Optimum feature selection based on genetic algorithm under Web spam detection
WANG Jiaqing1, ZHU Yan1, CHEN Tung-shou2, CHANG Chin-chen3
1. College of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China; 2. Department of Computer Science and Information Engineering, Taichung University of Science and Technology, Taichung Taiwan 404, China; 3. Department of Information Engineering and Computer Science, Feng Chia University, Taichung Taiwan 407, China
Abstract:Focusing on the issue that features used in Web spam detection are always high-dimensional and redundant, an Improved Feature Selection method Based on Information Gain and Genetic Algorithm (IFS-BIGGA) was proposed. Firstly, the priorities of features were ranked by Information Gain (IG), and dynamic threshold was set to get rid of redundant features. Secondly, the function of chromosome encoding was modified and the selection operator was improved in Genetic Algorithm (GA). After that, the Area Under receiver operating Characteristic (AUC) of Random Forest (RF) classifier was utilized as the fitness function to pick up the features with high degree of identification. Finally, the Optimal Minimum Feature Set (OMFS) was obtained by increasing the experimental iteration to avoid the randomness of the proposed algorithm. The experimental results show that OMFS, compared to the high-dimensional feature set, although the AUC under RF is decreased by 2%, the True Positive Rate (TPR) is increased by 21% and the feature dimension is reduced by 92%. And the average detecting time is decreased by 83%; moreover, by comparing to the Traditional GA (TGA) and Imperialist Competitive Algorithm (ICA), the F1 score under Bayes Net (BN) is increased by 4.2% and 3.5% respectively. The experimental results that the IFS-BIGGA can effectively reduce the dimension of features, which means it can effectively reduce the calculation cost, improves the detection efficiency in the actual Web spam detection inspection project.
[1] GOH L, SINGH A K, LIM K H. Multilayer perceptrons neural network based Web spam detection application[C]//Proceedings of the 1st IEEE China Summit and International Conference on Signal and Information Processing. Piscataway, NJ:IEEE, 2013:636-640. [2] PAGE L. The PageRank citation ranking:bringing order to the Web[J]. Stanford Digital Libraries Working Paper, 1998, 9(1):1-14. [3] 李智超,余慧佳,刘奕群,等.网页作弊与反作弊技术综述[J].山东大学学报(理学版),2011,46(5):1-8.(LI Z C, YU H J, LIU Y Q, et al. A survey of Web spam and anti-spam[J]. Journal of Shandong University (Natural Science), 2011, 46(5):1-8.) [4] ARAUJO L, MARTINEZ-ROMO J. Web spam detection:new classification features based on qualified link analysis and language models[J]. IEEE Transactions on Information Forensics & Security, 2010, 5(3):581-590. [5] NTOULAS A, NAJORK M, MANASSE M, et al. Detecting spam Web pages through content analysis[C]//Proceedings of the 15th International Conference on World Wide Web. New York:ACM, 2006:83-92. [6] KUMAR S, GAO X, WELCH I, et al. A machine learning based Web spam filtering approach[C]//Proceedings of the 30th IEEE International Conference on Advanced Information Networking & Applications. Piscataway, NJ:IEEE, 2016:973-980. [7] 李法良,朱焱,曾俊东.集成PCA降维与分类算法的垃圾网页检测[J].计算机应用与软件,2014,31(10):269-272.(LI F L, ZHU Y, ZENG J D. Spam Webpage detection combining PCA dimensionality reduction and classifier algorithm[J]. Journal of Computer Applications and Software, 2014, 31(10):269-272) [8] 刘卫红,方卫东,董守斌,等.基于内容与链接特征的中文垃圾网页分类[J].微计算机信息,2010,26(9):6-8.(LIU W H, FANG W D, DONG S B, et al. Chinese Web spam identification through content and hyperlinks[J]. Journal of Microcomputer Information, 2010, 26(9):6-8.) [9] 韦莎,朱焱.主题相似度与链接权重相结合的垃圾网页排序检测[J].计算机应用,2016,36(3):735-739.(WEI S, ZHU Y. Combining topic similarity with link weight for Web spam ranking detection[J]. Journal of Computer Applications, 2016, 36(3):735-739.) [10] GOH K L, SINGH A K. Comprehensive literature review on machine learning structures for Web spam classification[J]. Procedia Computer Science, 2015, 70:434-441. [11] KARIMPOUR J, NOROOZI A A, ABADI A. The impact of feature selection on Web spam detection[J]. International Journal of Intelligent Systems and Applications, 2012, 4(9):59-64. [12] LUCKNER M, GAD M, SOBKOWIAK P. Stable Web spam detection using features based on lexical items[J]. Computers and Security, 2014, 46:79-93. [13] KULLBACK, S, LEIBLER, R A. On information and sufficiency[J]. Annals of Mathematical Statistics, 1951, 22(1):79-86. [14] GREFENSTETTE J J. Optimization of control parameters for genetic algorithms[J]. IEEE Transactions on Systems Man and Cybernetics, 1986, 16(1):122-128. [15] CASTILLO C, DONATO D, GIONIS A, et al. Know your neighbors:Web spam detection using the Web topology[C]//Proceedings of the 2007 International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2007:423-430.