Web spam detection based on immune clonal feature selection and under-sampling ensemble

doi:10.11772/j.issn.1001-9081.2016.07.1899

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (7): 1899-1903.DOI: 10.11772/j.issn.1001-9081.2016.07.1899

Previous Articles Next Articles

Web spam detection based on immune clonal feature selection and under-sampling ensemble

LU Xiaoyong¹, CHEN Musheng², WU Jhenglong³, CHANG Peichan³

1. School of Software, Nanchang University, Nanchang Jiangxi 330047, China;
2. Information Engineering School, Nanchang University, Nanchang Jiangxi 330031, China;
3. College of Informatics, Yuan Ze University, Taoyuan Taiwan 32003, China

Received:2016-01-08 Revised:2016-03-02 Online:2016-07-14 Published:2016-07-10
Supported by:
This work is partially supported by the Sciences and Technology Support Program of Jiangxi Province (20131102040039).

基于免疫克隆特征选择和欠采样集成的垃圾网页检测

卢晓勇¹, 陈木生², 吴政隆³, 张百栈³

1. 南昌大学软件学院, 南昌 330047;
2. 南昌大学信息工程学院, 南昌 330031;
3. 元智大学资讯学院, 台湾桃园 32003

通讯作者: 陈木生
作者简介:卢晓勇(1957-),男,江西高安人,教授,博士,主要研究方向:信息管理与信息系统、工业工程;陈木生(1977-),男,江西于都人,博士研究生,主要研究方向:数据挖掘、知识发现;吴政隆(1983-),男,台湾宜兰人,博士,主要研究方向:智能优化、文本挖掘;张百栈(1959-),男,台湾高雄人,教授,博士,主要研究方向:生产排程、智能优化。
基金资助:
江西省科技支撑计划项目（20131102040039）。

Abstract

Abstract: To solve the problem of "curse of dimensionality" and imbalance classification, a binary classifier algorithm based on immune clonal feature selection and Under-Sampling (US) ensemble was proposed to detect Web spam. Firstly, major samples in training dataset were sampled into several sample subsets, which were combined with minor samples to generate several balanced training sample subsets. Then an immune clonal algorithm was proposed to select several optimal feature subsets. The balanced training subsets were projected to multiple views based on the optimal feature subsets. Finally, several Random Forest (RF) classifiers were trained by these views of the training sample subsets to classify the testing samples. The testing samples' classifications were determined by voting. The experimental results on the WEBSPAM UK-2006 dataset show that the ensemble classifier algorithm outperforms these algorithms like RF, Bagging with RF and AdaBoost with RF, and its accuracy, F1-Measure, AUC (Area Under ROC Curve) are increased by more than 11% respectively. Compared with several state-of-the-art baseline classification models, the F1-Measure is increased by 2% and the AUC reaches the optimum result using the ensemble classifier.

Key words: Web spam detection, ensemble learning, immune clonal algorithm, feature selection, Under-Sampling (US), Random Forest (RF)

摘要： 为解决垃圾网页检测过程中的“维数灾难”和不平衡分类问题，提出一种基于免疫克隆特征选择和欠采样（US）集成的二元分类器算法。首先，使用欠采样技术将训练样本集大类抽样成多个与小类样本数相近的样本集，再将其分别与小类样本合并构成多个平衡的子训练样本集；然后，设计一种免疫克隆算法遴选出多个最优的特征子集；基于最优特征子集对平衡的子样本集进行投影操作，生成平衡数据集的多个视图；最后，用随机森林（RF）分类器对测试样本进行分类，采用简单投票法确定测试样本的最终类别。在WEBSPAM UK-2006数据集上的实验结果表明，该集成分类器算法应用于垃圾网页检测：与随机森林算法及其Bagging和AdaBoost集成分类器算法相比，准确率、F1测度、AUC等指标均提高11%以上；与其他最优的研究结果相比，该集成分类器算法在F1测度上提高2%，在AUC上达到最优。

关键词: 垃圾网页检测, 集成学习, 免疫克隆算法, 特征选择, 欠采样, 随机森林

CLC Number:

LU Xiaoyong, CHEN Musheng, WU Jhenglong, CHANG Peichan. Web spam detection based on immune clonal feature selection and under-sampling ensemble[J]. Journal of Computer Applications, 2016, 36(7): 1899-1903.

卢晓勇, 陈木生, 吴政隆, 张百栈. 基于免疫克隆特征选择和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(7): 1899-1903.

References

[1] SPIRIN N, HAN J. Survey on Web spam detection:principles and algorithms[J]. ACM SIGKDD Explorations Newsletter, 2012, 13(2):50-64.
[2] CHANDRA A, SUAIB M. A survey on Web spam and spam 2.0[J]. International Journal of Advanced Computer Research, 2014, 4(2):634-644.
[3] TAHIR M A, BOURIDANE A, KURUGOLLU F. Simultaneous feature selection and feature weighting using hybrid tabu search/K-nearest neighbor classifier[J]. Pattern Recognition Letters, 2007, 28(4):438-446.
[4] BONEV B, ESCOLANO F, CAZORLA M. Feature selection, mutual information, and the classification of high-dimensional patterns[J]. Pattern Analysis and Applications, 2008, 11(3/4):309-319.
[5] MOUSTAKIDIS S P, THEOCHARIS J B. A fast SVM-based wrapper feature selection method driven by a fuzzy complementary criterion[J]. Pattern Analysis and Applications, 2012, 15(4):379-397.
[6] LIN S, LEE Z, CHEN S, et al. Parameter determination of support vector machine and feature selection using simulated annealing approach[J]. Applied Soft Computing, 2008, 8(4):1505-1512.
[7] AHMED A. Feature subset selection using ant colony optimization[J]. International Journal of Computational Intelligence and Applications, 2005, 2(1):53-58.
[8] AHMAD F, ISA N A M, HUSSAIN Z, et al. A GA-based feature selection and parameter optimization of an ANN in diagnosing breast cancer[J]. Pattern Analysis and Applications, 2014, 18(4):861-870.
[9] MARINAKI M, MARINAKIS Y. A hybridization of clonal selection algorithm with iterated local search and variable neighborhood search for the feature selection problem[J]. Memetic Computing, 2015, 7(3):181-201.
[10] SAMADZADEGAN F, NAMIN S R, RAJABI M A. Evaluating the potential of clonal selection optimization algorithm to hyperspectral image feature selection[J]. Key Engineering Materials, 2012, 500(1):799-805.
[11] YEN S, LEE Y. Cluster-based under-sampling approaches for imbalanced data distributions[J]. Expert Systems with Applications, 2009, 36(3):5718-5727.
[12] SUN Y, KAMEL M S, WONG A K, et al. Cost-sensitive boosting for classification of imbalanced data[J]. Pattern Recognition, 2007, 40(12):3358-3378.
[13] HONG X, CHEN S, HARRIS C J. A kernel-based two-class classifier for imbalanced data sets[J]. IEEE Transactions on Neural Networks, 2007, 18(1):28-41.
[14] 卢晓勇,陈木生.基于随机森林和欠采样集成的垃圾网页检测[J].计算机应用,2016,36(3):731-734.(LU X Y, CHEN M S. Web spam detection based on random forest and under-sampling ensemble[J]. Journal of Computer Applications, 2016, 36(3):731-734.)
[15] FAWCETT T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8):861-874.
[16] DAVIS J, GOADRICH M. The relationship between precision-recall and ROC curves[C]//ICML 2006:Proceedings of the 23rd International Conference on Machine Learning. New York:ACM, 2006:233-240.
[17] DE CASTRO L N, VON ZUBEN F J. Learning and optimization using the clonal selection principle[J]. IEEE Transactions on Evolutionary Computation, 2002, 6(3):239-251.
[18] SCARSELLI F, TSOI A C, HAGENBUCHNER M, et al. Solving graph data issues using a layered architecture approach with applications to Web spam detection[J]. Neural Networks, 2013, 48:78-90.

Web spam detection based on immune clonal feature selection and under-sampling ensemble

基于免疫克隆特征选择和欠采样集成的垃圾网页检测

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499.
[2]	Mingzhu LEI, Hao WANG, Rong JIA, Lin BAI, Xiaoying PAN. Oversampling algorithm based on synthesizing minority class samples using relationship between features [J]. Journal of Computer Applications, 2024, 44(5): 1428-1436.
[3]	Lin GAO, Yu ZHOU, Tak Wu KWONG. Evolutionary bi-level adaptive local feature selection [J]. Journal of Computer Applications, 2024, 44(5): 1408-1414.
[4]	Dapeng XU, Xinmin HOU. Feature selection method for graph neural network based on network architecture design [J]. Journal of Computer Applications, 2024, 44(3): 663-670.
[5]	Shengjie MENG, Wanjun YU, Ying CHEN. Feature selection algorithm for high-dimensional data with maximum correlation and maximum difference [J]. Journal of Computer Applications, 2024, 44(3): 767-771.
[6]	Lin SUN, Menghan LIU. K-means clustering based on adaptive cuckoo optimization feature selection [J]. Journal of Computer Applications, 2024, 44(3): 831-841.
[7]	Jingxin LIU, Wenjing HUANG, Liangsheng XU, Chong HUANG, Jiansheng WU. Unsupervised feature selection model with dictionary learning and sample correlation preservation [J]. Journal of Computer Applications, 2024, 44(12): 3766-3775.
[8]	Tian HE, Zongxin SHEN, Qianqian HUANG, Yanyong HUANG. Adaptive learning-based multi-view unsupervised feature selection method [J]. Journal of Computer Applications, 2023, 43(9): 2657-2664.
[9]	Lin SUN, Jinxu HUANG, Jiucheng XU. Feature selection for imbalanced data based on neighborhood tolerance mutual information and whale optimization algorithm [J]. Journal of Computer Applications, 2023, 43(6): 1842-1854.
[10]	Ran ZHAI, Xuebin CHEN, Guopeng ZHANG, Langtao PEI, Zheng MA. Improved K-anonymity privacy protection algorithm based on different sensitivities [J]. Journal of Computer Applications, 2023, 43(5): 1497-1503.
[11]	Zhenhua YU, Zhengqi LIU, Ying LIU, Cheng GUO. Feature selection method based on self-adaptive hybrid particle swarm optimization for software defect prediction [J]. Journal of Computer Applications, 2023, 43(4): 1206-1213.
[12]	Lin SUN, Tianjiao MA, Zhan’ao XUE. Multilabel feature selection algorithm based on Fisher score and fuzzy neighborhood entropy [J]. Journal of Computer Applications, 2023, 43(12): 3779-3789.
[13]	Jingcheng XU, Xuebin CHEN, Yanling DONG, Jia YANG. DDoS attack detection by random forest fused with feature selection [J]. Journal of Computer Applications, 2023, 43(11): 3497-3503.
[14]	Lei MA, Chuan LUO, Tianrui LI, Hongmei CHEN. Fuzzy-rough set based unsupervised dynamic feature selection algorithm [J]. Journal of Computer Applications, 2023, 43(10): 3121-3128.
[15]	Jingtao ZHAO, Zefang ZHAO, Zhaojuan YUE, Jun LI. TenrepNN：practice of new ensemble learning paradigm in enterprise self-discipline evaluation [J]. Journal of Computer Applications, 2023, 43(10): 3107-3113.