Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (1): 295-299.DOI: 10.11772/j.issn.1001-9081.2017061560

Previous Articles     Next Articles

Optimum feature selection based on genetic algorithm under Web spam detection

WANG Jiaqing1, ZHU Yan1, CHEN Tung-shou2, CHANG Chin-chen3   

  1. 1. College of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China;
    2. Department of Computer Science and Information Engineering, Taichung University of Science and Technology, Taichung Taiwan 404, China;
    3. Department of Information Engineering and Computer Science, Feng Chia University, Taichung Taiwan 407, China
  • Received:2017-06-26 Revised:2017-08-20 Online:2018-01-10 Published:2018-01-22
  • Supported by:
    This work is partially supported by the Academic and Technological Leadership Training Foundation of Sichuan Province (WZ0100112371408, YH1500411031402), the Academic and Technological Leadership Research Foundation of Sichuan Province (WZ0100112371601/004), the Demonstration Project in Technology Service Industry of Sichuan Province (2016GFW0166).

欺诈网页检测中基于遗传算法的特征优选

王嘉卿1, 朱焱1, 陈同孝2, 张真诚3   

  1. 1. 西南交通大学 信息科学与技术学院, 成都 611756;
    2. 台中科技大学 资讯工程系, 台湾 台中 404;
    3. 逢甲大学 资讯工程系, 台湾 台中 407
  • 通讯作者: 朱焱
  • 作者简介:王嘉卿(1993-),女,浙江金华人,硕士研究生,主要研究方向:Web数据挖掘;朱焱(1965-),女,广西桂林人,教授,博士,CCF会员,主要研究方向:数据挖掘、Web异常发现、大数据管理与智能分析;陈同孝(1964-),男,安徽霍邱人,教授,博士,主要研究方向:影像处理、资料勘探、资讯安全;张真诚(1954-),男,台湾台中人,教授,博士,主要研究方向:资料库设计、电子商务安全、电子多媒体影像技术、密码学。
  • 基金资助:
    四川省学术和技术带头人后备人选科研基金资助项目(WZ0100112371408,YH1500411031402);四川省学术和技术带头人科研基金资助项目(WZ0100112371601/004);四川省科技服务业示范项目(2016GFW0166)。

Abstract: Focusing on the issue that features used in Web spam detection are always high-dimensional and redundant, an Improved Feature Selection method Based on Information Gain and Genetic Algorithm (IFS-BIGGA) was proposed. Firstly, the priorities of features were ranked by Information Gain (IG), and dynamic threshold was set to get rid of redundant features. Secondly, the function of chromosome encoding was modified and the selection operator was improved in Genetic Algorithm (GA). After that, the Area Under receiver operating Characteristic (AUC) of Random Forest (RF) classifier was utilized as the fitness function to pick up the features with high degree of identification. Finally, the Optimal Minimum Feature Set (OMFS) was obtained by increasing the experimental iteration to avoid the randomness of the proposed algorithm. The experimental results show that OMFS, compared to the high-dimensional feature set, although the AUC under RF is decreased by 2%, the True Positive Rate (TPR) is increased by 21% and the feature dimension is reduced by 92%. And the average detecting time is decreased by 83%; moreover, by comparing to the Traditional GA (TGA) and Imperialist Competitive Algorithm (ICA), the F1 score under Bayes Net (BN) is increased by 4.2% and 3.5% respectively. The experimental results that the IFS-BIGGA can effectively reduce the dimension of features, which means it can effectively reduce the calculation cost, improves the detection efficiency in the actual Web spam detection inspection project.

Key words: feature selection, Genetic Algorithm (GA), Information Gain (IG), Random Forest (RF) algorithm, Web spam detection

摘要: 针对网页欺诈检测中特征的高维、冗余问题,提出一个基于信息增益和遗传算法的改进特征选择算法(IFS-BIGGA)。首先,通过信息增益(IG)给出特征重要性排序,设定动态阈值减少冗余特征;其次,改进遗传算法(GA)中染色体编码函数和选择算子,并结合随机森林(RF)的受试者工作特征曲线面积(AUC)作为适应度函数,选择高辨识度特征;最后,增加实验迭代次数避免算法随机性,产生最佳最小的特征集合(OMFS)。实验验证表明,应用IFS-BIGGA生成的OMFS与高维特征集合相比,尽管RF下的AUC减小了2%,但是真阳性率(TPR)提高了21%,并且特征维度减少了92%;同时多个常用分类器的平均检测时间减少了83%;另外,IFS-BIGGA的F1值相比传统的遗传算法(TGA)和帝国主义竞争算法(ICA)分别提高了4.2%和3.5%。实验结果表明,IFS-BIGGA可以进行高效特征降维,在实际的网页检测工程中,有效减少计算代价,提高检测效率。

关键词: 特征选择, 遗传算法, 信息增益, 随机森林算法, 欺诈网页检测

CLC Number: