• •    

优选特征高效打击网页欺诈

王嘉卿1,朱焱2,3,陈同孝4,张真诚5   

  1. 1. 四川省成都市郫县犀浦镇西南交通大学
    2. 西南交通大学 信息科学与技术学院,成都 610031;
    3. 云计算与智能技术四川省高校重点实验室,成都 610031
    4. 台中科技大学
    5. 逢甲大学资讯工程系
  • 收稿日期:2017-06-26 修回日期:2017-08-12 发布日期:2017-08-12
  • 通讯作者: 王嘉卿

Optimum Features Selection for Beating Web Spam Efficiently

  • Received:2017-06-26 Revised:2017-08-12 Online:2017-08-12
  • Contact: Jia-Qing WANG

摘要: 网页欺诈给搜索引擎和互联网安全造成了破坏性影响。欺诈网页检测技术的研究重要且广泛,集中于提取新特征和改进分类算法。而检测中使用的网页基本特征高维且冗余,这会使分类器“过载”,影响欺诈网页检测效率。高效的特征降维是必要的,提出一个基于信息增益和遗传算法的改进特征选择算法。考虑到算法具有一定的随机性,增加实验迭代次数,产生最佳最小的特征集合。采用贝叶斯网络和随机森林分类算法进行对比实验。实验验证了最佳最小特征集合的使用使得检测时间大幅缩短,而检测结果可以近似达到甚至高于使用原高维特征集合。表明了改进的特征选择算法可以有效降低特征维度,减小分类的计算代价,同时保证检测结果的鲁棒性。

关键词: 欺诈网页检测, 特征选择, 遗传算法, 信息增益, 最佳最小特征集合

Abstract: Web spam has been destructive to Search Engines and Internet Security. Researches on Web spam detection are popular and essential, which focus on integrating new features and improving classification algorithms. However, basic features of web page typically used in spam detection are high-dimensional and redundant, which cause classifier ‘overhead’ and influence the detection efficiency, feature dimensionality reduction is necessary, proposed an improved feature selection algorithm based on Kullback-Leibler Divergence and Genetic Algorithm (IFSBKGA). Considering that the algorithm has a certain randomness, increased the number of experimental iterations to generate an Optimal Minimum Feature Set (OMFS). Comparative experiments are conducted based on Native Bayes and Random Forest classifiers, and the detection results can still reach or be better than that use hundreds of features, as well as the detection time is dramatically reduced. The experiments verify that IFSBKGA can reduce feature dimension, decrease the classification computational cost, while ensuring detection robustness.

Key words: Web spam detection, feature selection, Genetic Algorithm (GA), Kullback-Leibler Divergence (KLD), the optimal minimum feature set

中图分类号: