计算机应用 ›› 2017, Vol. 37 ›› Issue (2): 535-539.DOI: 10.11772/j.issn.1001-9081.2017.02.0535

• 人工智能 • 上一篇    下一篇

三种用于垃圾网页检测的随机欠采样集成分类器

陈木生1, 卢晓勇2   

  1. 1. 南昌大学 信息工程学院, 江西 南昌 330031;
    2. 南昌大学 软件学院, 江西 南昌 330047
  • 收稿日期:2016-08-01 修回日期:2016-08-22 出版日期:2017-02-10 发布日期:2017-02-11
  • 通讯作者: 陈木生,dreaminit@163.com
  • 作者简介:陈木生(1977-),男,江西于都人,博士研究生,主要研究方向:数据挖掘与知识发现、信息管理与信息系统;卢晓勇(1957-),男,江西高安人,教授,博士,主要研究方向:信息管理与信息系统、工业工程。
  • 基金资助:
    江西省科技支撑计划项目(20131102040039)。

Three random under-sampling based ensemble classifiers for Web spam detection

CHEN Musheng1, LU Xiaoyong2   

  1. 1. School of Information Engineering, Nanchang University, Nanchang Jiangxi 330031, China;
    2. School of Software, Nanchang University, Nanchang Jiangxi 330047, China
  • Received:2016-08-01 Revised:2016-08-22 Online:2017-02-10 Published:2017-02-11
  • Supported by:
    This work is partially supported by the Sciences and Technology Support Program of Jiangxi Province (20131102040039).

摘要: 针对垃圾网页检测过程中轻微的不平衡分类问题,提出三种随机欠采样集成分类器算法,分别为一次不放回随机欠采样(RUS-once)、多次不放回随机欠采样(RUS-multiple)和有放回随机欠采样(RUS-replacement)算法。首先使用其中一种随机欠采样技术将训练样本集转换成平衡样本集,然后对每个平衡样本集使用分类回归树(CART)分类器算法进行分类,最后采用简单投票法构建集成分类器对测试样本进行分类。实验表明,三种随机欠采样集成分类器均取得了良好的分类效果,其中RUS-multiple和RUS-replacement比RUS-once的分类效果更好。与CART及其Bagging和Adaboost集成分类器相比,在WEBSPAM UK-2006数据集上,RUS-multiple和RUS-replacement方法的AUC指标值提高了10%左右,在WEBSPAM UK-2007数据集上,提高了25%左右;与其他最优研究结果相比,RUS-multiple和RUS-replacement方法在AUC指标上能达到最优分类结果。

关键词: 垃圾网页检测, 不平衡分类, 集成学习, 欠采样, 分类回归树

Abstract: In order to solve the problem of slighty imbalanced classification in Web spam detection, three ensemble classifiers based on random under-sampling techniques were proposed, including Random Under-Sampling once without replacement (RUS-once), Random Under-Sampling multiple times without replacement (RUS-multiple) and Random Under-Sampling with replacement (RUS-replacement). At first, the unbalanced training dataset was converted into several balanced datasets by using one of the under-sampling techniques. Secondly, the Classification And Regression Tree (CART) classifiers were trained based on the balanced datasets. Finally, an ensemble classifier was constructed with all of the CART classifiers based on simple voting rule and used to classify the test samples. The experimental results show that the three kinds of random under-sampling based ensemble classifiers achieve good classification results, the performance of RUS-multiple and RUS-replacement are better than RUS-once. Compared with CART, Bagging with CART and Adaboost with CART, the AUC values of RUS-multiple and RUS-replacement increase about 10% on WEBSPAM UK-2006 and about 25% on WEBSPAM UK-2007; compared with several state-of-the-art baseline classification models, RUS-multiple and RUS-replacement achieve the optimal results in AUC value.

Key words: Web spam detection, imbalanced classification, ensemble learning, under-sampling, Classification And Regression Tree (CART)

中图分类号: