三种用于垃圾网页检测的随机欠采样集成分类器

doi:10.11772/j.issn.1001-9081.2017.02.0535

计算机应用 ›› 2017, Vol. 37 ›› Issue (2): 535-539.DOI: 10.11772/j.issn.1001-9081.2017.02.0535

三种用于垃圾网页检测的随机欠采样集成分类器

陈木生¹, 卢晓勇²

1. 南昌大学信息工程学院, 江西南昌 330031;
2. 南昌大学软件学院, 江西南昌 330047

收稿日期:2016-08-01 修回日期:2016-08-22 发布日期:2017-02-11 出版日期:2017-02-10
通讯作者: 陈木生,dreaminit@163.com
作者简介:陈木生(1977-),男,江西于都人,博士研究生,主要研究方向:数据挖掘与知识发现、信息管理与信息系统;卢晓勇(1957-),男,江西高安人,教授,博士,主要研究方向:信息管理与信息系统、工业工程。
基金资助:
江西省科技支撑计划项目（20131102040039）。

Three random under-sampling based ensemble classifiers for Web spam detection

CHEN Musheng¹, LU Xiaoyong²

1. School of Information Engineering, Nanchang University, Nanchang Jiangxi 330031, China;
2. School of Software, Nanchang University, Nanchang Jiangxi 330047, China

Received:2016-08-01 Revised:2016-08-22 Online:2017-02-11 Published:2017-02-10
Supported by:
This work is partially supported by the Sciences and Technology Support Program of Jiangxi Province (20131102040039).

摘要/Abstract

摘要： 针对垃圾网页检测过程中轻微的不平衡分类问题，提出三种随机欠采样集成分类器算法，分别为一次不放回随机欠采样（RUS-once）、多次不放回随机欠采样（RUS-multiple）和有放回随机欠采样（RUS-replacement）算法。首先使用其中一种随机欠采样技术将训练样本集转换成平衡样本集，然后对每个平衡样本集使用分类回归树（CART）分类器算法进行分类，最后采用简单投票法构建集成分类器对测试样本进行分类。实验表明，三种随机欠采样集成分类器均取得了良好的分类效果，其中RUS-multiple和RUS-replacement比RUS-once的分类效果更好。与CART及其Bagging和Adaboost集成分类器相比，在WEBSPAM UK-2006数据集上，RUS-multiple和RUS-replacement方法的AUC指标值提高了10%左右，在WEBSPAM UK-2007数据集上，提高了25%左右；与其他最优研究结果相比，RUS-multiple和RUS-replacement方法在AUC指标上能达到最优分类结果。

关键词: 垃圾网页检测, 不平衡分类, 集成学习, 欠采样, 分类回归树

Abstract: In order to solve the problem of slighty imbalanced classification in Web spam detection, three ensemble classifiers based on random under-sampling techniques were proposed, including Random Under-Sampling once without replacement (RUS-once), Random Under-Sampling multiple times without replacement (RUS-multiple) and Random Under-Sampling with replacement (RUS-replacement). At first, the unbalanced training dataset was converted into several balanced datasets by using one of the under-sampling techniques. Secondly, the Classification And Regression Tree (CART) classifiers were trained based on the balanced datasets. Finally, an ensemble classifier was constructed with all of the CART classifiers based on simple voting rule and used to classify the test samples. The experimental results show that the three kinds of random under-sampling based ensemble classifiers achieve good classification results, the performance of RUS-multiple and RUS-replacement are better than RUS-once. Compared with CART, Bagging with CART and Adaboost with CART, the AUC values of RUS-multiple and RUS-replacement increase about 10% on WEBSPAM UK-2006 and about 25% on WEBSPAM UK-2007; compared with several state-of-the-art baseline classification models, RUS-multiple and RUS-replacement achieve the optimal results in AUC value.

Key words: Web spam detection, imbalanced classification, ensemble learning, under-sampling, Classification And Regression Tree (CART)

中图分类号:

陈木生, 卢晓勇. 三种用于垃圾网页检测的随机欠采样集成分类器[J]. 计算机应用, 2017, 37(2): 535-539.

CHEN Musheng, LU Xiaoyong. Three random under-sampling based ensemble classifiers for Web spam detection[J]. Journal of Computer Applications, 2017, 37(2): 535-539.

参考文献

[1] YANG Q, WU X. 10 challenging problems in data mining research[J]. International Journal of Information Technology & Decision Making. 2006, 5(4):597-604.
[2] YANG Z, TANG W H, SHINTEMIROV A, et al. Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C:Applications and Reviews, 2009, 39(6):597-610.
[3] KHREICH W, GRANGER E, MIRI A, et al. Iterative Boolean combination of classifiers in the ROC space:an application to anomaly detection with HMMs[J]. Pattern Recognition. 2010, 43(8):2732-2752.
[4] MAZUROWSKI M A, HABAS P A, ZURADA J M, et al. 2008 special issue:training neural network classifiers for medical decision making:the effects of imbalanced datasets on classification performance[J]. Neural Networks. 2008, 21(2/3):427-436.
[5] LIU Y-H, CHEN Y-T. Total margin based adaptive fuzzy support vector machines for multiview face recognition[C]//Proceedings of the 2005 IEEE International Conference on Systems, Man and Cybernetics. Piscataway, NJ:IEEE, 2005:1704-1711.
[6] QUINLAN J R. Improved estimates for the accuracy of small disjuncts[J]. Machine Learning, 1991, 6(1):93-98.
[7] ZADROZNY B, ELKAN C. Learning and making decisions when costs and probabilities are both unknown[C]//SIGKDD 2001:Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2001:204-213.
[8] WU G, CHANG E Y. KBA:kernel boundary alignment considering imbalanced data distribution[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6):786-795.
[9] BATISTA G E A P A, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter-Special Issue on Learning from Imbalanced Datasets, 2004, 6(1):20-29.
[10] CHAWLA N V, JAPKOWICZ N, KOTCZ A. Editorial:special issue on learning from imbalanced data sets[J]. ACM SIGKDD Explorations Newsletter-Special Issue on Learning from Imbalanced Datasets, 2004, 6(1):1-6.
[11] GENG G-G, WANG C-H, LI Q-D, et al. Boosting the performance of Web spam detection with ensemble under-sampling classification[C]//FSKD'07:Proceedings of the IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery. Piscataway, NJ:IEEE, 2007, 4:583-587.
[12] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[13] CHAWLA N V, CIESLAK D A, HALL L O, et al. Automatically countering imbalance and its empirical relationship to cost[J]. Data Mining and Knowledge Discovery, 2008, 17(2):225-252.
[14] FREITAS A, COSTA-PEREIRA A, BRAZDIL P. Cost-sensitive decision trees applied to medical data[C]//DaWaK 2007:Proceedings of the 9th International Conference on Data Warehousing and Knowledge Discovery, LNCS 4654. Berlin Heidelberg:Springer, 2007:303-312.
[15] SPIRIN N, HAN J. Survey on Web spam detection:principles and algorithms[J]. ACM SIGKDD Explorations Newsletter, 2012, 13(2):50-64.
[16] CASTILLO C, DONATO D, BECCHETTI L, et al. A reference collection for Web spam[J]. ACM SIGIR Forum. 2006, 40(2):11-24.
[17] FAWCETT T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8):861-874.
[18] DAVIS J, GOADRICH M. The relationship between precision-recall and ROC curves[C]//Proceedings of the 23rd International Conference on Machine Learning. New York:ACM, 2006:233-240.
[19] 卢晓勇,陈木生.基于随机森林和克隆选择的垃圾网页检测[J].计算机应用,2016,36(1):156-159. (LU X Y, CHEN M S. Web spam detection based on random forests and under-sampling ensemble[J]. Journal of Computer Applications, 2016, 36(1):156-159.).
[20] 卢晓勇,陈木生,吴政隆,等.基于免疫克隆特征选择和欠采样集成的垃圾网页检测[J].计算机应用,2016,36(7):1899-1903. (LU X Y, CHEN M S, WU J L, et al. Web spam detection based on immune clonal feature selection and under-sampling ensemble[J]. Journal of Computer Applications, 2016, 36(7):1899-1903.)

三种用于垃圾网页检测的随机欠采样集成分类器

Three random under-sampling based ensemble classifiers for Web spam detection

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	邱华禄, 蔺素珍, 王彦博, 刘峰, 李大威. 基于复卷积双域级联网络的欠采样磁共振图像重建算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 580-587.
[2]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[3]	樊小宇, 蔺素珍, 王彦博, 刘峰, 李大威. 基于残差图卷积神经网络的高倍欠采样核磁共振图像重建算法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1261-1268.
[4]	赵敬涛, 赵泽方, 岳兆娟, 李俊. TenrepNN：集成学习的新范式在企业自律性评价中的实践[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3107-3113.
[5]	蔡淳豪, 李建良. 小样本问题下培训弱教师网络的模型蒸馏模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2652-2658.
[6]	郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2078-2087.
[7]	刘学文, 王继奎, 杨正国, 李强, 易纪海, 李冰, 聂飞平. 密度峰值优化的球簇划分欠采样不平衡数据分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1455-1463.
[8]	李颖之, 李曼, 董平, 周华春. 基于集成学习的多类型应用层DDoS攻击检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3775-3784.
[9]	陆宇, 赵凌云, 白斌雯, 姜震. 基于改进的半监督聚类的不平衡分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3750-3755.
[10]	李蒙蒙, 刘艺, 李庚松, 郑奇斌, 秦伟, 任小广. 不平衡多分类算法综述[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3307-3321.
[11]	李小娟, 韩萌, 王乐, 张妮, 程浩东. 基于准确率爬坡的动态加权集成分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 123-131.
[12]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[13]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 《计算机应用》唯一官方网站, 2021, 41(4): 1035-1041.
[14]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[15]	罗长银, 陈学斌, 马春地, 王君宇. 面向区块链的在线联邦增量学习算法[J]. 计算机应用, 2021, 41(2): 363-371.