基于随机森林和欠采样集成的垃圾网页检测

doi:10.11772/j.issn.1001-9081.2016.03.731

计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 731-734.DOI: 10.11772/j.issn.1001-9081.2016.03.731

基于随机森林和欠采样集成的垃圾网页检测

卢晓勇¹, 陈木生²

1. 南昌大学软件学院, 南昌 330047;
2. 南昌大学信息工程学院, 南昌 330031

收稿日期:2015-08-10 修回日期:2015-10-03 出版日期:2016-03-10 发布日期:2016-03-17
通讯作者: 陈木生
作者简介:卢晓勇(1957-),男,江西高安人,教授,博士,主要研究方向:信息管理与信息系统、工业工程;陈木生(1977-),男,江西于都人,博士研究生,主要研究方向:数据挖掘与知识发现、信息管理与信息系统。
基金资助:
江西省科技支撑计划项目(20131102040039)。

Web spam detection based on random forest and under-sampling ensemble

LU Xiaoyong¹, CHEN Musheng²

1. School of Software, Nanchang University, Nanchang Jiangxi 330047, China;
2. Information Engineering School, Nanchang University, Nanchang Jiangxi 330031, China

Received:2015-08-10 Revised:2015-10-03 Online:2016-03-10 Published:2016-03-17
Supported by:
This work is partially supported by the Sciences and Technology Support Program of Jiangxi Province (20131102040039).

摘要/Abstract

摘要： 为解决垃圾网页检测过程中的不平衡分类和"维数灾难"问题,提出一种基于随机森林(RF)和欠采样集成的二元分类器算法。首先使用欠采样技术将训练样本集大类抽样成多个子样本集,再将其分别与小类样本集合并构成多个平衡的子训练样本集;然后基于各个子训练样本集训练出多个随机森林分类器;最后用多个随机森林分类器对测试样本集进行分类,采用投票法确定测试样本的最终所属类别。在WEBSPAM UK-2006数据集上的实验表明,该集成分类器算法应用于垃圾网页检测比随机森林算法及其Bagging和Adaboost集成分类器算法效果更好,准确率、F1测度、ROC曲线下面积(AUC)等指标提高至少14%,13%和11%。与Web spam challenge 2007 优胜团队的竞赛结果相比,该集成分类器算法在F1测度上提高至少1%,在AUC上达到最优结果。

关键词: 垃圾网页检测, 随机森林, 欠采样, 集成分类器, 机器学习

Abstract: In order to solve the problem of imbalance classification and "curse of dimensionality", a binary classifier algorithm based on Random Forest (RF) and under-sampling ensemble was proposed to detect Web spam. Firstly, majority samples in training dataset were sampled into several sub sample sets, each of them was combined with minority samples and several balanced training sample sub sets were generated; then several RF classifiers were trained by these training sample sub sets to classify the testing samples; finally, the testing samples' classifications were determined by voting. Experiments on the WEBSPAM UK-2006 dataset show that the ensemble classifier outperformed RF, Bagging with RF and Adaboost with RF etc., and its accuracy, F1-measure, AUC increased by at least 14%, 13% and 11%. Compared with the winners of Web spam challenge 2007, the ensemble classifier increased F1-measure by at least 1% and reached to the optimum result in AUC.

Key words: Web spam detection, Random Forest (RF), under-sampling, ensemble classifier, machine learning

中图分类号:

卢晓勇, 陈木生. 基于随机森林和欠采样集成的垃圾网页检测[J]. 计算机应用, 2016, 36(3): 731-734.

LU Xiaoyong, CHEN Musheng. Web spam detection based on random forest and under-sampling ensemble[J]. Journal of Computer Applications, 2016, 36(3): 731-734.

参考文献

[1] GYÖNGYI Z, GARCIA-MOLINA H. Web spam taxonomy [C]//Proceedings of the 14st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan: AIRWeb, 2005:39-47.
[2] EIRON N, MCCURLEY K S. Analysis of anchor text for Web search [C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2003:459-460.
[3] SPIRIN N, HAN J. Survey on Web spam detection: principles and algorithms [J]. ACM SIGKDD Explorations Newsletter, 2012,13(2): 50-64.
[4] CHANDRA A, SUAIB M. A survey on Web spam and spam 2.0 [J]. International Journal of Advanced Research in Computer Science, 2014,4(15): 634-644.
[5] 王莉丽,朱焱,马永强.基于朴素贝叶斯的伪装型垃圾网页检测[J].计算机应用,2013,33(S1):102-103.(WANG L L, ZHU Y, MA Y Q. Cloaking detection based on Naive Bayes simple [J]. Journal of Computer Applications, 2013,33(S1):102-103.)
[6] PRIETO V M, ÁLVAREZ M, CACHEDA F. SAAD, a content based Web spam analyzer and detector [J]. Journal of Systems and Software, 2013,86(11):2906-2918.
[7] SCARSELLI F, TSOI A C, HAGENBUCHNER M, et al. Solving graph data issues using a layered architecture approach with applications to Web spam detection [J]. Neural Networks, 2013,48(1):78-90.
[8] GAO S, ZHANG H, ZHENG X, et al. Improving SVM classifiers with link structure for Web spam detection [J]. Journal of Computational Information Systems, 2014,10(6):2435-2443.
[9] BREIMAN L. Random forests—random features [J]. Machine Learning, 1999,45(1):5-32.
[10] BREIMAN L, FRIEDMAN J, OLSHEN R, et al. Classification and regression trees [M]. Boca Raton, FL: CRC Press, 1984:18-58.
[11] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
[12] GENG G-G, WANG C-H, LI Q-D, et al. Boosting the performance of Web spam detection with ensemble under-sampling classification [C]//Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery. Washington, DC: IEEE Computer Society, 2007,4:583-587.
[13] 房晓南,张化祥,高爽.基于SMOTE和随机森林的Web spam检测[J].山东大学学报(工学版),2013,41(1):22-27.(FANG X N, ZHANG H X, GAO S. Web spam detection based on SMOTE and random forests [J]. Journal of Shandong University (Engineering Science), 2013,41(1):22-27.)
[14] BREIMAN L. Statistical modeling: the two cultures [J]. Statistical Science, 2001,16(3):199-231.
[15] 林舒杨,李翠华,江弋,等.不平衡数据的降采样方法研究[J].计算机研究与发展,2011,48(Z2):47-53.(LIN S Y, LI C H, JIANG Y, et al. Under-sampling method research in class-imbalanced data [J]. Journal of Computer Research and Development, 2011,48(Z2):47-53.)
[16] CASTILLO C, DONATO D, BECCHETTI L, et al. A reference collection for Web spam [J]. ACM SIGIR Forum, 2006,40(2):11-24.

基于随机森林和欠采样集成的垃圾网页检测

Web spam detection based on random forest and under-sampling ensemble

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	郭棉, 张锦友. 移动边缘计算环境中面向机器学习的计算迁移策略[J]. 计算机应用, 2021, 41(9): 2639-2645.
[2]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[3]	秦斌斌, 彭良康, 卢向明, 钱江波. 司机分心驾驶检测研究进展[J]. 计算机应用, 2021, 41(8): 2330-2337.
[4]	张杨, 董士程. 面向并发程序中锁机制的智能化推荐方法[J]. 计算机应用, 2021, 41(6): 1597-1603.
[5]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 计算机应用, 2021, 41(4): 1035-1041.
[6]	姜倩玉, 王凤英, 贾立鹏. 基于感知哈希算法和特征融合的恶意代码检测方法[J]. 计算机应用, 2021, 41(3): 780-785.
[7]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[8]	孟祥瑞, 杨文忠, 王婷. 基于图文融合的情感分析研究综述[J]. 计算机应用, 2021, 41(2): 307-317.
[9]	刘晓龙, 王士同. 渐进式分离的开放集模糊域自适应算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3127-3131.
[10]	楼豪杰, 郑元林, 廖开阳, 雷浩, 李佳. 基于Siamese-YOLOv4的印刷品缺陷目标检测[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3206-3212.
[11]	王雅辉, 钱宇华, 刘郭庆. 基于模糊优势互补互信息的有序决策树算法[J]. 计算机应用, 2021, 41(10): 2785-2792.
[12]	张增辉, 姜高霞, 王文剑. 基于局部概率抽样的标签噪声过滤方法[J]. 计算机应用, 2021, 41(1): 67-73.
[13]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[14]	蒋阳升, 王胜男, 涂家祺, 李莎, 王红军. 面向高铁站的热舒适度和能耗综合预测[J]. 计算机应用, 2021, 41(1): 249-257.
[15]	王俊红, 闫家荣. 基于欠采样和代价敏感的不平衡数据分类算法[J]. 计算机应用, 2021, 41(1): 48-52.