基于AdaBoost的微博垃圾评论识别方法

计算机应用 ›› 2013, Vol. 33 ›› Issue (12): 3563-3566.

基于AdaBoost的微博垃圾评论识别方法

黄铃,李学明

重庆大学计算机学院，重庆 400044

收稿日期:2013-06-14 修回日期:2013-08-02 发布日期:2013-12-31 出版日期:2013-12-01
通讯作者: 黄铃
作者简介:黄铃(1988-),男,重庆人,硕士研究生,主要研究方向:数据挖掘、电子商务;
李学明(1967-),男,重庆人,教授,博士,主要研究方向:数据挖掘、网格计算。
基金资助:
国家自然科学基金资助项目

Identification method of spam comments in microblog based on AdaBoost

HUANG Ling,LI Xueming

College of Computer Science, Chongqing University, Chongqing 400044, China

Received:2013-06-14 Revised:2013-08-02 Online:2013-12-31 Published:2013-12-01
Contact: HUANG Ling

摘要/Abstract

摘要： 针对微博上存在的大量垃圾评论，提出一种基于AdaBoost的微博垃圾评论识别方法。该方法首先提取表示微博评论的特征值向量，由8个特征值组成，然后通过AdaBoost算法在这些特征上训练出若干个比随机预测好的弱分类器，最后将得到的弱分类器加权集合成高精度的强分类器。从实际的热门新浪微博中提取评论数据集进行实验，结果表明所选取的8个特征是有效的，该方法对于微博垃圾评论的识别拥有较高的识别率。

关键词: 微博, 垃圾评论识别, 特征值向量, AdaBoost算法, 弱分类器

Abstract: In view of the existence of a lot of spam comments in microblog, a new method based on AdaBoost was proposed to identify spam comments. This method firstly extracted feature vectors which consisted of eight feature values to represent the comments, then trained several weak classifiers which were better than random prediction on these features via AdaBoost algorithm, and finally combined these weighted weak classifiers to build a strong classifier with a high precision. The experimental results on comment data sets extracted from the popular Sina microblogs indicate that the selected eight features are effective for the method, and it has a high recognition rate in the identification of spam comments in microblog.

Key words: microblog, spam comments identification, feature vector, AdaBoost algorithm, weak classifier

中图分类号:

TP391

黄铃李学明. 基于AdaBoost的微博垃圾评论识别方法[J]. 计算机应用, 2013, 33(12): 3563-3566.

HUANG Ling LI Xueming. Identification method of spam comments in microblog based on AdaBoost [J]. Journal of Computer Applications, 2013, 33(12): 3563-3566.

[1]	马源源, 解蕾蕾, 董南, 刘娜. 考虑用户能动性和流动性的舆情传播模型[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 619-627.
[2]	方澄, 李贝, 韩萍, 吴琼. 基于语法依存图的中文微博细粒度情感分类[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1056-1061.
[3]	赵旭剑, 王崇伟, 王俊力. 融合社会影响力和时间分布的微博关键事件抽取方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2667-2673.
[4]	毕蓓, 潘慧瑶, 陈峰, 隋京言, 高扬, 王耀君. 基于异构图注意力网络的微博谣言监测模型[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3546-3550.
[5]	赵旭剑, 王崇伟. 基于图卷积网络的微博新闻故事线抽取方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3139-3144.
[6]	王俊红, 闫家荣. 基于欠采样和代价敏感的不平衡数据分类算法[J]. 计算机应用, 2021, 41(1): 48-52.
[7]	汪权方, 张梦茹, 张雨, 汪倩倩, 陈龙跃, 杨宇琪. 基于视觉注意机制的大范围水体信息遥感智能提取[J]. 计算机应用, 2020, 40(4): 1038-1044.
[8]	李艳红, 赵宏伟, 王素格, 李德玉. 面向微博文本流的负面情感突发话题检测[J]. 计算机应用, 2020, 40(12): 3458-3464.
[9]	王忠震, 黄勃, 方志军, 高永彬, 张娟. 改进SMOTE的不平衡数据集成分类算法[J]. 计算机应用, 2019, 39(9): 2591-2596.
[10]	王莉, 陈红梅, 王生武. 新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost[J]. 计算机应用, 2019, 39(3): 629-633.
[11]	杨震, 王红军. 基于Adaboost-Markov模型的移动用户位置预测方法[J]. 计算机应用, 2019, 39(3): 675-680.
[12]	刘威, 张明新, 安德智. 面向微博话题的用户影响力分析算法[J]. 计算机应用, 2019, 39(1): 213-219.
[13]	赵星宇, 赵志宏, 王业沛, 陈松宇. 基于聚类分析的微博广告发布者识别[J]. 计算机应用, 2018, 38(5): 1267-1271.
[14]	王玲娣, 徐华. AdaBoost的多样性分析及改进[J]. 计算机应用, 2018, 38(3): 650-654.
[15]	段大高, 盖新新, 韩忠明, 刘冰心. 基于梯度提升决策树的微博虚假消息检测[J]. 计算机应用, 2018, 38(2): 410-414.