基于聚类分析的微博广告发布者识别

doi:10.11772/j.issn.1001-9081.2017102478

计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1267-1271.DOI: 10.11772/j.issn.1001-9081.2017102478

基于聚类分析的微博广告发布者识别

赵星宇, 赵志宏, 王业沛, 陈松宇

南京大学软件学院, 南京 210093

收稿日期:2017-10-19 修回日期:2017-11-24 出版日期:2018-05-10 发布日期:2018-05-24
通讯作者: 赵志宏
作者简介:赵星宇(1993-),男,江苏扬州人,硕士研究生,主要研究方向:数据挖掘、舆情分析;赵志宏(1975-),男,河北清河人,教授,博士,主要研究方向:数据挖掘;王业沛(1993-),男,江苏扬州人,硕士研究生,主要研究方向:神经网络、舆情分析;陈松宇(1986-),男,江苏盐城人,讲师,博士研究生,主要研究方向:舆情分析、自然语言处理。
基金资助:
江苏省产学研前瞻性联合研究项目（BY2015069-03）。

Identification of micro-blog advertising publisher based on clustering analysis

ZHAO Xingyu, ZHAO Zhihong, WANG Yepei, CHEN Songyu

Software Institute, Nanjing University, Nanjing Jiangsu 210093, China

Received:2017-10-19 Revised:2017-11-24 Online:2018-05-10 Published:2018-05-24
Contact: 赵志宏
Supported by:
This work is partially supported by the Prospective Joint Research Project of Industry-Academia-Research in Jiangsu (BY2015069-03).

摘要/Abstract

摘要： 微博空间存在大量的广告内容，这些信息严重影响着普通用户的用户体验和相关的研究工作。现有研究多使用支持向量机（SVM）或随机森林等分类算法对广告微博进行处理，然而分类方法中人工标注大数据量训练集存在困难，因此提出基于聚类分析的微博广告发布者识别方法：对于用户维度，针对微博广告发布者通过发布大量普通微博来稀释其广告内容的现象，提出核心微博的概念，通过提取核心微博主题及其对应的微博序列，计算用户特征和对应微博的文本特征，并使用聚类算法对特征进行聚类，从而识别微博广告发布者。实验结果显示，所提方法准确率为92%，召回率为97%，F值为95%，证明所提方法在广告内容被人为稀释的情况下能准确地识别微博广告发布者，可以为微博垃圾信息识别、清理等工作提供理论支持和实用方法。

关键词: 微博广告, 基于密度的空间聚类, 文本过滤, 特征提取

Abstract: There is a large amount of advertising content in micro-blog space, which seriously affects user experience and related research work. Much of existing research on micro-blog process uses classification algorithm such as Support Vector Machine (SVM) and random forest algorithm. However, it is difficult to classify a large volume of data in the classification method manually. A micro-blog advertisement publisher identification method based on clustering analysis was proposed. For user dimension, a concept of core micro-blog was put forward to deal with the phenomenon that ordinary micro-blogs were posted to dilute advertising content. Then the extracted main themes of each user and corresponding micro-blog sequences could be used to calculate user characteristics as well as the text characteristics. After that, a clustering algorithm was used to cluster the features and identify the micro-blog advertisers. The experiment result shows that the precision is 93%, the recall is 97%, and the F value is 95%, which proves that the proposed method can accurately identify the micro-blog advertisement publisher under the condition that the content of the advertisement is artificially diluted. It provides theoretical support and practical methods for the recognition and cleaning work of micro-blog spam information.

Key words: micro-blog advertising, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), text filtering, feature extraction

中图分类号:

TP391

赵星宇, 赵志宏, 王业沛, 陈松宇. 基于聚类分析的微博广告发布者识别[J]. 计算机应用, 2018, 38(5): 1267-1271.

ZHAO Xingyu, ZHAO Zhihong, WANG Yepei, CHEN Songyu. Identification of micro-blog advertising publisher based on clustering analysis[J]. Journal of Computer Applications, 2018, 38(5): 1267-1271.

参考文献

[1] 肖萌萌, 卜梦斐, 陈丹妮.微博影响力的研究[J]. 科学时代, 2014(11):552-558. (XIAO M M, BU M F, CHEN D N. Research on the influence of Weibo[J]. Science Times, 2014(11):552-558.)
[2] YANG S, LI S, YE X, et al. Content mining and network analysis of microblog spam[J]. Journal of Convergence Information Technology, 2010, 5(1):135-140.
[3] ZHANG Q, MA H, QIAN W, et al. Duplicate detection for identifying social spam in microblogs[C]//Proceedings of the 2013 IEEE International Congress on Big Data. Piscataway, NJ:IEEE, 2013:141-148.
[4] 丁兆云, 周斌, 贾焰, 等.微博中基于统计特征与双向投票的垃圾用户发现[J]. 计算机研究与发展, 2013, 50(11):2336-2348. (DING Z Y, ZHOU B, JIA Y, et al. Detecting spammers with a bidirectional vote algorithm based on statistical features in microblogs[J]. Journal of Computer Research and Development, 2013, 50(11):2336-2348.)
[5] THOMAS K, GRIER C, SONG D, et al. Suspended accounts in retrospect:an analysis of twitter spam[C]//IMC 2011:Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference. New York:ACM, 2011:243-258.
[6] BENEVENUTO F, MAGNO G, RODRIGUES T, et al. Detecting spammers on Twitter[EB/OL].[2017-05-10]. https://gmagno.net/papers/ceas2010_benevenuto_twitterspam.pdf.
[7] WANG A H. Don't follow me:spam detection in Twitter[C]//Proceedings of the 2010 International Conference on Security and Cryptography. Piscataway, NJ:IEEE, 2010:142-151.
[8] 李赫元, 俞晓明, 刘悦, 等.中文微博客的垃圾用户检测[J]. 中文信息学报, 2014, 28(3):62-67. (LI H Y, YU X M, LIU Y, et al. Research on detecting spammer in micro-blogs[J]. Journal of Chinese Information Processing, 2014, 28(3):62-67.)
[9] 赵斌, 吉根林, 曲维光, 等.基于重用检测的微博垃圾用户过滤算法[J]. 南京大学学报(自然科学版), 2013, 49(4):456-464. (ZHAO B, JI G L, QU W G, et al. Detecting microblog spammers based on reuse detection[J]. Journal of Nanjing University (Natural Sciences), 2013, 49(4):456-464.)
[10] 马彬, 洪宇, 陆剑江, 等.基于线索树双层聚类的微博话题检测[J]. 中文信息学报, 2012, 26(6):121-129. (MA B, HONG Y, LU J J, et al. A thread-based two-stage clustering method of microblog topic detection[J]. Journal of Chinese Information Processing, 2012, 26(6):121-129.)
[11] MANCINI R, CARTER B. Op Amps for Everyone[M]. Oxford, UK:Butterworth-Heinemann, 2013:157-158.
[12] 于亚飞, 周爱武.一种改进的DBSCAN密度算法[J]. 计算机技术与发展, 2011, 21(2):30-33. (YU Y F, ZHOU A W. An improved algorithm of DBSCAN[J]. Computer Technology and Development, 2011, 21(2):30-33.)
[13] WEIKUM G. Foundations of statistical natural language processing[J]. Information Retrieval Journal, 2001, 4(1):80-81.
[14] 谢丽星, 周明, 孙茂松. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1):73-83. (XIE L X, ZHOU M, SUN M S, et al. Hierarchical structure based hybrid approach to sentiment analysis of Chinese micro blog and its feature extraction[J]. Journal of Chinese Information Processing, 2012, 26(1):73-83.)

[1]	郑志强, 胡鑫, 翁智, 王雨禾, 程曦. 基于改进DenseNet的牛眼图像特征提取方法[J]. 计算机应用, 2021, 41(9): 2780-2784.
[2]	佘玉龙, 张晓龙, 程若勤, 邓春华. 基于边缘关注模型的语义分割方法[J]. 计算机应用, 2021, 41(2): 343-349.
[3]	赵津, 宋文爱, 邰隽, 杨吉江, 王青, 李晓丹, 雷毅, 邱悦. 儿童阻塞性睡眠呼吸暂停计算机人脸辅助诊断综述[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3394-3401.
[4]	朱新成, 何坤金, 倪娜, 郝博. 基于改进迭代最近点算法的接骨板贴合性快捷计算方法[J]. 计算机应用, 2021, 41(10): 3033-3039.
[5]	尹春勇, 何苗. 基于改进胶囊网络的文本分类[J]. 计算机应用, 2020, 40(9): 2525-2530.
[6]	周云, 陈淑荣. 基于双流非局部残差网络的行为识别方法[J]. 计算机应用, 2020, 40(8): 2236-2240.
[7]	张家岗, 李达平, 杨晓东, 邹茂扬, 吴锡, 胡金蓉. 基于深度卷积特征光流的形变医学图像配准算法[J]. 计算机应用, 2020, 40(6): 1799-1805.
[8]	徐代, 岳璋, 杨文霞, 任潇. 基于改进的三向流Faster R-CNN的篡改图像识别[J]. 计算机应用, 2020, 40(5): 1315-1321.
[9]	张俊升, 徐晶晶, 余伟. 面部美化图像质量无参考评价方法[J]. 计算机应用, 2020, 40(4): 1184-1190.
[10]	沈亮, 王鑫, 陈曙晖. 面向移动应用识别的结构化特征提取方法[J]. 计算机应用, 2020, 40(4): 1109-1114.
[11]	刘尚旺, 刘承伟, 张爱丽. 基于深度可分卷积神经网络的实时人脸表情和性别分类[J]. 计算机应用, 2020, 40(4): 990-995.
[12]	郭志强, 胡永武, 刘鹏, 杨杰. 基于特征融合的室外天气图像分类[J]. 计算机应用, 2020, 40(4): 1023-1029.
[13]	陈梅婕, 谢振平, 陈晓琪, 许鹏. 专利新词发现的双向聚合度特征提取新方法[J]. 计算机应用, 2020, 40(3): 631-637.
[14]	朱喆, 许少华. 降噪自编码器深度卷积过程神经网络及在时变信号分类中的应用[J]. 计算机应用, 2020, 40(3): 698-703.
[15]	王海鹏, 降爱莲, 李鹏翔. 牛顿-软阈值迭代鲁棒主成分分析算法[J]. 计算机应用, 2020, 40(11): 3133-3138.

基于聚类分析的微博广告发布者识别

Identification of micro-blog advertising publisher based on clustering analysis

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics