计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1267-1271.DOI: 10.11772/j.issn.1001-9081.2017102478

• 人工智能 • 上一篇    下一篇

基于聚类分析的微博广告发布者识别

赵星宇, 赵志宏, 王业沛, 陈松宇   

  1. 南京大学 软件学院, 南京 210093
  • 收稿日期:2017-10-19 修回日期:2017-11-24 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 赵志宏
  • 作者简介:赵星宇(1993-),男,江苏扬州人,硕士研究生,主要研究方向:数据挖掘、舆情分析;赵志宏(1975-),男,河北清河人,教授,博士,主要研究方向:数据挖掘;王业沛(1993-),男,江苏扬州人,硕士研究生,主要研究方向:神经网络、舆情分析;陈松宇(1986-),男,江苏盐城人,讲师,博士研究生,主要研究方向:舆情分析、自然语言处理。
  • 基金资助:
    江苏省产学研前瞻性联合研究项目(BY2015069-03)。

Identification of micro-blog advertising publisher based on clustering analysis

ZHAO Xingyu, ZHAO Zhihong, WANG Yepei, CHEN Songyu   

  1. Software Institute, Nanjing University, Nanjing Jiangsu 210093, China
  • Received:2017-10-19 Revised:2017-11-24 Online:2018-05-10 Published:2018-05-24
  • Contact: 赵志宏
  • Supported by:
    This work is partially supported by the Prospective Joint Research Project of Industry-Academia-Research in Jiangsu (BY2015069-03).

摘要: 微博空间存在大量的广告内容,这些信息严重影响着普通用户的用户体验和相关的研究工作。现有研究多使用支持向量机(SVM)或随机森林等分类算法对广告微博进行处理,然而分类方法中人工标注大数据量训练集存在困难,因此提出基于聚类分析的微博广告发布者识别方法:对于用户维度,针对微博广告发布者通过发布大量普通微博来稀释其广告内容的现象,提出核心微博的概念,通过提取核心微博主题及其对应的微博序列,计算用户特征和对应微博的文本特征,并使用聚类算法对特征进行聚类,从而识别微博广告发布者。实验结果显示,所提方法准确率为92%,召回率为97%,F值为95%,证明所提方法在广告内容被人为稀释的情况下能准确地识别微博广告发布者,可以为微博垃圾信息识别、清理等工作提供理论支持和实用方法。

关键词: 微博广告, 基于密度的空间聚类, 文本过滤, 特征提取

Abstract: There is a large amount of advertising content in micro-blog space, which seriously affects user experience and related research work. Much of existing research on micro-blog process uses classification algorithm such as Support Vector Machine (SVM) and random forest algorithm. However, it is difficult to classify a large volume of data in the classification method manually. A micro-blog advertisement publisher identification method based on clustering analysis was proposed. For user dimension, a concept of core micro-blog was put forward to deal with the phenomenon that ordinary micro-blogs were posted to dilute advertising content. Then the extracted main themes of each user and corresponding micro-blog sequences could be used to calculate user characteristics as well as the text characteristics. After that, a clustering algorithm was used to cluster the features and identify the micro-blog advertisers. The experiment result shows that the precision is 93%, the recall is 97%, and the F value is 95%, which proves that the proposed method can accurately identify the micro-blog advertisement publisher under the condition that the content of the advertisement is artificially diluted. It provides theoretical support and practical methods for the recognition and cleaning work of micro-blog spam information.

Key words: micro-blog advertising, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), text filtering, feature extraction

中图分类号: