Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (5): 1302-1306.DOI: 10.11772/j.issn.1001-9081.2016.05.1302

Micro-blog hot-spot topic discovery based on real-time word co-occurrence network

LI Yaxing1, WANG Zhaokai1, FENG Xupeng2, LIU Lijun1, HUANG Qingsong1,3   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
    2. Educational Technology and Network Center, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
    3. Yunnan Provincial Key Laboratory of Computer Technology Applications(Kunming University of Science and Technology), Kunming Yunnan 650500, China
  • Received:2015-09-14 Revised:2015-10-22 Online:2016-05-09 Published:2016-05-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (81360230).


李亚星1, 王兆凯1, 冯旭鹏2, 刘利军1, 黄青松1,3   

  1. 1. 昆明理工大学 信息工程与自动化学院, 昆明 650500;
    2. 昆明理工大学 教育技术与网络中心, 昆明 650500;
    3. 云南省计算机技术应用重点实验室(昆明理工大学), 昆明 650500
  • 通讯作者: 黄青松
  • 作者简介:李亚星(1991-),女,河南新乡人,硕士研究生,主要研究方向:机器学习、自然语言处理;王兆凯(1991-),男,浙江温州人,硕士研究生,主要研究方向:机器学习、自然语言处理;冯旭鹏(1986-),男,河南郑州人,助理实验师,硕士,主要研究方向:信息检索;刘利军(1978-)男,河南新乡人,讲师,硕士,主要研究方向:医疗信息服务;黄青松(1962-)男,湖南长沙人,教授,博士,主要研究方向:智能信息系统、信息检索。
  • 基金资助:

Abstract: In view of the real-time, sparse and massive characteristics of micro-blog, a topic discovery model based on real-time co-occurrence network was proposed. Firstly, the set of keywords was extracted from the primitive data by the model, and the relationship weights was calculated on the basis of the time parameter to structure the word co-occurrence network. Then, sparsity could be reduced by finding potential features of a strong correlation based on weight adjustment coefficient. Secondly, the topic incremental clustering could be achieved by using the improved Single-Pass algorithm. Finally, the feature words of each topic were sorted by heat calculation, so the most representative keywords of the topic were got. The experimental results show that the accuracy and comprehensive index of the proposed model increase 6%, 8% respectively compared with the Single-Pass algorithm. The experimental results prove the validity and accuracy of the proposed model.

Key words: topic discovery, real-time co-occurrence network, short text, Single-Pass cluster, hot degree calculation

摘要: 针对微博的实时性、稀疏性和海量性特点,提出基于实时词共现网络的话题发现模型。首先,从原始语料中筛选出主题词集合,再利用时间参数计算共现主题词的关系权重以实现词共现网络的构建,通过该网络推算出与话题关联性强的潜在特征词以解决微博特征词的稀疏性;其次,采用改进Single-Pass算法实现话题增量聚类;最后,对每个话题的主题词按热度计算进行排序,获得最具代表性的话题主题词。实验结果表明,该模型与经典Single-Pass聚类算法相比,话题发现准确率约提高6%,综合指标提高8%。实验结果证明所提模型的有效性和准确性。

关键词: 话题发现, 实时共现网络, 短文本, Single-Pass聚类, 热度计算

