基于实时词共现网络的微博话题发现

doi:10.11772/j.issn.1001-9081.2016.05.1302

计算机应用 ›› 2016, Vol. 36 ›› Issue (5): 1302-1306.DOI: 10.11772/j.issn.1001-9081.2016.05.1302

基于实时词共现网络的微博话题发现

李亚星¹, 王兆凯¹, 冯旭鹏², 刘利军¹, 黄青松^1,3

1. 昆明理工大学信息工程与自动化学院, 昆明 650500;
2. 昆明理工大学教育技术与网络中心, 昆明 650500;
3. 云南省计算机技术应用重点实验室(昆明理工大学), 昆明 650500

收稿日期:2015-09-14 修回日期:2015-10-22 发布日期:2016-05-09 出版日期:2016-05-10
通讯作者: 黄青松
作者简介:李亚星(1991-),女,河南新乡人,硕士研究生,主要研究方向:机器学习、自然语言处理;王兆凯(1991-),男,浙江温州人,硕士研究生,主要研究方向:机器学习、自然语言处理;冯旭鹏(1986-),男,河南郑州人,助理实验师,硕士,主要研究方向:信息检索;刘利军(1978-)男,河南新乡人,讲师,硕士,主要研究方向:医疗信息服务;黄青松(1962-)男,湖南长沙人,教授,博士,主要研究方向:智能信息系统、信息检索。
基金资助:
国家自然科学基金资助项目(81360230)。

Micro-blog hot-spot topic discovery based on real-time word co-occurrence network

LI Yaxing¹, WANG Zhaokai¹, FENG Xupeng², LIU Lijun¹, HUANG Qingsong^1,3

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
2. Educational Technology and Network Center, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
3. Yunnan Provincial Key Laboratory of Computer Technology Applications(Kunming University of Science and Technology), Kunming Yunnan 650500, China

Received:2015-09-14 Revised:2015-10-22 Online:2016-05-09 Published:2016-05-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (81360230).

摘要/Abstract

摘要： 针对微博的实时性、稀疏性和海量性特点,提出基于实时词共现网络的话题发现模型。首先,从原始语料中筛选出主题词集合,再利用时间参数计算共现主题词的关系权重以实现词共现网络的构建,通过该网络推算出与话题关联性强的潜在特征词以解决微博特征词的稀疏性;其次,采用改进Single-Pass算法实现话题增量聚类;最后,对每个话题的主题词按热度计算进行排序,获得最具代表性的话题主题词。实验结果表明,该模型与经典Single-Pass聚类算法相比,话题发现准确率约提高6%,综合指标提高8%。实验结果证明所提模型的有效性和准确性。

关键词: 话题发现, 实时共现网络, 短文本, Single-Pass聚类, 热度计算

Abstract: In view of the real-time, sparse and massive characteristics of micro-blog, a topic discovery model based on real-time co-occurrence network was proposed. Firstly, the set of keywords was extracted from the primitive data by the model, and the relationship weights was calculated on the basis of the time parameter to structure the word co-occurrence network. Then, sparsity could be reduced by finding potential features of a strong correlation based on weight adjustment coefficient. Secondly, the topic incremental clustering could be achieved by using the improved Single-Pass algorithm. Finally, the feature words of each topic were sorted by heat calculation, so the most representative keywords of the topic were got. The experimental results show that the accuracy and comprehensive index of the proposed model increase 6%, 8% respectively compared with the Single-Pass algorithm. The experimental results prove the validity and accuracy of the proposed model.

Key words: topic discovery, real-time co-occurrence network, short text, Single-Pass cluster, hot degree calculation

中图分类号:

TP391.1

李亚星, 王兆凯, 冯旭鹏, 刘利军, 黄青松. 基于实时词共现网络的微博话题发现[J]. 计算机应用, 2016, 36(5): 1302-1306.

LI Yaxing, WANG Zhaokai, FENG Xupeng, LIU Lijun, HUANG Qingsong. Micro-blog hot-spot topic discovery based on real-time word co-occurrence network[J]. Journal of Computer Applications, 2016, 36(5): 1302-1306.

参考文献

[1] KWAK H, LEE C, PARK H. What is Twitter, a social network or a news media?[C]//WWW 2010:Proceedings of the 19th International Conference on World Wide Web. New York:ACM, 2010:591-600.
[2] 贺亮, 李芳.基于话题模型的科技文献话题发现和趋势分析[J].中文信息学报, 2010, 26(2):109-115.(HE L, LI F. Topic discovery and trend analysis in scientific literature based on topic model[J]. Journal of Chinese Information Processing, 2010, 26(2):109-115.)
[3] 单斌, 李芳.基于LDA话题演化研究方法综述[J].中文信息学报, 2010, 24(6):43-49.(SHAN B, LI F. A survey of topic evolution based on LDA[J]. Journal of Chinese Information Processing, 2010, 24(6):43-49.)
[4] 骆卫华, 于满泉, 许洪波, 等.基于多策略优化的分治多层聚类算法的话题发现研究[J].中文信息学报, 2005, 20(1):29-35.(LUO W H, YU M Q, XU H B, et al. The study of topic detection based on algorithm of division and multi-level clustering with multi-strategy optimization[J]. Journal of Chinese Information Processing, 2010, 20(1):29-35.).
[5] 刘星星, 何婷婷, 龚海军, 等.网络热点事件发现系统的设计[J].中文信息学报, 2008, 22(6):80-85. (LIU X X, HE T T, GONG H J, et al. Design of hot Web event detection system[J]. Journal of Chinese Information Processing, 2008, 22(6):80-85.)
[6] 黄九鸣, 吴泉源, 刘春阳, 等.短文本信息流的无监督会话抽取技术[J].软件学报, 2012, 23(4):735-747. (HUANG J M, WU Q Y, LIU C Y, et al. Unsupervised conversation extraction in short text message streams[J]. Journal of Software, 2012, 23(4):735-747.)
[7] YANG Y M, PIERCE T, CARBONELL J. A study of retrospective and on-line event detection[C]//SIGIR 1998:Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 1998.28-36.
[8] 雷震, 吴玲达, 雷蕾, 等.初始化类中心的增量K均值法及其在新闻事件探测中的应用[J].软件学报, 2006, 25(3):289-295.(LEI Z, WU L D, LEI L, et al. Incremental K-means method based on initialisation of cluster centers and its application in news event detection[J]. Journal of Software, 2006, 25(3):289-295.)
[9] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11):613-630.
[10] 洪宇, 张宇, 范基礼, 等.基于语义域语言模型的中文话题关联检测[J].软件学报, 2008, 19(9):2265-2275.(HONG Y, ZHANG Y, FAN J L, et al. Chinese topic link detection based on semantic domain language model[J]. Journal of Software, 2008, 19(9):2265-2275.)
[11] 刘振鹿, 王大玲, 冯时, 等.一种基于LDA的潜在语义区划分及Web文档聚类算法[J].中文信息学报, 2011, 25(1):60-65.(LIU Z L, WANG D L, FENG S, et al. An approach of latent semantic space partition and Web document clustering[J]. Journal of Chinese Information Processing, 2011, 25(1):60-65.)
[12] 张志飞, 苗夺谦, 高灿.基于LDA主题模型的短文本分类方法[J].计算机应用, 2013, 33(6):1587-1590. (ZHANG Z F, MIAO D Q, GAO C. Short text classification using latent Dirichlet allocation[J]. Journal of Computer Applications, 2013, 33(6):1587-1590.)
[13] 蒙祖强, 黄柏雄.一种新的网络热点话题提取方法[J].小型微型计算机系统, 2013, 34(4):743-748. (MENG Z Q, HUANG B X. Novel approach to Internet hot topic extraction[J]. Journal of Chinese Computer Systems, 2013, 34(4):743-748.)
[14] 杨菲, 黄柏雄.词共现网络的遗传聚类在话题发现中的应用[J].计算机工程与应用, 2013, 49(14):126-129.(YANG F, HUANG B X. Application of GCA of word co-occurrence network in topic detection[J]. Computer Engineering and Applications, 2013, 49(14):126-129.)
[15] 余传明, 周丹.情感词汇共现网络的复杂网络特性分析[J].情报学报, 2010, 29(5):906-914.(YU C M, ZHOU D. The complexity analysis of the emotional word co-occurrence network[J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5):906-914.)
[16] PAPKA R, ALLAN J. On-line new event detection using single pass clustering[EB/OL].[2015-02-10]. http://maroo.cs.umass.edu/getpdf.php?id=28.

基于实时词共现网络的微博话题发现

Micro-blog hot-spot topic discovery based on real-time word co-occurrence network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	杨世刚, 刘勇国. 融合语料库特征与图注意力网络的短文本分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1324-1329.
[2]	邓钰, 李晓瑜, 崔建, 刘齐. 用于短文本情感分类的多头注意力记忆网络[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3132-3138.
[3]	杨威亚, 余正涛, 高盛祥, 宋燃. 基于跨语言神经主题模型的汉越新闻话题发现方法[J]. 计算机应用, 2021, 41(10): 2879-2884.
[4]	尹春勇, 章荪. 面向短文本情感分类的端到端对抗变分贝叶斯方法[J]. 计算机应用, 2020, 40(9): 2536-2542.
[5]	王舒漫, 李爱萍, 段利国, 付佳, 陈永乐. 基于BTM的物联网服务发现方法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 459-464.
[6]	张小川, 戴旭尧, 刘璐, 冯天硕. 融合多头自注意力机制的中文短文本分类模型[J]. 计算机应用, 2020, 40(12): 3485-3489.
[7]	陈洁, 邵志清, 张欢欢, 费佳慧. 基于并行混合神经网络模型的短文本情感分析[J]. 计算机应用, 2019, 39(8): 2192-2197.
[8]	余慧, 冯旭鹏, 刘利军, 黄青松. 聊天机器人中用户就医意图识别方法[J]. 计算机应用, 2018, 38(8): 2170-2174.
[9]	曹大为, 贺超波, 陈启买, 刘海. 基于加权核非负矩阵分解的短文本聚类算法[J]. 计算机应用, 2018, 38(8): 2180-2184.
[10]	邢金彪, 崔超远, 孙丙宇, 宋良图. 基于隐含狄列克雷分配分类特征扩展的微博广告过滤方法[J]. 计算机应用, 2016, 36(8): 2257-2261.
[11]	陈雪, 胡晓峰, 徐浩. 基于短文本的突发事件发展过程表示方法[J]. 计算机应用, 2016, 36(6): 1605-1612.
[12]	杨武李阳卢玲. 基于用户角色定位的微博热点话题检测方法[J]. 计算机应用, 2013, 33(11): 3076-3079.
[13]	马雯雯邓一贵. 新的短文本特征权重计算方法[J]. 计算机应用, 2013, 33(08): 2280-2282.
[14]	张志飞苗夺谦高灿. 基于LDA主题模型的短文本分类方法[J]. 计算机应用, 2013, 33(06): 1587-1590.
[15]	杨天平朱征宇. 使用概念描述的中文短文本分类算法[J]. 计算机应用, 2012, 32(12): 3335-3338.