计算机应用 ›› 2015, Vol. 35 ›› Issue (2): 456-460.DOI: 10.11772/j.issn.1001-9081.2015.02.0456

• 人工智能 • 上一篇    下一篇

基于特征本体的文本流主题演化

陈千1,2, 桂志国1, 郭鑫2, 向阳3   

  1. 1. 中北大学 信息与通信工程学院, 太原 030051;
    2. 山西大学 计算机与信息技术学院, 太原 030006;
    3. 同济大学 电子与信息工程学院, 上海 201804
  • 收稿日期:2014-08-29 修回日期:2014-11-04 出版日期:2015-02-10 发布日期:2015-02-12
  • 通讯作者: 桂志国
  • 作者简介:陈千(1983-),男,湖北黄冈人,讲师,博士,主要研究方向:文本挖掘、机器学习; 桂志国(1972-),男,天津人,教授,博士,主要研究方向:图像处理与重建; 郭鑫(1982-),女,山西太原人,讲师,博士,主要研究方向:社交网络、数据降维; 向阳(1962-),男,重庆人,教授,博士生导师,博士,主要研究方向:数据挖掘、语义决策支持。
  • 基金资助:

    国家自然科学基金资助项目(61403238,61071192,61271357,61171178);山西省自然科学基金资助项目(2014021022-1);山西省研究生优秀创新项目(20123098);山西省国际合作项目(2013081035)。

Topic evolution in text stream based on feature ontology

CHEN Qian1,2, GUI Zhiguo1, GUO Xin2, XIANG Yang3   

  1. 1. School of Information and Communication Engineering, North University of China, Taiyuan Shanxi 030051, China;
    2. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China;
    3. School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
  • Received:2014-08-29 Revised:2014-11-04 Online:2015-02-10 Published:2015-02-12

摘要:

针对网络大数据时代文本流的主题演化研究大多基于经典概率主题模型,以词袋假设为前提导致主题的语义缺失问题和批处理问题,提出一种在线增量的基于特征本体的主题演化算法。首先,基于词共现和通用本体库WordNet构建特征本体,用特征本体对文本流主题进行建模;其次,提出一种文本流主题矩阵构建算法,实现在线增量主题演化分析;最后,依据该矩阵提出文本流主题本体演化图构建算法,利用特征本体的子图相似度计算主题相似度,从而获得文本流中主题随时间的演化模式。在科技文献上的实验上,满意度同传统在线潜在狄利克雷分配模型(LDA)不相上下,但时间复杂度降低到O(nK+N)。所提出的方法引入了本体,加入了语义关系标注,可图形化展现主题的语义特征,并在此基础上在线增量地实现了主题演化图的构建,在语义解释性和主题可视化方面更具有优势。

关键词: 文本流, 主题建模, 特征本体, 主题演化, 词共现

Abstract:

In the era of big data, research in topic evolution is mostly based on the classical probability topic model, the premise of word bag hypothesis leads to the lack of semantic in topic and the retrospective process in analyzing evolution. An online incremental feature ontology based topic evolution algorithm was proposed to tackle these problems. First of all, feature ontology was built based on word co-occurrence and general WordNet ontology base, with which the topic in text stream was modeled. Secondly, a text stream topic matrix construction algorithm was put forward to realize online incremental topic evolution analysis. Finally, a text topic ontology evolution diagram construction algorithm was put forward based on the text steam topic matrix, and topic similarity was computed using sub-graph similarity calculation, thus the evolution of topics in text stream was obtained with time scale. Experiments on scientific literature showed that the proposed algorithm reduced time complexity to O(nK+N), which outperformed classical probability topic evolution model, and performed no worse than sliding-window based Latent Dirichlet Allocation (LDA). With ontology introduced, as well as the semantic relations, the proposed algorithm can demonstrate the semantic feature of topics in graphics, based on which the topic evolution diagram is built incrementally, thus has more advantages in semantic explanatory and topic visualization.

Key words: text stream, topic modeling, feature ontology, topic evolution, word co-occurrence

中图分类号: