计算机应用 ›› 2013, Vol. 33 ›› Issue (11): 3080-3083.

• 数据库技术 • 上一篇    下一篇

科技文献中作者研究兴趣动态发现

史庆伟,李艳妮,郭朋亮   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
  • 收稿日期:2013-05-13 修回日期:2013-07-17 出版日期:2013-11-01 发布日期:2013-12-04
  • 通讯作者: 史庆伟
  • 作者简介:史庆伟(1973-),男,辽宁阜新人,副教授,博士,主要研究方向:文本挖掘、机器学习;李艳妮(1989-),女,辽宁大连人,硕士研究生,主要研究方向:文本挖掘;郭朋亮(1988-),男,辽宁朝阳人,硕士研究生,主要研究方向:文本挖掘、机器学习。

Dynamic finding of authors‘ research interests in scientific literature

SHI Qingwei,LI Yanni,GUO Pengliang   

  1. School of Software, Liaoning Technical University, Huludao Liaoning 125105, China
  • Received:2013-05-13 Revised:2013-07-17 Online:2013-12-04 Published:2013-11-01
  • Contact: SHI Qingwei

摘要: 针对挖掘大规模科技文献中作者、主题和时间及其关系的问题,考虑科技文献的内外部特征,提出了一个作者主题演化(AToT)模型。模型中文档表示为一定概率比例的主题混合体,每个主题对应一个词项上的多项分布和一个随时间变化的贝塔分布,主题词项分布不仅由文档中单词共现决定,同时受文档时间戳影响,每个作者也对应一个主题上的多项分布。主题词项分布与作者主题分布分别用来描述主题随时间变化的规律和作者研究兴趣的变化规律。采用吉布斯采样的方法,通过学习文档集可以获得模型的参数。在1700篇NIPS会议论文集上的实验结果显示,作者主题演化模型可以描述文档集中潜在的主题演化规律,动态发现作者研究兴趣的变化,可以预测与主题相关的作者,与作者主题模型相比计算困惑度更低。

关键词: 主题模型, 时序分析, 无监督学习, 文本模型, 困惑度

Abstract: To solve the problems of mining relationships among topics, authors and time in large scale scientific literature corpora, this paper proposed the Author-Topic over Time (AToT) model according to the intra-features and inter-features of scientific literature. In AToT, a document was represented as a mixture of probabilistic topics and each topic was correspondent with a multinomial distribution over words and a beta distribution over time. The word-topic distribution was influenced not only by word co-occurrence but also by document timestamps. Each author was also correspondent with a multinomial distribution over topics. The word-topic distribution and author-topic distribution were used to describe the topics evolution and research interests changes of the authors over time respectively. Parameters in AToT could be learned from the documents by employing methods of Gibbs sampling. The experimental results by running in the collections of 1700 NIPS conference papers show that AToT model can characterize the latent topics evolution, dynamically find authors research interests and predict the authors related to the topics. Meanwhile, AToT model can also lower perplexity compared with the author-topic model.

Key words: topic model, temporal analysis, unsupervised learning, text model, perplexity

中图分类号: