计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2103-2108.DOI: 10.11772/j.issn.1001-9081.2016.08.2103

• 第六届中国数据挖掘会议(CCDM 2016) • 上一篇    下一篇

基于概率主题模型的景点知识挖掘及其可视化

徐洁, 范玉顺, 白冰   

  1. 清华大学 自动化系, 北京 100084
  • 收稿日期:2016-03-01 修回日期:2016-05-11 出版日期:2016-08-10 发布日期:2016-08-10
  • 通讯作者: 范玉顺
  • 作者简介:徐洁(1990-),女,山东烟台人,硕士研究生,主要研究方向:业务流程管理、服务推荐、大数据;范玉顺(1962-),男,江苏扬州人,教授,博士生导师,博士,主要研究方向:企业建模与优化分析、企业经营过程重组、工作流管理;白冰(1990-),男,北京人,博士研究生,主要研究方向:服务计算、服务推荐、大数据。
  • 基金资助:
    高校博士学科点专项科研基金资助项目(20120002110034)。

Knowledge mining and visualizing for scenic spots with probabilistic topic model

XU Jie, FAN Yushun, BAI Bing   

  1. Department of Automation, Tsinghua University, Beijing 100084, China
  • Received:2016-03-01 Revised:2016-05-11 Online:2016-08-10 Published:2016-08-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61572419, 61403328, 61403329), the Natural Science Foundation of Shandong Province (ZR2013FM011, 2015GSF115009, ZR2014FQ016, ZR2014FQ026).

摘要: 针对旅游文本噪声多、景点多且展示不直观的问题,提出一种基于概率主题模型的景点-主题模型。模型假设同一篇文档涉及多个具有相关关系的景点,引入“全局景点”过滤噪声语义,并利用Gibbs采样算法估计最大似然函数的参数,获取目的地景点的主题分布。实验通过对景点主题特征进行聚类,评估聚类效果从而间接评价模型训练效果,并定性分析“全局景点”对模型的作用。实验结果表明,该模型对旅游文本的建模效果优于基准算法TF-IDF与隐含狄利克雷分布(LDA),且“全局景点”的引入对建模效果有明显的改善作用。最后通过景点关联图的方式对实验结果进行可视化展示。

关键词: 概率主题模型, 旅游文本, 噪声, Gibbs采样, 可视化

Abstract: Since the tourism text for destinations contains semantic noise and different scenic spots, which can not be displayed intuitively, a new scenic spots-topic model based on the probabilistic topic model was proposed. The model assumed that one document included several scenic spots with correlation, and a special scenic spot named "global scenic spot" was introduced to filter the semantic noise. Then Gibbs sampling algorithm was employed to learn the maximum a posteriori estimates of the model and get a topic distribution vector for each scenic spot. A clustering experiment was conducted to indirectly evaluate the effects of the model and analyze the impact of "global scenic spot" on the model. The result shows that the proposed model has better effect than baseline model such as TF-IDF (Term Frequency-Inverse Document Frequency) and Latent Dirichlet Allocation (LDA), and the "global scenic spot" can improve the modeling effect significantly. Finally, scenic spots association graph was employed to display the result visually.

Key words: probabilistic topic model, tourism text, noise, Gibbs sampling, visualization

中图分类号: