计算机应用 ›› 2010, Vol. 30 ›› Issue (12): 3401-3406.

• 数据库与数据挖掘 • 上一篇    下一篇

基于统计主题模型的多粒度Web文档标注

袁柳1,张龙波2   

  1. 1. 陕西师范大学
    2.
  • 收稿日期:2010-06-23 修回日期:2010-07-26 发布日期:2010-12-22 出版日期:2010-12-01
  • 通讯作者: 袁柳
  • 基金资助:
    面向入侵检测的数据流挖掘研究

Annotating Web document in multi-granularity way by statistical topical model

  • Received:2010-06-23 Revised:2010-07-26 Online:2010-12-22 Published:2010-12-01
  • Contact: Liu YUAN

摘要: 针对已有Web文档语义标注技术在标注完整性方面的缺陷,将潜在狄里克雷分配(LDA)模型用于对Web文档添加语义标注。考虑到Web文档具有明显的领域特征,在传统的LDA模型中嵌入领域信息,提出Domain-enable LDA模型,提高了标注结果的完整性并避免了对词汇主题的强制分配;同时在文档隐含主题和文档所在领域本体概念间建立关联,利用本体概念表达的语义对隐含主题进行准确的解释,使文档的语义清晰化,为文档检索提供有效帮助。根据LDA模型可为每个词汇分配隐含主题的特征,提出多粒度语义标注的概念。在20news-group和WebKB数据集上的实验证明了Domain-enable LDA模型的有效性,并指出对文档进行多粒度标注有助于有效处理不同类型查询。

关键词: 统计主题模型, 本体, 语义标注, 概念, 信息检索

Abstract: Concerning the Web document annotation techniques available have weakness in integrity annotation, Latent Dirichlet Allocation (LDA) model was applied to semantic annotation. By embedding document domain information to LDA model, a new LDA model called domain-enabled LDA was introduced. An association between the statistical topical model and domain ontology was established, so the implied topic generated could be interpreted by concepts and an explicit semantic in document was acquired. Because the LDA model assigned a topic to each word in document, a multi-granularity annotation strategy was proposed. The experiments on 20news-group and WebKB show that the domain-enabled LDA model proposed can improve the annotation effectiveness and the multi-granularity annotation method helps different types of query in information retrieval.

Key words: statistical topical model, ontology, semantic annotation, concept, information retrieval