计算机应用 ›› 2012, Vol. 32 ›› Issue (05): 1355-1358.

• 人工智能 • 上一篇    下一篇

结合词义的文本特征词权重计算方法

李明涛1,罗军勇1,2,尹美娟3,路林1   

  1. 1. 信息工程大学 信息工程学院,郑州450002
    2. 信息工程大学 信息工程学院
    3. 信息工程大学 信息工程学院, 郑州 450002
  • 收稿日期:2011-11-04 修回日期:2011-12-28 发布日期:2012-05-01 出版日期:2012-05-01
  • 通讯作者: 李明涛
  • 作者简介:李明涛(1984-),男,湖北襄阳人,硕士研究生,主要研究方向:社会网络分析、数据挖掘;罗军勇(1964-),男,江西南昌人,教授,主要研究方向:信息安全、数据挖掘;尹美娟(1977-),女,安徽芜湖人,讲师,主要研究方向:社会网络分析、数据挖掘;路林(1983-),女,河北邯郸人,硕士研究生,主要研究方向:社会网络分析、网络信息安全。

Weight computing method for text feature terms by integrating word sense

LI Ming-tao1,LUO Jun-yong1,2,YIN Mei-juan3,LU Lin1   

  1. 1. Institute of Information Engineering, Information Engineering University, Zhengzhou Henan 450002, China
    2.
    3. Institute of Information Engineering,Information Engineering University,Zhengzhou Henan 450002,China
  • Received:2011-11-04 Revised:2011-12-28 Online:2012-05-01 Published:2012-05-01
  • Contact: LI Ming-tao

摘要: 传统的基于向量空间模型的文本相似度计算方法,用TF-IDF计算文本特征词的权重,忽略了特征词之间的词义相似关系,不能准确地反映文本之间的相似程度。针对此问题,提出了结合词义的文本特征词权重计算方法,基于Chinese WordNet采用词义向量余弦计算特征词的词义相似度,根据词义相似度对特征词的TF-IDF权重进行修正,修正后的权重同时兼顾词频和词义信息。在哈尔滨工业大学信息检索研究室多文档自动文摘语料库上的实验结果表明,根据修正后的特征词权重计算文本相似度,能够有效地提高文本的类区分度。

关键词: 文本相似度, 特征词权重, 词义相似度, Chinese WordNet

Abstract: Most of the existing methods to compute text similarity based on Vector Space Model (VSM) use TF-IDF scores as the weights of feature terms in text, which ignores the word sense relationships among feature terms and lead to inaccurate text similarity. To improve the accuracy of text similarities calculated by methods based on VSM, a new term weight computing method by integrating word sense was proposed in this paper. Firstly, word sense similarities among feature terms were computed based on the Chinese WordNet. And then, the TF-IDF weights were revised according to the word sense similarities for the purpose of reflecting both the frequency and the word sense of feature terms in text. The experimental results on the HIT IR-lab Multi-Document Summarization Corpus show that to use the weights calculated by the proposed method can efficiently improve the differentiation among document clusters.

Key words: documents similarity, feature term weight, words sense similarity, Chinese WordNet

中图分类号: