计算机应用

• 数据库技术(Database technology) • 上一篇    下一篇

基于潜在语义索引的文本特征词权重计算方法

李媛媛 马永强   

  1. 西南交通大学 信息科学与技术学院 西南交通大学 信息科学与技术学院
  • 收稿日期:2007-12-10 修回日期:2008-01-20 发布日期:2008-06-01 出版日期:2008-06-01
  • 通讯作者: 李媛媛

Text term weighting approach based on latent semantic indexing

Yuan-yuan LI Yong-qiang MA   

  • Received:2007-12-10 Revised:2008-01-20 Online:2008-06-01 Published:2008-06-01
  • Contact: Yuan-yuan LI

摘要: 潜在语义索引具有可计算性强,需要人参与少等优点。对其中重要的优化过程--权重计算,进行了深入分析。针对目前应用最广泛的TF-IDF方法中,采用线性处理的不合理性以及难以突出对文本内容起关键性作用的特征的缺点,提出了一种基于"Sigmiod函数"和"位置因子"的新权重方案。突出了文本中不同特征词的重要程度,更有利于潜在语义空间的构造。通过实验平台"中文潜在语义索引分析系统"的测试结果表明,该权重方法更利于基于潜在语义的检索性能的提高。

关键词: 潜在语义索引, Sigmiod函数, 位置因子, 权重算法

Abstract: Latent Semantic Indexing (LSI) is a new document retrieval model that has been developed during the last ten years. It is easy to compute and requires less human intervention. Term weighting, which is a difficult problem and of great importance in LSI, was studied in detail. In view of the most popular term weighting algorithms, TF-IDF, which is unreasonable to make use of linear and unable to emphasize the significance of key terms which contribute mainly to the content of a text, a new weighting design based on Sigmiod function and location factor was proposed. The new method highlights the importance of the different terms in documents and is in more favor of constructing the latent semantic space. It was tested in the experimental platform named "Chinese LSI Retrieval Analysis System", and the results show that the new method enhances the performance of LSI information retrieve.

Key words: Latent Semantic Indexing, Sigmiod function, location factor, weighting algorithms