计算机应用 ›› 2011, Vol. 31 ›› Issue (09): 2429-2431.DOI: 10.3724/SP.J.1087.2011.02429

• 数据库技术 • 上一篇    下一篇

基于支持向量机的隐含语意特征选择方法

李旻松,段琢华   

  1. 韶关学院 计算机科学学院,广东 韶关 512022
  • 收稿日期:2011-01-24 修回日期:2011-03-19 发布日期:2011-09-01 出版日期:2011-09-01
  • 通讯作者: 李旻松
  • 作者简介:

    李旻松(1974-),男,湖南郴州人,讲师,硕士,CCF会员,主要研究方向:人工智能、计算机网络;
    段琢华(1969-),男,湖南冷水江人,副教授,博士,主要研究方向:人工智能。

  • 基金资助:
    广东省自然科学基金博士科研启动项目(9451200501002983);韶关市技术创新项目(韶科(成)2008-03)

Latent semantic features selection based on support vector machine

LI Min-song,DUAN Zhuo-hua   

  1. College of Computer Science, Shaoguan University, Shaoguan Guangdong 512022, China
  • Received:2011-01-24 Revised:2011-03-19 Online:2011-09-01 Published:2011-09-01
  • Contact: LI Min-song

摘要: 隐含语意索引(LSI)是一个能有效捕获文档中词的隐含语意特征的方法。然而,用该方法选择的特征空间对文本分类来说可能不是最适合的,因为这种方法按照词的变化排序特征,而没有考虑到分类能力。支持向量机(SVM)高度的泛化能力使它特别适用于高维数据例如文档的分类。为此提出基于支持向量机的特征提取方法用于选择适于分类的LSI特征。该方法利用SVM高度泛化的分类能力, 通过使用在每一个规则下训练的分类器的参数对第k个特征对反向平方分解面的贡献w2k的值进行估计。实验表明当需要比LSI更少的训练和测试时间时,该方法能够以更为紧凑的表示方式提高分类性能。

关键词: 隐含语意索引, 向量空间模型, 奇异值分解, 文档矩阵, 支持向量机

Abstract: Latent Semantic Indexing (LSI) is an effective feature extraction method which can capture the underlying latent semantic structure between words in documents. However, feature subspace selected by LSI is probably not the most appropriate for text classification, since the method orders extracted features according to their variance without considering the classification capability. The high generalization ability of Support Vector Machine (SVM) makes it especially suitable for the classification of high-dimension data such as term-document. Thus, a feature extraction method based on SVM was proposed to select the LSI features fit for classification. Making use of the high generalization ability of SVM, contribution value of the reverse side of the square decomposition of the k-th feature was estimated by each classifier parameter trained under the rules. The experimental results indicate that the method improves classification performance with more compact representation when less time of training and testing is required than that of LSI.

Key words: Latent Semantic Indexing (LSI), Vector Space Model (VSM), Singular Value Decomposition (SVD), term-document matrix, Support Vector Machine (SVM)

中图分类号: