计算机应用 ›› 2014, Vol. 34 ›› Issue (6): 1626-1630.DOI: 10.11772/j.issn.1001-9081.2014.06.1626

• 人工智能 • 上一篇    下一篇

基于低秩分解的精细文本挖掘方法

黄晓海1,2,3,4,郭智1,2,4,黄宇1,2,4   

  1. 1. 中国科学院 电子学研究所,北京 100190;
    2. 中国科学院 空间信息处理与应用系统技术重点实验室,北京 100190;
    3. 中国科学院大学 信息科学与工程学院,北京 100190
    4. 中国科学院 空间信息处理与应用系统技术重点实验室,北京 100190;
  • 收稿日期:2013-11-28 修回日期:2013-12-31 出版日期:2014-06-01 发布日期:2014-07-02
  • 通讯作者: 黄晓海
  • 作者简介:黄晓海(1990-),男,福建莆田人,硕士研究生,主要研究方向:文本挖掘;郭智(1975-),男,内蒙古呼和浩特人,研究员,博士,主要研究方向:信号与信息处理;黄宇(1980-),男,辽宁丹东人,助理研究员,博士,主要研究方向:数据挖掘与可视化。

Precise text mining using low-rank matrix decomposition

HUANG Xiaohai1,2,3,GUO Zhi1,2,HUANG Yu1,2   

  1. 1. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China;
    2. Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China;
    3. School of Information Science and Engineering, University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2013-11-28 Revised:2013-12-31 Online:2014-06-01 Published:2014-07-02
  • Contact: HUANG Xiaohai

摘要:

全文检索等应用要求对文本进行精细表示。针对传统主题模型只能挖掘文本的主题背景,无法对文本的侧重点进行精细描述的问题,提出一种低秩稀疏文本表示模型,将文本表示分为低秩和稀疏两部分,低秩部分代表主题背景,稀疏部分则是对主题中不同方面的关键词描述。为了实现文本低秩部分和稀疏部分的分解,定义了主题矩阵,并引入鲁棒性主成分分析(PCA)方法进行矩阵分解。在新闻语料数据集上的实验结果表明,模型复杂度比隐含狄利克雷分配(LDA)模型降低了25%。在实际应用中,将模型所得的低秩部分应用于文本分类,分类所需的特征减少了28.7%,能用于特征集的降维;将稀疏部分应用于全文检索,检索结果精确度比LDA模型提高了10.8%,有助于检索结果命中率的优化。

Abstract:

Applications such as information retrieval need a precise representation of text content while the representations using traditional topic model can only extract topic background and have no ability for a precise description. A new low-rank and sparse model was proposed to decompose text into a low-rank component which represents topic background and a sparse component which represents keywords. To implement this model, the topic matrix was defined, and Robust Principal Component Analysis (RPCA) was introduced to realize the decomposition. The experimental result on news corpus shows that the model complexity is 25 percent lower than that of Latent Dirichlet Allocation (LDA). In practical applications, the low-rank component reduces the features needed in text classification by 28.7 percent, which helps to reduce the dimension of features; And the sparse component improves the precision of information retrieval result by 10.8 percent compared with LDA, which improves the hit rate of information retrieval result.

中图分类号: