基于低秩分解的精细文本挖掘方法

doi:10.11772/j.issn.1001-9081.2014.06.1626

计算机应用 ›› 2014, Vol. 34 ›› Issue (6): 1626-1630.DOI: 10.11772/j.issn.1001-9081.2014.06.1626

基于低秩分解的精细文本挖掘方法

黄晓海¹,²,³,⁴,郭智¹,²,⁴,黄宇¹,²,⁴

1. 中国科学院电子学研究所,北京 100190;
2. 中国科学院空间信息处理与应用系统技术重点实验室,北京 100190;
3. 中国科学院大学信息科学与工程学院,北京 100190
4. 中国科学院空间信息处理与应用系统技术重点实验室,北京 100190;

收稿日期:2013-11-28 修回日期:2013-12-31 出版日期:2014-06-01 发布日期:2014-07-02
通讯作者: 黄晓海
作者简介:黄晓海(1990-),男,福建莆田人,硕士研究生,主要研究方向:文本挖掘;郭智(1975-),男,内蒙古呼和浩特人,研究员,博士,主要研究方向:信号与信息处理;黄宇(1980-),男,辽宁丹东人,助理研究员,博士,主要研究方向:数据挖掘与可视化。

Precise text mining using low-rank matrix decomposition

HUANG Xiaohai¹,²,³,GUO Zhi¹,²,HUANG Yu¹,²

1. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China;
2. Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China;
3. School of Information Science and Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

Received:2013-11-28 Revised:2013-12-31 Online:2014-06-01 Published:2014-07-02
Contact: HUANG Xiaohai

摘要/Abstract

摘要：

全文检索等应用要求对文本进行精细表示。针对传统主题模型只能挖掘文本的主题背景,无法对文本的侧重点进行精细描述的问题,提出一种低秩稀疏文本表示模型,将文本表示分为低秩和稀疏两部分,低秩部分代表主题背景,稀疏部分则是对主题中不同方面的关键词描述。为了实现文本低秩部分和稀疏部分的分解,定义了主题矩阵,并引入鲁棒性主成分分析(PCA)方法进行矩阵分解。在新闻语料数据集上的实验结果表明,模型复杂度比隐含狄利克雷分配（LDA）模型降低了25%。在实际应用中,将模型所得的低秩部分应用于文本分类,分类所需的特征减少了28.7%,能用于特征集的降维;将稀疏部分应用于全文检索,检索结果精确度比LDA模型提高了10.8%,有助于检索结果命中率的优化。

Abstract:

Applications such as information retrieval need a precise representation of text content while the representations using traditional topic model can only extract topic background and have no ability for a precise description. A new low-rank and sparse model was proposed to decompose text into a low-rank component which represents topic background and a sparse component which represents keywords. To implement this model, the topic matrix was defined, and Robust Principal Component Analysis (RPCA) was introduced to realize the decomposition. The experimental result on news corpus shows that the model complexity is 25 percent lower than that of Latent Dirichlet Allocation (LDA). In practical applications, the low-rank component reduces the features needed in text classification by 28.7 percent, which helps to reduce the dimension of features; And the sparse component improves the precision of information retrieval result by 10.8 percent compared with LDA, which improves the hit rate of information retrieval result.

中图分类号:

TP311

黄晓海郭智黄宇. 基于低秩分解的精细文本挖掘方法[J]. 计算机应用, 2014, 34(6): 1626-1630.

HUANG Xiaohai GUO Zhi HUANG Yu. Precise text mining using low-rank matrix decomposition[J]. Journal of Computer Applications, 2014, 34(6): 1626-1630.

参考文献

[1]BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation [J]. The Journal of Machine Learning Research, 2003,3:993-1022.
[2]BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process [C]// Advances in Neural Information Processing Systems: Proceedings of the 2003 Conference. Cambridge: MIT Press, 2004,16:106-114.
[3]CHEMUDUGUNTA C, SMYTH P, STEYVERS M. Modeling general and specific aspects of documents with a probabilistic topic model [C]// Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference. Cambridge: MIT Press, 2007,19:241.
[4]MEI Q, LING X, WONDRA M, et al. Topic sentiment mixture: modeling facets and opinions in weblogs [C]// Proceedings of the 16th International Conference on World Wide Web. New York: ACM Press, 2007:171-180.
[5]CANDS E J, LI X, MA Y, et al. Robust principal component analysis? [J]. Journal of the ACM, 2011,58(3):11.
[6]MIN K, ZHANG Z, WRIGHT J, et al. Decomposing background topics from keywords by principal component pursuit [C]// Proceedings of the 19th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2010:269-278.
[7]LANDAUER T K, FOLTZ P W, LAHAM D. An introduction to latent semantic analysis [J]. Discourse Processes, 1998,25(2/3): 259-284.
[8]LIN Z, CHEN M, MA Y. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices [EB/OL]. [2013-08-16]. http://arxiv.org/pdf/1009.5055v3.pdf.
[9]BLEI D M, LAFFERTY J D. Dynamic topic models [C]// Proceedings of the 23rd International Conference on Machine Learning. New York: ACM Press, 2006:113-120.
[10]BLEI D M, LAFFERTY J D. A correlated topic model of science [J]. The Annals of Applied Statistics, 2007,1(1):17-35.
[11]BLEI D M, McAULIFFE J D. Supervised topic models [EB/OL]. [2013-08-16]. https://papers.nips.cc/paper/3328-supervised-topic-models.pdf.
[12]BRANAVAN S R K, CHEN H, EISENSTEIN J, et al. Learning document-level semantic properties from free-text annotations [J]. Journal of Artificial Intelligence Research, 2009,34(1):569.
[13]BRODY S, ELHADAD N. An unsupervised aspect-sentiment model for online reviews [C]// HLT 2010: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsbrug: Association for Computational Linguistics, 2010:804-812.
[14]CHANG J, BLEI D M. Relational topic models for document networks [C]// Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. Cambridge: MIT Press, 2009:81-88.
[15]CHEMUDUGUNTA C, SMYTH P, STEYVERS M. Combining concept hierarchies and statistical topic models [C]// CIKM 2008: Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York: ACM Press, 2008:1469-1470.
[16]FUNG G P C, YU J X, YU P S, et al. Parameter free bursty events detection in text streams [C]// VLDB 2005: Proceedings of the 31st International Conference on Very Large Data Bases. Trondheim: VLDB Endowment, 2005:181-192.
[17]ALSUMAIT L, BARBARA D, DOMENICONI C. On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking [C]// ICDM 2008: Proceedings of the 8th IEEE International Conference on Data Mining. Piscataway: IEEE Press, 2008:3-12.
[18]JO Y, OH A H. Aspect and sentiment unification model for online review analysis [C]// WSDM 2011: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2011:815-824.
[19]SAMMUT C, WEBB G I. Encyclopedia of machine learning [M]. New York: Springer-Verlag, 2010: 280-287.
[20]HOFMANN T. Unsupervised learning by probabilistic latent semantic analysis [J]. Machine Learning, 2001,42(1/2):177-196.

基于低秩分解的精细文本挖掘方法

Precise text mining using low-rank matrix decomposition

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	祁祥洲邢红杰. 基于中心核对齐的多核单类支持向量机[J]. 计算机应用, 0, (): 0-0.
[2]	陈浩杰，范江亭，刘勇. 分布式强化学习解决动态旅行商问题[J]. 计算机应用, 0, (): 0-0.
[3]	郭一阳于炯杜旭升杨少智曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 计算机应用, 0, (): 0-0.
[4]	王周恺, 张炯, 马维纲, 王怀军. 面向高速列车监测数据的并行解压缩算法[J]. 计算机应用, 2021, 41(9): 2586-2593.
[5]	李卓, 宋子晖, 沈鑫, 陈昕. 边缘计算支持下的移动群智感知本地差分隐私保护机制[J]. 计算机应用, 2021, 41(9): 2678-2686.
[6]	赵津宋文爱邰隽杨吉江王青李晓丹雷毅邱悦. 儿童阻塞性睡眠呼吸暂停计算机人脸辅助诊断综述[J]. 计算机应用, 0, (): 0-0.
[7]	张妮韩萌王乐李小娟程浩东. 基于正负效用划分的高效用模式挖掘方法综述[J]. 计算机应用, 0, (): 0-0.
[8]	武鹏, 吴尽昭. 基于线性误差断言的推理方法[J]. 计算机应用, 2021, 41(8): 2199-2204.
[9]	孙蕊, 韩萌, 张春砚, 申明尧, 杜诗语. 含负项top-k高效用项集挖掘算法[J]. 计算机应用, 2021, 41(8): 2386-2395.
[10]	王梓森, 梁英, 刘政君, 谢小杰, 张伟, 史红周. 科研项目同行评议专家学术专长匹配方法[J]. 计算机应用, 2021, 41(8): 2418-2426.
[11]	赵全, 汤小春, 朱紫钰, 毛安琪, 李战怀. 大规模短时间任务的低延迟集群调度框架[J]. 计算机应用, 2021, 41(8): 2396-2405.
[12]	康军, 黄山, 段宗涛, 李宜修. 时空轨迹序列模式挖掘方法综述[J]. 计算机应用, 2021, 41(8): 2379-2385.
[13]	陈静, 毛莺池, 陈豪, 王龙宝, 王子成. 基于改进单点多盒检测器的大坝缺陷目标检测方法[J]. 计算机应用, 2021, 41(8): 2366-2372.
[14]	马华, 陈跃鹏, 唐文胜, 娄小平, 黄卓轩. 面向工作者能力评估的众包任务分配方法的研究进展综述[J]. 计算机应用, 2021, 41(8): 2232-2241.
[15]	李莉吴怡杨祉坤陈云鹏. 基于分区型区块链医疗电子病历共享方案[J]. , 0, (): 0-0.