基于加权核非负矩阵分解的短文本聚类算法

doi:10.11772/j.issn.1001-9081.2018020356

计算机应用 ›› 2018, Vol. 38 ›› Issue (8): 2180-2184.DOI: 10.11772/j.issn.1001-9081.2018020356

基于加权核非负矩阵分解的短文本聚类算法

曹大为¹, 贺超波², 陈启买¹, 刘海¹

1. 华南师范大学计算机学院, 广州 510631;
2. 仲恺农业工程学院信息科学技术学院, 广州 510225

收稿日期:2018-02-08 修回日期:2018-03-18 发布日期:2018-08-11 出版日期:2018-08-10
通讯作者: 陈启买
作者简介:曹大为(1993-),男,浙江天台人,硕士研究生,CCF会员,主要研究方向:数据挖掘、机器学习;贺超波(1981-),男,广东河源人,副教授,博士,CCF高级会员,主要研究方向:数据挖掘、社会计算;陈启买(1965-),男,湖南衡阳人,教授,硕士,主要研究方向:数据挖掘、机器学习;刘海(1974-),男,湖南张家界人,副教授,博士,CCF会员,主要研究方向:文本挖掘、深度学习。
基金资助:
广东省科技计划项目（2017A040405057，2017A030303074，2016A030303058，2015A020209178）；广州市科技计划项目（201604016035，201807010043）。

Short text clustering algorithm based on weighted kernel nonnegative matrix factorization

CAO Dawei¹, HE Chaobo², CHEN Qimai¹, LIU Hai¹

1. School of Computer, South China Normal University, Guangzhou Guangdong 510631, China;
2. School of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou Guangdong 510225, China

Received:2018-02-08 Revised:2018-03-18 Online:2018-08-11 Published:2018-08-10
Supported by:
This work is partially supported by the Science and Technology Planning Project of Guangdong Province (2017A040405057, 2017A030303074, 2016A030303058, 2015A020209178), the Science and Technology Planning Project of Guangzhou (201604016035,201807010043).

摘要/Abstract

摘要： 对互联网产生的大量短文本进行聚类分析具有重要的应用价值，但由于短文本存在特征稀疏和特征难以提取的问题，导致传统的文本聚类算法难以有效处理该问题。为了解决该问题，利用非负矩阵分解（NMF）模型提出基于加权核非负矩阵分解（WKNMF）的短文本聚类算法。该算法通过核方法的映射关系将稀疏特征空间映射到高维隐性空间，从而可以充分利用短文本中的隐性语义特征进行聚类；另外，利用核技巧简化高维数据的复杂运算，并通过迭代更新规则不断地动态调整短文本的权重向量，从而可以区分不同短文本对聚类的重要性。在真实的微博数据集上进行了相关实验，结果表明WKNMF算法比K均值、隐含狄利克雷分布（LDA）、NMF和自组织神经网络（SOM）具有更好的聚类质量，准确度和归一化互信息分别达到了66.38%和66.91%。

关键词: 核方法, 短文本聚类, 非负矩阵分解, 核技巧, 迭代优化求解

Abstract: Clustering analysis of a large number of short texts generated by the Internet is of great application value. Because the characteristics of short texts such as sparse features and difficulty of extracting features, the traditional text clustering algorithm faces many challenges in short text clustering. To solve the problem, a short text clustering algorithm based on Weighted Kernel Nonnegative Matrix Factorization (WKNMF) was proposed by using Nonnegative Matrix Factorization (NMF) model. To make full use of hidden semantic features in short texts for clustering, sparse feature space was mapped to high-dimensional implicit vectors by using kernel method. In addition, kernel trick was used to simplify the complex operation of high-dimensional data, and the weight vectors of short texts were dynamically adjusted through iterative optimization updating rules, so that the importance of different short texts to clustering can be distinguished. Experiments were conducted on real Micro-blog data sets and WKNMF algorithm was compared with K-means, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF) and Self-Organization Map (SOM). The experimental results show that the proposed WKNMF algorithm has a better clustering performance than the contrast algorithms, its accuracy and Normalized Mutual Information (NMI) reach 66.38% and 66.91% respectively.

Key words: kernel method, short text clustering, Nonnegative Matrix Factorization (NMF), kernel trick, iterative optimization

中图分类号:

TP391

曹大为, 贺超波, 陈启买, 刘海. 基于加权核非负矩阵分解的短文本聚类算法[J]. 计算机应用, 2018, 38(8): 2180-2184.

CAO Dawei, HE Chaobo, CHEN Qimai, LIU Hai. Short text clustering algorithm based on weighted kernel nonnegative matrix factorization[J]. Journal of Computer Applications, 2018, 38(8): 2180-2184.

参考文献

[1] WANG J, LI Q, CHEN Y P, et al. Recommendation in internet forums and blogs[C]//ACL 2010:Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2010:257-265.
[2] LI J, RITTER A, HOVY E. Weakly supervised user profile extraction from twitter[C]//ACL 2014:Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2014:165-174.
[3] 史庆伟,刘雨诗,张丰田.基于微博文本的词对主题演化模型[J].计算机应用,2017,37(5):1407-1412. (SHI Q W, LIU Y S, ZHANG F T. Biterm topic evolution model of microblog[J]. Journal of Computer Applications, 2017, 37(5):1407-1412.)
[4] BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using Wikipedia[C]//SIGIR 2007:Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2007:787-788.
[5] FODEH S, PUNCH B, TAN P-N. On ontology-driven document clustering using core semantic features[J]. Knowledge and Information Systems, 2011, 28(2):395-421.
[6] EFRON M, ORGANISCIAK P, EENLON K. Improving retrieval of short texts through document expansion[C]//SIGIR 2012:Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2012:911-920.
[7] TANG J, WANG X, GAO H, et al. Enriching short text representation in microblog for clustering[J]. Frontiers of Computer Science, 2012, 6(1):88-101.
[8] HU X, SUN N, ZHANG C, et al. Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]//CIKM 2009:Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York:ACM, 2009:919-928.
[9] KIM Y. Convolutional neural networks for sentence classification[C]//EMNLP 2014:Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2014:1746-1751.
[10] 周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017,40(6):1229-1251. (ZHOU F Y, JIN L P, DONG J. Review of convolutional neural network[J]. Chinese Journal of Computers, 2017, 40(6):1229-1251.)
[11] LEE D D. Learning the parts of objects by non-negative matrix factorization[J]. Nature, 1999, 401(10):788-791.
[12] VAGHMARE A K, RAO C V R. Unsupervised noise removal technique based on constrained NMF[J]. IET Signal Processing, 2017, 11(7):788-795.
[13] 涂鼎,陈岭,陈根才,等.基于在线层次化非负矩阵分解的文本流主题检测[J].浙江大学学报(工学版),2016,50(8):1618-1626. (TU D, CHEN L, CHEN G C, et al. Hierarchical online NMF for detecting and tracking topic[J]. Journal of Zhejiang University (Engineering Science), 2016, 50(8):1618-1626.)
[14] 徐海勇,郁梅,骆挺,等.基于非负矩阵分解的彩色图像质量评价方法[J].电子与信息学报,2016,38(3):578-585. (XU H Y, YU M, LUO T, et al. A color image quality assessment method based on non-negative matrix factorization[J]. Journal of Electronics and Information Technology, 2016, 38(3):578-585.)
[15] 陈增强,谢征,张青.基于非负矩阵分解的复杂网络重构[J].复杂系统与复杂性科学,2016,13(3):26-32. (CHEN Z Q, XIE Z, ZHANG Q. Complex network reconstruction based on nonnegative matrix factorization[J]. Complex Systems and Complexity Science, 2016, 13(3):26-32.)
[16] XU W, LIU X, GONG Y. Document clustering based on non-negative matrix factorization[C]//SIGIR 2003:Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2003:267-273.
[17] 袁伟,朱山风.基于距离学习的生物医学文本聚类算法研究[J].计算机应用与软件,2010,27(11):4-5. (YUAN W, ZHU S F. Study on biomedical text clustering algorithm based on metric learning[J]. Computer Applications and Software, 2010, 27(11):4-5.)
[18] LIN X, ZHANG Q, WEI G. The clustering algorithm for Chinese texts based on Lingo[C]//FSKD 2011:Eighth International Conference on Fuzzy Systems and Knowledge Discovery. Washington, D.C.:IEEE Computer Society, 2011, 2:1187-1190.
[19] LIU Y, PI D, CHENG Q. Ensemble kernel method:SVM classification based on game theory[J]. Journal of Systems Engineering and Electronics, 2016, 27(1):251-259.
[20] ZHANG L, HU X. Locally adaptive multiple kernel clustering[J]. Neurocomputing, 2014, 137:192-197.
[21] ZHANG D, ZHOU Z-H, CHEN S. Non-negative matrix factorization on kernels[C]//PRICAI 2006:Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence. Berlin:Springer, 2006:404-412.
[22] AN S, YUN J-M, CHOI S. Multiple kernel nonnegative matrix factorization[C]//ICASSP 2011:Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, DC:IEEE Computer Society, 2011:1976-1979.
[23] ZHU F, HONEINE P, KALLAS M. Kernel nonnegative matrix factorization without the pre-image problem[C]//MLSP 2014:Proceedings of the 2014 IEEE International Workshop on Machine Learning for Signal Processing. Piscataway, NJ:IEEE, 2014:1-6.
[24] LIN C-J. Projected gradient methods for nonnegative matrix factorization [J]. Neural Computation, 2007, 19(10): 2756-2779.
[25] LLOYD S. Least squares quantization in PCM [J]. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
[26] RAO C R. The utilization of multiple measurements in problems of biological classification [J]. Journal of the Royal Statistical Society, Series B (Methodological), 1948, 10(2): 159-203.
[27] 张立文,徐家宁,李进,等.基于免疫网络和SOM的文本聚类算法研究[J].计算机应用与软件,2010,27(5):118-120. (ZHANG L W, XU J N, LI J, et al. Research on text clustering algorithm based on artificial immune network and SOM [J]. Computer Applications and Software, 2010, 27(5): 118-120.)
[28] MUNKRES J. Algorithms for the assignment and transportation problems [J]. Journal of the Society for Industrial and Applied Mathematics, 1957, 5(1): 32-38.

基于加权核非负矩阵分解的短文本聚类算法

Short text clustering algorithm based on weighted kernel nonnegative matrix factorization

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	任奇泽, 贾洪杰, 陈东宇. 融合局部结构学习的大规模子空间聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3747-3754.
[2]	包永春, 张建臣, 杜守信, 张军军. 基于非负矩阵分解与稀疏表示的多标签分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1375-1382.
[3]	刘清华, 廖士中. 基于随机素描方法的在线核回归[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 676-682.
[4]	管娇娇, 钱雪忠, 周世兵, 姜凯彬, 宋威. 基于格拉斯曼流形子空间融合的多视图聚类[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3740-3749.
[5]	王锦凯, 贾旭. 基于迁移孪生非负矩阵分解的静脉识别算法[J]. 计算机应用, 2021, 41(3): 898-903.
[6]	陈献, 胡丽莹, 林晓炜, 陈黎飞. 基于核非负矩阵分解的有向图聚类算法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3447-3454.
[7]	杜汉, 龙显忠, 李云. 基于图学习正则判别非负矩阵分解的人脸识别[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3455-3461.
[8]	李华, 卢桂馥, 余沁茹. 基于干净数据的流形正则化非负矩阵分解[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3492-3498.
[9]	王锦凯, 贾旭. 基于加权正交约束非负矩阵分解的车脸识别算法[J]. 计算机应用, 2020, 40(4): 1050-1055.
[10]	成其伟, 陈启买, 贺超波, 刘海. 基于改进对称二值非负矩阵分解的重叠社区发现方法[J]. 计算机应用, 2020, 40(11): 3203-3210.
[11]	刘颖, 梁楠楠, 李大湘, 杨凡超. 基于光谱距离聚类的高光谱图像解混算法[J]. 计算机应用, 2019, 39(9): 2541-2546.
[12]	陈善学, 储成泉. 基于稀疏和正交约束非负矩阵分解的高光谱解混[J]. 计算机应用, 2019, 39(8): 2276-2280.
[13]	杨亮东, 杨志霞. 稀疏限制的增量式鲁棒非负矩阵分解及其应用[J]. 计算机应用, 2019, 39(5): 1275-1281.
[14]	余江兰, 李向利, 赵朋飞. 基于核技巧和超图正则的稀疏非负矩阵分解[J]. 计算机应用, 2019, 39(3): 742-749.
[15]	李艳生, 刘园, 张毅. 基于感知掩蔽的重构非负矩阵分解单通道语音增强算法[J]. 计算机应用, 2019, 39(3): 894-898.