Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (8): 2180-2184.DOI: 10.11772/j.issn.1001-9081.2018020356

Previous Articles     Next Articles

Short text clustering algorithm based on weighted kernel nonnegative matrix factorization

CAO Dawei1, HE Chaobo2, CHEN Qimai1, LIU Hai1   

  1. 1. School of Computer, South China Normal University, Guangzhou Guangdong 510631, China;
    2. School of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou Guangdong 510225, China
  • Received:2018-02-08 Revised:2018-03-18 Online:2018-08-10 Published:2018-08-11
  • Supported by:
    This work is partially supported by the Science and Technology Planning Project of Guangdong Province (2017A040405057, 2017A030303074, 2016A030303058, 2015A020209178), the Science and Technology Planning Project of Guangzhou (201604016035,201807010043).


曹大为1, 贺超波2, 陈启买1, 刘海1   

  1. 1. 华南师范大学 计算机学院, 广州 510631;
    2. 仲恺农业工程学院 信息科学技术学院, 广州 510225
  • 通讯作者: 陈启买
  • 作者简介:曹大为(1993-),男,浙江天台人,硕士研究生,CCF会员,主要研究方向:数据挖掘、机器学习;贺超波(1981-),男,广东河源人,副教授,博士,CCF高级会员,主要研究方向:数据挖掘、社会计算;陈启买(1965-),男,湖南衡阳人,教授,硕士,主要研究方向:数据挖掘、机器学习;刘海(1974-),男,湖南张家界人,副教授,博士,CCF会员,主要研究方向:文本挖掘、深度学习。
  • 基金资助:

Abstract: Clustering analysis of a large number of short texts generated by the Internet is of great application value. Because the characteristics of short texts such as sparse features and difficulty of extracting features, the traditional text clustering algorithm faces many challenges in short text clustering. To solve the problem, a short text clustering algorithm based on Weighted Kernel Nonnegative Matrix Factorization (WKNMF) was proposed by using Nonnegative Matrix Factorization (NMF) model. To make full use of hidden semantic features in short texts for clustering, sparse feature space was mapped to high-dimensional implicit vectors by using kernel method. In addition, kernel trick was used to simplify the complex operation of high-dimensional data, and the weight vectors of short texts were dynamically adjusted through iterative optimization updating rules, so that the importance of different short texts to clustering can be distinguished. Experiments were conducted on real Micro-blog data sets and WKNMF algorithm was compared with K-means, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF) and Self-Organization Map (SOM). The experimental results show that the proposed WKNMF algorithm has a better clustering performance than the contrast algorithms, its accuracy and Normalized Mutual Information (NMI) reach 66.38% and 66.91% respectively.

Key words: kernel method, short text clustering, Nonnegative Matrix Factorization (NMF), kernel trick, iterative optimization

摘要: 对互联网产生的大量短文本进行聚类分析具有重要的应用价值,但由于短文本存在特征稀疏和特征难以提取的问题,导致传统的文本聚类算法难以有效处理该问题。为了解决该问题,利用非负矩阵分解(NMF)模型提出基于加权核非负矩阵分解(WKNMF)的短文本聚类算法。该算法通过核方法的映射关系将稀疏特征空间映射到高维隐性空间,从而可以充分利用短文本中的隐性语义特征进行聚类;另外,利用核技巧简化高维数据的复杂运算,并通过迭代更新规则不断地动态调整短文本的权重向量,从而可以区分不同短文本对聚类的重要性。在真实的微博数据集上进行了相关实验,结果表明WKNMF算法比K均值、隐含狄利克雷分布(LDA)、NMF和自组织神经网络(SOM)具有更好的聚类质量,准确度和归一化互信息分别达到了66.38%和66.91%。

关键词: 核方法, 短文本聚类, 非负矩阵分解, 核技巧, 迭代优化求解

CLC Number: