计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3127-3131.DOI: 10.11772/j.issn.1001-9081.2018041357

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇    下一篇

基于用户行为特征的多维度文本聚类

黎万英1, 黄瑞章1,2,3, 丁志远1, 陈艳平1,2, 徐立洋1   

  1. 1. 贵州大学 计算机科学与技术学院, 贵阳 550025;
    2. 贵州省公共大数据重点实验室(贵州大学), 贵阳 550025;
    3. 计算机软件新技术国家重点实验室(南京大学), 南京 210093
  • 收稿日期:2018-04-30 修回日期:2018-06-21 出版日期:2018-11-10 发布日期:2018-11-10
  • 通讯作者: 黄瑞章
  • 作者简介:黎万英(1992-),女,贵州开阳人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;黄瑞章(1979-),女,天津人,副教授,博士,CCF成员,主要研究方向:数据挖掘、文本挖掘、机器学习、信息检索;丁志远(1993-),男,湖北孝感人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;陈艳平(1980-),男,贵州黔南人,副教授,博士,主要研究方向:人工智能、自然语言处理;徐立洋(1990-),男,贵州黔南人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省重大应用基础研究项目(黔科合JZ字[2014]2001);贵州省自然科学基金资助项目(黔科合基础[2018]1035);贵州省科技重大专项计划(黔科合重大专项字[2017]3002)。

Multi-dimensional text clustering with user behavior characteristics

LI Wanying1, HUANG Ruizhang1,2,3, DING Zhiyuan1, CHEN Yanping1,2, XU Liyang1   

  1. 1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China;
    2. Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University), Guiyang Guizhou 550025, China;
    3. State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing Jiangsu 210093, China
  • Received:2018-04-30 Revised:2018-06-21 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61462011), the Major Research Program of the National Natural Science Foundation of China (91746116), the Major Applied Basic Research Program of Guizhou Province (JZ20142001), the Major Special Science and Technology Projects of Guizhou Province ([2017]3002), the Science and Technology Project of Guizhou Province ([2018]1035).

摘要: 传统多维度文本聚类一般是从文本内容中提取特征,而很少考虑数据中用户与文本的交互信息(如:点赞、转发、评论、关注、引用等行为信息),且传统的多维度文本聚类主要是将多个空间维度线性结合,没能深入考虑每个维度中属性间的关系。为有效利用与文本相关的用户行为信息,提出一种结合用户行为信息的多维度文本聚类模型(MTCUBC)。根据文本间的相似性在不同空间上应该保持一致的原则,该模型将用户行为信息作为文本内容聚类的约束来调节相似度,然后结合度量学习方法来改善文本间的距离,从而提高聚类效果。通过实验表明,与线性结合的多维度聚类相比,MTCUBC模型在高维稀疏数据中表现出明显的优势。

关键词: 多维度聚类, 度量学习, 约束, 用户行为特征

Abstract: Traditional multi-dimensional text clustering generally extracts features from text contents, but seldom considers the interaction information between users and text data, such as likes, forwards, reviews, concerns, references, etc. Moreover, the traditional multi-dimension text clustering mainly integrates linearly multiple spatial dimensions and fails to consider the relationship between attributes in each dimension. In order to effectively use text-related user behavior information, a Multi-dimensional Text Clustering with User Behavior Characteristics (MTCUBC) was proposed. According to the principle that the similarity between texts should be consistent in different spaces, the similarity was adjusted by using the user behavior information as the constraints of the text content clustering, and then the distance between the texts was improved by the metric learning method, so that the clustering effect was improved. Extensive experiments conduct and verify that the proposed MTCUBC model is effective, and the results present obvious advantages in high-dimensional sparse data compared to linearly combined multi-dimensional clustering.

Key words: multi-dimensional clustering, metric learning, constraint, user behavior characteristics

中图分类号: