基于用户行为特征的多维度文本聚类

doi:10.11772/j.issn.1001-9081.2018041357

计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3127-3131.DOI: 10.11772/j.issn.1001-9081.2018041357

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇下一篇

基于用户行为特征的多维度文本聚类

黎万英¹, 黄瑞章^1,2,3, 丁志远¹, 陈艳平^1,2, 徐立洋¹

1. 贵州大学计算机科学与技术学院, 贵阳 550025;
2. 贵州省公共大数据重点实验室(贵州大学), 贵阳 550025;
3. 计算机软件新技术国家重点实验室(南京大学), 南京 210093

收稿日期:2018-04-30 修回日期:2018-06-21 发布日期:2018-11-10 出版日期:2018-11-10
通讯作者: 黄瑞章
作者简介:黎万英(1992-),女,贵州开阳人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;黄瑞章(1979-),女,天津人,副教授,博士,CCF成员,主要研究方向:数据挖掘、文本挖掘、机器学习、信息检索;丁志远(1993-),男,湖北孝感人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;陈艳平(1980-),男,贵州黔南人,副教授,博士,主要研究方向:人工智能、自然语言处理;徐立洋(1990-),男,贵州黔南人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习。
基金资助:
国家自然科学基金资助项目（61462011）；国家自然科学基金重大研究计划项目（91746116）；贵州省重大应用基础研究项目（黔科合JZ字[2014]2001）；贵州省自然科学基金资助项目（黔科合基础[2018]1035）；贵州省科技重大专项计划（黔科合重大专项字[2017]3002）。

Multi-dimensional text clustering with user behavior characteristics

LI Wanying¹, HUANG Ruizhang^1,2,3, DING Zhiyuan¹, CHEN Yanping^1,2, XU Liyang¹

1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China;
2. Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University), Guiyang Guizhou 550025, China;
3. State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing Jiangsu 210093, China

Received:2018-04-30 Revised:2018-06-21 Online:2018-11-10 Published:2018-11-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61462011), the Major Research Program of the National Natural Science Foundation of China (91746116), the Major Applied Basic Research Program of Guizhou Province (JZ20142001), the Major Special Science and Technology Projects of Guizhou Province ([2017]3002), the Science and Technology Project of Guizhou Province ([2018]1035).

摘要/Abstract

摘要： 传统多维度文本聚类一般是从文本内容中提取特征，而很少考虑数据中用户与文本的交互信息（如：点赞、转发、评论、关注、引用等行为信息），且传统的多维度文本聚类主要是将多个空间维度线性结合，没能深入考虑每个维度中属性间的关系。为有效利用与文本相关的用户行为信息，提出一种结合用户行为信息的多维度文本聚类模型（MTCUBC）。根据文本间的相似性在不同空间上应该保持一致的原则，该模型将用户行为信息作为文本内容聚类的约束来调节相似度，然后结合度量学习方法来改善文本间的距离，从而提高聚类效果。通过实验表明，与线性结合的多维度聚类相比，MTCUBC模型在高维稀疏数据中表现出明显的优势。

关键词: 多维度聚类, 度量学习, 约束, 用户行为特征

Abstract: Traditional multi-dimensional text clustering generally extracts features from text contents, but seldom considers the interaction information between users and text data, such as likes, forwards, reviews, concerns, references, etc. Moreover, the traditional multi-dimension text clustering mainly integrates linearly multiple spatial dimensions and fails to consider the relationship between attributes in each dimension. In order to effectively use text-related user behavior information, a Multi-dimensional Text Clustering with User Behavior Characteristics (MTCUBC) was proposed. According to the principle that the similarity between texts should be consistent in different spaces, the similarity was adjusted by using the user behavior information as the constraints of the text content clustering, and then the distance between the texts was improved by the metric learning method, so that the clustering effect was improved. Extensive experiments conduct and verify that the proposed MTCUBC model is effective, and the results present obvious advantages in high-dimensional sparse data compared to linearly combined multi-dimensional clustering.

Key words: multi-dimensional clustering, metric learning, constraint, user behavior characteristics

中图分类号:

TP311.1

黎万英, 黄瑞章, 丁志远, 陈艳平, 徐立洋. 基于用户行为特征的多维度文本聚类[J]. 计算机应用, 2018, 38(11): 3127-3131.

LI Wanying, HUANG Ruizhang, DING Zhiyuan, CHEN Yanping, XU Liyang. Multi-dimensional text clustering with user behavior characteristics[J]. Journal of Computer Applications, 2018, 38(11): 3127-3131.

参考文献

[1] WAGSTAFF K, CARDIE C, ROGERS S. Constrained K-means clustering with background knowledge[C]//Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers, 2001:577-584.
[2] BASU S, BANERJEE A, MOONEY R J. Semi-supervised clustering by seeding[C]//Proceedings of the 19th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers, 2002:27-34.
[3] BLUM A, MITCHELL T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the 11th Annual Conference on Computational Learning Theory. New York:ACM, 1998:92-100.
[4] JOACHIMS T. Transductive inference for text classification using support vector machines[C]//Proceedings of the 16th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers, 1999:200-209.
[5] DEMIRIZ A, BENNETT K P, EMBRECHTS M J. Semi-supervised clustering using genetic algorithms[EB/OL].[2018-03-20].https://www.researchgate.net/profile/M_Embrechts/publication/2395752_Semi-Supervised_Clustering_Using_Genetic_Algorithms/links/0c9605203c771a5687000000/Semi-Supervised-Clustering-Using-Genetic-Algorithms.pdf.
[6] BANSAL N, BLUM A, CHAWLA S. Correlation clustering[C]//Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science. Piscataway, NJ:IEEE, 2002:238-247.
[7] SCHULTZ M, JOACHIMS T. Learning a distance metric from relative comparisons[EB/OL].[2018-03-20].http://papers.nips.cc/paper/2366-learning-a-distance-metric-from-relative-comparisons.pdf.
[8] BASU S, BANERJEE A, MOONEY R J. Active semi-supervision for pairwise constrained clustering[EB/OL].[2018-03-20].http://www.cs.utexas.edu/users/ai-lab/pubs/semi-sdm-04.pdf.
[9] LIU S, CUI P, ZHU W, et al. Social embedding image distance learning[C]//Proceedings of the 22nd ACM International Conference on Multimedia. New York:ACM, 2014:617-626.
[10] XING E P, NG A Y, JORDAN M I, et al. Distance metric learning, with application to clustering with side-information[C]//Proceedings of the 15th International Conference on Neural Information Processing Systems. Cambridge, MA:MIT Press, 2002:521-528.
[11] BILENKO M, BASU S, MOONEY R J. Integrating constraints and metric learning in semi-supervised clustering[C]//Proceedings of the 21st International Conference on Machine Learning. New York:ACM, 2004:11.
[12] XU Y M, WANG C D, LAI J H. Weighted multi-view clustering with feature selection[J]. Pattern Recognition, 2016, 53:25-35.
[13] BASU S, BILENKO M, MOONEY R J. A probabilistic framework for semi-supervised clustering[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2004:59-68.
[14] TZORTZIS G, LIKAS A. Kernel-based weighted multi-view clustering[C]//Proceedings of the 12th International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2012:675-684.
[15] HUANG S, XUE G R, ZHANG B Y, et al. Multi-type features based Web document clustering[C]//Proceedings of the 5th International Conference on Web Information Systems Engineering, LNCS 3306. Berlin:Springer, 2004:253-265.
[16] BAR-HILLEL A, HERTZ T, SHENTAL N, et al. Learning distance functions using equivalence relations[C]//Proceedings of the 20th International Conference on Machine Learning. Washington, DC:IEEE Computer Society, 2003:11-18.
[17] TANG J, ZHANG J, YAO L, et al. Arnetminer:Extraction and mining of academic social networks[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2008:990-998.
[18] DOMENICONI C. Locally adaptive techniques for pattern classification[D]. Riverside:University of California, 2002.
[19] DHILLON I S, GUAN Y. Information theoretic clustering of sparse co-occurrence data[C]// Proceedings of the 3rd IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2003:517.
[20] NIPS[EB/OL].[2018-01-04]. https://www.kaggle.com/benhamner/nips-papers/data.
[21] BICKEL S, SCHEFFER T. Multi-view clustering[C]// Proceedings of the 4th IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2004:19-26.

基于用户行为特征的多维度文本聚类

Multi-dimensional text clustering with user behavior characteristics

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[2]	王清, 赵杰煜, 叶绪伦, 王弄潇. 统一框架的增强深度子空间聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1995-2003.
[3]	赵楷文, 王鹏, 童向荣. 基于双阶段搜索的约束进化多任务优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1415-1422.
[4]	魏凤凤, 陈伟能. 分布式数据驱动的多约束进化优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1393-1400.
[5]	田野, 陈津津, 张兴义. 面向约束多目标优化的进化计算与梯度下降联合优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1386-1392.
[6]	李炫锋, 刘晟材, 唐珂. 机会约束的多选择背包问题的遗传算法求解[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1378-1385.
[7]	蔡美玉, 朱润哲, 吴飞, 张开昱, 李家乐. 基于注意力机制和多粒度特征融合的跨视角匹配模型[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 901-908.
[8]	柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010.
[9]	马勇健, 史旭华, 王佩瑶. 基于两阶段搜索与动态资源分配的约束多目标进化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 269-277.
[10]	甘舰文, 陈艳, 周芃, 杜亮. 基于高阶一致性学习的聚类集成算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2665-2672.
[11]	徐赛娟, 裴镇宇, 林佳炜, 刘耿耿. 基于多阶段搜索的约束多目标进化算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2345-2351.
[12]	林剑, 叶璟轩, 刘雯雯, 邵晓雯. 求解带容量约束车辆路径问题的多模态差分进化算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2248-2254.
[13]	袁泉, 唐成亮, 徐雲鹏. 基于长度约束的蝙蝠高效用项集挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1473-1480.
[14]	张晓燕, 王佳一. 属性聚类下三支概念的对比[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1336-1341.
[15]	姜春茂, 吴鹏, 李志聪. 基于Seeds集和成对约束的半监督三支聚类集成[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1481-1488.