基于意图正则化的深度半监督文本聚类

doi:10.11772/j.issn.1001-9081.2024070931

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2145-2152.DOI: 10.11772/j.issn.1001-9081.2024070931

• CCF第39届中国计算机应用大会 (CCF NCCA 2024) • 上一篇下一篇

基于意图正则化的深度半监督文本聚类

徐乐¹^,²^,³, 黄瑞章¹^,²^,³(), 白瑞娜¹^,²^,³, 秦永彬¹^,²^,³

^1.文本计算与认知智能教育部工程研究中心（贵州大学），贵阳 550025
^2.公共大数据国家重点实验室（贵州大学），贵阳 550025
^3.贵州大学计算机科学与技术学院，贵阳 550025

收稿日期:2024-07-01 修回日期:2024-09-25 接受日期:2024-10-09 发布日期:2025-07-10 出版日期:2025-07-10
通讯作者: 黄瑞章
作者简介:徐乐（1999—），女，四川泸州人，硕士研究生，CCF会员，主要研究方向：自然语言处理、文本挖掘、机器学习
白瑞娜（1995—），女，内蒙古包头人，博士研究生，主要研究方向：自然语言处理、多视图学习
秦永彬（1980—），男，山东烟台人，教授，博士，CCF高级会员，主要研究方向：大数据治理与应用、多源数据融合、智能计算、机器学习、算法设计。
基金资助:
国家自然科学基金资助项目(62066007)

Deep semi-supervised text clustering with intentional regularization

Le XU¹^,²^,³, Ruizhang HUANG¹^,²^,³(), Ruina BAI¹^,²^,³, Yongbin QIN¹^,²^,³

^1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence （Guizhou University），Guiyang Guizhou 550025，China
^2.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^3.College of Computer Science and Technology，Guizhou University，Guiyang Guizhou 550025，China

Received:2024-07-01 Revised:2024-09-25 Accepted:2024-10-09 Online:2025-07-10 Published:2025-07-10
Contact: Ruizhang HUANG
About author:XU Le， born in 1999， M. S. candidate. Her research interests include natural language processing， text mining， machine learning.
BAI Ruina， born in 1995， Ph. D. candidate. Her research interests include natural language processing， multi-view learning.
QIN Yongbin， born in 1980， Ph. D.， professor. His research interests include big data governance and application， multi-source data fusion， intelligent computing， machine learning， algorithm design.
Supported by:
National Natural Science Foundation of China(62066007)

摘要/Abstract

摘要：

针对现有半监督文本聚类方法无法同时在表示学习和聚类过程中考虑用户意图的问题，提出基于意图正则化的深度半监督文本聚类（IRDSTC）模型。通过引入意图正则化策略，设计意图正则化的表示学习（IRRL）模块和意图正则化的聚类（IRC）模块。首先，根据用户提供的意图约束信息构建意图矩阵，以捕获用户对文本之间关系的期望。其次，将该矩阵应用到表示学习阶段和聚类阶段：在表示学习阶段，将深度模型提取的中间层表示转换为表示关联性矩阵，并结合意图矩阵构造正则项，以利用用户意图驱动表示学习；在聚类阶段，根据聚类迭代得到的类簇分配概率构造分配一致性矩阵，并结合意图矩阵构造正则项，以实现用户意图对聚类过程的指导。实验结果表明，IRDSTC模型在Reu-10k、BBC、ACM和Abstract数据集上相较于其他聚类方法在聚类准确率（ACC）、标准化互信息（NMI）和调整兰德指数（ARI）上均具有更好的表现。具体而言，相较于次优模型改进的深度嵌入聚类（IDEC），IRDSTC模型的NMI分别提升了28.26%、32.58%、27.13%和34.94%，表明IRDSTC模型具有更好的聚类效果。

关键词: 意图, 正则化, 半监督, 文本聚类

Abstract:

Aiming at the problem that the existing semi-supervised text clustering methods fail to consider user intent in processes of representation learning and clustering simultaneously， a Deep Semi-supervised Text Clustering with Intentional Regularization （IRDSTC） model was proposed. With the introduction of intention regularization strategy， the Intention Regularized Representation Learning （IRRL） module and Intention Regularized Clustering （IRC） module were designed. Firstly， an intent matrix was constructed on the basis of the intent constraint information provided by the user to capture the user’s expectations for the relationship between texts. Secondly， the matrix was applied to the representation learning stage and the clustering stage. In the representation learning stage， the intermediate layer representation extracted by the deep model was converted into a representation correlation matrix， and the intent matrix was combined to construct a regular term， so as to use user intent to drive the representation learning. In the clustering stage， an allocation consistency matrix was constructed according to the class cluster allocation probabilities obtained from clustering iterations， and the intent matrix was combined to construct regular terms， so as to realize the guidance of user intent to the clustering process. Experimental results show that IRDSTC model has better performance in clustering ACCuracy （ACC）， Normalized Mutual Information （NMI） and Adjusted Rand Index （ARI） compared to other clustering methods on Reu-10k， BBC， ACM， and Abstract datasets. In specific， compared with Improved Deep Embedding Clustering（IDEC）， IRDSTC model has the NMI increased by 28.26%， 32.58%， 27.13%， and 34.94%， respectively， indicating that IRDSTC model has better clustering effect.

Key words: intent, regularization, semi-supervision, text clustering

中图分类号:

TP391.1

徐乐, 黄瑞章, 白瑞娜, 秦永彬. 基于意图正则化的深度半监督文本聚类[J]. 计算机应用, 2025, 45(7): 2145-2152.

Le XU, Ruizhang HUANG, Ruina BAI, Yongbin QIN. Deep semi-supervised text clustering with intentional regularization[J]. Journal of Computer Applications, 2025, 45(7): 2145-2152.

图/表 6

图1 IRDSTC模型框架

Fig. 1 Framework of IRDSTC model

表1 各数据集信息

Tab. 1 Information of individual datasets

名称	类数	文本数	维数
Reu-10k	4	10 000	2 000
BBC	5	2 225	9 635
ACM	3	3 025	1 870
Abstract	3	4 306	10 000

表2 对比实验结果 ( %)

Tab. 2 Results of comparison experiments

数据集	评价	K-means	AE	DEC	IDEC	Cop-Kmeans	SDEC	CPAC	S⁴NMF	IRDSTC
Reu-10k	ACC	54.04	71.16	72.75	75.03	65.84	72.80	72.79	61.99	81.27
	NMI	41.28	48.41	51.62	51.98	42.53	48.88	48.32	41.26	66.67
	ARI	27.15	54.82	57.97	58.26	50.49	53.91	54.61	28.92	62.87
BBC	ACC	51.58	57.03	60.45	76.40	69.16	64.80	67.01	78.83	94.79
	NMI	30.88	53.88	57.19	64.48	53.65	51.02	51.01	56.41	85.49
	ARI	20.50	41.13	47.15	69.10	49.78	44.60	41.08	53.29	87.93
ACM	ACC	67.37	83.83	84.74	85.13	75.02	85.53	82.09	66.94	92.30
	NMI	33.54	51.83	54.85	56.16	48.71	55.37	56.42	27.83	71.40
	ARI	33.85	57.74	59.92	62.16	51.39	61.43	59.39	28.54	78.34
Abstract	ACC	69.18	80.14	86.32	83.83	77.63	90.78	89.62	82.28	95.91
	NMI	38.30	54.36	59.08	60.98	65.14	68.14	66.65	52.65	82.29
	ARI	27.60	51.52	62.69	62.03	70.03	74.05	66.97	53.87	87.97

表3 消融实验结果 (%)

Tab. 3 Results of ablation experiments

优化目标	Reu-10k			BBC			ACM			Abstract
优化目标	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
$L R E + L K L$	72.75	51.62	57.97	60.45	57.19	47.51	84.74	54.85	59.92	86.32	59.08	62.69
$L R E + L K L + L I R$	80.43	66.47	61.59	93.44	84.13	83.85	91.64	69.44	76.59	95.84	82.03	87.77
$L R E + L K L + L I C$	73.46	53.12	59.61	63.55	60.25	50.50	88.40	62.72	68.79	88.32	63.50	68.41
$L R E + L K L + L I R + L I C$	81.27	66.67	62.87	94.79	85.49	87.93	92.30	71.40	78.34	95.91	82.29	87.97

表3 消融实验结果 (%)

Tab. 3 Results of ablation experiments

优化目标	Reu-10k			BBC			ACM			Abstract
优化目标	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
$L R E + L K L$	72.75	51.62	57.97	60.45	57.19	47.51	84.74	54.85	59.92	86.32	59.08	62.69
$L R E + L K L + L I R$	80.43	66.47	61.59	93.44	84.13	83.85	91.64	69.44	76.59	95.84	82.03	87.77
$L R E + L K L + L I C$	73.46	53.12	59.61	63.55	60.25	50.50	88.40	62.72	68.79	88.32	63.50	68.41
$L R E + L K L + L I R + L I C$	81.27	66.67	62.87	94.79	85.49	87.93	92.30	71.40	78.34	95.91	82.29	87.97

图2 约束数量对模型性能的影响

Fig. 2 Effect of number of constraints on model performance

图3 ACM数据集的可视化结果

Fig. 3 Visualization results of ACM dataset

参考文献 42

[1]	BAI L， LIANG J， CAO F. Semi-supervised clustering with constraints of different types from multiple information sources ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（9）： 3247-3258.
[2]	LI J， LIN C， HUANG R， et al. Intention-guided deep semi-supervised document clustering via metric learning ［J］. Journal of King Saud University — Computer and Information Sciences， 2023， 35（1）： 416-425.
[3]	XIAO X， HOU H， DING S. Semi-supervised deep density clustering ［J］. Applied Soft Computing， 2023， 148： No.110903.
[4]	QIN X， YUAN C， JIANG J， et al. Deep semi-supervised clustering based on pairwise constraints and sample similarity ［J］. Pattern Recognition Letters， 2024， 178： 1-6.
[5]	LeCUN Y， BENGIO Y， HINTON G. Deep learning ［J］. Nature， 2015， 521（7553）： 436-444.
[6]	RUSK N. Deep learning ［J］. Nature Methods， 2016， 13（1）： No.3707.
[7]	DONG S， WANG P， ABBAS K. A survey on deep learning and its applications ［J］. Computer Science Review， 2021， 40： No.100379.
[8]	LI P， PEI Y， LI J. A comprehensive survey on design and application of autoencoder in deep learning ［J］. Applied Soft Computing， 2023， 138： No.110176.
[9]	WANG Y， YAO H， ZHAO S. Auto-encoder based dimensionality reduction ［J］. Neurocomputing， 2016， 184： 232-242.
[10]	XIE J， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
[11]	GUO X， GAO L， LIU X， et al. Improved deep embedded clustering with local structure preservation ［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York： ACM， 2017： 1753-1759.
[12]	WANG Y， CHANG D， FU Z， et al. Learning a bi-directional discriminative representation for deep clustering ［J］. Pattern Recognition， 2023， 137： No.109237.
[13]	LIN M， WEN K， ZHU X， et al. Graph autoencoder with preserving node attribute similarity ［J］. Entropy， 2023， 25（4）： No.567.
[14]	CHEN B， XU S， XU H， et al. Structure-aware deep clustering network based on contrastive learning ［J］. Neural Networks， 2023， 167： 118-128.
[15]	CAI J， HAO J， YANG H， et al. A review on semi-supervised clustering ［J］. Information Sciences， 2023， 632： 164-200.
[16]	XU X， HOU H， DING S. Semi-supervised deep density clustering ［J］. Applied Soft Computing， 2023， 148： No.110903.
[17]	ZHANG D， YANG Y， QIU H. Two-stage semi-supervised clustering ensemble framework based on constraint weight ［J］. International Journal of Machine Learning and Cybernetics， 2023， 14（2）： 567-586.
[18]	TAGHIZABET A， TANHA J， AMINI A， et al. A semi-supervised clustering approach using labeled data ［J］. Scientia Iranica， 2023， 30（1）： 104-115.
[19]	姜春茂，吴鹏，李志聪.基于Seeds集和成对约束的半监督三支聚类集成［J］.计算机应用，2023， 43（5）： 1481-1488.
	JIANG C M， WU P， LI Z C. Semi-supervised three-way clustering ensemble based on Seeds set and pairwise constraints ［J］. Journal of Computer Applications， 2023， 43（5）： 1481-1488.
[20]	MASUD M A， HUANG J Z， ZHONG M， et al. Generate pairwise constraints from unlabeled data for semi-supervised clustering ［J］. Data and Knowledge Engineering， 2019， 123： No.101715.
[21]	MEI J P， LV H， CAO J， et al. Pairwise constrained fuzzy clustering： relation， comparison and parallelization ［J］. International Journal of Fuzzy Systems， 2019， 21（6）： 1938-1949.
[22]	FORESTIER G， WEMMERT C. Semi-supervised learning using multiple clusterings with limited labeled data ［J］. Information Sciences， 2016， 361/362： 48-65.
[23]	VOUROS A， VASILAKI E. A semi-supervised sparse k-means algorithm ［J］. Pattern Recognition Letters， 2021， 142： 65-71.
[24]	WAGSTAFF K， CARDIE C， ROGERS S， et al. Constrained k-means clustering with background knowledge ［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 577-584.
[25]	YANG Y， TAN W， LI T， et al. Consensus clustering based on constrained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems ［J］. Knowledge-Based Systems， 2012， 32： 101-115.
[26]	WANG Y， ZOU J， WANG K， et al. Semi-supervised deep embedded clustering with pairwise constraints and subset allocation ［J］. Neural Networks， 2023， 164： 310-322.
[27]	CHEN Z， LI C， GAO J， et al. Semisupervised deep embedded clustering with adaptive labels ［J］. Scientific Programming， 2021， 2021： No.6613452.
[28]	SALADI P， GUNTUPALLI R M， PUPPALA S K， et al. Prioritized semi-supervised deep embedded clustering ［C］// Proceedings of the 2022 International Conference on Innovative Trends in Information Technology. Piscataway： IEEE， 2022： 1-6.
[29]	REN Y， HU K， DAI X， et al. Semi-supervised deep embedded clustering ［J］. Neurocomputing， 2019， 325： 121-130.
[30]	YANG X， DENG C， ZHENG F， et al. Deep spectral clustering using dual autoencoder network ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4061-4070.
[31]	CAI J， WANG S， GUO W. Unsupervised embedded feature learning for deep clustering with stacked sparse auto-encoder ［J］. Expert Systems with Applications， 2021， 186： No.115729.
[32]	BO D， WANG X， SHI C， et al. Structural deep clustering network ［C］// Proceedings of the Web Conference 2020. New York： ACM， 2020： 1400-1410.
[33]	KADHIM A I， CHEAH Y N， AHAMED N H. Text document preprocessing and dimension reduction techniques for text document clustering ［C］// Proceedings of the 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology. Piscataway： IEEE， 2014： 69-73.
[34]	WANG X， JI H， SHI C， et al. Heterogeneous graph attention network ［C］// Proceedings of the 2019 World Wide Web Conference. New York： ACM， 2019： 2022-2032.
[35]	BAI R， HUANG R， CHEN Y， et al. Deep multi-view document clustering with enhanced semantic embedding ［J］. Information Sciences， 2021， 564： 273-287.
[36]	XU W， LIU X， GONG Y. Document clustering based on non-negative matrix factorization ［C］// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2003： 267-273.
[37]	ESTÉVEZ P A， TESMER M， PEREZ C A， et al. Normalized mutual information feature selection ［J］. IEEE Transactions on Neural Networks， 2009， 20（2）： 189-201.
[38]	XIA R， PAN Y， DU L， et al. Robust multi-view spectral clustering via low-rank and sparse decomposition ［C］// Proceedings of the 28th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2014： 2149-2155.
[39]	HARTIGAN J A， WONG M A. A k-means clustering algorithm ［J］. Journal of the Royal Statistical Society Series C： Applied Statistics， 1979， 28（1）： 100-108.
[40]	FOGEL S， AVERBUCH-ELOR H， COHEN-OR D， et al. Clustering-driven deep embedding with pairwise constraints ［J］. IEEE Computer Graphics and Applications， 2019， 39（4）： 16-27.
[41]	CHAVOSHINEJAD J， SEYEDI S A， TAB F A， et al. Self-supervised semi-supervised nonnegative matrix factorization for data clustering ［J］. Pattern Recognition， 2023， 137： No.109282.
[42]	VAN DER MAATEN L， HINTON G. Visualizing data using t-SNE ［J］. Journal of Machine Learning Research， 2008， 9（11）： 2579-2605.

基于意图正则化的深度半监督文本聚类

Deep semi-supervised text clustering with intentional regularization

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 42

相关文章 15

编辑推荐

Metrics

[1]	申奥, 黄瑞章, 薛菁菁, 陈艳平, 秦永彬. 基于分布增强的深度变分文本聚类模型[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2457-2463.
[2]	谢斌红, 剌颖坤, 张英俊, 张睿. 自步学习指导下的半监督目标检测框架[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2546-2554.
[3]	陈丹阳, 张长伦. 多尺度去相关的图卷积网络模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2180-2187.
[4]	陈鹏宇, 聂秀山, 李南君, 李拓. 基于时空解耦和区域鲁棒性增强的半监督视频目标分割方法[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1379-1386.
[5]	谭瑛, 任新宇, 孙超利, 王思思. 两阶段填充采样的半监督昂贵多目标优化算法[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1605-1612.
[6]	龙雨菲, 牟宇辰, 刘晔. 基于张量化图卷积网络和对比学习的多源数据表示学习模型[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1372-1378.
[7]	郭书剑, 余节约, 尹学松. 图正则化弹性网子空间聚类[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1464-1471.
[8]	蒋铭, 王琳钦, 赖华, 高盛祥. 基于编辑约束的端到端越南语文本正则化方法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 362-370.
[9]	杨顺, 边小勇, 陈希. 无迭代图胶囊网络的遥感场景分类[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 247-252.
[10]	张英俊, 李牛牛, 谢斌红, 张睿, 陆望东. 课程学习指导下的半监督目标检测框架[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2326-2333.
[11]	李晨倩, 刘俊. 基于半监督和多尺度级联注意力的超声颈动脉斑块分割方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2604-2610.
[12]	蒋小霞, 黄瑞章, 白瑞娜, 任丽娜, 陈艳平. 基于事件表示和对比学习的深度事件聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1734-1742.
[13]	周妍, 李阳. 用于脑卒中病灶分割的具有注意力机制的校正交叉伪监督方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1942-1948.
[14]	尚爱国, 朱欣娟. 基于多任务学习的意图检测和槽位填充联合方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 690-695.
[15]	巫婕, 钱雪忠, 宋威. 基于相似度聚类和正则化的个性化联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3345-3353.