Deep semi-supervised text clustering with intentional regularization

doi:10.11772/j.issn.1001-9081.2024070931

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (7): 2145-2152.DOI: 10.11772/j.issn.1001-9081.2024070931

• The 39th CCF National Conference of Computer Applications (CCF NCCA 2024) • Previous Articles Next Articles

Deep semi-supervised text clustering with intentional regularization

Le XU¹^,²^,³, Ruizhang HUANG¹^,²^,³(), Ruina BAI¹^,²^,³, Yongbin QIN¹^,²^,³

^1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence （Guizhou University），Guiyang Guizhou 550025，China
^2.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^3.College of Computer Science and Technology，Guizhou University，Guiyang Guizhou 550025，China

Received:2024-07-01 Revised:2024-09-25 Accepted:2024-10-09 Online:2025-07-10 Published:2025-07-10
Contact: Ruizhang HUANG
About author:XU Le， born in 1999， M. S. candidate. Her research interests include natural language processing， text mining， machine learning.
BAI Ruina， born in 1995， Ph. D. candidate. Her research interests include natural language processing， multi-view learning.
QIN Yongbin， born in 1980， Ph. D.， professor. His research interests include big data governance and application， multi-source data fusion， intelligent computing， machine learning， algorithm design.
Supported by:
National Natural Science Foundation of China(62066007)

基于意图正则化的深度半监督文本聚类

徐乐¹^,²^,³, 黄瑞章¹^,²^,³(), 白瑞娜¹^,²^,³, 秦永彬¹^,²^,³

^1.文本计算与认知智能教育部工程研究中心（贵州大学），贵阳 550025
^2.公共大数据国家重点实验室（贵州大学），贵阳 550025
^3.贵州大学计算机科学与技术学院，贵阳 550025

通讯作者: 黄瑞章
作者简介:徐乐（1999—），女，四川泸州人，硕士研究生，CCF会员，主要研究方向：自然语言处理、文本挖掘、机器学习
白瑞娜（1995—），女，内蒙古包头人，博士研究生，主要研究方向：自然语言处理、多视图学习
秦永彬（1980—），男，山东烟台人，教授，博士，CCF高级会员，主要研究方向：大数据治理与应用、多源数据融合、智能计算、机器学习、算法设计。
基金资助:
国家自然科学基金资助项目(62066007)

Abstract

Abstract:

Aiming at the problem that the existing semi-supervised text clustering methods fail to consider user intent in processes of representation learning and clustering simultaneously， a Deep Semi-supervised Text Clustering with Intentional Regularization （IRDSTC） model was proposed. With the introduction of intention regularization strategy， the Intention Regularized Representation Learning （IRRL） module and Intention Regularized Clustering （IRC） module were designed. Firstly， an intent matrix was constructed on the basis of the intent constraint information provided by the user to capture the user’s expectations for the relationship between texts. Secondly， the matrix was applied to the representation learning stage and the clustering stage. In the representation learning stage， the intermediate layer representation extracted by the deep model was converted into a representation correlation matrix， and the intent matrix was combined to construct a regular term， so as to use user intent to drive the representation learning. In the clustering stage， an allocation consistency matrix was constructed according to the class cluster allocation probabilities obtained from clustering iterations， and the intent matrix was combined to construct regular terms， so as to realize the guidance of user intent to the clustering process. Experimental results show that IRDSTC model has better performance in clustering ACCuracy （ACC）， Normalized Mutual Information （NMI） and Adjusted Rand Index （ARI） compared to other clustering methods on Reu-10k， BBC， ACM， and Abstract datasets. In specific， compared with Improved Deep Embedding Clustering（IDEC）， IRDSTC model has the NMI increased by 28.26%， 32.58%， 27.13%， and 34.94%， respectively， indicating that IRDSTC model has better clustering effect.

Key words: intent, regularization, semi-supervision, text clustering

摘要：

针对现有半监督文本聚类方法无法同时在表示学习和聚类过程中考虑用户意图的问题，提出基于意图正则化的深度半监督文本聚类（IRDSTC）模型。通过引入意图正则化策略，设计意图正则化的表示学习（IRRL）模块和意图正则化的聚类（IRC）模块。首先，根据用户提供的意图约束信息构建意图矩阵，以捕获用户对文本之间关系的期望。其次，将该矩阵应用到表示学习阶段和聚类阶段：在表示学习阶段，将深度模型提取的中间层表示转换为表示关联性矩阵，并结合意图矩阵构造正则项，以利用用户意图驱动表示学习；在聚类阶段，根据聚类迭代得到的类簇分配概率构造分配一致性矩阵，并结合意图矩阵构造正则项，以实现用户意图对聚类过程的指导。实验结果表明，IRDSTC模型在Reu-10k、BBC、ACM和Abstract数据集上相较于其他聚类方法在聚类准确率（ACC）、标准化互信息（NMI）和调整兰德指数（ARI）上均具有更好的表现。具体而言，相较于次优模型改进的深度嵌入聚类（IDEC），IRDSTC模型的NMI分别提升了28.26%、32.58%、27.13%和34.94%，表明IRDSTC模型具有更好的聚类效果。

关键词: 意图, 正则化, 半监督, 文本聚类

CLC Number:

TP391.1

Le XU, Ruizhang HUANG, Ruina BAI, Yongbin QIN. Deep semi-supervised text clustering with intentional regularization[J]. Journal of Computer Applications, 2025, 45(7): 2145-2152.

徐乐, 黄瑞章, 白瑞娜, 秦永彬. 基于意图正则化的深度半监督文本聚类[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2145-2152.

Figures/Tables 6

Fig. 1 Framework of IRDSTC model

Tab. 1 Information of individual datasets

名称	类数	文本数	维数
Reu-10k	4	10 000	2 000
BBC	5	2 225	9 635
ACM	3	3 025	1 870
Abstract	3	4 306	10 000

Tab. 2 Results of comparison experiments

数据集	评价	K-means	AE	DEC	IDEC	Cop-Kmeans	SDEC	CPAC	S⁴NMF	IRDSTC
Reu-10k	ACC	54.04	71.16	72.75	75.03	65.84	72.80	72.79	61.99	81.27
	NMI	41.28	48.41	51.62	51.98	42.53	48.88	48.32	41.26	66.67
	ARI	27.15	54.82	57.97	58.26	50.49	53.91	54.61	28.92	62.87
BBC	ACC	51.58	57.03	60.45	76.40	69.16	64.80	67.01	78.83	94.79
	NMI	30.88	53.88	57.19	64.48	53.65	51.02	51.01	56.41	85.49
	ARI	20.50	41.13	47.15	69.10	49.78	44.60	41.08	53.29	87.93
ACM	ACC	67.37	83.83	84.74	85.13	75.02	85.53	82.09	66.94	92.30
	NMI	33.54	51.83	54.85	56.16	48.71	55.37	56.42	27.83	71.40
	ARI	33.85	57.74	59.92	62.16	51.39	61.43	59.39	28.54	78.34
Abstract	ACC	69.18	80.14	86.32	83.83	77.63	90.78	89.62	82.28	95.91
	NMI	38.30	54.36	59.08	60.98	65.14	68.14	66.65	52.65	82.29
	ARI	27.60	51.52	62.69	62.03	70.03	74.05	66.97	53.87	87.97

Tab. 3 Results of ablation experiments

优化目标	Reu-10k			BBC			ACM			Abstract
优化目标	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
$L R E + L K L$	72.75	51.62	57.97	60.45	57.19	47.51	84.74	54.85	59.92	86.32	59.08	62.69
$L R E + L K L + L I R$	80.43	66.47	61.59	93.44	84.13	83.85	91.64	69.44	76.59	95.84	82.03	87.77
$L R E + L K L + L I C$	73.46	53.12	59.61	63.55	60.25	50.50	88.40	62.72	68.79	88.32	63.50	68.41
$L R E + L K L + L I R + L I C$	81.27	66.67	62.87	94.79	85.49	87.93	92.30	71.40	78.34	95.91	82.29	87.97

Tab. 3 Results of ablation experiments

优化目标	Reu-10k			BBC			ACM			Abstract
优化目标	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
$L R E + L K L$	72.75	51.62	57.97	60.45	57.19	47.51	84.74	54.85	59.92	86.32	59.08	62.69
$L R E + L K L + L I R$	80.43	66.47	61.59	93.44	84.13	83.85	91.64	69.44	76.59	95.84	82.03	87.77
$L R E + L K L + L I C$	73.46	53.12	59.61	63.55	60.25	50.50	88.40	62.72	68.79	88.32	63.50	68.41
$L R E + L K L + L I R + L I C$	81.27	66.67	62.87	94.79	85.49	87.93	92.30	71.40	78.34	95.91	82.29	87.97

Fig. 2 Effect of number of constraints on model performance

Fig. 3 Visualization results of ACM dataset

References 42

[1]	BAI L， LIANG J， CAO F. Semi-supervised clustering with constraints of different types from multiple information sources ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（9）： 3247-3258.
[2]	LI J， LIN C， HUANG R， et al. Intention-guided deep semi-supervised document clustering via metric learning ［J］. Journal of King Saud University — Computer and Information Sciences， 2023， 35（1）： 416-425.
[3]	XIAO X， HOU H， DING S. Semi-supervised deep density clustering ［J］. Applied Soft Computing， 2023， 148： No.110903.
[4]	QIN X， YUAN C， JIANG J， et al. Deep semi-supervised clustering based on pairwise constraints and sample similarity ［J］. Pattern Recognition Letters， 2024， 178： 1-6.
[5]	LeCUN Y， BENGIO Y， HINTON G. Deep learning ［J］. Nature， 2015， 521（7553）： 436-444.
[6]	RUSK N. Deep learning ［J］. Nature Methods， 2016， 13（1）： No.3707.
[7]	DONG S， WANG P， ABBAS K. A survey on deep learning and its applications ［J］. Computer Science Review， 2021， 40： No.100379.
[8]	LI P， PEI Y， LI J. A comprehensive survey on design and application of autoencoder in deep learning ［J］. Applied Soft Computing， 2023， 138： No.110176.
[9]	WANG Y， YAO H， ZHAO S. Auto-encoder based dimensionality reduction ［J］. Neurocomputing， 2016， 184： 232-242.
[10]	XIE J， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
[11]	GUO X， GAO L， LIU X， et al. Improved deep embedded clustering with local structure preservation ［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York： ACM， 2017： 1753-1759.
[12]	WANG Y， CHANG D， FU Z， et al. Learning a bi-directional discriminative representation for deep clustering ［J］. Pattern Recognition， 2023， 137： No.109237.
[13]	LIN M， WEN K， ZHU X， et al. Graph autoencoder with preserving node attribute similarity ［J］. Entropy， 2023， 25（4）： No.567.
[14]	CHEN B， XU S， XU H， et al. Structure-aware deep clustering network based on contrastive learning ［J］. Neural Networks， 2023， 167： 118-128.
[15]	CAI J， HAO J， YANG H， et al. A review on semi-supervised clustering ［J］. Information Sciences， 2023， 632： 164-200.
[16]	XU X， HOU H， DING S. Semi-supervised deep density clustering ［J］. Applied Soft Computing， 2023， 148： No.110903.
[17]	ZHANG D， YANG Y， QIU H. Two-stage semi-supervised clustering ensemble framework based on constraint weight ［J］. International Journal of Machine Learning and Cybernetics， 2023， 14（2）： 567-586.
[18]	TAGHIZABET A， TANHA J， AMINI A， et al. A semi-supervised clustering approach using labeled data ［J］. Scientia Iranica， 2023， 30（1）： 104-115.
[19]	姜春茂，吴鹏，李志聪.基于Seeds集和成对约束的半监督三支聚类集成［J］.计算机应用，2023， 43（5）： 1481-1488.
	JIANG C M， WU P， LI Z C. Semi-supervised three-way clustering ensemble based on Seeds set and pairwise constraints ［J］. Journal of Computer Applications， 2023， 43（5）： 1481-1488.
[20]	MASUD M A， HUANG J Z， ZHONG M， et al. Generate pairwise constraints from unlabeled data for semi-supervised clustering ［J］. Data and Knowledge Engineering， 2019， 123： No.101715.
[21]	MEI J P， LV H， CAO J， et al. Pairwise constrained fuzzy clustering： relation， comparison and parallelization ［J］. International Journal of Fuzzy Systems， 2019， 21（6）： 1938-1949.
[22]	FORESTIER G， WEMMERT C. Semi-supervised learning using multiple clusterings with limited labeled data ［J］. Information Sciences， 2016， 361/362： 48-65.
[23]	VOUROS A， VASILAKI E. A semi-supervised sparse k-means algorithm ［J］. Pattern Recognition Letters， 2021， 142： 65-71.
[24]	WAGSTAFF K， CARDIE C， ROGERS S， et al. Constrained k-means clustering with background knowledge ［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 577-584.
[25]	YANG Y， TAN W， LI T， et al. Consensus clustering based on constrained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems ［J］. Knowledge-Based Systems， 2012， 32： 101-115.
[26]	WANG Y， ZOU J， WANG K， et al. Semi-supervised deep embedded clustering with pairwise constraints and subset allocation ［J］. Neural Networks， 2023， 164： 310-322.
[27]	CHEN Z， LI C， GAO J， et al. Semisupervised deep embedded clustering with adaptive labels ［J］. Scientific Programming， 2021， 2021： No.6613452.
[28]	SALADI P， GUNTUPALLI R M， PUPPALA S K， et al. Prioritized semi-supervised deep embedded clustering ［C］// Proceedings of the 2022 International Conference on Innovative Trends in Information Technology. Piscataway： IEEE， 2022： 1-6.
[29]	REN Y， HU K， DAI X， et al. Semi-supervised deep embedded clustering ［J］. Neurocomputing， 2019， 325： 121-130.
[30]	YANG X， DENG C， ZHENG F， et al. Deep spectral clustering using dual autoencoder network ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4061-4070.
[31]	CAI J， WANG S， GUO W. Unsupervised embedded feature learning for deep clustering with stacked sparse auto-encoder ［J］. Expert Systems with Applications， 2021， 186： No.115729.
[32]	BO D， WANG X， SHI C， et al. Structural deep clustering network ［C］// Proceedings of the Web Conference 2020. New York： ACM， 2020： 1400-1410.
[33]	KADHIM A I， CHEAH Y N， AHAMED N H. Text document preprocessing and dimension reduction techniques for text document clustering ［C］// Proceedings of the 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology. Piscataway： IEEE， 2014： 69-73.
[34]	WANG X， JI H， SHI C， et al. Heterogeneous graph attention network ［C］// Proceedings of the 2019 World Wide Web Conference. New York： ACM， 2019： 2022-2032.
[35]	BAI R， HUANG R， CHEN Y， et al. Deep multi-view document clustering with enhanced semantic embedding ［J］. Information Sciences， 2021， 564： 273-287.
[36]	XU W， LIU X， GONG Y. Document clustering based on non-negative matrix factorization ［C］// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2003： 267-273.
[37]	ESTÉVEZ P A， TESMER M， PEREZ C A， et al. Normalized mutual information feature selection ［J］. IEEE Transactions on Neural Networks， 2009， 20（2）： 189-201.
[38]	XIA R， PAN Y， DU L， et al. Robust multi-view spectral clustering via low-rank and sparse decomposition ［C］// Proceedings of the 28th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2014： 2149-2155.
[39]	HARTIGAN J A， WONG M A. A k-means clustering algorithm ［J］. Journal of the Royal Statistical Society Series C： Applied Statistics， 1979， 28（1）： 100-108.
[40]	FOGEL S， AVERBUCH-ELOR H， COHEN-OR D， et al. Clustering-driven deep embedding with pairwise constraints ［J］. IEEE Computer Graphics and Applications， 2019， 39（4）： 16-27.
[41]	CHAVOSHINEJAD J， SEYEDI S A， TAB F A， et al. Self-supervised semi-supervised nonnegative matrix factorization for data clustering ［J］. Pattern Recognition， 2023， 137： No.109282.
[42]	VAN DER MAATEN L， HINTON G. Visualizing data using t-SNE ［J］. Journal of Machine Learning Research， 2008， 9（11）： 2579-2605.

Deep semi-supervised text clustering with intentional regularization

基于意图正则化的深度半监督文本聚类

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 42

Related Articles 15

Recommended Articles

Metrics

[1]	Ao SHEN, Ruizhang HUANG, Jingjing XUE, Yanping CHEN, Yongbin QIN. Deep variational text clustering model based on distribution augmentation [J]. Journal of Computer Applications, 2025, 45(8): 2457-2463.
[2]	Binhong XIE, Yingkun LA, Yingjun ZHANG, Rui ZHANG. Semi-supervised object detection framework guided by self-paced learning [J]. Journal of Computer Applications, 2025, 45(8): 2546-2554.
[3]	Danyang CHEN, Changlun ZHANG. Multi-scale decorrelation graph convolutional network model [J]. Journal of Computer Applications, 2025, 45(7): 2180-2187.
[4]	Shujian GUO, Jieyue YU, Xuesong YIN. Graph regularized elastic net subspace clustering [J]. Journal of Computer Applications, 2025, 45(5): 1464-1471.
[5]	Shun YANG, Xiaoyong BIAN, Xi CHEN. Non-iterative graph capsule network for remote sensing scene classification [J]. Journal of Computer Applications, 2025, 45(1): 247-252.
[6]	Chenqian LI, Jun LIU. Ultrasound carotid plaque segmentation method based on semi-supervision and multi-scale cascaded attention [J]. Journal of Computer Applications, 2024, 44(8): 2604-2610.
[7]	Xiaoxia JIANG, Ruizhang HUANG, Ruina BAI, Lina REN, Yanping CHEN. Deep event clustering method based on event representation and contrastive learning [J]. Journal of Computer Applications, 2024, 44(6): 1734-1742.
[8]	Aiguo SHANG, Xinjuan ZHU. Joint approach of intent detection and slot filling based on multi-task learning [J]. Journal of Computer Applications, 2024, 44(3): 690-695.
[9]	Shuaihua ZHANG, Shufen ZHANG, Mingchuan ZHOU, Chao XU, Xuebin CHEN. Malicious traffic detection model based on semi-supervised federated learning [J]. Journal of Computer Applications, 2024, 44(11): 3487-3494.
[10]	Jie WU, Xuezhong QIAN, Wei SONG. Personalized federated learning based on similarity clustering and regularization [J]. Journal of Computer Applications, 2024, 44(11): 3345-3353.
[11]	Shengwei MA, Ruizhang HUANG, Lina REN, Chuan LIN. Structured deep text clustering model based on multi-layer semantic fusion [J]. Journal of Computer Applications, 2023, 43(8): 2364-2369.
[12]	Mengjie LAN, Jianping CAI, Lan SUN. Self-regularization optimization methods for Non-IID data in federated learning [J]. Journal of Computer Applications, 2023, 43(7): 2073-2081.
[13]	Wenbo LI, Bo LIU, Lingling TAO, Fen LUO, Hang ZHANG. Deep spectral clustering algorithm with L1 regularization [J]. Journal of Computer Applications, 2023, 43(12): 3662-3667.
[14]	Kaiqiang YUE, Bo LI, Panlong FAN. Air combat maneuver decision method based on three-way decision [J]. Journal of Computer Applications, 2022, 42(2): 616-621.
[15]	Lili FAN, Guifu LU, Ganyi TANG, Dan YANG. Low-rank representation subspace clustering method based on Hessian regularization and non-negative constraint [J]. Journal of Computer Applications, 2022, 42(1): 115-122.