Deep variational text clustering model based on distribution augmentation

doi:10.11772/j.issn.1001-9081.2024081100

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2457-2463.DOI: 10.11772/j.issn.1001-9081.2024081100

• The 21th CCF Conference on Web Information Systems and Applications (WISA 2024) • Previous Articles

Deep variational text clustering model based on distribution augmentation

Ao SHEN¹^,²^,³, Ruizhang HUANG¹^,²^,³(), Jingjing XUE¹^,²^,³, Yanping CHEN¹^,²^,³, Yongbin QIN¹^,²^,³

^1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence，Guizhou University，Guiyang Guizhou 550025，China
^2.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^3.College of Computer Science and Technology，Guizhou University，Guiyang Guizhou 550025，China

Received:2024-08-06 Revised:2024-08-25 Accepted:2024-09-02 Online:2025-08-15 Published:2025-08-10
Contact: Ruizhang HUANG
About author:SHEN Ao， born in 2000， M. S. candidate. His research interests include natural language processing， text mining.
XUE Jingjing， born in 1995， Ph. D. Her research interests include natural language processing， text mining.
CHEN Yanping， born in 1980， Ph. D.， professor. His research interests include artificial intelligence， natural language processing.
QIN Yongbin， born in 1980， Ph. D.， professor. His research interests include big data management and application， multi-source data fusion.
Supported by:
National Natural Science Foundation of China(62066007);Guizhou Province Science and Technology Support Program （Qiankehe Support ［2023］ General 300）

基于分布增强的深度变分文本聚类模型

申奥¹^,²^,³, 黄瑞章¹^,²^,³(), 薛菁菁¹^,²^,³, 陈艳平¹^,²^,³, 秦永彬¹^,²^,³

^1.贵州大学文本计算与认知智能教育部工程研究中心，贵阳 550025
^2.公共大数据国家重点实验室（贵州大学），贵阳 550025
^3.贵州大学计算机科学与技术学院，贵阳 550025

通讯作者: 黄瑞章
作者简介:申奥（2000—），男，山东济宁人，硕士研究生，CCF会员，主要研究方向：自然语言处理、文本挖掘
薛菁菁（1995—），女，山东日照人，博士，CCF会员，主要研究方向：自然语言处理、文本挖掘
陈艳平（1980—），男，贵州长顺人，教授，博士，CCF会员，主要研究方向：人工智能、自然语言处理
秦永彬（1980—），男，山东烟台人，教授，博士，CCF高级会员，主要研究方向：大数据管理与应用、多源数据融合。
基金资助:
国家自然科学基金资助项目(62066007);贵州省科技支撑计划项目（黔科合支撑［2023］一般300）

Abstract

Abstract:

To address the issues of missing distribution information and distribution collapse encountered by deep variational text clustering models in practical applications， a Deep Variational text Clustering Model based on Distribution augmentation （DVCMD） was proposed. In this model， the enhanced latent semantic distributions were integrated into the original latent semantic distribution by enhancing distribution information， so as to improve information completeness and accuracy of the latent distribution. At the same time， a distribution consistency constraint strategy was employed to promote the learning of consistent semantic representations by the model， thereby enhancing the model’s ability to express true information of the data through learned semantic distributions， and thus improving clustering performance. Experimental results show that compared with existing deep clustering models and structural semantic-enhanced clustering models， DVCMD has the Normalized Mutual Information （NMI） metric improved by at least 0.16， 9.01， 2.30， and 2.72 percentage points on the four real-world datasets： Abstract， BBC， Reuters-10k， and BBCSports， respectively， validating the effectiveness of the model.

Key words: deep text clustering, distribution augmentation, Variational Auto-Encoder (VAE), semantic representation, distribution consistency constraint

摘要：

针对深度变分文本聚类模型在实际应用中遇到的分布信息缺失和分布坍塌问题，提出一种基于分布增强的深度变分文本聚类模型（DVCMD）。该模型通过分布信息增强的方法，整合增强潜在语义分布至原始潜在语义分布，从而提高潜在分布的信息完整性和准确性；同时，采用分布一致性约束策略促使模型学习一致的语义表征，从而提高模型通过学习的语义分布对数据真实信息的表达能力，进而提升聚类性能。实验结果表明，与现有的深度聚类模型和结构语义增强聚类模型相比，DVCMD的归一化互信息（NMI）指标在Abstract、BBC、Reuters-10k和BBCSports这4个真实数据集上分别至少提升了0.16、9.01、2.30和2.72个百分点，验证了模型的有效性。

关键词: 深度文本聚类, 分布增强, 变分自编码器, 语义表征, 分布一致性约束

CLC Number:

TP391.1

Ao SHEN, Ruizhang HUANG, Jingjing XUE, Yanping CHEN, Yongbin QIN. Deep variational text clustering model based on distribution augmentation[J]. Journal of Computer Applications, 2025, 45(8): 2457-2463.

申奥, 黄瑞章, 薛菁菁, 陈艳平, 秦永彬. 基于分布增强的深度变分文本聚类模型[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2457-2463.

Figures/Tables 6

References 28

[1]	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. ［2024-06-10］..
[2]	MANDUCHI L， VANDENHIRTZ M， RYSER A， et al. Tree variational autoencoders［C］// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2023： 54952-54986.
[3]	BAI R， HUANG R， QIN Y， et al. HVAE： a deep generative model via hierarchical variational auto-encoder for multi-view document modeling［J］. Information Sciences， 2023， 623： 40-55.
[4]	TOLSTIKHIN I， BOUSQUET O， GELLY S， et al. Wasserstein auto-encoders［EB/OL］. ［2024-06-10］..
[5]	TU W， ZHOU S， LIU X， et al. Deep fusion clustering network［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 9978-9987.
[6]	GUO C， ZHOU J， CHEN H， et al. Variational autoencoder with optimizing Gaussian mixture model priors［J］. IEEE Access， 2020， 8： 43992-44005.
[7]	JIANG Z， ZHENG Y， TAN H， et al. Variational deep embedding： an unsupervised and generative approach to clustering［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1965-1972.
[8]	KIPF T N， WELLING M. Variational graph auto-encoders［EB/OL］. ［2024-06-10］..
[9]	YANG B， FU X， SIDIROPOULOS N D， et al. Towards k-means-friendly spaces： simultaneous deep learning and clustering［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 3861-3870.
[10]	HARTIGAN J A， WONG M A. A K-means clustering algorithm［J］. Journal of the Royal Statistical Society Series C （Applied Statistics）， 1979， 28（1）： 100-108.
[11]	XIE J， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
[12]	YAN G， WEN K， HONG J， et al. An analysis method for time-based features of malicious domains based on time series clustering［C］// Proceedings of the 2023 International Conference on Web Information Systems and Applications， LNCS 14094. Singapore： Springer， 2023： 347-358.
[13]	BO D， WANG X， SHI C， et al. Structural deep clustering network［C］// Proceedings of the Web Conference 2020. New York： ACM， 2020： 1400-1410.
[14]	马胜位，黄瑞章，任丽娜，等. 基于多层语义融合的结构化深度文本聚类模型［J］. 计算机应用， 2023， 43（8）：2364-2369.
	MA S W， HUANG R Z， REN L N， et al. Structured deep text clustering model based on multi-layer semantic fusion［J］. Journal of Computer Applications， 2023， 43（8）： 2364-2369.
[15]	DILOKTHANAKUL N， MEDIANO P A M， GARNELO M， et al. Deep unsupervised clustering with Gaussian mixture variational autoencoders［EB/OL］. ［2024-07-10］..
[16]	CACIULARU A， GOLDBERGER J. An entangled mixture of variational autoencoders approach to deep clustering［J］. Neurocomputing， 2023， 529： 182-189.
[17]	李静楠，黄瑞章，任丽娜. 用户意图补充的半监督深度文本聚类［J］. 计算机科学与探索， 2023， 17（8）： 1928-1937.
	LI J N， HUANG R Z， REN L N. Semi-supervised deep document clustering model with supplemented user intention［J］. Journal of Frontiers of Computer Science and Technology， 2023， 17（8）： 1928-1937.
[18]	黄瑞章，白瑞娜，陈艳平，等. CMDC：一种差异互补的迭代式多维度文本聚类算法［J］. 通信学报， 2020， 41（8）： 155-164.
	HUANG R Z， BAI R N， CHEN Y P， et al. CMDC： an iterative algorithm for complementary multi-view document clustering［J］. Journal on Communications， 2020， 41（8）： 155-164.
[19]	GREENE D， CUNNINGHAM P. Practical solutions to the problem of diagonal dominance in kernel document clustering［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 377-384.
[20]	KADHIM A I， CHEAH Y N， AHAMED N H. Text document preprocessing and dimension reduction techniques for text document clustering［C］// Proceedings of the 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology. Piscataway： IEEE， 2014： 69-73.
[21]	BLEI D M， NG A Y， JORDAN M I. Latent dirichlet allocation［J］. Journal of Machine Learning Research， 2003， 3： 993-1022.
[22]	HINTON G E， SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks［J］. Science， 2006， 313（5786）：504-507.
[23]	GUO X， GAO L， LIU X， et al. Improved deep embedded clustering with local structure preservation［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1753-1759.
[24]	GUO X， LIU X， ZHU E， et al. Deep clustering with convolutional autoencoders［C］// Proceedings of the 2017 International Conference on Neural Information Processing， LNCS 10635. Cham： Springer， 2017： 373-382.
[25]	薛菁菁，秦永彬，黄瑞章，等. SSVAE：一种补充语义信息的深度变分文本聚类模型［J］. 数据分析与知识发现， 2022， 6（6）：71-83.
	XUE J J， QIN Y B， HUANG R Z， et al. SSVAE： a deep variational text clustering model with semantic supplementation［J］. Data Analysis and Knowledge Discovery， 2022， 6（6）： 71-83.
[26]	REN L， QIN Y， CHEN Y， et al. Deep structural enhanced network for document clustering［J］. Applied Intelligence， 2023， 53（10）： 12163-12178.
[27]	BAI R， HUANG R， ZHENG L， et al. Structure enhanced deep clustering network via a weighted neighbourhood auto-encoder［J］. Neural Networks， 2022， 155： 144-154.
[28]	VAN DER MAATEN L， HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research， 2008， 9： 2579-2605.

数据集	样本数	输入维度	聚类数
Abstract	4 306	10 000	3
BBC	2 225	10 000	5
Reuters-10k	10 000	2 000	4
BBCSports	737	4 613	5

数据集	样本数	输入维度	聚类数
Abstract	4 306	10 000	3
BBC	2 225	10 000	5
Reuters-10k	10 000	2 000	4
BBCSports	737	4 613	5

聚类类型	模型	Abstract			BBC			Reuters-10k			BBCSports
聚类类型	模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
传统聚类	K-means	69.18	38.26	27.69	51.58	30.88	20.50	54.04	41.54	27.95	55.24	32.44	25.91
传统聚类	LDA	80.19	29.47	36.44	45.66	23.90	18.25	55.46	25.27	26.07	59.63	44.67	32.42
深度聚类	AE	75.56	45.26	39.95	53.60	39.93	19.90	74.90	49.69	49.55	67.16	49.13	29.76
	IDEC	88.63	63.89	68.68	83.60	66.56	61.07	73.40	48.51	54.11	73.41	60.52	47.13
	DCEC	92.80	73.33	79.42	88.91	75.03	74.78	72.28	51.22	55.60	62.42	64.39	48.08
	VAE	81.14	52.82	51.32	65.14	54.06	40.67	61.33	33.20	19.79	63.74	30.12	28.57
	VaDE	61.79	23.46	27.64	29.65	7.30	6.06	40.22	30.57	—	—	—	—
结构语义增强聚类	VGAE	75.43	61.26	67.49	84.13	53.20	57.72	60.85	25.51	26.17	60.11	54.48	28.36
	SDCN	93.03	72.90	79.11	77.55	65.28	62.58	77.15	50.82	55.36	78.43	68.29	55.46
	SDCMS	91.08	70.85	74.30	76.18	64.56	56.51	76.61	51.79	51.05	67.44	59.35	55.75
	SSVAE	86.11	57.64	62.25	80.31	67.54	59.40	78.65	51.17	52.89	70.14	56.05	39.11
	DSEDC	88.57	63.71	68.32	76.73	61.25	51.06	73.15	53.19	58.02	65.81	58.79	48.52
	SEDCN	93.73	75.87	81.72	76.29	73.60	67.55	73.76	55.33	61.26	71.64	65.27	58.43
	DVCMD	93.84	76.03	82.27	94.61	84.04	87.45	74.31	57.63	55.09	81.41	71.01	66.58

聚类类型	模型	Abstract			BBC			Reuters-10k			BBCSports
聚类类型	模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
传统聚类	K-means	69.18	38.26	27.69	51.58	30.88	20.50	54.04	41.54	27.95	55.24	32.44	25.91
传统聚类	LDA	80.19	29.47	36.44	45.66	23.90	18.25	55.46	25.27	26.07	59.63	44.67	32.42
深度聚类	AE	75.56	45.26	39.95	53.60	39.93	19.90	74.90	49.69	49.55	67.16	49.13	29.76
	IDEC	88.63	63.89	68.68	83.60	66.56	61.07	73.40	48.51	54.11	73.41	60.52	47.13
	DCEC	92.80	73.33	79.42	88.91	75.03	74.78	72.28	51.22	55.60	62.42	64.39	48.08
	VAE	81.14	52.82	51.32	65.14	54.06	40.67	61.33	33.20	19.79	63.74	30.12	28.57
	VaDE	61.79	23.46	27.64	29.65	7.30	6.06	40.22	30.57	—	—	—	—
结构语义增强聚类	VGAE	75.43	61.26	67.49	84.13	53.20	57.72	60.85	25.51	26.17	60.11	54.48	28.36
	SDCN	93.03	72.90	79.11	77.55	65.28	62.58	77.15	50.82	55.36	78.43	68.29	55.46
	SDCMS	91.08	70.85	74.30	76.18	64.56	56.51	76.61	51.79	51.05	67.44	59.35	55.75
	SSVAE	86.11	57.64	62.25	80.31	67.54	59.40	78.65	51.17	52.89	70.14	56.05	39.11
	DSEDC	88.57	63.71	68.32	76.73	61.25	51.06	73.15	53.19	58.02	65.81	58.79	48.52
	SEDCN	93.73	75.87	81.72	76.29	73.60	67.55	73.76	55.33	61.26	71.64	65.27	58.43
	DVCMD	93.84	76.03	82.27	94.61	84.04	87.45	74.31	57.63	55.09	81.41	71.01	66.58

模型	Abstract			BBC			Reuters-10k			BBCSports
模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
DVCMD-f	62.22	33.02	31.09	72.07	44.36	43.69	54.20	26.01	22.67	58.87	34.05	26.50
DVCMD-r	78.14	45.75	44.64	72.80	51.20	43.87	61.71	43.28	38.19	64.59	50.91	39.45
DVCMD-c	87.48	61.25	65.82	81.78	63.45	61.20	72.40	54.09	51.45	74.93	58.40	47.40

Deep variational text clustering model based on distribution augmentation

基于分布增强的深度变分文本聚类模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 28

Related Articles 6

Recommended Articles

Metrics

[1]	Shengwei MA, Ruizhang HUANG, Lina REN, Chuan LIN. Structured deep text clustering model based on multi-layer semantic fusion [J]. Journal of Computer Applications, 2023, 43(8): 2364-2369.
[2]	Menglin HUANG, Lei DUAN, Yuanhao ZHANG, Peiyan WANG, Renhao LI. Prompt learning based unsupervised relation extraction model [J]. Journal of Computer Applications, 2023, 43(7): 2010-2016.
[3]	SUN Heli, SUN Yuzhu, ZHANG Xiaoyun. Event description generation based on generative adversarial network [J]. Journal of Computer Applications, 2021, 41(5): 1256-1261.
[4]	TU Hongyan, ZHANG Ting, XIA Pengfei, DU Yi. Reconstruction method for uncertain spatial information based on improved variational auto-encoder [J]. Journal of Computer Applications, 2021, 41(10): 2959-2963.
[5]	YI Dongyi, DENG Genqiang, DONG Chaoxiong, ZHU Miaomiao, LYU Zhouping, ZHU Suisong. Medical insurance fraud detection algorithm based on graph convolutional neural network [J]. Journal of Computer Applications, 2020, 40(5): 1272-1277.
[6]	LI Yan, ZHANG Bowen, HAO Hongwei. Query expansion with semantic vector representation [J]. Journal of Computer Applications, 2016, 36(9): 2526-2530.