Structured deep text clustering model based on multi-layer semantic fusion

doi:10.11772/j.issn.1001-9081.2022091356

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (8): 2364-2369.DOI: 10.11772/j.issn.1001-9081.2022091356

• The 19th International Conference on Web Information Systems and Applications (WISA 2022) • Previous Articles Next Articles

Structured deep text clustering model based on multi-layer semantic fusion

Shengwei MA¹^,², Ruizhang HUANG¹^,²(), Lina REN¹^,², Chuan LIN¹^,²

^1.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^2.College of Computer Science and Technology，Guizhou University，Guiyang Guizhou 550025，China

Received:2022-09-12 Revised:2022-10-13 Accepted:2022-10-17 Online:2022-12-26 Published:2023-08-10
Contact: Ruizhang HUANG
About author:MA Shengwei， born in 1999， M. S. candidate. Her research interests include natural language processing， deep clustering.
REN Lina， born in 1987， Ph. D. candidate， lecturer. Her research interests include natural language processing，text mining， machine learning.
LIN Chuan， born in 1975， M. S.， associate professor. His research interests include text mining，machine learning， big data management and applications.
Supported by:
National Natural Science Foundation of China(62066007)

基于多层语义融合的结构化深度文本聚类模型

马胜位¹^,², 黄瑞章¹^,²(), 任丽娜¹^,², 林川¹^,²

^1.公共大数据国家重点实验室（贵州大学），贵阳 550025
^2.贵州大学计算机科学与技术学院，贵阳 550025

通讯作者: 黄瑞章
作者简介:马胜位（1999—），女，贵州紫云人，硕士研究生，CCF会员，主要研究方向：自然语言处理、深度聚类
任丽娜（1987—），女，辽宁阜新人，讲师，博士研究生，CCF会员，主要研究方向：自然语言处理、文本挖掘、机器学习
林川（1975—），男，四川自贡人，副教授，硕士，主要研究方向：文本挖掘、机器学习、大数据管理与应用。
基金资助:
国家自然科学基金资助项目(62066007)

Abstract

Abstract:

In recent years， due to the advantages of the structural information of Graph Neural Network （GNN） in machine learning， people have begun to combine GNN into deep text clustering. The current deep text clustering algorithm combined with GNN ignores the important role of the decoder on semantic complementation in the fusion of text semantic information， resulting in the lack of semantic information in the data generation part. In response to the above problem， a Structured Deep text Clustering Model based on multi-layer Semantic fusion （SDCMS） was proposed. In this model， a GNN was utilized to integrate structural information into the decoder， the representation of text data was enhanced through layer-by-layer semantic complement， and better network parameters were obtained through triple self-supervision mechanism.Results of experiments carried out on 5 real datasets Citeseer， Acm， Reutuers， Dblp and Abstract show that compared with the current optimal Attention-driven Graph Clustering Network （AGCN） model， SDCMS in accuracy， Normalized Mutual Information （NMI ） and Average Rand Index （ARI） has increased by at most 5.853%， 9.922% and 8.142%.

Key words: deep text clustering, layer-by-layer semantic enhancement, text semantic information, graph neural network, self-supervised learning

摘要：

近年来，由于图神经网络（GNN）的结构信息在机器学习中的优势，人们开始将GNN结合进深度文本聚类中。当前结合GNN的深度文本聚类算法在文本语义信息融合时忽略了解码器在语义补足上的重要作用，这导致在数据生成部分出现语义信息的缺失。针对以上问题，提出了一种基于多层语义融合的结构化深度文本聚类模型（SDCMS）。该模型利用GNN将结构信息集成到解码器中，通过逐层语义补充增强了文本数据的表示，并通过三重自监督机制获得更好的网络参数。在Citeseer、Acm、Reutuers、Dblp、Abstract 这5个真实数据集上进行实验的结果表明，与目前最优的注意力驱动的图形聚类网络（AGCN）模型相比，SDCMS在准确率、归一化互信息（NMI）和平均兰德指数（ARI）上分别最多提升了5.853%、9.922%和8.142%。

关键词: 深度文本聚类, 逐层语义增强, 文本语义信息, 图神经网络, 自监督学习

CLC Number:

TP391.1

Shengwei MA, Ruizhang HUANG, Lina REN, Chuan LIN. Structured deep text clustering model based on multi-layer semantic fusion[J]. Journal of Computer Applications, 2023, 43(8): 2364-2369.

马胜位, 黄瑞章, 任丽娜, 林川. 基于多层语义融合的结构化深度文本聚类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2364-2369.

Figures/Tables 7

References 26

1	AGGARWAL C C， ZHAI C X. A survey of text classification algorithms［M］// Mining Text Data. Boston： Springer， 2012： 163-222. 10.1007/978-1-4614-3223-4_6
2	KIPFT N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2022-09-25］..
3	YANG B， FU X， SIDIROPOULOS N D， et al. Towards K-means-friendly spaces： simultaneous deep learning and clustering［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 3861-3870.
4	HARTIGAN J A， WONG M A. Algorithm AS 136： a K-means clustering algorithm［J］. Journal of the Royal Statistical Society. Series C （Applied Statistics）， 1979， 28（1）： 100-108. 10.2307/2346830
5	XIE J Y， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
6	JIANG Z X， ZHENG Y， TAN H C， et al. Variational deep embedding： an unsupervised and generative approach to clustering［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1965-1972. 10.24963/ijcai.2017/273
7	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. （2022-12-10）［2023-02-26］.. 10.1561/2200000056
8	BRUNA J， ZAREMBA W， SZLAM A， et al. Spectral networks and locally connected networks on graphs［EB/OL］. （2014-05-21）［2022-09-25］.. 10.1017/cbo9780511761942.003
9	KIPF T N， WELLING M. Variational graph auto-encoders［EB/OL］. （2016-11-21）［2022-09-26］..
10	PAN S R， HU R Q， FUNG S F， et al. Learning graph embedding with adversarial training methods［J］. IEEE Transactions on Cybernetics， 2020， 50（6）： 2475-2487. 10.1109/tcyb.2019.2932096
11	WANG C， PAN S R， LONG G D， et al. MGAE： marginalized graph autoencoder for graph clustering［C］// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. New York： ACM， 2017：889-898. 10.1145/3132847.3132967
12	STRETCU O， VISWANATHAN K， MOVSGOVITZ-ATTIAS D， et al. Graph agreement models for semi-supervised learning［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 8713-8723.
13	WANG C， PAN S R， YU C P， et al. Deep neighbor-aware embedding for node clustering in attributed graphs［J］. Pattern Recognition， 2022， 122： No.108230. 10.1016/j.patcog.2021.108230
14	BO D Y， WANG X， SHI C， et al. Structural deep clustering network［C］// Proceedings of the Web Conference 2020. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2020： 1400-1410. 10.1145/3366423.3380214
15	PENG Z H， LIU H， JIA Y H， et al. Attention-driven graph clustering network［C］// Proceedings of the 29th ACM International Conference on Multimedia. New York： ACM， 2021： 935-943. 10.1145/3474085.3475276
16	VINCENT P， LAROCHELLE H， BENGIO Y， et al. Extracting and composing robust features with denoising autoencoders［C］// Proceedings of the 25th International Conference on Machine Learning. New York： ACM， 2008：1096-1103. 10.1145/1390156.1390294
17	MASCI J， MEIER U， CIREŞAN D， et al. Stacked convolutional auto-encoders for hierarchical feature extraction［C］// Proceedings of the 2011 International Conference on Artificial Neural Networks， LNCS 6791. Berlin： Springer， 2011： 52-59.
18	MALHOTRA P， VISHNU T V， RAMAKRISHNAN A， et al. Multi-sensor prognostics using an unsupervised health index based on LSTM encoder-decoder［C/OL］// Proceedings of the 1st ACM SIGKDD Workshop on Machine Learning for Prognostics and Health Management （ 2016-08-22）［2022-09-26］..
19	MAKHZANI A， SHLENS J， JAITLY N， et al. Adversarial autoencoders［EB/OL］. （2016-05-25）［2022-09-26］..
20	HINTON G E， SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks［J］. Science， 2006， 313（5786）：504-507. 10.1126/science.1127647
21	NAIR V， HINTON G E. Rectified linear units improve restricted Boltzmann machines［C］// Proceedings of the 27th International Conference on Machine Learning. Madison， WI： Omnipress， 2010：807-814.
22	L van der MAATEN， HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research， 2008， 9： 2579-2605.
23	WANG X， JI H Y， SHI C， et al. Heterogeneous graph attention network［C］// Proceedings of the World Wide Web Conference 2019. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2019： 2022-2032. 10.1145/3308558.3313562
24	LEWIS D D， YANG Y M， ROSE T G， et al. RCV1： a new benchmark collection for text categorization research［J］. Journal of Machine Learning Research， 2004， 5： 361-397.
25	黄瑞章，白瑞娜，陈艳平，等. CMDC：一种差异互补的迭代式多维度文本聚类算法［J］. 通信学报， 2020， 41（8）： 155-164. 10.11959/j.issn.1000-436x.2020152
	HUANG R Z， BAI R N， CHEN Y P， et al. CMDC： an iterative algorithm for complementary multi-view document clustering［J］. Journal on Communications， 2020， 41（8）： 155-164. 10.11959/j.issn.1000-436x.2020152
26	KRASKOV A， STÖGBAUER H， GRASSBERGER P. Estimating mutual information［J］. Physical Review. E， Statistical， Nonlinear， and Soft Matter Physics， 2004， 69（6）： No.066138. 10.1103/physreve.69.066138

数据集	样本数	维度	类别	数据集	样本数	维度	类别
Citesser	3 327	3 073	6	Reuters	10 000	2 000	4
Acm	3 025	1 870	3	Abstract	4 306	10 000	3
Dblp	4 058	334	4

数据集	样本数	维度	类别	数据集	样本数	维度	类别
Citesser	3 327	3 073	6	Reuters	10 000	2 000	4
Acm	3 025	1 870	3	Abstract	4 306	10 000	3
Dblp	4 058	334	4

模型	Citeseer			Acm			Reuters			Dblp			Abstract
模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
k-means	0.431 3	0.207 3	0.155 4	0.673 1	0.327 6	0.308 0	0.540 2	0.423 9	0.285 8	0.384 3	0.111 9	0.067 4	0.691 8	0.382 6	0.276 9
AE	0.570 8	0.276 4	0.293 1	0.809 6	0.465 7	0.518 0	0.713 4	0.498 5	0.557 1	0.514 3	0.254 0	0.122 1	0.852 1	0.580 5	0.598 9
DEC	0.523 3	0.282 1	0.242 1	0.847 3	0.591 8	0.609 8	0.731 2	0.502 6	0.548 6	0.581 6	0.295 1	0.239 2	0.868 7	0.603 6	0.641 2
GAE	0.613 5	0.346 3	0.335 5	0.845 2	0.553 8	0.594 6	0.544 0	0.259 2	0.196 1	0.612 1	0.308 0	0.220 2	0.873 7	0.590 0	0.653 2
SDCN	0.628 0	0.348 9	0.358 2	0.893 2	0.649 1	0.710 3	0.758 9	0.476 7	0.515 9	0.669 5	0.316 7	0.334 3	0.930 3	0.729 0	0.791 1
AGCN	0.627 0	0.360 8	0.369 7	0.903 8	0.682 3	0.736 9	0.767 9	0.514 2	0.538 0	0.667 2	0.332 4	0.346 6	0.934 3	0.746 5	0.810 9
SDCMS	0.663 7	0.396 6	0.399 8	0.918 3	0.722 8	0.774 5	0.785 1	0.528 4	0.561 2	0.669 7	0.348 6	0.319 9	0.943 8	0.781 4	0.836 2

模型	Citeseer			Acm			Reuters			Dblp			Abstract
模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
k-means	0.431 3	0.207 3	0.155 4	0.673 1	0.327 6	0.308 0	0.540 2	0.423 9	0.285 8	0.384 3	0.111 9	0.067 4	0.691 8	0.382 6	0.276 9
AE	0.570 8	0.276 4	0.293 1	0.809 6	0.465 7	0.518 0	0.713 4	0.498 5	0.557 1	0.514 3	0.254 0	0.122 1	0.852 1	0.580 5	0.598 9
DEC	0.523 3	0.282 1	0.242 1	0.847 3	0.591 8	0.609 8	0.731 2	0.502 6	0.548 6	0.581 6	0.295 1	0.239 2	0.868 7	0.603 6	0.641 2
GAE	0.613 5	0.346 3	0.335 5	0.845 2	0.553 8	0.594 6	0.544 0	0.259 2	0.196 1	0.612 1	0.308 0	0.220 2	0.873 7	0.590 0	0.653 2
SDCN	0.628 0	0.348 9	0.358 2	0.893 2	0.649 1	0.710 3	0.758 9	0.476 7	0.515 9	0.669 5	0.316 7	0.334 3	0.930 3	0.729 0	0.791 1
AGCN	0.627 0	0.360 8	0.369 7	0.903 8	0.682 3	0.736 9	0.767 9	0.514 2	0.538 0	0.667 2	0.332 4	0.346 6	0.934 3	0.746 5	0.810 9
SDCMS	0.663 7	0.396 6	0.399 8	0.918 3	0.722 8	0.774 5	0.785 1	0.528 4	0.561 2	0.669 7	0.348 6	0.319 9	0.943 8	0.781 4	0.836 2

数据集	SDCMS	SDCMS-d	数据集	SDCMS	SDCMS-d
Citeseer	0.663 7	0.641 5	Reuters	0.785 1	0.724 5
Acm	0.918 3	0.907 4	Dblp	0.669 7	0.638 9

Structured deep text clustering model based on multi-layer semantic fusion

基于多层语义融合的结构化深度文本聚类模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 26

Related Articles 15

Recommended Articles

Metrics

数据集	层数	ACC	NMI	ARI
Citeseer	0	0.618 0	0.348 9	0.358 2
	1	0.627 9	0.363 4	0.364 8
	4	0.6637	0.3966	0.3998
Acm	0	0.893 2	0.649 1	0.710 3
	1	0.907 1	0.708 5	0.748 6
	4	0.9174	0.7182	0.7715
Reuters	0	0.758 9	0.476 7	0.515 9
	1	0.754 6	0.500 0	0.488 7
	4	0.7851	0.5284	0.5612

[1]	Xingyao YANG, Yu CHEN, Jiong YU, Zulian ZHANG, Jiaying CHEN, Dongxiao WANG. Recommendation model combining self-features and contrastive learning [J]. Journal of Computer Applications, 2024, 44(9): 2704-2710.
[2]	Yu DU, Yan ZHU. Constructing pre-trained dynamic graph neural network to predict disappearance of academic cooperation behavior [J]. Journal of Computer Applications, 2024, 44(9): 2726-2731.
[3]	Hang YANG, Wanggen LI, Gensheng ZHANG, Zhige WANG, Xin KAI. Multi-layer information interactive fusion algorithm based on graph neural network for session-based recommendation [J]. Journal of Computer Applications, 2024, 44(9): 2719-2725.
[4]	Tingjie TANG, Jiajin HUANG, Jin QIN. Session-based recommendation with graph auxiliary learning [J]. Journal of Computer Applications, 2024, 44(9): 2711-2718.
[5]	Fan YANG, Yao ZOU, Mingzhi ZHU, Zhenwei MA, Dawei CHENG, Changjun JIANG. Credit card fraud detection model based on graph attention Transformation neural network [J]. Journal of Computer Applications, 2024, 44(8): 2634-2642.
[6]	Xinrui LIN, Xiaofei WANG, Yan ZHU. Academic anomaly citation group detection based on local extended community detection [J]. Journal of Computer Applications, 2024, 44(6): 1855-1861.
[7]	Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL： positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492.
[8]	Guijin HAN, Xinyuan ZHANG, Wentao ZHANG, Ya HUANG. Self-supervised image registration algorithm based on multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(5): 1597-1604.
[9]	Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127.
[10]	Rong HUANG, Junjie SONG, Shubo ZHOU, Hao LIU. Image aesthetic quality evaluation method based on self-supervised vision Transformer [J]. Journal of Computer Applications, 2024, 44(4): 1269-1276.
[11]	Dapeng XU, Xinmin HOU. Feature selection method for graph neural network based on network architecture design [J]. Journal of Computer Applications, 2024, 44(3): 663-670.
[12]	Nengbing HU, Biao CAI, Xu LI, Danhua CAO. Graph classification method based on graph pooling contrast learning [J]. Journal of Computer Applications, 2024, 44(11): 3327-3334.
[13]	Beijing ZHOU, Hairong WANG, Yimeng WANG, Lisi ZHANG, He MA. Recommendation method using knowledge graph embedding propagation [J]. Journal of Computer Applications, 2024, 44(10): 3252-3259.
[14]	Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU. Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction [J]. Journal of Computer Applications, 2024, 44(1): 175-181.
[15]	Hongbin WANG, Xiao FANG, Hong JIANG. Commonsense reasoning and question answering method with three-dimensional semantic features [J]. Journal of Computer Applications, 2024, 44(1): 138-144.