基于多层语义融合的结构化深度文本聚类模型

doi:10.11772/j.issn.1001-9081.2022091356

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2364-2369.DOI: 10.11772/j.issn.1001-9081.2022091356

• 第十九届CCF中国信息系统及应用大会 • 上一篇

基于多层语义融合的结构化深度文本聚类模型

马胜位¹^,², 黄瑞章¹^,²(), 任丽娜¹^,², 林川¹^,²

^1.公共大数据国家重点实验室（贵州大学），贵阳 550025
^2.贵州大学计算机科学与技术学院，贵阳 550025

收稿日期:2022-09-12 修回日期:2022-10-13 接受日期:2022-10-17 发布日期:2022-12-26 出版日期:2023-08-10
通讯作者: 黄瑞章
作者简介:马胜位（1999—），女，贵州紫云人，硕士研究生，CCF会员，主要研究方向：自然语言处理、深度聚类
任丽娜（1987—），女，辽宁阜新人，讲师，博士研究生，CCF会员，主要研究方向：自然语言处理、文本挖掘、机器学习
林川（1975—），男，四川自贡人，副教授，硕士，主要研究方向：文本挖掘、机器学习、大数据管理与应用。
基金资助:
国家自然科学基金资助项目(62066007)

Structured deep text clustering model based on multi-layer semantic fusion

Shengwei MA¹^,², Ruizhang HUANG¹^,²(), Lina REN¹^,², Chuan LIN¹^,²

^1.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^2.College of Computer Science and Technology，Guizhou University，Guiyang Guizhou 550025，China

Received:2022-09-12 Revised:2022-10-13 Accepted:2022-10-17 Online:2022-12-26 Published:2023-08-10
Contact: Ruizhang HUANG
About author:MA Shengwei， born in 1999， M. S. candidate. Her research interests include natural language processing， deep clustering.
REN Lina， born in 1987， Ph. D. candidate， lecturer. Her research interests include natural language processing，text mining， machine learning.
LIN Chuan， born in 1975， M. S.， associate professor. His research interests include text mining，machine learning， big data management and applications.
Supported by:
National Natural Science Foundation of China(62066007)

摘要/Abstract

摘要：

近年来，由于图神经网络（GNN）的结构信息在机器学习中的优势，人们开始将GNN结合进深度文本聚类中。当前结合GNN的深度文本聚类算法在文本语义信息融合时忽略了解码器在语义补足上的重要作用，这导致在数据生成部分出现语义信息的缺失。针对以上问题，提出了一种基于多层语义融合的结构化深度文本聚类模型（SDCMS）。该模型利用GNN将结构信息集成到解码器中，通过逐层语义补充增强了文本数据的表示，并通过三重自监督机制获得更好的网络参数。在Citeseer、Acm、Reutuers、Dblp、Abstract 这5个真实数据集上进行实验的结果表明，与目前最优的注意力驱动的图形聚类网络（AGCN）模型相比，SDCMS在准确率、归一化互信息（NMI）和平均兰德指数（ARI）上分别最多提升了5.853%、9.922%和8.142%。

关键词: 深度文本聚类, 逐层语义增强, 文本语义信息, 图神经网络, 自监督学习

Abstract:

In recent years， due to the advantages of the structural information of Graph Neural Network （GNN） in machine learning， people have begun to combine GNN into deep text clustering. The current deep text clustering algorithm combined with GNN ignores the important role of the decoder on semantic complementation in the fusion of text semantic information， resulting in the lack of semantic information in the data generation part. In response to the above problem， a Structured Deep text Clustering Model based on multi-layer Semantic fusion （SDCMS） was proposed. In this model， a GNN was utilized to integrate structural information into the decoder， the representation of text data was enhanced through layer-by-layer semantic complement， and better network parameters were obtained through triple self-supervision mechanism.Results of experiments carried out on 5 real datasets Citeseer， Acm， Reutuers， Dblp and Abstract show that compared with the current optimal Attention-driven Graph Clustering Network （AGCN） model， SDCMS in accuracy， Normalized Mutual Information （NMI ） and Average Rand Index （ARI） has increased by at most 5.853%， 9.922% and 8.142%.

Key words: deep text clustering, layer-by-layer semantic enhancement, text semantic information, graph neural network, self-supervised learning

中图分类号:

TP391.1

马胜位, 黄瑞章, 任丽娜, 林川. 基于多层语义融合的结构化深度文本聚类模型[J]. 计算机应用, 2023, 43(8): 2364-2369.

Shengwei MA, Ruizhang HUANG, Lina REN, Chuan LIN. Structured deep text clustering model based on multi-layer semantic fusion[J]. Journal of Computer Applications, 2023, 43(8): 2364-2369.

图/表 7

参考文献 26

1	AGGARWAL C C， ZHAI C X. A survey of text classification algorithms［M］// Mining Text Data. Boston： Springer， 2012： 163-222. 10.1007/978-1-4614-3223-4_6
2	KIPFT N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2022-09-25］..
3	YANG B， FU X， SIDIROPOULOS N D， et al. Towards K-means-friendly spaces： simultaneous deep learning and clustering［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 3861-3870.
4	HARTIGAN J A， WONG M A. Algorithm AS 136： a K-means clustering algorithm［J］. Journal of the Royal Statistical Society. Series C （Applied Statistics）， 1979， 28（1）： 100-108. 10.2307/2346830
5	XIE J Y， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
6	JIANG Z X， ZHENG Y， TAN H C， et al. Variational deep embedding： an unsupervised and generative approach to clustering［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1965-1972. 10.24963/ijcai.2017/273
7	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. （2022-12-10）［2023-02-26］.. 10.1561/2200000056
8	BRUNA J， ZAREMBA W， SZLAM A， et al. Spectral networks and locally connected networks on graphs［EB/OL］. （2014-05-21）［2022-09-25］.. 10.1017/cbo9780511761942.003
9	KIPF T N， WELLING M. Variational graph auto-encoders［EB/OL］. （2016-11-21）［2022-09-26］..
10	PAN S R， HU R Q， FUNG S F， et al. Learning graph embedding with adversarial training methods［J］. IEEE Transactions on Cybernetics， 2020， 50（6）： 2475-2487. 10.1109/tcyb.2019.2932096
11	WANG C， PAN S R， LONG G D， et al. MGAE： marginalized graph autoencoder for graph clustering［C］// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. New York： ACM， 2017：889-898. 10.1145/3132847.3132967
12	STRETCU O， VISWANATHAN K， MOVSGOVITZ-ATTIAS D， et al. Graph agreement models for semi-supervised learning［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 8713-8723.
13	WANG C， PAN S R， YU C P， et al. Deep neighbor-aware embedding for node clustering in attributed graphs［J］. Pattern Recognition， 2022， 122： No.108230. 10.1016/j.patcog.2021.108230
14	BO D Y， WANG X， SHI C， et al. Structural deep clustering network［C］// Proceedings of the Web Conference 2020. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2020： 1400-1410. 10.1145/3366423.3380214
15	PENG Z H， LIU H， JIA Y H， et al. Attention-driven graph clustering network［C］// Proceedings of the 29th ACM International Conference on Multimedia. New York： ACM， 2021： 935-943. 10.1145/3474085.3475276
16	VINCENT P， LAROCHELLE H， BENGIO Y， et al. Extracting and composing robust features with denoising autoencoders［C］// Proceedings of the 25th International Conference on Machine Learning. New York： ACM， 2008：1096-1103. 10.1145/1390156.1390294
17	MASCI J， MEIER U， CIREŞAN D， et al. Stacked convolutional auto-encoders for hierarchical feature extraction［C］// Proceedings of the 2011 International Conference on Artificial Neural Networks， LNCS 6791. Berlin： Springer， 2011： 52-59.
18	MALHOTRA P， VISHNU T V， RAMAKRISHNAN A， et al. Multi-sensor prognostics using an unsupervised health index based on LSTM encoder-decoder［C/OL］// Proceedings of the 1st ACM SIGKDD Workshop on Machine Learning for Prognostics and Health Management （ 2016-08-22）［2022-09-26］..
19	MAKHZANI A， SHLENS J， JAITLY N， et al. Adversarial autoencoders［EB/OL］. （2016-05-25）［2022-09-26］..
20	HINTON G E， SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks［J］. Science， 2006， 313（5786）：504-507. 10.1126/science.1127647
21	NAIR V， HINTON G E. Rectified linear units improve restricted Boltzmann machines［C］// Proceedings of the 27th International Conference on Machine Learning. Madison， WI： Omnipress， 2010：807-814.
22	L van der MAATEN， HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research， 2008， 9： 2579-2605.
23	WANG X， JI H Y， SHI C， et al. Heterogeneous graph attention network［C］// Proceedings of the World Wide Web Conference 2019. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2019： 2022-2032. 10.1145/3308558.3313562
24	LEWIS D D， YANG Y M， ROSE T G， et al. RCV1： a new benchmark collection for text categorization research［J］. Journal of Machine Learning Research， 2004， 5： 361-397.
25	黄瑞章，白瑞娜，陈艳平，等. CMDC：一种差异互补的迭代式多维度文本聚类算法［J］. 通信学报， 2020， 41（8）： 155-164. 10.11959/j.issn.1000-436x.2020152
	HUANG R Z， BAI R N， CHEN Y P， et al. CMDC： an iterative algorithm for complementary multi-view document clustering［J］. Journal on Communications， 2020， 41（8）： 155-164. 10.11959/j.issn.1000-436x.2020152
26	KRASKOV A， STÖGBAUER H， GRASSBERGER P. Estimating mutual information［J］. Physical Review. E， Statistical， Nonlinear， and Soft Matter Physics， 2004， 69（6）： No.066138. 10.1103/physreve.69.066138

数据集	样本数	维度	类别	数据集	样本数	维度	类别
Citesser	3 327	3 073	6	Reuters	10 000	2 000	4
Acm	3 025	1 870	3	Abstract	4 306	10 000	3
Dblp	4 058	334	4

数据集	样本数	维度	类别	数据集	样本数	维度	类别
Citesser	3 327	3 073	6	Reuters	10 000	2 000	4
Acm	3 025	1 870	3	Abstract	4 306	10 000	3
Dblp	4 058	334	4

模型	Citeseer			Acm			Reuters			Dblp			Abstract
模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
k-means	0.431 3	0.207 3	0.155 4	0.673 1	0.327 6	0.308 0	0.540 2	0.423 9	0.285 8	0.384 3	0.111 9	0.067 4	0.691 8	0.382 6	0.276 9
AE	0.570 8	0.276 4	0.293 1	0.809 6	0.465 7	0.518 0	0.713 4	0.498 5	0.557 1	0.514 3	0.254 0	0.122 1	0.852 1	0.580 5	0.598 9
DEC	0.523 3	0.282 1	0.242 1	0.847 3	0.591 8	0.609 8	0.731 2	0.502 6	0.548 6	0.581 6	0.295 1	0.239 2	0.868 7	0.603 6	0.641 2
GAE	0.613 5	0.346 3	0.335 5	0.845 2	0.553 8	0.594 6	0.544 0	0.259 2	0.196 1	0.612 1	0.308 0	0.220 2	0.873 7	0.590 0	0.653 2
SDCN	0.628 0	0.348 9	0.358 2	0.893 2	0.649 1	0.710 3	0.758 9	0.476 7	0.515 9	0.669 5	0.316 7	0.334 3	0.930 3	0.729 0	0.791 1
AGCN	0.627 0	0.360 8	0.369 7	0.903 8	0.682 3	0.736 9	0.767 9	0.514 2	0.538 0	0.667 2	0.332 4	0.346 6	0.934 3	0.746 5	0.810 9
SDCMS	0.663 7	0.396 6	0.399 8	0.918 3	0.722 8	0.774 5	0.785 1	0.528 4	0.561 2	0.669 7	0.348 6	0.319 9	0.943 8	0.781 4	0.836 2

模型	Citeseer			Acm			Reuters			Dblp			Abstract
模型	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
k-means	0.431 3	0.207 3	0.155 4	0.673 1	0.327 6	0.308 0	0.540 2	0.423 9	0.285 8	0.384 3	0.111 9	0.067 4	0.691 8	0.382 6	0.276 9
AE	0.570 8	0.276 4	0.293 1	0.809 6	0.465 7	0.518 0	0.713 4	0.498 5	0.557 1	0.514 3	0.254 0	0.122 1	0.852 1	0.580 5	0.598 9
DEC	0.523 3	0.282 1	0.242 1	0.847 3	0.591 8	0.609 8	0.731 2	0.502 6	0.548 6	0.581 6	0.295 1	0.239 2	0.868 7	0.603 6	0.641 2
GAE	0.613 5	0.346 3	0.335 5	0.845 2	0.553 8	0.594 6	0.544 0	0.259 2	0.196 1	0.612 1	0.308 0	0.220 2	0.873 7	0.590 0	0.653 2
SDCN	0.628 0	0.348 9	0.358 2	0.893 2	0.649 1	0.710 3	0.758 9	0.476 7	0.515 9	0.669 5	0.316 7	0.334 3	0.930 3	0.729 0	0.791 1
AGCN	0.627 0	0.360 8	0.369 7	0.903 8	0.682 3	0.736 9	0.767 9	0.514 2	0.538 0	0.667 2	0.332 4	0.346 6	0.934 3	0.746 5	0.810 9
SDCMS	0.663 7	0.396 6	0.399 8	0.918 3	0.722 8	0.774 5	0.785 1	0.528 4	0.561 2	0.669 7	0.348 6	0.319 9	0.943 8	0.781 4	0.836 2

数据集	SDCMS	SDCMS-d	数据集	SDCMS	SDCMS-d
Citeseer	0.663 7	0.641 5	Reuters	0.785 1	0.724 5
Acm	0.918 3	0.907 4	Dblp	0.669 7	0.638 9

基于多层语义融合的结构化深度文本聚类模型

Structured deep text clustering model based on multi-layer semantic fusion

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 26

相关文章 15

编辑推荐

Metrics

数据集	层数	ACC	NMI	ARI
Citeseer	0	0.618 0	0.348 9	0.358 2
	1	0.627 9	0.363 4	0.364 8
	4	0.6637	0.3966	0.3998
Acm	0	0.893 2	0.649 1	0.710 3
	1	0.907 1	0.708 5	0.748 6
	4	0.9174	0.7182	0.7715
Reuters	0	0.758 9	0.476 7	0.515 9
	1	0.754 6	0.500 0	0.488 7
	4	0.7851	0.5284	0.5612

[1]	夏子芳, 于亚新, 王子腾, 乔佳琪. 融合协同知识图谱与反事实推理的可解释推荐机制[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2001-2009.
[2]	郑智雄, 刘建华, 孙水华, 徐戈, 林鸿辉. 融合多窗口局部信息的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1796-1802.
[3]	金柯君, 于洪涛, 吴翼腾, 李邵梅, 张建朋, 郑洪浩. 改进的基于奇异值分解的图卷积网络防御方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1511-1517.
[4]	孙浩, 曹健, 李海生, 毛典辉. 基于改进胶囊网络的会话型推荐模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1043-1049.
[5]	吕学强, 张煜楠, 韩晶, 崔运鹏, 李欢. 融合边特征与注意力的表格结构识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 752-758.
[6]	李路宝, 陈田, 任福继, 罗蓓蓓. 基于图神经网络和注意力的双模态情感识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 700-705.
[7]	王雨, 袁玉波, 过弋, 张嘉杰. 情感增强的对话文本情绪识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 706-712.
[8]	张军, 吴朋莉, 石陆魁, 史进, 潘斌. 联合MOD11A1和地面气象站点数据的多站点温度预测深度学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 321-328.
[9]	杜航原, 郝思聪, 王文剑. 结合图自编码器与聚类的半监督表示学习方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2643-2651.
[10]	李晓寒, 王俊, 贾华丁, 萧刘. 基于多重注意力机制的图神经网络股市波动预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2265-2273.
[11]	于蒙, 何文涛, 周绪川, 崔梦天, 吴克奇, 周文杰. 推荐系统综述[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1898-1913.
[12]	周嘉凡, 杜岳峰, 宋宝燕, 李晓光, 赵阿珠, 肖绪界. 基于元路径注意力机制的MOOC视频推荐方法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1808-1813.
[13]	李晓寒, 贾华丁, 程雪, 李太勇. 基于改进遗传算法和图神经网络的股市波动预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1624-1633.
[14]	胡鹤轩, 隋华超, 胡强, 张晔, 胡震云, 马能武. 基于图注意力网络与双阶注意力机制的径流预报模型[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1607-1615.
[15]	蒋雯静, 熊熙, 李中志, 李斌勇. 基于无采样协作知识图网络的推荐系统[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1057-1064.