深度动态文本聚类模型DDDC

doi:10.11772/j.issn.1001-9081.2022091354

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2370-2375.DOI: 10.11772/j.issn.1001-9081.2022091354

• 第十九届CCF中国信息系统及应用大会 • 上一篇

深度动态文本聚类模型DDDC

陆辉¹^,², 黄瑞章¹^,²(), 薛菁菁¹^,², 任丽娜¹^,², 林川¹^,²

^1.公共大数据国家重点实验室（贵州大学），贵阳 550025
^2.贵州大学计算机科学与技术学院，贵阳 550025

收稿日期:2022-09-06 修回日期:2022-10-26 接受日期:2022-11-01 发布日期:2022-12-12 出版日期:2023-08-10
通讯作者: 黄瑞章
作者简介:陆辉（1998—），男，贵州安顺人，硕士研究生，CCF会员，主要研究方向：动态聚类、主题挖掘
薛菁菁（1995—），女，山东日照人，博士研究生，CCF会员，主要研究方向：深度文本聚类
任丽娜（1987—），女，辽宁阜新人，讲师，博士研究生，CCF会员，主要研究方向：自然语言处理、文本挖掘、机器学习
林川（1975—），男，四川自贡人，副教授，硕士，主要研究方向：文本挖掘、机器学习、大数据管理与应用。
基金资助:
国家自然科学基金资助项目(62066007)

DDDC： deep dynamic document clustering model

Hui LU¹^,², Ruizhang HUANG¹^,²(), Jingjing XUE¹^,², Lina REN¹^,², Chuan LIN¹^,²

^1.State Key Laboratory of Public Big Data（Guizhou University），Guiyang Guizhou 550025，China
^2.College of Computer Science and Technology，Guizhou University，Guiyang Guizhou 550025，China

Received:2022-09-06 Revised:2022-10-26 Accepted:2022-11-01 Online:2022-12-12 Published:2023-08-10
Contact: Ruizhang HUANG
About author:LU Hui， born in 1998， M. S. candidate. His research interests include dynamic clustering， topic mining.
XUE Jingjing， born in 1995， Ph. D. candidate. Her research interests include deep document clustering.
REN Lina， born in 1987， Ph. D. candidate， lecturer. Her research interests include natural language processing， document mining， machine learning.
LIN Chuan， born in 1975， M. S.， associate professor. His research interests include document mining， machine learning， big data management and applications.
Supported by:
National Natural Science Foundation of China(62066007)

摘要/Abstract

摘要：

互联网的飞速发展使得新闻数据呈爆炸增长的趋势。如何从海量新闻数据中获取当前热门事件的主题演化过程成为文本分析领域研究的热点。然而，常用的传统动态聚类模型处理大规模数据集时灵活性差且效率低下，现有的深度文本聚类模型则缺乏一种通用的方法捕捉时间序列数据的主题演化过程。针对以上问题，设计了一种深度动态文本聚类（DDDC）模型。该模型以现有的深度变分推断算法为基础，可以在不同时间片上捕捉融合了前置时间片内容的主题分布，并通过聚类从这些分布中获取事件主题的演化过程。在真实新闻数据集上的实验结果表明，在不同的数据集上，与动态主题模型（DTM）、变分深度嵌入（VaDE）等算法相比，DDDC模型在各时间片的聚类精度均至少提升了4个百分点，且归一化互信息（NMI）至少提高了3个百分点，验证了DDDC模型的有效性。

关键词: 文本动态聚类, 事件主题演化, 主题分布, 时间序列数据, 深度变分推断

Abstract:

The rapid development of Internet leads to the explosive growth of news data. How to capture the topic evolution process of current popular events from massive news data has become a hot research topic in the field of document analysis. However， the commonly used traditional dynamic clustering models are inflexible and inefficient when dealing with large-scale datasets， while the existing deep document clustering models lack a general method to capture the topic evolution process of time series data. To address these problems， a Deep Dynamic Document Clustering （DDDC） model was designed. In this model， based on the existing deep variational inference algorithms， the topic distributions incorporating the content of previous time slices on different time slices were captured， and the evolution process of event topics was captured from these distributions through clustering. Experimental results on real news datasets show that compared with Dynamic Topic Model （DTM）， Variational Deep Embedding （VaDE） and other algorithms， DDDC model has the clustering accuracy and Normalized Mutual Information （NMI） improved by at least 4 percentage points averagely and at least 3 percentage points respectively in each time slice on different datasets， verifying the effectiveness of DDDC model.

Key words: dynamic document clustering, event topic evolution, topic distribution, time series data, deep variational inference

中图分类号:

TP391.1

陆辉, 黄瑞章, 薛菁菁, 任丽娜, 林川. 深度动态文本聚类模型DDDC[J]. 计算机应用, 2023, 43(8): 2370-2375.

Hui LU, Ruizhang HUANG, Jingjing XUE, Lina REN, Chuan LIN. DDDC： deep dynamic document clustering model[J]. Journal of Computer Applications, 2023, 43(8): 2370-2375.

图/表 7

参考文献 22

1	HOFFMAN M D， BLEI D M， WANG C， et al. Stochastic variational inference［J］. Journal of Machine Learning Research， 2013， 14：1303-1347.
2	REYNOLDS D. Gaussian mixture models［M］// LI S Z， JAIN A K. Encyclopedia of Biometrics. Boston： Springer， 2009：659-663. 10.1007/978-0-387-73003-5_196
3	BLEI D， CARIN L， DUNSON D. Probabilistic topic models［J］. IEEE Signal Processing Magazine， 2010， 27（6）： 55-65.
4	TERENIN A， SIMPSON D， DRAPER D. Asynchronous Gibbs sampling［C］// Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2020：144-154.
5	MOON T K. The expectation-maximization algorithm［J］. IEEE Signal Processing Magazine， 1996， 13（6）： 47-60. 10.1109/79.543975
6	BLEI D M， NG A Y， JORDAN M I. Latent Dirichlet allocation［J］. Journal of Machine Learning Research， 2003， 3： 993-1022.
7	WANG X R， McCALLUM A. Topics over time： a non-Markov continuous-time model of topical trends［C］// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2006： 424-433. 10.1145/1150402.1150450
8	BLEI D M， LAFFERTY J D. Dynamic topic models［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 113-120. 10.1145/1143844.1143859
9	IWATA T， WATANABE S， YAMADA T， et al. Topic tracking model for analyzing consumer purchase behavior［C］// Proceedings of the 21st International Joint Conference on Artificial Intelligence. San Francisco： Morgan Kaufmann Publishers Inc.， 2009： 1427-1432.
10	AMOUALIAN H， CLAUSEL M， GAUSSIER E， et al. Streaming-LDA： a copula-based approach to modeling topic dependencies in document streams［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2016： 695-704. 10.1145/2939672.2939781
11	ZHAO Y K， LIANG S S， REN Z C， et al. Explainable user clustering in short text streams［C］// Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2016： 155-164. 10.1145/2911451.2911522
12	HINTON G E， SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks［J］. Science， 2006， 313（5786）： 504-507. 10.1126/science.1127647
13	XIE J Y， GIRSHICK R， FARHADI A. Unsupervised deep embedding for clustering analysis［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 478-487.
14	RAIBER F， KURLAND O. Kullback-Leibler divergence revisited［C］// Proceedings of the 2017 ACM SIGIR International Conference on Theory of Information Retrieval. New York： ACM 2017： 117-124. 10.1145/3121050.3121062
15	BO D Y， WANG X， SHI C， et al. Structural deep clustering network［C］// Proceedings of the Web Conference 2020. Republic and Canton of Geneva： International World Wide Web Conferences Steering Committee， 2020： 1400-1410. 10.1145/3366423.3380214
16	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. （2022-12-10）［2023-02-25］.. 10.1561/2200000056
17	ZHANG D J， SUN Y F， ERIKSSON B， et al. Deep unsupervised clustering using mixture of autoencoders［EB/OL］. （2017-12-26）［2022-09-25］..
18	JIANG Z X， ZHENG Y， TAN H C， et al. Variational deep embedding： an unsupervised and generative approach to clustering［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1965-1972. 10.24963/ijcai.2017/273
19	BENGIO Y， COURVILLE A， VINCENT P. Representation learning： a review and new perspectives［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013， 35（8）： 1798-1828. 10.1109/tpami.2013.50
20	侯艳辉，董慧芳，郝敏，等. 基于本体特征的影评细粒度情感分类［J］. 计算机应用， 2020， 40（4）： 1074-1078.
	HOU Y H， DONG H F， HAO M， et al. Fine-grained sentiment classification of film reviews based on ontological features［J］. Journal of Computer Applications， 2020， 40（4）： 1074-1078.
21	KRASKOV A， STÖGBAUER H， GRASSBERGER P. Estimating mutual information［J］. Physical Review. E， Statistical， Nonlinear， and Soft Matter Physics， 2004， 69（6）： No.066138. 10.1103/physreve.69.066138
22	JONKER R， VOLGENANT T. Improving the Hungarian assignment algorithm［J］. Operations Research Letters， 1986， 5（4）： 171-175. 10.1016/0167-6377(86)90073-8

聚类模型	2019年		2020年		2021年		2022年
聚类模型	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
DDDC无继承	0.89	0.84	0.70	0.68	0.77	0.71	0.72	0.69
DDDC有继承	—	—	0.93	0.88	0.97	0.90	0.92	0.89

聚类模型	2019年		2020年		2021年		2022年
聚类模型	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
DDDC无继承	0.89	0.84	0.70	0.68	0.77	0.71	0.72	0.69
DDDC有继承	—	—	0.93	0.88	0.97	0.90	0.92	0.89

聚类模型	2019年		2020年		2021年		2022年
聚类模型	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
AE+GMM无继承	0.85	0.78	0.68	0.63	0.72	0.63	0.65	0.59
AE+GMM有继承	—	—	0.85	0.75	0.90	0.80	0.86	0.78

聚类模型	2019年		2020年		2021年		2022年
聚类模型	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
AE+GMM无继承	0.85	0.78	0.68	0.63	0.72	0.63	0.65	0.59
AE+GMM有继承	—	—	0.85	0.75	0.90	0.80	0.86	0.78

数据集	模型	2019年		2020年		2021年		2022年
数据集	模型	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
Series300	DTM	0.85	0.72	0.89	0.73	0.93	0.87	0.85	0.74
	ToT	0.88	0.82	0.82	0.72	0.89	0.79	0.85	0.77
	DDDC	—	—	0.93	0.88	0.97	0.90	0.92	0.89
Series500	AE	0.71	0.58	0.79	0.59	0.78	0.57	0.76	0.63
	SDCN	0.74	0.57	0.80	0.66	0.79	0.65	0.77	0.66
	VAE	0.76	0.66	0.74	0.63	0.70	0.61	0.75	0.63
	VaDE	0.86	0.75	0.81	0.69	0.83	0.72	0.79	0.72
	DTM	0.70	0.62	0.73	0.60	0.68	0.47	0.72	0.50
	ToT	0.68	0.56	0.67	0.49	0.69	0.50	0.69	0.53
	DDDC	—	—	0.86	0.74	0.87	0.75	0.83	0.77

深度动态文本聚类模型DDDC

DDDC： deep dynamic document clustering model

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 22

相关文章 4

编辑推荐

Metrics

时间片	主题				总数
时间片	自动驾驶	乌克兰	雄安	智慧城市	总数
2019	100	100	100	100	400
2020	100	100	100	100	400
2021	100	100	100	100	400
2022	100	100	100	100	400

[1]	张凌哲, 黄向东, 乔嘉林, 勾王敏浩, 王建民. 面向时序数据的两阶段日志结构合并树文件合并框架[J]. 计算机应用, 2021, 41(3): 618-622.
[2]	杨飞, 罗建桥, 李柏林. 结合全局和局部约束的sLDA铁路扣件分类模型[J]. 计算机应用, 2019, 39(3): 888-893.
[3]	薛钰, 梅雪, 支有冉, 许志兴, 史翔. 基于时间序列数据挖掘的地铁车门亚健康状态识别方法[J]. 计算机应用, 2018, 38(3): 905-910.
[4]	毛莺池, 齐海, 接青, 王龙宝. M-TAEDA:多变量水质参数时序数据异常事件检测算法[J]. 计算机应用, 2017, 37(1): 138-144.