Chinese-Vietnamese news topic discovery method based on cross-language neural topic model

doi:10.11772/j.issn.1001-9081.2020122054

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (10): 2879-2884.DOI: 10.11772/j.issn.1001-9081.2020122054

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Chinese-Vietnamese news topic discovery method based on cross-language neural topic model

YANG Weiya^1,2, YU Zhengtao^1,2, GAO Shengxiang^1,2, SONG Ran^1,2

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology) Kunming Yunnan 650500, China

Received:2020-12-29 Revised:2021-04-22 Online:2021-07-14 Published:2021-10-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61972196, 61762056, 61472168), the Yunnan Major Science and Technology Special Program (202002AD080001), the Yunnan High-tech Industry Special Project (201606).

基于跨语言神经主题模型的汉越新闻话题发现方法

杨威亚^1,2, 余正涛^1,2, 高盛祥^1,2, 宋燃^1,2

1. 昆明理工大学信息工程与自动化学院, 昆明 650500;
2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500

通讯作者: 高盛祥
作者简介:杨威亚(1996-),男,四川广安人,硕士研究生,主要研究方向:自然语言处理、跨语言话题发现;余正涛(1970-),男,云南曲靖人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索;高盛祥(1977-),女,云南大理人,副教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索;宋燃(1996-),男,云南楚雄人,硕士研究生,主要研究方向:自然语言处理、信息抽取。
基金资助:
国家自然科学基金资助项目（61972196，61762056，61472168）；云南省重大科技专项（202002AD080001）；云南省高新技术产业专项（201606）。

Abstract

Abstract: In Chinese-Vietnamese cross-language news topic discovery task, the Chinese-Vietnamese parallel corpora are rare, it is difficult to train high-quality bilingual word embedding, and the news text is generally long, so that the method of bilingual word embedding is difficult to represent the text well. In order to solve the problems, a Chinese-Vietnamese news topic discovery method based on Cross-Language Neural Topic Model (CL-NTM) was proposed. In the method, the news topic information was used to represent news text, and the bilingual semantic alignment was converted into bilingual topic alignment tasks. Firstly, the neural topic models based on the variational autoencoder were trained in Chinese and Vietnamese respectively to obtain the monolingual abstract representations of the topics. Then, a small-scale parallel corpus was used to map the bilingual topics into the same semantic space. Finally, the K-means method was used to cluster the bilingual topic representations for finding the topics of news event clusters. Experimental results show that, compared with the Improved Chinese-English Latent Dirichlet Allocation model (ICE-LDA), the proposed method increases the Macro-F1 value and topic-coherence by 4 percentage points and 7 percentage points respectively, showing that the proposed method can effectively improve the clustering effect and topic interpretability of news topics.

Key words: cross-language, topic alignment, Neural Topic Model (NTM), K-means clustering, topic discovery

摘要： 针对汉越跨语言新闻话题发现任务中汉越平行语料稀缺，训练高质量的双语词嵌入较为困难，而且新闻文本一般较长导致双语词嵌入的方法难以很好地表征文本的问题，提出一种基于跨语言神经主题模型（CL-NTM）的汉越新闻话题发现方法，利用新闻的主题信息对新闻文本进行表征，将双语语义对齐转化为双语主题对齐任务。首先，针对汉语和越南语分别训练基于变分自编码器的神经主题模型，从而得到单语的主题抽象表征；然后，利用小规模的平行语料将双语主题映射到同一语义空间；最后，使用K-means方法对双语主题表征进行聚类，从而发现新闻事件簇的话题。实验结果表明，所提方法相较于面向中英文的隐狄利克雷分配主题改进模型（ICE-LDA）在Macro-F1值与主题一致性上分别提升了4个百分点与7个百分点，可见所提方法可有效提升新闻话题的聚类效果与话题可解释性。

关键词: 跨语言, 主题对齐, 神经主题模型, K-means聚类, 话题发现

CLC Number:

TP391

YANG Weiya, YU Zhengtao, GAO Shengxiang, SONG Ran. Chinese-Vietnamese news topic discovery method based on cross-language neural topic model[J]. Journal of Computer Applications, 2021, 41(10): 2879-2884.

杨威亚, 余正涛, 高盛祥, 宋燃. 基于跨语言神经主题模型的汉越新闻话题发现方法[J]. 计算机应用, 2021, 41(10): 2879-2884.

References

[1] 王禹森, 余正涛, 高盛祥, 等. 基于图聚类的汉越双语新闻话题发现[J]. 数据采集与处理, 2018, 33(3):530-537.(WANG Y S, YU Z T, GAO S X, et al. Chinese-Vietnamese bilingual news topic detection methods based on graph clustering[J]. Journal of Data Acquisition and Processing, 2018, 33(3):530-537.)
[2] 夏青, 严馨, 余正涛, 等. 融合要素及主题的汉越双语新闻话题分析[J]. 计算机工程, 2016, 42(9):186-191.(XIA Q, YAN X, YU Z T, et al. Analysis of Sino-Vietnamese bilingual news topics mixing elements and themes[J]. Computer Engineering, 2016, 42(9):186-191.)
[3] LEEK T, JIN H, SISTA S, et al. The BBN crosslingual topic detection and tracking system[EB/OL].[2020-10-28]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.207.9743&rep=rep1&type=pdf.
[4] YANG W W, BOYD-GRABER J, RESNIK P. A multilingual topic model for learning weighted topic links across corpora with low comparability[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing/the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2019:1243-1248
[5] MATHIEU B, BESANÇON R, FLUHR C. Multilingual document clusters discovery[C]//Proceedings of the 7th RIAO Conference:Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval. Paris:Le Centre de Hautes Études Internationales d'Informatique Documentaire, 2004:116-125.
[6] POULIQUEN B, STEINBERGER R, LGNAT C, et al. Multilingual and cross-lingual news topic tracking[C]//Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2004:959-965.
[7] CHANG C H, HWANG S Y, XUI T H. Incorporating word embedding into cross-lingual topic modeling[C]//Proceedings of the 2018 IEEE International Congress on Big Data. Piscataway:IEEE, 2018:17-24.
[8] MIMNO D, WALLACH H M, NARADOWSKY J, et al. Polylingual topic models[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2009:880-889.
[9] HAO S D, PAUL M J. Learning multilingual topics from incomparable corpus[C]//Proceedings of the 27th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2018:2595-2609.
[10] HONG X D, YU Z T, TANG M M, et al. Cross-lingual eventcentered news clustering based on elements semantic correlations of different news[J]. Multimedia Tools and Applications, 2017, 76(23):25129-25143.
[11] BIANCHI F, TERRAGNI S, HOVY D, et al. Cross-lingual contextualized topic models with zero-shot learning[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume. Stroudsburg, PA:Association for Computational Linguistics, 2021:1676-1683.
[12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07)[2020-10-28]. https://arxiv.org/pdf/1301.3781.pdf.
[13] RUDER S, VULIĆ I, SØGAARD A, et al. A survey of crosslingual word embedding models[J]. Journal of Artificial Intelligence Research, 2019, 65:569-631.
[14] 张玉芳, 毛嘉莉, 熊忠阳. 一种改进的K-means算法[J]. 计算机应用, 2003, 23(8):31-33, 60.(ZHANG Y F, MAO J L, XIONG Z Y. An improved K-means algorithm[J]. Journal of Computer Applications, 2003, 23(8):31-33, 60.)
[15] 黄佳佳, 李鹏伟, 彭敏, 等. 基于深度学习的主题模型研究[J]. 计算机学报, 2020, 43(5):827-855.(HUANG J J, LI P W, PENG M, et al. Review of deep learning-based topic model[J]. Chinese Journal of Computers, 2020, 43(5):827-855.)
[16] 陈兴蜀, 罗梁, 王海舟, 等. 基于ICE-LDA模型的中英文跨语言话题发现研究[J]. 工程科学与技术, 2017, 49(2):100-106. (CHEN X S, LUO L, WANG H D, et al. Analysis and research on cross language topic discovery in Chinese and English[J]. Advanced Engineering Sciences, 2017, 49(2):100-106.)
[17] LAN H H, HUANG J D. Chinese-English cross-lingual text clustering algorithm based on latent semantic analysis[C]//Proceedings of the 5th International Conference on Information Science and Cloud Computing. Piscataway:IEEE, 2017:66-72.

Chinese-Vietnamese news topic discovery method based on cross-language neural topic model

基于跨语言神经主题模型的汉越新闻话题发现方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Lin SUN, Menghan LIU. K-means clustering based on adaptive cuckoo optimization feature selection [J]. Journal of Computer Applications, 2024, 44(3): 831-841.
[2]	Xiaoyan ZHANG, Zhengyu DUAN. Cross-lingual zero-resource named entity recognition model based on sentence-level generative adversarial network [J]. Journal of Computer Applications, 2023, 43(8): 2406-2411.
[3]	Youzhi LI, Zhihua HU, Chun CHEN, Peibei YANG, Yajing DONG. Prediction model of transaction pricing in internet freight transport platform based on combination of dual long short-term memory networks [J]. Journal of Computer Applications, 2022, 42(5): 1616-1623.
[4]	Lei GAO, Guanfeng LUO, Dang LIU, Fan MIN. First-arrival automatic picking algorithm based on clustering and local linear regression [J]. Journal of Computer Applications, 2022, 42(2): 655-662.
[5]	Le ZHAO, En ZHANG, Leiyong QIN, Gongli LI. Multi-party privacy preserving k-means clustering scheme based on blockchain [J]. Journal of Computer Applications, 2022, 42(12): 3801-3812.
[6]	Chunming MA, Xiuhong LI, Zhe LI, Huiru WANG, Dan YANG. Survey of event extraction [J]. Journal of Computer Applications, 2022, 42(10): 2975-2989.
[7]	ZHANG En, LI Huimin, CHANG Jian. Verifiable k-means clustering scheme with privacy-preserving [J]. Journal of Computer Applications, 2021, 41(2): 413-421.
[8]	Jicheng CHEN, Hongchang CHEN. Community detection method based on tensor modeling and evolutionary K-means clustering [J]. Journal of Computer Applications, 2021, 41(11): 3120-3126.
[9]	WANG Lei. Network intrusion detection method based on improved rough set attribute reduction and K-means clustering [J]. Journal of Computer Applications, 2020, 40(7): 1996-2002.
[10]	YE Shuang, YANG Xiaomin, YAN Bin'yu. Image super-resolution algorithm based on adaptive anchored neighborhood regression [J]. Journal of Computer Applications, 2019, 39(10): 3040-3045.
[11]	ZOU Chengming, LUO Ying, XU Xiaolong. Fine-grained image classification method based on multi-feature combination [J]. Journal of Computer Applications, 2018, 38(7): 1853-1856.
[12]	YANG Yaqian, TANG Shaoting. Brain node recognition method based on extended low-rank multivariate general linear model [J]. Journal of Computer Applications, 2018, 38(10): 3048-3052.
[13]	LUO Yuan, LI Huimin, ZHANG Yi. Improved location direction pattern based on interest points location for face recognition [J]. Journal of Computer Applications, 2017, 37(8): 2248-2252.
[14]	MU Tao, CHEN Wei, CHEN Songjian. User classification method based on multiple-layer network traffic analysis [J]. Journal of Computer Applications, 2017, 37(3): 705-710.
[15]	HUO Weigang, CHENG Zhen, CHENG Wenli. Improved clustering algorithm for multivariate time series with unequal length [J]. Journal of Computer Applications, 2017, 37(12): 3477-3481.