基于跨语言神经主题模型的汉越新闻话题发现方法

doi:10.11772/j.issn.1001-9081.2020122054

计算机应用 ›› 2021, Vol. 41 ›› Issue (10): 2879-2884.DOI: 10.11772/j.issn.1001-9081.2020122054

所属专题：人工智能

基于跨语言神经主题模型的汉越新闻话题发现方法

杨威亚^1,2, 余正涛^1,2, 高盛祥^1,2, 宋燃^1,2

1. 昆明理工大学信息工程与自动化学院, 昆明 650500;
2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500

收稿日期:2020-12-29 修回日期:2021-04-22 发布日期:2021-07-14 出版日期:2021-10-10
通讯作者: 高盛祥
作者简介:杨威亚(1996-),男,四川广安人,硕士研究生,主要研究方向:自然语言处理、跨语言话题发现;余正涛(1970-),男,云南曲靖人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索;高盛祥(1977-),女,云南大理人,副教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索;宋燃(1996-),男,云南楚雄人,硕士研究生,主要研究方向:自然语言处理、信息抽取。
基金资助:
国家自然科学基金资助项目（61972196，61762056，61472168）；云南省重大科技专项（202002AD080001）；云南省高新技术产业专项（201606）。

Chinese-Vietnamese news topic discovery method based on cross-language neural topic model

YANG Weiya^1,2, YU Zhengtao^1,2, GAO Shengxiang^1,2, SONG Ran^1,2

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology) Kunming Yunnan 650500, China

Received:2020-12-29 Revised:2021-04-22 Online:2021-07-14 Published:2021-10-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61972196, 61762056, 61472168), the Yunnan Major Science and Technology Special Program (202002AD080001), the Yunnan High-tech Industry Special Project (201606).

摘要/Abstract

摘要： 针对汉越跨语言新闻话题发现任务中汉越平行语料稀缺，训练高质量的双语词嵌入较为困难，而且新闻文本一般较长导致双语词嵌入的方法难以很好地表征文本的问题，提出一种基于跨语言神经主题模型（CL-NTM）的汉越新闻话题发现方法，利用新闻的主题信息对新闻文本进行表征，将双语语义对齐转化为双语主题对齐任务。首先，针对汉语和越南语分别训练基于变分自编码器的神经主题模型，从而得到单语的主题抽象表征；然后，利用小规模的平行语料将双语主题映射到同一语义空间；最后，使用K-means方法对双语主题表征进行聚类，从而发现新闻事件簇的话题。实验结果表明，所提方法相较于面向中英文的隐狄利克雷分配主题改进模型（ICE-LDA）在Macro-F1值与主题一致性上分别提升了4个百分点与7个百分点，可见所提方法可有效提升新闻话题的聚类效果与话题可解释性。

关键词: 跨语言, 主题对齐, 神经主题模型, K-means聚类, 话题发现

Abstract: In Chinese-Vietnamese cross-language news topic discovery task, the Chinese-Vietnamese parallel corpora are rare, it is difficult to train high-quality bilingual word embedding, and the news text is generally long, so that the method of bilingual word embedding is difficult to represent the text well. In order to solve the problems, a Chinese-Vietnamese news topic discovery method based on Cross-Language Neural Topic Model (CL-NTM) was proposed. In the method, the news topic information was used to represent news text, and the bilingual semantic alignment was converted into bilingual topic alignment tasks. Firstly, the neural topic models based on the variational autoencoder were trained in Chinese and Vietnamese respectively to obtain the monolingual abstract representations of the topics. Then, a small-scale parallel corpus was used to map the bilingual topics into the same semantic space. Finally, the K-means method was used to cluster the bilingual topic representations for finding the topics of news event clusters. Experimental results show that, compared with the Improved Chinese-English Latent Dirichlet Allocation model (ICE-LDA), the proposed method increases the Macro-F1 value and topic-coherence by 4 percentage points and 7 percentage points respectively, showing that the proposed method can effectively improve the clustering effect and topic interpretability of news topics.

Key words: cross-language, topic alignment, Neural Topic Model (NTM), K-means clustering, topic discovery

中图分类号:

TP391

杨威亚, 余正涛, 高盛祥, 宋燃. 基于跨语言神经主题模型的汉越新闻话题发现方法[J]. 计算机应用, 2021, 41(10): 2879-2884.

YANG Weiya, YU Zhengtao, GAO Shengxiang, SONG Ran. Chinese-Vietnamese news topic discovery method based on cross-language neural topic model[J]. Journal of Computer Applications, 2021, 41(10): 2879-2884.

参考文献

[1] 王禹森, 余正涛, 高盛祥, 等. 基于图聚类的汉越双语新闻话题发现[J]. 数据采集与处理, 2018, 33(3):530-537.(WANG Y S, YU Z T, GAO S X, et al. Chinese-Vietnamese bilingual news topic detection methods based on graph clustering[J]. Journal of Data Acquisition and Processing, 2018, 33(3):530-537.)
[2] 夏青, 严馨, 余正涛, 等. 融合要素及主题的汉越双语新闻话题分析[J]. 计算机工程, 2016, 42(9):186-191.(XIA Q, YAN X, YU Z T, et al. Analysis of Sino-Vietnamese bilingual news topics mixing elements and themes[J]. Computer Engineering, 2016, 42(9):186-191.)
[3] LEEK T, JIN H, SISTA S, et al. The BBN crosslingual topic detection and tracking system[EB/OL].[2020-10-28]. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.207.9743&rep=rep1&type=pdf.
[4] YANG W W, BOYD-GRABER J, RESNIK P. A multilingual topic model for learning weighted topic links across corpora with low comparability[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing/the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2019:1243-1248
[5] MATHIEU B, BESANÇON R, FLUHR C. Multilingual document clusters discovery[C]//Proceedings of the 7th RIAO Conference:Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval. Paris:Le Centre de Hautes Études Internationales d'Informatique Documentaire, 2004:116-125.
[6] POULIQUEN B, STEINBERGER R, LGNAT C, et al. Multilingual and cross-lingual news topic tracking[C]//Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2004:959-965.
[7] CHANG C H, HWANG S Y, XUI T H. Incorporating word embedding into cross-lingual topic modeling[C]//Proceedings of the 2018 IEEE International Congress on Big Data. Piscataway:IEEE, 2018:17-24.
[8] MIMNO D, WALLACH H M, NARADOWSKY J, et al. Polylingual topic models[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2009:880-889.
[9] HAO S D, PAUL M J. Learning multilingual topics from incomparable corpus[C]//Proceedings of the 27th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2018:2595-2609.
[10] HONG X D, YU Z T, TANG M M, et al. Cross-lingual eventcentered news clustering based on elements semantic correlations of different news[J]. Multimedia Tools and Applications, 2017, 76(23):25129-25143.
[11] BIANCHI F, TERRAGNI S, HOVY D, et al. Cross-lingual contextualized topic models with zero-shot learning[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume. Stroudsburg, PA:Association for Computational Linguistics, 2021:1676-1683.
[12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07)[2020-10-28]. https://arxiv.org/pdf/1301.3781.pdf.
[13] RUDER S, VULIĆ I, SØGAARD A, et al. A survey of crosslingual word embedding models[J]. Journal of Artificial Intelligence Research, 2019, 65:569-631.
[14] 张玉芳, 毛嘉莉, 熊忠阳. 一种改进的K-means算法[J]. 计算机应用, 2003, 23(8):31-33, 60.(ZHANG Y F, MAO J L, XIONG Z Y. An improved K-means algorithm[J]. Journal of Computer Applications, 2003, 23(8):31-33, 60.)
[15] 黄佳佳, 李鹏伟, 彭敏, 等. 基于深度学习的主题模型研究[J]. 计算机学报, 2020, 43(5):827-855.(HUANG J J, LI P W, PENG M, et al. Review of deep learning-based topic model[J]. Chinese Journal of Computers, 2020, 43(5):827-855.)
[16] 陈兴蜀, 罗梁, 王海舟, 等. 基于ICE-LDA模型的中英文跨语言话题发现研究[J]. 工程科学与技术, 2017, 49(2):100-106. (CHEN X S, LUO L, WANG H D, et al. Analysis and research on cross language topic discovery in Chinese and English[J]. Advanced Engineering Sciences, 2017, 49(2):100-106.)
[17] LAN H H, HUANG J D. Chinese-English cross-lingual text clustering algorithm based on latent semantic analysis[C]//Proceedings of the 5th International Conference on Information Science and Cloud Computing. Piscataway:IEEE, 2017:66-72.

基于跨语言神经主题模型的汉越新闻话题发现方法

Chinese-Vietnamese news topic discovery method based on cross-language neural topic model

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 13

编辑推荐

Metrics

[1]	孙祥杰, 魏强, 王奕森, 杜江. 代码相似性检测技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1248-1258.
[2]	孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841.
[3]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[4]	李由之, 胡志华, 陈春, 杨培蓓, 董雅静. 基于双长短期记忆网络组合的网络货运平台成交定价预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1616-1623.
[5]	赵乐, 张恩, 秦磊勇, 李功丽. 基于区块链的多方隐私保护k-means聚类方案[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3801-3812.
[6]	师夏阳, 张风远, 袁嘉琪, 黄敏. 基于多语BERT的无监督攻击性言论检测[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3379-3385.
[7]	马春明, 李秀红, 李哲, 王惠茹, 杨丹. 事件抽取综述[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 2975-2989.
[8]	张恩, 李会敏, 常键. 可验证的隐私保护k-means聚类方案[J]. 计算机应用, 2021, 41(2): 413-421.
[9]	王磊. 改进粗糙集属性约简结合K-means聚类的网络入侵检测方法[J]. 计算机应用, 2020, 40(7): 1996-2002.
[10]	杨雅倩, 唐绍婷. 基于扩展的低阶多元广义线性模型的脑节点识别方法[J]. 计算机应用, 2018, 38(10): 3048-3052.
[11]	穆桃, 陈伟, 陈松健. 基于多层网络流量分析的用户分类方法[J]. 计算机应用, 2017, 37(3): 705-710.
[12]	霍纬纲, 程震, 程文莉. 面向不等长多维时间序列的聚类改进算法[J]. 计算机应用, 2017, 37(12): 3477-3481.
[13]	李亚星, 王兆凯, 冯旭鹏, 刘利军, 黄青松. 基于实时词共现网络的微博话题发现[J]. 计算机应用, 2016, 36(5): 1302-1306.