计算机应用 ›› 2021, Vol. 41 ›› Issue (10): 2879-2884.DOI: 10.11772/j.issn.1001-9081.2020122054

所属专题: 人工智能

• 人工智能 • 上一篇    下一篇

基于跨语言神经主题模型的汉越新闻话题发现方法

杨威亚1,2, 余正涛1,2, 高盛祥1,2, 宋燃1,2   

  1. 1. 昆明理工大学 信息工程与自动化学院, 昆明 650500;
    2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500
  • 收稿日期:2020-12-29 修回日期:2021-04-22 出版日期:2021-10-10 发布日期:2021-07-14
  • 通讯作者: 高盛祥
  • 作者简介:杨威亚(1996-),男,四川广安人,硕士研究生,主要研究方向:自然语言处理、跨语言话题发现;余正涛(1970-),男,云南曲靖人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索;高盛祥(1977-),女,云南大理人,副教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索;宋燃(1996-),男,云南楚雄人,硕士研究生,主要研究方向:自然语言处理、信息抽取。
  • 基金资助:
    国家自然科学基金资助项目(61972196,61762056,61472168);云南省重大科技专项(202002AD080001);云南省高新技术产业专项(201606)。

Chinese-Vietnamese news topic discovery method based on cross-language neural topic model

YANG Weiya1,2, YU Zhengtao1,2, GAO Shengxiang1,2, SONG Ran1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650500, China;
    2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology) Kunming Yunnan 650500, China
  • Received:2020-12-29 Revised:2021-04-22 Online:2021-10-10 Published:2021-07-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61972196, 61762056, 61472168), the Yunnan Major Science and Technology Special Program (202002AD080001), the Yunnan High-tech Industry Special Project (201606).

摘要: 针对汉越跨语言新闻话题发现任务中汉越平行语料稀缺,训练高质量的双语词嵌入较为困难,而且新闻文本一般较长导致双语词嵌入的方法难以很好地表征文本的问题,提出一种基于跨语言神经主题模型(CL-NTM)的汉越新闻话题发现方法,利用新闻的主题信息对新闻文本进行表征,将双语语义对齐转化为双语主题对齐任务。首先,针对汉语和越南语分别训练基于变分自编码器的神经主题模型,从而得到单语的主题抽象表征;然后,利用小规模的平行语料将双语主题映射到同一语义空间;最后,使用K-means方法对双语主题表征进行聚类,从而发现新闻事件簇的话题。实验结果表明,所提方法相较于面向中英文的隐狄利克雷分配主题改进模型(ICE-LDA)在Macro-F1值与主题一致性上分别提升了4个百分点与7个百分点,可见所提方法可有效提升新闻话题的聚类效果与话题可解释性。

关键词: 跨语言, 主题对齐, 神经主题模型, K-means聚类, 话题发现

Abstract: In Chinese-Vietnamese cross-language news topic discovery task, the Chinese-Vietnamese parallel corpora are rare, it is difficult to train high-quality bilingual word embedding, and the news text is generally long, so that the method of bilingual word embedding is difficult to represent the text well. In order to solve the problems, a Chinese-Vietnamese news topic discovery method based on Cross-Language Neural Topic Model (CL-NTM) was proposed. In the method, the news topic information was used to represent news text, and the bilingual semantic alignment was converted into bilingual topic alignment tasks. Firstly, the neural topic models based on the variational autoencoder were trained in Chinese and Vietnamese respectively to obtain the monolingual abstract representations of the topics. Then, a small-scale parallel corpus was used to map the bilingual topics into the same semantic space. Finally, the K-means method was used to cluster the bilingual topic representations for finding the topics of news event clusters. Experimental results show that, compared with the Improved Chinese-English Latent Dirichlet Allocation model (ICE-LDA), the proposed method increases the Macro-F1 value and topic-coherence by 4 percentage points and 7 percentage points respectively, showing that the proposed method can effectively improve the clustering effect and topic interpretability of news topics.

Key words: cross-language, topic alignment, Neural Topic Model (NTM), K-means clustering, topic discovery

中图分类号: