计算机应用 ›› 2020, Vol. 40 ›› Issue (4): 972-977.DOI: 10.11772/j.issn.1001-9081.2019101711

• 人工智能 • 上一篇    下一篇

通过标点恢复提高机器同传效果

陈玉娜, 史晓东   

  1. 厦门大学 信息学院, 福建 厦门 361005
  • 收稿日期:2019-10-11 修回日期:2019-12-02 出版日期:2020-04-10 发布日期:2020-04-17
  • 通讯作者: 史晓东
  • 作者简介:陈玉娜(1995-),女,福建泉州人,硕士研究生,主要研究方向:自然语言处理、机器翻译;史晓东(1966-),男,江苏江阴人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、人工智能。
  • 基金资助:
    国家社会科学基金重点项目(16AZD049);国家语委语言文字科研项目优秀成果后期资助计划项目(WT135-38)。

Improving machine simultaneous interpretation by punctuation recovery

CHEN Yuna, SHI Xiaodong   

  1. School of Informatics, Xiamen University, Xiamen Fujian 361005, China
  • Received:2019-10-11 Revised:2019-12-02 Online:2020-04-10 Published:2020-04-17
  • Supported by:
    This work is partially supported by the Key Project of National Social Science Foundation of China (16AZD049),the Language Research Project Outstanding Achievement Late Fund of the National Language Commission of China(WT135-38).

摘要: 在机器同传(MSI)流水线系统中,将自动语音识别(ASR)的输出直接输入神经机器翻译(NMT)中会产生语义不完整问题,为解决该问题,提出基于BERT(Bidirectional Encoder Representation from Transformers)和Focal Loss的模型。首先,将ASR系统生成的几个片段缓存并组成一个词串;然后,使用基于BERT的序列标注模型恢复该词串的标点符号,并利用Focal Loss作为模型训练过程中的损失函数来缓解无标点样本比有标点样本多的类别不平衡问题;最后,将标点恢复后的词串输入NMT中。在英-德和汉-英翻译上的实验结果表明,在翻译质量上,使用提出的标点恢复模型的MSI,比将ASR输出直接输入NMT的MSI分别提高了8.19 BLEU和4.24 BLEU,比使用基于注意力机制的双向循环神经网络标点恢复模型的MSI分别提高了2.28 BLEU和3.66 BLEU。因此所提模型可以有效应用于MSI中。

关键词: 机器同传, 标点恢复, Focal Loss, 自动语音识别, 预训练语言模型

Abstract: In the Machine Simultaneous Interpretation(MSI)pipeline system,semantic incompleteness occurs when the Automatic Speech Recognition(ASR)outputs are directly input into Neural Machine Translation(NMT). To address this problem,a model based on Bidirectional Encoder Representation from Transformers (BERT) and Focal Loss was proposed. Firstly,several segments generated by the ASR system were cached and formed into a string. Then a BERT-based sequence labeling model was used to recover the punctuations of the string,and Focal Loss was used as the loss function in the process of model training to alleviate the class imbalance problem of more unpunctuated samples than punctuated samples. Finally,the punctuation-restored string was input into NMT. Experimental results on English-German and Chinese-English translation show that in term of translation quality,the MSI using the proposed punctuation recovery model has the improvement of 8. 19 BLEU and 4. 24 BLEU respectively compared with the MSI with ASR outputs directly inputting into NMT,and has the improvement of 2. 28 BLEU and 3. 66 BLEU respectively compared with the MSI using punctuation recovery model based on bi-directional recurrent neural network with attention mechanism. Therefore,the proposed model can be effectively applied to MSI.

Key words: Machine Simultaneous Interpretation (MSI), punctuation recovery, Focal Loss, Automatic Speech Recognition (ASR), pretrained language model

中图分类号: