《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 418-423.DOI: 10.11772/j.issn.1001-9081.2023020231

• 人工智能 • 上一篇    

多模态特征的越南语语音识别文本标点恢复

赖华1,2, 孙童1,2, 王文君1,2, 余正涛1,2, 高盛祥1,2(), 董凌1,2   

  1. 1.昆明理工大学 信息工程与自动化学院,昆明 650500
    2.云南省人工智能重点实验室(昆明理工大学),昆明 650500
  • 收稿日期:2023-03-06 修回日期:2023-05-06 接受日期:2023-05-10 发布日期:2023-08-14 出版日期:2024-02-10
  • 通讯作者: 高盛祥
  • 作者简介:赖华(1966—),男,广西荔浦人,副教授,硕士,主要研究方向:智能信息处理
    孙童(1996—),男,山东济宁人,硕士研究生,主要研究方向:自然语言处理、语音识别
    王文君(1988—),男,云南昆明人,博士研究生,主要研究方向:自然语言处理、语音识别
    余正涛(1970—),男,云南曲靖人,教授,博士,主要研究方向:自然语言处理、机器翻译、信息检索
    董凌(1984—),男,云南大理人,博士研究生,主要研究方向:语音识别、自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(61732005);云南高新技术产业发展项目(201606);云南省重大科技专项(202103AA080015);云南省基础研究计划项目(202001AS070014);云南省学术和技术带头人后备人才(202105AC160018)

Text punctuation restoration for Vietnamese speech recognition with multimodal features

Hua LAI1,2, Tong SUN1,2, Wenjun WANG1,2, Zhengtao YU1,2, Shengxiang GAO1,2(), Ling DONG1,2   

  1. 1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China
    2.Yunnan Key Laboratory of Artificial Intelligence (Kunming University of Science and Technology),Kunming Yunnan 6505000,China
  • Received:2023-03-06 Revised:2023-05-06 Accepted:2023-05-10 Online:2023-08-14 Published:2024-02-10
  • Contact: Shengxiang GAO
  • About author:LAI Hua, born in 1966, M. S., associate professor. His research interests include intelligent information processing.
    SUN Tong, born in 1996, M. S. candidate. His research interests include natural language processing, speech recognition.
    WANG Wenjun, born in 1988, Ph. D. candidate. His research interests include natural language processing, speech recognition.
    YU Zhengtao, born in 1970, Ph. D., professor. His research interests include natural language processing, machine translation, information retrieval.
    DONG Ling, born in 1984, Ph. D. candidate. His research interests include speech recognition, natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61732005);Yunnan High and New Technology Industry Development Project(201606);Yunnan Province Major Science and Technology Special Project(202103AA080015);Yunnan Fundamental Research Project(202001AS070014);Reserve Talents of Academic and Technical Leaders in Yunnan Province(202105AC160018)

摘要:

越南语语音识别系统输出的文本序列缺少标点符号,恢复识别文本标点有助于消除歧义,更易于阅读和理解。越南语语音识别文本中常出现破坏语义的错误音节,基于文本模态的标点恢复模型在识别带噪文本时存在标点预测不准确的问题。利用越南语语音中的语气停顿及声调变化指导模型对带噪文本作出正确的标点预测,提出多模态特征的越南语语音识别文本标点恢复方法,利用梅尔倒谱系数(MFCC)提取语音特征,利用预训练语言模型提取文本上下文特征,基于标签注意力机制实现语音与文本多模态特征融合,增强模型对越南语带噪文本上下文信息的学习能力。实验结果表明,相较于基于Transformer和BERT提取文本单一模态特征的标点恢复模型,所提方法在越南语数据集上精确率、召回率和F1值均至少提高10个百分点,验证了融合语音与文本特征对提升越南语语音识别带噪文本标点预测精确率的有效性。

关键词: 语音识别, 标点恢复, 越南语, BERT, 多模态

Abstract:

The text sequence output by the Vietnamese speech recognition system lacks punctuation, and punctuating the recognized text can help eliminate ambiguity and make it easier to understand. However, the punctuation restoration model based on text modality faces the problem of inaccurate punctuation prediction when dealing with noisy text, as errors in phonemes often occur in Vietnamese speech recognition systems, which can destroy the semantics of the text. A Vietnamese speech recognition text punctuation restoration method that utilizes multi-modal features was proposed, guided by intonation pauses and tone changes in Vietnamese speech to correctly predict punctuation for noisy text. Specifically, Mel-Frequency Cepstral Coefficients (MFCC) were used to extract speech features, pre-trained language models were used to extract text context features, and speech and text features were fused with label attention mechanism to fuse multi-modal features, thereby enhancing the model’s ability to learn contextual information from noisy Vietnamese text. Experimental results show that compared to punctuation restoration models that extract only text features based on Transformer and BERT (Bidirectional Encoder Representations from Transformers), the proposed method improves the precision, recall, and F1 score on Vietnamese dataset by at least 10 percent points, demonstrating the effectiveness of fusing speech and text features in improving punctuation prediction accuracy for noisy Vietnamese speech recognition text.

Key words: speech recognition, punctuation restoration, Vietnamese, Bidirectional Encoder Representations from Transformers (BERT), multimodal

中图分类号: