Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (2): 418-423.DOI: 10.11772/j.issn.1001-9081.2023020231

• Artificial intelligence • Previous Articles    

Text punctuation restoration for Vietnamese speech recognition with multimodal features

Hua LAI1,2, Tong SUN1,2, Wenjun WANG1,2, Zhengtao YU1,2, Shengxiang GAO1,2(), Ling DONG1,2   

  1. 1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China
    2.Yunnan Key Laboratory of Artificial Intelligence (Kunming University of Science and Technology),Kunming Yunnan 6505000,China
  • Received:2023-03-06 Revised:2023-05-06 Accepted:2023-05-10 Online:2023-08-14 Published:2024-02-10
  • Contact: Shengxiang GAO
  • About author:LAI Hua, born in 1966, M. S., associate professor. His research interests include intelligent information processing.
    SUN Tong, born in 1996, M. S. candidate. His research interests include natural language processing, speech recognition.
    WANG Wenjun, born in 1988, Ph. D. candidate. His research interests include natural language processing, speech recognition.
    YU Zhengtao, born in 1970, Ph. D., professor. His research interests include natural language processing, machine translation, information retrieval.
    DONG Ling, born in 1984, Ph. D. candidate. His research interests include speech recognition, natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61732005);Yunnan High and New Technology Industry Development Project(201606);Yunnan Province Major Science and Technology Special Project(202103AA080015);Yunnan Fundamental Research Project(202001AS070014);Reserve Talents of Academic and Technical Leaders in Yunnan Province(202105AC160018)


赖华1,2, 孙童1,2, 王文君1,2, 余正涛1,2, 高盛祥1,2(), 董凌1,2   

  1. 1.昆明理工大学 信息工程与自动化学院,昆明 650500
    2.云南省人工智能重点实验室(昆明理工大学),昆明 650500
  • 通讯作者: 高盛祥
  • 作者简介:赖华(1966—),男,广西荔浦人,副教授,硕士,主要研究方向:智能信息处理
  • 基金资助:


The text sequence output by the Vietnamese speech recognition system lacks punctuation, and punctuating the recognized text can help eliminate ambiguity and make it easier to understand. However, the punctuation restoration model based on text modality faces the problem of inaccurate punctuation prediction when dealing with noisy text, as errors in phonemes often occur in Vietnamese speech recognition systems, which can destroy the semantics of the text. A Vietnamese speech recognition text punctuation restoration method that utilizes multi-modal features was proposed, guided by intonation pauses and tone changes in Vietnamese speech to correctly predict punctuation for noisy text. Specifically, Mel-Frequency Cepstral Coefficients (MFCC) were used to extract speech features, pre-trained language models were used to extract text context features, and speech and text features were fused with label attention mechanism to fuse multi-modal features, thereby enhancing the model’s ability to learn contextual information from noisy Vietnamese text. Experimental results show that compared to punctuation restoration models that extract only text features based on Transformer and BERT (Bidirectional Encoder Representations from Transformers), the proposed method improves the precision, recall, and F1 score on Vietnamese dataset by at least 10 percent points, demonstrating the effectiveness of fusing speech and text features in improving punctuation prediction accuracy for noisy Vietnamese speech recognition text.

Key words: speech recognition, punctuation restoration, Vietnamese, Bidirectional Encoder Representations from Transformers (BERT), multimodal



关键词: 语音识别, 标点恢复, 越南语, BERT, 多模态

CLC Number: