多模态特征的越南语语音识别文本标点恢复

doi:10.11772/j.issn.1001-9081.2023020231

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 418-423.DOI: 10.11772/j.issn.1001-9081.2023020231

• 人工智能 • 上一篇

多模态特征的越南语语音识别文本标点恢复

赖华¹^,², 孙童¹^,², 王文君¹^,², 余正涛¹^,², 高盛祥¹^,²(), 董凌¹^,²

^1.昆明理工大学信息工程与自动化学院，昆明 650500
^2.云南省人工智能重点实验室（昆明理工大学），昆明 650500

收稿日期:2023-03-06 修回日期:2023-05-06 接受日期:2023-05-10 发布日期:2023-08-14 出版日期:2024-02-10
通讯作者: 高盛祥
作者简介:赖华（1966—），男，广西荔浦人，副教授，硕士，主要研究方向：智能信息处理
孙童（1996—），男，山东济宁人，硕士研究生，主要研究方向：自然语言处理、语音识别
王文君（1988—），男，云南昆明人，博士研究生，主要研究方向：自然语言处理、语音识别
余正涛（1970—），男，云南曲靖人，教授，博士，主要研究方向：自然语言处理、机器翻译、信息检索
董凌（1984—），男，云南大理人，博士研究生，主要研究方向：语音识别、自然语言处理。
基金资助:
国家自然科学基金资助项目(61732005);云南高新技术产业发展项目(201606);云南省重大科技专项(202103AA080015);云南省基础研究计划项目(202001AS070014);云南省学术和技术带头人后备人才(202105AC160018)

Text punctuation restoration for Vietnamese speech recognition with multimodal features

Hua LAI¹^,², Tong SUN¹^,², Wenjun WANG¹^,², Zhengtao YU¹^,², Shengxiang GAO¹^,²(), Ling DONG¹^,²

^1.Faculty of Information Engineering and Automation，Kunming University of Science and Technology，Kunming Yunnan 650500，China
^2.Yunnan Key Laboratory of Artificial Intelligence （Kunming University of Science and Technology），Kunming Yunnan 6505000，China

Received:2023-03-06 Revised:2023-05-06 Accepted:2023-05-10 Online:2023-08-14 Published:2024-02-10
Contact: Shengxiang GAO
About author:LAI Hua， born in 1966， M. S.， associate professor. His research interests include intelligent information processing.
SUN Tong， born in 1996， M. S. candidate. His research interests include natural language processing， speech recognition.
WANG Wenjun， born in 1988， Ph. D. candidate. His research interests include natural language processing， speech recognition.
YU Zhengtao， born in 1970， Ph. D.， professor. His research interests include natural language processing， machine translation， information retrieval.
DONG Ling， born in 1984， Ph. D. candidate. His research interests include speech recognition， natural language processing.
Supported by:
National Natural Science Foundation of China(61732005);Yunnan High and New Technology Industry Development Project(201606);Yunnan Province Major Science and Technology Special Project(202103AA080015);Yunnan Fundamental Research Project(202001AS070014);Reserve Talents of Academic and Technical Leaders in Yunnan Province(202105AC160018)

摘要/Abstract

摘要：

越南语语音识别系统输出的文本序列缺少标点符号，恢复识别文本标点有助于消除歧义，更易于阅读和理解。越南语语音识别文本中常出现破坏语义的错误音节，基于文本模态的标点恢复模型在识别带噪文本时存在标点预测不准确的问题。利用越南语语音中的语气停顿及声调变化指导模型对带噪文本作出正确的标点预测，提出多模态特征的越南语语音识别文本标点恢复方法，利用梅尔倒谱系数（MFCC）提取语音特征，利用预训练语言模型提取文本上下文特征，基于标签注意力机制实现语音与文本多模态特征融合，增强模型对越南语带噪文本上下文信息的学习能力。实验结果表明，相较于基于Transformer和BERT提取文本单一模态特征的标点恢复模型，所提方法在越南语数据集上精确率、召回率和F1值均至少提高10个百分点，验证了融合语音与文本特征对提升越南语语音识别带噪文本标点预测精确率的有效性。

关键词: 语音识别, 标点恢复, 越南语, BERT, 多模态

Abstract:

The text sequence output by the Vietnamese speech recognition system lacks punctuation， and punctuating the recognized text can help eliminate ambiguity and make it easier to understand. However， the punctuation restoration model based on text modality faces the problem of inaccurate punctuation prediction when dealing with noisy text， as errors in phonemes often occur in Vietnamese speech recognition systems， which can destroy the semantics of the text. A Vietnamese speech recognition text punctuation restoration method that utilizes multi-modal features was proposed， guided by intonation pauses and tone changes in Vietnamese speech to correctly predict punctuation for noisy text. Specifically， Mel-Frequency Cepstral Coefficients （MFCC） were used to extract speech features， pre-trained language models were used to extract text context features， and speech and text features were fused with label attention mechanism to fuse multi-modal features， thereby enhancing the model’s ability to learn contextual information from noisy Vietnamese text. Experimental results show that compared to punctuation restoration models that extract only text features based on Transformer and BERT （Bidirectional Encoder Representations from Transformers）， the proposed method improves the precision， recall， and F1 score on Vietnamese dataset by at least 10 percent points， demonstrating the effectiveness of fusing speech and text features in improving punctuation prediction accuracy for noisy Vietnamese speech recognition text.

Key words: speech recognition, punctuation restoration, Vietnamese, Bidirectional Encoder Representations from Transformers (BERT), multimodal

中图分类号:

TP183

赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 计算机应用, 2024, 44(2): 418-423.

Hua LAI, Tong SUN, Wenjun WANG, Zhengtao YU, Shengxiang GAO, Ling DONG. Text punctuation restoration for Vietnamese speech recognition with multimodal features[J]. Journal of Computer Applications, 2024, 44(2): 418-423.

图/表 9

参考文献 21

1	T-H NGUYEN T， NGUYEN T B， PHAM P， et al. Toward human-friendly ASR systems： recovering capitalization and punctuation for Vietnamese text［J］. IEICE Transactions on Information and Systems， 2021， 104（8）： 1195-1203. 10.1587/transinf.2020bdp0005
2	YAO Z， WU D， WANG X， et al. WeNet： production oriented streaming and non-streaming end-to-end speech recognition toolkit［EB/OL］. ［2021-02-02］. . 10.21437/interspeech.2021-1983
3	ZHANG B， WU D， PENG Z， et al. WeNet 2.0： more productive end-to-end speech recognition toolkit［EB/OL］. （2022-07-05）［2022-03-29］. . 10.21437/interspeech.2022-483
4	TILK O， ALUMÄE T. Bidirectional recurrent neural network with attention mechanism for punctuation restoration［C］// Proceedings of the 17th Annual Conference of the International Speech Communication Association. Baixas， France ： International Speech Communication Association， 2016： 3047-3051. 10.21437/interspeech.2016-1517
5	ŻELASKO P， SZYMAŃSKI P， MIZGAJSKI J， et al. Punctuation prediction model for conversational speech［EB/OL］. （2018-07-02）［2023-02-01］. . 10.21437/interspeech.2018-1096
6	TÜNDIK M Á， SZASZÁK G. Joint word- and character-level embedding CNN-RNN models for punctuation restoration［C］// Proceedings of the 2018 9th IEEE International Conference on Cognitive Infocommunications. Piscataway， IEEE， 2018： 135-140. 10.1109/coginfocom.2018.8639876
7	PHAM Q H， NGUYEN B T， CUONG N V. Punctuation prediction for Vietnamese texts using conditional random fields［C］// Proceedings of the Tenth International Symposium on Information and Communication Technology. New York： ACM， 2019： 322-327. 10.1145/3368926.3369716
8	SUN K， WANG R. Frequency distributions of punctuation marks in English： evidence from large-scale corpora［J］. English Today， 2019， 35（4）： 23-35. 10.1017/s0266078418000512
9	BEEFERMAN D， BERGER A， LAFFERTY J. Cyberpunc： a lightweight punctuation annotation system for speech［C］// Proceedings of the 1998 IEEE International Conference on Acoustics， Speech and Signal Processin. Piscataway： IEEE， 1998， 2： 689-692. 10.1109/icassp.1998.678039
10	CHRISTENSEN H， GOTOH Y， RENALS S. Punctuation annotation using statistical prosody models［EB/OL］.［2023-02-01］..
11	YI J， TAO J. Self-attention based model for punctuation prediction using word and speech embeddings［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 7270-7274. 10.1109/icassp.2019.8682260
12	KLEJCH O， BELL P， RENALS S. Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches［C］// Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2016： 433-440. 10.1109/slt.2016.7846300
13	李雅昆，潘晴， WANG E X. 基于改进的多层BLSTM的中文分词和标点预测［J］.计算机应用， 2018， 38（5）： 1278-1282. 10.11772/j.issn.1001-9081.201711112631
	LI Y K， PAN Q， WANG E X. Joint Chinese word segmentation and punctuation prediction based on improved multi-layer BLSTM［J］. Journal of Computer Applications， 2018， 38（5）： 1278-1282. 10.11772/j.issn.1001-9081.201711112631
14	谭华. 基于深度学习的标点预测研究［D］. 重庆：重庆大学，2018.
	TAN H. Study on punctuation based on deep learning［D］. Chongqing： Chongqing University， 2018.
15	TILK O， ALUMÄE T. LSTM for punctuation restoration in speech transcripts［C］// Proceedings of the 16th Annual Conference of the International Speech Communication Association. Baixas， France： International Speech Communication Association， 2015： 683-687. 10.21437/interspeech.2015-240
16	LIU Y， SHRIBERG E， STOLCKE A， et al. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2006， 14（5）： 1526-1540. 10.1109/tasl.2006.878255
17	LU W， NG H T. Better punctuation prediction with dynamic conditional random fields［C］// Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2010： 177-186. 10.3115/1699510.1699563
18	UEFFING N， BISANI M， VOZILA P. Improved models for automatic punctuation prediction for spoken and written text［C］// Proceedings of the 14th Annual Conference of the International Speech Communication. Baixas， France： International Speech Communication Association， 2013： 3097-3101. 10.21437/interspeech.2013-675
19	SHI N， WANG W， WANG B， et al. Incorporating external POS tagger for punctuation restoration［EB/OL］. （2021-06-12）［2023-02-01］ . 10.21437/interspeech.2021-1708
20	PHAM T， NGUYEN N， PHAM Q， et al. Vietnamese punctuation prediction using deep neural networks［C］// SOFSEM 2020： International Conference on Current Trends in Theory and Practice of Informatics. Cham： Springer， 2020： 388-400. 10.1007/978-3-030-38919-2_32
21	TRAN H， DINH C V， PHAM Q， et al. An efficient Transformer-based model for Vietnamese punctuation prediction ［C］// Proceedings of the 34th International Conference on Industrial， Engineering and Other Applications of Applied Intelligent Systems. Cham： Springer， 2021： 47-58. 10.1007/978-3-030-79463-7_5

文本类别	文本序列
语音识别文本
对识别文本的标点预测	.
原始正确文本	.
正确文本译文	头晕，可能是由严重的心律失常引起的。

文本类别	文本序列
语音识别文本
对识别文本的标点预测	.
原始正确文本	.
正确文本译文	头晕，可能是由严重的心律失常引起的。

文本序列	音频帧	文本序列	音频帧
tôn	［4.920，4.960）	can	［5.760，6.000）
bên	［4.960，5.120）		［6.000，6.440）
ngoài	［5.120，5.440）	còn（助词）	6.440
lan	［5.440，5.760）	có	（6.440，6.680］

文本序列	音频帧	文本序列	音频帧
tôn	［4.920，4.960）	can	［5.760，6.000）
bên	［4.960，5.120）		［6.000，6.440）
ngoài	［5.120，5.440）	còn（助词）	6.440
lan	［5.440，5.760）	có	（6.440，6.680］

数据集	句子数	标点数
训练集	9 000	28 000
测试集	1 000	3 000

多模态特征的越南语语音识别文本标点恢复

Text punctuation restoration for Vietnamese speech recognition with multimodal features

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 21

相关文章 15

编辑推荐

Metrics

模型	COMMA			PERIOID			QUESTION MARK			OVERALL
模型	P	R	F1	P	R	F1	P	R	F1	P	R	F1
Punctuator2	nan	0.0	nan	54.6	33.6	41.6	0.0	nan	nan	1.1	11.1	2.0
VietPunc	62.7	59.4	61.3	73.4	74.0	74.0	71.4	39.2	54.8	68.0	65.3	66.6
Transformer CRF	66.0	53.0	59.0	71.0	57.0	63.0	75.0	52.0	59.0	70.6	54.0	61.3
Transformer Linear	47.3	12.8	20.1	45.2	7.4	12.7	nan	nan	nan	46.8	10.9	17.7
BERT Linear	55.6	44.6	49.5	85.4	71.9	78.1	80.0	60.0	75.0	71.9	58.5	64.5
BERT MFCC LAN	78.0	69.3	73.4	87.4	92.8	90.0	82.9	42.5	56.2	82.6	77.0	79.7

模型	OVERALL
模型	P	R	F1
BERT LAN	64.0	57.9	60.8
BERT MFCC LAN	82.6	77.0	79.7

模型	OVERALL
模型	P	R	F1
BERT MFCC	59.0	55.0	63.5
BERT MFCC LAN	82.6	77.0	79.7

[1]	陈田, 蔡从虎, 袁晓辉, 罗蓓蓓. 基于多尺度卷积和自注意力特征融合的多模态情感识别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 369-376.
[2]	王春雷, 王肖, 刘凯. 多模态知识图谱表示学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 1-15.
[3]	赵强, 王中卿, 王红玲. 融合多模态信息的产品摘要抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 73-78.
[4]	林于翔, 吴运兵, 阴爱英, 廖祥文. 基于语义相关性分析的多模态摘要模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 65-72.
[5]	罗俊豪, 朱焱. 用于未对齐多模态语言序列情感分析的多交互感知网络[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 79-85.
[6]	李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 86-93.
[7]	黄懿蕊, 罗俊玮, 陈景强. 基于对比学习和GIF标记的多模态对话回复检索[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 32-38.
[8]	林剑, 叶璟轩, 刘雯雯, 邵晓雯. 求解带容量约束车辆路径问题的多模态差分进化算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2248-2254.
[9]	何嘉明, 杨巨成, 吴超, 闫潇宁, 许能华. 基于多模态图卷积神经网络的行人重识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2182-2189.
[10]	拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124.
[11]	王惠茹, 李秀红, 李哲, 马春明, 任泽裕, 杨丹. 多模态预训练模型综述[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 991-1004.
[12]	高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1303-1308.
[13]	李路宝, 陈田, 任福继, 罗蓓蓓. 基于图神经网络和注意力的双模态情感识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 700-705.
[14]	孙晓飞, 朱静远, 陈斌, 游恒志. 融合多模态数据的药物合成反应的虚拟筛选[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 622-629.
[15]	孙梦迪, 孙忠贵, 孔旭, 韩红燕. 针对多模态图像的自适应引导形态学设计[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 560-566.