基于双向长短时记忆和卷积Transformer的声学词嵌入模型

doi:10.11772/j.issn.1001-9081.2023010062

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 123-128.DOI: 10.11772/j.issn.1001-9081.2023010062

• 人工智能 • 上一篇

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

高芸芸, 赵腊生, 张强()

先进设计与智能计算省部共建教育部重点实验室（大连大学），辽宁大连 116622

收稿日期:2023-01-30 修回日期:2023-04-01 接受日期:2023-04-07 发布日期:2023-06-06 出版日期:2024-01-10
通讯作者: 张强
作者简介:高芸芸（1997—），女，山东烟台人，硕士研究生，主要研究方向：深度学习、语音关键词检测；
赵腊生（1978—），男，山西朔州人，讲师，博士，主要研究方向：深度学习、语音信号处理；
第一联系人：张强（1971—），男，陕西西安人，教授，博士，主要研究方向：生物计算与人工智能、大数据分析与处理。
基金资助:
辽宁省教育厅基本科研项目(LJKMZ20221838)

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG()

Key Laboratory of Advanced Design and Intelligent Computing，Ministry of Education （Dalian University），Dalian Liaoning 116622，China

Received:2023-01-30 Revised:2023-04-01 Accepted:2023-04-07 Online:2023-06-06 Published:2024-01-10
Contact: Qiang ZHANG
About author:GAO Yunyun， born in 1997， M. S. candidate. Her research interests include deep learning， spoken term detection.
ZHAO Lasheng， born in 1978， Ph. D.， lecturer. His research interests include deep learning， speech signal processing.
Supported by:
Basic Scientific Research Project of Liaoning Provincial Department of Education(LJKMZ20221838)

摘要/Abstract

摘要：

示例查询语音关键词检测中，卷积神经网络（CNN）或者循环神经网络（RNN）提取到的声学词嵌入语音信息有限，为更好地表示语音内容以及改善模型的性能，提出一种基于双向长短时记忆（Bi-LSTM）和卷积Transformer的声学词嵌入模型。首先，使用Bi-LSTM提取特征、对语音序列进行建模，并通过叠加方式来提高模型的学习能力；其次，为了能在捕获全局信息的同时学习到局部信息，将CNN和Transformer编码器并联连接组成卷积Transformer，充分利用它在特征提取上的优势，聚合更多有效的信息，提高嵌入的区分性。在对比损失约束下，所提模型平均精度达到了94.36%，与基于注意力的Bi-LSTM模型相比，平均精度提高了1.76%。实验结果表明，所提模型可以有效改善模型性能，更好地实现示例查询语音关键词检测。

关键词: 卷积神经网络, 声学词嵌入, 语音信息, 示例查询语音关键词检测, 循环神经网络

Abstract:

In Query-by-Example Spoken Term Detection （QbE-STD）， the Acoustic Word Embedding （AWE） speech information extracted by Convolutional Neural Network （CNN） or Recurrent Neural Network （RNN） is limited. To better represent speech content and improve model performance， an acoustic word embedding model based on Bi-directional Long Short-Term Memory （Bi-LSTM） and convolutional-Transformer was proposed. Firstly， Bi-LSTM was utilized for extracting features， modeling speech sequences and improving the model learning ability by superposition. Secondly， to learn local information while capturing global information， CNN and Transformer encoder were connected in parallel to form convolutional-Transformer， which taking full advantages in feature extraction to aggregate more efficient information and improving the discrimination of embeddings. Under the constraint of contrast loss， the Average Precision （AP） of the proposed model reaches 94.36%， which is 1.76% higher than that of the Bi-LSTM model based on attention. The experimental results show that the proposed model can effectively improve model performance and better perform QbE-STD.

Key words: Convolutional Neural Network (CNN), Acoustic Word Embedding (AWE), speech information, Query-by-Example Spoken Term Detection (QbE-STD), Recurrent Neural Network (RNN)

中图分类号:

TP183

高芸芸, 赵腊生, 张强. 基于双向长短时记忆和卷积Transformer的声学词嵌入模型[J]. 计算机应用, 2024, 44(1): 123-128.

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG. Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer[J]. Journal of Computer Applications, 2024, 44(1): 123-128.

图/表 6

图1 基于声学词嵌入的示例查询语音关键词检测的流程

Fig. 1 Flowchart of QbE-STD based on acoustic word embedding

图2 网络总体结构示意图

Fig. 2 Schematic diagram of overall network structure

图3 SE块的结构

Fig. 3 Structure of SE block

图4 Transformer编码器的结构

Fig. 4 Structure of transformer encoder

表1 消融实验结果

Tab. 1 Ablation experiment results

序号	有FFN		序号	无FFN
序号	模型	AP/%	序号	模型	AP/%
M1	Bi-LSTM+ FFN	88.62	N1	Bi-LSTM	88.92
M2	Transformer	89.01	N2	Transformer	90.92
M3	1D Conv + SE	92.31	N3	1D Conv + SE	92.88
M4	Swish -> ReLU	92.52	N4	Swish -> ReLU	93.16
M5	串联	92.55	N5	串联	91.43
M6	本文模型（并联）	92.68	N6	本文模型（并联）	93.53

表2 不同模型的对比实验结果

Tab. 2 Comparative experiment results of different models

模型名称	AP/%	PRBEP/%	KL散度
LSTM（ $L t$ ）^［27］	74.15	70.48	4.658 3
Bi-LSTM（ $L t$ ）^［28］	88.92	83.60	5.129 6
Bi-LSTM+Attention（ $L t$ ）^［16］	91.52	86.09	5.637 1
Bi-LSTM+Attention（ $L c$ ）	92.73	87.28	6.047 2
本文模型（ $L t$ ）	93.53	87.39	6.209 9
本文模型（ $L c$ ）	94.36	88.96	6.467 2

表2 不同模型的对比实验结果

Tab. 2 Comparative experiment results of different models

模型名称	AP/%	PRBEP/%	KL散度
LSTM（ $L t$ ）^［27］	74.15	70.48	4.658 3
Bi-LSTM（ $L t$ ）^［28］	88.92	83.60	5.129 6
Bi-LSTM+Attention（ $L t$ ）^［16］	91.52	86.09	5.637 1
Bi-LSTM+Attention（ $L c$ ）	92.73	87.28	6.047 2
本文模型（ $L t$ ）	93.53	87.39	6.209 9
本文模型（ $L c$ ）	94.36	88.96	6.467 2

参考文献 28

1	张卫强，宋贝利，蔡猛，等.基于音素后验概率的样例语音关键词检测方法［J］.天津大学学报（自然科学与工程技术版）， 2015， 48（9）： 757-760.
	ZHANG W Q， SONG B L， CAI M， et al. A query-by-example spoken term detection method based on phonetic posteriorgram ［J］. Journal of Tianjin University （Science and Technology）， 2015， 48（9）： 757-760.
2	HAZEN T J， SHEN W， WHITE C. Query-by-example spoken term detection using phonetic posteriorgram templates ［C］// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2009： 421-426. 10.1109/asru.2009.5372889
3	ZHANG Y， GLASS J R. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams ［C］// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2009： 398-403. 10.1109/asru.2009.5372931
4	MANTEENA G， ANGUERA X. Speed improvements to information retrieval-based dynamic time warping using hierarchical K-Means clustering ［C］// Proceedings of the 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2013： 8515-8519. 10.1109/icassp.2013.6639327
5	LEUNG C C， WANG L， XU H， et al. Toward high-performance language-independent query-by-example spoken term detection for MediaEval 2015： post-evaluation analysis ［C］// Proceedings of the INTERSPEECH 2016. ［S.l.］： International Speech Communication Association， 2016： 3703-3707. 10.21437/interspeech.2016-691
6	LEVIN K， HENRY K， JANSEN A， et al. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings ［C］// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2013： 410-415. 10.1109/asru.2013.6707765
7	LEVIN K， JANSEN A， VAN DURME B. Segmental acoustic indexing for zero resource keyword search ［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2015： 5828-5832. 10.1109/icassp.2015.7179089
8	SHEN F， DU C， YU K. Acoustic word embeddings for end-to-end speech synthesis ［J］. Applied Sciences， 2021， 11（19）： No.9010. 10.3390/app11199010
9	SHI B， SETTLE S， LIVESCU K. Whole-word segmental speech recognition with acoustic word embeddings ［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 164-171. 10.1109/slt48900.2021.9383578
10	KAMPER H. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2019： 6535-6539. 10.1109/icassp.2019.8683639
11	KAMPER H， WANG W， LIVESCU K. Deep convolutional acoustic word embeddings using word-pair side information ［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2016： 4950-4954. 10.1109/icassp.2016.7472619
12	HUANG J， GHARBIEH W， SHIM H S， et al. Query-by-example keyword spotting system using multi-head attention and soft-triple loss ［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 6858-6862. 10.1109/icassp39728.2021.9414156
13	SETTLE S， LIVESCU K. Discriminative acoustic word embeddings： recurrent neural network-based approaches ［C］// Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2016： 503-510. 10.1109/slt.2016.7846310
14	CHEN G， PARADA C， SAINATH T N. Query-by-example keyword spotting using long short-term memory networks ［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2015： 5236-5240. 10.1109/icassp.2015.7178970
15	YUAN Y， LV Z， HUANG S， et al. Verifying deep keyword spotting detection with acoustic word embeddings ［C］// Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2019： 613-620. 10.1109/asru46091.2019.9003781
16	YUAN Y， XIE L， LEUNG C C， et al. Fast query-by-example speech search using attention-based deep binary embeddings ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1988-2000. 10.1109/taslp.2020.2998277
17	AO C W， LEE H Y. Query-by-example spoken term detection using attention-based multi-hop networks ［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2018： 6264-6268. 10.1109/icassp.2018.8462570
18	ZHANG K， WU Z， JIA J， et al. Query-by-example spoken term detection using attentive pooling networks ［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 1267-1272. 10.1109/apsipaasc47483.2019.9023023
19	RAM D， MICULICICH L， BOURLARD H. CNN based query by example spoken term detection ［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 92-96. 10.21437/interspeech.2018-1722
20	RAM D， MICULICICH L， BOURLARD H. Neural network based end-to-end query by example spoken term detection ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1416-1427. 10.1109/taslp.2020.2988788
21	NAIK P， GAONKAR M N， THENKANIDIYOOR V， et al. Kernel based matching and a novel training approach for CNN-based QbE-STD ［C］// Proceedings of the 2020 International Conference on Signal Processing and Communications. Piscataway： IEEE， 2020： 1-5. 10.1109/spcom50965.2020.9179588
22	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
23	YUAN Y， LEUNG C C， XIE L， et al. Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context ［J］. IEEE Access， 2019， 7： 67656-67665. 10.1109/access.2019.2918638
24	JACOBS C， MATUSEVYCH Y， KAMPER H. Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation ［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 919-926. 10.1109/slt48900.2021.9383594
25	ZHANG Y， PARK D S， HAN W， et al. BigSSL： exploring the frontier of large-scale semi-supervised learning for automatic speech recognition ［J］. IEEE Journal of Selected Topics in Signal Processing， 2022， 16（6）： 1519-1532. 10.1109/jstsp.2022.3182537
26	YANG Z， HIRSCHBERG J. Linguistically-informed training of acoustic word embeddings for low-resource languages ［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 2678-2682. 10.21437/interspeech.2019-3119
27	SHITOV D， PIROGOVA E， WYSOCKI T A， et al. Learning acoustic word embeddings with dynamic time warping triplet networks ［J］. IEEE Access， 2020， 8： 103327-103338. 10.1109/access.2020.2999055
28	LI Z， WU L， LI T， et al. Improves neural acoustic word embeddings query by example spoken term detection with Wav2Vec pretraining and circle loss ［C］// Proceedings of the 12th International Symposium on Chinese Spoken Language Processing. Piscataway： IEEE， 2021： 1-5. 10.1109/iscslp49672.2021.9362065

[1]	路琨婷, 费蓉蓉, 张选德. 融合卷积神经网络的遥感图像全色锐化[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2963-2969.
[2]	尚绍法, 蒋林, 李远成, 朱筠. 异构平台下卷积神经网络推理模型自适应划分和调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2828-2835.
[3]	李豆豆, 李汪根, 夏义春, 束阳, 高坤. 基于特征交互与自适应融合的骨骼动作识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2581-2587.
[4]	秦源源, 张鸿. 基于注意力特征金字塔网络的肺结节检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2311-2318.
[5]	何嘉明, 杨巨成, 吴超, 闫潇宁, 许能华. 基于多模态图卷积神经网络的行人重识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2182-2189.
[6]	张慧斌, 冯丽萍, 郝耀军, 王一宁. 基于注意力机制和迁移学习的古壁画朝代识别[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1826-1832.
[7]	许睿, 梁爽, 万航, 文益民, 沈世铭, 李建. 基于烛台图模式匹配的PM_2.5扩散特征的提取[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1394-1400.
[8]	隋佳宏, 毛莺池, 于慧敏, 王子成, 平萍. 基于图注意力网络的全局图像描述生成方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1409-1415.
[9]	傅励瑶, 尹梦晓, 杨锋. 基于Transformer的U型医学图像分割网络综述[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1584-1595.
[10]	王彬, 向甜, 吕艺东, 王晓帆. 基于NSGA‑Ⅱ的自适应多尺度特征通道分组优化算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1401-1408.
[11]	杨森淇, 段旭良, 肖展, 郎松松, 李志勇. 基于ERNIE+DPCNN+BiGRU的农业新闻文本分类[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1461-1466.
[12]	何建辉, 胡春龙, 束鑫. 基于多峰标签分布学习的多任务年龄估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1578-1583.
[13]	樊小宇, 蔺素珍, 王彦博, 刘峰, 李大威. 基于残差图卷积神经网络的高倍欠采样核磁共振图像重建算法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1261-1268.
[14]	杨海宇, 郭文普, 康凯. 基于卷积长短时深度神经网络的信号调制方式识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1318-1322.
[15]	张秋余, 王煜坤. 基于改进Inception网络的语音分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 909-915.

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 28

相关文章 15

编辑推荐

Metrics