基于双向长短时记忆和卷积Transformer的声学词嵌入模型

doi:10.11772/j.issn.1001-9081.2023010062

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 123-128.DOI: 10.11772/j.issn.1001-9081.2023010062

所属专题：人工智能

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

高芸芸, 赵腊生, 张强()

先进设计与智能计算省部共建教育部重点实验室（大连大学），辽宁大连 116622

收稿日期:2023-01-30 修回日期:2023-04-01 接受日期:2023-04-07 发布日期:2023-06-06 出版日期:2024-01-10
通讯作者: 张强
作者简介:高芸芸（1997—），女，山东烟台人，硕士研究生，主要研究方向：深度学习、语音关键词检测；
赵腊生（1978—），男，山西朔州人，讲师，博士，主要研究方向：深度学习、语音信号处理；
第一联系人：张强（1971—），男，陕西西安人，教授，博士，主要研究方向：生物计算与人工智能、大数据分析与处理。
基金资助:
辽宁省教育厅基本科研项目(LJKMZ20221838)

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG()

Key Laboratory of Advanced Design and Intelligent Computing，Ministry of Education （Dalian University），Dalian Liaoning 116622，China

Received:2023-01-30 Revised:2023-04-01 Accepted:2023-04-07 Online:2023-06-06 Published:2024-01-10
Contact: Qiang ZHANG
About author:GAO Yunyun， born in 1997， M. S. candidate. Her research interests include deep learning， spoken term detection.
ZHAO Lasheng， born in 1978， Ph. D.， lecturer. His research interests include deep learning， speech signal processing.
Supported by:
Basic Scientific Research Project of Liaoning Provincial Department of Education(LJKMZ20221838)

摘要/Abstract

摘要：

示例查询语音关键词检测中，卷积神经网络（CNN）或者循环神经网络（RNN）提取到的声学词嵌入语音信息有限，为更好地表示语音内容以及改善模型的性能，提出一种基于双向长短时记忆（Bi-LSTM）和卷积Transformer的声学词嵌入模型。首先，使用Bi-LSTM提取特征、对语音序列进行建模，并通过叠加方式来提高模型的学习能力；其次，为了能在捕获全局信息的同时学习到局部信息，将CNN和Transformer编码器并联连接组成卷积Transformer，充分利用它在特征提取上的优势，聚合更多有效的信息，提高嵌入的区分性。在对比损失约束下，所提模型平均精度达到了94.36%，与基于注意力的Bi-LSTM模型相比，平均精度提高了1.76%。实验结果表明，所提模型可以有效改善模型性能，更好地实现示例查询语音关键词检测。

关键词: 卷积神经网络, 声学词嵌入, 语音信息, 示例查询语音关键词检测, 循环神经网络

Abstract:

In Query-by-Example Spoken Term Detection （QbE-STD）， the Acoustic Word Embedding （AWE） speech information extracted by Convolutional Neural Network （CNN） or Recurrent Neural Network （RNN） is limited. To better represent speech content and improve model performance， an acoustic word embedding model based on Bi-directional Long Short-Term Memory （Bi-LSTM） and convolutional-Transformer was proposed. Firstly， Bi-LSTM was utilized for extracting features， modeling speech sequences and improving the model learning ability by superposition. Secondly， to learn local information while capturing global information， CNN and Transformer encoder were connected in parallel to form convolutional-Transformer， which taking full advantages in feature extraction to aggregate more efficient information and improving the discrimination of embeddings. Under the constraint of contrast loss， the Average Precision （AP） of the proposed model reaches 94.36%， which is 1.76% higher than that of the Bi-LSTM model based on attention. The experimental results show that the proposed model can effectively improve model performance and better perform QbE-STD.

Key words: Convolutional Neural Network (CNN), Acoustic Word Embedding (AWE), speech information, Query-by-Example Spoken Term Detection (QbE-STD), Recurrent Neural Network (RNN)

中图分类号:

TP183

高芸芸, 赵腊生, 张强. 基于双向长短时记忆和卷积Transformer的声学词嵌入模型[J]. 计算机应用, 2024, 44(1): 123-128.

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG. Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer[J]. Journal of Computer Applications, 2024, 44(1): 123-128.

图/表 6

图1 基于声学词嵌入的示例查询语音关键词检测的流程

Fig. 1 Flowchart of QbE-STD based on acoustic word embedding

图2 网络总体结构示意图

Fig. 2 Schematic diagram of overall network structure

图3 SE块的结构

Fig. 3 Structure of SE block

图4 Transformer编码器的结构

Fig. 4 Structure of transformer encoder

表1 消融实验结果

Tab. 1 Ablation experiment results

序号	有FFN		序号	无FFN
序号	模型	AP/%	序号	模型	AP/%
M1	Bi-LSTM+ FFN	88.62	N1	Bi-LSTM	88.92
M2	Transformer	89.01	N2	Transformer	90.92
M3	1D Conv + SE	92.31	N3	1D Conv + SE	92.88
M4	Swish -> ReLU	92.52	N4	Swish -> ReLU	93.16
M5	串联	92.55	N5	串联	91.43
M6	本文模型（并联）	92.68	N6	本文模型（并联）	93.53

表2 不同模型的对比实验结果

Tab. 2 Comparative experiment results of different models

模型名称	AP/%	PRBEP/%	KL散度
LSTM（ $L t$ ）^［27］	74.15	70.48	4.658 3
Bi-LSTM（ $L t$ ）^［28］	88.92	83.60	5.129 6
Bi-LSTM+Attention（ $L t$ ）^［16］	91.52	86.09	5.637 1
Bi-LSTM+Attention（ $L c$ ）	92.73	87.28	6.047 2
本文模型（ $L t$ ）	93.53	87.39	6.209 9
本文模型（ $L c$ ）	94.36	88.96	6.467 2

表2 不同模型的对比实验结果

Tab. 2 Comparative experiment results of different models

模型名称	AP/%	PRBEP/%	KL散度
LSTM（ $L t$ ）^［27］	74.15	70.48	4.658 3
Bi-LSTM（ $L t$ ）^［28］	88.92	83.60	5.129 6
Bi-LSTM+Attention（ $L t$ ）^［16］	91.52	86.09	5.637 1
Bi-LSTM+Attention（ $L c$ ）	92.73	87.28	6.047 2
本文模型（ $L t$ ）	93.53	87.39	6.209 9
本文模型（ $L c$ ）	94.36	88.96	6.467 2

参考文献 28

1	张卫强，宋贝利，蔡猛，等.基于音素后验概率的样例语音关键词检测方法［J］.天津大学学报（自然科学与工程技术版）， 2015， 48（9）： 757-760.
	ZHANG W Q， SONG B L， CAI M， et al. A query-by-example spoken term detection method based on phonetic posteriorgram ［J］. Journal of Tianjin University （Science and Technology）， 2015， 48（9）： 757-760.
2	HAZEN T J， SHEN W， WHITE C. Query-by-example spoken term detection using phonetic posteriorgram templates ［C］// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2009： 421-426. 10.1109/asru.2009.5372889
3	ZHANG Y， GLASS J R. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams ［C］// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2009： 398-403. 10.1109/asru.2009.5372931
4	MANTEENA G， ANGUERA X. Speed improvements to information retrieval-based dynamic time warping using hierarchical K-Means clustering ［C］// Proceedings of the 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2013： 8515-8519. 10.1109/icassp.2013.6639327
5	LEUNG C C， WANG L， XU H， et al. Toward high-performance language-independent query-by-example spoken term detection for MediaEval 2015： post-evaluation analysis ［C］// Proceedings of the INTERSPEECH 2016. ［S.l.］： International Speech Communication Association， 2016： 3703-3707. 10.21437/interspeech.2016-691
6	LEVIN K， HENRY K， JANSEN A， et al. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings ［C］// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2013： 410-415. 10.1109/asru.2013.6707765
7	LEVIN K， JANSEN A， VAN DURME B. Segmental acoustic indexing for zero resource keyword search ［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2015： 5828-5832. 10.1109/icassp.2015.7179089
8	SHEN F， DU C， YU K. Acoustic word embeddings for end-to-end speech synthesis ［J］. Applied Sciences， 2021， 11（19）： No.9010. 10.3390/app11199010
9	SHI B， SETTLE S， LIVESCU K. Whole-word segmental speech recognition with acoustic word embeddings ［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 164-171. 10.1109/slt48900.2021.9383578
10	KAMPER H. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2019： 6535-6539. 10.1109/icassp.2019.8683639
11	KAMPER H， WANG W， LIVESCU K. Deep convolutional acoustic word embeddings using word-pair side information ［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2016： 4950-4954. 10.1109/icassp.2016.7472619
12	HUANG J， GHARBIEH W， SHIM H S， et al. Query-by-example keyword spotting system using multi-head attention and soft-triple loss ［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 6858-6862. 10.1109/icassp39728.2021.9414156
13	SETTLE S， LIVESCU K. Discriminative acoustic word embeddings： recurrent neural network-based approaches ［C］// Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2016： 503-510. 10.1109/slt.2016.7846310
14	CHEN G， PARADA C， SAINATH T N. Query-by-example keyword spotting using long short-term memory networks ［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2015： 5236-5240. 10.1109/icassp.2015.7178970
15	YUAN Y， LV Z， HUANG S， et al. Verifying deep keyword spotting detection with acoustic word embeddings ［C］// Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2019： 613-620. 10.1109/asru46091.2019.9003781
16	YUAN Y， XIE L， LEUNG C C， et al. Fast query-by-example speech search using attention-based deep binary embeddings ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1988-2000. 10.1109/taslp.2020.2998277
17	AO C W， LEE H Y. Query-by-example spoken term detection using attention-based multi-hop networks ［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2018： 6264-6268. 10.1109/icassp.2018.8462570
18	ZHANG K， WU Z， JIA J， et al. Query-by-example spoken term detection using attentive pooling networks ［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 1267-1272. 10.1109/apsipaasc47483.2019.9023023
19	RAM D， MICULICICH L， BOURLARD H. CNN based query by example spoken term detection ［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 92-96. 10.21437/interspeech.2018-1722
20	RAM D， MICULICICH L， BOURLARD H. Neural network based end-to-end query by example spoken term detection ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1416-1427. 10.1109/taslp.2020.2988788
21	NAIK P， GAONKAR M N， THENKANIDIYOOR V， et al. Kernel based matching and a novel training approach for CNN-based QbE-STD ［C］// Proceedings of the 2020 International Conference on Signal Processing and Communications. Piscataway： IEEE， 2020： 1-5. 10.1109/spcom50965.2020.9179588
22	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
23	YUAN Y， LEUNG C C， XIE L， et al. Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context ［J］. IEEE Access， 2019， 7： 67656-67665. 10.1109/access.2019.2918638
24	JACOBS C， MATUSEVYCH Y， KAMPER H. Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation ［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 919-926. 10.1109/slt48900.2021.9383594
25	ZHANG Y， PARK D S， HAN W， et al. BigSSL： exploring the frontier of large-scale semi-supervised learning for automatic speech recognition ［J］. IEEE Journal of Selected Topics in Signal Processing， 2022， 16（6）： 1519-1532. 10.1109/jstsp.2022.3182537
26	YANG Z， HIRSCHBERG J. Linguistically-informed training of acoustic word embeddings for low-resource languages ［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 2678-2682. 10.21437/interspeech.2019-3119
27	SHITOV D， PIROGOVA E， WYSOCKI T A， et al. Learning acoustic word embeddings with dynamic time warping triplet networks ［J］. IEEE Access， 2020， 8： 103327-103338. 10.1109/access.2020.2999055
28	LI Z， WU L， LI T， et al. Improves neural acoustic word embeddings query by example spoken term detection with Wav2Vec pretraining and circle loss ［C］// Proceedings of the 12th International Symposium on Chinese Spoken Language Processing. Piscataway： IEEE， 2021： 1-5. 10.1109/iscslp49672.2021.9362065

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[3]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[4]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[5]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[6]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[7]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[8]	黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919.
[9]	李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759.
[10]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[11]	沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806.
[12]	孙敏, 成倩, 丁希宁. 基于CBAM-CGRU-SVM的Android恶意软件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1539-1545.
[13]	高文烁, 陈晓云. 基于节点结构的点云分类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1471-1478.
[14]	席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519.
[15]	陈天华, 朱家煊, 印杰. 基于注意力机制的鸟类识别算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1114-1120.

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 28

相关文章 15

编辑推荐

Metrics