Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

doi:10.11772/j.issn.1001-9081.2023010062

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 123-128.DOI: 10.11772/j.issn.1001-9081.2023010062

• Artificial intelligence • Previous Articles

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG()

Key Laboratory of Advanced Design and Intelligent Computing，Ministry of Education （Dalian University），Dalian Liaoning 116622，China

Received:2023-01-30 Revised:2023-04-01 Accepted:2023-04-07 Online:2023-06-06 Published:2024-01-10
Contact: Qiang ZHANG
About author:GAO Yunyun， born in 1997， M. S. candidate. Her research interests include deep learning， spoken term detection.
ZHAO Lasheng， born in 1978， Ph. D.， lecturer. His research interests include deep learning， speech signal processing.
Supported by:
Basic Scientific Research Project of Liaoning Provincial Department of Education(LJKMZ20221838)

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

高芸芸, 赵腊生, 张强()

先进设计与智能计算省部共建教育部重点实验室（大连大学），辽宁大连 116622

通讯作者: 张强
作者简介:高芸芸（1997—），女，山东烟台人，硕士研究生，主要研究方向：深度学习、语音关键词检测；
赵腊生（1978—），男，山西朔州人，讲师，博士，主要研究方向：深度学习、语音信号处理；
第一联系人：张强（1971—），男，陕西西安人，教授，博士，主要研究方向：生物计算与人工智能、大数据分析与处理。
基金资助:
辽宁省教育厅基本科研项目(LJKMZ20221838)

Abstract

Abstract:

In Query-by-Example Spoken Term Detection （QbE-STD）， the Acoustic Word Embedding （AWE） speech information extracted by Convolutional Neural Network （CNN） or Recurrent Neural Network （RNN） is limited. To better represent speech content and improve model performance， an acoustic word embedding model based on Bi-directional Long Short-Term Memory （Bi-LSTM） and convolutional-Transformer was proposed. Firstly， Bi-LSTM was utilized for extracting features， modeling speech sequences and improving the model learning ability by superposition. Secondly， to learn local information while capturing global information， CNN and Transformer encoder were connected in parallel to form convolutional-Transformer， which taking full advantages in feature extraction to aggregate more efficient information and improving the discrimination of embeddings. Under the constraint of contrast loss， the Average Precision （AP） of the proposed model reaches 94.36%， which is 1.76% higher than that of the Bi-LSTM model based on attention. The experimental results show that the proposed model can effectively improve model performance and better perform QbE-STD.

Key words: Convolutional Neural Network (CNN), Acoustic Word Embedding (AWE), speech information, Query-by-Example Spoken Term Detection (QbE-STD), Recurrent Neural Network (RNN)

摘要：

示例查询语音关键词检测中，卷积神经网络（CNN）或者循环神经网络（RNN）提取到的声学词嵌入语音信息有限，为更好地表示语音内容以及改善模型的性能，提出一种基于双向长短时记忆（Bi-LSTM）和卷积Transformer的声学词嵌入模型。首先，使用Bi-LSTM提取特征、对语音序列进行建模，并通过叠加方式来提高模型的学习能力；其次，为了能在捕获全局信息的同时学习到局部信息，将CNN和Transformer编码器并联连接组成卷积Transformer，充分利用它在特征提取上的优势，聚合更多有效的信息，提高嵌入的区分性。在对比损失约束下，所提模型平均精度达到了94.36%，与基于注意力的Bi-LSTM模型相比，平均精度提高了1.76%。实验结果表明，所提模型可以有效改善模型性能，更好地实现示例查询语音关键词检测。

关键词: 卷积神经网络, 声学词嵌入, 语音信息, 示例查询语音关键词检测, 循环神经网络

CLC Number:

TP183

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG. Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer[J]. Journal of Computer Applications, 2024, 44(1): 123-128.

高芸芸, 赵腊生, 张强. 基于双向长短时记忆和卷积Transformer的声学词嵌入模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 123-128.

Figures/Tables 6

Fig. 1 Flowchart of QbE-STD based on acoustic word embedding

Fig. 2 Schematic diagram of overall network structure

Fig. 3 Structure of SE block

Fig. 4 Structure of transformer encoder

Tab. 1 Ablation experiment results

序号	有FFN		序号	无FFN
序号	模型	AP/%	序号	模型	AP/%
M1	Bi-LSTM+ FFN	88.62	N1	Bi-LSTM	88.92
M2	Transformer	89.01	N2	Transformer	90.92
M3	1D Conv + SE	92.31	N3	1D Conv + SE	92.88
M4	Swish -> ReLU	92.52	N4	Swish -> ReLU	93.16
M5	串联	92.55	N5	串联	91.43
M6	本文模型（并联）	92.68	N6	本文模型（并联）	93.53

Tab. 2 Comparative experiment results of different models

模型名称	AP/%	PRBEP/%	KL散度
LSTM（ $L t$ ）^［27］	74.15	70.48	4.658 3
Bi-LSTM（ $L t$ ）^［28］	88.92	83.60	5.129 6
Bi-LSTM+Attention（ $L t$ ）^［16］	91.52	86.09	5.637 1
Bi-LSTM+Attention（ $L c$ ）	92.73	87.28	6.047 2
本文模型（ $L t$ ）	93.53	87.39	6.209 9
本文模型（ $L c$ ）	94.36	88.96	6.467 2

Tab. 2 Comparative experiment results of different models

模型名称	AP/%	PRBEP/%	KL散度
LSTM（ $L t$ ）^［27］	74.15	70.48	4.658 3
Bi-LSTM（ $L t$ ）^［28］	88.92	83.60	5.129 6
Bi-LSTM+Attention（ $L t$ ）^［16］	91.52	86.09	5.637 1
Bi-LSTM+Attention（ $L c$ ）	92.73	87.28	6.047 2
本文模型（ $L t$ ）	93.53	87.39	6.209 9
本文模型（ $L c$ ）	94.36	88.96	6.467 2

References 28

1	张卫强，宋贝利，蔡猛，等.基于音素后验概率的样例语音关键词检测方法［J］.天津大学学报（自然科学与工程技术版）， 2015， 48（9）： 757-760.
	ZHANG W Q， SONG B L， CAI M， et al. A query-by-example spoken term detection method based on phonetic posteriorgram ［J］. Journal of Tianjin University （Science and Technology）， 2015， 48（9）： 757-760.
2	HAZEN T J， SHEN W， WHITE C. Query-by-example spoken term detection using phonetic posteriorgram templates ［C］// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2009： 421-426. 10.1109/asru.2009.5372889
3	ZHANG Y， GLASS J R. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams ［C］// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2009： 398-403. 10.1109/asru.2009.5372931
4	MANTEENA G， ANGUERA X. Speed improvements to information retrieval-based dynamic time warping using hierarchical K-Means clustering ［C］// Proceedings of the 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2013： 8515-8519. 10.1109/icassp.2013.6639327
5	LEUNG C C， WANG L， XU H， et al. Toward high-performance language-independent query-by-example spoken term detection for MediaEval 2015： post-evaluation analysis ［C］// Proceedings of the INTERSPEECH 2016. ［S.l.］： International Speech Communication Association， 2016： 3703-3707. 10.21437/interspeech.2016-691
6	LEVIN K， HENRY K， JANSEN A， et al. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings ［C］// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway： IEEE， 2013： 410-415. 10.1109/asru.2013.6707765
7	LEVIN K， JANSEN A， VAN DURME B. Segmental acoustic indexing for zero resource keyword search ［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2015： 5828-5832. 10.1109/icassp.2015.7179089
8	SHEN F， DU C， YU K. Acoustic word embeddings for end-to-end speech synthesis ［J］. Applied Sciences， 2021， 11（19）： No.9010. 10.3390/app11199010
9	SHI B， SETTLE S， LIVESCU K. Whole-word segmental speech recognition with acoustic word embeddings ［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 164-171. 10.1109/slt48900.2021.9383578
10	KAMPER H. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2019： 6535-6539. 10.1109/icassp.2019.8683639
11	KAMPER H， WANG W， LIVESCU K. Deep convolutional acoustic word embeddings using word-pair side information ［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2016： 4950-4954. 10.1109/icassp.2016.7472619
12	HUANG J， GHARBIEH W， SHIM H S， et al. Query-by-example keyword spotting system using multi-head attention and soft-triple loss ［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 6858-6862. 10.1109/icassp39728.2021.9414156
13	SETTLE S， LIVESCU K. Discriminative acoustic word embeddings： recurrent neural network-based approaches ［C］// Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2016： 503-510. 10.1109/slt.2016.7846310
14	CHEN G， PARADA C， SAINATH T N. Query-by-example keyword spotting using long short-term memory networks ［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2015： 5236-5240. 10.1109/icassp.2015.7178970
15	YUAN Y， LV Z， HUANG S， et al. Verifying deep keyword spotting detection with acoustic word embeddings ［C］// Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2019： 613-620. 10.1109/asru46091.2019.9003781
16	YUAN Y， XIE L， LEUNG C C， et al. Fast query-by-example speech search using attention-based deep binary embeddings ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1988-2000. 10.1109/taslp.2020.2998277
17	AO C W， LEE H Y. Query-by-example spoken term detection using attention-based multi-hop networks ［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2018： 6264-6268. 10.1109/icassp.2018.8462570
18	ZHANG K， WU Z， JIA J， et al. Query-by-example spoken term detection using attentive pooling networks ［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 1267-1272. 10.1109/apsipaasc47483.2019.9023023
19	RAM D， MICULICICH L， BOURLARD H. CNN based query by example spoken term detection ［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 92-96. 10.21437/interspeech.2018-1722
20	RAM D， MICULICICH L， BOURLARD H. Neural network based end-to-end query by example spoken term detection ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1416-1427. 10.1109/taslp.2020.2988788
21	NAIK P， GAONKAR M N， THENKANIDIYOOR V， et al. Kernel based matching and a novel training approach for CNN-based QbE-STD ［C］// Proceedings of the 2020 International Conference on Signal Processing and Communications. Piscataway： IEEE， 2020： 1-5. 10.1109/spcom50965.2020.9179588
22	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
23	YUAN Y， LEUNG C C， XIE L， et al. Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context ［J］. IEEE Access， 2019， 7： 67656-67665. 10.1109/access.2019.2918638
24	JACOBS C， MATUSEVYCH Y， KAMPER H. Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation ［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 919-926. 10.1109/slt48900.2021.9383594
25	ZHANG Y， PARK D S， HAN W， et al. BigSSL： exploring the frontier of large-scale semi-supervised learning for automatic speech recognition ［J］. IEEE Journal of Selected Topics in Signal Processing， 2022， 16（6）： 1519-1532. 10.1109/jstsp.2022.3182537
26	YANG Z， HIRSCHBERG J. Linguistically-informed training of acoustic word embeddings for low-resource languages ［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 2678-2682. 10.21437/interspeech.2019-3119
27	SHITOV D， PIROGOVA E， WYSOCKI T A， et al. Learning acoustic word embeddings with dynamic time warping triplet networks ［J］. IEEE Access， 2020， 8： 103327-103338. 10.1109/access.2020.2999055
28	LI Z， WU L， LI T， et al. Improves neural acoustic word embeddings query by example spoken term detection with Wav2Vec pretraining and circle loss ［C］// Proceedings of the 12th International Symposium on Chinese Spoken Language Processing. Piscataway： IEEE， 2021： 1-5. 10.1109/iscslp49672.2021.9362065

[1]	Shaofa SHANG, Lin JIANG, Yuancheng LI, Yun ZHU. Adaptive partitioning and scheduling method of convolutional neural network inference model on heterogeneous platforms [J]. Journal of Computer Applications, 2023, 43(9): 2828-2835.
[2]	Kunting LU, Rongrong FEI, Xuande ZHANG. Remote sensing image pansharpening by convolutional neural network [J]. Journal of Computer Applications, 2023, 43(9): 2963-2969.
[3]	Yuanyuan QIN, Hong ZHANG. Pulmonary nodule detection algorithm based on attention feature pyramid networks [J]. Journal of Computer Applications, 2023, 43(7): 2311-2318.
[4]	Huibin ZHANG, Liping FENG, Yaojun HAO, Yining WANG. Ancient mural dynasty identification based on attention mechanism and transfer learning [J]. Journal of Computer Applications, 2023, 43(6): 1826-1832.
[5]	Rui XU, Shuang LIANG, Hang WAN, Yimin WEN, Shiming SHEN, Jian LI. Extraction of PM_2.5 diffusion characteristics based on candlestick pattern matching [J]. Journal of Computer Applications, 2023, 43(5): 1394-1400.
[6]	Jiahong SUI, Yingchi MAO, Huimin YU, Zicheng WANG, Ping PING. Global image captioning method based on graph attention network [J]. Journal of Computer Applications, 2023, 43(5): 1409-1415.
[7]	Jianhui HE, Chunlong HU, Xin SHU. Multi-task age estimation method based on multi-peak label distribution learning [J]. Journal of Computer Applications, 2023, 43(5): 1578-1583.
[8]	Liyao FU, Mengxiao YIN, Feng YANG. Transformer based U-shaped medical image segmentation network： a survey [J]. Journal of Computer Applications, 2023, 43(5): 1584-1595.
[9]	Haiyu YANG, Wenpu GUO, Kai KANG. Signal modulation recognition method based on convolutional long short-term deep neural network [J]. Journal of Computer Applications, 2023, 43(4): 1318-1322.
[10]	Cong YIN, Hanping HU. Parameter identification model for time-delay chaotic systems based on temporal attention mechanism [J]. Journal of Computer Applications, 2023, 43(3): 842-847.
[11]	Nanfan LI, Wenwen SI, Siyuan DU, Zhiyong WANG, Chongyang ZHONG, Shihong XIA. Hidden state initialization method for recurrent neural network-based human motion model [J]. Journal of Computer Applications, 2023, 43(3): 723-727.
[12]	Haifeng LI, Fan ZHANG, Minnan PIAO, Huaichao WANG, Nansha LI, Zhongcheng GUI. Automatic detection of targets under airport pavement based on channel and spatial attention [J]. Journal of Computer Applications, 2023, 43(3): 930-935.
[13]	Jinyue LIU, Huiyu LI, Xiaohui JIA, Jiarui LI. Dynamic gait recognition method based on human model constraints [J]. Journal of Computer Applications, 2023, 43(3): 972-977.
[14]	Ranyan NI, Yi ZHANG. Action recognition method based on video spatio-temporal features [J]. Journal of Computer Applications, 2023, 43(2): 521-528.
[15]	Ping WANG, Nan CHEN, Lei LU. Fall detection algorithm based on scene prior and attention guidance [J]. Journal of Computer Applications, 2023, 43(2): 529-535.

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 28

Related Articles 15

Recommended Articles

Metrics