Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 123-128.DOI: 10.11772/j.issn.1001-9081.2023010062
• Artificial intelligence • Previous Articles
Yunyun GAO, Lasheng ZHAO, Qiang ZHANG()
Received:
2023-01-30
Revised:
2023-04-01
Accepted:
2023-04-07
Online:
2023-06-06
Published:
2024-01-10
Contact:
Qiang ZHANG
About author:
GAO Yunyun, born in 1997, M. S. candidate. Her research interests include deep learning, spoken term detection.Supported by:
通讯作者:
张强
作者简介:
高芸芸(1997—),女,山东烟台人,硕士研究生,主要研究方向:深度学习、语音关键词检测;基金资助:
CLC Number:
Yunyun GAO, Lasheng ZHAO, Qiang ZHANG. Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer[J]. Journal of Computer Applications, 2024, 44(1): 123-128.
高芸芸, 赵腊生, 张强. 基于双向长短时记忆和卷积Transformer的声学词嵌入模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 123-128.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023010062
序号 | 有FFN | 序号 | 无FFN | ||
---|---|---|---|---|---|
模型 | AP/% | 模型 | AP/% | ||
M1 | Bi-LSTM+ FFN | 88.62 | N1 | Bi-LSTM | 88.92 |
M2 | Transformer | 89.01 | N2 | Transformer | 90.92 |
M3 | 1D Conv + SE | 92.31 | N3 | 1D Conv + SE | 92.88 |
M4 | Swish -> ReLU | 92.52 | N4 | Swish -> ReLU | 93.16 |
M5 | 串联 | 92.55 | N5 | 串联 | 91.43 |
M6 | 本文模型(并联) | 92.68 | N6 | 本文模型(并联) | 93.53 |
Tab. 1 Ablation experiment results
序号 | 有FFN | 序号 | 无FFN | ||
---|---|---|---|---|---|
模型 | AP/% | 模型 | AP/% | ||
M1 | Bi-LSTM+ FFN | 88.62 | N1 | Bi-LSTM | 88.92 |
M2 | Transformer | 89.01 | N2 | Transformer | 90.92 |
M3 | 1D Conv + SE | 92.31 | N3 | 1D Conv + SE | 92.88 |
M4 | Swish -> ReLU | 92.52 | N4 | Swish -> ReLU | 93.16 |
M5 | 串联 | 92.55 | N5 | 串联 | 91.43 |
M6 | 本文模型(并联) | 92.68 | N6 | 本文模型(并联) | 93.53 |
模型名称 | AP/% | PRBEP/% | KL散度 |
---|---|---|---|
LSTM( | 74.15 | 70.48 | 4.658 3 |
Bi-LSTM( | 88.92 | 83.60 | 5.129 6 |
Bi-LSTM+Attention( | 91.52 | 86.09 | 5.637 1 |
Bi-LSTM+Attention( | 92.73 | 87.28 | 6.047 2 |
本文模型( | 93.53 | 87.39 | 6.209 9 |
本文模型( | 94.36 | 88.96 | 6.467 2 |
Tab. 2 Comparative experiment results of different models
模型名称 | AP/% | PRBEP/% | KL散度 |
---|---|---|---|
LSTM( | 74.15 | 70.48 | 4.658 3 |
Bi-LSTM( | 88.92 | 83.60 | 5.129 6 |
Bi-LSTM+Attention( | 91.52 | 86.09 | 5.637 1 |
Bi-LSTM+Attention( | 92.73 | 87.28 | 6.047 2 |
本文模型( | 93.53 | 87.39 | 6.209 9 |
本文模型( | 94.36 | 88.96 | 6.467 2 |
1 | 张卫强,宋贝利,蔡猛,等.基于音素后验概率的样例语音关键词检测方法[J].天津大学学报(自然科学与工程技术版), 2015, 48(9): 757-760. |
ZHANG W Q, SONG B L, CAI M, et al. A query-by-example spoken term detection method based on phonetic posteriorgram [J]. Journal of Tianjin University (Science and Technology), 2015, 48(9): 757-760. | |
2 | HAZEN T J, SHEN W, WHITE C. Query-by-example spoken term detection using phonetic posteriorgram templates [C]// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway: IEEE, 2009: 421-426. 10.1109/asru.2009.5372889 |
3 | ZHANG Y, GLASS J R. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams [C]// Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway: IEEE, 2009: 398-403. 10.1109/asru.2009.5372931 |
4 | MANTEENA G, ANGUERA X. Speed improvements to information retrieval-based dynamic time warping using hierarchical K-Means clustering [C]// Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2013: 8515-8519. 10.1109/icassp.2013.6639327 |
5 | LEUNG C C, WANG L, XU H, et al. Toward high-performance language-independent query-by-example spoken term detection for MediaEval 2015: post-evaluation analysis [C]// Proceedings of the INTERSPEECH 2016. [S.l.]: International Speech Communication Association, 2016: 3703-3707. 10.21437/interspeech.2016-691 |
6 | LEVIN K, HENRY K, JANSEN A, et al. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings [C]// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway: IEEE, 2013: 410-415. 10.1109/asru.2013.6707765 |
7 | LEVIN K, JANSEN A, VAN DURME B. Segmental acoustic indexing for zero resource keyword search [C]// Proceedings of the 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2015: 5828-5832. 10.1109/icassp.2015.7179089 |
8 | SHEN F, DU C, YU K. Acoustic word embeddings for end-to-end speech synthesis [J]. Applied Sciences, 2021, 11(19): No.9010. 10.3390/app11199010 |
9 | SHI B, SETTLE S, LIVESCU K. Whole-word segmental speech recognition with acoustic word embeddings [C]// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2021: 164-171. 10.1109/slt48900.2021.9383578 |
10 | KAMPER H. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models [C]// Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2019: 6535-6539. 10.1109/icassp.2019.8683639 |
11 | KAMPER H, WANG W, LIVESCU K. Deep convolutional acoustic word embeddings using word-pair side information [C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2016: 4950-4954. 10.1109/icassp.2016.7472619 |
12 | HUANG J, GHARBIEH W, SHIM H S, et al. Query-by-example keyword spotting system using multi-head attention and soft-triple loss [C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2021: 6858-6862. 10.1109/icassp39728.2021.9414156 |
13 | SETTLE S, LIVESCU K. Discriminative acoustic word embeddings: recurrent neural network-based approaches [C]// Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2016: 503-510. 10.1109/slt.2016.7846310 |
14 | CHEN G, PARADA C, SAINATH T N. Query-by-example keyword spotting using long short-term memory networks [C]// Proceedings of the 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2015: 5236-5240. 10.1109/icassp.2015.7178970 |
15 | YUAN Y, LV Z, HUANG S, et al. Verifying deep keyword spotting detection with acoustic word embeddings [C]// Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway: IEEE, 2019: 613-620. 10.1109/asru46091.2019.9003781 |
16 | YUAN Y, XIE L, LEUNG C C, et al. Fast query-by-example speech search using attention-based deep binary embeddings [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1988-2000. 10.1109/taslp.2020.2998277 |
17 | AO C W, LEE H Y. Query-by-example spoken term detection using attention-based multi-hop networks [C]// Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2018: 6264-6268. 10.1109/icassp.2018.8462570 |
18 | ZHANG K, WU Z, JIA J, et al. Query-by-example spoken term detection using attentive pooling networks [C]// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2019: 1267-1272. 10.1109/apsipaasc47483.2019.9023023 |
19 | RAM D, MICULICICH L, BOURLARD H. CNN based query by example spoken term detection [C]// Proceedings of the INTERSPEECH 2018. [S.l.]: International Speech Communication Association, 2018: 92-96. 10.21437/interspeech.2018-1722 |
20 | RAM D, MICULICICH L, BOURLARD H. Neural network based end-to-end query by example spoken term detection [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1416-1427. 10.1109/taslp.2020.2988788 |
21 | NAIK P, GAONKAR M N, THENKANIDIYOOR V, et al. Kernel based matching and a novel training approach for CNN-based QbE-STD [C]// Proceedings of the 2020 International Conference on Signal Processing and Communications. Piscataway: IEEE, 2020: 1-5. 10.1109/spcom50965.2020.9179588 |
22 | HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. 10.1109/cvpr.2018.00745 |
23 | YUAN Y, LEUNG C C, XIE L, et al. Query-by-example speech search using recurrent neural acoustic word embeddings with temporal context [J]. IEEE Access, 2019, 7: 67656-67665. 10.1109/access.2019.2918638 |
24 | JACOBS C, MATUSEVYCH Y, KAMPER H. Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation [C]// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2021: 919-926. 10.1109/slt48900.2021.9383594 |
25 | ZHANG Y, PARK D S, HAN W, et al. BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1519-1532. 10.1109/jstsp.2022.3182537 |
26 | YANG Z, HIRSCHBERG J. Linguistically-informed training of acoustic word embeddings for low-resource languages [C]// Proceedings of the INTERSPEECH 2019. [S.l.]: International Speech Communication Association, 2019: 2678-2682. 10.21437/interspeech.2019-3119 |
27 | SHITOV D, PIROGOVA E, WYSOCKI T A, et al. Learning acoustic word embeddings with dynamic time warping triplet networks [J]. IEEE Access, 2020, 8: 103327-103338. 10.1109/access.2020.2999055 |
28 | LI Z, WU L, LI T, et al. Improves neural acoustic word embeddings query by example spoken term detection with Wav2Vec pretraining and circle loss [C]// Proceedings of the 12th International Symposium on Chinese Spoken Language Processing. Piscataway: IEEE, 2021: 1-5. 10.1109/iscslp49672.2021.9362065 |
[1] | Shaofa SHANG, Lin JIANG, Yuancheng LI, Yun ZHU. Adaptive partitioning and scheduling method of convolutional neural network inference model on heterogeneous platforms [J]. Journal of Computer Applications, 2023, 43(9): 2828-2835. |
[2] | Kunting LU, Rongrong FEI, Xuande ZHANG. Remote sensing image pansharpening by convolutional neural network [J]. Journal of Computer Applications, 2023, 43(9): 2963-2969. |
[3] | Yuanyuan QIN, Hong ZHANG. Pulmonary nodule detection algorithm based on attention feature pyramid networks [J]. Journal of Computer Applications, 2023, 43(7): 2311-2318. |
[4] | Huibin ZHANG, Liping FENG, Yaojun HAO, Yining WANG. Ancient mural dynasty identification based on attention mechanism and transfer learning [J]. Journal of Computer Applications, 2023, 43(6): 1826-1832. |
[5] | Rui XU, Shuang LIANG, Hang WAN, Yimin WEN, Shiming SHEN, Jian LI. Extraction of PM2.5 diffusion characteristics based on candlestick pattern matching [J]. Journal of Computer Applications, 2023, 43(5): 1394-1400. |
[6] | Jiahong SUI, Yingchi MAO, Huimin YU, Zicheng WANG, Ping PING. Global image captioning method based on graph attention network [J]. Journal of Computer Applications, 2023, 43(5): 1409-1415. |
[7] | Jianhui HE, Chunlong HU, Xin SHU. Multi-task age estimation method based on multi-peak label distribution learning [J]. Journal of Computer Applications, 2023, 43(5): 1578-1583. |
[8] | Liyao FU, Mengxiao YIN, Feng YANG. Transformer based U-shaped medical image segmentation network: a survey [J]. Journal of Computer Applications, 2023, 43(5): 1584-1595. |
[9] | Haiyu YANG, Wenpu GUO, Kai KANG. Signal modulation recognition method based on convolutional long short-term deep neural network [J]. Journal of Computer Applications, 2023, 43(4): 1318-1322. |
[10] | Cong YIN, Hanping HU. Parameter identification model for time-delay chaotic systems based on temporal attention mechanism [J]. Journal of Computer Applications, 2023, 43(3): 842-847. |
[11] | Nanfan LI, Wenwen SI, Siyuan DU, Zhiyong WANG, Chongyang ZHONG, Shihong XIA. Hidden state initialization method for recurrent neural network-based human motion model [J]. Journal of Computer Applications, 2023, 43(3): 723-727. |
[12] | Haifeng LI, Fan ZHANG, Minnan PIAO, Huaichao WANG, Nansha LI, Zhongcheng GUI. Automatic detection of targets under airport pavement based on channel and spatial attention [J]. Journal of Computer Applications, 2023, 43(3): 930-935. |
[13] | Jinyue LIU, Huiyu LI, Xiaohui JIA, Jiarui LI. Dynamic gait recognition method based on human model constraints [J]. Journal of Computer Applications, 2023, 43(3): 972-977. |
[14] | Ranyan NI, Yi ZHANG. Action recognition method based on video spatio-temporal features [J]. Journal of Computer Applications, 2023, 43(2): 521-528. |
[15] | Ping WANG, Nan CHEN, Lei LU. Fall detection algorithm based on scene prior and attention guidance [J]. Journal of Computer Applications, 2023, 43(2): 529-535. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||