Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 123-128.DOI: 10.11772/j.issn.1001-9081.2023010062

• Artificial intelligence • Previous Articles    

Acoustic word embedding model based on Bi-LSTM and convolutional-Transformer

Yunyun GAO, Lasheng ZHAO, Qiang ZHANG()   

  1. Key Laboratory of Advanced Design and Intelligent Computing,Ministry of Education (Dalian University),Dalian Liaoning 116622,China
  • Received:2023-01-30 Revised:2023-04-01 Accepted:2023-04-07 Online:2023-06-06 Published:2024-01-10
  • Contact: Qiang ZHANG
  • About author:GAO Yunyun, born in 1997, M. S. candidate. Her research interests include deep learning, spoken term detection.
    ZHAO Lasheng, born in 1978, Ph. D., lecturer. His research interests include deep learning, speech signal processing.
  • Supported by:
    Basic Scientific Research Project of Liaoning Provincial Department of Education(LJKMZ20221838)

基于双向长短时记忆和卷积Transformer的声学词嵌入模型

高芸芸, 赵腊生, 张强()   

  1. 先进设计与智能计算省部共建教育部重点实验室(大连大学),辽宁 大连 116622
  • 通讯作者: 张强
  • 作者简介:高芸芸(1997—),女,山东烟台人,硕士研究生,主要研究方向:深度学习、语音关键词检测;
    赵腊生(1978—),男,山西朔州人,讲师,博士,主要研究方向:深度学习、语音信号处理;
    第一联系人:张强(1971—),男,陕西西安人,教授,博士,主要研究方向:生物计算与人工智能、大数据分析与处理。
  • 基金资助:
    辽宁省教育厅基本科研项目(LJKMZ20221838)

Abstract:

In Query-by-Example Spoken Term Detection (QbE-STD), the Acoustic Word Embedding (AWE) speech information extracted by Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) is limited. To better represent speech content and improve model performance, an acoustic word embedding model based on Bi-directional Long Short-Term Memory (Bi-LSTM) and convolutional-Transformer was proposed. Firstly, Bi-LSTM was utilized for extracting features, modeling speech sequences and improving the model learning ability by superposition. Secondly, to learn local information while capturing global information, CNN and Transformer encoder were connected in parallel to form convolutional-Transformer, which taking full advantages in feature extraction to aggregate more efficient information and improving the discrimination of embeddings. Under the constraint of contrast loss, the Average Precision (AP) of the proposed model reaches 94.36%, which is 1.76% higher than that of the Bi-LSTM model based on attention. The experimental results show that the proposed model can effectively improve model performance and better perform QbE-STD.

Key words: Convolutional Neural Network (CNN), Acoustic Word Embedding (AWE), speech information, Query-by-Example Spoken Term Detection (QbE-STD), Recurrent Neural Network (RNN)

摘要:

示例查询语音关键词检测中,卷积神经网络(CNN)或者循环神经网络(RNN)提取到的声学词嵌入语音信息有限,为更好地表示语音内容以及改善模型的性能,提出一种基于双向长短时记忆(Bi-LSTM)和卷积Transformer的声学词嵌入模型。首先,使用Bi-LSTM提取特征、对语音序列进行建模,并通过叠加方式来提高模型的学习能力;其次,为了能在捕获全局信息的同时学习到局部信息,将CNN和Transformer编码器并联连接组成卷积Transformer,充分利用它在特征提取上的优势,聚合更多有效的信息,提高嵌入的区分性。在对比损失约束下,所提模型平均精度达到了94.36%,与基于注意力的Bi-LSTM模型相比,平均精度提高了1.76%。实验结果表明,所提模型可以有效改善模型性能,更好地实现示例查询语音关键词检测。

关键词: 卷积神经网络, 声学词嵌入, 语音信息, 示例查询语音关键词检测, 循环神经网络

CLC Number: