《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (9): 2910-2918.DOI: 10.11772/j.issn.1001-9081.2022081149

• 多媒体计算与计算机仿真 • 上一篇    下一篇

用于语音检索的三联体深度哈希方法

张秋余, 温永旺()   

  1. 兰州理工大学 计算机与通信学院,兰州 730050
  • 收稿日期:2022-08-07 修回日期:2022-11-14 接受日期:2022-11-25 发布日期:2023-01-18 出版日期:2023-09-10
  • 通讯作者: 温永旺
  • 作者简介:张秋余(1966—),男,河北辛集人,研究员,博士生导师,CCF会员,主要研究方向:网络与信息安全、智能信息处理、模式识别;
  • 基金资助:
    国家自然科学基金资助项目(61862041)

Triplet deep hashing method for speech retrieval

Qiuyu ZHANG, Yongwang WEN()   

  1. School of Computer and Communication,Lanzhou University of Technology,Lanzhou Gansu 730050,China
  • Received:2022-08-07 Revised:2022-11-14 Accepted:2022-11-25 Online:2023-01-18 Published:2023-09-10
  • Contact: Yongwang WEN
  • About author:ZHANG Qiuyu, born in 1966, research fellow. His research interests include network and information security, intelligent information processing, pattern recognition.
  • Supported by:
    National Natural Science Foundation of China(61862041)

摘要:

现有基于内容的语音检索中深度哈希方法对监督信息利用不足,生成的哈希码是次优的,而且检索精度和检索效率不高。针对以上问题,提出一种用于语音检索的三联体深度哈希方法。首先,将语谱图图像特征以三联体方式作为模型的输入来提取语音特征的有效信息;然后,提出注意力机制-残差网络(ARN)模型,即在残差网络(ResNet)的基础上嵌入空间注意力力机制,并通过聚集整个语谱图能量显著区域信息来提高显著区域表示;最后,引入新三联体交叉熵损失,将语谱图图像特征之间的分类信息和相似性映射到所学习的哈希码中,可在模型训练的同时实现最大的类可分性和最大的哈希码可分性。实验结果表明,所提方法生成的高效紧凑的二值哈希码使语音检索的查全率、查准率、F1分数均超过了98.5%。与单标签检索等方法相比,使用Log-Mel谱图作为特征的所提方法的平均运行时间缩短了19.0%~55.5%,能在减小计算量的同时,显著提高检索效率和精度。

关键词: 语音检索, 三联体深度哈希, 注意力机制, 语谱图特征, 三联体交叉熵损失

Abstract:

The existing deep hashing methods of content-based speech retrieval do not make enough use of supervised information and have the suboptimal generated hash codes, low retrieval precision and low retrieval efficiency. To address the above problems, a triplet deep hashing method for speech retrieval was proposed. Firstly, the spectrogram image features were used as the input of the model in triplet manner to extract the effective information of the speech feature. Then, an Attentional mechanism-Residual Network (ARN) model was proposed, that is, the spatial attention mechanism was embedded on the basis of the ResNet (Residual Network), and the salient region representation was improved by aggregating the energy salient region information in the whole spectrogram. Finally, a novel triplet cross-entropy loss was introduced to map the classification information and similarity between spectrogram image features into the learned hash codes, thereby achieving the maximum class separability and maximal hash code discriminability during model training. Experimental results show that the efficient and compact binary hash codes generated by the proposed method has the recall, precision and F1 score of over 98.5% in speech retrieval. Compared with methods such as single-label retrieval method, the average running time of the proposed method using Log-Mel spectra as features is shorted by 19.0% to 55.5%. Therefore, this method can improve the retrieval efficiency and retrieval precision significantly while reducing the amount of computation.

Key words: speech retrieval, triplet deep hashing, attentional mechanism, spectrogram feature, triplet cross-entropy loss

中图分类号: