Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (9): 2910-2918.DOI: 10.11772/j.issn.1001-9081.2022081149

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Triplet deep hashing method for speech retrieval

Qiuyu ZHANG, Yongwang WEN()   

  1. School of Computer and Communication,Lanzhou University of Technology,Lanzhou Gansu 730050,China
  • Received:2022-08-07 Revised:2022-11-14 Accepted:2022-11-25 Online:2023-01-18 Published:2023-09-10
  • Contact: Yongwang WEN
  • About author:ZHANG Qiuyu, born in 1966, research fellow. His research interests include network and information security, intelligent information processing, pattern recognition.
  • Supported by:
    National Natural Science Foundation of China(61862041)


张秋余, 温永旺()   

  1. 兰州理工大学 计算机与通信学院,兰州 730050
  • 通讯作者: 温永旺
  • 作者简介:张秋余(1966—),男,河北辛集人,研究员,博士生导师,CCF会员,主要研究方向:网络与信息安全、智能信息处理、模式识别;
  • 基金资助:


The existing deep hashing methods of content-based speech retrieval do not make enough use of supervised information and have the suboptimal generated hash codes, low retrieval precision and low retrieval efficiency. To address the above problems, a triplet deep hashing method for speech retrieval was proposed. Firstly, the spectrogram image features were used as the input of the model in triplet manner to extract the effective information of the speech feature. Then, an Attentional mechanism-Residual Network (ARN) model was proposed, that is, the spatial attention mechanism was embedded on the basis of the ResNet (Residual Network), and the salient region representation was improved by aggregating the energy salient region information in the whole spectrogram. Finally, a novel triplet cross-entropy loss was introduced to map the classification information and similarity between spectrogram image features into the learned hash codes, thereby achieving the maximum class separability and maximal hash code discriminability during model training. Experimental results show that the efficient and compact binary hash codes generated by the proposed method has the recall, precision and F1 score of over 98.5% in speech retrieval. Compared with methods such as single-label retrieval method, the average running time of the proposed method using Log-Mel spectra as features is shorted by 19.0% to 55.5%. Therefore, this method can improve the retrieval efficiency and retrieval precision significantly while reducing the amount of computation.

Key words: speech retrieval, triplet deep hashing, attentional mechanism, spectrogram feature, triplet cross-entropy loss



关键词: 语音检索, 三联体深度哈希, 注意力机制, 语谱图特征, 三联体交叉熵损失

CLC Number: