Triplet deep hashing method for speech retrieval

doi:10.11772/j.issn.1001-9081.2022081149

Abstract

Abstract:

The existing deep hashing methods of content-based speech retrieval do not make enough use of supervised information and have the suboptimal generated hash codes， low retrieval precision and low retrieval efficiency. To address the above problems， a triplet deep hashing method for speech retrieval was proposed. Firstly， the spectrogram image features were used as the input of the model in triplet manner to extract the effective information of the speech feature. Then， an Attentional mechanism-Residual Network （ARN） model was proposed， that is， the spatial attention mechanism was embedded on the basis of the ResNet （Residual Network）， and the salient region representation was improved by aggregating the energy salient region information in the whole spectrogram. Finally， a novel triplet cross-entropy loss was introduced to map the classification information and similarity between spectrogram image features into the learned hash codes， thereby achieving the maximum class separability and maximal hash code discriminability during model training. Experimental results show that the efficient and compact binary hash codes generated by the proposed method has the recall， precision and F1 score of over 98.5% in speech retrieval. Compared with methods such as single-label retrieval method， the average running time of the proposed method using Log-Mel spectra as features is shorted by 19.0% to 55.5%. Therefore， this method can improve the retrieval efficiency and retrieval precision significantly while reducing the amount of computation.

Key words: speech retrieval, triplet deep hashing, attentional mechanism, spectrogram feature, triplet cross-entropy loss

摘要：

现有基于内容的语音检索中深度哈希方法对监督信息利用不足，生成的哈希码是次优的，而且检索精度和检索效率不高。针对以上问题，提出一种用于语音检索的三联体深度哈希方法。首先，将语谱图图像特征以三联体方式作为模型的输入来提取语音特征的有效信息；然后，提出注意力机制-残差网络（ARN）模型，即在残差网络（ResNet）的基础上嵌入空间注意力力机制，并通过聚集整个语谱图能量显著区域信息来提高显著区域表示；最后，引入新三联体交叉熵损失，将语谱图图像特征之间的分类信息和相似性映射到所学习的哈希码中，可在模型训练的同时实现最大的类可分性和最大的哈希码可分性。实验结果表明，所提方法生成的高效紧凑的二值哈希码使语音检索的查全率、查准率、F1分数均超过了98.5%。与单标签检索等方法相比，使用Log-Mel谱图作为特征的所提方法的平均运行时间缩短了19.0%~55.5%，能在减小计算量的同时，显著提高检索效率和精度。

关键词: 语音检索, 三联体深度哈希, 注意力机制, 语谱图特征, 三联体交叉熵损失

CLC Number:

TN912.34

Qiuyu ZHANG, Yongwang WEN. Triplet deep hashing method for speech retrieval[J]. Journal of Computer Applications, 2023, 43(9): 2910-2918.

张秋余, 温永旺. 用于语音检索的三联体深度哈希方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2910-2918.

Figures/Tables 13

References 29

1	NG W W Y， LI J Y， TIAN X， et al. Bit-wise attention deep complementary supervised hashing for image retrieval［J］. Multimedia Tools and Applications， 2022， 81（1）： 927-951. 10.1007/s11042-021-11494-8
2	万方，强浩鹏，雷光波. 自监督深度离散哈希图像检索［J］. 中国图象图形学报， 2021， 26（11）：2659-2669.
	WAN F， QIANG H P， LEI G B， et al. Self-supervised depth discrete hashing for image retrieval［J］. Journal of Image and Graphics， 2021， 26（11）： 2659-2669.
3	CAO R， ZHANG Q， ZHU J S， et al. Enhancing remote sensing image retrieval using a triplet deep metric learning network［J］. International Journal of Remote Sensing， 2020， 41（2）： 740-751. 10.1080/2150704x.2019.1647368
4	LI M Y， AN Z Y， WEI Q M， et al. Triplet deep hashing with joint supervised loss based on deep neural networks［J］. Computational Intelligence and Neuroscience， 2019， 2019： No.8490364. 10.1155/2019/8490364
5	黄羿博，王勇，张秋余，等. 基于混沌测量矩阵的生物哈希密文语音检索［J］. 华中科技大学学报（自然科学版）， 2020， 48（12）：32-37.
	HUANG Y B， WANG Y， ZHANG Q Y， et al. Biohashing encrypted speech retrieval based on chaotic measurement matrix［J］. Journal of Huazhong University of Science and Technology （Natural Science Edition）， 2020， 48（12）： 32-37.
6	ZHANG Q Y， ZHAO X J， ZHANG Q W， et al. Content-based encrypted speech retrieval scheme with deep hashing［J］. Multimedia Tools and Applications， 2022， 81（7）： 10221-10242. 10.1007/s11042-022-12123-8
7	MASALSKI M， ADAMCZYK M， MORAWSKI K. Optimization of the speech test material in a group of hearing impaired subjects： a feasibility study for multilingual digit triplet test development［J］. Audiology Research， 2021， 11（3）： 342-356. 10.3390/audiolres11030032
8	FAN L， JIANG Q Y， YU Y Q， et al. Deep hashing for speaker identification and retrieval［C］// Proceeding of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 2908-2912. 10.21437/interspeech.2019-2457
9	DENG M Q， MENG T T， CAO J W， et al. Heart sound classification based on improved MFCC features and convolutional recurrent neural networks［J］. Neural Networks， 2020， 130： 22-32. 10.1016/j.neunet.2020.06.015
10	SAFI M E， ABBAS E I. Isolated word recognition based on PNCC with different classifiers in a noisy environment［J］. Applied Acoustics， 2022， 195： No.108848. 10.1016/j.apacoust.2022.108848
11	沈侃文，李文钧，岳克强. 融合LPCC和MFCC的支持向量机OSAHS鼾声识别［J］. 杭州电子科技大学学报（自然科学版）， 2020， 40（6）：1-5， 12. 10.13954/j.cnki.hdu.2020.06.001
	SHEN K W， LI W J， YUE K Q. Support vector machine OSAHS snoring recognition based on LPCC and MFCC［J］. Journal of Hangzhou Dianzi University （Natural Sciences）， 2020， 40（6）： 1-5， 12. 10.13954/j.cnki.hdu.2020.06.001
12	JIA Y J， CHEN X， YU J Q， et al. Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network［J］. Complex and Intelligent Systems， 2021， 7（4）： 1749-1757. 10.1007/s40747-020-00172-1
13	LI Q， SUN Z N， HE R， et al. A general framework for deep supervised discrete hashing［J］. International Journal of Computer Vision， 2020， 128（8/9）： 2204-2222. 10.1007/s11263-020-01327-w
14	LONG J， WEI X X， QI Q Q， et al. A deep hashing method based on attention module for image retrieval［C］// Proceeding of the 13th International Conference on Intelligent Computation Technology and Automation. Piscataway： IEEE， 2020： 284-288. 10.1109/icicta51737.2020.00066
15	LIAO W T， YANG M Y， ZHAN N， et al. Triplet-based deep similarity learning for person re-identification［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway： IEEE， 2017： 385-393. 10.1109/iccvw.2017.52
16	ZHAO H， HE S F. A retrieval algorithm for encrypted speech based on perceptual hashing［C］// Proceeding of the 12th International Conference on Natural Computation， Fuzzy Systems and Knowledge Discovery. Piscataway： IEEE， 2016： 1840-1845. 10.1109/fskd.2016.7603458
17	HE S F， ZHAO H. A retrieval algorithm of encrypted speech based on syllable-level perceptual hashing［J］. Computer Science and Information Systems， 2017， 14（3）： 703-718. 10.2298/csis170112024h
18	HUANG Y B， WANG Y， ZHANG Q Y， et al. Multi-format speech BioHashing based on spectrogram［J］. Multimedia Tools and Applications， 2020， 79（33/34）： 24889-24909. 10.1007/s11042-020-09211-y
19	WANG C Y， TAI T C， WANG J C， et al. Sound events recognition and retrieval using multi-convolutional-channel sparse coding convolutional neural networks［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28：1875-1887. 10.1109/taslp.2020.2964959
20	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceeding of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
21	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceeding of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
22	WANG D， ZHANG X W. THCHS-30： a free Chinese speech corpus［EB/OL］. （2015-12-10）［2022-05-11］..
23	ARIAS-VERGARA T， KLUMPP P， VASQUEZ-CORREA J C， et al. Multi-channel spectrograms for speech processing applications using deep learning methods［J］. Pattern Analysis and Applications， 2021， 24（2）： 423-431. 10.1007/s10044-020-00921-5
24	ZHENG X T， ZHANG Y C， LU X Q. Deep balanced discrete hashing for image retrieval［J］. Neurocomputing， 2020， 403： 224-236. 10.1016/j.neucom.2020.04.037
25	ZHANG Z， ZOU Q， LIN Y W， et al. Improved deep hashing with soft pairwise similarity for multi-label image retrieval［J］. IEEE Transactions on Multimedia， 2020， 22（2）： 540-553. 10.1109/tmm.2019.2929957
26	ZHANG Q Y， ZHAO X J， HU Y J. A classification retrieval method for encrypted speech based on deep neural network and deep hashing［J］. IEEE Access， 2020， 8： 202469-202482. 10.1109/access.2020.3036048
27	LI W， CHEN Y M， HU H S， et al. Using granule to search privacy preserving voice in home IoT systems［J］. IEEE Access， 2020， 8： 31957-31969. 10.1109/access.2020.2972975
28	LI W， XIAO Y Z， TANG C， et al. Multi-user searchable encryption voice in home IoT system［J］. Internet of Things， 2020， 11： No.100180. 10.1016/j.iot.2020.100180
29	ZHANG Q Y， BAI J， XU F J， et al. A retrieval method for encrypted speech based on improved power normalized cepstrum coefficients and perceptual hashing［J］. Multimedia Tools and Applications， 2022， 81（11）： 15127-15151. 10.1007/s11042-022-12560-5

组成	参数设置
net1模块	Conv2D（kernel_size=（3，3），activation='ReLU'，stride=2， input-shape=（3，256，256））
	BatchNormalization（axis=-1）
	Conv2D（kernel_size=（3，3）， activation='ReLU'，stride=2）
	BatchNormalization（axis=-1）
	MaxPooling2D（kernel_ size=3， stride=1）
SA模块	Conv2D（kernel_ size=（3，3），stride=2，padding=1）
SA模块	Sigmoid（）
net2模块	Conv2D（kernel_ size=（3，3）， activation='ReLU'，stride=2）
	BatchNormalization（axis=-1）
	Conv2D（kernel_ size=（3，3）， activation='ReLU'，stride=2）
	BatchNormalization（axis=-1）
	Average Pooling2D（kernel_size=（3，3）， stride=1）
FC层	Flattens（）
Dropout	Dropout（0.5）
哈希层	Dense（64，activation='ReLu'）
输出层	Dense（10，activation='softmax'）

组成	参数设置
net1模块	Conv2D（kernel_size=（3，3），activation='ReLU'，stride=2， input-shape=（3，256，256））
	BatchNormalization（axis=-1）
	Conv2D（kernel_size=（3，3）， activation='ReLU'，stride=2）
	BatchNormalization（axis=-1）
	MaxPooling2D（kernel_ size=3， stride=1）
SA模块	Conv2D（kernel_ size=（3，3），stride=2，padding=1）
SA模块	Sigmoid（）
net2模块	Conv2D（kernel_ size=（3，3）， activation='ReLU'，stride=2）
	BatchNormalization（axis=-1）
	Conv2D（kernel_ size=（3，3）， activation='ReLU'，stride=2）
	BatchNormalization（axis=-1）
	Average Pooling2D（kernel_size=（3，3）， stride=1）
FC层	Flattens（）
Dropout	Dropout（0.5）
哈希层	Dense（64，activation='ReLu'）
输出层	Dense（10，activation='softmax'）

图像特征	哈希码长度	不同方式的mAP/%
图像特征	哈希码长度	三联体标签	单标签
语谱图	64	98.86	89.70
	128	98.53	92.04
	256	98.20	94.50
	384	98.20	96.96
	512	95.20	96.96
Log-Mel谱图	64	98.86	90.99
	128	98.86	94.94
	256	98.69	96.81
	384	98.69	98.00
	512	98.66	98.03

图像特征	哈希码长度	不同方式的mAP/%
图像特征	哈希码长度	三联体标签	单标签
语谱图	64	98.86	89.70
	128	98.53	92.04
	256	98.20	94.50
	384	98.20	96.96
	512	95.20	96.96
Log-Mel谱图	64	98.86	90.99
	128	98.86	94.94
	256	98.69	96.81
	384	98.69	98.00
	512	98.66	98.03

方法	语音特征	网络模型	不同哈希码长度下的mAP/%
方法	语音特征	网络模型	k=64	k=128	k=256	k=384	k=512
本文方法	语谱图	ARN	98.86	98.53	98.20	98.20	95.20
本文方法	Log-Mel谱图	ARN	98.86	98.86	98.69	98.69	98.66
文献［6］方法	语谱图	CNN	89.09	92.81	94.50	96.31	96.96
文献［24］方法	语谱图	VGG16	85.00	84.40	82.20	82.20	80.40
文献［24］方法	Log-Mel谱图	VGG16	82.50	81.90	81.90	81.90	81.70
文献［24］方法	语谱图	3LCNN	78.31	75.82	74.59	74.23	73.15
文献［24］方法	Log-Mel谱图	3LCNN	80.05	79.83	79.27	78.56	77.96
文献［25］方法	语谱图	VGG16	86.70	86.20	85.20	85.10	84.00
文献［25］方法	Log-Mel谱图	VGG16	86.00	85.30	85.00	85.00	84.90
文献［26］方法	Log-Mel谱图	CNN	88.17	90.92	93.22	94.94	95.42
文献［26］方法	Log-Mel谱图	CRNN	88.33	91.03	94.36	95.56	95.94