Speech deception detection algorithm based on denoising autoencoder and long short-term memory network

doi:10.11772/j.issn.1001-9081.2019071183

Abstract

Abstract:

In order to further improve the performance of speech deception detection， a speech deception detection algorithm based on Denoising AutoEncoder （DAE） and Long Short-Term Memory （LSTM） network was proposed. Firstly， a parallel structure of DAE and LSTM was constructed， namely PDL （Parallel connection of DAE and LSTM）. Then， artificial features in the speech were extracted and put into the DAE to obtain more robust features. Simultaneously， the Mel spectrums extracted after adding windows to the speech and framing were input into LSTM frame-by-frame for frame-level depth feature learning. Finally， these two types of features were merged by the fully connected layer and the batch normalization， and the softmax classifier was used for the deception recognition. The experimental results on the CSC （Columbia-SRI-Colorado） corpus and the self-built corpus show that the recognition accuracy of the classification with fusion feature is 65.18% and 68.04% respectively， which is up to 5.56% and 7.22% higher than those of other algorithms， indicating that the proposed algorithm can effectively improve the accuracy of deception recognition.

Key words: Denoising AutoEncoder (DAE), Long Short-Term Memory (LSTM) network, speech feature, feature fusion, deception detection

摘要：

为进一步提升语音测谎性能，提出了一种基于去噪自编码器（DAE）和长短时记忆（LSTM）网络的语音测谎算法。首先，该算法构建了优化后的DAE和LSTM的并行结构PDL；然后，提取出语音中的人工特征并输入DAE以获取更具鲁棒性的特征，同时，将语音加窗分帧后提取出的Mel谱逐帧输入到LSTM进行帧级深度特征的学习；最后，将这两种特征通过全连接层及批归一化处理后实现融合，使用softmax分类器进行谎言识别。CSC（Columbia-SRI-Colorado）库和自建语料库上的实验结果显示，融合特征分类的识别准确率分别为65.18%和68.04%，相比其他对比算法的识别准确率最高分别提升了5.56%和7.22%，表明所提算法可以有效提高谎言识别精度。

关键词: 去噪自编码器, 长短时记忆网络, 语音特征, 特征融合, 测谎

CLC Number:

TP391.41

Hongliang FU, Peizhi LEI. Speech deception detection algorithm based on denoising autoencoder and long short-term memory network[J]. Journal of Computer Applications, 2020, 40(2): 589-594.

傅洪亮, 雷沛之. 基于去噪自编码器和长短时记忆网络的语音测谎算法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 589-594.

Figures/Tables 13

Tab. 1 Feature set of 2009 International speech emotion recognition challenge

基本特征集	特征集包含的特征和函数
LLD $(16 × 2)$	均方根能量、基频、过零率、谐波噪声比、梅尔频率倒谱系数1~12
HLSF （12）	标准差、峰度、偏度、均值、最大最小值、相对位置、范围、极限值、斜率、偏量、均方误差

Tab. 1 Feature set of 2009 International speech emotion recognition challenge

基本特征集	特征集包含的特征和函数
LLD $(16 × 2)$	均方根能量、基频、过零率、谐波噪声比、梅尔频率倒谱系数1~12
HLSF （12）	标准差、峰度、偏度、均值、最大最小值、相对位置、范围、极限值、斜率、偏量、均方误差

Fig. 1 Mel spectrum of truth and deception

Fig. 2 Denoising autoencoder

Fig. 3 Structure of LSTM

Fig. 4 Overall framework of the proposed algorithm

Fig. 5 Extracting frame-level features with LSTM

Tab. 2 Number of players in games

游戏名称	男性	女性	合计
狼人游戏	23	16	39
杀手游戏	40	24	64

Tab. 3 Parameters of model

网络	层名	神经单元数
DAE	输入	384
	编码_1	512
	编码_2	1 024
	解码_1	512
	解码_2	384
LSTM	输入	64
	隐层	1 024
	平均	1 024
总输出		2 048
全连接层		1 024

Tab. 4 Recognition accuracy of different models

数据库	模型	WA	UA
CSC	PDL-DAE	62.22	58.46
	PDL-LSTM	63.51	59.98
	PDL	65.18	62.56
Killer	PDL-DAE	62.88	59.74
	PDL-LSTM	64.94	62.11
	PDL	68.04	65.35

Fig. 6 Convergence curves on different corpora

Tab. 5 T-test of test results

模型	数据库
模型	CSC	Killer
（PDL， DAE）	$< 0.001$	$< 0.001$
（PDL， LSTM）	$< 0.001$	$< 0.001$

Tab. 5 T-test of test results

模型	数据库
模型	CSC	Killer
（PDL， DAE）	$< 0.001$	$< 0.001$
（PDL， LSTM）	$< 0.001$	$< 0.001$

Tab. 6 Different recognition accuracies whether to using DAE

数据库	处理方法	WA	UA
CSC	直接融合	63.89	60.08
CSC	本文算法	65.18	62.56
Killer	直接融合	65.97	62.46
Killer	本文算法	68.04	65.35

Tab. 7 Comaprison of recognition accuracy and recognition time of single speech by different deception detection methods

数据库	测谎方法	识别精度/%		单条语音识别时间/s
数据库	测谎方法	WA	UA	单条语音识别时间/s
CSC	SVM	59.62	53.20	0.629×10^-3
	DNN	60.79	57.08	0.351×10^-3
	SAE	62.27	57.86	0.370×10^-3
	DBN-ELM	62.58	59.21	0.443×10^-3
	CNN	63.13	60.03	0.237×10^-2
	本文算法	65.18	62.56	0.785×10^-2
Killer	SVM	60.82	55.68	0.268×10^-2
	DNN	61.45	58.13	0.164×10^-2
	SAE	61.89	59.25	0.135×10^-2
	DBN-ELM	63.40	61.03	0.144×10^-2
	CNN	64.02	61.56.	0.690×10^-2
	本文算法	68.04	65.35	0.197×10^-1

References 20

1	KIRCHHUEBEL C. The acoustic and temporal characteristics of deceptive speech［D］. York， North Yorkshire： University of York， 2013： 37. 10.1016/j.apergo.2012.04.016
2	ANAGNOSTOPOULOS C N， ILIOU T， GIANNOUKOS I. Features and classifiers for emotion recognition from speech： a survey from 2000 to 2011［J］. Artificial Intelligence Review， 2015， 43（2）： 155-177. 10.1007/s10462-012-9368-5
3	EKMAN P， O'SULLIVAN M， FRIESEN W V， et al. Invited article： face， voice， and body in detecting deceit［J］. Journal of Nonverbal Behavior， 1991， 15（2）： 125-135. 10.1007/bf00998267
4	HANSEN J H L， WOMACK B D. Feature analysis and neural network-based classification of speech under stress［J］. IEEE Transactions on Speech and Audio Processing， 1996， 4（4）： 307-313. 10.1109/89.506935
5	ZHOU Y， ZHAO H， PAN X. Lie detection from speech analysis based on K-SVD deep belief network model［C］// Proceedings of the 2015 International Conference on Intelligent Computing， LNCS9225. Cham： Springer， 2015： 189-196.
6	SRIVASTAVA N， DUBEY S. Deception detection using artificial neural network and support vector machine［C］// Proceedings of the 2nd International Conference on Electronics， Communication and Aerospace Technology. Piscataway： IEEE， 2018： 1205-1208. 10.1109/iceca.2018.8474706
7	SCHULLER B， STEIDL S， BATLINER A. The INTERSPEECH 2009 emotion challenge［C］// Proceedings of the 10th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2009： 312-315. 10.21437/interspeech.2009-103
8	EYBEN F， WENINGER F， GROSS F， et al. Recent developments in openSMILE， the Munich open-source multimedia feature extractor［C］// Proceedings of the 21st ACM International Conference on Multimedia. New York： ACM， 2013： 835-838. 10.1145/2502081.2502224
9	贾文娟，张煜东. 自编码器理论与方法综述［J］. 计算机系统应用， 2018， 275）：1-9 （JIA W J， ZHANG Y D. Survey on theories and methods of autoencoder［J］. Computer Systems and Applications， 2018， 27（5）： 1-9.
10	崔建峰，邓泽平，申飞，等. 基于非负矩阵分解和长短时记忆网络的单通道语音分离［J］. 科学技术与工程， 2019， 19（12）：206-210. 10.3969/j.issn.1671-1815.2019.12.029
	CUI J F， DENG Z P， SHEN F， et al. Single channel speech separation based on non-negative matrix factorization and long short-term memory network［J］. Science Technology and Engineering， 2019， 19（12）： 206-210. 10.3969/j.issn.1671-1815.2019.12.029
11	PORIA S， PENG H， HUSSAIN A， et al. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis［J］. Neurocomputing， 2017， 261： 217-230. 10.1016/j.neucom.2016.09.117
12	CHEN S， JIN Q. Multi-modal conditional attention fusion for dimensional emotion prediction［C］// Proceedings of the 24th ACM International Conference on Multimedia. New York： ACM， 2016： 571-575. 10.1145/2964284.2967286
13	DENG J， XU X， ZHANG Z， et al. Semi-supervised autoencoders for speech emotion recognition［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2018， 26（1）： 31-43. 10.1109/taslp.2017.2759338
14	YANG Z， WANG C， ZHANG Z， et al. Random Barzilai–Borwein step size for mini-batch algorithms［J］. Engineering Applications of Artificial Intelligence， 2018， 72： 124-135. 10.1016/j.engappai.2018.03.017
15	ENOS F， BENUS S， CAUTIN R L， et al. Personality factors in human deception detection： comparing human to machine performance［C］// Proceedings of the 9th International Conference on Spoken Language Processing. ［S.l.］： ISCA， 2006： 813-816. 10.21437/interspeech.2006-278
16	HUNG H， CHITTARANJAN G. The IDIAP wolf corpus： exploring group behaviour in a competitive role-playing game［C］// Proceedings of the 18th ACM International Conference on Multimedia. New York： ACM， 2010： 879-882. 10.1145/1873951.1874102
17	VEXLER A， YU J. To t-test or not to t-test？： a P-values-based point of view in the ROC curve framework［J］. Journal of Computational Biology， 2018， 25（6）：541-550. 10.1089/cmb.2017.0216
18	VINCENT P， LAROCHELLE H， LAJOIE I， et al. Stacked denoising autoencoders： learning useful representations in a deep network with a local denoising criterion［J］. Journal of Machine Learning Research， 2010， 11： 3371-3408.
19	GUO L， WANG L， DANG J， et al. A feature fusion method based on extreme learning machine for speech emotion recognition［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 2666-2670. 10.1109/icassp.2018.8462219
20	YOO H J. Deep convolution neural networks in computer vision： a review［J］. IEIE Transactions on Smart Processing and Computing， 2015， 4（1）：35-43. 10.5573/ieiespc.2015.4.1.035

[1]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[2]	Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257.
[3]	Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919.
[4]	Yue LIU, Fang LIU, Aoyun WU, Qiuyue CHAI, Tianxiao WANG. 3D object detection network based on self-attention mechanism and graph convolution [J]. Journal of Computer Applications, 2024, 44(6): 1972-1977.
[5]	Xin LI, Qiao MENG, Junyi HUANGFU, Lingchen MENG. YOLOv5 multi-attribute classification based on separable label collaborative learning [J]. Journal of Computer Applications, 2024, 44(5): 1619-1628.
[6]	Hongtian LI, Xinhao SHI, Weiguo PAN, Cheng XU, Bingxin XU, Jiazheng YUAN. Few-shot object detection via fusing multi-scale and attention mechanism [J]. Journal of Computer Applications, 2024, 44(5): 1437-1444.
[7]	Guijin HAN, Xinyuan ZHANG, Wentao ZHANG, Ya HUANG. Self-supervised image registration algorithm based on multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(5): 1597-1604.
[8]	Ning WU, Yangyang LUO, Huajie XU. Semantic segmentation method for remote sensing images based on multi-scale feature fusion [J]. Journal of Computer Applications, 2024, 44(3): 737-744.
[9]	Yuliang ZHENG, Yunhua CHEN, Weijie BAI, Pinghua CHEN. Vehicle target detection by fusing event data and image frames [J]. Journal of Computer Applications, 2024, 44(3): 931-937.
[10]	Xinye LI, Yening HOU, Yinghui KONG, Zhiqi YAN. Few-shot object detection combining feature fusion and enhanced attention [J]. Journal of Computer Applications, 2024, 44(3): 745-751.
[11]	Zhanjun JIANG, Baijing WU, Long MA, Jing LIAN. Faster-RCNN water-floating garbage recognition based on multi-scale feature and polarized self-attention [J]. Journal of Computer Applications, 2024, 44(3): 938-944.
[12]	Zongze JIA, Pengfei GAO, Yinglong MA, Xiaofeng LIU, Haixin XIA. Multi-feature fusion attention-based hierarchical classification method for dialogue act [J]. Journal of Computer Applications, 2024, 44(3): 715-721.
[13]	Qiaoling HUANG, Bochuan ZHENG, Zicheng DING, Zedong WU. Improved image inpainting network incorporating supervised attention module and cross-stage feature fusion [J]. Journal of Computer Applications, 2024, 44(2): 572-579.
[14]	Ziqi HUANG, Jianpeng HU. Entity category enhanced nested named entity recognition in automotive domain [J]. Journal of Computer Applications, 2024, 44(2): 377-384.
[15]	Lin WANG, Jingliang LIU, Wuwei WANG. Small target detection method in UAV images based on fusion of dilated convolution and Transformer [J]. Journal of Computer Applications, 2024, 44(11): 3595-3602.