基于翻转梅尔频率倒谱系数的语音变调检测方法

doi:10.11772/j.issn.1001-9081.2019050870

计算机应用 ›› 2019, Vol. 39 ›› Issue (12): 3510-3514.DOI: 10.11772/j.issn.1001-9081.2019050870

基于翻转梅尔频率倒谱系数的语音变调检测方法

林晓丹, 邱应强

华侨大学信息科学与工程学院, 福建厦门 361021

收稿日期:2019-05-23 修回日期:2019-06-20 发布日期:2019-07-29 出版日期:2019-12-10
作者简介:林晓丹(1983-),女,福建泉州人,讲师,博士,主要研究方向:多媒体取证、信号处理;邱应强(1981-),男,福建龙岩人,副教授,博士,主要研究方向:信息隐藏。
基金资助:
国家自然科学基金资助项目（61871434）；华侨大学科研基金资助项目（Y19060）。

Disguised voice detection method based on inverted Mel-frequency cepstral coefficient

LIN Xiaodan, QIU Yingqiang

College of Information Science and Engineering, Huaqiao University, Xiamen Fujian 361021, China

Received:2019-05-23 Revised:2019-06-20 Online:2019-07-29 Published:2019-12-10
Contact: 林晓丹
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61871434), the Scientific Research Fund of Huaqiao University (Y19060).

摘要/Abstract

摘要： 语音变调常用于掩盖说话人身份，各种变声软件的出现使得说话人身份伪装变得更加容易。针对现有变调语音检测方法无法判断语音是经过了何种变调操作（升调或降调）的问题，通过分析语音变调在信号频谱，尤其是高频区域留下的痕迹，提出了基于翻转梅尔倒谱系数（IMFCC）统计矩特征的电子变调语音检测方法。首先，提取各语音帧IMFCC及其一阶差分；然后，计算其统计均值；最后，在该统计特征上利用支持向量机（SVM）多分类器的设计来区分原始语音、升调语音和降调语音。在TIMIT和NIST语音集上的实验结果表明，所提方法无论对于原始语音、升调语音还是降调语音都具有良好的检测性能。与MFCC作为特征构造的基线系统相比，所设计的特征的方法明显提高了变调操作的识别率。在较少的训练资源的情况下，所提方法也获得了比基于卷积神经网络（CNN）的框架更好的性能；此外，在不同数据集和不同变调方法上也都取得了较好的泛化性能。

关键词: 语音变调, 翻转梅尔频率, 倒谱系数, 统计矩, 多分类

Abstract: Voice disguise through pitch shift is commonly used to conceal the identity of speaker. A bunch of voice changers substantially facilitate the application of voice disguise. To simultaneously address the problem of whether a speech signal is pitch-shifted and how it is modified (pitch-raised or pitch-lowered), with the traces of the electronic disguised voice in the signal spectrum especially the high frequency region analyzed, an electronic disguised voice detection method based on statistical moment features derived from Inverted Mel-Frequency Cepstral Coefficient (IMFCC) was proposed. Firstly, IMFCC and its first-order difference of each voice frame were extracted. Then, its statistical mean was calculated. Finally, on the above statistical feature, the design of Support Vector Machine (SVM) multi-classifier was used to identify the original voice, the pitch-raised voice and the pitch-lowered voice. The experimental results on TIMIT and NIST voice datasets show that the proposed method has satisfactory performance on the detection of the original, pitch-raised and pitch-lowered voice signals. Compared with the baseline system using MFCC as feature construction, the method with the proposed features has significantly increased the recognition rate of the disguise operation. And the method outperforms the Convolutional Neural Network (CNN) based framework when limited training data is available. The extensive experiments demonstrate the proposed has good generalization ability on different datasets and different disguising methods.

Key words: voice disguise, inverted Mel-frequency, cepstral coefficient, statistical moment, multi-classification

中图分类号:

TN912.3

林晓丹, 邱应强. 基于翻转梅尔频率倒谱系数的语音变调检测方法[J]. 计算机应用, 2019, 39(12): 3510-3514.

LIN Xiaodan, QIU Yingqiang. Disguised voice detection method based on inverted Mel-frequency cepstral coefficient[J]. Journal of Computer Applications, 2019, 39(12): 3510-3514.

参考文献

[1] PERROT P, AVERSANO G, CHOLLET G. Voice disguise and automatic detection:review and perspectives[M]//STYLIANOU Y, FAUNDEZ-ZANUY M, ESPOSITO A. Progress in Nonlinear Speech Processing, LNCS 4391. Berlin:Springer, 2007:101-117.
[2] ZHANG C, TAN T. Voice disguise and automatic speaker recognition[J]. Forensic Science International, 2008, 175(2/3):118-122.
[3] MUCKENHIRN H, KORSHUNOV P, MAGIMAI-DOSS M, et al. Long-term spectral statistics for voice presentation attack detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(11):2098-2111.
[4] WANG L, NAKAGAWA S, ZHANG Z, et al. Spoofing speech detection using modified relative phase information[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(4):660-670.
[5] WU H, WANG Y, HUANG J. Identification of electronic disguised voices[J]. IEEE Transactions on Information Forensics and Security, 2014, 9(3):489-500.
[6] 李燕萍,林乐,陶定元.基于GMM统计特性的电子伪装语音鉴定研究[J].计算机技术与发展,2017,27(1):103-106.(LI Y P, LIN L, TAO D Y. Research on identification of electronic disguised voice based on GMM statistical parameters[J]. Computer Technology and Development, 2017, 27(1):103-106.)
[7] LIANG H, LIN X, ZHANG Q, et al. Recognition of spoofed voice using convolutional neural networks[C]//Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing. Piscataway:IEEE, 2017:293-297.
[8] WANG L, LIANG H, LIN X, et al. Revealing the processing history of pitch-shifted voice using CNNs[C]//Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security. Piscataway:IEEE, 2018:1-7.
[9] WONG P H W, AU O C. Fast SOLA-based time scale modification using envelope matching[C]//Proceedings of the 2002 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2002:III-3188-III-3191.
[10] 杜守富,毛启容,詹永照.自适应同步叠加语音时长规整算法[J].通信学报,2005,26(2):136-140.(DU S F, MAO Q R, ZHAN Y Z. Adaptive synchronous overlap and add algorithm for time scale modification of speech[J]. Journal on Communications, 2005, 26(2):136-140.)
[11] VALBRET H, MOULINES E, TUBACH J. Voice transformation using PSOLA technique[C]//Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 1992:145-148.
[12] VERHELST W, ROELANDS M. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech[C]//Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 1993:554-557.
[13] MOULINES E, CHARPENTIER F. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones[J]. Speech Communication, 1990, 9(5/6):453-467.
[14] LAROCHE J, DOLSON M. Improved phase vocoder time-scale modification of audio[J]. IEEE Transactions on Speech and Audio Processing, 1999, 7(3):323-332.
[15] Sourceforge. Audacity:a free multi-track audio editor and recorder[EB/OL].[2019-02-20]. http://audacity.sourceforge.net.
[16] Adobe. Adobe audition[EB/OL].[2019-02-20].http://www.adobe.com/products/audition.html.
[17] BOERSMA P, WEENINK D. Praat:doing phonetics by computer[EB/OL].[2019-02-20]. http://www.fon.hum.uva.nl/praat.
[18] ZHU X, BEAUREGARD G T, WYSE L L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(5):1645-1653.
[19] CHAKROBORTY S, ROY A, MAJUMDAR S, et al. Capturing complementary information via reversed filter bank and parallel implementation with MFCC for improved text-independent speaker identification[C]//Proceedings of the 2007 International Conference on Computing:Theory and Applications. Piscataway:IEEE, 2007:463-467.
[20] SOHN J, KIM N S, SUNG W. A statistical model-based voice activity detection[J]. IEEE Signal Processing Letters, 1999, 6(1):1-3.
[21] GAROFOLO J S, LAMEL L F, FISHER W M. TIMIT acoustic-phonetic continuous speech corpus[EB/OL].[2019-02-20]. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1.
[22] NIST Multimodal Information Group. NIST speaker recognition evaluation database[EB/OL].[2019-02-20]. http://catalog.ldc.upenn.edu/LDC2010S03.
[23] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9:2579-2605.

基于翻转梅尔频率倒谱系数的语音变调检测方法

Disguised voice detection method based on inverted Mel-frequency cepstral coefficient

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孙淳, 胡春龙, 黄树成. 一致性保留的集成排序年龄估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2381-2386.
[2]	周菊香, 刘金生, 甘健侯, 吴迪, 李子杰. 基于多尺度时序感知网络的课堂语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1636-1643.
[3]	陈美宏, 袁凌云, 夏桐. 基于主从多链的数据分类分级访问控制模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1148-1157.
[4]	李蒙蒙, 刘艺, 李庚松, 郑奇斌, 秦伟, 任小广. 不平衡多分类算法综述[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3307-3321.
[5]	李凯, 李洁. 基于pinball损失的结构模糊多分类支持向量机算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3104-3112.
[6]	闵鑫, 王海鹏, 牟长宁. 基于多头注意力机制和残差神经网络的肽谱匹配打分算法[J]. 计算机应用, 2020, 40(6): 1830-1836.
[7]	杨磊, 赵红东. 基于轻量级深度神经网络的环境声音识别[J]. 计算机应用, 2020, 40(11): 3172-3177.
[8]	牛晓可, 黄伊鑫, 徐华兴, 蒋震阳. 基于听皮层神经元感受野的强噪声环境下说话人识别[J]. 计算机应用, 2020, 40(10): 3034-3040.
[9]	郇战, 陈学杰, 吕士云, 耿宏杨. 基于多分类器融合的步态识别方法[J]. 计算机应用, 2019, 39(3): 712-718.
[10]	王天锐, 鲍骞月, 秦品乐. 基于梅尔倒谱系数、深层卷积和Bagging的环境音分类方法[J]. 计算机应用, 2019, 39(12): 3515-3521.
[11]	向立, 严迪群, 王让定, 李孝文. 针对多种处理痕迹的数字语音取证算法[J]. 计算机应用, 2019, 39(1): 126-130.
[12]	林朗, 王让定, 严迪群, 李璨. 基于修正倒谱特征的回放语音检测算法[J]. 计算机应用, 2018, 38(6): 1648-1652.
[13]	王莉莉, 付忠良, 陶攀, 朱锴. 基于多分类AdaBoost改进算法的TEE标准切面分类[J]. 计算机应用, 2017, 37(8): 2253-2257.
[14]	王莉莉, 付忠良, 陶攀, 胡鑫. 基于主动学习不平衡多分类AdaBoost算法的心脏病分类[J]. 计算机应用, 2017, 37(7): 1994-1998.
[15]	翟夕阳, 王晓丹, 雷蕾, 魏晓辉. 基于多类指数损失函数逐步添加模型的改进多分类AdaBoost算法[J]. 计算机应用, 2017, 37(6): 1692-1696.