计算机应用 ›› 2019, Vol. 39 ›› Issue (12): 3510-3514.DOI: 10.11772/j.issn.1001-9081.2019050870

• 人工智能 • 上一篇    下一篇

基于翻转梅尔频率倒谱系数的语音变调检测方法

林晓丹, 邱应强   

  1. 华侨大学 信息科学与工程学院, 福建 厦门 361021
  • 收稿日期:2019-05-23 修回日期:2019-06-20 发布日期:2019-07-29 出版日期:2019-12-10
  • 作者简介:林晓丹(1983-),女,福建泉州人,讲师,博士,主要研究方向:多媒体取证、信号处理;邱应强(1981-),男,福建龙岩人,副教授,博士,主要研究方向:信息隐藏。
  • 基金资助:
    国家自然科学基金资助项目(61871434);华侨大学科研基金资助项目(Y19060)。

Disguised voice detection method based on inverted Mel-frequency cepstral coefficient

LIN Xiaodan, QIU Yingqiang   

  1. College of Information Science and Engineering, Huaqiao University, Xiamen Fujian 361021, China
  • Received:2019-05-23 Revised:2019-06-20 Online:2019-07-29 Published:2019-12-10
  • Contact: 林晓丹
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61871434), the Scientific Research Fund of Huaqiao University (Y19060).

摘要: 语音变调常用于掩盖说话人身份,各种变声软件的出现使得说话人身份伪装变得更加容易。针对现有变调语音检测方法无法判断语音是经过了何种变调操作(升调或降调)的问题,通过分析语音变调在信号频谱,尤其是高频区域留下的痕迹,提出了基于翻转梅尔倒谱系数(IMFCC)统计矩特征的电子变调语音检测方法。首先,提取各语音帧IMFCC及其一阶差分;然后,计算其统计均值;最后,在该统计特征上利用支持向量机(SVM)多分类器的设计来区分原始语音、升调语音和降调语音。在TIMIT和NIST语音集上的实验结果表明,所提方法无论对于原始语音、升调语音还是降调语音都具有良好的检测性能。与MFCC作为特征构造的基线系统相比,所设计的特征的方法明显提高了变调操作的识别率。在较少的训练资源的情况下,所提方法也获得了比基于卷积神经网络(CNN)的框架更好的性能;此外,在不同数据集和不同变调方法上也都取得了较好的泛化性能。

关键词: 语音变调, 翻转梅尔频率, 倒谱系数, 统计矩, 多分类

Abstract: Voice disguise through pitch shift is commonly used to conceal the identity of speaker. A bunch of voice changers substantially facilitate the application of voice disguise. To simultaneously address the problem of whether a speech signal is pitch-shifted and how it is modified (pitch-raised or pitch-lowered), with the traces of the electronic disguised voice in the signal spectrum especially the high frequency region analyzed, an electronic disguised voice detection method based on statistical moment features derived from Inverted Mel-Frequency Cepstral Coefficient (IMFCC) was proposed. Firstly, IMFCC and its first-order difference of each voice frame were extracted. Then, its statistical mean was calculated. Finally, on the above statistical feature, the design of Support Vector Machine (SVM) multi-classifier was used to identify the original voice, the pitch-raised voice and the pitch-lowered voice. The experimental results on TIMIT and NIST voice datasets show that the proposed method has satisfactory performance on the detection of the original, pitch-raised and pitch-lowered voice signals. Compared with the baseline system using MFCC as feature construction, the method with the proposed features has significantly increased the recognition rate of the disguise operation. And the method outperforms the Convolutional Neural Network (CNN) based framework when limited training data is available. The extensive experiments demonstrate the proposed has good generalization ability on different datasets and different disguising methods.

Key words: voice disguise, inverted Mel-frequency, cepstral coefficient, statistical moment, multi-classification

中图分类号: