基于特征交互与表示增强的语音手机来源开集识别方法

doi:10.11772/j.issn.1001-9081.2024121815

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (12): 3813-3819.DOI: 10.11772/j.issn.1001-9081.2024121815

基于特征交互与表示增强的语音手机来源开集识别方法

岳峰¹^,²^,³, 彭洋¹^,⁴, 苏兆品¹^,²^,⁴(), 张国富¹^,³^,⁴, 廉晨思³^,⁵, 杨波³^,⁵, 方振¹^,⁴

^1.合肥工业大学计算机与信息学院，合肥 230601
^2.智能互联系统安徽省实验室（合肥工业大学），合肥 230009
^3.音视频智能防识联合实验室（合肥工业大学），合肥 230009
^4.工业安全与应急技术安徽省重点实验室（合肥工业大学），合肥 230009
^5.安徽省公安厅物证鉴定管理处，合肥 230009

收稿日期:2024-12-26 修回日期:2025-01-07 接受日期:2025-01-10 发布日期:2025-01-15 出版日期:2025-12-10
通讯作者: 苏兆品
作者简介:岳峰（1981—），男，安徽合肥人，副研究员，博士，主要研究方向：群智能、软件工程
彭洋（1999—），男，河北邢台人，硕士研究生，主要研究方向：声纹识别
张国富（1979—），男，安徽合肥人，教授，博士，CCF会员，主要研究方向：语音安全
廉晨思（1984—），女，安徽合肥人，高级工程师，硕士，主要研究方向：声纹和合成语音鉴定
杨波（1985—），男，安徽合肥人，高级工程师，硕士，主要研究方向：声纹和合成语音鉴定
方振（2000—），男，安徽亳州人，硕士研究生，主要研究方向：声纹识别。
第一联系人：苏兆品（1983—），女，山东菏泽人，副教授，博士，CCF会员，主要研究方向：复杂智能系统、多媒体安全
基金资助:
教育部人文社会科学研究规划基金资助项目(24YJA870011);安徽省重点研究与开发计划项目(202104d07020001);安徽省自然科学基金资助项目(2208085MF166)

Open-set source cell-phone identification method based on feature interaction and representation enhancement

Feng YUE¹^,²^,³, Yang PENG¹^,⁴, Zhaopin SU¹^,²^,⁴(), Guofu ZHANG¹^,³^,⁴, Chensi LIAN³^,⁵, Bo YANG³^,⁵, Zhen FANG¹^,⁴

^1.School of Computer and Information Technology，Hefei University of Technology，Hefei Anhui 230601，China
^2.Intelligent Interconnected Systems Provincial Laboratory of Anhui（Hefei University of Technology），Hefei Anhui 230009 China
^3.Joint Laboratory of Intelligent Prevention and Recognition of Audio and Video （Hefei University of Technology），Hefei Anhui 230009 China
^4.Anhui Provincial Key Laboratory of Industrial Safety and Emergency Technology;（Hefei University of Technology），Hefei Anhui 230009 China
^5.Department of Physical Evidence Identification，Anhui Public Security Department，Hefei Anhui 230009 China

Received:2024-12-26 Revised:2025-01-07 Accepted:2025-01-10 Online:2025-01-15 Published:2025-12-10
Contact: Zhaopin SU
About author:YUE Feng， born in 1981， Ph. D.， associate research fellow. His research interests include swarm intelligence， software engineering.
PENG Yang， born in 1999， M. S. candidate. His research interests include voice print recognition.
ZHANG Guofu， born in 1979， Ph. D.， professor. His research interests include speech security.
LIAN Chensi， born in 1984， M. S.， senior engineer. Her research interests include voice print， synthesis speech identification.
YANG Bo， born in 1985， M. S.， senior engineer. His research interests include voice print， synthesis speech identification.
FANG Zhen， born in 2000， M. S. candidate. His research interests include voice print recognition.
Supported by:
Humanities and Social Sciences Research Planning Fund of the Ministry of Education(24YJA870011);Key Research and Development Program of Anhui Province(202104d07020001);Natural Science Foundation of Anhui Province(2208085MF166)

摘要/Abstract

摘要：

基于手机语音的多媒体取证任务一直都是研究热点，然而已有语音手机识别任务均局限于闭集模式，即训练集与测试集共享相同的类别集合，无法保证未知类别手机的识别精度，所以现有方法无法直接应用于未知手机。为此，提出一种基于特征交互与表示增强的语音手机来源开集识别方法（FireOSCI）。首先，设计基于多头注意力模块Fastformer的全局特征提取模块GlobalBlock，以更好地捕捉整个语音样本的全局信息，获得丰富的设备特征信息；其次，设计基于SE-Res2Block（Squeeze-Excitation Res2Block）的局部特征提取模块LocalBlocks，专注于增强跟手机信息相关的特征，抑制与手机来源识别无关的特征；随后，设计基于注意力机制的特征融合机制，将全局特征和多层局部特征深度融合；最后，设计基于注意力池化的手机来源确认网络，以提高开集模式下的识别准确率。在13个不同手机品牌、86种不同型号的手机语音数据集上的对比实验结果表明，所提方法可以实现未知类别手机的识别，为语音手机来源的开集识别提供可参考的技术方案。

关键词: 语音手机来源, 开集识别, 特征交互, 表示增强, 深度融合

Abstract:

Multimedia forensics tasks based on cell-phone speech has always been a key research hotspot. However， the existing speech-based cell-phone identification tasks are all confined to the closed-set mode， which means that the training set and the test set share the same category set， which cannot guarantee the recognition accuracy for cell-phones of unknown categories， leading to the difficulty in applications of the existing methods to the unknown cell-phones. Therefore， an Open-set Source Cell-phone Identification method based on Feature interaction and representation enhancement （FireOSCI） was proposed. Firstly， a global information extraction block named GlobalBlock was designed on the basis of the multi-head attention block Fastformer for better capturing the global information from the whole speech sample and obtaining rich device feature information. Secondly， a local feature extraction block named LocalBlocks was presented on the basis of SE-Res2Block （Squeeze-Excitation Res2Block） to focus on enhancing cell-phone information related features and suppressing the features that are not related to the source cell-phone identification. Thirdly， an attention mechanism based feature fusion mechanism was designed to fuse global features with multi-layer local features deeply. Finally， a source cell?phone confirmation network was designed on the basis of attention pooling to improve the recognition accuracy in open-set mode. Comparison experimental results on cell-phone speech dataset with 13 different cell-phone brands and 86 different cell-phone models show that the proposed method can achieve identification of unknown categories of cell-phones， and provide a referable technical solution for the open-set recognition of speech-based source cell-phones.

Key words: source cell-phone, open-set recognition, feature interaction, representation enhancement, deep fusion

中图分类号:

TN912.34

岳峰, 彭洋, 苏兆品, 张国富, 廉晨思, 杨波, 方振. 基于特征交互与表示增强的语音手机来源开集识别方法[J]. 计算机应用, 2025, 45(12): 3813-3819.

Feng YUE, Yang PENG, Zhaopin SU, Guofu ZHANG, Chensi LIAN, Bo YANG, Zhen FANG. Open-set source cell-phone identification method based on feature interaction and representation enhancement[J]. Journal of Computer Applications, 2025, 45(12): 3813-3819.

图/表 11

参考文献 19

[1]	HANILCI C， ERTAS F， ERTAS T， et al. Recognition of brand and models of cell-phones from recorded speech signals［J］. IEEE Transactions on Information Forensics and Security， 2012， 7（2）： 625-634.
[2]	ZOU L， YANG J， HUANG T. Automatic cell phone recognition from speech recordings［C］// Proceedings of the 2014 IEEE China Summit and International Conference on Signal and Information Processing. Piscataway： IEEE， 2014： 621-625.
[3]	ZOU L， HE Q， FENG X. Cell phone verification from speech recordings using sparse representation［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2015： 1787-1791.
[4]	LUO D， KORUS P， HUANG J. Band energy difference for source attribution in audio forensics［J］. IEEE Transactions on Information Forensics and Security， 2018， 13（9）： 2179-2189.
[5]	裴安山，王让定，严迪群. 基于语音频谱融合特征的手机来源识别［J］.计算机应用， 2018， 38（3）： 884-890.
	PEI A S， WANG R D， YAN D Q. Cell-phone source identification based on spectral fusion features of recorded speech［J］. Journal of Computer Applications， 2018， 38（3）： 884-890.
[6]	VERMA V， KHATURIA P， KHANNA N. Cell-phone identification from recompressed audio recordings［C］// Proceedings of the 24th National Conference on Communications. Piscataway： IEEE， 2018： 1-6.
[7]	秦天芸，王让定，裴安山. 基于线性预测梅尔频率倒谱系数的设备来源识别［J］. 数据通信， 2018（4）： 20-25.
	QIN T Y， WANG R D， PEI A S. Device source identification based on linear predictive Mel-Frequency Cepstral Coefficients［J］. Data Communications， 2018（4）： 20-25.
[8]	苏兆品，吴张倩，岳峰，等. 自然环境背景噪声下基于低维深度特征的手机来源识别［J］. 电子学报， 2021， 49（4）： 637-646.
	SU Z P， WU Z Q， YUE F， et al. Source cell-phone identification under background noise based on low-dimensional deep features［J］. Acta Electronica Sinica， 2021， 49（4）： 637-646.
[9]	BAI S， KOLTER J Z， KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling［EB/OL］. ［2024-08-15］..
[10]	ABBASIAN H， NASERSHARIF B， AKBARI A， et al. Optimized linear discriminant analysis for extracting robust speech features［C］// Proceedings of the 3rd International Symposium on Communications， Control and Signal Processing. Piscataway： IEEE， 2008： 819-824.
[11]	DESPLANQUES B， THIENPONDT J， DEMUYNCK K. ECAPA-TDNN： emphasized channel attention， propagation and aggregation in TDNN based speaker verification ［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 3830-3834.
[12]	LI Y， GAN J， LIN X， et al. DS-TDNN： dual-stream time-delay neural network with global-aware filter for speaker verification ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2024， 32： 2814-2827.
[13]	WANG H， ZHENG S， CHEN Y， et al. CAM++： a fast and efficient network for speaker verification using context-aware masking［C］// Proceedings of the INTERSPEECH 2023. ［S.l.］： International Speech Communication Association， 2023： 5301-5305.
[14]	WU C， WU F， QI T， et al. Fastformer： additive attention can be all you need［EB/OL］. ［2024-08-15］..
[15]	OKABE K， KOSHINAKA T， SHINODA K. Attentive statistics pooling for deep speaker embedding［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 2252-2256.
[16]	GAO S H， CHENG M M， ZHAO K， et al. Res2Net： a new multi-scale backbone architecture［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（2）： 652-662.
[17]	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141.
[18]	DENG J， GUO J， XUE N， et al. ArcFace： additive angular margin loss for deep face recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4685-4694.
[19]	XIANG X， WANG S， HUANG H， et al. Margin matters： towards more discriminative deep neural network embeddings for speaker recognition［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 1652-1656.

网络层	参数	网络层	参数
FC1	（94，96）	Fastformer	（96，96，8，0.2）
LN	（96，96）	FC2	（96，94）

网络层	参数	网络层	参数
FC1	（94，96）	Fastformer	（96，96，8，0.2）
LN	（96，96）	FC2	（96，94）

网络层	参数	网络层	参数
LocalBlock1	（512，512，3，2，8）	AAP	（94，1）
LocalBlock2	（512，512，3，3，8）	Conv1d3	（512，128）
LocalBlock3	（512，512，3，4，8）	Conv1d4	（128，512）

网络层	参数	网络层	参数
LocalBlock1	（512，512，3，2，8）	AAP	（94，1）
LocalBlock2	（512，512，3，3，8）	Conv1d3	（512，128）
LocalBlock3	（512，512，3，4，8）	Conv1d4	（128，512）

特征	EER	ACC	特征	EER	ACC
MFCC	12.90	82.76	局部特征	9.57	88.89
Fbank	11.06	83.23	融合特征	8.64	91.47
全局特征	10.78	87.74

基于特征交互与表示增强的语音手机来源开集识别方法

Open-set source cell-phone identification method based on feature interaction and representation enhancement

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 19

相关文章 7

编辑推荐

Metrics

模块	EER	ACC
FireOSCI-GlobalBlock	9.57	88.89
FireOSCI-LocalBlocks	10.78	87.74
FireOSCI-Attention Merge	9.54	89.65
FireOSCI-ASP	10.28	88.76
FireOSCI-GlobalBlock-ASP	9.74	88.65
FireOSCI	8.64	91.47

方法	时长为3 s		时长为2 s		时长为1 s
方法	EER	ACC	EER	ACC	EER	ACC
ECAPA-TDNN	9.57	88.89	12.41	87.76	13.82	85.52
DS-TDNN	15.74	81.89	16.25	79.98	17.55	78.39
CAM++	11.11	84.51	12.52	82.33	15.79	80.56
TCN+LDA	16.03	82.81	16.84	81.46	18.87	78.20
FireOSCI	8.64	91.47	11.33	89.28	12.36	86.78

方法	Sr=16 000 Hz		Sr=22 050 Hz		Sr=32 000 Hz
方法	EER	ACC	EER	ACC	EER	ACC
ECAPA-TDNN	12.98	87.33	9.29	90.40	9.18	90.56
DS-TDNN	16.06	75.21	13.97	82.02	13.94	82.45
CAM++	11.48	87.38	11.93	88.31	11.92	88.40
TCN+LDA	15.89	83.24	15.96	82.97	15.69	83.05
FireOSCI	10.34	89.51	8.64	91.36	8.52	91.42

[1]	卢燕群, 赵奕奕. 基于层次图神经网络和差异化特征学习的客户流失预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3057-3066.
[2]	向尔康, 黄荣, 董爱华. 开放生成与特征优化的开集识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2195-2202.
[3]	李豆豆, 李汪根, 夏义春, 束阳, 高坤. 基于特征交互与自适应融合的骨骼动作识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2581-2587.
[4]	韩滕跃, 牛少彰, 张文. 基于对比学习的多模态序列推荐算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1683-1688.
[5]	余晓鹏, 何儒汉, 黄晋, 张俊杰, 胡新荣. 基于改进Inception结构的知识图谱嵌入模型[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1065-1071.
[6]	潘仁志, 钱付兰, 赵姝, 张燕平. 基于卷积神经网络交互的用户属性偏好建模的推荐模型[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 404-411.
[7]	孙研博刘宗柱孟珂汤扬. 多源信息簇融技术在煤矿瓦斯监测中的应用[J]. 计算机应用, 2013, 33(06): 1783-1786.