《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (4): 1260-1268.DOI: 10.11772/j.issn.1001-9081.2021071258

• CCF第36届中国计算机应用大会 (CCF NCCA 2021) • 上一篇    

基于自注意力机制时频谱同源特征融合的鸟鸣声分类

刘志华1,2, 陈文洁1,2, 陈爱斌1,2()   

  1. 1.中南林业科技大学 计算机与信息工程学院,长沙 410004
    2.中南林业科技大学 人工智能应用研究所,长沙 410004
  • 收稿日期:2021-07-16 修回日期:2021-08-26 接受日期:2021-08-30 发布日期:2021-08-26 出版日期:2022-04-10
  • 通讯作者: 陈爱斌
  • 作者简介:刘志华(1996—),男,湖南邵阳人,硕士研究生,主要研究方向:深度学习、音频分类
    陈文洁(1996—),女,湖南株洲人,硕士研究生,主要研究方向:深度学习、图像检测与分类
  • 基金资助:
    智慧物流技术湖南省重点实验室资助项目(2019TP1015)

Homologous spectrogram feature fusion with self-attention mechanism for bird sound classification

Zhihua LIU1,2, Wenjie CHEN1,2, Aibin CHEN1,2()   

  1. 1.College of Computer and Information Engineering,Central South University of Forestry and Technology,Changsha Hunan 410004,China
    2.Institute of Applied Artificial Intelligence,Central South University of Forestry and Technology,Changsha Hunan 410004,China
  • Received:2021-07-16 Revised:2021-08-26 Accepted:2021-08-30 Online:2021-08-26 Published:2022-04-10
  • Contact: Aibin CHEN
  • About author:LIU Zhihua, born in 1996, M. S. candidate. His research interests include deep learning, audio classification.
    CHEN Wenjie, born in 1996, M. S. candidate. Her research interests include deep learning, image detection and classification.
  • Supported by:
    Hunan Key Laboratory of Intelligent Logistics Technology(2019TP1015)

摘要:

目前深度学习模型大都难以应对复杂背景噪声下的鸟鸣声分类问题。考虑到鸟鸣声具有时域连续性、频域高低性特点,提出了一种利用同源谱图特征进行融合的模型用于复杂背景噪声下的鸟鸣声分类。首先,使用卷积神经网络(CNN)提取鸟鸣声梅尔时频谱特征;然后,使用特定的卷积以及下采样操作,将同一梅尔时频谱特征的时域和频域维度分别压缩至1,得到仅包含鸟鸣声高低特性的频域特征以及连续特性的时域特征。基于上述提取频域以及时域特征的操作,在时域和频域维度上同时对梅尔时频谱特征进行提取,得到具有连续性以及高低特性的时频域特征。然后,将自注意力机制分别用于得到的时域、频域、时频域特征以加强其各自拥有的特性。最后,将这三类同源谱图特征决策融合后的结果用于鸟鸣声分类。所提模型用于Xeno-canto网站的8种鸟类音频分类,并在分类对比实验中取得了平均精确率(MAP)为0.939的较好结果。实验结果表明该模型能应对复杂背景噪声下的鸟鸣声分类效果较差的问题。

关键词: 深度学习, 鸟鸣声分类, 卷积神经网络, 自注意力机制, 同源谱图特征融合

Abstract:

At present, most deep learning models are difficult to deal with the classification of bird sound under complex background noise. Because bird sound has the continuity characteristic in time domain and high-low characteristic in frequency domain, a fusion model of homologous spectrogram features was proposed for bird sound classification under complex background noise. Firstly, Convolutional Neural Network (CNN) was used to extract Mel-spectrogram features of bird sound. Then, the time domain and frequency domain dimensions of the same Mel-spectrogram feature were compressed to 1 by specific convolution and down-sampling operations, so that frequency domain feature with only high-low characteristics and the time domain feature with only continuous characteristics were obtained. Based on the above operation to extract frequency domain and time domain features, the features of Mel-spectrogram were extracted both in time domain and frequency domain, the time-frequency domain features with continuity and high-low characteristics were obtained. Then the self-attention mechanism was applied to the obtained time domain, frequency domain and time-frequency domain features, strengthening their own characteristics. Finally, the results of these three homologous spectrogram features after decision fusion were used for bird sound classification. The proposed model was used for audio classification of 8 bird species on Xeno-canto website, achieved the better result in the comparison experiment with the Mean Average Precision (MAP) of 0.939. The experimental results show that the proposed model can deal with the problem of the poor classification effect of bird sound under complex background noise.

Key words: deep learning, bird sound classification, Convolutional Neural Network (CNN), self-attention mechanism, homologous spectrogram feature fusion

中图分类号: