《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 731-735.DOI: 10.11772/j.issn.1001-9081.2021060995

• 2021年中国计算机学会人工智能会议(CCFAI 2021) • 上一篇    

基于关键帧筛选网络的视听联合动作识别

陈亭秀, 尹建芹()   

  1. 北京邮电大学 人工智能学院,北京 100876
  • 收稿日期:2021-06-11 修回日期:2021-08-13 接受日期:2021-08-20 发布日期:2022-04-09 出版日期:2022-03-10
  • 通讯作者: 尹建芹
  • 作者简介:陈亭秀(1997—),女,黑龙江大兴安岭人,硕士研究生,主要研究方向:计算机视觉、深度学习;
  • 基金资助:
    国家自然科学基金资助项目(61673192);中央高校基本科研业务费专项资金资助项目(2020XD?A04)

Audio visual joint action recognition based on key frame selection network

Tingxiu CHEN, Jianqin YIN()   

  1. School of Artificial Intelligence,Beijing University of Posts and Telecommunications,Beijing 100876,China
  • Received:2021-06-11 Revised:2021-08-13 Accepted:2021-08-20 Online:2022-04-09 Published:2022-03-10
  • Contact: Jianqin YIN
  • About author:CHEN Tingxiu, born in 1997, M. S. candidate. Her research interests include computer vision, deep learning.
  • Supported by:
    National Natural Science Foundation of China(61673192);Fundamental Research Funds for the Central Universities(2020XD-A04)

摘要:

近年来,视听联合学习的动作识别获得了一定关注。无论在视频(视觉模态)还是音频(听觉模态)中,动作发生是瞬时的,往往在动作发生时间段内的信息才能够显著地表达动作类别。如何更好地利用视听模态的关键帧携带的显著表达动作信息,是视听动作识别待解决的问题之一。针对该问题,提出关键帧筛选网络KFIA-S,通过基于全连接层的线性时间注意力机制赋予每个时刻视听信息不同权重,从而筛选益于视频分类的视听特征,减少重复冗余信息,抑制背景干扰信息,提升动作识别精度。研究了不同强度的时间注意力对动作识别的影响。在ActivityNet数据集上的实验表明,KFIA-S网络达到了最先进的识别精度,证明了所提方法的有效性。

关键词: 视频动作识别, 视听联合学习, 时间注意力, 深度学习, 长短时记忆循环神经网络

Abstract:

In recent years, the action recognition of audio visual joint learning has received some attention. Whether in video (visual modality) or audio (auditory modality), the occurrence of action is instantaneous, only the information in the time period of action can significantly express the action category. How to make better use of the significant expression information carried by the key frames of audio-visual modality is one of the problems to be solved in audio-visual action recognition. According to this problem, a key frame screening network KFIA-S was proposed. Though the linear temporal attention mechanism based on the full connected layer, different weights were given to the audio-visual information at different times, so as to screen the audio-visual features beneficial to video classification, reduce redundant information, suppress background interference information, and improve the accuracy of action recognition. The effect of different intensity of time attention on action recognition was studied. The experiments on ActivityNet dataset show that KFIA-S network achieves the SOTA (State-Of-The-Art) recognition accuracy, which proves the effectiveness of the proposed method.

Key words: video action recognition, audio visual joint learning, temporal attention, deep learning, long short-term memory recurrent neural network

中图分类号: