《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (S1): 1-7.DOI: 10.11772/j.issn.1001-9081.2022101472

• 人工智能 •    

基于混合注意力机制的动态人脸表情识别

刘希未1,2, 宫晓燕1,2, 赵红霞1, 边思宇1, 邵帅3, 戴亚平3, 代文鑫1,3()   

  1. 1.多模态人工智能系统全国重点实验室(中国科学院自动化研究所), 北京 100190
    2.青岛智能产业技术研究院 智慧教育研究所, 青岛 山东 266044
    3.北京理工大学 自动化学院, 北京100081
  • 收稿日期:2022-10-11 修回日期:2022-12-20 接受日期:2022-12-26 发布日期:2023-07-04 出版日期:2023-06-30
  • 通讯作者: 代文鑫
  • 作者简介:刘希未(1978—),男,河南濮阳人,高级工程师,博士,主要研究方向:智慧教育、复杂系统建模与仿真、虚拟现实、人因工程学
    宫晓燕(1976—),女,山东滨州人,高级工程师,博士,主要研究方向:智慧教育、复杂系统管理与控制、智慧交通
    赵红霞(1982—),女,湖北孝感人,工程师,博士,主要研究方向:智慧教育、复杂系统管理与控制、智慧交通
    边思宇(1998—),女,海南海口人,主要研究方向:智慧教育、虚拟现实
    邵帅(1992—),男,北京人,博士,主要研究方向:智能家居、多传感器融合、人机交互系统
    戴亚平(1963—),女,山东淄博人,教授,博士,主要研究方向:图像特征提取与识别、多传感器数据融合与决策诊断技术、人工智能与专家系统
    代文鑫(1999—),女,河北沧州人,硕士研究生,主要研究方向:计算机视觉、情感识别。 dwx08042@163.com
  • 基金资助:
    科技创新2030“新一代人工智能”重大项目(2020AAA0108801);新时期铁路安全发展效能提升关键技术研究系统性重大专项项目(P2021T002)

Dynamic facial expression recognition based on hybrid attention mechanism

Xiwei LIU1,2, Xiaoyan GONG1,2, Hongxia ZHAO1, Siyu BIAN1, Shuai SHAO3, Yaping DAI3, Wenxin DAI1,3()   

  1. 1.State Key Laboratory of Multimodal Artificial intelligence Systems (Institute of Automation,Chinese Academy of Sciences),Beijing 100190,China
    2.Institute of Smart Education Systems,Qingdao Academy of Intelligent Industries,Qingdao Shandong 266044,China
    3.School of Automation,Beijing Institute of Technology,Beijing 100081,China
  • Received:2022-10-11 Revised:2022-12-20 Accepted:2022-12-26 Online:2023-07-04 Published:2023-06-30
  • Contact: Wenxin DAI

摘要:

针对自然环境中存在人脸遮挡、姿势变化等复杂因素,以及卷积神经网络(CNN)中的卷积滤波器由于空间局部性无法学习大多数神经层中不同面部区域之间的长程归纳偏差的问题,提出一种用于动态人脸表情识别(DFER)的混合注意力机制模型(HA-Model),以提升DFER的鲁棒性和准确性。HA-Model由空间特征提取和时序特征处理两部分组成:空间特征提取部分通过两种注意力机制——Transformer和包含卷积块注意力模块(CBAM)的网格注意力模块,引导网络从空间角度学习含有遮挡、姿势变化的鲁棒面部特征并关注人脸局部显著特征;时序特征处理部分通过Transformer引导网络学习高层语义特征的时序联系,用于学习人脸表情特征的全局表示。实验结果表明,HA-Model在DFEW和AFEW基准上的准确率分别达到了67.27%和50.41%,验证了HA-Model可以有效提取人脸特征并提升动态人脸表情识别的精度。

关键词: 动态人脸表情识别, 深度学习, 卷积神经网络, 注意力机制, Transformer, 卷积块注意力模块

Abstract:

Complex factors such as face occlusion and pose variation exist in the wild, and the convolution filter in Convolutional Neural Network (CNN) cannot learn the long-range induction bias between different facial regions in most neural layers due to spatial locality. In order to solve the problem above, an HA-Model (Hybrid-Attention-mechanism-Model) was proposed for Dynamic Facial Expression Recognition (DFER) , which was used to improve the robustness and accuracy of DFER. HA-Model was composed of spatial feature extraction and temporal feature processing. Transformer and grid attention module in Convolution Block Attention Module (CBAM) in the spatial feature extraction part were used to guide the network to learn robust facial features including occlusion and pose variation from a spatial perspective, and pay attention to local significant features of the face. The temporal feature processing part was used to guide the network to learn the temporal connections of high-level semantic features through Transformer, which was used to learn the global representation of facial expression features. The experimental results show that the accuracy of HA-Model on DFEW and AFEW benchmarks reaches 67.27% and 50.41% respectively, which verifies that HA-Model can effectively extract facial features and improve the accuracy of DFER.

Key words: Dynamic Facial Expression Recognition (DFER), deep learning, Convolutional Neural Network (CNN), attention mechanism, Transformer, Convolutional Block Attention Module (CBAM)

中图分类号: