计算机应用 ›› 2019, Vol. 39 ›› Issue (12): 3496-3502.DOI: 10.11772/j.issn.1001-9081.2019061075

• 人工智能 • 上一篇    下一篇

基于多级特征和混合注意力机制的室内人群检测网络

沈文祥, 秦品乐, 曾建潮   

  1. 中北大学 大数据学院, 太原 030051
  • 收稿日期:2019-06-24 修回日期:2019-09-19 出版日期:2019-12-10 发布日期:2019-10-15
  • 作者简介:沈文祥(1995-),男,安徽淮南人,硕士研究生,主要研究方向:深度学习、计算机视觉;秦品乐(1978-),男,山西长治人,副教授,博士,CCF会员,主要研究方向:机器视觉、大数据、医学影像;曾建潮(1963-)男,陕西大荔县人,教授,博士,CCF会员,主要研究方向:演化计算、机器学习。
  • 基金资助:
    山西省重点研发计划项目(201803D31212-1)。

Indoor crowd detection network based on multi-level features and hybrid attention mechanism

SHEN Wenxiang, QIN Pinle, ZENG Jianchao   

  1. College of Big Data, North University of China, Taiyuan Shanxi 030051, China
  • Received:2019-06-24 Revised:2019-09-19 Online:2019-12-10 Published:2019-10-15
  • Contact: 曾建潮
  • Supported by:
    This work is partially supported by the Shanxi Provincial Key Research and Development Plan (201803D31212-1).

摘要: 针对室内人群目标尺度和姿态多样性、人头目标易与周围物体特征混淆的问题,提出了一种基于多级特征和混合注意力机制的室内人群检测网络(MFANet)。该网络结构包括三部分,即特征融合模块、多尺度空洞卷积金字塔特征分解模块以及混合注意力模块。首先,通过将浅层特征和中间层特征信息融合,形成包含上下文信息的融合特征,用于解决浅层特征图中小目标语义信息不丰富、分类能力弱的问题;然后,利用空洞卷积增大感受野而不增加参数的特性,对融合特征进行多尺度分解,形成新的小目标检测分支,实现网络对多尺度目标的定位和检测;最后,用局部混合注意力模块来融合全局像素关联空间注意力和通道注意力,增强对关键信息贡献大的特征,来增强网络对目标和背景的区分能力。实验结果表明,所提方法在室内监控场景数据集SCUT-HEAD上达到了0.94的准确率、0.91的召回率和0.92的F1分数,在召回率、准确率和F1指标上均明显优于当前用于室内人群检测的其他算法。

关键词: 室内人群检测, 特征融合, 注意力机制, 空洞卷积, 特征金字塔

Abstract: In order to solve the problem of indoor crowd target scale and attitude diversity and confusion of head targets with surrounding objects, a new Network based on Multi-level Features and hybrid Attention mechanism for indoor crowd detection (MFANet) was proposed. It is composed of three parts:feature fusion module, multi-scale dilated convolution pyramid feature decomposition module, and hybrid attention module. Firstly, by combining the information of shallow features and intermediate layer features, a fusion feature containing context information was formed to solve the problem of the lack of semantic information and the weakness of classification ability of the small targets in the shallow feature map. Then, with the characteristics of increasing the receptive field without increasing the parameters, the dilated convolution was used to perform the multi-scale decomposition on the fusion features to form a new small target detection branch, realizing the positioning and detection of the multi-scale targets by the network. Finally, the local fusion attention module was used to integrate the global pixel correlation space attention and channel attention to enhance the features with large contribution on the key information in order to improve the ability of distinguishing target from background. The experimental results show that the proposed method achieves an accuracy of 0.94, a recall rate of 0.91 and an F1 score of 0.92 on the indoor monitoring scene dataset SCUT-HEAD. All of these three are significantly better than those of other algorithms currently used for indoor crowd detection.

Key words: indoor crowd detection, feature fusion, attention mechanism, dilate convolution, feature pyramid

中图分类号: