Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (2): 537-543.DOI: 10.11772/j.issn.1001-9081.2020060793

Special Issue: 多媒体计算与计算机仿真

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Crowd counting network based on multi-scale spatial attention feature fusion

DU Peide, YAN Hua   

  1. College of Electronics and Information Engineering, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2020-06-10 Revised:2020-09-20 Online:2021-02-10 Published:2020-12-18
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (11872069).

基于多尺度空间注意力特征融合的人群计数网络

杜培德, 严华   

  1. 四川大学 电子信息学院, 成都 610065
  • 通讯作者: 严华
  • 作者简介:杜培德(1994-),男,山西浑源人,硕士研究生,主要研究方向:模式识别、智能控制;严华(1971-),男,四川渠县人,教授,博士,主要研究方向:模式识别、智能控制。
  • 基金资助:
    国家自然科学基金资助项目(11872069)。

Abstract: Concerning the poor performance problem of crowd counting tasks in different dense scenes caused by severe scale changes and occlusions, a new Multi-scale spatial Attention Feature fusion Network (MAFNet) was proposed based on the Congested Scene Recognition Network (CSRNet) by combining the multi-scale feature fusion structure and the spatial attention module. Before extracting features with MAFNet, the scene images with head markers were processed with the Gaussian filter to obtain the ground truth density maps of images. In addition, the method of jointly using two basic loss functions was proposed to constrain the consistency of the density estimation map and the ground truth density map. Next, with the multi-scale feature fusion structure as the backbone of MAFNet, the strategy of extracting and fusing multi-scale features simultaneously was used to obtain the multi-scale fusion feature map, then the feature maps were calibrated and refused by the spatial attention module. After that, an estimated density image was generated through dilated convolution, and the number of people in the scene was obtained by integrating the estimated density image pixel by pixel. To verify the effectiveness of the proposed model, evaluations were conducted on four datasets (ShanghaiTech, UCF_CC_50, UCF_QRNF and World-Expo'10). Experimental results on ShanghaiTech dataset PartB show that, compared with CSRNet, MAFNet has a Mean Absolute Error (MAE) reduction of 34.9% and a Mean Square Error (MSE) reduction of 29.4%. Furthermore, experimental results on multiple datasets show that by using the attention mechanism and multi-scale feature fusion strategy, MAFNet can extract more detailed information and reduce the impact of scale changes and occlusions.

Key words: dense crowd counting, Convolutional Neural Network (CNN), feature fusion, attention mechanism, multi-scale

摘要: 针对严重的尺度变化和遮挡导致在不同密集场景人群计数任务中性能差的问题,在密集场景识别网络(CSRNet)的基础上通过增加多尺度特征融合结构并引入空间注意力机制,提出了一种多尺度空间注意力特征融合网络(MAFNet)。在MAFNet进行特征提取之前,需要对添加了人头标记的场景图进行高斯滤波生成真实密度图;此外,MAFNet还通过联合使用两种基本损失函数的方法来约束密度估计图与真实密度图的一致性。接着,MAFNet以多尺度特征融合结构为主干,首先采用边提取多尺度特征边融合的策略得到多尺度融合特征图,然后使用空间注意力模块对特征图进行校准和再融合,之后通过扩张卷积生成密度估计图,最后对密度估计图逐像素积分得到场景中的人数。为了验证所提出模型的有效性,在四个人群计数数据集(ShanghaiTech、UCF_CC_50、UCF_QRNF和World-Expo’10)上进行了评估。其中ShanghaiTech数据集PartB的实验结果显示,MAFNet与CSRNet相比,平均绝对误差(MAE)降低了34.9%,均方误差(MSE)降低了29.4%。在多个数据集上的实验结果表明,采用注意力机制和多尺度特征融合策略使MAFNet可以提取更多细节信息,减少尺度变化和遮挡带来的影响。

关键词: 密集人群计数, 卷积神经网络, 特征融合, 注意力机制, 多尺度

CLC Number: