《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 844-853.DOI: 10.11772/j.issn.1001-9081.2021030392

• 人工智能 • 上一篇    

基于注意力机制和金字塔融合的RGB-D室内场景语义分割

余娜, 刘彦, 魏雄炬, 万源()   

  1. 武汉理工大学 理学院,武汉 430070
  • 收稿日期:2021-03-16 修回日期:2021-05-16 接受日期:2021-05-31 发布日期:2022-04-09 出版日期:2022-03-10
  • 通讯作者: 万源
  • 作者简介:余娜(2000—),女,湖北咸宁人,CCF会员,主要研究方向:机器学习、深度学习、图像语义分割
    刘彦(2000—),男,安徽六安人,CCF会员,主要研究方向:机器学习、深度学习、图像语义分割
    魏雄炬(2000—),男,湖北咸宁人,主要研究方向:机器学习、深度学习、图像语义分割;
  • 基金资助:
    国家级大学生创新创业训练计划项目(202010497047)

Semantic segmentation of RGB-D indoor scenes based on attention mechanism and pyramid fusion

Na YU, Yan LIU, Xiongju WEI, Yuan WAN()   

  1. College of Science,Wuhan University of Technology,Wuhan Hubei 430070,China
  • Received:2021-03-16 Revised:2021-05-16 Accepted:2021-05-31 Online:2022-04-09 Published:2022-03-10
  • Contact: Yuan WAN
  • About author:YU Na, born in 2000. Her research interests include machine learning, deep learning, semantic image segmentation.
    LIU Yan, born in 2000. His research interests include machine learning, deep learning, semantic image segmentation.
    WEI Xiongju, born in 2000. His research interests include machine learning, deep learning, semantic image segmentation.
  • Supported by:
    National Innovation and Entrepreneurship Training Program for College Students(202010497047)

摘要:

针对现有RGB-D室内场景语义分割不能有效融合多模态特征的问题,提出一种基于注意力机制和金字塔融合的RGB-D室内场景图像语义分割网络模型APFNet,并为其设计了两个新模块:注意力机制融合模块与金字塔融合模块。其中,注意力机制融合模块分别提取RGB特征和Depth特征的注意力分配权重,充分利用两种特征的互补性,使网络聚焦于信息含量更高的多模态特征域;金字塔融合模块利用四种不同金字塔尺度特征,融合局部与全局信息,提取场景语境,提升物体边缘和小尺度物体的分割精度。将这两个融合模块整合到一个包含三个分支的“编码器-解码器”网络中,实现“端到端”输出。该模型在SUN RGB-D和NYU Depth v2数据集上与多层残差特征融合网络(RDF-152)、注意力互补网络(ACNet)、空间信息引导卷积网络(SGNet)等先进方法进行实验对比。实验结果表明,与最好的表现方法RDF-152对比,APFNet的编码器网络层数从152层降低到50层的情况下,像素精度(PA)、平均像素精度(MPA)、平均交并比(MIoU)分别提升了0.4、1.1、3.2个百分点,并对枕头、照片等小尺度物体和木板、天花板等大尺度物体的语义分割质量分别有0.9~4.5和12.4~18个百分点的提升;故该模型在处理室内场景语义分割问题上具有一定的优势。

关键词: RGB-D语义分割, 注意力机制, 金字塔融合, 多模态, 深层监督

Abstract:

Aiming at the issue of ineffective fusion of multi-modal features of indoor scene semantic segmentation using RGB-D, a network named APFNet (Attention mechanism and Pyramid Fusion Network) was proposed, in which attention mechanism fusion module and pyramid fusion module were designed. To fully use the complementarity of the RGB features and the Depth features, the attention allocation weights of these two kinds of features were respectively extracted by the attention mechanism fusion module, making the network focus more on the multi-modal feature domain with more information content. Local and global information were fused by pyramid fusion module with four different scales of pyramid features, thus scene context was extracted and segmentation accuracies of object edges and small-scale objects were improved. By integrating these two fusion modules into a three-branch “encoder-decoder” network, an “end-to-end” output was realized. Comarative experiments were implemented with the state-of-the-art methods, such as multi-level RGB-D residual feature Fusion network (RDF-152), Attention Complementary features Network (ACNet) and Spatial information Guided convolution Network (SGNet) on the SUN RGB-D and NYU Depth v2 datasets. Compared with the best-performing method RDF-152, when the layer number of the encoder network was reduced from 152 to 50, the Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), and Mean Intersection over Union (MIoU) of APFNet were respectively increased by 0.4, 1.1 and 3.2 percentage points. The semantic segmentation accuracies for small-scale objects such as pillows and photos, and large-scale objects such as boards and ceilings were increased by 0.9 to 3.4 and 12.4 to 18 percentage points respectively. The results show that the proposed APFNet has some advantages in dealing with the semantic segmentation of indoor scenes.

Key words: RGB-D semantic segmentation, attention mechanism, pyramid fusion, multi-modal, deep supervision

中图分类号: