《计算机应用》唯一官方网站

• •    下一篇

多尺度稀疏图引导的视觉图神经网络

张子墨1,赵雪专2,3   

  1. 1. 新加坡国立大学
    2. 郑州航空工业管理学院
    3. 哈尔滨工业大学重庆研究院
  • 收稿日期:2024-07-01 修回日期:2024-11-07 发布日期:2024-12-03 出版日期:2024-12-03
  • 通讯作者: 赵雪专
  • 基金资助:
    河南省重点研发专项;航空科学基金;河南省科技攻关项目;重庆市自然科学基金

Multi-scale sparse graph induced vision graph neural networks

  • Received:2024-07-01 Revised:2024-11-07 Online:2024-12-03 Published:2024-12-03

摘要: 近年来,视觉图神经网络在计算机视觉领域引起了研究人员的广泛关注,其中构图是视觉图神经网络的重要建模方式。目前流行的K近邻构图方法尺度单一、具有二次计算复杂度并且难以建模图像的局部和多尺度信息。为了解决该问题,提出一种尺度稀疏图建图方法——MSSG。该方法将K近邻建图沿通道分解为三个不同尺度的稀疏子图,具有线性的计算复杂度并实现了图像局部信息和多尺度信息的有效建模。为了增强模型的全局建模能力,提出一种全局和局部多尺度信息融合策略。基于以上创新,提出一种新颖的视觉骨干网络——MSViG。在ImageNet-1K数据集上进行的图像分类实验结果表明,所提出的视觉架构优于传统的视觉图神经网络。例如,与视觉神经网络ViG-T相比,所提MSViG-T的Top-1分类准确率提高了2.1个百分点,并且在目标检测和实例分割这种视觉下游任务上相比传统视觉图神经网络取得了较大的性能提升。

关键词: 图神经网络, 视觉图神经网络, 视觉骨干网络, 图像分类, 目标检测, 实例分割

Abstract: Recently, Vision Graph Neural Networks (ViGs) have attracted considerable attention from the researchers in the field of computer vision, with graph construction being a key modeling approach in ViGs. The existing popular K-nearest neighbor (KNN) graph construction approach was limited by its fixed scale and quadratic computational complexity, making it difficult to model both local and multi-scale information in an image. To address this issue, a Multi-Scale Sparse Graph (MSSG) construction method was proposed. MSSG decomposes the KNN graph into three sparse subgraphs of different scales along the channel dimension, achieving linear computational complexity while effectively modeling both local and multi-scale information in an image. To enhance the model's global modeling capability, a global and local multi-scale information fusion strategy was proposed. Based on these innovations, a novel vision backbone network, termed MSViG, was proposed. The image classification experiments on ImageNet-1K demonstrate that the proposed vision architecture outperforms the existing Vision Graph Neural Networks. For example, MSViG-T achieves a 2.1 percentage points higher Top-1 classification accuracy compared to ViG-T and shows the significant performance improvements in downstream vision tasks such as object detection and instance segmentation compared to ViGs.

Key words: graph neural network, vision graph neural network, vision backbone, image classification, object detection, instance segmentation

中图分类号: