《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (11): 3713-3720.DOI: 10.11772/j.issn.1001-9081.2024111662

• 多媒体计算与计算机仿真 • 上一篇    

融合局部特征增强感知的人-物交互检测算法

林峻屹, 陈明轩(), 高永彬   

  1. 上海工程技术大学 电子电气工程学院,上海 201620
  • 收稿日期:2024-11-22 修回日期:2025-04-09 接受日期:2025-04-17 发布日期:2025-04-22 出版日期:2025-11-10
  • 通讯作者: 陈明轩
  • 作者简介:林峻屹(1999—),男,山东烟台人,硕士研究生,主要研究方向:人-物交互检测
    高永彬(1988—),男,江西吉安人,副教授,博士,主要研究方向:计算机视觉、机器学习、知识图谱、智能制造。
  • 基金资助:
    上海市地方能力建设项目(21010501500);上海市“科技创新行动计划”社会发展科技攻关项目(21DZ1204900)

Human-object interaction detection algorithm by fusing local feature enhanced perception

Junyi LIN, Mingxuan CHEN(), Yongbin GAO   

  1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
  • Received:2024-11-22 Revised:2025-04-09 Accepted:2025-04-17 Online:2025-04-22 Published:2025-11-10
  • Contact: Mingxuan CHEN
  • About author:LIN Junyi, born in 1999, M. S. candidate. His research interests include human-object interaction detection.
    GAO Yongbin, born in 1988, Ph. D., associate professor. His research interests include computer vision, machine learning, knowledge graph, intelligent manufacturing.
  • Supported by:
    Shanghai Local Capacity Building Project(21010501500);Shanghai “Science and Technology Innovation Action Plan” Social Development Science and Technology Research Project(21DZ1204900)

摘要:

人-物交互(HOI)检测任务的核心在于识别图像中的人物和物体,并准确分类它们之间的交互关系,这对于深化场景理解至关重要;但现有算法在处理复杂关系时,由于缺乏局部信息导致错误关联,难以区分细粒度操作。因此,设计一种局部特征增强的感知模块(LFPM),通过结合局部和非局部特征的相互作用增强模型对局部特征信息的捕获能力。该模块包含了3个关键部分:降采样聚合分支模块(DAM)、细粒度特征分支(FGFB)模块以及多尺度小波卷积(MSWC)模块。其中,DAM通过降采样获得低频特征,聚合非局部结构信息;FGFB模块并行执行卷积操作,补充DAM对局部信息的提取;MSWC模块进一步在空间和通道维度上优化输出特征,使特征表达更加精细完整。此外,为解决Transformer在局部空间和通道特征挖掘方面的不足,引入空间和通道挤压注意力(scSE)模块。该模块在空间和通道维度上分配注意力,可增强模型对局部显著区域的敏感性,有效提升HOI检测的精度。最后整合LFPM、scSE以及Transformer架构构成局部特征增强感知模型(LFEP)框架。实验结果表明,与SQA(Strong guidance Query with self-selected Attention)算法相比,LFEP框架在V-COCO数据集上的平均精度(AP)提升了1.1个百分点,在HICO-DET数据集上的平均精度均值(mAP)提升了0.49个百分点,消融实验也验证了LEEP中各模块的有效性。

关键词: 特征感知, 多频率卷积, 降采样聚合, 端到端, 人-物交互检测

Abstract:

The core of Human-Object Interaction (HOI) detection is to identify humans and objects in the images and accurately classify their interactions, which is crucial for deepening scene understanding. However, existing algorithms struggle with complex interactions due to insufficient local information, leading to erroneous associations and difficulties in distinguishing fine-grained operations. To address this limitation, a Local Feature-enhanced Perceptual Module (LFPM) was designed to enhance the model's capability of capturing local feature information through the integration of local and non-local feature interactions. This module comprised three key components: the Downsampling Aggregation branch Module (DAM), which acquired low-frequency features through downsampling and aggregated non-local structural information; the Fine-Grained Feature Branch (FGFB) module, which performed parallel convolution operations to supplement the DAM's local information extraction; and the Multi-Scale Wavelet Convolution (MSWC) module, which further optimized output features in spatial and channel dimensions for more precise and comprehensive feature representations. Additionally, to address the limitations of Transformer in local spatial and channel feature mining, a spatial and channel Squeeze and Excitation (scSE) module was introduced. This module allocated attention across spatial and channel dimensions, enhancing the model's sensitivity to locally salient regions and effectively improving HOI detection accuracy. Finally, the LFPM, scSE, and Transformer architectures were integrated to form the Local Feature Enhancement Perception model (LFEP) framework. Experimental results show that, compared with the SQA (Strong guidance Query with self-selected Attention) algorithm, LFEP framework achieves 1.1 percentage points improvement in Average Precision on the V-COCO dataset, and 0.49 percentage points improvement in mean Average Precision (mAP) on the HICO-DET dataset. Ablation experimental results also validate the effectiveness of each module of LFEP.

Key words: feature perception, multi-frequency convolution, down-sampling aggregation, end-to-end, Human-Object Interaction (HOI) detection

中图分类号: