《计算机应用》唯一官方网站

• •    下一篇

基于多维频域特征融合的人物交互检测

樊跃波1,陈明轩1,汤显1,高永彬1,李文超2   

  1. 1. 上海工程技术大学
    2. 上海工程技术大学电子电气工程学院
  • 收稿日期:2025-03-10 修回日期:2025-05-10 发布日期:2025-06-10 出版日期:2025-06-10
  • 通讯作者: 陈明轩

Multi-dimensional frequency domain feature fusion for human-object interaction detection

  • Received:2025-03-10 Revised:2025-05-10 Online:2025-06-10 Published:2025-06-10

摘要: 人物交互检测任务旨在检测图像中所有人和物体之间的交互关系。目前的研究大多采用编码器-解码器结构进行端到端的训练,但这通常依赖于绝对位置编码,且在复杂的多对象交互场景中表现有限。针对现有方法依赖绝对位置编码难以有效捕捉人与物体相对空间关系,以及在复杂多对象交互场景中局部与全局信息整合不足的问题,提出一种结合跨维度交互特征提取与频域特征融合的新型人物交互检测模型。该模型首先改进了传统的Transformer编码器,额外引入了一种位置编码,通过与绝对位置编码的融合,使其能够对人与物体之间的相对关系进行建模。其次引入一种新的特征提取模块来加强图像信息的整合,通过跨维度交互捕捉图像中通道、空间和特征维度的交互特征,提升信息表达能力,同时利用离散余弦变换提取频域特征,捕捉更丰富的局部与全局信息。最后结合Wise-IoU损失函数提升检测精度与类别区分能力,使得模型可以更加灵活地处理不同类别的目标。实验在HICO-DET和V-COCO两个公开数据集上进行,结果表明,与GEN-VLKT模型相比,本文模型在HICO-DET数据集全部种类上的mAP提升了0.95个百分点,在VCOCO数据集场景1上的AP提升了0.9个百分点。

关键词: 人物交互检测, 目标检测, 相对位置编码, 频域特征, 离散余弦变换

Abstract: The task of human-object interaction detection aims to identify all interactions between humans and objects in an image. Most existing approaches employ an encoder-decoder framework for end-to-end training, which heavily relies on absolute positional encoding and performs suboptimally in complex multi-object interaction scenarios. To address the limitations of capturing relative spatial relationships between humans and objects due to reliance on absolute positional encoding, as well as the insufficient integration of local and global information, a novel human-object interaction detection model was proposed by combining cross dimensional interaction feature extraction with frequency-domain feature fusion. The model improves the conventional Transformer encoder by incorporating an additional positional encoding scheme. Through the fusion of relative and absolute positional encodings, the model is enabled to capture the spatial relationships between humans and objects. Furthermore, a new feature extraction module was introduced to enhance image information representation by capturing interactions across channel, spatial, and feature dimensions, while discrete cosine transform was applied to extract frequency-domain features for richer local and global information representation. Finally, the Wise-IoU loss function was adopted to improve detection accuracy and class discrimination capability, allowing the model to flexibly handle targets of various categories. Experiments conducted on two public datasets, HICO-DET and V-COCO, show that the proposed model achieves an improvement of 0.95 percentage points in mAP on the full set of HICO-DET and 0.9 percentage points in AP on scenario 1 of the V-COCO dataset, compared to the GEN-VLKT model.

Key words: human-object interaction detection, object detection, relative position encoding, Frequency domain characteristics, discrete cosine transform

中图分类号: