Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (2): 580-586.DOI: 10.11772/j.issn.1001-9081.2025020241

• Multimedia computing and computer simulation • Previous Articles    

Multi-dimensional frequency domain feature fusion for human-object interaction detection

Yuebo FAN, Mingxuan CHEN(), Xian TANG, Yongbin GAO, Wenchao LI   

  1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
  • Received:2025-03-11 Revised:2025-05-10 Accepted:2025-05-15 Online:2025-06-10 Published:2026-02-10
  • Contact: Mingxuan CHEN
  • About author:FAN Yuebo, born in 2000, M. S. candidate. His research interests include computer vision, machine learning.
    CHEN Mingxuan, born in 1993, Ph. D., lecturer. His research interests include artificial intelligence, pattern recognition, deep learning. Email:mchen@sues.edu.cn
    TANG Xian, born in 1978, Ph. D., associate professor. Her research interests include data graph query processing, buffer algorithm.
    GAO Yongbin, born in 1988, Ph. D., associate professor. His research interests include computer vision, machine learning, natural language processing.
    LI Wenchao, born in 1999, M. S. candidate. His research interests include natural language processing, knowledge graph construction.

基于多维频域特征融合的人物交互检测

樊跃波, 陈明轩(), 汤显, 高永彬, 李文超   

  1. 上海工程技术大学 电子电气工程学院,上海 201620
  • 通讯作者: 陈明轩
  • 作者简介:樊跃波(2000—),男,河南长垣人,硕士研究生,主要研究方向:计算机视觉、机器学习
    陈明轩(1993—),男,江苏南通人,讲师,博士,主要研究方向:人工智能、模式识别、深度学习 Email:mchen@sues.edu.cn
    汤显(1978—),女,山东荣成人,副教授,博士,CCF会员,主要研究方向:数据图查询处理、缓冲区算法
    高永彬(1988—),男,江西吉安人,副教授,博士,主要研究方向:计算机视觉、机器学习、自然语言处理
    李文超(1999—),男,河南周口人,硕士研究生,主要研究方向:自然语言处理、知识图谱构建。

Abstract:

The task of Human-Object Interaction (HOI) detection aims to identify all interactions between humans and objects in an image. Most existing research employs an encoder-decoder framework for end-to-end training, which relies on Absolute Positional Encoding (APE) heavily and has limited performance in complex multi-object interaction scenarios. To address the limitations of capturing relative spatial relationships between humans and objects due to reliance on APE, as well as the insufficient integration of local and global information in complex multi-object interaction scenarios, an HOI detection model was proposed by combining cross-dimensional interaction feature extraction with frequency domain feature fusion. Firstly, the conventional Transformer encoder was improved by introducing a Relative Position Encoding (RPE), and through the fusion of RPE and APE, the model was enabled to capture the spatial relationships between humans and objects. Then, a new feature extraction module was introduced to enhance image information integration by capturing interaction features across channel, spatial, and feature dimensions, while Discrete Cosine Transform (DCT) was applied to extract frequency domain features to capture richer local and global information. Finally, the Wise-IoU loss function was adopted to improve detection accuracy and class discriminative capability, thereby allowing the model to process targets of various categories more flexibly. Experiments were conducted on two public datasets, HICO-DET and V-COCO. The results show that the proposed model achieves an improvement of 0.95 percentage points in mean Average Precision (mAP) on all categories of the HICO-DET dataset and 0.9 percentage points in AP on scenario 1 of the V-COCO dataset, compared to the GEN-VLKT (Guided Embedding Network Visual-Linguistic Knowledge Transfer) model.

Key words: Human-Object Interaction (HOI) detection, object detection, Relative Position Encoding (RPE), frequency domain feature, Discrete Cosine Transform (DCT)

摘要:

人物交互(HOI)检测任务的目标是检测图像中所有人与物体之间的交互关系。目前的研究大多采用编码器-解码器结构进行端到端的训练,依赖绝对位置编码(APE),且在复杂的多对象交互场景中表现欠佳。针对现有方法依赖APE,难以有效捕捉人与物体之间的相对空间关系,以及在复杂多对象交互场景中局部与全局信息整合不足的问题,提出一种结合跨维度交互特征提取与频域特征融合的HOI检测模型。首先,改进传统的Transformer编码器,额外引入一种相对位置编码(RPE),并通过融合RPE与APE,使模型能够对人与物体之间的相对关系进行建模;其次,引入一种新的特征提取模块加强图像信息的整合,即通过跨维度交互捕捉图像中通道、空间和特征维度的交互特征,提升信息表达能力,同时利用离散余弦变换(DCT)提取频域特征,从而捕捉更丰富的局部与全局信息;最后,使用Wise-IoU损失函数提升检测精度与类别区分能力,使模型可以更灵活地处理不同类别的目标。实验在HICO-DET和V-COCO两个公开数据集上进行,结果表明,与GEN-VLKT(Guided Embedding Network Visual-Linguistic Knowledge Transfer)模型相比,所提模型在HICO-DET数据集全部种类上的平均精度均值(mAP)提升了0.95个百分点,在V-COCO数据集场景1上的AP提升了0.9个百分点。

关键词: 人物交互检测, 目标检测, 相对位置编码, 频域特征, 离散余弦变换

CLC Number: