Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Multi-dimensional frequency domain feature fusion for human-object interaction detection

Yuebo FAN, Mingxuan CHEN, Xian TANG, Yongbin GAO, Wenchao LI

Journal of Computer Applications 2026, 46 (2): 580-586. DOI: 10.11772/j.issn.1001-9081.2025020241

Abstract （47）

HTML （0）

PDF （1356KB）（4）

Save

The task of Human-Object Interaction （HOI） detection aims to identify all interactions between humans and objects in an image. Most existing research employs an encoder-decoder framework for end-to-end training， which relies on Absolute Positional Encoding （APE） heavily and has limited performance in complex multi-object interaction scenarios. To address the limitations of capturing relative spatial relationships between humans and objects due to reliance on APE， as well as the insufficient integration of local and global information in complex multi-object interaction scenarios， an HOI detection model was proposed by combining cross-dimensional interaction feature extraction with frequency domain feature fusion. Firstly， the conventional Transformer encoder was improved by introducing a Relative Position Encoding （RPE）， and through the fusion of RPE and APE， the model was enabled to capture the spatial relationships between humans and objects. Then， a new feature extraction module was introduced to enhance image information integration by capturing interaction features across channel， spatial， and feature dimensions， while Discrete Cosine Transform （DCT） was applied to extract frequency domain features to capture richer local and global information. Finally， the Wise-IoU loss function was adopted to improve detection accuracy and class discriminative capability， thereby allowing the model to process targets of various categories more flexibly. Experiments were conducted on two public datasets， HICO-DET and V-COCO. The results show that the proposed model achieves an improvement of 0.95 percentage points in mean Average Precision （mAP） on all categories of the HICO-DET dataset and 0.9 percentage points in AP on scenario 1 of the V-COCO dataset， compared to the GEN-VLKT （Guided Embedding Network Visual-Linguistic Knowledge Transfer） model.

Table and Figures | Reference | Related Articles | Metrics