Human-object interaction detection algorithm by fusing local feature enhanced perception

doi:10.11772/j.issn.1001-9081.2024111662

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (11): 3713-3720.DOI: 10.11772/j.issn.1001-9081.2024111662

• Multimedia computing and computer simulation • Previous Articles

Human-object interaction detection algorithm by fusing local feature enhanced perception

Junyi LIN, Mingxuan CHEN(), Yongbin GAO

School of Electronic and Electrical Engineering，Shanghai University of Engineering Science，Shanghai 201620，China

Received:2024-11-22 Revised:2025-04-09 Accepted:2025-04-17 Online:2025-04-22 Published:2025-11-10
Contact: Mingxuan CHEN
About author:LIN Junyi， born in 1999， M. S. candidate. His research interests include human-object interaction detection.
GAO Yongbin， born in 1988， Ph. D.， associate professor. His research interests include computer vision， machine learning， knowledge graph， intelligent manufacturing.
Supported by:
Shanghai Local Capacity Building Project(21010501500);Shanghai “Science and Technology Innovation Action Plan” Social Development Science and Technology Research Project(21DZ1204900)

融合局部特征增强感知的人-物交互检测算法

林峻屹, 陈明轩(), 高永彬

上海工程技术大学电子电气工程学院，上海 201620

通讯作者: 陈明轩
作者简介:林峻屹（1999—），男，山东烟台人，硕士研究生，主要研究方向：人-物交互检测
高永彬（1988—），男，江西吉安人，副教授，博士，主要研究方向：计算机视觉、机器学习、知识图谱、智能制造。
基金资助:
上海市地方能力建设项目(21010501500);上海市“科技创新行动计划”社会发展科技攻关项目(21DZ1204900)

Abstract

Abstract:

The core of Human-Object Interaction （HOI） detection is to identify humans and objects in the images and accurately classify their interactions， which is crucial for deepening scene understanding. However， existing algorithms struggle with complex interactions due to insufficient local information， leading to erroneous associations and difficulties in distinguishing fine-grained operations. To address this limitation， a Local Feature-enhanced Perceptual Module （LFPM） was designed to enhance the model's capability of capturing local feature information through the integration of local and non-local feature interactions. This module comprised three key components： the Downsampling Aggregation branch Module （DAM）， which acquired low-frequency features through downsampling and aggregated non-local structural information； the Fine-Grained Feature Branch （FGFB） module， which performed parallel convolution operations to supplement the DAM's local information extraction； and the Multi-Scale Wavelet Convolution （MSWC） module， which further optimized output features in spatial and channel dimensions for more precise and comprehensive feature representations. Additionally， to address the limitations of Transformer in local spatial and channel feature mining， a spatial and channel Squeeze and Excitation （scSE） module was introduced. This module allocated attention across spatial and channel dimensions， enhancing the model's sensitivity to locally salient regions and effectively improving HOI detection accuracy. Finally， the LFPM， scSE， and Transformer architectures were integrated to form the Local Feature Enhancement Perception model （LFEP） framework. Experimental results show that， compared with the SQA （Strong guidance Query with self-selected Attention） algorithm， LFEP framework achieves 1.1 percentage points improvement in Average Precision on the V-COCO dataset， and 0.49 percentage points improvement in mean Average Precision （mAP） on the HICO-DET dataset. Ablation experimental results also validate the effectiveness of each module of LFEP.

Key words: feature perception, multi-frequency convolution, down-sampling aggregation, end-to-end, Human-Object Interaction (HOI) detection

摘要：

人-物交互（HOI）检测任务的核心在于识别图像中的人物和物体，并准确分类它们之间的交互关系，这对于深化场景理解至关重要；但现有算法在处理复杂关系时，由于缺乏局部信息导致错误关联，难以区分细粒度操作。因此，设计一种局部特征增强的感知模块（LFPM），通过结合局部和非局部特征的相互作用增强模型对局部特征信息的捕获能力。该模块包含了3个关键部分：降采样聚合分支模块（DAM）、细粒度特征分支（FGFB）模块以及多尺度小波卷积（MSWC）模块。其中，DAM通过降采样获得低频特征，聚合非局部结构信息；FGFB模块并行执行卷积操作，补充DAM对局部信息的提取；MSWC模块进一步在空间和通道维度上优化输出特征，使特征表达更加精细完整。此外，为解决Transformer在局部空间和通道特征挖掘方面的不足，引入空间和通道挤压注意力（scSE）模块。该模块在空间和通道维度上分配注意力，可增强模型对局部显著区域的敏感性，有效提升HOI检测的精度。最后整合LFPM、scSE以及Transformer架构构成局部特征增强感知模型（LFEP）框架。实验结果表明，与SQA（Strong guidance Query with self-selected Attention）算法相比，LFEP框架在V-COCO数据集上的平均精度（AP）提升了1.1个百分点，在HICO-DET数据集上的平均精度均值（mAP）提升了0.49个百分点，消融实验也验证了LEEP中各模块的有效性。

关键词: 特征感知, 多频率卷积, 降采样聚合, 端到端, 人-物交互检测

CLC Number:

TP391.41

Junyi LIN, Mingxuan CHEN, Yongbin GAO. Human-object interaction detection algorithm by fusing local feature enhanced perception[J]. Journal of Computer Applications, 2025, 45(11): 3713-3720.

林峻屹, 陈明轩, 高永彬. 融合局部特征增强感知的人-物交互检测算法[J]. 《计算机应用》唯一官方网站, 2025, 45(11): 3713-3720.

Figures/Tables 9

Fig. 1 Schematic diagram of detail perception ability

Fig. 2 Schematic diagram of ability to enhance sensitivity of local salient regions

Fig. 3 General architecture of LFEP

Fig. 4 General architecture of LFPM

Fig. 5 General architecture of scSE attention module

Tab. 1 mAp comparison of different methods on HICO-DET test set

方法		默认			已知类
方法		完整类	稀有类	非稀有类	完整类	稀有类	非稀有类
一阶段方法	UnionDet^［13］	17.58	11.72	19.33	19.76	14.68	21.27
	IPNet^［29］	19.56	12.79	21.58	22.05	15.77	23.92
	PPDM^［12］	21.94	13.97	24.32	24.81	17.09	27.12
	AS-Net^［15］	24.40	22.39	25.01	27.41	25.44	28.00
	QPIC^［30］	29.07	21.85	31.23	31.68	24.14	33.93
	CDT^［31］	30.48	25.48	32.37	—	—	—
	SQAB^［32］	30.82	24.92	32.58	33.58	27.19	35.49
	SQA^［33］	31.99	25.88	32.62	35.12	32.74	—
两阶段方法	TIN^［34］	17.03	13.42	18.11	19.17	15.51	20.26
	DRG^［21］	19.26	17.74	19.71	23.40	21.75	23.89
	ACP^［35］	20.59	15.92	21.98	—	—	—
	DJRN^［36］	21.34	18.53	22.18	23.69	20.64	24.60
	IDN^［37］	23.36	22.47	23.63	26.43	25.01	26.85
	FCL^［38］	25.27	20.57	26.67	27.71	22.34	28.93
	TMHOI^［39］	26.95	21.28	28.56	—	—	—
	OCN^［40］	31.43	25.80	33.11	—	—	—
LFEP		32.48	27.12	34.05	35.09	29.58	36.16

Tab. 2 Comparison of effectiveness of different methods on V-COCO test set

方法	$A P r o l e # 1$	$A P r o l e # 2$	方法	$A P r o l e # 1$	$A P r o l e # 2$
UnionDet^［13］	47.5	56.2	HOTR^［39］	55.2	64.4
TIN^［32］	47.8	54.2	QPIC^［30］	58.8	61.0
IPNet^［29］	51.0	—	CDT^［31］	61.43	—
DRG^［21］	51.0	—	SQAB^［4］	65.0	—
FCL^［38］	52.4	—	OCN^［40］	65.3	—
ACP^［35］	52.9	—	SQA^［33］	65.4	—
IDN^［37］	53.3	60.3	LFEP	66.5	68.8
AS-Net^［15］	53.9	—

Tab. 2 Comparison of effectiveness of different methods on V-COCO test set

方法	$A P r o l e # 1$	$A P r o l e # 2$	方法	$A P r o l e # 1$	$A P r o l e # 2$
UnionDet^［13］	47.5	56.2	HOTR^［39］	55.2	64.4
TIN^［32］	47.8	54.2	QPIC^［30］	58.8	61.0
IPNet^［29］	51.0	—	CDT^［31］	61.43	—
DRG^［21］	51.0	—	SQAB^［4］	65.0	—
FCL^［38］	52.4	—	OCN^［40］	65.3	—
ACP^［35］	52.9	—	SQA^［33］	65.4	—
IDN^［37］	53.3	60.3	LFEP	66.5	68.8
AS-Net^［15］	53.9	—

Fig. 6 Visualization test results

Tab. 3 Ablation experiment results on HICO-DET dataset for each module

方法	mAP/%
方法	默认	已知类
BaseLine	30.75	33.12
BaseLine+LFPM	31.61	34.26
BaseLine+MSWC	31.20	33.60
BaseLine+LFPM+MSWC	32.08	34.63
BaseLine+scSE	31.16	33.62
LFEP	32.48	35.09

References 41

[1]	SADEGHI M A， FARHADI A. Recognition using visual phrases［C］// Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2011： 1745-1752.
[2]	IFTEKHAR A S M， KUMAR S， McEVER R A， et al. GTNet： guided transformer network for detecting human-object interactions［C］// Proceedings of the SPIE 12527， Pattern Recognition and Tracking XXXIV. Bellingham， WA： SPIE， 2023： No.125270Q.
[3]	CAO Y， TANG Q， YANG F， et al. Re-mine， learn and reason： exploring the cross-modal semantic correlations for language-guided HOI detection［C］// Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2023： 23435-23446.
[4]	ZHONG X， DING C， QU X， et al. Polysemy deciphering network for robust human-object interaction detection［J］. International Journal of Computer Vision， 2021， 129（6）： 1910-1929.
[5]	YANG Y， ZHUANG Y， PAN Y. Multiple knowledge representation for big data artificial intelligence： framework， applications， and case studies［J］. Frontiers of Information Technology and Electronic Engineering， 2021， 22（12）： 1551-1558.
[6]	ZHANG A， LIAO Y， LIU S， et al. Mining the benefits of two-stage and one-stage HOI detection［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 17209-17220.
[7]	龚勋，张志莹，刘璐，等.人物交互检测研究进展综述［J］.西南交通大学学报，2022，57（4）：693-704.
	GONG X， ZHANG Z Y， LIU L， et al. A survey of human-object interaction detection［J］. Journal of Southwest Jiaotong University， 2022， 57（4）： 693-704.
[8]	GUPTA S， MALIK J. Visual semantic role labeling［EB/OL］. ［2024-09-20］..
[9]	CHAO Y W， LIU Y， LIU X， et al. Learning to detect human-object interactions［C］// Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2018： 381-389.
[10]	ZHENG S， XU B， JIN Q. Open-category human-object interaction pre-training via language modeling framework［C］// Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 19392-19402.
[11]	ZOU C， WANG B， HU Y， et al. End-to-end human object interaction detection with HOI Transformer［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 11820-11829.
[12]	LIAO Y， LIU S， WANG F， et al. PPDM： parallel point detection and matching for real-time human-object interaction detection［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 479-487.
[13]	KIM B， CHOI T， KANG J， et al. UnionDet： union-level detector towards real-time human-object interaction detection［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12360. Cham： Springer， 2020： 498-514.
[14]	ZOU C， WANG B， HU Y， et al. Cascaded decoding network for HOI detection［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 11825-11834.
[15]	CHEN M， LIAO Y， LIU S， et al. Reformulating HOI detection as adaptive set prediction［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 9000-9009.
[16]	ZHONG X， DING C， QU X， et al. Polysemy deciphering network for human-object interaction detection［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12365. Cham： Springer， 2020： 69-85.
[17]	GKIOXARI G， GIRSHICK R， DOLLÁR P， et al. Detecting and recognizing human-object interactions［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8359-8367.
[18]	ZHANG Y， PAN Y， YAO T， et al. Exploring structure-aware Transformer over interaction proposals for human-object interaction detection［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 19526-19535.
[19]	ZHANG F Z， CAMPBELL D， GOULD S. Efficient two-stage detection of human-object interactions with a novel Unary-Pairwise Transformer［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 20072-20080.
[20]	ZHOU D， LIU Z， WANG J， et al. Human-object interaction detection via Disentangled Transformer［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 19546-19555.
[21]	GAO C， XU J， ZOU Y， et al. DRG： dual relation graph for human-object interaction detection［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12357. Cham： Springer， 2020： 696-712.
[22]	ZHANG F Z， CAMPBELL D， GOULD S. Spatially conditioned graphs for detecting human-object interactions［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 13299-13307.
[23]	PARK N， KIM S. How do Vision Transformers work？［EB/OL］. ［2025-01-13］..
[24]	HENDRYCKS D， GIMPEL K. Gaussian Error Linear Units （GELUs）［EB/OL］. ［2024-11-09］..
[25]	KUHN H W. The Hungarian method for the assignment problem［M］// JÜNGERM， LIEBLINGT M， NADDEFD， alet. 50 years of integer programming 1958 — 2008. Berlin： Springer， 2010： 29-47.
[26]	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. Cambridge： MIT Press， 2015： 91-99.
[27]	REZATOFIGHI H， TSOI N， GWAK J， et al. Generalized intersection over union： a metric and a loss for bounding box regression［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 658-666.
[28]	LIN T Y， GOYAL P， GIRSHICK R， et al. Focal loss for dense object detection［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2999-3007.
[29]	WANG T， YANG T， DANELLJAN M， et al. Learning human-object interaction detection using interaction points［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 4115-4124.
[30]	TAMURA M， OHASHI H， YOSHINAGA T. QPIC： query-based pairwise human-object interaction detection with image-wide contextual information［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 10405-10414.
[31]	ZONG D， SU S. Zero-shot human-object interaction detection via similarity propagation［J］. IEEE Transactions on Neural Networks and Learning Systems， 2024， 35（12）： 17805-17816.
[32]	LI J， LAI H， GAO G， et al. SQAB： specific query anchor boxes for human-object interaction detection［J］. Displays， 2023， 80： No.102570.
[33]	ZHANG F， SHENG L， GUO B， et al. SQA： strong guidance query with self-selected attention for human-object interaction detection［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
[34]	LI Y L， ZHOU S， HUANG X， et al. Transferable interactiveness knowledge for human-object interaction detection［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 3580-3589.
[35]	KIM D J， SUN X， CHOI J， et al. Detecting human-object interactions with action co-occurrence priors［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12366. Cham： Springer， 2020： 718-736.
[36]	LI Y L， LIU X， LU H， et al. Detailed 2D-3D joint representation for human-object interaction［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10163-10172.
	37 LI Y L， LIU X， WU X， et al. HOI analysis： integrating and decomposing human-object interaction［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 5011-5022.
[38]	HOU Z， YU B， QIAO Y， et al. Detecting human-object interaction via fabricated compositional learning［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 14641-14650.
[39]	ZHU L， LAN Q， VELASQUEZ A， et al. TMHOI： translational model for human-object interaction detection［EB/OL］. ［2024-06-20］..
[40]	YUAN H， WANG M， NI D， et al. Detecting human-object interactions with object-guided cross-modal calibrated semantics［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 3206-3214.
[41]	KIM B， LEE J， KANG J， et al. HOTR： end-to-end human-object interaction detection with transformers［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 74-83.

Human-object interaction detection algorithm by fusing local feature enhanced perception

融合局部特征增强感知的人-物交互检测算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 41

Related Articles 15

Recommended Articles

Metrics

[1]	Wei ZONG, Yue ZHAO, Yin LI, Xiaona XU. Review of optimization methods for end-to-end speech-to-speech translation [J]. Journal of Computer Applications, 2025, 45(5): 1363-1371.
[2]	Dongmei XIE, Xinye BIAN, Lianfei YU, Wenbo LIU, Ziling WANG, Zhijian QU, Jiafeng YU. DeepsORF： coding sORFs prediction method based on graph coding with improved flow attention [J]. Journal of Computer Applications, 2025, 45(2): 546-555.
[3]	Ming JIANG, Linqin WANG, Hua LAI, Shengxiang GAO. End-to-end Vietnamese text normalization method based on editing constraints [J]. Journal of Computer Applications, 2025, 45(2): 362-370.
[4]	Qiang FU, Zhenping XU, Wenxing SHENG, Qing YE. End-to-end Chinese speech recognition method with byte-level byte pair encoding [J]. Journal of Computer Applications, 2025, 45(1): 318-324.
[5]	Cong LIU, Genshun WAN, Jianqing GAO, Zhonghua FU. End-to-end speech recognition method based on prosodic features [J]. Journal of Computer Applications, 2023, 43(2): 380-384.
[6]	Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention [J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.
[7]	GUO Shuai, SU Yang. Encrypted traffic classification method based on data stream [J]. Journal of Computer Applications, 2021, 41(5): 1386-1391.
[8]	WU Saisai, LIANG Xiaohe, XIE Nengfu, ZHOU Ailian, HAO Xinning. Annotation method for joint extraction of domain-oriented entities and relations [J]. Journal of Computer Applications, 2021, 41(10): 2858-2863.
[9]	HU Xuemin, TONG Xiuchi, GUO Lin, ZHANG Ruohan, KONG Li. End-to-end autonomous driving model based on deep visual attention neural network [J]. Journal of Computer Applications, 2020, 40(7): 1926-1931.
[10]	CHEN Xiukai, LU Zhihua, ZHOU Yu. Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit [J]. Journal of Computer Applications, 2020, 40(7): 2137-2141.
[11]	JIA Yongchao, HE Xiaowei, ZHENG Zhonglong. Object tracking algorithm combining re-detection mechanism and convolutional regression network [J]. Journal of Computer Applications, 2019, 39(8): 2247-2251.
[12]	QIU Zeyu, QU Dan, ZHANG Lianhai. End-to-end speech synthesis based on WaveNet [J]. Journal of Computer Applications, 2019, 39(5): 1325-1329.
[13]	PAN Peike, WANG Yan, LUO Yong, ZHOU Jiliu. Automatic segmentation of nasopharyngeal neoplasm in MR image based on U-net model [J]. Journal of Computer Applications, 2019, 39(4): 1183-1188.
[14]	WANG Kang, DONG Yuanfei. Angular interval embedding based end-to-end voiceprint recognition model [J]. Journal of Computer Applications, 2019, 39(10): 2937-2941.
[15]	YAO Yu, RYAD Chellali. End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers [J]. Journal of Computer Applications, 2018, 38(9): 2495-2499.