Journal of Computer Applications

    Next Articles

Zero-shot human-object interaction detection method via multimodal collaborative prompt optimization

  

  • Received:2025-07-16 Revised:2025-09-16 Online:2025-10-13 Published:2025-10-13

多模态协同提示优化下的零样本人物交互检测方法

马岳1,赖惠成2,姜迪1,汪烈军2   

  1. 1. 新疆大学 计算机科学与技术学院
    2. 新疆大学
  • 通讯作者: 赖惠成
  • 基金资助:
    新疆维吾尔自治区重点研发计划项目;国家自然科学基金联合基金资助项目

Abstract: To address the challenge of recognizing unseen categories in zero-shot Human-Object Interaction (HOI) detection, a zero-shot human-object interaction detection method via multimodal collaborative prompt optimization named MCPNet (Multimodal Cooperative Prompt Network) was proposed to enhance the model’s generalization ability under zero-shot conditions. The proposed framework was designed to jointly leverage text-driven prompt learning and visual feature generation mechanisms, aiming to improve the recognition of unseen interactions from both the semantic and visual modeling perspectives. First, in terms of semantic modeling, a prompt-guided module was developed by integrating handcrafted templates with learnable prompts, which were encoded by the Contrastive Language-Image Pre-training(CLIP)text encoder to produce fine-grained semantic representations. Then, a generation module conditioned on learnable prompts was introduced, in which a Variational Autoencoder (VAE) was employed to model the latent semantic space and to synthesize discriminative visual features for unseen categories, thus alleviating the long-tailed distribution problem. Finally, for visual modeling, a collaborative mechanism between local and global features was proposed. Local interaction region features were integrated with global image semantics as input to the interaction detection head, enhancing the recognition of HOIs in complex scenes. On the HICO-DET (Humans Interacting with Common Objects Detection) dataset, the proposed method achieves significant gains in unseen category mAP (mean Average Precision), surpassing CLIP4HOI by 8.59% under the UC (Unseen Combination) setting, 9.94% under the RF-UC (Rare-First UC) setting, 8.24% under the NF-UC (Non-rare-First UC) setting, and 9.37% under the UO (Unseen Object) setting. These consistent improvements across multiple evaluation protocols clearly demonstrate the effectiveness of our approach.

Key words: human-object interaction detection, zero-shot learning, Contrastive Language-Image Pre-training (CLIP), learnable prompts, feature generation

摘要: 针对零样本人物交互(Human-Object Interaction, HOI)检测任务中未见类别难以识别的问题,提出一种多模态协同提示优化下的零样本人物交互检测方法 MCPNet(Multimodal Cooperative Prompt Network),以提升模型在零样本条件下的泛化能力。该方法结合文本引导的提示学习与图像特征生成机制,从语义建模与视觉建模两个层面提升对未见交互类别的识别能力与整体泛化性能。首先,在语义建模方面,设计融合人工模板与可学习提示的文本引导模块,结合对比语言–图像预训练(CLIP)文本编码器生成细粒度语义表征;其次,引入可学习提示为条件的生成模块,通过变分自编码器(VAE)建模潜在语义空间,合成具有判别性的未见类别图像特征,从而缓解类别长尾分布问题。最后,在视觉建模方面,提出局部与全局特征协同机制,将局部交互区域特征与全局图像语义信息共同作为交互动作检测头输入,提升复杂场景下的人物交互检测效果。在 HICO-DET(Humans Interacting with Common Objects Detection)数据集的零样本检测任务中,所提方法在多个评估设置下的未见类别 mAP(mean Average Precision)表现均优于大多数现有先进方法。与 CLIP4HOI 方法相比,本方法在 UC(Unseen Combination)设置下提升了 8.59%,在 RF-UC(Rare-First UC)设置下提升了 9.94%,在 NF-UC(Non-rare-First UC)设置下提升了 8.24%,在 UO(Unseen Object)设置下提升了 9.37%,证明了方法的有效性。

关键词: 人物交互检测, 零样本学习, 对比语言–图像预训练, 可学习提示, 特征生成