Journal of Computer Applications

    Next Articles

Multimodal Object Detection Method with Controllable Text-Image Fusion Intensity

  

  • Received:2025-11-14 Revised:2026-01-23 Accepted:2026-01-26 Online:2026-02-04 Published:2026-02-04

基于文本-图像融合强度可调的多模态目标检测算法

裴江波,李炜,李艺洋   

  1. 四川大学空天科学与工程学院
  • 通讯作者: 李炜

Abstract: To address the limitations of single visual modalities in object detection tasks within complex scenarios, particularly their limited expressive power and insufficient semantic information, a multimodal object detection framework with adjustable fusion intensity was proposed. A textual modality was introduced as a semantic prior, performing global semantic modeling on text data to compensate for the shortcomings of visual modalities in category recognition and semantic association at the semantic level. Considering that textual data was essentially a semantic description of the entire dataset rather than a specific description of a single image, a semantic enhancement module based on category prediction soft mapping was proposed. This module was used to adaptively filter and enhance semantic information relevant to the current image content, effectively suppressing interference from irrelevant semantic noise. To address the uneven contribution of different modalities during multimodal fusion, an adjustable fusion strength mechanism was proposed to achieve optimal information fusion based on the quality of the input data and task requirements by dynamically adjusting the fusion weights of visual and semantic features. Experimental results on the PASCAL VOC and DUT Anti-UAV datasets show that the average accuracy (AP) of the proposed method is improved by 1.1 percentage points and 3.1 percentage points, respectively, compared with the optimal baseline method. This verifies the effectiveness and generalization ability of the proposed method in multimodal feature collaborative modeling and object detection in complex scenes.

Key words: object detection, multi-modal fusion, text-image fusion, adjustable fusion intensity, semantic enhancement

摘要: 针对复杂场景下单一视觉模态在目标检测任务中表达能力受限、语义信息不足的问题,提出了一种融合强度可调的多模态目标检测框架。该框架引入文本模态作为语义先验,通过对文本数据进行全局语义建模,从语义层面弥补视觉模态在类别辨识和语义关联方面的不足。考虑到文本数据本质上是对整体数据集的语义描述,而非针对单幅图像的具体说明,设计了基于类别预测软映射的语义增强模块,实现对与当前图像内容相关语义信息的自适应筛选与强化,有效抑制了无关语义噪声的干扰。为解决多模态融合过程中不同模态贡献不均的问题,提出了融合强度可调机制,通过动态调节视觉特征与语义特征的融合权重,使模型能够根据输入数据的质量及任务需求实现最优信息融合。在PASCAL VOC和DUT Anti-UAV数据集上的实验结果表明,相较于最优的基线方法,所提方法的平均精度(AP)分别提高了1.1个百分点和3.1个百分点,验证了该方法在多模态特征协同建模及复杂场景目标检测任务中的有效性与泛化能力。

关键词: 目标检测, 多模态融合, 图文融合, 融合强度可调, 语义增强

CLC Number: