Multimodal Object Detection Method with Controllable Text-Image Fusion Intensity

doi:10.11772/j.issn.1001-9081.2025111341

Abstract

Abstract: To address the limitations of single visual modalities in object detection tasks within complex scenarios, particularly their limited expressive power and insufficient semantic information, a multimodal object detection framework with adjustable fusion intensity was proposed. A textual modality was introduced as a semantic prior, performing global semantic modeling on text data to compensate for the shortcomings of visual modalities in category recognition and semantic association at the semantic level. Considering that textual data was essentially a semantic description of the entire dataset rather than a specific description of a single image, a semantic enhancement module based on category prediction soft mapping was proposed. This module was used to adaptively filter and enhance semantic information relevant to the current image content, effectively suppressing interference from irrelevant semantic noise. To address the uneven contribution of different modalities during multimodal fusion, an adjustable fusion strength mechanism was proposed to achieve optimal information fusion based on the quality of the input data and task requirements by dynamically adjusting the fusion weights of visual and semantic features. Experimental results on the PASCAL VOC and DUT Anti-UAV datasets show that the average accuracy (AP) of the proposed method is improved by 1.1 percentage points and 3.1 percentage points, respectively, compared with the optimal baseline method. This verifies the effectiveness and generalization ability of the proposed method in multimodal feature collaborative modeling and object detection in complex scenes.

Key words: object detection, multi-modal fusion, text-image fusion, adjustable fusion intensity, semantic enhancement

摘要： 针对复杂场景下单一视觉模态在目标检测任务中表达能力受限、语义信息不足的问题，提出了一种融合强度可调的多模态目标检测框架。该框架引入文本模态作为语义先验，通过对文本数据进行全局语义建模，从语义层面弥补视觉模态在类别辨识和语义关联方面的不足。考虑到文本数据本质上是对整体数据集的语义描述，而非针对单幅图像的具体说明，设计了基于类别预测软映射的语义增强模块，实现对与当前图像内容相关语义信息的自适应筛选与强化，有效抑制了无关语义噪声的干扰。为解决多模态融合过程中不同模态贡献不均的问题，提出了融合强度可调机制，通过动态调节视觉特征与语义特征的融合权重，使模型能够根据输入数据的质量及任务需求实现最优信息融合。在PASCAL VOC和DUT Anti-UAV数据集上的实验结果表明，相较于最优的基线方法，所提方法的平均精度（AP）分别提高了1.1个百分点和3.1个百分点，验证了该方法在多模态特征协同建模及复杂场景目标检测任务中的有效性与泛化能力。

关键词: 目标检测, 多模态融合, 图文融合, 融合强度可调, 语义增强

CLC Number:

TP183

裴江波李炜李艺洋. 基于文本-图像融合强度可调的多模态目标检测算法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025111341.

[1]	Yinshan YU, Xu TANG, Mingjian DING, Wenkai HUANG, Jiawen BI, Guochen TAN. Real-time vehicle detection algorithm based on YOLOv10 [J]. Journal of Computer Applications, 2026, 46(3): 950-958.
[2]	Quanjie LIU, Zhaoyi GU, Chunyuan WANG. Unsafe driving behavior detection under complex lighting conditions [J]. Journal of Computer Applications, 2026, 46(2): 613-619.
[3]	Yuebo FAN, Mingxuan CHEN, Xian TANG, Yongbin GAO, Wenchao LI. Multi-dimensional frequency domain feature fusion for human-object interaction detection [J]. Journal of Computer Applications, 2026, 46(2): 580-586.
[4]	Mingguang LI, Chongben TAO. Hierarchical cross-modal fusion method for 3D object detection based on Mamba model [J]. Journal of Computer Applications, 2026, 46(2): 572-579.
[5]	Shiwei LI, Yufeng ZHOU, Pengfei SUN, Weisong LIU, Zhuxuan MENG, Haojie LIAN. Point cloud data augmentation method based on scattering and absorption effects of coal dust on LiDAR electromagnetic waves [J]. Journal of Computer Applications, 2026, 46(1): 331-340.
[6]	Xiaoyong BIAN, Peiyang YUAN, Qiren HU. Dual-coding space-frequency mixing method for infrared small target detection [J]. Journal of Computer Applications, 2026, 46(1): 252-259.
[7]	Shuwen HUANG, Keyu GUO, Xiangyu SONG, Feng HAN, Shijie SUN, Huansheng SONG. Multi-target 3D visual grounding method based on monocular images [J]. Journal of Computer Applications, 2026, 46(1): 207-215.
[8]	Yanan LI, Mengyang GUO, Guojun DENG, Yunfeng CHEN, Jianji REN, Yongliang YUAN. Method for life prediction of parallel branching engine based on multi-modal fusion features [J]. Journal of Computer Applications, 2026, 46(1): 305-313.
[9]	Yu SANG, Tong GONG, Chen ZHAO, Bowen YU, Siman LI. Domain-adaptive nighttime object detection method with photometric alignment [J]. Journal of Computer Applications, 2026, 46(1): 242-251.
[10]	Binhong XIE, Rui WANG, Rui ZHANG, Yingjun ZHANG. Agent prototype distillation algorithm for few-shot object detection [J]. Journal of Computer Applications, 2026, 46(1): 233-241.
[11]	Fei WANG, Ye TAO, Jiawang LIU, Wei LI, Xiugong QIN, Ning ZHANG. Bimodal fusion method for constructing spatio-temporal knowledge graph in smart home space [J]. Journal of Computer Applications, 2026, 46(1): 52-59.
[12]	Lili WEI, Lirong YAN, Xiaofen TANG. Contextual semantic representation and pixel relationship correction for few-shot object detection [J]. Journal of Computer Applications, 2025, 45(9): 2993-3002.
[13]	Jiaxiang ZHANG, Xiaoming LI, Jiahui ZHANG. Few-shot object detection algorithm based on new category feature enhancement and metric mechanism [J]. Journal of Computer Applications, 2025, 45(9): 2984-2992.
[14]	Chengzhi YAN, Ying CHEN, Kai ZHONG, Han GAO. 3D object detection algorithm based on multi-scale network and axial attention [J]. Journal of Computer Applications, 2025, 45(8): 2537-2545.
[15]	Yanhua LIAO, Yuanxia YAN, Wenlin PAN. Multi-target detection algorithm for traffic intersection images based on YOLOv9 [J]. Journal of Computer Applications, 2025, 45(8): 2555-2565.

Multimodal Object Detection Method with Controllable Text-Image Fusion Intensity

基于文本-图像融合强度可调的多模态目标检测算法

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics