Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping

doi:10.11772/j.issn.1001-9081.2026030300

Journal of Computer Applications

Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping

WANG Xin^1,2，AN Junxiu^2,3，MAO Ke^1,2

1.College of Software Engineering ,Chengdu University of Information Technology 2.Institute of Parallel Computing and Big Data, Chengdu University of Information Technology 3. College of Statistics, Chengdu University of Information Technology

Received:2026-03-26 Revised:2026-05-12 Online:2026-06-03 Published:2026-06-03
About author:WANG Xin, born in 2000, M. S. candidate. Her research interests include image captioning. AN Junxiu, born in 1970, M. S., professor. Her research interests include data mining, intelligent computing.
Supported by:
National Social Science Fund of China (22BXW048); Project of Chengdu Municipal Science and Technology Bureau (2025-YF05- 00114-SN)

基于动态语义映射的块-原型对比对齐图像字幕生成

王鑫^1,2，安俊秀^2,3，毛柯^1,2

1.成都信息工程大学软件工程学院 2.成都信息工程大学并行计算与大数据研究所 3.成都信息工程大学统计学院

通讯作者: 安俊秀
作者简介:王鑫(2000—)，女，四川达州人，硕士研究生，主要研究方向：图像字幕；安俊秀（1970—），女，山西临汾人，教授，硕士，CCF会员，主要研究方向：数据挖掘、智能计算；毛柯(2000—)，男，四川成都人，硕士，主要研究方向：自然语言处理。
基金资助:
国家社会科学基金资助项目（22BXW048）; 成都市科学技术局项目（2025-YF05- 00114-SN）

Abstract

Abstract: Image captioning tasks require organizing objects, attributes, and relationships in an image into coherent sentences. While existing Transformer methods have achieved good results, patch-level visual representations are still prone to semantic fragmentation, which affects fine-grained alignment, long-tail concept representation, and generation stability. To address this issue, an image captioning method based on dynamic prototype mapping was proposed. In the encoding stage, fine-grained pseudo-labels were first generated in the feature space. Then, by combining patch-level consistency constraints and prototype contrast alignment, local visual features were aggregated onto learnable semantic prototypes. Simultaneously, a dynamic mapping mechanism was introduced to periodically update the correspondence between pseudo-labels and prototypes to adapt to feature evolution during training. In the decoding stage, prototype memory attention was used to retrieve semantic prototypes related to the current generation state to assist word prediction. Experimental results based on the MS-COCO dataset show that compared with the baseline model, this method improves the BLEU-1 and Consensus-based Image Description Evaluation (CIDEr) metrics by 1.6 and 8.8 percentage points, respectively, and the generated results are more stable in terms of local semantic consistency.

Key words: image captioning, contrastive learning, semantic prototype, Transformer, multimodal

摘要： 图像字幕生成任务需要把图像中的对象、属性和关系组织成连贯语句。现有Transformer方法虽然取得了较好效果，但patch级视觉表征仍容易出现语义碎片化，进而影响细粒度对齐、长尾概念表达和生成稳定性。针对这一问题，提出一种基于动态原型映射的图像字幕生成方法。在编码阶段，先在特征空间生成细粒度伪标签，再结合patch级一致性约束和原型对比对齐，将局部视觉特征聚合到可学习语义原型上；同时引入动态映射机制，周期更新伪标签与原型之间的对应关系，以适应训练过程中的特征演化。在解码阶段，利用原型记忆注意力检索与当前生成状态相关的语义原型，辅助词预测。基于MS-COCO数据集的实验结果表明，与基线模型相比，该方法在BLEU-1和CIDEr（Consensus-based Image Description Evaluation）指标上分别提升1.6和8.8个百分点，生成结果在局部语义一致性方面也更稳定。

关键词: 图像字幕生成, 对比学习, 语义原型, Transformer, 多模态

CLC Number:

TP391.41

WANG Xin, AN Junxiu, MAO Ke. Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping[J]. Journal of Computer Applications, DOI: 10.11772/j.issn.1001-9081.2026030300.

王鑫安俊秀毛柯. 基于动态语义映射的块-原型对比对齐图像字幕生成[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2026030300.

[1]	Dirui ZHANG, Jiayu LIN, Zuhong LIANG. Supervised contrastive generative sentiment analysis method with uncertainty-aware unlikelihood learning [J]. Journal of Computer Applications, 2026, 46(5): 1416-1423.
[2]	Kun FU, Haoyu WEI, Weijing LIU, Xing DANG, Zezheng LIU, Jianwei LI. Graph neural network framework for topology semantic dual-domain collaboration [J]. Journal of Computer Applications, 2026, 46(5): 1378-1387.
[3]	Minqi WU, Yuanhua YANG, Hang LI, Yaqin HU, Zhihao TANG, Teng MEI. Lightweight underwater small object detection based on graph Transformer and RT-DETR [J]. Journal of Computer Applications, 2026, 46(5): 1586-1595.
[4]	Xinyi YAN, Linglong ZHU, Yonghong ZHANG. CDC-DETR： multi-scale real-time human-vehicle detection method for complex traffic scenarios [J]. Journal of Computer Applications, 2026, 46(4): 1283-1291.
[5]	Haihua ZHAO, Yijun HU, Rui TANG, Xian MO. Multimodal recommendation method based on semantic fusion and contrast enhancement [J]. Journal of Computer Applications, 2026, 46(4): 1058-1068.
[6]	Shengwei XU, Jianbo WANG, Jijie HAN, Yijie BAI. Face forgery detection method based on tri-branch feature extraction [J]. Journal of Computer Applications, 2026, 46(4): 1292-1299.
[7]	Yongbing ZHANG, Lirong YAN, Xiaofen TANG. Progressive dual-stage modality interaction for single-domain generalized object detection [J]. Journal of Computer Applications, 2026, 46(4): 1264-1274.
[8]	Huanxian LIU, Hongtao WANG, Xian’ao WANG, Hongmei WANG, Weifeng XU. Multimodal fact verification with cross-modal semantic association [J]. Journal of Computer Applications, 2026, 46(4): 1069-1076.
[9]	Xiang BAI, Juchuan LI, Huimin WANG, Chao JING, Jian NIU, Xingzhong ZHANG, Yongqiang CHENG. Power image retrieval method based on improved Swin Transformer [J]. Journal of Computer Applications, 2026, 46(4): 1334-1343.
[10]	Jixin GUO, Ting ZHANG. Transformer image dehazing based on component collaborative optimization pruning [J]. Journal of Computer Applications, 2026, 46(3): 933-939.
[11]	Xiaoxia LIU, Liqun KUANG, Song WANG, Shichao JIAO, Huiyan HAN, Fengguang XIONG. Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition [J]. Journal of Computer Applications, 2026, 46(3): 767-774.
[12]	Yuhang XIAO, Guanfeng LI, Yuyin CHEN, Jing QIN. Few-shot relation extraction model with graph-based multi-view contrastive learning [J]. Journal of Computer Applications, 2026, 46(3): 732-740.
[13]	Ping HUANG, Qing LI, Haifeng QIU, Chengsi WANG, Anzi HUANG, Long FAN. Lightweight method for transmission line defect detection [J]. Journal of Computer Applications, 2026, 46(3): 969-979.
[14]	Hanqing LIU, Guoming SANG, Yijia ZHANG. Remote sensing image captioning model combining dense multi-scale feature fusion and feature knowledge-enhanced Transformer [J]. Journal of Computer Applications, 2026, 46(3): 741-749.
[15]	Jian ZHANG, Jianbo YU, Jian TANG. Municipal solid waste incineration state recognition method based on multilayer preprocessing [J]. Journal of Computer Applications, 2026, 46(3): 940-949.

Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping

基于动态语义映射的块-原型对比对齐图像字幕生成

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics