Journal of Computer Applications

    Next Articles

Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping

WANG Xin1,2,AN Junxiu2,3MAO Ke1,2   

  1. 1.College of Software Engineering ,Chengdu University of Information Technology 2.Institute of Parallel Computing and Big Data, Chengdu University of Information Technology 3. College of Statistics, Chengdu University of Information Technology
  • Received:2026-03-26 Revised:2026-05-12 Online:2026-06-03 Published:2026-06-03
  • About author:WANG Xin, born in 2000, M. S. candidate. Her research interests include image captioning. AN Junxiu, born in 1970, M. S., professor. Her research interests include data mining, intelligent computing.
  • Supported by:
    National Social Science Fund of China (22BXW048); Project of Chengdu Municipal Science and Technology Bureau (2025-YF05- 00114-SN)

基于动态语义映射的块-原型对比对齐图像字幕生成

王鑫1,2,安俊秀2,3,毛柯1,2   

  1. 1.成都信息工程大学 软件工程学院 2.成都信息工程大学 并行计算与大数据研究所 3.成都信息工程大学 统计学院
  • 通讯作者: 安俊秀
  • 作者简介:王鑫(2000—),女,四川达州人,硕士研究生,主要研究方向:图像字幕;安俊秀(1970—),女,山西临汾人,教授,硕士,CCF会员,主要研究方向:数据挖掘、智能计算;毛柯(2000—),男,四川成都人,硕士,主要研究方向:自然语言处理。
  • 基金资助:
    国家社会科学基金资助项目(22BXW048); 成都市科学技术局项目(2025-YF05- 00114-SN)

Abstract: Image captioning tasks require organizing objects, attributes, and relationships in an image into coherent sentences. While existing Transformer methods have achieved good results, patch-level visual representations are still prone to semantic fragmentation, which affects fine-grained alignment, long-tail concept representation, and generation stability. To address this issue, an image captioning method based on dynamic prototype mapping was proposed. In the encoding stage, fine-grained pseudo-labels were first generated in the feature space. Then, by combining patch-level consistency constraints and prototype contrast alignment, local visual features were aggregated onto learnable semantic prototypes. Simultaneously, a dynamic mapping mechanism was introduced to periodically update the correspondence between pseudo-labels and prototypes to adapt to feature evolution during training. In the decoding stage, prototype memory attention was used to retrieve semantic prototypes related to the current generation state to assist word prediction. Experimental results based on the MS-COCO dataset show that compared with the baseline model, this method improves the BLEU-1 and Consensus-based Image Description Evaluation (CIDEr) metrics by 1.6 and 8.8 percentage points, respectively, and the generated results are more stable in terms of local semantic consistency.

Key words: image captioning, contrastive learning, semantic prototype, Transformer, multimodal

摘要: 图像字幕生成任务需要把图像中的对象、属性和关系组织成连贯语句。现有Transformer方法虽然取得了较好效果,但patch级视觉表征仍容易出现语义碎片化,进而影响细粒度对齐、长尾概念表达和生成稳定性。针对这一问题,提出一种基于动态原型映射的图像字幕生成方法。在编码阶段,先在特征空间生成细粒度伪标签,再结合patch级一致性约束和原型对比对齐,将局部视觉特征聚合到可学习语义原型上;同时引入动态映射机制,周期更新伪标签与原型之间的对应关系,以适应训练过程中的特征演化。在解码阶段,利用原型记忆注意力检索与当前生成状态相关的语义原型,辅助词预测。基于MS-COCO数据集的实验结果表明,与基线模型相比,该方法在BLEU-1和CIDEr(Consensus-based Image Description Evaluation)指标上分别提升1.6和8.8个百分点,生成结果在局部语义一致性方面也更稳定。

关键词: 图像字幕生成, 对比学习, 语义原型, Transformer, 多模态

CLC Number: