《计算机应用》唯一官方网站

• •    下一篇

基于多级视觉与图文动态交互的图像中文描述方法

张军燕1,赵一鸣2,林兵3,吴允平4   

  1. 1. 福建师范大学光电与信息学院
    2. 福建师范大学光电与信息工程学院
    3. 福建师范大学物理与能源学院
    4. 福建师范大学大学 物理与光电信息科技学院,福州 350007
  • 收稿日期:2024-05-17 修回日期:2024-08-11 接受日期:2024-09-13 发布日期:2024-09-18 出版日期:2024-09-18
  • 通讯作者: 张军燕
  • 基金资助:
    国家自然科学海峡联合基金重点项目;福建省科技厅产学研项目;福建省本科高校教育教学研究项目

Chinese image captioning method based on multi-level visual and text-image dynamic interaction

  • Received:2024-05-17 Revised:2024-08-11 Accepted:2024-09-13 Online:2024-09-18 Published:2024-09-18
  • Contact: Jun-Yan ZHANG

摘要: 图像文字描述技术可以帮助计算机更好地理解图像内容,实现跨模态交互。针对图像中文描述任务中存在图像多粒度特征提取不全面、图文关联性理解不充分的问题,提出一种通过提取图像多级视觉语义特征,与特征动态融合解码的方法。首先,编码器端提取多级视觉特征,通过图像局部特征提取器的辅助引导模块,以获取多粒度特征。然后,设计图文交互模块对图文信息语义关联进行动态关注;设计特征动态融合解码器,将带有图文信息动态权重的特征,经过闭环动态融合与关注、解码,以保证信息增强、无缺失,获得语义关联性的输出。最后生成图像中文描述语句。使用BLEU-n、Rouge、Meteor、CIDEr评价指标进行方法评估,将该方法与八种不同方法进行对比,在语义相关性的评分指标上有提升。其中与基线模型对比,在BLEU-1、BLEU-2、BLEU-3、BLEU-4、Rouge_L、Meteor、CIDEr分别提升了5.32%、6.76%、8.07%、9.78%、12.33%、4.88%、13.16%,表明该方法具有较好准确性。

关键词: 图像中文描述, 图像多级视觉特征, 多粒度, 图文交互, 动态融合

Abstract: Image caption technology can help computers understand image content better, and achieve cross-modal interaction. To address the issues of incomplete extraction of multi-scale features from images and insufficient understanding of image-text correlation in Chinese image caption tasks, this paper proposed a method that extracted multi-level visual and semantic features of images and dynamically integrated them in the decoding process. First, multi-level visual features were extracted on the encoder side, with fine-grained features obtained through an auxiliary guidance module for local feature extraction. Then, a text-image interation module was designed to dynamically focus on semantic associations between image and text information. Finally, a dynamic feature fusion decoder integrated and decoded features with dynamically weighted image-text information through a closed-loop fusion and attention mechanism, ensuring enhanced information without loss, to generate semantically coherent Chinese image descriptions. Use BLEU-n, Rouge, Meteor, and CIDEr evaluation indicators for method evaluation, and compare this method with six different methods, the score indicators of semantic relevance are improved. Compared with the baseline model, BLEU-1, BLEU-2, BLEU-3, BLEU-4, Rouge_L, Meteor, and CIDEr improved by 5.32%, 6.76%, 8.07%, 9.78%, 12.33%, 4.88%, and 13.16% respectively. It shows that this method has better accuracy.

Key words: Chinese image caption, image multi-level visual features, multi-granularity, image-text interaction, dynamic fusion

中图分类号: