Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (5): 1520-1527.DOI: 10.11772/j.issn.1001-9081.2024050616

• Artificial intelligence • Previous Articles    

Chinese image captioning method based on multi-level visual and dynamic text-image interaction

Junyan ZHANG1, Yiming ZHAO1, Bing LIN2, Yunping WU1()   

  1. 1.College of Photonic and Electronic Engineering,Fujian Normal University,Fuzhou Fujian 350117,China
    2.College of Physics and Energy,Fujian Normal University,Fuzhou Fujian 350117,China
  • Received:2024-05-17 Revised:2024-08-11 Accepted:2024-09-13 Online:2024-09-18 Published:2025-05-10
  • Contact: Yunping WU
  • About author:ZHANG Junyan, born in 2001, M. S. candidate. Her research interests include natural language processing, image captioning.
    ZHAO Yiming, born in 1999, M. S. candidate. His research interests include large language model, natural language processing.
    LIN Bing, born in 1986, Ph. D., associate professor. His research interests include cloud-edge computing, computation offloading, intelligent optimization computing.
    WU Yunping, born in 1971, Ph. D., professor. His research interests include embedded systems, location-based services, industrial big data analysis.
  • Supported by:
    Key Program of National Natural Science Foundation of China-Strait Joint Fund(U1805263);Industry-Academia-Research Program of Fujian Provincial Department of Science and Technology(2022H6024);General Program on Education and Teaching Research in Undergraduate Colleges in Fujian Province(39)

基于多级视觉与图文动态交互的图像中文描述方法

张军燕1, 赵一鸣1, 林兵2, 吴允平1()   

  1. 1.福建师范大学 光电与信息工程学院,福州 350117
    2.福建师范大学 物理与能源学院,福州 350117
  • 通讯作者: 吴允平
  • 作者简介:张军燕(2001—),女,安徽六安人,硕士研究生,主要研究方向:自然语言处理、图像文字描述
    赵一鸣(1999—),男,山东滨州人,硕士研究生,主要研究方向:大语言模型、自然语言处理
    林兵(1986—),男,福建福清人,副教授,博士,CCF会员,主要研究方向:云边计算、计算卸载、智能优化计算
    吴允平(1971—),男,福建福州人,教授,博士,主要研究方向:嵌入式系统、位置服务、行业大数据分析。
  • 基金资助:
    国家自然科学海峡联合基金重点项目(U1805263);福建省科技厅产学研项目(2022H6024);福建省本科高校教育教学研究一般项目(39)

Abstract:

Image captioning technology can help computers understand image content better, and achieve cross-modal interaction. To address the issues of incomplete extraction of multi-granularity features from images and insufficient understanding of image-text correlation in Chinese image captioning tasks, a method for extracting multi-level visual and semantic features of images and dynamically integrating them in decoding process was proposed. Firstly, multi-level visual features were extracted on the encoder, and multi-granularity features were obtained through an auxiliary guidance module of the image local feature extractor. Then, a text-image interaction module was designed to dynamically focus on semantic associations between visual and textual information. Concurrently, a dynamic feature fusion decoder was designed to perform closed-loop dynamic fusion and decoding of features with adaptive cross-modal weights, ensuring enhanced information integrity while maintaining semantic relevance. Finally, coherent Chinese descriptive sentences were generated. The method's effectiveness was evaluated using BLEU-n, Rouge, Meteor, and CIDEr metrics, with comparisons against eight existing approaches. Experimental results demonstrate consistent improvements across all semantic relevance evaluation metrics. Specifically, compared with the baseline model NIC (Neural Image Caption), the proposed method improves the BLEU-1, BLEU-2, BLEU-3, BLEU-4, Rouge_L, Meteor, and CIDEr by 5.62%, 7.25%, 8.78%, 10.85%, 14.06%, 5.14%, and 15.16%, respectively, confirming its superior accuracy.

Key words: Chinese image captioning, image multi-level visual feature, multi-granularity, image-text interaction, dynamic fusion

摘要:

图像文字描述技术可以帮助计算机更好地理解图像内容,实现跨模态交互。针对图像中文描述任务中存在的图像多粒度特征提取不全面以及图文关联性理解不充分等问题,提出一种基于多级视觉与图文动态交互的图像中文描述方法。首先,在编码器端提取多级视觉特征,通过图像局部特征提取器的辅助引导模块获取多粒度特征。其次,设计图文交互模块对图文信息的语义关联进行动态关注;同时,设计特征动态融合解码器将带有图文信息动态权重的特征经过闭环动态融合并关注与解码,以保证信息增强且无缺失,从而获得语义关联性的输出。最后,生成语义通顺的图像中文描述语句。使用BLEU-n、Rouge、Meteor、CIDEr指标评估方法的有效性并与8种不同方法进行对比。实验结果显示,所提方法的语义相关性评价指标均有提升。具体而言,与基线模型NIC(Neural Image Caption)相比,所提方法在BLEU-1、BLEU-2、BLEU-3、BLEU-4、Rouge_L、Meteor、CIDEr分别提升了5.62%、7.25%、8.78%、10.85%、14.06%、5.14%、15.16%,表明该方法具有较好的准确性。

关键词: 图像中文描述, 图像多级视觉特征, 多粒度, 图文交互, 动态融合

CLC Number: