Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 58-64.DOI: 10.11772/j.issn.1001-9081.2022071109

• Cross-media representation learning and cognitive reasoning • Previous Articles     Next Articles

Scene graph-aware cross-modal image captioning model

Zhiping ZHU, Yan YANG(), Jie WANG   

  1. College of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu Sichuan 611756,China
  • Received:2022-07-29 Revised:2022-11-20 Accepted:2022-11-30 Online:2023-01-15 Published:2024-01-10
  • Contact: Yan YANG
  • About author:ZHU Zhiping, born in 1998, M. S. candidate. His research interests include natural language processing, computer vision.
    WANG Jie, born in 1994, Ph. D. candidate. His research interests include cross-modal learning, natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61976247)

基于场景图感知的跨模态图像描述模型

朱志平, 杨燕(), 王杰   

  1. 西南交通大学 计算机与人工智能学院,成都 611756
  • 通讯作者: 杨燕
  • 作者简介:朱志平(1998—),男,四川南充人,硕士研究生,主要研究方向:自然语言处理、计算机视觉;
    王杰(1994—),男,四川成都人,博士研究生,主要研究方向:跨模态学习、自然语言处理。
    第一联系人:杨燕(1964—),女,安徽合肥人,教授,博士,CCF杰出会员,主要研究方向:人工智能、大数据分析与挖掘、集成学习与多视图学习、云计算与云服务;
  • 基金资助:
    国家自然科学基金资助项目(61976247)

Abstract:

Aiming at the forgetting and underutilization of the text information of image in image captioning methods, a Scene Graph-aware Cross-modal Network (SGC-Net) was proposed. Firstly, the scene graph was utilized as the image’s visual features, and the Graph Convolutional Network (GCN) was utilized for feature fusion, so that the visual and textual features were in the same feature space. Then, the text sequence generated by the model was stored, and the corresponding position information was added as the textual features of the image, so as to solve the problem of text feature loss brought by the single-layer Long Short-Term Memory (LSTM) Network. Finally, to address the issue of over dependence on image information and underuse of text information, the self-attention mechanism was utilized to extract significant image information and text information and fuse then. Experimental results on Flickr30K and MS-COCO (MicroSoft Common Objects in COntext) datasets demonstrate that SGC-Net outperforms Sub-GC on the indicators BLEU1 (BiLingual Evaluation Understudy with 1-gram), BLEU4 (BiLingual Evaluation Understudy with 4-grams), METEOR (Metric for Evaluation of Translation with Explicit ORdering), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and SPICE (Semantic Propositional Image Caption Evaluation) with the improvements of 1.1,0.9,0.3,0.7,0.4 and 0.3, 0.1, 0.3, 0.5, 0.6, respectively. It can be seen that the method used by SGC-Net can increase the model’s image captioning performance and the fluency of the generated description effectively.

Key words: image captioning, scene graph, attention mechanism, Long Short-Term Memory (LSTM) Network, feature fusion

摘要:

针对图像描述方法中对图像文本信息的遗忘及利用不充分问题,提出了基于场景图感知的跨模态交互网络(SGC-Net)。首先,使用场景图作为图像的视觉特征并使用图卷积网络(GCN)进行特征融合,从而使图像的视觉特征和文本特征位于同一特征空间;其次,保存模型生成的文本序列,并添加对应的位置信息作为图像的文本特征,以解决单层长短期记忆(LSTM)网络导致的文本特征丢失的问题;最后,使用自注意力机制提取出重要的图像信息和文本信息后并对它们进行融合,以解决对图像信息过分依赖以及对文本信息利用不足的问题。在Flickr30K和MS-COCO (MicroSoft Common Objects in COntext)数据集上进行实验的结果表明,与Sub-GC相比,SGC-Net在BLEU1 (BiLingual Evaluation Understudy with 1-gram)、BLEU4 (BiLingual Evaluation Understudy with 4-grams)、METEOR (Metric for Evaluation of Translation with Explicit ORdering)、ROUGE (Recall-Oriented Understudy for Gisting Evaluation)和SPICE (Semantic Propositional Image Caption Evaluation)指标上分别提升了1.1、0.9、0.3、0.7、0.4和0.3、0.1、0.3、0.5、0.6。可见,SGC-Net所使用的方法能够有效提升模型的图像描述性能及生成描述的流畅度。

关键词: 图像描述, 场景图, 注意力机制, 长短期记忆网络, 特征融合

CLC Number: