Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3900-3905.DOI: 10.11772/j.issn.1001-9081.2021101743

• Multimedia computing and computer simulation • Previous Articles    

Image caption generation model with adaptive commonsense gate

You YANG1,2, Lizhi CHEN2(), Xiaolong FANG2, Longyue PAN2   

  1. 1.National Center for Applied Mathematics in Chongqing,Chongqing 401331,China
    2.College of Computer and Information Science,Chongqing Normal University,Chongqing 401331,China
  • Received:2021-10-11 Revised:2021-12-17 Accepted:2021-12-23 Online:2021-12-31 Published:2022-12-10
  • Contact: Lizhi CHEN
  • About author:YANG You, born in 1965, Ph. D., associate professor. His research interests include digital image processing, computer vision.
    FANG Xiaolong, born in 1994, M. S. candidate. His research interests include computer vision.
    PAN Longyue, born in 1998, M. S. candidate. Her research interests include computer vision.
  • Supported by:
    Chongqing Normal University Graduate Scientific Research and Innovation Project(YKC20038);Chongqing Normal University Fund (Talent Introduction/Doctor Start-up)(21XLB032)

融合自适应常识门的图像描述生成模型

杨有1,2, 陈立志2(), 方小龙2, 潘龙越2   

  1. 1.重庆国家应用数学中心,重庆 401331
    2.重庆师范大学 计算机与信息科学学院,重庆 401331
  • 通讯作者: 陈立志
  • 作者简介:杨有(1965—),男,重庆人,副教授,博士,主要研究方向:数字图像处理、计算机视觉
    方小龙(1994—),男,重庆人,硕士研究生,CCF会员,主要研究方向:计算机视觉
    潘龙越(1998—),女,吉林四平人,硕士研究生,主要研究方向:计算机视觉。
  • 基金资助:
    重庆师范大学研究生科研创新项目(YKC20038);重庆师范大学(人才引进/博士启动)基金资助项目(21XLB032)

Abstract:

Focusing on the issues that the traditional image caption models cannot make full use of image information, and have only single method of fusing features, an image caption generation model with Adaptive Commonsense Gate (ACG) was proposed. Firstly, VC R-CNN (Visual Commonsense Region-based Convolutional Neural Network) was used to extract visual commonsense features and input commonsense feature layer into Transformer encoder. Then, ACG was designed in each layer of encoder to perform adaptive fusion operation on visual commonsense features and encoding features. Finally, the encoding features fused with commonsense information were fed into Transformer decoder to complete the training. Training and testing were carried out on MSCOCO dataset. The results show that the proposed model reaches 39.2, 129.6 and 22.7 respectively on the evaluation indicators BLEU (BiLingual Evaluation Understudy)-4, CIDEr (Consensus-based Image Description Evaluation) and SPICE (Semantic Propositional Image Caption Evaluation), which are improved by 3.2%,2.9% and 2.3% respectively compared with those of the POS-SCAN (Part-Of-Speech Stacked Cross Attention Network) model. It can be seen that the proposed model significantly outperforms Transformer models using single salient region feature and can describe the image content accurately.

Key words: image caption, natural language processing, Convolutional Neural Network (CNN), visual commonsense, Adaptive Commonsense Gate (ACG)

摘要:

针对传统的图像描述模型不能充分利用图像信息且融合特征方式单一的问题,提出了一种融合自适应常识门(ACG)的图像描述生成模型。首先,使用基于视觉常识区域的卷积神经网络(VC R-CNN)提取视觉常识特征,并将常识特征分层输入到Transformer编码器中;然后,在编码器的每一分层中设计了ACG,从而对视觉常识特征和编码特征进行自适应融合操作;最后,将融合常识信息的编码特征送入Transformer解码器中完成训练。使用MSCOCO数据集进行训练和测试,结果表明所提模型在评价指标BLEU?4、CIDEr和SPICE上分别达到了39.2、129.6和22.7,相较于词性堆叠交叉注意网络(POS-SCAN)模型分别提升了3.2%、2.9%和2.3%。所提模型的效果明显优于使用单一显著区域特征的Transformer模型,能够对图像内容进行准确的描述。

关键词: 图像描述, 自然语言处理, 卷积神经网络, 视觉常识, 自适应常识门

CLC Number: