Image caption generation model with adaptive commonsense gate

doi:10.11772/j.issn.1001-9081.2021101743

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3900-3905.DOI: 10.11772/j.issn.1001-9081.2021101743

• Multimedia computing and computer simulation • Previous Articles

Image caption generation model with adaptive commonsense gate

You YANG¹^,², Lizhi CHEN²(), Xiaolong FANG², Longyue PAN²

^1.National Center for Applied Mathematics in Chongqing，Chongqing 401331，China
^2.College of Computer and Information Science，Chongqing Normal University，Chongqing 401331，China

Received:2021-10-11 Revised:2021-12-17 Accepted:2021-12-23 Online:2021-12-31 Published:2022-12-10
Contact: Lizhi CHEN
About author:YANG You， born in 1965， Ph. D.， associate professor. His research interests include digital image processing， computer vision.
FANG Xiaolong， born in 1994， M. S. candidate. His research interests include computer vision.
PAN Longyue， born in 1998， M. S. candidate. Her research interests include computer vision.
Supported by:
Chongqing Normal University Graduate Scientific Research and Innovation Project(YKC20038);Chongqing Normal University Fund （Talent Introduction/Doctor Start-up）(21XLB032)

融合自适应常识门的图像描述生成模型

杨有¹^,², 陈立志²(), 方小龙², 潘龙越²

^1.重庆国家应用数学中心，重庆 401331
^2.重庆师范大学计算机与信息科学学院，重庆 401331

通讯作者: 陈立志
作者简介:杨有（1965—），男，重庆人，副教授，博士，主要研究方向：数字图像处理、计算机视觉
方小龙（1994—），男，重庆人，硕士研究生，CCF会员，主要研究方向：计算机视觉
潘龙越（1998—），女，吉林四平人，硕士研究生，主要研究方向：计算机视觉。
基金资助:
重庆师范大学研究生科研创新项目(YKC20038);重庆师范大学（人才引进/博士启动）基金资助项目(21XLB032)

Abstract

Abstract:

Focusing on the issues that the traditional image caption models cannot make full use of image information， and have only single method of fusing features， an image caption generation model with Adaptive Commonsense Gate （ACG） was proposed. Firstly， VC R-CNN （Visual Commonsense Region-based Convolutional Neural Network） was used to extract visual commonsense features and input commonsense feature layer into Transformer encoder. Then， ACG was designed in each layer of encoder to perform adaptive fusion operation on visual commonsense features and encoding features. Finally， the encoding features fused with commonsense information were fed into Transformer decoder to complete the training. Training and testing were carried out on MSCOCO dataset. The results show that the proposed model reaches 39.2， 129.6 and 22.7 respectively on the evaluation indicators BLEU （BiLingual Evaluation Understudy）-4， CIDEr （Consensus-based Image Description Evaluation） and SPICE （Semantic Propositional Image Caption Evaluation）， which are improved by 3.2%，2.9% and 2.3% respectively compared with those of the POS-SCAN （Part-Of-Speech Stacked Cross Attention Network） model. It can be seen that the proposed model significantly outperforms Transformer models using single salient region feature and can describe the image content accurately.

Key words: image caption, natural language processing, Convolutional Neural Network (CNN), visual commonsense, Adaptive Commonsense Gate (ACG)

摘要：

针对传统的图像描述模型不能充分利用图像信息且融合特征方式单一的问题，提出了一种融合自适应常识门（ACG）的图像描述生成模型。首先，使用基于视觉常识区域的卷积神经网络（VC R-CNN）提取视觉常识特征，并将常识特征分层输入到Transformer编码器中；然后，在编码器的每一分层中设计了ACG，从而对视觉常识特征和编码特征进行自适应融合操作；最后，将融合常识信息的编码特征送入Transformer解码器中完成训练。使用MSCOCO数据集进行训练和测试，结果表明所提模型在评价指标BLEU?4、CIDEr和SPICE上分别达到了39.2、129.6和22.7，相较于词性堆叠交叉注意网络（POS-SCAN）模型分别提升了3.2%、2.9%和2.3%。所提模型的效果明显优于使用单一显著区域特征的Transformer模型，能够对图像内容进行准确的描述。

关键词: 图像描述, 自然语言处理, 卷积神经网络, 视觉常识, 自适应常识门

CLC Number:

TP391.41

You YANG, Lizhi CHEN, Xiaolong FANG, Longyue PAN. Image caption generation model with adaptive commonsense gate[J]. Journal of Computer Applications, 2022, 42(12): 3900-3905.

杨有, 陈立志, 方小龙, 潘龙越. 融合自适应常识门的图像描述生成模型[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3900-3905.

Figures/Tables 8

References 22

1	JIANG W H， MA L， JIANG Y G， et al. Recurrent fusion network for image captioning［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11206. Cham： Springer， 2018： 510-526.
2	黄友文，游亚东，赵朋. 融合卷积注意力机制的图像描述生成模型［J］. 计算机应用， 2020， 40（1）： 23-27. 10.11772/j.issn.1001-9081.2019050943
	HUANG Y W， YOU Y D， ZHAO P. Image caption generation model with convolutional attention mechanism［J］. Journal of Computer Applications， 2020， 40（1）： 23-27. 10.11772/j.issn.1001-9081.2019050943
3	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
4	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
5	ZHU X X， LI L X， LIU J， et al. Captioning transformer with stacked attention modules［J］. Applied Sciences， 2018， 8（5）： No.739. 10.3390/app8050739
6	YU J， LI J， YU Z， et al. Multimodal transformer with multi-view visual representation for image captioning［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2020， 30（12）： 4467-4480. 10.1109/tcsvt.2019.2947482
7	ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
8	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149. 10.1109/tpami.2016.2577031
9	HERDADE S， KAPPELER A， BOAKYE K， et al. Image captioning： Transforming objects into words ［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-06-13］..
10	WANG T， HUANG J Q， ZHANG H W， et al. Visual commonsense R-CNN［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2020： 10757-10767. 10.1109/cvpr42600.2020.01077
11	李文惠，曾上游，王金金. 基于改进注意力机制的图像描述生成算法［J］. 计算机应用， 2021， 41（5）： 1262-1267. 10.11772/j.issn.1001-9081.2020071078
	LI W H， ZENG S Y， WANG J J. Image description generation algorithm based on improved attention mechanism［J］. Journal of Computer Applications， 2021， 41（5）： 1262-1267. 10.11772/j.issn.1001-9081.2020071078
12	LI G， ZHU L C， LIU P， et al. Entangled transformer for image captioning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8927-8936. 10.1109/iccv.2019.00902
13	LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8693. Cham： Springer， 2014： 740-755.
14	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137. 10.1109/cvpr.2015.7298932
15	VEDANTAM R， ZITNICK C L， PARIKH D， et al. CIDEr： consensus-based image description evaluation［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4566-4575. 10.1109/cvpr.2015.7299087
16	ANDERSON P， FERNANDO B， JOHNSON M， et al. SPICE： semantic propositional image caption evaluation［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 382-398.
17	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-08-03］..
18	RENNNIE S J， MARCHERET E， MROUEH Y， et al. Self-critical sequence training for image captioning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1179-1195. 10.1109/cvpr.2017.131
19	WANG W Z， CHEN Z H， HU H F. Hierarchical attention network for image captioning［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019： 8957-8964. 10.1609/aaai.v33i01.33018957
20	YANG X， TANG K H， ZHANG H W， et al. Auto-encoding scene graphs for image captioning［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10677-10686. 10.1109/cvpr.2019.01094
21	ZHOU Y E， WANG M， LIU D Q， et al. More grounded image captioning by distilling image-text matching model［C］//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 4776-4785. 10.1109/cvpr42600.2020.00483
22	WANG L， BAI Z C， ZHANG Y H， et al. Show， recall， and tell： image captioning with recall mechanism［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 12176-12183. 10.1609/aaai.v34i07.6898

图号	人为标注的参考描述	Transformer基线模型	本文模型
图5（a）	Two men on motorcycles at a stop light.	a man riding on the back of a motorcycle.	two peopleriding on a motorcycle on a street
图5（b）	The view of runway from behind the windows of airport.	a group of airplanes parked at an airport.	a truck drivingtowards airplanes parked at an airport.
图5（c）	A person feeding a cat with a banana.	a cat eating a banana with a banana.	a person feeding a bananato a kitten.
图5（d）	a living room with a big table next to a book shelf.	a living room with a couch and a table.	a living room filled with furniture and a large window.
图5（e）	Several zebras eat the green grass in the pasture.	a group of zebra standing on top of a lush green field.	a group of zebras grazing in a field next to the water.
图5（f）	A man getting ready to kick a soccer ball.	a man kicking a soccer ball on a field.	a soccer player in a green uniformkicking a soccer ball.

图号	人为标注的参考描述	Transformer基线模型	本文模型
图5（a）	Two men on motorcycles at a stop light.	a man riding on the back of a motorcycle.	two peopleriding on a motorcycle on a street
图5（b）	The view of runway from behind the windows of airport.	a group of airplanes parked at an airport.	a truck drivingtowards airplanes parked at an airport.
图5（c）	A person feeding a cat with a banana.	a cat eating a banana with a banana.	a person feeding a bananato a kitten.
图5（d）	a living room with a big table next to a book shelf.	a living room with a couch and a table.	a living room filled with furniture and a large window.
图5（e）	Several zebras eat the green grass in the pasture.	a group of zebra standing on top of a lush green field.	a group of zebras grazing in a field next to the water.
图5（f）	A man getting ready to kick a soccer ball.	a man kicking a soccer ball on a field.	a soccer player in a green uniformkicking a soccer ball.

模型	B-1	B-4	M	R	C	S
SCST^［18］	—	34.2	26.7	55.7	114.0	—
RFNet^［1］	79.1	36.5	27.7	57.3	121.9	21.2
Up-Down^［7］	79.8	36.3	27.7	56.9	120.1	21.4
HAN^［19］	80.9	37.6	27.8	58.1	121.7	21.5
SGAE^［20］	80.8	38.4	28.4	58.6	127.8	22.1
ORT^［9］	80.5	38.6	28.7	58.4	128.3	22.6
POS-SCAN^［21］	80.2	38.0	28.5	—	125.9	22.2
SRT^［22］	80.3	38.5	28.7	58.4	129.1	22.4
本文模型	80.3	39.2	28.9	58.9	129.6	22.7

模型	B-1	B-4	M	R	C	S
SCST^［18］	—	34.2	26.7	55.7	114.0	—
RFNet^［1］	79.1	36.5	27.7	57.3	121.9	21.2
Up-Down^［7］	79.8	36.3	27.7	56.9	120.1	21.4
HAN^［19］	80.9	37.6	27.8	58.1	121.7	21.5
SGAE^［20］	80.8	38.4	28.4	58.6	127.8	22.1
ORT^［9］	80.5	38.6	28.7	58.4	128.3	22.6
POS-SCAN^［21］	80.2	38.0	28.5	—	125.9	22.2
SRT^［22］	80.3	38.5	28.7	58.4	129.1	22.4
本文模型	80.3	39.2	28.9	58.9	129.6	22.7

模型	B-1	B-4	M	R	C
TED	79.3	38.5	28.4	58.5	127.7
TED-VC	79.4	38.5	28.5	58.4	128.2
TED-FVC	79.7	38.8	28.6	58.7	128.6
TED-ACG	80.3	39.2	28.9	58.9	129.6

Image caption generation model with adaptive commonsense gate

融合自适应常识门的图像描述生成模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 22

Related Articles 15

Recommended Articles

Metrics

[1]	Yuefeng LIU, Xiaoyan ZHANG, Wei GUO, Haodong BIAN, Yingjie HE. Remaining useful life prediction method of aero-engine based on optimized hybrid model [J]. Journal of Computer Applications, 2022, 42(9): 2960-2968.
[2]	Hongjun HENG, Tianbao XU. Attention sentiment analysis model based on multi-scale convolution and gating mechanism [J]. Journal of Computer Applications, 2022, 42(9): 2674-2679.
[3]	Yuhang WANG, Yongxia ZHOU, Liangwu WU. Pooling algorithm based on Gaussian function [J]. Journal of Computer Applications, 2022, 42(9): 2800-2806.
[4]	Chengxia XU, Qing YAN, Teng LI, Kaichao MIAO. De-raining algorithm based on joint attention mechanism for single image [J]. Journal of Computer Applications, 2022, 42(8): 2578-2585.
[5]	Jiehang DENG, Wenquan GUO, Hanjie CHEN, Guosheng GU, Jingjian LIU, Yukun DU, Chao LIU, Xiaodong KANG, Jian ZHAO. Few-shot diatom detection combining multi-scale multi-head self-attention and online hard example mining [J]. Journal of Computer Applications, 2022, 42(8): 2593-2600.
[6]	Xianjie ZHANG, Zhiming ZHANG. Handwritten English text recognition based on convolutional neural network and Transformer [J]. Journal of Computer Applications, 2022, 42(8): 2394-2400.
[7]	Nanjiang CHENG, Zhenxia YU, Lin CHEN, Hezhe QIAO. Multi-source and multi-label pedestrian attribute recognition based on domain adaptation [J]. Journal of Computer Applications, 2022, 42(8): 2401-2406.
[8]	Zhenhu LYU, Xinzheng XU, Fangyan ZHANG. Lightweight attention mechanism module based on squeeze and excitation [J]. Journal of Computer Applications, 2022, 42(8): 2353-2360.
[9]	Xiangyue TAN, Xiao HU, Jiaxin YANG, Junjiang XIANG. Camouflaged object detection based on progressive feature enhancement aggregation [J]. Journal of Computer Applications, 2022, 42(7): 2192-2200.
[10]	Yuanlong WANG, Xiaomin LIU, Hu ZHANG. Machine reading comprehension model based on event representation [J]. Journal of Computer Applications, 2022, 42(7): 1979-1984.
[11]	Haiqi WANG, Zhihai WANG, Liuke LI, Haoran KONG, Qiong WANG, Jianbo XU. Spatial-temporal prediction model of urban short-term traffic flow based on grid division [J]. Journal of Computer Applications, 2022, 42(7): 2274-2280.
[12]	Ning DONG, Xiaorong CHENG, Mingquan ZHANG. Intrusion detection system with dynamic weight loss function based on internet of things platform [J]. Journal of Computer Applications, 2022, 42(7): 2118-2124.
[13]	Zhenyu WANG, Lei ZHANG, Wenbin GAO, Weiming QUAN. Human activity recognition based on progressive neural architecture search [J]. Journal of Computer Applications, 2022, 42(7): 2058-2064.
[14]	Shan SU, Yang ZHANG, Dongwen ZHANG. Coupling related code smell detection method based on deep learning [J]. Journal of Computer Applications, 2022, 42(6): 1702-1707.
[15]	Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention [J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.