Image caption generation model with adaptive commonsense gate

doi:10.11772/j.issn.1001-9081.2021101743

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3900-3905.DOI: 10.11772/j.issn.1001-9081.2021101743

• Multimedia computing and computer simulation • Previous Articles Next Articles

Image caption generation model with adaptive commonsense gate

You YANG¹^,², Lizhi CHEN²(), Xiaolong FANG², Longyue PAN²

^1.National Center for Applied Mathematics in Chongqing，Chongqing 401331，China
^2.College of Computer and Information Science，Chongqing Normal University，Chongqing 401331，China

Received:2021-10-11 Revised:2021-12-17 Accepted:2021-12-23 Online:2021-12-31 Published:2022-12-10
Contact: Lizhi CHEN
About author:YANG You， born in 1965， Ph. D.， associate professor. His research interests include digital image processing， computer vision.
FANG Xiaolong， born in 1994， M. S. candidate. His research interests include computer vision.
PAN Longyue， born in 1998， M. S. candidate. Her research interests include computer vision.
Supported by:
Chongqing Normal University Graduate Scientific Research and Innovation Project(YKC20038);Chongqing Normal University Fund （Talent Introduction/Doctor Start-up）(21XLB032)

融合自适应常识门的图像描述生成模型

杨有¹^,², 陈立志²(), 方小龙², 潘龙越²

^1.重庆国家应用数学中心，重庆 401331
^2.重庆师范大学计算机与信息科学学院，重庆 401331

通讯作者: 陈立志
作者简介:杨有（1965—），男，重庆人，副教授，博士，主要研究方向：数字图像处理、计算机视觉
方小龙（1994—），男，重庆人，硕士研究生，CCF会员，主要研究方向：计算机视觉
潘龙越（1998—），女，吉林四平人，硕士研究生，主要研究方向：计算机视觉。
基金资助:
重庆师范大学研究生科研创新项目(YKC20038);重庆师范大学（人才引进/博士启动）基金资助项目(21XLB032)

Abstract

Abstract:

Focusing on the issues that the traditional image caption models cannot make full use of image information， and have only single method of fusing features， an image caption generation model with Adaptive Commonsense Gate （ACG） was proposed. Firstly， VC R-CNN （Visual Commonsense Region-based Convolutional Neural Network） was used to extract visual commonsense features and input commonsense feature layer into Transformer encoder. Then， ACG was designed in each layer of encoder to perform adaptive fusion operation on visual commonsense features and encoding features. Finally， the encoding features fused with commonsense information were fed into Transformer decoder to complete the training. Training and testing were carried out on MSCOCO dataset. The results show that the proposed model reaches 39.2， 129.6 and 22.7 respectively on the evaluation indicators BLEU （BiLingual Evaluation Understudy）-4， CIDEr （Consensus-based Image Description Evaluation） and SPICE （Semantic Propositional Image Caption Evaluation）， which are improved by 3.2%，2.9% and 2.3% respectively compared with those of the POS-SCAN （Part-Of-Speech Stacked Cross Attention Network） model. It can be seen that the proposed model significantly outperforms Transformer models using single salient region feature and can describe the image content accurately.

Key words: image caption, natural language processing, Convolutional Neural Network (CNN), visual commonsense, Adaptive Commonsense Gate (ACG)

摘要：

针对传统的图像描述模型不能充分利用图像信息且融合特征方式单一的问题，提出了一种融合自适应常识门（ACG）的图像描述生成模型。首先，使用基于视觉常识区域的卷积神经网络（VC R-CNN）提取视觉常识特征，并将常识特征分层输入到Transformer编码器中；然后，在编码器的每一分层中设计了ACG，从而对视觉常识特征和编码特征进行自适应融合操作；最后，将融合常识信息的编码特征送入Transformer解码器中完成训练。使用MSCOCO数据集进行训练和测试，结果表明所提模型在评价指标BLEU?4、CIDEr和SPICE上分别达到了39.2、129.6和22.7，相较于词性堆叠交叉注意网络（POS-SCAN）模型分别提升了3.2%、2.9%和2.3%。所提模型的效果明显优于使用单一显著区域特征的Transformer模型，能够对图像内容进行准确的描述。

关键词: 图像描述, 自然语言处理, 卷积神经网络, 视觉常识, 自适应常识门

CLC Number:

TP391.41

You YANG, Lizhi CHEN, Xiaolong FANG, Longyue PAN. Image caption generation model with adaptive commonsense gate[J]. Journal of Computer Applications, 2022, 42(12): 3900-3905.

杨有, 陈立志, 方小龙, 潘龙越. 融合自适应常识门的图像描述生成模型[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3900-3905.

Figures/Tables 8

References 22

1	JIANG W H， MA L， JIANG Y G， et al. Recurrent fusion network for image captioning［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11206. Cham： Springer， 2018： 510-526.
2	黄友文，游亚东，赵朋. 融合卷积注意力机制的图像描述生成模型［J］. 计算机应用， 2020， 40（1）： 23-27. 10.11772/j.issn.1001-9081.2019050943
	HUANG Y W， YOU Y D， ZHAO P. Image caption generation model with convolutional attention mechanism［J］. Journal of Computer Applications， 2020， 40（1）： 23-27. 10.11772/j.issn.1001-9081.2019050943
3	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
4	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
5	ZHU X X， LI L X， LIU J， et al. Captioning transformer with stacked attention modules［J］. Applied Sciences， 2018， 8（5）： No.739. 10.3390/app8050739
6	YU J， LI J， YU Z， et al. Multimodal transformer with multi-view visual representation for image captioning［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2020， 30（12）： 4467-4480. 10.1109/tcsvt.2019.2947482
7	ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
8	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149. 10.1109/tpami.2016.2577031
9	HERDADE S， KAPPELER A， BOAKYE K， et al. Image captioning： Transforming objects into words ［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-06-13］..
10	WANG T， HUANG J Q， ZHANG H W， et al. Visual commonsense R-CNN［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2020： 10757-10767. 10.1109/cvpr42600.2020.01077
11	李文惠，曾上游，王金金. 基于改进注意力机制的图像描述生成算法［J］. 计算机应用， 2021， 41（5）： 1262-1267. 10.11772/j.issn.1001-9081.2020071078
	LI W H， ZENG S Y， WANG J J. Image description generation algorithm based on improved attention mechanism［J］. Journal of Computer Applications， 2021， 41（5）： 1262-1267. 10.11772/j.issn.1001-9081.2020071078
12	LI G， ZHU L C， LIU P， et al. Entangled transformer for image captioning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8927-8936. 10.1109/iccv.2019.00902
13	LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8693. Cham： Springer， 2014： 740-755.
14	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137. 10.1109/cvpr.2015.7298932
15	VEDANTAM R， ZITNICK C L， PARIKH D， et al. CIDEr： consensus-based image description evaluation［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4566-4575. 10.1109/cvpr.2015.7299087
16	ANDERSON P， FERNANDO B， JOHNSON M， et al. SPICE： semantic propositional image caption evaluation［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 382-398.
17	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-08-03］..
18	RENNNIE S J， MARCHERET E， MROUEH Y， et al. Self-critical sequence training for image captioning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1179-1195. 10.1109/cvpr.2017.131
19	WANG W Z， CHEN Z H， HU H F. Hierarchical attention network for image captioning［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2019： 8957-8964. 10.1609/aaai.v33i01.33018957
20	YANG X， TANG K H， ZHANG H W， et al. Auto-encoding scene graphs for image captioning［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10677-10686. 10.1109/cvpr.2019.01094
21	ZHOU Y E， WANG M， LIU D Q， et al. More grounded image captioning by distilling image-text matching model［C］//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 4776-4785. 10.1109/cvpr42600.2020.00483
22	WANG L， BAI Z C， ZHANG Y H， et al. Show， recall， and tell： image captioning with recall mechanism［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 12176-12183. 10.1609/aaai.v34i07.6898

图号	人为标注的参考描述	Transformer基线模型	本文模型
图5（a）	Two men on motorcycles at a stop light.	a man riding on the back of a motorcycle.	two peopleriding on a motorcycle on a street
图5（b）	The view of runway from behind the windows of airport.	a group of airplanes parked at an airport.	a truck drivingtowards airplanes parked at an airport.
图5（c）	A person feeding a cat with a banana.	a cat eating a banana with a banana.	a person feeding a bananato a kitten.
图5（d）	a living room with a big table next to a book shelf.	a living room with a couch and a table.	a living room filled with furniture and a large window.
图5（e）	Several zebras eat the green grass in the pasture.	a group of zebra standing on top of a lush green field.	a group of zebras grazing in a field next to the water.
图5（f）	A man getting ready to kick a soccer ball.	a man kicking a soccer ball on a field.	a soccer player in a green uniformkicking a soccer ball.

图号	人为标注的参考描述	Transformer基线模型	本文模型
图5（a）	Two men on motorcycles at a stop light.	a man riding on the back of a motorcycle.	two peopleriding on a motorcycle on a street
图5（b）	The view of runway from behind the windows of airport.	a group of airplanes parked at an airport.	a truck drivingtowards airplanes parked at an airport.
图5（c）	A person feeding a cat with a banana.	a cat eating a banana with a banana.	a person feeding a bananato a kitten.
图5（d）	a living room with a big table next to a book shelf.	a living room with a couch and a table.	a living room filled with furniture and a large window.
图5（e）	Several zebras eat the green grass in the pasture.	a group of zebra standing on top of a lush green field.	a group of zebras grazing in a field next to the water.
图5（f）	A man getting ready to kick a soccer ball.	a man kicking a soccer ball on a field.	a soccer player in a green uniformkicking a soccer ball.

模型	B-1	B-4	M	R	C	S
SCST^［18］	—	34.2	26.7	55.7	114.0	—
RFNet^［1］	79.1	36.5	27.7	57.3	121.9	21.2
Up-Down^［7］	79.8	36.3	27.7	56.9	120.1	21.4
HAN^［19］	80.9	37.6	27.8	58.1	121.7	21.5
SGAE^［20］	80.8	38.4	28.4	58.6	127.8	22.1
ORT^［9］	80.5	38.6	28.7	58.4	128.3	22.6
POS-SCAN^［21］	80.2	38.0	28.5	—	125.9	22.2
SRT^［22］	80.3	38.5	28.7	58.4	129.1	22.4
本文模型	80.3	39.2	28.9	58.9	129.6	22.7

模型	B-1	B-4	M	R	C	S
SCST^［18］	—	34.2	26.7	55.7	114.0	—
RFNet^［1］	79.1	36.5	27.7	57.3	121.9	21.2
Up-Down^［7］	79.8	36.3	27.7	56.9	120.1	21.4
HAN^［19］	80.9	37.6	27.8	58.1	121.7	21.5
SGAE^［20］	80.8	38.4	28.4	58.6	127.8	22.1
ORT^［9］	80.5	38.6	28.7	58.4	128.3	22.6
POS-SCAN^［21］	80.2	38.0	28.5	—	125.9	22.2
SRT^［22］	80.3	38.5	28.7	58.4	129.1	22.4
本文模型	80.3	39.2	28.9	58.9	129.6	22.7

模型	B-1	B-4	M	R	C
TED	79.3	38.5	28.4	58.5	127.7
TED-VC	79.4	38.5	28.5	58.4	128.2
TED-FVC	79.7	38.8	28.6	58.7	128.6
TED-ACG	80.3	39.2	28.9	58.9	129.6

Image caption generation model with adaptive commonsense gate

融合自适应常识门的图像描述生成模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 22

Related Articles 15

Recommended Articles

Metrics

[1]	Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910.
[2]	Qi SHUAI, Hairui WANG, Guifu ZHU. Chinese story ending generation model based on bidirectional contrastive training [J]. Journal of Computer Applications, 2024, 44(9): 2683-2688.
[3]	Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499.
[4]	Quanmei ZHANG, Runping HUANG, Fei TENG, Haibo ZHANG, Nan ZHOU. Automatic international classification of disease coding method incorporating heterogeneous information [J]. Journal of Computer Applications, 2024, 44(8): 2476-2482.
[5]	Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242.
[6]	Qianhui LU, Yu ZHANG, Mengling WANG, Tingwei WU, Yuzhong SHAN. Classification model of nuclear power equipment quality text based on improved recurrent pooling network [J]. Journal of Computer Applications, 2024, 44(7): 2034-2040.
[7]	Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994.
[8]	Yao LIU, Yumeng LI, Miaomiao SONG. Cognitive graph based on business process [J]. Journal of Computer Applications, 2024, 44(6): 1699-1705.
[9]	Youren YU, Yangsen ZHANG, Yuru JIANG, Gaijuan HUANG. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information [J]. Journal of Computer Applications, 2024, 44(6): 1706-1712.
[10]	Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919.
[11]	Jianjing LI, Guanfeng LI, Feizhou QIN, Weijun LI. Multi-relation approximate reasoning model based on uncertain knowledge graph embedding [J]. Journal of Computer Applications, 2024, 44(6): 1751-1759.
[12]	Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545.
[13]	Wenshuo GAO, Xiaoyun CHEN. Point cloud classification network based on node structure [J]. Journal of Computer Applications, 2024, 44(5): 1471-1478.
[14]	Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113.
[15]	Longtao GAO, Nana LI. Aspect sentiment triplet extraction based on aspect-aware attention enhancement [J]. Journal of Computer Applications, 2024, 44(4): 1049-1057.