Scene graph-aware cross-modal image captioning model

doi:10.11772/j.issn.1001-9081.2022071109

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 58-64.DOI: 10.11772/j.issn.1001-9081.2022071109

• Cross-media representation learning and cognitive reasoning • Previous Articles Next Articles

Scene graph-aware cross-modal image captioning model

Zhiping ZHU, Yan YANG(), Jie WANG

College of Computing and Artificial Intelligence，Southwest Jiaotong University，Chengdu Sichuan 611756，China

Received:2022-07-29 Revised:2022-11-20 Accepted:2022-11-30 Online:2023-01-15 Published:2024-01-10
Contact: Yan YANG
About author:ZHU Zhiping， born in 1998， M. S. candidate. His research interests include natural language processing， computer vision.
WANG Jie， born in 1994， Ph. D. candidate. His research interests include cross-modal learning， natural language processing.
Supported by:
National Natural Science Foundation of China(61976247)

基于场景图感知的跨模态图像描述模型

朱志平, 杨燕(), 王杰

西南交通大学计算机与人工智能学院，成都 611756

通讯作者: 杨燕
作者简介:朱志平（1998—），男，四川南充人，硕士研究生，主要研究方向：自然语言处理、计算机视觉；
王杰（1994—），男，四川成都人，博士研究生，主要研究方向：跨模态学习、自然语言处理。
第一联系人：杨燕（1964—），女，安徽合肥人，教授，博士，CCF杰出会员，主要研究方向：人工智能、大数据分析与挖掘、集成学习与多视图学习、云计算与云服务；
基金资助:
国家自然科学基金资助项目(61976247)

Abstract

Abstract:

Aiming at the forgetting and underutilization of the text information of image in image captioning methods， a Scene Graph-aware Cross-modal Network （SGC-Net） was proposed. Firstly， the scene graph was utilized as the image’s visual features， and the Graph Convolutional Network （GCN） was utilized for feature fusion， so that the visual and textual features were in the same feature space. Then， the text sequence generated by the model was stored， and the corresponding position information was added as the textual features of the image， so as to solve the problem of text feature loss brought by the single-layer Long Short-Term Memory （LSTM） Network. Finally， to address the issue of over dependence on image information and underuse of text information， the self-attention mechanism was utilized to extract significant image information and text information and fuse then. Experimental results on Flickr30K and MS-COCO （MicroSoft Common Objects in COntext） datasets demonstrate that SGC-Net outperforms Sub-GC on the indicators BLEU1 （BiLingual Evaluation Understudy with 1-gram）， BLEU4 （BiLingual Evaluation Understudy with 4-grams）， METEOR （Metric for Evaluation of Translation with Explicit ORdering）， ROUGE （Recall-Oriented Understudy for Gisting Evaluation） and SPICE （Semantic Propositional Image Caption Evaluation） with the improvements of 1.1，0.9，0.3，0.7，0.4 and 0.3， 0.1， 0.3， 0.5， 0.6， respectively. It can be seen that the method used by SGC-Net can increase the model’s image captioning performance and the fluency of the generated description effectively.

Key words: image captioning, scene graph, attention mechanism, Long Short-Term Memory (LSTM) Network, feature fusion

摘要：

针对图像描述方法中对图像文本信息的遗忘及利用不充分问题，提出了基于场景图感知的跨模态交互网络（SGC-Net）。首先，使用场景图作为图像的视觉特征并使用图卷积网络（GCN）进行特征融合，从而使图像的视觉特征和文本特征位于同一特征空间；其次，保存模型生成的文本序列，并添加对应的位置信息作为图像的文本特征，以解决单层长短期记忆（LSTM）网络导致的文本特征丢失的问题；最后，使用自注意力机制提取出重要的图像信息和文本信息后并对它们进行融合，以解决对图像信息过分依赖以及对文本信息利用不足的问题。在Flickr30K和MS-COCO （MicroSoft Common Objects in COntext）数据集上进行实验的结果表明，与Sub-GC相比，SGC-Net在BLEU1 （BiLingual Evaluation Understudy with 1-gram）、BLEU4 （BiLingual Evaluation Understudy with 4-grams）、METEOR （Metric for Evaluation of Translation with Explicit ORdering）、ROUGE （Recall-Oriented Understudy for Gisting Evaluation）和SPICE （Semantic Propositional Image Caption Evaluation）指标上分别提升了1.1、0.9、0.3、0.7、0.4和0.3、0.1、0.3、0.5、0.6。可见，SGC-Net所使用的方法能够有效提升模型的图像描述性能及生成描述的流畅度。

关键词: 图像描述, 场景图, 注意力机制, 长短期记忆网络, 特征融合

CLC Number:

TP391.1

Zhiping ZHU, Yan YANG, Jie WANG. Scene graph-aware cross-modal image captioning model[J]. Journal of Computer Applications, 2024, 44(1): 58-64.

朱志平, 杨燕, 王杰. 基于场景图感知的跨模态图像描述模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 58-64.

Figures/Tables 7

References 39

1	HOCHREITER S， SCHMIDHUBER J. Long short-term memory ［J］. Neural Computation， 1997， 9（8）： 1735-1780. 10.1162/neco.1997.9.8.1735
2	CHEN L， ZHANG H W， XIAO J， et al. SCA-CNN： spatial and channel-wise attention in convolutional networks for image captioning ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6298-6306. 10.1109/cvpr.2017.667
3	PEDERSOLI M， LUCAS T， SCHMID C， et al. Areas of attention for image captioning ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1251-1259. 10.1109/iccv.2017.140
4	刘茂福，施琦，聂礼强.基于视觉关联与上下文双注意力的图像描述生成方法［J］.软件学报， 2022， 33（9）： 3210-3222.
	LIU M F， SHI Q， NIE L Q. Image captioning based on visual relevance and context dual attention ［J］. Journal of Software， 2022， 33（9）： 3210-3222.
5	陈悦，郭宇，谢圆琰，等.基于图像描述算法的离线盲人视觉辅助系统［J］.电信科学， 2022， 38（1）： 61-72. 10.11959/j.issn.1000-0801.2022014
	CHEN Y， GUO Y， XIE Y Y， et al. Offline visual aid system for the blind based on image captioning ［J］. Telecommunications Science， 2022， 38（1）： 61-72. 10.11959/j.issn.1000-0801.2022014
6	谢州益，冯亚枝，胡彦蓉，等.基于ResNet18特征编码器的水稻病虫害图像描述生成［J］.农业工程学报， 2022， 38（12）： 197-206. 10.11975/j.issn.1002-6819.2022.12.023
	XIE Z Y， FENG Y Z， HU Y R， et al. Generating image description of rice pests and diseases using a ResNet18 feature encoder ［J］. Transactions of the Chinese Society of Agricultural Engineering， 2022， 38（12）： 197-206. 10.11975/j.issn.1002-6819.2022.12.023
7	FARHADI A， HEJRATI M， SADEGHI M A， et al. Every picture tells a story： generating sentences from images ［C］// Proceedings of the 2010 European Conference on Computer Vision， LNCS 6314. Berlin： Springer， 2010： 15-29.
8	LI S M， KULKARNI G， BERG T L， et al. Composing simple image descriptions using web-scale n-grams ［C］// Proceedings of the 15th Conference on Computational Natural Language Learning. Stroudsburg， PA： ACL， 2011： 220-228.
9	ORDONEZ V， KULKARNI G， BERG T L. Im2Text： describing images using 1 million captioned photographs ［C］// Proceedings of the 24th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2011： 1143-1151.
10	HODOSH M， YOUNG P， HOCKENMAIER J. Framing image description as a ranking task： data， models and evaluation metrics ［J］. Journal of Artificial Intelligence Research， 2013， 47： 853-899. 10.1613/jair.3994
11	GONG Y C， WANG L W， HODOSH M， et al. Improving image-sentence embeddings using large weakly annotated photo collections ［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8692. Cham： Springer， 2014： 529-545.
12	ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
13	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149. 10.1109/tpami.2016.2577031
14	YAO T， PAN Y W， LI Y H， et al. Exploring visual relationship for image captioning ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11218. Cham： Springer， 2018： 711-727.
15	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks ［EB/OL］. （2017-02-22）［2022-05-17］. . 10.48550/arXiv.1609.02907
16	RENNIE S J， MARCHERET E， MROUEH Y， et al. Self-critical sequence training for image captioning ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1179-1195. 10.1109/cvpr.2017.131
17	LIU W， CHEN S H， GUO L T， et al. CPTR： full Transformer network for image captioning ［EB/OL］. （2021-01-28）［2022-05-17］. .
18	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale ［EB/OL］. （2021-06-03）［2022-05-17］. .
19	BAHDANAU D， CHO K， BENGIO Y. Neural machine translation by jointly learning to align and translate ［EB/OL］. （2016-05-19）［2022-05-17］. . 10.1017/9781108608480.003
20	XU K， BA J L， KIROS R， et al. Show， attend and tell： neural image caption generation with visual attention ［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 2048-2057. 10.1109/cvpr.2015.7298935
21	HUANG L， WANG W M， CHEN J， et al. Attention on attention for image captioning ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4633-4642. 10.1109/iccv.2019.00473
22	LU J S， XIONG C M， PARIKH D， et al. Knowing when to look： adaptive attention via a visual sentinel for image captioning ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3242-3250. 10.1109/cvpr.2017.345
23	PAN Y W， YAO T， LI Y H， et al. X-Linear attention networks for image captioning ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10968-10977. 10.1109/cvpr42600.2020.01098
24	LUO Y P， JI J Y， SUN X S， et al. Dual-level collaborative transformer for image captioning ［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 2286-2293. 10.1609/aaai.v35i3.16328
25	ZELLERS R， YATSKAR M， THOMSON S， et al. Neural motifs： scene graph parsing with global context ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5831-5840. 10.1109/cvpr.2018.00611
26	YOUNG P， LAI A， HODOSH M， et al. From image descriptions to visual denotations： new similarity metrics for semantic inference over event descriptions ［J］. Transactions of the Association for Computational Linguistics， 2014， 2： 67-78. 10.1162/tacl_a_00166
27	LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context ［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8693. Cham： Springer， 2014： 740-755.
28	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions ［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137. 10.1109/cvpr.2015.7298932
29	PAPINENI K， ROUKOS S， WARD T， et al. BLEU： a method for automatic evaluation of machine translation ［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2002： 311-318. 10.3115/1073083.1073135
30	BANERJEE S， LAVIE A. METEOR： an automatic metric for MT evaluation with improved correlation with human judgments ［C］// Proceedings of the ACL-05 Workshop： Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg， PA： ACL， 2005： 65-72.
31	LIN C Y. ROUGE： a package for automatic evaluation of summaries ［C］// Proceedings of the ACL-04 Workshop： Text Summarization Branches Out. Stroudsburg， PA： ACL， 2004： 74-81. 10.3115/1218955.1219032
32	VEDANTAM R， ZITNICK C L， PARIKH D. CIDEr： consensus-based image description evaluation ［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4566-4575. 10.1109/cvpr.2015.7299087
33	ANDERSON P， FERNANDO B， JOHNSON M， et al. SPICE： semantic propositional image caption evaluation ［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 382-398.
34	ZHONG Y W， WANG L W， CHEN J S， et al. Comprehensive image captioning via scene graph decomposition ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12359. Cham： Springer， 2020： 211-229.
35	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
36	KRISHNA R， ZHU Y K， GROTH O， et al. Visual genome： connecting language and vision using crowdsourced dense image annotations ［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73. 10.1007/s11263-016-0981-7
37	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. PA： ACL， 2014： 1532-1543. 10.3115/v1/d14-1162
38	KINGMA D P， BA J L. Adam： a method for stochastic optimization ［EB/OL］. （2017-01-30）［2022-02-19］. .
39	ZHOU L W， KALANTIDIS Y， CHEN X L， et al. Grounded video description ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6571-6580. 10.1109/cvpr.2019.00674

模型	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
GVD^［39］	69.2	26.9	22.1	—	60.1	16.1
Up-Down^［12］	69.4	27.3	21.7	—	56.6	16.0
Sub-GC^［34］	69.1	28.2	22.3	49.0	60.3	16.7
SGC-Net	70.2	29.1	22.6	49.7	61.7	17.1

模型	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
GVD^［39］	69.2	26.9	22.1	—	60.1	16.1
Up-Down^［12］	69.4	27.3	21.7	—	56.6	16.0
Sub-GC^［34］	69.1	28.2	22.3	49.0	60.3	16.7
SGC-Net	70.2	29.1	22.6	49.7	61.7	17.1

模型	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
Up-Down^［12］	77.2	36.2	27.0	56.4	113.5	20.3
Sub-GC^［34］	76.8	36.2	27.7	56.6	115.3	20.7
SGC-Net	77.1	36.3	28.0	57.1	114.8	21.3

模型	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
Up-Down^［12］	77.2	36.2	27.0	56.4	113.5	20.3
Sub-GC^［34］	76.8	36.2	27.7	56.6	115.3	20.7
SGC-Net	77.1	36.3	28.0	57.1	114.8	21.3

L	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
1	76.4	36.0	27.7	56.5	113.7	20.9
2	76.6	36.0	27.8	56.6	113.6	21.0
3	76.8	36.2	27.8	56.8	114.2	21.2
4	76.9	36.3	27.9	56.9	114.5	21.1
5	77.1	36.3	28.0	57.1	114.8	21.3
6	76.8	35.9	27.8	56.7	114.5	21.0
7	76.7	35.8	27.8	56.6	114.5	20.9
8	76.5	35.7	27.7	56.6	114.2	21.0
9	76.5	35.7	27.7	56.4	114.2	20.9
10	76.4	35.8	27.7	56.7	113.9	20.8
11	76.3	35.7	27.7	56.4	113.8	20.8

Scene graph-aware cross-modal image captioning model

基于场景图感知的跨模态图像描述模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 39

Related Articles 15

Recommended Articles

Metrics

[1]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[2]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[3]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[4]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[5]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[6]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[7]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[8]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[9]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[10]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[11]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.
[12]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[13]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[14]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[15]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.