基于图注意力网络的全局图像描述生成方法

doi:10.11772/j.issn.1001-9081.2022040513

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (5): 1409-1415.DOI: 10.11772/j.issn.1001-9081.2022040513

• 人工智能 • 上一篇

基于图注意力网络的全局图像描述生成方法

隋佳宏¹, 毛莺池¹^,²(), 于慧敏¹, 王子成³, 平萍¹^,²

^1.河海大学计算机与信息学院，南京 210098
^2.水利部水利大数据重点实验室（河海大学），南京 210098
^3.中国电建集团昆明勘测设计研究院有限公司，昆明 650051

收稿日期:2022-04-05 修回日期:2022-07-11 接受日期:2022-07-14 发布日期:2023-05-08 出版日期:2023-05-10
通讯作者: 毛莺池
作者简介:隋佳宏（1998—），女，山东烟台人，硕士研究生，CCF会员，主要研究方向：计算机视觉
毛莺池（1976—），女，上海人，教授，博士，CCF高级会员，主要研究方向：边缘智能计算 yingchimao@hhu.edu.cn
于慧敏（1998—），女，山西大同人，硕士研究生，CCF会员，主要研究方向：计算机视觉
王子成（1990—），男，湖北荆州人，工程师，硕士，主要研究方向：数字图像处理、三维建模
平萍（1982—），女，江苏吴江人，副教授，博士，CCF会员，主要研究方向：数字图像处理。
基金资助:
国家自然科学基金资助项目(61902110);江苏省重点研发计划项目(BE2020729);华能集团总部科技项目(HNKJ19?H12)

Global image captioning method based on graph attention network

Jiahong SUI¹, Yingchi MAO¹^,²(), Huimin YU¹, Zicheng WANG³, Ping PING¹^,²

^1.College of Computer and Information，Hohai University，Nanjing Jiangsu 210098，China
^2.Key Laboratory of Water Big Data Technology of Ministry of Water Resources （Hohai University），Nanjing Jiangsu 210098，China
^3.Power China Kunming Engineering Corporation Limited，Kunming Yunnan 650051，China

Received:2022-04-05 Revised:2022-07-11 Accepted:2022-07-14 Online:2023-05-08 Published:2023-05-10
Contact: Yingchi MAO
About author:SUI Jiahong， born in 1998， M. S. candidate. Her research interests include computer vision.
MAO Yingchi， born in 1976， Ph. D.， professor. Her research interests include edge intelligent computing.
YU Huimin， born in 1998， M. S. candidate. Her research interests include computer vision.
WANG Zicheng， born in 1990， M. S.， engineer. His research interests include digital image processing， three-dimensional modeling.
PING Ping， born in 1982， Ph. D.， associate professor. Her research interests include digital image processing.
Supported by:
National Natural Science Foundation of China(61902110);Key Research and Development Program of Jiangsu Province(BE2020729);Science and Technology Project of China Huaneng Group Headquarter(HNKJ19-H12)

摘要/Abstract

摘要：

现有图像描述生成方法仅考虑网格的空间位置特征，网格特征交互不足，并且未充分利用图像的全局特征。为生成更高质量的图像描述，提出一种基于图注意力网络（GAT）的全局图像描述生成方法。首先，利用多层卷积神经网络（CNN）进行视觉编码，提取给定图像的网格特征和整幅图像特征，并构建网格特征交互图；然后，通过GAT将特征提取问题转化成节点分类问题，包括一个全局节点和多个局部节点，更新优化后可以充分利用全局和局部特征；最后，基于Transformer的解码模块利用改进的视觉特征生成图像描述。在Microsoft COCO数据集上的实验结果表明，所提方法能有效捕捉图像的全局和局部特征，在CIDEr（Consensus-based Image Description Evaluation）指标上达到了133.1%。可见基于GAT的全局图像描述生成方法能有效提高文字描述图像的准确度，从而可以使用文字对图像进行分类、检索、分析等处理。

关键词: 网格特征, 图注意力网络, 卷积神经网络, 图像描述生成, 全局特征

Abstract:

The existing image captioning methods only focus on the grid spatial location features without enough grid feature interaction and full use of image global features. To generate higher-quality image captions， a global image captioning method based on Graph ATtention network （GAT） was proposed. Firstly， a multi-layer Convolutional Neural Network （CNN） was utilized for visual encoding， extracting the grid features and entire image features of the given image and building a grid feature interaction graph. Then， by using GAT， the feature extraction problem was transformed into a node classification problem， including a global node and many local nodes， and the global and local features were able to be fully utilized after updating the optimization. Finally， through the Transformer-based decoding module， the improved visual features were adopted to realize image captioning. Experimental results on the Microsoft COCO dataset demonstrated that the proposed method effectively captured the global and local features of the image， achieving 133.1% in CIDEr （Consensus-based Image Description Evaluation） metric. It can be seen that the proposed image captioning method is effective in improving the accuracy of image captioning， thus allowing processing tasks such as classification， retrieval， and analysis of images by words.

Key words: grid feature, Graph ATttention network (GAT), Convolutional Neural Network (CNN), image captioning, global feature

中图分类号:

TP183

隋佳宏, 毛莺池, 于慧敏, 王子成, 平萍. 基于图注意力网络的全局图像描述生成方法[J]. 计算机应用, 2023, 43(5): 1409-1415.

Jiahong SUI, Yingchi MAO, Huimin YU, Zicheng WANG, Ping PING. Global image captioning method based on graph attention network[J]. Journal of Computer Applications, 2023, 43(5): 1409-1415.

图/表 10

参考文献 35

1	HOSSAIN M Z， SOHEL F， SHIRATUDDIN M F， et al. A comprehensive survey of deep learning for image captioning［J］. ACM Computing Surveys， 2019， 51（6）： No.118. 10.1145/3295748
2	MIKOLOV T， KARAFIÁT M， BURGET L， et al. Recurrent neural network based language model［C］// Proceedings of the INTERSPEECH 2010. ［S.l.］： International Speech Communication Association， 2010： 1045-1048. 10.21437/interspeech.2010-343
3	李康康，张静. 基于注意力机制的多层次编码和解码的图像描述模型［J］. 计算机应用， 2021， 41（9）：2504-2509. 10.11772/j.issn.1001-9081.2020111838
	LI K K， ZHANG J. Multi-layer encoding and decoding model for image captioning based on attention mechanism［J］. Journal of Computer Applications， 2021， 41（9）： 2504-2509. 10.11772/j.issn.1001-9081.2020111838
4	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
5	REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）：1137-1149. 10.1109/tpami.2016.2577031
6	ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
7	HERDADE S， KAPPELER A， BOAKYE K， et al. Image captioning： transforming objects into words［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems ［2022-02-19］..
8	HUANG L， WANG W M， CHEN J， et al. Attention on attention for image captioning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4633-4642. 10.1109/iccv.2019.00473
9	JI J Y， LUO Y P， SUN X S， et al. Improving image captioning by leveraging intra- and inter-layer global representation in Transformer network［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 1655-1663. 10.1609/aaai.v35i2.16258
10	GUO L T， LIU J， ZHU X X， et al. Normalized and geometry-aware self-attention network for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10324-10333. 10.1109/cvpr42600.2020.01034
11	JIANG H Z， MISRA I， ROHRBACH M， et al. In defense of grid features for visual question answering［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10264-10273. 10.1109/cvpr42600.2020.01028
12	KRISHNA R， ZHU Y K， GROTH O， et al. Visual Genome： connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73. 10.1007/s11263-016-0981-7
13	ZHANG X Y， SUN X S， LUO Y P， et al. RSTNet： captioning with adaptive attention on visual and non-visual words［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 15460-15469. 10.1109/cvpr46437.2021.01521
14	LUO Y P， JI J Y， SUN X S， et al. Dual-level collaborative Transformer for image captioning［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 2286-2293. 10.1609/aaai.v35i3.16328
15	YAO T， PAN Y W， LI Y H， et al. Exploring visual relationship for image captioning［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11218. Cham： Springer， 2018： 711-727.
16	GUO L T， LIU J， TANG J H， et al. Aligning linguistic words and visual semantic units for image captioning［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 765-773. 10.1145/3343031.3350943
17	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2022-02-17］.. 10.48550/arXiv.1609.02907
18	YAO T， PAN Y W， LI Y H， et al. Hierarchy parsing for image captioning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 2621-2629. 10.1109/iccv.2019.00271
19	TAI K S， SOCHER R， MANNING C D. Improved semantic representations from tree-structured long short-term memory networks［C］// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2015： 1556-1566. 10.3115/v1/p15-1150
20	ZHENG Q T， WANG Y P. Graph self-attention network for image captioning［C］// Proceedings of the IEEE/ACS 17th International Conference on Computer Systems and Applications. Piscataway： IEEE， 2020： 1-8. 10.1109/aiccsa50499.2020.9316518
21	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.
22	RENNIE S J， MARCHERET E， MROUEH Y， et al. Self-critical sequence training for image captioning［C］// Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1179-1195. 10.1109/cvpr.2017.131
23	CORNIA M， STEFANINI M， BARALDI L， et al. Meshed-memory transformer for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10575-10584. 10.1109/cvpr42600.2020.01059
24	LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context［C］// Proceedings of the 2014 European Conference on Computer Vision， LNCS 8693. Cham： Springer， 2014： 740-755.
25	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137. 10.1109/cvpr.2015.7298932
26	PAPINENI K， ROUKOS S， WARD T， et al. BLEU： a method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2002： 311-318. 10.3115/1073083.1073135
27	BANERJEE S， LAVIE A. METEOR： an automatic metric for MT evaluation with improved correlation with human judgments［C］// Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg， PA： ACL， 2005： 65-72.
28	VEDANTAM R， ZITNICK C L， PARIKH D. CIDEr： consensus-based image description evaluation［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4566-4575. 10.1109/cvpr.2015.7299087
29	LIN C Y. ROUGE： a package for automatic evaluation of summaries［C］// Proceedings of 2004 ACL Workshop on Text Summarization Branches Out. Stroudsburg， PA： ACL， 2004： 74-81. 10.3115/1218955.1219032
30	ANDERSON P， FERNANDO B， JOHNSON M， et al. SPICE： semantic propositional image caption evaluation［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 382-398.
31	KINGMA D P， BA J. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2022-02-19］..
32	JIANG W H， MA L， JIANG Y G， et al. Recurrent fusion network for image captioning［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11206. Cham： Springer， 2018： 510-526.
33	YANG X， TANG K H， ZHANG H W， et al. Auto-encoding scene graphs for image captioning［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10677-10686. 10.1109/cvpr.2019.01094
34	PAN Y W， YAO T， LI Y H， et al. X-Linear attention networks for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10968-10977. 10.1109/cvpr42600.2020.01098
35	LIU W， CHEN S H， GUO L T， et al. CPTR： full Transformer network for image captioning［EB/OL］. （2021-01-28）［2022-02-11］..

方法	B1	B4	METEOR	CIDEr	ROUGE-L	SPICE
SCST^［22］	—	34.2	26.7	114.0	55.7	—
Up-Down^［6］	79.8	36.3	27.7	120.1	56.9	21.4
RFNet^［32］	79.1	36.5	27.7	121.9	57.7	21.2
GCN-LSTM^［15］	80.5	38.2	28.5	127.6	58.3	22.0
SGAE^［33］	80.8	38.4	28.4	127.8	58.6	22.1
ORT^［7］	80.5	38.6	28.7	128.3	58.4	22.6
CPTR^［35］	81.7	40.0	29.1	129.4	59.4	—
AoA^［8］	80.2	38.9	29.2	129.8	58.8	22.4
M2^［23］	80.8	39.1	29.2	131.2	58.6	22.6
GET^［9］	81.5	38.8	29.0	131.6	58.9	22.8
X-Transformer^［34］	80.9	39.7	29.5	132.8	59.1	23.4
本文方法	81.2	39.3	29.7	133.1	59.2	22.8

方法	B1	B4	METEOR	CIDEr	ROUGE-L	SPICE
SCST^［22］	—	34.2	26.7	114.0	55.7	—
Up-Down^［6］	79.8	36.3	27.7	120.1	56.9	21.4
RFNet^［32］	79.1	36.5	27.7	121.9	57.7	21.2
GCN-LSTM^［15］	80.5	38.2	28.5	127.6	58.3	22.0
SGAE^［33］	80.8	38.4	28.4	127.8	58.6	22.1
ORT^［7］	80.5	38.6	28.7	128.3	58.4	22.6
CPTR^［35］	81.7	40.0	29.1	129.4	59.4	—
AoA^［8］	80.2	38.9	29.2	129.8	58.8	22.4
M2^［23］	80.8	39.1	29.2	131.2	58.6	22.6
GET^［9］	81.5	38.8	29.0	131.6	58.9	22.8
X-Transformer^［34］	80.9	39.7	29.5	132.8	59.1	23.4
本文方法	81.2	39.3	29.7	133.1	59.2	22.8

全局节点	交互方式	区域特征	B1	B4	METEOR	CIDEr	ROUGE-L	SPICE
w/o	相邻	w/o	80.5	38.7	29.5	129.2	58.6	22.4
w/	邻域	w/o	80.8	39.2	29.4	130.5	59.0	22.1
w/	相邻	w/	81.0	39.1	29.5	130.8	59.2	22.6
w/	相邻	w/o	81.2	39.3	29.7	133.1	59.2	22.8

全局节点	交互方式	区域特征	B1	B4	METEOR	CIDEr	ROUGE-L	SPICE
w/o	相邻	w/o	80.5	38.7	29.5	129.2	58.6	22.4
w/	邻域	w/o	80.8	39.2	29.4	130.5	59.0	22.1
w/	相邻	w/	81.0	39.1	29.5	130.8	59.2	22.6
w/	相邻	w/o	81.2	39.3	29.7	133.1	59.2	22.8

图序	描述方法	描述结果
（a）	GT	A man catching a baseball as another slides into the base.
	Base	A baseball player is throwing a baseball.
	本文方法	A man is catching the baseball， *and another is throwing the ball* .
（b）	GT	A black bird standing on top of a power pole.
	Base	Two birds sitting on top of a wooden poset.
	本文方法	*A black* birtd sitting on top of a wire.
（c）	GT	A polar bear is standing in some snow.
	Base	A polar bear is standing in the water.
	本文方法	A polar bear is standing in the *snow* .
（d）	GT	A child feeding a giraffe from the palm of his hand.
	Base	A young boy feeding a giraffe at a zoo.
	本文方法	A young boy feeding a giraffe *with a hand* .

基于图注意力网络的全局图像描述生成方法

Global image captioning method based on graph attention network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 35

相关文章 15

编辑推荐

Metrics

[1]	许睿, 梁爽, 万航, 文益民, 沈世铭, 李建. 基于烛台图模式匹配的PM_2.5扩散特征的提取[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1394-1400.
[2]	樊小宇, 蔺素珍, 王彦博, 刘峰, 李大威. 基于残差图卷积神经网络的高倍欠采样核磁共振图像重建算法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1261-1268.
[3]	杨海宇, 郭文普, 康凯. 基于卷积长短时深度神经网络的信号调制方式识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1318-1322.
[4]	刘今越, 李慧宇, 贾晓辉, 李佳蕊. 基于人体模型约束的步态动态识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 972-977.
[5]	李振亮, 李波. 基于矩阵分解的卷积神经网络改进方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 685-691.
[6]	张秋余, 王煜坤. 基于改进Inception网络的语音分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 909-915.
[7]	吕学强, 张煜楠, 韩晶, 崔运鹏, 李欢. 融合边特征与注意力的表格结构识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 752-758.
[8]	李海丰, 张凡, 朴敏楠, 王怀超, 李南莎, 桂仲成. 基于通道和空间注意力的机场道面地下目标自动检测[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 930-935.
[9]	王萍, 陈楠, 鲁磊. 基于场景先验及注意力引导的跌倒检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 529-535.
[10]	汪洋, 傅洪亮, 陶华伟, 杨静, 谢跃, 赵力. 基于决策边界优化域自适应的跨库语音情感识别[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 374-379.
[11]	王若莹, 吕凡, 赵柳清, 胡伏原. 融合用户需求和边界约束的平面图生成算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 575-582.
[12]	倪苒岩, 张轶. 基于视频时空特征的行为识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 521-528.
[13]	王佑芯, 陈斌. 基于深度对比网络的印刷缺陷检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 250-258.
[14]	申志军, 穆丽娜, 高静, 史远航, 刘志强. 细粒度图像分类综述[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 51-60.
[15]	林荐壮, 杨文忠, 谭思翔, 周乐鑫, 陈丹妮. 融合滤波增强和反转注意力网络用于息肉分割[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 265-272.