Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3900-3905.DOI: 10.11772/j.issn.1001-9081.2021101743
• Multimedia computing and computer simulation • Previous Articles Next Articles
You YANG1,2, Lizhi CHEN2(), Xiaolong FANG2, Longyue PAN2
Received:
2021-10-11
Revised:
2021-12-17
Accepted:
2021-12-23
Online:
2021-12-31
Published:
2022-12-10
Contact:
Lizhi CHEN
About author:
YANG You, born in 1965, Ph. D., associate professor. His research interests include digital image processing, computer vision.Supported by:
通讯作者:
陈立志
作者简介:
杨有(1965—),男,重庆人,副教授,博士,主要研究方向:数字图像处理、计算机视觉基金资助:
CLC Number:
You YANG, Lizhi CHEN, Xiaolong FANG, Longyue PAN. Image caption generation model with adaptive commonsense gate[J]. Journal of Computer Applications, 2022, 42(12): 3900-3905.
杨有, 陈立志, 方小龙, 潘龙越. 融合自适应常识门的图像描述生成模型[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3900-3905.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021101743
图号 | 人为标注的参考描述 | Transformer基线模型 | 本文模型 |
---|---|---|---|
Two men on motorcycles at a stop light. | a man riding on the back of a motorcycle. | two people | |
The view of runway from behind the windows of airport. | a group of airplanes parked at an airport. | a truck driving | |
A person feeding a cat with a banana. | a cat eating a banana with a banana. | a person feeding a banana | |
a living room with a big table next to a book shelf. | a living room with a couch and a table. | a living room filled with furniture and a large window. | |
Several zebras eat the green grass in the pasture. | a group of zebra standing on top of a lush green field. | a group of zebras grazing in a field next to the water. | |
A man getting ready to kick a soccer ball. | a man kicking a soccer ball on a field. | a soccer player in a green uniform |
Tab. 1 Examples of generated captions
图号 | 人为标注的参考描述 | Transformer基线模型 | 本文模型 |
---|---|---|---|
Two men on motorcycles at a stop light. | a man riding on the back of a motorcycle. | two people | |
The view of runway from behind the windows of airport. | a group of airplanes parked at an airport. | a truck driving | |
A person feeding a cat with a banana. | a cat eating a banana with a banana. | a person feeding a banana | |
a living room with a big table next to a book shelf. | a living room with a couch and a table. | a living room filled with furniture and a large window. | |
Several zebras eat the green grass in the pasture. | a group of zebra standing on top of a lush green field. | a group of zebras grazing in a field next to the water. | |
A man getting ready to kick a soccer ball. | a man kicking a soccer ball on a field. | a soccer player in a green uniform |
模型 | B-1 | B-4 | M | R | C | S |
---|---|---|---|---|---|---|
SCST[ | — | 34.2 | 26.7 | 55.7 | 114.0 | — |
RFNet[ | 79.1 | 36.5 | 27.7 | 57.3 | 121.9 | 21.2 |
Up-Down[ | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
HAN[ | 80.9 | 37.6 | 27.8 | 58.1 | 121.7 | 21.5 |
SGAE[ | 80.8 | 38.4 | 28.4 | 58.6 | 127.8 | 22.1 |
ORT[ | 80.5 | 38.6 | 28.7 | 58.4 | 128.3 | 22.6 |
POS-SCAN[ | 80.2 | 38.0 | 28.5 | — | 125.9 | 22.2 |
SRT[ | 80.3 | 38.5 | 28.7 | 58.4 | 129.1 | 22.4 |
本文模型 | 80.3 | 39.2 | 28.9 | 58.9 | 129.6 | 22.7 |
Tab.2 Comparison of different image caption generation models on evaluation indicators
模型 | B-1 | B-4 | M | R | C | S |
---|---|---|---|---|---|---|
SCST[ | — | 34.2 | 26.7 | 55.7 | 114.0 | — |
RFNet[ | 79.1 | 36.5 | 27.7 | 57.3 | 121.9 | 21.2 |
Up-Down[ | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
HAN[ | 80.9 | 37.6 | 27.8 | 58.1 | 121.7 | 21.5 |
SGAE[ | 80.8 | 38.4 | 28.4 | 58.6 | 127.8 | 22.1 |
ORT[ | 80.5 | 38.6 | 28.7 | 58.4 | 128.3 | 22.6 |
POS-SCAN[ | 80.2 | 38.0 | 28.5 | — | 125.9 | 22.2 |
SRT[ | 80.3 | 38.5 | 28.7 | 58.4 | 129.1 | 22.4 |
本文模型 | 80.3 | 39.2 | 28.9 | 58.9 | 129.6 | 22.7 |
模型 | B-1 | B-4 | M | R | C |
---|---|---|---|---|---|
TED | 79.3 | 38.5 | 28.4 | 58.5 | 127.7 |
TED-VC | 79.4 | 38.5 | 28.5 | 58.4 | 128.2 |
TED-FVC | 79.7 | 38.8 | 28.6 | 58.7 | 128.6 |
TED-ACG | 80.3 | 39.2 | 28.9 | 58.9 | 129.6 |
Tab.3 Setting and results of ablation experiments
模型 | B-1 | B-4 | M | R | C |
---|---|---|---|---|---|
TED | 79.3 | 38.5 | 28.4 | 58.5 | 127.7 |
TED-VC | 79.4 | 38.5 | 28.5 | 58.4 | 128.2 |
TED-FVC | 79.7 | 38.8 | 28.6 | 58.7 | 128.6 |
TED-ACG | 80.3 | 39.2 | 28.9 | 58.9 | 129.6 |
1 | JIANG W H, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11206. Cham: Springer, 2018: 510-526. |
2 | 黄友文,游亚东,赵朋. 融合卷积注意力机制的图像描述生成模型[J]. 计算机应用, 2020, 40(1): 23-27. 10.11772/j.issn.1001-9081.2019050943 |
HUANG Y W, YOU Y D, ZHAO P. Image caption generation model with convolutional attention mechanism[J]. Journal of Computer Applications, 2020, 40(1): 23-27. 10.11772/j.issn.1001-9081.2019050943 | |
3 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
4 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. |
5 | ZHU X X, LI L X, LIU J, et al. Captioning transformer with stacked attention modules[J]. Applied Sciences, 2018, 8(5): No.739. 10.3390/app8050739 |
6 | YU J, LI J, YU Z, et al. Multimodal transformer with multi-view visual representation for image captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(12): 4467-4480. 10.1109/tcsvt.2019.2947482 |
7 | ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086. 10.1109/cvpr.2018.00636 |
8 | REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. 10.1109/tpami.2016.2577031 |
9 | HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words [C/OL]// Proceedings of the 33rd Conference on Neural Information Processing Systems. [2021-06-13].. |
10 | WANG T, HUANG J Q, ZHANG H W, et al. Visual commonsense R-CNN[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE,2020: 10757-10767. 10.1109/cvpr42600.2020.01077 |
11 | 李文惠,曾上游,王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用, 2021, 41(5): 1262-1267. 10.11772/j.issn.1001-9081.2020071078 |
LI W H, ZENG S Y, WANG J J. Image description generation algorithm based on improved attention mechanism[J]. Journal of Computer Applications, 2021, 41(5): 1262-1267. 10.11772/j.issn.1001-9081.2020071078 | |
12 | LI G, ZHU L C, LIU P, et al. Entangled transformer for image captioning[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8927-8936. 10.1109/iccv.2019.00902 |
13 | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]// Proceedings of the 2014 European Conference on Computer Vision, LNCS 8693. Cham: Springer, 2014: 740-755. |
14 | KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137. 10.1109/cvpr.2015.7298932 |
15 | VEDANTAM R, ZITNICK C L, PARIKH D, et al. CIDEr: consensus-based image description evaluation[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 4566-4575. 10.1109/cvpr.2015.7299087 |
16 | ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evaluation[C]// Proceedings of the 2016 European Conference on Computer Vision, LNCS 9909. Cham: Springer, 2016: 382-398. |
17 | KINGMA D P, BA J L. Adam: a method for stochastic optimization[EB/OL]. (2017-01-30) [2021-08-03].. |
18 | RENNNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1179-1195. 10.1109/cvpr.2017.131 |
19 | WANG W Z, CHEN Z H, HU H F. Hierarchical attention network for image captioning[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2019: 8957-8964. 10.1609/aaai.v33i01.33018957 |
20 | YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 10677-10686. 10.1109/cvpr.2019.01094 |
21 | ZHOU Y E, WANG M, LIU D Q, et al. More grounded image captioning by distilling image-text matching model[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 4776-4785. 10.1109/cvpr42600.2020.00483 |
22 | WANG L, BAI Z C, ZHANG Y H, et al. Show, recall, and tell: image captioning with recall mechanism[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 12176-12183. 10.1609/aaai.v34i07.6898 |
[1] | Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910. |
[2] | Qi SHUAI, Hairui WANG, Guifu ZHU. Chinese story ending generation model based on bidirectional contrastive training [J]. Journal of Computer Applications, 2024, 44(9): 2683-2688. |
[3] | Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499. |
[4] | Quanmei ZHANG, Runping HUANG, Fei TENG, Haibo ZHANG, Nan ZHOU. Automatic international classification of disease coding method incorporating heterogeneous information [J]. Journal of Computer Applications, 2024, 44(8): 2476-2482. |
[5] | Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242. |
[6] | Qianhui LU, Yu ZHANG, Mengling WANG, Tingwei WU, Yuzhong SHAN. Classification model of nuclear power equipment quality text based on improved recurrent pooling network [J]. Journal of Computer Applications, 2024, 44(7): 2034-2040. |
[7] | Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994. |
[8] | Yao LIU, Yumeng LI, Miaomiao SONG. Cognitive graph based on business process [J]. Journal of Computer Applications, 2024, 44(6): 1699-1705. |
[9] | Youren YU, Yangsen ZHANG, Yuru JIANG, Gaijuan HUANG. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information [J]. Journal of Computer Applications, 2024, 44(6): 1706-1712. |
[10] | Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919. |
[11] | Jianjing LI, Guanfeng LI, Feizhou QIN, Weijun LI. Multi-relation approximate reasoning model based on uncertain knowledge graph embedding [J]. Journal of Computer Applications, 2024, 44(6): 1751-1759. |
[12] | Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545. |
[13] | Wenshuo GAO, Xiaoyun CHEN. Point cloud classification network based on node structure [J]. Journal of Computer Applications, 2024, 44(5): 1471-1478. |
[14] | Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113. |
[15] | Longtao GAO, Nana LI. Aspect sentiment triplet extraction based on aspect-aware attention enhancement [J]. Journal of Computer Applications, 2024, 44(4): 1049-1057. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||