1 |
ANTOL S, AGRAWAL A, LU J, et al. VQA: visual question answering [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2425-2433.
|
2 |
YU J, ZHANG W, LU Y, et al. Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval [J]. IEEE Transactions on Multimedia, 2020, 22(12): 3196-3209.
|
3 |
LU S, LIU M, YIN L, et al. The multi-modal fusion in visual question answering: a review of attention mechanisms [J]. PeerJ Computer Science, 2023, 9: No.e1400.
|
4 |
李祥,范志广,李学相,等. 基于深度学习的视觉问答研究综述[J]. 计算机科学, 2023, 50(5):177-188.
|
|
LI X, FAN Z G, LI X X, et al. Survey of visual question answering based on deep learning[J]. Computer Science, 2023, 50(5):177-188.
|
5 |
MALINOWSKI M, ROHRBACH M, FRITZ M. Ask your neurons: a neural-based approach to answering questions about images [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1-9.
|
6 |
KIM J H, LEE S W, KWAK D, et al. Multimodal residual learning for visual QA [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 361-369.
|
7 |
KIM J H, JUN J, ZHANG B T. Bilinear attention networks [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1571-1581.
|
8 |
TENEY D, LIU L, VAN DEN HENGEL A. Graph-structured representations for visual question answering [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 3233-3241.
|
9 |
YANG Z, HE X, GAO J, et al. Stacked attention networks for image question answering [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 21-29.
|
10 |
RAHMAN T, CHOU S H, SIGAL L, et al. An improved attention for visual question answering [C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1653-1662.
|
11 |
ZENG Y, ZHANG X, LI H. Multi-grained vision language pre-training: aligning texts with visual concepts [C]// Proceedings of the 39th International Conference on Machine Learning. New York: JMLR.org, 2022: 25994-26009.
|
12 |
ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086.
|
13 |
JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10264-10273.
|
14 |
NGUYEN B X, DO T, TRAN H, et al. Coarse-to-fine reasoning for visual question answering [C]// Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2022: 4558-4566.
|
15 |
ZHOU H, ZHANG J, LUO T, et al. Debiased scene graph generation for dual imbalance learning [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4274-4288.
|
16 |
ZHOU H, YANG Y, LUO T, et al. A unified deep sparse graph attention network for scene graph generation [J]. Pattern Recognition, 2022, 123: No.108367.
|
17 |
PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543.
|
18 |
DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding [C]// Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186.
|
19 |
CHEN Y C, LI L, YU L, et al. Uniter: universal image-text representation learning [C]// Proceedings of the 2020 European Conference on Computer Vision. Cham: Springer, 2020: 104-120.
|
20 |
YANG Z, QIN Z, YU J, et al. Scene graph reasoning with prior visual relationship for visual question answering [C]// Proceedings of the 2020 IEEE International Conference on Image Processing. Piscataway: IEEE, 2020: 1411-1415.
|
21 |
YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6274-6283.
|
22 |
XIONG P, SHEN Y, JIN H. MGA-VQA: multi-granularity alignment for visual question answering [EB/OL]. [2024-02-12]..
|
23 |
JING C, JIA Y, WU Y, et al. Maintaining reasoning consistency in compositional visual question answering [C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5089-5098.
|
24 |
ANDREAS J, ROHRBACH M, DARRELL T, et al. Neural module networks [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 39-48.
|
25 |
CHEN W, GAN Z, LI L, et al. Meta module network for compositional visual reasoning [C]// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 655-664.
|
26 |
HU R, ROHRBACH A, DARRELL T, et al. Language-conditioned graph networks for relational reasoning [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 10293-10302.
|
27 |
JING C, JIA Y, WU Y, et al. Learning the dynamics of visual relational reasoning via reinforced path routing [C]// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 1122-1130.
|
28 |
周浩. 图像语义场景图生成方法研究[D]. 长沙:国防科技大学, 2021: 133-136.
|
|
ZHOU H. Scene graph generation for image semantic understanding and representation [D]. Changsha: National University of Defense Technology, 2021: 133-136.
|
29 |
TANG K, ZHANG H, WU B, et al. Learning to compose dynamic tree structures for visual contexts [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6612-6621.
|
30 |
HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6693-6702.
|
31 |
GAO P, JIANG Z, YOU H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6632-6641.
|
32 |
SHA F, CHAO W L, HU H. Learning answer embeddings for visual question answering [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 5428-5436.
|
33 |
DO T, TRAN H, DO T T, et al. Compact trilinear interaction for visual question answering [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 392-401.
|
34 |
TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from Transformers [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 5100-5111.
|
35 |
KIM W, SON B, KIM I. ViLT: vision-and-language Transformer without convolution or region supervision [C]// Proceedings of the 38th International Conference on Machine Learning. New York: JMLR.org, 2021: 5583-5594.
|
36 |
LI X, YIN X, LI C, et al. OSCAR: object-semantics aligned pre-training for vision-language tasks [C]// Proceedings of the 2020 European Conference on Computer Vision, LNCS 12375. Cham: Springer, 2020: 121-137.
|