| 1 | ANTOL S, AGRAWAL A, LU J, et al. VQA: visual question answering [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2425-2433. | 
																													
																						| 2 | YU J, ZHANG W, LU Y, et al. Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval [J]. IEEE Transactions on Multimedia, 2020, 22(12): 3196-3209. | 
																													
																						| 3 | LU S, LIU M, YIN L, et al. The multi-modal fusion in visual question answering: a review of attention mechanisms [J]. PeerJ Computer Science, 2023, 9: No.e1400. | 
																													
																						| 4 | 李祥,范志广,李学相,等. 基于深度学习的视觉问答研究综述[J]. 计算机科学, 2023, 50(5):177-188. | 
																													
																						|  | LI X, FAN Z G, LI X X, et al. Survey of visual question answering based on deep learning[J]. Computer Science, 2023, 50(5):177-188. | 
																													
																						| 5 | MALINOWSKI M, ROHRBACH M, FRITZ M. Ask your neurons: a neural-based approach to answering questions about images [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1-9. | 
																													
																						| 6 | KIM J H, LEE S W, KWAK D, et al. Multimodal residual learning for visual QA [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 361-369. | 
																													
																						| 7 | KIM J H, JUN J, ZHANG B T. Bilinear attention networks [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 1571-1581. | 
																													
																						| 8 | TENEY D, LIU L, VAN DEN HENGEL A. Graph-structured representations for visual question answering [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 3233-3241. | 
																													
																						| 9 | YANG Z, HE X, GAO J, et al. Stacked attention networks for image question answering [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 21-29. | 
																													
																						| 10 | RAHMAN T, CHOU S H, SIGAL L, et al. An improved attention for visual question answering [C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1653-1662. | 
																													
																						| 11 | ZENG Y, ZHANG X, LI H. Multi-grained vision language pre-training: aligning texts with visual concepts [C]// Proceedings of the 39th International Conference on Machine Learning. New York: JMLR.org, 2022: 25994-26009. | 
																													
																						| 12 | ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086. | 
																													
																						| 13 | JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10264-10273. | 
																													
																						| 14 | NGUYEN B X, DO T, TRAN H, et al. Coarse-to-fine reasoning for visual question answering [C]// Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2022: 4558-4566. | 
																													
																						| 15 | ZHOU H, ZHANG J, LUO T, et al. Debiased scene graph generation for dual imbalance learning [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4274-4288. | 
																													
																						| 16 | ZHOU H, YANG Y, LUO T, et al. A unified deep sparse graph attention network for scene graph generation [J]. Pattern Recognition, 2022, 123: No.108367. | 
																													
																						| 17 | PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543. | 
																													
																						| 18 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding [C]// Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. | 
																													
																						| 19 | CHEN Y C, LI L, YU L, et al. Uniter: universal image-text representation learning [C]// Proceedings of the 2020 European Conference on Computer Vision. Cham: Springer, 2020: 104-120. | 
																													
																						| 20 | YANG Z, QIN Z, YU J, et al. Scene graph reasoning with prior visual relationship for visual question answering [C]// Proceedings of the 2020 IEEE International Conference on Image Processing. Piscataway: IEEE, 2020: 1411-1415. | 
																													
																						| 21 | YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6274-6283. | 
																													
																						| 22 | XIONG P, SHEN Y, JIN H. MGA-VQA: multi-granularity alignment for visual question answering [EB/OL]. [2024-02-12].. | 
																													
																						| 23 | JING C, JIA Y, WU Y, et al. Maintaining reasoning consistency in compositional visual question answering [C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5089-5098. | 
																													
																						| 24 | ANDREAS J, ROHRBACH M, DARRELL T, et al. Neural module networks [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 39-48. | 
																													
																						| 25 | CHEN W, GAN Z, LI L, et al. Meta module network for compositional visual reasoning [C]// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 655-664. | 
																													
																						| 26 | HU R, ROHRBACH A, DARRELL T, et al. Language-conditioned graph networks for relational reasoning [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 10293-10302. | 
																													
																						| 27 | JING C, JIA Y, WU Y, et al. Learning the dynamics of visual relational reasoning via reinforced path routing [C]// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 1122-1130. | 
																													
																						| 28 | 周浩. 图像语义场景图生成方法研究[D]. 长沙:国防科技大学, 2021: 133-136. | 
																													
																						|  | ZHOU H. Scene graph generation for image semantic understanding and representation [D]. Changsha: National University of Defense Technology, 2021: 133-136. | 
																													
																						| 29 | TANG K, ZHANG H, WU B, et al. Learning to compose dynamic tree structures for visual contexts [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6612-6621. | 
																													
																						| 30 | HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6693-6702. | 
																													
																						| 31 | GAO P, JIANG Z, YOU H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6632-6641. | 
																													
																						| 32 | SHA F, CHAO W L, HU H. Learning answer embeddings for visual question answering [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 5428-5436. | 
																													
																						| 33 | DO T, TRAN H, DO T T, et al. Compact trilinear interaction for visual question answering [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 392-401. | 
																													
																						| 34 | TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from Transformers [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 5100-5111. | 
																													
																						| 35 | KIM W, SON B, KIM I. ViLT: vision-and-language Transformer without convolution or region supervision [C]// Proceedings of the 38th International Conference on Machine Learning. New York: JMLR.org, 2021: 5583-5594. | 
																													
																						| 36 | LI X, YIN X, LI C, et al. OSCAR: object-semantics aligned pre-training for vision-language tasks [C]// Proceedings of the 2020 European Conference on Computer Vision, LNCS 12375. Cham: Springer, 2020: 121-137. |