Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 39-46.DOI: 10.11772/j.issn.1001-9081.2023010055
• Cross-media representation learning and cognitive reasoning • Previous Articles Next Articles
Sihang CHEN, Aiwen JIANG(), Zhaoyang CUI, Mingwen WANG
Received:
2023-01-30
Revised:
2023-05-05
Accepted:
2023-05-09
Online:
2023-06-06
Published:
2024-01-10
Contact:
Aiwen JIANG
About author:
CHEN Sihang, born in 1997, M. S. candidate. His research interests include visual dialogue.Supported by:
通讯作者:
江爱文
作者简介:
陈思航(1997—),男,江西萍乡人,硕士研究生,CCF学生会员,主要研究方向:视觉对话;基金资助:
CLC Number:
Sihang CHEN, Aiwen JIANG, Zhaoyang CUI, Mingwen WANG. Multi-channel multi-step integration model for generative visual dialogue[J]. Journal of Computer Applications, 2024, 44(1): 39-46.
陈思航, 江爱文, 崔朝阳, 王明文. 基于多通道多步融合的生成式视觉对话模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 39-46.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023010055
模型 | MRR↑/% | R@1↑/% | R@5↑/% | R@10↑/% | Mean↓ |
---|---|---|---|---|---|
MN[ | 52.59 | 42.29 | 62.85 | 68.88 | 17.06 |
HCIAE[ | 53.86 | 44.06 | 63.55 | 69.24 | 16.01 |
CoAtt[ | 54.11 | 44.32 | 63.82 | 69.75 | 16.47 |
DMRM[ | 55.96 | 46.20 | 66.02 | 72.43 | 13.35 |
VD-BERT[ | 55.95 | 46.83 | 65.43 | 72.05 | 13.18 |
MITVG[ | 56.83 | 47.14 | 67.19 | 73.72 | 14.37 |
LTMI-GoG[ | 56.32 | 46.65 | 66.41 | 72.69 | 13.78 |
LTMI-LG[ | 56.56 | 46.71 | 66.69 | 73.37 | 13.62 |
MCMI | 57.91 | 48.82 | 68.86 | 74.73 | 12.47 |
Tab. 1 Experimental result comparison with mainstream models on VisDial v0.9 (generative tasks)
模型 | MRR↑/% | R@1↑/% | R@5↑/% | R@10↑/% | Mean↓ |
---|---|---|---|---|---|
MN[ | 52.59 | 42.29 | 62.85 | 68.88 | 17.06 |
HCIAE[ | 53.86 | 44.06 | 63.55 | 69.24 | 16.01 |
CoAtt[ | 54.11 | 44.32 | 63.82 | 69.75 | 16.47 |
DMRM[ | 55.96 | 46.20 | 66.02 | 72.43 | 13.35 |
VD-BERT[ | 55.95 | 46.83 | 65.43 | 72.05 | 13.18 |
MITVG[ | 56.83 | 47.14 | 67.19 | 73.72 | 14.37 |
LTMI-GoG[ | 56.32 | 46.65 | 66.41 | 72.69 | 13.78 |
LTMI-LG[ | 56.56 | 46.71 | 66.69 | 73.37 | 13.62 |
MCMI | 57.91 | 48.82 | 68.86 | 74.73 | 12.47 |
模型 | MRR↑/% | R@1↑/% | R@5↑/% | R@10↑/% | Mean↓ |
---|---|---|---|---|---|
MN[ | 47.99 | 38.18 | 57.54 | 64.32 | 18.60 |
HCIAE[ | 49.10 | 39.35 | 58.49 | 64.70 | 18.46 |
CoAtt[ | 49.25 | 39.66 | 58.83 | 65.38 | 18.15 |
ReDAN[ | 49.69 | 40.19 | 59.35 | 66.06 | 17.92 |
DMRM[ | 50.16 | 40.15 | 60.02 | 67.21 | 15.19 |
DAM[ | 50.51 | 40.53 | 60.84 | 67.94 | 15.26 |
KBGN[ | 50.05 | 40.40 | 60.11 | 66.82 | 17.54 |
MITVG[ | 51.14 | 41.03 | 61.25 | 68.49 | 14.37 |
LTMI-GoG[ | 51.32 | 41.25 | 61.83 | 69.44 | 15.32 |
LTMI-LG[ | 51.30 | 41.34 | 61.61 | 69.06 | 15.26 |
UTC[ | 52.22 | 42.56 | 62.40 | 69.51 | 15.67 |
MCMI | 52.28 | 43.24 | 62.78 | 69.45 | 14.20 |
Tab. 2 Experimental result comparison with mainstream models on VisDial v1.0 (generative tasks)
模型 | MRR↑/% | R@1↑/% | R@5↑/% | R@10↑/% | Mean↓ |
---|---|---|---|---|---|
MN[ | 47.99 | 38.18 | 57.54 | 64.32 | 18.60 |
HCIAE[ | 49.10 | 39.35 | 58.49 | 64.70 | 18.46 |
CoAtt[ | 49.25 | 39.66 | 58.83 | 65.38 | 18.15 |
ReDAN[ | 49.69 | 40.19 | 59.35 | 66.06 | 17.92 |
DMRM[ | 50.16 | 40.15 | 60.02 | 67.21 | 15.19 |
DAM[ | 50.51 | 40.53 | 60.84 | 67.94 | 15.26 |
KBGN[ | 50.05 | 40.40 | 60.11 | 66.82 | 17.54 |
MITVG[ | 51.14 | 41.03 | 61.25 | 68.49 | 14.37 |
LTMI-GoG[ | 51.32 | 41.25 | 61.83 | 69.44 | 15.32 |
LTMI-LG[ | 51.30 | 41.34 | 61.61 | 69.06 | 15.26 |
UTC[ | 52.22 | 42.56 | 62.40 | 69.51 | 15.67 |
MCMI | 52.28 | 43.24 | 62.78 | 69.45 | 14.20 |
模型 | M1↑/% | M2↑ |
---|---|---|
DMRM | 67.00 | 3.20 |
MCMI | 76.00 | 3.90 |
Tab. 3 Manual evaluation result comparison between DMRM and MCMI
模型 | M1↑/% | M2↑ |
---|---|---|
DMRM | 67.00 | 3.20 |
MCMI | 76.00 | 3.90 |
模型 | MRR↑/% | R@1↑/% | R@5↑/% | R@10↑/% | Mean↓ |
---|---|---|---|---|---|
MCMI-1 | 51.48 | 41.93 | 60.46 | 67.90 | 14.76 |
MCMI-2 | 52.10 | 42.35 | 61.43 | 68.21 | 14.47 |
MCMI-V | 49.94 | 40.20 | 59.18 | 65.87 | 16.17 |
MCMI-H | 51.63 | 41.44 | 60.72 | 66.71 | 15.18 |
MCMI-S | 50.26 | 41.39 | 60.17 | 67.34 | 15.01 |
MCMI-FA | 52.02 | 42.26 | 61.37 | 67.69 | 14.47 |
MCMI | 52.28 | 43.24 | 62.78 | 69.45 | 14.20 |
Tab. 4 Ablation experiment results
模型 | MRR↑/% | R@1↑/% | R@5↑/% | R@10↑/% | Mean↓ |
---|---|---|---|---|---|
MCMI-1 | 51.48 | 41.93 | 60.46 | 67.90 | 14.76 |
MCMI-2 | 52.10 | 42.35 | 61.43 | 68.21 | 14.47 |
MCMI-V | 49.94 | 40.20 | 59.18 | 65.87 | 16.17 |
MCMI-H | 51.63 | 41.44 | 60.72 | 66.71 | 15.18 |
MCMI-S | 50.26 | 41.39 | 60.17 | 67.34 | 15.01 |
MCMI-FA | 52.02 | 42.26 | 61.37 | 67.69 | 14.47 |
MCMI | 52.28 | 43.24 | 62.78 | 69.45 | 14.20 |
1 | 魏忠钰,范智昊,王瑞泽,等.从视觉到文本:图像描述生成的研究进展综述[J].中文信息学报, 2020, 34(7): 19-29. 10.3969/j.issn.1003-0077.2020.07.002 |
WEI Z Y, FAN Z H, WANG R Z, et al. From vision to text: a brief survey for image captioning [J]. Journal of Chinese Information Processing, 2020, 34(7): 19-29. 10.3969/j.issn.1003-0077.2020.07.002 | |
2 | KASAI J, SAKAGUCHI K, DUNAGAN L, et al. Transparent human evaluation for image captioning [C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2022: 3464-3478. 10.18653/v1/2022.naacl-main.254 |
3 | YANG X, ZHANG H, CAI J. Auto-encoding and distilling scene graphs for image captioning [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(5): 2313-2327. |
4 | YU J, ZHANG W, LU Y, et al. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval [J]. IEEE Transactions on Multimedia, 2020, 22(12): 3196-3209. 10.1109/tmm.2020.2972830 |
5 | LIU Y, WEI W, PENG D, et al. Depth-aware and semantic guided relational attention network for visual question answering [J]. IEEE Transactions on Multimedia, 2022(Early Access): 1-14. 10.1109/tmm.2022.3190686 |
6 | JIANG J, LIU Z, ZHENG N. LiVLR: a lightweight visual-linguistic reasoning framework for video question answering [J]. IEEE Transactions on Multimedia, 2022(Early Access): 1-12. 10.1109/tmm.2022.3185900 |
7 | BITEN A F, LITMAN R, XIE Y, et al. LaTr: layout-aware transformer for scene-text VQA [C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 16527-16537. 10.1109/cvpr52688.2022.01605 |
8 | DAS A, KOTTUR S, GUPTA K, et al. Visual dialog [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1080-1089. 10.1109/cvpr.2017.121 |
9 | LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 3242-3250. 10.1109/cvpr.2017.345 |
10 | GAN Z, CHENG Y, KHOLY A, et al. Multi-step reasoning via recurrent dual attention for visual dialog [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 6463-6474. 10.18653/v1/p19-1648 |
11 | GUO D, XU C, TAO D. Image-question-answer synergistic network for visual dialog [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 10426-10435. 10.1109/cvpr.2019.01068 |
12 | PARK S, WHANG T, YOON Y, et al. Multi-view attention network for visual dialog [J]. Applied Sciences, 2021, 11(7): No.3009. 10.3390/app11073009 |
13 | NIU Y, ZHANG H, ZHANG M, et al. Recursive visual attention in visual dialog [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6672-6681. 10.1109/cvpr.2019.00684 |
14 | CHEN F, MENG F, XU J, et al. DMRM: a dual-channel multi-hop reasoning model for visual dialog [C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 7504-7511. 10.1609/aaai.v34i05.6248 |
15 | YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answering [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6274-6283. 10.1109/cvpr.2019.00644 |
16 | KIM H, TAN H, BANSAL M. Modality-balanced models for visual dialogue [C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 8091-8098. 10.1609/aaai.v34i05.6320 |
17 | WANG Y, JOTY S, LYU M, et al. VD-BERT: a unified vision and dialog transformer with BERT [C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 3325-3338. 10.18653/v1/2020.emnlp-main.269 |
18 | NGUYEN V Q, SUGANUMA M, OKATANI T. Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs [C]// Proceedings of the 2020 European Conference on Computer Vision, LNCS 12369. Cham: Springer, 2020: 223-240. |
19 | WANG Z, WANG J, JIANG C. Unified multimodal model with unlikelihood training for visual dialog [C]// Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 4625-4634. 10.1145/3503161.3547974 |
20 | LU J, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2019: 13-23. 10.1109/cvpr.2018.00754 |
21 | YU X, ZHANG H, HONG R, et al. VD-PCR: improving visual dialog with pronoun coreference resolution [J]. Pattern Recognition, 2022, 125: No.108540. 10.1016/j.patcog.2022.108540 |
22 | JIANG X, DU S, QIN Z, et al. KBGN: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue [C]// Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1265-1273. 10.1145/3394171.3413826 |
23 | CHEN F, CHEN X, MENG F, et al. GoG: relation-aware graph-over-graph network for visual dialog [C]// Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg, PA: ACL, 2021: 230-243. 10.18653/v1/2021.findings-acl.20 |
24 | JOHNSON J, KARPATHY A, LI F F. DenseCap: fully convolutional localization networks for dense captioning [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4565-4574. 10.1109/cvpr.2016.494 |
25 | ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086. 10.1109/cvpr.2018.00636 |
26 | LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Proceedings of the 2014 European Conference, LNCS 8693. Cham: Springer, 2014: 740-755. |
27 | LU J, KANNAN A, YANG J, et al. Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 313-323. |
28 | WU Q, WANG P, SHEN C, et al. Are you talking to me? reasoned visual dialog generation through adversarial learning [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6106-6115. 10.1109/cvpr.2018.00639 |
29 | CHEN F, CHEN X, XU C, et al. Learning to ground visual objects for visual dialog [C]// Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg, PA: ACL, 2021: 1081-1091. 10.18653/v1/2021.findings-emnlp.93 |
30 | JIANG X, YU J, SUN Y, et al. DAM: deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue [C]// Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. California: ijcai.org, 2021: 687-693. 10.24963/ijcai.2020/96 |
31 | CHEN F, MENG F, CHEN X, et al. Multimodal incremental transformer with visual grounding for visual dialogue generation [C]// Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Stroudsburg, PA: ACL, 2021: 436-446. 10.18653/v1/2021.findings-acl.38 |
32 | CHEN C, TAN Z, CHENG Q, et al. UTC: a unified transformer with inter-task contrastive learning for visual dialog [C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18082-18091. 10.1109/cvpr52688.2022.01757 |
[1] |
LI Zhuoran, LI Hua, WANG Tong, JIANG Chaozhe.
Lightweight human pose estimation based on fused feature state space model#br#
#br#
[J]. Journal of Computer Applications, 0, (): 0-0.
|
[2] | Han WANG, Yuan WAN, Dong WANG, Yiming DING. Robust weight matrix combination selection method of broad learning system [J]. Journal of Computer Applications, 2024, 44(10): 3032-3038. |
[3] | Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974. |
[4] | Fan YANG, Yao ZOU, Mingzhi ZHU, Zhenwei MA, Dawei CHENG, Changjun JIANG. Credit card fraud detection model based on graph attention Transformation neural network [J]. Journal of Computer Applications, 2024, 44(8): 2634-2642. |
[5] | Zhi ZHANG, Xin LI, Naifu YE, Kaixi HU. DKP: defending against model stealing attacks based on dark knowledge protection [J]. Journal of Computer Applications, 2024, 44(7): 2080-2086. |
[6] | Feiyu ZHAI, Handa MA. Hybrid classical-quantum classification model based on DenseNet [J]. Journal of Computer Applications, 2024, 44(6): 1905-1910. |
[7] | . Adaptive Collaborative Optimization for Vehicle Recognition Algorithm based on FIPA Model [J]. Journal of Computer Applications, 0, (): 0-0. |
[8] | . Lightweight human pose estimation based on decoupled attention and ghost convolution [J]. Journal of Computer Applications, 0, (): 0-0. |
[9] | Haoran WANG, Dan YU, Yuli YANG, Yao MA, Yongle CHEN. Domain transfer intrusion detection method for unknown attacks on industrial control systems [J]. Journal of Computer Applications, 2024, 44(4): 1158-1165. |
[10] | Jiahang PAN, Jiahang WANG, Zhan SHI, Yingkun XU, Yongan XU, Xiaoxia HUANG. FPD-ViT: Facial pain detection Vision Transformer [J]. Journal of Computer Applications, 2023, 43(S2): 77-82. |
[11] | Lei LIU, Peng WU, Kai XIE, Beizhi CHENG, Guanqun SHENG. Parking space detection method based on self-supervised learning HOG prediction auxiliary task [J]. Journal of Computer Applications, 2023, 43(12): 3933-3940. |
[12] | Chuyuan WEI, Mengke WANG, Chuanhao HU, Guangqi ZHANG. Deep review attention neural network model for enhancing explainability of recommendation system [J]. Journal of Computer Applications, 2023, 43(11): 3443-3448. |
[13] | Guohuan HE, Jiangping ZHU. WT-U-Net++: surface defect detection network based on wavelet transform [J]. Journal of Computer Applications, 2023, 43(10): 3260-3266. |
[14] | Xiaomin ZHOU, Fei TENG, Yi ZHANG. Automatic international classification of diseases coding model based on meta-network [J]. Journal of Computer Applications, 2023, 43(9): 2721-2726. |
[15] | . Sleep physiological time series classification method based on adaptive multi-task learning [J]. Journal of Computer Applications, 0, (): 0-0. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||