基于多通道多步融合的生成式视觉对话模型

doi:10.11772/j.issn.1001-9081.2023010055

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 39-46.DOI: 10.11772/j.issn.1001-9081.2023010055

• 跨媒体表征学习与认知推理 • 上一篇下一篇

基于多通道多步融合的生成式视觉对话模型

陈思航, 江爱文(), 崔朝阳, 王明文

江西师范大学计算机信息工程学院，南昌 330022

收稿日期:2023-01-30 修回日期:2023-05-05 接受日期:2023-05-09 发布日期:2023-06-06 出版日期:2024-01-10
通讯作者: 江爱文
作者简介:陈思航（1997—），男，江西萍乡人，硕士研究生，CCF学生会员，主要研究方向：视觉对话；
崔朝阳（1998—），男，河北石家庄人，硕士研究生，主要研究方向：视觉对话；
王明文（1965—），男，江西南康人，教授，博士，CCF高级会员，主要研究方向：自然语言处理。
第一联系人：江爱文（1984—），男，江西浮梁人，教授，博士，CCF高级会员，主要研究方向：多模态信息处理；
基金资助:
国家自然科学基金资助项目(61966018)

Multi-channel multi-step integration model for generative visual dialogue

Sihang CHEN, Aiwen JIANG(), Zhaoyang CUI, Mingwen WANG

School of Computer and Information Engineering，Jiangxi Normal University，Nanchang Jiangxi 330022，China

Received:2023-01-30 Revised:2023-05-05 Accepted:2023-05-09 Online:2023-06-06 Published:2024-01-10
Contact: Aiwen JIANG
About author:CHEN Sihang， born in 1997， M. S. candidate. His research interests include visual dialogue.
CUI Zhaoyang， born in 1998， M. S. candidate. His research interests include visual dialogue.
WANG Mingwen， born in 1965， Ph. D.， professor. His research interests include natural language processing.
Supported by:
National Natural Science Foundation of China(61966018)

摘要/Abstract

摘要：

当前视觉对话任务在多模态信息融合和推理方面取得了较大进展，但是，在回答一些涉及具有比较明确语义属性和位置空间关系的问题时，主流模型的能力依然有限。比较少的主流模型在正式响应之前能够显式地提供有关图像内容的、语义充分的细粒度表达。视觉特征表示与对话历史、当前问句等文本语义之间缺少必要的、缓解语义鸿沟的桥梁，因此提出一种基于多通道多步融合的视觉对话模型MCMI。该模型显式提供一组关于视觉内容的细粒度语义描述信息，并通过“视觉-语义-对话”历史三者相互作用和多步融合，能够丰富问题的语义表示，实现较为准确的答案解码。在VisDial v0.9/VisDial v1.0数据集中，MCMI模型较基准模型双通道多跳推理模型（DMRM），平均倒数排名（MRR）分别提升了1.95和2.12个百分点，召回率（R@1）分别提升了2.62和3.09个百分点，正确答案平均排名（Mean）分别提升了0.88和0.99；在VisDial v1.0数据集中，较最新模型UTC（Unified Transformer Contrastive learning model）， MRR、R@1、Mean分别提升了0.06百分点，0.68百分点和1.47。为了进一步评估生成对话的质量，提出类图灵测试响应通过比例M1和对话质量分数（五分制）M2两个人工评价指标。在VisDial v0.9数据集中，相较于基准模型DMRM，MCMI模型的M1和M2指标分别提高了9.00百分点和0.70。

关键词: 视觉对话, 生成式任务, 视觉语义描述, 多步融合, 多通道融合

Abstract:

Visual dialogue task has made significant progress in multimodal information fusion and inference. However， the ability of mainstream models is still limited when answering questions that involve relatively clear semantic attributes and spatial relationships. A relatively few mainstream models can explicitly provide fine-grained semantic representation of image content before formal response. There is a lack of necessary bridges to the semantic gap between visual feature representation and text semantics such as dialogue history and current questions. Therefore， a visual dialogue model based on Multi-Channel and Multi-step Integration （MCMI） was proposed to explicitly provide a set of fine-grained semantic description information about visual content. Through the interactions and multi-step integration among vision， semantics and dialogue history， the semantic representation of questions was enriched and more accurate decoded answers were achieved. On VisDial v0.9/VisDial v1.0 datasets， compared to Dual-channel Multi-hop Reasoning Model （DMRM）， the proposed MCMI model improved Mean Reciprocal Ranking（MRR） by 1.95 and 2.12 percentage points respectively， Recall Rate （R@1） by 2.62 and 3.09 percentage points respectively， and Mean ranking of correct answers （Mean） by 0.88 and 0.99 respectively； On VisDial v1.0 dataset， compared to the latest Unified Transformer Contrastive learning model（UTC）， MCMI model improved the MRR， R@1， Mean by 0.06 percentage points， 0.68 percentage points， and 1.47 respectively. In order to further evaluate the quality of generated dialogue， two subjective indicators are proposed. They are the Turing-test passing proportion M1 and the dialogue quality score （five point scale） M2. When compared with baseline model DMRM in the VisDial v0.9 dataset， MCMI model improved M1 by 9.00 percentage points and M2 by 0.70.

Key words: visual dialogue, generative task, visual semantic description, multi-step integration, multi-channel fusion

中图分类号:

TP389.1

陈思航, 江爱文, 崔朝阳, 王明文. 基于多通道多步融合的生成式视觉对话模型[J]. 计算机应用, 2024, 44(1): 39-46.

Sihang CHEN, Aiwen JIANG, Zhaoyang CUI, Mingwen WANG. Multi-channel multi-step integration model for generative visual dialogue[J]. Journal of Computer Applications, 2024, 44(1): 39-46.

图/表 11

参考文献 32

1	魏忠钰，范智昊，王瑞泽，等.从视觉到文本：图像描述生成的研究进展综述［J］.中文信息学报， 2020， 34（7）： 19-29. 10.3969/j.issn.1003-0077.2020.07.002
	WEI Z Y， FAN Z H， WANG R Z， et al. From vision to text： a brief survey for image captioning ［J］. Journal of Chinese Information Processing， 2020， 34（7）： 19-29. 10.3969/j.issn.1003-0077.2020.07.002
2	KASAI J， SAKAGUCHI K， DUNAGAN L， et al. Transparent human evaluation for image captioning ［C］// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： ACL， 2022： 3464-3478. 10.18653/v1/2022.naacl-main.254
3	YANG X， ZHANG H， CAI J. Auto-encoding and distilling scene graphs for image captioning ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2022， 44（5）： 2313-2327.
4	YU J， ZHANG W， LU Y， et al. Reasoning on the relation： Enhancing visual representation for visual question answering and cross-modal retrieval ［J］. IEEE Transactions on Multimedia， 2020， 22（12）： 3196-3209. 10.1109/tmm.2020.2972830
5	LIU Y， WEI W， PENG D， et al. Depth-aware and semantic guided relational attention network for visual question answering ［J］. IEEE Transactions on Multimedia， 2022（Early Access）： 1-14. 10.1109/tmm.2022.3190686
6	JIANG J， LIU Z， ZHENG N. LiVLR： a lightweight visual-linguistic reasoning framework for video question answering ［J］. IEEE Transactions on Multimedia， 2022（Early Access）： 1-12. 10.1109/tmm.2022.3185900
7	BITEN A F， LITMAN R， XIE Y， et al. LaTr： layout-aware transformer for scene-text VQA ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 16527-16537. 10.1109/cvpr52688.2022.01605
8	DAS A， KOTTUR S， GUPTA K， et al. Visual dialog ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1080-1089. 10.1109/cvpr.2017.121
9	LU J， XIONG C， PARIKH D， et al. Knowing when to look： adaptive attention via a visual sentinel for image captioning ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3242-3250. 10.1109/cvpr.2017.345
10	GAN Z， CHENG Y， KHOLY A， et al. Multi-step reasoning via recurrent dual attention for visual dialog ［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2019： 6463-6474. 10.18653/v1/p19-1648
11	GUO D， XU C， TAO D. Image-question-answer synergistic network for visual dialog ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10426-10435. 10.1109/cvpr.2019.01068
12	PARK S， WHANG T， YOON Y， et al. Multi-view attention network for visual dialog ［J］. Applied Sciences， 2021， 11（7）： No.3009. 10.3390/app11073009
13	NIU Y， ZHANG H， ZHANG M， et al. Recursive visual attention in visual dialog ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6672-6681. 10.1109/cvpr.2019.00684
14	CHEN F， MENG F， XU J， et al. DMRM： a dual-channel multi-hop reasoning model for visual dialog ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 7504-7511. 10.1609/aaai.v34i05.6248
15	YU Z， YU J， CUI Y， et al. Deep modular co-attention networks for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6274-6283. 10.1109/cvpr.2019.00644
16	KIM H， TAN H， BANSAL M. Modality-balanced models for visual dialogue ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 8091-8098. 10.1609/aaai.v34i05.6320
17	WANG Y， JOTY S， LYU M， et al. VD-BERT： a unified vision and dialog transformer with BERT ［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 3325-3338. 10.18653/v1/2020.emnlp-main.269
18	NGUYEN V Q， SUGANUMA M， OKATANI T. Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12369. Cham： Springer， 2020： 223-240.
19	WANG Z， WANG J， JIANG C. Unified multimodal model with unlikelihood training for visual dialog ［C］// Proceedings of the 30th ACM International Conference on Multimedia. New York： ACM， 2022： 4625-4634. 10.1145/3503161.3547974
20	LU J， BATRA D， PARIKH D， et al. ViLBERT： pretraining task-agnostic visiolinguistic representations for vision-and-language tasks ［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 13-23. 10.1109/cvpr.2018.00754
21	YU X， ZHANG H， HONG R， et al. VD-PCR： improving visual dialog with pronoun coreference resolution ［J］. Pattern Recognition， 2022， 125： No.108540. 10.1016/j.patcog.2022.108540
22	JIANG X， DU S， QIN Z， et al. KBGN： knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue ［C］// Proceedings of the 28th ACM International Conference on Multimedia. New York： ACM， 2020： 1265-1273. 10.1145/3394171.3413826
23	CHEN F， CHEN X， MENG F， et al. GoG： relation-aware graph-over-graph network for visual dialog ［C］// Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Stroudsburg， PA： ACL， 2021： 230-243. 10.18653/v1/2021.findings-acl.20
24	JOHNSON J， KARPATHY A， LI F F. DenseCap： fully convolutional localization networks for dense captioning ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 4565-4574. 10.1109/cvpr.2016.494
25	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
26	LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context ［C］// Proceedings of the 2014 European Conference， LNCS 8693. Cham： Springer， 2014： 740-755.
27	LU J， KANNAN A， YANG J， et al. Best of both worlds： transferring knowledge from discriminative learning to a generative visual dialog model ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 313-323.
28	WU Q， WANG P， SHEN C， et al. Are you talking to me？ reasoned visual dialog generation through adversarial learning ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6106-6115. 10.1109/cvpr.2018.00639
29	CHEN F， CHEN X， XU C， et al. Learning to ground visual objects for visual dialog ［C］// Findings of the Association for Computational Linguistics： EMNLP 2021. Stroudsburg， PA： ACL， 2021： 1081-1091. 10.18653/v1/2021.findings-emnlp.93
30	JIANG X， YU J， SUN Y， et al. DAM： deliberation， abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue ［C］// Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. California： ijcai.org， 2021： 687-693. 10.24963/ijcai.2020/96
31	CHEN F， MENG F， CHEN X， et al. Multimodal incremental transformer with visual grounding for visual dialogue generation ［C］// Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Stroudsburg， PA： ACL， 2021： 436-446. 10.18653/v1/2021.findings-acl.38
32	CHEN C， TAN Z， CHENG Q， et al. UTC： a unified transformer with inter-task contrastive learning for visual dialog ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 18082-18091. 10.1109/cvpr52688.2022.01757

模型	MRR↑/%	R@1↑/%	R@5↑/%	R@10↑/%	Mean↓
MN^［8］	52.59	42.29	62.85	68.88	17.06
HCIAE^［27］	53.86	44.06	63.55	69.24	16.01
CoAtt^［28］	54.11	44.32	63.82	69.75	16.47
DMRM^［14］	55.96	46.20	66.02	72.43	13.35
VD-BERT^［17］	55.95	46.83	65.43	72.05	13.18
MITVG^［31］	56.83	47.14	67.19	73.72	14.37
LTMI-GoG^［23］	56.32	46.65	66.41	72.69	13.78
LTMI-LG^［29］	56.56	46.71	66.69	73.37	13.62
MCMI	57.91	48.82	68.86	74.73	12.47

模型	MRR↑/%	R@1↑/%	R@5↑/%	R@10↑/%	Mean↓
MN^［8］	52.59	42.29	62.85	68.88	17.06
HCIAE^［27］	53.86	44.06	63.55	69.24	16.01
CoAtt^［28］	54.11	44.32	63.82	69.75	16.47
DMRM^［14］	55.96	46.20	66.02	72.43	13.35
VD-BERT^［17］	55.95	46.83	65.43	72.05	13.18
MITVG^［31］	56.83	47.14	67.19	73.72	14.37
LTMI-GoG^［23］	56.32	46.65	66.41	72.69	13.78
LTMI-LG^［29］	56.56	46.71	66.69	73.37	13.62
MCMI	57.91	48.82	68.86	74.73	12.47

模型	MRR↑/%	R@1↑/%	R@5↑/%	R@10↑/%	Mean↓
MN^［8］	47.99	38.18	57.54	64.32	18.60
HCIAE^［27］	49.10	39.35	58.49	64.70	18.46
CoAtt^［28］	49.25	39.66	58.83	65.38	18.15
ReDAN^［10］	49.69	40.19	59.35	66.06	17.92
DMRM^［14］	50.16	40.15	60.02	67.21	15.19
DAM^［30］	50.51	40.53	60.84	67.94	15.26
KBGN^［22］	50.05	40.40	60.11	66.82	17.54
MITVG^［31］	51.14	41.03	61.25	68.49	14.37
LTMI-GoG^［23］	51.32	41.25	61.83	69.44	15.32
LTMI-LG^［29］	51.30	41.34	61.61	69.06	15.26
UTC^［32］	52.22	42.56	62.40	69.51	15.67
MCMI	52.28	43.24	62.78	69.45	14.20

模型	MRR↑/%	R@1↑/%	R@5↑/%	R@10↑/%	Mean↓
MN^［8］	47.99	38.18	57.54	64.32	18.60
HCIAE^［27］	49.10	39.35	58.49	64.70	18.46
CoAtt^［28］	49.25	39.66	58.83	65.38	18.15
ReDAN^［10］	49.69	40.19	59.35	66.06	17.92
DMRM^［14］	50.16	40.15	60.02	67.21	15.19
DAM^［30］	50.51	40.53	60.84	67.94	15.26
KBGN^［22］	50.05	40.40	60.11	66.82	17.54
MITVG^［31］	51.14	41.03	61.25	68.49	14.37
LTMI-GoG^［23］	51.32	41.25	61.83	69.44	15.32
LTMI-LG^［29］	51.30	41.34	61.61	69.06	15.26
UTC^［32］	52.22	42.56	62.40	69.51	15.67
MCMI	52.28	43.24	62.78	69.45	14.20

模型	M1↑/%	M2↑
DMRM	67.00	3.20
MCMI	76.00	3.90

基于多通道多步融合的生成式视觉对话模型

Multi-channel multi-step integration model for generative visual dialogue

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 32

相关文章 15

编辑推荐

Metrics

模型	MRR↑/%	R@1↑/%	R@5↑/%	R@10↑/%	Mean↓
MCMI-1	51.48	41.93	60.46	67.90	14.76
MCMI-2	52.10	42.35	61.43	68.21	14.47
MCMI-V	49.94	40.20	59.18	65.87	16.17
MCMI-H	51.63	41.44	60.72	66.71	15.18
MCMI-S	50.26	41.39	60.17	67.34	15.01
MCMI-FA	52.02	42.26	61.37	67.69	14.47
MCMI	52.28	43.24	62.78	69.45	14.20

[1]	李卓然李华王桐蒋朝哲. 基于融合特征状态空间模型的轻量化人体姿态估计[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	汪韩, 万源, 王东, 丁义明. 宽度学习系统中鲁棒性权值矩阵组合的筛选方法[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3032-3038.
[3]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[4]	杨帆, 邹窈, 朱明志, 马振伟, 程大伟, 蒋昌俊. 基于图注意力Transformer神经网络的信用卡欺诈检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2634-2642.
[5]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[6]	翟飞宇, 马汉达. 基于DenseNet的经典-量子混合分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1905-1910.
[7]	王开明梁国远罗哲皓孙文琪梁勇. 基于FIPA模型的自适应协同优化的车型识别算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[8]	陈俊颖郭士杰陈玲玲. 基于解耦注意力与幻影卷积的轻量级人体姿态估计[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[9]	王昊冉, 于丹, 杨玉丽, 马垚, 陈永乐. 面向工控系统未知攻击的域迁移入侵检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1158-1165.
[10]	潘家航, 王嘉航, 施展, 许营坤, 许永安, 黄晓霞. FPD-ViT：面部疼痛检测视觉转换器[J]. 《计算机应用》唯一官方网站, 2023, 43(S2): 77-82.
[11]	刘磊, 伍鹏, 谢凯, 程贝芝, 盛冠群. 自监督学习HOG预测辅助任务下的车位检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3933-3940.
[12]	魏楚元, 王梦珂, 户传豪, 张桄齐. 增强推荐系统可解释性的深度评论注意力神经网络模型[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3443-3448.
[13]	何国欢, 朱江平. WT-U-Net++：基于小波变换的表面缺陷检测网络[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3260-3266.
[14]	周晓敏, 滕飞, 张艺. 基于元网络的自动国际疾病分类编码模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2721-2726.
[15]	宋钰丹王晶王雪徽马朝阳林友芳. 基于自适应多任务学习的睡眠生理时序分类方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.