Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 32-38.DOI: 10.11772/j.issn.1001-9081.2022081260

• Cross-media representation learning and cognitive reasoning • Previous Articles     Next Articles

Multi-modal dialog reply retrieval based on contrast learning and GIF tag

Yirui HUANG1, Junwei LUO2, Jingqiang CHEN1,3()   

  1. 1.School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing Jiangsu 210023,China
    2.China Mobile Communications Group Chongqing Company Limited,Chongqing 401120,China
    3.Jiangsu Key Laboratory for Big Data Security and Intelligent Processing (Nanjing University of Posts and Telecommunications),Nanjing Jiangsu 210023,China
  • Received:2022-08-25 Revised:2022-12-14 Accepted:2023-01-31 Online:2023-02-28 Published:2024-01-10
  • Contact: Jingqiang CHEN
  • About author:HUANG Yirui, born in 1997, M. S. candidate. Her research interests include cross-modal retrieval.
    LUO Junwei, born in 1982, M. S. His research interests include artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(61806101)

基于对比学习和GIF标记的多模态对话回复检索

黄懿蕊1, 罗俊玮2, 陈景强1,3()   

  1. 1.南京邮电大学 计算机学院、软件学院、网络空间安全学院, 南京 210023
    2.中国移动通信集团重庆有限公司, 重庆 401120
    3.江苏省大数据安全与智能处理重点实验室(南京邮电大学), 南京 210023
  • 通讯作者: 陈景强
  • 作者简介:黄懿蕊(1997—),女,江苏南通人,硕士研究生,主要研究方向:跨模态检索;
    罗俊玮(1982—),男,重庆人,硕士,主要研究方向:人工智能;
    第一联系人:陈景强(1983—),男,浙江温州人,副教授,博士,主要研究方向:自动文本摘要、自然语言处理、人工智能。
  • 基金资助:
    国家自然科学基金资助项目(61806101)

Abstract:

GIFs (Graphics Interchange Formats) are frequently used as responses to posts on social media platforms, but many approaches do not make good use of the GIF tag information on social media when dealing with the question “how to choose an appropriate GIF to reply to a post”. A Multi-Modal Dialog reply retrieval based on Contrast learning and GIF Tag (CoTa-MMD) approach was proposed, by which the tag information was integrated into the retrieval process. Specifically, the tags were used as intermediate variables, the retrieval of text to GIF was then converted to the retrieval of text to GIF tag to GIF. Then the modal representation was learned by a contrastive learning algorithm and the retrieval probability was calculated using a full probability formula. Compared to direct text image retrieval, the introduction of transition tags reduced retrieval difficulties caused by the heterogeneity of different modalities. Experimental results show that the CoTa-MMD model improved the recall sum of the text image retrieval task by 0.33 percentage points and 4.21 percentage points compared to the DSCMR (Deep Supervised Cross-Modal Retrieval) model on PEPE-56 multimodal dialogue dataset and Taiwan multimodal dialogue dataset, respectively.

Key words: cross-modal retrieval, multi-modal dialogue, Graphics Interchange Format (GIF), contrastive learning, representation learning

摘要:

社交媒体网站上使用GIF(Graphics Interchange Format)作为消息的回复相当普遍。但目前大多方法针对问题“如何选择一个合适的GIF回复消息”,没有很好地利用社交媒体上的GIF附属标记信息。为此,提出基于对比学习和GIF标记的多模态对话回复检索(CoTa-MMD)方法,将标记信息整合到检索过程中。具体来说就是使用标记作为中间变量,文本→GIF的检索就被转换为文本→GIF标记→GIF的检索,采用对比学习算法学习模态表示,并利用全概率公式计算检索概率。与直接的文本图像检索相比,引入的过渡标记降低了不同模态的异质性导致的检索难度。实验结果表明,CoTa-MMD模型相较于深度监督的跨模态检索(DSCMR)模型,在PEPE-56多模态对话数据集和Taiwan多模态对话数据集上文本图像检索任务的召回率之和分别提升了0.33个百分点和4.21个百分点。

关键词: 跨模态检索, 多模态对话, GIF, 对比学习, 表示学习

CLC Number: