Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 732-738.DOI: 10.11772/j.issn.1001-9081.2024081139

• Frontier research and typical applications of large models • Previous Articles     Next Articles

Commonsense question answering model based on cross-modal contrastive learning

Yuanlong WANG(), Tinghua LIU, Hu ZHANG   

  1. School of Computer and Information Technology,Shanxi University,Taiyuan Shanxi 030006,China
  • Received:2024-08-12 Revised:2024-09-09 Accepted:2024-09-13 Online:2024-09-25 Published:2025-03-10
  • Contact: Yuanlong WANG
  • About author:LIU Tinghua, born in 1998, M. S. candidate. Her research interests include natural language processing.
    ZHANG Hu, born in 1979, Ph. D., professor. His research interests include natural language processing, big data mining and analysis.
  • Supported by:
    National Natural Science Foundation of China(62176145)

基于跨模态对比学习的常识问答模型

王元龙(), 刘亭华, 张虎   

  1. 山西大学 计算机与信息技术学院,太原 030006
  • 通讯作者: 王元龙
  • 作者简介:刘亭华(1998—),女,山西临汾人,硕士研究生,主要研究方向:自然语言处理
    张虎(1979—),男,山西大同人,教授,博士,CCF会员,主要研究方向:自然语言处理、大数据挖掘与分析。
  • 基金资助:
    国家自然科学基金资助项目(62176145)

Abstract:

Commonsense Question Answering (CQA) aims to use commonsense knowledge to answer questions described in natural language automatically to obtain accurate answer, and it belongs to intelligent question answering field. Typically, this task demands background commonsense knowledge to enhance the model in problem-solving capability. While most related methods rely on extracting and utilizing commonsense from textual data, however, commonsense is often implicit and not always represented in the text directly, which affects the application range and effectiveness of these methods. Therefore, a cross-modal contrastive learning-based CQA model was proposed to fully utilize cross-modal information for enriching the expression of commonsense knowledge. Firstly, a cross-modal commonsense representation module was designed to integrate the commonsense bases and a cross-modal large model, thereby obtaining a cross-modal commonsense representation. Secondly, in order to enhance the ability of the model to distinguish among different options, contrastive learning was carried out on the cross-modal representations of problems and options. Finally, the softmax layer was used to generate relevance scores for the problem option pairs, and the option with the highest score was taken as the final predicted answer. Experimental results on public datasets CommonSenseQA (CSQA) and OpenBookQA (OBQA) show that compared to DEKCOR (DEscriptive Knowledge for COmmonsense question answeRing), the proposed model is improved by 1.46 and 0.71 percentage points respectively in accuracy.

Key words: intelligent question answering, Commonsense Question Answering (CQA), contrastive learning, cross-modal commonsense, Contrastive Language-Image Pre-training (CLIP)

摘要:

常识问答(CQA)是利用常识知识对自然语言问句进行自动求解以得到准确答案的任务,属于智能问答领域。该任务通常需要背景常识知识提升模型的求解能力,现有的大多数相关方法依赖于从文本数据中提取和利用常识。然而,常识通常具有隐含性,并不总是直接体现在文本内容中,影响了这些方法的应用范围和效果。因此,提出基于跨模态对比学习的CQA模型,以充分利用跨模态信息丰富常识的表达。首先,设计一个跨模态常识表示模块,以融合常识库和跨模态大模型,从而获取跨模态的常识表示;其次,对问题和选项的跨模态表示进行对比学习,从而增强模型对不同选项之间的区分能力;最后,利用softmax层为问题选项对生成相关性分数,并根据分数的高低确定最终的预测答案。在公开数据集CSQA(CommonSenseQA)和OBQA(OpenBookQA)上进行的实验结果表明,与DEKCOR(DEscriptive Knowledge for COmmonsense question answeRing)相比,所提模型的准确率分别提高了1.46和0.71个百分点。

关键词: 智能问答, 常识问答, 对比学习, 跨模态常识, CLIP

CLC Number: