Commonsense question answering model based on cross-modal contrastive learning

doi:10.11772/j.issn.1001-9081.2024081139

Abstract

Abstract:

Commonsense Question Answering （CQA） aims to use commonsense knowledge to answer questions described in natural language automatically to obtain accurate answer， and it belongs to intelligent question answering field. Typically， this task demands background commonsense knowledge to enhance the model in problem-solving capability. While most related methods rely on extracting and utilizing commonsense from textual data， however， commonsense is often implicit and not always represented in the text directly， which affects the application range and effectiveness of these methods. Therefore， a cross-modal contrastive learning-based CQA model was proposed to fully utilize cross-modal information for enriching the expression of commonsense knowledge. Firstly， a cross-modal commonsense representation module was designed to integrate the commonsense bases and a cross-modal large model， thereby obtaining a cross-modal commonsense representation. Secondly， in order to enhance the ability of the model to distinguish among different options， contrastive learning was carried out on the cross-modal representations of problems and options. Finally， the softmax layer was used to generate relevance scores for the problem option pairs， and the option with the highest score was taken as the final predicted answer. Experimental results on public datasets CommonSenseQA （CSQA） and OpenBookQA （OBQA） show that compared to DEKCOR （DEscriptive Knowledge for COmmonsense question answeRing）， the proposed model is improved by 1.46 and 0.71 percentage points respectively in accuracy.

Key words: intelligent question answering, Commonsense Question Answering (CQA), contrastive learning, cross-modal commonsense, Contrastive Language-Image Pre-training (CLIP)

摘要：

常识问答（CQA）是利用常识知识对自然语言问句进行自动求解以得到准确答案的任务，属于智能问答领域。该任务通常需要背景常识知识提升模型的求解能力，现有的大多数相关方法依赖于从文本数据中提取和利用常识。然而，常识通常具有隐含性，并不总是直接体现在文本内容中，影响了这些方法的应用范围和效果。因此，提出基于跨模态对比学习的CQA模型，以充分利用跨模态信息丰富常识的表达。首先，设计一个跨模态常识表示模块，以融合常识库和跨模态大模型，从而获取跨模态的常识表示；其次，对问题和选项的跨模态表示进行对比学习，从而增强模型对不同选项之间的区分能力；最后，利用softmax层为问题选项对生成相关性分数，并根据分数的高低确定最终的预测答案。在公开数据集CSQA（CommonSenseQA）和OBQA（OpenBookQA）上进行的实验结果表明，与DEKCOR（DEscriptive Knowledge for COmmonsense question answeRing）相比，所提模型的准确率分别提高了1.46和0.71个百分点。

关键词: 智能问答, 常识问答, 对比学习, 跨模态常识, CLIP

CLC Number:

TP391.7

Yuanlong WANG, Tinghua LIU, Hu ZHANG. Commonsense question answering model based on cross-modal contrastive learning[J]. Journal of Computer Applications, 2025, 45(3): 732-738.

王元龙, 刘亭华, 张虎. 基于跨模态对比学习的常识问答模型[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 732-738.

Figures/Tables 7

Fig. 1 Framework of proposed model

Tab. 1 Performance comparison of different methods on CSQA training and test sets

方法	划分验证集准确率	划分测试集准确率
RGCN^［28］	72.69（±0.19）	68.41（±0.66）
KagNet^［4］	73.47（±0.22）	69.01（±0.76）
QA-GNN^［20］	76.54（±0.21）	73.41（±0.92）
GREASELM^［21］	78.50（±0.50）	74.20（±0.40）
MHGRN^［5］	73.88（±0.25）	70.31（±0.83）
DEKCOR^［23］	83.31（±0.23）	81.78（±0.39）
ChatGPT^*［29］	—	74.00
DHLK^［22］	79.39 （±0.24）	74.68（±0.26）
CPACE^［25］	88.40	—
本文方法	86.24（±0.26）	83.24（±0.17）

Tab. 2 Performance comparison on OBQA test set

方法	测试集准确率
RGCN^［28］	62.45（±1.57）
QA-GNN^［20］	70.58（±1.42）
GREASELM^［21］	—
MHGRN^［5］	66.85（±1.19）
DEKCOR^［23］	82.40
ChatGPT^*［29］	73.00
DHLK^［22］	72.20（±0.40）
本文方法	83.11（±0.36）

Tab. 3 Ablation experiment results on CSQA test set

方法	划分测试集准确率
本文方法	83.24
本文方法-multi	81.39
本文方法-con	82.11
本文方法-K_text	78.08
本文方法-K_text-multi	78.16
本文方法-K_text-con	79.85
本文方法-multi-con	80.90

Tab. 4 Example studies on CSQA dataset

Question： What green area is a marmot likely to be found in？

Choices： A. countryside B. great plains C. encyclopedia D. jungle E. north America

D. jungle

A. countryside

Question： Where would you keep an ottoman near your front door？

Choices： A. living room B. parlor C. furniture store D. basement E. kitchen

B. parlor

A. living room

Question： Where is a monkey likely to enjoy being？

Choices： A. banana tree B. sailor suit C. theatre D. mulberry bush E. research laboratory

D. mulberry bush

A. banana tree

Tab. 5 Experimental analysis of cross-modal commonsense representation

CLIP的输入序列	准确率/%	准确率变化百分点
$e c$	80.90	-0.88
$e c$ ： $d c$	82.84	+1.06
$e q$	82.03	+0.25
$e q$ ： $d q$	83.16	+1.38
$e q S E P e c$	83.24	+1.46
$e q, r, e c$	80.18	-1.60
$q s t e m$	79.94	-1.84

Tab. 5 Experimental analysis of cross-modal commonsense representation

CLIP的输入序列	准确率/%	准确率变化百分点
$e c$	80.90	-0.88
$e c$ ： $d c$	82.84	+1.06
$e q$	82.03	+0.25
$e q$ ： $d q$	83.16	+1.38
$e q S E P e c$	83.24	+1.46
$e q, r, e c$	80.18	-1.60
$q s t e m$	79.94	-1.84

Tab. 6 Comparison results of multi-modal large models

多模态大模型	准确率
CLIP	83.24
BLIP	79.05
Gemini	81.73

References 29

1	范怡帆，邹博伟，徐庆婷，等. 常识问答研究综述［J］. 软件学报， 2024， 35（1）： 236-265.
	FAN Y F， ZOU B W， XU Q T， et al. Survey on commonsense question answering ［J］. Journal of Software， 2024， 35（1）： 236-265.
2	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach ［EB/OL］. ［2023-11-20］. .
3	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners ［EB/OL］. ［2023-11-22］. .
4	LIN B Y， CHEN X， CHEN J， et al. KagNet： knowledge-aware graph networks for commonsense reasoning ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 2829-2839.
5	FENG Y， CHEN X， LIN B Y， et al. Scalable multi-hop relational reasoning for knowledge-aware question answering ［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 1295-1309.
6	FELLBAUM C. WordNet： an electronic lexical database ［M］. Cambridge： MIT Press， 1998.
7	BAKER C F， FILLMORE C J， LOWE J B. The Berkeley FrameNet project ［C］// Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics， Volume 1. New Brunswick， NJ： ACL， 1998： 86-90.
8	VRANDEČIĆ D， KRÖTZSCH M. Wikidata： a free collaborative knowledgebase ［J］. Communications of the ACM， 2014， 57（10）： 78-85.
9	PELLISSIER TANON T， WEIKUM G， SUCHANEK F. YAGO 4： a reason-able knowledge base ［C］// Proceedings of the 2020 European Semantic Web Conference， LNCS 12123. Cham： Springer， 2020： 583-596.
10	SAP M， LE BRAS R， ALLAWAY E， et al. ATOMIC： an atlas of machine commonsense for if-then reasoning ［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 3027-3035.
11	KRISHNA R， ZHU Y， GROTH O， et al. Visual Genome： connecting language and vision using crowdsourced dense image annotations ［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73.
12	ILIEVSKI F， SZEKELY P， ZHANG B. CSKG： the commonsense knowledge graph ［C］// Proceedings of the 2021 European Semantic Web Conference， LNCS 12731. Cham： Springer， 2021： 680-696.
13	SPEER R， CHIN J， HAVASI C. ConceptNet 5.5： an open multilingual graph of general knowledge ［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2017： 4444-4451.
14	TAN H， BANSAL M. LXMERT： learning cross-modality encoder representations from Transformers ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 5100-5111.
15	LU J， BATRA D， PARIKH D， et al. ViLBERT： pretraining task-agnostic visiolinguistic representations for vision-and-language tasks ［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 13-23.
16	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision ［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
17	LI J， LI D， XIONG C， et al. BLIP： bootstrapping language-image pre-training for unified vision-language understanding and generation ［C］// Proceedings of the 39th International Conference on Machine Learning. New York： JMLR.org， 2022： 12888-12900.
18	YASUNAGA M， REN H， BOSSELUT A， et al. QA-GNN： reasoning with language models and knowledge graphs for question answering ［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 535-546.
19	ZHANG X， BOSSELUT A， YASUNAGA M， et al. GreaseLM： graph reasoning enhanced language models for question answering［EB/OL］. ［2023-10-08］. .
20	WANG Y， ZHANG H， LIANG J， et al. Dynamic heterogeneous-graph reasoning with language models and knowledge representation learning for commonsense question answering ［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 14048-14063.
21	XU Y， ZHU C， XU R， et al. Fusing context into knowledge graph for commonsense question answering ［C］// Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Stroudsburg： ACL， 2021： 1201-1207.
22	XU Y， ZHU C， WANG S， et al. Human parity on CommonSenseQA： augmenting self-attention with external attention ［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2022： 2762-2768.
23	CHEN Q， XU G， YAN M， et al. Distinguish before answer： generating contrastive explanation as knowledge for commonsense question answering ［C］// Findings of the Association for Computational Linguistics： ACL 2023. Stroudsburg： ACL， 2023： 13207-13224.
24	LIN J. ALBERT + KCR（knowledge chosen by relations）［EB/OL］. ［2023-07-20］. .
25	LI G， GAN Y， WU H， et al. Cross-modal attentional context learning for RGB-D object detection ［J］. IEEE Transactions on Image Processing， 2019， 28（4）： 1591-1601.
26	MIHAYLOV T， CLARK P， KHOT T， et al. Can a suit of armor conduct electricity？ a new dataset for open book question answering［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2018： 2381-2391.
27	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization ［EB/OL］. ［2023-11-14］. .
28	SCHLICHTKRULL M， KIPF T N， BLOEM P， et al. Modeling relational data with graph convolutional networks ［C］// Proceedings of the 2018 European Semantic Web Conference， LNCS 10843. Cham： Springer， 2018： 593-607.
29	BIAN N， HAN X， SUN L， et al. ChatGPT is a knowledgeable but inexperienced solver： an investigation of commonsense problem in large language models ［C］// Proceedings of the 2024 Joint International Conference on Computational Linguistics， Language Resources and Evaluation. ［S.l.］： ELRA Language Resources Association， 2024： 3098-3110.

[1]	Sheng YANG, Yan LI. Contrastive knowledge distillation method for object detection [J]. Journal of Computer Applications, 2025, 45(2): 354-361.
[2]	Xiaosheng YU, Zhixin WANG. Sequential recommendation model based on multi-level graph contrastive learning [J]. Journal of Computer Applications, 2025, 45(1): 106-114.
[3]	Xingyao YANG, Yu CHEN, Jiong YU, Zulian ZHANG, Jiaying CHEN, Dongxiao WANG. Recommendation model combining self-features and contrastive learning [J]. Journal of Computer Applications, 2024, 44(9): 2704-2710.
[4]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[5]	Xiaoxia JIANG, Ruizhang HUANG, Ruina BAI, Lina REN, Yanping CHEN. Deep event clustering method based on event representation and contrastive learning [J]. Journal of Computer Applications, 2024, 44(6): 1734-1742.
[6]	Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL： positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492.
[7]	Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127.
[8]	Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning [J]. Journal of Computer Applications, 2024, 44(2): 548-555.
[9]	Xingyao YANG, Hongtao SHEN, Zulian ZHANG, Jiong YU, Jiaying CHEN, Dongxiao WANG. Sequential recommendation based on hierarchical filter and temporal convolution enhanced self-attention network [J]. Journal of Computer Applications, 2024, 44(10): 3090-3096.
[10]	Yunhua ZHU, Bing KONG, Lihua ZHOU, Hongmei CHEN, Chongming BAO. Multi-view clustering network guided by graph contrastive learning [J]. Journal of Computer Applications, 2024, 44(10): 3267-3274.
[11]	Wei TONG, Liyang HE, Rui LI, Wei HUANG, Zhenya HUANG, Qi LIU. Efficient similar exercise retrieval model based on unsupervised semantic hashing [J]. Journal of Computer Applications, 2024, 44(1): 206-216.
[12]	Yirui HUANG, Junwei LUO, Jingqiang CHEN. Multi-modal dialog reply retrieval based on contrast learning and GIF tag [J]. Journal of Computer Applications, 2024, 44(1): 32-38.
[13]	Jingsheng LEI, Kaijun LA, Shengying YANG, Yi WU. Joint entity and relation extraction based on contextual semantic enhancement [J]. Journal of Computer Applications, 2023, 43(5): 1438-1444.
[14]	Rong GAO, Jiawei SHEN, Xiongkai SHAO, Xinyun WU. Instance segmentation algorithm based on Fastformer and self-supervised contrastive learning [J]. Journal of Computer Applications, 2023, 43(4): 1062-1070.
[15]	Weichao DANG, Bingyang CHENG, Gaimei GAO, Chunxia LIU. Contrastive hypergraph transformer for session-based recommendation [J]. Journal of Computer Applications, 2023, 43(12): 3683-3688.