《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (11): 3335-3344.DOI: 10.11772/j.issn.1001-9081.2023111645

• 人工智能 • 上一篇    下一篇

基于图卷积网络的掩码数据增强

胡新荣1, 陈静雪1, 黄子键1, 王帮超1(), 姚迅1, 刘军平1, 朱强1, 杨捷2   

  1. 1.武汉纺织大学 计算机与人工智能学院,武汉 430200
    2.伍伦贡大学 计算机与信息技术学院,澳大利亚 新南威尔士州 伍伦贡市 2259
  • 收稿日期:2023-12-01 修回日期:2024-03-21 接受日期:2024-04-10 发布日期:2024-04-12 出版日期:2024-11-10
  • 通讯作者: 王帮超
  • 作者简介:胡新荣(1973—),女,湖北武汉人,教授,博士,CCF高级会员,主要研究方向:计算机视觉、机器学习、自然语言处理
    陈静雪(2000—),女,安徽合肥人,硕士研究生,主要研究方向:自然语言处理
    黄子键(1995—),男,广东佛山人,硕士研究生,主要研究方向:自然语言处理
    姚迅(1969—),男,湖北武汉人,讲师,博士,CCF会员,主要研究方向:计算机视觉、模式识别、自然语言处理
    刘军平(1979—),男,湖北武汉人,副教授,博士,CCF会员,主要研究方向:机器学习、自然语言处理
    朱强(1984—),男,湖北武汉人,讲师,博士,主要研究方向:自然语言处理、生物信息学
    杨捷(1984—),男,福建福州人,研究员,博士,主要研究方向:自然语言处理、人工智能。
  • 基金资助:
    CCF-智谱AI大模型创新基金项目(CCF?Zhipu202312)

Graph convolution network-based masked data augmentation

Xinrong HU1, Jingxue CHEN1, Zijian HUANG1, Bangchao WANG1(), Xun YAO1, Junping LIU1, Qiang ZHU1, Jie YANG2   

  1. 1.School of Computer Science and Artificial Intelligence,Wuhan Textile University,Wuhan Hubei 430200,China
    2.School of Computer and Information Technology,University of Wollongong Australia,Wollongong New South Wales 2259,Australia
  • Received:2023-12-01 Revised:2024-03-21 Accepted:2024-04-10 Online:2024-04-12 Published:2024-11-10
  • Contact: Bangchao WANG
  • About author:HU Xinrong, born in 1973, Ph. D., professor. Her research interests include computer vision, machine learning, natural language processing.
    CHEN Jingxue, born in 2000, M. S. candidate. Her research interests include natural language processing.
    HUANG Zijian, born in 1995, M. S. candidate. His research interests include natural language processing.
    YAO Xun, born in 1969, Ph. D., lecturer. His research interests include computer vision, pattern recognition, natural language processing.
    LIU Junping, born in 1979, Ph. D., associate professor. His research interests include machine learning, natural language processing.
    ZHU Qiang, born in 1984, Ph. D., lecturer. His research interests include natural language processing, bioinformatics.
    YANG Jie, born in 1984, Ph. D., research fellow. His research interests include natural language processing, artificial intelligence.
  • Supported by:
    CCF-Zhipu.AI Large Model Innovation Fund(CCF-Zhipu202312)

摘要:

针对多项选择问答(MCQA)领域中原始数据信息不准确、样本质量低以及模型泛化能力差等问题,提出一种基于图卷积网络(GCN)的掩码数据增强模型GMDA(Graph convolution network-based MASK Data Augmentation)。该模型以GCN作为基础框架,首先将文章中的单词抽象为图节点,并利用问题-候选答案(QA)对节点进行连接,建立与相关的文章节点之间的联系;其次,计算节点之间的相似性,并应用掩码技术对图中的节点进行掩盖,从而生成增强样本;再次,利用GCN对增强样本进行特征扩充,以提升模型的信息表达能力;最后,引入打分器对原始样本和增强样本进行评分,并结合课程学习策略提高答案预测的准确性。综合评估实验结果表明:与RACE-M、RACE-H数据集上的最优基线模型EAM相比,所提模型GMDA的准确率分别平均提高了0.8、0.4个百分点,而与DREAM数据集上的最优基线模型STM(SelfTraining Method)相比,GMDA模型的准确率平均提高了1.4个百分点。此外,对比实验的结果也验证了GMDA模型在MCQA任务中的有效性,可为数据增强技术在该领域的进一步研究和应用提供帮助。

关键词: 多项选择问答, 数据增强, 图卷积网络, 打分器, 课程学习

Abstract:

Concerning the problems of inaccurate information of raw data, low quality of samples and poor generalisation ability of models in the field of Multiple-Choice Question Answering (MCQA), a mask data augmentation model based on Graph Convolutional Network (GCN) was proposed, namely GMDA (Graph convolution network-based MASK Data Augmentation). Using GCN as the basic frame, the words in the articles were abstracted as graph nodes and connected by Question-candidate Answer (QA) pair nodes to establish connections with related article nodes. Secondly, the similarity between nodes was calculated and the masking technique was applied to mask the nodes in the graph to generate the augmented samples. Thirdly, the augmented samples were subjected to feature expansion by using GCN to enhance the model's information representation capability. Finally, a scorer was introduced to score the original and augmented samples, and the curriculum learning strategy was combined to improve the accuracy of answer prediction. The results of the comprehensive evaluation experiments show that compared with the best baseline model EAM on RACE-M and RACE-H datasets, the proposed GMDA model improves the accuracy by an average of 0.8 and 0.4 percentage points respectively, and compared with the best baseline model STM (SelfTraining Method) on DREAM dataset, the GMDA model has the average accuracy improved by 1.4 percentage points. Besides, comparative experiments also prove the effectiveness of the GMDA model in MCQA tasks, which can help further research and application of data augmentation techniques in this field.

Key words: Multiple-Choice Question Answering (MCQA), data augmentation, Graph Convolutional Network (GCN), scorer, curriculum learning

中图分类号: