Journal of Computer Applications

    Next Articles

Integrating internal and external data for out-of-distribution detection training & testing

  

  • Received:2024-08-14 Revised:2024-10-21 Online:2024-11-07 Published:2024-11-07
  • Contact: Tao Tao PengPeng
  • Supported by:
    China University Industry Research and Innovation Fund

分布外检测中训练与测试的内外数据整合研究

王祉苑1,彭涛2,杨捷3   

  1. 1. 武汉纺织大学
    2. 武汉纺织大学 计算机与人工智能学院
    3. 伍伦贡大学
  • 通讯作者: 彭涛
  • 基金资助:
    中国高校产学研创新基金

Abstract: Out-of-Distribution (OOD) detection aims to identify foreign samples deviating from the training data distribution to prevent erroneous predictions by the model in anomalous scenarios. Due to the inherent unpredictability of genuine OOD data, current OOD detection methods based on pre-trained language models have not simultaneously evaluated the impact of OOD distribution on detection performance during both training and testing phases. To address this gap, we propose a framework for detecting out-of-distribution text by incorporating internal and external data for training and testing (IEDOD-TT). This framework employs different data integration strategies in different stages: during training, pseudo OOD datasets are generated on the original training set using Masked Language Model (MLM) and contrastive learning is introduced to enhance feature disparities between intra-and-inter-dataset data; during testing, a comprehensive OOD detection scoring metric is designed by combining density estimation of intra-and-inter-dataset distributions. Experimental results demonstrate that IEDOD-TT achieves an average AUROC and average FPR95 improvement of 1.61% and 2.71% respectively over its variant when faced with distribution drift, outperforming other baselines with a 1.56% and 2.83% increase, thus validating the additional performance gains by comprehensively considering intra-and-inter-data distribution characteristics.

Key words: out-of-distribution detection, pre-trained language model, internal and external data integration, contrastive learning, text classification

摘要: 分布外(OOD)检测旨在识别偏离训练数据分布的外来样本,以规避模型对异常情况的错误预测。由于真实OOD数据的不可知性,目前基于预训练语言模型的OOD检测方法尚未同时评估OOD分布在训练与测试阶段对检测性能的影响。针对这一问题,提出一种训练与测试阶段整合内外数据的分布外文本检测框架(IEDOD-TT)。该框架分阶段采用不同的数据整合策略:在训练阶段通过掩码语言模型(MLM)在原始训练集上生成伪OOD数据集,并引入对比学习增强内外数据之间的特征差异;在测试阶段结合内外数据分布的密度估计,设计了一个综合的OOD检测评分指标。实验结果表明,IEDOD-TT面向文本分类任务中不同分布偏移下的平均AUROC和平均FPR95分别优于其变体1.61%和2.71%,相比于其他基线分别提升了1.56%和2.83%,验证了综合考虑内外数据分布的额外性能提升。

关键词: 分布外检测, 预训练语言模型, 内外数据整合, 对比学习, 文本分类

CLC Number: