《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2497-2506.DOI: 10.11772/j.issn.1001-9081.2024081141

• 人工智能 • 上一篇    

分布外检测中训练与测试的内外数据整合

王祉苑1, 彭涛1,2(), 杨捷3   

  1. 1.武汉纺织大学 计算机与人工智能学院,武汉 430200
    2.武汉纺织大学 纺织行业智慧感知与计算重点实验室,武汉 430200
    3.伍伦贡大学 计算机与信息技术学院,澳大利亚 新南威尔士 伍伦贡 2522
  • 收稿日期:2024-08-14 修回日期:2024-10-18 接受日期:2024-10-21 发布日期:2024-11-07 出版日期:2025-08-10
  • 通讯作者: 彭涛
  • 作者简介:王祉苑(2000—),女,湖北十堰人,硕士研究生,主要研究方向:自然语言处理
    杨捷(1984—),男,福建福州人,教授,博士,主要研究方向:自然语言处理、机器视觉、人工智能。
  • 基金资助:
    中国高校产学研创新基金资助项目(2021ITA05012)

Integrating internal and external data for out-of-distribution detection training and testing

Zhiyuan WANG1, Tao PENG1,2(), Jie YANG3   

  1. 1.School of Computer Science and Artificial Intelligence,Wuhan Textile University,Wuhan Hubei 430200,China
    2.Key Laboratory of Intelligent Perception and Computing for Textile Industry,Wuhan Textile University,Wuhan Hubei 430200,China
    3.School of Computer Science and Information Technology,University of Wollongong,Wollongong New South Wales 2522,Australia
  • Received:2024-08-14 Revised:2024-10-18 Accepted:2024-10-21 Online:2024-11-07 Published:2025-08-10
  • Contact: Tao PENG
  • About author:WANG Zhiyuan, born in 2000, M. S. candidate. Her research interests include natural language processing.
    YANG Jie, born in 1984, Ph. D., professor. His research interests include natural language processing, machine vision, artificial intelligence.
  • Supported by:
    China Higher Education Institute Industry-Research-Innovation Fund(2021ITA05012)

摘要:

分布外(OOD)检测旨在识别偏离训练数据分布的外来样本,以规避模型对异常情况的错误预测。由于真实OOD数据的不可知性,目前基于预训练语言模型(PLM)的OOD检测方法尚未同时评估OOD分布在训练与测试阶段对检测性能的影响。针对这一问题,提出一种训练与测试阶段整合内外数据的OOD文本检测框架(IEDOD-TT)。该框架分阶段采用不同的数据整合策略:在训练阶段通过掩码语言模型(MLM)在原始训练集上生成伪OOD数据集,并引入对比学习增强内外数据之间的特征差异;在测试阶段通过结合内外数据分布的密度估计设计一个综合的OOD检测评分指标。实验结果表明,所提方法在CLINC150、NEWS-TOP5、SST2和YELP这4个数据集上的综合表现与最优基线方法doSCL-cMaha相比,平均接受者操作特征曲线下面积(AUROC)提升了1.56个百分点,平均95%真阳性率下的假阳性率(FPR95)降低了2.83个百分点;与所提方法的最佳变体IS/IEDOD-TT (ID Single/IEDOD-TT)相比,所提方法在这4个数据集上的平均AUROC提升了1.61个百分点,平均FPR95降低了2.71个百分点。实验结果证明了IEDOD-TT在处理文本分类任务时针对不同数据分布偏移的有效性,并验证了综合考虑内外数据分布的额外性能提升。

关键词: 分布外检测, 预训练语言模型, 内外数据整合, 对比学习, 文本分类

Abstract:

Out-Of-Distribution (OOD) detection aims to identify foreign samples deviating from the training data distribution to prevent error predictions by the model in anomalous scenarios. Due to the uncertainty of real OOD data, current OOD detection methods based on Pre-trained Language Model (PLM) do not evaluate the impact of OOD on detection performance during both training and testing stages simultaneously. In response to this issue, integration of Internal and External Data for OOD Detection Training and Testing (IEDOD-TT) was proposed. In this framework, different data integration strategies were utilized in different stages: during training, a Masked Language Model (MLM) was employed to generate pseudo OOD datasets on the original training set, and contrastive learning was introduced to enhance feature disparities between internal and external data; during testing, a comprehensive OOD detection scoring metric was designed by combining density estimation of internal and external data. Experimental results show that on CLINC150, NEWS-TOP5, SST2, and YELP datasets, compared to the optimal baseline method doSCL-cMaha, the proposed method has the average Area Under the Receiver Operating Characteristic curve (AUROC) increased by 1.56 percentage points, the average False Positive Rate at 95% true positive rate (FPR95) decreased by 2.83 percentage points, respectively; compared to the best variant of the proposed method — IS/IEDOD-TT (ID Single/IEDOD-TT), the proposed method improved the average AUROC by 1.61 percentage points and reduced the average FPR95 by 2.71 percentage points, respectively. The above validates the effectiveness of IEDOD-TT in handling text classification tasks with different data distribution shifts, and confirms the additional performance gains by considering both internal and external data distributions comprehensively.

Key words: Out-Of-Distribution (OOD) detection, Pre-trained Language Model (PLM), internal and external data integration, contrastive learning, text classification

中图分类号: