• •    

CCML2021+224 基于动态概率抽样的标签噪声过滤方法

张增辉,姜高霞,王文剑   

  1. 山西大学
  • 收稿日期:2021-06-16 修回日期:2021-06-29 发布日期:2021-06-29
  • 通讯作者: 王文剑

Label Noise Filtering Method Based on Dynamic Probability Sampling

  • Received:2021-06-16 Revised:2021-06-29 Online:2021-06-29

摘要: 在机器学习问题中,数据质量会对系统预测的准确性产生深远的影响。由于信息获取的难度大,人类的认知主观且有限,导致了专家无法准确标记所有样本。近年来出现的一些概率抽样方法,无法避免样本人为划分不合理且主观性较强的问题。本文针对这一问题提出了一种动态概率抽样的标签噪声过滤方法,充分考虑各个数据集样本间的差异性,通过统计各个区间内置信度分布频率,分析区间内置信度分布信息熵的走势,确定合理阈值。实验结果表明,本文提出的方法能够在多数情况下获得较高的标签噪声识别能力和分类能力。

关键词: 标签噪声, 动态概率抽样, 噪声过滤, 标签置信度, 置信度

Abstract: In machine learning problems, data quality will have a profound impact on the accuracy of the prediction of system. Because of the difficulty in obtaining information and the subjective and limited cognition of human beings, experts cannot mark all samples accurately. In recent years, some probabilistic sampling methods can't avoid the unreasonable and subjective problem of artificial division of samples. To solve this problem, this paper puts forward a label noise filtering method based on dynamic probability sampling, which fully considers the differences among samples of each data set, analyzes the trend of information entropy of confidence distribution in each interval, and determines a reasonable threshold. Experimental results show that the method proposed in this paper can obtain higher label noise recognition ability and classification ability in most cases.

Key words: label noise, dynamic probability sampling, noise filtering, label confidence, confidence

中图分类号: