Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (1): 67-73.DOI: 10.11772/j.issn.1001-9081.2020060970

Special Issue: 第八届中国数据挖掘会议(CCDM 2020)

• China Conference on Data Mining 2020 (CCDM 2020) • Previous Articles     Next Articles

Label noise filtering method based on local probability sampling

ZHANG Zenghui1, JIANG Gaoxia1, WANG Wenjian1,2   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China;
    2. Key Laboratory of Computation Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China
  • Received:2020-04-30 Revised:2020-06-22 Online:2021-01-10 Published:2020-08-21
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61673249, U1805263, 61906113), the Shanxi Key Research and Development Program (International Science and Technology Cooperation) (201903D421050), the Scientific and Technological Innovation Program of Higher Education Institutions in Shanxi (2020L0007).

基于局部概率抽样的标签噪声过滤方法

张增辉1, 姜高霞1, 王文剑1,2   

  1. 1. 山西大学 计算机与信息技术学院, 太原 030006;
    2. 计算智能与中文信息处理教育部重点实验室(山西大学), 太原 030006
  • 通讯作者: 王文剑
  • 作者简介:张增辉(1996-),女,山西太原人,硕士研究生,主要研究方向:数据质量分析;姜高霞(1987-),男,山西新绛人,讲师,博士,CCF会员,主要研究方向:数据质量分析、机器学习;王文剑(1968-),女,山西太原人,教授,博士,CCF会员,主要研究方向:机器学习、计算智能、图像处理。
  • 基金资助:
    国家自然科学基金资助项目(61673249,U1805263,61906113);山西省国际合作重点研发计划(国际科技合作)项目(201903D421050);山西省高等学校科技创新项目(2020L0007)。

Abstract: In the classification learning tasks, it is inevitable to generate noise in the process of acquiring data. Especially, the existence of label noise not only makes the learning model more complex, but also leads to overfitting and the reduction of generalization ability of the classifier. Although some label noise filtering algorithms can solve the above problems to some extent, there are still some limitations such as poor noise recognition ability, unsatisfactory classification effect and low filtering efficiency. Focused on these issues, a local probability sampling method based on label confidence distribution was proposed for label noise filtering. Firstly, the random forest classifiers were used to perform the voting of the labels of samples, so as to obtain the label confidence of each sample. And then the samples were divided into easy and hard to recognize ones according to the values of label confidences. Finally, the samples were filtered by different filtering strategies respectively. Experimental results show that in the situation of existing label noise, the proposed method can maintain high noise recognition ability in most cases, and has obvious advantage on classification generalization performance.

Key words: label noise, local probability sampling, noise filtering, Random Forest (RF), confidence estimation

摘要: 分类学习任务中,在获取数据的过程中会不可避免地产生噪声,特别是标签噪声的存在不仅使得学习模型更复杂,而且容易造成过拟合并导致分类器泛化能力的下降。标签噪声过滤算法虽然在一定程度上可以解决上述问题,但是仍然存在噪声识别能力较差、分类效果不够理想以及过滤效率低等问题。针对这些问题,提出一种基于标签置信度分布的局部概率抽样方法来进行标签噪声过滤。首先利用随机森林分类器对样本的标签进行投票,从而获取每个样本的标签置信度;然后根据标签置信度的大小,将样本划分为易识别样本和难识别样本;最后分别采用不同的过滤策略对样本进行过滤。实验结果表明,在标签噪声存在的情况下,所提方法在大多数案例上能够保持较高的噪声识别能力,并且在分类泛化性能上也具有明显优势。

关键词: 标签噪声, 局部概率抽样, 噪声过滤, 随机森林, 置信度估计

CLC Number: