Label noise filtering method based on local probability sampling
ZHANG Zenghui1, JIANG Gaoxia1, WANG Wenjian1,2
1. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China; 2. Key Laboratory of Computation Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China
Abstract:In the classification learning tasks, it is inevitable to generate noise in the process of acquiring data. Especially, the existence of label noise not only makes the learning model more complex, but also leads to overfitting and the reduction of generalization ability of the classifier. Although some label noise filtering algorithms can solve the above problems to some extent, there are still some limitations such as poor noise recognition ability, unsatisfactory classification effect and low filtering efficiency. Focused on these issues, a local probability sampling method based on label confidence distribution was proposed for label noise filtering. Firstly, the random forest classifiers were used to perform the voting of the labels of samples, so as to obtain the label confidence of each sample. And then the samples were divided into easy and hard to recognize ones according to the values of label confidences. Finally, the samples were filtered by different filtering strategies respectively. Experimental results show that in the situation of existing label noise, the proposed method can maintain high noise recognition ability in most cases, and has obvious advantage on classification generalization performance.
张增辉, 姜高霞, 王文剑. 基于局部概率抽样的标签噪声过滤方法[J]. 计算机应用, 2021, 41(1): 67-73.
ZHANG Zenghui, JIANG Gaoxia, WANG Wenjian. Label noise filtering method based on local probability sampling. Journal of Computer Applications, 2021, 41(1): 67-73.
[1] FRÉNAY B,VERLEYSEN M. Classification in the presence of label noise:a survey[J]. IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869. [2] SEGATA N,BLANZIERI E,DELANY S J,et al. Noise reduction for instance-based learning with a local maximal margin approach[J]. Journal of Intelligent Information Systems,2010,35(2):301-331. [3] RAJASEKAR M,SANDHYA N. Mammogram images detection using support vector machines[J]. International Journal of Advanced Research in Computer Science,2017,8(7):329-334. [4] PAULHEIM H. Knowledge graph refinement:a survey of approaches and evaluation methods[J]. Semantic Web,2017,8(3):489-508. [5] SUBRAMANIYASWAMY V,LOGESH R. Adaptive KNN based recommender system through mining of user preferences[J]. Wireless Personal Communications,2017,97(2):2229-2247. [6] GARCÍA S,DERRAC J,CANO J R,et al. Prototype selection for nearest neighbor classification:taxonomy and empirical study[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012,34(3):417-435. [7] SUNDER H,KHURD P. Parallel algorithms for the computation of cycles in relative neighborhood graphs[C]//Proceedings of the 46th International Conference on Parallel Processing. Piscataway:IEEE,2017:191-200. [8] DU W,URAHAMA K. Error-correcting semi-supervised pattern recognition with mode filter on graphs[C]//Proceedings of the 2nd International Symposium on Aware Computing. Piscataway:IEEE, 2010:6-11. [9] GÓMEZ-RÍOS A,LUENGO J,HERRERA F. A study on the noise label influence in boosting algorithms:AdaBoost, GBM and XGBoost[C]//Proceedings of the 2017 International Conference on Hybrid Artificial Intelligence Systems, LNCS 10334. Cham:Springer,2017:268-280. [10] GAO Y,GAO F,GUAN X. Improved boosting algorithm with adaptive filtration[C]//Proceedings of the 8th World Congress on Intelligent Control and Automation. Piscataway:IEEE,2010:3173-3178. [11] CHEN T, GUESTRIN C. XGBoost:a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM,2016:785-794. [12] ZHANG J,WU X,SHENG V S. Learning from crowdsourced labeled data:a survey[J]. Artificial Intelligence Review,2016, 46(4):543-576. [13] CANO J R, LUENGO J, GARCÍA S. Label noise filtering techniques to improve monotonic classification[J]. Neurocomputing,2019,353:83-95. [14] SÁEZ J A,GALAR M,LUENGO J,et al. INFFC:an iterative class noise filter based on the fusion of classifiers with noise sensitivity control[J]. Information Fusion,2016,27:19-32. [15] CHEN H,SHEN C,HE G,et al. Critical noise of majority-vote model on complex networks[J]. Physical Review. E, Statistical, Nonlinear,and Soft Matter Physics,2015,91(2):No. 022816. [16] XIE S,GUO L. Analysis of normalized least mean squares-based consensus adaptive filters under a general information condition[J]. SIAM Journal on Control and Optimization,2018,56(5):3404-3431. [17] YUAN W,GUAN D,MA T,et al. Classification with class noises through probabilistic sampling[J]. Information Fusion,2018, 41:57-67. [18] 陈庆强, 王文剑, 姜高霞. 基于数据分布的标签噪声过滤[J]. 清华大学学报(自然科学版),2019,59(4):262-269.(CHEN Q Q,WANG W J,JIANG G X. Label noise filtering based on the data distribution[J]. Journal of Tsinghua University(Science and Technology),2019,59(4):262-269.)