Imbalanced data learning based on particle swarm optimization
CAO Peng1,2*, LI Bo1,2, LI Wei1,2, ZHAO Dazhe1,2
1.College of Information Science and Engineering, Northeastern University, Shenyang Liaoning 110004, China;
2.Key Laboratory of Medical Image Computing of Ministry of Education (Northeastern University), Shenyang Liaoning 110179, China
Abstract:In order to improve the classification performance on the imbalanced data, a new Particle Swarm Optimization (PSO) based method was introduced. It optimized the re-sampling rate and selected the feature set simultaneously, with the imbalanced data evaluation metric as objective function through particle swarm optimization, so as to achieve the best data distribution. The proposed method was tested on a large number of UCI datasets and compared with the state-of-the-art methods. The experimental results show that the proposed method has substantial advantages over other methods; moreover, it proves that it can effectively improve the performance on the imbalanced data by optimizing the re-sampling rate and feature set simultaneously.
曹鹏 李博 栗伟 赵大哲. 基于粒子群优化的不均衡数据学习[J]. 计算机应用, 2013, 33(03): 789-792.
CAO Peng LI Bo LI Wei ZHAO Dazhe. Imbalanced data learning based on particle swarm optimization. Journal of Computer Applications, 2013, 33(03): 789-792.
YANG Q, WU X. 10 challenging problems in data mining research [J]. International Journal of Information Technology & Decision Making, 2006, 5(4):597-604.
[3]
HE H B, GARCIA E A. Learning from imbalanced data [J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[4]
WEISS G M, PROVOST F. Learning when training data are costly: the effect of class distribution on tree induction [J]. Journal of Artificial Intelligence Research, 2003, 19(1): 315-354.
[5]
CHEN S, HE H B, GARCIA E A. RAMOboost: ranked minority oversampling in boosting [J]. IEEE Transactions on Neural Networks, 2010, 21(10): 1624-1642.
[6]
RAMENTOL E, CABALLERO Y, BELLO R, et al. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory [J]. Knowledge and Information Systems,2012,33(2): 245-265.
WASIKOWSKI M, CHEN X W. Combating the small sample class imbalance problem using feature selection [J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1388-1400.
[9]
ZHENG Z H, WU X Y, SRIHARI R. Feature selection for text categorization on imbalanced data [J]. ACM SIGKDD Explorations Newsletter — Special Issue on Learning from Imbalanced Datasets, 2004,6(1):80-89.
[10]
CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002,16: 321-357.
[11]
KENNEDY J, EBERHART R C. Particle swarm optimization [C]// Proceedings of IEEE International Conference on Neural Networks. Piscataway, NJ: IEEE Press, 1995, 4: 1942-1948.
[12]
HASSAN R, COHANIM R, de WECK O. A comparison of particle swarm optimization and the genetic algorithm [C]// Proceedings of the 46th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference. [S.l.]: AIAA, 2005:1-13.
[13]
FAWCETT T. An introduction to ROC analysis [J]. Pattern Recognition Letters, 2006, 27(8): 861-874.
[14]
THAI-NGHE N, GANTNER Z, SCHMIDT-THIEME L. Cost-sensitive learning methods for imbalanced data [C]// Proceedings of 2010 International Joint Conference on Neural Networks. Piscataway, NJ: IEEE Press, 2010: 1-8.
[15]
CARLISLE A, DOZIER G. An off-the-shelf PSO [C]// Proceedings of the Particle Swarm Optimization Workshop. Indianapolis: [s.n.], 2001:1-6.
[16]
CHAWLA N V, LAZAREVIC A, HALL L O, et al. SMOTEBoost: improving prediction of the minority class in boosting [C]// PKDD 2003: Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, LNCS 2838. Berlin: Springer-Verlag, 2003: 107-119.
[17]
DOMINGOS P. MetaCost: a general method for making classifiers cost-sensitive [C]// KDD '99: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 1999:155-164.