Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (9): 2591-2596.DOI: 10.11772/j.issn.1001-9081.2019030531

• Data science and technology • Previous Articles     Next Articles

Improved SMOTE unbalanced data integration classification algorithm

WANG Zhongzhen<sup>1</sup>, HUANG Bo<sup>1,2</sup>, FANG Zhijun<sup>1</sup>, GAO Yongbin<sup>1</sup>, ZHANG Juan<sup>1</sup>   

  1. 1. School of Electric and Electronic Engineering, Shanghai University of Engineering Science, Shanghai 201620, China;
    2. Jiangxi Province Economic Crime Investigation and Prevention and Control Technology Collaborative Innovation Center, Nanchang Jiangxi 330103, China
  • Received:2019-04-01 Revised:2019-05-14 Online:2019-05-28 Published:2019-09-10
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61603242), the Open Project of Jiangxi Collaborative Innovation Center of Economic Crime Investigation, Prevention and Control Technology (JXJZXTCX-030).

改进SMOTE的不平衡数据集成分类算法

王忠震1, 黄勃1,2, 方志军1, 高永彬1, 张娟1   

  1. 1. 上海工程技术大学 电子电气工程学院, 上海 201620;
    2. 江西省经济犯罪侦查与防控技术协同创新中心, 南昌 330103
  • 通讯作者: 黄勃
  • 作者简介:王忠震(1992-),男,江苏徐州人,硕士研究生,主要研究方向:机器学习、数据挖掘;黄勃(1985-),男,湖北武汉人,博士,CCF会员,主要研究方向:人工智能、软件工程、需求工程;方志军(1971-),男,江西鄱阳人,教授,博士,CCF会员,主要研究方向:模式识别、智能计算、视频分析;高永彬(1988-),男,江西南昌人,博士,主要研究方向:人工智能、机器学习、图像处理、模式识别;张娟(1975-),女,江西南昌人,副教授,博士,主要研究方向:计算机视觉、人工智能、软件测试。
  • 基金资助:

    国家自然科学基金资助项目(61603242);江西省经济犯罪侦查与防控技术协同创新中心开放基金资助课题(JXJZXTCX-030)。

Abstract:

Aiming at the low classification accuracy of unbalanced datasets, an unbalanced data classification algorithm based on improved SMOTE (Synthetic Minority Oversampling TEchnique) and AdaBoost algorithm (KSMOTE-AdaBoost) was proposed. Firstly, a noise sample identification algorithm was proposed according to the idea of K-Nearest Neighbors (KNN). The noise samples in the sample set were accurately identified and filtered out by the number of heterogeneous samples included in the K neighbors of the sample. Secondly, in the process of oversampling, the sample set was divided into different sub-clusters based on the idea of clustering. According to the cluster center of the sub-cluster and the number of samples the sub-cluster contains, the synthesis of new samples was performed between the samples in the cluster and the cluster center. In the process of sample synthesis, the data imbalance between classes as well as in the class was fully considered, and the samples were corrected in time to ensure the quality of the synthesized samples and balance the sample information. Finally, using the advantage of AdaBoost algorithm, the decision tree was used as the base classifier and the balanced sample set was trained and iterated several times until the termination condition was satisfied, and the final classification model was obtained. The comparative experiments were carried out on 6 KEEL datasets with G-mean and AUC selected as evaluation indicators. The experimental results show that compared with the classical oversampling algorithm SMOTE and ADASYN (ADAptive SYNthetic sampling approach), G-means and AUC have the highest of 3 groups in 4 groups. Compared with the existing unbalanced classification models SMOTE-Boost, CUS (Cluster-based Under-Sampling)-Boost and RUS (Random Under-Sampling)-Boost, among the 6 groups of data:the proposed classification model has higher G-means than CUS-Boost and RUS-Boost, and 3 groups are lower than SMOTE-Boost; AUC is higher than SMOTE-Boost and RUS-Boost, and one group is lower than CUS-Boost. It is verified that the proposed KSMOTE-AdaBoost has better classification effect and the model has higher generalization performance.

Key words: unbalanced data classification, Synthetic Minority Oversampling TEchnique (SMOTE), K-Nearest Neighbors (KNN), oversampling, clustering, AdaBoost algorithm

摘要:

针对不平衡数据集的低分类准确性,提出基于改进合成少数类过采样技术(SMOTE)和AdaBoost算法相结合的不平衡数据分类算法(KSMOTE-AdaBoost)。首先,根据K近邻(KNN)的思想,提出噪声样本识别算法,通过样本的K个近邻中所包含的异类样本数目,对样本集中的噪声样本进行精确识别并予以滤除;其次,在过采样过程中基于聚类的思想将样本集划分为不同的子簇,根据子簇的簇心及其所包含的样本数目,在簇内样本与簇心之间进行新样本的合成操作。在样本合成过程中充分考虑类间和类内数据不平衡性,对样本及时修正以保证合成样本质量,平衡样本信息;最后,利用AdaBoost算法的优势,采用决策树作为基分类器,对平衡后的样本集进行训练,迭代多次直到满足终止条件,得到最终分类模型。选择G-mean、AUC作为评价指标,通过在6组KEEL数据集进行对比实验。实验结果表明,所提的过采样算法与经典的过采样算法SMOTE、自适应综合过采样技术(ADASYN)相比,G-means和AUC在4组中有3组最高;所提分类模型与现有的不平衡分类模型SMOTE-Boost,CUS-Boost,RUS-Boost相比,6组数据中:G-means均高于CUS-Boost和RUS-Boost,有3组低于SMOTE-Boost;AUC均高于SMOTE-Boost和RUS-Boost,有1组低于CUS-Boost。验证了所提的KSMOTE-AdaBoost具有更好的分类效果,且模型泛化性能更高。

关键词: 不平衡数据分类, 合成少数类过采样技术, K近邻, 过采样, 聚类, AdaBoost算法

CLC Number: