计算机应用 ›› 2015, Vol. 35 ›› Issue (1): 121-124.DOI: 10.11772/j.issn.1001-9081.2015.01.0121

• 数据技术 • 上一篇    下一篇

基于遗传算法改进的少数类样本合成过采样技术的非平衡数据集分类算法

霍玉丹1,2, 谷琼1,3, 蔡之华2, 袁磊1   

  1. 1. 湖北文理学院 数学与计算机科学学院, 湖北 襄阳441053;
    2. 中国地质大学 计算机学院, 武汉430074;
    3. 西南大学 逻辑与智能研究中心, 重庆400715
  • 收稿日期:2014-07-18 修回日期:2014-09-08 出版日期:2015-01-01 发布日期:2015-01-26
  • 通讯作者: 谷琼
  • 作者简介:霍玉丹(1989-),男,河北衡水人,硕士研究生,主要研究方向:演化计算、数据挖掘、卡尔曼滤波;谷琼(1973-),女,湖北荆门人,副教授,博士,CCF会员,主要研究方向:数据挖掘、网络舆情;蔡之华(1964-),男,湖北黄冈人,教授,博士生导师,博士,CCF高级会员,主要研究方向:数据挖掘、演化计算;袁磊(1959-),男,江苏丹阳人,教授,CCF会员,主要研究方向:数据库、信息系统.
  • 基金资助:

    国家自然科学基金资助项目(61075063);湖北省自然科学基金资助项目(2013CFA004);中国博士后科学基金面上资助项目(2014M560700);重庆博士后特别资助项目(XM2014057).

Classification method for imbalance dataset based on genetic algorithm improved synthetic minority over-sampling technique

HUO Yudan1,2, GU Qiong1,3, CAI Zhihua2, YUAN Lei1   

  1. 1. School of Mathematics and Computer Science, Hubei University of Arts and Science, Xiangyang Hubei 441053, China;
    2. School of Computer Science, China University of Geosciences, Wuhan Hubei 430074, China;
    3. Center for the Study of Logic and Intelligence, Southwest University, Chongqing 400715, China
  • Received:2014-07-18 Revised:2014-09-08 Online:2015-01-01 Published:2015-01-26

摘要:

针对少数类样本合成过采样技术(SMOTE)在处理非平衡数据集分类问题时,为少数类的不同样本设置相同的采样倍率,存在一定的盲目性的问题,提出了一种基于遗传算法(GA)改进的SMOTE方法——GASMOTE.首先,为少数类的不同样本设置不同的采样倍率,并将这些采样倍率取值的组合编码为种群中的个体;然后,循环使用GA的选择、交叉、变异等算子对种群进行优化,在达到停机条件时获得采样倍率取值的最优组合;最后,根据找到的最优组合对非平衡数据集进行SMOTE采样.在10个典型的非平衡数据集上进行的实验结果表明:与SMOTE算法相比,GASMOTE在F-measure值上提高了5.9个百分点,在G-mean值上提高了1.6个百分点;与Borderline-SMOTE算法相比,GASMOTE在F-measure值上提高了3.7个百分点,在G-mean值上提高了2.3个百分点.该方法可作为一种新的解决非平衡数据集分类问题的过采样技术.

关键词: 非平衡数据集, 分类, 少数类样本合成过采样技术, 采样倍率, 遗传算法

Abstract:

When the Synthetic Minority Over-sampling Technique (SMOTE) is used in imbalance dataset classification, it sets the same sampling rate for all the samples of minority class in the process of synthetising new samples, which has blindness. To overcome this problem, a Genetic Algorithm (GA) improved SMOTE algorithm, namely GASMOTE (Genetic Algorithm Improved Synthetic Minority Over-sampling Technique) was proposed. At the beginning, GASMOTE set different sampling rates for different minority class samples. One combination of the sampling rates corresponded to one individual in the population. And then, the selection, crossover and mutation operators of GA were iteratively applied on the population to get the best combination of sampling rates when the stopping criteria were met. At last, the best combination of sampling rates was used in SMOTE to synthetise new samples. The experimental results on ten typical imbalance datasets show that, compared with SMOTE algorithm, GASMOTE can increase 5.9 percentage on F-measure value and 1.6 percentage on G-mean value, and compared with Borderline-SMOTE algorithm, GASMOTE can increase 3.7 percentage on F-measure value and 2.3 percentage on G-mean value. GASMOTE can be used as a new over-sampling technique to deal with imbalance dataset classification problem.

Key words: imbalance dataset, classification, Synthetic Minority Over-sampling Technique (SMOTE), sampling rate, Genetic Algorithm (GA)

中图分类号: