Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (3): 629-633.DOI: 10.11772/j.issn.1001-9081.2018071598

Previous Articles     Next Articles

NIBoost: new imbalanced dataset classification method based on cost sensitive ensemble learning

WANG Li, CHEN Hongmei, WANG Shengwu   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China
  • Received:2018-07-31 Revised:2018-09-13 Online:2019-03-10 Published:2019-03-11
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61572406).

新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost

王莉, 陈红梅, 王生武   

  1. 西南交通大学 信息科学与技术学院, 成都 611756
  • 通讯作者: 陈红梅
  • 作者简介:王莉(1992-),女,山东菏泽人,硕士研究生,CCF会员,主要研究方向:数据挖掘;陈红梅(1971-),女,四川成都人,教授,博士,CCF会员,主要研究方向:智能信息处理、数据挖掘;王生武(1995-),男,安徽芜湖人,硕士研究生,CCF会员,主要研究方向:数据挖掘。
  • 基金资助:

    国家自然科学基金资助项目(61572406)。

Abstract:

The problem of misclassification of minority class samples appears frequently when classifying massive amount of imbalanced data in real life with traditional classification algorithms, because most of these algorithms only suit balanced class distribution or samples with same misclassification cost. To overcome this problem, a classification algorithm for imbalanced dataset based on cost sensitive ensemble learning and oversampling-New Imbalanced Boost (NIBoost) was proposed. Firstly, the oversampling algorithm was used to add a certain number of minority samples to balance the dataset in each iteration, and the classifier was trained on the new dataset. Secondly, the classifier was used to classify the dataset to obtain the predicted class label of each sample and the classification error rate of the classifier. Finally, the weight coefficient of the classifier and new weight of each sample were calculated according to the classification error rate and the predicted class labeles. Experimental results on UCI datasets with decision tree and Naive Bayesian used as weak classifier algorithm show that when decision tree was used as the base classifier of NIBoost, compared with RareBoost algorithm, the F-value is increased up to 5.91 percentage points, the G-mean is increased up to 7.44 percentage points, and the AUC is increased up to 4.38 percentage points. The experimental results show that the proposed algorithm has advantages on imbalanced data classification problem.

Key words: imbalanced dataset, classification, cost sensitive, over-sampling, Adaboost algorithm

摘要:

现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法——NIBoost(New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。

关键词: 非平衡数据集, 分类, 代价敏感, 过采样, Adaboost算法

CLC Number: