Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (1): 48-52.DOI: 10.11772/j.issn.1001-9081.2020060878

Special Issue: 第八届中国数据挖掘会议(CCDM 2020)

• China Conference on Data Mining 2020 (CCDM 2020) • Previous Articles     Next Articles

Classification algorithm based on undersampling and cost-sensitiveness for unbalanced data

WANG Junhong1,2, YAN Jiarong1,2   

  1. 1. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China;
    2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China
  • Received:2020-05-31 Revised:2020-07-22 Online:2021-01-10 Published:2020-09-02
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61772323), the Natural Science Foundation of Shanxi Province (201701D121051).

基于欠采样和代价敏感的不平衡数据分类算法

王俊红1,2, 闫家荣1,2   

  1. 1. 山西大学 计算机与信息技术学院, 太原 030006;
    2. 计算智能与中文信息处理教育部重点实验室(山西大学), 太原 030006
  • 通讯作者: 王俊红
  • 作者简介:王俊红(1979-),女,山西曲沃人,副教授,博士,CCF会员,主要研究方向:形式概念分析、数据挖掘、粗糙集、粒计算;闫家荣(1995-),男,山西吕梁人,硕士研究生,主要研究方向:数据挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61772323);山西省自然科学基金资助项目(201701D121051)。

Abstract: Focusing on the problem that the minority class in the unbalanced dataset has low prediction accuracy by traditional classifiers, an unbalanced data classification algorithm based on undersampling and cost-sensitiveness, called USCBoost (UnderSamples and Cost-sensitive Boosting), was proposed. Firstly, the majority class samples were sorted from large weight sample to small weight sample before base classifiers being trained by the AdaBoost (Adaptive Boosting) algorithm in each iteration, the majority class samples with the number equal to the number of minority class samples were selected according to sample weights, and the weights of majority class samples after sampling were normalized and a temporary training set was formed by these majority class samples and the minority class samples to train base classifiers. Secondly, in the weight update stage, higher misclassification cost was given to the minority class, which made the weights of minority class samples increase faster and the weights of majority class samples increase more slowly. On ten sets of UCI datasets, USCBoost was compared with AdaBoost, AdaCost (Cost-sensitive AdaBoosting), and RUSBoost (Random Under-Sampling Boosting). Experimental results show that USCBoost has the highest evaluation indexes on six sets and nine sets of datasets under the F1-measure and G-mean criteria respectively. The proposed algorithm has better classification performance on unbalanced data.

Key words: unbalanced data, classification, cost-sensitiveness, AdaBoost algorithm, undersampling

摘要: 针对不平衡数据集中的少数类在传统分类器上预测精度低的问题,提出了一种基于欠采样和代价敏感的不平衡数据分类算法——USCBoost。首先在AdaBoost算法每次迭代训练基分类器之前对多数类样本按权重由大到小进行排序,根据样本权重选取与少数类样本数量相当的多数类样本;之后将采样后的多数类样本权重归一化并与少数类样本组成临时训练集训练基分类器;其次在权重更新阶段,赋予少数类更高的误分代价,使得少数类样本权重增加更快,并且多数类样本权重增加更慢。在10组UCI数据集上,将USCBoost与AdaBoost、AdaCost、RUSBoost进行对比实验。实验结果表明USCBoost在F1-measure和G-mean准则下分别在6组和9组数据集获得了最高的评价指标。可见所提算法在不平衡数据上具有更好的分类性能。

关键词: 不平衡数据, 分类, 代价敏感, AdaBoost算法, 欠采样

CLC Number: