计算机应用 ›› 2019, Vol. 39 ›› Issue (6): 1707-1712.DOI: 10.11772/j.issn.1001-9081.2018102180

• 数据科学与技术 • 上一篇    下一篇

基于带多数类权重的少数类过采样技术和随机森林的信用评估方法

田臣, 周丽娟   

  1. 首都师范大学 信息工程学院, 北京 100048
  • 收稿日期:2018-10-30 修回日期:2019-01-21 出版日期:2019-06-10 发布日期:2019-06-17
  • 通讯作者: 田臣
  • 作者简介:田臣(1994-),男,北京人,硕士研究生,主要研究方向:数据挖掘;周丽娟(1969-),女,辽宁辽阳人,教授,博士,主要研究方向:数据挖掘、机器学习、大数据处理、云计算、数据库系统。
  • 基金资助:
    国家重点研发计划项目(2017YFB1400803);国家自然科学基金资助项目(31571563,61601310)。

Credit assessment method based on majority weight minority oversampling technique and random forest

TIAN Chen, ZHOU Lijuan   

  1. Information Engineering College, Capital Normal University, Beijing 100048, China
  • Received:2018-10-30 Revised:2019-01-21 Online:2019-06-10 Published:2019-06-17
  • Supported by:
    This work is partially supported by the National Key R&D Program (YFB1400803), the National Natural Science Foundation of China (31571563, 61601310).

摘要: 针对信用评估中最为常见的不均衡数据集问题以及单个分类器在不平衡数据上分类效果有限的问题,提出了一种基于带多数类权重的少数类过采样技术和随机森林(MWMOTE-RF)结合的信用评估方法。首先,在数据预处理过程中利用MWMOTE技术增加少数类别样本的样本数;然后,在预处理后的较平衡的新数据集上利用监督式机器学习算法中的随机森林算法对数据进行分类预测。使用受测者工作特征曲线下面积(AUC)作为分类评价指标,在UCI机器学习数据库中的德国信用卡数据集和某公司的汽车违约贷款数据集上的仿真实验表明,在相同数据集上,MWMOTE-RF方法与随机森林方法和朴素贝叶斯方法相比,AUC值分别提高了18%和20%。与此同时,随机森林方法分别与合成少数类过采样技术(SMOTE)方法和自适应综合过采样(ADASYN)方法结合,MWMOTE-RF方法与它们相比,AUC值分别提高了1.47%和2.34%,从而验证了所提方法的有效性及其对分类器性能的优化。

关键词: 不平衡数据集, 机器学习, 带多数类权重的少数类过采样技术, 随机森林, 信用评估

Abstract: In order to solve the problem of unbalanced dataset in credit assessment and the limited classification effect of single classifier on unbalanced data, a Majority Weighted Minority Oversampling TEchnique-Random Forest (MWMOTE-RF) credit assessment method was proposed. Firstly, MWMOTE technology was applied to increase the samples of minority classes in the preprocessing stage. Then, on the preprocessed balanced dataset, random forest algorithm, one of supervised machine learning algorithms, was used to classify and predict the data. With Area Under the Carve (AUC) used to evaluate the performance of classifier, experiments were conducted on German credict card dataset from UCI database and a company's car default loan dataset. The results show that the AUC value of MWMOTE-RF method increases by 18% and 20% respectively compared with random forest method and Naive Bayes method on the same data set. At the same time, random forest method was combined with Synthetic Minority Over-sampling TEchnique (SMOTE) and ADAptive SYNthetic over-sampling (ADASYN), respectively, and the AUC value of MWMOTE-RF method increases by 1.47% and 2.34% respectively compared with them. The results prove the effectiveness and the optimization of classifier performance of the proposed method.

Key words: umbalanced dataset, machine learning, Majority Weight Minority Oversampling TEchnique (MWMOTE), random forest, credit assessment

中图分类号: