基于带多数类权重的少数类过采样技术和随机森林的信用评估方法

doi:10.11772/j.issn.1001-9081.2018102180

计算机应用 ›› 2019, Vol. 39 ›› Issue (6): 1707-1712.DOI: 10.11772/j.issn.1001-9081.2018102180

基于带多数类权重的少数类过采样技术和随机森林的信用评估方法

田臣, 周丽娟

首都师范大学信息工程学院, 北京 100048

收稿日期:2018-10-30 修回日期:2019-01-21 发布日期:2019-06-17 出版日期:2019-06-10
通讯作者: 田臣
作者简介:田臣(1994-),男,北京人,硕士研究生,主要研究方向:数据挖掘;周丽娟(1969-),女,辽宁辽阳人,教授,博士,主要研究方向:数据挖掘、机器学习、大数据处理、云计算、数据库系统。
基金资助:
国家重点研发计划项目（2017YFB1400803）；国家自然科学基金资助项目（31571563，61601310）。

Credit assessment method based on majority weight minority oversampling technique and random forest

TIAN Chen, ZHOU Lijuan

Information Engineering College, Capital Normal University, Beijing 100048, China

Received:2018-10-30 Revised:2019-01-21 Online:2019-06-17 Published:2019-06-10
Supported by:
This work is partially supported by the National Key R&D Program (YFB1400803), the National Natural Science Foundation of China (31571563, 61601310).

摘要/Abstract

摘要： 针对信用评估中最为常见的不均衡数据集问题以及单个分类器在不平衡数据上分类效果有限的问题，提出了一种基于带多数类权重的少数类过采样技术和随机森林（MWMOTE-RF）结合的信用评估方法。首先，在数据预处理过程中利用MWMOTE技术增加少数类别样本的样本数；然后，在预处理后的较平衡的新数据集上利用监督式机器学习算法中的随机森林算法对数据进行分类预测。使用受测者工作特征曲线下面积（AUC）作为分类评价指标，在UCI机器学习数据库中的德国信用卡数据集和某公司的汽车违约贷款数据集上的仿真实验表明，在相同数据集上，MWMOTE-RF方法与随机森林方法和朴素贝叶斯方法相比，AUC值分别提高了18%和20%。与此同时，随机森林方法分别与合成少数类过采样技术（SMOTE）方法和自适应综合过采样（ADASYN）方法结合，MWMOTE-RF方法与它们相比，AUC值分别提高了1.47%和2.34%，从而验证了所提方法的有效性及其对分类器性能的优化。

关键词: 不平衡数据集, 机器学习, 带多数类权重的少数类过采样技术, 随机森林, 信用评估

Abstract: In order to solve the problem of unbalanced dataset in credit assessment and the limited classification effect of single classifier on unbalanced data, a Majority Weighted Minority Oversampling TEchnique-Random Forest (MWMOTE-RF) credit assessment method was proposed. Firstly, MWMOTE technology was applied to increase the samples of minority classes in the preprocessing stage. Then, on the preprocessed balanced dataset, random forest algorithm, one of supervised machine learning algorithms, was used to classify and predict the data. With Area Under the Carve (AUC) used to evaluate the performance of classifier, experiments were conducted on German credict card dataset from UCI database and a company's car default loan dataset. The results show that the AUC value of MWMOTE-RF method increases by 18% and 20% respectively compared with random forest method and Naive Bayes method on the same data set. At the same time, random forest method was combined with Synthetic Minority Over-sampling TEchnique (SMOTE) and ADAptive SYNthetic over-sampling (ADASYN), respectively, and the AUC value of MWMOTE-RF method increases by 1.47% and 2.34% respectively compared with them. The results prove the effectiveness and the optimization of classifier performance of the proposed method.

Key words: umbalanced dataset, machine learning, Majority Weight Minority Oversampling TEchnique (MWMOTE), random forest, credit assessment

中图分类号:

TP18
TP399

田臣, 周丽娟. 基于带多数类权重的少数类过采样技术和随机森林的信用评估方法[J]. 计算机应用, 2019, 39(6): 1707-1712.

TIAN Chen, ZHOU Lijuan. Credit assessment method based on majority weight minority oversampling technique and random forest[J]. Journal of Computer Applications, 2019, 39(6): 1707-1712.

参考文献

[1] WIN S. What are the possible future research directions for bank's credit risk assessment research? A systematic review of literature[J]. International Economics and Economic Policy, 2018, 15(4):743-759.
[2] WIGINTON J C. A note on the comparison of logit and discriminant models of consumer credit behavior[J]. Journal of Financial and Quantitative Analysis, 1980, 15(3):757-771.
[3] DESAI V S, CROOK J N, JR OVERSTREET G A. A comparison of neural networks and linear scoring models in the credit union environment[J]. European Journal of Operational Research, 1996, 95(1):24-37.
[4] BAESENS B, van GESTEL T, VIAENE S, et al. Benchmarking state-of-the-art classification algorithms for credit scoring[J]. Journal of the Operational Research Society, 2003, 54(6):627-635.
[5] DAVIS S, ALBRIGHT T. An investigation of the effect of Balanced Scorecard implementation on financial performance[J]. Management Accounting Research, 2004, 15(2):135-153.
[6] 李志辉,李萌.我国商业银行信用风险识别模型及其实证研究[J].经济科学,2005(5):61-71.(LI Z H, LI M. Credit risk identification model of Chinese commercial banks and its empirical study[J]. Economic Science, 2005(5):61-71.)
[7] 王春峰,赵欣,韩冬.基于改进蚁群算法的商业银行信用风险评估方法[J].天津大学学报(社会科学版),2005,7(2):81-85.(WANG C F, ZHAO X, HAN D. A model on modified ants algorithm for credit risk assessment in commercial banks[J].Journal of Tianjin University (Social Sciences), 2005, 7(2):81-85.)
[8] 方匡南,吴见彬,朱建平,等.随机森林方法研究综述[J].统计与信息论坛,2011,26(3):32-38.(FANG K N, WU J B, ZHU J P, et al. A review of technologies on random forests[J]. Statistic & Information Forum, 2011, 26(3):32-38.)
[9] 萧超武,蔡文学,黄晓宇,等.基于随机森林的个人信用评估模型研究及实证分析[J].管理现代化,2014,34(6):111-113.(XIAO C W, CAI W X, HUANG X Y, et al. Research and empirical analysis of personal credit evaluation model based on random forest[J]. Modernization of Management, 2014, 34(6):111-113.)
[10] 李进.基于随机森林算法的绿色信贷信用风险评估研究[J].金融理论与实践,2015(11):14-18.(LI J. Study on green-credit risk assessment based on random forest algorithm[J]. Financial Theory & Practice, 2015(11):14-18.)
[11] 杨爱香.浅析我国商业银行信贷风险管理的现状及对策[J].时代金融,2015(30):37,39.(YANG A X. A brief analysis of China's commercial banks credit risk management status and countermeasures[J]. Times Finance, 2015(30):37,39.)
[12] 封化民,李明伟,侯晓莲,等.基于SMOTE和GBDT的网络入侵检测方法研究[J].计算机应用研究,2017,34(12):3745-3748.(FENG H M, LI M W, HOU X L, et al. Study of network intrusion detection method based on SMOTE and GBDT[J]. Application Research of Computers, 2017, 34(12):3745-3748.)
[13] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[14] HE H B, BAI Y, GARCIA E A, et al. ADASYN:adaptive synthetic sampling approach for imbalanced learning[C]//Proceeding of the 2008 IEEE International Joint Conference on Neural Networks. Piscataway, NJ:IEEE, 2008:1322-1328.
[15] HAN H, WANG W Y, MAO B H. Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//ICIC 2005:Proceedings of the 2005 International Conference on Advances in Intelligent Computing. Berlin:Springer, 2005:878-887.
[16] 赵楠,张小芳,张利军.不平衡数据分类研究综述[J].计算机科学,2018,45(6A):22-27,57.(ZHAO N, ZHANG X F, ZHANG L J. Overview of imbalanced data classification[J].Computer Science, 2018, 45(6A):22-27,57.)
[17] 沈学利,覃淑娟.基于SMOTE和深度信念网络的异常检测[J].计算机应用,2018,38(7):1941-1945.(SHEN X L, QIN S J. Anomaly detection based on synthetic minority oversampling technique and deep belief network[J]. Journal of Computer Applications, 2018, 38(7):1941-1945.)
[18] 王超学,张涛,马春森.面向不平衡数据集的改进型SMOTE算法[J].计算机科学与探索,2014,8(6):727-734.(WANG C X, ZHANG T, MA C S. Improved SMOTE algorithm for imbalanced datasets[J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(6):727-734.)
[19] BARUA S, ISLAM M M, YAO X, et al. MWMOTE - Majority weighted minority oversampling technique for imbalanced data set learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2):405-425.
[20] 叶晓枫,鲁亚会.基于随机森林融合朴素贝叶斯的信用评估模型[J].数学的实践与认识,2017,47(2):68-73.(YE X F, LU Y H. Credit assessment model based on random forest and navie bayes[J]. Mathematics in Practice and Theory, 2017, 47(2):68-73.)
[21] 李诒靖,郭海湘,李亚楠,等.一种基于Boosting的集成学习算法在不均衡数据中的分类[J].系统工程理论与实践,2016,36(1):189-199.(LI Y J, GUO H X, LI Y N, et all. A boosting based ensemble learning algorithm in imbalanced data classification[J]. Systems Engineering - Theory & Practice, 2016, 36(1):189-199)
[22] HAND D J, TILL R J. A simple generalization of the area under the ROC curve for multiple class classification problems[J].Machine Learning, 2001, 45(2):171-186
[23] 蒋帅.基于AUC的分类器性能评估问题研究[D].长春:吉林大学,2016:10-17.(JIANG S. Researches of performance evaluation of classifier based on AUC[D]. Changchun:Jilin University, 2016:10-17.)

基于带多数类权重的少数类过采样技术和随机森林的信用评估方法

Credit assessment method based on majority weight minority oversampling technique and random forest

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	姚梓豪, 栗远明, 马自强, 李扬, 魏良根. 基于机器学习的多目标缓存侧信道攻击检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1862-1871.
[2]	陈学斌, 任志强, 张宏扬. 联邦学习中的安全威胁与防御措施综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1663-1672.
[3]	佘维, 李阳, 钟李红, 孔德锋, 田钊. 基于改进实数编码遗传算法的神经网络超参数优化[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 671-676.
[4]	郑毅, 廖存燚, 张天倩, 王骥, 刘守印. 面向城区的基于图去噪的小区级RSRP估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 855-862.
[5]	李博, 黄建强, 黄东强, 王晓英. 基于异构平台的稀疏矩阵向量乘自适应计算优化[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3867-3875.
[6]	陈学斌, 屈昌盛. 面向联邦学习的后门攻击与防御综述[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3459-3469.
[7]	孙仁科, 皇甫志宇, 陈虎, 李仲年, 许新征. 神经架构搜索综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2983-2994.
[8]	柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010.
[9]	尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020.
[10]	崔昊阳, 张晖, 周雷, 杨春明, 李波, 赵旭剑. 有序规范实数对多相似度K最近邻分类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2673-2678.
[11]	郭祥, 姜文刚, 王宇航. 基于改进Inception-ResNet的加密流量分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2471-2476.
[12]	钟静, 林晨, 盛志伟, 张仕斌. 基于汉明距离的量子K-Means算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2493-2498.
[13]	蓝梦婕, 蔡剑平, 孙岚. 非独立同分布数据下的自正则化联邦学习优化方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2073-2081.
[14]	翟冉, 陈学斌, 张国鹏, 裴浪涛, 马征. 基于不同敏感度的改进K-匿名隐私保护算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1497-1503.
[15]	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1620-1624.