《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (7): 2256-2264.DOI: 10.11772/j.issn.1001-9081.2021050810

• 前沿与综合应用 • 上一篇    

基于边界自适应SMOTE和Focal Loss函数改进LightGBM的信用风险预测模型

陈海龙(), 杨畅, 杜梅, 张颖宇   

  1. 哈尔滨理工大学 计算机科学与技术学院,哈尔滨 150080
  • 收稿日期:2021-05-18 修回日期:2021-09-29 接受日期:2021-10-12 发布日期:2022-07-15 出版日期:2022-07-10
  • 通讯作者: 陈海龙
  • 作者简介:杨畅(1997—),女,黑龙江绥化人,硕士研究生,主要研究方向:机器学习
    杜梅(1996—),女,山东济南人,硕士研究生,主要研究方向:机器学习
    张颖宇(1996—),女,河北唐山人,硕士研究生,主要研究方向:机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61772160);哈尔滨市科技创新人才研究专项(2017RAQXJ045)

Credit risk prediction model based on borderline adaptive SMOTE and Focal Loss improved LightGBM

Hailong CHEN(), Chang YANG, Mei DU, Yingyu ZHANG   

  1. College of Computer Science and Technology,Harbin University of Science and Technology,Harbin Heilongjiang 150080,China
  • Received:2021-05-18 Revised:2021-09-29 Accepted:2021-10-12 Online:2022-07-15 Published:2022-07-10
  • Contact: Hailong CHEN
  • About author:YANG Chang, born in 1997, M. S. candidate. Her research interests include machine learning.
    DU Mei, born in 1996, M. S. candidate. Her research interests include machine learning.
    ZHANG Yingyu, born in 1996, M. S. candidate. Her research interests include machine learning.
  • Supported by:
    National Natural Science Foundation of China(61772160);Special Research Program of Scientific and Technological Innovation for Young Scientists of Harbin(2017RAQXJ045)

摘要:

针对信用风险评估中数据集不平衡影响模型预测效果的问题,提出一种基于边界自适应合成少数类过采样方法(BA-SMOTE)和利用Focal Loss函数改进LightGBM损失函数的算法(FLLightGBM)相结合的信用风险预测模型。首先,在边界合成少数类过采样(Borderline-SMOTE)的基础上,引入自适应思想和新的插值方式,使每个处于边界的少数类样本生成不同数量的新样本,并且新样本的位置更靠近原少数类样本,以此来平衡数据集;其次,利用Focal Loss函数来改进LightGBM算法的损失函数,并以改进的算法训练新的数据集以得到最终结合BA-SMOTE方法和FLLightGBM算法建立的BA-SMOTE-FLLightGBM模型;最后,在Lending Club数据集上进行信用风险预测。实验结果表明,与其他不平衡分类算法RUSBoost、CUSBoost、KSMOTE-AdaBoost和AK-SMOTE-Catboost相比,所建立的模型在G-mean和AUC两个指标上都有明显的提升,提升了9.0%~31.3%和5.0%~14.1%。以上结果验证了所提出的模型在信用风险评估中具有更好的违约预测效果。

关键词: 信用风险, 不平衡数据, 过采样, LightGBM, Focal Loss

Abstract:

Aiming at the problem that the imbalance of datasets in credit risk assessment affects the prediction effect of the model, a credit risk prediction model based on Borderline Adaptive Synthetic Minority Oversampling TEchnique (BA-SMOTE) and Focal Loss-Light Gradient Boosting Machine (FLLightGBM) was proposed. Firstly, on the basis of Borderline Synthetic Minority Oversampling TEchnique (Borderline-SMOTE), the adaptive idea and new interpolation method were introduced, so that different numbers of new samples were generated for each minority sample at the border, and the positions of the new samples were closer to the original minority sample, thereby balancing the dataset. Secondly, the Focal Loss function was used to improve the loss function of LightGBM (Light Gradient Boosting Machine) algorithm, and the improved algorithm was used to train a new dataset to obtain the final BA-SMOTE-FLLightGBM model constructed by BA-SMOTE method and FLLightGBM algorithm. Finally, on Lending Club dataset, the credit risk prediction was performed. Experimental results show that compared with other imbalanced classification algorithms RUSBoost (Random Under-Sampling with adaBoost), CUSBoost (Cluster-based Under-Sampling with adaBoost), KSMOTE-AdaBoost (K-means clustering SMOTE with AdaBoost), and AK-SMOTE-Catboost (AllKnn-SMOTE-Catboost), the constructed model has a significant improvement on two evaluation indicators G-mean and AUC (Area Under Curve) with 9.0%-31.3% and 5.0%-14.1% respectively. The above results verify that the proposed model has a better default prediction effect in credit risk assessment.

Key words: credit risk, imbalanced data, oversampling, LightGBM (Light Gradient Boosting Machine), Focal Loss

中图分类号: