Ensemble learning training method based on AUC and Q statistics

doi:10.11772/j.issn.1001-9081.2018102162

Abstract

Abstract: Focusing on the information asymmetry problem in the process of lending, in order to integrate different data sources and loan default prediction models more effectively, an ensemble learning training method was proposed, which measured the accuracy and the diversity of learners by Area Under Curve (AUC) value and Q statistics, and an ensemble learning training method named TABAQ (Training Algorithm Based on AUC and Q statistics) was implemented. By empirical analyses based on Peer-to-Peer (P2P) loan data, it was found that the performance of ensemble learning was closely related to the accuracy and diversity of the base learners and had low correlation with the number of base learners, and statistical ensemble performed best in all ensemble learning methods. It was also found in the experiments that by integrating the information sources of borrower side and investor side, the information asymmetry in loan default prediction was effectively reduced. TABAQ can combine the advantages of both information sources fusion and ensemble learning. With the accuracy of prediction steadily improved, the number of forecast errors further reduced by 4.85%.

Key words: ensemble learning, Area Under Curve (AUC), Q statistics, loan default prediction, information asymmetry, Peer-to-Peer loan (P2P loan)

摘要： 针对借贷过程中的信息不对称问题，为更有效地整合不同的数据源和贷款违约预测模型，提出一种集成学习的训练方法，使用AUC（Area Under Curve）值和Q统计值对学习器的准确性和多样性进行度量，并实现了基于AUC和Q统计值的集成学习训练算法（TABAQ）。基于个人对个（P2P）贷款数据进行实证分析，发现集成学习的效果与基学习器的准确性和多样性关系密切，而与所集成的基学习器数量相关性较低，并且各种集成学习方法中统计集成表现最好。实验还发现，通过融合借款人端和投资人端的信息，可以有效地降低贷款违约预测中的信息不对称性。TABAQ能有效发挥数据源融合和学习器集成两方面的优势，在保持预测准确性稳步提升的同时，预测的一类错误数量更是进一步下降了4.85%。

关键词: 集成学习, 曲线下面积, Q统计值, 贷款违约预测, 信息不对称性, 个人对个人借贷

CLC Number:

ZHANG Ning, CHEN Qin. Ensemble learning training method based on AUC and Q statistics[J]. Journal of Computer Applications, 2019, 39(4): 935-939.

章宁, 陈钦. 基于AUC及Q统计值的集成学习训练方法[J]. 计算机应用, 2019, 39(4): 935-939.

References

[1] ZHOU X. Ensemble Methods:Foundations and Algorithms[M]. Boca Racton:CRC Press, 2012:15-17.
[2] DIMITRAS A I, ZANAKIS S H, ZOPOUNIDIS C. A survey of business failures with an emphasis on prediction methods and industrial applications[J]. European Journal of Operational Research, 1996, 90(3):487-513.
[3] HAND D J, HENLEY W E. Statistical classification methods in consumer credit scoring:a review[J]. Journal of the Royal Statistical Society, 1997, 160(3):523-541.
[4] MIN J H, LEE Y C. Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters[J]. Expert Systems with Applications, 2005, 28(4):603-614.
[5] LI H, SUN J, WU J. Predicting business failure using classification and regression tree:an empirical comparison with popular classical statistical methods and top classification mining methods[J]. Expert Systems with Applications, 2010, 37(8):5895-5904.
[6] CHARALAMBOUS C, CHARITOU A, KAOUROU F. Application of feature extractive algorithm to bankruptcy prediction[C]//Proceedings of the 2000 IEEE-Inns-Enns International Joint Conference on Neural Networks. Washington, DC:IEEE Computer Society, 2000:5303.
[7] AMIN R K, INDWIARTI, SIBARONI Y. Implementation of decision tree using C4.5 algorithm in decision making of loan application by debtor (case study:bank pasar of Yogyakarta special region)[C]//Proceedings of the 2015 International Conference on Information and Communication Technology. Piscataway, NJ:IEEE, 2015:75-80.
[8] GENG R, BOSE I, CHEN X. Prediction of financial distress:an empirical study of listed Chinese companies using data mining[J]. European Journal of Operational Research, 2015, 241(1):236-247.
[9] VERIKAS A, KALSYTE Z, BACAUSKIENE M, et al. Hybrid and ensemble-based soft computing techniques in bankruptcy prediction:a survey[J]. Soft Computing, 2010, 14(9):995-1010.
[10] JADHAV S, HE H, JENKINS K W. An academic review:applications of data mining techniques in finance industry[J]. International Journal of Soft Computing and Artificial Intelligence 2016, 4(1):79-95.
[11] ERGER S C, GLEISNER F. Emergence of financial intermediaries in electronic markets:the case of online P2P lending[J]. Business Research, 2010, 2(1):39-65.
[12] JIN Y, ZHU Y. A data-driven approach to predict default risk of loan for online Peer-to-Peer (P2P) lending[C]//Proceedings of the Fifth International Conference on Communication Systems and Network Technologies. Piscataway, NJ:IEEE, 2015:609-613.
[13] EMEKTER R, TU Y. Evaluating credit risk and loan performance in online Peer-to-Peer (P2P) lending[J]. Applied Economics, 2015, 47(1):54-70.
[14] 谈超, 孙本芝, 王冀宁. P2P网络借贷平台中的逾期行为研究[J]. 财会通讯, 2015(2):49-51. (TAN C, SUN B Z, WANG J N. Research on overdue behavior in P2P lending platform[J]. Communication of Finance and Accounting, 2015(2):49-51.)
[15] 邓帆帆, 薛菁, 闫海鑫.商业银行参与P2P网络借贷的路径分析及建议——基于贝叶斯网络投资模型的测算结果[J]. 集美大学学报(哲学社会科学版), 2015, 18(2):53-58. (DENG F F, XUE J, YAN H X. Analysis and suggestions of commercial banks' participation in P2P lending - based on the measurement results of Bayesian network model[J]. Journal of Jimei University (Philosophy and Social Sciences), 2015, 18(2):53-58.)
[16] WANG P, ZHENG H, CHEN D, et al. Exploring the critical factors influencing online lending intentions[J]. Financial Innovation, 2015, 1(1):1-11.
[17] EVERETT C R. Information asymmetry in relationship versus transactional debt markets:evidence from peer-to-peer lending[D]. West Lafayette:Purdue University, 2011:63-66.
[18] LUO C, XIONG H, ZHOU W, et al. Enhancing investment decisions in P2P lending:an investor composition perspective[C]//Proceedings of the 2011 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2011:292-300.
[19] ZHAO H, WU L, LIU Q, et al. Investment recommendation in P2P lending:a portfolio perspective with risk management[C]//Proceedings of the 2014 IEEE International Conference on Data Mining. Piscataway, NJ:IEEE, 2014:1109-1114.
[20] 章宁, 陈钦. 基于TF-IDF算法的P2P贷款违约预测模型[J]. 计算机应用, 2018, 38(10):3042-3047. (ZHANG N, CHEN Q. P2P loan default prediction model based on TF-IDF algorithm[J]. Journal of Computer Applications, 2018, 38(10):3042-3047.)
[21] KUNCHEVA L I, WHITAKER C J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy[J]. Machine Learning, 2003, 51(2):181-207.
[22] CHEN N, RIBEIRO B, AN C. A Financial credit risk assessment:a recent review[J]. Artificial Intelligence Review, 2016, 5(1):1-23.
[23] CANUTO A M P, ABREU M C C, OLIVEIRA L D M, et al. Investigating the influence of the choice of the ensemble members in accuracy and diversity of selection-based and fusion-based methods for ensembles[J]. Pattern Recognition Letters, 2007, 28(4):472-486.
[24] FAWCETT T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8):861-874.
[25] MYERSON J, GREEN L, WARUSAWITHARANA M, et al. Area under the curve as a measure of discounting[J]. Journal of the Experimental Analysis of Behavior, 2001, 76(2):235-243.
[26] MEYNET J, THIRAN J P. Information theoretic combination of classifiers with application to AdaBoost[C]//Proceedings of the 2007 International Conference on Multiple Classifier Systems. Berlin:Springer, 2007:171-179.