Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (4): 935-939.DOI: 10.11772/j.issn.1001-9081.2018102162

    Next Articles

Ensemble learning training method based on AUC and Q statistics

ZHANG Ning1, CHEN Qin1,2   

  1. 1. School of Information, Central University of Finance and Economics, Beijing 100081, China;
    2. Information Management Department, China Development Bank Financial Leasing Company Limited, Shenzhen Guangdong 518038, China
  • Received:2018-10-26 Revised:2018-12-22 Online:2019-04-10 Published:2019-04-10
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2017YFB1400701).

基于AUC及Q统计值的集成学习训练方法

章宁1, 陈钦1,2   

  1. 1. 中央财经大学 信息学院, 北京 100081;
    2. 国银金融租赁股份有限公司 信息化管理部, 广东 深圳 518038
  • 通讯作者: 陈钦
  • 作者简介:章宁(1975-),女,江西临川人,教授,博士,主要研究方向:互联网金融、个人信息保护、服务外包;陈钦(1977-),男,江西南昌人,高级工程师,博士研究生,主要研究方向:金融科技、智能投资、大数据分析、信息检索。
  • 基金资助:
    国家重点研发计划项目(2017YFB1400701)。

Abstract: Focusing on the information asymmetry problem in the process of lending, in order to integrate different data sources and loan default prediction models more effectively, an ensemble learning training method was proposed, which measured the accuracy and the diversity of learners by Area Under Curve (AUC) value and Q statistics, and an ensemble learning training method named TABAQ (Training Algorithm Based on AUC and Q statistics) was implemented. By empirical analyses based on Peer-to-Peer (P2P) loan data, it was found that the performance of ensemble learning was closely related to the accuracy and diversity of the base learners and had low correlation with the number of base learners, and statistical ensemble performed best in all ensemble learning methods. It was also found in the experiments that by integrating the information sources of borrower side and investor side, the information asymmetry in loan default prediction was effectively reduced. TABAQ can combine the advantages of both information sources fusion and ensemble learning. With the accuracy of prediction steadily improved, the number of forecast errors further reduced by 4.85%.

Key words: ensemble learning, Area Under Curve (AUC), Q statistics, loan default prediction, information asymmetry, Peer-to-Peer loan (P2P loan)

摘要: 针对借贷过程中的信息不对称问题,为更有效地整合不同的数据源和贷款违约预测模型,提出一种集成学习的训练方法,使用AUC(Area Under Curve)值和Q统计值对学习器的准确性和多样性进行度量,并实现了基于AUC和Q统计值的集成学习训练算法(TABAQ)。基于个人对个(P2P)贷款数据进行实证分析,发现集成学习的效果与基学习器的准确性和多样性关系密切,而与所集成的基学习器数量相关性较低,并且各种集成学习方法中统计集成表现最好。实验还发现,通过融合借款人端和投资人端的信息,可以有效地降低贷款违约预测中的信息不对称性。TABAQ能有效发挥数据源融合和学习器集成两方面的优势,在保持预测准确性稳步提升的同时,预测的一类错误数量更是进一步下降了4.85%。

关键词: 集成学习, 曲线下面积, Q统计值, 贷款违约预测, 信息不对称性, 个人对个人借贷

CLC Number: