计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1633-1637.DOI: 10.11772/j.issn.1001-9081.2019101878

• 人工智能 • 上一篇    下一篇

基于梯度分布调节策略的Xgboost算法优化

李浩, 朱焱   

  1. 西南交通大学 信息科学与技术学院,成都 611756
  • 收稿日期:2019-11-04 修回日期:2019-12-16 出版日期:2020-06-10 发布日期:2020-06-18
  • 通讯作者: 朱焱(1965—)
  • 作者简介:李浩(1992—),男,安徽定远人,硕士研究生,主要研究方向:不平衡数据分类。朱焱(1965—),女,广西桂林人,教授,博士,主要研究方向:数据挖掘、Web异常模式发现、大数据管理与智能分析。
  • 基金资助:
    四川省科技计划项目(2019YFSY0032)。

Xgboost algorithm optimization based on gradient distribution harmonized strategy

LI Hao, ZHU Yan   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China
  • Received:2019-11-04 Revised:2019-12-16 Online:2020-06-10 Published:2020-06-18
  • Contact: ZHU Yan, born in 1965, Ph. D., professor.Her research interests include data mining, Web anomaly detection, big data management and intelligent analysis.
  • About author:LI Hao, born in 1992, M. S. candidate. His research interests include unbalanced data classification.ZHU Yan, born in 1965, Ph. D., professor.Her research interests include data mining, Web anomaly detection, big data management and intelligent analysis.
  • Supported by:
    Science and Technology Plan of Sichuan Province (2019YFSY0032).

摘要: 为了解决集成学习模型Xgboost在二分类问题中少数类检出率低的问题,提出了基于梯度分布调节策略的改进的Xgboost算法——LCGHA-Xgboost。首先,通过定义损失贡献(LC)来模拟Xgboost算法中样本个体的损失量;而后,通过定义损失贡献密度(LCD)来衡量Xgboost算法中样本被正确分类的难易程度;最后,提出了梯度分布调节算法LCGHA,依据LCD动态调整样本个体的一阶梯度分布,间接地增大难分样本(主要存在于少数类中)的损失量,减小易分样本(主要存在于多数类中)的损失量,使Xgboost算法偏向对难分样本的学习。实验结果表明,与Xgboost、GBDT、随机森林(Random_Forest)这三大集成学习算法相比,LCGHA-Xgboost算法在多个UCI数据集上的召回率(Recall)值有5.4%~16.7%的提高,AUC值有0.94%~7.41%的提高;在垃圾网页数据集WebSpam-UK2007和DC2010数据集上所提算法的Recall值更是有44.4%~383.3%的提高,AUC值有5.8%~35.6%的提高。LCGHA-Xgboost算法可以有效提高对少数类的分类检出能力,减小少数类的分类错误率。

关键词: 不平衡分类, Xgboost, 梯度分布, 损失贡献, 损失贡献密度

Abstract: In order to solve the problem of low detection rate of minority class by ensemble learning model eXtreme gradient boosting (Xgboost) in the binary classification problem, an improved Xgboost algorithm based on gradient distribution harmonized strategy called Loss Contribution Gradient Harmonized Algorithm (LCGHA)-Xgboost was proposed. Firstly, Loss Contribution (LC) was defined to simulate the losses of the samples in Xgboost algorithm. Secondly, by defining Loss Contribution Density (LCD), the difficulty of samples being correctly classified in Xgboost algorithm was measured. Finally, a gradient distribution harmonized algorithm called LCGHA was proposed to dynamically adjust the one order gradient distribution of samples according to the LCD. In the algorithm, the losses of hard samples (mainly in minority class) were indirectly increased, and the losses of easy samples (mainly in majority class) were indirectly reduced, making Xgboost algorithm tend to learn the hard samples. The experimental results show that compared with three ensemble learning algorithms Xgboost, GBDT (Gradient Boosting Decision Tree) and Random_Forest, LCGHA-Xgboost has the recall increased by 5.4%-16.7%, and Area Under the Curve (AUC) improved by 0.94%-7.41% on multiple UCI datasets, and the Recall increased by 44.4%-383.3%, and AUC improved by 5.8%-35.6% on WebSpam-UK2007 and DC2010 datasets. LCGHA-Xgboost can effectively improve the classification and detection ability for minority class, and reduce the classification error rate of minority class.

Key words: imbalanced classification, Xgboost, gradient distribution, loss contribution, loss contribution density

中图分类号: