基于梯度分布调节策略的Xgboost算法优化

doi:10.11772/j.issn.1001-9081.2019101878

计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1633-1637.DOI: 10.11772/j.issn.1001-9081.2019101878

基于梯度分布调节策略的Xgboost算法优化

李浩, 朱焱

西南交通大学信息科学与技术学院，成都 611756

收稿日期:2019-11-04 修回日期:2019-12-16 发布日期:2020-06-18 出版日期:2020-06-10
通讯作者: 朱焱(1965—)
作者简介:李浩（1992—），男，安徽定远人，硕士研究生，主要研究方向：不平衡数据分类。朱焱（1965—），女，广西桂林人，教授，博士，主要研究方向：数据挖掘、Web异常模式发现、大数据管理与智能分析。
基金资助:
四川省科技计划项目（2019YFSY0032）。

Xgboost algorithm optimization based on gradient distribution harmonized strategy

LI Hao, ZHU Yan

School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China

Received:2019-11-04 Revised:2019-12-16 Online:2020-06-18 Published:2020-06-10
Contact: ZHU Yan, born in 1965, Ph. D., professor．Her research interests include data mining, Web anomaly detection, big data management and intelligent analysis.
About author:LI Hao, born in 1992, M. S. candidate. His research interests include unbalanced data classification.ZHU Yan, born in 1965, Ph. D., professor．Her research interests include data mining, Web anomaly detection, big data management and intelligent analysis.
Supported by:
Science and Technology Plan of Sichuan Province (2019YFSY0032).

摘要/Abstract

摘要： 为了解决集成学习模型Xgboost在二分类问题中少数类检出率低的问题，提出了基于梯度分布调节策略的改进的Xgboost算法——LCGHA-Xgboost。首先，通过定义损失贡献（LC）来模拟Xgboost算法中样本个体的损失量；而后，通过定义损失贡献密度（LCD）来衡量Xgboost算法中样本被正确分类的难易程度；最后，提出了梯度分布调节算法LCGHA，依据LCD动态调整样本个体的一阶梯度分布，间接地增大难分样本（主要存在于少数类中）的损失量，减小易分样本（主要存在于多数类中）的损失量，使Xgboost算法偏向对难分样本的学习。实验结果表明,与Xgboost、GBDT、随机森林（Random_Forest）这三大集成学习算法相比，LCGHA-Xgboost算法在多个UCI数据集上的召回率（Recall）值有5.4%~16.7%的提高，AUC值有0.94%~7.41%的提高；在垃圾网页数据集WebSpam-UK2007和DC2010数据集上所提算法的Recall值更是有44.4%~383.3%的提高，AUC值有5.8%~35.6%的提高。LCGHA-Xgboost算法可以有效提高对少数类的分类检出能力，减小少数类的分类错误率。

关键词: 不平衡分类, Xgboost, 梯度分布, 损失贡献, 损失贡献密度

Abstract: In order to solve the problem of low detection rate of minority class by ensemble learning model eXtreme gradient boosting （Xgboost） in the binary classification problem, an improved Xgboost algorithm based on gradient distribution harmonized strategy called Loss Contribution Gradient Harmonized Algorithm (LCGHA)-Xgboost was proposed. Firstly, Loss Contribution (LC) was defined to simulate the losses of the samples in Xgboost algorithm. Secondly, by defining Loss Contribution Density (LCD), the difficulty of samples being correctly classified in Xgboost algorithm was measured. Finally, a gradient distribution harmonized algorithm called LCGHA was proposed to dynamically adjust the one order gradient distribution of samples according to the LCD. In the algorithm, the losses of hard samples (mainly in minority class) were indirectly increased, and the losses of easy samples (mainly in majority class) were indirectly reduced, making Xgboost algorithm tend to learn the hard samples. The experimental results show that compared with three ensemble learning algorithms Xgboost, GBDT (Gradient Boosting Decision Tree) and Random_Forest, LCGHA-Xgboost has the recall increased by 5.4%-16.7%, and Area Under the Curve (AUC) improved by 0.94%-7.41% on multiple UCI datasets, and the Recall increased by 44.4%-383.3%, and AUC improved by 5.8%-35.6% on WebSpam-UK2007 and DC2010 datasets. LCGHA-Xgboost can effectively improve the classification and detection ability for minority class, and reduce the classification error rate of minority class.

Key words: imbalanced classification, Xgboost, gradient distribution, loss contribution, loss contribution density

中图分类号:

TP181

李浩, 朱焱. 基于梯度分布调节策略的Xgboost算法优化[J]. 计算机应用, 2020, 40(6): 1633-1637.

LI Hao, ZHU Yan. Xgboost algorithm optimization based on gradient distribution harmonized strategy[J]. Journal of Computer Applications, 2020, 40(6): 1633-1637.

参考文献

1 LUOR, DIANS, WANGC, et al. Bagging of Xgboost classifiers with random under-sampling and tomek-link for noisy label-imbalanced data [J]. IOP Conference Series Materials Science and Engineering, 2018, 428: Article No.012004.
2 SHIH, WANGH, HUANGY, et al. A hierarchical method based on weighted extreme gradient boosting in ECG heartbeat classification [J]. Computer Methods and Programs in Biomedicine, 2019, 171:1-10.
3 CHENS, LIUX, LIB. A cost-sensitive loss function for machine learning [C]// Proceedings of the 2018 International Conference on Database Systems for Advanced Applications, LNCS 10829. Cham: Springer, 2018: 255-268.
4 ALA’RAJM, ABBODM F. Classifiers consensus system approach for credit scoring [J]. Knowledge-Based Systems, 2016, 104:89-105.
5 LIB, LIUY, WANGX. Gradient harmonized single-stage detector[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2019: 8577-8584.
6 OGUNLEYEA, WANGQ G. Enhanced XGBoost based automatic diagnosis system for chronic kidney disease [C]// Proceedings of the IEEE 14th International Conference on Control and Automation. Piscataway: IEEE, 2018: 805-810.
7 BAHNSENA C, AOUADAD, OTTERSTENB, et al. Example-dependent cost-sensitive decision trees [J]. Expert Systems with Applications, 2015, 42(19): 6609-6619.
8 ZHANGB, YUY, LIJ. Network intrusion detection based on stacked sparse autoencoder and binary tree ensemble method [C]// Proceedings of the 2018 IEEE International Conference on Communications Workshops . Piscataway: IEEE, 2018: 1 -6.
9 CHENT, GUESTRINC. XGBoost: a scalable tree boosting system [C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 785-794.
10 LIUH, ZHOUM, LUX S. Weighted Gini index feature selection method for imbalanced data [C]// Proceedings of the IEEE 15th International Conference on Networking, Sensing and Control. Piscataway: IEEE, 2018: 1-6.
11 CASTILLOC, DONATOD, BECCHETTIL, et al. A reference collection for Web spam [J]. ACM SIGIR Forum, 2006, 40(2):11-24.
12 SIKLóSID, DARóCZYB, BENCZúRA A. Content-based trust and bias classification via biclustering [C]// Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality. New York: ACM, 2012: 41-47.
13 WANGJ, SUNZ, BAOB, et al. Malicious synchrophasor detection based on highly imbalanced historical operational data [J]. CESS Journal of Power and Energy Systems, 2019, 5(1): 11-20.
14 KANGQ, CHENX, LIS, et al. A noise-filtered under-sampling scheme for imbalanced classification [J]. IEEE Transactions on Cybernetics, 2017, 47(12): 4263-4274.
15 TANGY, WANGX, JIAX, et al. Fusing multiple deep features for face anti-spoofing [C]// Proceeding of the 2018 Chinese Conference on Biometric Recognition, LNCS 10996. Cham: Springer, 2018: 321-330.
16 IBRAHIMO A S, LANDA-SILVAD. Term frequency with average term occurrences for textual information retrieval [J]. Soft Computing, 2016, 20(8): 3045-3061.
17 CHENK, ZHANGZ, LONGJ, et al. Turning from TF-IDF to TF-IGM for term weighting in text classification [J]. Expert Systems with Applications, 2016, 66: 245-260.

[1]	陆宇, 赵凌云, 白斌雯, 姜震. 基于改进的半监督聚类的不平衡分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3750-3755.
[2]	李蒙蒙, 刘艺, 李庚松, 郑奇斌, 秦伟, 任小广. 不平衡多分类算法综述[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3307-3321.
[3]	徐周波, 杨健, 刘华东, 黄文文. 基于XGBoost与拓扑结构信息的蛋白质复合物识别算法[J]. 计算机应用, 2020, 40(5): 1510-1514.
[4]	张君如, 赵晓焱, 袁培燕. 面向用户隐私保护的联邦安全树算法[J]. 计算机应用, 2020, 40(10): 2980-2985.
[5]	莫赞, 盖彦蓉, 樊冠龙. 基于GAN-AdaBoost-DT不平衡分类算法的信用卡欺诈分类[J]. 计算机应用, 2019, 39(2): 618-622.
[6]	王莉莉, 付忠良, 陶攀, 胡鑫. 基于主动学习不平衡多分类AdaBoost算法的心脏病分类[J]. 计算机应用, 2017, 37(7): 1994-1998.
[7]	陈木生, 卢晓勇. 三种用于垃圾网页检测的随机欠采样集成分类器[J]. 计算机应用, 2017, 37(2): 535-539.
[8]	王欢, 张丽萍, 闫盛. 克隆代码有害性预测中分类不平衡问题的解决方法[J]. 计算机应用, 2016, 36(12): 3468-3475.

基于梯度分布调节策略的Xgboost算法优化

Xgboost algorithm optimization based on gradient distribution harmonized strategy

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics