基于概率校准的集成学习

doi:10.11772/j.issn.1001-9081.2016.02.0291

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 291-294.DOI: 10.11772/j.issn.1001-9081.2016.02.0291

• 第三届CCF大数据学术会议(CCF BigData 2015) • 下一篇

基于概率校准的集成学习

姜正申¹, 刘宏志²

1. 北京大学信息科学技术学院, 北京 100871;
2. 北京大学软件与微电子学院, 北京 102600

收稿日期:2015-08-29 修回日期:2015-09-11 出版日期:2016-02-10 发布日期:2016-02-03
通讯作者: 刘宏志(1982-),男,湖北武汉人,副教授,博士,CCF会员,主要研究方向:信息融合、模式识别。
作者简介:姜正申(1990-),男,吉林松原人,博士研究生,主要研究方向:模式识别。
基金资助:
国家自然科学基金资助项目(61232005);CCF-腾讯科研基金资助项目。

Ensemble learning based on probability calibration

JIANG Zhengshen¹, LIU Hongzhi²

1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;
2. School of Software and Microelectronics, Peking University, Beijing 102600, China

Received:2015-08-29 Revised:2015-09-11 Online:2016-02-10 Published:2016-02-03

摘要/Abstract

摘要： 针对原有集成学习多样性不足而导致的集成效果不够显著的问题,提出一种基于概率校准的集成学习方法以及两种降低多重共线性影响的方法。首先,通过使用不同的概率校准方法对原始分类器给出的概率进行校准;然后使用前一步生成的若干校准后的概率进行学习,从而预测最终结果。第一步中使用的不同概率校准方法为第二步的集成学习提供了更强的多样性。接下来,针对校准概率与原始概率之间的多重共线性问题,提出了选择最优(choose-best)和有放回抽样(bootstrap)的方法。选择最优方法对每个基分类器,从原始分类器和若干校准分类器之间选择最优的进行集成;有放回抽样方法则从整个基分类器集合中进行有放回的抽样,然后对抽样出来的分类器进行集成。实验表明,简单的概率校准集成学习对学习效果的提高有限,而使用了选择最优和有放回抽样方法后,学习效果得到了较大的提高。此结果说明,概率校准为集成学习提供了更强的多样性,其伴随的多重共线性问题可以通过抽样等方法有效地解决。

关键词: 集成学习, 概率校准, 多重共线性, 有放回抽样, 随机子空间

Abstract: Since the lackness of diversity may lead to bad performance in ensemble learning, a new two-phase ensemble learning method based on probability calibration was proposed, as well as two methods to reduce the impact of multiple collinearity. In the first phase, the probabilities given by the original classifiers were calibrated using different calibration methods. In the second phase, another classifier was trained using the calibrated probabilities and the final result was predicted. The different calibration methods used in the first phase provided diversity for the second phase, which has been shown to be an important factor to enhance ensemble learning. In order to address the limited improvement due to the correlation between base classifiers, two methods to reduce the multiple collinearity were also proposed, that is, choose-best and bootstrap sampling method. The choose-best method just selected the best base classifier among original and calibrated classifiers; the bootstrap method combined a set of classifiers, which were chosen from the base classifiers with replacement. The experimental results showed that the use of different calibrated probabilities indeed improved the effectiveness of the ensemble; after using the choose-best and bootstrap sampling methods, further improvement was also achieved. It means that probability calibration provides a new way to produce diversity, and the multiple collinearity caused by it can be solved by sampling method.

Key words: ensemble learning, probability calibration, multiple collinearity, bootstrap sampling, random subspace

中图分类号:

TP391
TP181

姜正申, 刘宏志. 基于概率校准的集成学习[J]. 计算机应用, 2016, 36(2): 291-294.

JIANG Zhengshen, LIU Hongzhi. Ensemble learning based on probability calibration[J]. Journal of Computer Applications, 2016, 36(2): 291-294.

参考文献

[1] POLIKAR R. Ensemble learning[M]//Ensemble Machine Learning. Berlin:Springer-Verlag, 2012:1-34.
[2] GNEITING T, RAFTERY A E. Weather forecasting with ensemble methods[J]. Science, 2005, 310(5746):248-249.
[3] KOREN Y, BELL R. Advances in collaborative filtering[M]//Recommender Systems Handbook. Berlin:Springer-Verlag, 2011:145-186.
[4] BHATTACHARYYA S, JHA S, THARAKUNNEL K, et al. Data mining for credit card fraud:a comparative study[J]. Decision Support Systems, 2011, 50(3):602-613.
[5] BROWN G, KUNCHEVA L I. "Good" and "bad" diversity in majority vote ensembles[C]//MCS '10:Proceedings of the 9th International Conference on Multiple Classifier Systems, LNCS 5997. Berlin:Springer-Verlag, 2010:124-133.
[6] NAEINI M P, COOPER G F, HAUSKRECHT M. Obtaining well calibrated probabilities using Bayesian binning[C]//Proceedings of the 2015 Twenty-Ninth AAAI Conference on Artificial Intelligence. Menlo Park, CA:AAAI Press, 2015:2901-2907.
[7] QIN J, GARCIA T P, MA Y, et al. Combining isotonic regression and EM algorithm to predict genetic risk under monotonicity constraint[J]. Annals of Applied Statistics, 2014, 8(2):1182-1208.
[8] ZADROZNY B, ELKAN C. Transforming classifier scores into accurate multiclass probability estimates[C]//KDD '02:Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2002:694-699.
[9] BREIMAN L. Bagging predictors[J]. Machine Learning, 1996, 24(2):123-140.
[10] SCHAPIRE R E. The strength of weak learnability[J]. Machine Learning, 1990, 5(2):197-227.
[11] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1):119-139.
[12] BREIMAN L. Random forests[J]. Machine Learning, 2001, 45(1):5-32.
[13] HO T K. The random subspace method for constructing decision forests[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8):832-844.
[14] UEDA N, NAKANO R. Generalization error of ensemble estimators[C]//Proceedings of the 1996 IEEE International Conference on Neural Networks. Piscataway, NJ:IEEE, 1996, 1:90-95.

[1]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[2]	谢雨, 蒋瑜, 龙超奇. 基于随机子空间的扩展隔离林算法[J]. 计算机应用, 2021, 41(6): 1679-1685.
[3]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 计算机应用, 2021, 41(4): 1035-1041.
[4]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[5]	罗长银, 陈学斌, 马春地, 王君宇. 面向区块链的在线联邦增量学习算法[J]. 计算机应用, 2021, 41(2): 363-371.
[6]	周超然, 赵建平, 马太, 周欣. 基于注意力机制和集成学习的网页黑名单判别方法[J]. 计算机应用, 2021, 41(1): 133-138.
[7]	顾桐, 许国良, 李万林, 李家浩, 王志愿, 雒江涛. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 40(9): 2762-2767.
[8]	刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡流量数据的分类模型[J]. 计算机应用, 2020, 40(8): 2327-2333.
[9]	刘然, 刘宇, 顾进广. 基于自适应学习率优化的AdaNet改进[J]. 计算机应用, 2020, 40(10): 2804-2810.
[10]	苏俊宁, 叶东毅. 基于样本密度峰值的不平衡数据欠抽样方法[J]. 计算机应用, 2020, 40(1): 83-89.
[11]	尹玉, 詹永照, 姜震. 伪标签置信选择的半监督集成学习视频语义检测[J]. 计算机应用, 2019, 39(8): 2204-2209.
[12]	章宁, 陈钦. 基于AUC及Q统计值的集成学习训练方法[J]. 计算机应用, 2019, 39(4): 935-939.
[13]	莫赞, 盖彦蓉, 樊冠龙. 基于GAN-AdaBoost-DT不平衡分类算法的信用卡欺诈分类[J]. 计算机应用, 2019, 39(2): 618-622.
[14]	叶志宇, 冯爱民, 高航. 基于深度LightGBM集成学习模型的谷歌商店顾客购买力预测[J]. 计算机应用, 2019, 39(12): 3434-3439.
[15]	胡健, 苏永东, 黄文载, 肖鹏, 刘玉婷, 杨本富. 基于互信息加权集成迁移学习的入侵检测方法[J]. 计算机应用, 2019, 39(11): 3310-3315.

基于概率校准的集成学习

Ensemble learning based on probability calibration

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics