计算机应用 ›› 2017, Vol. 37 ›› Issue (7): 1994-1998.DOI: 10.11772/j.issn.1001-9081.2017.07.1994

• 人工智能 • 上一篇    下一篇

基于主动学习不平衡多分类AdaBoost算法的心脏病分类

王莉莉1,2, 付忠良1,2, 陶攀1,2, 胡鑫1,2   

  1. 1. 中国科学院 成都计算机应用研究所, 成都 610041;
    2. 中国科学院大学, 北京 100049
  • 收稿日期:2017-01-12 修回日期:2017-02-27 出版日期:2017-07-10 发布日期:2017-07-18
  • 通讯作者: 王莉莉
  • 作者简介:王莉莉(1987-),女,河南周口人,博士研究生,主要研究方向:机器学习、模式识别、数据挖掘;付忠良(1967-),男,重庆合川人,教授,硕士,主要研究方向:机器学习、模式识别;陶攀(1988-),男,河南安阳人,博士研究生,主要研究方向:机器学习、数据挖掘;胡鑫(1987-),男,贵州遵义人,硕士研究生,主要研究方向:数据仓库、数据挖掘。
  • 基金资助:
    四川省科技支撑计划项目(2016JZ0035);中国科学院西部之光项目。

Heart disease classification based on active imbalance multi-class AdaBoost algorithm

WANG Lili1,2, FU Zhongliang1,2, TAO Pan1,2, HU Xin1,2   

  1. 1. Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu Sichuan 610041, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2017-01-12 Revised:2017-02-27 Online:2017-07-10 Published:2017-07-18
  • Supported by:
    This work is partially supported by the Sichuan Science and Technology Support Project (2016JZ0035), West Light Foundation of Chinese Academy of Sciences.

摘要: 针对不平衡分类中小类样本识别率低问题,提出一种基于主动学习不平衡多分类AdaBoost改进算法。首先,利用主动学习方法通过多次迭代抽样,选取少量的、对分类器最有价值的样本作为训练集;然后,基于不确定性动态间隔的样本选择策略,降低训练集的不平衡性;最后,利用代价敏感方法对多分类AdaBoost算法进行改进,对不同的类别给予不同的错分代价,调整样本权重更新速度,强迫弱分类器"关注"小类样本。在临床经胸超声心动图(TTE)测量数据集上的实验分析表明:与多分类支持向量机(SVM)相比,心脏病总体识别率提升了5.9%,G-mean指标提升了18.2%,瓣膜病(VHD)识别率提升了0.8%,感染性心内膜炎(IE)(小类)识别率提升了12.7%,冠心病(CAD)(小类)识别率提升了79.73%;与SMOTE-Boost相比,总体识别率提升了6.11%,G-mean指标提升了0.64%,VHD识别率提升了11.07%,先心病(CHD)识别率提升了3.69%。在TTE数据集和4个UCI数据集上的实验结果表明,该算法在不平稳多分类时能有效提高小类样本识别率,并且保证其他类别识别率不会大幅度降低,综合提升分类器性能。

关键词: 主动学习, 不平衡分类, 多分类AdaBoost, 多类别分类, 心脏病分类

Abstract: An imbalance multi-class AdaBoost algorithm with active learning was proposed to improve the recognition accuracy of minority class in imbalance classification. Firstly, active learning was adopted to select the most informative samples for classifiers through multiple iterations of sampling. Secondly, a new sample selection strategy based on uncertainty of dynamic margin was proposed to tackle the problem of data imbalance in the multi-class case. Finally, the cost sensitive method was adopted to improve the multi-class AdaBoost algorithm: giving different class with different misclassification cost, adjusting sample weight update speed, and forcing weak learners to "concern" minority class. The experimental results on clinical TransThoracic Echocardiography (TTE) data set illustrate that, when compared with multi-class Support Vector Machine (SVM), the total recognition accuracy of heart disease increases by 5.9%, G-mean improves by 18.2%, the recognition accuracy of Valvular Heart Disease (VHD) improves by 0.8%, the recognition accuracy of Infective Endocarditis (IE) (minority class) improves by 12.7% and the recognition accuracy of Coronary Artery Disease (CAD) (minority class) improves by 79.73%; compared with SMOTE-Boost, the total recognition accuracy of heart disease increases by 6.11%, the G-mean improves by 0.64%, the recognition accuracy of VHD improves by 11.07%, the recognition accuracy of Congenital Heart Disease (CHD) improves by 3.67%. The experiment results on TTE data and 4 UCI data sets illustrate that when used in imbalanced multi-class classification, the proposed algorithm can improve the recognition accuracy of minority class effectively, and upgrade the overall classifier performance while guaranteeing the recognition accuracy of other classes not to be decreased dramatically.

Key words: active learning, imbalance classification, multi-class AdaBoost, multi-class classification, heart disease classification

中图分类号: