多子域隔离学习组合决策用于不均衡样本

doi:10.11772/j.issn.1001-9081.2016.09.2475

计算机应用 ›› 2016, Vol. 36 ›› Issue (9): 2475-2480.DOI: 10.11772/j.issn.1001-9081.2016.09.2475

多子域隔离学习组合决策用于不均衡样本

靳燕¹, 彭新光²

1. 山西大学商务学院信息学院, 太原 030031;
2. 太原理工大学计算机科学与技术学院, 太原 030024

收稿日期:2016-02-07 修回日期:2016-04-24 出版日期:2016-09-10 发布日期:2016-09-08
通讯作者: 靳燕
作者简介:靳燕(1982-),女,山西高平人,讲师,硕士,主要研究方向:数据挖掘、网络安全;彭新光(1955-),男,山西太原人,教授,博士,CCF会员,主要研究方向:网络安全、可信计算。
基金资助:
山西省自然科学基金资助项目（2010011022-2）；山西省高等学校科技创新项目（20131112）；山西大学商务学院科研基金资助项目（2014010）。

Composite classification model learned on multiple isolated subdomains for imbalanced class

JIN Yan¹, PENG Xinguang²

1. Information Institute, Business College of Shanxi University, Taiyuan Shanxi 030031, China;
2. College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan Shanxi 030024, China

Received:2016-02-07 Revised:2016-04-24 Online:2016-09-10 Published:2016-09-08
Supported by:
This work is partially supported by the Shanxi Natural Science Foundation (2010011022-2), the Science and Technology Innovation Project of Shanxi Province (20131112), the Fundamental Research Funds for the Business College of Shanxi University (2014010).

摘要/Abstract

摘要： 为进一步弱化数据不均衡对分类算法的束缚，从数据集区域分布特性着手，提出了不均衡数据集上基于子域学习的复合分类模型。子域划分阶段，扩展支持向量数据描述（SVDD）算法给出类的最小界定域，划分出域内密集区与域外稀疏区。借鉴不同类存在相似样本的类重叠概念，对边界样本进行搜索，组合构成重叠域。子域清理阶段，基于邻近算法（KNN）的邻近性假设，结合不同域的密疏程度，设置样本有效性参数，对域内样本逐个检测以清理噪声。各子域隔离参与分类建模，按序组合产生出用于不均衡数据集的复合分类器CCRD。在相似算法对比以及代价敏感MetaCost对比中，CCRD对正类的正确分类改善明显，且未加重负类误判；在SMOTE抽样比较中，CCRD改善了负类的误判情形，且未影响正类的正确分类；在五类数据集的逐个比较中，CCRD分类性能均有提升，在Haberman_sur的正类分类性能提升上尤为明显。结果表明，基于子域学习的复合分类模型的分类性能较好，是一种研究不均衡数据集的较有效的方法。

关键词: 不均衡数据集区域分布, 支持向量数据描述, 稀疏域与重叠域, 子域隔离学习, 复合分类器

Abstract: Started with the regional distribution characteristics, a composite classification model learned on multiple isolated subdomains was proposed to further study the class imbalance problem. In the subdomains division stage, each class was described as ultra-small spheres by improved Support Vector Data Description (SVDD) algorithm, then class domain was divided into intensive and sparse domains. Some instances were founded out from the boundaries of classes and composed of class overlapping domains. In the subdomains cleanup stage, according to sample availability parameters related to domain tightness, noise data was cleaned up by improved K-Nearest Neighbor (KNN). After combining classifiers sequentially which were learned on isolated subdomains, the Composite Classification model (CCRD) was generated. In the comparison with similar algorithms including SVM (Support Vector Machine), KNN, C4.5 and MetaCost, CCRD can obviously improve the accuracy of positive instances without increasing mistake of negative instances; in the comparison with SMOTE (Synthetic Minority Over-sampling TEchnique) sampling, CCRD can improve the misjudgement of negative instances without affecting the classification of the positive instances; in the experiments on five datasets, the classification performance of CCRD is also improved, especially in Haberman_sur. Experimental results indicate that the composite classification model learned on multiple isolated subdomains has excellent classification capability, and it is an effective method for inbalanced dataset.

Key words: regional distribution of imbalanced class, Support Vector Data Description (SVDD), sparse and overlapping domains, leaning classifiers on multiple isolated subdomains, Composite Classification model (CCRD)

中图分类号:

TP391

靳燕, 彭新光. 多子域隔离学习组合决策用于不均衡样本[J]. 计算机应用, 2016, 36(9): 2475-2480.

JIN Yan, PENG Xinguang. Composite classification model learned on multiple isolated subdomains for imbalanced class[J]. Journal of Computer Applications, 2016, 36(9): 2475-2480.

参考文献

[1] ABDI L, HASHEMI S. To combat multi-class imbalanced problems by means of over-sampling and boosting techniques [J]. Soft Computing, 2015, 19(12): 3369-3385.
[2] VERBIEST N, RAMENTOL E, CORNELIS C, et al. Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection [J]. Applied Soft Computing, 2014, 22(5): 511-517.
[3] 霍玉丹,谷琼,蔡之华,等.基于遗传算法改进的少数类样本合成过采样技术的非平衡数据集分类算法[J].计算机应用,2015,35(1):121-124.(HUO Y D, GU Q, CAI Z H, et al. Classification method for imbalance based on genetic algorithm improved synthetic minority over-sampling technique [J]. Journal of Computer Applications, 2015,35(1):121-124.)
[4] WANG K J, ADRIAN A M, CHEN K H, et al. A hybrid classifier combining borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan [J]. Computer Methods and Programs in Biomedicine, 2015, 119(2): 63-76.
[5] YU H, NI J, ZHAO J. ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data [J]. Neurocomputing, 2013, 101(3): 309-318.
[6] GARCÍA-BORROTO M, MARTÍNEZ-TRINIDAD J F, CARRASCO-OCHOA J A. A survey of emerging patterns for supervised classification [J]. Artificial Intelligence Review, 2014, 42(4): 705-721.
[7] 陈睿,张亮,杨静,等. 基于BSMOTE和逆转欠抽样的不均衡数据分类算法[J]. 计算机应用研究,2014,31(11):3299-3303.(CHEN R, ZHANG L, YANG J, et al. Classification algorithm for imbalanced data sets based on combination of BSMOTE and inverse under sampling [J]. Application Research of Computers, 2014,31(11):3299-3303.)
[8] GALAR M, FERNÁNDEZ A, BARRENECHEA E, et al. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463-484.
[9] GARCÍA V, SÁNCHEZ J S, MOLLINEDA R A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J]. Knowledge-Based Systems, 2012, 25(1): 13-21.
[10] ALEJO R, VALDOVINOS R M, GARCÍA V, et al. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios [J]. Pattern Recognition Letters, 2013, 34(4): 380-388.
[11] BECKMANN M, EBECKEN N F F, DE LIMA B S L P. A KNN undersampling approach for data balancing [J]. Journal of Intelligent Learning Systems and Applications, 2015, 7(4): 104-116.
[12] 熊海涛,吴俊杰,刘洪甫,等.分类中的类重叠问题及其处理方法研究[J].管理科学学报,2013,16(4):8-21.(XIONG H T, WU J J, LIU H P, et al. Towards classification with class overlapping [J]. Journal of Management Sciences in China, 2013,16(4):8-21.)
[13] KHAZAI S, SAFARI A, MOJARADI B, et al. Improving the SVDD approach to hyperspectral image classification [J]. IEEE Geoscience and Remote Sensing Letters, 2012, 9(4): 594-598.
[14] 蒋盛益,苗邦,余雯.基于一趟聚类的不平衡数据下抽样算法[J].小型微型计算机系统,2012,33(2):232-236.(JIANG S Y, MIAO B, YU W. Under-sampling method based on one-pass clustering for imbalanced data distribution [J]. Journal of Chinese Computer Systems, 2012, 33(2): 232-236.)
[15] 李雄飞,李军,董元方,等.一种新的不平衡数据学习算法PCBoost [J]. 计算机学报,2012,35(2):202-209.(LI X F, LI J, DONG Y F, et al. A new learning algorithm for imbalanced data-PCBoost [J]. Chinese Journal of Computers, 2012, 35(2): 202-209.)
[16] 曹鹏,李博,栗伟,等.基于粒子群优化的不均衡数据学习[J].计算机应用,2013,33(3):789-792.(CAO P, LI B, LI W, et al. Imbalanced data learning based on particle swarm optimization [J]. Journal of Computer Applications, 2013, 33(3): 789-792.)

多子域隔离学习组合决策用于不均衡样本

Composite classification model learned on multiple isolated subdomains for imbalanced class

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

[1]	胡天杰, 胡文军, 王士同. 分布熵惩罚的支持向量数据描述[J]. 计算机应用, 2021, 41(8): 2212-2218.
[2]	杨晨, 王婕婷, 李飞江, 钱宇华. 基于概率的支持向量数据描述方法[J]. 计算机应用, 2019, 39(11): 3134-3139.
[3]	杨小明, 胡文军, 楼俊钢, 蒋云良. 局部分块的一类支持向量数据描述[J]. 计算机应用, 2015, 35(4): 1026-1029.
[4]	谢国城蒋芸陈娜. 基于超球体多类支持向量数据描述的医学图像分类新方法[J]. 计算机应用, 2013, 33(11): 3300-3304.
[5]	黄仕建. 支持向量数据描述在烟叶异物检测中的应用[J]. 计算机应用, 2012, 32(03): 881-884.
[6]	陈伟余旭初张鹏强王智超王鹤. 基于一类支持向量机的高光谱影像地物识别[J]. 计算机应用, 2011, 31(08): 2092-2096.
[7]	何伟成方景龙. 基于信息熵的支持向量数据描述分类[J]. 计算机应用, 2011, 31(04): 1114-1116.
[8]	刘宗礼曹洁郝元宏. 一种新的特征提取方法及其在模式识别中的应用[J]. 计算机应用, 2009, 29(4): 1032-1035.