计算机应用 ›› 2016, Vol. 36 ›› Issue (9): 2475-2480.DOI: 10.11772/j.issn.1001-9081.2016.09.2475

• 大数据 • 上一篇    下一篇

多子域隔离学习组合决策用于不均衡样本

靳燕1, 彭新光2   

  1. 1. 山西大学商务学院 信息学院, 太原 030031;
    2. 太原理工大学 计算机科学与技术学院, 太原 030024
  • 收稿日期:2016-02-07 修回日期:2016-04-24 出版日期:2016-09-10 发布日期:2016-09-08
  • 通讯作者: 靳燕
  • 作者简介:靳燕(1982-),女,山西高平人,讲师,硕士,主要研究方向:数据挖掘、网络安全;彭新光(1955-),男,山西太原人,教授,博士,CCF会员,主要研究方向:网络安全、可信计算。
  • 基金资助:
    山西省自然科学基金资助项目(2010011022-2);山西省高等学校科技创新项目(20131112);山西大学商务学院科研基金资助项目(2014010)。

Composite classification model learned on multiple isolated subdomains for imbalanced class

JIN Yan1, PENG Xinguang2   

  1. 1. Information Institute, Business College of Shanxi University, Taiyuan Shanxi 030031, China;
    2. College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan Shanxi 030024, China
  • Received:2016-02-07 Revised:2016-04-24 Online:2016-09-10 Published:2016-09-08
  • Supported by:
    This work is partially supported by the Shanxi Natural Science Foundation (2010011022-2), the Science and Technology Innovation Project of Shanxi Province (20131112), the Fundamental Research Funds for the Business College of Shanxi University (2014010).

摘要: 为进一步弱化数据不均衡对分类算法的束缚,从数据集区域分布特性着手,提出了不均衡数据集上基于子域学习的复合分类模型。子域划分阶段,扩展支持向量数据描述(SVDD)算法给出类的最小界定域,划分出域内密集区与域外稀疏区。借鉴不同类存在相似样本的类重叠概念,对边界样本进行搜索,组合构成重叠域。子域清理阶段,基于邻近算法(KNN)的邻近性假设,结合不同域的密疏程度,设置样本有效性参数,对域内样本逐个检测以清理噪声。各子域隔离参与分类建模,按序组合产生出用于不均衡数据集的复合分类器CCRD。在相似算法对比以及代价敏感MetaCost对比中,CCRD对正类的正确分类改善明显,且未加重负类误判;在SMOTE抽样比较中,CCRD改善了负类的误判情形,且未影响正类的正确分类;在五类数据集的逐个比较中,CCRD分类性能均有提升,在Haberman_sur的正类分类性能提升上尤为明显。结果表明,基于子域学习的复合分类模型的分类性能较好,是一种研究不均衡数据集的较有效的方法。

关键词: 不均衡数据集区域分布, 支持向量数据描述, 稀疏域与重叠域, 子域隔离学习, 复合分类器

Abstract: Started with the regional distribution characteristics, a composite classification model learned on multiple isolated subdomains was proposed to further study the class imbalance problem. In the subdomains division stage, each class was described as ultra-small spheres by improved Support Vector Data Description (SVDD) algorithm, then class domain was divided into intensive and sparse domains. Some instances were founded out from the boundaries of classes and composed of class overlapping domains. In the subdomains cleanup stage, according to sample availability parameters related to domain tightness, noise data was cleaned up by improved K-Nearest Neighbor (KNN). After combining classifiers sequentially which were learned on isolated subdomains, the Composite Classification model (CCRD) was generated. In the comparison with similar algorithms including SVM (Support Vector Machine), KNN, C4.5 and MetaCost, CCRD can obviously improve the accuracy of positive instances without increasing mistake of negative instances; in the comparison with SMOTE (Synthetic Minority Over-sampling TEchnique) sampling, CCRD can improve the misjudgement of negative instances without affecting the classification of the positive instances; in the experiments on five datasets, the classification performance of CCRD is also improved, especially in Haberman_sur. Experimental results indicate that the composite classification model learned on multiple isolated subdomains has excellent classification capability, and it is an effective method for inbalanced dataset.

Key words: regional distribution of imbalanced class, Support Vector Data Description (SVDD), sparse and overlapping domains, leaning classifiers on multiple isolated subdomains, Composite Classification model (CCRD)

中图分类号: