Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (8): 2281-2287.DOI: 10.11772/j.issn.1001-9081.2019010162

• Data science and technology • Previous Articles     Next Articles

Learning sample extraction method based on convex boundary

GU Yiyi, TAN Xuntao, YUAN Yubo   

  1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2019-01-22 Revised:2019-03-30 Online:2019-04-15 Published:2019-08-10
  • Supported by:
    This work is partially supported by Provincial Key Research and Development Plan of Zhejiang (2019C03004).

基于凸边界的学习样本抽取方法

顾依依, 谈询滔, 袁玉波   

  1. 华东理工大学 信息科学与工程学院, 上海 200237
  • 通讯作者: 袁玉波
  • 作者简介:顾依依(1994-),女,天津人,硕士,主要研究方向:样本抽取方法、数据预处理、数据挖掘;谈询滔(1994-),男,四川内江人,硕士,主要研究方向:数据质量评估、数据挖掘;袁玉波(1976-),男,云南宣威人,副教授,博士,主要研究方向:机器学习、数据科学、数据质量评估和数据挖掘。
  • 基金资助:
    浙江省省级重点研发计划项目(2019C03004)。

Abstract: The quality and quantity of learning samples are very important for intelligent data classification systems. But there is no general good method for finding meaningful samples in data classification systems. For this reason, the concept of convex boundary of dataset was proposed, and a fast method of discovering meaningful sample set was given. Firstly, abnormal and incomplete samples in the learning sample set were cleaned by box-plot function. Secondly, the concept of data cone was proposed to divide the normalized learning samples into cones. Finally, each cone of sample subset was centralized, and based on convex boundary, samples with very small difference from convex boundary were extracted to form convex boundary sample set. In the experiments, 6 classical data classification algorithms, including Gaussian Naive Bayes (GNB), Classification And Regression Tree (CART), Linear Discriminant Analysis (LDA), Adaptive Boosting (AdaBoost), Random Forest (RF) and Logistic Regression (LR), were tested on 12 UCI datasets. The results show that convex boundary sample sets can significantly shorten the training time of each algorithm while maintaining the classification performance. In particular, for datasets with many noise data such as caesarian section, electrical grid, car evaluation datasets, convex boundary sample set can improve the classification performance. In order to better evaluate the efficiency of convex boundary sample set, the sample cleaning efficiency was defined as the quotient of sample size change rate and classification performance change rate. With this index, the significance of convex boundary samples was evaluated objectively. Cleaning efficiency greater than 1 proves that the method is effective. The higher the numerical value, the better the effect of using convex boundary samples as learning samples. For example, on the dataset of HTRU2, the cleaning efficiency of the proposed method for GNB algorithm is over 68, which proves the strong performance of this method.

Key words: machine learning, data classification, sample selection, convex cone, boundary sample

摘要: 学习样本的质量和数量对于智能数据分类系统至关重要,但在数据分类系统中没有一个通用的良好方法用于发现有意义的样本。以此为动机,提出数据集合凸边界的概念,给出了快速发现有意义样本集合的方法。首先,利用箱型函数对学习样本集合中的异常和特征不全样本进行清洗;接着,提出数据锥的概念,对归一化的学习样本进行锥形分割;最后,对每个锥形样本子集进行中心化,以凸边界为基础提取距离凸边界差异极小的样本构成凸边界样本集合。实验在12个UCI数据集上进行,并与高斯朴素贝叶斯(GNB)、决策树(CART)、线性判别分析(LDA)、提升算法(AdaBoost)、随机森林(RF)和逻辑回归(LR)这六种经典的数据分类算法进行对比。结果表明,各个算法在凸边界样本集合的训练时间显著缩短,同时保持了分类性能。特别地,对包含噪声数据较多的数据集,如剖腹产、电网稳定性、汽车评估等数据集,凸边界样本集合能使分类性能得到提升。为了更好地评价凸边界样本集合的效率,以样本变化率和分类性能变化率的比值定义了样本清洗效率,并用该指标来客观评价凸边界样本的意义。清洗效率大于1时说明方法有效,且数值越高效果越好。在脉冲星数据集合上,所提方法对GNB算法的清洗效率超过68,说明所提方法性能优越。

关键词: 机器学习, 数据分类, 样本选择, 凸锥, 边界样本

CLC Number: