基于凸边界的学习样本抽取方法

doi:10.11772/j.issn.1001-9081.2019010162

计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2281-2287.DOI: 10.11772/j.issn.1001-9081.2019010162

基于凸边界的学习样本抽取方法

顾依依, 谈询滔, 袁玉波

华东理工大学信息科学与工程学院, 上海 200237

收稿日期:2019-01-22 修回日期:2019-03-30 发布日期:2019-04-15 出版日期:2019-08-10
通讯作者: 袁玉波
作者简介:顾依依(1994-),女,天津人,硕士,主要研究方向:样本抽取方法、数据预处理、数据挖掘;谈询滔(1994-),男,四川内江人,硕士,主要研究方向:数据质量评估、数据挖掘;袁玉波(1976-),男,云南宣威人,副教授,博士,主要研究方向:机器学习、数据科学、数据质量评估和数据挖掘。
基金资助:
浙江省省级重点研发计划项目（2019C03004）。

Learning sample extraction method based on convex boundary

GU Yiyi, TAN Xuntao, YUAN Yubo

School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

Received:2019-01-22 Revised:2019-03-30 Online:2019-04-15 Published:2019-08-10
Supported by:
This work is partially supported by Provincial Key Research and Development Plan of Zhejiang (2019C03004).

摘要/Abstract

摘要： 学习样本的质量和数量对于智能数据分类系统至关重要，但在数据分类系统中没有一个通用的良好方法用于发现有意义的样本。以此为动机，提出数据集合凸边界的概念，给出了快速发现有意义样本集合的方法。首先，利用箱型函数对学习样本集合中的异常和特征不全样本进行清洗；接着，提出数据锥的概念，对归一化的学习样本进行锥形分割；最后，对每个锥形样本子集进行中心化，以凸边界为基础提取距离凸边界差异极小的样本构成凸边界样本集合。实验在12个UCI数据集上进行，并与高斯朴素贝叶斯（GNB）、决策树（CART）、线性判别分析（LDA）、提升算法（AdaBoost）、随机森林（RF）和逻辑回归（LR）这六种经典的数据分类算法进行对比。结果表明，各个算法在凸边界样本集合的训练时间显著缩短，同时保持了分类性能。特别地，对包含噪声数据较多的数据集，如剖腹产、电网稳定性、汽车评估等数据集，凸边界样本集合能使分类性能得到提升。为了更好地评价凸边界样本集合的效率，以样本变化率和分类性能变化率的比值定义了样本清洗效率，并用该指标来客观评价凸边界样本的意义。清洗效率大于1时说明方法有效，且数值越高效果越好。在脉冲星数据集合上，所提方法对GNB算法的清洗效率超过68，说明所提方法性能优越。

关键词: 机器学习, 数据分类, 样本选择, 凸锥, 边界样本

Abstract: The quality and quantity of learning samples are very important for intelligent data classification systems. But there is no general good method for finding meaningful samples in data classification systems. For this reason, the concept of convex boundary of dataset was proposed, and a fast method of discovering meaningful sample set was given. Firstly, abnormal and incomplete samples in the learning sample set were cleaned by box-plot function. Secondly, the concept of data cone was proposed to divide the normalized learning samples into cones. Finally, each cone of sample subset was centralized, and based on convex boundary, samples with very small difference from convex boundary were extracted to form convex boundary sample set. In the experiments, 6 classical data classification algorithms, including Gaussian Naive Bayes (GNB), Classification And Regression Tree (CART), Linear Discriminant Analysis (LDA), Adaptive Boosting (AdaBoost), Random Forest (RF) and Logistic Regression (LR), were tested on 12 UCI datasets. The results show that convex boundary sample sets can significantly shorten the training time of each algorithm while maintaining the classification performance. In particular, for datasets with many noise data such as caesarian section, electrical grid, car evaluation datasets, convex boundary sample set can improve the classification performance. In order to better evaluate the efficiency of convex boundary sample set, the sample cleaning efficiency was defined as the quotient of sample size change rate and classification performance change rate. With this index, the significance of convex boundary samples was evaluated objectively. Cleaning efficiency greater than 1 proves that the method is effective. The higher the numerical value, the better the effect of using convex boundary samples as learning samples. For example, on the dataset of HTRU2, the cleaning efficiency of the proposed method for GNB algorithm is over 68, which proves the strong performance of this method.

Key words: machine learning, data classification, sample selection, convex cone, boundary sample

中图分类号:

TP311.1

顾依依, 谈询滔, 袁玉波. 基于凸边界的学习样本抽取方法[J]. 计算机应用, 2019, 39(8): 2281-2287.

GU Yiyi, TAN Xuntao, YUAN Yubo. Learning sample extraction method based on convex boundary[J]. Journal of Computer Applications, 2019, 39(8): 2281-2287.

参考文献

[1] 刘艳,钟萍,陈静,等.用于处理不平衡样本的改进近似支持向量机新算法[J].计算机应用,2014,34(6):1618-1621. (LIU Y, ZHONG P, CHEN J, et al. Modified proximal support vector machine algorithm for dealing with unbalanced samples[J]. Journal of Computer Applications, 2014, 34(6):1618-1621.)
[2] de CARVALHO M G, LAENDER A H F, GONCALVES M A, et al. A genetic programming approach to record deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3):399-412.
[3] dal BIANCO G, GALANTE R, HEUSER C A, et al. Tuning large scale deduplication with reduced effort[C]//Proceedings of the 25th International Conference on Scientific and Statistical Database Management. New York:ACM, 2013:No.18.
[4] dal BIANCO G, GALANTE R, GONÇALVES M A, et al. A practical and effective sampling selection strategy for large scale deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(9):2305-2319.
[5] WANG X, DONG L, YAN J. Maximum ambiguity-based sample selection in fuzzy decision tree induction[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(8):1491-1505.
[6] OUGIAROGLOU S, DIAMANTARAS K I, EVANGELIDIS G. Exploring the effect of data reduction on neural network and support vector machine classification[J]. Neurocomputing, 2018, 280:101-110.
[7] SHEN X, MU L, LI Z, et al. Large-scale support vector machine classification with redundant data reduction[J]. Neurocomputing, 2016, 172:189-197.
[8] 胡小生,钟勇.基于边界样本选择的支持向量机加速算法[J].计算机工程与应用, 2017, 53(3):169-173. (HU X S, ZHONG Y. SVM accelerated training algorithm based on border sample selection[J]. Computer Engineering and Applications, 2017, 53(3):169-173.)
[9] DESSì N, PES B. Similarity of feature selection methods:An empirical study across data intensive classification tasks[J]. Expert Systems with Applications, 2015, 42(10):4632-4642.
[10] ZHANG Y, YANG C, YANG A, et al. Feature selection for classification with class-separability strategy and data envelopment analysis[J]. Neurocomputing, 2015, 166:172-184.
[11] BOLÓN-CANEDO V, SÁNCHEZ-MAROÑO N, ALONSO-BETANZOS A. Data classification using an ensemble of filters[J]. Neurocomputing, 2014, 135:13-20.
[12] RIVERA W A, XANTHOPOULOS P. A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets[J]. Expert Systems with Applications, 2016, 66:124-135.
[13] MENARDI G, TORELLI N. Training and assessing classification rules with imbalanced data[J]. Data Mining and Knowledge Discovery, 2014, 28(1):92-122.
[14] LóPEZ V, FERNáNDEZ A, HERRERA F. On the importance of the validation technique for classification with imbalanced datasets:addressing covariate shift when data is skewed[J]. Information Sciences, 2014, 257(2):1-13.
[15] GAO M, HONG X, HARRIS C J. Construction of neurofuzzy models for imbalanced data classification[J]. IEEE Transactions on Fuzzy Systems, 2014, 22(6):1472-1488.
[16] DATTA S, DAS S. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs[J]. Neural Networks, 2015, 70:39-52.
[17] 刘艳,钟萍,陈静,等.用于处理不平衡样本的改进近似支持向量机新算法[J].计算机应用,2014,34(6):1618-1621. (LIU Y, ZHONG P, CHEN J, et al. Modified proximal support vector machine algorithm for dealing with unbalanced samples[J]. Journal of Computer Applications, 2014, 34(6):1618-1621.)
[18] 袁玉波,顾依依,谈询滔,等.一种基于凸边界的学习样本抽取方法:CN201711314980.2[P]. 2018-05-18. (YUAN Y B, GU Y Y, TAN X T, et al. A learning sample extraction method based on convex boundary:CN201711314980.2[P]. 2018-05-18.)

基于凸边界的学习样本抽取方法

Learning sample extraction method based on convex boundary

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[2]	郭棉, 张锦友. 移动边缘计算环境中面向机器学习的计算迁移策略[J]. 计算机应用, 2021, 41(9): 2639-2645.
[3]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[4]	秦斌斌, 彭良康, 卢向明, 钱江波. 司机分心驾驶检测研究进展[J]. 计算机应用, 2021, 41(8): 2330-2337.
[5]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[6]	姜倩玉, 王凤英, 贾立鹏. 基于感知哈希算法和特征融合的恶意代码检测方法[J]. 计算机应用, 2021, 41(3): 780-785.
[7]	孟祥瑞, 杨文忠, 王婷. 基于图文融合的情感分析研究综述[J]. 计算机应用, 2021, 41(2): 307-317.
[8]	刘晓龙, 王士同. 渐进式分离的开放集模糊域自适应算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3127-3131.
[9]	楼豪杰, 郑元林, 廖开阳, 雷浩, 李佳. 基于Siamese-YOLOv4的印刷品缺陷目标检测[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3206-3212.
[10]	王雅辉, 钱宇华, 刘郭庆. 基于模糊优势互补互信息的有序决策树算法[J]. 计算机应用, 2021, 41(10): 2785-2792.
[11]	蒋阳升, 王胜男, 涂家祺, 李莎, 王红军. 面向高铁站的热舒适度和能耗综合预测[J]. 计算机应用, 2021, 41(1): 249-257.
[12]	朱琳, 于海涛, 雷新宇, 刘静, 王若凡. 基于MRI图像的阿尔茨海默症患者脑网络特征识别算法[J]. 计算机应用, 2020, 40(8): 2455-2459.
[13]	梁登高, 周安民, 郑荣锋, 刘亮, 丁建伟. 基于大小突发块划分的微信支付行为识别模型[J]. 计算机应用, 2020, 40(7): 1970-1976.
[14]	徐周波, 杨健, 刘华东, 黄文文. 基于XGBoost与拓扑结构信息的蛋白质复合物识别算法[J]. 计算机应用, 2020, 40(5): 1510-1514.
[15]	张俊升, 徐晶晶, 余伟. 面部美化图像质量无参考评价方法[J]. 计算机应用, 2020, 40(4): 1184-1190.