基于凸边界的学习样本抽取方法

doi:10.11772/j.issn.1001-9081.2019010162

计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2281-2287.DOI: 10.11772/j.issn.1001-9081.2019010162

基于凸边界的学习样本抽取方法

顾依依, 谈询滔, 袁玉波

华东理工大学信息科学与工程学院, 上海 200237

收稿日期:2019-01-22 修回日期:2019-03-30 发布日期:2019-04-15 出版日期:2019-08-10
通讯作者: 袁玉波
作者简介:顾依依(1994-),女,天津人,硕士,主要研究方向:样本抽取方法、数据预处理、数据挖掘;谈询滔(1994-),男,四川内江人,硕士,主要研究方向:数据质量评估、数据挖掘;袁玉波(1976-),男,云南宣威人,副教授,博士,主要研究方向:机器学习、数据科学、数据质量评估和数据挖掘。
基金资助:
浙江省省级重点研发计划项目（2019C03004）。

Learning sample extraction method based on convex boundary

GU Yiyi, TAN Xuntao, YUAN Yubo

School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

Received:2019-01-22 Revised:2019-03-30 Online:2019-04-15 Published:2019-08-10
Supported by:
This work is partially supported by Provincial Key Research and Development Plan of Zhejiang (2019C03004).

摘要/Abstract

摘要： 学习样本的质量和数量对于智能数据分类系统至关重要，但在数据分类系统中没有一个通用的良好方法用于发现有意义的样本。以此为动机，提出数据集合凸边界的概念，给出了快速发现有意义样本集合的方法。首先，利用箱型函数对学习样本集合中的异常和特征不全样本进行清洗；接着，提出数据锥的概念，对归一化的学习样本进行锥形分割；最后，对每个锥形样本子集进行中心化，以凸边界为基础提取距离凸边界差异极小的样本构成凸边界样本集合。实验在12个UCI数据集上进行，并与高斯朴素贝叶斯（GNB）、决策树（CART）、线性判别分析（LDA）、提升算法（AdaBoost）、随机森林（RF）和逻辑回归（LR）这六种经典的数据分类算法进行对比。结果表明，各个算法在凸边界样本集合的训练时间显著缩短，同时保持了分类性能。特别地，对包含噪声数据较多的数据集，如剖腹产、电网稳定性、汽车评估等数据集，凸边界样本集合能使分类性能得到提升。为了更好地评价凸边界样本集合的效率，以样本变化率和分类性能变化率的比值定义了样本清洗效率，并用该指标来客观评价凸边界样本的意义。清洗效率大于1时说明方法有效，且数值越高效果越好。在脉冲星数据集合上，所提方法对GNB算法的清洗效率超过68，说明所提方法性能优越。

关键词: 机器学习, 数据分类, 样本选择, 凸锥, 边界样本

Abstract: The quality and quantity of learning samples are very important for intelligent data classification systems. But there is no general good method for finding meaningful samples in data classification systems. For this reason, the concept of convex boundary of dataset was proposed, and a fast method of discovering meaningful sample set was given. Firstly, abnormal and incomplete samples in the learning sample set were cleaned by box-plot function. Secondly, the concept of data cone was proposed to divide the normalized learning samples into cones. Finally, each cone of sample subset was centralized, and based on convex boundary, samples with very small difference from convex boundary were extracted to form convex boundary sample set. In the experiments, 6 classical data classification algorithms, including Gaussian Naive Bayes (GNB), Classification And Regression Tree (CART), Linear Discriminant Analysis (LDA), Adaptive Boosting (AdaBoost), Random Forest (RF) and Logistic Regression (LR), were tested on 12 UCI datasets. The results show that convex boundary sample sets can significantly shorten the training time of each algorithm while maintaining the classification performance. In particular, for datasets with many noise data such as caesarian section, electrical grid, car evaluation datasets, convex boundary sample set can improve the classification performance. In order to better evaluate the efficiency of convex boundary sample set, the sample cleaning efficiency was defined as the quotient of sample size change rate and classification performance change rate. With this index, the significance of convex boundary samples was evaluated objectively. Cleaning efficiency greater than 1 proves that the method is effective. The higher the numerical value, the better the effect of using convex boundary samples as learning samples. For example, on the dataset of HTRU2, the cleaning efficiency of the proposed method for GNB algorithm is over 68, which proves the strong performance of this method.

Key words: machine learning, data classification, sample selection, convex cone, boundary sample

中图分类号:

TP311.1

顾依依, 谈询滔, 袁玉波. 基于凸边界的学习样本抽取方法[J]. 计算机应用, 2019, 39(8): 2281-2287.

GU Yiyi, TAN Xuntao, YUAN Yubo. Learning sample extraction method based on convex boundary[J]. Journal of Computer Applications, 2019, 39(8): 2281-2287.

参考文献

[1] 刘艳,钟萍,陈静,等.用于处理不平衡样本的改进近似支持向量机新算法[J].计算机应用,2014,34(6):1618-1621. (LIU Y, ZHONG P, CHEN J, et al. Modified proximal support vector machine algorithm for dealing with unbalanced samples[J]. Journal of Computer Applications, 2014, 34(6):1618-1621.)
[2] de CARVALHO M G, LAENDER A H F, GONCALVES M A, et al. A genetic programming approach to record deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3):399-412.
[3] dal BIANCO G, GALANTE R, HEUSER C A, et al. Tuning large scale deduplication with reduced effort[C]//Proceedings of the 25th International Conference on Scientific and Statistical Database Management. New York:ACM, 2013:No.18.
[4] dal BIANCO G, GALANTE R, GONÇALVES M A, et al. A practical and effective sampling selection strategy for large scale deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(9):2305-2319.
[5] WANG X, DONG L, YAN J. Maximum ambiguity-based sample selection in fuzzy decision tree induction[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(8):1491-1505.
[6] OUGIAROGLOU S, DIAMANTARAS K I, EVANGELIDIS G. Exploring the effect of data reduction on neural network and support vector machine classification[J]. Neurocomputing, 2018, 280:101-110.
[7] SHEN X, MU L, LI Z, et al. Large-scale support vector machine classification with redundant data reduction[J]. Neurocomputing, 2016, 172:189-197.
[8] 胡小生,钟勇.基于边界样本选择的支持向量机加速算法[J].计算机工程与应用, 2017, 53(3):169-173. (HU X S, ZHONG Y. SVM accelerated training algorithm based on border sample selection[J]. Computer Engineering and Applications, 2017, 53(3):169-173.)
[9] DESSì N, PES B. Similarity of feature selection methods:An empirical study across data intensive classification tasks[J]. Expert Systems with Applications, 2015, 42(10):4632-4642.
[10] ZHANG Y, YANG C, YANG A, et al. Feature selection for classification with class-separability strategy and data envelopment analysis[J]. Neurocomputing, 2015, 166:172-184.
[11] BOLÓN-CANEDO V, SÁNCHEZ-MAROÑO N, ALONSO-BETANZOS A. Data classification using an ensemble of filters[J]. Neurocomputing, 2014, 135:13-20.
[12] RIVERA W A, XANTHOPOULOS P. A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets[J]. Expert Systems with Applications, 2016, 66:124-135.
[13] MENARDI G, TORELLI N. Training and assessing classification rules with imbalanced data[J]. Data Mining and Knowledge Discovery, 2014, 28(1):92-122.
[14] LóPEZ V, FERNáNDEZ A, HERRERA F. On the importance of the validation technique for classification with imbalanced datasets:addressing covariate shift when data is skewed[J]. Information Sciences, 2014, 257(2):1-13.
[15] GAO M, HONG X, HARRIS C J. Construction of neurofuzzy models for imbalanced data classification[J]. IEEE Transactions on Fuzzy Systems, 2014, 22(6):1472-1488.
[16] DATTA S, DAS S. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs[J]. Neural Networks, 2015, 70:39-52.
[17] 刘艳,钟萍,陈静,等.用于处理不平衡样本的改进近似支持向量机新算法[J].计算机应用,2014,34(6):1618-1621. (LIU Y, ZHONG P, CHEN J, et al. Modified proximal support vector machine algorithm for dealing with unbalanced samples[J]. Journal of Computer Applications, 2014, 34(6):1618-1621.)
[18] 袁玉波,顾依依,谈询滔,等.一种基于凸边界的学习样本抽取方法:CN201711314980.2[P]. 2018-05-18. (YUAN Y B, GU Y Y, TAN X T, et al. A learning sample extraction method based on convex boundary:CN201711314980.2[P]. 2018-05-18.)

基于凸边界的学习样本抽取方法

Learning sample extraction method based on convex boundary

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	冷强奎, 孙薛梓, 孟祥福. 基于样本势和噪声进化的不平衡数据过采样方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2466-2475.
[2]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[3]	陈学斌, 任志强, 张宏扬. 联邦学习中的安全威胁与防御措施综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1663-1672.
[4]	姚梓豪, 栗远明, 马自强, 李扬, 魏良根. 基于机器学习的多目标缓存侧信道攻击检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1862-1871.
[5]	佘维, 李阳, 钟李红, 孔德锋, 田钊. 基于改进实数编码遗传算法的神经网络超参数优化[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 671-676.
[6]	郑毅, 廖存燚, 张天倩, 王骥, 刘守印. 面向城区的基于图去噪的小区级RSRP估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 855-862.
[7]	李博, 黄建强, 黄东强, 王晓英. 基于异构平台的稀疏矩阵向量乘自适应计算优化[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3867-3875.
[8]	陈学斌, 屈昌盛. 面向联邦学习的后门攻击与防御综述[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3459-3469.
[9]	孙仁科, 皇甫志宇, 陈虎, 李仲年, 许新征. 神经架构搜索综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2983-2994.
[10]	柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010.
[11]	尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020.
[12]	崔昊阳, 张晖, 周雷, 杨春明, 李波, 赵旭剑. 有序规范实数对多相似度K最近邻分类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2673-2678.
[13]	钟静, 林晨, 盛志伟, 张仕斌. 基于汉明距离的量子K-Means算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2493-2498.
[14]	蓝梦婕, 蔡剑平, 孙岚. 非独立同分布数据下的自正则化联邦学习优化方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2073-2081.
[15]	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1620-1624.