Abstract:The quality and quantity of learning samples are very important for intelligent data classification systems. But there is no general good method for finding meaningful samples in data classification systems. For this reason, the concept of convex boundary of dataset was proposed, and a fast method of discovering meaningful sample set was given. Firstly, abnormal and incomplete samples in the learning sample set were cleaned by box-plot function. Secondly, the concept of data cone was proposed to divide the normalized learning samples into cones. Finally, each cone of sample subset was centralized, and based on convex boundary, samples with very small difference from convex boundary were extracted to form convex boundary sample set. In the experiments, 6 classical data classification algorithms, including Gaussian Naive Bayes (GNB), Classification And Regression Tree (CART), Linear Discriminant Analysis (LDA), Adaptive Boosting (AdaBoost), Random Forest (RF) and Logistic Regression (LR), were tested on 12 UCI datasets. The results show that convex boundary sample sets can significantly shorten the training time of each algorithm while maintaining the classification performance. In particular, for datasets with many noise data such as caesarian section, electrical grid, car evaluation datasets, convex boundary sample set can improve the classification performance. In order to better evaluate the efficiency of convex boundary sample set, the sample cleaning efficiency was defined as the quotient of sample size change rate and classification performance change rate. With this index, the significance of convex boundary samples was evaluated objectively. Cleaning efficiency greater than 1 proves that the method is effective. The higher the numerical value, the better the effect of using convex boundary samples as learning samples. For example, on the dataset of HTRU2, the cleaning efficiency of the proposed method for GNB algorithm is over 68, which proves the strong performance of this method.
[1] 刘艳,钟萍,陈静,等.用于处理不平衡样本的改进近似支持向量机新算法[J].计算机应用,2014,34(6):1618-1621. (LIU Y, ZHONG P, CHEN J, et al. Modified proximal support vector machine algorithm for dealing with unbalanced samples[J]. Journal of Computer Applications, 2014, 34(6):1618-1621.) [2] de CARVALHO M G, LAENDER A H F, GONCALVES M A, et al. A genetic programming approach to record deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3):399-412. [3] dal BIANCO G, GALANTE R, HEUSER C A, et al. Tuning large scale deduplication with reduced effort[C]//Proceedings of the 25th International Conference on Scientific and Statistical Database Management. New York:ACM, 2013:No.18. [4] dal BIANCO G, GALANTE R, GONÇALVES M A, et al. A practical and effective sampling selection strategy for large scale deduplication[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(9):2305-2319. [5] WANG X, DONG L, YAN J. Maximum ambiguity-based sample selection in fuzzy decision tree induction[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(8):1491-1505. [6] OUGIAROGLOU S, DIAMANTARAS K I, EVANGELIDIS G. Exploring the effect of data reduction on neural network and support vector machine classification[J]. Neurocomputing, 2018, 280:101-110. [7] SHEN X, MU L, LI Z, et al. Large-scale support vector machine classification with redundant data reduction[J]. Neurocomputing, 2016, 172:189-197. [8] 胡小生,钟勇.基于边界样本选择的支持向量机加速算法[J].计算机工程与应用, 2017, 53(3):169-173. (HU X S, ZHONG Y. SVM accelerated training algorithm based on border sample selection[J]. Computer Engineering and Applications, 2017, 53(3):169-173.) [9] DESSì N, PES B. Similarity of feature selection methods:An empirical study across data intensive classification tasks[J]. Expert Systems with Applications, 2015, 42(10):4632-4642. [10] ZHANG Y, YANG C, YANG A, et al. Feature selection for classification with class-separability strategy and data envelopment analysis[J]. Neurocomputing, 2015, 166:172-184. [11] BOLÓN-CANEDO V, SÁNCHEZ-MAROÑO N, ALONSO-BETANZOS A. Data classification using an ensemble of filters[J]. Neurocomputing, 2014, 135:13-20. [12] RIVERA W A, XANTHOPOULOS P. A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets[J]. Expert Systems with Applications, 2016, 66:124-135. [13] MENARDI G, TORELLI N. Training and assessing classification rules with imbalanced data[J]. Data Mining and Knowledge Discovery, 2014, 28(1):92-122. [14] LóPEZ V, FERNáNDEZ A, HERRERA F. On the importance of the validation technique for classification with imbalanced datasets:addressing covariate shift when data is skewed[J]. Information Sciences, 2014, 257(2):1-13. [15] GAO M, HONG X, HARRIS C J. Construction of neurofuzzy models for imbalanced data classification[J]. IEEE Transactions on Fuzzy Systems, 2014, 22(6):1472-1488. [16] DATTA S, DAS S. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs[J]. Neural Networks, 2015, 70:39-52. [17] 刘艳,钟萍,陈静,等.用于处理不平衡样本的改进近似支持向量机新算法[J].计算机应用,2014,34(6):1618-1621. (LIU Y, ZHONG P, CHEN J, et al. Modified proximal support vector machine algorithm for dealing with unbalanced samples[J]. Journal of Computer Applications, 2014, 34(6):1618-1621.) [18] 袁玉波,顾依依,谈询滔,等.一种基于凸边界的学习样本抽取方法:CN201711314980.2[P]. 2018-05-18. (YUAN Y B, GU Y Y, TAN X T, et al. A learning sample extraction method based on convex boundary:CN201711314980.2[P]. 2018-05-18.)