《计算机应用》唯一官方网站

• •    下一篇

基于大数据随机样本划分的分布式观测点分类器

李旭1,何玉林1,崔来中2,黄哲学2,Philippe Fournier-Viger2   

  1. 1. 人工智能与数字经济广东省实验室(深圳)
    2. 深圳大学计算机与软件学院大数据所
  • 收稿日期:2023-06-29 修回日期:2023-08-21 发布日期:2023-09-11 出版日期:2023-09-11
  • 通讯作者: 何玉林
  • 基金资助:
    国家自然科学基金面上项目;广东省自然科学基金面上项目

Random sample partition-based distributed observation point classifier for big data

  • Received:2023-06-29 Revised:2023-08-21 Online:2023-09-11 Published:2023-09-11

摘要: 摘 要: 观测点分类器(OPC)是一种最新的试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型,对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度,在Spark框架下设计了一款基于大数据的随机样本划分(RSP)的分布式观测点分类器(DOPC)。首先,在分布式计算环境下生成大数据的RSP数据块,并将其转换成弹性分布式数据集(RDD);之后,在RSP数据块上协同式地训练一组OPC,由于每个RSP数据块上的OPC独立训练,使其获得了高效的Spark可实现性;最后,在Spark框架下将在RSP数据块上协同训练的OPC集成为DOPC来对新样本进行类标签预测。在8个大数据集上,对Spark集群环境下实现的DOPC的可行性、合理性和有效性进行了实验验证,实验结果显示DOPC能够以更低的计算消耗获得比单机观测点分类器更高的测试精度,同时DOPC相比于Spark框架下实现的基于RSP模型的神经网络、决策树、朴素贝叶斯和近邻分类器,平均测试精度分别提高了1.7、0.2、12.1和1.9个百分点。良好的测试表现表明DOPC是一种高效低耗的处理大数据分类问题的有监督学习算法。

关键词: 关键词: 大数据分类, 分布式文件系统, 随机样本划分, 观测点分类器, Spark计算框架

Abstract: Abstract: Observation point classifier (OPC) is a latest supervised learning model which tries to transform the multi-dimensional linearly-inseparable problem in the original data space into the one-dimensional linearly-separable problem in the projective distance space. In order to alleviate the high computational complexity when applying OPC to handle the big data classification problem, a random sample partition (RSP)-based distributed OPC (DOPC) for big data was designed under the Spark framework to enhance the efficiency of OPC to classify the big data. First, RSP data blocks were generated and transformed into RDD under the distributed computation environment. Second, a series of OPC was synergistically trained based on RSP data blocks with high Spark parallelizability. Finally, the predictive results of different OPC were fused into the final label of unknow sample. The persuasive experiments based on 8 big data sets were conducted to validate the feasibility, rationality and effectiveness of designed DOPC. Experimental results show that DOPC trained on multiple computation nodes gets the better generalization capability than OPC trained on single computation node with less time consumption and meanwhile compared to the RSP model based neural network, decision tree, na?ve Bayesian, and K-nearest neighbor classifiers under the Spark framework, the average testing accuracy of DOPC has been improved by 1.7%, 0.2%, 12.1%, and 1.9%, respectively. The superior testing performances demonstrate that DOPC is a high-efficiency and low-consumption supervised learning algorithm for handling big data classification problems.

Key words: Abstract: Big data classification, Distributed file system, Random sample partition, Observation point classifier, Spark computing framework

中图分类号: