基于大数据随机样本划分的分布式观测点分类器

doi:10.11772/j.issn.1001-9081.2023060847

《计算机应用》唯一官方网站

• • 下一篇

基于大数据随机样本划分的分布式观测点分类器

李旭¹,何玉林¹,崔来中²,黄哲学²,Philippe Fournier-Viger²

1. 人工智能与数字经济广东省实验室（深圳）
2. 深圳大学计算机与软件学院大数据所

收稿日期:2023-06-29 修回日期:2023-08-21 发布日期:2023-09-11 出版日期:2023-09-11
通讯作者: 何玉林
基金资助:
国家自然科学基金面上项目;广东省自然科学基金面上项目

Random sample partition-based distributed observation point classifier for big data

Received:2023-06-29 Revised:2023-08-21 Online:2023-09-11 Published:2023-09-11

摘要/Abstract

摘要： 摘要: 观测点分类器（OPC）是一种最新的试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型，对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度，在Spark框架下设计了一款基于大数据的随机样本划分（RSP）的分布式观测点分类器（DOPC）。首先，在分布式计算环境下生成大数据的RSP数据块，并将其转换成弹性分布式数据集（RDD）；之后，在RSP数据块上协同式地训练一组OPC，由于每个RSP数据块上的OPC独立训练，使其获得了高效的Spark可实现性；最后，在Spark框架下将在RSP数据块上协同训练的OPC集成为DOPC来对新样本进行类标签预测。在8个大数据集上，对Spark集群环境下实现的DOPC的可行性、合理性和有效性进行了实验验证，实验结果显示DOPC能够以更低的计算消耗获得比单机观测点分类器更高的测试精度，同时DOPC相比于Spark框架下实现的基于RSP模型的神经网络、决策树、朴素贝叶斯和近邻分类器，平均测试精度分别提高了1.7、0.2、12.1和1.9个百分点。良好的测试表现表明DOPC是一种高效低耗的处理大数据分类问题的有监督学习算法。

关键词: 关键词: 大数据分类, 分布式文件系统, 随机样本划分, 观测点分类器, Spark计算框架

Abstract: Abstract: Observation point classifier (OPC) is a latest supervised learning model which tries to transform the multi-dimensional linearly-inseparable problem in the original data space into the one-dimensional linearly-separable problem in the projective distance space. In order to alleviate the high computational complexity when applying OPC to handle the big data classification problem, a random sample partition (RSP)-based distributed OPC (DOPC) for big data was designed under the Spark framework to enhance the efficiency of OPC to classify the big data. First, RSP data blocks were generated and transformed into RDD under the distributed computation environment. Second, a series of OPC was synergistically trained based on RSP data blocks with high Spark parallelizability. Finally, the predictive results of different OPC were fused into the final label of unknow sample. The persuasive experiments based on 8 big data sets were conducted to validate the feasibility, rationality and effectiveness of designed DOPC. Experimental results show that DOPC trained on multiple computation nodes gets the better generalization capability than OPC trained on single computation node with less time consumption and meanwhile compared to the RSP model based neural network, decision tree, na?ve Bayesian, and K-nearest neighbor classifiers under the Spark framework, the average testing accuracy of DOPC has been improved by 1.7%, 0.2%, 12.1%, and 1.9%, respectively. The superior testing performances demonstrate that DOPC is a high-efficiency and low-consumption supervised learning algorithm for handling big data classification problems.

Key words: Abstract: Big data classification, Distributed file system, Random sample partition, Observation point classifier, Spark computing framework

中图分类号:

TP181

李旭何玉林崔来中黄哲学 Philippe Fournier-Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 计算机应用, DOI: 10.11772/j.issn.1001-9081.2023060847.

[1]	倪瑞轩, 蔡淼, 叶保留. 内存高效的持久性分布式文件系统客户端缓存DFS-Cache[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1172-1180.
[2]	苟子安, 张晓, 吴东南, 王艳秋. 分布式存储系统中的日志分析与负载特征提取[J]. 计算机应用, 2020, 40(9): 2586-2593.
[3]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.
[4]	陈博, 何连跃, 严巍巍, 徐照淼, 徐俊. 海量小文件系统的可移植操作系统接口兼容技术[J]. 计算机应用, 2018, 38(5): 1389-1392.
[5]	李强, 刘晓峰. 基于Hopfield神经网络的云存储负载均衡策略[J]. 计算机应用, 2017, 37(8): 2214-2217.
[6]	杨俊杰, 廖卓凡, 冯超超. 大数据存储架构和算法研究综述[J]. 计算机应用, 2016, 36(9): 2465-2471.
[7]	刘青, 付印金, 倪桂强, 梅建民. 基于Hadoop平台的分布式重删存储系统[J]. 计算机应用, 2016, 36(2): 330-335.
[8]	伍晋博, 宋杰, 张莉, 鲍玉斌. 基于概率的大数据查询系统——Probery[J]. 计算机应用, 2016, 36(1): 8-12.
[9]	邵田, 陈广胜, 景维鹏. 云存储系统中文件分界点确定方法——Cut-GAR[J]. 计算机应用, 2015, 35(9): 2497-2502.
[10]	杨文晖, 李国强, 苗放. 面向海量空间数据存储的元数据管理机制[J]. 计算机应用, 2015, 35(5): 1276-1279.
[11]	郑凯, 朱林, 陈优广. 基于Kademlia的负载平衡云存储算法[J]. 计算机应用, 2015, 35(3): 643-647.
[12]	王政英, 于炯, 英昌甜, 鲁亮. 分布式文件系统数据块聚类存储节能策略[J]. 计算机应用, 2015, 35(2): 378-382.
[13]	陈吉荣乐嘉锦. 基于MapReduce的Hadoop大表导入编程模型[J]. 计算机应用, 2013, 33(09): 2486-2489.
[14]	朱媛媛王晓京. 基于GE码的HDFS优化方案[J]. 计算机应用, 2013, 33(03): 730-733.
[15]	陈冬晓王鹏. 基于校验编码备份的分布存储方案[J]. 计算机应用, 2013, 33(01): 211-214.

基于大数据随机样本划分的分布式观测点分类器

Random sample partition-based distributed observation point classifier for big data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics