基于大数据随机样本划分的分布式观测点分类器

doi:10.11772/j.issn.1001-9081.2023060847

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1727-1733.DOI: 10.11772/j.issn.1001-9081.2023060847

所属专题： CCF第38届中国计算机应用大会 (CCF NCCA 2023)

• CCF第38届中国计算机应用大会 (CCF NCCA 2023) • 上一篇下一篇

基于大数据随机样本划分的分布式观测点分类器

李旭¹, 何玉林¹(), 崔来中¹^,², 黄哲学¹^,², PHILIPPE Fournier‑Viger²

^1.人工智能与数字经济广东省实验室（深圳），广东深圳 518107
^2.深圳大学计算机与软件学院，广东深圳 518060

收稿日期:2023-06-29 修回日期:2023-08-21 接受日期:2023-08-23 发布日期:2023-09-11 出版日期:2024-06-10
通讯作者: 何玉林
作者简介:李旭（1996—），男，广东汕头人，工程师，硕士，CCF会员，主要研究方向：大数据分布式计算、数据挖掘、机器学习
崔来中（1984—），男，吉林白山人，教授，博士，CCF会员，主要研究方向：互联网体系结构、边缘计算、AI驱动的网络优化
黄哲学（1959—），男，黑龙江哈尔滨人，教授，博士，CCF会员，主要研究方向：新型算力网络智能计算、大数据近似计算、数据挖掘、机器学习
PHILIPPE Fournier‑Viger（1980—），男，加拿大蒙特利尔人，教授，博士，主要研究方向：数据挖掘、模式识别、人工智能。
基金资助:
国家自然科学基金资助项目(61972261);广东省自然科学基金资助项目(2023A1515011667);深圳市基础研究项目(JCYJ20220818100205012)

Distributed observation point classifier for big data with random sample partition

Xu LI¹, Yulin HE¹(), Laizhong CUI¹^,², Zhexue HUANG¹^,², Fournier‑Viger PHILIPPE²

^1.Guangdong Laboratory of Artificial Intelligence and Digital Economy （SZ），Shenzhen Guangdong 518107，China
^2.College of Computer Science and Software Engineering，Shenzhen University，Shenzhen Guangdong 518060，China

Received:2023-06-29 Revised:2023-08-21 Accepted:2023-08-23 Online:2023-09-11 Published:2024-06-10
Contact: Yulin HE
About author:LI Xu， born in 1996， M. S.， engineer. His research interests include distributed computation of big data， data mining， machine learning.
CUI Laizhong， born in 1984， Ph. D.， professor. His research interests include Internet architecture， edge computing， AI-driven network optimization.
HUANG Zhexue， born in 1959， Ph. D.， professor. His research interests include intelligent computing of new computational power network， big data approximation computing， data mining， machine learning.
PHILIPPE Fournier‑Viger， born in 1980， Ph. D.， professor. His research interests include data mining， pattern recognition， artificial intelligence.
Supported by:
National Natural Science Foundation of China(61972261);Natural Science Foundation of Guangdong Province(2023A1515011667);Basic Research Foundation of Shenzhen(JCYJ20220818100205012)

摘要/Abstract

摘要：

观测点分类器（OPC）是一种试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型，对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度，在Spark框架下设计一款基于大数据的随机样本划分（RSP）的分布式OPC（DOPC）。首先，在分布式计算环境下生成大数据的RSP数据块，并将它转换为弹性分布式数据集（RDD）；其次，在RSP数据块上协同式地训练一组OPC，由于每个RSP数据块上的OPC独立训练，因此有高效的Spark可实现性；最后，在Spark框架下将在RSP数据块上协同训练的OPC集成为DOPC，对新样本进行类标签预测。在8个大数据集上，对Spark集群环境下实现的DOPC的可行性、合理性和有效性进行实验验证，实验结果显示，DOPC能够以更低的计算消耗获得比单机OPC更高的测试精度，同时相较于Spark框架下实现的基于RSP模型的神经网络（NN）、决策树（DT）、朴素贝叶斯（NB）和K最近邻（KNN），DOPC分类器具有更强的泛化性能。测试结果表明，DOPC是一种高效低耗的处理大数据分类问题的有监督学习算法。

关键词: 大数据分类, 分布式文件系统, 随机样本划分, 观测点分类器, Spark计算框架

Abstract:

Observation Point Classifier （OPC） is a supervised learning model which tries to transform a multi-dimensional linearly-inseparable problem in original data space into a one-dimensional linearly-separable problem in projective distance space and it is good at high-dimensional data classification. In order to alleviate the high train complexity when applying OPC to handle the big data classification problem， a Random Sample Partition （RSP）-based Distributed OPC （DOPC） for big data was designed under the Spark framework. First， RSP data blocks were generated and transformed into Resilient Distributed Dataset （RDD） under the distributed computation environment. Second， a set of OPCs was collaboratively trained on RSP data blocks with high Spark parallelizability. Finally， different OPCs were fused into a DOPC to predict the final label of unknow sample. The persuasive experiments on eight big datasets were conducted to validate the feasibility， rationality and effectiveness of designed DOPC. Experimental results show that DOPC trained on multiple computation nodes gets the higher testing accuracy than OPC trained on single computation node with less time consumption， and meanwhile compared to the RSP model based Neural Network （NN）， Decision Tree （DT）， Naive Bayesian （NB）， and K-Nearest Neighbor （KNN） classifiers under the Spark framework， DOPC obtains stronger generalization capability. The superior testing performances demonstrate that DOPC is a highly effective and low consumptive supervised learning algorithm for handling big data classification problems.

Key words: big data classification, distributed file system, Random Sample Partition （RSP）, Observation Point Classifier （OPC）, Spark computing framework

中图分类号:

TP181

李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 计算机应用, 2024, 44(6): 1727-1733.

Xu LI, Yulin HE, Laizhong CUI, Zhexue HUANG, Fournier‑Viger PHILIPPE. Distributed observation point classifier for big data with random sample partition[J]. Journal of Computer Applications, 2024, 44(6): 1727-1733.

图/表 10

参考文献 23

1	梅宏，杜小勇，金海，等. 大数据技术前瞻［J］. 大数据， 2023， 9（1）：1-20.
	MEI H， DU X Y， JIN H， et al. Big data technologies forward-looking ［J］. Big Data Research， 2023， 9（1）：1-20.
2	KARUN A K， CHITHARANJAN K. A review on hadoop： HDFS infrastructure extensions［C］// Proceedings of the 2013 IEEE Conference on Information & Communication Technologies. Piscataway： IEEE， 2013： 132-137.
3	DEAN J， GHEMAWAT S. MapReduce： a flexible data processing tool ［J］. Communications of the ACM， 2010， 53（1）： 72-77.
4	ZAHARIA M， XIN R S， WENDELL P， et al. Apache Spark： a unified engine for big data processing ［J］. Communications of the ACM， 2016， 59（11）： 56-65.
5	SLEEMAN IV W C， KRAWCZYK B. Multi-class imbalanced big data classification on Spark ［J］. Knowledge-Based Systems， 2021， 212： 106598.
6	黄哲学，何玉林，魏丞昊，等. 大数据随机样本划分模型及相关分析计算技术［J］. 数据采集与处理， 2019， 34（3）： 373-385.
	HUANG Z X， HE Y L， WEI C H， et al. Random sample partition data model and related technologies for big data analysis［J］. Journal of Data Acqusisition & Processing， 2019， 34（3）： 373-385.
7	SALLOUM S， HUANG J Z， HE Y. Random sample partition： a distributed data model for big data analysis ［J］. IEEE Transactions on Industrial Informatics， 2019， 15（11）： 5846-5854.
8	HE Y L， LI X， FOURNIER-VIGER P， et al. Observation points classifier ensemble for high-dimensional imbalanced classification［J］. CAAI Transactions on Intelligence Technology， 2023， 8（2）： 500-517.
9	TRIGUERO I， PERALTA D， BACARDIT J， et al. MRPR： a MapReduce solution for prototype reduction in big data classification［J］. Neurocomputing， 2015， 150： 331-345.
10	MAILLO J， TRIGUERO I， HERRERA F. A MapReduce-based k-nearest neighbor approach for big data classification［C］// Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA. Piscataway： IEEE， 2015： 167-172.
11	KUMAR M， RATH S K. Classification of microarray using MapReduce based proximal support vector machine classifier ［J］. Knowledge-Based Systems， 2015， 89： 584-602.
12	SUYKENS J A K， VANDEWALLE J. Least squares support vector machine classifiers［J］. Neural Processing Letters， 1999， 9（3）： 293-300.
13	BECHINI A， MARCELLONI F， SEGATORI A. A MapReduce solution for associative classification of big data［J］. Information Sciences， 2016， 332： 33-55.
14	LI H， WANG Y， ZHANG D， et al. PFP： parallel FP-Growth for query recommendation［C］// Proceedings of the 2008 ACM Conference on Recommender Systems. New York： ACM， 2008： 107-114.
15	于苹苹，倪建成，姚彬修，等.基于Spark框架的高效KNN中文文本分类算法［J］. 计算机应用， 2016， 36（12）： 3292-3297.
	YU P P， NI J C， YAO B X， et al. Highly efficient Chinese text classification algorithm of KNN based on Spark framework［J］. Journal of Computer Applications， 2016， 36（12）： 3292-3297.
16	夏宁霞，苏一丹，覃希. 一种高效的K-medoids聚类算法［J］. 计算机应用研究， 2010， 27（12）： 4517-4519.
	XIA N X， SU Y D， QIN X. Efficient K-medoids clustering algorithm［J］. Application Research of Computers， 2010， 27（12）： 4517-4519.
17	RAMÍREZ-GALLEGO S， KRAWCZYK B， GARCÍA S， et al. Nearest neighbor classification for high-speed big data streams using Spark［J］. IEEE Transactions on Systems， Man， and Cybernetics： Systems， 2017， 47（10）： 2727-2739.
18	刘牧雷，徐菲菲.基于Spark的大数据三枝决策分类方法［J］. 上海电力学院学报， 2018， 34（5）： 483-490.
	LIU M L， XU F F. Processing big data with three way decision based on Spark ［J］. Journal of Shanghai University of Electric Power， 2018， 34（5）： 483-490.
19	YAO Y. The superiority of three-way decisions in probabilistic rough set models ［J］. Information Sciences， 2011， 181（6）： 1080-1096.
20	LIU P， ZHAO H， TENG J， et al. Parallel naive Bayes algorithm for large-scale Chinese text classification based on Spark［J］. Journal of Central South University， 2019， 26（1）： 1-12.
21	ALI A H， ABDULLAH M Z. A parallel grid optimization of SVM hyperparameter for big data classification using Spark Radoop ［J］. Karbala International Journal of Modern Science， 2020， 6（1）： 3.
22	KENNEDY J， EBERHART R. Particle swarm optimization ［C］// Proceedings of the 1995 International Conference on Neural Networks. Piscataway： IEEE， 1995， 4： 1942-1948.
23	HE Q， SHANG T， ZHUANG F， et al. Parallel extreme learning machine for regression based on MapReduce［J］. Neurocomputing， 2013， 102： 52-58.

数据集	样本数	属性数	类别数	分块数
Skin Segmentation	245 057	3	2	50
Census-Income	299 285	40	2	50
RLCP	5 749 132	11	2	100
HIGGS	11 000 000	28	2	1 200
仿真数据集1	1 000 000	7	2	50
仿真数据集2	1 500 000	7	5	50
仿真数据集3	10 000 000	7	3	500
仿真数据集4	20 000 000	7	2	1 000

数据集	样本数	属性数	类别数	分块数
Skin Segmentation	245 057	3	2	50
Census-Income	299 285	40	2	50
RLCP	5 749 132	11	2	100
HIGGS	11 000 000	28	2	1 200
仿真数据集1	1 000 000	7	2	50
仿真数据集2	1 500 000	7	5	50
仿真数据集3	10 000 000	7	3	500
仿真数据集4	20 000 000	7	2	1 000

数据集	单计算节点OPC		DOPC
数据集	训练时间/s	测试精度	训练时间/s	测试精度
Skin Segmentation	139.734±15.478	0.990±0.012	45.896±3.783	0.994±0.002
Census-Income	8 315.174±53.542	0.927±0.007	58.750±5.315	0.943±0.001
RLCP	1 625.486±89.263	0.959±0.003	230.218±4.172	0.984±0.002
HIGGS	—	—	14 267.521±318.905	0.601±0.005
仿真数据集1	26 125.715±131.297	0.951±0.021	458.760±21.142	0.963±0.003
仿真数据集2	42 512.142±236.364	0.800±0.013	1 339.028±30.513	0.821±0.016
仿真数据集3	259 341.914±714.381	0.852±0.035	7 015.188±214.275	0.877±0.018
仿真数据集4	—	—	8 720.214±227.164	0.950±0.002

数据集	单计算节点OPC		DOPC
数据集	训练时间/s	测试精度	训练时间/s	测试精度
Skin Segmentation	139.734±15.478	0.990±0.012	45.896±3.783	0.994±0.002
Census-Income	8 315.174±53.542	0.927±0.007	58.750±5.315	0.943±0.001
RLCP	1 625.486±89.263	0.959±0.003	230.218±4.172	0.984±0.002
HIGGS	—	—	14 267.521±318.905	0.601±0.005
仿真数据集1	26 125.715±131.297	0.951±0.021	458.760±21.142	0.963±0.003
仿真数据集2	42 512.142±236.364	0.800±0.013	1 339.028±30.513	0.821±0.016
仿真数据集3	259 341.914±714.381	0.852±0.035	7 015.188±214.275	0.877±0.018
仿真数据集4	—	—	8 720.214±227.164	0.950±0.002

数据集	DOPC	NN-Spark	DT-Spark	KNN-Spark	NB-Spark
Skin Segmentation	0.994±0.002	0.989±0.001	0.976±0.002	0.994±0.002	0.923±0.003
Census-Income	0.943±0.001	0.938±0.007	0.920±0.031	0.937±0.012	0.570±0.044
RLCP	0.984±0.002	0.998±0.001	0.998±0.001	0.999±0.000	0.981±0.005
HIGGS	0.601±0.005	0.570±0.005	0.673±0.005	0.579±0.014	0.583±0.004
仿真数据集1	0.963±0.003	0.938±0.001	0.963±0.008	0.951±0.013	0.870±0.002
仿真数据集2	0.821±0.016	0.818±0.021	0.809±0.015	0.815±0.006	0.658±0.001
仿真数据集3	0.877±0.018	0.820±0.004	0.831±0.003	0.831±0.002	0.675±0.004
仿真数据集4	0.950±0.002	0.924±0.007	0.944±0.003	0.941±0.005	0.902±0.009
平均精度	0.891±0.006	0.874±0.005	0.889±0.008	0.880±0.006	0.770±0.009

基于大数据随机样本划分的分布式观测点分类器

Distributed observation point classifier for big data with random sample partition

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 23

相关文章 15

编辑推荐

Metrics

[1]	倪瑞轩, 蔡淼, 叶保留. 内存高效的持久性分布式文件系统客户端缓存DFS-Cache[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1172-1180.
[2]	苟子安, 张晓, 吴东南, 王艳秋. 分布式存储系统中的日志分析与负载特征提取[J]. 计算机应用, 2020, 40(9): 2586-2593.
[3]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.
[4]	陈博, 何连跃, 严巍巍, 徐照淼, 徐俊. 海量小文件系统的可移植操作系统接口兼容技术[J]. 计算机应用, 2018, 38(5): 1389-1392.
[5]	李强, 刘晓峰. 基于Hopfield神经网络的云存储负载均衡策略[J]. 计算机应用, 2017, 37(8): 2214-2217.
[6]	杨俊杰, 廖卓凡, 冯超超. 大数据存储架构和算法研究综述[J]. 计算机应用, 2016, 36(9): 2465-2471.
[7]	刘青, 付印金, 倪桂强, 梅建民. 基于Hadoop平台的分布式重删存储系统[J]. 计算机应用, 2016, 36(2): 330-335.
[8]	伍晋博, 宋杰, 张莉, 鲍玉斌. 基于概率的大数据查询系统——Probery[J]. 计算机应用, 2016, 36(1): 8-12.
[9]	邵田, 陈广胜, 景维鹏. 云存储系统中文件分界点确定方法——Cut-GAR[J]. 计算机应用, 2015, 35(9): 2497-2502.
[10]	杨文晖, 李国强, 苗放. 面向海量空间数据存储的元数据管理机制[J]. 计算机应用, 2015, 35(5): 1276-1279.
[11]	郑凯, 朱林, 陈优广. 基于Kademlia的负载平衡云存储算法[J]. 计算机应用, 2015, 35(3): 643-647.
[12]	王政英, 于炯, 英昌甜, 鲁亮. 分布式文件系统数据块聚类存储节能策略[J]. 计算机应用, 2015, 35(2): 378-382.
[13]	陈吉荣乐嘉锦. 基于MapReduce的Hadoop大表导入编程模型[J]. 计算机应用, 2013, 33(09): 2486-2489.
[14]	朱媛媛王晓京. 基于GE码的HDFS优化方案[J]. 计算机应用, 2013, 33(03): 730-733.
[15]	陈冬晓王鹏. 基于校验编码备份的分布存储方案[J]. 计算机应用, 2013, 33(01): 211-214.