《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1727-1733.DOI: 10.11772/j.issn.1001-9081.2023060847

所属专题: CCF第38届中国计算机应用大会 (CCF NCCA 2023)

• CCF第38届中国计算机应用大会 (CCF NCCA 2023) • 上一篇    下一篇

基于大数据随机样本划分的分布式观测点分类器

李旭1, 何玉林1(), 崔来中1,2, 黄哲学1,2, PHILIPPE Fournier‑Viger2   

  1. 1.人工智能与数字经济广东省实验室(深圳),广东 深圳 518107
    2.深圳大学 计算机与软件学院,广东 深圳 518060
  • 收稿日期:2023-06-29 修回日期:2023-08-21 接受日期:2023-08-23 发布日期:2023-09-11 出版日期:2024-06-10
  • 通讯作者: 何玉林
  • 作者简介:李旭(1996—),男,广东汕头人,工程师,硕士,CCF会员,主要研究方向:大数据分布式计算、数据挖掘、机器学习
    崔来中(1984—),男,吉林白山人,教授,博士,CCF会员,主要研究方向:互联网体系结构、边缘计算、AI驱动的网络优化
    黄哲学(1959—),男,黑龙江哈尔滨人,教授,博士,CCF会员,主要研究方向:新型算力网络智能计算、大数据近似计算、数据挖掘、机器学习
    PHILIPPE Fournier‑Viger(1980—),男,加拿大蒙特利尔人,教授,博士,主要研究方向:数据挖掘、模式识别、人工智能。
  • 基金资助:
    国家自然科学基金资助项目(61972261);广东省自然科学基金资助项目(2023A1515011667);深圳市基础研究项目(JCYJ20220818100205012)

Distributed observation point classifier for big data with random sample partition

Xu LI1, Yulin HE1(), Laizhong CUI1,2, Zhexue HUANG1,2, Fournier‑Viger PHILIPPE2   

  1. 1.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ),Shenzhen Guangdong 518107,China
    2.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China
  • Received:2023-06-29 Revised:2023-08-21 Accepted:2023-08-23 Online:2023-09-11 Published:2024-06-10
  • Contact: Yulin HE
  • About author:LI Xu, born in 1996, M. S., engineer. His research interests include distributed computation of big data, data mining, machine learning.
    CUI Laizhong, born in 1984, Ph. D., professor. His research interests include Internet architecture, edge computing, AI-driven network optimization.
    HUANG Zhexue, born in 1959, Ph. D., professor. His research interests include intelligent computing of new computational power network, big data approximation computing, data mining, machine learning.
    PHILIPPE Fournier‑Viger, born in 1980, Ph. D., professor. His research interests include data mining, pattern recognition, artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(61972261);Natural Science Foundation of Guangdong Province(2023A1515011667);Basic Research Foundation of Shenzhen(JCYJ20220818100205012)

摘要:

观测点分类器(OPC)是一种试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型,对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度,在Spark框架下设计一款基于大数据的随机样本划分(RSP)的分布式OPC(DOPC)。首先,在分布式计算环境下生成大数据的RSP数据块,并将它转换为弹性分布式数据集(RDD);其次,在RSP数据块上协同式地训练一组OPC,由于每个RSP数据块上的OPC独立训练,因此有高效的Spark可实现性;最后,在Spark框架下将在RSP数据块上协同训练的OPC集成为DOPC,对新样本进行类标签预测。在8个大数据集上,对Spark集群环境下实现的DOPC的可行性、合理性和有效性进行实验验证,实验结果显示,DOPC能够以更低的计算消耗获得比单机OPC更高的测试精度,同时相较于Spark框架下实现的基于RSP模型的神经网络(NN)、决策树(DT)、朴素贝叶斯(NB)和K最近邻(KNN),DOPC分类器具有更强的泛化性能。测试结果表明,DOPC是一种高效低耗的处理大数据分类问题的有监督学习算法

关键词: 大数据分类, 分布式文件系统, 随机样本划分, 观测点分类器, Spark计算框架

Abstract:

Observation Point Classifier (OPC) is a supervised learning model which tries to transform a multi-dimensional linearly-inseparable problem in original data space into a one-dimensional linearly-separable problem in projective distance space and it is good at high-dimensional data classification. In order to alleviate the high train complexity when applying OPC to handle the big data classification problem, a Random Sample Partition (RSP)-based Distributed OPC (DOPC) for big data was designed under the Spark framework. First, RSP data blocks were generated and transformed into Resilient Distributed Dataset (RDD) under the distributed computation environment. Second, a set of OPCs was collaboratively trained on RSP data blocks with high Spark parallelizability. Finally, different OPCs were fused into a DOPC to predict the final label of unknow sample. The persuasive experiments on eight big datasets were conducted to validate the feasibility, rationality and effectiveness of designed DOPC. Experimental results show that DOPC trained on multiple computation nodes gets the higher testing accuracy than OPC trained on single computation node with less time consumption, and meanwhile compared to the RSP model based Neural Network (NN), Decision Tree (DT), Naive Bayesian (NB), and K-Nearest Neighbor (KNN) classifiers under the Spark framework, DOPC obtains stronger generalization capability. The superior testing performances demonstrate that DOPC is a highly effective and low consumptive supervised learning algorithm for handling big data classification problems.

Key words: big data classification, distributed file system, Random Sample Partition (RSP), Observation Point Classifier (OPC), Spark computing framework

中图分类号: