Distributed observation point classifier for big data with random sample partition

Xu LI1, Yulin HE1(), Laizhong CUI1,2, Zhexue HUANG1,2, Fournier‑Viger PHILIPPE2   

  1. 1.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ),Shenzhen Guangdong 518107,China
    2.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China
  • Received:2023-06-29 Revised:2023-08-21 Accepted:2023-08-23 Online:2023-09-11 Published:2024-06-10
  • Contact: Yulin HE
  • About author:LI Xu, born in 1996, M. S., engineer. His research interests include distributed computation of big data, data mining, machine learning.
    CUI Laizhong, born in 1984, Ph. D., professor. His research interests include Internet architecture, edge computing, AI-driven network optimization.
    HUANG Zhexue, born in 1959, Ph. D., professor. His research interests include intelligent computing of new computational power network, big data approximation computing, data mining, machine learning.
    PHILIPPE Fournier‑Viger, born in 1980, Ph. D., professor. His research interests include data mining, pattern recognition, artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(61972261);Natural Science Foundation of Guangdong Province(2023A1515011667);Basic Research Foundation of Shenzhen(JCYJ20220818100205012)



Observation Point Classifier (OPC) is a supervised learning model which tries to transform a multi-dimensional linearly-inseparable problem in original data space into a one-dimensional linearly-separable problem in projective distance space and it is good at high-dimensional data classification. In order to alleviate the high train complexity when applying OPC to handle the big data classification problem, a Random Sample Partition (RSP)-based Distributed OPC (DOPC) for big data was designed under the Spark framework. First, RSP data blocks were generated and transformed into Resilient Distributed Dataset (RDD) under the distributed computation environment. Second, a set of OPCs was collaboratively trained on RSP data blocks with high Spark parallelizability. Finally, different OPCs were fused into a DOPC to predict the final label of unknow sample. The persuasive experiments on eight big datasets were conducted to validate the feasibility, rationality and effectiveness of designed DOPC. Experimental results show that DOPC trained on multiple computation nodes gets the higher testing accuracy than OPC trained on single computation node with less time consumption, and meanwhile compared to the RSP model based Neural Network (NN), Decision Tree (DT), Naive Bayesian (NB), and K-Nearest Neighbor (KNN) classifiers under the Spark framework, DOPC obtains stronger generalization capability. The superior testing performances demonstrate that DOPC is a highly effective and low consumptive supervised learning algorithm for handling big data classification problems.

Key words: big data classification, distributed file system, Random Sample Partition (RSP), Observation Point Classifier (OPC), Spark computing framework
