Ensemble classification model for distributed drifted data streams

doi:10.11772/j.issn.1001-9081.2020081277

Abstract

Abstract: Aiming at the problem of low classification accuracy in big data environment, an ensemble classification model for distributed data streams was proposed. Firstly, the microcluster mode was used to reduce the amount of data transmitted from local nodes to the central nodes, so as to reduce the communication cost. Secondly, the training samples of the global classifier were generated by using the sample reconstruction algorithm. Finally, an ensemble classification model for drift data streams was proposed, which adopted the weighted combination strategy of dynamic classifiers and steady classifiers, and the mixed labeling strategy was used to label the most representative instances to update the ensemble model. Experiments on two virtual datasets and two real datasets showed that the model suffered less fluctuation from concept drift compared with two distributed mining models DS-means and BDS-ensemble, and had higher accuracy than Online Active Learning Ensemble model (OALEnsemble), with the accuracy on four datasets improved by 1.58、0.97、0.77 and 1.91 percentage points respectively. Although the memory consumption of this model was slightly higher than those of BDS-ensemble and DS-means models, this model was able to improve the classification performance at a lower memory cost. Therefore, the model is suitable for the classification of big data with distributed and mobility characteristics, such as network monitoring and banking business system.

Key words: distributed, data stream, ensemble, classification, concept drift

摘要： 针对大数据环境下分类精度不高的问题，提出了一种面向分布式数据流的集成分类模型。首先，使用微簇模式减少局部节点向中心节点传输的数据量，降低通信代价；然后，使用样本重构算法生成全局分类器的训练样本；最后，提出一种面向漂移数据流的集成分类模型，采用动态分类器和稳定分类器的加权组合策略，使用混合标记策略标记最具代表性的样本以更新集成模型。在两个虚拟数据集和两个真实数据集上的实验结果表明，该模型与DS-means、BDS-ensemble这两个分布式挖掘模型相比，受到概念漂移时的波动较小；而与在线主动学习集成模型（OALEnsemble）相比，准确率更高，在四个数据集上的准确率分别提高了1.58、0.97、0.77和1.91个百分点。该模型虽然在内存消耗上略高于DS-means和BDS-ensemble模型，但是可以在较小的内存代价下获得较大的分类性能的提升。因此，该模型适用于具有分布式和流动性特征的大数据的分类工作，如网络监控、银行业务系统等。

关键词: 分布式, 数据流, 集成, 分类, 概念漂移

CLC Number:

TP391.1

YIN Chunyong, ZHANG Guojie. Ensemble classification model for distributed drifted data streams[J]. Journal of Computer Applications, 2021, 41(7): 1947-1955.

尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.

References

[1] 王元卓, 靳小龙, 程学旗. 网络大数据:现状与展望[J]. 计算机学报,2013,36(6):1125-1138.(WANG Y Z,JIN X L,CHENG X Q. Network big data:present and future[J]. Chinese Journal of Computers,2013,36(6):1125-1138.)
[2] WU X,ZHU X,WU G,et al. Data mining with big data[J]. IEEE Transactions on Knowledge and Data Engineering,2014,26(1):97-107.
[3] MORENO M V,TERROSO-SÁENZ F,GONZÁLEZ-VIDAL A,et al. Applicability of big data techniques to smart cities deployments[J]. IEEE Transactions on Industrial Informatics,2017,13(2):800-809.
[4] EDSTROM J,CHEN D,GONG Y,et al. Data-pattern enabled selfrecovery low-power storage system for big video data[J]. IEEE Transactions on Big Data,2019,5(1):95-105.
[5] AKTER S,WAMBA S F. Big data analytics in E-commerce:a systematic review and agenda for future research[J]. Electronic Markets,2016,26(2):173-194.
[6] WU X,ZENG X,FANG B. An efficient energy-aware and gametheory-based clustering protocol for wireless sensor networks[J]. IEICE Transactions on Communications, 2018, E101-B(3):709-722.
[7] KRAWCZYK B,MINKU L L,GAMA J,et al. Ensemble learning for data stream analysis:a survey[J]. Information Fusion,2017, 37:132-156.
[8] 张宇, 包研科, 邵良杉, 等. 面向分布式数据流大数据分类的多变量决策树[J]. 自动化学报,2018,44(6):1115-1127. (ZHANG Y,BAO Y K,SHAO L S,et al. A multivariate decision tree for big data classification of distributed data streams[J]. Acta Automatica Sinica,2018,44(6):1115-1127.)
[9] GUERRIERI A,MONTRESOR A. DS-means:distributed data stream clustering[C]//Proceedings of the 2012 European Conference on Parallel Processing,LNCS 7484. Berlin:Springer, 2012:260-271.
[10] FUERTES W,CARRERA D,VILLACÍS C,et al. Distributed system as internet of things for a new low-cost,air pollution wireless monitoring on real time[C]//Proceedings of the IEEE/ACM 19th International Symposium on Distributed Simulation and Real Time Applications. Piscataway:IEEE,2016:58-67.
[11] MASUD M M,WOOLAM C,GAO J,et al. Facing the reality of data stream classification:coping with scarcity of labeled data[J]. Knowledge and Information Systems,2012,33(1):213-244.
[12] WANG E T, CHEN A L P. Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis[J]. Data Mining and Knowledge Discovery,2011,23(2):252-299.
[13] 毛国君, 胡殿军, 谢松燕. 基于分布式数据流的大数据分类模型和算法[J]. 计算机学报,2017,40(1):161-175.(MAO G J, HU D J,XIE S Y. Models and algorithms for classifying big data based on distributed data streams[J]. Chinese Journal of Computers,2017,40(1):161-175.)
[14] WANG S,MINKU L L,YAO X. A systematic study of online class imbalance learning with concept drift[J]. IEEE Transactions on Neural Networks and Learning Systems,2018,29(10):4802-4821.
[15] 朱欣, 赵雷, 杨季文. 基于CVFDT的网络流量分类方法[J]. 计算机工程,2011,37(12):101-103.(ZHU X,ZHAO L,YANG J W. Network traffic classification method based on conceptadapting very fast decision tree[J]. Computer Engineering,2011, 37(12):101-103.)
[16] BARROS R S M,CABRAL D R L,GONÇALVES P M,et al. RDDM:reactive drift detection method[J]. Expert Systems with Applications,2017,90:344-355.
[17] BIFET A,GAVALDÀ R. Learning from time-changing data with adaptive windowing[C]//Proceedings of the 2007 SIAM International Conference on Data Mining. Philadelphia, PA:SIAM,2007:443-448.
[18] 翟婷婷, 高阳, 朱俊武. 面向流数据分类的在线学习综述[J]. 软件学报,2020,31(4):912-931.(ZHAI T T,GAO Y,ZHU J W. Survey of online learning algorithms for streaming data classification[J]. Journal of software,2020,31(4):912-931.)
[19] STREET W N,KIM Y. A Streaming Ensemble Algorithm(SEA) for large-scale classification[C]//Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM,2001:377-382.
[20] KOLTER J Z,MALOOF M A. Dynamic weighted majority:a new ensemble method for tracking concept drift[C]//Proceedings of the 3rd IEEE International Conference on Data Mining. Piscataway:IEEE,2003:123-130.
[21] 郭虎升, 张爱娟, 王文剑. 基于在线性能测试的概念漂移检测方法[J]. 软件学报,2020,31(4):932-947.(GUO H S, ZHANG A J,WANG W J. Concept drift detection method based on online performance test[J]. Journal of Software,2020,31(4):932-947.)
[22] SHAN J,ZHANG H,LIU W,et al. Online active learning ensemble framework for drifted data streams[J]. IEEE Transactions on Neural Networks and Learning Systems,2019,30(2):486-498.
[23] WEBB G I,HYDE R,CAO H,et al. Characterizing concept drift[J]. Data Mining and Knowledge Discovery,2016,30(4):964-994.
[24] GAMA J,ŽLIOBAITĖ I,BIFET A,et al. A survey on concept drift adaptation[J]. ACM Computing Surveys,2014,46(4):No. 44.
[25] ŽLIOBAITĖ I. Learning under concept drift:an overview[EB/OL].[2020-07-22]. https://arxiv.org/pdf/1010.4784.pdf.
[26] 包研科, 赵凤华. 多标度数据轮廓相似性的度量公理与计算[J]. 辽宁工程技术大学学报(自然科学版),2012,31(5):797-800. (BAO Y K,ZHAO F H. Measure axiom of outline similarity of multi-scale data and its calculation[J]. Journal of Liaoning Technical University(Natural Science),2012,31(5):797-800.)
[27] HRUSHOVSKI E. Computing the Galois group of a linear differential equation[J]. Banach Center Publications,2002,58:97-138.
[28] KRAWCZYK B, CANO A. Online ensemble learning with abstaining classifiers for drifting and noisy data streams[J]. Applied Soft Computing,2018,68:677-692.
[29] BIFET A,HOLMES G,KIRKBY R,et al. MOA:massive online analysis[J]. Journal of Machine Learning Research,2010,11:1601-1604.