Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (7): 1947-1955.DOI: 10.11772/j.issn.1001-9081.2020081277

Special Issue: 数据科学与技术

• Data science and technology • Previous Articles     Next Articles

Ensemble classification model for distributed drifted data streams

YIN Chunyong, ZHANG Guojie   

  1. School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing Jiangsu 210044, China
  • Received:2020-08-21 Revised:2020-11-27 Online:2021-07-10 Published:2020-12-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61772282).


尹春勇, 张帼杰   

  1. 南京信息工程大学 计算机与软件学院, 南京 210044
  • 通讯作者: 尹春勇
  • 作者简介:尹春勇(1977-),男,山东潍坊人,教授,博士生导师,博士,主要研究方向:网络空间安全、大数据挖掘及隐私保护、人工智能及新型计算;张帼杰(1997-),女,江苏泰州人,硕士研究生,主要研究方向:机器学习、数据挖掘、数据流分类。
  • 基金资助:

Abstract: Aiming at the problem of low classification accuracy in big data environment, an ensemble classification model for distributed data streams was proposed. Firstly, the microcluster mode was used to reduce the amount of data transmitted from local nodes to the central nodes, so as to reduce the communication cost. Secondly, the training samples of the global classifier were generated by using the sample reconstruction algorithm. Finally, an ensemble classification model for drift data streams was proposed, which adopted the weighted combination strategy of dynamic classifiers and steady classifiers, and the mixed labeling strategy was used to label the most representative instances to update the ensemble model. Experiments on two virtual datasets and two real datasets showed that the model suffered less fluctuation from concept drift compared with two distributed mining models DS-means and BDS-ensemble, and had higher accuracy than Online Active Learning Ensemble model (OALEnsemble), with the accuracy on four datasets improved by 1.58、0.97、0.77 and 1.91 percentage points respectively. Although the memory consumption of this model was slightly higher than those of BDS-ensemble and DS-means models, this model was able to improve the classification performance at a lower memory cost. Therefore, the model is suitable for the classification of big data with distributed and mobility characteristics, such as network monitoring and banking business system.

Key words: distributed, data stream, ensemble, classification, concept drift

摘要: 针对大数据环境下分类精度不高的问题,提出了一种面向分布式数据流的集成分类模型。首先,使用微簇模式减少局部节点向中心节点传输的数据量,降低通信代价;然后,使用样本重构算法生成全局分类器的训练样本;最后,提出一种面向漂移数据流的集成分类模型,采用动态分类器和稳定分类器的加权组合策略,使用混合标记策略标记最具代表性的样本以更新集成模型。在两个虚拟数据集和两个真实数据集上的实验结果表明,该模型与DS-means、BDS-ensemble这两个分布式挖掘模型相比,受到概念漂移时的波动较小;而与在线主动学习集成模型(OALEnsemble)相比,准确率更高,在四个数据集上的准确率分别提高了1.58、0.97、0.77和1.91个百分点。该模型虽然在内存消耗上略高于DS-means和BDS-ensemble模型,但是可以在较小的内存代价下获得较大的分类性能的提升。因此,该模型适用于具有分布式和流动性特征的大数据的分类工作,如网络监控、银行业务系统等。

关键词: 分布式, 数据流, 集成, 分类, 概念漂移

CLC Number: