计算机应用 ›› 2012, Vol. 32 ›› Issue (08): 2176-2185.DOI: 10.3724/SP.J.1087.2012.02176

• 数据库技术 • 上一篇    下一篇

基于少量类标签的概念漂移检测算法

李南1,2,郭躬德1,2,陈黎飞1,2   

  1. 1. 福建师范大学 数学与计算机科学学院,福州 350007
    2. 网络安全与密码技术福建省高校重点实验室(福建师范大学),福州 350007
  • 收稿日期:2012-01-16 修回日期:2012-03-08 发布日期:2012-08-28 出版日期:2012-08-01
  • 通讯作者: 郭躬德
  • 作者简介:李南(1987-),男,福建福州人,硕士研究生,主要研究方向:信息融合、数据流挖掘;
    郭躬德(1965-),男,福建龙岩人,教授,博士,主要研究方向:数据挖掘、机器学习;
    陈黎飞(1970-),男,福建长乐人,副教授,博士,主要研究方向:数据挖掘、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61174175);国家自然科学基金资助项目(61174175)

Concept drift detection method with limited amount of labeled data

LI Nan1,2,GUO Gong-de1,2,CHEN Li-fei1,2   

  1. 1. Key Laboratory of Network Security and Cryptography of Fujian Province University (Fujian Normal University), Fuzhou Fujian 350007, China
    2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350007,China
  • Received:2012-01-16 Revised:2012-03-08 Online:2012-08-28 Published:2012-08-01
  • Contact: GUO Gong-de

摘要: 传统的概念漂移数据流分类算法通常利用测试数据的真实类标来检测数据流是否发生概念漂移,并根据需要调整分类模型。然而,真实类标的标记需要耗费大量的人力、物力,而持续不断到来的高速数据流使得这种解决方案在现实中难以实现。针对上述问题,提出一种基于少量类标签的概念漂移检测算法。它根据快速KNNModel算法利用模型簇分类的特点,在未知分类数据类标的情况下,根据当前数据块不被任一模型簇覆盖的实例数目较之前数据块在一定的显著水平下是否发生显著增大,来判断是否发生概念漂移。在概念漂移发生的情况下,让领域专家针对那些少量的不被模型簇覆盖的数据进行标记,并利用这些数据自我修正模型,较好地解决了概念漂移的检测和模型自我更新问题。实验结果表明,该方法能够在自适应处理数据流概念漂移的前提下对数据流进行快速的分类,并得到和传统数据流分类算法近似或更高的分类精度。

关键词: 概念漂移, 数据流, 分类, KNNModel, 模型簇

Abstract: Most existing algorithms for data streams mining utilize the true label of testing data to detect concept drift and adjust current model according to requirements. It is impractical in real-world applications as manual labeling of instances which arrive continuously at a high speed requires a lot of human and material resources. Therefore, a concept drift detection method with limited amount of labeled data was proposed. The proposed method used the model clusters generated by the fast KNNModel algorithm to classify instances. It was able to detect concept drift on whether the number of instances which were not covered by any model clusters on the current block increased remarkably at a certain significance level than that of the prior block. Once concept drift happened, the domain experts were asked to label a few instances which were not covered by the model clusters and these representative instances were used to update the current model. The experimental results show that, compared with the traditional classification algorithms, the proposed method not only adapts to the situation of concept drift, but also acquires approximate or better classification accuracy.

Key words: concept drift, data stream, classification, KNNModel, model cluster

中图分类号: