基于少量类标签的概念漂移检测算法

doi:10.3724/SP.J.1087.2012.02176

计算机应用 ›› 2012, Vol. 32 ›› Issue (08): 2176-2185.DOI: 10.3724/SP.J.1087.2012.02176

基于少量类标签的概念漂移检测算法

李南¹,²,郭躬德¹,²,陈黎飞¹,²

1. 福建师范大学数学与计算机科学学院，福州 350007
2. 网络安全与密码技术福建省高校重点实验室(福建师范大学)，福州 350007

收稿日期:2012-01-16 修回日期:2012-03-08 发布日期:2012-08-28 出版日期:2012-08-01
通讯作者: 郭躬德
作者简介:李南(1987-),男,福建福州人,硕士研究生,主要研究方向：信息融合、数据流挖掘;
郭躬德(1965-),男,福建龙岩人,教授,博士,主要研究方向：数据挖掘、机器学习;
陈黎飞(1970-),男,福建长乐人,副教授,博士,主要研究方向：数据挖掘、模式识别。
基金资助:
国家自然科学基金资助项目(61174175);国家自然科学基金资助项目(61174175)

Concept drift detection method with limited amount of labeled data

LI Nan¹,²,GUO Gong-de¹,²,CHEN Li-fei¹,²

1. Key Laboratory of Network Security and Cryptography of Fujian Province University (Fujian Normal University), Fuzhou Fujian 350007, China
2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350007,China

Received:2012-01-16 Revised:2012-03-08 Online:2012-08-28 Published:2012-08-01
Contact: GUO Gong-de

摘要/Abstract

摘要： 传统的概念漂移数据流分类算法通常利用测试数据的真实类标来检测数据流是否发生概念漂移,并根据需要调整分类模型。然而,真实类标的标记需要耗费大量的人力、物力,而持续不断到来的高速数据流使得这种解决方案在现实中难以实现。针对上述问题,提出一种基于少量类标签的概念漂移检测算法。它根据快速KNNModel算法利用模型簇分类的特点,在未知分类数据类标的情况下,根据当前数据块不被任一模型簇覆盖的实例数目较之前数据块在一定的显著水平下是否发生显著增大,来判断是否发生概念漂移。在概念漂移发生的情况下,让领域专家针对那些少量的不被模型簇覆盖的数据进行标记,并利用这些数据自我修正模型,较好地解决了概念漂移的检测和模型自我更新问题。实验结果表明,该方法能够在自适应处理数据流概念漂移的前提下对数据流进行快速的分类,并得到和传统数据流分类算法近似或更高的分类精度。

关键词: 概念漂移, 数据流, 分类, KNNModel, 模型簇

Abstract: Most existing algorithms for data streams mining utilize the true label of testing data to detect concept drift and adjust current model according to requirements. It is impractical in real-world applications as manual labeling of instances which arrive continuously at a high speed requires a lot of human and material resources. Therefore, a concept drift detection method with limited amount of labeled data was proposed. The proposed method used the model clusters generated by the fast KNNModel algorithm to classify instances. It was able to detect concept drift on whether the number of instances which were not covered by any model clusters on the current block increased remarkably at a certain significance level than that of the prior block. Once concept drift happened, the domain experts were asked to label a few instances which were not covered by the model clusters and these representative instances were used to update the current model. The experimental results show that, compared with the traditional classification algorithms, the proposed method not only adapts to the situation of concept drift, but also acquires approximate or better classification accuracy.

Key words: concept drift, data stream, classification, KNNModel, model cluster

中图分类号:

TP311.13

李南郭躬德陈黎飞. 基于少量类标签的概念漂移检测算法[J]. 计算机应用, 2012, 32(08): 2176-2185.

LI Nan GUO Gong-de CHEN Li-fei. Concept drift detection method with limited amount of labeled data [J]. Journal of Computer Applications, 2012, 32(08): 2176-2185.

参考文献

[1]MASUD M M, GAO J, KHAN L, et al. Mining concept-drifting data stream to detect peer to peer botnet traffic [EB/OL]. [2012-01-04]. http://www.utdallas.edu/~mmm058000/reports/UTDCS-05-08.pdf. [2]CRUPI V, GUGLIEMINO E, MILAZZO G. Neural-network-based system for novel fault detection in rotating machinery [J].Journal of Vibration and Control, 2004, 10(8): 1137-1150. [3]DELANY S J, CUNNINGHAM P, TSYMBAL A. A comparison of ensemble and case-base maintenance techniques for handing concept drift in spam filtering [C]// FLAIRS'2006: Proceedings of 19th International Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2006: 340-345. [4]MASUD M M, GAO J, KHAN L, et al. A practical approach to classify evolving data streams: Training with limited amount of labeled data [C]// ICDM '08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2008:929-934. [5]WIDMER G,KUBAT M.Learning in the presence of concept drift and hidden contexts［J］ .Machine Learning,1996,23(1):69-101. [6]HO S-S, WECHSLER H. A martingale framework for detecting changes in data streams by testing exchangeability [J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(12)：2113-2127. [7]HULTEN G, SPENCER L, DOMINGOS P. Mining time-changing data streams [C]// KDD '01: Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2001: 97-106. [8]DIETTERICH T G, BARKIRI G. Solving multiclass learning problems via error-correcting output codes [J].Artificial Intelligence Research, 1995, 2(1): 263-286. [9]郭躬德,黄杰,陈黎飞. 基于KNN模型的增量学习算法[J].模式识别与人工智能, 2010, 23(5): 701-707. [10]辛轶,郭躬德, 陈黎飞,等. IKnnM-DHecoc：一种解决概念漂移问题的方法[J].计算机研究与发展, 2011, 48(4): 592-601. [11]STREET W N, KIM Y S. A Streaming Ensemble Algorithm (SEA) for large-scale classification [C]// KDD '01: Proceedings of 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2001: 377-382. [12]WANG H, FAN W, YU P S, et al. Mining concept drifting data streams using ensemble classifiers[C]// KDD '03: Proceedings of 9th International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2003: 226-235. [13]胡学刚,潘春香. 基于实例加权方法的概念漂移问题研究[J].计算机工程与应用, 2008, 44(21): 188-190. [14]欧阳震诤,罗建书,胡东敏,等. 一种不平衡数据流集成分类模型[J].电子学报, 2010, 38(1): 184-189. [15]KOLTER J Z, MALOOF M A. Dynamic weighted majority: an ensemble method for drifting concepts [J].Journal of Machine Research, 2007, 8: 2755-2790. [16]TSYMBAL A, PECHENIZKIY M, CUNNINGHAM P, et al. Dynamic integration of classifiers for handling concept drift [J].Information Fusion, 2008, 9(1): 56-68. [17]COVER T M, HART P E. Nearest neighbor pattern classification [J].IEEE Transactions on Information Theory,1967, 13(1): 21-27. [18]YANG QIANG, WU XIUDONG. 10 Challenging problems in data mining research[J].Journary of Information Technology and Decision Making, 2006, 5(4): 597-604. [19]GUO GONGDE, WANG HUI, BELL D, et al. KNN model-based approach in classification [C]// ODBASE 2003： International Conference on Ontologies, Databases and Applications of Semantics. Berlin: Springer-Verlag, 2003: 986-996. [20]GUO GONGDE, WANG HUI, BELL D, et al. Using KNN model for automatic text categorization [J].Soft Computing, 2006, 10(5): 423-430. [21]VERLEYSEN M. Learning high-dimensional data[C]// Proceedings of the NATO Advanced Research Workshop on Limitations and Future Trends in Neural Computation. [S.l.]: IOS, 2003: 141-162. [22]陈黎飞, 郭躬德. 最近邻分类的多代表点学习算法[J].模式识别与人工智能,2011, 24(6): 882-888. [23]LI NAN, GUO GONGDE, CHEN LIFEI, et al. Optimal subspace classification method for complex data ［J/OL］. International Journal of Machine Learning and Cybernetics [2012-04-10]. http://www.springerlink.com/content/m62633h475397160/. [24]张健飞, 陈黎飞, 郭躬德,等. 多代表点的子空间分类算法[J].计算机科学与探索, 2011, 5(11): 1037-1047. [25]MOISE G, SANDER J, ESTER M. Robust projected clustering [J].Knowledge Information System, 2008, 14(3): 273-398. [26]李南, 郭躬德. 面向高速数据流的集成分类器算法[J].JOCA, 2012, 32(3): 629-633. [27]KOTSIANTIS S B, PINTELAS P E. Recent advances in clustering: a brief survey [J].WSEAS Transaction on Information Science and Application, 2004, 1(1): 73-81. [28]李南郭躬德. 基于子空间集成的概念漂移数据流分类算法[J].计算机系统应用, 2011, 20(12): 240-248. [29]GUO GONGDE, LI NAN, CHEN LIFEI. Classification for concept-drifting data streams with limited amount of labeled data [C]// Proceedings of the International Conference on Automatic Control and Artificial Intelligence. Hertford: IET, 2012: 4259-4265. [30]盛骤, 谢式千, 潘承毅. 概率论与数理统计[M].北京：高等教育出版社,2006:241-243. [31]LIU JING, LI XUE, ZHONG WEICAI. Ambiguous decision trees for mining concept-drifting data streams [J].Pattern Recognition Letters, 2008, 30(15): 1347-1355. [32]RENNIE J,SHIH L,TEEVEN J,et al. Tackling the poor assumptions of Nave Bayes text classifiers [C]// Proceedings of the 12th International Conference on Machine Learning.Menlo Park: AAAI Press, 2003:616-623.

基于少量类标签的概念漂移检测算法

Concept drift detection method with limited amount of labeled data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[2]	宋中山, 梁家锐, 郑禄, 刘振宇, 帖军. 基于双向门控尺度特征融合的遥感场景分类[J]. 计算机应用, 2021, 41(9): 2726-2735.
[3]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[4]	朱亮, 徐华, 崔鑫. 基于基分类器系数和多样性的改进AdaBoost算法[J]. 计算机应用, 2021, 41(8): 2225-2231.
[5]	胡天杰, 胡文军, 王士同. 分布熵惩罚的支持向量数据描述[J]. 计算机应用, 2021, 41(8): 2212-2218.
[6]	张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.
[7]	肖振远, 王逸涵, 罗建桥, 熊鹰, 李柏林. 基于部分加权损失函数的RefineDet[J]. 计算机应用, 2021, 41(7): 1928-1932.
[8]	尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.
[9]	章惠, 张娜娜, 黄俊. 优化LeNet-5网络的多角度头部姿态估计方法[J]. 计算机应用, 2021, 41(6): 1667-1672.
[10]	史杨潇, 章军, 陈鹏, 王兵. 基于轻量级网络的钢铁表面缺陷分类[J]. 计算机应用, 2021, 41(6): 1836-1841.
[11]	贾鹤鸣, 郎春博, 姜子超. 基于轻量级卷积神经网络的植物叶片病害识别方法[J]. 计算机应用, 2021, 41(6): 1812-1819.
[12]	陆鑫伟, 余鹏飞, 李海燕, 李红松, 丁文谦. 基于注意力自身线性融合的弱监督细粒度图像分类算法[J]. 计算机应用, 2021, 41(5): 1319-1325.
[13]	郭帅, 苏旸. 基于数据流的加密流量分类方法[J]. 计算机应用, 2021, 41(5): 1386-1391.
[14]	韦铭燕, 陈彧, 张亮. 针对混合变量优化问题的协同进化蚁群优化算法[J]. 计算机应用, 2021, 41(5): 1412-1418.
[15]	严爱军, 魏志远. 案例推理分类器的权重分配及案例库维护方法[J]. 计算机应用, 2021, 41(4): 1071-1077.