计算机应用 ›› 2010, Vol. 30 ›› Issue (11): 2949-2951.

• 数据库与数据挖掘 • 上一篇    下一篇

基于距离的数据流离群点挖掘算法

杨显飞1,张健沛1,杨静2,初妍3   

  1. 1. 哈尔滨工程大学
    2. 哈尔滨工程大学计算机科学与技术学院
    3.
  • 收稿日期:2010-05-05 修回日期:2010-07-07 发布日期:2010-11-05 出版日期:2010-11-01
  • 通讯作者: 杨显飞
  • 基金资助:
    基于Mobile Agent的分布式数据挖掘关键技术研究

Algorithm for mining data stream outliers based on distance

  • Received:2010-05-05 Revised:2010-07-07 Online:2010-11-05 Published:2010-11-01
  • Contact: Yang XianFei

摘要: 传统的离群点挖掘算法无法有效挖掘数据流中的离群点。针对数据流的无限输入和动态变化等特点,提出一种新的基于距离的数据流离群点挖掘算法。通过Hoeffding定理及独立同分布中心极限定理,对数据流概率分布变化进行动态检测,利用检测结果自适应调整滑动窗口大小对数据流离群点进行挖掘。实验结果表明,该算法在人工数据集和真实数据集KDD-CUP99中可以对数据流中的离群点进行有效挖掘。

关键词: 数据流, 离群点, Hoeffding定理, 滑动窗口

Abstract: The traditional algorithm of mining outliers cannot mine outliers in data stream effectively. Concerning the infinite input and dynamic change in data stream environment, a new algorithm for detecting data stream outliers based on distance was proposed. Change of data stream probability distribution was dynamically detected by Hoeffding theorem and independent identical distribution central limit theorem. Making use of detection outcome to self adaptation, sliding window size was adjusted to mine outliers in data stream. The experimental results show this algorithm can effectively mine data stream outliers in artificial data set and KDD-CUP99 date set.

Key words: data stream, outlier, Hoeffding theorem, sliding window