计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1346-1352.DOI: 10.11772/j.issn.1001-9081.2017102511

• 数据科学与技术 • 上一篇    下一篇

基于异方差高斯过程的时间序列数据离群点检测

严宏1,2, 杨波2, 杨红雨2   

  1. 1. 中国民用航空飞行学院 计算机学院, 四川 广汉 618307;
    2. 四川大学 国家空管自动化系统技术重点实验室, 成都 610064
  • 收稿日期:2017-10-23 修回日期:2017-12-22 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 严宏
  • 作者简介:严宏(1984-),男,四川攀枝花人,讲师,博士研究生,CCF会员,主要研究方向:机器学习、空管自动化;杨波(1973-),男,四川成都人,副教授,博士,主要研究方向:空管自动化、机器学习;杨红雨(1967-),女,四川成都人,教授,博士,主要研究方向:空管自动化、图像处理。
  • 基金资助:
    国家空管科研资助项目(GKG201403004)。

Outlier detection in time series data based on heteroscedastic Gaussian processes

YAN Hong1,2, YANG Bo2, YANG Hongyu2   

  1. 1. College of Computer Science, Civil Aviation Flight University of China, Guanghan Sichuan 618307, China;
    2. National Key Laboratory of Air Traffic Control Automation System Technology, Sichuan University, Chengdu Sichuan 610064, China
  • Received:2017-10-23 Revised:2017-12-22 Online:2018-05-10 Published:2018-05-24
  • Contact: 严宏
  • Supported by:
    This work is partially supported by the National Air Traffic Control Research Project (GKG201403004).

摘要: 时间序列数据在测量过程中通常受到事物内在可变性以及外界干扰等因素的影响,针对各个时间点上数据受影响程度不同的情况,提出一种基于高斯过程预估模型的时间序列数据离群点检测方法。将监测数据分解为标准值和偏差项两个部分,除了对理想情况下的标准值建模,还再次使用高斯过程实现对异方差偏差项的有效描述,通过变分推断解决引入偏差项后的后验概率求解问题,将后验分布中设定的容差区间用于离群点判定。使用雅虎公司公开的网络流量时序数据进行验证,模型输出的容差区间在不同时间点上的变化趋势与标注的正常数据偏差情况相符,并在对比实验中异常检测性能指标F1-score优于自回归积分滑动平均模型、一类支持向量机以及基于密度并伴随噪声的空间聚类算法。实验结果表明,该模型能够有效描述各个时间点上正常数据的分布情况,取得误报率和召回率两方面的综合权衡,而且可以避免模型参数设置不当导致的性能问题。

关键词: 离群点检测, 时间序列, 高斯过程, 异方差, 变分推断

Abstract: Generally, there are inevitable disturbances in time series data, such as inherent uncertainties and external interferences. To detect outlier in time series data with time-varying disturbances, an approach based on prediction model using Gaussian Processes was proposed. The monitoring data was decomposed into two components:the standard value and the deviation term. As the basis of model for the ideal standard value without any deviation, Gaussian processes were also employed to model the heteroscedastic deviations. The posterior distribution of predicted data which is analytically intractable after introducing deviation term was approximated by variational inference. The tolerance interval selected from posterior distribution was used for outlier detection. Verification experiments were conducted on the public time series datasets of network traffic from Yahoo. The calculated tolerance interval coincided with the actual range of reasonable deviation existing in labeled normal data at various time points. In the comparison experiments, the proposed model outperformed autoregressive integrated moving average model, one-class support vector machine and Density-Based Spatial Clustering of Application with Noise (DBSCAN) in terms of F1-score. The experimental results show that the proposed model can effectively describe the distribution of normal data at various time points, achieve a tradeoff between false alarm rate and recall, and avoid the performance problems caused by improper parameter settings.

Key words: outlier detection, time series, Gaussian process, heteroscedasticity, variational inference

中图分类号: