Outlier detection in time series data based on heteroscedastic Gaussian processes
YAN Hong1,2, YANG Bo2, YANG Hongyu2
1. College of Computer Science, Civil Aviation Flight University of China, Guanghan Sichuan 618307, China; 2. National Key Laboratory of Air Traffic Control Automation System Technology, Sichuan University, Chengdu Sichuan 610064, China
Abstract:Generally, there are inevitable disturbances in time series data, such as inherent uncertainties and external interferences. To detect outlier in time series data with time-varying disturbances, an approach based on prediction model using Gaussian Processes was proposed. The monitoring data was decomposed into two components:the standard value and the deviation term. As the basis of model for the ideal standard value without any deviation, Gaussian processes were also employed to model the heteroscedastic deviations. The posterior distribution of predicted data which is analytically intractable after introducing deviation term was approximated by variational inference. The tolerance interval selected from posterior distribution was used for outlier detection. Verification experiments were conducted on the public time series datasets of network traffic from Yahoo. The calculated tolerance interval coincided with the actual range of reasonable deviation existing in labeled normal data at various time points. In the comparison experiments, the proposed model outperformed autoregressive integrated moving average model, one-class support vector machine and Density-Based Spatial Clustering of Application with Noise (DBSCAN) in terms of F1-score. The experimental results show that the proposed model can effectively describe the distribution of normal data at various time points, achieve a tradeoff between false alarm rate and recall, and avoid the performance problems caused by improper parameter settings.
严宏, 杨波, 杨红雨. 基于异方差高斯过程的时间序列数据离群点检测[J]. 计算机应用, 2018, 38(5): 1346-1352.
YAN Hong, YANG Bo, YANG Hongyu. Outlier detection in time series data based on heteroscedastic Gaussian processes. Journal of Computer Applications, 2018, 38(5): 1346-1352.
[1] CHANDOLA V, BANERJEE A, KUMAR V. Anomaly detection:a survey[J]. ACM Computing Surveys, 2009, 41(3):1-58. [2] YAACOB A H, TAN I K T, SU F C, et al. ARIMA based network anomaly detection[C]//Proceedings of the 2nd International Conference on Communication Software and Networks. Piscataway, NJ:IEEE, 2010:205-209. [3] LIN J, KEOGH E, FU A, et al. Approximations to magic:finding unusual medical time series[C]//Proceedings of the 2005 IEEE Symposium on Computer-Based Medical Systems. Piscataway, NJ:IEEE, 2005:329-334. [4] 余宇峰, 朱跃龙, 万定生,等. 基于滑动窗口预测的水文时间序列异常检测[J]. 计算机应用, 2014, 34(8):2217-2220. (YU Y F, ZHU Y L, WAN D S, et al. Time series outlier detection based on sliding window prediction[J]. Journal of Computer Applications, 2014, 34(8):2217-2220.) [5] 张宝燕, 李茹, 穆文瑜. 基于混沌时间序列的瓦斯浓度预测研究[J]. 计算机工程与应用, 2011, 47(10):244-248.(ZHANG B Y, LI R, MU W Y. Study on gas concentration prediction based on chaotic time series[J]. Computer Engineering and Applications, 2011, 47(10):244-248.) [6] SEVAKULA R K, VERMA N K. Clustering based outlier detection in fuzzy SVM[C]//Proceedings of the 2014 IEEE International Conference on Fuzzy Systems. Piscataway, NJ:IEEE, 2014:1172-1177. [7] MARTINS H, PALMA L, CARDOSO A, et al. A support vector machine based technique for online detection of outliers in transient time series[C]//Proceedings of the 201510th Asian Control Conference. Piscataway, NJ:IEEE, 2015:1-6. [8] DANG T T, NGAN H Y T, LIU W. Distance-based k-nearest neighbors outlier detection method in large-scale traffic data[C]//Proceedings of the 2015 IEEE International Conference on Digital Signal Processing. Piscataway, NJ:IEEE, 2015:507-510. [9] ABID A, KACHOURI A, MAHFOUDHI A. Outlier detection for wireless sensor networks using density-based clustering approach[J]. IET Wireless Sensor Systems, 2017, 7(4):83-90. [10] JIANG J, YASAKETHU L. Anomaly detection via one class SVM for protection of SCADA systems[C]//Proceedings of the 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. Washington, DC:IEEE Computer Society, 2013:82-88. [11] NGAN H Y T, YUNG N H C, YEH A G O. A comparative study of outlier detection for large-scale traffic data by one-class SVM and kernel density estimation[J]. Proceedings of SPIE-the International Society for Optical Engineering, 2015, 9405:94050I-1-94050I-10. [12] PENA E H M, BARBON S, RODRIGUES J J P C, et al. Anomaly detection using digital signature of network segment with adaptive ARIMA model and paraconsistent logic[C]//Proceedings of the 2014 IEEE Symposium on Computers and Communication. Piscataway, NJ:IEEE, 2014:1-6. [13] FERNANDES G, PENA E H M, CARVALHO L F, et al. Statistical, forecasting and metaheuristic techniques for network anomaly detection[C]//Proceedings of the 30th Annual ACM Symposium on Applied Computing. New York:ACM, 2015:701-707. [14] BISHOP C M. Pattern Recognition and Machine Learning (Information Science and Statistics)[M]. New York:Springer, 2006:303-319. [15] MURPHY K P. Machine Learning:a Probabilistic Perspective[M]. Cambridge, MA:MIT Press, 2012:79-91, 515-542. [16] WILLIAMS C K I, RASMUSSEN C E. Gaussian Processes for Machine Learning[M]. Cambridge, MA:MIT Press, 2006:7-30, 79-102. [17] GOLDBERG P W, WILLIAMS C K I, BISHOP C M. Regression with input-dependent noise:a Gaussian process treatment[C]//NIPS 1998:Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems. Cambridge, MA:MIT Press, 1998:493-499. [18] LÁZARO-GREDILLA M, TITSIAS M K. Variational heteroscedastic Gaussian process regression[C]//ICML 2011:Proceedings of the 2011 International Conference on Machine Learning. New York, NY:ACM, 2011:841-848. [19] NOCEDAL J, WRIGHT S. Numerical Optimization[M]. New York:Springer, 2006:101-134. [20] Yahoo! Inc. Webscope dataset ydata labeled time series anomalies v1.0[EB/OL].[2015-03-24]. https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70.