Anomaly detection method for hydrologic sensor data based on SparkR

doi:10.11772/j.issn.1001-9081.2018081782

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (2): 436-440.DOI: 10.11772/j.issn.1001-9081.2018081782

Previous Articles Next Articles

Anomaly detection method for hydrologic sensor data based on SparkR

LIU Zihao¹, LI Ling², YE Feng²

1. School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang Jiangsu 212003, China;
2. College of Computer and Information, Hohai University, Nanjing Jiangsu 211100, China

Received:2018-08-17 Revised:2018-09-02 Online:2019-02-15 Published:2019-02-10
Supported by:
This work is partially supported by the Jiangsu Province Postdoctoral Research Funding Project (1701020C), the Six Talent Peaks Project of Jiangsu Province (XYDXX-078).

基于SparkR的水文传感器数据的异常检测方法

刘子豪¹, 李凌², 叶枫²

1. 江苏科技大学计算机学院, 江苏镇江 212003;
2. 河海大学计算机与信息学院, 南京 211100

通讯作者: 叶枫
作者简介:刘子豪(1995-),男,江苏南京人,硕士研究生,主要研究方向:数据挖掘、大数据;李凌(1968-),女,江苏南京人,工程师,硕士,主要研究方向:云计算、大数据;叶枫(1980-),男,山东济南人,讲师,博士,CCF会员,主要研究方向:分布式计算、大数据。
基金资助:
江苏省博士后科研资助计划项目（1701020C）；江苏省"六大人才高峰"资助项目（XYDXX-078）。

Abstract

Abstract: To efficiently detect outliers in massive hydrologic sensor data, an anomaly detection method for hydrological time series based on SparkR was proposed. Firstly, a sliding window and Autoregressive Integrated Moving Average (ARIMA) model were used to forecast the cleaned data on SparkR platform. Then, the confidence interval was calculated for the prediction results, and the results outside the interval range were judged as anomaly data. Finally, based on the detection results, K-Means algorithm was used to cluster the original data, the state transition probability was calculated, and the anomaly data were evaluated in quality. Taking the data of hydrologic sensor obtained from the Chu River as experimental data, experiments on the detection time and outlier detection performance were carried out respectively. The results show that the millions of data calculation by two slaves costs more time than that by one slave, but when calculating the tens of milllions of data, the time costed by two slaves is less than that by one slave, and the maximum reduction is 16.21%. The sensitivity of the evaluation is increased from 5.24% to 92.98%. It shows that under big data platform, the proposed algorithm which is based on the characteristics of hydrological data and combines forecast test and cluster test can effectively improve the computational efficiency of hydrologic time series detection for tens of millions data and has a significant improvement in sensitivity.

Key words: SparkR, AutoRegressive Integrated Moving Average (ARIMA) model, anomaly detection, hydrologic time series, K-Means

摘要： 为了高效地从海量的水文传感器数据中检测出异常值，提出一种基于SparkR的水文时间序列异常检测方法。首先，对数据进行清洗后，采用滑动窗口配合自回归积分滑动平均模型（ARIMA）在SparkR平台上进行预测；然后，对预测的结果计算置信区间，将在区间范围以外的判定为异常值；最后，基于检测结果，利用K均值算法对原数据进行聚类，同时计算其状态转移概率，对检测出的异常值进行质量评估。以在滁河获取的水文传感器数据为实验数据，分别在运行时间和异常值检测效果这两个方面进行了实验。结果显示:利用SparkR对百万级数据进行计算时，利用双节点计算的时间要长于单节点；但是对千万级数据进行计算时，双节点比单节点计算时间上更少，最多减少了16.21%，且评估过后的灵敏度由之前的5.24%提高到了92.98%。实验结果表明，在SparkR下，根据水文数据的特点并结合预测检验和聚类校验的方法对千万级水文时间序列进行检测时，能有效提高传统方法的计算效率，并且在灵敏度方面相比传统方法也有显著提升。

关键词: SparkR, 自回归积分滑动平均模型, 异常检测, 水文时间序列, K均值

CLC Number:

TP391

LIU Zihao, LI Ling, YE Feng. Anomaly detection method for hydrologic sensor data based on SparkR[J]. Journal of Computer Applications, 2019, 39(2): 436-440.

刘子豪, 李凌, 叶枫. 基于SparkR的水文传感器数据的异常检测方法[J]. 计算机应用, 2019, 39(2): 436-440.

References

[1] 吴德.水文时间序列相似模式挖掘的研究与应用[D].南京:河海大学,2007.(WU D. Research and application of hydrological time series similarity pattern[D]. Nanjing:Hohai University, 2007.)
[2] 桑燕芳,王中根,刘昌明.水文时间序列分析方法研究进展[J].地理科学进展,2013,32(1):20-30. (SANG Y F, WANG Z G, LIU C M. Research progress on the time series analysis methods in hydrology[J]. Progress in Geography, 2013, 32(1):20-30.)
[3] 孙建树,娄渊胜,陈裕俊.基于ARIMA-SVR的水文时间序列异常值检测[J].计算机与数字工程,2018,46(2):225-230. (SUN J S, LOU Y S, CHEN Y J. Outlier detection of hydrological time series based on ARIMA-SVR model[J]. Computer & Digital Engineering, 2018, 46(2):225-230.)
[4] 余宇峰,朱跃龙,万定生,等.基于滑动窗口预测的水文时间序列异常检测[J].计算机应用,2014,34(8):2217-2220,2226. (YU Y F, ZHU Y L, WAN D S, et al. Time series outlier detection based on sliding window prediction[J]. Journal of Computer Applications,2014,34(8):2217-2220,2226.)
[5] HAWKINS D M. Identification of Outliers[M]. Berlin:Springer, 1980:27-41
[6] 牛丽肖,王正方,臧传治,等.一种基于小波变换和ARIMA的短期电价混合预测模型[J].计算机应用研究,2014,31(3):688-691. (NIU L X, WANG Z F, ZANG C Z, et al. Hybrid model based on wavelet and ARIMA for short-term electricity price forecasting[J]. Application Research of Computers,2014,31(3):688-691.)
[7] 任勋益,王汝传,孔强.基于主元分析和支持向量机的异常检测[J].计算机应用研究,2009,26(7):2719-2721. (REN X Y, WANG R C, KONG Q. Principal component analysis and support vector machine based anomaly detection[J]. Application Research of Computers,2009,26(7):2719-2721.)
[8] VY N D K, ANH D T. Detecting variable length anomaly patterns in time series data[C]//Proceedings of the 2016 International Conference on Data Mining and Big Data, LNCS 9714. Berlin:Springer, 2016:279-287.
[9] BREUNIG M M, KRIEGEL H-P, NG R T, et al. LOF:Identifying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2000:93-104.
[10] 潘渊洋,李光辉,徐勇军.基于DBSCAN的环境传感器网络异常数据检测方法[J].计算机应用与软件,2012,29(11):69-72. (PAN Y Y, LI G H, XU Y J. Abnormal data detection method for environment wireless sensor networks based on DBSCAN[J]. Computer Applications and Software, 2012, 29(11):69-72.)
[11] twitter/AnomalyDEtection[EB/OL].[2015-09-01]. https://github.com/twitter/AnomalyDetection.
[12] 杨志勇,朱跃龙,万定生.基于知识粒度的时间序列异常检测研究[J].计算机技术与发展,2016,26(7):51-54. (YANG Z Y, ZHU Y L, WAN D S. Research on time series anomaly detection based on knowledge granularity[J]. Computer Technology and Development, 2016, 26(7):51-54.)
[13] 刘雪梅,王亚茹.基于异常因子的时间序列异常模式检测[J].计算机技术与发展,2018,28(3):93-96. (LIU X M, WANG Y R. Anomaly pattern detection in time series based on outlier factor[J]. Computer Technology and Development, 2018, 28(3):93-96.)
[14] Spark R (R frontend for Spark)[EB/OL].[2016-06-11]. https://github.com/amplab-extras/SparkR.pkg.
[15] 谭旭杰,邓长寿,董小刚,等.SparkDE:一种基于RDD云计算模型的并行差分进化算法[J].计算机科学,2016,43(9):116-119,139. (TAN X J, DENG C S, DONG X G, et al. SparkDE:a parallel version of differential evolution based on resilient distributed datasets model in cloud computing[J]. Computer Science, 2016, 43(9):116-119,139.)
[16] CONTRERAS J, ESPINOLA R, NOGALES F J, et al. ARIMA models to predict next-day electricity prices[J]. IEEE Transactions on Power Systems,2003,18(3):1014-1020.

Anomaly detection method for hydrologic sensor data based on SparkR

基于SparkR的水文传感器数据的异常检测方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Tingwei CHEN, Jiacheng ZHANG, Junlu WANG. Random validation blockchain construction for federated learning [J]. Journal of Computer Applications, 2024, 44(9): 2770-2776.
[2]	Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557.
[3]	Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499.
[4]	Xinrui LIN, Xiaofei WANG, Yan ZHU. Academic anomaly citation group detection based on local extended community detection [J]. Journal of Computer Applications, 2024, 44(6): 1855-1861.
[5]	Fan MENG, Qunli YANG, Jing HUO, Xinkuan WANG. EraseMTS： iterative active multivariable time series anomaly detection algorithm based on margin anomaly candidate set [J]. Journal of Computer Applications, 2024, 44(5): 1458-1463.
[6]	Zimeng ZHU, Zhixin LI, Zhan HUAN, Ying CHEN, Jiuzhen LIANG. Weakly supervised video anomaly detection based on triplet-centered guidance [J]. Journal of Computer Applications, 2024, 44(5): 1452-1457.
[7]	Lin SUN, Menghan LIU. K-means clustering based on adaptive cuckoo optimization feature selection [J]. Journal of Computer Applications, 2024, 44(3): 831-841.
[8]	Pei ZHAO, Yan QIAO, Rongyao HU, Xinyu YUAN, Minyue LI, Benchu ZHANG. Multivariate time series anomaly detection based on multi-domain feature extraction [J]. Journal of Computer Applications, 2024, 44(11): 3419-3426.
[9]	Yongjiang LIU, Bin CHEN. Pixel-level unsupervised industrial anomaly detection based on multi-scale memory bank [J]. Journal of Computer Applications, 2024, 44(11): 3587-3594.
[10]	Hui JIANG, Qiuyan YAN, Zhujun JIANG. Symmetric positive definite autoencoder method for multivariate time series anomaly detection [J]. Journal of Computer Applications, 2024, 44(10): 3294-3299.
[11]	Lishuo YE, Zhixue HE. Multiscale time series anomaly detection incorporating wavelet decomposition [J]. Journal of Computer Applications, 2024, 44(10): 3300-3306.
[12]	Yuhao TANG, Dezhong PENG, Zhong YUAN. Fuzzy multi-granularity anomaly detection for incomplete mixed data [J]. Journal of Computer Applications, 2024, 44(10): 3097-3104.
[13]	Xueran XU, Geng YANG, Yuxian HUANG. Differential privacy clustering algorithm in horizontal federated learning [J]. Journal of Computer Applications, 2024, 44(1): 217-222.
[14]	Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU. Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction [J]. Journal of Computer Applications, 2024, 44(1): 175-181.
[15]	Jing ZHONG, Chen LIN, Zhiwei SHENG, Shibin ZHANG. Quantum K-Means algorithm based on Hamming distance [J]. Journal of Computer Applications, 2023, 43(8): 2493-2498.