Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (2): 436-440.DOI: 10.11772/j.issn.1001-9081.2018081782

Previous Articles     Next Articles

Anomaly detection method for hydrologic sensor data based on SparkR

LIU Zihao1, LI Ling2, YE Feng2   

  1. 1. School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang Jiangsu 212003, China;
    2. College of Computer and Information, Hohai University, Nanjing Jiangsu 211100, China
  • Received:2018-08-17 Revised:2018-09-02 Online:2019-02-10 Published:2019-02-15
  • Supported by:
    This work is partially supported by the Jiangsu Province Postdoctoral Research Funding Project (1701020C), the Six Talent Peaks Project of Jiangsu Province (XYDXX-078).


刘子豪1, 李凌2, 叶枫2   

  1. 1. 江苏科技大学 计算机学院, 江苏 镇江 212003;
    2. 河海大学 计算机与信息学院, 南京 211100
  • 通讯作者: 叶枫
  • 作者简介:刘子豪(1995-),男,江苏南京人,硕士研究生,主要研究方向:数据挖掘、大数据;李凌(1968-),女,江苏南京人,工程师,硕士,主要研究方向:云计算、大数据;叶枫(1980-),男,山东济南人,讲师,博士,CCF会员,主要研究方向:分布式计算、大数据。
  • 基金资助:

Abstract: To efficiently detect outliers in massive hydrologic sensor data, an anomaly detection method for hydrological time series based on SparkR was proposed. Firstly, a sliding window and Autoregressive Integrated Moving Average (ARIMA) model were used to forecast the cleaned data on SparkR platform. Then, the confidence interval was calculated for the prediction results, and the results outside the interval range were judged as anomaly data. Finally, based on the detection results, K-Means algorithm was used to cluster the original data, the state transition probability was calculated, and the anomaly data were evaluated in quality. Taking the data of hydrologic sensor obtained from the Chu River as experimental data, experiments on the detection time and outlier detection performance were carried out respectively. The results show that the millions of data calculation by two slaves costs more time than that by one slave, but when calculating the tens of milllions of data, the time costed by two slaves is less than that by one slave, and the maximum reduction is 16.21%. The sensitivity of the evaluation is increased from 5.24% to 92.98%. It shows that under big data platform, the proposed algorithm which is based on the characteristics of hydrological data and combines forecast test and cluster test can effectively improve the computational efficiency of hydrologic time series detection for tens of millions data and has a significant improvement in sensitivity.

Key words: SparkR, AutoRegressive Integrated Moving Average (ARIMA) model, anomaly detection, hydrologic time series, K-Means

摘要: 为了高效地从海量的水文传感器数据中检测出异常值,提出一种基于SparkR的水文时间序列异常检测方法。首先,对数据进行清洗后,采用滑动窗口配合自回归积分滑动平均模型(ARIMA)在SparkR平台上进行预测;然后,对预测的结果计算置信区间,将在区间范围以外的判定为异常值;最后,基于检测结果,利用K均值算法对原数据进行聚类,同时计算其状态转移概率,对检测出的异常值进行质量评估。以在滁河获取的水文传感器数据为实验数据,分别在运行时间和异常值检测效果这两个方面进行了实验。结果显示:利用SparkR对百万级数据进行计算时,利用双节点计算的时间要长于单节点;但是对千万级数据进行计算时,双节点比单节点计算时间上更少,最多减少了16.21%,且评估过后的灵敏度由之前的5.24%提高到了92.98%。实验结果表明,在SparkR下,根据水文数据的特点并结合预测检验和聚类校验的方法对千万级水文时间序列进行检测时,能有效提高传统方法的计算效率,并且在灵敏度方面相比传统方法也有显著提升。

关键词: SparkR, 自回归积分滑动平均模型, 异常检测, 水文时间序列, K均值

CLC Number: