计算机应用 ›› 2018, Vol. 38 ›› Issue (1): 38-43.DOI: 10.11772/j.issn.1001-9081.2017071903

• 2017年全国开放式分布与并行计算学术年会(DPCS 2017)论文 • 上一篇    下一篇

Spark Streaming框架下的气象自动站数据实时处理系统

赵文芳1, 刘旭林2   

  1. 1. 北京市气象信息中心, 北京 100089;
    2. 北京市气象探测中心, 北京 100176
  • 收稿日期:2017-08-02 修回日期:2017-08-10 出版日期:2018-01-10 发布日期:2018-01-22
  • 通讯作者: 刘旭林
  • 作者简介:赵文芳(1980-),女,湖北鄂州人,高级工程师,硕士,主要研究方向:大数据、云计算、机器学习、气象大数据处理;刘旭林(1963-),男,湖北武汉人,研究员,硕士,主要研究方向:高性能计算、软件架构、数据挖掘、知识发现。
  • 基金资助:
    中国气象局公益性行业科研专项基金资助项目(201206031)。

Real-time processing system for automatic weather station data on Spark Streaming architecture

ZHAO Wenfang1, LIU Xulin2   

  1. 1. Information Center, Beijing Meteorological Bureau, Beijing 100089, China;
    2. Observation Center, Beijing Meteorological Bureau, Beijing 100176, China
  • Received:2017-08-02 Revised:2017-08-10 Online:2018-01-10 Published:2018-01-22
  • Supported by:
    This work is partially supported by the Public Welfare Industry Research Funds of China Meteorological Bureau (201206031).

摘要: 针对现有气象自动站业务平台面临处理数据不及时、交互式响应慢、统计时效差等问题,提出了使用Spark Streaming技术和HBase解决该问题的方法,将实时计算框架和分布式数据库系统结合起来实现大规模流式数据处理。使用Flume收集自动站数据,Spark Streaming对数据进行流式处理并存储到HBase数据库中,并设计Spark框架下的自动站数据流式入库处理算法和要素极值的实时统计算法,在Cloudera平台下实现了一个高速可靠的实时采集、处理、统计的应用系统。通过对比分析和性能监测,验证了该系统具有低延迟和高吞吐量的优势,运行状况良好,负载均衡。实验结果表明,Spark Streaming用于气象自动站的实时业务处理,数据并行写入HBase、基于HBase的查询和各类要素统计均能达到毫秒级响应,完全能满足自动站数据的应用需求,有效地支撑天气预报业务。

关键词: 气象自动站, Spark Streaming, 流计算, 气象数据处理, Flume

Abstract: Aiming at these problems of the current data service of Automatic Weather Stations (AWS), including data processing delay, slow interactive response, and low statistical efficiency, a new method based on Spark Streaming and HBase technologies was proposed and introduced to process massive streaming AWS data by integrating stream computing framework and distributed database system. Flume was used for data collection, and data processing was conducted by Spark Streaming and data were stored into HBase. In framework of Spark, two algorithms, one for writing streaming AWS data into HBase database, the other for realizing real-time statistical calculation of different observed AWS meteorological elements were designed. Finally, a stable and high-efficient system for real-time acquisition, processing, and statistics of AWS data was developed on Cloudera platform. Based on comparative analysis and running monitoring, performances of the system were confirmed, including low latency, high I/O efficiency, stable running status and excellent load balance. The experimental results show that the response time of Spark Streaming-based real-time operational processing for AWS data can reach to millisecond level, which includes paralleled data writing into HBase, HBase-based data query and statistics on different meteorological elements. The system can fully meet needs of operational applications to AWS data, and provides effective support to weather forecast.

Key words: Automatic Weather Station (AWS), Spark Streaming, stream computing, meteorological data processing, Flume

中图分类号: