《计算机应用》唯一官方网站 ›› 2021, Vol. 41 ›› Issue (4): 1035-1041.DOI: 10.11772/j.issn.1001-9081.2020081589

所属专题: CCF第35届中国计算机应用大会(CCF NCCA 2020)

• CCF第35届中国计算机应用大会(CCF NCCA 2020) • 上一篇    下一篇

基于LightGBM算法的能见度预测模型

余东昌1,2, 赵文芳1,2, 聂凯3, 张舸4   

  1. 1. 北京城市气象研究院, 北京 100089;
    2. 北京市气象信息中心, 北京 100089;
    3. 北京市气象探测中心, 北京 100176;
    4. 信图智行(北京)科技有限公司, 北京 100022
  • 收稿日期:2020-10-13 修回日期:2020-11-01 发布日期:2020-12-29 出版日期:2021-04-10
  • 通讯作者: 赵文芳
  • 作者简介:余东昌(1978—),男,福建古田人,高级工程师,主要研究方向:并行计算、大数据分析、人工智能;赵文芳(1980—),女,湖北鄂州人,研究员,硕士,主要研究方向:气象数据分析处理、机器学习、人工智能;聂凯(1983—),男,山西阳泉人,高级工程师,主要研究方向:气象智能观测、大数据分析;张舸(1991—),男,北京人,高级工程师,硕士,主要研究方向:遥感数据分析、软件架构。

Visibility forecast model based on LightGBM algorithm

YU Dongchang1,2, ZHAO Wenfang1,2, NIE Kai3, ZHANG Ge4   

  1. 1. Beijing Institute of Urban Meteorology, Beijing 100089, China;
    2. Beijing Meteorological Information Center, Beijing 100089, China;
    3. Beijing Meteorological Observation Center, Beijing 100176, China;
    4. XinTuZhiXing (Beijing) Technology Corporation Limited, Beijing 100022, China
  • Received:2020-10-13 Revised:2020-11-01 Online:2020-12-29 Published:2021-04-10

摘要: 为了提高能见度预报的准确率,尤其是低能见度预报的准确率,提出一种基于集成学习随机森林和LightGBM的能见度预测模型。首先,以数值模式系统的气象预报数据为基础,结合地面气象观测数据、PM2.5浓度观测数据,利用随机森林算法构建特征向量;其次,针对不同时间跨度的缺失数据,设计了3种缺失值处理方法对缺失值进行替代,生成用于训练和测试的连续性较好的数据样本集;最后,建立基于LightGBM的能见度预测模型,并用网络搜索法对其进行参数优化。把所提模型与支持向量机(SVM)、多元线性回归(MLR)、人工神经网络(ANN)在性能上进行对比。实验结果表明,对于不同的等级的能见度,应用LightGBM的能见度预测模型获得预兆得分(TS)均较高,而对于<2 km的低能见度,该模型对各观测站点的能见度预测值与各观测站点的能见度实况值的平均相关系数为0.75,平均均方误差为6.49。可见基于LightGBM的预测模型能有效提高能见度预测精度。

关键词: 能见度预测, 集成学习, 随机森林算法, LightGBM算法

Abstract: In order to improve the accuracy of visibility forecast, especially the accuracy of low-visibility forecast, an ensemble learning model based on random forest and LightGBM for visibility forecast was proposed. Firstly, based on the meteorological forecast data of the numerical modeling system, combined with meteorological observation data and PM2.5 concentration observation data, the random forest method was used to construct the feature vectors. Secondly, for the missing data with different time spans, three missing value processing methods were designed to replace the missing values, and then the data sample set with good continuity for training and testing was created. Finally, a visibility forecast model based on LightGBM was established, and its parameters were optimized by using the network search method. The proposed model was compared to Support Vector Machine(SVM), Multiple Linear Regression(MLR) and Artificial Neural Network(ANN) on performance. Experimental results show that for different levels of visibility, the proposed visibility forecast model based on LightGBM algorithm obtains the highest Threat Score(TS); when the visibility is less than 2 km, the average correlation coefficient between the visibility values of observation stations predicted by the model and the observation values of visibility of observation stations is 0.75, the average mean square error between them is 6.49. It can be seen that the forecast model based on LightGBM can effectively improve the accuracy of visibility forecast.

Key words: visibility forecast, ensemble learning, random forest algorithm, LightGBM algorithm

中图分类号: