  1. 1. 华北电力大学控制与计算机工程学院, 北京 102206
    2. 长春理工大学计算机科学技术学院, 吉林 长春 130022
  收稿日期:2019-07-18 出版日期:2020-04-20 发布日期:2020-04-16
  • 基金资助:

Air quality prediction approach based on integrating forecasting dataset

Minghe GAO1(),Ying ZHANG1,*(),Rongrong ZHANG1,Zihao HUANG1,Linyan HUANG1,Fanyu LI1,Xin ZHANG2,Yanhao WANG1   

  1. 1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
    2. School of Computer Science and Technology, Changchun University of Science and Technology, Jilin 130022, China
  Received:2019-07-18 Online:2020-04-20 Published:2020-04-16
采用LightGBM预测模型对空气质量预测问题进行研究,提出并设计一种基于预测性特征的空气质量预测方法,有效地预测北京市区内未来24 h核心表征空气质量的PM2.5质量浓度。在构建预测方案过程中,分析训练数据集特性开展数据清洗,利用随机森林与线性插值相结合的方法,解决数据大量缺失以及噪声干扰问题;提出使用预测性数据特征方法,同时设计相关统计特征,提高预测结果的准确性;采用滑窗机制挖掘高维时间特征,增加数据特征数量级;对预测模型的工作性能和结果进行详细分析,并结合基线模型进行对比评价。试验结果表明,基于预测性特征结合采用LightGBM预测模型的方案具有更高的预测精度。

关键词: 预测数据融合, 高维统计特征, 空气质量预测, 机器学习


Towarding the air quality prediction research problem, LightGBM was employed to propose and design a predictive feature-based air quality prediction approach, which could effectively predict the PM2.5 concentration, i.e., the key indicator reflecting air quality, in the upcoming 24-hour within Beijing. During constructing the prediction solution, the features of the training data set was analyzed to execute data cleansing, and the methods of random forest and linear interpolation were used to solve the problem of high data loss and noise interference. The predictive data features were integrated into the dataset, and meanwhile the corresponding statistical features were designed to imiprove the prediction accurancy. The sliding window mechanism was used to mine high-dimensional time features and increase the quantity of data features. The performance and result of the proposed approach were analyzed in details through comparing with the basedline models. The experimental results showed that compared with other model methods, the proposed LightGBM-based prediction approach with integrating forecasting data had higher prediction accuracy.

Key words: predictive data fusion, high dimensional statistical features, air quality prediction, machine learning


  • TP18


2 d内的PM2.5质量浓度变化情况"





参数 缺失数据量/条 缺失比例/%
PM10 83 263 26.771 8
CO 42 813 13.765 8
O3 20 421 6.566 0
PM2.5 20 389 6.555 7
NO2 18 651 5.996 9
SO2 18 548 5.963 8









特征 名称 特征描述
时间特征 day_of_week 一周内日期, 1:星期一, 2:星期二, …7:星期日
day_of_hour 一天内时刻, 00:00—23:00时
day_of_month 一月内日期序列, 0:1日, 1:2日, …, 30:31日
isweekend 是否为双休日
hour_to_predict 要预测的时间, 0~24 h
CO_1, …, CO_144 历史144 h CO特征
h_temperature_1, …, h_temperature_144 历史144 h气温特征
气象特征 humity 给定时间内的湿度
weather 给定时间内的天气状况
temperature 给定时间内的温度
wind_direction 给定时间内的风向
wind_speed 给定时间内的风速
pressure 给定时间内的气压
空气质量特征 CO 给定时间内的CO质量浓度
PM10 给定时间内的PM10质量浓度
NO2 给定时间内的NO2质量浓度
O3 给定时间内的O3质量浓度
SO2 给定时间内的SO2质量浓度
天气预报特征 temperature_1, …, temperature_24 未来24 h温度
humity_1, …, humity_24 未来24 h湿度
pressure_1, …, pressure_24 未来24 h气压
weather_1, …, weather_24 未来24 h天气状况
wind_direction_1, …, wind_direction24 未来24 h风向
wind_speed_1, …, wind_speed_24 未来24 h风速
统计特征 mean_pm25\PM10\O3_1, mean_pm25\PM10\O3_3, mean_pm25\PM10\O3_5 前1、3、5 d PM2.5\PM10\O3平均质量浓度
max_pm25\PM10\O3_1, max_pm25\PM10\O3_3, max_pm25\PM10\O3_5 前1、3、5 d PM2.5\PM10\O3最大质量浓度
min_pm25\PM10\O3_1, min_pm25\PM10\O3_3, min_pm25\PM10\O3_5 前1、3、5 d的PM2.5\PM10\O3最小质量浓度
pm25_13\O3_13\pm10_13 PM2.5\O3\PM10前1 d与前3 d平均质量浓度比值
pm25_35\O3_35\pm10_35 PM2.5\O3\PM10前3 d与前5 d平均质量浓度比值









项目 PM2.5 PM10 NO2 CO O3 SO2
平均 58.8 88.1 45.8 1.0 55.7 8.98
标准 66.1 89.3 32.1 1.0 53.8 11.7
最小 2.0 5.0 1.0 0.1 1.0 1.0
25% 16.0 37.0 20.0 0.4 29.1 2.0
50% 39.0 70.0 39.0 0.7 45.0 5.0
75% 77.0 113.0 66.0 1.2 79.0 11.0
最大 1 004.0 3 000.0 300.0 15.0 504.0 307.0
条目数 290 621 227 747 292 359 268 197 290 589 292 462



项目 温度/℃ 气压/
平均 38.2 1 026.8 37.1 35 487.5 9.8
标准差 5 030.6 5 025.7 18.9 184 454.8 5.5
最小 -21.3 940.0 5.0 0.0 0.1
25% 2.5 994.2 23.0 78.0 5.6
50% 13.8 1 005.6 33.0 48.0 8.5
75% 23.2 1 016.9 48.0 280.0 12.9
最大 999 999.0 999 999.0 100.0 999 999.0 30.0
条目数 15 8047 15 8047 15 8047 157 813 157 813







方法模型 S M A
XGBoost 0.430 7 33.094 8 27.054 5
GBDT 0.432 9 33.306 0 27.263 8
本文方法 0.422 9 32.871 1 26.436 0
DNN 0.540 6 42.515 2 33.436 5
LightGBM(无预报) 0.429 8 33.892 3 26.682 5
