您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(工学版)》

山东大学学报 (工学版) ›› 2020, Vol. 50 ›› Issue (2): 91-99.doi: 10.6040/j.issn.1672-3961.0.2019.404

• 机器学习与数据挖掘 • 上一篇    下一篇

基于预测数据特征的空气质量预测方法

高铭壑1(),张莹1,*(),张蓉蓉1,黄子豪1,黄琳焱1,李繁菀1,张昕2,王彦浩1   

  1. 1. 华北电力大学控制与计算机工程学院, 北京 102206
    2. 长春理工大学计算机科学技术学院, 吉林 长春 130022
  • 收稿日期:2019-07-18 出版日期:2020-04-20 发布日期:2020-04-16
  • 通讯作者: 张莹 E-mail:1010619625@qq.com;dearzppzpp@163.com
  • 作者简介:高铭壑(1995—),女,吉林长岭人,硕士研究生,主要研究方向为人工智能. E-mail:1010619625@qq.com
  • 基金资助:
    中央高校基本科研业务费专项资金(2018MS024);国家自然科学基金资助项目(61305056);吉林省科技发展计划项目(20190303133SF)

Air quality prediction approach based on integrating forecasting dataset

Minghe GAO1(),Ying ZHANG1,*(),Rongrong ZHANG1,Zihao HUANG1,Linyan HUANG1,Fanyu LI1,Xin ZHANG2,Yanhao WANG1   

  1. 1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China
    2. School of Computer Science and Technology, Changchun University of Science and Technology, Jilin 130022, China
  • Received:2019-07-18 Online:2020-04-20 Published:2020-04-16
  • Contact: Ying ZHANG E-mail:1010619625@qq.com;dearzppzpp@163.com
  • Supported by:
    中央高校基本科研业务费专项资金(2018MS024);国家自然科学基金资助项目(61305056);吉林省科技发展计划项目(20190303133SF)

摘要:

采用LightGBM预测模型对空气质量预测问题进行研究,提出并设计一种基于预测性特征的空气质量预测方法,有效地预测北京市区内未来24 h核心表征空气质量的PM2.5质量浓度。在构建预测方案过程中,分析训练数据集特性开展数据清洗,利用随机森林与线性插值相结合的方法,解决数据大量缺失以及噪声干扰问题;提出使用预测性数据特征方法,同时设计相关统计特征,提高预测结果的准确性;采用滑窗机制挖掘高维时间特征,增加数据特征数量级;对预测模型的工作性能和结果进行详细分析,并结合基线模型进行对比评价。试验结果表明,基于预测性特征结合采用LightGBM预测模型的方案具有更高的预测精度。

关键词: 预测数据融合, 高维统计特征, 空气质量预测, 机器学习

Abstract:

Towarding the air quality prediction research problem, LightGBM was employed to propose and design a predictive feature-based air quality prediction approach, which could effectively predict the PM2.5 concentration, i.e., the key indicator reflecting air quality, in the upcoming 24-hour within Beijing. During constructing the prediction solution, the features of the training data set was analyzed to execute data cleansing, and the methods of random forest and linear interpolation were used to solve the problem of high data loss and noise interference. The predictive data features were integrated into the dataset, and meanwhile the corresponding statistical features were designed to imiprove the prediction accurancy. The sliding window mechanism was used to mine high-dimensional time features and increase the quantity of data features. The performance and result of the proposed approach were analyzed in details through comparing with the basedline models. The experimental results showed that compared with other model methods, the proposed LightGBM-based prediction approach with integrating forecasting data had higher prediction accuracy.

Key words: predictive data fusion, high dimensional statistical features, air quality prediction, machine learning

中图分类号: 

  • TP18

图1

2 d内的PM2.5质量浓度变化情况"

图2

3个监测站点的PM2.5质量浓度变化情况"

表1

数据缺失情况"

参数 缺失数据量/条 缺失比例/%
PM10 83 263 26.771 8
CO 42 813 13.765 8
O3 20 421 6.566 0
PM2.5 20 389 6.555 7
NO2 18 651 5.996 9
SO2 18 548 5.963 8

图3

网格节点和监控站的分布"

图4

数据融合过程"

图5

滑动窗口原理"

表2

特征表"

特征 名称 特征描述
时间特征 day_of_week 一周内日期, 1:星期一, 2:星期二, …7:星期日
day_of_hour 一天内时刻, 00:00—23:00时
day_of_month 一月内日期序列, 0:1日, 1:2日, …, 30:31日
isweekend 是否为双休日
hour_to_predict 要预测的时间, 0~24 h
CO_1, …, CO_144 历史144 h CO特征
h_temperature_1, …, h_temperature_144 历史144 h气温特征
气象特征 humity 给定时间内的湿度
weather 给定时间内的天气状况
temperature 给定时间内的温度
wind_direction 给定时间内的风向
wind_speed 给定时间内的风速
pressure 给定时间内的气压
空气质量特征 CO 给定时间内的CO质量浓度
PM10 给定时间内的PM10质量浓度
NO2 给定时间内的NO2质量浓度
O3 给定时间内的O3质量浓度
SO2 给定时间内的SO2质量浓度
天气预报特征 temperature_1, …, temperature_24 未来24 h温度
humity_1, …, humity_24 未来24 h湿度
pressure_1, …, pressure_24 未来24 h气压
weather_1, …, weather_24 未来24 h天气状况
wind_direction_1, …, wind_direction24 未来24 h风向
wind_speed_1, …, wind_speed_24 未来24 h风速
统计特征 mean_pm25\PM10\O3_1, mean_pm25\PM10\O3_3, mean_pm25\PM10\O3_5 前1、3、5 d PM2.5\PM10\O3平均质量浓度
max_pm25\PM10\O3_1, max_pm25\PM10\O3_3, max_pm25\PM10\O3_5 前1、3、5 d PM2.5\PM10\O3最大质量浓度
min_pm25\PM10\O3_1, min_pm25\PM10\O3_3, min_pm25\PM10\O3_5 前1、3、5 d的PM2.5\PM10\O3最小质量浓度
pm25_13\O3_13\pm10_13 PM2.5\O3\PM10前1 d与前3 d平均质量浓度比值
pm25_35\O3_35\pm10_35 PM2.5\O3\PM10前3 d与前5 d平均质量浓度比值

图6

PM2.5日平均污染物质量浓度随日期变化情况"

图7

不同天气下PM2.5污染物质量浓度变化"

图8

相互关联特征的热力图"

表3

空气质量数据集统计表"

项目 PM2.5 PM10 NO2 CO O3 SO2
平均 58.8 88.1 45.8 1.0 55.7 8.98
标准 66.1 89.3 32.1 1.0 53.8 11.7
最小 2.0 5.0 1.0 0.1 1.0 1.0
25% 16.0 37.0 20.0 0.4 29.1 2.0
50% 39.0 70.0 39.0 0.7 45.0 5.0
75% 77.0 113.0 66.0 1.2 79.0 11.0
最大 1 004.0 3 000.0 300.0 15.0 504.0 307.0
条目数 290 621 227 747 292 359 268 197 290 589 292 462

表4

气象数据集统计表"

项目 温度/℃ 气压/
hPa
湿度/
%
风向/
(°)
风速/
(m·s-1)
平均 38.2 1 026.8 37.1 35 487.5 9.8
标准差 5 030.6 5 025.7 18.9 184 454.8 5.5
最小 -21.3 940.0 5.0 0.0 0.1
25% 2.5 994.2 23.0 78.0 5.6
50% 13.8 1 005.6 33.0 48.0 8.5
75% 23.2 1 016.9 48.0 280.0 12.9
最大 999 999.0 999 999.0 100.0 999 999.0 30.0
条目数 15 8047 15 8047 15 8047 157 813 157 813

图9

气象站和监测站的分布"

图10

实际和预测PM2.5的质量浓度散点图"

表5

本文方法与基线模型的效果比较"

方法模型 S M A
XGBoost 0.430 7 33.094 8 27.054 5
GBDT 0.432 9 33.306 0 27.263 8
本文方法 0.422 9 32.871 1 26.436 0
DNN 0.540 6 42.515 2 33.436 5
LightGBM(无预报) 0.429 8 33.892 3 26.682 5
1 HUANG J , DUAN N , JI P , et al. A crowd source-based sensing system for monitoring fine-grained air quality in urban environments[J]. IEEE Internet of Things Journal, 2018, 6 (2): 3240- 3247.
2 LI X , PENG L , HU Y , et al. Deep learning architecture for air quality predictions[J]. Environmental Science and Pollution Research, 2016, 23 (22): 22408- 22417.
doi: 10.1007/s11356-016-7812-9
3 ZHOU Q , JIANG H , WANG J , et al. A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network[J]. Science of the Total Environment, 2014, 496, 264- 274.
doi: 10.1016/j.scitotenv.2014.07.051
4 HOCHREITER S , SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997, 9 (8): 1735- 1780.
doi: 10.1162/neco.1997.9.8.1735
5 COSMA A C , SIMHA R . Machine learning method for real-time non-invasive prediction of individual thermal preference in transient conditions[J]. Building and Environment, 2019, 148, 372- 383.
doi: 10.1016/j.buildenv.2018.11.017
6 ZHU D , CAI C , YANG T , et al. A machine learning approach for air quality prediction: model regularization and optimization[J]. Big Data and Cognitive Computing, 2018, 2 (1): 1- 15.
7 WANG D , WEI S , LUO H , et al. A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine[J]. Science of the Total Environment, 2017, 580, 719- 733.
doi: 10.1016/j.scitotenv.2016.12.018
8 MAHAJAN S , LIU H M , TSAI T C , et al. Improving the accuracy and efficiency of PM2.5 forecast service using cluster-based hybrid neural network model[J]. IEEE Access, 2018, 6, 19193- 19204.
doi: 10.1109/ACCESS.2018.2820164
9 ZHENG Y, YI X, LI M, et al. Forecasting fine-grained air quality based on big data[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, Australia: Associ-ation for Computing Machinery, 2015: 2267-2276.
10 ZHANG C, YUAN D. Fast fine-grained air quality index level prediction using random forest algorithm on cluster computing of spark[C]//Proceeding of 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom). Beijing, China: IEEE, 2015: 929-934.
11 GAO M , YIN L , NING J . Artificial neural network model for ozone concentration estimation and Monte Carlo analysis[J]. Atmospheric Environment, 2018, 184, 129- 139.
doi: 10.1016/j.atmosenv.2018.03.027
12 ZHENG Y, LIU F, HSIEH HP. U-air: when urban air quality inference meets big data[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, USA: Association for Computing Machinery, 2013: 1436-1444.
13 HSIEH H P, LIN S D, ZHENG Y. Inferring air quality for station location recommendation based on urban big data[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, Australia: Association for Computing Machinery, 2015: 437-446.
14 WANG J , SONG G . A deep spatial-temporal ensemble model for air quality prediction[J]. Neurocomputing, 2018, 314, 198- 206.
doi: 10.1016/j.neucom.2018.06.049
15 HUANG C J , KUO P H . A deep cnn-lstm model for particulate matter (PM2.5) forecasting in smart cities[J]. Sensors, 2018, 18 (7): 1- 22.
16 SUN W, DUAN N, JI P, et al. Intelligent in-vehicle air quality management: a smart mobility application dealing with air pollution in the traffic[C]//Proceeding of 23rd ITS World Congress. Melbourne, Australia: Intelligent Transport Systems Australia, 2016: 1-12.
17 MA C, DUAN N, SUN W, et al. Reducing air pollution exposure in a road trip[C]//Proceeding of 24rd ITS World Congress. Montreal, Canada: Intelligent Transport Systems Australia, 2017: 1-12.
18 CHENG Y , ZHANG S , HUAN C , et al. Optimization on fresh outdoor air ratio of air conditioning system with stratum ventilation for both targeted indoor air quality and maximal energy saving[J]. Building and Environment, 2019, 147, 11- 22.
doi: 10.1016/j.buildenv.2018.10.009
19 SUN W, ZHU J, DUAN N, et al. Moving object map analytics: a framework enabling contextual spatial-temporal analytics of Internet of Things applications[C]//Proceeding of 2016 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI). Beijing, China: IEEE, 2016: 101-106.
20 ROY S S, PRATYUSH C, BARNA C. Predicting ozone layer concentration using multivariate adaptive regression splines, random forest and classification and regression tree[C]//Proceeding of International Workshop Soft Computing Applications. Arad, Romania: Springer, 2016: 140-152.
21 CHANG J C , HANNA S R . Air quality model performance evaluation[J]. Meteorology and Atmospheric Physics, 2004, 87 (1/2/3): 167- 196.
22 MEIJERING E . A chronology of interpolation: from ancient astronomy to modern signal and image processing[J]. Proceedings of the IEEE, 2002, 90 (3): 319- 342.
23 KE G, MENG Q, FINLEY T, et al. Lightgbm: a highly efficient gradient boosting decision tree[C]//Proceeding of 31st Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates, Inc., 2017: 3146-3154.
24 FRIEDMAN JH . Greedy function approximation: a gradient boosting machine[J]. Annals of Statistics, 2001, 29 (5): 1189- 1232.
[1] 祝明,石承龙,吕潘,刘现荣,孙驰,陈建城,范宏运. 基于优化长短时记忆网络的深基坑变形预测方法及其工程应用[J]. 山东大学学报 (工学版), 2025, 55(3): 141-148.
[2] 常新功,苏敏惠,周志刚. 基于进化集成的图神经网络解释方法[J]. 山东大学学报 (工学版), 2024, 54(4): 1-12.
[3] 乔慧妍,段学龙,解驰皓,赵冬慧,马玉玲. 基于异常点检测的心理健康辅助诊断方法[J]. 山东大学学报 (工学版), 2024, 54(4): 76-85.
[4] 刘新,刘冬兰,付婷,王勇,常英贤,姚洪磊,罗昕,王睿,张昊. 基于联邦学习的时间序列预测算法[J]. 山东大学学报 (工学版), 2024, 54(3): 55-63.
[5] 岳仁峰,张嘉琦,刘勇,范学忠,李琮琮,孔令鑫. 基于颜色和纹理特征的立体车库锈蚀检测技术[J]. 山东大学学报 (工学版), 2024, 54(3): 64-69.
[6] 陈成,董永权,贾瑞,刘源. 基于交互序列特征相关性的可解释知识追踪[J]. 山东大学学报 (工学版), 2024, 54(1): 100-108.
[7] 卞小曼,王小琴,蓝如师,刘振丙,罗笑南. 基于相似性保持和判别性分析的快速视频哈希算法[J]. 山东大学学报 (工学版), 2023, 53(6): 63-69.
[8] 李鸿钊,张庆松,刘人太,陈新,辛勤,石乐乐. 浅埋地铁车站施工期地表变形风险预警[J]. 山东大学学报 (工学版), 2023, 53(6): 82-91.
[9] 袁高腾,周晓峰,郭宏乐. 基于特征选择算法的ECG信号分类[J]. 山东大学学报 (工学版), 2022, 52(4): 38-44.
[10] 聂秀山,马玉玲,乔慧妍,郭杰,崔超然,于志云,刘兴波,尹义龙. 任务粒度视角下的学生成绩预测研究综述[J]. 山东大学学报 (工学版), 2022, 52(2): 1-14.
[11] 孙鸿昌,周风余,单明珠,翟文文,牛兰强. 基于模式划分的空调能耗混合填补方法[J]. 山东大学学报 (工学版), 2022, 52(1): 9-18.
[12] 袁高腾,刘毅慧,黄伟,胡兵. 基于Gabor特征的乳腺肿瘤MR图像分类识别模型[J]. 山东大学学报 (工学版), 2020, 50(3): 15-23.
[13] 张大鹏,刘雅军,张伟,沈芬,杨建盛. 基于异质集成学习的虚假评论检测[J]. 山东大学学报 (工学版), 2020, 50(2): 1-9.
[14] 刘玉田, 孙润稼, 王洪涛, 顾雪平. 人工智能在电力系统恢复中的应用综述[J]. 山东大学学报 (工学版), 2019, 49(5): 1-8.
[15] 李童,马然,郑鸿鹤,安平,胡翔宇. 基于视频统计特征的差错敏感度模型[J]. 山东大学学报 (工学版), 2019, 49(2): 116-121.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张永花,王安玲,刘福平 . 低频非均匀电磁波在导电界面的反射相角[J]. 山东大学学报(工学版), 2006, 36(2): 22 -25 .
[2] 李 侃 . 嵌入式相贯线焊接控制系统开发与实现[J]. 山东大学学报(工学版), 2008, 38(4): 37 -41 .
[3] 孔祥臻,刘延俊,王勇,赵秀华 . 气动比例阀的死区补偿与仿真[J]. 山东大学学报(工学版), 2006, 36(1): 99 -102 .
[4] 来翔 . 用胞映射方法讨论一类MKdV方程[J]. 山东大学学报(工学版), 2006, 36(1): 87 -92 .
[5] 余嘉元1 , 田金亭1 , 朱强忠2 . 计算智能在心理学中的应用[J]. 山东大学学报(工学版), 2009, 39(1): 1 -5 .
[6] 陈瑞,李红伟,田靖. 磁极数对径向磁轴承承载力的影响[J]. 山东大学学报(工学版), 2018, 48(2): 81 -85 .
[7] 王波,王宁生 . 机电装配体拆卸序列的自动生成及组合优化[J]. 山东大学学报(工学版), 2006, 36(2): 52 -57 .
[8] 季涛,高旭,孙同景,薛永端,徐丙垠 . 铁路10 kV自闭/贯通线路故障行波特征分析[J]. 山东大学学报(工学版), 2006, 36(2): 111 -116 .
[9] 秦通,孙丰荣*,王丽梅,王庆浩,李新彩. 基于极大圆盘引导的形状插值实现三维表面重建[J]. 山东大学学报(工学版), 2010, 40(3): 1 -5 .
[10] 张英,郎咏梅,赵玉晓,张鉴达,乔鹏,李善评 . 由EGSB厌氧颗粒污泥培养好氧颗粒污泥的工艺探讨[J]. 山东大学学报(工学版), 2006, 36(4): 56 -59 .