Training data
508,134 daily PM\(_{2.5}\) measurements on 2,769 different days across 764 unique stations from 2017-01-01 to 2024-07-31.
Measured PM\(_{2.5}\) concentration (\(\mu g/ m^3\)) quantiles are:
percentile | concentration |
---|---|
0% | 0.0 |
5% | 2.4 |
25% | 4.6 |
50% | 7.0 |
75% | 10.2 |
95% | 17.6 |
100% | 593.0 |
Random Forest
The generalized regression forest has 200 trees and was trained using a sample fraction of 0.5, a minimum node size of 5, and an \(m_{try}\) value of 20.
Variable Importance
The variable importance is calculated as an exponentially-weighted sum of how many times each feature was selected within the first 6 splits of each tree in the forest.
importance | predictor |
---|---|
0.707 | merra_oc |
0.104 | merra_bc |
0.069 | hpbl |
0.032 | plume_smoke |
0.024 | x |
0.020 | temperature_max |
0.014 | merra_so4 |
0.006 | wind_speed |
0.005 | temperature_min |
0.004 | y, doy, merra_dust |
0.003 | solar_radiation |
0.002 | elevation_sd, specific_humidity |
0.001 | elevation_median, wind_direction, merra_ss |
0.000 | year, precipitation |
LOLO Model Accuracy
Leave-one-location-out (LOLO) accuracy is calculated by using out of
bag predictions from the trained random forest with resample clustering
by the location. Accuracy is characterized using median absolute error
(mae
) and the Spearman’s correlation coefficient
(rho
). Accuracy metrics are calculated for each left out
location and then summarized using the median accuracy statistic across
all locations. This most closely captures the performance in a
real-world scenario where we are trying to predict air pollution between
2017 and 2023 in a place where it was not measured.
Each left-out location, or AQS monitor, contains a variable number of
days with air pollution measurements. This depends on the frequency of
the daily measurements as well as when the monitoring station was
initiated or deprecated. Some stations-time groupings only have a single
measurement; exclude any station or station-time grouping that has 4 or
less observations. In the tables below, median_n
represents
the median number of observations used in each station grouping to
calculate the overall median accuracy metrics. ci_coverage
is the percentage of the time that the 95% CI interval of the predicted
concentration contained the measured concentration.
Monthly
Exclude stations with 4 or less total monthly observations.
mae | rho | ci_coverage | median_n |
---|---|---|---|
0.71 | 0.89 | 95% | 75 |