Skip to contents

Training data

508,134 daily PM\(_{2.5}\) measurements on 2,769 different days across 764 unique stations from 2017-01-01 to 2024-07-31.

Measured PM\(_{2.5}\) concentration (\(\mu g/ m^3\)) quantiles are:

percentile concentration
0% 0.0
5% 2.4
25% 4.6
50% 7.0
75% 10.2
95% 17.6
100% 593.0

Random Forest

The generalized regression forest has 200 trees and was trained using a sample fraction of 0.5, a minimum node size of 5, and an \(m_{try}\) value of 20.

Variable Importance

The variable importance is calculated as an exponentially-weighted sum of how many times each feature was selected within the first 6 splits of each tree in the forest.

importance predictor
0.707 merra_oc
0.104 merra_bc
0.069 hpbl
0.032 plume_smoke
0.024 x
0.020 temperature_max
0.014 merra_so4
0.006 wind_speed
0.005 temperature_min
0.004 y, doy, merra_dust
0.003 solar_radiation
0.002 elevation_sd, specific_humidity
0.001 elevation_median, wind_direction, merra_ss
0.000 year, precipitation

LOLO Model Accuracy

Leave-one-location-out (LOLO) accuracy is calculated by using out of bag predictions from the trained random forest with resample clustering by the location. Accuracy is characterized using median absolute error (mae) and the Spearman’s correlation coefficient (rho). Accuracy metrics are calculated for each left out location and then summarized using the median accuracy statistic across all locations. This most closely captures the performance in a real-world scenario where we are trying to predict air pollution between 2017 and 2023 in a place where it was not measured.

Each left-out location, or AQS monitor, contains a variable number of days with air pollution measurements. This depends on the frequency of the daily measurements as well as when the monitoring station was initiated or deprecated. Some stations-time groupings only have a single measurement; exclude any station or station-time grouping that has 4 or less observations. In the tables below, median_n represents the median number of observations used in each station grouping to calculate the overall median accuracy metrics. ci_coverage is the percentage of the time that the 95% CI interval of the predicted concentration contained the measured concentration.

Daily

mae rho ci_coverage median_n
1.2 0.85 69% 601

Actual PM2.5 Concentrations vs LOLO Daily Predictions

Daily Prediction Accuracies per Calendar Year

year mae rho ci_coverage median_n
2017 1.10 0.86 70% 117
2018 1.19 0.85 70% 117
2019 1.20 0.85 69% 116
2020 1.20 0.83 69% 115
2021 1.25 0.85 69% 116
2022 1.20 0.84 69% 117
2023 1.30 0.85 69% 116
2024 1.26 0.83 68% 41

Monthly

Exclude stations with 4 or less total monthly observations.

mae rho ci_coverage median_n
0.71 0.89 95% 75

Annual

Exclude stations with 4 or less total annual observations.

mae rho ci_coverage median_n
0.61 0.88 100% 8

Median LOLO Accuracy Per Spatial Aggregation Period

Station-specific estimates of crossvalidated MAE and Rho were spatially aggregated to level 5 s2 cells and summarized. The visualized result is a rough approximation of how model performance may differ in different parts of the country: