CV Model Performance • appc

Training data

521,015 daily PM\(_{2.5}\) measurements on 2,861 different days across 767 unique stations from 2017-01-01 to 2024-10-31.

Measured PM\(_{2.5}\) concentration (\(\mu g/ m^3\)) quantiles are:

percentile	concentration
0%	0.0
5%	2.4
25%	4.6
50%	7.0
75%	10.2
95%	17.5
100%	593.0

Random Forest

The generalized regression forest has 200 trees and was trained using a sample fraction of 0.5, a minimum node size of 5, and an \(m_{try}\) value of 20.

Variable Importance

The variable importance is calculated as an exponentially-weighted sum of how many times each feature was selected within the first 6 splits of each tree in the forest.

importance	predictor
0.718	merra_oc
0.097	merra_bc
0.066	hpbl
0.038	plume_smoke
0.021	x
0.019	temperature_max
0.013	merra_so4
0.005	temperature_min
0.004	y, doy
0.003	solar_radiation, specific_humidity, merra_dust
0.002	elevation_sd, wind_speed
0.001	elevation_median, wind_direction
0.000	year, precipitation, merra_ss

LOLO Model Accuracy

Leave-one-location-out (LOLO) accuracy is calculated by using out of bag predictions from the trained random forest with resample clustering by the location. Accuracy is characterized using median absolute error (mae) and the Spearman’s correlation coefficient (rho). Accuracy metrics are calculated for each left out location and then summarized using the median accuracy statistic across all locations. This most closely captures the performance in a real-world scenario where we are trying to predict air pollution between 2017 and 2023 in a place where it was not measured.

Each left-out location, or AQS monitor, contains a variable number of days with air pollution measurements. This depends on the frequency of the daily measurements as well as when the monitoring station was initiated or deprecated. Some stations-time groupings only have a single measurement; exclude any station or station-time grouping that has 4 or less observations. In the tables below, median_n represents the median number of observations used in each station grouping to calculate the overall median accuracy metrics. ci_coverage is the percentage of the time that the 95% CI interval of the predicted concentration contained the measured concentration.

Daily

mae	rho	ci_coverage	median_n
1.2	0.85	69%	601.5

Actual PM2.5 Concentrations vs LOLO Daily Predictions

Daily Prediction Accuracies per Calendar Year

year	mae	rho	ci_coverage	median_n
2017	1.10	0.86	69%	117
2018	1.15	0.86	70%	117
2019	1.20	0.85	69%	116
2020	1.20	0.84	69%	115
2021	1.20	0.85	69%	116
2022	1.20	0.83	68%	117
2023	1.30	0.85	70%	116
2024	1.20	0.84	70%	62

Monthly

Exclude stations with 4 or less total monthly observations.

mae	rho	ci_coverage	median_n
0.72	0.89	95%	75

Annual

Exclude stations with 4 or less total annual observations.

mae	rho	ci_coverage	median_n
0.61	0.88	100%	8

Median LOLO Accuracy Per Spatial Aggregation Period

Station-specific estimates of crossvalidated MAE and Rho were spatially aggregated to level 5 s2 cells and summarized. The visualized result is a rough approximation of how model performance may differ in different parts of the country: