Workshop on Predictability, dynamics and applications research using the TIGGE and S2S ensembles
Calibrating ensemble forecasts of quantitative precipitation: An emprical comparison
Speaker
Description
The past decades have witnessed radical change in the practice of weather forecasting, in that ensemble prediction systems have been implemented operationally. Despited their undisputed successes, ensemble forecasts remain subject to biases and dispersion errors, thereby calling for statistical postprocessing, to generate calibrated probabilistic forecasts from ensemble output.
Quantitative precipitation poses particular challenges in this context, due to the mixed discrete-continuous character of the variable, which necessitates a point mass at zero (that represents the probability of no precipitation) and a right-skewed predictive density for strictly positive accumulations. State of the art techniques for postprocessing such as Bayesian model averaging (BMA) and ensemble model output statistics (EMOS) draw on parametric statistical models, typically using gamma or extreme value distributions. Isotonic distributional regression (IDR) is a recently developed, highly promising nonparametric alternative. IDR learns calibrated predictive distributions from training data, subject to natural monotonicity constraints, without invoking any parametric distributional assumptions.
In this case study, we evaluate and compare raw ensemble, BMA, EMOS, and IDR calibrated probabilistic quantitative precipitation forecasts based on the ECMWF ensemble at the airports in Brussels, Frankfurt, London (Heathrow), and Zurich in 2007 to 2017. Not surprisingly, IDR requires (much) more training data than BMA and EMOS, and we find rolling training periods of about 500 days to work best for IDR, whereas training periods of about 80 days are optimal for BMA and EMOS. In a direct comparison, the IDR calibrated forecast outperforms its competitors in terms of calibration, reliability, and proper scoring rules, such as the continuous ranked probability score (CRPS) and the Brier score, at all stations considered.
We note adaptations to the TIGGE and S2S multi-model ensembles, for which we anticipate major gain in predictive performance.