19th Workshop on high performance computing in meteorology

Towards fault tolerance in high-performance computing for numerical weather and climate prediction

Speaker

Dr Tommaso Benacchio (Politecnico di Milano)

Description

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown.

This talk will discuss hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems, analysing a selection of applicable existing strategies. Numerical examples will showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, including results with a new fault-tolerant version of the Generalized Conjugate Residual Krylov solver used in the next-generation ECMWF's FVM dynamical core.

The potential impact and resilience-performance trade-offs implied by these strategies will be considered in relation to current development of numerical weather prediction algorithms and systems towards the exascale.

Primary author

Dr Tommaso Benacchio (Politecnico di Milano)

Presentation materials