Every second year ECMWF hosts a workshop on the use of high performance computing in meteorology. High performance computer architectures are becoming ever more complex which presents significant challenges for the ongoing development of operational meteorological applications. This workshop will look especially at I/O and Scalability of weather and climate applications on HPC systems.
A set of gems is making its round: the dyamond simulations are
aiming for gaining experiences in high resolution global
modelling. The aim is to reach, in the beginning, 40 days of
roughly 2.5 km grid size global simulations.
This type of simulations poses a couple of challenges: usable
boundary conditions, output data size, output data handling,
sustainable computer resource usage, and a seriuos amount of
patience for everybody involved.
For setting up and initial debbuging simulations of this kind
tools have to be improved in their memory scalability. The range
of applications starts from preprocessing tools, the respective
data sets, the required visualization tools allowing gridpoint level
access to data and new visualization algorithms.
We will present an overview of our solutions.
The German Weather Service DWD has recently modernized substantial parts of its numerical weather prediction (NWP) suite.
In this presentation we sketch the rationale behind these changes, which are centered around probabilistic forecasting, ensemble data assimilation and the nonhydrostatic unstructured grid point model ICON.
Moreover, we expand on DWD's NWP strategy for the mid-term future 2018-2022: Seamless integration of nowcasting and very short-range forecasts come into reach, as well as the operational simulation of aerosols, trace gases, and pollen. The rapid evolution of the operational process chain will therefore continue, in order to answer the increasing demands on forecasts and to keep up with the technological developments.
Future projections of DWD's enlarged regional models and ensemble forecasts predict a ten-fold increase of data production until 2022, let alone the dramatic rise of the computational workload. What are the implications of this trend for data analysis, data management and computation?
Naturally, this includes the discussion of the consequences for DWD's upcoming HPC procurements. Apart from this aspect, data-intensive simulation in future NWP process chains will definitely increase the importance of meta-data. With the example of DWD's ICON and COSMO models, we present some ideas on ensuring data provenance, when data can no longer be easily moved and user communities become more diverse.
This work analyzes and improves the I/O process of IFS, one of the most important atmospheric models used around Europe. IFS can use two different output schemes: the MF I/O server and an inefficient sequential I/O scheme. The latter is the only scheme that can be used outside ECMWF. This is the case of EC-Earth, a global coupled climate model that uses IFS as its atmospheric component. In recent experiments of EC-Earth, the I/O part of IFS represented about the 30% of the total execution time. Furthermore, EC-Earth experiments have to run post-processing tasks that perform costly operations.
Therefore, it is presented an easy-to-use development that integrates an asynchronous parallel I/O server called XIOS into IFS. XIOS offers a series of features that are especially targeted to climate models: netCDF format data, online diagnostics and CMORized data. Thus, it will be possible to shorten the critical path of EC-Earth experiments by concurrently running the post-processing task along the EC-Earth execution.
Moreover, a profiling analysis is done to evaluate the new computational performance of the IFS-XIOS integration, proving that the integration itself is not optimal. Thus, it is necessary to optimize it in order to increase both computational performance and efficiency. As a consequence, different HPC optimization techniques, such as computation and communication overlap, are applied in the integration to minimize the I/O overhead in the resulting IFS execution. This proves that it is necessary to analyze from a computational point of view any type of implementation.
The results show that the use of XIOS in IFS to output data is certainly good. This new parallel scheme has reduced significantly the execution time of the original sequential scheme. In one of the tests, XIOS is able to output 3.2 TB of data in only two and a half minutes of overhead. Furthermore, when the cost of converting GRIB to netCDF files is taken into account, the use of XIOS causes that the overall execution is three times faster.
This presentation will provide an overview of existing NCAR systems and the strategy for its next generation of computing and storage. Topics will include application preparations, early technology deployment, procurement and benchmarking strategies. The role of GPUs and cloud solutions in this overall strategy will also be discussed.
The Met Office is currently running 3 large Cray XC40 supercomputers based on Intel Xeon processors and host a small number of more experimental systems, including the GW4 Consortium’s Isambard machine which includes Cavium ThunderX2 ARM processors. This presentation will give an overview of these systems and will include some performance comparisons obtained on them. An overview of the current operational models run by the Met Office will be given, along with an explanation of how reliable runtimes for the high resolution deterministic configurations were obtained. Recent work on optimisation will be reviewed including experiments with a 32-bit precipitation scheme.
Various forms of Kalman filtering, especially different types of Ensemble Kalman filters and hybrid variational/ensemble Kalman filters, have become popular in geophysical data assimilation. Kalman filters work well but they have a characteristic defect in that they leave behind discontinuous model trajectories that do not correspond to physically consistent and continuous model states, and are also less accurate than optimal smooth trajectories. This is a problem in areas such as re-analysis of climate history. For that reason, a smoothing operation that is consistent with the corresponding Kalman filter algorithm is highly desirable.
We have compared the performance on parallel computers and resulting model trajectories and corresponding model errors of two advanced Kalman smoothers, the Fixed Interval Kalman Smoother (FIKS) and the Variational Ensemble Kalman Smoother (VEnKS). The latter is based on the Variational Ensemble Kalman Filter (VEnKF), a hybrid variational-ensemble Kalman filter. The comparison is done using the Lorenz95 model and a random walk based 2D advection model, so that the resulting similarities and differences in performance, forecast skill and the quality of the ensuing model trajectories can be clearly understood and discerned in subsequent analysis.
In recent years the HPC hardware market has been dominated by the x86 architecture, with 93% of the Top500 systems currently powered by Intel CPUs. The resulting lack of competition has inevitably resulted in significantly rising component costs, and decreased rates of performance improvement and related innovations. Arm's entry into the HPC market is seen as one of the most likely developments to disrupt the current status quo. 2018 has seen the first generation of HPC-optimised CPUs hit the market, with for example Cavium's ThunderX2 32 core CPUs proving to be performance competitive with state-of-the-art x86 CPUs. Performance per dollar benefits are even more compelling. In this talk, we will investigate how Arm's HPC strategy might impact the meteorological sciences, and report on early results from Isambard, the world's first production, Arm-based supercomputer.
Through the Scalability Programme, ECMWF has invested significant resources in the enhancement of efficiency and scalability of the forecasting system - from observational data handling at the beginning of the processing chain, to data assimilation, running the forecast model and dealing with vast amounts of model output data. The recently made case for the next HPC upgrade has demonstrated again the need for substantial efficiency gains in order to meet the strategic target of running a global forecast ensemble at 5 km spatial resolution in 2025.
ECMWF's current research foci aim at two key sources of efficiency in the future: the separation of concerns and the fully integrated and distributed model output data handling.
The separation of concerns aims to create flexibility for algorithmic and numerical choices at the science-code level and highly specialised modules that exploit various processor technology options at the kernel level. In between, options for domain specific languages supported by generic libraries for performing operations on model fields optimizing memory access and parallelism are being developed. This effort is strongly supported by FET-HPC projects like ESCAPE, ESCAPE-2 and EuroEXA, together with many European partners.
The data handling strategy aims to overcome the file oriented and disk based approach and integrates data operations into the pre and post-processing chain through object stores much closer to where data is needed or created. This offers options to exploit new memory layers with less latency and high access bandwidth during model runtime, and flexible work management with distributed processing for model output post-processing. This work is based on partnership projects like NextGenIO and MAESTRO.
The talk will focus on these areas but will also touch on the bigger picture of weather and climate prediction in Europe, that is currently proposed as the European Flagship project ExtremeEarth.
The presentation reviews recent developments in the European HPC strategy towards exascale, ranging from the development of low-power microprocessor technologies to Centres of Excellence in HPC applications. In order to meet the challenges ahead, the European Commission proposes to join forces with Member States and relevant stakeholders in HPC and big data and to establish the EuroHPC Joint Undertaking. Its mission is to develop, deploy, extend and maintain an integrated world-class supercomputing and data infrastructure and to develop and support a highly competitive and innovative High-Performance Computing ecosystem. The main characteristics of the new funding body, starting operations already in 2019, are introduced. The presentation will also touch on some aspects of the European Commission proposals for the next long-term EU budget for 2021-2027 and in particular the Digital Europe Programme, where HPC features prominently.
EuroEXA is a H2020 project working on a co-designed architecture for ExaScale, it aims to improve energy efficiency 4X and performance 4X through collaborations between facility, hardware, system architecture, software and applications.
Peter Hopton of EuroEXA will tell us more about this exciting project, its vision and the results so far.
The biggest challenge for state-of-the-art global numerical weather prediction (NWP) arises from the need to simulate complex, multi-scale physical phenomena with tight production schedules. The requirements for climate projections to produce at least 1 simulated year per day (SYPD) is very similar to existing medium-range NWP requirements. With increasing resolution and Earth system model complexity describing the land-surface, ocean, wave, and sea-ice interactions with necessary detail, there is growing concern to adequately align the development of extreme-scale application software for weather and climate services with the roadmap for extreme-scale computing capability and associated changes in the hardware architectures. Moreover, there is a need to better understand and robustly handle the uncertainty of the underlying simulated processes, potentially with hardware that itself is less reliable than we know today, or even to explicitly trade uncertainty for energy-efficiency of the simulations.
The European Centre for Medium-Range Weather Forecasts (ECMWF) is leading the H2020 programmes ESCAPE and ESCAPE-2 to foster innovation actions for developing a holistic understanding of energy-efficiency for extreme-scale applications using heterogeneous architectures, accelerators and special compute units by (a) defining and encapsulating the fundamental algorithmic building blocks ("Weather & Climate Dwarfs") underlying weather and climate services; (b) combining frontier research on algorithm development and hardware adaptation using DSLs for use in extreme-scale, high-performance computing applications, and (c), by developing
benchmarks and cross-disciplinary VVUQ for weather & climate applications. These efforts synthesize the complementary skills of global NWP with leading European regional forecasting and climate modelling consortia, university research, experienced high-performance computing centres and hardware vendors.
This talk will summarize ESCAPE outcomes while sketching the next steps towards accelerating state-of-the-art Earth-System modelling for the coming decade.
The talk will present the ESiWACE project which is one of the H2020 centers of excellence funded to support the application of high performance computing.
The central goal of ESiWACE is to enable very high resolution weather and climate simulations by leveraging synergies between several projects and initiatives in the community and by filling gaps with own developments.
We will present use cases, results from an international comparison of cloud permitting atmosphere models and discuss future strategies for model development within the European weather and climate community.
The quest towards the Exascale is having a profound impact on the design of future supercomputers. We are particularly seeing increased heterogeneity, both in the compute and memory subsystems. This heterogeneity makes efficient programming however a very challenging task. We need mature and efficient programming environments and the EPiGRAM-HS project, a recently funded EC exascale project, has the aim of providing such a programming environment.
The overall objective of EPiGRAM-HS is the development of a new validated programming environment for large-scale heterogeneous computing systems, including accelerators, reconfigurable hardware and low-power microprocessor together with non-volatile and high-bandwidth memories. EPiGRAM-HS will extend the programmability of large-scale heterogeneous systems, by introducing new concepts in MPI and GASPI and developing runtime systems for automatic data placement and code generation. The EPiGRAM-HS programming environment will be integrated into and validated against HPC scientific applications, among which the IFS mini-application. Most importantly, EPiGRAM-HS will provide a migration path for applications to use large-scale heterogeneous systems.
In this talk, we will discuss the motivation for the EPiGRAM-HS project, present its overall vision and describe our planned work for winning the challenge of programming heterogenous systems at exascale for maximum performance.
The NEXTGenIO Project
Professor Mark Parsons
EPCC, The University of Edinburgh
One of the major challenges for Exascale computing is tackling the I/O bottleneck. Today’s current systems are capable of processing data quickly, but speeds are limited by how fast the system is able to read and write data. This represents a significant loss of time and energy in the system. Being able to widen, and ultimately eliminate, this bottleneck would majorly increase the performance and efficiency of HPC systems. Its imperative that the I/O bottleneck is tackled properly as we approach the installation of the first Exascale systems worldwide.
NEXTGenIO is a Horizon 2020 “FETHPC” project that is solving the I/O problem by bridging the gap between memory and storage. The project is using Intel's revolutionary new Optane DC Persistent Memory, which sits between conventional memory and disk storage. Fujitsu, one of NEXTGenIO’s lead partners, has designed the hardware to exploit this new memory technology. The goal is to build a system with 100x faster I/O than current HPC systems, a significant step towards Exascale computation. New hardware is only part of the challenge. NEXTGenIO is also focussing on the software required to make use of this new layer of memory.
The advances that Optane DC Persistent Memory and the NEXTGenIO software stack represent will be transformational across the computing sector. This talk will present the work of the project and show how the hardware and software developments are tackling the Exascale I/O bottleneck.
New memory and storage technologies create new opportunities for mitigating the growing gaps between memory and I/O performance on the one hand, and compute performance on the other hand. Efficient use of hierarchical or heterogeneous memory and storage architectures based on these new technologies can, however, be difficult and cumbersome. Programming models and run-time systems available today have limited capabilities to address this challenge. Typically focussed on improving the processing capabilities they often lack data-awareness. Without being memory-aware these software layers and tools are not able to manage data placement and transport. The newly started Maestro project, a research and development project funded by the European Commission, will address these short-comings by designing a memory- and data-aware middleware framework for high performance computing as well as high performance data analytics applications. The project relies on a co-design approach for which numerical weather prediction and climate reanalysis have been identified as a key application area for developing, validating and demonstrating the Maestro concepts.
Title: Tackling the Simulation and Analysis Frontiers of Atmospheric and Earth System Science
Author: Dr. Richard Loft
Affiliation: National Center for Atmospheric Research, Boulder, CO USA
Atmospheric and Earth system models (ESMs) are complex multiscale, multi-physics software systems developed over decades through the collaboration of many institutions. With low floating point intensity, large code bases, and the need for high throughput rates, ESMs are problematic for current post-Dennard-scaling computer architectures. ESMs are also especially data intensive, making it difficult to extract scientific knowledge from a sea of model output. NCAR’s efforts to tackle these challenges will be reviewed, including an update of the on-going port of the MPAS atmospheric model to many-core architectures, the introduction of machine learning techniques to both modeling and post analysis, and our plans develop flexible and parallel in-situ and cloud based analysis systems. Data-centric system design concepts that leverage emerging computational and data storage technologies to support a new approach to simulation and prediction, will also be presented.
grow across a variety of fields, our team at the National Oceanic Atmospheric Administration (NOAA) Earth System Research Laboratory (ESRL) is researching the application of these techniques for processing of satellite observation data. With the recent launch of new satellites like the Geostationary Operational Environmental Satellite (GOES)-16, the data volume has increased by orders of magnitude and can be difficult to process in a timely manner using traditional methods. Deep Learning has shown promising advancements to significantly improve both processing speed and scientific accuracy of results.
Through our research, we are using a Convolution Neural Network (CNN) to identify regions of interest (ROI) from satellite observations. These areas include cyclones, both tropical and extratropical, cyclogenesis, and eventually convection initiation. Additionally, our team has started preliminary research into the use of CNN’s with satellite data for other areas including hurricane intensification where a CNN model is trained to predict the likelihood of a tropical system will jump two categories in the next 24 hours, and the use of CNN’s to generate soil moisture product from satellite radiance observations to improve soil moisture within atmospheric prediction models.
This presentation will provide an introduction to our research efforts into the application of deep learning, the tradeoffs on computing and accuracy when designing neural networks, along with the challenges of data preparation including creation of “labeled” data, training the model, and where we see these applications heading into the future.
The Japan Meteorological Agency (JMA) signed a contract with Hitachi for the tenth-generation HPC system introduced in June 2018.
This system consists of two Cray XC50 supercomputers with Intel Xeon Skylake processor and three DDN Lustre file systems each, and was evaluated about 10x the performance of previous systems by the benchmark test.
We plan to update almost all operational models with new powerful supercomputers, and model update will contribute to providing more accurate meteorological information for disaster risk mitigation.
This talk will describe some efforts of migration from the previous POWER system to the Xeon system, the performance of the new HPC in the preliminary experiments, and the plan of updating the atmospheric models.
The Goddard Earth Observing System (GEOS), developed by the Global Modeling and Assimilation Office (GMAO) at NASA Goddard, has been using a finite-volume cubed sphere (FV3) dynamical core since the development of FV3 back in 2007. The scalability of this unique dynamical core has allowed the GEOS modeling system to explore the boundaries of global atmospheric prediction since its inception. FV3 has supported a wide range of modeling requirements within NASA from MERRA-2 reanalysis runs at 50km global resolution all the way down to global cloud resolving applications of GEOS first at 3km in 2009 and then 1.5km in 2015 with a non-hydrostatic version of FV3.
Similar to most Earth system models, the future requirements for GEOS over the next 10 years include significant increases in resolution and complexity in order to build a fully coupled model. The computational and storage requirements for the resulting large-scale coupled simulations is driving NASA toward Exascale architectures. However, many application questions have emerged, including the scalability of GEOS on the emerging Exascale architectures. The influence of machine learning and deep learning on the current and near future of high performance computing is clear in systems like Summit at the Department of Energy. Can the current application approach in GEOS continue to be evolved to meet future research requirements on these new systems or is a different approach necessary? In addition to presenting recent GEOS results, this presentation will discuss how the introduction of machine learning models into applications can combine conventional computing techniques with machine learning to create a viable approach for future Exasacle Earth system models.
The enhancement of numerical codes is given a lot of attention around Europe. Weather and climate models are improving the accuracy of their operational simulations with some factors such as the reduction of parametrization or the increase of grid resolution. However, this accuracy improvement will need more computational resources through a new generation of supercomputers and HPC techniques. To take advantage of this new generation, firstly, performance analysis could help to know in detail the computational behavior of the models and the information obtained could be used to introduce optimizations. Secondly, new ways of optimization are presented, including dynamic load balancing and the improvement of the tyoical paradigms use to take advantage of the MPI and OpenMP implementations of our models.
The optimizations will improve the energy efficiency of the models when thousands of resources are used for their parallel execution and reducing the final execution time of the operational models which are launched everyday to obtain predictions. In this study, we present results for the operational configuration used by the ECMWF for weather predictions, using the last stable version of IFS. We present our methodology and analysis results to know more about the computational behavior of IFS. Additionally, we present different options about how to improve the functionality and performance of some of the bottlenecks found.
Recent updates to Navy NWP systems for multiple earth system components have resulted in a significant increase in computational requirements for future operational systems. At the same time, ongoing efforts to modernize the operational control of these NWP runs is driven in part by a high-level desire for allowing the Navy METOC Enterprise to leverage cloud computing and other distributed HPC opportunities.
We will describe recent updates to operational systems, including the upcoming transition of a high-resolution global coupled system in light of these computational challenges. These updates also include infrastructure changes aimed at promoting research to operations, increasing HPC utilization rates, and improving the reliability and maintainability of the operational run through the adoption and use of the Cylc workflow manager.
Finally, we will describe recent efforts to benchmark the performance of several cloud computing platforms to support the demand of production NWP HPC. We will show results comparing an on-premises standalone Cray system and several cloud platforms on the computational performance (both scalability and timing) using the NAVGEM global atmospheric model as the test code and discuss implications for future operational HPC deployments.
The United States’ National Weather Service (NWS) has been developing and running environmental models on supercomputers for well over 25 years. These computers are operated and maintained by the National Centers for Environmental Prediction (NCEP). The current supercomputing contract, the Weather and Climate Operational Supercomputing System (WCOSS), was awarded in 2011 and will run through 2021. The multi-year contract was structured with an initial system installation, followed by multiple enhancements throughout the life of the contract. In addition, NWS received additional funds during the life of the contract to use for additional supercomputing resources for the agency. The resulting system is currently a heterogenous system comprised of IBM, Cray and Dell components woven into one computing cluster. WCOSS has had three upgrades since that first installation - each upgrade being an addition to the existing system, rather than a rip-and-replace of the old system.
This approach has presented opportunities as well as challenges. Interruption to development of modeling applications has been minimized by the upgrade paradigm. It has been possible to accommodate supplemental funding without too much disruption. At the same time, the conglomerate of systems has meant utilizing common technologies that were not necessarily optimal for all components. Management of the workload has been necessary to determine and control which applications should run on which computing components.
We will outline the current operational compute capabilities of WCOSS. We will discuss the experiences working with a heterogeneous computing environment. We’ll discuss some lessons learned and some thoughts for the future.
GRAPES (Global/Regional Assimilation and Prediction System)-GLB is the operational weather prediction global model of China Meteorological Administration (CMA). Sunway Taihu Light is a heterogeneous many-core supercomputer system made in China. The parallel processing of GARPES-GLB bases on MPI and OpenMP, Sunway Taihu Light supports MPI, OpenACC and A-thread but not OpenMP. To keep the code readable and portable, openACC was adopted as programming model in most cases. This paper mainly presents the efforts exploring the performance of GRAPES-GLB on Sunway Taihu Light and some hot spots of dynamic core, including Piecewise Rational Method for water vapor advection and semi Lagrange advection computations will be described in detail, and premier test results will be presented.
An ultra-scalable implicit solver is developed for GRAPES-GLB. In the solver, we propose a highly efficient hybrid domain-decomposed preconditioner that can greatly accelerate the convergence rate at the extreme scale. For solving the overlapped subdomain problems, a geometry-based pipelined incomplete LU factorization method is designed to further exploit the on-chip fine-grained concurrency. We perform systematic optimizations on different hardware levels to achieve best utilization of the heterogeneous computing units and substantial reduction of data movement cost. The implicit solver successfully scales to half system of the Sunway Taihu Light supercomputer with over 6M heterogeneous cores, and enables fast atmospheric dynamic simulations at the 10km horizontal resolution.
Reduced precision can be used to better allocate computational resources in numerical models. Seemingly large rounding errors may be tolerated in numerical models because of the inherent uncertainty in subgrid-scale processes which are typically represented by stochastic schemes. There are now many examples in numerical weather prediction of models using single precision (instead of double) giving significant run time improvements with no change in forecast skill. However, studies with more idealised chaotic models have shown that precision could be reduced much further than single precision. This talk will give a summary of the ongoing work in Oxford looking at reduced precision in more complex models. This work is broken down into separable components of numerical weather prediction models such as data assimilation, individual physics parametrizations, and different scales in spectral space.
Limited-area models are much more economical than uniform-resolution global models for high-resolution climate and weather modeling, but boundary errors become a problem after a few days of integration. We present results from variable-resolution approaches available to models using the GFDL Finite-Volume Cubed-Sphere Dynamical Core (FV3), including grid stretching and grid nesting. In the HiRAM climate and seasonal prediction model, a ~10 km region is able to greatly improve the representation of precipitation systems and especially intense tropical cyclones, without degrading (and in some cases improving) the large-scale atmospheric circulation. In the fvGFS weather prediction model, 3-km domains are able to explicitly represent convective features, including intense hurricanes and severe thunderstorms, again without degrading the hemispheric skill of the model. Promising results for medium-range prediction of severe weather are shown with this global-to-regional weather model. Prospects for further developments in the model infrastructure and in model science supporting new applications will be discussed.
NRL’s Marine Meteorology Division is developing NEPTUNE, a new global weather model to go into U.S. Navy operational forecasting early next decade. NEPTUNE is based on the Naval Post Graduate School NUMA spectral element dynamical core (Giraldo, et al.) to be memory local, computationally dense and scalable to higher numerical order and larger problem sizes. Model development is proceeding with full-physics real-data testing and verification, data assimilation using the JEDI framework based on ECMWF’s OOPS, HPC performance characterization and optimization, and evaluation of programming models for performance portability. This talk will provide an introduction to goals and status of the NEPTUNE project emphasizing computational readiness challenges with result of testing on different processors, including ARM, and early experiences with programming models for performance portability, including early testing of NEPTUNE kernels using the DOE’s Kokkos library.
The developments of digital infrastructures and advancement in weather and climate research and operations have gone hand in hand. Until recently this development steadily went on, scaling up to larger computers, simulating finer scales, assimilating more data and simulating more ensemble members. Now, the digital world around is drastically changing. High performance computing is not driven by meteorology. Currently, machine learning drives the development of hardware. Moreover, the increasing footprint of our calculations need new paradigms of scientific computing. A next generation of hardware architecture is needed and the community will need to redesign the workflow and software of our simulations and data handling. Also science in general is changing towards more data-driven science. Big data, both in volume and heterogeneity, allows for new, unexploited opportunities. Machine learning techniques are alredy used to learn from data and to optimize numerical modelling systems. The problem at hand, to simulate and skilfully predict smaller and smaller scales, and to make use of new e-infrastructures and methodologies requires a new way of thinking. Meteorology can learn from other areas of research where these new developments are embraced and tailor it to its own needs. In a future infrastructure these challenges will need to be integrated in a ‘cloud’ that allows for deployment of simulations on multiple HPCs, a cloud that allows for access to distributed huge data sets and perform analytics on it. And when the data becomes too big, real time data analytics on interactively simulated data. If we combine this with the user-friendlyness and capabilities of current cloud providers, who are already active in the weather services arena, a (c)loud revolution takes place that puts meteorology and climate research at the forefront of sciences and of applications and services.
The increasing resolution of operational models leads to scalability concerns for most operational NWP centres in the world. Moreover, massively parallel systems with hundreds of thousands cores along with new computing architectures (such as GPUs) are strong incentives for improving our current fully-implicit semi-Lagrangian (FISL) methods. Those two schemes have the characteristics of being non-local algorithms and they are inherently communication intensive, and therefore difficult to optimize. This is currently driving the search for alternative numerical techniques that would allow for a large time step, local computing properties, high arithmetic intensity and minimal parallel communication footprint. In searching for an exponential time integrator to replace the traditional FISL schemes, our colleague esteemed Stéphane Gaudreault has developed a parallel implementation (MPI/OpenMP) of the exponential propagation iterative (EPI) method in the Global Environmental Multiscale (GEM) model with the Yin-Yang overset grid. Initial results with the shallow water equations and idealized test cases with the non-hydrostatic Euler equations look very promising. The next steps are to assemble a complete 3D model with results comparable to GEM-FISL and to port the resulting code on GPUs.
An Update of CMA HPC System
Wei Min, Zhao Chunyan
(National Meteorological Information Center, Beijing, China )
CMA HPCS provides reliable HPC resources and high performance computing services for weather and climate applications, generating millions of meteorological data products daily. Now there are two systems including “PI-Sugon” and IBM Flex P460. The new HPC of CMA -" PI-Sugon" system has the peak performance over 8 PFLOPS, with a total physical storage capacity over 20 PB. The architecture includes common CPU computing nodes, GPU and Knights Landing (KNL) nodes. The average CPU utilization of "PI-Sugon " subsystem 1 is above 60% within a few months after online. IBM Flex P460 system consists of two identical national subsystems and another 7 regional subsystems, which provides a total peak performance of 1.7 PFLOPS with 6.7PB physical storage capacity. The average CPU utilization of the IBM Flex P460 operational subsystem is 73% while the research subsystem is 80%. The CMA HPCS provides resources for GRAPES (Global and Regional Assimilation Prediction System) and BCC_CSM (Beijing Climate Center Climate System Model). Besides, it provides research services for dozens of models.
The accurate resource management system and the monitoring system provide efficient management of HPCS resources, and the stable operational environment and job scheduling system support model running and R&D. The model version management system provides standardized code management and cooperation service for model R&D based on HPCS. The GRAPES integrated-setting experiment tool (GISET) is a GUI for experiment construction and running of model R&D.
In order to explore the applicability of the meteorological numerical model on the multicore platform, CMA has been carrying out the program porting and optimization for GRAPES and BCC_AGCM (Beijing Climate Center Atmospheric General Model) on "Sunway TaihuLight",GPU and KNL platforms.
Effective utilization of fine-grain chips with thousands of compute cores, and systems with millions of compute cores will require adapting and rewriting applications to prepare them for exascale. For the last two years, significant efforts have been made to adapt the National Weather Service’s Finite Volume-cubed (FV3) model in order to run efficiently on GPUs without degrading performance on CPU hardware. The work to adapt the FV3 code has proven to be both challenging and disappointing as code changes were more invasive and performance gains were elusive. Some detail of this work will be presented.
Disappointment with FV3 led to formation of a project to explore how to design and develop future assimilation and NWP models for Exascale. Modeled after the ESCAPE project in Europe, the focus has been to develop dwarfs to evaluate key areas of the weather prediction system. Initial work is focused in four areas: transport, 4DVAR, I/O, and machine learning. Prototype development is from scratch and is not constrained by legacy codes, approaches, software or languages. This presentation will give an overview of the work, with additional details in follow-on presenations by Harrop (Software Design), and Flynt (I/O).
Modernizing Scientific Software Development
Christopher W. Harrop ,Cooperative Institute for Research in Environmental Sciences
Mark W. Govett, NOAA Earth Systems Research Laboratory
The commercial software industry has known for many years that the heavy cost of poorly designed software and undisciplined software development practices can sink an entire enterprise. To combat this problem, a mountain of software development tools and best practices have been developed and honed by the industry over the years. While no one tool or development methodology is objectively known to be a perfect solution, industry standard solutions have emerged that are widely adhered to.
The scientific software community, by contrast, has been slow to recognize and respond to the impact of poor software designs and development practices. There are several reasons for this, including insular attitudes and lack of software engineering expertise among scientists as well as a lack of incentives to create quality software in a research environment. As scientific software has increased in complexity, the problems stemming from poor designs and development practices have compounded. We are adapting techniques and technologies from the commercial software industry to rectify this situation.
We will present a description of software development design and processes that we have adapted and are adhering to in order to improve scientific software quality and the cost of its maintenance. We will describe how the Git-Flow branching model, used in tandem with an Agile-like development approach, affords us a structured, disciplined, means of ensuring integrity of software is maintained and that new developments are delivered in a timely fashion. We will describe a test-driven design approach, encompassing tests spanning fine and coarse granularities, scientific correctness, algorithmic correctness, and performance metrics, that ensures scientific code is tested before it is contributed and that it continues to be tested throughout the software lifecycle. Finally, we will show how we are adapting the use of design patterns and anti-patterns into our scientific software designs.
OMNI/O is a new tool being developed to accurately record, analyze and replay file input and output performed by parallel applications without modification to the applications being profiled. The tool consists of four components which in their entirety enable the collection and replay of I/O patterns and statistics for common forecast models and data assimilation systems. The “trace” component records events for each thread and processor by intercepting operating system calls and logging the necessary information to replay all events. The recorded events can be analyzed using the “statistics” component to gain insight into an applications I/O pattern, throughput and opportunity for optimizations. The “preproc” component takes the recorded events and prepositions real or fake data files at user configurable locations to allow for characterization of different file systems using the same pattern as the profiled application. The final component, “replay”, executes the desired I/O pattern without the original profiled application being present. It is envisioned these representative workloads can then be used for procurement benchmarks, to motivate additional scenarios, and to understand limitations in current systems. This presentation will focus on the techniques used to develop the tool along with results from its application on common forecast models such as HRRR and FV3 coupled to the GSI data assimilation system.
For about 15 years, Météo-France has been running non-hydrostatic limited area models in operations. These non-hydrostatic dynamics, currently used in Météo-France limited area model AROME can be enabled in our global model ARPEGE.
The renewal of Météo-France HPC systems in 2020 shall increase its computing power by a factor of five, and may make possible the operation of a short-range (2 or 3 days) global very high-resolution non-hydrostatic model. Météo-France is currently engaged in the Dyamond project whose purpose is the comparison of different non-hydrostatic global models running at 2.5km.
The presentation will describe the workings of the non-hydrostatic dynamics, the expected and measured performance of ARPEGE-NH, and the work achieved so far in the framework of the Dyamond project. We will eventually discuss the feasibility of implementing a 2.5km ARPEGE-NH in operations.
This talk describes issues in the state-of-the-art data management;
then the Earth-System Data Middleware (ESD) is introduced that sits underneath of NetCDF4/HDF5. While this is considered to be an intermediate step, integrated solutions that cover workflows consisting of storage and compute and enable a transparent and concurrent usage of the storage technology of the complex future storage landsacape is necessary.
A proposed approach is the definition and development of next generation APIs that superseed the outdated POSIX and MPI-IO interfaces. This would ultimately enable (I/O) performance portability and less tuning of low-level features.
However, active involvement of the earth-system community is necessary to drive such efforts.
This talk will provide an overview of work done as part of a collaboration between the Met Office and NIWA to create a prototype LFRic mini-app that uses Paraview with the Catalyst library for in-situ visualisation. The Met Office LFRic project is developing a software infrastructure to support the replacement for the Met Office Unified Model (UM). This new model will deliver the Met Office's operational weather forecasting and climate capabilities on the HPC platforms of the next decade. Key requirements are scalability and ability to efficiently deploy to different hardware architectures in preparation for exascale regimes. When considering future massively parallel, high resolution outputs, I/O is a serious potential bottleneck. In-situ analytics and visualisation as part of a modelling workflow therefore makes sense as we can do much of the post-processing work while the data is still 'hot'. Our Catalyst mini-app enables visualisation of LFRic fields with 'on-the-fly' conversion of mesh and data from LFRic's internal data structures to Paraview/VTK data structures while the model simulation is running. Parallel image rendering on compute nodes (using the MPI decomposition defined by the LFRic infrastructure) is also supported. Workflows on user desktops as well as HPC are possible with a range of end-user interactions. For example, a full Python pipeline enabling customisation of visualisation scripts without writing significant code and also 'live' operation where end users can interact with a running simulation via Paraview.
FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS
Mike Ashworth, Graham Riley, Andrew Attwood, and John Mawer
School of Computer Science, University of Manchester, Manchester M13 9PL, UK
The EuroExa project proposes a High-Performance Computing (HPC) architecture which is both scalable to Exascale performance levels and delivers world-leading power efficiency. This is achieved through the use of low-power ARM processors accelerated by closely-coupled FPGA programmable components. In order to demonstrate the efficacy of the design, the EuroExa project includes application porting work across a rich set of applications. One such application is the new weather and climate model, LFRic (named in honour of Lewis Fry Richardson), which is being developed by the UK Met Office and its partners for operational deployment in the middle of the next decade.
Much of the run-time of the LFRic model consists of compute intensive operations which are suitable for acceleration using FPGAs. Programming methods for such high-performance numerical workloads are still immature for FPGAs compared with traditional HPC architectures. We have selected the Xilinx Vivado toolset including High-Level Synthesis (HLS). HLS generates a block of code for the FPGA (an IP block) from a pure C, C++ or OpenCL function annotated with HLS pragmas, which supply optimization hints and instructions about the data interface. This block can be combined with other standard IP blocks in Vivado Design Studio and a bitstream generated for programming the FPGA. The FPGA is then driven by standard C code on the ARM processor which manages the flow of data between the ARM and the FPGA and handles signals to control the HLS blocks. We shall describe this process in detail, discuss the benefits of a range of optimizations and report performance achieved from a matrix-vector multiply kernel.
In this talk we will present collaborative work performed as part of the EU-funded ESCAPE project (led by ECMWF) aimed at modernising numerical weather prediction applications for efficient utilisation of modern parallel architectures. The project features the use of "dwarves" designed to represent the main computational properties of the real applications, and this talk will focus on GPU enabling and optimisation for two dwarves. The first captures the key algorithmic patterns from the spherical harmonics techniques used by global modelling applications such as the Integrated Forecasting System (IFS), which combines Legendre and Fourier transforms on a spherical grid. The second represents the "MPDATA" advection scheme and uses an unstructured grid.
For single-GPU benchmarks, we have improved performance, over pre-existing suboptimal GPU implementations, by around a factor of 17 for the spherical harmonics dwarf and a factor of 57 for the MPDATA dwarf. Optimizations include the improvement in memory management and minimization of data movement, algorithmic redevelopment to allow more flexible exposure of parallelism, minimization of launch overhead, data layout and loop structure adaptations to improve memory access patterns and coalescing, and efficient usage of libraries. Performance modelling shows we are approaching maximum available hardware performance. We have performed substantial work to allow efficient multi-GPU scaling, and results will showcase the recently announced DGX-2 server featuring the high-bandwidth NVSwitch interconnect which provides an excellent match for the all-to-all communications required by the algorithms.
We will present our work in a manner to allow our analysis and optimisation techniques to be easily transferable for other researchers working on their own applications.
We will also briefly describe other weather-related projects NVIDIA is involved with including COSMO, MPAS and WRF, as well as research into exploiting artificial intelligence in this domain.
Since its formation in 2015, the Cray EMEA Research Lab has taken a keen interest in new approaches to optimise data location, distributed task computation and HPC I/O. To fully address HPC I/O optimization we must move beyond configuration, application and library optimisation to looking at the whole workflow and to optimise data movement required by dependencies in the workflow. Our Octopus project will deliver a high-level workflow description, scheduling, and execution framework, built around data and memory hierarchy awareness. We believe that being able to schedule applications while being able to reason about their data production, data locality, and data consumption, and accounting for the cost of moving data between tiers of the memory hierarchy will enable efficient execution of coupled applications in complex workflows. Our use-cases cover: in-situ analysis and visualization; simulation with multiple consumers; coupled simulations with multiple concurrent analysis, external-data source and archiving dependencies; and even sample distribution and shuffling in high-throughput machine learning tasks. Octopus is an ambitious project that will provide optimisations impossible if we constrain ourselves to the individual application. We have an outline design and will develop Octopus collaboratively with partners of recently funded EU projects, ECMWF is one of these partners.
We have developed an enabling library-based ‘Data Junction’ to transport (and redistribute) data encapsulated in Octopus objects between applications, this already supports various transports (including MPI, Dataspaces, Ceph RADOS, libfabric and POSIX file).
The talk will describe Octopus, work completed so far on data transport and how this will continue via collaborative EU H2020 projects MAESTRO and EPIGRAM-HS. We are interested in how the Octopus framework can be applied to a variety of NWP and climate workflows.
CIMA Research Foundation, a non-profit organization committed to promote the study, research and development in engineering and environmental sciences, has developed a workflow manager for the execution of high-resolution NWP on a regional scale and the exploitation of meteorological products for further simulations (hydro, fire, energy production) supporting civil protection, and energy trading.
The system has to provide results within a strict time constraint, exploiting HPC resources granted on a best-effort basis, and cloud resources. Users include regional meteo services, civil protection, SMEs active on energy trading market, insurance and re-insurance companies.
The workflow manager, developed in go and fully controlled by REST APIs, is able to exploit concurrency to minimize the latency in all steps of the workflow and to minimize the uncertainties due to exploitation of best effort resources.
In particular, it fetches Global NWP products in parallel; as soon as a timeframe of global data is available, it is pre-processed by WPS with a custom namelist. Then the WRF simulation is started and continuously monitored. As soon as output timestep is available, we further process it on HPC and cloud resources, and we deliver the time steps of final products to service subscribers.
First output timestep is available in less than 5 minutes from the start of our NWP simulation.
We plan to enrich the system with: the possibility to exploit dedicated resources to further reduce the input-to-output latency; more flexible namelist generation, to automate on-demand simulations on a new domain; a web interface to simplify workflow monitoring and management.
Typically, operational NWP centres collect observational data up until a predefined cut-off time. After this data cut-off, the assimilation process begins. Consequently, data assimilation systems are optimised for "time to solution" to make the best use of the critical path. Under the Continuous Data Assimilation concept, the assimilation computations are started before all of the observations have arrived. This approach allows more time to be spent on the assimilation which enables an improved estimation of the initial state. Results will be presented from an experimental quasi-continuous data assimilation system where newly arrived observations are added at each outer loop of 4D-Var. In the future, such a system may evolve into something that runs continuously, with the first guess state being refined continually as new observations arrive. The implications in terms of operational HPC resources and resilience will be discussed.