Workshop: Building reproducible workflows for earth sciences




ECMWF's research and operational activities are computationally expensive and constantly need to scale further with new requirements and challenges. There is a high demand for the use of new mechanisms for sharing code, data, environments and making the scientific workflow more reproducible and flexible to move to new platforms. For example, for the upcoming move to ECMWF's new data centre in Bologna, Italy, we would like to explore new and robust technologies to develop reproducible workflows and how cloud computing could benefit from them.

Events team
    • 09:10 09:50
      Registration and coffee
    • 09:50 10:10
      Welcome and introduction 20m
      Speakers: Claudia Vitolo (ECMWF), Florian Pappenberger (ECMWF)
    • 10:10 10:30
      Reproducible workflows - Setting the scene 20m

      ECMWF has always been a hub of much activities around earth science data. Scientist and analysts continue to develop new ways of analysing and presenting data. For ECMWF it is crucial that this work can be shared and reproduced at any time.
      But ECMWF is also a place constantly changing to make best use of new technologies. In 2020 ECMWF whole computer centre will be moved from Reading to Bologna. Combined with the popularity of public and private cloud environments, this gives even more urgency to ensure that scientific work can be reproduced in a robust way.

      This presentation will give an overview of ECMWF's perspective on reproducible workflows, the challenges encountered and solutions found.

      Speaker: Stephan Siemen (ECMWF)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 11:40
      Responding to reproducibility challenges from physics to social sciences 40m

      Facilitating research reproducibility presents a pressing issue across all sciences. However, since different challenges arise in natural and social sciences, domain-specific strategies might be the best way to promote reproducibility. This talk presents experiences from two different disciplines: energy economics and high-energy physics. It discusses and compares potential technical and conceptual solutions for facilitating reproducibility and openness in the two fields. On the energy economics side, the ubiquitous use of proprietary software and sensitive data are encumbering efforts to share research and thus inhibit reproducibility. I present insights around these issues based on interviews with faculty and staff at the Energy Policy Institute at the University of Chicago. On the high-energy physics side, vast amounts of data and complex analysis workflows are among the main barriers to reproducibility. I present domain-tailored solutions to these problems, including projects called CERN Open Data, CERN Analysis Preservation and REANA - Reusable Analysis. Finally, I discuss the types of tools that can be used to facilitate reproducibility and sharing, detailing their ability to address various challenges across different disciplines.

      Speaker: Dr Ana Trisovic (IQSS, Harvard University )
    • 11:40 12:00
      The important role of versioning your code 20m

      Versioning source codes is an essential task for reproducible workflows. But it is not only codes which can be versioned - data, configurations and setups of workflows can be versioned too. The popularity of Git and GitHub have help to raise awareness and allow greater accountability of codes across the community. Many tools can be triggered on code changes and have made it possible for any developer/scientists to easily integrate Continuous Integration (CI) & Continuous Deployment (CD) processes.

      The talk will show how versioning is used at ECMWF to help improve reproducibility and robustness of our work and what we offer with our new presence in GitHub.

      Speaker: Stephan Siemen (ECMWF)
    • 12:00 12:40
      Reproducibility and workflows with the Integrated Forecasting System 40m

      Development and testing of scientific changes to ECMWF’s Integrated Forecast System is complex and is distinct from non-scientific software workflows at ECMWF. This talk will compare these branching models and explain why those differences exist. It will also discuss work that has recently been done to take the ideas from non-scientific software workflow to improve and streamline our scientific workflows where we can.

      Speaker: Andrew Bennett (ECMWF)
    • 12:40 13:00
      Versioning and tracking changes of vector data 20m

      Geospatial data are often treated as static datasets. But, in reality, the data are presenting specific features at a specific time.

      Handling changes to the data is usually a manual work which does not capture the history and source of changes.

      We have started GEODIFF (, a library to version and track changes to vector data. The Mergin service ( was developed so that users can take advantage of the tool to track their vector data when changes are made from various sources.

      Speaker: Saber Razmjooei (Lutra Consulting)
    • 13:00 14:00
      Lunch break 1h
    • 14:00 14:30
      Open discussion: The importance of versioning and standards
      Convener: Stephan Siemen (ECMWF)
    • 14:30 14:50
      Leveraging OGC standards to boost reproducibilty 20m
      Speaker: Ingo Simonis (OGC)
    • 14:50 15:10
      Challenges and needs of reproducible workflows of Open Big Weather and Climate data 20m

      ECMWF offers, also as operator of the two Copernicus services on Climate Change (C3S) and Atmosphere Monitoring (CAMS), a range of open environmental data sets on climate, air quality, fire and floods.
      Through Copernicus, a wealth of open data is being made available free and open and a new range of users, not necessarily ‘expert’ users, are interested in exploiting the data.
      This makes the reproducibility of workflows particularly important A full, free and open data policy is vital for reproducible workflows and an important prerequisite. Reproducibility however has to be reflected in all aspects of the data processing chain. The biggest challenge is currently a limited data ‘accessibility’, where ‘accessibility’ means more than just improving data access. Accessibility aspects are strongly linked with being reproducible and
      require improvements / developments along the entire data processing chain, including the development of example workflows and reproducible training materials, the need for data standards and interoperability, as well as developing or improving the right open-source software tools.

      The presentation will go through each step of some example workflows for open meteorological and climate data and will discuss reproducibility and ‘accessibility’ challenges and future needs that will be required in order to make open meteorological and climate data fully accessible and reproducible.

      Speaker: Julia Wagemann (ECMWF)
    • 15:10 15:30
      The Copernicus Climate Data Store: ECMWF’s approach to providing online access to climate data and tools 20m

      Data about Earth’s climate is being gathered at an ever-increasing rate. To organise and provide access to this data, the Copernicus (European Union's Earth Observation Program) Climate Change Service (C3S) operated by ECMWF released the Climate Data Store (CDS) mid-2018. The aim of the C3S service is to help a diverse set of users, including policy-makers, businesses and scientists, to investigate and tackle climate change and the CDS provides a cloud-based platform and freely available data to enable this.

      The CDS provides reliable information about the past, present and future climate, on global, continental, and regional scales. It contains a variety of data types, including satellite observations, in-situ measurements, climate model projections and seasonal forecasts.

      It is set to give free access to this huge amount of open climate data, presenting new opportunities to all those who require authoritative information on climate change. A quick ‘click-and-go’ experience via a simple, uniform user interface offers easy online access to a wealth of climate data that anyone can freely browse and download after a simple registration process.

      As well as discovering and browsing trusted datasets, users can use the CDS Toolbox to analyze CDS data online by building their own data processing workflows and their own web-based applications.

      The reproducibility is key here, both for the CDS catalogue and the CDS Toolbox. The CDS needs to be a reliable and sustainable system that users can trust.

      The CDS is continually being optimised and expanded through interaction with users. It delivers more than 60TB a day. It can be accessed at .

      Speaker: Dr Gionata Biavati (ECMWF)
    • 15:30 16:00
      Coffee break 30m
    • 16:00 16:20
      Standardised data representation - power of reproducible work-flow 20m

      ECMWF maintains strong collaboration with WMO, space agencies, conventional data providers and Member/Co-operating States in standardized data representation to accomplishing exchange and data assimilation of the high-quality observations. That secure success of efficient data workflow, longevity in data archive and climate reanalysis, ensuring new weather observations will continue to improve forecasts for the benefit of society. The power of the standardized data comes from the power of data usage amongst internal and external stakeholders serving seamless integration in the operations while fulfilling needs in research.

      Speaker: Ms Marijana Crepulja (ECMWF)
    • 16:20 16:40
      ECMWF Data Governance 20m

      ECMWF is hosting the largest meteorological archive in the world with more than 300 PB stored on tape. Data created, used, archived and distributed by ECMWF are a valuable and unique asset. Often, they are generated at considerable expense and effort, and can be almost impossible to reproduce. As such, they should be properly governed and managed to ensure that the maximum value and benefit is extracted from them. It is also vital that they are available for reuse in the future. Often decisions about data can have far reaching implications which are not immediately apparent, and the Data Governance process provides the means to ensure that all impacts are properly considered. The "Data Governance" workflow therefore plays a crucial role and enables ECMWF to meet both current and future data challenges.
      In this talk, I will review the type of data we handle at ECMWF and explain the importance of having precise, well-defined metadata to describe and index the information. I will illustrate the Data Governance process through a set of concrete examples covering new parameters and new types of data.

      Speaker: Sebastien Villaume (ECMWF)
    • 16:40 17:10
      Computer hall tour or Weather Wall demonstration
    • 17:10 18:00
      Drinks reception 50m
    • 09:00 09:10
      Recap from day 1 and remarks 10m
      Speaker: Claudia Vitolo (ECMWF)
    • 09:10 09:50
      Scaling Reproducible Research with Project Jupyter 40m

      Jupyter notebooks have become the de-facto standard as a scientific and data science tool for producing computational narratives. Over five million Jupyter notebooks exist on GitHub today. Beyond the classic Jupyter notebook, Project Jupyter's tools have evolved to provide end to end workflows for research that enable scientists to prototype, collaborate, and scale with ease. JupyterLab, a web-based, extensible, next generation interactive development environment enables researchers to combine Jupyter notebooks, code and data to form computational narratives. JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening the users with installation and maintenance tasks. Binder builds upon JupyterHub and provides free, sharable, interactive computing environments to people all around the world.

      Speaker: Ms Carol Willing (Project Jupyter)
    • 09:50 10:10
      Automated production of high value air quality forecasts with Pangeo, Papermill and Krontab 20m

      In many ways, a Jupyter notebook describes a data processing pipeline: you select some data at the top of the notebook, define reduction and analysis algorithms as the core of the notebook’s content, and generate value – often in the form of plots or new insight – at the end of the notebook by applying algorithm to data. Value can be added to analysis and insight by including textual metadata throughout the notebook that describes the analysis applied and interpretation of the insight generated in the notebook.

      It is a common requirement to want to apply the same processing pipeline, described by a Jupyter notebook, to multiple datasets. In the case of air quality forecasts, this might mean executing the same processing pipeline on all chemical species implicated in a particular air quality study.

      In this talk we will present Pangeo as an open-source, highly customisable, scalable, cloud-first data processing platform. We will demonstrate using Pangeo to run a defined data processing pipeline in a Jupyter notebook, and move on to explore running this notebook multiple times on a range of input datasets using papermill. Finally we will demonstrate running the processing pipeline automatically to a schedule defined with krontab, a crontab-like job scheduling system for kubernetes.

      Speaker: Peter Killick (Met Office Informatics Lab)
    • 10:10 10:30
      Jupyter for Reproducible Science at Photon and Neutron Facilities 20m

      Modern photon and neutron facilities produce huge amounts of data which can lead to interesting and important scientific results. However the increasing volume of data produced at these facilities leads to some fundamental issues with data analysis.

      With data sets in the hundreds of terrabytes it is difficult for scientists to work with their data, the large size also leads to another issue as these huge volumes of data require a lot of computational power to be analysed, lastly it can be difficult to find out what analysis was performed to arrive at a certain result, making reproducibility challenging.

      Jupyter notebooks potentially offer an elegant solution to these problems, they can be ran remotely so the data can stay at the facility it was gathered at, and the integrated text and plotting functionality allows scientists to explain and record the steps they are taking to analyse their data, meaning that others can easily follow along and reproduce their results.

      The PaNOSC (Photon and Neutron Open Science Cloud) project aims to promote remote analysis via Jupyter notebooks, with a focus on reproducibility and following FAIR (Findable, Accessible, Interoperable, Re-Usable) data standards.

      There are many technical challenges which must be addressed before such an approach is possible, such as recreating the computational environments, recording workflows used by the scientists, seamlessly moving these environments and workflows to the data, and having this work through one common portal which links together several international facilities.

      Some of these challenges have likely also been encountered by other scientific, as such it would be very useful to work together and try to come up with common workflows and solutions to creating reproducible notebooks for science.

      This project (PaNOSC) has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 823852.

      Speaker: Robert Rosca (European XFEL)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 11:20
      Design of a Generic Workflow Generator for the JEDI Data Assimilation System 20m

      The JEDI (Joint Effort in Data assimilation Integration) is a collaborative project that provides a generic interface to data assimilation algorithms and observation operators for atmospheric, marine and other Earth system models, allowing these components to be easily and dynamically composed into complete data-assimilation and forecast-cycling systems. In this work we present the design of a generic workflow generation system that allows users to easily configure the JEDI components to produce custom data analysis toolchains with full cycling capability. Like the JEDI system itself, the workflow component is designed as a dynamically composable system of generic applications. An important point is that the JEDI workflow system is a generic workflow generation system, designed to programmatically produce workflow descriptions for a range of production-quality workflow management software engines including ecFlow, Cylc, and Apache Airflow. Configuration of the JEDI executables, the Python applications that control them, and the connection of applications into larger workflow specifications, is entirely accomplished with YAML-syntax configuration files using the Jinja templating engine. The combination of YAML and Jinja is simultaneously powerful, simple, and easily editable, allowing the user to quickly reconfigure workflow descriptions. A user can change model parameters, DA algorithms, covariance models, observation operators, and observation QC filtering algorithms, as well as the entire workflow graph structure, all without writing any shell scripts, editing any code, or recompiling any packages. Another key focus of the JEDI workflow system is data provenance and experiment reproducibility. Execution reproducibility is accomplished through elimination of unportable shell scripting in favor of Python-3; reliance on version control systems; universal use of checksum verification of all input products; and archiving of all relevant configuration and state as human-readable and editable YAML files.

      Speaker: Dr Mark J. Olah (UCAR / JCSDA)
    • 11:20 12:00
      Building robust and reproducible workflows with Cylc and Rose 40m

      Cylc is an Open-Source workflow tool used by a number of National Met Services to control the workflow of their software, including the Met Office in the UK. We talk about how the Met Office uses Cylc and the related software configuration tool Rose to ensure that our workflows are reproducible, and discuss best practice when designing workflows. We also discuss the features of Cylc which improve robustness, enabling workflows to endure hardware outages and other interruptions to service.

      Speaker: Stuart Whitehouse (Met Office)
    • 12:00 12:20
      Workflow in CESM2 20m

      The Community Earth System Model (CESM) version 2.x includes a case control system (CCS) developed in object oriented python. In this talk I will present the CCS with an emphasis on workflow control using tools developed at NCAR as well as third party tools such as CYLC.

      Speaker: Mr Jim Edwards (National Center for Atmospheric Research USA)
    • 12:20 12:40
      Reproducible science at large scale within a continuous delivery pipeline: the BSC vision 20m

      Numerical models of the climate system are an essential pillar of modern climate research. Over the years, these models have become more and more complex and today’s Earth System Models (ESMs) consist of several components of the climate system coupled together, and running on high performance computing (HPC) facilities. Moreover, climate experiments often entail running different instances of these models from different starting conditions and for an indeterminate number of steps. This workflow usually involves other tasks needed to perform a complete experiment, as data pre-processing, post-processing, transfering or archiving.

      As a result, to reproduce a climate experiment is far from a trivial task which requires the orchestration of different methodologies and tools in order to guarantee, when not bit to bit, the statistical reproducibility of the research.

      In this work we show the methodology and software tools employed to achieve science reproducibility in the Earth Sciences department of the Barcelona Supercomputing Center. Version control systems (VCS), test-oriented development, continuous integration with automatic and periodic tests, and a well-established reproducibility methodology, data repositories and data federation services, are all orchestrated by a fault-tolerant workflow management system.

      Additionally, we show our experience in providing an operational service in the context of a research environment, with the set up of a highly fault-tolerant system. In this case the option of providing redundancy by using cloud services to execute computational models was considered and studied in comparison with the capabilities provided by HPC systems.

      Speaker: Miguel Castrillo (BSC-CNS)
    • 12:40 13:00
      Space Situational Awareness - Virtual Search Environment 20m

      Mr Marek Kubel-Grabau

      Analytic at EVERSIS sp z.o.o.

      EVERSIS created a prototype web platform called the SSA-VRE (Space Situational Awareness - Virtual Search Environment) which aims to provide domain specialists with a space where science solutions and tools could be made available for scientific audiences, supported by possible ideas exchange and community building.

      The SSA-VRE organically grows and expands the possibilities of using the data that fosters innovation and results in world-leading collaborative solutions. It supports across segmental cooperation of dozens of scientists and service providers. A platform is packed not only with data and tools but also inspiring remarks, knowledge resources to browse and share, or projects everyone can join and develop. 

      Eversis team would like to discuss the concept of science community building and information flow in the open-sources digital environments. Team will try to answer a question why such an idea is successful in the SSA domain and how it could benefit other science domains

      Speaker: Marek Kubel-Grabau (Eversis)
    • 13:00 14:00
      Lunch break 1h
    • 14:00 14:20
      CMIP6 post-processing workflow at the Met Office 20m

      The Climate Data Dissemination System is the software system developed in the Met Office Hadley Centre to post process simulation output into the format required for submission to the Coupled Model Intercomparison Project phase 6 (CMIP6). The system decouples the production of data and metadata in CMIP standard formats from the running of climates simulations. This provides the flexibility required by participation in a global community-led project, allowing simulation to run before all the specifications of the datasets have been finalised.

      I will describe how we have developed a workflow based on standard python tools developed in the Met Office and the climate research community to build a system that enables traceability and reproducibility in the climate data post-processing workflow. I will discuss the advantages and disadvantages of the choices we have made for this system. I will then discuss plans for future climate data projects, including CMIP7.

      Speaker: Mr Stephen Haddad (Met Office)
    • 14:20 14:40
      Reproducible workflows for big data with Python 20m


      Speaker: Mr Iain Russell (ECMWF)
    • 14:40 15:00
      Using containers for reproducible results 20m

      This talk will show how we're using containers at ECMWF in several web developments, in order to achieve a reasonably reproducible workflow from development to production.

      Speaker: Carlos Valiente (ECMWF)
    • 19:00 21:00
      Workshop dinner at Cote Brasserie
    • 09:00 09:10
      Recap from day 2 and remarks 10m
      Speaker: Stephan Siemen (ECMWF)
    • 09:10 09:50
      Publishing Reproducible Geoscientific Papers: Status quo, benefits, and opportunities 40m

      Publishing Reproducible Geoscientific Papers: Status quo, benefits, and opportunities

      Markus Konkol, University of Münster, Institute for Geoinfomatics

      Open reproducible research (ORR) is the practice of publishing the source code and the datasets needed to produce the computational results reported in a paper. Since many geoscientific articles include geostatistical analyses and spatiotemporal data, reproducibility should be a cornerstone of the computational geosciences but is rarely realized. Furthermore, publishing scientific outcomes in static PDFs does not adequately report on computational aspects. Thus, readers cannot fully understand how the authors came to the conclusions and how robust these are to changes in the analysis. Consequently, it is difficult for reviewers to follow the analysis steps, and for other researchers to reuse existing materials. This talk starts with obstacles that prevented geoscientists from publishing ORR. To overcome these barriers, the talk suggests concrete strategies. One strategy is the executable research compendium (ERC) which encapsulates the paper, code, data, and the entire software environment needed to produce the computational results. Such concepts can assist authors in adhering to ORR principles to ensure high scientific standards. However, ORR is not only about reproducing results but it involves a number of additional benefits, e.g. an ERC-based workflow. It allows authors to convey their computational methods and results by also providing interactive access to code and data, and readers to deeply investigate the computational analysis while reading the actual article, e.g. by changing the parameters of the analysis. Finally, the presentation introduces the concept of a binding; a binding connects those code lines and data subsets that produce a specific result, e.g. a figure or number. By also considering user interface widgets (e.g. a slider), this approach allows readers to interactively manipulate the parameters of the analysis to see how the results change.

      Speaker: Mr Markus Konkol (University of Münster, Institute for Geoinformatics)
    • 09:50 10:10
      DARE: Integrating solutions for Data-Intensive and Reproducible Science 20m

      The DARE (Delivering Agile Research Excellence on European e-Infrastructures) project is implementing solutions to enable user-driven reproducible computations that involve complex and data-intensive methods. Technology developed in DARE enables Domain Experts, Computational Scientists and Research Developers to compose, use and validate methods that are expressed in abstract terms. Scientists’ workflows translates to concrete applications that are deployed and executed on cloud resources offered by European and international e-infrastructures, as well as in-house institutional platforms and commercial providers. The platforms’ core services enable researchers to visualise the collected provenance data from runs of their methods for detailed diagnostics and validation, in support of long-running research campaigns involving multiple runs. Use cases are presented by two scientific communities in the framework of EPOS and IS-ENES, conducting research in computational seismology and climate-impact studies respectively. DARE enables users to develop their methods within generic environments, such as Jupyter notebooks, associated with conceptual and evolving workspaces, or via the invocation of OGC WPS services interfacing with institutional data archives. We will show how DARE exploits computational facilities adopting software containerisation and infrastructure orchestration technologies (Kubernetes). These are transparently managed via the DARE API, in combination with registries describing data, data-sources and methods. Ultimately, the extensive adoption of workflows (dispel4py, CWL), methods abstraction and containerisation, allows DARE to dedicate special attention to portability and reproducibility of scientific progress in different computational contexts. We will show how choices of research developers as well as the effects of the execution of their workflows are captured and managed, which enable validation, monitoring and reproducibility. We will discuss the implementation of the provenance mechanisms that adopt worklfow’s provenance types, lineage services (S-ProvFlow) and PROV-Templates, to record and interactively use context-rich provenance information in W3C PROV compliant formats.

      Speaker: Dr Alessandro Spinuso (KNMI)
    • 10:10 10:30
      ECMWF's new product generation - Lessons learned from development to operations 20m

      ECMWF operational forecast generates massive amounts of I/O in short bursts, accumulating to tens of TiB in hourly windows. From this output, millions of user-defined daily products are generated and disseminated to member states and commercial clients all over the world. These products are processed from the raw output of the IFS model, within the time critical path and under strict delivery schedule. Upcoming rise in resolution and growing popularity will increase both the size and number of these products.
      In 2019, ECMWF has brought into operations a completely redesigned Product Generation software stack that computes the requested user products. To introduce the new software swiftly into operations, minimal user impact was a design criteria, as well as increased robustness and reliability.
      We will present the lessons learned, technical decisions and the steps taken to bring this massive operational change successfully into completion.

      Speaker: Tiago Quintino (ECMWF)
    • 10:30 11:00
      Coffee break 30m
    • 11:00 11:40
      Using Cloud to Streamline R&D Workflow 40m

      Finnish Meteorological Institute (FMI) use cloud services several ways. First, FMI has piloted to provide its services in the cloud. Second, FMI has joined AWS Public Data Sets -program in order to provide its open data. Users who need the whole grid have found the service very convenient and for that particular use case AWS popularity is increasing rapidly while FMI own data portal usage is decreasing slowly. Third, FMI has piloted Google Cloud for machine learning in impact analysis studies. Additional services like BigQuery and Data Studios have been very useful to conduct the studies.

      Speaker: Mr Roope Tervo (Finnish Meteorological Institute)
    • 11:40 12:00
      A journey into the long white Cloud 20m

      In 2012 the New Zealand (NZ) MetService started on a journey to move its NWP modelling and other automated processing to Amazon Cloud computing. Sitting on multiple earthquake faults with a data centre that has limited abilities to increase capacity and resilience against external outages, it left the organisation to make a revolutionary change compared with the past: not to own a data centre any more for weather modelling. Although mainly driven by the requirement of a resilient computing environment, the preparation for the Cloud made many more benefits apparent if only a Cloud infrastructure solution would be designed appropriately.

      The main benefits we aimed for were
      - high resilience of infrastructure by combining multiple AWS regions in a seamless environment.
      - change towards scientists adopting professional software development practices and establishing an extremely robust release process.
      - cost effectiveness by using spot market instance prices for operations and research.
      - "self-healing" workflows that can guarantee automatic completion against all hardware failures.
      - scientists don't wait for data any more, they analyse data,
      - much clearer cost attribution towards applications and services.

      This presentation will touch on a number of aspects that were encountered on this journey of change which impacted on people, financials, accountabilities, and the mindset of solving science and infrastructure problems in the Cloud. It has been an interesting time with a few surprises along the way but the general consensus it: let's not go back to the way it was.

      Having said all this, not all systems and data can be moved to the Cloud leaving the NZ MetService to operate in a hybrid environment. The requirements for delivering data to NZ aviation with an availability beyond what the Southern Cross cable can provide as well as what the NZ privacy law asks for means that some data needs to be hosted within NZ redundantly. Some of the challenges that this poses will be presented as well.

      Speaker: Dr Andy Ziegler (Meteorological Service NZ Ltd.)
    • 12:00 12:20
      From loose scripts to ad-hoc reproducible workflows: a methodology using ECMWF's ecflow 20m

      With a rising need for resource-efficient and fully tractable software stacks, e.g. in operational contexts, a programmer increasingly faces ever-changing complex software- and hardware dependencies within their workflow. In this talk, we show how ecflow was used to adaptively build up a fully reproducible workflow at runtime for the calibration of the European Flood Awareness System's hydrologic model, although the presented methods can be adapted to any workflow needs. Whether we require serial, parallel, or mixed-mode executions on a single or multiple computer systems, ecflow can help to easily manage, monitor and adapt any software stack without penalising execution performance.

      Speaker: Damien Decremer (ECMWF)
    • 12:20 12:40
      The Role of Containers in Reproducible Workflows 20m

      A key challenge in supporting reproducible workflows in science is ensuring that software environment for any simulation or analysis is sufficiently captured and re-runnable. This is compounded by the growing complexity of scientific software and the systems they execute on. Containers offer a potential approach to address some of these challenges. This presentation will describe how containers can be used for scientific use cases with an emphasis on reproducibility. It will also cover some of the aspects of reproducibility that aren't easily addressed by containers.

      Speaker: Shane Canon (Lawrence Berkeley National Lab)
    • 12:40 13:00
      Reproducing new and old operational systems on development workstations using containers 20m

      Linux containers (Singularity and Docker) have significantly assisted with the Bureau of Meteorology's transition of operational weather model statistical post-processing from an old mid-range system to a new data-intensive HPC cluster. Containers provided a way to run the same software as both old and new systems on development workstations, which led to significantly easier development and allowed migration work to begin well before full readiness of the new HPC cluster system. Containers also provided reproducibility and consistent results for scientific verification of post-processing software. This talk describes how containers have been used in a team containing a mix of scientists and software developers, what has been learnt from the use of containers and recommendations for other teams adopting containers as part of their development process.

      Speaker: Dr Tom Gale (Bureau of Meteorology)
    • 13:00 14:00
      Lunch break 1h
    • 14:00 14:30
      Open discussion: Reproducibility in cloud and HPC environments
      Convener: Tiago Quintino (ECMWF)
    • 14:30 14:50
      Remote presentation: Developing a Unified Workflow for Convection Allowing Applications of the FV3 20m

      The Environmental Modeling Center (EMC) at the National Centers for Environmental Prediction (NCEP) has developed a limited area modeling capability for the Unified Forecast System (UFS), which uses the Finite Volume Cubed-Sphere (FV3) dynamical core. The limited area FV3 is a deterministic, convection-allowing model (CAM) being run routinely over several domains for testing and development purposes as part of a long-term effort toward formulating the Rapid Refresh Forecast System (RRFS). The RRFS is a convection-allowing, ensemble-based data assimilation and prediction system which is planned to feature an hourly update cadence. While current testing resides mostly on NOAA high-performance computing platforms, work is underway to perform development using cloud compute resources.
      Two workflows for running the limited area FV3 have been developed: 1) a community workflow, primarily focused on engaging NOAA collaborators and research community modeling efforts, and 2) an operational workflow, focused on the transition to operations. Both workflows utilize the Rocoto workflow manager, shell scripting, and have been ported to multiple supercomputing platforms. Unification of the two workflows is underway to foster collaboration and accelerated development. In July 2019, a code sprint focusing on developing a workflow for operations and research applications took place featuring membership across multiple organizations. Outcomes from this sprint, current efforts, and ongoing challenges will be discussed.

      Speaker: Mr Benjamin Blake (IMSG and NOAA/NWS/NCEP/EMC)
    • 14:50 15:30
      Scaling Machine Learning with the help of Cloud Computing 40m

      One of the most common hurdles with developing data science/machine learning models is to design end-to-end pipelines that can operate at scale and in real-time. Data scientists and engineers are often expected to learn, develop and maintain the infrastructure for their experiments. This process takes time away from focussing on training and developing the models.

      What if there was a way of abstracting away the non Machine Learning related tasks while still retaining control? This talk will discuss the merits of using Kubeflow. Kubeflow is an open source Kubernetes based platform. With the help of Kubeflow, users can:

      • Develop Machine Learning models easily and make repeatable, portable deployments on a diverse infrastructure e.g. laptop to production cluster.
      • Scale infrastructure based on the demand.

      This talk will also present the current use cases of Kubeflow and how teams from other industries have been utilising the cloud to scale their machine learning operations.

      Speaker: Mr Salman Iqbal (ONS / Learnk8s)
    • 15:30 16:00
      Coffee break 30m
    • 16:00 16:20
      ESoWC: A Machine Learning Pipeline for Climate Science 20m

      Thomas Lees [1], Gabriel Tseng [2], Simon Dadson [1], Steven Reece [1]
      [1] University of Oxford
      [2] Okra Solar, Phnom Penh
      As part of the ECMWF Summer of Weather Code we have developed a machine learning pipeline for working with climate and hydrologicla data. We had three goals for the project. Firstly, to bring machine learning capabilities to scientists working in hydrology, weather and climate. This meant we wanted to produce an extensible and reproducible workflow that can be adapted for different input and output datasets. Secondly, we wanted the pipeline to use open source datasets in order to allow for full reproducibility. The ECMWF and Copernicus Climate Data Store provided access to all of the ERA5 data. We augmented this with satellite derived variables from other providers. Finally, we had a strong focus on good software engineering practices, including code reviews, unit testing and continuous integration. This allowed us to iterate quickly and to develop a code-base which can be extended and tested to ensure that core functionality will still be provided. Here we present our pipeline, outline some of our key learnings and show some exemplary results.

      Speakers: Mr Thomas Lees (University of Oxford), Gabriel Tseng (will present remotely) (Okra Solar)
    • 16:20 16:40
      A reproducible flood forecasting case study using different machine learning techniques 20m

      Lukas Kugler(1), Sebastian Lehner(1,2)
      (1)University of Vienna, Department of Meteorology and Geophysics, Vienna
      (2)Zentralanstalt für Meteorologie und Geodynamik, Vienna

      Extreme weather events can cause massive economic and human damage due to their inherent rarity and the usually large amplitudes associated with them. For this reason, forecast with a focus on extreme events is essential.

      The core of this study is the prediction of extreme flooding events using various machine learning methods (Linear Regression, Support Vector Regression, Gradient Boosting Regression, Time-Delay Neural Net). These will be compared with each other, with a persistence forecast and with forecast reruns from the GloFAS (Global Flood Awareness System). The whole work was carried out with Jupyter Notebooks using a small sample data set, which is all available on Github [1] and hence, open source and fully reproducible.

      The data bases are the ERA5 reanalysis, of which various meteorological variables are used as predictors and the GloFAS 2.0 reanalysis from which river discharge is used as predictand. The area of interest is the upper Danube catchment. All of the data is available from 1981 to 2016 and was divided into 25 years for model training, 6 years for validation and hyper parameter optimization, as well as 5 years for an independent testing period.

      Since the focus is on extreme flooding events, times within the test period containing steep increases in river discharge are evaluated with 14-day forecasts, with varying initialisation times. Additionaly, a comparison with GloFAS 'forecast reruns' is carried out for the 2013 flooding event in Central Europe.

      This work was supported through the ESoWC (ECMWF Summer of Weather Code) 2019 program from ECMWF (European Centre of Medium-Range Weather Forecast), Copernicus and the Department of Meteorology and Geophysics from the University of Vienna.


      Speaker: Sebastian Lehner (Zentralanstalt für Meteorologie und Geodynamik)
    • 16:40 17:00
      Remote presentation: Singularity Containers 20m

      Singularity is an open source container platform which is ideally suited to reproducible scientific workflows. With Singularity you can package tools or entire workflows into a single container image file that can be flexibly deployed on High Performance Computing (HPC), cloud, or local compute resources. Starting from a docker image, or building from scratch, your containers can be cryptographically signed and verified allowing confidence in the provenance of your analyses. GPUs and high performance networking hardware are supported natively for efficient execution of large models.

      We will give a brief overview of Singularity - what it is, where to get it, and how to use it. As an example, we'll show how Singularity can be used to run a workflow where different tools are installed on different Linux distributions, where it provides flexibility by freeing the user from the constraints of their environment.

      Speaker: Dr David Trudgian (Sylabs Inc.)
    • 17:00 17:10