Workshop: Building reproducible workflows for earth sciences

Keynote speakers

14 October - 11:00

Ana Trisovic (IQSS, Harvard University; Twitter: @atrisovic)

Responding to reproducibility challenges from physics to social sciences

Facilitating research reproducibility presents a pressing issue across all sciences. However, since different challenges arise in natural and social sciences, domain-specific strategies might be the best way to promote reproducibility. This talk presents experiences from two different disciplines: energy economics and high-energy physics. It discusses and compares potential technical and conceptual solutions for facilitating reproducibility and openness in the two fields. On the energy economics side, the ubiquitous use of proprietary software and sensitive data are encumbering efforts to share research and thus inhibit reproducibility. I present insights around these issues based on interviews with faculty and staff at the Energy Policy Institute at the University of Chicago. On the high-energy physics side, vast amounts of data and complex analysis workflows are among the main barriers to reproducibility. I present domain-tailored solutions to these problems, including projects called CERN Open Data, CERN Analysis Preservation and REANA - Reusable Analysis. Finally, I discuss the types of tools that can be used to facilitate reproducibility and sharing, detailing their ability to address various challenges across different disciplines.


15 October - 9:10

Carol Willing (Willing Consulting / Project Jupyter; Twitter: @WillingCarol)

Scaling reproducible research with Project Jupyter

Jupyter notebooks have become the de-facto standard as a scientific and data science tool for producing computational narratives. Over five million Jupyter notebooks exist on GitHub today. Beyond the classic Jupyter notebook, Project Jupyter's tools have evolved to provide end to end workflows for research that enable scientists to prototype, collaborate, and scale with ease. JupyterLab, a web-based, extensible, next generation interactive development environment enables researchers to combine Jupyter notebooks, code and data to form computational narratives. JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening the users with installation and maintenance tasks. Binder builds upon JupyterHub and provides free, sharable, interactive computing environments to people all around the world.


16 October - 9:10

Markus Konkol (University of Münster, Institute for Geoinformatics; Twitter: @MarkusKonkol)

Publishing Reproducible Geoscientific Papers: Status quo, benefits, and opportunities

Open reproducible research (ORR) is the practice of publishing the source code and the datasets needed to produce the computational results reported in a paper. Since many geoscientific articles include geostatistical analyses and spatiotemporal data, reproducibility should be a cornerstone of the computational geosciences but is rarely realized. Furthermore, publishing scientific outcomes in static PDFs does not adequately report on computational aspects. Thus, readers cannot fully understand how the authors came to the conclusions and how robust these are to changes in the analysis. Consequently, it is difficult for reviewers to follow the analysis steps, and for other researchers to reuse existing materials. This talk starts with obstacles that prevented geoscientists from publishing ORR. To overcome these barriers, the talk suggests concrete strategies. One strategy is the executable research compendium (ERC) which encapsulates the paper, code, data, and the entire software environment needed to produce the computational results. Such concepts can assist authors in adhering to ORR principles to ensure high scientific standards. However, ORR is not only about reproducing results but it involves a number of additional benefits, e.g. an ERC-based workflow. It allows authors to convey their computational methods and results by also providing interactive access to code and data, and readers to deeply investigate the computational analysis while reading the actual article, e.g. by changing the parameters of the analysis. Finally, the presentation introduces the concept of a binding; a binding connects those code lines and data subsets that produce a specific result, e.g. a figure or number. By also considering user interface widgets (e.g. a slider), this approach allows readers to interactively manipulate the parameters of the analysis to see how the results change.