I’ve just come back from an workshop run by the Software Sustainability Institute about Docker and reproducibility. Widely used in industry and academia, Docker, the containerisation technology, is perhaps one of many tools to support the running of software across different platforms in a sane way.
Two or three years ago, there was a huge amount of heat about Puppet and Chef to build systems sustainably and reproducibly. Docker appeared and to focus moved to it as the hot new thing. As part of this, it is also considered as part of the toolkit for allowing experiments to be made portable and repeatable. In theory, the image runs the same software on different underlying hardware without the overhead of using a hypervisor.
Although it has made strides in sorting out some of security issues in early versions, Docker does not typically get run on HPC clusters where Singularity has grown to solve some of the security issues of running software by unknown users on a shared system. (We did allow Docker containers to run on our small cluster but we also knew the software and the users.)
Matthew Upson’s talk on the use of Docker as part of reproducible data science process at the Government Digital Service (some of it aspirational), I think shows a valid way forward by allowing departments and so on to use a variety of software versions across different systems.
James Mooney and David Gerrard’s talk provoked questions about the general archiving of software at the end of a project: from the artefacts to the context such as email, IRC and so on. The project has posted a blog post about their talk and the issues raised on their website.
Overall the two days were interesting and hearing about issues across various disciplines helped to understand the requirements driving the use of containers. It is also a salutory reminder, as if we need one, that reproducibility is a set of practices and not tied to one tool chain or set of notions.