A Statistical Methods Manuscript#

About this case study#

The purpose of this case study is to discuss the different components of research reproducibility implemented in designing and conducting a statistical study. With the help of their manuscript, the authors provide a catalog of methods used in their research and cross-reference them to the respective sections discussed in this Guide for Reproducible Research.

About the Manuscript#

  • Title: A review of Bayesian perspectives on sample size derivation for confirmatory trials[KGL+20].

  • Authors: Kevin Kunzmann, Michael J. Grayling, Kim May Lee, David S. Robertson, Kaspar Rufibach, James M. S. Wason

  • Publication month & year: June 2020

Overview#

The manuscript [KGL+20] itself is concerned with the problem of deriving a suitable sample size for a clinical trial. This is a classical problem in statistics and particularly important in medical statistics where collecting trial data is extremely expensive and ethical considerations need to be addressed. The manuscript reviews and extends methods to systematically incorporate planning uncertainty into the sample size derivation.

Citation summary#

The manuscript can be cited in plain text APA format:

Kunzmann, K., Grayling, M. J., Lee, K. M., Robertson, D. S., Rufibach, K., & Wason, J. (2020). A review of Bayesian perspectives on sample size derivation for confirmatory trials. arXiv preprint arXiv:2006.15715.

BibTeX format:

@article{
    kunzmann2020,
      title = {A review of Bayesian perspectives on sample size derivation for confirmatory trials},
     author = {Kunzmann, Kevin and Grayling, Michael J and Lee, Kim May and Robertson, David S and Rufibach, Kaspar and Wason, James},
    journal = {arXiv preprint arXiv:2006.15715},
       year = {2020}
}

Catalog of different methods for reproducible research#

Version control#

The git repository kkmann/sample-size-calculation-under-uncertainty contains all code required to produce the manuscript arXiv:2006.15715 from scratch. For an in-depth explanation of the importance of version control for reproducible research, see Version Control Systems.

Research data management#

In this particular case, data management aspects are not an issue since the manuscript is exclusively based on hypothetical examples and no external, protected data is required.

Literate programming#

The manuscript [KGL+20] itself is written in and built with LaTeX. The source files are contained in the subfolder latex/. Plain TeX files were preferred over literate programming solutions like knitr for R to facilitate the use of dedicated LaTeX editors like Overleaf. This means, however, that all figures used in the manuscript need to be created separately. A dedicated Jupyter notebook notebooks/figures-for-manuscript.ipynb combining code and rudimentary descriptions are provided to that end.

Reproducible software environment#

Although this means that all code required to compile the manuscript from scratch is available in a self-contained environment, it is not yet sufficient for ensuring reproducibility. Installing LaTeX, Jupyter, and R with the same specification needed to run all code can still be challenging for less experienced users. To avoid this from keeping interested readers from experimenting with the code, a combination of the Python package repo2docker and a free BinderHub hosting service is used. For details on these techniques, see the chapters on Binder and BinderHub. This allows interested individuals to start an interactive version of the repository with all required software preinstalled - in exactly the right versions! Note that it is possible to provide version stable binder links

badge badge

This badge points to the state of the repository at a specific point in time (via the git tagging feature). This means that the links will remain valid and unchanged even if there are later corrections to the contents of the repository! Binder supports multiple user interfaces. This is leveraged to provide and Jupyter lab Integrated Development Environment view on the repository to explore file, the Jupyter notebook, or to open a shell for further commands. The second badge directly opens an interactive Shiny app that illustrates some of the points discussed in the manuscript and requires no familiarity with programming at all. All relevant configurations for Binder are located in the subfolder .binder.

Workflow management using Snakemake#

Since JupyterLab also allows to open a shell in the repository instance opened using a Binder link, another feature of the repository can be used to reproduce the entire manuscript from scratch. The Python workflow manager Snakemake was used to define all required steps in a Snakefile. To execute this workflow, you can open a shell in the online version of JupyterHub. Once the user interface finished loading, open a new terminal and type

snakemake -F --cores 1  manuscript

This will execute all the required steps in turn:

  1. create all plots by executing the Jupyter notebook file

  2. compiling the actual latex/main.pdf file from the LaTeX sources

You should then see a main.pdf file in the latex subfolder.

Support for local instantiation of the software environment#

The Python package repo2docker can also be used locally to reproduce the same computing environment. To this end, you will need to have Python and Docker installed. For details on Docker and container technologies in general, please see the chapter on reproducible environments and containers. Then simply clone the repository on your local machine using the commands

git clone git@github.com:kkmann/sample-size-calculation-under-uncertainty.git
cd sample-size-calculation-under-uncertainty

After cloning the repository, you can build and run a Docker container locally using the configuration files provided in the .binder/ folder using the following command

jupyter-repo2docker -E .

The container is started automatically after the build completes and you can use the usual Jupyter interface in your browser by following the link printed by repo2docker to explore the repository locally.

Use of continuous integration#

Although not necessary for the reproducibility of this manuscript, the repository also makes use of continuous integration (CI) using GitHub actions. GitHub actions runners are provided directly from GitHub (see rr-ci-github-actions).

The repository defines two workflows in .github/workflows directory. The first one, .github/workflows/build_and_run.yml, is activated whenever the master branch of the repository is updated and the specifications in .binder are changed. This builds the container, pushes it to a public container repository docker hub, and then checks that the Snakemake workflow runs through without problems. The second one, .github/workflows/run.yml, runs when the folder .binder was not changed and uses the pre-built Docker container to run the Snakemake workflow. The latter saves a lot of computing time since the computational environment will change much less often than the contents of the repository. The use of CI thus facilitates checking contributions by pull requests for technical integrity and makes the respective latest version of the required container available for direct download. This means that instead of building the container locally using repo2docker you could thus just download it directly and execute the workflow using the following commands

docker run -d --name mycontainer kkmann/sample-size-calculation-under-uncertainty
docker exec --name mycontainer /
    snakemake -F --cores 1  manuscript

Long term archiving and citability#

The GitHub repository is also linked with zenodo.org to ensure long-term archiving, see Citing Software

DOI

Note that a DOI provided by Zenodo can also be used with BinderHub to turn a repository snapshot backed up on Zenodo in an interactive environment (see this blog post).