Tools for a Reproducible, Shareable & Communicable Science

DiplΓ΄me Universitaire Public Health Data Science

Author
Published

May 25, 2026

Foreword and Instructions

Note🎯 Learning Objectives

By the end of this class you will be able to:

  1. Explain why reproducibility matters in science and identify the main barriers.
  2. Organize a data-analysis project so that any colleague (including your future self) can re-run it from scratch.
  3. Write a self-contained, reproducible dynamic document with Quarto that blend narrative text, code, and output together.
  4. Produce publication-quality visualizations with ggplot2 () or seaborn/ matplotlib () and save them programmatically.
  5. Audit your own project against the standard reproducibility checklist before sharing it.

Instructions: how to navigate this class

This class walks you through 4 Parts (augmented by this introduction and a concluding summary). Along the way, you will meet three different types of activity, identified by the icons below:

Icon Type Purpose
πŸ“– Targeted reading A short external resource to read (~5–10 min). Follow the links β€” don’t skip them.
πŸ’‘ Concept definition Key ideas with illustrations and worked examples.
πŸ§‘β€πŸ’» Hands-on exercise YOU write and run code yourself. Blocks are not pre-executed here.

Choose R or python , both languages are showed in the exercises via tabs but you only need to do one (according to your initial preferences and familiarity).

Total time needed: approx. ~3 hours. Do not hesitate to take a short break between each Part.


βš™οΈ Setup (to be completed before starting the class)

Install the latest version of β†’ https://cran.r-project.org/
Install the latest version of RStudio β†’ https://posit.co/download/rstudio-desktop/

Then paste the following lines into your console:

install.packages(c(
  "tidyverse",    # ggplot2, dplyr, readr, …
  "gapminder",    # our main dataset
  "renv",         # dependency management
  "here",         # portable file paths
  "quarto",       # render .qmd from R
  "knitr",        # knitting engine
  "sessioninfo"   # document your environment
))

Verify Quarto is available: in the RStudio Terminal tab, type quarto --version.
You should see a version number β‰₯ 1.4.

Install Python β‰₯ 3.10 β†’ https://www.python.org/downloads/
Install VS Code β†’ https://code.visualstudio.com/ (with the Python extension)
or use JupyterLab: pip install jupyterlab

Create and activate a virtual environment, then install packages:

# Create project folder and virtual environment
mkdir repro-phds && cd repro-phds
python -m venv .venv

# Activate (choose the one line right for your OS)
source .venv/bin/activate       # for macOS or Linux
 .venv\Scripts\activate         # for Windows (PowerShell)

# Install python libraries
pip install pandas matplotlib seaborn gapminder jupyterlab quarto

Verify Quarto is indeed available by running quarto --version in your terminal (it should return a version number).

Warning⏱ Before you start

If you run into installation errors, search the error message on the web: troubleshooting is a core data-science skill! Most errors have a Stack Overflow solution on the first result page. GenAI chatbots such as LeChat from Mistral AI can also be very helpful.


1 Reproducibility: what it is and why it matters ?

⏱ ~45 minutes

1.1 πŸ“– Readings: the reproducibility crisis

NoteπŸ“– Targeted reading (~10 min)

Before you read, think about: Have you ever tried to re-run an analysis and gotten different numbers? Have you read a Methods section and thought β€œI have no idea how they actually did this”?

  1. Read the Nature News Feature β€œ1,500 scientists lift the lid on reproducibility” (Baker, 2016).
    Focus on the survey bar chart and the section β€œBarriers to reproducibility”:
    πŸ‘‰ https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
    (~5 min)

  2. Skim the Turing Way overview page. Read β€œWhy this is important” and the bullet list of barriers:
    πŸ‘‰ https://book.the-turing-way.org/reproducible-research/overview
    (~5 min)

Reflection: In public health the stakes are especially high. Irreproducible findings can influence clinical guidelines or policy decisions. Can you recall one published COVID-19 study from 2020 that had to be retracted or substantially corrected? (Hint: look up β€œSurgisphere hydroxychloroquine Lancet” or β€œHydroxychloroquine and azithromycin Gautret Raoult International Journal of Antimicrobial Agents”)


1.2 πŸ’‘ The Reproducibility Spectrum

β€œReproducible” is used loosely, and several definitions co-exists in different scientific field. In this class, we will use the definition below focusing on computational reproducibility:

Same data? Same code/method? What does it test?
Reproducible βœ… βœ… Can you re-run the analysis and get the exact same numbers?
Replicable ❌ βœ… Does the same method generalise to new data?
Robust βœ… ❌ Do different analytical choices reach the same conclusion?
Generalisable ❌ ❌ Does the finding hold beyond the original context?

The diagram below (from The Turing Way community) relates these 4 levels to one another:

A 2x2 matrix showing reproducible, replicable, robust and generalisable research across axes of same/different data and same/different code

Reproducibility matrix. The Turing Way / Scriberia β€” CC-BY 4.0. Source: https://book.the-turing-way.org

1.3 πŸ’‘ Why is Reproducibility important ?

What is the value of a non reproducible article ?
If it isn’t reproducible, is it science ?
Should we just trust each other ?

Public funding demands accountability, while scientific credibility depends on it. Reproducibility helps achieve that.

  • πŸ“™ Scientific journals require it: peer reviewers now verify code, data, and workflows before final acceptance
  • βš”οΈ It acts as a methodological shield: it reduces the likelihood of undetected errors & spurious findings
  • πŸ‡ͺπŸ‡ΊπŸ‡«πŸ‡· Institutional law: the EU, the ANR and the HCERES all require some level of reproducibility for their funded research
  • 🧱 Increased impact: Reproducible articles are cited more & extended more (a more trustworthy foundation for future works)

In Public Health, it carries an additional importance: - Policy decisions are made from published findings. A wrong or irreproducible result can harm patients and populations at scale. - Publicly-funded research should be accountable. If the public paid for the study, the code and data should (where possible) be publicly accessible. This idea is very much related to open-science, a concept connected to reproducibility, but different (➑️🌐 more details here)

Barriers, motivations and solutions

Barrier Concrete example Practical solution
Cultural β€œMy PI never does this” Frame it as a career investment, not overhead
Technical Simulation takes 3 days to run Store pre-computed intermediate results; provide a fast reduced-run mode
Legal Patient data under GDPR Generate synthetic data of same structure; grant temporary restricted auditor access
Time β€œI’ll clean up the code later” Start early: it costs extra work upfront, but saves 4x more at revision

Reproducibility is a safeguard, not a burden

Reproducibility earns trust. Scientists that care for reproducibility are more efficient in the long term as they can build on their own past work more easily. Reproducibility carries benefits at different scales for science:

Benefits of reproducibility at 3 different levels
🌍 Research field πŸ₯ Research group πŸ§‘β€πŸ”¬ Yourself
Stronger methodological credibility Faster transmission to collaborators Faster article completion & revisions
Cumulative, extendable knowledge Reduced technical debt Transparent & trustable for audience
Lower risk of published errors Clear, defensible archival Hard skills & efficient workflow
More citations
ImportantπŸ”‘ Key takeaways
  • Reproducibility is a scientific requirement: a scientific result that cannot be reproduced has very little (if any) added value. In this class, we focus on computationally reproducibility, which is only one of the layer for reproducing science (think also data generation for instance).
  • Reproducibility is a continuum: universal, absolute, forever reproducibility is an ideal that is out of reach. Nonetheless, we should strive for improving practical reproducibility thinking about future reuse of our work and how we establish trust in scientific results.

1.4 πŸ§‘β€πŸ’» Exercise 1: Create Your Reproducible Project Skeleton

TipπŸ§‘β€πŸ’» Exercise 1: Project Setup (25 min)

Goal: Create a well-organised, portable project that you will build on throughout this class.


Step 1: Create a project

  1. Open RStudio β†’ File β†’ New Project β†’ New Directory β†’ New Project
  2. Name it repro-phds, choose a folder you will remember
  3. βœ… Tick β€œUse renv with this project” β€” this sets up dependency tracking automatically
  4. Click Create Project

Notice the .Rproj file in the Files pane. Whenever you open this file, RStudio automatically sets the working directory to your project root. This alone eliminates 90 % of path-related bugs.

# If you have not done so previously, create the directory and the environment below
mkdir repro-phds && cd repro-phds

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate          # macOS/Linux
# .venv\Scripts\activate           # Windows – run the lien corresponding to your OS

# Start Python
python

Step 2: Create a standard folder structure

Run this in your console (inside the project):

dirs <- c(
  "data/raw",           # Original data β€” NEVER overwrite
  "data/processed",     # Cleaned / transformed data
  "R",                  # Reusable R functions
  "analysis",           # .qmd scripts go here
  "output/figures",     # Saved plots  β†’ fig01_..., fig02_...
  "output/tables"       # Exported tables β†’ tab01_..., tab02_...
)

invisible(lapply(dirs, dir.create, recursive = TRUE, showWarnings = FALSE))
cat("βœ… Folders created!\n")

Run the following python code (with .venv active, inside repro-phds/):

import os

dirs = [
    "data/raw",
    "data/processed",
    "src",               # Python source files / modules
    "analysis",          # Notebooks or scripts
    "output/figures",
    "output/tables"
]

for d in dirs:
    os.makedirs(d, exist_ok=True)

print("βœ… Folders created!")

Your project should now look like this:

repro-phds/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/           ← original data  (treat as read-only)
β”‚   └── processed/     ← cleaned data
β”œβ”€β”€ R/  (or src/)      ← reusable functions / modules
β”œβ”€β”€ analysis/          ← your scripts and notebooks
β”œβ”€β”€ output/
β”‚   β”œβ”€β”€ figures/       ← fig01_name.pdf, fig01_name.png, …
β”‚   └── tables/        ← tab01_name.csv, …
└── README.md          ← you'll write this next

Step 3: Write a README.md

Create README.md at the project root. A README is the first thing anyone β€” including future-you β€” reads. It must answer: What is this? How do I run it?

Copy the template below into your README.md and fill in the blanks:

# repro-phds β€” Reproducibility Practice (Public Health)

## Description
A self-contained data analysis project produced during the
"Tools for Reproducible Science" class (Bachelor Public Health, [Year]).

## How to reproduce all results

1. Open `repro-phds.Rproj` in RStudio
   (or activate `.venv` in Python: `source .venv/bin/activate`).
2. Restore dependencies:
   - `<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>`{=html}: run `renv::restore()` in the console
   - Python: `pip install -r requirements.txt`
3. Render the main report:
   open `analysis/report.qmd` and click **Render**,
   or run `quarto render analysis/report.qmd`.

## Expected runtime
< 1 minute on a standard laptop.

## Session information
(Paste the output of `sessioninfo::session_info()` here
after completing Exercise 4.)

Step 4: Lock your dependencies

renv was already initialised when you created the project.
After installing packages, record their exact versions:

renv::snapshot()
# β†’ "renv.lock has been updated."

This file is your insurance against β€œit worked last year” bugs.
A collaborator who runs renv::restore() will get the exact same packages.

In the terminal, execute the following bash code:

pip freeze > requirements.txt

# Verify it was created
cat requirements.txt

A collaborator can restore this exact environment with:

pip install -r requirements.txt

βœ… Self-check

Verify all four points before moving on:

  • getwd() () or os.getcwd() () returns the path to repro-phds/
  • The six sub-folders exist in the files explorer / pane
  • README.md is at the project root and has all three sections
  • renv.lock () or requirements.txt () has been created
  • Closing and reopening the project works cleanly

Question to think about: Why should data/raw/ be treated as strictly read-only? What would happen if you accidentally saved a modified version over your original data file?



2 Literate Programming with Quarto

⏱ ~45 minutes

2.1 πŸ“– Reading: One Document to Rule Them All

NoteπŸ“– Targeted Reading (5 min)

Donald Knuth introduced literate programming in the 1980s: instead of code with comments intended to be read primarily by computers, write a narrative document that contains the code intended for humans. The output is a document that is the analysis.

  1. Read the Quarto β€œHello, Quarto” guide β€” try to render the example document if you can:
    πŸ‘‰ https://quarto.org/docs/get-started/hello/rstudio.html

  2. Keep the Quarto cheatsheet open as a reference throughout this block:
    πŸ‘‰ https://rstudio.github.io/cheatsheets/quarto.pdf

The central idea: When figures and numbers are computed inside your .qmd file, there is no copy-paste step where a stale number can slip through. Re-run the document and every value updates automatically.


2.2 πŸ’‘Anatomy of a .qmd File

A Quarto file has three building blocks:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. YAML header       title, author, output type...  β”‚
│──────────────────────────────────────────────────────│
β”‚  2. Markdown text     prose, headings, lists, links  β”‚
│──────────────────────────────────────────────────────│
β”‚  3. Code chunks       R or Python code + its output  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quarto schematics

A colourful illustration showing wizards casting spells to turn R code into beautiful output documents

FigureΒ 1: Artwork by Allison Horst β€” CC-BY 4.0. Quarto (like it precursor Rmarkdown before it) lets you blend code and narrative text into a single dynamic document.

1. The YAML header

At the beginning of your document, it sits between --- markers at the very top. This sets the default controls everything about the output and its overall format, using YAML language.
Be careful: indentation is important in YAML

---
title: "Global Health Trends β€” Gapminder Data"
author: "Your Name"
date: today                   # auto-fills today's date
format:
  `HTML`:
    toc: true                 # table of contents
    code-fold: true           # hide code by default (reader can expand)
    embed-resources: true     # self-contained `HTML` file (portable)
execute:
  echo: true                  # show code in output by default
  warning: true               # keep showing warnings
  message: false              # suppress messages (eg when loading packages)
---

2. Markdown text

Between code chunks, narrative text uses plain Markdown (lightweight formatting). Below is an example of markdown basics:

# Heading level 1
## Heading level 2

A paragraph. Make text **bold**, *italic*, or `code-styled`.

- Bullet list item one
- Bullet list item two

1. Numbered item one
2. Numbered item two

[Descriptive link text](https://url.com)
![Image alt text](path/to/image.png){width=50%}

3. Code chunks

Enclosed in triple back-ticks with {r} or {python} to specify the programming language interpreter to use. Chunk-specific options are indicate by setting values to specific keywords at the begining of the chunk with the following syntax: #| keyword: value. Below is a chunk example, followed by its output:

```{r}
#| label: summary-table-r
#| echo: true
#| eval: true

library(gapminder)
head(gapminder)
```
# A tibble: 6 Γ— 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Key chunk options (written after #| inside the chunk):

Option Default Effect
echo true Show the code in output
eval true Run the code
include true Include chunk output at all
message true Show package-load messages
warning true Show warnings
cache false Cache results (useful for slow code)
label β€” Name for cross-referencing figures/tables
fig-cap β€” Figure caption
fig-width / fig-height β€” Figure dimensions in inches

2.2.1 Inline code

You can also insert computed values directly into prose using `r ` or `{python}`:

The dataset covers `r length(unique(gapminder$country))` countries from `r min(gapminder$year)` to `r max(gapminder$year)`

When rendered, this becomes:

The dataset covers 142 countries from 1952 to 2007

No copy-paste, no stale numbers, no manual update ! ✨


2.3 πŸ§‘β€πŸ’» Exercise 2: Your First Reproducible Health Report

TipπŸ§‘β€πŸ’» Exercise 2: First Quarto report (25 min)

Goal: Write a reproducible .qmd report that loads real global health data, computes summaries with inline values, and renders to a self-contained HTML file.

We use the Gapminder dataset: life expectancy, population, and GDP per capita for 142 countries from 1952 to 2007. Life expectancy is a central metric in public health β€” it integrates mortality at all ages and reflects overall health system performance.


Step 1: Create the file

In RStudio: File β†’ New File β†’ Quarto Document.
Delete all template content. Save as analysis/report.qmd.

Paste the YAML header that matches your language:

---
title: "Global Health Trends β€” Gapminder Analysis"
author: "Your Name"
date: today
format:
  `HTML`:
    toc: true
    code-fold: true
    theme: flatly
    embed-resources: true
execute:
  echo: true
  warning: false
  message: false
---
---
title: "Global Health Trends β€” Gapminder Analysis"
author: "Your Name"
date: today
format:
  `HTML`:
    toc: true
    code-fold: true
    theme: flatly
    embed-resources: true
execute:
  echo: true
  warning: false
  message: false
jupyter: python3
---

Step 2: Setup chunk (always the very first chunk!)

A setup chunk runs silently before anything else. It is a good place to set your random seed for instance.

```{r}
#| label: setup-r
#| include: false

set.seed(20260401)      # ← ensures the reproducibility of any downstream random operations

library(ggplot2)
library(dplyr)
library(gapminder)
library(here)
library(knitr)
```
```{python}
#| label: setup-py
#| include: false

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20260401)

# Load Gapminder from GitHub (stable URL)
url = "https://raw.githubusercontent.com/jennybc/gapminder/main/inst/extdata/gapminder.tsv"
gapminder = pd.read_csv(url, sep="\t")

sns.set_theme(style="whitegrid", palette="colorblind")
```

Step 3: Data section with inline values

## The Data

The **Gapminder** dataset is a global development database compiled 
by the Gapminder Foundation. It covers health, population and wealth 
indicators for **`r length(unique(gapminder$country))` countries** 
across **`r length(unique(gapminder$year))` time points**
(from `r min(gapminder$year)` to `r max(gapminder$year)`).

```{r}
#| label: tbl-preview-r
#| tbl-cap: "First 6 rows of the Gapminder dataset."

head(gapminder) |> knitr::kable()
```
## The Data

The **Gapminder** dataset is a global development database compiled 
by the Gapminder Foundation. It covers health, population and wealth 
indicators for **`{python} gapminder["country"].nunique()` countries** 
across **`{python} len(sorted(gapminder["year"].unique()))` time points**
(from `{python} sorted(gapminder["year"].unique())[0]` to `{python} sorted(gapminder["year"].unique())[-1]`).

```{python}
#| label: tbl-preview-py

gapminder.head(6)
```

Step 4: Summary statistics section

## Summary Statistics

```{r}
#| label: tbl-continent-2007-r
#| tbl-cap: "Life expectancy and GDP per capita by continent in 2007."

gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarise(
    `N countries`         = n(),
    `Mean life exp.`      = round(mean(lifeExp), 1),
    `Median life exp.`    = round(median(lifeExp), 1),
    `Mean GDP/capita`     = round(mean(gdpPercap), 0)
  ) |>
  knitr::kable(format.args = list(big.mark = ","))
```

In 2007, the global mean life expectancy was
**`r round(mean(dplyr::filter(gapminder, year == 2007)$lifeExp), 1)` years**,
with values ranging from
`r round(min(dplyr::filter(gapminder, year == 2007)$lifeExp), 1)` to
`r round(max(dplyr::filter(gapminder, year == 2007)$lifeExp), 1)` years.
## Summary Statistics

```{python}
#| label: tbl-continent-2007-py

    gapminder.query("year == 2007") \
    .groupby("continent") \
    .agg(
        n_countries=("country", "nunique"),
        mean_lifeExp=("lifeExp", "mean"),
        median_lifeExp=("lifeExp", "median"),
        mean_gdpPercap=("gdpPercap", "mean")
    ) \
    .round(1) \
    .reset_index() \
    .rename(columns={
        "continent": "Continent",
        "n_countries": "N Countries",
        "mean_lifeExp": "Mean Life Exp.",
        "median_lifeExp": "Median Life Exp.",
        "mean_gdpPercap": "Mean GDP/capita"
    })
```

Step 5: Render!

Click the blue Render button (top of the editor pane), or press Ctrl+Shift+K / Cmd+Shift+K.

Or from the console:

quarto::quarto_render("analysis/report.qmd")

In the terminal

quarto render analysis/report.qmd

Open the resulting report.html in your browser. Check that:

  • Inline values (number of countries, year range) appear correctly in the prose
  • The summary table has a caption
  • The β€œCode” toggle (from code-fold: true) works for each chunk

βœ… Self-check

  • The HTML file renders without errors
  • The inline numbers in the text match the table values
  • Changing year == 2007 to year == 1997 and re-rendering updates all values automatically
  • You can open report.html without RStudio or any software (it is self-contained)

Bonus: Add a ## Introduction section above ## The Data with two or three sentences on why life expectancy is a meaningful public health indicator. Try adding a footnote using [^1] syntax and a simple citation using a .bib file.



3 Data Visualisation

⏱ ~40 minutes

3.1 πŸ“– Reading β€” The Grammar of Graphics

NoteπŸ“– Targeted Reading (5 min)

Rather than memorizing dozens of plot types, you learn a grammar β€” a set of composable rules. Any statistical graphic can be described as data mapped to visual properties (position, colour, size, shape), rendered through geometric objects, on a coordinate system.

  1. Skim the ggplot2 homepage and the first β€œGetting Started” example:
    πŸ‘‰ https://ggplot2.tidyverse.org/
    (For users, read the seaborn tutorial overview instead:
    πŸ‘‰ https://seaborn.pydata.org/tutorial/introduction.html)

  2. Read the abstract and first page of Wilkinson’s Grammar of Graphics, as summarised here:
    πŸ‘‰ https://vita.had.co.nz/papers/layered-grammar.html β€” β€œAbstract” and β€œIntroduction” paragraphs only

Key idea: Once you understand the grammar, you can quickly build almost any chart by composing (modular) layers. You no longer need to search for β€œhow do I make a violin plot in ?” at every turn.


3.2 πŸ’‘Building Plots Layer by Layer

The seven layers

Component In ggplot2 Example
Data ggplot(data = ...) gapminder |> filter(year == 2007)
Aesthetics aes(x, y, colour, size, shape) aes(x = gdpPercap, y = lifeExp, colour = continent)
Geometry geom_*() geom_point(), geom_line(), geom_boxplot()
Scales scale_*() scale_x_log10(), scale_colour_viridis_d()
Facets facet_wrap() / facet_grid() facet_wrap(~ continent)
Theme theme_*() theme_minimal()
Labels labs() labs(title = "...", x = "...", caption = "...")

An artistic illustration of a person building a layered ggplot2 graphic

Artwork by Allison Horst β€” CC-BY 4.0. Every ggplot2 plot is built by layering these components.

A complete worked example

library(ggplot2)
library(gapminder)
library(dplyr)

gap_2007 <- gapminder |> filter(year == 2007)

ggplot(
  data = gap_2007,
  aes(x = gdpPercap, y = lifeExp,
      colour = continent,
      size   = pop)          # bubble size = population
) +
  geom_point(alpha = 0.7) +
  scale_x_log10(labels = scales::dollar_format()) +   # log GDP axis
  scale_colour_viridis_d(option = "plasma") +         # colour-blind safe
  scale_size(range = c(2, 14), guide = "none") +
  labs(
    title    = "Wealth and Health in 2007",
    subtitle = "Each bubble is a country; size ∝ population",
    x        = "GDP per capita (log scale, USD)",
    y        = "Life expectancy (years)",
    colour   = "Continent",
    caption  = "Source: Gapminder Foundation"
  ) +
  theme_minimal(base_size = 13)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

gap_2007 = gapminder[gapminder["year"] == 2007].copy()

# Bubble sizes proportional to population
max_pop = gap_2007["pop"].max()
gap_2007["bubble_size"] = (gap_2007["pop"] / max_pop) * 1800 + 30

fig, ax = plt.subplots(figsize=(10, 6))

continents = gap_2007["continent"].unique()
palette = sns.color_palette("plasma", len(continents))

for i, cont in enumerate(sorted(continents)):
    subset = gap_2007[gap_2007["continent"] == cont]
    ax.scatter(
        np.log10(subset["gdpPercap"]),
        subset["lifeExp"],
        s=subset["bubble_size"],
        color=palette[i],
        alpha=0.75,
        label=cont
    )

ax.set_xlabel("GDP per capita (log scale, USD)", fontsize=12)
ax.set_ylabel("Life expectancy (years)", fontsize=12)
ax.set_title("Wealth and Health in 2007", fontsize=14, fontweight="bold")
ax.set_xticks([3, 4, 5])
ax.set_xticklabels(["$1K", "$10K", "$100K"])
ax.legend(title="Continent", loc="lower right")
plt.tight_layout()
plt.show()

Saving figures programmatically

Warning⚠️ Never use the β€œExport” button

Saving a figure manually (by right-clicking or clicking β€œExport” in the Plots pane) is not reproducible. The size, resolution, and format vary each time and the step is invisible in your code.

Always save figures inside your script, with fixed dimensions and resolution.

# 1. Assign your plot to a named variable
p <- ggplot(gap_2007, aes(...)) + ...

# 2. Save β€” always provide both a vector (`PDF`) or a raster (`PNG`) image depending on the size and nature of your graph
ggsave(
  filename = here::here("output", "figures", "fig01_wealth_health_2007.pdf"),
  plot     = p,
  width    = 10, height = 6, units = "in"
)
ggsave(
  filename = here::here("output", "figures", "fig01_wealth_health_2007.png"),
  plot     = p,
  width    = 10, height = 6, units = "in",
  dpi      = 300          # 300 dpi = publication quality
)
fig, ax = plt.subplots(figsize=(10, 6))
# ... your plotting code ...

fig.savefig("output/figures/fig01_wealth_health_2007.pdf", bbox_inches="tight")
fig.savefig("output/figures/fig01_wealth_health_2007.png", dpi=300, bbox_inches="tight")
plt.close()   # free memory

File naming convention: use a numeric prefix that matches your manuscript β€” fig01_, fig02_, …, tab01_, tab02_, … This is one of the items auditors check first.

Choosing accessible colour palettes

About 8 % of men have some form of colour-vision deficiency. Here are a few palettes designed to be perceptually uniform and colour-blind safe:

Palette Package Use case
viridis, plasma, magma viridis / built-in ggplot2 Sequential / continuous
scale_colour_viridis_d() ggplot2 Discrete categorical
scale_colour_brewer(palette = "Set2") ggplot2 Categorical (up to 8 groups)
sns.set_palette("colorblind") seaborn Any seaborn chart
palette="viridis" seaborn Continuous colour mapping

3.3 πŸ§‘β€πŸ’» Exercise 3: Figures

TipπŸ§‘β€πŸ’» Exercise 3: Your first figure (20 min)

Goal: Add two publication-quality figures to analysis/report.qmd and save them programmatically to output/figures/.


Figure 2: Distribution of life expectancy in 2007

Add a section ## Distribution in 2007:

## Distribution in 2007

As of 2007, the distribution of life expectancy varies widely within
each continent (see @fig-dist-r).

```{r}
#| label: fig-dist-r
#| fig-cap: "Life expectancy by continent in 2007. Violin = density; points = individual countries; white bar = median."
#| fig-width: 9
#| fig-height: 5

p_dist <- gapminder |>
  filter(year == 2007) |>
  ggplot(aes(
    x      = reorder(continent, lifeExp, FUN = median),
    y      = lifeExp,
    fill   = continent,
    colour = continent
  )) +
  geom_violin(alpha = 0.35, linewidth = 0.4) +
  geom_jitter(width = 0.12, alpha = 0.8, size = 2.2) +
  stat_summary(fun = median, geom = "crossbar",
               width = 0.45, colour = "white", linewidth = 0.9) +
  scale_fill_viridis_d(option = "plasma") +
  scale_colour_viridis_d(option = "plasma") +
  labs(
    title   = "Life Expectancy Distribution by Continent in 2007",
    x       = "Continent (ordered by median life expectancy)",
    y       = "Life expectancy (years)",
    caption = "Source: Gapminder. Each point = one country."
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

p_dist

ggsave(
  here::here("output", "figures", "fig03_distribution_2007.pdf"),
  plot = p_dist, width = 9, height = 5
)
```
## Distribution in 2007

As of 2007, the distribution of life expectancy varies widely within
each continent (see @fig-dist-py).

```{python}
#| label: fig-dist-py
#| fig-cap: "Life expectancy by continent in 2007."

gap_2007 = gapminder[gapminder["year"] == 2007].copy()

# Order continents by median life expectancy
order = (
    gap_2007.groupby("continent")["lifeExp"]
    .median()
    .sort_values()
    .index.tolist()
)

fig, ax = plt.subplots(figsize=(9, 5))
sns.violinplot(data=gap_2007, x="continent", y="lifeExp",
               order=order, palette="plasma", alpha=0.4, ax=ax)
sns.stripplot(data=gap_2007, x="continent", y="lifeExp",
              order=order, palette="plasma", alpha=0.75,
              jitter=True, size=5, ax=ax)

ax.set_xlabel("Continent (ordered by median)", fontsize=12)
ax.set_ylabel("Life expectancy (years)", fontsize=12)
ax.set_title("Life Expectancy Distribution by Continent in 2007", fontsize=14)
plt.tight_layout()

fig.savefig("../output/figures/fig03_distribution_2007.pdf", bbox_inches="tight")
plt.show()
```

Cross-referencing in the narrative

In your report, reference figures by their label (replace @fig-trends-r with @fig-trends-py if you are using ):

Life expectancy has grown dramatically since 1952 (see @fig-trends-r),
although large disparities persist between and within continents
(@fig-dist-r).

βœ… Self-check

  • Both figures appear in the rendered HTML
  • Two PDF files exist in output/figures/ with descriptive names
  • File names use the fig02_, fig03_ prefix convention
  • Both figures use a colour-blind-safe palette
  • Each figure has an informative fig-cap

Bonus challenge: Add a 3rd figure: a scatter plot for Africa only showing how GDP per capita relates to life expectancy in each decade. Is the relationship the same as globally? Does it change over time?



4 Sharing & the Reproducibility Checklist

⏱ ~35 minutes

4.1 πŸ“– Reading β€” The Journal Audit

NoteπŸ“– Targeted Reading (5 min)

The Biometrical Journal β€” a leading biostatistics journal β€” performs reproducibility audits on every accepted paper. Manuscripts receive a checklist and cannot be published until all items pass. This is increasingly common across health and life-science scientific journals.

  1. Read Box 1 and the Introduction of:
    Hejblum et al.Β (2020). Realistic and robust reproducible research for biostatistics.
    πŸ‘‰ https://www.biorxiv.org/content/10.1101/2020.01.15.907485v1
    (~5 min β€” focus on the 10 checklist items)

  2. Browse the Turing Way section on β€œMaking Research Reproducible”:
    πŸ‘‰ https://book.the-turing-way.org/reproducible-research/reproducible-research

Ask yourself: If you submitted your repro-phds project for a journal audit right now, which of the 10 checklist items would you fail?

4.2 πŸ’‘Tools for Sharing and Archiving

The full reproducibility toolchain

Getting from β€œworks on my machine” to β€œanyone can run it” requires stacking a few layers:

πŸ“  Project organisation     RStudio Projects / standard folder structure
πŸ”’  Dependency management    renv  (`<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>`{=html})  /  venv + requirements.txt  (`<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>`{=html})
πŸ“  Literate document        Quarto (.qmd)
🎲  Random seeds             set.seed() / np.random.seed()
πŸ—‚οΈ  Version control          Git
🌐  Remote hosting           GitHub
πŸ“¦  Permanent archiving      Zenodo  β†’  DOI you can cite in a paper

renv and virtual environments

renv creates a project-local library and records exact package versions in renv.lock.

renv::snapshot()    # update lockfile after installing / changing packages
renv::restore()     # collaborator command: install exact same versions
renv::status()      # check whether lockfile and current state agree

At the end of your report, always record your session environment:

sessioninfo::session_info()

This prints version, OS, and the exact version of every loaded package.

In the terminal, as mentionned in SectionΒ 1

pip freeze > requirements.txt       # save exact versions

# Collaborator setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

At the end of your notebook:

## Requirements

```{python}
import sys, importlib.metadata
print(f"Python {sys.version}")
for pkg in ["pandas", "matplotlib", "seaborn", "numpy"]:
    print(f"  {pkg}: {importlib.metadata.version(pkg)}")
```

Git and GitHub : the essentials

Why use git and GitHub ?

πŸŽ– Off-site backup: if your laptop dies, nothing is lost
πŸŽ– Full history: you can go back in time to any previous version, without worrying that you inadvertently deleted some code that you might end up using later in time
πŸŽ– Collaboration: others can contribute easily, without emailing files
πŸŽ– Integration with Zenodo for citable DOIs

  • git is a version control system that allows to track changes to all the files within a project, and synchronize those changes across multiple computers and contributors.
  • GitHub is a web plateform from Microsoft providing git integration with cloud hosting.

You do not need to be a git expert. Three commands cover most of everyday use:

git add .                               # stage all changes
git commit -m "Add life expectancy analysis and figures"
git push                                # send to GitHub
NoteπŸ™ One-time git setup
git config --global user.name  "Your chosen user name"
git config --global user.email "you@domain.ext"

In RStudio: Tools β†’ Global Options β†’ Git/SVN β€” point to the git executable. For trouble shooting in setting-up git in RStudio, you can refer to the Happy git with R online book.

Zenodo: a permanent DOI for your code (and optionally data)

Zenodo is a CERN-hosted repository that gives your code or data a DOI (which stands for Digital Object Identifier), a permanent, citable identifier. This is what goes in your paper’s Data Availability statement.

How it works in three steps:

  1. Push your project to a public GitHub repository
  2. Connect GitHub at https://zenodo.org/account/settings/github
  3. Create a GitHub release (e.g.Β v1.0) β€” Zenodo archives it automatically and issues a DOI

Your paper then cites: > β€œAll analysis code is available at https://doi.org/10.5281/zenodo.XXXXXXX”

The fact that is has a DOI and is hosted by the EU makes it a lot more future proof than just having a GitHub repo (especially as they are not exclusive).

4.3 πŸ§‘β€πŸ’» Exercise 4: Self-Audit and Preparing to Share

TipπŸ§‘β€πŸ’» Exercise 4: Check the RR Checklist (20 min)

Goal: Systematically audit your repro-phds project against the standard checklist, fix any gaps, and prepare it for sharing.


Step 1: The checklist

Work through every item. For anything not yet in place, add it now:

πŸ—‚οΈ README & execution

  • README.md describes the project in one paragraph
  • README gives exact instructions to reproduce all results (which file, which command)
  • A single command re-runs everything (quarto render analysis/report.qmd)
  • Runtime is documented (e.g.Β β€œ< 1 minute on a standard laptop”)

πŸ“¦ Dependencies

  • renv.lock () or requirements.txt () is present and up to date
  • sessioninfo::session_info() () or equivalent Python output is printed in the report

πŸ“Š Data & code

  • All scripts needed to reproduce results are inside the project
  • Data is either in data/raw/ or downloaded via a stable URL in the code
  • No absolute paths anywhere in the code (/home/alice/… or C:…)

🎲 Randomness

  • set.seed() () or np.random.seed() () is set in the setup chunk

πŸ“ Output

  • All figures are saved programmatically to output/figures/
  • Figure file names use the fig01_, fig02_ prefix convention matching the report
  • At least one table is saved to output/tables/

Step 2: Add session info to the README

Add the output of the session-info command to the bottom of your README.md:

# Run in the console and paste the output into README.md
sessioninfo::session_info()
# Run in the terminal and paste into README.md
python -c "import sys; print(sys.version)"
python -m pip list

Step 3: Final dependency snapshot and clean render

# Update the lockfile
renv::snapshot()

# The ultimate test β€” restart R completely, then render from scratch:
# Session β†’ Restart R  (or Ctrl+Shift+F10)
renv::restore()
quarto::quarto_render("analysis/report.qmd")
pip freeze > requirements.txt

# Deactivate and reactivate the environment, then render
deactivate
source .venv/bin/activate
pip install -r requirements.txt
quarto render analysis/report.qmd

Step 4: Export a table

library(gapminder)
library(dplyr)
library(readr)
library(here)

summary_2007 <- gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarise(
    n_countries    = n(),
    mean_lifeExp   = round(mean(lifeExp), 2),
    median_lifeExp = round(median(lifeExp), 2),
    mean_gdpPercap = round(mean(gdpPercap), 0)
  )

write_csv(summary_2007,
          here("output", "tables", "tab01_continent_summary_2007.csv"))
import pandas as pd

summary_2007 = (
    gapminder.query("year == 2007") \
    .groupby("continent") \
    .agg(
        n_countries=("country", "nunique"),
        mean_lifeExp=("lifeExp", "mean"),
        median_lifeExp=("lifeExp", "median"),
        mean_gdpPercap=("gdpPercap", "mean")
    ) \
    .round(2) \
    .reset_index()
)

summary_2007.to_csv("output/tables/tab01_continent_summary_2007.csv",
                    index=False)

Step 5 (Bonus) β€” Push to GitHub and archive on Zenodo

# 1. Initialise Git (skip if you already did this)
git init
git add .
git commit -m "Initial reproducible analysis β€” Gapminder life expectancy"

# 2. Create a new repo at https://github.com/new, then:
git remote add origin https://github.com/YOUR_USERNAME/repro-phds.git
git push -u origin main

# 3. Create a release on GitHub (click "Releases" β†’ "Create a new release")
#    Tag it: v1.0.0

# 4. Go to https://zenodo.org/account/settings/github
#    Enable the repo β†’ the release triggers automatic DOI creation

βœ… Final self-check

The ultimate reproducibility test:
Ask a classmate to clone or download your repository, follow your README instructions exactly, and try to reproduce your figures without asking you anything.

If they can β€” you have a reproducible project. πŸŽ‰

  • All 13 checklist items above are ticked
  • The project renders from a clean /Python session
  • A classmate can reproduce your figures from the README instructions alone
  • (Bonus) The project is on GitHub with a Zenodo DOI

Reflection questions:

  1. What was the hardest checklist item to satisfy β€” and why?
  2. Roughly how much extra time did the reproducible setup cost you, compared to β€œjust writing code”?
  3. Imagine one real-world public health scenario where the extra cost would clearly be worth it.


πŸ† Summary & wrap-up

What we learned:

πŸŽ– Adopting a clean, portable structure for easy navigation inside your project
πŸŽ– Running everything end-to-end with a single command
πŸŽ– Programmatically saving figures
πŸŽ– Locking your dependencies, so the environment is reproducible
πŸŽ– Including inline computed values
πŸŽ– Passing the standard of a journal reproducibility checklist

The one rule

Start small, start now. Reproducibility is an ideal, each small step forward brings you closer.

Key tools

Tool Problem it solves One command to remember
RStudio Project No more absolute paths; self-contained workspace Open .Rproj
Quarto Code + prose in one document; no copy-paste of values quarto render
renv / venv Exact package versions β€” β€œit worked last year” insurance renv::snapshot()
here Portable file paths that work on any OS here::here(β€œoutput”,β€œfig.pdf”)
ggplot2 / seaborn Programmatic, consistent, exportable figures ggsave() / fig.savefig()
Git + GitHub Version history, collaboration, off-site backup git add . && git commit && git push
Zenodo Permanent DOI for code citation in papers Connect GitHub in Zenodo settings

Going further

Resource What for
The Turing Way Comprehensive, community-maintained reproducible research guide
R for Data Science (2e) Tidyverse, ggplot2, Quarto β€” free online book
What They Forgot to Teach You About R Project organisation, naming, file paths
Quarto documentation Everything Quarto
ggplot2 book (3e) Deep dive into the grammar of graphics
Python Data Science Handbook NumPy, Pandas, Matplotlib, scikit-learn
Hejblum et al.Β (2020) Reproducible research for biostatistics
Hornung et al.Β (2026) Overcoming computational reproducibility barriers
Ouvrir la Science (in English) French open science guides and passports (MESRI)