Tools for a Reproducible, Shareable & Communicable Science

Diplôme Universitaire Public Health Data Science

Author

Boris Hejblum

Published

May 25, 2026

Foreword and Instructions

🎯 Learning Objectives

By the end of this class you will be able to:

Explain why reproducibility matters in science and identify the main barriers.
Organize a data-analysis project so that any colleague (including your future self) can re-run it from scratch.
Write a self-contained, reproducible dynamic document with Quarto that blend narrative text, code, and output together.
Produce publication-quality visualizations with ggplot2 () or seaborn/ matplotlib () and save them programmatically.
Audit your own project against the standard reproducibility checklist before sharing it.

Instructions: how to navigate this class

This class walks you through 4 Parts (augmented by this introduction and a concluding summary). Along the way, you will meet three different types of activity, identified by the icons below:

Icon	Type	Purpose
📖	Targeted reading	A short external resource to read (~5–10 min). Follow the links — don’t skip them.
💡	Concept definition	Key ideas with illustrations and worked examples.
🧑‍💻	Hands-on exercise	YOU write and run code yourself. Blocks are not pre-executed here.

Choose R or python , both languages are showed in the exercises via tabs but you only need to do one (according to your initial preferences and familiarity).

Total time needed: approx. ~3 hours. Do not hesitate to take a short break between each Part.

⚙️ Setup (to be completed before starting the class)

Python

Install the latest version of → https://cran.r-project.org/
Install the latest version of RStudio → https://posit.co/download/rstudio-desktop/

Then paste the following lines into your console:

install.packages(c(
  "tidyverse",    # ggplot2, dplyr, readr, …
  "gapminder",    # our main dataset
  "renv",         # dependency management
  "here",         # portable file paths
  "quarto",       # render .qmd from R
  "knitr",        # knitting engine
  "sessioninfo"   # document your environment
))

Verify Quarto is available: in the RStudio Terminal tab, type quarto --version.
You should see a version number ≥ 1.4.

Install Python ≥ 3.10 → https://www.python.org/downloads/
Install VS Code → https://code.visualstudio.com/ (with the Python extension)
or use JupyterLab: pip install jupyterlab

Create and activate a virtual environment, then install packages:

# Create project folder and virtual environment
mkdir repro-phds && cd repro-phds
python -m venv .venv

# Activate (choose the one line right for your OS)
source .venv/bin/activate       # for macOS or Linux
 .venv\Scripts\activate         # for Windows (PowerShell)

# Install python libraries
pip install pandas matplotlib seaborn gapminder jupyterlab quarto

Verify Quarto is indeed available by running quarto --version in your terminal (it should return a version number).

⏱ Before you start

If you run into installation errors, search the error message on the web: troubleshooting is a core data-science skill! Most errors have a Stack Overflow solution on the first result page. GenAI chatbots such as LeChat from Mistral AI can also be very helpful.

1 Reproducibility: what it is and why it matters ?

⏱ ~45 minutes

1.1 📖 Readings: the reproducibility crisis

📖 Targeted reading (~10 min)

Before you read, think about: Have you ever tried to re-run an analysis and gotten different numbers? Have you read a Methods section and thought “I have no idea how they actually did this”?

Read the Nature News Feature “1,500 scientists lift the lid on reproducibility” (Baker, 2016).
Focus on the survey bar chart and the section “Barriers to reproducibility”:
👉 https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
(~5 min)
Skim the Turing Way overview page. Read “Why this is important” and the bullet list of barriers:
👉 https://book.the-turing-way.org/reproducible-research/overview
(~5 min)

Reflection: In public health the stakes are especially high. Irreproducible findings can influence clinical guidelines or policy decisions. Can you recall one published COVID-19 study from 2020 that had to be retracted or substantially corrected? (Hint: look up “Surgisphere hydroxychloroquine Lancet” or “Hydroxychloroquine and azithromycin Gautret Raoult International Journal of Antimicrobial Agents”)

1.2 💡 The Reproducibility Spectrum

“Reproducible” is used loosely, and several definitions co-exists in different scientific field. In this class, we will use the definition below focusing on computational reproducibility:

	Same data?	Same code/method?	What does it test?
Reproducible	✅	✅	Can you re-run the analysis and get the exact same numbers?
Replicable	❌	✅	Does the same method generalise to new data?
Robust	✅	❌	Do different analytical choices reach the same conclusion?
Generalisable	❌	❌	Does the finding hold beyond the original context?

The diagram below (from The Turing Way community) relates these 4 levels to one another:

A 2x2 matrix showing reproducible, replicable, robust and generalisable research across axes of same/different data and same/different code — Reproducibility matrix. The Turing Way / Scriberia — CC-BY 4.0. Source: https://book.the-turing-way.org

1.3 💡 Why is Reproducibility important ?

What is the value of a non reproducible article ?
If it isn’t reproducible, is it science ?
Should we just trust each other ?

Public funding demands accountability, while scientific credibility depends on it. Reproducibility helps achieve that.

📙 Scientific journals require it: peer reviewers now verify code, data, and workflows before final acceptance
⚔️ It acts as a methodological shield: it reduces the likelihood of undetected errors & spurious findings
🇪🇺🇫🇷 Institutional law: the EU, the ANR and the HCERES all require some level of reproducibility for their funded research
🧱 Increased impact: Reproducible articles are cited more & extended more (a more trustworthy foundation for future works)

In Public Health, it carries an additional importance: - Policy decisions are made from published findings. A wrong or irreproducible result can harm patients and populations at scale. - Publicly-funded research should be accountable. If the public paid for the study, the code and data should (where possible) be publicly accessible. This idea is very much related to open-science, a concept connected to reproducibility, but different (➡️🌐 more details here)

Barriers, motivations and solutions

Barrier	Concrete example	Practical solution
Cultural	“My PI never does this”	Frame it as a career investment, not overhead
Technical	Simulation takes 3 days to run	Store pre-computed intermediate results; provide a fast reduced-run mode
Legal	Patient data under GDPR	Generate synthetic data of same structure; grant temporary restricted auditor access
Time	“I’ll clean up the code later”	Start early: it costs extra work upfront, but saves 4x more at revision

Reproducibility is a safeguard, not a burden

Reproducibility earns trust. Scientists that care for reproducibility are more efficient in the long term as they can build on their own past work more easily. Reproducibility carries benefits at different scales for science:

Benefits of reproducibility at 3 different levels
🌍 Research field	🏥 Research group	🧑‍🔬 Yourself
Stronger methodological credibility	Faster transmission to collaborators	Faster article completion & revisions
Cumulative, extendable knowledge	Reduced technical debt	Transparent & trustable for audience
Lower risk of published errors	Clear, defensible archival	Hard skills & efficient workflow
		More citations

🔑 Key takeaways

Reproducibility is a scientific requirement: a scientific result that cannot be reproduced has very little (if any) added value. In this class, we focus on computationally reproducibility, which is only one of the layer for reproducing science (think also data generation for instance).
Reproducibility is a continuum: universal, absolute, forever reproducibility is an ideal that is out of reach. Nonetheless, we should strive for improving practical reproducibility thinking about future reuse of our work and how we establish trust in scientific results.

1.4 🧑‍💻 Exercise 1: Create Your Reproducible Project Skeleton

🧑‍💻 Exercise 1: Project Setup (25 min)

Goal: Create a well-organised, portable project that you will build on throughout this class.

Step 1: Create a project

Open RStudio → File → New Project → New Directory → New Project
Name it repro-phds, choose a folder you will remember
✅ Tick “Use renv with this project” — this sets up dependency tracking automatically
Click Create Project

Notice the .Rproj file in the Files pane. Whenever you open this file, RStudio automatically sets the working directory to your project root. This alone eliminates 90 % of path-related bugs.

# If you have not done so previously, create the directory and the environment below
mkdir repro-phds && cd repro-phds

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate          # macOS/Linux
# .venv\Scripts\activate           # Windows – run the lien corresponding to your OS

# Start Python
python

Step 2: Create a standard folder structure

Run this in your console (inside the project):

dirs <- c(
  "data/raw",           # Original data — NEVER overwrite
  "data/processed",     # Cleaned / transformed data
  "R",                  # Reusable R functions
  "analysis",           # .qmd scripts go here
  "output/figures",     # Saved plots  → fig01_..., fig02_...
  "output/tables"       # Exported tables → tab01_..., tab02_...
)

invisible(lapply(dirs, dir.create, recursive = TRUE, showWarnings = FALSE))
cat("✅ Folders created!\n")

Run the following python code (with .venv active, inside repro-phds/):

import os

dirs = [
    "data/raw",
    "data/processed",
    "src",               # Python source files / modules
    "analysis",          # Notebooks or scripts
    "output/figures",
    "output/tables"
]

for d in dirs:
    os.makedirs(d, exist_ok=True)

print("✅ Folders created!")

Your project should now look like this:

repro-phds/
├── data/
│   ├── raw/           ← original data  (treat as read-only)
│   └── processed/     ← cleaned data
├── R/  (or src/)      ← reusable functions / modules
├── analysis/          ← your scripts and notebooks
├── output/
│   ├── figures/       ← fig01_name.pdf, fig01_name.png, …
│   └── tables/        ← tab01_name.csv, …
└── README.md          ← you'll write this next

Step 3: Write a README.md

Create README.md at the project root. A README is the first thing anyone — including future-you — reads. It must answer: What is this? How do I run it?

Copy the template below into your README.md and fill in the blanks:

# repro-phds — Reproducibility Practice (Public Health)

## Description
A self-contained data analysis project produced during the
"Tools for Reproducible Science" class (Bachelor Public Health, [Year]).

## How to reproduce all results

1. Open `repro-phds.Rproj` in RStudio
   (or activate `.venv` in Python: `source .venv/bin/activate`).
2. Restore dependencies:
   - `<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>`{=html}: run `renv::restore()` in the console
   - Python: `pip install -r requirements.txt`
3. Render the main report:
   open `analysis/report.qmd` and click **Render**,
   or run `quarto render analysis/report.qmd`.

## Expected runtime
< 1 minute on a standard laptop.

## Session information
(Paste the output of `sessioninfo::session_info()` here
after completing Exercise 4.)

Step 4: Lock your dependencies

renv was already initialised when you created the project.
After installing packages, record their exact versions:

renv::snapshot()
# → "renv.lock has been updated."

This file is your insurance against “it worked last year” bugs.
A collaborator who runs renv::restore() will get the exact same packages.

In the terminal, execute the following bash code:

pip freeze > requirements.txt

# Verify it was created
cat requirements.txt

A collaborator can restore this exact environment with:

pip install -r requirements.txt

✅ Self-check

Verify all four points before moving on:

getwd() () or os.getcwd() () returns the path to repro-phds/
The six sub-folders exist in the files explorer / pane
README.md is at the project root and has all three sections
renv.lock () or requirements.txt () has been created
Closing and reopening the project works cleanly

Question to think about: Why should data/raw/ be treated as strictly read-only? What would happen if you accidentally saved a modified version over your original data file?

2 Literate Programming with Quarto

⏱ ~45 minutes

2.1 📖 Reading: One Document to Rule Them All

📖 Targeted Reading (5 min)

Donald Knuth introduced literate programming in the 1980s: instead of code with comments intended to be read primarily by computers, write a narrative document that contains the code intended for humans. The output is a document that is the analysis.

Read the Quarto “Hello, Quarto” guide — try to render the example document if you can:
👉 https://quarto.org/docs/get-started/hello/rstudio.html
Keep the Quarto cheatsheet open as a reference throughout this block:
👉 https://rstudio.github.io/cheatsheets/quarto.pdf

The central idea: When figures and numbers are computed inside your .qmd file, there is no copy-paste step where a stale number can slip through. Re-run the document and every value updates automatically.

2.2 💡Anatomy of a `.qmd` File

A Quarto file has three building blocks:

┌──────────────────────────────────────────────────────┐
│  1. YAML header       title, author, output type...  │
│──────────────────────────────────────────────────────│
│  2. Markdown text     prose, headings, lists, links  │
│──────────────────────────────────────────────────────│
│  3. Code chunks       R or Python code + its output  │
└──────────────────────────────────────────────────────┘

Quarto schematics — Figure 1: Artwork by Allison Horst — CC-BY 4.0. Quarto (like it precursor Rmarkdown before it) lets you blend code and narrative text into a single dynamic document.

A colourful illustration showing wizards casting spells to turn R code into beautiful output documents — Figure 1: Artwork by Allison Horst — CC-BY 4.0. Quarto (like it precursor Rmarkdown before it) lets you blend code and narrative text into a single dynamic document.

1. The `YAML` header

At the beginning of your document, it sits between --- markers at the very top. This sets the default controls everything about the output and its overall format, using YAML language.
Be careful: indentation is important in YAML

---
title: "Global Health Trends — Gapminder Data"
author: "Your Name"
date: today                   # auto-fills today's date
format:
  `HTML`:
    toc: true                 # table of contents
    code-fold: true           # hide code by default (reader can expand)
    embed-resources: true     # self-contained `HTML` file (portable)
execute:
  echo: true                  # show code in output by default
  warning: true               # keep showing warnings
  message: false              # suppress messages (eg when loading packages)
---

2. Markdown text

Between code chunks, narrative text uses plain Markdown (lightweight formatting). Below is an example of markdown basics:

# Heading level 1
## Heading level 2

A paragraph. Make text **bold**, *italic*, or `code-styled`.

- Bullet list item one
- Bullet list item two

1. Numbered item one
2. Numbered item two

[Descriptive link text](https://url.com)
![Image alt text](path/to/image.png){width=50%}

3. Code chunks

Enclosed in triple back-ticks with {r} or {python} to specify the programming language interpreter to use. Chunk-specific options are indicate by setting values to specific keywords at the begining of the chunk with the following syntax: #| keyword: value. Below is a chunk example, followed by its output:

```{r}
#| label: summary-table-r
#| echo: true
#| eval: true

library(gapminder)
head(gapminder)
```

# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Key chunk options (written after #| inside the chunk):

Option	Default	Effect
`echo`	`true`	Show the code in output
`eval`	`true`	Run the code
`include`	`true`	Include chunk output at all
`message`	`true`	Show package-load messages
`warning`	`true`	Show warnings
`cache`	`false`	Cache results (useful for slow code)
`label`	—	Name for cross-referencing figures/tables
`fig-cap`	—	Figure caption
`fig-width` / `fig-height`	—	Figure dimensions in inches

2.2.1 Inline code

You can also insert computed values directly into prose using `r ` or `{python}`:

The dataset covers `r length(unique(gapminder$country))` countries from `r min(gapminder$year)` to `r max(gapminder$year)`

When rendered, this becomes:

The dataset covers 142 countries from 1952 to 2007

No copy-paste, no stale numbers, no manual update ! ✨

2.3 🧑‍💻 Exercise 2: Your First Reproducible Health Report

🧑‍💻 Exercise 2: First Quarto report (25 min)

Goal: Write a reproducible .qmd report that loads real global health data, computes summaries with inline values, and renders to a self-contained HTML file.

We use the Gapminder dataset: life expectancy, population, and GDP per capita for 142 countries from 1952 to 2007. Life expectancy is a central metric in public health — it integrates mortality at all ages and reflects overall health system performance.

Step 1: Create the file

In RStudio: File → New File → Quarto Document.
Delete all template content. Save as analysis/report.qmd.

Paste the YAML header that matches your language:

---
title: "Global Health Trends — Gapminder Analysis"
author: "Your Name"
date: today
format:
  `HTML`:
    toc: true
    code-fold: true
    theme: flatly
    embed-resources: true
execute:
  echo: true
  warning: false
  message: false
---

---
title: "Global Health Trends — Gapminder Analysis"
author: "Your Name"
date: today
format:
  `HTML`:
    toc: true
    code-fold: true
    theme: flatly
    embed-resources: true
execute:
  echo: true
  warning: false
  message: false
jupyter: python3
---

Step 2: Setup chunk (always the very first chunk!)

A setup chunk runs silently before anything else. It is a good place to set your random seed for instance.

```{r}
#| label: setup-r
#| include: false

set.seed(20260401)      # ← ensures the reproducibility of any downstream random operations

library(ggplot2)
library(dplyr)
library(gapminder)
library(here)
library(knitr)
```

```{python}
#| label: setup-py
#| include: false

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20260401)

# Load Gapminder from GitHub (stable URL)
url = "https://raw.githubusercontent.com/jennybc/gapminder/main/inst/extdata/gapminder.tsv"
gapminder = pd.read_csv(url, sep="\t")

sns.set_theme(style="whitegrid", palette="colorblind")
```

Step 3: Data section with inline values

## The Data

The **Gapminder** dataset is a global development database compiled 
by the Gapminder Foundation. It covers health, population and wealth 
indicators for **`r length(unique(gapminder$country))` countries** 
across **`r length(unique(gapminder$year))` time points**
(from `r min(gapminder$year)` to `r max(gapminder$year)`).

```{r}
#| label: tbl-preview-r
#| tbl-cap: "First 6 rows of the Gapminder dataset."

head(gapminder) |> knitr::kable()
```

## The Data

The **Gapminder** dataset is a global development database compiled 
by the Gapminder Foundation. It covers health, population and wealth 
indicators for **`{python} gapminder["country"].nunique()` countries** 
across **`{python} len(sorted(gapminder["year"].unique()))` time points**
(from `{python} sorted(gapminder["year"].unique())[0]` to `{python} sorted(gapminder["year"].unique())[-1]`).

```{python}
#| label: tbl-preview-py

gapminder.head(6)
```

Step 4: Summary statistics section

## Summary Statistics

```{r}
#| label: tbl-continent-2007-r
#| tbl-cap: "Life expectancy and GDP per capita by continent in 2007."

gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarise(
    `N countries`         = n(),
    `Mean life exp.`      = round(mean(lifeExp), 1),
    `Median life exp.`    = round(median(lifeExp), 1),
    `Mean GDP/capita`     = round(mean(gdpPercap), 0)
  ) |>
  knitr::kable(format.args = list(big.mark = ","))
```

In 2007, the global mean life expectancy was
**`r round(mean(dplyr::filter(gapminder, year == 2007)$lifeExp), 1)` years**,
with values ranging from
`r round(min(dplyr::filter(gapminder, year == 2007)$lifeExp), 1)` to
`r round(max(dplyr::filter(gapminder, year == 2007)$lifeExp), 1)` years.

## Summary Statistics

```{python}
#| label: tbl-continent-2007-py

    gapminder.query("year == 2007") \
    .groupby("continent") \
    .agg(
        n_countries=("country", "nunique"),
        mean_lifeExp=("lifeExp", "mean"),
        median_lifeExp=("lifeExp", "median"),
        mean_gdpPercap=("gdpPercap", "mean")
    ) \
    .round(1) \
    .reset_index() \
    .rename(columns={
        "continent": "Continent",
        "n_countries": "N Countries",
        "mean_lifeExp": "Mean Life Exp.",
        "median_lifeExp": "Median Life Exp.",
        "mean_gdpPercap": "Mean GDP/capita"
    })
```

Step 5: Render!

Click the blue Render button (top of the editor pane), or press Ctrl+Shift+K / Cmd+Shift+K.

Or from the console:

quarto::quarto_render("analysis/report.qmd")

In the terminal

quarto render analysis/report.qmd

Open the resulting report.html in your browser. Check that:

Inline values (number of countries, year range) appear correctly in the prose
The summary table has a caption
The “Code” toggle (from code-fold: true) works for each chunk

✅ Self-check

The HTML file renders without errors
The inline numbers in the text match the table values
Changing year == 2007 to year == 1997 and re-rendering updates all values automatically
You can open report.html without RStudio or any software (it is self-contained)

Bonus: Add a ## Introduction section above ## The Data with two or three sentences on why life expectancy is a meaningful public health indicator. Try adding a footnote using [^1] syntax and a simple citation using a .bib file.

3 Data Visualisation

⏱ ~40 minutes

3.1 📖 Reading — The Grammar of Graphics

📖 Targeted Reading (5 min)

Rather than memorizing dozens of plot types, you learn a grammar — a set of composable rules. Any statistical graphic can be described as data mapped to visual properties (position, colour, size, shape), rendered through geometric objects, on a coordinate system.

Skim the ggplot2 homepage and the first “Getting Started” example:
👉 https://ggplot2.tidyverse.org/
(For users, read the seaborn tutorial overview instead:
👉 https://seaborn.pydata.org/tutorial/introduction.html)
Read the abstract and first page of Wilkinson’s Grammar of Graphics, as summarised here:
👉 https://vita.had.co.nz/papers/layered-grammar.html — “Abstract” and “Introduction” paragraphs only

Key idea: Once you understand the grammar, you can quickly build almost any chart by composing (modular) layers. You no longer need to search for “how do I make a violin plot in ?” at every turn.

3.2 💡Building Plots Layer by Layer

The seven layers

Component	In ggplot2	Example
Data	`ggplot(data = ...)`	`gapminder \|> filter(year == 2007)`
Aesthetics	`aes(x, y, colour, size, shape)`	`aes(x = gdpPercap, y = lifeExp, colour = continent)`
Geometry	`geom_*()`	`geom_point()`, `geom_line()`, `geom_boxplot()`
Scales	`scale_*()`	`scale_x_log10()`, `scale_colour_viridis_d()`
Facets	`facet_wrap()` / `facet_grid()`	`facet_wrap(~ continent)`
Theme	`theme_*()`	`theme_minimal()`
Labels	`labs()`	`labs(title = "...", x = "...", caption = "...")`

An artistic illustration of a person building a layered ggplot2 graphic — Artwork by Allison Horst — CC-BY 4.0. Every ggplot2 plot is built by layering these components.

A complete worked example

library(ggplot2)
library(gapminder)
library(dplyr)

gap_2007 <- gapminder |> filter(year == 2007)

ggplot(
  data = gap_2007,
  aes(x = gdpPercap, y = lifeExp,
      colour = continent,
      size   = pop)          # bubble size = population
) +
  geom_point(alpha = 0.7) +
  scale_x_log10(labels = scales::dollar_format()) +   # log GDP axis
  scale_colour_viridis_d(option = "plasma") +         # colour-blind safe
  scale_size(range = c(2, 14), guide = "none") +
  labs(
    title    = "Wealth and Health in 2007",
    subtitle = "Each bubble is a country; size ∝ population",
    x        = "GDP per capita (log scale, USD)",
    y        = "Life expectancy (years)",
    colour   = "Continent",
    caption  = "Source: Gapminder Foundation"
  ) +
  theme_minimal(base_size = 13)

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

gap_2007 = gapminder[gapminder["year"] == 2007].copy()

# Bubble sizes proportional to population
max_pop = gap_2007["pop"].max()
gap_2007["bubble_size"] = (gap_2007["pop"] / max_pop) * 1800 + 30

fig, ax = plt.subplots(figsize=(10, 6))

continents = gap_2007["continent"].unique()
palette = sns.color_palette("plasma", len(continents))

for i, cont in enumerate(sorted(continents)):
    subset = gap_2007[gap_2007["continent"] == cont]
    ax.scatter(
        np.log10(subset["gdpPercap"]),
        subset["lifeExp"],
        s=subset["bubble_size"],
        color=palette[i],
        alpha=0.75,
        label=cont
    )

ax.set_xlabel("GDP per capita (log scale, USD)", fontsize=12)
ax.set_ylabel("Life expectancy (years)", fontsize=12)
ax.set_title("Wealth and Health in 2007", fontsize=14, fontweight="bold")
ax.set_xticks([3, 4, 5])
ax.set_xticklabels(["$1K", "$10K", "$100K"])
ax.legend(title="Continent", loc="lower right")
plt.tight_layout()
plt.show()

Saving figures programmatically

⚠️ Never use the “Export” button

Saving a figure manually (by right-clicking or clicking “Export” in the Plots pane) is not reproducible. The size, resolution, and format vary each time and the step is invisible in your code.

Always save figures inside your script, with fixed dimensions and resolution.

# 1. Assign your plot to a named variable
p <- ggplot(gap_2007, aes(...)) + ...

# 2. Save — always provide both a vector (`PDF`) or a raster (`PNG`) image depending on the size and nature of your graph
ggsave(
  filename = here::here("output", "figures", "fig01_wealth_health_2007.pdf"),
  plot     = p,
  width    = 10, height = 6, units = "in"
)
ggsave(
  filename = here::here("output", "figures", "fig01_wealth_health_2007.png"),
  plot     = p,
  width    = 10, height = 6, units = "in",
  dpi      = 300          # 300 dpi = publication quality
)

fig, ax = plt.subplots(figsize=(10, 6))
# ... your plotting code ...

fig.savefig("output/figures/fig01_wealth_health_2007.pdf", bbox_inches="tight")
fig.savefig("output/figures/fig01_wealth_health_2007.png", dpi=300, bbox_inches="tight")
plt.close()   # free memory

File naming convention: use a numeric prefix that matches your manuscript — fig01_, fig02_, …, tab01_, tab02_, … This is one of the items auditors check first.

Choosing accessible colour palettes

About 8 % of men have some form of colour-vision deficiency. Here are a few palettes designed to be perceptually uniform and colour-blind safe:

Palette	Package	Use case
`viridis`, `plasma`, `magma`	`viridis` / built-in ggplot2	Sequential / continuous
`scale_colour_viridis_d()`	ggplot2	Discrete categorical
`scale_colour_brewer(palette = "Set2")`	ggplot2	Categorical (up to 8 groups)
`sns.set_palette("colorblind")`	seaborn	Any seaborn chart
`palette="viridis"`	seaborn	Continuous colour mapping

3.3 🧑‍💻 Exercise 3: Figures

🧑‍💻 Exercise 3: Your first figure (20 min)

Goal: Add two publication-quality figures to analysis/report.qmd and save them programmatically to output/figures/.

Figure 1: Life expectancy trends over time

Add a new section ## Trends Over Time to your report:

## Trends Over Time

Life expectancy has improved in every continent since 1952 (see @fig-trends),
but large disparities remain.

```{r}
#| label: fig-trends-r
#| fig-cap: "Median life expectancy by continent, 1952–2007. Ribbons show interquartile range."
#| fig-width: 10
#| fig-height: 5

p_trends <- gapminder |>
  group_by(continent, year) |>
  summarise(
    med  = median(lifeExp),
    q25  = quantile(lifeExp, 0.25),
    q75  = quantile(lifeExp, 0.75),
    .groups = "drop"
  ) |>
  ggplot(aes(x = year, y = med,
             colour = continent, fill = continent)) +
  geom_ribbon(aes(ymin = q25, ymax = q75), alpha = 0.2, colour = NA) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.2) +
  scale_colour_viridis_d(option = "plasma") +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title   = "Life Expectancy Trends by Continent (1952–2007)",
    x       = "Year",
    y       = "Life expectancy (years)",
    colour  = "Continent", fill = "Continent",
    caption = "Source: Gapminder. Ribbons show IQR."
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

p_trends

ggsave(
  here::here("output", "figures", "fig02_trends_by_continent.pdf"),
  plot = p_trends, width = 10, height = 5
)
```

## Trends Over Time

```{python}
#| label: fig-trends-py
#| fig-cap: "Median life expectancy by continent, 1952–2007."

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

trend = (
    gapminder
    .groupby(["continent", "year"])["lifeExp"]
    .agg(med="median",
         q25=lambda x: x.quantile(0.25),
         q75=lambda x: x.quantile(0.75))
    .reset_index()
)

palette = sns.color_palette("plasma", n_colors=trend["continent"].nunique())

fig, ax = plt.subplots(figsize=(10, 5))
for i, (cont, grp) in enumerate(trend.groupby("continent")):
    ax.fill_between(grp["year"], grp["q25"], grp["q75"],
                    color=palette[i], alpha=0.2)
    ax.plot(grp["year"], grp["med"], marker="o", markersize=4,
            color=palette[i], label=cont, linewidth=1.5)

ax.set_xlabel("Year", fontsize=12)
ax.set_ylabel("Life expectancy (years)", fontsize=12)
ax.set_title("Life Expectancy Trends by Continent (1952–2007)", fontsize=14)
ax.legend(loc="lower right", title="Continent")
plt.tight_layout()

fig.savefig("../output/figures/fig02_trends_by_continent.pdf", bbox_inches="tight")
plt.show()
```

Figure 2: Distribution of life expectancy in 2007

Add a section ## Distribution in 2007:

## Distribution in 2007

As of 2007, the distribution of life expectancy varies widely within
each continent (see @fig-dist-r).

```{r}
#| label: fig-dist-r
#| fig-cap: "Life expectancy by continent in 2007. Violin = density; points = individual countries; white bar = median."
#| fig-width: 9
#| fig-height: 5

p_dist <- gapminder |>
  filter(year == 2007) |>
  ggplot(aes(
    x      = reorder(continent, lifeExp, FUN = median),
    y      = lifeExp,
    fill   = continent,
    colour = continent
  )) +
  geom_violin(alpha = 0.35, linewidth = 0.4) +
  geom_jitter(width = 0.12, alpha = 0.8, size = 2.2) +
  stat_summary(fun = median, geom = "crossbar",
               width = 0.45, colour = "white", linewidth = 0.9) +
  scale_fill_viridis_d(option = "plasma") +
  scale_colour_viridis_d(option = "plasma") +
  labs(
    title   = "Life Expectancy Distribution by Continent in 2007",
    x       = "Continent (ordered by median life expectancy)",
    y       = "Life expectancy (years)",
    caption = "Source: Gapminder. Each point = one country."
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

p_dist

ggsave(
  here::here("output", "figures", "fig03_distribution_2007.pdf"),
  plot = p_dist, width = 9, height = 5
)
```

## Distribution in 2007

As of 2007, the distribution of life expectancy varies widely within
each continent (see @fig-dist-py).

```{python}
#| label: fig-dist-py
#| fig-cap: "Life expectancy by continent in 2007."

gap_2007 = gapminder[gapminder["year"] == 2007].copy()

# Order continents by median life expectancy
order = (
    gap_2007.groupby("continent")["lifeExp"]
    .median()
    .sort_values()
    .index.tolist()
)

fig, ax = plt.subplots(figsize=(9, 5))
sns.violinplot(data=gap_2007, x="continent", y="lifeExp",
               order=order, palette="plasma", alpha=0.4, ax=ax)
sns.stripplot(data=gap_2007, x="continent", y="lifeExp",
              order=order, palette="plasma", alpha=0.75,
              jitter=True, size=5, ax=ax)

ax.set_xlabel("Continent (ordered by median)", fontsize=12)
ax.set_ylabel("Life expectancy (years)", fontsize=12)
ax.set_title("Life Expectancy Distribution by Continent in 2007", fontsize=14)
plt.tight_layout()

fig.savefig("../output/figures/fig03_distribution_2007.pdf", bbox_inches="tight")
plt.show()
```

Cross-referencing in the narrative

In your report, reference figures by their label (replace @fig-trends-r with @fig-trends-py if you are using ):

Life expectancy has grown dramatically since 1952 (see @fig-trends-r),
although large disparities persist between and within continents
(@fig-dist-r).

✅ Self-check

Both figures appear in the rendered HTML
Two PDF files exist in output/figures/ with descriptive names
File names use the fig02_, fig03_ prefix convention
Both figures use a colour-blind-safe palette
Each figure has an informative fig-cap

Bonus challenge: Add a 3^rd figure: a scatter plot for Africa only showing how GDP per capita relates to life expectancy in each decade. Is the relationship the same as globally? Does it change over time?

4 Sharing & the Reproducibility Checklist

⏱ ~35 minutes

4.1 📖 Reading — The Journal Audit

📖 Targeted Reading (5 min)

The Biometrical Journal — a leading biostatistics journal — performs reproducibility audits on every accepted paper. Manuscripts receive a checklist and cannot be published until all items pass. This is increasingly common across health and life-science scientific journals.

Read Box 1 and the Introduction of:
Hejblum et al. (2020). Realistic and robust reproducible research for biostatistics.
👉 https://www.biorxiv.org/content/10.1101/2020.01.15.907485v1
(~5 min — focus on the 10 checklist items)
Browse the Turing Way section on “Making Research Reproducible”:
👉 https://book.the-turing-way.org/reproducible-research/reproducible-research

Ask yourself: If you submitted your repro-phds project for a journal audit right now, which of the 10 checklist items would you fail?

4.2 💡Tools for Sharing and Archiving

The full reproducibility toolchain

Getting from “works on my machine” to “anyone can run it” requires stacking a few layers:

📁  Project organisation     RStudio Projects / standard folder structure
🔒  Dependency management    renv  (`<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>`{=html})  /  venv + requirements.txt  (`<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>`{=html})
📝  Literate document        Quarto (.qmd)
🎲  Random seeds             set.seed() / np.random.seed()
🗂️  Version control          Git
🌐  Remote hosting           GitHub
📦  Permanent archiving      Zenodo  →  DOI you can cite in a paper

renv and virtual environments

renv creates a project-local library and records exact package versions in renv.lock.

renv::snapshot()    # update lockfile after installing / changing packages
renv::restore()     # collaborator command: install exact same versions
renv::status()      # check whether lockfile and current state agree

At the end of your report, always record your session environment:

sessioninfo::session_info()

This prints version, OS, and the exact version of every loaded package.

In the terminal, as mentionned in Section 1

pip freeze > requirements.txt       # save exact versions

# Collaborator setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

At the end of your notebook:

## Requirements

```{python}
import sys, importlib.metadata
print(f"Python {sys.version}")
for pkg in ["pandas", "matplotlib", "seaborn", "numpy"]:
    print(f"  {pkg}: {importlib.metadata.version(pkg)}")
```

Git and GitHub : the essentials

Why use git and GitHub ?

🎖 Off-site backup: if your laptop dies, nothing is lost
🎖 Full history: you can go back in time to any previous version, without worrying that you inadvertently deleted some code that you might end up using later in time
🎖 Collaboration: others can contribute easily, without emailing files
🎖 Integration with Zenodo for citable DOIs

git is a version control system that allows to track changes to all the files within a project, and synchronize those changes across multiple computers and contributors.
GitHub is a web plateform from Microsoft providing git integration with cloud hosting.

You do not need to be a git expert. Three commands cover most of everyday use:

git add .                               # stage all changes
git commit -m "Add life expectancy analysis and figures"
git push                                # send to GitHub

🐙 One-time git setup

git config --global user.name  "Your chosen user name"
git config --global user.email "you@domain.ext"

In RStudio: Tools → Global Options → Git/SVN — point to the git executable. For trouble shooting in setting-up git in RStudio, you can refer to the Happy git with R online book.

Zenodo: a permanent DOI for your code (and optionally data)

Zenodo is a CERN-hosted repository that gives your code or data a DOI (which stands for Digital Object Identifier), a permanent, citable identifier. This is what goes in your paper’s Data Availability statement.

How it works in three steps:

Push your project to a public GitHub repository
Connect GitHub at https://zenodo.org/account/settings/github
Create a GitHub release (e.g. v1.0) — Zenodo archives it automatically and issues a DOI

Your paper then cites: > “All analysis code is available at https://doi.org/10.5281/zenodo.XXXXXXX”

The fact that is has a DOI and is hosted by the EU makes it a lot more future proof than just having a GitHub repo (especially as they are not exclusive).

4.3 🧑‍💻 Exercise 4: Self-Audit and Preparing to Share

🧑‍💻 Exercise 4: Check the RR Checklist (20 min)

Goal: Systematically audit your repro-phds project against the standard checklist, fix any gaps, and prepare it for sharing.

Step 2: Add session info to the README

Add the output of the session-info command to the bottom of your README.md:

# Run in the console and paste the output into README.md
sessioninfo::session_info()

# Run in the terminal and paste into README.md
python -c "import sys; print(sys.version)"
python -m pip list

Step 3: Final dependency snapshot and clean render

# Update the lockfile
renv::snapshot()

# The ultimate test — restart R completely, then render from scratch:
# Session → Restart R  (or Ctrl+Shift+F10)
renv::restore()
quarto::quarto_render("analysis/report.qmd")

pip freeze > requirements.txt

# Deactivate and reactivate the environment, then render
deactivate
source .venv/bin/activate
pip install -r requirements.txt
quarto render analysis/report.qmd

Step 4: Export a table

library(gapminder)
library(dplyr)
library(readr)
library(here)

summary_2007 <- gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarise(
    n_countries    = n(),
    mean_lifeExp   = round(mean(lifeExp), 2),
    median_lifeExp = round(median(lifeExp), 2),
    mean_gdpPercap = round(mean(gdpPercap), 0)
  )

write_csv(summary_2007,
          here("output", "tables", "tab01_continent_summary_2007.csv"))

import pandas as pd

summary_2007 = (
    gapminder.query("year == 2007") \
    .groupby("continent") \
    .agg(
        n_countries=("country", "nunique"),
        mean_lifeExp=("lifeExp", "mean"),
        median_lifeExp=("lifeExp", "median"),
        mean_gdpPercap=("gdpPercap", "mean")
    ) \
    .round(2) \
    .reset_index()
)

summary_2007.to_csv("output/tables/tab01_continent_summary_2007.csv",
                    index=False)

Step 5 (Bonus) — Push to GitHub and archive on Zenodo

# 1. Initialise Git (skip if you already did this)
git init
git add .
git commit -m "Initial reproducible analysis — Gapminder life expectancy"

# 2. Create a new repo at https://github.com/new, then:
git remote add origin https://github.com/YOUR_USERNAME/repro-phds.git
git push -u origin main

# 3. Create a release on GitHub (click "Releases" → "Create a new release")
#    Tag it: v1.0.0

# 4. Go to https://zenodo.org/account/settings/github
#    Enable the repo → the release triggers automatic DOI creation

✅ Final self-check

The ultimate reproducibility test:
Ask a classmate to clone or download your repository, follow your README instructions exactly, and try to reproduce your figures without asking you anything.

If they can — you have a reproducible project. 🎉

All 13 checklist items above are ticked
The project renders from a clean /Python session
A classmate can reproduce your figures from the README instructions alone
(Bonus) The project is on GitHub with a Zenodo DOI

Reflection questions:

What was the hardest checklist item to satisfy — and why?
Roughly how much extra time did the reproducible setup cost you, compared to “just writing code”?
Imagine one real-world public health scenario where the extra cost would clearly be worth it.

🏆 Summary & wrap-up

What we learned:

🎖 Adopting a clean, portable structure for easy navigation inside your project
🎖 Running everything end-to-end with a single command
🎖 Programmatically saving figures
🎖 Locking your dependencies, so the environment is reproducible
🎖 Including inline computed values
🎖 Passing the standard of a journal reproducibility checklist

The one rule

Start small, start now. Reproducibility is an ideal, each small step forward brings you closer.

Key tools

Tool	Problem it solves	One command to remember
RStudio Project	No more absolute paths; self-contained workspace	Open `.Rproj`
Quarto	Code + prose in one document; no copy-paste of values	`quarto render`
renv / venv	Exact package versions — “it worked last year” insurance	`renv::snapshot()`
here	Portable file paths that work on any OS	`here::here(“output”,“fig.pdf”)`
ggplot2 / seaborn	Programmatic, consistent, exportable figures	`ggsave()` / `fig.savefig()`
Git + GitHub	Version history, collaboration, off-site backup	`git add . && git commit && git push`
Zenodo	Permanent DOI for code citation in papers	Connect GitHub in Zenodo settings

Going further

Resource	What for
The Turing Way	Comprehensive, community-maintained reproducible research guide
R for Data Science (2e)	Tidyverse, ggplot2, Quarto — free online book
What They Forgot to Teach You About R	Project organisation, naming, file paths
Quarto documentation	Everything Quarto
ggplot2 book (3e)	Deep dive into the grammar of graphics
Python Data Science Handbook	NumPy, Pandas, Matplotlib, scikit-learn
Hejblum et al. (2020)	Reproducible research for biostatistics
Hornung et al. (2026)	Overcoming computational reproducibility barriers
Ouvrir la Science (in English)	French open science guides and passports (MESRI)

Foreword and Instructions

Instructions: how to navigate this class

⚙️ Setup (to be completed before starting the class)

1 Reproducibility: what it is and why it matters ?

1.1 📖 Readings: the reproducibility crisis

1.2 💡 The Reproducibility Spectrum

1.3 💡 Why is Reproducibility important ?

Barriers, motivations and solutions

Reproducibility is a safeguard, not a burden

1.4 🧑‍💻 Exercise 1: Create Your Reproducible Project Skeleton

Step 1: Create a project

Step 2: Create a standard folder structure

Step 3: Write a README.md

Step 4: Lock your dependencies

✅ Self-check

2 Literate Programming with Quarto

2.1 📖 Reading: One Document to Rule Them All

2.2 💡Anatomy of a .qmd File

1. The YAML header

2. Markdown text

3. Code chunks

2.2.1 Inline code

2.3 🧑‍💻 Exercise 2: Your First Reproducible Health Report

Step 1: Create the file

Step 2: Setup chunk (always the very first chunk!)

Step 3: Data section with inline values

Step 4: Summary statistics section

Step 5: Render!

✅ Self-check

3 Data Visualisation

3.1 📖 Reading — The Grammar of Graphics

3.2 💡Building Plots Layer by Layer

The seven layers

A complete worked example

Saving figures programmatically

Choosing accessible colour palettes

3.3 🧑‍💻 Exercise 3: Figures

Figure 1: Life expectancy trends over time

Figure 2: Distribution of life expectancy in 2007

Cross-referencing in the narrative

✅ Self-check

4 Sharing & the Reproducibility Checklist

4.1 📖 Reading — The Journal Audit

4.2 💡Tools for Sharing and Archiving

The full reproducibility toolchain

renv and virtual environments

Git and GitHub : the essentials

Zenodo: a permanent DOI for your code (and optionally data)

4.3 🧑‍💻 Exercise 4: Self-Audit and Preparing to Share

Step 1: The checklist

Step 2: Add session info to the README

Step 3: Final dependency snapshot and clean render

Step 4: Export a table

Step 5 (Bonus) — Push to GitHub and archive on Zenodo

✅ Final self-check

🏆 Summary & wrap-up

What we learned:

The one rule

Key tools

Going further

2.2 💡Anatomy of a `.qmd` File

1. The `YAML` header