# A tibble: 6 Γ 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Tools for a Reproducible, Shareable & Communicable Science
DiplΓ΄me Universitaire Public Health Data Science
Foreword and Instructions
βοΈ Setup (to be completed before starting the class)
Install the latest version of β https://cran.r-project.org/
Install the latest version of RStudio β https://posit.co/download/rstudio-desktop/
Then paste the following lines into your console:
install.packages(c(
"tidyverse", # ggplot2, dplyr, readr, β¦
"gapminder", # our main dataset
"renv", # dependency management
"here", # portable file paths
"quarto", # render .qmd from R
"knitr", # knitting engine
"sessioninfo" # document your environment
))Verify Quarto is available: in the RStudio Terminal tab, type quarto --version.
You should see a version number β₯ 1.4.
Install Python β₯ 3.10 β https://www.python.org/downloads/
Install VS Code β https://code.visualstudio.com/ (with the Python extension)
or use JupyterLab: pip install jupyterlab
Create and activate a virtual environment, then install packages:
# Create project folder and virtual environment
mkdir repro-phds && cd repro-phds
python -m venv .venv
# Activate (choose the one line right for your OS)
source .venv/bin/activate # for macOS or Linux
.venv\Scripts\activate # for Windows (PowerShell)
# Install python libraries
pip install pandas matplotlib seaborn gapminder jupyterlab quartoVerify Quarto is indeed available by running quarto --version in your terminal (it should return a version number).
If you run into installation errors, search the error message on the web: troubleshooting is a core data-science skill! Most errors have a Stack Overflow solution on the first result page. GenAI chatbots such as LeChat from Mistral AI can also be very helpful.
1 Reproducibility: what it is and why it matters ?
β± ~45 minutes1.1 π Readings: the reproducibility crisis
1.2 π‘ The Reproducibility Spectrum
βReproducibleβ is used loosely, and several definitions co-exists in different scientific field. In this class, we will use the definition below focusing on computational reproducibility:
| Same data? | Same code/method? | What does it test? | |
|---|---|---|---|
| Reproducible | β | β | Can you re-run the analysis and get the exact same numbers? |
| Replicable | β | β | Does the same method generalise to new data? |
| Robust | β | β | Do different analytical choices reach the same conclusion? |
| Generalisable | β | β | Does the finding hold beyond the original context? |
The diagram below (from The Turing Way community) relates these 4 levels to one another:
1.3 π‘ Why is Reproducibility important ?
What is the value of a non reproducible article ?
If it isnβt reproducible, is it science ?
Should we just trust each other ?
Public funding demands accountability, while scientific credibility depends on it. Reproducibility helps achieve that.
- π Scientific journals require it: peer reviewers now verify code, data, and workflows before final acceptance
- βοΈ It acts as a methodological shield: it reduces the likelihood of undetected errors & spurious findings
- πͺπΊπ«π· Institutional law: the EU, the ANR and the HCERES all require some level of reproducibility for their funded research
- π§± Increased impact: Reproducible articles are cited more & extended more (a more trustworthy foundation for future works)
In Public Health, it carries an additional importance: - Policy decisions are made from published findings. A wrong or irreproducible result can harm patients and populations at scale. - Publicly-funded research should be accountable. If the public paid for the study, the code and data should (where possible) be publicly accessible. This idea is very much related to open-science, a concept connected to reproducibility, but different (β‘οΈπ more details here)
Barriers, motivations and solutions
| Barrier | Concrete example | Practical solution |
|---|---|---|
| Cultural | βMy PI never does thisβ | Frame it as a career investment, not overhead |
| Technical | Simulation takes 3 days to run | Store pre-computed intermediate results; provide a fast reduced-run mode |
| Legal | Patient data under GDPR | Generate synthetic data of same structure; grant temporary restricted auditor access |
| Time | βIβll clean up the code laterβ | Start early: it costs extra work upfront, but saves 4x more at revision |
Reproducibility is a safeguard, not a burden
Reproducibility earns trust. Scientists that care for reproducibility are more efficient in the long term as they can build on their own past work more easily. Reproducibility carries benefits at different scales for science:
| π Research field | π₯ Research group | π§βπ¬ Yourself |
|---|---|---|
| Stronger methodological credibility | Faster transmission to collaborators | Faster article completion & revisions |
| Cumulative, extendable knowledge | Reduced technical debt | Transparent & trustable for audience |
| Lower risk of published errors | Clear, defensible archival | Hard skills & efficient workflow |
| More citations |
1.4 π§βπ» Exercise 1: Create Your Reproducible Project Skeleton
2 Literate Programming with Quarto
β± ~45 minutes2.1 π Reading: One Document to Rule Them All
2.2 π‘Anatomy of a .qmd File
A Quarto file has three building blocks:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. YAML header title, author, output type... β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Markdown text prose, headings, lists, links β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Code chunks R or Python code + its output β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. The YAML header
At the beginning of your document, it sits between --- markers at the very top. This sets the default controls everything about the output and its overall format, using YAML language.
Be careful: indentation is important in YAML
---
title: "Global Health Trends β Gapminder Data"
author: "Your Name"
date: today # auto-fills today's date
format:
`HTML`:
toc: true # table of contents
code-fold: true # hide code by default (reader can expand)
embed-resources: true # self-contained `HTML` file (portable)
execute:
echo: true # show code in output by default
warning: true # keep showing warnings
message: false # suppress messages (eg when loading packages)
---2. Markdown text
Between code chunks, narrative text uses plain Markdown (lightweight formatting). Below is an example of markdown basics:
# Heading level 1
## Heading level 2
A paragraph. Make text **bold**, *italic*, or `code-styled`.
- Bullet list item one
- Bullet list item two
1. Numbered item one
2. Numbered item two
[Descriptive link text](https://url.com)
{width=50%}3. Code chunks
Enclosed in triple back-ticks with {r} or {python} to specify the programming language interpreter to use. Chunk-specific options are indicate by setting values to specific keywords at the begining of the chunk with the following syntax: #| keyword: value. Below is a chunk example, followed by its output:
```{r}
#| label: summary-table-r
#| echo: true
#| eval: true
library(gapminder)
head(gapminder)
```Key chunk options (written after #| inside the chunk):
| Option | Default | Effect |
|---|---|---|
echo |
true |
Show the code in output |
eval |
true |
Run the code |
include |
true |
Include chunk output at all |
message |
true |
Show package-load messages |
warning |
true |
Show warnings |
cache |
false |
Cache results (useful for slow code) |
label |
β | Name for cross-referencing figures/tables |
fig-cap |
β | Figure caption |
fig-width / fig-height |
β | Figure dimensions in inches |
2.2.1 Inline code
You can also insert computed values directly into prose using `r ` or `{python}`:
The dataset covers
`r length(unique(gapminder$country))`countries from`r min(gapminder$year)`to`r max(gapminder$year)`
When rendered, this becomes:
The dataset covers 142 countries from 1952 to 2007
No copy-paste, no stale numbers, no manual update ! β¨
2.3 π§βπ» Exercise 2: Your First Reproducible Health Report
3 Data Visualisation
β± ~40 minutes3.1 π Reading β The Grammar of Graphics
3.2 π‘Building Plots Layer by Layer
The seven layers
| Component | In ggplot2 | Example |
|---|---|---|
| Data | ggplot(data = ...) |
gapminder |> filter(year == 2007) |
| Aesthetics | aes(x, y, colour, size, shape) |
aes(x = gdpPercap, y = lifeExp, colour = continent) |
| Geometry | geom_*() |
geom_point(), geom_line(), geom_boxplot() |
| Scales | scale_*() |
scale_x_log10(), scale_colour_viridis_d() |
| Facets | facet_wrap() / facet_grid() |
facet_wrap(~ continent) |
| Theme | theme_*() |
theme_minimal() |
| Labels | labs() |
labs(title = "...", x = "...", caption = "...") |
A complete worked example
library(ggplot2)
library(gapminder)
library(dplyr)
gap_2007 <- gapminder |> filter(year == 2007)
ggplot(
data = gap_2007,
aes(x = gdpPercap, y = lifeExp,
colour = continent,
size = pop) # bubble size = population
) +
geom_point(alpha = 0.7) +
scale_x_log10(labels = scales::dollar_format()) + # log GDP axis
scale_colour_viridis_d(option = "plasma") + # colour-blind safe
scale_size(range = c(2, 14), guide = "none") +
labs(
title = "Wealth and Health in 2007",
subtitle = "Each bubble is a country; size β population",
x = "GDP per capita (log scale, USD)",
y = "Life expectancy (years)",
colour = "Continent",
caption = "Source: Gapminder Foundation"
) +
theme_minimal(base_size = 13)import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
gap_2007 = gapminder[gapminder["year"] == 2007].copy()
# Bubble sizes proportional to population
max_pop = gap_2007["pop"].max()
gap_2007["bubble_size"] = (gap_2007["pop"] / max_pop) * 1800 + 30
fig, ax = plt.subplots(figsize=(10, 6))
continents = gap_2007["continent"].unique()
palette = sns.color_palette("plasma", len(continents))
for i, cont in enumerate(sorted(continents)):
subset = gap_2007[gap_2007["continent"] == cont]
ax.scatter(
np.log10(subset["gdpPercap"]),
subset["lifeExp"],
s=subset["bubble_size"],
color=palette[i],
alpha=0.75,
label=cont
)
ax.set_xlabel("GDP per capita (log scale, USD)", fontsize=12)
ax.set_ylabel("Life expectancy (years)", fontsize=12)
ax.set_title("Wealth and Health in 2007", fontsize=14, fontweight="bold")
ax.set_xticks([3, 4, 5])
ax.set_xticklabels(["$1K", "$10K", "$100K"])
ax.legend(title="Continent", loc="lower right")
plt.tight_layout()
plt.show()Saving figures programmatically
Saving a figure manually (by right-clicking or clicking βExportβ in the Plots pane) is not reproducible. The size, resolution, and format vary each time and the step is invisible in your code.
Always save figures inside your script, with fixed dimensions and resolution.
# 1. Assign your plot to a named variable
p <- ggplot(gap_2007, aes(...)) + ...
# 2. Save β always provide both a vector (`PDF`) or a raster (`PNG`) image depending on the size and nature of your graph
ggsave(
filename = here::here("output", "figures", "fig01_wealth_health_2007.pdf"),
plot = p,
width = 10, height = 6, units = "in"
)
ggsave(
filename = here::here("output", "figures", "fig01_wealth_health_2007.png"),
plot = p,
width = 10, height = 6, units = "in",
dpi = 300 # 300 dpi = publication quality
)fig, ax = plt.subplots(figsize=(10, 6))
# ... your plotting code ...
fig.savefig("output/figures/fig01_wealth_health_2007.pdf", bbox_inches="tight")
fig.savefig("output/figures/fig01_wealth_health_2007.png", dpi=300, bbox_inches="tight")
plt.close() # free memoryFile naming convention: use a numeric prefix that matches your manuscript β fig01_, fig02_, β¦, tab01_, tab02_, β¦ This is one of the items auditors check first.
Choosing accessible colour palettes
About 8 % of men have some form of colour-vision deficiency. Here are a few palettes designed to be perceptually uniform and colour-blind safe:
| Palette | Package | Use case |
|---|---|---|
viridis, plasma, magma |
viridis / built-in ggplot2 |
Sequential / continuous |
scale_colour_viridis_d() |
ggplot2 | Discrete categorical |
scale_colour_brewer(palette = "Set2") |
ggplot2 | Categorical (up to 8 groups) |
sns.set_palette("colorblind") |
seaborn | Any seaborn chart |
palette="viridis" |
seaborn | Continuous colour mapping |
3.3 π§βπ» Exercise 3: Figures
4 Sharing & the Reproducibility Checklist
β± ~35 minutes4.1 π Reading β The Journal Audit
4.2 π‘Tools for Sharing and Archiving
The full reproducibility toolchain
Getting from βworks on my machineβ to βanyone can run itβ requires stacking a few layers:
π Project organisation RStudio Projects / standard folder structure
π Dependency management renv (`<svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>`{=html}) / venv + requirements.txt (`<svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M439.8 200.5c-7.7-30.9-22.3-54.2-53.4-54.2h-40.1v47.4c0 36.8-31.2 67.8-66.8 67.8H172.7c-29.2 0-53.4 25-53.4 54.3v101.8c0 29 25.2 46 53.4 54.3 33.8 9.9 66.3 11.7 106.8 0 26.9-7.8 53.4-23.5 53.4-54.3v-40.7H226.2v-13.6h160.2c31.1 0 42.6-21.7 53.4-54.2 11.2-33.5 10.7-65.7 0-108.6zM286.2 404c11.1 0 20.1 9.1 20.1 20.3 0 11.3-9 20.4-20.1 20.4-11 0-20.1-9.2-20.1-20.4.1-11.3 9.1-20.3 20.1-20.3zM167.8 248.1h106.8c29.7 0 53.4-24.5 53.4-54.3V91.9c0-29-24.4-50.7-53.4-55.6-35.8-5.9-74.7-5.6-106.8.1-45.2 8-53.4 24.7-53.4 55.6v40.7h106.9v13.6h-147c-31.1 0-58.3 18.7-66.8 54.2-9.8 40.7-10.2 66.1 0 108.6 7.6 31.6 25.7 54.2 56.8 54.2H101v-48.8c0-35.3 30.5-66.4 66.8-66.4zm-6.7-142.6c-11.1 0-20.1-9.1-20.1-20.3.1-11.3 9-20.4 20.1-20.4 11 0 20.1 9.2 20.1 20.4s-9 20.3-20.1 20.3z"/></svg>`{=html})
π Literate document Quarto (.qmd)
π² Random seeds set.seed() / np.random.seed()
ποΈ Version control Git
π Remote hosting GitHub
π¦ Permanent archiving Zenodo β DOI you can cite in a paper
renv and virtual environments
renv creates a project-local library and records exact package versions in renv.lock.
renv::snapshot() # update lockfile after installing / changing packages
renv::restore() # collaborator command: install exact same versions
renv::status() # check whether lockfile and current state agreeAt the end of your report, always record your session environment:
sessioninfo::session_info()This prints version, OS, and the exact version of every loaded package.
In the terminal, as mentionned in SectionΒ 1
pip freeze > requirements.txt # save exact versions
# Collaborator setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtAt the end of your notebook:
## Requirements
```{python}
import sys, importlib.metadata
print(f"Python {sys.version}")
for pkg in ["pandas", "matplotlib", "seaborn", "numpy"]:
print(f" {pkg}: {importlib.metadata.version(pkg)}")
```Git and GitHub : the essentials
Why use git and GitHub ?
π Off-site backup: if your laptop dies, nothing is lost
π Full history: you can go back in time to any previous version, without worrying that you inadvertently deleted some code that you might end up using later in time
π Collaboration: others can contribute easily, without emailing files
π Integration with Zenodo for citable DOIs
gitis a version control system that allows to track changes to all the files within a project, and synchronize those changes across multiple computers and contributors.- GitHub is a web plateform from Microsoft providing
gitintegration with cloud hosting.
You do not need to be a git expert. Three commands cover most of everyday use:
git add . # stage all changes
git commit -m "Add life expectancy analysis and figures"
git push # send to GitHubgit setup
git config --global user.name "Your chosen user name"
git config --global user.email "you@domain.ext"In RStudio: Tools β Global Options β Git/SVN β point to the git executable. For trouble shooting in setting-up git in RStudio, you can refer to the Happy git with R online book.
Zenodo: a permanent DOI for your code (and optionally data)
Zenodo is a CERN-hosted repository that gives your code or data a DOI (which stands for Digital Object Identifier), a permanent, citable identifier. This is what goes in your paperβs Data Availability statement.
How it works in three steps:
- Push your project to a public GitHub repository
- Connect GitHub at https://zenodo.org/account/settings/github
- Create a GitHub release (e.g.Β
v1.0) β Zenodo archives it automatically and issues a DOI
Your paper then cites: > βAll analysis code is available at https://doi.org/10.5281/zenodo.XXXXXXXβ
The fact that is has a DOI and is hosted by the EU makes it a lot more future proof than just having a GitHub repo (especially as they are not exclusive).
4.3 π§βπ» Exercise 4: Self-Audit and Preparing to Share
π Summary & wrap-up
What we learned:
π Adopting a clean, portable structure for easy navigation inside your project
π Running everything end-to-end with a single command
π Programmatically saving figures
π Locking your dependencies, so the environment is reproducible
π Including inline computed values
π Passing the standard of a journal reproducibility checklist
The one rule
Start small, start now. Reproducibility is an ideal, each small step forward brings you closer.
Key tools
| Tool | Problem it solves | One command to remember |
|---|---|---|
| RStudio Project | No more absolute paths; self-contained workspace |
Open .Rproj
|
| Quarto | Code + prose in one document; no copy-paste of values |
quarto render
|
| renv / venv | Exact package versions β βit worked last yearβ insurance |
renv::snapshot()
|
| here | Portable file paths that work on any OS |
here::here(βoutputβ,βfig.pdfβ)
|
| ggplot2 / seaborn | Programmatic, consistent, exportable figures |
ggsave() / fig.savefig()
|
| Git + GitHub | Version history, collaboration, off-site backup |
git add . && git commit && git push
|
| Zenodo | Permanent DOI for code citation in papers | Connect GitHub in Zenodo settings |
Going further
| Resource | What for |
|---|---|
| The Turing Way | Comprehensive, community-maintained reproducible research guide |
| R for Data Science (2e) | Tidyverse, ggplot2, Quarto β free online book |
| What They Forgot to Teach You About R | Project organisation, naming, file paths |
| Quarto documentation | Everything Quarto |
| ggplot2 book (3e) | Deep dive into the grammar of graphics |
| Python Data Science Handbook | NumPy, Pandas, Matplotlib, scikit-learn |
| Hejblum et al.Β (2020) | Reproducible research for biostatistics |
| Hornung et al.Β (2026) | Overcoming computational reproducibility barriers |
| Ouvrir la Science (in English) | French open science guides and passports (MESRI) |