Basic principles in Biostatistics: likelihood and statistical thinking

Boris Hejblum

September 10^th, 2019

Motivational example

Birth weight study

Study on risks factor for low birth weights from Baystate medical center in Springfield (MA, USA) during 1986 [Hosmer & Lemeshow (1989), Applied Logistic Regression]

- low: low birth weight indicator coded as: 1 (low birth weight, i.e. ) and 0 (normal birth weight)
- age: mother’s age (in years) at the time of birth
- lwt: mother’s weight (in pounds) at the time of the last menstrual period
- race: mother’s race coded as: 1 (white), 2 (black) and 3 (other)
- smoke: mother’s smoking status coded as 1 (smoker) and 0 (non-smoker)
- ptl: number of previous premature labours
- ht: hypertension history indicator coded as: 1(yes) and 0 (no)
- ui: uterine irritability indicator coded as: 1 (yes) and 0 (no)
- ftv: number of physician visits during the first trimester
- bwt: infant’s birth weight (in grams)

`R` primer

`R` 101 for M2 PHDS

launch Rstudio
open a new .Rmd file from the default template
compile it with the "Knit" button
discuss with your neighbor to understand each part of the script

Importing data

import the dataset in birthweight.txt (you can use the "Import Dataset" button from Rstudio
describe the data (use nice table outputs in Rmarkdown and ggplot2 graphics)

R basics

program a function that add the logarithm of two numbers
use browser() to debug your function when evaluating at negative arguments

Statistical modeling

Why do we do (bio)statistics ?

Statistics: summarizing information from experimental observations and quantifying the associated uncertainty

Always start with the research/scientific question !

How do we do (bio)statistics

ideally we would start from our scientific question, figure out what data we need to answer the question
usually the question is not very clear & the data have already been collected

Statistical Inference: we use a simple Generative Probabilistic model that could have generated the observations (Machine Learning sometimes reject this paradigm – cf. L. Breiman)

The likelihood

The likelihood is a fundamental building block of Biostatistics:

its maximization yields estimators that have good asymptocic (when the number of observations increases indefinitely) properties – no bias and smallest variance,
it also allows to perform significance testing of model parameters in many settings.

likelihood interpretation

The likelihood function quantifies how likely it is that a given (set of) observation(s) has been generated by our hypothesized Generative Probabilistic model.

The likelihood function is equal to the model joint probability distribution computed for the observations, and thus only a function of the model parameters.

Maximum Likelihood Estimator (MLE)

The idea of the MLE is to to optimize the likelihood function given the observations, by finding the model parameters that would give these observations the highest probability of being generated under the model.

seems like a reasonable and intuitive idea !

Bayesian statistics

Reverend Thomas Bayes proposed an alternative framework for statistical inference (actually before the “frequentist” method). It also relies on a probabilistic model through the likelihood function, but has different philosophical grounds than the frequentist.

To be continued…

Computational statistics

Computational statistics have become essential in modern statistics, with always bigger data, and always more sophisticated approaches.

Maximizing the likelihood

Maximizing the likelihood can easily be done analytically for simple linear models.

However: non-linear likelihoods are hard (sometimes impossible) to optimize analytically !

numerical optimization

Newton-Raphson optimizer

An algorithm to find values for which a function is zero.

Applied to the derivative of the likelihood function, this will identify the MLE
given that the log-likelihood is a concave function

Why always use the ?

Generally, we maximize the log-likelihood instead of the likelihood.

same thing ( is a monotonically increasing function)
products (e.g. likelihoods) become sums
sums are easier to compute & easier to derivate
more precision for small positives numbers (e.g. small probabilities)

trick

This is not taught often but it can come very handy if you are writing your own statistical/optimisation program:

Disclaimer: not useful today

Newton-Raphson optimizer intuition

start from an initial point
tangential linear approximation of in
get the next point as the solution
…

Repeat until

About the linear approximation at the step: - goes through - has slope

So it has the following equation: . Thus we find by setting , which gives us

Newton-Raphson optimizer example

#install.packages("animation")
library("animation")
newton.method(FUN = function(x) (x - 2)^2 - 1, init = 9.5, 
              rg = c(-1, 10), tol = 0.001, interact = FALSE, 
              col.lp = c("orange", "red3", "dodgerblue1"), 
              lwd=1.5)

Practicals

What is the prevalence of low birth weights in our data ?

propose a generative probabilistic model
define the parameter of interest
program the associated likelihood function
maximize this likelihood analytically

Brute force numerical optimization

Let’s Program a Newton-Raphson algorithm to maximise this likelihood numerically

write two funtions that computes the first and second derivatives of the log-likelihoodrespectively
write a Newton-Raphson function with 5 arguments (the first deriative of the function to maximize, its second derivative, the initial starting point, the tolerance, the maximum number of iterations)
use all three functions to compute the MLE of the low birthweight prevalence

Robustness and statistical inference

Try several initial values, different tolerance, and experiment with different stopping rules – i.e. convergence criterion – (you can combine them). Compare the results. Does it always work ?
Compute the p-value from the Wald test, the score test and the Likelihood Ratio Test respectively, and compare to the glm function output

Does the mother’s smoking status impacts the probability of low birth weight

Repeat the exercise to assess wether the mother’s smoking status impacts the probability of low birth weight