1 Motivational example

1.1 Birth weight study

Study on risks factor for low birth weights from Baystate medical center in Springfield (MA, USA) during 1986 [Hosmer & Lemeshow (1989), Applied Logistic Regression]

  • low: low birth weight indicator coded as: 1 (low birth weight, i.e. \(\leq 2.5kg\)) and 0 (normal birth weight)
  • age: mother’s age (in years) at the time of birth
  • lwt: mother’s weight (in pounds) at the time of the last menstrual period
  • race: mother’s race coded as: 1 (white), 2 (black) and 3 (other)
  • smoke: mother’s smoking status coded as 1 (smoker) and 0 (non-smoker)
  • ptl: number of previous premature labours
  • ht: hypertension history indicator coded as: 1(yes) and 0 (no)
  • ui: uterine irritability indicator coded as: 1 (yes) and 0 (no)
  • ftv: number of physician visits during the first trimester
  • bwt: infant’s birth weight (in grams)

2 R primer

2.1 R 101 for M2 PHDS

  1. launch Rstudio
  2. open a new .Rmd file from the default template
  3. compile it with the "Knit" button
  4. discuss with your neighbor to understand each part of the script

2.2 Importing data

  1. import the dataset in birthweight.txt (you can use the "Import Dataset" button from Rstudio
  bw_data <- read.delim("birthweight_data.txt")
  1. briefly describe the data (use nice table outputs in Rmarkdown and ggplot2 graphics)
library(knitr)
bw_data$smoke <- as.factor(bw_data$smoke) # tell R that smoke is a binary factor and not a continuous variable (the default)
knitr::kable(bw_data[1:5,])
low age lwt race smoke ptl ht ui ftv bwt
0 19 182 2 0 0 0 1 0 2523
0 33 155 3 0 0 0 0 3 2551
0 20 105 1 1 0 0 0 1 2557
0 21 108 1 1 0 0 1 2 2594
0 18 107 1 1 0 0 1 0 2600
library(ggplot2)
ggplot(bw_data) +
  geom_point(aes(y=bwt, x=lwt, color=smoke)) +
  geom_hline(aes(yintercept=2500), linetype=2) +
  xlab("Mother's weight (in pounds)") +
  ylab("Birth weight (in kilograms)") + 
  scale_color_discrete("Smoking status", labels=c("No", "Yes"))

2.3 R basics

  • program a function that add the logarithm of two numbers
  • use browser() to debug your function when evaluating at negative arguments
mylog <- function(a, b){
  #browser()
  ans <- log(a) + log(b)
  return(ans)
}
mylog(a = 1, b = 1)
## [1] 0
mylog(a = 1, b = -1)
## Warning in log(b): NaNs produced
## [1] NaN

3 Statistical modeling

3.1 Why do we do (bio)statistics ?

Statistics: summarizing information from experimental observations and quantifying the associated uncertainty

Always start with the research/scientific question !

3.2 How do we do (bio)statistics

  • ideally we would start from our scientific question, figure out what data we need to answer the question
  • usually the question is not very clear & the data have already been collected

Statistical Inference: we use a simple Generative Probabilistic model that could have generated the observations (Machine Learning sometimes reject this paradigm – cf. L. Breiman)

3.3 The likelihood

The likelihood is a fundamental building block of Biostatistics:

  • its maximization yields estimators that have good asymptocic (when \(n\) the number of observations increases indefinitely) properties – no bias and smallest variance,
  • it also allows to perform significance testing of model parameters in many settings.

3.4 likelihood interpretation

The likelihood function quantifies how likely it is that a given (set of) observation(s) has been generated by our hypothesized Generative Probabilistic model.

The likelihood function is equal to the model joint probability distribution computed for the observations, and thus only a function of the model parameters.

3.5 Maximum Likelihood Estimator (MLE)

The idea of the MLE is to to optimize the likelihood function given the observations, by finding the model parameters that would give these observations the highest probability of being generated under the model.

seems like a reasonable and intuitive idea !

3.6 Bayesian statistics

Reverend Thomas Bayes proposed an alternative framework for statistical inference (actually before the “frequentist” method). It also relies on a probabilistic model through the likelihood function, but has different philosophical grounds than the frequentist.

To be continued…

4 Computational statistics

Computational statistics have become essential in modern statistics, with always bigger data, and always more sophisticated approaches.

4.1 Maximizing the likelihood

Maximizing the likelihood can easily be done analytically for simple linear models.

However: non-linear likelihoods are hard (sometimes impossible) to optimize analytically !

\(\Rightarrow\) numerical optimization

4.2 Newton-Raphson optimizer

An algorithm to find values for which a function is zero.

Applied to the derivative of the likelihood function, this will identify the MLE
given that the log-likelihood is a concave function

4.3 Why always use the \(log\) ?

Generally, we maximize the log-likelihood instead of the likelihood.

  • same thing (\(\log\) is a monotonically increasing function)
  • products \(\prod_{i=1}^n\) (e.g. likelihoods) become sums \(\sum_{i=1}^n\)
    sums are easier to compute & easier to derivate
  • more precision for small positives numbers (e.g. small probabilities)

4.4 \(log-exponential\) trick

This is not taught often but it can come very handy if you are writing your own statistical/optimisation program:

\[ \log \sum_{i=1}^n e^{x_i} = c + \log \sum_{i=1}^n e^{x_i-c} \]

Disclaimer: not useful today

4.5 Newton-Raphson optimizer intuition

  1. start from an initial point \(x_0\)
  2. tangential linear approximation of \(f\) in \(f(x_0)\)
  3. get the next point \(x_1\) as the solution
  1. Repeat until \(x_{n+1}\approx x_n\)

About the linear approximation at the \(n + 1\) step: - goes through \(f(x_n)\) - has slope \(f'(x_n)\)

So it has the following equation: \(y= f'(x_n)(x-x_n) + f(x_n)\). Thus we find \(x_{n+1}\) by setting \(y=0\), which gives us \(x_{n+1} = x_n -\frac{f(x_n)}{f'(x_n)}\)

4.6 Newton-Raphson optimizer example

#install.packages("animation")
library("animation")
newton.method(FUN = function(x) (x - 2)^2 - 1, init = 9.5, 
              rg = c(-1, 10), tol = 0.001, interact = FALSE, 
              col.lp = c("orange", "red3", "dodgerblue1"), 
              lwd=1.5)

5 Practicals

5.1 What is the prevalence of low birth weights in our data ?

  1. propose a generative probabilistic model

We assume that the birth weight of each infant follows an independent Bernoulli distribution with the same probability parameter \(\pi\):\[y_i\overset{iid}{\sim} Bernoulli(\pi)\] where \(y_i\) is either \(1\) if a low birth weight is observed for the \(i\)th baby, and \(0\) otherwise.

  1. define the parameter of interest

\(\pi\) is our parameter of interest and represents the probability of low birth weight (i.e. the prevalence).

  1. Write down the log-likelihood and its the first derivative (by hand)

\(L(\pi | y_1, \dots, y_n) = \prod_{i=1}^n \pi^{y_i} (1-\pi)^{(1-y_i)}\) \(\ell(\pi | y_1, \dots, y_n) = \sum_{i=1}^n y_ilog(\pi) + (1-y_i)log(1-\pi)\) \(\ell'(\pi | y_1, \dots, y_n) = \sum_{i=1}^n \dfrac{y_i}{\pi} - \dfrac{1-y_i}{1-\pi}\)

  1. Show that the maximum value (that nullify the derivative) of this likelihood is for \(\displaystyle\widehat{\pi}_{MLE} = \dfrac{1}{n}\sum_{i=1}^ny_i\)

\(\displaystyle\ell'(\widehat{\pi}_{MLE} | y_1, \dots, y_n) = 0\)
\(\displaystyle\Leftrightarrow \sum_{i=1}^n \dfrac{y_i}{\widehat{\pi}_{MLE}} = \sum_{i=1}^n \dfrac{1-y_i}{1-\widehat{\pi}_{MLE}}\)
\(\displaystyle\Leftrightarrow \dfrac{1}{\widehat{\pi}_{MLE}} \sum_{i=1}^n y_i = \dfrac{1}{1-\widehat{\pi}_{MLE}}\sum_{i=1}^n (1-y_i)\)
\(\displaystyle\Leftrightarrow \dfrac{1}{\widehat{\pi}_{MLE}} \sum_{i=1}^n y_i = \dfrac{n - \sum_{i=1}^n y_i}{1-\widehat{\pi}_{MLE}}\)
\(\displaystyle\Leftrightarrow \sum_{i=1}^n y_i - \widehat{\pi}_{MLE} \sum_{i=1}^n y_i = n\widehat{\pi}_{MLE} - \widehat{\pi}_{MLE}\sum_{i=1}^n y_i\)
\(\displaystyle\Leftrightarrow \sum_{i=1}^n y_i = n\widehat{\pi}_{MLE}\)

mean(bw_data$low)
## [1] 0.3121693

5.2 Brute force numerical optimization

Now let’s use R to program a Newton-Raphson algorithm to maximise this likelihood numerically

  1. write an R function that computes either the likelihood or the log-likelihood for a probabiliyt p, with two additional arguments: i) obs the vector of observations (that is bw_data$low by default) and ii) log a logical (that is FALSE by default)
likelihood <- function(p, obs = bw_data$low, log=FALSE){
  
  n <- length(obs) #189
  
  res <- 1 # initial likelihood value
  res_log <- 0 # initial log-likelihood value
  
  for(i in 1:n){ # loop over all observation
    res <- res*p^obs[i]*(1-p)^(1-obs[i])
    res_log <- res_log + obs[i]*log(p) +(1-obs[i])*log(1-p)
  }
  
  if(log){
    return(res_log)
  }else{
    return(res)
  }
}
  1. Write two R functions of p that compute the first and the second derivatives of the log-likelihood respectivelly (each with obs=bw_data$low as an additional argument)
first_deriv_loglik <- function(p, obs = bw_data$low){
  n <- length(obs) #189
  res <- 0 # initial derivative value
  for(i in 1:n){
    res <- res + obs[i]/p - (1-obs[i])/(1-p)
  }
  return(res)
}

second_deriv_loglik <- function(p, obs = bw_data$low){
  n <- length(obs) #189
  res <- 0 # initial derivative value
  for(i in 1:n){
    res <- res - obs[i]/p^2 - (1-obs[i])/(1-p)^2
  }
  return(res)
}
  1. plot those 4 functions (the likelihood, log-likelihood, )
par(mfrow = c(2,2)) # splitting the plot window into 4 subpanels
xvalues <- seq(from = 0.01, to = 0.99, by = 0.001)
plot(x = xvalues, y = likelihood(xvalues),
     type="l",
     ylab = "likelihood", xlab="p",
     main = "Bernoulli likelihood of a low birth weight")
abline(v=mean(bw_data$low), col="red")

plot(x = xvalues, y = likelihood(xvalues, log = TRUE),
     type="l",
     ylab = "log-likelihood", xlab="p",
     main = "Bernoulli log-likelihood of a low birth weight")
abline(v=mean(bw_data$low), col="red")
abline(h=likelihood(mean(bw_data$low), log=TRUE), col="red", lty=2)
plot(x = xvalues, y = first_deriv_loglik(xvalues),
     type="l",
     ylab = "log-likelihood first derivative", xlab="p",
     main = "Bernoulli modeling for a low birth weight")
abline(v=mean(bw_data$low), col="red")
abline(h=0, col="red", lty=2)
plot(x = xvalues, y = second_deriv_loglik(xvalues),
     type="l",
     ylab = "log-likelihood first derivative", xlab="p",
     main = "Bernoulli modeling for a low birth weight"
)
abline(v=mean(bw_data$low), col="red")

  1. write a Newton-Raphson function with 5 arguments (the first deriative of the function to maximize, its second derivative, the initial starting point, the tolerance, the maximum number of iterations)
newton_raphson <- function(f, fprime, init, tol=0.001, itermax=20, ...) {
    
  # initialization ----
    x_new <- init
    diff <- tol + 1  # ensure that the initialization is entering the while loop
    iter <- 0
  
  # updates ----
    while (diff > tol & iter < itermax) { 
      # as long as diff is greater than the required precision AND the maximum number of iteration has not been reached
        iter <- iter + 1
        x_current <- x_new # store the current value
        x_new <- x_current - f(x_current, ...)/fprime(x_current, ...) #update to the new value
        diff <- abs(x_new - x_current) #compute the difference to assess convergence
        #diff <- abs(f(x_new) - f(x_current)) # a different convergence criterion
        cat(iter, " x_new:", x_new, " diff:", diff, "\n") #print the current step
    }
    
    if(iter == itermax){
      warning("Maximum number of iteration reached") 
      # throw a warning if the maximum number of iteration is reached
    }
  
    return(x_new)
}
  1. use all three functions to compute the MLE of the low birthweight prevalence
pi_opt <- newton_raphson(f = first_deriv_loglik, fprime = second_deriv_loglik, 
                         init = 0.7, tol=0.01)
## 1  x_new: 0.4769454  diff: 0.2230546 
## 2  x_new: 0.3069933  diff: 0.169952 
## 3  x_new: 0.3121212  diff: 0.005127827
mean(bw_data$low)
## [1] 0.3121693

5.3 Robustness and statistical inference

  1. Try several initial values, different tolerance, and experiment with different stopping rules – i.e. convergence criterion – (you can combine them). Compare the results. Does it always work ?
newton_raphson(f = first_deriv_loglik, fprime = second_deriv_loglik, 
                         init = 0.01, tol=0.01)
## 1  x_new: 0.01977524  diff: 0.009775238
## [1] 0.01977524
newton_raphson(f = first_deriv_loglik, fprime = second_deriv_loglik, 
                         init = 0.01, tol=0.0001)
## 1  x_new: 0.01977524  diff: 0.009775238 
## 2  x_new: 0.0386545  diff: 0.01887927 
## 3  x_new: 0.07375934  diff: 0.03510483 
## 4  x_new: 0.1337386  diff: 0.05997927 
## 5  x_new: 0.2175799  diff: 0.08384125 
## 6  x_new: 0.2895745  diff: 0.0719946 
## 7  x_new: 0.311171  diff: 0.0215966 
## 8  x_new: 0.3121676  diff: 0.0009965095 
## 9  x_new: 0.3121693  diff: 1.752721e-06
## [1] 0.3121693
  1. Considering the null hypothesis \(H_0: \pi=0.5\), compute the p-value for: i) the Wald test, ii)the score test, and iii) the Likelihood Ratio Test respectively, and compare to the glm and anova functions output.

  2. Comment on the AIC and the 95% confidence interval

# logit and expit are bijective functions used to go from the real space ]-inf, +inf[ to the probability space [0,1] 
logit <- function(p){
  log(p/(1-p))
}
expit <- function(x){
  exp(x)/(1+exp(x))
}
# logisitc regression with just an intercept
my_logistic_reg <- glm(low~1, data=bw_data, family = "binomial")
summary(my_logistic_reg)
## 
## Call:
## glm(formula = low ~ 1, family = "binomial", data = bw_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8651  -0.8651  -0.8651   1.5259   1.5259  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.790      0.157  -5.033 4.84e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 234.67  on 188  degrees of freedom
## AIC: 236.67
## 
## Number of Fisher Scoring iterations: 4
expit(my_logistic_reg$coefficients)
## (Intercept) 
##   0.3121693
var_pi_opt <- -1/second_deriv_loglik(pi_opt)

pval_Wald <- pchisq((pi_opt-0.5)/var_pi_opt, df=1, lower.tail = FALSE)

pval_score <- pchisq(first_deriv_loglik(0.5)^2/-first_deriv_loglik(0.5), df=1, lower.tail = FALSE)


-2*likelihood(pi_opt, log = TRUE)
## [1] 234.672
pval_LRT <- pchisq(-2*(likelihood(0.5, log=TRUE)-likelihood(pi_opt, log=TRUE)), 
                   df=1, lower.tail = FALSE)
anova(glm(low~1, data=bw_data, family = "binomial"), # model with just th intercept, i.e. the average prevalence
      glm(low~0, data=bw_data, family = "binomial"), # null model
      test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: low ~ 1
## Model 2: low ~ 0
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       188     234.67                          
## 2       189     262.01 -1  -27.338 1.709e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pi_opt + c(-1, 1)*1.96*sqrt((pi_opt*(1-pi_opt))/189) # 95% normal confidence interval
## [1] 0.2460605 0.3781818

5.4 Does the mother’s smoking status impacts the probability of low birth weight

  1. BONUS: repeat the exercise to assess wether the mother’s smoking status impacts the probability of low birth weight