Study on risks factor for low birth weights from Baystate medical center in Springfield (MA, USA) during 1986 [Hosmer & Lemeshow (1989), Applied Logistic Regression]
low
: low birth weight indicator coded as: 1
(low birth weight, i.e. \(\leq 2.5kg\)) and 0
(normal birth weight)age
: mother’s age (in years) at the time of birthlwt
: mother’s weight (in pounds) at the time of the last menstrual periodrace
: mother’s race coded as: 1
(white), 2
(black) and 3
(other)smoke
: mother’s smoking status coded as 1
(smoker) and 0
(non-smoker)ptl
: number of previous premature laboursht
: hypertension history indicator coded as: 1
(yes) and 0
(no)ui
: uterine irritability indicator coded as: 1
(yes) and 0
(no)ftv
: number of physician visits during the first trimesterbwt
: infant’s birth weight (in grams) R
primerR
101 for M2 PHDSRstudio
.Rmd
file from the default template"Knit"
button birthweight.txt
(you can use the "Import Dataset"
button from Rstudio
bw_data <- read.delim("birthweight_data.txt")
Rmarkdown
and ggplot2
graphics)library(knitr)
bw_data$smoke <- as.factor(bw_data$smoke) # tell R that smoke is a binary factor and not a continuous variable (the default)
knitr::kable(bw_data[1:5,])
low | age | lwt | race | smoke | ptl | ht | ui | ftv | bwt |
---|---|---|---|---|---|---|---|---|---|
0 | 19 | 182 | 2 | 0 | 0 | 0 | 1 | 0 | 2523 |
0 | 33 | 155 | 3 | 0 | 0 | 0 | 0 | 3 | 2551 |
0 | 20 | 105 | 1 | 1 | 0 | 0 | 0 | 1 | 2557 |
0 | 21 | 108 | 1 | 1 | 0 | 0 | 1 | 2 | 2594 |
0 | 18 | 107 | 1 | 1 | 0 | 0 | 1 | 0 | 2600 |
library(ggplot2)
ggplot(bw_data) +
geom_point(aes(y=bwt, x=lwt, color=smoke)) +
geom_hline(aes(yintercept=2500), linetype=2) +
xlab("Mother's weight (in pounds)") +
ylab("Birth weight (in kilograms)") +
scale_color_discrete("Smoking status", labels=c("No", "Yes"))
browser()
to debug your function when evaluating at negative argumentsmylog <- function(a, b){
#browser()
ans <- log(a) + log(b)
return(ans)
}
mylog(a = 1, b = 1)
## [1] 0
mylog(a = 1, b = -1)
## Warning in log(b): NaNs produced
## [1] NaN
Statistics: summarizing information from experimental observations and quantifying the associated uncertainty
Always start with the research/scientific question !
Statistical Inference: we use a simple Generative Probabilistic model that could have generated the observations (Machine Learning sometimes reject this paradigm – cf. L. Breiman)
The likelihood is a fundamental building block of Biostatistics:
The likelihood function quantifies how likely it is that a given (set of) observation(s) has been generated by our hypothesized Generative Probabilistic model.
The likelihood function is equal to the model joint probability distribution computed for the observations, and thus only a function of the model parameters.
The idea of the MLE is to to optimize the likelihood function given the observations, by finding the model parameters that would give these observations the highest probability of being generated under the model.
seems like a reasonable and intuitive idea !
Reverend Thomas Bayes proposed an alternative framework for statistical inference (actually before the “frequentist” method). It also relies on a probabilistic model through the likelihood function, but has different philosophical grounds than the frequentist.
To be continued…
Computational statistics have become essential in modern statistics, with always bigger data, and always more sophisticated approaches.
Maximizing the likelihood can easily be done analytically for simple linear models.
However: non-linear likelihoods are hard (sometimes impossible) to optimize analytically !
\(\Rightarrow\) numerical optimization
An algorithm to find values for which a function is zero.
Applied to the derivative of the likelihood function, this will identify the MLE
given that the log-likelihood is a concave function
Generally, we maximize the log-likelihood instead of the likelihood.
This is not taught often but it can come very handy if you are writing your own statistical/optimisation program:
\[ \log \sum_{i=1}^n e^{x_i} = c + \log \sum_{i=1}^n e^{x_i-c} \]
Disclaimer: not useful today
About the linear approximation at the \(n + 1\) step: - goes through \(f(x_n)\) - has slope \(f'(x_n)\)
So it has the following equation: \(y= f'(x_n)(x-x_n) + f(x_n)\). Thus we find \(x_{n+1}\) by setting \(y=0\), which gives us \(x_{n+1} = x_n -\frac{f(x_n)}{f'(x_n)}\)
#install.packages("animation")
library("animation")
newton.method(FUN = function(x) (x - 2)^2 - 1, init = 9.5,
rg = c(-1, 10), tol = 0.001, interact = FALSE,
col.lp = c("orange", "red3", "dodgerblue1"),
lwd=1.5)
We assume that the birth weight of each infant follows an independent Bernoulli distribution with the same probability parameter \(\pi\):\[y_i\overset{iid}{\sim} Bernoulli(\pi)\] where \(y_i\) is either \(1\) if a low birth weight is observed for the \(i\)th baby, and \(0\) otherwise.
\(\pi\) is our parameter of interest and represents the probability of low birth weight (i.e. the prevalence).
\(L(\pi | y_1, \dots, y_n) = \prod_{i=1}^n \pi^{y_i} (1-\pi)^{(1-y_i)}\) \(\ell(\pi | y_1, \dots, y_n) = \sum_{i=1}^n y_ilog(\pi) + (1-y_i)log(1-\pi)\) \(\ell'(\pi | y_1, \dots, y_n) = \sum_{i=1}^n \dfrac{y_i}{\pi} - \dfrac{1-y_i}{1-\pi}\)
\(\displaystyle\ell'(\widehat{\pi}_{MLE} | y_1, \dots, y_n) = 0\)
\(\displaystyle\Leftrightarrow \sum_{i=1}^n \dfrac{y_i}{\widehat{\pi}_{MLE}} = \sum_{i=1}^n \dfrac{1-y_i}{1-\widehat{\pi}_{MLE}}\)
\(\displaystyle\Leftrightarrow \dfrac{1}{\widehat{\pi}_{MLE}} \sum_{i=1}^n y_i = \dfrac{1}{1-\widehat{\pi}_{MLE}}\sum_{i=1}^n (1-y_i)\)
\(\displaystyle\Leftrightarrow \dfrac{1}{\widehat{\pi}_{MLE}} \sum_{i=1}^n y_i = \dfrac{n - \sum_{i=1}^n y_i}{1-\widehat{\pi}_{MLE}}\)
\(\displaystyle\Leftrightarrow \sum_{i=1}^n y_i - \widehat{\pi}_{MLE} \sum_{i=1}^n y_i = n\widehat{\pi}_{MLE} - \widehat{\pi}_{MLE}\sum_{i=1}^n y_i\)
\(\displaystyle\Leftrightarrow \sum_{i=1}^n y_i = n\widehat{\pi}_{MLE}\)
mean(bw_data$low)
## [1] 0.3121693
Now let’s use R
to program a Newton-Raphson algorithm to maximise this likelihood numerically
R
function that computes either the likelihood or the log-likelihood for a probabiliyt p
, with two additional arguments: i) obs
the vector of observations (that is bw_data$low
by default) and ii) log
a logical (that is FALSE
by default)likelihood <- function(p, obs = bw_data$low, log=FALSE){
n <- length(obs) #189
res <- 1 # initial likelihood value
res_log <- 0 # initial log-likelihood value
for(i in 1:n){ # loop over all observation
res <- res*p^obs[i]*(1-p)^(1-obs[i])
res_log <- res_log + obs[i]*log(p) +(1-obs[i])*log(1-p)
}
if(log){
return(res_log)
}else{
return(res)
}
}
R
functions of p
that compute the first and the second derivatives of the log-likelihood respectivelly (each with obs=bw_data$low
as an additional argument)first_deriv_loglik <- function(p, obs = bw_data$low){
n <- length(obs) #189
res <- 0 # initial derivative value
for(i in 1:n){
res <- res + obs[i]/p - (1-obs[i])/(1-p)
}
return(res)
}
second_deriv_loglik <- function(p, obs = bw_data$low){
n <- length(obs) #189
res <- 0 # initial derivative value
for(i in 1:n){
res <- res - obs[i]/p^2 - (1-obs[i])/(1-p)^2
}
return(res)
}
par(mfrow = c(2,2)) # splitting the plot window into 4 subpanels
xvalues <- seq(from = 0.01, to = 0.99, by = 0.001)
plot(x = xvalues, y = likelihood(xvalues),
type="l",
ylab = "likelihood", xlab="p",
main = "Bernoulli likelihood of a low birth weight")
abline(v=mean(bw_data$low), col="red")
plot(x = xvalues, y = likelihood(xvalues, log = TRUE),
type="l",
ylab = "log-likelihood", xlab="p",
main = "Bernoulli log-likelihood of a low birth weight")
abline(v=mean(bw_data$low), col="red")
abline(h=likelihood(mean(bw_data$low), log=TRUE), col="red", lty=2)
plot(x = xvalues, y = first_deriv_loglik(xvalues),
type="l",
ylab = "log-likelihood first derivative", xlab="p",
main = "Bernoulli modeling for a low birth weight")
abline(v=mean(bw_data$low), col="red")
abline(h=0, col="red", lty=2)
plot(x = xvalues, y = second_deriv_loglik(xvalues),
type="l",
ylab = "log-likelihood first derivative", xlab="p",
main = "Bernoulli modeling for a low birth weight"
)
abline(v=mean(bw_data$low), col="red")
newton_raphson <- function(f, fprime, init, tol=0.001, itermax=20, ...) {
# initialization ----
x_new <- init
diff <- tol + 1 # ensure that the initialization is entering the while loop
iter <- 0
# updates ----
while (diff > tol & iter < itermax) {
# as long as diff is greater than the required precision AND the maximum number of iteration has not been reached
iter <- iter + 1
x_current <- x_new # store the current value
x_new <- x_current - f(x_current, ...)/fprime(x_current, ...) #update to the new value
diff <- abs(x_new - x_current) #compute the difference to assess convergence
#diff <- abs(f(x_new) - f(x_current)) # a different convergence criterion
cat(iter, " x_new:", x_new, " diff:", diff, "\n") #print the current step
}
if(iter == itermax){
warning("Maximum number of iteration reached")
# throw a warning if the maximum number of iteration is reached
}
return(x_new)
}
pi_opt <- newton_raphson(f = first_deriv_loglik, fprime = second_deriv_loglik,
init = 0.7, tol=0.01)
## 1 x_new: 0.4769454 diff: 0.2230546
## 2 x_new: 0.3069933 diff: 0.169952
## 3 x_new: 0.3121212 diff: 0.005127827
mean(bw_data$low)
## [1] 0.3121693
newton_raphson(f = first_deriv_loglik, fprime = second_deriv_loglik,
init = 0.01, tol=0.01)
## 1 x_new: 0.01977524 diff: 0.009775238
## [1] 0.01977524
newton_raphson(f = first_deriv_loglik, fprime = second_deriv_loglik,
init = 0.01, tol=0.0001)
## 1 x_new: 0.01977524 diff: 0.009775238
## 2 x_new: 0.0386545 diff: 0.01887927
## 3 x_new: 0.07375934 diff: 0.03510483
## 4 x_new: 0.1337386 diff: 0.05997927
## 5 x_new: 0.2175799 diff: 0.08384125
## 6 x_new: 0.2895745 diff: 0.0719946
## 7 x_new: 0.311171 diff: 0.0215966
## 8 x_new: 0.3121676 diff: 0.0009965095
## 9 x_new: 0.3121693 diff: 1.752721e-06
## [1] 0.3121693
Considering the null hypothesis \(H_0: \pi=0.5\), compute the p-value for: i) the Wald test, ii)the score test, and iii) the Likelihood Ratio Test respectively, and compare to the glm
and anova
functions output.
Comment on the AIC and the 95% confidence interval
# logit and expit are bijective functions used to go from the real space ]-inf, +inf[ to the probability space [0,1]
logit <- function(p){
log(p/(1-p))
}
expit <- function(x){
exp(x)/(1+exp(x))
}
# logisitc regression with just an intercept
my_logistic_reg <- glm(low~1, data=bw_data, family = "binomial")
summary(my_logistic_reg)
##
## Call:
## glm(formula = low ~ 1, family = "binomial", data = bw_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8651 -0.8651 -0.8651 1.5259 1.5259
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.790 0.157 -5.033 4.84e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 234.67 on 188 degrees of freedom
## AIC: 236.67
##
## Number of Fisher Scoring iterations: 4
expit(my_logistic_reg$coefficients)
## (Intercept)
## 0.3121693
var_pi_opt <- -1/second_deriv_loglik(pi_opt)
pval_Wald <- pchisq((pi_opt-0.5)/var_pi_opt, df=1, lower.tail = FALSE)
pval_score <- pchisq(first_deriv_loglik(0.5)^2/-first_deriv_loglik(0.5), df=1, lower.tail = FALSE)
-2*likelihood(pi_opt, log = TRUE)
## [1] 234.672
pval_LRT <- pchisq(-2*(likelihood(0.5, log=TRUE)-likelihood(pi_opt, log=TRUE)),
df=1, lower.tail = FALSE)
anova(glm(low~1, data=bw_data, family = "binomial"), # model with just th intercept, i.e. the average prevalence
glm(low~0, data=bw_data, family = "binomial"), # null model
test = "LRT")
## Analysis of Deviance Table
##
## Model 1: low ~ 1
## Model 2: low ~ 0
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 188 234.67
## 2 189 262.01 -1 -27.338 1.709e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pi_opt + c(-1, 1)*1.96*sqrt((pi_opt*(1-pi_opt))/189) # 95% normal confidence interval
## [1] 0.2460605 0.3781818