September 18th, 2019

Class objectives

Be able to:

  • identify wether a linear or logistic regression would let you answer the scientific question of interest given the available data
  • propose an adequate model for the question (model, outcome, predictors)
  • fit the model in R
  • interpret and discuss the results

Introduction

Linear model

linear regression is also called linear model:

  • simple linear model: \(Y\) and \(X\)
  • multiple linear model: \(Y\) and \(X_1\), \(X_2\), โ€ฆ, \(X_p\)
  • generalized linear models (and in particular logistic regression): \(Y\) and \(X_1\), \(X_2\), โ€ฆ, \(X_p\) when \(Y\) is not Gaussian

Motivation example: Birth weight study

Study on risks factor for low birth weights from Baystate medical center in Springfield (MA, USA) during 1986 [Hosmer & Lemeshow (1989), Applied Logistic Regression]

- low: low birth weight indicator coded as: 1 (low birth weight, i.e.ย \(\leq 2.5kg\)) and 0 (normal birth weight)
- age: motherโ€™s age (in years) at the time of birth
- lwt: motherโ€™s weight (in pounds) at the time of the last menstrual period
- race: motherโ€™s race coded as: 1 (white), 2 (black) and 3 (other)
- smoke: motherโ€™s smoking status coded as 1 (smoker) and 0 (non-smoker)
- ptl: number of previous premature labours
- ht: hypertension history indicator coded as: 1(yes) and 0 (no)
- ui: uterine irritability indicator coded as: 1 (yes) and 0 (no)
- ftv: number of physician visits during the first trimester
- bwt: infantโ€™s birth weight (in grams)

Correlation

Quantifies the relationship between 2 continuous variables \(X\) and \(Y\)

Correlation

Are \(X\) and \(Y\) values linked ?

A few definitions:

  • \(cor(X, Y) = \dfrac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}}\)
  • \(Var(X) = E[(X-E(X))^2]\)
  • \(Cov(X,Y)=E[(X-E(X))(Y-E(Y))]=E(XY)-E(X)E(Y)\)
  • \(E(X) = \sum_{x}x\mathbb{P}(X=x)\)

Estimations for correlation

Those quantities are estimated over a sample \((x_1, \dots, x_n)\) by:

  • \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\)
  • \(s^2_x = \dfrac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}\)
  • \(s_{xy} = \dfrac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{n-1}\)
  • \(r=\frac{s_{xy}}{\sqrt{s^2_{x}s^2_{y}}}\)

NB: the correlation is a scaled version of the covariance,
it is thus an association measure between \(-1\) and \(1\)
(the closer to 0, the weaker the link)

Correlation between the motherโ€™s age and the birth weight

Here, the correlation between the motherโ€™s age and the birth weight is \(0.09\)

Other correlations and associations measure

This is the linear correlation, also called Pearson correlation coefficient.

Many other measures have been proposed to explore and quantify the link between two variables:

  • Spearman correlation
  • Mutual Information
  • Maximum Information Criterion
  • โ€ฆ

Simple linear Regression

Also quantifies the relationship between 2 continuous variables \(X\) and \(Y\)

Average birth weight by motherโ€™s age

Linear Model

It all started with a scatter plotโ€ฆ

\(\Rightarrow\) Regression, a.k.a. linear model:
what is the optimal line going through this scatter plot ?

“Best” line ?

“Best” line ?

A Line equation

\[y = a + bx\]

We can write such an equation for relating \(y_i\) values to \(x_i\) values:

\[y_i = \beta_0 + \color{blue}{\beta_1} x_i + \color{red}{\varepsilon_i}\]

  • \(\beta_0\) is interpreted as the average value of \(Y\) when \(x_i\) is 0
  • \(\beta_1\) represents the slope of the regression line
  • \(\varepsilon_i\) is the random error separating the regression line from the observed value

Conditional distribution

The regression problem can be formalized as:

How do we characterize the conditional distribution of the random variable \(Y\) knowing the values \(x\) of the random variable \(X\) ?

\[Y|x \;\sim\; ?\] using:

  • its expected value \(E(Y|x)\)
  • its variance \(Var(Y|x)\)
  • its probability distribution

Regression equation

The simple linear model can be written as:

\[Y|x\;\sim\;\mathcal{N}(\beta_0 + \color{blue}{\beta_1}x,\; \color{red}{\sigma^2})\]

The regression equation then writes itself as: \[E(Y|x) = \beta_0 +\beta_1 x\]

Hypothesis of this linear model

  1. linearity of the relationship between \(Y\) and \(X\)
  2. homoskedasticity of the observations \(y_i\) (same variance across observations)
  3. normality of the observations \(y_i\)
  4. independence of the \(y_i\) across observations

Linearity hypothesis

When X increases by 1 unit, Y increases on average by \(\beta_1\) units
regardless of the inital value of X

\(\begin{align*} E(Y_i|X_i=a+1) - E(Y_i|X_i=a) &= [\beta_0 +\beta_1(a+1)]\\&\quad-[\beta_0 +\beta_1a]\\ &=\beta_1 \end{align*}\)

Homoskedasticity hypothesis

\(Var(Y_i|x_i) = \sigma^2\) for all \(x_i\)
usually not enough data to assess numerically โ€“ instead a graphical assesment can be performed

Ex: for any motherโ€™s age (\(x\)), the conditional birth weight (\(Y|x\)) must have the same variance

Normality hypothesis

\(Y_i|x_i \sim \mathcal{N}\) for all \(x_i\)
usually not enough data to assess โ€“ instead a graphical assessment can be performed on the marginal histogram of \(Y\) (instead of \(Y|x\))

Ex: for any motherโ€™s age (\(x\)), the conditional birth weight (\(Y|x\)) must have a Gaussian distribution

Independence hypothesis

\(Cov(Y_i, Y_j) = 0\) for all \(i \neq j\)
Such a hypothesis cannot be checked directly on the data, but must be assessed from what we know from the data collection process

Ex: birth weights of the 189 infants are independent, assuming there is no twins (or even siblings) among them

Random errors & residuals

  • \(E(Y_i|x_i) = \beta_0 + \beta_1 x_i \quad (i = 1, \dots, n)\)
  • \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \quad (i = 1, \dots, n)\)

where \(\beta_0 + \beta_1 x_i\) is the determinist part
and \(\varepsilon_i\) is the random part: \[\varepsilon_i=Y_i-E(Y_i|x_i)=Y_i-(\beta_0 + \beta_1 x_i) \enspace\]

\(\varepsilon_i\) are called the random errors, or the residuals

Random errors hypothesis

The 4 hypothesis from the linear model can be expressed with respect to the random errors:

  • independence of the \(\varepsilon_i\) \(\Rightarrow\) independence of \(Y_i\)
  • \(E(\varepsilon_i)=0\) \(\Rightarrow\) linearity hypothesis
  • \(Var(\varepsilon_i)=\sigma^2\) \(\Rightarrow\) \(Var(Y_i|x_i)=\sigma^2\) (homoskedasticity)
  • normality of the \(\varepsilon_i\) \(\Rightarrow\) normality of the \(Y_i|x_i\).

Summary on the hypothesis

  • Linearity: \(E(Y_i|X_i=x_i) = E(Y_i|x_i) = \beta_0 + \beta_1 x_i\) or \(E(\varepsilon_i)=0\) \((i = 1, \dots, n)\)
    โœ… check: \(X \times Y\) scatter plot + post-fit residuals

  • Homoskedasticity: \(Var(Y_i|x_i) = \sigma^2\) or \(Var(\varepsilon_i)=\sigma^2\) for all \(i\)
    โœ… check: \(X \times Y\) scatter plot + post-fit residuals

  • Normality: \(Y_i|x_i \sim \mathcal{N}\) for any \(x_i\) or \(\varepsilon_i \sim \mathcal{N}\) for all \(i\)
    โœ… check: histogram of \(Y\) before estimation + post-fit residuals

  • Independence: \(cov(Y_i, Y_j) = 0\) or \(cov(\varepsilon_i, \varepsilon_j) = 0\) for all pairs \(i \neq j\)
    โœ… check: context

linear model and Ordinary Least Squares (OLS)

One way to define the “Best” line is the line that will minimize the sum of squarred errors.

The Least Squares criteria minimizes: \[\Phi(\beta_0,\beta_1)=\sum_{i=1}^n(y_i-\beta_0-\beta_1x_i)^2\]

OLS estimates

Minimizing \(\Phi(\beta_0,\beta_1)\) gives \(\hat{\beta}_0\) and \(\hat{\beta}_1\) (solved setting the partial derivatives with respect to \(\beta_1\) and \(\beta_0\) respectively to 0):

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})} {\sum_{i=1}^n (x_i - \bar{x})^2} \quad \mbox{et} \quad \hat{\beta}_0 = \bar{y} - \hat{\beta_1}\bar{x}\]

with \(\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\) and \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\)

Another possible formula is: \(\hat{\beta}_1 = \frac{\overline{xy} - \bar{x} \bar{y}} {\overline{x^2} - \bar{x}^2}\)
with \(\overline{xy} = \frac{1}{n} \sum_{i=1}^n x_i y_i\) and \(\overline{x^2} = \frac{1}{n} \sum_{i=1}^n x^2_i\)

OLS estimators properties

It can be shown that, as \(n\) (the number of observations) gets larger:

  • \(\hat{\beta_0} \sim \mathcal{N}\left(\beta_0, \sigma^2_{\hat{\beta_0}}\right)\) \(\Rightarrow\) \(\dfrac{\hat{\beta_0}-\beta_0}{s(\hat{\beta_0})}\sim Student(n-2)\)

  • \(\hat{\beta_1} \sim \mathcal{N}\left(\beta_1, \sigma^2_{\hat{\beta_1}}\right)\) \(\Rightarrow\) \(\dfrac{\hat{\beta_1}-\beta_1}{s(\hat{\beta_1})}\sim Student(n-2)\)

Confidence Intervals (CI)

It follows that

  • Confidence interval at \(100\times(1-\alpha)\%\) for \(\beta_0\) : \[[\hat{\beta_0}\pm t^{n-2}_{1-\alpha/2}\times s(\hat{\beta}_0)]\]

  • Confidence interval at \(100\times(1-\alpha)\%\) for \(\beta_1\) : \[[\hat{\beta_1}\pm t^{n-2}_{1-\alpha/2}\times s(\hat{\beta}_1)]\]

Studentโ€™s Test

The most important test in the simple linear model is whether \(\hat{\beta}_1\) is significantly different from \(0\),
i.e.ย does \(X\) significantly informs the prediction of \(Y\):

  • null hypothesis \(H_0:\ \beta_1=0\)
    (no association between \(Y\) & \(X\))
  • alternative hypothesis \(H_1:\ \beta_1\neq0\)
    (linear association between \(Y\) & \(X\))

\(\Rightarrow\) test statistics: \(T=\frac{\hat{\beta}_1}{s(\hat{\beta}_1)}\underset{H_0}{\sim} Student(n-2)\)

Link between Tests and CI

Link between correlation and simple linear regression

Likelihood approach

๐Ÿ“

  • ๐Ÿ‘‰ write the likelihood for the simple linear model

Practicals 1

+

  • ๐Ÿ‘‰ load the data from birthweight_data.txt
  • ๐Ÿ‘‰ using dplyr, compute a new data.frame with both the mean and the standard deviation (look at the group_by and summarize_all() function) of all original variables for byh avaierโ€™s age. Do value observedes it makes sense ?
  • ๐Ÿ‘‰ add the number of observations summarized with the add_tally() function
  • ๐Ÿ‘‰ draw a scatter plot with ggplot2 of average birth weight as a function of the motherโ€™s age

Practicals 2

  • ๐Ÿ‘‰ using the lm() function (look at the help and the examples), fit a simple linear model explaining the average birth weight from the motherโ€™s age
  • ๐Ÿ‘‰ explore the results with the ls() function, the $ operator and the summary() function
  • ๐Ÿ‘‰ assess the linear model assumpions
  • ๐Ÿ‘‰ interpret your results (take particular care for \(\beta_0\))
  • ๐Ÿ‘‰ fit a second simple linear model explaining the individual birth weight from the motherโ€™s age using the non-aggregated original data
  • ๐Ÿ‘‰ compare the results. How can you explain the differences ?

Multiple Linear Regression

Quantifies the linear relationship between a Gaussian variable \(Y\) and multiple variables \(X_1\), \(X_2\), โ€ฆ, \(X_p\)

Regression analysis steps

Regardless of which regression model you are using:

  1. Model specification according to the research question, and the data available
  2. Estimation with point estimates and Confidence Intervals for the model parameters
  3. Significance testing for each model parameters of interest
  4. Model adequation diagnoses
  5. Results presentation (often as a table)
  6. Results interpretation and discussion

Multiple linear Regression

\[\begin{align*} y_i &= \beta_0 + \color{blue}{\beta_1} x_{1i} + \color{blue}{\beta_2} x_{2i} + \dots + \color{blue}{\beta_p} x_{2p} + \color{red}{\varepsilon_i}\\ &= \beta_0 + \left(\sum_{i=1}^p\color{blue}{\beta_i} x_i\right) + \color{red}{\varepsilon_i} \end{align*}\]

Matrix notation

\[Y= X\boldsymbol{\beta} + \epsilon\] \[ Y = \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix} ,\quad X = \begin{bmatrix} 1 & x_{11} & x_{21} & \dots & x_{p1} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{1n} & x_{2n} & \dots & x_{pn} \end{bmatrix} ,\quad \boldsymbol{\beta} = \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix} ,\quad\] \[\varepsilon = \begin{bmatrix} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} . \]

Ordinary Least Squares formula

\[\boldsymbol{\widehat{\beta}}= (X^TX)^{-1}X^TY\]

โ—๏ธRequires \(X^TX\) to be invertible !

This is no the case if:

  • \(p \geq n\) (high-dimensional case)
  • \(X^TX\) is not of full rank (e.g.ย one covariate can be computed as a perfect linear combination of the other)
    numerical instabilities when covaraites are extremely correlated

Indicator coding

For categorical predictors, the design matrix \(X\) uses modality indicator variables:

\(X_j = \begin{bmatrix} m_1 \\ m_K \\ m_1 \\ \vdots \\ m_{K} \end{bmatrix}\) \(\rightarrow\,\) \(X_{m_1}^j = \begin{bmatrix} 1 \\ 0 \\ 1 \\ \vdots \\ 0 \end{bmatrix} ,\,\) \(\dots,\,\) \(X_{m_n}^j = \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 1 \end{bmatrix}\)

โ—๏ธNeed to set a reference class to ensure identifiability

?model.matrix()

Interactions

\[X_j\times X_m\]

\(\Rightarrow\) the value of \(X_m\) modifies the effect of \(X_j\) on \(Y\)

โ—๏ธif an interaction is included in a linear model, the main effects involved must always be included as well
otherwise the interaction is not interpretable

Due to numerical complexity, it is very rare to use higher order (\(>2\)) interactionsโ€ฆ

Confounding

potential confounders must be included in the model !
\(X_2\) is a counfounding factor for the relationship between \(X_1\) and \(Y\) if:

  • \(X_2\) is associated with both \(X_1\) and \(Y\)
  • \(X_2\) is a direct or indirect cause of \(Y\), but not a consequence
  • \(X_2\) is not on the causal path between \(X_1\) and \(Y\)

Assessthe relative variation \(RV=\dfrac{\left|\beta_1^{model1} - \beta_1^{model2}\right|}{\beta_1^{model1}}\)
\(\Rightarrow\) arbitrary thresholds of 10% or 20% to declare potential confounding factors

Likelihood Ratio Test

The Likelihood Ratio Test (LRT) can be used to compare nested models, in particular for testing several coefficients at once.

AIC & BIC criteria:

  • based on \(-2\,\text{log-likelihood}\)
  • penalize for the number of parameters (the more parameters, the better the fit, mechanically)
  • can be used to compare non-nested models on the same data
  • the lower the better

\(AIC=2p-2\,\text{log-likelihood}\)
\(BIC=p\log(n)-2\,\text{log-likelihood}\)
\(R_{adj}^{2}=1-(1-R^{2}){n-1 \over n-p-1}\)

Multiple testing

โ—๏ธwhen \(p\) becomes large, the number of coefficients tested in the model becomes large as well, and one must be careful and deal with the multiple testing issue

To be continuedโ€ฆ

Model selection

One of the recurring question in multiple regression is:

Which variables should be included in the model ?

Often dealt with stepwise method inclusion

โ—๏ธthis is not statistically sound and should be avoided

Modern methods, such as the LASSO or Random Forest encompass a framework for suitable variable selection in multivariate regression modelsโ€ฆ To be continued

Practicals 3

+

  • ๐Ÿ‘‰ load the data from birthweight_data.txt
  • ๐Ÿ‘‰ how can you best explain the individual birth weight from the other variables ?

Class objectives

Be able to:

  • identify wether a linear or logistic regression would let you answer the scientific question of interest given the available data
  • propose an adequate model for the question (model, outcome, predictors)
  • fit the model in R
  • interpret and discuss the results

Generalized linear models (GLM)

Quantifies the linear relationship between a non-gaussian variable \(Y\) and multiple variables \(X_1\), \(X_2\), โ€ฆ, \(X_p\)

Binary outcome & regression

Linear regression model:

\[E(Y|x_1,x_2,\dots,x_p) = \beta_0+\beta_1x_1 + \beta_2 x_2 +\dots + \beta_p x_p\]

If \(Y\) is a binary variable: \(\in\ [0, 1]\):

\[E(Y|x_1,x_2,\dots,x_p) = \mathbb{P}(Y=1|x_1,x_2,\dots,x_p)\] โ—๏ธ\(E(Y|x_1,x_2,\dots,x_p)\ \in\ [0, 1]\) (probability) while \(\beta_0+\beta_1x_1 + \beta_2 x_2 +\dots + \beta_p x_p\ \in\ ]-\infty, \infty[\)

logit

\[logit(p) = log\left(\dfrac{p}{1-p}\right)\ \in\ ]-\infty, \infty[\]

expit

\[expit(x) = \dfrac{e^x}{1+e^x}\ \in\ [0,1]\]

NB: \(expit(logit(p))=p\) et \(logit(expit(x))=x\)

logistic regression

\[logit\left(\mathbb{P}(Y=1|x_1,x_2,\dots,x_p)\right) = \beta_0+\beta_1x_1 + \beta_2 x_2 +\dots + \beta_p x_p\]

\(\Rightarrow \mathbb{P}(Y=1|x_1,x_2,\dots,x_p) = \dfrac{e^{\beta_0+\beta_1x_1 + \beta_2 x_2 +\dots + \beta_p x_p}}{1 + e^{\beta_0+\beta_1x_1 + \beta_2 x_2 +\dots + \beta_p x_p}}\)

  • the \(logit\) is called the link function:
    makes the link between the expectation of the outcome and the linear model

NB: No error term, randomness is directly encompassed into the model

Odds and Odds Ratio

  • \(\dfrac{p}{1-p}\)

If \(logit(\mathbb{P}(Y=1|x) = \beta_0 + \beta_1x\): \[log(\frac{\mathbb{P}(Y=1|X=a)}{1-\mathbb{P}(Y=1|X=a)}) - log\left(\frac{\mathbb{P}(Y=1|X=a)}{1 -\mathbb{P}(Y=1|X=a+1)}\right) = \beta_1\] Odds Ratio (OR): \(OR_{X}=e^{\beta_1}\)

CI and OR

\(CI(OR)_{1-\alpha} = [e^{\hat{\beta_1} ยฑ u_{1-\alpha}\hat{s}(\beta_1)}]\)

โ—๏ธNormal asymptotic approximation \(\Rightarrow\) Wald test:

\[\dfrac{\hat{\beta_1}}{\hat{s}(\beta_1)}\underset{H_0}{\sim}\mathcal{N}(0,1)\quad\text{when}\quad n\rightarrow\infty\]

linearity hypothesis

One way to assess the linearity hypothesis is to split-up the continuous covariates into categories and check that the coefficients associated with each categories are coherent with a linear relationship.

This is the only hypothesis to be check in the logistic regression

Practicals 4

+

  • ๐Ÿ‘‰ load the data from birthweight_data.txt
  • ๐Ÿ‘‰ how can you best explain the binary variable low from the other variables ? Use the glm() function with the "binomial link"
  • ๐Ÿ‘‰ change the reference category for race to obtain the CI for the OR for smoking in black women

Generalized Linear Models (GLM)

  • Poisson regression for discrete quantitative outcome (e.g.ย counts)
    link function is the \(log\)
  • Probit & Tobit models as alternatives to logistic regression
    (different link functions)
  • โ€ฆ

Independence

GLM assume that all the observations are \(iid\)

Linear mixed-effect models can be used to model grouped data

Ex: longitudinal data, batch effects, etcโ€ฆ

Conclusion

Regression analysis steps

Regardless of which regression model you are using:

  1. Model specification according to the research question, and the data available
  2. Estimation with point estimates and Confidence Intervals for the model parameters
  3. Significance testing for each model parameters of interest
  4. Model fit diagnoses
  5. Results presentation (often as a table)
  6. Results interpretation and discussion

Additional resources

Points of significance | Statistics for Biologists series: