TU BASICS: Regression modeling

Boris Hejblum

September 18th, 2019

Class objectives

Be able to:

  • identify wether a linear or logistic regression would let you answer the scientific question of interest given the available data
  • propose an adequate model for the question (model, outcome, predictors)
  • fit the model in R
  • interpret and discuss the results

Introduction

Linear model

linear regression is also called linear model:

  • simple linear model: Y and X
  • multiple linear model: Y and X1, X2, …, Xp
  • generalized linear models (and in particular logistic regression): Y and X1, X2, …, Xp when Y is not Gaussian

Motivation example: Birth weight study

Study on risks factor for low birth weights from Baystate medical center in Springfield (MA, USA) during 1986 [Hosmer & Lemeshow (1989), Applied Logistic Regression]

- low: low birth weight indicator coded as: 1 (low birth weight, i.e. 2.5kg) and 0 (normal birth weight)
- age: mother’s age (in years) at the time of birth
- lwt: mother’s weight (in pounds) at the time of the last menstrual period
- race: mother’s race coded as: 1 (white), 2 (black) and 3 (other)
- smoke: mother’s smoking status coded as 1 (smoker) and 0 (non-smoker)
- ptl: number of previous premature labours
- ht: hypertension history indicator coded as: 1(yes) and 0 (no)
- ui: uterine irritability indicator coded as: 1 (yes) and 0 (no)
- ftv: number of physician visits during the first trimester
- bwt: infant’s birth weight (in grams)

Correlation

Quantifies the relationship between 2 continuous variables X and Y

Correlation

Are X and Y values linked ?

A few definitions:

  • cor(X,Y)=Cov(X,Y)Var(X)Var(Y)
  • Var(X)=E[(XE(X))2]
  • Cov(X,Y)=E[(XE(X))(YE(Y))]=E(XY)E(X)E(Y)
  • E(X)=xxP(X=x)

Estimations for correlation

Those quantities are estimated over a sample (x1,,xn) by:

  • ˉx=1nni=1xi
  • s2x=ni=1(xiˉx)2n1
  • sxy=ni=1(xiˉx)(yiˉy)n1
  • r=sxys2xs2y

NB: the correlation is a scaled version of the covariance,
it is thus an association measure between 1 and 1
(the closer to 0, the weaker the link)

Correlation between the mother’s age and the birth weight

Here, the correlation between the mother’s age and the birth weight is 0.09

Other correlations and associations measure

This is the linear correlation, also called Pearson correlation coefficient.

Many other measures have been proposed to explore and quantify the link between two variables:

  • Spearman correlation
  • Mutual Information
  • Maximum Information Criterion

Simple linear Regression

Also quantifies the relationship between 2 continuous variables X and Y

Average birth weight by mother’s age

ABCDEFGHIJ0123456789
age
<int>
n
<dbl>
bwt_mean
<dbl>
bwt_se
<dbl>
1432967.333843.3447
1532504.000237.7036
1673331.714695.1493
17122758.500443.3844
18102998.200573.6156
19163051.875681.9460
20182860.167694.1185
21122724.167698.4330
22133150.231551.7235
23132951.692640.3095

Linear Model

It all started with a scatter plot…

Regression, a.k.a. linear model:
what is the optimal line going through this scatter plot ?

“Best” line ?

“Best” line ?

A Line equation

y=a+bx

We can write such an equation for relating yi values to xi values:

yi=β0+β1xi+εi

  • β0 is interpreted as the average value of Y when xi is 0
  • β1 represents the slope of the regression line
  • εi is the random error separating the regression line from the observed value

Conditional distribution

The regression problem can be formalized as:

How do we characterize the conditional distribution of the random variable Y knowing the values x of the random variable X ?

Y|x?

using:

  • its expected value E(Y|x)
  • its variance Var(Y|x)
  • its probability distribution

Regression equation

The simple linear model can be written as:

Y|xN(β0+β1x,σ2)

The regression equation then writes itself as: E(Y|x)=β0+β1x

Hypothesis of this linear model

  1. linearity of the relationship between Y and X
  2. homoskedasticity of the observations yi (same variance across observations)
  3. normality of the observations yi
  4. independence of the yi across observations

Linearity hypothesis

When X increases by 1 unit, Y increases on average by β1 units
regardless of the inital value of X

E(Yi|Xi=a+1)E(Yi|Xi=a)=[β0+β1(a+1)][β0+β1a]=β1

Homoskedasticity hypothesis

Var(Yi|xi)=σ2 for all xi
usually not enough data to assess numerically – instead a graphical assesment can be performed

Ex: for any mother’s age (x), the conditional birth weight (Y|x) must have the same variance

Normality hypothesis

Yi|xiN for all xi
usually not enough data to assess – instead a graphical assessment can be performed on the marginal histogram of Y (instead of Y|x)

Ex: for any mother’s age (x), the conditional birth weight (Y|x) must have a Gaussian distribution

Independence hypothesis

Cov(Yi,Yj)=0 for all ij
Such a hypothesis cannot be checked directly on the data, but must be assessed from what we know from the data collection process

Ex: birth weights of the 189 infants are independent, assuming there is no twins (or even siblings) among them

Random errors & residuals

  • E(Yi|xi)=β0+β1xi(i=1,,n)
  • Yi=β0+β1xi+εi(i=1,,n)

where β0+β1xi is the determinist part
and εi is the random part: εi=YiE(Yi|xi)=Yi(β0+β1xi)

εi are called the random errors, or the residuals

Random errors hypothesis

The 4 hypothesis from the linear model can be expressed with respect to the random errors:

  • independence of the εi independence of Yi
  • E(εi)=0 linearity hypothesis
  • Var(εi)=σ2 Var(Yi|xi)=σ2 (homoskedasticity)
  • normality of the εi normality of the Yi|xi.

Summary on the hypothesis

  • Linearity: E(Yi|Xi=xi)=E(Yi|xi)=β0+β1xi or E(εi)=0 (i=1,,n)
    check: X×Y scatter plot + post-fit residuals

  • Homoskedasticity: Var(Yi|xi)=σ2 or Var(εi)=σ2 for all i
    check: X×Y scatter plot + post-fit residuals

  • Normality: Yi|xiN for any xi or εiN for all i
    check: histogram of Y before estimation + post-fit residuals

  • Independence: cov(Yi,Yj)=0 or cov(εi,εj)=0 for all pairs ij
    check: context

linear model and Ordinary Least Squares (OLS)

One way to define the “Best” line is the line that will minimize the sum of squarred errors.

The Least Squares criteria minimizes: Φ(β0,β1)=ni=1(yiβ0β1xi)2

OLS estimates

Minimizing Φ(β0,β1) gives ˆβ0 and ˆβ1 (solved setting the partial derivatives with respect to β1 and β0 respectively to 0):

ˆβ1=ni=1(xiˉx)(yiˉy)ni=1(xiˉx)2etˆβ0=ˉy^β1ˉx

with ˉy=1nni=1yi and ˉx=1nni=1xi

Another possible formula is: ˆβ1=¯xyˉxˉy¯x2ˉx2
with ¯xy=1nni=1xiyi and ¯x2=1nni=1x2i

OLS estimators properties

It can be shown that, as n (the number of observations) gets larger:

  • ^β0N(β0,σ2^β0) ^β0β0s(^β0)Student(n2)

  • ^β1N(β1,σ2^β1) ^β1β1s(^β1)Student(n2)

Confidence Intervals (CI)

It follows that

  • Confidence interval at 100×(1α)% for β0 : [^β0±tn21α/2×s(ˆβ0)]

  • Confidence interval at 100×(1α)% for β1 : [^β1±tn21α/2×s(ˆβ1)]

Student’s Test

The most important test in the simple linear model is whether ˆβ1 is significantly different from 0,
i.e. does X significantly informs the prediction of Y:

  • null hypothesis H0: β1=0
    (no association between Y & X)
  • alternative hypothesis H1: β10
    (linear association between Y & X)

test statistics: T=ˆβ1s(ˆβ1)H0Student(n2)

Link between Tests and CI

Link between correlation and simple linear regression

Likelihood approach

📝

  • 👉 write the likelihood for the simple linear model

Practicals 1

+

  • 👉 load the data from birthweight_data.txt
  • 👉 using dplyr, compute a new data.frame with both the mean and the standard deviation (look at the group_by and summarize_all() function) of all original variables for byh avaier’s age. Do value observedes it makes sense ?
  • 👉 add the number of observations summarized with the add_tally() function
  • 👉 draw a scatter plot with ggplot2 of average birth weight as a function of the mother’s age

Practicals 2

  • 👉 using the lm() function (look at the help and the examples), fit a simple linear model explaining the average birth weight from the mother’s age
  • 👉 explore the results with the ls() function, the $ operator and the summary() function
  • 👉 assess the linear model assumpions
  • 👉 interpret your results (take particular care for β0)
  • 👉 fit a second simple linear model explaining the individual birth weight from the mother’s age using the non-aggregated original data
  • 👉 compare the results. How can you explain the differences ?

Multiple Linear Regression

Quantifies the linear relationship between a Gaussian variable Y and multiple variables X1, X2, …, Xp

Regression analysis steps

Regardless of which regression model you are using:

  1. Model specification according to the research question, and the data available
  2. Estimation with point estimates and Confidence Intervals for the model parameters
  3. Significance testing for each model parameters of interest
  4. Model adequation diagnoses
  5. Results presentation (often as a table)
  6. Results interpretation and discussion

Multiple linear Regression

yi=β0+β1x1i+β2x2i++βpx2p+εi=β0+(pi=1βixi)+εi

Matrix notation

Y=Xβ+ϵ

Y=[Y1Yn],X=[1x11x21xp11x1nx2nxpn],β=[β0β1β2βp],
ε=[ε1εn].

Ordinary Least Squares formula

ˆβ=(XTX)1XTY

❗️Requires XTX to be invertible !

This is no the case if:

  • pn (high-dimensional case)
  • XTX is not of full rank (e.g. one covariate can be computed as a perfect linear combination of the other)
    numerical instabilities when covaraites are extremely correlated

Indicator coding

For categorical predictors, the design matrix X uses modality indicator variables:

Xj=[m1mKm1mK] Xjm1=[1010], , Xjmn=[0101]

❗️Need to set a reference class to ensure identifiability

?model.matrix()

Interactions

Xj×Xm

the value of Xm modifies the effect of Xj on Y

❗️if an interaction is included in a linear model, the main effects involved must always be included as well
otherwise the interaction is not interpretable

Due to numerical complexity, it is very rare to use higher order (>2) interactions…

Confounding

potential confounders must be included in the model !
X2 is a counfounding factor for the relationship between X1 and Y if:

  • X2 is associated with both X1 and Y
  • X2 is a direct or indirect cause of Y, but not a consequence
  • X2 is not on the causal path between X1 and Y

Assessthe relative variation RV=|βmodel11βmodel21|βmodel11
arbitrary thresholds of 10% or 20% to declare potential confounding factors

Likelihood Ratio Test

The Likelihood Ratio Test (LRT) can be used to compare nested models, in particular for testing several coefficients at once.

AIC & BIC criteria:

  • based on 2log-likelihood
  • penalize for the number of parameters (the more parameters, the better the fit, mechanically)
  • can be used to compare non-nested models on the same data
  • the lower the better

AIC=2p2log-likelihood
BIC=plog(n)2log-likelihood
R2adj=1(1R2)n1np1

Multiple testing

❗️when p becomes large, the number of coefficients tested in the model becomes large as well, and one must be careful and deal with the multiple testing issue

To be continued…

Model selection

One of the recurring question in multiple regression is:

Which variables should be included in the model ?

Often dealt with stepwise method inclusion

❗️this is not statistically sound and should be avoided

Modern methods, such as the LASSO or Random Forest encompass a framework for suitable variable selection in multivariate regression models… To be continued

Practicals 3

+

  • 👉 load the data from birthweight_data.txt
  • 👉 how can you best explain the individual birth weight from the other variables ?

Class objectives

Be able to:

  • identify wether a linear or logistic regression would let you answer the scientific question of interest given the available data
  • propose an adequate model for the question (model, outcome, predictors)
  • fit the model in R
  • interpret and discuss the results

Generalized linear models (GLM)

Quantifies the linear relationship between a non-gaussian variable Y and multiple variables X1, X2, …, Xp

Binary outcome & regression

Linear regression model:

E(Y|x1,x2,,xp)=β0+β1x1+β2x2++βpxp

If Y is a binary variable:  [0,1]:

E(Y|x1,x2,,xp)=P(Y=1|x1,x2,,xp)

❗️E(Y|x1,x2,,xp)  [0,1] (probability) while β0+β1x1+β2x2++βpxp  ],[

logit

logit(p)=log(p1p)  ],[

expit

expit(x)=ex1+ex  [0,1]

NB: expit(logit(p))=p et logit(expit(x))=x

logistic regression

logit(P(Y=1|x1,x2,,xp))=β0+β1x1+β2x2++βpxp

P(Y=1|x1,x2,,xp)=eβ0+β1x1+β2x2++βpxp1+eβ0+β1x1+β2x2++βpxp

  • the logit is called the link function:
    makes the link between the expectation of the outcome and the linear model

NB: No error term, randomness is directly encompassed into the model

Odds and Odds Ratio

  • p1p

If logit(P(Y=1|x)=β0+β1x: log(P(Y=1|X=a)1P(Y=1|X=a))log(P(Y=1|X=a)1P(Y=1|X=a+1))=β1

Odds Ratio (OR): ORX=eβ1

CI and OR

CI(OR)1α=[e^β1±u1αˆs(β1)]

❗️Normal asymptotic approximation Wald test:

^β1ˆs(β1)H0N(0,1)whenn

linearity hypothesis

One way to assess the linearity hypothesis is to split-up the continuous covariates into categories and check that the coefficients associated with each categories are coherent with a linear relationship.

This is the only hypothesis to be check in the logistic regression

Practicals 4

+

  • 👉 load the data from birthweight_data.txt
  • 👉 how can you best explain the binary variable low from the other variables ? Use the glm() function with the "binomial link"
  • 👉 change the reference category for race to obtain the CI for the OR for smoking in black women

Generalized Linear Models (GLM)

  • Poisson regression for discrete quantitative outcome (e.g. counts)
    link function is the log
  • Probit & Tobit models as alternatives to logistic regression
    (different link functions)

Independence

GLM assume that all the observations are iid

Linear mixed-effect models can be used to model grouped data

Ex: longitudinal data, batch effects, etc…

Conclusion

Regression analysis steps

Regardless of which regression model you are using:

  1. Model specification according to the research question, and the data available
  2. Estimation with point estimates and Confidence Intervals for the model parameters
  3. Significance testing for each model parameters of interest
  4. Model fit diagnoses
  5. Results presentation (often as a table)
  6. Results interpretation and discussion

Additional resources