TU BASICS: Regression modeling

Boris Hejblum

September 18^th, 2019

Class objectives

Be able to:

identify wether a linear or logistic regression would let you answer the scientific question of interest given the available data
propose an adequate model for the question (model, outcome, predictors)
fit the model in R
interpret and discuss the results

Introduction

Linear model

linear regression is also called linear model:

simple linear model: and
multiple linear model: and , , …,
generalized linear models (and in particular logistic regression): and , , …, when is not Gaussian

Motivation example: Birth weight study

Study on risks factor for low birth weights from Baystate medical center in Springfield (MA, USA) during 1986 [Hosmer & Lemeshow (1989), Applied Logistic Regression]

- low: low birth weight indicator coded as: 1 (low birth weight, i.e. ) and 0 (normal birth weight)
- age: mother’s age (in years) at the time of birth
- lwt: mother’s weight (in pounds) at the time of the last menstrual period
- race: mother’s race coded as: 1 (white), 2 (black) and 3 (other)
- smoke: mother’s smoking status coded as 1 (smoker) and 0 (non-smoker)
- ptl: number of previous premature labours
- ht: hypertension history indicator coded as: 1(yes) and 0 (no)
- ui: uterine irritability indicator coded as: 1 (yes) and 0 (no)
- ftv: number of physician visits during the first trimester
- bwt: infant’s birth weight (in grams)

Correlation

Quantifies the relationship between 2 continuous variables and

Correlation

Are and values linked ?

A few definitions:

Estimations for correlation

Those quantities are estimated over a sample by:

NB: the correlation is a scaled version of the covariance,
it is thus an association measure between and
(the closer to 0, the weaker the link)

Correlation between the mother’s age and the birth weight

Here, the correlation between the mother’s age and the birth weight is

Other correlations and associations measure

This is the linear correlation, also called Pearson correlation coefficient.

Many other measures have been proposed to explore and quantify the link between two variables:

Spearman correlation
Mutual Information
Maximum Information Criterion
…

Simple linear Regression

Also quantifies the relationship between 2 continuous variables and

Average birth weight by mother’s age

ABCDEFGHIJ0123456789

age <int>	n <dbl>	bwt_mean <dbl>	bwt_se <dbl>
14	3	2967.333	843.3447
15	3	2504.000	237.7036
16	7	3331.714	695.1493
17	12	2758.500	443.3844
18	10	2998.200	573.6156
19	16	3051.875	681.9460
20	18	2860.167	694.1185
21	12	2724.167	698.4330
22	13	3150.231	551.7235
23	13	2951.692	640.3095

Linear Model

It all started with a scatter plot…

Regression, a.k.a. linear model:
what is the optimal line going through this scatter plot ?

“Best” line ?

A Line equation

We can write such an equation for relating values to values:

is interpreted as the average value of when is 0
represents the slope of the regression line
is the random error separating the regression line from the observed value

Conditional distribution

The regression problem can be formalized as:

How do we characterize the conditional distribution of the random variable knowing the values of the random variable ?

using:

its expected value
its variance
its probability distribution

Regression equation

The simple linear model can be written as:

The regression equation then writes itself as:

Hypothesis of this linear model

linearity of the relationship between and
homoskedasticity of the observations (same variance across observations)
normality of the observations
independence of the across observations

Linearity hypothesis

When X increases by 1 unit, Y increases on average by units
regardless of the inital value of X

Homoskedasticity hypothesis

for all
usually not enough data to assess numerically – instead a graphical assesment can be performed

Ex: for any mother’s age (), the conditional birth weight () must have the same variance

Normality hypothesis

for all
usually not enough data to assess – instead a graphical assessment can be performed on the marginal histogram of (instead of )

Ex: for any mother’s age (), the conditional birth weight () must have a Gaussian distribution

Independence hypothesis

for all
Such a hypothesis cannot be checked directly on the data, but must be assessed from what we know from the data collection process

Ex: birth weights of the 189 infants are independent, assuming there is no twins (or even siblings) among them

Random errors & residuals

where is the determinist part
and is the random part:

are called the random errors, or the residuals

Random errors hypothesis

The 4 hypothesis from the linear model can be expressed with respect to the random errors:

independence of the independence of
linearity hypothesis
(homoskedasticity)
normality of the normality of the .

Summary on the hypothesis

Linearity: or
✅ check: scatter plot + post-fit residuals
Homoskedasticity: or for all
✅ check: scatter plot + post-fit residuals
Normality: for any or for all
✅ check: histogram of before estimation + post-fit residuals
Independence: or for all pairs
✅ check: context

linear model and Ordinary Least Squares (OLS)

One way to define the “Best” line is the line that will minimize the sum of squarred errors.

The Least Squares criteria minimizes:

OLS estimates

Minimizing gives and (solved setting the partial derivatives with respect to and respectively to 0):

with and

Another possible formula is:
with and

OLS estimators properties

It can be shown that, as (the number of observations) gets larger:

Confidence Intervals (CI)

It follows that

Confidence interval at for :
Confidence interval at for :

Student’s Test

The most important test in the simple linear model is whether is significantly different from ,
i.e. does significantly informs the prediction of :

null hypothesis
(no association between & )
alternative hypothesis
(linear association between & )

test statistics:

Link between Tests and CI

contains 0 non-rejection of at the significance level
does NOT contain 0 rejection of at the significance level

CI gives a idea of the precision around the estimation of , but not its significance level (p-value)…
One can only know if the p-value is below or above from the CI

Link between correlation and simple linear regression

coefficient is the square of the correlation and is related to

The correlation estimator can then be written as:

Likelihood approach

📝

👉 write the likelihood for the simple linear model

Practicals 1

👉 load the data from birthweight_data.txt
👉 using dplyr, compute a new data.frame with both the mean and the standard deviation (look at the group_by and summarize_all() function) of all original variables for byh avaier’s age. Do value observedes it makes sense ?
👉 add the number of observations summarized with the add_tally() function
👉 draw a scatter plot with ggplot2 of average birth weight as a function of the mother’s age

Practicals 2

👉 using the lm() function (look at the help and the examples), fit a simple linear model explaining the average birth weight from the mother’s age
👉 explore the results with the ls() function, the $ operator and the summary() function
👉 assess the linear model assumpions
👉 interpret your results (take particular care for )
👉 fit a second simple linear model explaining the individual birth weight from the mother’s age using the non-aggregated original data
👉 compare the results. How can you explain the differences ?

Multiple Linear Regression

Quantifies the linear relationship between a Gaussian variable and multiple variables , , …,

Regression analysis steps

Regardless of which regression model you are using:

Model specification according to the research question, and the data available
Estimation with point estimates and Confidence Intervals for the model parameters
Significance testing for each model parameters of interest
Model adequation diagnoses
Results presentation (often as a table)
Results interpretation and discussion

Multiple linear Regression

Matrix notation

Ordinary Least Squares formula

❗️Requires to be invertible !

This is no the case if:

(high-dimensional case)
is not of full rank (e.g. one covariate can be computed as a perfect linear combination of the other)
numerical instabilities when covaraites are extremely correlated

Indicator coding

For categorical predictors, the design matrix uses modality indicator variables:

❗️Need to set a reference class to ensure identifiability

?model.matrix()

Interactions

the value of modifies the effect of on

❗️if an interaction is included in a linear model, the main effects involved must always be included as well
otherwise the interaction is not interpretable

Due to numerical complexity, it is very rare to use higher order () interactions…

Confounding

potential confounders must be included in the model !
is a counfounding factor for the relationship between and if:

is associated with both and
is a direct or indirect cause of , but not a consequence
is not on the causal path between and

Assessthe relative variation
arbitrary thresholds of 10% or 20% to declare potential confounding factors

Likelihood Ratio Test

The Likelihood Ratio Test (LRT) can be used to compare nested models, in particular for testing several coefficients at once.

AIC & BIC criteria:

based on
penalize for the number of parameters (the more parameters, the better the fit, mechanically)
can be used to compare non-nested models on the same data
the lower the better

Multiple testing

❗️when becomes large, the number of coefficients tested in the model becomes large as well, and one must be careful and deal with the multiple testing issue

To be continued…

Model selection

One of the recurring question in multiple regression is:

Which variables should be included in the model ?

Often dealt with stepwise method inclusion

❗️this is not statistically sound and should be avoided

Modern methods, such as the LASSO or Random Forest encompass a framework for suitable variable selection in multivariate regression models… To be continued

Practicals 3

👉 load the data from birthweight_data.txt
👉 how can you best explain the individual birth weight from the other variables ?

Class objectives

Be able to:

identify wether a linear or logistic regression would let you answer the scientific question of interest given the available data
propose an adequate model for the question (model, outcome, predictors)
fit the model in R
interpret and discuss the results

Generalized linear models (GLM)

Quantifies the linear relationship between a non-gaussian variable and multiple variables , , …,

Binary outcome & regression

Linear regression model:

If is a binary variable: :

❗️

(probability) while

logit

expit

NB: et

logistic regression

the is called the link function:
makes the link between the expectation of the outcome and the linear model

NB: No error term, randomness is directly encompassed into the model

Odds and Odds Ratio

If :

Odds Ratio (OR):

CI and OR

❗️Normal asymptotic approximation Wald test:

linearity hypothesis

One way to assess the linearity hypothesis is to split-up the continuous covariates into categories and check that the coefficients associated with each categories are coherent with a linear relationship.

This is the only hypothesis to be check in the logistic regression

Practicals 4

👉 load the data from birthweight_data.txt
👉 how can you best explain the binary variable low from the other variables ? Use the glm() function with the "binomial link"
👉 change the reference category for race to obtain the CI for the OR for smoking in black women

Generalized Linear Models (GLM)

Poisson regression for discrete quantitative outcome (e.g. counts)
link function is the
Probit & Tobit models as alternatives to logistic regression
(different link functions)
…

Independence

GLM assume that all the observations are

Linear mixed-effect models can be used to model grouped data

Ex: longitudinal data, batch effects, etc…

Conclusion

Regression analysis steps

Regardless of which regression model you are using:

Model specification according to the research question, and the data available
Estimation with point estimates and Confidence Intervals for the model parameters
Significance testing for each model parameters of interest
Model fit diagnoses
Results presentation (often as a table)
Results interpretation and discussion

Additional resources

Points of significance | Statistics for Biologists series: