Linear models

Guillaume Blanchet – Andrew MacDonald

2024-04-30

Basic regression model

Hierarchical models are a generalized version of the classic regression models you have seen in your undergraduate courses.

In its simplest form, a regression model is usually presented as

\[ y_i = \beta_0 + \beta_1 x_{i} + \varepsilon \]

It is known as a simple linear model, where :

\(y_i\) is the value of a response variable for observation \(i\)

\(x_i\) is the value of an explanatory variable for observation \(i\)

\(\beta_0\) is the model intercept

\(\beta_1\) is the model slope

\(\varepsilon\) is the error term

Basic regression model

The cool thing about the simple linear model is that it can be studied visually quite easily.

For example if we are interested in knowing how a newly discovered plant species (Bidonia exemplaris) reacts to humidity, we can relate the biomass of B. exemplaris sampled at 100 sites with the soil humidity content and readily visual the data and the trend.

Basic regression model

Regression parameters

Generally, we focused on the most commonly assessed regression parameters of the simple linear model, the slope and the intercept, but there is another one that is very important to consider, especially for this course.

Any ideas which one it is ?

Regression parameters

If we go back to the mathematical description of the model

\[ y_i = \beta_0 + \beta_1 x_{i} + \varepsilon \]

we can see that in the simple linear regression the error term (\(\varepsilon\)) has actually a very precise definition:

\[\varepsilon \sim \mathcal{N}(0, \sigma^2)\]

where \(\sigma^2\) is an estimated variance.

In words, it means that the error in a simple linear regression follows a Gaussian distribution with a variance that is estimated.

For most the course, we will play with the variance parameter \(\sigma^2\) in a bunch of different ways.

But before we do this, we need to understand a bit more about how this parameter influence the model.

Regression parameters

A first way to do this is to think about the simple linear regression in a slightly different way. Specifically, based on what we learned in the previous slide the simple linear regression can be rewritten as \[ y \sim \mathcal{N}(\beta_0 + \beta_1 x_{i}, \sigma^2) \]

As we will see later in this course, this writting style will become particularly useful.

Regression parameters

Variance of the model (\(\sigma^2\))

In essence, \(\sigma^2\) tells us about what the model could not account for.

For example, let’s compare the biomass of Bidonia exemplaris with that of Ilovea chicktighii, another species (a carnivorous plant)

Regression parameters

Variance of the model (\(\sigma^2\))

By regressing humidity on the biomass of both plants, we can obtain the estimated parameters for each regression (which are all available using summary.lm)

# Regression model
regBexemplaris <- lm(b.exemplaris ~ humidity)
regIchicktighii <- lm(i.chicktighii ~ humidity)

# Summary
summaryRegBexemplaris <- summary(regBexemplaris)
summaryRegIchicktighii <- summary(regIchicktighii)

Regression parameters

Variance of the model (\(\sigma^2\))

For Bidonia exemplaris

# Estimated coefficients
summaryRegBexemplaris$coefficients[,1:2]
            Estimate Std. Error
(Intercept) 4.015999 0.05235200
humidity    1.313897 0.08845811
# Estimated variance
summaryRegBexemplaris$sigma
[1] 0.5232624

For Ilovea chicktighii

# Estimated coefficients
summaryRegIchicktighii$coefficients[,1:2]
            Estimate  Std. Error
(Intercept) 3.991785 0.008928294
humidity    1.271250 0.015085956
# Estimated variance
summaryRegIchicktighii$sigma
[1] 0.089239

Limits of the simple linear regression

There are two major pitfalls of the simple linear model for problems in the life sciences

  1. One explanatory is almost never enough to approach biological questions nowadays.
  1. The simple linear model assumes that the error follows a Gaussian distribution.

Multiple linear regression

Simple linear regression can be extended to account for multiple explanatory variables to study more complexe problems. This type of regression model is known as a multiple linear regression.

Mathematically, a multiple linear regression can be defined as

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon \]

Theoretically, estimating the parameters of a multiple regression model is done the same ways as for simple linear regression. However, in practice, matrix algebra is quite practical to use in this context and also for this course in general.

In this respect, let’s take a bit of time to get acquinted with different basic (and maybe not so basic!) knowledge of matrix algebra.

Matrix algebra interlude

A general way to write matrices

\[ \mathbf{A} = \begin{bmatrix} A_{11} & A_{12} & \dots & A_{1j} & \dots & A_{1n}\\ A_{21} & A_{22} & \dots & A_{2j} & \dots & A_{2n}\\ \vdots & \vdots & \ddots & \vdots & & \vdots\\ A_{i1} & A_{i2} & \dots & A_{ij} & \dots & A_{in}\\ \vdots & \vdots & & \vdots & \ddots & \vdots\\ A_{m1} & A_{m2} & \dots & A_{mj} & \dots & A_{mn}\\ \end{bmatrix} \] \[A = \left[a_{ij}\right]=\left[a_{ij}\right]_{m\times n}\]

Basic matrix operations

The transpose of a matrix

\[A = \begin{bmatrix} 5 & -6 & 4 & -4\\ \end{bmatrix} \]

\[B = \begin{bmatrix} -8\\ 9\\ -2\\ \end{bmatrix} \]

\[C = \begin{bmatrix} -4 & 1\\ 2 & -5\\ \end{bmatrix} \]

\[A^t=\begin{bmatrix} 5\\ -6\\ 4\\ -4\\ \end{bmatrix} \]

\[B^t =\begin{bmatrix} -8 & 9 & -2\\ \end{bmatrix} \]

\[C^t =\begin{bmatrix} -4 & 2\\ 1 & -5\\ \end{bmatrix} \]

The transpose of a matrix

In R

A <- matrix(c(3, 1, 5, -2), nrow = 2, ncol = 2)
A
     [,1] [,2]
[1,]    3    5
[2,]    1   -2
t(A)
     [,1] [,2]
[1,]    3    1
[2,]    5   -2

Multiplying a matrix by a scalar

\[\mathbf{B} = c\mathbf{A}\] \[B_{ij} = cA_{ij}\]

\[ 0.3 \begin{bmatrix} 3 & 5\\ 1 & -2\\ \end{bmatrix} = \begin{bmatrix} 0.9 & 1.5\\ 0.3 & -0.6\\ \end{bmatrix} \]

In R

A <- matrix(c(3, 1, 5, -2), nrow = 2, ncol = 2)
c <- 0.3
c*A 
     [,1] [,2]
[1,]  0.9  1.5
[2,]  0.3 -0.6

Matrix multiplications (not divisions!)

\[\mathbf{C} = \mathbf{A} \cdot \mathbf{B}\]

\[C_{ik} = \sum^{n}_{j=1}A_{ij}B_{jk}\]

Rules

Associative: \(\mathbf{A}(\mathbf{B}\mathbf{C}) = (\mathbf{A}\mathbf{B})\mathbf{C}\)

Distributive: \(\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B}+\mathbf{A}\mathbf{C}\)

Not commutative: \(\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}\)

Inner product

\[(\mathbf{Ax})_i=\sum_{j=1}^{n}A_{ij}x_j\]

\[ \begin{bmatrix} 3 & 5\\ 1 & -2\\ \end{bmatrix} \begin{bmatrix} 2\\ 5\\ \end{bmatrix} = \begin{bmatrix} (3, 5) \cdot (2, 5)\\ (1, -2) \cdot (2, 5) \\ \end{bmatrix} = \begin{bmatrix} 3 \times 2 + 5 \times 5\\ 1 \times 2 -2 \times 5\\ \end{bmatrix} = \begin{bmatrix} 31\\ -8\\ \end{bmatrix} \]

In R

A <- matrix(c(3, 1, 5, -2), nrow = 2, ncol = 2)
x <- matrix(c(2,5), nrow = 2, ncol = 1)
A %*% x
     [,1]
[1,]   31
[2,]   -8

Some important matrices

Identity matrix

The identity matrix is a square matrix where all values of its diagonal are 0 except the diagonal values which are all 1s.

\[ \mathbf{I}=\begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ \end{bmatrix} \]

The identity matrix is important because

\[\mathbf{A} \cdot \mathbf{I}_n = \mathbf{A}\] or

\[\mathbf{I}_m \cdot \mathbf{A} = \mathbf{A}\]

Diagonal matrix

The diagonal matrix is a square matrix where all values of its diagonal are 0 except the ones on the diagonal.

\[D= \begin{bmatrix} d_1 & 0 & \dots & 0\\ 0 & d_2 & \dots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & d_n\\ \end{bmatrix}\]

An example

\[ \begin{bmatrix} -1 & 0 & 0\\ 0 & 0 & 0\\ 0 & 0 & 6\\ \end{bmatrix} \]

End of matrix algebra interlude

Multiple linear regression

In this course, we will rely heavily on multiple linear regression and expand on it by studying how some of the parameters (the \(\beta\)s) can depend on other other data and parameters.

As we saw earlier, a classic way to write multiple linear regression is

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon \]

However, we can rewrite this using matrix notation has

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]

Multiple linear regression

Matrix notation

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]

Using the matrix notation, we assume that

  • \(\mathbf{y}\) is a vector of length \(n\) (samples)
  • \(\mathbf{X}\) is a matrix with \(n\) samples (rows) and representing \(p\) explanatory variables (columns)
  • \(\boldsymbol{\beta}\) is a vector of \(p\) regression coefficients
  • \(\boldsymbol{\varepsilon}\) is a vector of residuals of length \(n\)

Error in linear models

As previously mentioned, in (simple and multiple!) linear regression the error term (\(\varepsilon\)) has actually a very precise definition:

\[\varepsilon \sim \mathcal{N}(0, \sigma^2)\] where \(\sigma^2\) is an estimated variance

which means that the error linear regression follows a Gaussian distribution with an estimateed variance.

Note The model can be written using matrix notation as
\[\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2)\] where \(\sigma^2\)

Generalized linear models

To go around this problem, generalized linear models (GLMs) have been proposed. In essence, GLMs use link functions to adapt models for them to be used on non-Gaussian data.

Mathematically, the generic way to write generalized linear model is

\[ \widehat{y}_i = g^{-1}(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}) \]

or in matrix notation

\[ \widehat{\mathbf{y}} = g^{-1}(\mathbf{X}\boldsymbol{\beta}) \]

where \(g\) is the link function and \(g^{-1}\) the inverse link function.

Generalized linear models

Link functions

There are many types of link functions and they are usually directly associated to the underlying data the analysis is carried out on.

Arguably the most common link function in ecology is

logit link function

It is commonly used for modelling binary (0-1) data.

\[ \mathbf{X}\boldsymbol{\beta} = \ln\left(\frac{\widehat{\mathbf{y}}}{1 - \widehat{\mathbf{y}}}\right) \]

The inverse logit link function is

\[ \widehat{\mathbf{y}} = \frac{\exp(\mathbf{X}\boldsymbol{\beta})}{1 + \exp(\mathbf{X}\boldsymbol{\beta})} = \frac{1}{1 + \exp(-\mathbf{X}\boldsymbol{\beta})} \]

Generalized linear models

Link functions

Another commonly used link function is

log link function

It is commonly used for modelling count data.

\[ \mathbf{X}\boldsymbol{\beta} = \ln\left(\widehat{\mathbf{y}}\right) \]

The inverse logit link function is

\[ \widehat{\mathbf{y}} = \exp(\mathbf{X}\boldsymbol{\beta}) \]