Univariate regression

The shortest route to science is a straight line.

Simulation workout!

  1. make a histogram of 500 numbers from a distribution!
    • normal
    • poisson
    • ** EXTRA** try a new one, like beta, gamma, lognormal
  2. make a histogram of poisson observations, using the classic log() link function.

\[ \begin{align} y &\sim \text{Poisson}(e^a) \\ a &\sim \text{Normal}(??, ??) \end{align} \]

## sample poisson variables like this:
rpois(500, exp(3))
  [1] 23 26 27 16 25 16 14 21 15 18 26 19 22 21 20 23 22 21 19 15 18 19 18 24 23
 [26] 18 20 14 18 15 21 22 24 23 15 19 18 23 18 22 20 15 22 11 25 28 23 19 30 25
 [51] 16 21 20 24 23 19 14 21 16 31 19 16 19 19 20 26 15 15 15 25 29 12 24 19 26
 [76] 16 18 21 20 22 26 27 24 15 26 33 22 23 26 23 22 23 20 28 17 12 14 20 30 20
[101] 15 24 21 31 20 13 18 17 20 15 27 14 13 13 16 25 24 21 16 21 19 26 14 18 26
[126] 14 24 18 20 20 23 16 25 24 19 16 15 23 15 17 21 18 16 11 26 22 28 15 20 21
[151] 19 23 18 18 22 26 23 22 13 27 24 14 21 22 31 13 24 19 26 18 25 30 12 18 15
[176] 25 17 22 22 17 26 23 19 24 21 16 21 24 17 20 16 18 25 26 18 12 22 21 13 19
[201] 20 22 19 21 19 13 20 23 12 26 14 19 16 22 19 21 19 29 20 19 26 21 19 28 19
[226] 17 15 19 22 12 24 15 13 19 23 17 22 18 16 28 23 21 24 23 19 23 19 24 18 22
[251] 17 17 24 19 18 21 18 24 17 32 26 27 26 28 22 23 17 23  9 15 23 33 16 15 17
[276] 35 16 19 13 14 18 19 18 14 28 21 16 19 21 19 17 17 12 21 20 17 20 21 13 20
[301] 17 13 22 18 21 22 14 18 22 18 21 20 12 16 22 15 19 24 16 19 19 13 24 16 20
[326] 16 12 16 28 23 24 20 13 15 24 19 18 29 17 18 19 22 15 14 26 20 24 26 22 23
[351] 18 16 18  9 14 24 17 32 29 19 18 22 19 27 16 26 17 23 25 22 23 22 19 20 23
[376] 15 28 24 10 19 23 27 18 15 23 20 14 20 19 28 13 30 21 14 14 25 17 21 22 21
[401] 19 16 19 16 15 31 23 27 20 22 12 24 21 11 28 17 21 19 29 29 23 23 21 15 28
[426] 16 19 15 25 12 12 20 19 15 18 24 24 23 21 21 29 21 19 26 22 21 17 21 17 17
[451] 24 19 15 26 23 22 25 21 25 16 16 17 16 20 16 29 25 23 13 19 17 24 15 31 16
[476] 22 23 19 17 27 17 15 15 15 19 16 14 17 17 25 15 29 22 19 28 19 23 23 18 29
  1. make a histogram of Binomial observations, using the inverse logit link function

\[ \begin{align} y &\sim \text{Binomial}\left(\frac{1}{1+e^{-a}}, N \right) \\ a &\sim \text{Normal}(??, ??) \end{align} \]

Here’s a plot of the link function, to help you think about it:

curve(1 / (1 + exp(-x)), xlim = c(-3, 3), ylim = c(0, 1))

a <- rnorm(1, mean = 0, 1)
hist(rbinom(n = 500, size = 50, prob = 1 / (1 + exp(-a))))

Statistical models of Penguin bill morphology.

We’ll be studying the relationship between two numbers about penguin bills. Specifically, we’ll ask “Are longer bills also deeper?”. This question might not be the most interesting ecologically, but it is a great chance to practice some interesting stats.

Let’s begin with plotting the data:

library(palmerpenguins)
library(tidyverse)
library(cmdstanr)
This is cmdstanr version 0.7.1
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /home/andrew/software/cmdstan
- CmdStan version: 2.34.1
library(tidybayes)
penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point() + 
  stat_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Bill depth (mm) as predicted by bill length (mm) across the entire palmerpenguins dataset.

Let’s write a simple statistical model for these data:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \\ \sigma &\sim \text{Exponential}(??) \end{align} \]

What should our priors be? Before we can answer that, we have a more important question:

WHERE IS ZERO??

It has to be somewhere. Does it make sense? take control and choose for yourself.

If we fit a model like this without thinking about the location of zero, we get some pretty silly answers:

bill_line <- coef(lm(bill_depth_mm ~ bill_length_mm, data = penguins))

When the value of bill length is 0, the average of the response is the intercept:

\[ \begin{align} \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \mu_i &= \beta_0 + \beta_1\times0 \\ \mu_i &= \beta_0 \\ \end{align} \]

But, if we take the data as we found it, we’re going to be talking about \(\beta_0\) as the depth of a penguin’s bill when the bill has 0 length! Either way it is the same line. However, from the point of view of setting priors and interpreting coefficients, it helps a lot to set a meaningful 0.

A very common choice is to subtract the average from your independent variable, so that it is equal to 0 at the average:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \end{align} \]

Now \(\beta_0\) means the average bill depth at the average bill length. It becomes easier to think about priors:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(17,2) \\ \beta_1 &\sim \text{Normal}(0,.5) \\ \sigma &\sim \text{Exponential}(0.5) \end{align} \]

Exercise

What continuous predictors have you used in your analysis? How would you find a biologically meaningful zero? Think about how you would center time, age, mass, fitness etc.

Prior predictive simulations

Armed with this model, it becomes much easier to think about prior predictions.

We’ll make a bunch of lines implied by the equation above. There’s two steps:

  1. Center the predictor
  2. Make up a vector that goes from the minimum to the maximum of the predictor. This is just for convenience!
bill_len_centered <- with(penguins,
                          bill_length_mm - mean(bill_length_mm,
                                                na.rm = TRUE))

## make up a short vector
some_bill_lengths <- seq(
  from = min(bill_len_centered, na.rm = TRUE), 
  to = max(bill_len_centered, na.rm = TRUE),
  length.out = 10
  )
Shortcuts to these common tasks

These tasks are so common that they are automated in helper functions.

For centering predictors, see the base R function ?scale

For creating a short vector over the range of a predictor, see modelr::seq_range. The R package modelr has many different functions to help with modelling.

To simulate, we’ll use some matrix algebra, as we saw in lecture:

slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)

X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

knitr::kable(head(X))
some_bill_lengths
1 -11.8219298
1 -8.7663743
1 -5.7108187
1 -2.6552632
1 0.4002924
1 3.4558480
knitr::kable(head(B))
inters 19.3267323 18.7340270 16.6558349 17.814553 14.0237585 17.8310158 14.6174138
slopes 0.2564205 -0.4651482 0.2435612 -0.278236 -0.1508848 0.6891655 -0.5906592
prior_mus <- X %*% B

matplot(x = some_bill_lengths,
        y = prior_mus, type = "l")

Exercise

Copy the code above. Increase the number of simulations. Which priors are too wide? Which are too narrow?

Simulating Observations

There are always at least TWO kinds of predictions we can be thinking about:

  1. Predicted averages. This is often called a “confidence” interval for a regression line.
  2. Predicted observations. This is often called a “prediction” interval.

We can use the full model to simulate observations!

slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)
sigmas <- rexp(7, rate = 0.3)

X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

prior_mus <- X %*% B

prior_obs <- matrix(0, nrow = nrow(prior_mus), ncol = ncol(prior_mus))

for (j in 1:ncol(prior_obs)) {
  prior_obs[,j] <- rnorm(n = nrow(prior_mus),
                         mean = prior_mus[,j],
                         sd = sigmas[j])
}

matplot(x = some_bill_lengths,
        y = prior_obs, type = "p")

Tidyverse style for those who indulge:

tibble(
  sim_id = 1:7,
  slopes = rnorm(7, 0, .5),
  inters = rnorm(7, 17, 2),
  sigmas = rexp(7, rate = 0.2)
  ) |> 
  mutate(x = list(seq(from = -10, to = 10, length.out = 6))) |> 
  rowwise() |> 
  mutate(avg = list(x * slopes + inters),
         obs = list(rnorm(length(avg), mean = avg, sd = sigmas)),
         sim_id = as.factor(sim_id)) |> 
  unnest(cols = c("x", "avg", "obs")) |> 
  ggplot(aes(x= x, y = avg, group = sim_id, fill = sim_id)) + 
  geom_line(aes(colour = sim_id)) + 
  geom_point(aes(y = obs, fill = sim_id), pch = 21, size = 3) + 
  scale_fill_brewer(type = "qual") + 
  scale_colour_brewer(type = "qual") + 
  facet_wrap(~sim_id)

EXERCISE

Pick one of the two simulations above and modify it. Here are some suggested modifications:

  • Experiment with priors that are “too narrow” or “too wide”.
  • Try a different distribution than the one used
  • Instead of bill size, imagine that we are applying this model to YOUR data. What would you change?

Linear regression in Stan

Now we write a Stan program for this model. We’ll begin with a simple model that has no posterior predictions:

normal_regression_no_prediction <- cmdstan_model(
  stan_file = "topics/02_regression/normal_regression_no_prediction.stan")

normal_regression_no_prediction
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
}

In order to get the posterior, we need to put our data in Stan. We follow the same steps as previously:

  • Remember to remove NAs first!
  • arrange the data in a list
  • pass the data to a Stan model to estimate.
## drop NAs
penguins_no_NA <- penguins |> 
  tidyr::drop_na(bill_depth_mm, bill_length_mm) |> 
  dplyr::mutate(
    bill_length_center = bill_length_mm - mean(bill_length_mm))

## assemble data list
data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm
          ))

## run the sampler, using the compiled model.
normal_reg_no_pred <- normal_regression_no_prediction$sample(
  data = data_list, 
  refresh = 0)
Running MCMC with 4 sequential chains...

Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.1 seconds.
Chain 4 finished in 0.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.6 seconds.
normal_reg_no_pred$summary()
# A tibble: 4 × 10
  variable       mean    median     sd    mad       q5       q95  rhat ess_bulk
  <chr>         <dbl>     <dbl>  <dbl>  <dbl>    <dbl>     <dbl> <dbl>    <dbl>
1 lp__      -396.     -395.     1.20   0.998  -398.    -394.      1.00    2081.
2 intercept   17.2      17.2    0.104  0.103    17.0     17.3     1.00    3849.
3 slope       -0.0849   -0.0848 0.0192 0.0192   -0.117   -0.0524  1.00    4194.
4 sigma        1.93      1.93   0.0729 0.0720    1.81     2.06    1.00    4059.
# ℹ 1 more variable: ess_tail <dbl>
normal_reg_no_pred$draws() |> 
  bayesplot::mcmc_areas(pars = c("slope", "intercept", "sigma"))

EXERCISE

Discussion : Look just at the posterior distribution of the slope right above. Do we have evidence that there’s a relationship between bill length and bill depth.

Posterior predictions in R

We can calculate a posterior prediction line directly in R for these data. I’ll show each step in this workflow separately:

normal_reg_no_pred |> 
  spread_rvars(slope, intercept, sigma)
# A tibble: 1 × 3
            slope  intercept        sigma
       <rvar[1d]> <rvar[1d]>   <rvar[1d]>
1  -0.085 ± 0.019   17 ± 0.1  1.9 ± 0.073

tidybayes helps us extract the posterior distribution of the parameters into a convenient object called an rvar. Learn more about tidybayes here and about the rvar datatype here

Next we combine these posteriors with a vector of observations to make a posterior distribution of LINES:

normal_reg_predline <- normal_reg_no_pred |> 
  tidybayes::spread_rvars(slope, intercept) |> 
  expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |> 
  mutate(mu = intercept + slope*x)

knitr::kable(normal_reg_predline)
slope intercept x mu
-0.085 ± 0.019 17 ± 0.1 -15.0 18 ± 0.31
-0.085 ± 0.019 17 ± 0.1 -7.5 18 ± 0.18
-0.085 ± 0.019 17 ± 0.1 0.0 17 ± 0.10
-0.085 ± 0.019 17 ± 0.1 7.5 17 ± 0.18
-0.085 ± 0.019 17 ± 0.1 15.0 16 ± 0.31

Finally we’ll plot these:

normal_reg_predline |> 
  ggplot(aes(x = x, dist = mu)) + 
  stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             inherit.aes = FALSE,
             data = penguins_no_NA)

Posterior predictions in Stan

We can also make these posterior predictions in Stan.

normal_regression <- cmdstan_model(stan_file = "topics/02_regression/normal_regression.stan")

normal_regression
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;
  
  // calculate expectation
  post_bill_dep_average = intercept + slope * pred_values;
  
  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept + slope * pred_values[i], sigma);
  }  
  
}
penguins_no_NA <- penguins |> 
  tidyr::drop_na(bill_depth_mm, bill_length_mm) |> 
  dplyr::mutate(
    bill_length_center = bill_length_mm - mean(bill_length_mm))

data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          npost = 6,
          pred_values = modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)
          ))

bill_norm_reg <- normal_regression$sample(data = data_list, 
                                          refresh = 0)
Running MCMC with 4 sequential chains...

Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.1 seconds.
Chain 4 finished in 0.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.5 seconds.
library(tidyverse)

bill_posterior <- bill_norm_reg |> 
  tidybayes::spread_rvars(post_bill_dep_average[i],
                          post_bill_dep_obs[i]) |>
  mutate(bill_length = data_list$pred_values[i]) 

bill_posterior |> 
  ggplot(aes(x = bill_length, dist = post_bill_dep_average)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") + 
  labs(title = "Average response")
bill_posterior |> 
  ggplot(aes(x = bill_length, dist = post_bill_dep_obs)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") +
  labs(title = "Predicted observations")

EXERCISE

Extend this model to include species. Specifically, let each species have its own value of the intercept. This involves combining this regression example with the previous activity on discrete predictors.

When you’re done, look at the resulting summary of coefficients. What do you notice that’s different?

normal_regression_spp <- cmdstan_model(stan_file = "topics/02_regression/normal_regression_spp.stan")

normal_regression_spp
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // species IDs
  array[N] int spp_id;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
  array[npost] int pred_spp_id;
}
parameters {
  vector[3] intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
  bill_dep ~ normal(intercept[spp_id] + slope * bill_len, sigma);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;
  
  // calculate expectation
  post_bill_dep_average = intercept[pred_spp_id] + slope * pred_values;
  
  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept[pred_spp_id[i]] + slope * pred_values[i], sigma);
  }  
  
}

We set up a list for this model just as we did before. Note that this time we are using TRIPLE the pred_values, because we want to run independent predictions for each species.

bill_vec <- modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)

data_list_spp <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          spp_id = as.numeric(as.factor(species)),
          npost = 3*6,
          pred_values = rep(bill_vec, 3),
          pred_spp_id = rep(1:3, each = 6)
          ))

normal_reg_spp_post <- normal_regression_spp$sample(data = data_list_spp, refresh = 0)
Running MCMC with 4 sequential chains...

Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.1 seconds.
Chain 4 finished in 0.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.9 seconds.

Note that the sign of the slope is different now!

normal_reg_spp_post$summary()
# A tibble: 42 × 10
   variable         mean   median     sd    mad       q5      q95  rhat ess_bulk
   <chr>           <dbl>    <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <dbl>    <dbl>
 1 lp__         -157.    -157.    1.56   1.35   -161.    -156.    0.999    1892.
 2 intercept[1]   19.4     19.4   0.117  0.116    19.2     19.6   1.00     1937.
 3 intercept[2]   17.4     17.4   0.141  0.141    17.2     17.7   1.00     2143.
 4 intercept[3]   14.3     14.3   0.105  0.105    14.1     14.4   1.00     1949.
 5 slope           0.198    0.198 0.0171 0.0173    0.170    0.226 1.00     1686.
 6 sigma           0.956    0.955 0.0360 0.0350    0.899    1.02  1.00     3169.
 7 post_bill_d…   17.0     17.0   0.966  0.963    15.4     18.6   0.999    3886.
 8 post_bill_d…   18.1     18.1   0.945  0.953    16.5     19.6   1.00     3833.
 9 post_bill_d…   19.2     19.2   0.966  0.954    17.6     20.8   1.00     3755.
10 post_bill_d…   20.3     20.3   0.960  0.941    18.7     21.9   1.00     3841.
# ℹ 32 more rows
# ℹ 1 more variable: ess_tail <dbl>

Plotting posterior predictions

Using stat_lineribbon(), let’s plot the average and predicted intervals for this regression.

bill_posterior <- normal_reg_spp_post |> 
  tidybayes::spread_rvars(post_bill_dep_average[i],
                          post_bill_dep_obs[i]) |>
  mutate(bill_length = data_list_spp$pred_values[i],
         spp = data_list_spp$pred_spp_id) |> 
  mutate(spp = as.factor(levels(penguins$species)[spp]))

bill_posterior |> 
  ggplot(aes(x = bill_length,
             ydist = post_bill_dep_average,
             fill = spp, 
             colour = spp)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center,
                 y = bill_depth_mm,
                 fill = species, colour = species),
             data = penguins_no_NA, 
             inherit.aes = FALSE) +   
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Average response")
bill_posterior |> 
  ggplot(aes(x = bill_length,
             dist = post_bill_dep_obs,
             fill = spp,
             colour = spp)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center,
                 y = bill_depth_mm,
                 colour = species),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Predicted observations") + 
  facet_wrap(~spp, ncol = 1)

Exercise!

show how the \(\sigma\) is different between these two models