Univariate regression

The shortest route to science is a straight line.

Load packages and data

library(palmerpenguins)
library(ggplot2)
suppressPackageStartupMessages(library(dplyr))
library(tidybayes)
# library(cmdstanr)
suppressPackageStartupMessages(library(rstan))
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())

Statistical models of Penguin bill morphology.

We’ll be studying the relationship between two numbers about penguin bills. Specifically, we’ll ask “Are longer bills also deeper?”. This question might not be the most interesting ecologically, but it is a great chance to practice some interesting stats.

Let’s begin with plotting the data:

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point() + 
  stat_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Bill depth (mm) as predicted by bill length (mm) across the entire palmerpenguins dataset.

Let’s write a simple statistical model for these data:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \\ \sigma &\sim \text{Exponential}(??) \end{align} \]

What should our priors be? Before we can answer that, we have a more important question:

WHERE IS ZERO??

It has to be somewhere. Does it make sense? take control and choose for yourself.

If we fit a model like this without thinking about the location of zero, we get some pretty silly answers:

coef(lm(bill_depth_mm ~ bill_length_mm, data = penguins))
   (Intercept) bill_length_mm 
   20.88546832    -0.08502128 

When the value of bill length is 0, the average of the response is the intercept:

\[ \begin{align} \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \mu_i &= \beta_0 + \beta_1\times0 \\ \mu_i &= \beta_0 \\ \end{align} \]

But, if we take the data as we found it, we’re going to be talking about \(\beta_0\) as the depth of a penguin’s bill when the bill has 0 length! Clearly that isn’t a very meaningful value. From the point of view of setting priors and interpreting coefficients, it helps a lot to set a meaningful 0.

A very common choice is to subtract the average from your independent variable, so that penguins with an average bill length now have an average of 0:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \end{align} \]

Now \(\beta_0\) means the average bill depth at the average bill length. It becomes easier to think about priors:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(17,2) \\ \beta_1 &\sim \text{Normal}(0,.5) \\ \sigma &\sim \text{Exponential}(0.5) \end{align} \]

Exercise

What continuous predictors have you used in your analysis? How would you find a biologically meaningful zero? Think about how you would center time, age, mass, fitness etc.

Prior predictive simulations

Armed with this model, it becomes much easier to think about prior predictions.

We’ll make a bunch of lines implied by the equation above. There’s two steps:

  1. Center the predictor
  2. Make up a vector that goes from the minimum to the maximum of the predictor. This is just for convenience!
bill_len_centered <- with(penguins,
                          bill_length_mm - mean(bill_length_mm,
                                                na.rm = TRUE))

## make up a short vector
some_bill_lengths <- seq(
  from = min(bill_len_centered, na.rm = TRUE), 
  to = max(bill_len_centered, na.rm = TRUE),
  length.out = 10
  )
Shortcuts to these common tasks

These tasks are so common that they are automated in helper functions.

For centering predictors, see the base R function ?scale (however, doing this by hand is often more convenient)

For creating a short vector over the range of a predictor, see modelr::seq_range. The R package modelr has many different functions to help with modelling.

To simulate, we’ll use some matrix algebra, as we saw in lecture:

slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)

X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

knitr::kable(head(X))
some_bill_lengths
1 -11.8219298
1 -8.7663743
1 -5.7108187
1 -2.6552632
1 0.4002924
1 3.4558480
knitr::kable(head(B))
inters 17.2896383 19.556429 18.8264880 14.3230226 18.4934815 20.3406031 17.7371358
slopes 0.2565904 0.495448 0.2071873 -0.3701628 -0.2776949 -0.1027866 -0.4361996
prior_mus <- X %*% B

matplot(x = some_bill_lengths,
        y = prior_mus, type = "l")

Exercise

Copy the code above. Increase the number of simulations. Which priors are too wide? Which are too narrow?

Simulating Observations

There are always at least TWO kinds of predictions we can be thinking about:

  1. Predicted averages. This is often called a “confidence” interval for a regression line.
  2. Predicted observations. This is often called a “prediction” interval.

We can use the full model to simulate observations!

slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)
sigmas <- rexp(7, rate = 0.3)

X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

prior_mus <- X %*% B

prior_obs <- matrix(0, nrow = nrow(prior_mus), ncol = ncol(prior_mus))

for (j in 1:ncol(prior_obs)) {
  prior_obs[,j] <- rnorm(n = nrow(prior_mus),
                         mean = prior_mus[,j],
                         sd = sigmas[j])
}

matplot(x = some_bill_lengths,
        y = prior_obs, type = "p")

Tidyverse style for those who indulge:

tibble(
  sim_id = 1:7,
  slopes = rnorm(7, 0, .5),
  inters = rnorm(7, 17, 2),
  sigmas = rexp(7, rate = 0.2)
  ) |> 
  mutate(x = list(seq(from = -10, to = 10, length.out = 6))) |> 
  rowwise() |> 
  mutate(avg = list(x * slopes + inters),
         obs = list(rnorm(length(avg), mean = avg, sd = sigmas)),
         sim_id = as.factor(sim_id)) |> 
  tidyr::unnest(cols = c("x", "avg", "obs")) |> 
  ggplot(aes(x= x, y = avg, group = sim_id, fill = sim_id)) + 
  geom_line(aes(colour = sim_id)) + 
  geom_point(aes(y = obs, fill = sim_id), pch = 21, size = 3) + 
  scale_fill_brewer(type = "qual") + 
  scale_colour_brewer(type = "qual") + 
  facet_wrap(~sim_id)

EXERCISE

Pick one of the two simulations above and modify it. Here are some suggested modifications:

  • Experiment with priors that are “too narrow” or “too wide”.
  • Try a different distribution than the one used
  • Instead of bill size, imagine that we are applying this model to YOUR data. What would you change?

Linear regression in Stan

Now we write a Stan program for this model. We’ll begin with a simple model that has no posterior predictions:

normal_regression_no_prediction <- stan_model(
  file = here::here("topics/02_regression/normal_regression_no_prediction.stan"),
  model_name = "normal_regression_no_prediction")

normal_regression_no_prediction
S4 class stanmodel 'normal_regression_no_prediction' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
} 

In order to get the posterior, we need to put our data in Stan. We follow the same steps as previously:

  • Remember to remove NAs first!
  • arrange the data in a list
  • pass the data to a Stan model to estimate.
## drop NAs
penguins_no_NA <- penguins |> 
  tidyr::drop_na(bill_depth_mm, bill_length_mm) |> 
  dplyr::mutate(
    bill_length_center = bill_length_mm - mean(bill_length_mm))

## assemble data list
data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm
          ))

## run the sampler, using the compiled model.
normal_reg_no_pred <- sampling(
  object = normal_regression_no_prediction,
  data = data_list, 
  refresh = 0)

summary(normal_reg_no_pred)$summary
                   mean      se_mean         sd         2.5%           25%
intercept   17.14879647 0.0015503892 0.10346566   16.9404187   17.08009150
slope       -0.08440727 0.0002796366 0.01880894   -0.1215021   -0.09709005
sigma        1.92532431 0.0011155299 0.07327060    1.7858456    1.87508209
lp__      -395.67475128 0.0267169260 1.19300817 -398.8297774 -396.23426581
                    50%           75%       97.5%    n_eff      Rhat
intercept   17.15109315   17.21793369   17.356910 4453.598 1.0009309
slope       -0.08421363   -0.07182047   -0.047809 4524.185 1.0000700
sigma        1.92275361    1.97412896    2.076412 4314.169 0.9993515
lp__      -395.35376822 -394.81515460 -394.305017 1993.948 1.0002518
normal_reg_no_pred |> 
  bayesplot::mcmc_areas(pars = c("slope", "intercept", "sigma"))

EXERCISE

Discussion : Look just at the posterior distribution of the slope right above. Do we have evidence that there’s a relationship between bill length and bill depth.

Posterior predictions in R

We can calculate a posterior prediction line directly in R for these data. I’ll show each step in this workflow separately:

normal_reg_no_pred |> 
  spread_rvars(slope, intercept, sigma)
# A tibble: 1 × 3
            slope  intercept        sigma
       <rvar[1d]> <rvar[1d]>   <rvar[1d]>
1  -0.084 ± 0.019   17 ± 0.1  1.9 ± 0.073

tidybayes helps us extract the posterior distribution of the parameters into a convenient object called an rvar. Learn more about tidybayes here and about the rvar datatype here

Next we combine these posteriors with a vector of observations to make a posterior distribution of LINES:

normal_reg_predline <- normal_reg_no_pred |> 
  tidybayes::spread_rvars(slope, intercept) |> 
  tidyr::expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |> 
  mutate(mu = intercept + slope*x)

normal_reg_predline
# A tibble: 5 × 4
            slope  intercept     x         mu
       <rvar[1d]> <rvar[1d]> <dbl> <rvar[1d]>
1  -0.084 ± 0.019   17 ± 0.1 -15    18 ± 0.30
2  -0.084 ± 0.019   17 ± 0.1  -7.5  18 ± 0.18
3  -0.084 ± 0.019   17 ± 0.1   0    17 ± 0.10
4  -0.084 ± 0.019   17 ± 0.1   7.5  17 ± 0.17
5  -0.084 ± 0.019   17 ± 0.1  15    16 ± 0.30

Finally we’ll plot these:

normal_reg_predline |> 
  ggplot(aes(x = x, dist = mu)) + 
  stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             inherit.aes = FALSE,
             data = penguins_no_NA)

Using posterior draws individually

The above workflow makes a nice figure, but perhaps it helps to see the individual lines to understand what is happening here. We can get these with another tidybayes function spread_draws:

normal_reg_predline_draws <- normal_reg_no_pred |> 
  tidybayes::spread_draws(slope, intercept, ndraws = 12)

knitr::kable(normal_reg_predline_draws)
.chain .iteration .draw slope intercept
4 732 3732 -0.0802951 16.99151
2 531 1531 -0.0894771 17.04231
3 564 2564 -0.0411701 17.08328
4 796 3796 -0.0909085 17.20641
2 148 1148 -0.0760923 16.99839
1 270 270 -0.0981297 17.00251
1 157 157 -0.0813635 17.12567
1 348 348 -0.1036138 17.07408
2 20 1020 -0.0797008 17.06398
4 877 3877 -0.0949920 17.24112
4 352 3352 -0.0876686 17.12087
1 328 328 -0.0764813 17.17234
normal_reg_predline_draws |> 
  tidyr::expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |> 
  mutate(mu = intercept + slope*x) |> 
  ggplot(aes(x = x, y = mu, group = .draw)) + 
  geom_line() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             inherit.aes = FALSE,
             data = penguins_no_NA)

Posterior predictions in Stan

We can also make these posterior predictions in Stan.

First, compile the model:

normal_regression <- stan_model(file = here::here("topics/02_regression/normal_regression.stan"))

normal_regression
S4 class stanmodel 'anon_model' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;

  // calculate expectation
  post_bill_dep_average = intercept + slope * pred_values;

  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept + slope * pred_values[i], sigma);
  }

} 

Then prepare the data and feed it into the sampling() function.

penguins_no_NA <- penguins |> 
  tidyr::drop_na(bill_depth_mm, bill_length_mm) |> 
  dplyr::mutate(
    bill_length_center = bill_length_mm - mean(bill_length_mm))

data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          npost = 6,
          pred_values = modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)
          ))

bill_norm_reg <- sampling(normal_regression,
                          data = data_list, 
                          refresh = 0)
library(tidyverse)

bill_posterior <- bill_norm_reg |> 
  tidybayes::spread_rvars(post_bill_dep_average[i],
                          post_bill_dep_obs[i]) |>
  mutate(bill_length = data_list$pred_values[i]) 

bill_posterior |> 
  ggplot(aes(x = bill_length, dist = post_bill_dep_average)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") + 
  labs(title = "Average response")
bill_posterior |> 
  ggplot(aes(x = bill_length, dist = post_bill_dep_obs)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") +
  labs(title = "Predicted observations")

EXERCISE

Extend this model to include species. Specifically, let each species have its own value of the intercept. This involves combining this regression example with the previous activity on discrete predictors.

When you’re done, look at the resulting summary of coefficients. What do you notice that’s different?

normal_regression_spp <- stan_model(file = here::here("topics/02_regression/normal_regression_spp.stan"),
                                    model_name = "normal_regression_spp")

normal_regression_spp
S4 class stanmodel 'normal_regression_spp' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // species IDs
  array[N] int spp_id;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
  array[npost] int pred_spp_id;
}
parameters {
  vector[3] intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
  bill_dep ~ normal(intercept[spp_id] + slope * bill_len, sigma);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;

  // calculate expectation
  post_bill_dep_average = intercept[pred_spp_id] + slope * pred_values;

  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept[pred_spp_id[i]] + slope * pred_values[i], sigma);
  }

} 

We set up a list for this model just as we did before. Note that this time we are using TRIPLE the pred_values, because we want to run independent predictions for each species.

bill_vec <- modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)

data_list_spp <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          spp_id = as.numeric(as.factor(species)),
          npost = 3*6,
          pred_values = rep(bill_vec, 3),
          pred_spp_id = rep(1:3, each = 6)
          ))

normal_reg_spp_post <- sampling(normal_regression_spp,
                                data = data_list_spp, refresh = 0)

Note that the sign of the slope is different now!

summary(normal_reg_spp_post)$summary |> 
  head(5) |>
  knitr::kable()
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept[1] 19.3627152 0.0028533 0.1193427 19.1265277 19.2831202 19.3601586 19.4455365 19.5915510 1749.395 0.9997753
intercept[2] 17.4423245 0.0031066 0.1444712 17.1583700 17.3451299 17.4400714 17.5403406 17.7229913 2162.686 1.0000708
intercept[3] 14.2744151 0.0022653 0.1052333 14.0656090 14.2031493 14.2757284 14.3444366 14.4824974 2158.007 0.9996055
slope 0.1986246 0.0004297 0.0177094 0.1637669 0.1867533 0.1984669 0.2108494 0.2323239 1698.866 0.9999005
sigma 0.9558671 0.0006063 0.0360305 0.8892068 0.9313392 0.9541228 0.9793045 1.0284729 3531.928 0.9996378

Plotting posterior predictions

Using stat_lineribbon(), let’s plot the average and predicted intervals for this regression.

bill_posterior <- normal_reg_spp_post |> 
  tidybayes::spread_rvars(post_bill_dep_average[i],
                          post_bill_dep_obs[i]) |>
  mutate(bill_length = data_list_spp$pred_values[i],
         spp = data_list_spp$pred_spp_id) |> 
  mutate(spp = as.factor(levels(penguins$species)[spp]))

bill_posterior |> 
  ggplot(aes(x = bill_length,
             ydist = post_bill_dep_average,
             fill = spp, 
             colour = spp)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center,
                 y = bill_depth_mm,
                 fill = species, colour = species),
             data = penguins_no_NA, 
             inherit.aes = FALSE) +   
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Average response")
bill_posterior |> 
  ggplot(aes(x = bill_length,
             dist = post_bill_dep_obs,
             fill = spp,
             colour = spp)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center,
                 y = bill_depth_mm,
                 colour = species),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Predicted observations") + 
  facet_wrap(~spp, ncol = 1)

Exercise!

  1. We have one model without species identity as an independent variable, and one which includes species. Look at the difference in \(\sigma\) between these two models. Why did the value change?
  2. Posterior predictions Edit the generated quantities blocks of normal_regression.stan and normal_regression_spp.stan to create replicate observations for all N observations (that is, yrep as we did in the model of bill depth previously). Use this to compare the models using bayesplot::ppc_dens_overlay() or another function from the bayesplot package.