Univariate regression

The shortest route to science is a straight line.

Load packages and data

library(palmerpenguins)
library(ggplot2)
suppressPackageStartupMessages(library(dplyr))
library(tidybayes)
# library(cmdstanr)
suppressPackageStartupMessages(library(rstan))
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())

Statistical models of Penguin bill morphology.

We’ll be studying the relationship between two numbers about penguin bills. Specifically, we’ll ask “Are longer bills also deeper?”. This question might not be the most interesting ecologically, but it is a great chance to practice some interesting stats.

Let’s begin with plotting the data:

penguins |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point() + 
  stat_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Bill depth (mm) as predicted by bill length (mm) across the entire `palmerpenguins` dataset.

Let’s write a simple statistical model for these data:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \\ \sigma &\sim \text{Exponential}(??) \end{align} \]

What should our priors be? Before we can answer that, we have a more important question:

WHERE IS ZERO??

It has to be somewhere. Does it make sense? take control and choose for yourself.

If we fit a model like this without thinking about the location of zero, we get some pretty silly answers:

coef(lm(bill_depth_mm ~ bill_length_mm, data = penguins))

   (Intercept) bill_length_mm 
   20.88546832    -0.08502128

When the value of bill length is 0, the average of the response is the intercept:

\[ \begin{align} \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \mu_i &= \beta_0 + \beta_1\times0 \\ \mu_i &= \beta_0 \\ \end{align} \]

But, if we take the data as we found it, we’re going to be talking about \(\beta_0\) as the depth of a penguin’s bill when the bill has 0 length! Clearly that isn’t a very meaningful value. From the point of view of setting priors and interpreting coefficients, it helps a lot to set a meaningful 0.

A very common choice is to subtract the average from your independent variable, so that penguins with an average bill length now have an average of 0:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \end{align} \]

Now \(\beta_0\) means the average bill depth at the average bill length. It becomes easier to think about priors:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(17,2) \\ \beta_1 &\sim \text{Normal}(0,.5) \\ \sigma &\sim \text{Exponential}(0.5) \end{align} \]

Exercise

What continuous predictors have you used in your analysis? How would you find a biologically meaningful zero? Think about how you would center time, age, mass, fitness etc.

Prior predictive simulations

Armed with this model, it becomes much easier to think about prior predictions.

We’ll make a bunch of lines implied by the equation above. There’s two steps:

Center the predictor
Make up a vector that goes from the minimum to the maximum of the predictor. This is just for convenience!

bill_len_centered <- with(penguins,
                          bill_length_mm - mean(bill_length_mm,
                                                na.rm = TRUE))

## make up a short vector
some_bill_lengths <- seq(
  from = min(bill_len_centered, na.rm = TRUE), 
  to = max(bill_len_centered, na.rm = TRUE),
  length.out = 10
  )

Shortcuts to these common tasks

These tasks are so common that they are automated in helper functions.

For centering predictors, see the base R function ?scale (however, doing this by hand is often more convenient)

For creating a short vector over the range of a predictor, see modelr::seq_range. The R package modelr has many different functions to help with modelling.

To simulate, we’ll use some matrix algebra, as we saw in lecture:

slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)

X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

knitr::kable(head(X))

	some_bill_lengths
1	-11.8219298
1	-8.7663743
1	-5.7108187
1	-2.6552632
1	0.4002924
1	3.4558480

knitr::kable(head(B))

inters	17.2896383	19.556429	18.8264880	14.3230226	18.4934815	20.3406031	17.7371358
slopes	0.2565904	0.495448	0.2071873	-0.3701628	-0.2776949	-0.1027866	-0.4361996

prior_mus <- X %*% B

matplot(x = some_bill_lengths,
        y = prior_mus, type = "l")

Exercise

Copy the code above. Increase the number of simulations. Which priors are too wide? Which are too narrow?

Simulating Observations

There are always at least TWO kinds of predictions we can be thinking about:

Predicted averages. This is often called a “confidence” interval for a regression line.
Predicted observations. This is often called a “prediction” interval.

We can use the full model to simulate observations!

slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)
sigmas <- rexp(7, rate = 0.3)

X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

prior_mus <- X %*% B

prior_obs <- matrix(0, nrow = nrow(prior_mus), ncol = ncol(prior_mus))

for (j in 1:ncol(prior_obs)) {
  prior_obs[,j] <- rnorm(n = nrow(prior_mus),
                         mean = prior_mus[,j],
                         sd = sigmas[j])
}

matplot(x = some_bill_lengths,
        y = prior_obs, type = "p")

Tidyverse style for those who indulge:

tibble(
  sim_id = 1:7,
  slopes = rnorm(7, 0, .5),
  inters = rnorm(7, 17, 2),
  sigmas = rexp(7, rate = 0.2)
  ) |> 
  mutate(x = list(seq(from = -10, to = 10, length.out = 6))) |> 
  rowwise() |> 
  mutate(avg = list(x * slopes + inters),
         obs = list(rnorm(length(avg), mean = avg, sd = sigmas)),
         sim_id = as.factor(sim_id)) |> 
  tidyr::unnest(cols = c("x", "avg", "obs")) |> 
  ggplot(aes(x= x, y = avg, group = sim_id, fill = sim_id)) + 
  geom_line(aes(colour = sim_id)) + 
  geom_point(aes(y = obs, fill = sim_id), pch = 21, size = 3) + 
  scale_fill_brewer(type = "qual") + 
  scale_colour_brewer(type = "qual") + 
  facet_wrap(~sim_id)

EXERCISE

Pick one of the two simulations above and modify it. Here are some suggested modifications:

Experiment with priors that are “too narrow” or “too wide”.
Try a different distribution than the one used
Instead of bill size, imagine that we are applying this model to YOUR data. What would you change?

Linear regression in Stan

Now we write a Stan program for this model. We’ll begin with a simple model that has no posterior predictions:

normal_regression_no_prediction <- stan_model(
  file = here::here("topics/02_regression/normal_regression_no_prediction.stan"),
  model_name = "normal_regression_no_prediction")

normal_regression_no_prediction

S4 class stanmodel 'normal_regression_no_prediction' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
}

In order to get the posterior, we need to put our data in Stan. We follow the same steps as previously:

Remember to remove NAs first!
arrange the data in a list
pass the data to a Stan model to estimate.

## drop NAs
penguins_no_NA <- penguins |> 
  tidyr::drop_na(bill_depth_mm, bill_length_mm) |> 
  dplyr::mutate(
    bill_length_center = bill_length_mm - mean(bill_length_mm))

## assemble data list
data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm
          ))

## run the sampler, using the compiled model.
normal_reg_no_pred <- sampling(
  object = normal_regression_no_prediction,
  data = data_list, 
  refresh = 0)

summary(normal_reg_no_pred)$summary

                   mean      se_mean         sd         2.5%           25%
intercept   17.14879647 0.0015503892 0.10346566   16.9404187   17.08009150
slope       -0.08440727 0.0002796366 0.01880894   -0.1215021   -0.09709005
sigma        1.92532431 0.0011155299 0.07327060    1.7858456    1.87508209
lp__      -395.67475128 0.0267169260 1.19300817 -398.8297774 -396.23426581
                    50%           75%       97.5%    n_eff      Rhat
intercept   17.15109315   17.21793369   17.356910 4453.598 1.0009309
slope       -0.08421363   -0.07182047   -0.047809 4524.185 1.0000700
sigma        1.92275361    1.97412896    2.076412 4314.169 0.9993515
lp__      -395.35376822 -394.81515460 -394.305017 1993.948 1.0002518

normal_reg_no_pred |> 
  bayesplot::mcmc_areas(pars = c("slope", "intercept", "sigma"))

EXERCISE

Discussion : Look just at the posterior distribution of the slope right above. Do we have evidence that there’s a relationship between bill length and bill depth.

Posterior predictions in R

We can calculate a posterior prediction line directly in R for these data. I’ll show each step in this workflow separately:

normal_reg_no_pred |> 
  spread_rvars(slope, intercept, sigma)

# A tibble: 1 × 3
            slope  intercept        sigma
       <rvar[1d]> <rvar[1d]>   <rvar[1d]>
1  -0.084 ± 0.019   17 ± 0.1  1.9 ± 0.073

tidybayes helps us extract the posterior distribution of the parameters into a convenient object called an rvar. Learn more about tidybayes here and about the rvar datatype here

Next we combine these posteriors with a vector of observations to make a posterior distribution of LINES:

normal_reg_predline <- normal_reg_no_pred |> 
  tidybayes::spread_rvars(slope, intercept) |> 
  tidyr::expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |> 
  mutate(mu = intercept + slope*x)

normal_reg_predline

# A tibble: 5 × 4
            slope  intercept     x         mu
       <rvar[1d]> <rvar[1d]> <dbl> <rvar[1d]>
1  -0.084 ± 0.019   17 ± 0.1 -15    18 ± 0.30
2  -0.084 ± 0.019   17 ± 0.1  -7.5  18 ± 0.18
3  -0.084 ± 0.019   17 ± 0.1   0    17 ± 0.10
4  -0.084 ± 0.019   17 ± 0.1   7.5  17 ± 0.17
5  -0.084 ± 0.019   17 ± 0.1  15    16 ± 0.30

Finally we’ll plot these:

normal_reg_predline |> 
  ggplot(aes(x = x, dist = mu)) + 
  stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             inherit.aes = FALSE,
             data = penguins_no_NA)

Using posterior draws individually

The above workflow makes a nice figure, but perhaps it helps to see the individual lines to understand what is happening here. We can get these with another tidybayes function spread_draws:

normal_reg_predline_draws <- normal_reg_no_pred |> 
  tidybayes::spread_draws(slope, intercept, ndraws = 12)

knitr::kable(normal_reg_predline_draws)

.chain	.iteration	.draw	slope	intercept
4	732	3732	-0.0802951	16.99151
2	531	1531	-0.0894771	17.04231
3	564	2564	-0.0411701	17.08328
4	796	3796	-0.0909085	17.20641
2	148	1148	-0.0760923	16.99839
1	270	270	-0.0981297	17.00251
1	157	157	-0.0813635	17.12567
1	348	348	-0.1036138	17.07408
2	20	1020	-0.0797008	17.06398
4	877	3877	-0.0949920	17.24112
4	352	3352	-0.0876686	17.12087
1	328	328	-0.0764813	17.17234

normal_reg_predline_draws |> 
  tidyr::expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |> 
  mutate(mu = intercept + slope*x) |> 
  ggplot(aes(x = x, y = mu, group = .draw)) + 
  geom_line() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             inherit.aes = FALSE,
             data = penguins_no_NA)

Posterior predictions in Stan

We can also make these posterior predictions in Stan.

First, compile the model:

normal_regression <- stan_model(file = here::here("topics/02_regression/normal_regression.stan"))

normal_regression

S4 class stanmodel 'anon_model' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;

  // calculate expectation
  post_bill_dep_average = intercept + slope * pred_values;

  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept + slope * pred_values[i], sigma);
  }

}

Then prepare the data and feed it into the sampling() function.

penguins_no_NA <- penguins |> 
  tidyr::drop_na(bill_depth_mm, bill_length_mm) |> 
  dplyr::mutate(
    bill_length_center = bill_length_mm - mean(bill_length_mm))

data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          npost = 6,
          pred_values = modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)
          ))

bill_norm_reg <- sampling(normal_regression,
                          data = data_list, 
                          refresh = 0)

library(tidyverse)

bill_posterior <- bill_norm_reg |> 
  tidybayes::spread_rvars(post_bill_dep_average[i],
                          post_bill_dep_obs[i]) |>
  mutate(bill_length = data_list$pred_values[i]) 

bill_posterior |> 
  ggplot(aes(x = bill_length, dist = post_bill_dep_average)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") + 
  labs(title = "Average response")
bill_posterior |> 
  ggplot(aes(x = bill_length, dist = post_bill_dep_obs)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center, y = bill_depth_mm),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") +
  labs(title = "Predicted observations")

EXERCISE

Extend this model to include species. Specifically, let each species have its own value of the intercept. This involves combining this regression example with the previous activity on discrete predictors.

When you’re done, look at the resulting summary of coefficients. What do you notice that’s different?

SOLUTION

normal_regression_spp <- stan_model(file = here::here("topics/02_regression/normal_regression_spp.stan"),
                                    model_name = "normal_regression_spp")

normal_regression_spp

S4 class stanmodel 'normal_regression_spp' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // species IDs
  array[N] int spp_id;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
  array[npost] int pred_spp_id;
}
parameters {
  vector[3] intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
  bill_dep ~ normal(intercept[spp_id] + slope * bill_len, sigma);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;

  // calculate expectation
  post_bill_dep_average = intercept[pred_spp_id] + slope * pred_values;

  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept[pred_spp_id[i]] + slope * pred_values[i], sigma);
  }

}

We set up a list for this model just as we did before. Note that this time we are using TRIPLE the pred_values, because we want to run independent predictions for each species.

bill_vec <- modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)

data_list_spp <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          spp_id = as.numeric(as.factor(species)),
          npost = 3*6,
          pred_values = rep(bill_vec, 3),
          pred_spp_id = rep(1:3, each = 6)
          ))

normal_reg_spp_post <- sampling(normal_regression_spp,
                                data = data_list_spp, refresh = 0)

Note that the sign of the slope is different now!

summary(normal_reg_spp_post)$summary |> 
  head(5) |>
  knitr::kable()

	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
intercept[1]	19.3627152	0.0028533	0.1193427	19.1265277	19.2831202	19.3601586	19.4455365	19.5915510	1749.395	0.9997753
intercept[2]	17.4423245	0.0031066	0.1444712	17.1583700	17.3451299	17.4400714	17.5403406	17.7229913	2162.686	1.0000708
intercept[3]	14.2744151	0.0022653	0.1052333	14.0656090	14.2031493	14.2757284	14.3444366	14.4824974	2158.007	0.9996055
slope	0.1986246	0.0004297	0.0177094	0.1637669	0.1867533	0.1984669	0.2108494	0.2323239	1698.866	0.9999005
sigma	0.9558671	0.0006063	0.0360305	0.8892068	0.9313392	0.9541228	0.9793045	1.0284729	3531.928	0.9996378

Plotting posterior predictions

Using stat_lineribbon(), let’s plot the average and predicted intervals for this regression.

bill_posterior <- normal_reg_spp_post |> 
  tidybayes::spread_rvars(post_bill_dep_average[i],
                          post_bill_dep_obs[i]) |>
  mutate(bill_length = data_list_spp$pred_values[i],
         spp = data_list_spp$pred_spp_id) |> 
  mutate(spp = as.factor(levels(penguins$species)[spp]))

bill_posterior |> 
  ggplot(aes(x = bill_length,
             ydist = post_bill_dep_average,
             fill = spp, 
             colour = spp)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center,
                 y = bill_depth_mm,
                 fill = species, colour = species),
             data = penguins_no_NA, 
             inherit.aes = FALSE) +   
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Average response")
bill_posterior |> 
  ggplot(aes(x = bill_length,
             dist = post_bill_dep_obs,
             fill = spp,
             colour = spp)) + 
  tidybayes::stat_lineribbon() + 
  geom_point(aes(x = bill_length_center,
                 y = bill_depth_mm,
                 colour = species),
             data = penguins_no_NA, 
             inherit.aes = FALSE) + 
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Predicted observations") + 
  facet_wrap(~spp, ncol = 1)

Exercise!

We have one model without species identity as an independent variable, and one which includes species. Look at the difference in \(\sigma\) between these two models. Why did the value change?
Posterior predictions Edit the generated quantities blocks of normal_regression.stan and normal_regression_spp.stan to create replicate observations for all N observations (that is, yrep as we did in the model of bill depth previously). Use this to compare the models using bayesplot::ppc_dens_overlay() or another function from the bayesplot package.