library(palmerpenguins)
library(ggplot2)
suppressPackageStartupMessages(library(dplyr))
library(tidybayes)
# library(cmdstanr)
suppressPackageStartupMessages(library(rstan))
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())
Univariate regression
The shortest route to science is a straight line.
Load packages and data
Statistical models of Penguin bill morphology.
We’ll be studying the relationship between two numbers about penguin bills. Specifically, we’ll ask “Are longer bills also deeper?”. This question might not be the most interesting ecologically, but it is a great chance to practice some interesting stats.
Let’s begin with plotting the data:
|>
penguins ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
stat_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
palmerpenguins
dataset.Let’s write a simple statistical model for these data:
\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \\ \sigma &\sim \text{Exponential}(??) \end{align} \]
What should our priors be? Before we can answer that, we have a more important question:
It has to be somewhere. Does it make sense? take control and choose for yourself.
If we fit a model like this without thinking about the location of zero, we get some pretty silly answers:
coef(lm(bill_depth_mm ~ bill_length_mm, data = penguins))
(Intercept) bill_length_mm
20.88546832 -0.08502128
When the value of bill length is 0, the average of the response is the intercept:
\[ \begin{align} \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \mu_i &= \beta_0 + \beta_1\times0 \\ \mu_i &= \beta_0 \\ \end{align} \]
But, if we take the data as we found it, we’re going to be talking about \(\beta_0\) as the depth of a penguin’s bill when the bill has 0 length! Clearly that isn’t a very meaningful value. From the point of view of setting priors and interpreting coefficients, it helps a lot to set a meaningful 0.
A very common choice is to subtract the average from your independent variable, so that penguins with an average bill length now have an average of 0:
\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \end{align} \]
Now \(\beta_0\) means the average bill depth at the average bill length. It becomes easier to think about priors:
\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(17,2) \\ \beta_1 &\sim \text{Normal}(0,.5) \\ \sigma &\sim \text{Exponential}(0.5) \end{align} \]
What continuous predictors have you used in your analysis? How would you find a biologically meaningful zero? Think about how you would center time, age, mass, fitness etc.
Prior predictive simulations
Armed with this model, it becomes much easier to think about prior predictions.
We’ll make a bunch of lines implied by the equation above. There’s two steps:
- Center the predictor
- Make up a vector that goes from the minimum to the maximum of the predictor. This is just for convenience!
<- with(penguins,
bill_len_centered - mean(bill_length_mm,
bill_length_mm na.rm = TRUE))
## make up a short vector
<- seq(
some_bill_lengths from = min(bill_len_centered, na.rm = TRUE),
to = max(bill_len_centered, na.rm = TRUE),
length.out = 10
)
These tasks are so common that they are automated in helper functions.
For centering predictors, see the base R function ?scale
(however, doing this by hand is often more convenient)
For creating a short vector over the range of a predictor, see modelr::seq_range
. The R package modelr
has many different functions to help with modelling.
To simulate, we’ll use some matrix algebra, as we saw in lecture:
<- rnorm(7, 0, .5)
slopes <- rnorm(7, 17, 2)
inters
<- cbind(1, some_bill_lengths)
X <- rbind(inters, slopes)
B
::kable(head(X)) knitr
some_bill_lengths | |
---|---|
1 | -11.8219298 |
1 | -8.7663743 |
1 | -5.7108187 |
1 | -2.6552632 |
1 | 0.4002924 |
1 | 3.4558480 |
::kable(head(B)) knitr
inters | 17.2896383 | 19.556429 | 18.8264880 | 14.3230226 | 18.4934815 | 20.3406031 | 17.7371358 |
slopes | 0.2565904 | 0.495448 | 0.2071873 | -0.3701628 | -0.2776949 | -0.1027866 | -0.4361996 |
<- X %*% B
prior_mus
matplot(x = some_bill_lengths,
y = prior_mus, type = "l")
Copy the code above. Increase the number of simulations. Which priors are too wide? Which are too narrow?
Simulating Observations
There are always at least TWO kinds of predictions we can be thinking about:
- Predicted averages. This is often called a “confidence” interval for a regression line.
- Predicted observations. This is often called a “prediction” interval.
We can use the full model to simulate observations!
<- rnorm(7, 0, .5)
slopes <- rnorm(7, 17, 2)
inters <- rexp(7, rate = 0.3)
sigmas
<- cbind(1, some_bill_lengths)
X <- rbind(inters, slopes)
B
<- X %*% B
prior_mus
<- matrix(0, nrow = nrow(prior_mus), ncol = ncol(prior_mus))
prior_obs
for (j in 1:ncol(prior_obs)) {
<- rnorm(n = nrow(prior_mus),
prior_obs[,j] mean = prior_mus[,j],
sd = sigmas[j])
}
matplot(x = some_bill_lengths,
y = prior_obs, type = "p")
Tidyverse style for those who indulge:
tibble(
sim_id = 1:7,
slopes = rnorm(7, 0, .5),
inters = rnorm(7, 17, 2),
sigmas = rexp(7, rate = 0.2)
|>
) mutate(x = list(seq(from = -10, to = 10, length.out = 6))) |>
rowwise() |>
mutate(avg = list(x * slopes + inters),
obs = list(rnorm(length(avg), mean = avg, sd = sigmas)),
sim_id = as.factor(sim_id)) |>
::unnest(cols = c("x", "avg", "obs")) |>
tidyrggplot(aes(x= x, y = avg, group = sim_id, fill = sim_id)) +
geom_line(aes(colour = sim_id)) +
geom_point(aes(y = obs, fill = sim_id), pch = 21, size = 3) +
scale_fill_brewer(type = "qual") +
scale_colour_brewer(type = "qual") +
facet_wrap(~sim_id)
Pick one of the two simulations above and modify it. Here are some suggested modifications:
- Experiment with priors that are “too narrow” or “too wide”.
- Try a different distribution than the one used
- Instead of bill size, imagine that we are applying this model to YOUR data. What would you change?
Linear regression in Stan
Now we write a Stan program for this model. We’ll begin with a simple model that has no posterior predictions:
<- stan_model(
normal_regression_no_prediction file = here::here("topics/02_regression/normal_regression_no_prediction.stan"),
model_name = "normal_regression_no_prediction")
normal_regression_no_prediction
S4 class stanmodel 'normal_regression_no_prediction' coded as follows:data {
int<lower=0> N;
vector[N] bill_len;
vector[N] bill_dep;
}parameters {
real intercept;
real slope;
real<lower=0> sigma;
}model {
bill_dep ~ normal(intercept + slope * bill_len, sigma);17, 2);
intercept ~ normal(0, 1);
slope ~ normal(7);
sigma ~ exponential(. }
In order to get the posterior, we need to put our data in Stan. We follow the same steps as previously:
- Remember to remove NAs first!
- arrange the data in a list
- pass the data to a Stan model to estimate.
## drop NAs
<- penguins |>
penguins_no_NA ::drop_na(bill_depth_mm, bill_length_mm) |>
tidyr::mutate(
dplyrbill_length_center = bill_length_mm - mean(bill_length_mm))
## assemble data list
<- with(penguins_no_NA,
data_list list(N = length(bill_length_center),
bill_len = bill_length_center,
bill_dep = bill_depth_mm
))
## run the sampler, using the compiled model.
<- sampling(
normal_reg_no_pred object = normal_regression_no_prediction,
data = data_list,
refresh = 0)
summary(normal_reg_no_pred)$summary
mean se_mean sd 2.5% 25%
intercept 17.14879647 0.0015503892 0.10346566 16.9404187 17.08009150
slope -0.08440727 0.0002796366 0.01880894 -0.1215021 -0.09709005
sigma 1.92532431 0.0011155299 0.07327060 1.7858456 1.87508209
lp__ -395.67475128 0.0267169260 1.19300817 -398.8297774 -396.23426581
50% 75% 97.5% n_eff Rhat
intercept 17.15109315 17.21793369 17.356910 4453.598 1.0009309
slope -0.08421363 -0.07182047 -0.047809 4524.185 1.0000700
sigma 1.92275361 1.97412896 2.076412 4314.169 0.9993515
lp__ -395.35376822 -394.81515460 -394.305017 1993.948 1.0002518
|>
normal_reg_no_pred ::mcmc_areas(pars = c("slope", "intercept", "sigma")) bayesplot
Discussion : Look just at the posterior distribution of the slope right above. Do we have evidence that there’s a relationship between bill length and bill depth.
Posterior predictions in R
We can calculate a posterior prediction line directly in R for these data. I’ll show each step in this workflow separately:
|>
normal_reg_no_pred spread_rvars(slope, intercept, sigma)
# A tibble: 1 × 3
slope intercept sigma
<rvar[1d]> <rvar[1d]> <rvar[1d]>
1 -0.084 ± 0.019 17 ± 0.1 1.9 ± 0.073
tidybayes
helps us extract the posterior distribution of the parameters into a convenient object called an rvar
. Learn more about tidybayes here and about the rvar datatype here
Next we combine these posteriors with a vector of observations to make a posterior distribution of LINES:
<- normal_reg_no_pred |>
normal_reg_predline ::spread_rvars(slope, intercept) |>
tidybayes::expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |>
tidyrmutate(mu = intercept + slope*x)
normal_reg_predline
# A tibble: 5 × 4
slope intercept x mu
<rvar[1d]> <rvar[1d]> <dbl> <rvar[1d]>
1 -0.084 ± 0.019 17 ± 0.1 -15 18 ± 0.30
2 -0.084 ± 0.019 17 ± 0.1 -7.5 18 ± 0.18
3 -0.084 ± 0.019 17 ± 0.1 0 17 ± 0.10
4 -0.084 ± 0.019 17 ± 0.1 7.5 17 ± 0.17
5 -0.084 ± 0.019 17 ± 0.1 15 16 ± 0.30
Finally we’ll plot these:
|>
normal_reg_predline ggplot(aes(x = x, dist = mu)) +
stat_lineribbon() +
geom_point(aes(x = bill_length_center, y = bill_depth_mm),
inherit.aes = FALSE,
data = penguins_no_NA)
Using posterior draws individually
The above workflow makes a nice figure, but perhaps it helps to see the individual lines to understand what is happening here. We can get these with another tidybayes function spread_draws
:
<- normal_reg_no_pred |>
normal_reg_predline_draws ::spread_draws(slope, intercept, ndraws = 12)
tidybayes
::kable(normal_reg_predline_draws) knitr
.chain | .iteration | .draw | slope | intercept |
---|---|---|---|---|
4 | 732 | 3732 | -0.0802951 | 16.99151 |
2 | 531 | 1531 | -0.0894771 | 17.04231 |
3 | 564 | 2564 | -0.0411701 | 17.08328 |
4 | 796 | 3796 | -0.0909085 | 17.20641 |
2 | 148 | 1148 | -0.0760923 | 16.99839 |
1 | 270 | 270 | -0.0981297 | 17.00251 |
1 | 157 | 157 | -0.0813635 | 17.12567 |
1 | 348 | 348 | -0.1036138 | 17.07408 |
2 | 20 | 1020 | -0.0797008 | 17.06398 |
4 | 877 | 3877 | -0.0949920 | 17.24112 |
4 | 352 | 3352 | -0.0876686 | 17.12087 |
1 | 328 | 328 | -0.0764813 | 17.17234 |
|>
normal_reg_predline_draws ::expand_grid(x = seq(from = -15, to = 15, length.out = 5)) |>
tidyrmutate(mu = intercept + slope*x) |>
ggplot(aes(x = x, y = mu, group = .draw)) +
geom_line() +
geom_point(aes(x = bill_length_center, y = bill_depth_mm),
inherit.aes = FALSE,
data = penguins_no_NA)
Posterior predictions in Stan
We can also make these posterior predictions in Stan.
First, compile the model:
<- stan_model(file = here::here("topics/02_regression/normal_regression.stan"))
normal_regression
normal_regression
S4 class stanmodel 'anon_model' coded as follows:data {
int<lower=0> N;
vector[N] bill_len;
vector[N] bill_dep;
// posterior predictions
int<lower=0> npost;
vector[npost] pred_values;
}parameters {
real intercept;
real slope;
real<lower=0> sigma;
}model {
bill_dep ~ normal(intercept + slope * bill_len, sigma);17, 2);
intercept ~ normal(0, 1);
slope ~ normal(
}generated quantities {
vector[npost] post_bill_dep_obs;
vector[npost] post_bill_dep_average;
// calculate expectation
post_bill_dep_average = intercept + slope * pred_values;
// make fake observations
for (i in 1:npost) {
post_bill_dep_obs[i] = normal_rng(intercept + slope * pred_values[i], sigma);
}
}
Then prepare the data and feed it into the sampling()
function.
<- penguins |>
penguins_no_NA ::drop_na(bill_depth_mm, bill_length_mm) |>
tidyr::mutate(
dplyrbill_length_center = bill_length_mm - mean(bill_length_mm))
<- with(penguins_no_NA,
data_list list(N = length(bill_length_center),
bill_len = bill_length_center,
bill_dep = bill_depth_mm,
npost = 6,
pred_values = modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)
))
<- sampling(normal_regression,
bill_norm_reg data = data_list,
refresh = 0)
library(tidyverse)
<- bill_norm_reg |>
bill_posterior ::spread_rvars(post_bill_dep_average[i],
tidybayes|>
post_bill_dep_obs[i]) mutate(bill_length = data_list$pred_values[i])
|>
bill_posterior ggplot(aes(x = bill_length, dist = post_bill_dep_average)) +
::stat_lineribbon() +
tidybayesgeom_point(aes(x = bill_length_center, y = bill_depth_mm),
data = penguins_no_NA,
inherit.aes = FALSE) +
scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") +
labs(title = "Average response")
|>
bill_posterior ggplot(aes(x = bill_length, dist = post_bill_dep_obs)) +
::stat_lineribbon() +
tidybayesgeom_point(aes(x = bill_length_center, y = bill_depth_mm),
data = penguins_no_NA,
inherit.aes = FALSE) +
scale_fill_brewer(palette = "Greens", direction = -1, guide = "none") +
labs(title = "Predicted observations")
Extend this model to include species. Specifically, let each species have its own value of the intercept
. This involves combining this regression example with the previous activity on discrete predictors.
When you’re done, look at the resulting summary of coefficients. What do you notice that’s different?
<- stan_model(file = here::here("topics/02_regression/normal_regression_spp.stan"),
normal_regression_spp model_name = "normal_regression_spp")
normal_regression_spp
S4 class stanmodel 'normal_regression_spp' coded as follows:data {
int<lower=0> N;
vector[N] bill_len;
vector[N] bill_dep;
// species IDs
array[N] int spp_id;
// posterior predictions
int<lower=0> npost;
vector[npost] pred_values;
array[npost] int pred_spp_id;
}parameters {
vector[3] intercept;
real slope;
real<lower=0> sigma;
}model {
17, 2);
intercept ~ normal(0, 1);
slope ~ normal(7);
sigma ~ exponential(.
bill_dep ~ normal(intercept[spp_id] + slope * bill_len, sigma);
}generated quantities {
vector[npost] post_bill_dep_obs;
vector[npost] post_bill_dep_average;
// calculate expectation
post_bill_dep_average = intercept[pred_spp_id] + slope * pred_values;
// make fake observations
for (i in 1:npost) {
post_bill_dep_obs[i] = normal_rng(intercept[pred_spp_id[i]] + slope * pred_values[i], sigma);
}
}
We set up a list for this model just as we did before. Note that this time we are using TRIPLE the pred_values
, because we want to run independent predictions for each species.
<- modelr::seq_range(penguins_no_NA$bill_length_center, n = 6)
bill_vec
<- with(penguins_no_NA,
data_list_spp list(N = length(bill_length_center),
bill_len = bill_length_center,
bill_dep = bill_depth_mm,
spp_id = as.numeric(as.factor(species)),
npost = 3*6,
pred_values = rep(bill_vec, 3),
pred_spp_id = rep(1:3, each = 6)
))
<- sampling(normal_regression_spp,
normal_reg_spp_post data = data_list_spp, refresh = 0)
Note that the sign of the slope is different now!
summary(normal_reg_spp_post)$summary |>
head(5) |>
::kable() knitr
mean | se_mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|---|---|---|
intercept[1] | 19.3627152 | 0.0028533 | 0.1193427 | 19.1265277 | 19.2831202 | 19.3601586 | 19.4455365 | 19.5915510 | 1749.395 | 0.9997753 |
intercept[2] | 17.4423245 | 0.0031066 | 0.1444712 | 17.1583700 | 17.3451299 | 17.4400714 | 17.5403406 | 17.7229913 | 2162.686 | 1.0000708 |
intercept[3] | 14.2744151 | 0.0022653 | 0.1052333 | 14.0656090 | 14.2031493 | 14.2757284 | 14.3444366 | 14.4824974 | 2158.007 | 0.9996055 |
slope | 0.1986246 | 0.0004297 | 0.0177094 | 0.1637669 | 0.1867533 | 0.1984669 | 0.2108494 | 0.2323239 | 1698.866 | 0.9999005 |
sigma | 0.9558671 | 0.0006063 | 0.0360305 | 0.8892068 | 0.9313392 | 0.9541228 | 0.9793045 | 1.0284729 | 3531.928 | 0.9996378 |
Plotting posterior predictions
Using stat_lineribbon()
, let’s plot the average and predicted intervals for this regression.
<- normal_reg_spp_post |>
bill_posterior ::spread_rvars(post_bill_dep_average[i],
tidybayes|>
post_bill_dep_obs[i]) mutate(bill_length = data_list_spp$pred_values[i],
spp = data_list_spp$pred_spp_id) |>
mutate(spp = as.factor(levels(penguins$species)[spp]))
|>
bill_posterior ggplot(aes(x = bill_length,
ydist = post_bill_dep_average,
fill = spp,
colour = spp)) +
::stat_lineribbon() +
tidybayesgeom_point(aes(x = bill_length_center,
y = bill_depth_mm,
fill = species, colour = species),
data = penguins_no_NA,
inherit.aes = FALSE) +
scale_fill_brewer(palette = "Set2") +
scale_color_brewer(palette = "Dark2") +
labs(title = "Average response")
|>
bill_posterior ggplot(aes(x = bill_length,
dist = post_bill_dep_obs,
fill = spp,
colour = spp)) +
::stat_lineribbon() +
tidybayesgeom_point(aes(x = bill_length_center,
y = bill_depth_mm,
colour = species),
data = penguins_no_NA,
inherit.aes = FALSE) +
scale_fill_brewer(palette = "Set2") +
scale_color_brewer(palette = "Dark2") +
labs(title = "Predicted observations") +
facet_wrap(~spp, ncol = 1)
Exercise!
- We have one model without species identity as an independent variable, and one which includes species. Look at the difference in \(\sigma\) between these two models. Why did the value change?
- Posterior predictions Edit the generated quantities blocks of
normal_regression.stan
andnormal_regression_spp.stan
to create replicate observations for all N observations (that is,yrep
as we did in the model of bill depth previously). Use this to compare the models usingbayesplot::ppc_dens_overlay()
or another function from the bayesplot package.