Univariate regression

The shortest route to science is a straight line.

Load packages and data

library(palmerpenguins)

Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':

    penguins, penguins_raw
library(rstan)
Loading required package: StanHeaders

rstan version 2.32.7 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())

Statistical models of Penguin bill morphology.

We’ll be studying the relationship between two numbers about penguin bills. Specifically, we’ll ask “Are longer bills also deeper?”

This question might not be the most interesting ecologically, but it is a great chance to practice some interesting stats.

Let’s begin with plotting the data :

# Remove NAs
penguins_clean <- na.omit(penguins[, c("bill_length_mm", "bill_depth_mm")])

# Plot data
plot(penguins_clean$bill_length_mm,
     penguins_clean$bill_depth_mm,
     xlab = "Bill length (mm)",
     ylab = "Bill depth (mm)",
     pch = 19, 
     cex = 0.7)

# Regression
reg <- lm(bill_depth_mm ~ bill_length_mm,
          data = penguins_clean)

# Regression lines
abline(reg, 
       col = "steelblue",
       lwd = 3)

Bill depth (mm) as predicted by bill length (mm) across the entire palmerpenguins dataset.

Let’s write a simple statistical model for these data:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \\ \sigma &\sim \text{Exponential}(??) \end{align} \]

What should our priors be?

Before we can answer that question, we have a more important question:

WarningWHERE IS ZERO?

Specifically, what does it mean biologically when bill lengths is at 0 ? Does it make sense?

If we fit a model like this without thinking about the location of zero, we get some pretty silly answers:

coef(lm(bill_depth_mm ~ bill_length_mm, data = penguins))
   (Intercept) bill_length_mm 
   20.88546832    -0.08502128 

When the value of bill length is 0, the average of the response is the intercept:

\[ \begin{align} \mu_i &= \beta_0 + \beta_1\times\text{Bill length}_i \\ \mu_i &= \beta_0 + \beta_1\times0 \\ \mu_i &= \beta_0 \\ \end{align} \]

But, if we take the data as we found it, we’re going to be talking about \(\beta_0\) as the depth of a penguin’s bill when the bill has 0 length! Clearly that isn’t a very meaningful value. From the point of view of setting priors and interpreting coefficients, it helps a lot to set a meaningful 0.

A very common choice is to subtract the average from your independent variable, so that penguins with an average bill length now have a mean of 0:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(??) \\ \beta_1 &\sim \text{Normal}(??) \end{align} \]

Now \(\beta_0\) means the average bill depth at the average bill length.

Using this “new” data, it becomes easier to think about priors:

\[ \begin{align} \text{Bill depth}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_1\times(\text{Bill length}_i - \overline{\text{Bill length}})\\ \beta_0 &\sim \text{Normal}(17,2) \\ \beta_1 &\sim \text{Normal}(0,.5) \\ \sigma &\sim \text{Exponential}(0.5) \end{align} \]

NoteExercise

What continuous predictors have you used in your analysis? How would you find a biologically meaningful zero? Think about how you would centre time, age, mass, fitness etc.

Prior predictive simulations

Armed with this model, it becomes much easier to think about prior predictions.

We’ll make a bunch of lines derived from the equation above. There’s two steps:

  1. Centre the predictor
  2. Make up a vector that goes from the minimum to the maximum of the predictor. This is just for convenience!
bill_len_centered <- with(penguins_clean,
                          bill_length_mm - mean(bill_length_mm))

## make up a short vector
some_bill_lengths <- seq(from = min(bill_len_centered),
                         to = max(bill_len_centered),
                         length.out = 10)
WarningShortcuts to these common tasks

These tasks are so common that they are automated in helper functions.

For centering predictors, see the ?scale function ?scale

For creating a short vector over the range of a predictor, use seq(min(x), max(x), length.out = n).

To simulate, we’ll use some matrix algebra, as we saw in lecture:

# Parameters to simulate
slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)

# Basis of the regression
X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

# Bill length
knitr::kable(head(X))
some_bill_lengths
1 -11.8219298
1 -8.7663743
1 -5.7108187
1 -2.6552632
1 0.4002924
1 3.4558480
# Regression parameters
knitr::kable(head(B))
inters 19.170348 13.7062430 18.8615390 16.9928140 17.7574422 15.612315 14.5494335
slopes -0.447755 0.0793149 -0.2137636 -0.4838215 0.3335648 -0.056673 0.0088077
# Simulation for mu 
prior_mus <- X %*% B

# Draw results
matplot(x = some_bill_lengths,
        y = prior_mus, type = "l")

NoteExercise

Copy the code above. Increase the number of simulations and study the structure of the priors. Which prior are too wide? Which are too narrow?

Simulating Observations

There are always at least TWO kinds of predictions we can be thinking about:

  1. Predicted averages. This is often called a “confidence” interval for a regression line.
  2. Predicted observations. This is often called a “prediction” interval.

We can use the full model to simulate observations!

# Simulate parameters
slopes <- rnorm(7, 0, .5)
inters <- rnorm(7, 17, 2)
sigmas <- rexp(7, rate = 0.3)

# Basis of the regression
X <- cbind(1, some_bill_lengths)
B <- rbind(inters, slopes)

# Simulation for mu 
prior_mus <- X %*% B

# Result object for the simulated mu 
prior_obs <- matrix(0, nrow = nrow(prior_mus), ncol = ncol(prior_mus))

# Simulate observation
for (j in 1:ncol(prior_obs)) {
  prior_obs[,j] <- rnorm(n = nrow(prior_mus),
                         mean = prior_mus[,j],
                         sd = sigmas[j])
}

# Plot results
matplot(x = some_bill_lengths,
        y = prior_obs, type = "p")

Base R version with small multiples:

set.seed(42)
n_sim <- 7

# Simulate parameters
slopes <- rnorm(n_sim, 0, .5)
inters <- rnorm(n_sim, 17, 2)
sigmas <- rexp(n_sim, rate = 0.2)

# Generate X
x_sim  <- seq(-10, 10, length.out = 6)

# Plot structure
par(mfrow = c(2, 4), mar = c(3, 3, 2, 1))

for (i in seq_len(n_sim)) {
  # Build regression
  avg <- inters[i] + slopes[i] * x_sim
  
  # Sample observations from regression average
  obs <- rnorm(length(avg), mean = avg, sd = sigmas[i])
  
  # Plot regression result
  plot(x_sim,
       avg, 
       type = "l",
       col = i + 1,
       lwd = 2,
       ylim = range(c(avg, obs)), 
       xlab = "x",
       ylab = "value",
       main = paste("Sim", i))
  # Plot points
  points(x_sim, obs, pch = 21, bg = i + 1, col = i + 1)
}

TipEXERCISE

Pick one of the two simulations above and modify it. Here are some suggested modifications:

  • Experiment with priors that are “too narrow” or “too wide”.
  • Try a different distribution than the one used
  • Instead of bill size, imagine that we are applying this model to YOUR data. What would you change?

Linear regression in Stan

Now let’s write this model with Stan. We’ll begin with a simple model that has no posterior predictions:

normal_regression_no_prediction <- stan_model(
  file = "topics/02_regression/normal_regression_no_prediction.stan",
  model_name = "normal_regression_no_prediction")

normal_regression_no_prediction
S4 class stanmodel 'normal_regression_no_prediction' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
} 

In order to get the posterior, we need to put our data in Stan. We follow the same steps as previously:

  • Remember to remove NAs first!
  • arrange the data in a list
  • pass the data to a Stan model to estimate.
penguins_no_NA <- na.omit(penguins[, c("bill_length_mm", "bill_depth_mm", "species")])

penguins_no_NA$bill_length_center <- penguins_no_NA$bill_length_mm -
                                     mean(penguins_no_NA$bill_length_mm)

## Assemble data list
data_list <- with(penguins_no_NA,
                  list(N = length(bill_length_center),
                       bill_len = bill_length_center,
                       bill_dep = bill_depth_mm))

## Run the sampler, using the compiled model
normal_reg_no_pred <- sampling(
  object = normal_regression_no_prediction,
  data = data_list,
  refresh = 0)

summary(normal_reg_no_pred)$summary
                   mean      se_mean         sd         2.5%           25%
intercept   17.14801897 0.0016233866 0.10512092   16.9470600   17.07622490
slope       -0.08490538 0.0003062498 0.01894187   -0.1222547   -0.09786552
sigma        1.92496487 0.0011839345 0.07382542    1.7812733    1.87570429
lp__      -395.70683313 0.0275135750 1.22085229 -398.8090134 -396.28767384
                    50%           75%         97.5%    n_eff      Rhat
intercept   17.14843349   17.21843949   17.35462251 4193.091 0.9994090
slope       -0.08448248   -0.07219798   -0.04755828 3825.552 0.9993120
sigma        1.92210188    1.97292401    2.07447768 3888.271 0.9994085
lp__      -395.38407095 -394.79830318 -394.29751247 1968.939 1.0009826
bayesplot::mcmc_areas(normal_reg_no_pred,
                      pars = c("slope", 
                               "intercept", 
                               "sigma"))

TipEXERCISE

Discussion : Look only at the posterior distribution of the slope above. Do we have evidence that there’s a relationship between bill length and bill depth.

Posterior predictions in R

We can calculate a posterior prediction line directly in R for these data. I’ll show each step in this workflow separately:

draws_inspect <- rstan::extract(normal_reg_no_pred, 
                                pars = c("slope", "intercept", "sigma"))
str(draws_inspect)
List of 3
 $ slope    : num [1:4000(1d)] -0.0965 -0.0577 -0.1 -0.0807 -0.0914 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ intercept: num [1:4000(1d)] 17.3 17 17 16.8 17 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ sigma    : num [1:4000(1d)] 2.04 1.91 2.03 1.94 1.82 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL

rstan::extract() pulls the posterior draws for each parameter into a named list — e.g. draws_inspect$slope, draws_inspect$intercept are numeric vectors with one value per MCMC draw.

Next we combine these posteriors with a vector of observations to make a posterior distribution of LINES:

draws_reg <- rstan::extract(normal_reg_no_pred, pars = c("slope", "intercept"))
x_seq     <- seq(-15, 15, length.out = 5)

# mu_draws: matrix (rows = MCMC draws, cols = x-values)
mu_draws  <- outer(as.vector(draws_reg$slope), x_seq) + as.vector(draws_reg$intercept)
str(mu_draws)
 num [1:4000, 1:5] 18.8 17.9 18.5 18 18.4 ...

Finally, let’s plot these:

q <- apply(mu_draws, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

plot(penguins_no_NA$bill_length_center, penguins_no_NA$bill_depth_mm,
     xlab = "Bill length (centered, mm)", ylab = "Bill depth (mm)", pch = 19, cex = 0.5)

# 95% Confidence intervals
polygon(c(x_seq, rev(x_seq)), c(q[1,], rev(q[5,])),
        col = adjustcolor("steelblue", 0.15), border = NA)

# 50% Confidence intervals
polygon(c(x_seq, rev(x_seq)), c(q[2,], rev(q[4,])),
        col = adjustcolor("steelblue", 0.30), border = NA)

# Model
lines(x_seq, q[3,], col = "steelblue", lwd = 2)

Using posterior draws individually

The above workflow makes a nice figure, but perhaps it helps to see the individual lines to understand what is happening here.

We can get these by sampling rows directly from the extracted draws:

set.seed(42)
draw_idx <- sample(length(draws_reg$slope), 12)
draws12 <- data.frame(.draw = draw_idx,
                      slope = draws_reg$slope[draw_idx],
                      intercept = draws_reg$intercept[draw_idx])

knitr::kable(draws12)
.draw slope intercept
2609 -0.1084951 17.08312
2369 -0.0990726 17.03514
1177 -0.0716151 17.11933
1098 -0.0804855 17.22800
1252 -0.0980822 17.20400
3218 -0.1004108 17.27400
634 -0.0806283 17.30834
2097 -0.0533362 17.22336
1152 -0.1035335 17.10023
1327 -0.0889499 17.07135
2072 -0.1017803 17.03660
3911 -0.0722955 17.01069
plot(penguins_no_NA$bill_length_center, penguins_no_NA$bill_depth_mm,
     xlab = "Bill length (centered, mm)", ylab = "Bill depth (mm)", pch = 19, cex = 0.5)
for (i in seq_len(nrow(draws12))) {
  lines(x_seq, draws12$intercept[i] + draws12$slope[i] * x_seq,
        col = adjustcolor("steelblue", 0.6))
}

Posterior predictions in Stan

We can also make these posterior predictions in Stan.

First, compile the model:

normal_regression <- stan_model(file = here::here("topics/02_regression/normal_regression.stan"))

normal_regression
S4 class stanmodel 'anon_model' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
}
parameters {
  real intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  bill_dep ~ normal(intercept + slope * bill_len, sigma);
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;

  // calculate expectation
  post_bill_dep_average = intercept + slope * pred_values;

  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept + slope * pred_values[i], sigma);
  }

} 

Then prepare the data and feed it into the sampling() function.

penguins_no_NA <- na.omit(penguins[, c("bill_length_mm", 
                                       "bill_depth_mm", 
                                       "species")])

penguins_no_NA$bill_length_center <- penguins_no_NA$bill_length_mm - 
                                     mean(penguins_no_NA$bill_length_mm)

data_list <- with(penguins_no_NA,
     list(N = length(bill_length_center),
          bill_len = bill_length_center,
          bill_dep = bill_depth_mm,
          npost = 6,
          pred_values = seq(min(penguins_no_NA$bill_length_center),
                            max(penguins_no_NA$bill_length_center), length.out = 6)
          ))

bill_norm_reg <- sampling(normal_regression,
                          data = data_list,
                          refresh = 0)
draws_avg <- rstan::extract(bill_norm_reg, "post_bill_dep_average")$post_bill_dep_average
draws_obs <- rstan::extract(bill_norm_reg, "post_bill_dep_obs")$post_bill_dep_obs
x_pred    <- data_list$pred_values

q_avg <- apply(draws_avg, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
q_obs <- apply(draws_obs, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

y_range <- range(penguins_no_NA$bill_depth_mm)

# Average response ()
plot(penguins_no_NA$bill_length_center,
     penguins_no_NA$bill_depth_mm,
     xlab = "Bill length (centered, mm)", 
     ylab = "Bill depth (mm)",
     main = "Average response", 
     pch = 19, 
     cex = 0.5, 
     ylim = y_range)

# 95 % interval
polygon(c(x_pred, rev(x_pred)), 
        c(q_avg[1,], rev(q_avg[5,])),
        col = adjustcolor("darkgreen", 0.15),
        border = NA)

# 50 % interval
polygon(c(x_pred, rev(x_pred)),
        c(q_avg[2,], rev(q_avg[4,])),
        col = adjustcolor("darkgreen", 0.30),
        border = NA)

lines(x_pred, q_avg[3,], 
      col = "darkgreen",
      lwd = 2)
# Predicted observations
plot(penguins_no_NA$bill_length_center,
     penguins_no_NA$bill_depth_mm,
     xlab = "Bill length (centered, mm)",
     ylab = "Bill depth (mm)",
     main = "Predicted observations",
     pch = 19, 
     cex = 0.5, 
     ylim = y_range)

# 95 % Interval
polygon(c(x_pred, rev(x_pred)),
        c(q_obs[1,], rev(q_obs[5,])),
        col = adjustcolor("darkgreen", 0.15),
        border = NA)

# 50 % Interval
polygon(c(x_pred, rev(x_pred)), 
        c(q_obs[2,], rev(q_obs[4,])),
        col = adjustcolor("darkgreen", 0.30),
        border = NA)

lines(x_pred, q_obs[3,], col = "darkgreen", lwd = 2)

TipEXERCISE

Extend this model to include species. Specifically, let each species have its own value of the intercept. This involves combining this regression example with the previous activity on discrete predictors.

When you’re done, look at the resulting summary of coefficients. What do you notice that’s different?

normal_regression_spp <- stan_model(file = here::here("topics/02_regression/normal_regression_spp.stan"),
                                    model_name = "normal_regression_spp")

normal_regression_spp
S4 class stanmodel 'normal_regression_spp' coded as follows:
data {
  int<lower=0> N;
  vector[N] bill_len;
  vector[N] bill_dep;
  // species IDs
  array[N] int spp_id;
  // posterior predictions
  int<lower=0> npost;
  vector[npost] pred_values;
  array[npost] int pred_spp_id;
}
parameters {
  vector[3] intercept;
  real slope;
  real<lower=0> sigma;
}
model {
  intercept ~ normal(17, 2);
  slope ~ normal(0, 1);
  sigma ~ exponential(.7);
  bill_dep ~ normal(intercept[spp_id] + slope * bill_len, sigma);
}
generated quantities {
  vector[npost] post_bill_dep_obs;
  vector[npost] post_bill_dep_average;

  // calculate expectation
  post_bill_dep_average = intercept[pred_spp_id] + slope * pred_values;

  // make fake observations
  for (i in 1:npost) {
    post_bill_dep_obs[i] = normal_rng(intercept[pred_spp_id[i]] + slope * pred_values[i], sigma);
  }

} 

We set up a list for this model just as we did before. Note that this time we are using TRIPLE the pred_values, because we want to run independent predictions for each species.

bill_vec <- seq(min(penguins_no_NA$bill_length_center),
                max(penguins_no_NA$bill_length_center), 
                length.out = 6)

data_list_spp <- with(penguins_no_NA,
                      list(N = length(bill_length_center),
                           bill_len = bill_length_center,
                           bill_dep = bill_depth_mm,
                           spp_id = as.numeric(as.factor(species)),
                           npost = 3*6,
                           pred_values = rep(bill_vec, 3),
                           pred_spp_id = rep(1:3, each = 6)))

normal_reg_spp_post <- sampling(normal_regression_spp,
                                data = data_list_spp, refresh = 0)

Note that the sign of the slope is different now!

summaryReg <- summary(normal_reg_spp_post)$summary
knitr::kable(summaryReg)
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept[1] 19.3597188 0.0028076 0.1178442 19.1283731 19.2807603 19.3605046 19.4390444 19.592941 1761.709 1.0001721
intercept[2] 17.4461816 0.0033706 0.1446876 17.1679296 17.3461074 17.4458744 17.5466656 17.733834 1842.693 1.0010622
intercept[3] 14.2751953 0.0021842 0.1054456 14.0708215 14.2052106 14.2754060 14.3458451 14.483918 2330.546 0.9995826
slope 0.1981401 0.0004593 0.0176456 0.1635319 0.1864720 0.1980774 0.2099092 0.232515 1476.126 1.0009297
sigma 0.9561868 0.0006192 0.0372180 0.8851993 0.9318227 0.9551486 0.9801020 1.032546 3612.428 0.9996801
post_bill_dep_obs[1] 17.0174941 0.0158601 0.9725923 15.0930004 16.3664415 17.0394912 17.6929610 18.881642 3760.547 1.0001870
post_bill_dep_obs[2] 18.0951910 0.0150132 0.9661301 16.1987037 17.4443081 18.0979706 18.7449339 20.028641 4141.175 1.0000381
post_bill_dep_obs[3] 19.2119649 0.0152526 0.9694458 17.2993607 18.5715742 19.2117132 19.8668967 21.112527 4039.770 0.9996714
post_bill_dep_obs[4] 20.2937604 0.0155405 0.9694027 18.3552383 19.6686777 20.2876916 20.9455283 22.241329 3891.155 0.9991819
post_bill_dep_obs[5] 21.3751262 0.0162381 0.9847415 19.4383149 20.7210571 21.3761287 22.0238175 23.306502 3677.699 0.9991823
post_bill_dep_obs[6] 22.4848641 0.0190974 1.0455611 20.4449495 21.7892184 22.4733634 23.1460302 24.573956 2997.449 1.0009398
post_bill_dep_obs[7] 15.1041174 0.0166638 1.0224158 13.1558813 14.4192934 15.0863210 15.7782108 17.115795 3764.511 1.0003766
post_bill_dep_obs[8] 16.2058078 0.0161155 0.9698412 14.3408701 15.5625438 16.2030100 16.8437579 18.106316 3621.703 0.9999073
post_bill_dep_obs[9] 17.2907308 0.0150669 0.9577040 15.4332081 16.6465703 17.2683091 17.9554241 19.168899 4040.324 0.9994804
post_bill_dep_obs[10] 18.3642054 0.0149140 0.9630019 16.4394623 17.7171384 18.3738350 19.0153630 20.233431 4169.346 1.0006077
post_bill_dep_obs[11] 19.4510232 0.0155083 0.9730427 17.5770048 18.7889146 19.4435967 20.0855874 21.414828 3936.709 1.0007954
post_bill_dep_obs[12] 20.5443505 0.0157338 0.9847124 18.5994657 19.8706999 20.5334977 21.2082646 22.494451 3916.983 1.0000848
post_bill_dep_obs[13] 11.9678003 0.0171697 1.0103772 10.0180399 11.2826965 11.9462710 12.6443808 14.002803 3462.910 1.0006268
post_bill_dep_obs[14] 13.0682340 0.0156212 0.9764727 11.2038141 12.4081100 13.0670591 13.7122596 15.043696 3907.429 0.9997173
post_bill_dep_obs[15] 14.1146745 0.0151597 0.9729787 12.1945318 13.4394566 14.1187605 14.8025678 15.982926 4119.313 0.9998715
post_bill_dep_obs[16] 15.1958922 0.0150143 0.9645257 13.3082843 14.5550695 15.1890726 15.8269176 17.074409 4126.829 0.9997639
post_bill_dep_obs[17] 16.3135359 0.0148360 0.9444386 14.4352714 15.7026046 16.3278289 16.9566798 18.131907 4052.440 0.9999311
post_bill_dep_obs[18] 17.3875919 0.0166938 0.9949562 15.4211640 16.7064951 17.3990366 18.0559603 19.299965 3552.204 1.0006753
post_bill_dep_average[1] 17.0173206 0.0031367 0.1432558 16.7387642 16.9209788 17.0172751 17.1097395 17.297705 2085.788 1.0006505
post_bill_dep_average[2] 18.1070911 0.0011593 0.0812954 17.9511495 18.0525445 18.1063131 18.1607377 18.269893 4917.752 0.9997264
post_bill_dep_average[3] 19.1968616 0.0024731 0.1073937 18.9866559 19.1249089 19.1966252 19.2689300 19.408350 1885.751 1.0000465
post_bill_dep_average[4] 20.2866321 0.0048094 0.1878713 19.9226012 20.1576043 20.2861197 20.4085305 20.657279 1525.958 1.0005777
post_bill_dep_average[5] 21.3764025 0.0072298 0.2790979 20.8311073 21.1884936 21.3763720 21.5606494 21.926043 1490.256 1.0007415
post_bill_dep_average[6] 22.4661730 0.0097245 0.3732739 21.7310010 22.2168695 22.4651075 22.7144145 23.206268 1473.404 1.0008085
post_bill_dep_average[7] 15.1037834 0.0084767 0.3169614 14.4790827 14.8884086 15.1017771 15.3180297 15.736126 1398.159 1.0012461
post_bill_dep_average[8] 16.1935539 0.0060363 0.2294292 15.7520652 16.0367372 16.1901955 16.3489788 16.644641 1444.628 1.0012771
post_bill_dep_average[9] 17.2833244 0.0036972 0.1537815 16.9882789 17.1776681 17.2814085 17.3892937 17.587707 1730.056 1.0011323
post_bill_dep_average[10] 18.3730948 0.0018072 0.1161788 18.1493983 18.2956937 18.3713753 18.4507480 18.601003 4132.642 1.0002856
post_bill_dep_average[11] 19.4628653 0.0024856 0.1489430 19.1680281 19.3627047 19.4630555 19.5685519 19.744614 3590.651 0.9999307
post_bill_dep_average[12] 20.5526358 0.0045702 0.2229536 20.1120778 20.4021314 20.5493305 20.7127928 20.975749 2379.949 1.0002126
post_bill_dep_average[13] 11.9327971 0.0073430 0.2844504 11.3741740 11.7381787 11.9343087 12.1184582 12.497569 1500.612 1.0005419
post_bill_dep_average[14] 13.0225676 0.0048288 0.1939463 12.6472098 12.8918139 13.0242167 13.1494180 13.396970 1613.208 1.0002983
post_bill_dep_average[15] 14.1123381 0.0025008 0.1146999 13.8906469 14.0353964 14.1126295 14.1886142 14.339449 2103.557 0.9997046
post_bill_dep_average[16] 15.2021085 0.0011400 0.0868027 15.0329621 15.1436637 15.2013451 15.2623970 15.366788 5797.340 0.9998009
post_bill_dep_average[17] 16.2918790 0.0028711 0.1440525 16.0116603 16.1960674 16.2915556 16.3916611 16.572717 2517.291 1.0008868
post_bill_dep_average[18] 17.3816495 0.0053640 0.2297941 16.9304067 17.2285934 17.3810188 17.5365419 17.831044 1835.301 1.0010573
lp__ -157.5586590 0.0386916 1.5677566 -161.4834179 -158.3137637 -157.2486909 -156.4073253 -155.449299 1641.811 1.0031875

Plotting posterior predictions

Let’s plot the posterior credible ribbons for this regression

draws_avg_spp <- rstan::extract(normal_reg_spp_post, 
                                "post_bill_dep_average")$post_bill_dep_average

draws_obs_spp <- rstan::extract(normal_reg_spp_post, 
                                "post_bill_dep_obs")$post_bill_dep_obs

spp_levels <- levels(penguins$species)
spp_colors <- c("#66C2A5", "#FC8D62", "#8DA0CB")

n_pred     <- 6  # predictions per species
x_range_spp <- range(data_list_spp$pred_values)
y_range_spp <- range(penguins_no_NA$bill_depth_mm)

# --- Average response (in one plot) --- 
plot(0, 0, 
     type = "n",
     xlim = x_range_spp, ylim = y_range_spp,
     xlab = "Bill length (centered, mm)",
     ylab = "Bill depth (mm)",
     main = "Average response")

for (s in seq_along(spp_levels)) {
  idx <- ((s - 1) * n_pred + 1):(s * n_pred)
  x_s <- data_list_spp$pred_values[idx]
  q_s <- apply(draws_avg_spp[, idx],
               2,
               quantile,
               probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
  
  polygon(c(x_s, rev(x_s)), 
          c(q_s[1,], rev(q_s[5,])),
          col = adjustcolor(spp_colors[s], 0.15), 
          border = NA)
  
  polygon(c(x_s, rev(x_s)), 
          c(q_s[2,], rev(q_s[4,])),
          col = adjustcolor(spp_colors[s], 0.30),
          border = NA)
  
  lines(x_s, q_s[3,], col = spp_colors[s], lwd = 2)
  spp_pts <- penguins_no_NA[as.numeric(as.factor(penguins_no_NA$species)) == s, ]
  points(spp_pts$bill_length_center, spp_pts$bill_depth_mm,
         pch = 19, cex = 0.5, col = spp_colors[s])
}
legend("topright", legend = spp_levels, col = spp_colors, lwd = 2, bty = "n")
# --- Predicted observations (one panel per species) ---
par(mfrow = c(3, 1), mar = c(4, 4, 2, 1))
for (s in seq_along(spp_levels)) {
  idx <- ((s - 1) * n_pred + 1):(s * n_pred)
  x_s <- data_list_spp$pred_values[idx]
  q_s <- apply(draws_obs_spp[, idx], 2, quantile,
               probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
  spp_pts <- penguins_no_NA[as.numeric(as.factor(penguins_no_NA$species)) == s, ]
  
  plot(spp_pts$bill_length_center, 
       spp_pts$bill_depth_mm,
       xlim = x_range_spp, 
       ylim = y_range_spp,
       xlab = "Bill length (centered, mm)", 
       ylab = "Bill depth (mm)",
       main = spp_levels[s], 
       pch = 19, 
       cex = 0.5, 
       col = spp_colors[s])
  
  polygon(c(x_s, rev(x_s)), 
          c(q_s[1,], rev(q_s[5,])),
          col = adjustcolor(spp_colors[s], 0.15),
          border = NA)
  
  polygon(c(x_s, rev(x_s)), 
          c(q_s[2,], 
            rev(q_s[4,])),
          col = adjustcolor(spp_colors[s], 0.30), 
          border = NA)
  
  lines(x_s, q_s[3,], 
        col = spp_colors[s], 
        lwd = 2)
}

Exercise!

  1. We have one model without species identity as an independent variable, and one which includes species. Look at the difference in \(\sigma\) between these two models. Why did the value change?
  2. Posterior predictions Edit the generated quantities blocks of normal_regression.stan and normal_regression_spp.stan to create replicate observations for all N observations (that is, yrep as we did in the model of bill depth previously). Use this to compare the models using bayesplot::ppc_dens_overlay() or another function from the bayesplot package.