Models with one level of hierarchy

Some of these things are somewhat like the others.

TipBayesian workflow
  1. Visualize your data
  2. Decide on your model structure
  3. Simulate from the model to understand it
  4. Fit the model to the data
  5. Plot model predictions to evaluate the fit / draw conclusions

Today’s goal is to look at a couple of different model structures that we saw yesterday.

Load packages and data

library(palmerpenguins)

Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':

    penguins, penguins_raw
library(rstan)
Loading required package: StanHeaders

rstan version 2.32.7 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())

Gaussian random intercepts: Penguin body mass

Are populations of penguins on different islands different in their body mass?

The Palmer penguins are found on three different islands. Let’s look at the distribution of body mass of each species on each island.

Plot the data

First, let’s do a bit of data cleaning and preparation. For this section we’ll drop NA values in the predictor1. Following, we’ll create a new variable out of the union of the species and island names2

We’ll also change the units of body mass to kilograms. You can always transform a variable to more sensible units!

penguin_mass_island <- na.omit(penguins[, c("species",
                                            "island", 
                                            "body_mass_g")])
penguin_mass_island$sp_island <- paste(penguin_mass_island$species,
                                       penguin_mass_island$island, 
                                       sep = "_")

penguin_mass_island$mass_kg <- penguin_mass_island$body_mass_g / 1000

knitr::kable(head(penguin_mass_island[, c("sp_island", 
                                          "mass_kg")]))
sp_island mass_kg
Adelie_Torgersen 3.750
Adelie_Torgersen 3.800
Adelie_Torgersen 3.250
Adelie_Torgersen 3.450
Adelie_Torgersen 3.650
Adelie_Torgersen 3.625

Let’s visualize the distribution of body sizes for these species+island combinations:

spi_groups <- unique(penguin_mass_island$sp_island)

spi_cols <- palette.colors(length(spi_groups),
                           palette = "Dark2")

names(spi_cols) <- spi_groups

plot(0, 
     0, 
     type = "n",
     xlim = range(penguin_mass_island$mass_kg),
     ylim = c(0.5, length(spi_groups) + 0.5),
     xlab = "Mass in kg",
     ylab = "", 
     yaxt = "n")

for (i in seq_along(spi_groups)) {
  pts <- penguin_mass_island$mass_kg[penguin_mass_island$sp_island == spi_groups[i]]
  points(pts, 
         jitter(rep(i, length(pts)),
                factor = 0.4),
         col = spi_cols[i],
         pch = 19, 
         cex = 0.7)
}

axis(2, 
     at = seq_along(spi_groups),
     labels = spi_groups, 
     las = 1, 
     cex.axis = 0.8)

Observations of the mass of penguins on five different species-island combinations.

It’s always good to ask some questions about the dataset. Here is a classic one: are the sample sizes equal among the species-island combinations?

knitr::kable(as.data.frame(table(sp_island = penguin_mass_island$sp_island)))
sp_island Freq
Adelie_Biscoe 44
Adelie_Dream 56
Adelie_Torgersen 51
Chinstrap_Dream 68
Gentoo_Biscoe 123

Decide on a model structure

We’ll begin by fitting a model that assumes that body size for each of these five groups is completely independent:

\[ \begin{align} \text{Body mass}_i &\sim \text{Normal}(\mu_i, \sigma_{\text{obs}}) \\ \mu_i &= \bar\beta + \beta_{\text{group}[i]} \\ \bar\beta &\sim \text{Normal}(5, 2) \\ \beta_{\text{group}} &\sim \text{Normal}(0, 1) \\ \sigma_{\text{obs}} &\sim \text{Exponential}(.5) \end{align} \]

Here the subscript \(i\) is just counting the row of the dataset, and \(\text{group}[i]\) means the group (species+island) to which row number \(i\) belongs.

Simulate to understand this model

Here’s a little trick to get group indexes (numbers) from a character vector:

group_names <- unique(penguin_mass_island$sp_island)
group_numbers <- seq_along(group_names)
names(group_numbers) <- group_names

group_numbers
Adelie_Torgersen    Adelie_Biscoe     Adelie_Dream    Gentoo_Biscoe 
               1                2                3                4 
 Chinstrap_Dream 
               5 
penguin_groupid <- penguin_mass_island
penguin_groupid$group_id <- group_numbers[penguin_groupid$sp_island]

head(penguin_groupid)
# A tibble: 6 × 6
  species island    body_mass_g sp_island        mass_kg group_id
  <fct>   <fct>           <int> <chr>              <dbl>    <int>
1 Adelie  Torgersen        3750 Adelie_Torgersen    3.75        1
2 Adelie  Torgersen        3800 Adelie_Torgersen    3.8         1
3 Adelie  Torgersen        3250 Adelie_Torgersen    3.25        1
4 Adelie  Torgersen        3450 Adelie_Torgersen    3.45        1
5 Adelie  Torgersen        3650 Adelie_Torgersen    3.65        1
6 Adelie  Torgersen        3625 Adelie_Torgersen    3.62        1

As you can see, we’re now set upwith the names and the indexes we need.

Now we can simulate data and plot it:

ngroup <- length(group_numbers)
overall_mean <- rnorm(1, mean = 5, sd = 2)
group_diffs <- rnorm(n = ngroup, mean = 0, sd = 1)
sigma_obs <- rexp(1, .5)

penguin_pred_obs <- penguin_groupid
penguin_pred_obs$fake_mass_avg <- overall_mean + group_diffs[penguin_pred_obs$group_id]
penguin_pred_obs$fake_mass_obs <- rnorm(nrow(penguin_pred_obs),
                                        mean = penguin_pred_obs$fake_mass_avg,
                                        sd = sigma_obs)

plot(0, 0, type = "n",
     xlim = range(penguin_pred_obs$fake_mass_obs),
     ylim = c(0.5, length(spi_groups) + 0.5),
     xlab = "Simulated mass (kg)", ylab = "", yaxt = "n")
for (i in seq_along(spi_groups)) {
  pts <- penguin_pred_obs$fake_mass_obs[penguin_pred_obs$sp_island == spi_groups[i]]
  points(pts,
         jitter(rep(i, length(pts)), 
                factor = 0.4),
         col = spi_cols[i],
         pch = 19, 
         cex = 0.7)
}
axis(2, 
     at = seq_along(spi_groups),
     labels = spi_groups, 
     las = 1, 
     cex.axis = 0.8)

TipEXERCISE

Run the above code a few times, trying different prior values.

Write it in Stan

fixed_groups <- stan_model(file = here::here("topics/03_one_random_effect/fixed_groups.stan"),
                           model_name = "fixed_groups")

fixed_groups
S4 class stanmodel 'fixed_groups' coded as follows:
data {
  int<lower=0> N;
  vector[N] y;
  int<lower=0> Ngroup;
  array[N] int<lower=0, upper=Ngroup> group_id;
}
parameters {
  real b_avg;
  vector[Ngroup] b_group;
  real<lower=0> sigma;
}
model {
  y ~ normal(b_avg + b_group[group_id], sigma);
  b_group ~ std_normal();
  b_avg ~ normal(5, 2);
  sigma ~ exponential(.5);
}
generated quantities {

  vector[Ngroup] group_averages;

  for (k in 1:Ngroup){
    group_averages[k] = b_avg + b_group[k];
  }

  // predict making one new observation per group
  vector[Ngroup] one_obs_per_group;

  for (k in 1:Ngroup) {
    one_obs_per_group[k] = normal_rng(group_averages[k], sigma);
  }
} 

Fit the model

Fitting the model requires arranging the data first, and creating a list which we then feed into Stan.

peng_group_list <- with(penguin_groupid,
         list(N = length(mass_kg),
              y = mass_kg,
              Ngroup = max(group_id),
              group_id = group_id))

fixed_groups_samples <- sampling(fixed_groups,
                                 data = peng_group_list,
                                 refresh = 0L)

Plot predictions to evaluate results

Let’s begin by plotting the averages for each group.

draws_grp <- rstan::extract(fixed_groups_samples,
                            "group_averages")$group_averages
q_grp <- apply(draws_grp, 
               2, 
               quantile,
               probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

spi_labels <- names(group_numbers)

plot(0, 0, type = "n",
     xlim = range(c(q_grp, penguin_mass_island$mass_kg)),
     ylim = c(0.5, length(spi_labels) + 0.5),
     xlab = "Mass (kg)", ylab = "", yaxt = "n",
     main = "Group averages (fixed effects)")

for (i in seq_along(spi_labels)) {
  pts <- penguin_mass_island$mass_kg[penguin_mass_island$sp_island == spi_labels[i]]
  points(pts, 
         jitter(rep(i, length(pts)),factor = 0.4),
         col = adjustcolor(spi_cols[i], 0.5), pch = 21)
  
  segments(q_grp[1, i], i, q_grp[5, i], i, col = "black", lwd = 1)
  segments(q_grp[2, i], i, q_grp[4, i], i, col = "black", lwd = 3)
  points(q_grp[3, i], i, pch = 19, col = "black", cex = 1.2)
}
axis(2, 
     at = seq_along(spi_labels), 
     labels = spi_labels, 
     las = 1, 
     cex.axis = 0.8)

Some things to notice about the code above:

  • I’m using my named vector group_numbers to re-create the column sp_island. This is my technique for making sure I always use the correct label, but you can do this any way you want.
  • We use quantile() on the posterior draws to compute the 2.5%, 25%, 50%, 75%, and 97.5% quantiles, then draw them with segments() (thin line for the 95% credible interval, thick line for the 50% interval) and points() for the median — the base R equivalent of stat_pointinterval().
  • We add points from the original data with a for loop calling points() with vertical jitter(), but no horizontal noise.
TipEXERCISE: plot posterior predictions of observations

Repeat the exercise above using the value of one_obs_per_group. Why are the results different? What additional error is included in these predictions?

draws_obs_grp <- rstan::extract(fixed_groups_samples,
                                "one_obs_per_group")$one_obs_per_group
q_obs_grp     <- apply(draws_obs_grp, 
                       2, 
                       quantile, 
                       probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

plot(0, 
     0, 
     type = "n",
     xlim = range(c(q_obs_grp, penguin_groupid$mass_kg)),
     ylim = c(0.5, length(group_names) + 0.5),
     xlab = "Mass (kg)", ylab = "", yaxt = "n",
     main = "Predicted observations (fixed effects)")

for (i in seq_along(group_names)) {
  pts <- penguin_groupid$mass_kg[penguin_groupid$sp_island == group_names[i]]
  points(pts,
         jitter(rep(i, length(pts)), factor = 0.7),
         col = adjustcolor(spi_cols[i], 0.3),
         pch = 19, 
         cex = 0.5)
  
  segments(q_obs_grp[1, i], i, q_obs_grp[5, i], i, col = "black", lwd = 1)
  segments(q_obs_grp[2, i], i, q_obs_grp[4, i], i, col = "black", lwd = 3)
  points(q_obs_grp[3, i], i, pch = 19, col = "black", cex = 1.2)
}

axis(2, 
     at = seq_along(group_names), 
     labels = group_names, 
     las = 1,
     cex.axis = 0.8)

Make it hierarchical

Math

\[ \begin{align} \text{Body mass}_i &\sim \text{Normal}(\mu_i, \sigma_{\text{obs}}) \\ \mu_i &= \bar\beta + \beta_{\text{group}[i]} \\ \bar\beta &\sim \text{Normal}(5, 2) \\ \beta_{\text{group}} &\sim \text{Normal}(0, 1) \\ \sigma_{\text{obs}} &\sim \text{Exponential}(.5) \end{align} \]

\[ \begin{align} \text{Body mass}_i &\sim \text{Normal}(\mu_i, \sigma_{\text{obs}}) \\ \mu_i &= \bar\beta + \beta_{\text{group}[i]} \\ \bar\beta &\sim \text{Normal}(5, 2) \\ \beta_{\text{group}} &\sim \text{Normal}(0, \sigma_{\text{sp}}) \\ \sigma_{\text{obs}} &\sim \text{Exponential}(0.5) \\ \sigma_{\text{sp}} &\sim \text{Exponential}(1) \end{align} \]

Simulation of a hierarchical model

TipEXERCISE

Simulate from the model above. Base your approach on the code for simulation the non-hierarchical version. Remember to simulate the standard deviation of group differences

ngroup <- length(group_numbers)
overall_mean <- rnorm(1, mean = 5, sd = 2)
sigma_group <- rexp(1, 0.1)
group_diffs <- rnorm(n = ngroup, 
                     mean = 0, 
                     sd = sigma_group)
sigma_obs <- rexp(1, 0.5)

penguin_pred_obs <- penguin_groupid
penguin_pred_obs$fake_mass_avg <- overall_mean + group_diffs[penguin_pred_obs$group_id]
penguin_pred_obs$fake_mass_obs <- rnorm(nrow(penguin_pred_obs),
                                        mean = penguin_pred_obs$fake_mass_avg,
                                        sd = sigma_obs)

plot(0, 0, type = "n",
     xlim = range(penguin_pred_obs$fake_mass_obs),
     ylim = c(0.5, length(spi_groups) + 0.5),
     xlab = "Simulated mass (kg)", 
     ylab = "", 
     yaxt = "n")

for (i in seq_along(spi_groups)) {
  pts <- penguin_pred_obs$fake_mass_obs[penguin_pred_obs$sp_island == spi_groups[i]]
  points(pts, 
         jitter(rep(i, length(pts)),
                factor = 0.4),
         col = spi_cols[i], 
         pch = 19, 
         cex = 0.7)
}

axis(2,
     at = seq_along(spi_groups), 
     labels = spi_groups, 
     las = 1, 
     cex.axis = 0.8)

Stan

Below I’m comparing the two Stan programs side-by-side. Compare them to the models above!

hierarchical_groups <- stan_model(
  file = "topics/03_one_random_effect/hierarchical_groups.stan")
fixed_groups
S4 class stanmodel 'fixed_groups' coded as follows:
data {
  int<lower=0> N;
  vector[N] y;
  int<lower=0> Ngroup;
  array[N] int<lower=0, upper=Ngroup> group_id;
}
parameters {
  real b_avg;
  vector[Ngroup] b_group;
  real<lower=0> sigma;
}
model {
  y ~ normal(b_avg + b_group[group_id], sigma);
  b_group ~ std_normal();
  b_avg ~ normal(5, 2);
  sigma ~ exponential(.5);
}
generated quantities {

  vector[Ngroup] group_averages;

  for (k in 1:Ngroup){
    group_averages[k] = b_avg + b_group[k];
  }

  // predict making one new observation per group
  vector[Ngroup] one_obs_per_group;

  for (k in 1:Ngroup) {
    one_obs_per_group[k] = normal_rng(group_averages[k], sigma);
  }
} 
hierarchical_groups
S4 class stanmodel 'anon_model' coded as follows:
data {
  int<lower=0> N;
  vector[N] y;
  int<lower=0> Ngroup;
  array[N] int<lower=0, upper=Ngroup> group_id;
}
parameters {
  real b_avg;
  vector[Ngroup] b_group;
  real<lower=0> sigma_obs;
  real<lower=0> sigma_grp;
}
model {
  y ~ normal(b_avg + b_group[group_id], sigma_obs);
  b_group ~ normal(0, sigma_grp);
  b_avg ~ normal(5, 2);
  sigma_obs ~ exponential(.5);
  sigma_grp ~ exponential(1);
}
generated quantities {

  vector[Ngroup] group_averages;

  for (k in 1:Ngroup){
    group_averages[k] = b_avg + b_group[k];
  }

  // predict making one new observation per group
  vector[Ngroup] one_obs_per_group;

  for (k in 1:Ngroup) {
    one_obs_per_group[k] = normal_rng(group_averages[k], sigma_obs);
  }

  // difference for a new group
  real new_b_group = normal_rng(0, sigma_grp);

  // observations from that new group
  real one_obs_new_group = normal_rng(b_avg + new_b_group, sigma_obs);

} 
hierarchical_groups_samples <- sampling(hierarchical_groups,
  data = peng_group_list, refresh = 0)
hierarchical_groups_samples
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                      mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff
b_avg                 3.99    0.02 0.36  3.23  3.79  4.00  4.19  4.70   488
b_group[1]           -0.28    0.02 0.37 -1.00 -0.49 -0.28 -0.09  0.49   503
b_group[2]           -0.28    0.02 0.37 -1.02 -0.49 -0.28 -0.07  0.47   495
b_group[3]           -0.30    0.02 0.37 -1.02 -0.51 -0.30 -0.09  0.45   507
b_group[4]            1.08    0.02 0.36  0.37  0.87  1.08  1.28  1.84   485
b_group[5]           -0.26    0.02 0.37 -0.99 -0.46 -0.26 -0.06  0.50   493
sigma_obs             0.46    0.00 0.02  0.43  0.45  0.46  0.48  0.50  1866
sigma_grp             0.76    0.01 0.32  0.38  0.55  0.69  0.89  1.60   905
group_averages[1]     3.71    0.00 0.07  3.58  3.67  3.71  3.75  3.84  4777
group_averages[2]     3.71    0.00 0.07  3.58  3.66  3.71  3.76  3.85  3953
group_averages[3]     3.69    0.00 0.06  3.57  3.65  3.69  3.73  3.82  4153
group_averages[4]     5.07    0.00 0.04  4.99  5.04  5.07  5.10  5.16  4835
group_averages[5]     3.73    0.00 0.06  3.62  3.70  3.74  3.77  3.84  4608
one_obs_per_group[1]  3.70    0.01 0.47  2.77  3.38  3.69  4.01  4.64  3948
one_obs_per_group[2]  3.70    0.01 0.47  2.76  3.38  3.71  4.02  4.65  3882
one_obs_per_group[3]  3.69    0.01 0.47  2.75  3.38  3.70  4.01  4.62  3788
one_obs_per_group[4]  5.07    0.01 0.46  4.17  4.76  5.07  5.38  5.99  4207
one_obs_per_group[5]  3.72    0.01 0.47  2.79  3.41  3.72  4.04  4.63  3958
new_b_group           0.01    0.01 0.82 -1.61 -0.45 -0.01  0.46  1.80  3970
one_obs_new_group     4.01    0.02 1.01  2.05  3.38  3.99  4.61  6.16  2263
lp__                 88.40    0.07 2.13 83.35 87.18 88.77 89.96 91.55   970
                     Rhat
b_avg                1.01
b_group[1]           1.01
b_group[2]           1.01
b_group[3]           1.01
b_group[4]           1.01
b_group[5]           1.01
sigma_obs            1.00
sigma_grp            1.00
group_averages[1]    1.00
group_averages[2]    1.00
group_averages[3]    1.00
group_averages[4]    1.00
group_averages[5]    1.00
one_obs_per_group[1] 1.00
one_obs_per_group[2] 1.00
one_obs_per_group[3] 1.00
one_obs_per_group[4] 1.00
one_obs_per_group[5] 1.00
new_b_group          1.00
one_obs_new_group    1.00
lp__                 1.00

Samples were drawn using NUTS(diag_e) at Wed May 27 15:40:37 2026.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
draws_bg  <- rstan::extract(hierarchical_groups_samples, "b_group")$b_group
draws_new <- rstan::extract(hierarchical_groups_samples, "new_b_group")$new_b_group

all_labels <- c(group_names, "New Group")
n_all <- length(all_labels)
q_bg <- apply(draws_bg, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
q_new <- quantile(draws_new, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
q_all <- cbind(q_bg, q_new)

plot(0, 0, type = "n",
     xlim = range(q_all),
     ylim = c(0.5, n_all + 0.5),
     xlab = "Group effect", ylab = "", yaxt = "n",
     main = "Group effects (hierarchical)")

for (i in seq_len(n_all)) {
  segments(q_all[1, i], i, q_all[5, i], i, lwd = 1, col = "steelblue")
  segments(q_all[2, i], i, q_all[4, i], i, lwd = 3, col = "steelblue")
  points(q_all[3, i], i, pch = 19, col = "steelblue")
}

axis(2, at = seq_len(n_all), labels = all_labels, las = 1, cex.axis = 0.8)

draws_oopg <- rstan::extract(hierarchical_groups_samples, 
                             "one_obs_per_group")$one_obs_per_group
draws_ong  <- rstan::extract(hierarchical_groups_samples,
                             "one_obs_new_group")$one_obs_new_group

q_oopg <- apply(draws_oopg, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
q_ong  <- quantile(draws_ong, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
q_oall <- cbind(q_oopg, q_ong)

plot(0, 0, type = "n",
     xlim = range(c(q_oall, penguin_groupid$mass_kg)),
     ylim = c(0.5, n_all + 0.5),
     xlab = "Mass (kg)", ylab = "", yaxt = "n",
     main = "Predicted observations (hierarchical)")

for (i in seq_len(n_all)) {
  if (i <= length(group_names)) {
    pts <- penguin_groupid$mass_kg[penguin_groupid$sp_island == group_names[i]]
    points(pts, 
           jitter(rep(i, length(pts)),
                  factor = 0.5),
           col = adjustcolor(spi_cols[i], 0.5), 
           pch = 19,
           cex = 0.5)
  }
  segments(q_oall[1, i], i, q_oall[5, i], i, lwd = 1, col = "black")
  segments(q_oall[2, i], i, q_oall[4, i], i, lwd = 3, col = "black")
  points(q_oall[3, i], i, pch = 19, col = "black")
}
axis(2,
     at = seq_len(n_all), 
     labels = all_labels,
     las = 1, 
     cex.axis = 0.8)

Exercises

  1. Try leaving out a group and refitting the hierarchical model. Are the predictions for the missing group accurate?
  2. There are other categorical predictors in the dataset. Try including year as a part of the group-creating factor (i.e. in the call to unite() above). What changes?
  3. Modify the generated quantities block to simulate a fake observation for every row of the dataset. This opens the possibility of using bayesplot to make predictions. Look back at the code from Day 1 and create a posterior predictive check for both models. (e.g. using the ppc_dens_overlay function)
  4. We could perhaps have used sex as a grouping factor, but sex has missing values in it! Why is this a problem for this kind of model? What would it take to address that? (Discussion only; missing values are unfortunately outside the scope of the class!)

Observation-level random effects: Mite abundance

Biological question?

Let’s write a model to answer the question:

How does the total abundance of the mite community change as water content increases?

Express this in Math

Here’s a partially complete model for species richness over time

\[ \begin{align} \text{S}_i &\sim \text{Poisson}(e^a) \\ a &= \bar\beta + \beta_{\text{water}} \cdot \text{water}_i \\ \bar\beta &\sim \text{Normal}(?, ?) \\ \beta_{\text{water}} &\sim \text{Normal}(?, ?) \\ \end{align} \]

TipEXERCISE

Simulate from this model, and look at your simulations to decide on a reasonable prior for the data.

n <- 30
water <- seq(from = -5, to = 5, length.out = n)

b0 <- rnorm(1, mean = log(17), sd = .3)
b1 <- rnorm(1, mean = 0, sd = .2)

S <- rpois(n, lambda = exp(b0 + b1*water))
plot(water, S)

Data preparation & visualization

First we need to load and prepare the data:

data(mite, package = "vegan")
data("mite.env", package = "vegan")

# Combine data and environment
mite_df <- mite
mite_df$site_id <- rownames(mite)
mite_df <- cbind(mite_df, mite.env)

spp_cols_names <- colnames(mite)

mite_data_long <- do.call(rbind, lapply(spp_cols_names, function(sp) {
  data.frame(site_id = mite_df$site_id,
             WatrCont = mite_df$WatrCont,
             spp = sp,
             abd = mite_df[[sp]])
}))

mite_data_long <- mite_data_long[order(mite_data_long$site_id), ]
rownames(mite_data_long) <- NULL

First let’s transform the mite dataset into a data frame of total community abundance (N) per site. We’ll also standardize the water content while we’re at it:

mite_community_abd <- aggregate(abd ~ site_id + WatrCont,
                                data = mite_data_long, FUN = sum)

names(mite_community_abd)[names(mite_community_abd) == "abd"] <- "N"

mite_community_abd$water_c <- (mite_community_abd$WatrCont -
                               mean(mite_community_abd$WatrCont)) / 100

knitr::kable(head(mite_community_abd))
site_id WatrCont N water_c
11 134.13 216 -2.765057
27 145.28 173 -2.653557
20 145.68 184 -2.649557
21 184.04 117 -2.265957
22 185.89 172 -2.247457
5 204.13 199 -2.065057

We get a nice histogram of community abundance, and a clear negative relationship with water volume:

hist(mite_community_abd$N,
     xlab = "Community abundance (N)",
     main = "Histogram of N")
plot(mite_community_abd$water_c, mite_community_abd$N,
     xlab = "Water content (centered)",
     ylab = "Community abundance (N)",
     pch = 19, cex = 0.7)

Write the model in Stan and estimate it

This model will have a structure very similar to previous ones we’ve seen. The main difference is the use of the poisson_log function:

poisson_regression <- stan_model(
  file = "topics/03_one_random_effect/poisson_regression.stan",
  model_name = "poisson_regression")

poisson_regression
S4 class stanmodel 'poisson_regression' coded as follows:
data {
  int<lower=0> N;
  vector[N] water;
  array[N] int y;
  // for prediction
  int<lower=0> Npred;
  vector[Npred] water_pred;
}
parameters {
  real b_avg;
  real b_water;
}
model {
  y ~ poisson_log(b_avg + b_water * water);
  b_water ~ normal(0, .2);
  b_avg ~ normal(0, .3);
}
generated quantities {
  array[N] int fake_obs;
  for (i in 1:N){
    fake_obs[i] = poisson_log_rng(b_avg + b_water * water[i]);
  }

  // confidence interval for the line
  vector[Npred] line_avg;
  line_avg = exp(b_avg + b_water * water_pred);

  // prediction interval for the line
  array[Npred] int line_obs;
  for (j in 1:Npred){
    line_obs[j] = poisson_rng(line_avg[j]);
  }
} 
NoteBuilt in Stan functions

Stan has a lot of built-in functions to facilitate modelling. For example the log link function is so common that there is a specialized likelihood function just for that, poisson_log. And there’s an even more efficient version called poisson_log_glm. I didn’t use it here because I wanted the code to closely match the equations. You can read about these functions (and much more!) in the amazing Stan manual

Using a now-familiar workflow, we feed the data into the model:

water_for_pred <- seq(from = -3, to = 4.5, length.out = 15)

abd_data_list <- list(N = length(mite_community_abd$N),
              water = mite_community_abd$water_c,
              y = mite_community_abd$N,
              Npred = 15,
              water_pred = water_for_pred)

poisson_regression_sample <- sampling(
  poisson_regression,
  data = abd_data_list,
  refresh = 0)

Plot the model to see if it fits well

draws_lo <- rstan::extract(poisson_regression_sample, "line_obs")$line_obs
q_lo <- apply(draws_lo, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

plot(mite_community_abd$water_c, mite_community_abd$N,
     pch = 19, cex = 0.7,
     xlab = "Water content (centered)", ylab = "Community abundance (N)",
     main = "Poisson model — prediction interval")

polygon(c(water_for_pred, rev(water_for_pred)),
        c(q_lo[1,], rev(q_lo[5,])),
        col = adjustcolor("forestgreen", 0.15), border = NA)

polygon(c(water_for_pred, rev(water_for_pred)),
        c(q_lo[2,], rev(q_lo[4,])),
        col = adjustcolor("forestgreen", 0.30), border = NA)

lines(water_for_pred, q_lo[3,], col = "forestgreen", lwd = 2)
fake_obs_S <- rstan::extract(poisson_regression_sample, pars = "fake_obs")

bayesplot::ppc_dens_overlay(y = mite_community_abd$N,
                            yrep = head(fake_obs_S$fake_obs, 50))

Remember, on the left we are plotting the Prediction interval here: it’s showing the distribution of probable observations according to the model. Notice that the model predicts much narrower variation than we really find!

On the right hand side we have the posterior predictive check, which once again shows that the model is overconfident and predicts a range of observations that are far too narrow.

EXERCISE

  1. Discuss with your neighbours: Would you trust this model? Would you publish it? The technical name for this phenomenon is “overdispersion”. Have you checked for this in previous count models you’ve done?

  2. Add a random effect for every individual observation in the model. Begin by writing the mathematical notation for this new model!

  3. Fit the model and re-create the two figures above. What do you notice?

  • Which model is more trustworthy?
  • look at the slope in the new model. Is it different?

Mathematical notation

\[ \begin{align} \text{S}_i &\sim \text{Poisson}(e^a) \\ a &= \bar\beta + \beta_{\text{water}} \cdot \text{water}_i + \text{site}_i\\ \bar\beta &\sim \text{Normal}(?, ?) \\ \beta_{\text{water}} &\sim \text{Normal}(?, ?) \\ \text{site} &\sim \text{Normal}(?, \sigma) \\ \sigma &\sim \text{Exponential}(?) \end{align} \]

Prior predictive checks – Important ! Do not skip !

prior_b0 <- rnorm(1, 4.5, .2)
prior_b1 <- rnorm(1, 0, .2)

mite_community_abd$avg_prior <- exp(prior_b0 + prior_b1 * mite_community_abd$water_c)

plot(mite_community_abd$water_c,
     mite_community_abd$N,
     pch = 19, cex = 0.7,
     xlab = "Water content (centered)",
     ylab = "Community abundance (N)")

lines(mite_community_abd$water_c,
      mite_community_abd$avg_prior,
      type = "l", 
      lwd = 3, 
      col = "blue")

legend("topleft",
       col = "blue",
       lty = 1, 
       lwd = 3, 
       legend = "Expected N")

Stan code

Write the model

poisson_regression_overdisp <- stan_model(
  file = "topics/03_one_random_effect/poisson_regression_overdisp.stan",
  model_name = "poisson_regression_overdisp")

poisson_regression_overdisp
S4 class stanmodel 'poisson_regression_overdisp' coded as follows:
data {
  int<lower=0> N;
  vector[N] water;
  array[N] int y;
  // for prediction
  int<lower=0> Npred;
  vector[Npred] water_pred;
}
parameters {
  real b_avg;
  real b_water;
  real<lower=0> sigma_site;
  vector[N] site_intercepts;
}
model {
  b_avg ~ normal(0, .3);
  b_water ~ normal(0, .2);
  site_intercepts ~ normal(0, sigma_site);
  sigma_site ~ exponential(.5);
  y ~ poisson_log(b_avg + b_water * water + site_intercepts);
}
generated quantities {
  array[N] int fake_obs;
  for (i in 1:N){
    fake_obs[i] = poisson_log_rng(b_avg + b_water * water[i] + site_intercepts[i]);
  }

  // confidence interval for the line

  // prediction interval for the line
  vector[Npred] new_site_intercepts;
  array[Npred] int line_obs;
  for (j in 1:Npred){
    new_site_intercepts[j] = normal_rng(0, sigma_site);
  }

  vector[Npred] line_avg;
  line_avg = exp(b_avg + b_water * water_pred + new_site_intercepts);

  for(k in 1:Npred){
    line_obs[k] = poisson_rng(line_avg[k]);
  }
} 

Sample it

poisson_regression_overdisp_sample <- sampling(poisson_regression_overdisp,
                                               data = abd_data_list, 
                                               refresh = 0,iter = 20000)
Warning in validityMethod(object): The following variables have undefined
values: fake_obs[1],The following variables have undefined values:
fake_obs[2],The following variables have undefined values: fake_obs[3],The
following variables have undefined values: fake_obs[4],The following variables
have undefined values: fake_obs[5],The following variables have undefined
values: fake_obs[6],The following variables have undefined values:
fake_obs[7],The following variables have undefined values: fake_obs[8],The
following variables have undefined values: fake_obs[9],The following variables
have undefined values: fake_obs[10],The following variables have undefined
values: fake_obs[11],The following variables have undefined values:
fake_obs[12],The following variables have undefined values: fake_obs[13],The
following variables have undefined values: fake_obs[14],The following variables
have undefined values: fake_obs[15],The following variables have undefined
values: fake_obs[16],The following variables have undefined values:
fake_obs[17],The following variables have undefined values: fake_obs[18],The
following variables have undefined values: fake_obs[19],The following variables
have undefined values: fake_obs[20],The following variables have undefined
values: fake_obs[21],The following variables have undefined values:
fake_obs[22],The following variables have undefined values: fake_obs[23],The
following variables have undefined values: fake_obs[24],The following variables
have undefined values: fake_obs[25],The following variables have undefined
values: fake_obs[26],The following variables have undefined values:
fake_obs[27],The following variables have undefined values: fake_obs[28],The
following variables have undefined values: fake_obs[29],The following variables
have undefined values: fake_obs[30],The following variables have undefined
values: fake_obs[31],The following variables have undefined values:
fake_obs[32],The following variables have undefined values: fake_obs[33],The
following variables have undefined values: fake_obs[34],The following variables
have undefined values: fake_obs[35],The following variables have undefined
values: fake_obs[36],The following variables have undefined values:
fake_obs[37],The following variables have undefined values: fake_obs[38],The
following variables have undefined values: fake_obs[39],The following variables
have undefined values: fake_obs[40],The following variables have undefined
values: fake_obs[41],The following variables have undefined values:
fake_obs[42],The following variables have undefined values: fake_obs[43],The
following variables have undefined values: fake_obs[44],The following variables
have undefined values: fake_obs[45],The following variables have undefined
values: fake_obs[46],The following variables have undefined values:
fake_obs[47],The following variables have undefined values: fake_obs[48],The
following variables have undefined values: fake_obs[49],The following variables
have undefined values: fake_obs[50],The following variables have undefined
values: fake_obs[51],The following variables have undefined values:
fake_obs[52],The following variables have undefined values: fake_obs[53],The
following variables have undefined values: fake_obs[54],The following variables
have undefined values: fake_obs[55],The following variables have undefined
values: fake_obs[56],The following variables have undefined values:
fake_obs[57],The following variables have undefined values: fake_obs[58],The
following variables have undefined values: fake_obs[59],The following variables
have undefined values: fake_obs[60],The following variables have undefined
values: fake_obs[61],The following variables have undefined values:
fake_obs[62],The following variables have undefined values: fake_obs[63],The
following variables have undefined values: fake_obs[64],The following variables
have undefined values: fake_obs[65],The following variables have undefined
values: fake_obs[66],The following variables have undefined values:
fake_obs[67],The following variables have undefined values: fake_obs[68],The
following variables have undefined values: fake_obs[69],The following variables
have undefined values: fake_obs[70],The following variables have undefined
values: new_site_intercepts[1],The following variables have undefined values:
new_site_intercepts[2],The following variables have undefined values:
new_site_intercepts[3],The following variables have undefined values:
new_site_intercepts[4],The following variables have undefined values:
new_site_intercepts[5],The following variables have undefined values:
new_site_intercepts[6],The following variables have undefined values:
new_site_intercepts[7],The following variables have undefined values:
new_site_intercepts[8],The following variables have undefined values:
new_site_intercepts[9],The following variables have undefined values:
new_site_intercepts[10],The following variables have undefined values:
new_site_intercepts[11],The following variables have undefined values:
new_site_intercepts[12],The following variables have undefined values:
new_site_intercepts[13],The following variables have undefined values:
new_site_intercepts[14],The following variables have undefined values:
new_site_intercepts[15],The following variables have undefined values:
line_obs[1],The following variables have undefined values: line_obs[2],The
following variables have undefined values: line_obs[3],The following variables
have undefined values: line_obs[4],The following variables have undefined
values: line_obs[5],The following variables have undefined values:
line_obs[6],The following variables have undefined values: line_obs[7],The
following variables have undefined values: line_obs[8],The following variables
have undefined values: line_obs[9],The following variables have undefined
values: line_obs[10],The following variables have undefined values:
line_obs[11],The following variables have undefined values: line_obs[12],The
following variables have undefined values: line_obs[13],The following variables
have undefined values: line_obs[14],The following variables have undefined
values: line_obs[15],The following variables have undefined values:
line_avg[1],The following variables have undefined values: line_avg[2],The
following variables have undefined values: line_avg[3],The following variables
have undefined values: line_avg[4],The following variables have undefined
values: line_avg[5],The following variables have undefined values:
line_avg[6],The following variables have undefined values: line_avg[7],The
following variables have undefined values: line_avg[8],The following variables
have undefined values: line_avg[9],The following variables have undefined
values: line_avg[10],The following variables have undefined values:
line_avg[11],The following variables have undefined values: line_avg[12],The
following variables have undefined values: line_avg[13],The following variables
have undefined values: line_avg[14],The following variables have undefined
values: line_avg[15]. Many subsequent functions will not work correctly.
#|
fake_obs_overdisp_S <- rstan::extract(poisson_regression_overdisp_sample,
                                      pars = "fake_obs")

bayesplot::ppc_dens_overlay(y = mite_community_abd$N,
                            yrep = head(fake_obs_overdisp_S$fake_obs, 50))
draws_la  <- rstan::extract(poisson_regression_overdisp_sample, "line_avg")$line_avg
q_la      <- apply(draws_la, 2, quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

plot(mite_community_abd$water_c, mite_community_abd$N,
     pch = 19, cex = 0.7,
     xlab = "Water content (centered)", ylab = "Community abundance (N)",
     main = "Overdispersed model — mean trend")
polygon(c(water_for_pred, rev(water_for_pred)),
        c(q_la[1,], rev(q_la[5,])),
        col = adjustcolor("steelblue", 0.15), border = NA)
polygon(c(water_for_pred, rev(water_for_pred)),
        c(q_la[2,], rev(q_la[4,])),
        col = adjustcolor("steelblue", 0.30), border = NA)
lines(water_for_pred, q_la[3,], col = "steelblue", lwd = 2)

NoteBonus exercises

Compare the slope estimates between the two poisson models above (possibly in the same figure). What happened here?

Another great way to model overdispersion is via the Negative Binomial distribution. Look at the Stan documentation for neg_binomial_2_log and adapt your model to use it (don’t forget to drop the random effect when you do!).

EXPERIMENTAL: replace the normal distribution for the random observation-level effect with the Student t distribution, to help with that one extreme data point! how does the slope change?

Footnotes

  1. dropping NA values is not always the best idea! I’m doing it here to create a focused example. Bayesian gives us other options for working with missing values, including modelling them directly.↩︎

  2. again, this is slightly contrived for the sake of making a clean example! Normally you’d most likely treat these two variables separately.↩︎