Fitting an intercept-only model

Where is the variation?

Variance partitioning with hierarchical models

Its very useful to have an idea of where the variation is in a dataset, as a guide to model building and future data collection. Before collecting information on independent variables that you think might explain variation – first find out how much variation there is!

Random intercept models give us a very handy way to investigate the variation in data: the so-called “intercept only” model. At this point in the course we now have all the tools needed to build such a model. This goal can be achieved in two steps:

  1. We build a model with no predictors at all, only a random intercept for every grouping variable in the data (e.g. species, sites, years, regions).
  2. Then we examine the relative magnitudes of the standard deviations to see which is reasonably larger or smaller.

Abundance of mites in different samples

We’re going to build such a model by extending the observation level random effect model from the previous Hierachy on the intercept exercise. Note that observation-level random effects are useful for Poisson models because they can take help with overdispersion.

Mathematical model

\[ \begin{align} \text{Abundance}_i &\sim \text{Poisson}(\lambda_i) \\ \log{\lambda_i} &\sim \mu + \beta_{\text{sample}[i]} + \beta_{\text{species[i]}} + \beta_i\\ \mu &\sim \text{Normal}(3, 1)\\ \beta_{\text{sample}} &\sim \text{Normal}(0, \sigma_{\text{samp}})\\ \beta_{\text{species}} &\sim \text{Normal}(0, \sigma_{\text{species}})\\ \beta_i &\sim \text{Normal}(0, \sigma_{\text{obs}}) \\ \sigma_{\text{samp}} &\sim \text{Exponential}(3)\\ \sigma_{\text{species}} &\sim \text{Exponential}(3)\\ \sigma_{\text{obs}} &\sim \text{Exponential}(3) \end{align} \]

Load packages and prepare data

library(rstan)
Loading required package: StanHeaders

rstan version 2.32.7 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())

data("mite", package = "vegan")
spp_names <- colnames(mite)
spp_names <- setNames(1:ncol(mite), colnames(mite))

mite_mat  <- as.matrix(mite)
mite_long <- data.frame(site_id = rep(seq_len(nrow(mite_mat)),
                                      times = ncol(mite_mat)),
                        spp = rep(colnames(mite_mat), 
                                  each = nrow(mite_mat)),
                        abd = as.vector(mite_mat))
mite_long$spp_id <- spp_names[mite_long$spp]

knitr::kable(head(mite_long))
site_id spp abd spp_id
1 Brachy 17 1
2 Brachy 2 1
3 Brachy 4 1
4 Brachy 23 1
5 Brachy 5 1
6 Brachy 19 1
spp_site_obs_intercepts <- stan_model(
  file = "topics/intercept_only/spp_site_obs_intercepts.stan",
  model_name = "spp_site_obs_intercepts")

spp_site_obs_intercepts
S4 class stanmodel 'spp_site_obs_intercepts' coded as follows:
data{
  int N;
  int N_spp;
  array[N] int<lower=1,upper=N_spp> spp_id;
  int N_sites;
  array[N] int<lower=1,upper=N_sites> site_id;
  array[N] int abd;
}
parameters{
  vector[N_spp] spp_effects;
  vector[N_sites] site_effects;
  vector[N] obs_effects;
  real mu;
  real<lower=0> sigma_spp;
  real<lower=0> sigma_sites;
  real<lower=0> sigma_obs;
}
model {
  abd ~ poisson_log(mu + spp_effects[spp_id] + site_effects[site_id] + obs_effects);
  spp_effects ~ normal(0, sigma_spp);
  site_effects ~ normal(0, sigma_sites);
  obs_effects ~ normal(0, sigma_obs);
  mu ~ normal(3, 1);
  sigma_spp ~ exponential(3);
  sigma_sites ~ exponential(3);
  sigma_obs ~ exponential(3);
} 

Now we can sample this model.

WarningWarning: irresponsible statistics

I’m sampling only 2 chains below, for illustration purposes and to make sure the code run in a reasonable amount of time. Use more chains in your research.

spp_site_obs_intercepts_samp <- sampling(spp_site_obs_intercepts,
                                         data = list(N = nrow(mite_long),
                                                     N_spp = max(mite_long$spp_id),
                                                     spp_id = mite_long$spp_id,
                                                     N_sites = max(mite_long$site_id),
                                                     site_id = mite_long$site_id,
                                                     abd = mite_long$abd),
                                         chains = 2,
                                         iter = 5000)
Warning

When working a model that takes a long time to run, saving the sampling in an .rds object using the saveRDS function can be a valuable time saver.

Exploring the model output

The output is large, so rather than showing all the parameters, I’m just going to count them.

nrow(summary(spp_site_obs_intercepts_samp)$summary)
[1] 2560
nrow(mite_long)
[1] 2450

We have fit many more parameters than observations! However, this model fits just fine, with no divergent iterations. These are one indication (not the only one!) that the model fit OK:

rstan::check_divergences(spp_site_obs_intercepts_samp)
0 of 5000 iterations ended with a divergence.

Let’s view the summary of only the standard deviations of the random effects (sigma):

## let's look at the variance components
summary(spp_site_obs_intercepts_samp,
        pars = c("sigma_spp", "sigma_sites", "sigma_obs"))$summary
                mean     se_mean         sd     2.5%       25%       50%
sigma_spp   1.510298 0.003641104 0.18485698 1.195419 1.3804024 1.4957883
sigma_sites 0.611614 0.001714294 0.07269443 0.482590 0.5604841 0.6069964
sigma_obs   1.487701 0.001482228 0.04308800 1.406660 1.4581799 1.4871572
                  75%    97.5%     n_eff      Rhat
sigma_spp   1.6212500 1.921567 2577.5411 1.0014901
sigma_sites 0.6583157 0.767820 1798.1742 0.9996911
sigma_obs   1.5156123 1.575220  845.0502 1.0000490

And we can plot them also:

bayesplot::mcmc_areas(spp_site_obs_intercepts_samp,
                      regex_pars ='sigma',
                      border_size = 0.5)

Standard deviations of the mite dataset

Alternatively we can pull out posterior samples and make this figure ourselves:

sigma_post <- rstan::extract(spp_site_obs_intercepts_samp,
  pars = c("sigma_spp", "sigma_sites", "sigma_obs"))

sigma_df <- data.frame(
  sigma_spp   = sigma_post$sigma_spp,
  sigma_sites = sigma_post$sigma_sites,
  sigma_obs   = sigma_post$sigma_obs)

med_order <- order(sapply(sigma_df, median))
sigma_ordered <- sigma_df[, med_order]
cols <- c("steelblue", "darkorange", "forestgreen")
dens_list <- lapply(sigma_ordered, density)
y_max <- max(sapply(dens_list, function(d) max(d$y))) * 1.1

plot(0,
     0, 
     type = "n",
     xlim = range(unlist(sigma_df)), 
     ylim = c(0, y_max),
     xlab = "Value",
     ylab = "Density",
     main = "Standard deviations of random effects")

for (i in seq_along(dens_list)) {
  d <- dens_list[[i]]
  
  polygon(c(d$x, rev(d$x)),
          c(d$y, rep(0, length(d$x))),
          col = adjustcolor(cols[i], 0.3),
          border = NA)
  
  lines(d$x, 
        d$y, 
        col = cols[i], 
        lwd = 2)
}

legend("topright",
       legend = colnames(sigma_ordered),
       col = cols,
       lwd = 2, 
       bty = "n")

This can be a useful guide to future model building – perhaps collecting data on species traits would help improve a model’s predictive power.

Calculate the posterior distribution of average abundance for each species

We can conceptualize this as a kind of “averaged rank abundance plot” for the species.

To do this we need to extract the average \(\mu\) and add it to the species effects, \(\beta_{\text{species}}\)

spp_post <- rstan::extract(spp_site_obs_intercepts_samp,
                           pars = c("mu", "spp_effects"))
# spp_avg_draws: (n_draws x n_spp) matrix of species log-averages
# sweep adds the mu vector (one value per draw) across all species columns
spp_avg_draws <- sweep(spp_post$spp_effects, 1, as.vector(spp_post$mu), "+")
n_spp     <- ncol(spp_avg_draws)
spp_med   <- apply(spp_avg_draws, 2, median)
spp_order <- order(spp_med)  # ascending order for log-scale slab

# --- Plot 1: density slab per species (log scale, sorted ascending) ---
plot(0, 0, type = "n",
     xlim = range(spp_avg_draws),
     ylim = c(0.5, n_spp + 0.5),
     xlab = "Species average (log scale)", ylab = "",
     yaxt = "n",
     main = "Species average abundances (log scale)")

for (i in seq_along(spp_order)) {
  s  <- spp_order[i]
  d  <- density(spp_avg_draws[, s])
  dsc <- d$y / max(d$y) * 0.45
  polygon(c(d$x, rev(d$x)),
          c(i + dsc, rep(i, length(d$x))),
          col = adjustcolor("steelblue", 0.4), border = "black", lwd = 0.3)
}

axis(2, at = seq_len(n_spp),
     labels = names(spp_names)[spp_order],
     las = 1, cex.axis = 0.55)
# --- Plot 2: point + interval per species (count scale, sorted descending) ---
spp_avg_count <- exp(spp_avg_draws)
spp_order_desc <- order(apply(spp_avg_count, 2, median), decreasing = TRUE)
spp_q <- apply(spp_avg_count[, spp_order_desc], 2,
               quantile, probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

plot(0, 0, type = "n",
     xlim = c(0.5, n_spp + 0.5),
     ylim = c(0, max(spp_q[5, ])),
     xlab = "", ylab = "Average count",
     xaxt = "n",
     main = "Species average abundances (count scale)")

for (i in seq_len(n_spp)) {
  segments(i, spp_q[1, i], i, spp_q[5, i], col = "steelblue", lwd = 1)
  segments(i, spp_q[2, i], i, spp_q[4, i], col = "steelblue", lwd = 3)
  points(i, spp_q[3, i], pch = 19, col = "steelblue", cex = 0.8)
}

axis(1, at = seq_len(n_spp),
     labels = names(spp_names)[spp_order_desc],
     las = 2, cex.axis = 0.55)

Two visualizations of species average abundances, as measured by a random effects model which included random effects for site and site-species combinations. - species average abundances on the log scale - species average abundances on the observation scale.

Two visualizations of species average abundances, as measured by a random effects model which included random effects for site and site-species combinations. - species average abundances on the log scale - species average abundances on the observation scale.