Summarizing many univariate models

A secret weapon for when you’re building hierarchical models.

We’ve already looked at univariate models. When we fit the same model to multiple different groups, we don’t expect the same values for all the coefficients. Each thing we are studying will respond to the same variable in different ways.

Hierarchical models represent a way to model this variation, in ways that range from simple to complex.

Before we dive in with hierarchical structure, let’s build a bridge between these two approaches.

This is useful to help us understand what a hierarchical model does.

However it is also useful from a strict model-building perspective – so useful that Andrew Gelman calls it a “Secret Weapon”

Loading models and data

library(rstan)
Loading required package: StanHeaders

rstan version 2.32.7 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())

data(mite, package = "vegan")
data("mite.env", package = "vegan")
data("mite.xy", package = "vegan")

And some quick data restructuring to combine both.

# Combine data and environment
spp_cols_ce <- colnames(mite)
mite_combined_ce <- cbind(mite.env, mite)

mite_data_long <- do.call(rbind, lapply(spp_cols_ce, function(sp) {
  data.frame(WatrCont = mite_combined_ce$WatrCont,
             spp = sp,
             abd = mite_combined_ce[[sp]])
}))

rownames(mite_data_long) <- NULL

To keep things simple and univariate, let’s consider only water concentration as an independent variable.

First, a quick word about centring and scaling a predictor variable. In short, think of the biology of your system before transforming the data. Furthermore, make sure you understand how these transformation may influence the model parameters.

Here, I decide to centre the predictor by subtracting the mean. This changes the intercept of my linear predictor. It becomes the mean log-odds of occurrence when the water content is average.

I decided not to perform additional transformation on the water content because I still want to retain the units associated to this explanatory variable. This may change if I want to build a model that include multiple explanatory variables with different units. In this case, we may want to apply a transformation on the explanatory variables that removes the effect of the units.

mite_data_long_transformed <- mite_data_long
mite_data_long_transformed$presabs <- as.numeric(mite_data_long_transformed$abd > 0)
mite_data_long_transformed$water <- (mite_data_long_transformed$WatrCont -
                                     mean(mite_data_long_transformed$WatrCont))

u_spp_ce <- unique(mite_data_long_transformed$spp)

nc_ce <- 5
nr_ce <- ceiling(length(u_spp_ce) / nc_ce)

par(mfrow = c(nr_ce, nc_ce), 
    mar = c(2, 2, 1.5, 0.3))

for (sp in u_spp_ce) {
  df_sp <- mite_data_long_transformed[mite_data_long_transformed$spp == sp, ]
  
  plot(df_sp$water, 
       df_sp$presabs,
       pch = 19, 
       cex = 0.4, 
       col = adjustcolor("steelblue", 0.4),
       xlab = "",
       ylab = "", 
       main = sp, 
       cex.main = 0.7)
  
  fit_sp <- glm(presabs ~ water, 
                data = df_sp, 
                family = binomial)
  
  x_s    <- seq(min(df_sp$water), 
                max(df_sp$water), 
                length.out = 100)
  
  lines(x_s, predict(fit_sp, 
                     newdata = data.frame(water = x_s), 
                     type = "response"),
        col = "steelblue", 
        lwd = 1.5)
}

An important things to notice about this figure is that there is a ton of variation in how different species respond to water!

u_spp_glm <- unique(mite_data_long_transformed$spp)

mite_many_glms_list <- lapply(u_spp_glm, function(s) {
  df <- mite_data_long_transformed[mite_data_long_transformed$spp == s, ]
  glm(presabs ~ water, family = "binomial", data = df)
})

names(mite_many_glms_list) <- u_spp_glm

mite_many_glm_coefs <- do.call(rbind, lapply(u_spp_glm, function(s) {
  cf <- coef(summary(mite_many_glms_list[[s]]))
  data.frame(spp = s,
             term = rownames(cf),
             estimate = cf[, "Estimate"],
             std.error = cf[, "Std. Error"],
             stringsAsFactors = FALSE)
}))
rownames(mite_many_glm_coefs) <- NULL
NoteSplit-Apply-Combine

To explore this kind of thinking, we are going to use an approach sometimes called “split-apply-combine”

There are many possible ways to do this in practice. Above, we used lapply to carry out the split-apply-combine pattern. That is, we loop over species, fit one GLM per species, and collect the coefficients with coef(summary()).

int_df <- mite_many_glm_coefs[mite_many_glm_coefs$term == "(Intercept)", ]
slp_df <- mite_many_glm_coefs[mite_many_glm_coefs$term == "water", ]

par(mfrow = c(1, 2), mar = c(4, 5, 2, 0.5))

for (df_t in list(int_df, slp_df)) {
  xlim_t <- range(c(df_t$estimate - 2*df_t$std.error, 
                    df_t$estimate + 2*df_t$std.error))
  
  plot(df_t$estimate, 
       seq_along(df_t$spp),
       pch = 19, 
       yaxt = "n",
       xlim = xlim_t,
       xlab = "Estimate", 
       ylab = "",
       cex = 0.75,
       main = df_t$term[1])
  
  abline(v = 0, col = "red")
  
  segments(df_t$estimate - df_t$std.error, 
           seq_along(df_t$spp),
           df_t$estimate + df_t$std.error,
           seq_along(df_t$spp))
  
  axis(2, at = seq_along(df_t$spp),
       labels = df_t$spp, 
       las = 1, 
       cex.axis = 0.6)
}

As you can see, some of these estimates are high, others low. We could also plot these as histograms to see this distribution.

par(mfrow = c(1, 2))

hist(int_df$estimate, 
     xlab = "Estimate",
     main = "(Intercept)")

hist(slp_df$estimate, 
     xlab = "Estimate",
     main = "water")

Once again, the two parameters of this model represent :

  • Intercept The probability (in log-odds) of a species being present at the average water concentration. some species are common, others are rare.
  • water this is the change in probability (in log-odds) as water increases by one centilitre per litre of substrate.

Let’s do it in Stan

The above lapply approach is useful for quick exploration, but we can also do the same procedure in Stan.

all_species_unpooled <- stan_model(
  file = "topics/correlated_effects/all_species_unpooled.stan")

all_species_unpooled
S4 class stanmodel 'anon_model' coded as follows:
data {
  // number of rows in dataset
  int<lower=0> Nsites;
  // number of species
  int<lower=0> S;
  // one environmental variable to use in prediction
  vector[Nsites] x;
  // response site (rows) by species (columns) 2D array
  array[Nsites,S] int <lower=0,upper=1> y;
}
parameters {
  // parameters are now VECTORS
  vector[S] intercept;
  vector[S] slope;
}
model {
  for (s in 1:S){
    y[,s] ~ bernoulli_logit(intercept[s] + slope[s] * x);
  }
  // priors don't change because Stan is vectorized:
  // every element of the vector gets the same prior
  intercept ~ normal(0, 3);
  slope ~ normal(0, 3);
} 

Let’s fit this model by passing in the data:

mite_bin <- mite
mite_bin[mite_bin>0] <- 1

mite_pa_list <- list(Nsites = nrow(mite_bin),
                     S = ncol(mite_bin),
                     x = with(mite.env, 
                              (WatrCont - mean(WatrCont))),
                     y = as.matrix(mite_bin))

all_species_unpooled_posterior <- rstan::sampling(all_species_unpooled,
                                                  data = mite_pa_list,
                                                  refresh = 1000, 
                                                  chains = 4)

Now let’s try to plot this:

# start by looking at the names of variables

# Helper: draw per-species ribbon plots from a Stan fit
draw_spp_ribbons <- function(fit, water_seq = seq(-300, 400, length.out = 100),
                             spp_names = colnames(mite_bin),
                             fill_col = "green4",
                             add_data = FALSE) {
  
  draws_int <- rstan::extract(fit, "intercept")$intercept
  draws_slp <- rstan::extract(fit, "slope")$slope
  
  n_spp <- ncol(draws_int)
  nc <- 5
  nr <- ceiling(n_spp / nc)
  
  par(mfrow = c(nr, nc), 
      mar = c(2, 2, 1.5, 0.3))
  
  for (j in seq_len(n_spp)) {
    
    pm <- matrix(0, nrow(draws_int), length(water_seq))
    
    for (wi in seq_along(water_seq)){
      pm[, wi] <- plogis(draws_int[, j] + draws_slp[, j] * water_seq[wi])
    }
    
    q_pp <- apply(pm, 
                  2, 
                  quantile, 
                  probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
    
    plot(0,
         0, 
         type = "n", 
         xlim = range(water_seq), 
         ylim = c(-0.05, 1.05),
         xlab = "", 
         ylab = "", 
         main = spp_names[j], 
         cex.main = 0.8)
    
    polygon(c(water_seq, 
              rev(water_seq)), 
            c(q_pp[1,], 
              rev(q_pp[5,])),
            col = adjustcolor(fill_col, 0.15), 
            border = NA)
    
    polygon(c(water_seq,
              rev(water_seq)), 
            c(q_pp[2,], 
              rev(q_pp[4,])),
            col = adjustcolor(fill_col, 0.30),
            border = NA)
    
    lines(water_seq, 
          q_pp[3,], 
          col = fill_col, 
          lwd = 1.5)
    
    if (add_data) {
      sp_df <- mite_data_long_transformed[mite_data_long_transformed$spp == spp_names[j], ]
      
      colo <- col2rgb(fill_col)
      colPts <- rgb(colo[1]/255,
                    colo[2]/255,
                    colo[3]/255,
                    alpha = 0.3)
      
      points(sp_df$water, 
             sp_df$presabs,
             pch = 21, 
             cex = 0.5,
             bg = colPts,
             col = colPts)
    }
  }
}

draw_spp_ribbons(all_species_unpooled_posterior, 
                 fill_col = "green4",
                 add_data = TRUE)

TipEXERCISE
  1. Add hierarchy to both the slope and the intercept of this model. Note that the lme4 syntax would be y ~ 1 + water + (1 | species) + (0 + water | species)

  2. Make a plot of the slope coefficients from both models, with and without hierarchy. Are they the same? How are they different?

  3. Make a posterior prediction of species richness in these communities.

First we rewrite the Stan model from above, replacing the standard deviations with parameters. We do this for the priors on the intercepts and slopes separately.

all_species_partpooled <- stan_model(
  file = "topics/correlated_effects/all_species_partpooled.stan")

all_species_partpooled
S4 class stanmodel 'anon_model' coded as follows:
data {
  // number of rows in dataset
  int<lower=0> Nsites;
  // number of species
  int<lower=0> S;
  // one environmental variable to use in prediction
  vector[Nsites] x;
  // response site (rows) by species (columns) 2D array
  array[Nsites,S] int <lower=0,upper=1> y;
}
parameters {
  // parameters are now VECTORS
  vector[S] intercept;
  vector[S] slope;
  real<lower=0> sigma_intercept;
  real<lower=0> sigma_slope;
  real avg_intercept;
  real avg_slope;
}
model {
  for (s in 1:S){
    y[,s] ~ bernoulli_logit(intercept[s] + slope[s] * x);
  }
  // priors don't change because Stan is vectorized:
  // every element of the vector gets the same prior
  intercept ~ normal(avg_intercept, sigma_intercept);
  slope ~ normal(avg_slope, sigma_slope);
  avg_intercept ~ normal(0, 1);
  avg_slope ~ normal(0, 1);
  sigma_intercept ~ exponential(5);
  sigma_slope ~ exponential(5);

} 

Sample the model

all_species_partpooled_posterior <- sampling(all_species_partpooled,
                                             data = mite_pa_list,
                                             refresh = 1000, # What is reported
                                             chains = 4)

Plot posterior predictions of trend lines, just as before:

draw_spp_ribbons(all_species_partpooled_posterior,
                 fill_col = "red3",
                 add_data = TRUE)

Comparing slope estimates of both models

draws_slp_up <- rstan::extract(all_species_unpooled_posterior, 
                               "slope")$slope
draws_slp_pp <- rstan::extract(all_species_partpooled_posterior,
                               "slope")$slope
spp_names_ce <- colnames(mite_bin)

q_up <- apply(draws_slp_up, 
              2, 
              quantile,
              probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

q_pp <- apply(draws_slp_pp, 
              2, 
              quantile,
              probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

ord <- order(q_pp[3, ])
n_sp <- length(spp_names_ce)

plot(0,
     0, 
     type = "n",
     xlim = c(0.5, n_sp + 0.5), 
     ylim = range(c(q_up, q_pp)),
     xlab = "",
     ylab = "Slope (water)",
     xaxt = "n")

axis(1,
     at = seq_len(n_sp), 
     labels = spp_names_ce[ord], 
     las = 2, 
     cex.axis = 0.6)

abline(h = 0, 
       lty = 2, 
       col = "grey60")

for (j in seq_len(n_sp)) {
  oj <- ord[j]
  for (xoff in list(list(x = j - 0.15, q = q_up[, oj], col = "steelblue"),
                    list(x = j + 0.15, q = q_pp[, oj], col = "orange"))) {
    
    segments(xoff$x, 
             xoff$q[1], 
             xoff$x, 
             xoff$q[5], 
             col = xoff$col, 
             lwd = 1)
    
    segments(xoff$x, 
             xoff$q[2],
             xoff$x, 
             xoff$q[4], 
             col = xoff$col,
             lwd = 3)
    
    points(xoff$x,
           xoff$q[3],
           pch = 19,
           col = xoff$col)
  }
}

legend("topleft",
       legend = c("non-hierarchical", "hierarchical"),
       col = c("steelblue", 
               "orange"), 
       pch = 19)

Posterior distribution of species richness

We can always calculate things out of the posterior distribution, and get a new distribution which reflects the uncertainty in all our parameter estimates. Here I’m suggesting we calculate the relationship between species richness and water concentration

draws_int_pp_S <- rstan::extract(all_species_partpooled_posterior, 
                                 "intercept")$intercept

draws_slp_pp_S <- rstan::extract(all_species_partpooled_posterior, 
                                 "slope")$slope

water_seq_S <- seq(-8, 6, length.out = 10)
idx_S <- sample(nrow(draws_int_pp_S), 
                min(700, 
                    nrow(draws_int_pp_S)))

S_mat <- matrix(0, 
                length(idx_S), 
                length(water_seq_S))

for (ki in seq_along(idx_S)) {
  k <- idx_S[ki]
  for (wi in seq_along(water_seq_S))
    S_mat[ki, wi] <- sum(plogis(draws_int_pp_S[k, ] + draws_slp_pp_S[k, ] *
                                  water_seq_S[wi]))
}

q_S <- apply(S_mat, 
             2, 
             quantile, 
             probs = c(0.025, 0.25, 0.5, 0.75, 0.975))

plot(0,
     0,
     type = "n",
     xlim = range(water_seq_S), 
     ylim = range(q_S),
     xlab = "Water (centered)",
     ylab = "Expected species richness")

polygon(c(water_seq_S, rev(water_seq_S)), 
        c(q_S[1,], rev(q_S[5,])),
        col = adjustcolor("steelblue",
                          0.15),
        border = NA)

polygon(c(water_seq_S, rev(water_seq_S)), 
        c(q_S[2,], rev(q_S[4,])),
        col = adjustcolor("steelblue", 0.30),
        border = NA)

lines(water_seq_S, 
      q_S[3,],
      col = "steelblue", 
      lwd = 2)