Some of these things are somewhat like the others.
Bayesian workflow
Visualize your data
Decide on your model structure
Simulate from the model to understand it
Fit the model to the data
Plot model predictions to evaluate the fit / draw conclusions
Today’s goal is to look at a couple of different model structures that we saw yesterday.
library(tidyverse)library(cmdstanr)
This is cmdstanr version 0.7.1
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /home/andrew/software/cmdstan
- CmdStan version: 2.34.1
library(tidybayes)library(palmerpenguins)
Gaussian random intercepts: Penguin body mass
Are populations of penguins on different islands different in their body mass?
The Palmer penguins are found on three different islands. Let’s look at the distribution of body mass of each species on each island.
Plot the data
penguin_mass_island <- penguins |>select(species, island, body_mass_g) |>drop_na(body_mass_g) |>unite(sp_island, species, island) |>## center mass and change the unitsmutate(mass_kg = (body_mass_g)/1000)
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/tmp/RtmpH4tcI4/model-4700c1ddd09d5.stan', line 13, column 2 to column 47)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3
Chain 1 finished in 0.4 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.3 seconds.
Chain 4 finished in 0.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 0.4 seconds.
Plot predictions to evaluate results
Let’s begin by plotting the averages for each group.
I’m using my named vector group_numbers to re-create the column sp_island. This is my technique for making sure I always use the correct label, but you can do this any way you want.
We use tidybayes::stat_pointinterval() to summarize the posterior distribution.
we’re adding points from the original data (penguin_mass_island) with geom_jitter(). We’re adding noise vertically to make the visualization better, but not adding any horizontal noise.
EXERCISE: plot posterior predictions of observations
Repeat the exercise above using the value of one_obs_per_group. Why are the results different? What additional error is included in these predictions?
Simulate from the model above. Base your approach on the code for simulation the non-hierarchical version. Remember to simulate one additional number: the standard deviation of group differences
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/tmp/RtmpWhWKuf/model-473001618f1b5.stan', line 15, column 2 to column 33)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1
Chain 1 finished in 0.4 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.3 seconds.
Chain 4 finished in 0.4 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.4 seconds.
Total execution time: 0.4 seconds.
Try leaving out a group and refitting the hierarchical model. Are the predictions for the missing group accurate?
There are other categorical predictors in the dataset. Try including year as a part of the group-creating factor (i.e. in the call to unite() above). What changes?
Modify the generated quantities block to simulate a fake observation for EVERY row of the dataset. This opens the possibility of using bayesplot to make predictions. Look back at the code from Day 1 and create a posterior predictive check for both models. (e.g. using ppc_dens_overlay)
We could perhaps have used sex as a grouping factor, but sex has missing values in it! Why is this a problem for this kind of model? What would it take to address that? (Discussion only; missing values are unfortunately outside the scope of the class!)
Observation-level random effects: Mite abundance
What is the question?
Let’s write a model to answer the question:
How does the total abundance of the mite community change as water content increases?
Express this in Math
Here’s a partially complete model for species richness over time
Simulate from this model, and look at your simulations to decide on a reasonable prior for the data.
SOLUTION
n <-30water <-seq(from =-5, to =5, length.out = n)b0 <-rnorm(1, mean =log(17), sd = .3)b1 <-rnorm(1, mean =0, sd = .2)S <-rpois(n, lambda =exp(b0 + b1*water))plot(water, S)
Data preparation & visualization
First we need to load and prepare the data:
data(mite, package ="vegan")data("mite.env", package ="vegan")# combine data and environmentmite_data_long <- mite |> tibble::rownames_to_column(var ="site_id") |>bind_cols(mite.env) |>pivot_longer(Brachy:Trimalc2,names_to ="spp", values_to ="abd")
First let’s transform the mite dataset into a dataframe of total community abundance (N) per site. We’ll also standardize the water content while we’re at it:
Remember, on the left we are plotting the Prediction interval here: it’s showing the distribution of probable observations according to the model. Notice that the the model predicts much narrower variation than we really find!
On the right hand side we have the posterior predictive check, which once again shows that the model is overconfident and predicts a range of observations that are far too narrow.
EXERCISE
Discuss with your neighbours: would you trust this model? Would you publish it? The technical name for this phenomenon is “overdisperson”. Have you checked for this in previous count models you’ve done?
Add a random effect for every individual observation in the model. Begin by writing the mathematical notation for this new model!
fit the model and re-create the two figures above. What do you notice?
Which model is more trustworthy?
look at the slope in the new model. Is it different?
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/tmp/RtmpIhvutV/model-471231bcd7800.stan', line 18, column 2 to column 42)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1
Chain 1 finished in 0.3 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.3 seconds.
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/tmp/RtmpIhvutV/model-471231bcd7800.stan', line 18, column 2 to column 42)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4
Chain 4 finished in 0.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 1.3 seconds.
Another great way to model overdispersion is via the Negative Binomial distribution. Look at the Stan documentation for neg_binomial_2_log and adapt your model to use it (don’t forget to drop the random effect when you do!).
Source Code
---title: "Models with one level of hierarchy"description: | Some of these things are somewhat like the others.execute: freeze: truecomments: hypothesis: trueformat: html: code-tools: trueeditor_options: chunk_output_type: console---:::{.callout-tip}## Bayesian workflow1. Visualize your data2. Decide on your model structure3. Simulate from the model to understand it4. Fit the model to the data5. Plot model predictions to evaluate the fit / draw conclusions:::Today's goal is to look at a couple of different model structures that we saw yesterday. ```{r}library(tidyverse)library(cmdstanr)library(tidybayes)library(palmerpenguins)```## Gaussian random intercepts: Penguin body mass**Are populations of penguins on different islands different in their body mass?**The Palmer penguins are found on three different islands. Let's look at the distribution of body mass of each species on each island.### Plot the data```{r gauss-inter-setup}penguin_mass_island <- penguins |>select(species, island, body_mass_g) |>drop_na(body_mass_g) |>unite(sp_island, species, island) |>## center mass and change the unitsmutate(mass_kg = (body_mass_g)/1000)``````{r gauss-inter-plot}penguin_mass_island |>ggplot(aes(y = sp_island,x = mass_kg,colour = sp_island)) +geom_jitter(alpha =0.8, height =0.1, width =0) +scale_color_brewer(palette ="Dark2")```Are the sample sizes equal among the species-island combinations?```{r}penguin_mass_island |>count(sp_island) |> knitr::kable()```### Decide on a model structureWe'll begin by fitting a model that assumes that body size for each of these five groups is completely independent:$$\begin{align}\text{Body mass}_i &\sim \text{Normal}(\mu_i, \sigma_{\text{obs}}) \\\mu_i &= \bar\beta + \beta_{\text{group}[i]} \\\bar\beta &\sim \text{Normal}(5, 2) \\\beta_{\text{group}} &\sim \text{Normal}(0, 1) \\\sigma_{\text{obs}} &\sim \text{Exponential}(.5)\end{align}$$### Simulate to understand this model {#sec-fixed-simulation}Here's a little trick to get group indexes (numbers) from a character vector:```{r}group_names <-unique(penguin_mass_island$sp_island)group_numbers <-seq_along(group_names)names(group_numbers) <- group_namesgroup_numbers``````{r}penguin_groupid <- penguin_mass_island |>mutate(group_id = group_numbers[sp_island])penguin_groupid```As you can see, we're set up now with the names and the indexes we need. Now we can simulate data and plot it:```{r}ngroup <-length(group_numbers)overall_mean <-rnorm(1, mean =5, sd =2)group_diffs <-rnorm(n = ngroup, mean =0, sd =1)sigma_obs <-rexp(1, .5)penguin_pred_obs <- penguin_groupid |>mutate(fake_mass_avg = overall_mean + group_diffs[group_id],fake_mass_obs =rnorm(length(fake_mass_avg), mean = fake_mass_avg, sd = sigma_obs))penguin_pred_obs |>ggplot(aes(y = sp_island,x = fake_mass_obs,colour = sp_island)) +geom_jitter(alpha =0.8, height =0.1, width =0) +scale_color_brewer(palette ="Dark2")```:::{.callout-tip}### EXERCISERun the above code a few times! if you want, try different prior values.:::### Write it in Stan```{r}#| class-output: stanfixed_groups <-cmdstan_model(stan_file ="topics/03_one_random_effect/fixed_groups.stan")fixed_groups```### Fit the model```{r}peng_group_list <-with(penguin_groupid, list(N =length(mass_kg),y = mass_kg,Ngroup =max(group_id),group_id = group_id ))fixed_groups_samples <- fixed_groups$sample(data = peng_group_list,refresh =0,parallel_chains =4)```### Plot predictions to evaluate resultsLet's begin by plotting the averages for each group.```{r}fixed_groups_samples |>gather_rvars(group_averages[group_id]) |>mutate(sp_island =names(group_numbers)[group_id]) |>ggplot(aes(y = sp_island, dist = .value)) +stat_pointinterval() +geom_jitter(data = penguin_mass_island,aes(y = sp_island,x = mass_kg,colour = sp_island), pch =21, inherit.aes =FALSE,alpha =0.8, height =0.1, width =0) +scale_colour_brewer(palette ="Dark2")```Some things to notice about the code above: * I'm using my named vector `group_numbers` to re-create the column `sp_island`. This is my technique for making sure I always use the correct label, but you can do this any way you want.* We use `tidybayes::stat_pointinterval()` to summarize the posterior distribution.* we're adding points from the original data (`penguin_mass_island`) with `geom_jitter()`. We're adding noise vertically to make the visualization better, but not adding any horizontal noise.:::{.callout-tip}### EXERCISE: plot posterior predictions of _observations_Repeat the exercise above using the value of `one_obs_per_group`. Why are the results different? What additional error is included in these predictions?::::::{.callout-note collapse="true"}### SOLUTION```{r}fixed_groups_samples |> tidybayes::gather_rvars(one_obs_per_group[group_id]) |>mutate(sp_island = group_names[group_id]) |>ggplot(aes(y = sp_island,dist = .value,colour = sp_island)) +stat_pointinterval(colour ="black") +geom_jitter(aes(y = sp_island,x = mass_kg,colour = sp_island), inherit.aes =FALSE,alpha = .2, data = penguin_groupid, height = .2, width =0) +scale_colour_brewer(palette ="Dark2")```:::### Make it hierarchical#### Math:::{.column-screen}::::{.columns}::: {.column width="2.5%"}:::::: {.column width="45%"}$$\begin{align}\text{Body mass}_i &\sim \text{Normal}(\mu_i, \sigma_{\text{obs}}) \\\mu_i &= \bar\beta + \beta_{\text{group}[i]} \\\bar\beta &\sim \text{Normal}(5, 2) \\\beta_{\text{group}} &\sim \text{Normal}(0, 1) \\\sigma_{\text{obs}} &\sim \text{Exponential}(.5)\end{align}$$:::::: {.column width="5%"}:::::: {.column width="45%"}$$\begin{align}\text{Body mass}_i &\sim \text{Normal}(\mu_i, \sigma_{\text{obs}}) \\\mu_i &= \bar\beta + \beta_{\text{group}[i]} \\\bar\beta &\sim \text{Normal}(5, 2) \\\beta_{\text{group}} &\sim \text{Normal}(0, \sigma_{\text{sp}}) \\\sigma_{\text{obs}} &\sim \text{Exponential}(.5) \\\sigma_{\text{sp}} &\sim \text{Exponential}(1)\end{align}$$:::::: {.column width="2.5%"}::::::::::#### Simulation of a hierarchical model:::{.callout-tip}### EXERCISE Simulate from the model above. Base your approach on the [code for simulation the non-hierarchical version](#sec-fixed-simulation). Remember to simulate one additional number: the standard deviation of group differences::::::{.callout-note collapse="true"}### SOLUTION```{r}ngroup <-length(group_numbers)overall_mean <-rnorm(1, mean =5, sd =2)sigma_group <-rexp(1, .1)group_diffs <-rnorm(n = ngroup, mean =0, sd = sigma_group)sigma_obs <-rexp(1, .5)penguin_pred_obs <- penguin_groupid |>mutate(fake_mass_avg = overall_mean + group_diffs[group_id],fake_mass_obs =rnorm(length(fake_mass_avg), mean = fake_mass_avg, sd = sigma_obs))penguin_pred_obs |>ggplot(aes(y = sp_island,x = fake_mass_obs,colour = sp_island)) +geom_jitter(alpha =0.8, height =0.1, width =0) +scale_color_brewer(palette ="Dark2")```:::#### StanBelow I'm comparing the two Stan programs side-by-side. Compare them to the models above! ```{r}hierarchical_groups <-cmdstan_model(stan_file ="topics/03_one_random_effect/hierarchical_groups.stan")```:::{.column-screen}::::{.columns}::: {.column width="2.5%"}:::::: {.column width="45%"}```{r}#| class-output: stanfixed_groups```:::::: {.column width="5%"}:::::: {.column width="45%"}```{r}#| class-output: stanhierarchical_groups```:::::: {.column width="2.5%"}::::::::::```{r}hierarchical_groups_samples <- hierarchical_groups$sample(data = peng_group_list, refresh =0, parallel_chains =4)``````{r}hierarchical_groups_samples``````{r}hierarchical_groups_samples |> tidybayes::gather_rvars(b_group[group_id], new_b_group) |>mutate(sp_island = group_names[group_id],sp_island =if_else(is.na(sp_island),true ="New Group",false = sp_island)) |>ggplot(aes(y = sp_island,dist = .value,colour = sp_island)) +stat_pointinterval()``````{r}hierarchical_groups_samples |> tidybayes::gather_rvars(one_obs_per_group[group_id], one_obs_new_group) |>mutate(sp_island = group_names[group_id],sp_island =if_else(is.na(sp_island),true ="New Group",false = sp_island)) |>ggplot(aes(y = sp_island,dist = .value,colour = sp_island)) +stat_pointinterval() +geom_point(aes(y = sp_island,x = mass_kg,colour = sp_island), inherit.aes =FALSE,alpha = .2, data = penguin_groupid)```### Exercises1. Try leaving out a group and refitting the hierarchical model. Are the predictions for the missing group accurate?1. There are other categorical predictors in the dataset. Try including `year` as a part of the group-creating factor (i.e. in the call to `unite()` above). What changes?1. Modify the `generated quantities` block to simulate a fake observation for EVERY row of the dataset. This opens the possibility of using `bayesplot` to make predictions. Look back at the code from Day 1 and create a posterior predictive check for both models. (e.g. using `ppc_dens_overlay`)1. We could perhaps have used `sex` as a grouping factor, but `sex` has missing values in it! Why is this a problem for this kind of model? What would it take to address that? (Discussion only; missing values are unfortunately outside the scope of the class!)## Observation-level random effects: Mite abundance### What is the question? Let's write a model to answer the question: **How does the total abundance of the mite community change as water content increases?** ### Express this in MathHere's a partially complete model for species richness over time$$\begin{align}\text{S}_i &\sim \text{Poisson}(e^a) \\a &= \bar\beta + \beta_{\text{water}} \cdot \text{water}_i \\\bar\beta &\sim \text{Normal}(?, ?) \\\beta_{\text{water}} &\sim \text{Normal}(?, ?) \\\end{align}$$:::{.callout-tip}### EXERCISE Simulate from this model, and look at your simulations to decide on a reasonable prior for the data.::::::{.callout-note collapse="true"}### SOLUTION```{r}n <-30water <-seq(from =-5, to =5, length.out = n)b0 <-rnorm(1, mean =log(17), sd = .3)b1 <-rnorm(1, mean =0, sd = .2)S <-rpois(n, lambda =exp(b0 + b1*water))plot(water, S)```:::### Data preparation & visualizationFirst we need to load and prepare the data:```{r}data(mite, package ="vegan")data("mite.env", package ="vegan")# combine data and environmentmite_data_long <- mite |> tibble::rownames_to_column(var ="site_id") |>bind_cols(mite.env) |>pivot_longer(Brachy:Trimalc2,names_to ="spp", values_to ="abd")```First let's transform the mite dataset into a dataframe of total community abundance (N) per site. We'll also standardize the water content while we're at it:```{r}mite_community_abd <- mite_data_long |>group_by(site_id, WatrCont) |>summarize(N =sum(abd)) |>ungroup() |>mutate(water_c = (WatrCont -mean(WatrCont))/100)knitr::kable(head(mite_community_abd))```We get a nice histogram of community abundance, and a clear negative relationship with water volume:```{r}#| layout-ncol: 2mite_community_abd |>ggplot(aes(x = N)) +geom_histogram()mite_community_abd |>ggplot(aes(x = water_c, y = N)) +geom_point()```### Write the model in Stan and estimate it```{r}poisson_regression <-cmdstan_model(stan_file ="topics/03_one_random_effect/poisson_regression.stan")``````{r}water_for_pred <-seq(from =-3, to =4.5, length.out =15)abd_data_list <-list(N =length(mite_community_abd$N),water = mite_community_abd$water_c,y = mite_community_abd$N,Npred =15,water_pred = water_for_pred)poisson_regression_sample <- poisson_regression$sample(data = abd_data_list,refresh =0)```### Plot the model to see if it fits well```{r}#| layout-ncol: 2poisson_regression_sample |> tidybayes::gather_rvars(line_obs[i]) |>mutate(water = water_for_pred) |>ggplot(aes(x = water, dist = .value)) +stat_lineribbon() +geom_point(aes(x = water_c, y = N), data = mite_community_abd, inherit.aes =FALSE) +scale_fill_brewer(palette ="Greens")fake_obs_S <- poisson_regression_sample$draws(variables ="fake_obs")fake_obs_S_matrix <- posterior::as_draws_matrix(fake_obs_S)bayesplot::ppc_dens_overlay(y = mite_community_abd$N,yrep =head(fake_obs_S_matrix, 50))```Remember, on the left we are plotting the _*Prediction interval*_ here: it's showing the distribution of probable observations according to the model. Notice that the the model predicts much narrower variation than we really find! On the right hand side we have the posterior predictive check, which once again shows that the model is overconfident and predicts a range of observations that are far too narrow. :::{.callout-tip### EXERCISE1) Discuss with your neighbours: would you trust this model? Would you publish it? The technical name for this phenomenon is "overdisperson". Have you checked for this in previous count models you've done?2) Add a random effect for _every individual observation_ in the model. Begin by writing the mathematical notation for this new model! 3) fit the model and re-create the two figures above. What do you notice? * Which model is more trustworthy?* look at the slope in the new model. Is it different?::::::{.callout-note collapse="true"}### SOLUTION#### Mathematical notation$$\begin{align}\text{S}_i &\sim \text{Poisson}(e^a) \\a &= \bar\beta + \beta_{\text{water}} \cdot \text{water}_i + \text{site}_i\\\bar\beta &\sim \text{Normal}(?, ?) \\\beta_{\text{water}} &\sim \text{Normal}(?, ?) \\\text{site} &\sim \text{Normal}(?, \sigma) \\\sigma &\sim \text{Exponential}(?)\end{align}$$#### Stan code ```{r}poisson_regression_overdisp <-cmdstan_model(stan_file ="topics/03_one_random_effect/poisson_regression_overdisp.stan")poisson_regression_overdisp_sample <- poisson_regression_overdisp$sample(data = abd_data_list, refresh =0)``````{r}#| layout-ncol: 2#| fake_obs_S <- poisson_regression_overdisp_sample$draws(variables ="fake_obs")fake_obs_S_matrix <- posterior::as_draws_matrix(fake_obs_S)bayesplot::ppc_dens_overlay(y = mite_community_abd$N,yrep =head(fake_obs_S_matrix, 50))poisson_regression_overdisp_sample |> tidybayes::gather_rvars(line_obs[i]) |>mutate(water = water_for_pred) |>ggplot(aes(x = water, dist = .value)) +stat_lineribbon() +geom_point(aes(x = water_c, y = N), data = mite_community_abd, inherit.aes =FALSE)```::::::{.callout-note}Another great way to model overdispersion is via the [Negative Binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution) distribution. Look at the Stan documentation for [neg_binomial_2_log](https://mc-stan.org/docs/functions-reference/neg-binom-2-log.html) and adapt your model to use it (don't forget to drop the random effect when you do!).:::