Introduction to Stan and simulation

Repeating yesterday’s exercise, with the programming language Stan.

library(rstan)
library(ggplot2)
rstan_options("auto_write" = TRUE)
TipGoals of this lesson
  1. Introduce the idea of a generative model
  2. Using generated data to validate a model
  3. Stan model syntax
  4. Outline of a Bayesian workflow

The process

To practice our first models in Stan, we’ll being with an imaginary example based on yesterday’s activity. We’ll imagine we’re conducting a particularly pleasant study: counting birds!

Question How many birds will each person in the class find? For the purposes of this example, let’s say there are 22 people.

  1. We’re going to count birds, so we’ll have count data: a number that is either 0 or some positive, integer number
  2. We’ll make a simplifying assumption: everybody has the same chance of seeing a bird (i.e. no differences in skill or equipment), and everyone in the class is an independent observer (i.e. nobody is working in pairs, etc.)
  3. Everyone in the class makes only one count, so we have 22 numbers.

We’re Bayesian, so we need to write a probability distribution for all the possible values.

\[ \begin{align} \text{Number of Birds}_{\text{seen by person i}} &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Uniform}(0, 60) \end{align} \]

A quick note about notation for models like these:

  • We use a subscript \(i\) to indicate the “label” for each observation in our dataset. You can think of this as the row number of the data spreadsheet, and imagine sliding your finger down the column of measurements, modelling each value in turn.
  • Usually we’ll use more general language, such as \(y_i\). But for this simple example I wanted to make things as explicit as possible.
  • Notice the symbol \(\sim\). This is read as “distributed as”, and indicates the probability distribution from which the values might come. When the values we’re talking about are data that we can observe (in this case, number of birds counted), we call the distribution the likelihood. When the value is something we can’t observe (in this case, the average count \(\lambda\)) we call the distribution the prior.
Warning

We’ll be talking about better ways to model count data in a later exercise! For now, I’m using the Uniform distribution for simplicity. It’s not usually a very good choice!

Simulation in R

Before starting to work on real data, we are going to begin by learning how to generate data by simulation. There are at least three reasons why this is a good idea:

  1. Understand your priors.. For most interesting models in ecology, you will not be able to pick good numbers for your prior parameters just by thinking hard, simply because parameters for a distribution are not always intuitive. Should the prior defining the number seen by an average person in the group be \(\text{Uniform}(0, 60)\) ? Or should it be narrower (e.g. \(\text{Uniform}(0, 29)\)) ? Or broader (e.g. \(\text{Uniform}(0, 100)\))? As we’ll see, simulation will demystify the process.
  2. Validate your model. Bayesian models are great because they can create datasets by simulation. This suggests a very minimum requirement we might have for a statistical model: use known parameters and a model to generate data, then fit that same model to the very data it generated, and see if we get back something close to those known parameter values.
  3. Test your understanding. Perhaps most importantly, simulation helps you to test your own intuition. If you can simulate data from your model, then you really understand it! If you can’t, then you don’t know quite how it works yet. It’s rare1 that a biologist will fail to learn something by simulating a dataset.

Simple exercise in simulation

Let’s imagine that today, we are taking a hike as a group in this beautiful Natural Reserve. What is the number of birds (total abundance of ALL species) each of us is going to see on our hike?

Some questions to ask about simulated data

  1. What kind of observations are you going to make? Do they have a minimum or maximum value? Are they integers, or are they decimal numbers, or something else?
  2. Where do the numbers come from? This could be anything, from simple linear approximations (i.e. the models we’re looking at in this course) to ordinary differential equations, mathematical models, generalized additive models, etc.
  3. How many observations will we be making?

One of the most useful particularities of Bayesian models is that they are generative: they can be used to make a simulated dataset. We’ll do that now for our bird example.

Let’s simulate from a Poisson distribution:

set.seed(525600)
n_people <- 22
avg_birds_per_person <- runif(1, min = 0, max = 60)
bird_count <- rpois(n_people, lambda = avg_birds_per_person)

Some things to note in the code above:

Every statistical distribution that is in R (which is a lot! almost all! ) has four different functions. If the distribution is called distr, then they are:

  • rdistr : Draw random numbers from the distribution
  • qdistr : The quantile function – which value gives a certain proportion of the distribution?
  • pdistr : The probability density function – which proportion of the distribution is below a certain value?
  • ddistr : The density function – draws the “shape” of a distribution. How probable are specific values?

The other thing to note is that there are TWO simulation steps here: 1. Simulating a value of the average (\(\lambda\)). 2. Simulating observations.

In our model, the Uniform distribution was referred to as the prior, and the Poisson distribution was referred to as a likelihood, but here you can see that they are very nearly the same thing: just statements about what distribution of values might be most consistent with the data.

Plotting the result

Let’s take a look at our simulated values:

hist(bird_count, col = "lightblue", xlim = c(0, 50), main = "Histogram of bird count")

Histogram of simulated counts of birds

This is pretty great, and represents one possible realization of sampling. However, one sample isn’t enough to tell us about what our \(\text{Uniform}(0, 60)\) prior really means.

TipEXERCISE

Try to make many different simulations (say, 12 simulations). This represents 12 different repeats of the whole process: draw a value from the uniform prior, THEN draw a value from the Poisson distribution. Visualize them any way you want! (the worked example below presents one way to do this)

set.seed(42)

# Simulate birds
simulate_some_birds <- function() {
  lambda <- runif(1, min = 0, max = 60)
  data.frame(obs = rpois(22, lambda = lambda))
}

# Carry out the simulations 12 times 
simul_list <- replicate(12, simulate_some_birds())

# Define common breaks
simul <- unlist(simul_list)
br <- seq(min(simul), max(simul), length = 30)

# Figure of the results 
par(mfrow = c(3,4), 
    mar = c(2.5,2.5,0.1,0.1), 
    oma = c(2,2,0,0))

histRes <- vector("list", length = 12)
for(i in 1:12) {
  histRes[[i]] <- hist(simul_list[[i]],
                       breaks = br, 
                       plot = FALSE)
}

ymax <- max(sapply(histRes, function(x) x$counts))

for(i in 1:12) {
  hist(simul_list[[i]],
       breaks = br,
       xlab = "", 
       ylab = "",
       main = "",
       ylim = c(0,ymax),
       bty = "L")
}

mtext("Number of birds observed per person",
      side = 1,
      outer = TRUE)
mtext("Frequency",
      side = 2,
      outer = TRUE)

Twelve different simulations of a possible bird dataset. Do all of these seem plausible?

This figure shows different simulations of what, according to our prior, might be reasonable datasets for us to study. Do any of them seem implausible to you? If so, try changing the prior. The goal is to make fake datasets that seem plausible, but which still include the possibility of some surprising observations.

When you have a prior that generates observations that cover a range of scientifically reasonable values, then you are ready to move on to fitting real data.

However before we actually do that, let’s do the whole thing again: this time in Stan.

Simulating data in Stan

Let’s look back at the equation:

\[ \begin{align} \text{Number of Birds}_{\text{seen by person i}} &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Uniform}(0, 60) \end{align} \]

And then translate it into Stan:

poisson_simulation <- rstan::stan_model(
  file = "topics/01_simulation/poisson_simulation.stan", 
  model_name = "poisson_simulation",
  save_dso = TRUE
  )

poisson_simulation
S4 class stanmodel 'poisson_simulation' coded as follows:
data {
  int<lower=0> n_people;
}
generated quantities {
  real<lower=0> avg_birds_per_person;
  // an array -- like a list in R
  array[n_people] int<lower=0> bird_count;

  // simulate averages
  avg_birds_per_person = uniform_rng(0, 60);
  // simulate observations with that average
  for (i in 1:n_people){
    bird_count[i] = poisson_rng(avg_birds_per_person);
  }
} 

What you see just above is not R, but is the first Stan program we will see in this course. Stan code is written in a text file, which I’ve read in here just to display it for you.

When you sample the model, as we’ll do later, Stan samples the posterior distribution using Hamiltonian Monte Carlo.

Before we run it, let’s look at the parts of a Stan model:

Parts of a Stan model

This Stan program has two parts. Each part is separated with curly braces, {}. The are they data block and the generated quantities block:

data {
  int<lower=0> n_people;
}

And the generated quantities block.

generated quantities {
  real<lower=0> avg_observed;
  
  // an array -- like a list in R
  array[n_people] int<lower=0> bird_count;
  
  // simulate averages
  avg_birds_per_person = uniform_rng(0, 60);
  
  // simulate observations with that average
  for (i in 1:n_people){
    bird_count[i] = poisson_rng(avg_birds_per_person);
  }
}
TipEXERCISE

Let’s look at similarities and differences to the procedure in R. Try to find a few similarities and differences between R code and Stan code!

Similarities:

  • We have a random number generating function for each of our distributions. In R, these were called runif and rpois, here they are uniform_rng and poisson_rng.
  • Once again, the only thing we need to provide is n_people, the number of observers we have

Differences:

  • Every line ends with a semicolon ;
  • In Stan, the name of a variable is on the RIGHT of a line, while in R it’s on the left.
  • We need to use a for-loop to generate random variables.
  • Note the syntax for creating an array of integers. Arrays in Stan are a little like lists in R: they can hold any other kind of object, and are of a predefined length.

Sampling in Stan

# We connect the data in R to the model in Stan using a named list
poisson_simulation_datalist <- list(n_people = 22)

poisson_simulation_samp <- rstan::sampling(poisson_simulation,
                                           data = poisson_simulation_datalist, 
                                           refresh = 0,
                                           algorithm = "Fixed_param")

# refresh : Controls how often the progress of the sampling is reported
# algorithm : Required only because no parameters are estimated

This generates a large number of simulated datasets – the default is 4000 datasets! Each time the model samples, it draws a new value for the unobserved average (avg_birds_per_person) and for the number of birds seen by each person.

Let’s look at a 12 of these datasets and visualize them.

# Extract the simulations
pois_simul <- rstan::extract(poisson_simulation_samp,permuted = TRUE)

# Organize the simulated data
bird_count_simul <- pois_simul$bird_count
avg_birds_per_person_simul <- pois_simul$avg_birds_per_person

Here, bird_count_simul includes all the simulations for each iterations while avg_birds_per_person is a scalar defining the mean (and variance!) of the Poisson “model”.

Let’s see what we get from the Stan output by looking at 12 randomly drawn datasets from bird_count_simul and the associated values in avg_birds_per_person.

draws <- sample(4000,12)
bird_count_draws <- bird_count_simul[draws,]
avg_birds_per_person_draws <- avg_birds_per_person_simul[draws]

Let’s take a look at some of these simulations using histograms.

# Define common breaks
br <- seq(min(bird_count_draws),
          max(bird_count_draws),
          length = 30)

# Figure of the results 
par(mfrow = c(3,4), 
    mar = c(2.5,2.5,0.1,0.1), 
    oma = c(2,2,0,0))

histSimul <- vector("list", length = 12)
for(i in 1:12) {
  histSimul[[i]] <- hist(bird_count_draws[i,],
                       breaks = br, 
                       plot = FALSE)
}

ymax <- max(sapply(histSimul, function(x) x$counts))

for(i in 1:12) {
  hist(bird_count_draws[i,],
       breaks = br,
       xlab = "", 
       ylab = "",
       main = "",
       ylim = c(0,ymax),
       bty = "L",
       col = "orange",
       border = FALSE)

  abline(v = avg_birds_per_person_draws[i], 
         col = "darkgreen", 
         lwd = 3)
}

mtext("Number of birds observed per person",
      side = 1,
      outer = TRUE)
mtext("Frequency",
      side = 2,
      outer = TRUE)

Prior simulations of bird counts. Green bars are the mean for a particular simulation, and orange histograms show the distribution of observations around this mean.

Parameter recovery

Let’s go back and look at the fake datasets we created in R

avg_birds_per_person
[1] 34.25577
bird_count
 [1] 42 24 37 22 28 27 30 39 37 29 38 32 34 33 34 46 34 29 47 37 34 29

and let’s see if we can recapture the only known parameter, avg_birds_per_person, which is equal to 34.255774.

We’ll do it first in R, using the function fitdistr from the MASS package:

MASS::fitdistr(bird_count, 
               densfun = "Poisson")
    lambda  
  33.727273 
 ( 1.238167)

This could also be done with glm

bird_glm <- glm(bird_count ~ 1, family = "poisson")
exp(coef(bird_glm))
(Intercept) 
   33.72727 

You can see that in all cases we are getting close to the value of avg_birds_per_person, which in these simulations is the true value.

Sampling the posterior distribution in Stan

Time for the HMC Slides!

We will be doing a lot of Stan models this week, and we will begin by replicating the above GLM in Stan.

poisson_model <- rstan::stan_model(
  file = "topics/01_simulation/poisson_model.stan",
  model_name = "poisson_model")

poisson_model
S4 class stanmodel 'poisson_model' coded as follows:
data {
  int<lower=0> n_people;
  array[n_people] int<lower=0> bird_count_observed;
}
parameters {
  real<lower=0> avg_birds_per_person;
}
model {
  bird_count_observed ~ poisson(avg_birds_per_person);
  avg_birds_per_person ~ uniform(0, 60);
}
generated quantities {
  // an array -- like a list in R
  array[n_people] int<lower=0> bird_count;

  // simulate observations with that average
  for (i in 1:n_people){
    bird_count[i] = poisson_rng(avg_birds_per_person);
  }
} 

This model has all the same code as the previous one, but has two additional parts. Let’s compare them

poisson_simulation
S4 class stanmodel 'poisson_simulation' coded as follows:
data {
  int<lower=0> n_people;
}
generated quantities {
  real<lower=0> avg_birds_per_person;
  // an array -- like a list in R
  array[n_people] int<lower=0> bird_count;

  // simulate averages
  avg_birds_per_person = uniform_rng(0, 60);
  // simulate observations with that average
  for (i in 1:n_people){
    bird_count[i] = poisson_rng(avg_birds_per_person);
  }
} 
poisson_model
S4 class stanmodel 'poisson_model' coded as follows:
data {
  int<lower=0> n_people;
  array[n_people] int<lower=0> bird_count_observed;
}
parameters {
  real<lower=0> avg_birds_per_person;
}
model {
  bird_count_observed ~ poisson(avg_birds_per_person);
  avg_birds_per_person ~ uniform(0, 60);
}
generated quantities {
  // an array -- like a list in R
  array[n_people] int<lower=0> bird_count;

  // simulate observations with that average
  for (i in 1:n_people){
    bird_count[i] = poisson_rng(avg_birds_per_person);
  }
} 

What’s different in this second Stan program? There are two new sections:

  • parameters block : Indicated by parameters {}, this block includes all the unobserved quantities. In the present case there is only one unobserved quantity: avg_birds_per_person. In this block, we have to give this value a name and say what kind of number it is. It is also the place to declare any constraints on the parameter. In our example, we state that the parameter is always positive (because it is an average of counts).
  • model block : Indicated by model {}, this block contains the model. What this means in a Bayesian context is that it lists the probability distribution for all the observations, and for all the unobserved parameters. In other words, it looks just like our mathematical expressions above. The model block can also contain intermediate calculations, for example combining data and parameters with an equation. We’ll see examples of this later.

This Stan program also contains one line that’s been moved. Specifically,

  avg_birds_per_person ~ uniform(0, 60);

has been moved to the model block. This has two consequences.

First, when the model sees our data, it looks for values of avg_birds_per_person that would make those data probable. In other words, it computes the posterior distribution of plausible values for the average. Together, the prior and the data constrain what those values can be.

Second, compared to the poisson_simulation file, the generated quantities block means something different now. Previously, we had no idea what avg_birds_per_person should be, so we had the computer choose a random number from a wide range. We called this “the prior”. Now, when the computer draws new values of bird_count, it is going to use the values from the posterior that it is finding in the model {} block. This means that the simulations, rather than being prior predictive checks, are now posterior predictive checks.

Parameter recovery

TipEXERCISE: parameter recovery in Stan

Use the Stan code above to fit the model to our simulated data. Do we recover the parameters?

Once again, we connect a dataset to our model with a list:

bird_data_list <- list(bird_count_observed = bird_count,
                       n_people = length(bird_count))

poisson_model_samp <- rstan::sampling(poisson_model, 
                                      data  = bird_data_list,
                                      refresh = 0)

summary(poisson_model_samp)$summary
                           mean    se_mean        sd       2.5%        25%
avg_birds_per_person   33.82153 0.02955644 1.2033624   31.53522   32.99033
bird_count[1]          33.98250 0.09960319 5.8618072   23.00000   30.00000
bird_count[2]          33.90775 0.09459180 5.9202620   23.00000   30.00000
bird_count[3]          33.71150 0.09649288 5.8788101   23.00000   30.00000
bird_count[4]          33.86775 0.09883340 5.9166723   23.00000   30.00000
bird_count[5]          33.57825 0.09745838 6.0008232   23.00000   29.00000
bird_count[6]          33.66450 0.09784491 6.0740977   23.00000   29.00000
bird_count[7]          33.78100 0.09889797 6.0039196   23.00000   30.00000
bird_count[8]          33.83275 0.09952679 5.9857526   23.00000   30.00000
bird_count[9]          33.66325 0.09583036 5.7520102   23.00000   30.00000
bird_count[10]         33.71050 0.09677890 5.8317829   23.00000   30.00000
bird_count[11]         33.66150 0.09425594 5.8705224   23.00000   30.00000
bird_count[12]         33.93425 0.09446877 5.8762709   23.00000   30.00000
bird_count[13]         33.85575 0.09788260 5.9424128   22.00000   30.00000
bird_count[14]         33.71300 0.09801505 5.9216465   22.00000   30.00000
bird_count[15]         33.80900 0.09852701 6.0054172   22.97500   30.00000
bird_count[16]         33.80750 0.09430131 5.8768681   23.00000   30.00000
bird_count[17]         33.81750 0.09708688 5.9360343   22.97500   30.00000
bird_count[18]         33.83275 0.09495247 5.8445973   23.00000   30.00000
bird_count[19]         33.72100 0.09374341 5.8836905   22.97500   30.00000
bird_count[20]         33.74875 0.09707166 5.9274706   23.00000   30.00000
bird_count[21]         33.81975 0.09492447 5.9404194   23.00000   30.00000
bird_count[22]         33.92200 0.09450489 5.8874087   23.00000   30.00000
lp__                 1871.63203 0.01492467 0.6850866 1869.69541 1871.48083
                            50%        75%      97.5%    n_eff      Rhat
avg_birds_per_person   33.79564   34.59665   36.22421 1657.634 1.0005020
bird_count[1]          34.00000   38.00000   46.00000 3463.511 1.0012297
bird_count[2]          34.00000   38.00000   46.00000 3917.192 0.9999287
bird_count[3]          33.00000   37.00000   46.00000 3711.832 0.9995640
bird_count[4]          34.00000   38.00000   46.00000 3583.831 1.0005823
bird_count[5]          33.00000   37.00000   46.00000 3791.257 0.9999575
bird_count[6]          34.00000   38.00000   46.00000 3853.781 0.9995651
bird_count[7]          34.00000   38.00000   46.00000 3685.488 1.0002217
bird_count[8]          34.00000   38.00000   46.00000 3617.075 0.9993676
bird_count[9]          33.00000   38.00000   45.00000 3602.741 1.0004714
bird_count[10]         33.00000   38.00000   46.00000 3631.126 1.0016181
bird_count[11]         34.00000   37.00000   46.00000 3879.145 1.0003653
bird_count[12]         34.00000   38.00000   46.00000 3869.252 0.9994187
bird_count[13]         34.00000   38.00000   46.00000 3685.655 1.0001659
bird_count[14]         34.00000   38.00000   46.00000 3650.055 0.9998041
bird_count[15]         34.00000   38.00000   46.00000 3715.145 1.0008637
bird_count[16]         34.00000   38.00000   46.00000 3883.797 0.9998472
bird_count[17]         34.00000   38.00000   46.00000 3738.279 0.9997732
bird_count[18]         34.00000   38.00000   46.00000 3788.756 0.9996116
bird_count[19]         33.00000   38.00000   46.00000 3939.292 0.9996765
bird_count[20]         34.00000   38.00000   46.00000 3728.669 0.9992992
bird_count[21]         34.00000   38.00000   46.00000 3916.317 1.0011303
bird_count[22]         34.00000   38.00000   46.00000 3880.966 1.0003850
lp__                 1871.89161 1872.05767 1872.10206 2107.081 1.0004748

We can look at a table of coefficients but it is much easier to once again look at posterior samples as a figure.

NoteVisualize everything!

Bayesian workflows are highly visual. Make as many plots as you can: of your parameters, your predictions, the performance of your chains, etc.

bayesplot

Another essential package for working with posterior samples is called bayesplot. Let’s use it to look at the posterior distribution for avg_birds_per_person.

bayesplot::mcmc_areas(poisson_model_samp, pars = "avg_birds_per_person") + 
  geom_vline(xintercept = avg_birds_per_person, col = "orange", lwd = 2)

posterior distribution for avg_birds_per_person. The orange line is the true parameter value, which we simulated in R.

Posterior predictive checks

Bayesian models GENERATES data, which suggests a clear way to validate our models: ask the model to make some data, then see how well these data correspond to our real data. Here, we will take 50 fake datasets of bird counts and compare them to the simulation we first did in R.

The process involves a bit of fiddling around in R to get the simulated data, but then bayesplot does all the work:

bird_count_model_draws <- rstan::extract(poisson_model_samp,
                                         pars = "bird_count")
bayesplot::ppc_dens_overlay(y = bird_count,
                            yrep = head(bird_count_model_draws$bird_count, 50))

tidybayes again

As we see above, bayesplot offers many out-of-the-box figures. Sometimes however, you’ll want to control exactly what your figures look like, and for this tidybayes is an excellent tool.

Let’s use the flexibility of tidybayes to show how the prior and posterior differ between our two models

# Define breaks to compare the two model results using histogram
brCompare <- seq(min(pois_simul$avg_birds_per_person),
                 max(pois_simul$avg_birds_per_person),
                 length = 60)

# Draw histogram of simulated prior
hist(pois_simul$avg_birds_per_person,
     breaks = brCompare,
     xlab = "",
     ylab = "",
     main = "Prior")
# Draw histogram of model posterior
poisson_model_est <- rstan::extract(poisson_model_samp)
hist(poisson_model_est$avg_birds_per_person,
     breaks = brCompare,
     xlab = "",
     ylab = "",
     main = "Posterior")

Shinystan

shinystan::launch_shinystan(poisson_model_samp)

Exercises

Level 1

  • What would you do next to add complexity the bird-counting model above?
  • We plotted histograms to evaluate our model. Experiment with other types of plots. For example, what is the maximum value in each posterior simulation? What is the minimum? How to these compare to the real data? TIP: check out ?ppc_stat.

Level 2

  • Take a closer look at summary(poisson_model_samp). All the values of bird_count are essentially the same. That’s correct, but why?
  • Try to fit YOUR data and your chosen distribution from Monday’s exercise. Check to see if the distribution you chose is implemented in Stan – see, for example, this list.
  • check the fit using the plots we have already seen today.

Level 3

  • You would never actually do the analysis in this exercise! If all you want is the average of a Poisson distribution, you can get that without any sampling at all. Start by writing the model with a different prior:

\[ \begin{align} \text{Number of Birds}_{\text{seen by person i}} &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Gamma}(9, .5) \end{align} \]

This lets us calculate the posterior distribution directly. See the equation on Wikipedia and calculate the posterior for our bird data.

Footnotes

  1. in my experience anyway!↩︎