library(tidyverse)
library(cmdstanr)
Introduction to Stan and simulation
Repeating yesterday’s exercise, with the programming language Stan.
- Introduce the idea of a generative model
- Using generated data to validate a model
- Stan model syntax
- Outline of a Bayesian workflow
The process
To practice our first models in Stan, we’ll being with an imaginary example based on yesterday’s activity. We’ll imagine we’re conducting a particularly pleasant study: counting birds!
Question How many birds will each person in the class find? For the purposes of this example, let’s say there are 22 people.
- We’re going to count birds, so we’ll have count data: a number that is either 0 or some positive, round number
- We’ll make a simplifying assumption: everybody has the same chance of seeing a bird (i.e. no differences in skill or equipment), and everyone in the class is an independent observer (i.e. nobody is working in pairs, etc.)
- Everyone in the class makes only one count, so we have 22 numbers.
We’re Bayesian, so we need to write a probability distribution for all the possible values.
\[ \begin{align} \text{Number of Birds}_{\text{seen by person i}} &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Uniform}(0, 60) \end{align} \]
A quick note about notation for models like these:
- We use a subscript \(i\) to indicate the “label” for each observation in our dataset. You can think of this as the row number of the data spreadsheet, and imagine sliding your finger down the column of measurements, modelling each value in turn.
- Usually we’ll use more general language, such as \(y_i\). But for this simple example I wanted to make things as explicit as possible.
- Notice the symbol \(\sim\). This is read as “distributed as”, and indicates the probability distribution from which the values might come. When the values we’re talking about are data that we can observe (in this case, counts of birds), we call the distribution the likelihood. When the value is something we can’t observe (in this case, the average count \(\lambda\)) we call the distribution the prior.
We’ll be talking about better ways to model count data in a later exercise! For now, I’m using the Uniform distribution for simplicity. It’s not usually a very good choice!
Simulation in R
Before starting work on real data, we are going to begin by learning how to generate data by simulation. There are at least three reasons why this is a good idea:
- Understand your priors.. For most interesting models in ecology, you will not be able to pick good numbers for your prior parameters just by thinking hard. Should the prior on annual tree growth be \(\text{Normal}(2, 1)\) ? Or should the standard deviation be bigger? Smaller? As we’ll see, simulation will demystify the process.
- Validate your model. Bayesian models are great because they can create datasets by simulation. This suggests a very minimum requirement we might have for a statistical model: use known parameters and a model to generate data, then fit that same model to the very data it generated, and see if we get back something close to those known parameter values.
- Test your understanding. Perhaps most importantly, simulation helps you to test your own intuition. If you can simulate data from your model, then you really understand it! If you can’t, then you don’t know quite how it works yet. It’s rare1 that a biologist will fail to learn something by simulating a dataset.
Simple exercise in simulation
Let’s imagine we are taking a walk as a group today at this beautiful field site. What is the number of birds (total abundance of ALL species) each of us is going to see on our hike?
Some questions to ask about simulated data
- What kind of observations are you going to make? Do they have a minimum or maximum value? Are they integers, or are they decimal numbers, or something else?
- Where do the numbers come from? This could be anything, from simple linear approximations (i.e. the models we’re looking at in this course) to ODEs, mathematical models, GAMs, etc.
- How many observations will we be making?
One of the most useful traits of Bayesian models is that they are generative: they can be used to make a simulated dataset. We’ll do that now for our bird example.
let’s simulate from a Poisson distribution:
set.seed(525600)
<- 21
n_people <- runif(1, min = 0, max = 30)
avg_birds_per_person <- rpois(n_people, lambda = avg_birds_per_person) bird_count
Some things to note in the code above:
Every statistical distribution that is in R (which is a lot! almost all! ) has four different functions. If the distribution is called dist
, then they are:
rdist
= draw random numbers fromdist
qdist
= the quantile function – what value gives a certain proportion of the distribution?pdist
= the probability density function – what proportion of the distribution is below a certain value?ddist
the density function = draws the “shape” of a distribution. How probable are specific values?
The other thing to note is that there are TWO simulation steps here: first, simulating a value of the average (\(\lambda\)) and second, simulating observations. In our model, the Uniform distribution was referred to as the prior, and the Poisson distribution was referred to as a likelihood, but here you can see that they are very nearly the same thing: just statements about what distribution of values might be most consistent with the data.
Plotting the result
Let’s take a look at our simulated values:
hist(bird_count, col = "lightblue", xlim = c(0, 50))
This is pretty great, and represents one possible realization of sampling. However, one sample isn’t enough to tell us about what our \(\text{Uniform}(0, 60)\) prior really means.
Try to make many different simulations (say, 12 simulations). This represents 12 different repeats of the whole process: draw a value from the uniform prior, THEN draw a value from the poisson. Visualize them any way you want! (the worked example below uses ggplot2
)
library(tidyverse)
set.seed(525600)
<- function() {
simulate_some_birds <- runif(1, min = 0, max = 60)
lambda data.frame(obs = rpois(23, lambda = lambda))
}
<- replicate(12, simulate_some_birds())
rep_list
::tibble(simulation = 1:12,
tibbleobs = rep_list) |>
unnest(cols = "obs") |>
ggplot(aes(x = obs)) +
geom_histogram(bins = 28) +
facet_wrap(~simulation) +
theme_bw() +
labs(x = "Number of birds observed per person")
This figure shows different simulations of what, according to our prior, might be reasonable datasets for us to study. Do any of them seem implausible to you? If so, try changing the prior. The goal is to make fake datasets that seem plausible, but which still include the possibility of some surprising observations.
When you have a prior that generates observations that cover a range of scientifically reasonable values, then you are ready to move on to fitting real data.
However before we actually do that, let’s do the whole thing again: this time in Stan.
Simulating data in Stan
Let’s look back at the equation:
\[ \begin{align} \text{Number of Birds}_{\text{seen by person i}} &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Uniform}(0, 60) \end{align} \]
And then translate it into Stan:
<- cmdstan_model(
poisson_simulation stan_file = "topics/01_simulation/poisson_simulation.stan")
poisson_simulation
data {
int<lower=0> n_people;
}generated quantities {
real<lower=0> avg_birds_per_person;
// an array -- like a list in R
array[n_people] int<lower=0> bird_count;
// simulate averages
0, 60);
avg_birds_per_person = uniform_rng(// simulate observations with that average
for (i in 1:n_people){
bird_count[i] = poisson_rng(avg_birds_per_person);
} }
What you see just above is not R, but is the first Stan program we will see in this course. Stan code is written in a text file, which you then bring into R with the function cmdstan_model
as shown above. This does more than read the program, it compiles the code into a computer program. When you sample the model, as we’ll do later, Stan samples the posterior distribution using Hamiltonian Monte Carlo.
This Stan program has two parts. Each part is separated with curly braces, {}
. The are they data
block and the generated quantities
block:
data {
int<lower=0> n_people;
}
And the generated quantites block.
generated quantities {
real<lower=0> avg_observed;
// an array -- like a list in R
array[n_people] int<lower=0> bird_count;
// simulate averages
0, 60);
avg_birds_per_person = uniform_rng(// simulate observations with that average
for (i in 1:n_people){
bird_count[i] = poisson_rng(avg_birds_per_person);
} }
Let’s look at similarities and differences to the procedure in R. Try to find a few similarities and differences between R code and Stan code!
similarities:
- We have a random number generating function for each of our distributions. In R, these were called
runif
andrpois
, here they areuniform_rng
andpoisson_rng
. - Once again, the only thing we need to provide is
n_people
, the number of observers we have
differences:
- every line ends with a semicolon
;
- in Stan, the name of a variable is on the RIGHT of a line, while in R it’s on the left.
- we need to use a for-loop to generate random variables.
- note the syntax for creating an
array
of integers. Arrays in Stan are a little like lists in R: they can hold any other kind of object, and are of a certain length.
Sampling in Stan
# We connect the data in R to the model in Stan using a
# named list!
<- list(
poisson_simulation_datalist n_people = 21
)
<- poisson_simulation$sample(
poisson_sim_stan data = poisson_simulation_datalist,
refresh = 0,
# usually not necessary -- this model has no parameters
fixed_param = TRUE)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
This generates a large number of simulated datasets – the default is 4000 datasets! Each time the model samples, it draws a new value for the unobserved average (avg_birds_per_person
) and for the number of birds seen by each person.
Let’s pull out just a few of these datasets and visualize them.
We’ll use a wonderful package called tidybayes
to easily extract posterior draws from cmdstan
objects.
library(tidybayes)
<- tidybayes::spread_draws(poisson_sim_stan,
pois_sim
avg_birds_per_person,
bird_count[],ndraws = 20,
seed = 525600)
Here we pass tidybayes::spread_draws()
the model name, as well as the names of the parameters that we want to work with. The parameter avg_birds_per_person
is a scalar, so we only need to mention it by name. The parameter vector bird_count
needs square brackets after its name. This syntax is strange but gives us lots of options for more complex models, as we’ll see!
Let’s see what we get from tidybayes by looking at the first few rows.
|>
pois_sim as.data.frame() |>
head(25) |>
::kable() knitr
.chain | .iteration | .draw | avg_birds_per_person | bird_count |
---|---|---|---|---|
1 | 553 | 553 | 24.4378 | 29 |
1 | 553 | 553 | 24.4378 | 27 |
1 | 553 | 553 | 24.4378 | 26 |
1 | 553 | 553 | 24.4378 | 30 |
1 | 553 | 553 | 24.4378 | 28 |
1 | 553 | 553 | 24.4378 | 21 |
1 | 553 | 553 | 24.4378 | 32 |
1 | 553 | 553 | 24.4378 | 24 |
1 | 553 | 553 | 24.4378 | 28 |
1 | 553 | 553 | 24.4378 | 26 |
1 | 553 | 553 | 24.4378 | 24 |
1 | 553 | 553 | 24.4378 | 11 |
1 | 553 | 553 | 24.4378 | 32 |
1 | 553 | 553 | 24.4378 | 19 |
1 | 553 | 553 | 24.4378 | 23 |
1 | 553 | 553 | 24.4378 | 28 |
1 | 553 | 553 | 24.4378 | 28 |
1 | 553 | 553 | 24.4378 | 10 |
1 | 553 | 553 | 24.4378 | 23 |
1 | 553 | 553 | 24.4378 | 24 |
1 | 553 | 553 | 24.4378 | 20 |
4 | 310 | 3310 | 28.4318 | 32 |
4 | 310 | 3310 | 28.4318 | 28 |
4 | 310 | 3310 | 28.4318 | 29 |
4 | 310 | 3310 | 28.4318 | 25 |
Remember we asked for only 25 of the 4000 posterior samples. Here is one sample, and just a bit of the next. We can see that the value of avg_birds_per_person
is the same within each iteration. The model uses this average to sample every bird_count
value, one for each person making observations. Then the program takes a new value of avg_birds_per_person
and simulates everyone’s bird_count
again!
Let’s take a look at some of these simulations:
|>
pois_sim ggplot(aes(x = bird_count)) +
geom_histogram(fill = "orange") +
geom_vline(aes(xintercept = avg_birds_per_person), col = "darkgreen", lwd = 1) +
facet_wrap(~.draw) +
theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Parameter recovery
Let’s go back and look at the fake datasets we created in R
avg_birds_per_person
[1] 17.12789
bird_count
[1] 23 10 19 27 20 15 16 18 18 22 14 14 14 18 17 13 26 19 16 13 10
and let’s see if we can recapture the only known parameter, avg_birds_per_person
, which is equal to 17.127887.
We’ll do it first in R, using the function fitdistr
from the MASS
package:
::fitdistr(bird_count, dpois, start = list(lambda=10)) MASS
Warning in stats::optim(x = c(23L, 10L, 19L, 27L, 20L, 15L, 16L, 18L, 18L, : one-dimensional optimization by Nelder-Mead is unreliable:
use "Brent" or optimize() directly
lambda
17.2382812
( 0.9060239)
This could also be done with glm
<- glm(bird_count ~ 1, family = "poisson")
bird_glm exp(coef(bird_glm))
(Intercept)
17.2381
You can see that in all cases we are getting close to the value of avg_birds_per_person
, which in these simulations is the true value.
Sampling the posterior distribution in Stan
Time for the HMC Slides!
We will be doing a lot of Stan models this week, and we will begin by replicating the above GLM in Stan.
<- cmdstan_model(
poisson_model stan_file = "topics/01_simulation/poisson_model.stan")
poisson_model
data {
int<lower=0> n_people;
array[n_people] int<lower=0> bird_count_observed;
}parameters {
real<lower=0> avg_birds_per_person;
}model {
bird_count_observed ~ poisson(avg_birds_per_person);0, 60);
avg_birds_per_person ~ uniform(
}generated quantities {
// an array -- like a list in R
array[n_people] int<lower=0> bird_count;
// simulate observations with that average
for (i in 1:n_people){
bird_count[i] = poisson_rng(avg_birds_per_person);
} }
This model has all the same code as the previous one, but has two additional parts. Let’s compare them
poisson_simulation
data {
int<lower=0> n_people;
}generated quantities {
real<lower=0> avg_birds_per_person;
// an array -- like a list in R
array[n_people] int<lower=0> bird_count;
// simulate averages
0, 60);
avg_birds_per_person = uniform_rng(// simulate observations with that average
for (i in 1:n_people){
bird_count[i] = poisson_rng(avg_birds_per_person);
} }
poisson_model
data {
int<lower=0> n_people;
array[n_people] int<lower=0> bird_count_observed;
}parameters {
real<lower=0> avg_birds_per_person;
}model {
bird_count_observed ~ poisson(avg_birds_per_person);0, 60);
avg_birds_per_person ~ uniform(
}generated quantities {
// an array -- like a list in R
array[n_people] int<lower=0> bird_count;
// simulate observations with that average
for (i in 1:n_people){
bird_count[i] = poisson_rng(avg_birds_per_person);
} }
What’s different in this second Stan program? There are two new sections:
- parameters block : Indicated by
parameters {}
, this block includes all the unobserved quantities. In this case there is only one:avg_birds_per_person
. We have to give this value a name and say what kind of number it is. Here is also the place to declare any constraints. In our example, we state that the parameter is always positive (because it is an average of counts). - model block indicated by
model {}
, this block contains the model. What this means in a Bayesian model is that it lists the probability distribution for all the observations, and for all the unobserved parameters. In other words, it looks just like our mathematical expressions above. The model block can also contain intermediate calculations, for example combining data and parameters with an equation. We’ll see examples later.
This Stan program also contains one line that’s been moved.
0, 60); avg_birds_per_person ~ uniform(
has been moved to the model
block. This has two consequences.
First, when the model sees our data, it’s going to try to find values of avg_birds_per_person
which make those data probable – in other words, it is going to find the posterior distribution of possible values of the average. Together, the prior and the data constrain what those values can be.
Second, the generated quantities
block means something different now. Previously, we had no idea what avg_birds_per_person
should be, so we had the computer choose a random number from a wide range. We called this “the prior” Now, when the computer draws new values of bird_count
, it is going to use the values from the posterior that it is finding in the model {}
block . This means that the simulations, rather than being prior predictive checks, are now posterior predictive checks.
Parameter recovery
Use the Stan code above to fit the model to our simulated data. Do we recover the parameters?
Once again, we connect a dataset to our model with a list:
<- list(bird_count_observed = bird_count,
bird_data_list n_people = length(bird_count))
<- poisson_model$sample(data = bird_data_list,
poisson_model_samp refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
$summary() poisson_model_samp
# A tibble: 23 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ 671. 671. 0.699 0.329 670. 672. 1.00 1912. 2347.
2 avg_birds_per_p… 17.2 17.2 0.907 0.919 15.8 18.8 1.00 1800. 2084.
3 bird_count[1] 17.2 17 4.30 4.45 10 25 1.00 3756. 4108.
4 bird_count[2] 17.1 17 4.29 4.45 10 24 1.00 3749. 3290.
5 bird_count[3] 17.3 17 4.29 4.45 11.0 25 1.00 3866. 3935.
6 bird_count[4] 17.3 17 4.20 4.45 11 24 1.00 3554. 3727.
7 bird_count[5] 17.1 17 4.19 4.45 11 24 1.00 3870. 3959.
8 bird_count[6] 17.2 17 4.26 4.45 11 25 1.00 3814. 3771.
9 bird_count[7] 17.2 17 4.29 4.45 10 25 1.00 3930. 3644.
10 bird_count[8] 17.2 17 4.26 4.45 10 25 1.00 4151. 3893.
# ℹ 13 more rows
We can look at a table of coefficients but it is much easier to once again look at posterior samples as a figure.
Bayesian workflows are highly visual. Make as many plots as you can: of your parameters, your predictions, the performance of your chains, etc.
bayesplot
Another essential package for working with posterior samples is called bayesplot
. Let’s use it to look at the posterior distribution for avg_birds_per_person
.
## Extract draws from the model object
<- poisson_model_samp$draws()
poisson_model_draws
::mcmc_areas(poisson_model_draws, pars = "avg_birds_per_person") +
bayesplotgeom_vline(xintercept = avg_birds_per_person, col = "orange", lwd = 2)
Posterior predictive checks
Bayesian models MAKE data, which suggests a clear way to validate our models: ask the model to make some data, then see how well these data correspond to biology (e.g. to our real data). Here, we will take 50 fake datasets of bird counts and compare them to the simulation we first did in R.
The process involves a bit of fiddling around in R to get the simulated data, but then bayesplot
does all the work:
<- poisson_model_samp$draws(variables = "bird_count")
bird_count_draws <- posterior::as_draws_matrix(bird_count_draws)
bird_count_draws_matrix
::ppc_dens_overlay(y = bird_count,
bayesplotyrep = head(bird_count_draws_matrix, 50))
tidybayes again
As we see above, bayesplot
offers many out-of-the-box figures. Sometimes however, you’ll want to control exactly what your figures look like, and for this tidybayes
is an excellent tool.
Let’s use the flexibility of tidybayes
to show how the prior and posterior differ between our two models
::gather_rvars(poisson_sim_stan,
tidybayes|>
avg_birds_per_person) ggplot(aes(y = "avg_birds_per_person", dist = .value)) +
::stat_halfeye() +
tidybayeslabs(title = "Prior")
::gather_rvars(poisson_model_samp,
tidybayes|>
avg_birds_per_person) ggplot(aes(y = "avg_birds_per_person", dist = .value)) +
::stat_halfeye() +
tidybayeslabs(title = "Posterior") +
coord_cartesian(xlim = c(0, 60))
Shinystan
::launch_shinystan(poisson_model_samp) shinystan
Exercises
Level 1
- Consider the What would you do next to add complexity the bird-counting model above?
- We plotted histograms to evaluate our model. Experiment with other types of plots. For example, what is the maximum value in each posterior simulation? What is the minimum? How to these compare to the real data?
Level 2
- Take a closer look at
poisson_model_samp$summary()
. All the values ofbird_count
are the same. That’s correct, but why? - Try to fit YOUR data and your chosen distribution from Monday’s exercise. Check to see if the distribution you chose is implemented in Stan – see, for example, this list.
- check the fit using the plots we have already seen today.
Level 3
- You would never actually do the analysis in this exercise! If all you want is the average of a Poisson distribution, you can get that without any sampling at all. Start by writing the model with a different prior:
\[ \begin{align} \text{Number of Birds}_{\text{seen by person i}} &\sim \text{Poisson}(\lambda) \\ \lambda &\sim \text{Gamma}(9, .5) \end{align} \]
This lets us calculate the posterior distribution directly. See the equation on Wikipedia and calculate the posterior for our bird data.
Footnotes
in Andrew’s experience anyway↩︎