Generative models and multilevel models

class: title-slide, middle

# Model evaluation with ecological data

## Generative models and multilevel models

.instructors[
  Andrew MacDonald, Dominique Gravel
]

---
# Generative models

Bayesian models can generate data. This is part of what makes them useful in ecology:

* Easy to make predictions
* Straightforward to test the model

---
# Simple Generative model

We've already seen a simple model which is generative: our Binomial model with a Beta prior:

We start with this model
$$
`\begin{align}
y \sim \text{Binomial}(6, \theta) \\
\theta \sim Beta(2,2)
\end{align}`
$$

and we have data `4, 5, 4, 4, 3`

* So total successes is `$4 + 5 + 4 + 4 + 3 = 20$`
* And total failures is `$6 * 5 - 20 = 10$`

The posterior becomes:

$$
\theta_{post} \sim \text{Beta}(22, 12)
$$

---
# Generating data with this simple model

Quick pseudocode (_Note_ this is also called the Beta-Binomial distribution) :

* generate values of `$\theta$` from `$\text{Beta}(22, 12)$`
* for each value, generate an observation of surviving eggs using:

$$
y \sim \text{Binomial}(6, \theta)
$$

.pull-left[

```r
theta_post <- rbeta(200, 
 22, 12)
obs <- rbinom(n = 200, 
 size = 6, 
 theta_post)
hist(obs, breaks = 20)
```
]

.pull-right[
<img src="13_generative_files/figure-html/unnamed-chunk-1-1.png" width="504" />
]

---
# nonlinear tree growth -- write the model

$$
`\begin{align}
\text{height} &\sim \text{LogNormal}(\text{log}(\mu), \sigma_{height}) \\
\mu &= \frac{aL}{a/s + L} \\
\end{align}`
$$

To make it Bayesian, we specify distributions for every parameter:
--
$$
`\begin{align}
\sigma_{height} & \sim \text{Exponential}(0.5)\\
a               & \sim \text{Normal}(200, 15)\\
s               & \sim \text{Normal}(15, 2)\\
\end{align}`
$$

--
* _not_ the only choices for these parameters! Not even particularly good ones
--
* use prior predictive checks to see if these priors make sense

---
# Nonlinear light data -- simulation

Draw parameters from your priors.

$$
`\begin{align}
\text{height} &\sim \text{LogNormal}(\text{log}(\mu), \sigma_{height}) \\
\mu &= \frac{aL}{a/s + L} \\
\sigma_{height} & \sim \text{Exponential}(2)\\
a               & \sim \text{Normal}(200, 15)\\
s               & \sim \text{Normal}(15, 2)\\
\end{align}`
$$

---
# Nonlinear light data -- simulation

Draw parameters from your priors.

$$
`\begin{align}
\text{height} &\sim \text{LogNormal}(\text{log}(\mu), \sigma_{height}) \\
\mu &= \frac{aL}{a/s + L} \\
\sigma_{height} & = 0.36\\
a               & = 196.4\\
s               & = 16.1\\
\end{align}`
$$
---
# Nonlinear light data -- simulation

Draw parameters from your priors.

* keep these fixed for the rest of the simulation! You can go back and try different ones later!

* **pop quiz** is `$\mu$` a parameter??

---
# Nonlinear light data -- simulation

$$
`\begin{align}
\text{height} &\sim \text{LogNormal}\left(\text{log}\left(\frac{aL}{a/s + L} \right), \sigma_{height}\right) \\
\sigma_{height} & = 0.36\\
a & = 196.4\\
s & = 16.1\\
\end{align}`
$$
 
**NO**. `$\mu$` is not a parameter, it is simply a handy way to rewrite this equation so it is easier to read.

`$\mu$` might be a very interesting thing to calculate after the model is run!
---
class: center
# Simple simulation

<img src="13_generative_files/figure-html/unnamed-chunk-2-1.png" width="504" />
---
# Other ways of talking about the same model

let's go back to this representation of the model:

* the deterministic part of the model is indicated with `$=$`
* the stochastic parts are all indicated with `$\sim$`

When one distribution has a parameter which is also a distribution, we say that the first is *conditional* on the second.
---
class: center

# Graphing conditional distributions

```
## Warning in if (layout %in% c("dendogram")) {: the condition has length > 1 and
## only the first element will be used
```

---
class: center

# Converting the graph to probability notation

$$
[a, s, \sigma_{\text{height}}|y ] \propto 
[y |g(L, a, s), \sigma_{\text{height}}] 
[a] 
[s] [\sigma_{\text{height}}] 
$$

---
# Converting the graph to probability notation

$$
[a, s, \sigma_{\text{height}}|y ] \propto 
[y |g(L, a, s), \sigma_{\text{height}}] 
[a] 
[s] [\sigma_{\text{height}}] 
$$

* the vertical bar `$|$` reads "conditioned on" or "dependent on"

* `$L$` is observed and therefore it appears *only* of the right-hand side of `$|$`

* `$a$`, `$s$` and `$L$` are parameters and therefore need distributions

* `$[y |g(L, a, s), \sigma_y]$` is a likelihood! This means all the observations multiplied together:

$$
[y |g(L, a, s), \sigma_y] = \prod_{i=1}^{300} \text{LogNormal}\left(y_i, \text{log}\left(\frac{aL}{a/x + L} \right), \sigma_{height}\right)
$$

---
# Hierarchical models

$$
`\begin{align}
\text{height} &\sim \text{LogNormal}(\text{log}(\mu_n), \sigma_{height}) \\
\mu_n &= \frac{a_nL}{a_n/s + L} \\
a_n &= \bar{a} + a_{j[n]}\\
a_j &\sim \text{Normal}(0, \sigma_a)\\
\bar{a}               & \sim \text{Normal}(200, 15)\\
\sigma_{a} & \sim \text{Exponential}(2)\\
\sigma_{height} & \sim \text{Exponential}(4)\\
s               & \sim \text{Normal}(15, 2)\\
\end{align}`
$$

* For observation number `$n$`, between `$1$` and `$N$` total observations.
* For species number `$j$` , which is between `$1$` and `$S$` total species (in our example `$S$` is 5)
* There are **5 values** of `$a_j$` : `$a_1$`, `$a_2$`, `$a_3$`, `$a_4$`, `$a_5$`

---
# Quick word about nested indexing

`$a_{j[n]}$`  This means the value of `$a$` for observation `$n$` -- whichever of the `$j$` species it is in.

```r
sp_ids <- c(fir = 1, maple = 2)
fake_data <- expand.grid(light = c(5,10, 60),
 spp = c("fir", "maple"))
*fake_data$id <- sp_ids[fake_data$spp]
knitr::kable(fake_data)
```

| light|spp   | id|
|-----:|:-----|--:|
|     5|fir   |  1|
|    10|fir   |  1|
|    60|fir   |  1|
|     5|maple |  2|
|    10|maple |  2|
|    60|maple |  2|

---
# Hierarchical models -- two ways to write

$$
`\begin{align}
\text{height} &\sim \text{LogNormal}(\text{log}(\mu), \sigma_{height}) \\
\mu_n &= \frac{a_{j[n]}L}{a_{j[n]}/s + L} \\
a_j &\sim \text{Normal}(\bar{a}, \sigma_a)\\
\bar{a}               & \sim \text{Normal}(200, 15)\\
\sigma_{a} & \sim \text{Exponential}(2)\\
\sigma_{height} & \sim \text{Exponential}(4)\\
s               & \sim \text{Normal}(15, 2)\\
\end{align}`
$$

BOTH will have the same 5 values for `$a_j$`

---
class: center
# Different species `$a$` simulation

---
# DAG for Multilevel models

```
## Warning in if (layout %in% c("dendogram")) {: the condition has length > 1 and
## only the first element will be used
```

Note: we have **arrows pointing in** AND **arrows pointing out**

---
# Probability notation

$$
`\begin{align}
[a_j,\bar{a}, \sigma_a, s, \sigma_{\text{height}}|y ] &\propto \\
&[y |g(L, a_j, s), \sigma_{\text{height}}] \times \\
&[a_j |\bar{a}, \sigma_a] \times \\
&[\bar{a}] \times \\
&[\sigma_a] \times \\
&[s] \times \\ 
&[\sigma_{\text{height}}] 
\end{align}`
$$