---
title: "Gaussian Processes"
description: |
Smooth lines in fancy colours.
execute:
freeze: false
comments:
hypothesis: true
format:
html:
code-tools: true
editor_options:
chunk_output_type: console
---
:::{.callout-tip}
## Goals of this lesson
1. Let's appreciate together the power of online community resources
1. Gaussian processes are families of smooth functions we learn from data
1. When used for prediction, a Gaussian process is both a "prior" and a "likelihood"
:::
## Background reading
Gaussian processes are surprisingly common, and there are lots of resources on the topic:
1. The Stan manual [has a chapter on it](https://mc-stan.org/docs/stan-users-guide/gaussian-processes.html)
1. The Stan team gives lots of [example models on Github](https://github.com/stan-dev/example-models/blob/master/misc/gaussian-process/gp-fit-logit.stan) which I adapted for this example.
2. Michael Betancourt has an extremely detailed, very rigous [tutorial on Gaussian Process](https://betanalpha.github.io/assets/case_studies/gaussian_processes.html#3_Inferring_A_Gaussian_Process)
3. Here's a complete, worked [analysis of human birthdays](https://avehtari.github.io/casestudies/Birthdays/birthdays.html#Model_4:_long_term_smooth_+_seasonal_+_weekday_with_increasing_magnitude) by world-class statisticians (in particular Andrew Gelman, Aki Vehtari and Daniel Simpson)
4. Gaussian Processes are related to Generalized additive models (GAMs) and can be represented by a collection of basis functions. This is approximate but much much (!) faster. See this [excellent tutorial](https://avehtari.github.io/casestudies/Motorcycle/motorcycle_gpcourse.html#45_GP_with_basis_functions_for_f_and_g) by Aki Vehtari as well as the corresponding paper by [Riutort -Mayol et al. (2023)](https://link.springer.com/article/10.1007/s11222-022-10167-2) also cited in the blog post.
5. This [blog](https://rpubs.com/NickClark47/stan_geostatistical) applies Gaussian Processes to spatial count data
6. Here is a very long and wonderfully detailed post describing a Gaussian Process approach to [occupany modelling](https://peter-stewart.github.io/blog/gaussian-process-occupancy-tutorial/)
7. Another [blog post on Gaussian Processes](https://brendanhasz.github.io/2018/10/10/hmm-vs-gp.html), Hidden Markov Models and more, very clear explanation.
<!-- add equation -->
<!-- add simulation -->
### Reorganizing the mite data
Let's begin by (once again!) loading and reorganizing the mite data. This time we'll also use `mite.xy`, which gives the coordinates of each one of the 70 samples. For the sack of this example, let's focus on the PWIL species because it has a rather strong relationship with water and for which presence and absence are roughly balanced. This is just to make the illustration clear.
#### Loading models and data
```{r setup}
library(rstan)
rstan_options("auto_write" = TRUE)
options(mc.cores = parallel::detectCores())
# mite data
data(mite, package = "vegan")
data(mite.env, package = "vegan")
## ALSO: the spatial data
data(mite.xy, package = "vegan")
```
```{r pick-sp}
pwil_data <- data.frame(plot_id = 1:nrow(mite),
WatrCont = mite.env$WatrCont,
abd = mite$PWIL,
presabs = ifelse(mite$PWIL>0, 1, 0),
water = (mite.env$WatrCont - mean(mite.env$WatrCont))/100)
```
Let's focus on a species with a rather strong relationship with water and for which presence and absence are roughly balanced. This is just to make the example clear.
```{r}
#| fig-cap: Probability of occurrance of one mite species, as a fuction of water content of the soil
# Plot data
plot(pwil_data$water,
pwil_data$presabs,
pch = 19,
cex = 0.6,
col = adjustcolor("steelblue", 0.5),
xlab = "Water (centered)",
ylab = "Presence / Absence")
# Logistic regression
fit_pwil <- glm(presabs ~ water,
data = pwil_data,
family = binomial(link = "logit"))
# Draw prediction line
x_seq_pwil <- seq(min(pwil_data$water),
max(pwil_data$water),
length.out = 100)
lines(x_seq_pwil,
predict(fit_pwil,
newdata = data.frame(water = x_seq_pwil),
type = "response"),
col = "steelblue",
lwd = 2)
```
```{r}
#| fig-cap: Presence-absence data for mite species "PWIL", at the spatial location of each point.
# add the spatial coordinates:
pwil_spatial <- cbind(pwil_data, mite.xy)
pa_cols <- palette.colors(2, "Dark2")
plot(pwil_spatial$x, pwil_spatial$y,
col = pa_cols[pwil_spatial$presabs + 1],
bg = pa_cols[pwil_spatial$presabs + 1],
pch = 21, cex = 1.5,
xlab = "x", ylab = "y",
asp = 1)
legend("topright", legend = c("Absent", "Present"),
pch = 21, pt.bg = pa_cols, col = pa_cols)
```
In this exercise, we will use a Gaussian process from two different angles:
1. To build a nonlinear function of one variable to study how it relates to the abundance of one species
1. To build a model characterizing spatial autocorrelation in the distribution of one species
# Smooth function of one variable
## Write the model
$$
\begin{align}
\mathsf{Pr}(y_i = 1) &\sim \mathsf{Bernoulli}(p_i)\\
\mathsf{logit}(p_i) &= a + f_i\\
f_i &\sim \mathsf{multivariate\ normal}(0, K(x | \theta)) \\
K(x | \alpha, \rho, \sigma)_{i, j}
&= \alpha^2
\exp \left(
- \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2
\right)
+ \delta_{i, j} \sigma^2,
\end{align}
$$
That equation above presents the general notation for a model with $D$ dimensions that relies on the exponentiated quadratic function to account for the spatial constraints. In this first case study, we will focus on the univariate version of this equation.
$$
\begin{align}
\mathsf{Pr}(y_i = 1) &\sim \mathsf{Bernoulli}(p_i)\\
\mathsf{logit}(p_i) &= a + f_i\\
f_i &\sim \mathsf{Multivariate\ Normal}(0, K(x | \theta)) \\
K(x | \alpha, \rho, \sigma)_{i, j}
&= \alpha^2
e^{
\frac{-(\text{water}_i - \text{water}_j)^2}{2 \rho^2}}
+ \delta_{i, j} \sigma^2 \\
\rho &\sim \mathsf{Inverse\ Gamma}(5, 14) \\
\alpha &\sim \mathsf{Normal}(0, .8) \\
a &\sim \mathsf{Normal}(0, .2) \\
\end{align}
$$
Here's an interpretation of the parameters of this model:
* $\alpha^2$ is the maximum covariance between two points
* $\rho$ tells us how quickly that covariance goes down as two samples become more different in their water amount
* $\delta_{i, j} \sigma^2$ adds the variances along the diagonal
See the explanation of this function in the [Stan User's guide](https://mc-stan.org/docs/stan-users-guide/gaussian-processes.html#gaussian-process-regression)
## Simulate to understand it
Here is the Stan code that replicates the mathematical model above.
```{r}
#| class-output: stan
gp_example_sim <- stan_model(file = "topics/04_gp/gp_example_sim.stan")
gp_example_sim
```
```{r, eval=TRUE}
gp_example_sim_samples <- sampling(gp_example_sim,
data = list(N = 20,x = seq(from = -3,
to = 5,
length.out = 20)),
chains = 4,
iter = 2000)
```
```{r}
x_pred_prior <- seq(from = -3, to = 5, length.out = 20)
draws_f_prior <- rstan::extract(gp_example_sim_samples, "f")$f
draws_a_prior <- as.vector(rstan::extract(gp_example_sim_samples, "a")$a)
draw_idx_prior <- sample(length(draws_a_prior), 45)
plot(0,
0,
type = "n",
xlim = range(x_pred_prior),
ylim = c(0, 1),
xlab = "Water",
ylab = "Pr(presence)")
for (k in draw_idx_prior) {
lines(x_pred_prior,
plogis(draws_f_prior[k, ] + draws_a_prior[k]),
col = adjustcolor("steelblue", 0.4))
}
```
## Express that model in code
With a working simulation, we can now adapt the model to handle real data.
```{r}
#| class-output: stan
gp_example_pred <- stan_model(
file = "topics/04_gp/gp_example_pred.stan")
gp_example_pred
```
We need to generate data for making predictions! I'll create a new vector of observations called `new_x` that cover the range of the `water` variable in our dataset.
```{r eval=TRUE}
# sample N values on the range of x
new_x <- seq(from = -3, to = 5, length.out = 15)
gp_data_list <- list(N = length(pwil_spatial$presabs) + length(new_x),
Nobs = length(pwil_spatial$presabs),
x = c(pwil_spatial$water, new_x),
z = pwil_spatial$presabs)
# put them on the data frame
gp_example_pwil_samp <- sampling(gp_example_pred,
data = gp_data_list,
chains = 4,
iter = 2000,
refresh = 1000)
```
```{r eval=TRUE}
#| fig.cap: A Gaussian Process estimates a distribution of smooth functions to a dataset. Here we're using it to estimate the effect of water amount on the occurence of a mite.
Nobs_pwil <- length(pwil_spatial$presabs)
draws_f_pwil <- rstan::extract(gp_example_pwil_samp, "f")$f
draws_a_pwil <- as.vector(rstan::extract(gp_example_pwil_samp, "a")$a)
draws_pred <- plogis(draws_f_pwil[, (Nobs_pwil + 1):ncol(draws_f_pwil)] +
draws_a_pwil)
q_pred <- apply(draws_pred,
2,
quantile,
probs = c(0.025, 0.25, 0.5, 0.75, 0.975))
plot(pwil_spatial$water,
jitter(pwil_spatial$presabs, factor = 0.05),
pch = 19,
cex = 0.5,
col = adjustcolor("black", 0.4),
xlab = "Water (centered)",
ylab = "Pr(presence)",
ylim = c(-0.05, 1.05))
polygon(c(new_x, rev(new_x)),
c(q_pred[1, ], rev(q_pred[5, ])),
col = adjustcolor("red3", 0.15),
border = NA)
polygon(c(new_x, rev(new_x)),
c(q_pred[2, ], rev(q_pred[4, ])),
col = adjustcolor("red3", 0.30),
border = NA)
lines(new_x,
q_pred[3, ],
col = "red3",
lwd = 2)
```
<!--  -->
We can also pull out some specific functions. What I want you to see here is that there are MANY curvy lines that are consistent with this model.
```{r}
draw_idx_63 <- sample(length(draws_a_pwil), 63)
plot(0,
0,
type = "n",
xlim = range(new_x),
ylim = c(0, 1),
xlab = "Water (centered)",
ylab = "Pr(presence)")
for (k in draw_idx_63) {
prob_k <- plogis(draws_f_pwil[k, (Nobs_pwil + 1):ncol(draws_f_pwil)] +
draws_a_pwil[k])
lines(new_x, prob_k, col = adjustcolor("black", 0.3))
}
```
# Spatial predictions
To make a prediction of a function on one X variable, we needed a sequence of points to predict along.
To make spatial predictions, we need a _grid_ of points to predict along.
```{r}
grid_points <- expand.grid(x = seq(min(mite.xy$x), max(mite.xy$x), by = 0.25),
y = seq(min(mite.xy$y), max(mite.xy$y), by = 0.25))
plot(grid_points$x,
grid_points$y,
pch = 19,
cex = 0.4,
xlab = "x",
ylab = "y",
asp = 1,
xlim = range(grid_points$x) + c(-0.25, 0.25))
```
Mathematically, the model is slightly different than for the previous case because the exponentiated quadratic function now needs to account for two dimensions. So, our model looks like this:
$$
\begin{align}
\mathsf{Pr}(y_i = 1) &\sim \mathsf{Bernoulli}(p_i)\\
\mathsf{logit}(p_i) &= a + f_i\\
f_i &\sim \mathsf{multivariate\ normal}(0, K(x | \theta)) \\
K(x | \alpha, \rho, \sigma)_{i, j}
&= \alpha^2
\exp \left(
- \dfrac{1}{2 \rho^2} \sum_{d=1}^2 (x_{i,d} - x_{j,d})^2
\right)
+ \delta_{i, j} \sigma^2,
\end{align}
$$
However, from a coding standpoint, other than a change in the `data {}` block, the Stan code is unchanged! This is because the `gp_exp_quad_cov` Stan function was designed to work for any dimensions.
### Prior predictive simulations
```{r}
#| class-output: stan
gp_example_2D_prior <- stan_model(
file = "topics/04_gp/gp_example_2D_prior.stan")
gp_example_2D_prior
```
```{r}
gp_example_2D_prior_samp <- sampling(gp_example_2D_prior,
data = list(N = nrow(grid_points),
x = grid_points,
rho_a = 5,
rho_b = 14),
chains = 4,
iter = 2000,
refresh = 1000)
```
### visualize the prior
```{r}
## extract the predictors
draws_f2d_prior <- rstan::extract(gp_example_2D_prior_samp, "f")$f
draws_a2d_prior <- as.vector(rstan::extract(gp_example_2D_prior_samp, "a")$a)
draw_idx_2d_prior <- sample(length(draws_a2d_prior), 6)
x_uniq <- sort(unique(grid_points$x))
y_uniq <- sort(unique(grid_points$y))
par(mfrow = c(2, 3), mar = c(1, 1, 1.5, 0.5))
for (k in draw_idx_2d_prior) {
pa_k <- plogis(draws_f2d_prior[k, ] + draws_a2d_prior[k])
z_mat <- matrix(pa_k, nrow = length(x_uniq), ncol = length(y_uniq))
image(x_uniq, y_uniq, z_mat,
col = hcl.colors(100, "Inferno"),
xlab = "", ylab = "", main = "Pr(y=1)", asp = 1)
}
```
:::{.callout-warning}
## CAUTION: Slow
The model below is the slowest model we've seen so far and takes about 5 minutes to run on my laptop.
:::
```{r}
#| class-output: stan
gp_example_pred_2D <- stan_model(
file = "topics/04_gp/gp_example_pred_2D.stan")
gp_example_pred_2D
```
plot the effect in space:
```{r, eval=TRUE}
## sample the model
gp_example_2D_samp <- sampling(gp_example_pred_2D,
data = list(N = length(pwil_spatial$presabs) + nrow(grid_points),
Nobs = length(pwil_spatial$presabs),
x = rbind(pwil_spatial[c("x", "y")], grid_points),
z = pwil_spatial$presabs),
chains = 4,
iter = 1000)
# gp_example_2D_samp$save_object("topics/04_gp/gp_example_2D_samp_pwil.rds")
```
```{r}
# gp_example_2D_samp_pwil <- read_rds("topics/04_gp/gp_example_2D_samp_pwil.rds")
## extract the predictors
Nobs_2d <- length(pwil_spatial$presabs)
draws_f2d <- rstan::extract(gp_example_2D_samp, "f")$f
draws_a2d <- as.vector(rstan::extract(gp_example_2D_samp, "a")$a)
draws_pred_2d <- plogis(draws_f2d[, (Nobs_2d + 1):ncol(draws_f2d)] + draws_a2d)
pa_median_2d <- apply(draws_pred_2d, 2, median)
z_mat_2d <- matrix(pa_median_2d, nrow = length(x_uniq), ncol = length(y_uniq))
image(x_uniq, y_uniq, z_mat_2d,
col = hcl.colors(100, "Inferno"),
xlab = "x", ylab = "y", main = "Pr(y=1) — posterior median", asp = 1)
# overlay observed points
pt_col <- ifelse(pwil_spatial$presabs == 1, "white", "lightblue")
points(pwil_spatial$x, pwil_spatial$y,
pch = 21, cex = 1.5, bg = pt_col, col = "grey40")
```
<!--  -->
## Extensions:
Add water to the model. Does the spatial effect disappear, increase, or stay kind of the same?
Next step: try to model water curve for more than one species. Would it be possible to make the species rho parameters hierarchical?