class: title-slide, middle <style type="text/css"> .title-slide { background-image: url('assets/img/bg.jpg'); background-color: #23373B; background-size: contain; border: 0px; background-position: 600px 0; line-height: 1; } </style> # Sampling the posterior with MCMC <hr width="65%" align="left" size="0.3" color="orange"></hr> ## Rejection sampling, Metropolis Hastings etc. <hr width="65%" align="left" size="0.3" color="orange" style="margin-bottom:40px;" alt="@Martin Sanchez"></hr> .instructors[ **ECL707/807** - Dominique Gravel ] <img src="assets/img/logo.png" width="25%" style="margin-top:20px;"></img> <img src="assets/img/Bios2_reverse.png" width="23%" style="margin-top:20px;margin-left:35px"></img> --- # Objective The objective of MCMC algorithm is to simulate parameters from a complex posterior distribution --- # Outline - Understand the principle of rejection sampling - Get an intuition of a multivariate posterior distribution - Introduction to Monte Carlo Markov Chain - Run a simple example of Gibbs sampling --- # Bayes theorem $$ P(\theta|D,H) = \frac{P(D|\theta,H)P(\theta,H)}{P(D|H)} $$ $$ posterior = \frac{likelihood \times prior}{evidence} $$ The problem : how can we compute the posterior distribution without knowledge of the denominator ? --- # Dealing with the normalization constant 1. Use the proportional form : $$ P[\theta \mid X] \propto P[X \mid \theta]P[\theta] $$ 2. Use conjugacy 3. Sample from the posterior --- # Rejection sampling 1. Use an easily sampled distribution (the candidate distribution `\(c(x)\)` ) 2. Evaluate the probability of the candidate according to a target distribution `\(t(x)\)` 3. Reject candidates with a probability proportional to the difference between the two distributions --- # Rejection sampling **Pseudo-code** ``` DEFINE candidate distribution c(x) DEFINE target distribution t(x) DEFINE constant M such that M*c(x) >= t(x) for all x ``` --- # Rejection sampling Acceptance probability : $$p = \frac{t(x)}{Mc(x)} $$ --- # Rejection sampling **Pseudo-code** ``` REPEAT DRAW sample X from c(x) COMPUTE acceptance probability p = t(X)/Mc(X) DRAW random value rand from uniform on [0,1] IF rand < p accept x ELSE reject X UNTIL X is accepted ``` --- # Rejection sampling **Exercise 1** Write a function that uses rejection sampling to return `\(n\)` random normal deviates using only a uniform random number generator as a source of randomness. Note : you can still use *dnorm* to evaluate your target distribution. --- # Rejection sampling ```r rej_norm <- function(n, mu, sig) { # Define the target target <- function(x, mu, sig) dnorm(x, mu, sig) # Define the candidate cand <- function(x, sig) dunif(x,-2*sig,2*sig) # Constant M <- 5 # Object to store the n results result <- numeric(n) # Main loop for(i in 1:n) { accept <- FALSE while(!accept) { X <- runif(1,-6,6) p <- target(X,mu,sig) / (M * cand(X)) rand <- runif(1) accept <- (rand < p) } #endwhile result[i] <- X } #endfor return(result) } #endfunction ``` --- # Metropolis-Hastings algorithm - A very general technique to explore the parameter space - Simulated annealing is a special case of MH - Unlike rejection sampling, each iteration generates a sample from the target distribution - Samples are dependent (autocorrelated), so effective sample size is much smaller than chain length - For a Markov Chain `\(X\)`, draw `\(X_t\)` from candidate distribution `\(j(x|X_{t-1})\)` and compare to target distribution `\(f(x)\)` - Better propositions are not deterministically accepted - The posterior probability is stationary after a *burnin* period --- # Acceptance probability Full definition is : `$$p = \frac{t(X_{t})c(X_{t-1}|X{t})}{t(X_{t-1})c(X_{t}|X_{t-1})}$$` Where `\(X_t\)` is the current candidate value and `\(X_{t-1}\)` is the previous value. Note that if `\(t(x)\)` is a posterior distribution, the normalization constant cancels. --- # Metropolis-Hastings algorithm ```r # metropolis <- function(N = 5000, X0 = 0, A) { # N: number of MCMC iterations # X0: starting value of the chain # A: step parameter # Object to store results chain <- numeric(N + 1) chain[1] <- X0 #} ``` --- # Metropolis-Hastings algorithm ```r # DEFINE candidate distribution cand_fn <- function(y, A) runif(1, y-A, y+A) # DEFINE target distribution target_fn <- dnorm(x, 0, 1) # DEFINE acceptance probability function accept_fn <- function(current.value, previous.value) target_fn(current.value)/target_fn(previous.value) ``` --- # Acceptance probability - What to use for `\(c(x|y)\)`? - Very flexible, but must allow for positive recurrence of the chain - An example is the uniform distribution centered on the previous value - A nice property of the uniform is that `\(c(x \mid y) = c(y \mid x)\)`, so: `$$p = \frac{f(X_{t})}{f(X_{t-1})}$$` --- # Metropolis-Hastings algorithm ```r for(step in 2:(N+1)) { # PROPOSE candidate current.value <- cand_fn(chain[step -1], A) # COMPUTE acceptance probability p <- accept_fn(current.value, chain[step-1]) # DO rejection sampling rand <- runif(1) if(rand <- p) chain[t] = current.value else chain[t] = chain[t-1] } ``` --- # Chick survival The Fake Petrel ( _Hydrobates inventus_ ) lays precisely 6 eggs. You have data on survivorship from 10 different nests: ```r set.seed(1859) x_obs <- rbinom(10, 6, 0.6) x_obs ``` ``` ## [1] 4 4 5 4 4 4 5 4 4 6 ``` Our model is: $$ `\begin{align} \text{hatched}_{i} &\sim \text{Binomial}(\theta, 6) \\ \theta &\sim \text{Uniform}(0,1) \end{align}` $$ --- # Exercise 2. - Copy the MH code and adapt it to evaluate the parameter of the petrel example (binomial model) - Plot the chain to observe convergence - Change initial values to compare --- # Gibbs sampling - Gibbs Sampling is generalization of MCMC for _joint distributions_ - Objective: Sample from `\([\theta|X]\)` where `\(X\)` is data and : `$$\theta = [\theta_1, \theta_2, \theta_3, \ldots, \theta_k]$$` - Very flexible: define samplers for each `\(\theta_i\)` - Formally, we use Metropolis-within-Gibbs scheme, where Metropolis-Hastings is the sampler when no easier sampler can be identified --- # Simple bivariate example - We will consider data `\(X = \{21.4, 17.64, 18.31, 15.12, 14.40, 15, 19.59, 15.06, 15.71, 14.65\}\)` - Objective : estimate posterior mean `\(\mu\)` and standard deviation `\(\sigma\)` given prior estimates : + `\([\mu] = \mathrm{N}(14.5, 0.6)\)` with parameters `\((\mu_0, \sigma_0)\)` + `\([\tau] = \left [ \frac{1}{\sigma^2} \right ] = \Gamma \left ( 1.08367, 3.068 \right )\)` with parameters `\((\alpha_{\sigma}, \beta_{\sigma})\)` --- # Proportional form `$$[{\theta} \mid {X} ] \propto [ {X} \mid {\theta} ] [ {\theta}]$$` Where : `$$[ {X} \mid {\theta} ] = \prod_{i = 1}^{n} \frac{1}{\hat{\sigma}\sqrt{2\pi}} e^{-\frac{\left (X_i - \hat{\mu}\right )^{2}}{2\hat{\sigma}^{2}}} = \prod_{i = 1}^{n} \mathrm{dnorm}(X_i, \hat{\mu}, \hat{\sigma})$$` And : .small[ `$$[{\theta}] = [\hat{\mu} \mid \mu_0, \sigma_0][\hat{\sigma} \mid \alpha, \beta] = \mathrm{dnorm}(\hat{\mu}, \mu_0, \sigma_0) \times \\ \mathrm{dgamma} \left (\hat{\tau} = \frac{1}{\hat{\sigma}^2}, \alpha, \beta \right )$$` ] --- # Sampler ```r mh.sampler < – function(previous, A, target, prior, other.params, data, cand_fn = function(x, A) runif(1, x–A, x+A)) { # propose and select candidate values using metropolis–hastings algorith # cand_fn: the function used to draw samples # previous: the previous state in the chain # A: the tuning parameter # target : function(x) returning the target density at X candidate.value < – cand_fn(previous, A) p < – mh.acceptance(candidate.value, previous, target, prior, other.params, data) U < – runif(1) if (U < p) { result < – candidate.value } else {s result < – previous } return(result) } ``` --- # Conditional mu ```r mu.conditional < – function(mu, mu.prior, sigma, data) { # log-likelihood lik < – exp(sum(dnorm(data, mu, sigma, log=TRUE))) # log prior prior <- exp(– (mu – mu.prior[1])ˆ2 / (2 ∗ mu.prior[2]ˆ2)) # Return the conditional mu return(lik*prior) } ``` --- # Conditional sigma ```r sigma.conditional < – function(sigma, tau.prior, mu, data) { # Compute the posterior # Make sure sigma is positive if (sigma < = 0) { return(–Inf) } else { # Likelihood lik < – exp(sum(dnorm(data, mu, sigma, log = TRUE))) # Prior prior <- exp((tau.prior[1]ˆ2 – 1) ∗ log(tau(sigma)) – tau(sigma) ∗ tau.prior[2]) } # Exponentiate back return(lik*prior) } ``` --- # Acceptance ```r mh.acceptance < – function(current, previous, target, prior, other.params, data) { # target : function returning target density at parameter x num < – target(current, prior, other.params, data) denom < – target(previous, prior, other.params, data) if (denom == 0) return(-1) else return(num/denom) } ``` --- # Run the whole thing ```r N < – 5000 # number of iterations of the mcmc starting < – c(10,4) # starting values for mu and sigma A < – c(1.4, 0.8) # tuning parameters for the candidate function X < – c(21.4, 17.64, 18.31, 15.12, 14.40, 15, 19.59, 15.06, 15.71, 14.65) # data set mu.prior < – c(14.5, 0.6) # prior parameters for the mean (normal distribution) tau.prior < – c(1.28367, 3.068) # prior parameters for sigma (gamma distribution) ## Small function to get a parameter of the gamma tau < – function(sigma) 1/(sigmaˆ2) # Object to store the results chain < – matrix(nrow=N+1, ncol=2) chain [1,] < – starting # Run the chain for(t in 2:( N+1)) { # Sample for mu, fixing sigma chain[t ,1] < – mh.sampler(previous = chain[t–1,1], A = A[1], target = mu.conditional, prior = mu.prior, other.params = chain[t–1,2], data = X) # Sample for sigma, fixing mu chain[t ,2] < – mh.sampler(previous = chain[t–1,2], A = A[2], target = sigma.conditional, prior = tau.prior, other.params = chain[t –1,1], data = X) } ``` --- # Exercise That was a big chunk of code and we are at day 9 .... Let's try to implement this. Run the code and check the following : - run multiple runs and overlay the series - change the initial values and see how it behaves. Start far from the good values - Adjust the parameter `\(A\)` to see how it affects convergence time - Plot the histogram of the values - Check the relationship between mu and sigma --- # Exercise 2 We will try to adapt the code to the hemlock example. - Use your simulated data so that you know precisely what are the true parameters. - Specify a an informative prior that is not too far from the truth, with a reasonnable dispersion around the expected value. - Revisit the **conditional** functions for each parameter theta - Adapt the chain to account for the extra parameter - Run the chain and plot the histogram of the different parameters - Compare to the parameters used to simulate data