rm(list =ls())# delete any data that's already loaded into Rconflicts_prefer(dplyr::filter)ggplot2::theme_set(ggplot2::theme_bw()+# ggplot2::labs(col = "") +ggplot2::theme( legend.position ="bottom", text =ggplot2::element_text(size =12, family ="serif")))knitr::opts_chunk$set(message =FALSE)options('digits'=6)panderOptions("big.mark", ",")pander::panderOptions("table.emphasize.rownames", FALSE)pander::panderOptions("table.split.table", Inf)conflicts_prefer(dplyr::filter)# use the `filter()` function from dplyr() by defaultlegend_text_size=9run_graphs=TRUE
The previous appendix showed that, outside conjugate special cases, the posterior distribution has no closed form and its normalizing constant is an intractable integral. This appendix develops Markov chain Monte Carlo (MCMC), the simulation strategy that makes general Bayesian inference practical. The structure parallels (Dobson and Barnett 2018, chap. 13): we explain why ordinary inference fails, introduce Monte Carlo integration and Markov chains, describe the Metropolis–Hastings and Gibbs samplers, and then turn to convergence diagnostics and model comparison.
1 Foundations
When the posterior distribution has a known closed form, we can sample from it directly and compute summaries analytically. In most real-world models, however, the normalizing constant \(p(\tilde{y})\) is intractable, making direct sampling and closed-form summaries impossible. Markov Chain Monte Carlo (MCMC) methods provide a practical solution: instead of sampling independent draws from the exact posterior, they generate a correlated sequence of draws whose long-run distribution converges to the posterior (Dobson and Barnett 2018, chap. 13, p. 287).
1.1 Why the normalizing constant is computationally challenging
The posterior is \(p(\tilde{\theta}\mid \tilde{y}) \propto p(\tilde{y}\mid \tilde{\theta})\, p(\tilde{\theta})\). The normalizing constant is
a high-dimensional integral that typically has no closed form. Computing it numerically becomes infeasible as the dimension of \(\tilde{\theta}\) grows (Dobson and Barnett 2018, chap. 13, p. 287).
MCMC sidesteps this problem: many algorithms require only the ratio\(p(\tilde{\theta}^* \mid \tilde{y}) / p(\tilde{\theta}\mid \tilde{y})\), in which the normalizing constant cancels.
1.2 Monte Carlo integration
Before introducing Markov chains, it helps to understand Monte Carlo integration(Dobson and Barnett 2018, chap. 13, p. 290). If we could draw \(M\) independent samples \(\tilde{\theta}^{(1)}, \ldots, \tilde{\theta}^{(M)} \ \sim_{\operatorname{iid}}\ p(\tilde{\theta}\mid \tilde{y})\), we could approximate any posterior expectation:
In particular, the posterior mean, posterior variance, and credible-interval endpoints are all special cases of this formula.
Exm
Example 1 (Monte Carlo posterior summaries for a Bernoulli model) Recall the Beta-Bernoulli model of the conjugate-prior example, with \(n = 91\) observations and \(r = 55\) successes. The posterior is \(\pi \mid \tilde{y}\sim \operatorname{Beta}(56, 37)\). We draw \(M = 5{,}000\) samples and compare Monte Carlo summaries with exact values.
A Markov chain is a sequence of random variables \(\tilde{\theta}^{(1)}, \tilde{\theta}^{(2)}, \ldots\) such that the distribution of \(\tilde{\theta}^{(t+1)}\) depends only on \(\tilde{\theta}^{(t)}\) (the Markov property).
Definition 1 (Markov chain) A sequence of random variables \(\tilde{\theta}^{(1)}, \tilde{\theta}^{(2)}, \ldots\) is a Markov chain if, for all \(t\),
Under mild conditions, a Markov chain converges to a unique stationary distribution. MCMC algorithms are designed so that the stationary distribution is the target posterior \(p(\tilde{\theta}\mid \tilde{y})\)(Dobson and Barnett 2018, chap. 13, p. 291).
1.4 The Metropolis–Hastings algorithm
The Metropolis–Hastings (MH) algorithm is the most general MCMC method (Dobson and Barnett 2018, chap. 13, p. 293). At each iteration \(t\):
Propose a new value \(\tilde{\theta}^*\) from a proposal distribution\(q(\tilde{\theta}^* \mid \tilde{\theta}^{(t)})\).
Accept the proposal with probability \(\min(1, \alpha)\): if accepted, set \(\tilde{\theta}^{(t+1)} = \tilde{\theta}^*\). otherwise set \(\tilde{\theta}^{(t+1)} = \tilde{\theta}^{(t)}\).
Because the normalizing constant \(p(\tilde{y})\) appears in both numerator and denominator, it cancels in the ratio, so only the unnormalized posterior is needed.
Definition 2 (Metropolis–Hastings algorithm) Given a target posterior \(p(\tilde{\theta}\mid \tilde{y}) \propto p(\tilde{y}\mid \tilde{\theta})\, p(\tilde{\theta})\) and a proposal distribution \(q(\cdot \mid \tilde{\theta})\), one step of the MH algorithm proceeds as:
Set \(\tilde{\theta}^{(t+1)} = \tilde{\theta}^*\) with probability \(\alpha\), else \(\tilde{\theta}^{(t+1)} = \tilde{\theta}^{(t)}\).
Exm
Example 2 (Metropolis–Hastings for a Bernoulli model) We implement the MH algorithm for the same Beta-Bernoulli model as in Example 1. We use a normal random-walk proposal \(\pi^* = \pi^{(t)} + \varepsilon\), \(\varepsilon \sim \operatorname{N}\mathopen{}\left(0, \sigma^2\right)\mathclose{}\).
Show R code
log_post_unnorm<-function(pi, r, n){if(pi<=0||pi>=1)return(-Inf)r*log(pi)+(n-r)*log(1-pi)}mh_bernoulli<-function(n_iter, r, n, pi_init=0.5, sigma=0.05){chain<-numeric(n_iter)chain[1]<-pi_initfor(iinseq(2, n_iter)){pi_curr<-chain[i-1]pi_prop<-pi_curr+rnorm(1, sd =sigma)log_alpha<-log_post_unnorm(pi_prop, r, n)-log_post_unnorm(pi_curr, r, n)chain[i]<-if(log(runif(1))<log_alpha)pi_propelsepi_curr}chain}set.seed(1)mh_chain<-mh_bernoulli(n_iter =5000L, r =55L, n =91L)
1.5 The Gibbs sampler
The Gibbs sampler is a special case of the MH algorithm for multiparameter models (Dobson and Barnett 2018, chap. 13, p. 300). When \(\tilde{\theta}= (\theta_1, \ldots, \theta_K)\), each component is updated in turn by sampling from its full conditional distribution\(p(\theta_k \mid \theta_{-k}, \tilde{y})\), where \(\theta_{-k}\) denotes all components except \(\theta_k\). The resulting proposal is always accepted (acceptance probability 1). Gibbs sampling requires that the full conditionals are available in closed form. JAGS (Just Another Gibbs Sampler) automates this approach for a wide class of models (Dobson and Barnett 2018, chap. 13).
1.6 Burn-in
Definition 3 (burn-in) The burn-in period is the initial segment of an MCMC chain that is discarded before posterior summaries are computed. During burn-in, the chain is moving toward the stationary distribution and does not yet represent draws from the posterior (Dobson and Barnett 2018, chap. 13, p. 301).
The length of the burn-in period is chosen by inspection of trace plots. A common default is to discard the first 10–20% of iterations.
1.7 Convergence diagnostics
Because MCMC chains are correlated and may take time to reach stationarity, we must assess whether the chain has converged before using it for inference (Dobson and Barnett 2018, chap. 13, p. 302).
1.7.1 Trace plots
A trace plot graphs the sampled value \(\tilde{\theta}^{(t)}\) against iteration \(t\). A well-mixed chain shows rapid oscillation around a stable mean, with no obvious trends or stuck regions.
Figure 1: MH chain trace plot; dashed line marks end of burn-in.
1.7.2 Autocorrelation function (ACF)
Because successive draws in an MCMC chain are correlated, the effective sample size (ESS) is less than the number of iterations. The autocorrelation function (ACF) quantifies this dependence. A rapidly decaying ACF indicates good mixing.
1.7.3 Multiple chains and the Gelman–Rubin statistic
Running multiple chains from different starting points provides a stronger convergence check than a single chain. The Gelman–Rubin statistic\(\hat{R}\) compares within-chain and between-chain variance. \(\hat{R}\approx 1\) indicates convergence (Dobson and Barnett 2018, chap. 13, p. 304). A common rule of thumb is \(\hat{R}< 1.1\).
1.8 Posterior inference from MCMC samples
After discarding burn-in, the remaining chain \(\tilde{\theta}^{(B+1)}, \ldots, \tilde{\theta}^{(T)}\) is treated as a (correlated) sample from the posterior. Posterior summaries are computed as sample statistics:
Example 4 (Posterior summaries from MH chain) Using the chain from Example 2, we discard the first 500 iterations as burn-in and compute posterior summaries.
Show R code
burnin<-500Lpost_chain<-mh_chain[seq(burnin+1L, length(mh_chain))]round(c( mean =mean(post_chain), lower =unname(quantile(post_chain, 0.025)), upper =unname(quantile(post_chain, 0.975))), 4)#> mean lower upper #> 0.6014 0.5044 0.6958
2 Comparing MCMC estimates to maximum likelihood
It is reassuring — and a useful check — that for models where both approaches apply, the posterior mean from a well-mixed MCMC chain and the maximum likelihood estimate (MLE) typically agree closely when the prior is weak and the sample size is moderate or large (Dobson and Barnett 2018, chap. 13, p. 298). Intuitively, as \(n\) grows the likelihood dominates the prior, so the posterior concentrates near the MLE, and the posterior standard deviation approaches the frequentist standard error.
Exm
Example 5 (Posterior mean vs. MLE for a Bernoulli model) For the Beta-Bernoulli model with \(r = 55\) successes in \(n = 91\) trials, the MLE is \(\hat\pi = r/n\), while the posterior mean under a uniform prior is \((r+1)/(n+2)\). The two differ only by the prior’s pseudo-counts and converge as \(n\) grows:
A persistent discrepancy between the two, by contrast, is a signal worth investigating: it may reflect a genuinely informative prior, an unconverged chain, or a coding error in the model.
3 The importance of parameterization
How a model is written affects how well its sampler mixes (Dobson and Barnett 2018, chap. 13, p. 299). Two algebraically equivalent parameterizations of the same model can produce chains with very different autocorrelation: strong posterior correlation between parameters slows a component-wise sampler, because each update can only move one coordinate at a time along a narrow, tilted ridge.
Common remedies include centering predictors (subtracting their means) so that intercept and slope are less correlated, and reparameterizing variance components to a scale on which the posterior is more nearly symmetric. These changes leave the model’s likelihood and the scientific conclusions unchanged; they alter only the geometry the sampler must explore.
4 Bayesian model fit: the deviance information criterion
To compare competing Bayesian models we need a measure that balances goodness of fit against complexity, much as Akaike’s information criterion (AIC) does for likelihood-based models. The standard MCMC-based criterion is the deviance information criterion (DIC)(Dobson and Barnett 2018, chap. 13, p. 306).
Let \(D(\tilde{\theta}) = -2 \log p(\tilde{y}\mid \tilde{\theta})\) be the deviance. Write \(\overline{D}\) for its posterior mean (estimated by averaging \(D(\tilde{\theta}^{(t)})\) over the chain) and \(D(\bar{\tilde{\theta}})\) for the deviance evaluated at the posterior mean of the parameters.
Definition 4 (Deviance information criterion) The effective number of parameters and the DIC are
The term \(D(\bar{\tilde{\theta}})\) rewards fit, while \(p_D\) penalizes complexity: it measures how much the deviance is reduced, on average, by letting the parameters adapt to the data. As with AIC, lower DIC is better, and only differences in DIC between models fit to the same data are meaningful. Because \(p_D\) is computed directly from the posterior draws, DIC is a convenient by-product of an MCMC run; it should nonetheless be used with care, as it can behave poorly for models with weakly identified parameters or markedly non-normal posteriors.
References
Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.
---title: "Markov Chain Monte Carlo Methods"format: html: default revealjs: output-file: mcmc-methods-slides.html pdf: output-file: mcmc-methods-handout.pdf docx: output-file: mcmc-methods-handout.docx---{{< include shared-config.qmd >}}The [previous appendix](intro-bayes.qmd#sec-bayes-software) showed that,outside conjugate special cases, the posterior distribution has no closedform and its normalizing constant is an intractable integral.This appendix develops **Markov chain Monte Carlo (MCMC)**, thesimulation strategy that makes general Bayesian inference practical.The structure parallels [@dobson4e, Chapter 13]:we explain why ordinary inference fails, introduce Monte Carlointegration and Markov chains, describe the Metropolis–Hastings andGibbs samplers, and then turn to convergence diagnostics and modelcomparison.# Foundations {#sec-mcmc-methods}{{< include _subfiles/intro-bayes/_sec-mcmc.qmd >}}# Comparing MCMC estimates to maximum likelihood {#sec-mcmc-vs-mle}It is reassuring — and a useful check — that for models where bothapproaches apply, the **posterior mean** from a well-mixed MCMC chainand the **maximum likelihood estimate** (MLE) typically agree closelywhen the prior is weak and the sample size is moderate or large[@dobson4e, Chapter 13, p. 298].Intuitively, as $n$ grows the likelihood dominates the prior, so theposterior concentrates near the MLE, and the posterior standarddeviation approaches the frequentist standard error.:::{#exm-mcmc-vs-mle}#### Posterior mean vs. MLE for a Bernoulli modelFor the Beta-Bernoulli model with $r = 55$ successes in $n = 91$ trials,the MLE is $\hat\pi = r/n$, while the posterior mean under a uniformprior is $(r+1)/(n+2)$. The two differ only by the prior's pseudo-countsand converge as $n$ grows:```{r}#| code-fold: showr_obs <-55Ln_obs <-91Lround(c(mle = r_obs / n_obs,posterior_mean = (r_obs +1) / (n_obs +2)), 4)```:::A persistent discrepancy between the two, by contrast, is a signal worthinvestigating: it may reflect a genuinely informative prior, anunconverged chain, or a coding error in the model.# The importance of parameterization {#sec-mcmc-parameterization}How a model is written affects how well its sampler mixes[@dobson4e, Chapter 13, p. 299].Two algebraically equivalent parameterizations of the same model canproduce chains with very different autocorrelation:strong posterior correlation between parameters slows acomponent-wise sampler, because each update can only move one coordinateat a time along a narrow, tilted ridge.Common remedies include **centering** predictors (subtracting theirmeans) so that intercept and slope are less correlated, and**reparameterizing** variance components to a scale on which theposterior is more nearly symmetric.These changes leave the model's likelihood and the scientificconclusions unchanged; they alter only the geometry the sampler mustexplore.# Bayesian model fit: the deviance information criterion {#sec-mcmc-dic}To compare competing Bayesian models we need a measure that balancesgoodness of fit against complexity, much as Akaike's informationcriterion (AIC) does for likelihood-based models.The standard MCMC-based criterion is the **deviance informationcriterion (DIC)** [@dobson4e, Chapter 13, p. 306].Let $D(\vth) = -2 \log p(\vy \mid \vth)$ be the **deviance**.Write $\overline{D}$ for its posterior mean (estimated by averaging$D(\vth^{(t)})$ over the chain) and $D(\bar{\vth})$ for thedeviance evaluated at the posterior mean of the parameters.:::{#def-dic}#### Deviance information criterionThe **effective number of parameters** and the **DIC** are$$\bap_D &\eqdef \overline{D} - D(\bar{\vth}),\\\operatorname{DIC} &\eqdef D(\bar{\vth}) + 2\,p_D = \overline{D} + p_D.\ea$$:::The term $D(\bar{\vth})$ rewards fit, while $p_D$ penalizescomplexity: it measures how much the deviance is reduced, on average, byletting the parameters adapt to the data.As with AIC, **lower DIC is better**, and only differences in DICbetween models fit to the same data are meaningful.Because $p_D$ is computed directly from the posterior draws, DIC is aconvenient by-product of an MCMC run; it should nonetheless be used withcare, as it can behave poorly for models with weakly identifiedparameters or markedly non-normal posteriors.# References {.unnumbered}::: {#refs}:::