Parallel Tempering For MCMC

Author: Eiko

Time: 2025-07-25 02:49:55 - 2025-07-25 02:49:55 (UTC)

Reference: Non-Reversible Parallel Tempering on Optimized Paths

Parallel Tempering For MCMC

Let \(\pi_0, \pi_1\) be some probability density functions.

Parallel tempering introduces intermediate probability distributions

\[\pi_{\beta_n} = \pi_0^{1 - \beta_n} \pi_1^{\beta_n}\]

where \(0=\beta_0<\beta_1<\cdots<\beta_N=1\) is a step sequence that is used to control an ‘annealing schedule’. \(\pi_0\) is the reference distribution, and \(\pi_1\) is the target distribution.

To facilitate sampling, PT uses a higher dimensional MC that takes these intermediate distributions together, whose state space is \(X^{N+1}\)

\[X_t = (X_t^0, X_t^1, \ldots, X_t^N) \in X^{N+1}\]

which is constructed in a precise way so that the stationary distribution is the product of the intermediate distributions:

\[\pi((x^i)) = \prod_{i=0}^N \pi_{\beta_i}(x^i).\]

Clearly if you apply any MCMC scheme for each \(x^i\) independently you will keep the stationary distribution. The interesting part of PT is the ‘swap’ step, which allows the sampler to move between different intermediate distributions.

Kernels

Local Exploration Kernels

Local exploration kernels is just the product of independent kernels for each \(x^i\):

\[K(x,\mathrm{d}x) = \prod_{i=0}^N K_i(x^i, \mathrm{d}x^i)\]

where \(K_i\) is a kernel for the \(i\)-th component. You can pick \(K_0\) to be a independent sampler \(K_0(x^0, ) = \pi_0(\cdot)\).

Swap Kernels

Classical Metropolis-Hastings Swap Kernel

Swaps are the fundamental building blocks for a communication scheme. Denote \(s_{n,n+1}\) as the swap map \((x^0, \ldots, x^n, x^{n+1}, \ldots, x^N) \mapsto (x^0, \ldots, x^{n+1}, x^n, \ldots, x^N)\), the Metropolis-Hastings swap kernel is given by

\[K^{n,n+1}(x,\mathrm{d}x) = (1-\alpha_{n,n+1}(x)) \delta_x(\mathrm{d}x) + \alpha_{n,n+1}(x) \delta_{s_{n,n+1}(x)}(\mathrm{d}x)\]

where \(\alpha_{n,n+1}(x)\) is given by the classical Metropolis-Hastings acceptance ratio

\[\alpha_{n,n+1}(x) = \frac{\pi(s_{n,n+1}(x))}{\pi(x)} \wedge 1. \]

If we write \(\pi_i(x^i) = \frac{\exp{W_i(x^i)}}{Z_i}\), then clearly \(\pi(x) = \exp\left(\sum_{i=0}^N W_i(x^i)\right) / Z\), where \(Z = \prod_{i=0}^N Z_i\). This means

\[\begin{align*} \alpha_{n,n+1}(x) &= \exp\left(W_{n+1}(x^n)+W_n(x^{n+1}) -W_{n+1}(x^{n+1})-W_n(x^n)\right) \wedge 1 \\ &= \exp\left(\Delta W_n(x^n) - \Delta W_n(x^{n+1})\right) \wedge 1 \end{align*}\]

where \(\Delta W_n = W_{n+1} - W_n\) is the difference of the log densities.

Non-Reversible Odd/Even Swap Kernel

Let

\(K^0 = \prod_{n\equiv 0\mod 2} K^{n,n+1}\) and \(K^1 = \prod_{n\equiv 1\mod 2} K^{n,n+1}\) be the even and odd swap kernels respectively. Intuitively they are swap kernels that only swap on even or odd indices.

The Stochastic Even/Odd (SEO) swap kernel is given by mixing \(K^0\) and \(K^1\) with equal probability, while Deterministic Even/Odd (DEO) swap kernel is given by alternating between \(K^0\) and \(K^1\). This can be expressed as

\[K^{\mathrm{SEO}}(x,\mathrm{d}x) = \frac{1}{2} K^0(x,\mathrm{d}x) + \frac{1}{2} K^1(x,\mathrm{d}x)\]

\[K^{\mathrm{DEO}}(x,\mathrm{d}x) = \begin{cases} K^0(x,\mathrm{d}x) & \text{if } t \text{ is even} \\ K^1(x,\mathrm{d}x) & \text{if } t \text{ is odd} \end{cases}\]

Parallel Tempering Kernel

The parallel tempering kernel is just a composition of a local exploration kernel and a swap (communication) kernel:

\[K^{PT} = K^{\mathrm{local}} \circ K^{\mathrm{comm}}\]