Reference: Non-Reversible Parallel Tempering on Optimized Paths
Let \(\pi_0, \pi_1\) be some probability density functions.
Parallel tempering introduces intermediate probability distributions
\[\pi_{\beta_n} = \pi_0^{1 - \beta_n} \pi_1^{\beta_n}\]
where \(0=\beta_0<\beta_1<\cdots<\beta_N=1\) is a step sequence that is used to control an ‘annealing schedule’. \(\pi_0\) is the reference distribution, and \(\pi_1\) is the target distribution.
To facilitate sampling, PT uses a higher dimensional MC that takes these intermediate distributions together, whose state space is \(X^{N+1}\)
\[X_t = (X_t^0, X_t^1, \ldots, X_t^N) \in X^{N+1}\]
which is constructed in a precise way so that the stationary distribution is the product of the intermediate distributions:
\[\pi((x^i)) = \prod_{i=0}^N \pi_{\beta_i}(x^i).\]
Clearly if you apply any MCMC scheme for each \(x^i\) independently you will keep the stationary distribution. The interesting part of PT is the ‘swap’ step, which allows the sampler to move between different intermediate distributions.
Local exploration kernels is just the product of independent kernels for each \(x^i\):
\[K(x,\mathrm{d}x) = \prod_{i=0}^N K_i(x^i, \mathrm{d}x^i)\]
where \(K_i\) is a kernel for the \(i\)-th component. You can pick \(K_0\) to be a independent sampler \(K_0(x^0, ) = \pi_0(\cdot)\).
Swaps are the fundamental building blocks for a communication scheme. Denote \(s_{n,n+1}\) as the swap map \((x^0, \ldots, x^n, x^{n+1}, \ldots, x^N) \mapsto (x^0, \ldots, x^{n+1}, x^n, \ldots, x^N)\), the Metropolis-Hastings swap kernel is given by
\[K^{n,n+1}(x,\mathrm{d}x) = (1-\alpha_{n,n+1}(x)) \delta_x(\mathrm{d}x) + \alpha_{n,n+1}(x) \delta_{s_{n,n+1}(x)}(\mathrm{d}x)\]
where \(\alpha_{n,n+1}(x)\) is given by the classical Metropolis-Hastings acceptance ratio
\[\alpha_{n,n+1}(x) = \frac{\pi(s_{n,n+1}(x))}{\pi(x)} \wedge 1. \]
If we write \(\pi_i(x^i) = \frac{\exp{W_i(x^i)}}{Z_i}\), then clearly \(\pi(x) = \exp\left(\sum_{i=0}^N W_i(x^i)\right) / Z\), where \(Z = \prod_{i=0}^N Z_i\). This means
\[\begin{align*} \alpha_{n,n+1}(x) &= \exp\left(W_{n+1}(x^n)+W_n(x^{n+1}) -W_{n+1}(x^{n+1})-W_n(x^n)\right) \wedge 1 \\ &= \exp\left(\Delta W_n(x^n) - \Delta W_n(x^{n+1})\right) \wedge 1 \end{align*}\]
where \(\Delta W_n = W_{n+1} - W_n\) is the difference of the log densities.
Let
\(K^0 = \prod_{n\equiv 0\mod 2} K^{n,n+1}\) and \(K^1 = \prod_{n\equiv 1\mod 2} K^{n,n+1}\) be the even and odd swap kernels respectively. Intuitively they are swap kernels that only swap on even or odd indices.
The Stochastic Even/Odd (SEO) swap kernel is given by mixing \(K^0\) and \(K^1\) with equal probability, while Deterministic Even/Odd (DEO) swap kernel is given by alternating between \(K^0\) and \(K^1\). This can be expressed as
\[K^{\mathrm{SEO}}(x,\mathrm{d}x) = \frac{1}{2} K^0(x,\mathrm{d}x) + \frac{1}{2} K^1(x,\mathrm{d}x)\]
\[K^{\mathrm{DEO}}(x,\mathrm{d}x) = \begin{cases} K^0(x,\mathrm{d}x) & \text{if } t \text{ is even} \\ K^1(x,\mathrm{d}x) & \text{if } t \text{ is odd} \end{cases}\]
The parallel tempering kernel is just a composition of a local exploration kernel and a swap (communication) kernel:
\[K^{PT} = K^{\mathrm{local}} \circ K^{\mathrm{comm}}\]