Markov Chains and Metropolis-Hastings
Markov Chains and Metropolis-Hastings
Sommersemester 2022
Vesa Kaarnioja
[Link]@[Link]
π(xk+1 | x0 , . . . , xk ) = π(xk+1 | xk ).
π(y | x) = q(x, y ).
Here, both R(x, y ) and r (x) are as yet undetermined—the trick will be to
calibrate these in order to find a kernel such that p is its invariant density
as discussed on the previous slide.
Since R is a transition kernel, y 7→ R(x, y ) is a probability density and
hence Z
R(x, y ) dy = 1 for all x ∈ Rd .
Rd
Denote by A the event of moving away from x and by ¬A the event of
not moving. Clearly
Even though the expression for K seems complicated, it turns out that the
drawing can be performed according to the following procedure.
Metropolis–Hastings algorithm
Remark. Note that due to the form of α both the target p and the
’
proposal density q can be unnormalized within the Metropolis–Hastings
algorithm.
Why does this work?
Let us focus on the main loop of the Metropolis–Hastings algorithm:
Given x, draw y using the transition kernel q(x, y ).
Calculate the acceptance ratio α(x, y ) = min 1, p(y )q(x,y )
p(x)q(x,y ) .
Draw t ∼ U([0, 1]). If α > t, accept y , otherwise stay put at x.
Recall that A was the event of moving in the Markov chain. Then
Then
as desired.
Example
Let us consider sampling from the density
1
1.5
0.5
x1
0
1
-0.5
-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1
2
-0.5
1.5
-1 1
x2
0.5
-1.5
0
-2 -0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Random walk Metropolis-Hastings with 5000 samples, = 0.5, acceptance ratio 0.3272
2 1.5
1
1.5
0.5
x1
0
1
-0.5
-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1
2
-0.5
1.5
1
-1
x2
0.5
0
-1.5
-0.5
-2 -1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
1
1.5
0.5
x1
0
1
-0.5
-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1
2
-0.5
1.5
-1 1
x2
0.5
-1.5
0
-2 -0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Derivation of the single component Gibbs sampler
We continue to be interested in sampling the distribution with density
p(x). The single component Gibbs sampler is based on the same Markov
process that was introduced in the derivation of Metropolis–Hastings: if
you are currently situated at some x ∈ Rd , either
1 stay put at x with the probability r (x), 0 ≤ r (x) ≤ 1, or
p(y1 , . . . , yi , xi+1 , . . . , xd )
where p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ) = R .
R p(y1 , . . . , yi , xi+1 , . . . , xd ) dyi
This transition kernel K does not in general satisfy the detailed balance
equation, but it does satisfy the standard balance equation, which is
sufficient to ensure that p is the invariant density of the Markov chain (see
derivation of the Metropolis–Hastings method).
Theorem
The transition kernel
d
Y
K (x, y ) = p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ),
i=1
p(y1 , . . . , yi , xi+1 , . . . , xd )
where p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ) = R ,
R p(y1 , . . . , yi , xi+1 , . . . , xd ) dyi
satisfies
Z Z
p(y )K (y , x) dx = p(x)K (x, y ) dx.
Rd Rd
Remark. We only consider the single component Gibbs sampler here. The
Gibbs sampler can be written in slightly more general form; see, e.g.,
[chapter 3.6.3, Kaipio and Somersalo 2005].
Proof.
R We begin with the left-hand side of the balance equation and consider
Rd K (y , x) dx. We integrate inductively over the variables in the order xd , xd−1 , . . . , x1 :
Z d
Z Y
K (y , x) dxd = p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) dxd
R R i=1
d−1
Y Z
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) p(xd |x1 , . . . , xd−1 ) dxd
i=1 |R {z }
=1
d−1
Y
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd )
i=1
Z Z Z d−1
Y
⇒ K (y , x) dxd dxd−1 = p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) dxd−1
R R R i=1
d−2
Y Z
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) p(xd−1 |x1 , . . . , xd−1 , yd ) dxd−1
i=1 |R {z }
=1
d−2
Y
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) ⇒ ...
i=1
Proceeding
R by inductively integrating
R over xd−2 , xd−3 , . . R. , x1 , we obtain
Rd K (y , x) dx = 1 and thus Rd p(y )K (y , x) dx = p(y ) Rd K (y , x) dx = p(y ).
Next we consider the right-hand side of the balance equation. Recall that
K (x, y ) = di=1 p(yi |y1 , . . . , yi−1 , xi+1 . . . , xd ). We integrate inductively over the
Q
variables, this time in the order x1 , . . . , xd :
Z Z
p(x)K (x, y ) dx1 = K (x, y ) p(x1 , x2 , . . . , xd ) dx1 (K is independent of x1 )
R R
d
Y Z
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 |x2 , . . . , xd ) p(x1 , x2 , . . . , xd ) dx1
i=2
| {z } R
p(y ,x ,...,x )
= R p(x 1,x 2,...,x d) dx
R 1 2 d 1
d
Y
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , x2 , . . . , xd )
i=2
Z Z d
Z Y
⇒ p(x)K (x, y ) dx1 dx2 = p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , x2 , . . . , xd ) dx2
R R R i=2
d
Y Z
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y2 |y1 , x3 , . . . , xd ) p(y1 , x2 , . . . , xd ) dx2
i=3
| {z } R
p(y ,y ,x ,...,x )
= R p(y 1,x 2,x 3,...,x d) dx
R 1 2 3 d 2
d
Y
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , y2 , x3 , . . . , xd ) ⇒ ...
i=3
Proceeding
R by inductively integrating over x3 , . . . , xd , we eventually obtain
Rd
p(x)K (x, y ) dx = p(y ). Therefore the balance equation holds.
Single component Gibbs sampler
and set yj = t.
(iii) If j = d, set y = (y1 , . . . , yd ) and terminate the inner loop. Otherwise,
set j ← j + 1 and return to step (ii).
3 Set x (k+1) = y , increase k ← k + 1 and return to step 2.
Example
Let us consider the density from before
1
p(x1 , x2 ) = exp(−10(x12 − x2 )2 − (x2 − 41 )4 ),
Z
where the normalizing constant is Z = 1.1813 . . .
This time we use the Gibbs sampler. To sample the univariate densities
that arise in the process, we use inverse transform sampling. In this case,
the explicit algorithm we use is written below.
Fix x (0) ∈ R2 and set x = x (0) ;
For k = 1, . . . , N, do
Rt
Calculate Φ1 (t) = −∞ p(x1 , x2 ) dx1 ;
Draw u ∼ U([0, 1]), set x1 = Φ−1
1 (u);
Rt
Calculate Φ2 (t) = −∞ p(x1 , x2 ) dx2 ;
Draw u ∼ U([0, 1]), set x2 = Φ−1
2 (u);
Set x (k) = x.
End
Single component Gibbs sampler with 5000 samples
2 1.5
1
1.5
0.5
x1
0
1
-0.5
-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1
2
-0.5
1.5
1
-1
x2
0.5
0
-1.5
-0.5
-2 -1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Computational remarks about MCMC
As a general rule of thumb, one should aim at roughly 30%
acceptance rates when using Gaussian (or close to Gaussian) proposal
and target densities with MH.
It usually takes the Markov chain a number of iterations to reach the
steady state. To this end, it is usually advisable to discard the first
N0 obtained samples since they may not be representative of the
target distribution—this is the so-called “burn-in” period. The length
of the burn-in period varies depending on the application, but one
might consider throwing away the first ∼ 5 − 10% steps for a
sufficiently large sample size as an example.
In MH, using a Gaussian kernel (e.g., random walk
Metropolis–Hastings) is a popular choice due to the ease of
implementation. While it is a safe choice, it does not take into
account the form of the posterior density. To increase efficiency, it is
advisable to take the shape of the density into account when
designing the proposal density. In the high-dimensional setting, this is
especially useful if the posterior density is anisotropic (stretched in
some directions).
Computational remarks about MCMC
The proposal distribution in MH can also be updated while the
sampling algorithm moves around the posterior density. This process
is called adaptation.
Visual assessment: we are aiming for independent sample points,
where the sample histories look like a “fuzzy worm”. One could aim
at something like the Gaussian white noise signal below:
-1
-2
-3
PN
where γ0 = N −1 2
j=1 zj .
0.5 0 0
0.6
0.5
0.4
0.3
0.2
0.1
-0.1
0 10 20 30 40 50 60 70 80 90 100
Preconditioned Crank–Nicolson algorithm
Here, the step size 0 < β < 1 is a free parameter (which can be
optimized for statistical efficiency).
The pCN method is dimension robust: the acceptance probability
does not degenerate to zero as the dimension d → ∞. Contrast this
with, e.g., random walk Metropolis, whose acceptance probability
degenerates to zero as the dimension d → ∞.
Further variations of MCMC
We have only scratched the surface of some basic ideas surrounding
MCMC methods. In the literature and practical applications, one can find
many variations of these ideas to boost the performance of MCMC for
“difficult”/“high-dimensional” problems. To list a couple of notable ones:
Adaptive Metropolis: as the proposal density q(x, y ), use a random walk model
Y = X + W with W ∼ N (0, Γ), where the covariance Γ is replaced by the sample
covariance (plus some small perturbation of identity) computed using the sample
history. The updating can happen either at every step or after every M steps of
the Metropolis iteration. The main theoretical challenge is proving the ergodicity
of the chain—this was proved by Haario, Saksman, and Tamminen (2001).
Computationally, stable updating formulae for the sample means and covariances
are needed in practice.
Independence Metropolis: as the proposal density q(x, y ), use a probability density
that is independent of the previous sample x, i.e., q(x, y ) = q(y ). The proposal
density should be similar to the target density.
Metropolis-within-Gibbs, Delayed rejection adaptive Metropolis, . . .
Software: [Link]
[Link]