0% found this document useful (0 votes)
11 views32 pages

Markov Chains and Metropolis-Hastings

The document presents lecture notes on Inverse Problems, focusing on discrete time Markov chains and their properties, including transition kernels and invariant densities. It introduces the Metropolis-Hastings algorithm as a method to sample from probability densities and provides a detailed derivation of the algorithm. Additionally, it includes examples of sampling from a specific density using different step sizes.

Uploaded by

clearningaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Markov Chains and Metropolis-Hastings

The document presents lecture notes on Inverse Problems, focusing on discrete time Markov chains and their properties, including transition kernels and invariant densities. It introduces the Metropolis-Hastings algorithm as a method to sample from probability densities and provides a detailed derivation of the algorithm. Additionally, it includes examples of sampling from a specific density using different step sizes.

Uploaded by

clearningaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Inverse Problems

Sommersemester 2022

Vesa Kaarnioja
[Link]@[Link]

FU Berlin, FB Mathematik und Informatik

Ninth lecture, June 27, 2022


The content of these slides follows roughly the material presented in the
following monographs.
D. Calvetti and E. Somersalo. Introduction to Bayesian Scientific
Computing – Ten Lectures on Subjective Computing. 2007.
J. Kaipio and E. Somersalo. Statistical and Computational Inverse
Problems. 2005.
Discrete time Markov chains
A sequence {Xk }∞ k=0 of random variables is called a discrete time Markov
chain if the probability distribution of any Xk+1 depends only on the
previous state Xk :

π(xk+1 | x0 , . . . , xk ) = π(xk+1 | xk ).

Suppose in addition that there exists a probability transition kernel q(x, y )


such that
π(xk+1 | xk ) = q(xk , xk+1 ).
Then the Markov chain is called time invariant (or time homogeneous)
since the kernel q is independent of the time k.
Remark. We assume here and in the sequel that transition kernels satisfy
x 7→ B q(x, y ) dy is measurable for all x ∈ Rd and B ∈ B(Rd );
R
d
R
B 7→ B q(x, y ) dy is a probability distribution for
R all x ∈ R ,
d
R ∈ B(R ). In particular, P(Y ∈ B | X = x) = B q(x, y ) dy and
B
Rd q(x, y ) dy = 1.
Let X be a random variable with probability density p(x).
Let q(x, y ) be an arbitrary transition kernel used to generate a new
random variable Y given X = x, i.e.,

π(y | x) = q(x, y ).

The probability density of Y can be found through marginalization:


Z Z
π(y ) = π(y | x)p(x) dx = q(x, y )p(x) dx.
Rd Rd

If the probability density of Y is equal to the probability density of X ,


Z
q(x, y )p(x) dx = p(y ),
Rd

then we call p an invariant density of the transition kernel q.


Definition (Irreducible transition kernel)
The transition kernel q is irreducible if, regardless of the starting point, the
Markov chain generated by q can visit any set of positive measure with
positive probability.

Definition (Periodic transition kernel)


The transition kernel q is periodic if, for some integer m ≥ 2, there is a set
of disjoint nonempty sets {E1 , . . . , Em } ⊂ Rd such that for all
j ∈ {1, . . . , m} and for all x ∈ Ej :
Z
P(Y ∈ Emod(j,m)+1 |X = x) = q(x, y ) dy = 1.
Emod(j,m)+1

That is, the Markov chain generated by q remains in a periodic loop


forever.

Definition (Aperiodic transition kernel)


The transition kernel q is aperiodic if it is not periodic.
Theorem
Let {Xk }∞
k=0 be a time invariant Markov chain with the transition kernel
q, i.e.,
π(xk+1 | xk ) = q(xk , xk+1 ).
Assume that p is an invariant density of q and the following technical
conditions hold:
q is irreducible;
q is aperiodic.
Then for all x0 ∈ Rd and any B ∈ B(Rd ), it holds that
Z
lim P(XN ∈ B | X0 = x0 ) = p(x) dx.
N→∞ B

Moreover, for any regular enough function f ,


N Z
1 X
lim f (Xj ) = f (x)p(x) dx a.s.
N→∞ N Rd
j=1
Suppose we want to sample some probability density p and we know that
it is invariant with respect to transition kernel q. Then we can proceed as
follows:
1 Select starting point x0 and set k = 0.
2 Draw xk+1 from q(xk , xk+1 ).
3 Set k ← k + 1 and return to step 2.
The previous theorem implies that the sample {xk }N
k=0 is asymptotically
distributed according to p as N → ∞.
This raises the question: given a probability density p, how do you find a
kernel q such that p is its invariant density?
The Metropolis–Hastings algorithm is a method to construct such a kernel!
Derivation of the Metropolis–Hastings algorithm
We are interested in obtaining samples from the probability density p.
Consider the following Markov process: if you are currently situated at
some x ∈ Rd , either
1 stay put at x with the probability r (x), 0 ≤ r (x) ≤ 1, or

2 move away from x using a transition kernel R(x, y ) otherwise.

Here, both R(x, y ) and r (x) are as yet undetermined—the trick will be to
calibrate these in order to find a kernel such that p is its invariant density
as discussed on the previous slide.
Since R is a transition kernel, y 7→ R(x, y ) is a probability density and
hence Z
R(x, y ) dy = 1 for all x ∈ Rd .
Rd
Denote by A the event of moving away from x and by ¬A the event of
not moving. Clearly

P(A) = 1 − r (x) and P(¬A) = r (x).


Given a current state X = x, we want to know what is the probability
density of Y generated by the aforementioned strategy. Let B ∈ B(Rd )
and consider the probability of the event Y ∈ B. Then
P(Y ∈ B | X = x) = P(Y ∈ B | X = x, A)P(A) (move away from x)
+ P(Y ∈ B | X = x, ¬A)P(¬A). (stay put at x)
The probability of arriving in B through a move is
Z
P(Y ∈ B | X = x, A) = R(x, y ) dy .
B
The only way to arrive in B without moving is if x is already in B:
(
1 if x ∈ B,
P(Y ∈ B | X = x, ¬A) = χB (x) =
0 if x 6∈ B.
Hence
Z z =:K (x,y )
}| {
P(Y ∈ B | X = x) = (1 − r (x))R(x, y ) dy + r (x)χB (x)
ZB
= K (x, y ) dy + r (x)χB (x).
B
The probability of Y ∈ B can be obtained by marginalizing over x:
Z
P(Y ∈ B) = P(Y ∈ B | X = x)p(x) dx
Rd
Z Z  Z
= K (x, y ) dy p(x) dx + r (x)χB (x)p(x) dx
Rd B Rd
Z Z  Z
= K (x, y )p(x) dx dy + r (x)p(x) dx
d
ZB  ZR B

= K (x, y )p(x) dx + r (y )p(y ) dy
B Rd
Z Z Z 
= K (x, y )p(x) dx − K (y , x)p(y ) dx + p(y ) dy ,
B Rd Rd
R R
where we used Rd K (y , x) dx = (1 − r (y )) Rd R(y , x) dx = 1 − r (y ).
If the balance equation
Z Z
p(y )K (y , x) dx = p(x)K (x, y ) dx (1)
Rd Rd
holds, then Z
P(Y ∈ B) = p(y ) dy as desired.
B
The Metropolis–Hastings algorithm is a technique for finding a kernel K
that satisfies the detailed balance equation
p(y )K (y , x) = p(x)K (x, y ),
which implies (1). Let us start with a proposal density q(x, y ), chosen so
that generating a Markov chain with it is easy. (For this reason, a
Gaussian kernel is a very popular choice.) There are three separate cases:
1 If p(y )q(y , x) = p(x)q(x, y ), then set r (x) = 0,

R(x, y ) = K (x, y ) = q(x, y ) and the previous analysis ensures that p


is an invariant density for kernel q.
2 If p(y )q(y , x) < p(x)q(x, y ), then define the kernel K to be

K (x, y ) = α(x, y )q(x, y ),


where α is chosen s.t. p(y )α(y , x)q(y , x) = p(x)α(x, y )q(x, y ). We
can make the selection
p(y )q(y , x)
α(y , x) = 1 and α(x, y ) = < 1.
p(x)q(x, y )
3 If p(y )q(y , x) > p(x)q(x, y ), then in complete analogy to the above:
p(x)q(x, y )
α(x, y ) = 1 and α(y , x) = < 1.
p(y )q(y , x)
In summary, we define K as
 
p(y )q(y , x)
K (x, y ) = α(x, y )q(x, y ), α(x, y ) = min 1, .
p(x)q(x, y )

Even though the expression for K seems complicated, it turns out that the
drawing can be performed according to the following procedure.
Metropolis–Hastings algorithm

1 Choose x (0) ∈ Rd and set k = 0.


2 Given x = x (k) , draw y using the transition kernel q(x, y ) of your
choosing.
3 Calculate the acceptance ratio
 
p(y )q(y , x)
α(x, y ) = min 1, .
p(x)q(x, y )

4 Flip the α-coin: draw t ∼ U([0, 1]). If α > t, set x (k+1) = y ,


otherwise stay put at x and set x (k+1) = x (k) .
5 Set k ← k + 1 and return to step 2.

Remark. Note that due to the form of α both the target p and the

proposal density q can be unnormalized within the Metropolis–Hastings
algorithm.
Why does this work?
Let us focus on the main loop of the Metropolis–Hastings algorithm:
Given x, draw y using the transition kernel q(x, y ).
Calculate the acceptance ratio α(x, y ) = min 1, p(y )q(x,y )

p(x)q(x,y ) .
Draw t ∼ U([0, 1]). If α > t, accept y , otherwise stay put at x.
Recall that A was the event of moving in the Markov chain. Then

P(A|y , x) = “probability of accepting transition” = α(x, y ),


P(y |x) = “probability of drawing y ” = q(x, y ).

Then

“probability of accepted y ” = P(A, y |x)


= P(A|y , x)P(y |x)
= α(x, y )q(x, y ) = K (x, y ),

as desired.
Example
Let us consider sampling from the density

p(x1 , x2 ) ∝ exp(−10(x12 − x2 )2 − (x2 − 14 )4 ).

As the proposal distribution, we use the random walk model Y = X + W ,


W ∼ N (0, γ 2 I ), with the kernel
 
1 2
q(x, y ) ∝ exp − 2 kx − y k .

We draw 5000 samples from the probability distribution with density p


using three different step sizes: γ = 0.1, γ = 0.5, and γ = 2.
Random walk Metropolis-Hastings with 5000 samples, = 0.1, acceptance ratio 0.7764
2 1.5

1
1.5
0.5

x1
0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

-1 1

x2
0.5
-1.5
0

-2 -0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2

Random walk Metropolis-Hastings with 5000 samples, = 0.5, acceptance ratio 0.3272
2 1.5

1
1.5
0.5

x1
0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

1
-1

x2
0.5

0
-1.5
-0.5

-2 -1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2

Random walk Metropolis-Hastings with 5000 samples, = 2, acceptance ratio 0.058


2 1.5

1
1.5
0.5
x1

0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

-1 1
x2

0.5
-1.5
0

-2 -0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Derivation of the single component Gibbs sampler
We continue to be interested in sampling the distribution with density
p(x). The single component Gibbs sampler is based on the same Markov
process that was introduced in the derivation of Metropolis–Hastings: if
you are currently situated at some x ∈ Rd , either
1 stay put at x with the probability r (x), 0 ≤ r (x) ≤ 1, or

2 move away from x using a transition kernel R(x, y ) otherwise.

Recall also the definition we made in the Metropolis–Hastings derivation:


K (x, y ) = (1 − r (x))R(x, y ).
Suppose that x is a d-variate random variable. For the single component
Gibbs sampler, we set r (x) = 0 (moving is obligatory) and define the
transition kernel
d
Y
K (x, y ) = R(x, y ) = p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ),
i=1

p(y1 , . . . , yi , xi+1 , . . . , xd )
where p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ) = R .
R p(y1 , . . . , yi , xi+1 , . . . , xd ) dyi
This transition kernel K does not in general satisfy the detailed balance
equation, but it does satisfy the standard balance equation, which is
sufficient to ensure that p is the invariant density of the Markov chain (see
derivation of the Metropolis–Hastings method).
Theorem
The transition kernel
d
Y
K (x, y ) = p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ),
i=1

p(y1 , . . . , yi , xi+1 , . . . , xd )
where p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ) = R ,
R p(y1 , . . . , yi , xi+1 , . . . , xd ) dyi
satisfies
Z Z
p(y )K (y , x) dx = p(x)K (x, y ) dx.
Rd Rd

Remark. We only consider the single component Gibbs sampler here. The
Gibbs sampler can be written in slightly more general form; see, e.g.,
[chapter 3.6.3, Kaipio and Somersalo 2005].
Proof.
R We begin with the left-hand side of the balance equation and consider
Rd K (y , x) dx. We integrate inductively over the variables in the order xd , xd−1 , . . . , x1 :
Z d
Z Y 
K (y , x) dxd = p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) dxd
R R i=1
 d−1
Y Z
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) p(xd |x1 , . . . , xd−1 ) dxd
i=1 |R {z }
=1
d−1
Y
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd )
i=1
Z Z Z  d−1
Y 
⇒ K (y , x) dxd dxd−1 = p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) dxd−1
R R R i=1
 d−2
Y Z
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) p(xd−1 |x1 , . . . , xd−1 , yd ) dxd−1
i=1 |R {z }
=1
d−2
Y
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) ⇒ ...
i=1

Proceeding
R by inductively integrating
R over xd−2 , xd−3 , . . R. , x1 , we obtain
Rd K (y , x) dx = 1 and thus Rd p(y )K (y , x) dx = p(y ) Rd K (y , x) dx = p(y ).
Next we consider the right-hand side of the balance equation. Recall that
K (x, y ) = di=1 p(yi |y1 , . . . , yi−1 , xi+1 . . . , xd ). We integrate inductively over the
Q
variables, this time in the order x1 , . . . , xd :
Z Z
p(x)K (x, y ) dx1 = K (x, y ) p(x1 , x2 , . . . , xd ) dx1 (K is independent of x1 )
R R
d
Y  Z
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 |x2 , . . . , xd ) p(x1 , x2 , . . . , xd ) dx1
i=2
| {z } R
p(y ,x ,...,x )
= R p(x 1,x 2,...,x d) dx
R 1 2 d 1

d
Y 
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , x2 , . . . , xd )
i=2
Z Z d
Z Y 
⇒ p(x)K (x, y ) dx1 dx2 = p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , x2 , . . . , xd ) dx2
R R R i=2
d
Y  Z
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y2 |y1 , x3 , . . . , xd ) p(y1 , x2 , . . . , xd ) dx2
i=3
| {z } R
p(y ,y ,x ,...,x )
= R p(y 1,x 2,x 3,...,x d) dx
R 1 2 3 d 2

d
Y 
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , y2 , x3 , . . . , xd ) ⇒ ...
i=3

Proceeding
R by inductively integrating over x3 , . . . , xd , we eventually obtain
Rd
p(x)K (x, y ) dx = p(y ). Therefore the balance equation holds.
Single component Gibbs sampler

1 Choose the initial value x (0) ∈ Rd and set k = 0.


2 Draw the next sample as follows:
(i) Set x = x (k) and j = 1.
(ii) Draw t ∈ R from the one-dimensional distribution

p(t | y1 , . . . , yj−1 , xj+1 , . . . , xd ) ∝ p(y1 , . . . , yj−1 , t, xj+1 , . . . , xd )

and set yj = t.
(iii) If j = d, set y = (y1 , . . . , yd ) and terminate the inner loop. Otherwise,
set j ← j + 1 and return to step (ii).
3 Set x (k+1) = y , increase k ← k + 1 and return to step 2.
Example
Let us consider the density from before
1
p(x1 , x2 ) = exp(−10(x12 − x2 )2 − (x2 − 41 )4 ),
Z
where the normalizing constant is Z = 1.1813 . . .
This time we use the Gibbs sampler. To sample the univariate densities
that arise in the process, we use inverse transform sampling. In this case,
the explicit algorithm we use is written below.
Fix x (0) ∈ R2 and set x = x (0) ;
For k = 1, . . . , N, do
Rt
Calculate Φ1 (t) = −∞ p(x1 , x2 ) dx1 ;
Draw u ∼ U([0, 1]), set x1 = Φ−1
1 (u);
Rt
Calculate Φ2 (t) = −∞ p(x1 , x2 ) dx2 ;
Draw u ∼ U([0, 1]), set x2 = Φ−1
2 (u);
Set x (k) = x.
End
Single component Gibbs sampler with 5000 samples
2 1.5

1
1.5
0.5

x1
0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

1
-1

x2
0.5

0
-1.5
-0.5

-2 -1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Computational remarks about MCMC
As a general rule of thumb, one should aim at roughly 30%
acceptance rates when using Gaussian (or close to Gaussian) proposal
and target densities with MH.
It usually takes the Markov chain a number of iterations to reach the
steady state. To this end, it is usually advisable to discard the first
N0 obtained samples since they may not be representative of the
target distribution—this is the so-called “burn-in” period. The length
of the burn-in period varies depending on the application, but one
might consider throwing away the first ∼ 5 − 10% steps for a
sufficiently large sample size as an example.
In MH, using a Gaussian kernel (e.g., random walk
Metropolis–Hastings) is a popular choice due to the ease of
implementation. While it is a safe choice, it does not take into
account the form of the posterior density. To increase efficiency, it is
advisable to take the shape of the density into account when
designing the proposal density. In the high-dimensional setting, this is
especially useful if the posterior density is anisotropic (stretched in
some directions).
Computational remarks about MCMC
The proposal distribution in MH can also be updated while the
sampling algorithm moves around the posterior density. This process
is called adaptation.
Visual assessment: we are aiming for independent sample points,
where the sample histories look like a “fuzzy worm”. One could aim
at something like the Gaussian white noise signal below:

-1

-2

-3

500 1000 1500 2000 2500 3000 3500 4000 4500

More quantitatively, the independence of consecutive draws can be


estimated from the sample itself by computing its (sample-based)
autocovariance.
A note on convergence
The success of the Metropolis–Hastings and Gibbs sampler algorithms
depends largely on whether they satisfy the ergodicity conditions from
before. There are known sufficient conditions concerning the density
p that guarantee the ergodicity of these methods. For example, the
following proposition gives some relatively general conditions.
Proposition (Proposition 3.12. in Kaipio and Somersalo 2005)
(a) Let p : Rd → R+ and let q : Rd × Rd → R+ be a candidate-generating
kernel. If the Markov chain corresponding to q is aperiodic, then the
Metropolis–Hastings chain is also aperiodic. Further, if the Markov chain
corresponding to q is irreducible and α(x, y ) > 0 for all (x, y ) ∈ E+ × E+ ,
where E+ := {x ∈ Rd | p(x) > 0}, then the Metropolis–Hastings chain is
irreducible.
(b) Let p be a lower semicontinuous density and E+ as above. The Gibbs
sampler defines an irreducible and aperiodic transition kernel if E+ is
connected and each (d − 1)-dimensional R marginal
p(x1 , . . . , xj−1 , xj+1 , . . . , xd ) = R p(x) dxj is locally bounded.
Let us consider
Z N
1 X
f (x)p(x) dx ≈ f (xj ).
Rd N
j=1
Assume the variables Yj = f (xj ) are i.i.d. with E[Yj ] = y and
Var(Yj ) = σ 2 . Define
N √
1 X N(YeN − y )
YeN = Yj and ZN = .
N σ
j=1

Then YeN → E[Y ] a.s. (law of large numbers).


Asymptotically, ZN is (standard) normally distributed:
Z z
1 1 2
lim P(ZN ≤ z) = √ e− 2 s ds.
N→∞ 2π −∞
Loosely speaking, this means that
N Z
1 X σ
f (xj ) − f (x)p(x) dx ≈ √
N Rd N
j=1

provided that xj are independent.


Autocovariance and correlation length
The independence of consecutive draws can be estimated from the sample
itself. Suppose that we are interested in the convergence of the integral of
f (x) with respect to the probability density p(x). Let us denote zj = f (xj ),
where {x1 , . . . , xN } ⊂ Rd is a MCMC sample and let z = N −1 N
P
j=1 zj .
Then we define the normalized autocovariance of the sample as
N−k
1 X
γk = (zj − z)(zj+k − z), k ≥ 1,
(N − k)γ0
j=1

PN
where γ0 = N −1 2
j=1 zj .

The correlation length of the sample {zj }N


j=1 can be estimated based on
the decay of the normalized autocovariance sequence of the sample.
If every k th sample
p point ispindependent, one might
√ expect the discrepancy
to behave as 1/ N/k = k/N instead of 1/ N. In consequence, one
should try to choose the proposal distribution so that the correlation
length is as small as possible.
Normalized autocovariance sequences for the Metropolis–Hastings example
1 1 1
horizontal component horizontal component horizontal component
vertical component vertical component vertical component

0.9 0.8 0.8

0.8 0.6 0.6

0.7 0.4 0.4

0.6 0.2 0.2

0.5 0 0

0.4 -0.2 -0.2


0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Normalized autocovariance sequences for the Gibbs example
0.8
horizontal component
vertical component
0.7

0.6

0.5

0.4

0.3

0.2

0.1

-0.1
0 10 20 30 40 50 60 70 80 90 100
Preconditioned Crank–Nicolson algorithm

The preconditioned Crank–Nicolson (pCN) algorithm is an instance of


the Metropolis–Hastings algorithm with a specially chosen proposal
density.
p
The proposal is drawn using the model Y = 1 − β 2 X + βW , where
W ∼ N (0, C0 ), C0 is a symmetric and positive definite matrix, with
the (non-symmetric!) kernel
 
1 p
2 T −1
p
2
q(x, y ) ∝ exp − 2 (y − 1 − β x) C0 (y − 1 − β x) .

Here, the step size 0 < β < 1 is a free parameter (which can be
optimized for statistical efficiency).
The pCN method is dimension robust: the acceptance probability
does not degenerate to zero as the dimension d → ∞. Contrast this
with, e.g., random walk Metropolis, whose acceptance probability
degenerates to zero as the dimension d → ∞.
Further variations of MCMC
We have only scratched the surface of some basic ideas surrounding
MCMC methods. In the literature and practical applications, one can find
many variations of these ideas to boost the performance of MCMC for
“difficult”/“high-dimensional” problems. To list a couple of notable ones:
Adaptive Metropolis: as the proposal density q(x, y ), use a random walk model
Y = X + W with W ∼ N (0, Γ), where the covariance Γ is replaced by the sample
covariance (plus some small perturbation of identity) computed using the sample
history. The updating can happen either at every step or after every M steps of
the Metropolis iteration. The main theoretical challenge is proving the ergodicity
of the chain—this was proved by Haario, Saksman, and Tamminen (2001).
Computationally, stable updating formulae for the sample means and covariances
are needed in practice.
Independence Metropolis: as the proposal density q(x, y ), use a probability density
that is independent of the previous sample x, i.e., q(x, y ) = q(y ). The proposal
density should be similar to the target density.
Metropolis-within-Gibbs, Delayed rejection adaptive Metropolis, . . .
Software: [Link]
[Link]

You might also like