0% found this document useful (0 votes)

11 views32 pages

Markov Chains and Metropolis-Hastings

The document presents lecture notes on Inverse Problems, focusing on discrete time Markov chains and their properties, including transition kernels and invariant densities. It introduces the Metropolis-Hastings algorithm as a method to sample from probability densities and provides a detailed derivation of the algorithm. Additionally, it includes examples of sampling from a specific density using different step sizes.

Uploaded by

clearningaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views32 pages

Markov Chains and Metropolis-Hastings

Uploaded by

clearningaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Inverse Problems

Sommersemester 2022

Vesa Kaarnioja
[Link]@[Link]

FU Berlin, FB Mathematik und Informatik

Ninth lecture, June 27, 2022

The content of these slides follows roughly the material presented in the
following monographs.
D. Calvetti and E. Somersalo. Introduction to Bayesian Scientific
Computing – Ten Lectures on Subjective Computing. 2007.
J. Kaipio and E. Somersalo. Statistical and Computational Inverse
Problems. 2005.
Discrete time Markov chains
A sequence {Xk }∞ k=0 of random variables is called a discrete time Markov
chain if the probability distribution of any Xk+1 depends only on the
previous state Xk :

π(xk+1 | x0 , . . . , xk ) = π(xk+1 | xk ).

Suppose in addition that there exists a probability transition kernel q(x, y )

such that
π(xk+1 | xk ) = q(xk , xk+1 ).
Then the Markov chain is called time invariant (or time homogeneous)
since the kernel q is independent of the time k.
Remark. We assume here and in the sequel that transition kernels satisfy
x 7→ B q(x, y ) dy is measurable for all x ∈ Rd and B ∈ B(Rd );
R
d
R
B 7→ B q(x, y ) dy is a probability distribution for
R all x ∈ R ,
d
R ∈ B(R ). In particular, P(Y ∈ B | X = x) = B q(x, y ) dy and
B
Rd q(x, y ) dy = 1.
Let X be a random variable with probability density p(x).
Let q(x, y ) be an arbitrary transition kernel used to generate a new
random variable Y given X = x, i.e.,

π(y | x) = q(x, y ).

The probability density of Y can be found through marginalization:

Z Z
π(y ) = π(y | x)p(x) dx = q(x, y )p(x) dx.
Rd Rd

If the probability density of Y is equal to the probability density of X ,

Z
q(x, y )p(x) dx = p(y ),
Rd

then we call p an invariant density of the transition kernel q.

Definition (Irreducible transition kernel)
The transition kernel q is irreducible if, regardless of the starting point, the
Markov chain generated by q can visit any set of positive measure with
positive probability.

Definition (Periodic transition kernel)

The transition kernel q is periodic if, for some integer m ≥ 2, there is a set
of disjoint nonempty sets {E1 , . . . , Em } ⊂ Rd such that for all
j ∈ {1, . . . , m} and for all x ∈ Ej :
Z
P(Y ∈ Emod(j,m)+1 |X = x) = q(x, y ) dy = 1.
Emod(j,m)+1

That is, the Markov chain generated by q remains in a periodic loop

forever.

Definition (Aperiodic transition kernel)

The transition kernel q is aperiodic if it is not periodic.
Theorem
Let {Xk }∞
k=0 be a time invariant Markov chain with the transition kernel
q, i.e.,
π(xk+1 | xk ) = q(xk , xk+1 ).
Assume that p is an invariant density of q and the following technical
conditions hold:
q is irreducible;
q is aperiodic.
Then for all x0 ∈ Rd and any B ∈ B(Rd ), it holds that
Z
lim P(XN ∈ B | X0 = x0 ) = p(x) dx.
N→∞ B

Moreover, for any regular enough function f ,

N Z
1 X
lim f (Xj ) = f (x)p(x) dx a.s.
N→∞ N Rd
j=1
Suppose we want to sample some probability density p and we know that
it is invariant with respect to transition kernel q. Then we can proceed as
follows:
1 Select starting point x0 and set k = 0.
2 Draw xk+1 from q(xk , xk+1 ).
3 Set k ← k + 1 and return to step 2.
The previous theorem implies that the sample {xk }N
k=0 is asymptotically
distributed according to p as N → ∞.
This raises the question: given a probability density p, how do you find a
kernel q such that p is its invariant density?
The Metropolis–Hastings algorithm is a method to construct such a kernel!
Derivation of the Metropolis–Hastings algorithm
We are interested in obtaining samples from the probability density p.
Consider the following Markov process: if you are currently situated at
some x ∈ Rd , either
1 stay put at x with the probability r (x), 0 ≤ r (x) ≤ 1, or

2 move away from x using a transition kernel R(x, y ) otherwise.

Here, both R(x, y ) and r (x) are as yet undetermined—the trick will be to
calibrate these in order to find a kernel such that p is its invariant density
as discussed on the previous slide.
Since R is a transition kernel, y 7→ R(x, y ) is a probability density and
hence Z
R(x, y ) dy = 1 for all x ∈ Rd .
Rd
Denote by A the event of moving away from x and by ¬A the event of
not moving. Clearly

P(A) = 1 − r (x) and P(¬A) = r (x).

Given a current state X = x, we want to know what is the probability
density of Y generated by the aforementioned strategy. Let B ∈ B(Rd )
and consider the probability of the event Y ∈ B. Then
P(Y ∈ B | X = x) = P(Y ∈ B | X = x, A)P(A) (move away from x)
+ P(Y ∈ B | X = x, ¬A)P(¬A). (stay put at x)
The probability of arriving in B through a move is
Z
P(Y ∈ B | X = x, A) = R(x, y ) dy .
B
The only way to arrive in B without moving is if x is already in B:
(
1 if x ∈ B,
P(Y ∈ B | X = x, ¬A) = χB (x) =
0 if x 6∈ B.
Hence
Z z =:K (x,y )
}| {
P(Y ∈ B | X = x) = (1 − r (x))R(x, y ) dy + r (x)χB (x)
ZB
= K (x, y ) dy + r (x)χB (x).
B
The probability of Y ∈ B can be obtained by marginalizing over x:
Z
P(Y ∈ B) = P(Y ∈ B | X = x)p(x) dx
Rd
Z Z Z
= K (x, y ) dy p(x) dx + r (x)χB (x)p(x) dx
Rd B Rd
Z Z Z
= K (x, y )p(x) dx dy + r (x)p(x) dx
d
ZB ZR B

= K (x, y )p(x) dx + r (y )p(y ) dy
B Rd
Z Z Z
= K (x, y )p(x) dx − K (y , x)p(y ) dx + p(y ) dy ,
B Rd Rd
R R
where we used Rd K (y , x) dx = (1 − r (y )) Rd R(y , x) dx = 1 − r (y ).
If the balance equation
Z Z
p(y )K (y , x) dx = p(x)K (x, y ) dx (1)
Rd Rd
holds, then Z
P(Y ∈ B) = p(y ) dy as desired.
B
The Metropolis–Hastings algorithm is a technique for finding a kernel K
that satisfies the detailed balance equation
p(y )K (y , x) = p(x)K (x, y ),
which implies (1). Let us start with a proposal density q(x, y ), chosen so
that generating a Markov chain with it is easy. (For this reason, a
Gaussian kernel is a very popular choice.) There are three separate cases:
1 If p(y )q(y , x) = p(x)q(x, y ), then set r (x) = 0,

R(x, y ) = K (x, y ) = q(x, y ) and the previous analysis ensures that p

is an invariant density for kernel q.
2 If p(y )q(y , x) < p(x)q(x, y ), then define the kernel K to be

K (x, y ) = α(x, y )q(x, y ),

where α is chosen s.t. p(y )α(y , x)q(y , x) = p(x)α(x, y )q(x, y ). We
can make the selection
p(y )q(y , x)
α(y , x) = 1 and α(x, y ) = < 1.
p(x)q(x, y )
3 If p(y )q(y , x) > p(x)q(x, y ), then in complete analogy to the above:
p(x)q(x, y )
α(x, y ) = 1 and α(y , x) = < 1.
p(y )q(y , x)
In summary, we define K as

p(y )q(y , x)
K (x, y ) = α(x, y )q(x, y ), α(x, y ) = min 1, .
p(x)q(x, y )

Even though the expression for K seems complicated, it turns out that the
drawing can be performed according to the following procedure.
Metropolis–Hastings algorithm

1 Choose x (0) ∈ Rd and set k = 0.

2 Given x = x (k) , draw y using the transition kernel q(x, y ) of your
choosing.
3 Calculate the acceptance ratio

p(y )q(y , x)
α(x, y ) = min 1, .
p(x)q(x, y )

4 Flip the α-coin: draw t ∼ U([0, 1]). If α > t, set x (k+1) = y ,

otherwise stay put at x and set x (k+1) = x (k) .
5 Set k ← k + 1 and return to step 2.

Remark. Note that due to the form of α both the target p and the
’
proposal density q can be unnormalized within the Metropolis–Hastings
algorithm.
Why does this work?
Let us focus on the main loop of the Metropolis–Hastings algorithm:
Given x, draw y using the transition kernel q(x, y ).
Calculate the acceptance ratio α(x, y ) = min 1, p(y )q(x,y )

p(x)q(x,y ) .
Draw t ∼ U([0, 1]). If α > t, accept y , otherwise stay put at x.
Recall that A was the event of moving in the Markov chain. Then

P(A|y , x) = “probability of accepting transition” = α(x, y ),

P(y |x) = “probability of drawing y ” = q(x, y ).

Then

“probability of accepted y ” = P(A, y |x)

= P(A|y , x)P(y |x)
= α(x, y )q(x, y ) = K (x, y ),

as desired.
Example
Let us consider sampling from the density

p(x1 , x2 ) ∝ exp(−10(x12 − x2 )2 − (x2 − 14 )4 ).

As the proposal distribution, we use the random walk model Y = X + W ,

W ∼ N (0, γ 2 I ), with the kernel

1 2
q(x, y ) ∝ exp − 2 kx − y k .
2γ

We draw 5000 samples from the probability distribution with density p

using three different step sizes: γ = 0.1, γ = 0.5, and γ = 2.
Random walk Metropolis-Hastings with 5000 samples, = 0.1, acceptance ratio 0.7764
2 1.5

1
1.5
0.5

x1
0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

-1 1

x2
0.5
-1.5
0

-2 -0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2

Random walk Metropolis-Hastings with 5000 samples, = 0.5, acceptance ratio 0.3272
2 1.5

1
1.5
0.5

x1
0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

1
-1

x2
0.5

0
-1.5
-0.5

-2 -1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2

Random walk Metropolis-Hastings with 5000 samples, = 2, acceptance ratio 0.058

2 1.5

1
1.5
0.5
x1

0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

-1 1
x2

0.5
-1.5
0

-2 -0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Derivation of the single component Gibbs sampler
We continue to be interested in sampling the distribution with density
p(x). The single component Gibbs sampler is based on the same Markov
process that was introduced in the derivation of Metropolis–Hastings: if
you are currently situated at some x ∈ Rd , either
1 stay put at x with the probability r (x), 0 ≤ r (x) ≤ 1, or

2 move away from x using a transition kernel R(x, y ) otherwise.

Recall also the definition we made in the Metropolis–Hastings derivation:

K (x, y ) = (1 − r (x))R(x, y ).
Suppose that x is a d-variate random variable. For the single component
Gibbs sampler, we set r (x) = 0 (moving is obligatory) and define the
transition kernel
d
Y
K (x, y ) = R(x, y ) = p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ),
i=1

p(y1 , . . . , yi , xi+1 , . . . , xd )
where p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ) = R .
R p(y1 , . . . , yi , xi+1 , . . . , xd ) dyi
This transition kernel K does not in general satisfy the detailed balance
equation, but it does satisfy the standard balance equation, which is
sufficient to ensure that p is the invariant density of the Markov chain (see
derivation of the Metropolis–Hastings method).
Theorem
The transition kernel
d
Y
K (x, y ) = p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ),
i=1

p(y1 , . . . , yi , xi+1 , . . . , xd )
where p(yi | y1 , . . . , yi−1 , xi+1 , . . . , xd ) = R ,
R p(y1 , . . . , yi , xi+1 , . . . , xd ) dyi
satisfies
Z Z
p(y )K (y , x) dx = p(x)K (x, y ) dx.
Rd Rd

Remark. We only consider the single component Gibbs sampler here. The
Gibbs sampler can be written in slightly more general form; see, e.g.,
[chapter 3.6.3, Kaipio and Somersalo 2005].
Proof.
R We begin with the left-hand side of the balance equation and consider
Rd K (y , x) dx. We integrate inductively over the variables in the order xd , xd−1 , . . . , x1 :
Z d
Z Y
K (y , x) dxd = p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) dxd
R R i=1
d−1
Y Z
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) p(xd |x1 , . . . , xd−1 ) dxd
i=1 |R {z }
=1
d−1
Y
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd )
i=1
Z Z Z d−1
Y
⇒ K (y , x) dxd dxd−1 = p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) dxd−1
R R R i=1
d−2
Y Z
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) p(xd−1 |x1 , . . . , xd−1 , yd ) dxd−1
i=1 |R {z }
=1
d−2
Y
= p(xi |x1 , . . . , xi−1 , yi+1 , . . . , yd ) ⇒ ...
i=1

Proceeding
R by inductively integrating
R over xd−2 , xd−3 , . . R. , x1 , we obtain
Rd K (y , x) dx = 1 and thus Rd p(y )K (y , x) dx = p(y ) Rd K (y , x) dx = p(y ).
Next we consider the right-hand side of the balance equation. Recall that
K (x, y ) = di=1 p(yi |y1 , . . . , yi−1 , xi+1 . . . , xd ). We integrate inductively over the
Q
variables, this time in the order x1 , . . . , xd :
Z Z
p(x)K (x, y ) dx1 = K (x, y ) p(x1 , x2 , . . . , xd ) dx1 (K is independent of x1 )
R R
d
Y Z
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 |x2 , . . . , xd ) p(x1 , x2 , . . . , xd ) dx1
i=2
| {z } R
p(y ,x ,...,x )
= R p(x 1,x 2,...,x d) dx
R 1 2 d 1

d
Y
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , x2 , . . . , xd )
i=2
Z Z d
Z Y
⇒ p(x)K (x, y ) dx1 dx2 = p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , x2 , . . . , xd ) dx2
R R R i=2
d
Y Z
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y2 |y1 , x3 , . . . , xd ) p(y1 , x2 , . . . , xd ) dx2
i=3
| {z } R
p(y ,y ,x ,...,x )
= R p(y 1,x 2,x 3,...,x d) dx
R 1 2 3 d 2

d
Y
= p(yi |y1 , . . . , yi−1 , xi+1 , . . . , xd ) p(y1 , y2 , x3 , . . . , xd ) ⇒ ...
i=3

Proceeding
R by inductively integrating over x3 , . . . , xd , we eventually obtain
Rd
p(x)K (x, y ) dx = p(y ). Therefore the balance equation holds.
Single component Gibbs sampler

1 Choose the initial value x (0) ∈ Rd and set k = 0.

2 Draw the next sample as follows:
(i) Set x = x (k) and j = 1.
(ii) Draw t ∈ R from the one-dimensional distribution

p(t | y1 , . . . , yj−1 , xj+1 , . . . , xd ) ∝ p(y1 , . . . , yj−1 , t, xj+1 , . . . , xd )

and set yj = t.
(iii) If j = d, set y = (y1 , . . . , yd ) and terminate the inner loop. Otherwise,
set j ← j + 1 and return to step (ii).
3 Set x (k+1) = y , increase k ← k + 1 and return to step 2.
Example
Let us consider the density from before
1
p(x1 , x2 ) = exp(−10(x12 − x2 )2 − (x2 − 41 )4 ),
Z
where the normalizing constant is Z = 1.1813 . . .
This time we use the Gibbs sampler. To sample the univariate densities
that arise in the process, we use inverse transform sampling. In this case,
the explicit algorithm we use is written below.
Fix x (0) ∈ R2 and set x = x (0) ;
For k = 1, . . . , N, do
Rt
Calculate Φ1 (t) = −∞ p(x1 , x2 ) dx1 ;
Draw u ∼ U([0, 1]), set x1 = Φ−1
1 (u);
Rt
Calculate Φ2 (t) = −∞ p(x1 , x2 ) dx2 ;
Draw u ∼ U([0, 1]), set x2 = Φ−1
2 (u);
Set x (k) = x.
End
Single component Gibbs sampler with 5000 samples
2 1.5

1
1.5
0.5

x1
0
1
-0.5

-1
0.5
-1.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 Sample history of x 1

2
-0.5
1.5

1
-1

x2
0.5

0
-1.5
-0.5

-2 -1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Sample history of x 2
Computational remarks about MCMC
As a general rule of thumb, one should aim at roughly 30%
acceptance rates when using Gaussian (or close to Gaussian) proposal
and target densities with MH.
It usually takes the Markov chain a number of iterations to reach the
steady state. To this end, it is usually advisable to discard the first
N0 obtained samples since they may not be representative of the
target distribution—this is the so-called “burn-in” period. The length
of the burn-in period varies depending on the application, but one
might consider throwing away the first ∼ 5 − 10% steps for a
sufficiently large sample size as an example.
In MH, using a Gaussian kernel (e.g., random walk
Metropolis–Hastings) is a popular choice due to the ease of
implementation. While it is a safe choice, it does not take into
account the form of the posterior density. To increase efficiency, it is
advisable to take the shape of the density into account when
designing the proposal density. In the high-dimensional setting, this is
especially useful if the posterior density is anisotropic (stretched in
some directions).
Computational remarks about MCMC
The proposal distribution in MH can also be updated while the
sampling algorithm moves around the posterior density. This process
is called adaptation.
Visual assessment: we are aiming for independent sample points,
where the sample histories look like a “fuzzy worm”. One could aim
at something like the Gaussian white noise signal below:

-1

-2

-3

500 1000 1500 2000 2500 3000 3500 4000 4500

More quantitatively, the independence of consecutive draws can be

estimated from the sample itself by computing its (sample-based)
autocovariance.
A note on convergence
The success of the Metropolis–Hastings and Gibbs sampler algorithms
depends largely on whether they satisfy the ergodicity conditions from
before. There are known sufficient conditions concerning the density
p that guarantee the ergodicity of these methods. For example, the
following proposition gives some relatively general conditions.
Proposition (Proposition 3.12. in Kaipio and Somersalo 2005)
(a) Let p : Rd → R+ and let q : Rd × Rd → R+ be a candidate-generating
kernel. If the Markov chain corresponding to q is aperiodic, then the
Metropolis–Hastings chain is also aperiodic. Further, if the Markov chain
corresponding to q is irreducible and α(x, y ) > 0 for all (x, y ) ∈ E+ × E+ ,
where E+ := {x ∈ Rd | p(x) > 0}, then the Metropolis–Hastings chain is
irreducible.
(b) Let p be a lower semicontinuous density and E+ as above. The Gibbs
sampler defines an irreducible and aperiodic transition kernel if E+ is
connected and each (d − 1)-dimensional R marginal
p(x1 , . . . , xj−1 , xj+1 , . . . , xd ) = R p(x) dxj is locally bounded.
Let us consider
Z N
1 X
f (x)p(x) dx ≈ f (xj ).
Rd N
j=1
Assume the variables Yj = f (xj ) are i.i.d. with E[Yj ] = y and
Var(Yj ) = σ 2 . Define
N √
1 X N(YeN − y )
YeN = Yj and ZN = .
N σ
j=1

Then YeN → E[Y ] a.s. (law of large numbers).

Asymptotically, ZN is (standard) normally distributed:
Z z
1 1 2
lim P(ZN ≤ z) = √ e− 2 s ds.
N→∞ 2π −∞
Loosely speaking, this means that
N Z
1 X σ
f (xj ) − f (x)p(x) dx ≈ √
N Rd N
j=1

provided that xj are independent.

Autocovariance and correlation length
The independence of consecutive draws can be estimated from the sample
itself. Suppose that we are interested in the convergence of the integral of
f (x) with respect to the probability density p(x). Let us denote zj = f (xj ),
where {x1 , . . . , xN } ⊂ Rd is a MCMC sample and let z = N −1 N
P
j=1 zj .
Then we define the normalized autocovariance of the sample as
N−k
1 X
γk = (zj − z)(zj+k − z), k ≥ 1,
(N − k)γ0
j=1

PN
where γ0 = N −1 2
j=1 zj .

The correlation length of the sample {zj }N

j=1 can be estimated based on
the decay of the normalized autocovariance sequence of the sample.
If every k th sample
p point ispindependent, one might
√ expect the discrepancy
to behave as 1/ N/k = k/N instead of 1/ N. In consequence, one
should try to choose the proposal distribution so that the correlation
length is as small as possible.
Normalized autocovariance sequences for the Metropolis–Hastings example
1 1 1
horizontal component horizontal component horizontal component
vertical component vertical component vertical component

0.9 0.8 0.8

0.8 0.6 0.6

0.7 0.4 0.4

0.6 0.2 0.2

0.5 0 0

0.4 -0.2 -0.2

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Normalized autocovariance sequences for the Gibbs example
0.8
horizontal component
vertical component
0.7

0.6

0.5

0.4

0.3

0.2

0.1

-0.1
0 10 20 30 40 50 60 70 80 90 100
Preconditioned Crank–Nicolson algorithm

The preconditioned Crank–Nicolson (pCN) algorithm is an instance of

the Metropolis–Hastings algorithm with a specially chosen proposal
density.
p
The proposal is drawn using the model Y = 1 − β 2 X + βW , where
W ∼ N (0, C0 ), C0 is a symmetric and positive definite matrix, with
the (non-symmetric!) kernel

1 p
2 T −1
p
2
q(x, y ) ∝ exp − 2 (y − 1 − β x) C0 (y − 1 − β x) .
2β

Here, the step size 0 < β < 1 is a free parameter (which can be
optimized for statistical efficiency).
The pCN method is dimension robust: the acceptance probability
does not degenerate to zero as the dimension d → ∞. Contrast this
with, e.g., random walk Metropolis, whose acceptance probability
degenerates to zero as the dimension d → ∞.
Further variations of MCMC
We have only scratched the surface of some basic ideas surrounding
MCMC methods. In the literature and practical applications, one can find
many variations of these ideas to boost the performance of MCMC for
“difficult”/“high-dimensional” problems. To list a couple of notable ones:
Adaptive Metropolis: as the proposal density q(x, y ), use a random walk model
Y = X + W with W ∼ N (0, Γ), where the covariance Γ is replaced by the sample
covariance (plus some small perturbation of identity) computed using the sample
history. The updating can happen either at every step or after every M steps of
the Metropolis iteration. The main theoretical challenge is proving the ergodicity
of the chain—this was proved by Haario, Saksman, and Tamminen (2001).
Computationally, stable updating formulae for the sample means and covariances
are needed in practice.
Independence Metropolis: as the proposal density q(x, y ), use a probability density
that is independent of the previous sample x, i.e., q(x, y ) = q(y ). The proposal
density should be similar to the target density.
Metropolis-within-Gibbs, Delayed rejection adaptive Metropolis, . . .
Software: [Link]
[Link]

Markov Transition Kernels Explained
No ratings yet
Markov Transition Kernels Explained
9 pages
Acceptance-Rejection Method in MCMC
No ratings yet
Acceptance-Rejection Method in MCMC
5 pages
Acceptance-Rejection Method in MCMC
No ratings yet
Acceptance-Rejection Method in MCMC
3 pages
MTH707: Markov Chains Problem Set
No ratings yet
MTH707: Markov Chains Problem Set
13 pages
Metropolis-Hastings for Double Exponential
No ratings yet
Metropolis-Hastings for Double Exponential
2 pages
Markov Chain Monte Carlo Techniques
No ratings yet
Markov Chain Monte Carlo Techniques
27 pages
MCMC Techniques: Metropolis-Hastings & Gibbs
No ratings yet
MCMC Techniques: Metropolis-Hastings & Gibbs
7 pages
Introduction to Gibbs Sampling in MCMC
100% (1)
Introduction to Gibbs Sampling in MCMC
7 pages
Markov Chains and Gibbs Sampling Explained
No ratings yet
Markov Chains and Gibbs Sampling Explained
12 pages
Rarefied Gas Dynamics - DSMC Course
No ratings yet
Rarefied Gas Dynamics - DSMC Course
50 pages
Topics in Bayesian Statistics
No ratings yet
Topics in Bayesian Statistics
72 pages
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
No ratings yet
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
24 pages
Understanding Bayes' Rule and Random Variables
No ratings yet
Understanding Bayes' Rule and Random Variables
5 pages
MCMC and Gibbs Sampling Overview
No ratings yet
MCMC and Gibbs Sampling Overview
24 pages
sheet_2_sol
No ratings yet
sheet_2_sol
8 pages
Markov Chain Monte Carlo Overview
No ratings yet
Markov Chain Monte Carlo Overview
29 pages
Introduction to Metropolis-Hastings Algorithm
No ratings yet
Introduction to Metropolis-Hastings Algorithm
15 pages
Gibbs Sampling in Bayesian Analysis
No ratings yet
Gibbs Sampling in Bayesian Analysis
35 pages
Markov Chain Monte Carlo Overview
No ratings yet
Markov Chain Monte Carlo Overview
6 pages
Understanding the Metropolis-Hastings Algorithm
No ratings yet
Understanding the Metropolis-Hastings Algorithm
66 pages
Metropolis-Hastings Algorithm Explained
No ratings yet
Metropolis-Hastings Algorithm Explained
4 pages
MCMC Theory and Algorithms Overview
No ratings yet
MCMC Theory and Algorithms Overview
77 pages
Markov Chain Monte Carlo Methods Explained
No ratings yet
Markov Chain Monte Carlo Methods Explained
66 pages
MCMC and Bayesian Statistics Overview
No ratings yet
MCMC and Bayesian Statistics Overview
76 pages
Markov Chain Monte Carlo Techniques
100% (1)
Markov Chain Monte Carlo Techniques
69 pages
Monte Carlo Simulation Techniques in Python
No ratings yet
Monte Carlo Simulation Techniques in Python
5 pages
Bayesian Inference and Sampling Methods
No ratings yet
Bayesian Inference and Sampling Methods
41 pages
Adaptive Simulated Annealing Algorithm
No ratings yet
Adaptive Simulated Annealing Algorithm
9 pages
Inverse CDF in Monte Carlo Methods
No ratings yet
Inverse CDF in Monte Carlo Methods
17 pages
MCMC Sampling Methods Overview
No ratings yet
MCMC Sampling Methods Overview
70 pages
Markov Chains and Statistical Analysis in Math 368
No ratings yet
Markov Chains and Statistical Analysis in Math 368
7 pages
Markov Switching Models with RV Insights
No ratings yet
Markov Switching Models with RV Insights
9 pages
Inverse Problems: Gaussian Sampling Techniques
No ratings yet
Inverse Problems: Gaussian Sampling Techniques
47 pages
Optimal Scaling in Metropolis-Hastings
No ratings yet
Optimal Scaling in Metropolis-Hastings
18 pages
Monte Carlo Sampling Methods Overview
No ratings yet
Monte Carlo Sampling Methods Overview
32 pages
Doob's h-Transform of 2D Random Walk
No ratings yet
Doob's h-Transform of 2D Random Walk
8 pages
Bayesian Time Series Econometrics Guide
No ratings yet
Bayesian Time Series Econometrics Guide
72 pages
Monte Carlo Methods in Sampling Techniques
No ratings yet
Monte Carlo Methods in Sampling Techniques
30 pages
Applied Stochastic Processes Overview
No ratings yet
Applied Stochastic Processes Overview
164 pages
Applied Stochastic Processes Overview
No ratings yet
Applied Stochastic Processes Overview
164 pages
STAT 333 Markov Chains Course Notes
No ratings yet
STAT 333 Markov Chains Course Notes
71 pages
STAT2006 Tutorial 1
No ratings yet
STAT2006 Tutorial 1
7 pages
MCMC Methods in Bayesian Estimation
No ratings yet
MCMC Methods in Bayesian Estimation
25 pages
Metropolis-Hastings Algorithm: Theory and Implementation: 1 Markov Chain Construction
No ratings yet
Metropolis-Hastings Algorithm: Theory and Implementation: 1 Markov Chain Construction
5 pages
Gibbs Sampling in MATLAB
No ratings yet
Gibbs Sampling in MATLAB
1 page
MCMC Methods in Machine Learning Q&A
No ratings yet
MCMC Methods in Machine Learning Q&A
14 pages
Understanding the Gibbs Sampler in MCMC
No ratings yet
Understanding the Gibbs Sampler in MCMC
5 pages
Probability and Random Variables Overview
No ratings yet
Probability and Random Variables Overview
2 pages
Bayesian Computation in Nonlinear Regression
No ratings yet
Bayesian Computation in Nonlinear Regression
33 pages
Slide 2
No ratings yet
Slide 2
43 pages
Bayesian Analysis of Extreme Operational Losses: Chyng-Lan Liang
No ratings yet
Bayesian Analysis of Extreme Operational Losses: Chyng-Lan Liang
17 pages
Multivariate Probability Models Explained
No ratings yet
Multivariate Probability Models Explained
16 pages
Random Walks and Algorithms Exercises
No ratings yet
Random Walks and Algorithms Exercises
4 pages
Metropolis-Hastings Algorithm Explained
No ratings yet
Metropolis-Hastings Algorithm Explained
9 pages
Lect 12
No ratings yet
Lect 12
4 pages
Multivariate Statistical Distributions
No ratings yet
Multivariate Statistical Distributions
12 pages
General State Space MCMC Overview
No ratings yet
General State Space MCMC Overview
64 pages
Markov Processes: Estimation Techniques
No ratings yet
Markov Processes: Estimation Techniques
2 pages
Lightweight CNN Design for FPGA Acceleration
No ratings yet
Lightweight CNN Design for FPGA Acceleration
4 pages
EEEN 453 Feedback Control Systems Exam
No ratings yet
EEEN 453 Feedback Control Systems Exam
8 pages
Differential Equations: BVPs & Solutions
No ratings yet
Differential Equations: BVPs & Solutions
18 pages
Strong Field Physics Fundamentals
No ratings yet
Strong Field Physics Fundamentals
61 pages
Exploring Arnold's Cat Map Chaos
No ratings yet
Exploring Arnold's Cat Map Chaos
7 pages
Forest Fire Prediction Using Data Analytics
No ratings yet
Forest Fire Prediction Using Data Analytics
26 pages
Introduction to Predictive Analytics
No ratings yet
Introduction to Predictive Analytics
92 pages
Christos Cholevas: Data Analyst Profile
No ratings yet
Christos Cholevas: Data Analyst Profile
1 page
Random PDF
No ratings yet
Random PDF
12 pages
Control Systems: Feedback Effects
No ratings yet
Control Systems: Feedback Effects
26 pages
Automobile Data Mining Analysis
No ratings yet
Automobile Data Mining Analysis
23 pages
K-Map Simplification with Don't Cares
No ratings yet
K-Map Simplification with Don't Cares
13 pages
Cramer's Rule for Linear Equations
No ratings yet
Cramer's Rule for Linear Equations
8 pages
Robot Control Algorithm Performance Comparison
No ratings yet
Robot Control Algorithm Performance Comparison
9 pages
Python Hashlib: SHA256 & MD5 Usage
No ratings yet
Python Hashlib: SHA256 & MD5 Usage
5 pages
Iterative Methods for Linear Equations
No ratings yet
Iterative Methods for Linear Equations
38 pages
Compact Language Models
No ratings yet
Compact Language Models
27 pages
Inventory Forecasting Techniques Guide
No ratings yet
Inventory Forecasting Techniques Guide
23 pages
Hermite Interpolation Formula Explained
No ratings yet
Hermite Interpolation Formula Explained
7 pages
Hello World Program in C
No ratings yet
Hello World Program in C
24 pages
Mid Semester Exam: CSE 221 Algorithms
No ratings yet
Mid Semester Exam: CSE 221 Algorithms
13 pages
TanApCalcBr10 07 05
No ratings yet
TanApCalcBr10 07 05
37 pages
Curve Fitting Techniques in MATLAB
No ratings yet
Curve Fitting Techniques in MATLAB
22 pages
Data Mining in Education for Student Success
No ratings yet
Data Mining in Education for Student Success
2 pages
Security Analysis of FileVault Encryption
No ratings yet
Security Analysis of FileVault Encryption
4 pages
11 Unionfind
No ratings yet
11 Unionfind
14 pages
2004 - A Mathematical Approach To Classical Control - Autor Lewis
No ratings yet
2004 - A Mathematical Approach To Classical Control - Autor Lewis
678 pages
Deloitte AI/ML Consultant Training Plan
No ratings yet
Deloitte AI/ML Consultant Training Plan
2 pages
US Teens' Texting Behavior Study
No ratings yet
US Teens' Texting Behavior Study
11 pages
RSA Prime Factor Identification
No ratings yet
RSA Prime Factor Identification
1 page

Markov Chains and Metropolis-Hastings

Uploaded by

Markov Chains and Metropolis-Hastings

Uploaded by

Inverse Problems

FU Berlin, FB Mathematik und Informatik

Ninth lecture, June 27, 2022

Suppose in addition that there exists a probability transition kernel q(x, y )

The probability density of Y can be found through marginalization:

If the probability density of Y is equal to the probability density of X ,

then we call p an invariant density of the transition kernel q.

Definition (Periodic transition kernel)

That is, the Markov chain generated by q remains in a periodic loop

Definition (Aperiodic transition kernel)

Moreover, for any regular enough function f ,

2 move away from x using a transition kernel R(x, y ) otherwise.

P(A) = 1 − r (x) and P(¬A) = r (x).

R(x, y ) = K (x, y ) = q(x, y ) and the previous analysis ensures that p

K (x, y ) = α(x, y )q(x, y ),

1 Choose x (0) ∈ Rd and set k = 0.

4 Flip the α-coin: draw t ∼ U([0, 1]). If α > t, set x (k+1) = y ,

P(A|y , x) = “probability of accepting transition” = α(x, y ),

“probability of accepted y ” = P(A, y |x)

p(x1 , x2 ) ∝ exp(−10(x12 − x2 )2 − (x2 − 14 )4 ).

As the proposal distribution, we use the random walk model Y = X + W ,

We draw 5000 samples from the probability distribution with density p

Random walk Metropolis-Hastings with 5000 samples, = 2, acceptance ratio 0.058

2 move away from x using a transition kernel R(x, y ) otherwise.

Recall also the definition we made in the Metropolis–Hastings derivation:

1 Choose the initial value x (0) ∈ Rd and set k = 0.

p(t | y1 , . . . , yj−1 , xj+1 , . . . , xd ) ∝ p(y1 , . . . , yj−1 , t, xj+1 , . . . , xd )

500 1000 1500 2000 2500 3000 3500 4000 4500

More quantitatively, the independence of consecutive draws can be

Then YeN → E[Y ] a.s. (law of large numbers).

provided that xj are independent.

The correlation length of the sample {zj }N

0.9 0.8 0.8

0.8 0.6 0.6

0.7 0.4 0.4

0.6 0.2 0.2

0.4 -0.2 -0.2

The preconditioned Crank–Nicolson (pCN) algorithm is an instance of

You might also like