0% found this document useful (0 votes)
16 views35 pages

MCMC Algorithms: Metropolis & Gibbs Sampling

The document provides a comprehensive guide to Markov Chain Monte Carlo (MCMC) methods, focusing on how to construct chains that sample from desired distributions. It covers key algorithms such as the Metropolis and Metropolis-Hastings algorithms, explaining their mechanics, acceptance probabilities, and the importance of detailed balance. Additionally, it discusses practical considerations for implementing these algorithms effectively.

Uploaded by

Sankalp Savarn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views35 pages

MCMC Algorithms: Metropolis & Gibbs Sampling

The document provides a comprehensive guide to Markov Chain Monte Carlo (MCMC) methods, focusing on how to construct chains that sample from desired distributions. It covers key algorithms such as the Metropolis and Metropolis-Hastings algorithms, explaining their mechanics, acceptance probabilities, and the importance of detailed balance. Additionally, it discusses practical considerations for implementing these algorithms effectively.

Uploaded by

Sankalp Savarn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Markov Chain Monte Carlo (MCMC):

How to Build a Chain That Samples From


What We Want

January 18, 2026


Contents

Recap: What We Learnt Before 4

The Big Question 6

Introduction to The MCMC Strategy 7

1 The Metropolis Algorithm: The Simplest MCMC Method 8

1.1 The Intuition: Climbing a Probability Mountain . . . . . . . . . . . . . . 8

1.2 The Requirement: Symmetric Proposals . . . . . . . . . . . . . . . . . . 9

1.3 Deriving the Acceptance Probability . . . . . . . . . . . . . . . . . . . . 10

1.4 What Does This Acceptance Ratio Mean? . . . . . . . . . . . . . . . . . 13

1.5 The Metropolis Algorithm: Step by Step . . . . . . . . . . . . . . . . . . 13

1.6 Why It Works: Verification of Detailed Balance . . . . . . . . . . . . . . 14

1.7 Real Example: Inferring How Biased a Coin Is . . . . . . . . . . . . . . . 15

2 Metropolis-Hastings: When We have Biased Proposals 17

2.1 The Motivation: Sometimes Symmetry Isn’t Natural . . . . . . . . . . . 17

2.2 Deriving the Acceptance Ratio for Asymmetric Proposals . . . . . . . . . 18

2.3 Understanding the Proposal Ratio Correction . . . . . . . . . . . . . . . 19

2.4 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 20

2.5 Real Example: A Skewed Distribution . . . . . . . . . . . . . . . . . . . 21

3 Gibbs Sampling: When Conditional Distributions Are Easy [Optional


Section] 21

3.1 A Different Approach: Update One Variable at a Time . . . . . . . . . . 21

3.2 The Mathematical Insight: Gibbs Is a Special Case of Metropolitan-Hastings 27

2
3.3 Why This Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Why this works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 Strengths and Limitations of Gibbs . . . . . . . . . . . . . . . . . . . . . 29

4 Comparing the Three Algorithms 30

4.1 When to use each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Practical Reality: Making These Algorithms Work 31

5.1 Burn-In: Forgetting Where We Started . . . . . . . . . . . . . . . . . . . 31

5.2 Tuning:Acceptance Rates and Step Sizes . . . . . . . . . . . . . . . . . . 31

5.3 Checking Convergence: Trace Plots . . . . . . . . . . . . . . . . . . . . . 32

5.4 Limitations: When MCMC Struggles . . . . . . . . . . . . . . . . . . . . 33

6 Looking Forward: Beyond the Basics 34

6.1 The Next Generation: Modern MCMC Methods . . . . . . . . . . . . . . 34

3
Recap: What We Learnt Before

Before we dive into new material, let’s remember what we discovered in the previous
week. Don’t worry if these ideas feel fuzzy—we will go through them slowly again.

The Stationary Distribution: A Simple Example

In the previous week, we met something called the stationary distribution. Let’s use
the frog example we know well.

Imagine a frog jumping between three lily pads: Lily Pad A, Lily Pad B, and Lily Pad
C. Every time the frog is at a lily pad, it follows a pattern of jumping to the next pad.

After the frog jumps around for a very long time (say, 1 million jumps), we count how
often it spends time at each lily pad:

• It spends 50% of its time at Lily Pad A

• It spends 30% of its time at Lily Pad B

• It spends 20% of its time at Lily Pad C

We write this as a vector (a list of numbers):

π = [0.5, 0.3, 0.2]

This vector π (pronounced ”pi”) is called the stationary distribution. It tells us the
long-term probability of finding the frog at each lily pad.

Here’s what’s remarkable: no matter where the frog starts (at Lily Pad A, B, or C), after
enough time, these percentages stay the same. The frog forgets where it started and
settles into this pattern.

What Each Symbol Means

Let’s explain the symbols we will use so we are on the same page:

• πi is the probability the frog is at lily pad i. So π1 = 0.5 means 50% chance the
frog is at Lily Pad A.

4
• pij is the jumping rule: the probability of jumping from pad i to pad j. For example,
pAB is the probability of jumping from Lily Pad A to Lily Pad B.

All these jumping rules together form what we call the transition matrix P . When we
multiply the stationary distribution π by the transition matrix P , we get π back again.
In symbols:
π·P =π

Let’s break down what this means:

• π on the left side is our current distribution (where the frog is likely to be right
now).

• P is the jumping rules (what the frog does next).

• After applying the jumping rules, we get π again on the right side.

In other words: if the frog follows this pattern long enough to reach the stationary
distribution, then following the jumping rules doesn’t change the distribution. The system
is in perfect balance.

Detailed Balance: The Engine Behind Stationarity

Now comes an important detail. Not all jumping patterns lead to a stationary distribu-
tion. There’s a special condition that guarantees it will work.
That condition is called detailed balance, and here’s what it means in plain language:

The number of frogs jumping from Lily Pad A to Lily Pad B (in a given time period)
equals the number jumping from Lily Pad B back to Lily Pad A.
Imagine we have 1,000 frogs at Lily Pad A and 100 frogs at Lily Pad B. If detailed balance
holds:

• (Number of frogs from A jumping to B) = (Number of frogs from B jumping to A)

• 1,000 × 0.1 = 100 × 1

• 100 frogs leave A, and 100 frogs leave B heading back to A

• The populations stay balanced.

5
In mathematical symbols, detailed balance is written as:

πi · pij = πj · pji

Let us understand each part:

• πi : how many frogs are at lily pad i.

• pij : the jumping probability from pad i to pad j.

• πi · pij : the number of frogs jumping from i to j.

• πj · pji : the number of frogs jumping back from j to i.

• Setting them equal means: the flows balance in both directions.

This is powerful: if we can design jumping rules that satisfy detailed balance, we auto-
matically get a stationary distribution.

The Big Question: How Do We Design the Jumping


Rules?

Here’s where the previous week left us with a puzzle. We know:

• If detailed balance holds, we get a stationary distribution.

• We want to sample from some target distribution (like inferring a coin’s bias or
estimating a particle’s mass).

But how do we actually construct the jumping rules pij to make the stationary distribution
equal to the distribution we want to sample from?

This is the central question of our current week. The answer is elegant: we don’t specify
all the jumping rules directly. Instead, we use a proposal-and-accept-reject
strategy, which is the basis of Markov Chain Monte Carlo [MCMC] methods.

6
Introduction to The MCMC Strategy

In this week, we will learn three practical algorithms that use the detailed balance princi-
ple to construct Markov chains that sample from any target distribution we [Link]
algorithms form the foundation of Markov Chain Monte Carlo (MCMC) methods.
All the algorithms in this week follow the same basic pattern. Let’s understand this
pattern before we look at specific algorithms.

The Basic Loop: Propose and Accept

Here’s what happens at each step of an MCMC algorithm:

1. We are at some current state: Call it xt .This might be a guess for the coin’s
bias, or a proposed parameter value.

2. We propose a new state: We suggest a candidate value x′ using some proposal


rule. This might be ”jump left or right by a random amount.”

3. We compute an acceptance probability: We calculate α, a number between


0 and 1. This number tells us: ”Should we accept this proposal?” Higher α means
”this is a good proposal, probably accept it.”

4. We accept or reject: We flip a coin.(metaphorically). If the coin says ”yes” (with


probability α), we move: xt+1 = x′ . If the coin says ”no”, we stay put: xt+1 = xt .

5. We repeat: Now we are at xt+1 , and we do the same thing again.

After many iterations, we have a chain of states: x0 ,x1 ,x2 , . . . , xN . If we designed the
acceptance probability correctly, these states behave as samples from the distribution we
want.

The Secret: Detailed Balance Sets the Acceptance Probability

The magic is in step 3. How do we choose the acceptance probability α?


The answer: We derive it from the detailed balance condition.
Remember: if detailed balance holds, we are guaranteed to get the right stationary dis-
tribution.

So we work backwards: we say, ”What acceptance probability will make

7
detailed balance true?” And that’s the acceptance probability we use.

Different algorithms use different proposal strategies, which leads to different acceptance
formulas. But all of them are derived from the same principle: detailed balance.

1 The Metropolis Algorithm: The Simplest MCMC


Method

1.1 The Intuition: Climbing a Probability Mountain

Let’s think of probability as a landscape.

Figure 1: Probability Landscape [Height represents probability density.]

Imagine we are standing on a mountain range.

• High places (peaks) represent high probability.

• Low places (valleys) represent low probability.

• We are currently at location xt with height (probability) π(xt ).

Our goal: explore this landscape so we spend time at each location proportional to its
height. We want to spend more time on the tall peaks and less time in the valleys.

8
Here’s the intuitive principle for Metropolis:

If we propose moving to a higher place (uphill), we should almost always go there. Spend
more time on tall peaks!

If we propose moving to a lower place (downhill), we should sometimes go there, but


hesitantly. We need to escape local peaks to find better ones, but we don’t want to waste
time in valleys.

A Concrete Example

We are at a spot with probability π(xt ) = 0.8. We propose moving to a spot with
probability π(x′t ) = 0.8.

This is downhill. The ratio is 0.4/0.8 = 0.5. So we accept the move with 50 percent
probability. Half the time we make the move, half the time we stay put.

If we proposed a spot with probability π(x′t ) = 0.9, this is uphill. We would accept with
probability min(1,0.9/0.8) = 1, meaning we always go there.

This is the essence of Metropolis.

1.2 The Requirement: Symmetric Proposals

For Metropolis to work cleanly, we need to use a symmetric proposal.

What does symmetric mean? It means: the probability of jumping from xt to x′


equals the probability of jumping backwards from x′ to xt .

Example: Gaussian Random Walk

A common symmetric proposal is the Gaussian random walk:

x′ = xt + ϵ, where ϵ ∼ N (0, σ 2 )

What does this mean?

• We draw a random number ϵ from a normal (bell-curve) distribution.

• This random number has mean 0 and standard deviation σ.

9
• We add it to our current state xt .

• The result x′ is our proposal.

Why is this symmetric?

• The probability of jumping from xt to x′ depends only on the distance |x′ − xt |.

• The probability of jumping backwards from x′ to xt also depends on |xt − x′ |.

• Since the distances are the same, the probabilities are equal!

If we jump +5 units forward with some probability, we can jump −5 units backward with
the exact same probability. That’s symmetry.

Why Symmetry Matters

If our proposal were asymmetric (biased in one direction), that bias would interfere with
detailed balance. We would need to mathematically correct for it. Metropolis avoids this
complication by requiring symmetry from the start.

When the proposal is asymmetric, we use the more general Metropolis-Hastings algorithm
(coming next).

1.3 Deriving the Acceptance Probability

Now for the key derivation. We will figure out what acceptance probability ensures
detailed balance.

10
Figure 2: Detailed Balance

The Transition Probability: Two Pieces

When we move from xt to x′ , two things happen:

1. We propose a move using the proposal distribution. The probability is q(x′ |xt ).

2. We accept or reject. The probability is α(x′ , xt ).

The full transition probability (chance of actually moving from xt to x′ ) is:

p(x′ |xt ) = q(x′ |xt ) · α(x′ , xt )

Similarly, moving backwards from x′ to xt :

p(xt |x′ ) = q(xt |x′ ) · α(xt , x′ )

11
Applying Detailed Balance

We want detailed balance to hold. Recall what that means: the flow from xt to x′ equals
the flow from x′ back to xt . In symbols:

π(xt ) · p(x′ |xt ) = π(x′ ) · p(xt |x′ )

Substituting our transition probabilities:

π(xt ) · q(x′ |xt ) · α(x′ , xt ) = π(x′ ) · q(xt |x′ ) · α(xt , x′ )

Since our proposal is symmetric, q(x′ |xt ) = q(xt |x′ ). These cancel:

π(xt ) · α(x′ , xt ) = π(x′ ) · α(xt , x′ )

Rearranging:
α(x′ , xt ) π(x′ )
=
α(xt , x′ ) π(xt )

Choosing the Acceptance Probability

There are many ways to satisfy this ratio. One elegant choice is:

π(x′ )
 

α(x , xt ) = min 1,
π(xt )

This reads as: ”accept with probability equal to the ratio of the new probability to the
old probability, but cap it at 1.”

Let’s verify this works:


• Uphill case: If π(x′ ) > π(xt ), then π(x π(x )
t)
> 1, so α(x′ , xt ) = 1. We always accept.
And α(xt , x′ ) = π(xt)
π(x′ )
(which is less than 1).
π(xt )
Check: 1 · π(xt ) = π(x′ ) · π(x′ ). Correct!
π(x′ ) π(x′ )
• Downhill case: If π(x′ ) < π(xt ), then π(xt )
< 1, so α = π(xt )
. We accept with this
[Link] α(xt , x′ ) = 1.
π(x′ )/π(xt ) π(x′ )
Check: 1
= π(xt )
. Correct!

12
Therefore:
π(x′ )
 

α(x , xt ) = min 1,
π(xt )

This is the Metropolis acceptance ratio. It emerges from requiring detailed


balance.

1.4 What Does This Acceptance Ratio Mean?

Now that we have derived it, let’s understand its purpose.

– Always Climb Uphill


When π(x′ ) > π(xt ), we have α = 1. We always accept uphill moves.
Why? Because the chain should naturally gravitate toward high-probability
regions. By always accepting uphill moves, we ensure thorough exploration of
the peaks.
– Occasionally Go Downhill
π(x′ )
When π(x′ ) < π(xt ), we have α = π(xt)
. We accept with this probability.
Why accept downhill moves at all? Because the chain needs to escape local
peaks. If we never went downhill, we would get stuck at the nearest peak and
never discover better peaks elsewhere.

But by accepting downhill moves with probability proportional to the probability


ratio, we create a balance. The long-run visitation frequency at each location
matches its probability exactly. This is the genius of detailed balance.

Concrete Example

We are at a location with probability π(xt ) = 20. We propose a move to a location


with probability π(x′ ) = 10.
Acceptance ratio:
10
α= = 0.5
20
We accept with 50% probability. We draw a random number from 0 to 1. If it’s
less than 0.5, we accept. Otherwise, we stay put.
This 50% rate is exactly what detailed balance requires to balance probability flows.

1.5 The Metropolis Algorithm: Step by Step

Starting out:

13
1. Pick any starting point x0 . It doesn’t matter—the chain will forget where it started.

2. Set t = 0 (our step counter).

3. Choose a proposal standard deviation σ. (This controls how far we jump. We will
adjust this later.)

Main loop (repeat for N iterations):

1. Propose a jump:
x′ = xt + ϵ, where ϵ ∼ N (0, σ 2 )

Draw a random noise ϵ and add it to get the proposal.

2. Calculate acceptance probability:

π(x′ )
 
α = min 1,
π(xt )

In practice, we often use the unnormalized probability f (x):

f (x′ )
 
α = min 1,
f (xt )

3. Accept or reject:

• Draw a random number u between 0 and 1.


• If u < α, we accept: xt+1 = x′ .
• If u ≥ α, we reject: xt+1 = xt (stay put).

After all iterations:

• We have a chain: x0 , x1 , . . . , xN .

• Discard the first B samples (burn-in) to let the chain forget its starting point.

• Use the remaining N − B samples as our draws from the distribution we want.

1.6 Why It Works: Verification of Detailed Balance

We derived the acceptance ratio by requiring detailed balance. Let’s verify it actually
maintains that balance.

14
With acceptance ratio α and symmetric proposal, detailed balance becomes:

π(x′ ) π(xt )
π(xt ) · min(1, ) = π(x′ ) · min(1, )
π(xt ) π(x′ )

Uphill (π(x′ ) > π(xt )):

• Left: π(xt ) · 1 = π(xt )

• Right: π(x′ ) · π(xt )


π(x′ )
= π(xt )

• They match as LHS = RHS.

Downhill (π(x′ ) < π(xt )):

π(x′ )
• Left: π(xt ) · π(xt )
= π(x′ )

• Right: π(x′ ) · 1 = π(x′ )

• They match as LHS = RHS again.

Detailed balance is satisfied. Therefore, π is the stationary distribution, and we sample


correctly.

1.7 Real Example: Inferring How Biased a Coin Is

Let’s apply Metropolis to a real problem.

We flip a coin 100 times and get 61 heads and 39 tails. We want to estimate the coin’s
bias—that is, the probability θ that it lands heads.

Setting Up the Problem

In Bayesian inference (which we learned in Week 4), we combine prior beliefs with data
using this formula:
Likelihood × Prior
Posterior =
Normalizing Constant
Or, usually written as a proportionality:

P (θ|data) ∝ P (data|θ) · P (θ)

15
• Prior: We assume no bias toward heads or tails, so the prior is uniform:

P (θ) = 1

• Likelihood: Given a bias θ, the probability of 61 heads in 100 flips follows a


binomial distribution:
P (data|θ) ∝ θ61 (1 − θ)39

• Posterior: Combining them:

P (θ|data) ∝ 1 · θ61 (1 − θ)39

So we sample from the target distribution:

f (θ) = θ61 (1 − θ)39

Running the Algorithm

We start at θ0 = 0.5 and run 11,000 iterations with proposal standard deviation σ = 0.05.
We discard the first 1,000 as burn-in.

Results:

• Acceptance rate: 43.1% (good! We want around 44% for one-dimensional prob-
lems).

• Posterior mean: 0.610 (our estimate of the coin’s bias).

• 95% Credible Interval: [0.516, 0.705] (We are 95% confident the true bias is in
this range).

Notice: the observed frequency (61/100 = 0.61) matches our posterior mean. Metropolis
found the right answer!

16
2 Metropolis-Hastings: When We have Biased Pro-
posals

2.1 The Motivation: Sometimes Symmetry Isn’t Natural

Metropolis works great when we can use a symmetric proposal. But in many real situa-
tions, we have information suggesting an asymmetric (biased) proposal would be better.

For example:

• We might know roughly where the high-probability region is and want to propose
in that direction.

• We might have a preliminary estimate from a related problem and want to propose
near it.

• We might want to use gradient information (which direction increases probability).

The problem: If our proposal is asymmetric, the simple Metropolis formula no longer
maintains detailed balance. We need to correct for the asymmetry.

That’s where Metropolis-Hastings comes in.

Intuition: Two Biased Cities

Imagine two cities connected by transportation networks with different quality in each
direction.

17
Figure 3: Asymmetric Proposal

• Forward (City A to City B): There’s a nice highway. Many people travel this
way. q(B|A) is large.

• Backward (City B to City A): There’s a dirt road. Few people travel this way.
q(A|B) is small.

The proposal is biased forward. To maintain population balance (which is detailed bal-
ance), we must:

• Accept forward trips less often.

• Accept backward trips more often.

The correction is the inverse of the bias ratio. If forward is 10 times easier, we accept
forward moves 10 times less often.

2.2 Deriving the Acceptance Ratio for Asymmetric Proposals

Let’s derive the acceptance probability when proposals can be asymmetric.

The transition probability is still:

pt→′ = q(x′ |xt ) · α(x′ , xt )

18
Detailed balance requires:

π(xt ) · q(x′ |xt ) · α(x′ , xt ) = π(x′ ) · q(xt |x′ ) · α(xt , x′ )

Rearranging (and this is the key step):

α(x′ , xt ) π(x′ ) q(xt |x′ )


= ·
α(xt , x′ ) π(x ) q(x′ |x )
| {zt } | {z t }
Target Ratio Proposal Ratio

Now we have two factors:

π(x′ )
1. Target ratio: π(xt )
(compares probabilities, like Metropolis).
q(xt |x′ )
2. Proposal ratio: q(x′ |xt )
(corrects for proposal asymmetry).

To satisfy this for any target and proposal, we choose:

π(x′ ) q(xt |x′ )


 

α(x , xt ) = min 1, ·
π(xt ) q(x′ |xt )

This is the Metropolis-Hastings acceptance ratio.

2.3 Understanding the Proposal Ratio Correction

Let’s break down what’s happening:

q(backward)
Correction =
q(forward)

The proposal ratio correction works like this:

• If the forward proposal is heavily biased (q(forward) > q(backward)), then the ratio
is small. This pulls down the acceptance probability, compensating for the forward
bias.

• If the backward proposal is favored (q(backward) > q(forward)), then the ratio is
large. This pulls up the acceptance, compensating for the backward bias.

In both cases, the correction ensures detailed balance.

19
Special Case: Metropolis is Metropolis-Hastings with Symmetry

If the proposal is symmetric, q(x′ |xt ) = q(xt |x′ ), so the ratio becomes 1:

π(x′ )
 
α = min 1, ·1
π(xt )

This is exactly Metropolis! So Metropolis is just a special case of Metropolis-Hastings.

2.4 The Metropolis-Hastings Algorithm

Starting out:

1. Pick a starting point x0 .

2. Set t = 0.

3. Choose a proposal distribution q(x′ |xt ) (which can be asymmetric).

Main loop (for N iterations):

1. Propose:
x′ ∼ q(x′ |xt )

Draw from the proposal distribution.

2. Calculate acceptance:

f (x′ ) q(xt |x′ )


 
α = min 1, ·
f (xt ) q(x′ |xt )

where f is the unnormalized target.

3. Accept or reject:

• Draw u ∼ U (0, 1).


• If u < α, set xt+1 = x′ (accept).
• Otherwise, set xt+1 = xt (reject).

After N iterations:

• Discard burn-in.

• Use remaining samples.

20
2.5 Real Example: A Skewed Distribution

Consider a skewed distribution (asymmetric, concentrated on one side):

f (x) ∝ x2 e−x (for x > 0)

This is a Gamma(3,1) distribution, concentrated around x = 2.

A symmetric Gaussian random walk proposal would waste many proposals on negative
values (which have zero probability). An asymmetric exponential proposal that favors
positive values would be much more efficient.

We use an asymmetric proposal. This proposal biases jumps toward the positive direction,
matching the distribution’s shape.

Results:

• Acceptance rate: 45.2%

• Empirical mean: 3.012 (theory = 3)

• Empirical std dev: 1.724 (theory = 1.732)

The asymmetric proposal concentrates sampling where the distribution is large, improving
efficiency. The proposal ratio correction ensures detailed balance.

3 Gibbs Sampling: When Conditional Distributions


Are Easy [Optional Section]

Note: This entire Section-3 [Gibbs Sampling] is optional. There will be no


questions in the assignment from this section. Interested participants [having
basic knowledge of joint probability distributions] may read this section.

3.1 A Different Approach: Update One Variable at a Time

Metropolis and Metropolis-Hastings update all variables simultaneously. Gibbs takes a


different approach: we update variables one at a time.

21
The Problem We’re Solving

In earlier sections, we discussed Metropolis-Hastings, which updates all variables together


in one big jump. This works okay, but sometimes it’s inefficient—imagine trying to find
your way in a dark room by taking random jumps. You might miss the good spots.

Gibbs Sampling takes a different, cleverer approach: update variables one at a time.
It’s like feeling your way around the room step by step, which often works much better.

A Simple Two-Variable Scenario

Let’s say we’re trying to learn about two related quantities:

• X: How much it rains today

• Y : How wet the grass is

These two are obviously connected:

• If it rains a lot (high X), the grass is very wet (high Y ).

• If it doesn’t rain (low X), the grass stays dry (low Y ).

• But even without rain, a sprinkler could make the grass wet.

The key insight: X and Y depend on each other, but we can use this dependence to
our advantage in MCMC.

What Does It Mean for Variables to Depend on Each Other?

In probability, when two variables are connected, we describe this using conditional prob-
ability:

• P (X|Y ) = ”the probability distribution of X given that we know Y ”

– Example: ”If the grass is soaking wet, what are the likely rainfall amounts?”
The answer is a probability distribution over possible rainfall values.
– Even if we don’t know the exact rainfall, knowing the grass is wet gives us
information that makes high rainfall more likely than low rainfall.

22
• P (Y |X) = ”the probability distribution of Y given that we know X”

– Example: ”If it rained heavily, how wet would the grass likely be?” The answer
is a distribution.
– Knowing heavy rain occurred makes high grass wetness more likely.

This is different from P (X) alone (”how much does it typically rain?”) because we’re
conditioning on information about Y .

Sequential Conditional Updates

Here’s the beautiful idea behind Gibbs sampling: Instead of proposing new values
for both X and Y randomly (like in Metropolis-Hastings), we update them
sequentially using their conditional distributions.

The algorithm is simple!

Algorithm:

1. Start at (X0 , Y0 ).

2. Sample X from P (X|Y0 ) [holding Y constant].

3. Sample Y from P (Y |X1 ) [holding X constant].

4. Go back to step 2 with new (X1 , Y1 ) and repeat the process again and again.

Example:

• Step 1: We have the current rainfall amount X and grass wetness Y .

• Step 2: We freeze the grass wetness at its current value. We ask: ”Given that the
grass is this wet, what should the new rainfall be?” We answer this by sampling
from P (X|Y ). The result is a new rainfall amount X1 .

• Step 3: Now we freeze the rainfall at its new value X1 . We ask: ”Given that
it rained this much, how wet should the grass be?” We answer by sampling from
P (Y |X1 ). The result is a new grass wetness Y1 .

• Step 4: Go back to Step 2 with the new (X1 , Y1 ) and repeat the process.

23
Why Does This Stay Consistent?

Notice what happens:

• After Step 2, we have a new X1 that’s consistent with the current Y (we asked
”given Y , what’s a good X?”).

• After Step 3, we have a new Y1 that’s consistent with the new X1 (we asked ”given
the new X, what’s a good Y ?”).

This back-and-forth ensures that X and Y stay in the realistic region—we don’t produce
impractical combinations like ”no rain but soaking wet grass” (well, almost never).

Compare this to Metropolis-Hastings [MH]:

• MH: Propose a random jump in any direction. Often lands in low-probability


combinations that get rejected.

• Gibbs: Propose by respecting X-Y relationship. Always accepted because it uses


conditional distributions.

Visualisation of Gibbs Sampling

Unlike Metropolis-Hastings, which can jump diagonally, Gibbs sampling moves in straight
lines (orthogonally), updating one coordinate at a time. This creates a ”staircase” path
toward the high-probability region.

24
Figure 4: Gibbs Sampling Trajectory

The concentric ellipses represent the 2D landscape of (X, Y ) values.

Understanding the Ellipses

Each ellipse is a ”contour line”—like a topographic map of a mountain. All


points on the same ellipse are equally probable (X, Y ) pairs. The centre (where ellipses
are tightest) represents the most likely pairs. The outer ellipses represent less likely pairs.

Think of it this way:

• Centre: ”Moderate rain, moderately wet grass” (very realistic).

• Moving outward: ”Light rain, dry grass” or ”Heavy rain, very wet grass” (real-
istic, but less common).

• Far edges: ”No rain, soaking wet grass” (unrealistic—unlikely).

25
The tilt of the ellipse shows the correlation between X and Y :

• If rain and wetness are strongly linked, the ellipse is tilted (elongated along a
diagonal).

• If they’re independent, the ellipse is circular.

Intuitive Understanding of the figure

Let’s trace through what’s happening in Figure 4 using our rain/grass intuition. Let X
axis represent Rainfall and the Y-axis represent Grass Wetness.

Iteration 1:

1. Start: (rainfall = -2.0, wetness = -1.5) Red Star.

2. Hold wetness at -1.5. Ask: ”Given grass wetness = -1.5, what’s the likely
rainfall?”

• Sample from P (X|Y = −1.5) → Get X1 = −0.9.


• Horizontal move (X changes) from red star → Blue Dot at (-0.9, -1.5).

3. Hold rainfall at -0.9. Ask: ”Given rainfall = -0.9, what’s the likely grass wetness?”

• Sample from P (Y |X = −0.9) → Get Y1 = −0.92.


• Vertical move (Y changes) → Green Dot at (-0.9, -0.92).

Repeat this process again and again to get iteration 2, 3, and so on.

Each step stays on the ellipse because we’re using the conditional distributions. The
alternating horizontal (update X) and vertical (update Y) moves create a ”staircase”
pattern that efficiently explores the realistic region.

Why This Produces Valid Samples fromm the Joint Distribution

This is the deep theoretical result: Even though we update sequentially (not
jointly), the pairs (X, Y ) we generate are valid samples from the true joint
distribution P (X, Y ).

Intuitive reason:

26
• When we sample X from P (X|Y ), we’re asking: ”What X values are consistent
with this Y ?”

• When we sample Y from P (Y |X), we’re asking: ”What Y values are consistent
with this X?”

• This back-and-forth between consistent updates means the chain naturally explores
the high-probability region (the ellipse).

• After many iterations, the frequency of visiting each (X, Y ) pair in our MCMC
chain matches the true probability of that pair in the joint distribution.

This happens automatically we don’t need to reject proposals or apply any correction.
The conditional distributions build in all the structure we need.

3.2 The Mathematical Insight: Gibbs Is a Special Case of Metropolitan-


Hastings

Here’s something remarkable: Gibbs sampling is a special case of Metropolis-


Hastings where we always accept every proposal (acceptance probability =
1).

When updating variable i, we propose by sampling from its conditional:

q(x′i |xt ) = P (Xi |all other variables at their current values)

The Metropolis-Hastings acceptance ratio becomes:


 
P (new state) P (conditional at old)
α = min 1, ·
P (old state) P (conditional at new)

Due to how joint distributions factor (break into pieces), these terms cancel perfectly:

α = min(1, 1) = 1

We always accept! This is why Gibbs is called rejection-free.

27
3.3 Why This Works

The fact that Gibbs always accepts is not luck. By proposing directly from the conditional
distribution P (Xi |others), our proposal is perfectly aligned with the target.

In Metropolis and Metropolis-Hastings, we use an external proposal that might not match
the target. We need acceptance-rejection to correct the mismatch. Gibbs avoids this mis-
match by proposing from the conditional. The proposal is always correct, so acceptance
is automatic.

3.4 The Gibbs Sampling Algorithm

Gibbs sampling follows a simple pattern:

1. Start: Pick any initial values for your variables (they don’t matter much—Gibbs
will forget them).

2. Loop: Update each variable one at a time using its conditional distribution.

3. Stop: After many iterations, throw away the early ”learning phase” and keep the
rest.

The algorithm is clever because it accepts every proposed update—no rejections like in
Metropolis-Hastings.

The Detailed Algorithm

Initialization:

(0) (0) (0)


• Choose starting values for all variables: X1 , X2 , . . . , Xp .

• These starting values can be anything (even random).

• Gibbs will converge to the right distribution anyway.

Main Loop (repeat for many iterations):

(t) (t) (t)


• Update X1 : Sample a new value from P (X1 | X2 , X3 , . . . , Xp ).

28
Use the current (old) values of all other variables
(t+1) (t) (t)
• Update X2 : Sample a new value from P (X2 | X1 , X3 , . . . , Xp ).
(t+1)
Use the newly updated X1 , but old values of the rest

• Update X3 , . . . , Xp : Continue this pattern.

Always use the most recent values for variables already updated in this iteration.
Use old values for variables not yet updated.

After Many Iterations:

• Discard the first several iterations (burn-in): These are when the chain was
”learning” and haven’t converged yet.

• Keep the remaining iterations: These are valid samples from the true posterior
distribution.

3.5 Why this works

Each update respects the relationship between the variable being updated and all others.
By cycling through variables this way, the chain naturally stays in high-probability region
and converges to the true posterior.

3.6 Strengths and Limitations of Gibbs

Strengths:

• No tuning: No proposal variance to adjust.

• 100% acceptance: All samples are used, no waste.

• Fast mixing: Often explores the posterior very quickly.

• Natural for structured models: Graphical models, hierarchical Bayes, and mix-
ture models have built-in conditional structure.

Limitations:

29
• Requires easy conditionals: If P (Xi |others) is complicated, we can’t sample
from it.

• Slow for correlated variables: Updating one variable at a time means correlated
variables take many steps to exchange information.

4 Comparing the Three Algorithms

Feature Metropolis Metropolis- Gibbs Sampling


Hastings

Core Idea Uses a symmetric Uses an asymmetric Updates one variable


proposal to explore proposal and corrects at a time using condi-
the distribution. for the bias. tional probabilities.

Proposal (q) Must be Symmetric: Can be Asymmetric: No external proposal


q(x′ |xt ) = q(xt |x′ ) q(x′ |xt ) ̸= q(xt |x′ ) (Uses P (Xi |others)).
   
π(x′ ) π(x′ ) q(xt |x′ )
Acceptance min 1, π(x t)
min 1, π(x t)
· q(x′ |xt )
Always 1 (100% Ac-
Ratio (α) ceptance).

Key Strength Simple to implement; Flexible; handles bi- No tuning required;


good for basic prob- ased proposals and very efficient for hi-
lems. complex spaces. erarchical/structured
models.

Limitation Fails if proposal isn’t Requires calculating Slow if variables are


symmetric; can be proposal ratios. highly correlated;
slow. hard to derive condi-
tionals.

Table 1: Comparison of the three MCMC sampling methods learned this week.

4.1 When to use each

Learning MCMC? Start with Metropolis. It’s simple and teaches the core idea: how
to derive acceptance rates from detailed balance.

General-purpose sampling? Use Metropolis-Hastings. It handles any proposal and

30
corrects for asymmetry.

Structured Problems? Use Gibbs if conditionals are easy to form. It’s usually fastest
and needs no tuning.

Real applications? Combine them. Use Gibbs for step variables with easy condi-
tionals, Metropolis-Hastings steps for others.

5 Practical Reality: Making These Algorithms Work

5.1 Burn-In: Forgetting Where We Started

Every MCMC chain starts somewhere arbitrary. It takes time to ”forget” this starting
point and to converge to the stationary distribution.

Burn-in is the number of initial samples we throw away.

Rule of Thumb: Discard 10-15 percentage of iterations.


But this is just a guideline. The real question is: how do we know we have enough burn-in?

Answer: We look at trace plots (see next section).

5.2 Tuning:Acceptance Rates and Step Sizes

For Metropolis and Metropolis-Hastings, the acceptance rate (fraction of proposals ac-
cepted) tells us something important.

Targets:

• In 1D: Aim for 44% acceptance

• In 10+ dimensions: Aim for 23.4% acceptance

Why these numbers?

If acceptance is too high (80%), we are making tiny steps and mixing slowly. If ac-
ceptance is too slow (5%), we are making huge steps and getting rejected too often. The

31
targets balance exploration and mixing.

How to tune:

• Too high acceptance? Increase proposal step size (propose bigger jumps)

• Too low acceptance? Decrease proposal step size (propose smaller jumps)

5.3 Checking Convergence: Trace Plots

Figure 5: MCMC Convergence Diagnosis Using Trace Plots

[Left: Poor convergence shows systematic drift and long periods stuck at one value.
Right Plot: Good convergence shows random fluctuations around a stable mean after
burn-in.]

A trace plot is a simple graph: horizontal axis is iteration number, vertical axis is pa-
rameter value.

Good convergence looks like white noise—random fluctuations around a stable mean.
No drift, no trends.

32
Poor convergence shows:

• Slow upward or downward drift.

• Long-term trends.

• Getting stuck at one value.

Visual inspection of trace plots is the first step in diagnosing MCMC quality. We visually
check if the chain has settled into its stationary distribution. If not, we increase burn-in
or run longer.

5.4 Limitations: When MCMC Struggles

Problem 1: Distributions having Multiple Modes

Figure 6: MCMC on bimodal distribution

Fig 6 shows a bimodal probability density with two distant peaks. It indicates how
MCMC chain gets stuck in one peak.
When a posterior has multiple distant peaks, basic MCMC can get stuck exploring one
peak and rarely jump to another.

33
Solution: Run very long chains. Or use tempering/simulated annealing to help jumps
between modes

[Beyond our project’s scope].

Problem 2: High Dimensions

In 50+ dimensions, Metropolis and Metropolis-Hastings mix slowly without careful tun-
ing.

Solution: Use more sophisticated methods like HMC (mentioned at the end).

Problem 3: Correlated Variables

When variables are strongly correlated, updating one at a time (Gibbs) requires many
steps for information to flow.

Solution: Use blocked Gibbs (update correlated groups together) [Beyond our project’s
scope].

6 Looking Forward: Beyond the Basics

6.1 The Next Generation: Modern MCMC Methods

The three algorithms we learned—Metropolis, Metropolis-Hastings, and Gibbs—form the


foundation. We can use them for countless real problems.

But over the last 10–15 years, the field developed more powerful methods. Modern soft-
ware packages like Stan, PyMC, and TensorFlow Probability use these newer algorithms.

Key example: Hamiltonian Monte Carlo (HMC) HMC uses gradient information
(which direction probability increases) to propose moves more intelligently. This leads
to:

• Much higher acceptance rates (60–80% vs. 20–30%).

• Faster exploration.

34
• Better performance in high dimensions (50+ variables).

HMC is now the standard for professional Bayesian inference.

Important note: HMC is beyond this project’s scope and won’t be part of your ap-
plications. Learning HMC requires knowledge of physics (Hamiltonian dynamics) and
numerical integration methods—separate topics entirely.

However, HMC builds directly on what you have learnt. The detailed balance principle,
the acceptance ratio derivation—all the same. HMC just uses more clever proposals.

If you want to explore HMC independently, Michael Betancourt’s tutorial is excellent:


[Link]

Conclusion: The Foundation is Set

Over the past few weeks, we have built a complete understanding of sampling:

• Week 3 showed us why we need better methods (curse of dimensionality) and how
independent sampling works.

• The following weeks taught us the mathematics of Markov chains and detailed
balance—the theoretical engine.

• This current week showed us three practical algorithms that use detailed balance
to sample from any distribution we specify.

We now understand not just the algorithms, but the principles underneath. We know
why detailed balance matters and how it ensures correctness.

We have developed the foundations for our applications. Starting from next week, we
will apply these methods to real problems in various fields. We will learn to translate
scientific questions into Bayesian models, choose the right algorithm for each problem,
and validate our results using diagnostics.

The journey from understanding sampling to solving real problems is about to begin.

35

You might also like