0% found this document useful (0 votes)

16 views35 pages

MCMC Algorithms: Metropolis & Gibbs Sampling

The document provides a comprehensive guide to Markov Chain Monte Carlo (MCMC) methods, focusing on how to construct chains that sample from desired distributions. It covers key algorithms such as the Metropolis and Metropolis-Hastings algorithms, explaining their mechanics, acceptance probabilities, and the importance of detailed balance. Additionally, it discusses practical considerations for implementing these algorithms effectively.

Uploaded by

Sankalp Savarn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views35 pages

MCMC Algorithms: Metropolis & Gibbs Sampling

Uploaded by

Sankalp Savarn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Markov Chain Monte Carlo (MCMC):

How to Build a Chain That Samples From

What We Want

January 18, 2026

Contents

Recap: What We Learnt Before 4

The Big Question 6

Introduction to The MCMC Strategy 7

1 The Metropolis Algorithm: The Simplest MCMC Method 8

1.1 The Intuition: Climbing a Probability Mountain . . . . . . . . . . . . . . 8

1.2 The Requirement: Symmetric Proposals . . . . . . . . . . . . . . . . . . 9

1.3 Deriving the Acceptance Probability . . . . . . . . . . . . . . . . . . . . 10

1.4 What Does This Acceptance Ratio Mean? . . . . . . . . . . . . . . . . . 13

1.5 The Metropolis Algorithm: Step by Step . . . . . . . . . . . . . . . . . . 13

1.6 Why It Works: Verification of Detailed Balance . . . . . . . . . . . . . . 14

1.7 Real Example: Inferring How Biased a Coin Is . . . . . . . . . . . . . . . 15

2 Metropolis-Hastings: When We have Biased Proposals 17

2.1 The Motivation: Sometimes Symmetry Isn’t Natural . . . . . . . . . . . 17

2.2 Deriving the Acceptance Ratio for Asymmetric Proposals . . . . . . . . . 18

2.3 Understanding the Proposal Ratio Correction . . . . . . . . . . . . . . . 19

2.4 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 20

2.5 Real Example: A Skewed Distribution . . . . . . . . . . . . . . . . . . . 21

3 Gibbs Sampling: When Conditional Distributions Are Easy [Optional

Section] 21

3.1 A Different Approach: Update One Variable at a Time . . . . . . . . . . 21

3.2 The Mathematical Insight: Gibbs Is a Special Case of Metropolitan-Hastings 27

2
3.3 Why This Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Why this works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 Strengths and Limitations of Gibbs . . . . . . . . . . . . . . . . . . . . . 29

4 Comparing the Three Algorithms 30

4.1 When to use each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Practical Reality: Making These Algorithms Work 31

5.1 Burn-In: Forgetting Where We Started . . . . . . . . . . . . . . . . . . . 31

5.2 Tuning:Acceptance Rates and Step Sizes . . . . . . . . . . . . . . . . . . 31

5.3 Checking Convergence: Trace Plots . . . . . . . . . . . . . . . . . . . . . 32

5.4 Limitations: When MCMC Struggles . . . . . . . . . . . . . . . . . . . . 33

6 Looking Forward: Beyond the Basics 34

6.1 The Next Generation: Modern MCMC Methods . . . . . . . . . . . . . . 34

3
Recap: What We Learnt Before

Before we dive into new material, let’s remember what we discovered in the previous
week. Don’t worry if these ideas feel fuzzy—we will go through them slowly again.

The Stationary Distribution: A Simple Example

In the previous week, we met something called the stationary distribution. Let’s use
the frog example we know well.

Imagine a frog jumping between three lily pads: Lily Pad A, Lily Pad B, and Lily Pad
C. Every time the frog is at a lily pad, it follows a pattern of jumping to the next pad.

After the frog jumps around for a very long time (say, 1 million jumps), we count how
often it spends time at each lily pad:

• It spends 50% of its time at Lily Pad A

• It spends 30% of its time at Lily Pad B

• It spends 20% of its time at Lily Pad C

We write this as a vector (a list of numbers):

π = [0.5, 0.3, 0.2]

This vector π (pronounced ”pi”) is called the stationary distribution. It tells us the
long-term probability of finding the frog at each lily pad.

Here’s what’s remarkable: no matter where the frog starts (at Lily Pad A, B, or C), after
enough time, these percentages stay the same. The frog forgets where it started and
settles into this pattern.

What Each Symbol Means

Let’s explain the symbols we will use so we are on the same page:

• πi is the probability the frog is at lily pad i. So π1 = 0.5 means 50% chance the
frog is at Lily Pad A.

4
• pij is the jumping rule: the probability of jumping from pad i to pad j. For example,
pAB is the probability of jumping from Lily Pad A to Lily Pad B.

All these jumping rules together form what we call the transition matrix P . When we
multiply the stationary distribution π by the transition matrix P , we get π back again.
In symbols:
π·P =π

Let’s break down what this means:

• π on the left side is our current distribution (where the frog is likely to be right
now).

• P is the jumping rules (what the frog does next).

• After applying the jumping rules, we get π again on the right side.

In other words: if the frog follows this pattern long enough to reach the stationary
distribution, then following the jumping rules doesn’t change the distribution. The system
is in perfect balance.

Detailed Balance: The Engine Behind Stationarity

Now comes an important detail. Not all jumping patterns lead to a stationary distribu-
tion. There’s a special condition that guarantees it will work.
That condition is called detailed balance, and here’s what it means in plain language:

The number of frogs jumping from Lily Pad A to Lily Pad B (in a given time period)
equals the number jumping from Lily Pad B back to Lily Pad A.
Imagine we have 1,000 frogs at Lily Pad A and 100 frogs at Lily Pad B. If detailed balance
holds:

• (Number of frogs from A jumping to B) = (Number of frogs from B jumping to A)

• 1,000 × 0.1 = 100 × 1

• 100 frogs leave A, and 100 frogs leave B heading back to A

• The populations stay balanced.

5
In mathematical symbols, detailed balance is written as:

πi · pij = πj · pji

Let us understand each part:

• πi : how many frogs are at lily pad i.

• pij : the jumping probability from pad i to pad j.

• πi · pij : the number of frogs jumping from i to j.

• πj · pji : the number of frogs jumping back from j to i.

• Setting them equal means: the flows balance in both directions.

This is powerful: if we can design jumping rules that satisfy detailed balance, we auto-
matically get a stationary distribution.

The Big Question: How Do We Design the Jumping

Rules?

Here’s where the previous week left us with a puzzle. We know:

• If detailed balance holds, we get a stationary distribution.

• We want to sample from some target distribution (like inferring a coin’s bias or
estimating a particle’s mass).

But how do we actually construct the jumping rules pij to make the stationary distribution
equal to the distribution we want to sample from?

This is the central question of our current week. The answer is elegant: we don’t specify
all the jumping rules directly. Instead, we use a proposal-and-accept-reject
strategy, which is the basis of Markov Chain Monte Carlo [MCMC] methods.

6
Introduction to The MCMC Strategy

In this week, we will learn three practical algorithms that use the detailed balance princi-
ple to construct Markov chains that sample from any target distribution we [Link]
algorithms form the foundation of Markov Chain Monte Carlo (MCMC) methods.
All the algorithms in this week follow the same basic pattern. Let’s understand this
pattern before we look at specific algorithms.

The Basic Loop: Propose and Accept

Here’s what happens at each step of an MCMC algorithm:

1. We are at some current state: Call it xt .This might be a guess for the coin’s
bias, or a proposed parameter value.

2. We propose a new state: We suggest a candidate value x′ using some proposal

rule. This might be ”jump left or right by a random amount.”

3. We compute an acceptance probability: We calculate α, a number between

0 and 1. This number tells us: ”Should we accept this proposal?” Higher α means
”this is a good proposal, probably accept it.”

4. We accept or reject: We flip a coin.(metaphorically). If the coin says ”yes” (with

probability α), we move: xt+1 = x′ . If the coin says ”no”, we stay put: xt+1 = xt .

5. We repeat: Now we are at xt+1 , and we do the same thing again.

After many iterations, we have a chain of states: x0 ,x1 ,x2 , . . . , xN . If we designed the
acceptance probability correctly, these states behave as samples from the distribution we
want.

The Secret: Detailed Balance Sets the Acceptance Probability

The magic is in step 3. How do we choose the acceptance probability α?

The answer: We derive it from the detailed balance condition.
Remember: if detailed balance holds, we are guaranteed to get the right stationary dis-
tribution.

So we work backwards: we say, ”What acceptance probability will make

7
detailed balance true?” And that’s the acceptance probability we use.

Different algorithms use different proposal strategies, which leads to different acceptance
formulas. But all of them are derived from the same principle: detailed balance.

1 The Metropolis Algorithm: The Simplest MCMC

Method

1.1 The Intuition: Climbing a Probability Mountain

Let’s think of probability as a landscape.

Figure 1: Probability Landscape [Height represents probability density.]

Imagine we are standing on a mountain range.

• High places (peaks) represent high probability.

• Low places (valleys) represent low probability.

• We are currently at location xt with height (probability) π(xt ).

Our goal: explore this landscape so we spend time at each location proportional to its
height. We want to spend more time on the tall peaks and less time in the valleys.

8
Here’s the intuitive principle for Metropolis:

If we propose moving to a higher place (uphill), we should almost always go there. Spend
more time on tall peaks!

If we propose moving to a lower place (downhill), we should sometimes go there, but

hesitantly. We need to escape local peaks to find better ones, but we don’t want to waste
time in valleys.

A Concrete Example

We are at a spot with probability π(xt ) = 0.8. We propose moving to a spot with
probability π(x′t ) = 0.8.

This is downhill. The ratio is 0.4/0.8 = 0.5. So we accept the move with 50 percent
probability. Half the time we make the move, half the time we stay put.

If we proposed a spot with probability π(x′t ) = 0.9, this is uphill. We would accept with
probability min(1,0.9/0.8) = 1, meaning we always go there.

This is the essence of Metropolis.

1.2 The Requirement: Symmetric Proposals

For Metropolis to work cleanly, we need to use a symmetric proposal.

What does symmetric mean? It means: the probability of jumping from xt to x′

equals the probability of jumping backwards from x′ to xt .

Example: Gaussian Random Walk

A common symmetric proposal is the Gaussian random walk:

x′ = xt + ϵ, where ϵ ∼ N (0, σ 2 )

What does this mean?

• We draw a random number ϵ from a normal (bell-curve) distribution.

• This random number has mean 0 and standard deviation σ.

9
• We add it to our current state xt .

• The result x′ is our proposal.

Why is this symmetric?

• The probability of jumping from xt to x′ depends only on the distance |x′ − xt |.

• The probability of jumping backwards from x′ to xt also depends on |xt − x′ |.

• Since the distances are the same, the probabilities are equal!

If we jump +5 units forward with some probability, we can jump −5 units backward with
the exact same probability. That’s symmetry.

Why Symmetry Matters

If our proposal were asymmetric (biased in one direction), that bias would interfere with
detailed balance. We would need to mathematically correct for it. Metropolis avoids this
complication by requiring symmetry from the start.

When the proposal is asymmetric, we use the more general Metropolis-Hastings algorithm
(coming next).

1.3 Deriving the Acceptance Probability

Now for the key derivation. We will figure out what acceptance probability ensures
detailed balance.

10
Figure 2: Detailed Balance

The Transition Probability: Two Pieces

When we move from xt to x′ , two things happen:

1. We propose a move using the proposal distribution. The probability is q(x′ |xt ).

2. We accept or reject. The probability is α(x′ , xt ).

The full transition probability (chance of actually moving from xt to x′ ) is:

p(x′ |xt ) = q(x′ |xt ) · α(x′ , xt )

Similarly, moving backwards from x′ to xt :

p(xt |x′ ) = q(xt |x′ ) · α(xt , x′ )

11
Applying Detailed Balance

We want detailed balance to hold. Recall what that means: the flow from xt to x′ equals
the flow from x′ back to xt . In symbols:

π(xt ) · p(x′ |xt ) = π(x′ ) · p(xt |x′ )

Substituting our transition probabilities:

π(xt ) · q(x′ |xt ) · α(x′ , xt ) = π(x′ ) · q(xt |x′ ) · α(xt , x′ )

Since our proposal is symmetric, q(x′ |xt ) = q(xt |x′ ). These cancel:

π(xt ) · α(x′ , xt ) = π(x′ ) · α(xt , x′ )

Rearranging:
α(x′ , xt ) π(x′ )
=
α(xt , x′ ) π(xt )

Choosing the Acceptance Probability

There are many ways to satisfy this ratio. One elegant choice is:

π(x′ )

′
α(x , xt ) = min 1,
π(xt )

This reads as: ”accept with probability equal to the ratio of the new probability to the
old probability, but cap it at 1.”

Let’s verify this works:

′
• Uphill case: If π(x′ ) > π(xt ), then π(x π(x )
t)
> 1, so α(x′ , xt ) = 1. We always accept.
And α(xt , x′ ) = π(xt)
π(x′ )
(which is less than 1).
π(xt )
Check: 1 · π(xt ) = π(x′ ) · π(x′ ). Correct!
π(x′ ) π(x′ )
• Downhill case: If π(x′ ) < π(xt ), then π(xt )
< 1, so α = π(xt )
. We accept with this
[Link] α(xt , x′ ) = 1.
π(x′ )/π(xt ) π(x′ )
Check: 1
= π(xt )
. Correct!

12
Therefore:
π(x′ )

′
α(x , xt ) = min 1,
π(xt )

This is the Metropolis acceptance ratio. It emerges from requiring detailed

balance.

1.4 What Does This Acceptance Ratio Mean?

Now that we have derived it, let’s understand its purpose.

– Always Climb Uphill

When π(x′ ) > π(xt ), we have α = 1. We always accept uphill moves.
Why? Because the chain should naturally gravitate toward high-probability
regions. By always accepting uphill moves, we ensure thorough exploration of
the peaks.
– Occasionally Go Downhill
π(x′ )
When π(x′ ) < π(xt ), we have α = π(xt)
. We accept with this probability.
Why accept downhill moves at all? Because the chain needs to escape local
peaks. If we never went downhill, we would get stuck at the nearest peak and
never discover better peaks elsewhere.

But by accepting downhill moves with probability proportional to the probability

ratio, we create a balance. The long-run visitation frequency at each location
matches its probability exactly. This is the genius of detailed balance.

Concrete Example

We are at a location with probability π(xt ) = 20. We propose a move to a location

with probability π(x′ ) = 10.
Acceptance ratio:
10
α= = 0.5
20
We accept with 50% probability. We draw a random number from 0 to 1. If it’s
less than 0.5, we accept. Otherwise, we stay put.
This 50% rate is exactly what detailed balance requires to balance probability flows.

1.5 The Metropolis Algorithm: Step by Step

Starting out:

13
1. Pick any starting point x0 . It doesn’t matter—the chain will forget where it started.

2. Set t = 0 (our step counter).

3. Choose a proposal standard deviation σ. (This controls how far we jump. We will
adjust this later.)

Main loop (repeat for N iterations):

1. Propose a jump:
x′ = xt + ϵ, where ϵ ∼ N (0, σ 2 )

Draw a random noise ϵ and add it to get the proposal.

2. Calculate acceptance probability:

π(x′ )

α = min 1,
π(xt )

In practice, we often use the unnormalized probability f (x):

f (x′ )

α = min 1,
f (xt )

3. Accept or reject:

• Draw a random number u between 0 and 1.

• If u < α, we accept: xt+1 = x′ .
• If u ≥ α, we reject: xt+1 = xt (stay put).

After all iterations:

• We have a chain: x0 , x1 , . . . , xN .

• Discard the first B samples (burn-in) to let the chain forget its starting point.

• Use the remaining N − B samples as our draws from the distribution we want.

1.6 Why It Works: Verification of Detailed Balance

We derived the acceptance ratio by requiring detailed balance. Let’s verify it actually
maintains that balance.

14
With acceptance ratio α and symmetric proposal, detailed balance becomes:

π(x′ ) π(xt )
π(xt ) · min(1, ) = π(x′ ) · min(1, )
π(xt ) π(x′ )

Uphill (π(x′ ) > π(xt )):

• Left: π(xt ) · 1 = π(xt )

• Right: π(x′ ) · π(xt )

π(x′ )
= π(xt )

• They match as LHS = RHS.

Downhill (π(x′ ) < π(xt )):

π(x′ )
• Left: π(xt ) · π(xt )
= π(x′ )

• Right: π(x′ ) · 1 = π(x′ )

• They match as LHS = RHS again.

Detailed balance is satisfied. Therefore, π is the stationary distribution, and we sample

correctly.

1.7 Real Example: Inferring How Biased a Coin Is

Let’s apply Metropolis to a real problem.

We flip a coin 100 times and get 61 heads and 39 tails. We want to estimate the coin’s
bias—that is, the probability θ that it lands heads.

Setting Up the Problem

In Bayesian inference (which we learned in Week 4), we combine prior beliefs with data
using this formula:
Likelihood × Prior
Posterior =
Normalizing Constant
Or, usually written as a proportionality:

P (θ|data) ∝ P (data|θ) · P (θ)

15
• Prior: We assume no bias toward heads or tails, so the prior is uniform:

P (θ) = 1

• Likelihood: Given a bias θ, the probability of 61 heads in 100 flips follows a

binomial distribution:
P (data|θ) ∝ θ61 (1 − θ)39

• Posterior: Combining them:

P (θ|data) ∝ 1 · θ61 (1 − θ)39

So we sample from the target distribution:

f (θ) = θ61 (1 − θ)39

Running the Algorithm

We start at θ0 = 0.5 and run 11,000 iterations with proposal standard deviation σ = 0.05.
We discard the first 1,000 as burn-in.

Results:

• Acceptance rate: 43.1% (good! We want around 44% for one-dimensional prob-
lems).

• Posterior mean: 0.610 (our estimate of the coin’s bias).

• 95% Credible Interval: [0.516, 0.705] (We are 95% confident the true bias is in
this range).

Notice: the observed frequency (61/100 = 0.61) matches our posterior mean. Metropolis
found the right answer!

16
2 Metropolis-Hastings: When We have Biased Pro-
posals

2.1 The Motivation: Sometimes Symmetry Isn’t Natural

Metropolis works great when we can use a symmetric proposal. But in many real situa-
tions, we have information suggesting an asymmetric (biased) proposal would be better.

For example:

• We might know roughly where the high-probability region is and want to propose
in that direction.

• We might have a preliminary estimate from a related problem and want to propose
near it.

• We might want to use gradient information (which direction increases probability).

The problem: If our proposal is asymmetric, the simple Metropolis formula no longer
maintains detailed balance. We need to correct for the asymmetry.

That’s where Metropolis-Hastings comes in.

Intuition: Two Biased Cities

Imagine two cities connected by transportation networks with different quality in each
direction.

17
Figure 3: Asymmetric Proposal

• Forward (City A to City B): There’s a nice highway. Many people travel this
way. q(B|A) is large.

• Backward (City B to City A): There’s a dirt road. Few people travel this way.
q(A|B) is small.

The proposal is biased forward. To maintain population balance (which is detailed bal-
ance), we must:

• Accept forward trips less often.

• Accept backward trips more often.

The correction is the inverse of the bias ratio. If forward is 10 times easier, we accept
forward moves 10 times less often.

2.2 Deriving the Acceptance Ratio for Asymmetric Proposals

Let’s derive the acceptance probability when proposals can be asymmetric.

The transition probability is still:

pt→′ = q(x′ |xt ) · α(x′ , xt )

18
Detailed balance requires:

π(xt ) · q(x′ |xt ) · α(x′ , xt ) = π(x′ ) · q(xt |x′ ) · α(xt , x′ )

Rearranging (and this is the key step):

α(x′ , xt ) π(x′ ) q(xt |x′ )

= ·
α(xt , x′ ) π(x ) q(x′ |x )
| {zt } | {z t }
Target Ratio Proposal Ratio

Now we have two factors:

π(x′ )
1. Target ratio: π(xt )
(compares probabilities, like Metropolis).
q(xt |x′ )
2. Proposal ratio: q(x′ |xt )
(corrects for proposal asymmetry).

To satisfy this for any target and proposal, we choose:

π(x′ ) q(xt |x′ )

′
α(x , xt ) = min 1, ·
π(xt ) q(x′ |xt )

This is the Metropolis-Hastings acceptance ratio.

2.3 Understanding the Proposal Ratio Correction

Let’s break down what’s happening:

q(backward)
Correction =
q(forward)

The proposal ratio correction works like this:

• If the forward proposal is heavily biased (q(forward) > q(backward)), then the ratio
is small. This pulls down the acceptance probability, compensating for the forward
bias.

• If the backward proposal is favored (q(backward) > q(forward)), then the ratio is
large. This pulls up the acceptance, compensating for the backward bias.

In both cases, the correction ensures detailed balance.

19
Special Case: Metropolis is Metropolis-Hastings with Symmetry

If the proposal is symmetric, q(x′ |xt ) = q(xt |x′ ), so the ratio becomes 1:

π(x′ )

α = min 1, ·1
π(xt )

This is exactly Metropolis! So Metropolis is just a special case of Metropolis-Hastings.

2.4 The Metropolis-Hastings Algorithm

Starting out:

1. Pick a starting point x0 .

2. Set t = 0.

3. Choose a proposal distribution q(x′ |xt ) (which can be asymmetric).

Main loop (for N iterations):

1. Propose:
x′ ∼ q(x′ |xt )

Draw from the proposal distribution.

2. Calculate acceptance:

f (x′ ) q(xt |x′ )

α = min 1, ·
f (xt ) q(x′ |xt )

where f is the unnormalized target.

3. Accept or reject:

• Draw u ∼ U (0, 1).

• If u < α, set xt+1 = x′ (accept).
• Otherwise, set xt+1 = xt (reject).

After N iterations:

• Discard burn-in.

• Use remaining samples.

20
2.5 Real Example: A Skewed Distribution

Consider a skewed distribution (asymmetric, concentrated on one side):

f (x) ∝ x2 e−x (for x > 0)

This is a Gamma(3,1) distribution, concentrated around x = 2.

A symmetric Gaussian random walk proposal would waste many proposals on negative
values (which have zero probability). An asymmetric exponential proposal that favors
positive values would be much more efficient.

We use an asymmetric proposal. This proposal biases jumps toward the positive direction,
matching the distribution’s shape.

Results:

• Acceptance rate: 45.2%

• Empirical mean: 3.012 (theory = 3)

• Empirical std dev: 1.724 (theory = 1.732)

The asymmetric proposal concentrates sampling where the distribution is large, improving
efficiency. The proposal ratio correction ensures detailed balance.

3 Gibbs Sampling: When Conditional Distributions

Are Easy [Optional Section]

Note: This entire Section-3 [Gibbs Sampling] is optional. There will be no

questions in the assignment from this section. Interested participants [having
basic knowledge of joint probability distributions] may read this section.

3.1 A Different Approach: Update One Variable at a Time

Metropolis and Metropolis-Hastings update all variables simultaneously. Gibbs takes a

different approach: we update variables one at a time.

21
The Problem We’re Solving

In earlier sections, we discussed Metropolis-Hastings, which updates all variables together

in one big jump. This works okay, but sometimes it’s inefficient—imagine trying to find
your way in a dark room by taking random jumps. You might miss the good spots.

Gibbs Sampling takes a different, cleverer approach: update variables one at a time.
It’s like feeling your way around the room step by step, which often works much better.

A Simple Two-Variable Scenario

Let’s say we’re trying to learn about two related quantities:

• X: How much it rains today

• Y : How wet the grass is

These two are obviously connected:

• If it rains a lot (high X), the grass is very wet (high Y ).

• If it doesn’t rain (low X), the grass stays dry (low Y ).

• But even without rain, a sprinkler could make the grass wet.

The key insight: X and Y depend on each other, but we can use this dependence to
our advantage in MCMC.

What Does It Mean for Variables to Depend on Each Other?

In probability, when two variables are connected, we describe this using conditional prob-
ability:

• P (X|Y ) = ”the probability distribution of X given that we know Y ”

– Example: ”If the grass is soaking wet, what are the likely rainfall amounts?”
The answer is a probability distribution over possible rainfall values.
– Even if we don’t know the exact rainfall, knowing the grass is wet gives us
information that makes high rainfall more likely than low rainfall.

22
• P (Y |X) = ”the probability distribution of Y given that we know X”

– Example: ”If it rained heavily, how wet would the grass likely be?” The answer
is a distribution.
– Knowing heavy rain occurred makes high grass wetness more likely.

This is different from P (X) alone (”how much does it typically rain?”) because we’re
conditioning on information about Y .

Sequential Conditional Updates

Here’s the beautiful idea behind Gibbs sampling: Instead of proposing new values
for both X and Y randomly (like in Metropolis-Hastings), we update them
sequentially using their conditional distributions.

The algorithm is simple!

Algorithm:

1. Start at (X0 , Y0 ).

2. Sample X from P (X|Y0 ) [holding Y constant].

3. Sample Y from P (Y |X1 ) [holding X constant].

4. Go back to step 2 with new (X1 , Y1 ) and repeat the process again and again.

Example:

• Step 1: We have the current rainfall amount X and grass wetness Y .

• Step 2: We freeze the grass wetness at its current value. We ask: ”Given that the
grass is this wet, what should the new rainfall be?” We answer this by sampling
from P (X|Y ). The result is a new rainfall amount X1 .

• Step 3: Now we freeze the rainfall at its new value X1 . We ask: ”Given that
it rained this much, how wet should the grass be?” We answer by sampling from
P (Y |X1 ). The result is a new grass wetness Y1 .

• Step 4: Go back to Step 2 with the new (X1 , Y1 ) and repeat the process.

23
Why Does This Stay Consistent?

Notice what happens:

• After Step 2, we have a new X1 that’s consistent with the current Y (we asked
”given Y , what’s a good X?”).

• After Step 3, we have a new Y1 that’s consistent with the new X1 (we asked ”given
the new X, what’s a good Y ?”).

This back-and-forth ensures that X and Y stay in the realistic region—we don’t produce
impractical combinations like ”no rain but soaking wet grass” (well, almost never).

Compare this to Metropolis-Hastings [MH]:

• MH: Propose a random jump in any direction. Often lands in low-probability

combinations that get rejected.

• Gibbs: Propose by respecting X-Y relationship. Always accepted because it uses

conditional distributions.

Visualisation of Gibbs Sampling

Unlike Metropolis-Hastings, which can jump diagonally, Gibbs sampling moves in straight
lines (orthogonally), updating one coordinate at a time. This creates a ”staircase” path
toward the high-probability region.

24
Figure 4: Gibbs Sampling Trajectory

The concentric ellipses represent the 2D landscape of (X, Y ) values.

Understanding the Ellipses

Each ellipse is a ”contour line”—like a topographic map of a mountain. All

points on the same ellipse are equally probable (X, Y ) pairs. The centre (where ellipses
are tightest) represents the most likely pairs. The outer ellipses represent less likely pairs.

Think of it this way:

• Centre: ”Moderate rain, moderately wet grass” (very realistic).

• Moving outward: ”Light rain, dry grass” or ”Heavy rain, very wet grass” (real-
istic, but less common).

• Far edges: ”No rain, soaking wet grass” (unrealistic—unlikely).

25
The tilt of the ellipse shows the correlation between X and Y :

• If rain and wetness are strongly linked, the ellipse is tilted (elongated along a
diagonal).

• If they’re independent, the ellipse is circular.

Intuitive Understanding of the figure

Let’s trace through what’s happening in Figure 4 using our rain/grass intuition. Let X
axis represent Rainfall and the Y-axis represent Grass Wetness.

Iteration 1:

1. Start: (rainfall = -2.0, wetness = -1.5) Red Star.

2. Hold wetness at -1.5. Ask: ”Given grass wetness = -1.5, what’s the likely
rainfall?”

• Sample from P (X|Y = −1.5) → Get X1 = −0.9.

• Horizontal move (X changes) from red star → Blue Dot at (-0.9, -1.5).

3. Hold rainfall at -0.9. Ask: ”Given rainfall = -0.9, what’s the likely grass wetness?”

• Sample from P (Y |X = −0.9) → Get Y1 = −0.92.

• Vertical move (Y changes) → Green Dot at (-0.9, -0.92).

Repeat this process again and again to get iteration 2, 3, and so on.

Each step stays on the ellipse because we’re using the conditional distributions. The
alternating horizontal (update X) and vertical (update Y) moves create a ”staircase”
pattern that efficiently explores the realistic region.

Why This Produces Valid Samples fromm the Joint Distribution

This is the deep theoretical result: Even though we update sequentially (not
jointly), the pairs (X, Y ) we generate are valid samples from the true joint
distribution P (X, Y ).

Intuitive reason:

26
• When we sample X from P (X|Y ), we’re asking: ”What X values are consistent
with this Y ?”

• When we sample Y from P (Y |X), we’re asking: ”What Y values are consistent
with this X?”

• This back-and-forth between consistent updates means the chain naturally explores
the high-probability region (the ellipse).

• After many iterations, the frequency of visiting each (X, Y ) pair in our MCMC
chain matches the true probability of that pair in the joint distribution.

This happens automatically we don’t need to reject proposals or apply any correction.
The conditional distributions build in all the structure we need.

3.2 The Mathematical Insight: Gibbs Is a Special Case of Metropolitan-

Hastings

Here’s something remarkable: Gibbs sampling is a special case of Metropolis-

Hastings where we always accept every proposal (acceptance probability =
1).

When updating variable i, we propose by sampling from its conditional:

q(x′i |xt ) = P (Xi |all other variables at their current values)

The Metropolis-Hastings acceptance ratio becomes:

P (new state) P (conditional at old)
α = min 1, ·
P (old state) P (conditional at new)

Due to how joint distributions factor (break into pieces), these terms cancel perfectly:

α = min(1, 1) = 1

We always accept! This is why Gibbs is called rejection-free.

27
3.3 Why This Works

The fact that Gibbs always accepts is not luck. By proposing directly from the conditional
distribution P (Xi |others), our proposal is perfectly aligned with the target.

In Metropolis and Metropolis-Hastings, we use an external proposal that might not match
the target. We need acceptance-rejection to correct the mismatch. Gibbs avoids this mis-
match by proposing from the conditional. The proposal is always correct, so acceptance
is automatic.

3.4 The Gibbs Sampling Algorithm

Gibbs sampling follows a simple pattern:

1. Start: Pick any initial values for your variables (they don’t matter much—Gibbs
will forget them).

2. Loop: Update each variable one at a time using its conditional distribution.

3. Stop: After many iterations, throw away the early ”learning phase” and keep the
rest.

The algorithm is clever because it accepts every proposed update—no rejections like in
Metropolis-Hastings.

The Detailed Algorithm

Initialization:

(0) (0) (0)

• Choose starting values for all variables: X1 , X2 , . . . , Xp .

• These starting values can be anything (even random).

• Gibbs will converge to the right distribution anyway.

Main Loop (repeat for many iterations):

(t) (t) (t)

• Update X1 : Sample a new value from P (X1 | X2 , X3 , . . . , Xp ).

28
Use the current (old) values of all other variables
(t+1) (t) (t)
• Update X2 : Sample a new value from P (X2 | X1 , X3 , . . . , Xp ).
(t+1)
Use the newly updated X1 , but old values of the rest

• Update X3 , . . . , Xp : Continue this pattern.

Always use the most recent values for variables already updated in this iteration.
Use old values for variables not yet updated.

After Many Iterations:

• Discard the first several iterations (burn-in): These are when the chain was
”learning” and haven’t converged yet.

• Keep the remaining iterations: These are valid samples from the true posterior
distribution.

3.5 Why this works

Each update respects the relationship between the variable being updated and all others.
By cycling through variables this way, the chain naturally stays in high-probability region
and converges to the true posterior.

3.6 Strengths and Limitations of Gibbs

Strengths:

• No tuning: No proposal variance to adjust.

• 100% acceptance: All samples are used, no waste.

• Fast mixing: Often explores the posterior very quickly.

• Natural for structured models: Graphical models, hierarchical Bayes, and mix-
ture models have built-in conditional structure.

Limitations:

29
• Requires easy conditionals: If P (Xi |others) is complicated, we can’t sample
from it.

• Slow for correlated variables: Updating one variable at a time means correlated
variables take many steps to exchange information.

4 Comparing the Three Algorithms

Feature Metropolis Metropolis- Gibbs Sampling

Hastings

Core Idea Uses a symmetric Uses an asymmetric Updates one variable

proposal to explore proposal and corrects at a time using condi-
the distribution. for the bias. tional probabilities.

Proposal (q) Must be Symmetric: Can be Asymmetric: No external proposal

Key Strength Simple to implement; Flexible; handles bi- No tuning required;

good for basic prob- ased proposals and very efficient for hi-
lems. complex spaces. erarchical/structured
models.

Limitation Fails if proposal isn’t Requires calculating Slow if variables are

symmetric; can be proposal ratios. highly correlated;
slow. hard to derive condi-
tionals.

Table 1: Comparison of the three MCMC sampling methods learned this week.

4.1 When to use each

Learning MCMC? Start with Metropolis. It’s simple and teaches the core idea: how
to derive acceptance rates from detailed balance.

General-purpose sampling? Use Metropolis-Hastings. It handles any proposal and

30
corrects for asymmetry.

Structured Problems? Use Gibbs if conditionals are easy to form. It’s usually fastest
and needs no tuning.

Real applications? Combine them. Use Gibbs for step variables with easy condi-
tionals, Metropolis-Hastings steps for others.

5 Practical Reality: Making These Algorithms Work

5.1 Burn-In: Forgetting Where We Started

Every MCMC chain starts somewhere arbitrary. It takes time to ”forget” this starting
point and to converge to the stationary distribution.

Burn-in is the number of initial samples we throw away.

Rule of Thumb: Discard 10-15 percentage of iterations.

But this is just a guideline. The real question is: how do we know we have enough burn-in?

Answer: We look at trace plots (see next section).

5.2 Tuning:Acceptance Rates and Step Sizes

For Metropolis and Metropolis-Hastings, the acceptance rate (fraction of proposals ac-
cepted) tells us something important.

Targets:

• In 1D: Aim for 44% acceptance

• In 10+ dimensions: Aim for 23.4% acceptance

Why these numbers?

If acceptance is too high (80%), we are making tiny steps and mixing slowly. If ac-
ceptance is too slow (5%), we are making huge steps and getting rejected too often. The

31
targets balance exploration and mixing.

How to tune:

• Too high acceptance? Increase proposal step size (propose bigger jumps)

• Too low acceptance? Decrease proposal step size (propose smaller jumps)

5.3 Checking Convergence: Trace Plots

Figure 5: MCMC Convergence Diagnosis Using Trace Plots

[Left: Poor convergence shows systematic drift and long periods stuck at one value.
Right Plot: Good convergence shows random fluctuations around a stable mean after
burn-in.]

A trace plot is a simple graph: horizontal axis is iteration number, vertical axis is pa-
rameter value.

Good convergence looks like white noise—random fluctuations around a stable mean.
No drift, no trends.

32
Poor convergence shows:

• Slow upward or downward drift.

• Long-term trends.

• Getting stuck at one value.

Visual inspection of trace plots is the first step in diagnosing MCMC quality. We visually
check if the chain has settled into its stationary distribution. If not, we increase burn-in
or run longer.

5.4 Limitations: When MCMC Struggles

Problem 1: Distributions having Multiple Modes

Figure 6: MCMC on bimodal distribution

Fig 6 shows a bimodal probability density with two distant peaks. It indicates how
MCMC chain gets stuck in one peak.
When a posterior has multiple distant peaks, basic MCMC can get stuck exploring one
peak and rarely jump to another.

33
Solution: Run very long chains. Or use tempering/simulated annealing to help jumps
between modes

[Beyond our project’s scope].

Problem 2: High Dimensions

In 50+ dimensions, Metropolis and Metropolis-Hastings mix slowly without careful tun-
ing.

Solution: Use more sophisticated methods like HMC (mentioned at the end).

Problem 3: Correlated Variables

When variables are strongly correlated, updating one at a time (Gibbs) requires many
steps for information to flow.

Solution: Use blocked Gibbs (update correlated groups together) [Beyond our project’s
scope].

6 Looking Forward: Beyond the Basics

6.1 The Next Generation: Modern MCMC Methods

The three algorithms we learned—Metropolis, Metropolis-Hastings, and Gibbs—form the

foundation. We can use them for countless real problems.

But over the last 10–15 years, the field developed more powerful methods. Modern soft-
ware packages like Stan, PyMC, and TensorFlow Probability use these newer algorithms.

Key example: Hamiltonian Monte Carlo (HMC) HMC uses gradient information
(which direction probability increases) to propose moves more intelligently. This leads
to:

• Much higher acceptance rates (60–80% vs. 20–30%).

• Faster exploration.

34
• Better performance in high dimensions (50+ variables).

HMC is now the standard for professional Bayesian inference.

Important note: HMC is beyond this project’s scope and won’t be part of your ap-
plications. Learning HMC requires knowledge of physics (Hamiltonian dynamics) and
numerical integration methods—separate topics entirely.

However, HMC builds directly on what you have learnt. The detailed balance principle,
the acceptance ratio derivation—all the same. HMC just uses more clever proposals.

If you want to explore HMC independently, Michael Betancourt’s tutorial is excellent:

[Link]

Conclusion: The Foundation is Set

Over the past few weeks, we have built a complete understanding of sampling:

• Week 3 showed us why we need better methods (curse of dimensionality) and how
independent sampling works.

• The following weeks taught us the mathematics of Markov chains and detailed
balance—the theoretical engine.

• This current week showed us three practical algorithms that use detailed balance
to sample from any distribution we specify.

We now understand not just the algorithms, but the principles underneath. We know
why detailed balance matters and how it ensures correctness.

We have developed the foundations for our applications. Starting from next week, we
will apply these methods to real problems in various fields. We will learn to translate
scientific questions into Bayesian models, choose the right algorithm for each problem,
and validate our results using diagnostics.

The journey from understanding sampling to solving real problems is about to begin.

MCMC: Building Sampling Chains
No ratings yet
MCMC: Building Sampling Chains
32 pages
General State Space MCMC Overview
No ratings yet
General State Space MCMC Overview
64 pages
MCMC Theory and Algorithms Overview
No ratings yet
MCMC Theory and Algorithms Overview
77 pages
MCMC and Gibbs Sampling Overview
No ratings yet
MCMC and Gibbs Sampling Overview
24 pages
MCMC Methods in Machine Learning Q&A
No ratings yet
MCMC Methods in Machine Learning Q&A
14 pages
MCMC Methods for Bayesian Computation
No ratings yet
MCMC Methods for Bayesian Computation
40 pages
Markov Chain Monte Carlo
No ratings yet
Markov Chain Monte Carlo
51 pages
MCMC: Top Algorithm for Sampling
100% (1)
MCMC: Top Algorithm for Sampling
31 pages
Gibbs Sampling in MCMC Methods
No ratings yet
Gibbs Sampling in MCMC Methods
59 pages
MCMC Methods in Machine Learning
No ratings yet
MCMC Methods in Machine Learning
13 pages
Walter R. Gilks, Sylvia Richardson (Auth.), Walter R. Gilks, Sylvia Richardson, David J. Spiegelhalter (Eds.) - Markov Chain Monte Carlo in Practice-Springer US (1996)
No ratings yet
Walter R. Gilks, Sylvia Richardson (Auth.), Walter R. Gilks, Sylvia Richardson, David J. Spiegelhalter (Eds.) - Markov Chain Monte Carlo in Practice-Springer US (1996)
487 pages
MCMC Sampling Methods Overview
No ratings yet
MCMC Sampling Methods Overview
70 pages
Adaptive MCMC Techniques Explained
No ratings yet
Adaptive MCMC Techniques Explained
13 pages
Understanding Markov Chain Monte Carlo
No ratings yet
Understanding Markov Chain Monte Carlo
17 pages
Understanding Markov Chain Monte Carlo
No ratings yet
Understanding Markov Chain Monte Carlo
4 pages
Markov Chain Monte Carlo Overview
No ratings yet
Markov Chain Monte Carlo Overview
29 pages
ML UNIT-V word 5.3.26
No ratings yet
ML UNIT-V word 5.3.26
16 pages
MCMC Methods and Sampling Techniques
No ratings yet
MCMC Methods and Sampling Techniques
17 pages
MCMC Methods in Bayesian Statistics
No ratings yet
MCMC Methods in Bayesian Statistics
2 pages
Introduction to Metropolis-Hastings Algorithm
No ratings yet
Introduction to Metropolis-Hastings Algorithm
15 pages
Metropolis-Hastings Algorithm Explained
No ratings yet
Metropolis-Hastings Algorithm Explained
4 pages
Monte Carlo Integration Techniques Explained
No ratings yet
Monte Carlo Integration Techniques Explained
30 pages
0412055511MarkovChain
100% (7)
0412055511MarkovChain
508 pages
Markov Chain Monte Carlo Methods Overview
No ratings yet
Markov Chain Monte Carlo Methods Overview
79 pages
Advanced Sampling & Markov Chain Techniques
No ratings yet
Advanced Sampling & Markov Chain Techniques
19 pages
MCMC Methods in Graphical Models
No ratings yet
MCMC Methods in Graphical Models
23 pages
Simulation Techniques for Random Sampling
No ratings yet
Simulation Techniques for Random Sampling
4 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
10 pages
Geyer - Markov Chain Monte Carlo Lecture Notes
100% (1)
Geyer - Markov Chain Monte Carlo Lecture Notes
166 pages
MCMC Methods in Machine Learning
No ratings yet
MCMC Methods in Machine Learning
42 pages
Markov Chain Monte Carlo Methods Explained
No ratings yet
Markov Chain Monte Carlo Methods Explained
66 pages
Metropolis-Hastings Algorithm: Theory and Implementation: 1 Markov Chain Construction
No ratings yet
Metropolis-Hastings Algorithm: Theory and Implementation: 1 Markov Chain Construction
5 pages
l14
No ratings yet
l14
7 pages
Introduction to MCMC in Machine Learning
No ratings yet
Introduction to MCMC in Machine Learning
39 pages
Introduction to MCMC in Machine Learning
No ratings yet
Introduction to MCMC in Machine Learning
39 pages
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
No ratings yet
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
97 pages
Beginner's Guide to MCMC Methods
No ratings yet
Beginner's Guide to MCMC Methods
13 pages
Metropolis-Hastings Algorithm Overview
No ratings yet
Metropolis-Hastings Algorithm Overview
12 pages
Markov Chain Monte Carlo Overview
No ratings yet
Markov Chain Monte Carlo Overview
8 pages
Monte Carlo Sampling Techniques Explained
No ratings yet
Monte Carlo Sampling Techniques Explained
101 pages
Graphical Models and MCMC Methods
No ratings yet
Graphical Models and MCMC Methods
19 pages
Markov Chain Monte Carlo in Practice (W R Gilks, S Richardson, D J Spiegelhalter
No ratings yet
Markov Chain Monte Carlo in Practice (W R Gilks, S Richardson, D J Spiegelhalter
485 pages
Markov Chain Monte Carlo Methods Explained
No ratings yet
Markov Chain Monte Carlo Methods Explained
67 pages
Annurev Statistics 031219 041300
No ratings yet
Annurev Statistics 031219 041300
26 pages
Smart-Dumb MCMC Algorithm for Clustering
No ratings yet
Smart-Dumb MCMC Algorithm for Clustering
10 pages
Annurev Statistics 022513 115540
No ratings yet
Annurev Statistics 022513 115540
26 pages
Monte Carlo Methods Overview
0% (1)
Monte Carlo Methods Overview
23 pages
Markov Chain Monte Carlo in Machine Learning
No ratings yet
Markov Chain Monte Carlo in Machine Learning
74 pages
Understanding Monte Carlo Methods
No ratings yet
Understanding Monte Carlo Methods
7 pages
Monte Carlo Methods for Statistical Inference
No ratings yet
Monte Carlo Methods for Statistical Inference
138 pages
Monte Carlo Sampling Methods Overview
No ratings yet
Monte Carlo Sampling Methods Overview
32 pages
Chib UnderstandingMetropolisHastingsAlgorithm 1995
No ratings yet
Chib UnderstandingMetropolisHastingsAlgorithm 1995
10 pages
Stochastic Methods in Statistical Mechanics
No ratings yet
Stochastic Methods in Statistical Mechanics
24 pages
Introduction to MCMC Techniques
No ratings yet
Introduction to MCMC Techniques
46 pages
Understanding AIX CPU Folding
No ratings yet
Understanding AIX CPU Folding
3 pages
Writing 1 Course Packet Overview
No ratings yet
Writing 1 Course Packet Overview
44 pages
Essential Sales Interview Questions Guide
No ratings yet
Essential Sales Interview Questions Guide
3 pages
JEE 2024-25 Test Plan Overview
No ratings yet
JEE 2024-25 Test Plan Overview
1 page
Cultural Identity and Resistance in Zitkala-Sa
No ratings yet
Cultural Identity and Resistance in Zitkala-Sa
9 pages
Grade 8 ICT: Assessing Material Quality
No ratings yet
Grade 8 ICT: Assessing Material Quality
39 pages
Database Design Assignment Brief
No ratings yet
Database Design Assignment Brief
4 pages
Green Card Guide for PhD Holders
No ratings yet
Green Card Guide for PhD Holders
11 pages
Higher Education Achievement Report (HEAR)
No ratings yet
Higher Education Achievement Report (HEAR)
8 pages
Microbial Metagenomics Overview
100% (17)
Microbial Metagenomics Overview
15 pages
Fitzroy Word Skills 1 Overview
100% (1)
Fitzroy Word Skills 1 Overview
8 pages
Meditation on Impermanence Guide
No ratings yet
Meditation on Impermanence Guide
2 pages
High vs Low Context Cultures Explained
No ratings yet
High vs Low Context Cultures Explained
4 pages
Claiming Human Dignity in Racial Discourse
No ratings yet
Claiming Human Dignity in Racial Discourse
3 pages
Free Public Secondary Education Act
No ratings yet
Free Public Secondary Education Act
3 pages
Marketing Management Course Outline
No ratings yet
Marketing Management Course Outline
2 pages
Marketing Strategies for Umbrella Tree
No ratings yet
Marketing Strategies for Umbrella Tree
27 pages
Mathematics Mid-Term Revision 2025-26
No ratings yet
Mathematics Mid-Term Revision 2025-26
5 pages
State Lap Book Research Evaluation
No ratings yet
State Lap Book Research Evaluation
1 page
2017 NQESH Mock Test Announcement
No ratings yet
2017 NQESH Mock Test Announcement
1 page
2025 TVET Trimester Assessment Plans
No ratings yet
2025 TVET Trimester Assessment Plans
10 pages
Past Tenses and Modal Verbs Exercises
No ratings yet
Past Tenses and Modal Verbs Exercises
4 pages
Hypothesis Testing Fundamentals
No ratings yet
Hypothesis Testing Fundamentals
43 pages
Trainee PMO Job Description
No ratings yet
Trainee PMO Job Description
2 pages
English Practice Test: Pronunciation & Vocabulary
No ratings yet
English Practice Test: Pronunciation & Vocabulary
6 pages
Analysis of Shirley Jackson's "The Intoxicated"
No ratings yet
Analysis of Shirley Jackson's "The Intoxicated"
3 pages
Class Record for Grade 3 Sampaguita
No ratings yet
Class Record for Grade 3 Sampaguita
4 pages
Derwent Art Prize Catalogue 2018
No ratings yet
Derwent Art Prize Catalogue 2018
33 pages
Rekha D H: CV and Career Overview
No ratings yet
Rekha D H: CV and Career Overview
2 pages
FCE Article Writing Guide
No ratings yet
FCE Article Writing Guide
24 pages