MCMC Algorithms: Metropolis & Gibbs Sampling
MCMC Algorithms: Metropolis & Gibbs Sampling
2
3.3 Why This Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3
Recap: What We Learnt Before
Before we dive into new material, let’s remember what we discovered in the previous
week. Don’t worry if these ideas feel fuzzy—we will go through them slowly again.
In the previous week, we met something called the stationary distribution. Let’s use
the frog example we know well.
Imagine a frog jumping between three lily pads: Lily Pad A, Lily Pad B, and Lily Pad
C. Every time the frog is at a lily pad, it follows a pattern of jumping to the next pad.
After the frog jumps around for a very long time (say, 1 million jumps), we count how
often it spends time at each lily pad:
This vector π (pronounced ”pi”) is called the stationary distribution. It tells us the
long-term probability of finding the frog at each lily pad.
Here’s what’s remarkable: no matter where the frog starts (at Lily Pad A, B, or C), after
enough time, these percentages stay the same. The frog forgets where it started and
settles into this pattern.
Let’s explain the symbols we will use so we are on the same page:
• πi is the probability the frog is at lily pad i. So π1 = 0.5 means 50% chance the
frog is at Lily Pad A.
4
• pij is the jumping rule: the probability of jumping from pad i to pad j. For example,
pAB is the probability of jumping from Lily Pad A to Lily Pad B.
All these jumping rules together form what we call the transition matrix P . When we
multiply the stationary distribution π by the transition matrix P , we get π back again.
In symbols:
π·P =π
• π on the left side is our current distribution (where the frog is likely to be right
now).
• After applying the jumping rules, we get π again on the right side.
In other words: if the frog follows this pattern long enough to reach the stationary
distribution, then following the jumping rules doesn’t change the distribution. The system
is in perfect balance.
Now comes an important detail. Not all jumping patterns lead to a stationary distribu-
tion. There’s a special condition that guarantees it will work.
That condition is called detailed balance, and here’s what it means in plain language:
The number of frogs jumping from Lily Pad A to Lily Pad B (in a given time period)
equals the number jumping from Lily Pad B back to Lily Pad A.
Imagine we have 1,000 frogs at Lily Pad A and 100 frogs at Lily Pad B. If detailed balance
holds:
5
In mathematical symbols, detailed balance is written as:
πi · pij = πj · pji
This is powerful: if we can design jumping rules that satisfy detailed balance, we auto-
matically get a stationary distribution.
• We want to sample from some target distribution (like inferring a coin’s bias or
estimating a particle’s mass).
But how do we actually construct the jumping rules pij to make the stationary distribution
equal to the distribution we want to sample from?
This is the central question of our current week. The answer is elegant: we don’t specify
all the jumping rules directly. Instead, we use a proposal-and-accept-reject
strategy, which is the basis of Markov Chain Monte Carlo [MCMC] methods.
6
Introduction to The MCMC Strategy
In this week, we will learn three practical algorithms that use the detailed balance princi-
ple to construct Markov chains that sample from any target distribution we [Link]
algorithms form the foundation of Markov Chain Monte Carlo (MCMC) methods.
All the algorithms in this week follow the same basic pattern. Let’s understand this
pattern before we look at specific algorithms.
1. We are at some current state: Call it xt .This might be a guess for the coin’s
bias, or a proposed parameter value.
After many iterations, we have a chain of states: x0 ,x1 ,x2 , . . . , xN . If we designed the
acceptance probability correctly, these states behave as samples from the distribution we
want.
7
detailed balance true?” And that’s the acceptance probability we use.
Different algorithms use different proposal strategies, which leads to different acceptance
formulas. But all of them are derived from the same principle: detailed balance.
Our goal: explore this landscape so we spend time at each location proportional to its
height. We want to spend more time on the tall peaks and less time in the valleys.
8
Here’s the intuitive principle for Metropolis:
If we propose moving to a higher place (uphill), we should almost always go there. Spend
more time on tall peaks!
A Concrete Example
We are at a spot with probability π(xt ) = 0.8. We propose moving to a spot with
probability π(x′t ) = 0.8.
This is downhill. The ratio is 0.4/0.8 = 0.5. So we accept the move with 50 percent
probability. Half the time we make the move, half the time we stay put.
If we proposed a spot with probability π(x′t ) = 0.9, this is uphill. We would accept with
probability min(1,0.9/0.8) = 1, meaning we always go there.
x′ = xt + ϵ, where ϵ ∼ N (0, σ 2 )
9
• We add it to our current state xt .
• Since the distances are the same, the probabilities are equal!
If we jump +5 units forward with some probability, we can jump −5 units backward with
the exact same probability. That’s symmetry.
If our proposal were asymmetric (biased in one direction), that bias would interfere with
detailed balance. We would need to mathematically correct for it. Metropolis avoids this
complication by requiring symmetry from the start.
When the proposal is asymmetric, we use the more general Metropolis-Hastings algorithm
(coming next).
Now for the key derivation. We will figure out what acceptance probability ensures
detailed balance.
10
Figure 2: Detailed Balance
1. We propose a move using the proposal distribution. The probability is q(x′ |xt ).
11
Applying Detailed Balance
We want detailed balance to hold. Recall what that means: the flow from xt to x′ equals
the flow from x′ back to xt . In symbols:
Since our proposal is symmetric, q(x′ |xt ) = q(xt |x′ ). These cancel:
Rearranging:
α(x′ , xt ) π(x′ )
=
α(xt , x′ ) π(xt )
There are many ways to satisfy this ratio. One elegant choice is:
π(x′ )
′
α(x , xt ) = min 1,
π(xt )
This reads as: ”accept with probability equal to the ratio of the new probability to the
old probability, but cap it at 1.”
′
• Uphill case: If π(x′ ) > π(xt ), then π(x π(x )
t)
> 1, so α(x′ , xt ) = 1. We always accept.
And α(xt , x′ ) = π(xt)
π(x′ )
(which is less than 1).
π(xt )
Check: 1 · π(xt ) = π(x′ ) · π(x′ ). Correct!
π(x′ ) π(x′ )
• Downhill case: If π(x′ ) < π(xt ), then π(xt )
< 1, so α = π(xt )
. We accept with this
[Link] α(xt , x′ ) = 1.
π(x′ )/π(xt ) π(x′ )
Check: 1
= π(xt )
. Correct!
12
Therefore:
π(x′ )
′
α(x , xt ) = min 1,
π(xt )
Concrete Example
Starting out:
13
1. Pick any starting point x0 . It doesn’t matter—the chain will forget where it started.
3. Choose a proposal standard deviation σ. (This controls how far we jump. We will
adjust this later.)
1. Propose a jump:
x′ = xt + ϵ, where ϵ ∼ N (0, σ 2 )
π(x′ )
α = min 1,
π(xt )
f (x′ )
α = min 1,
f (xt )
3. Accept or reject:
• We have a chain: x0 , x1 , . . . , xN .
• Discard the first B samples (burn-in) to let the chain forget its starting point.
• Use the remaining N − B samples as our draws from the distribution we want.
We derived the acceptance ratio by requiring detailed balance. Let’s verify it actually
maintains that balance.
14
With acceptance ratio α and symmetric proposal, detailed balance becomes:
π(x′ ) π(xt )
π(xt ) · min(1, ) = π(x′ ) · min(1, )
π(xt ) π(x′ )
π(x′ )
• Left: π(xt ) · π(xt )
= π(x′ )
We flip a coin 100 times and get 61 heads and 39 tails. We want to estimate the coin’s
bias—that is, the probability θ that it lands heads.
In Bayesian inference (which we learned in Week 4), we combine prior beliefs with data
using this formula:
Likelihood × Prior
Posterior =
Normalizing Constant
Or, usually written as a proportionality:
15
• Prior: We assume no bias toward heads or tails, so the prior is uniform:
P (θ) = 1
We start at θ0 = 0.5 and run 11,000 iterations with proposal standard deviation σ = 0.05.
We discard the first 1,000 as burn-in.
Results:
• Acceptance rate: 43.1% (good! We want around 44% for one-dimensional prob-
lems).
• 95% Credible Interval: [0.516, 0.705] (We are 95% confident the true bias is in
this range).
Notice: the observed frequency (61/100 = 0.61) matches our posterior mean. Metropolis
found the right answer!
16
2 Metropolis-Hastings: When We have Biased Pro-
posals
Metropolis works great when we can use a symmetric proposal. But in many real situa-
tions, we have information suggesting an asymmetric (biased) proposal would be better.
For example:
• We might know roughly where the high-probability region is and want to propose
in that direction.
• We might have a preliminary estimate from a related problem and want to propose
near it.
The problem: If our proposal is asymmetric, the simple Metropolis formula no longer
maintains detailed balance. We need to correct for the asymmetry.
Imagine two cities connected by transportation networks with different quality in each
direction.
17
Figure 3: Asymmetric Proposal
• Forward (City A to City B): There’s a nice highway. Many people travel this
way. q(B|A) is large.
• Backward (City B to City A): There’s a dirt road. Few people travel this way.
q(A|B) is small.
The proposal is biased forward. To maintain population balance (which is detailed bal-
ance), we must:
The correction is the inverse of the bias ratio. If forward is 10 times easier, we accept
forward moves 10 times less often.
18
Detailed balance requires:
π(x′ )
1. Target ratio: π(xt )
(compares probabilities, like Metropolis).
q(xt |x′ )
2. Proposal ratio: q(x′ |xt )
(corrects for proposal asymmetry).
q(backward)
Correction =
q(forward)
• If the forward proposal is heavily biased (q(forward) > q(backward)), then the ratio
is small. This pulls down the acceptance probability, compensating for the forward
bias.
• If the backward proposal is favored (q(backward) > q(forward)), then the ratio is
large. This pulls up the acceptance, compensating for the backward bias.
19
Special Case: Metropolis is Metropolis-Hastings with Symmetry
If the proposal is symmetric, q(x′ |xt ) = q(xt |x′ ), so the ratio becomes 1:
π(x′ )
α = min 1, ·1
π(xt )
Starting out:
2. Set t = 0.
1. Propose:
x′ ∼ q(x′ |xt )
2. Calculate acceptance:
3. Accept or reject:
After N iterations:
• Discard burn-in.
20
2.5 Real Example: A Skewed Distribution
A symmetric Gaussian random walk proposal would waste many proposals on negative
values (which have zero probability). An asymmetric exponential proposal that favors
positive values would be much more efficient.
We use an asymmetric proposal. This proposal biases jumps toward the positive direction,
matching the distribution’s shape.
Results:
The asymmetric proposal concentrates sampling where the distribution is large, improving
efficiency. The proposal ratio correction ensures detailed balance.
21
The Problem We’re Solving
Gibbs Sampling takes a different, cleverer approach: update variables one at a time.
It’s like feeling your way around the room step by step, which often works much better.
• But even without rain, a sprinkler could make the grass wet.
The key insight: X and Y depend on each other, but we can use this dependence to
our advantage in MCMC.
In probability, when two variables are connected, we describe this using conditional prob-
ability:
– Example: ”If the grass is soaking wet, what are the likely rainfall amounts?”
The answer is a probability distribution over possible rainfall values.
– Even if we don’t know the exact rainfall, knowing the grass is wet gives us
information that makes high rainfall more likely than low rainfall.
22
• P (Y |X) = ”the probability distribution of Y given that we know X”
– Example: ”If it rained heavily, how wet would the grass likely be?” The answer
is a distribution.
– Knowing heavy rain occurred makes high grass wetness more likely.
This is different from P (X) alone (”how much does it typically rain?”) because we’re
conditioning on information about Y .
Here’s the beautiful idea behind Gibbs sampling: Instead of proposing new values
for both X and Y randomly (like in Metropolis-Hastings), we update them
sequentially using their conditional distributions.
Algorithm:
1. Start at (X0 , Y0 ).
4. Go back to step 2 with new (X1 , Y1 ) and repeat the process again and again.
Example:
• Step 2: We freeze the grass wetness at its current value. We ask: ”Given that the
grass is this wet, what should the new rainfall be?” We answer this by sampling
from P (X|Y ). The result is a new rainfall amount X1 .
• Step 3: Now we freeze the rainfall at its new value X1 . We ask: ”Given that
it rained this much, how wet should the grass be?” We answer by sampling from
P (Y |X1 ). The result is a new grass wetness Y1 .
• Step 4: Go back to Step 2 with the new (X1 , Y1 ) and repeat the process.
23
Why Does This Stay Consistent?
• After Step 2, we have a new X1 that’s consistent with the current Y (we asked
”given Y , what’s a good X?”).
• After Step 3, we have a new Y1 that’s consistent with the new X1 (we asked ”given
the new X, what’s a good Y ?”).
This back-and-forth ensures that X and Y stay in the realistic region—we don’t produce
impractical combinations like ”no rain but soaking wet grass” (well, almost never).
Unlike Metropolis-Hastings, which can jump diagonally, Gibbs sampling moves in straight
lines (orthogonally), updating one coordinate at a time. This creates a ”staircase” path
toward the high-probability region.
24
Figure 4: Gibbs Sampling Trajectory
• Moving outward: ”Light rain, dry grass” or ”Heavy rain, very wet grass” (real-
istic, but less common).
25
The tilt of the ellipse shows the correlation between X and Y :
• If rain and wetness are strongly linked, the ellipse is tilted (elongated along a
diagonal).
Let’s trace through what’s happening in Figure 4 using our rain/grass intuition. Let X
axis represent Rainfall and the Y-axis represent Grass Wetness.
Iteration 1:
2. Hold wetness at -1.5. Ask: ”Given grass wetness = -1.5, what’s the likely
rainfall?”
3. Hold rainfall at -0.9. Ask: ”Given rainfall = -0.9, what’s the likely grass wetness?”
Repeat this process again and again to get iteration 2, 3, and so on.
Each step stays on the ellipse because we’re using the conditional distributions. The
alternating horizontal (update X) and vertical (update Y) moves create a ”staircase”
pattern that efficiently explores the realistic region.
This is the deep theoretical result: Even though we update sequentially (not
jointly), the pairs (X, Y ) we generate are valid samples from the true joint
distribution P (X, Y ).
Intuitive reason:
26
• When we sample X from P (X|Y ), we’re asking: ”What X values are consistent
with this Y ?”
• When we sample Y from P (Y |X), we’re asking: ”What Y values are consistent
with this X?”
• This back-and-forth between consistent updates means the chain naturally explores
the high-probability region (the ellipse).
• After many iterations, the frequency of visiting each (X, Y ) pair in our MCMC
chain matches the true probability of that pair in the joint distribution.
This happens automatically we don’t need to reject proposals or apply any correction.
The conditional distributions build in all the structure we need.
Due to how joint distributions factor (break into pieces), these terms cancel perfectly:
α = min(1, 1) = 1
27
3.3 Why This Works
The fact that Gibbs always accepts is not luck. By proposing directly from the conditional
distribution P (Xi |others), our proposal is perfectly aligned with the target.
In Metropolis and Metropolis-Hastings, we use an external proposal that might not match
the target. We need acceptance-rejection to correct the mismatch. Gibbs avoids this mis-
match by proposing from the conditional. The proposal is always correct, so acceptance
is automatic.
1. Start: Pick any initial values for your variables (they don’t matter much—Gibbs
will forget them).
2. Loop: Update each variable one at a time using its conditional distribution.
3. Stop: After many iterations, throw away the early ”learning phase” and keep the
rest.
The algorithm is clever because it accepts every proposed update—no rejections like in
Metropolis-Hastings.
Initialization:
28
Use the current (old) values of all other variables
(t+1) (t) (t)
• Update X2 : Sample a new value from P (X2 | X1 , X3 , . . . , Xp ).
(t+1)
Use the newly updated X1 , but old values of the rest
Always use the most recent values for variables already updated in this iteration.
Use old values for variables not yet updated.
• Discard the first several iterations (burn-in): These are when the chain was
”learning” and haven’t converged yet.
• Keep the remaining iterations: These are valid samples from the true posterior
distribution.
Each update respects the relationship between the variable being updated and all others.
By cycling through variables this way, the chain naturally stays in high-probability region
and converges to the true posterior.
Strengths:
• Natural for structured models: Graphical models, hierarchical Bayes, and mix-
ture models have built-in conditional structure.
Limitations:
29
• Requires easy conditionals: If P (Xi |others) is complicated, we can’t sample
from it.
• Slow for correlated variables: Updating one variable at a time means correlated
variables take many steps to exchange information.
Table 1: Comparison of the three MCMC sampling methods learned this week.
Learning MCMC? Start with Metropolis. It’s simple and teaches the core idea: how
to derive acceptance rates from detailed balance.
30
corrects for asymmetry.
Structured Problems? Use Gibbs if conditionals are easy to form. It’s usually fastest
and needs no tuning.
Real applications? Combine them. Use Gibbs for step variables with easy condi-
tionals, Metropolis-Hastings steps for others.
Every MCMC chain starts somewhere arbitrary. It takes time to ”forget” this starting
point and to converge to the stationary distribution.
For Metropolis and Metropolis-Hastings, the acceptance rate (fraction of proposals ac-
cepted) tells us something important.
Targets:
If acceptance is too high (80%), we are making tiny steps and mixing slowly. If ac-
ceptance is too slow (5%), we are making huge steps and getting rejected too often. The
31
targets balance exploration and mixing.
How to tune:
• Too high acceptance? Increase proposal step size (propose bigger jumps)
• Too low acceptance? Decrease proposal step size (propose smaller jumps)
[Left: Poor convergence shows systematic drift and long periods stuck at one value.
Right Plot: Good convergence shows random fluctuations around a stable mean after
burn-in.]
A trace plot is a simple graph: horizontal axis is iteration number, vertical axis is pa-
rameter value.
Good convergence looks like white noise—random fluctuations around a stable mean.
No drift, no trends.
32
Poor convergence shows:
• Long-term trends.
Visual inspection of trace plots is the first step in diagnosing MCMC quality. We visually
check if the chain has settled into its stationary distribution. If not, we increase burn-in
or run longer.
Fig 6 shows a bimodal probability density with two distant peaks. It indicates how
MCMC chain gets stuck in one peak.
When a posterior has multiple distant peaks, basic MCMC can get stuck exploring one
peak and rarely jump to another.
33
Solution: Run very long chains. Or use tempering/simulated annealing to help jumps
between modes
In 50+ dimensions, Metropolis and Metropolis-Hastings mix slowly without careful tun-
ing.
Solution: Use more sophisticated methods like HMC (mentioned at the end).
When variables are strongly correlated, updating one at a time (Gibbs) requires many
steps for information to flow.
Solution: Use blocked Gibbs (update correlated groups together) [Beyond our project’s
scope].
But over the last 10–15 years, the field developed more powerful methods. Modern soft-
ware packages like Stan, PyMC, and TensorFlow Probability use these newer algorithms.
Key example: Hamiltonian Monte Carlo (HMC) HMC uses gradient information
(which direction probability increases) to propose moves more intelligently. This leads
to:
• Faster exploration.
34
• Better performance in high dimensions (50+ variables).
Important note: HMC is beyond this project’s scope and won’t be part of your ap-
plications. Learning HMC requires knowledge of physics (Hamiltonian dynamics) and
numerical integration methods—separate topics entirely.
However, HMC builds directly on what you have learnt. The detailed balance principle,
the acceptance ratio derivation—all the same. HMC just uses more clever proposals.
Over the past few weeks, we have built a complete understanding of sampling:
• Week 3 showed us why we need better methods (curse of dimensionality) and how
independent sampling works.
• The following weeks taught us the mathematics of Markov chains and detailed
balance—the theoretical engine.
• This current week showed us three practical algorithms that use detailed balance
to sample from any distribution we specify.
We now understand not just the algorithms, but the principles underneath. We know
why detailed balance matters and how it ensures correctness.
We have developed the foundations for our applications. Starting from next week, we
will apply these methods to real problems in various fields. We will learn to translate
scientific questions into Bayesian models, choose the right algorithm for each problem,
and validate our results using diagnostics.
The journey from understanding sampling to solving real problems is about to begin.
35