Introduction to Machine Learning
CMU-10701
Markov Chain Monte Carlo Methods
Barnabás Póczos & Aarti Singh
Contents
Markov Chain Monte Carlo Methods
• Goal & Motivation
Sampling
• Rejection
• Importance
Markov Chains
• Properties
MCMC sampling
• Hastings-Metropolis
• Gibbs
2
Monte Carlo Methods
3
The importance of MCMC
A recent survey places the Metropolis algorithm among the
10 algorithms that have had the greatest influence on the
development and practice of science and engineering in the 20th
century (Beichl&Sullivan, 2000).
The Metropolis algorithm is an instance of a large class of sampling
algorithms, known as Markov chain Monte Carlo (MCMC).
4
MCMC Applications
MCMC plays significant role in statistics, econometrics, physics and
computing science.
Sampling from high-dimensional, complicated distributions
Bayesian inference and learning
Marginalization
Normalization
Expectation
Global optimization
5
The Monte Carlo principle
Our goal is to estimate the following integral:
The idea of Monte Carlo simulation is to draw an i.i.d. set of samples
{x(i) } from a target density p(x) defined on a high-dim. space X.
Estimator:
6
The Monte Carlo principle
Theorems a.s. consistent
Unbiased estimation
Independent of dimension d!
Asymptotically normal
7
The Monte Carlo principle
One “tiny” problem…
Monte Carlo methods need sample from distribution p(x).
When p(x) has standard form, e.g. Uniform or Gaussian, it is
straightforward to sample from it using easily available routines.
However, when this is not the case, we need to introduce more
sophisticated sampling techniques. ⇒ MCMC sampling
8
Sampling
Rejection sampling
Importance sampling
9
Main Goal
Sample from distribution p(x) that is only known up
to a proportionality constant
For example,
p(x) ∝ 0.3 exp(−0.2x2) +0.7 exp(−0.2(x − 10)2)
10
Rejection Sampling
11
Rejection Sampling Conditions
Suppose that
p(x) is known up to a proportionality constant
p(x) ∝ 0.3 exp(−0.2x2) +0.7 exp(−0.2(x − 10)2)
It is easy to sample from q(x) that satisfies p(x) ≤ M q(x), M < ∞
M is known
12
Rejection Sampling Algorithm
13
Rejection Sampling
Theorem
The accepted x(i ) can be shown to be sampled with probability p(x)
(Robert & Casella, 1999, p. 49).
Severe limitations:
It is not always possible to bound p(x)/q(x) with a reasonable
constant M over the whole space X.
If M is too large, the acceptance probability is too small.
In high dimensional spaces it can be exponentially slow to sample
points. (The points usually will be rejected)
14
Importance Sampling
15
Importance Sampling
Goal: Sample from distribution p(x) that is only known up to a
proportionality constant
Importance sampling is an alternative “classical” solution that goes
back to the 1940’s.
Let us introduce, again, an arbitrary importance proposal distribution
q(x) such that its support includes the support of p(x).
Then we can rewrite I(f) as follows:
16
Importance Sampling
Consequently,
17
Importance Sampling
Theorem
This estimator is unbiased
Under weak assumptions, the strong law of large numbers applies:
Some proposal distributions q(x) will obviously be preferable to others.
Which one should we choose?
18
Importance Sampling
Theorem
This estimator is unbiased
Under weak assumptions, the strong law of large numbers applies:
Some proposal distributions q(x) will obviously be preferable to others.
19
Importance Sampling
Find one that minimizes the variance of the estimator!
Theorem
The variance is minimal when we adopt the following
optimal importance distribution:
20
Importance Sampling
The optimal proposal is not very useful in the sense that it is not easy to
sample from
High sampling efficiency is achieved when we focus on sampling from p(x)
in the important regions where |f (x)|p(x) is relatively large; hence the
name importance sampling
Importance sampling estimates can be super-efficient:
For a given function f (x), it is possible to find a distribution q(x)
that yields an estimate with a lower variance than when using
q(x)= p(x)!
In high dimensions it is not efficient either…
21
MCMC sampling - Main ideas
Create a Markov chain, which has the desired limiting distribution!
22
Andrey Markov
Markov Chains
23
Markov Chains
Markov chain:
Homogen Markov chain:
24
Markov Chains
Assume that the state space is finite:
1-Step state transition matrix:
Lemma: The state transition matrix is stochastic:
t-Step state transition matrix:
Lemma:
25
Markov Chains Example
Markov chain with three states (s = 3)
Transition matrix Transition graph
26
Markov Chains,
stationary distribution
Definition:
[stationary distribution, invariant distribution, steady state distributions]
The stationary distribution might be not unique (e.g. T= identity matrix)
27
Markov Chains, limit distributions
Some Markov chains have unique limit distribution:
If the probability vector for the initial state is
it follows that
and, after several iterations (multiplications by T )
limit distribution
no matter what initial distribution µ(x1) was.
The chain has forgotten its past.
28
Markov Chains
Our goal is to find conditions under which the Markov chain
converges to a unique limit distribution (independently from its
starting state distribution)
Observation:
If this limiting distribution exists, it has to be the stationary distribution.
29
Limit Theorem of Markov Chains
Theorem:
If the Markov chain is Irreducible and Aperiodic, then:
That is, the chain will convergence to the unique stationary distribution
30
Markov Chains
Definition
Irreducibility:
For each pairs of states (i,j), there is a positive probability, starting in
state i, that the process will ever enter state j.
= The matrix T cannot be reduced to separate smaller matrices
= Transition graph is connected.
It is possible to get to any state from any state.
31
Markov Chains
Definition
Aperiodicity: The chain cannot get trapped in cycles.
Definition
A state i has period k if any return to state i, must occur in multiples of
k time steps. Formally, the period of a state i is defined as
(where "gcd" is the greatest common divisor)
For example, suppose it is possible to return to the state in
{6,8,10,12,...} time steps. Then k=2
32
Markov Chains
Definition
Aperiodicity: The chain cannot get trapped in cycles.
In other words,
a state i is aperiodic if there exists n such that for all n' ≥ n,
Definition
A Markov chain is aperiodic if every state is aperiodic.
33
Markov Chains
Example for periodic Markov chain:
Let
In this case
If we start the chain from (1,0), or (0,1), then the chain get
traps into a cycle, it doesn’t forget its past.
It has stationary distribution, but no limiting distribution!
34
Reversible Markov chains
(Detailed Balance Property)
How can we find the limiting distribution of an irreducible and aperiodic
Markov chain?
Definition: reversibility /detailed balance condition:
Theorem:
A sufficient, but not necessary, condition to ensure that a particular π is
the desired invariant distribution of the Markov chain is the detailed
balance condition.
35
How fast can Markov chains forget
the past?
MCMC samplers are
irreducible and aperiodic Markov chains
have the target distribution as the invariant distribution.
the detailed balance condition is satisfied.
It is also important to design samplers that converge quickly.
36
Spectral properties
Theorem: If
π is the left eigenvector of the matrix T with eigenvalue 1.
The Perron-Frobenius theorem from linear algebra tells us that the
remaining eigenvalues have absolute value less than 1.
The second largest eigenvalue, therefore, determines the rate of
convergence of the chain, and should be as small as possible.
37
The Hastings-Metropolis Algorithm
38
The Hastings-Metropolis Algorithm
Our goal:
Generate samples from the following discrete distribution:
We don’t know B !
The main idea is to construct a time-reversible Markov chain
with (π,…,πm) limit distributions
Later we will discuss what to do when the distribution is continuous 39
The Hastings-Metropolis Algorithm
Let {1,2,…,m} be the state space of a Markov chain that we
can simulate.
No rejection: we use all X1, X2,… Xn, … 40
Example for Large State Space
Let {1,2,…,m} be the state space of a Markov chain that we
can simulate.
d-dimensional grid:
Max 2d possible movements at each grid point (linear in d)
Exponentially large state space in dimension d
41
The Hastings-Metropolis Algorithm
Theorem
Proof
42
The Hastings-Metropolis Algorithm
Observation
Proof:
Corollary
Theorem
43
The Hastings-Metropolis Algorithm
Theorem
Proof:
Note:
44
The Hastings-Metropolis Algorithm
It is not rejection sampling, we use all the samples! 45
Continuous Distributions
The same algorithm can be used for
continuous distributions as well.
In this case, the state space is continuous.
46
Experiment with HM
An application for continuous distributions
Bimodal target distribution: p(x) ∝ 0.3 exp(−0.2x2) +0.7 exp(−0.2(x − 10)2)
q(x | x(i )) = N(x(i), 100), 5000 iterations 47
Good proposal distrib. is important
48
HM on Combinatorial Sets
Generate uniformly distributed samples from the set of permutations
Let n=3, and a=12: {1,2,3}: 1+4+9=14
{1,3,2}: 1+6+6=13
{2,3,1}: 2+6+3=11
{2,1,3}: 2+2+9=13
{3,1,2}: 3+2+6=11
{3,2,1}: 3+4+3=10
49
HM on Combinatorial Sets
To define a simple Markov chain on , we need the concept of
neighboring elements (permutations):
Definition: Two permutations are neighbors, if one results from
the interchange of two of the positions of the other:
(1,2,3,4) and (1,2,4,3) are neighbors.
(1,2,3,4) and (1,3,4,2) are not neighbors.
50
HM on Combinatorial Sets
That is what we wanted!
51
Gibbs Sampling: The Problem
Suppose that we can generate samples from
Our goal is to generate samples from
52
Gibbs Sampling: Pseudo Code
53
Gibbs Sampling: Theory
Consider the following HM sampler:
Let
and let
Observation: By construction, this HM sampler would sample from
We will prove that this HM sampler = Gibbs sampler. 54
Gibbs Sampling is a Special HM
Theorem: The Gibbs sampling is a special case of HM with
Proof:
By definition:
55
Gibbs Sampling is a Special HM
Proof:
56
Gibbs Sampling in Practice
57
Simulated Annealing
58
Simulated Annealing
Goal: Find
59
Simulated Annealing
Theorem:
Proof:
60
Simulated Annealing
Main idea
Let λ be big.
Generate a Markov chain with limit distribution Pλ(x).
In long run, the Markov chain will jump among the maximum points of
Pλ(x).
Introduce the relationship of neighboring vectors:
61
Simulated Annealing
Uniform distribution
Use the Hastings- Metropolis sampling:
62
Simulated Annealing: Pseudo Code
With prob. α accept the new state
with prob. (1-α) don't accept and stay
63
Simulated Annealing: Special case
In this special case:
With prob. α=1 accept the new state since
we increased V
64
Simulated Annealing: Problems
65
Simulated Annealing
Temperature = 1/ λ
66
Simulated Annealing
67
Monte Carlo EM
E Step:
Monte Carlo EM:
Then the integral can be approximated! ☺
68
Monte Carlo EM
69
Thanks for the Attention! ☺
70