0% found this document useful (0 votes)
28 views3 pages

Acceptance-Rejection Method in MCMC

The document discusses the Acceptance-Rejection Method (AR) and the Metropolis-Hastings algorithm as techniques for generating random samples from a posterior distribution in Bayesian econometrics. It explains the process of simulating draws from a target distribution and the importance of choosing appropriate candidate distributions and constants to improve efficiency. Additionally, it covers the concept of Markov chains, transition kernels, and the conditions for convergence to an invariant distribution.

Uploaded by

rysul12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views3 pages

Acceptance-Rejection Method in MCMC

The document discusses the Acceptance-Rejection Method (AR) and the Metropolis-Hastings algorithm as techniques for generating random samples from a posterior distribution in Bayesian econometrics. It explains the process of simulating draws from a target distribution and the importance of choosing appropriate candidate distributions and constants to improve efficiency. Additionally, it covers the concept of Markov chains, transition kernels, and the conditions for convergence to an invariant distribution.

Uploaded by

rysul12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Acceptance-Rejection Method (AR) 1

14.384 Time Series Analysis, Fall 2007


Professor Anna Mikusheva
Paul Schrimpf, scribe
Novemeber 29, 2007

Lecture 25

MCMC: Metropolis Hastings Algorithm

A good reference is Chib and Greenberg (The American Statistician 1995).


Recall that the key object in Bayesian econometrics is the posterior distribution:

f (YT |θ)p(θ)
p(θ|YT ) = R
˜
f (YT |θ)dθ̃

It is often difficult to compute this distribution. In particular, the integral in the denominator is difficult.
So far, we have gotten around this by using conjugate priors – classes of distributions for which we know the
form of the posterior. Generally, it’s easy to compute the numerator, f (YT |θ)p(θ), but it is hard to compute
˜ ˜ One approach is to try to compute
R
the normalizing constant, the integral in the denominator, f (YT |θ)dθ.
this integral in some clever way. Another, more common approach is Markov Chain Monte-Carlo (MCMC).
The goal here is to generate a random sample θ1 , .., θN from p(θ|YT ). We can then use moments from this
sample to approximate moments of the posterior distribution. For example,
1 X
E(θ|YT ) ≈ θn
N
There are a number of methods for generating random samples from an arbitrary distribution.

Acceptance-Rejection Method (AR)


The goal is to simulate ξ ∼ π(x). We can calculate for each the value of a function, f (x), such that
π (x) = f (x)
k . The constant k is unknown. We have some candidate pdf h(x) that we can simulate draws
from, and there is a known constant c such that

f (x) ≤ ch(x)

We simulate draws from π(x) as follows:


1. Draw z ∼ h(x), u ∼ U [0, 1]
f (z )
2. If u ≤ ch(z ) , then ξ = z . Otherwise repeat (1)

The intuition of the procedure is the following: Let v = uch(z) and imagine the joint distribution of (v, z).
It will have support under the graph of ch(z) and above the z = 0 axis with a uniform density (it is uniform
on {(v, z) : z ∈ Spt(h), 0 ≤ v ≤ ch(z)}). Then, it is fairly easy to see that if we accept ξ = z, the joint
distribution of (v, ξ) will be uniform with support {(v, ξ) : ξ ∈ Spt(π), f (ξ) ≥ v ≥ 0} and be uniform. Then
(for the same reason that h(z) is the marginal density of (v, z)), the marginal density of ξ will be f (kξ) . More
formally,

Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007.
MIT OpenCourseWare ([Link] Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Markov Chains 2

Proof. Let ρ be the probability of rejecting a single draw. Then,


z1
P (ξ ≤ x) =P (z1 ≤ x, u1 ≤ )(1 + ρ + ρ2 + ...)
ch(z1 )
1 z1
= P (z1 ≤ x, u1 ≤ )
1−ρ ch(z1 )
· ¸
1 z
= Ez P (u ≤ |z)1{z≤x} iterated expectations
1−ρ ch(z)
Z x
1 f (z)
= h(z)dz
1 − ρ −∞ ch(z)
Z x
f (z) f (z)
= dz is a distribution, so c(1 − ρ) = k and
−∞ c(1 − ρ) c(1 − ρ)
Z x
= π(z)dz
−∞

A major drawback of this method is that is may lead us to reject many draws before we finally accept
f (z )
one. This can make the procedure inefficient. If we choose c and h(z) poorly, then ch(z ) could be very small
for many z. It will be especially difficult to choose a good c and h() when we do not know much about π(z).

Markov Chains
A Markov Chain is a stochastic process where the distribution of xt+1 only depends on xt , P (xt+1 ∈
A|xt , xt−1 , ...) = P (xt+1 ∈ A|xt ) ∀A.
Definition 1. A transition kernel is a function, P (x, A), which is a probability measure in the second
argument. It gives the probability of moving from x into the set A.
We want to study the behavior of a sequence of draws x1 → x2 → ... where we move around according
to a transition kernel. Suppose the distribution of xk is π ∗ , then the distribution of y = xk+1 is
Z
π̃(y)dy = π ∗ (x)P (x, dy)dx
<

Definition 2. It is an invariant measure (with respect to transition kernel P (x, A)) if π̃ = π ∗


Under some regularity conditions, the distribution of a Markov chain converges to its unique invariant
distribution.
In MCMC the goal is to simulate a draw from π. We need to find a transition kernel P (x, dy) such that
π is its invariant measure. Let’s suppose that π is continuous. We will consider the class of kernels
P (x, dy) = p(x, y)dy + r(x)∆x (dy)
i.e. we can stay at x with probability r(x), otherwise y is distributed accordingRto some pdf (times
R probability
of moving.
R p(x, y) isn’t exactly a density because it doesn’t integrate to 1. ( P (x, dy) = 1 = p(x, y)dy +
r(x); p(x, y)dy ≤ 1)).
Definition 3. A transition kernel is reversible if π(x)p(x, y) = π(y)p(y, x)
Theorem 4. If a transition kernel is reversible, then π is invariant.
There are more general conditions under which a Markov Chain converges. Generally, if the transition
kernel is irreducible (it can reach any point from any other point) and aperiodic (not periodic, i.e. the
greatest common denominator of {n : y can be reached from x in n steps} is 1), then it converges to an
invariant distribution.

Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007.
MIT OpenCourseWare ([Link] Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Metropolis-Hastings 3

Metropolis-Hastings
Suppose we have a Markov chain in state x. We want to simulate a draw from the transition kernel p(x, y)
with invariant measure π, but we do not know the form of p(x, y). We do know how to compute
R a function
proportional to π, f (x) = kπ(x). Assume that we can draw y ∼ q(x, y), a pdf wrt y (so q(x, y)dy = 1).
Consider using this q as a transition kernel. Notice that if

π(x)q(x, y) > π(y)q(y, x)

then we would move from x to y too often. This suggests that rather than always moving to the new y we
draw, we should only move with some probability, α(x, y). If we construct α(x, y) such that

π(x)q(x, y)α(x, y) = π(y)q(y, x)α(y, x)

then we will have a reversible transition kernel with invariant measure π. We can take:
π(y)q(y, x)
α(x, y) = min{1, }
π(x)q(x, y)

We can calculate α(x, y) because although we do not know π(x), we do know f (x) = kπ(x), so we can
compute the ratio. In summary, the Metropolis-Hastings algorithm is: given xj we move to xj+1 by
1. Generate a draw, y, from q(xj , ·)
2. Calculate α(xj , y)
3. Draw u ∼ U [0, 1]
4. If u < α(xj , y), then xj+1 = y. Otherwise xj+1 = xj
Then the marginal distribution of xj will converge to π. In practice, we begin the chain at an arbitrary
x0 , run the algorithm many, say M times, then use the last N < M draws as a sample from π. Note
that although the marginal distribution of the xj is π, the xj are autocorrelated. This is not a problem
for computing moments from the draws (although the higher the autocorrelation, the more draws we need
to get the same accuracy), but if we want to put standard errors on these moments, we need to take the
autocorrelation into account.

Choice of q()
• Random walk chain: q(x, y) = q1 (y − x), i.e. y = x + ², ² ∼ q1 . This can be nice choice because if q1
x,y )
is symmetric, q1 (z) = q1 (−z) and qq((y,x ) drops out of α(x, y ). Popular such q1 are normal and U [−a, a].
Note that there is a tradeoff between step-size in the chain and rejection probability when choosing
σ 2 = E²2 . Choosing σ 2 too large will lead to many draws of yfrom low probability areas (low π), and
as a result we will reject lots of draws. Choosing σ 2 too small will lead us to accept most draws, but
not move very much, and we will have difficulty covering the whole support of π. In either case, the
autocorrelation in our draws will be very high and we’ll need more draws to get a good sample from π.
• Independence chain: q(x, y) = q1 (y)
• π(y) ∝ ψ(y)h(y) and can sample from q(x, y) = h(y). This also simplifies α()
• Autocorrelated y = a + B(x − a) + ² with B < 0, this leads to negative autocorrelation in y. The hope
is that this reverses some of the positive autocorrelation inherent in the procedure.

Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007.
MIT OpenCourseWare ([Link] Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

You might also like