Monte Carlo Methods for Statistical Inference
Monte Carlo Methods for Statistical Inference
Sinan Yıldırım
August 4, 2021
Contents
1 Introduction 1
1.1 Sample averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Monte Carlo: Generating your own samples . . . . . . . . . . . . . . . . . 3
1.2.1 Justification of Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Toy example: Buffon’s needle . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 The need for more sophisticated methods . . . . . . . . . . . . . . . 10
4 Bayesian Inference 39
4.1 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Deriving Posterior distributions . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 A note on future notational simplifications . . . . . . . . . . . . . . 41
4.2.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Quantities of interest in Bayesian inference . . . . . . . . . . . . . . . . . . 46
4.3.1 Posterior mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . . . 47
4.3.3 Posterior predictive distribution . . . . . . . . . . . . . . . . . . . . 47
i
5 Markov Chain Monte Carlo 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Properties of Markov(η, M ) . . . . . . . . . . . . . . . . . . . . . . 54
[Link] Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . 54
[Link] Recurrence and Transience . . . . . . . . . . . . . . . . . 56
[Link] Invariant distribution . . . . . . . . . . . . . . . . . . . . . 56
[Link] Reversibility and detailed balance . . . . . . . . . . . . . . 57
[Link] Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Toy example: MH for the normal distribution . . . . . . . . . . . . 61
5.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Metropolis within Gibbs . . . . . . . . . . . . . . . . . . . . . . . . 73
ii
A.2.3 Moments, expectation and variance . . . . . . . . . . . . . . . . . . 114
A.2.4 More than one random variables . . . . . . . . . . . . . . . . . . . . 115
A.3 Conditional probability and Bayes’ rule . . . . . . . . . . . . . . . . . . . . 117
B Solutions 118
B.1 Exercises in Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.2 Exercises in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3 Exercises of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.4 Exercises of Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
iii
List of Figures
3.1 pX (x) and pX,Y (x, y) vs x for the problem in Example 3.4 with n = 10 and
a=2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Histograms for the estimate of the posterior mean using two different im-
portance sampling methods as described in Example 3.4 with n = 10 and
a = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Source localisation problem with three sensors and one source . . . . . . . 37
3.4 Source localisation problem with three sensors and one source: The like-
lihood terms, prior, and the posterior. The parameters and the variables
are s1 = (0, 2), s2 = (−2, −1), s3 = (1, −2), y1 = 2, y2 = 1.6, y3 = 2.5,
σx2 = 100, and σy2 = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Directed acyclic graph showing the (hierarchical) dependency structure for
X, Y, Z, U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iv
5.8 MH for parameters of N (z, s). σq2 = 1, α = 5, β = 10, m = 0, κ2 = 100. . . 66
5.9 An example data sequence of length n = 100 generated from the Poisson
changepoint model with parameters τ = 30, λ1 = 10 and λ2 = 5. . . . . . . 67
5.10 MH for parameters of the Poisson changepoint model . . . . . . . . . . . . 68
5.11 MH for the source localisation problem. . . . . . . . . . . . . . . . . . . . . 69
v
List of Tables
vi
List of Abbreviations
vii
Chapter 1
Introduction
Summary: This chapter provides a motivation for Monte Carlo methods. We will basi-
cally discuss averaging, which is the core of Monte Carlo integration. Then we will discuss
theoretical and practical justifications of Monte Carlo. The chapter ends with a toy exam-
ple, Buffon’s needle experiment.
Mean value of the distribution: First, we are asked to provide an estimate of the
mean value i.e. the expectation of X with respect to P using the sample set X (1:N ) . If P
has a probability density function p(x), this expectation can be written as1
Z
EP (X) = xp(x)dx. (1.1)
X
1
CHAPTER 1. INTRODUCTION 2
(Here, the notation P (ϕ) is introduced for simplicity.) This time, we replace our estimator
with one that has the values of the ϕ evaluated at the samples ϕ(X (i) ), i = 1, . . . , N instead
of X (i) themselves.
N
1 X
EP (ϕ(X)) ≈ ϕ(X (i) ). (1.4)
N i=1
It is easy to see that the second problem is just a simple generalisation of the first: Put
ϕ(X) = X and you will come back to the first problem, which was to estimate the expected
value of X. The function ϕ can correspond to another moment of interest, for example
ϕ(x) = x2 , or a specific function of interest, for example ϕ(x) = log x.
Probability of a set: Another special case of ϕ is seen when we are interested in the
probability of a certain set A ⊆ X . How do we write this probability
P (A) := P(X ∈ A)
as an expectation of a function with respect to P ? For this, consider the indicator function
IA : X → {0, 1} such that (
1, x ∈ A
IA (x) = (1.5)
0, x ∈ /A
Now, let us consider ϕ = IA and write the expectation of this function with respect to P :
Z
EP (IA (X)) = IA (x)p(x)dx
ZX
= p(x)dx
A
= P(X ∈ A). (1.6)
where the second line follows from the fact that the integrand becomes p(x) for x ∈ A and
i.i.d.
/ A. Therefore, we know what to do when X (1) , . . . , X (N ) ∼ P are given and we
0 for x ∈
want to estimate P(X ∈ A) = EP (IA (X)): we simply apply equation (1.4) for the function
IA (X):
N
1 X
P(X ∈ A) ≈ IA (X (i) ). (1.7)
N i=1
Notice that this will output a value that is guaranteed to be in [0, 1] (since the sum can
be at least 0 and at most N ), so it is a valid probability estimate.
In the following, we will talk in general about the expectation (1.3) and its estimate
(1.4), after hopefully having convinced you that the other expectations and their estimates
are just special cases.
CHAPTER 1. INTRODUCTION 3
• You can generate (draw) i.i.d. samples from P , as many as you want.
• You cannot compute (one or more of) the integral in (1.3), or you can only compute
it in a very very long time - so long that you do not want to! Another way of saying
this is that the integral is intractable.
What would you do to estimate (1.3) in that case? Of course, by generating your own
samples X (1) , . . . , X (N ) from P so that the problem reduces to the one in the previous
section.3 This simple idea is the core of Monte Carlo methods. Once you generate samples
from P , you do not need to deal with it in order to implement the estimate (1.4).
The term Monte Carlo was coined in the 1940s, see Metropolis and Ulam (1949) for a
first use of the term, and Metropolis (1987); Eckhardt (1987) for a historical review.
N
It is easy to show that PMC (ϕ) is an unbiased estimator of P (ϕ) for any N ≥ 1:
N
!
N 1 X
ϕ(X (i) ) .
E PMC (ϕ) = E
N i=1
N
1 X
= EP (ϕ(X (i) ))
N i=1
1
= N EP (ϕ(X))
N
= EP (ϕ(X)) = P (ϕ).
However, unbiasedness itself is not enough.4 Fortunately, we have results on the conver-
gence and decreasing variance as N increases.
Law of large numbers: If |P (ϕ)| < ∞, the law of large numbers (e.g. Shiryaev (1995),
N
p. 391) ensures almost sure (a.s.) convergence of PMC (ϕ) to P (ϕ) as the number of i.i.d.
samples tends to infinity,
N a.s.
PMC (ϕ) → P (ϕ), as N → ∞.
N
Central limit theorem: The variance of PMC (ϕ) is given by
N
N 1 X 1
VP ϕ(X (i) ) = VP [ϕ(X)] .
V PMC (ϕ) = 2
N i=1 N
which indicates the improvement in the accuracy with increasing N , provided that VP [ϕ(X)]
is finite. Also, if VP [ϕ(X)] is finite, the distribution of the estimator is well behaved in
the limit, which is ensured by the central limit theorem (e.g. Shiryaev (1995), p. 335)
√ N d
N PM C (ϕ) − P (ϕ) → N (0, VP [ϕ(X)]) as N → ∞.
12
10
-2
0 2 4 6 8 10
d 1
< (1.9)
sin θ 2
Try to verify this by observing the needles Figure 1.1. The variables d, θ are independent
and uniformly distributed in [0, 1/2] and [0, π/2], respectively, so that their joint probability
CHAPTER 1. INTRODUCTION 6
0.1
0
0 0.5 1 1.5
θ
Figure 1.2: The set A that corresponds to the needle crossing a line
Now, define the set A = {(d, θ) : d/ sin θ < 1/2} = {(d, θ) : d < sin θ/2}. The set A
corresponds to the area under the curve in Figure 1.2. Letting X = (d, θ), the required
probability is
Z Z
P(X ∈ A) = p(r, θ)drdθ
A
Z π/2 Z sin(θ)/2
4
= dθ dr
0 0 π
2 π/2
Z
= dθ sin(θ)
π 0
2
= [− cos(π/2) + cos(0)]
π
2
= (1.10)
π
where the dummy variable r is used for d (just to avoid writing dd in the integral!).
Monte Carlo approximation: Suppose it is not our day and we cannot carry out
calculation in (1.10). In order to find P(X ∈ A), we decide to run a Monte Carlo experiment
instead. Let X = (d, θ) and observe P(X ∈ A) = E(IA (X)). The idea is to generate
samples X (i) = (d(i) , θ(i) ), i = 1, . . . , N where each sample is generated independently as
(where we have introduced another notation-wise use of the indicator function I). In words,
for each sample X (i) = (d(i) , θ(i) ) we check whether d(i) < sin(θ(i) )/2 (in other words, we
check whether X (i) ∈ A) and we divide the number of samples satisfying this condition
by the total number samples n. Observe that this is the implementation of (1.7) for this
problem. The samples (d(i) , θ(i) ) can be generated by throwing a needle on a table with
parallel lines - or using a computer! Figure 1.3 shows the described Monte Carlo experiment
performed with N = 100 throws. The d, θ values corresponding to these throws are shown
in Figure 1.3. Note that (1.12) is
Estimating π: We have already stated that P(X ∈ A) is known for this problem and in
fact it is 2/π. We have also described a Monte Carlo experiment to estimate this value.
With a little modification, our estimate can be used to estimate π. Since we have
2
π= ,
P(X ∈ A)
we can approximate π by
N
π ≈ 2 × PN
i=1 I(d(i)
< sin(θ(i) )/2)
total number of dots
=2×
number of red dots
This is a pretty fancy way of estimating π with a needle and a table! Figure 1.5 shows
the estimated value versus the number of samples n. Again, we see improvement in the
estimate as N increases.
CHAPTER 1. INTRODUCTION 8
12
10
-2
0 2 4 6 8 10
Buffon's needle, N = 100. The estimated prob is 0.6700. True value is 0.6366
0.5
0.4
0.3
d
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
θ
Figure 1.3: Top: Buffon’s needle experiment with 100 independent throws. Bottom: (d, θ)
values of the needle throws
CHAPTER 1. INTRODUCTION 9
Buffon's needle, N = 1000. The estimated prob is 0.6590. True value is 0.6366
0.5
0.4
0.3
d
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
θ
Buffon's needle, N = 10000. The estimated prob is 0.6378. True value is 0.6366
0.5
0.4
0.3
d
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
θ
Figure 1.4: (d, θ) values of N = 1000 and N = 10000 independent needle throws
3.6
estimate of π
3.4
3.2
2.8
2.6
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
N
Figure 1.5: Approximating π with Buffon’s needle experiment. The plot shows the value
of the estimate versus the number of needle throws, N .
CHAPTER 1. INTRODUCTION 10
it is either too costly or impossible to perform exact sampling. That explains the vast
amount of literature on sophisticated Monte Carlo methods that aim to generate approxi-
mate samples. In the course, we will cover some of these methods. Among them, impor-
tance sampling and Markov chain Monte Carlo methods are worth mentioning as early as
here.
Chapter 2
11
CHAPTER 2. EXACT SAMPLING METHODS 12
If the coefficients a and c are chosen carefully (e.g. relatively prime to M ), xn will be
roughly uniformly distributed between 0 and M − 1 (and with normalisation by M they
can be shrunk between 0 and 1). By “roughly uniformly” we mean that the sequence of
numbers xn will pass many reasonable tests for randomness. One such test suite are the
so called DIEHARD tests, developed by George Marsaglia, that are a battery of statistical
tests for measuring the quality of a random number generator.
A more recently proposed generator is the Mersenne Twister algorithm, by Matsumoto
and Nishimura, 1997. It has several desirable features such as a long period and being very
fast. Many public domain implementations of the algorithm exist and it is the preferred
random number generator for statistical simulations and Monte Carlo computations.
U ∼ Unif(0, 1)
every time we ask it to do so. The crucial part is how to transform one or more copies of
U such that the resulting number is distributed according to a particular distribution that
we want to sample from. In a more general context, how can one exploit the ability of the
computer to generate uniform random variables so that we can obtain random numbers
from any desired distribution?
In the following we will see some exact sampling methods.
Remark 2.1. Define the set S(u) = {x ∈ X : F (x) ≥ u}. We can show that, by right-
continuity of F , S(u) actually attains its infimum, that is the minimum of S(u) exists and
hence inf S(u) = min S(u), or S(u) = [G(u), ∞)1 .
1
Proof: If x < G(u), x ∈ / S(u) by definition. If x > G(u), then there exists x0 < x with x0 ∈ S(u);
since F is non-decreasing, F (x) ≥ F (x0 ) ≥ u, so x ∈ S(u). Finally, by the right-continuity of F , we have
F (G(u)) = inf F (y) : y > G(u) ≥ u. Therefore G(u) ∈ S(u) and S(u) = [G(u), ∞)
CHAPTER 2. EXACT SAMPLING METHODS 13
0.8
u
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10
G(u)
x
If X is continuous with a pdf p(x) > 0 for all x ∈ X , (i.e. F has no jumps and no flat
parts in X ), then F is strictly monotonic in X , its inverse G = F −1 can be defined on X ,
and we simply have G(u) = F −1 (u).
The following Lemma enables the method of inversion.
Proof. Since S(u) = [G(u), ∞) (see Remark 2.1), we have x ≥ G(u) if and only if F (x) ≥ u.
Hence, P(X ≤ x) = P(G(U ) ≤ x) = P(U ≤ F (x)) = F (x).
Lemma 2.1 suggests we can sample X ∼ P by first sampling U ∈ Unif(0, 1) and
then transforming X = G(U ). This approach is called the method of inversion. It was
considered by Ulam prior to 1947 (Eckhardt, 1987) and some extensions to the method
are provided by Robert and Casella (2004).
Proof. Since we have S(u) = [G(u), ∞), x ≥ G(u) implies F (x) ≥ u. Moreover, if x < G(u)
then F (x) < u by definition of G. By continuity of F , we have F (G(u)) = u, so F (x) ≤ u
if and only if x ≤ G(u). Hence P(F (X) ≤ u) = P(X ≤ G(u)) = F (G(u)) = u, and we
conclude that the cdf of F (X) is the cdf of Unif(0, 1).
CHAPTER 2. EXACT SAMPLING METHODS 14
0.8
u
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9 10
G(u)
x
Example 2.1. Suppose we want to sample X ∼ P = Exp(λ) from the exponential distri-
bution with rate parameter λ > 0. The pdf of Exp(λ) is
(
λe−λx , x ≥ 0
p(x) = .
0, else
The cdf is (R x
0
λe−λt dt = 1 − e−λx , x≥0
u = F (x) = .
0, else
Therefore, we have x = − log(1 − u)/λ. So, we can generate U ∼ Unif(0, 1) and transform
X = − log(1 − U )/λ ∼ Exp(λ). See Figure 2.1 for an illustration.
Example 2.2. Suppose we want to sample X ∼ P = Geo(ρ) from the geometric distribu-
tion on X = N with success rate parameter ρ ∈ (0, 1) and pmf2
p(x) = (1 − ρ)x ρ, x = 0, 1, 2 . . . .
1−αx+1
Px
Making use of i=0 αi = 1−α
with α = 1 − ρ, the cdf at the support points is given by
F (x) = 1 − (1 − ρ)x+1 .
Given U = u sampled from Unif(0, 1), the rule in (2.2) implies
1 − (1 − ρ)x < u ≤ 1 − (1 − ρ)x+1
2
This distribution is used for the number of trials prior to the first success in a Bernoulli process with
success rate ρ. Another convention is to take the support range as 1, 2, . . . rather than 0, 1, 2 and interpret
X as the number of trials until the successful trial, including the successful one. Then the pmf changes to
p(x) = (1 − ρ)x−1 ρ, x ≥ 1
CHAPTER 2. EXACT SAMPLING METHODS 15
log(1 − u) log(1 − u)
−1≤x< .
log(1 − ρ) log(1 − ρ)
Example 2.3. If we want to sample from X ∼ Unif(a, b), a < b, we can sample U ∼
Unif(0, 1) and use the transformation
Transformation can also be used for more complicated situations than in Example 2.3.
Suppose we have an m-dimensional random variable X ∈ X ⊆ Rm with pdf pX (x) and we
apply a transform to X using an invertible function g : X → Y, where Y ⊆ Rm to obtain
Y = (Y1 , . . . , Ym ) = g(X1 , . . . , Xm )
Since g is invertible, we have X = g −1 (Y ). What is the pdf of Y , pY (y)? This density can
be found as follows: Define the Jacobian determinant (or simply Jacobian) of the inverse
transformation g −1 as
∂g −1 (y)
J(y) = det (2.4)
∂y
The usual practice to ease the notation is to introduce the short hand notation (y1 , . . . , ym ) =
g(x1 , . . . , xm ) and write J(y) by making implicit reference to g as
∂x1 /∂y1 . . . ∂x1 /∂ym
∂x ∂(x1 , . . . , xm ) .. ... ..
J(y) = det = det = det
∂y ∂(y1 , . . . , ym ) . .
∂xm /∂y1 . . . ∂xm /∂ym
where
is the pdf of Y .
Change of variables can be useful when P is difficult to sample from using the method
of inversion but X ∼ P can be performed by a certain transformation of random variables
that are easier to generate, such as uniform random variables.
Example 2.4. We describe the Box-Muller method for generating random variables from
the standard normal (Gaussian) distribution N (0, 1). The pdf for N (µ, σ 2 ) is
1 1 2
φ(x; µ, σ 2 ) = √ e− 2σ2 (x−µ)
2πσ 2
The method of inversion is not an easy option to sample from N (0, 1) since the cdf of
N (0, 1) is not easy to invert. Instead we use transformation.
The Box-Muller method generates a pair of independent standard normal random vari-
i.i.d.
ables X1 , X2 ∼ N (0, 1) as follows: First we generate
If we wanted to start off from uniform random numbers, we could consider generating
i.i.d.
U1 , U2 ∼ Unif(0, 1) and setting R = −2 log(U1 ) and Θ = 2πU2 so that R, Θ are distributed
as desired. In other words,
p p
X1 = −2 log(U1 ) cos(2πU2 ), X2 = −2 log(U1 ) sin(2πU2 )
One way to see why this works is to use change of variables. Note that3
√ √
Then the Jacobean at (x1 , x2 ) = ( r cos θ, r sin θ) is
which is the product of pdf of N (0, 1) evaluated at x1 and x2 . Therefore, we conclude that
i.i.d.
X1 , X2 ∼ N (0, 1).
The pdf of this distribution is (using the same letter as for the pdf of the univariate normal
distribution)
1 1 T −1
φ(x; µ, Σ) = exp − (x − µ) Σ (x − µ) (2.9)
|2πΣ|1/2 2
where | · | stands for determinant.
Suppose X = (X1 , . . . , Xn )T ∼ N (µ, Σ) and we have the transformation
Y = AX + η
the normal distribution is completely characterised by its mean and covariance. Therefore,
we can work out the mean and the variance of Y in order to identify its distribution.
E(Y ) = E(AX + η)
= AE(X) + η
= Aµ + η
Cov(Y ) = E([Y − E(Y )][Y − E(Y )]T )
= E([AX + η − (Aµ + η)][AX + η − (Aµ + η)]T )
= E(A(X − µ)(X − µ)T AT ) = ACov(X)AT
= AΣAT
Example 2.5. The above derivation suggests a way to generate an n-dimensional mul-
tivariate sample X ∼ N (µ, Σ). We can first generate i.i.d. normal random variables
i.i.d.
R1 , . . . , Rn ∼ N (0, 1) so that R = (R1 , . . . , Rn ) ∼ N (0n , In ) where 0n is the n × 1 vector
of zeros and In is the identity matrix of size n. Then, we decompose Σ = AAT using the
Cholesky decomposition. Finally, we let X = AR + µ. Observe that the mean of X is
A0n + µ = µ and covariance matrix of X is AIn AT = AAT = Σ, so we are done.
2.2.3 Composition
Let a random variable Z ∼ Π taking values from the set Z and Π has a pdf or pmf shown
as π(z). Suppose also that given z, X|z ∼ Pz where each Pz admits either a pmf or a
pdf shown as pz (x). Then the marginal distribution P is a mixture distribution and in the
presence of pdf’s or pmf’s, we have
(R
pz (x)π(z)dz, if π(z) is a pdf
p(x) = P (2.10)
z pz (x)π(z), if π(z) is a pmf
Whether p(x) is pmf or a pdf depends on whether pz (x) is pmf or pdf. The integral/sum
may be hard to evaluate, and the mixture distribution may be hard to sample directly.
But if we can easily sample from Π and from each Pz , then we can just
1. sample Z ∼ Π,
2. sample X ∼ PZ , and
The random number we produce in this way will be an exact sample from P , i.e. X ∼ P .
This is the method of composition. Ignoring Z is also called marginalisation, by which we
overcome the difficulty of dealing with the tough integral/sum in (2.10).
CHAPTER 2. EXACT SAMPLING METHODS 19
Example 2.6. The density of a mixture of Gaussian distribution with K components with
2
means and variance values (µ1 , σ12 ), . . . , (µK , σK ), and probability weights w1 , . . . , wK for
its components (such that w1 + · · · + wK = 1) is given by
K
X
p(x) = wk φ(x; µk , σk2 ).
k=1
To sample from p(x), we first sample the component number k with probability wk (for
example using the method of inversion), and given k, we sample X ∼ N (µk , σk2 )
Example 2.7. A sales company decides to reveal the demand D for a product over a
month. However, for privacy reasons, it shares this average by adding some noise to D,
which results in the shared value X. It is given that the distribution of the revealed demand
X has the pdf
X e−λ λd 1
|x − d|
p(x) = exp −
d
d! 2b b
We want to perform a Monte Carlo simulation for this data sharing process. How do we
sample X ∼ P ?
Although p(x) looks hard, observe that the first term in the sum is the pmf of PO(λ)
evaluated at d (can be viewed as the demand) and the second term in the sum is the pdf of
Laplace(d, b) evaluated at x (can be viewed as the noisy demand)6 . Therefore, generation
of X is possible by the method of composition as
1. Sample D ∼ PO(λ),
It is an exercise for you to figure out how one can sample from the Poisson and Laplace
distributions.
The rejection sampling method for obtaining one sample from P can be implemented with
any q(x) and M > 0 that satisfy the conditions above as in Algorithm 2.1.
How quickly do we obtain a sample with this method? Noting that the pdf of X 0 is
pX 0 (x) = q(x), the acceptance probability can be derived as
Z
P(Accept) = P(Accept|X 0 = x)pX 0 (x)dx
Z
p(x)
= q(x)dx
M q(x)
Z
1
= p(x)dx
M
1
= , (2.11)
M
which is also the long term proportion of the number accepted samples over the number
of trials. Therefore, taking q(x) as close to p(x) as possible to avoid large p(x)/q(x) ratios
and taking M = supx p(x)/q(x) are sensible choices to make the acceptance probability
P(Accept) as high as possible.
The validity of the rejection sampling method can be verified by considering the dis-
tribution of the accepted samples. Using Bayes’ theorem,
Example 2.8. Suppose we want to sample X ∼ Γ(α, 1) with α > 1, where Γ(α, β) is the
Gamma distribution with shape parameter α and scale parameter β. The density of Γ(α, 1)
is
xα−1 e−x
p(x) = , x > 0.
Γ(α)
As possible instrumental distributions, consider the family of exponential distributions
Qλ = Exp(λ), 0 < λ < 1,7 with pdf
Recall that M has to satisfy p(x) ≤ M q(x), x ∈ X and therefore given qλ (x), a sensible
choice for Mλ is Mλ = supx p(x)/qλ (x), hence we wish to use λ which minimises the
required Mλ . Given 0 < λ < 1, the ratio
p(x) xα−1 e(λ−1)x
=
qλ (x) λΓ(α)
is maximised at x = (α − 1)/(1 − λ), so we have
α−1 α−1
e−(α−1)
1−λ
Mλ =
λΓ(α)
resulting in the acceptance probability
α−1
p(x) x(1 − λ)
= e(λ−1)x+α−1
qλ (x)Mλ α−1
Now, we have to minimise Mλ with respect to λ so that P(Accept) = 1/Mλ is maximised.
Mλ is minimised at λ∗ = 1/α8 , yielding
αα e−(α−1)
M∗ = .
Γ(α)
Overall, the rejection sampling algorithm we choose to sample from Γ(α, 1) is
1. Sample X 0 ∼ Exp(1/α) and U ∼ Unif(0, 1).
2. If U ≤ (x/α)α−1 e(1/α−1)x+α−1 , accept X = X 0 , else go to 1.
Check Figure 2.3 for the roles of optimum choice for λ and M . Also, Figure 2.4 illustrates
the computational advantage of choosing λ optimally.
rejection sampling for Γ (2, 1) with optimal choice for λ rejection sampling for Γ (2, 1) with different values of λ
1.2 1.4
p(x) p(x)
1 M ∗ q λ ∗ (x) 1.2 lambda = 0.50
lambda = 0.10
Mq λ ∗ (x) when M = 1.5 M ∗
1 lambda = 0.70
0.8
0.8
0.6
0.6
0.4
0.4
0.2 0.2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x x
histogram of samples generated using λ = 0.5 histogram of samples generated using λ = 0.01
8000 250
200
6000
frequency
150
4000
100
2000
50
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12
x x
Figure 2.4: Rejection sampling for Γ(2, 1): Histograms with λ = 0.5 (68061 samples out
of 105 ) and λ = 0.01 (2664 samples out of 105 trials).
Example 2.9. Sometimes we want to sample from truncated versions of well known distri-
butions, i.e. where X is contained in an interval with density proportional to the density of
the original distribution on that interval. For example, take the truncated standard normal
distribution Na (0, 1) with density
φ(x; 0, 1)I(|x| ≤ a)
p(x) = Ra (2.15)
−a
φ(y; 0, 1)dy
pb(x)
= (2.16)
Zp
Ra
where pb(x) = φ(x; 0, 1)I(|x| ≤ a) and Zp = −a φ(y; 0, 1)dy. We can perform rejection
sampling using q(x) = φ(x; 0, 1), (that is qb(x) = q(x) and Zq = 1). Since pb(x)/φ(x; 0, 1) =
I(|x| ≤ a) ≤ 1, we can choose M = 1. Since Zq = 1, the acceptance probability is
1 Zp
Ra
M Zq
= −a φ(y; 0, 1)dy.
The rejection sampling method for this distribution reduces to sampling from Y ∼ φ
and accepting X = Y if |Y | ≤ a, which is intuitive.
Example 2.10. The unknown normalising constant issue mostly arises in Bayesian in-
ference when we want to sample from the posterior distribution. The posterior density of
X given Y = y is proportional to
[Link] Squeezing
The drawback of rejection sampling is that in practice a rejection based procedure is
usually not viable when X is high-dimensional, since P(Accept) gets smaller and more
computation is required to evaluate acceptance probabilities as the dimension increases.
In the literature there exist approaches to improve the computational efficiency of rejection
sampling. For example, assuming the densities exist, when it is difficult to compute q(x),
tests like u ≤ M1 p(x)
q(x)
can be slow to evaluate. In this case, one may use a squeezing function
s(x) s(x)
s : X → [0, ∞) such that q(x)
is cheap to evaluate and p(x)
is tightly bounded from above
s(x)
by 1. For such an s, not only u ≤ M1 q(x) would guarantee u ≤ M1 p(x)
q(x)
, hence acceptance,
1 p(x) 1 s(x)
but also if u ≤ M q(x)
then u ≤ M q(x) would hold with a high probability. Therefore,
CHAPTER 2. EXACT SAMPLING METHODS 24
Exercises
1. Use change of variables to show that X defined in (2.3) in Example 2.3 is distributed
from Unif(a, b).
3. Suggest a way to sample from Laplace(a, b) using uniform random numbers. (Hint:
Notice the similarity between the Laplace distribution and the exponential distribu-
tion.)
4. Show that the modified rejection sampling method described in Section [Link] for
unnormalised densities is valid, i.e. the accepted sample X ∼ P , and it has the
acceptance probability ZZq M
p
as claimed. The derivation is similar to those in (2.11),
(2.12).
5. Write your own function that takes a vector of non-negative numbers w = [w1 . . . wK ]
of any size and outputs a matrix X (of the specified size, size1 × size2) of i.i.d.
integers in {1, . . . , K}, each with probability proportional to wk (i.e. their sum may
not be normalised to 1). In MATLAB, your function should look something similar
to [X] = randsamp(w, size1, size2)
6. Learn the polar rejection method, another method used to sample from N (0, 1).
Write two different functions that produce i.i.d. standard normal random variables
as many as it is specified (as an input to the function): one using the Box-Muller
method and the other using the polar rejection method. Plot the histograms of 106
samples that you obtain from each function; make sure nothing strange happens in
your code. Compare the speeds of your functions. Which method is faster? Why do
you think is the reason?
7. Write your own function for generating a given number N of samples (specified as an
input argument) from a multivariate normal distribution N (µ, Σ) with given mean
vector µ and covariance matrix Σ as inputs.
8. Derive the rejection sampling method for Beta(a, b) a, b ≥ 1 using the uniform dis-
tribution as the instrumental distribution. Write a function that implements this
method. Is it still possible to use the uniform distribution as Q when a < 1 or b < 1?
Why or why not?
Chapter 3
we need i.i.d. samples from P and in the previous chapter we covered some exact sampling
methods for generating X (i) ∼ P , i = 1, . . . , N .
However, there are many cases where X ∼ P is either impossible or too difficult,
or wasteful. For example, rejection sampling uses only about 1/M of generated random
samples to construct an approximation to P . In order to generate N samples, we need on
average N M iterations of rejection sampling. The number M can be very large, especially
in high dimensions, and rejection sampling may be wasteful.
26
CHAPTER 3. MONTE CARLO ESTIMATION 27
The idea of importance sampling follows from the importance sampling fundamental iden-
tity (Robert and Casella, 2004): We can rewrite P (ϕ) as
Z
P (ϕ) = EP (ϕ(X)) = ϕ(x)p(x)dx
X
Z
p(x)
= ϕ(x) q(x)dx
X q(x)
Z
= ϕ(x)w(x)q(x)dx
X
= EQ (ϕ(X)w(X)) = Q(ϕw)
where ϕw stands for the product of the functions ϕ and w. This identity can be used with
a Q which is easy to sample from, which leads to importance sampling given in Algorithm
3.1
N
N 1 X
PIS (ϕ) := ϕ(X (i) )w(X (i) ). (3.3)
N i=1
The weights w(X (i) ) are known as the importance sampling weights. Note that PIS N
(ϕ)
is another plug-in estimator but for different distribution and function, namely it is the
plug-in estimator for Q (ϕw). Therefore the estimator in (3.3) is unbiased and justified by
the strong law of large numbers and the central limit theorem, provided that Q(ϕw) =
EQ (ϕ(X)w(X)) and VQ [w(X)ϕ(X)] are finite.
Example 3.1. Suppose we have two variables (X, Y ) ∈ X × Y with joint pdf pX,Y (x, y).
As we recall, we can write the joint pdf as
In the Bayesian framework where X is the unknown parameter and Y is the observed vari-
able (or data), pX (x) is called the prior density and it is usually easy to sample from, and
pY |X (y|x) is the conditional density of data, or the likelihood, which is easy to compute.1
1
In fact, this is how one usually constructs the joint pdf in Bayesian framework: First define the prior
X ∼ pX (x), then define the data likelihood Y |X = x ∼ pY |X (y|x), so that the pX,Y (x, y) is constructed
as above. When the starting point to define the joint density is to define the prior and the likelihood, it is
notationally convenient to define the marginal and conditionalR pdfs µ(x) := pX (x) and g(y|x) := pY |X (y|x)
and write p(x, y) = µ(x)g(y|x), p(x|y) ∝ µ(x)g(y|x), p(y) = µ(x)g(y|x)dx, etc.
CHAPTER 3. MONTE CARLO ESTIMATION 28
However, we do not need to sample from pX (x). In fact, we can use importance sampling
with an importance density q(x).
N
1 X pX (X (i) )
pY (y) ≈ (i)
pY |X (y|X (i) ), X (1) , . . . , X (N ) ∼ q(x).
N i=1 q(X )
Being able to approximate a marginal distribution as in pY (y) will have an important role
later on when we discuss sequential importance sampling methods.
This ratio diverges when k > 2, and unless ϕ(x)2 balances it the second moment Qk (wk2 ϕ2 )
diverges. Therefore, let us confine k to k ∈ (0, 2). In that case, we can rewrite
√
p(x)2 1 2 − k − 12 (2−k) (x−µ)2
=p √ e σ2
qk (x) k(2 − k) 2πσ 2
1
=p q2−k (x)
k(2 − k)
Therefore,
1 1
Qk (wk2 ϕ2 ) = p Q2−k (ϕ2 ) = p EQ2−k (ϕ(X)2 )
k(2 − k) k(2 − k)
CHAPTER 3. MONTE CARLO ESTIMATION 30
σ2
When ϕ(x) = x and µ = 0, Q2−k (ϕ2 ) = EQ2−k (X 2 ) = 2−k
. Therefore, we need to minimise
1 σ2
p = σ 2 (2 − k)−3/2 k −1/2 .
k(2 − k) 2 − k
The minimum is attained at k = 1/2 and is
1 2 σ2
N
VQ1/2 (PIS (ϕ)) = σ (2 − k)−3/2 k −1/2 = 0.7698
N k=1/2 N
N σ2
The variance of the plug-in estimator PMC (ϕ) for P (ϕ) is N
, which is larger!
observe that
Z
pb(x)
Q(w) = EQ (w(X)) = q(x)dx
qb(x)
Z
p(x)Zp
= q(x)dx
q(x)Zq
= Zp /Zq .
and
Z
pb(x)
Q (wϕ) = EQ (w(X)ϕ(X)) = ϕ(x)q(x)dx
qb(x)
Z
p(x)Zp
= ϕ(x)q(x)dx
q(x)Zq
= P (ϕ)Zp /Zq .
Therefore, we can write the importance sampling fundamental identity in terms of pb and
qb as
Q (ϕw) Q (wϕ)
P (ϕ) = = .
Zp /Zq Q (w)
The importance sampling method can be modified to approximate both the nominator, the
unnormalised estimate, and the denominator, the normalisation constant, by using Monte
Carlo. Sampling X (1) , . . . , X (N ) from Q, we have the approximation
1
PN (i) (i) N
i=1 ϕ(X )w(X )
X
N
PIS (ϕ) = N
1
PN = W (i) ϕ(X (i) ). (3.7)
w(X (i) )
N i=1 i=1
where
w(X (i) )
W (i) = PN
(j)
j=1 w(X )
are called the normalised importance weights as they sum up to 1. The resulting method,
which is called self-normalised importance sampling is given in Algorithm 3.2: Being the
ratio of two unbiased estimators, estimator of the self-normalised importance sampling is
biased for finite N . However, its consistency and stability are provided by a strong law of
large numbers and a central limit theorem in Geweke (1989). In the same work, the variance
of the self normalised importance sampling estimator is analysed and an approximation
is provided, from which it reveals that it can provide lower variance estimates than the
unnormalised importance sampling method. Also normalised importance sampling has
the nice property of estimating a constant by itself, unlike the unnormalised importance
sampling method. Therefore, this method can be preferable to its unnormalised version
even if it is not the case that P and Q are known only up to proportionality constants.
Self-normalised importance sampling is also called Bayesian importance sampling in
Geweke (1989), since in most Bayesian inference problems normalising constant of posterior
distribution is unknown.
CHAPTER 3. MONTE CARLO ESTIMATION 32
3 for i = 1, . . . , N do
(i)
4 Set W (i) = PNw(Xw(X)(j) ) .
j=1
N
X
N
PIS (ϕ) = W (i) ϕ(X (i) )
i=1
w(X (i) )
2. For i = 1, . . . , N ; set W (i) = PN (j) ) .
j=1 w(X
PN
3. Approximate E(ϕ(X)|Y = y) ≈ i=1 W (i) ϕ(X (i) ).
If we choose q(x) = pX (x), i.e. the prior density, then w(x) = pY |X (y|x) reduces to the
likelihood. But this is not always a good idea as we will see in the next example.
Example 3.4. Suppose we have an unknown mean parameter X ∈ R whose prior dis-
tribution is represented by X ∼ N (µ, σ 2 ). Conditional on X = x, n data samples
Y = (Y1 , . . . , Yn ) ∈ Rn are generated independently
i.i.d.
Y1 , . . . , Yn |X = x ∼ Unif(x − a, x + a).
CHAPTER 3. MONTE CARLO ESTIMATION 33
We
R want to estimate the posterior mean of X given Y = y = (y1 , . . . , yn ), i.e. E(X|Y =
y) = pX|Y (x|y)xdx, where
pX|Y (x|y) ∝ pX (x)pY |X (y|x)
Qn 1
The prior density and likelihood are pX (x) = φ(x; µ, σ 2 ) and pY |X (y|x) = i=1 2a I(x−a,x+a) (yi ),
so the posterior distribution can be written as
n
2 1 Y
pX|Y (x|y) ∝ φ(x; µ, σ ) I(x−a,x+a) (yi )
(2a)n i=1
Densities pX (x) and pX,Y (x, y) versus x for a fixed Y = y = (y1 , . . . , yn ) with n = 10
generated from the marginal distribution of Y with a = 2, µ = 0, and σ 2 = 10 are given in
Figure 3.1. Note that the second plot is proportional to the posterior density.
We can use self-normalised importance sampling to estimate E(X|Y = y). The choice
of the importance density is critical here: Suppose we chose Q to be the prior distribution
for X, i.e. q(x) = φ(x; µ, σ 2 ). This is a valid choice, however if a is small and σ 2 is
relatively large, it is likely that the resulting weight function
n
1 Y
w(x) = I(x−a,x+a) (yi ).
(2a)n i=1
1
will end up being zero for most of the generated samples from Q and it will be (2a) n for few
samples. This results in a high variance in the importance sampling estimator. What is
worse, it is possible to have all weights to be zeros and hence the denominator in (3.7) can
be zero. Therefore the estimator is a poor one.
Let ymax = maxi yi and ymin = mini yi . A careful inspection of pX|Y (x|y) reveals that
given y = (y1 , . . . , yn ), X must be contained in (ymax − a, ymin + a). In other words,
x ∈ (ymax − a, ymin + a) ⇔ x − a < yi < x + a, ∀i = 1, . . . , n
Therefore, a better importance density does not waste its time outside the interval (ymax −
a, ymin + a) and generate samples in that interval. As an example, we can choose Q =
Unif(ymax − a, ymin + a). With that choice, the weight function will be
( φ(x;µ,σ2 ) 1
(2a)n
, x ∈ (ymax − a, ymin + a)
w(x) = 1/(2a+ymin −ymax )
0, else
Note that since we are using the self-normalised
PN importance sampling estimator and hence
(i) (i) (j)
we normalise the weights W = w(X )/ j=1 w(X ), we do not need to calculate the
constant factor (2a + ymin − ymax )/(2a)n for the weights.
Figure 3.2 compares the importance sampling estimators with the two different im-
portance distributions mentioned above. The histograms are generated from 10000 Monte
Carlo runs (10000 independent estimates of the posterior mean) for each estimator. Ob-
serve that the estimates obtained when the importance distribution is the prior is more
wide-spread, exhibiting a higher variance.
CHAPTER 3. MONTE CARLO ESTIMATION 34
4
0.06
3
0.04
2
0.02
1
0 0
-20 -10 0 10 20 3 4 5 6
x x
Figure 3.1: pX (x) and pX,Y (x, y) vs x for the problem in Example 3.4 with n = 10 and
a=2
importance distribution is prior: variance: 0.00089 importance distribution is uniform: variance: 0.00002
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
4.6 4.65 4.7 4.75 4.8 4.85 4.7 4.71 4.72 4.73 4.74 4.75
estimated posterior mean estimated posterior mean
Figure 3.2: Histograms for the estimate of the posterior mean using two different impor-
tance sampling methods as described in Example 3.4 with n = 10 and a = 2.
CHAPTER 3. MONTE CARLO ESTIMATION 35
Exercises
1. Consider Example 3.2, where importance sampling for N (µ, σ 2 ) is discussed.
• This time, take µ = 0 and ϕ(x) = x2 . Find the optimum k for this φ and
calculate the gain due to variance reduction compared to the plug-in estimator
N
PMC (ϕ).
• Implement importance sampling (e.g., in MATLAB) for both ϕ(x) = x and
ϕ(x) = x2 , and verify that in each case the variance of the IS estimator is
N
lower than that of the plug-in estimator PMC (ϕ). Verify also that the k = 1/2
estimator is inferior for calculating the second moment, and likewise the k = 1/3
estimator is inferior for the first moment.
2. This example is based on Project Evaluation and Review Technique (PERT), a
project planning tool.2 Consider the software project described in Table 3.1 with
10 tasks (activities), indexed by j = 1, . . . , 10. The project is completed when all
of the tasks are completed. A task can begin only after all of its predecessors have
been completed. The project starts at time 0. Task j starts at time Sj , takes time Tj
Table 3.1: PERT: Project tasks, predecessor-successor relations, and mean durations
and ends at time Ej = Sj + Tj . Any task j with no predecessors (here only task 1)
starts at Sj = 0. The start time for a task with predecessors is the maximum of the
ending times of its predecessors. For example, S4 = E2 and S9 = max(E5 , E6 , E7 ).
The project as a whole ends at time E10 .
3. This is a simple example that illustrates the source localisation problem. We have a
source (or target) on the 2-D plane whose unknown location
X = (X(1), X(2)) ∈ R2
we wish to find. We collect distance measurements for the source using three sensors,
located at positions s1 , s2 , and s3 , see Figure 3.3. The measured distances Y =
(Y1 , Y2 , Y3 ), however, are noisy with independent normally distributed noises with
equal variance:3
Yi |X = x ∼ N (||x − si ||, σy2 ), i = 1, 2, 3,
where || · || denotes the Euclidean distance. Letting ri = ||x − si ||, the likelihood
evaluated at y = (y1 , y2 , y3 ) given x can be written as
3
Y
pY |X (y|x) = φ(yi ; ri , σy2 ) (3.8)
i=1
3
In this way we allow negative distances, which makes the normal distribution not the most proper
choice. However, for the sake of ease with computations, we overlook that in this example.
CHAPTER 3. MONTE CARLO ESTIMATION 37
2 sensor 1
1 r1
x(2)
0 source
r2
-1 sensor 2 r3
-2 sensor 3
-3
-3 -2 -1 0 1 2 3
x(1)
Figure 3.3: Source localisation problem with three sensors and one source
We do not know much a priori information about X, therefore we take the prior
distribution X as the bivariate normal distribution with zero mean vector and a
diagonal covariance matrix, X ∼ N (02 , σx2 I2 ), so that the density is
See Figure 3.4 for an illustration of prior, likelihood, and posterior densities for this
problem.
Given noisy measurements, Y = y = (y1 , y2 , y3 ), we want to locate X, so we are
interested in the posterior mean vector
Write a function that takes y, positions of the sensors s1 , s2 , s3 , the prior and
likelihood variances σx2 and σy2 , and the number of samples N as inputs, implements
self-normalised importance sampling (why this version?) in order to approximate
E(X|Y = y) and outputs its estimate. Try your code with s1 = (0, 2), s2 = (−2, −1),
s3 = (1, −2), y1 = 2, y2 = 1.6, y3 = 2.5, σx2 = 100, and σy2 = 1 which are the values
used to generate the plots in Figure 3.4.
CHAPTER 3. MONTE CARLO ESTIMATION 38
-4 -4 -4
-2 -2 -2
x(1)
x(1)
x(1)
0 0 0
2 2 2
4 4 4
6 6 6
-5 0 5 -5 0 5 -5 0 5
x(2) x(2) x(2)
Prior p X (x) posterior ∝ prior x likelihood ×10 -9
-6 -3 10
-4 -2
8
-2 -1
6
x(1)
x(1)
0 0
4
2 1
4 2
2
6 3
-5 0 5 -2 0 2
x(2) x(2)
Figure 3.4: Source localisation problem with three sensors and one source: The likelihood
terms, prior, and the posterior. The parameters and the variables are s1 = (0, 2), s2 =
(−2, −1), s3 = (1, −2), y1 = 2, y2 = 1.6, y3 = 2.5, σx2 = 100, and σy2 = 1
Chapter 4
Bayesian Inference
Summary: In this chapter, we provide a brief introduction to Bayesian statistics. Some
quantities of interest that are calculated from the posterior distribution will be explained.
We will see some examples where one can find the exact form of the posterior distribution.
In particular, we will discuss conjugate priors that are useful for deriving tractable posterior
distributions. This chapter also introduces a relaxation in the notation to be adopted in the
later chapters.
P(A ∩ B) P(A)P(B|A)
P(A|B) = = (4.1)
P(B) P(B)
Here we see some examples where Bayes’ rule is in action to calculate posterior probabili-
ties.
Example 4.1 (Conditional probabilities of sets). A pair of fair (unbiased) dice are
rolled independently. Let the outcomes be X1 and X2 .
• It is observed that the sum S = X1 +X2 = 8. What is the probability that the outcome
of at least one of the dice is 3?
We apply the Bayes rule: Define the sets A = {(X1 , X2 ) : X1 = 3 or X2 = 3}.
B = {(X1 , X2 ) : S = 8}, so that the desired probability is P(A|B) = P(A ∩ B)/P(B).
B = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}, A ∩ B = {(3, 5), (5, 3)}.
Since the dice are fair, every outcome is equiprobable, having probability 1/36. There-
fore,
P(A ∩ B) 2/36 2
P(A|B) = = = .
P(B) 5/36 5
• It is observed that the sum is even. What is the probability that the sum is smaller
than or equal to 4? Similarly, we define the sets A = {(X1 , X2 ) : X1 + X2 ≤ 4}.
39
CHAPTER 4. BAYESIAN INFERENCE 40
p(x)p(y|x)
p(x|y) =
p(y)
In the rest of this document, we will use the aforementioned notations interchangeably,
choosing the most suitable one depending on the context.
When the statistical model is prone to misunderstandings in case p is used for every-
thing, perhaps a nicer approach than using p generically from the beginning is to start
with different letters such as f, g, h, µ, etc. for different pdf’s or pmf’s when constructing
the joint distribution for the random variables of interest.
Example 4.3. Consider random variables X, Y, Z, U and assume that Y and Z are con-
ditionally independent given X, and U is independent from X given Y and Z; see Figure
4.1. In such a case, it may be convenient to construct the joint density by first declaring
the density for X, µ(x). Next, define the conditional densities f (y|x) and g(z|x) for Y
given X and Z given X. Finally define the conditional density for U given Y and Z,
h(u|y, z). Now, we can generically use the letter p to express any desired density regarding
these variables. To start with, the joint density is
Once we have the joint distribution p(x, y, z, u), we can derive anything else from it in
CHAPTER 4. BAYESIAN INFERENCE 42
Y Z
Figure 4.1: Directed acyclic graph showing the (hierarchical) dependency structure for
X, Y, Z, U .
or in terms of densities
p(x) = µ(x), p(z|x, y) = p(z|x) = g(z|x), p(u|x, y, z) = p(u|y, z) = h(u|y, z), etc.
In Bayesian statistics, the usual first step to build a statistical model is to decide on the
likelihood, i.e. the conditional distribution of the data given the unknown parameter. The
likelihood represents the model choice for the data and it should reflect the real stochastic
dynamics/phenomena of the data generation process as accurately as possible.
For convenience, it is common to choose a family of parametric distributions for the
data likelihood. With such choices x in p(y|x) becomes (some or all of the) parameters of
the chosen distribution. For example, X = (µ, σ 2 ) may be the unknown parameters of a
CHAPTER 4. BAYESIAN INFERENCE 43
normal distribution
Q from which the data samples Y1 , . . . , Yn are assumed to be distributed,
i.e. p(y1:n |x) = ni=1 φ(yi ; µ, σ 2 ). As another example, let X = α be the shape parameter
Q e−βyi yiα−1 β α
of the gamma distribution Γ(α, β) and p(y1:n |x) = ni=1 Γ(α)
and β is known.
Bayesian inference for the unknown parameter requires assigning a prior distribution
to it. Given the family of distributions for the likelihood, it is sometimes useful to consider
a certain family of distributions for the prior distribution so that the posterior distribution
has the same form as the prior distribution but with different parameters, i.e. the posterior
distribution is in the same family of distributions as the prior. When this is the case,
the prior and posterior are then called conjugate distributions, and the prior is called a
conjugate prior for the likelihood p(y|x).
xa−1 (1 − x)b−1 n!
p(x|y) ∝ p(x)p(y|x) = xk (1 − x)n−k (4.3)
B(a, b) k!(n − k)!
R
where B(a, b) = xa−1 (1 − x)b−1 dx.
Before continuing with deriving the expression, first note the important remark that
our aim here is to recognise the form of the density of a parametric distribution for x in
(4.3). Therefore, we can get rid of any multiplicative term that does not depend on x. That
is why we could start with the joint density as p(x|y) ∝ p(x, y); in fact we can do more
simplification
p(x|y) ∝ xa+k−1 (1 − x)b+n−k−1
Since we observe that this has the form of a beta distribution, we can conclude that the
posterior distribution has to be a beta distribution
where, from similarity with the prior distribution, we conclude that ax|y = a + k and
bx|y = b + n − k.
written as
p(x|y) ∝ p(x, y) = p(x)p(y|x)
n
1 1 2 Y 1 1 2
=p exp − 2 x √ exp − 2 (yi − x)
2πσx2 2σx i=1 2πσ 2 2σ
( n
)
1 1 X
∝ exp − 2 x2 − 2 (yi − x)2
2σx 2σ i=1
( n n
!)
1 2 1 X X
= exp − 2 x − 2 yi2 + nx2 − 2x yi
2σx 2σ i=1
( n
!)i=1
1 1 X
∝ exp − 2 x2 − 2 nx2 − 2x yi
2σx 2σ i=1
( " n
#)
1 2 1 n 1 X
∝ exp − x + − 2x 2 yi
2 σx2 σ 2 σ i=1
Since we observe that this has the form of a normal distribution, we can conclude that the
posterior distribution has to be a normal distribution
2
X|Y1:n = y1:n ∼ N (µx|y , σx|y )
2 2
for some µx|y and σx|y . In order to find µx|y and σx|y , compare the expression above with
n h 2
io
φ(x; m, κ2 ) ∝ exp − 21 x2 κ12 − 2x κm2 + m
κ2
. Therefore, we must have
−1 n −1 n
2 1 n µx|y 1 X 1 n 1 X
σx|y = + , 2
= y i ⇒ µ x|y = + yi
σx2 σ 2 σx|y σ 2 i=1 σx2 σ 2 σ 2 i=1
Example 4.6 (Variance of the normal distribution). Consider the scenario in the
previous example above but this time µ is known and the variance σ 2 is unknown. The
prior for X = σ 2 is chosen as the conjugate prior of the normal likelihood for the variance
parameter, i.e. the inverse gamma distribution IG(α, β) with shape and scale parameters
α and β, having the probability density function
β α −α−1
β
p(x) = x exp − .
Γ(α) x
The joint density can be written as
p(x|y) ∝ p(x, y) = p(x)p(y|x)
n
β α −α−1
β Y 1 1 2
= x exp − √ exp − (yi − µ)
Γ(α) x i=1 2πx 2x
1 Pn
(yi − µ)2 + β
∝ x−α−n/2−1 exp − 2 i=1
x
CHAPTER 4. BAYESIAN INFERENCE 45
Comparing this expression to the density of p(x), we observe that they have the same form
and therefore,
X|Y1:n = y1:n ∼ IG(αx|y , βx|y )
for some αx|y and βx|y . From similarity, we can conclude
n
n 1X
αx|y = α + , βx|y = β + (yi − µ)2 .
2 2 i=1
Example 4.7 (Multivariate normal distribution). Let the likelihood for Y given X is
chosen as Y |X = x ∼ N (Ax, R) and the prior for the unknown X is chosen X ∼ N (m, S).
The posterior p(x|y) is
p(x|y) ∝ p(x, y) = p(x)p(y|x)
1 1 T −1 1 1 T −1
= exp − (x − m) S (x − m) exp − (y − Ax) R (y − Ax)
|2πS|1/2 2 |2πR|1/2 2
1 T −1 T −1 T T −1 T −1
∝ exp − (x S x − 2m S x + x A R Ax − 2y R Ax)
2
1 T −1 T −1 T −1 T −1
= exp − x (S + A R A)x − 2(m S + y R A)x
2
i
1 h T −1 T −1
∝ φ(x; mx|y , Sx|y ) ∝ exp − x Sx|y x − 2mx|y Sx|y x
2
where the posterior covariance is
Sx|y = (S −1 + AT R−1 A)−1
and the posterior mean is
mx|y = Sx|y (mT S −1 + y T R−1 A)T = Sx|y (S −1 m + AT R−1 y).
Computing the evidence: We saw that when conjugate priors are used for the prior,
then p(x) and p(x|y) belong to the same family, i.e. their pdf/pmf have the same form.
This is nice: since we know p(x), p(y|x), and p(x|y) exactly, we can compute the evidence
p(y) for a given y as
p(x, y) p(x)p(y|x)
p(y) = =
p(x|y) p(x|y)
Example 4.8 (Success probability of the Binomial distribution - ctd). Consider
the setting in Example 4.4. Since we know pX|Y (x|y) and pX,Y (x, y) exactly, the evidence
pY (y) for y = k can be found as
xα−1 (1−x)β−1 n!
B(α,β) k!(n−k)!
xk (1 − x)n−k
pY (k) = xα+k−1 (1−x)β+n−k−1
(4.4)
B(α+k,β+n−k)
n! B(α + k, β + n − k)
= . (4.5)
k!(n − k)! B(α, β)
CHAPTER 4. BAYESIAN INFERENCE 46
which is the pmf, evaluated at k, of the Beta-Binomial distribution with trial parameter n
and shape parameters α and β.
where X̂(Y ) is the estimator for X and the expectation is taken with respect to the joint
distribution of X, Y .
In general, if want to estimate ϕ(X) given Y , we can target the posterior mean of ϕ
Z
E(ϕ(X)|Y = y) = p(x|y)ϕ(x)dx,
Although it has nice statistical properties as mentioned above, the posterior mean may
not always be a good choice. For example, suppose the posterior is a mixture of Gaussians
with pdf p(x|y) = 0.5φ(x; −10, 0.01)+0.5φ(x; 10, 0.01). The posterior mean is 0 but density
of p(x|y) at 0 is almost 0 and the distribution has almost no mass around 0!
CHAPTER 4. BAYESIAN INFERENCE 47
Note that this procedure is different than maximum likelihood estimation (MLE), which
yields the maximising argument of the likelihood
since in the MAP estimate there is the additional factor due to prior p(x).
In many cases, Yn+1 is independent from Y1:n given X. This happens, for example,
when {Yi }i≥1 are i.i.d. given X, that is Yi |X = x ∼ p(y|x), i ≥ 1. In that case, the density
above reduces to Z
p(yn+1 |y1:n ) = p(yn+1 |x)p(x|y1:n )dx
Note that this is equivalent to the expected value of the distribution of the new data point,
when the expectation is taken over the posterior distribution, i.e.:
Conjugate priors and posterior predictive density: We saw that when conjugate
priors are used for the prior, then p(x) and p(x|y) belong to the same family, i.e. their
pdf/pmf have the same form. This implies that, when Yi ’s are i.i.d. conditional on X, the
posterior predictive density p(yn+1 |y1:n ) has the same form as the marginal density of a
single sample Z
p(y) = p(x)p(y|x)dx.
m! B(α0 + r, β 0 + m − r)
pZ|Y (r|k) = .
r!(m − r)! B(α0 , β 0 )
CHAPTER 4. BAYESIAN INFERENCE 49
Exercises
1. Consider the discrete random variables X ∈ {1, 2, 3} and Y ∈ {1, 2, 3, 4} whose joint
probabilities are given in Table 4.1
pX|Y (x|y) y = 1 y = 2 y = 3 y = 4
x=1
x=2
x=3
pY |X (y|x) y = 1 y = 2 y = 3 y = 4
x=1
x=2
x=3
2. Show that the gamma distribution is the conjugate prior of the exponential distri-
bution for, i.e. if X ∼ Γ(α, β) and Y |X = x ∼ Exp(x), then X|Y = y ∼ Γ(αx|y , βx|y )
for some αx|y and βx|y . Find αx|y and βx|y in terms of α, β, and y.
3. Prove Theorem 4.1 [Hint: write the estimator as X̂(Y ) = E(X|Y )+(X̂(Y )−E(X|Y ))
and consider conditional expectation of the MSE given Y = y first. You should
conclude that for any y, X̂(y) − E(X|Y = y) should be zero.]
4. Suppose we observe a noisy sinusoid with period T and unknown amplitude X for n
steps: Y |X = x ∼ N (yt ; f (x, t), σy2 ), for t = 1, . . . , n where f (t; x) = x sin(2πt/T ) is
the sinusoid. The prior for the amplitude is Gaussian: X ∼ N (0, σx2 ).
• Find p(yn+1 ) and p(yn+1 |y1:n ). Compare their variances. What can you comment
on the difference between the variances?
• Generate your own samples Y1:n up to time n = 100, with period T = 40,
σx2 = 100, σy2 = 10. Calculate p(x|y1:n ); plot p(yn+1 ) and p(yn+1 |y1:n ) on the
same axis.
Chapter 5
5.1 Introduction
Remark 5.1 (Change of notation). So far we have used P and p to denote the distri-
bution and its pdf/pmf we are ultimately interested in. We will make a change of notation
here, and denote the distribution as well as its pdf/pmf as π. This change of notation is
necessary since p will be used generically to denote the pdf/pmf of various distributions.
We have already discussed the difficulties of generating a large number of i.i.d. samples
from π. One alternative was importance sampling which involved weighting every gener-
ated sample in order not to waste it, but it has its own drawbacks mostly due to issues
related to controlling variance. Another alternative is to use Markov chain Monte Carlo
(MCMC) methods (Metropolis et al., 1953; Hastings, 1970; Gilks et al., 1996; Robert and
Casella, 2004). These methods are based on the design of a suitable ergodic Markov chain
whose stationary distribution is π. The idea is that if one simulates such a Markov chain,
after a long enough time the samples of the Markov chain will approximately distributed
according to π. Although the samples generated from the Markov chain are not i.i.d.,
their use is justified by convergence results for dependent random variables in the litera-
ture. First examples of MCMC can be found in Metropolis et al. (1953); Hastings (1970),
and book length reviews are available in Gilks et al. (1996); Robert and Casella (2004).
51
CHAPTER 5. MARKOV CHAIN MONTE CARLO 52
by the relation of Markov chains to the topics covered in the course. For more details one
can see Meyn and Tweedie (2009) or Shiryaev (1995); a more related introduction to Monte
Carlo methods is present in Robert and Casella (2004, Chapter 6) and Cappé et al. (2005,
Chapter 14), Tierney (1994) and Gilks et al. (1996, Chapter 4).
Definition 5.1 (Markov chain). A stochastic process {Xn }n≥1 on X is called a Markov
chain if its probability law is defined from the initial distribution η(x) and a sequence
of Markov transition (or transition, state transition) kernels (or probabilities, densities)
{Mn (x0 |x)}n≥2 by finite dimensional joint distributions as
for all n ≥ 1.
The random variable Xt is called the state of the chain at time t and X is called
the state-space of the chain. For uncountable X , we have a discrete-time continuous-
state Markov chain, and η(·) and Mn (·|xn−1 ) are pdf’s1 . Similarly, X is countable (finite
or infinite), then the chain is a discrete-time discrete-state Markov chain and η(·) and
Mn (·|xn−1 ) are pmf’s. Moreover, when X = {x1 , . . . , xm } is finite with m states, the
transition kernel can sometimes be expressed in terms of an m × m transition matrix
Mn (i, j) = P(Xn = j|Xn−1 = i).
The definition of the Markov chain leads to the characteristic property of a Markov
chain, which is also referred to as the weak Markov property: The current state of the
chain at time n depends only on the previous state at time n − 1.
From now on, we will consider time-homogenous Markov chains where Mn = M for all
n ≥ 2, and we will denote them as Markov(η, M ).
Example 5.1. The simplest examples of a Markov chain are those with a finite state-
space, say of size m. Then, the transition rule can be expressed by an m × m transition
probability matrix M , which in this example is the following
1/2 0 1/2
M = 1/4 1/2 1/4
0 1 0
Also, the state-transition diagram of such a Markov chain with m = 3 states is given in
Figure 5.1, where the state-space is simply {1, 2, 3}.
1
In fact, there are exceptions where the transition kernels do not have a probability density; and this is
indeed the case for the transition kernel of the Markov chain of the Metropolis-Hastings algorithm which
we will see in Section 5.3. However, for the sake of brevity we ignore this technical issue and with abuse
of notation pretend as if we always have a density for Mn (·|xn−1 ) for continuous states
CHAPTER 5. MARKOV CHAIN MONTE CARLO 53
1
2 1
2
1 3
1
4
1
4 1
1
2
Example 5.2. Let X = Z be the set of integers, X1 = 0, and for n > 1 define Xn as
Xn = Xn−1 + Vn ,
Xn = Xn−1 + Vn ,
but this time Vn ∈ R with Vn ∼ N (0, σ 2 ). This is a Gaussian random walk process on R
with normally distributed step sizes, and it is a time homogenous discrete-time continuous
state Markov chain with η(x1 ) = δ0 (x1 ) and
Example 5.4. A generalisation of the Gaussian random walk is the first order autoregres-
sive process, or shortly AR(1). Let X = R the set of integers, X1 = 0, and for n > 1
define Xn as
Xn = aXn−1 + Vn ,
for some a ∈ R, and Vn ∈ R with Vn ∼ N (0, σ 2 ). AR(1) is a time homogenous discrete-
time continuous state Markov chain with η(x1 ) = δ0 (x1 ) and
M (x0 |x) = φ(x0 ; ax, σ 2 ).
σ2
When |a| < 1, another choice for the initial distribution is X1 ∼ N (0, 1−a 2 ), which is the
stationary distribution of {Xt }t≥1 . We will see more on the stationary distributions below.
[Link] Irreducibility
In a discrete state Markov chain, for two states x, x0 ∈ X , we say x leads to x0 and show
it by x → x0 if the chain can travel from x to x0 with a positive probability, i.e.
∃n > 1 s.t. P(Xn = x0 |X1 = x) > 0
If both x → x0 and x0 → x, we say x and x0 communicate and we show it by x ↔ x0 .
A subset of states C ⊆ X is called a communicating class, or simply class, if (i) all
x, x0 ∈ C communicate, and (ii) x ∈ C, x ↔ y together imply y ∈ C, too (that is, there is
no such y ∈/ C such that x ↔ y for some x ∈ C).
A communicating class is closed if x ∈ C and x → y imply y ∈ C, that is there is no
path with positive probability from outside the class to any of the states of the class.
Definition 5.2 (Irreducibiliy). A discrete state Markov chain is called irreducible if the
whole X is a communication class, i.e. all its states communicate.
For general state-spaces, we need to generalise the concept of irreducibility to φ-
irreducibility.
Example 5.5. Figure 5.3 shows two chains that are not irreducible. In the first chain, the
communication classes are {1, 2, 3} and {4, 5}; both are closed. In the second chain, the
communication classes are {1, 2} and {3, 4}; the first one is closed and the second one is
not.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 55
1/2
1/2
1/2
1 3 4
1/4
1/4 1/2
1/4
1
2 5
1/2 3/4
1/2
1 3
1/2
2 4
1/2 1/4
Figure 5.3: State transition diagrams of two Markov chains that are not irreducible.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 56
where we have assumed that {Xt }t≥1 is continuous (hence π is a pdf). When {Xt }t≥1 is
discrete (hence π is a pmf), this relation is written as
X
π(x) = π(x0 )M (x|x0 )
x0
CHAPTER 5. MARKOV CHAIN MONTE CARLO 57
The expressions on the RHS of the two equations above are short-handedly written as πM ,
so that for invariant π we have π = πM. In fact, when X = {x1 , . . . , xm } is finite with
M (i, j) = P(Xn = j|Xn−1 = i) and π = π(1) . . . π(m) , we can indeed write π = πM
using notation for vector matrix multiplication.
Example 5.9. The random walk on integers in Example 5.2 is irreducible. Therefore, it
does not have an invariant distribution since it is not positive recurrent for any choice of
p = 1 − q.
Example
5.10. The Markov
chain on top of Figure 5.3 has two invariant distributions
π = 1/4 1/2 1/4 0 0 and π = 0 0 0 1/3 2/3 although every state is positive
recurrent. Note that the chain is not irreducible with two isolated communication classes,
that is why Theorem 5.1 is not applicable and uniqueness may not follow.
Example 5.11. The Markov chain at the bottom of Figure 5.3 is neither irreducible nor all
of its states are positive recurrent (the states of the second
class are transient). However,
it has a unique invariant distribution, namely π = 1/3 2/3 0 0 . Note that for this
chain Theorem 5.1 is not applicable since the chain is not irreducible.
This immediately leads to the necessary and sufficient condition for reversibility of M is
the detailed balance condition.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 58
Being a sufficient condition for stationarity, the detailed balance condition is quite
useful for designing transition kernels for MCMC algorithms.
[Link] Ergodicity
Let πn be the distribution of Xn of a Markov chain {Xn }n≥1 with initial distribution η and
transition kernels M . We have π1 (x1 ) = η(x1 ) and the rest can be written recursively as
πn = πn−1 M , or explicitly
Z
πn (xn ) = πn−1 (xn−1 )M (xn |xn−1 )dxn−1
πn = πn−1 M
when the state space is finite and π and M are considered as a vector and a matrix,
respectively.
In MCMC methods that aim to approximately sample from π, we generate a Markov
chain {Xn }n≥1 with invariant distribution π and hope that for n large enough Xn is ap-
proximately distributed from π. This relies on the hope that πn converges to π.
We have shown the conditions for a unique stationary distribution of a Markov chain.
Note that having a unique invariant distribution does not mean that the chain will converge
to its stationary distribution. For that to happen the Markov chain is required to have
aperiodicity, a property which restricts the chain from getting trapped in cycles.
If the Markov chain is irreducible, then aperiodicity of one state implies the aperiodicity
of all the states.
Definition 5.7 (ergodic state). A state is called ergodic if it is positive recurrent and
aperiodic.
Definition 5.8 (ergodic Markov chain). An irreducible Markov chain is called ergodic
if it is positive recurrent and aperiodic.
Ergodic chains ensure that the sequence of distributions {πn }n≥1 for {Xn }n≥1 converge
to the invariant distribution π.
Theorem 5.2. Suppose {Xn }n≥1 is a discrete-state ergodic Markov chain with any initial
distribution η and Markov transition kernel M with invariant distribution π. Then,
Example 5.12. The Markov chain illustrated in Figure 5.4 is irreducible andpositive re-
current; so it has a unique invariant distribution, which is π = 1/3 1/3 1/3 . However,
it is periodic with period 3, and as a result πn does not converge to π unless η = π. Indeed,
one can show that for η = η(1) η(2) η(3) , we have
1
1 3
1 1
5.3 Metropolis-Hastings
As previously stated, an MCMC method is based on a discrete-time ergodic Markov chain
which has its stationary distribution as π. The most widely used MCMC algorithm up to
date is the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970).
The Metropolis-Hastings algorithm requires a Markov transition kernel Q on X for
proposing new values from the old ones. Assume that the pdf/pmf of Q(·|x) is q(·|x) for
any x. Given the previous sample Xn−1 a new value for Xn is proposed as X 0 ∼ Q(·|Xn−1 ).
The proposed sample X 0 is accepted with the acceptance probability α(Xn−1 , X 0 ), where
the function α : X × X → [0, 1] is defined as
π(x0 )q(x|x0 )
0
α(x, x ) = min 1, , x, x0 ∈ X .
π(x)q(x0 |x)
If the proposal is accepted, Xn = X 0 is taken. Otherwise, the proposal is rejected and
Xn = Xn−1 is taken.
π(X 0 )q(Xn−1 |X 0 )
0
α(Xn−1 , X ) = min 1, ,
π(Xn−1 )q(X 0 |Xn−1 )
which is symmetric with respect to x, y, so π(x)M (y|x) = π(y)M (x|y) and the detailed
balance condition holds for π which implies that M is reversible with respect to π and π
is invariant for M .
Note that, as long as discrete-state chains are considered, existence of the invariant
distribution π for M ensures the positive recurrence of M . There are also various sufficient
conditions for the M of the Metropolis-Hastings algorithm to be irreducible and aperiodic.
For example, if Q is irreducible and α(x, y) > 0 for all x, y ∈ X , then M is irreducible.
If pr (x) > 0 for all x or Q is aperiodic then M is aperiodic (Roberts and Smith, 1994).
More detailed results on the convergence of Metropolis-Hastings are also available, see e.g.
Tierney (1994); Roberts and Tweedie (1996) and Mengersen and Tweedie (1996).
Historically, the original MCMC algorithm was introduced by Metropolis et al. (1953)
for the purpose of optimisation on a discrete state-space. This algorithm, called the
Metropolis algorithm, used symmetrical proposal kernels Q, that is q(x0 |x) = q(x|x0 ).
When a symmetric proposal is used, the acceptance probability involves only the ratio
of the target distribution evaluated at x and x0 ,
π(x0 )
0
α(x, x ) = min 1, , if q(x0 |x) = q(x|x0 ).
π(x)
The Metropolis algorithm was later generalised by Hastings (1970) such that it permitted
continuous state-spaces and asymmetrical proposal kernels, preserving the Metropolis al-
gorithm as a special case. A more historical survey on Metropolis-Hastings algorithms is
provided by Hitchcock (2003).
Another version is the independence Metropolis-Hastings algorithm, where, as the name
suggests, the proposal kernel Q is chosen to be independent from the current value, i.e.
q(x0 |x) = q(x0 ), in which case the acceptance probability is
π(x0 )q(x)
0
α(x, x ) = min 1, .
π(x)q(x0 )
• Symmetric random walk: We can take q(x0 |x) = φ(x0 ; x, σq2 ), that is x0 is proposed
from the current value x by adding a normal random variable with zero mean and
variance σq2 , or Q(·|x) ∼ N (x, σq2 ). Since
π(x0 )q(x|x0 )
r(x, x0 ) =
π(x)q(x0 |x)
φ(x0 ; µ, σ 2 )
=
φ(x; µ, σ 2 )
1 0 2
√ 1 e− 2σ2 (x −µ)
2πσ 2
= 1 2
√ 1 e− 2σ2 (x−µ)
2πσ 2
1
= e− 2σ2 [(x −µ)
0 2 −(x−µ)2
]
The choice of σq2 is important for good performance of MH. We want the Markov
chain generated by the algorithm to mix well, that is we want the samples to forget
the previous values fast. Consider the acceptance ratio above:
– A too small value for σq2 will result in the acceptance ratio r(x, x0 ) being very
close to 1, and hence the proposed values will be accepted with high probability.
However, the chain will be very slowly mixing, that is the samples will be highly
correlated; because any accepted sample x0 will most likely be only slightly
different than the current x due to a small step-size of the random walk.
– A too large value for σq2 will likely result in the proposed value x0 to be far
from the region where π has most of its mass, hence π(x0 ) will be very small
compared to π(x) and the chain will likely reject the proposed value and stick
to the old value x. This will create a sticky chain.
Therefore, the optimum value for σq2 should be neither too small or too large. See
Figure 5.5 for the both bad choices and one in between those. This phenomenon of
having to choose the variance of the random walk proposals neither too small nor
too big is also valid for most distributions than the normal distribution.
• Another option for the proposal is to sample x0 independently from x, i.e. q(x0 |x) =
q(x0 ). For example, suppose we chose q(x) = φ(x; µq , σq2 ). Then the acceptance ratio
CHAPTER 5. MARKOV CHAIN MONTE CARLO 63
RWMH - first 1000 samples: σ 2q = 0.01 RWMH - first 1000 samples: σ 2q = 400 RWMH - first 1000 samples: σ 2q = 2
12 10 10
10 8 8
8 6 6
6 4 4
4 2 2
2 0 0
0 -2 -2
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
iterations iterations iterations
histogram of last 10000 samples: σ 2q = 0.01 histogram of last 10000 samples: σ 2q = 400 histogram of last 10000 samples: σ 2q = 2
2500 3000 3000
2000 2000
1500
1500 1500
1000
1000 1000
500 500 500
0 0 0
-2 0 2 4 6 -2 0 2 4 6 -2 0 2 4 6 8
x x x
Figure 5.5: Random walk MH for π(x) = φ(x; 2, 1). The left and middle plots correspond
to a too small and a too large value for σq2 , respectively. All algorithms are run for 50000
iterations. Both the trace plots and the histograms show that the last choice works the
best.
is
π(x0 )q(x)
r(x, x0 ) =
π(x)q(x0 )
φ(x0 ; µ, σ 2 )φ(x; µq , σq2 )
=
φ(x; µ, σ 2 )φ(x0 ; µq , σq2 )
1 2
1 0 2 − 2 (x−µq )
√ 1 e− 2σ2 (x −µ) √ 1 e 2σq
2πσ 2 2
2πσq
= 1 − 12 (x0 −µq )2
2
√ 1 e− 2σ2 (x−µ) √ 1 e 2σq
2πσ 2 2πσq2
1
−
2σ 2
[(x0 −µ)2 −(x−µ)2 ]+ 2σ1q2 [(x0 −µq )2 −(x−µq )2 ]
=e
IMH - first 1000 samples: σ 2q = 0.8 IMH - first 1000 samples: σ 2q = 100
10 6
4
5
0
0
-5 -2
0 200 400 600 800 1000 0 200 400 600 800 1000
iterations iterations
histogram of last 10000 samples: σ 2q = 0.8 histogram of last 10000 samples: σ 2q = 100
3000 3000
2000 2000
1000 1000
0 0
-4 -2 0 2 4 6 -2 0 2 4 6 8
x x
For this problem, π(x) indeed lacks a well-known form, so it is justified to use a Monte
Carlo method for it.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 65
4
5
0
0
-5 -2
0 200 400 600 800 1000 0 200 400 600 800 1000
iterations iterations
histogram of last 10000 samples: γ = 0.01 histogram of last 10000 samples: γ = 1
3000 3000
2000 2000
1000 1000
0 0
-4 -2 0 2 4 6 -4 -2 0 2 4 6
x x
π(x0 )q(x|x0 )
r(x, x0 ) =
π(x)q(x0 |x)
p(z 0 )p(s0 ) [ ni=1 p(yi |z 0 , s0 )] φ(z; z 0 , σq2 )p(s)
Q
=
p(z)p(s) [ ni=1 p(yi |z, s)] φ(z 0 ; z, σq2 )p(s0 )
Q
are indicated by the random variables λi , i = 1, 2, which are a priori assumed to follow a
Gamma distribution
λi ∼ Γ(α, β), i = 1, 2.
Under regime i, the counts are assumed to be identically Poisson distributed
(
PO(λ1 ), 1 ≤ t ≤ τ
Yt ∼
PO(λ2 ), τ < t ≤ n.
A typical draw from this model is shown in Figure 5.9. The inferential goal is, given
Y1:n = y1:n , to sample from the posterior distribution of the changepoint location τ and the
intensities λ1 , λ2 given the count data, i.e., letting x = (τ, λ1 , λ2 ), the target distribution is
π(x) = p(τ, λ1 , λ2 |y1:n ) which is given by
Two choices for the proposal will be considered. Let x0 = (τ 0 , λ01 , λ02 ).
• The first one is to use an independent proposal distribution, which is the prior dis-
tribution for x
q(x0 |x) = q(x0 ) = p(x0 ) = p(τ 0 , λ01 , λ02 ).
This leads to the acceptance ratio being the ratio of the likelihoods
15
y(t)
10
0
0 10 20 30 40 50 60 70 80 90 100
time
Figure 5.9: An example data sequence of length n = 100 generated from the Poisson
changepoint model with parameters τ = 30, λ1 = 10 and λ2 = 5.
Figure 5.10 illustrates the results obtained from the two algorithms. The initial value
for τ is taken bn/2c and for λ1 and λ2 we start from the mean of y1:n . As we can see,
the symmetric proposal algorithm is able to explore the posterior distribution much more
efficiently. This is because the proposal distribution in independence MH, which is chosen
as the prior distribution, does not take neither the posterior distribution (hence the data)
nor the previous sample into account, and as a result it has a large rejection rate. The
independence sampler would become even poorer if n were larger so that the posterior would
be more concentrated in contrast to the ignorance of the prior distribution.
Example 5.15 (MCMC for source localisation). Consider the source localisation
scenario in Question 3 of Exercises in Chapter 3. From the likelihood and the prior in
(3.8) and (3.9), the posterior distribution of the unknown position is
3
Y
p(x|y) ∝ φ(x(1); 0, σx2 )φ(x(2); 0, σx2 ) φ(yi ; ri , σy2 ) (5.5)
i=1
CHAPTER 5. MARKOV CHAIN MONTE CARLO 68
50 6
10
40 5
8
30 4
20 6 3
0 5000 10000 0 5000 10000 0 5000 10000
iterations iterations iterations
independent proposal - τ independent proposal - λ 1 independent proposal - λ 2
50 12 6.5
6
40 10
5.5
30 8
5
20 6 4.5
0 5000 10000 0 5000 10000 0 5000 10000
iterations iterations iterations
Due to the non-linearity in the ri = ||x − si || = [(x(1) − si (1))2 + (x(2) − si (2))2 ]1/2 ,
i = 1, 2, 3, p(x|y) does not admit a known distribution. We use the MH algorithm to
generate approximate samples from p(x|y). We use a symmetric random walk proposal
distribution with q(x0 |x) = φ(x0 ; x, σq2 I2 ), so that q(x0 |x) = q(x|x0 ). The resulting acceptance
rate
p(x0 |y)q(x|x0 )
0
r(x, x ) =
p(x|y)q(x0 |x)
p(x0 |y)
=
p(x|y)
φ(x0 (1); 0, σx2 )φ(x0 (2); 0, σx2 ) 3i=1 φ(yi ; ri0 , σy2 )
Q
=
φ(x(1); 0, σx2 )φ(x(2); 0, σx2 ) 3i=1 φ(yi ; ri , σy2 )
Q
where ri0 = ||x0 − si ||, i = 1, 2, 3 is the distance between the proposed value x0 and the
location i’th source si . Figure 5.11 shows the samples and their histograms obtained from
10000 iterations of the MH algorithm. The chain was started from X1 = (5, 5) and its
convergence to the posterior distribution is illustrated in the right pane of the figure where
we see the first a few samples of the chain traveling to the high probability region of the
posterior distribution.
4 4
2 2 -4
Xt (1)
Xt (2)
0 0
-2 -2 -2
-4 -4
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
x(1)
iterations iterations 0
800 800 2
600 600
400 400 4
200 200
0 0 6
-4 -2 0 2 4 6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
x x x(2)
one can sample from each of the full conditional distributions πk (·|X1:k−1 , Xk+1:d ), then the
Gibbs sampler produces a Markov chain by updating one component at a time using πk ’s.
One cycle of the Gibbs sampler successively samples from the conditional distributions
π1 , . . . , πd by conditioning on the most recent samples.
where y = (y1 , . . . , yd ). The justification of the transitional kernel comes from the re-
versibility of each Mk with respect to π, which can be verified from the detailed balance
CHAPTER 5. MARKOV CHAIN MONTE CARLO 70
condition as follows.
where the third line follows the second since δx−k (y−k ) allows the interchange of x−k and
y−k . Therefore, the detailed balance condition for Mk is satisfied with π and πMk = π. If
we apply M1 , . . . , Mk sequentially, we get
as shown in (5.6).
Reversibility of each Mk with respect to π does not suffice to establish proper conver-
gence of the Gibbs sampler, as none of the individual steps produces a irreducible chain.
Only the combination of the d moves in the complete cycle has a chance of producing
a φ-irreducible chain. We refer to Roberts and Smith (1994) some simple conditions for
convergence of the classical Gibbs sampler. Note, also, that M is not reversible either,
although this is not a necessary condition for convergence. A way of guaranteeing both
φ-irreducibility and reversibility is to use a mixture of kernels
d
X d
X
Mβ = βk Mk , βk > 0, k = 1, . . . , d, βk = 1.
k=1 k=1
provided that at least one Mk is irreducible and aperiodic. This choice of kernel leads to
the random scan Gibbs sampler algorithm. We refer to Tierney (1994), Robert and Casella
(2004), and Roberts and Tweedie (1996) for more detailed convergence results pertaining
to these variants of the Gibbs sampler.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 71
Example 5.16. Suppose we wish to sample from a bivariate normal distribution, where
2
x1 + x22 − 2ρx1 x2
1
π(x) = p exp − , ρ ∈ (−1, 1).
2π(1 − ρ2 ) 2(1 − ρ2 )
(x1 − ρx2 )2
π(x1 |x2 ) ∝ π(x1 , x2 ) ∝ exp −
2(1 − ρ2 )
therefore π(x1 |x2 ) = φ(x1 ; ρx2 , (1 − ρ2 )) and X1 |X2 = x2 ∼ N (ρx2 , (1 − ρ)2 ). Similarly,
we have X2 |X1 = x1 ∼ N (ρx1 , (1 − ρ)2 ). So, the iteration t ≥ 2 of the Gibbs sampling
algorithm for this π(x) is
Example 5.17 (ex: Normal distribution with unknown mean and variance). Let
us get back to the problem in Example 5.13 where we want to estimate the mean and the
variance of the normal distributions N (z, s) given samples y1 , . . . , yn generated from it. Let
use the same prior distributions for z and s, namely z ∼ N (m, κ2 ) and s ∼ IG(α, β). Note
that these are the conjugate priors for those parameters; and when one of the parameters
is given, the posterior distribution of the other one has a known form. Indeed, in Examples
4.5 and 4.6, we derived these full conditional distributions. Example 4.5 can be revisited
(but this time with a non-zero prior mean m) to see that
2
Z|s, y1:n ∼ N (µz|s,y , σz|s,y )
where !
−1 −1 n
2 1 n 1 n 1X m
σz|s,y = 2
+ , µz|s,y = 2
+ yi + 2
κ s κ s s i=1 κ
and from Example 4.6 we can deduce that
where n
n 1X
αs|z,y =α+ , βs|z,y =β+ (yi − z)2 .
2 2 i=1
Therefore, Gibbs sampling for Z, S given Y1:n = y1:n is
−1 −1
1 n 1
Pn m
• Sample Zt ∼ N κ2
+ St−1 St−1
y
i=1 i + κ2
, κ12 + n
St−1
Pn
• Sample St ∼ IG α + n2 , β + 1
2 i=1 (yi − Zt )2 .
CHAPTER 5. MARKOV CHAIN MONTE CARLO 72
2. the likelihood is intractable (hard to compute or does not admit conjugacy, etc), but
given some additional unobserved (real or fictitious) data it would be tractable.
Let yobs denote the observed data and ymis the missing data (sometimes ymis is called a
latent variable).We suppose we can easily sample x from the posterior given the augmented
data (yobs , ymis ). Also, that we can sample ymis , conditional on yobs and X (this only involves
the sampling distributions). Then we can use the Gibbs sampler of the pair (x, ymis ). Then
we perform Monte Carlo marginalisation: If in the resulting joint distribution for x, ymis
given yobs we simply ignore ymis , we shall have our sample from the posterior of x given
yobs alone.
Example 5.18 (Genetic linkage). Genetic linkage in an animal can be allocated to one
of four categories, coded 1,2, 3, and 4, having respective probabilities
where θ is an unknown parameter in (0, 1). For a sample of 197 animals, the (multi-
nomial) counts of those falling in the 4 categories are represented by random variables
Y = (Y1 , Y2 , Y3 , Y4 ), with observed values y = (y1 , y2 , y3 , y4 ) = (125, 18, 20, 34). Suppose we
place a Beta(α, β) prior on θ. Then,
125 18+20 34
1 θ 1−θ θ
π(θ) = p(θ|y) ∝ + θα−1 (1 − θ)β−1
2 4 4 4
| {z }
Multinomial likelihood
125 38+β−1 34+α−1
∝ (2 + θ) (1 − θ) θ (5.7)
How can we sample from this? We can use a rejection sampler (probably with a very high
rejection probability) or MH for this posterior distribution; in this example we seek for a
suitable Gibbs sampler. Note that the problematic part in (5.7) is the first one; should it
be like one of the others, the posterior would lend itself into a Beta distribution.
Suppose we divide category 1, with total probability 1/2 + θ/4, into two latent sub-
categories, a and b, with respective probabilities θ/4 and 1/2. We regard the number of
animals Z falling in subcategory a as missing data. If, as well as the observed data y, we
are given Z = z, we are in the situation of having observed counts (z, 125 − z, 18, 20, 34)
from a multinomial distribution with probabilities (θ/4, 1/2, (1 − θ)/4, (1 − θ)/4, θ/4). The
resulting joint distribution is
125−z 18+20 34+z
1 1−θ θ
p(θ, z|y) ∝ p(θ, z, y) = θα−1 (1 − θ)β−1 (5.8)
2 4 4
CHAPTER 5. MARKOV CHAIN MONTE CARLO 73
Exercises
1. Consider the toy example in Section 5.3.1 for the MH algorithm for sampling from
the normal distribution N (µ, σ 2 ).
• Modify the code so that it stores the acceptance probability at each iteration in
a vector and returns the vector as one of the outputs. In the next part of the
exercise, you will use the last T − tb samples of the vector to find an estimate
of the overall expected acceptance probability
Z
α(σq ) = α(x, x0 )π(x)qσq (x0 |x)dxdx0 .
where tb = 1000 is the burn-in time up to when you ignore the samples generated
by the algorithm. Similarly, calculate an estimate α(i) (σq ) of α(σq ) in a similar
way using the last T − tb samples.
• Report the sample variance of µ(i) (σq )’s: This is approximately the variance of
the mean estimate of the MH algorithm that uses T − tb samples. We wish this
variance to be as small as possible. Also, report the average of α(i) (σq )’s.
• Repeat above for σq = 0.1, 0.2, . . . , 9.9, 10, and generate two plots: (i) sample
variance of µ(i) (σq )’s vs σq and (ii) average of α(i) (σq )’s vs σq . From the first
plot, suggest the (approximately) optimum value for σq and report the estimate
of α(σq ) for that σq .
2. Design and implement a symmetric random walk MH algorithm and the Gibbs sam-
pling algorithm for the genetic linkage problem in Example 5.18 with hyperparame-
ters α = β = 2.
3. Implement the Gibbs sampler in Example 5.17 with n = 100 and hyperparameters
α = 5 and β = 10.
• Download UK coal mining disaster [Link] from SUCourse. The data con-
sists of the day numbers of coal mining disasters between 1851 and 1962, where
CHAPTER 5. MARKOV CHAIN MONTE CARLO 75
the first day is the start of the 1851. It is suspected that, due to a policy change,
the accident rate over the years is a piecewise constant with a single changepoint
time around the time of the policy change.
• From the data, create another data vector of length 112, where the i’th element
contains the number of disasters in year i (starting from 1851). Note that some
years are 366 days!
• Implement the MH algorithm (given in SUCourse) for the changepoint model
given the data that you created. Take the priors for τ , λ1 and λ2 the same as
in Example 5.14, i.e. with hyperparameters α = 10 and β = 1. You can use the
symmetric random walk proposal for the parameters.
• Implement Gibbs sampling algorithm for the same model given the same data
using the same priors. All the derivations you need are in Example 5.19.
5. Suppose we observe a noisy sinusoid with with unknown amplitude a, angular fre-
quency ω, phase z, and noise variance σy2 for n steps. Letting x = (a, ω, z, σy2 ),
The unknown parameters are a priori independent with a ∼ N (0, σa2 ), ω ∼ Γ(α, β),
z ∼ Unif(0, 2π), σy2 ∼ IG(α, 1/β).
• Write down the likelihood of p(y1:n |x) and the joint density p(x, y1:n ).
• Download the data file sinusoid [Link] from SUCourse; the observations
in the file are your data y1:n . Use hyperparameters σa2 = 100, α = β = 0.01
and design and implement an MH algorithm for generating samples from the
posterior distribution π(x) = p(x|y1:n ).
• Bonus - worth 50% of the base mark: This time, design and implement a
MH within Gibbs algorithm where in each loop contains four steps in each of
which you update one component only, fixing the others, using an MH kernel
that targets the full conditionals. This is an example where you can still update
the components one by one even if the full conditional distributions are not easy
to sample from.
Chapter 6
6.1 Introduction
Let {Xn }n≥1 be a sequence of random variables where each Xn takes values at some space
Xn . Define the sequence of distributions {πn }n≥1 where πn is defined on X n . Also, let
{ϕn }n≥1 be a sequence of functions where ϕn : X n → R is a real-valued function on
X n .1 We are interested in sequential inference, i.e. approximating the following integrals
sequentially in n
Z
πn (ϕn ) = Eπn [ϕn (X1:n )] = πn (x1:n )ϕ(x1:n )dx1:n , n = 1, 2, . . .
Despite their versatility and success, it might be impractical to apply MCMC algorithms
to sequential inference problems. This chapter discusses sequential Monte Carlo (SMC)
methods, that can provide with approximation tools for a sequence of varying distributions.
Good tutorials on the subject are available, see for example Doucet et al. (2000b) for and
Doucet et al. (2001) for a book length review. Also, Robert and Casella (2004) and Cappé
et al. (2005) contain detailed summaries. Finally, the book Del Moral (2004) contains a
more theoretical work on the subject in a more general framework, namely Feynman-Kac
formulae.
76
CHAPTER 6. SEQUENTIAL MONTE CARLO 77
sampling. First use of SIS can be recognised in works back in 1960s and 1970s such as
Mayne (1966); Handschin and Mayne (1969); Handschin (1970), see Doucet et al. (2000b)
for a general formulation of the method for Bayesian filtering.
Naive approach: Consider the naive importance sampling approach to the sequential
problem where we have a sequence of importance densities {qn (x1:n )}n≥1 where each qn is
defined on X n such that
πn (x1:n )
wn (x1:n ) = .
qn (x1:n )
It is obvious that we can approximate πn (ϕn ) by generating independent samples from qn
at each n and exploiting the relation
This approach would require the design of a separate qn (x1:n ) and sampling the whole path
X1:n at each n, which is obviously inefficient.
where q(x1 ) is some initial density that is easy to sample from and q(xt |x1:t−1 ) are condi-
tional densities which we design so that it is possible to sample from q(·|x1:t−1 ) for any x1:t−1
and t ≥ 1. This selection of qn leads to the following useful recursion on the importance
weights
πn (x1:n )
wn (x1:n ) =
qn (x1:n )
πn (x1:n ) πn−1 (x1:n−1 )
=
qn−1 (x1:n−1 )q(xn |x1:n−1 ) πn−1 (x1:n−1 )
πn (x1:n )
= wn−1 (x1:n−1 ) . (6.2)
πn−1 (x1:n−1 )q(xn |x1:n−1 )
We remark that the sequence of distributions are usually known up to a normalising con-
stant as
π̂n (x1:n )
πn (x1:n ) = ,
Zπn
where we know π̂n (x1:n ) for any x1:n but not Zπn . Hence, from now on we will only consider
self-normalised importance sampling where πn−1 and πn are replaced by π̂n−1 and π̂n in
calculation of (and the recursion for) wn (x1:n ) in (6.2).
CHAPTER 6. SEQUENTIAL MONTE CARLO 78
Note that this is indeed the same as the self-normalised importance sampling estimate, see
(3.7) for example.
The SIS algorithm: In many applications of (6.2), the importance density is designed
in such a way that the ratio
πn (x1:n )
πn−1 (x1:n−1 )q(xn |x1:n−1 )
is easy to calculate (at least up to a proportionality constant if we use the unnormalised
densities). For example, this may be due to the design of q(xn |x1:n−1 )’s in such a way that
the ratio depends only on xn−1 and xn . Hence, one can exploit this recursion by sampling
only Xn from q(·|x1:n−1 ) at time n and updating the weights with a small effort.
(i)
More explicitly, assume a set of N ≥ 1 samples, termed as particles, X1:n−1 with weights
(i) (i)
wn−1 (X1:n−1 ) and normalised weights Wn−1 for i = 1, . . . , N are available at time n − 1, so
that we have
X N
N (i)
πn−1 (x1:n−1 ) = Wn−1 δX (i) (x1:n−1 ).
1:n−1
i=1
N (i) (i)
The update from πn−1 to πnN can be performed by first sampling Xn ∼ q(·|X1:n−1 ) and
(i) (i) (i)
computing the weights wn at points X1:n = (X1:n−1 , Xn ) using the update rule in (6.2),
(i)
and finally obtain the normalised weights Wn using (6.4).
The SIS method is summarised in Algorithm 6.1. Being a special case of importance
sampling approximation, this SIS approximation πnN (ϕn ) has almost sure convergence to
πn for any n (under regular conditions) as the number of particles N tends to infinity; it
is also possible to have a central limit theorem for πnN (ϕn ) (Geweke, 1989).
CHAPTER 6. SEQUENTIAL MONTE CARLO 79
7 for i = 1, . . . , N do
8 Calculate
(i)
wn (X1:n )
Wn(i) = PN (i)
.
i=1 w n (X 1:n )
(i)
X1:n−1
(i)
X̃1:n−1
N
In sequential Monte Carlo for {πn (x1:n )}n≥1 , resampling is applied to πn−1 (x1:n−1 ) before
proceeding to approximate πn (x1:n ). Assume, again, that πn−1 (x1:n−1 ) is approximated by
N
X
N (i)
πn−1 (x1:n−1 ) = Wn−1 δX (i) (x1:n−1 ),
1:n−1
i=1
(i) N
We draw N independent samples X
e1:n−1 , i = 1, . . . , N from πn−1 , such that
(i) (j) (j)
P(X
e1:n−1 = X1:n−1 ) = Wn−1 , i, j = 1, . . . , N.
Obviously, this corresponds to drawing N independent samples from a multinomial distri-
bution, therefore this particular resampling scheme is called multinomial resampling. Now
the resampled particles form an equally weighted discrete distribution
N
N 1 X
π̃n−1 (x1:n−1 ) = δ (i) (x1:n−1 ),
N i=1 Xe1:n−1
CHAPTER 6. SEQUENTIAL MONTE CARLO 81
N N
We proceed to approximating πn (x1:n ) using π̂n−1 (x1:n−1 ) instead of πn−1 (x1:n−1 ) as follows.
(i) (i)
After resampling, for each i = 1, . . . , N we sample Xn ∼ q(·|X1:n−1 ), weight the particles
e
(i) (i) (i)
X1:n = (Xe1:n−1 , Xn ) using
(i)
Wn(i) ∝ wn|n−1 (X1:n )
(i) N
πn (X1:n ) X
= (i) (i) (i)
, Wn(i) = 1.
e1:n−1 )q(Xn |X
πn−1 (X e1:n−1 )
i=1
The SISR method, also known as the particle filter, is summarised in Algorithm 6.2.
7 else
(i) (i)
8 Resample from {X1:n−1 }1≤i≤N according to the weights {Wn−1 }1≤i≤N to get
(i)
resampled particles {X e1:n−1 }1≤i≤N with weight 1/N .
9 for i = 1, . . . , N do
(i) (i) (i) (i) (i)
10 Sample Xn ∼ q(·|X e1:n−1 ), set X1:n = (X
e1:n−1 , Xn )
11 for i = 1, . . . , N do
12 Calculate
(i)
πn (X1:n )
Wn(i) ∝ (i) (i) e (i)
.
πn−1 (X
e1:n−1 )q(Xn |X 1:n−1 )
Path degeneracy: The importance of resampling in the context of SMC was first demon-
strated by Gordon et al. (1993) based on the ideas of Rubin (1987). Although the resam-
pling step alleviates the weight degeneracy problem, it has two drawbacks. Firstly, since
after successive resampling steps some of the distinct particles for X1:n are dropped in
favour of more copies of highly-weighted particles. This leads to the impoverishment of
particles such that for k n, very few particles represent the marginal distribution of
X1:k under πn (Andrieu et al., 2005; Del Moral and Doucet, 2003; Olsson et al., 2008).
CHAPTER 6. SEQUENTIAL MONTE CARLO 82
Hence, whatever being the number of particles, πn (x1:k ) will eventually be approximated
by a single unique particle for all (sufficiently large) n. As a result, any attempt to perform
integrations over the path space will suffer from this form of degeneracy, which is called
path degeneracy. The second drawback is the extra variance introduced by the resampling
step. There are a few ways of reducing the effects of resampling.
• One way is adaptive resampling i.e. resampling only at iterations where the effective
sample size drops below a certain proportion of N . For a practical implementation,
the effective sample size at time n itself should be estimated from particles as well.
One particle estimate of Neff,n is given in Liu (2001, pp. 35-36)
eeff,n = P 1
N .
N (i)2
i=1 W n
• The final method we will mention here that is used to reduce path degeneracy is block
sampling (Doucet et al., 2006), where at time n one samples components Xn−L+1:n for
L > 1, and previously sampled values for Xn−L+1:n−1 are simply discarded. In return
of the computational cost introduced by L, this procedure reduces the variance of
weights and hence reduces the number of resampling steps (if an adaptive resampling
strategy is used) dramatically. Therefore, path degeneracy is reduced.
Chapter 7
7.1 Introduction
HMMs arguably constitute the widest class of time series models that are used for mod-
elling stochastic behaviour of dynamic systems. In Section 7.2, we will introduce HMMs
using a formulation that is appropriate for filtering and parameter estimation problems.
We will restrict ourselves to discrete time homogenous HMMs whose dynamics for their hid-
den states and observables admit conditional probability densities which are parametrised
by vector valued static parameters. However, this is our only restriction; we keep our
framework general enough to cover those models with non-linear non-Gaussian dynamics.
One of the main problems dealt within the framework of HMMs is optimal Bayesian
filtering, which has many applications in signal processing and related areas such as speech
processing (Rabiner, 1989), finance (Pitt and Shephard, 1999), robotics (Gordon et al.,
1993), communications (Andrieu et al., 2001), etc. Due to the non-linearity and non-
Gaussianity of most of models of interest in real life applications, approximate solutions
are inevitable and SMC is the main computational tool used for this; see e.g. Doucet et al.
(2001) for a wide selection of examples demonstrating use of SMC. SMC methods have
already been presented in its general form in the previous chapter, we will present their
application to HMMs for optimal Bayesian filtering in Sections 7.3 and 7.3.4.
83
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 84
f f f f f
X1 X2 ··· Xt−1 Xt ···
g g g g
Y1 Y2 ··· Yt−1 Yt
Another important probability density, which will be pursued in detail, is the density of
the posterior distribution of X1:n given Y1:n = y1:n , which is obtained by using the Bayes’
theorem
p(x1:n , y1:n )
p(x1:n |y1:n ) = (7.3)
p(y1:n )
In the time series literature, the term HMM has been widely associated with the case
of X being finite (Rabiner, 1989) and those models with continuous X are often referred
to as state-space models. Again, in some works the term ‘state-space models’ refers to
the case of linear Gaussian systems (Anderson and Moore, 1979). We emphasise at this
point that in this text we shall keep the framework as general as possible. We consider
the general case of measurable spaces and we avoid making any restrictive assumptions
on η(x1 ), f (xt |xt−1 ), and g(yt |xt ) that impose a certain structure on the dynamics of the
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 85
HMM. Also, we clarify that in contrast to previous restrictive use of terminology, we will
use both terms ‘HMM’ and ‘general state space model’ to describe exactly the same thing.
Example 7.1 (A finite state-space HMM for weather conditions). Assume that
the weather condition in terms of atmospheric pressure is simplified to have two states,
“Low” and “High”, and on day t, Xt ∈ X = {1, 2} denotes the state of the atmospheric
condition in terms of pressure, where 1 stands for “Low” and 2 stands for “High”. Further
{Xt }t≥1 is modelled as a Markov chain with some initial distribution η = [η(1), η(2)], and
transition density
0.3 0.7
F =
0.2 0.8
where F (i, j) = P(Xt+1 = j|Xt = i) = f (j|i). What we observe is not the atmospheric
pressure but, whether a day is “Dry”, ”Cloudy”, or ”Rainy”, and these conditions are
enumerated with 1, 2, and 3, respectively. Let Yt ∈ Y = {1, 2, 3} is the observed weather
condition on day t. It is known that low pressure is more likely to lead clouds or precipita-
tion than high pressure, and assumed that given Xt , Yt is conditionally independent from
Y1:t−1 and X1:t−1 . The conditional observation matrix that related Xt to Yt is given by
0.3 0.4 0.3
G=
0.6 0.3 0.1
where G(i, j) = P(Yt = j|Xt = i) = g(j|i). Then, {Xt , Yt }t≥1 forms a HMM, and since X
is finite, it is called a finite state-space HMM.
Example 7.2 (Linear Gaussian HMM). A generic linear Gaussian HMM {Xt , Yt },
where Xt ∈ Rdx , and Yt ∈ Rdy are vector valued hidden and observed states, can be defined
via the following generative definitions for the random variables {Xt , Yt }:
X1 ∼ N (µ1 , Σ1 ), Xt = AXt−1 + Ut , Ut ∼ N (0, S), t>1 (7.4)
Yt = BXt + Vt , Vt ∼ N (0, R), (7.5)
Here, A, B are dx ×dx and dy ×dx matrices, and S and R are dx ×dx and dy ×dy covariance
matrices for the state and observation processes, respectively. In terms of densities, this
HMM can be described as
η(x1 ) = φ(x1 ; µ1 , Σ1 ), f (xt |xt−1 ) = φ(xt ; Axt−1 , S), g(yt |xt ) = φ(yt ; Bxt , R). (7.6)
Example 7.3 (A partially observed moving target). We modify the source localisation
problem in Example 5.15 by adding to the scenario that the source is moving in a Markovian
fashion: The motion of the source is modelled as a Markov chain for its velocity and
position. Let Vt = (Vt (1), Vt (2)) and Pt = (Pt (1), Pt (2)) be the velocity and the position
vectors (in the xy plane) of the source at time t and assume that they evolve according to
2
its following stochastic dynamics: V1 (i) ∼ N (0, σbv ),
2 2
V1 (i) ∼ N (0, σbv ), P1 (i) ∼ N (0, σbp ), i = 1, 2,
Vt (i) = aVt−1 (i) + Ut (i), Pt (i) = Pt−1 (i) + ∆Vt−1 (i) + Zt (i), i = 1, 2.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 86
i.i.d. i.i.d.
where Ut ∼ N (0, σv2 ) and Zt (i) ∼ N (0, σp2 ). This model dictates that the velocity in each
direction changes independently according to an autoregressive model with the regression
parameter a and driving variance σv2 and the position is the previous position plus the
previous velocity multiplied by the factor ∆ which corresponds to the time interval between
successive time steps t − 1 and t, plus some noise which counts for the discretisation error.
Let Xt = (Vt , Pt ). Xt is a Markov chain with transition density
2
Y 2
Y
f (xt |xt−1 ) = φ(vt (i); avt−1 (i), σv2 ) φ(pt (i); ∆vt−1 (i) + pt−1 (i); σp2 )
i=1 i=1
The observations are generated as before, i.e. at each time t three distance measurements
(Rt,1 , Rt,2 , Rt,3 ) with
from three different sensors are collected in Gaussian noise with variance σy2 and these
measurements form Yt = (Yt,1 , Yt,2 , Yt,3 )
i.i.d.
Yt,i = Rt,i + Et,i , Et,i ∼ N (0, σy2 ), i = 1, 2, 3.
so that
3
Y
g(yt |xt ) = φ(yt,i ; rt,i , σy2 ).
i=1
This is an example to a non-linear HMM due to the non-linearity in its observation dy-
namics.
whereas for t0 < t the density p(x1:t0 |y1:t ) can be obtained simply by integrating out the
variables xt0 +1:t , i.e. Z
p(x1:t0 |y1:t ) = p(x1:t |y1:t )dxt0 +1:t .
However, one can restrict their focus to a problem of smaller size, such as the marginal
distribution of the random variable Xk , k ≤ n0 , given y1:n . The probability density of such
a marginal posterior distribution p(xk |y1:n ) is called a filtering, prediction, or smoothing
density if k = n, k > n and k < n, respectively. Indeed, there are many cases where one
is interested in calculating the expectations of functions ϕ : X → R of Xk given y1:n
Z
E [ϕ(Xk )|Y1:n = y1:n ] = ϕ(xk )p(xk |y1:n )dxk .
Although once we have p(x1:n0 |y1:n ) for n0 ≥ k the marginal density can directly be obtained
by marginalisation, the recursion in (7.24) may be intractable or too expensive to calculate.
Therefore it is useful to use alternative recursion techniques to effectively evaluate the
marginal densities sequentially.
Forward filtering (and prediction): We start with p(x1 |y0 ) := p(x1 ) = η(x1 ) and
where p(y1 ) = η(x01 )g(y1 |x01 )dx1 . Given the filtering density p(xt−1 |y1:t−1 ) at time t − 1
R
and the new observation yt at time t, the filtering density at time t can be obtained
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 88
recursively in two stages, which are called prediction and update. These are given as
Z
p(xt |y1:t−1 ) = p(xt |xt−1 , y1:t−1 )p(xt−1 |y1:t−1 )dxt−1
Z
= f (xt |xt−1 )p(xt−1 |y1:t−1 )dxt−1 , (7.7)
p(yt |xt , y1:t−1 )p(xt |y1:t−1 )
p(xt |y1:t ) =
p(yt |y1:t−1 )
g(yt |xt )p(xt |y1:t−1 )
= . (7.8)
p(yt |y1:t−1 )
where this time we write the normalising constant as
Z
p(yt |y1:t−1 ) = p(xt |y1:t−1 )g(yt |xt )dxt . (7.9)
Actually, (7.9) is important for its own sake, since it leads to two important quantities:
First, given y1:n , the posterior predictive density for Yn+1 is simply p(yn+1 |y1:n ), which
can be calculated from p(xn+1 |y1:n ). Secondly, (7.9) can be used to calculate the evidence
recursively.
n
Y
p(y1:n ) = p(yt |y1:t−1 ) = p(y1:n−1 )p(yn |y1:n−1 ). (7.10)
t=1
The problem of evaluating the recursion given by equations (7.7) and (7.8) is called the
Bayesian optimal filtering (or shortly optimum filtering) problem in the literature.
Backward smoothing: Once we have the forward filtering recursion to calculate the
filtering and prediction densities p(xt |y1:t ) and p(xt |y1:t−1 ) for t = 1, . . . , n, where n is the
total number of observations, there are more than one ways of performing smoothing in a
HMM to calculate p(xt |y1:n ), t = 1, . . . , n. We will see the one that corresponds to forward
filtering backward smoothing. As the name suggests, backward smoothing is performed via
a backward recursion in time, i.e. p(xt |y1:n ) is calculated in the order t = n, n − 1, . . . , 1.
Now let us see how one step of the backward recursion works: Given p(xt+1 |y1:n ), we find
p(xt |y1:n ) by exploiting the following relation
Z
p(xt |y1:n ) = p(xt , xt+1 |y1:n )dxt+1
Z
= p(xt+1 |y1:n )p(xt |xt+1 , y1:n )dxt+1 . (7.11)
which can be written for any time series model. Thanks to the particular structure of the
HMM, given Xt+1 , Xt is conditionally independent from the rest of the future variables
(try to see this from Figure 7.1). Hence
p(xt |y1:t )f (xt+1 |xt )
p(xt |xt+1 , y1:n ) = p(xt |xt+1 , y1:t ) = (7.12)
p(xt+1 |y1:t )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 89
where the last expression is indeed exactly p(xt |xt+1 , y1:n ). Substituting this into (7.11),
we have
p(xt |y1:t )f (xt+1 |xt )
Z
p(xt |y1:n ) = p(xt+1 |y1:n ) dxt+1 (7.13)
p(xt+1 |y1:t )
which involves the filtering and prediction distributions that we have calculated in the
forward filtering stage already.
There are cases when the optimum filtering problem can be solved exactly. One such
case is when X is a finite countable set (Rabiner, 1989). Also, in linear Gaussian state-
space models the densities in (7.7) and (7.8) are obtained by the Kalman filter (Kalman,
1960).
where the second line is crucial and it follows from the specific dependency structure
of the HMM. Equation (7.14) suggests that we can start sampling Xn from the filtering
distribution at time n, and go backwards to sample Xn−1 , Xn−2 , . . . , X1 , using the backward
transition probabilities. Note that one needs all the filtering distributions up to time n in
order to perform this backward sampling. That is why the algorithm that executes this
scheme to sample from p(x1:n |y1:n ) is called forward filtering backward sampling.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 90
The forward filtering backward smoothing algorithm for a finite-state HMM is given in
Algorithm 7.1. The recursions given in the algorithms are simply the discrete versions of
Equations (7.7), (7.8), and (7.13). In order to keep track of p(yt |y1:t−1 ) (hence p(y1:t )) as
well, one can include the following (with the convention p(y1 |y0 ) = p(y1 ))
k
X
p(yt |y1:t−1 ) = βt (i)g(yt |i).
i=1
k
X
βt (i) = αt−1 (j)f (i|j), i = 1, . . . , k.
j=1
Filtering:
βt (i)g(yt |i)
αt (i) = Pk , i = 1, . . . , k.
j=1 βt (j)g(yt |j)
Backward smoothing
3 for t = n, . . . , 1 do
4 Smoothing: If t = n, set γn (i) = αn (i), i = 1, . . . , k; else
k
X αt (i)f (j|i)
γt (i) = γt+1 (j) , i = 1, . . . , k.
j=1
βt+1 (j)
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 91
Forward filtering backward sampling: For the finite state-space HMM, the backward
transition probabilities are given as
αt (i)f (j|i)
P(Xt = i|Xt+1 = j, Y1:t = y1:t ) =
βt+1 (j)
The resulting forward filtering backward sampling algorithm for finite state-space HMMs
to sample from the full posterior p(x1:n |y1:n ) is given in Algorithm 7.2.
Forward filtering: The prediction update from (µt−1|t−1 , Pt−1|t−1 ) to (µt|t−1 , Pt|t−1 ) can
be deduced from Equation (7.7), but a simpler way of achieving this is noticing that
the update is simply an application of linear transformation of Gaussian variables: Since
Xt = AXt−1 + Ut , and Ut is independent from all the other variables and Gaussian, too,
we have
By using (7.9), we can derive the mean µyt|t−1 and the covariance Pt|t−1
y
of the conditional
density p(yt |y1:t−1 ), which we know to be a Gaussian density. An alternative way is to
derive the moments as above:
and
y
Pt|t−1 = Cov[Yt |Y1:t−1 = y1:t−1 ]
= Cov[BXt + Vt |Y1:t−1 = y1:t−1 ]
= Cov[BXt |Y1:t−1 = y1:t−1 ] + Cov[Vt |Y1:t−1 = y1:t−1 ]
= BCov[Xt−1 |Y1:t−1 = y1:t−1 ]B T + Cov[Vt ]
= BPt|t−1 B T + R. (7.19)
The filtering distribution p(xt |y1:t ) can be found by applying the Bayes theorem with
prior p(xt |y1:t−1 ) and likelihood g(yt |xt ). Since both are Gaussian and the relation between
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 93
Yt and Xt is linear, we can apply the conjugacy result for the mean parameter of the normal
distribution in Example 4.7 and deduce
−1
Pt|t = (Pt|t−1 + B T R−1 B)−1
and
−1
µt|t = Pt|t (Pt|t−1 mt|t−1 + B T R−1 yt )
xy
Using the matrix inversion lemma1 and letting Pt|t−1 = Pt|t−1 B T , we can rewrite
and
xy y −1 xy T −1
µt|t = (Pt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1 )(Pt|t−1 µt|t−1 + B T R−1 yt )
xy xy y −1 xy T −1 xy y −1 xy T T −1
= µt|t−1 + Pt|t−1 R−1 yt − Pt|t−1 Pt|t−1 Pt|t−1 Pt|t−1 µt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1 B R yt
xy y −1 y xy T T −1 xy y −1 xy T −1
= µt|t−1 + Pt|t−1 Pt|t−1 (Pt|t−1 R−1 − Pt|t−1 B R )yt − Pt|t−1 Pt|t−1 Pt|t−1 Pt|t−1 µt|t−1
xy y −1 xy T T −1 xy y −1
= µt|t−1 + Pt|t−1 Pt|t−1 ([BPt|t−1 B T + R]R−1 − Pt|t−1 B R )yt − Pt|t−1 Pt|t−1 Bµt|t−1
xy y −1 xy y −1 y
= µt|t−1 + Pt|t−1 Pt|t−1 (BPt|t−1 B T R−1 + I − BPt|t−1 B T R−1 )yt − Pt|t−1 Pt|t−1 µt|t−1
xy y −1 xy y −1 y
= µt|t−1 + Pt|t−1 Pt|t−1 yt − Pt|t−1 Pt|t−1 µt|t−1
xy y −1
= µt|t−1 + Pt|t−1 Pt|t−1 (yt − µyt|t−1 ) (7.21)
Backward smoothing: For backward smoothing, we start from µn|n and Pn|n , which
are already calculated in the last step of the forward filtering recursion, and go backwards
to derive µt|n and Pt|n from µt+1|n and Pt+1|n . Observing (7.13) first we need to derive the
backward transition density
This is in the form of the Bayes’ rule, with prior p(xt |y1:t ) = φ(xt ; µt|t , Pt|t ) and likelihood
f (xt+1 |xt ) = φ(xt+1 ; Axt , S). Since the relation is Gaussian and the prior and likelihood
densities are Gaussian, we know that p(xt |xt+1 , y1:t ) is Gaussian, too,. In order to derive
its mean µxt|t+1 and covariance Pt|t+1x
, we make use of the result in Example 4.7 again to
arrive at
−1
x
Pt|t+1 = (Pt|t + AT S −1 A)−1
−1
µxt|t+1 = Pt|t+1
x
(Pt|t µt|t + AT S −1 xt+1 )
1
For invertible matrices A, B and any two matrices U and V of suitable size, the lemma states that
(A + U BV )−1 = A1 − A−1 U (B −1 + V A−1 U )−1 V A−1
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 94
−1
Using the matrix inversion lemma again, and letting Γt|t+1 = Pt|t AT Pt+1|t , we rewrite those
moments as
x
Pt|t+1 = Pt|t − Γt|t+1 APt|t
µxt|t+1 = µt|t + Γt|t+1 (xt+1 − µt+1|t )
One can (artificially but usefully) view this relation as follows: given Y1:n = y1:n , Xt can
be written in terms of Xt+1 as follows:
Xt = µt|t + Γt|t+1 (Xt+1 − µt|t−1 ) + Et ,
x
where Xt+1 |Y1:n = y1:n ∼ N (µt+1|n , Pt+1|n ) and Et |Y1:n = y1:n ∼ N (0, Pt|t+1 ) and Et is
independent from Xt+1 given Y1:n . From this, we use the composition rule for the Gaussian
distributions to derive µt|n and Pt|n from µt+1|n and Pt+1|n . Letting , we have
µt|n = E[µt|t + Γt|t+1 (Xt+1 − µt+1|t ) + Et |Y1:n = y1:n ]
= µt|t + Γt|t+1 (E[Xt+1 |Y1:n = y1:n ] − µt+1|t )
= µt|t + Γt|t+1 (µt|n − µt+1|t ) (7.22)
and
Pt|n = Cov[µt|t + Γt|t+1 (Xt+1 − µt+1|t ) + Et |Y1:n = y1:n ]
= Γt|t+1 Cov[Xt+1 − µt+1|t |Y1:n = y1:n ]ΓTt|t+1 + Pt|t+1
x
Forward filtering backward sampling: Similar to the finite state-space case, with the
help of backward transition distributions, we can sample from the full posterior p(x1:n |y1:n )
in a linear Gaussian HMM by using forward filtering backward sampling, which is given
in Algorithm 7.4.
µt|t−1 = Aµt−1|t−1
Pt|t−1 = APt−1|t−1 AT + S
Filtering:
y
Pt|t−1 = BPt|t−1 B T + R
µyt|t−1 = Bµt|t−1
xy
Pt|t−1 = Pt|t−1 B T
xy y −1
µt|t = µt|t−1 + Pt|t−1 Pt|t−1 (yt − µyt|t−1 )
xy y −1 xy T
Pt|t = Pt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1
Backward smoothing
3 for t = n − 1, . . . , 1 do
4 Smoothing:
−1
Γt|t+1 = Pt|t AT Pt+1|t
µt|n = µt|t + Γt|t+1 (µt|n − µt+1|t )
Pt|n = Pt|t + Γt|t+1 (Pt+1|n − Pt+1|t )ΓTt|t+1
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 96
As we saw in Sections 6.2 and 6.3, we can perform SIS and SISR methods targeting
{πn (x1:n )}n≥1 . The SMC proposal density at time n, denoted as qn (x1:n |y1:n ), is designed
conditional to the observations up to time n and state values up to time n − 1; and in the
most general case it can be written as
n
Y
qn (x1:n |y1:n ) := q(x1 |y1 ) q(xt |x1:t−1 , y1:t )
t=2
= qn−1 (x1:n−1 |y1:n−1 )q(xn |x1:n−1 , y1:n ) (7.26)
In fact, most of the time the transition densities q(xt |x1:t−1 , y1:t ) only depends only on the
current observation yt and the previous state xt−1 ,
If we wanted to perform SMC using the target distribution πn directly, then we would have
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 98
We present the SIS algorithm in Algorithm 7.5 and SISR algorithm (or the particle
filter) for general state-space models in Algorithm 7.6, reminding that SIS is a special
type of SISR where there is no resampling. As in the general SISR algorithm, we can use
an optional resampling scheme, where we do resampling only when the estimated effective
sampling size decreases below a threshold value. In the following we list some of the aspects
of the particle filter.
7 for i = 1, . . . , N do
8 Calculate
(i)
wn (X1:n )
Wn(i) = PN (i)
.
i=1 wn (X1:n )
6 else
(i) (i)
7 Resample from {X1:n−1 }1≤i≤N according to the weights {Wn−1 }1≤i≤N to get
(i)
resampled particles {X e1:n−1 }1≤i≤N with weight 1/N .
8 for i = 1, . . . , N do
(i) (i) (i) (i) (i)
9 Sample Xn ∼ q(·|X e1:n−1 , yn ), set X1:n = (X
e1:n−1 , Xn )
10 for i = 1, . . . , N do
11 Calculate
(i) (i) (i)
f (Xn |X
en−1 )g(yn |Xn )
Wn(i) ∝ (i) e (i)
.
q(Xn |X n−1 , yn )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 100
Therefore, it is easy to derive approximations to these distributions from each other: ob-
taining πnp,N from πn−1 N
requires a simple extension of the path X1:n−1 to X1:n through f ;
(i) (i)
this is done by sampling Xn conditioned on the existing particles paths X1:n−1 , respec-
tively for i = 1, . . . , N . Whereas; obtaining πnN from πnp,N requires a simple re-weighting of
the particles according to g(yn |xn ).
As a second example, the approximations to the marginal distributions πnN (xk ) are
simply obtained from the k’th components of the particles, e.g.
N
X N
X
πnN (x1:n ) = Wn(i) δX (i) (x1:n ) ⇒ πnN (xk ) = Wn(i) δX (i) (k) (xk ).
1:n 1:n
i=1 i=1
Note that the optimal filtering problem corresponds to the case k = n. Therefore, it may
be sufficient to have a good approximation for the marginal posterior distribution of the
current state Xn rather than the whole path X1:n . This justifies the resampling step of the
particle filter in practice, since resampling trades off accuracy for states Xk with k n
for a good approximation for the marginal posterior distribution of Xn .
The optimal choice that minimises the variance of the incremental importance weights is,
from equation (6.5),
q opt (xn |xn−1 , yn ) = p(xn |xn−1 , yn ).
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 101
which is independent from the value of xn . First works where q opt was used include Kong
et al. (1994); Liu and Chen (1995); Liu (1996).
Another interesting choice is to take q(xn |xn−1 , yn ) = q(xn |yn ), which can be useful
when observations provide significant information about the hidden state but the state
dynamics are weak. This proposal was introduced in Lin et al. (2005) and the resulting
particle filter was called independent particle filter.
Example 7.4 (Linear Gaussian HMM). This is an illustrative example that is designed
to show both SIS and SISR (particle filter) algorithms applied to sequential Bayesian in-
ference in the following linear Gaussian HMM
η(x) = φ(x; 0, σ02 ), f (x0 |x) = φ(x0 ; ax, σx2 ), g(y|x) = φ(y; bx, σy2 ).
where Xt ∈ R and Yt ∈ R and hence a, b, σ02 , σx2 , and σy2 are all scalars.
In this example, we first generated y1:n with n = 10 using a = 0.99, b = 1, σx2 = 1,
σy2 = 1 and σ02 = 4. Our task is to run and compare and contrast the SMC algorithms,
namely SIS and SISR, for sequential approximation of
Since the HMM here is linear and Gaussian, the problem is analytically tractable, πt ’s are
all Gaussian, and we can find those πt ’s without any need to do Monte Carlo, for example
using the Kalman filter. We use SIS and SISR merely for illustrative purposes.
We ran SIS in Algorithm 7.5 with q1 (x1 ) = η(x1 ) and q(xn |xn−1 , yn ) = f (xn |xn−1 ) for
n > 1 so that wn|n−1 (xn−1 , xn ) = g(yn |xn ), which does not depend on xn−1 , and hence
(i)
Wn ∝ g(yn |Xn ) for all n ≥ 1. Top row of Figure 7.2 shows the initialisation phase, both
(1) (N )
before and after weighting the initially generated particles X1 , . . . , X1 whose locations
are shown on the y-axis and weights are represented with the sizes of the balls centred around
their values. The red curve represents the incremental weight function w1 (x1 ) = g(y1 |x1 )
vs x1 (located on the y-axis). Some of the later steps of SIS are shown Figure 7.2, from the
second row on. Starting from the second row, each row shows (i) the particles and their
weights from the previous time, (ii) The propagation and extension of the particles for the
new time step, and (iii) Update of the particle weights. Note that, due to out particular
choice of the importance density q(xt |xt−1 , yt ), the incremental weights wt|t−1 (xt−1 , xt ) =
g(yt |xt ) depend only on the current value xt , so for this example it is actually possible
to show it as a function of xt , which we have done by the red curve in the plots. Note
that the size of the ball around the value of a particle represents the weight of the whole
(i)
path X1:t . Also, notice the weight degeneracy problem in the SIS algorithm, since there is
no resampling procedure. At time t = 10, we have effectively only one useful particle to
approximate π10 (x1:10 ) = p(x1:10 |y1:10 ), which is not a good sign.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 102
Sequential importance sampling - before weighting: t = 1 Sequential importance sampling - after weighting, t = 1
8 8
6 6
4 4
sample values
sample values
2 2
0 0
-2 -2
-4 -4
-6 -6
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step
Sequential importance sampling - after weighting, t = 1 Sequential importance sampling - before weighting: t = 2 Sequential importance sampling - after weighting, t = 2
8 8 8
6 6 6
4 4 4
sample values
sample values
sample values
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
-6 -6 -6
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - after weighting, t = 4 Sequential importance sampling - before weighting: t = 5 Sequential importance sampling - after weighting, t = 5
8 8 8
6 6 6
4 4 4
sample values
sample values
sample values
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
-6 -6 -6
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - after weighting, t = 9 Sequential importance sampling - before weighting: t = 10 Sequential importance sampling - after weighting, t = 10
8 8 8
6 6 6
4 4 4
sample values
sample values
sample values
2 2 2
0 0 0
-2 -2 -2
-4 -4 -4
-6 -6 -6
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Figure 7.2: SIS: propagation and weighting of particles for several time steps. Each row
shows (i) the particles and their weights from the previous time, (ii) The propagation and
extension of the particles for the new time step, and (iii) Update of the particle weights.
Notice the weight degeneracy problem.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 103
Next, the SISR algorithm, or the particle filter, in Algorithm 7.6 for the same problem.
The initialisation is just the same as in SIS, see the top row of Figure 7.3. The remaining
plots in Figure 7.3 shows some later steps of SISR. Starting from the second row, each
row shows (i) the particles and their weights from the previous time, (ii) the resampling
and propagation and extension of the particles for the new time step, and (iii) Update of
the particle weights. Notice how the weight degeneracy problem is alleviated by resampling:
there are distinct particles having closer weights to each other than in SIS. But several
N
resampling steps in succession lead the path degeneracy problem: π10 (x1:10 ) has only one
support point for x1 .
Example 7.5 (Tracking a moving target). We consider the HMM for a moving target
in Example 7.3. Our objective is to estimate the position of the target at times t = 1, 2, . . .
given the observations up to time t. With the particle filter, we can approximate πt (x1:t ) =
p(x1:t |y1:t ) as
N
X
N (i)
πt (x1:t ) = Wt δX (i) (x1:t )
1:t
i=1
As discussed already, this approximation can be used to approximate the filtering distribu-
tion
N
X (i)
p(xt |y1:t ) ≈ Wt δX (i) (xt )
t
i=1
(i) (i) (i)
We can use the position components of Xt = (Vt , Pt )’s in order to estimate the current
position from the observations.
N
X (i) (i)
E[Pt (j)|Y1:t = y1:t ] ≈ P̂tN (j) = Wt Pt , j = 1, 2.
i=1
Figure 7.4 illustrates a target tracking scenario depicted above for 500 time steps. On the
top left plot we see the position (in red) and its filtering estimate P̂tN (in black) given the
sensor measurements on the right plot up to the current time t, with N = 1000 particles.
The lower plots show the performance of the particle filter at each direction separately.
• In the first example of such models, the process {Xn }n≥1 is still a Markov chain;
however the conditional distribution of Yn , given all past variables X1:n and Y1:n−1 ,
depends not only on the value of Xn but also on the values of past observations
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 104
Sequential importance sampling - resampling, before weighting t = 1 Sequential importance sampling - resampling, after weighting t = 1
3 3
2 2
1 1
0 0
-1 -1
sample values
sample values
-2 -2
-3 -3
-4 -4
-5 -5
-6 -6
-7 -7
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step
Sequential importance sampling - resampling, after weighting t = 1 Sequential importance sampling - resampling, before weighting t = 2 Sequential importance sampling - resampling, after weighting t = 2
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
sample values
sample values
sample values
-2 -2 -2
-3 -3 -3
-4 -4 -4
-5 -5 -5
-6 -6 -6
-7 -7 -7
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - resampling, after weighting t = 5 Sequential importance sampling - resampling, before weighting t = 6 Sequential importance sampling - resampling, after weighting t = 6
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
sample values
sample values
sample values
-2 -2 -2
-3 -3 -3
-4 -4 -4
-5 -5 -5
-6 -6 -6
-7 -7 -7
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - resampling, after weighting t = 9 Sequential importance sampling - resampling, before weighting t = 10 Sequential importance sampling - resampling, after weighting t = 10
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
sample values
sample values
sample values
-2 -2 -2
-3 -3 -3
-4 -4 -4
-5 -5 -5
-6 -6 -6
-7 -7 -7
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Figure 7.3: SIRS: resampling, propagation and weighting of particles for several time
steps. Each row shows (i) the particles and their weights from the previous time, (ii) the
resampling and propagation and extension of the particles for the new time step, and (iii)
N
Update of the particle weights. Notice the path degeneracy problem: π10 (x1:10 ) has only
one support point for x1 .
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 105
30 10
20 0
P(1, t)
P(2, t)
10 -10
0 -20
-10 -30
-20 -40
0 100 200 300 400 500 0 100 200 300 400 500
time time
• In another type of time series models that are not HMM the latent process {Xn }n≥1
is, again, still a Markov chain; however observation at current time depends on all the
past values, i.e. Yn conditional on (X1:n , Y1:n−1 ) depends on all of these conditioned
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 106
random variables. Actually, these models are usually the result of marginalising an
extended HMM. Consider the HMM {(Xn , Zn ), Yn }n≥1 , where the latent joint process
{Xn , Zn }n≥1 is a Markov chain such that its transitional density can be factorised as
and the observation Yn depends only on Xn and Zn given all the past random variables
and admits the probability density g(yn |xn , zn ). Now, the reduced bivariate process
{Xn , Yn }n≥1 is not a HMM and we express the joint density of (X1:n , Y1:n ) as
n
Y
p(x1:n , y1:n ) = η(x1 )p1 (y1 |x1 ) f1 (xt |xt−1 )pt (yt |x1:t , y1:t−1 )
t=2
The reason {Xn , Yn }n≥1 might be of interest is that the conditional laws of Z1:n may
be available in close form and exact evaluation of the integral in (7.29) is available.
In that case, it can be more effective to perform Monte Carlo approximation for the
law of X1:n given observations Y1:n , which leads to the so called Rao-Blackwellised
particle filters in the literature (Doucet et al., 2000a).
The integration is indeed available in close form for some time series models. One
example is the linear Gaussian switching state space models (Chen and Liu, 2000;
Doucet et al., 2000a; Fearnhead and Clifford, 2003), where Xn takes values on a finite
set whose elements are often called ‘labels’, and conditioned on {Xn }n≥1 , {Zn , Yn }n≥1
is a linear Gaussian state-space model.
We note that the computational tools developed for HMMs are generally applicable to
a more general class of time series models with some suitable modifications.
η(x1 , z1 ) = η1 (x1 )η2 (z1 |x1 ), f (xn , zn |xn−1 , zn−1 ) = f1 (xn |xn−1 )f2 (zn |xn , zn−1 ).
Also, conditioned on (xn , zn ) the distribution of observation Yn admit a density g(yn |xn , zn )
with respect to ν. We are interested in the case where the posterior distribution
for functions ϕn : X n × Z n → R. Obviously, one way to do this is to run an SMC filter for
{πn }n≥1 which obtains the approximation πnN at time n as
N
X N
X
πnN (x1:n , z1:n ) = Wn(i) δ(X (i) ,Z (i) ) (x1:n , z1:n ), Wn(i) = 1.
1:n 1:n
i=1 i=1
is analytically tractable, there is a better SMC scheme for approximating πn and estimating
πn (ϕn ). This SMC scheme is called the Rao Blackwellised particle filter (RBPF) (Doucet
et al., 2000a). Consider the following decomposition which follows from the chain rule
The RBPF is a particle filter for the sequence of marginal distributions {π1,n }n≥1 which
produces at time n the approximation
N
X N
X
N (i) (i)
π1,n (x1:n ) = W1,n δX (i) (x1:n ), W1,n = 1.
1:n
i=1 i=1
and the Rao-Blackwellised approximation the full posterior distribution involves the par-
N
ticle filter estimate π1,n and the exact distribution π2,n
Assuming q(x1:n |y1:n ) = q(x1:n−1 |y1:n−1 )q(xn |x1:n−1 , y1:n ) is used as the proposal distribu-
tion, the incremental importance weight for the RBPF is given by
f1 (xn |xn−1 )p(yn |x1:n , y1:n−1 )
w1,n|n−1 (x1:n ) =
q(xn |x1:n−1 , y1:n )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 108
Also, the optimum importance density which reduces the variance of w1,n|n−1 is when the
incremental importance density q(xn |x1:n−1 , y1:n ) is taken to be p(xn |x1:n−1 , y1:n ) which
results in w1,n|n−1 (x1:n ) being equal to p(yn |x1:n−1 , y1:n−1 ).
The use of the RBPF whenever it is possible is intuitively justified by the fact that we
substitute particle approximation of some expectations with their exact values. Indeed,
the theoretical analysis in Doucet et al. (2000a) and Chopin (2004, Proposition 3) revealed
that the RBPF has better precision than the regular particle filter: the estimates of the
RBPF never have larger variances. The favouring results for the RBPF are basically due to
the Rao-Blackwell theorem (see e.g. Blackwell (1947)), after which the proposed particle
filter gets its name.
The RBPF was formulated by Doucet et al. (2000a) and have been implemented in
various settings by Chen and Liu (2000); Andrieu and Doucet (2002); Särkkä et al. (2004)
among many.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 109
Exercises
1. Write your own log sum exp function that takes an array of numbers a = [a1 , . . . , am ]
and returns
log (ea1 + . . . + eam ) .
in a numerically safe way. Try your code with
Compare each answer with the naive solution which we would have if we typed
directly log(sum(exp(a))).
[Hint: log(ea − eb ) = log(ea−max{a,b} − eb−max{a,b} ) + max{a, b}.]
• Generate hidden states x1:n and observations y1:n for n = 2000, a = 0.99, b = 1,
σ02 = 5, σx2 = 1, σy2 = 4.
• You already have the code for this model that performs the particle filter i.e.
the SISR algorithm with the importance (proposal) density being equal to the
transition density, more explicitly
Run the SISR algorithm for y1:n with N = 100 this choice of importance density,
and at each time step estimate the mean posterior value E[Xt |Y1:t = y1:t ] from
the particles.
XN
(i) (i)
X̂tN = X t Wt .
i=1
1
Pn
Calculate the mean squared error (MSE) for X̂t , that is n t=1 (X̂t − xt ) 2 .
• Now, set N = 1000 and calculate the MSE again. Compare it with the previous
one.
• This time take N = 100 and run the SISR algorithm with the optimum choice
for the importance density which is obtained by
Show that this leads to the incremental weight wn|n−1 (xn−1 , xn ) = p(yn |xn−1 ).
(Since this is a linear Gaussian model, both p(xn |xn−1 , yn ) and p(yn |xn−1 ) are
available to sample from and calculate, respectively.) Calculate the MSE for X̂t
and compare it with the previous ones that you found.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 110
3. Consider the target tracking problem in Example 7.5. It may not be a good modelling
practice to take the noise in the measurements additive Gaussian for two reasons: (i)
One may expect to have a bigger error when a longer distance is measured. (ii) When
the noise is Gaussian, the noisy measurement is allowed to be negative, which does
not make sense. That is why, instead of the existing observation model, consider the
following one with multiplicative nonnegative noise:
i.i.d.
Yt,i = Rt,i Et,i , Et,i ∼ ln N (0, σy2 ), i = 1, 2, 3. (7.31)
where ln N (µ, σ 2 ) denotes the lognormal distribution with location and scale param-
eters µ and σ 2 . It can be shown that if Et,i ∼ ln N (µ, σ 2 ), log Et,i ∼ N (µ, σ 2 ), so we
effectively have
log Yt,i ∼ N (log Rt,i , σy2 ).
• Generate data according to the new model for n = 500 using σy2 = 0.1, a = 0.99,
2
σp2 = 0.001, σv2 = 0.01, σbv 2
= 0.01, σbp = 4, and the sensor locations being the
same as in the previous examples.
• Write down the new observation density g(yt |xt ) according to (7.31).
• Run a particle filter for the data you have generated with N = 1000 particles,
using q(x1 |y1 ) = η(x1 ) and q(xt |xt−1 , yt ) = f (xt |xt−1 ). Calculate the posterior
mean estimates for the position versus t, E[Pt (i)|Y1:t = y1:t ], i = 1, 2. Generate
results similar to the ones in Example 7.5.
• Remove one of the sensors and repeat your experiments. Comment on the
results.
Appendix A
P(E) ∈ R, P(E) ≥ 0, ∀E ∈ F.
(A2) Unitarity: The probability that at least one of the elementary events in the entire
sample space will occur is 1
P(Ω) = 1.
(A3) σ-additivity: A countable sequence of disjoint sets (or mutually exclusive sets) E1 , E2 , . . .
(Ei ∩ Ej = ∅ for all i 6= j) satisfies
∞
! ∞
[ X
P Ei = P(Ei )
i=1 i=1
Any function that satisfies those three axioms can be a probability measure. These axioms
lead to some useful properties of probability that we are familiar with.
(P1) The probability of the empty set:
P(∅) = 0.
(P2) Monotonicity:
P(A) ≤ P(B), ∀A, B ∈ F : A ⊆ B.
111
APPENDIX A. SOME BASICS OF PROBABILITY 112
X:Ω→R
such that {ω ∈ Ω : X(w) ≤ x} ∈ F for all x ∈ R. We need this condition since we need
the probability of this set in order to construct our cumulative distribution function.
• The use of ≤ (and not <) is important. Especially for discrete random variables,
this matters a lot.
• Note that X, written in capital letter, represents the randomness in the probability
statement while x is a given certain value in R.
(P2) F is right continuous (no jumps occur when the limit point is approached from the
right).
APPENDIX A. SOME BASICS OF PROBABILITY 113
Hence, the cdf F of X is a step function where jumps occur at points xi with jump height
being p(xi ) = P(X = xi ) = F (xi ) − F (xi−1 ).
Some discrete distributions: Some well known distributions with a pmf (hence the cdf
is a step function): Bernoulli B(ρ), Geometric distribution Geo(ρ), Binomial distribution
Binom(n, ρ) Negative binomial NB(r, ρ), Poisson distribution PO(λ).
Also, if F is right differentiable, we can define the probability density function (pdf ) for X
F (x + h) − F (x) ∂+ F (x)
p(x) := lim = , x ∈ R.
h→0 h ∂x
From the above equation, we can conclude that P(X = x) = 0 for any x ∈ R, because
Z x
p(x)dx = F (x) − F (x) = 0.
x
The first moment (n = 1) is called the expectation of X, also sometimes referred to as the
mean of X.
If |E(X)| < ∞, the n’th central moments of X, n ≥ 1, is defined for discrete and
continuous random variables as follows:
(P
[xi − E(X)]n p(xi ), if X is discrete,
E([X − E(X)]n ) := R ∞i n
(A.2)
−∞
[x − E(X)] p(x)dx, if X is continuous.
The second central moment is the most notable of them and is called the variance of X
and denoted by V (X):
V(X) := E([X − E(X)]2 ).
APPENDIX A. SOME BASICS OF PROBABILITY 115
A useful identity relating V(X) to the expectation and the second moment of X is
The marginal cdf’s for X and Y can be deduced from FX,Y (x, y):
FX (x) = lim FX,Y (x, y), FY (y) = lim FX,Y (x, y).
y→∞ x→∞
Expectation of any function g of X, Y can be evaluated using the joint pmf, for example
X
E(g(X, Y )) = pX,Y (xi , yj )g(xi , yj ).
i,j
Continuous variables: Similar to the joint pmf defined for discrete X and Y , one can
define the joint pdf for continuous X and Y , assuming F is right-differentiable,
∂+2 F (x, y)
pX,Y (x, y) :=
∂x∂y
so that for any a, b, we have
Z b Z a
FX,Y (a, b) = pX,Y (x, y)dxdy
−∞ −∞
Expectation of any function g of X, Y can be evaluated using the joint pdf,
Z ∞Z ∞
E(g(X, Y )) = pX,Y (x, y)g(x, y)dxdy.
−∞ −∞
The marginal pdf ’s for X and Y can be obtained
Z ∞ Z ∞
pX (x) = pX,Y (x, y)dy, pY (y) = pX,Y (x, y)dx,
−∞ −∞
Independence: We say random variables X and Y are independent if for all pairs of
sets A ⊆ R, B ⊆ R we have
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B).
If X and Y are discrete variables taking xi , i = 1, 2, . . . and yj , j = 1, 2, . . ., then indepen-
dence between X and Y can be expressed as
pX,Y (xi , yj ) = P(X = xi , Y = yj ) = pX (xi )pY (yj ), ∀i, j
If X and Y are continuous variables, then independence between X and Y can be expressed
as
pX,Y (x, y) = pX (x)pY (y), ∀x, y ∈ R.
P(A ∩ B)
P(A|B) =
P(B)
The Bayes’ rule is derived from this definition and it relates the two conditional probabil-
ities P(A|B) and P(B|A):
P(A)P(B|A)
P(A|B) = (A.3)
P(B)
This relation can be written in terms of two random variables. Suppose X, Y are discrete
random variables with joint pmf pX,Y (xi , yj ), where x ∈ X = {x1 , x2 , . . .} and y ∈ Y =
{y1 , y2 , . . .} so that the marginal pmf’s are
X X
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y), x ∈ X , y ∈ Y.
y∈Y x
Then the conditional pmf’s pX|Y (x|y) and pY |X (y|x) are defined as
pX (x)pY |X (y|x)
pX|Y (x|y) = (A.5)
pY (y)
When X, Y are continuous random variables taking values from X and Y, respectively,
with a joint pdf pX,Y (x, y), similar definitions follow: The marginal pdf’s are
Z ∞ Z ∞
pX (x) = pX,Y (x, y)dy, pY (y) = pX,Y (x, y)dx.
−∞ −∞
The conditional pdf’s are defined exactly the same way as in (A.4) and (A.5).
Appendix B
Solutions
118
APPENDIX B. SOLUTIONS 119
Y = Z with probability 1/2 and Y = −Z with probability 1/2. One can find out
using composition that Y has pdf pY (y) given above. Therefore, we can generate
X ∼ Laplace(a, b) as follows
• Generate Z ∼ Exp(b),
• Set Y = Z or Y = −Z with probability 1/2.
• Set X = Y + a.
4. We have X 0 is pX 0 (x) = q(x), pb(x) = p(x)Zp and qb(x) = q(x)Zq , and P(Accept|X 0 =
x) = Mpb(x)
qb(x)
= M1 ZZpq q(x)
p(x)
. So, the acceptance probability can be derived as
Z
P(Accept) = P(Accept|X 0 = x)pX 0 (x)dx
Z
1 Zp p(x)
= q(x)dx
M Zq q(x)
Z
1 Zp
= p(x)dx
M Zq
1 Zp
=
M Zq
The validity can be verified by considering the distribution of the accepted samples.
Using Bayes’ theorem,
xa−1 (1 − x)b−1
p(x) = ∝ xa−1 (1 − x)b−1 =: pb(x), x ∈ (0, 1).
B(a, b)
We have Q = Unif(0, 1) so q(x) = 1 and the ratio pb(x)/q(x) = pb(x).
• First, it can be seen that the ratio is unbounded for a < 1 or b < 1, so Q =
Unif(0, 1) cannot be used.
• When a = b = 1, we have the uniform distribution for X so it is trivial.
• For a ≥ 1 and b ≥ 1 and at least one of them is strictly greater than 1, the first
a−1
derivative of pb(x) is equal to 0 at x = a+b−2 , and the second derivative at that
value of x is −(a + b − 2) a−1 + b−1 , which is negative, so x∗ = a+b−2
1 1 a−1
2
is a
maximum point, yielding
a−1 b−1
∗ a−1 b−1
pb(x)/q(x) ≤ pb(x ) =
a+b−2 a+b−2
APPENDIX B. SOLUTIONS 120
a−1 a−1
b−1
so the smallest (hence the best) M we can choose is M ∗ = a+b−2 b−1
a+b−2
.
Hence the rejection sampling algorithm can be applied as follows:
(a) Sample X 0 ∼ Unif(0, 1) and U ∼ Unif(0, 1)
(X 0 )a−1 (1−X 0 )b−1
(b) If U ≤ a−1 a−1 b−1 b−1 , accept X = X 0 , else restart.
( a+b−2 ) ( a+b−2 )
N 1
PN
The variance of the plug-in estimator PMC (ϕ) = N i=1 Xi2 , Xi ∼ P , is
N E(X 4 ) − E(X 2 )2 3σ 4 − σ 4 2σ 4
V(PMC (ϕ)) = = = .
N N N
Therefore the IS estimator provides a variance-reduction factor of ≈ 0.22.
N i=1 θ
j=1 j
(i)
where E10 is the computed completion time for the i’th sample X (i) .
which is in the same form of the pdf of Γ(α + 1, β + y). Therefore, αx|y = α + 1 and
βx|y = β + y.
3. Let us define Z
µ(y) = E(X|Y = y) = xp(x|y)dx
APPENDIX B. SOLUTIONS 122
for brevity (we can define such µ since E(X|Y = y) is a function of y only). We can
write any estimator as X̂(Y ) = µ(Y ) + X̂(Y ) − µ(Y ). Now, the expected MSE is
Z
E((X̂(Y ) − X) ) = (X̂(y) − x)2 p(x, y)dxdy
2
Z Z
2
= (X̂(y) − x) p(x|y)dx p(y)dy
Z Z
2
= [µ(y) − x + X̂(y) − µ(y)] p(x|y)dx p(y)dy
where the last equation follows since the last term is zero:
Z Z
(µ(y) − x)p(x|y)dx = µ(y) − xp(x|y)dx = 0.
The first term in (B.1) does not depend on the estimator X̂(y) so we have control
only on the second term (X̂(y) − µ(y))2 which is always nonnegative and therefore
minimum when X̂(y) − µ(y) = 0, i.e. X̂(y) = µ(y) = E(X|Y = y). Since this is true
for all y, we conclude that the estimator X̂(Y ) = E(X|Y ), as a random variable of
Y , minimises the expected MSE.
where Sx|y = (S −1 + AT R−1 A)−1 and mx|y = Sx|y (S −1 m + AT R−1 y). By obser-
vation, we can see that X is a univariate number with m = 0 and S = σx2 , A is
an n × 1 vector with A(t) = sin(2πt/T ), t = 1, . . . , n, and R = σy2 In . Therefore
p(x|y1:n ) = φ(x; mx|y , Sx|y ) with
n
!−1
1 1 X 2
Sx|y = + sin (2πt/T ) , (B.2)
σx2 σy2 t=1
APPENDIX B. SOLUTIONS 123
and n
1 X
mx|y = Sx|y 2 sin(2πt/T )yt .
σy t=1
It is possible to derive p(y1:n ) from p(y1:n ) = p(x, y1:n )/p(x|y1:n ), but there is
an easier way when the prior and the likelihood is gaussian and the relation
between X and Y is linear. One can view
Y = AX + V
and
Yn+1 |X = x, Y1:n = y1:n ∼ N (x sin(2π(n + 1)/T ), σy2 )
as before, since Yn+1 is conditionally independent from Y1:n given X. We can
use the same mechanism as before and derive
Yn+1 |Y1:n = y1:n ∼ N (sin(2π(n + 1)/T )mx|y , sin2 (2π(n + 1)/T )Sx|y + σy2 ). (B.4)
When compare the variances (B.3) and (B.4), we see that the second one is
smaller since Sx|y < σx2 , see (B.2). The decrease in the variance, hence uncer-
tainty, is due to information that comes from Y1:n = y1:n .
APPENDIX B. SOLUTIONS 124
Conditional on xt−1 , Xt has prior N (axt−1 , σx2 ) and Y |Xt = xt ∼ N (bxt , σy2 ). There-
fore, using conjugacy, we have
−1
p(xt |xt−1 , yt−1 ) = φ(xt ; µq , σq2 ), σq2 = 1/σx2 + b2 /σy2 , µq = σq2 (axt−1 /σx2 +byt /σy2 ).
which only depends on xn−1 . It can be checked that Yt = bXt + Vt , Vt ∼ N (0, σy2 )
given Xt−1 = xt−1 is Gaussian with mean baxt−1 and variance b2 σx2 + σy2 , i.e.
For t = 1, we get similar results by replacing f (xt |xt−1 ) with η(x1 ) and considering
Y1 = bX1 + V1 with V1 ∼ N (0, σy2 ). This results in
−1
q(x1 |y1 ) = p(x1 |y1 ) = φ(x1 ; µq , σq2 ), σq2 = 1/σ02 + b2 /σy2 , µq = σq2 byt /σy2 .
and
w1 (x1 ) = p(y1 ) = φ(y1 ; 0, b2 σ02 + σy2 ).
3. Since each Yt,i is lognormal distributed with parameters log Ri and σy2 , we have
3
Y 1 − 12 (log yt,i −log Rt,i )2
g(yt |xt ) = p e 2σy
i=1
2πσy2 yt,i
Andrieu, C., Davy, M., and Doucet, A. (2001). Improved auxiliary particle filtering:
applications to time-varying spectral analysis. In Statistical Signal Processing, 2001.
Proceedings of the 11th IEEE Signal Processing Workshop on, pages 309–312. 7.1
Andrieu, C. and Doucet, A. (2002). Particle filtering for partially observed Gaussian state
space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
64:827–836. 7.3.6
Andrieu, C., Doucet, A., and Tadić, V. B. (2005). On-line parameter estimation in general
state-space models. In Proceedings of the 44th IEEE Conference on Decision and Control,
pages 332–337. 12
Arulampalam, M., Maskell, S., Gordon, N., and Clapp, T. (2002). A tutorial on particle
filters for online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE
Transactions on, 50(2):174–188. [Link]
Cappé, O., Godsill, S., and Moulines, E. (2007). An overview of existing methods and
recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5):899–924.
[Link]
Cappé, O., Moulines, E., and Rydén, T. (2005). Inference in Hidden Markov Models.
Springer. 5.2, 6.1
Carpenter, J., Clifford, P., and Fearnhead, P. (1999). An improved particle filter for non-
linear problems. Radar Sonar & Navigation, IEE Proceedings, 146:2–7. 12
Chen, R. and Liu, J. (1996). Predictive updating methods with application to Bayesian
classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
58:397–415. 8
Chen, R. and Liu, J. S. (2000). Mixture kalman filters. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 62(3):493–508. 7.3.5, 7.3.6
Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and its
application to Bayesian inference. The Annals of Statistics, 32(6):2385–2411. 7.3.6
125
REFERENCES 126
Crisan, D., Moral, P. D., and Lyons, T. (1999). Discrete filtering using branching and
interacting particle systems. Markov Processes and Related Fields, 5(3):293–318. 12
Del Moral, P. and Doucet, A. (2003). On a class of genealogical and interacting metropolis
models. In Azéma, J., Émery, M., Ledoux, M., and Yor, M., editors, Séminaire de
Probabilités XXXVII, volume 1832 of Lecture Notes in Mathematics, pages 415–446.
Springer Berlin Heidelberg. 12
Doucet, A. (1997). Monte Carlo methods for Bayesian estimation of hidden Markov models.
Application to radiation signals (in French). PhD thesis, University Paris-Sud Orsay,
France. 8, 8
Doucet, A., Briers, M., and Sénécal, S. (2006). Efficient block sampling strategies for
sequential Monte Carlo methods. Journal of Computational and Graphical Statistics,
15(3):693–711. 12
Doucet, A., De Freitas, J., and Gordon, N. (2001). Sequential Monte Carlo Methods in
Practice. Springer-Verlag, New York. 6.1, 7.1, [Link]
Doucet, A., de Freitas, N., Murphy, K., and Russell, S. (2000a). Rao-Blackwellised particle
filtering for dynamic Bayesian networks. In Proceedings of the Sixteenth Conference
Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 176–183,
San Francisco, CA. Morgan Kaufmann. 7.3.5, 7.3.6, 7.3.6
Doucet, A., Godsill, S., and Andrieu, C. (2000b). On sequential Monte Carlo sampling
methods for Bayesian filtering. Statistics and Computing, 10:197–208. 6.1, 6.2, 6.3,
[Link]
Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo method. Los
Alamos Science, Special Issue, pages 131–137. 1.2, 2.2.1, 2.2.4
Fearnhead, P. and Clifford, P. (2003). On-line inference for hidden Markov models via par-
ticle filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
65:887–889. 12, 7.3.5
Fearnhead, P. and Liu, Z. (2007). On-line inference for multiple changepoint problems.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):589–
605. 12
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, 6(6):721–741. 5.4
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integra-
tion. Econometrica, 57(6):1317–1339. 5, 6.2
Gilks, W. R. and Berzuini, C. (2001). Following a moving target-Monte Carlo inference for
dynamic Bayesian models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 63(1):127–146. 12
Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995). Adaptive rejection Metropolis
sampling within Gibbs sampling. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 44(4):455–472. [Link]
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo
in Practice. Chapman & Hall/CRC. 5.1, 5.2
Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Journal
of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348. [Link]
Handschin, J. E. (1970). Monte Carlo techniques for prediction and filtering of non-linear
stochastic processes. Automatica, 6:555–563. 6.2
Handschin, J. E. and Mayne, D. (1969). Monte Carlo techniques to estimate the conditional
expectation in multi-stage non-linear filtering. International Journal of Control, 9:547–
559. 6.2
REFERENCES 128
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 52(1):97–109. 5.1, 5.3, 4
Julier, S. J. and Uhlmann, J. K. (1997). A new extension of the Kalman filter to nonlinear
systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls 3, pages 182–
193. [Link]
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans-
actions of the ASME; Series D: Journal of Basic Engineering, 82:35–45. [Link]
Kitagawa, G. (1996). Monte-Carlo filter and smoother for non-Gaussian nonlinear state
space models. Journal of Computational and Graphical Statistics, 1:1–25. 12, [Link]
Kong, A., Liu, J. S., and Wong, W. H. (1994). Sequential imputations and Bayesian missing
data problems. Journal of the American Statistical Association, 89(425):278–288. 3.1.1,
8, 6.3, [Link]
Lin, M. T., Zhang, J. L., Cheng, Q., and Chen, R. (2005). Independent particle filters.
Journal of the American Statistical Association, 100(472):1412–1421. [Link]
Liu, J. (2001). Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics.
Springer Verlag, New York, NY, USA. 12
Liu, J. and Chen, R. (1995). Blind deconvolution via sequential imputation. Journal of
the American Statistical Association, 90:567–576. 8, [Link]
Liu, J. and Chen, R. (1998). Sequential Monte-Carlo methods for dynamic systems. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 93:1032–1044. 12,
[Link]
Marsaglia, G. (1977). The squeeze method for generating gamma variates. Computers and
Mathematics with Applications, 3(4):321–325. [Link]
Mayne, D. (1966). A solution of the smoothing problem for linear dynamic systems.
Automatica, 4:73–92. 6.2
Metropolis, N. (1987). The beginning of the Monte Carlo method. Los Alamos Science,
Special Issue, pages 125–130. 1.2
REFERENCES 129
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21(6):1087–1092. 5.1, 5.3, 4
Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247):pp. 335–341. 1.2
Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Cambridge
University Press, New York, NY, USA, 2nd edition. 5.2
Olsson, J., Cappé, O., Douc, R., and Moulines, E. (2008). Sequential Monte Carlo smooth-
ing with application to parameter estimation in nonlinear state space models. Bernoulli,
14:155–179. 12
Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters.
Journal of the American Statistical Association, 94(446):590–599. 7.1
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer,
2 edition. 2.2.1, 3.1, 3.1.1, 5.1, 5.2, 4, 6.1
Roberts, G. and Smith, A. (1994). Simple conditions for the convergence of the Gibbs sam-
pler and Metropolis-Hastings algorithms. Stochastic Processes and their Applications,
49(2):207–216. 4, 4
Roberts, G. and Tweedie, R. (1996). Geometric convergence and central limit theorems
for multidimensional Hastings and Metropolis algorithms. Biometrika, 83:95–110. 4, 4
Särkkä, S., Vehtari, A., and Lampinen, J. (2004). Rao-Blackwellized Monte Carlo data
association for multiple target tracking. In In Proceedings of the Seventh International
Conference on Information Fusion, pages 583–590. 7.3.6
Shiryaev, A. N. (1995). Probability. Springer-Verlag New York, Inc., Secaucus, NJ, USA,
2 edition. 1.2.1, 5.2
REFERENCES 130
Sorenson, H. W. (1985). Kalman Filtering: Theory and Application. IEEE Press, reprint
edition. [Link]
Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics,
22:1701–1762. 5.2, 4, 4, 5.4.1
von Neumann, J. (1951). Various techniques used in connection with random digits. Jour-
nal of Research of the National Bureau of Standards, 12:36–38. 2.2.4