0% found this document useful (0 votes)
15 views138 pages

Monte Carlo Methods for Statistical Inference

The document discusses Monte Carlo simulation methods for statistical inference, covering topics such as sample averages, exact sampling methods, and Bayesian inference. It includes detailed sections on various estimation techniques, Markov Chain Monte Carlo, and applications in Hidden Markov Models. The document serves as a comprehensive guide for understanding and implementing Monte Carlo methods in statistical analysis.

Uploaded by

ZhuYao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views138 pages

Monte Carlo Methods for Statistical Inference

The document discusses Monte Carlo simulation methods for statistical inference, covering topics such as sample averages, exact sampling methods, and Bayesian inference. It includes detailed sections on various estimation techniques, Markov Chain Monte Carlo, and applications in Hidden Markov Models. The document serves as a comprehensive guide for understanding and implementing Monte Carlo methods in statistical analysis.

Uploaded by

ZhuYao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Monte Carlo:

Simulation Methods for Statistical Inference

Sinan Yıldırım

August 4, 2021
Contents

List of Abbreviations vii

1 Introduction 1
1.1 Sample averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Monte Carlo: Generating your own samples . . . . . . . . . . . . . . . . . 3
1.2.1 Justification of Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Toy example: Buffon’s needle . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 The need for more sophisticated methods . . . . . . . . . . . . . . . 10

2 Exact Sampling Methods 11


2.1 Pseudo-random number generation . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Pseudo-random number generators for Unif(0, 1) . . . . . . . . . . . 11
2.2 Some exact sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Method of inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Transformation (change of variables) . . . . . . . . . . . . . . . . . 15
2.2.3 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
[Link] When p(x) is known up to a normalising constant . . . . . 21
[Link] Squeezing . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Monte Carlo Estimation 26


3.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Variance reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Self-normalised importance sampling . . . . . . . . . . . . . . . . . 30

4 Bayesian Inference 39
4.1 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Deriving Posterior distributions . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 A note on future notational simplifications . . . . . . . . . . . . . . 41
4.2.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Quantities of interest in Bayesian inference . . . . . . . . . . . . . . . . . . 46
4.3.1 Posterior mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . . . 47
4.3.3 Posterior predictive distribution . . . . . . . . . . . . . . . . . . . . 47

i
5 Markov Chain Monte Carlo 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Properties of Markov(η, M ) . . . . . . . . . . . . . . . . . . . . . . 54
[Link] Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . . 54
[Link] Recurrence and Transience . . . . . . . . . . . . . . . . . 56
[Link] Invariant distribution . . . . . . . . . . . . . . . . . . . . . 56
[Link] Reversibility and detailed balance . . . . . . . . . . . . . . 57
[Link] Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Toy example: MH for the normal distribution . . . . . . . . . . . . 61
5.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Metropolis within Gibbs . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Sequential Monte Carlo 76


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Sequential importance sampling . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Sequential importance sampling resampling . . . . . . . . . . . . . . . . . . 80

7 Bayesian inference in Hidden Markov Models 83


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 Bayesian optimal filtering and smoothing . . . . . . . . . . . . . . . . . . . 86
7.3.1 Filtering, prediction, and smoothing . . . . . . . . . . . . . . . . . . 87
[Link] Forward filtering backward smoothing . . . . . . . . . . . 87
[Link] Sampling from the full posterior . . . . . . . . . . . . . . . 89
7.3.2 Exact inference in finite state-space HMMs . . . . . . . . . . . . . . 90
7.3.3 Exact inference in linear Gaussian HMMs . . . . . . . . . . . . . . 91
7.3.4 Particle filters for optimal filtering in HMM . . . . . . . . . . . . . 94
[Link] Motivation for particle filters . . . . . . . . . . . . . . . . 96
[Link] Particle filtering for HMM . . . . . . . . . . . . . . . . . . 97
[Link] Filtering, prediction, and smoothing densities . . . . . . . 98
[Link] Estimating the evidence . . . . . . . . . . . . . . . . . . . 100
[Link] Choice of the importance density . . . . . . . . . . . . . . 100
7.3.5 Extensions to HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.6 The Rao-Blackwellised particle filter . . . . . . . . . . . . . . . . . 106

A Some Basics of Probability 111


A.1 Axioms and properties of probability . . . . . . . . . . . . . . . . . . . . . 111
A.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.2.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . 113
A.2.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . 113
A.2.2.1 Some continuous distributions . . . . . . . . . . . . . . . . 114

ii
A.2.3 Moments, expectation and variance . . . . . . . . . . . . . . . . . . 114
A.2.4 More than one random variables . . . . . . . . . . . . . . . . . . . . 115
A.3 Conditional probability and Bayes’ rule . . . . . . . . . . . . . . . . . . . . 117

B Solutions 118
B.1 Exercises in Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.2 Exercises in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.3 Exercises of Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.4 Exercises of Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

iii
List of Figures

1.1 Buffon’s needles: 3 throws . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


1.2 The set A that corresponds to the needle crossing a line . . . . . . . . . . . 6
1.3 Top: Buffon’s needle experiment with 100 independent throws. Bottom:
(d, θ) values of the needle throws . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 (d, θ) values of N = 1000 and N = 10000 independent needle throws . . . . 9
1.5 Approximating π with Buffon’s needle experiment. The plot shows the value
of the estimate versus the number of needle throws, N . . . . . . . . . . . . 9

2.1 Method of inversion for the exponential distribution . . . . . . . . . . . . . 13


2.2 Method of inversion for the geometric distribution . . . . . . . . . . . . . . 14
2.3 Rejection sampling for Γ(2, 1) . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Rejection sampling for Γ(2, 1): Histograms with λ = 0.5 (68061 samples out
of 105 ) and λ = 0.01 (2664 samples out of 105 trials). . . . . . . . . . . . . 22

3.1 pX (x) and pX,Y (x, y) vs x for the problem in Example 3.4 with n = 10 and
a=2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Histograms for the estimate of the posterior mean using two different im-
portance sampling methods as described in Example 3.4 with n = 10 and
a = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Source localisation problem with three sensors and one source . . . . . . . 37
3.4 Source localisation problem with three sensors and one source: The like-
lihood terms, prior, and the posterior. The parameters and the variables
are s1 = (0, 2), s2 = (−2, −1), s3 = (1, −2), y1 = 2, y2 = 1.6, y3 = 2.5,
σx2 = 100, and σy2 = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Directed acyclic graph showing the (hierarchical) dependency structure for
X, Y, Z, U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 State transition diagram of a Markov chain with 3 states, 1, 2, 3. . . . . . . 53


5.2 State transition diagram of the symmetric random walk on Z. . . . . . . . 53
5.3 State transition diagrams of two Markov chains that are not irreducible. . . 55
5.4 An irreducible, positive recurrent, and periodic Markov chain. . . . . . . . 59
5.5 Random walk MH for π(x) = φ(x; 2, 1). The left and middle plots corre-
spond to a too small and a too large value for σq2 , respectively. All algorithms
are run for 50000 iterations. Both the trace plots and the histograms show
that the last choice works the best. . . . . . . . . . . . . . . . . . . . . . . 63
5.6 Independence MH for π(x) = φ(x; 2, 1). . . . . . . . . . . . . . . . . . . . . 64
5.7 Gradient-guided MH for π(x) = φ(x; 2, 1). . . . . . . . . . . . . . . . . . . 65

iv
5.8 MH for parameters of N (z, s). σq2 = 1, α = 5, β = 10, m = 0, κ2 = 100. . . 66
5.9 An example data sequence of length n = 100 generated from the Poisson
changepoint model with parameters τ = 30, λ1 = 10 and λ2 = 5. . . . . . . 67
5.10 MH for parameters of the Poisson changepoint model . . . . . . . . . . . . 68
5.11 MH for the source localisation problem. . . . . . . . . . . . . . . . . . . . . 69

6.1 Resampling in SISR. Circle sizes represent weights. . . . . . . . . . . . . . 80

7.1 Acyclic directed graph for HMM . . . . . . . . . . . . . . . . . . . . . . . . 84


7.2 SIS: propagation and weighting of particles for several time steps. Each row
shows (i) the particles and their weights from the previous time, (ii) The
propagation and extension of the particles for the new time step, and (iii)
Update of the particle weights. Notice the weight degeneracy problem. . . 102
7.3 SIRS: resampling, propagation and weighting of particles for several time
steps. Each row shows (i) the particles and their weights from the previous
time, (ii) the resampling and propagation and extension of the particles for
the new time step, and (iii) Update of the particle weights. Notice the path
N
degeneracy problem: π10 (x1:10 ) has only one support point for x1 . . . . . . 104
7.4 Particle filter for target tracking . . . . . . . . . . . . . . . . . . . . . . . . 105

v
List of Tables

3.1 PERT: Project tasks, predecessor-successor relations, and mean durations . 35

4.1 Joint probability table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

vi
List of Abbreviations

a.s. almost surely


cdf cumulative distribution function
HMM hidden Markov model
i.i.d. independently and identically distributed
IS Importance sampling
MAP maximum a posteriori
MCMC Markov chain Monte Carlo
MH Metropolis-Hastings
MLE maximum likelihood estimate (or estimation)
MMSE minimum mean square error
MSE mean square error
pdf Probability density function
pmf Probability mass function
SIS sequential importance sampling
SISR sequential importance sampling resampling
SMC sequential Monte Carlo

vii
Chapter 1

Introduction
Summary: This chapter provides a motivation for Monte Carlo methods. We will basi-
cally discuss averaging, which is the core of Monte Carlo integration. Then we will discuss
theoretical and practical justifications of Monte Carlo. The chapter ends with a toy exam-
ple, Buffon’s needle experiment.

1.1 Sample averages


Suppose we are given N ≥ 1 random samples X (1) , . . . , X (N ) , each taking values from a set
X ⊂ Rdx for some dx ≥ 1. The samples are independent and identically distributed (i.i.d.)
according to some distribution P for a random variable X. We summarise this sentence as
i.i.d.
X (1) , . . . , X (N ) ∼ P.
Also, the distribution P is unknown.

Mean value of the distribution: First, we are asked to provide an estimate of the
mean value i.e. the expectation of X with respect to P using the sample set X (1:N ) . If P
has a probability density function p(x), this expectation can be written as1
Z
EP (X) = xp(x)dx. (1.1)
X

A reasonable estimate of this quantity would be


N
1 X (i)
EP (X) ≈ X . (1.2)
N i=1

Expectation of a general function: Next, we are asked to provide an estimate of the


expectation of a certain function ϕ : X → R with respect to P , that is2
Z
P (ϕ) := EP (ϕ(X)) = ϕ(x)p(x)dx. (1.3)
X
1
We assume continuity of X to avoid repetitions
P for continuous and discrete variables. For discrete
distributions change the integral to the sum P x
i i p(x i ).
2
Again, for discrete variables this would be i ϕ(xi )p(xi ).

1
CHAPTER 1. INTRODUCTION 2

(Here, the notation P (ϕ) is introduced for simplicity.) This time, we replace our estimator
with one that has the values of the ϕ evaluated at the samples ϕ(X (i) ), i = 1, . . . , N instead
of X (i) themselves.
N
1 X
EP (ϕ(X)) ≈ ϕ(X (i) ). (1.4)
N i=1
It is easy to see that the second problem is just a simple generalisation of the first: Put
ϕ(X) = X and you will come back to the first problem, which was to estimate the expected
value of X. The function ϕ can correspond to another moment of interest, for example
ϕ(x) = x2 , or a specific function of interest, for example ϕ(x) = log x.

Probability of a set: Another special case of ϕ is seen when we are interested in the
probability of a certain set A ⊆ X . How do we write this probability

P (A) := P(X ∈ A)

as an expectation of a function with respect to P ? For this, consider the indicator function
IA : X → {0, 1} such that (
1, x ∈ A
IA (x) = (1.5)
0, x ∈ /A
Now, let us consider ϕ = IA and write the expectation of this function with respect to P :
Z
EP (IA (X)) = IA (x)p(x)dx
ZX

= p(x)dx
A
= P(X ∈ A). (1.6)

where the second line follows from the fact that the integrand becomes p(x) for x ∈ A and
i.i.d.
/ A. Therefore, we know what to do when X (1) , . . . , X (N ) ∼ P are given and we
0 for x ∈
want to estimate P(X ∈ A) = EP (IA (X)): we simply apply equation (1.4) for the function
IA (X):
N
1 X
P(X ∈ A) ≈ IA (X (i) ). (1.7)
N i=1
Notice that this will output a value that is guaranteed to be in [0, 1] (since the sum can
be at least 0 and at most N ), so it is a valid probability estimate.
In the following, we will talk in general about the expectation (1.3) and its estimate
(1.4), after hopefully having convinced you that the other expectations and their estimates
are just special cases.
CHAPTER 1. INTRODUCTION 3

1.2 Monte Carlo: Generating your own samples


See the equation (1.4); this is what we would do to estimate a certain quantity that is
to do with the distribution P . However, notice that you do not need to know anything
explicit about P in order to make that calculation. All you need are the samples X (1:N )
from P . Perhaps we would be able to calculate the exact value of (1.3) if P were known.
However, we said P was unknown, and we proposed to use the estimate (1.4) instead.
Now consider a new scenario: This time you do know P , at least up to a certain extent;
but you are not given any samples from P . The following are what you can and cannot
do about P :

• You can generate (draw) i.i.d. samples from P , as many as you want.

• You cannot compute (one or more of) the integral in (1.3), or you can only compute
it in a very very long time - so long that you do not want to! Another way of saying
this is that the integral is intractable.

What would you do to estimate (1.3) in that case? Of course, by generating your own
samples X (1) , . . . , X (N ) from P so that the problem reduces to the one in the previous
section.3 This simple idea is the core of Monte Carlo methods. Once you generate samples
from P , you do not need to deal with it in order to implement the estimate (1.4).
The term Monte Carlo was coined in the 1940s, see Metropolis and Ulam (1949) for a
first use of the term, and Metropolis (1987); Eckhardt (1987) for a historical review.

1.2.1 Justification of Monte Carlo


N
Let PMC (ϕ) denote the Monte Carlo estimate of P (ϕ) in (1.3) that is given in (1.4) using
n samples, i.e.
N
N 1 X
PMC (ϕ) = ϕ(X (i) ). (1.8)
N i=1
3
An engineer is asked how to make tea using an empty kettle, a tea bag, a cup, and tap water. The
engineer explained that first she would pour water in the kettle, then boil it, pour it in a cup, put the
tea bag in the cup, and wait until the tea brews. Next, a mathematician is asked how to make tea using
a kettle full of water, a tea bag, a cup, and tap water. The mathematician proposed to empty the kettle
first so that they are back to the first problem.
CHAPTER 1. INTRODUCTION 4

N
It is easy to show that PMC (ϕ) is an unbiased estimator of P (ϕ) for any N ≥ 1:
N
!
N 1 X
ϕ(X (i) ) .

E PMC (ϕ) = E
N i=1
N
1 X
= EP (ϕ(X (i) ))
N i=1
1
= N EP (ϕ(X))
N
= EP (ϕ(X)) = P (ϕ).

However, unbiasedness itself is not enough.4 Fortunately, we have results on the conver-
gence and decreasing variance as N increases.

Law of large numbers: If |P (ϕ)| < ∞, the law of large numbers (e.g. Shiryaev (1995),
N
p. 391) ensures almost sure (a.s.) convergence of PMC (ϕ) to P (ϕ) as the number of i.i.d.
samples tends to infinity,
N a.s.
PMC (ϕ) → P (ϕ), as N → ∞.

N
Central limit theorem: The variance of PMC (ϕ) is given by
N
 N 1 X 1
VP ϕ(X (i) ) = VP [ϕ(X)] .
  
V PMC (ϕ) = 2
N i=1 N

which indicates the improvement in the accuracy with increasing N , provided that VP [ϕ(X)]
is finite. Also, if VP [ϕ(X)] is finite, the distribution of the estimator is well behaved in
the limit, which is ensured by the central limit theorem (e.g. Shiryaev (1995), p. 335)
√  N  d
N PM C (ϕ) − P (ϕ) → N (0, VP [ϕ(X)]) as N → ∞.

Advantage over deterministic integration: There are deterministic numerical in-


tegration techniques available to approximate P (ϕ); however these methods encounter
the problem called the curse of dimensionality since the amount of computation grows
exponentially with the dimension of X, dx (Press, 2007). Therefore, they are far from
being practical and reliable unless they work in low dimensional problems. Monte Carlo
integration is a powerful alternative to deterministic methods for integration problems.
Compared to deterministic numerical integration algorithms, the performance of Monte
Carlo does not depend on the dimension dx (check the variance of the Monte Carlo esti-
mate above, which does not depend on dx ). This makes the method particularly useful for
high dimensional integrations (Newman and Barkema, 1999).
4
Even ϕ(X (1) ) is an unbiased estimate of P (ϕ) but it is ‘somewhat’ inferior than taking the average
over N samples.
CHAPTER 1. INTRODUCTION 5

12

10

-2
0 2 4 6 8 10

Figure 1.1: Buffon’s needles: 3 throws

1.2.2 Toy example: Buffon’s needle


This is an illustrative example for the use of Monte Carlo. In mathematics, Buffon’s needle
problem is a question first posed in the 18’th century by Georges-Louis Leclerc, Comte de
Buffon. Suppose we have a floor made of parallel strips of wood, each the same width, and
we drop a needle onto the floor. What is the probability that the needle will lie across a
line between two strips?
Buffon’s needle was the earliest problem in geometric probability to be solved; it can
be solved using integral geometry. The solution, in the case where the needle length is
not greater than the width of the strips, can be used to design a Monte Carlo method
for approximating the number π, (although that was not the original motivation for de
Buffon’s question).
First, let us try to answer the initial question: What is the probability of the needle
of length 1 (without loss of generality) crossing a line between the strips of width 1 if the
location and the direction of the needle are independent and uniformly distributed? The
probability can actually be calculated: Let d be the distance from middle of the needle to
the nearest line and θ be the acute angle between the parallel lines and the needle (between
0 and π/2). A needle touches a line if and only if

d 1
< (1.9)
sin θ 2
Try to verify this by observing the needles Figure 1.1. The variables d, θ are independent
and uniformly distributed in [0, 1/2] and [0, π/2], respectively, so that their joint probability
CHAPTER 1. INTRODUCTION 6

0.4 d > sin(θ)/2


d 0.3

0.2 d < sin(θ)/2

0.1

0
0 0.5 1 1.5
θ

Figure 1.2: The set A that corresponds to the needle crossing a line

density can be written as


(
4
π
, (d, θ) ∈ [0, 1/2] × [0, π/2]
p(d, θ) =
0, else

Now, define the set A = {(d, θ) : d/ sin θ < 1/2} = {(d, θ) : d < sin θ/2}. The set A
corresponds to the area under the curve in Figure 1.2. Letting X = (d, θ), the required
probability is
Z Z
P(X ∈ A) = p(r, θ)drdθ
A
Z π/2 Z sin(θ)/2
4
= dθ dr
0 0 π
2 π/2
Z
= dθ sin(θ)
π 0
2
= [− cos(π/2) + cos(0)]
π
2
= (1.10)
π
where the dummy variable r is used for d (just to avoid writing dd in the integral!).

Monte Carlo approximation: Suppose it is not our day and we cannot carry out
calculation in (1.10). In order to find P(X ∈ A), we decide to run a Monte Carlo experiment
instead. Let X = (d, θ) and observe P(X ∈ A) = E(IA (X)). The idea is to generate
samples X (i) = (d(i) , θ(i) ), i = 1, . . . , N where each sample is generated independently as

d(i) ∼ Unif(0, 1/2), θ(i) ∼ Unif(0, π/2), (1.11)


CHAPTER 1. INTRODUCTION 7

and estimate P(X ∈ A) as


N
1 X
P(X ∈ A) ≈ IA (X (i) )
N i=1
N
1 X (i)
= I(d < sin(θ(i) )/2) (1.12)
N i=1

(where we have introduced another notation-wise use of the indicator function I). In words,
for each sample X (i) = (d(i) , θ(i) ) we check whether d(i) < sin(θ(i) )/2 (in other words, we
check whether X (i) ∈ A) and we divide the number of samples satisfying this condition
by the total number samples n. Observe that this is the implementation of (1.7) for this
problem. The samples (d(i) , θ(i) ) can be generated by throwing a needle on a table with
parallel lines - or using a computer! Figure 1.3 shows the described Monte Carlo experiment
performed with N = 100 throws. The d, θ values corresponding to these throws are shown
in Figure 1.3. Note that (1.12) is

number of red dots


P(X ∈ A) ≈ .
total number of dots
The law of large numbers says that as n tends to infinity the estimate above converges to
the true value P(X ∈ A) = 2/π. This fact can be ‘felt’ by observing the Figure 1.4 which
show the results of the same Monte Carlo experiment with larger N values. The estimate
of P(X ∈ A) improves with N .

Estimating π: We have already stated that P(X ∈ A) is known for this problem and in
fact it is 2/π. We have also described a Monte Carlo experiment to estimate this value.
With a little modification, our estimate can be used to estimate π. Since we have
2
π= ,
P(X ∈ A)

we can approximate π by
N
π ≈ 2 × PN
i=1 I(d(i)
< sin(θ(i) )/2)
total number of dots
=2×
number of red dots
This is a pretty fancy way of estimating π with a needle and a table! Figure 1.5 shows
the estimated value versus the number of samples n. Again, we see improvement in the
estimate as N increases.
CHAPTER 1. INTRODUCTION 8

12

10

-2
0 2 4 6 8 10
Buffon's needle, N = 100. The estimated prob is 0.6700. True value is 0.6366
0.5

0.4

0.3
d

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
θ

Figure 1.3: Top: Buffon’s needle experiment with 100 independent throws. Bottom: (d, θ)
values of the needle throws
CHAPTER 1. INTRODUCTION 9

Buffon's needle, N = 1000. The estimated prob is 0.6590. True value is 0.6366
0.5

0.4

0.3
d

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
θ
Buffon's needle, N = 10000. The estimated prob is 0.6378. True value is 0.6366
0.5

0.4

0.3
d

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
θ

Figure 1.4: (d, θ) values of N = 1000 and N = 10000 independent needle throws

Buffon's needle experiment to approximate π


3.8

3.6
estimate of π

3.4

3.2

2.8

2.6
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
N

Figure 1.5: Approximating π with Buffon’s needle experiment. The plot shows the value
of the estimate versus the number of needle throws, N .
CHAPTER 1. INTRODUCTION 10

1.2.3 The need for more sophisticated methods


The distribution P in the toy example was the product of two uniform distributions for
X = (d, θ). However, in many problems P is not always trivial to sample from. In the rest
of the course, we will see some methods to generate exact samples from P (that is, samples
that are exactly distributed according to P ). These are the inverse transform method and
the rejection sampling method.
However, the story does not end there: Being able to sample from the distribution of
interest exactly is rarely the case when it comes to real applications in engineering and
science. Especially in Bayesian statistics, where the distribution we want to sample from
is the posterior distribution of some variable X given Y = y, which is of the form

pX (x)pY |X (y|x) p (x, y)


pX|Y (x|y) = R = R X,Y
pX (x0 )pY |X (y|x0 )dx0 pX,Y (x0 , y)dx0
∝ pX (x)pY |X (y|x)

it is either too costly or impossible to perform exact sampling. That explains the vast
amount of literature on sophisticated Monte Carlo methods that aim to generate approxi-
mate samples. In the course, we will cover some of these methods. Among them, impor-
tance sampling and Markov chain Monte Carlo methods are worth mentioning as early as
here.
Chapter 2

Exact Sampling Methods


Summary: In order to obtain estimates as in (1.8), we need exact i.i.d. samples from P ,
that is samples that are exactly distributed from P . This chapter describes some exact sam-
pling methods. These methods are the method of inversion, transformation, composition,
and rejection sampling

2.1 Pseudo-random number generation


“The generation of random numbers is too important to be left to chance” and truly
random numbers are impossible to generate on a deterministic computer. Published tables
or other mechanical methods such as throwing dice, flipping coins, shuffling cards or turning
the roulette wheels are clearly not very practical for generating the random numbers that
are needed for computer simulations. Other techniques rely on chaotic behaviour, such
as the thermal noise in Zener diodes or other analog circuits as well as the atmospheric
noise (see, e.g., [Link]) or running a hash function against a frame of a video
stream. Still, the vast amount of random numbers are obtained from pseudo-random
number generators. Apart from being very efficient, one additional advantage of these
techniques is that the sequences are reproducible by setting a seed, this property is key for
debugging a Monte Carlo code.

2.1.1 Pseudo-random number generators for Unif(0, 1)


Today, in most applications the task of random variable generation is performed on com-
puters. In fact, a computer is mainly responsible for generating pseudo-random numbers
that look as if they are independent and distributed uniformly from between 0 and 1,
so goes the name “pseudo-random”. That is, any sequence of pseudo-random numbers
that are produced by the pseudo-random number generator should look like a sequence of
i.i.d. uniformly distributed random numbers between 0 and 1, showing no correlation and
spreading over the (0, 1) interval uniformly.
There already exist highly sophisticated numerical methods to generate such pseudo-
random numbers that pass certain tests for uniformity and independence. The most well
known method for generating random numbers is based on a Linear Congruential Generator
(LCG). The theory is well understood, and the method is easy to implement and fast. A

11
CHAPTER 2. EXACT SAMPLING METHODS 12

LCG is defined by the recurrence relation:

xn+1 = (axn + c)(mod M )

If the coefficients a and c are chosen carefully (e.g. relatively prime to M ), xn will be
roughly uniformly distributed between 0 and M − 1 (and with normalisation by M they
can be shrunk between 0 and 1). By “roughly uniformly” we mean that the sequence of
numbers xn will pass many reasonable tests for randomness. One such test suite are the
so called DIEHARD tests, developed by George Marsaglia, that are a battery of statistical
tests for measuring the quality of a random number generator.
A more recently proposed generator is the Mersenne Twister algorithm, by Matsumoto
and Nishimura, 1997. It has several desirable features such as a long period and being very
fast. Many public domain implementations of the algorithm exist and it is the preferred
random number generator for statistical simulations and Monte Carlo computations.

2.2 Some exact sampling methods


In the sequel, we will assume that a computer can produce for us an independent variable

U ∼ Unif(0, 1)

every time we ask it to do so. The crucial part is how to transform one or more copies of
U such that the resulting number is distributed according to a particular distribution that
we want to sample from. In a more general context, how can one exploit the ability of the
computer to generate uniform random variables so that we can obtain random numbers
from any desired distribution?
In the following we will see some exact sampling methods.

2.2.1 Method of inversion


Suppose X ∼ P taking values in X ⊆ R with cdf F as defined above: F (x) = P(X ≤ x),
x ∈ R. Recall that F takes values in [0, 1]. Define the generalised inverse cdf G : (0, 1) → R
as
G(u) := inf{x ∈ X : F (x) ≥ u}. (2.1)

Remark 2.1. Define the set S(u) = {x ∈ X : F (x) ≥ u}. We can show that, by right-
continuity of F , S(u) actually attains its infimum, that is the minimum of S(u) exists and
hence inf S(u) = min S(u), or S(u) = [G(u), ∞)1 .
1
Proof: If x < G(u), x ∈ / S(u) by definition. If x > G(u), then there exists x0 < x with x0 ∈ S(u);
since F is non-decreasing, F (x) ≥ F (x0 ) ≥ u, so x ∈ S(u). Finally, by the right-continuity of F , we have
F (G(u)) = inf F (y) : y > G(u) ≥ u. Therefore G(u) ∈ S(u) and S(u) = [G(u), ∞)
CHAPTER 2. EXACT SAMPLING METHODS 13

Exp(1) : sampling via the method of inversion


1.2
pdf
1 cdf

0.8
u
0.6

0.4

0.2

0
0 1 2 3 4 5 6 7 8 9 10
G(u)
x

Figure 2.1: Method of inversion for the exponential distribution

If X is discrete taking values x1 , x2 , . . ., this definition reduces to G(u) = xi∗ where



i = min{i : F (xi ) ≥ u}. In other words, G(u) = xi∗ such that

F (xi∗ −1 ) < u ≤ F (xi∗ ). (2.2)

If X is continuous with a pdf p(x) > 0 for all x ∈ X , (i.e. F has no jumps and no flat
parts in X ), then F is strictly monotonic in X , its inverse G = F −1 can be defined on X ,
and we simply have G(u) = F −1 (u).
The following Lemma enables the method of inversion.

Lemma 2.1. If U ∼ Unif(0, 1), G(U ) ∼ P

Proof. Since S(u) = [G(u), ∞) (see Remark 2.1), we have x ≥ G(u) if and only if F (x) ≥ u.
Hence, P(X ≤ x) = P(G(U ) ≤ x) = P(U ≤ F (x)) = F (x).
Lemma 2.1 suggests we can sample X ∼ P by first sampling U ∈ Unif(0, 1) and
then transforming X = G(U ). This approach is called the method of inversion. It was
considered by Ulam prior to 1947 (Eckhardt, 1987) and some extensions to the method
are provided by Robert and Casella (2004).

Corollary 2.1. Suppose F is continuous. If X ∼ P , then F (X) ∼ Unif(0, 1).

Proof. Since we have S(u) = [G(u), ∞), x ≥ G(u) implies F (x) ≥ u. Moreover, if x < G(u)
then F (x) < u by definition of G. By continuity of F , we have F (G(u)) = u, so F (x) ≤ u
if and only if x ≤ G(u). Hence P(F (X) ≤ u) = P(X ≤ G(u)) = F (G(u)) = u, and we
conclude that the cdf of F (X) is the cdf of Unif(0, 1).
CHAPTER 2. EXACT SAMPLING METHODS 14

Geo(0.3): sampling via the method of inversion


1.2
pmf
cdf
1

0.8
u
0.6

0.4

0.2

0
0 1 2 3 4 5 6 7 8 9 10
G(u)
x

Figure 2.2: Method of inversion for the geometric distribution

Example 2.1. Suppose we want to sample X ∼ P = Exp(λ) from the exponential distri-
bution with rate parameter λ > 0. The pdf of Exp(λ) is
(
λe−λx , x ≥ 0
p(x) = .
0, else

The cdf is (R x
0
λe−λt dt = 1 − e−λx , x≥0
u = F (x) = .
0, else
Therefore, we have x = − log(1 − u)/λ. So, we can generate U ∼ Unif(0, 1) and transform
X = − log(1 − U )/λ ∼ Exp(λ). See Figure 2.1 for an illustration.

Example 2.2. Suppose we want to sample X ∼ P = Geo(ρ) from the geometric distribu-
tion on X = N with success rate parameter ρ ∈ (0, 1) and pmf2
p(x) = (1 − ρ)x ρ, x = 0, 1, 2 . . . .
1−αx+1
Px
Making use of i=0 αi = 1−α
with α = 1 − ρ, the cdf at the support points is given by

F (x) = 1 − (1 − ρ)x+1 .
Given U = u sampled from Unif(0, 1), the rule in (2.2) implies
1 − (1 − ρ)x < u ≤ 1 − (1 − ρ)x+1
2
This distribution is used for the number of trials prior to the first success in a Bernoulli process with
success rate ρ. Another convention is to take the support range as 1, 2, . . . rather than 0, 1, 2 and interpret
X as the number of trials until the successful trial, including the successful one. Then the pmf changes to
p(x) = (1 − ρ)x−1 ρ, x ≥ 1
CHAPTER 2. EXACT SAMPLING METHODS 15

Solving the inequality for x we arrive at

log(1 − u) log(1 − u)
−1≤x< .
log(1 − ρ) log(1 − ρ)

This is nothing but the round-up function written explicitly:


 
log(1 − u)
x= −1 .
log(1 − ρ)

See Figure 2.2 for an illustration.

2.2.2 Transformation (change of variables)


The method of inversion can be seen as a transformation from U to X = G(U ). In fact,
one can use transformation in a more general sense than using G by considering a change
of variables via a suitable function g.

Example 2.3. If we want to sample from X ∼ Unif(a, b), a < b, we can sample U ∼
Unif(0, 1) and use the transformation

X = g(U ) := (b − a)U + a. (2.3)

Transformation can also be used for more complicated situations than in Example 2.3.
Suppose we have an m-dimensional random variable X ∈ X ⊆ Rm with pdf pX (x) and we
apply a transform to X using an invertible function g : X → Y, where Y ⊆ Rm to obtain

Y = (Y1 , . . . , Ym ) = g(X1 , . . . , Xm )

Since g is invertible, we have X = g −1 (Y ). What is the pdf of Y , pY (y)? This density can
be found as follows: Define the Jacobian determinant (or simply Jacobian) of the inverse
transformation g −1 as
∂g −1 (y)
J(y) = det (2.4)
∂y
The usual practice to ease the notation is to introduce the short hand notation (y1 , . . . , ym ) =
g(x1 , . . . , xm ) and write J(y) by making implicit reference to g as
 
∂x1 /∂y1 . . . ∂x1 /∂ym
∂x ∂(x1 , . . . , xm ) .. ... ..
J(y) = det = det = det 
 
∂y ∂(y1 , . . . , ym ) . . 
∂xm /∂y1 . . . ∂xm /∂ym

The Jacobian is useful for integration: If we make a change of variables from x → y, we


have to substitute dx = |J(y)|dy. When we apply this for the integral of any function
CHAPTER 2. EXACT SAMPLING METHODS 16

ϕ : X → R with respect to pX (x), we have


Z Z
pX (x)ϕ(x)dx = pX (g −1 (y))ϕ(g −1 (y)) |J(y)| dy
Z
= pX (g −1 (y)) |J(y)| ϕ(g −1 (y))dy
Z
= pY (y)ϕ(g −1 (y))dy

where

pY (y) := pX (g −1 (y)) |J(y)| (2.5)

is the pdf of Y .
Change of variables can be useful when P is difficult to sample from using the method
of inversion but X ∼ P can be performed by a certain transformation of random variables
that are easier to generate, such as uniform random variables.
Example 2.4. We describe the Box-Muller method for generating random variables from
the standard normal (Gaussian) distribution N (0, 1). The pdf for N (µ, σ 2 ) is
1 1 2
φ(x; µ, σ 2 ) = √ e− 2σ2 (x−µ)
2πσ 2
The method of inversion is not an easy option to sample from N (0, 1) since the cdf of
N (0, 1) is not easy to invert. Instead we use transformation.
The Box-Muller method generates a pair of independent standard normal random vari-
i.i.d.
ables X1 , X2 ∼ N (0, 1) as follows: First we generate

R ∼ Exp(1/2), Θ ∼ Unif(0, 2π).

and then apply the transformation


√ √
X1 = R cos(Θ), X2 = R sin(Θ)

If we wanted to start off from uniform random numbers, we could consider generating
i.i.d.
U1 , U2 ∼ Unif(0, 1) and setting R = −2 log(U1 ) and Θ = 2πU2 so that R, Θ are distributed
as desired. In other words,
p p
X1 = −2 log(U1 ) cos(2πU2 ), X2 = −2 log(U1 ) sin(2πU2 )

One way to see why this works is to use change of variables. Note that3

(R, Θ) = (X12 + X22 , arctan(X2 /X1 ))) (2.6)


3
To be precise, Θ = arctan(X2 /X1 ) + πI(X1 < 0) since Θ ∈ [0, 2π], but omitting the extra term
πI(X1 < 0) does not change the results.
CHAPTER 2. EXACT SAMPLING METHODS 17

√ √
Then the Jacobean at (x1 , x2 ) = ( r cos θ, r sin θ) is

∂r/∂x1 ∂r/∂x2 2x1 2x2


J(x1 , x2 ) = = 1 −y2 1 1 =2 (2.7)
∂θ/∂x1 ∂θ/∂x2 1+(y2 /y1 )2 y12 1+(y2 /y1 )2 y1

Therefore, we can apply (2.5) to get

pX1 ,X2 (x1 , x2 ) = pR (r)pΘ (θ)|J(x1 , x2 )|


= pR (x21 + x22 )pΘ (arctan(x2 /x1 ))|J(x1 , x2 )|
1 1 2 2 1
= e− 2 (x1 +x2 ) 2
2 2π
1 − 1 x21 1 − 1 x22
=√ e 2 √ e 2
2π 2π
= φ(x1 ; 0, 1)φ(x2 ; 0, 1) (2.8)

which is the product of pdf of N (0, 1) evaluated at x1 and x2 . Therefore, we conclude that
i.i.d.
X1 , X2 ∼ N (0, 1).

Multivariate normal distribution: Another important transformation that we should


be familiar with is a linear transformation of a multivariate normal random variable. We
denote the distribution of an n × 1 multivariable normal random variable as X ∼ N (µ, Σ)
where µ = E(X) is an n × 1 mean vector and

Σ = Cov(X) = E[(X − µ)(X − µ)T ]

is an n × n symmetric positive definite4 covariance matrix The (i, j)’th element of Σ is

σij = Cov(Xi , Xj ) = E[(Xi − µi )(Xj − µj )] = E(Xi Xj ) − µi µj

The pdf of this distribution is (using the same letter as for the pdf of the univariate normal
distribution)  
1 1 T −1
φ(x; µ, Σ) = exp − (x − µ) Σ (x − µ) (2.9)
|2πΣ|1/2 2
where | · | stands for determinant.
Suppose X = (X1 , . . . , Xn )T ∼ N (µ, Σ) and we have the transformation

Y = AX + η

where A is an m × n matrix with m ≤ n with rank m5 , and η is an m × 1 vector. We know


for a fact that a linear transformation of X has to be normally distributed as well. Also,
4
In fact, positive semi-definite covariance matrices are also allowed, however the distribution is called
degenerate and it does not have a pdf.
5
We constraint A to full row rank matrices since otherwise the resulting covariance matrix for AΣAT
is no longer positive definite and Y is degenerate.
CHAPTER 2. EXACT SAMPLING METHODS 18

the normal distribution is completely characterised by its mean and covariance. Therefore,
we can work out the mean and the variance of Y in order to identify its distribution.

E(Y ) = E(AX + η)
= AE(X) + η
= Aµ + η
Cov(Y ) = E([Y − E(Y )][Y − E(Y )]T )
= E([AX + η − (Aµ + η)][AX + η − (Aµ + η)]T )
= E(A(X − µ)(X − µ)T AT ) = ACov(X)AT
= AΣAT

Therefore, Y ∼ N (Aµ + η, AΣAT ).

Example 2.5. The above derivation suggests a way to generate an n-dimensional mul-
tivariate sample X ∼ N (µ, Σ). We can first generate i.i.d. normal random variables
i.i.d.
R1 , . . . , Rn ∼ N (0, 1) so that R = (R1 , . . . , Rn ) ∼ N (0n , In ) where 0n is the n × 1 vector
of zeros and In is the identity matrix of size n. Then, we decompose Σ = AAT using the
Cholesky decomposition. Finally, we let X = AR + µ. Observe that the mean of X is
A0n + µ = µ and covariance matrix of X is AIn AT = AAT = Σ, so we are done.

2.2.3 Composition
Let a random variable Z ∼ Π taking values from the set Z and Π has a pdf or pmf shown
as π(z). Suppose also that given z, X|z ∼ Pz where each Pz admits either a pmf or a
pdf shown as pz (x). Then the marginal distribution P is a mixture distribution and in the
presence of pdf’s or pmf’s, we have
(R
pz (x)π(z)dz, if π(z) is a pdf
p(x) = P (2.10)
z pz (x)π(z), if π(z) is a pmf

Whether p(x) is pmf or a pdf depends on whether pz (x) is pmf or pdf. The integral/sum
may be hard to evaluate, and the mixture distribution may be hard to sample directly.
But if we can easily sample from Π and from each Pz , then we can just

1. sample Z ∼ Π,

2. sample X ∼ PZ , and

3. ignore Z and return X.

The random number we produce in this way will be an exact sample from P , i.e. X ∼ P .
This is the method of composition. Ignoring Z is also called marginalisation, by which we
overcome the difficulty of dealing with the tough integral/sum in (2.10).
CHAPTER 2. EXACT SAMPLING METHODS 19

Example 2.6. The density of a mixture of Gaussian distribution with K components with
2
means and variance values (µ1 , σ12 ), . . . , (µK , σK ), and probability weights w1 , . . . , wK for
its components (such that w1 + · · · + wK = 1) is given by
K
X
p(x) = wk φ(x; µk , σk2 ).
k=1

To sample from p(x), we first sample the component number k with probability wk (for
example using the method of inversion), and given k, we sample X ∼ N (µk , σk2 )

Example 2.7. A sales company decides to reveal the demand D for a product over a
month. However, for privacy reasons, it shares this average by adding some noise to D,
which results in the shared value X. It is given that the distribution of the revealed demand
X has the pdf
X  e−λ λd   1 
|x − d|

p(x) = exp −
d
d! 2b b
We want to perform a Monte Carlo simulation for this data sharing process. How do we
sample X ∼ P ?
Although p(x) looks hard, observe that the first term in the sum is the pmf of PO(λ)
evaluated at d (can be viewed as the demand) and the second term in the sum is the pdf of
Laplace(d, b) evaluated at x (can be viewed as the noisy demand)6 . Therefore, generation
of X is possible by the method of composition as

1. Sample D ∼ PO(λ),

2. Sample X ∼ Laplace(D, b) (equivalent to V ∼ Laplace(0, b) and X = D + V .).

3. Ignore D and return X.

It is an exercise for you to figure out how one can sample from the Poisson and Laplace
distributions.

2.2.4 Rejection sampling


Another common method of obtaining i.i.d. samples from P with density p(x) is rejection
sampling. This method was first mentioned in a 1947 letter by Von Neumann (Eckhardt,
1987), it was also presented a few years later in von Neumann (1951). The method is
available when there exists an instrumental distribution Q with density q(x) such that

• q(x) > 0 whenever p(x) > 0, and

• There exists M > 0 such that p(x) ≤ M q(x) for all x ∈ X .


CHAPTER 2. EXACT SAMPLING METHODS 20

Algorithm 2.1: Rejection sampling


0
1 Generate X ∼ Q and U ∼ Unif(0, 1).
p(X 0 )
2 If U ≤
M q(X 0 )
, accept X = X 0 ; else go to 1.

The rejection sampling method for obtaining one sample from P can be implemented with
any q(x) and M > 0 that satisfy the conditions above as in Algorithm 2.1.
How quickly do we obtain a sample with this method? Noting that the pdf of X 0 is
pX 0 (x) = q(x), the acceptance probability can be derived as
Z
P(Accept) = P(Accept|X 0 = x)pX 0 (x)dx
Z
p(x)
= q(x)dx
M q(x)
Z
1
= p(x)dx
M
1
= , (2.11)
M
which is also the long term proportion of the number accepted samples over the number
of trials. Therefore, taking q(x) as close to p(x) as possible to avoid large p(x)/q(x) ratios
and taking M = supx p(x)/q(x) are sensible choices to make the acceptance probability
P(Accept) as high as possible.
The validity of the rejection sampling method can be verified by considering the dis-
tribution of the accepted samples. Using Bayes’ theorem,

pX 0 (x)P(Accept|X 0 = x) q(x) M1 p(x)


q(x)
pX (x) = pX 0 (x|Accept) = = = p(x). (2.12)
P(Accept) 1/M

Example 2.8. Suppose we want to sample X ∼ Γ(α, 1) with α > 1, where Γ(α, β) is the
Gamma distribution with shape parameter α and scale parameter β. The density of Γ(α, 1)
is
xα−1 e−x
p(x) = , x > 0.
Γ(α)
As possible instrumental distributions, consider the family of exponential distributions
Qλ = Exp(λ), 0 < λ < 1,7 with pdf

qλ (x) = λe−λx , x > 0.


6 e−λ λk
The
 pmf of PO(λ) evaluated at k is k! , and the pdf of Laplace(µ, b) evaluated at x is
1 |x−µ|
2b exp − b
7 p(x)
For λ > 1, the ratio qλ (x) is unbounded in x hence rejection sampling cannot be applied.
CHAPTER 2. EXACT SAMPLING METHODS 21

Recall that M has to satisfy p(x) ≤ M q(x), x ∈ X and therefore given qλ (x), a sensible
choice for Mλ is Mλ = supx p(x)/qλ (x), hence we wish to use λ which minimises the
required Mλ . Given 0 < λ < 1, the ratio
p(x) xα−1 e(λ−1)x
=
qλ (x) λΓ(α)
is maximised at x = (α − 1)/(1 − λ), so we have
α−1 α−1
e−(α−1)

1−λ
Mλ =
λΓ(α)
resulting in the acceptance probability
 α−1
p(x) x(1 − λ)
= e(λ−1)x+α−1
qλ (x)Mλ α−1
Now, we have to minimise Mλ with respect to λ so that P(Accept) = 1/Mλ is maximised.
Mλ is minimised at λ∗ = 1/α8 , yielding
αα e−(α−1)
M∗ = .
Γ(α)
Overall, the rejection sampling algorithm we choose to sample from Γ(α, 1) is
1. Sample X 0 ∼ Exp(1/α) and U ∼ Unif(0, 1).
2. If U ≤ (x/α)α−1 e(1/α−1)x+α−1 , accept X = X 0 , else go to 1.
Check Figure 2.3 for the roles of optimum choice for λ and M . Also, Figure 2.4 illustrates
the computational advantage of choosing λ optimally.

[Link] When p(x) is known up to a normalising constant


One advantage of rejection sampling is that we can implement it even when we know p(x)
and q(x) only up to some proportionality constants Zp and Zq , that is, when
Z
pb(x)
p(x) = , Zp = pb(x)dx (2.13)
Zp
Z
qb(x)
q(x) = , Zq = qb(x)dx. (2.14)
Zq
(Usually q(x) is fully known in which case the following should be read with qb(x) = q(x)
and Zq = 1.) It is easy to check that one can perform the rejection sampling method as in
Algorithm 2.2 for any M such that pb(x) ≤ M qb(x) for all x ∈ X .
Justification of Algorithm 2.2 would follow from similar steps to those in (2.12). Also,
in that case, the acceptance probability would be M1 ZZpq .
8
That is why we constrain α > 1; otherwise λ∗ would be greater than 1, yielding an unbounded ratio.
CHAPTER 2. EXACT SAMPLING METHODS 22

rejection sampling for Γ (2, 1) with optimal choice for λ rejection sampling for Γ (2, 1) with different values of λ
1.2 1.4
p(x) p(x)
1 M ∗ q λ ∗ (x) 1.2 lambda = 0.50
lambda = 0.10
Mq λ ∗ (x) when M = 1.5 M ∗
1 lambda = 0.70
0.8
0.8
0.6
0.6
0.4
0.4

0.2 0.2

0 0
0 2 4 6 8 10 0 2 4 6 8 10
x x

Figure 2.3: Rejection sampling for Γ(2, 1)

histogram of samples generated using λ = 0.5 histogram of samples generated using λ = 0.01
8000 250

200
6000
frequency

150
4000
100

2000
50

0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12
x x

Figure 2.4: Rejection sampling for Γ(2, 1): Histograms with λ = 0.5 (68061 samples out
of 105 ) and λ = 0.01 (2664 samples out of 105 trials).

Algorithm 2.2: Rejection sampling with unnormalised densities


0
1 Generate X ∼ Q and U ∼ Unif(0, 1).
0
pb(X )
2 If U ≤
M qb(X 0 )
, accept X = X 0 ; else go to 1.
CHAPTER 2. EXACT SAMPLING METHODS 23

Example 2.9. Sometimes we want to sample from truncated versions of well known distri-
butions, i.e. where X is contained in an interval with density proportional to the density of
the original distribution on that interval. For example, take the truncated standard normal
distribution Na (0, 1) with density
φ(x; 0, 1)I(|x| ≤ a)
p(x) = Ra (2.15)
−a
φ(y; 0, 1)dy
pb(x)
= (2.16)
Zp
Ra
where pb(x) = φ(x; 0, 1)I(|x| ≤ a) and Zp = −a φ(y; 0, 1)dy. We can perform rejection
sampling using q(x) = φ(x; 0, 1), (that is qb(x) = q(x) and Zq = 1). Since pb(x)/φ(x; 0, 1) =
I(|x| ≤ a) ≤ 1, we can choose M = 1. Since Zq = 1, the acceptance probability is
1 Zp
Ra
M Zq
= −a φ(y; 0, 1)dy.
The rejection sampling method for this distribution reduces to sampling from Y ∼ φ
and accepting X = Y if |Y | ≤ a, which is intuitive.
Example 2.10. The unknown normalising constant issue mostly arises in Bayesian in-
ference when we want to sample from the posterior distribution. The posterior density of
X given Y = y is proportional to

pX|Y (x|y) ∝ pX (x)pY |X (y|x) (2.17)


R
where the normalising constant pY (y) = pX (x)pY |X (y|x)dx is usually intractable. Sup-
pose we want to sample from pX|Y (x|y). When pX|Y (x|y) is not the density of a well known
distribution, we may be able to use rejection sampling. If we can find M > 0 such that
pY |X (y|x) ≤ M for all x ∈ X , and the prior distribution PX with density pX (x) is easy to
sample from, then we can use rejection sampling with Q with q(x) = pX (x).
1. Sample X 0 ∼ Q and U ∼ Unif(0, 1),

2. If U ≤ pY |X (y|X 0 )/M , accept X = X 0 ; otherwise go to step 1.

[Link] Squeezing
The drawback of rejection sampling is that in practice a rejection based procedure is
usually not viable when X is high-dimensional, since P(Accept) gets smaller and more
computation is required to evaluate acceptance probabilities as the dimension increases.
In the literature there exist approaches to improve the computational efficiency of rejection
sampling. For example, assuming the densities exist, when it is difficult to compute q(x),
tests like u ≤ M1 p(x)
q(x)
can be slow to evaluate. In this case, one may use a squeezing function
s(x) s(x)
s : X → [0, ∞) such that q(x)
is cheap to evaluate and p(x)
is tightly bounded from above
s(x)
by 1. For such an s, not only u ≤ M1 q(x) would guarantee u ≤ M1 p(x)
q(x)
, hence acceptance,
1 p(x) 1 s(x)
but also if u ≤ M q(x)
then u ≤ M q(x) would hold with a high probability. Therefore,
CHAPTER 2. EXACT SAMPLING METHODS 24

in case of acceptance evaluation of p(x)


q(x)
s(x)
would largely be avoided by checking u ≤ M1 q(x)
first. In Marsaglia (1977), the author proposed to squeeze p(x) from above and below by
q(x) and s(x) respectively, where q(x) is easy to sample from and s(x) is easy to evaluate.
There are also adaptive methods to squeeze π from both below and above; they involve
an adaptive scheme to gradually modify q(x) and s(x) from the samples that have already
been obtained (Gilks, 1992; Gilks and Wild, 1992; Gilks et al., 1995).
CHAPTER 2. EXACT SAMPLING METHODS 25

Exercises
1. Use change of variables to show that X defined in (2.3) in Example 2.3 is distributed
from Unif(a, b).

2. Suggest a way to sample from PO(λ) using uniform random numbers.

3. Suggest a way to sample from Laplace(a, b) using uniform random numbers. (Hint:
Notice the similarity between the Laplace distribution and the exponential distribu-
tion.)

4. Show that the modified rejection sampling method described in Section [Link] for
unnormalised densities is valid, i.e. the accepted sample X ∼ P , and it has the
acceptance probability ZZq M
p
as claimed. The derivation is similar to those in (2.11),
(2.12).

5. Write your own function that takes a vector of non-negative numbers w = [w1 . . . wK ]
of any size and outputs a matrix X (of the specified size, size1 × size2) of i.i.d.
integers in {1, . . . , K}, each with probability proportional to wk (i.e. their sum may
not be normalised to 1). In MATLAB, your function should look something similar
to [X] = randsamp(w, size1, size2)

6. Learn the polar rejection method, another method used to sample from N (0, 1).
Write two different functions that produce i.i.d. standard normal random variables
as many as it is specified (as an input to the function): one using the Box-Muller
method and the other using the polar rejection method. Plot the histograms of 106
samples that you obtain from each function; make sure nothing strange happens in
your code. Compare the speeds of your functions. Which method is faster? Why do
you think is the reason?

7. Write your own function for generating a given number N of samples (specified as an
input argument) from a multivariate normal distribution N (µ, Σ) with given mean
vector µ and covariance matrix Σ as inputs.

8. Derive the rejection sampling method for Beta(a, b) a, b ≥ 1 using the uniform dis-
tribution as the instrumental distribution. Write a function that implements this
method. Is it still possible to use the uniform distribution as Q when a < 1 or b < 1?
Why or why not?
Chapter 3

Monte Carlo Estimation


Summary: This is a small chapter on the use of Monte Carlo to estimate certain quan-
tities regarding a given distribution. Specifically, we will look at the importance sampling
method for Monte Carlo integration.
Let’s go back to the beginning and consider the expectation in (1.3) once again
Z
P (ϕ) = EP (ϕ(X)) = ϕ(x)p(x)dx.
X

In order to estimate P (ϕ) by the plug-in estimator (1.8)


N
N 1 X
PMC (ϕ) = ϕ(X (i) ), (3.1)
N i=1

we need i.i.d. samples from P and in the previous chapter we covered some exact sampling
methods for generating X (i) ∼ P , i = 1, . . . , N .
However, there are many cases where X ∼ P is either impossible or too difficult,
or wasteful. For example, rejection sampling uses only about 1/M of generated random
samples to construct an approximation to P . In order to generate N samples, we need on
average N M iterations of rejection sampling. The number M can be very large, especially
in high dimensions, and rejection sampling may be wasteful.

3.1 Importance sampling


In contrast to rejection sampling, importance sampling uses every sample but weights each
one according to the degree of similarity between the target and instrumental distributions.
We describe the importance sampling method for continuous variables where P has a pdf
p(x) - the discrete version should be easy to figure out afterwards:
Suppose there exists a distribution Q with density q(x) such that q(x) > 0 whenever
p(x) > 0. Given p(x) and q(x), define the weight function w : X → R
(
p(x)/q(x), q(x) > 0,
w(x) := (3.2)
0 q(x) = 0.

26
CHAPTER 3. MONTE CARLO ESTIMATION 27

The idea of importance sampling follows from the importance sampling fundamental iden-
tity (Robert and Casella, 2004): We can rewrite P (ϕ) as
Z
P (ϕ) = EP (ϕ(X)) = ϕ(x)p(x)dx
X
Z
p(x)
= ϕ(x) q(x)dx
X q(x)
Z
= ϕ(x)w(x)q(x)dx
X
= EQ (ϕ(X)w(X)) = Q(ϕw)

where ϕw stands for the product of the functions ϕ and w. This identity can be used with
a Q which is easy to sample from, which leads to importance sampling given in Algorithm
3.1

Algorithm 3.1: Importance sampling


1 for i = 1, . . . , N do
2 Sample X (i) ∼ Q, and calculate w(X (i) ) according to (3.2).
3 Calculate the approximation of the expectation P (ϕ) as

N
N 1 X
PIS (ϕ) := ϕ(X (i) )w(X (i) ). (3.3)
N i=1

The weights w(X (i) ) are known as the importance sampling weights. Note that PIS N
(ϕ)
is another plug-in estimator but for different distribution and function, namely it is the
plug-in estimator for Q (ϕw). Therefore the estimator in (3.3) is unbiased and justified by
the strong law of large numbers and the central limit theorem, provided that Q(ϕw) =
EQ (ϕ(X)w(X)) and VQ [w(X)ϕ(X)] are finite.
Example 3.1. Suppose we have two variables (X, Y ) ∈ X × Y with joint pdf pX,Y (x, y).
As we recall, we can write the joint pdf as

pX,Y (x, y) = pX (x)pY |X (y|x)

In the Bayesian framework where X is the unknown parameter and Y is the observed vari-
able (or data), pX (x) is called the prior density and it is usually easy to sample from, and
pY |X (y|x) is the conditional density of data, or the likelihood, which is easy to compute.1
1
In fact, this is how one usually constructs the joint pdf in Bayesian framework: First define the prior
X ∼ pX (x), then define the data likelihood Y |X = x ∼ pY |X (y|x), so that the pX,Y (x, y) is constructed
as above. When the starting point to define the joint density is to define the prior and the likelihood, it is
notationally convenient to define the marginal and conditionalR pdfs µ(x) := pX (x) and g(y|x) := pY |X (y|x)
and write p(x, y) = µ(x)g(y|x), p(x|y) ∝ µ(x)g(y|x), p(y) = µ(x)g(y|x)dx, etc.
CHAPTER 3. MONTE CARLO ESTIMATION 28

In certain applications, we want to compute the evidence pY (y) at a given value y of


the data. We can write pY as
Z
pY (y) = pX,Y (x, y)dx (3.4)
ZX

= pX (x)pY |X (y|x)dx (3.5)


X
= EPX (pY |X (y|X)) (3.6)
where the last line highlights the crucial observation that given y, the likelihood can be
thought as a function ϕ(x) = pY |X (y|x) and pY (y) can be written as an expectation of
ϕ(X) with respect to the prior with density pX (x). Therefore, pY (y) can be estimated using
a plug-in estimator where we sample X (1) , . . . , X (N ) ∼ pX (x) and estimate pY (y) as
N
1 X
pY (y) ≈ pY |X (y|X (i) ), X (1) , . . . , X (N ) ∼ pX (x).
N i=1

However, we do not need to sample from pX (x). In fact, we can use importance sampling
with an importance density q(x).
N
1 X pX (X (i) )
pY (y) ≈ (i)
pY |X (y|X (i) ), X (1) , . . . , X (N ) ∼ q(x).
N i=1 q(X )

Being able to approximate a marginal distribution as in pY (y) will have an important role
later on when we discuss sequential importance sampling methods.

3.1.1 Variance reduction


As we have freedom to choose Q, we can control the variance of importance sampling
(Robert and Casella, 2004).
 N  1
VQ PIS (ϕ) = VQ [w(X)ϕ(X)]
N
1
Q(w2 ϕ2 ) − Q(wϕ)2

=
N
1
Q(w2 ϕ2 ) − P (ϕ)2 .

=
N
 N 
Therefore, minimising VQ PIS (ϕ) is equivalent to minimising Q(w2 ϕ2 ), which can be
lower bounded as
Q(w2 ϕ2 ) ≥ Q(w|ϕ|)2 = P (|ϕ|)2
using the Jensen’s inequality. Considering Q(w2 ϕ2 ) = P (wϕ2 ), this bound is attainable if
we choose q such that it satisfies
p(x) P (|ϕ|)
w(x) = = , x ∈ X , ϕ(x) 6= 0.
q(x) |ϕ(x)|
CHAPTER 3. MONTE CARLO ESTIMATION 29

This results in the optimum choice of q to be


|ϕ(x)|
q(x) = p(x)
P (|ϕ|)
for points x ∈ X such that ϕ(x) 6= 0, and the resulting minimum variance is given by
1
[P (|ϕ|)]2 − [P (ϕ)]2 .
 N  
min VQ PIS (ϕ) =
Q N
Note that this minimum value is 0 if ϕ(x) ≥ 0 for all x ∈ X . Therefore, importance
sampling in principle can achieve a lower variance than perfect Monte Carlo. Of course,
if we can not compute P (ϕ) already, it is unlikely that we can compute P (|ϕ|). Also, it
will be rare that we can easily simulate from the optimal Q even if we can construct it.
Instead, we are guided to seek a Q close to the optimal one, but from which it is easy to
sample.
Example 3.2. We wish to implement importance sampling in order to approximate E(ϕ(X))
where X ∼ P = N (µ, σ 2 ). Instead of sampling from P directly, we want to sample from
Qk = N (µ, σ 2 /k). We want to choose the best k for ϕ in  terms of the variance of the impor-
N
tance sampling estimate. Recall that minimising VQk PIS (ϕ) is equivalent to minimising
Qk (wk2 ϕ2 ) where wk (x) = p(x)/qk (x).
p(x)2
Z
2 2
Qk (wk ϕ ) = qk (x) 2
ϕ(x)2 dx
q k (x)
ZX
p(x)2
= ϕ(x)2 dx
q
X k (x)
p(x)2
The ratio qk (x)
is

p(x)2 1 −(x−µ)2 /σ2 2πσ 2 k2 (x−µ) 2
= e √ e σ2
qk (x) 2πσ 2 k
1 1 (x−µ)2
=√ √ e− 2 (2−k) σ2
k 2πσ 2

This ratio diverges when k > 2, and unless ϕ(x)2 balances it the second moment Qk (wk2 ϕ2 )
diverges. Therefore, let us confine k to k ∈ (0, 2). In that case, we can rewrite

p(x)2 1 2 − k − 12 (2−k) (x−µ)2
=p √ e σ2
qk (x) k(2 − k) 2πσ 2
1
=p q2−k (x)
k(2 − k)
Therefore,
1 1
Qk (wk2 ϕ2 ) = p Q2−k (ϕ2 ) = p EQ2−k (ϕ(X)2 )
k(2 − k) k(2 − k)
CHAPTER 3. MONTE CARLO ESTIMATION 30

σ2
When ϕ(x) = x and µ = 0, Q2−k (ϕ2 ) = EQ2−k (X 2 ) = 2−k
. Therefore, we need to minimise

1 σ2
p = σ 2 (2 − k)−3/2 k −1/2 .
k(2 − k) 2 − k
The minimum is attained at k = 1/2 and is
1  2 σ2
N
VQ1/2 (PIS (ϕ)) = σ (2 − k)−3/2 k −1/2 = 0.7698
N k=1/2 N
N σ2
The variance of the plug-in estimator PMC (ϕ) for P (ϕ) is N
, which is larger!

Effective sample size: One approximation of the importance sampling estimator is


proposed in Kong et al. (1994) to be
 N  1
VQ PIS (ϕ) ≈ VP [ϕ(X)] {1 + VQ [w(X)]}
N
N

= VP PMC (ϕ) {1 + VQ [w(X)]}.
This approximation might be confusing at the first instance since it suggests that the vari-
ance of importance sampling is always greater than that of perfect Monte Carlo, which we
have just seen is not the case. However, it is useful as it provides an easy way of monitoring
the efficiency of the importance sampling method. Consider the ratio of variances of the
importance sampling method with N particles and perfect Monte Carlo with N 0 particles,
which is given according to this approximation by
 N 
VQ PIS (ϕ) N0
 N0 ≈ {1 + VQ [w(X)]}.
VP PMC (ϕ) N
The number N 0 for which this ratio is 1 would suggest how many samples for perfect
Monte Carlo would be equivalent to N samples for importance sampling. For this reason
this number is defined as the effective sample size (Kong et al., 1994; Liu, 1996) and it is
given by
N
Neff = .
1 + VQ [w(X)]
Obviously, the term VQ [w(X)] itself is usually estimated using the samples X (1) , . . . , X (N )
with weights w(X (i) ), . . . , w(X (N ) ) obtained from the importance sampling method.

3.1.2 Self-normalised importance sampling


Like rejection sampling, the importance sampling method can be modified for the cases
when p(x) = pbZ(x)
p
and/or q(x) = qbZ(x)
q
and we only have pb(x) and qb(x). This time, letting
(
pb(x)
, qb(x) > 0
w(x) := qb(x)
0, qb(x) = 0,
CHAPTER 3. MONTE CARLO ESTIMATION 31

observe that
Z
pb(x)
Q(w) = EQ (w(X)) = q(x)dx
qb(x)
Z
p(x)Zp
= q(x)dx
q(x)Zq
= Zp /Zq .
and
Z
pb(x)
Q (wϕ) = EQ (w(X)ϕ(X)) = ϕ(x)q(x)dx
qb(x)
Z
p(x)Zp
= ϕ(x)q(x)dx
q(x)Zq
= P (ϕ)Zp /Zq .
Therefore, we can write the importance sampling fundamental identity in terms of pb and
qb as
Q (ϕw) Q (wϕ)
P (ϕ) = = .
Zp /Zq Q (w)
The importance sampling method can be modified to approximate both the nominator, the
unnormalised estimate, and the denominator, the normalisation constant, by using Monte
Carlo. Sampling X (1) , . . . , X (N ) from Q, we have the approximation
1
PN (i) (i) N
i=1 ϕ(X )w(X )
X
N
PIS (ϕ) = N
1
PN = W (i) ϕ(X (i) ). (3.7)
w(X (i) )
N i=1 i=1

where
w(X (i) )
W (i) = PN
(j)
j=1 w(X )
are called the normalised importance weights as they sum up to 1. The resulting method,
which is called self-normalised importance sampling is given in Algorithm 3.2: Being the
ratio of two unbiased estimators, estimator of the self-normalised importance sampling is
biased for finite N . However, its consistency and stability are provided by a strong law of
large numbers and a central limit theorem in Geweke (1989). In the same work, the variance
of the self normalised importance sampling estimator is analysed and an approximation
is provided, from which it reveals that it can provide lower variance estimates than the
unnormalised importance sampling method. Also normalised importance sampling has
the nice property of estimating a constant by itself, unlike the unnormalised importance
sampling method. Therefore, this method can be preferable to its unnormalised version
even if it is not the case that P and Q are known only up to proportionality constants.
Self-normalised importance sampling is also called Bayesian importance sampling in
Geweke (1989), since in most Bayesian inference problems normalising constant of posterior
distribution is unknown.
CHAPTER 3. MONTE CARLO ESTIMATION 32

Algorithm 3.2: Self-normalised importance sampling


1 for i = 1, . . . , N do
(X (i) )
2 Generate X (i) ∼ Q, calculate w(X (i) ) = pqbb(X (i) ) .

3 for i = 1, . . . , N do
(i)
4 Set W (i) = PNw(Xw(X)(j) ) .
j=1

5 Calculate the approximation to the expectation

N
X
N
PIS (ϕ) = W (i) ϕ(X (i) )
i=1

Example 3.3. Let us consider the posterior distribution in Example 2.10

pX|Y (x|y) ∝ pX (x)pY |X (y|x)


R
and the unknown normalising constant is pY (y) = pX (x)pY |X (y|x)dx. Given the data
Y = y, we want to calculate the expectation of ϕ : X → R with respect to pX|Y (x|y)
Z
PX (ϕ|Y = y) = E(ϕ(X)|Y = y) = pX|Y (x|y)ϕ(x)dx.

Since we know pX|Y (x|y) only up to a proportionality constant, we use self-normalised


importance sampling. With the choice of Q with density q(x), self-normalised importance
sampling becomes
1. For i = 1, . . . , N ; generate X (i) ∼ Q, calculate

pX (X (i) )pY |X (y|X (i) )


(i)
w(X ) = .
q(X (i) )

w(X (i) )
2. For i = 1, . . . , N ; set W (i) = PN (j) ) .
j=1 w(X

PN
3. Approximate E(ϕ(X)|Y = y) ≈ i=1 W (i) ϕ(X (i) ).
If we choose q(x) = pX (x), i.e. the prior density, then w(x) = pY |X (y|x) reduces to the
likelihood. But this is not always a good idea as we will see in the next example.
Example 3.4. Suppose we have an unknown mean parameter X ∈ R whose prior dis-
tribution is represented by X ∼ N (µ, σ 2 ). Conditional on X = x, n data samples
Y = (Y1 , . . . , Yn ) ∈ Rn are generated independently
i.i.d.
Y1 , . . . , Yn |X = x ∼ Unif(x − a, x + a).
CHAPTER 3. MONTE CARLO ESTIMATION 33

We
R want to estimate the posterior mean of X given Y = y = (y1 , . . . , yn ), i.e. E(X|Y =
y) = pX|Y (x|y)xdx, where
pX|Y (x|y) ∝ pX (x)pY |X (y|x)
Qn 1
The prior density and likelihood are pX (x) = φ(x; µ, σ 2 ) and pY |X (y|x) = i=1 2a I(x−a,x+a) (yi ),
so the posterior distribution can be written as
n
2 1 Y
pX|Y (x|y) ∝ φ(x; µ, σ ) I(x−a,x+a) (yi )
(2a)n i=1

Densities pX (x) and pX,Y (x, y) versus x for a fixed Y = y = (y1 , . . . , yn ) with n = 10
generated from the marginal distribution of Y with a = 2, µ = 0, and σ 2 = 10 are given in
Figure 3.1. Note that the second plot is proportional to the posterior density.
We can use self-normalised importance sampling to estimate E(X|Y = y). The choice
of the importance density is critical here: Suppose we chose Q to be the prior distribution
for X, i.e. q(x) = φ(x; µ, σ 2 ). This is a valid choice, however if a is small and σ 2 is
relatively large, it is likely that the resulting weight function
n
1 Y
w(x) = I(x−a,x+a) (yi ).
(2a)n i=1
1
will end up being zero for most of the generated samples from Q and it will be (2a) n for few

samples. This results in a high variance in the importance sampling estimator. What is
worse, it is possible to have all weights to be zeros and hence the denominator in (3.7) can
be zero. Therefore the estimator is a poor one.
Let ymax = maxi yi and ymin = mini yi . A careful inspection of pX|Y (x|y) reveals that
given y = (y1 , . . . , yn ), X must be contained in (ymax − a, ymin + a). In other words,
x ∈ (ymax − a, ymin + a) ⇔ x − a < yi < x + a, ∀i = 1, . . . , n
Therefore, a better importance density does not waste its time outside the interval (ymax −
a, ymin + a) and generate samples in that interval. As an example, we can choose Q =
Unif(ymax − a, ymin + a). With that choice, the weight function will be
( φ(x;µ,σ2 ) 1
(2a)n
, x ∈ (ymax − a, ymin + a)
w(x) = 1/(2a+ymin −ymax )
0, else
Note that since we are using the self-normalised
PN importance sampling estimator and hence
(i) (i) (j)
we normalise the weights W = w(X )/ j=1 w(X ), we do not need to calculate the
constant factor (2a + ymin − ymax )/(2a)n for the weights.
Figure 3.2 compares the importance sampling estimators with the two different im-
portance distributions mentioned above. The histograms are generated from 10000 Monte
Carlo runs (10000 independent estimates of the posterior mean) for each estimator. Ob-
serve that the estimates obtained when the importance distribution is the prior is more
wide-spread, exhibiting a higher variance.
CHAPTER 3. MONTE CARLO ESTIMATION 34

pX (x) pX, Y (x, y) = pX (x)pY |X(y | x)


×10 -8
0.1 6
p X (x) p X, Y(x, y)
y 1, ..., y 10 5 y 1, ..., y 10
0.08

4
0.06
3
0.04
2

0.02
1

0 0
-20 -10 0 10 20 3 4 5 6
x x

Figure 3.1: pX (x) and pX,Y (x, y) vs x for the problem in Example 3.4 with n = 10 and
a=2

importance distribution is prior: variance: 0.00089 importance distribution is uniform: variance: 0.00002
700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0
4.6 4.65 4.7 4.75 4.8 4.85 4.7 4.71 4.72 4.73 4.74 4.75
estimated posterior mean estimated posterior mean

Figure 3.2: Histograms for the estimate of the posterior mean using two different impor-
tance sampling methods as described in Example 3.4 with n = 10 and a = 2.
CHAPTER 3. MONTE CARLO ESTIMATION 35

Exercises
1. Consider Example 3.2, where importance sampling for N (µ, σ 2 ) is discussed.
• This time, take µ = 0 and ϕ(x) = x2 . Find the optimum k for this φ and
calculate the gain due to variance reduction compared to the plug-in estimator
N
PMC (ϕ).
• Implement importance sampling (e.g., in MATLAB) for both ϕ(x) = x and
ϕ(x) = x2 , and verify that in each case the variance of the IS estimator is
N
lower than that of the plug-in estimator PMC (ϕ). Verify also that the k = 1/2
estimator is inferior for calculating the second moment, and likewise the k = 1/3
estimator is inferior for the first moment.
2. This example is based on Project Evaluation and Review Technique (PERT), a
project planning tool.2 Consider the software project described in Table 3.1 with
10 tasks (activities), indexed by j = 1, . . . , 10. The project is completed when all
of the tasks are completed. A task can begin only after all of its predecessors have
been completed. The project starts at time 0. Task j starts at time Sj , takes time Tj

j Task Predecessors mean duration θj


1 Planning None 4
2 Database Design 1 4
3 Module Layout 1 2
4 Database Capture 2 5
5 Database Interface 2 2
6 Input Module 3 3
7 Output Module 3 2
8 GUI Structure 3 3
9 I/O Interface Implementation 5, 6, 7 2
10 Final Testing 4, 8, 9 2

Table 3.1: PERT: Project tasks, predecessor-successor relations, and mean durations

and ends at time Ej = Sj + Tj . Any task j with no predecessors (here only task 1)
starts at Sj = 0. The start time for a task with predecessors is the maximum of the
ending times of its predecessors. For example, S4 = E2 and S9 = max(E5 , E6 , E7 ).
The project as a whole ends at time E10 .

• Using predecessor-successor relations in Table 3.1, draw a diagram (for example,


an acyclic directed graph) that shows the predecessor-successor relations in this
example, with a node for each activity.
2
The example is largely taken from [Link] The
original source can be reached at [Link]
CHAPTER 3. MONTE CARLO ESTIMATION 36

• Write a MATLAB function that takes duration times Tj , j = 1, . . . , 10 and


outputs the completion time for the project.
• Assume Tj ∼ Exp(1/θj ) independent exponentially distributed random vari-
ables with means θj given in the final column of the table. Simulate this project
(i.e. task durations and completion time) and generate N = 10000 independent
realisations of the completion time. Plot the histogram of completion times and
estimate the mean completion time.
• The completion time E10 can be seen as a function of task times X = (T1 , . . . , T10 ),
and the function is what you just coded above. Now suppose that there will
be a severe penalty should the project miss a deadline in 70 days time. Derive
the Monte Carlo estimator for P(E10 > 70) = E(I(E10 > 70)) and implement
it M = 1000 times with N = 10000 samples. Out of the M = 1000 samples,
N
calculate the sample variance of PMC (ϕ) with ϕ(X) = I(E10 > 70)
• This time, estimate the same probability using importance sampling, taking Q
the distribution of independent task times that are exponentially distributed
with means λj (instead of θj ), that is
(i) (i)
X (i) = (T1 , . . . , T10 ) ∼ Exp(1/λ1 ) × . . . × Exp(1/λ10 )
N (i)
Write down the expression for PIS (ϕ) in terms of λj ’s and θj ’s and Tj ’s. Try
λj = κθj for various values of κ to see if you can come up with a better estimator,
N
that is one with lower variance, than the plug-in estimator PMC (ϕ).

3. This is a simple example that illustrates the source localisation problem. We have a
source (or target) on the 2-D plane whose unknown location

X = (X(1), X(2)) ∈ R2

we wish to find. We collect distance measurements for the source using three sensors,
located at positions s1 , s2 , and s3 , see Figure 3.3. The measured distances Y =
(Y1 , Y2 , Y3 ), however, are noisy with independent normally distributed noises with
equal variance:3
Yi |X = x ∼ N (||x − si ||, σy2 ), i = 1, 2, 3,
where || · || denotes the Euclidean distance. Letting ri = ||x − si ||, the likelihood
evaluated at y = (y1 , y2 , y3 ) given x can be written as
3
Y
pY |X (y|x) = φ(yi ; ri , σy2 ) (3.8)
i=1

3
In this way we allow negative distances, which makes the normal distribution not the most proper
choice. However, for the sake of ease with computations, we overlook that in this example.
CHAPTER 3. MONTE CARLO ESTIMATION 37

2 sensor 1

1 r1

x(2)
0 source
r2
-1 sensor 2 r3

-2 sensor 3

-3
-3 -2 -1 0 1 2 3
x(1)

Figure 3.3: Source localisation problem with three sensors and one source

We do not know much a priori information about X, therefore we take the prior
distribution X as the bivariate normal distribution with zero mean vector and a
diagonal covariance matrix, X ∼ N (02 , σx2 I2 ), so that the density is

pX (x) = φ(x(1); 0, σx2 )φ(x(2); 0, σx2 ). (3.9)

See Figure 3.4 for an illustration of prior, likelihood, and posterior densities for this
problem.
Given noisy measurements, Y = y = (y1 , y2 , y3 ), we want to locate X, so we are
interested in the posterior mean vector

E(X|Y = y) = [E(X(1)|Y = y), E(X(2)|Y = y)].

Write a function that takes y, positions of the sensors s1 , s2 , s3 , the prior and
likelihood variances σx2 and σy2 , and the number of samples N as inputs, implements
self-normalised importance sampling (why this version?) in order to approximate
E(X|Y = y) and outputs its estimate. Try your code with s1 = (0, 2), s2 = (−2, −1),
s3 = (1, −2), y1 = 2, y2 = 1.6, y3 = 2.5, σx2 = 100, and σy2 = 1 which are the values
used to generate the plots in Figure 3.4.
CHAPTER 3. MONTE CARLO ESTIMATION 38

Likelihood term p(y1 | x) Likelihood term p(y2 | x) Likelihood term p(y3 | x)


-6 -6 -6

-4 -4 -4

-2 -2 -2
x(1)

x(1)

x(1)
0 0 0

2 2 2

4 4 4

6 6 6
-5 0 5 -5 0 5 -5 0 5
x(2) x(2) x(2)
Prior p X (x) posterior ∝ prior x likelihood ×10 -9
-6 -3 10

-4 -2
8
-2 -1
6
x(1)

x(1)

0 0
4
2 1

4 2
2

6 3
-5 0 5 -2 0 2
x(2) x(2)

Figure 3.4: Source localisation problem with three sensors and one source: The likelihood
terms, prior, and the posterior. The parameters and the variables are s1 = (0, 2), s2 =
(−2, −1), s3 = (1, −2), y1 = 2, y2 = 1.6, y3 = 2.5, σx2 = 100, and σy2 = 1
Chapter 4

Bayesian Inference
Summary: In this chapter, we provide a brief introduction to Bayesian statistics. Some
quantities of interest that are calculated from the posterior distribution will be explained.
We will see some examples where one can find the exact form of the posterior distribution.
In particular, we will discuss conjugate priors that are useful for deriving tractable posterior
distributions. This chapter also introduces a relaxation in the notation to be adopted in the
later chapters.

4.1 Conditional probabilities


Recall Bayes’ rule from Appendix A.3. Consider the probability space (Ω, F, P). Given
two sets A, B ∈ F, the conditional distribution of A given B is

P(A ∩ B) P(A)P(B|A)
P(A|B) = = (4.1)
P(B) P(B)

Here we see some examples where Bayes’ rule is in action to calculate posterior probabili-
ties.

Example 4.1 (Conditional probabilities of sets). A pair of fair (unbiased) dice are
rolled independently. Let the outcomes be X1 and X2 .

• It is observed that the sum S = X1 +X2 = 8. What is the probability that the outcome
of at least one of the dice is 3?
We apply the Bayes rule: Define the sets A = {(X1 , X2 ) : X1 = 3 or X2 = 3}.
B = {(X1 , X2 ) : S = 8}, so that the desired probability is P(A|B) = P(A ∩ B)/P(B).

B = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}, A ∩ B = {(3, 5), (5, 3)}.

Since the dice are fair, every outcome is equiprobable, having probability 1/36. There-
fore,
P(A ∩ B) 2/36 2
P(A|B) = = = .
P(B) 5/36 5
• It is observed that the sum is even. What is the probability that the sum is smaller
than or equal to 4? Similarly, we define the sets A = {(X1 , X2 ) : X1 + X2 ≤ 4}.

39
CHAPTER 4. BAYESIAN INFERENCE 40

B = {(X1 , X2 ) : X1 + X2 is even}. Explicity, we have


B = {(X1 , X2 ) : X1 , X2 are both even} ∪ {(X1 , X2 ) : X1 , X2 are both odd}.
A ∩ B = {(1, 1), (1, 3), (3, 1), (2, 2)}.
Therefore,
P(A ∩ B) 4/36 2
P(A|B) = = = .
P(B) 3/6 × 3/6 + 3/6 × 3/6 9
Example 4.2 (Model selection). There are two coins in an urn, one fair and one biased
with probability of tail ρ = 0.3. Someone picks up one of the coins at random (with half
probability for picking up either coin) and tosses it n times and reports the outcomes:
D = (H, T, H, H, T, H, H, H, T, H). Conditional on D, what is the probability that the fair
dice was picked up?
We have two hypotheses (models): H1 : The coin picked up was the fair one, H2 : The
coin picked was the biased one. The prior probabilities for these models are the same:
P(H1 ) = P(H2 ) = 0.5. The likelihood of data, that is the conditional probability of the
outcomes is: (
1/210 , i = 1,
P(D|Hi ) =
ρnT (1 − ρ)nH , i = 2,
where nT and nH are the number of times the coin showed tail and head, respectively. From
Bayes’ rule, we have
P(D, H1 ) P(H1 )P(D|H1 )
P(H1 |D) = =
P(D) P(D|H1 )P(H1 ) + P(D|H2 )P(H2 )
1/2 × 1/210
=
1/2 × 1/210 + 1/2 × ρnT (1 − ρ)nH
1/210
=
1/210 + ρnT (1 − ρ)nH
and, of course, P(H2 |D) = 1 − P(H1 |D). Substituting ρ = 0.3 and nT = 3, we have
P(H1 |D) = 0.3052 and P(H2 |D) = 0.6948.

4.2 Deriving Posterior distributions


In this section, we study posterior distributions and discuss their use. We introduce the
notion of conjugacy, a very important tool for deriving exact posterior distributions for
some likelihood models. Then, we will look at some useful inferential quantities that are
calculated from the posterior distribution.
When random variables X ∈ X , Y ∈ Y with joint pdf/pmf pX,Y (x, y) are considered,
the conditional pdf/pmf pX|Y (x|y) is
pX,Y (x, y) pX (x)pY |X (y|x)
pX|Y (x|y) = = (4.2)
pY (y) pY (y)
CHAPTER 4. BAYESIAN INFERENCE 41

4.2.1 A note on future notational simplifications


It may be tedious to keep the subscripts in pdf’s or pmf’s such as pX,Y , pX|Y , etc. Formally,
this is necessary to indicate what random variables are considered and what probability
distribution exactly we mean. However, it is common practice to drop the cumbersome
subscripts and use p(x, y), p(x), p(x|y), etc. whenever it is clear from the context what
distribution we mean. We will also adopt this simplification in this document. For example,
we will frequently write Bayes’ rule as

p(x)p(y|x)
p(x|y) =
p(y)

It is also common to use densities as well as distributions to indicate the distribution of


a random variable. For example, all the expressions below mean the same thing: X is
distributed from the distribution P , whose pdf or pmf is p(x)

X ∼ P, X ∼ p(·), X ∼ p(x), x ∼ P, x ∼ p(·), x ∼ p(x).

In the rest of this document, we will use the aforementioned notations interchangeably,
choosing the most suitable one depending on the context.
When the statistical model is prone to misunderstandings in case p is used for every-
thing, perhaps a nicer approach than using p generically from the beginning is to start
with different letters such as f, g, h, µ, etc. for different pdf’s or pmf’s when constructing
the joint distribution for the random variables of interest.

Example 4.3. Consider random variables X, Y, Z, U and assume that Y and Z are con-
ditionally independent given X, and U is independent from X given Y and Z; see Figure
4.1. In such a case, it may be convenient to construct the joint density by first declaring
the density for X, µ(x). Next, define the conditional densities f (y|x) and g(z|x) for Y
given X and Z given X. Finally define the conditional density for U given Y and Z,
h(u|y, z). Now, we can generically use the letter p to express any desired density regarding
these variables. To start with, the joint density is

p(x, y, z, u) = µ(x)f (y|x)g(z|x)h(u|y, z)

Once we have the joint distribution p(x, y, z, u), we can derive anything else from it in
CHAPTER 4. BAYESIAN INFERENCE 42

Y Z

Figure 4.1: Directed acyclic graph showing the (hierarchical) dependency structure for
X, Y, Z, U .

terms of the densities we defined µ, f, g, h. Some examples:


Z Z Z
p(y, z, u) = p(x, y, z, u)dx = µ(x)f (y|x)g(z|x)h(u|y, z)dx = h(u|y, z) µ(x)f (y|x)g(z|x)dx
Z Z
p(x) = p(x, y, z, u)dydzdu = µ(x)f (y|x)g(z|x)h(u|y, z)dydzdu = µ(x)
p(x, y, z, u)
p(y, z, u|x) = = f (y|x)g(z|x)h(u|y, z)
p(x)
Z Z
p(u|x) = p(y, z, u|x)dydz = f (y|x)g(z|x)h(u|y, z)dydz
p(x)p(u|x) µ(x)p(u|x)
p(x|u) = =R
p(u) µ(x)p(u|x)dx

The dependency structure of this model can be exemplified with

x ∼ µ(x), z|x, y ∼ g(z|x), u|x, y, z ∼ h(·|y, z) etc.

or in terms of densities

p(x) = µ(x), p(z|x, y) = p(z|x) = g(z|x), p(u|x, y, z) = p(u|y, z) = h(u|y, z), etc.

4.2.2 Conjugate priors


Consider the variables X, Y and Bayes’ theorem for p(x|y) in words,

posterior ∝ prior × likelihood.

In Bayesian statistics, the usual first step to build a statistical model is to decide on the
likelihood, i.e. the conditional distribution of the data given the unknown parameter. The
likelihood represents the model choice for the data and it should reflect the real stochastic
dynamics/phenomena of the data generation process as accurately as possible.
For convenience, it is common to choose a family of parametric distributions for the
data likelihood. With such choices x in p(y|x) becomes (some or all of the) parameters of
the chosen distribution. For example, X = (µ, σ 2 ) may be the unknown parameters of a
CHAPTER 4. BAYESIAN INFERENCE 43

normal distribution
Q from which the data samples Y1 , . . . , Yn are assumed to be distributed,
i.e. p(y1:n |x) = ni=1 φ(yi ; µ, σ 2 ). As another example, let X = α be the shape parameter
Q e−βyi yiα−1 β α
of the gamma distribution Γ(α, β) and p(y1:n |x) = ni=1 Γ(α)
and β is known.
Bayesian inference for the unknown parameter requires assigning a prior distribution
to it. Given the family of distributions for the likelihood, it is sometimes useful to consider
a certain family of distributions for the prior distribution so that the posterior distribution
has the same form as the prior distribution but with different parameters, i.e. the posterior
distribution is in the same family of distributions as the prior. When this is the case,
the prior and posterior are then called conjugate distributions, and the prior is called a
conjugate prior for the likelihood p(y|x).

Example 4.4 (Success probability of the Binomial distribution). A certain coin


has P(T) = ρ where ρ is unknown. The prior distribution is X = ρ ∼ Beta(a, b). The
coin is tossed n times, so that if the number of times it brought a tail is Y the conditional
distribution for Y is Y |ρ ∼ Binom(n, ρ). We want to find the posterior distribution of ρ
given Y = k successes out of n trials.
The posterior density is proportional to

xa−1 (1 − x)b−1 n!
p(x|y) ∝ p(x)p(y|x) = xk (1 − x)n−k (4.3)
B(a, b) k!(n − k)!
R
where B(a, b) = xa−1 (1 − x)b−1 dx.
Before continuing with deriving the expression, first note the important remark that
our aim here is to recognise the form of the density of a parametric distribution for x in
(4.3). Therefore, we can get rid of any multiplicative term that does not depend on x. That
is why we could start with the joint density as p(x|y) ∝ p(x, y); in fact we can do more
simplification
p(x|y) ∝ xa+k−1 (1 − x)b+n−k−1
Since we observe that this has the form of a beta distribution, we can conclude that the
posterior distribution has to be a beta distribution

X|Y = k ∼ Beta(ax|y , bx|y )

where, from similarity with the prior distribution, we conclude that ax|y = a + k and
bx|y = b + n − k.

Example 4.5 (Mean parameter of the normal distribution). It is believed that


Y1:n = y1:n are samples from a normal distribution with unknown µ and known variance
σ 2 . We want to estimate µ from y1:n . The prior for X = µ is chosen as N (0, σx2 ), the
conjugate prior of the normal likelihood for the mean parameter. The joint density can be
CHAPTER 4. BAYESIAN INFERENCE 44

written as
p(x|y) ∝ p(x, y) = p(x)p(y|x)
  n  
1 1 2 Y 1 1 2
=p exp − 2 x √ exp − 2 (yi − x)
2πσx2 2σx i=1 2πσ 2 2σ
( n
)
1 1 X
∝ exp − 2 x2 − 2 (yi − x)2
2σx 2σ i=1
( n n
!)
1 2 1 X X
= exp − 2 x − 2 yi2 + nx2 − 2x yi
2σx 2σ i=1
( n
!)i=1
1 1 X
∝ exp − 2 x2 − 2 nx2 − 2x yi
2σx 2σ i=1
( "   n
#)
1 2 1 n 1 X
∝ exp − x + − 2x 2 yi
2 σx2 σ 2 σ i=1

Since we observe that this has the form of a normal distribution, we can conclude that the
posterior distribution has to be a normal distribution
2
X|Y1:n = y1:n ∼ N (µx|y , σx|y )
2 2
for some µx|y and σx|y . In order to find µx|y and σx|y , compare the expression above with
n h 2
io
φ(x; m, κ2 ) ∝ exp − 21 x2 κ12 − 2x κm2 + m
κ2
. Therefore, we must have
 −1 n  −1 n
2 1 n µx|y 1 X 1 n 1 X
σx|y = + , 2
= y i ⇒ µ x|y = + yi
σx2 σ 2 σx|y σ 2 i=1 σx2 σ 2 σ 2 i=1

Example 4.6 (Variance of the normal distribution). Consider the scenario in the
previous example above but this time µ is known and the variance σ 2 is unknown. The
prior for X = σ 2 is chosen as the conjugate prior of the normal likelihood for the variance
parameter, i.e. the inverse gamma distribution IG(α, β) with shape and scale parameters
α and β, having the probability density function
β α −α−1
 
β
p(x) = x exp − .
Γ(α) x
The joint density can be written as
p(x|y) ∝ p(x, y) = p(x)p(y|x)
 n
β α −α−1
  
β Y 1 1 2
= x exp − √ exp − (yi − µ)
Γ(α) x i=1 2πx 2x
 1 Pn
(yi − µ)2 + β

∝ x−α−n/2−1 exp − 2 i=1
x
CHAPTER 4. BAYESIAN INFERENCE 45

Comparing this expression to the density of p(x), we observe that they have the same form
and therefore,
X|Y1:n = y1:n ∼ IG(αx|y , βx|y )
for some αx|y and βx|y . From similarity, we can conclude
n
n 1X
αx|y = α + , βx|y = β + (yi − µ)2 .
2 2 i=1
Example 4.7 (Multivariate normal distribution). Let the likelihood for Y given X is
chosen as Y |X = x ∼ N (Ax, R) and the prior for the unknown X is chosen X ∼ N (m, S).
The posterior p(x|y) is
p(x|y) ∝ p(x, y) = p(x)p(y|x)
   
1 1 T −1 1 1 T −1
= exp − (x − m) S (x − m) exp − (y − Ax) R (y − Ax)
|2πS|1/2 2 |2πR|1/2 2
 
1 T −1 T −1 T T −1 T −1
∝ exp − (x S x − 2m S x + x A R Ax − 2y R Ax)
2
 
1  T −1 T −1 T −1 T −1

= exp − x (S + A R A)x − 2(m S + y R A)x
2
 i
1 h T −1 T −1
∝ φ(x; mx|y , Sx|y ) ∝ exp − x Sx|y x − 2mx|y Sx|y x
2
where the posterior covariance is
Sx|y = (S −1 + AT R−1 A)−1
and the posterior mean is
mx|y = Sx|y (mT S −1 + y T R−1 A)T = Sx|y (S −1 m + AT R−1 y).

Computing the evidence: We saw that when conjugate priors are used for the prior,
then p(x) and p(x|y) belong to the same family, i.e. their pdf/pmf have the same form.
This is nice: since we know p(x), p(y|x), and p(x|y) exactly, we can compute the evidence
p(y) for a given y as
p(x, y) p(x)p(y|x)
p(y) = =
p(x|y) p(x|y)
Example 4.8 (Success probability of the Binomial distribution - ctd). Consider
the setting in Example 4.4. Since we know pX|Y (x|y) and pX,Y (x, y) exactly, the evidence
pY (y) for y = k can be found as
xα−1 (1−x)β−1 n!
B(α,β) k!(n−k)!
xk (1 − x)n−k
pY (k) = xα+k−1 (1−x)β+n−k−1
(4.4)
B(α+k,β+n−k)
n! B(α + k, β + n − k)
= . (4.5)
k!(n − k)! B(α, β)
CHAPTER 4. BAYESIAN INFERENCE 46

which is the pmf, evaluated at k, of the Beta-Binomial distribution with trial parameter n
and shape parameters α and β.

4.3 Quantities of interest in Bayesian inference


In Bayesian statistics, the ultimate goal is the posterior distribution p(x|y) of the unknown
variable given the available data Y = y. There are several quantities one might be inter-
ested; all of those these quantities are rooted from p(x|y). The following are some examples
of such quantities.

4.3.1 Posterior mean


If we want to have a point estimate about X, one quantity we can look at is the mean
posterior Z
E(X|Y = y) = p(x|y)xdx

Other than being an intuitive choice, E(X|Y ), as a random function of Y , is justified in


the frequentist setting as well, due to the fact that E(X|Y ) minimises the expected mean
squared error
  Z
MSE = E [X − X̂(Y )] = (x − X̂(y))2 p(x, y)dxdy
2

where X̂(Y ) is the estimator for X and the expectation is taken with respect to the joint
distribution of X, Y .

Theorem 4.1. X̂(Y ) = E(X|Y ) minimises MSE.

In general, if want to estimate ϕ(X) given Y , we can target the posterior mean of ϕ
Z
E(ϕ(X)|Y = y) = p(x|y)ϕ(x)dx,

which minimises the expected mean squared error for ϕ(X)


Z
E [ϕ(X) − ϕ̂(Y )] = (ϕ(x) − ϕ̂(y))2 p(x, y)dxdy.
2


Although it has nice statistical properties as mentioned above, the posterior mean may
not always be a good choice. For example, suppose the posterior is a mixture of Gaussians
with pdf p(x|y) = 0.5φ(x; −10, 0.01)+0.5φ(x; 10, 0.01). The posterior mean is 0 but density
of p(x|y) at 0 is almost 0 and the distribution has almost no mass around 0!
CHAPTER 4. BAYESIAN INFERENCE 47

4.3.2 Maximum a posteriori estimation


Another point estimate that is derived from the posterior is the maximum a posteriori
(MAP) estimate which is the maximising argument of p(x|y)

x̂MAP = arg max p(x|y) = arg max p(x, y).


x∈X x∈X

Note that this procedure is different than maximum likelihood estimation (MLE), which
yields the maximising argument of the likelihood

x̂MLE = arg max p(y|x),


x∈X

since in the MAP estimate there is the additional factor due to prior p(x).

4.3.3 Posterior predictive distribution


Assume we are interested in the distribution that a new data point Yn+1 would have, given
a set of n existing observations Y1:n = y1:n . In a frequentist context, this might be derived
by computing the maximum likelihood estimate x̂MLE (or some other point estimate) of X
given y1:n , and then plugging it into the distribution function of the new observation Yn+1
so that the predictive distribution is p(yn+1 |x̂MLE ).
In a Bayesian context, the natural answer to this is the posterior predictive distribu-
tion, which is the distribution of unobserved observations (prediction) conditional on the
observed data p(yn+1 |y1:n ). In order to find the posterior predictive distribution, we make
use of the entire posterior distribution of the parameter(s) given the observed data to yield
a probability distribution rather than simply a point estimate. Specifically, we compute
p(yn+1 |y1:n ) by marginalising over the unknown variable x, using its posterior distribution:
Z
p(yn+1 |y1:n ) = p(yn+1 , x|y1:n )dx
Z
= p(yn+1 |x, y1:n )p(x|y1:n )dx

In many cases, Yn+1 is independent from Y1:n given X. This happens, for example,
when {Yi }i≥1 are i.i.d. given X, that is Yi |X = x ∼ p(y|x), i ≥ 1. In that case, the density
above reduces to Z
p(yn+1 |y1:n ) = p(yn+1 |x)p(x|y1:n )dx

Note that this is equivalent to the expected value of the distribution of the new data point,
when the expectation is taken over the posterior distribution, i.e.:

p(yn+1 |y1:n ) = E[p(yn+1 |X)|Y1:n = y1:n ].


CHAPTER 4. BAYESIAN INFERENCE 48

Conjugate priors and posterior predictive density: We saw that when conjugate
priors are used for the prior, then p(x) and p(x|y) belong to the same family, i.e. their
pdf/pmf have the same form. This implies that, when Yi ’s are i.i.d. conditional on X, the
posterior predictive density p(yn+1 |y1:n ) has the same form as the marginal density of a
single sample Z
p(y) = p(x)p(y|x)dx.

Example 4.9 (Success probability of the Binomial distribution - ctd). Consider


the setting in Example 4.4. Given the prior X ∼ Beta(α, β) and Y = k successes out of n
trials, what is the probability of having Z = r successes out of the next m trials?
Here Z is the next sample that is to be predicted. We can employ the posterior predictive
probability for Z. We know from the derivation of Example 4.8 that Z will be distributed
from the Beta-Binomial distribution with parameters m (trials), α0 = α + k and β 0 =
β + n − k since the prior and the posterior of X are in the same form and Z given X = x
and Y given X = x are both Binomial.

m! B(α0 + r, β 0 + m − r)
pZ|Y (r|k) = .
r!(m − r)! B(α0 , β 0 )
CHAPTER 4. BAYESIAN INFERENCE 49

Exercises
1. Consider the discrete random variables X ∈ {1, 2, 3} and Y ∈ {1, 2, 3, 4} whose joint
probabilities are given in Table 4.1

pX,Y (x, y) y = 1 y = 2 y = 3 y = 4 pX (x)


x=1 1/40 3/40 4/40 2/40
x=2 5/40 7/40 6/40 5/40
x=3 1/40 2/40 2/40 2/40
pY (y)

Table 4.1: Joint probability table

• Find the marginal probabilities pX (x) and pY (y) for all x = 1, 2, 3, y = 1, 2, 3, 4


and fill in the rest of Table 4.1.
• Find the conditional probabilities pX|Y (x|y) and pY |X (y|x) for all x = 1, 2, 3,
y = 1, 2, 3, 4 and fill in the relevant empty tables.

pX|Y (x|y) y = 1 y = 2 y = 3 y = 4
x=1
x=2
x=3

pY |X (y|x) y = 1 y = 2 y = 3 y = 4
x=1
x=2
x=3

2. Show that the gamma distribution is the conjugate prior of the exponential distri-
bution for, i.e. if X ∼ Γ(α, β) and Y |X = x ∼ Exp(x), then X|Y = y ∼ Γ(αx|y , βx|y )
for some αx|y and βx|y . Find αx|y and βx|y in terms of α, β, and y.

3. Prove Theorem 4.1 [Hint: write the estimator as X̂(Y ) = E(X|Y )+(X̂(Y )−E(X|Y ))
and consider conditional expectation of the MSE given Y = y first. You should
conclude that for any y, X̂(y) − E(X|Y = y) should be zero.]

4. Suppose we observe a noisy sinusoid with period T and unknown amplitude X for n
steps: Y |X = x ∼ N (yt ; f (x, t), σy2 ), for t = 1, . . . , n where f (t; x) = x sin(2πt/T ) is
the sinusoid. The prior for the amplitude is Gaussian: X ∼ N (0, σx2 ).

• Find p(x|y1:n ) and p(y1:n ).


• What is distribution of f (n + 1, X) given Y1:n = y1:n ?
CHAPTER 4. BAYESIAN INFERENCE 50

• Find p(yn+1 ) and p(yn+1 |y1:n ). Compare their variances. What can you comment
on the difference between the variances?
• Generate your own samples Y1:n up to time n = 100, with period T = 40,
σx2 = 100, σy2 = 10. Calculate p(x|y1:n ); plot p(yn+1 ) and p(yn+1 |y1:n ) on the
same axis.
Chapter 5

Markov Chain Monte Carlo


Summary: In this chapter, we will see an essential and vast family of methods in Monte
Carlo, namely Markov chain Monte Carlo methods, for approximately sampling from com-
plex distributions. We will start the chapter with a review of discrete time Markov chains,
which is required for understanding the working principles of Markov chain Monte Carlo
methods. Then, we will see two most commonly used Markov chain Monte Carlo methods
in the literature: Metropolis-Hastings and Gibbs sampling methods.

5.1 Introduction
Remark 5.1 (Change of notation). So far we have used P and p to denote the distri-
bution and its pdf/pmf we are ultimately interested in. We will make a change of notation
here, and denote the distribution as well as its pdf/pmf as π. This change of notation is
necessary since p will be used generically to denote the pdf/pmf of various distributions.

We have already discussed the difficulties of generating a large number of i.i.d. samples
from π. One alternative was importance sampling which involved weighting every gener-
ated sample in order not to waste it, but it has its own drawbacks mostly due to issues
related to controlling variance. Another alternative is to use Markov chain Monte Carlo
(MCMC) methods (Metropolis et al., 1953; Hastings, 1970; Gilks et al., 1996; Robert and
Casella, 2004). These methods are based on the design of a suitable ergodic Markov chain
whose stationary distribution is π. The idea is that if one simulates such a Markov chain,
after a long enough time the samples of the Markov chain will approximately distributed
according to π. Although the samples generated from the Markov chain are not i.i.d.,
their use is justified by convergence results for dependent random variables in the litera-
ture. First examples of MCMC can be found in Metropolis et al. (1953); Hastings (1970),
and book length reviews are available in Gilks et al. (1996); Robert and Casella (2004).

5.2 Discrete time Markov chains


In order to adequately summarise the MCMC methodology, we first need reference to
the theory of discrete time Markov chains defined on general state spaces. Discrete time
Markov chains also constitute an important part of the rest of this course, especially when
we discuss sequential Monte Carlo methods. The review made here is very brief and limited

51
CHAPTER 5. MARKOV CHAIN MONTE CARLO 52

by the relation of Markov chains to the topics covered in the course. For more details one
can see Meyn and Tweedie (2009) or Shiryaev (1995); a more related introduction to Monte
Carlo methods is present in Robert and Casella (2004, Chapter 6) and Cappé et al. (2005,
Chapter 14), Tierney (1994) and Gilks et al. (1996, Chapter 4).

Definition 5.1 (Markov chain). A stochastic process {Xn }n≥1 on X is called a Markov
chain if its probability law is defined from the initial distribution η(x) and a sequence
of Markov transition (or transition, state transition) kernels (or probabilities, densities)
{Mn (x0 |x)}n≥2 by finite dimensional joint distributions as

p(x1 , . . . , xn ) = η(x1 )M2 (x2 |x1 ) . . . Mn (xn |xn−1 )

for all n ≥ 1.

The random variable Xt is called the state of the chain at time t and X is called
the state-space of the chain. For uncountable X , we have a discrete-time continuous-
state Markov chain, and η(·) and Mn (·|xn−1 ) are pdf’s1 . Similarly, X is countable (finite
or infinite), then the chain is a discrete-time discrete-state Markov chain and η(·) and
Mn (·|xn−1 ) are pmf’s. Moreover, when X = {x1 , . . . , xm } is finite with m states, the
transition kernel can sometimes be expressed in terms of an m × m transition matrix
Mn (i, j) = P(Xn = j|Xn−1 = i).
The definition of the Markov chain leads to the characteristic property of a Markov
chain, which is also referred to as the weak Markov property: The current state of the
chain at time n depends only on the previous state at time n − 1.

p(xn |x1:n−1 ) = p(xn |xn−1 ) = Mn (xn−1 , xn )

From now on, we will consider time-homogenous Markov chains where Mn = M for all
n ≥ 2, and we will denote them as Markov(η, M ).

Example 5.1. The simplest examples of a Markov chain are those with a finite state-
space, say of size m. Then, the transition rule can be expressed by an m × m transition
probability matrix M , which in this example is the following
 
1/2 0 1/2
M = 1/4 1/2 1/4
0 1 0

Also, the state-transition diagram of such a Markov chain with m = 3 states is given in
Figure 5.1, where the state-space is simply {1, 2, 3}.
1
In fact, there are exceptions where the transition kernels do not have a probability density; and this is
indeed the case for the transition kernel of the Markov chain of the Metropolis-Hastings algorithm which
we will see in Section 5.3. However, for the sake of brevity we ignore this technical issue and with abuse
of notation pretend as if we always have a density for Mn (·|xn−1 ) for continuous states
CHAPTER 5. MARKOV CHAIN MONTE CARLO 53

1
2 1
2
1 3
1
4

1
4 1

1
2

Figure 5.1: State transition diagram of a Markov chain with 3 states, 1, 2, 3.


p p
... 0 1 2 ...
q q

Figure 5.2: State transition diagram of the symmetric random walk on Z.

Example 5.2. Let X = Z be the set of integers, X1 = 0, and for n > 1 define Xn as

Xn = Xn−1 + Vn ,

where Vn ∈ {−1, 1} with p = P(Vn = 1) = 1 − P(Vn = −1) = 1 − q. This is a random


walk (of step-size 1) on Z and it is a time homogenous discrete-time discrete state Markov
chain with η(x1 ) = δ0 (x1 ) and
(
p, x0 = x + 1
M (x0 |x) =
q, x0 = x − 1

When p = q, the process is called a symmetric random walk.


Example 5.3. Let X = R, X1 = 0, and for n > 1 define Xn as

Xn = Xn−1 + Vn ,

but this time Vn ∈ R with Vn ∼ N (0, σ 2 ). This is a Gaussian random walk process on R
with normally distributed step sizes, and it is a time homogenous discrete-time continuous
state Markov chain with η(x1 ) = δ0 (x1 ) and

M (x0 |x) = φ(x0 ; x, σ 2 ).


CHAPTER 5. MARKOV CHAIN MONTE CARLO 54

Example 5.4. A generalisation of the Gaussian random walk is the first order autoregres-
sive process, or shortly AR(1). Let X = R the set of integers, X1 = 0, and for n > 1
define Xn as
Xn = aXn−1 + Vn ,
for some a ∈ R, and Vn ∈ R with Vn ∼ N (0, σ 2 ). AR(1) is a time homogenous discrete-
time continuous state Markov chain with η(x1 ) = δ0 (x1 ) and
M (x0 |x) = φ(x0 ; ax, σ 2 ).
σ2
When |a| < 1, another choice for the initial distribution is X1 ∼ N (0, 1−a 2 ), which is the

stationary distribution of {Xt }t≥1 . We will see more on the stationary distributions below.

5.2.1 Properties of Markov(η, M )


For MCMC, we require the Markov chain to have a unique invariant distribution π and to
converge to π. Before discussing that, we need to review some fundamental properties of a
discrete time Markov chain to understand when the existence of an invariant distribution
and convergence to it are ensured. Those properties will be discussed in specific to discrete-
state Markov chains only, for sake of simplicity and delivering the intuition behind the
concepts. Although for general state-space Markov chains similar concepts also exist, they
are more complicated and with less intuition, due to which we mostly omit them from our
review.

[Link] Irreducibility
In a discrete state Markov chain, for two states x, x0 ∈ X , we say x leads to x0 and show
it by x → x0 if the chain can travel from x to x0 with a positive probability, i.e.
∃n > 1 s.t. P(Xn = x0 |X1 = x) > 0
If both x → x0 and x0 → x, we say x and x0 communicate and we show it by x ↔ x0 .
A subset of states C ⊆ X is called a communicating class, or simply class, if (i) all
x, x0 ∈ C communicate, and (ii) x ∈ C, x ↔ y together imply y ∈ C, too (that is, there is
no such y ∈/ C such that x ↔ y for some x ∈ C).
A communicating class is closed if x ∈ C and x → y imply y ∈ C, that is there is no
path with positive probability from outside the class to any of the states of the class.
Definition 5.2 (Irreducibiliy). A discrete state Markov chain is called irreducible if the
whole X is a communication class, i.e. all its states communicate.
For general state-spaces, we need to generalise the concept of irreducibility to φ-
irreducibility.
Example 5.5. Figure 5.3 shows two chains that are not irreducible. In the first chain, the
communication classes are {1, 2, 3} and {4, 5}; both are closed. In the second chain, the
communication classes are {1, 2} and {3, 4}; the first one is closed and the second one is
not.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 55

1/2

1/2
1/2
1 3 4

1/4
1/4 1/2
1/4
1

2 5

1/2 3/4
1/2

1 3

1/2 1 1/4 1/2

1/2
2 4

1/2 1/4

Figure 5.3: State transition diagrams of two Markov chains that are not irreducible.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 56

[Link] Recurrence and Transience


In the discrete state-space, we say that a Markov chain is recurrent if every of its states is
expected to be visited by the chain infinitely often, otherwise it is transient. More precisely,
define the return time
τx = min{n ≥ 1 : Xn+1 = x}
Definition 5.3 (Recurrence). We say the state x ∈ X is recurrent if

P(τx < ∞|X1 = x) = 1 (5.1)


P∞
or equivalently n=1 P(Xn = x|X1 = x) = ∞. If a state is not recurrent, it is called
transient.
If M is irreducible, then either every state is recurrent (and M is said to be recurrent)
or every state is transient (and M is said to be transient).
Example 5.6. The random walk on integers in Example 5.2 is an irreducible chain. It
can be shown that, in the symmetric case when p = q = 1/2, the chain is recurrent; if
p 6= q, the chain is transient.
Definition 5.4 (Positive recurrence and null recurrence). We say a state x ∈ X is
positive recurrent if
E(τx |X1 = x) < ∞ (5.2)
(Note that (5.2) is a stronger condition than (5.1).) If a recurrent state is not positive
recurrent, it is called null recurrent.
If M is irreducible and recurrent, then either every state is positive recurrent (and M
is said to be positive recurrent) or every state is null recurrent (and M is said to be null
recurrent).
To talk about recurrence in general state-space chains, instead of states we consider
accessible sets in relation to φ-irreducibility.
Example 5.7. It can be shown that the random walk on integers in Example 5.2 is a null
recurrent chain for p = q = 1/2.

[Link] Invariant distribution


A probability distribution π is called M -invariant if
Z
π(x) = π(x0 )M (x|x0 )dx0

where we have assumed that {Xt }t≥1 is continuous (hence π is a pdf). When {Xt }t≥1 is
discrete (hence π is a pmf), this relation is written as
X
π(x) = π(x0 )M (x|x0 )
x0
CHAPTER 5. MARKOV CHAIN MONTE CARLO 57

The expressions on the RHS of the two equations above are short-handedly written as πM ,
so that for invariant π we have π = πM. In fact, when X = {x1 , . . . , xm } is finite with
M (i, j) = P(Xn = j|Xn−1 = i) and π = π(1) . . . π(m) , we can indeed write π = πM
using notation for vector matrix multiplication.

Theorem 5.1 (Existence and uniqueness of invariant distribution). Suppose M is


irreducible. M has a unique invariant distribution if and only if it is positive recurrent.
 
Example 5.8. The chain in Example 5.1 has the invariant distribution π = 1/4 1/2 1/4 .
By solving µ = µM , it can be shown that π is the only invariant distribution, so the chain
is positive recurrent.

Example 5.9. The random walk on integers in Example 5.2 is irreducible. Therefore, it
does not have an invariant distribution since it is not positive recurrent for any choice of
p = 1 − q.

Example
 5.10. The Markov
 chain on  top of Figure 5.3 has two invariant distributions
π = 1/4 1/2 1/4 0 0 and π = 0 0 0 1/3 2/3 although every state is positive
recurrent. Note that the chain is not irreducible with two isolated communication classes,
that is why Theorem 5.1 is not applicable and uniqueness may not follow.

Example 5.11. The Markov chain at the bottom of Figure 5.3 is neither irreducible nor all
of its states are positive recurrent (the states of the second
 class are transient). However,
it has a unique invariant distribution, namely π = 1/3 2/3 0 0 . Note that for this
chain Theorem 5.1 is not applicable since the chain is not irreducible.

[Link] Reversibility and detailed balance


One useful way for spotting the existence of an invariant probability measure for a Markov
chain is to check for its reversibility, which is a sufficient (but not necessary) condition for
existence of a stationary distribution.

Definition 5.5 (reversibility). Let M be a transitional kernel having an invariant dis-


tribution and assume the associated Markov chain is started from π. We say that M is
reversible if the reversed process {Xn−m }0≤m<n is also M arkov(π, M ) for all n ≥ 1.

According to the definition above, M is reversible with respect to π if the backward


transition density of the process {Xn }n≥1 with X1 ∼ π is the same as its forward transition
density, i.e.

p(xn−1 )p(xn |xn−1 ) p(xn−1 )M (xn |xn−1 )


p(xn−1 |xn ) = =R = M (xn−1 |xn ).
p(xn ) p(xn−1 )M (xn |xn−1 )dxn−1

This immediately leads to the necessary and sufficient condition for reversibility of M is
the detailed balance condition.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 58

Proposition 5.1 (detailed balance). We say a Markov kernel M is reversible with


respect to a probability distribution π if and only if the following condition, known as the
detailed balance condition, holds:

π(x)M (y|x) = π(y)M (x|y), x, y ∈ X .

Also, then π is an invariant distribution for M .

Being a sufficient condition for stationarity, the detailed balance condition is quite
useful for designing transition kernels for MCMC algorithms.

[Link] Ergodicity
Let πn be the distribution of Xn of a Markov chain {Xn }n≥1 with initial distribution η and
transition kernels M . We have π1 (x1 ) = η(x1 ) and the rest can be written recursively as
πn = πn−1 M , or explicitly
Z
πn (xn ) = πn−1 (xn−1 )M (xn |xn−1 )dxn−1

for continuous state chains, or


X
πn (xn ) = πn−1 (xn−1 )M (xn |xn−1 ),
xn−1 ∈X

for discrete state chains, which reduces to

πn = πn−1 M

when the state space is finite and π and M are considered as a vector and a matrix,
respectively.
In MCMC methods that aim to approximately sample from π, we generate a Markov
chain {Xn }n≥1 with invariant distribution π and hope that for n large enough Xn is ap-
proximately distributed from π. This relies on the hope that πn converges to π.
We have shown the conditions for a unique stationary distribution of a Markov chain.
Note that having a unique invariant distribution does not mean that the chain will converge
to its stationary distribution. For that to happen the Markov chain is required to have
aperiodicity, a property which restricts the chain from getting trapped in cycles.

Definition 5.6 (aperiodicity). In a discrete state Markov chain, a state x ∈ X is called


aperiodic if the set
{n > 0 : P(Xn+1 = x|X1 = x)}
has no common divisor other than 1. Otherwise, the state is periodic and its period is the
greatest common divisor of state x. The Markov chain is said to be aperiodic if all of its
states are aperiodic.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 59

If the Markov chain is irreducible, then aperiodicity of one state implies the aperiodicity
of all the states.

Definition 5.7 (ergodic state). A state is called ergodic if it is positive recurrent and
aperiodic.

Finally, the definition of ergodicity for a Markov chain follows.

Definition 5.8 (ergodic Markov chain). An irreducible Markov chain is called ergodic
if it is positive recurrent and aperiodic.

Ergodic chains ensure that the sequence of distributions {πn }n≥1 for {Xn }n≥1 converge
to the invariant distribution π.

Theorem 5.2. Suppose {Xn }n≥1 is a discrete-state ergodic Markov chain with any initial
distribution η and Markov transition kernel M with invariant distribution π. Then,

lim πn (x) = π(x) (5.3)


n→∞

In particular, for all x, x0 ∈ X ,

lim P (Xn = x|X1 = x0 ) = π(x)


n→∞

Example 5.12. The Markov chain illustrated in Figure 5.4 is irreducible andpositive re-
current; so it has a unique invariant distribution, which is π = 1/3 1/3 1/3 . However,
it is periodic with period 3, and as a result πn does not converge to π unless η = π. Indeed,
one can show that for η = η(1) η(2) η(3) , we have

πn = ηM n−1 = η(mod(n − 1, 3) + 1) η(mod(n − 1, 3) + 2) η(mod(n − 1, 3) + 3) .


 

1
1 3

1 1

Figure 5.4: An irreducible, positive recurrent, and periodic Markov chain.


CHAPTER 5. MARKOV CHAIN MONTE CARLO 60

5.3 Metropolis-Hastings
As previously stated, an MCMC method is based on a discrete-time ergodic Markov chain
which has its stationary distribution as π. The most widely used MCMC algorithm up to
date is the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970).
The Metropolis-Hastings algorithm requires a Markov transition kernel Q on X for
proposing new values from the old ones. Assume that the pdf/pmf of Q(·|x) is q(·|x) for
any x. Given the previous sample Xn−1 a new value for Xn is proposed as X 0 ∼ Q(·|Xn−1 ).
The proposed sample X 0 is accepted with the acceptance probability α(Xn−1 , X 0 ), where
the function α : X × X → [0, 1] is defined as

π(x0 )q(x|x0 )
 
0
α(x, x ) = min 1, , x, x0 ∈ X .
π(x)q(x0 |x)
If the proposal is accepted, Xn = X 0 is taken. Otherwise, the proposal is rejected and
Xn = Xn−1 is taken.

Algorithm 5.1: Metropolis-Hastings


1 Begin with some X1 ∈ X .
2 for n = 2, 3, . . . do
3 Sample X 0 ∼ Q(·|Xn−1 ).
4 Set Xn = X 0 with probability

π(X 0 )q(Xn−1 |X 0 )
 
0
α(Xn−1 , X ) = min 1, ,
π(Xn−1 )q(X 0 |Xn−1 )

else set Xn = Xn−1 .

The ratio in the acceptance probability


π(x0 )q(x|x0 )
r(x, x0 ) =
π(x)q(x0 |x)
is called the acceptance ratio, or the acceptance rate.
The invariant distribution of the Metropolis-Hastings algorithm described exists and it
is π. In order to show this, we can check for the detailed balance condition. According to
Algorithm 5.1, the transition kernel M of the Markov chain from which the samples are
obtained is
M (y|x) = q(y|x)α(x, y) + pr (x)δx (y),
where pr (x) is the rejection probability at x and
 Z  " #
X
pr (x) = 1 − q(x0 |x)α(x, x0 )dx0 , or pr (x) = 1 − q(x0 |x)α(x, x0 )
x0
CHAPTER 5. MARKOV CHAIN MONTE CARLO 61

depending on the nature of the state-space. For all x, y ∈ X , we have

π(x)M (y|x) = π(x)q(y|x)α(x, y) + π(x)pr (x)δx (y)


 
π(y)q(x|y)
= π(x)q(y|x) min 1, + π(x)pr (x)δx (y)
π(x)q(y|x)
= min {π(x)q(y|x), π(y)q(x|y)} + π(x)pr (x)δx (y)
= min {π(y)q(x|y), π(x)q(y|x)} + π(y)pr (y)δy (x)

which is symmetric with respect to x, y, so π(x)M (y|x) = π(y)M (x|y) and the detailed
balance condition holds for π which implies that M is reversible with respect to π and π
is invariant for M .
Note that, as long as discrete-state chains are considered, existence of the invariant
distribution π for M ensures the positive recurrence of M . There are also various sufficient
conditions for the M of the Metropolis-Hastings algorithm to be irreducible and aperiodic.
For example, if Q is irreducible and α(x, y) > 0 for all x, y ∈ X , then M is irreducible.
If pr (x) > 0 for all x or Q is aperiodic then M is aperiodic (Roberts and Smith, 1994).
More detailed results on the convergence of Metropolis-Hastings are also available, see e.g.
Tierney (1994); Roberts and Tweedie (1996) and Mengersen and Tweedie (1996).
Historically, the original MCMC algorithm was introduced by Metropolis et al. (1953)
for the purpose of optimisation on a discrete state-space. This algorithm, called the
Metropolis algorithm, used symmetrical proposal kernels Q, that is q(x0 |x) = q(x|x0 ).
When a symmetric proposal is used, the acceptance probability involves only the ratio
of the target distribution evaluated at x and x0 ,
π(x0 )
 
0
α(x, x ) = min 1, , if q(x0 |x) = q(x|x0 ).
π(x)
The Metropolis algorithm was later generalised by Hastings (1970) such that it permitted
continuous state-spaces and asymmetrical proposal kernels, preserving the Metropolis al-
gorithm as a special case. A more historical survey on Metropolis-Hastings algorithms is
provided by Hitchcock (2003).
Another version is the independence Metropolis-Hastings algorithm, where, as the name
suggests, the proposal kernel Q is chosen to be independent from the current value, i.e.
q(x0 |x) = q(x0 ), in which case the acceptance probability is
π(x0 )q(x)
 
0
α(x, x ) = min 1, .
π(x)q(x0 )

5.3.1 Toy example: MH for the normal distribution


This is a toy example where π(x) = φ(x; µ, σ 2 ) for which we do not need to use MH since
we can obviously sample from N (µ, σ 2 ) easily. But for the sake of example assume that
we have decided to use MH to generate approximate samples from π.
For the proposal kernel, we have several options:
CHAPTER 5. MARKOV CHAIN MONTE CARLO 62

• Symmetric random walk: We can take q(x0 |x) = φ(x0 ; x, σq2 ), that is x0 is proposed
from the current value x by adding a normal random variable with zero mean and
variance σq2 , or Q(·|x) ∼ N (x, σq2 ). Since

q(x0 |x) = φ(x0 ; x, σq2 ) = φ(x; x0 , σq2 ) = q(x|x0 ),

this results in the acceptance ratio

π(x0 )q(x|x0 )
r(x, x0 ) =
π(x)q(x0 |x)
φ(x0 ; µ, σ 2 )
=
φ(x; µ, σ 2 )
1 0 2
√ 1 e− 2σ2 (x −µ)
2πσ 2
= 1 2
√ 1 e− 2σ2 (x−µ)
2πσ 2
1
= e− 2σ2 [(x −µ)
0 2 −(x−µ)2
]

The choice of σq2 is important for good performance of MH. We want the Markov
chain generated by the algorithm to mix well, that is we want the samples to forget
the previous values fast. Consider the acceptance ratio above:

– A too small value for σq2 will result in the acceptance ratio r(x, x0 ) being very
close to 1, and hence the proposed values will be accepted with high probability.
However, the chain will be very slowly mixing, that is the samples will be highly
correlated; because any accepted sample x0 will most likely be only slightly
different than the current x due to a small step-size of the random walk.
– A too large value for σq2 will likely result in the proposed value x0 to be far
from the region where π has most of its mass, hence π(x0 ) will be very small
compared to π(x) and the chain will likely reject the proposed value and stick
to the old value x. This will create a sticky chain.

Therefore, the optimum value for σq2 should be neither too small or too large. See
Figure 5.5 for the both bad choices and one in between those. This phenomenon of
having to choose the variance of the random walk proposals neither too small nor
too big is also valid for most distributions than the normal distribution.

• Another option for the proposal is to sample x0 independently from x, i.e. q(x0 |x) =
q(x0 ). For example, suppose we chose q(x) = φ(x; µq , σq2 ). Then the acceptance ratio
CHAPTER 5. MARKOV CHAIN MONTE CARLO 63

RWMH - first 1000 samples: σ 2q = 0.01 RWMH - first 1000 samples: σ 2q = 400 RWMH - first 1000 samples: σ 2q = 2
12 10 10

10 8 8

8 6 6

6 4 4

4 2 2

2 0 0

0 -2 -2
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
iterations iterations iterations

histogram of last 10000 samples: σ 2q = 0.01 histogram of last 10000 samples: σ 2q = 400 histogram of last 10000 samples: σ 2q = 2
2500 3000 3000

2000 2500 2500

2000 2000
1500
1500 1500
1000
1000 1000
500 500 500

0 0 0
-2 0 2 4 6 -2 0 2 4 6 -2 0 2 4 6 8
x x x

Figure 5.5: Random walk MH for π(x) = φ(x; 2, 1). The left and middle plots correspond
to a too small and a too large value for σq2 , respectively. All algorithms are run for 50000
iterations. Both the trace plots and the histograms show that the last choice works the
best.

is
π(x0 )q(x)
r(x, x0 ) =
π(x)q(x0 )
φ(x0 ; µ, σ 2 )φ(x; µq , σq2 )
=
φ(x; µ, σ 2 )φ(x0 ; µq , σq2 )
1 2
1 0 2 − 2 (x−µq )
√ 1 e− 2σ2 (x −µ) √ 1 e 2σq
2πσ 2 2
2πσq
= 1 − 12 (x0 −µq )2
2
√ 1 e− 2σ2 (x−µ) √ 1 e 2σq
2πσ 2 2πσq2
1

2σ 2
[(x0 −µ)2 −(x−µ)2 ]+ 2σ1q2 [(x0 −µq )2 −(x−µq )2 ]
=e

See Figure 5.6 for examples of MH with this choice.

• Another alternative is to use a gradient-guided proposal. We may want to ‘guide’ the


chain towards the high-probability region of π(x); one proposal that can be chosen
for that purpose is
q(x0 |x) = φ(x0 ; g(x), σq2 )
where the mean for the proposal g(x) is constructed by using the gradient of the
CHAPTER 5. MARKOV CHAIN MONTE CARLO 64

IMH - first 1000 samples: σ 2q = 0.8 IMH - first 1000 samples: σ 2q = 100
10 6

4
5

0
0

-5 -2
0 200 400 600 800 1000 0 200 400 600 800 1000
iterations iterations
histogram of last 10000 samples: σ 2q = 0.8 histogram of last 10000 samples: σ 2q = 100
3000 3000

2000 2000

1000 1000

0 0
-4 -2 0 2 4 6 -2 0 2 4 6 8
x x

Figure 5.6: Independence MH for π(x) = φ(x; 2, 1).

logarithm of the target density,


∂ log π(x)
g(x) = x + γ .
∂x
Here, γ is a step-size parameter that needs to be adjusted. For π(x) = φ(x; µ, σ 2 ),
g(x) = x − σγ2 (x − µ). The acceptance ratio for this choice of proposal becomes
2 2
h i
0 − 1
2σ 2
[(x0 −µ)2 −(x−µ)2 ]+ 2σ1q2 (x0 −x+ σγ2 (x−µ)) −(x−x0 + σγ2 (x0 −µ))
r(x, x ) = e

See Figure 5.7 for examples of MH with this choice.


Example 5.13 (Normal distribution with unknown mean and variance). We have
observations Y1 , . . . , Yn ∼ N (z, s) and z and s are unknown. The parameters x = (z, s)
are a priori independent with z ∼ N (m, κ2 ) and s ∼ IG(α, β), so that the prior density is
1 1 2 β α −α−1 − β
p(x) = p(z)p(s) = √ e− 2κ2 (z−m) s e s
2πκ2 Γ(α)
Given the data Y1:n = y1:n , we want to run the MH algorithm to sample from the posterior
distribution π(x) = p(x|y1:n ), which is given by
n
Y
π(x) = p(x|y1:n ) ∝ p(x)p(y1:n |x) = p(z)p(s) φ(yi ; z, s)
i=1

For this problem, π(x) indeed lacks a well-known form, so it is justified to use a Monte
Carlo method for it.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 65

grad-MH - first 1000 samples: γ = 0.01 grad-MH - first 1000 samples: γ = 1


10 6

4
5

0
0

-5 -2
0 200 400 600 800 1000 0 200 400 600 800 1000
iterations iterations
histogram of last 10000 samples: γ = 0.01 histogram of last 10000 samples: γ = 1
3000 3000

2000 2000

1000 1000

0 0
-4 -2 0 2 4 6 -4 -2 0 2 4 6
x x

Figure 5.7: Gradient-guided MH for π(x) = φ(x; 2, 1).

To run the MH algorithm, we need a proposal distribution for proposing x0 = (z 0 , s0 ). In


this example, given x = (z, s), we decide to propose z 0 ∼ N (z, σq2 ) and s0 ∼ IG(α, β), i.e.
we use a random walk for the mean component and the prior distribution for the variance
parameter. With this choice, the proposal density becomes

q(x0 |x) = φ(z 0 ; z, σq2 )p(s0 )


β α 0 −α−1 − β0
= φ(z 0 ; z, σq2 ) (s ) e s
Γ(α)

The acceptance ratio in this case is

π(x0 )q(x|x0 )
r(x, x0 ) =
π(x)q(x0 |x)
p(z 0 )p(s0 ) [ ni=1 p(yi |z 0 , s0 )] φ(z; z 0 , σq2 )p(s)
Q
=
p(z)p(s) [ ni=1 p(yi |z, s)] φ(z 0 ; z, σq2 )p(s0 )
Q

φ(z 0 ; m, κ2 ) ni=1 φ(yi ; z 0 , s0 )


Q
=
φ(z; m, κ2 ) ni=1 φ(yi ; z, s)
Q

See Figure 5.8 for results obtained from this MH algorithm.

Example 5.14 (A changepoint model). In this example, we consider a changepoint


model. In this model, at each time t we observe the count of an event Yt . All the counts
up to an unknown time τ come from the same distribution after which the distribution
changes. We assume that the changepoint τ is uniformly distributed over {1, . . . , n} where
n is the number of time steps. The two different distributional regimes up to τ and after τ
CHAPTER 5. MARKOV CHAIN MONTE CARLO 66

Figure 5.8: MH for parameters of N (z, s). σq2 = 1, α = 5, β = 10, m = 0, κ2 = 100.

are indicated by the random variables λi , i = 1, 2, which are a priori assumed to follow a
Gamma distribution
λi ∼ Γ(α, β), i = 1, 2.
Under regime i, the counts are assumed to be identically Poisson distributed
(
PO(λ1 ), 1 ≤ t ≤ τ
Yt ∼
PO(λ2 ), τ < t ≤ n.

A typical draw from this model is shown in Figure 5.9. The inferential goal is, given
Y1:n = y1:n , to sample from the posterior distribution of the changepoint location τ and the
intensities λ1 , λ2 given the count data, i.e., letting x = (τ, λ1 , λ2 ), the target distribution is
π(x) = p(τ, λ1 , λ2 |y1:n ) which is given by

p(τ, λ1 , λ2 |y1:n ) ∝ p(τ, λ1 , λ2 , y1:n )


= p(τ, λ1 , λ2 )p(y1:n |τ, λ1 , λ2 )
= p(τ )p(λ1 )p(λ2 )p(y1:n |τ, λ1 , λ2 )
τ n
1 β α λα−1
1 e−βλ1 β α λα−1
2 e−βλ2 Y e−λ1 λy1t Y e−λ2 λy2t
= (5.4)
n Γ(α) Γ(α) t=1
yt ! t=τ +1 yt !

Two choices for the proposal will be considered. Let x0 = (τ 0 , λ01 , λ02 ).

• The first one is to use an independent proposal distribution, which is the prior dis-
tribution for x
q(x0 |x) = q(x0 ) = p(x0 ) = p(τ 0 , λ01 , λ02 ).
This leads to the acceptance ratio being the ratio of the likelihoods

0 p(y1:n |τ 0 , λ01 , λ02 )


r(x, x ) =
p(y1:n |τ, λ1 , λ2 )
CHAPTER 5. MARKOV CHAIN MONTE CARLO 67

example of data generated by a changepoint model


20

15

y(t)
10

0
0 10 20 30 40 50 60 70 80 90 100
time

Figure 5.9: An example data sequence of length n = 100 generated from the Poisson
changepoint model with parameters τ = 30, λ1 = 10 and λ2 = 5.

• The second choice is a symmetric proposal,


 
1 1
q(x |x) = Iτ +1 (τ ) + Iτ −1 (τ ) φ(λ01 ; λ1 , σλ2 )φ(λ02 ; λ2 , σλ2 ).
0 0 0
2 2

The first factor involving τ indicates that we propose either τ 0 = τ + 1 or τ 0 = τ − 1


both with probability a half. Since q(x0 |x) = q(x|x0 ), the acceptance ratio reduces to
the ratio of the posteriors

p(τ 0 , λ01 , λ02 |y1:n )


r(x, x0 ) =
p(τ, λ1 , λ2 |y1:n )
α−1+Pτt=1 yt  α−1+Pnt=τ +1 yt
λ01 λ02

−(τ +β)(λ01 −λ1 ) −(n−τ +β)(λ02 −λ2 )
=e e
λ1 λ2
  0 yτ +1
e−λ01 +λ02 λ1
0 , τ 0 = τ + 1,
×  λ02 yτ
e−λ02 +λ01 λ2
λ01
, τ 0 = τ − 1.

Figure 5.10 illustrates the results obtained from the two algorithms. The initial value
for τ is taken bn/2c and for λ1 and λ2 we start from the mean of y1:n . As we can see,
the symmetric proposal algorithm is able to explore the posterior distribution much more
efficiently. This is because the proposal distribution in independence MH, which is chosen
as the prior distribution, does not take neither the posterior distribution (hence the data)
nor the previous sample into account, and as a result it has a large rejection rate. The
independence sampler would become even poorer if n were larger so that the posterior would
be more concentrated in contrast to the ignorance of the prior distribution.

Example 5.15 (MCMC for source localisation). Consider the source localisation
scenario in Question 3 of Exercises in Chapter 3. From the likelihood and the prior in
(3.8) and (3.9), the posterior distribution of the unknown position is
3
Y
p(x|y) ∝ φ(x(1); 0, σx2 )φ(x(2); 0, σx2 ) φ(yi ; ri , σy2 ) (5.5)
i=1
CHAPTER 5. MARKOV CHAIN MONTE CARLO 68

Symmetric proposal - τ Symmetric proposal - λ 1 Symmetric proposal - λ 2


60 12 7

50 6
10
40 5
8
30 4

20 6 3
0 5000 10000 0 5000 10000 0 5000 10000
iterations iterations iterations
independent proposal - τ independent proposal - λ 1 independent proposal - λ 2
50 12 6.5

6
40 10
5.5
30 8
5

20 6 4.5
0 5000 10000 0 5000 10000 0 5000 10000
iterations iterations iterations

Figure 5.10: MH for parameters of the Poisson changepoint model

Due to the non-linearity in the ri = ||x − si || = [(x(1) − si (1))2 + (x(2) − si (2))2 ]1/2 ,
i = 1, 2, 3, p(x|y) does not admit a known distribution. We use the MH algorithm to
generate approximate samples from p(x|y). We use a symmetric random walk proposal
distribution with q(x0 |x) = φ(x0 ; x, σq2 I2 ), so that q(x0 |x) = q(x|x0 ). The resulting acceptance
rate
p(x0 |y)q(x|x0 )
0
r(x, x ) =
p(x|y)q(x0 |x)
p(x0 |y)
=
p(x|y)
φ(x0 (1); 0, σx2 )φ(x0 (2); 0, σx2 ) 3i=1 φ(yi ; ri0 , σy2 )
Q
=
φ(x(1); 0, σx2 )φ(x(2); 0, σx2 ) 3i=1 φ(yi ; ri , σy2 )
Q

where ri0 = ||x0 − si ||, i = 1, 2, 3 is the distance between the proposed value x0 and the
location i’th source si . Figure 5.11 shows the samples and their histograms obtained from
10000 iterations of the MH algorithm. The chain was started from X1 = (5, 5) and its
convergence to the posterior distribution is illustrated in the right pane of the figure where
we see the first a few samples of the chain traveling to the high probability region of the
posterior distribution.

5.4 Gibbs sampling


The Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) is one of the most
popular MCMC methods, which can be used when X has more than one dimension. If X
has d > 1 components (of possibly different dimensions) such that X = (X1 , . . . , Xd ), and
CHAPTER 5. MARKOV CHAIN MONTE CARLO 69

Samples for X(1) Samples for X(2) first 200 samples


6 6 -6

4 4

2 2 -4
Xt (1)

Xt (2)
0 0

-2 -2 -2

-4 -4
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

x(1)
iterations iterations 0

Histogram of samples for X(1) Histogram of samples for X(2)


1000 1000

800 800 2

600 600

400 400 4

200 200

0 0 6
-4 -2 0 2 4 6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
x x x(2)

Figure 5.11: MH for the source localisation problem.

one can sample from each of the full conditional distributions πk (·|X1:k−1 , Xk+1:d ), then the
Gibbs sampler produces a Markov chain by updating one component at a time using πk ’s.
One cycle of the Gibbs sampler successively samples from the conditional distributions
π1 , . . . , πd by conditioning on the most recent samples.

Algorithm 5.2: The Gibbs sampler:


1 Begin with some X1 ∈ X .
2 for n = 2, 3, . . . do
3 for k = 1, . . . , d do
4
Xn,k ∼ πk (·|Xn,1:k−1 , Xn−1,k+1:d ).

For an x ∈ X , let x−k = (x1:k−1 , xk+1:d ) for k = 1, . . . , d denotes the components of


x excluding xk , and let us permit ourselves to write x = (xk , x−k ). The corresponding
MCMC kernel of the Gibbs sampler can be written as M = M1 M2 . . . Md , where each
transition kernel Mk for k = 1, . . . , d can be written as

Mk (y|x) = πk (yk |x−k )δx−k (y−k )

where y = (y1 , . . . , yd ). The justification of the transitional kernel comes from the re-
versibility of each Mk with respect to π, which can be verified from the detailed balance
CHAPTER 5. MARKOV CHAIN MONTE CARLO 70

condition as follows.

π(x)Mk (y|x) = π(x)πk (yk |x−k )δx−k (y−k )


= π(x−k )πk (xk |x−k )πk (yk |x−k )δx−k (y−k )
= π(y−k )πk (yk |y−k )πk (xk |y−k )δy−k (x−k )
= π(y)Mk (x|y), (5.6)

where the third line follows the second since δx−k (y−k ) allows the interchange of x−k and
y−k . Therefore, the detailed balance condition for Mk is satisfied with π and πMk = π. If
we apply M1 , . . . , Mk sequentially, we get

πM = πM1 . . . Md = (πM1 )M2 . . . Md = πM2 . . . Md = . . . = π,

so π is indeed the invariant distribution for the Gibbs sampler.

Gibbs sampling as a special Metropolis-Hastings algorithm: An insightful inter-


pretation of (5.6) is that each step of a cycle of the Gibbs sampler is a Metropolis-Hastings
move whose MCMC kernel is equal to its proposal kernel which results in the acceptance
probability being 1 uniformly. Indeed, if the k’th component of X is to be updated with
Qk = Mk , i.e. if we propose the new value y as

qk (y|x) = Mk (y|x) = πk (yk |x−k )δx−k (y−k ),

the acceptance ratio αk (x, y) for this move is


   
π(y)qk (x|y) π(y)Mk (x|y)
αk (x, y) = min 1, = min 1, =1
π(x)qk (y|x) π(x)Mk (y|x)

as shown in (5.6).
Reversibility of each Mk with respect to π does not suffice to establish proper conver-
gence of the Gibbs sampler, as none of the individual steps produces a irreducible chain.
Only the combination of the d moves in the complete cycle has a chance of producing
a φ-irreducible chain. We refer to Roberts and Smith (1994) some simple conditions for
convergence of the classical Gibbs sampler. Note, also, that M is not reversible either,
although this is not a necessary condition for convergence. A way of guaranteeing both
φ-irreducibility and reversibility is to use a mixture of kernels
d
X d
X
Mβ = βk Mk , βk > 0, k = 1, . . . , d, βk = 1.
k=1 k=1

provided that at least one Mk is irreducible and aperiodic. This choice of kernel leads to
the random scan Gibbs sampler algorithm. We refer to Tierney (1994), Robert and Casella
(2004), and Roberts and Tweedie (1996) for more detailed convergence results pertaining
to these variants of the Gibbs sampler.
CHAPTER 5. MARKOV CHAIN MONTE CARLO 71

Example 5.16. Suppose we wish to sample from a bivariate normal distribution, where
 2
x1 + x22 − 2ρx1 x2

1
π(x) = p exp − , ρ ∈ (−1, 1).
2π(1 − ρ2 ) 2(1 − ρ2 )

The full conditionals are

(x1 − ρx2 )2
 
π(x1 |x2 ) ∝ π(x1 , x2 ) ∝ exp −
2(1 − ρ2 )

therefore π(x1 |x2 ) = φ(x1 ; ρx2 , (1 − ρ2 )) and X1 |X2 = x2 ∼ N (ρx2 , (1 − ρ)2 ). Similarly,
we have X2 |X1 = x1 ∼ N (ρx1 , (1 − ρ)2 ). So, the iteration t ≥ 2 of the Gibbs sampling
algorithm for this π(x) is

• Sample Xt,1 ∼ N (ρXt−1,2 , (1 − ρ)2 ),

• Sample Xt,2 ∼ N (ρXt,1 , (1 − ρ)2 ).

Example 5.17 (ex: Normal distribution with unknown mean and variance). Let
us get back to the problem in Example 5.13 where we want to estimate the mean and the
variance of the normal distributions N (z, s) given samples y1 , . . . , yn generated from it. Let
use the same prior distributions for z and s, namely z ∼ N (m, κ2 ) and s ∼ IG(α, β). Note
that these are the conjugate priors for those parameters; and when one of the parameters
is given, the posterior distribution of the other one has a known form. Indeed, in Examples
4.5 and 4.6, we derived these full conditional distributions. Example 4.5 can be revisited
(but this time with a non-zero prior mean m) to see that
2
Z|s, y1:n ∼ N (µz|s,y , σz|s,y )

where !
 −1  −1 n
2 1 n 1 n 1X m
σz|s,y = 2
+ , µz|s,y = 2
+ yi + 2
κ s κ s s i=1 κ
and from Example 4.6 we can deduce that

S|z, y1:n = IG(αs|z,y , βs|z,y )

where n
n 1X
αs|z,y =α+ , βs|z,y =β+ (yi − z)2 .
2 2 i=1
Therefore, Gibbs sampling for Z, S given Y1:n = y1:n is
 −1    −1 
1 n 1
Pn m
• Sample Zt ∼ N κ2
+ St−1 St−1
y
i=1 i + κ2
, κ12 + n
St−1

Pn
• Sample St ∼ IG α + n2 , β + 1

2 i=1 (yi − Zt )2 .
CHAPTER 5. MARKOV CHAIN MONTE CARLO 72

Data augmentation: Data augmentation is an application of the Gibbs sampler. It is


useful if

1. there is missing data, and/or

2. the likelihood is intractable (hard to compute or does not admit conjugacy, etc), but
given some additional unobserved (real or fictitious) data it would be tractable.

Let yobs denote the observed data and ymis the missing data (sometimes ymis is called a
latent variable).We suppose we can easily sample x from the posterior given the augmented
data (yobs , ymis ). Also, that we can sample ymis , conditional on yobs and X (this only involves
the sampling distributions). Then we can use the Gibbs sampler of the pair (x, ymis ). Then
we perform Monte Carlo marginalisation: If in the resulting joint distribution for x, ymis
given yobs we simply ignore ymis , we shall have our sample from the posterior of x given
yobs alone.

Example 5.18 (Genetic linkage). Genetic linkage in an animal can be allocated to one
of four categories, coded 1,2, 3, and 4, having respective probabilities

(1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4)

where θ is an unknown parameter in (0, 1). For a sample of 197 animals, the (multi-
nomial) counts of those falling in the 4 categories are represented by random variables
Y = (Y1 , Y2 , Y3 , Y4 ), with observed values y = (y1 , y2 , y3 , y4 ) = (125, 18, 20, 34). Suppose we
place a Beta(α, β) prior on θ. Then,
 125  18+20  34
1 θ 1−θ θ
π(θ) = p(θ|y) ∝ + θα−1 (1 − θ)β−1
2 4 4 4
| {z }
Multinomial likelihood
125 38+β−1 34+α−1
∝ (2 + θ) (1 − θ) θ (5.7)

How can we sample from this? We can use a rejection sampler (probably with a very high
rejection probability) or MH for this posterior distribution; in this example we seek for a
suitable Gibbs sampler. Note that the problematic part in (5.7) is the first one; should it
be like one of the others, the posterior would lend itself into a Beta distribution.
Suppose we divide category 1, with total probability 1/2 + θ/4, into two latent sub-
categories, a and b, with respective probabilities θ/4 and 1/2. We regard the number of
animals Z falling in subcategory a as missing data. If, as well as the observed data y, we
are given Z = z, we are in the situation of having observed counts (z, 125 − z, 18, 20, 34)
from a multinomial distribution with probabilities (θ/4, 1/2, (1 − θ)/4, (1 − θ)/4, θ/4). The
resulting joint distribution is
 125−z  18+20  34+z
1 1−θ θ
p(θ, z|y) ∝ p(θ, z, y) = θα−1 (1 − θ)β−1 (5.8)
2 4 4
CHAPTER 5. MARKOV CHAIN MONTE CARLO 73

This easily leads to the posterior distribution


θ|z, y ∼ Beta(z + 34 + α, 38 + β). (5.9)
Also, simple properties of the multinomial distribution yield
 
θ/4
Z|θ, y ∼ Binom 125, (5.10)
1/2 + θ/4
So we can now apply Gibbs sampling, cycling between updates given by (5.9) and (5.10).
Example 5.19 (A changepoint model, ctd.). Consider the changepoint problem in
Example 5.14, with the same likelihood and priors. It is possible to run Gibbs sampling
algorithm for τ, λ1 , λ2 . Observing (5.4), where the full posterior distribution is written as
proportional to the full joint distribution
τ n
1 β α λα−1
1 e−βλ1 β α λα−1
2 e−βλ2 Y e−λ1 λy1t Y e−λ2 λy2t
p(τ, λ1 , λ2 , y1:n ) = ,
n Γ(α) Γ(α) t=1
y t ! t=τ +1
y t !

from which we can derive all the full conditionals


τ
!
X
λ1 |τ, λ2 , y1:n ∼ Γ α + yt , β + τ
t=1
n
!
X
λ2 |τ, λ1 , y1:n ∼ Γ α + yt , β + n − τ
t=τ +1

τ |λ1 , λ2 , y1:n ∼ Categorical(a1 , . . . , an )


where the probabilities in the Categorical distribution (which is simply the discrete distri-
bution with probabilities a1 , . . . , an , the generalisation of the Bernoulli distribution to the
case of multiple (here, n) outcomes) are
Pi Pn
−iλ1 yt −(n−i)λ2 t=i+1 yt
e λ1 t=1
e λ2
ai = P h Pj Pn
yt
i
n yt −(n−j)λ
j=1 e−jλ1 λ1 t=1
e 2 λ t=j+1
2

5.4.1 Metropolis within Gibbs


Having attractive computational properties, the Gibbs sampler is widely used. The require-
ment for easy-to-sample conditional distributions is the main restriction for the Gibbs sam-
pler. Fortunately, though, replacing an exact simulation Xk ∼ πk (·|Xn−1,1:k−1 , Xn−1,k+1:d )
by a Metropolis-Hastings step in a general MCMC algorithm does not violate its valid-
ity as long as the Metropolis-Hastings step has the correct invariant distribution. The
most natural alternative to the Gibbs move in step k where sampling from the full condi-
tional distribution πk (·|x−k ) is not directly feasible is to use Metropolis-Hastings move that
updates xk by using a Metropolis-Hastings kernel that targets πk (·|x−k ) (Tierney, 1994).
CHAPTER 5. MARKOV CHAIN MONTE CARLO 74

Exercises
1. Consider the toy example in Section 5.3.1 for the MH algorithm for sampling from
the normal distribution N (µ, σ 2 ).

• Modify the code so that it stores the acceptance probability at each iteration in
a vector and returns the vector as one of the outputs. In the next part of the
exercise, you will use the last T − tb samples of the vector to find an estimate
of the overall expected acceptance probability
Z
α(σq ) = α(x, x0 )π(x)qσq (x0 |x)dxdx0 .

where qσq (x0 |x) = φ(x0 ; x, σq2 ).


• Choose µ = 0, σ 2 = 1, and σq = 1. Run the MH algorithm (provided in SU-
Course) 100 times, each with T = 10000 iterations with the symmetric proposal
with various values for the proposal variance. For each run i = 1, . . . , 100, use
(i) (i)
the samples X1 , . . . , XT to calculate the mean estimate
T
(i) 1 X (i)
µ (σq ) = Xt
T − tb t=t +1
b

where tb = 1000 is the burn-in time up to when you ignore the samples generated
by the algorithm. Similarly, calculate an estimate α(i) (σq ) of α(σq ) in a similar
way using the last T − tb samples.
• Report the sample variance of µ(i) (σq )’s: This is approximately the variance of
the mean estimate of the MH algorithm that uses T − tb samples. We wish this
variance to be as small as possible. Also, report the average of α(i) (σq )’s.
• Repeat above for σq = 0.1, 0.2, . . . , 9.9, 10, and generate two plots: (i) sample
variance of µ(i) (σq )’s vs σq and (ii) average of α(i) (σq )’s vs σq . From the first
plot, suggest the (approximately) optimum value for σq and report the estimate
of α(σq ) for that σq .

2. Design and implement a symmetric random walk MH algorithm and the Gibbs sam-
pling algorithm for the genetic linkage problem in Example 5.18 with hyperparame-
ters α = β = 2.

3. Implement the Gibbs sampler in Example 5.17 with n = 100 and hyperparameters
α = 5 and β = 10.

4. Consider the changepoint problem in Example 5.14.

• Download UK coal mining disaster [Link] from SUCourse. The data con-
sists of the day numbers of coal mining disasters between 1851 and 1962, where
CHAPTER 5. MARKOV CHAIN MONTE CARLO 75

the first day is the start of the 1851. It is suspected that, due to a policy change,
the accident rate over the years is a piecewise constant with a single changepoint
time around the time of the policy change.
• From the data, create another data vector of length 112, where the i’th element
contains the number of disasters in year i (starting from 1851). Note that some
years are 366 days!
• Implement the MH algorithm (given in SUCourse) for the changepoint model
given the data that you created. Take the priors for τ , λ1 and λ2 the same as
in Example 5.14, i.e. with hyperparameters α = 10 and β = 1. You can use the
symmetric random walk proposal for the parameters.
• Implement Gibbs sampling algorithm for the same model given the same data
using the same priors. All the derivations you need are in Example 5.19.

5. Suppose we observe a noisy sinusoid with with unknown amplitude a, angular fre-
quency ω, phase z, and noise variance σy2 for n steps. Letting x = (a, ω, z, σy2 ),

Y |x ∼ N (yt ; a sin(ωt + z), σy2 ), t = 1, . . . , n.

The unknown parameters are a priori independent with a ∼ N (0, σa2 ), ω ∼ Γ(α, β),
z ∼ Unif(0, 2π), σy2 ∼ IG(α, 1/β).

• Write down the likelihood of p(y1:n |x) and the joint density p(x, y1:n ).
• Download the data file sinusoid [Link] from SUCourse; the observations
in the file are your data y1:n . Use hyperparameters σa2 = 100, α = β = 0.01
and design and implement an MH algorithm for generating samples from the
posterior distribution π(x) = p(x|y1:n ).
• Bonus - worth 50% of the base mark: This time, design and implement a
MH within Gibbs algorithm where in each loop contains four steps in each of
which you update one component only, fixing the others, using an MH kernel
that targets the full conditionals. This is an example where you can still update
the components one by one even if the full conditional distributions are not easy
to sample from.
Chapter 6

Sequential Monte Carlo


Summary: This chapter contains a brief and limited review of sequential Monte Carlo
methods, another large family of Monte Carlo methods that are used for many applications
including sequential inference, sampling from complex distributions, rare event analysis,
density estimation, optimisation, etc. In this chapter we will introduce two main meth-
ods, sequential importance sampling, and sequential importance sampling-resampling, in a
generic setting.

6.1 Introduction
Let {Xn }n≥1 be a sequence of random variables where each Xn takes values at some space
Xn . Define the sequence of distributions {πn }n≥1 where πn is defined on X n . Also, let
{ϕn }n≥1 be a sequence of functions where ϕn : X n → R is a real-valued function on
X n .1 We are interested in sequential inference, i.e. approximating the following integrals
sequentially in n
Z
πn (ϕn ) = Eπn [ϕn (X1:n )] = πn (x1:n )ϕ(x1:n )dx1:n , n = 1, 2, . . .

Despite their versatility and success, it might be impractical to apply MCMC algorithms
to sequential inference problems. This chapter discusses sequential Monte Carlo (SMC)
methods, that can provide with approximation tools for a sequence of varying distributions.
Good tutorials on the subject are available, see for example Doucet et al. (2000b) for and
Doucet et al. (2001) for a book length review. Also, Robert and Casella (2004) and Cappé
et al. (2005) contain detailed summaries. Finally, the book Del Moral (2004) contains a
more theoretical work on the subject in a more general framework, namely Feynman-Kac
formulae.

6.2 Sequential importance sampling


The first method which is usually considered a sequential Monte Carlo (SMC) method
is sequential importance sampling (SIS), which is a sequential version of the importance
1
In a more general setting Xn takes values at some space XnQwhich may not be the same set for all n.
n
Then the sequence of distributions {πn }n≥1 would be on Xn = i=1 Xi and we would have ϕn : Xn → R.

76
CHAPTER 6. SEQUENTIAL MONTE CARLO 77

sampling. First use of SIS can be recognised in works back in 1960s and 1970s such as
Mayne (1966); Handschin and Mayne (1969); Handschin (1970), see Doucet et al. (2000b)
for a general formulation of the method for Bayesian filtering.

Naive approach: Consider the naive importance sampling approach to the sequential
problem where we have a sequence of importance densities {qn (x1:n )}n≥1 where each qn is
defined on X n such that
πn (x1:n )
wn (x1:n ) = .
qn (x1:n )
It is obvious that we can approximate πn (ϕn ) by generating independent samples from qn
at each n and exploiting the relation

πn (ϕn ) = Eqn [wn (X1:n )ϕn (X1:n )] .

This approach would require the design of a separate qn (x1:n ) and sampling the whole path
X1:n at each n, which is obviously inefficient.

Sequential design of the importance density: An efficient alternative to the naive


approach is SIS which can be used when it is possible to choose qn (x1:n ) to have the form
n
Y
qn (x1:n ) = q(x1 ) q(xt |x1:t−1 ), (6.1)
t=2

where q(x1 ) is some initial density that is easy to sample from and q(xt |x1:t−1 ) are condi-
tional densities which we design so that it is possible to sample from q(·|x1:t−1 ) for any x1:t−1
and t ≥ 1. This selection of qn leads to the following useful recursion on the importance
weights

πn (x1:n )
wn (x1:n ) =
qn (x1:n )
πn (x1:n ) πn−1 (x1:n−1 )
=
qn−1 (x1:n−1 )q(xn |x1:n−1 ) πn−1 (x1:n−1 )
πn (x1:n )
= wn−1 (x1:n−1 ) . (6.2)
πn−1 (x1:n−1 )q(xn |x1:n−1 )

We remark that the sequence of distributions are usually known up to a normalising con-
stant as
π̂n (x1:n )
πn (x1:n ) = ,
Zπn
where we know π̂n (x1:n ) for any x1:n but not Zπn . Hence, from now on we will only consider
self-normalised importance sampling where πn−1 and πn are replaced by π̂n−1 and π̂n in
calculation of (and the recursion for) wn (x1:n ) in (6.2).
CHAPTER 6. SEQUENTIAL MONTE CARLO 78

Approximation to πn : As long as self-normalised importance sampling is concerned,


given the samples and their weights, it is practical to define the weighted empirical distri-
bution
XN
N
πn (x1:n ) := Wn(i) δX (i) (x1:n ), (6.3)
1:n
i=1
(i)
as an approximation to πn , where Wn , i = 1, . . . , N are the self-normalised importance
weights
(i)
wn (X1:n )
Wn(i) = PN (i)
. (6.4)
i=1 wn (X1:n )
This is another way of viewing the self-normalised importance sampling: The self-normalised
importance sampling approximation of the desired expectation πn (ϕn ) is actually the exact
expectation of ϕ with respect to πnN . This expectation is given by
N
X (i)
πnN (ϕn ) = Wn(i) ϕn (X1:n ).
i=1

Note that this is indeed the same as the self-normalised importance sampling estimate, see
(3.7) for example.

The SIS algorithm: In many applications of (6.2), the importance density is designed
in such a way that the ratio
πn (x1:n )
πn−1 (x1:n−1 )q(xn |x1:n−1 )
is easy to calculate (at least up to a proportionality constant if we use the unnormalised
densities). For example, this may be due to the design of q(xn |x1:n−1 )’s in such a way that
the ratio depends only on xn−1 and xn . Hence, one can exploit this recursion by sampling
only Xn from q(·|x1:n−1 ) at time n and updating the weights with a small effort.
(i)
More explicitly, assume a set of N ≥ 1 samples, termed as particles, X1:n−1 with weights
(i) (i)
wn−1 (X1:n−1 ) and normalised weights Wn−1 for i = 1, . . . , N are available at time n − 1, so
that we have
X N
N (i)
πn−1 (x1:n−1 ) = Wn−1 δX (i) (x1:n−1 ).
1:n−1
i=1
N (i) (i)
The update from πn−1 to πnN can be performed by first sampling Xn ∼ q(·|X1:n−1 ) and
(i) (i) (i)
computing the weights wn at points X1:n = (X1:n−1 , Xn ) using the update rule in (6.2),
(i)
and finally obtain the normalised weights Wn using (6.4).
The SIS method is summarised in Algorithm 6.1. Being a special case of importance
sampling approximation, this SIS approximation πnN (ϕn ) has almost sure convergence to
πn for any n (under regular conditions) as the number of particles N tends to infinity; it
is also possible to have a central limit theorem for πnN (ϕn ) (Geweke, 1989).
CHAPTER 6. SEQUENTIAL MONTE CARLO 79

Algorithm 6.1: Sequential importance sampling (SIS)


1 for n = 1, 2, . . . do
2 for i = 1, . . . , N do
3 if n = 1 then
(i) (i) (i) (i)
4 sample X1 ∼ q(·), calculate w1 (X1 ) = π1 (X1 )/q1 (X1 ).
5 else
(i) (i) (i) (i) (i)
6 if n ≥ 2 sample Xn ∼ q(·|X1:n−1 ), set X1:n = (X1:n−1 , Xn ), and
calculate
(i)
(i) (i) πn (X1:n )
wn (X1:n ) = wn−1 (X1:n−1 ) (i) (i) (i)
.
πn−1 (X1:n−1 )q(Xn |X1:n−1 )

7 for i = 1, . . . , N do
8 Calculate
(i)
wn (X1:n )
Wn(i) = PN (i)
.
i=1 w n (X 1:n )

Optimal choice of importance density: As in the non-sequential case, it is important


to choose {qn }n≥1 such that the variances of {πnN (ϕn )}n≥1 are minimised. Recall that in the
SIS algorithm we restrict ourselves to {qn (x1:n )}n≥1 satisfying (6.1), therefore selection of
the optimal proposal distributions suggested in Section 3.1 may not be possible. Instead,
define the incremental importance weights as.
πn (x1:n )
wn|n−1 (x1:n ) = .
πn−1 (x1:n−1 )q(xn |x1:n−1 )
A more relevant motivation for those {qn (x1:n )}n≥1 satisfying (6.1) might be to minimise
the variance of wn|n−1 (X1:n ) conditional on X1:n−1 .
Note that the objective of minimising the conditional variance of wn|n−1 is somehow
more general in the sense that it is not specific to ϕn . It was shown in Doucet (1997) that
q opt (xn |x1:n−1 ) by which the variance is minimised is given by
πn (x1:n )
q opt (xn |x1:n−1 ) = πn (xn |x1:n−1 ) = R . (6.5)
πn (x1:n )dxn
Before Doucet (1997), the optimum kernel was used in several works for particular ap-
plications, see e.g. Kong et al. (1994); Liu and Chen (1995); Chen and Liu (1996). The
optimum kernel leads to the optimum incremental weight
R
opt πn (x1:n−1 ) πn (x1:n )dxn
wn|n−1 (x1:n ) = = . (6.6)
πn−1 (x1:n−1 ) πn−1 (x1:n−1 )
which does not depend on the value of xn .
CHAPTER 6. SEQUENTIAL MONTE CARLO 80

6.3 Sequential importance sampling resampling


Weight degeneracy: The SIS method is an efficient way of implementing importance
sampling sequentially. However; unless the proposal distribution is very close to the true
distribution, the importance weight step will lead over a number of iterations to a small
number of particles with very large weights compared to the rest of the particles. This
will eventually result in one of the normalised weights to being ≈ 1 and the others being
≈ 0, effectively leading to a particle approximation with a single particle, see Kong et al.
(1994) and Doucet et al. (2000b). This problem is called the weight degeneracy problem.

Resampling: In order to address the weight degeneracy problem, a resampling step is


introduced at iterations of the SIS method, leading to the sequential importance sampling
resampling (SISR) algorithm.
Generally, we can describe resampling as a method by which a weighted empirical
distribution is replaced with an equally weighted distribution, where the samples of the
equally weighted distribution are drawn from the weighted empirical distribution.

(i)
X1:n−1

(i)
X̃1:n−1

i=1 i=2 i=3 i=4 i=5 i=6

Figure 6.1: Resampling in SISR. Circle sizes represent weights.

N
In sequential Monte Carlo for {πn (x1:n )}n≥1 , resampling is applied to πn−1 (x1:n−1 ) before
proceeding to approximate πn (x1:n ). Assume, again, that πn−1 (x1:n−1 ) is approximated by
N
X
N (i)
πn−1 (x1:n−1 ) = Wn−1 δX (i) (x1:n−1 ),
1:n−1
i=1
(i) N
We draw N independent samples X
e1:n−1 , i = 1, . . . , N from πn−1 , such that
(i) (j) (j)
P(X
e1:n−1 = X1:n−1 ) = Wn−1 , i, j = 1, . . . , N.
Obviously, this corresponds to drawing N independent samples from a multinomial distri-
bution, therefore this particular resampling scheme is called multinomial resampling. Now
the resampled particles form an equally weighted discrete distribution
N
N 1 X
π̃n−1 (x1:n−1 ) = δ (i) (x1:n−1 ),
N i=1 Xe1:n−1
CHAPTER 6. SEQUENTIAL MONTE CARLO 81

N N
We proceed to approximating πn (x1:n ) using π̂n−1 (x1:n−1 ) instead of πn−1 (x1:n−1 ) as follows.
(i) (i)
After resampling, for each i = 1, . . . , N we sample Xn ∼ q(·|X1:n−1 ), weight the particles
e
(i) (i) (i)
X1:n = (Xe1:n−1 , Xn ) using
(i)
Wn(i) ∝ wn|n−1 (X1:n )
(i) N
πn (X1:n ) X
= (i) (i) (i)
, Wn(i) = 1.
e1:n−1 )q(Xn |X
πn−1 (X e1:n−1 )
i=1

The SISR method, also known as the particle filter, is summarised in Algorithm 6.2.

Algorithm 6.2: Sequential importance sampling resampling (SISR)


1 for n = 1, 2, . . ., do
2 if n = 1 then
3 for i = 1, . . . , N do
(i)
4 sample X1 ∼ q1 (·)
5 for i = 1, . . . , N do
6 Calculate
(i)
(i) π1 (X1 )
W1 ∝ (i)
.
q1 (X1 )

7 else
(i) (i)
8 Resample from {X1:n−1 }1≤i≤N according to the weights {Wn−1 }1≤i≤N to get
(i)
resampled particles {X e1:n−1 }1≤i≤N with weight 1/N .
9 for i = 1, . . . , N do
(i) (i) (i) (i) (i)
10 Sample Xn ∼ q(·|X e1:n−1 ), set X1:n = (X
e1:n−1 , Xn )
11 for i = 1, . . . , N do
12 Calculate
(i)
πn (X1:n )
Wn(i) ∝ (i) (i) e (i)
.
πn−1 (X
e1:n−1 )q(Xn |X 1:n−1 )

Path degeneracy: The importance of resampling in the context of SMC was first demon-
strated by Gordon et al. (1993) based on the ideas of Rubin (1987). Although the resam-
pling step alleviates the weight degeneracy problem, it has two drawbacks. Firstly, since
after successive resampling steps some of the distinct particles for X1:n are dropped in
favour of more copies of highly-weighted particles. This leads to the impoverishment of
particles such that for k  n, very few particles represent the marginal distribution of
X1:k under πn (Andrieu et al., 2005; Del Moral and Doucet, 2003; Olsson et al., 2008).
CHAPTER 6. SEQUENTIAL MONTE CARLO 82

Hence, whatever being the number of particles, πn (x1:k ) will eventually be approximated
by a single unique particle for all (sufficiently large) n. As a result, any attempt to perform
integrations over the path space will suffer from this form of degeneracy, which is called
path degeneracy. The second drawback is the extra variance introduced by the resampling
step. There are a few ways of reducing the effects of resampling.

• One way is adaptive resampling i.e. resampling only at iterations where the effective
sample size drops below a certain proportion of N . For a practical implementation,
the effective sample size at time n itself should be estimated from particles as well.
One particle estimate of Neff,n is given in Liu (2001, pp. 35-36)

eeff,n = P 1
N .
N (i)2
i=1 W n

• Another way to reduce the effects of resampling is to use alternative resampling


methods to multinomial resampling. Let In (i) is the number of times the i’th parti-
cle is drawn from πnN (x1:n ) in a resampling scheme. A number of resampling methods
(i)
have been proposed in the literature that satisfy E [In (i)] = N Wn but have differ-
(i)
ent V [In (i)]. The idea behind E [In (i)] = N Wn is that the mean of the particle
approximation to πn (ϕn ) remains the same after resampling. Standard resampling
schemes include multinomial resampling (Gordon et al., 1993), residual resampling
(Whitley, 1994; Liu and Chen, 1998), stratified resampling (Kitagawa, 1996), and
systematic resampling (Whitley, 1994; Carpenter et al., 1999). There are also some
non-standard resampling algorithms such that the particle size varies (randomly) af-
ter resampling (e.g. Crisan et al. (1999); Fearnhead and Liu (2007)), or the weights
are not constrained to be equal after resampling (e.g. Fearnhead and Clifford (2003);
Fearnhead and Liu (2007)).

• A third way of avoiding path degeneracy is provided by the resample-move algorithm


(i)
(Gilks and Berzuini, 2001), where each resampled particle X e1:n is moved according
to a MCMC kernel Kn whose invariant distribution is πn (x1:n ). In fact we could
have included this MCMC move step in Algorithm 6.2 to make the algorithm more
generic. However, the resample-move algorithm is a useful degeneracy reduction
technique usually in a much more general setting. Although possible in principle, it
is computationally infeasible to apply a kernel to the path space on which current
particles exist as the state space grows at evert iteration of SISR.

• The final method we will mention here that is used to reduce path degeneracy is block
sampling (Doucet et al., 2006), where at time n one samples components Xn−L+1:n for
L > 1, and previously sampled values for Xn−L+1:n−1 are simply discarded. In return
of the computational cost introduced by L, this procedure reduces the variance of
weights and hence reduces the number of resampling steps (if an adaptive resampling
strategy is used) dramatically. Therefore, path degeneracy is reduced.
Chapter 7

Bayesian inference in Hidden Markov


Models
Summary: One main application of sequential Monte Carlo methods is Bayesian opti-
mum filtering in hidden Markov models (HMM). We will first introduce HMMs. Then
we will see exact sequential inference techniques for finite-state space HMMs and linear
Gaussian HMMs, where we do not need SMC methods for certain distributions of interest.
Then, we will move on to the general case where the HMM can be non-linear and/or non-
Gaussian, see sequential Monte Carlo methods in action for sequential inference in such
HMMs.

7.1 Introduction
HMMs arguably constitute the widest class of time series models that are used for mod-
elling stochastic behaviour of dynamic systems. In Section 7.2, we will introduce HMMs
using a formulation that is appropriate for filtering and parameter estimation problems.
We will restrict ourselves to discrete time homogenous HMMs whose dynamics for their hid-
den states and observables admit conditional probability densities which are parametrised
by vector valued static parameters. However, this is our only restriction; we keep our
framework general enough to cover those models with non-linear non-Gaussian dynamics.
One of the main problems dealt within the framework of HMMs is optimal Bayesian
filtering, which has many applications in signal processing and related areas such as speech
processing (Rabiner, 1989), finance (Pitt and Shephard, 1999), robotics (Gordon et al.,
1993), communications (Andrieu et al., 2001), etc. Due to the non-linearity and non-
Gaussianity of most of models of interest in real life applications, approximate solutions
are inevitable and SMC is the main computational tool used for this; see e.g. Doucet et al.
(2001) for a wide selection of examples demonstrating use of SMC. SMC methods have
already been presented in its general form in the previous chapter, we will present their
application to HMMs for optimal Bayesian filtering in Sections 7.3 and 7.3.4.

83
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 84

f f f f f
X1 X2 ··· Xt−1 Xt ···

g g g g

Y1 Y2 ··· Yt−1 Yt

Figure 7.1: Acyclic directed graph for HMM

7.2 Hidden Markov models


We begin with the definition of a HMM. Let {Xt }t≥1 be a homogenous Markov chain with
state-space X , initial density η(x1 ) and transition density f (xt |xt−1 ). Suppose that this
process is observed as another process {Yt }t≥1 on Y such that the conditional distribution
on Yt given all the other random variables depends only on Xt and has the conditional
density g(yt |xt ). Then the bivariate process {Xt , Yt }t≥1 is called a HMM. For any n ≥ 1,
the joint probability density of (X1:n , Y1:n ) is given by
n
Y n
Y
p(x1:n , y1:n ) = η(x1 ) f (xt |xt−1 ) g(yt |xt ) (7.1)
t=2 t=1
| {z }| {z }
latent Markov process observations

Figure 7.1 shows the diagram for the HMM.


The joint law of all the variables of the HMM up to time n is summarised in (7.1) from
which we derive several probability densities of interest. One example is the evidence of
the observations up to time n which can be derived as
Z
p(y1:n ) = p(x1:n , y1:n )dx1:n . (7.2)

Another important probability density, which will be pursued in detail, is the density of
the posterior distribution of X1:n given Y1:n = y1:n , which is obtained by using the Bayes’
theorem
p(x1:n , y1:n )
p(x1:n |y1:n ) = (7.3)
p(y1:n )
In the time series literature, the term HMM has been widely associated with the case
of X being finite (Rabiner, 1989) and those models with continuous X are often referred
to as state-space models. Again, in some works the term ‘state-space models’ refers to
the case of linear Gaussian systems (Anderson and Moore, 1979). We emphasise at this
point that in this text we shall keep the framework as general as possible. We consider
the general case of measurable spaces and we avoid making any restrictive assumptions
on η(x1 ), f (xt |xt−1 ), and g(yt |xt ) that impose a certain structure on the dynamics of the
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 85

HMM. Also, we clarify that in contrast to previous restrictive use of terminology, we will
use both terms ‘HMM’ and ‘general state space model’ to describe exactly the same thing.
Example 7.1 (A finite state-space HMM for weather conditions). Assume that
the weather condition in terms of atmospheric pressure is simplified to have two states,
“Low” and “High”, and on day t, Xt ∈ X = {1, 2} denotes the state of the atmospheric
condition in terms of pressure, where 1 stands for “Low” and 2 stands for “High”. Further
{Xt }t≥1 is modelled as a Markov chain with some initial distribution η = [η(1), η(2)], and
transition density  
0.3 0.7
F =
0.2 0.8
where F (i, j) = P(Xt+1 = j|Xt = i) = f (j|i). What we observe is not the atmospheric
pressure but, whether a day is “Dry”, ”Cloudy”, or ”Rainy”, and these conditions are
enumerated with 1, 2, and 3, respectively. Let Yt ∈ Y = {1, 2, 3} is the observed weather
condition on day t. It is known that low pressure is more likely to lead clouds or precipita-
tion than high pressure, and assumed that given Xt , Yt is conditionally independent from
Y1:t−1 and X1:t−1 . The conditional observation matrix that related Xt to Yt is given by
 
0.3 0.4 0.3
G=
0.6 0.3 0.1
where G(i, j) = P(Yt = j|Xt = i) = g(j|i). Then, {Xt , Yt }t≥1 forms a HMM, and since X
is finite, it is called a finite state-space HMM.
Example 7.2 (Linear Gaussian HMM). A generic linear Gaussian HMM {Xt , Yt },
where Xt ∈ Rdx , and Yt ∈ Rdy are vector valued hidden and observed states, can be defined
via the following generative definitions for the random variables {Xt , Yt }:
X1 ∼ N (µ1 , Σ1 ), Xt = AXt−1 + Ut , Ut ∼ N (0, S), t>1 (7.4)
Yt = BXt + Vt , Vt ∼ N (0, R), (7.5)
Here, A, B are dx ×dx and dy ×dx matrices, and S and R are dx ×dx and dy ×dy covariance
matrices for the state and observation processes, respectively. In terms of densities, this
HMM can be described as
η(x1 ) = φ(x1 ; µ1 , Σ1 ), f (xt |xt−1 ) = φ(xt ; Axt−1 , S), g(yt |xt ) = φ(yt ; Bxt , R). (7.6)
Example 7.3 (A partially observed moving target). We modify the source localisation
problem in Example 5.15 by adding to the scenario that the source is moving in a Markovian
fashion: The motion of the source is modelled as a Markov chain for its velocity and
position. Let Vt = (Vt (1), Vt (2)) and Pt = (Pt (1), Pt (2)) be the velocity and the position
vectors (in the xy plane) of the source at time t and assume that they evolve according to
2
its following stochastic dynamics: V1 (i) ∼ N (0, σbv ),
2 2
V1 (i) ∼ N (0, σbv ), P1 (i) ∼ N (0, σbp ), i = 1, 2,
Vt (i) = aVt−1 (i) + Ut (i), Pt (i) = Pt−1 (i) + ∆Vt−1 (i) + Zt (i), i = 1, 2.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 86

i.i.d. i.i.d.
where Ut ∼ N (0, σv2 ) and Zt (i) ∼ N (0, σp2 ). This model dictates that the velocity in each
direction changes independently according to an autoregressive model with the regression
parameter a and driving variance σv2 and the position is the previous position plus the
previous velocity multiplied by the factor ∆ which corresponds to the time interval between
successive time steps t − 1 and t, plus some noise which counts for the discretisation error.
Let Xt = (Vt , Pt ). Xt is a Markov chain with transition density
2
Y 2
Y
f (xt |xt−1 ) = φ(vt (i); avt−1 (i), σv2 ) φ(pt (i); ∆vt−1 (i) + pt−1 (i); σp2 )
i=1 i=1

or, in matrix form


   2 
a 0 0 0 σv 0 0 0
0 a 0 0  0 σv2 0 0 
f (xt |xt−1 ) = φ(F xt−1 , Σx ), F =
∆
, Σx =  
 0 0 σp2 0 
0 1 0
0 ∆ 0 1 0 0 0 σp2

The observations are generated as before, i.e. at each time t three distance measurements
(Rt,1 , Rt,2 , Rt,3 ) with

Rt,i = [(Pt (1) − Si (1))2 + (Pt (2) − Si (2))2 ]1/2 , i = 1, 2, 3,

from three different sensors are collected in Gaussian noise with variance σy2 and these
measurements form Yt = (Yt,1 , Yt,2 , Yt,3 )
i.i.d.
Yt,i = Rt,i + Et,i , Et,i ∼ N (0, σy2 ), i = 1, 2, 3.

so that
3
Y
g(yt |xt ) = φ(yt,i ; rt,i , σy2 ).
i=1

This is an example to a non-linear HMM due to the non-linearity in its observation dy-
namics.

7.3 Bayesian optimal filtering and smoothing


In a HMM, one is usually interested in sequential inference on the variables of the hidden
process {Xt }t≥1 given observations {Yt }t≥1 up to time n. For example, one pursues for the
sequence of posterior distributions {p(x1:t |y1:t )}t≥1 , where p(x1:t |y1:t ) is given in equation
(7.3). It is also straightforward to generalise p(x1:t |y1:t ) to the posterior distributions of
X1:t0 for any t0 ≥ 1. For t0 > t we have
t 0
Y
p(x1:t0 |y1:t ) = p(x1:t |y1:t ) f (xτ |xτ −1 );
τ =t+1
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 87

whereas for t0 < t the density p(x1:t0 |y1:t ) can be obtained simply by integrating out the
variables xt0 +1:t , i.e. Z
p(x1:t0 |y1:t ) = p(x1:t |y1:t )dxt0 +1:t .

7.3.1 Filtering, prediction, and smoothing


From a Bayesian point of view, the probability densities p(x1:n0 |y1:n ) are complete solutions
to the inference problems as they contain all the information about the hidden states X1:n0
0
given the observations y1:n . For example, the expectation of a function ϕn0 : X n → R
conditional upon the observations y1:n can be evaluated as
Z
E [ϕn (X1:n0 )|Y1:n = y1:n ] = ϕ(x1:n0 )p(x1:n0 |y1:n )dx1:n0 .

However, one can restrict their focus to a problem of smaller size, such as the marginal
distribution of the random variable Xk , k ≤ n0 , given y1:n . The probability density of such
a marginal posterior distribution p(xk |y1:n ) is called a filtering, prediction, or smoothing
density if k = n, k > n and k < n, respectively. Indeed, there are many cases where one
is interested in calculating the expectations of functions ϕ : X → R of Xk given y1:n
Z
E [ϕ(Xk )|Y1:n = y1:n ] = ϕ(xk )p(xk |y1:n )dxk .

Although once we have p(x1:n0 |y1:n ) for n0 ≥ k the marginal density can directly be obtained
by marginalisation, the recursion in (7.24) may be intractable or too expensive to calculate.
Therefore it is useful to use alternative recursion techniques to effectively evaluate the
marginal densities sequentially.

[Link] Forward filtering backward smoothing


Here, we will see the technique called forward filtering backward smoothing that combines
the recursions for the filtering and one-step prediction densities as well as a backward
recursion for smoothing densities.

Forward filtering (and prediction): We start with p(x1 |y0 ) := p(x1 ) = η(x1 ) and

η(x1 )g(y1 |x1 )


p(x1 |y1 ) = .
p(y1 )

where p(y1 ) = η(x01 )g(y1 |x01 )dx1 . Given the filtering density p(xt−1 |y1:t−1 ) at time t − 1
R

and the new observation yt at time t, the filtering density at time t can be obtained
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 88

recursively in two stages, which are called prediction and update. These are given as
Z
p(xt |y1:t−1 ) = p(xt |xt−1 , y1:t−1 )p(xt−1 |y1:t−1 )dxt−1
Z
= f (xt |xt−1 )p(xt−1 |y1:t−1 )dxt−1 , (7.7)
p(yt |xt , y1:t−1 )p(xt |y1:t−1 )
p(xt |y1:t ) =
p(yt |y1:t−1 )
g(yt |xt )p(xt |y1:t−1 )
= . (7.8)
p(yt |y1:t−1 )
where this time we write the normalising constant as
Z
p(yt |y1:t−1 ) = p(xt |y1:t−1 )g(yt |xt )dxt . (7.9)

Actually, (7.9) is important for its own sake, since it leads to two important quantities:
First, given y1:n , the posterior predictive density for Yn+1 is simply p(yn+1 |y1:n ), which
can be calculated from p(xn+1 |y1:n ). Secondly, (7.9) can be used to calculate the evidence
recursively.
n
Y
p(y1:n ) = p(yt |y1:t−1 ) = p(y1:n−1 )p(yn |y1:n−1 ). (7.10)
t=1

The problem of evaluating the recursion given by equations (7.7) and (7.8) is called the
Bayesian optimal filtering (or shortly optimum filtering) problem in the literature.

Backward smoothing: Once we have the forward filtering recursion to calculate the
filtering and prediction densities p(xt |y1:t ) and p(xt |y1:t−1 ) for t = 1, . . . , n, where n is the
total number of observations, there are more than one ways of performing smoothing in a
HMM to calculate p(xt |y1:n ), t = 1, . . . , n. We will see the one that corresponds to forward
filtering backward smoothing. As the name suggests, backward smoothing is performed via
a backward recursion in time, i.e. p(xt |y1:n ) is calculated in the order t = n, n − 1, . . . , 1.
Now let us see how one step of the backward recursion works: Given p(xt+1 |y1:n ), we find
p(xt |y1:n ) by exploiting the following relation
Z
p(xt |y1:n ) = p(xt , xt+1 |y1:n )dxt+1
Z
= p(xt+1 |y1:n )p(xt |xt+1 , y1:n )dxt+1 . (7.11)

which can be written for any time series model. Thanks to the particular structure of the
HMM, given Xt+1 , Xt is conditionally independent from the rest of the future variables
(try to see this from Figure 7.1). Hence
p(xt |y1:t )f (xt+1 |xt )
p(xt |xt+1 , y1:n ) = p(xt |xt+1 , y1:t ) = (7.12)
p(xt+1 |y1:t )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 89

In fact, one can derive this analytically as

p(xt |xt+1 , y1:n ) = p(xt |xt+1 , y1:t , yt+1:n )


p(xt |y1:t )p(xt+1 , yt+1:n |xt , y1:t )
=
p(xt+1 , yt+1:n |y1:t )
p(xt |y1:t )p(xt+1 |xt , y1:t )p(yt+1:n |xt+1 , xt , y1:t )
=
p(xt+1 |y1:t )p(yt+1:n |xt+1 , y1:t )
p(xt |y1:t )f (xt+1 |xt )p(yt+1:n |xt+1 )
=
p(xt+1 |y1:t )p(yt+1:n |xt+1 )
p(xt |y1:t )f (xt+1 |xt )
=
p(xt+1 |y1:t )

where the last expression is indeed exactly p(xt |xt+1 , y1:n ). Substituting this into (7.11),
we have
p(xt |y1:t )f (xt+1 |xt )
Z
p(xt |y1:n ) = p(xt+1 |y1:n ) dxt+1 (7.13)
p(xt+1 |y1:t )
which involves the filtering and prediction distributions that we have calculated in the
forward filtering stage already.
There are cases when the optimum filtering problem can be solved exactly. One such
case is when X is a finite countable set (Rabiner, 1989). Also, in linear Gaussian state-
space models the densities in (7.7) and (7.8) are obtained by the Kalman filter (Kalman,
1960).

[Link] Sampling from the full posterior


In order to sample from the posterior distribution p(x1:n |y1:n ), we can exploit the following
factorisation:
1
Y
p(x1:n |y1:n ) = p(xn |y1:n ) p(xt |xt+1:n , y1:n )
t=n−1
1
Y
= p(xn |y1:n ) p(xt |xt+1 , y1:t ) (7.14)
t=n−1

where the second line is crucial and it follows from the specific dependency structure
of the HMM. Equation (7.14) suggests that we can start sampling Xn from the filtering
distribution at time n, and go backwards to sample Xn−1 , Xn−2 , . . . , X1 , using the backward
transition probabilities. Note that one needs all the filtering distributions up to time n in
order to perform this backward sampling. That is why the algorithm that executes this
scheme to sample from p(x1:n |y1:n ) is called forward filtering backward sampling.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 90

7.3.2 Exact inference in finite state-space HMMs


In a finite state-space HMM, as exemplified in Example 7.1, Xt takes values from a finite
set X of size k, and for simplicity we assume that the states are enumerated from 1 to k,
implying X = {1, . . . , k}. Define the 1 × k vectors αt , βt , γt for t = 1, . . . , n that represent
the filtering, prediction, and smoothing probabilities, respectively:

αt (i) := P(Xt = i|Y1:t = y1:t ), i = 1, . . . , k, t = 1, . . . , n


βt (i) := P(Xt = i|Y1:t−1 = y1:t−1 ), i = 1, . . . , k, t = 1, . . . , n
γt (i) := P(Xt = i|Y1:n = y1:n ), i = 1, . . . , k, t = 1, . . . , n

The forward filtering backward smoothing algorithm for a finite-state HMM is given in
Algorithm 7.1. The recursions given in the algorithms are simply the discrete versions of
Equations (7.7), (7.8), and (7.13). In order to keep track of p(yt |y1:t−1 ) (hence p(y1:t )) as
well, one can include the following (with the convention p(y1 |y0 ) = p(y1 ))
k
X
p(yt |y1:t−1 ) = βt (i)g(yt |i).
i=1

Algorithm 7.1: Forward filtering backward smoothing in finite state-space HMM


Input: Observations y1:n , HMM transition and observation probabilities η, f , g
Output: αt , βt , γt , t = 1, . . . , n.
Forward filtering
1 for t = 1, . . . , n do
2 Prediction: If t = 1, set β1 (i) = η(i), i = 1, . . . , k; else

k
X
βt (i) = αt−1 (j)f (i|j), i = 1, . . . , k.
j=1

Filtering:
βt (i)g(yt |i)
αt (i) = Pk , i = 1, . . . , k.
j=1 βt (j)g(yt |j)

Backward smoothing
3 for t = n, . . . , 1 do
4 Smoothing: If t = n, set γn (i) = αn (i), i = 1, . . . , k; else

k
X αt (i)f (j|i)
γt (i) = γt+1 (j) , i = 1, . . . , k.
j=1
βt+1 (j)
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 91

Forward filtering backward sampling: For the finite state-space HMM, the backward
transition probabilities are given as
αt (i)f (j|i)
P(Xt = i|Xt+1 = j, Y1:t = y1:t ) =
βt+1 (j)
The resulting forward filtering backward sampling algorithm for finite state-space HMMs
to sample from the full posterior p(x1:n |y1:n ) is given in Algorithm 7.2.

Algorithm 7.2: Forward filtering backward sampling in finite state-space HMM


Input: Observations y1:n , HMM transition and observation probabilities η, f , g
Output: X1:n ∼ p(x1:n |y1:n ).
Forward filtering
1 Perform forward filtering in the first part ot Algorithm 7.1 to obtain αt , βt ,
t = 1, . . . , n.
Backward sampling .
for t = n, . . . , 1 do
2 Smoothing: If t = n, sample Xn = i with probability αn (i), i = 1, . . . , k; else,
(xt+1 |i)
given Xt+1 = xt+1 sample Xt = i with probability αtβ(i)f
t+1 (xt+1 )
, i = 1, . . . , k.

7.3.3 Exact inference in linear Gaussian HMMs


Consider the linear Gaussian HMM in Example 7.2, where the initial, state transition, and
observation distributions are given in Equations 7.4 and 7.5, which are repeated here
X1 ∼ N (µ1 , Σ1 ), Xt = AXt−1 + Ut , Ut ∼ N (0, S), t>1
Yt = BXt + Vt , Vt ∼ N (0, R)
Since this is a linear and Gaussian HMM, the filtering, prediction, and smoothing distri-
butions have to be Gaussian as well. For any k and n, let the mean and the covariance of
the posterior distribution of Xk given Y1:n = y1:n be µk|n and Pk|n , respectively. Then, we
denote the distributions of interest as
Xt |Y1:t = y1:t ∼ N (µt|t , Pt|t ), t = 1, . . . , n,
Xt |Y1:t−1 = y1:t−1 ∼ N (µt|t−1 , Pt|t−1 ), t = 1, . . . , n,
Xt |Y1:n = y1:n ∼ N (µt|n , Pt|n ), t = 1, . . . , n;
or, in terms of densities,
p(xt |y1:t ) = φ(xt ; µt|t , Pt|t ), t = 1, . . . , n,
p(xt |y1:t−1 ) = φ(xt ; µt|t−1 , Pt|t−1 ), t = 1, . . . , n,
p(xt |y1:n ) = φ(xt ; µt|n , Pt|n ), t = 1, . . . , n.
Moreover, as we will see, the mean and the covariance of these distributions are tractable.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 92

Forward filtering: The prediction update from (µt−1|t−1 , Pt−1|t−1 ) to (µt|t−1 , Pt|t−1 ) can
be deduced from Equation (7.7), but a simpler way of achieving this is noticing that
the update is simply an application of linear transformation of Gaussian variables: Since
Xt = AXt−1 + Ut , and Ut is independent from all the other variables and Gaussian, too,
we have

µt|t−1 = E[Xt |Y1:t−1 = y1:t−1 ] (7.15)


= E[AXt−1 + Ut |Y1:t−1 = y1:t−1 ]
= E[AXt−1 |Y1:t−1 = y1:t−1 ] + E[Ut |Y1:t−1 = y1:t−1 ]
= AE[Xt−1 |Y1:t−1 = y1:t−1 ] + E[Ut ]
= Aµt−1|t−1 + 0
= Aµt−1|t−1 . (7.16)

For the covariance of the prediction distribution, we have

Pt|t−1 = Cov[Xt |Y1:t−1 = y1:t−1 ]


= Cov[AXt−1 + Ut |Y1:t−1 = y1:t−1 ]
= Cov[AXt−1 |Y1:t−1 = y1:t−1 ] + Cov[Ut |Y1:t−1 = y1:t−1 ]
= ACov[Xt−1 |Y1:t−1 = y1:t−1 ]AT + Cov[Ut ]
= APt−1|t−1 AT + S. (7.17)

By using (7.9), we can derive the mean µyt|t−1 and the covariance Pt|t−1
y
of the conditional
density p(yt |y1:t−1 ), which we know to be a Gaussian density. An alternative way is to
derive the moments as above:

µyt|t−1 = E[Yt |Y1:t−1 = y1:t−1 ] = E[BXt + Vt |Y1:t−1 = y1:t−1 ]


= E[BXt |Y1:t−1 = y1:t−1 ] + E[Vt |Y1:t−1 = y1:t−1 ]
= BE[Xt−1 |Y1:t−1 = y1:t−1 ] + E[Vt ]
= Bµt|t−1 (7.18)

and
y
Pt|t−1 = Cov[Yt |Y1:t−1 = y1:t−1 ]
= Cov[BXt + Vt |Y1:t−1 = y1:t−1 ]
= Cov[BXt |Y1:t−1 = y1:t−1 ] + Cov[Vt |Y1:t−1 = y1:t−1 ]
= BCov[Xt−1 |Y1:t−1 = y1:t−1 ]B T + Cov[Vt ]
= BPt|t−1 B T + R. (7.19)

The filtering distribution p(xt |y1:t ) can be found by applying the Bayes theorem with
prior p(xt |y1:t−1 ) and likelihood g(yt |xt ). Since both are Gaussian and the relation between
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 93

Yt and Xt is linear, we can apply the conjugacy result for the mean parameter of the normal
distribution in Example 4.7 and deduce
−1
Pt|t = (Pt|t−1 + B T R−1 B)−1

and
−1
µt|t = Pt|t (Pt|t−1 mt|t−1 + B T R−1 yt )
xy
Using the matrix inversion lemma1 and letting Pt|t−1 = Pt|t−1 B T , we can rewrite

Pt|t = Pt|t−1 − Pt|t−1 B T (R + BPt|t−1 B T )−1 BPt|t−1


xy y −1 xy T
= Pt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1 (7.20)

and
xy y −1 xy T −1
µt|t = (Pt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1 )(Pt|t−1 µt|t−1 + B T R−1 yt )
xy xy y −1 xy T −1 xy y −1 xy T T −1
= µt|t−1 + Pt|t−1 R−1 yt − Pt|t−1 Pt|t−1 Pt|t−1 Pt|t−1 µt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1 B R yt
xy y −1 y xy T T −1 xy y −1 xy T −1
= µt|t−1 + Pt|t−1 Pt|t−1 (Pt|t−1 R−1 − Pt|t−1 B R )yt − Pt|t−1 Pt|t−1 Pt|t−1 Pt|t−1 µt|t−1
xy y −1 xy T T −1 xy y −1
= µt|t−1 + Pt|t−1 Pt|t−1 ([BPt|t−1 B T + R]R−1 − Pt|t−1 B R )yt − Pt|t−1 Pt|t−1 Bµt|t−1
xy y −1 xy y −1 y
= µt|t−1 + Pt|t−1 Pt|t−1 (BPt|t−1 B T R−1 + I − BPt|t−1 B T R−1 )yt − Pt|t−1 Pt|t−1 µt|t−1
xy y −1 xy y −1 y
= µt|t−1 + Pt|t−1 Pt|t−1 yt − Pt|t−1 Pt|t−1 µt|t−1
xy y −1
= µt|t−1 + Pt|t−1 Pt|t−1 (yt − µyt|t−1 ) (7.21)

Backward smoothing: For backward smoothing, we start from µn|n and Pn|n , which
are already calculated in the last step of the forward filtering recursion, and go backwards
to derive µt|n and Pt|n from µt+1|n and Pt+1|n . Observing (7.13) first we need to derive the
backward transition density

p(xt |y1:t )f (xt+1 |xt )


p(xt |xt+1 , y1:t ) =
p(xt+1|y1:t )

This is in the form of the Bayes’ rule, with prior p(xt |y1:t ) = φ(xt ; µt|t , Pt|t ) and likelihood
f (xt+1 |xt ) = φ(xt+1 ; Axt , S). Since the relation is Gaussian and the prior and likelihood
densities are Gaussian, we know that p(xt |xt+1 , y1:t ) is Gaussian, too,. In order to derive
its mean µxt|t+1 and covariance Pt|t+1x
, we make use of the result in Example 4.7 again to
arrive at
−1
x
Pt|t+1 = (Pt|t + AT S −1 A)−1
−1
µxt|t+1 = Pt|t+1
x
(Pt|t µt|t + AT S −1 xt+1 )
1
For invertible matrices A, B and any two matrices U and V of suitable size, the lemma states that
(A + U BV )−1 = A1 − A−1 U (B −1 + V A−1 U )−1 V A−1
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 94

−1
Using the matrix inversion lemma again, and letting Γt|t+1 = Pt|t AT Pt+1|t , we rewrite those
moments as
x
Pt|t+1 = Pt|t − Γt|t+1 APt|t
µxt|t+1 = µt|t + Γt|t+1 (xt+1 − µt+1|t )
One can (artificially but usefully) view this relation as follows: given Y1:n = y1:n , Xt can
be written in terms of Xt+1 as follows:
Xt = µt|t + Γt|t+1 (Xt+1 − µt|t−1 ) + Et ,
x
where Xt+1 |Y1:n = y1:n ∼ N (µt+1|n , Pt+1|n ) and Et |Y1:n = y1:n ∼ N (0, Pt|t+1 ) and Et is
independent from Xt+1 given Y1:n . From this, we use the composition rule for the Gaussian
distributions to derive µt|n and Pt|n from µt+1|n and Pt+1|n . Letting , we have
µt|n = E[µt|t + Γt|t+1 (Xt+1 − µt+1|t ) + Et |Y1:n = y1:n ]
= µt|t + Γt|t+1 (E[Xt+1 |Y1:n = y1:n ] − µt+1|t )
= µt|t + Γt|t+1 (µt|n − µt+1|t ) (7.22)
and
Pt|n = Cov[µt|t + Γt|t+1 (Xt+1 − µt+1|t ) + Et |Y1:n = y1:n ]
= Γt|t+1 Cov[Xt+1 − µt+1|t |Y1:n = y1:n ]ΓTt|t+1 + Pt|t+1
x

= Γt|t+1 Cov[Xt+1 |Y1:n = y1:n ]ΓTt|t+1 + Pt|t+1


x

= Γt|t+1 Pt+1|n ΓTt|t+1 + Pt|t − Γt|t+1 APt|t


= Pt|t + Γt|t+1 (Pt+1|n − Pt+1|t )ΓTt|t+1 (7.23)
The forward filtering backward smoothing recursions are given in Algorithm 7.3. In
order to keep track of p(yt |y1:t−1 ) (hence p(y1:t )) as well, one can include the following (with
the convention p(y1 |y0 ) = p(y1 ))
p(yt |y1:t−1 ) = φ(yt ; µyt|t−1 , Pt|t−1
y
)

Forward filtering backward sampling: Similar to the finite state-space case, with the
help of backward transition distributions, we can sample from the full posterior p(x1:n |y1:n )
in a linear Gaussian HMM by using forward filtering backward sampling, which is given
in Algorithm 7.4.

7.3.4 Particle filters for optimal filtering in HMM


We saw the two cases where the optimum filtering problem can be solved exactly. In
general, however, these densities do not admit a close form expression and one has to use
methods based on numerical approximations. In the following, we will look at the SMC
methodology in the context of general HMMs and review how SMC methods have been
used to provide approximate solutions to the optimal filtering problem.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 95

Algorithm 7.3: Forward filtering backward smoothing in linear Gaussian HMM


Input: Observations y1:n , HMM transition and observation parameters A, B, S,
R, µ1 , Σ1
Output: µt|t−1 , Pt|t−1 , µt|t , Pt|t , µt|n , Pt|n , t = 1, . . . , n.
Forward filtering (Kalman filtering)
1 for t = 1, . . . , n do
2 Prediction: If t = 1, set µ1|0 = µ1 , P1|0 = Σ1 ; else

µt|t−1 = Aµt−1|t−1
Pt|t−1 = APt−1|t−1 AT + S

Filtering:
y
Pt|t−1 = BPt|t−1 B T + R
µyt|t−1 = Bµt|t−1
xy
Pt|t−1 = Pt|t−1 B T
xy y −1
µt|t = µt|t−1 + Pt|t−1 Pt|t−1 (yt − µyt|t−1 )
xy y −1 xy T
Pt|t = Pt|t−1 − Pt|t−1 Pt|t−1 Pt|t−1

Backward smoothing
3 for t = n − 1, . . . , 1 do
4 Smoothing:
−1
Γt|t+1 = Pt|t AT Pt+1|t
µt|n = µt|t + Γt|t+1 (µt|n − µt+1|t )
Pt|n = Pt|t + Γt|t+1 (Pt+1|n − Pt+1|t )ΓTt|t+1
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 96

Algorithm 7.4: Forward filtering backward sampling in linear Gaussian HMM


Input: Observations y1:n , HMM transition and observation parameters A, B, S,
R, µ1 , Σ1
Output: µt|t−1 , Pt|t−1 , µt|t , Pt|t , µt|n , Pt|n , t = 1, . . . , n.
Forward filtering (Kalman filtering)
1 Perform forward filtering part in Algorithm 7.3 to calculate µt|t , Pt|t , µt|t−1 , and
Pt|t−1 , t = 1, . . . , n.
Backward sampling
for t = n, . . . , 1 do
2 If t = n, sample Xn ∼ N (µn|n , Pn|n ); else, given Xt+1 = xt+1 , calculate
−1
Γt|t+1 = Pt|t AT Pt+1|t ,
x
Pt|t+1 = Pt|t − Γt|t+1 APt|t ,
µxt|t+1 = µt|t + Γt|t+1 (xt+1 − µt+1|t ),
x
and sample Xt ∼ N (µxt|t+1 , Pt|t+1 ).

[Link] Motivation for particle filters


One approximation to optimum filtering recursion is to use grid-based methods, where
the continuous X is approximated by its finite discretised version and the update rules
are used as in the case of finite state HMMs. Another approach is extended Kalman
filter (Sorenson, 1985), which approximates a non-linear transition by a linear one and
performs the Kalman filter afterwards. The method fails if the nonlinearity in the HMM is
substantial An improved approach based on the Kalman filter is the unscented Kalman filter
(Julier and Uhlmann, 1997), which is based on a deterministic selection of sigma-points
from the support of the state distribution of interest such that the mean and the variance of
the true distribution are preserved by the sample mean and covariance calculated at these
selected sigma-points. All of these methods are deterministic and not capable of dealing
with the most general state-space models; in particular they will fail when the dimensions
or the nonlinearities increase.
Alternative to the deterministic approximation methods, Monte Carlo can provide a
robust and efficient solution to the optimal filtering problem. SMC methods for optimal
filtering, also known as particle filters, have been shown to produce more accurate estimates
than the deterministic methods mentioned (Liu and Chen, 1998; Doucet et al., 2000b;
Durbin and Koopman, 2000; Kitagawa, 1996). Some of the good tutorials on SMC methods
for filtering as well as smoothing in HMMs are Doucet et al. (2000b); Arulampalam et al.
(2002); Cappé et al. (2007); Fearnhead (2008); Doucet and Johansen (2009) from the
earliest to the most recent. One can also see Doucet et al. (2001) as a reference book,
although a bit outdated. Also, the book Del Moral (2004) contains a rigorous review of
numerous theoretical aspects of the SMC methodology in a different framework where a
SMC method is treated as an interacting particle system associated with the mean field
interpretation of a Feynman-Kac flow.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 97

[Link] Particle filtering for HMM


Equations (7.1) and (7.3) reveal that we can write p(x1:t |y1:t ) in terms of p(x1:t−1 |y1:t−1 ) as

f (xt |xt−1 )g(yt |xt )


p(x1:t |y1:t ) = p(x1:t−1 |y1:t−1 ). (7.24)
p(yt |y1:t−1 )
The normalising constant p(yt |y1:t−1 ) can be written in terms of the known densities as
Z
p(yt |y1:t−1 ) = p(x1:t−1 |y1:t−1 )f (xt |xt−1 )g(yt |xt )dx1:t (7.25)
R
where by convention p(y1 |y0 ) := p(y1 ) = g(y1 |x1 )η(x1 )dx1 . The recursion in (7.24) is
essential since it enables efficient sequential approximation of the distributions p(x1:t |y1:t )
as we will see shortly.
With reference to the Monte Carlo methodology covered in Sections 6.2 and 6.3, the
filtering problem in state space models can be considered as a sequential inference problem
for the sequence of probability distributions

πn (x1:n ) := p(x1:n |y1:n ), n ≥ 1.

As we saw in Sections 6.2 and 6.3, we can perform SIS and SISR methods targeting
{πn (x1:n )}n≥1 . The SMC proposal density at time n, denoted as qn (x1:n |y1:n ), is designed
conditional to the observations up to time n and state values up to time n − 1; and in the
most general case it can be written as
n
Y
qn (x1:n |y1:n ) := q(x1 |y1 ) q(xt |x1:t−1 , y1:t )
t=2
= qn−1 (x1:n−1 |y1:n−1 )q(xn |x1:n−1 , y1:n ) (7.26)

In fact, most of the time the transition densities q(xt |x1:t−1 , y1:t ) only depends only on the
current observation yt and the previous state xt−1 ,

q(xn |x1:n−1 , y1:n ) = q(xn |xn−1 , yn ).

Therefore, we can write


n
Y
qn (x1:n |y1:n ) = q(x1 |y1 ) q(xt |xt−1 , yt ) (7.27)
t=2

If we wanted to perform SMC using the target distribution πn directly, then we would have
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 98

to calculate the following incremental weight at time n


πn (x1:n ) p(x1:n |y1:n )
=
πn−1 (x1:n−1 )q(xn |xn−1 , yn ) p(x1:n−1 |y1:n−1 )q(xn |xn−1 , yn )
p(x1:n , y1:n ) p(y1:n−1 )
=
p(x1:n−1 , y1:n−1 )q(xn |xn−1 , yn ) p(y1:n )
f (xn |xn−1 )g(yn |xn ) 1
=
q(xn |xn−1 , yn ) p(yn |y1:n−1 )
f (xn |xn−1 )g(yn |xn )
∝ . (7.28)
q(xn |xn−1 , yn )
In most of the applications p(yn |y1:n−1 ) can not be calculated, hence the ratio above is not
available. For this reason, instead of πn (x1:n ) SMC methods use the joint density of X1:n
and Y1:n as the unnormalised measure for importance sampling
π
bn (x1:n ) = p(x1:n , y1:n ),
where the normalising constant is p(y1:n ), the likelihood of observations up to time n.
Define the incremental importance weight
f (xn |xn−1 )g(yn |xn )
wn|n−1 (xn−1 , xn ) = .
q(xn |xn−1 , yn )
The importance weight for the whole path X1:n is given by
wn (x1:n ) = wn−1 (x1:n−1 )wn|n−1 (xn−1 , xn ),

We present the SIS algorithm in Algorithm 7.5 and SISR algorithm (or the particle
filter) for general state-space models in Algorithm 7.6, reminding that SIS is a special
type of SISR where there is no resampling. As in the general SISR algorithm, we can use
an optional resampling scheme, where we do resampling only when the estimated effective
sampling size decreases below a threshold value. In the following we list some of the aspects
of the particle filter.

[Link] Filtering, prediction, and smoothing densities


Although the particle filter we presented in Algorithm 7.6 targets the path filtering distri-
butions πn (x1:n ) = p(x1:n |y1:n ); it can easily be modified, or used directly, to make inference
on other distributions that might be of interest. For example, consider the one step path
prediction distribution
πnp (x1:n ) = p(x1:n |y1:n−1 ).
There is the following relation between πn and πnp .
g(yn |xn )
πnp (x1:n ) = πn−1 (x1:n−1 )f (xn |xn−1 ), πn (x1:n ) = πnp (x1:n ) .
p(yn |y1:n−1 )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 99

Algorithm 7.5: SIS for HMM


1 for n = 1, 2, . . . do
2 for i = 1, . . . , N do
3 if n = 1 then
(i) (i)
(i) (i) η(X1 )g(y1 |X1 )
4 sample X1 ∼ q(·), calculate w1 (X1 ) = (i) .
q(X1 |y1 )
5 else
(i) (i) (i) (i) (i)
6 if n ≥ 2 sample Xn ∼ q(·|X1:n−1 ), set X1:n = (X1:n−1 , Xn ), and
calculate
(i) (i) (i)
(i) (i) f (Xn |Xn−1 )g(yn |Xn )
wn (X1:n ) = wn−1 (X1:n−1 ) (i) (i)
.
q(Xn |Xn−1 , yn )

7 for i = 1, . . . , N do
8 Calculate
(i)
wn (X1:n )
Wn(i) = PN (i)
.
i=1 wn (X1:n )

Algorithm 7.6: SISR (Particle filter) for HMM


1 if n = 1 then
2 for i = 1, . . . , N do
(i)
3 sample X1 ∼ q1 (·)
4 for i = 1, . . . , N do
5 Calculate
(i) (i)
(i) η(X1 )g(y1 |X1 )
W1 ∝ (i)
.
q(X1 |y1 )

6 else
(i) (i)
7 Resample from {X1:n−1 }1≤i≤N according to the weights {Wn−1 }1≤i≤N to get
(i)
resampled particles {X e1:n−1 }1≤i≤N with weight 1/N .
8 for i = 1, . . . , N do
(i) (i) (i) (i) (i)
9 Sample Xn ∼ q(·|X e1:n−1 , yn ), set X1:n = (X
e1:n−1 , Xn )
10 for i = 1, . . . , N do
11 Calculate
(i) (i) (i)
f (Xn |X
en−1 )g(yn |Xn )
Wn(i) ∝ (i) e (i)
.
q(Xn |X n−1 , yn )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 100

Therefore, it is easy to derive approximations to these distributions from each other: ob-
taining πnp,N from πn−1 N
requires a simple extension of the path X1:n−1 to X1:n through f ;
(i) (i)
this is done by sampling Xn conditioned on the existing particles paths X1:n−1 , respec-
tively for i = 1, . . . , N . Whereas; obtaining πnN from πnp,N requires a simple re-weighting of
the particles according to g(yn |xn ).
As a second example, the approximations to the marginal distributions πnN (xk ) are
simply obtained from the k’th components of the particles, e.g.
N
X N
X
πnN (x1:n ) = Wn(i) δX (i) (x1:n ) ⇒ πnN (xk ) = Wn(i) δX (i) (k) (xk ).
1:n 1:n
i=1 i=1

Note that the optimal filtering problem corresponds to the case k = n. Therefore, it may
be sufficient to have a good approximation for the marginal posterior distribution of the
current state Xn rather than the whole path X1:n . This justifies the resampling step of the
particle filter in practice, since resampling trades off accuracy for states Xk with k  n
for a good approximation for the marginal posterior distribution of Xn .

[Link] Estimating the evidence


A by-product of the particle filter is that it can provide unbiased estimates for unknown
normalising constants of the target distribution. For example, when SISR is used, an
unbiased estimator of p(y1:n ) can be obtained as
n N
N
Y 1 X (i) (i)
p(y1:n ) ≈ p (y1:n ) = wt|t−1 (X
et−1 , Xt ).
t=1
N i=1

[Link] Choice of the importance density


The choice of the kernel q for the importnce distribution in the particle filter is important
to ensure effective SMC approximation. The first genuine particle filter in the literature,
proposed by Gordon et al. (1993), involved sampling from the transition density of X1:n ,
hence taking
q(xn |xn−1 , yn ) = f (xn |xn−1 )
and the resulting particle filter with this particular choice of q is called the bootstrap filter.
With this choice, the particles are weighted by how they fit to the observation, i.e. by the
observation density,

f (xn |xn−1 )g(yn |xn )


wn|n−1 (xn−1 , xn ) = = g(yn |xn ).
f (xn |xn−1 )

The optimal choice that minimises the variance of the incremental importance weights is,
from equation (6.5),
q opt (xn |xn−1 , yn ) = p(xn |xn−1 , yn ).
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 101

This results in the optimal incremental weights to be


opt
wn|n−1 (xn−1 , xn ) = p(yn |xn−1 ),

which is independent from the value of xn . First works where q opt was used include Kong
et al. (1994); Liu and Chen (1995); Liu (1996).
Another interesting choice is to take q(xn |xn−1 , yn ) = q(xn |yn ), which can be useful
when observations provide significant information about the hidden state but the state
dynamics are weak. This proposal was introduced in Lin et al. (2005) and the resulting
particle filter was called independent particle filter.

Example 7.4 (Linear Gaussian HMM). This is an illustrative example that is designed
to show both SIS and SISR (particle filter) algorithms applied to sequential Bayesian in-
ference in the following linear Gaussian HMM

η(x) = φ(x; 0, σ02 ), f (x0 |x) = φ(x0 ; ax, σx2 ), g(y|x) = φ(y; bx, σy2 ).

where Xt ∈ R and Yt ∈ R and hence a, b, σ02 , σx2 , and σy2 are all scalars.
In this example, we first generated y1:n with n = 10 using a = 0.99, b = 1, σx2 = 1,
σy2 = 1 and σ02 = 4. Our task is to run and compare and contrast the SMC algorithms,
namely SIS and SISR, for sequential approximation of

π1 (x1 ) = p(x1 |y1 ), . . . , π10 (x1:10 ) = p(x1:10 |y1:10 )

Since the HMM here is linear and Gaussian, the problem is analytically tractable, πt ’s are
all Gaussian, and we can find those πt ’s without any need to do Monte Carlo, for example
using the Kalman filter. We use SIS and SISR merely for illustrative purposes.
We ran SIS in Algorithm 7.5 with q1 (x1 ) = η(x1 ) and q(xn |xn−1 , yn ) = f (xn |xn−1 ) for
n > 1 so that wn|n−1 (xn−1 , xn ) = g(yn |xn ), which does not depend on xn−1 , and hence
(i)
Wn ∝ g(yn |Xn ) for all n ≥ 1. Top row of Figure 7.2 shows the initialisation phase, both
(1) (N )
before and after weighting the initially generated particles X1 , . . . , X1 whose locations
are shown on the y-axis and weights are represented with the sizes of the balls centred around
their values. The red curve represents the incremental weight function w1 (x1 ) = g(y1 |x1 )
vs x1 (located on the y-axis). Some of the later steps of SIS are shown Figure 7.2, from the
second row on. Starting from the second row, each row shows (i) the particles and their
weights from the previous time, (ii) The propagation and extension of the particles for the
new time step, and (iii) Update of the particle weights. Note that, due to out particular
choice of the importance density q(xt |xt−1 , yt ), the incremental weights wt|t−1 (xt−1 , xt ) =
g(yt |xt ) depend only on the current value xt , so for this example it is actually possible
to show it as a function of xt , which we have done by the red curve in the plots. Note
that the size of the ball around the value of a particle represents the weight of the whole
(i)
path X1:t . Also, notice the weight degeneracy problem in the SIS algorithm, since there is
no resampling procedure. At time t = 10, we have effectively only one useful particle to
approximate π10 (x1:10 ) = p(x1:10 |y1:10 ), which is not a good sign.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 102

Sequential importance sampling - before weighting: t = 1 Sequential importance sampling - after weighting, t = 1

8 8

6 6

4 4
sample values

sample values
2 2

0 0

-2 -2

-4 -4

-6 -6

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step
Sequential importance sampling - after weighting, t = 1 Sequential importance sampling - before weighting: t = 2 Sequential importance sampling - after weighting, t = 2

8 8 8

6 6 6

4 4 4
sample values

sample values

sample values
2 2 2

0 0 0

-2 -2 -2

-4 -4 -4

-6 -6 -6

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - after weighting, t = 4 Sequential importance sampling - before weighting: t = 5 Sequential importance sampling - after weighting, t = 5

8 8 8

6 6 6

4 4 4
sample values

sample values

sample values

2 2 2

0 0 0

-2 -2 -2

-4 -4 -4

-6 -6 -6

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - after weighting, t = 9 Sequential importance sampling - before weighting: t = 10 Sequential importance sampling - after weighting, t = 10

8 8 8

6 6 6

4 4 4
sample values

sample values

sample values

2 2 2

0 0 0

-2 -2 -2

-4 -4 -4

-6 -6 -6

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step

Figure 7.2: SIS: propagation and weighting of particles for several time steps. Each row
shows (i) the particles and their weights from the previous time, (ii) The propagation and
extension of the particles for the new time step, and (iii) Update of the particle weights.
Notice the weight degeneracy problem.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 103

Next, the SISR algorithm, or the particle filter, in Algorithm 7.6 for the same problem.
The initialisation is just the same as in SIS, see the top row of Figure 7.3. The remaining
plots in Figure 7.3 shows some later steps of SISR. Starting from the second row, each
row shows (i) the particles and their weights from the previous time, (ii) the resampling
and propagation and extension of the particles for the new time step, and (iii) Update of
the particle weights. Notice how the weight degeneracy problem is alleviated by resampling:
there are distinct particles having closer weights to each other than in SIS. But several
N
resampling steps in succession lead the path degeneracy problem: π10 (x1:10 ) has only one
support point for x1 .

Example 7.5 (Tracking a moving target). We consider the HMM for a moving target
in Example 7.3. Our objective is to estimate the position of the target at times t = 1, 2, . . .
given the observations up to time t. With the particle filter, we can approximate πt (x1:t ) =
p(x1:t |y1:t ) as
N
X
N (i)
πt (x1:t ) = Wt δX (i) (x1:t )
1:t
i=1

As discussed already, this approximation can be used to approximate the filtering distribu-
tion
N
X (i)
p(xt |y1:t ) ≈ Wt δX (i) (xt )
t
i=1
(i) (i) (i)
We can use the position components of Xt = (Vt , Pt )’s in order to estimate the current
position from the observations.
N
X (i) (i)
E[Pt (j)|Y1:t = y1:t ] ≈ P̂tN (j) = Wt Pt , j = 1, 2.
i=1

Figure 7.4 illustrates a target tracking scenario depicted above for 500 time steps. On the
top left plot we see the position (in red) and its filtering estimate P̂tN (in black) given the
sensor measurements on the right plot up to the current time t, with N = 1000 particles.
The lower plots show the performance of the particle filter at each direction separately.

7.3.5 Extensions to HMMs


Although HMMs are the most common class of time series models in the literature, there
are also many time series models which are not a HMM and are still of great importance.
These models differ from HMMs mostly because they do not possess the conditional inde-
pendency of observations. Here, we give two examples.

• In the first example of such models, the process {Xn }n≥1 is still a Markov chain;
however the conditional distribution of Yn , given all past variables X1:n and Y1:n−1 ,
depends not only on the value of Xn but also on the values of past observations
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 104

Sequential importance sampling - resampling, before weighting t = 1 Sequential importance sampling - resampling, after weighting t = 1
3 3

2 2

1 1

0 0

-1 -1
sample values

sample values
-2 -2

-3 -3

-4 -4

-5 -5

-6 -6

-7 -7

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step
Sequential importance sampling - resampling, after weighting t = 1 Sequential importance sampling - resampling, before weighting t = 2 Sequential importance sampling - resampling, after weighting t = 2
3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1
sample values

sample values

sample values
-2 -2 -2

-3 -3 -3

-4 -4 -4

-5 -5 -5

-6 -6 -6

-7 -7 -7

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - resampling, after weighting t = 5 Sequential importance sampling - resampling, before weighting t = 6 Sequential importance sampling - resampling, after weighting t = 6
3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1
sample values

sample values

sample values

-2 -2 -2

-3 -3 -3

-4 -4 -4

-5 -5 -5

-6 -6 -6

-7 -7 -7

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step
Sequential importance sampling - resampling, after weighting t = 9 Sequential importance sampling - resampling, before weighting t = 10 Sequential importance sampling - resampling, after weighting t = 10
3 3 3

2 2 2

1 1 1

0 0 0

-1 -1 -1
sample values

sample values

sample values

-2 -2 -2

-3 -3 -3

-4 -4 -4

-5 -5 -5

-6 -6 -6

-7 -7 -7

0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
time step time step time step

Figure 7.3: SIRS: resampling, propagation and weighting of particles for several time
steps. Each row shows (i) the particles and their weights from the previous time, (ii) the
resampling and propagation and extension of the particles for the new time step, and (iii)
N
Update of the particle weights. Notice the path degeneracy problem: π10 (x1:10 ) has only
one support point for x1 .
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 105

first dimension second dimension


50 30
true position true position
40 estimated position 20 estimated position

30 10

20 0
P(1, t)

P(2, t)

10 -10

0 -20

-10 -30

-20 -40
0 100 200 300 400 500 0 100 200 300 400 500
time time

Figure 7.4: Particle filter for target tracking

i.e. Y1:n−1 . If we denote the probability density of this conditional distribution


gn (yn |xn , y1:n−1 ), the joint probability density of (X1:n , Y1:n ) is
n
Y
p(x1:n , y1:n ) = η(x1 )g(y1 |x1 ) f (xt |xt−1 )gt (yt |xt , y1:t−1 ).
t=2

If Yn given Xn is independent of the past values of the observations prior to time


n − k, then we can define gn (yn |xn , y1:n−1 ) = g(yn |xn , yn−k:n−1 ) for all n.
These models have much in common with basic HMMs in the sense that virtually
identical computational tools may be used for both models. In the particular context
of SMC, the similarity between these two types of models is more clearly exposed
in Del Moral (2004) via the Feynman-Kac representation of SMC methods, where
the conditional density of observation at time n is treated generally as a potential
function of xn .

• In another type of time series models that are not HMM the latent process {Xn }n≥1
is, again, still a Markov chain; however observation at current time depends on all the
past values, i.e. Yn conditional on (X1:n , Y1:n−1 ) depends on all of these conditioned
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 106

random variables. Actually, these models are usually the result of marginalising an
extended HMM. Consider the HMM {(Xn , Zn ), Yn }n≥1 , where the latent joint process
{Xn , Zn }n≥1 is a Markov chain such that its transitional density can be factorised as

f (xn , zn |xn−1 , zn−1 ) = f1 (xn |xn−1 )f2 (zn |xn , zn−1 ).

and the observation Yn depends only on Xn and Zn given all the past random variables
and admits the probability density g(yn |xn , zn ). Now, the reduced bivariate process
{Xn , Yn }n≥1 is not a HMM and we express the joint density of (X1:n , Y1:n ) as
n
Y
p(x1:n , y1:n ) = η(x1 )p1 (y1 |x1 ) f1 (xt |xt−1 )pt (yt |x1:t , y1:t−1 )
t=2

where the density pt (yt |x1:t , y1:t−1 ) is given by


Z
pt (yt |x1:t , y1:t−1 ) = p(z1:t−1 |x1:t−1 , y1:t−1 )f2 (zt |xt , zt−1 )g(yt |xt , zt )dz1:t . (7.29)

The reason {Xn , Yn }n≥1 might be of interest is that the conditional laws of Z1:n may
be available in close form and exact evaluation of the integral in (7.29) is available.
In that case, it can be more effective to perform Monte Carlo approximation for the
law of X1:n given observations Y1:n , which leads to the so called Rao-Blackwellised
particle filters in the literature (Doucet et al., 2000a).
The integration is indeed available in close form for some time series models. One
example is the linear Gaussian switching state space models (Chen and Liu, 2000;
Doucet et al., 2000a; Fearnhead and Clifford, 2003), where Xn takes values on a finite
set whose elements are often called ‘labels’, and conditioned on {Xn }n≥1 , {Zn , Yn }n≥1
is a linear Gaussian state-space model.
We note that the computational tools developed for HMMs are generally applicable to
a more general class of time series models with some suitable modifications.

7.3.6 The Rao-Blackwellised particle filter


Assume we are given a HMM {(Xn , Zn ), Yn }n≥1 where this time the hidden state at time
n is composed of two components Xn and Zn . Suppose that the initial and transition
distributions of the Markov chain {Xn , Zn }n≥1 have densities η and f and they can be
factorised as follows

η(x1 , z1 ) = η1 (x1 )η2 (z1 |x1 ), f (xn , zn |xn−1 , zn−1 ) = f1 (xn |xn−1 )f2 (zn |xn , zn−1 ).

Also, conditioned on (xn , zn ) the distribution of observation Yn admit a density g(yn |xn , zn )
with respect to ν. We are interested in the case where the posterior distribution

πn (x1:n , z1:n ) = p(x1:n , z1:n |y1:n )


CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 107

is analytically intractable and we are interested in approximating the expectations

πn (ϕn ) = E [ϕn (X1:n , Z1:n )|Y1:n = y1:n ]

for functions ϕn : X n × Z n → R. Obviously, one way to do this is to run an SMC filter for
{πn }n≥1 which obtains the approximation πnN at time n as
N
X N
X
πnN (x1:n , z1:n ) = Wn(i) δ(X (i) ,Z (i) ) (x1:n , z1:n ), Wn(i) = 1.
1:n 1:n
i=1 i=1

However, if the conditional posterior probability distribution

π2,n (z1:n |x1:n ) = p(z1:n |x1:n , y1:n )

is analytically tractable, there is a better SMC scheme for approximating πn and estimating
πn (ϕn ). This SMC scheme is called the Rao Blackwellised particle filter (RBPF) (Doucet
et al., 2000a). Consider the following decomposition which follows from the chain rule

p(x1:n , z1:n |y1:n ) = p(x1:n |y1:n )p(z1:n |x1:n , y1:n )

and define the marginal posterior distribution of X1:n conditioned on y1:n as

π1,n (x1:n ) = p1 (x1:n |y1:n ).

The RBPF is a particle filter for the sequence of marginal distributions {π1,n }n≥1 which
produces at time n the approximation
N
X N
X
N (i) (i)
π1,n (x1:n ) = W1,n δX (i) (x1:n ), W1,n = 1.
1:n
i=1 i=1

and the Rao-Blackwellised approximation the full posterior distribution involves the par-
N
ticle filter estimate π1,n and the exact distribution π2,n

πnRB,N (x1:n , z1:n ) = π1,n


N
(x1:n )π2,n (z1:n |x1:n ).

Then, the estimator of the the RBPF for πn (ϕn ) becomes


h h ii
RB,N (i)
πn (ϕn ) = Eπ1,n
N Eπ2,n (·|X (i) ) ϕn (X1:n , Z1:n )
1:n
N
X (i)  Z 
(i) (i)
= W1,n π2,n (z1:n |X1:n )ϕn (X1:n , z1:n )dz1:n . (7.30)
i=1

Assuming q(x1:n |y1:n ) = q(x1:n−1 |y1:n−1 )q(xn |x1:n−1 , y1:n ) is used as the proposal distribu-
tion, the incremental importance weight for the RBPF is given by
f1 (xn |xn−1 )p(yn |x1:n , y1:n−1 )
w1,n|n−1 (x1:n ) =
q(xn |x1:n−1 , y1:n )
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 108

where the density p(yn |x1:n , y1:n−1 ) is given by


Z
pn (yn |x1:n , y1:n−1 ) = p(z1:n−1 |x1:n−1 , y1:n−1 )f2 (zn |xn , zn−1 )g(yn |xn , zn )dz1:n .

Also, the optimum importance density which reduces the variance of w1,n|n−1 is when the
incremental importance density q(xn |x1:n−1 , y1:n ) is taken to be p(xn |x1:n−1 , y1:n ) which
results in w1,n|n−1 (x1:n ) being equal to p(yn |x1:n−1 , y1:n−1 ).
The use of the RBPF whenever it is possible is intuitively justified by the fact that we
substitute particle approximation of some expectations with their exact values. Indeed,
the theoretical analysis in Doucet et al. (2000a) and Chopin (2004, Proposition 3) revealed
that the RBPF has better precision than the regular particle filter: the estimates of the
RBPF never have larger variances. The favouring results for the RBPF are basically due to
the Rao-Blackwell theorem (see e.g. Blackwell (1947)), after which the proposed particle
filter gets its name.
The RBPF was formulated by Doucet et al. (2000a) and have been implemented in
various settings by Chen and Liu (2000); Andrieu and Doucet (2002); Särkkä et al. (2004)
among many.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 109

Exercises
1. Write your own log sum exp function that takes an array of numbers a = [a1 , . . . , am ]
and returns
log (ea1 + . . . + eam ) .
in a numerically safe way. Try your code with

a = [100, 500, 1000], a = [−1200, −1500, −1200], a = [−1000, 1000]

Compare each answer with the naive solution which we would have if we typed
directly log(sum(exp(a))).
[Hint: log(ea − eb ) = log(ea−max{a,b} − eb−max{a,b} ) + max{a, b}.]

2. Consider the linear Gaussian HMM in Example 7.4.

• Generate hidden states x1:n and observations y1:n for n = 2000, a = 0.99, b = 1,
σ02 = 5, σx2 = 1, σy2 = 4.
• You already have the code for this model that performs the particle filter i.e.
the SISR algorithm with the importance (proposal) density being equal to the
transition density, more explicitly

q1 (x1 ) = η(x1 ), qn (x1:n ) = qn−1 (x1:n−1 )f (xn |xn−1 )

Run the SISR algorithm for y1:n with N = 100 this choice of importance density,
and at each time step estimate the mean posterior value E[Xt |Y1:t = y1:t ] from
the particles.
XN
(i) (i)
X̂tN = X t Wt .
i=1
1
Pn
Calculate the mean squared error (MSE) for X̂t , that is n t=1 (X̂t − xt ) 2 .
• Now, set N = 1000 and calculate the MSE again. Compare it with the previous
one.
• This time take N = 100 and run the SISR algorithm with the optimum choice
for the importance density which is obtained by

q(x1 ) = p(x1 |y1 ), q(xn |xn−1 , yn ) = p(xn |xn−1 , yn ),

Show that this leads to the incremental weight wn|n−1 (xn−1 , xn ) = p(yn |xn−1 ).
(Since this is a linear Gaussian model, both p(xn |xn−1 , yn ) and p(yn |xn−1 ) are
available to sample from and calculate, respectively.) Calculate the MSE for X̂t
and compare it with the previous ones that you found.
CHAPTER 7. BAYESIAN INFERENCE IN HIDDEN MARKOV MODELS 110

3. Consider the target tracking problem in Example 7.5. It may not be a good modelling
practice to take the noise in the measurements additive Gaussian for two reasons: (i)
One may expect to have a bigger error when a longer distance is measured. (ii) When
the noise is Gaussian, the noisy measurement is allowed to be negative, which does
not make sense. That is why, instead of the existing observation model, consider the
following one with multiplicative nonnegative noise:
i.i.d.
Yt,i = Rt,i Et,i , Et,i ∼ ln N (0, σy2 ), i = 1, 2, 3. (7.31)

where ln N (µ, σ 2 ) denotes the lognormal distribution with location and scale param-
eters µ and σ 2 . It can be shown that if Et,i ∼ ln N (µ, σ 2 ), log Et,i ∼ N (µ, σ 2 ), so we
effectively have
log Yt,i ∼ N (log Rt,i , σy2 ).

• Generate data according to the new model for n = 500 using σy2 = 0.1, a = 0.99,
2
σp2 = 0.001, σv2 = 0.01, σbv 2
= 0.01, σbp = 4, and the sensor locations being the
same as in the previous examples.
• Write down the new observation density g(yt |xt ) according to (7.31).
• Run a particle filter for the data you have generated with N = 1000 particles,
using q(x1 |y1 ) = η(x1 ) and q(xt |xt−1 , yt ) = f (xt |xt−1 ). Calculate the posterior
mean estimates for the position versus t, E[Pt (i)|Y1:t = y1:t ], i = 1, 2. Generate
results similar to the ones in Example 7.5.
• Remove one of the sensors and repeat your experiments. Comment on the
results.
Appendix A

Some Basics of Probability


Summary: This chapter provides some basics of probability which is related to the content
of this course. The covered concepts are probability, random variables, cumulative distribu-
tion function, discrete and continuous distributions, probability mass function, probability
density function, expectation, independence, correlation and covariance, Bayes’ Theorem,
and posterior distribution

A.1 Axioms and properties of probability


Let Ω be the sample space and F be the event space. (In a non-rigorous way, you can
think of F as the set of all subsets of Ω as an example.) A probability measure on (Ω, F)
is a function P : F → R that satisfies the following axioms of probability.
(A1) The probability of an event is a non-negative and real number:

P(E) ∈ R, P(E) ≥ 0, ∀E ∈ F.

(A2) Unitarity: The probability that at least one of the elementary events in the entire
sample space will occur is 1
P(Ω) = 1.

(A3) σ-additivity: A countable sequence of disjoint sets (or mutually exclusive sets) E1 , E2 , . . .
(Ei ∩ Ej = ∅ for all i 6= j) satisfies

! ∞
[ X
P Ei = P(Ei )
i=1 i=1

Any function that satisfies those three axioms can be a probability measure. These axioms
lead to some useful properties of probability that we are familiar with.
(P1) The probability of the empty set:

P(∅) = 0.

(P2) Monotonicity:
P(A) ≤ P(B), ∀A, B ∈ F : A ⊆ B.

111
APPENDIX A. SOME BASICS OF PROBABILITY 112

(P3) The numeric bound:


0 ≤ P(E) ≤ 1, ∀E ∈ F.

(P4) Union of two sets:

P(A ∪ B) = P(A) + P(B) − P(A ∩ B), ∀A, B ∈ F.

(P5) Completion of a set:


P(Ac ) = 1 − P(A), ∀A ∈ F.

A.2 Random variables


Suppose we are given the triple (Ω, F, P). A real-valued random variable is a function

X:Ω→R

such that {ω ∈ Ω : X(w) ≤ x} ∈ F for all x ∈ R. We need this condition since we need
the probability of this set in order to construct our cumulative distribution function.

Cumulative distribution function The probability distribution of X is mainly char-


acterised by its cumulative distribution function (cdf) denoted as F , which is defined as

F (x) := P(X ≤ x) = P({ω ∈ Ω : X(ω) ≤ x}), x ∈ R.

There are three points to note here:

• The probability distribution of X is induced by P: There is always an implicit ref-


erence to (Ω, F, P) when one calculates P (X ≤ x), but we tend to forget it once we
have out cumulative distribution function F for X. This is because once we know F ,
we know everything about the probability distribution of X and usually we do not
need to go back to the lower level and work with (Ω, F, P) in practice. However, it
may be useful to know what a random variable is in general.

• The use of ≤ (and not <) is important. Especially for discrete random variables,
this matters a lot.

• Note that X, written in capital letter, represents the randomness in the probability
statement while x is a given certain value in R.

By definition, F has the following properties:

(P1) F is a non-decreasing function: For any a, b ∈ R, if a < b, then F (a) ≤ F (b).

(P2) F is right continuous (no jumps occur when the limit point is approached from the
right).
APPENDIX A. SOME BASICS OF PROBABILITY 113

(P3) limx→−∞ F (x) = 0.


(P4) limx→∞ F (x) = 1.
Any function that satisfies those four properties can be a cdf. Therefore, the definition
and the properties have an if and only if relation.
All the probability questions about X can be answered in terms of F . Examples:
• P(X ∈ (a, b]) = P(X ≤ b) − P(X ≤ a) = F (b) − F (a)
• P(X = a) = F (a) − limx→a− F (x). (the second term is a limit from the left)
• P(X ∈ [a, b]) = P(X ∈ (a, b]) + P(X = a) = F (b) − F (a) + [F (a) − limx→a− F (x)]
• P(X ∈ (a, b)) = P(X ∈ (a, b]) − P(X = b) = F (b) − F (a) − [F (b) − limx→b− F (x)]
Depending on the nature of set of values X takes, it can be called a discrete or a
continuous random variable (sometimes neither of them!).

A.2.1 Discrete random variables


If X takes finite or countably infinite number of possible values in R, then X is called a
discrete random variable. The possible values of X may be listed as x1 , x2 , . . ., where the
sequence terminates in the finite case but continues indefinitely in the countably infinite
case.
Let p(xi ) := P(X = xi ), i = 1, 2, . . . The function p(·) is called the probability mass
function
P (pmf ) of X and has the following properties: p(xi ) ≥ 0, i = 1, 2, . . . and
i p(xi ) = 1.
It can be shown that, for any x ∈ R,
X
F (x) = p(xi ).
i:xi ≤x

Hence, the cdf F of X is a step function where jumps occur at points xi with jump height
being p(xi ) = P(X = xi ) = F (xi ) − F (xi−1 ).

Some discrete distributions: Some well known distributions with a pmf (hence the cdf
is a step function): Bernoulli B(ρ), Geometric distribution Geo(ρ), Binomial distribution
Binom(n, ρ) Negative binomial NB(r, ρ), Poisson distribution PO(λ).

A.2.2 Continuous random variables


If X takes values on a continuous subset RX of R (such as R itself, an interval [a, b] or
union of such intervals), then X is said to be a continuous random variable. Furthermore,
if F for X is continuous (i.e. no jumps), we have
P(X ∈ (a, b)) = P(X ∈ (a, b]) = P(X ∈ [a, b)) = P(X ∈ [a, b]) = F (b) − F (a).
APPENDIX A. SOME BASICS OF PROBABILITY 114

Also, if F is right differentiable, we can define the probability density function (pdf ) for X

F (x + h) − F (x) ∂+ F (x)
p(x) := lim = , x ∈ R.
h→0 h ∂x

RSince F is monotonic, we have p(x) ≥ 0 for all x ∈ R. Also, p integrates to 1 i.e.


∞ R
−∞
p(x)dx = RX p(x)dx = 1. All probability statements for X can be calculated using
f , such as
Z b
P(X ∈ [a, b]) = F (b) − F (a) = p(x)dx,
a
Z a
P(X ≤ a) = F (a) = p(x)dx.
−∞

From the above equation, we can conclude that P(X = x) = 0 for any x ∈ R, because
Z x
p(x)dx = F (x) − F (x) = 0.
x

A.2.2.1 Some continuous distributions


The following are some well known distributions with a continuous cdf (hence admitting a
pdf): Uniform distribution Unif(a, b), exponential distribution Exp(µ), gamma distribution
Γ(α, β), inverse gamma distribution IG(α, k), normal (Gaussian) distribution N (µ, σ 2 ),
Beta distribution Beta(α, β).

A.2.3 Moments, expectation and variance


If X is a random variable, the n’th moment of X, n ≥ 1, denoted by E(X n ), is defined for
discrete and continuous random variables as follows:
(P
xn p(xi ), if X is discrete,
E(X n ) := R ∞i i n (A.1)
−∞
x p(x)dx, if X is continuous.

The first moment (n = 1) is called the expectation of X, also sometimes referred to as the
mean of X.
If |E(X)| < ∞, the n’th central moments of X, n ≥ 1, is defined for discrete and
continuous random variables as follows:
(P
[xi − E(X)]n p(xi ), if X is discrete,
E([X − E(X)]n ) := R ∞i n
(A.2)
−∞
[x − E(X)] p(x)dx, if X is continuous.

The second central moment is the most notable of them and is called the variance of X
and denoted by V (X):
V(X) := E([X − E(X)]2 ).
APPENDIX A. SOME BASICS OF PROBABILITY 115

A useful identity relating V(X) to the expectation and the second moment of X is

V(X) = E(X 2 ) − E(X)2 .

Finally, the standard deviation of X is


p
σX := V(X).

A.2.4 More than one random variables


Suppose we have two real valued random variables, X : Ω → R and Y : Ω → R, both
defined on the same proabability space (Ω, F, P).1 The joint distribution of X and Y is
characterised by the joint cdf FX,Y which is defined as

FX,Y (x, y) := P(X ≤ x, Y ≤ y) = P({ω ∈ Ω : X(ω) ≤ x, Y (ω) ≤ y}).

The marginal cdf’s for X and Y can be deduced from FX,Y (x, y):

FX (x) = lim FX,Y (x, y), FY (y) = lim FX,Y (x, y).
y→∞ x→∞

Discrete variables: For discrete X and Y taking values xi , i = 1, 2, . . . and yj , j =


1, 2, . . ., we can define a joint pmf pX,Y for X and Y such that

pX,Y (xi , yj ) := P(X = xi , Y = yj )

so that for any x, y ∈ R, we have


X
FX,Y (x, y) = pX,Y (xi , yj ).
i,j:xi ≤x,yi ≤y

Expectation of any function g of X, Y can be evaluated using the joint pmf, for example
X
E(g(X, Y )) = pX,Y (xi , yj )g(xi , yj ).
i,j

The marginal pmf ’s for X and Y are given as follows:


X X
pX (xi ) = pX,Y (xi , yj ), pY (yj ) = pX,Y (xi , yj ),
j i
1
(X, Y ) together can be called a bivariate random variable. A generalisation of this is a multivariate
random variable of dimension m, such as (X1 , X2 , . . . , Xm ).
APPENDIX A. SOME BASICS OF PROBABILITY 116

Continuous variables: Similar to the joint pmf defined for discrete X and Y , one can
define the joint pdf for continuous X and Y , assuming F is right-differentiable,
∂+2 F (x, y)
pX,Y (x, y) :=
∂x∂y
so that for any a, b, we have
Z b Z a
FX,Y (a, b) = pX,Y (x, y)dxdy
−∞ −∞
Expectation of any function g of X, Y can be evaluated using the joint pdf,
Z ∞Z ∞
E(g(X, Y )) = pX,Y (x, y)g(x, y)dxdy.
−∞ −∞
The marginal pdf ’s for X and Y can be obtained
Z ∞ Z ∞
pX (x) = pX,Y (x, y)dy, pY (y) = pX,Y (x, y)dx,
−∞ −∞

Independence: We say random variables X and Y are independent if for all pairs of
sets A ⊆ R, B ⊆ R we have
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B).
If X and Y are discrete variables taking xi , i = 1, 2, . . . and yj , j = 1, 2, . . ., then indepen-
dence between X and Y can be expressed as
pX,Y (xi , yj ) = P(X = xi , Y = yj ) = pX (xi )pY (yj ), ∀i, j
If X and Y are continuous variables, then independence between X and Y can be expressed
as
pX,Y (x, y) = pX (x)pY (y), ∀x, y ∈ R.

Covariance and Correlation: Covariance between two random variables X and Y ,


Cov(X, Y ) is given as
Cov(X, Y ) := E([X − E(X)][Y − E(Y )])
= E(XY ) − E(X)E(Y )
A normalised version of covariance is correlation ρ(X, Y ). Provided that V(X) ≥ 0 and
V(Y ) ≥ 0,
Cov(X, Y )
ρ(X, Y ) := ;
σX σY
When one of V(X) and V(Y ) is 0, we set ρ(X, Y ) = 1 if X = Y and ρ(X, Y ) = 0 if X 6= Y .
One can show that
−1 ≤ ρ(X, Y ) ≤ 1.
Absolute value of ρ(X, Y ) indicates the level of correlation. We say two random variables
X, Y are uncorrelated if Cov(X, Y ) = 0 (hence ρ(X, Y ) = 0).
Note: Independence implies uncorrelatedness, but the reverse is not always true.
APPENDIX A. SOME BASICS OF PROBABILITY 117

A.3 Conditional probability and Bayes’ rule


Consider the probability space (Ω, F, P) again. Given two sets A, B ∈ F, the conditional
distribution of A given B is denoted by P(A|B) and is defined as

P(A ∩ B)
P(A|B) =
P(B)

The Bayes’ rule is derived from this definition and it relates the two conditional probabil-
ities P(A|B) and P(B|A):
P(A)P(B|A)
P(A|B) = (A.3)
P(B)
This relation can be written in terms of two random variables. Suppose X, Y are discrete
random variables with joint pmf pX,Y (xi , yj ), where x ∈ X = {x1 , x2 , . . .} and y ∈ Y =
{y1 , y2 , . . .} so that the marginal pmf’s are
X X
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y), x ∈ X , y ∈ Y.
y∈Y x

Then the conditional pmf’s pX|Y (x|y) and pY |X (y|x) are defined as

pX,Y (x, y) pX,Y (x, y)


pX|Y (x|y) = , pY |X (y|x) = (A.4)
pY (y) pX (x)

and Bayes’ rule relating them together is

pX (x)pY |X (y|x)
pX|Y (x|y) = (A.5)
pY (y)

When X, Y are continuous random variables taking values from X and Y, respectively,
with a joint pdf pX,Y (x, y), similar definitions follow: The marginal pdf’s are
Z ∞ Z ∞
pX (x) = pX,Y (x, y)dy, pY (y) = pX,Y (x, y)dx.
−∞ −∞

The conditional pdf’s are defined exactly the same way as in (A.4) and (A.5).
Appendix B

Solutions

B.1 Exercises in Chapter 2


1. We have g : (0, 1) → (a, b) with x = g(u) := (b − a)u + a, hence g −1 : (a, b) → (0, 1)
with
u = g −1 (x) = (x − a)/(b − a), x ∈ (a, b)
−1
The Jacobian is J(x) = ∂g ∂x(x) = b−a
1
for all x. We apply the formula in (2.5) for
transformation from U to X to have

pX (x) = pU (g −1 (x)) |J(x)|


 
x−a 1
= pU
b−a b−a
1
=
b−a
for x ∈ (a, b). So, we conclude that X ∼ Unif(a, b).
2. Probably the best method for generating from PO(λ) is by the method of inversion.
i
The cdf at integer values is F (k) = e−λ ki=0 λi! , so we can generate U ∼ Unif(0, 1)
P
and find the smallest k such that F (k) > U is distributed from PO(λ).
One alternative to it is based on the Poisson process: When the interarrival times
of a process are i.i.d. and distributed from Exp(1), the number of arrivals in an
i.i.d.
interval
Pn of λ is PO(λ). From this, we can produce Ei ∼ Exp(1) and N := max{n :
i=1 Ei ≤ λ} Q∼ PO(λ) (keep generating Ei ’s until the sum exceeds λ). Equivalently,
N = max{n : ni=1 Ui ≥ e−λ } with Ui ∼ Unif(0, 1) (why?). If λ is large this requires
about λ uniform variables to generate one point.
 
3. The pdf of the Laplace distribution Laplace(a, b) is pX (x) = 2b1
exp − |x−a|
b
. Notice
that Y = X − a is centred with
 
1 |y|
pY (y) = exp − .
2b b
This pdf is a two sided version of the pdf of the exponential distribution, where
each side is multiplied by a half so that the integral is 1. Let Z ∼ Exp(b) and

118
APPENDIX B. SOLUTIONS 119

Y = Z with probability 1/2 and Y = −Z with probability 1/2. One can find out
using composition that Y has pdf pY (y) given above. Therefore, we can generate
X ∼ Laplace(a, b) as follows

• Generate Z ∼ Exp(b),
• Set Y = Z or Y = −Z with probability 1/2.
• Set X = Y + a.

4. We have X 0 is pX 0 (x) = q(x), pb(x) = p(x)Zp and qb(x) = q(x)Zq , and P(Accept|X 0 =
x) = Mpb(x)
qb(x)
= M1 ZZpq q(x)
p(x)
. So, the acceptance probability can be derived as
Z
P(Accept) = P(Accept|X 0 = x)pX 0 (x)dx
Z
1 Zp p(x)
= q(x)dx
M Zq q(x)
Z
1 Zp
= p(x)dx
M Zq
1 Zp
=
M Zq
The validity can be verified by considering the distribution of the accepted samples.
Using Bayes’ theorem,

pX 0 (x)P(Accept|X 0 = x) q(x) M1 ZZpq q(x)


p(x)

pX (x) = pX 0 (x|Accept) = = Zp = p(x).


P(Accept) Zq
(1/M )

8. The pdf of Beta(a, b) is

xa−1 (1 − x)b−1
p(x) = ∝ xa−1 (1 − x)b−1 =: pb(x), x ∈ (0, 1).
B(a, b)
We have Q = Unif(0, 1) so q(x) = 1 and the ratio pb(x)/q(x) = pb(x).

• First, it can be seen that the ratio is unbounded for a < 1 or b < 1, so Q =
Unif(0, 1) cannot be used.
• When a = b = 1, we have the uniform distribution for X so it is trivial.
• For a ≥ 1 and b ≥ 1 and at least one of them is strictly greater than 1, the first
a−1
derivative of pb(x) is equal to 0 at x = a+b−2 , and the second derivative at that
value of x is −(a + b − 2) a−1 + b−1 , which is negative, so x∗ = a+b−2
1 1 a−1
2

is a
maximum point, yielding
 a−1  b−1
∗ a−1 b−1
pb(x)/q(x) ≤ pb(x ) =
a+b−2 a+b−2
APPENDIX B. SOLUTIONS 120

a−1 a−1
b−1
so the smallest (hence the best) M we can choose is M ∗ = a+b−2 b−1

a+b−2
.
Hence the rejection sampling algorithm can be applied as follows:
(a) Sample X 0 ∼ Unif(0, 1) and U ∼ Unif(0, 1)
(X 0 )a−1 (1−X 0 )b−1
(b) If U ≤ a−1 a−1 b−1 b−1 , accept X = X 0 , else restart.
( a+b−2 ) ( a+b−2 )

B.2 Exercises in Chapter 3


1. For µ = 0 and ϕ(x) = x2 , . Find the optimum k for this φ and calculate the gain
N
due to variance reduction compared to the plug-in estimator PMC (ϕ).
3σ 4
When ϕ(x) = x2 and µ = 0, Q2−k (ϕ2 ) = EQ2−k (X 4 ) = (2−k)2
. Therefore, we need to
minimise
1 3σ 4
N
VQk (PIS (ϕ)) = p 2
= 3σ 4 (2 − k)−5/2 k −1/2 .
k(2 − k) (2 − k)

The minimum is attained at k = 1/3 and is


 
1   σ4
N
VQ1/3 (PIS (ϕ)) = 3σ 4 (2 − k)−5/2 k −1/2 − E(X 2 )2 = 0.4490
N | {z } N
σ4 k=1/3

N 1
PN
The variance of the plug-in estimator PMC (ϕ) = N i=1 Xi2 , Xi ∼ P , is

N E(X 4 ) − E(X 2 )2 3σ 4 − σ 4 2σ 4
V(PMC (ϕ)) = = = .
N N N
Therefore the IS estimator provides a variance-reduction factor of ≈ 0.22.

2. • For an acyclic directed graph, see for example


[Link]
• Importance sampling part: The probability P(E10 > 70) = E(ϕ(X)) where
X = (T1 , . . . , T10 ) and ϕ(X) = I(E10 > 70) can be estimated via importance
sampling by sampling independent X (i) ’s where
(i) (i) (i)
X (i) = (T1 , . . . , T10 ), Tj ∼ Exp(1/λj )
Q10 1 −tj /λj
The proposal density for this choice at x = (t1 , . . . , t10 ) is q(x) = j=1 λj e ,
so the weights will be.
10
 
p(X (i) ) Y λj 1
− θ1
(i)
Tj
w(X (i) ) = = e λj j

q(X (i) ) j=1 θj


APPENDIX B. SOLUTIONS 121

Therefore, the overall importance sampling estimate is


N 10
 
(i)
N 1 X (i)
Y λj 1
λj
− θ1 Tj
PIS (ϕ) = I(E10 > 70) e j

N i=1 θ
j=1 j

(i)
where E10 is the computed completion time for the i’th sample X (i) .

B.3 Exercises of Chapter 4


1. The full tables are given below.

pX,Y (x, y) y=1 y=2 y = 3 y = 4 pX (x)


x=1 1/40 3/40 4/40 2/40 10/40
x=2 5/40 7/40 6/40 5/40 23/40
x=3 1/40 2/40 2/40 2/40 7/40
pY (y) 7/40 12/40 12/40 9/40

pX|Y (x|y) y=1 y=2 y=3 y=4


x=1 1/7 3/12 4/12 2/9
x=2 5/7 7/12 6/12 5/9
x=3 1/7 2/12 2/12 2/9
pY |X (y|x) y=1 y=2 y=3 y=4
x=1 1/10 3/10 4/10 2/10
x=2 5/23 7/23 6/23 5/23
x=3 1/7 2/7 2/7 2/7
β α α−1 −βx
2. We have pX (x) = Γ(α)
x e , and pY |X (y|x) = xe−xy , so

pX|Y (x|y) ∝ pX (x)pY |X (y|x)


β α α−1 −βx −xy
= x e xe
Γ(α)
∝ xα e−(β+y)x ,

which is in the same form of the pdf of Γ(α + 1, β + y). Therefore, αx|y = α + 1 and
βx|y = β + y.

3. Let us define Z
µ(y) = E(X|Y = y) = xp(x|y)dx
APPENDIX B. SOLUTIONS 122

for brevity (we can define such µ since E(X|Y = y) is a function of y only). We can
write any estimator as X̂(Y ) = µ(Y ) + X̂(Y ) − µ(Y ). Now, the expected MSE is
Z
E((X̂(Y ) − X) ) = (X̂(y) − x)2 p(x, y)dxdy
2

Z Z 
2
= (X̂(y) − x) p(x|y)dx p(y)dy
Z Z 
2
= [µ(y) − x + X̂(y) − µ(y)] p(x|y)dx p(y)dy

Let us focus on the inner integral first.


Z
[µ(y) − x + X̂(y) − µ(y)]2 p(x|y)dx
Z h i
2 2
= (µ(y) − x) + (X̂(y) − µ(y)) + 2(µ(y) − x)(X̂(y) − µ(y)) p(x|y)dx
Z Z
2 2
= (µ(y) − x) p(x|y)dx + (X̂(y) − µ(y)) + 2(X̂(y) − µ(y)) (µ(y) − x)p(x|y)dx
Z
= (µ(y) − x)2 p(x|y)dx + (X̂(y) − µ(y))2 (B.1)

where the last equation follows since the last term is zero:
Z Z
(µ(y) − x)p(x|y)dx = µ(y) − xp(x|y)dx = 0.

The first term in (B.1) does not depend on the estimator X̂(y) so we have control
only on the second term (X̂(y) − µ(y))2 which is always nonnegative and therefore
minimum when X̂(y) − µ(y) = 0, i.e. X̂(y) = µ(y) = E(X|Y = y). Since this is true
for all y, we conclude that the estimator X̂(Y ) = E(X|Y ), as a random variable of
Y , minimises the expected MSE.

4. • We can use the result in Example 4.7 which states

Y |X = x ∼ N (Ax, R), X ∼ N (m, S) ⇒ X|Y = y ∼ N (mx|y , Sx|y )

where Sx|y = (S −1 + AT R−1 A)−1 and mx|y = Sx|y (S −1 m + AT R−1 y). By obser-
vation, we can see that X is a univariate number with m = 0 and S = σx2 , A is
an n × 1 vector with A(t) = sin(2πt/T ), t = 1, . . . , n, and R = σy2 In . Therefore
p(x|y1:n ) = φ(x; mx|y , Sx|y ) with

n
!−1
1 1 X 2
Sx|y = + sin (2πt/T ) , (B.2)
σx2 σy2 t=1
APPENDIX B. SOLUTIONS 123

and n
1 X
mx|y = Sx|y 2 sin(2πt/T )yt .
σy t=1
It is possible to derive p(y1:n ) from p(y1:n ) = p(x, y1:n )/p(x|y1:n ), but there is
an easier way when the prior and the likelihood is gaussian and the relation
between X and Y is linear. One can view

Y = AX + V

where X and A are as before and V ∼ N (0n , σy2 In ). As covered already in


Section 2.2.2, since X and V are gaussian, Y1:n ∼ N (mn , Σn ) with

mn = E(AX + V ) = A0n + 0n = 0n , Σn = Cov(AX + V ) = σx2 AAT + σy2 In .

• f (n + 1, X) = sin(2π(n + 1)/T )X and we can view sin(2π(n + 1)/T ) as a scalar


constant. Since X|Y1:n = y1:n ∼ N (mx|y , Sx|y ), we have

f (n + 1, X)|Y1:n = y1:n ∼ N (sin(2π(n + 1)/T )mx|y , sin2 (2π(n + 1)/T )Sx|y )

• First, let us write

Yn+1 |X ∼ N (f (n + 1, X), σy2 ) = N (sin(2π(n + 1)/T )X, σy2 ).

The unconditional (marginal) distribution of Yn+1 can be calculated in a similar


way as done for Y1:n . Since the unconditional distribution of X is X ∼ N (0, σx2 ),
we have
Yn+1 ∼ N (0, sin2 (2π(n + 1)/T )σx2 + σy2 ). (B.3)
Conditional on Y1:n = y1:n , we have

X|Y1:n = y1:n ∼ N (mx|y , Sx|y )

and
Yn+1 |X = x, Y1:n = y1:n ∼ N (x sin(2π(n + 1)/T ), σy2 )
as before, since Yn+1 is conditionally independent from Y1:n given X. We can
use the same mechanism as before and derive

Yn+1 |Y1:n = y1:n ∼ N (sin(2π(n + 1)/T )mx|y , sin2 (2π(n + 1)/T )Sx|y + σy2 ). (B.4)

When compare the variances (B.3) and (B.4), we see that the second one is
smaller since Sx|y < σx2 , see (B.2). The decrease in the variance, hence uncer-
tainty, is due to information that comes from Y1:n = y1:n .
APPENDIX B. SOLUTIONS 124

B.4 Exercises of Chapter 7


2. For t > 1, the optimum proposal is

f (xt |xt−1 )g(yt |xt )


q(xt |xt−1 , yt ) = p(xt |xt−1 , yt−1 ) =
p(yt |xt−1 )

Conditional on xt−1 , Xt has prior N (axt−1 , σx2 ) and Y |Xt = xt ∼ N (bxt , σy2 ). There-
fore, using conjugacy, we have
−1
p(xt |xt−1 , yt−1 ) = φ(xt ; µq , σq2 ), σq2 = 1/σx2 + b2 /σy2 , µq = σq2 (axt−1 /σx2 +byt /σy2 ).

The resulting incremental weight is

f (xt |xt−1 )g(yt |xt )


wt|t−1 (xt−1 , xt ) = = p(yt |xt−1 ),
p(xt |xt−1 , yt )

which only depends on xn−1 . It can be checked that Yt = bXt + Vt , Vt ∼ N (0, σy2 )
given Xt−1 = xt−1 is Gaussian with mean baxt−1 and variance b2 σx2 + σy2 , i.e.

wt|t−1 (xt−1 , xt ) = p(yt |xt−1 ) = φ(yt ; baxt−1 , b2 σx2 + σy2 ).

For t = 1, we get similar results by replacing f (xt |xt−1 ) with η(x1 ) and considering
Y1 = bX1 + V1 with V1 ∼ N (0, σy2 ). This results in
−1
q(x1 |y1 ) = p(x1 |y1 ) = φ(x1 ; µq , σq2 ), σq2 = 1/σ02 + b2 /σy2 , µq = σq2 byt /σy2 .

and
w1 (x1 ) = p(y1 ) = φ(y1 ; 0, b2 σ02 + σy2 ).

3. Since each Yt,i is lognormal distributed with parameters log Ri and σy2 , we have
3
Y 1 − 12 (log yt,i −log Rt,i )2
g(yt |xt ) = p e 2σy

i=1
2πσy2 yt,i

This result can be reached by transformation of random variables, considering that


the log Yt,i is Gaussian with mean log Rt,i and variance σy2 .
References
Anderson, B. and Moore, J. (1979). Optimal Filtering. Prentice-Hall, New York. 7.2

Andrieu, C., Davy, M., and Doucet, A. (2001). Improved auxiliary particle filtering:
applications to time-varying spectral analysis. In Statistical Signal Processing, 2001.
Proceedings of the 11th IEEE Signal Processing Workshop on, pages 309–312. 7.1

Andrieu, C. and Doucet, A. (2002). Particle filtering for partially observed Gaussian state
space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
64:827–836. 7.3.6

Andrieu, C., Doucet, A., and Tadić, V. B. (2005). On-line parameter estimation in general
state-space models. In Proceedings of the 44th IEEE Conference on Decision and Control,
pages 332–337. 12

Arulampalam, M., Maskell, S., Gordon, N., and Clapp, T. (2002). A tutorial on particle
filters for online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE
Transactions on, 50(2):174–188. [Link]

Blackwell, D. (1947). Conditional expectation and unbiased sequential estimation. The


Annals of Mathematical Statistics, 18(1):105–110. 7.3.6

Cappé, O., Godsill, S., and Moulines, E. (2007). An overview of existing methods and
recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5):899–924.
[Link]

Cappé, O., Moulines, E., and Rydén, T. (2005). Inference in Hidden Markov Models.
Springer. 5.2, 6.1

Carpenter, J., Clifford, P., and Fearnhead, P. (1999). An improved particle filter for non-
linear problems. Radar Sonar & Navigation, IEE Proceedings, 146:2–7. 12

Chen, R. and Liu, J. (1996). Predictive updating methods with application to Bayesian
classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
58:397–415. 8

Chen, R. and Liu, J. S. (2000). Mixture kalman filters. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 62(3):493–508. 7.3.5, 7.3.6

Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and its
application to Bayesian inference. The Annals of Statistics, 32(6):2385–2411. 7.3.6

125
REFERENCES 126

Crisan, D., Moral, P. D., and Lyons, T. (1999). Discrete filtering using branching and
interacting particle systems. Markov Processes and Related Fields, 5(3):293–318. 12

Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle


Systems with Applications. Springer-Verlag, New York. 6.1, [Link], 7.3.5

Del Moral, P. and Doucet, A. (2003). On a class of genealogical and interacting metropolis
models. In Azéma, J., Émery, M., Ledoux, M., and Yor, M., editors, Séminaire de
Probabilités XXXVII, volume 1832 of Lecture Notes in Mathematics, pages 415–446.
Springer Berlin Heidelberg. 12

Doucet, A. (1997). Monte Carlo methods for Bayesian estimation of hidden Markov models.
Application to radiation signals (in French). PhD thesis, University Paris-Sud Orsay,
France. 8, 8

Doucet, A., Briers, M., and Sénécal, S. (2006). Efficient block sampling strategies for
sequential Monte Carlo methods. Journal of Computational and Graphical Statistics,
15(3):693–711. 12

Doucet, A., De Freitas, J., and Gordon, N. (2001). Sequential Monte Carlo Methods in
Practice. Springer-Verlag, New York. 6.1, 7.1, [Link]

Doucet, A., de Freitas, N., Murphy, K., and Russell, S. (2000a). Rao-Blackwellised particle
filtering for dynamic Bayesian networks. In Proceedings of the Sixteenth Conference
Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 176–183,
San Francisco, CA. Morgan Kaufmann. 7.3.5, 7.3.6, 7.3.6

Doucet, A., Godsill, S., and Andrieu, C. (2000b). On sequential Monte Carlo sampling
methods for Bayesian filtering. Statistics and Computing, 10:197–208. 6.1, 6.2, 6.3,
[Link]

Doucet, A. and Johansen, A. M. (2009). A Tutorial on Particle Filtering and Smoothing:


Fifteen Years Later. In Crisan, D. and Rozovsky, B., editors, The Oxford Handbook of
Nonlinear Filtering. Oxford University Press. [Link]

Durbin, J. and Koopman, S. J. (2000). Time series analysis of non-Gaussian observations


based on state space models from both classical and Bayesian perspectives (with dis-
cussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology),
62:3–56. [Link]

Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo method. Los
Alamos Science, Special Issue, pages 131–137. 1.2, 2.2.1, 2.2.4

Fearnhead, P. (2008). Computational methods for complex stochastic systems: a review


of some alternatives to MCMC. Statistics and Computing, 18(2):151–171. [Link]
REFERENCES 127

Fearnhead, P. and Clifford, P. (2003). On-line inference for hidden Markov models via par-
ticle filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
65:887–889. 12, 7.3.5

Fearnhead, P. and Liu, Z. (2007). On-line inference for multiple changepoint problems.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):589–
605. 12

Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating


marginal densities. Journal of the American Statistical Association, 85(410):398–409.
5.4

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, 6(6):721–741. 5.4

Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integra-
tion. Econometrica, 57(6):1317–1339. 5, 6.2

Gilks, W. R. (1992). Derivative-free adaptive rejection sampling for Gibbs sampling. In


Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors, Bayesian
Statistics 4, pages 641–649. Oxford University Press, Oxford, UK. [Link]

Gilks, W. R. and Berzuini, C. (2001). Following a moving target-Monte Carlo inference for
dynamic Bayesian models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 63(1):127–146. 12

Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995). Adaptive rejection Metropolis
sampling within Gibbs sampling. Journal of the Royal Statistical Society: Series C
(Applied Statistics), 44(4):455–472. [Link]

Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo
in Practice. Chapman & Hall/CRC. 5.1, 5.2

Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Journal
of the Royal Statistical Society: Series C (Applied Statistics), 41(2):337–348. [Link]

Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach to


nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F, 140(6):107–113.
12, 7.1, [Link]

Handschin, J. E. (1970). Monte Carlo techniques for prediction and filtering of non-linear
stochastic processes. Automatica, 6:555–563. 6.2

Handschin, J. E. and Mayne, D. (1969). Monte Carlo techniques to estimate the conditional
expectation in multi-stage non-linear filtering. International Journal of Control, 9:547–
559. 6.2
REFERENCES 128

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 52(1):97–109. 5.1, 5.3, 4

Hitchcock, D. B. (2003). A history of the Metropolis-Hastings algorithm. The American


Statistician, 57:254–257. 4

Julier, S. J. and Uhlmann, J. K. (1997). A new extension of the Kalman filter to nonlinear
systems. In Int. Symp. Aerospace/Defense Sensing, Simul. and Controls 3, pages 182–
193. [Link]

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans-
actions of the ASME; Series D: Journal of Basic Engineering, 82:35–45. [Link]

Kitagawa, G. (1996). Monte-Carlo filter and smoother for non-Gaussian nonlinear state
space models. Journal of Computational and Graphical Statistics, 1:1–25. 12, [Link]

Kong, A., Liu, J. S., and Wong, W. H. (1994). Sequential imputations and Bayesian missing
data problems. Journal of the American Statistical Association, 89(425):278–288. 3.1.1,
8, 6.3, [Link]

Lin, M. T., Zhang, J. L., Cheng, Q., and Chen, R. (2005). Independent particle filters.
Journal of the American Statistical Association, 100(472):1412–1421. [Link]

Liu, J. (1996). Metropolized independent sampling with comparisons to rejection sampling


and importance sampling. Statistics and Computing, 6(2):113–119. 3.1.1, [Link]

Liu, J. (2001). Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics.
Springer Verlag, New York, NY, USA. 12

Liu, J. and Chen, R. (1995). Blind deconvolution via sequential imputation. Journal of
the American Statistical Association, 90:567–576. 8, [Link]

Liu, J. and Chen, R. (1998). Sequential Monte-Carlo methods for dynamic systems. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 93:1032–1044. 12,
[Link]

Marsaglia, G. (1977). The squeeze method for generating gamma variates. Computers and
Mathematics with Applications, 3(4):321–325. [Link]

Mayne, D. (1966). A solution of the smoothing problem for linear dynamic systems.
Automatica, 4:73–92. 6.2

Mengersen, K. and Tweedie, R. L. (1996). Rates of convergence of the Hastings and


Metropolis algorithms. Annals of Statistics, 24:101–121. 4

Metropolis, N. (1987). The beginning of the Monte Carlo method. Los Alamos Science,
Special Issue, pages 125–130. 1.2
REFERENCES 129

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics,
21(6):1087–1092. 5.1, 5.3, 4

Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. Journal of the American
Statistical Association, 44(247):pp. 335–341. 1.2

Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability. Cambridge
University Press, New York, NY, USA, 2nd edition. 5.2

Newman, M. E. J. and Barkema, G. T. (1999). Monte Carlo Methods in Statistical Physics.


Oxford University Press, USA. 1.2.1

Olsson, J., Cappé, O., Douc, R., and Moulines, E. (2008). Sequential Monte Carlo smooth-
ing with application to parameter estimation in nonlinear state space models. Bernoulli,
14:155–179. 12

Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: Auxiliary particle filters.
Journal of the American Statistical Association, 94(446):590–599. 7.1

Press, W. H. (2007). Numerical Recipes : The Art of Scientific Computing. Cambridge


University Press, 3rd edition. 1.2.1

Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in


speech recognition. Proceedings of the IEEE, 77(2):257–286. 7.1, 7.2, [Link]

Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer,
2 edition. 2.2.1, 3.1, 3.1.1, 5.1, 5.2, 4, 6.1

Roberts, G. and Smith, A. (1994). Simple conditions for the convergence of the Gibbs sam-
pler and Metropolis-Hastings algorithms. Stochastic Processes and their Applications,
49(2):207–216. 4, 4

Roberts, G. and Tweedie, R. (1996). Geometric convergence and central limit theorems
for multidimensional Hastings and Metropolis algorithms. Biometrika, 83:95–110. 4, 4

Rubin, D. B. (1987). A noniterative sampling/importance resampling alternative to the


data augmentation algorithm for creating a few imputations when the fraction of missing
information is modest: the SIR algorithm (discussion of Tanner and Wong). Journal of
the American Statistical Association, 82:543–546. 12

Särkkä, S., Vehtari, A., and Lampinen, J. (2004). Rao-Blackwellized Monte Carlo data
association for multiple target tracking. In In Proceedings of the Seventh International
Conference on Information Fusion, pages 583–590. 7.3.6

Shiryaev, A. N. (1995). Probability. Springer-Verlag New York, Inc., Secaucus, NJ, USA,
2 edition. 1.2.1, 5.2
REFERENCES 130

Sorenson, H. W. (1985). Kalman Filtering: Theory and Application. IEEE Press, reprint
edition. [Link]

Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics,
22:1701–1762. 5.2, 4, 4, 5.4.1

von Neumann, J. (1951). Various techniques used in connection with random digits. Jour-
nal of Research of the National Bureau of Standards, 12:36–38. 2.2.4

Whitley, D. (1994). A genetic algorithm tutorial. Statistics and Computing, 4:65–85. 12

You might also like