0% found this document useful (0 votes)

52 views17 pages

Metropolis-Hastings Acceptance Probabilities

The document discusses the flexibility in choosing acceptance probabilities in Metropolis-Hastings algorithms when using auxiliary variables to generate proposals. It presents a general framework for the construction of acceptance probabilities in this setting and shows the similarities between many existing algorithms. The framework demonstrates there is flexibility in how to construct acceptance probabilities and alternative choices are possible. Some numerical experiments comparing different acceptance probabilities are also described.

Uploaded by

Daan2213

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views17 pages

Metropolis-Hastings Acceptance Probabilities

Uploaded by

Daan2213

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Scandinavian Journal of Statistics, Vol.

38: 342–358, 2011

doi: 10.1111/j.1467-9469.2010.00709.x
© 2010 Board of the Foundation of the Scandinavian Journal of Statistics. Published by Blackwell Publishing Ltd.

On the Flexibility of Metropolis–Hastings

Acceptance Probabilities in Auxiliary
Variable Proposal Generation
GEIR STORVIK
Department of Mathematics, University of Oslo and Centre for Statistics for
Innovation

ABSTRACT. Use of auxiliary variables for generating proposal variables within a Metropolis–
Hastings setting has been suggested in many different settings. This has in particular been of interest
for simulation from complex distributions such as multimodal distributions or in transdimensional
approaches. For many of these approaches, the acceptance probabilities that are used turn up some-
what magic and different proofs for their validity have been given in each case. In this article, we
will present a general framework for construction of acceptance probabilities in auxiliary variable
proposal generation. In addition to showing the similarities between many of the proposed algo-
rithms in the literature, the framework also demonstrates that there is a great ﬂexibility in how
to construct acceptance probabilities. With this ﬂexibility, alternative acceptance probabilities are
suggested. Some numerical experiments are also reported.

Key words: acceptance probabilities, auxiliary variables, Markov chain Monte Carlo,
Metropolis–Hastings algorithms

1. Introduction
Monte Carlo methods, and in particular Markov chain Monte Carlo (MCMC) algorithms,
are nowadays part of the standard set of techniques used by statisticians (Robert & Casella,
2004). Within statistics, the typical applications of Monte Carlo methods are evaluation of
likelihoods including random/missing effects or calculation of posterior distributions. In
general, Monte Carlo methods can be used to evaluate complex and high-dimensional integrals.
Monte Carlo methods involve sampling from some distribution, (y) say, which in many
cases can efﬁciently be done through ‘standard’ MCMC algorithms generating a Markov
chain with (y) as invariant distribution (Metropolis et al., 1953; Hastings, 1970; Gilks
et al., 1996; Robert & Casella, 2004). Most MCMC algorithms involve mixtures of different
Metropolis–Hastings (MH) steps (including Gibbs steps as particular cases). Given the current
state y, an MH step consists of generating a proposal y∗ ∼ q(y∗ | y) and switch to the proposal
with probability

(y∗ )q(y | y∗ )
(y; y∗ ) = min 1, .
(y)q(y∗ | y)

General Markov chain theory can be used to show convergence and ergodicity results (e.g.
Tierney, 1994). In practise, slow mixing can occur when (y) has a complex structure, that is,
multimodality or covering several models in different spaces. The main problem is to choose
proposal distributions q(y∗ | y) that both allow ﬂexible movements within the sample space
and give reasonable acceptance probabilities. Some guidelines and theoretical results for tuning
variances of proposal distributions are available (Roberts et al., 1997; Brooks et al., 2003).
Efﬁcient sampling algorithms can in many cases be constructed by the use of auxiliary
variables. Several approaches have been suggested within the class of MCMC algorithms
Scand J Statist 38 Flexibility of MH acceptance probabilities 343

(Tanner & Wong, 1987; Edwards & Sokal, 1988; Besag & Green, 1993; Higdon, 1998;
Damien et al., 1999) but has also been used in combination with importance sampling
(Neal, 2001). The most common way of doing this is to augment the sample space with
some extra variables to construct simpler or more efficient algorithms in this augmented
space. More recently, auxiliary variables have been used as a tool within an MH setting for
either generating better proposals (e.g. Tjelmeland & Hegstad, 2001; Jennison & Sharp, 2007)
or for calculation of acceptance probabilities (e.g., Beaumont, 2003; Andrieu & Roberts,
2009).
The flexibility of choosing proposal distributions has long been recognized whereas accep-
tance probabilities are given by the standard MH rule. When extending the state space by
introducing auxiliary variables, there is also a flexibility in the choice of acceptance proba-
bilities that to a large extent seems to be unnoticed. This flexibility is a consequence of the
possibility of choosing different target distributions in the augmented space, and is the focus
of this article.
Assume our aim is to sample y∗ ∼ (y∗ ). A proposal is generated by a simultaneous simu-
lation of (x∗ , y∗ ) ∼ q(x∗ , y∗ | y). The flexibility follows in that the target distribution, ¯ say, for
the combined variables (x∗ , y∗ ) can be any distribution with as its marginal distribution.
For a given joint proposal distribution q, several possible acceptance probabilities can be
obtained by different choices of . ¯ By taking this alternative viewpoint, a common under-
standing of different algorithms suggested in the literature is obtained, and this also serves
as a framework for suggesting new algorithms. This framework will in particular be useful in
cases where standard MCMC algorithms fail and more complicated versions are needed. Typical
examples are algorithms for jumping between possible modes (Tjelmeland & Hegstad, 2001;
Jennison & Sharp, 2007) and improvements of acceptance rates within reversible jump algo-
rithms (Al-Awadhi et al., 2004).
An important special class of algorithms is where x∗ = (x1∗ , . . ., xt∗ ) is generated in sequence,
typically with x1∗ generated through a big jump (for jumping between modes or dimensions)
followed by a few steps of smaller jumps. Such xi∗ s can in principle live on any space, but in
most cases they live on the same space as y. For most such algorithms suggested in the litera-
ture, acceptance ratios within an MH setting typically depend either only on the
first x1∗ (e.g. Al-Awadhi et al., 2004) or the last xt∗ (e.g. Tjelmeland & Hegstad, 2001;
Jennison & Sharp, 2007). By combining different target distributions, acceptance ratios
depending on averages over all the generated x1∗ , . . ., xt∗ can be constructed, which can give
higher and more stable acceptance probabilities.
Other algorithms that can be considered as special cases of this general framework are
mode or model jumping (Tjelmeland & Hegstad, 2001; Al-Awadhi et al., 2004; Jennison &
Sharp, 2007), multiple-try methods (Liu et al., 2000), proposals based on particle filters
(Andrieu et al., 2010), pseudo-marginal algorithms (Beaumont, 2003; Andrieu & Roberts, 2009),
sampling algorithms for distributions with intractable normalizing constants (Møller et al.,
2006) and delayed rejection sampling (Tierney & Mira, 1999; Green & Mira, 2001; Mira,
2001). Some of these will be considered in more detail later whereas others are treated in
Storvik (2009). Although our focus will be on MH algorithms, the framework can also be
applied within importance sampling, for example, in combination with annealed importance
sampling (Neal, 2001); see Storvik (2009) for details.
We start in section 2 by discussing the flexibility of acceptance probabilities in MH
algorithms when auxiliary variables are used for generating proposals. Section 3 focuses on
sequential generations of auxiliary variables. In section 4, we apply our general results to
some algorithms suggested in the literature and also consider alternative versions of these.
For some of these alternatives, numerical experiments are included, demonstrating that

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

344 G. Storvik Scand J Statist 38

alternative weights and acceptance probabilities can improve the performance of an algo-
rithm. We conclude the article with discussion and ﬁnal remarks in section 5.

2. Weight functions within MH algorithms

Our aim is to generate a sequence of variables having invariant distribution (y). The typical
situation will be that a new y∗ is generated through an auxiliary x∗ with a joint proposal
density q(x∗ , y∗ | y). For weight functions, acceptance ratios and acceptance probabilities, we
will list the variables involved in the order they are generated with semicolon separating
variables that have been generated at the previous iteration and new variables generated.
The ﬁrst result considers the situation where the auxiliary variables are stored at each iter-
ation. To get full generality, we allow the proposals to depend on the previous x as well.

Proposition 1. Assume the current state y ∼ (y) and that x | y ∼ h(x | y) for some arbitrary
conditional distribution h.
Generate (x∗ , y∗ ) ∼ q(x∗ , y∗ | x, y) and put (x , y ) = (x∗ , y∗ ) with probability (x, y; x∗ , y∗ ) and
(x , y ) = (x, y) otherwise, where the acceptance probability is given by

(x, y; x∗ , y∗ ) = min {1, r(x, y; x∗ , y∗ )}

with
(y∗ )h(x∗ | y∗ )q(x, y | x∗ , y∗ )
r(x, y; x∗ , y∗ ) = , (1)
(y)h(x | y)q(x∗ , y∗ | x, y)

then, y ∼ (y ) and also x | y ∼ h(x | y ).

Proof. Consider (y)h(x | y) as an augmented target distribution. Then (1) corresponds to

the ordinary acceptance ratio in an MH step using q(x∗ , y∗ | x, y) as proposal. Since (y) is
the marginal distribution of the augmented target distribution, the result follows.
Proposition 1 describes an MH step keeping the target distribution (y) invariant. There
is ﬂexibility in the choice of both q and h. To obtain convergence, irreducibility needs to be
considered as well. Note however that if irreducibility is a problem, modiﬁcations of h can
in many cases resolve this without the need for changing the proposal distributions.
The setting described above assumes that the auxiliary variable x generated in the previous
iterations is stored. An alternative version is to assume a new x is generated in a ‘reverse’
proposal. Andrieu & Roberts (2009) show that generating new x’s at each iteration rather than
reusing those generated at the previous iteration can improve the acceptance rates. Storage of
only y makes it also more easy to incorporate the procedure into a hybrid MCMC algorithm
cycling through different steps (Robert & Casella, 2004, section 10.3.2), where irreducibility
can be obtained through some of the other steps. Proposition 2 considers this setting.

Proposition 2. Assume y ∼ (y). Generate (x∗ , y∗ ) ∼ q(x∗ , y∗ | y) and x ∼ h(x | y, x∗ , y∗ ) where

h(x | y, x∗ , y∗ ) is an arbitrary distribution. Put y = y∗ with probability (y; x∗ , y∗ , x) and y = y
otherwise where the acceptance probability is given by

(y; x∗ , y∗ , x) = min{1, r(y; x∗ , y∗ , x)}

with

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Scand J Statist 38 Flexibility of MH acceptance probabilities 345

(y∗ )q(x, y | y∗ )h(x∗ | y∗ , x, y)

r(y; x∗ , y∗ , x) = , (2)
(y)q(x∗ , y∗ | y)h(x | y, x∗ , y∗ )

then, y ∼ (y ).

Proof. Consider the joint distribution (y, ¯ x∗ , y∗ , x) = (y)q(x∗ , y∗ | y)h(x | y, x∗ , y∗ ) which

has (y) as marginal distribution for the first component. Given y ∼ (y) we directly obtain
(y, x∗ , y∗ , x) ∼ (y,
¯ x∗ , y∗ , x). Then (2) is the ordinary MH acceptance ratio for swapping (y, x∗ )
∗
and (y , x), proving the result.
The proof shows that the procedure corresponds to a Gibbs step followed by an MH step.
Also for proposition 2, irreducibility needs to be considered to obtain convergence and
ergodicity results. Note that the proposition defines a Markov chain on y with invariant
distribution (y). The additional requirement for convergence is that the Markov chain is
irreducible with respect to (e.g. theorem 1 in Tierney, 1994). A minimal criterion is that the

Markov chain induced by the transition kernel qy (y∗ | y) = x∗ q(x∗ , y∗ | y) dx∗ is irreducible. In
addition, one needs to ensure that the acceptance structure allows a similar irreducibility. A
sufficient criterion is that q(x, y | y∗ )h(x∗ | y∗ , x, y) > 0 for any (x∗ , y∗ , x) generated from y, but
less restrictive conditions can also be constructed. In cases where all distributions involved
have full support, irreducibility is directly obtained.
The approach in proposition 2 assumes auxiliary variables generated both forwards (x∗ )
and backwards (x). Some algorithms suggested in the literature (e.g. Al-Awadhi et al., 2004)
only consider generation forwards while the same variables are used backwards. This can be
obtained by choosing h to be a degenerate distribution giving probability 1 to a specific value
of x, typically a function of (y∗ , x∗ , y). The most direct approach is to choose h(x | y, x∗ , y∗ ) =
(x − x∗ ), where is Dirac’s delta distribution in which case the acceptance ratio reduces to

(y∗ )q(x, y | y∗ )
r(y; x∗ , y∗ , x) = .
(y)q(x∗ , y∗ | y)

If x∗ = (x1∗ , . . ., xt∗ ) = : x1:t

∗
is generated in sequence, an alternative is to assume x1:t = xt:1
∗
with
probability 1. For such a choice,

(y∗ )q(xt:1
∗
, y | y∗ )
r(y; x∗ , y∗ , x) = ∗
.
(y)q(x1:t , y∗ | y)

We will explore such possibilities further in sections 3 and 4 for speciﬁc choices of proposal
distributions.
For both propositions 1 and 2, the acceptance ratios can be written as ratios between
weight functions directly related to importance sampling. For instance, in (2),

w(y; x∗ , y∗ , x) (y∗ )h(x∗ | y∗ , x, y)

r(y; x∗ , y∗ , x) = with w(y; x∗ , y∗ , x) = . (3)
w(y∗ ; x, y, x∗ ) q(x∗ , y∗ | y)

In addition to demonstrating that the results presented also can be applied to importance
sampling, weight functions will in some cases be used instead of acceptance ratios to sim-
plify notation.
If {hs } is a collection of distributions, a mixture of the hs s,

h(x | y, x∗ , y∗ ) = as hs (x | y, x∗ , y∗ ), as ≥ 0, as = 1
s s

is also a distribution. Inserting this into (3), we obtain importance weights

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

346 G. Storvik Scand J Statist 38

(y∗ )hs (x∗ | y∗ , x, y)

ws (y; x∗ , y∗ , x) = as ws with ws (y; x∗ , y∗ , x) = . (4)
s
q(x∗ , y∗ | y)

Similar combinations of weights can be obtained for proposition 1. Such mixtures will be
considered in the subsequent sections.
Properties of the acceptance ratios are given in proposition 3.

Proposition 3. Under the assumptions of proposition 2,

(y∗ )qy (y∗ | y)
E[r(y; x∗ , y∗ , x) | y, y∗ ] = =: r(y; y∗ ), (5)
(y)qy (y∗ | y)
where qy (y∗ | y) is the marginal proposal density for y∗ generated through x∗ , and
var[r(y; x∗ , y∗ , x)] ≥ var[r(y, y∗ )]. (6)

Proof. Deﬁne qx | y (x∗ | y, y∗ ) to be the conditional distribution of x∗ given both y and y∗ ,

which by direct insertion of the deﬁnition of r(y; x∗ , y∗ , x) becomes

proving (5). Further,

var[r(y; x∗ , y∗ , x)] = var[E[r(y; x∗ , y∗ , x) | y∗ ]] + E[var[w(y; x∗ , y∗ , x) | y∗ ]]

(y∗ )qy (y | y∗ )
≥ var = var[r(y; y∗ )].
(y)qy (y∗ | y)
Note that since min{1, x} is a concave function, by Jensen’s inequality,
E[(y; x∗ , y∗ , x) | y, y∗ ] ≤ min{1, r(y; y∗ )}
showing that the optimal acceptance ratio [in Peskun’s (1973) sense] would be to use r(y; y∗ ).
In practice, qy (y∗ | y) will not be possible to evaluate but by clever choices of h, we hopefully
get close to this.

3. Sequential generation of auxiliary variables

In this section, we will discuss the use of different acceptance ratios in the case where x∗ is
generated through a sequence of steps. Such schemes have been considered by, for example,
Neal (2001) in his annealed importance sampling scheme, by Jennison & Sharp (2007) who
proposed a method for mode jumping by ﬁrst applying one big jump and thereafter several
smaller moves and by Al-Awadhi et al. (2004) within a reversible jump MCMC framework.
In the notation of proposition 2, we assume

t+1
q(x∗ , y∗ | y) = qi (xi∗ | xi−1
∗
), (7)
i =1

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Scand J Statist 38 Flexibility of MH acceptance probabilities 347

where x0∗ ≡ y and xt∗+ 1 ≡ y∗ . The typical setup we will consider is where for i ≥ 2 the proposal
densities satisfy the detailed balance criterion
(x)qi (y | x) = (y)qi (x | y). (8)
Such proposals can for instance be used for target distributions with several modes where
q1 corresponds to a large jump whereas the subsequent moves are ordinary MCMC ones.
This particular setting will be further discussed in section 4.1. Another possibility is model
or dimension moves in a reversible jump setting, see Storvik (2009). Allowing the additional
moves also to depend on i gives the possibilities for different blocks of a high-dimensional
state vector to be updated at each step.
A consequence of assumption (8) is that for s ≥ 1

t ∗ ∗
= qi + 1 (xi | xi + 1 ) (xs∗ )

it s = , (9)
∗ ∗
i = s qi + 1 (xi + 1 | xi )
(xt∗+ 1 )
which will be used repeatedly.
Our aim is to construct acceptance probabilities for the proposal y∗ = xt∗+ 1 when y is the
current state. A complicating factor is that the sequential approach for generating y∗ will
not make the proposal distribution for y∗ given y directly available. Note that if q1 does
not depend on y, we obtain an independence sampler where the proposal is generated by a
sequence of internal MCMC steps, see section 4.1 for an example.
∗
Writing x1:t +1
= (x1∗ , . . ., xt∗+ 1 ), the general weight function will have the form

∗ (xt∗+ 1 )h(x1:t
∗
| xt∗+ 1 , x, y)
w(y; x1:t + 1 , x) =
t + 1 ∗ ∗
.
i = 1 qi (xi | xi−1 )

A wide variety of weight functions can be considered by different choices of h. Consider the
class of distributions

s−1
t
∗
hs (x1:t | xt∗+ 1 , x, y) = qi (xi∗ | xi−1
∗
) qi + 1 (xi∗ | xi∗+ 1 ). (10)
i =1 i =s

Under the aforegiven assumptions, from (9), we obtain

∗ (xt∗+ 1 ) ti = s qi + 1 (xi∗ | xi∗+ 1 ) (xs∗ )

ws (y; x1:t +1 ) =
+
= ∗
. (11)
∗
qs (xs∗ | xs−1 ) i = s + 1 qi (xi∗ | xi−1
t 1 ∗
) qs (xs∗ | xs−1 )
The weights only depend on the ratio between the marginal distributions and the transition
kernels for different components of the auxiliary variables. Note that the proposal distribu-
tions qi , i ≤ s can in fact be arbitrary [i.e. do not need to satisfy detailed balance (8)]. The
special case s = 1 gives
(x1∗ )
w1 (y; x1:t + 1 ) = , (12)
q1 (x1∗ | y)
which corrects for using q1 to generate the ﬁrst sample but then utilizes that the following
samples keeps the distribution invariant with respect to . This weight function does however
not account for that subsequent samples will be closer to .
Another special case is s = t + 1 which gives weight
(y∗ )
wt (y; x1:t + 1 ) = , (13)
qt + 1 (y∗ | xt∗ )
only depending on the density of and the transition to y∗ . Note that in this case no assump-
tions on qi need to be made for any i although qt + 1 (y∗ , xt∗ ) should be possible to compute.
Also this choice suffers from the weakness of not taking into account that as t increases, xt∗

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

348 G. Storvik Scand J Statist 38

converges in distribution to which should give a weight approaching 1. This can however
be accomplished by combining the weights. Using (4), a proper weight function is also given
by

t+1
∗ ∗
w(y; x1:t + 1) = as ws (y; x1:t + 1 ), (14)
s=1

resulting in an acceptance ratio

t + 1 ∗
∗ s = 1 as ws (y; x1:t + 1 )
r(y; x1:t +1 ) = t + 1 . (15)
s = 1 as ws (y ; x1:t + 1 )
∗

A particularly interesting case is as = (t + 1 − b)−1 I (s ≥ b), with b corresponding to some ‘burn-

∗
in’ period for the proposal generation. In this case, w(y; x1:t + 1 ) is a standard MCMC estimate
∗
of E[w(y; x1:t +1 )]. If further requirements for the Markov chain using qb , . . ., qt + 1 as transi-
tion probabilities to be ergodic is satisﬁed, the weight function will then converge towards its
expectation. Since this expectation is equal to 1, this implies that the acceptance rates also
converge towards 1. This indicates that the weight function indeed reﬂects that xt∗+ 1 converges
in distribution to (y) as t grows.
In some cases, updates are made component-wise (or block-wise). Similar to the non-
reversibility of the Gibbs sampler, some extra care needs to be taken under systematic scan
updates. The weights (11) can however still be shown to be valid.
To use the proposed weight functions, it must be possible to calculate the transition proba-
bilities qs (xs∗ | xs−1
∗
). When {x1:t∗
+ 1 } is generated through internal MH steps, such transition
probabilities are not always directly available (this is particularly the case when an internal
proposal is not accepted). There are different ways around this, mainly based on the idea
of including the internal proposals as part of the auxiliary variables. The details on this are
given in the Appendix.

4. Examples
In Storvik (2009), several algorithms suggested in the literature are shown to ﬁt into the
general framework discussed in the previous sections. These include annealed importance
sampling (Neal, 2001), reversible jump MCMC for model selection (Al-Awadhi et al., 2004;
Andrieu & Roberts, 2009), multiple-try methods and particle proposals (Liu et al., 2000;
Andrieu et al., 2010) and the pseudo-marginal algorithm (Beaumont, 2003; Andrieu &
Roberts, 2009). Here, we will consider some speciﬁc examples where alternative weights are
proposed. For some of these examples, numerical experiments are performed to compare the
different weights.

4.1. Mode jumping

Tjelmeland & Hegstad (2001) and Jennison & Sharp (2007) considered two quite similar
approaches for performing mode jumping within MCMC. Here, we will consider the approach
by Jennison & Sharp (2007), referring to Storvik (2009) for a discussion on the approach by
Tjelmeland & Hegstad (2001).
The approach is based on starting with a large jump followed by a sequence of (typically
smaller) MCMC steps. To be speciﬁc, x1∗ = y + where ∼ f () is a distribution symmetric
around zero whereas xs∗ ∼ q(xs∗ | xs−1
∗
) for s = 2, . . ., t and ﬁnally y∗ ∼ qt + 1 (y∗ | xt∗ ). Here q is
a transition kernel leaving invariant. A different transition at the last step is chosen for
computational reasons, see below.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Scand J Statist 38 Flexibility of MH acceptance probabilities 349

Consider the setting of proposition 2. In the speciﬁcation of h(x | y, x∗ , y∗ ), it is important

to allow for a similar large jump in the ‘backwards’ move. We do so by putting x1 = y∗ − .
By choosing

t
h(x | y, x∗ , y∗ ) = (x1 − y∗ + ) q(xi | xi−1 ),
i =2

we obtain through proposition 2 (for = x1∗ − y = x1 − y∗ ) the acceptance ratio

(y∗ )f (−) ti = 2 q(xi | xi−1 )qt + 1 (y | xt ) ti = 2 q(xi∗ | xi−1

∗
)
r(x∗ , y∗ , x | y) =
t ∗ ∗ ∗

t
(y)f () i = 2 q(xi | xi−1 )qt + 1 (y | xt ) i = 2 q(xi | xi−1 )
∗

(y∗ )qt + 1 (y | xt )
= ,
(y)qt + 1 (y∗ | xt∗ )

which is equal to the acceptance ratio obtained by Jennison & Sharp (2007). To get a com-
putationally tractable density, Jennison & Sharp (2007) suggested choosing qt + 1 (y∗ | xt∗ ) as a
local approximation to (y).
As discussed in the Appendix, qt + 1 (y∗ | xt∗ ) can also be deﬁned through an MH step. Within
such a setting, alternative constructions can be considered. Assume qt + 1 = q. Similar to (10),
deﬁne

s−1
t
hs (x | y∗ , x∗ , y) = (x1 − y∗ + ) q(xi | xi−1 ) q(xi | xi + 1 ),
i =2 i =s

resulting in an acceptance ratio

(xs∗ ) (xs )
rs (y; x∗ , y∗ , x) = ∗
. (16)
q(xs∗ | xs−1 ) q(xs | xs−1 )

Using (15) with as = 1/t, we obtain an acceptance ratio equal to

t+1

t+1
(xs∗ ) (xs )
∗ ∗
r(y; x , y , x) = ∗
. (17)
s=2
q(xs | xs−1 )
∗
s=2
q(xs | xs−1 )

In general, if q is capable of moving efﬁciently around the whole state space, this ratio will
converge to 1. In more practical settings where q is only able to move within the current
mode, the ratio will converge towards the ratio between the masses of the corresponding
modes.
A numerical example on a Gaussian mixture model. Consider a model in Rp

(y) = N(y; 1 , I ) + (1 − )N(y; 2 , I ),

where 1, 1 = −10, 2, 1 = 10 whereas i, j = 0, i = 1, 2, j = 2, . . ., p. This corresponds to a model

with multiple modes so separated that ordinary MCMC methods will get stuck in one of the
modes.
We will consider an MH algorithm where proposals are generated sequentially as described
in section 3 with a starting value generated from N(1 , I ).
A proposal y∗ is generated by simulating a sequence x1∗ , . . ., xt∗+ 1 with y∗ = xt∗+ 1 . Here,
∗
x1 ∼ N(0, 2large I ) followed by a set of t ‘inner’ MH steps using a discrete version Langevin
diffusion (Besag, 1994; Roberts & Tweedie, 1996) where proposals zj∗ are generated by

1
zj∗ ∼ N xj−1 ∗ ∗
+ 2small log (xj−1 ), 2small I
2

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

350 G. Storvik Scand J Statist 38

and proposals are accepted using ordinary MH acceptance rates. The number of ‘outer’ MH
steps, that is, the number of generated y∗ s will be denoted by M.
The ﬁnal proposals y∗ are accepted with probabilities given through the following alterna-
tive acceptance ratios:

(i) equation (16) with s = 1,

(ii) equation (16) with s = t + 1 or
(iii) equation (17).

All the alternatives are used in combination with (23). The full algorithm was restarted N
times and the last sample was stored in each case, giving samples y1 , . . ., yN . As a measure
for the performance of the different weight functions,

d(F̂ 1 , F1 ) = sup |F̂ 1 (y1 ) − F1 (y1 )| (18)

was used where F̂ 1 is the empirical distribution function based on the first component of the
samples y1 , . . ., yN whereas F1 is the corresponding true marginal cumulative distribution.
Figure 1A–E shows boxplots of d(F̂ 1 , F1 ) with N =10,000, M = 20, t + 1 = 10 and with
large = 15, small = 1.5. The different panels correspond to different dimensions p. The box-
plots are obtained by repeating the experiments 100 times. We see in all cases that the first
weight function does not perform that well, an expected property given the earlier discussion
about this particular choice. Both the other choices give significant improvements with the
third weight function performing best in all the cases. The difference between these seems

A p=1 B p=2 C p=3

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

1 2 3 1 2 3 1 2 3

D p=5 E p = 10 F p = 10
0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

1 2 3 1 2 3 1 2 3

Fig. 1. Boxplot of d(F̂ 1 , F1 ) for the Gaussian mixture distribution. The different panels correspond to
results for different dimensions p = 1, 2, 3, 5, 10. The experimental setup is described in the text. Panels
(A)–(E) correspond to M = 20, whereas panel (F) corresponds to M = 40. The numbers on the x-axis
correspond to the different types of acceptance ratios.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Scand J Statist 38 Flexibility of MH acceptance probabilities 351

1.0
0.8

0.6
F
0.4

0.2

0.0
−10 −5 0 5 10
y

Fig. 2. Mixture Gaussian distribution with dimension p = 10 and number of outer Markov chain Monte
Carlo iterations M = 20. True (shaded) and examples of estimated cumulative distributions of the ﬁrst
component. Solid line corresponds to the ﬁrst weight function, dashed line to the second weight func-
tion and dotted line to the third weight function.

to depend on dimension though. In particular it seems like the difference between them are
decreasing with dimension, but then increasing for p = 10. This is probably because of that
for p = 10, 20 iterations in the outer MCMC iterations is not that much. Increasing M to
40 (Fig. 1F), the similarities between the second and the third weight functions are again
obtained.
Figure 2 shows the true cumulative distribution for the ﬁrst component and typical ex-
amples of empirical distributions based on the three different weight functions for p = 10 and
M = 20. The differences between the choices of weight functions shown in Fig. 1 is clearly
seen. Note in particular that the third weight function produces estimates that are indistin-
guishable from the truth, indicating that convergence has been reached even with this small
number of MCMC iterations.

4.2. Delayed rejection sampling

Tierney & Mira (1999) suggested a method for composing a new (possibly different) proposal
in an MH setting when the first proposal was rejected. This approach was further discussed
and extended in Mira (2001) and Green & Mira (2001).
Consider a situation where given the current state y, a first proposal x1∗ is generated by
q1 (x1∗ | y) and accepted with a probability 1 (y; x1∗ ). If rejected, a new proposal x2∗ is gener-
ated by q2 (x2∗ | y, x1∗ ) and this new proposal is accepted with probability 2 (y, x1∗ ; x2∗ ). We will
see how this approach can be seen as a special case of proposition 2 and that alternative
acceptance probabilities can be derived.
Define x∗ = (x1∗ , x2∗ ) to be our auxiliary variable and qy | x (y∗ | y, x1∗ , x2∗ ) to be a discrete
distribution with y∗ = x1∗ with probability 1 (y; x1∗ ) and = x2∗ otherwise:

qy | x (y∗ | y, x1∗ , x2∗ ) = 1 (y; x1∗ )I (y∗ = x1∗ ) + [1 − 1 (y; x1∗ )]I (y∗ = x2∗ ).

Assume ﬁrst that in addition to (x1∗ , x2∗ ) also a pair (x1 , x2 ) is generated using the proposal
distribution h(x1 , x2 | y, x1∗ , x2∗ , y∗ ). To get the variables to work within the same spaces, we
make h(x1 , x2 | y, x1∗ , x2∗ , y∗ ) degenerate in the sense that either x1 or x2 is equal to y, that is,

(x1 − y)h2 (x2 | y, x2∗ , y∗ ) if y∗ = x1∗ ,
h(x1 , x2 | y, x1∗ , x2∗ , y∗ ) = ∗ ∗
(x2 − y)h1 (x1 | y, x1 , y ) if y∗ = x2∗ .

We consider the two possibilities of y∗ separately: y∗ = x1∗ : In this case x1 = y and by applying
proposition 2, we obtain

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

352 G. Storvik Scand J Statist 38

(y∗ )q1 (y | y∗ )q2 (x2 | y∗ , y)1 (y∗ ; y)h2 (x2∗ | y∗ , x2 , y)

r(y; x∗ , y∗ , x) =
(y)q1 (y∗ | y)q2 (x2∗ | y, y∗ )1 (y; y∗ )h2 (x2 | y, x2∗ , y∗ )
q (x | y∗ , y)h2 (x2∗ | y∗ , x2 , y)
= 2 2∗ ,
q2 (x2 | y, y∗ )h2 (x2 | y, x2∗ , y∗ )
where we have used that (y∗ )q1 (y | y∗ )1 (y∗ ; y) = (y)q1 (y∗ | y)1 (y; y∗ ) for a standard MH
choice of 1 (y; y∗ ). For h2 (x2∗ | y∗ , x2 , y) = q2 (x2∗ | y, y∗ ), this reduces to 1, which corresponds
nicely to accepting the ﬁrst proposal x1∗ with probability 1 (y; x1∗ ).
y∗ = x2∗ : In this case x2 = y and we obtain
(y∗ )q1 (x1 | y∗ )q2 (y | y∗ , x1 )[1 − 1 (y∗ ; x1 )]h1 (x1∗ | y∗ , x1 , y)
r(y; x∗ , y∗ , x) = .
(y)q1 (x1∗ | y)q2 (y∗ | y, x1∗ )[1 − 1 (y; x1∗ )]h1 (x1 | y, x1∗ , y∗ )
Choosing now h1 (x1 | y, x1∗ , y∗ ) to be degenerate in that x1 = x1∗ with probability 1, this reduces
to
(y∗ )q1 (x1∗ | y∗ )q2 (y | y∗ , x1∗ )[1 − 1 (y∗ ; x1∗ )]
r(y; x∗ , y∗ , x) = , (19)
(y)q1 (x1∗ | y)q2 (y∗ | y, x1∗ )[1 − 1 (y; x1∗ )]
which is equal to the acceptance ratio given by Tierney & Mira (1999). An alternative in this
case is to choose h1 (x1 | y, x1∗ , y∗ ) = q1 (x1 | y∗ ) in which case the acceptance ratio reduces to
(y∗ )q2 (y | y∗ , x1 )[1 − 1 (y∗ ; x1 )]
r(y; x∗ , y∗ , x) = . (20)
(y)q2 (y∗ | y, x1∗ )[1 − 1 (y; x1∗ )]
This acceptance probability might be a reasonable alternative when you have the choice be-
tween using a (computationally) cheap proposal q1 and a more expensive q2 which is close
to . Note that in the special case where q2 = , this acceptance ratio reduces to
1 − 1 (y∗ ; x1 )
r(y; x∗ , y∗ , x) = .
1 − 1 (y; x1∗ )
With q1 being a bad approximation to , both the acceptance probabilities 1 (y; x1∗ ) and
1 (y∗ ; x1 ) will typically be small, making r(y; x∗ , y∗ , x) almost equal to 1, as it should!
A numerical experiment on a multiple precision model. Consider a model where observations
z ∈ Rp follow a model

zi = + i +
i ,

where
1 , . . .,
p are i.i.d. zero-mean Gaussian variables with precision 1 whereas 1 , . . ., p are
independent of
1 , . . .,
p and spatially correlated variables with a conditional autoregressive
(CAR) structure (Besag, 1974; Besag & Kooperberg, 1995). The conditional distributions are
given by

−1 −1
i | −i ∼ N ni j , (ni 2 ) ,
j∼i

where −i is the collection of all j s except i , j ∼ i means j is a neighbour of i and ni is the

number of neighbours of i. Our interest will be in the posterior distribution of = (1 , 2 ).
Assuming independent Gamma priors for 1 and 2 , both with shape parameter and rate
parameter , the posterior distribution has the form
1
() ∝ −1 exp(−1 )−1 exp(−2 ) exp[−1/2(z − )T −1 (z − )].
1 2
||1/2
He et al. (2007) state that such distributions in many cases are ‘l-shaped with two long arms
pressed tightly along one or both coordinate axes’, which is demonstrated in the top panel

Scand J Statist 38 Flexibility of MH acceptance probabilities 353

τ2
2

1
−41
0
0 1 2 3 4 5
τ1

0.5

0.3
τ2

0.1

0 2000 4000 6000 8000 10,000

Iteration

Fig. 3. Top panel: posterior distribution of (1 , 2 ) in the conditional autoregressive model based on 50
simulated observations on a 5 × 5 grid using = 0, = 0.25 and 1 = 1, 2 = 1/3. Bottom panel: distance,
as deﬁned by (22), between true and empirical distribution of 2 as a function of the number of Markov
chain Monte Carlo iterations for the ‘standard’ delayed rejection sampling acceptance probability (19),
solid line, and the alternative acceptance probability (20), dashed line.

of Fig. 3 showing the posterior distribution obtained from simulated values of z1 , . . ., zp on

a 5 × 5 grid using = 0, = 0.25, 1 = 1 and 2 = 1/3.
We will consider a delayed rejection sampling algorithm where at the ﬁrst stage proposals
∗j, 1 , j = 1, 2 are generated through scale proposals suggested by Knorr-Held & Rue (2002),
that is, ∗j, 1 = j fj , j = 1, 2 where the scale factor fj ∼ qf (f ) ∝ (1 + f −1 )I [F −1 ≤ f ≤ F ], and
where F > 1 is a tuning parameter. At the second stage, a proposal is generated through the
reparametrizations
1 2 2
= , r= , (21)
1 + 2 1 + 2
where a new r is generated uniformly on [0, 1] whereas is generated through the posterior of
given r which can be shown to be a Gamma distribution with shape parameter 2 + n and
rate parameter [ + 0.5ssq]/[r(1 − r)], where ssq is the sum of squares using the covariance
matrix divided by . New proposals ∗1, 2 , ∗2, 2 are then given by the inverse transformations of
(21). This second proposal utilizes more of the structure involved and can be considered as
a proposal closer to the target distribution.
Our aim will be to compare the two alternative acceptance probabilities (19) and (20). As
a measure for the performance of the different weight functions, the distance

d(F̂ 2 , F2 ) = sup |F̂ 2 (2 ) − F2 (2 )| (22)

354 G. Storvik Scand J Statist 38

between the true cumulative distribution function for 2 , F2 (2 ) (evaluated through numerical
integration) and the corresponding estimate based on the samples, F̂ 2 (2 ), was used.
The bottom panel of Fig. 3 shows the distance measure (22) as function of the number of
MCMC iterations based on the ‘standard’ delayed rejection sampling acceptance proba-
bility (19) (solid line) and the alternative acceptance probability (20) (dashed line), both
which should converge towards 0 as the number of iterations increases to inﬁnity. We clearly
see the improvements for the alternative acceptance probability.

4.3. Distributions with intractable normalizing constants

Møller et al. (2006) considered a problem of drawing from a posterior distribution
(y) = p(y | z) = C −1 p(y)p(z | y),
where the likelihood for data z is:
p(z | y) = Z −1 (y)p̃(z | y).
Here, both the normalization constant involved in the posterior C and the normalization
constant deﬁning the likelihood, Z(y) are unknown (a problem often encountered in spatial
modelling). It is however assumed that it is possible to simulate from p(z | y). The data z
is dropped in (y) since we are considering this as a given constant. Møller et al. (2006)
proposed an MCMC algorithm that we now will describe in the setting of proposition 1.
Deﬁne an augmented distribution (y,¯ x) = (y)h(x | y). Assume (x, y) is the current state
and generate a proposal (x∗ , y∗ ) through q(x∗ , y∗ | x, y) = qy (y∗ | x, y)p(x∗ | y∗ ). From proposi-
tion 1, we obtain a weight function
C −1 p(y∗ )Z −1 (y∗ )p̃(z | y∗ )h(x∗ | y∗ ) ∗ ∗ ∗
−1 p(y )p̃(z | y )h(x | y )
∗
w(x, y; x∗ , y∗ ) = = C
qy (y∗ | x, y)Z −1 (y∗ )p̃(x∗ | y∗ ) qy (y∗ | x, y)p̃(x∗ | y∗ )
and an acceptance ratio
p(y∗ )p̃(z | y∗ )h(x∗ | y∗ )qy (y | x∗ , y∗ )p̃(x | y)
r(x, y; x∗ , y∗ ) = .
p(y)p̃(z | y)h(x | y)qy (y∗ | x, y)p̃(x∗ | y∗ )
Note that neither C nor Z −1 (·) are involved in this expression. Here, h(x | y) has a state
space similar to z but may otherwise be arbitrary (and even depend on z). The given algo-
rithm will according to proposition 1 have (y)h(x | y) as invariant distribution. This is the
same ratio obtained by Møller et al. (2006). They also discussed several options for choosing
the h distribution.
Consider now the case where qy (y∗ | y, x) = qy (y∗ | y), a special case only considered further
by Møller et al. (2006) after their general description of the algorithm. Through proposi-
tion 2 we are then able to construct an alternative algorithm where x does not need to be
stored from one iteration to another. Using the same q function as before but now simulating
x ∼ h(x | y, x∗ , y∗ ) in addition to x∗ , y∗ , we obtain
h(x∗ | y∗ , x, y)p(y∗ )p̃(z | y∗ )qy (y∗ | y)p̃(x | y)
r(x∗ , y∗ | x, y) = .
h(x | y, x∗ , y∗ )p(y)p̃(z | y)qy (y | y∗ )p̃(x∗ | y∗ )
In this case, more general choices of the h distribution can be considered. One option is
to choose h(x | y, x∗ , y∗ ) = (x − x∗ ), making generation of an extra x unnecessary. This is an
important special case since simulation from p(x | y) can be computationally costly. In this
case, the acceptance ratio reduces to
p(y∗ )p̃(z | y∗ )qy (y∗ | y)p̃(x∗ | y)
r(x∗ , y∗ | x, y) = .
p(y)p̃(z | y)qy (y | y∗ )p̃(x∗ | y∗ )

Scand J Statist 38 Flexibility of MH acceptance probabilities 355

As noted by an anonymous referee, this results in the exchange algorithm by Murray et al.
(2006) and thereby serves as a uniﬁcation of the two methods for dealing with unknown
normalizing constants proposed by Møller et al. (2006) and Murray et al. (2006).

5. Summary and discussion

In this article we have presented a framework for construction of auxiliary variable proposal
generation within MH algorithms. Many algorithms proposed in the literature fit into the
framework and by using the framework, alternative acceptance probabilities are suggested.
Numerical experiments demonstrate that in some cases significant improvements can be
obtained by using these alternatives.
The variety of options for acceptance probabilities have also been suggested elsewhere. In
this article, we have shown that these options can be related to choices of augmented state
spaces for the variables to be generated. These choices come in addition to the flexibility in
how proposals are generated.
Although the added flexibility makes it possible to define alternative and hopefully
better algorithms, it also extends the number of choices to be made to construct an efficient
algorithm. In the case of sequential generation of auxiliary variables, acceptance ratios can
depend on the ratio between the marginal distributions and the transition kernels for any
component of the auxiliary variables. Previous approaches in the literature have been concen-
trating on either the first or the last component, whereas we have shown here that averages
over all components have more preferable theoretical properties. Numerical experiments sup-
ported this, although improvements over using the last components were small. For the more
general case, general guidelines are at this point difficult to give. In cases where generation
of the auxiliary variables is computationally costly, as in the case of intractable normaliz-
ing constants discussed in section 4.3, degenerate distributions avoiding one simulation step
can be beneficial. In other cases where such generations are cheap, as in the case of delayed
rejection sampling discussed in section 4.2, generation of additional ‘backwards’ auxiliary
variables can improve the acceptance probabilities. More research is needed on this topic.

Acknowledgements
Part of this work was performed when the author was a visiting fellow at the Department of
Mathematics, University of Bristol. The author is grateful for valuable discussions with col-
leagues there, in particular Prof. Christophe Andrieu. The author also wants to thank two
anonymous referees, the associate editor and the editor for many valuable comments.

References
Al-Awadhi, F., Hurn, M. & Jennison, C. (2004). Improving the acceptance rate of reversible jump
MCMC proposals. Statist. Probab. Lett. 69, 189–198.
Andrieu, C. & Roberts, G. O. (2009). The pseudo-marginal approach for efﬁcient Monte Carlo compu-
tations. Ann. Statist. 2, 697–725.
Andrieu, C., Doucet, A. & Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. J. Roy.
Statist. Soc. Ser. B (Stat. Methodol.) 72, 269–342.
Beaumont, M. (2003). Estimation of population growth or decline in genetically monitored populations.
Genetics 164, 1139–1160.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser.
B Stat. Methodol. 36, 192–236.
Besag, J. (1994). Comments on representations of knowledge in complex systems by U. Grenander and
M. I. Miller. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 56, 591–592.

356 G. Storvik Scand J Statist 38

Besag, J. & Green, P. J. (1993). Spatial statistics and Bayesian computation (with discussion). J. Roy.
Statist. Soc. Ser. B Stat. Methodol. 55, 25–37.
Besag, J. & Kooperberg, C. (1995). On conditional and intrinsic autoregressions. Biometrika 82, 733–
746.
Brooks, S., Giudici, P. & Roberts, G. (2003). Efficient construction of reversible jump Markov chain
Monte Carlo proposal distributions. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 65, 3–39.
Damien, P., Wakefield, J. & Walker, S. (1999). Gibbs sampling for Bayesian non-conjugate and hier-
archical models by using auxiliary variables. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 61, 331–344.
Edwards, R. G. & Sokal, A. D. (1988). Generalization of the Fortuin-Kasteleyn-Swendsen-Wang repre-
sentation and Monte Carlo algorithm. Phys. Rev. D 38, 2009–2012.
Gelman, A., Gilks, W. R. & Roberts, G. O. (1997). Weak convergence and optimal scaling of random
walk Metropolis algorithms. Ann. Appl. Probab. 7, 110–120.
Gilks, W., Richardson, S. & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. Chapman
& Hall, London.
Green, P. & Mira, A. (2001). Delayed rejection in reversible jump Metropolis-Hastings. Biometrika 88,
1035–1053.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57, 97–109.
He, Y., Hodges, J. & Carlin, B. (2007). Re-considering the variance parameterization in multiple precision
models. Bayesian Anal. 2, 529–556.
Higdon, D. (1998). Auxiliary variable methods for Markov chain Monte Carlo with applications. J. Amer.
Statist. Assoc. 93, 585–595.
Jennison, C. & Sharp, R. (2007). Mode jumping in MCMC: adapting proposals to the local
environment. Talk at Conference to Honour Allan Seheult, Durham, March 2007. Available at:
[Link] (accessed July 10, 2010).
Knorr-Held, L. & Rue, H. (2002). On block updating in Markov random field models for disease map-
ping. Scand. J. Statist. 29, 597–614.
Liu, J., Liang, F. & Wong, W. (2000). The multiple-try method and local optimization in Metropolis
sampling. J. Amer. Statist. Assoc. 95, 121–134.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. & Teller, E. (1953). Equation of state calcu-
lations by fast computing machines. J. Chem. Phys. 21, 1087–1092.
Mira, A. (2001). On Metropolis-Hastings algorithms with delayed rejection. Metron, 59, 231–241.
Møller, J., Pettitt, A., Reeves, R. & Berthelsen, K. (2006). An efficient Markov chain Monte Carlo
method for distributions with intractable normalising constants. Biometrika 93, 451–458.
Murray, I., Ghahramani, Z. & MacKay, D. (2006). MCMC for doubly-intractable distributions. In
Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI), 359–366.
Neal, R. M. (2001). Annealing importance sampling. Statist. Comput. 2, 125–139. AUAI Press, Arlington,
Virginia.
Peskun, P. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika 60, 607–612.
Robert, C. P. & Casella, G. (2004). Monte Carlo statistical methods, 2nd edn. Springer, New York.
Roberts, G. & Tweedie, R. (1996). Exponential convergence of Langevin distributions and their discrete
approximations. Bernoulli 2, 341–364.
Storvik, G. (2009). On the flexibility of Metropolis-Hastings acceptance probabilities in auxiliary variable
proposal generation. Statistical Research Report 1, Department of Mathematics, University of Oslo,
Oslo.
Tanner, M. & Wong, W. (1987). The calculation of posterior distributions by data augmentation. J. Amer.
Statist. Assoc. 82, 528–550.
Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22, 1701–1728.
Tierney, L. & Mira, A. (1999). Some adaptive Monte Carlo methods for Bayesian inference. Stat. Med.
18, 2507–2515.
Tjelmeland, H. & Hegstad, B. (2001). Mode jumping proposals in MCMC. Scand. J. Statist. 28, 205–
223.

Received March 2009, in ﬁnal form April 2010

Geir Storvik, Department of Mathematics, University of Oslo, P.O. Box 1053, Blindern, N-0316 Oslo,
Norway.
E-mail: geirs@[Link]

Scand J Statist 38 Flexibility of MH acceptance probabilities 357

Appendix: Transition densities for internal MH moves

In this section, we discuss the use of weight functions when proposals are generated by a
ﬁxed number of internal MH steps. The general difﬁculty in this case is that the transition
probabilities qs (xs∗ | xs−1
∗
) are not always directly available.
One possibility, explored by both Tjelmeland & Hegstad (2001) and Jennison & Sharp
(2007) is to only consider transition probabilities for the last proposal and use a computable
transition for this part. In the case where only parts of the state vector is changed at the time
and at least one of the blocks can be moved through a Gibbs sampler update, using as > 0
in (15) only for those steps at which a Gibbs sampler is used, avoids the need for calculating
more complicated transition kernels.
To utilize all the generated auxiliary variables, an alternative is to include also the proposed
∗
values at each iteration into the set of auxiliary variables. Assume that x1:t + 1 (with proposals
∗
z1:t ) are generated by MH steps, that is,

zi∗ ∼ qiz (zi∗ | xi−1

∗
)
∗ ∗
z with probability i (xi−1 ; zi∗ ),
xi∗ = i∗
xi−1 otherwise.
The generating distribution can be written as:

t+1
z|x
qi (xi∗ | xi−1
∗
)qi (zi∗ | xi−1
∗
, xi∗ ),
i =1

where qi (xi∗ | xi−1

∗
), i = 1, . . ., t + 1 is the transition kernel for the Markov chain {x1:t
∗
+ 1 } whereas

qiz (zi∗ | xi−1

∗ ∗
)[i (xi−1 ; zi∗ )I (xi∗ = zi∗ ) + [1 − i (xi−1
∗
; zi∗ )]I (xi∗ = xi−1
∗
)]
qi (zi∗ | xi−1 , xi∗ ) =
z|x
∗ ∗
qi (xi | xi−1 )
is the conditional distribution for zi∗ given (xi−1
∗
, xi∗ ). Now, consider the choice

t
qi + 1 (xi∗ | xi∗+ 1 )qi + 1 (zi∗+ 1 | xi∗ , xi∗+ 1 ).
z|x
×
i =s

Using (9), we obtain

t
∗ ∗ (y∗ )hzs (zs∗ | xs−1
∗
, xs∗ ) ∗ ∗
i = s qi + 1 (xi | xi + 1 )
ws (y; z1:t , x1:t + 1) = ∗ z|x ∗

qs (xs∗ | xs−1 )qs (zs∗ | xs−1 , xs∗ ) i = s qi + 1 (xi∗+ 1 | xi∗ )

(xs∗ )hzs (zs∗ | xs−1

∗
, xs∗ )
= ∗ ∗ .
; zs∗ )]I (xs = xs−1 )
∗ ∗ ∗ ∗ ∗
qsz (zs∗ | xs−1 )z, s (xs−1 ; zs∗ )I (xs = zs ) [1 − z, s (xs−1

For the choice of hzs (zs∗ | xs−1 ∗

, xs∗ ), some care should be taken to increase the probability for
non-zero acceptance probabilities. In particular, the distribution should reﬂect that xs∗ = ∗
/ xs−1 ⇒
∗= ∗
zs xs . Therefore, assume
∗
(zs − xs∗ ) if xs∗ = ∗
/ xs−1 ,
hzs (zs∗ | xs−1
∗
, xs∗ ) = z|x ∗ ∗ ∗ ∗= ∗
q̃s (zs | xs−1 , xs ) if xs xs−1 ,
∗ ∗ ∗ z ∗ ∗
s (zs | xs−1 , xs ) is a general density with support equal to qs (zs | xs−1 ). The obvious
where q̃z|x
z|x ∗ ∗ ∗ = z ∗ ∗
option is to choose q̃s (zs | xs−1 , xs ) qs (zs | xs−1 ) in which case the weight function reduces
to

358 G. Storvik Scand J Statist 38

⎧
⎪ (xs∗ )
⎪
⎨ qz (z∗ | x∗ ) (x∗ ; x∗ ) if xs∗ = ∗
/ xs−1 ,
∗
+ 1) =
s−1 z, s s−1
ws (y; x1:t s s s
(23)
⎪
⎪ (xs∗ )
⎩ ∗
if xs∗ = xs−1
∗
.
1 − z, s (xs−1 ; zs∗ )
Another option is to choose
∗
(zs − xs∗ ) ∗
with probability 1 − z, s (xs−1 , xs∗ ),
hzs (zs∗ | xs−1
∗
, xs∗ ) =
qs (zs∗ | xs−1
∗
) ∗ ∗
with probability z, s (xs−1 , xs ).
In that case, the weight function reduces to

(xs∗ )
∗
ws (y; z1:t ∗
= ∗
if xs∗ = ∗
/ xs−1 ,
, x1:t +1 ) qs (xs∗ | xs−1 ) (24)
0 otherwise.
In this case, only those iterations corresponding to acceptance are considered.

Sample Adaptive MCMC Method Explained
No ratings yet
Sample Adaptive MCMC Method Explained
12 pages
Adaptive MCMC Techniques Explained
No ratings yet
Adaptive MCMC Techniques Explained
13 pages
Metropolis-Hastings Algorithm Overview
No ratings yet
Metropolis-Hastings Algorithm Overview
12 pages
Metropolis-Hastings Algorithm Overview
No ratings yet
Metropolis-Hastings Algorithm Overview
10 pages
Chib UnderstandingMetropolisHastingsAlgorithm 1995
No ratings yet
Chib UnderstandingMetropolisHastingsAlgorithm 1995
10 pages
Prisons and Prisoners in Victorian Britain Neil R Storey Ready To Read
No ratings yet
Prisons and Prisoners in Victorian Britain Neil R Storey Ready To Read
95 pages
Neural Network-Enhanced MCMC Methods
No ratings yet
Neural Network-Enhanced MCMC Methods
15 pages
Gibbs Sampling in MCMC Methods
No ratings yet
Gibbs Sampling in MCMC Methods
59 pages
Metropolis-Hastings Sampling Methods
No ratings yet
Metropolis-Hastings Sampling Methods
22 pages
Monte Carlo Sampling Techniques Explained
No ratings yet
Monte Carlo Sampling Techniques Explained
101 pages
An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
No ratings yet
An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
20 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
10 pages
Adaptive MCMC Algorithm Examples
No ratings yet
Adaptive MCMC Algorithm Examples
28 pages
MCMC Algorithms: Metropolis & Gibbs Sampling
No ratings yet
MCMC Algorithms: Metropolis & Gibbs Sampling
35 pages
Understanding Markov Chain Monte Carlo
No ratings yet
Understanding Markov Chain Monte Carlo
4 pages
Smart-Dumb MCMC Algorithm for Clustering
No ratings yet
Smart-Dumb MCMC Algorithm for Clustering
10 pages
Bivariate Metropolis-Hastings Method
No ratings yet
Bivariate Metropolis-Hastings Method
10 pages
Inverse CDF in Monte Carlo Methods
No ratings yet
Inverse CDF in Monte Carlo Methods
17 pages
Introduction to Metropolis-Hastings Algorithm
No ratings yet
Introduction to Metropolis-Hastings Algorithm
15 pages
Metropolis-Hastings Algorithm Explained
No ratings yet
Metropolis-Hastings Algorithm Explained
4 pages
Efficient Sequential Monte Carlo Methods
No ratings yet
Efficient Sequential Monte Carlo Methods
7 pages
Hybrid Rosenbrock Distribution for MCMC
No ratings yet
Hybrid Rosenbrock Distribution for MCMC
22 pages
MCMC Methods in Machine Learning
No ratings yet
MCMC Methods in Machine Learning
13 pages
MCMC Methods and Sampling Techniques
No ratings yet
MCMC Methods and Sampling Techniques
17 pages
MCMC Methods for Bayesian Computation
No ratings yet
MCMC Methods for Bayesian Computation
40 pages
Introduction to MCMC in Machine Learning
No ratings yet
Introduction to MCMC in Machine Learning
39 pages
Introduction to MCMC in Machine Learning
No ratings yet
Introduction to MCMC in Machine Learning
39 pages
Causal Inference and MCMC in Cognition
No ratings yet
Causal Inference and MCMC in Cognition
45 pages
sheet_2_sol
No ratings yet
sheet_2_sol
8 pages
Reinforcement Learning for Adaptive MCMC
No ratings yet
Reinforcement Learning for Adaptive MCMC
44 pages
ML UNIT-V word 5.3.26
No ratings yet
ML UNIT-V word 5.3.26
16 pages
Japs-2 2 3
No ratings yet
Japs-2 2 3
6 pages
MCMC Algorithms: Metropolis, Gibbs, Hamiltonian
No ratings yet
MCMC Algorithms: Metropolis, Gibbs, Hamiltonian
28 pages
No-U-Turn Sampler for HMC Optimization
No ratings yet
No-U-Turn Sampler for HMC Optimization
30 pages
Stochastic Gradient MCMC Framework
No ratings yet
Stochastic Gradient MCMC Framework
16 pages
Graphical Models and MCMC Methods
No ratings yet
Graphical Models and MCMC Methods
19 pages
Advanced MCMC for Diffusion Path Sampling
No ratings yet
Advanced MCMC for Diffusion Path Sampling
39 pages
Simulation Techniques for Random Sampling
No ratings yet
Simulation Techniques for Random Sampling
4 pages
MTH707: Markov Chains Problem Set
No ratings yet
MTH707: Markov Chains Problem Set
13 pages
Markov Chain Monte Carlo Methods Explained
No ratings yet
Markov Chain Monte Carlo Methods Explained
66 pages
Monte Carlo Sampling Methods Overview
No ratings yet
Monte Carlo Sampling Methods Overview
32 pages
MCMC: Building Sampling Chains
No ratings yet
MCMC: Building Sampling Chains
32 pages
Adaptive Tuning of Hamiltonian Monte Carlo Methods
No ratings yet
Adaptive Tuning of Hamiltonian Monte Carlo Methods
62 pages
Understanding Hidden Markov Models
No ratings yet
Understanding Hidden Markov Models
88 pages
MCMC Sampling Techniques Explained
No ratings yet
MCMC Sampling Techniques Explained
13 pages
Understanding Monte Carlo Methods
No ratings yet
Understanding Monte Carlo Methods
45 pages
Overview of Monte Carlo Methods
No ratings yet
Overview of Monte Carlo Methods
5 pages
RMHMC for Gaussian Process Sampling
No ratings yet
RMHMC for Gaussian Process Sampling
9 pages
MCMC Algorithm for Nonlinear Optimization
No ratings yet
MCMC Algorithm for Nonlinear Optimization
5 pages
MCMC Methods in Machine Learning
No ratings yet
MCMC Methods in Machine Learning
42 pages
Adaptive Hamiltonian Monte Carlo Methods
No ratings yet
Adaptive Hamiltonian Monte Carlo Methods
9 pages
Quantum Annealing for MCMC Acceleration
No ratings yet
Quantum Annealing for MCMC Acceleration
10 pages
Sampling Methods for Marginal Densities
No ratings yet
Sampling Methods for Marginal Densities
13 pages
MCMC Methods in Machine Learning Q&A
No ratings yet
MCMC Methods in Machine Learning Q&A
14 pages
MCMC for Bayesian Neural Networks Tutorial
No ratings yet
MCMC for Bayesian Neural Networks Tutorial
33 pages
Particle Filters for Unknown Parameters
No ratings yet
Particle Filters for Unknown Parameters
9 pages
Lopes 2010
No ratings yet
Lopes 2010
42 pages
Exact Particle Filtering Algorithm
No ratings yet
Exact Particle Filtering Algorithm
39 pages
Particle Filtering for Structural Parameter Identification
No ratings yet
Particle Filtering for Structural Parameter Identification
25 pages
Contec CMS50DL Pulse Oximeter Manual
No ratings yet
Contec CMS50DL Pulse Oximeter Manual
11 pages
Economics of Monoclonal Antibody Production
No ratings yet
Economics of Monoclonal Antibody Production
11 pages
Proposal Distribution in Machine Learning
No ratings yet
Proposal Distribution in Machine Learning
15 pages
Monte Carlo With Applications To Finance in R
100% (2)
Monte Carlo With Applications To Finance in R
73 pages
NIPS 2016 GAN Tutorial Overview
No ratings yet
NIPS 2016 GAN Tutorial Overview
57 pages
Yonas Tirffe: CV for Computational Physics PhD
No ratings yet
Yonas Tirffe: CV for Computational Physics PhD
13 pages
Monte Carlo Light Propagation in Tissue
No ratings yet
Monte Carlo Light Propagation in Tissue
11 pages
Implementing Quantitative Risk Assessments
No ratings yet
Implementing Quantitative Risk Assessments
7 pages
Food Bioprocess Modeling Handbook
No ratings yet
Food Bioprocess Modeling Handbook
600 pages
Gas Network Security Modeling Analysis
No ratings yet
Gas Network Security Modeling Analysis
11 pages
Risk Management Framework for Metro Rail
No ratings yet
Risk Management Framework for Metro Rail
28 pages
Bloomberg Portfolio VaR Methodology
No ratings yet
Bloomberg Portfolio VaR Methodology
11 pages
Dam Safety: Impact of Reservoir Levels
No ratings yet
Dam Safety: Impact of Reservoir Levels
17 pages
Quantifying Inventory Uncertainty
No ratings yet
Quantifying Inventory Uncertainty
10 pages
RR 728
No ratings yet
RR 728
50 pages
Bayesian Inversion Tutorial Guide
No ratings yet
Bayesian Inversion Tutorial Guide
50 pages
Geotechnical Insights on Mine Slope Stability
No ratings yet
Geotechnical Insights on Mine Slope Stability
9 pages
Intermediate-q Robustness in ML Classifiers
No ratings yet
Intermediate-q Robustness in ML Classifiers
12 pages
Probability Theory in Risk Management
No ratings yet
Probability Theory in Risk Management
17 pages
Computer Graphics and Multimedia Overview
100% (1)
Computer Graphics and Multimedia Overview
59 pages
Options and Futures Risk Management Solutions
No ratings yet
Options and Futures Risk Management Solutions
17 pages
Quantitative Models for Decision Making
No ratings yet
Quantitative Models for Decision Making
13 pages
Extended Kalman Filter for CIRIS
No ratings yet
Extended Kalman Filter for CIRIS
170 pages
Mohd Yunos 2019 IOP Conf. Ser.A Mater. Sci. Eng. 554 012005
No ratings yet
Mohd Yunos 2019 IOP Conf. Ser.A Mater. Sci. Eng. 554 012005
16 pages
Intuitive Physics Engine Model Explained
No ratings yet
Intuitive Physics Engine Model Explained
7 pages
Evaluating Academic Paper Titles
No ratings yet
Evaluating Academic Paper Titles
37 pages
Quantitative Risk Analysis in Project Management
No ratings yet
Quantitative Risk Analysis in Project Management
12 pages
Beam Quality Factors for PTW 31022 Chamber
No ratings yet
Beam Quality Factors for PTW 31022 Chamber
10 pages
Resource Adequacy Assessment Overview
No ratings yet
Resource Adequacy Assessment Overview
50 pages
Sample Size Guidelines for Mediation Models
No ratings yet
Sample Size Guidelines for Mediation Models
31 pages

Metropolis-Hastings Acceptance Probabilities

Uploaded by

Metropolis-Hastings Acceptance Probabilities

Uploaded by

Scandinavian Journal of Statistics, Vol.

38: 342–358, 2011

On the Flexibility of Metropolis–Hastings

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

2. Weight functions within MH algorithms

(x, y; x∗ , y∗ ) = min {1, r(x, y; x∗ , y∗ )}

then, y ∼ (y ) and also x | y ∼ h(x | y ).

Proof. Consider (y)h(x | y) as an augmented target distribution. Then (1) corresponds to

Proposition 2. Assume y ∼ (y). Generate (x∗ , y∗ ) ∼ q(x∗ , y∗ | y) and x ∼ h(x | y, x∗ , y∗ ) where

(y; x∗ , y∗ , x) = min{1, r(y; x∗ , y∗ , x)}

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

(y∗ )q(x, y | y∗ )h(x∗ | y∗ , x, y)

Proof. Consider the joint distribution (y, ¯ x∗ , y∗ , x) = (y)q(x∗ , y∗ | y)h(x | y, x∗ , y∗ ) which

If x∗ = (x1∗ , . . ., xt∗ ) = : x1:t

w(y; x∗ , y∗ , x) (y∗ )h(x∗ | y∗ , x, y)

is also a distribution. Inserting this into (3), we obtain importance weights

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

 (y∗ )hs (x∗ | y∗ , x, y)

Proposition 3. Under the assumptions of proposition 2,

Proof. Deﬁne qx | y (x∗ | y, y∗ ) to be the conditional distribution of x∗ given both y and y∗ ,

which by direct insertion of the deﬁnition of r(y; x∗ , y∗ , x) becomes

proving (5). Further,

3. Sequential generation of auxiliary variables

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Under the aforegiven assumptions, from (9), we obtain

∗ (xt∗+ 1 ) ti = s qi + 1 (xi∗ | xi∗+ 1 ) (xs∗ )

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

resulting in an acceptance ratio

A particularly interesting case is as = (t + 1 − b)−1 I (s ≥ b), with b corresponding to some ‘burn-

4.1. Mode jumping

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Consider the setting of proposition 2. In the speciﬁcation of h(x | y, x∗ , y∗ ), it is important

we obtain through proposition 2 (for  = x1∗ − y = x1 − y∗ ) the acceptance ratio

(y∗ )f (−) ti = 2 q(xi | xi−1 )qt + 1 (y | xt ) ti = 2 q(xi∗ | xi−1

resulting in an acceptance ratio

Using (15) with as = 1/t, we obtain an acceptance ratio equal to

(y) = N(y; 1 , I ) + (1 − )N(y; 2 , I ),

where 1, 1 = −10, 2, 1 = 10 whereas i, j = 0, i = 1, 2, j = 2, . . ., p. This corresponds to a model

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

(i) equation (16) with s = 1,

d(F̂ 1 , F1 ) = sup |F̂ 1 (y1 ) − F1 (y1 )| (18)

A p=1 B p=2 C p=3

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

4.2. Delayed rejection sampling

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

(y∗ )q1 (y | y∗ )q2 (x2 | y∗ , y)1 (y∗ ; y)h2 (x2∗ | y∗ , x2 , y)

where −i is the collection of all j s except i , j ∼ i means j is a neighbour of i and ni is the

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

0 2000 4000 6000 8000 10,000

of Fig. 3 showing the posterior distribution obtained from simulated values of z1 , . . ., zp on

d(F̂ 2 , F2 ) = sup |F̂ 2 ( 2 ) − F2 ( 2 )| (22)

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

4.3. Distributions with intractable normalizing constants

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

5. Summary and discussion

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Received March 2009, in ﬁnal form April 2010

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

Appendix: Transition densities for internal MH moves

zi∗ ∼ qiz (zi∗ | xi−1

where qi (xi∗ | xi−1

qiz (zi∗ | xi−1

Using (9), we obtain

qs (xs∗ | xs−1 )qs (zs∗ | xs−1 , xs∗ ) i = s qi + 1 (xi∗+ 1 | xi∗ )

(xs∗ )hzs (zs∗ | xs−1

For the choice of hzs (zs∗ | xs−1 ∗

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

You might also like

(y∗ )hs (x∗ | y∗ , x, y)

we obtain through proposition 2 (for = x1∗ − y = x1 − y∗ ) the acceptance ratio

(y∗ )f (−) ti = 2 q(xi | xi−1 )qt + 1 (y | xt ) ti = 2 q(xi∗ | xi−1

(y) = N(y; 1 , I ) + (1 − )N(y; 2 , I ),

where 1, 1 = −10, 2, 1 = 10 whereas i, j = 0, i = 1, 2, j = 2, . . ., p. This corresponds to a model

d(F̂ 2 , F2 ) = sup |F̂ 2 (2 ) − F2 (2 )| (22)