0% found this document useful (0 votes)
52 views17 pages

Metropolis-Hastings Acceptance Probabilities

The document discusses the flexibility in choosing acceptance probabilities in Metropolis-Hastings algorithms when using auxiliary variables to generate proposals. It presents a general framework for the construction of acceptance probabilities in this setting and shows the similarities between many existing algorithms. The framework demonstrates there is flexibility in how to construct acceptance probabilities and alternative choices are possible. Some numerical experiments comparing different acceptance probabilities are also described.

Uploaded by

Daan2213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views17 pages

Metropolis-Hastings Acceptance Probabilities

The document discusses the flexibility in choosing acceptance probabilities in Metropolis-Hastings algorithms when using auxiliary variables to generate proposals. It presents a general framework for the construction of acceptance probabilities in this setting and shows the similarities between many existing algorithms. The framework demonstrates there is flexibility in how to construct acceptance probabilities and alternative choices are possible. Some numerical experiments comparing different acceptance probabilities are also described.

Uploaded by

Daan2213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Scandinavian Journal of Statistics, Vol.

38: 342–358, 2011


doi: 10.1111/j.1467-9469.2010.00709.x
© 2010 Board of the Foundation of the Scandinavian Journal of Statistics. Published by Blackwell Publishing Ltd.

On the Flexibility of Metropolis–Hastings


Acceptance Probabilities in Auxiliary
Variable Proposal Generation
GEIR STORVIK
Department of Mathematics, University of Oslo and Centre for Statistics for
Innovation

ABSTRACT. Use of auxiliary variables for generating proposal variables within a Metropolis–
Hastings setting has been suggested in many different settings. This has in particular been of interest
for simulation from complex distributions such as multimodal distributions or in transdimensional
approaches. For many of these approaches, the acceptance probabilities that are used turn up some-
what magic and different proofs for their validity have been given in each case. In this article, we
will present a general framework for construction of acceptance probabilities in auxiliary variable
proposal generation. In addition to showing the similarities between many of the proposed algo-
rithms in the literature, the framework also demonstrates that there is a great flexibility in how
to construct acceptance probabilities. With this flexibility, alternative acceptance probabilities are
suggested. Some numerical experiments are also reported.

Key words: acceptance probabilities, auxiliary variables, Markov chain Monte Carlo,
Metropolis–Hastings algorithms

1. Introduction
Monte Carlo methods, and in particular Markov chain Monte Carlo (MCMC) algorithms,
are nowadays part of the standard set of techniques used by statisticians (Robert & Casella,
2004). Within statistics, the typical applications of Monte Carlo methods are evaluation of
likelihoods including random/missing effects or calculation of posterior distributions. In
general, Monte Carlo methods can be used to evaluate complex and high-dimensional integrals.
Monte Carlo methods involve sampling from some distribution, (y) say, which in many
cases can efficiently be done through ‘standard’ MCMC algorithms generating a Markov
chain with (y) as invariant distribution (Metropolis et al., 1953; Hastings, 1970; Gilks
et al., 1996; Robert & Casella, 2004). Most MCMC algorithms involve mixtures of different
Metropolis–Hastings (MH) steps (including Gibbs steps as particular cases). Given the current
state y, an MH step consists of generating a proposal y∗ ∼ q(y∗ | y) and switch to the proposal
with probability
 
(y∗ )q(y | y∗ )
(y; y∗ ) = min 1, .
(y)q(y∗ | y)

General Markov chain theory can be used to show convergence and ergodicity results (e.g.
Tierney, 1994). In practise, slow mixing can occur when (y) has a complex structure, that is,
multimodality or covering several models in different spaces. The main problem is to choose
proposal distributions q(y∗ | y) that both allow flexible movements within the sample space
and give reasonable acceptance probabilities. Some guidelines and theoretical results for tuning
variances of proposal distributions are available (Roberts et al., 1997; Brooks et al., 2003).
Efficient sampling algorithms can in many cases be constructed by the use of auxiliary
variables. Several approaches have been suggested within the class of MCMC algorithms
Scand J Statist 38 Flexibility of MH acceptance probabilities 343

(Tanner & Wong, 1987; Edwards & Sokal, 1988; Besag & Green, 1993; Higdon, 1998;
Damien et al., 1999) but has also been used in combination with importance sampling
(Neal, 2001). The most common way of doing this is to augment the sample space with
some extra variables to construct simpler or more efficient algorithms in this augmented
space. More recently, auxiliary variables have been used as a tool within an MH setting for
either generating better proposals (e.g. Tjelmeland & Hegstad, 2001; Jennison & Sharp, 2007)
or for calculation of acceptance probabilities (e.g., Beaumont, 2003; Andrieu & Roberts,
2009).
The flexibility of choosing proposal distributions has long been recognized whereas accep-
tance probabilities are given by the standard MH rule. When extending the state space by
introducing auxiliary variables, there is also a flexibility in the choice of acceptance proba-
bilities that to a large extent seems to be unnoticed. This flexibility is a consequence of the
possibility of choosing different target distributions in the augmented space, and is the focus
of this article.
Assume our aim is to sample y∗ ∼ (y∗ ). A proposal is generated by a simultaneous simu-
lation of (x∗ , y∗ ) ∼ q(x∗ , y∗ | y). The flexibility follows in that the target distribution, ¯ say, for
the combined variables (x∗ , y∗ ) can be any distribution with  as its marginal distribution.
For a given joint proposal distribution q, several possible acceptance probabilities can be
obtained by different choices of . ¯ By taking this alternative viewpoint, a common under-
standing of different algorithms suggested in the literature is obtained, and this also serves
as a framework for suggesting new algorithms. This framework will in particular be useful in
cases where standard MCMC algorithms fail and more complicated versions are needed. Typical
examples are algorithms for jumping between possible modes (Tjelmeland & Hegstad, 2001;
Jennison & Sharp, 2007) and improvements of acceptance rates within reversible jump algo-
rithms (Al-Awadhi et al., 2004).
An important special class of algorithms is where x∗ = (x1∗ , . . ., xt∗ ) is generated in sequence,
typically with x1∗ generated through a big jump (for jumping between modes or dimensions)
followed by a few steps of smaller jumps. Such xi∗ s can in principle live on any space, but in
most cases they live on the same space as y. For most such algorithms suggested in the litera-
ture, acceptance ratios within an MH setting typically depend either only on the
first x1∗ (e.g. Al-Awadhi et al., 2004) or the last xt∗ (e.g. Tjelmeland & Hegstad, 2001;
Jennison & Sharp, 2007). By combining different target distributions, acceptance ratios
depending on averages over all the generated x1∗ , . . ., xt∗ can be constructed, which can give
higher and more stable acceptance probabilities.
Other algorithms that can be considered as special cases of this general framework are
mode or model jumping (Tjelmeland & Hegstad, 2001; Al-Awadhi et al., 2004; Jennison &
Sharp, 2007), multiple-try methods (Liu et al., 2000), proposals based on particle filters
(Andrieu et al., 2010), pseudo-marginal algorithms (Beaumont, 2003; Andrieu & Roberts, 2009),
sampling algorithms for distributions with intractable normalizing constants (Møller et al.,
2006) and delayed rejection sampling (Tierney & Mira, 1999; Green & Mira, 2001; Mira,
2001). Some of these will be considered in more detail later whereas others are treated in
Storvik (2009). Although our focus will be on MH algorithms, the framework can also be
applied within importance sampling, for example, in combination with annealed importance
sampling (Neal, 2001); see Storvik (2009) for details.
We start in section 2 by discussing the flexibility of acceptance probabilities in MH
algorithms when auxiliary variables are used for generating proposals. Section 3 focuses on
sequential generations of auxiliary variables. In section 4, we apply our general results to
some algorithms suggested in the literature and also consider alternative versions of these.
For some of these alternatives, numerical experiments are included, demonstrating that

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


344 G. Storvik Scand J Statist 38

alternative weights and acceptance probabilities can improve the performance of an algo-
rithm. We conclude the article with discussion and final remarks in section 5.

2. Weight functions within MH algorithms


Our aim is to generate a sequence of variables having invariant distribution (y). The typical
situation will be that a new y∗ is generated through an auxiliary x∗ with a joint proposal
density q(x∗ , y∗ | y). For weight functions, acceptance ratios and acceptance probabilities, we
will list the variables involved in the order they are generated with semicolon separating
variables that have been generated at the previous iteration and new variables generated.
The first result considers the situation where the auxiliary variables are stored at each iter-
ation. To get full generality, we allow the proposals to depend on the previous x as well.

Proposition 1. Assume the current state y ∼ (y) and that x | y ∼ h(x | y) for some arbitrary
conditional distribution h.
Generate (x∗ , y∗ ) ∼ q(x∗ , y∗ | x, y) and put (x , y ) = (x∗ , y∗ ) with probability (x, y; x∗ , y∗ ) and
(x , y ) = (x, y) otherwise, where the acceptance probability is given by


(x, y; x∗ , y∗ ) = min {1, r(x, y; x∗ , y∗ )}

with
(y∗ )h(x∗ | y∗ )q(x, y | x∗ , y∗ )
r(x, y; x∗ , y∗ ) = , (1)
(y)h(x | y)q(x∗ , y∗ | x, y)

then, y ∼ (y ) and also x | y ∼ h(x | y ).

Proof. Consider (y)h(x | y) as an augmented target distribution. Then (1) corresponds to


the ordinary acceptance ratio in an MH step using q(x∗ , y∗ | x, y) as proposal. Since (y) is
the marginal distribution of the augmented target distribution, the result follows.
Proposition 1 describes an MH step keeping the target distribution (y) invariant. There
is flexibility in the choice of both q and h. To obtain convergence, irreducibility needs to be
considered as well. Note however that if irreducibility is a problem, modifications of h can
in many cases resolve this without the need for changing the proposal distributions.
The setting described above assumes that the auxiliary variable x generated in the previous
iterations is stored. An alternative version is to assume a new x is generated in a ‘reverse’
proposal. Andrieu & Roberts (2009) show that generating new x’s at each iteration rather than
reusing those generated at the previous iteration can improve the acceptance rates. Storage of
only y makes it also more easy to incorporate the procedure into a hybrid MCMC algorithm
cycling through different steps (Robert & Casella, 2004, section 10.3.2), where irreducibility
can be obtained through some of the other steps. Proposition 2 considers this setting.

Proposition 2. Assume y ∼ (y). Generate (x∗ , y∗ ) ∼ q(x∗ , y∗ | y) and x ∼ h(x | y, x∗ , y∗ ) where


h(x | y, x∗ , y∗ ) is an arbitrary distribution. Put y = y∗ with probability (y; x∗ , y∗ , x) and y = y
otherwise where the acceptance probability is given by

(y; x∗ , y∗ , x) = min{1, r(y; x∗ , y∗ , x)}

with

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 345

(y∗ )q(x, y | y∗ )h(x∗ | y∗ , x, y)


r(y; x∗ , y∗ , x) = , (2)
(y)q(x∗ , y∗ | y)h(x | y, x∗ , y∗ )

then, y ∼ (y ).

Proof. Consider the joint distribution (y, ¯ x∗ , y∗ , x) = (y)q(x∗ , y∗ | y)h(x | y, x∗ , y∗ ) which


has (y) as marginal distribution for the first component. Given y ∼ (y) we directly obtain
(y, x∗ , y∗ , x) ∼ (y,
¯ x∗ , y∗ , x). Then (2) is the ordinary MH acceptance ratio for swapping (y, x∗ )

and (y , x), proving the result.
The proof shows that the procedure corresponds to a Gibbs step followed by an MH step.
Also for proposition 2, irreducibility needs to be considered to obtain convergence and
ergodicity results. Note that the proposition defines a Markov chain on y with invariant
distribution (y). The additional requirement for convergence is that the Markov chain is
irreducible with respect to  (e.g. theorem 1 in Tierney, 1994). A minimal criterion is that the

Markov chain induced by the transition kernel qy (y∗ | y) = x∗ q(x∗ , y∗ | y) dx∗ is irreducible. In
addition, one needs to ensure that the acceptance structure allows a similar irreducibility. A
sufficient criterion is that q(x, y | y∗ )h(x∗ | y∗ , x, y) > 0 for any (x∗ , y∗ , x) generated from y, but
less restrictive conditions can also be constructed. In cases where all distributions involved
have full support, irreducibility is directly obtained.
The approach in proposition 2 assumes auxiliary variables generated both forwards (x∗ )
and backwards (x). Some algorithms suggested in the literature (e.g. Al-Awadhi et al., 2004)
only consider generation forwards while the same variables are used backwards. This can be
obtained by choosing h to be a degenerate distribution giving probability 1 to a specific value
of x, typically a function of (y∗ , x∗ , y). The most direct approach is to choose h(x | y, x∗ , y∗ ) =
(x − x∗ ), where  is Dirac’s delta distribution in which case the acceptance ratio reduces to

(y∗ )q(x, y | y∗ )
r(y; x∗ , y∗ , x) = .
(y)q(x∗ , y∗ | y)

If x∗ = (x1∗ , . . ., xt∗ ) = : x1:t



is generated in sequence, an alternative is to assume x1:t = xt:1

with
probability 1. For such a choice,

(y∗ )q(xt:1

, y | y∗ )
r(y; x∗ , y∗ , x) = ∗
.
(y)q(x1:t , y∗ | y)

We will explore such possibilities further in sections 3 and 4 for specific choices of proposal
distributions.
For both propositions 1 and 2, the acceptance ratios can be written as ratios between
weight functions directly related to importance sampling. For instance, in (2),

w(y; x∗ , y∗ , x) (y∗ )h(x∗ | y∗ , x, y)


r(y; x∗ , y∗ , x) = with w(y; x∗ , y∗ , x) = . (3)
w(y∗ ; x, y, x∗ ) q(x∗ , y∗ | y)

In addition to demonstrating that the results presented also can be applied to importance
sampling, weight functions will in some cases be used instead of acceptance ratios to sim-
plify notation.
If {hs } is a collection of distributions, a mixture of the hs s,
 
h(x | y, x∗ , y∗ ) = as hs (x | y, x∗ , y∗ ), as ≥ 0, as = 1
s s

is also a distribution. Inserting this into (3), we obtain importance weights

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


346 G. Storvik Scand J Statist 38

 (y∗ )hs (x∗ | y∗ , x, y)


ws (y; x∗ , y∗ , x) = as ws with ws (y; x∗ , y∗ , x) = . (4)
s
q(x∗ , y∗ | y)

Similar combinations of weights can be obtained for proposition 1. Such mixtures will be
considered in the subsequent sections.
Properties of the acceptance ratios are given in proposition 3.

Proposition 3. Under the assumptions of proposition 2,


(y∗ )qy (y∗ | y)
E[r(y; x∗ , y∗ , x) | y, y∗ ] = =: r(y; y∗ ), (5)
(y)qy (y∗ | y)
where qy (y∗ | y) is the marginal proposal density for y∗ generated through x∗ , and
var[r(y; x∗ , y∗ , x)] ≥ var[r(y, y∗ )]. (6)

Proof. Define qx | y (x∗ | y, y∗ ) to be the conditional distribution of x∗ given both y and y∗ ,


that is,
q(x∗ , y∗ | y)
qx | y (x∗ | y, y∗ ) = .
qy (y∗ | y)
Then,
 
E[r(y; x∗ , y∗ , x) | y, y∗ ] = r(y; x∗ , y∗ , x)qx | y (x∗ | y, y∗ )h(x | y, x∗ , y∗ ) dx dx∗
x∗ x

which by direct insertion of the definition of r(y; x∗ , y∗ , x) becomes


 
(y∗ )qy (y | y∗ ) (y∗ )qy (y | y∗ )
qx | y (x | y∗ , y)h(x∗ | y∗ , x, y) dx dx∗ =
x∗ x (y)qy (y | y) (y)qy (y∗ | y)

proving (5). Further,


var[r(y; x∗ , y∗ , x)] = var[E[r(y; x∗ , y∗ , x) | y∗ ]] + E[var[w(y; x∗ , y∗ , x) | y∗ ]]
 
(y∗ )qy (y | y∗ )
≥ var = var[r(y; y∗ )].
(y)qy (y∗ | y)
Note that since min{1, x} is a concave function, by Jensen’s inequality,
E[(y; x∗ , y∗ , x) | y, y∗ ] ≤ min{1, r(y; y∗ )}
showing that the optimal acceptance ratio [in Peskun’s (1973) sense] would be to use r(y; y∗ ).
In practice, qy (y∗ | y) will not be possible to evaluate but by clever choices of h, we hopefully
get close to this.

3. Sequential generation of auxiliary variables


In this section, we will discuss the use of different acceptance ratios in the case where x∗ is
generated through a sequence of steps. Such schemes have been considered by, for example,
Neal (2001) in his annealed importance sampling scheme, by Jennison & Sharp (2007) who
proposed a method for mode jumping by first applying one big jump and thereafter several
smaller moves and by Al-Awadhi et al. (2004) within a reversible jump MCMC framework.
In the notation of proposition 2, we assume

t+1
q(x∗ , y∗ | y) = qi (xi∗ | xi−1

), (7)
i =1

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 347

where x0∗ ≡ y and xt∗+ 1 ≡ y∗ . The typical setup we will consider is where for i ≥ 2 the proposal
densities satisfy the detailed balance criterion
(x)qi (y | x) = (y)qi (x | y). (8)
Such proposals can for instance be used for target distributions with several modes where
q1 corresponds to a large jump whereas the subsequent moves are ordinary MCMC ones.
This particular setting will be further discussed in section 4.1. Another possibility is model
or dimension moves in a reversible jump setting, see Storvik (2009). Allowing the additional
moves also to depend on i gives the possibilities for different blocks of a high-dimensional
state vector to be updated at each step.
A consequence of assumption (8) is that for s ≥ 1

t ∗ ∗
= qi + 1 (xi | xi + 1 ) (xs∗ )

it s = , (9)
∗ ∗
i = s qi + 1 (xi + 1 | xi )
(xt∗+ 1 )
which will be used repeatedly.
Our aim is to construct acceptance probabilities for the proposal y∗ = xt∗+ 1 when y is the
current state. A complicating factor is that the sequential approach for generating y∗ will
not make the proposal distribution for y∗ given y directly available. Note that if q1 does
not depend on y, we obtain an independence sampler where the proposal is generated by a
sequence of internal MCMC steps, see section 4.1 for an example.

Writing x1:t +1
= (x1∗ , . . ., xt∗+ 1 ), the general weight function will have the form

∗ (xt∗+ 1 )h(x1:t

| xt∗+ 1 , x, y)
w(y; x1:t + 1 , x) =
t + 1 ∗ ∗
.
i = 1 qi (xi | xi−1 )

A wide variety of weight functions can be considered by different choices of h. Consider the
class of distributions

s−1
t

hs (x1:t | xt∗+ 1 , x, y) = qi (xi∗ | xi−1

) qi + 1 (xi∗ | xi∗+ 1 ). (10)
i =1 i =s

Under the aforegiven assumptions, from (9), we obtain


∗ (xt∗+ 1 ) ti = s qi + 1 (xi∗ | xi∗+ 1 ) (xs∗ )


ws (y; x1:t +1 ) =
+
= ∗
. (11)

qs (xs∗ | xs−1 ) i = s + 1 qi (xi∗ | xi−1
t 1 ∗
) qs (xs∗ | xs−1 )
The weights only depend on the ratio between the marginal distributions and the transition
kernels for different components of the auxiliary variables. Note that the proposal distribu-
tions qi , i ≤ s can in fact be arbitrary [i.e. do not need to satisfy detailed balance (8)]. The
special case s = 1 gives
(x1∗ )
w1 (y; x1:t + 1 ) = , (12)
q1 (x1∗ | y)
which corrects for using q1 to generate the first sample but then utilizes that the following
samples keeps the distribution invariant with respect to . This weight function does however
not account for that subsequent samples will be closer to .
Another special case is s = t + 1 which gives weight
(y∗ )
wt (y; x1:t + 1 ) = , (13)
qt + 1 (y∗ | xt∗ )
only depending on the density of and the transition to y∗ . Note that in this case no assump-
tions on qi need to be made for any i although qt + 1 (y∗ , xt∗ ) should be possible to compute.
Also this choice suffers from the weakness of not taking into account that as t increases, xt∗

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


348 G. Storvik Scand J Statist 38

converges in distribution to  which should give a weight approaching 1. This can however
be accomplished by combining the weights. Using (4), a proper weight function is also given
by

t+1
∗ ∗
w(y; x1:t + 1) = as ws (y; x1:t + 1 ), (14)
s=1

resulting in an acceptance ratio


t + 1 ∗
∗ s = 1 as ws (y; x1:t + 1 )
r(y; x1:t +1 ) = t + 1 . (15)
s = 1 as ws (y ; x1:t + 1 )

A particularly interesting case is as = (t + 1 − b)−1 I (s ≥ b), with b corresponding to some ‘burn-



in’ period for the proposal generation. In this case, w(y; x1:t + 1 ) is a standard MCMC estimate

of E[w(y; x1:t +1 )]. If further requirements for the Markov chain using qb , . . ., qt + 1 as transi-
tion probabilities to be ergodic is satisfied, the weight function will then converge towards its
expectation. Since this expectation is equal to 1, this implies that the acceptance rates also
converge towards 1. This indicates that the weight function indeed reflects that xt∗+ 1 converges
in distribution to (y) as t grows.
In some cases, updates are made component-wise (or block-wise). Similar to the non-
reversibility of the Gibbs sampler, some extra care needs to be taken under systematic scan
updates. The weights (11) can however still be shown to be valid.
To use the proposed weight functions, it must be possible to calculate the transition proba-
bilities qs (xs∗ | xs−1

). When {x1:t∗
+ 1 } is generated through internal MH steps, such transition
probabilities are not always directly available (this is particularly the case when an internal
proposal is not accepted). There are different ways around this, mainly based on the idea
of including the internal proposals as part of the auxiliary variables. The details on this are
given in the Appendix.

4. Examples
In Storvik (2009), several algorithms suggested in the literature are shown to fit into the
general framework discussed in the previous sections. These include annealed importance
sampling (Neal, 2001), reversible jump MCMC for model selection (Al-Awadhi et al., 2004;
Andrieu & Roberts, 2009), multiple-try methods and particle proposals (Liu et al., 2000;
Andrieu et al., 2010) and the pseudo-marginal algorithm (Beaumont, 2003; Andrieu &
Roberts, 2009). Here, we will consider some specific examples where alternative weights are
proposed. For some of these examples, numerical experiments are performed to compare the
different weights.

4.1. Mode jumping


Tjelmeland & Hegstad (2001) and Jennison & Sharp (2007) considered two quite similar
approaches for performing mode jumping within MCMC. Here, we will consider the approach
by Jennison & Sharp (2007), referring to Storvik (2009) for a discussion on the approach by
Tjelmeland & Hegstad (2001).
The approach is based on starting with a large jump followed by a sequence of (typically
smaller) MCMC steps. To be specific, x1∗ = y +  where  ∼ f () is a distribution symmetric
around zero whereas xs∗ ∼ q(xs∗ | xs−1

) for s = 2, . . ., t and finally y∗ ∼ qt + 1 (y∗ | xt∗ ). Here q is
a transition kernel leaving  invariant. A different transition at the last step is chosen for
computational reasons, see below.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 349

Consider the setting of proposition 2. In the specification of h(x | y, x∗ , y∗ ), it is important


to allow for a similar large jump in the ‘backwards’ move. We do so by putting x1 = y∗ − .
By choosing

t
h(x | y, x∗ , y∗ ) = (x1 − y∗ + ) q(xi | xi−1 ),
i =2

we obtain through proposition 2 (for  = x1∗ − y = x1 − y∗ ) the acceptance ratio



(y∗ )f (−) ti = 2 q(xi | xi−1 )qt + 1 (y | xt ) ti = 2 q(xi∗ | xi−1



)
r(x∗ , y∗ , x | y) =
t ∗ ∗ ∗

t
(y)f () i = 2 q(xi | xi−1 )qt + 1 (y | xt ) i = 2 q(xi | xi−1 )

(y∗ )qt + 1 (y | xt )
= ,
(y)qt + 1 (y∗ | xt∗ )

which is equal to the acceptance ratio obtained by Jennison & Sharp (2007). To get a com-
putationally tractable density, Jennison & Sharp (2007) suggested choosing qt + 1 (y∗ | xt∗ ) as a
local approximation to (y).
As discussed in the Appendix, qt + 1 (y∗ | xt∗ ) can also be defined through an MH step. Within
such a setting, alternative constructions can be considered. Assume qt + 1 = q. Similar to (10),
define

s−1
t
hs (x | y∗ , x∗ , y) = (x1 − y∗ + ) q(xi | xi−1 ) q(xi | xi + 1 ),
i =2 i =s

resulting in an acceptance ratio



(xs∗ ) (xs )
rs (y; x∗ , y∗ , x) = ∗
. (16)
q(xs∗ | xs−1 ) q(xs | xs−1 )

Using (15) with as = 1/t, we obtain an acceptance ratio equal to


t+1

t+1
(xs∗ )  (xs )
∗ ∗
r(y; x , y , x) = ∗
. (17)
s=2
q(xs | xs−1 )

s=2
q(xs | xs−1 )

In general, if q is capable of moving efficiently around the whole state space, this ratio will
converge to 1. In more practical settings where q is only able to move within the current
mode, the ratio will converge towards the ratio between the masses of the corresponding
modes.
A numerical example on a Gaussian mixture model. Consider a model in Rp

(y) = N(y; 1 , I ) + (1 − )N(y; 2 , I ),

where 1, 1 = −10, 2, 1 = 10 whereas i, j = 0, i = 1, 2, j = 2, . . ., p. This corresponds to a model


with multiple modes so separated that ordinary MCMC methods will get stuck in one of the
modes.
We will consider an MH algorithm where proposals are generated sequentially as described
in section 3 with a starting value generated from N(1 , I ).
A proposal y∗ is generated by simulating a sequence x1∗ , . . ., xt∗+ 1 with y∗ = xt∗+ 1 . Here,

x1 ∼ N(0, 2large I ) followed by a set of t ‘inner’ MH steps using a discrete version Langevin
diffusion (Besag, 1994; Roberts & Tweedie, 1996) where proposals zj∗ are generated by
 
1
zj∗ ∼ N xj−1 ∗ ∗
+ 2small  log (xj−1 ), 2small I
2

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


350 G. Storvik Scand J Statist 38

and proposals are accepted using ordinary MH acceptance rates. The number of ‘outer’ MH
steps, that is, the number of generated y∗ s will be denoted by M.
The final proposals y∗ are accepted with probabilities given through the following alterna-
tive acceptance ratios:

(i) equation (16) with s = 1,


(ii) equation (16) with s = t + 1 or
(iii) equation (17).

All the alternatives are used in combination with (23). The full algorithm was restarted N
times and the last sample was stored in each case, giving samples y1 , . . ., yN . As a measure
for the performance of the different weight functions,

d(F̂ 1 , F1 ) = sup |F̂ 1 (y1 ) − F1 (y1 )| (18)


y1

was used where F̂ 1 is the empirical distribution function based on the first component of the
samples y1 , . . ., yN whereas F1 is the corresponding true marginal cumulative distribution.
Figure 1A–E shows boxplots of d(F̂ 1 , F1 ) with N =10,000, M = 20, t + 1 = 10 and with
large = 15, small = 1.5. The different panels correspond to different dimensions p. The box-
plots are obtained by repeating the experiments 100 times. We see in all cases that the first
weight function does not perform that well, an expected property given the earlier discussion
about this particular choice. Both the other choices give significant improvements with the
third weight function performing best in all the cases. The difference between these seems

A p=1 B p=2 C p=3


0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


1 2 3 1 2 3 1 2 3

D p=5 E p = 10 F p = 10
0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10

0.05 0.05 0.05

0.00 0.00 0.00


1 2 3 1 2 3 1 2 3

Fig. 1. Boxplot of d(F̂ 1 , F1 ) for the Gaussian mixture distribution. The different panels correspond to
results for different dimensions p = 1, 2, 3, 5, 10. The experimental setup is described in the text. Panels
(A)–(E) correspond to M = 20, whereas panel (F) corresponds to M = 40. The numbers on the x-axis
correspond to the different types of acceptance ratios.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 351

1.0
0.8

0.6
F
0.4

0.2

0.0
−10 −5 0 5 10
y

Fig. 2. Mixture Gaussian distribution with dimension p = 10 and number of outer Markov chain Monte
Carlo iterations M = 20. True (shaded) and examples of estimated cumulative distributions of the first
component. Solid line corresponds to the first weight function, dashed line to the second weight func-
tion and dotted line to the third weight function.

to depend on dimension though. In particular it seems like the difference between them are
decreasing with dimension, but then increasing for p = 10. This is probably because of that
for p = 10, 20 iterations in the outer MCMC iterations is not that much. Increasing M to
40 (Fig. 1F), the similarities between the second and the third weight functions are again
obtained.
Figure 2 shows the true cumulative distribution for the first component and typical ex-
amples of empirical distributions based on the three different weight functions for p = 10 and
M = 20. The differences between the choices of weight functions shown in Fig. 1 is clearly
seen. Note in particular that the third weight function produces estimates that are indistin-
guishable from the truth, indicating that convergence has been reached even with this small
number of MCMC iterations.

4.2. Delayed rejection sampling


Tierney & Mira (1999) suggested a method for composing a new (possibly different) proposal
in an MH setting when the first proposal was rejected. This approach was further discussed
and extended in Mira (2001) and Green & Mira (2001).
Consider a situation where given the current state y, a first proposal x1∗ is generated by
q1 (x1∗ | y) and accepted with a probability 1 (y; x1∗ ). If rejected, a new proposal x2∗ is gener-
ated by q2 (x2∗ | y, x1∗ ) and this new proposal is accepted with probability 2 (y, x1∗ ; x2∗ ). We will
see how this approach can be seen as a special case of proposition 2 and that alternative
acceptance probabilities can be derived.
Define x∗ = (x1∗ , x2∗ ) to be our auxiliary variable and qy | x (y∗ | y, x1∗ , x2∗ ) to be a discrete
distribution with y∗ = x1∗ with probability 1 (y; x1∗ ) and = x2∗ otherwise:

qy | x (y∗ | y, x1∗ , x2∗ ) = 1 (y; x1∗ )I (y∗ = x1∗ ) + [1 − 1 (y; x1∗ )]I (y∗ = x2∗ ).

Assume first that in addition to (x1∗ , x2∗ ) also a pair (x1 , x2 ) is generated using the proposal
distribution h(x1 , x2 | y, x1∗ , x2∗ , y∗ ). To get the variables to work within the same spaces, we
make h(x1 , x2 | y, x1∗ , x2∗ , y∗ ) degenerate in the sense that either x1 or x2 is equal to y, that is,

(x1 − y)h2 (x2 | y, x2∗ , y∗ ) if y∗ = x1∗ ,
h(x1 , x2 | y, x1∗ , x2∗ , y∗ ) = ∗ ∗
(x2 − y)h1 (x1 | y, x1 , y ) if y∗ = x2∗ .

We consider the two possibilities of y∗ separately: y∗ = x1∗ : In this case x1 = y and by applying
proposition 2, we obtain

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


352 G. Storvik Scand J Statist 38

(y∗ )q1 (y | y∗ )q2 (x2 | y∗ , y)1 (y∗ ; y)h2 (x2∗ | y∗ , x2 , y)


r(y; x∗ , y∗ , x) =
(y)q1 (y∗ | y)q2 (x2∗ | y, y∗ )1 (y; y∗ )h2 (x2 | y, x2∗ , y∗ )
q (x | y∗ , y)h2 (x2∗ | y∗ , x2 , y)
= 2 2∗ ,
q2 (x2 | y, y∗ )h2 (x2 | y, x2∗ , y∗ )
where we have used that (y∗ )q1 (y | y∗ )1 (y∗ ; y) = (y)q1 (y∗ | y)1 (y; y∗ ) for a standard MH
choice of 1 (y; y∗ ). For h2 (x2∗ | y∗ , x2 , y) = q2 (x2∗ | y, y∗ ), this reduces to 1, which corresponds
nicely to accepting the first proposal x1∗ with probability 1 (y; x1∗ ).
y∗ = x2∗ : In this case x2 = y and we obtain
(y∗ )q1 (x1 | y∗ )q2 (y | y∗ , x1 )[1 − 1 (y∗ ; x1 )]h1 (x1∗ | y∗ , x1 , y)
r(y; x∗ , y∗ , x) = .
(y)q1 (x1∗ | y)q2 (y∗ | y, x1∗ )[1 − 1 (y; x1∗ )]h1 (x1 | y, x1∗ , y∗ )
Choosing now h1 (x1 | y, x1∗ , y∗ ) to be degenerate in that x1 = x1∗ with probability 1, this reduces
to
(y∗ )q1 (x1∗ | y∗ )q2 (y | y∗ , x1∗ )[1 − 1 (y∗ ; x1∗ )]
r(y; x∗ , y∗ , x) = , (19)
(y)q1 (x1∗ | y)q2 (y∗ | y, x1∗ )[1 − 1 (y; x1∗ )]
which is equal to the acceptance ratio given by Tierney & Mira (1999). An alternative in this
case is to choose h1 (x1 | y, x1∗ , y∗ ) = q1 (x1 | y∗ ) in which case the acceptance ratio reduces to
(y∗ )q2 (y | y∗ , x1 )[1 − 1 (y∗ ; x1 )]
r(y; x∗ , y∗ , x) = . (20)
(y)q2 (y∗ | y, x1∗ )[1 − 1 (y; x1∗ )]
This acceptance probability might be a reasonable alternative when you have the choice be-
tween using a (computationally) cheap proposal q1 and a more expensive q2 which is close
to . Note that in the special case where q2 = , this acceptance ratio reduces to
1 − 1 (y∗ ; x1 )
r(y; x∗ , y∗ , x) = .
1 − 1 (y; x1∗ )
With q1 being a bad approximation to , both the acceptance probabilities 1 (y; x1∗ ) and
1 (y∗ ; x1 ) will typically be small, making r(y; x∗ , y∗ , x) almost equal to 1, as it should!
A numerical experiment on a multiple precision model. Consider a model where observations
z ∈ Rp follow a model

zi =  + i +
i ,

where
1 , . . .,
p are i.i.d. zero-mean Gaussian variables with precision 1 whereas 1 , . . ., p are
independent of
1 , . . .,
p and spatially correlated variables with a conditional autoregressive
(CAR) structure (Besag, 1974; Besag & Kooperberg, 1995). The conditional distributions are
given by
 

−1 −1
i | −i ∼ N ni j , (ni 2 ) ,
j∼i

where −i is the collection of all j s except i , j ∼ i means j is a neighbour of i and ni is the


number of neighbours of i. Our interest will be in the posterior distribution of = ( 1 , 2 ).
Assuming independent Gamma priors for 1 and 2 , both with shape parameter  and rate
parameter , the posterior distribution has the form
1
( ) ∝ −1 exp(− 1 ) −1 exp(− 2 ) exp[−1/2(z − )T −1 (z − )].
1 2
||1/2
He et al. (2007) state that such distributions in many cases are ‘l-shaped with two long arms
pressed tightly along one or both coordinate axes’, which is demonstrated in the top panel

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 353

τ2
2

1
−41
0
0 1 2 3 4 5
τ1

0.5

0.3
τ2

0.1

0 2000 4000 6000 8000 10,000


Iteration

Fig. 3. Top panel: posterior distribution of ( 1 , 2 ) in the conditional autoregressive model based on 50
simulated observations on a 5 × 5 grid using  = 0,  = 0.25 and 1 = 1, 2 = 1/3. Bottom panel: distance,
as defined by (22), between true and empirical distribution of 2 as a function of the number of Markov
chain Monte Carlo iterations for the ‘standard’ delayed rejection sampling acceptance probability (19),
solid line, and the alternative acceptance probability (20), dashed line.

of Fig. 3 showing the posterior distribution obtained from simulated values of z1 , . . ., zp on


a 5 × 5 grid using  = 0,  = 0.25, 1 = 1 and 2 = 1/3.
We will consider a delayed rejection sampling algorithm where at the first stage proposals
∗j, 1 , j = 1, 2 are generated through scale proposals suggested by Knorr-Held & Rue (2002),
that is, ∗j, 1 = j fj , j = 1, 2 where the scale factor fj ∼ qf (f ) ∝ (1 + f −1 )I [F −1 ≤ f ≤ F ], and
where F > 1 is a tuning parameter. At the second stage, a proposal is generated through the
reparametrizations
1 2 2
= , r= , (21)
1 + 2 1 + 2
where a new r is generated uniformly on [0, 1] whereas is generated through the posterior of
given r which can be shown to be a Gamma distribution with shape parameter 2 + n and
rate parameter [ + 0.5ssq]/[r(1 − r)], where ssq is the sum of squares using the covariance
matrix divided by . New proposals ∗1, 2 , ∗2, 2 are then given by the inverse transformations of
(21). This second proposal utilizes more of the structure involved and can be considered as
a proposal closer to the target distribution.
Our aim will be to compare the two alternative acceptance probabilities (19) and (20). As
a measure for the performance of the different weight functions, the distance

d(F̂ 2 , F2 ) = sup |F̂ 2 ( 2 ) − F2 ( 2 )| (22)


2

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


354 G. Storvik Scand J Statist 38

between the true cumulative distribution function for 2 , F2 ( 2 ) (evaluated through numerical
integration) and the corresponding estimate based on the samples, F̂ 2 ( 2 ), was used.
The bottom panel of Fig. 3 shows the distance measure (22) as function of the number of
MCMC iterations based on the ‘standard’ delayed rejection sampling acceptance proba-
bility (19) (solid line) and the alternative acceptance probability (20) (dashed line), both
which should converge towards 0 as the number of iterations increases to infinity. We clearly
see the improvements for the alternative acceptance probability.

4.3. Distributions with intractable normalizing constants


Møller et al. (2006) considered a problem of drawing from a posterior distribution
(y) = p(y | z) = C −1 p(y)p(z | y),
where the likelihood for data z is:
p(z | y) = Z −1 (y)p̃(z | y).
Here, both the normalization constant involved in the posterior C and the normalization
constant defining the likelihood, Z(y) are unknown (a problem often encountered in spatial
modelling). It is however assumed that it is possible to simulate from p(z | y). The data z
is dropped in (y) since we are considering this as a given constant. Møller et al. (2006)
proposed an MCMC algorithm that we now will describe in the setting of proposition 1.
Define an augmented distribution (y,¯ x) = (y)h(x | y). Assume (x, y) is the current state
and generate a proposal (x∗ , y∗ ) through q(x∗ , y∗ | x, y) = qy (y∗ | x, y)p(x∗ | y∗ ). From proposi-
tion 1, we obtain a weight function
C −1 p(y∗ )Z −1 (y∗ )p̃(z | y∗ )h(x∗ | y∗ ) ∗ ∗ ∗
−1 p(y )p̃(z | y )h(x | y )

w(x, y; x∗ , y∗ ) = = C
qy (y∗ | x, y)Z −1 (y∗ )p̃(x∗ | y∗ ) qy (y∗ | x, y)p̃(x∗ | y∗ )
and an acceptance ratio
p(y∗ )p̃(z | y∗ )h(x∗ | y∗ )qy (y | x∗ , y∗ )p̃(x | y)
r(x, y; x∗ , y∗ ) = .
p(y)p̃(z | y)h(x | y)qy (y∗ | x, y)p̃(x∗ | y∗ )
Note that neither C nor Z −1 (·) are involved in this expression. Here, h(x | y) has a state
space similar to z but may otherwise be arbitrary (and even depend on z). The given algo-
rithm will according to proposition 1 have (y)h(x | y) as invariant distribution. This is the
same ratio obtained by Møller et al. (2006). They also discussed several options for choosing
the h distribution.
Consider now the case where qy (y∗ | y, x) = qy (y∗ | y), a special case only considered further
by Møller et al. (2006) after their general description of the algorithm. Through proposi-
tion 2 we are then able to construct an alternative algorithm where x does not need to be
stored from one iteration to another. Using the same q function as before but now simulating
x ∼ h(x | y, x∗ , y∗ ) in addition to x∗ , y∗ , we obtain
h(x∗ | y∗ , x, y)p(y∗ )p̃(z | y∗ )qy (y∗ | y)p̃(x | y)
r(x∗ , y∗ | x, y) = .
h(x | y, x∗ , y∗ )p(y)p̃(z | y)qy (y | y∗ )p̃(x∗ | y∗ )
In this case, more general choices of the h distribution can be considered. One option is
to choose h(x | y, x∗ , y∗ ) = (x − x∗ ), making generation of an extra x unnecessary. This is an
important special case since simulation from p(x | y) can be computationally costly. In this
case, the acceptance ratio reduces to
p(y∗ )p̃(z | y∗ )qy (y∗ | y)p̃(x∗ | y)
r(x∗ , y∗ | x, y) = .
p(y)p̃(z | y)qy (y | y∗ )p̃(x∗ | y∗ )

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 355

As noted by an anonymous referee, this results in the exchange algorithm by Murray et al.
(2006) and thereby serves as a unification of the two methods for dealing with unknown
normalizing constants proposed by Møller et al. (2006) and Murray et al. (2006).

5. Summary and discussion


In this article we have presented a framework for construction of auxiliary variable proposal
generation within MH algorithms. Many algorithms proposed in the literature fit into the
framework and by using the framework, alternative acceptance probabilities are suggested.
Numerical experiments demonstrate that in some cases significant improvements can be
obtained by using these alternatives.
The variety of options for acceptance probabilities have also been suggested elsewhere. In
this article, we have shown that these options can be related to choices of augmented state
spaces for the variables to be generated. These choices come in addition to the flexibility in
how proposals are generated.
Although the added flexibility makes it possible to define alternative and hopefully
better algorithms, it also extends the number of choices to be made to construct an efficient
algorithm. In the case of sequential generation of auxiliary variables, acceptance ratios can
depend on the ratio between the marginal distributions and the transition kernels for any
component of the auxiliary variables. Previous approaches in the literature have been concen-
trating on either the first or the last component, whereas we have shown here that averages
over all components have more preferable theoretical properties. Numerical experiments sup-
ported this, although improvements over using the last components were small. For the more
general case, general guidelines are at this point difficult to give. In cases where generation
of the auxiliary variables is computationally costly, as in the case of intractable normaliz-
ing constants discussed in section 4.3, degenerate distributions avoiding one simulation step
can be beneficial. In other cases where such generations are cheap, as in the case of delayed
rejection sampling discussed in section 4.2, generation of additional ‘backwards’ auxiliary
variables can improve the acceptance probabilities. More research is needed on this topic.

Acknowledgements
Part of this work was performed when the author was a visiting fellow at the Department of
Mathematics, University of Bristol. The author is grateful for valuable discussions with col-
leagues there, in particular Prof. Christophe Andrieu. The author also wants to thank two
anonymous referees, the associate editor and the editor for many valuable comments.

References
Al-Awadhi, F., Hurn, M. & Jennison, C. (2004). Improving the acceptance rate of reversible jump
MCMC proposals. Statist. Probab. Lett. 69, 189–198.
Andrieu, C. & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo compu-
tations. Ann. Statist. 2, 697–725.
Andrieu, C., Doucet, A. & Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. J. Roy.
Statist. Soc. Ser. B (Stat. Methodol.) 72, 269–342.
Beaumont, M. (2003). Estimation of population growth or decline in genetically monitored populations.
Genetics 164, 1139–1160.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser.
B Stat. Methodol. 36, 192–236.
Besag, J. (1994). Comments on representations of knowledge in complex systems by U. Grenander and
M. I. Miller. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 56, 591–592.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


356 G. Storvik Scand J Statist 38

Besag, J. & Green, P. J. (1993). Spatial statistics and Bayesian computation (with discussion). J. Roy.
Statist. Soc. Ser. B Stat. Methodol. 55, 25–37.
Besag, J. & Kooperberg, C. (1995). On conditional and intrinsic autoregressions. Biometrika 82, 733–
746.
Brooks, S., Giudici, P. & Roberts, G. (2003). Efficient construction of reversible jump Markov chain
Monte Carlo proposal distributions. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 65, 3–39.
Damien, P., Wakefield, J. & Walker, S. (1999). Gibbs sampling for Bayesian non-conjugate and hier-
archical models by using auxiliary variables. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 61, 331–344.
Edwards, R. G. & Sokal, A. D. (1988). Generalization of the Fortuin-Kasteleyn-Swendsen-Wang repre-
sentation and Monte Carlo algorithm. Phys. Rev. D 38, 2009–2012.
Gelman, A., Gilks, W. R. & Roberts, G. O. (1997). Weak convergence and optimal scaling of random
walk Metropolis algorithms. Ann. Appl. Probab. 7, 110–120.
Gilks, W., Richardson, S. & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. Chapman
& Hall, London.
Green, P. & Mira, A. (2001). Delayed rejection in reversible jump Metropolis-Hastings. Biometrika 88,
1035–1053.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57, 97–109.
He, Y., Hodges, J. & Carlin, B. (2007). Re-considering the variance parameterization in multiple precision
models. Bayesian Anal. 2, 529–556.
Higdon, D. (1998). Auxiliary variable methods for Markov chain Monte Carlo with applications. J. Amer.
Statist. Assoc. 93, 585–595.
Jennison, C. & Sharp, R. (2007). Mode jumping in MCMC: adapting proposals to the local
environment. Talk at Conference to Honour Allan Seheult, Durham, March 2007. Available at:
[Link] (accessed July 10, 2010).
Knorr-Held, L. & Rue, H. (2002). On block updating in Markov random field models for disease map-
ping. Scand. J. Statist. 29, 597–614.
Liu, J., Liang, F. & Wong, W. (2000). The multiple-try method and local optimization in Metropolis
sampling. J. Amer. Statist. Assoc. 95, 121–134.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. & Teller, E. (1953). Equation of state calcu-
lations by fast computing machines. J. Chem. Phys. 21, 1087–1092.
Mira, A. (2001). On Metropolis-Hastings algorithms with delayed rejection. Metron, 59, 231–241.
Møller, J., Pettitt, A., Reeves, R. & Berthelsen, K. (2006). An efficient Markov chain Monte Carlo
method for distributions with intractable normalising constants. Biometrika 93, 451–458.
Murray, I., Ghahramani, Z. & MacKay, D. (2006). MCMC for doubly-intractable distributions. In
Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI), 359–366.
Neal, R. M. (2001). Annealing importance sampling. Statist. Comput. 2, 125–139. AUAI Press, Arlington,
Virginia.
Peskun, P. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika 60, 607–612.
Robert, C. P. & Casella, G. (2004). Monte Carlo statistical methods, 2nd edn. Springer, New York.
Roberts, G. & Tweedie, R. (1996). Exponential convergence of Langevin distributions and their discrete
approximations. Bernoulli 2, 341–364.
Storvik, G. (2009). On the flexibility of Metropolis-Hastings acceptance probabilities in auxiliary variable
proposal generation. Statistical Research Report 1, Department of Mathematics, University of Oslo,
Oslo.
Tanner, M. & Wong, W. (1987). The calculation of posterior distributions by data augmentation. J. Amer.
Statist. Assoc. 82, 528–550.
Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22, 1701–1728.
Tierney, L. & Mira, A. (1999). Some adaptive Monte Carlo methods for Bayesian inference. Stat. Med.
18, 2507–2515.
Tjelmeland, H. & Hegstad, B. (2001). Mode jumping proposals in MCMC. Scand. J. Statist. 28, 205–
223.

Received March 2009, in final form April 2010

Geir Storvik, Department of Mathematics, University of Oslo, P.O. Box 1053, Blindern, N-0316 Oslo,
Norway.
E-mail: geirs@[Link]

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


Scand J Statist 38 Flexibility of MH acceptance probabilities 357

Appendix: Transition densities for internal MH moves


In this section, we discuss the use of weight functions when proposals are generated by a
fixed number of internal MH steps. The general difficulty in this case is that the transition
probabilities qs (xs∗ | xs−1

) are not always directly available.
One possibility, explored by both Tjelmeland & Hegstad (2001) and Jennison & Sharp
(2007) is to only consider transition probabilities for the last proposal and use a computable
transition for this part. In the case where only parts of the state vector is changed at the time
and at least one of the blocks can be moved through a Gibbs sampler update, using as > 0
in (15) only for those steps at which a Gibbs sampler is used, avoids the need for calculating
more complicated transition kernels.
To utilize all the generated auxiliary variables, an alternative is to include also the proposed

values at each iteration into the set of auxiliary variables. Assume that x1:t + 1 (with proposals

z1:t ) are generated by MH steps, that is,

zi∗ ∼ qiz (zi∗ | xi−1



)
 ∗ ∗
z with probability i (xi−1 ; zi∗ ),
xi∗ = i∗
xi−1 otherwise.
The generating distribution can be written as:


t+1
z|x
qi (xi∗ | xi−1

)qi (zi∗ | xi−1

, xi∗ ),
i =1

where qi (xi∗ | xi−1



), i = 1, . . ., t + 1 is the transition kernel for the Markov chain {x1:t

+ 1 } whereas

qiz (zi∗ | xi−1


∗ ∗
)[i (xi−1 ; zi∗ )I (xi∗ = zi∗ ) + [1 − i (xi−1

; zi∗ )]I (xi∗ = xi−1

)]
qi (zi∗ | xi−1 , xi∗ ) =
z|x
∗ ∗
qi (xi | xi−1 )
is the conditional distribution for zi∗ given (xi−1

, xi∗ ). Now, consider the choice


s−1
∗ ∗ ∗
+ 1 , x1:t | y, x, y ) = qi (xi∗ | xi−1

)qi (zi∗ | xi−1

, xi∗ ) × hzs (zs∗ | xs−1

, xs∗ )
z|x
hs (z1:t
i =1


t
qi + 1 (xi∗ | xi∗+ 1 )qi + 1 (zi∗+ 1 | xi∗ , xi∗+ 1 ).
z|x
×
i =s

Using (9), we obtain



t
∗ ∗ (y∗ )hzs (zs∗ | xs−1

, xs∗ ) ∗ ∗
i = s qi + 1 (xi | xi + 1 )
ws (y; z1:t , x1:t + 1) = ∗ z|x ∗

qs (xs∗ | xs−1 )qs (zs∗ | xs−1 , xs∗ ) i = s qi + 1 (xi∗+ 1 | xi∗ )


t

(xs∗ )hzs (zs∗ | xs−1



, xs∗ )
= ∗ ∗ .
; zs∗ )]I (xs = xs−1 )
∗ ∗ ∗ ∗ ∗
qsz (zs∗ | xs−1 )z, s (xs−1 ; zs∗ )I (xs = zs ) [1 − z, s (xs−1

For the choice of hzs (zs∗ | xs−1 ∗


, xs∗ ), some care should be taken to increase the probability for
non-zero acceptance probabilities. In particular, the distribution should reflect that xs∗ = ∗
/ xs−1 ⇒
∗= ∗
zs xs . Therefore, assume
 ∗
(zs − xs∗ ) if xs∗ = ∗
/ xs−1 ,
hzs (zs∗ | xs−1

, xs∗ ) = z|x ∗ ∗ ∗ ∗= ∗
q̃s (zs | xs−1 , xs ) if xs xs−1 ,
∗ ∗ ∗ z ∗ ∗
s (zs | xs−1 , xs ) is a general density with support equal to qs (zs | xs−1 ). The obvious
where q̃z|x
z|x ∗ ∗ ∗ = z ∗ ∗
option is to choose q̃s (zs | xs−1 , xs ) qs (zs | xs−1 ) in which case the weight function reduces
to

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.


358 G. Storvik Scand J Statist 38


⎪ (xs∗ )

⎨ qz (z∗ | x∗ ) (x∗ ; x∗ ) if xs∗ = ∗
/ xs−1 ,

+ 1) =
s−1 z, s s−1
ws (y; x1:t s s s
(23)

⎪ (xs∗ )
⎩ ∗
if xs∗ = xs−1

.
1 − z, s (xs−1 ; zs∗ )
Another option is to choose
 ∗
(zs − xs∗ ) ∗
with probability 1 − z, s (xs−1 , xs∗ ),
hzs (zs∗ | xs−1

, xs∗ ) =
qs (zs∗ | xs−1

) ∗ ∗
with probability z, s (xs−1 , xs ).
In that case, the weight function reduces to

(xs∗ )

ws (y; z1:t ∗
= ∗
if xs∗ = ∗
/ xs−1 ,
, x1:t +1 ) qs (xs∗ | xs−1 ) (24)
0 otherwise.
In this case, only those iterations corresponding to acceptance are considered.

© 2010 Board of the Foundation of the Scandinavian Journal of Statistics.

You might also like