Metropolis-Hastings Acceptance Probabilities
Metropolis-Hastings Acceptance Probabilities
ABSTRACT. Use of auxiliary variables for generating proposal variables within a Metropolis–
Hastings setting has been suggested in many different settings. This has in particular been of interest
for simulation from complex distributions such as multimodal distributions or in transdimensional
approaches. For many of these approaches, the acceptance probabilities that are used turn up some-
what magic and different proofs for their validity have been given in each case. In this article, we
will present a general framework for construction of acceptance probabilities in auxiliary variable
proposal generation. In addition to showing the similarities between many of the proposed algo-
rithms in the literature, the framework also demonstrates that there is a great flexibility in how
to construct acceptance probabilities. With this flexibility, alternative acceptance probabilities are
suggested. Some numerical experiments are also reported.
Key words: acceptance probabilities, auxiliary variables, Markov chain Monte Carlo,
Metropolis–Hastings algorithms
1. Introduction
Monte Carlo methods, and in particular Markov chain Monte Carlo (MCMC) algorithms,
are nowadays part of the standard set of techniques used by statisticians (Robert & Casella,
2004). Within statistics, the typical applications of Monte Carlo methods are evaluation of
likelihoods including random/missing effects or calculation of posterior distributions. In
general, Monte Carlo methods can be used to evaluate complex and high-dimensional integrals.
Monte Carlo methods involve sampling from some distribution, (y) say, which in many
cases can efficiently be done through ‘standard’ MCMC algorithms generating a Markov
chain with (y) as invariant distribution (Metropolis et al., 1953; Hastings, 1970; Gilks
et al., 1996; Robert & Casella, 2004). Most MCMC algorithms involve mixtures of different
Metropolis–Hastings (MH) steps (including Gibbs steps as particular cases). Given the current
state y, an MH step consists of generating a proposal y∗ ∼ q(y∗ | y) and switch to the proposal
with probability
(y∗ )q(y | y∗ )
(y; y∗ ) = min 1, .
(y)q(y∗ | y)
General Markov chain theory can be used to show convergence and ergodicity results (e.g.
Tierney, 1994). In practise, slow mixing can occur when (y) has a complex structure, that is,
multimodality or covering several models in different spaces. The main problem is to choose
proposal distributions q(y∗ | y) that both allow flexible movements within the sample space
and give reasonable acceptance probabilities. Some guidelines and theoretical results for tuning
variances of proposal distributions are available (Roberts et al., 1997; Brooks et al., 2003).
Efficient sampling algorithms can in many cases be constructed by the use of auxiliary
variables. Several approaches have been suggested within the class of MCMC algorithms
Scand J Statist 38 Flexibility of MH acceptance probabilities 343
(Tanner & Wong, 1987; Edwards & Sokal, 1988; Besag & Green, 1993; Higdon, 1998;
Damien et al., 1999) but has also been used in combination with importance sampling
(Neal, 2001). The most common way of doing this is to augment the sample space with
some extra variables to construct simpler or more efficient algorithms in this augmented
space. More recently, auxiliary variables have been used as a tool within an MH setting for
either generating better proposals (e.g. Tjelmeland & Hegstad, 2001; Jennison & Sharp, 2007)
or for calculation of acceptance probabilities (e.g., Beaumont, 2003; Andrieu & Roberts,
2009).
The flexibility of choosing proposal distributions has long been recognized whereas accep-
tance probabilities are given by the standard MH rule. When extending the state space by
introducing auxiliary variables, there is also a flexibility in the choice of acceptance proba-
bilities that to a large extent seems to be unnoticed. This flexibility is a consequence of the
possibility of choosing different target distributions in the augmented space, and is the focus
of this article.
Assume our aim is to sample y∗ ∼ (y∗ ). A proposal is generated by a simultaneous simu-
lation of (x∗ , y∗ ) ∼ q(x∗ , y∗ | y). The flexibility follows in that the target distribution, ¯ say, for
the combined variables (x∗ , y∗ ) can be any distribution with as its marginal distribution.
For a given joint proposal distribution q, several possible acceptance probabilities can be
obtained by different choices of . ¯ By taking this alternative viewpoint, a common under-
standing of different algorithms suggested in the literature is obtained, and this also serves
as a framework for suggesting new algorithms. This framework will in particular be useful in
cases where standard MCMC algorithms fail and more complicated versions are needed. Typical
examples are algorithms for jumping between possible modes (Tjelmeland & Hegstad, 2001;
Jennison & Sharp, 2007) and improvements of acceptance rates within reversible jump algo-
rithms (Al-Awadhi et al., 2004).
An important special class of algorithms is where x∗ = (x1∗ , . . ., xt∗ ) is generated in sequence,
typically with x1∗ generated through a big jump (for jumping between modes or dimensions)
followed by a few steps of smaller jumps. Such xi∗ s can in principle live on any space, but in
most cases they live on the same space as y. For most such algorithms suggested in the litera-
ture, acceptance ratios within an MH setting typically depend either only on the
first x1∗ (e.g. Al-Awadhi et al., 2004) or the last xt∗ (e.g. Tjelmeland & Hegstad, 2001;
Jennison & Sharp, 2007). By combining different target distributions, acceptance ratios
depending on averages over all the generated x1∗ , . . ., xt∗ can be constructed, which can give
higher and more stable acceptance probabilities.
Other algorithms that can be considered as special cases of this general framework are
mode or model jumping (Tjelmeland & Hegstad, 2001; Al-Awadhi et al., 2004; Jennison &
Sharp, 2007), multiple-try methods (Liu et al., 2000), proposals based on particle filters
(Andrieu et al., 2010), pseudo-marginal algorithms (Beaumont, 2003; Andrieu & Roberts, 2009),
sampling algorithms for distributions with intractable normalizing constants (Møller et al.,
2006) and delayed rejection sampling (Tierney & Mira, 1999; Green & Mira, 2001; Mira,
2001). Some of these will be considered in more detail later whereas others are treated in
Storvik (2009). Although our focus will be on MH algorithms, the framework can also be
applied within importance sampling, for example, in combination with annealed importance
sampling (Neal, 2001); see Storvik (2009) for details.
We start in section 2 by discussing the flexibility of acceptance probabilities in MH
algorithms when auxiliary variables are used for generating proposals. Section 3 focuses on
sequential generations of auxiliary variables. In section 4, we apply our general results to
some algorithms suggested in the literature and also consider alternative versions of these.
For some of these alternatives, numerical experiments are included, demonstrating that
alternative weights and acceptance probabilities can improve the performance of an algo-
rithm. We conclude the article with discussion and final remarks in section 5.
Proposition 1. Assume the current state y ∼ (y) and that x | y ∼ h(x | y) for some arbitrary
conditional distribution h.
Generate (x∗ , y∗ ) ∼ q(x∗ , y∗ | x, y) and put (x , y ) = (x∗ , y∗ ) with probability (x, y; x∗ , y∗ ) and
(x , y ) = (x, y) otherwise, where the acceptance probability is given by
with
(y∗ )h(x∗ | y∗ )q(x, y | x∗ , y∗ )
r(x, y; x∗ , y∗ ) = , (1)
(y)h(x | y)q(x∗ , y∗ | x, y)
with
then, y ∼ (y ).
(y∗ )q(x, y | y∗ )
r(y; x∗ , y∗ , x) = .
(y)q(x∗ , y∗ | y)
(y∗ )q(xt:1
∗
, y | y∗ )
r(y; x∗ , y∗ , x) = ∗
.
(y)q(x1:t , y∗ | y)
We will explore such possibilities further in sections 3 and 4 for specific choices of proposal
distributions.
For both propositions 1 and 2, the acceptance ratios can be written as ratios between
weight functions directly related to importance sampling. For instance, in (2),
In addition to demonstrating that the results presented also can be applied to importance
sampling, weight functions will in some cases be used instead of acceptance ratios to sim-
plify notation.
If {hs } is a collection of distributions, a mixture of the hs s,
h(x | y, x∗ , y∗ ) = as hs (x | y, x∗ , y∗ ), as ≥ 0, as = 1
s s
Similar combinations of weights can be obtained for proposition 1. Such mixtures will be
considered in the subsequent sections.
Properties of the acceptance ratios are given in proposition 3.
where x0∗ ≡ y and xt∗+ 1 ≡ y∗ . The typical setup we will consider is where for i ≥ 2 the proposal
densities satisfy the detailed balance criterion
(x)qi (y | x) = (y)qi (x | y). (8)
Such proposals can for instance be used for target distributions with several modes where
q1 corresponds to a large jump whereas the subsequent moves are ordinary MCMC ones.
This particular setting will be further discussed in section 4.1. Another possibility is model
or dimension moves in a reversible jump setting, see Storvik (2009). Allowing the additional
moves also to depend on i gives the possibilities for different blocks of a high-dimensional
state vector to be updated at each step.
A consequence of assumption (8) is that for s ≥ 1
t ∗ ∗
= qi + 1 (xi | xi + 1 ) (xs∗ )
it s = , (9)
∗ ∗
i = s qi + 1 (xi + 1 | xi )
(xt∗+ 1 )
which will be used repeatedly.
Our aim is to construct acceptance probabilities for the proposal y∗ = xt∗+ 1 when y is the
current state. A complicating factor is that the sequential approach for generating y∗ will
not make the proposal distribution for y∗ given y directly available. Note that if q1 does
not depend on y, we obtain an independence sampler where the proposal is generated by a
sequence of internal MCMC steps, see section 4.1 for an example.
∗
Writing x1:t +1
= (x1∗ , . . ., xt∗+ 1 ), the general weight function will have the form
∗ (xt∗+ 1 )h(x1:t
∗
| xt∗+ 1 , x, y)
w(y; x1:t + 1 , x) =
t + 1 ∗ ∗
.
i = 1 qi (xi | xi−1 )
A wide variety of weight functions can be considered by different choices of h. Consider the
class of distributions
s−1
t
∗
hs (x1:t | xt∗+ 1 , x, y) = qi (xi∗ | xi−1
∗
) qi + 1 (xi∗ | xi∗+ 1 ). (10)
i =1 i =s
converges in distribution to which should give a weight approaching 1. This can however
be accomplished by combining the weights. Using (4), a proper weight function is also given
by
t+1
∗ ∗
w(y; x1:t + 1) = as ws (y; x1:t + 1 ), (14)
s=1
4. Examples
In Storvik (2009), several algorithms suggested in the literature are shown to fit into the
general framework discussed in the previous sections. These include annealed importance
sampling (Neal, 2001), reversible jump MCMC for model selection (Al-Awadhi et al., 2004;
Andrieu & Roberts, 2009), multiple-try methods and particle proposals (Liu et al., 2000;
Andrieu et al., 2010) and the pseudo-marginal algorithm (Beaumont, 2003; Andrieu &
Roberts, 2009). Here, we will consider some specific examples where alternative weights are
proposed. For some of these examples, numerical experiments are performed to compare the
different weights.
(y∗ )qt + 1 (y | xt )
= ,
(y)qt + 1 (y∗ | xt∗ )
which is equal to the acceptance ratio obtained by Jennison & Sharp (2007). To get a com-
putationally tractable density, Jennison & Sharp (2007) suggested choosing qt + 1 (y∗ | xt∗ ) as a
local approximation to (y).
As discussed in the Appendix, qt + 1 (y∗ | xt∗ ) can also be defined through an MH step. Within
such a setting, alternative constructions can be considered. Assume qt + 1 = q. Similar to (10),
define
s−1
t
hs (x | y∗ , x∗ , y) = (x1 − y∗ + ) q(xi | xi−1 ) q(xi | xi + 1 ),
i =2 i =s
In general, if q is capable of moving efficiently around the whole state space, this ratio will
converge to 1. In more practical settings where q is only able to move within the current
mode, the ratio will converge towards the ratio between the masses of the corresponding
modes.
A numerical example on a Gaussian mixture model. Consider a model in Rp
and proposals are accepted using ordinary MH acceptance rates. The number of ‘outer’ MH
steps, that is, the number of generated y∗ s will be denoted by M.
The final proposals y∗ are accepted with probabilities given through the following alterna-
tive acceptance ratios:
All the alternatives are used in combination with (23). The full algorithm was restarted N
times and the last sample was stored in each case, giving samples y1 , . . ., yN . As a measure
for the performance of the different weight functions,
was used where F̂ 1 is the empirical distribution function based on the first component of the
samples y1 , . . ., yN whereas F1 is the corresponding true marginal cumulative distribution.
Figure 1A–E shows boxplots of d(F̂ 1 , F1 ) with N =10,000, M = 20, t + 1 = 10 and with
large = 15, small = 1.5. The different panels correspond to different dimensions p. The box-
plots are obtained by repeating the experiments 100 times. We see in all cases that the first
weight function does not perform that well, an expected property given the earlier discussion
about this particular choice. Both the other choices give significant improvements with the
third weight function performing best in all the cases. The difference between these seems
D p=5 E p = 10 F p = 10
0.20 0.20 0.20
Fig. 1. Boxplot of d(F̂ 1 , F1 ) for the Gaussian mixture distribution. The different panels correspond to
results for different dimensions p = 1, 2, 3, 5, 10. The experimental setup is described in the text. Panels
(A)–(E) correspond to M = 20, whereas panel (F) corresponds to M = 40. The numbers on the x-axis
correspond to the different types of acceptance ratios.
1.0
0.8
0.6
F
0.4
0.2
0.0
−10 −5 0 5 10
y
Fig. 2. Mixture Gaussian distribution with dimension p = 10 and number of outer Markov chain Monte
Carlo iterations M = 20. True (shaded) and examples of estimated cumulative distributions of the first
component. Solid line corresponds to the first weight function, dashed line to the second weight func-
tion and dotted line to the third weight function.
to depend on dimension though. In particular it seems like the difference between them are
decreasing with dimension, but then increasing for p = 10. This is probably because of that
for p = 10, 20 iterations in the outer MCMC iterations is not that much. Increasing M to
40 (Fig. 1F), the similarities between the second and the third weight functions are again
obtained.
Figure 2 shows the true cumulative distribution for the first component and typical ex-
amples of empirical distributions based on the three different weight functions for p = 10 and
M = 20. The differences between the choices of weight functions shown in Fig. 1 is clearly
seen. Note in particular that the third weight function produces estimates that are indistin-
guishable from the truth, indicating that convergence has been reached even with this small
number of MCMC iterations.
qy | x (y∗ | y, x1∗ , x2∗ ) = 1 (y; x1∗ )I (y∗ = x1∗ ) + [1 − 1 (y; x1∗ )]I (y∗ = x2∗ ).
Assume first that in addition to (x1∗ , x2∗ ) also a pair (x1 , x2 ) is generated using the proposal
distribution h(x1 , x2 | y, x1∗ , x2∗ , y∗ ). To get the variables to work within the same spaces, we
make h(x1 , x2 | y, x1∗ , x2∗ , y∗ ) degenerate in the sense that either x1 or x2 is equal to y, that is,
(x1 − y)h2 (x2 | y, x2∗ , y∗ ) if y∗ = x1∗ ,
h(x1 , x2 | y, x1∗ , x2∗ , y∗ ) = ∗ ∗
(x2 − y)h1 (x1 | y, x1 , y ) if y∗ = x2∗ .
We consider the two possibilities of y∗ separately: y∗ = x1∗ : In this case x1 = y and by applying
proposition 2, we obtain
zi = + i +
i ,
where
1 , . . .,
p are i.i.d. zero-mean Gaussian variables with precision 1 whereas 1 , . . ., p are
independent of
1 , . . .,
p and spatially correlated variables with a conditional autoregressive
(CAR) structure (Besag, 1974; Besag & Kooperberg, 1995). The conditional distributions are
given by
−1 −1
i | −i ∼ N ni j , (ni 2 ) ,
j∼i
τ2
2
1
−41
0
0 1 2 3 4 5
τ1
0.5
0.3
τ2
0.1
Fig. 3. Top panel: posterior distribution of (1 , 2 ) in the conditional autoregressive model based on 50
simulated observations on a 5 × 5 grid using = 0, = 0.25 and 1 = 1, 2 = 1/3. Bottom panel: distance,
as defined by (22), between true and empirical distribution of 2 as a function of the number of Markov
chain Monte Carlo iterations for the ‘standard’ delayed rejection sampling acceptance probability (19),
solid line, and the alternative acceptance probability (20), dashed line.
between the true cumulative distribution function for 2 , F2 (2 ) (evaluated through numerical
integration) and the corresponding estimate based on the samples, F̂ 2 (2 ), was used.
The bottom panel of Fig. 3 shows the distance measure (22) as function of the number of
MCMC iterations based on the ‘standard’ delayed rejection sampling acceptance proba-
bility (19) (solid line) and the alternative acceptance probability (20) (dashed line), both
which should converge towards 0 as the number of iterations increases to infinity. We clearly
see the improvements for the alternative acceptance probability.
As noted by an anonymous referee, this results in the exchange algorithm by Murray et al.
(2006) and thereby serves as a unification of the two methods for dealing with unknown
normalizing constants proposed by Møller et al. (2006) and Murray et al. (2006).
Acknowledgements
Part of this work was performed when the author was a visiting fellow at the Department of
Mathematics, University of Bristol. The author is grateful for valuable discussions with col-
leagues there, in particular Prof. Christophe Andrieu. The author also wants to thank two
anonymous referees, the associate editor and the editor for many valuable comments.
References
Al-Awadhi, F., Hurn, M. & Jennison, C. (2004). Improving the acceptance rate of reversible jump
MCMC proposals. Statist. Probab. Lett. 69, 189–198.
Andrieu, C. & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo compu-
tations. Ann. Statist. 2, 697–725.
Andrieu, C., Doucet, A. & Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. J. Roy.
Statist. Soc. Ser. B (Stat. Methodol.) 72, 269–342.
Beaumont, M. (2003). Estimation of population growth or decline in genetically monitored populations.
Genetics 164, 1139–1160.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser.
B Stat. Methodol. 36, 192–236.
Besag, J. (1994). Comments on representations of knowledge in complex systems by U. Grenander and
M. I. Miller. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 56, 591–592.
Besag, J. & Green, P. J. (1993). Spatial statistics and Bayesian computation (with discussion). J. Roy.
Statist. Soc. Ser. B Stat. Methodol. 55, 25–37.
Besag, J. & Kooperberg, C. (1995). On conditional and intrinsic autoregressions. Biometrika 82, 733–
746.
Brooks, S., Giudici, P. & Roberts, G. (2003). Efficient construction of reversible jump Markov chain
Monte Carlo proposal distributions. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 65, 3–39.
Damien, P., Wakefield, J. & Walker, S. (1999). Gibbs sampling for Bayesian non-conjugate and hier-
archical models by using auxiliary variables. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 61, 331–344.
Edwards, R. G. & Sokal, A. D. (1988). Generalization of the Fortuin-Kasteleyn-Swendsen-Wang repre-
sentation and Monte Carlo algorithm. Phys. Rev. D 38, 2009–2012.
Gelman, A., Gilks, W. R. & Roberts, G. O. (1997). Weak convergence and optimal scaling of random
walk Metropolis algorithms. Ann. Appl. Probab. 7, 110–120.
Gilks, W., Richardson, S. & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. Chapman
& Hall, London.
Green, P. & Mira, A. (2001). Delayed rejection in reversible jump Metropolis-Hastings. Biometrika 88,
1035–1053.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57, 97–109.
He, Y., Hodges, J. & Carlin, B. (2007). Re-considering the variance parameterization in multiple precision
models. Bayesian Anal. 2, 529–556.
Higdon, D. (1998). Auxiliary variable methods for Markov chain Monte Carlo with applications. J. Amer.
Statist. Assoc. 93, 585–595.
Jennison, C. & Sharp, R. (2007). Mode jumping in MCMC: adapting proposals to the local
environment. Talk at Conference to Honour Allan Seheult, Durham, March 2007. Available at:
[Link] (accessed July 10, 2010).
Knorr-Held, L. & Rue, H. (2002). On block updating in Markov random field models for disease map-
ping. Scand. J. Statist. 29, 597–614.
Liu, J., Liang, F. & Wong, W. (2000). The multiple-try method and local optimization in Metropolis
sampling. J. Amer. Statist. Assoc. 95, 121–134.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. & Teller, E. (1953). Equation of state calcu-
lations by fast computing machines. J. Chem. Phys. 21, 1087–1092.
Mira, A. (2001). On Metropolis-Hastings algorithms with delayed rejection. Metron, 59, 231–241.
Møller, J., Pettitt, A., Reeves, R. & Berthelsen, K. (2006). An efficient Markov chain Monte Carlo
method for distributions with intractable normalising constants. Biometrika 93, 451–458.
Murray, I., Ghahramani, Z. & MacKay, D. (2006). MCMC for doubly-intractable distributions. In
Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI), 359–366.
Neal, R. M. (2001). Annealing importance sampling. Statist. Comput. 2, 125–139. AUAI Press, Arlington,
Virginia.
Peskun, P. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika 60, 607–612.
Robert, C. P. & Casella, G. (2004). Monte Carlo statistical methods, 2nd edn. Springer, New York.
Roberts, G. & Tweedie, R. (1996). Exponential convergence of Langevin distributions and their discrete
approximations. Bernoulli 2, 341–364.
Storvik, G. (2009). On the flexibility of Metropolis-Hastings acceptance probabilities in auxiliary variable
proposal generation. Statistical Research Report 1, Department of Mathematics, University of Oslo,
Oslo.
Tanner, M. & Wong, W. (1987). The calculation of posterior distributions by data augmentation. J. Amer.
Statist. Assoc. 82, 528–550.
Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22, 1701–1728.
Tierney, L. & Mira, A. (1999). Some adaptive Monte Carlo methods for Bayesian inference. Stat. Med.
18, 2507–2515.
Tjelmeland, H. & Hegstad, B. (2001). Mode jumping proposals in MCMC. Scand. J. Statist. 28, 205–
223.
Geir Storvik, Department of Mathematics, University of Oslo, P.O. Box 1053, Blindern, N-0316 Oslo,
Norway.
E-mail: geirs@[Link]
t+1
z|x
qi (xi∗ | xi−1
∗
)qi (zi∗ | xi−1
∗
, xi∗ ),
i =1
s−1
∗ ∗ ∗
+ 1 , x1:t | y, x, y ) = qi (xi∗ | xi−1
∗
)qi (zi∗ | xi−1
∗
, xi∗ ) × hzs (zs∗ | xs−1
∗
, xs∗ )
z|x
hs (z1:t
i =1
t
qi + 1 (xi∗ | xi∗+ 1 )qi + 1 (zi∗+ 1 | xi∗ , xi∗+ 1 ).
z|x
×
i =s
⎧
⎪ (xs∗ )
⎪
⎨ qz (z∗ | x∗ ) (x∗ ; x∗ ) if xs∗ = ∗
/ xs−1 ,
∗
+ 1) =
s−1 z, s s−1
ws (y; x1:t s s s
(23)
⎪
⎪ (xs∗ )
⎩ ∗
if xs∗ = xs−1
∗
.
1 − z, s (xs−1 ; zs∗ )
Another option is to choose
∗
(zs − xs∗ ) ∗
with probability 1 − z, s (xs−1 , xs∗ ),
hzs (zs∗ | xs−1
∗
, xs∗ ) =
qs (zs∗ | xs−1
∗
) ∗ ∗
with probability z, s (xs−1 , xs ).
In that case, the weight function reduces to
(xs∗ )
∗
ws (y; z1:t ∗
= ∗
if xs∗ = ∗
/ xs−1 ,
, x1:t +1 ) qs (xs∗ | xs−1 ) (24)
0 otherwise.
In this case, only those iterations corresponding to acceptance are considered.