Smart-Dumb MCMC Algorithm for Clustering
Smart-Dumb MCMC Algorithm for Clustering
(If the proposal is not accepted, the new state is the same We assume N mentions and L possible tokens. An EMM
as the current state.) The ratio appearing in this expres- model is composed from the following variables and con-
sion is called the MH ratio; we have written it as the prod- ditional distributions:
uct of the state ratio π(x0 )/π(x) and the proposal ratio
q(x|x0 )/q(x0 |x). The only “free parameter” in designing • K, the number of entities, drawn from a prior P (K).
• For each entity k, a dictionary θ k , i.e., a cate- Algorithm 1 Gibbs sampling for an entity/mention model
gorical distribution over L tokens, drawn from a 1: procedure G IBBS S AMPLING
Dirichlet(αk ). The set of dictionaries {θ 1 , . . . , θ K } 2: for n=1 to N do
is represented by Θ. 3: Sample Sn u.a.r. from {1, . . . , K}
4: end for
• For each mention mn , the entity Sn for that mention
5: for i=1 to I do
is drawn u.a.r. from the set of K entities, and the token
6: for n=1 to N do
for that entity is drawn from θ Sn . The (unknown) en-
7: Sample Sn from P (Sn |K, S−n , m)
tities {S1 , . . . , SN } are represented by S and the ob-
8: end for
served mentions {m1 , . . . , mN } by m.
9: end for
10: end procedure
The EMM resembles a topic model for a single document
with an unknown number of topics (entities).
In the experiments described below, we use a broad prior token l assigned to Ek . It is divided by the sum of the prior
for K, namely a discretized log-normal distribution with and observed counts for all the L tokens.
the location parameter µ and the scale parameter σ on a
Each step of the Gibbs sampler is relatively easy to com-
logarithmic scale:
pute, but of course the algorithm cannot change the number
1 1 2 2 of entities; moreover, it tends to get stuck on local maxima
P (K) = √ e−(log K−µ) /2σ
C Kσ 2π because of the local nature of the changes [Celeux et al.,
2000]. Despite these drawbacks, Gibbs steps are an impor-
where C is an additional normalization factor arising from tant element used in association with split-merge steps in
discretizing the distribution. order to optimize the allocation of mentions to entities.
Because the entity for any given mention is assumed to be
chosen u.a.r. from the available entities, we have 3.3 Split–merge MCMC for the EMM
N
1 One way to explore states with different numbers of enti-
P (S|K) = .
K ties is to use birth and death moves. A birth moves creates
a new entity with no mentions, while a death move kills off
Rather than sample the dictionaries, we will integrate them
an entity that has no mentions. While such moves, com-
out exactly, taking advantage of properties of the Dirichlet.
bined with Gibbs moves, do connect the entire state space,
In particular, we have
they lead to very slow mixing because death moves can
only occur when Gibbs moves have removed all the men-
Z Y B(αk + nk )
P (m, Θ|K, S) = tions from an entity, which is astronomically unlikely when
Θ B(αk )
k∈{1:K}
N is much larger than K—unless the entity being killed is
where B(·) is the Beta function and αk and nk are both one that was just born.
vectors of size L, representing respectively the Dirichlet Split and merge moves simultaneously change the number
prior counts and the observed counts of each token in the of entities and change the assignments of multiple mentions
dictionary. in one go. The simplest approach is to pick an entity at ran-
dom and split it into two, randomly assigning the mentions
3.2 Gibbs sampling for the EMM to the two new entities; the merge move operates in reverse
by picking two entities and merging into one, along with
Basic Gibbs sampling samples one variable at a time, their mentions. A naive implementation of this idea often
which means, in our entity/mention model, sampling a new fails to work, because random splits often yield a state with
entity assignment Sn for some mention mn , conditioned on a very low probability, preventing acceptance. Pasula et al.
all other current assignments S−n of mentions to entities. [2003] suggested a random mixing procedure that chooses
The initial assignment is chosen at random, then the Gibbs two entities and randomly assigns each of their mentions to
sampler is repeated for I iterations, each cycling through all one of two new entities. A split occurs when the two chosen
the mentions; see Algorithm 1. entities happen to coincide, and a merge happens when one
The probability P (Sn |K, S−n , m) is calculated by: of the new entities receives no mentions and is discarded.
The approach was effective for medium-sized data sets in
αk,l + nk,l their experiments (300-400 mentions, 60-80 entities) but
P (Sn = k|K, S−n , m) ∝ P
(αk,l0 + nk,l0 )
l0 ∈L fails when each entity has many mentions: merges become
exponentially unlikely to be proposed.
where αk,l and nk,l are respectively the prior counts for
the token l (the mention mn = l) and observed counts for Jain and Neal [2004] proposed a Restricted Gibbs Split–
Merge (RGSM) algorithm to generate splits that are con- proportional to state probabilities:
sistent with data. Two random elements are chosen in the
π(x0 ) π(x)
beginning. A split is proposed if these two elements belong fm = P , fs = P , (3)
to the same component and otherwise a merge is proposed. ω∈W (x) π(ω) ω∈W (x0 ) π(ω)
To split the component ck , RGSM algorithm first assigns
where W (x) and W (x0 ) are respectively the set of states
the elements randomly into two new components claunch 1 for all possible merges and the set of states for all possible
and claunch
2 as the launch state. Restricted Gibbs is then
splits given Ps and Pm .
applied for t times inside the launch state, re-assigning el-
ements to one of the components. The modified launch The MH ratio is then given by
state after t Restricted Gibbs steps is used for generating π(x)
the split. The resulting split reflects the data to some ex- q(x|x0 ) π(x0 ) Ps π(x0 )
P
ω∈W (x0 ) π(ω)
tent and tends to have a higher likelihood. However, these = π(x0 )
q(x0 |x) π(x) Pm P π(x)
π(ω)
intermediate Restricted Gibss steps are rather computation- P
ω∈W (x)
ally expensive, especially for large data sets. Dahl [2003] Ps ω∈W (x) π(ω)
proposed an allocation procedure, which works by assign- = P (4)
Pm ω∈W (x0 ) π(ω)
ing elements sequentially to two components. It starts with
creating two new components c1 and c2 with two random
When both split and merge are smart, the ratio PPms will be
elements. The remaining elements are sequentially allo-
very low due to a quite small Ps . It is because that the
cated to either c1 or c2 using the Restricted Gibbs sampler
smart split would not give a big probability mass to the part
conditioned on those previously assigned elements. This
a smart merge prefers to merge (detailed examples given in
procedure is more efficient than preparing the launch state
the next subsection). The second part in Formula 4, namely
in RGSM.
space ratio, is important
P in split-merge case because of the
Split–merge has been applied to different models such as space asymmetry. ω∈W (x) on the top is just π(x) since
the Beta Process Hidden Markov Model [Hughes et al., when we have chosen two entities to merge
P (Pm ), there is
2012] and the Hierarchical Dirichlet Process [Wang and only one merge possibility. However, ω∈W (x0 ) contain
Blei, 2012; Rana et al., 2013]; on the other hand, the split– 2n possible splits (n is the number of mentions assigned to
merge method itself has not been improved since RGSM the picked entity).
was first proposed in 2004. In the next section, we exam-
The PPms ratio gives the first idea about why smart propos-
ine the interaction of the MH algorithm with split–merge
als have conflicts with inverse smart proposals. In case of
moves and propose a new combination of moves that seems
space asymmetry, a smart–smart proposal suffers also from
to work better.
the space ratio in addition to the PPms ratio.
4 SMART AND DUMB PROPOSALS 4.1 Why smart proposals do not work by themselves:
A simple example
In general, smart proposals that lead to high-probability
states are preferred, as they lead to faster convergence of Let’s take a concrete example of split and merge to illus-
MCMC. What Jain and Neal [2004] did for RGSM is to trate the problem stated above for smart proposals. Assum-
avoid the low-probability states generated by random splits. ing that there are three entities in state x as below:
As we mentioned in Section 1, “smart” proposals propose x : {E1 : {A A B B}; E2 : {C C}; E3 : {C C}},
states with higher probability according to the target distri-
bution. The Gibbs sampler is smart in this sense, because a desired split would be splitting E1 into two entities as
its proposal distribution is proportional to the state proba- follows:
bility and hence the MH acceptance probability is always 1
x01 : {E1 : {A A}; E4 : {B B}; E2 : {C C}; E3 : {C C}}.
(Eq. 2). Consequently, we may instinctively conclude that
the convergence efficiency might be significantly improved A smart split proposal distribution should propose the state
if we concentrate on the design of smart proposals. How- x01 with a quite high probability as illustrated in the left part
ever, this is not true for MH in general. The MH ratio in of Figure 1, where the width of the arrow line indicates
the case of a smart merge proposal will be analyzed as an the probability value. However, a smart merge proposal,
example. starting from state x01 , would rather have a high probability
Let q(x0 |x) and q(x|x0 ) be respectively a smart merge pro- q(x001 |x01 ) for proposing the state x001 :
posal and a smart split proposal. Each first picks a subset x001 : {E1 : {A A}; E4 : {B B}; E2 : {C C C C}}.
of the variables to merge (or split) with probability Pm (or
Ps ). It then proposes a particular merge (or split) according From the point of view of a smart merge, inverting the
to the target distribution fm (or fs ), where fm and fs are smart split’s preferred move q(x|x01 ) is highly unlikely, as
represented by the dashed arrow in the figure. Therefore, An important property of MCMC methods is that detailed
considering the high value of q(x01 |x), the proposal ratio balance can be guaranteed when several different proposals
becomes extremely low, which leads to a very low accep- are adopted on condition that each proposal satisfies the
tance rate for smart split proposals. Metropolis-Hastings algorithm [Tierney, 1994]. Therefore,
there is a solution to combine smart and dumb proposals,
namely the Smart-Dumb Dumb-Smart proposals (SDDS),
as illustrated in Figure 2.
Figure 3: The mean of K during iteration for different data size, with the initial K0 =5
(a) Random Mixing (b) Restricted Gibbs Split-Merge (c) Smart Split Smart Merge (d) Smart-Dumb Dumb-Smart
Figure 4: The mean of K during iteration for different initial K0 =1,5,10,20, with the data size N=500
N =100, RGSM sampler has high acceptance rates for both 5.2 Applying SDDS to conjugate Dirichlet Process
split (7.8%) and merge (2.1%). However, when the data Mixture Model
size is N =200, the acceptance rates are decreased largely,
only 1.1% for split and 0.08% for merge. When the data RGSM algorithm is originally proposed for Dirichlet Pro-
size grows to N =500, almost no effective split or merge is cess Mixture Model (DPMM). When conjugate priors are
happening. On the contrary, the smart proposals in SDDS used, the Gibbs sampling procedure can be easily con-
algorithm, either smart split or smart merge, maintain sat- structed for DPMM. A particular Gibbs sampling method
isfying acceptance rates even for the data set N =500, 1.9% is adapted in this context of DPMM model so that the Gibbs
for smart split and 4.8% for smart merge. For the SS sam- sampler can create new components [Neal, 1992]. The pro-
pler, the drop of acceptance rates is similar to RGSM sam- posed SDDS algorithm is also applied to DPMM and is
pler since there is no dumb proposals to support their cor- then compared to the Gibbs sampler and RGSM sampler.
responding smart proposals.
The experiments have been done with high dimensional
Bernoulli data. Given the independently and identically
distributed data set y = (y1 , y2 , ..., yN ) , each observation
yi has m Bernoulli attributes, (yi1 , yi2 , ..., yim ). Given the
Time per iteration As we stated previously, the allo- component ci each item yi belongs to, its attributes are in-
cation procedure in our smart split proposal is less time- dependent of each other. The mixture components are con-
consuming than the Restricted Gibbs sampling based split sidered as the latent class that produces the observed data.
proposal. The performance of the running time per iter-
The simulated data for our experiments are generated in
ation is shown for each algorithm in Figure 5. For the
the same way as Jain and Neal [2004] did for one high-
RGSM sampler, the time spent per iteration grows quickly
dimensional data set: 5 components with attribute dimen-
with the data size, whereas the time for SDDS algorithm re-
sion 15. Experiments are run for different sizes, N =100,
mains stable. In the case of N =500, RGSM sampler takes
1K, and 10K. The Dirichlet process prior and the Beta prior
488.63 milliseconds per step while SDDS algorithm takes
for attributes are respectively set to 1 and 0.1. (See Jain and
only 10.02 milliseconds per step. The time per iteration for
Neal [2004] for further details on this model).
Random Mixture sampler and that for SS sampler (which
are not show in the figure) is in the same scale of SDDS ;Jain and Neal [2004] have demonstrated their results by
sampler and stays stable for different data sizes. plotting the traces of the five ;highest-wweight components
(a) Acceptance rates for split (b) Acceptance rates for merge (c) Time per Iteration
Figure 5: The differences of acceptance rates and time per iteration for different data sizes
(a) Data size N=100 (b) Data size N=1K (c) Data size N=10K
Figure 6: Comparison of the evolution of likelihood during the time for different data sizes
to verify if their algorithms can cover most data and de- References
tect ;five components. We have provided the same plots
for the SDDS sampler. We observed the same relative per- Gilles Celeux, Merrilee Hurn, and Christian P. Robert.
formance for SDDS and RGSM on the DPMM as for the Computational and inferential difficulties with mixture
EMM, in terms of time per iteration and acceptance rates posterior distributions. Journal of the American Statisti-
(results not shown here). We also compared the evolution cal Association, 2000.
of the likelihoods for SDDS, RGSM, and Gibbs as a func- J Chang and J. W. Fisher. Parallel sampling of dp mixture
tion of running time (Figure 6). We see that, for N =100, models using sub-clusters splits. In NIPS, 2013.
RGSM arrives at high-probability states about 0.5 second
after the SDDS algorithm does. When N =1K, SDDS gains David B Dahl. An improved merge-split sampler for conju-
70 seconds over RGSM. When N =10K, SDDS outper- gate dirichlet process mixture models. Technical report,
forms RGSM by more than 2000 seconds. We conclude University of Wisconsin - Madison, 2003.
from this limited sample of runs that the advantage of Andrew Gelman. Iterative and non-iterative simulation al-
SDDS over RGSM increases with data set size. gorithms. In Computing Science and Statistics: Proceed-
ings of the 13th Symposium on the Interface, 1992.
6 CONCLUSION Stuart Geman and Donald Geman. Stochastic relaxation,
gibbs distributions, and the Bayesian restoration of im-
We have described the SDDS algorithm, which achieves ef- ages. IEEE Trans. Pattern Anal. Mach. Intell., 1984.
ficient split–merge inference by combining smart and dumb
W K Hastings. Monte Carlo sampling methods using
proposals. The idea is illustrated for the entity/mention
Markov chains and their applications. Biometrika, 1970.
model. For the smart split proposal, we proposed a fast
and exact way of generating splits that are consistent with Michael Hughes, Emily Fox, and Erik Sudderth. Effec-
data. Experiments on the entity/mention model and Dirich- tive split-merge Monte Carlo methods for nonparametric
let process mixture models suggest that SDDS algorithm models of sequential data. In NIPS, 2012.
mixes faster than previously known algorithms, and that
Sonia Jain and Radford Neal. A split-merge Markov chain
the advantage increases with data set size.
Monte Carlo procedure for the dirichlet process mixture
model. Journal of Computational and Graphical Statis-
tics, 2004.
Zia Khan, T. Balch, and F. Dellaert. Multitarget tracking
with split and merged measurements. In CVPR, 2005.
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H.
Teller, and E. Teller. Equations of state calculations by
fast computing machines. Journal of Chemical Physics,
1953.
Radford M. Neal. Bayesian mixture modeling. In Proceed-
ings of the 11th International Workshop on Maximum
Entropy and Bayesian Methods of Statistical Analysis,
1992.
Mathias Niepert and Pedro Domingos. Exchangeable vari-
able models. In ICML, 2014.
Hanna Pasula, Stuart Russell, Michael Ostland, and
Ya’acov Ritov. Tracking many objects with many sen-
sors. In IJCAI, 1999.
Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Rus-
sell, and Ilya Shpitser. Identity uncertainty and citation
matching. In NIPS. MIT Press, 2003.
Santu Rana, Dinh Phung, and Svetha Venkatesh. Split-
merge augmented gibbs sampling for hierarchical dirich-
let processes. In Advances in Knowledge Discovery and
Data Mining. Springer, 2013.
Luke Tierney. Markov chains for exploring posterior dis-
tributions. Annals of Statistics, 1994.
Chong Wang and David M. Blei. A split-merge MCMC
algorithm for the hierarchical dirichlet process. CoRR,
2012.
Sinead Williamson, Avinava Dubey, and Eric P. Xing. Par-
allel Markov chain Monte Carlo for nonparametric mix-
ture models. In ICML, 2013.