0% found this document useful (0 votes)
43 views10 pages

Smart-Dumb MCMC Algorithm for Clustering

1) The paper proposes a new "smart-dumb/dumb-smart" (SDDS) algorithm for efficient split-merge Markov chain Monte Carlo (MCMC). 2) The SDDS algorithm combines two kernels - one with a smart split move and dumb merge move, and the other with a dumb split move and smart merge move. 3) Previous approaches that used only smart split and merge moves did not lead to high acceptance probabilities or rapid convergence due to asymmetry between state spaces. 4) The SDDS algorithm maintains high acceptance rates and converges reasonably quickly even for large datasets, outperforming previous methods like restricted Gibbs split-merge.

Uploaded by

Ciprian Cornea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views10 pages

Smart-Dumb MCMC Algorithm for Clustering

1) The paper proposes a new "smart-dumb/dumb-smart" (SDDS) algorithm for efficient split-merge Markov chain Monte Carlo (MCMC). 2) The SDDS algorithm combines two kernels - one with a smart split move and dumb merge move, and the other with a dumb split move and smart merge move. 3) Previous approaches that used only smart split and merge moves did not lead to high acceptance probabilities or rapid convergence due to asymmetry between state spaces. 4) The SDDS algorithm maintains high acceptance rates and converges reasonably quickly even for large datasets, outperforming previous methods like restricted Gibbs split-merge.

Uploaded by

Ciprian Cornea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Smart-Dumb/Dumb-Smart Algorithm for Efficient Split-Merge MCMC

Wei Wang Stuart Russell


LIP6 Computer Science Division
Université Pierre et Marie Curie, 75005 Paris University of California, Berkeley, CA 94720
[Link]@[Link] russell@[Link]

Abstract This paper focuses on MH proposal design for split–merge


moves. Split and merge moves, which form a comple-
mentary pair comprising a kernel, are useful for problems
Split-merge moves are a standard component
where an MCMC state can be thought of as consisting of
of MCMC algorithms for tasks such as multi-
a number of components or clusters, each of which is re-
target tracking and fitting mixture models with
sponsible for some subset of observations. The canonical
unknown numbers of components. Achieving
example is the family of mixture models: a split move in
rapid mixing for split-merge MCMC has been
such a model converts one mixture component into two, di-
notoriously difficult, and state-of-the-art meth-
viding the observations of the original component between
ods do not scale well. We explore the reasons
the two new components; a merge move combines two
for this and propose a new split-merge kernel
components and their observations into a single compo-
consisting of two sub-kernels: one combines a
nent [Dahl, 2003; Pasula et al., 2003; Jain and Neal, 2004].
“smart” split move that proposes plausible splits
Split-merge moves are also common in multitarget track-
of heterogeneous clusters with a “dumb” merge
ing, where a “component” is a single track joining together
move that proposes merging random pairs of
observations of an object over multiple time steps [Pasula
clusters; the other combines a dumb split move
et al., 1999; Khan et al., 2005]. The state of the art for
with a smart merge move. We show that the
split-merge MCMC in mixture models is considered to be
resulting smart-dumb/dumb-smart (SDDS) algo-
the restricted Gibbs split-merge (RGSM) algorithm of Jain
rithm outperforms previous methods. Experi-
and Neal [2004].
ments with entity-mention models and Dirichlet
process mixture models demonstrate much faster The general idea followed in most MH proposal designs
convergence and better scaling to large data sets. is to make the proposal “smart” by preferentially propos-
ing states with higher probability according to the target
distribution. This tends to give higher acceptance proba-
1 INTRODUCTION bilities. The Gibbs sampler [Geman and Geman, 1984] is
the quintessential smart proposal: by proposing values for
some subset of variables exactly in proportion to their prob-
Markov Chain Monte Carlo (MCMC) algorithms have be-
ability, Gibbs sampling has an acceptance probability of 1.
come a central pillar of statistical inference and machine
In the context of split-merge moves, a smart split would be
learning. MCMC algorithms repeatedly apply a stochastic
one that favors splitting a heterogeneous cluster to produce
kernel transformation to an initial state, generating a ran-
two more homogeneous ones, and a smart merge would
dom walk through the sample space whose stationary dis-
favor merging similar clusters to ensure a homogeneous re-
tribution matches a desired target distribution. Within the
sult. As our analysis and experiments show, however, com-
general Metropolis-Hastings (MH) family of MCMC meth-
bining smart split and merge moves does not lead to high
ods, which includes many specific algorithms that have
acceptance probabilities and rapid convergence. The rea-
proven useful in practice, the kernel is built from a pro-
son is the asymmetry between subspaces with K and K +1
posal step followed by a stochastic acceptance step whose
components: there are far more states in the latter than the
probability is determined by the chosen proposal distribu-
former, whereas in the case of Gibbs sampling the source
tion. One key to efficient MH inference, then, is proposal
and target subspaces are identical.
design.1
1
Other approaches include parallelization [Chang and Fisher, try [Niepert and Domingos, 2014]. The work reported in this pa-
2013; Williamson et al., 2013] and taking advantage of symme- per can easily be combined with these approaches.
Our proposed solution, the smart-dumb/dumb-smart an MH algorithm is the proposal q(·|·), and it is this that
(SDDS) algorithm, combines two kernels in parallel: one determines the rate of convergence.
has a smart split move and a “dumb” merge move that pro-
Gibbs sampling can be understood as a special case of MH
poses merging random pairs of clusters; the other combines
[Gelman, 1992]. In its simplest form, it chooses a variable
a dumb split move with a smart merge move. We show that
Xi uniformly at random from the set of n variables whose
the resulting algorithm performs well in practice, main-
values define the state x and proposes a value x0i from the
taining high acceptance rates and converging reasonably
distribution π(Xi |x−i ), where x−i denotes the current val-
quickly even for large data sets with many clusters, where
ues for all variables other than Xi . Because this proposal
RGSM and other algorithms fail. To our knowledge, the
distribution is proportional to the state probability, the pro-
closest relative of SDDS is a parallel-computation MCMC
posal ratio for Gibbs is exactly the inverse of the state ratio.
algorithm due to Chang and Fisher [2013], who mention
For the case where xi and x0i coincide, both ratios are 1;
the use of a random split move as the complement of a
when they differ, we have
merge move in one part of a rather complex algorithm. A
1
second contribution of our paper is a fast, exact method for q(x|x0 ) q(xi , x−i |x0i , x−i ) n π(xi |x−i )
sampling a split move from the posterior over all possible = =
q(x0 |x) q(x0i , x−i |xi , x−i ) 1 0
n π(xi |x−i )
splits of a given component, i.e., an efficient block-Gibbs
π(xi , x−i )π(x−i ) π(x)
proposal for splits. = 0 = . (2)
π(xi , x−i )π(x−i ) π(x0 )
The paper begins (Section 2) with background material on
MCMC. Section 3 describes, for expository purposes, the Hence the acceptance probability in (1) is therefore exactly
entity/mention model (EMM), a very simple Bayesian mix- 1 for Gibbs sampling.
ture model with observations that are discrete tokens, and Having a high acceptance probability—or at least, one
examines split-merge MCMC in the context of the EMM. bounded away from zero—is a necessary but not sufficient
Section 4 describes the SDDS algorithm in detail. Fi- condition for rapid mixing in MH. Moves can be accepted
nally, Section 5 evaluates SDDS in comparison to other ap- with high probability but if those moves fail to lead the
proaches, both on EMM data and on data from a Dirichlet chain from one local maximum in π to another, overall
process mixture model (DPMM). mixing may still be slow. Thus, good proposal design is
concerned with both acceptance probabilities and the abil-
2 MCMC METHODS ity to traverse the state space without getting stuck in local
maxima.
Here we provide a brief review of the relevant aspects of
Metropolis–Hastings MCMC. Let X be a sample space (a 3 THE ENTITY/MENTION MODEL
set of possible worlds); a sample point x ∈ X will be called
a state. Let π(·) be a target distribution of interest, such as The entity/mention model or EMM is a very simple form
the posterior distribution on X given some evidence. The of mixture model, defined here for the purposes of expo-
goal is to generate samples from π, or something close to it, sition. The EMM posits a certain (unknown) number of
so as to answer queries. A standard MCMC algorithm con- entities that are referred to by some set of mentions. For
structs a Markov chain from a transition kernel P (x0 |x), example, there is a person who may variously be referred
such that the unique stationary distribution of the chain is π; to as “Barack Obama”, “the President”, “POTUS”, and so
of primary concern is the rate of convergence of the Markov on. Given a collection of mentions of various entities—for
chain to its stationary distribution. example, in newspaper text—the task is to figure out how
many entities exist, which mentions refer to which entities,
The Metropolis–Hastings (MH) algorithm [Metropolis et
and thence the ways in which any given entity may be men-
al., 1953; Hastings, 1970] is a general template for building
tioned. In its simplest form, the EMM assumes that each
transition kernels with the desired property. Each transition
mention is a token with no internal structure, drawn from
is built from two steps: first, a new state x0 is proposed
a fixed, known set of tokens. This renders the model less
from a proposal distribution q(x0 |x), then the new state is
interesting than the models used in NLP research, but has
accepted with a probability given by
the advantage of simplifying the analysis.
π(x0 ) q(x|x0 )
α(x0 |x) = min{1, }. (1)
π(x) q(x0 |x) 3.1 The EMM probability model

(If the proposal is not accepted, the new state is the same We assume N mentions and L possible tokens. An EMM
as the current state.) The ratio appearing in this expres- model is composed from the following variables and con-
sion is called the MH ratio; we have written it as the prod- ditional distributions:
uct of the state ratio π(x0 )/π(x) and the proposal ratio
q(x|x0 )/q(x0 |x). The only “free parameter” in designing • K, the number of entities, drawn from a prior P (K).
• For each entity k, a dictionary θ k , i.e., a cate- Algorithm 1 Gibbs sampling for an entity/mention model
gorical distribution over L tokens, drawn from a 1: procedure G IBBS S AMPLING
Dirichlet(αk ). The set of dictionaries {θ 1 , . . . , θ K } 2: for n=1 to N do
is represented by Θ. 3: Sample Sn u.a.r. from {1, . . . , K}
4: end for
• For each mention mn , the entity Sn for that mention
5: for i=1 to I do
is drawn u.a.r. from the set of K entities, and the token
6: for n=1 to N do
for that entity is drawn from θ Sn . The (unknown) en-
7: Sample Sn from P (Sn |K, S−n , m)
tities {S1 , . . . , SN } are represented by S and the ob-
8: end for
served mentions {m1 , . . . , mN } by m.
9: end for
10: end procedure
The EMM resembles a topic model for a single document
with an unknown number of topics (entities).
In the experiments described below, we use a broad prior token l assigned to Ek . It is divided by the sum of the prior
for K, namely a discretized log-normal distribution with and observed counts for all the L tokens.
the location parameter µ and the scale parameter σ on a
Each step of the Gibbs sampler is relatively easy to com-
logarithmic scale:
pute, but of course the algorithm cannot change the number
1 1 2 2 of entities; moreover, it tends to get stuck on local maxima
P (K) = √ e−(log K−µ) /2σ
C Kσ 2π because of the local nature of the changes [Celeux et al.,
2000]. Despite these drawbacks, Gibbs steps are an impor-
where C is an additional normalization factor arising from tant element used in association with split-merge steps in
discretizing the distribution. order to optimize the allocation of mentions to entities.
Because the entity for any given mention is assumed to be
chosen u.a.r. from the available entities, we have 3.3 Split–merge MCMC for the EMM
 N
1 One way to explore states with different numbers of enti-
P (S|K) = .
K ties is to use birth and death moves. A birth moves creates
a new entity with no mentions, while a death move kills off
Rather than sample the dictionaries, we will integrate them
an entity that has no mentions. While such moves, com-
out exactly, taking advantage of properties of the Dirichlet.
bined with Gibbs moves, do connect the entire state space,
In particular, we have
they lead to very slow mixing because death moves can
only occur when Gibbs moves have removed all the men-
Z Y B(αk + nk )
P (m, Θ|K, S) = tions from an entity, which is astronomically unlikely when
Θ B(αk )
k∈{1:K}
N is much larger than K—unless the entity being killed is
where B(·) is the Beta function and αk and nk are both one that was just born.
vectors of size L, representing respectively the Dirichlet Split and merge moves simultaneously change the number
prior counts and the observed counts of each token in the of entities and change the assignments of multiple mentions
dictionary. in one go. The simplest approach is to pick an entity at ran-
dom and split it into two, randomly assigning the mentions
3.2 Gibbs sampling for the EMM to the two new entities; the merge move operates in reverse
by picking two entities and merging into one, along with
Basic Gibbs sampling samples one variable at a time, their mentions. A naive implementation of this idea often
which means, in our entity/mention model, sampling a new fails to work, because random splits often yield a state with
entity assignment Sn for some mention mn , conditioned on a very low probability, preventing acceptance. Pasula et al.
all other current assignments S−n of mentions to entities. [2003] suggested a random mixing procedure that chooses
The initial assignment is chosen at random, then the Gibbs two entities and randomly assigns each of their mentions to
sampler is repeated for I iterations, each cycling through all one of two new entities. A split occurs when the two chosen
the mentions; see Algorithm 1. entities happen to coincide, and a merge happens when one
The probability P (Sn |K, S−n , m) is calculated by: of the new entities receives no mentions and is discarded.
The approach was effective for medium-sized data sets in
αk,l + nk,l their experiments (300-400 mentions, 60-80 entities) but
P (Sn = k|K, S−n , m) ∝ P
(αk,l0 + nk,l0 )
l0 ∈L fails when each entity has many mentions: merges become
exponentially unlikely to be proposed.
where αk,l and nk,l are respectively the prior counts for
the token l (the mention mn = l) and observed counts for Jain and Neal [2004] proposed a Restricted Gibbs Split–
Merge (RGSM) algorithm to generate splits that are con- proportional to state probabilities:
sistent with data. Two random elements are chosen in the
π(x0 ) π(x)
beginning. A split is proposed if these two elements belong fm = P , fs = P , (3)
to the same component and otherwise a merge is proposed. ω∈W (x) π(ω) ω∈W (x0 ) π(ω)
To split the component ck , RGSM algorithm first assigns
where W (x) and W (x0 ) are respectively the set of states
the elements randomly into two new components claunch 1 for all possible merges and the set of states for all possible
and claunch
2 as the launch state. Restricted Gibbs is then
splits given Ps and Pm .
applied for t times inside the launch state, re-assigning el-
ements to one of the components. The modified launch The MH ratio is then given by
state after t Restricted Gibbs steps is used for generating π(x)
the split. The resulting split reflects the data to some ex- q(x|x0 ) π(x0 ) Ps π(x0 )
P
ω∈W (x0 ) π(ω)
tent and tends to have a higher likelihood. However, these = π(x0 )
q(x0 |x) π(x) Pm P π(x)
π(ω)
intermediate Restricted Gibss steps are rather computation- P
ω∈W (x)

ally expensive, especially for large data sets. Dahl [2003] Ps ω∈W (x) π(ω)
proposed an allocation procedure, which works by assign- = P (4)
Pm ω∈W (x0 ) π(ω)
ing elements sequentially to two components. It starts with
creating two new components c1 and c2 with two random
When both split and merge are smart, the ratio PPms will be
elements. The remaining elements are sequentially allo-
very low due to a quite small Ps . It is because that the
cated to either c1 or c2 using the Restricted Gibbs sampler
smart split would not give a big probability mass to the part
conditioned on those previously assigned elements. This
a smart merge prefers to merge (detailed examples given in
procedure is more efficient than preparing the launch state
the next subsection). The second part in Formula 4, namely
in RGSM.
space ratio, is important
P in split-merge case because of the
Split–merge has been applied to different models such as space asymmetry. ω∈W (x) on the top is just π(x) since
the Beta Process Hidden Markov Model [Hughes et al., when we have chosen two entities to merge
P (Pm ), there is
2012] and the Hierarchical Dirichlet Process [Wang and only one merge possibility. However, ω∈W (x0 ) contain
Blei, 2012; Rana et al., 2013]; on the other hand, the split– 2n possible splits (n is the number of mentions assigned to
merge method itself has not been improved since RGSM the picked entity).
was first proposed in 2004. In the next section, we exam-
The PPms ratio gives the first idea about why smart propos-
ine the interaction of the MH algorithm with split–merge
als have conflicts with inverse smart proposals. In case of
moves and propose a new combination of moves that seems
space asymmetry, a smart–smart proposal suffers also from
to work better.
the space ratio in addition to the PPms ratio.

4 SMART AND DUMB PROPOSALS 4.1 Why smart proposals do not work by themselves:
A simple example
In general, smart proposals that lead to high-probability
states are preferred, as they lead to faster convergence of Let’s take a concrete example of split and merge to illus-
MCMC. What Jain and Neal [2004] did for RGSM is to trate the problem stated above for smart proposals. Assum-
avoid the low-probability states generated by random splits. ing that there are three entities in state x as below:
As we mentioned in Section 1, “smart” proposals propose x : {E1 : {A A B B}; E2 : {C C}; E3 : {C C}},
states with higher probability according to the target distri-
bution. The Gibbs sampler is smart in this sense, because a desired split would be splitting E1 into two entities as
its proposal distribution is proportional to the state proba- follows:
bility and hence the MH acceptance probability is always 1
x01 : {E1 : {A A}; E4 : {B B}; E2 : {C C}; E3 : {C C}}.
(Eq. 2). Consequently, we may instinctively conclude that
the convergence efficiency might be significantly improved A smart split proposal distribution should propose the state
if we concentrate on the design of smart proposals. How- x01 with a quite high probability as illustrated in the left part
ever, this is not true for MH in general. The MH ratio in of Figure 1, where the width of the arrow line indicates
the case of a smart merge proposal will be analyzed as an the probability value. However, a smart merge proposal,
example. starting from state x01 , would rather have a high probability
Let q(x0 |x) and q(x|x0 ) be respectively a smart merge pro- q(x001 |x01 ) for proposing the state x001 :
posal and a smart split proposal. Each first picks a subset x001 : {E1 : {A A}; E4 : {B B}; E2 : {C C C C}}.
of the variables to merge (or split) with probability Pm (or
Ps ). It then proposes a particular merge (or split) according From the point of view of a smart merge, inverting the
to the target distribution fm (or fs ), where fm and fs are smart split’s preferred move q(x|x01 ) is highly unlikely, as
represented by the dashed arrow in the figure. Therefore, An important property of MCMC methods is that detailed
considering the high value of q(x01 |x), the proposal ratio balance can be guaranteed when several different proposals
becomes extremely low, which leads to a very low accep- are adopted on condition that each proposal satisfies the
tance rate for smart split proposals. Metropolis-Hastings algorithm [Tierney, 1994]. Therefore,
there is a solution to combine smart and dumb proposals,
namely the Smart-Dumb Dumb-Smart proposals (SDDS),
as illustrated in Figure 2.

(a) smart split (b) smart merge


Figure 2: Smart-Dumb Dumb-Smart design for proposals
Figure 1: Conflicts between smart split and smart merge
(red lines for splits and green lines for merges; thick/dashed As a consequence, there are two separate pairs of proposal
lines for moves preferred/dispreferred by smart proposals.) distributions. For either of them, the dumb proposal gives a
uniform distribution over all possible moves. The existence
It is a similar situation for smart merge proposals, as illus- of dumb proposals helps the acceptance of smart proposals.
trated in the right part of Figure 1. A smart merge proposal In the context of these two pairs, both smart split and smart
will give a high probability to generate the state x02 : merge can produce higher acceptance rates and faster mix-
ing.
x02 : {E1 : {A A B B}; E2 : {C C C C}}.
On the other hand, the smart split from the state x02 will be 4.3 An SDDS split–merge algorithm
more likely to generate a state such as:
Inside the SDDS algorithm, we want to propose high-
x002 : {E1 : {A A}; E3 : {B, B}; E2 : {C C C C}}, probability states with smart splits and smart merges and to
leaving only a tiny probability to propose the state x02 to go do so efficiently. The algorithm for the smart split proposal
back to x. The same phenomenon of a low acceptance rate with dumb merge and the one for smart merge proposal
will be engendered. with dumb split are respectively described in Algorithm 2
and Algorithm 3.
For the particular case of split and merge, a new entity cre-
ated by the split proposal changes the parameters of the
Smart split/dumb merge proposal Algorithm 2 begins
space, which means, for one split and its inverse merge,
by choosing randomly between a smart split proposal and a
there are much more possibilities for splits than those for
dumb merge proposal. If a smart split is picked, it will first
merges. This neighborhood issue makes the smart merge
choose one entity Ek based on a function fsplit (Ek ). The
proposal even harder to be accepted as shown in Formula 4.
function is inversely proportional to the likelihood of the
mentions mEk associated with this entity Ek , which im-
4.2 Coupling with smart and dumb proposals plies that the proposal tends to choose large and mixed enti-
ties. The likelihood is given by B(αk + nk )/B(αk ) where
Smart proposals are not effective in this case because the
αk and nk are respectively the vectors for the Dirichlet
entity which we choose to split does not correspond to the
prior counts and the observed counts of each token in mEk .
entities that we prefer to merge in the reverse direction.
However, if we want to distribute a higher probability to Once the entity Ek chosen, the smart split procedure will
the reverse move, q(x|x01 ) for instance, it can be treated allocate sequentially each mention to one of two newly
as a dumb proposal rather than a smart one. “Dumb” pro- created entities E10 and E20 , according to the likelihood of
posals can be considered as distributions that give uniform the previously assigned mentions. Given the entity Ek :
probability mass over all possible moves. {AABBCC} for example, the procedure is illustrated in
Algorithm 2 Smart Split and Dumb Merge proposal Algorithm 3 Smart Merge and Dumb Split proposal
1: procedure S MART S PLIT D UMB M ERGE 1: procedure S MART M ERGE D UMB S PLIT
2: choose a move type: type ∼ (split, merge) 2: choose a move type: type ∼ (split, merge)
3: if type==split then . smart split 3: if type==merge then . smart merge
4: choose one entity to split 4: choose one entity Ei uniformly from K entities
5: 5: choose another entity
Ek ∼ fsplit (Ek )
Ej ∼ fmerge (Ej |Ei )
1
fsplit (Ek ) ∝
P (mEk |Ek ) fmerge (Ej |Ei ) ∝ P (mEi ,Ej , Ej |Ei )
6: create two new empty entities E10
and E20 6: else . dumb split
7: assign each mention in Ek sequentially to E10 or 7: choose one entity Ek uniformly from K entities
E20 according to the likelihood of previously assigned 8: create two new empty entities E10 and E20
mentions 9: assign each mention in Ek sequentially to E10
8: else . dumb merge 0
or E2 with equal probability
9: choose one entity uniformly from K entities 10: end if
10: choose another entity uniformly from the rest 11: calculate the acceptance ratio α
K-1 entities 12: apply the proposal with probability α
11: end if 13: end procedure
12: calculate the acceptance ratio α
13: apply the proposal with probability α
14: end procedure
random mentions, which causes an initial bias when these
two random mentions are supposed to be associated with
the same entity.
Table 1.
The reverse dumb merge proposal would rather generate
Table 1: Splitting one entity into two entities sequentially random merges. It picks one entity uniformly from K enti-
steps 0 1 2 3 4 5 6 ties and then picks another from the remaining K-1 entities.
E10 A AA AA AA AA AA The probability of this reverse proposal is then 2/K(K −1)
E20 B BB BBC BBCC (two orders of choosing these two entities).
P 0.5 0.6 0.625 0.714 0.5 0.714 If a dumb merge proposal is chosen in the beginning, the
choice of two entities will have the probability 1/K(K −
1). Then we need to know the probability of the split pro-
Two new created entities are empty in the beginning. Dur- posal that reverses this move, i.e., allocates the mentions
ing each step, the allocation probability for each mention exactly into the two given sets. The order of mentions dur-
mi is calculated by: ing allocation influences the final probability and the dif-
α1,l0 + n1,l0 ferent probabilities from all possible orders are supposed
P (E10 |mi ) ∝ P to be summed up, which is not really feasible in practice.
l0 ∈L (α1,l0 + n1,l0 )
Dahl [2003] applied a random permutation on the order of
α2,l + n2,l
P (E20 |mi ) ∝ P (5) mentions, which may be critical to correctness of MCMC
l ∈L (α2,l + n2,l )
0 0 0
methods since in this way we are obtaining a random prob-
where α1,l and α2,l are the Dirichlet priors for the token l ability for the reverse split when merge is proposed and it
(the mention mi = l, which is A, B or C in this case) be- may not correspond to the exact probability of the inverse
ing assigned to entity E10 and E20 , n1,l and n2,l are current move. In our case, we fix a unique order for all mentions so
observed counts of token l assigned to E10 and E20 (counts that there is only one way of applying the procedure thus
are updated during each step). The denominator sums up only one possible probability value for the same split re-
the prior and observed counts for all possible tokens (A, B sults, either the real split procedure or the imaged reverse
and C in this case). The probability in each step is given one. This unique order is chosen in an arbitrary way. The
in the table taking all α1,l =α2,l =1 as example (smaller al- choice of any particular order has no influence on the infer-
pha makes this procedure more discriminating). The prob- ence but the same order should be kept all along the exper-
ability of this allocation procedure is then a product of the iments.
probability in each step. This procedure avoids the time-
consuming Restricted Gibbs sampling adopted by Jain and Smart merge/dumb split proposal Algorithm 3 begins
Neal [2004]. [Dahl, 2003] proposed a similar sequential from making the random choice between merge and split as
procedure but started by creating two new entities with two well. If smart merge proposal is picked, it will first choose
one entity Ei randomly from K entities. The choice of the ferent data set sizes (N =100, 200, 500) and different initial
second entity Ej is based on how likely it is when merged entity numbers (K0 =1, 5, 10, 20). All Dirichlet priors α
with Ei . The entity Ej is draw from a distribution given by are set to 0.001. For the Log-normal prior on the number
function fmerge (Ej |Ei ), which is proportional to the like- of entities K, the location parameter is set to 0.5 (10 enti-
lihood of this resulting merge of Ej to Ei . The likelihood ties) and scale parameter is set to 1. Experiments were run
of the merge is calculated by: for 10K/50K/200K iterations, with a time-out at 10 hours.
B(αi + ni + nj ) We are interested in the posterior distribution of K, the
P (mEi ,Ej , Ej |Ei ) = (6) number of entities, given the available evidence. We illus-
B(αi )
trate the evolution of the expected value of K for the vari-
where mEi ,Ej refers to all mentions assigned to Ei and ous algorithms as a function of the number of iterations, for
Ej , αi is vector for Dirichlet prior, and ni + nj refers to different values of N and K0 . We also give the acceptance
the vector for the observed counts of each term in mEi ,Ej . rates and time needed for each iteration for comparison.
The reverse dumb split proposal chooses one entity Ek uni-
formly from K entities and allocates each mention ran- Posterior distribution of entity numbers The posterior
domly into two new created entities, the probability of distributions of entity numbers are analyzed by the mean
which is 1/(K · 2n ), where n is the number of mention of the value K during iterations. First of all, we fix the
assigned to entity Ek . initial value K0 =5 to compare results of these four different
algorithms for different data sizes, as shown in Figure 3.
If the dumb split proposal is chosen, it generates a ran- We can see that for N = 100 and N =200, all samplers
dom split with the same probability 1/(K · 2n ). As for except random mixing sampler can converge to the correct
the reverse smart merge proposal, we need to consider posterior K=10. However, when a larger data set N =500
the merge in two different orders, namely fmerge (Ej |Ei ) is used, the RGSM sampler fails to have successful split or
and fmerge (Ei |Ej ), where fmerge (Ei |Ej ) is proportional merge therefore the value K is trapped at initial value. The
to P (mEj ,Ei , Ei |Ej ) which can be calculated similarly to SS sampler is capable of applying several effective splits
Formula 6. whereas the merge proposal does not work well because of
the reason we discussed in Section 4.1.
5 EXPERIMENTS WITH SDDS For the data set N =500, we have also analyzed the sen-
sitivity of the sampler to initial value K0 , as illustrated in
5.1 Applying SDDS to EMM Figure 4. It is easy to observe that RGSM sampler is always
We applied the SDDS algorithm to the entity/mention trapped in initial K0 in this case. The SS sampler can gen-
model. During each inference step, the SDDS sampler erate well-assigned mentions for entities during smart split
chooses uniformly from either smart-split/dumb-merge procedure, whereas it can not converge to the true posterior.
proposal or dumb-split/smart-merge proposal; this is then It is interesting to see from the results of Random Mixing
complemented by one single Gibbs sampling step.2 It is sampler that it works well when the initial value is twice as
worth emphasizing again that detailed balance is satisfied large as the true value. In fact, in the RM sampler, the ran-
when each sub-kernel fulfills the detailed balance condition dom split procedure encourages the acceptance of merge
individually. Three other algorithms are tested for com- proposals. However, the random split proposals themselves
parison. The first is a random mixing (RM) sampler in- are not likely to be accepted because they generate very
spired from [Pasula et al., 2003]. It works by choosing low-probability states. The SDDS sampler outperforms all
two random entities (which could be the same one) and the others, converging to the true posterior regardless of the
then distributing corresponding mentions into two new cre- initial value.
ated entities by a randomly chosen split point. The second It is worth mentioning that the inferences start from random
is the RGSM sampler [Jain and Neal, 2004].3 The third allocations, which would generate an ensemble of messed
is a smart-smart (SS) sampler which has the same smart up entities. It is not really useful to merge messed up en-
split and merge moves as SDDS but lacks the paired dumb tities. The acceptance rates for merging messed up enti-
moves. ties should be very low as well. It is therefore logical to
The simulated data sets use 10 different tokens (A, B, C, observe many accepted splits in the beginning of the infer-
etc.). Different configurations were tested, including dif- ence of SDDS algorithm. SDDS algorithm generates more
well-formed entities and then yields higher probabilities for
2
The different ways of combining a designed sampler with merging them.
Gibbs sampler is an issue to be investigated; we adopted this con-
figuration for all tested algorithms for comparison.
3
The stable version (5, 1, 1) is adopted for comparison, which Acceptance rates The relation between the size of data
contains 5 intermediate Restricted Gibbs steps inside one MH step set and the acceptance rates for both split and merge are
and then one complete Gibbs sampling. shown in Figure 5. We can observe that, for the data set
(a) N =100 (b) N =200 (c) N =500

Figure 3: The mean of K during iteration for different data size, with the initial K0 =5

(a) Random Mixing (b) Restricted Gibbs Split-Merge (c) Smart Split Smart Merge (d) Smart-Dumb Dumb-Smart

Figure 4: The mean of K during iteration for different initial K0 =1,5,10,20, with the data size N=500

N =100, RGSM sampler has high acceptance rates for both 5.2 Applying SDDS to conjugate Dirichlet Process
split (7.8%) and merge (2.1%). However, when the data Mixture Model
size is N =200, the acceptance rates are decreased largely,
only 1.1% for split and 0.08% for merge. When the data RGSM algorithm is originally proposed for Dirichlet Pro-
size grows to N =500, almost no effective split or merge is cess Mixture Model (DPMM). When conjugate priors are
happening. On the contrary, the smart proposals in SDDS used, the Gibbs sampling procedure can be easily con-
algorithm, either smart split or smart merge, maintain sat- structed for DPMM. A particular Gibbs sampling method
isfying acceptance rates even for the data set N =500, 1.9% is adapted in this context of DPMM model so that the Gibbs
for smart split and 4.8% for smart merge. For the SS sam- sampler can create new components [Neal, 1992]. The pro-
pler, the drop of acceptance rates is similar to RGSM sam- posed SDDS algorithm is also applied to DPMM and is
pler since there is no dumb proposals to support their cor- then compared to the Gibbs sampler and RGSM sampler.
responding smart proposals.
The experiments have been done with high dimensional
Bernoulli data. Given the independently and identically
distributed data set y = (y1 , y2 , ..., yN ) , each observation
yi has m Bernoulli attributes, (yi1 , yi2 , ..., yim ). Given the
Time per iteration As we stated previously, the allo- component ci each item yi belongs to, its attributes are in-
cation procedure in our smart split proposal is less time- dependent of each other. The mixture components are con-
consuming than the Restricted Gibbs sampling based split sidered as the latent class that produces the observed data.
proposal. The performance of the running time per iter-
The simulated data for our experiments are generated in
ation is shown for each algorithm in Figure 5. For the
the same way as Jain and Neal [2004] did for one high-
RGSM sampler, the time spent per iteration grows quickly
dimensional data set: 5 components with attribute dimen-
with the data size, whereas the time for SDDS algorithm re-
sion 15. Experiments are run for different sizes, N =100,
mains stable. In the case of N =500, RGSM sampler takes
1K, and 10K. The Dirichlet process prior and the Beta prior
488.63 milliseconds per step while SDDS algorithm takes
for attributes are respectively set to 1 and 0.1. (See Jain and
only 10.02 milliseconds per step. The time per iteration for
Neal [2004] for further details on this model).
Random Mixture sampler and that for SS sampler (which
are not show in the figure) is in the same scale of SDDS ;Jain and Neal [2004] have demonstrated their results by
sampler and stays stable for different data sizes. plotting the traces of the five ;highest-wweight components
(a) Acceptance rates for split (b) Acceptance rates for merge (c) Time per Iteration

Figure 5: The differences of acceptance rates and time per iteration for different data sizes

(a) Data size N=100 (b) Data size N=1K (c) Data size N=10K

Figure 6: Comparison of the evolution of likelihood during the time for different data sizes

to verify if their algorithms can cover most data and de- References
tect ;five components. We have provided the same plots
for the SDDS sampler. We observed the same relative per- Gilles Celeux, Merrilee Hurn, and Christian P. Robert.
formance for SDDS and RGSM on the DPMM as for the Computational and inferential difficulties with mixture
EMM, in terms of time per iteration and acceptance rates posterior distributions. Journal of the American Statisti-
(results not shown here). We also compared the evolution cal Association, 2000.
of the likelihoods for SDDS, RGSM, and Gibbs as a func- J Chang and J. W. Fisher. Parallel sampling of dp mixture
tion of running time (Figure 6). We see that, for N =100, models using sub-clusters splits. In NIPS, 2013.
RGSM arrives at high-probability states about 0.5 second
after the SDDS algorithm does. When N =1K, SDDS gains David B Dahl. An improved merge-split sampler for conju-
70 seconds over RGSM. When N =10K, SDDS outper- gate dirichlet process mixture models. Technical report,
forms RGSM by more than 2000 seconds. We conclude University of Wisconsin - Madison, 2003.
from this limited sample of runs that the advantage of Andrew Gelman. Iterative and non-iterative simulation al-
SDDS over RGSM increases with data set size. gorithms. In Computing Science and Statistics: Proceed-
ings of the 13th Symposium on the Interface, 1992.
6 CONCLUSION Stuart Geman and Donald Geman. Stochastic relaxation,
gibbs distributions, and the Bayesian restoration of im-
We have described the SDDS algorithm, which achieves ef- ages. IEEE Trans. Pattern Anal. Mach. Intell., 1984.
ficient split–merge inference by combining smart and dumb
W K Hastings. Monte Carlo sampling methods using
proposals. The idea is illustrated for the entity/mention
Markov chains and their applications. Biometrika, 1970.
model. For the smart split proposal, we proposed a fast
and exact way of generating splits that are consistent with Michael Hughes, Emily Fox, and Erik Sudderth. Effec-
data. Experiments on the entity/mention model and Dirich- tive split-merge Monte Carlo methods for nonparametric
let process mixture models suggest that SDDS algorithm models of sequential data. In NIPS, 2012.
mixes faster than previously known algorithms, and that
Sonia Jain and Radford Neal. A split-merge Markov chain
the advantage increases with data set size.
Monte Carlo procedure for the dirichlet process mixture
model. Journal of Computational and Graphical Statis-
tics, 2004.
Zia Khan, T. Balch, and F. Dellaert. Multitarget tracking
with split and merged measurements. In CVPR, 2005.
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H.
Teller, and E. Teller. Equations of state calculations by
fast computing machines. Journal of Chemical Physics,
1953.
Radford M. Neal. Bayesian mixture modeling. In Proceed-
ings of the 11th International Workshop on Maximum
Entropy and Bayesian Methods of Statistical Analysis,
1992.
Mathias Niepert and Pedro Domingos. Exchangeable vari-
able models. In ICML, 2014.
Hanna Pasula, Stuart Russell, Michael Ostland, and
Ya’acov Ritov. Tracking many objects with many sen-
sors. In IJCAI, 1999.
Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Rus-
sell, and Ilya Shpitser. Identity uncertainty and citation
matching. In NIPS. MIT Press, 2003.
Santu Rana, Dinh Phung, and Svetha Venkatesh. Split-
merge augmented gibbs sampling for hierarchical dirich-
let processes. In Advances in Knowledge Discovery and
Data Mining. Springer, 2013.
Luke Tierney. Markov chains for exploring posterior dis-
tributions. Annals of Statistics, 1994.
Chong Wang and David M. Blei. A split-merge MCMC
algorithm for the hierarchical dirichlet process. CoRR,
2012.
Sinead Williamson, Avinava Dubey, and Eric P. Xing. Par-
allel Markov chain Monte Carlo for nonparametric mix-
ture models. In ICML, 2013.

You might also like