Nonsmooth DC Programming for Clustering
Nonsmooth DC Programming for Clustering
[Link]/locate/pr
PII: S0031-3203(15)00431-8
DOI: [Link]
Reference: PR5574
To appear in: Pattern Recognition
Received date: 2 April 2015
Revised date: 16 October 2015
Accepted date: 15 November 2015
Cite this article as: Adil M. Bagirov, Sona Taheri and Julien Ugon, Nonsmooth
DC programming approach to the minimum sum-of-squares clustering problems,
Pattern Recognition, [Link]
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
Nonsmooth DC programming approach to the minimum
sum-of-squares clustering problems
Adil M. Bagirov∗ Sona Taheri Julien Ugon
Abstract
This paper introduces an algorithm for solving the minimum sum-of-squares clus-
tering problems using their difference of convex representations. A nonsmooth noncon-
vex optimization formulation of the clustering problem is used to design the algorithm.
Characterizations of critical points, stationary points in the sense of generalized gra-
dients and inf-stationary points of the clustering problem are given. The proposed
algorithm is tested and compared with other clustering algorithms using large real
world data sets.
1 Introduction
Clustering is an unsupervised partitioning technique dealing with the problems of orga-
nizing a collection of patterns into clusters based on similarity. Most clustering algorithms
are based on the hierarchical and partitional approaches. Algorithms based on the hierar-
chical approach generate a dendrogram representing the nested grouping of patterns and
similarity levels at which groupings change [22]. Partitional clustering algorithms find the
partition that optimizes a clustering criterion [22]. In this paper we develop a partitional
clustering algorithm. More specifically, we develop an algorithm for solving the minimum
sum-of-squares clustering (MSSC) problems.
To date various heuristics such as the k-means algorithm and its variations have been
developed to solve the MSSC problem (see, for example, [23, 24] and references therein).
The global k-means algorithm and its many modifications are among the most efficient
algorithms for solving the MSSC problem [6, 12, 14, 25, 27, 28, 30].
The MSSC can be formulated as a mixed integer nonlinear programming or noncon-
vex nonsmooth optimization problems [10, 13, 33]. Different optimization techniques have
been applied to solve it. These techniques include branch and bound [18] and interior
point methods [19], nonsmooth optimization algorithms [8, 11, 13], algorithms based on
hyperbolic smoothing technique [9, 36, 37], the variable neighborhood search [20], simu-
lated annealing [31], tabu search [1] and genetic algorithms [29]. Not all these algorithms
are efficient for solving the MSSC in very large data sets.
The objective functions of the clustering problems, called cluster functions, can be
represented as a difference of convex (DC) functions. Above mentioned algorithms do not
exploit this special structure of the clustering problem. There are several papers where
∗
Faculty of Science and Technology, Federation University Australia, Victoria, Australia; Phone:
+61353276306; Fax: +61353279289; Email: [Link]@[Link]
1
the DC representation of the MSSC problems is used to design algorithms. In [16], the
truncated codifferential method is applied to solve the MSSC using its DC representa-
tion. The branch and bound method was modified for such problems in [34] using their
DC representation. In [2] an algorithm based on DC programming and DC Algorithms
(DCA) is introduced. In [5], the authors use the hard combinatorial optimization model
to formulate MSSC as a DC program and propose an algorithm based on DCA. Such an
approach allows one to make simpler and less expensive computations in the resulting
DCA. In [3], the DCA and a Gaussian kernel are applied to design an algorithm to solve
the MSSC problem.
In this paper, we propose a new approach for solving the MSSC problems using their
DC representations. The main contributions of this paper are: (i) the characterization of
critical points, stationary points in the sense of generalized gradients and inf-stationary
points of the MSSC problem; (ii) an algorithm for solving the MSSC problem based on
its DC representation; (iii) convergence results for the algorithm. Results of numerical
experiments on some real world data sets are reported and the proposed algorithm is
compared with several other clustering algorithms. It is demonstrated that the proposed
algorithm is especially efficient for solving the MSSC problems in very large data sets.
The rest of the paper is organized as follows. Section 2 provides some preliminaries on
DC functions and nonsmooth analysis. DC representations of cluster functions are given
in Section 3. Optimality conditions for the auxiliary clustering problem are studied in
Section 4 and for the clustering problem in Section 5. Section 6 presents an algorithm
for solving clustering problems. An incremental algorithm is described in Section 7. The
implementation of algorithms is discussed in Section 8. Numerical results are reported in
Section 9 and Section 10 contains some concluding remarks.
2 Preliminaries
In this section we give some results on nonsmooth analysis and DC functions used through-
out the paper. In what follows n
Pn we denote by R the n-dimensional Euclidean space with
the inner product hu, vi = i=1 ui vi and the associated norm kuk = hu, ui1/2 , u, v ∈ Rn .
Bε (x) = {y ∈ Rn : ky − xk < ε} is the open ball centered at x with the radius ε > 0.
Let f : Rn → R be a convex function. Its subdifferential at x ∈ Rn is defined as:
∂c f (x) = {ξ ∈ Rn : f (y) − f (x) ≥ hξ, y − xi ∀y ∈ Rn } .
A function f : Rn → R is called a locally Lipschitz on Rn if for any bounded subset
X ⊂ Rn there exists L > 0 such that
|f (x) − f (y)| ≤ Lkx − yk ∀x, y ∈ X.
The generalized derivative of a locally Lipschitz function f at a point x with respect to a
direction u ∈ Rn is defined as [15]:
f (y + αu) − f (y)
f 0 (x, u) = lim sup .
α↓0,y→x α
The subdifferential ∂f (x) of the function f at x is:
∂f (x) = ξ ∈ Rn : f 0 (x, u) ≥ hξ, ui ∀u ∈ Rn .
2
According to Rademacher’s theorem any locally Lipschitz function defined on Rn is dif-
ferentiable almost everywhere and its subdifferential can also be defined as:
∂f (x) = conv lim ∇f (xi ) : xi → x and ∇f (xi ) exists .
i→∞
Here “conv” denotes the convex hull of a set. Each vector ξ ∈ ∂f (x) is called a subgradient.
For convex function f : Rn → R one has ∂f (x) = ∂c f (x), x ∈ Rn . From now on we will
use the notation ∂f for subdifferentials of convex functions.
Now assume that the function f is directionally differentiable at x that is the limit
f (x + αu) − f (x)
f 0 (x, u) = lim
α↓0 α
exists for any u ∈ Rn . This function is called regular at x if f 0 (x, u) = f 0 (x, u), ∀u ∈ Rn .
Definition 1. f : Rn → R is called a DC function if there exist convex functions g, h :
Rn → R such that:
f (x) = g(x) − h(x), x ∈ Rn .
Here g − h is called a DC decomposition of f while g and h are DC components of f .
A function f is locally DC if for any x0 ∈ Rn , there exist ε > 0 such that f is DC on the
ball Bε (x0 ). It is well known that every locally DC function is DC [21]. Note that a DC
function has infinitely many DC decompositions.
An unconstrained DC program is an optimization problem of the form:
In general, nonsmooth DC functions are not regular and the Clarke subdifferential
calculus exists for such functions only in the form of inclusions:
0 ∈ ∂f (x∗ ) (4)
and
∂h(x∗ ) ∩ ∂g(x∗ ) 6= ∅. (5)
Points satisfying (3) are called inf-stationary, points satisfying (4) are called Clarke sta-
tionary and points satisfying (5) are called critical points of the problem (1). In general,
any inf-stationary point is also a Clarke stationary and a critical point. Furthermore, any
Clarke stationary point is also a critical point.
3
3 DC programming approach to clustering problems
In this section we give a nonsmooth optimization formulation of clustering problems and
their DC representations.
In cluster analysis we assume that we are given a finite set of points A in the n−dimensional
space Rn , that is A = {a1 , . . . , am }, where ai ∈ Rn , i = 1, . . . , m. The hard unconstrained
clustering problem is the distribution of the points of the set A into a given number k of
disjoint subsets Aj , j = 1, . . . , k such that:
1. Aj 6= ∅ and Aj Al = ∅, j, l = 1, . . . , k, j 6= l.
T
k
Aj .
S
2. A =
j=1
The sets Aj , j = 1, . . . , k are called clusters and each cluster Aj can be identified by its
center xj ∈ Rn , j = 1, . . . , k. The problem of finding these centers is called the k-clustering
(or k-partition) problem. In order to formulate the clustering problem one needs to define
the similarity (or dissimilarity) measure. In this paper, the similarity measure is defined
using the L2 norm:
n
X
d2 (x, a) = (xi − ai )2 .
i=1
where
1 X
fk (x1 , . . . , xk ) = min d2 (xj , a). (7)
m j=1,...,k
a∈A
where
k k
1 XX 1 X X
fk1 (x) = d2 (xj , a), fk2 (x) = max d2 (xs , a).
m m j=1,...,k
a∈A j=1 a∈A s=1,s6=j
Since the function d2 is convex in x the function fk1 as a sum of convex functions is also
convex. The function fk2 is a sum of maxima of sum of convex functions. Since the sum
of convex functions is convex, the functions under maximum are convex. Furthermore,
since the maximum of a finite number of convex functions is also convex, the function fk2
is a sum of convex functions and therefore it is also convex.
Note that the similarity measure can also be defined using other norms and in par-
ticular, the L1 -norm. However, in this case the clustering function fk is more complex
and both functions fk1 and fk2 are nonsmooth whereas for d2 only the function fk2 is
nonsmooth.
4
Problem (6) is a global optimization problem, the objective function fk in this problem
has many local minimizers and only its global minimizers provide the best cluster structure
of a data set with the least number of clusters. In general, conventional global optimization
methods cannot be applied to solve this problem in large data sets. Therefore in such
data sets heuristics and deterministic local search algorithms are the only choice. But the
success of these algorithms heavily depends on the choice of starting cluster centers and
the development of efficient procedures for generating starting clusters centers is crucial for
the success of such algorithms. We apply an approach introduced in [28] to find starting
cluster centers. This approach involves the solution of the so-called auxiliary clustering
problem.
Assume that the solution x1 , . . . , xk−1 , k ≥ 2 to the (k − 1)-clustering problem is
known. Denote by rk−1a the distance between the data point a ∈ A and the closest cluster
center among k − 1 centers x1 , . . . , xk−1 :
n o
a
rk−1 = min d2 (x1 , a), . . . , d2 (xk−1 , a) . (9)
A problem:
minimize f¯k (y) subject to y ∈ Rn (11)
is called the k-th auxiliary clustering problem [6]. The DC representation of the function
f¯k is as follows:
f¯k (y) = f¯k1 (y) − f¯k2 (y) (12)
where
1 X a 1 X
f¯k1 (y) = rk−1 + d2 (y, a) , f¯k2 (y) = a
max{rk−1 , d2 (y, a)}.
m m
a∈A a∈A
The function f¯k2 , in general, is nondifferentiable and to write its subdifferential at a given
point y ∈ Rn introduce the following sets:
a a
Ā1 (y) = {a ∈ A : rk−1 > d2 (y, a)}, Ā2 (y) = {a ∈ A : rk−1 < d2 (y, a)},
5
a
Ā3 (y) = {a ∈ A : rk−1 = d2 (y, a)}.
We rewrite the function f¯k2 at y as:
1 X X X
f¯k2 (y) = a
a
rk−1 + d2 (y, a) + max rk−1 , d2 (y, a) .
m
a∈Ā1 (y) a∈Ā2 (y) a∈Ā3 (y)
Proposition 2. The generalized subdifferential ∂ f¯k (y) of the function f¯k at y ∈ Rn can
be given as:
∂ f¯k (y) = ∇f¯k1 (y) − ∂ f¯k2 (y).
Proof. Denote by D(y) = ∇f¯k1 (y) − ∂ f¯k2 (y). The inclusion ∂ f¯k (y) ⊂ D(y) follows from
(2). Therefore we prove only the opposite inclusion. As finite valued convex functions,
defined on Rn , f¯k1 and f¯k2 are directionally differentiable. Therefore, the function f¯k is
also directionally differentiable and
0
f¯k0 (y, u) = f¯k1 0
(y, u) − f¯k2 (y, u), u ∈ Rn .
Since the function f¯k is DC, the function −f¯k is also DC and
Then
for all u ∈ Rn . This means that (−f¯k )0 (y, u) ≥ hv, ui for all u ∈ Rn and v ∈ −D(y) which
in turn due to the convexity of the set D(y) implies that −D(y) ⊆ ∂(−f¯k )(y) = −∂ f¯k (y).
Therefore D(y) ⊆ ∂ f¯k (y).
6
Proof. The subdifferential ∂ f¯k1 (y) is singleton at any y ∈ Rn and therefore it follows from
the definition of inf-stationary points (3) that the set ∂ f¯k2 (y ∗ ) must be singleton at all
such points.
It is clear that the first term in (14) is a singleton. Therefore at the inf-stationary
point y ∗ the second term must be a singleton (namely it has to be {0}). That is y ∗ = a
for every a ∈ Ā3 (y ∗ ) meaning that the subdifferential ∂ f¯k2 (y ∗ ) is given by (15).
Remark 1. The condition y ∗ = a for every a ∈ Ā3 (y ∗ ) implies that y ∗ = xj for some
j = 1, . . . , k − 1. Furthermore since the set Ā3 (y ∗ ) is singleton the cluster Aj contains only
one data point.
Proof. Any local minimizer of Problem (11) is an inf-stationary point to this problem.
Then the proof follows from the expression of the gradient ∇f¯k1 (y ∗ ), Proposition 3 and
Remark 1.
Proposition 5. The sets of Clarke stationary and critical points of Problem (11) coincide
and they are given by:
S = {y ∈ Rn : ∇f¯k1 ∈ ∂ f¯k2 (y)}. (17)
Proof. Assume the point ȳ is a critical point of Problem (11). Since the subdifferential
f¯k1 at any y ∈ Rn is singleton we get that ∇f¯k1 (ȳ) ∈ ∂ f¯k2 (ȳ). Then it follows from
Proposition 2 that 0 ∈ ∂ f¯k (ȳ) that is ȳ is Clarke stationary.
Now assume that ȳ is Clarke stationary. Then Proposition 2 implies that ∇f¯k1 (ȳ) ∈
¯
∂ fk2 (ȳ) and therefore, ȳ is critical point. The expression (17) for the set of Clarke sta-
tionary points is obvious.
Remark 2. It is obvious that any inf-stationary point of Problem (11) is also Clarke
stationary and critical point of this problem. In general, the set of inf-stationary points
is a strict subset of these two sets which is demonstrated by the following example.
Proposition 6. Let x ∈ Rnk be a local minimizer of the problem (6). Then the objective
function fk is continuously differentiable at this point and ∇fk (x) = 0.
7
Proof. First we derive expressions for subdifferentials of the functions fk1 and fk2 . The
function fk1 is differentiable and its gradient is:
Here A
b = (A
b1 , . . . , A
bk ), A
b1 = . . . = A
bk = (â1 , . . . , ân ) and
1 X
ât = at .
m
a∈A
This means that the subdifferential ∂fk1 (x) is a singleton for any x ∈ Rn .
In general, the function fk2 is nonsmooth. To compute its subdifferential consider the
following function and a set [28]:
k
X
ϕa (x) = max d2 (xs , a), (19)
j=1,...,k
s=1,s6=j
and
k
X
Ra (x) = j ∈ {1, . . . , k} : d2 (xs , a) = ϕa (x) . (20)
s=1,s6=j
where
x̃j = x1 , . . . , xj−1 , 0n , xj+1 , . . . , xk , Ãji = Ãji1 , . . . , Ãjik ∈ Rnk ,
and
Ãjit = ai , t = 1, . . . , k, t 6= j, Ãjij = 0n .
Then the subdifferential ∂fk2 (x) can be expressed as:
1 X
∂fk2 (x) = ∂ϕa (x). (22)
m
a∈A
The local minimizer x of the problem (6) is also its inf-stationary point. Since the
subdifferential ∂fk1 (x) is a singleton at any x it follows from (3) that the subdifferential
∂fk2 (x) is a singleton at any inf-stationary point x. This means that the subdifferentials
∂ϕa (x), i = 1, . . . , m are also singletons at any stationary point x which in turn means
that at any such point the index sets Ra (x) are singletons for all a ∈ A. This implies that
for each a ∈ A there exists a unique j ∈ {1, . . . , k} such that Ra (x) = {j}. It follows from
the DC representation of the function fk that this j stands for the index of the cluster to
which the data point a belongs. Thus, if x is an inf-stationary point, then for each data
point a ∈ A there exists only one cluster center xj such that
8
and d2 (xs , a) > d2 (xj , a) for any other s = 1, . . . , k, s 6= j. This means that the clustering
function fk is continuously differentiable at any inf-stationary point x of Problem (6) and
the Clarke subdifferential of this function are singletons at such points, that is
where
2 X X
∇fk (x) = (xj − a).
m
a∈A j∈Ra (x)
Then the necessary condition for a minimum is ∇fk (x) = 0 and in addition each cluster
center xj , j = 1, . . . , k attracts data points a ∈ A such that j ∈ Ra (x).
Proposition 7. The generalized subdifferential ∂fk (x) of the function fk at x ∈ Rnk is:
Proposition 8. The sets of Clarke stationary and critical points of Problem (6) coincide
and at these points
∇fk1 (x) ∈ ∂fk2 (x).
Proof. The proof follows from Proposition 7 and the definitions of Clarke stationary and
critical points.
where f (x) = f1 (x) − f2 (x), the function f1 is continuously differentiable convex and the
function f2 is, in general, nonsmooth convex function.
According to Propositions 2 and 7 for the objective function f in the problem (23)
Here S1 = {u ∈ Rn : kuk = 1} is the unit sphere. It is obvious that the set Q1 (x, τ ) is
convex and compact for any x ∈ Rn and λ > 0. Moreover, due to the convexity of f1
9
Since f2 (x + λu) − f2 (x) ≥ λhξ, ui for any ξ ∈ ∂f2 (x) we get
and
f (x + λu) − f (x) ≤ λ max hη − ξ, ui.
η∈Q1 (x,λ)
Assume that a point x ∈ Rn is not a (λ, δ)-stationary point. Then kξ − zk ≥ δ for all
ξ ∈ ∂f2 (x) and z ∈ Q1 (x, λ). Take any ξ ∈ ∂f2 (x) and define the following set:
Then we have
f (x + λu) − f (x) ≤ λ max hz, ui ∀u ∈ Rn . (28)
z∈Q̄ξ (x,λ)
Proposition 9. Assume that the point x is not a (λ, δ)-stationary. Then the direction
z0
u0 = − ,
kz0 k
or hz0 , zi ≥ kz0 k2 , ∀z ∈ Q̄ξ (x, λ). Dividing both sides by −kz0 k we have hz, u0 i ≤
−kz0 k, ∀z ∈ Q̄ξ (x, λ). Then the proof follows from (28).
It follows from Proposition 9 that if the point x is not a (λ, δ)-stationary then the
set Q̄ξ (x, λ) can be used to find a direction of sufficient decrease of the function f at
x. However, the computation of the set Q̄ξ (x, λ) is not always possible. Next we design
an algorithm which uses a finite number of elements from Q̄ξ (x, λ) to compute descent
directions.
Let λ > 0, δ > 0 be given numbers.
10
Algorithm 1 Computation of descent directions.
Step 1: (Initialization). Select a search control parameter c ∈ (0, 1), the initial direction
u1 ∈ S1 , compute the gradient ∇f1 (x + λu1 ) and the subgradient ξ ∈ ∂f2 (x). Set
Q̄12 := {∇f1 (x + λu1 ) − ξ} and j := 1.
Step 2: (Computation of least distance subgradient). Compute z j = argmin{(1/2)kzk2 :
z ∈ Q̄j2 }.
Step 3: (Stopping criterion). If kz j k ≤ δ then stop. x is a (λ, δ)-stationary point.
Step 4: (Computation of a search direction). Compute uj+1 = −kz j k−1 z j . If
kz j k > δ (33)
and
f (x + λuj+1 ) − f (x) > −cλkz j k. (34)
First, we will show that if the algorithm does not terminate at the j-th iteration then
wj+1 = ∇f1 (x + λuj+1 ) − ξ ∈/ Q̄j2 . It is obvious that
hz j , z − z j i ≥ 0, ∀z ∈ Q̄k2
11
/ Q̄j2 .
Since c ∈ (0, 1) we have that the vector wj+1 does not satisfy (36) and therefore wj+1 ∈
(37) can be rewritten as:
hwj+1 , z j i < ckz j k2 . (38)
It is clear that for any t ∈ [0, 1]
where the last inequality follows from (38). (31) implies that kξk ≤ M and k∇f1 (x+λu)k ≤
M, for all u ∈ S1 . Then we have
Select t as
(1 − c)kz j k2
t= .
4M 2
It is clear that t ∈ (0, 1). Putting this t in (39) we get
(1 − c)2 kz j k4
kz j+1 k2 ≤ kz j k2 − ,
4M 2
which together with (33) implies that
kz j+1 k2 ≤ kz j k2 − C(δ).
(1−c)2 δ 4
Here C(δ) = 4M 2
. For j ≥ 1 we have
Since C(δ) is constant this means that the algorithm must terminate after a finite number
of iterations. In order to estimate this number notice that according to (31), kz 1 k2 ≤ M 2 .
Then the inequality kz 1 k2 − jC(δ) ≤ δ 2 will be satisfied in at most j0 iterations where j0
is given by (32).
Next we describe an algorithm for finding (λ, δ)-stationary points, λ > 0, δ > 0.
12
Algorithm 2 Finding (λ, δ)-stationary points.
Step 1: (Initialization). Select any starting point x1 ∈ Rn , numbers c1 ∈ (0, 1) and
c2 ∈ (0, c1 ]. Set j := 1.
Step 2: Apply Algorithm 1 with c = c1 to find a search direction at the point xj . This
algorithm terminates after a finite number of iterations lj > 0 with either kz lj k ≤ δ or
finds a descent direction ulj ∈ S1 such that
Proposition 11. Assume that f∗ = inf{f (x), x ∈ Rn } > −∞. Then Algorithm 2 finds
(λ, δ)-stationary points in at most K iterations where
f (x1 ) − f∗
K= . (41)
c2 λδ
Proof. Assume the contrary, that is the sequence {xj } is infinite and points xj are not
(λ, δ)-stationary for all j = 1, 2, . . .. This means that kz lj k > δ, j = 1, 2, . . .. Since c2 ≤ c1
it follows from (40) that αj ≥ λ for any j > 0. Then
f (xj+1 ) − f (xj ) ≤ −c2 λδ.
Therefore
f (xj+1 ) − f (x1 ) ≤ −c2 jλδ,
which means that f (xj ) → −∞ as j → ∞ which is a contradiction that is the algorithm
is finite convergent. Since f∗ ≤ f (xj+1 ) it is obvious the maximum number of iterations
K is given by (41).
Corollary 1. For both clustering and auxiliary clustering problems f∗ ≥ 0 and the esti-
mation (41) can be given as:
f (x1 )
K= .
c2 λδ
Now we can design an algorithm for finding Clarke stationary points of Problem (23).
13
Proposition 12. Assume the level set J (x1 ) = {x ∈ Rn : f (x) ≤ f (x1 )} is bounded and
ε = 0. Then all limit points of the sequence {xj } generated by Algorithm 3 are Clarke
stationary points of Problem (23).
Proof. Since Algorithm 3 is a descent algorithm it follows that the sequence {xj } ⊂ J (x1 )
and because J (x1 ) is a compact set this sequence has at least one limit point. Assume x̄
is a limit of {xj } and there exists the subsequence {xjl } such that xjl → x̄ as l → ∞.
After each iteration jl we get (λjl , δjl )-stationary point xjl +1 which means that there
exists ξ jl +1 ∈ ∂f2 (xjl +1 ) such that
ξ jl +1 ∈ Q1 (xjl +1 , λjl ) + Bδjl (0).
Replacing jl by jl − 1 we have
ξ jl ∈ Q1 (xjl , λjl −1 ) + Bδjl −1 (0). (42)
Continuity of the gradient ∇f1 (x) implies that for any γ > 0 there exists l0 > 0 such that
k∇f1 (xjl + λjl −1 u) − ∇f1 (x̄)k < γ, ∀l > l0 and u ∈ S1 .
This means that
Q1 (xjl , λjl −1 ) ⊂ ∇f1 (x̄) + Bγ (0) ∀l > l0 . (43)
From (42) and (43) we get
ξ jl ∈ ∇f1 (x̄) + Bγ+δjl −1 (0) ∀l > l0 . (44)
The mapping x 7→ ∂f2 (x) is upper semicontinuous. Then for any θ > 0 there exists ¯l > 0
such that
ξ jl ∈ ∂f2 (x̄) + Bθ (0), ∀l > ¯l. (45)
Without loss of generality assume that there exists ξ¯ ∈ ∂f2 (x̄) such that kξ jl − ξk
¯ < θ for
¯
all l > l. Then it follows from (44) that
¯ < θ + γ + δj −1 ∀ l > ˆl = max{l0 , ¯l}.
k∇f (x̄) − ξk l
Finally, we design an algorithm for finding inf-stationary points of Problem (23). This
algorithm involves a special procedure to escape from the Clarke stationary points which
are not inf-stationary.
Assume that the point x∗ is a Clarke stationary found by Algorithm 3. If the function
f2 is differentiable at this point then x∗ is inf-stationary point. Otherwise one can compute
the subgradient ξ ∈ ∂f2 (x∗ ) such that kξ − ∇f1 (x∗ )k > ε for some ε > 0. If ∂f2 (x∗ ) ⊂
∇f1 (x∗ ) + Bε (0) for some sufficiently small ε > 0 then the point x∗ can be considered as
an approximate inf-stationary point.
Our main assumption is: if the subdifferential ∂f2 (x) is not a singleton at a point
x ∈ Rn then we can always compute two subgradients ξ 1 , ξ 2 ∈ ∂f2 (x) such that ξ 1 6= ξ 2 .
We will show that this assumption is satisfied for Problems (6) and (11).
14
Proposition 13. Assume that the subdifferential ∂f2 (x) is not a singleton at x and the
subgradients ξ 1 , ξ 2 ∈ ∂f2 (x) are such that ξ 1 6= ξ 2 . Consider the direction ū = −kv̄k−1 v̄
where
v̄ = arg max k∇f1 (x) − ξ i k, i = 1, 2 .
(46)
Then
f 0 (x, ū) ≤ −kv̄k.
Proof. Since the subdifferential ∂f2 (x) is not a singleton and ξ 1 6= ξ 2 it follows that v̄ 6= 0.
For simplicity we assume that v̄ = ∇f1 (x) − ξ 2 . Then convexity of functions f1 and f2
implies that
Proof. The proof follows from (24) and the definition of inf-stationary points.
Remark 3. Propositions 13, 14 and Corollary 2 show how one can design an algorithm
for finding inf-stationary points of Problem (23). Applying Algorithm 3 we can find
the Clarke stationary point x of the problem (23). If at this point the subdifferential
∂f2 (x) is a singleton then according to Proposition 14 it is also an inf-stationary. If this
subdifferential is not a singleton then Corollary 2 implies that this point is not an inf-
stationary and according to Proposition 13 we can find a descent direction from this point
which in turn allows us to find a new starting point for Algorithm 3.
15
Algorithm 4 Finding inf-stationary points.
Step 1: (Initialization). Choose numbers c1 ∈ (0, 1), c2 ∈ (0, c1 ], c3 ∈ (0, 1/2] and an
optimality tolerance ε > 0. Select any starting point x1 ∈ Rn and set j := 1.
Step 2: Apply Algorithm 3 starting from the point xj and using constants c1 , c2 to find
Clarke stationary point x̄ with the optimality tolerance ε.
Step 3: If ∂f2 (x̄) ⊂ ∇f1 (x) + Bε (0) then stop. x̄ is an inf-stationary point.
Step 4: Compute subgradients ξ 1 , ξ 2 ∈ ∂f2 (x̄) such that
Proposition 15. Assume that f∗ > −∞ and the gradient ∇f1 : Rn → Rn satisfies
Lipschitz condition. Then Algorithm 4 terminates after finite number of iterations at an
inf-stationary point of Problem (23).
Proof. For simplicity assume that at the j-th iteration ūj = −kv̄ j k−1 v̄ j and v̄ j = ∇f1 (xj )−
ξ 1 , ξ 1 ∈ ∂f2 (xj ). Applying the mean value theorem to the function f1 we get that for some
σj ∈ (0, 1)
f (x̄ + αūj ) − f (x̄) =[f1 (x̄ + αūj ) − f1 (x̄)] − [f2 (x̄ + αūj ) − f2 (x̄)]
≤αh∇f1 (x̄ + ασj ūj ), ūj i − αhξ 1 , ūj i
≤αh∇f1 (x̄) − ξ 1 , ūj i + αh∇f1 (x̄ + ασj ūj ) − ∇f1 (x̄), ūj i.
r2
f (x̄ + ᾱūj ) − f (x̄) < − ≤ −c3 ᾱr ≤ −c3 ᾱε.
4L
This means that at each iteration of Algorithm 4 αj ≥ ᾱ ≥ ε/2L and the function f
decreases by at least c3 ε2 /2L > 0 at each iteration. Since f∗ > −∞ the algorithm must
stop after finite number of iterations.
16
Next we will show that the gradient of the functions f¯k1 and fk1 are Lipschitz.
Proposition 16. The gradient of the function f¯k1 satisfies Lipschitz condition with the
constant L = 2.
Then k∇f¯k1 (y 1 ) − ∇f¯k1 (y 2 )k = 2ky 1 − y 2 k that is the gradient ∇f¯k1 satisfies the Lipschitz
condition on Rn with the constant L = 2.
Proposition 17. The gradient of the function fk1 satisfies Lipschitz condition with the
constant L = 2.
Finally, we demonstrate how two different subgradients from ∂ f¯k2 (y), y ∈ Rn and
∂fk2 (x), x ∈ Rnk can be computed if these subdifferentials are not singleton. First consider
the function f¯k2 . We can choose ξ 1 , ξ 2 ∈ ∂ f¯k2 (y) using (14) as follows
2 X
ξ1 = (y − a)
m
a∈Ā2 (y)
and
¯ ξ¯ = argmax ky − ak.
ξ 2 = ξ 1 + ξ,
a∈Ā3 (y)
(The sets Ra (x) are defined in Section 5). If A1 = A then ∂ f¯k2 (x) is singleton. If |A2 | ≥ 1
then ∂ f¯k2 (x) is not singleton. Take any a ∈ A2 . Since |Ra (x)| ≥ 2 the point is attracted
by at least two cluster centers. Using two cluster centers we can compute two subgradients
for the function ϕa defined by (19).
7 Incremental algorithm
In this section we present an incremental algorithm for solving Problems (6) and (11)
using the DC approach. An important part of this algorithm is a procedure for finding
starting points for the l-th cluster center where 1 ≤ l ≤ k. This procedure was described
in detail in [28].
17
Algorithm 5 An incremental clustering algorithm.
Step 1: (Initialization). Compute the center x1 ∈ Rn of the set A. Set l := 1.
Step 2: (Stopping criterion). Set l := l + 1. If l > k then stop. The k-partition problem
has been solved.
Step 3: (Computation of a set of starting points for the auxiliary clustering problem).
Apply the procedure from [28] to find the set S1 ⊂ Rn of starting points for solving
the auxiliary clustering problem (11) for k = l.
Step 4: (Computation of a set of starting points for the l-th cluster center). Apply
Algorithm 4 to solve Problem (11) starting from each point y ∈ S1 . This algorithm
generates a set S2 ⊂ Rn of starting points for the l-th cluster center.
Step 5: (Computation of a set of cluster centers). For each ȳ ∈ S2 apply Algorithm
4 to solve Problem (6) starting from the point (x1 , . . . , xl−1 , ȳ) and find a solution
(ŷ 1 , . . . , ŷ l ). Denote by S3 ⊂ Rnl a set of all such solutions.
Step 6: (Computation of the best solution). Compute
n o
flmin = min fl (ŷ 1 , . . . , ŷ l ) : (ŷ 1 , . . . , ŷ l ) ∈ S3
and the collection of cluster centers (ȳ 1 , . . . , ȳ l ) such that fl (ȳ 1 , . . . , ȳ l ) = flmin .
Step 7: (Solution to the l-partition problem). Set xj := ȳ j , j = 1, . . . , l as a solution to
the l-th partition problem and go to Step 2.
8 Implementation of algorithms
We compare the DCClust with the following algorithms:
1. The global k-means algorithm (GKM) [27].
18
8.1 Implementation of DCClust
This algorithm contains a special procedure to generate starting cluster centers (Step
3) which is described in detail in [9, 28]. The choice of parameters in this procedure
is the same as in [9]. Algorithm 4 is applied in Step 4 of the DCClust to solve the
auxiliary clustering problem and in Step 5 to solve clustering problem. Parameters in this
algorithm are chosen as follows: c1 = 0.2, c2 = 0.01, c3 = 0.4, ε = 10−5 , δi ≡ 10−7 and
λ1 = 1, λi+1 = 0.2λi , i = 1, 2, . . .. The algorithm from [35] is applied to solve the quadratic
programming problem in Step 2 of Algorithm 1.
In general, the sequence {xj } generated by the DCA converges to critical points of
the problem (23). Since for this problem the sets of Clarke stationary and critical points
coincide limit points of the sequence {xj } are also Clarke stationary.
In order to apply the DCA to solve Problem (11) the subgradient ξ j in Step 2 is
computed as (see (14)):
2 X
ξj = (xj − a), xj ∈ Rn .
m j
a∈Ā2 (x )
Then the solution xj+1 to the problem (47) in Step 4 can be expressed as follows:
1 X
xj+1 = |Ā2 (xj )|xj + a .
m j
S j a∈Ā1 (x ) Ā3 (x )
Next we describe the application of the DCA to solve the clustering problem (6). Let
xj = (x1j , . . . , xkj ) ∈ Rnk be a vector of cluster centers at the iteration j and A1 , . . . , Ak be
the cluster partition of the data set A given by these centers.
19
In order to compute the subgradient ξ j in Step 2 for each a ∈ A we compute the set
Ra (xj ) given by (20), take any p ∈ Ra (xj ) and compute the subgradient vaj ∈ ∂ϕa (xj )
using (21). Then we apply (22) to compute the subgradient ξ j . Thus we get the following
formula for the subgradient ξ j :
2 X X
ξj = (x1j − a), . . . , (xkj − a)
m
a∈A\A1 a∈A\Ak
2
= (m − |A1 |)x1j − (mā − |A1 |ā1 )), . . . , (m − |Ak |)xkj − (mā − |Ak |āk ))
m
where āl is the center of the cluster Al , l = 1, . . . , k and ā is the center of the whole set A.
The solution xj+1 = (x1j+1 , . . . , xkj+1 ) to the problem (47) in Step 4 is given by:
|At | |At | t
xtj+1 = 1− xtj + ā , t = 1, . . . , k.
m m
Finally, the stopping criterion in Step 3 can be given by
|At | |At |
t
xj = 1 − xtj + āt , t = 1, . . . , k.
m m
To design the version of Algorithm 5 with the DCA in Steps 4 and 5 Algorithm 4 is
replaced by the above described version of the DCA.
Remark 4. Despite some similarities between the DCA and Algorithm 3 these two algo-
rithms are designed in different ways. In DCA a subgradient ξ j ∈ ∂f2 (xj ) is computed at
each global minimizer of the overestimation f1 (x) − hξ j , x − xj i, however, in Algorithm 3
this subgradient is updated at each iteration.
9 Numerical results
To test the DCClust algorithm and compare it with other three clustering algorithms
numerical experiments with a number of real-world data sets have been carried out. Algo-
rithms were implemented in Fortran 95 and compiled using the gfortran compiler. Com-
putational results were obtained on a PC with the CPU Intel(R) Core(TM) i5-3470S 2.90
GHz and RAM 8 GB. Eight data sets have been used in numerical experiments. Their
brief description is given in Table 1. The detailed description can be found in [26]. All
data sets contain only numeric features and they do not have missing values. To get as
more comprehensive picture about the performance of the algorithms as possible the data
sets were chosen so that: (i) the number of attributes is ranging from very few (2) to large
(128); (ii) the number of data points is ranging from tens of thousands (smallest 13,910)
to hundred of thousands (largest 434,874).
We computed up to 25 clusters in all data sets. The CPU time used by algorithms
is limited to 20 hours. Since all algorithms computes clusters incrementally we present
results with the maximum number of clusters obtained by an algorithm during this time
limit. Results for cluster function values found by different algorithms are presented in
Tables 2 and 3. In these tables we use the following notation:
20
Table 1: The brief description of data sets
• fbest (multiplied by the number shown after names of data sets) is the best known
value of the cluster function (7) (multiplied by m) for the corresponding number of
clusters;
f¯ − fbest
EA = × 100%
fbest
• The sign “-” in tables shows that an algorithm failed to compute clusters in the
given time frame.
All data sets can be divided into two groups. The first group contains data sets with
small number of attributes (2 or 3). D15112, Pla85900, Skin Segmentation and 3D Road
Network data sets belong to this group. The number of points in these data sets ranges
from 15112 to 434874. Results presented in Tables 2 and 3 demonstrate that in these data
sets the performance of algorithms are similar in the sense of accuracy. All algorithms
can find at least near best known solutions in these data sets. Only exception is the case
when k = 25 in 3D Road Network data set where the DCClust algorithm failed to find
the best solution. Results also demonstrate that the GKM and MS-MGKM algorithms
are not efficient for solving clustering problems within a given timeframe in data sets with
hundreds of thousands of points.
The second group contains data sets with relatively large number of attributes. Gas
Sensor Array Drift, EEG Eye State, KEGG Metabolic Relation Network and Shuttle
Control data sets belong to this group. The number of attributes in these data sets ranges
from 9 to 128. Results show that all algorithms are efficient to find (near) best known
solutions in Gas Sensor Array Drift and EEG Eye State data sets. The MS-MGKM
algorithm failed to find such solutions in KEGG Metabolic Relation Network data set
for k = 15 and the GKM algorithm for k = 15, 20. Results for Shuttle Control data set
21
show that the DCClust algorithm failed to find an accurate solution for k = 12 and three
other algorithms for k = 25. In this data set most points are very close to each other
and clusters are not well separated when their number is large. In this situation some
clustering algorithms may fail to find accurate solutions. In all other cases algorithms are
able to find such solutions.
Figures 1(a)-1(h) illustrate dependence of the number of distance function evalua-
tions (Nd ) on the number of clusters for four algorithms in all data sets. These figures
demonstrate that the MS-MGKM algorithm requires the least number of distance func-
tion evaluations in all data sets except the 3D Road Network data set. In this data set
the MS-MGKM algorithm computed only 6 clusters within a 20 hours timeframe and
the number Nd is similar to that of by the MS-DCA and DCClust algorithms. For the
GKM algorithm Nd depends linearly on the number of clusters, however this algorithm
requires significantly more distance function evaluations than other three algorithms in
three largest data sets: Pla85900, Skin Segmentation and 3D Road Network. Comparison
of the DCClust and the MS-DCA algorithms shows that the latter algorithm requires
more distance function evaluations than the former algorithm in all data sets except the
3D Road Network data set. The difference between these two algorithms in data sets
with the number of instances less than 100,000 is significant which is not the case for two
largest data sets.
Figures 2(a)-2(h) illustrate dependence of the CPU time on the number of clusters
for four algorithms in all data sets. It is obvious that the MS-MGKM algorithm requires
less CPU time than any other algorithm for almost all the number of clusters in five
data sets: Gas Sensor Array Drift, D15112, EEG Eye State, KEGG Metabolic Relation
Network and Shuttle Control. However, as the size (the number of data points) of a data
set increase this algorithm requires more CPU time than the DCClust and the MS-DCA
algorithms. The GKM algorithm is more time-consuming than the MS-MGKM algorithm
and it requires significantly more CPU time than the DCClust and the MS-DCA algorithms
in three largest data sets. Both the MS-MGKM and the GKM algorithms computed only
6 clusters in the 3D Road Network data set within a 20 hours timeframe. Results for
two largest data sets show that these algorithms are not efficient for solving clustering
problems in data sets with hundreds of thousands of data points. The comparison of the
DCClust and the MS-DCA algorithms shows the former requires less CPU time than the
latter algorithm in all other data sets except the D15112.
10 Conclusion
In this paper the minimum sum-of-squares clustering problems are studied using their DC
representation. Inf-stationary, stationary points in the sense of generalized subgradients
and critical points of the minimum sum-of-squares clustering problems are characterized
using such a representation. An incremental algorithm based on DC representation is
designed to solve the minimum sum-of-squares clustering problems. A special algorithm
is designed to solve nonsmooth optimization problems at each iteration of the incremental
algorithm. It is proved that this algorithm converges to inf-stationary points of the clus-
tering problems. The similar incremental algorithm is developed where the well-known
DCA algorithm is applied to solve optimization problems.
22
The proposed algorithms are tested using real world data sets with the number of
data points ranging from tens of thousands to hundreds of thousands. Results clearly
demonstrate that the use of the DC representation of clustering problems allows one to
significantly improve ability of incremental algorithms to solve clustering problems in very
large data sets in a reasonable time. Furthermore, the use of nonsmooth optimization
algorithms can increase this ability for even larger data sets.
Acknowledgement. This research by Dr. Adil Bagirov and Dr. Sona Taheri was sup-
ported under Australian Research Council’s Discovery Projects funding scheme (Project
No. DP140103213). The authors thank two anonymous referees for their comments that
helped to improve the quality of the paper.
References
[1] K.S. Al-Sultan. A tabu search approach to the clustering problem. Pattern Recogni-
tion, 28(9):1443–1451, 1995.
[2] L.T.H. An, M.T. Belghiti, and P.D. Tao. A new efficient algorithm based on DC
programming and DCA for clustering. J. of Global Optim., 37(4):593–608, 2007.
[3] L.T.H. An, L.H. Minh, and P.D. Tao. New and efficient DCA based algorithms for
minimum sum-of-squares clustering. Pattern Recognition, 47:388–401, 2014.
[4] L.T.H. An, H.V. Ngai, and P.D. Tao. Exact penalty and error bounds in DC pro-
gramming. Journal of Global Optimization, 52(3):509–535, 2012.
[6] A.M. Bagirov. Modified global k-means algorithm for minimum sum-of-squares clus-
tering problems. Pattern Recognition, 41(10):3192–3199, 2008.
[8] A.M. Bagirov and E. Mohebi. Nonsmooth optimization based algorithms in cluster
analysis. In Emre M. Celebi, editor, Partitional Clustering Algorithms, pages 99–146.
Springer.
[9] A.M. Bagirov, B. Ordin, G. Ozturk, and A.E. Xavier. An incremental clustering algo-
rithm based on hyperbolic smoothing. Computational Optimization and Applications,
61:219–241, 2015.
[10] A.M. Bagirov, A.M. Rubinov, N.V. Soukhoroukova, and J. Yearwood. Unsupervised
and supervised data classification via nonsmooth and global optimization. Top, 11:1–
93, 2003.
23
[11] A.M. Bagirov and J. Ugon. An algorithm for minimizing clustering functions. Opti-
mization, 54(4-5):351 – 368, 2005.
[12] A.M. Bagirov, J. Ugon, and D. Webb. Fast modified global k-means algorithm for
incremental cluster construction. Pattern Recognition, 44(4):866–876, April 2011.
[13] A.M. Bagirov and J. Yearwood. A new nonsmooth optimization algorithm for mini-
mum sum-of-squares clustering problems. European Journal of Operational Research,
170(2):578–596, 2006.
[14] L. Bai, J. Liang, C. Sui, and Ch. Dang. Fast global k-means clustering based on local
geometrical information. Information Sciences, 245:168 – 180, 2013.
[15] F.H. Clarke. Optimization and Nonsmooth Analysis. Canadian Mathematical Society
series of monographs and advanced texts. Wiley, 1983.
[16] V.F. Demyanov, A.M. Bagirov, and A.M. Rubinov. A method of truncated codif-
ferential with application to some problems of cluster analysis. Journal of Global
Optimization, 23(1):63–80, 2002.
[17] V.F. Demyanov and A.M. Rubinov. Constructive Nonsmooth Analysis. Peter Lang,
Frankfurt am Main, 1995.
[18] G. Diehr. Evaluation of a branch and bound algorithm for clustering. SIAM J.
Scientific and Statistical Computing, (6):268–284, 1985.
[22] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Comput.
Surv., 31(3):264–323, 1999.
[23] A.K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters,
31(8):651–666, 2010.
[25] J.Z.C. Lai and T.-J. Huang. Fast global k-means clustering using cluster membership
and inequality. Pattern Recognition, 43(5):1954 – 1963, 2010.
24
[27] A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering algorithm. Pattern
Recognition, 36(2):451–461, 2003.
[28] B. Ordin and A.M. Bagirov. A heuristic algorithm for solving the minimum sum-of-
squares clustering problems. Journal of Global Optimization, 61:341–361, 2015.
[30] R. Scitovski and S. Scitovski. A fast partitioning algorithm and its application to
earthquake investigation. Computers & Geosciences, 59:124–131, 2013.
[31] S.Z. Selim and K.S. Al-Sultan. A simulated annealing algorithm for the clustering.
Pattern Recognition, 24(10):1003–1008, 1991.
[32] P.D. Tao and L.T.H. An. Convex analysis approach to DC programming: theory,
algorithms and applications. Acta Mathematica Vietnamica, 22(1):289–355, 1997.
[34] Hoang Tuy, A.M. Bagirov, and A.M. Rubinov. Clustering via DC optimization. In
Advances in Convex Analysis and Global Optimization, pages 221–234. Springer, 2001.
[35] P.H. Wolfe. Finding the nearest point in a polytope. Mathematical Programming,
11(2):128–149, 1976.
[36] A.E. Xavier. The hyperbolic smoothing clustering method. Pattern Recognition,
43:731–737, 2010.
[37] A.E. Xavier and V.L. Xavier. Solving the minimum sum-of-squares clustering prob-
lem by hyperbolic smoothing and partition into boundary and gravitational regions.
Pattern Recognition, 44(1):70 – 77, 2011.
25
Table 2: Cluster function values obtained by algorithms
26
Table 3: Cluster function values obtained by algorithms (cont.)
27
Gas Sensor Array Drift EEG Eye State
1E+10 7E+09
9E+09
6E+09
8E+09
No of dist. func. eval.
(a) (b)
D15112 KEGG Metabolic Relation Network
9E+09 1E+11
8E+09 9E+10
8E+10
7E+09
7E+10
6E+09
6E+10 DCClust
5E+09 DCClust
5E+10
4E+09 GKM GKM
4E+10
3E+09 MS-MGKM 3E+10 MS-MGKM
2E+09 2E+10
MS-DCA MS-DCA
1E+09 1E+10
0 0
0 5 10 15 20 25 0 5 10 15 20 25
No of clusters No of clusters
(c) (d)
Shuttle Control Pla85900
3E+11 2E+11
1.8E+11
2.5E+11 1.6E+11
No of dist. func. eval.
No of dist. func. eval.
1.4E+11
2E+11
1.2E+11
DCClust DCClust
1.5E+11 1E+11
GKM 8E+10
GKM
1E+11 MS-MGKM 6E+10 MS-MGKM
4E+10
5E+10 MS-DCA MS-DCA
2E+10
0 0
0 5 10 15 20 25 0 5 10 15 20 25
No of clusters No of clusters
(e) (f)
Skin Segmentation 3D Road Network
1.6E+12 2.5E+12
1.4E+12
2E+12
No of dist. func. eval.
1.2E+12
1E+12 1.5E+12
DCClust DCClust
8E+11
GKM GKM
6E+11 1E+12
MS-MGKM MS-DCA
4E+11
MS-DCA 5E+11 MS-MGKM
2E+11
0 0
0 5 10 15 20 25 0 5 10 15 20 25
No of clusters No of clusters
(g) (h)
28
Gas Sensor Array Drift EEG Eye State
6000 350
5000 300
250
4000
CPU time
CPU time
DCClust 200 DCClust
3000
GKM 150 GKM
2000 MS-MGKM MS-MGKM
100
1000 MS-DCA MS-DCA
50
0 0
0 5 10 15 20 25 0 5 10 15 20 25
No of clusters No of clusters
(a) (b)
D15112 KEGG Metabolic Relation Network
120 9000
8000
100 7000
0 0
0 5 10 15 20 25 0 5 10 15 20 25
No of clusters No of clusters
(c) (d)
Shuttle Control Pla85900
12000 4000
10000 3500
3000
8000
2500
CPU time
CPU time
DCClust DCClust
6000 2000
GKM GKM
4000 1500
MS-MGKM MS-MGKM
1000
2000 MS-DCA MS-DCA
500
0 0
0 5 10 15 20 25 0 5 10 15 20 25
No of clusters No of clusters
(e) (f)
Skin Segmentation 3D Road Network
35000 80000
30000 70000
25000 60000
50000
CPU time
CPU time
(g) (h)
29