Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
• What could similar mean?
– One option: small Euclidean distance (squared)
dist(�x, �y ) = ||�x − �y ||22
– Clustering results are crucially dependent on the measure of
similarity (or distance) between “points” to be clustered
• /(&'+'0-(0+"+"*,'(%-.$
– 1,%%,.#23 +**",.&'+%(4&
– 5,26,7)3 6(4($(4&
Clustering algorithms
• 8+'%(%(,)+"*,'(%-.$9:"+%;
– <Ͳ.&+)$
– =(>%#'&,?@+#$$(+)
– A2&0%'+"!"#$%&'()*
!"#$%&'()*+"*,'(%-.$
• /(&'+'0-(0+"+"*,'(%-.$
– 1,%%,.#23 +**",.&'+%(4&
– 5,26,7)3 6(4($(4&
• 8+'%(%(,)+"*,'(%-.$9:"+%;
– <Ͳ.&+)$
Clustering examples
Image
segmenta3on
Goal:
Break
up
the
image
into
meaningful
or
perceptually
similar
regions
[Slide from James Hayes]
K-Means
• An iterative clustering
algorithm
– Initialize: Pick K random
points as cluster centers
– Alternate:
1. Assign data points to
closest cluster center
2. Change the cluster
center to the average
of its assigned points
– Stop when no points
assignments change
K-Means
• An iterative clustering
algorithm
– Initialize: Pick K random
points as cluster centers
– Alternate:
1. Assign data points to
closest cluster center
2. Change the cluster
center to the average
of its assigned points
– Stop when no points
assignments change
K-‐means
clustering:
Example
• Pick K random
points as cluster
centers (means)
Shown here for K=2
11
K-‐means
clustering:
Example
Iterative Step 1
• Assign data points to
closest cluster center
12
K-‐means
clustering:
Example
Iterative Step 2
• Change the cluster
center to the average of
the assigned points
13
K-‐means
clustering:
Example
• Repeat
unDl
convergence
14
K-‐means
clustering:
Example
15
K-‐means
clustering:
Example
16
K-‐means
clustering:
Example
17
ProperDes
of
K-‐means
algorithm
• Guaranteed
to
converge
in
a
finite
number
of
iteraDons
• Running
Dme
per
iteraDon:
1. Assign data points to closest cluster center
O(KN) time
2. Change the cluster center to the average of its
assigned points
O(N)
What
properDes
should
a
distance
!"#$%&'%(&$)(**"'+,-#-)*$#./(
measure
have?
0(#*+&("#1(2
• 3400($&)/
– 56789:;56987:
– <$"(&=)*(8=(/#.*#47,''>*,)>(9?+$9-'(*.'$,''>
,)>(7
• @'*)$)1)$48#.-*(,AͲ*)0),#&)$4
– 56789:B8#.-56789:;B)AA 7;9
– <$"(&=)*($"(&(=),,-)AA(&(.$'?C(/$*$"#$=(/#..'$$(,,
#%#&$
• D&)#.E,().(F+#,)$4
– 56789:G5698H: 5678H:
– <$"(&=)*('.(/#.*#4I7)*,)>(989)*,)>(H8?+$7)*.'$
,)>(H#$#,,J
[Slide from Alan Fern]
K-means: Idea
• K clusters each summarized by a prototype µk .
• Assignment of data xi to a cluster represented by
K
X
responsibilities rik ∈ {0, 1} with rik = 1.
k =1
• An example with 4 data points and 3 clusters.
1 0 0
0 0 1
(rik ) =
0
1 0
0 0 1
n X
X K
• Loss function J = rik kxi − µk k22 .
i =1 k =1
Sriram Sankararaman Clustering
K-means: minimizing the loss
function
• How do we minimize J w.r.t (rik , µk )?
• Chicken and egg problem
• If prototypes known, can assign responsibilities.
• If responsibilities known, can compute prototypes.
Sriram Sankararaman Clustering
K-means: minimizing the loss
function
• How do we minimize J w.r.t (rik , µk )?
• Chicken and egg problem
• If prototypes known, can assign responsibilities.
• If responsibilities known, can compute prototypes.
• We use an iterative procedure.
Sriram Sankararaman Clustering
K-means: minimizing the loss
function
• E-step: Fix µk , minimize J w.r.t. rik .
• Assign each data point to its nearest prototype.
• M-step: Fix rik , minimize J w.r.t. µk .
• Set each prototype
P to the mean of the points in that cluster,
r ik x i
i.e., µk = Pi .
i rik
• This procedure is guaranteed to converge.
Sriram Sankararaman Clustering
K-means: minimizing the loss
function
• E-step: Fix µk , minimize J w.r.t. rik .
• Assign each data point to its nearest prototype.
• M-step: Fix rik , minimize J w.r.t. µk .
• Set each prototype
P to the mean of the points in that cluster,
r ik x i
i.e., µk = Pi .
i rik
• This procedure is guaranteed to converge.
• Converges to a local minimum.
Sriram Sankararaman Clustering
K-means: minimizing the loss
function
• E-step: Fix µk , minimize J w.r.t. rik .
• Assign each data point to its nearest prototype.
• M-step: Fix rik , minimize J w.r.t. µk .
• Set each prototype
P to the mean of the points in that cluster,
r ik x i
i.e., µk = Pi .
i rik
• This procedure is guaranteed to converge.
• Converges to a local minimum.
• Use different initializations and pick the best solution.
• May still be insufficient for large search spaces.
• Other ways include a split-merge approach.
Sriram Sankararaman Clustering
How do we initialize K-means?
• Some heuristics
• Randomly pick K data points as prototypes.
• Pick prototype i + 1 to be farthest from prototypes {1, . . . , i }.
Sriram Sankararaman Clustering
Hierarchical Clustering
The goal of hierarchical clustering is to create a sequence of nested partitions,
which can be conveniently visualized via a tree or hierarchy of clusters, also called
the cluster dendrogram.
The clusters in the hierarchy range from the fine-grained to the coarse-grained –
the lowest level of the tree (the leaves) consists of each point in its own cluster,
whereas the highest level (the root) consists of all points in one cluster.
Agglomerative hierarchical clustering methods work in a bottom-up manner.
Starting with each of the n points in a separate cluster, they repeatedly merge the
most similar pair of clusters until all points are members of the same cluster.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 2 / 16
Hierarchical Clustering: Nested Partitions
Given a dataset D = {x 1 , . . . , x n }, where x i ∈ Rd , a clustering C = {C1 , . . . , Ck } is
a partition of D.
A clustering A = {A1 , . . . , Ar } is said to be nested in another clustering
B = {B1 , . . . , Bs } if and only if r > s, and for each cluster Ai ∈ A, there exists a
cluster Bj ∈ B, such that Ai ⊆ Bj .
Hierarchical clustering yields a sequence of n nested partitions C1 , . . . , Cn . The
clustering Ct −1 is nested in the clustering Ct . The cluster dendrogram is a rooted
binary tree that captures this nesting structure, with edges between cluster
Ci ∈ Ct −1 and cluster Cj ∈ Ct if Ci is nested in Cj , that is, if Ci ⊂ Cj .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 3 / 16
Hierarchical Clustering Dendrogram
The dendrogram represents the following
ABCDE sequence of nested partitions:
Clustering Clusters
C1 {A}, {B}, {C }, {D}, {E }
ABCD C2 {AB}, {C }, {D}, {E }
C3 {AB}, {CD}, {E }
C4 {ABCD}, {E }
AB CD C5 {ABCDE }
with Ct −1 ⊂ Ct for t = 2, . . . , 5. We
assume that A and B are merged before
A B C D E C and D.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 4 / 16
Agglomerative Hierarchical Clustering
In agglomerative hierarchical clustering, we begin with each of the n points in a
separate cluster. We repeatedly merge the two closest clusters until all points are
members of the same cluster.
Given a set of clusters C = {C1 , C2 , .., Cm }, we find the closest pair of clusters Ci
and Cj and merge them into a new cluster Cij = Ci ∪ Cj .
Next, we update the setof clusters by removing Ci and Cj and adding Cij , as
follows C = C \ {Ci , Cj } ∪ {Cij }.
We repeat the process until C contains only one cluster. If specified, we can stop
the merging process when there are exactly k clusters remaining.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 6 / 16
Agglomerative Hierarchical Clustering Algorithm
AgglomerativeClustering(D, k):
1 C ← {Ci = {x i } | x i ∈ D} // Each point in separate cluster
2 ∆ ← {δ(x i , x j ): x i , x j ∈ D} // Compute distance matrix
3 repeat
4 Find the closest pair of clusters Ci , Cj ∈ C
5 Cij ← Ci ∪ Cj // Merge the clusters
6 C ← C \ {Ci , Cj } ∪ {Cij } // Update the clustering
7 Update distance matrix ∆ to reflect new clustering
8 until |C| = k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 7 / 16
Distance between Clusters
Single, Complete and Average
A typical distance between two points is the Euclidean distance or L2 -norm
d
X 1/2
kx − y k2 = (xi − yi )2
i =1
Single Link: The minimum distance between a point in Ci and a point in Cj
δ(Ci , Cj ) = min{kx − y k | x ∈ Ci , y ∈ Cj }
Complete Link: The maximum distance between points in the two clusters:
δ(Ci , Cj ) = max{kx − y k | x ∈ Ci , y ∈ Cj }
Group Average: The average pairwise distance between points in Ci and Cj :
P P
x ∈Ci y ∈Cj kx − y k
δ(Ci , Cj ) =
ni · nj
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 8 / 16
Single Link Agglomerative Clustering
ABCDE
3
δ E
ABCD
ABCD 3
2
δ CD E
AB 2 3 CD
2
CD 3
3
δ C D E
AB 3 2 3
AB
C 1 3 1 1
D 5
1 1
δ B C D E
A 1 3 2 4
B 3 2 3 A B C D E
C 1 3
D 5
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 10 / 1
Covariance
• Variance and Covariance are a measure of the “spread” of a set
of points around their center of mass (mean)
• Variance – measure of the deviation from the mean for points in
one dimension e.g. heights
• Covariance as a measure of how much each of the dimensions
vary from the mean with respect to each other.
• Covariance is measured between 2 dimensions to see if there is
a relationship between the 2 dimensions e.g. number of hours
studied & marks obtained.
• The covariance between one dimension and itself is the variance
Covariance
n
covariance (X,Y) = i=1 (Xi – X) (Yi – Y)
(n -1)
• So, if you had a 3-dimensional data set (x,y,z), then you could
measure the covariance between the x and y dimensions, the y
and z dimensions, and the x and z dimensions. Measuring the
covariance between x and x , or y and y , or z and z would give
you the variance of the x , y and z dimensions respectively.
Covariance Matrix
• Representing Covariance between dimensions as a
matrix e.g. for 3 dimensions:
cov(x,x) cov(x,y) cov(x,z)
C = cov(y,x) cov(y,y) cov(y,z)
cov(z,x) cov(z,y) cov(z,z)
Variances
• Diagonal is the variances of x, y and z
• cov(x,y) = cov(y,x) hence matrix is symmetrical about
the diagonal
• N-dimensional data will result in NxN covariance
matrix
Covariance
• What is the interpretation of covariance
calculations?
e.g.: 2 dimensional data set
x: number of hours studied for a subject
y: marks obtained in that subject
covariance value is say: 104.53
what does this value mean?
Covariance
• Exact value is not as important as it’s sign.
• A positive value of covariance indicates both
dimensions increase or decrease together e.g. as the
number of hours studied increases, the marks in that
subject increase.
• A negative value indicates while one increases the
other decreases, or vice-versa e.g. active social life
at PSU vs performance in CS dept.
• If covariance is zero: the two dimensions are
independent of each other e.g. heights of students vs
the marks obtained in a subject
Covariance
• Why bother with calculating covariance
when we could just plot the 2 values to
see their relationship?
Covariance calculations are used to find
relationships between dimensions in high
dimensional data sets (usually greater
than 3) where visualization is difficult.
PCA
• principal components analysis (PCA) is a technique
that can be used to simplify a dataset
• It is a linear transformation that chooses a new
coordinate system for the data set such that
greatest variance by any projection of the data
set comes to lie on the first axis (then called the
first principal component),
the second greatest variance on the second axis,
and so on.
• PCA can be used for reducing dimensionality by
eliminating the later principal components.
PCA
• By finding the eigenvalues and eigenvectors of the
covariance matrix, we find that the eigenvectors with
the largest eigenvalues correspond to the dimensions
that have the strongest correlation in the dataset.
• This is the principal component.
• PCA is a useful statistical technique that has found
application in:
– fields such as face recognition and image compression
– finding patterns in data of high dimension.