0% found this document useful (0 votes)
29 views39 pages

Understanding Clustering Algorithms

The document discusses clustering, focusing on the K-Means and Hierarchical clustering algorithms. K-Means is an iterative method that groups data points based on their proximity to cluster centers, while Hierarchical clustering creates a tree-like structure of nested clusters. The document also covers distance measures and the importance of similarity in clustering results.

Uploaded by

Deepak Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views39 pages

Understanding Clustering Algorithms

The document discusses clustering, focusing on the K-Means and Hierarchical clustering algorithms. K-Means is an iterative method that groups data points based on their proximity to cluster centers, while Hierarchical clustering creates a tree-like structure of nested clusters. The document also covers distance measures and the importance of similarity in clustering results.

Uploaded by

Deepak Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Clustering

• Basic idea: group together similar instances


• Example: 2D point patterns
Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns
Clustering
• Basic idea: group together similar instances
• Example: 2D point patterns

• What could similar mean?


– One option: small Euclidean distance (squared)
dist(�x, �y ) = ||�x − �y ||22
– Clustering results are crucially dependent on the measure of
similarity (or distance) between “points” to be clustered
• /(&'+'0-(0+"+"*,'(%-.$
– 1,%%,.#23 +**",.&'+%(4&
– 5,26,7)3 6(4($(4&

Clustering algorithms
• 8+'%(%(,)+"*,'(%-.$9:"+%;
– <Ͳ.&+)$
– =(>%#'&,?@+#$$(+)
– A2&0%'+"!"#$%&'()*

!"#$%&'()*+"*,'(%-.$
• /(&'+'0-(0+"+"*,'(%-.$
– 1,%%,.#23 +**",.&'+%(4&
– 5,26,7)3 6(4($(4&

• 8+'%(%(,)+"*,'(%-.$9:"+%;
– <Ͳ.&+)$
Clustering examples
 Image  segmenta3on  
Goal:  Break  up  the  image  into  meaningful  or  
perceptually  similar  regions  

[Slide from James Hayes]


K-Means
• An iterative clustering
algorithm

– Initialize: Pick K random


points as cluster centers

– Alternate:
1. Assign data points to
closest cluster center
2. Change the cluster
center to the average
of its assigned points

– Stop when no points


assignments change
K-Means
• An iterative clustering
algorithm

– Initialize: Pick K random


points as cluster centers

– Alternate:
1. Assign data points to
closest cluster center
2. Change the cluster
center to the average
of its assigned points

– Stop when no points


assignments change
K-­‐means  clustering:  Example  

• Pick K random
points as cluster
centers (means)

Shown here for K=2

11
K-­‐means  clustering:  Example  
Iterative Step 1
• Assign data points to
closest cluster center

12
K-­‐means  clustering:  Example  
Iterative Step 2
• Change the cluster
center to the average of
the assigned points

13
K-­‐means  clustering:  Example  
• Repeat  unDl  
convergence  

14
K-­‐means  clustering:  Example  

15
K-­‐means  clustering:  Example  

16
K-­‐means  clustering:  Example  

17
ProperDes  of  K-­‐means  algorithm  
• Guaranteed  to  converge  in  a  finite  number  of  
iteraDons  

• Running  Dme  per  iteraDon:  


1. Assign data points to closest cluster center
O(KN) time
2. Change the cluster center to the average of its
assigned points
O(N)  
What   properDes  should  a  distance  
!"#$%&'%(&$)(**"'+,-#-)*$#./(
measure  have?  
0(#*+&("#1(2
• 3400($&)/
– 56789:;56987:
– <$"(&=)*(8=(/#.*#47,''>*,)>(9?+$9-'(*.'$,''>
,)>(7
• @'*)$)1)$48#.-*(,AͲ*)0),#&)$4
– 56789:൒B8#.-56789:;B)AA 7;9
– <$"(&=)*($"(&(=),,-)AA(&(.$'?C(/$*$"#$=(/#..'$$(,,
#%#&$
• D&)#.E,().(F+#,)$4
– 56789:G5698H:൒ 5678H:
– <$"(&=)*('.(/#.*#4I7)*,)>(989)*,)>(H8?+$7)*.'$
,)>(H#$#,,J

[Slide from Alan Fern]


K-means: Idea
• K clusters each summarized by a prototype µk .
• Assignment of data xi to a cluster represented by
K
X
responsibilities rik ∈ {0, 1} with rik = 1.
k =1
• An example with 4 data points and 3 clusters.
 
1 0 0
 0 0 1 
(rik ) = 
 0

1 0 
0 0 1

n X
X K
• Loss function J = rik kxi − µk k22 .
i =1 k =1

Sriram Sankararaman Clustering


K-means: minimizing the loss
function

• How do we minimize J w.r.t (rik , µk )?


• Chicken and egg problem
• If prototypes known, can assign responsibilities.
• If responsibilities known, can compute prototypes.

Sriram Sankararaman Clustering


K-means: minimizing the loss
function

• How do we minimize J w.r.t (rik , µk )?


• Chicken and egg problem
• If prototypes known, can assign responsibilities.
• If responsibilities known, can compute prototypes.
• We use an iterative procedure.

Sriram Sankararaman Clustering


K-means: minimizing the loss
function

• E-step: Fix µk , minimize J w.r.t. rik .


• Assign each data point to its nearest prototype.
• M-step: Fix rik , minimize J w.r.t. µk .
• Set each prototype
P to the mean of the points in that cluster,
r ik x i
i.e., µk = Pi .
i rik
• This procedure is guaranteed to converge.

Sriram Sankararaman Clustering


K-means: minimizing the loss
function

• E-step: Fix µk , minimize J w.r.t. rik .


• Assign each data point to its nearest prototype.
• M-step: Fix rik , minimize J w.r.t. µk .
• Set each prototype
P to the mean of the points in that cluster,
r ik x i
i.e., µk = Pi .
i rik
• This procedure is guaranteed to converge.
• Converges to a local minimum.

Sriram Sankararaman Clustering


K-means: minimizing the loss
function

• E-step: Fix µk , minimize J w.r.t. rik .


• Assign each data point to its nearest prototype.
• M-step: Fix rik , minimize J w.r.t. µk .
• Set each prototype
P to the mean of the points in that cluster,
r ik x i
i.e., µk = Pi .
i rik
• This procedure is guaranteed to converge.
• Converges to a local minimum.
• Use different initializations and pick the best solution.
• May still be insufficient for large search spaces.
• Other ways include a split-merge approach.

Sriram Sankararaman Clustering


How do we initialize K-means?

• Some heuristics
• Randomly pick K data points as prototypes.
• Pick prototype i + 1 to be farthest from prototypes {1, . . . , i }.

Sriram Sankararaman Clustering


Hierarchical Clustering

The goal of hierarchical clustering is to create a sequence of nested partitions,


which can be conveniently visualized via a tree or hierarchy of clusters, also called
the cluster dendrogram.

The clusters in the hierarchy range from the fine-grained to the coarse-grained –
the lowest level of the tree (the leaves) consists of each point in its own cluster,
whereas the highest level (the root) consists of all points in one cluster.

Agglomerative hierarchical clustering methods work in a bottom-up manner.


Starting with each of the n points in a separate cluster, they repeatedly merge the
most similar pair of clusters until all points are members of the same cluster.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 2 / 16
Hierarchical Clustering: Nested Partitions

Given a dataset D = {x 1 , . . . , x n }, where x i ∈ Rd , a clustering C = {C1 , . . . , Ck } is


a partition of D.

A clustering A = {A1 , . . . , Ar } is said to be nested in another clustering


B = {B1 , . . . , Bs } if and only if r > s, and for each cluster Ai ∈ A, there exists a
cluster Bj ∈ B, such that Ai ⊆ Bj .

Hierarchical clustering yields a sequence of n nested partitions C1 , . . . , Cn . The


clustering Ct −1 is nested in the clustering Ct . The cluster dendrogram is a rooted
binary tree that captures this nesting structure, with edges between cluster
Ci ∈ Ct −1 and cluster Cj ∈ Ct if Ci is nested in Cj , that is, if Ci ⊂ Cj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 3 / 16
Hierarchical Clustering Dendrogram

The dendrogram represents the following


ABCDE sequence of nested partitions:
Clustering Clusters
C1 {A}, {B}, {C }, {D}, {E }
ABCD C2 {AB}, {C }, {D}, {E }
C3 {AB}, {CD}, {E }
C4 {ABCD}, {E }
AB CD C5 {ABCDE }

with Ct −1 ⊂ Ct for t = 2, . . . , 5. We
assume that A and B are merged before
A B C D E C and D.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 4 / 16
Agglomerative Hierarchical Clustering

In agglomerative hierarchical clustering, we begin with each of the n points in a


separate cluster. We repeatedly merge the two closest clusters until all points are
members of the same cluster.
Given a set of clusters C = {C1 , C2 , .., Cm }, we find the closest pair of clusters Ci
and Cj and merge them into a new cluster Cij = Ci ∪ Cj .
Next, we update the setof clusters by removing Ci and Cj and adding Cij , as
follows C = C \ {Ci , Cj } ∪ {Cij }.
We repeat the process until C contains only one cluster. If specified, we can stop
the merging process when there are exactly k clusters remaining.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 6 / 16
Agglomerative Hierarchical Clustering Algorithm

AgglomerativeClustering(D, k):
1 C ← {Ci = {x i } | x i ∈ D} // Each point in separate cluster
2 ∆ ← {δ(x i , x j ): x i , x j ∈ D} // Compute distance matrix
3 repeat
4 Find the closest pair of clusters Ci , Cj ∈ C
5 Cij ← Ci ∪ Cj // Merge the clusters
6 C ← C \ {Ci , Cj } ∪ {Cij } // Update the clustering
7 Update distance matrix ∆ to reflect new clustering
8 until |C| = k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 7 / 16
Distance between Clusters
Single, Complete and Average

A typical distance between two points is the Euclidean distance or L2 -norm


d
X 1/2
kx − y k2 = (xi − yi )2
i =1

Single Link: The minimum distance between a point in Ci and a point in Cj

δ(Ci , Cj ) = min{kx − y k | x ∈ Ci , y ∈ Cj }

Complete Link: The maximum distance between points in the two clusters:

δ(Ci , Cj ) = max{kx − y k | x ∈ Ci , y ∈ Cj }

Group Average: The average pairwise distance between points in Ci and Cj :


P P
x ∈Ci y ∈Cj kx − y k
δ(Ci , Cj ) =
ni · nj

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 8 / 16
Single Link Agglomerative Clustering
ABCDE

3
δ E
ABCD
ABCD 3

2
δ CD E
AB 2 3 CD
2
CD 3
3

δ C D E
AB 3 2 3
AB
C 1 3 1 1
D 5

1 1
δ B C D E
A 1 3 2 4
B 3 2 3 A B C D E
C 1 3
D 5

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 10 / 1
Covariance
• Variance and Covariance are a measure of the “spread” of a set
of points around their center of mass (mean)

• Variance – measure of the deviation from the mean for points in


one dimension e.g. heights

• Covariance as a measure of how much each of the dimensions


vary from the mean with respect to each other.

• Covariance is measured between 2 dimensions to see if there is


a relationship between the 2 dimensions e.g. number of hours
studied & marks obtained.

• The covariance between one dimension and itself is the variance


Covariance
n
covariance (X,Y) = i=1 (Xi – X) (Yi – Y)
(n -1)

• So, if you had a 3-dimensional data set (x,y,z), then you could
measure the covariance between the x and y dimensions, the y
and z dimensions, and the x and z dimensions. Measuring the
covariance between x and x , or y and y , or z and z would give
you the variance of the x , y and z dimensions respectively.
Covariance Matrix
• Representing Covariance between dimensions as a
matrix e.g. for 3 dimensions:
cov(x,x) cov(x,y) cov(x,z)
C = cov(y,x) cov(y,y) cov(y,z)
cov(z,x) cov(z,y) cov(z,z)
Variances
• Diagonal is the variances of x, y and z
• cov(x,y) = cov(y,x) hence matrix is symmetrical about
the diagonal
• N-dimensional data will result in NxN covariance
matrix
Covariance
• What is the interpretation of covariance
calculations?
e.g.: 2 dimensional data set
x: number of hours studied for a subject
y: marks obtained in that subject
covariance value is say: 104.53
what does this value mean?
Covariance
• Exact value is not as important as it’s sign.

• A positive value of covariance indicates both


dimensions increase or decrease together e.g. as the
number of hours studied increases, the marks in that
subject increase.

• A negative value indicates while one increases the


other decreases, or vice-versa e.g. active social life
at PSU vs performance in CS dept.

• If covariance is zero: the two dimensions are


independent of each other e.g. heights of students vs
the marks obtained in a subject
Covariance
• Why bother with calculating covariance
when we could just plot the 2 values to
see their relationship?
Covariance calculations are used to find
relationships between dimensions in high
dimensional data sets (usually greater
than 3) where visualization is difficult.
PCA
• principal components analysis (PCA) is a technique
that can be used to simplify a dataset
• It is a linear transformation that chooses a new
coordinate system for the data set such that
greatest variance by any projection of the data
set comes to lie on the first axis (then called the
first principal component),
the second greatest variance on the second axis,
and so on.
• PCA can be used for reducing dimensionality by
eliminating the later principal components.
PCA
• By finding the eigenvalues and eigenvectors of the
covariance matrix, we find that the eigenvectors with
the largest eigenvalues correspond to the dimensions
that have the strongest correlation in the dataset.
• This is the principal component.
• PCA is a useful statistical technique that has found
application in:
– fields such as face recognition and image compression
– finding patterns in data of high dimension.

You might also like