Birla Institute of Technology and Science Pilani, Hyderabad Campus
27.09.2024
BITS F464: Machine Learning (1st Sem 2024-25)
LINEAR DISCRIMINANT ANALYSIS
Chittaranjan Hota, Sr. Professor
Dept. of Computer Sc. and Information Systems
hota@[Link]
Linear Discriminant Functions: Applications
• Fisher’s Linear Discriminant Analysis for reducing the number of features
required for Face Recognition.
• Classifying patient’s disease state as Mild, Moderate or Severe.
• Identifying the type of customers who might buy a particular product.
Linear Discriminant Functions
• Used to discriminate between two or more classes based on
a set of predictor variables.
x = [x1,..,xD]T CK
Discriminant Function
• Learns the mapping between feature vector and class labels.
• Does it not create a decision boundary?
Hyperplanes
(1-D, a point/ threshold) (2-D, a line) (3-D, a plane)
Logistic Regression may be unstable for well separated classes and few examples. Why?
Two-class Linear Discriminant Functions (K=2)
• y(x) = wT x + w0 = wi xi + w0
Where, wT is the weight vector and w0 is the bias. The negative of
bias (i.e. -w0) sometimes is called as threshold.
C1 : If w0 = 0, 3-dimension
then the
y(x) > 0 hyperplane will pass
C2 : through what?
y(x) < 0
location
orientation
(geometry of LDF in 2-dimension)
Distance of Origin to Decision Surface (Bias: w0)
• Let xA and xB be points that lie
on the decision surface.
xA y(xA) = y(xB) = 0
xB
w T xA + w 0 = w T xB + w 0 = 0
wT (xA – xB) = 0
If x is a point on the decision xA – xB is an arbitrary vector
surface, y(x) = 0 -> wTx = -w0 parallel to the line.
So, the normal distance from Hence, w is orthogonal to
origin to decision surface: every vector lying on the
decision surface. orientation
Distance of a point ‘x’ to the Decision surface ( r )
Let ‘x’ be an arbitrary point and
be it’s orthogonal projection on the
decision surface.
by vector addition
Second term is a normalized vector
to the decision surface, which is
collinear with ‘w’.
As = 1 , we need to scale it by r.
As y( ) = 0 and wTw = ||w||2 y(x) = wT x + w0 = wT ( ) + w0
wT w
wT + w0 + =0+r = r.||w|| r = y(x) / ||w||
Multi-class Linear Discriminant Functions (K>2)
Approach 1: By combining a number of two-class discriminant
functions.
Alternative
K K
= =
? ?
(K-1 classifiers with each one (K (K-1)/2) classifiers with one for
separating points in a particular every possible pair of classes. Each
class Ck from points not in that point is classified according to a
class) one-versus-the-rest majority vote. one-versus-one
Another Example…
bc + xpT wc > 0
bj + xpT wj < 0, j =1…c, j != c
y = argmax bj + xT wj
j = 1, …c
global maximum
Solution: Using K-discriminant functions
• Building a single K-class discriminant comprising K-linear
functions of the form:
• yk (x) = wkT x + wk0
• Then, assigning a point x to class Ck if yk(x) > yj(x), V j = k.
• The decision boundary between Ck and Cj : yk(x) = yj(x)
• Defined by: (wk – wj)T x + (wk0 – wj0) = 0 Same as 2-class
• Hence, same geometrical properties apply.
Decision regions of such discriminants are always singly
connected and convex.
Proof of Convexity Next…
Proof of Convexity of Decision Region
Where,
Rj
From the Linearity of
Ri
Discriminant functions:
Rk
xB As XA and XB lie inside RK :
x and
xA
Lies in RK
x must also lie in RK RK is Singly Connected and Convex
Multi-class Classification using LDA (sklearn)
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.3)
[Link](X_train,y_train)
Alcohol, magnesium, hue, proline, …
Least Squares for Classification
• Straightforward way to adapt regression techniques for
classification tasks.
• How do we compute y(x), and w0, w1, w2, …wd ?
• Each class Ck is described by its own linear model:
• yk(x) = wkT x + wk0 (where x and w have D dimensions each)
• We can group these together using a vector notation:
Augmented input vector
A Parameter matrix whose kth column is a D+1-dimensional
vector:
A new input x is then assigned to a class for which the output
is largest. Get by minimizing the Sum-of-squares.
An example classification using LDF
• Suppose we have a dataset of two classes: Class A and Class B. Each data point has two
features, x1 and x2. Our task is to classify new data points into either Class A or Class B using
a linear discriminant function.
• Class A (positive class): XA = {(2,3), (3,3), (4,5), (5,6)} & Class B (negative class): XB =
{(1,1),(2,2),(3,1),(4,2)}
• Step 1: Define the Linear discriminant function:
g(x) = w1x1 + w2x2 + w0 , Decision rune is: If g(x) > 0, classify as Class A, else Class B.
• Step 2: Train the classifier:
Suppose, after training we get the LDF as: g(x) = 2x1 + 3x2 - 15
Least squares
• Step 3: Classify new points:
x = (3,4) g(3,4) = 2X3 + 3X4 -15 = 3
As, g(3,4) = 3 > 0, we classify it as Class A.
What class a point (2,1) will belong to?
• Decision boundary:
g(x) = 0 x2 = (-2/3).x1 + 5
Minimizing sum-of-squares error func
Let there be a training dataset {xn, tn} where n = 1, …N
Define a matrix T whose nth row is the vector tnT and matrix
whose nth row is:
Then, the sum-of-squares error function can be written as:
Multiplying a matrix with its’s transpose results in a square
matrix.
Taking a Trace of this square matrix (sum of the elements on
the main diagonal)
LSE Computation: An Example
Minimizing Sum-of-Squares
To minimize the error, set the
derivative equal to zero:
0 =0
pseudoinverse solution
Least-squares: highly sensitive to outliers
Magenta: Least squares, Green: Logistic regression
Least squares: more severe problems
Certain datasets: Unsuitable 3-classes, 2-D space, synthetic data
(Least-squares classification) (Logistic Regression classification)
The region of input space assigned to the green class is too small and so
most of the points from this class are misclassified.
Fisher’s Linear Discriminant: Motivation
• Why do we need it?
Question: How difficult are these transformations to figure out?
Image source: [Link]
Fisher’s Linear Discriminant
Ronald A. Fisher
• View classification in terms of dimensionality reduction
• Project D-dimensional input vector x into one
dimension using: y = wTx
• Place threshold on y to classify y >= -w0 as class C1 else
class C2
• We get a standard linear classifier
• Classes well-separated in D-dimension space may strongly
overlap in 1-dimension
• Adjust component of the weight vector w
• Select projection to maximize class-separation
FLD seeks to maximize the ratio of between-class variance to within-class
variance, thus maximizing class discrimination.
An illustration of Fisher’s LDF
(Projection onto the line joining (Projection based on Fisher’s
the class means) Linear discriminant function)
What is the degree of class overlap? Is the class separation improved?
Maximizing Mean Separation
• Let us consider a two-class problem with N1 points of C1
class and N2 points of C2 class
• Mean vectors:
• Choose w to best separate class means:
• Maximize m2 – m1 = wT(m2 – m1), where mk = wTmk is the
mean of the projected data from class Ck
• Can be made arbitrarily large by increasing the magnitude
of w:
• We could have w to be of unit length i.e.
• Using a Lagrange multiplier, maximize
There is still a problem with this approach…
Illustration of the problem
Image source: [Link]
After re-projection, the data exhibit some sort of class
overlapping - shown by the yellow ellipse on the plot.
This difficulty arises from the strongly non-diagonal co-variances of the
class distributions.
Minimizing Variance and Optimizing
• Project D-dimensional input vector x into one dimension
using: yn = wTxn
• The within-class variance of the transformed data from
class Ck is given by:
sk2 = (yn- mk)2
• Total within-class variance for the whole dataset is: s12 + s22
• Fisher’s criterion:
Rewriting (to make the dependence on w explicit:
Where, is the between-class covariance matrix &
& I the within-class covariance matrix.
Differentiating with respect to w, J(w) is maximized when:
Dropping scalar factors, and noting SB is in the same direction as m2 – m1
and multiplying both the sides by SW-1 : Fisher’s LD
Optimization of J(w)
Quotient rule
To maximize J(w):
This is an eigenvalue problem, where λ is the
eigenvalue, and w is the eigenvector.
Problems:
Non-linear
Illustration with Fisher’s LDF
models,
Small
sample size
Image source: [Link]
Fisher’s Linear Discriminant Functions L
D
A
O
N
I
R
I
S
D
A
T
A
S
Assignment 3
E
Ref: [Link]
T
Thank you!