0% found this document useful (0 votes)
43 views28 pages

Linear Discriminant Analysis Overview

The document discusses Linear Discriminant Analysis (LDA) and its applications in various fields, such as face recognition and disease classification. It explains the mathematical foundations of LDA, including the formulation of discriminant functions, decision boundaries, and the optimization of class separation using Fisher's criterion. Additionally, it highlights the implementation of LDA in Python and the challenges associated with least squares methods in classification tasks.

Uploaded by

goldrosen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views28 pages

Linear Discriminant Analysis Overview

The document discusses Linear Discriminant Analysis (LDA) and its applications in various fields, such as face recognition and disease classification. It explains the mathematical foundations of LDA, including the formulation of discriminant functions, decision boundaries, and the optimization of class separation using Fisher's criterion. Additionally, it highlights the implementation of LDA in Python and the challenges associated with least squares methods in classification tasks.

Uploaded by

goldrosen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Birla Institute of Technology and Science Pilani, Hyderabad Campus

27.09.2024

BITS F464: Machine Learning (1st Sem 2024-25)


LINEAR DISCRIMINANT ANALYSIS
Chittaranjan Hota, Sr. Professor
Dept. of Computer Sc. and Information Systems
hota@[Link]
Linear Discriminant Functions: Applications

• Fisher’s Linear Discriminant Analysis for reducing the number of features


required for Face Recognition.
• Classifying patient’s disease state as Mild, Moderate or Severe.
• Identifying the type of customers who might buy a particular product.
Linear Discriminant Functions
• Used to discriminate between two or more classes based on
a set of predictor variables.

x = [x1,..,xD]T CK
Discriminant Function

• Learns the mapping between feature vector and class labels.

• Does it not create a decision boundary?

Hyperplanes
(1-D, a point/ threshold) (2-D, a line) (3-D, a plane)

Logistic Regression may be unstable for well separated classes and few examples. Why?
Two-class Linear Discriminant Functions (K=2)

• y(x) = wT x + w0 = wi xi + w0

Where, wT is the weight vector and w0 is the bias. The negative of


bias (i.e. -w0) sometimes is called as threshold.
C1 : If w0 = 0, 3-dimension
then the
y(x) > 0 hyperplane will pass
C2 : through what?
y(x) < 0

location
orientation

(geometry of LDF in 2-dimension)


Distance of Origin to Decision Surface (Bias: w0)

• Let xA and xB be points that lie


on the decision surface.

xA y(xA) = y(xB) = 0
xB

w T xA + w 0 = w T xB + w 0 = 0

wT (xA – xB) = 0

If x is a point on the decision xA – xB is an arbitrary vector


surface, y(x) = 0 -> wTx = -w0 parallel to the line.

So, the normal distance from Hence, w is orthogonal to


origin to decision surface: every vector lying on the
decision surface. orientation
Distance of a point ‘x’ to the Decision surface ( r )

Let ‘x’ be an arbitrary point and


be it’s orthogonal projection on the
decision surface.

by vector addition

Second term is a normalized vector


to the decision surface, which is
collinear with ‘w’.
As = 1 , we need to scale it by r.

As y( ) = 0 and wTw = ||w||2 y(x) = wT x + w0 = wT ( ) + w0


wT w
wT + w0 + =0+r = r.||w|| r = y(x) / ||w||
Multi-class Linear Discriminant Functions (K>2)
Approach 1: By combining a number of two-class discriminant
functions.
Alternative

K K
= =
? ?

(K-1 classifiers with each one (K (K-1)/2) classifiers with one for
separating points in a particular every possible pair of classes. Each
class Ck from points not in that point is classified according to a
class) one-versus-the-rest majority vote. one-versus-one
Another Example…

bc + xpT wc > 0
bj + xpT wj < 0, j =1…c, j != c

y = argmax bj + xT wj
j = 1, …c

global maximum
Solution: Using K-discriminant functions
• Building a single K-class discriminant comprising K-linear
functions of the form:
• yk (x) = wkT x + wk0

• Then, assigning a point x to class Ck if yk(x) > yj(x), V j = k.

• The decision boundary between Ck and Cj : yk(x) = yj(x)

• Defined by: (wk – wj)T x + (wk0 – wj0) = 0 Same as 2-class


• Hence, same geometrical properties apply.
Decision regions of such discriminants are always singly
connected and convex.

Proof of Convexity Next…


Proof of Convexity of Decision Region

Where,
Rj
From the Linearity of
Ri
Discriminant functions:
Rk

xB As XA and XB lie inside RK :


x and
xA
Lies in RK

x must also lie in RK RK is Singly Connected and Convex


Multi-class Classification using LDA (sklearn)

X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.3)
[Link](X_train,y_train)

Alcohol, magnesium, hue, proline, …


Least Squares for Classification
• Straightforward way to adapt regression techniques for
classification tasks.
• How do we compute y(x), and w0, w1, w2, …wd ?
• Each class Ck is described by its own linear model:
• yk(x) = wkT x + wk0 (where x and w have D dimensions each)
• We can group these together using a vector notation:
Augmented input vector
A Parameter matrix whose kth column is a D+1-dimensional
vector:
A new input x is then assigned to a class for which the output
is largest. Get by minimizing the Sum-of-squares.
An example classification using LDF
• Suppose we have a dataset of two classes: Class A and Class B. Each data point has two
features, x1 and x2. Our task is to classify new data points into either Class A or Class B using
a linear discriminant function.
• Class A (positive class): XA = {(2,3), (3,3), (4,5), (5,6)} & Class B (negative class): XB =
{(1,1),(2,2),(3,1),(4,2)}
• Step 1: Define the Linear discriminant function:
g(x) = w1x1 + w2x2 + w0 , Decision rune is: If g(x) > 0, classify as Class A, else Class B.
• Step 2: Train the classifier:
Suppose, after training we get the LDF as: g(x) = 2x1 + 3x2 - 15

Least squares
• Step 3: Classify new points:
x = (3,4)  g(3,4) = 2X3 + 3X4 -15 = 3
 As, g(3,4) = 3 > 0, we classify it as Class A.
What class a point (2,1) will belong to?
• Decision boundary:
g(x) = 0  x2 = (-2/3).x1 + 5
Minimizing sum-of-squares error func
Let there be a training dataset {xn, tn} where n = 1, …N
Define a matrix T whose nth row is the vector tnT and matrix
whose nth row is:
Then, the sum-of-squares error function can be written as:

Multiplying a matrix with its’s transpose results in a square


matrix.
Taking a Trace of this square matrix (sum of the elements on
the main diagonal)
LSE Computation: An Example
Minimizing Sum-of-Squares

To minimize the error, set the


derivative equal to zero:

0 =0

pseudoinverse solution
Least-squares: highly sensitive to outliers

Magenta: Least squares, Green: Logistic regression


Least squares: more severe problems
Certain datasets: Unsuitable 3-classes, 2-D space, synthetic data

(Least-squares classification) (Logistic Regression classification)

The region of input space assigned to the green class is too small and so
most of the points from this class are misclassified.
Fisher’s Linear Discriminant: Motivation
• Why do we need it?

Question: How difficult are these transformations to figure out?


Image source: [Link]
Fisher’s Linear Discriminant
Ronald A. Fisher
• View classification in terms of dimensionality reduction
• Project D-dimensional input vector x into one
dimension using: y = wTx
• Place threshold on y to classify y >= -w0 as class C1 else
class C2
• We get a standard linear classifier
• Classes well-separated in D-dimension space may strongly
overlap in 1-dimension
• Adjust component of the weight vector w
• Select projection to maximize class-separation
FLD seeks to maximize the ratio of between-class variance to within-class
variance, thus maximizing class discrimination.
An illustration of Fisher’s LDF

(Projection onto the line joining (Projection based on Fisher’s


the class means) Linear discriminant function)

What is the degree of class overlap? Is the class separation improved?


Maximizing Mean Separation
• Let us consider a two-class problem with N1 points of C1
class and N2 points of C2 class
• Mean vectors:

• Choose w to best separate class means:


• Maximize m2 – m1 = wT(m2 – m1), where mk = wTmk is the
mean of the projected data from class Ck
• Can be made arbitrarily large by increasing the magnitude
of w:
• We could have w to be of unit length i.e.
• Using a Lagrange multiplier, maximize
There is still a problem with this approach…
Illustration of the problem

Image source: [Link]

After re-projection, the data exhibit some sort of class


overlapping - shown by the yellow ellipse on the plot.
This difficulty arises from the strongly non-diagonal co-variances of the
class distributions.
Minimizing Variance and Optimizing
• Project D-dimensional input vector x into one dimension
using: yn = wTxn
• The within-class variance of the transformed data from
class Ck is given by:
sk2 = (yn- mk)2
• Total within-class variance for the whole dataset is: s12 + s22
• Fisher’s criterion:
Rewriting (to make the dependence on w explicit:
Where, is the between-class covariance matrix &
& I the within-class covariance matrix.
Differentiating with respect to w, J(w) is maximized when:
Dropping scalar factors, and noting SB is in the same direction as m2 – m1
and multiplying both the sides by SW-1 : Fisher’s LD
Optimization of J(w)

Quotient rule

To maximize J(w):

This is an eigenvalue problem, where λ is the


eigenvalue, and w is the eigenvector.
Problems:
Non-linear

Illustration with Fisher’s LDF


models,
Small
sample size
Image source: [Link]
Fisher’s Linear Discriminant Functions L
D
A

O
N

I
R
I
S

D
A
T
A
S
Assignment 3
E
Ref: [Link]
T
Thank you!

You might also like