0% found this document useful (0 votes)
50 views15 pages

Supervised Learning: LDA and QDA Methods

This document provides an overview of supervised learning methods for classification, including linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). It discusses conditional probability, Fisher's discriminant analysis, and how LDA finds the linear combination of features that best separates two or more classes. The document also covers evaluating model performance using cross-validation and confusion matrices, and provides an example of applying these methods to digit recognition.

Uploaded by

Ola James
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views15 pages

Supervised Learning: LDA and QDA Methods

This document provides an overview of supervised learning methods for classification, including linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). It discusses conditional probability, Fisher's discriminant analysis, and how LDA finds the linear combination of features that best separates two or more classes. The document also covers evaluating model performance using cross-validation and confusion matrices, and provides an example of applying these methods to digit recognition.

Uploaded by

Ola James
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Supervised Learning: Linear Methods (1/2)

Applied Multivariate Statistics – Spring 2012


Overview

 Review: Conditional Probability


 LDA / QDA: Theory
 Fisher’s Discriminant Analysis
 LDA: Example
 Quality control: Testset and Crossvalidation
 Case study: Text recognition

1
Conditional Probability
Sample space

T: Med. Test positive T (Marginal) Probability:


P(T), P(C)
C: Patient has cancer C

New sample space: New sample space:


People with cancer Conditional Probability: People with pos. test
P(T|C), P(C|T) P(C|T)
P(T|C)
large Bayes Theorem: small

P (T jC)P (C)
posterior P (CjT ) = P (T ) prior
Class conditional probability 2
One approach to supervised learning

P (C)P (XjC)
P (CjX) = P (X) » P (C)P (XjC)

Prior / prevalence:
Find some estimate Assume:
Fraction of samples
XjC » N(¹c; §c)
in that class

Bayes rule:
Choose class where P(C|X) is maximal
(rule is “optimal” if all types of error are equally costly)

Special case: Two classes (0/1)


- choose c=1 if P(C=1|X) > 0.5 or
- choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1

In Practice: Estimate 𝑃 𝐶 , 𝜇𝐶 , Σ𝐶
3
¡ 1 ¢
QDA: Doing the math… p 1 T ¡1
exp ¡ 2 (x ¡ ¹c ) §C (x ¡ ¹c )
(2¼)d j§C j

 𝑃 𝐶 𝑋 ~ 𝑃 𝐶 𝑃(𝑋|𝐶)
 Use the fact: max 𝑃 𝐶 𝑋 max(log 𝑃 𝐶 𝑋 )
 𝛿𝑐 𝑥 = log 𝑃 𝐶 𝑋 = log 𝑃 𝐶 + log 𝑃 𝑋 𝐶 =
1 1 𝑇 −1
= log 𝑃 𝐶 − log Σ𝐶 − 𝑥 − 𝜇𝐶 Σ𝐶 𝑥 − 𝜇𝐶 + 𝑐
2 2

Prior Additional Sq. Mahalanobis distance


term

 Choose class where 𝛿𝑐 𝑥 is maximal


 Special case: Two classes
Decision boundary: Values of x where 𝛿0 𝑥 = 𝛿1 (𝑥) is quadratic in x

 Quadratic Discriminant Analysis (QDA)

4
Simplification

 Assume same covariance matrix in all classes, i.e.


𝑋|𝐶 ~ 𝑁(𝜇𝑐 , Σ) Fix for all classes
1 1
 𝛿𝑐 𝑥 = log 𝑃 𝐶 − log Σ − 𝑥 − 𝜇𝐶 𝑇 Σ−1 𝑥 − 𝜇𝐶 + 𝑐 =
2 2
Prior 1 Sq. Mahalanobis distance
= log 𝑃 𝐶 − 𝑥 − 𝜇𝐶 𝑇 Σ−1 𝑥 − 𝜇𝐶 + 𝑑=
2
1
(= log 𝑃 𝐶 + 𝑥 𝑇 Σ−1 𝜇𝐶 − 𝜇𝐶𝑇 Σ −1 𝜇𝐶 )
2

Decision boundary is linear in x

 Linear Discriminant Analysis (LDA)

1
Classify to which class (assume equal prior)?
• Physical distance in space is equal
0
• Classify to class 0, since Mahal. Dist. is smaller

5
LDA vs. QDA
+ Only few parameters to - Many parameters to estimate;
estimate; accurate estimates less accurate
- Inflexible + More flexible
(quadratic decision boundary)
(linear decision boundary)

6
Fisher’s Discriminant Analysis: Idea
Find direction(s) in which groups are separated best

1. Principal Component • Class Y, predictors 𝑋 = 𝑋1 , … , 𝑋𝑑


𝑈 = 𝑤𝑇𝑋
1. Linear Discriminant • Find w so that groups are separated
= along U best
1. Canonical Variable • Measure of separation: Rayleigh coefficient
𝐷(𝑈)
𝐽 𝑤 =
𝑉𝑎𝑟(𝑈) 2
where 𝐷 𝑈 = 𝐸 𝑈 𝑌 = 0 − 𝐸 𝑈 𝑌 = 1
• 𝐸 𝑋 𝑌 = 𝑗 = 𝜇𝑗 , 𝑉𝑎𝑟 𝑋 𝑌 = 𝑗 = Σ
𝐸 𝑈 𝑌 = 𝑗 = 𝑤 𝑇 𝜇𝑗 , 𝑉 𝑈 = 𝑤 𝑇 Σw
• Concept extendable to many groups

D(U) D(U)
𝐽 𝑤 large 𝐽 𝑤 small

Var(U) Var(U)
7
LDA and Linear Discriminants

 - Direction with largest J(w): 1. Linear Discriminant (LD 1)


- orthogonal to LD1, again largest J(w): LD 2
- etc.
 At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s
e.g.: 3 groups in 10 dimensions – need 2 LD’s
 Computed using Eigenvalue Decomposition or Singular
Value Decomposition
Proportion of trace: Captured % of variance between group
means for each LD
 R: Function «lda» in package MASS does LDA and
computes linear discriminants (also «qda» available)

8
Example: Classification of Iris flowers

Iris setosa

Iris versicolor

Classify according to sepal/petal length/width

Iris virginica

9
Quality of classification

 Use training data also as test data: Overfitting


Too optimistic for error on new data
 Separate test data

Test

Training

 Cross validation (CV; e.g. “leave-one-out cross validation):


Every row is the test case once, the rest in the training data

10
Measures for prediction error

 Confusion matrix (e.g. 100 samples)

Truth = 0 Truth = 1 Truth = 2


Estimate = 0 23 7 6
Estimate = 1 3 27 4
Estimate = 2 3 1 26

 Error rate:
1 – sum(diagonal entries) / (number of samples) =
= 1 – 76/100 = 0.24
 We expect that our classifier predicts 24% of new
observations incorrectly (this is just a rough estimate)

11
Example: Digit recognition

 7129 hand-written digits


Sample of digits
 Each (centered) digit
was put in a 16*16 grid
 Measure grey value in
each part of the grid,
i.e. 256 grey values

Example with 8*8 grid 12


Concepts to know

 Idea of LDA / QDA


 Meaning of Linear Discriminants
 Cross Validation
 Confusion matrix, error rate

13
R functions to know

 lda

14

You might also like