Supervised Learning: Linear Methods (1/2)
Applied Multivariate Statistics – Spring 2012
Overview
Review: Conditional Probability
LDA / QDA: Theory
Fisher’s Discriminant Analysis
LDA: Example
Quality control: Testset and Crossvalidation
Case study: Text recognition
1
Conditional Probability
Sample space
T: Med. Test positive T (Marginal) Probability:
P(T), P(C)
C: Patient has cancer C
New sample space: New sample space:
People with cancer Conditional Probability: People with pos. test
P(T|C), P(C|T) P(C|T)
P(T|C)
large Bayes Theorem: small
P (T jC)P (C)
posterior P (CjT ) = P (T ) prior
Class conditional probability 2
One approach to supervised learning
P (C)P (XjC)
P (CjX) = P (X) » P (C)P (XjC)
Prior / prevalence:
Find some estimate Assume:
Fraction of samples
XjC » N(¹c; §c)
in that class
Bayes rule:
Choose class where P(C|X) is maximal
(rule is “optimal” if all types of error are equally costly)
Special case: Two classes (0/1)
- choose c=1 if P(C=1|X) > 0.5 or
- choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1
In Practice: Estimate 𝑃 𝐶 , 𝜇𝐶 , Σ𝐶
3
¡ 1 ¢
QDA: Doing the math… p 1 T ¡1
exp ¡ 2 (x ¡ ¹c ) §C (x ¡ ¹c )
(2¼)d j§C j
𝑃 𝐶 𝑋 ~ 𝑃 𝐶 𝑃(𝑋|𝐶)
Use the fact: max 𝑃 𝐶 𝑋 max(log 𝑃 𝐶 𝑋 )
𝛿𝑐 𝑥 = log 𝑃 𝐶 𝑋 = log 𝑃 𝐶 + log 𝑃 𝑋 𝐶 =
1 1 𝑇 −1
= log 𝑃 𝐶 − log Σ𝐶 − 𝑥 − 𝜇𝐶 Σ𝐶 𝑥 − 𝜇𝐶 + 𝑐
2 2
Prior Additional Sq. Mahalanobis distance
term
Choose class where 𝛿𝑐 𝑥 is maximal
Special case: Two classes
Decision boundary: Values of x where 𝛿0 𝑥 = 𝛿1 (𝑥) is quadratic in x
Quadratic Discriminant Analysis (QDA)
4
Simplification
Assume same covariance matrix in all classes, i.e.
𝑋|𝐶 ~ 𝑁(𝜇𝑐 , Σ) Fix for all classes
1 1
𝛿𝑐 𝑥 = log 𝑃 𝐶 − log Σ − 𝑥 − 𝜇𝐶 𝑇 Σ−1 𝑥 − 𝜇𝐶 + 𝑐 =
2 2
Prior 1 Sq. Mahalanobis distance
= log 𝑃 𝐶 − 𝑥 − 𝜇𝐶 𝑇 Σ−1 𝑥 − 𝜇𝐶 + 𝑑=
2
1
(= log 𝑃 𝐶 + 𝑥 𝑇 Σ−1 𝜇𝐶 − 𝜇𝐶𝑇 Σ −1 𝜇𝐶 )
2
Decision boundary is linear in x
Linear Discriminant Analysis (LDA)
1
Classify to which class (assume equal prior)?
• Physical distance in space is equal
0
• Classify to class 0, since Mahal. Dist. is smaller
5
LDA vs. QDA
+ Only few parameters to - Many parameters to estimate;
estimate; accurate estimates less accurate
- Inflexible + More flexible
(quadratic decision boundary)
(linear decision boundary)
6
Fisher’s Discriminant Analysis: Idea
Find direction(s) in which groups are separated best
1. Principal Component • Class Y, predictors 𝑋 = 𝑋1 , … , 𝑋𝑑
𝑈 = 𝑤𝑇𝑋
1. Linear Discriminant • Find w so that groups are separated
= along U best
1. Canonical Variable • Measure of separation: Rayleigh coefficient
𝐷(𝑈)
𝐽 𝑤 =
𝑉𝑎𝑟(𝑈) 2
where 𝐷 𝑈 = 𝐸 𝑈 𝑌 = 0 − 𝐸 𝑈 𝑌 = 1
• 𝐸 𝑋 𝑌 = 𝑗 = 𝜇𝑗 , 𝑉𝑎𝑟 𝑋 𝑌 = 𝑗 = Σ
𝐸 𝑈 𝑌 = 𝑗 = 𝑤 𝑇 𝜇𝑗 , 𝑉 𝑈 = 𝑤 𝑇 Σw
• Concept extendable to many groups
D(U) D(U)
𝐽 𝑤 large 𝐽 𝑤 small
Var(U) Var(U)
7
LDA and Linear Discriminants
- Direction with largest J(w): 1. Linear Discriminant (LD 1)
- orthogonal to LD1, again largest J(w): LD 2
- etc.
At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s
e.g.: 3 groups in 10 dimensions – need 2 LD’s
Computed using Eigenvalue Decomposition or Singular
Value Decomposition
Proportion of trace: Captured % of variance between group
means for each LD
R: Function «lda» in package MASS does LDA and
computes linear discriminants (also «qda» available)
8
Example: Classification of Iris flowers
Iris setosa
Iris versicolor
Classify according to sepal/petal length/width
Iris virginica
9
Quality of classification
Use training data also as test data: Overfitting
Too optimistic for error on new data
Separate test data
Test
Training
Cross validation (CV; e.g. “leave-one-out cross validation):
Every row is the test case once, the rest in the training data
10
Measures for prediction error
Confusion matrix (e.g. 100 samples)
Truth = 0 Truth = 1 Truth = 2
Estimate = 0 23 7 6
Estimate = 1 3 27 4
Estimate = 2 3 1 26
Error rate:
1 – sum(diagonal entries) / (number of samples) =
= 1 – 76/100 = 0.24
We expect that our classifier predicts 24% of new
observations incorrectly (this is just a rough estimate)
11
Example: Digit recognition
7129 hand-written digits
Sample of digits
Each (centered) digit
was put in a 16*16 grid
Measure grey value in
each part of the grid,
i.e. 256 grey values
Example with 8*8 grid 12
Concepts to know
Idea of LDA / QDA
Meaning of Linear Discriminants
Cross Validation
Confusion matrix, error rate
13
R functions to know
lda
14