Support Vector Machines Explained
Support Vector Machines Explained
3
Basic principles of classification
10
Basic principles of classification
• All objects before the coast line are boats and all objects after the
coast line are houses.
• Coast line serves as a decision surface that separates two classes. 11
Basic principles of classification
These boats will be misclassified as houses
12
Basic principles of classification
Longitude
Boat
House
Latitude
Longitude
Boat
House
Latitude
Longitude
These objects are classified as houses
? ? ?
? ? ?
Latitude
16
Main ideas of SVMs
Gene Y
17
Main ideas of SVMs
Gene Y
18
Main ideas of SVMs
Gene Y
kernel
Normal
Normal
Gene X
• If such linear decision surface does not exist, the data is mapped
into a much higher dimensional space (“feature space”) where the
separating decision surface is found;
• The feature space is constructed via very clever mathematical
projection (“kernel trick”).
19
Necessary mathematical concepts
21
How to represent samples geometrically?
Vectors in n-dimensional space ( n)
• Assume that a sample/patient is described by n characteristics
(“features” or “variables”)
• Representation: Every sample/patient is a vector in n with
tail at point with 0 coordinates and arrow-head at point with
the feature values.
• Example: Consider a patient described by 2 features:
Systolic BP = 110 and Age = 29.
This patient can be represented as a vector in 2:
Age
(110, 29)
(0, 0) Systolic BP 22
How to represent samples geometrically?
Vectors in n-dimensional space ( n)
Patient 3 Patient 4
Age (years) 60
40
Patient 2
Patient 1
20
0
200
150
300
100
200
50 100
0 0
40
Patient 2
Patient 1
20
0
200
150
300
100
200
50 100
0 0
6 6
5
5
4
4
3
3
2
2 1
1
0
10
0 5
0 1 2 3 4 5 6 7 7
5 6
3 4
0 1 2
0
1. Multiplication by a scalar
Consider a vector a (a1 , a2 ,..., an ) and a scalar c
Define: ca (ca1 , ca2 ,..., can )
When you multiply a vector by a scalar, you “stretch” it in the
same or opposite direction depending on whether the scalar is
positive or negative.
a (1,2) a (1,2) 3
4
c 2 3
ca c 1 2
a
1
c a (2,4) 2 c a ( 1, 2)
a -2 -1 0 1 2 3 4
1
-1
0 1 2 3 4 ca -2
26
Basic operation on vectors in n
2. Addition
Consider vectors a (a1 , a2 ,..., an ) and b (b1 , b2 ,..., bn )
Define: a b (a1 b1 , a2 b2 ,..., an bn )
a (1,2) 4
3
b (3,0) 2 Recall addition of forces in
a
a b (4,2) 1
a b classical mechanics.
0 1 2 3 4
b
27
Basic operation on vectors in n
3. Subtraction
Consider vectors a (a1 , a2 ,..., an ) and b (b1 , b2 ,..., bn )
Define: a b (a1 b1 , a2 b2 ,..., an bn )
a (1,2) 4
What vector do we
b (3,0) 3 need to add to b to
a b ( 2,2)
2
a
get a ? I.e., similar to
1
a b
subtraction of real
-3 -2 -1 0 1 2 3 4
b numbers.
28
Basic operation on vectors in n
29
Basic operation on vectors in n
5. Dot product
Consider vectors a (a1 , a2 ,..., an ) and b (b1 , b2 ,..., bn )
n
Define dot product: a b a1b1 a2b2 ... anbn ai bi
i 1
3 3
b (3,0) 2 b (3,0) 2
a a
a b 3 1
a b 0 1
0 1 2 3 4 0 1 2 3 4
b b
30
Basic operation on vectors in n
31
Hyperplanes as decision surfaces
• A hyperplane is a linear decision surface that splits the space
into two parts;
• It is obvious that a hyperplane is a binary classifier.
A hyperplane in 2 is a line 3
A hyperplane in is a plane
7
7
6
6
5
5
4
4 3
3 2
1
2
0
1 10
5
0 6 7
0 1 2 3 4 5 6 7 4 5
2 3
0 0 1
w
x x0 An equation of a hyperplane is defined
P P0 by a point (P0) and a perpendicular
vector to the plane (w) at that point.
x x0
tw
w x b2 0 x2 x1 tw
w D tw t w
w x b1 0 w x2 b2 0
x2
w ( x1 tw) b2 0
x1 2
w x1 t w b2 0
2
( w x1 b1 ) b1 t w b2 0
2
b1 t w b2 0
2
t (b1 b2 ) / w
D t w b1 b2 / w
36
Recap
We know…
• How to represent patients (as “vectors”)
• How to define a linear decision surface (“hyperplane”)
We need to know…
• How to efficiently compute the hyperplane that separates
two classes with the largest “gap”?
Gene Y
Local minimum
38
Basics of optimization:
Quadratic programming (QP)
• Quadratic programming (QP) is a special
optimization problem: the function to optimize
(“objective”) is quadratic, subject to linear
constraints.
• Convex QP problems have convex objective
functions.
• These problems can be solved easily and efficiently
by greedy algorithms (because every local
minimum is a global minimum).
39
Basics of optimization:
Example QP problem
Consider x ( x1 , x 2 )
1
Minimize || x ||22 subject to x1 x 2 1 0
2
quadratic linear
objective constraints
This is QP problem, and it is a convex QP as we will see later
quadratic linear
objective constraints
40
Basics of optimization:
Example QP problem
f(x1,x2) x1 x 2 1 0
x1 x 2 1
1 2
( x1 x22 ) x1 x 2 1 0
2
x2
x1
41
Quiz
w
1) Consider a hyperplane shown
with white. It is defined by P0
equation: w x 10 0
Which of the three other
hyperplanes can be defined by
equation: w x 3 0 ?
- Orange
- Green
- Yellow
4
3 a
2) What is the dot product between 2
-2 -1 0 1 2 3 4
-1
b
-2
43
Quiz
4
3 a
3) What is the dot product between
vectors a (3,3) and b (1,0) ?
2
-2 -1 0 1 2 3 4
b
-1
-2
3
4) What is the length of a vector 2
a (2,0) and what is the length of 1
all other red vectors in the figure? a
-2 -1 0 1 2 3 4
-1
-2
44
Quiz
5) Which of the four functions is/are convex?
1 2
3 4
45
Support vector machines for binary
classification: classical formulation
46
Case 1: Linearly separable data;
“Hard-margin” linear SVM
x1 , x2 ,..., x N Rn
Given training data:
y1 , y2 ,..., y N { 1, 1}
• Want to find a classifier
(hyperplane) to separate
negative instances from the
positive ones.
• An infinite number of such
hyperplanes exist.
• SVMs finds the hyperplane that
maximizes the gap between
data points on the boundaries
(so-called “support vectors”).
• If the points on the boundaries
are not informative (e.g., due to
Negative instances (y=-1) Positive instances (y=+1)
noise), SVMs will not do well.
47
Statement of linear SVM classifier
w x b 1 w x b 0
The gap is distance between
w x b 1 parallel hyperplanes:
w x b 1 and
w x b 1
Or equivalently:
w x (b 1) 0
w x (b 1) 0
We know that
D b1 b2 / w
Negative instances (y=-1) Positive instances (y=+1)
Therefore:
D 2/ w
Since we want to maximize the gap,
we need to minimize w
2 1
or equivalently minimize 1
2 w ( 2
is convenient for taking derivative later on)
48
Statement of linear SVM classifier
w x b 0
w x b 1 In addition we need to
impose constraints that all
instances are correctly
w x b 1
classified. In our case:
w xi b 1 if yi 1
w xi b 1 if yi 1
Equivalently:
yi ( w xi b) 1
In summary:
2
Want to minimize 2
1
w subject to yi ( w xi b) 1 for i = 1,…,N
Then given a new instance x, the classifier is f ( x ) sign( w x b)
49
SVM optimization problem:
Primal formulation
Minimize
1
2 wi2 subject to yi ( w xi b) 1 0 for i = 1,…,N
i 1
50
SVM optimization problem:
Dual formulation
• The previous problem can be recast in the so-called “dual
form” giving rise to “dual formulation of linear SVMs”.
• It is also a convex quadratic programming problem but with
N variables ( i ,i = 1,…,N), where N is the number of
samples.
N N N
Maximize i
1
2 i j yi y j xi x j subject to i 0 and i yi 0.
i 1 i, j 1 i 1
53
(Derivation of dual formulation)
If we set the derivatives with respect to w, b to 0, we obtain:
N
P w, b,
0 i yi 0
b i 1
N
P w, b,
0 w i yi xi
w i 1
We substitute the above into the equation for P w,b, and obtain “dual
formulation of linear SVMs”:
N N
D i
1
2 i j yi y j xi x j
i 1 i, j 1
Approach:
Assign a “slack variable” to each instance i 0 , which can be thought of distance from
the separating hyperplane if an instance is misclassified and 0 otherwise.
N
2
Want to minimize 1
2 w C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1
Then given a new instance x, the classifier is f ( x) sign( w x b)
55
Two formulations of soft-margin
linear SVM
Primal formulation:
n N
2
Minimize 1
2 w i C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1 i 1
Dual formulation:
n N N
Minimize i
1
2 i j yi y j xi x j subject to 0 i C and i yi 0
i 1 i, j 1 i 1
56
Parameter C in soft-margin SVM
N
2
Minimize
1
2 w C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1
Tumor Tumor
?
kernel
? Normal
Normal
Gene 1
: RN H 58
Kernel trick
Original data x (in input space) Data in a higher dimensional feature space (x )
f ( x) sign( w x b) f ( x) sign( w ( x ) b)
N N
w i yi xi w i yi ( xi )
i 1 i 1
N
f ( x) sign( i yi ( xi ) ( x ) b)
i 1
N
f ( x) sign( i yi K ( xi ,x ) b)
i 1
60
Understanding the Gaussian kernel
2
Consider Gaussian kernel: K ( x , x j ) exp( x xj )
Geometrically, this is a “bump” or “cavity” centered at the
training data point x j :
"bump”
“cavity”
The resulting
mapping function
is a combination
of bumps and
cavities.
61
Understanding the Gaussian kernel
Several more views of the
data is mapped to the
feature space by Gaussian
kernel
62
Understanding the Gaussian kernel
Linear hyperplane
that separates two
classes
63
Understanding the polynomial kernel
Consider polynomial kernel: K ( xi , x j ) (1 xi x j ) 3
2-dimensional space
x(1) x( 2 )
kernel
10-dimensional space
1 x(1) x( 2 ) x(21) x(22 ) x(1) x( 2 ) x(31) x(32 ) x(1) x(22 ) x(21) x( 2 )
64
Example of benefits of using a kernel
x( 2 )
• Data is not linearly separable
in the input space ( 2).
x1
• Apply kernel K ( x , z ) ( x z ) 2
x4 x3 to map data to a higher
x(1)
dimensional space (3-
x2 dimensional) where it is
linearly separable.
2
2
x(1) z(1) 2
K ( x, z ) (x z ) x(1) z(1) x( 2 ) z( 2 )
x( 2 ) z( 2 )
x(21) z(21)
x(21) z(21) 2 x(1) z(1) x( 2 ) z( 2 ) x(22 ) z(22 ) 2 x(1) x( 2 ) 2 z(1) z( 2 ) (x) (z)
x(22 ) z(22 )
65
Example of benefits of using a kernel
x(21)
Therefore, the explicit mapping is (x) 2 x(1) x( 2 )
x(22 )
x( 2 )
x(22 )
x1
kernel x1 , x2
x4 x3
x(1)
x(21)
x2 x3 , x4
2 x(1) x( 2 )
66
Comparison with methods from classical
statistics & regression
• Need
model to be estimated:
Number of Polynomial Number of Required
variables degree parameters sample
2 3 10 50
10 3 286 1,430
10 5 3,003 15,015
100 3 176,851 884,255
100 5 96,560,646 482,803,230
68
Generalization and overfitting
• Generalization: A classifier or a regression algorithm
learns to correctly predict output from given inputs
not only in previously seen samples but also in
previously unseen samples.
• Overfitting: A classifier or a regression algorithm
learns to correctly predict output from given inputs
in previously seen samples but fails to do so in
previously unseen samples.
• Overfitting Poor generalization.
69
Example of overfitting and generalization
There is a linear relationship between predictor and outcome (plus some Gaussian noise).
Algorithm 2
Outcome of Outcome of
Interest Y Interest Y
Algorithm 1
Training Data
Test Data
Predictor X Predictor X
71
SVMs in “loss + penalty” form
SVMs build the following classifiers: f ( x ) sign( w x b)
Consider soft-margin linear SVM formulation:
Find w and b that N
2
Minimize 21
w C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1
Loss Penalty
(“hinge loss”)
73
Meaning of SVM loss function
w x b 1
• If the instance is negative, w x b 0
it is penalized only in 1 2 w x b 1
regions 2,3,4 3 4
• If the instance is positive,
it is penalized only in
regions 1,2,3
74
Flexibility of “loss + penalty” framework
Minimize (Loss + Penalty)
N
Hinge loss: [1 yi f ( xi )] w1 1-norm SVM
i 1
75
Part 2
76
Model selection for SVMs
77
Need for model selection for SVMs
Gene 2 Gene 2
Tumor Tumor
Normal
Normal
Gene 1 Gene 1
• It is impossible to find a linear SVM classifier • We should not apply a non-linear SVM
that separates tumors from normals! classifier while we can perfectly solve
• Need a non-linear SVM classifier, e.g. SVM this problem using a linear SVM
with polynomial kernel of degree 2 solves classifier!
this problem without errors.
78
A data-driven approach for
model selection for SVMs
• Do not know a priori what type of SVM kernel and what kernel
parameter(s) to use for a given dataset?
• Need to examine various combinations of parameters, e.g.
consider searching the following grid:
Polynomial degree d
(0.1, 1) (1, 1) (10, 1) (100, 1) (1000, 1)
(0.1, 2) (1, 2) (10, 2) (100, 2) (1000, 2)
Parameter
C (0.1, 3) (1, 3) (10, 3) (100, 3) (1000, 3)
(0.1, 4) (1, 4) (10, 4) (100, 4) (1000, 4)
(0.1, 5) (1, 5) (10, 5) (100, 5) (1000, 5)
train
test
valid train
train
train valid
80
Example of nested cross-validation
Consider that we use 3-fold cross-validation and we want to
optimize parameter C that takes values “1” and “2”.
Outer Loop
P1 Training Testing Average
C Accuracy
set set Accuracy
data
P2 P1, P2 P3 1 89%
… P1,P3 P2 2 84% 83%
P3 … P2, P3 P1 1 76%
Inner Loop
Training Validation Average
C Accuracy
set set Accuracy
P1 P2 86%
1 85% choose
P2 P1 84%
P1 P2 70% C=1
2 80%
P2 P1 90%
81
On use of cross-validation
82
SVMs for multicategory data
83
One-versus-rest multicategory
SVM method
Gene 2 ?
Tumor I
****
*
? * * * Tumor II
* *
** *
*
* *
*
Tumor III
Gene 1 84
One-versus-one multicategory
SVM method
Gene 2
Tumor I
* * *
* * Tumor II
* * *
* *
* * *
*
* *
? *
Tumor III
Gene 1 85
DAGSVM multicategory
SVM method
AML
[Link].
ALL T-cell
ALL T-cell
Not AML Not ALL T-cell
ALL
ALLB-cell
B-cellvs.
[Link]
ALLT-cell
T-cell AML vs. ALL B-cell
Not ALL B-cell Not ALL T-cell Not AML Not ALL B-cell
86
SVM multicategory methods by Weston
and Watkins and by Crammer and Singer
Gene 2
? * * * *
*
* * *
* *
* * *
*
* *
*
Gene 1 87
Support vector regression
88
-Support vector regression ( -SVR)
x1 , x2 ,..., x N Rn
Given training data:
y1 , y2 ,..., y N R
y + Main idea:
-
* ** Find a function f ( x ) w x b
** * that approximates y1,…,yN :
* *
* *** • it has at most derivation from
the true values yi
* ** * *
• it is as “flat” as possible (to
x avoid overfitting)
E.g., build a model to predict survival of cancer patients that
can admit a one month error (= ).
89
Formulation of “hard-margin” -SVR
y + w x b 0
-
* * Find f ( x ) w x b
2
* * * by minimizing 12 w subject
* to constraints:
** *
* yi ( w x b )
* * yi ( w x b )
* *
*
for i = 1,…,N.
x
y +
* - If we have points like this
* * (e.g., outliers or noise) we
can either:
* * * a) increase to ensure that
* these points are within the
** * new -ribbon, or
*
* * b) assign a penalty (“slack”
* * variable) to each of this
* * points (as was done for
“soft-margin” SVMs)
x
91
Formulation of “soft-margin” -SVR
y + Find f ( x ) w x b N
2 *
* i - by minimizing 12 w C i i
* * subject to constraints:
i 1
* * * yi ( w x b ) i
*
** *
*
yi ( w x b ) i
* *
* * i
i, i
*
0
* * for i = 1,…,N.
* *
x
Loss Penalty
(“linear -insensitive loss”)
Error in approximation 94
Comparing -SVR with popular
regression methods
95
Comparing loss functions of regression
methods
Linear -insensitive loss Quadratic -insensitive loss
Loss function Loss function
value value
- Error in - Error in
approximation approximation
- Error in - Error in
approximation approximation
96
Applying -SVR to real data
In the absence of domain knowledge about decision
functions, it is recommended to optimize the following
parameters (e.g., by cross-validation using grid-search):
• parameter C
• parameter
• kernel parameters (e.g., degree of polynomial)
97
Novelty detection with SVM-based
methods
98
What is it about?
• Find the simplest and most
compact region in the space of
Decision function = +1
predictors where the majority
Predictor Y
of data samples “live” (i.e.,
with the highest density of ******* ********
* ****** ******** **
samples). * **************** * *
**************************
• Build a decision function that * *
takes value +1 in this region Decision function = -1
and -1 elsewhere.
• Once we have such a decision * *
Predictor X
function, we can identify novel
or outlier samples/patients in
the data.
99
Key assumptions
100
Sample applications
“Normal”
“Novel”
“Novel”
“Novel”
Modified from: [Link]/course/2004/learns/[Link] 101
Sample applications
Discover deviations in sample handling protocol when
doing quality control of assays.
Protein Y
** ************* *
of processing
* of processing from the
* lab of Dr. Smith
***********
* * *** * *
**
Samples with low quality
** of processing from infants
*
* * Protein X
Samples with low quality of
processing from ICU patients
* Samples with low quality of
processing from patients
** with lung cancer
102
Sample applications
Identify websites that discuss benefits of proven cancer
treatments.
Weighted
frequency of
* ** ** Websites that discuss
word Y
* ** ******* benefits of proven cancer
** ** ** *********** *
**
Websites that discuss * * ******** ***** treatments
*
cancer prevention * ** ********** * *
methods * *
* ** *** *
Websites that discuss side-effects of
proven cancer treatments
*** * ** * Weighted
*
Blogs of cancer patients frequency of
word X
** Websites that discuss
* unproven cancer treatments
103
One-class SVM
Main idea: Find the maximal gap hyperplane that separates data from
the origin (i.e., the only member of the second class is the origin).
w x b 0
Find f ( x ) sign( w x b)
N
2 1
by minimizing 1
2 w i b
w x b 0 N i 1
subject to constraints:
w x b i
i
i 0 upper bound on
j
the fraction of
for i = 1,…,N. outliers (i.e., points
outside decision
surface) allowed in
i.e., the decision function should
the data
be positive in all training samples
except for small deviations 105
Formulation of one-class SVM:
linear and non-linear cases
Linear case Non-linear case
(use “kernel trick”)
Find f ( x ) sign( w x b) Find f ( x ) sign( w ( x ) b)
N N
2 1 2 1
by minimizing 1
2 w i b by minimizing 1
2 w i b
N i 1 N i 1
i 0 i 0
for i = 1,…,N. for i = 1,…,N.
106
More about one-class SVM
107
Support vector clustering
108
Goal of clustering (aka class discovery)
Given a heterogeneous set of data points x1 , x2 ,..., xN R n
Assign labels y1 , y2 ,..., y N {1,2,..., K } such that points
with the same label are highly “similar“ to each other
and are distinctly different from the rest
Clustering process
109
Support vector domain description
• Support Vector Domain Description (SVDD) of the data is
a set of vectors lying on the surface of the smallest
hyper-sphere enclosing all data points in a feature space
– These surface points are called Support Vectors
kernel R
R
110
SVDD optimization criterion
Formulation with hard constraints:
R
a' R
R' a
R
111
Main idea behind Support Vector
Clustering
• Cluster boundaries in the input space are formed by the set of
points that when mapped from the input space to the feature
space fall exactly on the surface of the minimal enclosing
hyper-sphere
– SVs identified by SVDD are a subset of the cluster boundary points
1
R
R
a
R
112
Cluster assignment in SVC
• Two points xi and x j belong to the same cluster (i.e., have
the same label) if every point of the line segment ( xi , x j )
projected to the feature space lies within the hyper-
sphere
Some points lie outside the
hyper-sphere
R
R
a
R
D
114
Effects of noise and cluster overlap
• In practice, data often contains noise, outlier points and
overlapping clusters, which would prevent contour
separation and result in all points being assigned to the
same cluster Noise
Outliers
Overlap
115
SVDD with soft constraints
• SVC can be used on noisy data by allowing a fraction of points,
called Bounded SVs (BSV), to lie outside the hyper-sphere
– BSVs are not considered as cluster boundary points and are not
assigned to clusters by SVC
Noise
Noise Outliers
Typical data
kernel R
R
a
R
Outliers
Overlap
116
Overlap
Soft SVDD optimization criterion
Primal formulation with soft constraints:
117
Overlap
Dual formulation of soft SVDD
Minimize W i K ( xi , xi ) i j K ( xi , x j )
i i, j
Constraints
118
SVM-based variable selection
119
Understanding the weight vector w
Recall standard SVM formulation: w x b 0
120
Understanding the weight vector w
w (1,1) w (1,0)
x2 1x1 1x2 b 0 x2 1x1 0 x2 b 0
x1 x1
x2 w (0,1) x3
w (1,1,0)
0 x1 1x2 b 0
1x1 1x2 0 x3 b 0
x2
x1 X1 and X2 are
x1
equally important,
X2 is important, X1 is not X3 is not
121
Understanding the weight vector w
Gene X2
Melanoma SVM decision surface
Nevi w (1,1)
1x1 1x2 b 0
Decision surface of
another classifier
w (1,0)
1x1 0 x2 b 0
True model
X1
Gene X1
• In the true model, X1 is causal and X2 is redundant
X2 Phenotype
• SVM decision surface implies that X1 and X2 are equally
important; thus it is locally causally inconsistent
• There exists a causally consistent decision surface for this example
• Causal discovery algorithms can identify that X1 is causal and X2 is redundant 122
Simple SVM-based variable selection
algorithm
Algorithm:
1. Train SVM classifier using data for all variables to
estimate vector w
2. Rank each variable based on the magnitude of the
corresponding element in vector w
3. Using the above ranking of variables, select the
smallest nested subset of variables that achieves the
best SVM prediction accuracy.
123
Simple SVM-based variable selection
algorithm
Consider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7
The vector w is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2)
The ranking of variables is: X6, X5, X3, X2, X7, X1, X4
Classification
Subset of variables
accuracy
X6 X5 X3 X2 X7 X1 X4 0.920
Best classification accuracy
X6 X5 X3 X2 X7 X1 0.920
X6 X5 X3 X2 X7 0.919 Classification accuracy that is
statistically indistinguishable
X6 X5 X3 X2 0.852 from the best one
X6 X5 X3 0.843
X6 X5 0.832
X6 0.821
125
SVM-RFE variable selection algorithm
Algorithm:
1. Initialize V to all variables in the data
2. Repeat
3. Train SVM classifier using data for variables in V to
estimate vector w
4. Estimate prediction accuracy of variables in V using
the above SVM classifier (e.g., by cross-validation)
5. Remove from V a variable (or a subset of variables)
with the smallest magnitude of the corresponding
element in vector w
6. Until there are no variables in V
7. Select the smallest subset of variables with the best
prediction accuracy
126
SVM-RFE variable selection algorithm
5,000 2,500
Discarded genes Discarded
genes
Not important Not important
for classification for classification
Tumor Tumor
?
kernel
? Normal
Normal
Gene 1
input space feature space
128
SVM variable selection in feature space
• We have data for 100 SNPs (X1,…,X100) and some phenotype.
• We allow up to 3rd order interactions, e.g. we consider:
• X1,…,X100
• X12,X1X2, X1X3,…,X1X100 ,…,X1002
• X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X1003
• Task: find the smallest subset of features (either SNPs or
their interactions) that achieves the best predictive
accuracy of the phenotype.
• Challenge: If we have limited sample, we cannot explicitly
construct and evaluate all SNPs and their interactions
(176,851 features in total) as it is done in classical statistics.
129
SVM variable selection in feature space
Heuristic solution: Apply algorithm SVM-FSMB that:
1. Uses SVMs with polynomial kernel of degree 3 and
selects M features (not necessarily input variables!)
that have largest weights in the feature space.
E.g., the algorithm can select features like: X10,
(X1X2), (X9X2X22), (X72X98), and so on.
2. Apply HITON-MB Markov blanket algorithm to find
the Markov blanket of the phenotype using M
features from step 1.
130
Computing posterior class
probabilities for SVM classifiers
131
Output of SVM classifier
1. SVMs output a class label
(positive or negative) for each w x b 0
sample: sign( w x b)
Sample # 1 2 3 4 5 98 99 100
...
Distance 2 -1 8 3 4 -2 0.3 0.8
Validation set
3. Create a histogram with Q (e.g., say 10) bins using the
above distances. Each bin has an upper and lower Testing set
value in terms of distance.
25
Number of samples
20
in validation set
15
10
0
-15 -10 -5 0 5 10 15
Distance
133
Simple binning method
4. Given a new sample from the Testing set, place it in
the corresponding bin.
Training set
E.g., sample #382 has distance to hyperplane = 1, so it
is placed in the bin [0, 2.5]
25
Number of samples
20
in validation set
15
Validation set
10
5
Testing set
0
-15 -10 -5 0 5 10 15
Distance
0.9
0.8
P(positive class|sample)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
135
Distance
Platt’s method
1. Train SVM classifier in the Training set.
Sample # 1 2 3 4 5 98 99 100
...
Distance 2 -1 8 3 4 -2 0.3 0.8 Validation set
136
Part 3
• Case studies (taken from our research)
1. Classification of cancer gene expression microarray data
2. Text categorization in biomedicine
3. Prediction of clinical laboratory values
4. Modeling clinical judgment
5. Using SVMs for feature selection
6. Outlier detection in ovarian cancer proteomics data
• Software
• Conclusions
• Bibliography
137
1. Classification of cancer gene
expression microarray data
138
Comprehensive evaluation of algorithms
for classification of cancer microarray data
Main goals:
Find the best performing decision support
algorithms for cancer diagnosis from
microarray gene expression data;
Investigate benefits of using gene selection
and ensemble classification methods.
139
Classifiers
K-Nearest Neighbors (KNN) instance-based
Backpropagation Neural Networks (NN) neural
networks
Probabilistic Neural Networks (PNN)
Multi-Class SVM: One-Versus-Rest (OVR)
Multi-Class SVM: One-Versus-One (OVO)
Multi-Class SVM: DAGSVM kernel-based
Multi-Class SVM by Weston & Watkins (WW)
Multi-Class SVM by Crammer & Singer (CS)
Weighted Voting: One-Versus-Rest voting
Weighted Voting: One-Versus-One
Decision Trees: CART decision trees
140
Ensemble classifiers
dataset
Ensemble
Classifier
Final
Prediction
141
Gene selection methods
Highly discriminatory genes
Uninformative genes 1. Signal-to-noise (S2N) ratio in
one-versus-rest (OVR)
fashion;
2. Signal-to-noise (S2N) ratio in
one-versus-one (OVO)
fashion;
3. Kruskal-Wallis nonparametric
one-way ANOVA (KW);
4. Ratio of genes between-
categories to within-category
sum of squares (BW).
genes
142
Performance metrics and
statistical comparison
1. Accuracy
+ can compare to previous studies
+ easy to interpret & simplifies statistical comparison
143
Microarray datasets
Number of
Dataset name Sam- Variables Cate- Reference
ples (genes) gories
11_Tumors 174 12533 11 Su, 2001
Non-param. ANOVA
Classifiers (11) BW ratio Gene Expression Datasets (11)
One-Versus-Rest 11_Tumors
One-Versus-One 14_Tumors
MC-SVM
Multicategory Dx
9_Tumors
Method by WW Majority Voting
Brain Tumor1
Based on MC-
SVM outputs
Lung_Cancer
of all classifiers
Binary Dx
One-Versus-Rest Prostate_Tumors
WV
Decision Trees
One-Versus-One DLBCL
145
9_ Accuracy, %
Tu
100
20
40
60
80
0
mo
rs
14
_T
um
or
Br
ain s
_T
um
Br or
ain 2
_T
um
or
11 1
_T
um
or
Le s
uk
em
ia1
Le
uk
em
Lu ia2
ng
_C
an
ce
r
SR
Pr BC
os T
tat
e_
Tu
mo
DL r
BC
L
CS
Results without gene selection
NN
WW
PNN
OVR
KNN
OVO
DAGSVM
146
MC-SVM
Results with gene selection
Improvement of diagnostic
Diagnostic performance
performance by gene selection
before and after gene selection
(averages for the four datasets)
70
100 9_Tumors 14_Tumors
80
Improvement in accuracy, %
60
60
50 40
Accuracy, %
20
40
10 40
20
OVR OVO DAGSVM WW CS KNN NN PNN
SVM non-SVM SVM non-SVM SVM non-SVM
80
Accuracy, %
60
Multiple specialized
40
classification methods
(original primary studies)
20
L
ia1
ia2
rs
s
2
DL r
BC
BC
mo
ce
or
or
or
or
mo
em
em
um
um
an
um
um
SR
Tu
Tu
uk
uk
_C
_T
_T
_T
_T
e_
9_
Le
Le
ng
14
11
ain
ain
tat
Lu
os
Br
Br
Pr
148
Summary of results
149
Random Forest (RF) classifiers
• Appealing properties
– Work when # of predictors > # of samples
– Embedded gene selection
– Incorporate interactions
– Based on theory of ensemble learning
– Can work with binary & multiclass tasks
– Does not require much fine-tuning of parameters
• Strong theoretical claims
• Empirical evidence: (Diaz-Uriarte and Alvarez de
Andres, BMC Bioinformatics, 2006) reported
superior classification performance of RFs compared
to SVMs and other methods
150
Key principles of RF classifiers
Testing
data
Training
data
154
Models to categorize content and quality:
Main idea
1 2. Simple document
representations (i.e., typically
stemmed and weighted
2
words in title and abstract,
Mesh terms if available;
3 occasionally addition of
Metamap CUIs, author info) as
4 “bag-of-words”
155
Models to categorize content and quality:
Main idea
Unseen
Examples
Labeled
1
0.9
4. Evaluate models’ performances 0.8
- with nested cross-validation or other 0.7
0.6
appropriate error estimators 0.5
0.4
- use primarily AUC as well as other metrics 0.3
0.2
(sensitivity, specificity, PPV, Precision/Recall 0.1
0
curves, HIT curves, etc.) Txmt Diag Prog Etio
Gold standard: SSOAB Area under the ROC 0.9 0.9 0.8
0.8 0.71 0.8
0.7 0.7
0.6 0.6
curve* 0.5
0.4
0.3
0.5
0.4
0.3
0.2 0.2
0.1 0.1
0 0
Query Filters
Spec
Learning Models
Sens
Query Filters
Fixed Spec
Learning Models
Impact Factor (2005) 0.558 Query Filters Learning Models Query Filters Learning Models
0.3 0.3
PageRank, Impact Factor and Query Filters Learning Models Query Filters Learning Models
0.8 0.8
0.71
0.87 1
0.9
Prognosis - Fixed Specificity
0.8
1
0.77 0.77
0.8 0.8
Quackometer* 0.67
Google 0.63
160
Dataset generation and
experimental design
• StarPanel database contains ~8·106 lab measurements of ~100,000 in-
patients from Vanderbilt University Medical Center.
• Lab measurements were taken between 01/1998 and 10/2002.
01/1998-05/2001 06/2001-10/2002
Validation
Training Testing
(25% of Training)
161
Query-based approach for
prediction of clinical cab values
Training data
Validation data
Optimal
SVM classifier data model
Performance
These steps are performed
for every testing sample Prediction
These steps are performed
for every data model
162
Classification results
Including cases with K=0 (i.e. samples Excluding cases with K=0 (i.e. samples
with no prior lab measurements) with no prior lab measurements)
Area under ROC curve (without feature selection) Area under ROC curve (without feature selection)
Ca 67.5% 80.4% 55.0% 77.4% 70.8% 60.0% Ca 72.8% 93.4% 55.6% 81.4% 81.4% 63.4%
CaIo 74.1% 60.0% 50.1% 64.7% 72.3% 57.7%
Laboratory test
CaIo 63.5% 52.9% 58.8% 46.4% 66.3% 58.7%
Laboratory test
CO2 77.3% 88.0% 53.4% 77.5% 90.5% 58.1% CO2 82.0% 93.6% 59.8% 84.4% 94.5% 56.3%
Creat 62.2% 88.4% 83.5% 88.4% 94.9% 83.8% Creat 62.8% 97.7% 89.1% 91.5% 98.1% 87.7%
Mg 58.4% 71.8% 64.2% 67.0% 72.5% 62.1% Mg 56.9% 70.0% 49.1% 58.6% 76.9% 59.1%
Osmol 77.9% 64.8% 65.2% 79.2% 82.4% 71.5% Osmol 50.9% 60.8% 60.8% 91.0% 90.5% 68.0%
PCV 62.3% 91.6% 69.7% 76.5% 84.6% 70.2% PCV 74.9% 99.2% 66.3% 80.9% 80.6% 67.1%
Phos 70.8% 75.4% 60.4% 68.0% 81.8% 65.9% Phos 74.5% 93.6% 64.4% 71.7% 92.2% 69.7%
A total of 84,240 SVM classifiers were built for 16,848 possible data models.
163
Improving predictive power and parsimony
of a BUN model using feature selection
Model description x 10
4
Histogram of test BUN
Test name BUN 3
Range of normal values < 99 perc.
Data modeling SRT
2.5
Number of previous
Frequency (N measurements)
5
measurements
Use variables corresponding to 2
Yes
hospitalization units?
Number of prior
2 1.5
hospitalizations used
Dataset description 1
105
N samples N abnormal N
(total) samples variables 0.5 normal abnormal
values values
Training set 3749 78
164
Classification performance (area under ROC curve)
All RFE_Linear RFE_Poly HITON_PC HITON_MB
Validation set 95.29% 98.78% 98.76% 99.12% 98.90%
Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17
Features
1 LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN)
2 LAB: PM_2(Cl) LAB: Indicator(PM_1(Mg)) LAB: PM_5(Creat) LAB: PM_5(Creat)
LAB: Test Unit
3 LAB: DT(PM_3(K)) NO_TEST_MEASUREMENT LAB: PM_1(Phos) LAB: PM_3(PCV)
(Test CaIo, PM 1)
4 LAB: DT(PM_3(Creat)) LAB: Indicator(PM_1(BUN)) LAB: PM_1(Mg)
5 LAB: Test Unit J018 (Test Ca, PM 3) LAB: Indicator(PM_5(Creat)) LAB: PM_1(Phos)
6 LAB: DT(PM_4(Cl)) LAB: Indicator(PM_1(Mg)) LAB: Indicator(PM_4(Creat))
7 LAB: DT(PM_3(Mg)) LAB: DT(PM_4(Creat)) LAB: Indicator(PM_5(Creat))
8 LAB: PM_1(Cl) LAB: Test Unit 7SCC (Test Ca, PM 1) LAB: Indicator(PM_3(PCV))
9 LAB: PM_3(Gluc) LAB: Test Unit RADR (Test Ca, PM 5) LAB: Indicator(PM_1(Phos))
10 LAB: DT(PM_1(CO2)) LAB: Test Unit 7SMI (Test PCV, PM 4) LAB: DT(PM_4(Creat))
11 LAB: DT(PM_4(Gluc)) DEMO: Gender LAB: Test Unit 11NM (Test BUN, PM 2)
12 LAB: PM_3(Mg) LAB: Test Unit 7SCC (Test Ca, PM 1)
13 LAB: DT(PM_5(Mg)) LAB: Test Unit RADR (Test Ca, PM 5)
14 LAB: PM_1(PCV) LAB: Test Unit 7SMI (Test PCV, PM 4)
15 LAB: PM_2(BUN) LAB: Test Unit CCL (Test Phos, PM 1)
16 LAB: Test Unit 11NM (Test PCV, PM 2) DEMO: Gender
17 LAB: Test Unit 7SCC (Test Mg, PM 3) DEMO: Age
18 LAB: DT(PM_2(Phos))
19 LAB: DT(PM_3(CO2))
20 LAB: DT(PM_2(Gluc))
21 LAB: DT(PM_5(CaIo))
22 DEMO: Hospitalization Unit TVOS
23 LAB: PM_1(Phos)
24 LAB: PM_2(Phos)
25 LAB: Test Unit 11NM (Test K, PM 5)
26 LAB: Test Unit VHR (Test CaIo, PM 1)
165
4. Modeling clinical judgment
166
Methodological framework and study
outline
Patients Guidelines
Clinical Gold
Identify predictors
Patient Feature ignored by
Diagnosis Standard
physicians
1 f1…fm cd1 hd1
… … …
Physician 1
Features
Lesion Family history of Streaks (radial
Irregular Border
location melanoma streaming, pseudopods)
Fitzpatrick’s
Max-diameter Number of colors Slate-blue veil
Photo-type
Atypical pigmented
Min-diameter Sunburn Whitish veil
network
Comedo-like openings,
Age Lentigos Regression-Erythema
milia-like cysts
Gender Asymmetry Hypo-pigmentation Telangiectasia
Method to explain physician-specific
SVM models
FS
Build
DT
Build
SVM
Meta-Learning
Regular Learning
Results: Predicting physicians’ judgments
Expert 1 Expert 3
AUC=0.92 AUC=0.95
R2=99% R2=99%
Results: Explain physician disagreement
Blue irregular number
streaks evolution
veil border of colors
Expert 1 Expert 3
AUC=0.92 AUC=0.95
R2=99% R2=99%
Results: Guideline compliance
176
Feature selection methods
Feature selection methods (non-causal)
• SVM-RFE This is an SVM-based
• Univariate + wrapper feature selection
• Random forest-based method
• LARS-Elastic Net
• RELIEF + wrapper
• L0-norm
• Forward stepwise feature selection
• No feature selection
Sylva Ecology 216 14,394 Ponderosa pine vs. everything else continuous & discrete
Drug
Thrombin 139,351 2,543 Binding to thromin discrete (binary)
discovery
Gene
Breast_Cancer 17,816 286 Estrogen-receptor positive (ER+) vs. ER- continuous
expression
Drug
Hiva 1,617 4,229 Activity to AIDS HIV infection discrete (binary)
discovery
Nova Text 16,969 1,929 Separate politics from religion topics discrete (binary)
0.9
0.89
0.85
0.8 0.88
0.75
0.87
0.7
0.65 0.86
0.6
0.55
2
HITON-PC with G test 0.85 HITON-PC with G2 test
RFE RFE
0.5
0 0.5 1 0.05 0.1 0.15 0.2
Proportion of selected features Proportion of selected features
179
Statistical comparison of predictivity and
reduction of features
Predicitivity Reduction
181
Comparison of SVM-RFE and HITON-PC
182
Comparison of all methods in terms of
causal graph distance
SVM-RFE HITON-PC
based
causal
methods
183
Summary results
HITON-PC
based
causal
methods
HITON-PC-
FDR
methods
SVM-RFE 184
Statistical comparison of graph distance
185
6. Outlier detection in ovarian cancer
proteomics data
186
Ovarian cancer data
Data Set 1 (Top), Data Set 2 (Bottom)
patients, obtained
using the Ciphergen
H4 ProteinChip Normal
array (dataset 1)
and using the Other
Ciphergen WCX2
ProteinChip array
Cancer
(dataset 2).
Normal
Other
4000 8000 12000
Clock Tick
The gross break at the “benign disease” juncture in dataset 1 and the similarity of the
profiles to those in dataset 2 suggest change of protocol in the middle of the first
experiment.
Experiments with one-class SVM
Assume that sets {A, B} are Data Set 1 (Top), Data Set 2 (Bottom)
Other
188
Software
189
Interactive media and animations
SVM Applets
• [Link]
• [Link]
• [Link]
• [Link]
• [Link] (requires Java 3D)
Animations
• Support Vector Machines:
[Link]
[Link]
[Link]
• Support Vector Regression:
[Link]
190
Several SVM implementations for
beginners
• GEMS: [Link]
• Weka: [Link]
191
Several SVM implementations for
intermediate users
• LibSVM: [Link]
General purpose
Implements binary SVM, multiclass SVM, SVR, one-class SVM
Command-line interface
Code/interface for C/C++/C#, Java, Matlab, R, Python, Pearl
• SVMLight: [Link]
General purpose (designed for text categorization)
Implements binary SVM, multiclass SVM, SVR
Command-line interface
Code/interface for C/C++, Java, Matlab, Python, Pearl
192
Conclusions
193
Strong points of SVM-based learning
methods
• Empirically achieve excellent results in high-dimensional data
with very few samples
• Internal capacity control to avoid overfitting
• Can learn both simple linear and very complex nonlinear
functions by using “kernel trick”
• Robust to outliers and noise (use “slack variables”)
• Convex QP optimization problem (thus, it has global minimum
and can be solved efficiently)
• Solution is defined only by a small subset of training points
(“support vectors”)
• Number of free parameters is bounded by the number of
support vectors and not by the number of variables
• Do not require direct access to data, work only with dot-
products of data-points. 194
Weak points of SVM-based learning
methods
• Measures of uncertainty of parameters are not
currently well-developed
• Interpretation is less straightforward than classical
statistics
• Lack of parametric statistical significance tests
• Power size analysis and research design considerations
are less developed than for classical statistics
195
Bibliography
196
Part 1: Support vector machines for binary
classification: classical formulation
• Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers.
Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT)
1992:144-152.
• Burges CJC: A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery 1998, 2:121-167.
• Cristianini N, Shawe-Taylor J: An introduction to support vector machines and other kernel-
based learning methods. Cambridge: Cambridge University Press; 2000.
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Herbrich R: Learning kernel classifiers: theory and algorithms. Cambridge, Mass: MIT Press;
2002.
• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning.
Cambridge, Mass: MIT Press; 1999.
• Shawe-Taylor J, Cristianini N: Kernel methods for pattern analysis. Cambridge, UK:
Cambridge University Press; 2004.
• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.
197
Part 1: Basic principles of statistical
machine learning
• Aliferis CF, Statnikov A, Tsamardinos I: Challenges in the analysis of mass-throughput data:
a technical commentary from the statistical machine learning perspective. Cancer
Informatics 2006, 2:133-162.
• Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. New York: Wiley; 2001.
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Mitchell T: Machine learning. New York, NY, USA: McGraw-Hill; 1997.
• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.
198
Part 2: Model selection for SVMs
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model
selection. Proceedings of the Fourteenth International Joint Conference on Artificial
Intelligence (IJCAI) 1995, 2:1137-1145.
• Scheffer T: Error estimation and model selection. [Link], Technischen Universität
Berlin, School of Computer Science; 1999.
• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer
diagnosis and biomarker discovery from microarray gene expression data. Int J Med
Inform 2005, 74:491-503.
199
Part 2: SVMs for multicategory data
• Crammer K, Singer Y: On the learnability and design of output codes for multiclass
problems. Proceedings of the Thirteenth Annual Conference on Computational Learning
Theory (COLT) 2000.
• Platt JC, Cristianini N, Shawe-Taylor J: Large margin DAGs for multiclass classification.
Advances in Neural Information Processing Systems (NIPS) 2000, 12:547-553.
• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning.
Cambridge, Mass: MIT Press; 1999.
• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of
multicategory classification methods for microarray gene expression cancer diagnosis.
Bioinformatics 2005, 21:631-643.
• Weston J, Watkins C: Support vector machines for multi-class pattern recognition.
Proceedings of the Seventh European Symposium On Artificial Neural Networks 1999, 4:6.
200
Part 2: Support vector regression
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Smola AJ, Schölkopf B: A tutorial on support vector regression. Statistics and Computing
2004, 14:199-222.
201
Part 2: SVM-based variable selection
• Guyon I, Elisseeff A: An introduction to variable and feature selection. Journal of Machine
Learning Research 2003, 3:1157-1182.
• Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using
support vector machines. Machine Learning 2002, 46:389-422.
• Hardin D, Tsamardinos I, Aliferis CF: A theoretical characterization of linear SVM-based
feature selection. Proceedings of the Twenty First International Conference on Machine
Learning (ICML) 2004.
• Statnikov A, Hardin D, Aliferis CF: Using SVM weight-based methods to identify causally
relevant and non-causally relevant variables. Proceedings of the NIPS 2006 Workshop on
Causality and Feature Selection 2006.
• Tsamardinos I, Brown LE: Markov Blanket-Based Variable Selection in Feature Space.
Technical report DSL-08-01 2008.
• Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for
SVMs. Advances in Neural Information Processing Systems (NIPS) 2000, 13:668-674.
• Weston J, Elisseeff A, Scholkopf B, Tipping M: Use of the zero-norm with linear models
and kernel methods. Journal of Machine Learning Research 2003, 3:1439-1461.
• Zhu J, Rosset S, Hastie T, Tibshirani R: 1-norm support vector machines. Advances in
Neural Information Processing Systems (NIPS) 2004, 16.
202
Part 2: Computing posterior class
probabilities for SVM classifiers
• Platt JC: Probabilistic outputs for support vector machines and comparison to regularized
likelihood methods. In Advances in Large Margin Classifiers. Edited by Smola A, Bartlett B,
Scholkopf B, Schuurmans D. Cambridge, MA: MIT press; 2000.
204
Part 3: Modeling clinical judgment
(Case Study 4)
• Sboner A, Aliferis CF: Modeling clinical judgment and implicit guideline compliance in the
diagnosis of melanomas using machine learning. AMIA 2005 Annual Symposium
Proceedings 2005:664-668.
205
Part 3: Outlier detection in ovarian cancer
proteomics data (Case Study 6)
• Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDI-TOF protein patterns in
serum: comparing datasets from different experiments. Bioinformatics 2004, 20:777-785.
• Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C,
Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer.
Lancet 2002, 359:572-577.
206
Thank you for your attention!
Questions/Comments?
Email: [Link]@[Link]
URL: [Link]
207