0% found this document useful (0 votes)
44 views198 pages

Support Vector Machines Explained

Uploaded by

kullenkane3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views198 pages

Support Vector Machines Explained

Uploaded by

kullenkane3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction

3
Basic principles of classification

• Want to classify objects as boats and houses.

10
Basic principles of classification

• All objects before the coast line are boats and all objects after the
coast line are houses.
• Coast line serves as a decision surface that separates two classes. 11
Basic principles of classification
These boats will be misclassified as houses

12
Basic principles of classification

Longitude

Boat
House
Latitude

• The methods that build classification models (i.e., “classification algorithms”)


operate very similarly to the previous example.
• First all objects are represented geometrically. 13
Basic principles of classification

Longitude

Boat
House
Latitude

Then the algorithm seeks to find a decision


surface that separates classes of objects
14
Basic principles of classification

Longitude
These objects are classified as houses

? ? ?

? ? ?

These objects are classified as boats

Latitude

Unseen (new) objects are classified as “boats”


if they fall below the decision surface and as
“houses” if the fall above it
15
The Support Vector Machine (SVM)
approach
• Support vector machines (SVMs) is a binary classification
algorithm that offers a solution to problem #1.
• Extensions of the basic SVM algorithm can be applied to
solve problems #1-#5.
• SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results.

16
Main ideas of SVMs
Gene Y

Normal patients Cancer patients


Gene X

• Consider example dataset described by 2 genes, gene X and gene Y


• Represent patients geometrically (by “vectors”)

17
Main ideas of SVMs
Gene Y

Normal patients Cancer patients


Gene X

• Find a linear decision surface (“hyperplane”) that can separate


patient classes and has the largest distance (i.e., largest “gap” or
“margin”) between border-line patients (i.e., “support vectors”);

18
Main ideas of SVMs
Gene Y

Cancer Cancer Decision surface

kernel

Normal

Normal

Gene X

• If such linear decision surface does not exist, the data is mapped
into a much higher dimensional space (“feature space”) where the
separating decision surface is found;
• The feature space is constructed via very clever mathematical
projection (“kernel trick”).
19
Necessary mathematical concepts

21
How to represent samples geometrically?
Vectors in n-dimensional space ( n)
• Assume that a sample/patient is described by n characteristics
(“features” or “variables”)
• Representation: Every sample/patient is a vector in n with
tail at point with 0 coordinates and arrow-head at point with
the feature values.
• Example: Consider a patient described by 2 features:
Systolic BP = 110 and Age = 29.
This patient can be represented as a vector in 2:
Age

(110, 29)

(0, 0) Systolic BP 22
How to represent samples geometrically?
Vectors in n-dimensional space ( n)
Patient 3 Patient 4
Age (years) 60

40
Patient 2
Patient 1

20

0
200
150
300
100
200
50 100
0 0

Patient Cholesterol Systolic BP Age Tail of the Arrow-head of


id (mg/dl) (mmHg) (years) vector the vector
1 150 110 35 (0,0,0) (150, 110, 35)
2 250 120 30 (0,0,0) (250, 120, 30)
3 140 160 65 (0,0,0) (140, 160, 65)
4 300 180 45 (0,0,0) (300, 180, 45) 23
How to represent samples geometrically?
Vectors in n-dimensional space ( n)
Patient 3 Patient 4
Age (years) 60

40
Patient 2
Patient 1

20

0
200
150
300
100
200
50 100
0 0

Since we assume that the tail of each vector is at point with 0


coordinates, we will also depict vectors as points (where the
arrow-head is pointing).
24
Purpose of vector representation
• Having represented each sample/patient as a vector allows
now to geometrically represent the decision surface that
separates two groups of samples/patients.
A decision surface in 3
A decision surface in 2
7
7

6 6

5
5

4
4
3
3
2

2 1

1
0
10

0 5
0 1 2 3 4 5 6 7 7
5 6
3 4
0 1 2
0

• In order to define the decision surface, we need to introduce


some basic math elements…
25
Basic operation on vectors in n

1. Multiplication by a scalar
Consider a vector a (a1 , a2 ,..., an ) and a scalar c
Define: ca (ca1 , ca2 ,..., can )
When you multiply a vector by a scalar, you “stretch” it in the
same or opposite direction depending on whether the scalar is
positive or negative.

a (1,2) a (1,2) 3
4
c 2 3
ca c 1 2
a
1
c a (2,4) 2 c a ( 1, 2)
a -2 -1 0 1 2 3 4
1
-1

0 1 2 3 4 ca -2
26
Basic operation on vectors in n

2. Addition
Consider vectors a (a1 , a2 ,..., an ) and b (b1 , b2 ,..., bn )
Define: a b (a1 b1 , a2 b2 ,..., an bn )

a (1,2) 4

3
b (3,0) 2 Recall addition of forces in
a
a b (4,2) 1
a b classical mechanics.
0 1 2 3 4
b

27
Basic operation on vectors in n

3. Subtraction
Consider vectors a (a1 , a2 ,..., an ) and b (b1 , b2 ,..., bn )
Define: a b (a1 b1 , a2 b2 ,..., an bn )

a (1,2) 4
What vector do we
b (3,0) 3 need to add to b to
a b ( 2,2)
2
a
get a ? I.e., similar to
1
a b
subtraction of real
-3 -2 -1 0 1 2 3 4
b numbers.

28
Basic operation on vectors in n

4. Euclidian length or L2-norm


Consider a vector a (a1 , a2 ,..., an )
Define the L2-norm: a 2
a12 a22 ... an2

We often denote the L2-norm without subscript, i.e. a

a (1,2) 3 L2-norm is a typical way to


2 measure length of a vector;
a 5 2.24 a Length of this
2 1 vector is other methods to measure
0 1 2 3 4 length also exist.

29
Basic operation on vectors in n

5. Dot product
Consider vectors a (a1 , a2 ,..., an ) and b (b1 , b2 ,..., bn )
n
Define dot product: a b a1b1 a2b2 ... anbn ai bi
i 1

The law of cosines says that a b || a ||2 || b ||2 cos where


is the angle between a and b. Therefore, when the vectors
are perpendicular a b 0.
4
a (1,2) a (0,2)
4

3 3
b (3,0) 2 b (3,0) 2
a a
a b 3 1
a b 0 1

0 1 2 3 4 0 1 2 3 4
b b
30
Basic operation on vectors in n

5. Dot product (continued)


n
a b a1b1 a2b2 ... anbn ai bi
i 1
2
• Property: a a a a
1 1 a a
2 2 ... a a
n n a 2

• In the classical regression equation y w x b


the response variable y is just a dot product of the
vector representing patient characteristics (x ) and
the regression weights vector (w ) which is common
across all patients plus an offset b.

31
Hyperplanes as decision surfaces
• A hyperplane is a linear decision surface that splits the space
into two parts;
• It is obvious that a hyperplane is a binary classifier.

A hyperplane in 2 is a line 3
A hyperplane in is a plane
7
7

6
6
5
5
4

4 3

3 2

1
2

0
1 10

5
0 6 7
0 1 2 3 4 5 6 7 4 5
2 3
0 0 1

A hyperplane in n is an n-1 dimensional subspace 32


Equation of a hyperplane
Consider the case of 3:

w
x x0 An equation of a hyperplane is defined
P P0 by a point (P0) and a perpendicular
vector to the plane (w) at that point.
x x0

Define vectors: x0 OP0 and x OP , where P is an arbitrary point on a hyperplane.


A condition for P to be on the plane is that the vector x x0 is perpendicular to w :
w ( x x0 ) 0 or
w x w x0 0 define b w x0
w x b 0
The above equations also hold for n when n>3.
34
Equation of a hyperplane
Example
w (4, 1,6) + direction
P0 (0,1, 7)
w w x 50 0
b w x0 (0 1 42) 43
w x 43 0 P0 w x 43 0
(4, 1,6) x 43 0
w x 10 0
(4, 1,6) ( x(1) , x( 2 ) , x(3) ) 43 0
- direction
4 x(1) x( 2 ) 6 x(3) 43 0

What happens if the b coefficient changes?


The hyperplane moves along the direction of w .
We obtain “parallel hyperplanes”.

Distance between two parallel hyperplanes w x b1 0 and w x b2 0


is equal to D b1 b2 / w . 35
(Derivation of the distance between two
parallel hyperplanes)

tw
w x b2 0 x2 x1 tw
w D tw t w
w x b1 0 w x2 b2 0
x2
w ( x1 tw) b2 0
x1 2
w x1 t w b2 0
2
( w x1 b1 ) b1 t w b2 0
2
b1 t w b2 0
2
t (b1 b2 ) / w
D t w b1 b2 / w
36
Recap
We know…
• How to represent patients (as “vectors”)
• How to define a linear decision surface (“hyperplane”)

We need to know…
• How to efficiently compute the hyperplane that separates
two classes with the largest “gap”?
Gene Y

Need to introduce basics


of relevant optimization
theory

Normal patients Cancer patients


Gene X
37
Basics of optimization:
Convex functions
• A function is called convex if the function lies below the
straight line segment connecting two points, for any two
points in the interval.
• Property: Any local minimum is a global minimum!

Local minimum

Global minimum Global minimum

Convex function Non-convex function

38
Basics of optimization:
Quadratic programming (QP)
• Quadratic programming (QP) is a special
optimization problem: the function to optimize
(“objective”) is quadratic, subject to linear
constraints.
• Convex QP problems have convex objective
functions.
• These problems can be solved easily and efficiently
by greedy algorithms (because every local
minimum is a global minimum).

39
Basics of optimization:
Example QP problem
Consider x ( x1 , x 2 )
1
Minimize || x ||22 subject to x1 x 2 1 0
2
quadratic linear
objective constraints
This is QP problem, and it is a convex QP as we will see later

We can rewrite it as:


1 2
Minimize ( x1 x22 ) subject to x1 x 2 1 0
2

quadratic linear
objective constraints
40
Basics of optimization:
Example QP problem
f(x1,x2) x1 x 2 1 0
x1 x 2 1

1 2
( x1 x22 ) x1 x 2 1 0
2
x2

x1

The solution is x1=1/2 and x2=1/2.

41
Quiz
w
1) Consider a hyperplane shown
with white. It is defined by P0
equation: w x 10 0
Which of the three other
hyperplanes can be defined by
equation: w x 3 0 ?

- Orange
- Green
- Yellow
4

3 a
2) What is the dot product between 2

vectors a (3,3) and b (1, 1) ? 1

-2 -1 0 1 2 3 4
-1
b
-2

43
Quiz
4

3 a
3) What is the dot product between
vectors a (3,3) and b (1,0) ?
2

-2 -1 0 1 2 3 4
b
-1

-2

3
4) What is the length of a vector 2
a (2,0) and what is the length of 1
all other red vectors in the figure? a
-2 -1 0 1 2 3 4

-1

-2

44
Quiz
5) Which of the four functions is/are convex?

1 2

3 4

45
Support vector machines for binary
classification: classical formulation

46
Case 1: Linearly separable data;
“Hard-margin” linear SVM
x1 , x2 ,..., x N Rn
Given training data:
y1 , y2 ,..., y N { 1, 1}
• Want to find a classifier
(hyperplane) to separate
negative instances from the
positive ones.
• An infinite number of such
hyperplanes exist.
• SVMs finds the hyperplane that
maximizes the gap between
data points on the boundaries
(so-called “support vectors”).
• If the points on the boundaries
are not informative (e.g., due to
Negative instances (y=-1) Positive instances (y=+1)
noise), SVMs will not do well.
47
Statement of linear SVM classifier
w x b 1 w x b 0
The gap is distance between
w x b 1 parallel hyperplanes:
w x b 1 and
w x b 1
Or equivalently:
w x (b 1) 0
w x (b 1) 0
We know that
D b1 b2 / w
Negative instances (y=-1) Positive instances (y=+1)
Therefore:
D 2/ w
Since we want to maximize the gap,
we need to minimize w
2 1
or equivalently minimize 1
2 w ( 2
is convenient for taking derivative later on)
48
Statement of linear SVM classifier
w x b 0
w x b 1 In addition we need to
impose constraints that all
instances are correctly
w x b 1
classified. In our case:
w xi b 1 if yi 1
w xi b 1 if yi 1
Equivalently:
yi ( w xi b) 1

Negative instances (y=-1) Positive instances (y=+1)

In summary:
2
Want to minimize 2
1
w subject to yi ( w xi b) 1 for i = 1,…,N
Then given a new instance x, the classifier is f ( x ) sign( w x b)
49
SVM optimization problem:
Primal formulation

Minimize
1
2 wi2 subject to yi ( w xi b) 1 0 for i = 1,…,N
i 1

Objective function Constraints

• This is called “primal formulation of linear SVMs”.


• It is a convex quadratic programming (QP)
optimization problem with n variables (wi, i = 1,…,n),
where n is the number of features in the dataset.

50
SVM optimization problem:
Dual formulation
• The previous problem can be recast in the so-called “dual
form” giving rise to “dual formulation of linear SVMs”.
• It is also a convex quadratic programming problem but with
N variables ( i ,i = 1,…,N), where N is the number of
samples.
N N N
Maximize i
1
2 i j yi y j xi x j subject to i 0 and i yi 0.
i 1 i, j 1 i 1

Objective function Constraints


N
Then the w-vector is defined in terms of i: w i yi xi
N i 1

And the solution becomes: f ( x ) sign( i yi xi x b)


i 1
51
SVM optimization problem:
Benefits of using dual formulation
1) No need to access original data, need to access only dot
products.
N N
Objective function: i
1
2 i j yi y j xi x j
i 1 i, j 1
N
Solution: f ( x ) sign( i yi xi x b)
i 1

2) Number of free parameters is bounded by the number


of support vectors and not by the number of variables
(beneficial for high-dimensional problems).
E.g., if a microarray dataset contains 20,000 genes and 100
patients, then need to find only up to 100 parameters!
52
(Derivation of dual formulation)
n
Minimize
1
2 wi2 subject to yi ( w xi b) 1 0 for i = 1,…,N
i 1

Objective function Constraints


Apply the method of Lagrange multipliers.
n N
2
Define Lagrangian P w, b, 1
2 w
i i yi ( w xi b) 1
i 1 i 1

a vector with n elements


a vector with N elements

We need to minimize this Lagrangian with respect to w, b and simultaneously


require that the derivative with respect to vanishes , all subject to the
constraints that i 0.

53
(Derivation of dual formulation)
If we set the derivatives with respect to w, b to 0, we obtain:
N
P w, b,
0 i yi 0
b i 1
N
P w, b,
0 w i yi xi
w i 1

We substitute the above into the equation for P w,b, and obtain “dual
formulation of linear SVMs”:
N N

D i
1
2 i j yi y j xi x j
i 1 i, j 1

We seek to maximize the above Lagrangian with respect to , subject to the


N
constraints that i 0 and i yi 0.
i 1 54
Case 2: Not linearly separable data;
“Soft-margin” linear SVM
What if the data is not linearly 0 0
0 0
separable? E.g., there are 0
0 0
outliers or noisy measurements, 0
0 0 0 0
or the data is slightly non-linear. 0
0
0
Want to handle this case without changing
the family of decision functions.

Approach:
Assign a “slack variable” to each instance i 0 , which can be thought of distance from
the separating hyperplane if an instance is misclassified and 0 otherwise.
N
2
Want to minimize 1
2 w C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1
Then given a new instance x, the classifier is f ( x) sign( w x b)
55
Two formulations of soft-margin
linear SVM
Primal formulation:
n N
2
Minimize 1
2 w i C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1 i 1

Objective function Constraints

Dual formulation:
n N N
Minimize i
1
2 i j yi y j xi x j subject to 0 i C and i yi 0
i 1 i, j 1 i 1

Objective function Constraints


for i = 1,…,N.

56
Parameter C in soft-margin SVM
N
2
Minimize
1
2 w C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1

• When C is very large, the soft-


margin SVM is equivalent to
hard-margin SVM;
• When C is very small, we
admit misclassifications in the
training data at the expense of
C=100 C=1
having w-vector with small
norm;
• C has to be selected for the
distribution at hand as it will
be discussed later in this
tutorial.
57
C=0.15 C=0.1
Case 3: Not linearly separable data;
Kernel trick
Gene 2

Tumor Tumor
?
kernel

? Normal

Normal

Gene 1

Data is not linearly separable Data is linearly separable in the


in the input space feature space obtained by a kernel

: RN H 58
Kernel trick
Original data x (in input space) Data in a higher dimensional feature space (x )
f ( x) sign( w x b) f ( x) sign( w ( x ) b)
N N
w i yi xi w i yi ( xi )
i 1 i 1

N
f ( x) sign( i yi ( xi ) ( x ) b)
i 1

N
f ( x) sign( i yi K ( xi ,x ) b)
i 1

Therefore, we do not need to know explicitly, we just need to


define function K(·, ·): N × N .
Not every function N × N can be a valid kernel; it has to satisfy so-called
Mercer conditions. Otherwise, the underlying quadratic program may not be solvable.
59
Popular kernels
A kernel is a dot product in some feature space:
K ( xi , x j ) ( xi ) (x j )
Examples:
K ( xi , x j ) xi x j Linear kernel
2
K ( xi , x j ) exp( xi xj ) Gaussian kernel
K ( xi , x j ) exp( xi xj )
Exponential kernel
K ( xi , x j ) (p xi x j ) q Polynomial kernel
K ( xi , x j ) (p xi x j ) q exp( xi
2
xj ) Hybrid kernel
K ( xi , x j ) tanh(kxi x j ) Sigmoidal

60
Understanding the Gaussian kernel
2
Consider Gaussian kernel: K ( x , x j ) exp( x xj )
Geometrically, this is a “bump” or “cavity” centered at the
training data point x j :

"bump”
“cavity”
The resulting
mapping function
is a combination
of bumps and
cavities.

61
Understanding the Gaussian kernel
Several more views of the
data is mapped to the
feature space by Gaussian
kernel

62
Understanding the Gaussian kernel
Linear hyperplane
that separates two
classes

63
Understanding the polynomial kernel
Consider polynomial kernel: K ( xi , x j ) (1 xi x j ) 3

Assume that we are dealing with 2-dimensional data


(i.e., in 2). Where will this kernel map the data?

2-dimensional space
x(1) x( 2 )

kernel
10-dimensional space
1 x(1) x( 2 ) x(21) x(22 ) x(1) x( 2 ) x(31) x(32 ) x(1) x(22 ) x(21) x( 2 )

64
Example of benefits of using a kernel
x( 2 )
• Data is not linearly separable
in the input space ( 2).
x1
• Apply kernel K ( x , z ) ( x z ) 2
x4 x3 to map data to a higher
x(1)
dimensional space (3-
x2 dimensional) where it is
linearly separable.
2
2
x(1) z(1) 2
K ( x, z ) (x z ) x(1) z(1) x( 2 ) z( 2 )
x( 2 ) z( 2 )

x(21) z(21)
x(21) z(21) 2 x(1) z(1) x( 2 ) z( 2 ) x(22 ) z(22 ) 2 x(1) x( 2 ) 2 z(1) z( 2 ) (x) (z)
x(22 ) z(22 )
65
Example of benefits of using a kernel
x(21)
Therefore, the explicit mapping is (x) 2 x(1) x( 2 )
x(22 )
x( 2 )
x(22 )
x1
kernel x1 , x2
x4 x3
x(1)
x(21)
x2 x3 , x4

2 x(1) x( 2 )

66
Comparison with methods from classical
statistics & regression
• Need
model to be estimated:
Number of Polynomial Number of Required
variables degree parameters sample
2 3 10 50
10 3 286 1,430
10 5 3,003 15,015
100 3 176,851 884,255
100 5 96,560,646 482,803,230

• SVMs do not have such requirement & often require


much less sample than the number of variables, even
when a high-degree polynomial kernel is used.
67
Basic principles of statistical
machine learning

68
Generalization and overfitting
• Generalization: A classifier or a regression algorithm
learns to correctly predict output from given inputs
not only in previously seen samples but also in
previously unseen samples.
• Overfitting: A classifier or a regression algorithm
learns to correctly predict output from given inputs
in previously seen samples but fails to do so in
previously unseen samples.
• Overfitting Poor generalization.

69
Example of overfitting and generalization
There is a linear relationship between predictor and outcome (plus some Gaussian noise).
Algorithm 2

Outcome of Outcome of
Interest Y Interest Y

Algorithm 1

Training Data
Test Data

Predictor X Predictor X

• Algorithm 1 learned non-reproducible peculiarities of the specific sample


available for learning but did not learn the general characteristics of the function
that generated the data. Thus, it is overfitted and has poor generalization.
• Algorithm 2 learned general characteristics of the function that produced the
data. Thus, it generalizes.
70
“Loss + penalty” paradigm for learning to
avoid overfitting and ensure generalization
• Many statistical learning algorithms (including SVMs)
search for a decision function by solving the following
optimization problem:

Minimize (Loss + Penalty)


• Loss measures error of fitting the data
• Penalty penalizes complexity of the learned function
• is regularization parameter that balances Loss and Penalty

71
SVMs in “loss + penalty” form
SVMs build the following classifiers: f ( x ) sign( w x b)
Consider soft-margin linear SVM formulation:
Find w and b that N
2
Minimize 21
w C i subject to yi ( w xi b) 1 i for i = 1,…,N
i 1

This can also be stated as:


Find w and b that
N
2
Minimize [1 yi f ( xi )] w2
i 1

Loss Penalty
(“hinge loss”)

(in fact, one can show that = 1/(2C)).


72
Meaning of SVM loss function
N

Consider loss function: [1 yi f ( xi )]


i 1

• Recall that […]+ indicates the positive part


• For a given sample/patient i, the loss is non-zero if 1 yi f ( xi ) 0
• In other words, yi f ( xi ) 1
• Since yi { 1, 1}, this means that the loss is non-zero if
f ( xi ) 1 for yi = +1
f ( xi ) 1 for yi= -1
• In other words, the loss is non-zero if
w xi b 1 for yi = +1
w xi b 1 for yi= -1

73
Meaning of SVM loss function

w x b 1
• If the instance is negative, w x b 0
it is penalized only in 1 2 w x b 1
regions 2,3,4 3 4
• If the instance is positive,
it is penalized only in
regions 1,2,3

Negative instances (y=-1) Positive instances (y=+1)

74
Flexibility of “loss + penalty” framework
Minimize (Loss + Penalty)

Loss function Penalty function Resulting algorithm


N
2
Hinge loss: [1 yi f ( xi )] w2 SVMs
i 1

Mean squared error: 2


N w2 Ridge regression
( yi f ( xi )) 2
i 1

Mean squared error:


N w1 Lasso
( yi f ( xi )) 2
i 1

Mean squared error:


N 2
2 w1 w2 Elastic net
( yi f ( xi )) 1 2
i 1

N
Hinge loss: [1 yi f ( xi )] w1 1-norm SVM
i 1
75
Part 2

• Model selection for SVMs


• Extensions to the basic SVM model:
1. SVMs for multicategory data
2. Support vector regression
3. Novelty detection with SVM-based methods
4. Support vector clustering
5. SVM-based variable selection
6. Computing posterior class probabilities for SVM
classifiers

76
Model selection for SVMs

77
Need for model selection for SVMs
Gene 2 Gene 2

Tumor Tumor

Normal

Normal
Gene 1 Gene 1
• It is impossible to find a linear SVM classifier • We should not apply a non-linear SVM
that separates tumors from normals! classifier while we can perfectly solve
• Need a non-linear SVM classifier, e.g. SVM this problem using a linear SVM
with polynomial kernel of degree 2 solves classifier!
this problem without errors.
78
A data-driven approach for
model selection for SVMs
• Do not know a priori what type of SVM kernel and what kernel
parameter(s) to use for a given dataset?
• Need to examine various combinations of parameters, e.g.
consider searching the following grid:
Polynomial degree d
(0.1, 1) (1, 1) (10, 1) (100, 1) (1000, 1)
(0.1, 2) (1, 2) (10, 2) (100, 2) (1000, 2)
Parameter
C (0.1, 3) (1, 3) (10, 3) (100, 3) (1000, 3)
(0.1, 4) (1, 4) (10, 4) (100, 4) (1000, 4)
(0.1, 5) (1, 5) (10, 5) (100, 5) (1000, 5)

• How to search this grid while producing an unbiased estimate


of classification performance? 79
Nested cross-validation
Recall the main idea of cross-validation:

test What combination of SVM


parameters to apply on
train
data train test
training data?

train
test
valid train
train
train valid

Perform “grid search” using another nested


loop of cross-validation.

80
Example of nested cross-validation
Consider that we use 3-fold cross-validation and we want to
optimize parameter C that takes values “1” and “2”.
Outer Loop
P1 Training Testing Average
C Accuracy
set set Accuracy
data

P2 P1, P2 P3 1 89%
… P1,P3 P2 2 84% 83%
P3 … P2, P3 P1 1 76%

Inner Loop
Training Validation Average
C Accuracy
set set Accuracy
P1 P2 86%
1 85% choose
P2 P1 84%
P1 P2 70% C=1
2 80%
P2 P1 90%
81
On use of cross-validation

• Empirically we found that cross-validation works well


for model selection for SVMs in many problem
domains;
• Many other approaches that can be used for model
selection for SVMs exist, e.g.:
Generalized cross-validation
Bayesian information criterion (BIC)
Minimum description length (MDL)
Vapnik-Chernovenkis (VC) dimension
Bootstrap

82
SVMs for multicategory data

83
One-versus-rest multicategory
SVM method
Gene 2 ?
Tumor I

****
*
? * * * Tumor II
* *
** *
*
* *
*

Tumor III

Gene 1 84
One-versus-one multicategory
SVM method
Gene 2
Tumor I

* * *
* * Tumor II
* * *
* *
* * *
*
* *
? *

Tumor III
Gene 1 85
DAGSVM multicategory
SVM method
AML
[Link].
ALL T-cell
ALL T-cell
Not AML Not ALL T-cell

ALL
ALLB-cell
B-cellvs.
[Link]
ALLT-cell
T-cell AML vs. ALL B-cell

Not ALL B-cell Not ALL T-cell Not AML Not ALL B-cell

ALL T-cell ALLALL


B-cell
B-cell AML

86
SVM multicategory methods by Weston
and Watkins and by Crammer and Singer
Gene 2

? * * * *
*
* * *
* *
* * *
*
* *
*

Gene 1 87
Support vector regression

88
-Support vector regression ( -SVR)
x1 , x2 ,..., x N Rn
Given training data:
y1 , y2 ,..., y N R
y + Main idea:
-
* ** Find a function f ( x ) w x b
** * that approximates y1,…,yN :
* *
* *** • it has at most derivation from
the true values yi
* ** * *
• it is as “flat” as possible (to
x avoid overfitting)
E.g., build a model to predict survival of cancer patients that
can admit a one month error (= ).
89
Formulation of “hard-margin” -SVR
y + w x b 0
-
* * Find f ( x ) w x b
2
* * * by minimizing 12 w subject
* to constraints:
** *
* yi ( w x b )
* * yi ( w x b )
* *
*
for i = 1,…,N.
x

I.e., difference between yi and the fitted function should be smaller


than and larger than - all points yi should be in the “ -ribbon”
around the fitted function.
90
Formulation of “soft-margin” -SVR

y +
* - If we have points like this
* * (e.g., outliers or noise) we
can either:
* * * a) increase to ensure that
* these points are within the
** * new -ribbon, or
*
* * b) assign a penalty (“slack”
* * variable) to each of this
* * points (as was done for
“soft-margin” SVMs)
x

91
Formulation of “soft-margin” -SVR

y + Find f ( x ) w x b N
2 *
* i - by minimizing 12 w C i i

* * subject to constraints:
i 1

* * * yi ( w x b ) i
*
** *
*
yi ( w x b ) i
* *

* * i
i, i
*
0
* * for i = 1,…,N.
* *
x

Notice that only points outside -ribbon are penalized!


92
Nonlinear -SVR
y +
** * - Cannot approximate well
* *
this function with small !
***
* *
** **
* x
kernel y
y + ( )
* * * * * +-
- ( )
** ***
** * *
* *
* ***
* ** **
* 1 * x
**
*
(x) 93
-Support vector regression in
“loss + penalty” form
Build decision function of the form: f ( x ) w x b
Find w and b that
N
2
Minimize max(0, | yi f ( xi ) | ) w 2
i 1

Loss Penalty
(“linear -insensitive loss”)

Loss function value

Error in approximation 94
Comparing -SVR with popular
regression methods

Loss function Penalty function Resulting algorithm


Linear -insensitive loss: 2
N w2 -SVR
max(0, | yi f ( xi ) | )
i 1

Quadratic -insensitive loss:


N
2 w2
2
Another variant of -SVR
max(0, ( yi f ( xi )) )
i 1

Mean squared error: 2


N w2 Ridge regression
( yi f ( xi )) 2
i 1

Mean linear error:


N
w2
2 Another variant of ridge
| yi f ( xi ) | regression
i 1

95
Comparing loss functions of regression
methods
Linear -insensitive loss Quadratic -insensitive loss
Loss function Loss function
value value

- Error in - Error in
approximation approximation

Mean squared error Mean linear error


Loss function Loss function
value value

- Error in - Error in
approximation approximation
96
Applying -SVR to real data
In the absence of domain knowledge about decision
functions, it is recommended to optimize the following
parameters (e.g., by cross-validation using grid-search):

• parameter C
• parameter
• kernel parameters (e.g., degree of polynomial)

Notice that parameter depends on the ranges of


variables in the dataset; therefore it is recommended to
normalize/re-scale data prior to applying -SVR.

97
Novelty detection with SVM-based
methods

98
What is it about?
• Find the simplest and most
compact region in the space of
Decision function = +1
predictors where the majority

Predictor Y
of data samples “live” (i.e.,
with the highest density of ******* ********
* ****** ******** **
samples). * **************** * *
**************************
• Build a decision function that * *
takes value +1 in this region Decision function = -1
and -1 elsewhere.
• Once we have such a decision * *
Predictor X
function, we can identify novel
or outlier samples/patients in
the data.
99
Key assumptions

• We do not know classes/labels of samples (positive


or negative) in the data available for learning
this is not a classification problem
• All positive samples are similar but each negative
sample can be different in its own way

Thus, do not need to collect data for negative samples!

100
Sample applications
“Normal”

“Novel”

“Novel”
“Novel”
Modified from: [Link]/course/2004/learns/[Link] 101
Sample applications
Discover deviations in sample handling protocol when
doing quality control of assays.
Protein Y

Samples with low quality ********* Samples with high-quality

** ************* *
of processing
* of processing from the
* lab of Dr. Smith
***********
* * *** * *
**
Samples with low quality
** of processing from infants
*
* * Protein X
Samples with low quality of
processing from ICU patients
* Samples with low quality of
processing from patients
** with lung cancer
102
Sample applications
Identify websites that discuss benefits of proven cancer
treatments.
Weighted
frequency of
* ** ** Websites that discuss
word Y
* ** ******* benefits of proven cancer
** ** ** *********** *
**
Websites that discuss * * ******** ***** treatments

*
cancer prevention * ** ********** * *
methods * *
* ** *** *
Websites that discuss side-effects of
proven cancer treatments

*** * ** * Weighted
*
Blogs of cancer patients frequency of
word X
** Websites that discuss
* unproven cancer treatments
103
One-class SVM
Main idea: Find the maximal gap hyperplane that separates data from
the origin (i.e., the only member of the second class is the origin).

w x b 0

Origin is the only j


member of the Use “slack variables” as in soft-margin SVMs
second class to penalize these instances 104
Formulation of one-class SVM:
linear case
Given training data: x1 , x2 ,..., x N Rn

Find f ( x ) sign( w x b)
N
2 1
by minimizing 1
2 w i b
w x b 0 N i 1

subject to constraints:
w x b i
i
i 0 upper bound on
j
the fraction of
for i = 1,…,N. outliers (i.e., points
outside decision
surface) allowed in
i.e., the decision function should
the data
be positive in all training samples
except for small deviations 105
Formulation of one-class SVM:
linear and non-linear cases
Linear case Non-linear case
(use “kernel trick”)
Find f ( x ) sign( w x b) Find f ( x ) sign( w ( x ) b)
N N
2 1 2 1
by minimizing 1
2 w i b by minimizing 1
2 w i b
N i 1 N i 1

subject to constraints: subject to constraints:


w x b i w (x) b i

i 0 i 0
for i = 1,…,N. for i = 1,…,N.

106
More about one-class SVM

• One-class SVMs inherit most of properties of SVMs for


binary classification (e.g., “kernel trick”, sample
efficiency, ease of finding of a solution by efficient
optimization method, etc.);
• The choice of other parameter significantly affects
the resulting decision surface.
• The choice of origin is arbitrary and also significantly
affects the decision surface returned by the algorithm.

107
Support vector clustering

Contributed by Nikita Lytkin

108
Goal of clustering (aka class discovery)
Given a heterogeneous set of data points x1 , x2 ,..., xN R n
Assign labels y1 , y2 ,..., y N {1,2,..., K } such that points
with the same label are highly “similar“ to each other
and are distinctly different from the rest

Clustering process

109
Support vector domain description
• Support Vector Domain Description (SVDD) of the data is
a set of vectors lying on the surface of the smallest
hyper-sphere enclosing all data points in a feature space
– These surface points are called Support Vectors

kernel R
R

110
SVDD optimization criterion
Formulation with hard constraints:

Minimize R2 subject to || ( xi ) a ||2 R 2 for i = 1,…,N

Squared radius of the sphere Constraints

R
a' R
R' a
R

111
Main idea behind Support Vector
Clustering
• Cluster boundaries in the input space are formed by the set of
points that when mapped from the input space to the feature
space fall exactly on the surface of the minimal enclosing
hyper-sphere
– SVs identified by SVDD are a subset of the cluster boundary points

1
R
R
a
R

112
Cluster assignment in SVC
• Two points xi and x j belong to the same cluster (i.e., have
the same label) if every point of the line segment ( xi , x j )
projected to the feature space lies within the hyper-
sphere
Some points lie outside the
hyper-sphere

R
R
a
R

Every point is within the hyper-


sphere in the feature space 113
Cluster assignment in SVC (continued)
C
• Point-wise adjacency matrix is
B E
constructed by testing the line A

segments between every pair of D


points A B C D E
• Connected components are A 1 1 1 0 0
extracted B 1 1 0 0
C 1 0 0
• Points belonging to the same D 1 1
connected component are C E 1
assigned the same label
B E
A

D
114
Effects of noise and cluster overlap
• In practice, data often contains noise, outlier points and
overlapping clusters, which would prevent contour
separation and result in all points being assigned to the
same cluster Noise

Ideal data Typical data

Outliers
Overlap

115
SVDD with soft constraints
• SVC can be used on noisy data by allowing a fraction of points,
called Bounded SVs (BSV), to lie outside the hyper-sphere
– BSVs are not considered as cluster boundary points and are not
assigned to clusters by SVC
Noise
Noise Outliers

Typical data
kernel R
R
a
R

Outliers
Overlap

116
Overlap
Soft SVDD optimization criterion
Primal formulation with soft constraints:

Minimize R 2 subject to || ( xi ) a ||2 R 2 i i 0 for i = 1,…,N

Squared radius of the sphere Soft constraints


Outliers
Introduction of slack
variables i mitigates Noise
the influence of noise R
and overlap on the R
clustering process R i a

117
Overlap
Dual formulation of soft SVDD
Minimize W i K ( xi , xi ) i j K ( xi , x j )
i i, j

subject to 0 i C for i = 1,…,N

Constraints

• As before, K ( xi , x j ) ( xi ) ( x j ) denotes a kernel function


• Parameter 0 C 1 gives a trade-off between volume of the sphere and
the number of errors (C=1 corresponds to hard constraints)
2
• Gaussian kernel K ( xi , x j ) exp( xi x j ) tends to yield tighter
contour representations of clusters than the polynomial kernel
• The Gaussian kernel width parameter 0 influences tightness of
cluster boundaries, number of SVs and the number of clusters
• Increasing causes an increase in the number of clusters

118
SVM-based variable selection

119
Understanding the weight vector w
Recall standard SVM formulation: w x b 0

Find w and b that minimize


2
1
2 w subject to yi ( w xi b) 1
for i = 1,…,N.
Use classifier: f ( x ) sign( w x b) Negative instances (y=-1) Positive instances (y=+1)

• The weight vector w contains as many elements as there are input


n
variables in the dataset, i.e. w R .
• The magnitude of each element denotes importance of the
corresponding variable for classification task.

120
Understanding the weight vector w
w (1,1) w (1,0)
x2 1x1 1x2 b 0 x2 1x1 0 x2 b 0

x1 x1

X1 and X2 are equally important X1 is important, X2 is not

x2 w (0,1) x3
w (1,1,0)
0 x1 1x2 b 0
1x1 1x2 0 x3 b 0

x2

x1 X1 and X2 are
x1
equally important,
X2 is important, X1 is not X3 is not
121
Understanding the weight vector w
Gene X2
Melanoma SVM decision surface
Nevi w (1,1)
1x1 1x2 b 0

Decision surface of
another classifier
w (1,0)
1x1 0 x2 b 0

True model

X1

Gene X1
• In the true model, X1 is causal and X2 is redundant
X2 Phenotype
• SVM decision surface implies that X1 and X2 are equally
important; thus it is locally causally inconsistent
• There exists a causally consistent decision surface for this example
• Causal discovery algorithms can identify that X1 is causal and X2 is redundant 122
Simple SVM-based variable selection
algorithm
Algorithm:
1. Train SVM classifier using data for all variables to
estimate vector w
2. Rank each variable based on the magnitude of the
corresponding element in vector w
3. Using the above ranking of variables, select the
smallest nested subset of variables that achieves the
best SVM prediction accuracy.

123
Simple SVM-based variable selection
algorithm
Consider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7
The vector w is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2)
The ranking of variables is: X6, X5, X3, X2, X7, X1, X4
Classification
Subset of variables
accuracy
X6 X5 X3 X2 X7 X1 X4 0.920
Best classification accuracy
X6 X5 X3 X2 X7 X1 0.920
X6 X5 X3 X2 X7 0.919 Classification accuracy that is
statistically indistinguishable
X6 X5 X3 X2 0.852 from the best one
X6 X5 X3 0.843
X6 X5 0.832
X6 0.821

Select the following variable subset: X6, X5, X3, X2 , X7 124


Simple SVM-based variable selection
algorithm
• SVM weights are not locally causally consistent we
may end up with a variable subset that is not causal
and not necessarily the most compact one.
• The magnitude of a variable in vector w estimates
the effect of removing that variable on the objective
function of SVM (e.g., function that we want to
minimize). However, this algorithm becomes sub-
optimal when considering effect of removing several
variables at a time… This pitfall is addressed in the
SVM-RFE algorithm that is presented next.

125
SVM-RFE variable selection algorithm
Algorithm:
1. Initialize V to all variables in the data
2. Repeat
3. Train SVM classifier using data for variables in V to
estimate vector w
4. Estimate prediction accuracy of variables in V using
the above SVM classifier (e.g., by cross-validation)
5. Remove from V a variable (or a subset of variables)
with the smallest magnitude of the corresponding
element in vector w
6. Until there are no variables in V
7. Select the smallest subset of variables with the best
prediction accuracy
126
SVM-RFE variable selection algorithm

SVM 5,000 SVM 2,500


10,000
genes
model
Prediction
genes
Important for
model
Prediction
genes
Important for

accuracy classification accuracy classification

5,000 2,500
Discarded genes Discarded
genes
Not important Not important
for classification for classification

• Unlike simple SVM-based variable selection algorithm, SVM-


RFE estimates vector w many times to establish ranking of the
variables.
• Notice that the prediction accuracy should be estimated at
each step in an unbiased fashion, e.g. by cross-validation. 127
SVM variable selection in feature space
The real power of SVMs comes with application of the kernel
trick that maps data to a much higher dimensional space
(“feature space”) where the data is linearly separable.
Gene 2

Tumor Tumor
?
kernel

? Normal

Normal

Gene 1
input space feature space
128
SVM variable selection in feature space
• We have data for 100 SNPs (X1,…,X100) and some phenotype.
• We allow up to 3rd order interactions, e.g. we consider:
• X1,…,X100
• X12,X1X2, X1X3,…,X1X100 ,…,X1002
• X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X1003
• Task: find the smallest subset of features (either SNPs or
their interactions) that achieves the best predictive
accuracy of the phenotype.
• Challenge: If we have limited sample, we cannot explicitly
construct and evaluate all SNPs and their interactions
(176,851 features in total) as it is done in classical statistics.
129
SVM variable selection in feature space
Heuristic solution: Apply algorithm SVM-FSMB that:
1. Uses SVMs with polynomial kernel of degree 3 and
selects M features (not necessarily input variables!)
that have largest weights in the feature space.
E.g., the algorithm can select features like: X10,
(X1X2), (X9X2X22), (X72X98), and so on.
2. Apply HITON-MB Markov blanket algorithm to find
the Markov blanket of the phenotype using M
features from step 1.

130
Computing posterior class
probabilities for SVM classifiers

131
Output of SVM classifier
1. SVMs output a class label
(positive or negative) for each w x b 0

sample: sign( w x b)

2. One can also compute distance


from the hyperplane that
separates classes, e.g. w x b.
These distances can be used to Negative samples (y=-1) Positive samples (y=+1)

compute performance metrics


like area under ROC curve.

Question: How can one use SVMs to estimate posterior


class probabilities, i.e., P(class positive | sample x)?
132
Simple binning method
1. Train SVM classifier in the Training set.
2. Apply it to the Validation set and compute distances
from the hyperplane to each sample. Training set

Sample # 1 2 3 4 5 98 99 100
...
Distance 2 -1 8 3 4 -2 0.3 0.8
Validation set
3. Create a histogram with Q (e.g., say 10) bins using the
above distances. Each bin has an upper and lower Testing set
value in terms of distance.
25
Number of samples

20
in validation set

15

10

0
-15 -10 -5 0 5 10 15
Distance
133
Simple binning method
4. Given a new sample from the Testing set, place it in
the corresponding bin.
Training set
E.g., sample #382 has distance to hyperplane = 1, so it
is placed in the bin [0, 2.5]
25
Number of samples

20
in validation set

15
Validation set
10

5
Testing set
0
-15 -10 -5 0 5 10 15
Distance

5. Compute probability P(positive class | sample #382) as


a fraction of true positives in this bin.

E.g., this bin has 22 samples (from the Validation set),


out of which 17 are true positive ones , so we compute
P(positive class | sample #382) = 17/22 = 0.77
134
Platt’s method
Convert distances output by SVM to probabilities by passing them
through the sigmoid filter:
1
P ( positive class | sample)
1 exp( Ad B)
where d is the distance from hyperplane and A and B are parameters.
1

0.9

0.8
P(positive class|sample)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-10 -8 -6 -4 -2 0 2 4 6 8 10
135
Distance
Platt’s method
1. Train SVM classifier in the Training set.

2. Apply it to the Validation set and compute distances Training set


from the hyperplane to each sample.

Sample # 1 2 3 4 5 98 99 100
...
Distance 2 -1 8 3 4 -2 0.3 0.8 Validation set

3. Determine parameters A and B of the sigmoid


function by minimizing the negative log likelihood of Testing set
the data from the Validation set.

4. Given a new sample from the Testing set, compute its


posterior probability using sigmoid function.

136
Part 3
• Case studies (taken from our research)
1. Classification of cancer gene expression microarray data
2. Text categorization in biomedicine
3. Prediction of clinical laboratory values
4. Modeling clinical judgment
5. Using SVMs for feature selection
6. Outlier detection in ovarian cancer proteomics data
• Software
• Conclusions
• Bibliography

137
1. Classification of cancer gene
expression microarray data

138
Comprehensive evaluation of algorithms
for classification of cancer microarray data
Main goals:
Find the best performing decision support
algorithms for cancer diagnosis from
microarray gene expression data;
Investigate benefits of using gene selection
and ensemble classification methods.

139
Classifiers
K-Nearest Neighbors (KNN) instance-based
Backpropagation Neural Networks (NN) neural
networks
Probabilistic Neural Networks (PNN)
Multi-Class SVM: One-Versus-Rest (OVR)
Multi-Class SVM: One-Versus-One (OVO)
Multi-Class SVM: DAGSVM kernel-based
Multi-Class SVM by Weston & Watkins (WW)
Multi-Class SVM by Crammer & Singer (CS)
Weighted Voting: One-Versus-Rest voting
Weighted Voting: One-Versus-One
Decision Trees: CART decision trees
140
Ensemble classifiers

dataset

Classifier 1 Classifier 2 … Classifier N

Prediction 1 Prediction 2 … Prediction N dataset

Ensemble
Classifier

Final
Prediction
141
Gene selection methods
Highly discriminatory genes
Uninformative genes 1. Signal-to-noise (S2N) ratio in
one-versus-rest (OVR)
fashion;
2. Signal-to-noise (S2N) ratio in
one-versus-one (OVO)
fashion;
3. Kruskal-Wallis nonparametric
one-way ANOVA (KW);
4. Ratio of genes between-
categories to within-category
sum of squares (BW).
genes

142
Performance metrics and
statistical comparison
1. Accuracy
+ can compare to previous studies
+ easy to interpret & simplifies statistical comparison

2. Relative classifier information (RCI)


+ easy to interpret & simplifies statistical comparison
+ not sensitive to distribution of classes
+ accounts for difficulty of a decision problem

Randomized permutation testing to compare accuracies


of the classifiers ( =0.05)

143
Microarray datasets
Number of
Dataset name Sam- Variables Cate- Reference
ples (genes) gories
11_Tumors 174 12533 11 Su, 2001

14_Tumors 308 15009 26 Ramaswamy, 2001


Total:
9_Tumors 60 5726 9 Staunton, 2001
~1300 samples
Brain_Tumor1 90 5920 5 Pomeroy, 2002
74 diagnostic categories
Brain_Tumor2 50 10367 4 Nutt, 2003
41 cancer types and
Leukemia1 72 5327 3 Golub, 1999 12 normal tissue types
Leukemia2 72 11225 3 Armstrong, 2002

Lung_Cancer 203 12600 5 Bhattacherjee, 2001

SRBCT 83 2308 4 Khan, 2001

Prostate_Tumor 102 10509 2 Singh, 2002

DLBCL 77 5469 2 Shipp, 2002


144
Summary of methods and datasets
Cross-Validation Gene Selection Performance Statistical
Designs (2) Methods (4) Metrics (2) Comparison
10-Fold CV S2N One-Versus-Rest Accuracy Randomized
LOOCV RCI permutation testing
S2N One-Versus-One

Non-param. ANOVA
Classifiers (11) BW ratio Gene Expression Datasets (11)
One-Versus-Rest 11_Tumors
One-Versus-One 14_Tumors
MC-SVM

DAGSVM Ensemble Classifiers (7)

Multicategory Dx
9_Tumors
Method by WW Majority Voting
Brain Tumor1
Based on MC-
SVM outputs

Method by CS MC-SVM OVR


Brain_Tumor2
KNN MC-SVM OVO
Leukemia1
Backprop. NN MC-SVM DAGSVM
Leukemia2
Prob. NN Decision Trees
Based on outputs

Lung_Cancer
of all classifiers

Decision Trees SRBCT


Majority Voting

Binary Dx
One-Versus-Rest Prostate_Tumors
WV

Decision Trees
One-Versus-One DLBCL

145
9_ Accuracy, %
Tu

100

20
40
60
80

0
mo
rs
14
_T
um
or
Br
ain s
_T
um
Br or
ain 2
_T
um
or
11 1
_T
um
or
Le s
uk
em
ia1
Le
uk
em
Lu ia2
ng
_C
an
ce
r
SR
Pr BC
os T
tat
e_
Tu
mo
DL r
BC
L
CS
Results without gene selection

NN
WW

PNN
OVR

KNN
OVO
DAGSVM

146

MC-SVM
Results with gene selection
Improvement of diagnostic
Diagnostic performance
performance by gene selection
before and after gene selection
(averages for the four datasets)

70
100 9_Tumors 14_Tumors
80
Improvement in accuracy, %

60
60

50 40

Accuracy, %
20
40

30 100 Brain_Tumor1 Brain_Tumor2


80
20
60

10 40

20
OVR OVO DAGSVM WW CS KNN NN PNN
SVM non-SVM SVM non-SVM SVM non-SVM

Average reduction of genes is 10-30 times


147
Comparison with previously
published results
Multiclass SVMs
100
(this study)

80
Accuracy, %

60

Multiple specialized
40
classification methods
(original primary studies)
20

L
ia1

ia2
rs

s
2

DL r
BC

BC
mo
ce
or

or
or

or
mo

em

em
um

um

an
um

um

SR

Tu
Tu

uk

uk

_C
_T

_T
_T

_T

e_
9_

Le

Le

ng
14

11
ain

ain

tat
Lu

os
Br

Br

Pr
148
Summary of results

Multi-class SVMs are the best family among the


tested algorithms outperforming KNN, NN, PNN, DT,
and WV.
Gene selection in some cases improves classification
performance of all classifiers, especially of non-SVM
algorithms;
Ensemble classification does not improve
performance of SVM and other classifiers;
Results obtained by SVMs favorably compare with the
literature.

149
Random Forest (RF) classifiers
• Appealing properties
– Work when # of predictors > # of samples
– Embedded gene selection
– Incorporate interactions
– Based on theory of ensemble learning
– Can work with binary & multiclass tasks
– Does not require much fine-tuning of parameters
• Strong theoretical claims
• Empirical evidence: (Diaz-Uriarte and Alvarez de
Andres, BMC Bioinformatics, 2006) reported
superior classification performance of RFs compared
to SVMs and other methods
150
Key principles of RF classifiers

Testing
data
Training
data

1) Generate 4) Apply to testing data &


bootstrap combine predictions
samples

2) Random gene 3) Fit unpruned


selection decision trees 151
Results without gene selection

• SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets,


algorithms are exactly the same in 3 datasets.
• In 7 datasets SVMs outperform RFs statistically significantly.
• On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI. 152
Results with gene selection

• SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets,


algorithms are exactly the same in 2 datasets.
• In 1 dataset SVMs outperform RFs statistically significantly.
• On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI. 153
2. Text categorization in biomedicine

154
Models to categorize content and quality:
Main idea

1. Utilize existing (or easy to build) training corpora

1 2. Simple document
representations (i.e., typically
stemmed and weighted
2
words in title and abstract,
Mesh terms if available;
3 occasionally addition of
Metamap CUIs, author info) as
4 “bag-of-words”
155
Models to categorize content and quality:
Main idea
Unseen
Examples

3. Train SVM models that capture


Labeled
Examples implicit categories of meaning or
quality criteria

Labeled

1
0.9
4. Evaluate models’ performances 0.8
- with nested cross-validation or other 0.7
0.6
appropriate error estimators 0.5
0.4
- use primarily AUC as well as other metrics 0.3
0.2
(sensitivity, specificity, PPV, Precision/Recall 0.1
0
curves, HIT curves, etc.) Txmt Diag Prog Etio

Estimated Performance 2005 Performance


5. Evaluate performance prospectively &
compare to prior cross-validation estimates
156
Models to categorize content and quality:
Some notable results
Method Treatment - Etiology - Prognosis - Diagnosis -
Category Average AUC Range AUC AUC AUC AUC
over n folds
Google Pagerank
0.54 0.54 0.43 0.46
Treatment 0.97* 0.96 - 0.98 Yahoo Webranks
0.56 0.49 0.52 0.52
Etiology 0.94* 0.89 – 0.95
Impact Factor
Prognosis 0.95* 0.92 – 0.97 2005
0.67 0.62 0.51 0.52
Diagnosis 0.95* 0.93 - 0.98
Web page hit
count
0.63 0.63 0.58 0.57
1. SVM models have Bibliometric
Citation Count
0.76 0.69 0.67 0.60
excellent ability to identify
high-quality PubMed Machine Learning
Models
0.96 0.95 0.95 0.95
documents according to
ACPJ gold standard 2. SVM models have better classification
performance than PageRank, Yahoo ranks,
Impact Factor, Web Page hit counts, and
bibliometric citation counts on the Web
according to ACPJ gold standard 157
Models to categorize content and quality:
Some notable results
Treatment - Fixed Sensitivity Treatment - Fixed Specificity
0.98 0.98 0.95
1 0.89 1 0.91 0.91

Gold standard: SSOAB Area under the ROC 0.9 0.9 0.8
0.8 0.71 0.8
0.7 0.7
0.6 0.6
curve* 0.5
0.4
0.3
0.5
0.4
0.3
0.2 0.2
0.1 0.1
0 0

SSOAB-specific filters 0.893 Fixed Sens

Query Filters
Spec

Learning Models
Sens

Query Filters
Fixed Spec

Learning Models

Citation Count 0.791 Etiology - Fixed Sensitivity Etiology - Fixed Specificity


0.98 0.98
1 0.94 0.91 0.91
1
ACPJ Txmt-specific filters 0.548 0.9
0.8
0.7
0.75 0.9
0.8
0.7
0.68
0.6 0.44 0.6
0.5 0.5
0.4 0.4
0.3
Impact Factor (2001) 0.549 0.2
0.1
0
0.3
0.2
0.1
0
Fixed Sens Spec Sens Fixed Spec

Impact Factor (2005) 0.558 Query Filters Learning Models Query Filters Learning Models

Diagnosis - Fixed Sensitivity Diagnosis - Fixed Specificity


0.96 0.96 0.97 0.97
1 0.88 1
0.9 0.9 0.82

3. SVM models have better 0.8


0.7
0.6
0.5
0.4
0.68 0.8
0.7
0.6
0.5
0.4
0.65

0.3 0.3

classification performance than 0.2


0.1
0
Fixed Sens Spec
0.2
0.1
0
Sens Fixed Spec

PageRank, Impact Factor and Query Filters Learning Models Query Filters Learning Models

Citation count in Medline for 1


0.9
Prognosis - Fixed Sensitivity

0.8 0.8
0.71
0.87 1
0.9
Prognosis - Fixed Specificity

0.8
1
0.77 0.77
0.8 0.8

SSOAB gold standard 0.7


0.6
0.5
0.4
0.3
0.7
0.6
0.5
0.4
0.3
0.2 0.2
0.1 0.1
0 0
Fixed Sens Spec Sens Fixed Spec

Query Filters Learning Models Query Filters Learning Models

4. SVM models have better sensitivity/specificity in PubMed than CQFs at


comparable thresholds according to ACPJ gold standard 158
Other applications of SVMs to text
categorization
Model Area Under the
Curve

Machine Learning Models 0.93

Quackometer* 0.67

Google 0.63

1. Identifying Web Pages with misleading treatment information according


to special purpose gold standard (Quack Watch). SVM models outperform
Quackometer and Google ranks in the tested domain of cancer treatment.

2. Prediction of future paper citation counts (work of L. Fu and C.F. Aliferis,


AMIA 2008)
159
3. Prediction of clinical laboratory
values

160
Dataset generation and
experimental design
• StarPanel database contains ~8·106 lab measurements of ~100,000 in-
patients from Vanderbilt University Medical Center.
• Lab measurements were taken between 01/1998 and 10/2002.

For each combination of lab test and normal range, we generated


the following datasets.

01/1998-05/2001 06/2001-10/2002

Validation
Training Testing
(25% of Training)

161
Query-based approach for
prediction of clinical cab values

Training data

Data model database Testing data

Train SVM classifier Testing


sample

Validation data
Optimal
SVM classifier data model
Performance
These steps are performed
for every testing sample Prediction
These steps are performed
for every data model
162
Classification results
Including cases with K=0 (i.e. samples Excluding cases with K=0 (i.e. samples
with no prior lab measurements) with no prior lab measurements)

Area under ROC curve (without feature selection) Area under ROC curve (without feature selection)

Range of normal values Range of normal values


>1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5] >1 <99 [1, 99] >2.5 <97.5 [2.5, 97.5]
BUN 75.9% 93.4% 68.5% 81.8% 92.2% 66.9% BUN 80.4% 99.1% 76.6% 87.1% 98.2% 70.7%

Ca 67.5% 80.4% 55.0% 77.4% 70.8% 60.0% Ca 72.8% 93.4% 55.6% 81.4% 81.4% 63.4%
CaIo 74.1% 60.0% 50.1% 64.7% 72.3% 57.7%

Laboratory test
CaIo 63.5% 52.9% 58.8% 46.4% 66.3% 58.7%
Laboratory test

CO2 77.3% 88.0% 53.4% 77.5% 90.5% 58.1% CO2 82.0% 93.6% 59.8% 84.4% 94.5% 56.3%
Creat 62.2% 88.4% 83.5% 88.4% 94.9% 83.8% Creat 62.8% 97.7% 89.1% 91.5% 98.1% 87.7%
Mg 58.4% 71.8% 64.2% 67.0% 72.5% 62.1% Mg 56.9% 70.0% 49.1% 58.6% 76.9% 59.1%
Osmol 77.9% 64.8% 65.2% 79.2% 82.4% 71.5% Osmol 50.9% 60.8% 60.8% 91.0% 90.5% 68.0%
PCV 62.3% 91.6% 69.7% 76.5% 84.6% 70.2% PCV 74.9% 99.2% 66.3% 80.9% 80.6% 67.1%
Phos 70.8% 75.4% 60.4% 68.0% 81.8% 65.9% Phos 74.5% 93.6% 64.4% 71.7% 92.2% 69.7%

A total of 84,240 SVM classifiers were built for 16,848 possible data models.

163
Improving predictive power and parsimony
of a BUN model using feature selection
Model description x 10
4
Histogram of test BUN
Test name BUN 3
Range of normal values < 99 perc.
Data modeling SRT
2.5
Number of previous

Frequency (N measurements)
5
measurements
Use variables corresponding to 2
Yes
hospitalization units?
Number of prior
2 1.5
hospitalizations used

Dataset description 1
105
N samples N abnormal N
(total) samples variables 0.5 normal abnormal
values values
Training set 3749 78

Validation set 1251 27 3442 0


0 50 100 150 200 250
Testing set 836 16 Test value

Classification performance (area under ROC curve)


All RFE_Linear RFE_Poly HITON_PC HITON_MB
Validation set 95.29% 98.78% 98.76% 99.12% 98.90%
Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17

164
Classification performance (area under ROC curve)
All RFE_Linear RFE_Poly HITON_PC HITON_MB
Validation set 95.29% 98.78% 98.76% 99.12% 98.90%
Testing set 94.72% 99.66% 99.63% 99.16% 99.05%
Number of features 3442 26 3 11 17

Features
1 LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN) LAB: PM_1(BUN)
2 LAB: PM_2(Cl) LAB: Indicator(PM_1(Mg)) LAB: PM_5(Creat) LAB: PM_5(Creat)
LAB: Test Unit
3 LAB: DT(PM_3(K)) NO_TEST_MEASUREMENT LAB: PM_1(Phos) LAB: PM_3(PCV)
(Test CaIo, PM 1)
4 LAB: DT(PM_3(Creat)) LAB: Indicator(PM_1(BUN)) LAB: PM_1(Mg)
5 LAB: Test Unit J018 (Test Ca, PM 3) LAB: Indicator(PM_5(Creat)) LAB: PM_1(Phos)
6 LAB: DT(PM_4(Cl)) LAB: Indicator(PM_1(Mg)) LAB: Indicator(PM_4(Creat))
7 LAB: DT(PM_3(Mg)) LAB: DT(PM_4(Creat)) LAB: Indicator(PM_5(Creat))
8 LAB: PM_1(Cl) LAB: Test Unit 7SCC (Test Ca, PM 1) LAB: Indicator(PM_3(PCV))
9 LAB: PM_3(Gluc) LAB: Test Unit RADR (Test Ca, PM 5) LAB: Indicator(PM_1(Phos))
10 LAB: DT(PM_1(CO2)) LAB: Test Unit 7SMI (Test PCV, PM 4) LAB: DT(PM_4(Creat))
11 LAB: DT(PM_4(Gluc)) DEMO: Gender LAB: Test Unit 11NM (Test BUN, PM 2)
12 LAB: PM_3(Mg) LAB: Test Unit 7SCC (Test Ca, PM 1)
13 LAB: DT(PM_5(Mg)) LAB: Test Unit RADR (Test Ca, PM 5)
14 LAB: PM_1(PCV) LAB: Test Unit 7SMI (Test PCV, PM 4)
15 LAB: PM_2(BUN) LAB: Test Unit CCL (Test Phos, PM 1)
16 LAB: Test Unit 11NM (Test PCV, PM 2) DEMO: Gender
17 LAB: Test Unit 7SCC (Test Mg, PM 3) DEMO: Age
18 LAB: DT(PM_2(Phos))
19 LAB: DT(PM_3(CO2))
20 LAB: DT(PM_2(Gluc))
21 LAB: DT(PM_5(CaIo))
22 DEMO: Hospitalization Unit TVOS
23 LAB: PM_1(Phos)
24 LAB: PM_2(Phos)
25 LAB: Test Unit 11NM (Test K, PM 5)
26 LAB: Test Unit VHR (Test CaIo, PM 1)

165
4. Modeling clinical judgment

166
Methodological framework and study
outline
Patients Guidelines

Predict clinical decisions


Physicians

Clinical Gold
Identify predictors
Patient Feature ignored by
Diagnosis Standard
physicians
1 f1…fm cd1 hd1

… … …
Physician 1

N cdN hdN Explain each physician’s


diagnostic model
… … … …
1 f1…fm cd1 hd1

… … … Compare physicians with each


Physician6

other and with guidelines


N cdN hdN

same across physicians different across physicians


Clinical context of experiment
Malignant melanoma is the most dangerous form of skin cancer
Incidence & mortality have been constantly increasing in
the last decades.
Physicians and patients
Patients N=177 Data collection:
76 melanomas - 101 nevi Patients seen prospectively,
from 1999 to 2002 at
Department of Dermatology,
[Link] Hospital, Trento, Italy
Dermatologists N = 6
inclusion criteria: histological
3 experts - 3 non-experts diagnosis and >1 digital image
available
Diagnoses made in 2004

Features
Lesion Family history of Streaks (radial
Irregular Border
location melanoma streaming, pseudopods)

Fitzpatrick’s
Max-diameter Number of colors Slate-blue veil
Photo-type
Atypical pigmented
Min-diameter Sunburn Whitish veil
network

Evolution Ephelis Abrupt network cut-off Globular elements

Comedo-like openings,
Age Lentigos Regression-Erythema
milia-like cysts
Gender Asymmetry Hypo-pigmentation Telangiectasia
Method to explain physician-specific
SVM models

FS

Build
DT
Build
SVM

SVM Apply SVM


”Black Box”

Meta-Learning
Regular Learning
Results: Predicting physicians’ judgments

All HITON_PC HITON_MB RFE


Physicians (features) (features)
(features) (features)

Expert 1 0.94 (24) 0.92 (4) 0.92 (5) 0.95 (14)

Expert 2 0.92 (24) 0.89 (7) 0.90 (7) 0.90 (12)

Expert 3 0.98 (24) 0.95 (4) 0.95 (4) 0.97 (19)

NonExpert 1 0.92 (24) 0.89 (5) 0.89 (6) 0.90 (22)

NonExpert 2 1.00 (24) 0.99 (6) 0.99 (6) 0.98 (11)

NonExpert 3 0.89 (24) 0.89 (4) 0.89 (6) 0.87 (10)


Results: Physician-specific models
Results: Explaining physician agreement
Blue
irregular border streaks
veil
Patient 001 yes no yes

Expert 1 Expert 3
AUC=0.92 AUC=0.95
R2=99% R2=99%
Results: Explain physician disagreement
Blue irregular number
streaks evolution
veil border of colors

Patient 002 no no yes 3 no

Expert 1 Expert 3
AUC=0.92 AUC=0.95
R2=99% R2=99%
Results: Guideline compliance

Physician Reported Compliance


guidelines

Non-compliant: they ignore the


Experts1,2,3,
Pattern analysis majority of features (17 to 20)
non-expert 1
recommended by pattern analysis.

Non compliant: asymmetry, irregular


Non expert 2 ABCDE rule
border and evolution are ignored.

Non-standard. Non compliant: 2 out of 7 reported


Non expert 3 Reports using 7 features are ignored while some non-
features reported ones are not

On the contrary: In all guidelines, the more predictors present,


the higher the likelihood of melanoma. All physicians were
compliant with this principle.
5. Using SVMs for feature selection

176
Feature selection methods
Feature selection methods (non-causal)
• SVM-RFE This is an SVM-based
• Univariate + wrapper feature selection
• Random forest-based method
• LARS-Elastic Net
• RELIEF + wrapper
• L0-norm
• Forward stepwise feature selection
• No feature selection

Causal feature selection methods


• HITON-PC
• HITON-MB This method outputs a
Markov blanket of the
• IAMB
response variable
• BLCD (under assumptions)
• K2MB

177
13 real datasets were used to evaluate
feature selection methods
Number of Number of
Dataset name Domain Target Data type
variables samples

Infant_Mortality Clinical 86 5,337 Died within the first year discrete

Ohsumed Text 14,373 5,000 Relevant to nenonatal diseases continuous

ACPJ_Etiology Text 28,228 15,779 Relevant to eitology continuous


Gene
Lymphoma 7,399 227 3-year survival: dead vs. alive continuous
expression
Digit
Gisette 5,000 7,000 Separate 4 from 9 continuous
recognition

Dexter Text 19,999 600 Relevant to corporate acquisitions continuous

Sylva Ecology 216 14,394 Ponderosa pine vs. everything else continuous & discrete

Ovarian_Cancer Proteomics 2,190 216 Cancer vs. normals continuous

Drug
Thrombin 139,351 2,543 Binding to thromin discrete (binary)
discovery
Gene
Breast_Cancer 17,816 286 Estrogen-receptor positive (ER+) vs. ER- continuous
expression
Drug
Hiva 1,617 4,229 Activity to AIDS HIV infection discrete (binary)
discovery

Nova Text 16,969 1,929 Separate politics from religion topics discrete (binary)

Bankruptcy Financial 147 7,063 Personal bankruptcy continuous & discrete


178
Classification performance vs. proportion
of selected features
Original Magnified
1
0.9
0.95

Classification performance (AUC)


Classification performance (AUC)

0.9
0.89
0.85

0.8 0.88

0.75
0.87
0.7

0.65 0.86
0.6

0.55
2
HITON-PC with G test 0.85 HITON-PC with G2 test
RFE RFE
0.5
0 0.5 1 0.05 0.1 0.15 0.2
Proportion of selected features Proportion of selected features

179
Statistical comparison of predictivity and
reduction of features
Predicitivity Reduction

P-value Nominal winner P-value Nominal winner

0.9754 SVM-RFE 0.0046 HITON-PC

SVM-RFE 0.8030 SVM-RFE 0.0042 HITON-PC


(4 variants) 0.1312 HITON-PC 0.3634 HITON-PC

0.1008 HITON-PC 0.6816 SVM-RFE

• Null hypothesis: SVM-RFE and HITON-PC perform the same;


• Use permutation-based statistical test with alpha = 0.05.
180
Simulated datasets with known causal
structure used to compare algorithms

181
Comparison of SVM-RFE and HITON-PC

182
Comparison of all methods in terms of
causal graph distance

SVM-RFE HITON-PC
based
causal
methods

183
Summary results
HITON-PC
based
causal
methods

HITON-PC-
FDR
methods

SVM-RFE 184
Statistical comparison of graph distance

Sample size = Sample size = Sample size =


200 500 5000
Nominal Nominal Nominal
Comparison P-value P-value P-value
winner winner winner
average HITON-PC-FDR
HITON-PC- HITON-PC- HITON-PC-
with G2 test vs. average <0.0001 0.0028 <0.0001
FDR FDR FDR
SVM-RFE

• Null hypothesis: SVM-RFE and HITON-PC-FDR perform the same;


• Use permutation-based statistical test with alpha = 0.05.

185
6. Outlier detection in ovarian cancer
proteomics data

186
Ovarian cancer data
Data Set 1 (Top), Data Set 2 (Bottom)

Same set of 216 Cancer

patients, obtained
using the Ciphergen
H4 ProteinChip Normal

array (dataset 1)
and using the Other

Ciphergen WCX2
ProteinChip array
Cancer
(dataset 2).

Normal

Other
4000 8000 12000
Clock Tick

The gross break at the “benign disease” juncture in dataset 1 and the similarity of the
profiles to those in dataset 2 suggest change of protocol in the middle of the first
experiment.
Experiments with one-class SVM
Assume that sets {A, B} are Data Set 1 (Top), Data Set 2 (Bottom)

normal and {C, D, E, F} are Cancer

outliers. Also, assume that we


do not know what are normal
and outlier samples. Normal

Other

•Experiment 1: Train one-class SVM


on {A, B, C} and test on {A, B, C}: Cancer

Area under ROC curve = 0.98


•Experiment 2: Train one-class SVM Normal
on {A, C} and test on {B, D, E, F}:
Area under ROC curve = 0.98 Other
4000 8000 12000
Clock Tick

188
Software

189
Interactive media and animations
SVM Applets
• [Link]
• [Link]
• [Link]
• [Link]
• [Link] (requires Java 3D)

Animations
• Support Vector Machines:
[Link]
[Link]
[Link]
• Support Vector Regression:
[Link]
190
Several SVM implementations for
beginners
• GEMS: [Link]

• Weka: [Link]

• Spider (for Matlab): [Link]

• CLOP (for Matlab): [Link]

191
Several SVM implementations for
intermediate users
• LibSVM: [Link]
General purpose
Implements binary SVM, multiclass SVM, SVR, one-class SVM
Command-line interface
Code/interface for C/C++/C#, Java, Matlab, R, Python, Pearl
• SVMLight: [Link]
General purpose (designed for text categorization)
Implements binary SVM, multiclass SVM, SVR
Command-line interface
Code/interface for C/C++, Java, Matlab, Python, Pearl

More software links at [Link]


and [Link]

192
Conclusions

193
Strong points of SVM-based learning
methods
• Empirically achieve excellent results in high-dimensional data
with very few samples
• Internal capacity control to avoid overfitting
• Can learn both simple linear and very complex nonlinear
functions by using “kernel trick”
• Robust to outliers and noise (use “slack variables”)
• Convex QP optimization problem (thus, it has global minimum
and can be solved efficiently)
• Solution is defined only by a small subset of training points
(“support vectors”)
• Number of free parameters is bounded by the number of
support vectors and not by the number of variables
• Do not require direct access to data, work only with dot-
products of data-points. 194
Weak points of SVM-based learning
methods
• Measures of uncertainty of parameters are not
currently well-developed
• Interpretation is less straightforward than classical
statistics
• Lack of parametric statistical significance tests
• Power size analysis and research design considerations
are less developed than for classical statistics

195
Bibliography

196
Part 1: Support vector machines for binary
classification: classical formulation
• Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers.
Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT)
1992:144-152.
• Burges CJC: A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery 1998, 2:121-167.
• Cristianini N, Shawe-Taylor J: An introduction to support vector machines and other kernel-
based learning methods. Cambridge: Cambridge University Press; 2000.
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Herbrich R: Learning kernel classifiers: theory and algorithms. Cambridge, Mass: MIT Press;
2002.
• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning.
Cambridge, Mass: MIT Press; 1999.
• Shawe-Taylor J, Cristianini N: Kernel methods for pattern analysis. Cambridge, UK:
Cambridge University Press; 2004.
• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.

197
Part 1: Basic principles of statistical
machine learning
• Aliferis CF, Statnikov A, Tsamardinos I: Challenges in the analysis of mass-throughput data:
a technical commentary from the statistical machine learning perspective. Cancer
Informatics 2006, 2:133-162.
• Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. New York: Wiley; 2001.
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Mitchell T: Machine learning. New York, NY, USA: McGraw-Hill; 1997.
• Vapnik VN: Statistical learning theory. New York: Wiley; 1998.

198
Part 2: Model selection for SVMs
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model
selection. Proceedings of the Fourteenth International Joint Conference on Artificial
Intelligence (IJCAI) 1995, 2:1137-1145.
• Scheffer T: Error estimation and model selection. [Link], Technischen Universität
Berlin, School of Computer Science; 1999.
• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer
diagnosis and biomarker discovery from microarray gene expression data. Int J Med
Inform 2005, 74:491-503.

199
Part 2: SVMs for multicategory data
• Crammer K, Singer Y: On the learnability and design of output codes for multiclass
problems. Proceedings of the Thirteenth Annual Conference on Computational Learning
Theory (COLT) 2000.
• Platt JC, Cristianini N, Shawe-Taylor J: Large margin DAGs for multiclass classification.
Advances in Neural Information Processing Systems (NIPS) 2000, 12:547-553.
• Schölkopf B, Burges CJC, Smola AJ: Advances in kernel methods: support vector learning.
Cambridge, Mass: MIT Press; 1999.
• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of
multicategory classification methods for microarray gene expression cancer diagnosis.
Bioinformatics 2005, 21:631-643.
• Weston J, Watkins C: Support vector machines for multi-class pattern recognition.
Proceedings of the Seventh European Symposium On Artificial Neural Networks 1999, 4:6.

200
Part 2: Support vector regression
• Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining,
inference, and prediction. New York: Springer; 2001.
• Smola AJ, Schölkopf B: A tutorial on support vector regression. Statistics and Computing
2004, 14:199-222.

Part 2: Novelty detection with SVM-based


methods and Support Vector Clustering
• Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC: Estimating the Support of a
High-Dimensional Distribution. Neural Computation 2001, 13:1443-1471.
• Tax DMJ, Duin RPW: Support vector domain description. Pattern Recognition Letters 1999,
20:1191-1199.
• Hur BA, Horn D, Siegelmann HT, Vapnik V: Support vector clustering. Journal of Machine
Learning Research 2001, 2:125–137.

201
Part 2: SVM-based variable selection
• Guyon I, Elisseeff A: An introduction to variable and feature selection. Journal of Machine
Learning Research 2003, 3:1157-1182.
• Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using
support vector machines. Machine Learning 2002, 46:389-422.
• Hardin D, Tsamardinos I, Aliferis CF: A theoretical characterization of linear SVM-based
feature selection. Proceedings of the Twenty First International Conference on Machine
Learning (ICML) 2004.
• Statnikov A, Hardin D, Aliferis CF: Using SVM weight-based methods to identify causally
relevant and non-causally relevant variables. Proceedings of the NIPS 2006 Workshop on
Causality and Feature Selection 2006.
• Tsamardinos I, Brown LE: Markov Blanket-Based Variable Selection in Feature Space.
Technical report DSL-08-01 2008.
• Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for
SVMs. Advances in Neural Information Processing Systems (NIPS) 2000, 13:668-674.
• Weston J, Elisseeff A, Scholkopf B, Tipping M: Use of the zero-norm with linear models
and kernel methods. Journal of Machine Learning Research 2003, 3:1439-1461.
• Zhu J, Rosset S, Hastie T, Tibshirani R: 1-norm support vector machines. Advances in
Neural Information Processing Systems (NIPS) 2004, 16.

202
Part 2: Computing posterior class
probabilities for SVM classifiers
• Platt JC: Probabilistic outputs for support vector machines and comparison to regularized
likelihood methods. In Advances in Large Margin Classifiers. Edited by Smola A, Bartlett B,
Scholkopf B, Schuurmans D. Cambridge, MA: MIT press; 2000.

Part 3: Classification of cancer gene


expression microarray data (Case Study 1)
• Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data
using random forest. BMC Bioinformatics 2006, 7:3.
• Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of
multicategory classification methods for microarray gene expression cancer diagnosis.
Bioinformatics 2005, 21:631-643.
• Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF: GEMS: a system for automated cancer
diagnosis and biomarker discovery from microarray gene expression data. Int J Med
Inform 2005, 74:491-503.
• Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and
support vector machines for microarray-based cancer classification. BMC Bioinformatics
2008, 9:319. 203
Part 3: Text Categorization in Biomedicine
(Case Study 2)
• Aphinyanaphongs Y, Aliferis CF: Learning Boolean queries for article quality filtering.
Medinfo 2004 2004, 11:263-267.
• Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF: Text categorization
models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc
2005, 12:207-216.
• Aphinyanaphongs Y, Statnikov A, Aliferis CF: A comparison of citation metrics to machine
learning filters for the identification of high quality MEDLINE documents. J Am Med
Inform Assoc 2006, 13:446-455.
• Aphinyanaphongs Y, Aliferis CF: Prospective validation of text categorization models for
indentifying high-quality content-specific articles in PubMed. AMIA 2006 Annual
Symposium Proceedings 2006.
• Aphinyanaphongs Y, Aliferis C: Categorization Models for Identifying Unproven Cancer
Treatments on the Web. MEDINFO 2007.
• Fu L, Aliferis C: Models for Predicting and Explaining Citation Count of Biomedical
Articles. AMIA 2008 Annual Symposium Proceedings 2008.

204
Part 3: Modeling clinical judgment
(Case Study 4)
• Sboner A, Aliferis CF: Modeling clinical judgment and implicit guideline compliance in the
diagnosis of melanomas using machine learning. AMIA 2005 Annual Symposium
Proceedings 2005:664-668.

Part 3: Using SVMs for feature selection


(Case Study 5)
• Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov
Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II:
Analysis and Extensions. Journal of Machine Learning Research 2008.
• Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov
Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I:
Algorithms and Empirical Evaluation. Journal of Machine Learning Research 2008.

205
Part 3: Outlier detection in ovarian cancer
proteomics data (Case Study 6)
• Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDI-TOF protein patterns in
serum: comparing datasets from different experiments. Bioinformatics 2004, 20:777-785.
• Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C,
Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer.
Lancet 2002, 359:572-577.

206
Thank you for your attention!
Questions/Comments?

Email: [Link]@[Link]

URL: [Link]

207

You might also like