Chapter 5 – Support Vector Machine
Prepared by: Shier Nee, SAW
Support Vector Machine
• Supervised learning models with associated learning algorithms
that analyze data for classification and regression analysis.
• The original SVM algorithm was invented by Vladimir N. Vapnik and
Alexey Ya. Chervonenkis in 1963
Support Vector Machine
• Robust algorithm for classification problems
• Medical problems
• Text and hypertext categorization
• Image classification
• Advantages:
• SVM depends on convex optimization
• Easy to use
• Excellent performance on different type of datasets
Projection Concept Recap
x = Datapoint
w = projection vector
x To compute x projection on w,
assuming intercept = 0
w
𝜃
If Same side as w
x’ 𝑤′ 𝑇 ∙ 𝑥=¿ 𝑤∨¿ 𝑥∨𝑐𝑜𝑠 𝜃
If Opposite side as w
If is +ve, same side as w
If is –ve, opposite side as w
Projection Concept Recap
w x = Datapoint
x w = normal projection vector from the plane
We can determine which side of the x by
computing projection of x onto w
If is +ve, same side as w
If is –ve, opposite side as w
x’ What if
The point is on the plane.
Linear Classifier
Assuming we have these point, we want to
classify into black and white points.
We need a hyperplane, which is defined by the
• normal vector to the plane, w
• Intercept, b
w
We want to have a classifier with a function of w and b.
x f (w,b) y
We defined x as the point, y as {-1, 1}
Black point = -1 ; white point = 1
𝑦 = 𝑓 ( 𝑥 ; 𝑤 , 𝑏 )=sgn ( 𝑤𝑇 ∙ 𝑥+𝑏 )
Linear Classifier
Questions:
1. We can have different hyperplane.
2. How do we choose the w and b so that we
have an optimum classifier?
What we need is a evaluation metrics to
determine the a good classifier.
In SVM, we evaluate the margin in which we
want to maximize the margin between the
points and the hyperplane.
Linear Classifier
Tight margin Maximize margin
To find which margin is the best, we need to measure the margin distance
between point that touch the margin and the plane
Compute the margin
{ 𝑥 : <𝑤 , 𝑥 } +𝑏=− 1} We first define our decision boundary function
T
y =f ( x ; w , b )=sgn (w ∙ x + b)
{ 𝑥 : <𝑤 , 𝑥 } +𝑏=+1 } What we know is
+ve value points are same as as w
-ve value points are opposite as as w
Zero value points are on the hyperplane
w
{ 𝑥 : <𝑤 , 𝑥 } +𝑏=0 }
Compute the margin
We need to compute the distance between x to the
hyperplane.
So we need a projection of x to hyperplane x’
Projection (x-x’) onto w =
x
x' We know
w
Projection (x-x’) onto w =
Distance between x to x’ is
Compute the margin
|x1-x’| = =
|x2-x’| = =
|x1-x2| =
We want to maximize this
w Equivalent to minimizing the inverse,
Or even better to minimizing the convex form,
Objective function
Or even better to minimizing the convex form,
Next thing to take note is to ensure the points sit on the correct side.
Add a constraint to the objective function
Objective function
3 scenarios: { 𝑥 : < ⟨ 𝑤 , 𝑥 ⟩ } +𝑏=−1 }
1. Correct side but inside the margin
− 1 < ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ + 𝑏 ) <1 { 𝑥 : < ⟨ 𝑤 , 𝑥 ⟩ } +𝑏=+1 }
2. Correct side and outside the margin
w
For
For
3. Wrong side
For { 𝑥 : < ⟨ 𝑤 , 𝑥 ⟩ } +𝑏=0 }
For
To solve the constrained optimization –
Primal Form
Finding the optimal hyperplane is an optimizing problem
¿ 𝒘 ∨𝟐
𝒎𝒊𝒏
𝟐
• N+1 parameters (N: dimension of data)
• M constraints (M; number of datapoints)
• Primal problem
To solve the constrained
optimization – Dual Form
We want to minimize the following with constrains.
𝑤 𝑗s.t2 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ +1
𝑤 =min ∑
∗
𝑤 ,𝑏 𝑗 2
Such problem is not trivial, we can use Lagrangian optimization
Rewrite the equation so that it is not positive inequality.
𝑤 𝑗s.t2 1 − 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥𝑖 ⟩ +𝑏 ) ≤ 0
𝑤∗ =min ∑
𝑤 ,𝑏 𝑗 2
To solve the constrained
optimization – Dual Form
𝑤 𝑗 2s.t 1 − 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 i ⟩ +𝑏 ) ≤ 0
𝑤 =min ∑
∗
𝑤 ,𝑏 𝑗 2
) )
Introduce M Lagrange multipliers,
𝐿 (𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝜃 , 𝛼 ) =𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝒇¿
𝐿= max ¿ ¿
𝛼
Lagrange Dual Form
To solve the constrained
optimization
We use M Lagrange multipliers, , with
𝑀
¿𝑤∨¿2
𝐿 ( 𝑤 ,𝑏,𝛼 )= − ∑ 𝛼𝑖 (1− 𝑦 𝑖 ( ⟨ 𝑤 ,𝑥 i ⟩ +𝑏 ) )¿
2 𝑖=1
Maximize the margin
Ensure point at the
correct side
To minimize , we need to find a stationary point that satisfies
∗ ∗ ∗
¿ 𝑤 , 𝑏 , 𝛼 ≥max ¿ ¿
𝛼
To solve the constrained
optimization
We find partial derivative of
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔𝑤=∑ 𝛼𝑖 𝑦 𝑖 𝑥 𝑖
𝜕𝑤 𝑖=1
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔ ∑ 𝛼𝑖 𝑦 𝑖=0
𝜕𝑏 𝑖 =1
Plugging these into the Lagrangian yields the following:
𝑀 𝑀 𝑀
1
𝐿 ( 𝑤 , 𝑏 , 𝛼 )= ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝑥𝑇𝑖 𝑥 𝑗 + ∑ 𝛼𝑛
2 𝑖 𝑗 𝑖=1
This will normally solve using Sequential Minimal Optimization algorithm (SMO)
Inequalities to solved
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔𝑤=∑ 𝛼𝑖 𝑦 𝑖 𝑥 𝑖 Dual feasibility
𝜕𝑤 𝑖=1
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔ ∑ 𝛼𝑖 𝑦 𝑖=0 Dual feasibility
𝜕𝑏 𝑖 =1
Primal feasibility
Combine all Karush-Kuhn-Tucker(KKT) conditions
Inequalities to solved
Must satisfies Karush-Kuhn-Tucker(KKT) conditions
2 scenarios: Either α = 0 or
1. α ≠ 0, must be zero points are sitting on the margin These points are
considered as support vector
2. Points with α = 0 will be ‘ignored’
Geometrical Interpretation
Points with α = 0, will NOT Points with α ≠ 0, will considered
considered in defining hyperplane in defining hyperplane
Points with α = 0, will NOT
considered in defining hyperplane
Decision Function
Decision Function
𝑦 = 𝑓 ( 𝑥 ; 𝑤 , 𝑏 )=sgn ( 𝑤𝑇 ∙ 𝑥+𝑏 )
We know 𝑀
𝑤 =∑ 𝛼 𝑖 𝑦 𝑖 𝑥 𝑖
𝑖= 1
Prediction of y can be obtained via the following by substituting w to decision function
( )
𝑀
𝑓 ( 𝑥 ; 𝑤 , 𝑏 )=sgn ∑ 𝛼 𝑖 𝑦 𝑖 ⟨ 𝑥 , 𝑥 𝑖 ⟩ +𝑏
𝑖=1
Solve the following to get b
( ⟨∑ ⟩)
𝑀
1−𝑦 𝑖 𝛼𝑖 𝑦 𝑖 𝑥 𝑖, 𝑥 𝑖 +𝑏 = 0
𝑖=1
SVM Theory :
Maximizing margin to best classifying the two classes.
∗ ∗ ∗
Maximize margin ¿ 𝑤 , 𝑏 , 𝛼 ≥max ¿ ¿
𝛼
No data point are allow to be in the wrong side Hard margin
𝑤 𝑗 2s.t 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ 1
𝑤∗ =min ∑
𝑤 ,𝑏 𝑗 2
Non separable cases
Soft Margin Give some slack, allow some error
𝜉𝑖 𝑛
𝑤 𝑗2 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ 1 − 𝜉 𝑖 ,
𝑤 =min ∑
∗
+𝐶 ∑ 𝜉s.t
𝑖
𝑤 ,𝑏 𝑗 2 𝑖=1 𝜉 ≥0
Allow
some
Constraint the
𝜉𝑖 slack
number of
over
data that is
here.
wrong.
Non separable cases
𝑤 . 𝑥 +𝑏=− 1 Soft Margin Give some slack, allow some error
𝑛
𝑤 𝑗2 s.t 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ 1 − 𝜉 𝑖 ,
𝜉 =1.5
𝑤 ∗
=min ∑ 2 +𝐶 ∑ 𝑖 𝜉
𝑖 𝑤 ,𝑏 𝑗 𝑖=1 𝜉 ≥0
The error
𝜉 𝑖 =0.75
𝑛
∑ 𝜉 𝑖=1.5 +0.75=2.25
𝑖 =1
𝑤 . 𝑥 +𝑏=0
𝑤 . 𝑥 +𝑏=+1
Non separable cases
𝑤 . 𝑥 +𝑏=− 1 Soft Margin Give some slack, allow some error
𝑛
𝑤 𝑗2 s.t 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥𝑖 ⟩ + 𝑏 ) ≥ 1− 𝜉 𝑖 ,
𝜉 =1.5
𝑤 ∗
=min ∑ 2
+𝐶 ∑ 𝜉 𝑖
𝑖 𝑤 ,𝑏 𝑗 𝑖=1 𝜉 ≥0
The error
𝜉 𝑖 =0.75
Small C soft margin allow more error
Big C hard margin allow NO error
𝑤 . 𝑥 +𝑏=0
𝑤 . 𝑥 +𝑏=+1
Example
C= 10
All points satisfy the constraint
3 support vectors
Example
C=3 C=2
Some points violates the More points violate the constraint
constraint 10 support vectors
Example
C = 0.01
𝑛
𝑤 𝑗2
𝑤 =min ∑
∗
+𝐶 ∑s.t𝜉 𝑖 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥𝑖 ⟩ + 𝑏 ) ≥ 1− 𝜉 𝑖 ,
𝑤 ,𝑏 𝑗 2 𝑖=1 𝜉 ≥0
Small C soft margin allow more error
Big C hard margin allow NO error
All points violates the constraint
N support vectors
Which C to use?
C Training error Testing error # support
vectors
0.01 42.1 30.1 2341
0.1 31.3 16.5 1756
1 22.1 16.2 1230
10 15.2 17.2 1000
100 10.2 17.9 987
1000 1.2 17.8 854
Cross Validation
Which C to use?
Validation error
Choose the C with
lowest validation error
C
SKLEARN - SVM
Kernel
𝑥2 Actual Boundary Decision Function:
2
𝑥1 + 𝑥 =5
2 𝑥1 = 𝑥22 +5
We can rewrite it in linear form, by transforming it to
using basis expansion
𝑥1
2
𝑥1 − 𝑥 2 − 5= 0 →Quadratic
𝑤 ∙ Φ ( 𝑥 ) + 𝑏=0 → 𝐿𝑖𝑛𝑒𝑎𝑟
Kernel
𝑥2
𝑥1 + 𝑥 22=5
2
𝑥1 − 𝑥 2 − 5= 0 →Quadratic
𝑤 ∙ Φ ( 𝑥 ) + 𝑏=0 → 𝐿𝑖𝑛𝑒𝑎𝑟
𝑥1
We embedded 2 dimensional space into 5 dimensional space.
And in 5D space, it is linear which we can solve the equation
Kernel
What if we have d-dimensional and deal with quadratic boundary???
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 : 𝑥=( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑑 )
𝑄𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 :(𝑥 ¿¿ 1 , 𝑥2 , 𝑥 21 , 𝑥 22 , 𝑥 1 𝑥 2 ,… , 𝑥 𝑑−1 𝑥 𝑑)¿
𝐿𝑖𝑛𝑒𝑎𝑟 : Φ ( 𝑥 )=( 𝑥 1 , 𝑥2 , 𝑥 21 , 𝑥 22 , 𝑥 1 𝑥 2 , , … , 𝑥 𝑑 −1 𝑥 𝑑 )
Example
What is the dimension of , if
Dimension = 9
Example
What is the dimension of , if
d
d
d Choose 2
Dimension = 2d + d(d-1)/2
~ O(d2)
Kernel in SVM
Recall SVM loss function
𝑀 𝑀 𝑀
1
𝐿 ( 𝑤 , 𝑏 , 𝛼 )= ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝑥𝑇𝑖 𝑥 𝑗 + ∑ 𝛼𝑛
2 𝑖 𝑗 𝑖=1
Replace with
𝑀 𝑀 𝑀
1
𝐿 ( 𝑤 , 𝑏, 𝛼 )= ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ ( 𝑥 )𝑖 Φ ( 𝑥 𝑗 ) + ∑ 𝛼𝑛
𝑇
2 𝑖 𝑗 𝑖=1
Kernel in SVM
Kernel in SVM
Kernel in SVM
Kernel in SVM
Pros and Cons
Pros Cons
Good linear classifier because it find Not suited for large datasets
the best decision boundary Computational cost ~ O(d2)
(consider hard margin case)
Easy to transform into non linear
model
References:
1. [Link]
2. [Link]