0% found this document useful (0 votes)
37 views44 pages

Understanding Support Vector Machines

Uploaded by

hiphoplistener
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views44 pages

Understanding Support Vector Machines

Uploaded by

hiphoplistener
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Chapter 5 – Support Vector Machine

Prepared by: Shier Nee, SAW


Support Vector Machine
• Supervised learning models with associated learning algorithms
that analyze data for classification and regression analysis.

• The original SVM algorithm was invented by Vladimir N. Vapnik and


Alexey Ya. Chervonenkis in 1963
Support Vector Machine
• Robust algorithm for classification problems
• Medical problems
• Text and hypertext categorization
• Image classification

• Advantages:
• SVM depends on convex optimization
• Easy to use
• Excellent performance on different type of datasets
Projection Concept Recap
x = Datapoint
w = projection vector

x To compute x projection on w,
assuming intercept = 0
w
𝜃
If  Same side as w

x’ 𝑤′ 𝑇 ∙ 𝑥=¿ 𝑤∨¿ 𝑥∨𝑐𝑜𝑠 𝜃


If  Opposite side as w

If is +ve, same side as w


If is –ve, opposite side as w
Projection Concept Recap
w x = Datapoint
x w = normal projection vector from the plane

We can determine which side of the x by


computing projection of x onto w

If is +ve, same side as w


If is –ve, opposite side as w

x’ What if

 The point is on the plane.


Linear Classifier
Assuming we have these point, we want to
classify into black and white points.

We need a hyperplane, which is defined by the


• normal vector to the plane, w
• Intercept, b

w
We want to have a classifier with a function of w and b.

x f (w,b) y

We defined x as the point, y as {-1, 1}


Black point = -1 ; white point = 1

𝑦 = 𝑓 ( 𝑥 ; 𝑤 , 𝑏 )=sgn ( 𝑤𝑇 ∙ 𝑥+𝑏 )
Linear Classifier
Questions:
1. We can have different hyperplane.

2. How do we choose the w and b so that we


have an optimum classifier?

What we need is a evaluation metrics to


determine the a good classifier.

In SVM, we evaluate the margin in which we


want to maximize the margin between the
points and the hyperplane.
Linear Classifier
Tight margin Maximize margin

To find which margin is the best, we need to measure the margin  distance
between point that touch the margin and the plane
Compute the margin
{ 𝑥 : <𝑤 , 𝑥 } +𝑏=− 1} We first define our decision boundary function
T
y =f ( x ; w , b )=sgn (w ∙ x + b)

{ 𝑥 : <𝑤 , 𝑥 } +𝑏=+1 } What we know is


+ve value  points are same as as w
-ve value  points are opposite as as w
Zero value  points are on the hyperplane
w

{ 𝑥 : <𝑤 , 𝑥 } +𝑏=0 }
Compute the margin
We need to compute the distance between x to the
hyperplane.

So we need a projection of x to hyperplane  x’

Projection (x-x’) onto w =


x
x' We know 
w

Projection (x-x’) onto w =

Distance between x to x’ is
Compute the margin
|x1-x’| = =

|x2-x’| = =

|x1-x2| =
We want to maximize this

w Equivalent to minimizing the inverse,

Or even better to minimizing the convex form,


Objective function
Or even better to minimizing the convex form,

Next thing to take note is to ensure the points sit on the correct side.

Add a constraint to the objective function


Objective function
3 scenarios: { 𝑥 : < ⟨ 𝑤 , 𝑥 ⟩ } +𝑏=−1 }
1. Correct side but inside the margin
− 1 < ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ + 𝑏 ) <1 { 𝑥 : < ⟨ 𝑤 , 𝑥 ⟩ } +𝑏=+1 }
2. Correct side and outside the margin
w
For
For

3. Wrong side
For { 𝑥 : < ⟨ 𝑤 , 𝑥 ⟩ } +𝑏=0 }
For
To solve the constrained optimization –
Primal Form
Finding the optimal hyperplane is an optimizing problem

¿ 𝒘 ∨𝟐
𝒎𝒊𝒏
𝟐

• N+1 parameters (N: dimension of data)


• M constraints (M; number of datapoints)
• Primal problem
To solve the constrained
optimization – Dual Form
We want to minimize the following with constrains.
𝑤 𝑗s.t2 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ +1
𝑤 =min ∑

𝑤 ,𝑏 𝑗 2

Such problem is not trivial, we can use Lagrangian optimization

Rewrite the equation so that it is not positive inequality.

𝑤 𝑗s.t2 1 − 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥𝑖 ⟩ +𝑏 ) ≤ 0
𝑤∗ =min ∑
𝑤 ,𝑏 𝑗 2
To solve the constrained
optimization – Dual Form
𝑤 𝑗 2s.t 1 − 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 i ⟩ +𝑏 ) ≤ 0
𝑤 =min ∑

𝑤 ,𝑏 𝑗 2
) )

Introduce M Lagrange multipliers,

𝐿 (𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝜃 , 𝛼 ) =𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝒇¿
𝐿= max ¿ ¿
𝛼

Lagrange Dual Form


To solve the constrained
optimization
We use M Lagrange multipliers, , with

𝑀
¿𝑤∨¿2
𝐿 ( 𝑤 ,𝑏,𝛼 )= − ∑ 𝛼𝑖 (1− 𝑦 𝑖 ( ⟨ 𝑤 ,𝑥 i ⟩ +𝑏 ) )¿
2 𝑖=1

Maximize the margin


Ensure point at the
correct side

To minimize , we need to find a stationary point that satisfies


∗ ∗ ∗
¿ 𝑤 , 𝑏 , 𝛼 ≥max ¿ ¿
𝛼
To solve the constrained
optimization
We find partial derivative of
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔𝑤=∑ 𝛼𝑖 𝑦 𝑖 𝑥 𝑖
𝜕𝑤 𝑖=1
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔ ∑ 𝛼𝑖 𝑦 𝑖=0
𝜕𝑏 𝑖 =1
Plugging these into the Lagrangian yields the following:
𝑀 𝑀 𝑀
1
𝐿 ( 𝑤 , 𝑏 , 𝛼 )= ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝑥𝑇𝑖 𝑥 𝑗 + ∑ 𝛼𝑛
2 𝑖 𝑗 𝑖=1

This will normally solve using Sequential Minimal Optimization algorithm (SMO)
Inequalities to solved
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔𝑤=∑ 𝛼𝑖 𝑦 𝑖 𝑥 𝑖 Dual feasibility
𝜕𝑤 𝑖=1
𝑀
𝜕 𝐿(𝑤 ,𝑏 , 𝛼)
=0 ⇔ ∑ 𝛼𝑖 𝑦 𝑖=0 Dual feasibility
𝜕𝑏 𝑖 =1

Primal feasibility

Combine all  Karush-Kuhn-Tucker(KKT) conditions


Inequalities to solved
Must satisfies Karush-Kuhn-Tucker(KKT) conditions

2 scenarios: Either α = 0 or

1. α ≠ 0, must be zero  points are sitting on the margin  These points are
considered as support vector

2. Points with α = 0 will be ‘ignored’


Geometrical Interpretation
Points with α = 0, will NOT Points with α ≠ 0, will considered
considered in defining hyperplane in defining hyperplane

Points with α = 0, will NOT


considered in defining hyperplane
Decision Function
Decision Function

𝑦 = 𝑓 ( 𝑥 ; 𝑤 , 𝑏 )=sgn ( 𝑤𝑇 ∙ 𝑥+𝑏 )
We know 𝑀
𝑤 =∑ 𝛼 𝑖 𝑦 𝑖 𝑥 𝑖
𝑖= 1
Prediction of y can be obtained via the following by substituting w to decision function

( )
𝑀
𝑓 ( 𝑥 ; 𝑤 , 𝑏 )=sgn ∑ 𝛼 𝑖 𝑦 𝑖 ⟨ 𝑥 , 𝑥 𝑖 ⟩ +𝑏
𝑖=1

Solve the following to get b

( ⟨∑ ⟩)
𝑀
1−𝑦 𝑖 𝛼𝑖 𝑦 𝑖 𝑥 𝑖, 𝑥 𝑖 +𝑏 = 0
𝑖=1
SVM Theory :
Maximizing margin to best classifying the two classes.
∗ ∗ ∗
Maximize margin ¿ 𝑤 , 𝑏 , 𝛼 ≥max ¿ ¿
𝛼

No data point are allow to be in the wrong side  Hard margin

𝑤 𝑗 2s.t 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ 1
𝑤∗ =min ∑
𝑤 ,𝑏 𝑗 2
Non separable cases
Soft Margin  Give some slack, allow some error

𝜉𝑖 𝑛
𝑤 𝑗2 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ 1 − 𝜉 𝑖 ,
𝑤 =min ∑

+𝐶 ∑ 𝜉s.t
𝑖
𝑤 ,𝑏 𝑗 2 𝑖=1 𝜉 ≥0
Allow
some
Constraint the
𝜉𝑖 slack
number of
over
data that is
here.
wrong.
Non separable cases
𝑤 . 𝑥 +𝑏=− 1 Soft Margin  Give some slack, allow some error

𝑛
𝑤 𝑗2 s.t 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥 𝑖 ⟩ +𝑏 ) ≥ 1 − 𝜉 𝑖 ,
𝜉 =1.5
𝑤 ∗
=min ∑ 2 +𝐶 ∑ 𝑖 𝜉
𝑖 𝑤 ,𝑏 𝑗 𝑖=1 𝜉 ≥0
The error

𝜉 𝑖 =0.75
𝑛

∑ 𝜉 𝑖=1.5 +0.75=2.25
𝑖 =1
𝑤 . 𝑥 +𝑏=0
𝑤 . 𝑥 +𝑏=+1
Non separable cases
𝑤 . 𝑥 +𝑏=− 1 Soft Margin  Give some slack, allow some error

𝑛
𝑤 𝑗2 s.t 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥𝑖 ⟩ + 𝑏 ) ≥ 1− 𝜉 𝑖 ,
𝜉 =1.5
𝑤 ∗
=min ∑ 2
+𝐶 ∑ 𝜉 𝑖
𝑖 𝑤 ,𝑏 𝑗 𝑖=1 𝜉 ≥0
The error

𝜉 𝑖 =0.75

Small C  soft margin  allow more error


Big C  hard margin  allow NO error
𝑤 . 𝑥 +𝑏=0
𝑤 . 𝑥 +𝑏=+1
Example
C= 10

All points satisfy the constraint


3 support vectors
Example
C=3 C=2

Some points violates the More points violate the constraint


constraint 10 support vectors
Example
C = 0.01

𝑛
𝑤 𝑗2
𝑤 =min ∑

+𝐶 ∑s.t𝜉 𝑖 𝑦 𝑖 ( ⟨ 𝑤 , 𝑥𝑖 ⟩ + 𝑏 ) ≥ 1− 𝜉 𝑖 ,
𝑤 ,𝑏 𝑗 2 𝑖=1 𝜉 ≥0

Small C  soft margin  allow more error


Big C  hard margin  allow NO error

All points violates the constraint


N support vectors
Which C to use?
C Training error Testing error # support
vectors
0.01 42.1 30.1 2341
0.1 31.3 16.5 1756
1 22.1 16.2 1230
10 15.2 17.2 1000
100 10.2 17.9 987
1000 1.2 17.8 854
Cross Validation
Which C to use?
Validation error

Choose the C with


lowest validation error

C
SKLEARN - SVM
Kernel
𝑥2 Actual Boundary Decision Function:
2
𝑥1 + 𝑥 =5
2 𝑥1 = 𝑥22 +5

We can rewrite it in linear form, by transforming it to


using basis expansion

𝑥1
2
𝑥1 − 𝑥 2 − 5= 0 →Quadratic

𝑤 ∙ Φ ( 𝑥 ) + 𝑏=0 → 𝐿𝑖𝑛𝑒𝑎𝑟
Kernel
𝑥2
𝑥1 + 𝑥 22=5
2
𝑥1 − 𝑥 2 − 5= 0 →Quadratic

𝑤 ∙ Φ ( 𝑥 ) + 𝑏=0 → 𝐿𝑖𝑛𝑒𝑎𝑟

𝑥1

We embedded 2 dimensional space into 5 dimensional space.

And in 5D space, it is linear which we can solve the equation


Kernel
What if we have d-dimensional and deal with quadratic boundary???

𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 : 𝑥=( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑑 )

𝑄𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 :(𝑥 ¿¿ 1 , 𝑥2 , 𝑥 21 , 𝑥 22 , 𝑥 1 𝑥 2 ,… , 𝑥 𝑑−1 𝑥 𝑑)¿

𝐿𝑖𝑛𝑒𝑎𝑟 : Φ ( 𝑥 )=( 𝑥 1 , 𝑥2 , 𝑥 21 , 𝑥 22 , 𝑥 1 𝑥 2 , , … , 𝑥 𝑑 −1 𝑥 𝑑 )
Example
What is the dimension of , if

Dimension = 9
Example
What is the dimension of , if

d
d
d Choose 2

Dimension = 2d + d(d-1)/2
~ O(d2)
Kernel in SVM
Recall SVM loss function
𝑀 𝑀 𝑀
1
𝐿 ( 𝑤 , 𝑏 , 𝛼 )= ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝑥𝑇𝑖 𝑥 𝑗 + ∑ 𝛼𝑛
2 𝑖 𝑗 𝑖=1

Replace with
𝑀 𝑀 𝑀
1
𝐿 ( 𝑤 , 𝑏, 𝛼 )= ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 Φ ( 𝑥 )𝑖 Φ ( 𝑥 𝑗 ) + ∑ 𝛼𝑛
𝑇

2 𝑖 𝑗 𝑖=1
Kernel in SVM
Kernel in SVM
Kernel in SVM
Kernel in SVM
Pros and Cons

Pros Cons
Good linear classifier because it find Not suited for large datasets
the best decision boundary Computational cost ~ O(d2)
(consider hard margin case)

Easy to transform into non linear


model
References:
1. [Link]

2. [Link]

You might also like