0% found this document useful (0 votes)
28 views10 pages

Understanding Support Vector Machines

Uploaded by

ahmadmanhal673
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Understanding Support Vector Machines

Uploaded by

ahmadmanhal673
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

6/22/2024

Support Vector Machines

Logisic Regression

1
6/22/2024

Max margin classificaion


Instead of fitting all the points, focus on boundary points
Aim: learn a boundary that leads to the largest margin
(buffer) from points on both sides

Why: intuition; theoretical support; and works well in practice


Subset of vectors that support (determine boundary) are
called the support vectors

Linear SVM
Max margin classifier: inputs in margin are of unknown class

2
6/22/2024

Maximizing the Margin


First note that the w vector is orthogonal to the +1 plane
if u and v are two points on that plane, then wT(u-v) = 0
Same is true for -1 plane

Also: for point x+ on +1 plane and x- nearest point on -1 plane:


x+ = λw + x-

Compuing the Margin

Also: for point x+ on +1 plane and x- nearest point on -1 plane:


x+ = λw + x-

3
6/22/2024

Compuing the Margin


Define the margin M to be the distance between the +1 and -1
planes

We can now express this in terms of w 


to maximize the margin we minimize the length of w

Learning a Margin-Based Classifier


We can search for the optimal parameters (w and b) by
finding a solution that:
1. Correctly classifies the training examples: {xi,ti}, i=1,..,n
2. Maximizes the margin (same as minimizing wTw)
1 2
min w
2
s.t. (wT x i + b)t i ≥1∀i

Can optimize via projective gradient descent, etc.

Apply Lagrange multipliers: formulate equivalent problem

4
6/22/2024

Learning a Linear SVM


Convert the constrained minimization to an unconstrained
optimization problem: represent constraints as penalty
terms:

For data : {xi,ti}, use the following penalty term:


 0 if (wT x i + b)t i  1
  = max  i [1  (wT xi + b)t i ]
 otherwise  i 0

n
1
Rewrite the w + ∑ max α i [1 − (wT x i + b)t i ]}
2
min{
w,b 2 α i ≥0
minimization problem: i=1

n
Where {αi} are the 1
w + ∑ α i[1 − (wT x i + b)t i ]}
2
= min max{
Lagrange multipliers w ,b α i ≥0 2 i=1

What if data is not linearly separable?


• Introduce slack variables ξi

subject to constraints (for all i):


t i (w⋅ xi ) ≥1 − ξi
ξi ≥ 0
• Example lies on wrong side of hyperplane:
is upper bound on number of training errors

• λ trades off training error versus model complexity

• This is known as the soft-margin extension

5
6/22/2024

Non-linear SVMs: Feature spaces

• General idea: the original feature space can always


be mapped to some higher-dimensional feature
space where the training set is separable:

Φ: x → φ(x)

Learning via Quadratic Programming


• Optimal separating hyperplane can be found by solving
1
arg max(  j    j k y j yk ( x j  x k ))
 j 2 j ,k
where ( x j , y j ) are samples
- This is a quadratic function
- Once  jare found,
the weight matrix w    j y j x j
j

the decision function is h(x)  sign(  j y j (x  x j )  b)


j
• This optimization problem can be solved by quadratic
programming
QP is a well-studied class of optimization algorithms to maximize a quadratic
function subject to linear constraints

6
6/22/2024

Non-linear decision boundaries


• Note that both the learning objective and the decision
function depend only on dot products between patterns
n

• How to form non-linear decision boundaries in input space?

1. Map data into feature space

2. Replace dot products between inputs with feature points


x i ⋅ x j → φ(x i )⋅ φ(xj )
3. Find linear decision boundary in feature space

• Problem: what is a good feature function ϕ(x)?

Kernel trick + QP
• Max margin classifier can be found by solving
1
arg max(  j    j k y j yk ( (x j )   (xk )))
 j 2 j ,k
1
 arg max(  j    j k y j yk ( K ( x j , x k ))
 j 2 j ,k

• the weight matrix (no need to compute and store)


w    j y j ( x j )
j
• the decision function is

h(x)  sign(  j y j ( (x)   (x j ))  b)  sign(  j y j K (x, x j )  b)


j j

Copyright © 2001, 2003, Andrew W. Moore

7
6/22/2024

Kernel Trick
• Kernel trick: dot-products in feature space can be
computed as a kernel function
φ (x i)⋅ φ (x j ) = K (x i , x j )

• Idea: work directly on x, avoid having to compute ϕ(x)

• Example:

Kernels
Examples of kernels (kernels measure similarity):
1. Polynomial K (x , x ) = (x ⋅ x + 1)
1 2 1 2 2

2
2. Gaussian K (x 1 , x 2 ) = exp(− x 1 − x 2 /2σ 2 )
3. Sigmoid K (x 1 , x 2 ) = tanh(κ (x 1 ⋅ x 2 ) + a)

Each kernel computation corresponds to dot product


calculation for particular mapping ϕ(x): implicitly maps to
high-dimensional space

Why is this useful?


1. Rewrite training examples using more complex features
2. Dataset not linearly separable in original space may be
linearly separable in higher dimensional space

8
6/22/2024

Input transformation
Mapping to a feature space can produce problems:
• High computational burden due to high dimensionality
• Many more parameters

SVM solves these two issues simultaneously


• Kernel trick produces efficient classification
• Dual formulation only assigns parameters to samples, not
features

Doing multi-class classification


• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• Extend to output arity N, learn N SVM’s
– SVM 1 learns “Output==1” vs “Output != 1”
– SVM 2 learns “Output==2” vs “Output != 2”
– :
– SVM N learns “Output==N” vs “Output != N”

9
6/22/2024

Summary
Advantages:
• Kernels allow very flexible hypotheses
• Soft-margin extension permits mis-classified examples
• Excellent results

Disadvantages:
• Must choose kernel parameters
• Very large problems computationally intractable

10

You might also like