0% found this document useful (0 votes)
42 views31 pages

Understanding Support Vector Machines

This lecture discusses Support Vector Machines (SVMs), focusing on their ability to find optimal linear separators with the largest margin, which enhances generalization and robustness to outliers. Key concepts include the use of optimization techniques, the introduction of slack variables for non-linearly separable data, and the kernel trick to efficiently handle high-dimensional feature spaces. The lecture also covers the dual formulation of SVMs and various kernel functions that can be employed.

Uploaded by

khyatisingh1910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views31 pages

Understanding Support Vector Machines

This lecture discusses Support Vector Machines (SVMs), focusing on their ability to find optimal linear separators with the largest margin, which enhances generalization and robustness to outliers. Key concepts include the use of optimization techniques, the introduction of slack variables for non-linearly separable data, and the kernel trick to efficiently handle high-dimensional feature spaces. The lecture also covers the dual formulation of SVMs and various kernel functions that can be employed.

Uploaded by

khyatisingh1910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Support vector machines (SVMs)

Lecture 2

David Sontag
New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate,


and Carlos Guestrin
Geometry of linear separators
(see blackboard)
A plane can be specified as the set of all points given by:

Vector from origin to a point in the plane


Two non-parallel directions in the plane

Alternatively, it can be specified as:

Normal vector
(we will call this w)

Only need to specify this dot product,


a scalar (we will call this the offset, b)

Barber, Section 29.1.1-4


Linear Separators
 If training data is linearly separable, perceptron is
guaranteed to find some linear separator
 Which of these is optimal?
Support Vector Machine (SVM)
 SVMs (Vapnik, 1990’s) choose the linear separator with the
largest margin

Robust to
outliers!

V. Vapnik

• Good according to intuition, theory, practice

• SVM became famous when, using images as input, it gave


accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
Support vector machines: 3 key ideas

1. Use optimization to find solution (i.e. a


hyperplane) with few errors

2. Seek large margin separator to improve


generalization

3. Use kernel trick to make large feature


spaces computationally efficient
Finding a perfect classifier (when one exists)
using linear programming
+1 For every data point (xt, yt), enforce the

-1
constraint
=

=
w.x + b

w.x + b

w.x + b
for yt = +1,

and for yt = -1,

Equivalently, we want to satisfy all of the


linear constraints

This linear program can be efficiently


solved using algorithms such as simplex,
interior point, or ellipsoid
Finding a perfect classifier (when one exists)
using linear programming

Weight space

Example of 2-dimensional
linear programming
(feasibility) problem:

For SVMs, each data point


gives one inequality:

What happens if the data set is not linearly separable?


Minimizing number of errors (0-1 loss)
• Try to find weights that violate as few
constraints as possible?
#(mistakes)

• Formalize this using the 0-1 loss:


X
min `0,1 (yj , w · xj + b)
w,b
j
where `0,1 (y, ŷ) = 1[y 6= sign(ŷ)]

• Unfortunately, minimizing 0-1 loss is


NP-hard in the worst-case
– Non-starter. We need another
approach.
Key idea #1: Allow for slack

+1

-1
=
=

=
w.x + b
w.x + b

Σj ξj
w.x + b

- ξj ξj≥0
ξ1
“slack variables”
ξ2
ξ3
We now have a linear program again,
and can efficiently find its optimum
ξ4

For each data point:


•If functional margin ≥ 1, don’t care
•If functional margin < 1, pay linear penalty
Key idea #1: Allow for slack

+1

-1
=
=

=
w.x + b
w.x + b

Σj ξj
w.x + b

- ξj ξj≥0
ξ1
“slack variables”
ξ2
ξ3
What is the optimal value ξj* as a function
of w* and b*?
ξ4
If then ξj = 0

If then ξj =
Sometimes written as
Equivalent hinge loss formulation

,ξ Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:

X ⇣ ⌘
min max 0, 1 (w · xj + b) yj
w,b
j
⇣ ⌘
The hinge loss is defined as `hinge (y, ŷ) = max 0, 1 ŷy
X
min `hinge (yj , w · xj + b)
w,b
j
This is empirical risk minimization,
using the hinge loss
Hinge loss vs. 0/1 loss

Hinge loss: ⇣ ⌘
1 `hinge (y, ŷ) = max 0, 1 ŷy
0-1 Loss:
`0,1 (y, ŷ) = 1[y 6= sign(ŷ)]

0 1

Hinge loss upper bounds 0/1 loss!


It is the tightest convex upper bound on the 0/1 loss
Key idea #2: seek large margin
Key idea #2: seek large margin

• Suppose again that the data is linearly separable and we are solving
a feasibility problem, with constraints

• If the length of the weight vector ||w|| is too small, the optimization
problem is infeasible! Why?
+1

+1
0

0
-1

-1
=

=
=

=
=

=
w.x + b

w.x + b
w.x + b

w.x + b
w.x + b

w.x + b
As ||w|| (and |b|)
get smaller
What is (geometric margin) as a function of w?
i = Distance to i’th data point
= min
i
i -
+1

-1
=

=
w.x + b

w.x + b

w.x + b

We also know that:

x1
x2

So, (assuming there is a data point


on the w.x + b = +1 or -1 line)

Final result: can maximize by minimizing ||w||2!!!


(Hard margin) support vector machines

+1

-1
=
=

=
w.x + b
w.x + b

w.x + b • Example of a convex optimization problem


– A quadratic program
– Polynomial-time algorithms to solve!
• Hyperplane defined by support vectors
– Could use them as a lower-dimension
basis to write down line, although we
haven’t seen how yet
margin 2
γ
• More on these later
Non-support Vectors:
• everything else Support Vectors:
• moving them will • data points on the
not change w canonical lines
Allowing for slack: “Soft margin SVM”

+1
+ C Σj ξj
0

-1
=
=

=
w.x + b
w.x + b

w.x + b - ξj ξj≥0

ξ1 “slack variables”

ξ2
ξ3 Slack penalty C > 0:
• C=∞  have to separate the data!
• C=0  ignores the data entirely!
ξ4
• Select using cross-validation
For each data point:
•If margin ≥ 1, don’t care
•If margin < 1, pay linear penalty
Equivalent formulation using hinge loss

+ C Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:

⇣ ⌘
The hinge loss is defined as `hinge (y, ŷ) = max 0, 1 ŷy
X
min ||w||22 +C `hinge (yj , w · xj + b)
w,b
j

This is called regularization; This part is empirical risk minimization,


used to prevent overfitting! using the hinge loss
What if the data is not linearly
separable?
Use features of features
of features of features….

x(1)
⇧ ... ⌃
⇧ ⌃
⇧ x(n) ⌃
⇧ ⌃
⇧ x(1) x(2) ⌃
⇧ ⌃
(x) = ⇧ (1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
⇧ ⌃
⇤ ex(1) ⌅
...

Feature space can get really large really quickly!


Example

[Tommi Jaakkola]
What’s Next!
• Learn one of the most interesting and
exciting recent advancements in machine
learning
– Key idea #3: the “kernel trick”
– High dimensional feature spaces at no extra
cost
• But first, a detour
– Constrained optimization!
Constrained optimization

No Constraint x ≥ -1 x≥1

x*=0 x*=0 x*=1

How do we solve with constraints?


 Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):

We will solve:
Why is this equivalent?
• min is fighting max!
x<b  (x-b)<0  maxα-α(x-b) = ∞
• min won’t let this happen!
Add new
x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0 constraint
• min is cool with 0, and L(x, α)=x2 (original objective)

x=b  α can be anything, and L(x, α)=x2 (original objective)

The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case (hard margin SVM)

Original optimization problem:

Rewrite One Lagrange multiplier


constraints per example
Lagrangian:

Our goal now is to solve:


Dual SVM derivation (2) – the linearly
separable case (hard margin SVM)

(Primal)

Swap min and max

(Dual)

Slater’s condition from convex optimization guarantees that


these two optimization problems are equivalent!

x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
separable ⇧ case⌃(hard margin SVM)
⇥(x) = ⇧ (1) ⌃
⇧ x x(3) ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples scalars dot product


Dual formulation only depends on
dot-products of the features!

First, we introduce a feature mapping:



Next, replace the dot product with an equivalent kernel function:
SVM with kernels

• Never compute features explicitly!!!


– Compute dot products in closed form Predict with:

• O(n2) time in size of dataset to


compute objective
– much work on speeding up
Common kernels
• Polynomials of degree exactly d

• Polynomials of degree up to d

• Gaussian kernels

• Sigmoid

• And many others: very active area of research!


Quadratic kernel

[Tommi Jaakkola]
Quadratic kernel

Feature mapping given by:

[Cynthia Rudin]

You might also like