Support vector machines (SVMs)
Lecture 2
David Sontag
New York University
Slides adapted from Luke Zettlemoyer, Vibhav Gogate,
and Carlos Guestrin
Geometry of linear separators
(see blackboard)
A plane can be specified as the set of all points given by:
Vector from origin to a point in the plane
Two non-parallel directions in the plane
Alternatively, it can be specified as:
Normal vector
(we will call this w)
Only need to specify this dot product,
a scalar (we will call this the offset, b)
Barber, Section 29.1.1-4
Linear Separators
If training data is linearly separable, perceptron is
guaranteed to find some linear separator
Which of these is optimal?
Support Vector Machine (SVM)
SVMs (Vapnik, 1990’s) choose the linear separator with the
largest margin
Robust to
outliers!
V. Vapnik
• Good according to intuition, theory, practice
• SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
Support vector machines: 3 key ideas
1. Use optimization to find solution (i.e. a
hyperplane) with few errors
2. Seek large margin separator to improve
generalization
3. Use kernel trick to make large feature
spaces computationally efficient
Finding a perfect classifier (when one exists)
using linear programming
+1 For every data point (xt, yt), enforce the
-1
constraint
=
=
w.x + b
w.x + b
w.x + b
for yt = +1,
and for yt = -1,
Equivalently, we want to satisfy all of the
linear constraints
This linear program can be efficiently
solved using algorithms such as simplex,
interior point, or ellipsoid
Finding a perfect classifier (when one exists)
using linear programming
Weight space
Example of 2-dimensional
linear programming
(feasibility) problem:
For SVMs, each data point
gives one inequality:
What happens if the data set is not linearly separable?
Minimizing number of errors (0-1 loss)
• Try to find weights that violate as few
constraints as possible?
#(mistakes)
• Formalize this using the 0-1 loss:
X
min `0,1 (yj , w · xj + b)
w,b
j
where `0,1 (y, ŷ) = 1[y 6= sign(ŷ)]
• Unfortunately, minimizing 0-1 loss is
NP-hard in the worst-case
– Non-starter. We need another
approach.
Key idea #1: Allow for slack
+1
-1
=
=
=
w.x + b
w.x + b
Σj ξj
w.x + b
,ξ
- ξj ξj≥0
ξ1
“slack variables”
ξ2
ξ3
We now have a linear program again,
and can efficiently find its optimum
ξ4
For each data point:
•If functional margin ≥ 1, don’t care
•If functional margin < 1, pay linear penalty
Key idea #1: Allow for slack
+1
-1
=
=
=
w.x + b
w.x + b
Σj ξj
w.x + b
,ξ
- ξj ξj≥0
ξ1
“slack variables”
ξ2
ξ3
What is the optimal value ξj* as a function
of w* and b*?
ξ4
If then ξj = 0
If then ξj =
Sometimes written as
Equivalent hinge loss formulation
,ξ Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:
X ⇣ ⌘
min max 0, 1 (w · xj + b) yj
w,b
j
⇣ ⌘
The hinge loss is defined as `hinge (y, ŷ) = max 0, 1 ŷy
X
min `hinge (yj , w · xj + b)
w,b
j
This is empirical risk minimization,
using the hinge loss
Hinge loss vs. 0/1 loss
Hinge loss: ⇣ ⌘
1 `hinge (y, ŷ) = max 0, 1 ŷy
0-1 Loss:
`0,1 (y, ŷ) = 1[y 6= sign(ŷ)]
0 1
Hinge loss upper bounds 0/1 loss!
It is the tightest convex upper bound on the 0/1 loss
Key idea #2: seek large margin
Key idea #2: seek large margin
• Suppose again that the data is linearly separable and we are solving
a feasibility problem, with constraints
• If the length of the weight vector ||w|| is too small, the optimization
problem is infeasible! Why?
+1
+1
0
0
-1
-1
=
=
=
=
=
=
w.x + b
w.x + b
w.x + b
w.x + b
w.x + b
w.x + b
As ||w|| (and |b|)
get smaller
What is (geometric margin) as a function of w?
i = Distance to i’th data point
= min
i
i -
+1
-1
=
=
w.x + b
w.x + b
w.x + b
We also know that:
x1
x2
So, (assuming there is a data point
on the w.x + b = +1 or -1 line)
Final result: can maximize by minimizing ||w||2!!!
(Hard margin) support vector machines
+1
-1
=
=
=
w.x + b
w.x + b
w.x + b • Example of a convex optimization problem
– A quadratic program
– Polynomial-time algorithms to solve!
• Hyperplane defined by support vectors
– Could use them as a lower-dimension
basis to write down line, although we
haven’t seen how yet
margin 2
γ
• More on these later
Non-support Vectors:
• everything else Support Vectors:
• moving them will • data points on the
not change w canonical lines
Allowing for slack: “Soft margin SVM”
+1
+ C Σj ξj
0
-1
=
=
=
w.x + b
w.x + b
w.x + b - ξj ξj≥0
ξ1 “slack variables”
ξ2
ξ3 Slack penalty C > 0:
• C=∞ have to separate the data!
• C=0 ignores the data entirely!
ξ4
• Select using cross-validation
For each data point:
•If margin ≥ 1, don’t care
•If margin < 1, pay linear penalty
Equivalent formulation using hinge loss
+ C Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:
⇣ ⌘
The hinge loss is defined as `hinge (y, ŷ) = max 0, 1 ŷy
X
min ||w||22 +C `hinge (yj , w · xj + b)
w,b
j
This is called regularization; This part is empirical risk minimization,
used to prevent overfitting! using the hinge loss
What if the data is not linearly
separable?
Use features of features
of features of features….
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
⇧ x(n) ⌃
⇧ ⌃
⇧ x(1) x(2) ⌃
⇧ ⌃
(x) = ⇧ (1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
⇧ ⌃
⇤ ex(1) ⌅
...
Feature space can get really large really quickly!
Example
[Tommi Jaakkola]
What’s Next!
• Learn one of the most interesting and
exciting recent advancements in machine
learning
– Key idea #3: the “kernel trick”
– High dimensional feature spaces at no extra
cost
• But first, a detour
– Constrained optimization!
Constrained optimization
No Constraint x ≥ -1 x≥1
x*=0 x*=0 x*=1
How do we solve with constraints?
Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):
We will solve:
Why is this equivalent?
• min is fighting max!
x<b (x-b)<0 maxα-α(x-b) = ∞
• min won’t let this happen!
Add new
x>b, α≥0 (x-b)>0 maxα-α(x-b) = 0, α*=0 constraint
• min is cool with 0, and L(x, α)=x2 (original objective)
x=b α can be anything, and L(x, α)=x2 (original objective)
The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case (hard margin SVM)
Original optimization problem:
Rewrite One Lagrange multiplier
constraints per example
Lagrangian:
Our goal now is to solve:
Dual SVM derivation (2) – the linearly
separable case (hard margin SVM)
(Primal)
Swap min and max
(Dual)
Slater’s condition from convex optimization guarantees that
these two optimization problems are equivalent!
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
separable ⇧ case⌃(hard margin SVM)
⇥(x) = ⇧ (1) ⌃
⇧ x x(3) ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
Sums over all training examples scalars dot product
Dual formulation only depends on
dot-products of the features!
First, we introduce a feature mapping:
Next, replace the dot product with an equivalent kernel function:
SVM with kernels
• Never compute features explicitly!!!
– Compute dot products in closed form Predict with:
• O(n2) time in size of dataset to
compute objective
– much work on speeding up
Common kernels
• Polynomials of degree exactly d
• Polynomials of degree up to d
• Gaussian kernels
• Sigmoid
• And many others: very active area of research!
Quadratic kernel
[Tommi Jaakkola]
Quadratic kernel
Feature mapping given by:
[Cynthia Rudin]