0% found this document useful (0 votes)

42 views31 pages

Understanding Support Vector Machines

This lecture discusses Support Vector Machines (SVMs), focusing on their ability to find optimal linear separators with the largest margin, which enhances generalization and robustness to outliers. Key concepts include the use of optimization techniques, the introduction of slack variables for non-linearly separable data, and the kernel trick to efficiently handle high-dimensional feature spaces. The lecture also covers the dual formulation of SVMs and various kernel functions that can be employed.

Uploaded by

khyatisingh1910

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views31 pages

Understanding Support Vector Machines

Uploaded by

khyatisingh1910

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Support vector machines (SVMs)

Lecture 2

David Sontag
New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate,

and Carlos Guestrin
Geometry of linear separators
(see blackboard)
A plane can be specified as the set of all points given by:

Vector from origin to a point in the plane

Two non-parallel directions in the plane

Alternatively, it can be specified as:

Normal vector
(we will call this w)

Only need to specify this dot product,

a scalar (we will call this the offset, b)

Barber, Section 29.1.1-4

Linear Separators
 If training data is linearly separable, perceptron is
guaranteed to find some linear separator
 Which of these is optimal?
Support Vector Machine (SVM)
 SVMs (Vapnik, 1990’s) choose the linear separator with the
largest margin

Robust to
outliers!

V. Vapnik

• Good according to intuition, theory, practice

• SVM became famous when, using images as input, it gave

accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
Support vector machines: 3 key ideas

1. Use optimization to find solution (i.e. a

hyperplane) with few errors

2. Seek large margin separator to improve

generalization

3. Use kernel trick to make large feature

spaces computationally efficient
Finding a perfect classifier (when one exists)
using linear programming
+1 For every data point (xt, yt), enforce the

-1
constraint
=

=
w.x + b

w.x + b

w.x + b
for yt = +1,

and for yt = -1,

Equivalently, we want to satisfy all of the

linear constraints

This linear program can be efficiently

solved using algorithms such as simplex,
interior point, or ellipsoid
Finding a perfect classifier (when one exists)
using linear programming

Weight space

Example of 2-dimensional
linear programming
(feasibility) problem:

For SVMs, each data point

gives one inequality:

What happens if the data set is not linearly separable?

Minimizing number of errors (0-1 loss)
• Try to find weights that violate as few
constraints as possible?
#(mistakes)

• Formalize this using the 0-1 loss:

X
min `0,1 (yj , w · xj + b)
w,b
j
where `0,1 (y, ŷ) = 1[y 6= sign(ŷ)]

• Unfortunately, minimizing 0-1 loss is

NP-hard in the worst-case
– Non-starter. We need another
approach.
Key idea #1: Allow for slack

-1
=
=

=
w.x + b
w.x + b

Σj ξj
w.x + b
,ξ
- ξj ξj≥0
ξ1
“slack variables”
ξ2
ξ3
We now have a linear program again,
and can efficiently find its optimum
ξ4

For each data point:

•If functional margin ≥ 1, don’t care
•If functional margin < 1, pay linear penalty
Key idea #1: Allow for slack

-1
=
=

=
w.x + b
w.x + b

Σj ξj
w.x + b
,ξ
- ξj ξj≥0
ξ1
“slack variables”
ξ2
ξ3
What is the optimal value ξj* as a function
of w* and b*?
ξ4
If then ξj = 0

If then ξj =
Sometimes written as
Equivalent hinge loss formulation

,ξ Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:

X ⇣ ⌘
min max 0, 1 (w · xj + b) yj
w,b
j
⇣ ⌘
The hinge loss is defined as `hinge (y, ŷ) = max 0, 1 ŷy
X
min `hinge (yj , w · xj + b)
w,b
j
This is empirical risk minimization,
using the hinge loss
Hinge loss vs. 0/1 loss

Hinge loss: ⇣ ⌘
1 `hinge (y, ŷ) = max 0, 1 ŷy
0-1 Loss:
`0,1 (y, ŷ) = 1[y 6= sign(ŷ)]

0 1

Hinge loss upper bounds 0/1 loss!

It is the tightest convex upper bound on the 0/1 loss
Key idea #2: seek large margin
Key idea #2: seek large margin

• Suppose again that the data is linearly separable and we are solving
a feasibility problem, with constraints

• If the length of the weight vector ||w|| is too small, the optimization
problem is infeasible! Why?
+1

+1
0

0
-1

-1
=

=
=

=
w.x + b

w.x + b
w.x + b

w.x + b
As ||w|| (and |b|)
get smaller
What is (geometric margin) as a function of w?
i = Distance to i’th data point
= min
i
i -
+1

-1
=

=
w.x + b

w.x + b

We also know that:

x1
x2

So, (assuming there is a data point

on the w.x + b = +1 or -1 line)

Final result: can maximize by minimizing ||w||2!!!

(Hard margin) support vector machines

-1
=
=

=
w.x + b
w.x + b

w.x + b • Example of a convex optimization problem

– A quadratic program
– Polynomial-time algorithms to solve!
• Hyperplane defined by support vectors
– Could use them as a lower-dimension
basis to write down line, although we
haven’t seen how yet
margin 2
γ
• More on these later
Non-support Vectors:
• everything else Support Vectors:
• moving them will • data points on the
not change w canonical lines
Allowing for slack: “Soft margin SVM”

+1
+ C Σj ξj
0

-1
=
=

=
w.x + b
w.x + b

w.x + b - ξj ξj≥0

ξ1 “slack variables”

ξ2
ξ3 Slack penalty C > 0:
• C=∞  have to separate the data!
• C=0  ignores the data entirely!
ξ4
• Select using cross-validation
For each data point:
•If margin ≥ 1, don’t care
•If margin < 1, pay linear penalty
Equivalent formulation using hinge loss

+ C Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:

⇣ ⌘
The hinge loss is defined as `hinge (y, ŷ) = max 0, 1 ŷy
X
min ||w||22 +C `hinge (yj , w · xj + b)
w,b
j

This is called regularization; This part is empirical risk minimization,

used to prevent overfitting! using the hinge loss
What if the data is not linearly
separable?
Use features of features
of features of features….
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
⇧ x(n) ⌃
⇧ ⌃
⇧ x(1) x(2) ⌃
⇧ ⌃
(x) = ⇧ (1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
⇧ ⌃
⇤ ex(1) ⌅
...

Feature space can get really large really quickly!

Example

[Tommi Jaakkola]
What’s Next!
• Learn one of the most interesting and
exciting recent advancements in machine
learning
– Key idea #3: the “kernel trick”
– High dimensional feature spaces at no extra
cost
• But first, a detour
– Constrained optimization!
Constrained optimization

No Constraint x ≥ -1 x≥1

x=0 x=0 x*=1

How do we solve with constraints?

 Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):

We will solve:
Why is this equivalent?
• min is fighting max!
x<b  (x-b)<0  maxα-α(x-b) = ∞
• min won’t let this happen!
Add new
x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0 constraint
• min is cool with 0, and L(x, α)=x2 (original objective)

x=b  α can be anything, and L(x, α)=x2 (original objective)

The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case (hard margin SVM)

Original optimization problem:

Rewrite One Lagrange multiplier

constraints per example
Lagrangian:

Our goal now is to solve:

Dual SVM derivation (2) – the linearly
separable case (hard margin SVM)

(Primal)

Swap min and max

(Dual)

Slater’s condition from convex optimization guarantees that

these two optimization problems are equivalent!
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
separable ⇧ case⌃(hard margin SVM)
⇥(x) = ⇧ (1) ⌃
⇧ x x(3) ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples scalars dot product

Dual formulation only depends on
dot-products of the features!

First, we introduce a feature mapping:


Next, replace the dot product with an equivalent kernel function:
SVM with kernels

• Never compute features explicitly!!!

– Compute dot products in closed form Predict with:

• O(n2) time in size of dataset to

compute objective
– much work on speeding up
Common kernels
• Polynomials of degree exactly d

• Polynomials of degree up to d

• Gaussian kernels

• Sigmoid

• And many others: very active area of research!

Quadratic kernel

[Tommi Jaakkola]
Quadratic kernel

Feature mapping given by:

[Cynthia Rudin]

Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
33 pages
Understanding Linear SVMs and Classifiers
No ratings yet
Understanding Linear SVMs and Classifiers
77 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
53 pages
SVM: Theory and Applications
No ratings yet
SVM: Theory and Applications
35 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
44 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
9 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
8 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
19 pages
Support Vector Machines Overview and Uses
No ratings yet
Support Vector Machines Overview and Uses
34 pages
SVM Overview and Applications
No ratings yet
SVM Overview and Applications
34 pages
SVM for Cancer Genomics Analysis
No ratings yet
SVM for Cancer Genomics Analysis
76 pages
Support Vector Machines Overview and Uses
No ratings yet
Support Vector Machines Overview and Uses
34 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
SVM Tutorial and Applications Overview
No ratings yet
SVM Tutorial and Applications Overview
34 pages
Vapnik's Support Vector Machines Explained
No ratings yet
Vapnik's Support Vector Machines Explained
55 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
138 pages
Logistic Regression Overview
No ratings yet
Logistic Regression Overview
63 pages
Understanding Linear Classifiers and SVMs
No ratings yet
Understanding Linear Classifiers and SVMs
31 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
19 pages
SVM Classifier: Concepts and Optimization
No ratings yet
SVM Classifier: Concepts and Optimization
40 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
33 pages
SVM: Maximizing Margin in Classification
No ratings yet
SVM: Maximizing Margin in Classification
70 pages
Overview of Support Vector Machines
No ratings yet
Overview of Support Vector Machines
49 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
41 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
26 pages
Hard-Margin SVM Optimization Explained
No ratings yet
Hard-Margin SVM Optimization Explained
34 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
50 pages
SVM Algorithm Flowchart Overview
No ratings yet
SVM Algorithm Flowchart Overview
40 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
36 pages
SVM Tutorial and Applications Overview
100% (1)
SVM Tutorial and Applications Overview
34 pages
Logistic Regression & SVM Overview
No ratings yet
Logistic Regression & SVM Overview
59 pages
SVM Classification Overview by Chandresh Maurya
No ratings yet
SVM Classification Overview by Chandresh Maurya
28 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
25 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
25 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
25 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
15 pages
Kernel Trick in Support Vector Machines
No ratings yet
Kernel Trick in Support Vector Machines
125 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
SVM Overview by Mingon Kang
No ratings yet
SVM Overview by Mingon Kang
23 pages
07 SVMs
No ratings yet
07 SVMs
68 pages
SVM Techniques for Image Classification
No ratings yet
SVM Techniques for Image Classification
22 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
41 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
8 pages
SVMs: A Comprehensive Overview
No ratings yet
SVMs: A Comprehensive Overview
28 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
10 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
52 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
27 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
19 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
58 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
52 pages
Software Update Guide for TCL TVs
No ratings yet
Software Update Guide for TCL TVs
7 pages
NETSEC: Advanced Network Analyzer Tool
No ratings yet
NETSEC: Advanced Network Analyzer Tool
68 pages
HTML/CSS/JavaScript Course Outline
No ratings yet
HTML/CSS/JavaScript Course Outline
397 pages
Control Statements in JavaScript Guide
No ratings yet
Control Statements in JavaScript Guide
2 pages
USBtin: USB to CAN Interface Board
No ratings yet
USBtin: USB to CAN Interface Board
3 pages
Salesforce Hinglish Explained
No ratings yet
Salesforce Hinglish Explained
3 pages
Portable Membrane Filter Tester for Beverages
No ratings yet
Portable Membrane Filter Tester for Beverages
4 pages
DSA Interview Questions for 3 Years
No ratings yet
DSA Interview Questions for 3 Years
14 pages
LabVIEW PID & Fuzzy Logic Toolkit
No ratings yet
LabVIEW PID & Fuzzy Logic Toolkit
114 pages
Mode 5 IFF Transponder Systems Overview
No ratings yet
Mode 5 IFF Transponder Systems Overview
16 pages
Web Development Basics: HTML Guide
No ratings yet
Web Development Basics: HTML Guide
4 pages
RT-2204C+ User's Manual V3.4e
No ratings yet
RT-2204C+ User's Manual V3.4e
43 pages
Top 10 Fonts Included with Mac
No ratings yet
Top 10 Fonts Included with Mac
6 pages
1MQA
No ratings yet
1MQA
41 pages
Dyson Logos BX Character Sheet
No ratings yet
Dyson Logos BX Character Sheet
1 page
Huawei ETP48300-C6D2 Installation Guide
No ratings yet
Huawei ETP48300-C6D2 Installation Guide
27 pages
Khadka 2020 IOP Conf. Ser. Earth Environ. Sci. 463 012121-1
No ratings yet
Khadka 2020 IOP Conf. Ser. Earth Environ. Sci. 463 012121-1
9 pages
Networking Basics for Healthcare Systems
No ratings yet
Networking Basics for Healthcare Systems
47 pages
Symbol for Slide Effects in PowerPoint
No ratings yet
Symbol for Slide Effects in PowerPoint
3 pages
Barangay Tongbangkaw Profile 2024
No ratings yet
Barangay Tongbangkaw Profile 2024
6 pages
ANSYS ICEM CFD Users Manual
No ratings yet
ANSYS ICEM CFD Users Manual
62 pages
Information Extraction Techniques Overview
No ratings yet
Information Extraction Techniques Overview
8 pages
HotDocs 2008 Standard
No ratings yet
HotDocs 2008 Standard
122 pages
Low-Light Image Enhancement via IAGC
No ratings yet
Low-Light Image Enhancement via IAGC
10 pages
FortiOS-Upgradepath 5.0.5
No ratings yet
FortiOS-Upgradepath 5.0.5
9 pages
Zishaan Sayyed: AI & ML Profile
No ratings yet
Zishaan Sayyed: AI & ML Profile
1 page
Tutorial 2 ISPA
No ratings yet
Tutorial 2 ISPA
51 pages
HTTP GET Request Overview
No ratings yet
HTTP GET Request Overview
13 pages
Overview of Information Technology and Its Impact
No ratings yet
Overview of Information Technology and Its Impact
36 pages
Week 7 Cybersecurity Test Review
No ratings yet
Week 7 Cybersecurity Test Review
5 pages

Understanding Support Vector Machines

Uploaded by

Understanding Support Vector Machines

Uploaded by

Support vector machines (SVMs)

Slides adapted from Luke Zettlemoyer, Vibhav Gogate,

Vector from origin to a point in the plane

Alternatively, it can be specified as:

Only need to specify this dot product,

Barber, Section 29.1.1-4

• Good according to intuition, theory, practice

• SVM became famous when, using images as input, it gave

1. Use optimization to find solution (i.e. a

2. Seek large margin separator to improve

3. Use kernel trick to make large feature

and for yt = -1,

Equivalently, we want to satisfy all of the

This linear program can be efficiently

For SVMs, each data point

What happens if the data set is not linearly separable?

• Formalize this using the 0-1 loss:

• Unfortunately, minimizing 0-1 loss is

For each data point:

Hinge loss upper bounds 0/1 loss!

We also know that:

So, (assuming there is a data point

Final result: can maximize by minimizing ||w||2!!!

w.x + b • Example of a convex optimization problem

This is called regularization; This part is empirical risk minimization,

Feature space can get really large really quickly!

x*=0 x*=0 x*=1

How do we solve with constraints?

x=b  α can be anything, and L(x, α)=x2 (original objective)

Original optimization problem:

Rewrite One Lagrange multiplier

Our goal now is to solve:

Swap min and max

Slater’s condition from convex optimization guarantees that

Substituting these values back in (and simplifying), we obtain:

Sums over all training examples scalars dot product

First, we introduce a feature mapping:

• Never compute features explicitly!!!

• O(n2) time in size of dataset to

• And many others: very active area of research!

Feature mapping given by:

You might also like

x=0 x=0 x*=1