0% found this document useful (0 votes)

207 views70 pages

SVM: Maximizing Margin in Classification

This document is a lecture on Support Vector Machines (SVM) by Alaa Othman, focusing on the concept of maximum margin hyperplanes for classifying linearly separable data. It explains the importance of maximizing the margin to improve classification accuracy and discusses the mathematical formulation of SVM, including the optimization problem and the role of support vectors. Key properties and geometric interpretations of SVM are also covered, along with the transformation to Lagrangian formulation for optimization.

Uploaded by

samuel tekyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

207 views70 pages

SVM: Maximizing Margin in Classification

Uploaded by

samuel tekyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Learning from Data

Lecture 8: Support Vector Machines (SVM)

Alaa Othman

June 10, 2025

Alaa Othman June 10, 2025 1 / 64

Motivation: Why Maximum Margin?
For linearly separable data, many hyperplanes can separate the
classes.

Alaa Othman June 10, 2025 2 / 64

Motivation: Why Maximum Margin?
For linearly separable data, many hyperplanes can separate the
classes. Which hyperplane is the best?
In-Sample Error: All separating hyperplanes achieve zero error.
Model Complexity: All linear models have identical complexity.

Alaa Othman June 10, 2025 2 / 64

Key Idea: The hyperplane with the maximum margin is optimal.
Why?

Alaa Othman June 10, 2025 3 / 64

Key Idea: The hyperplane with the maximum margin is optimal.
Why?
Margin: The distance from the hyperplane to the nearest data point.
Why Better?: A larger margin increases the likelihood that new
points are classied correctly.

Alaa Othman June 10, 2025 3 / 64

Dichotomies with Fat Margin
Fat Margin: Exclude dichotomies where the margin is below a
threshold, reduces the number of possible dichotomies.
Growth Function: Counts distinct labelings a model can produce; a
fat margin restricts it (more restriction on the growth function
improves generalization)

Alaa Othman June 10, 2025 4 / 64

Dening and Maximizing the Margin
Goal: Find the weight vector w that maximizes the margin.
Hyperplane: Dened by w⊤ x + b = 0.
Margin: Distance from the hyperplane to the nearest point xn .
y The optimal hyperplane with Positive class
the maximum margin
Negative class
7
H2
Training samples

w
(αi=0, εi=0 )

x+ T
6

b=
H1 w
T

-1
Support vectors
x+
b=
w

x2
T

5 0
x+
b=
+
1

4 (C>αi>0, εi=0 )
x1
β SVM Margin=2/||w||

3 Correctly classiﬁed
α sample
(αi=C, 1>εi>0 )
2

M
is am ε i≥
cl
(α i
s
as ple
=
d 2 w||

si
D1 || ||

ﬁe
1 1/

,
w

d
w ||
b/ =
D2 d 1 w||

1)
||
1/

1 2 3 4 5 6 7 x
Alaa Othman June 10, 2025 5 / 64
Key Concepts in SVM
Changing w and b generates innite hyperplanes
SVM goal: Find hyperplane maximally distant from closest samples of
both classes
This optimal hyperplane has the largest margin
Critical samples are called support vectors
Mathematical Formulation
Dene two parallel planes:
H1 : wT xi + b = +1 for yi = +1 (1)
H2 : wT xi + b = −1 for yi = −1 (2)
with constraints:
wT xi + b ≥ +1 ∀ yi = +1
wT xi + b ≤ −1 ∀ yi = −1
Alaa Othman June 10, 2025 6 / 64
Key Properties of SVM Hyperplanes

Geometric Characteristics
The dening planes H1 and H2 satisfy:
Parallel with same normal vector w
No training samples between them
2
Margin width = (next slides)
∥w ∥

Unied Classication Constraint

All training samples satisfy:
yi (wT xi + b) − 1 ≥ 0 ∀i = 1, 2, . . . , N (3)

Alaa Othman June 10, 2025 7 / 64

Support Vector Identication
Support vectors lie exactly on margin planes:
yi (wT xi + b) = 1

wT x1 + b = +1 (x1 on H1 )
wT x2 + b = −1 (x2 on H2 )
Other samples satisfy strict inequality:
yi (wT xi + b) > 1

Support vectors dene the orientation

and position of the optimal hyperplane

Alaa Othman June 10, 2025 8 / 64

Margin Geometry in SVM
Key Properties:
Margin M = d1 + d2 where: Euclidean Norm
√
d1 : distance to closest positive sample ∥w∥ =
qP
wT w
= d 2
i=1 wi
d2 : distance to closest negative sample w
Unit vector: ŵ = ∥w∥
Hyperplane is equidistant: d1 = d2
M measured in normal direction w
Vector A = 2 2 , the unit vector Â is calculated as follows:

2 2

A
= √12 √1
h i
Â = =√ 2 .
∥A∥ 22 + 22
The norm of the unit vector1 is always one:
s 2 2
1 1 √

∥Â∥ = √ + √ = 1 = 1.
2 2
1
The direction of the original vector A while standardizing its length to 1. It indicates
the orientation without aecting the magnitude
Alaa Othman June 10, 2025 9 / 64
Margin Calculation:
Using support vectors x1 (class +1) and x2 (class -1):
M = (x1 − x2 )w

Projection onto w direction:

w
Dk = (xk )w = xk ·
∥w ∥

w w
M = |D1 − D2 | = x1 · − x2 ·
∥w ∥ ∥w∥
|w x1 − w x2 |
T T |(1 − b) − (−1 − b)| 2
= = =
∥w ∥ ∥w ∥ ∥w∥

Key Insight
Maximizing margin M ⇔ Minimizing ∥w∥
Alaa Othman June 10, 2025 10 / 64
Problem Setup

Assume two classes ω− and ω+ :

Negative class: ω− = {x2 } where x2 = 2
Positive class: ω+ = {x1 } where x1 = 6
From the SVM constraints:
x1 : w · 6 + b ≥ 1
x2 : w · 2 + b ≤ −1
H2 x=4 H1

ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x

Alaa Othman June 10, 2025 11 / 64

Geometric Interpretation

x1 lies on H1 : wx + b = +1 H2 x=4 H1

x2 lies on H2 : wx + b = −1 ω- Support
vectors
Margin (M) ω+

2 x2 x1
Margin M = x
∥w ∥
1 2 3 4 5 6 7

Eect of ∥w∥:
∥w ∥ = 1 ⇒ M = 2
∥w ∥ = 2 ⇒ M = 1
∥w∥ = 4 ⇒ M = 0.5

Alaa Othman June 10, 2025 12 / 64

Constraint Violation Example

Consider w = 0.1 (∥w∥ = 0.1):

Requires M = 20
But x2 would lie between H1 and H2
Mathematical contradiction:
From x2 : 0.1 × 2 + b ≤ −1 ⇒ b ≤ −1.2
From x1 : 0.1 × 6 + b ≥ 1 ⇒ b ≥ 0.4
⇒ No feasible b

w = 0.5 gives maximum feasible margin

Alaa Othman June 10, 2025 13 / 64

Optimal Solution

Solve for parameters:

x1 : 0.5 × 6 + b = 1 ⇒ 3 + b = 1 ⇒ b = −2
x2 : 0.5 × 2 + b = −1 ⇒ 1 + b = −1 ⇒ b = −2

Discriminant function:
wT x + b = 0 ⇒ 0.5x − 2 = 0 ⇒ x =4

Margin width:
2 2
M= = =4
∥w∥ 0.5

Alaa Othman June 10, 2025 14 / 64

Classication of New Sample

Test sample xtest = 8:

ytest = sign(wT xtest + b) = sign(0.5 × 8 − 2) = sign(2) = +1

Decision process:
+1 ⇒ Classify to ω+
Since 0.5(8) − 2 = 2 > 0, above decision boundary

H2 x=4 H1

ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x

Alaa Othman June 10, 2025 15 / 64

Optimization Problem: Primal Form

Challenge with Many Samples

With large datasets, manual hyperplane selection becomes intractable. We
need mathematical optimization:

Primal Optimization Problem

minimize f (w) = ∥w∥

subject to g (w, b) = yi (wT xi + b) − 1 ≥ 0
∀i = 1, 2, . . . , N

Alaa Othman June 10, 2025 16 / 64

Equivalent Quadratic Form
Convenient transformation for optimization:
1
minimize f (w) = ∥w∥2
2
subject to yi (w xi + b) ≥ 1
T

∀i = 1, 2, . . . , N

Key Properties
Hard Margin: Strict separation requirement
Quadratic Programming: Convex optimization problem
Convex: Single global minimum guaranteed
Linearly Constrained: N linear inequality constraints
Removes square root for dierentiability
Scaling factor 12 simplies derivatives
Constraints are Linear inequalities
Alaa Othman June 10, 2025 17 / 64
Lagrangian Formulation for SVM
From Constrained to Unconstrained Optimization
Transform the constrained problem using Lagrange multipliers αi :
Original objective: min 12 ∥w∥2
Constraints: yi (wT xi + b) ≥ 1
Introduce αi ≥ 0 for each constraint

Lagrange Function
1 N
∥w∥2
h i
L(w, b, α) = αi yi (wT xi + b) − 1
X
−
2
|i=1
| {z }
Original objective {z }
Constraint term
Expanded form:

1 N N N
L(w, b, α) = wT w − αi yi wT xi − b (4)
X X X
αi yi + αi
2 i=1 i=1 i=1
Alaa Othman June 10, 2025 18 / 64
1 N N N
L(w, b, α) = wT w − αi yi wT xi − b (5)
X X X
αi yi + αi
2 i=1 i=1 i=1

Minimize L w.r.t w and b,

Maximize w.r.t α
α = [α1 , . . . , αN ]T : Lagrange multipliers
αi ≥ 0: Non-negativity constraint
αi yi (wT xi + b) − 1 = 0

αi > 0 only for support vectors, and αi = 0 for non-support vectors

Alaa Othman June 10, 2025 19 / 64

Optimal Solution

Optimization Strategy
The solution minimizes L w.r.t w, b and maximizes w.r.t α:
minw,b maxα≥0 L(w, b, α)

Found by setting partial derivatives to zero:

For w : ∂L
∂w =0 Lagrange Multipliers (αi ): ∂L
∂αi =0

N
yi (wT xi + b) − 1 = 0
w− αi yi xi = 0
X

i=1
N
For b: =0⇒ =0
∂L PN
w= αi yi xi i=1 αi yi
X
∂b

i=1

Alaa Othman June 10, 2025 20 / 64

w= i=1 αi yi xi : w is linear combination of training vectors
PN

N
w= αi yi xi = αi xi − αi xi
X X X

i=1 xi ∈ω+ xi ∈ω−

= 0: Bias condition balances class contributions

PN
i=1 αi yi
X X
αi = αi
ω+ ω−

yi (wT xi + b) − 1 = 0: Support vectors satisfy margin condition.

Decision function depends only
on support vectors:
f (x) = sign SVs αi yi xi
Tx
P
+b

Alaa Othman June 10, 2025 21 / 64

Constraint Formulation
For our two-sample example:
g1 (w, b) = (wx1 + b) − 1 (y1 = +1)
g2 (w, b) = −(wx2 + b) − 1 (y2 = −1)

Objective function:
∥w∥2
min L(w, b, α) = −α g −α g
1 1 2 2
w,b 2
∥w∥2
= − α1 (y1 (wx1 + b) − 1)
2
− α2 (y2 (wx2 + b) − 1)

With x1 = 6, y1 = +1 and x2 = 2, y2 = −1:

w2
L= − α1 (6w + b − 1) − α2 (−(2w + b) − 1)
2
Alaa Othman June 10, 2025 22 / 64
Derivatives and Optimality Conditions

Take partial derivatives and set to zero:

∂L ∂L
=0 =0
∂w ∂b
⇒ w − 6α1 + 2α2 = 0 ⇒ −α1 + α2 = 0
w = 6α1 − 2α2 α1 = α2

Primal feasibility:
∂L
= 0 ⇒ 6w + b − 1 = 0
∂α1
∂L
= 0 ⇒ 2w + b + 1 = 0
∂α2

Alaa Othman June 10, 2025 23 / 64

Solution and Interpretation
From α1 = α2 and w = 6α1 − 2α2 :
w = 6α − 2α = 4α (let α = α1 = α2 )

Solve system:
6(4α) + b − 1 = 0
2(4α) + b + 1 = 0
Subtract equations:
(24α + b − 1) − (8α + b + 1) = 0
16α − 2 = 0 ⇒ α = 1/8
Thus:
w = 4 × (1/8) = 0.5
b = −1 − 2(0.5) = −2

Alaa Othman June 10, 2025 24 / 64

Support Vector Representation
Weight vector as linear combination:
N
w= αi yi xi
X

i=1
= α1 y1 x1 + α2 y2 x2
1 1
= (1)(6) + (−1)(2)
8 8
6 2
= − = 0.5
8 8
Bias term calculation:
Using x2 : 0.5(2) + b = −1 ⇒ b = −2
Using x1 : 0.5(6) + b = 1 ⇒ b = −2
Discriminant function: 0.5x − 2 = 0 (matches previous solution)
Alaa Othman June 10, 2025 25 / 64
Support Vector Signicance

Key observations:
Both samples are support vectors (αi > 0)
Adding/removing non-support vectors (outside margin) doesn't aect
solution
Lagrange multipliers for non-support vectors: αi = 0
Removing support vectors changes decision boundary
H2 x=4 H1

ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x

Only support vectors contribute to weight vector

Solution depends only on critical samples
Alaa Othman June 10, 2025 26 / 64
Why Solve the Dual Problem?

Primal problem (original SVM):

1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Alaa Othman June 10, 2025 27 / 64

Why Solve the Dual Problem?

Primal problem (original SVM):

1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)

Alaa Othman June 10, 2025 27 / 64

Why Solve the Dual Problem?

Primal problem (original SVM):

1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Alaa Othman June 10, 2025 27 / 64

Why Solve the Dual Problem?

Primal problem (original SVM):

1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)
Dual advantage:
Only N variables (α1 , α2 , . . . , αN )
Ecient when d ≫ N (common in NLP, bioinformatics)
Key insight:
Most αi = 0! Only support vectors matter (αi > 0)
Automatic feature weighting

Alaa Othman June 10, 2025 27 / 64

Dual Optimization Problem

Maximize:
N
1X
α α y y (x⊤ x )
X
LD (α) = αi −
i=1
2 i,j i j i j i j
Subject to:
αi ≥ 0 ∀i
N
αi yi = 0
X

i=1

Alaa Othman June 10, 2025 28 / 64

Dual Optimization Problem

Maximize:
N
1X
α α y y (x⊤ x )
X
LD (α) = αi −
i=1
2 i,j i j i j i j
Subject to:
αi ≥ 0 ∀i
N
αi yi = 0
X

i=1

Key Properties
Convex Quadratic Program (guaranteed global optimum)
αi > 0 only for support vectors (critical points near margin)

Alaa Othman June 10, 2025 28 / 64

Primal to Dual Formulation

Primal SVM problem is convex quadratic programming:

1
w ∥2
min ∥⃗ s.t. yi (⃗
w T ⃗xi + b) ≥ 1
⃗ ,b
w 2
Convexity guarantees strong duality
Dual form obtained by:
1 Forming Lagrangian LP
2 Setting ∇w⃗ ,b LP = 0
3 Substituting back into LP
Dual problem has αi variables (one per datapoint)

Alaa Othman June 10, 2025 29 / 64

Deriving the Dual Form
Substituting stationarity conditions:
N N
αi yi = 0
X X
⃗ =
w αi yi ⃗xi ,
i=1 i=1

1
w ∥2 −
max LD (α) = ∥⃗ w T ⃗xi + b) − 1)
X
αi (yi (⃗
2 i
!2
1 X X X
= α y ⃗x + αi − αi αj yi yj ⃗xiT ⃗xj
2 i i i i i i,j
N
X 1X
= αi − α α y y ⃗x T ⃗x
i=1
2 i,j i j i j i j

Subject to: αi ≥ 0, =0
PN
i=1 αi yi
Alaa Othman June 10, 2025 30 / 64
Dual Problem Specication

The dual optimization problem:

N
X 1X
max αi − α α y y ⃗x T ⃗x
α
i=1
2 i,j i j i j i j
s.t. αi ≥ 0 ∀i
N
αi yi = 0
X

i=1

Maximize with respect to αi (Lagrange multipliers)

Number of variables = number of training samples (N )
After optimization, non-zero αi correspond to support vectors

Alaa Othman June 10, 2025 31 / 64

Matrix Formulation
Compact matrix representation:

max ⃗1T α − 1 αT Hα
α 2
s.t. α ⪰ 0, αT ⃗y = 0
Where:
Hij = yi yj ⟨⃗xi , ⃗xj ⟩
⃗1 = [1, 1, . . . , 1]T
α = [α1 , α2 , . . . , αN ]T
⃗y = [y1 , y2 , . . . , yN ]T
Equivalent minimization form:
1
min αT Hα − ⃗1T α
α 2

Alaa Othman June 10, 2025 32 / 64

Geometric Interpretation

Kernel matrix H encodes similarity:

Hij = yi yj ⟨⃗xi , ⃗xj ⟩

Dissimilar points (⟨⃗xi , ⃗xj ⟩ ≈ 0): Minimal contribution

Similar points with same label (yi = yj ): Decreases LD
Similar points with dierent labels (yi ̸= yj ): Increases LD
Critical patterns (borderline samples) maximize the objective and dene the
margin

Alaa Othman June 10, 2025 33 / 64

Recovering the Hyperplane

From dual solution α∗ :

1 Weight vector:
N
X
⃗ =
w αi∗ yi ⃗xi
i=1

(Only support vectors contribute)

2 Bias term:
NSV
1 X 1

T
b= −w
⃗ ⃗xi
NSV yi
i=1

Where NSV = number of support vectors

3 Final classier:
f (⃗x ) = sign w⃗ T ⃗x + b

Alaa Othman June 10, 2025 34 / 64

Dual SVM: Detailed Example
Consider a 1D dataset with two points:
Positive sample: x1 = 6, y1 = 1
Negative sample: x2 = 2, y2 = −1
Primal relationships:
w = 6α1 − 2α2
(from constraint αi yi = 0)
X
α1 = α2
⇒ w = 4 α1
b = 1 − 6w
Dual objective derivation:
∥w ∥2
max LD = − α ((6w + b) − 1) − α (−(2w + b) − 1)
2 1 2

(4α ) 2

− α (24α + b − 1) − α (−8α − b − 1)
1
=
2 1 1 1 1

= 8α − 24α − bα + α + 8α + bα + α
2
1
2
1 1 1
2
1 1 1

= −8α + 2α
2
1 1

Alaa Othman June 10, 2025 35 / 64

Example Solution and Verication

Optimality condition:
∂LD
= −16α1 + 2 = 0
∂α1
1
α1 = α2 =
8
Parameter recovery:
w = 4α1 = 0.5
b = 1 − 6w = −2

Alaa Othman June 10, 2025 36 / 64

Matrix form verication:
1
min LD = αT Hα − 1T α
2
1 · 1 · 36 1 · (−1) · 12 36

y1 y1 x1 x1 y1 y2 x1 x2
H= = =
y2 y1 x2 x1 y2 y2 x2 x2 (−1) · 1 · 12 (−1) · (−1) · 4 −12

1 36 −12

α1 α1
− 1 1

min LD = α α2 − λ(α1 − α2 )
2 −12 4
1
α2 α2
∂LD
= 4α1 − 12α2 − 1 + λ = 0
∂α1
∂LD
= −12α1 + 36α2 − 1 − λ = 0
∂α2
1
With α = α ⇒ λ = 2, α = α =
1 2 1 2
8

Alaa Othman June 10, 2025 37 / 64

Key Observations

Dual solution matches primal solution (w = 0.5, b = −2)

Both methods (direct and matrix) yield identical α values
Only one support vector needed in this case (both points are SVs)

Important note:
When N > d (more samples than features), solutions may not be
unique
Many α combinations may satisfy the constraints
The dual formulation remains valid but may have multiple solutions

Alaa Othman June 10, 2025 38 / 64

Solution Uniqueness in SVM
Key Insight: w and b may not be unique with positive semi denite
Hessian
Uniqueness Conditions:
Positive semidenite Hessian ⇒ Multiple solutions
Positive denite Hessian ⇒ Unique w, b
⇒ Dierent α values possible

x2 x1
1

-1 0 1

-1
x3 x4
Alaa Othman June 10, 2025 39 / 64
Example Setup
Data Points:
x1 = [1, 1] x2 x1
x2 = [−1, 1] 1

x3 = [−1, −1]
x4 = [1, −1]
-1 0 1
Labels:
y = [+1, −1, −1, +1] x3
-1
x4

Hessian Matrix:
2 0 2 0
 
0 2 0 2
H=
2 0 2 0


0 2 0 2
Alaa Othman June 10, 2025 40 / 64
Dual Problem Formulation

Dual Objective:
1
min LD (α) = αT Hα − f T α
2
1
= (2α12 + 2α22 + 2α32 + 2α42
2
+ 4α1 α3 + 4α2 α4 )
− (α1 + α2 + α3 + α4 )
− λ(α1 − α2 − α3 + α4 )
f = [1, 1, 1, 1]T
λ: Lagrange multiplier for constraint

Alaa Othman June 10, 2025 41 / 64

Optimality Conditions

Set partial derivatives to zero:

∂LD
= 0 ⇒ 2α1 + 2α3 = 1 + λ
∂α1
∂LD
= 0 ⇒ 2α2 + 2α4 = 1 − λ
∂α2
∂LD
= 0 ⇒ 2α1 + 2α3 = 1 − λ
∂α3
∂LD
= 0 ⇒ 2α2 + 2α4 = 1 + λ
∂α4

Observations:
System of equations has multiple solutions
Dierent α combinations yield same w and b

Alaa Othman June 10, 2025 42 / 64

Solution 1: All Support Vectors

1
α1 = α2 = α3 = α4 = (λ = 0)
4
Compute w:
4
w= αi yi xi
X

i=1
1 1 −1 −1 1

= (+1) + (−1) + (−1) + (+1)
4 1 1 −1 −1
1

=
0
Compute b:
4
1X 1

b= − wT xi = 0
4 i=1 yi

Alaa Othman June 10, 2025 43 / 64

Solution 2: Partial Support Vectors
1
α1 = α2 = , α3 = α4 = 0 (λ = 0)
2
Compute w:
4
w= αi yi xi
X

i=1
1 1 −1

= (+1) + (−1)
2 1 1
1

=
0
Compute b:
2
1X 1

b= − wT xi = 0
2 i=1 yi

Key Insight: Same w and b with dierent α values

Alaa Othman June 10, 2025 44 / 64
Critical Observations

Multiple α congurations can produce identical w and b

All valid solutions satisfy:
αi ≥ 0 (non-negativity)
N
αi yi = 0 (constraint)
X

i=1

Support vectors can vary between solutions:

Solution 1: All 4 points are support vectors
Solution 2: Only rst 2 points are support vectors
Geometric interpretation: The separating hyperplane wT x + b = 0
is identical in both solutions

Alaa Othman June 10, 2025 45 / 64

Implications for SVM Optimization

Solution space: Convex but not

strictly convex
x2 x1
Optimization algorithms may nd 1

dierent α values
Model equivalence: All solutions yield
same decision boundary -1 0 1

Computational eciency: Solutions -1

with fewer support vectors preferred x3 x4

Figure: Identical decision

boundary
w T x + b = x1 = 0

Practical note: Solution uniqueness depends on Hessian properties and

data conguration
Alaa Othman June 10, 2025 46 / 64
Challenges with Non-Separable Data

Problem: Strict separation impossible with overlapping classes

Consequence: Misclassied samples increase αi values
Eect: More support vectors aect decision boundary

Alaa Othman June 10, 2025 47 / 64

Slack Variables: Key Concept

Constraint Relaxation:
yi (wT xi + b) − 1 + ϵi ≥ 0, ϵi ≥ 0

Slack Variable ϵi : Objective Function:

Distance to margin plane
Permits margin relaxation (soft 1 2
N
CX k
min ∥w∥ + ϵi
w,b,ϵ 2 k
margin) i=1
Represents classication error
Regularization Parameter C :
Small C : Larger margin, more errors allowed
Large C : Smaller margin, fewer errors

Alaa Othman June 10, 2025 48 / 64

Primal Optimization Problem

1 CX k X
min LP (w, b, ϵ, α, µ) = ∥w∥2 + αi [yi (wT xi +b)−1+ϵi ]−
X
ϵi − µi ϵ
2 k
KKT Conditions:
1 Stationarity:
∇w LP = 0 ⇒ w = αi yi xi
X

∂LP
=0⇒ αi y i = 0
X
∂b
∂LP
= 0 ⇒ C = αi + µi (k = 1)
∂ϵi
2 Complementary slackness:
αi [yi (wT xi + b) − 1 + ϵi ] = 0
µi ϵi = (C − αi )ϵi = 0

Alaa Othman June 10, 2025 49 / 64

Dual Formulation (L1-SVM)

Dual Problem:
N
1X
α α y y xT x
X
max LD (α) = αi −
i=1
2 i,j i j i j i j

Constraints:
N
αi ≥ 0, αi yi = 0, 0 ≤ αi ≤ C
X

i=1

Key Insight:
Identical to separable case except αi ≤ C
ϵi and µi vanish in dual form
Three types of support vectors

Alaa Othman June 10, 2025 50 / 64

Support Vector Types (L1-SVM)

αi ϵi Position Classication
0 0 Outside margin Correct
0 < αi < C 0 On margin Correct (Free SV)
>1 Anywhere Misclassied
C
0 < ϵi < 1 Between margin/hyperplane Correct
0 On wrong margin side (Bounded SV)
Table: Support vector types and their characteristics

Alaa Othman June 10, 2025 51 / 64

L2-SVM Formulation
Primal Problem:
1 CX
min LP (w, b, ϵ, α) = ∥w∥2 + ϵ2i − αi [yi (wT xi + b) − 1 + ϵi ]
X
2 2
Stationarity Condition:
∂LP
= 0 ⇒ αi = C ϵi
∂ϵi
Dual Problem:
1 1

min LD (α) = αT H + I α − f T α
2 C

Properties:
Positive denite Hessian (H + C1 I)
Numerically stable solutions
More support vectors than L1-SVM
Alaa Othman June 10, 2025 52 / 64
Comparison: L1-SVM vs L2-SVM

L1-SVM (k = 1): L2-SVM (k = 2):

Sparser solution Denser solution
αi bounded by C αi unbounded
Three SV types Numerically stable
1
Hard margin interpretation C added to Hessian diagonal

Property Advantage
Sparsity L1-SVM
Stability L2-SVM
Solution uniqueness L2-SVM

Alaa Othman June 10, 2025 53 / 64

Practical Considerations

C Selection: Crucial trade-o parameter

Cross-validation recommended
Grid search common practice
Kernel Compatibility: Both L1/L2 extend to nonlinear cases
Implementation:
L1-SVM: More ecient for large sparse datasets
L2-SVM: Better for noisy, overlapping datasets
Solution Path: Optimal C can be found via regularization path
algorithms
Geometric Insight: Soft margin allows "margin errors" to handle class
overlap

Alaa Othman June 10, 2025 54 / 64

Handling Nonlinear Separability

Core Idea: Transform data to higher-dimensional space where linear

separation is possible
Kernel Function: K (xi , xj ) = ϕ(xi )T ϕ(xj )
SVM Objective:
1 2
N
CX k
min ∥w∥ + ϵi
w,b 2 k
i=1

Subject to: yi (wT ϕ(xi ) + b) − 1 + ϵi ≥ 0

Samples are x2=z2
Non-linearly Separable
4
X
-2 -1 0 1 2 3
Input Space (Feature Space)
Mapping 2

2
(x)=(x,x )=(z1,z2)=Z 1
(1) (1,1) (2) (2,4) x=z1
(-1) (-1,1) (-2) (-2,4) -2 -1 0 1 2
Samples are linearly Separable

Figure: Mapping from input space to feature space

Alaa Othman June 10, 2025 55 / 64

Popular Kernel Functions
Linear Kernel: Gaussian (RBF) Kernel:
K (xi , xj ) = ⟨xi , xj ⟩ ∥xi − xj ∥2

K (xi , xj ) = exp −
2σ 2

Equivalent to no transformation
Innite-dimensional feature
Input space remains unchanged space
Universal approximator
Polynomial Kernel:
K (xi , xj ) = (⟨xi , xj ⟩ + c)d

d : Degree of polynomial
c : Constant term
Alaa Othman June 10, 2025 56 / 64
The Kernel Trick

Challenge: Explicit mapping ϕ(x) can be computationally expensive

Solution: Compute inner products directly in input space
Example:
√
ϕ(x) = [x12 , 2x1 x2 , x22 ]
K (x, y) = ϕ(x)T ϕ(y)
= (x1 y1 + x2 y2 )2
= (xT y)2

Advantages:
Avoid explicit high-dimensional computation
Handle "curse of dimensionality"
Ecient calculation of inner products

Alaa Othman June 10, 2025 57 / 64

Kernel Example: 1D Classication

Dataset: 1.5

Positive class: x2 = 6 1

Negative class: x1 = 2, 0.5

x3 = 8 0

Polynomial Kernel: -0.5

-1

2
K (x, y ) = (x · y + 1) -1.5
0 1 2 3 4 5 6 7 8 9

Figure: Nonlinear decision boundary

Mapping:
√
ϕ(x) = [x 2 , 2x, 1]

Alaa Othman June 10, 2025 58 / 64

Solution Calculation
Hessian Matrix:
25 −169 289
 

H = −169 1369 −2401

289 −2401 4225
Lagrange Multipliers:
∂LD
= 0 : 25α1 − 169α2 + 289α3 = 1 + λ
∂α1
∂LD
= 0 : −169α1 + 1369α2 − 2401α3 = 1 − λ
∂α2
∂LD
= 0 : 289α1 − 2401α2 + 4225α3 = 1 + λ
∂α3
Solution:
α1 = 0.7396, α2 = 1.5938, α3 = 0.8542
All samples are support vectors
Alaa Othman June 10, 2025 59 / 64
Decision Boundary
Hyperplane Equation:
3
yi αi K (xi , x) + b
X
f (x) =
i=1
= −0.25x 2 + 2.5x − 5
Verication:
Using explicit mapping:
1 2.5

ϕ(x) − + √ + 0 + b
4 2
Same quadratic boundary
1.5

0.5

-0.5

-1

-1.5
Alaa Othman 0 1 2 3 4 5 6 7 8 9 June 10, 2025 60 / 64
Key Insights

Universal Separability: Kernels can make any dataset linearly

separable
Computational Eciency: Kernel trick avoids explicit high-dim
computation
Support Vectors: Determine complexity of decision boundary
Hyperparameter Choice:
Kernel type
Kernel parameters (σ for RBF, d for polynomial)
Regularization C

Practical Considerations:
RBF kernel generally most exible
Polynomial kernels useful for structured data
Kernel selection via cross-validation

Alaa Othman June 10, 2025 61 / 64

Dening and Maximizing the Margin

Goal: Find the weight vector w that maximizes the margin.

Hyperplane: Dened by w⊤ x + b = 0.
Margin: Distance from the hyperplane to the nearest point xn .
Distance Formula:
|w⊤ xn + b|
Distance =
||w||

Margin Calculation:
yn (w⊤ xn + b)
γ = min
n ||w||

where yn = +1 or −1 is the class label.

Objective: Maximize γ .
Alaa Othman June 10, 2025 62 / 64
Normalization of the Weight Vector

Step: Normalize w so that |w⊤ xn + b| = 1 for the nearest point.

Why?: Simplies the margin to γ = ||w1 || .
Support Vectors: Nearest points satisfy yn (w⊤ xn + b) = 1.
Consequence: Maximizing γ becomes minimizing ||w||.
Optimization Problem: Minimize 21 ||w||2 subject to:
yn (w⊤ xn + b) ≥ 1, ∀n

Alaa Othman June 10, 2025 63 / 64

Separating the Bias Term

Hyperplane Form: w⊤ x + b = 0.
Notation: w = [w1 , w2 , . . . , wd ]⊤ , b is the bias (sometimes called
w0 ).
Purpose: Separating b allows the hyperplane to shift, not just pass
through the origin.
Example: In 2D, w1 x1 + w2 x2 + b = 0 can t data oset from the
origin, unlike w1 x1 + w2 x2 = 0.
In Practice: b is optimized with w to maximize the margin.

Alaa Othman June 10, 2025 64 / 64

Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
25 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
19 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
25 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
44 pages
SVM Classification Overview by Chandresh Maurya
No ratings yet
SVM Classification Overview by Chandresh Maurya
28 pages
SVM: Theory and Applications
No ratings yet
SVM: Theory and Applications
35 pages
Vapnik's Support Vector Machines Explained
No ratings yet
Vapnik's Support Vector Machines Explained
55 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
25 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
46 pages
SVM: Mathematical Formulation Overview
No ratings yet
SVM: Mathematical Formulation Overview
32 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
31 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
13 pages
Support Vector Machine (SVM) Explained
No ratings yet
Support Vector Machine (SVM) Explained
16 pages
Kernel Trick in Support Vector Machines
No ratings yet
Kernel Trick in Support Vector Machines
125 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
52 pages
Understanding Linear SVMs and Classifiers
No ratings yet
Understanding Linear SVMs and Classifiers
77 pages
SVMs: A Comprehensive Overview
No ratings yet
SVMs: A Comprehensive Overview
28 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
35 pages
SVM Optimization and Support Vectors
No ratings yet
SVM Optimization and Support Vectors
5 pages
Latest Insights on Support Vector Machines
No ratings yet
Latest Insights on Support Vector Machines
12 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
45 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
8 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
33 pages
Math Behind Support Vector Machines
No ratings yet
Math Behind Support Vector Machines
15 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
62 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
35 pages
Machine Learning: Support Vector Machines
No ratings yet
Machine Learning: Support Vector Machines
22 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
19 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
14 pages
Support Vector Machine Hard Margin Explained
No ratings yet
Support Vector Machine Hard Margin Explained
18 pages
SVM Numerical Example Explained
No ratings yet
SVM Numerical Example Explained
69 pages
Overview of Support Vector Machines
No ratings yet
Overview of Support Vector Machines
49 pages
Hard-Margin SVM Optimization Explained
No ratings yet
Hard-Margin SVM Optimization Explained
34 pages
Support Vector Machines Overview and Uses
No ratings yet
Support Vector Machines Overview and Uses
34 pages
Understanding Mercer’s Theorem in SVMs
No ratings yet
Understanding Mercer’s Theorem in SVMs
16 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
52 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
Kernel Machines in Machine Learning
No ratings yet
Kernel Machines in Machine Learning
67 pages
SVM Algorithm Flowchart Overview
No ratings yet
SVM Algorithm Flowchart Overview
40 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
18 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
15 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
19 pages
Logistic Regression & SVM Overview
No ratings yet
Logistic Regression & SVM Overview
59 pages
Support Vector Machines Overview and Uses
No ratings yet
Support Vector Machines Overview and Uses
34 pages
Logistic Regression Overview
No ratings yet
Logistic Regression Overview
63 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
8 pages
Math Behind Support Vector Machines
No ratings yet
Math Behind Support Vector Machines
46 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
20 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
26 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
SVM Overview and Applications
No ratings yet
SVM Overview and Applications
34 pages
SVM Problem Set 3 Solutions
No ratings yet
SVM Problem Set 3 Solutions
7 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
11 pages
Dis11 Sol
No ratings yet
Dis11 Sol
5 pages
Big Data Architectures Overview
No ratings yet
Big Data Architectures Overview
91 pages
Future of Digital Well-Being Report
No ratings yet
Future of Digital Well-Being Report
34 pages
Power Dynamics in Digital Algorithms
No ratings yet
Power Dynamics in Digital Algorithms
24 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
93 pages
Learning Components and Model Analysis
No ratings yet
Learning Components and Model Analysis
20 pages
Matplotlib Basics by Ananda Dasgupta
No ratings yet
Matplotlib Basics by Ananda Dasgupta
16 pages
EWFC Blast Chiller Technical Manual
No ratings yet
EWFC Blast Chiller Technical Manual
46 pages
Sexual Harassment in Secondary Schools
No ratings yet
Sexual Harassment in Secondary Schools
15 pages
Weekly Educational Schedule Overview
100% (1)
Weekly Educational Schedule Overview
2 pages
Babinet Principle in Particle Sizing
No ratings yet
Babinet Principle in Particle Sizing
9 pages
Prophets and the Call for Justice
No ratings yet
Prophets and the Call for Justice
18 pages
Enhancing STEAM Learning with littleBits
No ratings yet
Enhancing STEAM Learning with littleBits
6 pages
Face-to-Face vs. Modular Learning Insights
No ratings yet
Face-to-Face vs. Modular Learning Insights
3 pages
Plant and Animal Reproduction Methods
No ratings yet
Plant and Animal Reproduction Methods
10 pages
Soil Properties Box-Plots Analysis
No ratings yet
Soil Properties Box-Plots Analysis
7 pages
F-15 Eagle Development History 1964-1972
No ratings yet
F-15 Eagle Development History 1964-1972
73 pages
SYNDEM Smart Grid Architecture Overview
No ratings yet
SYNDEM Smart Grid Architecture Overview
6 pages
List of Unjoined Medical Officers
No ratings yet
List of Unjoined Medical Officers
17 pages
Curriculum Development Models Explained
No ratings yet
Curriculum Development Models Explained
13 pages
My Brother's Great Invention Summary
No ratings yet
My Brother's Great Invention Summary
10 pages
CAMTEL: Overview and Management Insights
No ratings yet
CAMTEL: Overview and Management Insights
24 pages
EuroFlow DxFlex SOP: Setup & Compensation
No ratings yet
EuroFlow DxFlex SOP: Setup & Compensation
10 pages
Air Conditioning & Ventilation Syllabus
No ratings yet
Air Conditioning & Ventilation Syllabus
1 page
Public Domain Book Usage Guidelines
No ratings yet
Public Domain Book Usage Guidelines
223 pages
Ewsd v16
No ratings yet
Ewsd v16
11 pages
Bacterial Colony Morphology Guide
No ratings yet
Bacterial Colony Morphology Guide
7 pages
Discourse and Vocabulary Teaching Insights
No ratings yet
Discourse and Vocabulary Teaching Insights
17 pages
Understanding Attitude Formation
No ratings yet
Understanding Attitude Formation
11 pages
Mahindra Group Overview and Leadership
50% (2)
Mahindra Group Overview and Leadership
14 pages
Year 2 Maths Assessment Sheet
No ratings yet
Year 2 Maths Assessment Sheet
19 pages
PMDC Associate Professor Vacancies
No ratings yet
PMDC Associate Professor Vacancies
1 page
Footnote to Youth: Short Film Script
No ratings yet
Footnote to Youth: Short Film Script
8 pages
Resin Ocean Painting Guide
No ratings yet
Resin Ocean Painting Guide
2 pages
Ethical Implications of AI Development
No ratings yet
Ethical Implications of AI Development
2 pages
Refractory Solutions for Steel Industry
No ratings yet
Refractory Solutions for Steel Industry
43 pages

SVM: Maximizing Margin in Classification

Uploaded by

SVM: Maximizing Margin in Classification

Uploaded by

Learning from Data

Lecture 8: Support Vector Machines (SVM)

June 10, 2025

Alaa Othman June 10, 2025 1 / 64

Alaa Othman June 10, 2025 2 / 64

Alaa Othman June 10, 2025 2 / 64

Alaa Othman June 10, 2025 3 / 64

Alaa Othman June 10, 2025 3 / 64

Alaa Othman June 10, 2025 4 / 64

Unied Classication Constraint

Alaa Othman June 10, 2025 7 / 64

Support vectors dene the orientation

Alaa Othman June 10, 2025 8 / 64

Projection onto w direction:

Assume two classes ω− and ω+ :

Alaa Othman June 10, 2025 11 / 64

Alaa Othman June 10, 2025 12 / 64

Consider w = 0.1 (∥w∥ = 0.1):

w = 0.5 gives maximum feasible margin

Alaa Othman June 10, 2025 13 / 64

Solve for parameters:

Alaa Othman June 10, 2025 14 / 64

Test sample xtest = 8:

Alaa Othman June 10, 2025 15 / 64

Challenge with Many Samples

Primal Optimization Problem

minimize f (w) = ∥w∥

Alaa Othman June 10, 2025 16 / 64

Minimize L w.r.t w and b,

αi > 0 only for support vectors, and αi = 0 for non-support vectors

Alaa Othman June 10, 2025 19 / 64

Found by setting partial derivatives to zero:

Alaa Othman June 10, 2025 20 / 64

i=1 xi ∈ω+ xi ∈ω−

= 0: Bias condition balances class contributions

yi (wT xi + b) − 1 = 0: Support vectors satisfy margin condition.

Alaa Othman June 10, 2025 21 / 64

With x1 = 6, y1 = +1 and x2 = 2, y2 = −1:

Take partial derivatives and set to zero:

Alaa Othman June 10, 2025 23 / 64

Alaa Othman June 10, 2025 24 / 64

Only support vectors contribute to weight vector

Primal problem (original SVM):

Alaa Othman June 10, 2025 27 / 64

Primal problem (original SVM):

Alaa Othman June 10, 2025 27 / 64

Primal problem (original SVM):

Alaa Othman June 10, 2025 27 / 64

Primal problem (original SVM):

Alaa Othman June 10, 2025 27 / 64

Alaa Othman June 10, 2025 28 / 64

Alaa Othman June 10, 2025 28 / 64

Primal SVM problem is convex quadratic programming:

Alaa Othman June 10, 2025 29 / 64

The dual optimization problem:

Maximize with respect to αi (Lagrange multipliers)

Alaa Othman June 10, 2025 31 / 64

Alaa Othman June 10, 2025 32 / 64

Kernel matrix H encodes similarity:

Dissimilar points (⟨⃗xi , ⃗xj ⟩ ≈ 0): Minimal contribution

Alaa Othman June 10, 2025 33 / 64

From dual solution α∗ :

(Only support vectors contribute)

Where NSV = number of support vectors

Alaa Othman June 10, 2025 34 / 64

Alaa Othman June 10, 2025 35 / 64

Alaa Othman June 10, 2025 36 / 64

Alaa Othman June 10, 2025 37 / 64

Dual solution matches primal solution (w = 0.5, b = −2)

Alaa Othman June 10, 2025 38 / 64

Alaa Othman June 10, 2025 41 / 64

Set partial derivatives to zero:

Alaa Othman June 10, 2025 42 / 64

Alaa Othman June 10, 2025 43 / 64

Key Insight: Same w and b with dierent α values

Multiple α congurations can produce identical w and b

Unied Classication Constraint

Support vectors dene the orientation

Key Insight: Same w and b with dierent α values

Multiple α congurations can produce identical w and b

Computational eciency: Solutions -1

C Selection: Crucial trade-o parameter