0% found this document useful (0 votes)
207 views70 pages

SVM: Maximizing Margin in Classification

This document is a lecture on Support Vector Machines (SVM) by Alaa Othman, focusing on the concept of maximum margin hyperplanes for classifying linearly separable data. It explains the importance of maximizing the margin to improve classification accuracy and discusses the mathematical formulation of SVM, including the optimization problem and the role of support vectors. Key properties and geometric interpretations of SVM are also covered, along with the transformation to Lagrangian formulation for optimization.

Uploaded by

samuel tekyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views70 pages

SVM: Maximizing Margin in Classification

This document is a lecture on Support Vector Machines (SVM) by Alaa Othman, focusing on the concept of maximum margin hyperplanes for classifying linearly separable data. It explains the importance of maximizing the margin to improve classification accuracy and discusses the mathematical formulation of SVM, including the optimization problem and the role of support vectors. Key properties and geometric interpretations of SVM are also covered, along with the transformation to Lagrangian formulation for optimization.

Uploaded by

samuel tekyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Learning from Data

Lecture 8: Support Vector Machines (SVM)

Alaa Othman

June 10, 2025

Alaa Othman June 10, 2025 1 / 64


Motivation: Why Maximum Margin?
For linearly separable data, many hyperplanes can separate the
classes.

Alaa Othman June 10, 2025 2 / 64


Motivation: Why Maximum Margin?
For linearly separable data, many hyperplanes can separate the
classes. Which hyperplane is the best?
In-Sample Error: All separating hyperplanes achieve zero error.
Model Complexity: All linear models have identical complexity.

Alaa Othman June 10, 2025 2 / 64


Key Idea: The hyperplane with the maximum margin is optimal.
Why?

Alaa Othman June 10, 2025 3 / 64


Key Idea: The hyperplane with the maximum margin is optimal.
Why?
Margin: The distance from the hyperplane to the nearest data point.
Why Better?: A larger margin increases the likelihood that new
points are classied correctly.

Alaa Othman June 10, 2025 3 / 64


Dichotomies with Fat Margin
Fat Margin: Exclude dichotomies where the margin is below a
threshold, reduces the number of possible dichotomies.
Growth Function: Counts distinct labelings a model can produce; a
fat margin restricts it (more restriction on the growth function
improves generalization)

Alaa Othman June 10, 2025 4 / 64


Dening and Maximizing the Margin
Goal: Find the weight vector w that maximizes the margin.
Hyperplane: Dened by w⊤ x + b = 0.
Margin: Distance from the hyperplane to the nearest point xn .
y The optimal hyperplane with Positive class
the maximum margin
Negative class
7
H2
Training samples

w
(αi=0, εi=0 )

x+ T
6

b=
H1 w
T

-1
Support vectors
x+
b=
w

x2
T

5 0
x+
b=
+
1

4 (C>αi>0, εi=0 )
x1
β SVM Margin=2/||w||

3 Correctly classified
α sample
(αi=C, 1>εi>0 )
2

M
is am ε i≥
cl
(α i
s
as ple
=
d 2 w||

=C

si
D1 || ||

fie
1 1/

,
w

d
w ||
b/ =
D2 d 1 w||

1)
||
1/

1 2 3 4 5 6 7 x
Alaa Othman June 10, 2025 5 / 64
Key Concepts in SVM
Changing w and b generates innite hyperplanes
SVM goal: Find hyperplane maximally distant from closest samples of
both classes
This optimal hyperplane has the largest margin
Critical samples are called support vectors
Mathematical Formulation
Dene two parallel planes:
H1 : wT xi + b = +1 for yi = +1 (1)
H2 : wT xi + b = −1 for yi = −1 (2)
with constraints:
wT xi + b ≥ +1 ∀ yi = +1
wT xi + b ≤ −1 ∀ yi = −1
Alaa Othman June 10, 2025 6 / 64
Key Properties of SVM Hyperplanes

Geometric Characteristics
The dening planes H1 and H2 satisfy:
Parallel with same normal vector w
No training samples between them
2
Margin width = (next slides)
∥w ∥

Unied Classication Constraint


All training samples satisfy:
yi (wT xi + b) − 1 ≥ 0 ∀i = 1, 2, . . . , N (3)

Alaa Othman June 10, 2025 7 / 64


Support Vector Identication
Support vectors lie exactly on margin planes:
yi (wT xi + b) = 1

wT x1 + b = +1 (x1 on H1 )
wT x2 + b = −1 (x2 on H2 )
Other samples satisfy strict inequality:
yi (wT xi + b) > 1

Support vectors dene the orientation


and position of the optimal hyperplane

Alaa Othman June 10, 2025 8 / 64


Margin Geometry in SVM
Key Properties:
Margin M = d1 + d2 where: Euclidean Norm

d1 : distance to closest positive sample ∥w∥ =
qP
wT w
= d 2
i=1 wi
d2 : distance to closest negative sample w
Unit vector: ŵ = ∥w∥
Hyperplane is equidistant: d1 = d2
M measured in normal direction w
Vector A = 2 2 , the unit vector  is calculated as follows:


2 2
 
A
= √12 √1
h i
 = =√ 2 .
∥A∥ 22 + 22
The norm of the unit vector1 is always one:
s 2 2
1 1 √

∥Â∥ = √ + √ = 1 = 1.
2 2
1
The direction of the original vector A while standardizing its length to 1. It indicates
the orientation without aecting the magnitude
Alaa Othman June 10, 2025 9 / 64
Margin Calculation:
Using support vectors x1 (class +1) and x2 (class -1):
M = (x1 − x2 )w

Projection onto w direction:


w
Dk = (xk )w = xk ·
∥w ∥

w w
M = |D1 − D2 | = x1 · − x2 ·
∥w ∥ ∥w∥
|w x1 − w x2 |
T T |(1 − b) − (−1 − b)| 2
= = =
∥w ∥ ∥w ∥ ∥w∥

Key Insight
Maximizing margin M ⇔ Minimizing ∥w∥
Alaa Othman June 10, 2025 10 / 64
Problem Setup

Assume two classes ω− and ω+ :


Negative class: ω− = {x2 } where x2 = 2
Positive class: ω+ = {x1 } where x1 = 6
From the SVM constraints:
x1 : w · 6 + b ≥ 1
x2 : w · 2 + b ≤ −1
H2 x=4 H1

ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x

Alaa Othman June 10, 2025 11 / 64


Geometric Interpretation

x1 lies on H1 : wx + b = +1 H2 x=4 H1

x2 lies on H2 : wx + b = −1 ω- Support
vectors
Margin (M) ω+

2 x2 x1
Margin M = x
∥w ∥
1 2 3 4 5 6 7

Eect of ∥w∥:
∥w ∥ = 1 ⇒ M = 2
∥w ∥ = 2 ⇒ M = 1
∥w∥ = 4 ⇒ M = 0.5

Alaa Othman June 10, 2025 12 / 64


Constraint Violation Example

Consider w = 0.1 (∥w∥ = 0.1):


Requires M = 20
But x2 would lie between H1 and H2
Mathematical contradiction:
From x2 : 0.1 × 2 + b ≤ −1 ⇒ b ≤ −1.2
From x1 : 0.1 × 6 + b ≥ 1 ⇒ b ≥ 0.4
⇒ No feasible b

w = 0.5 gives maximum feasible margin

Alaa Othman June 10, 2025 13 / 64


Optimal Solution

Solve for parameters:


x1 : 0.5 × 6 + b = 1 ⇒ 3 + b = 1 ⇒ b = −2
x2 : 0.5 × 2 + b = −1 ⇒ 1 + b = −1 ⇒ b = −2

Discriminant function:
wT x + b = 0 ⇒ 0.5x − 2 = 0 ⇒ x =4

Margin width:
2 2
M= = =4
∥w∥ 0.5

Alaa Othman June 10, 2025 14 / 64


Classication of New Sample

Test sample xtest = 8:


ytest = sign(wT xtest + b) = sign(0.5 × 8 − 2) = sign(2) = +1

Decision process:
+1 ⇒ Classify to ω+
Since 0.5(8) − 2 = 2 > 0, above decision boundary

H2 x=4 H1

ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x

Alaa Othman June 10, 2025 15 / 64


Optimization Problem: Primal Form

Challenge with Many Samples


With large datasets, manual hyperplane selection becomes intractable. We
need mathematical optimization:

Primal Optimization Problem

minimize f (w) = ∥w∥


subject to g (w, b) = yi (wT xi + b) − 1 ≥ 0
∀i = 1, 2, . . . , N

Alaa Othman June 10, 2025 16 / 64


Equivalent Quadratic Form
Convenient transformation for optimization:
1
minimize f (w) = ∥w∥2
2
subject to yi (w xi + b) ≥ 1
T

∀i = 1, 2, . . . , N

Key Properties
Hard Margin: Strict separation requirement
Quadratic Programming: Convex optimization problem
Convex: Single global minimum guaranteed
Linearly Constrained: N linear inequality constraints
Removes square root for dierentiability
Scaling factor 12 simplies derivatives
Constraints are Linear inequalities
Alaa Othman June 10, 2025 17 / 64
Lagrangian Formulation for SVM
From Constrained to Unconstrained Optimization
Transform the constrained problem using Lagrange multipliers αi :
Original objective: min 12 ∥w∥2
Constraints: yi (wT xi + b) ≥ 1
Introduce αi ≥ 0 for each constraint

Lagrange Function
1 N
∥w∥2
h i
L(w, b, α) = αi yi (wT xi + b) − 1
X

2
|i=1
| {z }
Original objective {z }
Constraint term
Expanded form:

1 N N N
L(w, b, α) = wT w − αi yi wT xi − b (4)
X X X
αi yi + αi
2 i=1 i=1 i=1
Alaa Othman June 10, 2025 18 / 64
1 N N N
L(w, b, α) = wT w − αi yi wT xi − b (5)
X X X
αi yi + αi
2 i=1 i=1 i=1

Minimize L w.r.t w and b,


Maximize w.r.t α
α = [α1 , . . . , αN ]T : Lagrange multipliers
αi ≥ 0: Non-negativity constraint
αi yi (wT xi + b) − 1 = 0
 

αi > 0 only for support vectors, and αi = 0 for non-support vectors

Alaa Othman June 10, 2025 19 / 64


Optimal Solution

Optimization Strategy
The solution minimizes L w.r.t w, b and maximizes w.r.t α:
minw,b maxα≥0 L(w, b, α)

Found by setting partial derivatives to zero:

For w : ∂L
∂w =0 Lagrange Multipliers (αi ): ∂L
∂αi =0

N
yi (wT xi + b) − 1 = 0
w− αi yi xi = 0
X

i=1
N
For b: =0⇒ =0
∂L PN
w= αi yi xi i=1 αi yi
X
∂b

i=1

Alaa Othman June 10, 2025 20 / 64


w= i=1 αi yi xi : w is linear combination of training vectors
PN

N
w= αi yi xi = αi xi − αi xi
X X X

i=1 xi ∈ω+ xi ∈ω−

= 0: Bias condition balances class contributions


PN
i=1 αi yi
X X
αi = αi
ω+ ω−

yi (wT xi + b) − 1 = 0: Support vectors satisfy margin condition.


Decision function depends only
 on support vectors:
f (x) = sign SVs αi yi xi
Tx
P
+b

Alaa Othman June 10, 2025 21 / 64


Constraint Formulation
For our two-sample example:
g1 (w, b) = (wx1 + b) − 1 (y1 = +1)
g2 (w, b) = −(wx2 + b) − 1 (y2 = −1)

Objective function:
∥w∥2
min L(w, b, α) = −α g −α g
1 1 2 2
w,b 2
∥w∥2
= − α1 (y1 (wx1 + b) − 1)
2
− α2 (y2 (wx2 + b) − 1)

With x1 = 6, y1 = +1 and x2 = 2, y2 = −1:


w2
L= − α1 (6w + b − 1) − α2 (−(2w + b) − 1)
2
Alaa Othman June 10, 2025 22 / 64
Derivatives and Optimality Conditions

Take partial derivatives and set to zero:


∂L ∂L
=0 =0
∂w ∂b
⇒ w − 6α1 + 2α2 = 0 ⇒ −α1 + α2 = 0
w = 6α1 − 2α2 α1 = α2

Primal feasibility:
∂L
= 0 ⇒ 6w + b − 1 = 0
∂α1
∂L
= 0 ⇒ 2w + b + 1 = 0
∂α2

Alaa Othman June 10, 2025 23 / 64


Solution and Interpretation
From α1 = α2 and w = 6α1 − 2α2 :
w = 6α − 2α = 4α (let α = α1 = α2 )

Solve system:
6(4α) + b − 1 = 0
2(4α) + b + 1 = 0
Subtract equations:
(24α + b − 1) − (8α + b + 1) = 0
16α − 2 = 0 ⇒ α = 1/8
Thus:
w = 4 × (1/8) = 0.5
b = −1 − 2(0.5) = −2

Alaa Othman June 10, 2025 24 / 64


Support Vector Representation
Weight vector as linear combination:
N
w= αi yi xi
X

i=1
= α1 y1 x1 + α2 y2 x2
1 1
= (1)(6) + (−1)(2)
8 8
6 2
= − = 0.5
8 8
Bias term calculation:
Using x2 : 0.5(2) + b = −1 ⇒ b = −2
Using x1 : 0.5(6) + b = 1 ⇒ b = −2
Discriminant function: 0.5x − 2 = 0 (matches previous solution)
Alaa Othman June 10, 2025 25 / 64
Support Vector Signicance

Key observations:
Both samples are support vectors (αi > 0)
Adding/removing non-support vectors (outside margin) doesn't aect
solution
Lagrange multipliers for non-support vectors: αi = 0
Removing support vectors changes decision boundary
H2 x=4 H1

ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x

Only support vectors contribute to weight vector


Solution depends only on critical samples
Alaa Othman June 10, 2025 26 / 64
Why Solve the Dual Problem?

Primal problem (original SVM):


1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Alaa Othman June 10, 2025 27 / 64


Why Solve the Dual Problem?

Primal problem (original SVM):


1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)

Alaa Othman June 10, 2025 27 / 64


Why Solve the Dual Problem?

Primal problem (original SVM):


1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)
Dual advantage:
Only N variables (α1 , α2 , . . . , αN )
Ecient when d ≫ N (common in NLP, bioinformatics)

Alaa Othman June 10, 2025 27 / 64


Why Solve the Dual Problem?

Primal problem (original SVM):


1
min ∥w∥2 s.t. yi (w⊤ xi + b) ≥ 1
w,b 2

Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)
Dual advantage:
Only N variables (α1 , α2 , . . . , αN )
Ecient when d ≫ N (common in NLP, bioinformatics)
Key insight:
Most αi = 0! Only support vectors matter (αi > 0)
Automatic feature weighting

Alaa Othman June 10, 2025 27 / 64


Dual Optimization Problem

Maximize:
N
1X
α α y y (x⊤ x )
X
LD (α) = αi −
i=1
2 i,j i j i j i j
Subject to:
αi ≥ 0 ∀i
N
αi yi = 0
X

i=1

Alaa Othman June 10, 2025 28 / 64


Dual Optimization Problem

Maximize:
N
1X
α α y y (x⊤ x )
X
LD (α) = αi −
i=1
2 i,j i j i j i j
Subject to:
αi ≥ 0 ∀i
N
αi yi = 0
X

i=1

Key Properties
Convex Quadratic Program (guaranteed global optimum)
αi > 0 only for support vectors (critical points near margin)

Alaa Othman June 10, 2025 28 / 64


Primal to Dual Formulation

Primal SVM problem is convex quadratic programming:


1
w ∥2
min ∥⃗ s.t. yi (⃗
w T ⃗xi + b) ≥ 1
⃗ ,b
w 2
Convexity guarantees strong duality
Dual form obtained by:
1 Forming Lagrangian LP
2 Setting ∇w⃗ ,b LP = 0
3 Substituting back into LP
Dual problem has αi variables (one per datapoint)

Alaa Othman June 10, 2025 29 / 64


Deriving the Dual Form
Substituting stationarity conditions:
N N
αi yi = 0
X X
⃗ =
w αi yi ⃗xi ,
i=1 i=1

1
w ∥2 −
max LD (α) = ∥⃗ w T ⃗xi + b) − 1)
X
αi (yi (⃗
2 i
!2
1 X X X
= α y ⃗x + αi − αi αj yi yj ⃗xiT ⃗xj
2 i i i i i i,j
N
X 1X
= αi − α α y y ⃗x T ⃗x
i=1
2 i,j i j i j i j

Subject to: αi ≥ 0, =0
PN
i=1 αi yi
Alaa Othman June 10, 2025 30 / 64
Dual Problem Specication

The dual optimization problem:


N
X 1X
max αi − α α y y ⃗x T ⃗x
α
i=1
2 i,j i j i j i j
s.t. αi ≥ 0 ∀i
N
αi yi = 0
X

i=1

Maximize with respect to αi (Lagrange multipliers)


Number of variables = number of training samples (N )
After optimization, non-zero αi correspond to support vectors

Alaa Othman June 10, 2025 31 / 64


Matrix Formulation
Compact matrix representation:

max ⃗1T α − 1 αT Hα
α 2
s.t. α ⪰ 0, αT ⃗y = 0
Where:
Hij = yi yj ⟨⃗xi , ⃗xj ⟩
⃗1 = [1, 1, . . . , 1]T
α = [α1 , α2 , . . . , αN ]T
⃗y = [y1 , y2 , . . . , yN ]T
Equivalent minimization form:
1
min αT Hα − ⃗1T α
α 2

Alaa Othman June 10, 2025 32 / 64


Geometric Interpretation

Kernel matrix H encodes similarity:


Hij = yi yj ⟨⃗xi , ⃗xj ⟩

Dissimilar points (⟨⃗xi , ⃗xj ⟩ ≈ 0): Minimal contribution


Similar points with same label (yi = yj ): Decreases LD
Similar points with dierent labels (yi ̸= yj ): Increases LD
Critical patterns (borderline samples) maximize the objective and dene the
margin

Alaa Othman June 10, 2025 33 / 64


Recovering the Hyperplane

From dual solution α∗ :


1 Weight vector:
N
X
⃗ =
w αi∗ yi ⃗xi
i=1

(Only support vectors contribute)


2 Bias term:
NSV 
1 X 1

T
b= −w
⃗ ⃗xi
NSV yi
i=1

Where NSV = number of support vectors


3 Final classier:  
f (⃗x ) = sign w⃗ T ⃗x + b

Alaa Othman June 10, 2025 34 / 64


Dual SVM: Detailed Example
Consider a 1D dataset with two points:
Positive sample: x1 = 6, y1 = 1
Negative sample: x2 = 2, y2 = −1
Primal relationships:
w = 6α1 − 2α2
(from constraint αi yi = 0)
X
α1 = α2
⇒ w = 4 α1
b = 1 − 6w
Dual objective derivation:
∥w ∥2
max LD = − α ((6w + b) − 1) − α (−(2w + b) − 1)
2 1 2

(4α ) 2

− α (24α + b − 1) − α (−8α − b − 1)
1
=
2 1 1 1 1

= 8α − 24α − bα + α + 8α + bα + α
2
1
2
1 1 1
2
1 1 1

= −8α + 2α
2
1 1

Alaa Othman June 10, 2025 35 / 64


Example Solution and Verication

Optimality condition:
∂LD
= −16α1 + 2 = 0
∂α1
1
α1 = α2 =
8
Parameter recovery:
w = 4α1 = 0.5
b = 1 − 6w = −2

Alaa Othman June 10, 2025 36 / 64


Matrix form verication:
1
min LD = αT Hα − 1T α
2
1 · 1 · 36 1 · (−1) · 12 36
   
y1 y1 x1 x1 y1 y2 x1 x2
H= = =
y2 y1 x2 x1 y2 y2 x2 x2 (−1) · 1 · 12 (−1) · (−1) · 4 −12

1 36 −12
    
α1  α1
− 1 1
 
min LD = α α2 − λ(α1 − α2 )
2 −12 4
1
α2 α2
∂LD
= 4α1 − 12α2 − 1 + λ = 0
∂α1
∂LD
= −12α1 + 36α2 − 1 − λ = 0
∂α2
1
With α = α ⇒ λ = 2, α = α =
1 2 1 2
8

Alaa Othman June 10, 2025 37 / 64


Key Observations

Dual solution matches primal solution (w = 0.5, b = −2)


Both methods (direct and matrix) yield identical α values
Only one support vector needed in this case (both points are SVs)

Important note:
When N > d (more samples than features), solutions may not be
unique
Many α combinations may satisfy the constraints
The dual formulation remains valid but may have multiple solutions

Alaa Othman June 10, 2025 38 / 64


Solution Uniqueness in SVM
Key Insight: w and b may not be unique with positive semi denite
Hessian
Uniqueness Conditions:
Positive semidenite Hessian ⇒ Multiple solutions
Positive denite Hessian ⇒ Unique w, b
⇒ Dierent α values possible

x2 x1
1

-1 0 1

-1
x3 x4
Alaa Othman June 10, 2025 39 / 64
Example Setup
Data Points:
x1 = [1, 1] x2 x1
x2 = [−1, 1] 1

x3 = [−1, −1]
x4 = [1, −1]
-1 0 1
Labels:
y = [+1, −1, −1, +1] x3
-1
x4

Hessian Matrix:
2 0 2 0
 
0 2 0 2
H=
2 0 2 0

0 2 0 2
Alaa Othman June 10, 2025 40 / 64
Dual Problem Formulation

Dual Objective:
1
min LD (α) = αT Hα − f T α
2
1
= (2α12 + 2α22 + 2α32 + 2α42
2
+ 4α1 α3 + 4α2 α4 )
− (α1 + α2 + α3 + α4 )
− λ(α1 − α2 − α3 + α4 )
f = [1, 1, 1, 1]T
λ: Lagrange multiplier for constraint

Alaa Othman June 10, 2025 41 / 64


Optimality Conditions

Set partial derivatives to zero:


∂LD
= 0 ⇒ 2α1 + 2α3 = 1 + λ
∂α1
∂LD
= 0 ⇒ 2α2 + 2α4 = 1 − λ
∂α2
∂LD
= 0 ⇒ 2α1 + 2α3 = 1 − λ
∂α3
∂LD
= 0 ⇒ 2α2 + 2α4 = 1 + λ
∂α4

Observations:
System of equations has multiple solutions
Dierent α combinations yield same w and b

Alaa Othman June 10, 2025 42 / 64


Solution 1: All Support Vectors

1
α1 = α2 = α3 = α4 = (λ = 0)
4
Compute w:
4
w= αi yi xi
X

i=1
1 1 −1 −1 1
        
= (+1) + (−1) + (−1) + (+1)
4 1 1 −1 −1
1
 
=
0
Compute b:
4
1X 1
 
b= − wT xi = 0
4 i=1 yi

Alaa Othman June 10, 2025 43 / 64


Solution 2: Partial Support Vectors
1
α1 = α2 = , α3 = α4 = 0 (λ = 0)
2
Compute w:
4
w= αi yi xi
X

i=1
1 1 −1
    
= (+1) + (−1)
2 1 1
1
 
=
0
Compute b:
2
1X 1
 
b= − wT xi = 0
2 i=1 yi

Key Insight: Same w and b with dierent α values


Alaa Othman June 10, 2025 44 / 64
Critical Observations

Multiple α congurations can produce identical w and b


All valid solutions satisfy:
αi ≥ 0 (non-negativity)
N
αi yi = 0 (constraint)
X

i=1

Support vectors can vary between solutions:


Solution 1: All 4 points are support vectors
Solution 2: Only rst 2 points are support vectors
Geometric interpretation: The separating hyperplane wT x + b = 0
is identical in both solutions

Alaa Othman June 10, 2025 45 / 64


Implications for SVM Optimization

Solution space: Convex but not


strictly convex
x2 x1
Optimization algorithms may nd 1

dierent α values
Model equivalence: All solutions yield
same decision boundary -1 0 1

Computational eciency: Solutions -1


with fewer support vectors preferred x3 x4

Figure: Identical decision


boundary
w T x + b = x1 = 0

Practical note: Solution uniqueness depends on Hessian properties and


data conguration
Alaa Othman June 10, 2025 46 / 64
Challenges with Non-Separable Data

Problem: Strict separation impossible with overlapping classes


Consequence: Misclassied samples increase αi values
Eect: More support vectors aect decision boundary

Alaa Othman June 10, 2025 47 / 64


Slack Variables: Key Concept

Constraint Relaxation:
yi (wT xi + b) − 1 + ϵi ≥ 0, ϵi ≥ 0

Slack Variable ϵi : Objective Function:


Distance to margin plane
Permits margin relaxation (soft 1 2
N
CX k
min ∥w∥ + ϵi
w,b,ϵ 2 k
margin) i=1
Represents classication error
Regularization Parameter C :
Small C : Larger margin, more errors allowed
Large C : Smaller margin, fewer errors

Alaa Othman June 10, 2025 48 / 64


Primal Optimization Problem

1 CX k X
min LP (w, b, ϵ, α, µ) = ∥w∥2 + αi [yi (wT xi +b)−1+ϵi ]−
X
ϵi − µi ϵ
2 k
KKT Conditions:
1 Stationarity:
∇w LP = 0 ⇒ w = αi yi xi
X

∂LP
=0⇒ αi y i = 0
X
∂b
∂LP
= 0 ⇒ C = αi + µi (k = 1)
∂ϵi
2 Complementary slackness:
αi [yi (wT xi + b) − 1 + ϵi ] = 0
µi ϵi = (C − αi )ϵi = 0

Alaa Othman June 10, 2025 49 / 64


Dual Formulation (L1-SVM)

Dual Problem:
N
1X
α α y y xT x
X
max LD (α) = αi −
i=1
2 i,j i j i j i j

Constraints:
N
αi ≥ 0, αi yi = 0, 0 ≤ αi ≤ C
X

i=1

Key Insight:
Identical to separable case except αi ≤ C
ϵi and µi vanish in dual form
Three types of support vectors

Alaa Othman June 10, 2025 50 / 64


Support Vector Types (L1-SVM)

αi ϵi Position Classication
0 0 Outside margin Correct
0 < αi < C 0 On margin Correct (Free SV)
>1 Anywhere Misclassied
C
0 < ϵi < 1 Between margin/hyperplane Correct
0 On wrong margin side (Bounded SV)
Table: Support vector types and their characteristics

Alaa Othman June 10, 2025 51 / 64


L2-SVM Formulation
Primal Problem:
1 CX
min LP (w, b, ϵ, α) = ∥w∥2 + ϵ2i − αi [yi (wT xi + b) − 1 + ϵi ]
X
2 2
Stationarity Condition:
∂LP
= 0 ⇒ αi = C ϵi
∂ϵi
Dual Problem:
1 1
 
min LD (α) = αT H + I α − f T α
2 C

Properties:
Positive denite Hessian (H + C1 I)
Numerically stable solutions
More support vectors than L1-SVM
Alaa Othman June 10, 2025 52 / 64
Comparison: L1-SVM vs L2-SVM

L1-SVM (k = 1): L2-SVM (k = 2):


Sparser solution Denser solution
αi bounded by C αi unbounded
Three SV types Numerically stable
1
Hard margin interpretation C added to Hessian diagonal

Property Advantage
Sparsity L1-SVM
Stability L2-SVM
Solution uniqueness L2-SVM

Alaa Othman June 10, 2025 53 / 64


Practical Considerations

C Selection: Crucial trade-o parameter


Cross-validation recommended
Grid search common practice
Kernel Compatibility: Both L1/L2 extend to nonlinear cases
Implementation:
L1-SVM: More ecient for large sparse datasets
L2-SVM: Better for noisy, overlapping datasets
Solution Path: Optimal C can be found via regularization path
algorithms
Geometric Insight: Soft margin allows "margin errors" to handle class
overlap

Alaa Othman June 10, 2025 54 / 64


Handling Nonlinear Separability

Core Idea: Transform data to higher-dimensional space where linear


separation is possible
Kernel Function: K (xi , xj ) = ϕ(xi )T ϕ(xj )
SVM Objective:
1 2
N
CX k
min ∥w∥ + ϵi
w,b 2 k
i=1

Subject to: yi (wT ϕ(xi ) + b) − 1 + ϵi ≥ 0


Samples are x2=z2
Non-linearly Separable
4
X
-2 -1 0 1 2 3
Input Space (Feature Space)
Mapping 2

2
(x)=(x,x )=(z1,z2)=Z 1
(1) (1,1) (2) (2,4) x=z1
(-1) (-1,1) (-2) (-2,4) -2 -1 0 1 2
Samples are linearly Separable

Figure: Mapping from input space to feature space

Alaa Othman June 10, 2025 55 / 64


Popular Kernel Functions
Linear Kernel: Gaussian (RBF) Kernel:
K (xi , xj ) = ⟨xi , xj ⟩ ∥xi − xj ∥2
 
K (xi , xj ) = exp −
2σ 2

Equivalent to no transformation
Innite-dimensional feature
Input space remains unchanged space
Universal approximator
Polynomial Kernel:
K (xi , xj ) = (⟨xi , xj ⟩ + c)d

d : Degree of polynomial
c : Constant term
Alaa Othman June 10, 2025 56 / 64
The Kernel Trick

Challenge: Explicit mapping ϕ(x) can be computationally expensive


Solution: Compute inner products directly in input space
Example:

ϕ(x) = [x12 , 2x1 x2 , x22 ]
K (x, y) = ϕ(x)T ϕ(y)
= (x1 y1 + x2 y2 )2
= (xT y)2

Advantages:
Avoid explicit high-dimensional computation
Handle "curse of dimensionality"
Ecient calculation of inner products

Alaa Othman June 10, 2025 57 / 64


Kernel Example: 1D Classication

Dataset: 1.5

Positive class: x2 = 6 1

Negative class: x1 = 2, 0.5

x3 = 8 0

Polynomial Kernel: -0.5

-1

2
K (x, y ) = (x · y + 1) -1.5
0 1 2 3 4 5 6 7 8 9

Figure: Nonlinear decision boundary


Mapping:

ϕ(x) = [x 2 , 2x, 1]

Alaa Othman June 10, 2025 58 / 64


Solution Calculation
Hessian Matrix:
25 −169 289
 

H = −169 1369 −2401


289 −2401 4225
Lagrange Multipliers:
∂LD
= 0 : 25α1 − 169α2 + 289α3 = 1 + λ
∂α1
∂LD
= 0 : −169α1 + 1369α2 − 2401α3 = 1 − λ
∂α2
∂LD
= 0 : 289α1 − 2401α2 + 4225α3 = 1 + λ
∂α3
Solution:
α1 = 0.7396, α2 = 1.5938, α3 = 0.8542
All samples are support vectors
Alaa Othman June 10, 2025 59 / 64
Decision Boundary
Hyperplane Equation:
3
yi αi K (xi , x) + b
X
f (x) =
i=1
= −0.25x 2 + 2.5x − 5
Verication:
Using explicit mapping:
1 2.5
 
ϕ(x) − + √ + 0 + b
4 2
Same quadratic boundary
1.5

0.5

-0.5

-1

-1.5
Alaa Othman 0 1 2 3 4 5 6 7 8 9 June 10, 2025 60 / 64
Key Insights

Universal Separability: Kernels can make any dataset linearly


separable
Computational Eciency: Kernel trick avoids explicit high-dim
computation
Support Vectors: Determine complexity of decision boundary
Hyperparameter Choice:
Kernel type
Kernel parameters (σ for RBF, d for polynomial)
Regularization C

Practical Considerations:
RBF kernel generally most exible
Polynomial kernels useful for structured data
Kernel selection via cross-validation

Alaa Othman June 10, 2025 61 / 64


Dening and Maximizing the Margin

Goal: Find the weight vector w that maximizes the margin.


Hyperplane: Dened by w⊤ x + b = 0.
Margin: Distance from the hyperplane to the nearest point xn .
Distance Formula:
|w⊤ xn + b|
Distance =
||w||

Margin Calculation:
yn (w⊤ xn + b)
γ = min
n ||w||

where yn = +1 or −1 is the class label.


Objective: Maximize γ .
Alaa Othman June 10, 2025 62 / 64
Normalization of the Weight Vector

Step: Normalize w so that |w⊤ xn + b| = 1 for the nearest point.


Why?: Simplies the margin to γ = ||w1 || .
Support Vectors: Nearest points satisfy yn (w⊤ xn + b) = 1.
Consequence: Maximizing γ becomes minimizing ||w||.
Optimization Problem: Minimize 21 ||w||2 subject to:
yn (w⊤ xn + b) ≥ 1, ∀n

Alaa Othman June 10, 2025 63 / 64


Separating the Bias Term

Hyperplane Form: w⊤ x + b = 0.
Notation: w = [w1 , w2 , . . . , wd ]⊤ , b is the bias (sometimes called
w0 ).
Purpose: Separating b allows the hyperplane to shift, not just pass
through the origin.
Example: In 2D, w1 x1 + w2 x2 + b = 0 can t data oset from the
origin, unlike w1 x1 + w2 x2 = 0.
In Practice: b is optimized with w to maximize the margin.

Alaa Othman June 10, 2025 64 / 64

You might also like