SVM: Maximizing Margin in Classification
SVM: Maximizing Margin in Classification
Alaa Othman
w
(αi=0, εi=0 )
x+ T
6
b=
H1 w
T
-1
Support vectors
x+
b=
w
x2
T
5 0
x+
b=
+
1
4 (C>αi>0, εi=0 )
x1
β SVM Margin=2/||w||
3 Correctly classified
α sample
(αi=C, 1>εi>0 )
2
M
is am ε i≥
cl
(α i
s
as ple
=
d 2 w||
=C
si
D1 || ||
fie
1 1/
,
w
d
w ||
b/ =
D2 d 1 w||
1)
||
1/
1 2 3 4 5 6 7 x
Alaa Othman June 10, 2025 5 / 64
Key Concepts in SVM
Changing w and b generates innite hyperplanes
SVM goal: Find hyperplane maximally distant from closest samples of
both classes
This optimal hyperplane has the largest margin
Critical samples are called support vectors
Mathematical Formulation
Dene two parallel planes:
H1 : wT xi + b = +1 for yi = +1 (1)
H2 : wT xi + b = −1 for yi = −1 (2)
with constraints:
wT xi + b ≥ +1 ∀ yi = +1
wT xi + b ≤ −1 ∀ yi = −1
Alaa Othman June 10, 2025 6 / 64
Key Properties of SVM Hyperplanes
Geometric Characteristics
The dening planes H1 and H2 satisfy:
Parallel with same normal vector w
No training samples between them
2
Margin width = (next slides)
∥w ∥
wT x1 + b = +1 (x1 on H1 )
wT x2 + b = −1 (x2 on H2 )
Other samples satisfy strict inequality:
yi (wT xi + b) > 1
2 2
A
= √12 √1
h i
 = =√ 2 .
∥A∥ 22 + 22
The norm of the unit vector1 is always one:
s 2 2
1 1 √
∥Â∥ = √ + √ = 1 = 1.
2 2
1
The direction of the original vector A while standardizing its length to 1. It indicates
the orientation without aecting the magnitude
Alaa Othman June 10, 2025 9 / 64
Margin Calculation:
Using support vectors x1 (class +1) and x2 (class -1):
M = (x1 − x2 )w
w w
M = |D1 − D2 | = x1 · − x2 ·
∥w ∥ ∥w∥
|w x1 − w x2 |
T T |(1 − b) − (−1 − b)| 2
= = =
∥w ∥ ∥w ∥ ∥w∥
Key Insight
Maximizing margin M ⇔ Minimizing ∥w∥
Alaa Othman June 10, 2025 10 / 64
Problem Setup
ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x
x1 lies on H1 : wx + b = +1 H2 x=4 H1
x2 lies on H2 : wx + b = −1 ω- Support
vectors
Margin (M) ω+
2 x2 x1
Margin M = x
∥w ∥
1 2 3 4 5 6 7
Eect of ∥w∥:
∥w ∥ = 1 ⇒ M = 2
∥w ∥ = 2 ⇒ M = 1
∥w∥ = 4 ⇒ M = 0.5
Discriminant function:
wT x + b = 0 ⇒ 0.5x − 2 = 0 ⇒ x =4
Margin width:
2 2
M= = =4
∥w∥ 0.5
Decision process:
+1 ⇒ Classify to ω+
Since 0.5(8) − 2 = 2 > 0, above decision boundary
H2 x=4 H1
ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x
∀i = 1, 2, . . . , N
Key Properties
Hard Margin: Strict separation requirement
Quadratic Programming: Convex optimization problem
Convex: Single global minimum guaranteed
Linearly Constrained: N linear inequality constraints
Removes square root for dierentiability
Scaling factor 12 simplies derivatives
Constraints are Linear inequalities
Alaa Othman June 10, 2025 17 / 64
Lagrangian Formulation for SVM
From Constrained to Unconstrained Optimization
Transform the constrained problem using Lagrange multipliers αi :
Original objective: min 12 ∥w∥2
Constraints: yi (wT xi + b) ≥ 1
Introduce αi ≥ 0 for each constraint
Lagrange Function
1 N
∥w∥2
h i
L(w, b, α) = αi yi (wT xi + b) − 1
X
−
2
|i=1
| {z }
Original objective {z }
Constraint term
Expanded form:
1 N N N
L(w, b, α) = wT w − αi yi wT xi − b (4)
X X X
αi yi + αi
2 i=1 i=1 i=1
Alaa Othman June 10, 2025 18 / 64
1 N N N
L(w, b, α) = wT w − αi yi wT xi − b (5)
X X X
αi yi + αi
2 i=1 i=1 i=1
Optimization Strategy
The solution minimizes L w.r.t w, b and maximizes w.r.t α:
minw,b maxα≥0 L(w, b, α)
For w : ∂L
∂w =0 Lagrange Multipliers (αi ): ∂L
∂αi =0
N
yi (wT xi + b) − 1 = 0
w− αi yi xi = 0
X
i=1
N
For b: =0⇒ =0
∂L PN
w= αi yi xi i=1 αi yi
X
∂b
i=1
N
w= αi yi xi = αi xi − αi xi
X X X
Objective function:
∥w∥2
min L(w, b, α) = −α g −α g
1 1 2 2
w,b 2
∥w∥2
= − α1 (y1 (wx1 + b) − 1)
2
− α2 (y2 (wx2 + b) − 1)
Primal feasibility:
∂L
= 0 ⇒ 6w + b − 1 = 0
∂α1
∂L
= 0 ⇒ 2w + b + 1 = 0
∂α2
Solve system:
6(4α) + b − 1 = 0
2(4α) + b + 1 = 0
Subtract equations:
(24α + b − 1) − (8α + b + 1) = 0
16α − 2 = 0 ⇒ α = 1/8
Thus:
w = 4 × (1/8) = 0.5
b = −1 − 2(0.5) = −2
i=1
= α1 y1 x1 + α2 y2 x2
1 1
= (1)(6) + (−1)(2)
8 8
6 2
= − = 0.5
8 8
Bias term calculation:
Using x2 : 0.5(2) + b = −1 ⇒ b = −2
Using x1 : 0.5(6) + b = 1 ⇒ b = −2
Discriminant function: 0.5x − 2 = 0 (matches previous solution)
Alaa Othman June 10, 2025 25 / 64
Support Vector Signicance
Key observations:
Both samples are support vectors (αi > 0)
Adding/removing non-support vectors (outside margin) doesn't aect
solution
Lagrange multipliers for non-support vectors: αi = 0
Removing support vectors changes decision boundary
H2 x=4 H1
ω- Support ω+
Margin (M)
x2 vectors
x1
1 2 3 4 5 6 7 x
Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)
Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)
Dual advantage:
Only N variables (α1 , α2 , . . . , αN )
Ecient when d ≫ N (common in NLP, bioinformatics)
Challenge:
d + 1 variables (w ∈ Rd , b ∈ R)
Computationally heavy when d is large (high dimensions)
Dual advantage:
Only N variables (α1 , α2 , . . . , αN )
Ecient when d ≫ N (common in NLP, bioinformatics)
Key insight:
Most αi = 0! Only support vectors matter (αi > 0)
Automatic feature weighting
Maximize:
N
1X
α α y y (x⊤ x )
X
LD (α) = αi −
i=1
2 i,j i j i j i j
Subject to:
αi ≥ 0 ∀i
N
αi yi = 0
X
i=1
Maximize:
N
1X
α α y y (x⊤ x )
X
LD (α) = αi −
i=1
2 i,j i j i j i j
Subject to:
αi ≥ 0 ∀i
N
αi yi = 0
X
i=1
Key Properties
Convex Quadratic Program (guaranteed global optimum)
αi > 0 only for support vectors (critical points near margin)
1
w ∥2 −
max LD (α) = ∥⃗ w T ⃗xi + b) − 1)
X
αi (yi (⃗
2 i
!2
1 X X X
= α y ⃗x + αi − αi αj yi yj ⃗xiT ⃗xj
2 i i i i i i,j
N
X 1X
= αi − α α y y ⃗x T ⃗x
i=1
2 i,j i j i j i j
Subject to: αi ≥ 0, =0
PN
i=1 αi yi
Alaa Othman June 10, 2025 30 / 64
Dual Problem Specication
i=1
max ⃗1T α − 1 αT Hα
α 2
s.t. α ⪰ 0, αT ⃗y = 0
Where:
Hij = yi yj ⟨⃗xi , ⃗xj ⟩
⃗1 = [1, 1, . . . , 1]T
α = [α1 , α2 , . . . , αN ]T
⃗y = [y1 , y2 , . . . , yN ]T
Equivalent minimization form:
1
min αT Hα − ⃗1T α
α 2
(4α ) 2
− α (24α + b − 1) − α (−8α − b − 1)
1
=
2 1 1 1 1
= 8α − 24α − bα + α + 8α + bα + α
2
1
2
1 1 1
2
1 1 1
= −8α + 2α
2
1 1
Optimality condition:
∂LD
= −16α1 + 2 = 0
∂α1
1
α1 = α2 =
8
Parameter recovery:
w = 4α1 = 0.5
b = 1 − 6w = −2
1 36 −12
α1 α1
− 1 1
min LD = α α2 − λ(α1 − α2 )
2 −12 4
1
α2 α2
∂LD
= 4α1 − 12α2 − 1 + λ = 0
∂α1
∂LD
= −12α1 + 36α2 − 1 − λ = 0
∂α2
1
With α = α ⇒ λ = 2, α = α =
1 2 1 2
8
Important note:
When N > d (more samples than features), solutions may not be
unique
Many α combinations may satisfy the constraints
The dual formulation remains valid but may have multiple solutions
x2 x1
1
-1 0 1
-1
x3 x4
Alaa Othman June 10, 2025 39 / 64
Example Setup
Data Points:
x1 = [1, 1] x2 x1
x2 = [−1, 1] 1
x3 = [−1, −1]
x4 = [1, −1]
-1 0 1
Labels:
y = [+1, −1, −1, +1] x3
-1
x4
Hessian Matrix:
2 0 2 0
0 2 0 2
H=
2 0 2 0
0 2 0 2
Alaa Othman June 10, 2025 40 / 64
Dual Problem Formulation
Dual Objective:
1
min LD (α) = αT Hα − f T α
2
1
= (2α12 + 2α22 + 2α32 + 2α42
2
+ 4α1 α3 + 4α2 α4 )
− (α1 + α2 + α3 + α4 )
− λ(α1 − α2 − α3 + α4 )
f = [1, 1, 1, 1]T
λ: Lagrange multiplier for constraint
Observations:
System of equations has multiple solutions
Dierent α combinations yield same w and b
1
α1 = α2 = α3 = α4 = (λ = 0)
4
Compute w:
4
w= αi yi xi
X
i=1
1 1 −1 −1 1
= (+1) + (−1) + (−1) + (+1)
4 1 1 −1 −1
1
=
0
Compute b:
4
1X 1
b= − wT xi = 0
4 i=1 yi
i=1
1 1 −1
= (+1) + (−1)
2 1 1
1
=
0
Compute b:
2
1X 1
b= − wT xi = 0
2 i=1 yi
i=1
dierent α values
Model equivalence: All solutions yield
same decision boundary -1 0 1
Constraint Relaxation:
yi (wT xi + b) − 1 + ϵi ≥ 0, ϵi ≥ 0
1 CX k X
min LP (w, b, ϵ, α, µ) = ∥w∥2 + αi [yi (wT xi +b)−1+ϵi ]−
X
ϵi − µi ϵ
2 k
KKT Conditions:
1 Stationarity:
∇w LP = 0 ⇒ w = αi yi xi
X
∂LP
=0⇒ αi y i = 0
X
∂b
∂LP
= 0 ⇒ C = αi + µi (k = 1)
∂ϵi
2 Complementary slackness:
αi [yi (wT xi + b) − 1 + ϵi ] = 0
µi ϵi = (C − αi )ϵi = 0
Dual Problem:
N
1X
α α y y xT x
X
max LD (α) = αi −
i=1
2 i,j i j i j i j
Constraints:
N
αi ≥ 0, αi yi = 0, 0 ≤ αi ≤ C
X
i=1
Key Insight:
Identical to separable case except αi ≤ C
ϵi and µi vanish in dual form
Three types of support vectors
αi ϵi Position Classication
0 0 Outside margin Correct
0 < αi < C 0 On margin Correct (Free SV)
>1 Anywhere Misclassied
C
0 < ϵi < 1 Between margin/hyperplane Correct
0 On wrong margin side (Bounded SV)
Table: Support vector types and their characteristics
Properties:
Positive denite Hessian (H + C1 I)
Numerically stable solutions
More support vectors than L1-SVM
Alaa Othman June 10, 2025 52 / 64
Comparison: L1-SVM vs L2-SVM
Property Advantage
Sparsity L1-SVM
Stability L2-SVM
Solution uniqueness L2-SVM
2
(x)=(x,x )=(z1,z2)=Z 1
(1) (1,1) (2) (2,4) x=z1
(-1) (-1,1) (-2) (-2,4) -2 -1 0 1 2
Samples are linearly Separable
Equivalent to no transformation
Innite-dimensional feature
Input space remains unchanged space
Universal approximator
Polynomial Kernel:
K (xi , xj ) = (⟨xi , xj ⟩ + c)d
d : Degree of polynomial
c : Constant term
Alaa Othman June 10, 2025 56 / 64
The Kernel Trick
Advantages:
Avoid explicit high-dimensional computation
Handle "curse of dimensionality"
Ecient calculation of inner products
Dataset: 1.5
Positive class: x2 = 6 1
x3 = 8 0
-1
2
K (x, y ) = (x · y + 1) -1.5
0 1 2 3 4 5 6 7 8 9
0.5
-0.5
-1
-1.5
Alaa Othman 0 1 2 3 4 5 6 7 8 9 June 10, 2025 60 / 64
Key Insights
Practical Considerations:
RBF kernel generally most exible
Polynomial kernels useful for structured data
Kernel selection via cross-validation
Margin Calculation:
yn (w⊤ xn + b)
γ = min
n ||w||
Hyperplane Form: w⊤ x + b = 0.
Notation: w = [w1 , w2 , . . . , wd ]⊤ , b is the bias (sometimes called
w0 ).
Purpose: Separating b allows the hyperplane to shift, not just pass
through the origin.
Example: In 2D, w1 x1 + w2 x2 + b = 0 can t data oset from the
origin, unlike w1 x1 + w2 x2 = 0.
In Practice: b is optimized with w to maximize the margin.