Support Vector Machines Overview
Support Vector Machines Overview
DATE-30-April-2025
SUPPORT VECTOR MACHINE
SUBMITTED BY
Devanshu Rai (V24002)
Khushi (V24004)
Arun (V24007)
KM Sony (V24013)
Seema Alaria (V24022)
Yash Kumar (V24024)
Neha Hidko (V24028)
SUBMITTED TO
Dr. ARUNA BOMMAREDDI
Acknowledgment
1
Preface
1. Abstract .......................................................................................... 4
1. Introduction .................................................................................... 5
2. Understanding Margin ................................................................... 5
3. Functional and Geometric Margin ................................................ 6
4. The Most Effective Margin Classifier ............................................ 8
5. Adding a Scaling Limitation........................................................... 9
6. Top Margin Classifiers ................................................................. 13
7. Reiterating the Issue in Two Ways ............................................. 13
8. Making Prediction With The Dual Formulation ........................ 14
9. Kernels .......................................................................................... 15
10. Efficiency of Kernel Computation .............................................. 15
11. Polynomial Kernels .................................................................... 15
12. Kernels Similarity ..................................................................... 16
13. Application of Kernel Methods ................................................. 16
14. Regularization and Non-Separable Case ................................... 16
15. Coordinate Ascent ..................................................................... 18
16. The SMO Methodology ............................................................. 19
17. Conclusion .................................................................................. 21
18. References .................................................................................. 22
3
Abstract
Support Vector Machines (SVMs), a powerful class of supervised
learning algorithms that are commonly used for tasks involving
regression and classifica- tion, are thoroughly examined in this
paper. The basic concept of margins and their impact on
classification confidence is introduced at the beginning of the
study.
The mechanics of constructing an optimal margin classifier using
Lagrange duality—which is crucial for generating and resolving SVM
optimization problems— are then examined. In order to make
SVMs more suitable for complex, non- linearly separable data points,
the study also examines the significance of kernel functions in enabling
them to operate in high-dimensional or infinite-dimensional feature
spaces.
Additionally, it is mentioned that SVMs can be trained very
effectively using the Sequential Minimal Optimization (SMO)
technique, which reduces compu- tational complexity and
improves scalability.
Through theoretical explanations and instructive examples, this
paper aims to provide a clear, structured understanding of SVM
methodology, its mathe- matical foundations, and practical
implementation methodologies.
4
Support Vector Machines
1 Introduction
This paper presents the Support Vector Machine (SVM) learning
algorithm. SVMs are widely regarded as one of the most powerful
supervised learning algorithms currently in use. Investigating the
concept of margins and the sig- nificance of producing a
discernible difference between data points is necessary before
comprehending the fundamentals of SVMs. The best margin
classifier will then be discussed, leading to Lagrange duality, a
mathematical tool essential to solving SVM optimization problems.
Since kernels enable SVMs to operate efficiently in high-
dimensional or even infinite-dimensional feature spaces, their
function will also be examined. Finally, the discussion will
conclude with the Sequential Minimal Optimization (SMO)
approach, which provides a computa- tionally efficient SVM
implementation.
2 Understanding Margin
Since they provide insight into forecast confidence, the concept of
margins ini- tiates the discussion of SVMs. Prior to their formal
definition in a subsequent section, this section aims to establish an
intuitive understanding of margins.
Consider logistic regression, which uses h(x) = g(kT x) to
model the prob- ability of a given class y = 1 given input x and
≥ ≥
parameter k. Class ”1” is predicted when h(x) 0.5, which is
equivalent to kT x 0. In a training ex- ample where y = 1, a
higher value of kT x corresponds to a higher likelihood of
classification into this class, increasing the prediction’s confidence.
In contrast, the model is fairly certain that y = 0 if kT x is
substantially less than zero.
Given a training dataset, an ideal model would ensure that the
5
value of kT x is significantly greater than zero for all positive
cases (y = 1) and remains sig- nificantly lower than zero for
negative cases (y = 0). Achieving this distinction would
demonstrate a high level of category trust. Functional margins
would later be used to formalize this idea. Consider a scenario in
which ”x” represents positive training samples and ”o” represents
negative ones in order to further elucidate this concept. The
formula kT x = 0 describes the decision boundary that separates
the two classes. One can better understand how distance from the
decision boundary influences prediction confidence by looking at
three specific points, A, B, and C.
6
The decision boundary is far from Point A. The model can be
pretty sure that y = 1 if the objective is to predict y for A. Point
C, however, is fairly close to the border. Despite being on the side
of the hyperplane that predicts y = 1, a slight alteration to the
boundary could easily yield a different classification. As a result,
the forecasting confidence level at C for y = 1 is extremely low.
Since Point B is situated in the middle of these two possibilities,
it lends credence to the idea that data points that are further
away from the decision boundary are associated with higher
prediction confidence.
By maximizing the distance between the border and data
points, a classifica- tion model should ideally identify a decision
boundary that not only accurately partitions the training data but
also ensures that predictions are generated with high confidence.
This concept would later be formally defined by geometric mar-
gins.
Understanding Notations
Before delving deeper into Support Vector Machines (SVMs), a
well-designed notation system must be presented to facilitate the
discussion of classification. This paper examines a linear classifier
designed for a binary classification prob- lem in which the data
consists of corresponding labels y and characteristics x.
∈ {− as }y 1, 1
In this framework, the class labels will be represented
instead of the{ standard
} 0, 1 . The linear classifier will be specified
using the parameters w and b rather than the parameter vector θ.
Next, the classifier is expressed as follows:
(
1, if z ≥ 0
g(z) =
−1, otherwise
This notation clearly separates the bias term b from the weight
vector w, in contrast to the previous method that added an
additional feature x0 = 1 into the input vector. In this formulation,
w represents the remaining parameters, which were previously
expressed as [k1, ..., kn]T , and b represents what was previously
known as k0.
In contrast to logistic regression, which determines the
likelihood of y = 1 before making a decision, this {− classifier
}
7
generates class labels 1, 1 directly without determining a
probability estimate in between.
8
The term pT x(i) + b should be a very large positive value for a
correctly and securely categorized positive example (y(i) = 1).
—
On the other hand, given a negative example (y(i) = 1), a high
negative value is preferred. Furthermore, if y(i)(pT x(i) + b) > 0,
the classification for that particular case is accurate. A greater
functional margin therefore indicates more confidence in
classification.
Using functional margins as a confidence gauge, however, has its
drawbacks. The functional margin can be raised arbitrarily since
the function g solely relies on the sign of pT x + b and not its size,
thereby not influencing the classification result. For example,
the functional margin is doubled but the classifier stays the
same if both p and b are scaled by a factor of two. This shows
|| || without adding a normalization restriction, the functional
that
margin alone does not serve as a valid measure of classification
confidence. Limiting the weight vector until p 2 = 1 is an
alternate normalization technique that normalizes the
functional margin. This idea will be looked at again later.
Given a dataset S = {(x(i), y(i))}m i=1
, the classifier’s functional margin rela-
tive to the entire dataset is the smallest functional margin of all
training exam- ples:
tˆ= min
i=1,...,m t(i). (2)
This ensures that throughout all training samples, the classifier
maintains a constant level of confidence.
Geometric Margins
The decision boundary denoted by (p, b) is accompanied by the
weight vector p, which is always orthogonal to the hyperplane
separating the two classes. The concept of geometric margins will
be illustrated with the aid of a training example at point A with
the label y(i) = 1. The shortest distance between this location
and the decision border is represented by segment AB, which is
expressed as t(i).
One can determine the value of t(i) by noting that the ||p||
normalized vector p has unit length and points in the same
direction as p. Point B’s coordinates can be computed as follows:
9
x(i) t(i) p . (3)
— ||p||
Because B is exactly on the decision boundary, every point on this
hyperplane satisfies the equation:
p
pT (x(i)— t(i) ) + b = 0. (4)
||p||
Solving for t(i)
yields:
p b
pT x(i)
t(i) = +b T
x(i) ||p| . (5)
=
||p|| ||p|| + |
10
This derivation was specifically for a positive training scenario,
even though the definition applies to all circumstances. Regarding a
training example (x(i), y(i)), the classifier’s geometric margin is given
by:
p b
(i) . (6)
t(i) = y(i) T
x ||p|
||p|| + |
One noteworthy feature is the invariance of the geometric
margin under w and b rescaling. If both p and b are scaled by the
same amount, the geometric margin remains constant, in contrast
to the functional margin. Because it allows one to impose arbitrary
limitations on the p scale without || || altering
| | the catego- rization
behavior, this feature is useful later in optimization. For instance, it
is possible to apply restrictions such as p = 1 or p1 = 5 without
altering the classification properties.
For a dataset S = {(x(i), y(i))}m ,i=1
the classifier’s geometric margin relative
to the entire dataset is ultimately defined as follows:
t = min (i)
i=1,...,m t . (7)
This ensures that for each training sample, the classifier
maintains the ideal distance between classes.
11
max t (8)
p,b
subject to:
12
is equal to the geometric margin. The best parameters (p, b) will
thus be ob- tained by solving this optimization problem, providing
the maximum practical geometric margin for the given dataset.
While the desired classifier would ideally be obtained by solving
the afore- mentioned optimization problem, the constraint p =
1 poses a significant challenge. Due to the non-convex nature of
this constraint, the problem is not one that can be easily solved
using standard optimization techniques. To move forward, we
consider an alternative formulation:
tˆ
ma (11)
x p
p,b
subject to:
tˆ= 1. (13)
This restriction makes sense because the functional margin is
proportionally scaled by multiplying p and b by a constant, so
appropriate parameter rescaling will always meet it. We introduce
the following optimization problem by substi- tuting this constraint
into the previous formulation and noting that maximizing tˆ/ p is
equal to minimizing p :
13
1 2
min p (14)
p,b 2
subject to:
14
is quadratic, allowing the best margin classifier to be found using
standard quadratic programming (QP) techniques.
We take a slight detour to explore Lagrange duality, which will
result in the dual version of our optimization problem, even
though this method provides a good solution. The dual
formulation is crucial because it enables the use of kernel
approaches, which enable support vector machines (SVMs) to
function efficiently in high-dimensional feature spaces.
Additionally, a fast optimization algorithm that significantly
outperforms conventional QP solvers can be de- rived using the
dual formulation. We will take a closer look at this alternative
approach.
hi(p) = 0, ∀i = 1, . . . , l. (17)
16
subject to:
18
The Connection Between Dual and Primal Problems
The optimal value of the dual problem is always less than or equal to
the optimal value of the primal problem, according to a fundamental
optimization result:
d∗ = s,m:s
maxi≥0
min L(p, s, m)p ≤s,m:s
p
mini≥0
max L(p, s, m) = p∗. (29)
Under these conditions, the optimal values are p∗, s∗, m∗ such that:
p∗ = d∗ = L(p∗, s∗, m∗). (30)
Additionally, the following Karush-Kuhn-Tucker (KKT) criteria
are satisfied by these optimal values:
1. Stationar
y: ∂
L(p∗, s∗, m∗) = 0, i = 1, . . . , n. (31)
∂pi L(p∗, s∗, m∗) = 0, i = 1, . . . , l. (32)
∂
∂mi
2. Complementary Slackness:
19
s∗gi(p∗) = 0, i = 1, . . . , k. (33)
i
3. Primary Feasibility:
gi(p∗) ≤ 0, i = 1, . . . , k. (34)
4. Dual Feasibility:
si∗ ≥ 0, i = 1, . . . , k. (35)
20
Equation (3) is particularly important because it implies
i that
if s∗ > 0, then the condition gi(p∗) = 0 must hold with equality.
Since it demonstrates that only a subset of training data—support
vectors—shapes the decision boundary, support vector machines
heavily rely on this knowledge.
Additionally, the KKT criteria provide a basis for determining
convergence in algorithms such as Sequential Minimal
Optimization (SMO).
yi(p · xi + b) ≥ 1, ∀i (37)
Each training sample has a corresponding constraint. According
to the Karush-Kuhn-Tucker (KKT) dual complementarity condition,
the Lagrange multipliers are only strictly positive in situations
where the functional margin is precisely equal to one, meaning
that the constraints are satisfied as equalities.
In the figure illustrating the concept, the largest margin hyperplane
is indi- cated by a solid line. The data points closest to the decision
boundary on the dashed lines parallel to the hyperplane have the
lowest margins. These specific instances, which include two positive
and one negative, are support vectors. Only these locations yield
nonzero values at the optimal solution. Support vec- tors are
typically much smaller than the entire training set, which will be
helpful for future developments.
22
∂L Σ Σ
=p−
∂p s iy ix = 0 ⇒ p = si (39)
i yx
i
i i i
24
9 Kernels
In our earlier discussion of linear regression, we encountered a
scenario in which the input variable represented a home’s living
area. We built a cubic function because we considered using
features like x2 and x3 to enhance the regression model. To
distinguish between different sets of variables, we define the
”original” input variable as the input characteristics of the
problem, in this case, the dwelling area. When these
characteristics are transformed into a new set of variables used in
the learning process, we refer to them as input features.
To show this change, we use ϕ(x) to represent the feature
mapping that transforms input attributes into input features. For
example, in the given cir- cumstance:
ϕ(x) = (x, x2, x3)
Instead of applying Support Vector Machines (SVMs) directly to
the original input characteristics, we can use the modified features.
This entails replacing the existing algorithm with ϕ(x) for
occurrences. We substitute these inner products with K(x, x′) since
the algorithm is expressed in terms of inner products. Thus, the
kernel function is defined as:
25
This demonstrates that direct computation of K(x, x′) only requires
O(d) time, offering significant efficiency improvements.
11 Polynomial Kernels
One commonly used polynomial kernel function is:
26
which corresponds to a feature mapping:
ϕ(x) = (xd, xd−1x2, . . . , xd)
1 1 d
The kernel function allows transformations into a high-dimensional
feature space efficiently.
12 Kernel Similarity
If ϕ(x) and ϕ(x′) are close together in feature space, then K(x, x′)
will be large. However, if they are nearly orthogonal, K(x, x′) will
be small. A popular kernel for measuring similarity is the
Gaussian kernel:
′ 2
x−x
K(x, x′) = exp
28
with:
y(i)(pT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m (48)
ξi ≥ 0, i = 1, . . . , m. (49)
With this method, some data points can have a functional
margin smaller than 1. If a given example has a margin of 1 − ξi
(where ξi > 0), then increas- ing the objective function by Cξi
results in a penalty.
|| || The trade-off between ensuring that the
majority of examples maintain a functional margin of at least one
and minimizing p 2, which maximizes the margin, is controlled
by the parameter C.
Construction of Lagrangians
As before, we construct the Lagrangian function:
m m m
1 T Σ Σ Σ
L(p, b, ξ, s, r) = p p +C ξ i − si (y (p x +b)−1
(i) T (i)
i +ξ )−
i i r ξ . (50)
2
i=1 i=1 i=1
beneath:
0 ≤ si ≤ C, i = 1, . . . , m (52)
Σ
m
siy(i) = 0. (53)
i=1
Equation (9), as previously mentioned, allows the weight vector
p to be expressed in terms of si. Once the dual optimization
problem has been resolved, equation (13) can still be used for
prediction.
29
It’s interesting to note that adding ℓ1 regularization only
makes one change to the dual problem: the≤constraint ≤ ≤ si
on
goes from 0 si to 0 sai C. Additionally, the calculation of b∗
must be modified (Equation 11 is no longer relevant); further
details on this can be found in Platt’s article or the section that
follows.
30
Karush-Kuhn-Tucker Conditions
The following are the KKT dual complementarity requirements,
which will later aid in verifying convergence in the Sequential
Minimal Optimization (SMO) algorithm:
1. If si = 0, then:
y(i)(pT x(i) + b) ≥ 1. (54)
2. If si = C, then:
y(i)(pT x(i) + b) ≤ 1. (55)
3. If 0 < si < C,
then:
y(i)(pT x(i) + b) = 1. (56)
15 Coordinate Ascent
Consider the problem of finishing the subsequent unconstrained optimization:
sˆi = arg max W (s1, ..., si−1, sˆi, si+1, ..., sm). (58)
si
31
In this process, optimization focuses on one variable at a time
while holding the others constant. The simplest implementation
uses a sequential update order (s1, s2, . . . , sm), but more advanced
versions may prioritize updates based on their expected impact on
the objective function W (s).
32
When W is set up so that the maximization in each iteration can
be com- puted rapidly, coordinate ascent proves to be a very
effective technique. Coor- dinated ascent runs are depicted in the
following figure:
A quadratic function under optimization is depicted by the
graph’s curves. The trajectory toward the global maximum is
displayed, beginning at an initial point. In particular, since only
one variable is altered at a time, each action runs parallel to one
of the axes.
34
2. Optimize W (s) with respect to si and sj, while holding all
other sk con- stant (for k ̸= i, j).
Since all other si values are constant, this turns into a quadratic function in
s2:
as22 + bs2 + c. (68)
Equating the derivative to zero yields the best s2 when the box limits are ig-
nored. If the resultant number falls outside of [R, U ], it is changed
,
(or ”clipped”) as follows:
, 2
2
unclipped
snew = U,
unclipped if sunclipped > U
35
s , if R ≤ s ≤U
,
, 2 2
R, (69)
if s <R
unclip
ped 2
Equation (20) is used to determine snew after snew has been set.
1 2
Additional implementation details, such as methods for selecting si, sj, and
altering the bias term b, are covered in Platt’s article.
36
Conclusion
Support Vector Machines (SVMs) are among the best
supervised learning algorithms, particularly helpful for
classification in high-dimensional spaces. Their main goal is to
identify the hyperplane that produces the strongest gen-
eralization on new data by maximally separating classes.
SVMs use convex optimization to identify the best margin
classifier and present the ideas of functional and geometric
margins. The kernel trick is the process of employing kernel
functions to efficiently operate in high-dimensional feature spaces
by utilizing Lagrange duality without explicitly computing the
transformation.
SVMs use a regularization parameter and slack variables to
handle non- linearly separable data in order to balance margin size
and misclassification.
The dual problem of the SVM can be effectively resolved with
the help of the SMO (Sequential Minimal Optimization)
algorithm.
37
Refrences
• Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
38