0% found this document useful (0 votes)
37 views40 pages

Support Vector Machines Overview

This document presents a comprehensive study on Support Vector Machines (SVMs), detailing their theoretical foundations, including margin theory, optimization techniques using Lagrange duality, and the application of kernel methods for high-dimensional data. It emphasizes the importance of maximizing margins for effective classification and introduces the Sequential Minimal Optimization (SMO) algorithm for efficient training. The report aims to serve as a valuable resource for practitioners and researchers interested in machine learning methodologies.

Uploaded by

devanshur222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views40 pages

Support Vector Machines Overview

This document presents a comprehensive study on Support Vector Machines (SVMs), detailing their theoretical foundations, including margin theory, optimization techniques using Lagrange duality, and the application of kernel methods for high-dimensional data. It emphasizes the importance of maximizing margins for effective classification and introduces the Sequential Minimal Optimization (SMO) algorithm for efficient training. The report aims to serve as a valuable resource for practitioners and researchers interested in machine learning methodologies.

Uploaded by

devanshur222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SUPPORT VECTOR MACHINE

TECHNICAL COMMUNICATION (HS-541)

DATE-30-April-2025
SUPPORT VECTOR MACHINE

SUBMITTED BY
 Devanshu Rai (V24002)
 Khushi (V24004)
 Arun (V24007)
 KM Sony (V24013)
 Seema Alaria (V24022)
 Yash Kumar (V24024)
 Neha Hidko (V24028)

SUBMITTED TO
Dr. ARUNA BOMMAREDDI
Acknowledgment

We would like to sincerely thank the authors of the paper


“Support Vec- tor Machines” for their thorough and academic
explanation of the basic ideas behind SVMs. Additionally, we
would like to sincerely thank Dr. Aruna Bom- mareddi, our
supervisor.

The research described in this report was critically based on the


document’s thorough treatment of fundamental subjects, such as
margin theory, optimiza- tion using Lagrange duality, kernel
method development, and the Sequential Minimal Optimization
(SMO) algorithm.

Our comprehension and use of Support Vector Machines were


substantially improved by the methodical presentation of
mathematical formulations coupled with understandable, intuitive
explanations. Our work’s theoretical and prac- tical components
were enhanced by the thoroughness and accuracy with which
difficult ideas were handled.

We owe a debt of gratitude to the authors for their invaluable


contribution, which was essential to the accomplishment of this
study.

1
Preface

A thorough and organized investigation of Support Vector


Machines (SVMs), a crucial technique in supervised machine
learning, is provided in this report. The content is arranged
according to major themes, each of which contributes to a better
comprehension of the SVM framework:

• The Overview of Support Vector Machines: The report


starts off by outlining the rationale for SVMs and the
importance of establishing a distinct margin between classes
in order to enhance classification perfor- mance.
• Gaining an Understanding of Margins: To establish the
intuition behind how SVMs produce dependable and confident
predictions, the con- cepts of functional and geometric
margins are examined.
• Maximum Margin Optimization: The SVM optimization
problem formulation is shown, explaining how to convert a
complicated non-convex problem into a convex one that can
be solved computationally. Lagrange Duality is explained in
detail, emphasizing its significance in determining the dual
form of the SVM optimization problem and facilitating
effective solutions.
• Kernel Methods: By introducing the kernel trick, SVMs’
flexibility is significantly increased and they can function in
high-dimensional feature spaces without the need for
explicit computation..
• Sequential Minimal Optimization (SMO): The dual
optimization problem in large-scale datasets is solved by the
effective algorithm SMO, which is described in detail along
with how it updates variable pairs under constraint conditions.

This report’s sections are all intended to provide the reader


with a thorough understanding of Support Vector Machines by
offering both theoretical insights and real-world applications. This
work is to be a useful tool for practitioners, researchers, and
students who want to learn more about machine learning.
2
Table of Contents

1. Abstract .......................................................................................... 4
1. Introduction .................................................................................... 5
2. Understanding Margin ................................................................... 5
3. Functional and Geometric Margin ................................................ 6
4. The Most Effective Margin Classifier ............................................ 8
5. Adding a Scaling Limitation........................................................... 9
6. Top Margin Classifiers ................................................................. 13
7. Reiterating the Issue in Two Ways ............................................. 13
8. Making Prediction With The Dual Formulation ........................ 14
9. Kernels .......................................................................................... 15
10. Efficiency of Kernel Computation .............................................. 15
11. Polynomial Kernels .................................................................... 15
12. Kernels Similarity ..................................................................... 16
13. Application of Kernel Methods ................................................. 16
14. Regularization and Non-Separable Case ................................... 16
15. Coordinate Ascent ..................................................................... 18
16. The SMO Methodology ............................................................. 19
17. Conclusion .................................................................................. 21
18. References .................................................................................. 22

3
Abstract
Support Vector Machines (SVMs), a powerful class of supervised
learning algorithms that are commonly used for tasks involving
regression and classifica- tion, are thoroughly examined in this
paper. The basic concept of margins and their impact on
classification confidence is introduced at the beginning of the
study.
The mechanics of constructing an optimal margin classifier using
Lagrange duality—which is crucial for generating and resolving SVM
optimization problems— are then examined. In order to make
SVMs more suitable for complex, non- linearly separable data points,
the study also examines the significance of kernel functions in enabling
them to operate in high-dimensional or infinite-dimensional feature
spaces.
Additionally, it is mentioned that SVMs can be trained very
effectively using the Sequential Minimal Optimization (SMO)
technique, which reduces compu- tational complexity and
improves scalability.
Through theoretical explanations and instructive examples, this
paper aims to provide a clear, structured understanding of SVM
methodology, its mathe- matical foundations, and practical
implementation methodologies.

4
Support Vector Machines

1 Introduction
This paper presents the Support Vector Machine (SVM) learning
algorithm. SVMs are widely regarded as one of the most powerful
supervised learning algorithms currently in use. Investigating the
concept of margins and the sig- nificance of producing a
discernible difference between data points is necessary before
comprehending the fundamentals of SVMs. The best margin
classifier will then be discussed, leading to Lagrange duality, a
mathematical tool essential to solving SVM optimization problems.
Since kernels enable SVMs to operate efficiently in high-
dimensional or even infinite-dimensional feature spaces, their
function will also be examined. Finally, the discussion will
conclude with the Sequential Minimal Optimization (SMO)
approach, which provides a computa- tionally efficient SVM
implementation.

2 Understanding Margin
Since they provide insight into forecast confidence, the concept of
margins ini- tiates the discussion of SVMs. Prior to their formal
definition in a subsequent section, this section aims to establish an
intuitive understanding of margins.
Consider logistic regression, which uses h(x) = g(kT x) to
model the prob- ability of a given class y = 1 given input x and
≥ ≥
parameter k. Class ”1” is predicted when h(x) 0.5, which is
equivalent to kT x 0. In a training ex- ample where y = 1, a
higher value of kT x corresponds to a higher likelihood of
classification into this class, increasing the prediction’s confidence.
In contrast, the model is fairly certain that y = 0 if kT x is
substantially less than zero.
Given a training dataset, an ideal model would ensure that the
5
value of kT x is significantly greater than zero for all positive
cases (y = 1) and remains sig- nificantly lower than zero for
negative cases (y = 0). Achieving this distinction would
demonstrate a high level of category trust. Functional margins
would later be used to formalize this idea. Consider a scenario in
which ”x” represents positive training samples and ”o” represents
negative ones in order to further elucidate this concept. The
formula kT x = 0 describes the decision boundary that separates
the two classes. One can better understand how distance from the
decision boundary influences prediction confidence by looking at
three specific points, A, B, and C.

6
The decision boundary is far from Point A. The model can be
pretty sure that y = 1 if the objective is to predict y for A. Point
C, however, is fairly close to the border. Despite being on the side
of the hyperplane that predicts y = 1, a slight alteration to the
boundary could easily yield a different classification. As a result,
the forecasting confidence level at C for y = 1 is extremely low.
Since Point B is situated in the middle of these two possibilities,
it lends credence to the idea that data points that are further
away from the decision boundary are associated with higher
prediction confidence.
By maximizing the distance between the border and data
points, a classifica- tion model should ideally identify a decision
boundary that not only accurately partitions the training data but
also ensures that predictions are generated with high confidence.
This concept would later be formally defined by geometric mar-
gins.

Understanding Notations
Before delving deeper into Support Vector Machines (SVMs), a
well-designed notation system must be presented to facilitate the
discussion of classification. This paper examines a linear classifier
designed for a binary classification prob- lem in which the data
consists of corresponding labels y and characteristics x.
∈ {− as }y 1, 1
In this framework, the class labels will be represented
instead of the{ standard
} 0, 1 . The linear classifier will be specified
using the parameters w and b rather than the parameter vector θ.
Next, the classifier is expressed as follows:
(
1, if z ≥ 0
g(z) =
−1, otherwise
This notation clearly separates the bias term b from the weight
vector w, in contrast to the previous method that added an
additional feature x0 = 1 into the input vector. In this formulation,
w represents the remaining parameters, which were previously
expressed as [k1, ..., kn]T , and b represents what was previously
known as k0.
In contrast to logistic regression, which determines the
likelihood of y = 1 before making a decision, this {− classifier
}
7
generates class labels 1, 1 directly without determining a
probability estimate in between.

3 Functional and Geometric Margin


Understanding categorization confidence can be improved by
defining both ge- ometric and functional margins. Given a training
example (x(i), y(i)), the func- tional margin of the classifier
parameters (w, b) is expressed as follows:

t(i) = y(i)(pT x(i) + b). (1)

8
The term pT x(i) + b should be a very large positive value for a
correctly and securely categorized positive example (y(i) = 1).

On the other hand, given a negative example (y(i) = 1), a high
negative value is preferred. Furthermore, if y(i)(pT x(i) + b) > 0,
the classification for that particular case is accurate. A greater
functional margin therefore indicates more confidence in
classification.
Using functional margins as a confidence gauge, however, has its
drawbacks. The functional margin can be raised arbitrarily since
the function g solely relies on the sign of pT x + b and not its size,
thereby not influencing the classification result. For example,
the functional margin is doubled but the classifier stays the
same if both p and b are scaled by a factor of two. This shows
|| || without adding a normalization restriction, the functional
that
margin alone does not serve as a valid measure of classification
confidence. Limiting the weight vector until p 2 = 1 is an
alternate normalization technique that normalizes the
functional margin. This idea will be looked at again later.
Given a dataset S = {(x(i), y(i))}m i=1
, the classifier’s functional margin rela-
tive to the entire dataset is the smallest functional margin of all
training exam- ples:

tˆ= min
i=1,...,m t(i). (2)
This ensures that throughout all training samples, the classifier
maintains a constant level of confidence.

Geometric Margins
The decision boundary denoted by (p, b) is accompanied by the
weight vector p, which is always orthogonal to the hyperplane
separating the two classes. The concept of geometric margins will
be illustrated with the aid of a training example at point A with
the label y(i) = 1. The shortest distance between this location
and the decision border is represented by segment AB, which is
expressed as t(i).
One can determine the value of t(i) by noting that the ||p||
normalized vector p has unit length and points in the same
direction as p. Point B’s coordinates can be computed as follows:
9
x(i) t(i) p . (3)
— ||p||
Because B is exactly on the decision boundary, every point on this
hyperplane satisfies the equation:
p
pT (x(i)— t(i) ) + b = 0. (4)
||p||
Solving for t(i)
yields:

p b
pT x(i)
t(i) = +b T
x(i) ||p| . (5)
=
||p|| ||p|| + |

10
This derivation was specifically for a positive training scenario,
even though the definition applies to all circumstances. Regarding a
training example (x(i), y(i)), the classifier’s geometric margin is given
by:

p b
(i) . (6)
t(i) = y(i) T
x ||p|
||p|| + |
One noteworthy feature is the invariance of the geometric
margin under w and b rescaling. If both p and b are scaled by the
same amount, the geometric margin remains constant, in contrast
to the functional margin. Because it allows one to impose arbitrary
limitations on the p scale without || || altering
| | the catego- rization
behavior, this feature is useful later in optimization. For instance, it
is possible to apply restrictions such as p = 1 or p1 = 5 without
altering the classification properties.
For a dataset S = {(x(i), y(i))}m ,i=1
the classifier’s geometric margin relative
to the entire dataset is ultimately defined as follows:

t = min (i)
i=1,...,m t . (7)
This ensures that for each training sample, the classifier
maintains the ideal distance between classes.

4 The Most Effective Margin Classifier


Finding a decision boundary that maximizes the geometric margin
given a train- ing dataset is an obvious objective based on our
earlier discussion. When a classifier achieves this, it indicates a
strong match to the training data and pro- vides a high level of
confidence in its predictions. For such a classifier, keeping a
significant gap ensures a distinct separation between positive and
negative training instances.
Assume for the purposes of this discussion that the given
training set is linearly separable, or that the two classes may be
precisely separated by at least one hyperplane. Finding the
hyperplane with the largest geometric margin is our goal. This
leads to the following optimization formulation:

11
max t (8)
p,b
subject to:

y(i)(pT x(i) + b) ≥ t, ∀i = 1, . . . , m (9)


p = 1. (10)
In order to ensure that each training example has a functional
margin of at least t, the goal here is to maximize t. All geometric
margins are guaranteed to be at least t since the condition p =
1 ensures that the functional margin

12
is equal to the geometric margin. The best parameters (p, b) will
thus be ob- tained by solving this optimization problem, providing
the maximum practical geometric margin for the given dataset.
While the desired classifier would ideally be obtained by solving
the afore- mentioned optimization problem, the constraint p =
1 poses a significant challenge. Due to the non-convex nature of
this constraint, the problem is not one that can be easily solved
using standard optimization techniques. To move forward, we
consider an alternative formulation:

ma (11)
x p
p,b
subject to:

y(i)(pT x(i) + b) ≥ tˆ, ∀i = 1, . . . , m. (12)


With this formulation, we aim to maximize tˆ/ p , provided
that all func- tional margins are at least tˆ. Because t = tˆ/ p
connects the geometric and functional margins, maximizing this
expression yields the desired result. The problematic p = 1
constraint is also removed by this reformulation.
However, the new formulation introduces a new challenge: the function tˆ/ p
is still non-convex, making it more difficult to solve using
conventional optimiza- tion techniques. To assist with this, we
make use of the fact that the values p and b can be scaled
indefinitely without altering the classification border. This
knowledge enables us to apply a scale restriction, which is
necessary to transform the issue into a more manageable form.

5 Adding a Scaling Limitation


We include a constraint that sets the functional margin of (w, b)
with respect to the training set to exactly 1:

tˆ= 1. (13)
This restriction makes sense because the functional margin is
proportionally scaled by multiplying p and b by a constant, so
appropriate parameter rescaling will always meet it. We introduce
the following optimization problem by substi- tuting this constraint
into the previous formulation and noting that maximizing tˆ/ p is
equal to minimizing p :
13
1 2
min p (14)
p,b 2
subject to:

y(i)(pT x(i) + b) ≥ 1, ∀i = 1, . . . , m. (15)


This restated problem can now be solved computationally
efficiently in a convex form. In particular, the constraints are linear
and the goal function

14
is quadratic, allowing the best margin classifier to be found using
standard quadratic programming (QP) techniques.
We take a slight detour to explore Lagrange duality, which will
result in the dual version of our optimization problem, even
though this method provides a good solution. The dual
formulation is crucial because it enables the use of kernel
approaches, which enable support vector machines (SVMs) to
function efficiently in high-dimensional feature spaces.
Additionally, a fast optimization algorithm that significantly
outperforms conventional QP solvers can be de- rived using the
dual formulation. We will take a closer look at this alternative
approach.

Lagrange’s Dualistic Theory


To better understand the mathematical foundation of support
vector machines (SVMs) and maximum margin classifiers, we first
explore the general concept of restricted optimization and its
solution via Lagrange duality.
Consider the following type of optimization problem:
min f (p) (16)
w
Considering the constraints of equality:

hi(p) = 0, ∀i = 1, . . . , l. (17)

One common technique for resolving such problems is the


Lagrange multiplier method. This method introduces the
Lagrangian function, which has the fol- lowing definition:
Σl
L(p, m) = f (p) + mihi(p). (18)
i=1
Lagrange multipliers are the values of the parameters βi in this
case. The problem can be solved by computing the partial
derivatives of L and setting them to zero:
∂ ∂L
L = = 0. (19)
0,

∂p mi
i
15
The optimal values of p and m are obtained by solving this set of equations.

Adding to the Constraints of Inequality


Constraints involving both equality and inequality criteria are
present in many optimization problems. To deal with such
circumstances, we extend the La- grange formulation to the
subsequent primal optimization problem:
min f (p) (20)
p

16
subject to:

gi(p) ≤ 0, i = 1, . . . , k, hi(p) = 0, i = 1, . . . , l. (21)

To solve this, we suggest the generalized Lagrangian function:


k l
Σ Σ
L(p, s, m) = f (p) + sigi(p) + mihi(p). (22)
i=1 i=1

The parameters si (which relate to inequality constraints) and mi


(which relate to equality constraints) are the Lagrange multipliers
in this case. article
The function is now specified as follows:

kP (p) = max L(p, s, m). (23)


s,m:sai≥0
The primal formulation is indicated by the subscript “P.”
If the value of w does not satisfy the requirements, i.e., if gi(p) > 0 for any
i, or if hi(p) = 0 ̸ for specific i, then we determine that:
On the other hand, if the limitations are fulfilled, we
obtain: Thus, the first fundamental problem can be
stated as follows:
min kP (p) = min max L(p, s, m). (24)
p p s,m : si≥0
The optimal objective value for the primal formulation, p∗, is given as:
p∗ = min θP (p). (25)
p

The Two Formulations


Now, we define a dual function as:
kD(s, m) = min L(p, s, m), (26)
p
where the dual formulation is indicated by the subscript “D.”
Here, we first minimize over p, in contrast to the primal
function kP , where we maximized over s, m. The dual optimization
problem that results is as follows:
max k (s, m) = min L(p, s, m). (27)
s,m: D p
si≥0 max s,m:
si≥0
17
This method reflects the primal problem even though the order
of minimiza- tion and maximization is reversed. The optimal value
for the dual problem is given by:
d∗ = max kD(s, m). (28)
s,m:si≥0

18
The Connection Between Dual and Primal Problems
The optimal value of the dual problem is always less than or equal to
the optimal value of the primal problem, according to a fundamental
optimization result:

d∗ = s,m:s
maxi≥0
min L(p, s, m)p ≤s,m:s
p
mini≥0
max L(p, s, m) = p∗. (29)

The fact that, in general, the maximum of a minimum is always


less than or equal to the minimum of a maximum leads to this
inequality. However, in some cases, we obtain strong duality,
which suggests that:
d∗ = p∗

Conditions of Strong Duality


There is strong duality under specific presumptions, such as:
• The objective function f (p) and the inequality constraints
gi(p) are con- vex.
• The equality constraints hi(p) are linear functions, or affine functions.

• There is at least one p such that gi(p) < 0 for all i,


indicating that the constraints are strictly feasible.

Under these conditions, the optimal values are p∗, s∗, m∗ such that:
p∗ = d∗ = L(p∗, s∗, m∗). (30)
Additionally, the following Karush-Kuhn-Tucker (KKT) criteria
are satisfied by these optimal values:

1. Stationar
y: ∂
L(p∗, s∗, m∗) = 0, i = 1, . . . , n. (31)
∂pi L(p∗, s∗, m∗) = 0, i = 1, . . . , l. (32)

∂mi
2. Complementary Slackness:

19
s∗gi(p∗) = 0, i = 1, . . . , k. (33)
i

3. Primary Feasibility:

gi(p∗) ≤ 0, i = 1, . . . , k. (34)

4. Dual Feasibility:
si∗ ≥ 0, i = 1, . . . , k. (35)

20
Equation (3) is particularly important because it implies
i that
if s∗ > 0, then the condition gi(p∗) = 0 must hold with equality.
Since it demonstrates that only a subset of training data—support
vectors—shapes the decision boundary, support vector machines
heavily rely on this knowledge.
Additionally, the KKT criteria provide a basis for determining
convergence in algorithms such as Sequential Minimal
Optimization (SMO).

6 Top Margin Classifiers


Finding the best margin classifier can be framed as a constrained
optimization problem. The primal form of this optimization problem is
as follows:
1
min p 2 (36)
p,b 2
subject to the constraint:

yi(p · xi + b) ≥ 1, ∀i (37)
Each training sample has a corresponding constraint. According
to the Karush-Kuhn-Tucker (KKT) dual complementarity condition,
the Lagrange multipliers are only strictly positive in situations
where the functional margin is precisely equal to one, meaning
that the constraints are satisfied as equalities.
In the figure illustrating the concept, the largest margin hyperplane
is indi- cated by a solid line. The data points closest to the decision
boundary on the dashed lines parallel to the hyperplane have the
lowest margins. These specific instances, which include two positive
and one negative, are support vectors. Only these locations yield
nonzero values at the optimal solution. Support vec- tors are
typically much smaller than the entire training set, which will be
helpful for future developments.

7 Reiterating the Issue in Two Ways


One important insight in obtaining the dual form is that the
optimization prob- lem is represented in the inner product of the
input feature space. This for- mulation facilitates the use of the
21
kernel trick, a crucial component of complex classification
techniques.
The following results from building the Lagrangian function for
the primal problem:
1 Σ
L(p, b, s) = p 2 − si yi (p ·ix + b) − 1 (38)
2 i

Since the problem only involves inequality restrictions, the


Lagrange multi- pliers appear without complementary terms.
To obtain the dual problem, we minimize with respect to p and b, holding s
fixed. Finding the partial derivatives and setting them to zero:

22
∂L Σ Σ
=p−
∂p s iy ix = 0 ⇒ p = si (39)
i yx
i
i i i

Similarly, differentiation with respect to b leads to:


Σ
siyi = 0 (40)
i
Returning to the Lagrangian simplifies the expression to:
Σ 1Σ
L(s) = i s −
2
i α α yi y (x
j i j j ·x ) (41)
i i,j

Given the limitations and constraints, the dual optimization


problem be- comes:
Σ 1Σ
max s − s s y y (x · x ) (42)
s i i j i j ij
2
i i
subject to: ,
j
Σ
siyi = 0, si ≥ 0, ∀i (43)
i
Since the strong duality theorem and KKT conditions are met,
solving the dual formulation is the same as solving the primal
problem. The ideal weight vector can be reconstructed using the s
values as:
Σ
p∗ = i s∗yixi (44)
i
Additionally, the optimal intercept term is set to:

b∗ = ys − w∗ · xs, for any support vector xs


(45)

8 Making Predictions with the Dual Formula-


23
tion
Predictions for new input data points are made using the equation:
!
Σ
f (x) = sign s∗yi(xi · x) + b∗ (46)
i
i
Since only the support vectors offer nonzero values, the
summation is actu- ally computed across these points, allowing for
computational efficiency. Kernels for Machine Learning

24
9 Kernels
In our earlier discussion of linear regression, we encountered a
scenario in which the input variable represented a home’s living
area. We built a cubic function because we considered using
features like x2 and x3 to enhance the regression model. To
distinguish between different sets of variables, we define the
”original” input variable as the input characteristics of the
problem, in this case, the dwelling area. When these
characteristics are transformed into a new set of variables used in
the learning process, we refer to them as input features.
To show this change, we use ϕ(x) to represent the feature
mapping that transforms input attributes into input features. For
example, in the given cir- cumstance:
ϕ(x) = (x, x2, x3)
Instead of applying Support Vector Machines (SVMs) directly to
the original input characteristics, we can use the modified features.
This entails replacing the existing algorithm with ϕ(x) for
occurrences. We substitute these inner products with K(x, x′) since
the algorithm is expressed in terms of inner products. Thus, the
kernel function is defined as:

K(x, x′) = ⟨ ϕ(x), ϕ(x′)⟩

10 Efficiency of Kernel Computation


Given this, one could compute K(x, x′) by first evaluating ϕ(x)
and then com- puting their inner product. Even though this can
be computationally expensive, it can be computed rapidly in
many circumstances. When the vector ϕ(x) is high-dimensional,
this method is particularly useful.
Consider the kernel function:
K(x, x′) = (xT x′ + 1)2
which expands to:
K(x, x′) = 1 + 2xT x′ + (xT x′)2
We find that it corresponds to the feature mapping:

ϕ(x) = (1, 2x, x2)

25
This demonstrates that direct computation of K(x, x′) only requires
O(d) time, offering significant efficiency improvements.

11 Polynomial Kernels
One commonly used polynomial kernel function is:

K(x, x′) = (xT x′ + c)d

26
which corresponds to a feature mapping:
ϕ(x) = (xd, xd−1x2, . . . , xd)
1 1 d
The kernel function allows transformations into a high-dimensional
feature space efficiently.

12 Kernel Similarity
If ϕ(x) and ϕ(x′) are close together in feature space, then K(x, x′)
will be large. However, if they are nearly orthogonal, K(x, x′) will
be small. A popular kernel for measuring similarity is the
Gaussian kernel:
′ 2
x−x
K(x, x′) = exp

This function provides a measure of similarity that approaches 1 when x and x′


are close, and 0 when they are far apart.

13 Applications of Kernel Methods


Kernels have many applications beyond SVMs. For example: -
SVMs with poly- nomial or Gaussian kernels perform well in digit
recognition. - In bioinformatics, kernel methods classify strings by
counting substring occurrences.
The kernel trick allows algorithms relying on inner products to
operate in high-dimensional spaces efficiently, making kernel
methods a cornerstone of modern machine learning.

14 Regularization and Non-Separable Case


The Support Vector Machine (SVM) has been developed thus
far based on the supposition that the given data is linearly
separable. Separability is not always guaranteed, even though
moving data to a higher-dimensional feature space using the
function ϕ typically increases the likelihood of achieving it.
Furthermore, because it can be highly sensitive to outliers, it might
not always be the best option to find a fully separated hyperplane.
An ideal margin classifier is depicted in the left image below;
however, the decision boundary drastically shifts and a classifier
27
with a significantly lower margin is produced when an outlier is
added in the upper-left region (right figure).
In order to fit non-linearly separable datasets and lessen
sensitivity to out- liers, we modify our optimization problem by
adding ℓ1 regularization as follows:
1 Σ
m
|| || 2
min p + C (47)
ξi p,b,ξ 2
i=1

28
with:
y(i)(pT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m (48)
ξi ≥ 0, i = 1, . . . , m. (49)
With this method, some data points can have a functional
margin smaller than 1. If a given example has a margin of 1 − ξi
(where ξi > 0), then increas- ing the objective function by Cξi
results in a penalty.
|| || The trade-off between ensuring that the
majority of examples maintain a functional margin of at least one
and minimizing p 2, which maximizes the margin, is controlled
by the parameter C.

Construction of Lagrangians
As before, we construct the Lagrangian function:
m m m
1 T Σ Σ Σ
L(p, b, ξ, s, r) = p p +C ξ i − si (y (p x +b)−1
(i) T (i)
i +ξ )−
i i r ξ . (50)
2
i=1 i=1 i=1

Here, ri and si are Lagrange multipliers under the non-


negative
≥ constraint (s i, ri 0).
The dual formulation follows a derivation similar to the linearly
separable case after we set the derivatives with respect to p and b
to zero, substitute these values back into the Lagrangian, and
simplify:
m m m
Σ 1 Σ Σ (i) (j)
max W (s) = si − y yi js s K(x(i), x(j)) (51)
s 2
i=1 i=1 j=1

beneath:
0 ≤ si ≤ C, i = 1, . . . , m (52)
Σ
m
siy(i) = 0. (53)
i=1
Equation (9), as previously mentioned, allows the weight vector
p to be expressed in terms of si. Once the dual optimization
problem has been resolved, equation (13) can still be used for
prediction.
29
It’s interesting to note that adding ℓ1 regularization only
makes one change to the dual problem: the≤constraint ≤ ≤ si
on
goes from 0 si to 0 sai C. Additionally, the calculation of b∗
must be modified (Equation 11 is no longer relevant); further
details on this can be found in Platt’s article or the section that
follows.

30
Karush-Kuhn-Tucker Conditions
The following are the KKT dual complementarity requirements,
which will later aid in verifying convergence in the Sequential
Minimal Optimization (SMO) algorithm:
1. If si = 0, then:
y(i)(pT x(i) + b) ≥ 1. (54)
2. If si = C, then:
y(i)(pT x(i) + b) ≤ 1. (55)
3. If 0 < si < C,
then:
y(i)(pT x(i) + b) = 1. (56)

The next step is to provide a solution to the dual optimization


problem, which will be discussed in the following section.
The application known as Sequential Minimal Optimization
(SMO) will be explored next.
Sequential Minimal Optimization (SMO) The Sequential
Minimal Opti- mization (SMO) technique, first presented by John
Platt, provides a rapid solu- tion to the dual optimization problem
that arises from the derivation of Support Vector Machines
(SVMs). Before delving into the specifics of SMO, it is helpful to
review the coordinate ascent method since it helps to clarify the
core concepts of SMO.

15 Coordinate Ascent
Consider the problem of finishing the subsequent unconstrained optimization:

max W (s1, s2, ..., sm). (57)

In this case, W is a function of the parameters si without


explicitly men- tioning SVMs. The technique we now introduce is
known as coordinate ascent, despite the fact that we have
previously studied optimization strategies like gradient ascent and
Newton’s method:
Continue until convergence:
1. For i = 1, . . . , m:

sˆi = arg max W (s1, ..., si−1, sˆi, si+1, ..., sm). (58)
si
31
In this process, optimization focuses on one variable at a time
while holding the others constant. The simplest implementation
uses a sequential update order (s1, s2, . . . , sm), but more advanced
versions may prioritize updates based on their expected impact on
the objective function W (s).

32
When W is set up so that the maximization in each iteration can
be com- puted rapidly, coordinate ascent proves to be a very
effective technique. Coor- dinated ascent runs are depicted in the
following figure:
A quadratic function under optimization is depicted by the
graph’s curves. The trajectory toward the global maximum is
displayed, beginning at an initial point. In particular, since only
one variable is altered at a time, each action runs parallel to one
of the axes.

16 The SMO Methodology


To conclude our discussion of SVMs, we present the derivation of
the SMO algorithm. Platt’s original work and additional
assignments are among the more nuanced elements that are left
for additional reading.
The following dual form is the optimization problem that needs
to be re- solved: Σm 1mΣ m Σ (i) (j)
max W (s) = s − y y s s ⟨ x(i), x(j)⟩ (59)
i i j
2 i=1 j=1
i=1
beneath:
0 ≤ si ≤ C, i = 1, . . . , m, (60)
Σ
m
siy(i) = 0. (61)
i=1
Suppose we have a set of si values that satisfy these
requirements. There is no progress when attempting to maximize
W by altering only s1 while keeping s2, ..., sm fixed. This results
from the equality constraint:
Σ
m
s1y (1)
= − siy(i). (62)
i=2
article
The following is the result of multiplying both sides by y(1):
Σ
m
s1 = −y (1)
siy(i). (63)
33
i=2

Assuming that y(1) is either +1 or −1, we determine that (y(1))2


= 1, en- suring that the remaining si values fully define s1. As a
result, under the limit, s1 cannot be changed independently.
The SMO approach, which maintains constraint satisfaction by
updating at least two variables at once, is motivated by this
discovery. The method works as follows:

1. Select a pair of parameters, si and sj, for updating by


employing a strategy that determines which pair is most likely
to improve W (s).

34
2. Optimize W (s) with respect to si and sj, while holding all
other sk con- stant (for k ̸= i, j).

In order to determine whether the algorithm has converged, we


examine whether the Karush-Kuhn-Tucker (KKT) criteria hold
within a specified tol- erance level tol, typically between 0.01 and
0.001. Additional implementation details, including the
pseudocode, can be found in Platt’s article.
SMO’s Effective Updates
The ability to update si and sj in a computationally efficient
manner ac- counts for SMO’s efficiency.
Suppose that the existing set of si values satisfies the criteria.
Assume that we decide to set W (s) in relation to s1 and s2.
Considering the constraints:
Σ
m
s1y(1) + s2y(2) = − siy(i), (64)
i=3
We refer to the right-hand side as a constant ζ:

s1y(1) + s2y(2) = ζ. (65)

This constraint restricts s1 and s2 to lie on a line within the


feasible region denoted by 0 ≤ si ≤ C. In order to ensure
viability within the [0, C] × [0, C] box, the permitted values of s2
must also satisfy the lower and upper limits L and H.
Rewriting s1 as a function of s2:
s1 = (ζ − s2y(2))y(1), (66)

The objective function W (s) can be expressed as:

W (s1, s2, . . . , sm) = W ((ζ − s2y(2))y(1), s2, . . . , sm). (67)

Since all other si values are constant, this turns into a quadratic function in
s2:
as22 + bs2 + c. (68)
Equating the derivative to zero yields the best s2 when the box limits are ig-
nored. If the resultant number falls outside of [R, U ], it is changed
,
(or ”clipped”) as follows:
, 2
2
unclipped
snew = U,
unclipped if sunclipped > U
35
s , if R ≤ s ≤U
,
, 2 2
R, (69)
if s <R
unclip
ped 2
Equation (20) is used to determine snew after snew has been set.
1 2
Additional implementation details, such as methods for selecting si, sj, and
altering the bias term b, are covered in Platt’s article.

36
Conclusion
Support Vector Machines (SVMs) are among the best
supervised learning algorithms, particularly helpful for
classification in high-dimensional spaces. Their main goal is to
identify the hyperplane that produces the strongest gen-
eralization on new data by maximally separating classes.
SVMs use convex optimization to identify the best margin
classifier and present the ideas of functional and geometric
margins. The kernel trick is the process of employing kernel
functions to efficiently operate in high-dimensional feature spaces
by utilizing Lagrange duality without explicitly computing the
transformation.
SVMs use a regularization parameter and slack variables to
handle non- linearly separable data in order to balance margin size
and misclassification.
The dual problem of the SVM can be effectively resolved with
the help of the SMO (Sequential Minimal Optimization)
algorithm.

37
Refrences
• Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

• Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training


algorithm for optimal margin classifiers. Proceedings of the
Fifth Annual Workshop on Computational Learning Theory
(COLT).
• Cortes, C., & Vapnik, V. (1995). Support-vector networks.
Machine Learning, 20(3), 273–297.
• Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction
to Support Vector Machines and Other Kernel-based
Learning Methods. Cambridge University Press.
• Platt, J. (1998). Sequential Minimal Optimization: A Fast
Algorithm for Training Support Vector Machines. Microsoft
Research Technical Report MSR-TR-98-14.
• Sch¨olkopf, B., & Smola, A. J. (2002). Learning with
Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond. MIT Press.
• Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A Library for
Support Vec- tor Machines. ACM Transactions on
Intelligent Systems and Technology (TIST), 2(3), 1–27.
• Hastie, T., Tibshirani, R., & Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining, Inference,
and Prediction (2nd ed.). Springer.
• Smola, A. J., & Schölkopf, B. (2004). A Tutorial on Support
Vector Regression. Statistics and Computing, 14(3), 199–
222.
• Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience.

38

You might also like