0% found this document useful (0 votes)
115 views63 pages

Logistic Regression Overview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views63 pages

Logistic Regression Overview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Foundations of Machine Learning

Module 5:
Part A: Logistic Regression

Sudeshna Sarkar
IIT Kharagpur
Logistic Regression for classification
• Linear Regression: Logistic:
𝑛 1
𝑔 𝑧 =
1+𝑒 −𝑧
ℎ 𝑥 = ෍ 𝛽𝑖 𝑥𝑖 = 𝛽 𝑇 𝑋
𝑖=0
• Logistic Regression for
classification:
1
ℎ𝛽 𝑥 = 𝑇𝑋 = g(𝛽 𝑇 𝑥)
1 + 𝑒 −𝛽
1
𝑔 𝑧 =
1+𝑒 −𝑧
is called the logistic function or the
sigmoid function.
Sigmoid function properties
• Bounded between 0 and 1
• 𝑔(𝑧) → 1 as 𝑧 → ∞
• 𝑔(𝑧) → 0 as 𝑧 → −∞

𝑑 1
𝑔 𝑧 =
𝑑𝑧 1 + 𝑒 −𝑧
1 −𝑧
= . 𝑒
(1 + 𝑒 −𝑧 )2
1 1
= −𝑧
. (1 − −𝑧
)
1+𝑒 1+𝑒
= 𝑔(𝑧)(1 − 𝑔 𝑧
Logistic Regression
• In logistic regression, we learn the conditional distribution
P(y|x)
• Let py(x; 𝛽) be our estimate of P(y|x), where 𝛽 is a vector of
adjustable parameters.
• Assume there are two classes, y = 0 and y = 1 and
𝑃 𝑦 = 1 𝑥 = ℎ𝛽 𝑥
𝑃 𝑦 = 0 𝑥 = 1 − ℎ𝛽 (𝑥)
• Can be written more compactly
𝑃 𝑦 𝑥 = ℎ(𝑥)𝑦 (1 − ℎ 𝑥 )1−𝑦
• We can used the gradient method

5
Maximize likelihood
𝐿 𝛽 = 𝑝(𝑦|𝑋;
Ԧ 𝛽)
𝑚

= ෑ 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝛽)
𝑖=1
𝑚

= ෑ ℎ(𝑥𝑖 )𝑦𝑖 (1 − ℎ 𝑥𝑖 )1−𝑦𝑖


𝑖=1
𝑙 𝛽 = log 𝐿 𝛽
𝑚

= ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))


𝑖=1
𝑚

𝑙 𝛽 = ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))


𝑖=1
• How do we maximize the likelihood? Gradient ascent
– Updates: 𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
Assume one training example (x,y), and take derivatives to derive the
stochastic gradient ascent rule.
𝜕
𝑙 𝛽
𝜕𝛽𝑗
1
=( 𝑦
𝑔(𝛽𝑇 (𝑥) 1 𝜕 𝑇
− 1−𝑦 𝑇
) 𝑔(𝛽 𝑥)
1 − 𝑔 𝛽 𝑥 𝜕𝛽𝑗
1 1 𝑇 𝑇
𝜕 𝑇
=( 𝑦 − 1−𝑦 ) 𝑔(𝛽 𝑥)(1 − 𝑔(𝛽 𝑥) 𝛽 𝑥
𝑔(𝛽 𝑇 (𝑥) 𝑇
1−𝑔 𝛽 𝑥 𝜕𝛽𝑗

= (𝑦 1 − 𝑔 𝛽𝑇 𝑥 − 1 − 𝑦 𝑔 𝛽𝑇 𝑥 )𝑥𝑗

= (𝑦 − ℎ𝛽 𝑥 )𝑥𝑗
𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ𝛽 𝑥 𝑖 )𝑥𝑗 (𝑖)
Foundations of Machine Learning
Module 5:
Part B: Introduction to Support
Vector Machine
Sudeshna Sarkar
IIT Kharagpur

1
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.

2
Logistic Regression and Confidence
• Logistic Regression:
𝑝 𝑦 = 1 𝑥 = ℎ𝛽 𝑥 = 𝑔(𝛽𝑇 𝑥)
• Predict 1 on an input x iff ℎ𝛽 𝑥 ≥ 0.5,
equivalently, 𝛽𝑇 𝑥 ≥ 0
• The larger the value of ℎ𝛽 𝑥 , the larger is the probability,
and higher the confidence.
• Similarly, confident prediction of 𝑦 = 0 if 𝛽𝑇 𝑥 ≪ 0
• More confident of prediction from points (instances) located
far from the decision surface.

3
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating line
to use?
• Bayesian answer:
– Use all
– Weight each line by its posterior
probability
• Can we approximate the correct
answer efficiently?

4
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support vectors to
decide which side of the
separator a test case is on. The support vectors are
indicated by the circles
around them.

5
Functional Margin
• Functional Margin of a point (𝑥𝑖 , 𝑦𝑖 ) wrt (𝑤, 𝑏)
– Measured by the distance of a point (𝑥𝑖 , 𝑦𝑖 ) from the
decision boundary (𝑤, 𝑏)
𝛾 𝑖 = 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏)
– Larger functional margin →more confidence for
correct prediction
– Problem: w and b can be scaled to make this value
larger
• Functional Margin of training set
{ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 } wrt (𝑤, 𝑏) is
𝛾 = min 𝛾 𝑖
1≤𝑖≤𝑚

6
Geometric Margin
• For a decision surface (𝑤, 𝑏) P=(a1,a2)
• the vector orthogonal to it is
given by 𝑤. Q=(b1,b2)

• The unit length orthogonal 𝑤
𝑤
vector is
𝑤
𝑤
• 𝑃 =𝑄+𝛾 (w,b)
𝑤

7
Geometric Margin
𝑤
𝑃 =𝑄+𝛾
𝑤
𝑤
𝑏1, 𝑏2 = 𝑎1, 𝑎2 − 𝛾 P=(a1,a2)
𝑤
𝑇
𝑤
→ 𝑤 𝑎1, 𝑎2 − 𝛾 +𝑏 =0
𝑤
𝑤 𝑇 𝑎1, 𝑎2 + 𝑏 Q=(b1,b2)
→𝛾= →
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏 (w,b)
𝛾 = 𝑦. ( 𝑎1, 𝑎2 + )
𝑤 𝑤

Geometric margin : 𝑤 = 1
Geometric margin of (w,b) wrt S={ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 }
-- smallest of the geometric margins of individual points. 8
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability

x1

9
Maximize Margin Width
𝛾
• Maximize subject to
𝑤
• 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 𝛾 for 𝑖 = 1,2, . . , 𝑚
• Scale so that 𝛾 = 1
1 2
• Maximizing is the same as minimizing 𝑤
𝑤
• Minimize 𝒘. 𝒘 subject to the constraints
• for all (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑚 :
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1

10
Large Margin Linear Classifier
• Formulation: x2
Margin

1 2 x+
minimize w
2
such that x+

yi (wT xi  b)  1 n
x-

x1

denotes +1
denotes -1 11
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi  b)  1

• Optimization problem with convex quadratic objectives and


linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.

12
Foundations of Machine Learning
Module 5:
Part C: Support Vector Machine:
Dual
Sudeshna Sarkar
IIT Kharagpur

1
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi  b)  1

• Optimization problem with convex quadratic objectives and


linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.

2
Lagrangian Duality in brief
The Primal Problem min w f ( w)
s.t. g i ( w)  0, i  1,, k
hi ( w)  0, i  1,, l
The generalized Lagrangian:
k l
L( w, a , b )  f ( w)   a i g i ( w)   b i hi ( w)
i 1 i 1

the a's (ai≥0) and b's are called the Lagrange multipliers
Lemma:
 f ( w) if w satisfies primal constraints
maxa , b ,a i 0 L( w, a , b )  
  otherwise
A re-written Primal:
min w maxa , b ,a i 0 L( w, a , b )
Lagrangian Duality, cont.
The Primal Problem:  min w maxa , b ,a i 0 L( w, a , b )
*
p

a , b ,a i  0 min w L( w, a , b )
The Dual Problem: d *  max

Theorem (weak duality):


d *  maxa , b ,a i 0 min w L( w, a , b )  min w maxa , b ,a i 0 L( w, a , b )  p *

Theorem (strong duality):


Iff there exist a saddle point of 𝐿 𝑤, 𝛼, 𝛽 , we have d *  p *
The KKT conditions
If there exists some saddle point of L, then it satisfies the
following "Karush-Kuhn-Tucker" (KKT) conditions:

L( w, a , b )  0, i  1, , k
wi

L( w, a , b )  0, i  1,, l
b i
αi g i ( w)  0, i  1,, m
g i ( w)  0, i  1,, m
a i  0, i  1,, m
Theorem: If w*, a* and b* satisfy the KKT condition, then it
is also a solution to the primal and the dual problems.
Support Vectors
• Only a few 𝛼𝑖 ’s can be nonzero
• Call the training data points whose 𝛼𝑖 ’s are
nonzero the support vectors
αi g i ( w)  0, i  1,, m

If 𝛼𝑖 > 0 then 𝑔 𝑤 = 0

6
Solving the Optimization Problem
1 2
Quadratic minimize w
programming
2
with linear s.t. yi (wT xi  b)  1
constraints

Lagrangian Function

minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1


n
1 2

2 i 1

s.t. ai  0

7
Solving the Optimization Problem
minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1
n
1 2

2 i 1

s.t. ai  0

Lp
n

Minimize 0 w   a i yi xi
wrt w and b w i 1
n
Lp
a y
for fixed 𝛼
0 i i 0
bm i 1
1 m m
L p ( w, b, a )   a i   a ia j yi y j (xTi x j )  b a i yi
i 1 2 i , j 1 i 1
m
1 m
L p ( w, b, a )   a i   a ia j yi y j (xTi x j )
i 1 2 i , j 1
8
The Dual problem
Now we have the following dual opt problem:
m
1 m
maxa J (a )   a i   a ia j yi y j (xTi x j )
i 1 2 i , j 1
s.t. a i  0, i  1, , k
m

a y
i 1
i i  0.

This is a quadratic programming problem.


– A global maximum of ai can always be found.
Support vector machines
• Once we have the Lagrange multipliers {𝛼𝑗 } we can
reconstruct the parameter vector 𝑤 as a weighted
combination of the training examples:
w a y x
m
w   a i yi x i i i i
i 1 iSV

• For testing with a new data z


– Compute
 
wT z  b   a i yi xTi z  b
iSV

and classify z as class 1 if the sum is positive, and class 2


otherwise
Note: w need not be formed explicitly
Solving the Optimization Problem
 The discriminant function is:
g ( x)  w T x  b   i i xb
a x
iSV
T

 It relies on a dot product between the test point x and the


support vectors xi
 Solving the optimization problem involved computing the
dot products xiTxj between all pairs of training points
 The optimal w is a linear combination of a small number
of data points.

11
Foundations of Machine Learning
Module 5: Support Vector Machine
Part D: SVM – Maximum Margin
with Noise
Sudeshna Sarkar
IIT Kharagpur

1
Linear SVM formulation
Find w and b such that
2
is maximized
𝑤

And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),


𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1

Find w and b such that


𝑤 2 = 𝑤. 𝑤 is minimized
And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),
𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1
Limitations of previous SVM
formulation
• What if the data is
not linearly
separable?

• Or noisy data
points?

Extend the definition of maximum margin to allow


non-separating planes.
3
How to formulate?
• Minimize 𝑤 2 = 𝑤. 𝑤 and number of
misclassifications, i.e., minimize
𝑤. 𝑤 + #(training errors)

• No longer QP formulation

4
Objective to be minimized
• Minimize
𝑤. 𝑤
+ 𝐶 (distance of error points to their

Figure from [Link] 5


Maximum Margin with Noise
x1 M=
2
𝑤. 𝑤

x2 x3 Minimize
𝑚

𝑤. 𝑤 + 𝐶 ෍ 𝜉𝑘
𝑘=1
𝑚 constraints
𝑤. 𝑥𝑘 + 𝑏 ≥ 1 − 𝜉𝑘 if 𝑦𝑘 = 1
C controls the relative 𝑤. 𝑥𝑘 + 𝑏 ≤ −1 + 𝜉𝑘 if 𝑦𝑘 = −1
importance of maximizing ≡
the margin and fitting the
training data. 𝒚𝒌 𝒘. 𝒙𝒌 + 𝒃 ≥ 𝟏 − 𝝃𝒌 , k=1,…,m
Controls overfitting. 𝝃𝒌 ≥ 𝟎, k=1,…,m
Lagrangian
𝐿 𝑤, 𝑏, 𝜉, 𝛼, 𝛽
𝑚
1
= 𝑤. 𝑤 + 𝐶 ෍ 𝜉𝑖
2
𝑖=1
𝑚 𝑚

+ ෍ 𝛼𝑖 𝑦𝑖 𝑥. 𝑤 + 𝑏 − 1 + 𝜉𝑖 − ෍ 𝛽𝑖 𝜉𝑖
𝑖=1 𝑖=1

𝛼𝑖 ’s and 𝛽𝑖 ’s are Lagrange multipliers (≥ 0).


Dual Formulation
Find 𝛼1 , 𝛼2 , … , 𝛼𝑚 s.t.
m
1 m
max J ( )    i    i j yi y j (xTi x j )
i 1 2 i , j 1

Linear SVM Noise Accounted

s.t.  i  0, i  1, , m s.t. 0   i  C , i  1,, m


m

 y
m
 0.
 y
i 1
i i  0.
i 1
i i

8
Solution to Soft Margin Classification
• 𝑥𝑖 with non-zero 𝛼𝑖 will be support vectors.
• Solution to the dual problem is:
𝑚

𝑤 = ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝑖=1
𝑚

𝑏 = 𝑦𝑘 1 − 𝜉𝑘 − ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑘
𝑖=1
for any 𝑘 s.t. 𝛼𝑘 > 0
For classification,
𝑚

𝑓 𝑥 = ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖 . 𝑥 + 𝑏
𝑖=1
(no need to compute 𝑤 explicitly)

9
Thank You

10
Foundations of Machine Learning
Module 5: Support Vector Machine
Part E: Nonlinear SVM and Kernel
function
Sudeshna Sarkar
IIT Kharagpur

1
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.

2
Non-linear SVMs: Feature Space

Φ: 𝑥 → 𝜙(𝑥)
Non-linear SVMs: Feature Space

Φ: 𝑥 → 𝜙(𝑥)

This slide is from [Link]/~pift6080/documents/papers/svm_tutorial.ppt


Kernel
• Original input attributes is mapped to a new set of
input features via feature mapping Φ.
• Since the algorithm can be written in terms of the
scalar product, we replace 𝑥𝑎 . 𝑥𝑏 with 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
• For certain Φ’s there is a simple operation on two
vectors in the low-dim space that can be used to
compute the scalar product of their two images in the
high-dim space
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Let the kernel do the work rather than do the scalar
product in the high dimensional space.
5
Nonlinear SVMs: The Kernel Trick
• With this mapping, our discriminant function is now:

g ( x)  w T  ( x)  b   i i  ( x)  b
 
iSV
( x )T

• We only use the dot product of feature vectors in both


the training and test.
• A kernel function is defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
The kernel trick

𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Often 𝐾 𝑥𝑎 , 𝑥𝑏 may be very inexpensive to compute even if
𝜙 𝑥𝑎 may be extremely high dimensional.
Kernel Example
ഥ = [𝑥1 𝑥2 ]
2-dimensional vectors 𝒙
let 𝑲 𝒙𝒊 , 𝒙𝒋 = (𝟏 + 𝒙𝒊 . 𝒙𝒋 )𝟐
We need to show that 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 . 𝜙(𝑥𝑗 )

K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi). φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 . 𝑥𝑗
• Polynomial of power p:
𝐾 𝑥𝑖 , 𝑥𝑗 = (1 + 𝑥𝑖 . 𝑥𝑗 )𝑝
• Gaussian (radial-basis function):
2
𝑥𝑖 −𝑥𝑗

𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒 2𝜎2
• Sigmoid
𝐾 𝑥𝑖 , 𝑥𝑗 = tanh(𝛽0 𝑥𝑖 . 𝑥𝑗 + 𝛽1 )
In general, functions that satisfy Mercer’s condition can
be kernel functions.
9
Kernel Functions
• Kernel function can be thought of as a similarity measure
between the input objects
• Not all similarity measure can be used as kernel function.
• Mercer's condition states that any positive semi-definite
kernel K(x, y), i.e.

෍ 𝐾(𝑥𝑖 , 𝑥𝑗 )𝑐𝑖 𝑐𝑗 ≥ 0
𝑖,𝑗
• can be expressed as a dot product in a high dimensional
space.

10
SVM examples

© Eric Xing @ CMU, 2006-2010


Examples for Non Linear SVMs –
Gaussian Kernel

© Eric Xing @ CMU, 2006-2010


Nonlinear SVM: Optimization
 Formulation: (Lagrangian Dual Problem)
n
1 n n
maximize  i   i j yi y j K (xi , x j )
i 1 2 i 1 j 1

such that 0  i  C
n

 y
i 1
i i 0

 The solution of the discriminant function is


𝑔 𝑥 = ෍ 𝛼𝑖 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑏
𝑖𝜖𝑆𝑉
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• The kernel trick can also be used to do PCA in a much higher-
dimensional space, thus giving a non-linear version of PCA in
the original space.
Multi-class classification
• SVMs can only handle two-class outputs
• Learn N SVM’s
– SVM 1 learns Class1 vs REST
– SVM 2 learns Class2 vs REST
– :
– SVM N learns ClassN vs REST
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Thank You

16
Foundations of Machine Learning
Module 5: Support Vector Machine
Part F: SVM – Solution to the Dual
Problem
Sudeshna Sarkar
IIT Kharagpur

1
The SMO algorithm
The SMO algorithm can efficiently solve the dual problem.
First we discuss Coordinate Ascent.
Coordinate Ascent
• Consider solving the unconstrained optimization problem:
max 𝑊(𝛼1 , 𝛼2 , … , 𝛼𝑛 )
𝛼

Loop until convergence: {


for 𝑖 = 1 𝑡𝑜 𝑛 {
𝛼𝑖 = 𝑎𝑟𝑔 max 𝑊(𝛼1 , … , 𝛼ෝ𝑖 , … , 𝛼𝑛 ) ;
ෞ𝑖
𝛼
}
}
Coordinate ascent

• Ellipses are the contours of the function.


• At each step, the path is parallel to one of the axes.
Sequential minimal optimization
• Constrained optimization:
m
1 m
maxa J (a )   a i   a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0  a i  C , i  1, , m
m

a y
i 1
i i  0.

• Question: can we do coordinate along one


direction at a time (i.e., hold all a[-i] fixed, and
update ai?)
The SMO algorithm
m
1 m
maxa W (a )   a i   a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0  a i  C , i  1,, m
m

a y
i 1
i i  0.

• Choose a set of 𝛼1 ’s satisfying the constraints.


• 𝛼1 is exactly determined by the other 𝛼’s.
• We have to update at least two of them
simultaneously to keep satisfying the constraints.
The SMO algorithm
Repeat till convergence {
1. Select some pair ai and aj to update next
(using a heuristic that tries to pick the two that
will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize W(a) with respect to ai and aj, while
holding all the other ak 's (k i; j) fixed.
}
• The update to ai and aj can be computed very
efficiently.
Thank You

You might also like