Foundations of Machine Learning
Module 5:
Part A: Logistic Regression
Sudeshna Sarkar
IIT Kharagpur
Logistic Regression for classification
• Linear Regression: Logistic:
𝑛 1
𝑔 𝑧 =
1+𝑒 −𝑧
ℎ 𝑥 = 𝛽𝑖 𝑥𝑖 = 𝛽 𝑇 𝑋
𝑖=0
• Logistic Regression for
classification:
1
ℎ𝛽 𝑥 = 𝑇𝑋 = g(𝛽 𝑇 𝑥)
1 + 𝑒 −𝛽
1
𝑔 𝑧 =
1+𝑒 −𝑧
is called the logistic function or the
sigmoid function.
Sigmoid function properties
• Bounded between 0 and 1
• 𝑔(𝑧) → 1 as 𝑧 → ∞
• 𝑔(𝑧) → 0 as 𝑧 → −∞
′
𝑑 1
𝑔 𝑧 =
𝑑𝑧 1 + 𝑒 −𝑧
1 −𝑧
= . 𝑒
(1 + 𝑒 −𝑧 )2
1 1
= −𝑧
. (1 − −𝑧
)
1+𝑒 1+𝑒
= 𝑔(𝑧)(1 − 𝑔 𝑧
Logistic Regression
• In logistic regression, we learn the conditional distribution
P(y|x)
• Let py(x; 𝛽) be our estimate of P(y|x), where 𝛽 is a vector of
adjustable parameters.
• Assume there are two classes, y = 0 and y = 1 and
𝑃 𝑦 = 1 𝑥 = ℎ𝛽 𝑥
𝑃 𝑦 = 0 𝑥 = 1 − ℎ𝛽 (𝑥)
• Can be written more compactly
𝑃 𝑦 𝑥 = ℎ(𝑥)𝑦 (1 − ℎ 𝑥 )1−𝑦
• We can used the gradient method
5
Maximize likelihood
𝐿 𝛽 = 𝑝(𝑦|𝑋;
Ԧ 𝛽)
𝑚
= ෑ 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝛽)
𝑖=1
𝑚
= ෑ ℎ(𝑥𝑖 )𝑦𝑖 (1 − ℎ 𝑥𝑖 )1−𝑦𝑖
𝑖=1
𝑙 𝛽 = log 𝐿 𝛽
𝑚
= 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))
𝑖=1
𝑚
𝑙 𝛽 = 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))
𝑖=1
• How do we maximize the likelihood? Gradient ascent
– Updates: 𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
Assume one training example (x,y), and take derivatives to derive the
stochastic gradient ascent rule.
𝜕
𝑙 𝛽
𝜕𝛽𝑗
1
=( 𝑦
𝑔(𝛽𝑇 (𝑥) 1 𝜕 𝑇
− 1−𝑦 𝑇
) 𝑔(𝛽 𝑥)
1 − 𝑔 𝛽 𝑥 𝜕𝛽𝑗
1 1 𝑇 𝑇
𝜕 𝑇
=( 𝑦 − 1−𝑦 ) 𝑔(𝛽 𝑥)(1 − 𝑔(𝛽 𝑥) 𝛽 𝑥
𝑔(𝛽 𝑇 (𝑥) 𝑇
1−𝑔 𝛽 𝑥 𝜕𝛽𝑗
= (𝑦 1 − 𝑔 𝛽𝑇 𝑥 − 1 − 𝑦 𝑔 𝛽𝑇 𝑥 )𝑥𝑗
= (𝑦 − ℎ𝛽 𝑥 )𝑥𝑗
𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ𝛽 𝑥 𝑖 )𝑥𝑗 (𝑖)
Foundations of Machine Learning
Module 5:
Part B: Introduction to Support
Vector Machine
Sudeshna Sarkar
IIT Kharagpur
1
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.
2
Logistic Regression and Confidence
• Logistic Regression:
𝑝 𝑦 = 1 𝑥 = ℎ𝛽 𝑥 = 𝑔(𝛽𝑇 𝑥)
• Predict 1 on an input x iff ℎ𝛽 𝑥 ≥ 0.5,
equivalently, 𝛽𝑇 𝑥 ≥ 0
• The larger the value of ℎ𝛽 𝑥 , the larger is the probability,
and higher the confidence.
• Similarly, confident prediction of 𝑦 = 0 if 𝛽𝑇 𝑥 ≪ 0
• More confident of prediction from points (instances) located
far from the decision surface.
3
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating line
to use?
• Bayesian answer:
– Use all
– Weight each line by its posterior
probability
• Can we approximate the correct
answer efficiently?
4
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support vectors to
decide which side of the
separator a test case is on. The support vectors are
indicated by the circles
around them.
5
Functional Margin
• Functional Margin of a point (𝑥𝑖 , 𝑦𝑖 ) wrt (𝑤, 𝑏)
– Measured by the distance of a point (𝑥𝑖 , 𝑦𝑖 ) from the
decision boundary (𝑤, 𝑏)
𝛾 𝑖 = 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏)
– Larger functional margin →more confidence for
correct prediction
– Problem: w and b can be scaled to make this value
larger
• Functional Margin of training set
{ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 } wrt (𝑤, 𝑏) is
𝛾 = min 𝛾 𝑖
1≤𝑖≤𝑚
6
Geometric Margin
• For a decision surface (𝑤, 𝑏) P=(a1,a2)
• the vector orthogonal to it is
given by 𝑤. Q=(b1,b2)
→
• The unit length orthogonal 𝑤
𝑤
vector is
𝑤
𝑤
• 𝑃 =𝑄+𝛾 (w,b)
𝑤
7
Geometric Margin
𝑤
𝑃 =𝑄+𝛾
𝑤
𝑤
𝑏1, 𝑏2 = 𝑎1, 𝑎2 − 𝛾 P=(a1,a2)
𝑤
𝑇
𝑤
→ 𝑤 𝑎1, 𝑎2 − 𝛾 +𝑏 =0
𝑤
𝑤 𝑇 𝑎1, 𝑎2 + 𝑏 Q=(b1,b2)
→𝛾= →
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏 (w,b)
𝛾 = 𝑦. ( 𝑎1, 𝑎2 + )
𝑤 𝑤
Geometric margin : 𝑤 = 1
Geometric margin of (w,b) wrt S={ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 }
-- smallest of the geometric margins of individual points. 8
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability
x1
9
Maximize Margin Width
𝛾
• Maximize subject to
𝑤
• 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 𝛾 for 𝑖 = 1,2, . . , 𝑚
• Scale so that 𝛾 = 1
1 2
• Maximizing is the same as minimizing 𝑤
𝑤
• Minimize 𝒘. 𝒘 subject to the constraints
• for all (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑚 :
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1
10
Large Margin Linear Classifier
• Formulation: x2
Margin
1 2 x+
minimize w
2
such that x+
yi (wT xi b) 1 n
x-
x1
denotes +1
denotes -1 11
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi b) 1
• Optimization problem with convex quadratic objectives and
linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.
12
Foundations of Machine Learning
Module 5:
Part C: Support Vector Machine:
Dual
Sudeshna Sarkar
IIT Kharagpur
1
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi b) 1
• Optimization problem with convex quadratic objectives and
linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.
2
Lagrangian Duality in brief
The Primal Problem min w f ( w)
s.t. g i ( w) 0, i 1,, k
hi ( w) 0, i 1,, l
The generalized Lagrangian:
k l
L( w, a , b ) f ( w) a i g i ( w) b i hi ( w)
i 1 i 1
the a's (ai≥0) and b's are called the Lagrange multipliers
Lemma:
f ( w) if w satisfies primal constraints
maxa , b ,a i 0 L( w, a , b )
otherwise
A re-written Primal:
min w maxa , b ,a i 0 L( w, a , b )
Lagrangian Duality, cont.
The Primal Problem: min w maxa , b ,a i 0 L( w, a , b )
*
p
a , b ,a i 0 min w L( w, a , b )
The Dual Problem: d * max
Theorem (weak duality):
d * maxa , b ,a i 0 min w L( w, a , b ) min w maxa , b ,a i 0 L( w, a , b ) p *
Theorem (strong duality):
Iff there exist a saddle point of 𝐿 𝑤, 𝛼, 𝛽 , we have d * p *
The KKT conditions
If there exists some saddle point of L, then it satisfies the
following "Karush-Kuhn-Tucker" (KKT) conditions:
L( w, a , b ) 0, i 1, , k
wi
L( w, a , b ) 0, i 1,, l
b i
αi g i ( w) 0, i 1,, m
g i ( w) 0, i 1,, m
a i 0, i 1,, m
Theorem: If w*, a* and b* satisfy the KKT condition, then it
is also a solution to the primal and the dual problems.
Support Vectors
• Only a few 𝛼𝑖 ’s can be nonzero
• Call the training data points whose 𝛼𝑖 ’s are
nonzero the support vectors
αi g i ( w) 0, i 1,, m
If 𝛼𝑖 > 0 then 𝑔 𝑤 = 0
6
Solving the Optimization Problem
1 2
Quadratic minimize w
programming
2
with linear s.t. yi (wT xi b) 1
constraints
Lagrangian Function
minimize Lp (w, b, a i ) w a i yi (wT xi b) 1
n
1 2
2 i 1
s.t. ai 0
7
Solving the Optimization Problem
minimize Lp (w, b, a i ) w a i yi (wT xi b) 1
n
1 2
2 i 1
s.t. ai 0
Lp
n
Minimize 0 w a i yi xi
wrt w and b w i 1
n
Lp
a y
for fixed 𝛼
0 i i 0
bm i 1
1 m m
L p ( w, b, a ) a i a ia j yi y j (xTi x j ) b a i yi
i 1 2 i , j 1 i 1
m
1 m
L p ( w, b, a ) a i a ia j yi y j (xTi x j )
i 1 2 i , j 1
8
The Dual problem
Now we have the following dual opt problem:
m
1 m
maxa J (a ) a i a ia j yi y j (xTi x j )
i 1 2 i , j 1
s.t. a i 0, i 1, , k
m
a y
i 1
i i 0.
This is a quadratic programming problem.
– A global maximum of ai can always be found.
Support vector machines
• Once we have the Lagrange multipliers {𝛼𝑗 } we can
reconstruct the parameter vector 𝑤 as a weighted
combination of the training examples:
w a y x
m
w a i yi x i i i i
i 1 iSV
• For testing with a new data z
– Compute
wT z b a i yi xTi z b
iSV
and classify z as class 1 if the sum is positive, and class 2
otherwise
Note: w need not be formed explicitly
Solving the Optimization Problem
The discriminant function is:
g ( x) w T x b i i xb
a x
iSV
T
It relies on a dot product between the test point x and the
support vectors xi
Solving the optimization problem involved computing the
dot products xiTxj between all pairs of training points
The optimal w is a linear combination of a small number
of data points.
11
Foundations of Machine Learning
Module 5: Support Vector Machine
Part D: SVM – Maximum Margin
with Noise
Sudeshna Sarkar
IIT Kharagpur
1
Linear SVM formulation
Find w and b such that
2
is maximized
𝑤
And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),
𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1
Find w and b such that
𝑤 2 = 𝑤. 𝑤 is minimized
And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),
𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1
Limitations of previous SVM
formulation
• What if the data is
not linearly
separable?
• Or noisy data
points?
Extend the definition of maximum margin to allow
non-separating planes.
3
How to formulate?
• Minimize 𝑤 2 = 𝑤. 𝑤 and number of
misclassifications, i.e., minimize
𝑤. 𝑤 + #(training errors)
• No longer QP formulation
4
Objective to be minimized
• Minimize
𝑤. 𝑤
+ 𝐶 (distance of error points to their
Figure from [Link] 5
Maximum Margin with Noise
x1 M=
2
𝑤. 𝑤
x2 x3 Minimize
𝑚
𝑤. 𝑤 + 𝐶 𝜉𝑘
𝑘=1
𝑚 constraints
𝑤. 𝑥𝑘 + 𝑏 ≥ 1 − 𝜉𝑘 if 𝑦𝑘 = 1
C controls the relative 𝑤. 𝑥𝑘 + 𝑏 ≤ −1 + 𝜉𝑘 if 𝑦𝑘 = −1
importance of maximizing ≡
the margin and fitting the
training data. 𝒚𝒌 𝒘. 𝒙𝒌 + 𝒃 ≥ 𝟏 − 𝝃𝒌 , k=1,…,m
Controls overfitting. 𝝃𝒌 ≥ 𝟎, k=1,…,m
Lagrangian
𝐿 𝑤, 𝑏, 𝜉, 𝛼, 𝛽
𝑚
1
= 𝑤. 𝑤 + 𝐶 𝜉𝑖
2
𝑖=1
𝑚 𝑚
+ 𝛼𝑖 𝑦𝑖 𝑥. 𝑤 + 𝑏 − 1 + 𝜉𝑖 − 𝛽𝑖 𝜉𝑖
𝑖=1 𝑖=1
𝛼𝑖 ’s and 𝛽𝑖 ’s are Lagrange multipliers (≥ 0).
Dual Formulation
Find 𝛼1 , 𝛼2 , … , 𝛼𝑚 s.t.
m
1 m
max J ( ) i i j yi y j (xTi x j )
i 1 2 i , j 1
Linear SVM Noise Accounted
s.t. i 0, i 1, , m s.t. 0 i C , i 1,, m
m
y
m
0.
y
i 1
i i 0.
i 1
i i
8
Solution to Soft Margin Classification
• 𝑥𝑖 with non-zero 𝛼𝑖 will be support vectors.
• Solution to the dual problem is:
𝑚
𝑤 = 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝑖=1
𝑚
𝑏 = 𝑦𝑘 1 − 𝜉𝑘 − 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑘
𝑖=1
for any 𝑘 s.t. 𝛼𝑘 > 0
For classification,
𝑚
𝑓 𝑥 = 𝛼𝑖 𝑦𝑖 𝑥𝑖 . 𝑥 + 𝑏
𝑖=1
(no need to compute 𝑤 explicitly)
9
Thank You
10
Foundations of Machine Learning
Module 5: Support Vector Machine
Part E: Nonlinear SVM and Kernel
function
Sudeshna Sarkar
IIT Kharagpur
1
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.
2
Non-linear SVMs: Feature Space
Φ: 𝑥 → 𝜙(𝑥)
Non-linear SVMs: Feature Space
Φ: 𝑥 → 𝜙(𝑥)
This slide is from [Link]/~pift6080/documents/papers/svm_tutorial.ppt
Kernel
• Original input attributes is mapped to a new set of
input features via feature mapping Φ.
• Since the algorithm can be written in terms of the
scalar product, we replace 𝑥𝑎 . 𝑥𝑏 with 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
• For certain Φ’s there is a simple operation on two
vectors in the low-dim space that can be used to
compute the scalar product of their two images in the
high-dim space
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Let the kernel do the work rather than do the scalar
product in the high dimensional space.
5
Nonlinear SVMs: The Kernel Trick
• With this mapping, our discriminant function is now:
g ( x) w T ( x) b i i ( x) b
iSV
( x )T
• We only use the dot product of feature vectors in both
the training and test.
• A kernel function is defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
The kernel trick
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Often 𝐾 𝑥𝑎 , 𝑥𝑏 may be very inexpensive to compute even if
𝜙 𝑥𝑎 may be extremely high dimensional.
Kernel Example
ഥ = [𝑥1 𝑥2 ]
2-dimensional vectors 𝒙
let 𝑲 𝒙𝒊 , 𝒙𝒋 = (𝟏 + 𝒙𝒊 . 𝒙𝒋 )𝟐
We need to show that 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 . 𝜙(𝑥𝑗 )
K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi). φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 . 𝑥𝑗
• Polynomial of power p:
𝐾 𝑥𝑖 , 𝑥𝑗 = (1 + 𝑥𝑖 . 𝑥𝑗 )𝑝
• Gaussian (radial-basis function):
2
𝑥𝑖 −𝑥𝑗
−
𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒 2𝜎2
• Sigmoid
𝐾 𝑥𝑖 , 𝑥𝑗 = tanh(𝛽0 𝑥𝑖 . 𝑥𝑗 + 𝛽1 )
In general, functions that satisfy Mercer’s condition can
be kernel functions.
9
Kernel Functions
• Kernel function can be thought of as a similarity measure
between the input objects
• Not all similarity measure can be used as kernel function.
• Mercer's condition states that any positive semi-definite
kernel K(x, y), i.e.
𝐾(𝑥𝑖 , 𝑥𝑗 )𝑐𝑖 𝑐𝑗 ≥ 0
𝑖,𝑗
• can be expressed as a dot product in a high dimensional
space.
10
SVM examples
© Eric Xing @ CMU, 2006-2010
Examples for Non Linear SVMs –
Gaussian Kernel
© Eric Xing @ CMU, 2006-2010
Nonlinear SVM: Optimization
Formulation: (Lagrangian Dual Problem)
n
1 n n
maximize i i j yi y j K (xi , x j )
i 1 2 i 1 j 1
such that 0 i C
n
y
i 1
i i 0
The solution of the discriminant function is
𝑔 𝑥 = 𝛼𝑖 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑏
𝑖𝜖𝑆𝑉
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• The kernel trick can also be used to do PCA in a much higher-
dimensional space, thus giving a non-linear version of PCA in
the original space.
Multi-class classification
• SVMs can only handle two-class outputs
• Learn N SVM’s
– SVM 1 learns Class1 vs REST
– SVM 2 learns Class2 vs REST
– :
– SVM N learns ClassN vs REST
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Thank You
16
Foundations of Machine Learning
Module 5: Support Vector Machine
Part F: SVM – Solution to the Dual
Problem
Sudeshna Sarkar
IIT Kharagpur
1
The SMO algorithm
The SMO algorithm can efficiently solve the dual problem.
First we discuss Coordinate Ascent.
Coordinate Ascent
• Consider solving the unconstrained optimization problem:
max 𝑊(𝛼1 , 𝛼2 , … , 𝛼𝑛 )
𝛼
Loop until convergence: {
for 𝑖 = 1 𝑡𝑜 𝑛 {
𝛼𝑖 = 𝑎𝑟𝑔 max 𝑊(𝛼1 , … , 𝛼ෝ𝑖 , … , 𝛼𝑛 ) ;
ෞ𝑖
𝛼
}
}
Coordinate ascent
• Ellipses are the contours of the function.
• At each step, the path is parallel to one of the axes.
Sequential minimal optimization
• Constrained optimization:
m
1 m
maxa J (a ) a i a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0 a i C , i 1, , m
m
a y
i 1
i i 0.
• Question: can we do coordinate along one
direction at a time (i.e., hold all a[-i] fixed, and
update ai?)
The SMO algorithm
m
1 m
maxa W (a ) a i a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0 a i C , i 1,, m
m
a y
i 1
i i 0.
• Choose a set of 𝛼1 ’s satisfying the constraints.
• 𝛼1 is exactly determined by the other 𝛼’s.
• We have to update at least two of them
simultaneously to keep satisfying the constraints.
The SMO algorithm
Repeat till convergence {
1. Select some pair ai and aj to update next
(using a heuristic that tries to pick the two that
will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize W(a) with respect to ai and aj, while
holding all the other ak 's (k i; j) fixed.
}
• The update to ai and aj can be computed very
efficiently.
Thank You