0% found this document useful (0 votes)

115 views63 pages

Logistic Regression Overview

Uploaded by

nilayesh.bhattacharya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views63 pages

Logistic Regression Overview

Uploaded by

nilayesh.bhattacharya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Foundations of Machine Learning

Module 5:
Part A: Logistic Regression

Sudeshna Sarkar
IIT Kharagpur
Logistic Regression for classification
• Linear Regression: Logistic:
𝑛 1
𝑔 𝑧 =
1+𝑒 −𝑧
ℎ 𝑥 = ෍ 𝛽𝑖 𝑥𝑖 = 𝛽 𝑇 𝑋
𝑖=0
• Logistic Regression for
classification:
1
ℎ𝛽 𝑥 = 𝑇𝑋 = g(𝛽 𝑇 𝑥)
1 + 𝑒 −𝛽
1
𝑔 𝑧 =
1+𝑒 −𝑧
is called the logistic function or the
sigmoid function.
Sigmoid function properties
• Bounded between 0 and 1
• 𝑔(𝑧) → 1 as 𝑧 → ∞
• 𝑔(𝑧) → 0 as 𝑧 → −∞
′
𝑑 1
𝑔 𝑧 =
𝑑𝑧 1 + 𝑒 −𝑧
1 −𝑧
= . 𝑒
(1 + 𝑒 −𝑧 )2
1 1
= −𝑧
. (1 − −𝑧
)
1+𝑒 1+𝑒
= 𝑔(𝑧)(1 − 𝑔 𝑧
Logistic Regression
• In logistic regression, we learn the conditional distribution
P(y|x)
• Let py(x; 𝛽) be our estimate of P(y|x), where 𝛽 is a vector of
adjustable parameters.
• Assume there are two classes, y = 0 and y = 1 and
𝑃 𝑦 = 1 𝑥 = ℎ𝛽 𝑥
𝑃 𝑦 = 0 𝑥 = 1 − ℎ𝛽 (𝑥)
• Can be written more compactly
𝑃 𝑦 𝑥 = ℎ(𝑥)𝑦 (1 − ℎ 𝑥 )1−𝑦
• We can used the gradient method

5
Maximize likelihood
𝐿 𝛽 = 𝑝(𝑦|𝑋;
Ԧ 𝛽)
𝑚

= ෑ 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝛽)
𝑖=1
𝑚

= ෑ ℎ(𝑥𝑖 )𝑦𝑖 (1 − ℎ 𝑥𝑖 )1−𝑦𝑖

𝑖=1
𝑙 𝛽 = log 𝐿 𝛽
𝑚

= ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))

𝑖=1
𝑚

𝑙 𝛽 = ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))

𝑖=1
• How do we maximize the likelihood? Gradient ascent
– Updates: 𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
Assume one training example (x,y), and take derivatives to derive the
stochastic gradient ascent rule.
𝜕
𝑙 𝛽
𝜕𝛽𝑗
1
=( 𝑦
𝑔(𝛽𝑇 (𝑥) 1 𝜕 𝑇
− 1−𝑦 𝑇
) 𝑔(𝛽 𝑥)
1 − 𝑔 𝛽 𝑥 𝜕𝛽𝑗
1 1 𝑇 𝑇
𝜕 𝑇
=( 𝑦 − 1−𝑦 ) 𝑔(𝛽 𝑥)(1 − 𝑔(𝛽 𝑥) 𝛽 𝑥
𝑔(𝛽 𝑇 (𝑥) 𝑇
1−𝑔 𝛽 𝑥 𝜕𝛽𝑗

= (𝑦 1 − 𝑔 𝛽𝑇 𝑥 − 1 − 𝑦 𝑔 𝛽𝑇 𝑥 )𝑥𝑗

= (𝑦 − ℎ𝛽 𝑥 )𝑥𝑗
𝛽 = 𝛽 + 𝛼𝛻𝛽 𝑙(𝛽)
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ𝛽 𝑥 𝑖 )𝑥𝑗 (𝑖)
Foundations of Machine Learning
Module 5:
Part B: Introduction to Support
Vector Machine
Sudeshna Sarkar
IIT Kharagpur

1
Support Vector Machines
• SVMs have a clever way to prevent overfitting
• They can use many features without requiring
too much computation.

2
Logistic Regression and Confidence
• Logistic Regression:
𝑝 𝑦 = 1 𝑥 = ℎ𝛽 𝑥 = 𝑔(𝛽𝑇 𝑥)
• Predict 1 on an input x iff ℎ𝛽 𝑥 ≥ 0.5,
equivalently, 𝛽𝑇 𝑥 ≥ 0
• The larger the value of ℎ𝛽 𝑥 , the larger is the probability,
and higher the confidence.
• Similarly, confident prediction of 𝑦 = 0 if 𝛽𝑇 𝑥 ≪ 0
• More confident of prediction from points (instances) located
far from the decision surface.

3
Preventing overfitting with many features
• Suppose a big set of features.
• What is the best separating line
to use?
• Bayesian answer:
– Use all
– Weight each line by its posterior
probability
• Can we approximate the correct
answer efficiently?

4
Support Vectors
• The line that maximizes the
minimum margin.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– called “support vectors”.
– we use the support vectors to
decide which side of the
separator a test case is on. The support vectors are
indicated by the circles
around them.

5
Functional Margin
• Functional Margin of a point (𝑥𝑖 , 𝑦𝑖 ) wrt (𝑤, 𝑏)
– Measured by the distance of a point (𝑥𝑖 , 𝑦𝑖 ) from the
decision boundary (𝑤, 𝑏)
𝛾 𝑖 = 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏)
– Larger functional margin →more confidence for
correct prediction
– Problem: w and b can be scaled to make this value
larger
• Functional Margin of training set
{ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 } wrt (𝑤, 𝑏) is
𝛾 = min 𝛾 𝑖
1≤𝑖≤𝑚

6
Geometric Margin
• For a decision surface (𝑤, 𝑏) P=(a1,a2)
• the vector orthogonal to it is
given by 𝑤. Q=(b1,b2)
→
• The unit length orthogonal 𝑤
𝑤
vector is
𝑤
𝑤
• 𝑃 =𝑄+𝛾 (w,b)
𝑤

7
Geometric Margin
𝑤
𝑃 =𝑄+𝛾
𝑤
𝑤
𝑏1, 𝑏2 = 𝑎1, 𝑎2 − 𝛾 P=(a1,a2)
𝑤
𝑇
𝑤
→ 𝑤 𝑎1, 𝑎2 − 𝛾 +𝑏 =0
𝑤
𝑤 𝑇 𝑎1, 𝑎2 + 𝑏 Q=(b1,b2)
→𝛾= →
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏
= 𝑎1, 𝑎2 +
𝑤 𝑤
𝑤 𝑇 𝑏 (w,b)
𝛾 = 𝑦. ( 𝑎1, 𝑎2 + )
𝑤 𝑤

Geometric margin : 𝑤 = 1
Geometric margin of (w,b) wrt S={ 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑚 , 𝑦𝑚 }
-- smallest of the geometric margins of individual points. 8
Maximize margin width denotes +1
denotes -1
x2
• Assume linearly separable Margin
training examples.
• The classifier with the
maximum margin width is
robust to outliners and thus
has strong generalization
ability

9
Maximize Margin Width
𝛾
• Maximize subject to
𝑤
• 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 𝛾 for 𝑖 = 1,2, . . , 𝑚
• Scale so that 𝛾 = 1
1 2
• Maximizing is the same as minimizing 𝑤
𝑤
• Minimize 𝒘. 𝒘 subject to the constraints
• for all (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑚 :
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1

10
Large Margin Linear Classifier
• Formulation: x2
Margin

1 2 x+
minimize w
2
such that x+

yi (wT xi  b)  1 n
x-

denotes +1
denotes -1 11
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi  b)  1

• Optimization problem with convex quadratic objectives and

linear constraints
• Can be solved using QP.
• Lagrange duality to get the optimization problem’s dual form,
– Allow us to use kernels to get optimal margin classifiers to work
efficiently in very high dimensional spaces.
– Allow us to derive an efficient algorithm for solving the above
optimization problem that will typically do much better than generic
QP software.

12
Foundations of Machine Learning
Module 5:
Part C: Support Vector Machine:
Dual
Sudeshna Sarkar
IIT Kharagpur

1
Solving the Optimization Problem
1 2
minimize w
2
s.t. yi (wT xi  b)  1

• Optimization problem with convex quadratic objectives and

2
Lagrangian Duality in brief
The Primal Problem min w f ( w)
s.t. g i ( w)  0, i  1,, k
hi ( w)  0, i  1,, l
The generalized Lagrangian:
k l
L( w, a , b )  f ( w)   a i g i ( w)   b i hi ( w)
i 1 i 1

the a's (ai≥0) and b's are called the Lagrange multipliers
Lemma:
 f ( w) if w satisfies primal constraints
maxa , b ,a i 0 L( w, a , b )  
  otherwise
A re-written Primal:
min w maxa , b ,a i 0 L( w, a , b )
Lagrangian Duality, cont.
The Primal Problem:  min w maxa , b ,a i 0 L( w, a , b )
*
p

a , b ,a i  0 min w L( w, a , b )
The Dual Problem: d *  max

Theorem (weak duality):

d *  maxa , b ,a i 0 min w L( w, a , b )  min w maxa , b ,a i 0 L( w, a , b )  p *

Theorem (strong duality):

Iff there exist a saddle point of 𝐿 𝑤, 𝛼, 𝛽 , we have d *  p *
The KKT conditions
If there exists some saddle point of L, then it satisfies the
following "Karush-Kuhn-Tucker" (KKT) conditions:

L( w, a , b )  0, i  1, , k
wi

L( w, a , b )  0, i  1,, l
b i
αi g i ( w)  0, i  1,, m
g i ( w)  0, i  1,, m
a i  0, i  1,, m
Theorem: If w*, a* and b* satisfy the KKT condition, then it
is also a solution to the primal and the dual problems.
Support Vectors
• Only a few 𝛼𝑖 ’s can be nonzero
• Call the training data points whose 𝛼𝑖 ’s are
nonzero the support vectors
αi g i ( w)  0, i  1,, m

If 𝛼𝑖 > 0 then 𝑔 𝑤 = 0

6
Solving the Optimization Problem
1 2
Quadratic minimize w
programming
2
with linear s.t. yi (wT xi  b)  1
constraints

Lagrangian Function

minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1

n
1 2

2 i 1

s.t. ai  0

7
Solving the Optimization Problem
minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1
n
1 2

2 i 1

s.t. ai  0

Lp
n

Minimize 0 w   a i yi xi
wrt w and b w i 1
n
Lp
a y
for fixed 𝛼
0 i i 0
bm i 1
1 m m
L p ( w, b, a )   a i   a ia j yi y j (xTi x j )  b a i yi
i 1 2 i , j 1 i 1
m
1 m
L p ( w, b, a )   a i   a ia j yi y j (xTi x j )
i 1 2 i , j 1
8
The Dual problem
Now we have the following dual opt problem:
m
1 m
maxa J (a )   a i   a ia j yi y j (xTi x j )
i 1 2 i , j 1
s.t. a i  0, i  1, , k
m

a y
i 1
i i  0.

This is a quadratic programming problem.

– A global maximum of ai can always be found.
Support vector machines
• Once we have the Lagrange multipliers {𝛼𝑗 } we can
reconstruct the parameter vector 𝑤 as a weighted
combination of the training examples:
w a y x
m
w   a i yi x i i i i
i 1 iSV

• For testing with a new data z

– Compute
 
wT z  b   a i yi xTi z  b
iSV

and classify z as class 1 if the sum is positive, and class 2

otherwise
Note: w need not be formed explicitly
Solving the Optimization Problem
 The discriminant function is:
g ( x)  w T x  b   i i xb
a x
iSV
T

 It relies on a dot product between the test point x and the

support vectors xi
 Solving the optimization problem involved computing the
dot products xiTxj between all pairs of training points
 The optimal w is a linear combination of a small number
of data points.

11
Foundations of Machine Learning
Module 5: Support Vector Machine
Part D: SVM – Maximum Margin
with Noise
Sudeshna Sarkar
IIT Kharagpur

1
Linear SVM formulation
Find w and b such that
2
is maximized
𝑤

And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),

𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1

Find w and b such that

𝑤 2 = 𝑤. 𝑤 is minimized
And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),
𝑦𝑖 𝑤. 𝑥𝑖 + 𝑏 ≥ 1
Limitations of previous SVM
formulation
• What if the data is
not linearly
separable?

• Or noisy data
points?

Extend the definition of maximum margin to allow

non-separating planes.
3
How to formulate?
• Minimize 𝑤 2 = 𝑤. 𝑤 and number of
misclassifications, i.e., minimize
𝑤. 𝑤 + #(training errors)

• No longer QP formulation

4
Objective to be minimized
• Minimize
𝑤. 𝑤
+ 𝐶 (distance of error points to their

Figure from [Link] 5

Maximum Margin with Noise
x1 M=
2
𝑤. 𝑤

x2 x3 Minimize
𝑚

𝑤. 𝑤 + 𝐶 ෍ 𝜉𝑘
𝑘=1
𝑚 constraints
𝑤. 𝑥𝑘 + 𝑏 ≥ 1 − 𝜉𝑘 if 𝑦𝑘 = 1
C controls the relative 𝑤. 𝑥𝑘 + 𝑏 ≤ −1 + 𝜉𝑘 if 𝑦𝑘 = −1
importance of maximizing ≡
the margin and fitting the
training data. 𝒚𝒌 𝒘. 𝒙𝒌 + 𝒃 ≥ 𝟏 − 𝝃𝒌 , k=1,…,m
Controls overfitting. 𝝃𝒌 ≥ 𝟎, k=1,…,m
Lagrangian
𝐿 𝑤, 𝑏, 𝜉, 𝛼, 𝛽
𝑚
1
= 𝑤. 𝑤 + 𝐶 ෍ 𝜉𝑖
2
𝑖=1
𝑚 𝑚

+ ෍ 𝛼𝑖 𝑦𝑖 𝑥. 𝑤 + 𝑏 − 1 + 𝜉𝑖 − ෍ 𝛽𝑖 𝜉𝑖
𝑖=1 𝑖=1

𝛼𝑖 ’s and 𝛽𝑖 ’s are Lagrange multipliers (≥ 0).

Dual Formulation
Find 𝛼1 , 𝛼2 , … , 𝛼𝑚 s.t.
m
1 m
max J ( )    i    i j yi y j (xTi x j )
i 1 2 i , j 1

Linear SVM Noise Accounted

s.t.  i  0, i  1, , m s.t. 0   i  C , i  1,, m

 y
m
 0.
 y
i 1
i i  0.
i 1
i i

8
Solution to Soft Margin Classification
• 𝑥𝑖 with non-zero 𝛼𝑖 will be support vectors.
• Solution to the dual problem is:
𝑚

𝑤 = ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖
𝑖=1
𝑚

𝑏 = 𝑦𝑘 1 − 𝜉𝑘 − ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑘
𝑖=1
for any 𝑘 s.t. 𝛼𝑘 > 0
For classification,
𝑚

𝑓 𝑥 = ෍ 𝛼𝑖 𝑦𝑖 𝑥𝑖 . 𝑥 + 𝑏
𝑖=1
(no need to compute 𝑤 explicitly)

9
Thank You

10
Foundations of Machine Learning
Module 5: Support Vector Machine
Part E: Nonlinear SVM and Kernel
function
Sudeshna Sarkar
IIT Kharagpur

1
Non-linear decision surface
• We saw how to deal with datasets which are linearly
separable with noise.
• What if the decision boundary is truly non-linear?
• Idea: Map data to a high dimensional space where it
is linearly separable.
– Using a bigger set of features will make the computation
slow?
– The “kernel” trick to make the computation fast.

2
Non-linear SVMs: Feature Space

Φ: 𝑥 → 𝜙(𝑥)
Non-linear SVMs: Feature Space

Φ: 𝑥 → 𝜙(𝑥)

This slide is from [Link]/~pift6080/documents/papers/svm_tutorial.ppt

Kernel
• Original input attributes is mapped to a new set of
input features via feature mapping Φ.
• Since the algorithm can be written in terms of the
scalar product, we replace 𝑥𝑎 . 𝑥𝑏 with 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
• For certain Φ’s there is a simple operation on two
vectors in the low-dim space that can be used to
compute the scalar product of their two images in the
high-dim space
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Let the kernel do the work rather than do the scalar
product in the high dimensional space.
5
Nonlinear SVMs: The Kernel Trick
• With this mapping, our discriminant function is now:

g ( x)  w T  ( x)  b   i i  ( x)  b
 
iSV
( x )T

• We only use the dot product of feature vectors in both

the training and test.
• A kernel function is defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
The kernel trick

𝐾 𝑥𝑎 , 𝑥𝑏 = 𝜙 𝑥𝑎 . 𝜙(𝑥𝑏 )
Often 𝐾 𝑥𝑎 , 𝑥𝑏 may be very inexpensive to compute even if
𝜙 𝑥𝑎 may be extremely high dimensional.
Kernel Example
ഥ = [𝑥1 𝑥2 ]
2-dimensional vectors 𝒙
let 𝑲 𝒙𝒊 , 𝒙𝒋 = (𝟏 + 𝒙𝒊 . 𝒙𝒋 )𝟐
We need to show that 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 . 𝜙(𝑥𝑗 )

K(xi,xj) = (1 + xixj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2].[1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi). φ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Commonly-used kernel functions
• Linear kernel: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑥𝑖 . 𝑥𝑗
• Polynomial of power p:
𝐾 𝑥𝑖 , 𝑥𝑗 = (1 + 𝑥𝑖 . 𝑥𝑗 )𝑝
• Gaussian (radial-basis function):
2
𝑥𝑖 −𝑥𝑗
−
𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒 2𝜎2
• Sigmoid
𝐾 𝑥𝑖 , 𝑥𝑗 = tanh(𝛽0 𝑥𝑖 . 𝑥𝑗 + 𝛽1 )
In general, functions that satisfy Mercer’s condition can
be kernel functions.
9
Kernel Functions
• Kernel function can be thought of as a similarity measure
between the input objects
• Not all similarity measure can be used as kernel function.
• Mercer's condition states that any positive semi-definite
kernel K(x, y), i.e.

෍ 𝐾(𝑥𝑖 , 𝑥𝑗 )𝑐𝑖 𝑐𝑗 ≥ 0
𝑖,𝑗
• can be expressed as a dot product in a high dimensional
space.

10
SVM examples

Examples for Non Linear SVMs –
Gaussian Kernel

Nonlinear SVM: Optimization
 Formulation: (Lagrangian Dual Problem)
n
1 n n
maximize  i   i j yi y j K (xi , x j )
i 1 2 i 1 j 1

such that 0  i  C
n

 y
i 1
i i 0

 The solution of the discriminant function is

𝑔 𝑥 = ෍ 𝛼𝑖 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑏
𝑖𝜖𝑆𝑉
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• The kernel trick can also be used to do PCA in a much higher-
dimensional space, thus giving a non-linear version of PCA in
the original space.
Multi-class classification
• SVMs can only handle two-class outputs
• Learn N SVM’s
– SVM 1 learns Class1 vs REST
– SVM 2 learns Class2 vs REST
– :
– SVM N learns ClassN vs REST
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Thank You

16
Foundations of Machine Learning
Module 5: Support Vector Machine
Part F: SVM – Solution to the Dual
Problem
Sudeshna Sarkar
IIT Kharagpur

1
The SMO algorithm
The SMO algorithm can efficiently solve the dual problem.
First we discuss Coordinate Ascent.
Coordinate Ascent
• Consider solving the unconstrained optimization problem:
max 𝑊(𝛼1 , 𝛼2 , … , 𝛼𝑛 )
𝛼

Loop until convergence: {

for 𝑖 = 1 𝑡𝑜 𝑛 {
𝛼𝑖 = 𝑎𝑟𝑔 max 𝑊(𝛼1 , … , 𝛼ෝ𝑖 , … , 𝛼𝑛 ) ;
ෞ𝑖
𝛼
}
}
Coordinate ascent

• Ellipses are the contours of the function.

• At each step, the path is parallel to one of the axes.
Sequential minimal optimization
• Constrained optimization:
m
1 m
maxa J (a )   a i   a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0  a i  C , i  1, , m
m

a y
i 1
i i  0.

• Question: can we do coordinate along one

direction at a time (i.e., hold all a[-i] fixed, and
update ai?)
The SMO algorithm
m
1 m
maxa W (a )   a i   a ia j yi y j (x i .x j )
i 1 2 i , j 1
s.t. 0  a i  C , i  1,, m
m

a y
i 1
i i  0.

• Choose a set of 𝛼1 ’s satisfying the constraints.

• 𝛼1 is exactly determined by the other 𝛼’s.
• We have to update at least two of them
simultaneously to keep satisfying the constraints.
The SMO algorithm
Repeat till convergence {
1. Select some pair ai and aj to update next
(using a heuristic that tries to pick the two that
will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize W(a) with respect to ai and aj, while
holding all the other ak 's (k i; j) fixed.
}
• The update to ai and aj can be computed very
efficiently.
Thank You

Logistic Regression & SVM Overview
No ratings yet
Logistic Regression & SVM Overview
59 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
31 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
33 pages
07 SVMs
No ratings yet
07 SVMs
68 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
8 pages
SVM Overview by Mingon Kang
No ratings yet
SVM Overview by Mingon Kang
23 pages
Kernel Trick in Support Vector Machines
No ratings yet
Kernel Trick in Support Vector Machines
125 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
19 pages
SVM Optimization and Support Vectors
No ratings yet
SVM Optimization and Support Vectors
5 pages
SVM Algorithm Flowchart Overview
No ratings yet
SVM Algorithm Flowchart Overview
40 pages
Support Vector Machines Overview and Uses
No ratings yet
Support Vector Machines Overview and Uses
34 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
36 pages
SVM for Cancer Genomics Analysis
No ratings yet
SVM for Cancer Genomics Analysis
76 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
58 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
33 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
25 pages
SVMs: A Comprehensive Overview
No ratings yet
SVMs: A Comprehensive Overview
28 pages
SVM: Theory and Applications
No ratings yet
SVM: Theory and Applications
35 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
44 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
25 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
61 pages
SVM Problem Set 3 Solutions
No ratings yet
SVM Problem Set 3 Solutions
7 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
13 pages
SVM: Maximizing Margin in Classification
No ratings yet
SVM: Maximizing Margin in Classification
70 pages
Understanding Mercer’s Theorem in SVMs
No ratings yet
Understanding Mercer’s Theorem in SVMs
16 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
50 pages
Vapnik's Support Vector Machines Explained
No ratings yet
Vapnik's Support Vector Machines Explained
55 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
53 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
SVM Classification Overview by Chandresh Maurya
No ratings yet
SVM Classification Overview by Chandresh Maurya
28 pages
Linear Support Vector Machine Overview
No ratings yet
Linear Support Vector Machine Overview
26 pages
SVM Hyperplane and Margin Maximization
No ratings yet
SVM Hyperplane and Margin Maximization
45 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
25 pages
Predicting Student Pass Rates
No ratings yet
Predicting Student Pass Rates
17 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
46 pages
Machine Learning: Support Vector Machines
No ratings yet
Machine Learning: Support Vector Machines
22 pages
Linear and Logistic Regression Overview
No ratings yet
Linear and Logistic Regression Overview
7 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
25 pages
Support Vector Machines Overview and Uses
No ratings yet
Support Vector Machines Overview and Uses
34 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
19 pages
Hard vs Soft Margin in SVM Explained
No ratings yet
Hard vs Soft Margin in SVM Explained
21 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
19 pages
SVM Decision Boundary Optimization
No ratings yet
SVM Decision Boundary Optimization
58 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
35 pages
Overview of Support Vector Machines
No ratings yet
Overview of Support Vector Machines
49 pages
SVM Overview and Applications
No ratings yet
SVM Overview and Applications
34 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
Support Vector Machines in ML
No ratings yet
Support Vector Machines in ML
52 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
26 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
Understanding Linear Classifiers and SVMs
No ratings yet
Understanding Linear Classifiers and SVMs
31 pages
Understanding Linear SVMs and Classifiers
No ratings yet
Understanding Linear SVMs and Classifiers
77 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
SVM Tutorial and Applications Overview
No ratings yet
SVM Tutorial and Applications Overview
34 pages
India Post Postal Guide Part 1
100% (1)
India Post Postal Guide Part 1
8 pages
Overview of India's Postal Services 2023
No ratings yet
Overview of India's Postal Services 2023
8 pages
Probability Basics in Machine Learning
No ratings yet
Probability Basics in Machine Learning
75 pages
Physics Test: Motion and Measurement Concepts
No ratings yet
Physics Test: Motion and Measurement Concepts
3 pages
Chemistry Test: Questions & Instructions
No ratings yet
Chemistry Test: Questions & Instructions
2 pages
Chemistry Test: Questions & Instructions
No ratings yet
Chemistry Test: Questions & Instructions
2 pages
ISC Class 12 Computer Science Practical 2025
50% (2)
ISC Class 12 Computer Science Practical 2025
3 pages
Vampire Numbers in Java Program
100% (1)
Vampire Numbers in Java Program
10 pages
Calcutta University Mathematics Syllabus 2023
No ratings yet
Calcutta University Mathematics Syllabus 2023
26 pages
Engineering Optimization Course Details
No ratings yet
Engineering Optimization Course Details
3 pages
Benders Decomposition Explained Simply
No ratings yet
Benders Decomposition Explained Simply
17 pages
Applied Operations Research Question Bank
No ratings yet
Applied Operations Research Question Bank
13 pages
TYBMS Semester 5 Syllabus Overview
No ratings yet
TYBMS Semester 5 Syllabus Overview
51 pages
Complexity of Quantum SVMs
No ratings yet
Complexity of Quantum SVMs
30 pages
Convex Optimization
No ratings yet
Convex Optimization
9 pages
Expection of P-Boxes
No ratings yet
Expection of P-Boxes
21 pages
Understanding Subgradients and Their Properties
No ratings yet
Understanding Subgradients and Their Properties
13 pages
Internet Protocols Problem Set Solutions
No ratings yet
Internet Protocols Problem Set Solutions
5 pages
Lagrange Multipliers in Optimization Theory
No ratings yet
Lagrange Multipliers in Optimization Theory
56 pages
Operations Research Testbank 10th Ed.
No ratings yet
Operations Research Testbank 10th Ed.
12 pages
Penn CIS 520 Machine Learning Midterm 2018
No ratings yet
Penn CIS 520 Machine Learning Midterm 2018
15 pages
Computer-Based Optimization Techniques
No ratings yet
Computer-Based Optimization Techniques
134 pages
Seeding Methods in MINLP Optimization
No ratings yet
Seeding Methods in MINLP Optimization
42 pages
Linear Programming Problem Formulation
100% (2)
Linear Programming Problem Formulation
42 pages
Understanding DC-OPF and AC-OPF
No ratings yet
Understanding DC-OPF and AC-OPF
55 pages
Integer Programming Techniques Overview
No ratings yet
Integer Programming Techniques Overview
99 pages
ED, UC, and OPF PDF
No ratings yet
ED, UC, and OPF PDF
58 pages
Accurate Fault Location in The Power Transmission Line Using Support Vector Machine Approach
No ratings yet
Accurate Fault Location in The Power Transmission Line Using Support Vector Machine Approach
8 pages
Understanding Duality in LPP
No ratings yet
Understanding Duality in LPP
16 pages
Understanding DC Programming Methods
No ratings yet
Understanding DC Programming Methods
13 pages
Fast S&P 500/VIX Calibration Methods
No ratings yet
Fast S&P 500/VIX Calibration Methods
19 pages
Transportation Method in Linear Programming
No ratings yet
Transportation Method in Linear Programming
9 pages
Linear Programming in Operations Research
No ratings yet
Linear Programming in Operations Research
43 pages
Mathematical Programming Models and Methods For Production Planning and Scheduling
No ratings yet
Mathematical Programming Models and Methods For Production Planning and Scheduling
73 pages
Optimization For Decision Making Linear and Quadratic Models ISBN 9781441912909, 1441912908 PDF
100% (5)
Optimization For Decision Making Linear and Quadratic Models ISBN 9781441912909, 1441912908 PDF
15 pages
IIT Mandi Assignment on Duality Theory
No ratings yet
IIT Mandi Assignment on Duality Theory
2 pages
LPP Duality and Sensitivity Analysis
No ratings yet
LPP Duality and Sensitivity Analysis
5 pages
Graphical and Simplex Methods in LPP
No ratings yet
Graphical and Simplex Methods in LPP
38 pages

Logistic Regression Overview

Uploaded by

Logistic Regression Overview

Uploaded by

Foundations of Machine Learning

= ෑ ℎ(𝑥𝑖 )𝑦𝑖 (1 − ℎ 𝑥𝑖 )1−𝑦𝑖

= ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))

𝑙 𝛽 = ෍ 𝑦 𝑖 log ℎ(𝑥 𝑖 ) + (1 − 𝑦𝑖 )(log(1 − ℎ(𝑥𝑖 ))

• Optimization problem with convex quadratic objectives and

• Optimization problem with convex quadratic objectives and

Theorem (weak duality):

Theorem (strong duality):

minimize Lp (w, b, a i )  w   a i  yi (wT xi  b)  1

This is a quadratic programming problem.

• For testing with a new data z

and classify z as class 1 if the sum is positive, and class 2

 It relies on a dot product between the test point x and the

And for each of the 𝑚 training points (𝑥𝑖 , 𝑦𝑖 ),

Find w and b such that

Extend the definition of maximum margin to allow

Figure from [Link] 5

𝛼𝑖 ’s and 𝛽𝑖 ’s are Lagrange multipliers (≥ 0).

Linear SVM Noise Accounted

s.t.  i  0, i  1, , m s.t. 0   i  C , i  1,, m

This slide is from [Link]/~pift6080/documents/papers/svm_tutorial.ppt

• We only use the dot product of feature vectors in both

© Eric Xing @ CMU, 2006-2010

© Eric Xing @ CMU, 2006-2010

 The solution of the discriminant function is

Loop until convergence: {

• Ellipses are the contours of the function.

• Question: can we do coordinate along one

• Choose a set of 𝛼1 ’s satisfying the constraints.

You might also like