Unit 2 : Classification
What is Classification?
• Classification is a process of categorizing a given
set of data into classes, It can be performed on
both structured or unstructured data.
• The process starts with predicting the class of
given data points. The classes are often referred
to as target, label or categories.
• The classification predictive modeling is the task
of approximating the mapping function from
input variables to discrete output variables.
• The main goal is to identify which class/category
the new data will fall into.
Example:
Example:
• Heart disease detection can be identified as a
classification problem, this is a binary
classification since there can be only two classes
i.e has heart disease or does not have heart
• disease.
The classifier, in this case, needs training data to
understand how the given input variables are
related to the class. And once the classifier is
trained accurately, it can be used to detect
• whether heart disease is there or not for a
particular patient.
Since classification is a type of supervised learning,
even the targets are also provided with the input
Basic Terminologies used
• Classifier – It is an algorithm that is used to
map the input data to a specific category.
• Classification Model – The model predicts or
draws a conclusion to the input data given for
training, it will predict the class or category
for the data.
• Feature – A feature is an individual
measurable property of the phenomenon
being observed.
• Label- Output variable
Example of Linear Classification
• Red points: patterns
belonging to class C1.
• Blue points: patterns
belonging to class C2.
• Goal: find a linear decision
boundary separating C1
from C2.
• Points on one side of the line will be classified as belonging to
C1, points on the other side will be classified as C2.
• The red line is one example of such a decision boundary.
• It misclassified a few patterns.
• The green line is another example.
6
Linear Classification
• Mathematically, assuming
input patterns are
D-dimensional vectors:
• We are looking for a decision
boundary in the form of a
(D-1)-dimensional hyperplane
separating the two classes.
• Points on one side of the
hyperplane will be classified
as belonging to C1, points on the
other side will be classified as C2.
• If inputs are 2-dimensional vectors, the decision boundary
is a line.
• If inputs are 3-dimensional vectors, the decision boundary
is a 2-dimensional surface. 7
Types of Learners
• Lazy Learners –
– Lazy learners simply store the training
data and wait until a testing data
appears.
– The classification is done using the
most related data in the stored
training data.
– They have more predicting time compared
to eager learners. Eg – k-nearest neighbor,
case- based reasoning.
Types of Learners
• Eager Learners –
– Eager learners construct a
classification model based on the
given training data before getting
data for predictions.
– It must be able to commit to a single
hypothesis that will work for the entire
space.
– Due to this, they take a lot of time in
training and less time for a prediction. Eg –
Decision Tree, Naive Bayes, Artificial Neural
Networks.
Types of Classification
• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Imbalanced
Classification
Types of Classification
• Linear Models
– Logistic Regression
– Support Vector Machines
• Nonlinear models
– K-nearest Neighbors (KNN)
– Kernel Support Vector Machines
(SVM)
– Naïve Bayes
– Decision Tree Classification
– Random Forest Classification
Binary Classification
• Binary classification refers to those
classification tasks that have two class
labels.
• Examples include:
– Email spam detection (spam or not)
– Churn prediction (churn or not).
– Conversion prediction (buy or not).
• Typically, binary classification tasks involve
one class that is the normal state and
another class that is the abnormal state.
Binary Classification – Example
• For example “not spam” is the normal state and
“spam” is the abnormal state. Another example is
“cancer not detected” is the normal state of a task
that involves a medical test and “cancer detected”
is the abnormal state.
• The class for the normal state is assigned the
class label 0 and the class with the abnormal
state is assigned the class label 1.
• It is common to model a binary classification
task with a model that predicts a Bernoulli
probability distribution for each example.
Binary Classification – Algorithms
• Popular algorithms that can be used for
binary classification include:
– Logistic Regression
– k-Nearest Neighbors
– Decision Trees
– Support Vector Machine
– Naive Bayes
Evaluation of Binary Classifier
• There are many metrics that can be used to measure
the performance of a classifier or predictor; different
fields have different preferences for specific metrics
due to different goals.
• In medicine sensitivity and specificity are often
used, while in information retrieval precision and
recall are preferred.
• An important distinction is between metrics that are
independent of how often each category occurs in the
population (the prevalence), and metrics that depend
on the prevalence – both types are useful, but they
have very different properties.
Evaluation of Binary Classifier
• Given a classification of a specific data set, there
are four basic combinations of actual data
category and assigned category: true positives TP
(correct positive assignments), true negatives TN
(correct negative assignments), false positives FP
(incorrect positive assignments), and false
negatives FN (incorrect negative assignments).
Confusion Matrix
• In the field of machine learning and specifically the
problem of statistical classification, a confusion matrix,
also known as an error matrix, is a specific table layout
that allows visualization of the performance of an
algorithm, typically a supervised learning one (in
unsupervised learning it is usually called a matching
• matrix).
Each row of the matrix represents the instances in
a predicted class, while each column represents
• the instances in an actual class (or vice versa).
The name stems from the fact that it makes it easy to see
whether the system is confusing two classes (i.e.
commonly mislabeling one as another).
True Positive (TP)
•The predicted value matches the actual value
•The actual value was positive and the model predicted a positive value
True Negative (TN)
•The predicted value matches the actual value
•The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
•The predicted value was falsely predicted
•The actual value was negative but the model predicted a positive value
•Also known as the Type 1 error
False Negative (FN) – Type 2 error
•The predicted value was falsely predicted
•The actual value was positive but the model predicted a negative value
•Also known as the Type 2 error
Confusion Matrix
• Given a sample of 13 pictures, 8 of cats and 5 of dogs,
where cats belong to class 1 and dogs belong to class 0,
•– actual = [1,1,1,1,1,1,1,1,0,0,0,0,0],
• • assume that a classifier that distinguishes between cats and
dogs is trained, and we take the 13 pictures and run them
through the classifier, and the classifier makes 8 accurate
predictions and misses 5: 3 cats wrongly predicted as dogs (fi rst
3 predictions) and 2 dogs wrongly predicted as cats (last 2
predictions).
•– prediction = [0,0,0,1,1,1,1,1,0,0,0,1,1]
Confusion Matrix
F1 Score / Harmonic Mean
Sklearn has two
functions: confusion_matrix() and classification_report().
•Sklearn confusion_matrix() returns the values of the Confusion
matrix.
•Sklearn classification_report() outputs precision, recall and f1-
score for each target class. In addition to this, it also has some
extra values: micro avg, macro avg, and weighted avg
Confusion Matrix for Multi-Class Classification
Confusion Matrix for Multi-Class Classification
The true positive, true negative, false positive and false negative for
each class would be calculated by adding the cell values as follows:
Confusion Matrix for Multi-Class Classification
How to calculate FN, FP, TN, TP for multi
class :
Confusion Matrix for Multi-Class Classification
class Setosa
TP: The actual value and predicted value should be the same. So concerning S
class, the value of cell 1 is the TP value.
FN: The sum of values of corresponding rows except the TP value
FN = (cell 2 + cell3)
= (0 + 0)
=0
FP : The sum of values of corresponding column except the TP value.
FP = (cell 4 + cell 7)
= (0 + 0)
=0
TN: The sum of values of all columns and row except the values of that class
are calculating the values for.
TN = (cell 5 + cell 6 + cell 8 + cell 9)
= 17 + 1 +0 + 11
= 29
Similarly, for Versicolor class the values/ metrics are calculated as below:
TP : 17 (cell 5)
FN : 0 + 1 = 1 (cell 4 +cell 6)
FP : 0 + 0 = 0 (cell 2 + cell 8)
TN : 16 +0 +0 + 11 =27 (cell 1 + cell 3 + cell 7 + cell 9).
I hope the concept is clear you can try for the Virginia class.
ROC (Receiver Operating Characteristic)
•TruePositiveRate =
TruePositive / (TruePositive + The ROC curve shows the trade-off
FalseNegative) between sensitivity (or TPR) and
specificity (1 – FPR).
•FalsePositiveRate =
FalsePositive / (FalsePositive +
TrueNegative)
Multi Class Classification
• Multi-class classification refers to those
classification tasks that have more than two class
labels.
• Examples include:
– Face classification.
– Plant species classification.
– Optical character recognition.
• Unlike binary classification, multi-class
classification does not have the notion of normal
and abnormal outcomes. Instead, examples are
classified as belonging to one among a range of
known classes.
Multi Class Classification
• The number of class labels may be very large on some
problems. For example, a model may predict a photo as
belonging to one among thousands or tens of
thousands of faces in a face recognition system.
• Problems that involve predicting a sequence of words,
such as text translation models, may also be considered
a special type of multi-class classification.
• Each word in the sequence of words to be predicted
involves a multi-class classification where the size of
the vocabulary defines the number of possible classes
that may be predicted and could be tens or hundreds
of thousands of words in size.
Multi Class Classification -
Examples
• Many algorithms used for binary
classification can be used for multi-class
classification.
• Popular algorithms that can be used for
multi- class classification include:
– k-Nearest Neighbors.
– Decision Trees.
– Naive Bayes.
– Random Forest.
– Gradient Boosting.
Multi Class Classification
• This involves using a strategy of fitting multiple binary
classification models for each class vs. all other classes
(called one-vs-rest) or one model for each pair of
classes (called one-vs-one).
– One-vs-Rest: Fit one binary classification model for
each class vs. all other classes.
– One-vs-One: Fit one binary classification model for
each pair of classes.
• Binary classification algorithms that can use
these strategies for multi-class classification
include:
– Logistic Regression.
– Support Vector Machine.
Multi-Label Classification?
• Multi-label classification refers to those classification
tasks that have two or more class labels, where one
or more class labels may be predicted for each
example.
• Consider the example of photo classification, where
a given photo may have multiple objects in the
scene and a model may predict the presence of
multiple known objects in the photo, such as
“bicycle,” “apple,” “person,” etc.
• This is unlike binary classification and multi-class
classification, where a single class label is
predicted for each example.
Imbalanced Classification
• Imbalanced classification refers to classification
tasks where the number of examples in each class
is unequally distributed.
• Typically, imbalanced classification tasks are binary
classification tasks where the majority of examples
in the training dataset belong to the normal class
and a minority of examples belong to the abnormal
class.
• Examples include:
– Fraud detection.
– Outlier detection.
– Medical diagnostic tests.
Imbalanced Classification
• These problems are modeled as binary
classification tasks, although may
require specialized techniques.
• Specialized techniques may be used to
change the composition of samples in the
training dataset by undersampling the
majority class or oversampling the minority
class.
• Examples include:
– Random Undersampling.
– SMOTE Oversampling.
Imbalanced Classification
• Specialized modeling algorithms may be used
that pay more attention to the minority class
when fitting the model on the training
dataset, such as cost-sensitive machine
learning algorithms.
• Examples include:
– Cost-sensitive Logistic Regression.
– Cost-sensitive Decision Trees.
– Cost-sensitive Support Vector Machines.
Introduction to Support
Vector Machines
History of SVM
• SVM is related to statistical learning theory
• SVM was first introduced in 1992
• SVM becomes popular because of its success in handwritten digit
recognition
• 1.1% test error rate for SVM. This is the same as the error rates of a carefully
constructed neural network, LeNet 4.
• See Section 5.11 in [2] or the discussion in [3] for details
• SVM is now regarded as an important example of “kernel methods”,
one of the key area in machine learning
• Note: the meaning of “kernel” is different from the “kernel” function for
Parzen windows
07/30/2025 39
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 w: weight vector
x: data vector
How would you
classify this
data?
07/30/2025 40
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
07/30/2025 41
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
07/30/2025 42
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
How would you
classify this
data?
07/30/2025 43
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
Any of these
would be fine..
..but which is
best?
07/30/2025 44
Classifier Margin a
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.
07/30/2025 45
Maximum Margin a
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum
margin.
This is the
simplest kind of
SVM (Called an
Linear SVM LSVM)
07/30/2025 46
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x +
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
Linear SVM LSVM)
07/30/2025 47
Why Maximum Margin?
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with the, um,
Vectors are
those maximum
datapoints margin.
that the
margin pushes
This is the
up against simplest kind of
SVM (Called an
LSVM)
07/30/2025 48
How to calculate the distance from a point to a
line?
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value
In our case, w1*x1+w2*x2+b=0,
thus, w=(w1,w2), x=(x1,x2)
07/30/2025 49
Estimate the Margin
denotes +1
denotes -1
x
wx +b =
0
X – Vector
W
W – Normal
Vector
b – Scale Value
• What is the distance expression for a point x to a line wx+b= 0?
x w b x w b
d ( x)
2 d 2
w 2
w
i 1 i
07/30/2025 50
Soft-Margin Classification
Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
Input space Feature space
Note: feature space is of higher dimension
than the input space in practice
• Computation in the feature space can be costly because it is high
dimensional
• The feature space is typically infinite-dimensional!
• The kernel trick comes to rescue
07/30/2025 52
Non-linear
SVMs
Datasets that are linearly separable with noise work
out great:
0 x
But what are we going to do if the dataset is just
too hard?
0 x
Kernel Trick!!!
SVM = Linear SVM + Kernel
Trick
This slide is courtesy of [Link]/~pift6080/documents/papers/svm_tutorial.
ppt
Kernel Trick Motivation
• Linear classifiers are well understood, widely-
used
and efficient.
• How to use linear classifiers to build non-linear
ones?
• Neural networks: Construct non-linear classifiers
by
using a network of linear classifiers (perceptrons).
• Kernels:
o Map the problem from the input space to a new higher-dimensional
space (called the feature space) by doing a non-linear
transformation using a special function called the kernel.
o Then use a linear model in this new high-dimensional feature space.
The linear model in the feature space corresponds to a non-linear
model in the input space.
Non-linear SVMs: Feature
Space
General idea: the original input space can be
mapped to some higher-dimensional feature
space where the training set is separable:
Φ: x →
φ(x)
SVM Limitations
• Uses a binary (yes/no) decision rule
• Generates a distance from the hyperplane, but this distance is often not a good
measure of our “confidence” in the classification
• Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but
they are inadequate
• Number of support vectors grows linearly with the size of the data set
• Requires the estimation of trade-off parameter, C, via held-out sets
Error Open-Loop
Error
Optimum
Training Set
Error
Model Complexity
Logistic Regression
Logistic Regression is a Machine Learning algorithm
which is used for the classification problems, it is a
predictive analysis algorithm and based on the concept
of probability.
Logistic Regression and cost function
cost function represents optimization objective
create a cost function and minimum error.
The Cost function of Linear regression
The Cost function of Logistic regression
Graph of logistic regression
The above two functions can be compressed into a single function i.e.