Unit 4 : Linear Discriminants for Machine Learning
Introduction to Linear Discriminants:
Linear Discriminants (LD) is a statistical method used in machine learning to reduce the
dimensionality of data and achieve optimal discrimination among different classes. It
involves finding a linear combination of features that can effectively separate two or more
classes of objects.
LDs are widely used in various applications like pattern recognition and image retrieval.
The method is based on discriminant functions that are estimated based on a set of data called
training set. These discriminant functions are linear with respect to the characteristic vector,
and usually have the form
f(t)=wtx+b0,
where w represents the weight vector, x the characteristic vector, and b0 a threshold.
Linear Discriminants for Classification:
Linear discriminants are linear functions used in classification to separate data points
belonging to different classes. They aim to find a linear combination of features that best
separates these classes, often used in Linear Discriminant Analysis (LDA).
LDA is a supervised learning technique that also serves as a dimensionality reduction
method.
Why they are used:
Linear discriminants are linear functions that act as decision boundaries in
classification.
They aim to find a line (in 2D) or a hyperplane (in higher dimensions) that best
separates different classes.
They are used in algorithms like Linear Discriminant Analysis (LDA).
Let’s suppose we have d-dimensional data points x1….xn with 2
classes Ci=1,2 each having N1 & N2 samples.
Consider W as a unit vector onto which we will project the data points. Since we
are only concerned with the direction, we choose a unit vector for this purpose.
Number of samples : N = N1 + N2
If x(n) are the samples on the feature space then WTx(n) denotes the data points
after projection.
Means of classes before projection: mi
Means of classes after projection: Mi = WTmi
Datapoint X before and after projection
Scatter matrix: Used to make estimates of the covariance matrix. IT is a m X m positive
semi-definite matrix.
Perceptron Classifier:
One of the earliest and most basic machine learning methods used for binary classification is
the perceptron. Frank Rosenblatt created it in the late 1950s, and it is a key component of
more intricate neural network topologies.
A simple binary linear classifier called a perceptron generates predictions based on
the weighted average of the input data. Based on whether the weighted total exceeds a
predetermined threshold, a threshold function determines whether to output a 0 or a 1.
Single Layer Perceptron
Components of a Perceptron:
1. Input Features (x): Predictions are based on the characteristics or qualities of the
input data, or input features (x).
A number value is used to represent each feature. The two classes in binary
classification are commonly represented by the numbers 0 (negative class) and 1
(positive class).
2. Input Weights (w): Each input information has a weight (w), which establishes its
significance when formulating predictions.
The weights are numerical numbers as well and are either initialized to zeros
or small random values.
3. Weighted Sum (∑f(x)∑f(x)): To calculate the weighted sum, use the dot product of
the input features' (x) weights and their associated features' (w) weights.
Mathematically, it is written as ∑f(x)=w1x1+w2x2+...+wn∗xn∑f(x)=w1x1+w2x2+...
+wn∗xn.
4. Activation Function (Step Function) : The activation function, which is commonly
a step function, is applied to the weighted sum (∑f(x)∑f(x)). If the weighted total
exceeds a predetermined threshold, the step function is utilized to decide the
perceptron's output.
The output is 1 (positive class) if ∑f(x)∑f(x) is greater than or equal to the
threshold and 0 (negative class) otherwise.
Working of the Perceptron:
1. Initialization: The weights (w) are initially initialized, frequently using tiny random
values or zeros.
2. Prediction: The Perceptron calculates the weighted total (∑f(x)∑f(x)) of the input
features and weights in order to provide a forecast for a particular input.
3. Activation Function: Following the computation of the weighted sum (∑f(x)∑f(x)),
an activation function is used. The perceptron outputs 1 (positive class)
if ∑f(x)∑f(x) is greater than or equal to a specific threshold; otherwise, it outputs 0
(negative class) because the activation function is a step function.
4. Updating Weight: Weights are updated if a misclassification, or an inaccurate
prediction, is made by the perceptron. The weight update is carried out to reduce
prediction inaccuracy in the future. Typically, the update rule involves shifting the
weights in a way that lowers the error. The perceptron learning rule, which is based
on the discrepancy between the expected and actual class labels, is the most widely
used rule.
5. Repeat: Each input data point in the training dataset is repeated through steps 2
through 4 one more time. This procedure keeps going until the model converges and
accurately categorizes the training data, which could take a certain amount of
iterations.
Perceptron Learning Algorithm:
Steps:
1. Initialize weights wi and bias b (usually with zeros or small random numbers)
2. For each training sample (x,t):
o Compute predicted output: y=step(w⋅x+b)
o Update weights and bias: wi=wi+η(t−y)xi
b+η(t−y)
o Where:
t= target output
y = predicted output
η= learning rate (typically small, e.g., 0.1)
3. Repeat until convergence or a maximum number of iterations is reached.
Support Vector Machines:
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. SVM is particularly well-suited for classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to separate data points
into different classes. The algorithm maximizes the margin between the closest points of
different classes.
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best separates two
classes by maximizing the margin between them. This margin is the distance from the
hyperplane to the nearest data points (support vectors) on each side.
Multiple hyperplanes separate the data from two classes
The best hyperplane, also known as the “hard margin,” is the one that maximizes the
distance between the hyperplane and the nearest data points from both classes. This ensures a
clear separation between the classes. So, from the above figure, we choose L2 as hard
margin.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We have a
training dataset consisting of input feature vectors X and their corresponding class labels Y.
The equation for the linear hyperplane can be written as: wTx+b=0
Where:
w is the normal vector to the hyperplane (the direction perpendicular to it).
b is the offset or bias term, representing the distance of the hyperplane from the origin
along the normal vector w.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be calculated as:
di=wTxi+b/∣∣w∣∣
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the
normal vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane: y^={1: wTx+b≥0 }
0: wTx+b <0
Where y^ is the predicted label of a data point.
Support Vector Machine (SVM) Terminology
Hyperplane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
Support Vectors: The closest data points to the hyperplane, crucial for determining
the hyperplane and margin in SVM.
Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
Kernel: A function that maps data to a higher-dimensional space, enabling SVM to
handle non-linearly separable data.
Hard Margin: A maximum-margin hyperplane that perfectly separates the data
without misclassifications.
Soft Margin: Allows some misclassifications by introducing slack variables,
balancing margin maximization and misclassification penalties when data is not
perfectly separable.
Linearly Non-Separable Case:
When data is not linearly separable (i.e., it can’t be divided by a straight line), SVM uses a
technique called kernels to map the data into a higher-dimensional space where it becomes
separable. This transformation helps SVM find a decision boundary even for non-linear data.
Original 1D dataset for classification
A kernel is a function that maps data points into a higher-dimensional space without
explicitly computing the coordinates in that space. This allows SVM to work efficiently with
non-linear data by implicitly performing the mapping.
For example, consider data points that are not linearly separable. By applying a kernel
function, SVM transforms the data points into a higher-dimensional space where they
become linearly separable.
Linear Kernel: For linear separability.
Polynomial Kernel: Maps data into a polynomial space.
Radial Basis Function (RBF) Kernel: Transforms data into a space based on
distances between data points.
SVM algorithm is of two types:-
Linear SVM: When the data points are linearly separable into two classes, the data is
called linearly-separable data. We use the linear SVM classifier to classify such data.
Non-linear SVM: When the data is not linearly separable, we use the non-linear SVM
classifier to separate the data points.
Non-linear SVM:
Nonlinear SVM was introduced when the data cannot be separated by a linear decision
boundary in the original feature space.
The kernel function computes the similarity between data points allowing SVM to
capture complex patterns and nonlinear relationships between features. This enables
nonlinear SVM to form curved or circular decision boundaries with help of kernel.
FOR EXAMPLE:
From the above figure, there are two classes of data points that a straight line cannot separate.
But, a circular hyperplane can separate them, hence we can introduce a coordinate Z, with the
help of X and Y, where Z=X2+Y2Z=X2+Y2. Now after introducing the third dimension, the
graph changes to:-
Types of Kernels used in SVM
Here are some common types of kernels used by SVM. Let’s understand them one by one:
1. Linear Kernel
A linear kernel is the simplest form of kernel used in SVM. It is suitable when the
data is linearly separable meaning that a straight line (or hyperplane in higher
dimensions) can effectively separate the classes.
It is represented as: K(x,y)=x.y
It is used for text classification problems such as spam detection
2. Polynomial Kernel
The polynomial kernel allows SVM to model more complex relationships by
introducing polynomial terms. It is useful when the data is not linearly separable but
still follows a pattern. The formula of Polynomial kernel is:
K(x,y)=(x.y+c)d where is a constant and d is the polynomial degree.
It is used in Complex problems like image recognition where relationships between
features can be non-linear.
3. Radial Basis Function Kernel (RBF) Kernel
The RBF kernel is the most widely used kernel in SVM. It maps the data into an
infinite-dimensional space making it highly effective for complex classification
problems. The formula of RBF kernel is:
K(x,y)=e–(γ∣∣x–y∣∣2) where γis a parameter that controls the influence of each training
example.
We use RBF kernel When the decision boundary is highly non-linear and we have no
prior knowledge about the data’s structure is available.
4. Gaussian Kernel
The Gaussian kernel is a special case of the RBF kernel and is widely used for non-
linear data classification. It provides smooth and continuous transformations of data
into higher dimensions. It can be represented by:
K(x,y)=e–(∣∣x–y∣∣22σ2) where σ is a parameter that controls the spread of the kernel
function.
It is used Used when data has a smooth, continuous distribution and requires a
flexible boundary.
Gaussian Kernel Graph
5. Sigmoid Kernel
The sigmoid kernel is inspired by neural networks and behaves similarly to the
activation function of a neuron. It is based on the hyperbolic tangent function and is
suitable for neural networks and other non-linear classifiers. It is represented as:
K(x,y)=tanh(γ.xTy+r)
It is often used in neural networks and non-linear classifiers.
Kernel Trick:
The kernel trick computes the dot product of data points in the higher-dimensional
space directly that helps a model find patterns in complex data and transforming the data
into a higher-dimensional space where it becomes easier to separate different classes or
detect relationships.
For example, imagine we have data points shaped like two concentric circles: one circle
represents one class and the other circle represents another class. If we try to separate these
classes with a straight line it can’t be done because the data is not linearly separable in its
current form.
When we use a kernel function, it transforms the original 2D data (like the concentric circles)
into a higher-dimensional space where the data becomes linearly separable. In that higher-
dimensional space, the SVM finds a simple straight-line decision boundary to separate the
classes.
Logistic Regression:
Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0or 1.
Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
It is a predictive analysis algorithm which works on the concept of probability. o Logistic
regression is a type of regression, but it is different from the linear regression algorithm in the
term how they are used.
Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression
Function represented as Where, f(x)=Output between the 0 and 1 value.
x=input to the function e=base of natural logarithm.
When we provide the input values(data) to the function,it gives the S-curve as follows:
It uses the concept of threshold levels, values above the threshold level are rounded upto1,
and values below the threshold level are rounded upto 0.
There are three types of logistic regression:
• Binary(0/1,pass/fail) –Only two possible outcomes. Example:Assessing cancer risk
either high or low
• Multinomial(cats,dogs,lions)-There are unordered two or more outcomes
• Ordinal(low, medium, high) - There are ordered two or more outcomes
Linear Regression:
Linear regression is a statistical regression method which is used for predictive analysis.
It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
It is used for solving the regression problem in machine learning.
Linear regression shows the linear relationship between the independent variable(X-
axis) and the dependent variable (Y-axis), hence called linear regression.
If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
Below is the mathematical equation for Linear regression: Y= aX+b
Here, Y=dependent variables(targetvariables), X=Independent variables (predictorvariables),
a and b are the linear coefficients
Some popular applications of linear regression are:
• Analyzing trends and sales estimates
• Salary forecasting
• Real estate prediction
• Arriving at ETAs in traffic.
Types of Perceptron Models
Based on the number of layers, perceptrons are broadly classified into two major categories:
Single Layer Perceptron Model:
It is the simplest Artificial Neural Network (ANN) model. A single-layer perceptron model
consists of a feed-forward network and includes a threshold transfer function for thresholding
on the Output. The main objective of the single-layer perceptron model is to classify linearly
separable data with binary labels.
Multi-Layer Perceptron Model:
The multi-layer perceptron learning algorithm has the same structure as a single-layer
perceptron but consists of an additional one or more hidden layers, unlike a single-layer
perceptron, which consists of a single hidden layer. The distinction between these two types
of perceptron models is shown in the Figure below.
Multi-Layer Perceptrons (MLPs):
The main shortcoming of the Feed Forward networks was its inability to learn
with backpropagation (Unsupervised Learning). Multi-layer Perceptrons are
the neural networks which incorporate multiple hidden layers and activation
functions. The learning takes place in a Supervised manner where the weights
are updated by the means of Gradient Descent. Multi-layer Perceptron is bi-
directional, i.e., Forward propagation of the inputs, and the backward
propagation of the weight updates. The activation functions can be changes
with respect to the type of target. Softmax is usually used for multi-class
classification, Sigmoid for binary classification and so on. These are also
called dense networks because all the neurons in a layer are connected to all the
neurons in the next layer as shown in figure
Multilayer networks learned by the BACKPROPACATION algorithm
are capable of expressing a rich variety of nonlinear decision surfaces.
A Differentiable Threshold Unit (Sigmoid unit)
Sigmoid unit-a unit very much like a perceptron, but based on a smoothed, differentiable
threshold function.
• The sigmoid unit first computes a linear combination of its inputs, then
applies a threshold to the result and the threshold output is a continuous
function of its input.
• More precisely, the sigmoid unit computes its output O as
Backpropagation for Training an MLP:
Backpropagation is an algorithm that back propagates the errors from output
nodes to the input nodes. Therefore, it is simply referred to as backward
propagation of errors. It uses in the vast applications of neural networks like
Character recognition, Signature verification, etc.
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input
vectors that the network operates on. It Compares generated output to the desired
output and generates an error report if the result does not match the generated output
vector. Then it adjusts the weights according to the bug report to get your desired
output. Backpropagation Algorithm:
Step 1: Inputs X, arrive through the preconnected path.
Step 2: The input is modeled using true weights W. Weights are usually chosen
randomly. Step 3: Calculate the output of each neuron from the input layer to the
hidden layer to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to
reduce the error.
Step 6: Repeat the process until the desired output is achieved.
• The BACKPROPAGATION Algorithm learns the weights for a multilayer
network, given a network with a fixed set of units and interconnections. It
employs gradient descent to attempt to minimize the squared error between the
network output values and the target values for these outputs.
• In BACKPROPAGATION algorithm, we consider networks with multiple
output units rather than single units as before, so we redefine E to sum the
errors over all of the network output units.