Support Vector Machine Overview
Support Vector Machine Overview
• Support Vector Machine(SVM) is a supervised machine learning algorithm used for both classification and
regression
• The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine.
Two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier
is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier. [Link] Govindarajan
Hyperplane:
• There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to
find out the best decision boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features, then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-
dimension plane.
• We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points - Maximal Margin Hyperplane
Support Vectors:
• The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support
vector.
[Link] Govindarajan
Linear SVM:
• The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has
two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify
the pair(x1, x2) of coordinates in either green or blue.
• So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes.
[Link] Govindarajan
• The SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called
as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.
[Link] Govindarajan
Non-Linear SVM:
• If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line.
[Link] Govindarajan
• So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. By adding the third
dimension, the sample space will become,
• So now, SVM will divide the datasets into classes in the following way, Since we are in 3-d Space, hence it is looking like a
plane parallel to the x-axis.
[Link] Govindarajan
• If we convert it in 2d space with z=1, then it will look like
Advantages of SVM
•Its memory is efficient as it uses a subset of training points in the decision function called support vectors.
[Link] Govindarajan
Example: we see a strange cat that also has some features of dogs, so if we want a model that can accurately
identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can learn about different features
of cats and dogs, and then we test it with this strange creature. So as support vector creates a decision
boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat. SVM algorithm can be used
for Face detection, image classification, text categorization, etc.
[Link] Govindarajan
Kernel Optimization
Kernel optimization in machine learning focuses on finding the best kernel function and parameters to improve
model performance, particularly in tasks involving non-linear data and high-dimensional spaces.
This involves techniques like cross-validation, hyperparameter tuning, and exploring different kernel types to
achieve optimal accuracy and efficiency.
Kernel Methods :
• Kernel methods are a class of machine learning algorithms that leverage the "kernel trick" to perform
complex, non-linear operations in high-dimensional spaces without explicitly mapping the data.
• They are particularly useful for algorithms like Support Vector Machines (SVMs).
• The kernel function acts as a similarity measure between data points, allowing the algorithm to find patterns
and relationships in a higher-dimensional space.
[Link] Govindarajan
Need of Kernel Optimization:
Non-Linear Data:
• Many real-world datasets are not linearly separable, and kernel methods provide a powerful way to handle such
data by implicitly mapping it to a higher-dimensional space where it might be linearly separable.
Computational Efficiency:
• The kernel trick allows algorithms to operate in high-dimensional spaces without the computational cost
of explicitly mapping the data, making them efficient.
Flexibility:
• Kernel methods are versatile and can be applied to various types of data and tasks, as long as a suitable
kernel function can be defined.
[Link] Govindarajan
Neural networks learning
Neural network learning refers to how artificial neural networks (ANNs) adjust their internal parameters (weights
and biases) to improve performance on a given task. The learning process is typically based on optimization
algorithms and data-driven training.
How it works:
1. Forward Propagation
• The input data passes through multiple layers of neurons.
• Each neuron applies a weighted sum of its inputs, followed by an activation function (e.g., ReLU, Sigmoid,)
ReLU, Rectified Linear Unit, is a popular activation function in deep learning that outputs the input directly if it’s positive,
and zero otherwise
• The final output is produced based on the learned parameters.
2. Loss Calculation
• The output of the network is compared to the actual target values using a loss function (e.g., Mean Squared
Error for regression, Cross-Entropy Loss for classification).
• The loss function quantifies how far the predictions are from the true labels.
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] & Gradient Descent
• Backpropagation: The gradient of the loss function is computed with respect to each weight using the chain
rule of calculus.
• Gradient Descent: The network updates the weights in the opposite direction of the gradient to reduce the
loss.
4. Iterative Optimization
• Steps 1–3 are repeated for multiple epochs (full passes through the dataset).
• The network gradually learns patterns from data and improves predictions.
[Link] Govindarajan
Non-Linear hypothesis
[Link] Govindarajan
Methods for Non-linear Hypothesis Testing:
• Wald Test:
A statistical test that uses the estimated parameters and their standard errors to test a hypothesis.
• Bayesian Methods:
Methods that use Bayes' theorem to update prior beliefs about a hypothesis based on observed data.
A method for testing nonlinearity in time series data by generating surrogate data sets that are consistent with
a linear process and comparing them to the original data.
[Link] Govindarajan
Software and Tools:
Stata:
The testnl command in Stata can be used to test non-linear hypotheses after estimation.
R:
The hypothesis function in the brms package in R can be used for non-linear hypothesis testing in Bayesian models.
Why Use It :
Challenges:
A classic example of a non-linear hypothesis in machine learning is using a polynomial feature expansion in a
linear regression model to fit a curve to data that isn't linearly separable, or using decision trees or neural
networks to classify data that is not linearly separable.
Neural Networks:
Neural networks are powerful models that can learn complex, non-linear relationships in data.
They consist of interconnected "neurons" that process information and make predictions.
The neurons use activation functions (like sigmoid or ReLU) that introduce non-linearity, allowing the
network to model complex patterns.
By adjusting the weights and biases of the connections between neurons, the network can learn to map
inputs to outputs in a non-linear way.
[Link] Govindarajan
Perceptrons
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
Backpropagation Algorithm
Backpropagation is a crucial algorithm in machine learning, particularly for training artificial neural networks,
where it facilitates the learning process by minimizing errors through a process of forward and
backward passes, adjusting weights and biases to improve accuracy.
What it is:
• Backpropagation is a supervised learning method used to train artificial neural networks. It's
essentially a method for fine-tuning the weights of a neural network based on the error rate (or loss)
obtained in the previous epoch (or iteration).
How it works:
• Forward Pass: Input data flows through the network, and predictions are generated.
• Backward Pass: The error (the difference between predicted and actual outputs) is calculated and
propagated backward through the network.
• Weight Adjustment: The algorithm uses this error to adjust the weights and biases of the network,
aiming to minimize the error in future predictions.
[Link] Govindarajan
Explanation with an exemplar
[Link] Govindarajan
[Link] Govindarajan
• One can update all the weights.
• After which, the error 0.298371109 on the network - when one feds forward the
0.05 and 0.1 inputs. In the first round of Backpropagation, the total error is down
to 0.291027924.
• After repeating this process 10,000, the total error is down to 0.0000351085. At
this point, the outputs neurons generate 0.159121960 and 0.984065734 i.e., nearby
the target value when one feeds forward the 0.05 and 0.1.
[Link] Govindarajan
Why it's important
• Training Neural Networks: Backpropagation is a fundamental algorithm for training neural networks, enabling
them to learn from data.
• Gradient Descent: It enables the use of gradient descent algorithms to update network weights, which is how deep
learning models "learn".
• Efficiency: It's efficient in training multi-layered networks and handling non-linear relationships, making it
suitable for complex tasks like image recognition and language processing.
Key Concepts
• Loss Function: Measures the difference between predicted and actual outputs.
• Gradient: Indicates the direction of the steepest increase in the loss function.
• Chain Rule: Backpropagation is an efficient application of the chain rule of calculus to compute gradients.
[Link] Govindarajan
Advantages:
Scalability: Backpropagation is scalable and can be used with large networks.
Automation: It automates the calculation of gradients, simplifying the training process.
Generalization: Trained networks using backpropagation can generalize well to unseen data.
Limitations:
[Link] Govindarajan
Bayesian networks
• Bayesian networks, a type of probabilistic graphical model, are used in machine learning to represent and
reason about uncertainty, capturing probabilistic relationships between variables using nodes and edges,
enabling inference and learning from data.
• A Bayesian Network is a directed acyclic graph and:
- its vertices (or nodes) are random variables
- each of its arrows corresponds to a conditional dependency relation: an arrow B → A indicates that A
depends on B
- moreover, we attach to each node A the conditional probability distribution of the corresponding random
variable A given its parents (i.e. given the nodes B for which there is an arrow B → A).
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
Applications:
[Link] Govindarajan
Unsupervised learning : clustering
• In unsupervised learning, clustering is a technique used to group unlabeled data points based on their
similarities, revealing underlying patterns and structures within the data.
• Clustering is a fundamental unsupervised machine learning task where algorithms identify groups or clusters
within a dataset without any prior knowledge or labels.
How it works:
• Clustering algorithms analyze the data to find similarities or differences between data points, grouping
those that are more alike into the same cluster.
Unsupervised nature:
• Unlike supervised learning, where algorithms learn from labeled data, unsupervised learning, including
clustering, operates on unlabeled data, allowing it to discover hidden patterns and structures.
[Link] Govindarajan
[Link] Govindarajan
Cluster Analysis
• Clustering analyzes data objects without consulting class labels. Clustering can be used to generate class
labels for a group of data. The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
Clustering is a powerful and versatile tool in machine learning, especially in unsupervised learning,
where it helps uncover hidden patterns in data. By grouping similar data points together, clustering
enables better data exploration and decision-making in a variety of fields like finance, marketing,
healthcare, and beyond.
[Link] Govindarajan
Spectral clustering
[Link] Govindarajan
The three major steps involved in Spectral Clustering Algorithm are: constructing a similarity graph, projecting data onto a lower-
dimensional space, and clustering the data. Given a set of points S in a higher-dimensional space, it can be elaborated as follows:
[Link] a distance matrix
[Link] the distance matrix into an affinity matrix A
[Link] the degree matrix D and the Laplacian matrix L = D – A.
[Link] the eigenvalues and eigenvectors of L.
[Link] the eigenvectors of k largest eigenvalues computed from the previous step form a matrix.
[Link] the vectors.
[Link] the data points in k-dimensional space. [Link] Govindarajan
An affinity matrix, also known as a similarity matrix or kernel, is a square, symmetric matrix that
represents the pairwise similarities between objects in a dataset. Affinity matrices capture the relationships
between data points, highlighting which points are similar or dissimilar.
The degree matrix of an undirected graph is a diagonal matrix which contains information about the degree of
each vertex—that is, the number of edges attached to each vertex.
The Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian,
is a matrix representation of a graph.
An eigenvector of a matrix A is a non-zero vector v such that when multiplied by A, the resulting vector is a
scalar multiple of the original vector v.
An eigenvalue (also known as a characteristic value) is a scalar value associated with an eigenvector.
[Link] Govindarajan
Spectral Clustering Matrix Representation
[Link] Govindarajan
[Link] Govindarajan
Subspace clustering
Subspace clustering is a machine learning technique used to identify clusters within a dataset by analyzing
specific subsets of dimensions, or subspaces, of the data. It's particularly useful for high-dimensional data
where traditional clustering methods can struggle due to the "curse of dimensionality". The core idea is to find
clusters that exist within a subset of relevant features, rather than requiring agreement across all features.
[Link] Govindarajan
How Subspace Clustering Works:
1. Identifying Subspaces:
Subspace clustering algorithms aim to find the relevant subspaces within the data.
2. Clustering within Subspaces:
Once the subspaces are identified, clustering algorithms can be applied to the data points within those subspaces.
3. Combining Results:
The results from clustering within different subspaces can be combined to obtain a final clustering of the entire dataset.
[Link] Govindarajan
Types of Subspace Clustering Algorithms
• Algebraic Methods:
These methods use linear algebra techniques to find the underlying subspaces, such as finding the
eigenvectors of a matrix representing the data.
• Iterative Methods:
These methods iteratively refine the subspace representation by updating the projection matrices or cluster
assignments.
• Statistical Methods:
These methods use statistical models to describe the data distribution within subspaces.
[Link] Govindarajan
Challenges:
• Computational complexity: Some algorithms can be computationally expensive, especially for very large
datasets.
• Parameter tuning: Some algorithms require tuning of parameters, which can be challenging.
• Finding the right number of subspaces: Determining the optimal number of subspaces can be difficult.
[Link] Govindarajan
Dimensionality Reduction
Dimensionality reduction is a method for representing a given dataset using a lower number of features (that is,
dimensions) while still capturing the original data’s meaningful properties. This amounts to removing irrelevant
or redundant features, or simply noisy data, to create a model with a lower number of variables.
Dimensionality reduction covers an array of feature selection and data compression methods used during
preprocessing. While dimensionality reduction methods differ in operation, they all transform high-
dimensional spaces into low-dimensional spaces through variable extraction or combination.
• When working with machine learning models, datasets with too many features can cause issues like slow
computation and overfitting. Dimensionality reduction helps by reducing the number of features while
retaining key information.
• Techniques like principal component analysis (PCA), singular value decomposition (SVD) and linear
discriminant analysis (LDA) project data onto a lower-dimensional space, preserving important details.
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
Advantages of Dimensionality Reduction
•Faster Computation: With fewer features, machine learning algorithms can process data more quickly. This results
in faster model training and testing, which is particularly useful when working with large datasets.
•Better Visualization: As we saw in the earlier figure, reducing dimensions makes it easier to visualize data,
revealing hidden patterns.
•Prevent Overfitting: With fewer features, models are less likely to memorize the training data and overfit. This
helps the model generalize better to new, unseen data, improving its ability to make accurate predictions.
•Data Loss & Reduced Accuracy – Some important information may be lost during dimensionality reduction,
potentially affecting model performance.
•Choosing the Right Components – Deciding how many dimensions to keep is difficult, as keeping too few may lose
valuable information, while keeping too many can lead to overfitting.
[Link] Govindarajan
Data Compression
• Data compression is the process of encoding, restructuring or otherwise modifying data in order to
reduce its size. Fundamentally, it involves re-encoding information using fewer bits than the
original representation.
• Compression is done by a program that uses functions or an algorithm to effectively discover how to
reduce the size of the data.
• A good example of this often occurs with image compression. When a sequence of colors, like ‘blue,
red, red, blue’ is found throughout the image, the formula can turn this data string into a single
bit, while still maintaining the underlying information.
• Text compression can usually succeed by removing all unnecessary characters, instead inserting a
single character as reference for a string of repeated characters, then replacing a smaller bit string
for a more common bit string. With proper techniques, data compression can effectively lower a text
file by 50% or more, greatly reducing its overall size.
• For data transmission, compression can be run on the content or on the entire transmission.
When information is sent or received via the internet, larger files, either on their own or with others, or
as part of an archive file, may be transmitted in one of many compressed formats, like ZIP, RAR, 7z,
[Link] Govindarajan
Lossy vs Lossless
• Lossless compression: Removes bits by locating and removing statistical redundancies. Because of this
technique, no information is actually removed. Lossless compression will often have a smaller compression
ratio, with the benefit of not losing any data in the file. This is often very important when needing to maintain
absolute quality, as with database information or professional media files. Formats such as FLAC (audio) and
PNG offer lossless compression options.
• Lossy compression: Lowers size by deleting unnecessary information, and reducing the complexity of
existing information. Lossy compression can achieve much higher compression ratios, at the cost of possible
degradation of file quality. JPEG offers lossy compression options, and MP3 is based on lossy compression.
[Link] Govindarajan
• Compression reduces the cost of storage, increases the speed of algorithms, and reduces the transmission cost.
Compression is achieved by removing redundancy, that is repetition of unnecessary data. Coding redundancy refers to
the redundant data caused due to suboptimal coding techniques.
Variable-Length Codes:
Frequently occurring characters get shorter codes, while rare ones get longer codes. This reduces the average number of bits used per symbol.
A=00, B=01, C=10, D=11 – Fixed-Length and A=0, B=10, C=110, D=111 - Variable-Length (Huffman) [Link] Govindarajan
Coding techniques are related to the concepts of entropy and information content, which are studied as a subject called
information theory. Information theory also deals with uncertainty present in a message is called the information content.
•Faster Training & Inference: Smaller datasets mean quicker I/O and computation.
•Reduced Storage: Essential when working with large-scale datasets.
•Better Generalization: Compressed representations can reduce overfitting by removing noise or irrelevant features.
•Enables Edge Deployment: Compressed models/data can run on mobile devices or IoT systems.
Use Cases:
• The main advantages of compression are reductions in storage hardware, data transmission time, and
communication bandwidth. This can result in significant cost savings. Compressed files require significantly
less storage capacity than uncompressed files, meaning a significant decrease in expenses for storage. A
compressed file also requires less time for transfer while consuming less network bandwidth. This can also help
with costs, and also increases productivity.
• The main disadvantage of data compression is the increased use of computing resources to apply
compression to the relevant data. Because of this, compression vendors prioritize speed and resource
efficiency optimizations in order to minimize the impact of intensive compression tasks.
[Link] Govindarajan
Principal Components Analysis
Principal components analysis (PCA; also called the Karhunen-Loeve, or K-L, method) searches for k n-
dimensional orthogonal vectors that can best be used to represent the data, where k<=n. The original data
are thus projected onto a much smaller space, resulting in dimensionality reduction.
1. The input data are normalized, so that each attribute falls within the same range. This step helps ensure that
attributes with large domains will not dominate attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the normalized input data. These are unit
vectors that each point in a direction perpendicular to the others. These vectors are referred to as the
principal components. The input data are a linear combination of the principal components.
These vectors represent the directions (principal components) where the data has the most variance, and PCA finds the
orthonormal basis that maximizes this variance.
Orthogonal vectors are perpendicular to each other, meaning their dot product is zero. In PCA, this implies that
different principal components are uncorrelated.
[Link] Govindarajan
• The principal components are sorted in order of decreasing “significance” or strength. The principal
components essentially serve as a new set of axes for the data, providing important information about variance.
That is, the sorted axes are such that the first axis shows the most variance among the data, the second axis
shows the next highest variance, and so on.
• For example, shows the first two principal components, Y1 and Y2, for the given set of data originally mapped to
the axes X1 and X2. This information helps identify groups or patterns within the data.
• Because the components are sorted in decreasing order of “significance,” the data size can be reduced by
eliminating the weaker components, that is, those with low variance. Using the strongest principal
components, it should be possible to reconstruct a good approximation of the original data.
• PCA can be applied to ordered and unordered attributes. Multidimensional data of more than two dimensions
can be handled by reducing the problem to two dimensions. Principal components may be used as inputs to
multiple regression and cluster analysis. [Link] Govindarajan
Geometrically speaking, principal components
represent the directions of the data that explain
a maximal amount of variance, that is to say, the
lines that capture most information of the data.
[Link] Govindarajan
Core Assumptions of LDA
[Link] Covariance Matrices: Covariance matrices of the different classes should be equal.
[Link] Separability: A linear decision boundary should be sufficient to separate the classes.
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan
[Link] Govindarajan