AI:ML:DL
• A machine completes tasks based on a set of stipulated rules that solve problems (algorithms), such an
“intelligent” behavior is what is called artificial intelligence.
• Machine learning can be loosely interpreted to mean empowering computer systems with the ability to
“learn”
• Deep learning is the breakthrough in the field of artificial intelligence.
• The term Deep Learning was introduced to the machine learning community by Rina Decheter in 1986
• Deep learning is a subset of ML; in fact, it’s simply a technique for realizing machine learning. In other
words, DL is the next evolution of machine learning.
04/22/2025 1
AI:ML:DL
04/22/2025 2
Cont..
• Automatic Handwriting Generation
• Image Recognition
• Automatic Image Caption Generation
• Automatic Colorization
• Advertising
• Finance
04/22/2025 3
DL Applications
• Self Driving Cars
• Health Care
• Voice Search & Voice-Activated Assistants
• Automatically Adding Sounds To Silent Movies
• Automatic Machine Translation
04/22/2025 4
•
Deep Neural Network(DNN)
• Convolutional Neural Network
• Recurrent neural Network
04/22/2025 5
Convolutional Neural Networks
• Convolution: Combining two functions logically
• CNN is a regularized version of multilayer perceptron.
04/22/2025 6
Convolutional Neural Networks
04/22/2025 7
CNN
• Convolution is a mathematical operation used in Convolutional Neural Networks (CNNs)
to extract features from input data, such as images or signals. It involves sliding a filter
(or kernel) over the input data and computing a dot product at each position to produce an
output known as a feature map.
• Filter/Kernel: A small matrix (e.g., 3x3 or 5x5) of weights used to detect patterns like
edges or textures.
• Stride: The step size of the filter as it slides over the input.
• Feature Map: The output matrix after applying the convolution operation.
04/22/2025 8
04/22/2025 9
Input 4X4X3
In convolutional layers, channels are the depth of
the input or output tensor:
Input Channels:
Represent the number of feature maps or
input filters.
For an RGB image, the input typically has 3
channels (R, G, B).
Output Channels:
Represent the number of feature maps
produced by applying filters (kernels).
Each filter detects specific features (e.g.,
edges, textures).
The number of output channels corresponds
to the number of filters applied.
04/22/2025 10
Kernel operation and padding
• Kernal operation-Significant Region Extraction
• An operator applied to the entirety of the image such that it
transforms the information encoded in the pixels.
• The kernels are then convolved with the input volume to obtain so-
called ‘activation maps’. Activation maps indicate ‘activated’ regions.
• Padding is to adjust the input size
04/22/2025 11
• Padding refers to adding extra border values (usually zeros) around the input data before applying a convolution.
Padding helps control the size of the output feature map.
• Types of Padding:
• Valid Padding (No Padding):
– No extra values are added.
– Output size is smaller than the input size.
• Same Padding (Zero Padding):
– Zeros are added to ensure the output size matches the input size.
• Custom Padding:
– Arbitrary padding values can be added based on requirements.
04/22/2025 12
Receptive field and ReLU
• The area of the input influencing a specific output value is receptive field
• Each neuron receives input from some number of locations in the previous layer. In a
fully connected layer, each neuron receives input from every element of the previous
layer. In a convolutional layer, neurons receive input from only a restricted subarea of the
previous layer.
• ReLu refers to the Rectifier Unit, the most commonly deployed activation function for the
outputs of the CNN neurons
04/22/2025 13
04/22/2025 14
Softmax Function
• The softmax function is an activation function used in
machine learning and deep learning, typically in the final layer
of a classification model. It converts a vector of raw scores
(logits) into a probability distribution over n classes, ensuring
that the probabilities sum up to 1.
04/22/2025 15
Max Pooling
04/22/2025 16
Fully connected layer
04/22/2025 17
04/22/2025 18
04/22/2025 19
04/22/2025 20
04/22/2025 21
Example
04/22/2025 22
04/22/2025 23
04/22/2025 24
04/22/2025 25
CNN Applications
• Image data
• Classification prediction problems
• Regression prediction problems
04/22/2025 26
Recurrent Neural network
• The core reason that recurrent nets are more exciting is that
they allow us to operate over sequences of vectors : Sequences
in the input, the output, or in the most general case both.
04/22/2025 27
RNN Types
1. Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image
classification).
2. Sequence output (e.g. image captioning takes an image and outputs a sentence of words).
3. Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or
negative sentiment).
4. Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and
then outputs a sentence in French).
5. Synced sequence input and output (e.g. video classification where we wish to label each frame of the
video). [Link]
04/22/2025 28
RNN Types
04/22/2025 29
RNN Layers Unfold
04/22/2025 30
LSTM
• Every LSTM network basically contains three gates to control the flow of information and
cells to hold information. The Cell States carries the information from initial to later time
steps without getting vanished.
04/22/2025 31
Layer Wise LSTM
04/22/2025 32
04/22/2025 33
04/22/2025 34
04/22/2025 35
04/22/2025 36
04/22/2025 37
04/22/2025 38
04/22/2025 39
04/22/2025 40
04/22/2025 41
04/22/2025 42
At t=1
04/22/2025 43
GRU
• GRU ( Gated Recurrent Units ) are similar to the LSTM networks. GRU is a kind
of newer version of RNN. However, there are some differences between GRU and
LSTM.
– GRU doesn’t contain a cell state
– GRU uses its hidden states to transport information
– It Contains only 2 gates(Reset and Update Gate)
– GRU is faster than LSTM – GRU has lesser tensor’s operation that makes it faster
04/22/2025 44
04/22/2025 45
04/22/2025 46
04/22/2025 47
04/22/2025 48
04/22/2025 49
04/22/2025 50
04/22/2025 51
04/22/2025 52
Text processing
04/22/2025 53
Sentiment classification
04/22/2025 54
Applicability
• Text data
• Speech data
• Classification prediction problems
• Regression prediction problems
• Generative models
04/22/2025 55
Bidirectional LSTM
04/22/2025 56
• LSTM (Long Short-Term Memory): LSTM is a type of RNN designed to overcome the
limitations of traditional RNNs in capturing long-term dependencies in sequential data. It
introduces memory cells and gating mechanisms to selectively retain and forget information
over time. LSTMs have an internal memory state that can store information for long
durations, allowing them to capture dependencies that may span across many time steps.
• Bidirectional Processing: Unlike traditional RNNs that process input sequences in only one
direction (either forward or backward), Bi-LSTM processes the sequence in both directions
simultaneously. It consists of two LSTM layers: one processing the sequence in the forward
direction and the other in the backward direction. Each layer maintains its own hidden states
and memory cells.
04/22/2025 57
• Forward Pass: During the forward pass, the input sequence is fed into the forward LSTM layer from the first
time step to the last. At each time step, the forward LSTM computes its hidden state and updates its memory
cell based on the current input and the previous hidden state and memory cell.
• Backward Pass: Simultaneously, the input sequence is also fed into the backward LSTM layer in reverse
order, from the last time step to the first. Similar to the forward pass, the backward LSTM computes its hidden
state and updates its memory cell based on the current input and the previous hidden state and memory cell.
• Combining Forward and Backward States: Once the forward and backward passes are complete, the hidden
states from both LSTM layers are combined at each time step. This combination can be as simple as
concatenating the hidden states or applying some other transformation.
04/22/2025 58
04/22/2025 59
Bi directional RNN
04/22/2025 60
• Inputting a sequence: A sequence of data points, each represented as a vector with the same dimensionality, are fed into a BRNN. The
sequence might have different lengths.
• Dual Processing: Both the forward and backward directions are used to process the data. On the basis of the input at that step and the hidden
state at step t-1, the hidden state at time step t is determined in the forward direction. The input at step t and the hidden state at step t+1 are used
to calculate the hidden state at step t in a reverse way.
• Computing the hidden state: A non-linear activation function on the weighted sum of the input and previous hidden state is used to calculate
the hidden state at each step. This creates a memory mechanism that enables the network to remember data from earlier steps in the process.
• Determining the output: A non-linear activation function is used to determine the output at each step from the weighted sum of the hidden
state and a number of output weights. This output has two options: it can be the final output or input for another layer in the network.
• Training: The network is trained through a supervised learning approach where the goal is to minimize the discrepancy between the predicted
output and the actual output. The network adjusts its weights in the input-to-hidden and hidden-to-output connections during training through
backpropagation.
04/22/2025 61
Advantages of Bidirectional RNN
• Context from both past and future: With the ability to process sequential input both forward and backward, BRNNs
provide a thorough grasp of the full context of a sequence. Because of this, BRNNs are effective at tasks like sentiment
analysis and speech recognition.
• Enhanced accuracy: BRNNs frequently yield more precise answers since they take both historical and upcoming data
into account.
• Efficient handling of variable-length sequences: When compared to conventional RNNs, which require padding to
have a constant length, BRNNs are better equipped to handle variable-length sequences.
• Resilience to noise and irrelevant information: BRNNs may be resistant to noise and irrelevant data that are present
in the data. This is so because both the forward and backward paths offer useful information that supports the
predictions made by the network.
• Ability to handle sequential dependencies: BRNNs can capture long-term links between sequence pieces, making
them extremely adept at handling complicated sequential dependencies.
04/22/2025 62
• Disadvantages of Bidirectional RNN
• Computational complexity: Given that they analyze data both forward and backward, BRNNs can be
computationally expensive due to the increased amount of calculations needed.
• Long training time: BRNNs can also take a while to train because there are many parameters to optimize,
especially when using huge datasets.
• Difficulty in parallelization: Due to the requirement for sequential processing in both the forward and
backward directions, BRNNs can be challenging to parallelize.
• Overftting: BRNNs are prone to overfitting since they include many parameters that might result in too
complicated models, especially when trained on short datasets.
• Interpretability: Due to the processing of data in both forward and backward directions, BRNNs can be tricky
to interpret since it can be difficult to comprehend what the model is doing and how it is producing predictions.
04/22/2025 63
04/22/2025 64
CNN
• LeNet and AlexNet are foundational architectures in the field of
Convolutional Neural Networks (CNNs) because they introduced
and popularized key concepts that have shaped the evolution of
deep learning for computer vision.
• Their importance stems from their impact on research, applications,
and the development of modern deep learning techniques.
04/22/2025 65
LeNet
• This is also known as the Classic Neural Network that was designed by Yann LeCun, Leon
Bottou, Yosuha Bengio and Patrick Haffner for handwritten and machine-printed character
recognition in 1990’s which they called LeNet-5. The architecture was designed to identify
handwritten digits in the MNIST data-set. The architecture is pretty straightforward and simple
to understand. The input images were gray scale with dimension of 32*32*1 followed by two
pairs of Convolution layer with stride 2 and Average pooling layer with stride 1. Finally, fully
connected layers with Softmax activation in the output layer. Traditionally, this network had
60,000 parameters in total.
04/22/2025 66
Applications of LeNet
Digit Recognition: Handwritten digit classification (e.g., postal code recognition).
Character Recognition: Used in OCR (Optical Character Recognition) systems.
Banking Systems: Digit recognition on checks.
Entry-Level Computer Vision Tasks: Small datasets with simple patterns.
04/22/2025 67
AlexNet
• This network was very similar to LeNet-5 but was deeper with 8 layers, with more filters,
stacked convolutional layers, max pooling, dropout, data augmentation, ReLU and SGD.
• AlexNet was the winner of the ImageNet ILSVRC-2012 competition, designed by Alex
Krizhevsky, Ilya Sutskever and Geoffery E. Hinton.
• It was trained on two Nvidia Geforce GTX 580 GPUs, therefore, the network was split into two
pipelines.
• AlexNet has 5 Convolution layers and 3 fully connected layers.
• AlexNet consists of approximately 60 million parameters.
• A major drawback of this network was that it comprises of too many hyper-parameters.
04/22/2025 68
Applications of AlexNet
Image Classification: General-purpose classification tasks (e.g., ImageNet).
Feature Extraction: Pretrained AlexNet is used to extract features for other vision tasks.
Object Detection: As a backbone for object detection models.
Transfer Learning: Fine-tuned for smaller datasets (e.g., medical imaging).
Video Analysis: Classification and recognition in video streams.
04/22/2025 69
04/22/2025 70
Generative Model: Boltzmann Machine
04/22/2025 71
• Boltzmann Machine is a generative unsupervised model, which involves learning a probability
distribution from an original dataset and using it to make inferences about never before seen data.
• Boltzmann Machine has an input layer (also referred to as the visible layer) and one or several hidden
layers (also referred to as the hidden layer).
• Boltzmann Machine uses neural networks with neurons that are connected not only to other neurons in
other layers but also to neurons within the same layer.
• Everything is connected to everything. Connections are bidirectional, visible neurons connected to each
other and hidden neurons also connected to each other
• Boltzmann Machine doesn’t expect input data, it generates data. Neurons generate information regardless
they are hidden or visible.
• For Boltzmann Machine all neurons are the same, it doesn’t discriminate between hidden and visible
neurons. For Boltzmann Machine whole things are system and its generating state of the system.
04/22/2025 72
• The way this system work, we use our training data and feed into the Boltzmann Machine as input to
help the system adjust its weights. It resembles our system not any nuclear power station in the world.
• · It learns from the input, what are the possible connections between all these parameters, how do they
influence each other and therefore it becomes a machine that represents our system.
• · We can use this Boltzmann Machine to monitor our system
• · Boltzmann Machine learns how the system works in its normal states through a good example.
• Boltzmann Machine consists of a neural network with an input layer and one or several hidden layers.
The neurons in the neural network make stochastic decisions about whether to turn on or off based on
the data we feed during training and the cost function the Boltzmann Machine is trying to minimize.
• By doing so, the Boltzmann Machine discovers interesting features about the data, which help model the
complex underlying relationships and patterns present in the data.
04/22/2025 73
• This Boltzmann Machine uses neural networks with neurons that are connected
not only to other neurons in other layers but also to neurons within the same layer.
That makes training an unrestricted Boltzmann machine very inefficient and
Boltzmann Machine had very little commercial success.
• Boltzmann Machines are primarily divided into two categories: Energy-based
Models (EBMs) and Restricted Boltzmann Machines (RBM). When these
RBMs are stacked on top of each other, they are known as Deep Belief Networks
(DBN).
04/22/2025 74
Restricted Boltzmann Machines(RBM)
• Generative model: RBMs are generative models, meaning they model how the data is
generated in terms of a probability distribution. They are good at learning the underlying
distribution of input data and can generate new samples based on that learned distribution.
• Basic Neural Network:
• Discriminative model: Basic neural networks are discriminative models. They focus on
modeling the boundary between different classes (in classification tasks) or predicting the target
output (in regression tasks). The goal is to differentiate between classes or predict the correct
output given input data.
04/22/2025 75
• In an RBM, we have a symmetric bipartite graph where no two units within the same group are connected.
Multiple RBMs can also be stacked and can be fine-tuned through the process of gradient descent and back-
propagation.
• Such a network is called a Deep Belief Network. Although RBMs are occasionally used, most people in the
deep-learning community have started replacing their use with General Adversarial Networks or Variational
Autoencoders.
• RBM is a Stochastic Neural Network which means that each neuron will have some random behavior when
activated. There are two other layers of bias units (hidden bias and visible bias) in an RBM. This is what
makes RBMs different from autoencoders.
• The hidden bias RBM produces the activation on the forward pass and the visible bias helps RBM to
reconstruct the input during a backward pass. The reconstructed input is always different from the actual
input as there are no connections among the visible units and therefore, no way of transferring information
among
04/22/2025 themselves. 76
Restricted Boltzmann Machines
04/22/2025 77
Learning of RBM
• Initially, the weights and biases are set to random values.
• During training, the RBM is shown samples from the dataset (e.g., images, user ratings, etc.).
• The RBM follows this iterative learning process:
– Input to Hidden:
• The visible layer (input data) is passed to the hidden layer.
• The hidden units "turn on" based on the strength of their connection to the visible units.
– Hidden to Reconstruction:
• The hidden layer attempts to reconstruct the input by activating the visible units.
• This reconstruction reflects the RBM’s current understanding of the data.
– Comparison:
• The RBM compares the reconstructed input to the original input.
• The difference between the two indicates how well the RBM is modeling the data.
– Adjustment:
• The weights and biases are adjusted to reduce the difference between the original and reconstructed inputs.
04/22/2025 78
• Over time, the RBM becomes better at reconstructing the input.
• Applications of RBMs
• Dimensionality Reduction:
– RBMs can project high-dimensional data into a lower-dimensional latent space.
• Feature Learning:
– They extract meaningful features from data in an unsupervised manner.
• Collaborative Filtering:
– Used for recommendation systems, such as movie or product recommendations.
• Generative Modeling:
– RBMs can generate new samples by sampling from the learned distribution.
• Pretraining for Deep Networks:
– RBMs are often used to pretrain weights in deeper networks like Deep Belief Networks (DBNs).
04/22/2025 79
• In traditional CNNs, the model is trained deterministically using methods like
backpropagation and gradient descent. However, there is growing interest in adding a
probabilistic layer to CNNs to estimate uncertainty in the predictions.
• Bayesian CNNs: As mentioned earlier, MCMC can be used in Bayesian CNNs to sample
from the posterior distributions of the network's weights. This enables the network to account
for uncertainty in its predictions and improve decision-making, especially in tasks like
medical image analysis or autonomous driving, where uncertainty quantification is critical.
• Uncertainty in CNNs: MCMC methods like Gibbs sampling can be used to sample different
possible configurations of the network's parameters and generate multiple outputs, which can
be averaged to give more robust predictions that incorporate uncertainty.
04/22/2025 80
• In the context of Bayesian Neural Networks (BNNs), MCMC and Gibbs sampling can be applied to
approximate the posterior distribution of the model's parameters.
• Bayesian Neural Networks: These are neural networks where, instead of having fixed parameters
(weights), the model assumes that the weights are drawn from a probability distribution. The goal is to
estimate the posterior distribution of the weights given the data. This allows the model to express
uncertainty about its predictions.
• Role of MCMC and Gibbs Sampling:
– MCMC methods like Gibbs sampling are used to sample from the posterior distribution of the weights in a
Bayesian Neural Network. This is particularly important when the exact posterior distribution is intractable (which
is often the case in complex models like CNNs).
– By using MCMC to sample from this distribution, we can incorporate uncertainty into the predictions of a CNN. For
instance, instead of predicting a single output, a Bayesian CNN can output a distribution over possible outputs,
04/22/2025reflecting the uncertainty in the model's parameters. 81
• Markov Chain Monte Carlo (MCMC) refers to a family of
algorithms used to generate samples from a probability distribution
by constructing a Markov chain whose equilibrium distribution is
the target distribution we want to sample from. Once the chain has
"converged," we can use the samples to estimate properties of the
target distribution (e.g., mean, variance, quantiles).
04/22/2025 82
•Initialization: Start the Markov chain at some initial point in the parameter space.
•Transition: Use a transition rule to move from the current state to the next state in the chain.
•Sampling: After running the chain for a sufficient amount of time
•(i.e., the chain has "burned in" and reached equilibrium), collect samples from the chain.
•Convergence: The chain must converge to the target distribution, and after that, samples can be used
•for estimating properties of the distribution.
04/22/2025 83
• Gibbs Sampling is a special case of MCMC where the target distribution is sampled by
sequentially sampling from the conditional distributions of each variable, given the other
variables. It's particularly useful when the joint distribution of the variables is difficult to
sample from directly, but the conditional distributions are easier to handle.
• Key Concepts of Gibbs Sampling:
• Conditional Distribution: Given all but one of the variables in the system, the Gibbs sampler
samples the one variable conditioned on the others. This process is repeated for each variable
in the model.
• Sequential Sampling: Gibbs sampling works by iterating through each variable, updating it
by sampling from its conditional distribution, and repeating this process multiple times.
04/22/2025 84
• MCMC is a general family of sampling methods, and Gibbs Sampling is a specific, more
structured MCMC method that works well when you can easily compute the conditional
distributions for each variable.
• While MCMC can be applied to a wide range of problems, Gibbs Sampling is highly efficient
in certain types of models where the conditional distributions are easy to sample from.
• If the problem involves complex dependencies between variables, general MCMC methods
might be preferred, while Gibbs sampling is ideal when each variable has a simple conditional
distribution.
04/22/2025 85
Thank You
04/22/2025 86