Deep Learning
Deep Learning
Fundamentals about Deep Learning, Perception Learning Algorithms, Probabilistic modeling, Early Neural
Networks, How Deep Learning different from Machine Learning, Scalars, Vectors, Matrixes, Higher
Dimensional Tensors, Manipulating Tensors, Vector Data, Time Series Data, Image Data, Video Data.
About Neural Network, Building Blocks of Neural Network, Optimizers, Activation Functions, Loss
Functions, Data Pre-processing for neural networks, Feature Engineering, Over fitting and Under fitting,
Hyper parameters.
About CNN, Linear Time Invariant, Image Processing Filtering, Building a convolutional neural network,
Input Layers, Convolution Layers, Pooling Layers, Dense Layers, Backpropagation Through the
Convolutional Layer, Filters and Feature Maps, Backpropagation Through the Pooling Layers, Dropout
Layers and Regularization, Batch Normalization, Various Activation Functions, Various Optimizers, LeNet,
AlexNet, VGG16, ResNet, Transfer Learning with Image Data, Transfer Learning using Inception Oxford
VGG Model, Google Inception Model, Microsoft ResNet Model, R-CNN, Fast R-CNN, Faster R-CNN,
Mask-RCNN, YOLO.
About NLP & its Toolkits, Language Modeling, Vector Space Model (VSM), Continuous Bag of Words
(CBOW), Skip-Gram Model for Word Embedding, Part of Speech (PoS) Global Co-occurrence Statistics-
based Word Vectors, Transfer Learning, Word2Vec, Global Vectors for Word Representation GloVe,
Backpropagation Through Time, Biderectional TNNs (BRNN), Long Short Term Memory (LSTM), Bi-
directional LSTM, Sequence-to-sequence Models (Seq2Seq), Gated recurrent unit GRU.
About Deep Reinforcement Learning, Q-Learning, Deep Q-Network (DQN), Policy Gradient Methods,
Actor-Critic Algorithm, About Autoencoding, Convolutional Auto Encoding, Variational Auto Encoding,
Generative Adversarial Networks, Autoencoders for Feature Extraction, Auto Encoders for Classification,
Denoising Autoencoders, Spare Autoencoders.
UNIT I DEEP LEARNING CONCEPTS
Fundamentals about Deep Learning, Perception Learning Algorithms, Probabilistic modeling, Early Neural
Networks, How Deep Learning different from Machine Learning, Scalars, Vectors, Matrixes, Higher
Dimensional Tensors, Manipulating Tensors, Vector Data, Time Series Data, Image Data, Video Data.
In the fast-evolving era of artificial intelligence, Deep Learning stands as a cornerstone technology,
revolutionizing how machines understand, learn, and interact with complex data. At its essence, Deep
Learning AI mimics the intricate neural networks of the human brain, enabling computers to autonomously
discover patterns and make decisions from vast amounts of unstructured data. This transformative field has
propelled breakthroughs across various domains, from computer vision and natural language processing to
healthcare diagnostics and autonomous driving.
As we dive into this introductory exploration of Deep Learning, we uncover its foundational
principles, applications, and the underlying mechanisms that empower machines to achieve human-like
cognitive abilities. This article serves as a gateway into understanding how Deep Learning is reshaping
industries, pushing the boundaries of what’s possible in AI, and paving the way for a future where intelligent
systems can perceive, comprehend, and innovate autonomously.
The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected nodes called
neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the input layer.
The output of one neuron becomes the input to other neurons in the next layer of the network, and this
process continues until the final layer produces the output of the network. The layers of the neural network
transform the input data through a series of nonlinear transformations, allowing the network to learn complex
representations of the input data.
Scope of Deep Learning
Today Deep learning AI has become one of the most popular and visible areas of machine learning,
due to its success in a variety of applications, such as computer vision, natural language processing, and
Reinforcement learning.
Deep learning AI can be used for supervised, unsupervised as well as reinforcement machine learning. it
uses a variety of ways to process these.
• Supervised Machine Learning: Supervised machine learning is the machine learning technique in
which the neural network learns to make predictions or classify data based on the labeled datasets.
Here we input both input features along with the target variables. the neural network learns to make
predictions based on the cost or error that comes from the difference between the predicted and the
actual target, this process is known as backpropagation. Deep learning algorithms like Convolutional
neural networks, Recurrent neural networks are used for many supervised tasks like image
classifications and recognization, sentiment analysis, language translations, etc.
Machine learning and deep learning AI both are subsets of artificial intelligence but there are many
similarities and differences between them.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features which are Relevant features are automatically extracted
manually extracted from images to detect an object from images. It is an end-to-end learning
in the image. process.
It can work on the CPU or requires less computing It requires a high-performance computer with
power as compared to deep learning. GPU.
Types of neural networks
Deep Learning models are able to automatically learn features from the data, which makes them well-
suited for tasks such as image recognition, speech recognition, and natural language processing. The most
widely used architectures in deep learning are feedforward neural networks, convolutional neural networks
(CNNs), and recurrent neural networks (RNNs).
1. Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of information
through the network. FNNs have been widely used for tasks such as image classification, speech
recognition, and natural language processing.
2. Convolutional Neural Networks (CNNs) are specifically for image and video recognition tasks.
CNNs are able to automatically learn features from the images, which makes them well-suited for
tasks such as image classification, object detection, and image segmentation.
3. Recurrent Neural Networks (RNNs) are a type of neural network that is able to process sequential
data, such as time series and natural language. RNNs are able to maintain an internal state that
captures information about the previous inputs, which makes them well-suited for tasks such as
speech recognition, natural language processing, and language translation.
The main applications of deep learning AI can be divided into computer vision, natural language
processing (NLP), and reinforcement learning.
1. Computer vision
The first Deep Learning applications is Computer vision. In computer vision, Deep learning AI
models can enable machines to identify and understand visual data. Some of the main applications of deep
learning in computer vision include:
• Object detection and recognition: Deep learning model can be used to identify and locate objects
within images and videos, making it possible for machines to perform tasks such as self-driving cars,
surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into categories such as
animals, plants, and buildings. This is used in applications such as medical imaging, quality control,
and image retrieval.
• Image segmentation: Deep learning models can be used for image segmentation into different
regions, making it possible to identify specific features within images.
2. Natural language processing (NLP):
In Deep learning applications, second application is NLP. NLP, the Deep learning model can enable
machines to understand and generate human language. Some of the main applications of deep learning
in NLP include:
• Automatic Text Generation – Deep learning model can learn the corpus of text and new text like
summaries, essays can be automatically generated using these trained models.
• Language translation: Deep learning models can translate text from one language to another,
making it possible to communicate with people from different linguistic backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text, making it
possible to determine whether the text is positive, negative, or neutral. This is used in applications
such as customer service, social media monitoring, and political analysis.
• Speech recognition: Deep learning models can recognize and transcribe spoken words, making it
possible to perform tasks such as speech-to-text conversion, voice search, and voice-controlled
devices.
3. Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an environment to
maximize a reward. Some of the main applications of deep learning in reinforcement learning include:
• Game playing: Deep reinforcement learning models have been able to beat human experts at games
such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots to perform complex tasks
such as grasping objects, navigation, and manipulation.
• Control systems: Deep reinforcement learning models can be used to control complex systems such
as power grids, traffic management, and supply chain optimization.
Deep learning has made significant advancements in various fields, but there are still some challenges
that need to be addressed. Here are some of the main challenges in deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep learning it’s a big
concern to gather as much data for training.
2. Computational Resources: For training the deep learning model, it is computationally expensive
because it requires specialized hardware like GPUs and TPUs.
3. Time-consuming: While working on sequential data depending on the computational resource it can
take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it is very difficult to
interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized for the training
data, leading to overfitting and poor performance on new data.
1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance in various tasks,
such as image recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically discover and learn
relevant features from data without the need for manual feature engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets, and can learn
from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can handle various
types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their performance as
more data becomes available.
1. High computational requirements: Deep Learning AI models require large amounts of data and
computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a large amount of
labeled data for training, which can be expensive and time- consuming to acquire.
4. Black-box nature: Deep Learning models are often treated as black boxes, making it difficult to
understand how they work and how they arrived at their predictions.
1.2 Perception Learning Algorithms
Our goal is to find the w vector that can perfectly classify positive inputs and negative inputs in our
data. I will get straight to the algorithm. Here goes:
We initialize w with some random vector. We then iterate over all the examples in the data, (P U N)
both positive and negative examples. Now if an input x belongs to P, ideally what should the dot
product w.x be? I’d say greater than or equal to 0 because that’s the only thing what our perceptron wants at
the end of the day so let's give it that. And if x belongs to N, the dot product MUST be less than 0. So if you
look at the if conditions in the while loop:
Case 1: When x belongs to P and its dot product w.x < 0
Case 2: When x belongs to N and its dot product w.x ≥ 0
Only for these cases, we are updating our randomly initialized w. Otherwise, we don’t touch w at all
because Case 1 and Case 2 are violating the very rule of a perceptron. So we are adding x to w (ahem vector
addition ahem) in Case 1 and subtracting x from w in Case 2.
But why would this work? If you get it already why this would work, you’ve got the entire gist of
my post and you can now move on with your life, thanks for reading, bye. But if you are not sure why these
seemingly arbitrary operations of x and w would help you learn that perfect w that can perfectly
classify P and N, stick with me.
We have already established that when x belongs to P, we want w.x > 0, basic perceptron rule. What
we also mean by that is that when x belongs to P, the angle between w and x should be than 90 degrees. Fill
in the blank.
Answer: The angle between w and x should be less than 90 because the cosine of the angle is
proportional to the dot product.
So whatever the w vector may be, as long as it makes an angle less than 90 degrees with the positive
example data vectors (x E P) and an angle more than 90 degrees with the negative example data vectors
(x E N), we are cool. So ideally, it should look something like this:
So when we are adding x to w, which we do when x belongs to P and w.x < 0 (Case 1), we are
essentially increasing the cos(alpha) value, which means, we are decreasing the alpha value, the angle
between w and x, which is what we desire. And the similar intuition works for the case when x belongs
to N and w.x ≥ 0 (Case 2).
Here’s a toy simulation of how we might up end up learning w that makes an angle less than 90 for
positive examples and more than 90 for negative examples.
Probabilistic models are an essential component of machine learning, which aims to learn patterns
from data and make predictions on new, unseen data. They are statistical models that capture the inherent
uncertainty in data and incorporate it into their predictions. Probabilistic models are used in various
applications such as image and speech recognition, natural language processing, and recommendation
systems. In recent years, significant progress has been made in developing probabilistic models that can
handle large datasets efficiently.
• Generative models
• Discriminative models.
• Graphical models
Generative models:
Generative models aim to model the joint distribution of the input and output variables. These models
generate new data based on the probability distribution of the original dataset. Generative models are
powerful because they can generate new data that resembles the training data. They can be used for tasks
such as image and speech synthesis, language translation, and text generation.
Discriminative models
The discriminative model aims to model the conditional distribution of the output variable given the
input variable. They learn a decision boundary that separates the different classes of the output variable.
Discriminative models are useful when the focus is on making accurate predictions rather than generating
new data. They can be used for tasks such as image recognition, speech recognition, and sentiment analysis.
Graphical models
These models use graphical representations to show the conditional dependence between variables.
They are commonly used for tasks such as image recognition, natural language processing, and causal
inference.
The Naive Bayes algorithm is a widely used approach in probabilistic models, demonstrating
remarkable efficiency and effectiveness in solving classification problems. By leveraging the power of the
Bayes theorem and making simplifying assumptions about feature independence, the algorithm calculates
the probability of the target class given the feature set. This method has found diverse applications across
various industries, ranging from spam filtering to medical diagnosis. Despite its simplicity, the Naive Bayes
algorithm has proven to be highly robust, providing rapid results in a multitude of real-world problems.
Naive Bayes is a probabilistic algorithm that is used for classification problems. It is based on the
Bayes theorem of probability and assumes that the features are conditionally independent of each other given
the class. The Naive Bayes Algorithm is used to calculate the probability of a given sample belonging to a
particular class. This is done by calculating the posterior probability of each class given the sample and then
selecting the class with the highest posterior probability as the predicted class.
1. Collect a labeled dataset of samples, where each sample has a set of features and a class label.
2. For each feature in the dataset, calculate the conditional probability of the feature given the class.
3. This is done by counting the number of times the feature occurs in samples of the class and dividing
by the total number of samples in the class.
4. Calculate the prior probability of each class by counting the number of samples in each class and
dividing by the total number of samples in the dataset.
5. Given a new sample with a set of features, calculate the posterior probability of each class using the
Bayes theorem and the conditional probabilities and prior probabilities calculated in steps 2 and 3.
6. Select the class with the highest posterior probability as the predicted class for the new sample.
Deep learning, a subset of machine learning, also relies on probabilistic models. Probabilistic models
are used to optimize complex models with many parameters, such as neural networks. By incorporating
uncertainty into the model training process, deep learning algorithms can provide higher accuracy and
generalization capabilities. One popular technique is variational inference, which allows for efficient
estimation of posterior distributions.
• Probabilistic models play a crucial role in the field of machine learning, providing a framework for
understanding the underlying patterns and complexities in massive datasets.
• Probabilistic models provide a natural way to reason about the likelihood of different outcomes and
can help us understand the underlying structure of the data.
• Probabilistic models help enable researchers and practitioners to make informed decisions when
faced with uncertainty.
• Probabilistic models allow us to perform Bayesian inference, which is a powerful method for
updating our beliefs about a hypothesis based on new data. This can be particularly useful in
situations where we need to make decisions under uncertainty.
• Probabilistic models are an increasingly popular method in many fields, including artificial
intelligence, finance, and healthcare.
• The main advantage of these models is their ability to take into account uncertainty and variability
in data. This allows for more accurate predictions and decision-making, particularly in complex and
unpredictable situations.
• Probabilistic models can also provide insights into how different factors influence outcomes and can
help identify patterns and relationships within data.
• One of the disadvantages is the potential for overfitting, where the model is too specific to the training
data and doesn’t perform well on new data.
• Not all data fits well into a probabilistic framework, which can limit the usefulness of these models
in certain applications.
• Another challenge is that probabilistic models can be computationally intensive and require
significant resources to develop and implement.
Since the 1940s, there have been a number of noteworthy advancements in the field of neural networks:
• 1960s-1970s: Perceptrons
This era is defined by the work of Rosenblatt on perceptrons. Perceptrons are single-layer networks
whose applicability was limited to issues that could be solved linearly separately.
Deep Learning and Machine Learning are both subsets of artificial intelligence, but they differ in
several fundamental ways. Here's a concise comparison:
1. Definition:
- A field of artificial intelligence that uses statistical techniques to give computers the ability to learn from
data without being explicitly programmed.
2. Algorithms:
- Includes a variety of algorithms such as linear regression, logistic regression, decision trees, support
vector machines (SVM), k-nearest neighbors (KNN), and clustering algorithms like k-means.
3. Feature Engineering:
- Requires significant manual feature extraction. Humans need to identify and handcraft features from raw
data to improve model performance.
4. Data Requirements:
- Can perform well with relatively small datasets, depending on the complexity of the problem and the
algorithm used.
5. Applications:
- Used in applications like fraud detection, recommendation systems, predictive maintenance, and simple
image classification.
Deep Learning (DL)
1. Definition:
- A subset of machine learning that uses neural networks with many layers (deep neural networks) to model
complex patterns in large datasets.
2. Algorithms:
- Primarily involves neural network architectures such as Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Generative
Adversarial Networks (GANs).
3. Feature Engineering:
- Automates feature extraction. Deep learning models automatically discover representations needed for
feature detection, eliminating the need for manual intervention.
4. Data Requirements:
- Requires large amounts of labeled data and significant computational power for training, often leveraging
GPUs and TPUs.
5. Applications:
- Used in more complex and data-intensive applications such as image and speech recognition, natural
language processing, autonomous driving, and deep reinforcement learning.
Key Differences
- DL: Utilizes deep models with many layers (hence "deep" learning), allowing for the learning of more
complex data representations.
- ML: May struggle with very large and complex datasets; performance often relies heavily on feature
engineering.
- DL: Excels with large datasets, as deep models can capture intricate patterns and dependencies.
3. Training Time:
- ML: Generally faster to train, particularly with simpler models and smaller datasets.
- DL: Requires longer training times due to the complexity and size of the networks, though advancements
like transfer learning and pre-trained models help mitigate this.
4. Computational Resources:
- DL: Often requires specialized hardware (GPUs/TPUs) to efficiently handle the computation.
- ML: Models like decision trees and linear regression are more interpretable, making it easier to
understand the decision process.
- DL: Deep learning models are often seen as "black boxes" due to their complexity, making them harder
to interpret.
Summary
- Machine Learning involves using algorithms to parse data, learn from it, and make decisions based on what
has been learned, with an emphasis on feature engineering and interpretability.
- Deep Learning is a subset of machine learning that focuses on using deep neural networks to automatically
learn representations from data, excelling in handling large-scale, complex datasets with minimal human
intervention in feature selection.
1.6 Scalars
• Scalars are singular numerical entities within the realm of Data Science, devoid of any directional
attributes.
• They serve as the elemental components utilized in mathematical computations and algorithmic
frameworks across these domains. In practical terms, scalars often represent fundamental quantities
such as constants, probabilities, or error metrics.
• For instance, within Machine Learning, a scalar may denote the accuracy of a model or the value of
a loss function. Similarly, in Data Science, scalars are employed to encapsulate statistical metrics
like mean, variance, or correlation coefficients. Despite their apparent simplicity, scalars assume a
critical role in various AI-ML-DS tasks, spanning optimization, regression analysis, and
classification algorithms. Proficiency in understanding scalars forms the bedrock for comprehending
more intricate concepts prevalent in these fields.
1.7 Vectors
• Vectors, within the context of Data Science, represent ordered collections of numerical values
endowed with both magnitude and directionality. They serve as indispensable tools for representing
features, observations, and model parameters within AI-ML-DS workflows.
• In Artificial Intelligence, vectors find application in feature representation, where each dimension
corresponds to a distinct feature of the dataset.
• In Machine Learning, vectors play a pivotal role in encapsulating data points, model parameters, and
gradient computations during the training process. Moreover, within DS, vectors facilitate tasks like
data visualization, clustering, and dimensionality reduction. Mastery over vector concepts is
paramount for engaging in activities like linear algebraic operations, optimization via gradient
descent, and the construction of complex neural network architectures.
1.8 Matrixes
• Matrices, as two-dimensional arrays of numerical values, enjoy widespread utility across AI-ML-DS
endeavors. They serve as foundational structures for organizing and manipulating tabular data,
wherein rows typically represent observations and columns denote features or variables.
• In the domain of AI, matrices find application in representing weight matrices within neural
networks, with each element signifying the synaptic connection strength between neurons. Similarly,
within ML, matrices serve as repositories for datasets, building kernel matrices for support vector
machines, and implementing dimensionality reduction techniques such as principal component
analysis. Within DS, matrices are indispensable for data preprocessing, transformation, and model
assessment tasks.
These matrices can be combined into a new array to create a 3D tensor, which can be seen as a cube
of integers. Listed below is a Numpy 3D tensor:
[6, 7, 3, 5, 1],
A 4D tensor can be produced by stacking 3D tensors in an array, and so on. In deep learning, you typically
work with tensors that range from 0 to 4D, though if you’re processing video data, you might go as high as
5D.
One common operation on tensors in deep learning is to change the tensor shape. For example, you
may want to convert a 2D tensor into 1D or add a dummy dimension to a tensor. You may also want to extract
a sub-tensor from a larger tensor.
1a = [Link](3,4,5)
2print(a)
If you get:
10
1print(a[1])
Or if you use:
1print(a[1:, 2:4])
It will be:
You can also make use of the same slicing syntax to add a new dimension. For example,
[Link]([3, 1, 4, 1, 5])
Here you use None to insert a new dimension at a specific place. This is useful if, for example, you need to
convert an image into a batch of only one image. If you’re familiar with NumPy, you may recall there is a
function expand_dims() for this purpose, but PyTorch doesn’t provide it. A similar function is unsqueeze(),
which is demonstrated below:
1b = [Link](a, dim=2)
2print([Link])
3print([Link])
This prints:
[Link]([3, 4, 5])
[Link]([3, 4, 1, 5])
One powerful nature of NumPy slicing syntax is Boolean indexing. This is also supported with PyTorch
tensors. For example:
1a = [Link](3,4)
2print(a)
The above selects the columns where all elements are greater than -1. You can also manipulate the tensor by
selecting specific columns:
1print(a[:, [1,0,0,1]])
1a = [Link](3,4)
2print(a)
3print([Link]())
You may also use the reshape() function to achieve the same:
1print([Link](-1))
The result should be the same as that of ravel(). But usually, the reshape() function is for more complicated
target shapes:
1print([Link](3,2,2))
1tensor([[[-0.2718, -0.8309],
2 [ 0.6263, -0.2499]],
4 [[-0.1780, 1.1735],
5 [-1.3530, -1.2374]],
7 [[-0.6050, -1.5524],
8 [-0.1008, -1.2782]]])
One common case of reshaping tensors is to do matrix transpose. For a 2D matrix, it is easily done in the same
way as NumPy:
1print(a.T)
which prints:
1tensor([[-0.2718, -0.1780, -0.6050],
But the transpose() function in PyTorch requires you to specify which axes to swap explicitly:
1print([Link](0, 1))
This result is same as above. If you have multiple tensors, you can combine them by stacking them
(vstack() for vertically and hstack() for horizontally). For example:
1a = [Link](3,3)
2b = [Link](3,3)
3print(a)
4print(b)
5print([Link]([a,b]))
1c = [Link]([a, b])
2print(c)
1print([Link](c, 2))
It prints
This function tells how many tensors to split into, rather than what size each tensor is. The latter is indeed
more useful in deep learning (e.g., to split a tensor of a large dataset into many tensors of small batches). The
equivalent function would be:
1print([Link](c, 3, dim=0))
This should give you the same result as before. So split(c, 3, dim=0) means to split on dimension 0 such that
each resulting tensor will be of size 3.
1.11 Vector Data:
The most typical scenario is this. A batch of data in a dataset will be stored as a 2D tensor (i.e., an
array of vectors), in which the first axis is the samples axis and the second axis is the features axis. Each
individual data point in such a dataset is stored as a vector. Let’s focus on two instances:
• A statistical dataset of consumers, where each individual’s age, height, and gender are taken into
account. Since each individual may be represented as a vector of three values, the full dataset of 100
individuals can be stored in a 2D tensor of the shape (100, 3).
• A collection of textual information in which each article is represented by the number of times each
word occurs in it (out of a dictionary of 2000 common words). A full dataset of 50 articles can be
kept in a tensor of shape (50, 2000) since each article can be represented as a vector of 20,00 values
(one count per word in the dictionary).
It’s imperative to store data in a 3D tensor with an explicit time axis whenever time (or the idea of
sequence order) is important. A batch of data will be encoded as a 3D tensor because each instance can be
represented as a series of vectors (a 2D tensor).
The time axis has always been the second axis (axis of index 1) by convention. Let’s examine a couple of
instances:
• A stock price dataset: We keep track of the stock’s current market price as well as its peak and lowest
prices from the previous hour. Since there are 390 minutes in a trading day, each minute is encoded
as a 3D vector, a trading day may be represented as a 2D tensor of the form (390, 3), and 250 day’s
worth of data can be kept in a 3D tensor of shape (250, 390, 3). Each sample, in this case, corresponds
to a day’s worth of data.
• Tweets dataset: let’s 300 characters be used to represent each tweet in a dataset of tweets, with a total
of 125 different characters in the alphabet. Each character in this scenario can be represented as a
binary vector of size 125 that is all zeros with the exception of a single item at the character-specific
index. Then, a dataset of 10 million tweets can be kept in a tensor of shape by encoding each tweet
as a 2D tensor of shape (300, 125). (10000000, 300, 125).
Height, width, and colour depth are the three dimensions that most images have. By definition, image
tensors are always 3D, with a one-dimensional colour channel for grayscale images. Even though grayscale
images (like our MNIST digits) only have a single colour channel and may therefore be stored in 2D tensors.
Thus, a tensor of shape (32, 64, 64, 1) might be used to save a batch of 32 grayscale photos of size 64 x 64,
while a tensor of shape (32, 64, 64, 1) could be used to store a batch of 32 colour images (32, 64, 64, 3).
The channels-last format (used by TensorFlow) and the channels-first format are the two standards
for the shapes of image tensors (used by Theano). The colour-depth axis is located at the end of the list of
dimensions in the Google TensorFlow: (samples, height, width, colour-depth). The batch axis is placed first,
followed by the samples, colour-depth, height, and width axes by Theano. The previous instances would be
transformed into (32, 1, 64, 64) and (32, 3, 64, 64). Both file types are supported by the Keras framework.
Among the few real-world datasets for which you’ll require 5D tensors is video data. A video could
be viewed as a set of coloured images called frames. A batch of various movies can be saved in a 5D tensor
of shape (samples, frames, height, width, and colour-depth) since each frame can be kept in a 3D tensor
(height, width, and colour-depth). A series of frames can also be saved in a 4D tensor (frames, height, width,
and colour-depth).
For instance, 240 frames would be present in a 60-second, 144 x 256 YouTube video clip sampled
at 4 frames per second. Four of these video clips would be saved in a tensor shape as a batch (4, 240, 144,
256, 3). There are 106,168,320 values in all. The tensor could store 405 MB if the dtype of the tensor was
float32 since each value would be recorded in 32 bits. Heavy! Because they aren’t saved in float32 and are
often greatly compressed, videos you see in real life are significantly lighter (such as in the MPEG format).
UNIT II NEURAL NETWORKS
About Neural Network, Building Blocks of Neural Network, Optimizers, Activation Functions, Loss
Functions, Data Pre-processing for neural networks, Feature Engineering, Over fitting and Under fitting,
Hyper parameters.
Neural networks extract identifying features from data, lacking pre-programmed understanding.
Network components include neurons, connections, weights, biases, propagation functions, and a learning
rule. Neurons receive inputs, governed by thresholds and activation functions. Connections involve weights
and biases regulating information transfer. Learning, adjusting weights and biases, occurs in three stages:
input computation, output generation, and iterative refinement enhancing the network’s proficiency in
diverse tasks.
These include:
2. Then the free parameters of the neural network are changed as a result of this simulation.
3. The neural network then responds in a new way to the environment because of the changes in its free
parameters.
The ability of neural networks to identify patterns, solve intricate puzzles, and adjust to changing
surroundings is essential. Their capacity to learn from data has far-reaching effects, ranging from
revolutionizing technology like natural language processing and self-driving automobiles to automating
decision-making processes and increasing efficiency in numerous industries.
The development of artificial intelligence is largely dependent on neural networks, which also drive
innovation and influence the direction of technology.
Consider a neural network for email classification. The input layer takes features like email content,
sender information, and subject. These inputs, multiplied by adjusted weights, pass through hidden layers.
The network, through training, learns to recognize patterns indicating whether an email is spam or not. The
output layer, with a binary activation function, predicts whether the email is spam (1) or not (0). As the
network iteratively refines its weights through backpropagation, it becomes adept at distinguishing between
spam and legitimate emails, showcasing the practicality of neural networks in real-world applications like
email filtering.
Neural networks are complex systems that mimic some features of the functioning of the human
brain. It is composed of an input layer, one or more hidden layers, and an output layer made up of layers of
artificial neurons that are coupled. The two stages of the basic process are called backpropagation
and forward propagation.
Forward Propagation
• Input Layer: Each feature in the input layer is represented by a node on the network, which receives
input data.
• Weights and Connections: The weight of each neuronal connection indicates how strong the
connection is. Throughout training, these weights are changed.
• Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by weights, adding
them up, and then passing them through an activation function. By doing this, non-linearity is
introduced, enabling the network to recognize intricate patterns.
• Output: The final result is produced by repeating the process until the output layer is reached.
Backpropagation
• Loss Calculation: The network’s output is evaluated against the real goal values, and a loss function
is used to compute the difference. For a regression problem, the Mean Squared Error (MSE) is
commonly used as the cost function.
Loss Function:
• Gradient Descent: Gradient descent is then used by the network to reduce the loss. To lower the
inaccuracy, weights are changed based on the derivative of the loss with respect to each weight.
• Adjusting weights: The weights are adjusted at each connection by applying this iterative process,
or backpropagation, backward across the network.
• Training: During training with different data samples, the entire process of forward propagation,
loss calculation, and backpropagation is done iteratively, enabling the network to adapt and learn
patterns from the data.
• Actvation Functions: Model non-linearity is introduced by activation functions like the rectified
linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a neuron is based on the whole
weighted input.
In supervised learning, the neural network is guided by a teacher who has access to both input-output
pairs. The network creates outputs based on inputs without taking into account the surroundings. By
comparing these outputs to the teacher-known desired outputs, an error signal is generated. In order to reduce
errors, the network’s parameters are changed iteratively and stop when performance is at an acceptable level.
Equivalent output variables are absent in unsupervised learning. Its main goal is to comprehend
incoming data’s (X) underlying structure. No instructor is present to offer advice. Modeling data patterns
and relationships is the intended outcome instead. Words like regression and classification are related to
supervised learning, whereas unsupervised learning is associated with clustering and association.
Through interaction with the environment and feedback in the form of rewards or penalties, the
network gains knowledge. Finding a policy or strategy that optimizes cumulative rewards over time is the
goal for the network. This kind is frequently utilized in gaming and decision-making applications.
Types of Neural Networks
• Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or more
layers, including an input layer, one or more hidden layers, and an output layer. It uses nonlinear
activation functions.
• Recurrent Neural Network (RNN): An artificial neural network type intended for sequential data
processing is called a Recurrent Neural Network (RNN). It is appropriate for applications where
contextual dependencies are critical, such as time series prediction and natural language processing,
since it makes use of feedback loops, which enable information to survive within the network.
• Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome the
vanishing gradient problem in training RNNs. It uses memory cells and gates to selectively read,
write, and erase information.
Neural networks are made of shorter modules or building blocks, same as atoms in matter and logic
gates in electronic circuits. Once we know what the blocks are, we can combine them to solve a variety of
problems.
Processing of Artificial neural network depends upon the given three building blocks:
• Network Topology
• Activation functions
Network Topology:
The topology of a neural network refers to the way how Neurons are associated, and it is a significant
factor in network functioning and learning. A common topology in unsupervised learning is a direct mapping
of inputs to a group of units that represents categories, for example, self-organizing maps. The most widely
recognized topology in supervised learning is completely associated, three-layer, feedforward network
(Backpropagation, Radial Basis Function Networks). All input values are associated with all neurons in the
hidden layer (hidden because they are not noticeable in the input or the output), the output of the hidden
neurons are associated to all neurons in the output layer, and the activation functions of the output neurons
establish the output of the entire network. Such networks are well-known partly because hypothetically, they
are known to be universal function approximators, for example, a sigmoid or Gaussian.
Feedforward network:
The advancement of layered feed-forward networks initiated in the late 1950s, given
by Rosenblatt's perceptron and Widrow's Adaptive linear Element (ADLINE).The perceptron and
ADLINE can be defined as a single layer networks and are usually referred to as single-layer perceptron's.
Single-layer perceptron's can only solve linearly separable problems. The limitations of the single-layer
network have prompted the advancement of multi-layer feed-forward networks with at least one hidden
layer, called multi-layer perceptron (MLP) networks. MLP networks overcome various limitations of single-
layer perceptron's and can be prepared to utilize the backpropagation algorithm. The backpropagation
method was invented autonomously several times.
In 1974, Werbos created a backpropagation training algorithm. However, Werbos work remained
unknown in the scientific community, and in 1985, parker rediscovers the technique. Soon after Parker
published his discoveries, Rumelhart, Hinton, and Williams also rediscovered the method. It is the endeavors
of Rumelhart and the other individual if the Parallel Distributed Processing (PDP) group, that makes the
backpropagation method a pillar of neurocomputing.
Rosenblatt first constructed the single-layer feedforward network in the late 1950s and early 1990s.
The concept of feedforward artificial neural network having just one weighted layer. In other words, we can
say that the input layer is completely associated with the outer layer.
Multilayer feedforward network:
Feedback network:
A feedback based prediction refers to an approximation of an outcome in an iterative way where each
iteration's operation depends on the present outcome. Feedback is a common way of making predictions in
different fields, ranging from control hypothesis to psychology. Using feedback associations is also
additionally exercised by biological organisms, and the brain is proposing a vital role for it in complex
cognition.
In other words, we can say that a feedback network has feedback paths, which implies the signal can
flow in both directions using loops. It makes a non-linear dynamic system, which changes continuously until
it reaches the equilibrium state. It may be divided into the following types:
Recurrent network:
The human brain is a recurrent neural network that refers to a network of neurons with feedback
connections. It can learn numerous behaviors, sequence, processing tasks algorithms, and programs that are
not learnable by conventional learning techniques. It explains the rapidly growing interest in artificial
recurrent networks for technical applications. For example, general computers that can learn algorithms to
map input arrangements to output arrangements, with or without an instructor. They are computationally
more dominant and biologically more conceivable than other adaptive methodologies. For example, Hidden
Markov models (no continuous internal states), feedforward networks, and supportive vector machines (no
internal states).
Fully recurrent network:
The most straightforward form of a fully recurrent neural network is a Multi-Layer Perceptron (MLP)
with the previous set of hidden unit activations, feeding back along with the inputs. In other words, it is the
easiest neural network design because all nodes are associated with all other nodes with every single node
work as both input and output.
Note that the time 't' has to be discretized, with the activations updated at each time interval. The
time scale may compare to the activity of real neurons, or for artificial systems whenever step size fitting for
the given problem can be used. A delay unit should be introduced to hold activations until they are prepared
at the next time interval.
Jordan network:
The Jordan network refers to a simple neural structure in which only one value of the process input
signal (from the previous sampling) and only one value of the delayed output signal of the model (from the
previous sampling) are utilized as the inputs of the network. In order to get a computationally basic MPC
(Model Predictive Control) algorithm, the nonlinear Jordan neural model is repeatedly linearized on line
around an operating point, which prompts a quadratic optimization issue. Adequacy of the described MPC
algorithm is compared with that of the nonlinear MPC scheme with on-line nonlinear optimization performed
at each sampling instant.
Adjustments of Weights or Learning:
Learning in ANN is the technique for changing the weights of associations between the neurons of a
specified network. Learning in artificial neural networks can be characterized into three different categories,
namely supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning:
Supervised learning consists of two words supervised and learning. Supervise intends to guide. We
have supervisors whose duty is to guide and show the way. We can see a similar case in the case of learning.
Here the machine or program is learning with the help of the existing data set. We have a data set, and we
assume the results of new data relying upon the behavior of the existing data sets. It implies the existing data
sets acts as a supervisor or boss to find the new data. A basic example being electronic gadgets price
prediction. The price of electronic gadgets is predicted depending on what is observed with the prices of
other digital gadgets.
During the training of artificial neural networks under supervised learning, the input vector is given
to the network, which offers an output vector. Afterward, the output vector is compared with the desired
output vector. An error signal is produced if there is a difference between the actual output and the desired
output vector. Based on this error signal, the weight is adjusted until the actual output is matched with the
desired output.
Unsupervised learning:
As the name suggests, unsupervised learning refers to predict something without any supervision or
help from existing data. In this learning, the program learns by dividing the data with similar characteristics
into similar groups. In supervised learning, the data are grouped, relying upon similar characteristics. In this
situation, there are no existing data to look for direction. In other words, there is no supervisor. During the
training of the artificial neural network under unsupervised learning, the input vectors of a comparative type
are joined to form clusters. At the point when a new input pattern is implemented, then the neural network
gives an output response showing the class to which the input pattern belongs. There is no feedback from
the environment about what should be the ideal output and if it is either correct or incorrect. Consequently,
in this type of learning, the network itself must find the patterns and features from the input data and the
connection for the input data over the output.
Reinforcement learning:
Reinforcement Learning (RL) is a technique that helps to solve control optimization issues. By using
control optimization, we can recognize the best action in each state visited by the system in order to optimize
some objective function. Typically, reinforcement learning comes into existence when the system has a huge
number of states and has a complex stochastic structure, which is not responsible to closed-form analysis. If
issues have a relatively small number of states, then the random structure is relatively simple, so that one
can utilize dynamic programming.
As the name suggests, this kind of learning is used to strengthen the network over some analyst data.
This learning procedure is like supervised learning. However, we may have very little information. In
reinforcement learning, during the training of the network, the network gets some feedback from the system.
This makes it fairly like supervised learning. The feedback acquired here is evaluative, not instructive, which
implies there is no instructor as in supervised learning. After getting the feedback, the networks perform
modifications of the weights to get better Analyst data in the future.
Activation Function:
Activation functions refer to the functions used in neural networks to compute the weighted sum of
input and biases, which is used to choose the neuron that can be fire or not. It controls the presented
information through some gradient processing, normally gradient descent. It produces an output for the
neural network that includes the parameters in the data.
Activation function can either be linear or non-linear, relying on the function it shows. It is used to
control the output of outer neural networks across various areas, such as speech recognition, segmentation,
fingerprint detection, cancer detection system, etc.
In the artificial neural network, we can use activation functions over the input to get the precise
output. These are some activation functions that are used in ANN.
The equation of the linear activation function is the same as the equation of a straight line i.e.
Y= MX+ C
If we have many layers and all the layers are linear in nature, then the final activation function of the
last layer is the same as the linear function of the first layer. The range of a linear function is –infinitive to +
infinitive.
Linear activation function can be used at only one place that is the output layer.
Sigmoid function:
A = 1/(1+e-x)
This function is non-linear, and the values of x lie between -2 to +2. So that the value of X is directly
proportional to the values of Y. It means a slight change in the value of x would also bring about changes in
the value of y.
Tanh Function:
The activation function, which is more efficient than the sigmoid function is Tanh function. Tanh
function is also known as Tangent Hyperbolic Function. It is a mathematical updated version of the sigmoid
function. Sigmoid and Tanh function are similar to each other and can be derived from each other.
OR
Tanh (x) = 2 * sigmoid(2x) - 1
1.3 Optimizers
Optimizer is the extended class in Tensorflow, that is initialized with parameters of the model but no
tensor is given to it. The basic optimizer provided by Tensorflow is:
This class is never used directly but its sub-classes are instantiated.
Before explaining let’s first learn about the algorithm on top of which others are made .i.e. gradient
descent. Gradient descent links weights and loss functions, as gradient means a measure of change, gradient
descent algorithm determines what should be done to minimize loss functions using partial derivative – like
add 0.7, subtract 0.27 etc. But obstacle arises when it gets stuck at local minima instead of global minima in
the case of large multi-dimensional datasets.
Syntax: [Link](learning_rate,
use_locking,
name = 'GradientDescent)
Parameters:
Tensorflow predominantly supports 9 optimizer classes including its base class (Optimizer).
• Gradient Descent
• SGD
• AdaGrad
• RMSprop
• Adadelta
• Adam
• AdaMax
• NAdam
• FTRL
The stochastic Gradient Descent (SGD) optimization method executes a parameter update for every
training example. In the case of huge datasets, SGD performs redundant calculations resulting in frequent
updates having high variance causing the objective function to vary heavily.
momentum=0.0,
nesterov=False,
name='SGD',
**kwargs)
Parameters:
Advantages:
Disadvantages:
1. High Variance
2. Computationally Expensive
AdaGrad Optimizer
AdaGrad stands for Adaptive Gradient Algorithm. AdaGrad optimizer modifies the learning rate
particularly with individual features .i.e. some weights in the dataset may have separate learning rates than
others.
Syntax: [Link](learning_rate=0.001,
initial_accumulator_value=0.1,
epsilon=1e-07,
name="Adagrad",
**kwargs)
Parameters:
Advantages:
Disadvantages:
RMSprop Optimizer
RMSprop stands for Root Mean Square Propagation. RMSprop optimizer doesn’t let gradients
accumulate for momentum instead only accumulates gradients in a particular fixed window. It can be
considered as an updated version of AdaGrad with few improvements. RMSprop uses simple momentum
instead of Nesterov momentum.
Syntax: [Link](learning_rate=0.001,
rho=0.9,
momentum=0.0,
epsilon=1e-07,
centered=False,
name='RMSprop',
**kwargs)
Parameters:
Advantages:
Syntax: [Link](learning_rate=0.001,
rho=0.95,
epsilon=1e-07,
name='Adadelta',
**kwargs)
Parameters:
Adam Optimizer
Adaptive Moment Estimation (Adam) is among the top-most optimization techniques used today. In
this method, the adaptive learning rate for each parameter is calculated. This method combines advantages
of both RMSprop and momentum .i.e. stores decaying average of previous gradients and previously squared
gradients.
Syntax: [Link](leaarning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
name='Adam',
**kwargs)
Parameters:
Advantages:
1. Easy Implementation
3. Computationally efficient
Disadvantages:
AdaMax Optimizer
AdaMax is an alteration of the Adam optimizer. It is built on the adaptive approximation of low-
order moments (based off on infinity norm). Sometimes in the case of embeddings, AdaMax is considered
better than Adam.
Syntax: [Link](learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
name='Adamax',
**kwargs)
Parameters:
Advantages:
NAdam Optimizer
NAdam is a short form for Nesterov and Adam optimizer. NAdam uses Nesterov momentum to
update gradient than vanilla momentum used by Adam.
Syntax: [Link](learning_rate=0.001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
name='Nadam',
**kwargs)
Parameters:
Advantages:
1. Gives better results for gradients with high curvature or noisy gradients.
2. Learns faster
FTRL Optimizer
Follow The Regularized Leader (FTRL) is an optimization algorithm best suited for shallow models
having sparse and large feature spaces. This version supports both shrinkage-type L2 regularization
(summation of L2 penalty and loss function) and online L2 regularization.
Syntax: [Link](learning_rate=0.001,
learning_rate_power=-0.5,
initial_accumulator_value=0.1,
l1_regularization_strength=0.0,
l2_regularization_strength=0.0,
name='Ftrl',
l2_shrinkage_regularization_strength=0.0,
beta=0.0,
**kwargs)
Parameters:
l1_regularization_strength:Stabilization penalty.
Disadvantages:
Activation functions refer to the functions used in neural networks to compute the weighted sum of
input and biases, which is used to choose the neuron that can be fire or not. It controls the presented
information through some gradient processing, normally gradient descent. It produces an output for the
neural network that includes the parameters in the data.
Activation function can either be linear or non-linear, relying on the function it shows. It is used to
control the output of outer neural networks across various areas, such as speech recognition, segmentation,
fingerprint detection, cancer detection system, etc.
In the artificial neural network, we can use activation functions over the input to get the precise
output. These are some activation functions that are used in ANN.
The equation of the linear activation function is the same as the equation of a straight line i.e.
Y= MX+ C
If we have many layers and all the layers are linear in nature, then the final activation function of the
last layer is the same as the linear function of the first layer. The range of a linear function is –infinitive to +
infinitive.
Linear activation function can be used at only one place that is the output layer.
Sigmoid function:
A = 1/(1+e-x)
This function is non-linear, and the values of x lie between -2 to +2. So that the value of X is directly
proportional to the values of Y. It means a slight change in the value of x would also bring about changes in
the value of y.
Tanh Function:
The activation function, which is more efficient than the sigmoid function is Tanh function. Tanh
function is also known as Tangent Hyperbolic Function. It is a mathematical updated version of the sigmoid
function. Sigmoid and Tanh function are similar to each other and can be derived from each other.
OR
A loss function is a function that compares the target and predicted output values; measures how
well the neural network models the training data. When training, we aim to minimize this loss between the
predicted and target outputs.
The hyperparameters are adjusted to minimize the average loss — we find the weights, wT, and
biases, b, that minimize the value of J (average loss).
We can think of this akin to residuals, in statistics, which measure the distance of the actual y values
from the regression line (predicted values) — the goal being to minimize the net distance.
For this article, we will use Google’s TensorFlow library to implement different loss functions —
easy to demonstrate how loss functions are used in models.
In TensorFlow, the loss function the neural network uses is specified as a parameter in
[Link]() —the final method that trains the neural network.
[Link](loss='mse', optimizer='sgd')
The loss function can be inputed either as a String — as shown above — or as a function object —
either imported from TensorFlow or written as custom loss functions, as we will discuss later.
It must be formatted this way because the [Link]() method expects only two input parameters
for the loss attribute.
In supervised learning, there are two main types of loss functions — these correlate to the 2 major
types of neural networks: regression and classification loss functions
1. Regression Loss Functions — used in regression neural networks; given an input value, the model
predicts a corresponding output value (rather than pre-selected labels); Ex. Mean Squared Error,
Mean Absolute Error
2. Classification Loss Functions — used in classification neural networks; given an input, the neural
network produces a vector of probabilities of the input belonging to various pre-set categories — can
then select the category with the highest probability of belonging; Ex. Binary Cross-Entropy,
Categorical Cross-Entropy
Data pre-processing is crucial for training effective neural network models. It involves several steps
to ensure that the data is clean, relevant, and in a format suitable for the model.
[Link] Cleaning:
-Removing Noise and Outliers:Using statistical methods to identify and remove anomalies.
[Link] Scaling:
- Ensures that all features contribute equally to the model performance and accelerates convergence in
gradient descent.
[Link] Data:
- Dividing the dataset into training, validation, and test sets to evaluate the model's performance.
- Techniques like rotation, translation, flipping, and cropping to increase the diversity of the training set
and prevent overfitting.
Feature engineering involves creating new features or modifying existing ones to improve the
performance of the model.
[Link] Transformation:
-Binning:Converting continuous variables into categorical ones by grouping values into bins.
[Link] Selection:
-Wrapper Methods:Using model-based approaches to evaluate feature subsets (e.g., recursive feature
elimination).
-Embedded Methods:Methods that perform feature selection as part of the model training process (e.g.,
Lasso).
2.8 Overfitting and Underfitting
These are common problems in machine learning and neural networks, impacting model
performance.
[Link]:
- Occurs when the model learns not only the underlying pattern but also the noise in the training data.
-Solutions:
-Simplifying the Model:Reducing the model complexity by decreasing the number of layers/neurons.
[Link]:
- Occurs when the model is too simple to capture the underlying pattern in the data.
-Solutions:
2.9 Hyperparameters
Hyperparameters are settings used to control the learning process and model architecture. Unlike
parameters learned during training, hyperparameters are set before training.
[Link] Rate:
-Too High:Might cause the model to converge too quickly to a suboptimal solution.
[Link] Size:
[Link] of Epochs:
- The number of times the entire training dataset passes through the network.
[Link] Parameters:
[Link] Architecture:
-Number of Layers and Neurons:Determines the capacity of the network to learn complex patterns.
-Activation Functions:Functions like ReLU, Sigmoid, or Tanh that introduce non-linearity into the model.
[Link] Algorithm:
- Algorithms like SGD, Adam, RMSprop, each with their own parameters like momentum, beta1, beta2.
UNIT III CONVOLUTIONAL NEURAL NETWORK
About CNN, Linear Time Invariant, Image Processing Filtering, Building a convolutional neural network,
Input Layers, Convolution Layers, Pooling Layers, Dense Layers, Backpropagation Through the
Convolutional Layer, Filters and Feature Maps, Backpropagation Through the Pooling Layers, Dropout
Layers and Regularization, Batch Normalization, Various Activation Functions, Various Optimizers, LeNet,
AlexNet, VGG16, ResNet, Transfer Learning with Image Data, Transfer Learning using Inception Oxford
VGG Model, Google Inception Model, Microsoft ResNet Model, R-CNN, Fast R-CNN, Faster R-CNN,
Mask-RCNN, YOLO.
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that enables a
computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural Networks are
used for different purposes, for example for predicting the sequence of words we use Recurrent Neural
Networks more precisely an LSTM, similarly for image classification we use Convolution Neural networks.
In this blog, we are going to build a basic building block for CNN.
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in this
layer is equal to the total number of features in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden layer. There can be many
hidden layers depending on our model and data size. Each hidden layer can have different numbers
of neurons which are generally greater than the number of features. The output from each layer is
computed by matrix multiplication of the output of the previous layer with learnable weights of that
layer and then by the addition of learnable biases followed by activation function which makes the
network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or
softmax which converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is
called feedforward, we then calculate the error using an error function, some common error functions are
cross-entropy, square loss error, etc. The error function measures how well the network is performing. After
that, we backpropagate into the model by calculating the derivatives. This step is
called Backpropagation which basically is used to minimize the loss.
Convolutional Neural Network (CNN) is the extended version of artificial neural networks
(ANN) which is predominantly used to extract the feature from the grid-like matrix dataset. For example
visual datasets like images or videos where data patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features, the Pooling layer
downsamples the image to reduce computation, and the fully connected layer makes the final prediction.
The network learns the optimal filters through backpropagation and gradient descent.
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine
you have an image. It can be represented as a cuboid having its length, width (dimension of the image), and
height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network across
the whole image, as a result, we will get another image with different widths, heights, and depths. Instead of
just R, G, and B channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a regular neural network.
Because of this small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
• Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights
and the same depth as that of input volume (3 if the input layer is image input).
• For example, if we have to run convolution on an image with dimensions 34x34x3. The possible size
of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image
dimension.
• During the forward pass, we slide each filter across the whole input volume step by step where each
step is called stride (which can have a value of 2, 3, or even 4 for high-dimensional images) and
compute the dot product between the kernel weights and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a result,
we’ll get output volume having a depth equal to the number of filters. The network will learn all the
filters.
• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will
be an image or a sequence of images. This layer holds the raw input of the image with width 32,
height 32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset.
It applies a set of learnable filters known as the kernels to the input images. The filters/kernels are
smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and computes
the dot product between kernel weight and the corresponding input image patch. The output of this
layer is referred as feature maps. Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the preceding layer, activation
layers add nonlinearity to the network. it will apply an element-wise activation function to the output
of the convolution layer. Some common activation functions are RELU: max(0, x), Tanh, Leaky
RELU, etc. The volume remains unchanged hence output volume will have dimensions 32 x 32 x
12.
• Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce the
size of volume which makes the computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average pooling. If we use a max pool with 2
x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
• Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for categorization
or regression.
• Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Image source: [Link]
• Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Example:
Let’s consider an image and apply the convolution layer, activation layer, and pooling layer operation
to extract the inside feature.
Input image:
Input image
Step:
Output:
Output
Advantages of Convolutional Neural Networks (CNNs):
1. Good at detecting patterns and features in images, videos, and audio signals.
4. Interpretability is limited, it’s hard to understand what the network has learned.
A system that possesses two basic properties namely linearity and timeinvariant is known as linear
time-invariant system or LTI system.
There are two major reasons behind the use of the LTI systems −
• Many physical processes through not absolutely LTI systems can be approximated with the properties
of linearity and time-invariance.
Continuous-Time LTI System
The LTI systems are always considered with respect to the impulse response. That means the input is
the impulse signal and the output is the impulse response.
𝑥(𝑡) = 𝛿(𝑡)
𝑥(𝑛) = 𝛿(𝑛)
According to the shifting property of signals, any signal can be expressed as a combination of weighted and
shifted impulse signal, i.e.,
x(n)=∑k=−∞∞x(k)δ(n−k)x(n)=∑k=−∞∞x(k)𝛿(n−k)
The expression of Eqn. (3) is known as convolution sum. The convolution sum can be represented
symbolically as,
Also,
y(n)=∑k=−∞∞x(k)h(n−k)=∑k=−∞∞x(n−k)h(k)⋅⋅⋅(5)
3.3 Image Processing Filtering
Image processing techniques play a pivotal role in enhancing, restoring, and analyzing digital images.
This article delves into fundamental image filtering techniques, unveiling the mechanisms behind their
operations, applications, and outcomes. In this first part, we explore linear image filters, with a focus on the
Sobel filter, Gaussian filter, and mean filter.
Linear image filters are essential tools in the realm of image processing, particularly when it comes
to noise reduction. These filters operate by applying a convolution operation on an image, using a predefined
kernel or filter mask. The beauty of linear filters lies in their simplicity and effectiveness in reducing various
types of noise, such as salt-and-pepper noise or Gaussian noise.
The process of applying linear image filters for noise reduction involves a concept known as
convolution. Convolution is the mathematical operation where the filter mask is successively overlaid on
each pixel of the image, and the product of the filter’s coefficients and the corresponding pixel values is
computed. This process is repeated for every pixel, resulting in a new filtered image.
One of the significant advantages of linear image filters for noise reduction is their ability to process
images in near-real-time, making them suitable for applications where computational efficiency is crucial.
Linear filters can be implemented using simple mathematical operations, allowing for fast processing on
both hardware and software platforms. Despite their effectiveness in many scenarios, linear filters may
struggle with more complex and nonlinear noise patterns, where non-linear filters come into play.
Mean filtering is a widely employed technique within the realm of image processing, serving as a
crucial tool for noise reduction and smoothing. By adopting spatial filtering principles, mean filtering aids
in eliminating random variations or noise present in an image, all while retaining its essential features. In
this context, we will explore the mechanics, advantages, and limitations of mean filtering.
The fundamental concept behind mean filtering is elegantly simple: each pixel in an image is
substituted with the average value derived from neighboring pixels confined within a specified window or
kernel. The size of this kernel determines the degree of smoothing, with larger kernels inducing more potent
smoothing effects at the potential cost of finer detail loss.
Several iterations of mean smoothing exist, among which is Threshold Averaging. Here, smoothing
is contingent upon a condition: the central pixel value undergoes alteration only if the difference between its
initial value and the computed average surpasses a preset threshold. This approach efficiently reduces noise
while preserving more image detail than traditional mean filtering.
In the realm of computational simplicity, mean filtering stands as an exemplar. Its straightforward
implementation and efficiency render it an easily accessible tool. However, mean filtering’s suitability is not
universal; images containing sharp edges or intricate details may experience edge blurring due to the
averaging procedure. Notably effective against Gaussian or salt-and-pepper noise, mean filtering harnesses
the context of surrounding pixels to mitigate random noise values.
While mean filtering holds its place in preprocessing images for subsequent intricate tasks like edge
detection or object recognition, its significance is underscored in image restoration and enhancement
processes. Notably, in domains like medical imaging, mean filtering’s application contributes to enhanced
diagnostic precision and improved visual quality.
In summary, mean filtering emerges as an uncomplicated yet potent technique in image processing.
Its prowess in noise reduction and texture smoothing bestows it with versatility across diverse applications.
Nevertheless, judicious selection of kernel size remains pivotal to striking a balance between noise reduction
and vital image feature preservation. When judiciously employed, mean filtering profoundly elevates image
quality and facilitates comprehensive image analysis. Despite its computational efficiency, it is imperative
to acknowledge its limitations; alternative convolution filters, such as the Gaussian smoothing filter, present
distinct trade-offs between noise reduction and detail preservation.
Gaussian filtering is a critical tool in the field of image processing, especially for noise reduction. By
applying a Gaussian kernel, the filter gives central pixels more weight than surrounding regions, effectively
reducing noise while preserving image structure. The essential parameter σ controls the filter’s scope and
subsequent smoothing. Gaussian filters are excellent at removing random, subtle image noise patterns,
making them vital in many image processing applications.
Among the many linear smoothing filters used for noise reduction, the Gaussian filter stands out as
a crucial tool. It uses weights derived from the Gaussian distribution to effectively reduce noise in images.
However, before using the Gaussian filter, it is important to preprocess the image and remove any outliers
or spikes. While it is adept at handling random noise, the Gaussian filter has its limitations. It tends to blend
noise into the result, causing indiscriminate smoothing along edges.
The Sobel filter, introduced by Irwin Sobel and Gary M. Feldman in 1968, is a crucial tool in image
processing and computer vision. Its primary purpose is to detect edges within images, playing a foundational
role in various edge detection algorithms. By enhancing image edges, the Sobel filter significantly
contributes to tasks such as object detection, image segmentation, and edge enhancement.
At its core, the Sobel filter is ingeniously crafted as a compact, integer-valued filter that operates
along both the horizontal and vertical directions of an image. This well-thought-out design choice
underscores computational efficiency, rendering the Sobel filter an indispensable cornerstone of edge
detection methodologies.
The essence of the Sobel operator lies in its ability to compute the gradient of the intensity function
at each discrete point within the image. This process entails two separate convolutions: one with the
horizontal kernel, often referred to as Gx, and the other with the vertical kernel, known as Gy.
These convolution kernels play a fundamental role in the image convolution process. As the kernels
traverse the image pixel by pixel, they perform multiplication between the pixel values and corresponding
kernel coefficients. The resulting products are then summed, yielding the filtered pixel value. The Horizontal
Sobel Kernel excels at identifying vertical edges, whereas the Vertical Sobel Kernel adeptly highlights
horizontal edges.
Upon completing the convolutions, the Sobel filter computes the gradient magnitude using the
Pythagorean theorem:
This magnitude computation unveils insights into the rate of intensity variation and the presence of
edges throughout the image. Moreover, the orientation of these edges is determined counterclockwise
concerning the direction of maximum contrast, extending from black to white.
In summation, the Sobel filter assumes a foundational role in the landscape of image processing and
computer vision. Through its compact design, strategic use of convolution kernels (Gx and Gy), and astute
gradient calculation mechanisms, the Sobel filter emerges as a pivotal player in the realm of edge detection
algorithms. By harnessing the power of the Sobel filter, practitioners can delve deeper into comprehending
image structure and extracting invaluable insights from visual data.
The simplest use case of a convolutional neural network is for classification. You will find it to contain
three types of layers:
1. Convolutional layers
2. Pooling layers
3. Fully-connected layers
Neurons on a convolutional layer is called the filter. Usually it is a 2D convolutional layer in image
application. The filter is a 2D patch (e.g., 3×3 pixels) that is applied on the input image pixels. The size of this
2D patch is also called the receptive field, meaning how large a portion of the image it can see at a time.
The filter of a convolutional layer is to multiply with the input pixels, and then sum up the result. This
result is one pixel value at the output. The filter will move around the input image to fill out all pixel values
at the output. Usually multiple filters are applied to the same input, producing multiple output tensors. These
output tensors are called the feature maps produced by this layer. They are stacked together as one tensor and
pass on to the next layer as input.
The output of a convolutional layer is called feature maps because usually it learned about the features
of the input image. For example, whether there are vertical lines at the position. Learning the features from
pixels is to help understanding the image at a higher level. Multiple convolutional layers are stacked together
in order to infer higher level features from lower level details.
Pooling layer is to downsample the previous layer’s feature map. It is usually used after a
convolutional layer to consolidate features learned. It can compress and generalize the feature representations.
A pooling layer also has a receptive field and usually it is to take the average (average pooling) or the
maximum (max pooling) over all values on the receptive field.
Fully connected layers are usually the final layers in a network. It is to take the features consolidated
by previous convolutional and pooling layers as input to produce prediction. There might be multiple fully
connected layers stacked together. In the case of classification, you usually see the output of the final fully
connected layer applied with a softmax function to produce probability-like classification.
1 import torch
2 import [Link] as nn
3 import [Link] as optim
4 import torchvision
5
6 transform = [Link]([[Link]()])
7
8 trainset = [Link].CIFAR10(root='./data', train=True, download=True, transform=transform)
9 testset = [Link].CIFAR10(root='./data', train=False, download=True, transform=transform)
10
11batch_size = 32
12trainloader = [Link](trainset, batch_size=batch_size, shuffle=True)
13testloader = [Link](trainset, batch_size=batch_size, shuffle=True)
14
15class CIFAR10Model([Link]):
16 def __init__(self):
17 super().__init__()
18 self.conv1 = nn.Conv2d(3, 32, kernel_size=(3,3), stride=1, padding=1)
19 self.act1 = [Link]()
20 self.drop1 = [Link](0.3)
21
22 self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=1, padding=1)
23 self.act2 = [Link]()
24 self.pool2 = nn.MaxPool2d(kernel_size=(2, 2))
25
26 [Link] = [Link]()
27
28 self.fc3 = [Link](8192, 512)
29 self.act3 = [Link]()
30 self.drop3 = [Link](0.5)
31
32 self.fc4 = [Link](512, 10)
33
34 def forward(self, x):
35 # input 3x32x32, output 32x32x32
36 x = self.act1(self.conv1(x))
37 x = self.drop1(x)
38 # input 32x32x32, output 32x32x32
39 x = self.act2(self.conv2(x))
40 # input 32x32x32, output 32x16x16
41 x = self.pool2(x)
42 # input 32x16x16, output 8192
43 x = [Link](x)
44 # input 8192, output 512
45 x = self.act3(self.fc3(x))
46 x = self.drop3(x)
47 # input 512, output 10
48 x = self.fc4(x)
49 return x
50
51model = CIFAR10Model()
52loss_fn = [Link]()
53optimizer = [Link]([Link](), lr=0.001, momentum=0.9)
54
55n_epochs = 20
56for epoch in range(n_epochs):
57 for inputs, labels in trainloader:
58 # forward, backward, and then weight update
59 y_pred = model(inputs)
60 loss = loss_fn(y_pred, labels)
61 optimizer.zero_grad()
62 [Link]()
63 [Link]()
64
65 acc = 0
66 count = 0
67 for inputs, labels in testloader:
68 y_pred = model(inputs)
69 acc += ([Link](y_pred, 1) == labels).float().sum()
70 count += len(labels)
71 acc /= count
72 print("Epoch %d: model accuracy %.2f%%" % (epoch, acc*100))
73
[Link](model.state_dict(), "[Link]")
The CIFAR-10 dataset provides images in 32×32 pixels in RGB color (i.e., 3 color channels). There
are 10 classes, labelled in integers 0 to 9. Whenever you are working on PyTorch neural network models for
images, you will find the sister library torchvision useful. In the above, you used it to download the CIFAR-
10 dataset from the Internet and transform it into a PyTorch tensor:
1...
2transform = [Link]([[Link]()])
3
4trainset = [Link].CIFAR10(root='./data', train=True, download=True, transform=transform)
5testset = [Link].CIFAR10(root='./data', train=False, download=True, transform=transform)
You also used a DataLoader in PyTorch to help creating batches for training. Training is to optimize
the cross entropy loss of the model, using stochastic gradient descent. It is a classification model, hence
accuracy of classification is more intuitive than cross entropy, which is computed at the end of each epoch, by
comparing the maximum value in the output logit to the dataset’s labels:
1...
2acc += ([Link](y_pred, 1) == labels).float().sum()
It takes time to run the program above to train the network. This network should be able to achieve
above 70% accuracy in classification.
There are two convolutional layers in the network defined above. They are both defined with kernel
size of 3×3, hence it is looking at 9 pixels at a time to produce one output pixel. Note that the first convolutional
layer is taking the RGB image as input. Hence each pixel has three channels. The second convolutional layer
is taking a feature map with 32 channels as input. Each “pixel” as it sees will have 32 values. Thus the second
convolutional layer has more parameters even they have the same receptive field.
Let’s see what is in the feature map. Let’s say we pick one input sample from the training set:
You should see that this is an image of a horse, in 32×32 pixels with RGB channels:
First, you need to convert this into a PyTorch tensor and make it a batch of one image. PyTorch models
expect each image as a tensor in the format of (channel, height, width) but the data you read is in the format
of (height, width, channel). If you use torchvision to transform the image into PyTorch tensors, this format
conversion is done automatically. Otherwise, you need to permute the dimensions before use.
Afterward, pass it on through the model’s first convolution layer and capture the output. You need to
tell PyTorch that no gradient is needed for this calculation as you are not going to optimize the model weight:
1X = [Link]([[Link][7]], dtype=torch.float32).permute(0,3,1,2)
[Link]()
3with torch.no_grad():
4 feature_maps = model.conv1(X)
The feature maps are in one tensor. You can visualize them using matplotlib:
You can see that they are called feature maps because they are highlighting certain features from the
input image. A feature is identified using a small window (in this case, over a 3×3 pixels filter). The input
image has three color channels. Each channel has a different filter applied, and their results are combined for
an output feature.
You can similarly display the feature map from the output of the second convolutional layer as follows:
1 X = [Link]([[Link][7]], dtype=torch.float32).permute(0,3,1,2)
2
3 [Link]()
4 with torch.no_grad():
5 feature_maps = model.act1(model.conv1(X))
6 feature_maps = model.drop1(feature_maps)
7 feature_maps = model.conv2(feature_maps)
8
9 fig, ax = [Link](4, 8, sharex=True, sharey=True, figsize=(16,8))
10for i in range(0, 32):
11 row, col = i//8, i%8
12 ax[row][col].imshow(feature_maps[0][i])
[Link]()
Which shows:
Compared to the output of the first convolutional layer, the feature maps from the second convolutional
layer looks blurry and more abstract. But these are more useful for the model to identify the objects.
Putting everything together, the code below loads the saved model from the previous section and
generate the feature maps:
1 import torch
2 import [Link] as nn
3 import torchvision
4 import [Link] as plt
5
6 trainset = [Link].CIFAR10(root='./data', train=True, download=True)
7
8 class CIFAR10Model([Link]):
9 def __init__(self):
10 super().__init__()
11 self.conv1 = nn.Conv2d(3, 32, kernel_size=(3,3), stride=1, padding=1)
12 self.act1 = [Link]()
13 self.drop1 = [Link](0.3)
14
15 self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=1, padding=1)
16 self.act2 = [Link]()
17 self.pool2 = nn.MaxPool2d(kernel_size=(2, 2))
18
19 [Link] = [Link]()
20
21 self.fc3 = [Link](8192, 512)
22 self.act3 = [Link]()
23 self.drop3 = [Link](0.5)
24
25 self.fc4 = [Link](512, 10)
26
27 def forward(self, x):
28 # input 3x32x32, output 32x32x32
29 x = self.act1(self.conv1(x))
30 x = self.drop1(x)
31 # input 32x32x32, output 32x32x32
32 x = self.act2(self.conv2(x))
33 # input 32x32x32, output 32x16x16
34 x = self.pool2(x)
35 # input 32x16x16, output 8192
36 x = [Link](x)
37 # input 8192, output 512
38 x = self.act3(self.fc3(x))
39 x = self.drop3(x)
40 # input 512, output 10
41 x = self.fc4(x)
42 return x
43
44model = CIFAR10Model()
45model.load_state_dict([Link]("[Link]"))
46
[Link]([Link][7])
[Link]()
49
50X = [Link]([[Link][7]], dtype=torch.float32).permute(0,3,1,2)
[Link]()
52with torch.no_grad():
53 feature_maps = model.conv1(X)
54fig, ax = [Link](4, 8, sharex=True, sharey=True, figsize=(16,8))
55for i in range(0, 32):
56 row, col = i//8, i%8
57 ax[row][col].imshow(feature_maps[0][i])
[Link]()
59
60with torch.no_grad():
61 feature_maps = model.act1(model.conv1(X))
62 feature_maps = model.drop1(feature_maps)
63 feature_maps = model.conv2(feature_maps)
64fig, ax = [Link](4, 8, sharex=True, sharey=True, figsize=(16,8))
65for i in range(0, 32):
66 row, col = i//8, i%8
67 ax[row][col].imshow(feature_maps[0][i])
[Link]()
Input layers serve as the entry point for the data in a neural network, representing the initial step in
processing raw data. In image recognition tasks, for example, the input layer takes pixel values as its input.
Each node in this layer corresponds to a feature or variable in the data. The primary role of the input layer is
to shape the data appropriately for subsequent layers, maintaining its structure for further processing.
Convolution layers are the core building blocks of Convolutional Neural Networks (CNNs). They
apply a convolution operation to the input data, which involves sliding a filter or kernel across the input to
produce feature maps. This process captures local patterns such as edges and textures, making it especially
effective for image and signal processing tasks. Convolution layers reduce the spatial dimensions of the data
while preserving the relationship between pixels, leading to a hierarchical understanding of the input.
Pooling layers are used to downsample the spatial dimensions of the feature maps generated by
convolution layers. The most common types are max pooling and average pooling. Max pooling selects the
maximum value from each window of the feature map, while average pooling computes the average value.
Pooling reduces the computational load, controls overfitting, and retains important information by focusing
on the most prominent features.
Dense layers, also known as fully connected layers, are where each node is connected to every node
in the previous layer. They are typically used at the end of CNNs to integrate the features extracted by
convolution and pooling layers into final decision-making. Dense layers perform classification based on
these features by applying weights and biases to inputs and then passing the results through activation
functions like ReLU or softmax.
Backpropagation through the convolutional layer involves computing the gradient of the loss
function with respect to the filters and input data. This process adjusts the filters to minimize the loss,
effectively learning which features are important for the given task. The gradients are computed using the
chain rule, propagating the error backward from the output layer through each convolutional layer.
Filters, or kernels, are small matrices used in convolution layers to detect features such as edges,
textures, or colors in the input data. When a filter convolves with the input, it produces a feature map, which
is a new representation emphasizing the detected features. Different filters can learn to recognize different
aspects of the input, enabling the network to build a rich hierarchical representation of the data.
Backpropagation through pooling layers involves distributing the error gradient back to the locations
in the input feature map that contributed to the pooled output. In max pooling, the gradient is passed to the
location of the maximum value, while in average pooling, the gradient is distributed evenly. This process
ensures that the network learns which regions are important for making predictions.
Dropout layers are a regularization technique used to prevent overfitting in neural networks. During
training, dropout randomly sets a fraction of the input units to zero at each update step. This prevents the
network from becoming too reliant on specific nodes and encourages it to learn more robust features.
Regularization techniques, such as L1 and L2 regularization, add a penalty to the loss function to constrain
the magnitude of the weights, promoting simpler models that generalize better to new data.
Batch normalization was introduced to mitigate the internal covariate shift problem in neural
networks by Sergey Ioffe and Christian Szegedy in 2015. The normalization process involves calculating
the mean and variance of each feature in a mini-batch and then scaling and shifting the features using these
statistics. This ensures that the input to each layer remains roughly in the same distribution, regardless of
changes in the distribution of earlier layers’ outputs. Consequently, Batch Normalization helps in stabilizing
the training process, enabling higher learning rates and faster convergence.
Batch Normalization is extension of concept of normalization from just the input layer to the
activations of each hidden layer throughout the neural network. By normalizing the activations of each layer,
Batch Normalization helps to alleviate the internal covariate shift problem, which can hinder the convergence
of the network during training.
In traditional neural networks, as the input data propagates through the network, the distribution of
each layer’s inputs changes. This phenomenon, known as internal covariate shift, can slow down the
training process. Batch Normalization aims to mitigate this issue by normalizing the inputs of each layer.
The inputs to each hidden layer are the activations from the previous layer. If these activations are
normalized, it ensures that the network is consistently presented with inputs that have a similar distribution,
regardless of the training stage. This stability in the distribution of inputs allows for smoother and more
efficient training.
By applying Batch Normalization into the hidden layers of the network, the gradients propagated
during backpropagation are less likely to vanish or explode, leading to more stable training dynamics. This
ultimately facilitates faster convergence and better performance of the neural network on the given task.
In this section, we are going to discuss the steps taken to perform batch normalization.
For mini-batch of activations 𝑥1,𝑥2,…,𝑥𝑚x1,x2,…,xm, the mean 𝜇𝐵μB and variance 𝜎𝐵2σB2 of the
mini-batch are computed.
Step 2: Normalization
Each activation 𝑥𝑖xiis normalized using the computed mean and variance of the mini-batch.
The normalization process subtracts the mean 𝜇𝐵μB from each activation and divides by the square
root of the variance 𝜎𝐵2σB2, ensuring that the normalized activations have a zero mean and unit variance.
Additionally, a small constant 𝜖ϵ is added to the denominator for numerical stability, particularly to
prevent division by zero.
𝑥𝑖^=𝑥𝑖–𝜇𝐵𝜎𝐵2+𝜖xi=σB2+ϵxi–μB
The normalized activations 𝑥𝑖xi are then scaled by a learnable parameter 𝛾γ and shifted by another
learnable parameter 𝛽β. These parameters allow the model to learn the optimal scaling and shifting of the
normalized activations, giving the network additional flexibility.
𝑦𝑖=𝛾𝑥𝑖^+𝛽yi=γxi+β
To put in simple terms, an artificial neuron calculates the ‘weighted sum’ of its inputs and adds a
bias, as shown in the figure below by the net input.
Now the value of net input can be any anything from -inf to +inf. The neuron doesn’t really know
how to bound to value and thus is not able to decide the firing pattern. Thus the activation function is an
important part of an artificial neural network. They basically decide whether a neuron should be activated or
not. Thus it bounds the value of the net input. The activation function is a non-linear transformation that we
do over the input before sending it to the next layer of neurons or finalizing it as output. Types of Activation
Functions – Several different types of activation functions are used in Deep Learning. Some of them are
explained below:
Step Function: Step Function is one of the simplest kind of activation functions. In this, we consider a
threshold value and if the value of net input say y is greater than the threshold then the neuron is activated.
Graphically,
This is a smooth function and is continuously differentiable. The biggest advantage that it has over
step and linear function is that it is non-linear. This is an incredibly cool feature of the sigmoid function.
This essentially means that when I have multiple neurons having sigmoid function as their activation function
– the output is non linear as well. The function ranges from 0-1 having an S shape.
ReLU: The ReLU function is the Rectified linear unit. It is the most widely used activation function.
Graphically,
The main advantage of using the ReLU function over other activation functions is that it does not
activate all the neurons at the same time. What does this mean ? If you look at the ReLU function if the input
is negative it will convert it to zero and the neuron does not get activated.
Leaky ReLU: Leaky ReLU function is nothing but an improved version of the ReLU [Link] of
defining the Relu function as 0 for x less than 0, we define it as a small linear component of x.
Graphically,
Many people may be using optimizers while training the neural network without knowing that the
method is known as optimization. Optimizers are algorithms or methods used to change the attributes of
your neural network such as weights and learning rate in order to reduce the losses.
Optimizers help to get results faster
How you should change your weights or learning rates of your neural network to reduce the losses is
defined by the optimizers you use. Optimization algorithms or strategies are responsible for reducing the
losses and to provide the most accurate results possible.
1. Gradient Descent
Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear
regression and classification algorithms. Backpropagation in neural networks also uses a gradient descent
algorithm.
Gradient descent is a first-order optimization algorithm which is dependent on the first order
derivative of a loss function. It calculates that which way the weights should be altered so that the function
can reach a minima. Through backpropagation, the loss is transferred from one layer to another and the
model’s parameters also known as weights are modified depending on the losses so that the loss can be
minimized.
algorithm: θ=θ−α⋅∇J(θ)
Advantages:
1. Easy computation.
2. Easy to implement.
3. Easy to understand.
Disadvantages:
2. Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large
than this may take years to converge to the minima.
It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this,
the model parameters are altered after computation of loss on each training example. So, if the dataset
contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one
time as in Gradient Descent.
Advantages:
Disadvantages:
3. To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD
and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided
into various batches and after every batch, the parameters are updated.
Advantages:
1. Frequently updates the model parameters and also has less variance.
1. Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent
may take ages to converge.
2. Have a constant learning rate for all the parameters. There may be some parameters which we may
not want to change at the same rate.
3. Momentum
Momentum was invented for reducing high variance in SGD and softens the convergence. It
accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant
direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.
V(t)=γV(t−1)+α.∇J(θ)
Now, the weights are updated by θ=θ−V(t).
Advantages:
Disadvantages:
1. One more hyper-parameter is added which needs to be selected manually and accurately.
Momentum may be a good method but if the momentum is too high the algorithm may miss the local
minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look
ahead method. We know we’ll be using γV(t−1) for modifying the weights so, θ−γV(t−1) approximately
tells us the future location. Now, we’ll calculate the cost based on this future parameter rather than the current
one.
V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the parameters using θ=θ−V(t).
Advantages:
Disadvantages:
5. Adagrad
One of the disadvantages of all the optimizers explained is that the learning rate is constant for all
parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate ‘η’ for
each parameter and at every time step ‘t’. It’s a type second order optimization algorithm. It works on the
derivative of an error function.
η is a learning rate which is modified for given parameter θ(i) at a given time based on previous
gradients calculated for given parameter θ(i).
We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ϵ is a smoothing
term that avoids division by zero (usually on the order of 1e−8). Interestingly, without the square root
operation, the algorithm performs much worse.
It makes big updates for less frequent parameters and a small step for frequent parameters.
Advantages:
Disadvantages:
6. AdaDelta
It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it.
Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past
gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the
gradients.
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
Advantages:
1. Now the learning rate does not decay and the training does not stop.
Disadvantages:
1. Computationally expensive.
7. Adam
Adam (Adaptive Moment Estimation) works with momentums of first and second order. The
intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum,
we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially
decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying
average of past gradients M(t).
M(t) and V(t) are values of the first moment which is the Mean and the second moment which is
the uncentered variance of the gradients respectively.
Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where, E[f(x)] is
an expected value of f(x).
The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘ϵ’.
Advantages:
Disadvantages:
Computationally costly.
Comparison 1
comparison 2
3.16 LeNet
LeNet, developed by Yann LeCun in the late 1980s and early 1990s, is one of the earliest
convolutional neural networks (CNNs) designed for handwritten digit recognition. The LeNet architecture
consists of two convolutional layers, each followed by a subsampling (pooling) layer, and two fully
connected layers at the end. The use of convolutional layers allows LeNet to automatically learn spatial
hierarchies of features, making it highly effective for image recognition tasks. It laid the groundwork for
many modern CNN architectures by showcasing the potential of deep learning in computer vision.
3.17 AlexNet
AlexNet, created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet Large
Scale Visual Recognition Challenge in 2012. This network consists of five convolutional layers followed by
three fully connected layers. It introduced several innovations, such as using Rectified Linear Units (ReLUs)
for activation, dropout layers for regularization, and overlapping max pooling. AlexNet demonstrated the
power of deep learning on large datasets and spurred significant interest and research in deep learning for
computer vision.
3.18 VGG16
VGG16, developed by the Visual Geometry Group at the University of Oxford, is known for its
simplicity and depth. The network architecture consists of 16 layers, including 13 convolutional layers and
three fully connected layers. VGG16 uses small 3x3 filters with stride 1 and pad 1, which allows the network
to capture fine details in the images. The design choice to use small filters stacked together significantly
increased the depth and capacity of the model, leading to improved performance on various image
recognition benchmarks.
3.18 ResNet
ResNet, introduced by Kaiming He and his colleagues at Microsoft Research, addresses the
degradation problem that occurs when training very deep networks. ResNet uses residual connections, also
known as skip connections, which allow the gradient to bypass certain layers and be directly backpropagated.
This innovation enables the training of much deeper networks, such as ResNet-50, ResNet-101, and even
ResNet-152. ResNet models have achieved state-of-the-art results on various benchmarks and are widely
used in both academia and industry.
Transfer learning involves leveraging pre-trained models on large datasets to adapt to specific tasks
with limited data. Instead of training a model from scratch, which requires substantial computational
resources and data, transfer learning allows one to fine-tune an existing pre-trained model. This approach
not only speeds up the training process but also improves performance, especially when the target dataset is
small. Common pre-trained models used for transfer learning include VGG16, ResNet, and Inception.
Inception V3, part of Google's Inception family, is widely used for transfer learning due to its
efficiency and accuracy. The Inception architecture employs a combination of factorized convolutions,
aggressive regularization techniques, and careful balancing of computation and accuracy. By fine-tuning
Inception V3 on a specific dataset, researchers and practitioners can leverage its powerful feature extraction
capabilities to achieve high performance on diverse image classification tasks.
3.21 Google Inception Model
Google's Inception models, starting from Inception V1 (also known as GoogLeNet), introduced the
concept of Inception modules, which apply parallel convolutions of different sizes within the same layer.
This architecture captures a wide variety of features at different scales, enhancing the model's ability to
recognize complex patterns. Subsequent versions, like Inception V2 and Inception V3, improved on this
design with more efficient architectures and better performance on large-scale image recognition tasks.
Microsoft's ResNet models, such as ResNet-50, ResNet-101, and ResNet-152, introduced deep
residual learning, which allows the training of extremely deep networks without the vanishing gradient
problem. ResNet models have become the standard in the field for their robustness and ability to generalize
well across various computer vision tasks. They are used extensively in both research and real-world
applications, demonstrating superior performance in tasks like image classification, object detection, and
segmentation.
3.23 R-CNN
R-CNN (Region-based Convolutional Neural Network), developed by Ross Girshick and colleagues,
generates region proposals using selective search and applies a CNN to each proposal for object detection.
This method significantly improved object detection accuracy but was computationally expensive due to the
need to run the CNN on each region proposal independently. Despite its computational drawbacks, R-CNN
set a new standard for object detection performance.
Fast R-CNN, also developed by Ross Girshick, improves upon R-CNN by applying the CNN to the
entire image first, then using a Region of Interest (RoI) pooling layer to extract fixed-size feature maps for
each region proposal. These feature maps are then used for classification and bounding box regression. This
approach reduces the computational burden and speeds up both training and inference times while
maintaining high accuracy.
Faster R-CNN, introduced by Shaoqing Ren and colleagues, integrates a Region Proposal Network
(RPN) with Fast R-CNN, eliminating the need for an external region proposal generation step. The RPN
shares convolutional layers with the detection network, providing a unified, end-to-end trainable architecture
for object detection. Faster R-CNN significantly improves the speed and efficiency of the detection pipeline
while maintaining high accuracy.
3.26 Mask R-CNN
Mask R-CNN, developed by Kaiming He and colleagues, extends Faster R-CNN by adding a branch
for predicting segmentation masks for each Region of Interest (RoI). This addition enables instance
segmentation, allowing the network to delineate objects at the pixel level. Mask R-CNN maintains the
accuracy of object detection while providing detailed segmentation, making it highly effective for
applications requiring precise object boundaries.
3.27 YOLO
YOLO (You Only Look Once), developed by Joseph Redmon and colleagues, frames object
detection as a single regression problem, directly predicting bounding boxes and class probabilities from full
images in one forward pass. YOLO is known for its speed and efficiency, making it suitable for real-time
object detection applications. YOLO's design simplifies the detection pipeline and provides a good balance
between speed and accuracy, although it may struggle with detecting small objects compared to more
complex models like Faster R-CNN.
UNIT IV NATURAL LANGUAGE PROCESSING USING RNN
About NLP & its Toolkits, Language Modeling, Vector Space Model (VSM), Continuous Bag of Words
(CBOW), Skip-Gram Model for Word Embedding, Part of Speech (PoS) Global Co-occurrence Statistics-
based Word Vectors, Transfer Learning, Word2Vec, Global Vectors for Word Representation GloVe,
Backpropagation Through Time, Biderectional TNNs (BRNN), Long Short Term Memory (LSTM), Bi-
directional LSTM, Sequence-to-sequence Models (Seq2Seq), Gated recurrent unit GRU.
Natural language processing (NLP) is a field of computer science and a subfield of artificial
intelligence that aims to make computers understand human language. NLP uses computational linguistics,
which is the study of how language works, and various models based on statistics, machine learning, and
deep learning. These technologies allow computers to analyze and process text or voice data, and to grasp
their full meaning, including the speaker’s or writer’s intentions and emotions.
NLP powers many applications that use language, such as text translation, voice recognition, text
summarization, and chatbots. You may have used some of these applications yourself, such as voice-operated
GPS systems, digital assistants, speech-to-text software, and customer service bots. NLP also helps
businesses improve their efficiency, productivity, and performance by simplifying complex tasks that
involve language.
NLP Techniques
NLP encompasses a wide array of techniques that aimed at enabling computers to process and
understand human language. These tasks can be categorized into several broad areas, each addressing
different aspects of language processing. Here are some of the key NLP techniques:
• Stopword Removal: Removing common words (like “and”, “the”, “is”) that may not carry
significant meaning.
• Text Normalization: Standardizing text, including case normalization, removing punctuation, and
correcting spelling errors.
• Constituency Parsing: Breaking down a sentence into its constituent parts or phrases (e.g., noun
phrases, verb phrases).
3. Semantic Analysis
• Named Entity Recognition (NER): Identifying and classifying entities in text, such as names of
people, organizations, locations, dates, etc.
• Word Sense Disambiguation (WSD): Determining which meaning of a word is used in a given
context.
• Coreference Resolution: Identifying when different words refer to the same entity in a text (e.g.,
“he” refers to “John”).
4. Information Extraction
• Entity Extraction: Identifying specific entities and their relationships within the text.
• Relation Extraction: Identifying and categorizing the relationships between entities in a text.
• Sentiment Analysis: Determining the sentiment or emotional tone expressed in a text (e.g., positive,
negative, neutral).
6. Language Generation
7. Speech Processing
8. Question Answering
• Retrieval-Based QA: Finding and returning the most relevant text passage in response to a query.
• Generative QA: Generating an answer based on the information available in a text corpus.
9. Dialogue Systems
• Chatbots and Virtual Assistants: Enabling systems to engage in conversations with users,
providing responses and performing tasks based on user input.
• Opinion Mining: Analyzing opinions or reviews to understand public sentiment toward products,
services, or topics.
Working in natural language processing (NLP) typically involves using computational techniques to
analyze and understand human language. This can include tasks such as language understanding, language
generation, and language interaction.
• Data Storage: Storing the collected text data in a structured format, such as a database or a collection
of documents.
2. Text Preprocessing
Preprocessing is crucial to clean and prepare the raw text data for analysis. Common preprocessing steps
include:
• Stopword Removal: Removing common words that do not contribute significant meaning, such as
“and,” “the,” “is.”
• Stemming and Lemmatization: Reducing words to their base or root forms. Stemming cuts off
suffixes, while lemmatization considers the context and converts words to their meaningful base
form.
• Text Normalization: Standardizing text format, including correcting spelling errors, expanding
contractions, and handling special characters.
3. Text Representation
• Bag of Words (BoW): Representing text as a collection of words, ignoring grammar and word order
but keeping track of word frequency.
• Term Frequency-Inverse Document Frequency (TF-IDF): A statistic that reflects the importance
of a word in a document relative to a collection of documents.
• Word Embeddings: Using dense vector representations of words where semantically similar words
are closer together in the vector space (e.g., Word2Vec, GloVe).
4. Feature Extraction
Extracting meaningful features from the text data that can be used for various NLP tasks.
• N-grams: Capturing sequences of N words to preserve some context and word order.
• Syntactic Features: Using parts of speech tags, syntactic dependencies, and parse trees.
• Semantic Features: Leveraging word embeddings and other representations to capture word
meaning and context.
5. Model Selection and Training
Selecting and training a machine learning or deep learning model to perform specific NLP tasks.
• Supervised Learning: Using labeled data to train models like Support Vector Machines (SVM),
Random Forests, or deep learning models like Convolutional Neural Networks (CNNs) and
Recurrent Neural Networks (RNNs).
• Unsupervised Learning: Applying techniques like clustering or topic modeling (e.g., Latent
Dirichlet Allocation) on unlabeled data.
• Pre-trained Models: Utilizing pre-trained language models such as BERT, GPT, or transformer-
based models that have been trained on large corpora.
Deploying the trained model and using it to make predictions or extract insights from new text data.
• Text Classification: Categorizing text into predefined classes (e.g., spam detection, sentiment
analysis).
• Named Entity Recognition (NER): Identifying and classifying entities in the text.
• Question Answering: Providing answers to questions based on the context provided by text data.
Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision, recall, F1-score,
and others.
• Error Analysis: Analyzing errors to understand model weaknesses and improve robustness.
Continuously improving the algorithm by incorporating new data, refining preprocessing techniques,
experimenting with different models, and optimizing features.
There are a variety of technologies related to natural language processing (NLP) that are used to
analyze and understand human language. Some of the most common include:
1. Machine learning: NLP relies heavily on machine learning techniques such as supervised and
unsupervised learning, deep learning, and reinforcement learning to train models to understand and
generate human language.
2. Natural Language Toolkits (NLTK) and other libraries: NLTK is a popular open-source library in
Python that provides tools for NLP tasks such as tokenization, stemming, and part-of-speech tagging.
Other popular libraries include spaCy, OpenNLP, and CoreNLP.
3. Parsers: Parsers are used to analyze the syntactic structure of sentences, such as dependency parsing
and constituency parsing.
4. Text-to-Speech (TTS) and Speech-to-Text (STT) systems: TTS systems convert written text into
spoken words, while STT systems convert spoken words into written text.
5. Named Entity Recognition (NER) systems: NER systems identify and extract named entities such
as people, places, and organizations from the text.
7. Machine Translation: NLP is used for language translation from one language to another through
a computer.
8. Chatbots: NLP is used for chatbots that communicate with other chatbots or humans through
auditory or textual methods.
• Spam Filters: One of the most irritating things about email is spam. Gmail uses natural language
processing (NLP) to discern which emails are legitimate and which are spam. These spam filters look
at the text in all the emails you receive and try to figure out what it means to see if it’s spam or not.
• Algorithmic Trading: Algorithmic trading is used for predicting stock market conditions. Using
NLP, this technology examines news headlines about companies and stocks and attempts to
comprehend their meaning in order to determine if you should buy, sell, or hold certain stocks.
• Questions Answering: NLP can be seen in action by using Google Search or Siri Services. A major
use of NLP is to make search engines understand the meaning of what we are asking and generate
natural language in return to give us the answers.
• Summarizing Information: On the internet, there is a lot of information, and a lot of it comes in the
form of long documents or articles. NLP is used to decipher the meaning of the data and then provides
shorter summaries of the data so that humans can comprehend it more quickly.
Future Scope:
• Bots: Chatbots assist clients to get to the point quickly by answering inquiries and referring them to
relevant resources and products at any time of day or night. To be effective, chatbots must be fast,
smart, and easy to use, To accomplish this, chatbots employ NLP to understand language, usually
over text or voice-recognition interactions
• Supporting Invisible UI: Almost every connection we have with machines involves human
communication, both spoken and written. Amazon’s Echo is only one illustration of the trend toward
putting humans in closer contact with technology in the future. The concept of an invisible or zero
user interface will rely on direct communication between the user and the machine, whether by voice,
text, or a combination of the two. NLP helps to make this concept a real-world thing.
• Smarter Search: NLP’s future also includes improved search, something we’ve been discussing at
Expert System for a long time. Smarter search allows a chatbot to understand a customer’s request
can enable “search like you talk” functionality (much like you could query Siri) rather than focusing
on keywords or topics. Google recently announced that NLP capabilities have been added to Google
Drive, allowing users to search for documents and content using natural language.
Future Enhancements:
• Companies like Google are experimenting with Deep Neural Networks (DNNs) to push the limits of
NLP and make it possible for human-to-machine interactions to feel just like human-to-human
interactions.
• Basic words can be further subdivided into proper semantics and used in NLP algorithms.
• The NLP algorithms can be used in various languages that are currently unavailable such as regional
languages or languages spoken in rural areas etc.
• Translation of a sentence in one language to the same sentence in another Language at a broader
scope.
Language modeling, or LM, is the use of various statistical and probabilistic techniques to determine
the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of
text data to provide a basis for their word predictions.
Language modeling is used in artificial intelligence (AI), natural language processing (NLP), natural
language understanding and natural language generation systems, particularly ones that perform text
generation, machine translation and question answering.
Large language models (LLMs) also use language modeling. These are advanced language models,
such as OpenAI's GPT-3 and Google's Palm 2, that handle billions of training data parameters and generate
text output.
Natural language processing incorporates natural language generation and natural language understanding.
Language models determine word probability by analyzing text data. They interpret this data by
feeding it through an algorithm that establishes rules for context in natural language. Then, the model applies
these rules in language tasks to accurately predict or produce new sentences. The model essentially learns
the features and characteristics of basic language and uses those features to understand new phrases.
There are several different probabilistic approaches to modeling language. They vary depending on
the purpose of the language model. From a technical perspective, the various language model types differ in
the amount of text data they analyze and the math they use to analyze it. For example, a language model
designed to generate sentences for an automated social media bot might use different math and analyze text
data in different ways than a language model designed for determining the likelihood of a search query.
Language modeling types
There are several approaches to building language models. Some common statistical language
modeling types are the following:
• N-gram. This simple approach to a language model creates a probability distribution for a sequence
of n. The n can be any number and defines the size of the gram, or sequence of words or random
variables being assigned a probability. This allows the model to accurately predict the next word or
variable in a sentence. For example, if n = 5, a gram might look like this: "can you please call me."
The model then assigns probabilities using sequences of n size. Basically, n can be thought of as the
amount of context the model is told to consider. Some types of n-grams are unigrams, bigrams,
trigrams and so on. N-grams can also help detect malware by analyzing strings in a file.
• Unigram. This is the simplest type of language model. It doesn't look at any conditioning context in
its calculations. It evaluates each word or term independently. Unigram models commonly handle
language processing tasks such as information retrieval. The unigram is the foundation of a more
specific model variant called the query likelihood model, which uses information retrieval to examine
a pool of documents and match the most relevant one to a specific query.
• Bidirectional. Unlike n-gram models, which analyze text in one direction, backward, bidirectional
models analyze text in both directions, backward and forward. These models can predict any word
in a sentence or body of text by using every other word in the text. Examining text bidirectionally
increases result accuracy. This type is often used in machine learning models and speech generation
applications. For example, Google uses a bidirectional model to process search queries.
• Exponential. Also known as maximum entropy models, exponential models are more complex than
n-grams. Simply put, it evaluates text using an equation that combines feature functions and n-grams.
Basically, this type of model specifies features and parameters of the desired results and, unlike n-
grams, leaves the analysis parameters more ambiguous --- it doesn't specify individual gram sizes,
for example. The model is based on the principle of entropy, which states that the probability
distribution with the most entropy is the best choice. In other words, the model with the most chaos,
and least room for assumptions, is the most accurate. Exponential models are designed to maximize
cross-entropy, which minimizes the amount of statistical assumptions that can be made. This lets
users have more trust in the results they get from these models.
• Neural language models. Neural language models use deep learning techniques to overcome the
limitations of n-gram models. These models use neural networks, such as recurrent neural networks
(RNNs), and transformers to capture complex patterns and dependencies in text. RNN language
models include long short-term memory and gated recurrent unit models. These models can consider
all previous words in a sentence when predicting the next word. This allows them to capture long-
range dependencies and generate more contextually relevant text. Transformers use self-attention
mechanisms to weigh the importance of different words in a sentence, enabling them to capture
global dependencies. Generative AI models, such as GPT-3 and Palm 2, are based on the transformer
architecture.
• Continuous space. This is another type of neural language model that represents words as a
nonlinear combination of weights in a neural network. The process of assigning a weight to a word
is also known as word embedding. This type of model becomes especially useful as data sets get
bigger, because larger data sets often include more unique words. The presence of a lot of unique or
rarely used words can cause problems for linear models such as n-grams. This is because the amount
of possible word sequences increases, and the patterns that inform results become weaker. By
weighting words in a nonlinear, distributed way, this model can "learn" to approximate words and
not be misled by any unknown values. Its "understanding" of a given word isn't as tightly tethered to
the immediate surrounding words as it is in n-gram models.
The models listed above are more general statistical approaches from which more specific variant
language models are derived. For example, as mentioned in the n-gram description, the query likelihood
model is a more specific or specialized model that uses the n-gram approach. Model types can be used in
conjunction with one another.
The models listed also vary in complexity. Broadly speaking, more complex language models are
better at NLP tasks because language itself is extremely complex and always evolving. Therefore, an
exponential model or continuous space model might be better than an n-gram for NLP tasks because they're
designed to account for ambiguity and variation in language.
A good language model should also be able to process long-term dependencies, handling words that
might derive their meaning from other words that occur in far-away, disparate parts of the text. A language
model should be able to understand when a word is referencing another word from a long distance, as
opposed to always relying on proximal words within a certain fixed history. This requires a more complex
model.
Language modeling is crucial in modern NLP applications. It's the reason that machines can
understand qualitative information. Each language model type, in one way or another, turns qualitative
information into quantitative information. This allows people to communicate with machines as they do with
each other, to a limited extent.
The roots of language modeling can be traced back to 1948. That year, Claude Shannon published a
paper titled "A Mathematical Theory of Communication." In it, he detailed the use of a stochastic model
called the Markov chain to create a statistical model for the sequences of letters in English text. This paper
had a large impact on the telecommunications industry and laid the groundwork for information theory and
language modeling. The Markov model is still used today, and n-grams are tied closely to the concept.
Language models are the backbone of NLP. Below are some NLP use cases and tasks that employ
language modeling:
• Speech recognition. This involves a machine being able to process speech audio. Voice assistants
such as Siri and Alexa commonly use speech recognition.
• Text generation. This application uses prediction to generate coherent and contextually relevant
text. It has applications in creative writing, content generation, and summarization of structured data
and other text.
• Chatbots. These bots engage in humanlike conversations with users as well as generate accurate
responses to questions. Chatbots are used in virtual assistants, customer support applications and
information retrieval systems.
• Machine translation. This involves the translation of one language to another by a machine. Google
Translate and Microsoft Translator are two programs that do this. Another is SDL Government,
which is used to translate foreign social media feeds in real time for the U.S. government.
• Parts-of-speech tagging. This use involves the markup and categorization of words by certain
grammatical characteristics. This model is used in the study of linguistics. It was first and perhaps
most famously used in the study of the Brown Corpus, a body of random English prose that was
designed to be studied by computers. This corpus has been used to train several important language
models, including one used by Google to improve search quality.
• Parsing. This use involves analysis of any string of data or sentence that conforms to formal grammar
and syntax rules. In language modeling, this can take the form of sentence diagrams that depict each
word's relationship to the others. Spell-checking applications use language modeling and parsing.
• Optical character recognition. This application involves the use of a machine to convert images of
text into machine-encoded text. The image can be a scanned document or document photo, or a photo
with text somewhere in it -- on a sign, for example. Optical character recognition is often used in
data entry when processing old paper records that need to be digitized. It can also be used to analyze
and identify handwriting samples.
• Information retrieval. This approach involves searching in a document for information, searching
for documents in general and searching for metadata that corresponds to a document. Web browsers
are the most common information retrieval applications.
• Observed data analysis. These language models analyze observed data such as sensor data,
telemetric data and data from experiments.
• Sentiment analysis. This application involves determining the sentiment behind a given phrase.
Specifically, sentiment analysis is used to understand opinions and attitudes expressed in a text.
Businesses use it to analyze unstructured data, such as product reviews and general posts about their
product, as well as analyze internal data such as employee surveys and customer support chats. Some
services that provide sentiment analysis tools are Repustate and HubSpot's Service Hub. Google's
NLP tool Bert is also used for sentiment analysis.
Sentiment analysis uses language modeling technology to detect and analyze keywords in customer reviews
and posts.
State-of-the-art LLMs have demonstrated impressive capabilities in generating human language and
humanlike text and understanding complex language patterns. Leading models such as those that
power ChatGPT and Bard have billions of parameters and are trained on massive amounts of data. Their
success has led them to being implemented into Bing and Google search engines, promising to change the
search experience.
New data science techniques, such as fine-tuning and transfer learning, have become essential in
language modeling. Rather than training a model from scratch, fine-tuning lets developers take a pre-trained
language model and adapt it to a task or domain. This approach has reduced the amount of labeled data
required for training and improved overall model performance.
As language models and their techniques become more powerful and capable, ethical considerations
become increasingly important. Issues such as bias in generated text, misinformation and the potential
misuse of AI-driven language models have led many AI experts and developers such as Elon Musk to warn
against their unregulated development.
Language modeling is one of the leading techniques in generative AI. Learn the top eight biggest
ethical concerns for generative AI.
The Vector Space Model (VSM) is a mathematical framework used in information retrieval and
natural language processing (NLP) to represent and analyze textual data. It’s fundamental in text
mining, document retrieval, and text-based machine learning tasks like document classification, information
retrieval, and text similarity analysis.
The Vector Space Model represents documents and terms as vectors in a multi-dimensional space.
Each dimension corresponds to a unique term in the entire corpus of documents.
Each dimension corresponds to a unique term, while the documents and queries can be represented as
a vector within that space.
1. Document-Term Matrix: To create the vector representation of a collection of documents, you first
construct a Document-Term Matrix (DTM) or Term-Document Matrix (TDM). Rows in this matrix
represent documents, and columns represent terms (words or phrases). Each cell contains a numerical
value representing a term’s frequency or importance within a document.
2. Term Frequency-Inverse Document Frequency (TF-IDF): Once you have the DTM, you often
apply a TF-IDF transformation to the raw term frequencies. TF-IDF stands for Term Frequency-
Inverse Document Frequency and is a measure that reflects the importance of a term within a document
relative to its importance across all documents in the corpus. It helps in highlighting important terms
while downplaying common terms.
4. Cosine Similarity: To compare documents or perform text retrieval, you can use cosine similarity as
a metric to measure the similarity between two document vectors. Cosine similarity calculates the
cosine of the angle between two vectors and ranges from -1 (entirely dissimilar) to 1 (completely
similar). A higher cosine similarity indicates a more significant similarity between documents.
4.4 Continuous Bag of Words (CBOW)
Continuous Bag of Words (CBOW) is a popular natural language processing technique used to
generate word embeddings. Word embeddings are important for many NLP tasks because they capture
semantic and syntactic relationships between words in a language. CBOW is a neural network-based
algorithm that predicts a target word given its surrounding context words. It is a type of “unsupervised”
learning, meaning that it can learn from unlabeled data, and it is often used to pre-train word embeddings
that can be used for various NLP tasks such as sentiment analysis, text classification, and machine
translation.
• The Bag-of-Words model and the Continuous Bag-of-Words model are both techniques used in
natural language processing to represent text in a computer-readable format, but they differ in how
they capture context.
• The BoW model represents text as a collection of words and their frequency in a given document or
corpus. It does not consider the order or context in which the words appear, and therefore, it may not
capture the full meaning of the text. The BoW model is simple and easy to implement, but it has
limitations in capturing the meaning of language.
• In contrast, the CBOW model is a neural network-based approach that captures the context of words.
It learns to predict the target word based on the words that appear before and after it in a given context
window. By considering the surrounding words, the CBOW model can better capture the meaning
of a word in a given context.
The CBOW model uses the target word around the context word in order to predict it. Consider the above
example “She is a great dancer.” The CBOW model converts this phrase into pairs of context words and
target words. The word pairings would appear like this ([she, a], is), ([is, great], a) ([a, dancer],
great) having window size=2.
CBOW Architecture
The model considers the context words and tries to predict the target term. The four 1∗W input
vectors will be passed to the input layer if have four words as context words are used to predict one target
word. The hidden layer will receive the input vectors and then multiply them by a W∗N matrix. The 1∗N
output from the hidden layer finally enters the sum layer, where the vectors are element-wise summed before
a final activation is carried out and the output is obtained from the output layer.
We’ll do so by creating a “fake” task for the neural network to train. We won’t be interested in the
inputs and outputs of this network, rather the goal is actually just to learn the weights of the hidden layer that
The fake task for Skip-gram model would be, given a word, we’ll try to predict its neighboring words.
The word highlighted in yellow is the source word and the words highlighted in green are its
neighboring words.
and a window size of 2, if the target word is juice, its neighboring words will be ( have, orange, and,
eggs). Our input and target word pair would be (juice, have), (juice, orange), (juice, and), (juice, eggs).
Also note that within the sample window, proximity of the words to the source word plays no role. So have,
orange, and, and eggs will be treated the same while training.
Architecture for skip-gram model. Source: McCormickml tutorial
The dimensions of the input vector will be 1xV — where V is the number of words in the vocabulary
— i.e one-hot representation of the word. The single hidden layer will have dimension VxE, where E is the size
of the word embedding and is a hyper-parameter. The output from the hidden layer would be of the dimension
1xE, which we will feed into an softmax layer. The dimensions of the output layer will be 1xV, where each
value in the vector will be the probability score of the target word at that position.
According to our earlier example if we have a vector [0.2, 0.1, 0.3, 0.4], the probability of the word
being mango is 0.2, strawberry is 0.1, city is 0.3 and Delhi is 0.4.
The back propagation for training samples corresponding to a source word is done in one back pass. So
for juice, we will complete the forward pass for all 4 target words ( have, orange, and, eggs). We will then
calculate the errors vectors[1xV dimension] corresponding to each target word. We will now have 4 1xV error
vectors and will perform an element-wise sum to get a 1xV vector. The weights of the hidden layer will be
Part of Speech (PoS) global co-occurrence statistics-based word vectors are a type of word
representation that leverage the distributional properties of words in a large corpus, combined with their
syntactic roles. These vectors capture semantic and syntactic information by considering both the words
themselves and their parts of speech.
Concept
[Link] Vectors: These are dense, continuous-valued vectors representing words in a high-dimensional
space. They encode semantic relationships, such as similarity and analogy, by the relative positions of words.
[Link] Co-occurrence: This refers to the statistical properties of how words appear together across an
entire corpus. By examining these patterns, one can derive meaningful relationships between words.
[Link] of Speech (PoS): PoS tagging classifies words into categories like nouns, verbs, adjectives, etc.
Incorporating PoS into word vectors adds syntactic context, refining the semantic representation.
Methodology
[Link] Tagging: Apply a PoS tagger to annotate each word in the corpus with its corresponding part of
speech.
[Link]-occurrence Matrix Construction: Build a matrix where rows represent target words and columns
represent context words, populated by their co-occurrence counts. Separate matrices can be constructed for
different PoS tags.
[Link] Scheme: Use weighting schemes such as Positive Pointwise Mutual Information (PPMI) to
enhance the co-occurrence matrix, emphasizing informative co-occurrences.
[Link] Reduction: Apply techniques like Singular Value Decomposition (SVD) to reduce the
dimensionality of the co-occurrence matrix, yielding dense word vectors.
Advantages
-Enhanced Semantic Understanding: Incorporating PoS information helps distinguish between different
meanings of a word depending on its syntactic role.
-Contextual Nuance: Captures subtle differences in word usage across different contexts and syntactic
structures.
Applications
[Link] Language Processing (NLP): Improves tasks such as word sense disambiguation, syntactic
parsing, and semantic role labeling.
[Link] Retrieval: Enhances search algorithms by understanding user queries in a more nuanced
way.
[Link] Translation: Provides better context for translating words with multiple meanings depending
on their PoS.
4.7 Transfer Learning
We, humans, are very perfect at applying the transfer of knowledge between tasks. This means that
whenever we encounter a new problem or a task, we recognize it and apply our relevant knowledge from
our previous learning experiences. This makes our work easy and fast to finish. For instance, if you know
how to ride a bicycle and if you are asked to ride a motorbike which you have never done before. In such a
case, our experience with a bicycle will come into play and handle tasks like balancing the bike, steering,
etc. This will make things easier compared to a complete beginner. Such learnings are very useful in real life
as they make us more perfect and allow us to earn more experience. Following the same approach, a term
was introduced Transfer Learning in the field of machine learning. This approach involves the use of
knowledge that was learned in some tasks and applying it to solve the problem in the related target task.
While most machine learning is designed to address a single task, the development of algorithms that
facilitate transfer learning is a topic of ongoing interest in the machine-learning community.
Transfer learning is a technique in machine learning where a model trained on one task is used as the
starting point for a model on a second task. This can be useful when the second task is similar to the first
task, or when there is limited data available for the second task. By using the learned features from the first
task as a starting point, the model can learn more quickly and effectively on the second task. This can also
help to prevent overfitting, as the model will have already learned general features that are likely to be useful
in the second task.
Many deep neural networks trained on images have a curious phenomenon in common: in the early
layers of the network, a deep learning model tries to learn a low level of features, like detecting edges,
colours, variations of intensities, etc. Such kind of features appear not to be specific to a particular dataset
or a task because no matter what type of image we are processing either for detecting a lion or car. In both
cases, we have to detect these low-level features. All these features occur regardless of the exact cost function
or image dataset. Thus, learning these features in one task of detecting lions can be used in other tasks like
detecting humans.
• Pre-trained Model: Start with a model that has previously been trained for a certain task using a
large set of data. Frequently trained on extensive datasets, this model has identified general features
and patterns relevant to numerous related jobs.
• Base Model: The model that has been pre-trained is known as the base model. It is made up of layers
that have utilized the incoming data to learn hierarchical feature representations.
• Transfer Layers: In the pre-trained model, find a set of layers that capture generic information
relevant to the new task as well as the previous one. Because they are prone to learning low-level
information, these layers are frequently found near the top of the network.
• Fine-tuning: Using the dataset from the new challenge to retrain the chosen layers. We define this
procedure as fine-tuning. The goal is to preserve the knowledge from the pre-training while enabling
the model to modify its parameters to better suit the demands of the current assignment.
Transfer Learning
Low-level features learned for task A should be beneficial for learning of model for task B.
This is what transfer learning is. Nowadays, it is very hard to see people training whole convolutional
neural networks from scratch, and it is common to use a pre-trained model trained on a variety of images in
a similar task, e.g models trained on ImageNet (1.2 million images with 1000 categories) and use features
from them to solve a new task. When dealing with transfer learning, we come across a phenomenon called
the freezing of layers. A layer, it can be a CNN layer, hidden layer, a block of layers, or any subset of a set
of all layers, is said to be fixed when it is no longer available to train. Hence, the weights of freeze layers
will not be updated during training. While layers that are not frozen follows regular training procedure. When
we use transfer learning in solving a problem, we select a pre-trained model as our base model. Now, there
are two possible approaches to using knowledge from the pre-trained model. The first way is to freeze a few
layers of the pre-trained model and train other layers on our new dataset for the new task. The second way
is to make a new model, but also take out some features from the layers in the pre-trained model and use
them in a newly created model. In both cases, we take out some of the learned features and try to train the
rest of the model. This makes sure that the only feature that may be the same in both of the tasks is taken out
from the pre-trained model, and the rest of the model is changed to fit the new dataset by training.
Now, one may ask how to determine which layers we need to freeze, and which layers need to train.
The answer is simple, the more you want to inherit features from a pre-trained model, the more you have to
freeze layers. For instance, if the pre-trained model detects some flower species and we need to detect some
new species. In such a case, a new dataset with new species contains a lot of features similar to the pre-
trained model. Thus, we freeze less number of layers so that we can use most of its knowledge in a new
model. Now, consider another case, if there is a pre-trained model which detects humans in images, and we
want to use that knowledge to detect cars, in such a case where the dataset is entirely different, it is not good
to freeze lots of layers because freezing a large number of layers will not only give low level features but
also give high-level features like nose, eyes, etc which are useless for new dataset (car detection). Thus, we
only copy low-level features from the base network and train the entire network on a new dataset.
Let’s consider all situations where the size and dataset of the target task vary from the base network.
• The target dataset is small and similar to the base network dataset: Since the target dataset is
small, that means we can fine-tune the pre-trained network with the target dataset. But this may lead
to a problem of overfitting. Also, there may be some changes in the number of classes in the target
task. So, in such a case we remove the fully connected layers from the end, maybe one or two, and
add a new fully connected layer satisfying the number of new classes. Now, we freeze the rest of the
model and only train newly added layers.
• The target dataset is large and similar to the base training dataset: In such cases when the dataset
is large, and it can hold a pre-trained model there will be no chance of overfitting. Here, also the last
full-connected layer is removed, and a new fully-connected layer is added with the proper number of
classes. Now, the entire model is trained on a new dataset. This makes sure to tune the model on a
new large dataset keeping the model architecture the same.
• The target dataset is small and different from the base network dataset: Since the target dataset
is different, using high-level features of the pre-trained model will not be useful. In such a case,
remove most of the layers from the end in a pre-trained model, and add new layers a satisfying
number of classes in a new dataset. This way we can use low-level features from the pre-trained
model and train the rest of the layers to fit a new dataset. Sometimes, it is beneficial to train the entire
network after adding a new layer at the end.
• The target dataset is large and different from the base network dataset: Since the target network
is large and different, the best way is to remove the last layers from the pre-trained network and add
layers with a satisfying number of classes, then train the entire network without freezing any layer.
4.8 Word2Vec
Word2Vec revolutionized natural language processing by transforming words into dense vector
representations, capturing semantic relationships. This article explores its fundamentals: What is Word2Vec?
Why are word embeddings crucial for NLP tasks? Delve into the mechanics of CBOW and Skip-gram models,
What is Word2Vec?
Word2Vec creates vectors of the words that are distributed numerical representations of word features
– these word features could comprise of words that represent the context of the individual words present in
our vocabulary. Word embeddings eventually help in establishing the association of a word with another
As seen in the image below where word embeddings are plotted, similar meaning words are closer in
Two different model architectures that can be used by Word2Vec to create the word embeddings are
the Continuous Bag of Words (CBOW) model & the Skip-Gram model.
4.9 Global Vectors for Word Representation GloVe
GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm
developed by Stanford for generating word embeddings by aggregating global word-word co-occurrence matrix
from a corpus. The resulting embeddings show interesting linear substructures of the word in vector space.
algorithms.
4.10 Backpropagation Through Time
Recurrent Neural Networks are those networks that deal with sequential data. They predict outputs
using not only the current inputs but also by taking into consideration those that occurred before it. In other
words, the current output depends on current output as well as a memory element (which takes into account
the past inputs). For training such networks, we use good old backpropagation but with a slight twist. We
don’t independently train the system at a specific time “t”. We train it at a specific time “t” as well as all
that has happened before time “t” like t-1, t-2, t-3. Consider the following representation of a RNN:
RNN Architecture
S1, S2, S3 are the hidden states or memory units at time t1, t2, t3 respectively, and Ws is the weight
matrix associated with it. X1, X2, X3 are the inputs at time t1, t2, t3 respectively, and Wx is the weight
matrix associated with it. Y1, Y2, Y3 are the outputs at time t1, t2, t3 respectively, and Wy is the weight
matrix associated with it.
We are using the squared error here, where d3 is the desired output at time t = 3. To perform back
propagation, we have to adjust the weights associated with inputs, the memory units and the
outputs. Adjusting Wy For better understanding, let us consider the following representation:
Adjusting Wy
Explanation: E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3. Y3 is a function of WY. Hence,
we differentiate Y3 w.r.t WY. Adjusting Ws For better understanding, let us consider the following
representation:
Adjusting Ws
Explanation: E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3. Y3 is a function of S3. Hence,
we differentiate Y3 w.r.t S3. S3 is a function of WS. Hence, we differentiate S3 w.r.t WS. But we can’t stop
with this; we also have to take into consideration, the previous time steps. So, we differentiate (partially)
the Error function with respect to memory units S2 as well as S1 taking into consideration the weight
matrix WS. We have to keep in mind that a memory unit, say S t is a function of its previous memory unit
St-1. Hence, we differentiate S3 with S2 and S2 with S1. Generally, we can express this formula as:
Adjusting WX: For better understanding, let us consider the following representation:
Adjusting Wx
Explanation: E3 is a function of Y3. Hence, we differentiate E3 w.r.t Y3. Y3 is a function of S3. Hence,
we differentiate Y3 w.r.t S3. S3 is a function of WX. Hence, we differentiate S3 w.r.t WX. Again we can’t
stop with this; we also have to take into consideration, the previous time steps. So, we differentiate
(partially) the Error function with respect to memory units S2 as well as S1 taking into consideration the
weight matrix WX.
Limitations: This method of Back Propagation through time (BPTT) can be used up to a limited number
of time steps like 8 or 10. If we back propagate further, the gradient becomes too small. This problem is
called the “Vanishing gradient” problem. The problem is that the contribution of information decays
geometrically over time. So, if the number of time steps is >10 (Let’s say), that information will effectively
be discarded.
Going Beyond RNNs: One of the famous solutions to this problem is by using what is called Long Short-
Term Memory (LSTM for short) cells instead of the traditional RNN cells. But there might arise yet
another problem here, called the exploding gradient problem, where the gradient grows uncontrollably
large.
Solution: A popular method called gradient clipping can be used where in each time step, we can check
if the gradient > threshold. If yes, then normalize it.
An architecture of a neural network called a bidirectional recurrent neural network (BRNN) is made
to process sequential data. In order for the network to use information from both the past and future context
in its predictions, BRNNs process input sequences in both the forward and backward directions. This is the
main distinction between BRNNs and conventional recurrent neural networks.
A BRNN has two distinct recurrent hidden layers, one of which processes the input sequence forward
and the other of which processes it backward. After that, the results from these hidden layers are collected
and input into a prediction-making final layer. Any recurrent neural network cell, such as Long Short-Term
Memory (LSTM) or Gated Recurrent Unit, can be used to create the recurrent hidden layers.
The BRNN functions similarly to conventional recurrent neural networks in the forward direction,
updating the hidden state depending on the current input and the prior hidden state at each time step. The
backward hidden layer, on the other hand, analyses the input sequence in the opposite manner, updating the
hidden state based on the current input and the hidden state of the next time step.
Compared to conventional unidirectional recurrent neural networks, the accuracy of the BRNN is
improved since it can process information in both directions and account for both past and future contexts.
Because the two hidden layers can complement one another and give the final prediction layer more data,
using two distinct hidden layers also offers a type of model regularisation.
In order to update the model parameters, the gradients are computed for both the forward and
backward passes of the backpropagation through the time technique that is typically used to train BRNNs.
The input sequence is processed by the BRNN in a single forward pass at inference time, and predictions
are made based on the combined outputs of the two hidden layers. layers.
where,
A = activation function,
W = weight matrix
b = bias
The hidden state at time t is given by a combination of H t (Forward) and Ht (Backward). The output at any
given hidden state is :
Yt = Ht * WAY + by
The training of a BRNN is similar to backpropagation through a time algorithm. BPTT algorithm works as
follows:
• Roll out the network and calculate errors at each iteration
• Update weights and roll up the network.
However, because forward and backward passes in a BRNN occur simultaneously, updating the weights for
the two processes may occur at the same time. This produces inaccurate outcomes. Thus, the following
approach is used to train a BRNN to accommodate forward and backward passes individually.
Applications of Bidirectional Recurrent Neural Network
Bi-RNNs have been applied to various natural language processing (NLP) tasks, including:
1. Sentiment Analysis: By taking into account both the prior and subsequent context, BRNNs
can be utilized to categorize the sentiment of a particular sentence.
2. Named Entity Recognition: By considering the context both before and after the stated thing,
BRNNs can be utilized to identify those entities in a sentence.
3. Part-of-Speech Tagging: The classification of words in a phrase into their corresponding parts
of speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
4. Machine Translation: BRNNs can be used in encoder-decoder models for machine
translation, where the decoder creates the target sentence and the encoder analyses the source
sentence in both directions to capture its context.
5. Speech Recognition: When the input voice signal is processed in both directions to capture the
contextual information, BRNNs can be used in automatic speech recognition systems.
Advantages of Bidirectional RNN
• Context from both past and future: With the ability to process sequential input both forward
and backward, BRNNs provide a thorough grasp of the full context of a sequence. Because of
this, BRNNs are effective at tasks like sentiment analysis and speech recognition.
• Enhanced accuracy: BRNNs frequently yield more precise answers since they take both
historical and upcoming data into account.
• Efficient handling of variable-length sequences: When compared to conventional RNNs,
which require padding to have a constant length, BRNNs are better equipped to handle variable-
length sequences.
• Resilience to noise and irrelevant information: BRNNs may be resistant to noise and
irrelevant data that are present in the data. This is so because both the forward and backward
paths offer useful information that supports the predictions made by the network.
• Ability to handle sequential dependencies: BRNNs can capture long-term links between
sequence pieces, making them extremely adept at handling complicated sequential dependencies.
Disadvantages of Bidirectional RNN
• Computational complexity: Given that they analyze data both forward and backward,
BRNNs can be computationally expensive due to the increased amount of calculations needed.
• Long training time: BRNNs can also take a while to train because there are many parameters
to optimize, especially when using huge datasets.
• Difficulty in parallelization: Due to the requirement for sequential processing in both the
forward and backward directions, BRNNs can be challenging to parallelize.
• Overfitting: BRNNs are prone to overfitting since they include many parameters that might
result in too complicated models, especially when trained on short datasets.
• Interpretability: Due to the processing of data in both forward and backward directions,
BRNNs can be tricky to interpret since it can be difficult to comprehend what the model is doing
and how it is producing predictions.
4.12 Long Short Term Memory (LSTM)
Information is retained by the cells and the memory manipulations are done by the gates. There are
three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and multiplied with
weight matrices followed by the addition of bias. The resultant is passed through an activation function
which gives a binary output. If for a particular cell state the output is 0, the piece of information is forgotten
and for output 1, the information is retained for future use. f represents the weight matrix associated with
the forget gate.
• [h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
• b_f is the bias with the forget gate.
• σ is the sigmoid activation function.
Input gate
The addition of useful information to the cell state is done by the input gate. First, the information
is regulated using the sigmoid function and filter the values to be remembered similar to the forget gate
using inputs ht-1 and xt. . Then, a vector is created using tanh function that gives an output from -1 to +1,
which contains all the possible values from ht-1 and xt. At last, the values of the vector and the regulated
values are multiplied to obtain the useful information.
We multiply the previous state by ft, disregarding the information we had previously chosen to
ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for the amount that
we chose to update each state value.
• ⊙ denotes element-wise multiplication
• tanh is tanh activation function
Output gate
The task of extracting useful information from the current cell state to be presented as output is done
by the output gate. First, a vector is generated by applying tanh function on the cell. Then, the information
is regulated using the sigmoid function and filter by the values to be remembered using inputs ht-1 and xt.
At last, the values of the vector and the regulated values are multiplied to be sent as an output and input to
the next cell. The equation for the output gate is:
LTSM vs RNN
Bidirectional LSTM or BiLSTM is a term used for a sequence model which contains
two LSTM layers, one for processing input in the forward direction and the other for processing in the
backward direction. It is usually used in NLP-related tasks. The intuition behind this approach is that by
processing data in both directions, the model is able to better understand the relationship between sequences
(e.g. knowing the following and preceding words in a sentence).
To better understand this let us see an example. The first statement is “Server can you bring me this
dish” and the second statement is “He crashed the server”. In both these statements, the word server has
different meanings and this relationship depends on the following and preceding words in the statement.
The bidirectional LSTM helps the machine to understand this relationship better than compared with
unidirectional LSTM. This ability of BiLSTM makes it a suitable architecture for tasks like sentiment
analysis, text classification, and machine translation.
Architecture
The architecture of bidirectional LSTM comprises of two unidirectional LSTMs which process the
sequence in both forward and backward directions. This architecture can be interpreted as having two
separate LSTM networks, one gets the sequence of tokens as it is while the other gets in the reverse order.
Both of these LSTM network returns a probability vector as output and the final output is the combination
of both of these probabilities.
Bidirectional LSTM layer Architecture
Figure 1 describes the architecture of the BiLSTM layer where is the input token, is the
output token, and and are LSTM nodes. The final output of is the combination of and
LSTM nodes.
Now, let us look into an implementation of a review system using BiLSTM layers in Python using
the Tensorflow library. We would be performing sentiment analysis on the IMDB movie review dataset.
We would implement the network from scratch and train it to identify if the review is positive or negative.
4.14 Sequence-to-sequence Models (Seq2Seq)
The seq2Seq model is a kind of machine learning model that takes sequential data as input and
generates also sequential data as output. Before the arrival of Seq2Seq models, the machine
translation systems relied on statistical methods and phrase-based approaches. The most popular approach
was the use of phrase-based statistical machine translation (SMT) systems. That was not able to handle
long-distance dependencies and capture global context.
Seq2Seq models addressed the issues by leveraging the power of neural networks,
especially recurrent neural networks (RNN). The concept of seq2seq model was introduced in the paper
titled “Sequence to Sequence Learning with Neural Networks” by Google. The architecture discussed in
this research paper is fundamental framework for natural language processing tasks. The seq2seq models
are encoder-decoder models. The encoder processes the input sequence and transforms it into a fixed-size
hidden representation. The decoder uses the hidden representation to generate output sequence. The
encoder-decoder structure allows them to handle input and output sequences of different lengths, making
them capable to handle sequential data. Seq2Seq models are trained using a dataset of input-output pairs,
where the input is a sequence of tokens, and the output is also a sequence of tokens. The model is trained to
maximize the likelihood of the correct output sequence given the input sequence.
The advancement in neural networks architectures led to the development of more capable seq2seq
model named transformers. “Attention is all you need! ” was a research paper that first introduced the
transformer model in the era of Deep Learning after which language-related models have taken a huge leap.
The main idea behind the transformers model was that of attention layers and different encoder and decoder
stacks which were highly efficient to perform language-related tasks.
Seq2Seq models have been widely used in NLP tasks due to their ability to handle variable-length
input and output sequences. Additionally, the attention mechanism is often used in Seq2Seq models to
improve performance and it allows the decoder to focus on specific parts of the input sequence when
generating the output.
What is Encoder and Decoder in Seq2Seq model?
In the seq2seq model, the Encoder and the Decoder architecture plays a vital role in converting input
sequences into output sequences. Let’s explore about each block:
Encoder Block
The main purpose of the encoder block is to process the input sequence and capture information in
a fixed-size context vector.
Architecture:
• The input sequence is put into the encoder.
• The encoder processes each element of the input sequence using neural networks (or
transformer architecture).
• Throughout this process, the encoder keeps an internal state, and the ultimate hidden state
functions as the context vector that encapsulates a compressed representation of the entire input
sequence. This context vector captures the semantic meaning and important information of the
input sequence.
The final hidden state of the encoder is then passed as the context vector to the decoder.
Decoder Block
The decoder block is similar to encoder block. The decoder processes the context vector from
encoder to generate output sequence incrementally.
Architecture:
• In the training phase, the decoder receives both the context vector and the desired target output
sequence (ground truth).
• During inference, the decoder relies on its own previously generated outputs as inputs for
subsequent steps.
The decoder uses the context vector to comprehend the input sequence and create the corresponding
output sequence. It engages in autoregressive generation, producing individual elements sequentially. At
each time step, the decoder uses the current hidden state, the context vector, and the previous output token
to generate a probability distribution over the possible next tokens. The token with the highest probability
is then chosen as the output, and the process continues until the end of the output sequence is reached.
RNN based Seq2Seq Model
The decoder and encoder architecture utilizes RNNs to generate desired outputs. Let’s look at the
simplest seq2seq model.
Recurrent Neural Networks can easily map sequences to sequences when the alignment between the
inputs and the outputs are known in advance. Although the vanilla version of RNN is rarely used, its more
advanced version i.e. LSTM or GRU is used. This is because RNN suffers from the problem of vanishing
gradient. LSTM develops the context of the word by taking 2 inputs at each point in time. One from the user
and the other from its previous output, hence the name recurrent (output goes as input).
Advantages of seq2seq Models
• Flexibility: Seq2Seq models can handle a wide range of tasks such as machine translation, text
summarization, and image captioning, as well as variable-length input and output sequences.
• Handling Sequential Data: Seq2Seq models are well-suited for tasks that involve sequential
data such as natural language, speech, and time series data.
• Handling Context: The encoder-decoder architecture of Seq2Seq models allows the model to
capture the context of the input sequence and use it to generate the output sequence.
• Attention Mechanism: Using attention mechanisms allows the model to focus on specific
parts of the input sequence when generating the output, which can improve performance for long
input sequences.
Disadvantages of seq2seq Models
• Computationally Expensive: Seq2Seq models require significant computational resources to
train and can be difficult to optimize.
• Limited Interpretability: The internal workings of Seq2Seq models can be difficult to
interpret, which can make it challenging to understand why the model is making certain
decisions.
• Overfitting: Seq2Seq models can overfit the training data if they are not properly regularized,
which can lead to poor performance on new data.
• Handling Rare Words: Seq2Seq models can have difficulty handling rare words that are not
present in the training data.
• Handling Long input Sequences: Seq2Seq models can have difficulty handling input
sequences that are very long, as the context vector may not be able to capture all the information
in the input sequence.
Applications of Seq2Seq model
Throughout the article, we have discovered the machine translation is the real-world application of
seq2seq model. Let’s explore more applications:
• Text Summarization: The seq2seq model effectively understands the input text which makes
it suitable for news and document summarization.
• Speech Recognition: Seq2Seq model, especially those with attention mechanisms, excel in
processing audio waveform for ASR. They are able to capture spoken language patterns
effectively.
• Image Captioning: The seq2seq model integrate image features from CNNs with textual
generation capabilities for image captioning. They are capable to describe images in a human
readable format.
4.15 Gated recurrent unit GRU
Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) that was introduced by
Cho et al. in 2014 as a simpler alternative to Long Short-Term Memory (LSTM) networks. Like LSTM,
GRU can process sequential data such as text, speech, and time-series data.
The basic idea behind GRU is to use gating mechanisms to selectively update the hidden state of the
network at each time step. The gating mechanisms are used to control the flow of information in and out of
the network. The GRU has two gating mechanisms, called the reset gate and the update gate.
The reset gate determines how much of the previous hidden state should be forgotten, while the
update gate determines how much of the new input should be used to update the hidden state. The output of
the GRU is calculated based on the updated hidden state.
The equations used to calculate the reset gate, update gate, and hidden state of a GRU are as follows:
Reset gate: r_t = sigmoid(W_r * [h_{t-1}, x_t])
Update gate: z_t = sigmoid(W_z * [h_{t-1}, x_t])
Candidate hidden state: h_t’ = tanh(W_h * [r_t * h_{t-1}, x_t])
Hidden state: h_t = (1 – z_t) * h_{t-1} + z_t * h_t’
where W_r, W_z, and W_h are learnable weight matrices, x_t is the input at time step t, h_{t-1} is the
previous hidden state, and h_t is the current hidden state.
In summary, GRU networks are a type of RNN that use gating mechanisms to selectively update the
hidden state at each time step, allowing them to effectively model sequential data. They have been shown
to be effective in various natural language processing tasks, such as language modeling, machine translation,
and speech recognition
Prerequisites: Recurrent Neural Networks, Long Short Term Memory Networks
To solve the Vanishing-Exploding gradients problem often encountered during the operation of a basic
Recurrent Neural Network, many variations were developed. One of the most famous variations is
the Long Short Term Memory Network(LSTM). One of the lesser-known but equally effective
variations is the Gated Recurrent Unit Network(GRU).
Unlike LSTM, it consists of only three gates and does not maintain an Internal Cell State. The information
which is stored in the Internal Cell State in an LSTM recurrent unit is incorporated into the hidden state of
the Gated Recurrent Unit. This collective information is passed onto the next Gated Recurrent Unit. The
different gates of a GRU are as described below:-
1. Update Gate(z): It determines how much of the past knowledge needs to be passed along into
the future. It is analogous to the Output Gate in an LSTM recurrent unit.
2. Reset Gate(r): It determines how much of the past knowledge to forget. It is analogous to the
combination of the Input Gate and the Forget Gate in an LSTM recurrent unit.
1. For each gate, calculate the parameterized current input and previously hidden state
vectors by performing element-wise multiplication (Hadamard Product) between the
concerned vector and the respective weights for each gate.
2. Apply the respective activation function for each gate element-wise on the
parameterized vectors. Below given is the list of the gates with the activation function
to be applied for the gate.
Update Gate : Sigmoid Function
Reset Gate : Sigmoid Function
• The process of calculating the Current Memory Gate is a little different. First, the Hadamard
product of the Reset Gate and the previously hidden state vector is calculated. Then this vector
is parameterized and then added to the parameterized current input vector.
• To calculate the current hidden state, first, a vector of ones and the same dimensions as that of
the input is defined. This vector will be called ones and mathematically be denoted by 1. First,
calculate the Hadamard Product of the update gate and the previously hidden state vector. Then
generate a new vector by subtracting the update gate from ones and then calculate the Hadamard
Product of the newly generated vector with the current memory gate. Finally, add the two vectors
to get the currently hidden state vector.
Note that the blue circles denote element-wise multiplication. The positive sign in the circle denotes
vector addition while the negative sign denotes vector subtraction(vector addition with negative value). The
weight matrix W contains different weights for the current input vector and the previous hidden state for
each gate.
Just like Recurrent Neural Networks, a GRU network also generates an output at each time step and
this output is used to train the network using gradient descent.
Note that just like the workflow, the training process for a GRU network is also diagrammatically
similar to that of a basic Recurrent Neural Network and differs only in the internal working of each recurrent
unit.
The Back-Propagation Through Time Algorithm for a Gated Recurrent Unit Network is similar to
that of a Long Short Term Memory Network and differs only in the differential chain formation.
UNIT V DEEP REINFORCEMENT & UNSUPERVISED LEARNING
About Deep Reinforcement Learning, Q-Learning, Deep Q-Network (DQN), Policy Gradient Methods,
Actor-Critic Algorithm, About Autoencoding, Convolutional Auto Encoding, Variational Auto Encoding,
Generative Adversarial Networks, Autoencoders for Feature Extraction, Auto Encoders for Classification,
Denoising Autoencoders, Spare Autoencoders.
Deep Reinforcement Learning (DRL) is the crucial fusion of two powerful artificial intelligence
fields: deep neural networks and reinforcement learning. By combining the benefits of data-driven neural
networks and intelligent decision-making, it has sparked an evolutionary change that crosses traditional
boundaries. In this article, we take a detailed look at the interesting evolution, enormous challenges, and
dynamic trendy situation of DRL. We reveal DRL’s revolutionary power by going into its core and
following its progression from conquering Atari games to addressing difficult real-world situations. We
discover the collaborative efforts of researchers, practitioners, and policymakers that advance DRL towards
responsible and substantial applications as we navigate its hurdles, which vary from instability during
training to the exploration-exploitation paradox.
Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) is a revolutionary Artificial Intelligence methodology that
combines reinforcement learning and deep neural networks. By iteratively interacting with an environment
and making choices that maximise cumulative rewards, it enables agents to learn sophisticated strategies.
Agents are able to directly learn rules from sensory inputs thanks to DRL, which makes use of deep
learning’s ability to extract complex features from unstructured data. DRL relies heavily on Q-learning,
policy gradient methods, and actor-critic systems. The notions of value networks, policy networks, and
exploration-exploitation trade-offs are crucial. The uses for DRL are numerous and include robotics,
gaming, banking, and healthcare. Its development from Atari games to real-world difficulties emphasises
how versatile and potent it is. Sample effectiveness, exploratory tactics, and safety considerations are
difficulties. The collaboration aims to drive DRL responsibly, promising an inventive future that will change
how decisions are made and problems are solved.
Core Components of Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) building blocks include all the aspects that power learning and
empower agents to make wise judgements in their surroundings. Effective learning frameworks are
produced by the cooperative interactions of these elements. The following are the essential elements:
• Agent: The decision-maker or learner who engages with the environment. The agent acts in
accordance with its policy and gains experience over time to improve its ability to make
decisions.
• Environment: The system outside of the agent that it communicates with. Based on the actions
the agent does, it gives the agent feedback in the form of incentives or punishments.
• State: A depiction of the current circumstance or environmental state at a certain moment. The
agent chooses its activities and makes decisions based on the state.
• Action: A choice the agent makes that causes a change in the state of the system. The policy of
the agent guides the selection of [Link]: A scalar feedback signal from the environment
that shows whether an agent’s behaviour in a specific state is desirable. The agent is guided by
rewards to learn positive behaviour.\
• Policy: A plan that directs the agent’s decision-making by mapping states to actions. Finding
an ideal policy that maximises cumulative rewards is the objective.
• Value Function: This function calculates the anticipated cumulative reward an agent can
obtain from a specific state while adhering to a specific policy. It is beneficial in assessing and
contrasting states and policies.
• Model: A depiction of the dynamics of the environment that enables the agent to simulate
potential results of actions and states. Models are useful for planning and forecasting.
• Exploration-Exploitation Strategy: A method of making decisions that strikes a balance
between exploring new actions to learn more and exploiting well-known acts to reap immediate
benefits (exploitation).
• Learning Algorithm: The process by which the agent modifies its value function or policy in
response to experiences gained from interacting with the environment. Learning in DRL is fueled
by a variety of algorithms, including Q-learning, policy gradient, and actor-critic.
• Deep Neural Networks: DRL can handle high-dimensional state and action spaces by acting
as function approximators in deep neural networks. They pick up intricate input-to-output
mappings.
• Experience Replay: A method that randomly selects from stored prior experiences (state,
action, reward, and next state) during training. As a result, learning stability is improved and the
association between subsequent events is decreased.
These core components collectively form the foundation of Deep Reinforcement Learning,
empowering agents to learn strategies, make intelligent decisions, and adapt to dynamic environments.
How Deep Reinforcement Learning works?
In Deep Reinforcement Learning (DRL), an agent interacts with an environment to learn how to
make optimal decisions. Steps:
1. Initialization: Construct an agent and set up the issue.
2. Interaction: The agent interacts with its surroundings through acting, which results in states and
rewards.
3. Learning: The agent keeps track of its experiences and updates its method for making decisions.
4. Policy Update: Based on data, algorithms modify the agent’s approach.
5. Exploration-Exploitation: The agent strikes a balance between using well-known actions and
trying out new ones.
6. Reward Maximization: The agent learns to select activities that will yield the greatest possible
total rewards.
7. Convergence: The agent’s policy becomes better and stays the same over time.
8. Extrapolation: Skilled agents can use what they’ve learned in fresh circumstances.
9. Evaluation: Unknown surroundings are used to assess the agent’s performance.
10. Use of the trained agent in practical situations.
Applications of Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) is used in a wide range of fields, demonstrating its adaptability
and efficiency in solving difficult problems. Several well-known applications consist of:
1. Entertainment and gaming: DRL has mastered games like Go, Chess, and Dota 2 with ease. Also,
it’s used to develop intelligent, realistic game AI, which improves user experiences.
2. Robotics and autonomous systems: DRL allows robots to pick up skills like navigation, object
identification, and manipulation. It is essential to the development of autonomous vehicles, drones,
and industrial automation.
3. Finance and Trading: DRL enhances decision-making and profitability by optimising trading tactics,
portfolio management, and risk assessment in financial markets.
4. Healthcare and Medicine: DRL helps develop individualised treatment plans, discover new
medications, analyse medical images, identify diseases, and even perform robotically assisted
procedures.
5. Energy Management: DRL makes sustainable energy solutions possible by optimising energy use,
grid management, and the distribution of renewable resources.
7. Recommendation Systems: By learning user preferences and adjusting to shifting trends, DRL
improves suggestions in e-commerce, content streaming, and advertising.
8. Industrial Process Optimization: DRL streamlines supply chain management, quality control, and
manufacturing procedures to cut costs and boost productivity.
9. Agricultural and Environmental Monitoring: Through enhancing crop production forecasting, pest
control, and irrigation, DRL supports precision agriculture. Additionally, it strengthens conservation
and environmental monitoring initiatives.
10. Education and Training: DRL is utilised to create adaptive learning platforms, virtual trainers, and
intelligent tutoring systems that tailor learning experiences.
These uses highlight the adaptability and influence of DRL across several industries. It is a
transformative instrument for addressing practical issues and influencing the direction of technology because
of its capacity for handling complexity, adapting to various situations, and learning from unprocessed data.
DRL’s journey began with the marriage of two powerful fields: deep learning and reinforcement
learning. Deep Q-Networks (DQN) by DeepMind were unveiled as a watershed moment. DQN outperformed
deep neural networks when playing Atari games, demonstrating the benefits of integrating Q-learning and
deep neural networks. This breakthrough heralded a new era in which DRL could perform difficult tasks by
directly learning from unprocessed sensory inputs.
In order to accelerate learning, researchers are investigating methods to incorporate prior knowledge
into DRL algorithms. By dividing challenging tasks into smaller subtasks, reinforcement in hierarchical
learning increases learning effectiveness. DRL uses pre-trained models to encourage fast learning in
unfamiliar scenarios, bridging the gap between simulations and real-world situations.
The use of model-based and model-free hybrid approaches is growing. By developing a model of the
environment to guide decision-making, model-based solutions aim to increase sampling efficiency. Two
exploration tactics that try to more successfully strike a balance between exploration and exploitation are
curiosity-driven exploration and intrinsic motivation.
5.2 Q-Learning
Q-learning is a popular model-free reinforcement learning algorithm used in machine learning and
artificial intelligence applications. It falls under the category of temporal difference learning techniques, in
which an agent picks up new information by observing results, interacting with the environment, and getting
feedback in the form of rewards.
1. Q-Values or Action-Values: Q-values are defined for states and actions. 𝑄(𝑆,𝐴)Q(S,A) is an
estimation of how good is it to take the action A at the state S . This estimation of 𝑄(𝑆,𝐴)Q(S,A) will
be iteratively computed using the TD- Update rule which we will see in the upcoming sections.
2. Rewards and Episodes: An agent throughout its lifetime starts from a start state, and makes several
transitions from its current state to a next state based on its choice of action and also the environment
the agent is interacting in. At every step of transition, the agent from a state takes an action, observes
a reward from the environment, and then transits to another state. If at any point in time, the agent
ends up in one of the terminating states that means there are no further transitions possible. This is
said to be the completion of an episode.
• A’ – Next best action to be picked using current Q-value estimation, i.e. pick the action with
the maximum Q-value in the next state.
• 𝛾γ(>0 and <=1) : Discounting Factor for Future Rewards. Future rewards are less valuable
than current rewards so they must be discounted. Since Q-value is an estimation of expected
rewards from a state, discounting rule applies here as well.
4. Selecting the Course of Action with ϵ-greedy policy: A simple method for selecting an action to
take based on the current estimates of the Q-value is the ϵ-greedy policy. This is how it operates:
• In this instance of exploitation, the agent chooses the course of action that, given its current
understanding, it feels is optimal.
• Rather than selecting the course of action with the highest Q-value,
• In order to learn about the possible benefits of new actions, the agent engages in a type of exploration.
Q-learning models engage in an iterative process where various components collaborate to train the
model. This iterative procedure encompasses the agent exploring the environment and continuously updating
the model based on this exploration. The key components of Q-learning include:
1. Agents: Entities that operate within an environment, making decisions and taking actions.
3. Rewards: Positive or negative responses provided to the agent based on its actions.
4. Episodes: Instances where an agent concludes its actions, marking the end of an episode.
Temporal Difference: Calculated by comparing the current state and action values with the previous ones.
Bellman’s Equation: A recursive formula invented by Richard Bellman in 1957, used to calculate the value
of a given state and determine its optimal position. It provides a recursive formula for calculating the value
of a given state in a Markov Decision Process (MDP) and is particularly influential in the context of Q-
learning and optimal decision-making.
𝑄(𝑠,𝑎)=𝑅(𝑠,𝑎)+𝛾𝑚𝑎𝑥𝑎𝑄(𝑠’,𝑎)Q(s,a)=R(s,a)+γmaxaQ(s’,a)
Where,
• maxaQ(s′,a) is the maximum Q-value for the next state ′s′ and all possible actions.
What is Q-table?
The Q-table functions as a repository of rewards associated with optimal actions for each state in a
given environment. It serves as a guide for the agent, indicating which actions are likely to yield positive
outcomes in various scenarios.
Each row in the Q-table corresponds to a distinct situation the agent might face, while the columns
represent the available actions. Through interactions with the environment and the receipt of rewards or
penalties, the Q-table is dynamically updated to capture the model’s evolving understanding.
Reinforcement learning aims to enhance performance by refining the Q-table, enabling the agent to
make informed decisions. As the Q-table undergoes continuous updates with more feedback, it becomes a
more accurate resource, empowering the agent to make optimal choices and achieve superior results.
Crucially, the Q-table is closely tied to the Q-function, a mathematical expression that considers the
current state and action, generating outputs that include anticipated future rewards for that specific state-
action pair. By consulting the Q-table, the agent can retrieve expected future rewards, guiding it toward
optimized decision-making and states.
The Q in DQN stands for ‘Q-Learning’, an off-policy temporal difference method that also considers
future rewards while updating the value function for a given State-Action pair. An advantage we get with
Value-based methods is we don’t need to wait till the end of the episode to get the final reward and calculated
discounted reward as was the case with REINFORCE algorithm in my last post. Using the Bellman equation,
we update the value function of all actions as we move ahead. You can find more about Q-Learning in the
below video.
In MountainCar-v0, an underpowered car must climb a steep hill by building enough momentum.
The car’s engine is not strong enough to drive directly up the hill (acceleration is limited), so it must
learn to rock back and forth, building momentum with each swing until it can eventually reach the top of the
mountain.
Talking about the rewards, the agent i.e. the car gets a -1 for every action taken until it reaches the flag
where it gets a 0. Also, the episode ends if the agent isn’t able to reach the mountain in 200 steps.
Action space:
The agent can take 3 actions, accelerate to left, accelerate to the right, or do nothing
State space:
This variable represents the position of the car along the x-axis of the environment. It is a continuous
This variable represents the velocity of the car along the x-axis of the environment. It is a continuous
Note: A few code snippets are directly taken from my previous post where I have explained them as well. So,
Why have we imported deque? Double Ended Queue is a useful data structure that allows insertion and
deletion from both ends. This will help us to implement Experience Replay.
input_shape = env.observation_space.shape[0]
num_actions = env.action_space.n
As mentioned earlier, sample space is an array with 2 elements and num_actions=3
The value_network is a shallow neural network that intakes state (1d array with 2 elements) and outputs q-
value for all possible actions (1d array with 3 values). The loss function used is a mean-squared error.
5.4 Policy Gradient Methods
Policy Gradient (PG) methods are a class of algorithms in reinforcement learning (RL) that aim to
directly optimize the policy, which is a mapping from states to actions. Unlike value-based methods that
estimate value functions to derive policies, policy gradient methods adjust the policy parameters directly to
maximize some measure of long-term reward. These methods are particularly powerful in environments with
continuous action spaces and where the policy can be represented by a parameterized function, such as a
neural network.
2. Objective Function: The objective in PG methods is to maximize the expected return 𝐽(𝜃)J(θ),
defined as: 𝐽(𝜃)=𝐸𝜏∼𝜋𝜃[∑𝑡=0𝑇𝑟𝑡]J(θ)=Eτ∼πθ[t=0∑Trt] where 𝜏τ is a trajectory (sequence of states,
actions, and rewards) generated by following the policy 𝜋𝜃πθ.
3. Gradient Estimation: To optimize 𝐽(𝜃)J(θ), we need the gradient ∇𝜃𝐽(𝜃)∇θJ(θ). Using the policy
gradient theorem, this gradient can be estimated as:
∇𝜃𝐽(𝜃)=𝐸𝜏∼𝜋𝜃[∑𝑡=0𝑇∇𝜃log𝜋𝜃(𝑎𝑡∣𝑠𝑡)𝑅(𝜏)]∇θJ(θ)=Eτ∼πθ[t=0∑T∇θlogπθ(at∣st)R(τ)] where
𝑅(𝜏)R(τ) is the return of trajectory 𝜏τ.
1. REINFORCE: The simplest policy gradient algorithm, also known as Monte Carlo Policy Gradient.
It updates the policy parameters 𝜃θ using the gradient:
𝜃←𝜃+𝛼∑𝑡=0𝑇∇𝜃log𝜋𝜃(𝑎𝑡∣𝑠𝑡)𝑅𝑡θ←θ+αt=0∑T∇θlogπθ(at∣st)Rt
where 𝛼α is the learning rate and 𝑅𝑡Rt is the return from time step 𝑡t.
2. Actor-Critic Methods: These methods use two function approximators: an actor 𝜋𝜃(𝑎∣𝑠)πθ(a∣s) that
decides the actions, and a critic 𝑉𝑤(𝑠)Vw(s) that estimates the value function. The policy gradient
update is:
𝜃←𝜃+𝛼∑𝑡=0𝑇∇𝜃log𝜋𝜃(𝑎𝑡∣𝑠𝑡)(𝑅𝑡−𝑉𝑤(𝑠𝑡))θ←θ+αt=0∑T∇θlogπθ(at∣st)(Rt−Vw(st))
where 𝑅𝑡−𝑉𝑤(𝑠𝑡)Rt−Vw(st) is the advantage function, representing the relative value of taking action 𝑎𝑡at
in state 𝑠𝑡st.
3. Trust Region Policy Optimization (TRPO): TRPO ensures that the new policy after the update is
not too far from the old policy, providing more stable learning. It solves:
max𝜃𝐸𝑠∼𝜋𝜃old[𝜋𝜃(𝑎∣𝑠)𝜋𝜃old(𝑎∣𝑠)𝐴𝜋𝜃old(𝑠,𝑎)]θmaxEs∼πθold[πθold(a∣s)πθ(a∣s)Aπθold(s,a)]
subject to a constraint on the KL divergence between the new and old policies.
4. Proximal Policy Optimization (PPO): PPO is a simplified version of TRPO, using a clipped
surrogate objective to maintain the policy update within a safe range:
𝐿CLIP(𝜃)=𝐸𝑡[min(𝑟𝑡(𝜃)𝐴𝑡,clip(𝑟𝑡(𝜃),1−𝜖,1+𝜖)𝐴𝑡)]LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
1. Continuous Action Spaces: PG methods are well-suited for problems with continuous action spaces,
where value-based methods struggle.
2. Stochastic Policies: They can naturally handle stochastic policies, which are useful in exploration
and when dealing with uncertain environments.
3. Direct Optimization: By directly optimizing the policy, PG methods can find deterministic policies
that are often more effective than those derived indirectly from value functions.
1. High Variance: PG estimates can have high variance, leading to unstable updates. Solutions include
using baselines (e.g., value functions) to reduce variance and employing techniques like reward
normalization.
2. Sample Inefficiency: PG methods often require a large number of samples. Techniques like
experience replay and using more efficient exploration strategies can help.
3. Local Optima: Due to the nature of gradient ascent, PG methods can get stuck in local optima.
Advanced algorithms like TRPO and PPO mitigate this by ensuring smoother policy updates.
Applications
1. Robotics: PG methods are used for training robots to perform complex tasks involving continuous
control.
2. Games: They are employed in training agents for games requiring strategic planning and real-time
decision making.
3. Autonomous Vehicles: PG methods assist in the development of policies for autonomous driving,
where actions are continuous and environments are dynamic.
The actor-critic algorithm is a type of reinforcement learning algorithm that combines aspects of
both policy-based methods (Actor) and value-based methods (Critic). This hybrid approach is designed
to address the limitations of each method when used individually.
In the actor-critic framework, an agent (the “actor”) learns a policy to make decisions, and a value
function (the “Critic”) evaluates the actions taken by the Actor.
Simultaneously, the critic evaluates these actions by estimating their value or quality. This dual role
allows the method to strike a balance between exploration and exploitation, leveraging the strengths of both
policy and value functions.
Before delving into the actor-critic method, it’s crucial to understand the fundamental components
of reinforcement learning (RL):
• Agent: The entity making decisions and interacting with the environment.
• Actor: The actor makes decisions by selecting actions based on the current policy. Its responsibility
lies in exploring the action space to maximize expected cumulative rewards. By continuously refining
the policy, the actor adapts to the dynamic nature of the environment.
• Critic: The critic evaluates the actions taken by the actor. It estimates the value or quality of these
actions by providing feedback on their performance. The critic’s role is pivotal in guiding the actor
towards actions that lead to higher expected returns, contributing to the overall improvement of the
learning process.
• Policy (Actor):
• The policy, denoted as 𝜋(𝑎∣𝑠)π(a∣s), represents the probability of taking action a in state s.
• The actor seeks to maximize the expected return by optimizing this policy.
• The policy is modeled by the actor network, and its parameters are denoted by 𝜃θ
• The value function, denoted as 𝑉(𝑠)V(s), estimates the expected cumulative reward starting
from state s.
• The value function is modeled by the critic network, and its parameters are denoted by w.
• The objective function for the Actor-Critic algorithm is a combination of the policy gradient (for the
actor) and the value function (for the critic).
• The overall objective function is typically expressed as the sum of two components:
∇𝜃𝐽(𝜃)≈1𝑁∑𝑖=0𝑁∇𝜃log𝜋𝜃(𝑎𝑖∣𝑠𝑖)⋅𝐴(𝑠𝑖,𝑎𝑖)∇θJ(θ)≈N1∑i=0N∇θlogπθ(ai∣si)⋅A(si,ai)
Here,
• 𝐴(𝑠,𝑎)A(s,a) is the advantage function representing the advantage of taking action a in state s.
∇𝑤𝐽(𝑤)≈1𝑁∑𝑖=1𝑁∇𝑤(𝑉𝑤(𝑠𝑖)−𝑄𝑤(𝑠𝑖,𝑎𝑖))2∇wJ(w)≈N1∑i=1N∇w(Vw(si)−Qw(si,ai))2
Here,
• ∇𝑤𝐽(𝑤)∇wJ(w) is the gradient of the loss function with respect to the critic’s parameters w.
• N is number of samples
Update Rules
The update rules for the actor and critic involve adjusting their respective parameters using gradient
ascent (for the actor) and gradient descent (for the critic).
Actor Update
𝜃𝑡+1=𝜃𝑡+𝛼∇𝜃𝐽(𝜃𝑡)θt+1=θt+α∇θJ(θt)
Here,
Critic Update
𝑤𝑡=𝑤𝑡−𝛽∇𝑤𝐽(𝑤𝑡)wt=wt−β∇wJ(wt)
Here,
Advantage Function
The advantage function, 𝐴(𝑠,𝑎)A(s,a), measures the advantage of taking action a in state s over the
expected value of the state under the current policy.
𝐴(𝑠,𝑎)=𝑄(𝑠,𝑎)−𝑉(𝑠)A(s,a)=Q(s,a)−V(s)
The advantage function, then, provides a measure of how much better or worse an action is compared
to the average action.
These mathematical expressions highlight the essential computations involved in the Actor-Critic
method. The actor is updated based on the policy gradient, encouraging actions with higher advantages,
while the critic is updated to minimize the difference between the estimated value and the action-value.
A2C (Advantage Actor-Critic)
A2C (Advantage Actor-Critic) is a specific variant of the Actor-Critic algorithm that introduces the
concept of the advantage function. This function measures how much better an action is compared to the
average action in a given state. By incorporating this advantage information, A2C focuses the learning
process on actions that have a significantly higher value than the typical action taken in that state.
While both leverage the actor-critic architecture, here’s a key distinction between them:
• Learning from the Average: The base Actor-Critic method uses the difference between the actual
reward and the estimated value (critic’s evaluation) to update the actor.
• Learning from the Advantage: A2C leverages the advantage function, incorporating the difference
between the action’s value and the average value of actions in that state. This additional information
refines the learning process further.
The Actor-Critic algorithm combines these mathematical principles into a coherent learning
framework. The algorithm involves:
1. Initialization:
• Initialize the policy parameters 𝜃θ(actor) and the value function parameters 𝜙ϕ (critic).
• The agent interacts with the environment by taking actions according to the current policy
and receiving observations and rewards in return.
3. Advantage Computation:
• Compute the advantage function A(s,a) based on the current policy and value estimates.
• Simultaneously update the actor’s parameters(𝜃)(θ) using the policy gradient. The policy
gradient is derived from the advantage function and guides the actor to increase the
probabilities of actions that lead to higher advantages.
• Simultaneously update the critic’s parameters (𝜙)(ϕ)using a value-based method. This often
involves minimizing the temporal difference (TD) error, which is the difference between the
observed rewards and the predicted values.
The actor learns a policy, and the critic evaluates the actions taken by the actor. The actor is updated
using the policy gradient, and the critic is updated using a value-based method. This combination allows for
more stable and efficient learning in complex environments.
5.6 About Autoencoding
Autoencoders are a specialized class of algorithms that can learn efficient representations of input
data with no need for labels. It is a class of artificial neural networks designed for unsupervised learning.
Learning to compress and effectively represent input data without specific labels is the essential principle of
an automatic decoder. This is accomplished using a two-fold structure that consists of an encoder and a
decoder. The encoder transforms the input data into a reduced-dimensional representation, which is often
referred to as “latent space” or “encoding”. From that representation, a decoder rebuilds the initial input. For
the network to gain meaningful patterns in data, a process of encoding and decoding facilitates the definition
of essential features.
The general architecture of an autoencoder includes an encoder, decoder, and bottleneck layer.
1. Encoder
• The hidden layers progressively reduce the dimensionality of the input, capturing important
features and patterns. These layer compose the encoder.
• The bottleneck layer (latent space) is the final hidden layer, where the dimensionality is
significantly reduced. This layer represents the compressed encoding of the input data.
2. Decoder
• The bottleneck layer takes the encoded representation and expands it back to the
dimensionality of the original input.
• The hidden layers progressively increase the dimensionality and aim to reconstruct the
original input.
• The output layer produces the reconstructed output, which ideally should be as close as
possible to the input data.
The loss function used during training is typically a reconstruction loss, measuring the difference
between the input and the reconstructed output. Common choices include mean squared error (MSE) for
continuous data or binary cross-entropy for binary data.
During training, the autoencoder learns to minimize the reconstruction loss, forcing the network to
capture the most important features of the input data in the bottleneck layer.
After the training process, only the encoder part of the autoencoder is retained to encode a similar
type of data used in the training process. The different ways to constrain the network are: –
• Keep small Hidden Layers: If the size of each hidden layer is kept as small as possible, then the
network will be forced to pick up only the representative features of the data thus encoding the data.
• Regularization: In this method, a loss term is added to the cost function which encourages the
network to train in ways other than copying the input.
• Denoising: Another way of constraining the network is to add noise to the input and teach the
network how to remove the noise from the data.
• Tuning the Activation Functions: This method involves changing the activation functions of
various nodes so that a majority of the nodes are dormant thus, effectively reducing the size of the
hidden layers.
Convolutional autoencoders are a type of autoencoder that use convolutional neural networks
(CNNs) as their building blocks. The encoder consists of multiple layers that take a image or a grid as input
and pass it through different convolution layers thus forming a compressed representation of the input. The
decoder is the mirror image of the encoder it deconvolves the compressed representation and tries to
reconstruct the original image.
Advantages
2. Convolutional autoencoder can reconstruct missing parts of an image. It can also handle images with
slight variations in object position or orientation.
Disadvantages
1. These autoencoder are prone to overfitting. Proper regularization techniques should be used to tackle
this issue.
2. Compression of data can cause data loss which can result in reconstruction of a lower quality image.
Variational autoencoder makes strong assumptions about the distribution of latent variables and uses
the Stochastic Gradient Variational Bayes estimator in the training process.
Advantages
1. Variational Autoencoders are used to generate new data points that resemble the original training
data. These samples are learned from the latent space.
Disadvantages
1. Variational Autoencoder use approximations to estimate the true distribution of the latent variables.
This approximation introduces some level of error, which can affect the quality of generated samples.
2. The generated samples may only cover a limited subset of the true data distribution. This can result
in a lack of diversity in generated samples.
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for
an unsupervised learning. GANs are made up of two neural networks, a discriminator and a
generator. They use adversarial training to produce artificial data that is identical to actual data.
• The Generator attempts to fool the Discriminator, which is tasked with accurately distinguishing
between produced and genuine data, by producing random noise samples.
• Realistic, high-quality samples are produced as a result of this competitive interaction, which drives
both networks toward advancement.
• GANs are proving to be highly versatile artificial intelligence tools, as evidenced by their extensive
use in image synthesis, style transfer, and text-to-image synthesis.
Through adversarial training, these models engage in a competitive interplay until the generator
becomes adept at creating realistic samples, fooling the discriminator approximately half the time.
Generative Adversarial Networks (GANs) can be broken down into three parts:
• Generative: To learn a generative model, which describes how data is generated in terms of a
probabilistic model.
• Adversarial: The word adversarial refers to setting one thing up against another. This means that, in
the context of GANs, the generative result is compared with the actual images in the data set. A
mechanism known as a discriminator is used to apply a model that attempts to distinguish between
real and fake images.
• Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes.
Types of GANs
1. Vanilla GAN: This is the simplest type of GAN. Here, the Generator and the Discriminator are
simple a basic multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries to
optimize the mathematical equation using stochastic gradient descent.
2. Conditional GAN (CGAN): CGAN can be described as a deep learning method in which some
conditional parameters are put into place.
• In CGAN, an additional parameter ‘y’ is added to the Generator for generating the
corresponding data.
• Labels are also put into the input to the Discriminator in order for the Discriminator to help
distinguish the real data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular and also the most
successful implementations of GAN. It is composed of ConvNets in place of multi-layer perceptrons.
• The ConvNets are implemented without max pooling, which is in fact replaced by
convolutional stride.
4. Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency
residual.
• This approach uses multiple numbers of Generator and Discriminator networks and
different levels of the Laplacian Pyramid.
• This approach is mainly used because it produces very high-quality images. The image is
down-sampled at first at each layer of the pyramid and then it is again up-scaled at each layer
in a backward pass where the image acquires some noise from the Conditional GAN at these
layers until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a GAN in
which a deep neural network is used along with an adversarial network in order to produce higher-
resolution images. This type of GAN is particularly useful in optimally up-scaling native low-
resolution images to enhance their details minimizing errors while doing so.
Architecture of GANs
A Generative Adversarial Network (GAN) is composed of two primary parts, which are the
Generator and the Discriminator.
Generator Model
A key element responsible for creating fresh, accurate data in a Generative Adversarial Network
(GAN) is the generator model. The generator takes random noise as input and converts it into complex data
samples, such text or images. It is commonly depicted as a deep neural network.
The training data’s underlying distribution is captured by layers of learnable parameters in its design
through training. The generator adjusts its output to produce samples that closely mimic real data as it is
being trained by using backpropagation to fine-tune its parameters.
The generator’s ability to generate high-quality, varied samples that can fool the discriminator is
what makes it successful.
Generator Loss
The objective of the generator in a GAN is to produce synthetic samples that are realistic enough to
fool the discriminator. The generator achieves this by minimizing its loss function 𝐽𝐺JG. The loss is
minimized when the log probability is maximized, i.e., when the discriminator is highly likely to classify the
generated samples as real. The following equation is given below:
𝐽𝐺=−1𝑚Σ𝑖=1𝑚𝑙𝑜𝑔𝐷(𝐺(𝑧𝑖))JG=−m1Σi=1mlogD(G(zi))
Where,
• log 𝐷(𝐺(𝑧𝑖))D(G(zi))represents log probability of the discriminator being correct for generated
samples.
• The generator aims to minimize this loss, encouraging the production of samples that the
discriminator classifies as real (𝑙𝑜𝑔𝐷(𝐺(𝑧𝑖))(logD(G(zi)), close to 1.
Discriminator Model
An artificial neural network called a discriminator model is used in Generative Adversarial Networks
(GANs) to differentiate between generated and actual input. By evaluating input samples and allocating
probability of authenticity, the discriminator functions as a binary classifier.
Over time, the discriminator learns to differentiate between genuine data from the dataset and
artificial samples created by the generator. This allows it to progressively hone its parameters and increase
its level of proficiency.
Convolutional layers or pertinent structures for other modalities are usually used in its architecture
when dealing with picture data. Maximizing the discriminator’s capacity to accurately identify generated
samples as fraudulent and real samples as authentic is the aim of the adversarial training procedure. The
discriminator grows increasingly discriminating as a result of the generator and discriminator’s interaction,
which helps the GAN produce extremely realistic-looking synthetic data overall.
Discriminator Loss
The discriminator reduces the negative log likelihood of correctly classifying both produced and real
samples. This loss incentivizes the discriminator to accurately categorize generated samples as fake and real
samples with the following equation:
𝐽𝐷=−1𝑚Σ𝑖=1𝑚𝑙𝑜𝑔𝐷(𝑥𝑖)–1𝑚Σ𝑖=1𝑚𝑙𝑜𝑔(1–𝐷(𝐺(𝑧𝑖))JD=−m1Σi=1mlogD(xi)–m1Σi=1mlog(1–
D(G(zi))
• 𝐽𝐷JD assesses the discriminator’s ability to discern between produced and actual samples.
• The log likelihood that the discriminator will accurately categorize real data is represented
by 𝑙𝑜𝑔𝐷(𝑥𝑖)logD(xi).
• The log chance that the discriminator would correctly categorize generated samples as fake is
represented by 𝑙𝑜𝑔(1−𝐷(𝐺(𝑧𝑖)))log(1−D(G(zi))).
• The discriminator aims to reduce this loss by accurately identifying artificial and real samples.
MinMax Loss
In a Generative Adversarial Network (GAN), the minimax loss formula is provided by:
𝑚𝑖𝑛𝐺𝑚𝑎𝑥𝐷(𝐺,𝐷)=[𝐸𝑥∼𝑝𝑑𝑎𝑡𝑎[𝑙𝑜𝑔𝐷(𝑥)]+𝐸𝑧∼𝑝𝑧(𝑧)[𝑙𝑜𝑔(1–𝐷(𝑔(𝑧)))]minGmaxD(G,D)=[Ex∼pdata
[logD(x)]+Ez∼pz(z)[log(1–D(g(z)))]
Where,
• Actual data samples obtained from the true data distribution 𝑝𝑑𝑎𝑡𝑎(𝑥)pdata(x) are represented by x.
• D(x) represents the discriminator’s likelihood of correctly identifying actual data as real.
• D(G(z)) is the likelihood that the discriminator will identify generated data coming from the
generator as authentic.
1. Initialization: Two neural networks are created: a Generator (G) and a Discriminator (D).
• G is tasked with creating new data, like images or text, that closely resembles real data.
• D acts as a critic, trying to distinguish between real data (from a training dataset) and the data
generated by G.
2. Generator’s First Move: G takes a random noise vector as input. This noise vector contains random
values and acts as the starting point for G’s creation process. Using its internal layers and learned
patterns, G transforms the noise vector into a new data sample, like a generated image.
• The data samples generated by G in the previous step. D’s job is to analyze each input and
determine whether it’s real data or something G cooked up. It outputs a probability score
between 0 and 1. A score of 1 indicates the data is likely real, and 0 suggests it’s fake.
• If D correctly identifies real data as real (score close to 1) and generated data as fake (score
close to 0), both G and D are rewarded to a small degree. This is because they’re both doing
their jobs well.
• When D mistakenly labels G’s creation as real (score close to 1), it’s a sign that G is on the
right track. In this case, G receives a significant positive update, while D receives a penalty
for being fooled.
• This feedback helps G improve its generation process to create more realistic data.
6. Discriminator’s Adaptation:
• Conversely, if D correctly identifies G’s fake data (score close to 0), but G receives no reward,
D is further strengthened in its discrimination abilities.
• This ongoing duel between G and D refines both networks over time.
As training progresses, G gets better at generating realistic data, making it harder for D to tell the difference.
Ideally, G becomes so adept that D can’t reliably distinguish real from fake data. At this point, G is
considered well-trained and can be used to generate new, realistic data samples.
An autoencoder is a neural network model that seeks to learn a compressed representation of an input.
An autoencoder is a neural network that is trained to attempt to copy its input to its output.
They are an unsupervised learning method, although technically, they are trained using supervised
learning methods, referred to as self-supervised.
Autoencoders are typically trained as part of a broader model that attempts to recreate the input.
For example:
• X = [Link](X)
The design of the autoencoder model purposefully makes this challenging by restricting the
architecture to a bottleneck at the midpoint of the model, from which the reconstruction of the input data is
performed.
There are many types of autoencoders, and their use varies, but perhaps the more common use is as
a learned or automatic feature extraction model.
In this case, once the model is fit, the reconstruction aspect of the model can be discarded and the
model up to the point of the bottleneck can be used. The output of the model at the bottleneck is a fixed-
length vector that provides a compressed representation of the input data.
Usually they are restricted in ways that allow them to copy only approximately, and to copy only
input that resembles the training data. Because the model is forced to prioritize which aspects of the input
should be copied, it often learns useful properties of the data.
Input data from the domain can then be provided to the model and the output of the model at the
bottleneck can be used as a feature vector in a supervised learning model, for visualization, or more generally
for dimensionality reduction.
In this section, we will develop an autoencoder to learn a compressed representation of the input
features for a classification predictive modeling problem.
We will use the make_classification() scikit-learn function to define a synthetic binary (2-class)
classification task with 100 input features (columns) and 1,000 examples (rows). Importantly, we will define
the problem in such a way that most of the input variables are redundant (90 of the 100 or 90 percent), allowing
the autoencoder later to learn a useful compressed representation.
The example below defines the dataset and summarizes its shape.
Running the example defines the dataset and prints the shape of the arrays, confirming the number of
rows and columns.
The model will take all of the input columns, then output the same values. It will learn to recreate the
input pattern exactly.
The autoencoder consists of two parts: the encoder and the decoder. The encoder learns how to
interpret the input and compress it to an internal representation defined by the bottleneck layer. The decoder
takes the output of the encoder (the bottleneck layer) and attempts to recreate the input.
Once the autoencoder is trained, the decoder is discarded and we only keep the encoder and use it to
compress examples of input to vectors output by the bottleneck layer.
In this first autoencoder, we won’t compress the input at all and will use a bottleneck layer the same
size as the input. This should be an easy problem that the model will learn nearly perfectly and is intended to
confirm our model is implemented correctly.
We will define the model using the functional API; if this is new to you, I recommend this tutorial:
• How to Use the Keras Functional API for Deep Learning
Prior to defining and fitting the model, we will split the data into train and test sets and scale the input
data by normalizing the values to the range 0-1, a good practice with MLPs.
1...
2# split into train test sets
3X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
4# scale data
5t = MinMaxScaler()
[Link](X_train)
7X_train = [Link](X_train)
8X_test = [Link](X_test)
We will define the encoder to have two hidden layers, the first with two times the number of inputs
(e.g. 200) and the second with the same number of inputs (100), followed by the bottleneck layer with the
same number of inputs as the dataset (100).
To ensure the model learns well, we will use batch normalization and leaky ReLU activation.
1 ...
2 # define encoder
3 visible = Input(shape=(n_inputs,))
4 # encoder level 1
5 e = Dense(n_inputs*2)(visible)
6 e = BatchNormalization()(e)
7 e = LeakyReLU()(e)
8 # encoder level 2
9 e = Dense(n_inputs)(e)
10e = BatchNormalization()(e)
11e = LeakyReLU()(e)
12# bottleneck
13n_bottleneck = n_inputs
14bottleneck = Dense(n_bottleneck)(e)
It will have two hidden layers, the first with the number of inputs in the dataset (e.g. 100) and the
second with double the number of inputs (e.g. 200). The output layer will have the same number of nodes as
there are columns in the input data and will use a linear activation function to output numeric values.
1 ...
2 # define decoder, level 1
3 d = Dense(n_inputs)(bottleneck)
4 d = BatchNormalization()(d)
5 d = LeakyReLU()(d)
6 # decoder level 2
7 d = Dense(n_inputs*2)(d)
8 d = BatchNormalization()(d)
9 d = LeakyReLU()(d)
10# output layer
11output = Dense(n_inputs, activation='linear')(d)
12# define autoencoder model
13model = Model(inputs=visible, outputs=output)
The model will be fit using the efficient Adam version of stochastic gradient descent and minimizes
the mean squared error, given that reconstruction is a type of multi-output regression problem.
1...
2# compile autoencoder model
[Link](optimizer='adam', loss='mse')
We can plot the layers in the autoencoder model to get a feeling for how the data flows through the
model.
1...
2# plot the autoencoder
3plot_model(model, 'autoencoder_no_compress.png', show_shapes=True)
1...
2# fit the autoencoder model to reconstruct input
3history = [Link](X_train, X_train, epochs=200, batch_size=16, verbose=2,
validation_data=(X_test,X_test))
After training, we can plot the learning curves for the train and test sets to confirm the model learned
the reconstruction problem well.
1...
2# plot loss
[Link]([Link]['loss'], label='train')
[Link]([Link]['val_loss'], label='test')
[Link]()
[Link]()
Finally, we can save the encoder model for use later, if desired.
1...
2# define an encoder model (without the decoder)
3encoder = Model(inputs=visible, outputs=bottleneck)
4plot_model(encoder, 'encoder_no_compress.png', show_shapes=True)
5# save the encoder to file
[Link]('encoder.h5')
As part of saving the encoder, we will also plot the encoder model to get a feeling for the shape of the
output of the bottleneck layer, e.g. a 100 element vector.
Running the example fits the model and reports loss on the train and test sets along the way.
Note: if you have problems creating the plots of the model, you can comment out the import and call
the plot_model() function.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the average
outcome.
In this case, we see that loss gets low, but does not go to zero (as we might have expected) with no
compression in the bottleneck layer. Perhaps further tuning the model architecture or learning
hyperparameters is required.
1 ...
2 42/42 - 0s - loss: 0.0032 - val_loss: 0.0016
3 Epoch 196/200
4 42/42 - 0s - loss: 0.0031 - val_loss: 0.0024
5 Epoch 197/200
6 42/42 - 0s - loss: 0.0032 - val_loss: 0.0015
7 Epoch 198/200
8 42/42 - 0s - loss: 0.0032 - val_loss: 0.0014
9 Epoch 199/200
1042/42 - 0s - loss: 0.0031 - val_loss: 0.0020
11Epoch 200/200
1242/42 - 0s - loss: 0.0029 - val_loss: 0.0017
A plot of the learning curves is created showing that the model achieves a good fit in reconstructing
the input, which holds steady throughout training, not overfitting.
1...
2# bottleneck
3n_bottleneck = round(float(n_inputs) / 2.0)
4bottleneck = Dense(n_bottleneck)(e)
1 # train autoencoder for classification with with compression in the bottleneck layer
2 from [Link] import make_classification
3 from [Link] import MinMaxScaler
4 from sklearn.model_selection import train_test_split
5 from [Link] import Model
6 from [Link] import Input
7 from [Link] import Dense
8 from [Link] import LeakyReLU
9 from [Link] import BatchNormalization
10from [Link] import plot_model
11from matplotlib import pyplot
12# define dataset
13X, y = make_classification(n_samples=1000, n_features=100, n_informative=10, n_redundant=90,
14random_state=1)
15# number of input columns
16n_inputs = [Link][1]
17# split into train test sets
18X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
19# scale data
20t = MinMaxScaler()
[Link](X_train)
21
22X_train = [Link](X_train)
23X_test = [Link](X_test)
24# define encoder
25visible = Input(shape=(n_inputs,))
26# encoder level 1
27e = Dense(n_inputs*2)(visible)
28e = BatchNormalization()(e)
29e = LeakyReLU()(e)
30# encoder level 2
31e = Dense(n_inputs)(e)
32e = BatchNormalization()(e)
33e = LeakyReLU()(e)
34# bottleneck
35n_bottleneck = round(float(n_inputs) / 2.0)
36bottleneck = Dense(n_bottleneck)(e)
37# define decoder, level 1
38d = Dense(n_inputs)(bottleneck)
39d = BatchNormalization()(d)
40d = LeakyReLU()(d)
41# decoder level 2
42d = Dense(n_inputs*2)(d)
43d = BatchNormalization()(d)
44d = LeakyReLU()(d)
45# output layer
46output = Dense(n_inputs, activation='linear')(d)
47# define autoencoder model
48model = Model(inputs=visible, outputs=output)
49# compile autoencoder model
[Link](optimizer='adam', loss='mse')
51# plot the autoencoder
52plot_model(model, 'autoencoder_compress.png', show_shapes=True)
53# fit the autoencoder model to reconstruct input
54history = [Link](X_train, X_train, epochs=200, batch_size=16, verbose=2,
55validation_data=(X_test,X_test))
56# plot loss
[Link]([Link]['loss'], label='train')
[Link]([Link]['val_loss'], label='test')
[Link]()
[Link]()
61# define an encoder model (without the decoder)
62encoder = Model(inputs=visible, outputs=bottleneck)
63plot_model(encoder, 'encoder_compress.png', show_shapes=True)
# save the encoder to file
[Link]('encoder.h5')
Running the example fits the model and reports loss on the train and test sets along the way.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the average
outcome.
In this case, we see that loss gets similarly low as the above example without compression,
suggesting that perhaps the model performs just as well with a bottleneck half the size.
1 ...
2 42/42 - 0s - loss: 0.0029 - val_loss: 0.0010
3 Epoch 196/200
4 42/42 - 0s - loss: 0.0029 - val_loss: 0.0013
5 Epoch 197/200
6 42/42 - 0s - loss: 0.0030 - val_loss: 9.4472e-04
7 Epoch 198/200
8 42/42 - 0s - loss: 0.0028 - val_loss: 0.0015
9 Epoch 199/200
1042/42 - 0s - loss: 0.0033 - val_loss: 0.0021
11Epoch 200/200
1242/42 - 0s - loss: 0.0027 - val_loss: 8.7731e-04
A plot of the learning curves is created, again showing that the model achieves a good fit in
reconstructing the input, which holds steady throughout training, not overfitting.
The trained encoder is saved to the file “encoder.h5” that we can load and use later.
5.12 Denoising Autoencoders
Denoising autoencoder works on a partially corrupted input and trains to recover the original
undistorted image. As mentioned above, this method is an effective way to constrain the network from
simply copying the input and thus learn the underlying structure and important features of the data.
Advantages
1. This type of autoencoder can extract important features and reduce the noise or the useless features.
2. Denoising autoencoders can be used as a form of data augmentation, the restored images can be
used as augmented data thus generating additional training samples.
Disadvantages
1. Selecting the right type and level of noise to introduce can be challenging and may require domain
knowledge.
2. Denoising process can result into loss of some information that is needed from the original input.
This loss can impact accuracy of the output.
This type of autoencoder typically contains more hidden units than the input but only a few are
allowed to be active at once. This property is called the sparsity of the network. The sparsity of the
network can be controlled by either manually zeroing the required hidden units, tuning the activation
functions or by adding a loss term to the cost function.
Advantages
1. The sparsity constraint in sparse autoencoders helps in filtering out noise and irrelevant features
during the encoding process.
2. These autoencoders often learn important and meaningful features due to their emphasis on sparse
activations.
Disadvantages
1. The choice of hyperparameters play a significant role in the performance of this autoencoder.
Different inputs should result in the activation of different nodes of the network.