UNIT – II
INTRODUCTION TO DEEP
LEARNING
Final Year
BTECH Subject : Deep Learning (PE3)
Unit II : Contents
2
Introduction to Deep Learning
Introduction to Deep learning, Architecture, Multilayer Perceptron
(MLP), Training MLP, Chain Rule, Backpropogation, Optimization
Methods, Feedforward Neural Network, Examples of Deep Learning.
Introduction to Deep learning, Architecture, Multilayer
Perceptron (MLP), Training MLP, Chain Rule
Sources:
• Machine learning with neural networks An introduction for
scientists and engineers, Bernhard Mehlig
Final Year
BTECH Subject : Deep Learning (PE3)
Introduction to Deep learning
4
• Deep Learning comes under the artificial intelligence
umbrella, either alongside machine learning or as a
subset.
• The difference is that machine learning uses
algorithms developed for specific tasks.
• Deep learning is more of a data representation based
upon multiple layers of a matrix, where each layer
uses output from the previous layer as input.
Source (Figure) : [Link]
Deep learning Architecture
5
Source (Figure) : [Link]
Deep Neural Network
6
ANN is a deep feed-forward neural network as it processes inputs in the forward
direction only. Artificial Neural Networks are capable of learning non-linear functions.
The activation function of ANNs helps in learning any complex relationship between
input and output.
RNN is designed to overcome the looping constraint of ANN in hidden layers. Deep
recurrent networks are capable of solving problems related to audio data, text data,
and time-series data. Recurrent neural networks capture sequential information
available in the input data. RNN works on parameter sharing.
CNN based models in Deep Neural Networks are used in video and image processing.
Filters or kernels are the building blocks of CNN. By using conventional operations
kernels extract relevant and correct features from the input data.
[Link]
Deep Neural Network
7
Machine learning with neural networks An introduction for scientists and engineers, Bernhard Mehlig (page no.130)
Introduction to Deep learning
8
Why it is sometimes necessary to have a hidden layer?
In order to solve problems that are not linearly separable.
Machine learning with neural networks An introduction for scientists and engineers, Bernhard Mehlig (page no.91-123)
Difference between Machine Learning and Deep Learning
Machine Learning Deep Learning
Works on small amount of Dataset for accuracy. Works on Large amount of Dataset.
Dependent on Low-end Machine. Heavily dependent on High-end Machine.
Divides the tasks into sub-tasks, solves them
Solves problem end to end.
individually and finally combine the results.
Takes less time to train. Takes longer time to train.
Testing time may increase. Less time to test the data.
[Link]
learning/#:~:text=Deep%20Neural%20Network%20%E2%80%93%20It%20is,is%20multi%2Dlayer%20belief%20networks.
Difference Between Neural Network And
Deep Neural Network
10
• The Deep Neural Network is more creative and complicated than the neural network.
Deep Neural Network algorithms can recognize sounds and voice commands, make
predictions, think creatively, and do analysis. They act like the human brain.
• Neural networks give one result. It can be an action, a word, or a solution. On the
other hand, Deep Neural Networks provides solutions by globally solving problems
based on the information given.
• Specific data input and algorithm is required for a neural network, whereas Deep
Neural Networks are capable of solving problems without a specific data amount.
[Link]
Multilayer Perceptron (MLP)
11
• A multilayer perceptron (MLP) is a class of feedforward artificial neural
network (ANN).
• The term MLP is used ambiguously, sometimes loosely to any feedforward ANN,
sometimes strictly to refer to networks composed of multiple layers
of perceptrons (with threshold activation);
• Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural
networks, especially when they have a single hidden layer.
• An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an
output layer. Except for the input nodes, each node is a neuron that uses a
nonlinear activation function.
• MLP utilizes a supervised learning technique called backpropagation for
training. Its multiple layers and non-linear activation distinguish MLP from a
linear perceptron.
• It can distinguish data that is not linearly separable
The multilayer perceptron (MLP) is used for a variety of tasks, such as stock analysis,
image identification, spam detection, and election voting predictions.
[Link]
Multilayer Perceptron (MLP)
12
Machine learning with neural networks An introduction for scientists and engineers, Bernhard Mehlig (page no.92)
Solving XOR using Multilayer Perceptron
Multilayer Perceptron (MLP)
14
Machine learning with neural networks An introduction for scientists and engineers, Bernhard Mehlig (page no.100)
Multilayer Perceptron (MLP)
15
The MLP is a feedforward neural network, which means that the data is transmitted from
the input layer to the output layer in the forward direction.
The value of the h5 node could be:
h5=h1.w8+h2.w9
[Link]
Backpropagation in
Multilayer Perceptron (MLP)
16
⮚ Backpropagation is a technique used to optimize the weights of an MLP using
the outputs as inputs.
⮚ In a conventional MLP, random weights are assigned to all the connections.
These random weights propagate values through the network to produce the
actual output.
⮚ Naturally, this output would differ from the expected output. The difference
between the two values is called the error.
⮚ Backpropagation refers to the process of sending this error back through the
network, readjusting the weights automatically so that eventually, the error
between the actual and expected output is minimized.
⮚ In this way, the output of the current iteration becomes the input and affects the
next output. This is repeated until the correct output is produced. The weights at
the end of the process would be the ones on which the neural network works
correctly.
[Link]
Backpropagation in
Multilayer Perceptron (MLP)
17
How Backpropagation works??
[Link]
Chain Rule
18
⮚ Chain Rule is covered when you studied Differential Calculus.
⮚ The derivatives of energy functions are evaluated the Chain rule.
Machine learning with neural networks An introduction for scientists and engineers, Bernhard Mehlig.
Chain Rule
19
Chain Rule
20
[Link]
Training Multilayer Perceptron (MLP)
21
Neural network needs to be trained on dataset
1. Data Preparation
2. Weight Update
3. Prediction
Training MLP- Data Preparation
22
● Data must be numerical, real values.
● If categorical data, such as a sex attribute with the values “male” and
“female”, then you can convert it to a real-valued representation called a
one hot encoding.
● one hot encoding can be used on the output variable in classification
problems with more than one class.
● This would create a binary vector from a single column that would be easy to
directly compare to the output of the neuron in the network’s output layer,
that as described above, would output one value for each class.
● Data Preprocessing
1. Normalization
2. Standardization
3. Scaling
Training MLP- Weight Update Algorithm
23
Stochastic Gradient Descent
● The classical and still preferred training algorithm for neural networks is
called stochastic gradient descent.
● One row of data is exposed to the network at a time as input and it
processes the input upward activating neurons as it goes to finally
produce an output value.
● The output of the network is compared to the expected output and an error
is calculated.
● This error is then propagated back through the network, one layer at a
time, and the weights are updated according to the amount that they
contributed to the error. This is called the backpropagation algorithm.
● The process is repeated for all of the examples in your training data.
● One round of updating the network for the entire training dataset is
called an epoch.
● A network may be trained for tens, hundreds or many thousands of epochs.
Training MLP- Weight Update Algorithm
24
● The weights in the network can be updated from the errors calculated for
each training example
● The errors can be saved up across all of the training examples and the
network can be updated at the end.
● Datasets are so large and computational efficiencies, the size of the batch, the
number of examples the network is shown before an update is often reduced
to a small number, such as tens or hundreds of examples.
● The amount that weights are updated is controlled by a configuration
parameters called the learning rate. It is also called the step size and controls
the step or change made to network weight for a given error.
● Often small weight sizes are used such as 0.1 or 0.01 or smaller.
Training MLP- Weight Update Algorithm
25
● The update equation can be complemented with additional configuration terms
that you can set.
● Momentum is a term that incorporates the properties from the previous weight
update to allow the weights to continue to change in the same direction even
when there is less error being calculated.
● Learning Rate Decay is used to decrease the learning rate over epochs to allow
the network to make large changes to the weights at the beginning and smaller
fine tuning changes later in the training schedule.
Training MLP- Prediction
26
Stochastic Gradient Descent
Machine learning with neural networks An introduction for scientists and engineers, Bernhard Mehlig, pg no-106
Gradient Descent
27
Fundamentals of Deep Learning Designing Next-Generation Machine Intelligence AlgorithmsNikhil Buduma, pg no-28
Gradient Descent
28
• How we might minimize the squared error over
all of the training examples by simplifying the
problem.
• E.g Linear neuron only has two inputs (and thus
only two weights, w1 and w2 ).
• Imagine a three-dimensional space where the
horizontal dimensions correspond to the weights
w1 and w2 , and the vertical dimension
corresponds to the value of the error function E.
• In this space, points in the horizontal plane
correspond to different settings of the weights,
and the height at those points corresponds to
the incurred error.
• If we consider the errors we make over all
possible weights, we get a surface in this three- Fundamentals of Deep Learning
dimensional space, in particular, a quadratic bowl Designing Next-Generation Machine
Intelligence AlgorithmsNikhil Buduma,
as shown in Figure. pg no-28
Gradient Descent
29
• Conveniently visualize this surface as a set of elliptical
contours, where the minimum error is at the center of the
ellipses.
• Two-dimensional plane where the dimensions
correspond to the two weights.
• Contours correspond to settings of w1 and w2 that
evaluate to the same value of E.
• The closer the contours are to each other, the steeper the
slope.
In fact, it turns out that the direction of the steepest
descent is always perpendicular to the contours.
• This direction is expressed as a vector known as the
gradient.
• Develop a high-level strategy for how to find the values
of the weights that minimizes the error function.
• Randomly initialize the weights of our network to find
ourselves somewhere on the horizontal plane.
Gradient Descent
30
• By evaluating the gradient at current position, we
can find the direction of steepest descent, and we
can take a step in that direction.
• Find a new position that’s closer to the minimum
than before.
• Reevaluate the direction of steepest descent by
taking the gradient at this new position and taking a
step in this new direction.
• It’s easy to see that, as shown in Figure, following
this strategy will eventually get us to the point of
minimum error.
• This algorithm is known as gradient descent, and it
is used to tackle the problem of training individual
neurons and the more general challenge of training
entire networks.
Convergence
31
Convergence
32
Learning Rate (Hyper-parameter)
33
• In practice, at each step of moving
perpendicular to the contour, it needs
to determine how far we want to
walk before recalculating our new
direction.
• This distance needs to depend on the
steepness of the surface.
• Picking the learning rate is a hard
problem
• if we pick a learning rate that’s too
small, we risk taking too long during
the training process.
• But if we pick a learning rate that’s
too big, we’ll mostly likely start
diverging away from the minimum.
Fundamentals of Deep Learning Designing Next-Generation Machine Intelligence AlgorithmsNikhil Buduma, pg no-30
Learning Rate
Learning rate selection. Source: [Link]
Optimization
35
⮚ Optimization is the most essential ingredient in the recipe of
machine learning algorithms.
⮚ It starts with defining some kind of loss function/cost
function and ends with minimizing it using one or the other
optimization routine.
⮚ The choice of optimization algorithm can make a difference
between getting a good accuracy in hours or days.
Optimization
36
⮚ For a deep learning problem, we will usually define a loss function first. Once we
have the loss function, we can use an optimization algorithm in attempt to minimize
the loss.
⮚ In optimization, a loss function is often referred to as the objective function of the
optimization problem.
⮚ By tradition and convention most optimization algorithms are concerned
with minimization. If we ever need to maximize an objective there is a simple
solution: just flip the sign on the objective.
⮚ The objective function of the optimization algorithm is usually a loss function based
on the training dataset, the goal of optimization is to reduce the training error means
to reduce the generalization error.
⮚ To accomplish the latter we need to pay attention to overfitting in addition to using
the optimization algorithm to reduce the training error.
⮚ In deep learning, generally, to approach the optimal value, gradient descent is
applied to the weights, and optimization is achieved by running many epochs with
large datasets. [Link]
[Link]
How Learning Differs from Pure Optimization
⮚ Goal : Finding the parameters Ɵ that reduce a cost
function J(Ɵ).
⮚ Typically, the cost function with respect to the training set can be
written as:
⮚ where L is the per-example loss function,
⮚f(x;θ):the predicted output when
the input is x, p’data is the empirical distribution.
⮚In the supervised learning case, y is the target output.
⮚ The objective is to minimize the corresponding objective function
[Link]
20ae75984cb2
Gradient descent
Most optimization algorithms now are based on Gradient Descent (GD)
which requires a derivative calculation and hence, may not work with certain
loss functions like the 0–1 loss (as it is not differentiable).
Gradient descent is a way to minimize an objective function J(θ)
θ ∈ Rd : model parameters
Sebastian
Ruder
η: learning rate
∇θ J(θ): gradient of the objective function with regard to the
parameters
Update equation: θ = θ − η · ∇θ J(θ)
Figure: Optimization with gradient descent
[Link]
20ae75984cb2
Gradient descent variants
Batch gradient descent
Stochastic gradient descent
1
Mini-batch gradient descent
Sebastian
Ruder
Gradient Descent algorithm and its variants - GeeksforGeeks
Batch gradient descent
This is a type of gradient descent which processes all the training examples for each
iteration of gradient descent.
But if the number of training examples is large, then batch gradient descent is
computationally very expensive.
1
Hence if the number of training examples is large, then batch gradient descent is not
Sebastian
preferred. Instead, prefer to use stochastic gradient descent or mini-batch gradient
Ruder
descent.
Pros:
Guaranteed to converge to global minimum for convex error surfaces and to a local
minimum for non-convex surfaces.
Cons:
Very slow.
Intractable for datasets that do not fit in memory. No online
learning.
Gradient Descent algorithm and its variants - GeeksforGeeks
Batch gradient descent
Sebastian
Ruder
Gradient Descent algorithm and its variants - GeeksforGeeks
Batch gradient descent Algorithm
Sebastian
Ruder
Gradient Descent algorithm and its variants - GeeksforGeeks
Batch gradient descent
Sebastian
Ruder
Gradient Descent algorithm and its variants - GeeksforGeeks
Stochastic gradient descent
• This is a type of gradient descent which processes 1 training example per
iteration.
• Hence, the parameters are being updated even after one iteration in which only a
single example
1 has been processed.
• Hence this is quite faster than batch gradient descent. But again, when the
Sebastian
Ruder
number of training examples is large, even then it processes only one example
which can be additional overhead for the system as the number of iterations will
be quite large.
Pros
Much faster than batch gradient descent.
Allows online learning.
Cons
High variance updates.
Figure: SGD fluctuation (Source: Wikipedia)
Gradient Descent algorithm and its variants - GeeksforGeeks
Stochastic gradient descent
Sebastian
Ruder
Gradient Descent algorithm and its variants - GeeksforGeeks
Mini-batch gradient descent
• Mini Batch gradient descent: This is a type of gradient descent which
works faster than both batch gradient descent and stochastic gradient
descent.
• Here b examples where b<m are processed per iteration. So even if the
number of training examples is large, it is processed in batches of b training
Sebastian
Ruder
examples in one go.
• Thus, it works for larger training examples and that too with lesser number
of iterations.
Pros
Reduces variance of updates.
Can exploit matrix multiplication primitives.
Cons
Mini-batch size is a hyperparameter. Common sizes are 50-256.
Usually referred to as SGD even when mini-batches are used.
Mini-batch gradient descent
Sebastian
Ruder
Gradient Descent algorithm and its variants - GeeksforGeeks
Comparison of trade-offs of gradient descent variants
Sebastian
Ruder
Gradient Descent algorithm and its variants – GeeksforGeeks
[Link]
20ae75984cb2
Optimization Methods
49
Commonly used optimization methods
⮚ Stochastic Gradient Descent with Momentum
⮚ AdaGrad
⮚ RMSProp
⮚ Adam Optimizer
The choice of which algorithm to use, at this point, seems to depend largely on the
user’s familiarity with the algorithm (for ease of hyperparameter tuning)
[Link]
[Link]
Optimization Methods
50
Optimization Challenges in Deep Learning
⮚ Most vexing challenges are local minima, saddle points, and vanishing gradients
Local Minima saddle points vanishing gradients
• The optimization problems may have many local minima.
• The problem may have even more saddle points, as generally the problems are not convex.
• Vanishing gradients can cause optimization to stall. Often a reparameterization of the
problem helps. Good initialization of the parameters can be beneficial, too.
[Link]
Backpropogation
Feedforward Neural Network
Sources:
1. [Link]
14
2. Lecture slides for Chapter 6 of Deep Learning,
[Link], Ian Goodfellow
Final Year
BTECH Subject : Deep Learning (PE4)
The Neuron Metaphor
52
⮚ Neurons [1]
? accept information from multiple inputs
? transmit information to other neurons
⮚ Multiply inputs by weights along edges
⮚ Apply some function to the set of inputs at each node
Types of Neurons
53
Linear Neuron
Logistic Neuron
Perceptron
Multilayer Networks
54
⮚ Cascade Neurons together
⮚ The output from one layer is the input to the next
⮚ Each Layer has its own sets of weights
Linear Neural Networks
55
⮚ Arrange linear neurons in a multilayer network
Linear Neural Networks
56
⮚ The product of two linear transformations is itself a linear transformation
Neural Networks
57
⮚ Introduce non-linearities to the network
? Non-linearities allow a network to identify complex regions
in space
Linear Separability
58
⮚ 1-layer cannot handle XOR
⮚ More layers can handle more complicated spaces – but
require more parameters
⮚ Each node splits the feature space with a hyperplane
⮚ If the second layer is AND a 2-layer network can represent
any convex hull.
Feed-Forward Networks
59
⮚ Predictions are fed forward through the network to
classify
Feed-Forward Networks
60
⮚ Predictions are fed forward through the network to
classify
Feed-Forward Networks
61
⮚ Predictions are fed forward through the network to
classify
Feed-Forward Networks
62
⮚ Predictions are fed forward through the network to
classify
Feed-Forward Networks
63
⮚ Predictions are fed forward through the network to
classify
Feed-Forward Networks
64
⮚ Predictions are fed forward through the network to
classify
Forward Activity- Backward Error
65
Backward propagation
66
⮚ Backpropagation has two phases:
Forward pass phase: computes ‘functional signal’, feed forward
Propagation of input pattern signals through network
Backward pass phase: computes ‘error signal’, propagates the error
backwards through network starting at output units
(where the error is the difference between actual and desired output values)
Error Backpropagation
67
⮚ Apply gradient descent on the whole network
⮚ Training will proceed from the last layer to the first
Error Backpropagation
68
⮚ Introduce variables over the neural network
Error Backpropagation
69
⮚ Introduce variables over the neural network
? Distinguish the input and output of each node
zj
Error Backpropagation
70
Error Backpropagation
71
Training: Take the gradient of the last component and iterate backwards
Error Backpropagation
72
Empirical Risk
Function
Error Backpropagation
73
Optimize last layer weights
wkl
Calculus chain
rule
Error Backpropagation
74
Optimize last layer weights wkl
Calculus chain
rule
Error Backpropagation
75
Optimize last layer weights wkl
Calculus chain rule
Error Backpropagation
76
Optimize last layer weights
wkl
Calculus chain rule
Error Backpropagation
77
Optimize last layer weights wkl
Calculus chain rule
Error Backpropagation
78
Optimize last hidden weights
wjk
Error Backpropagation
79
Optimize last hidden weights
wjk
Multivariate chain rule
Error Backpropagation
80
Optimize last hidden weights wjk
Multivariate chain rule
Error Backpropagation
81
Optimize last hidden weights
wjk
Multivariate chain
rule
Error Backpropagation
82
Repeat for all previous
layers
Error Backpropagation
83
Now that we have well defined gradients for each parameter, update using Gradient Descent
Error Back-propagation
84
⮚ Error backprop unravels the multivariate chain rule and
solves the gradient for each partial component separately.
⮚ The target values for each layer come from the next layer.
⮚ This feeds the errors back along the network.