0% found this document useful (0 votes)
109 views76 pages

Deep Learning: Feedforward Networks & Optimization

This syllabus covers the fundamentals of deep learning, focusing on deep feed-forward neural networks, gradient descent, back-propagation, and optimization techniques. It discusses the architecture of neural networks, training processes, challenges like the vanishing gradient problem, and regularization methods. Additionally, it highlights real-world applications and the importance of back-propagation in training neural networks to minimize prediction errors.

Uploaded by

anithakumaran29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views76 pages

Deep Learning: Feedforward Networks & Optimization

This syllabus covers the fundamentals of deep learning, focusing on deep feed-forward neural networks, gradient descent, back-propagation, and optimization techniques. It discusses the architecture of neural networks, training processes, challenges like the vanishing gradient problem, and regularization methods. Additionally, it highlights real-world applications and the importance of back-propagation in training neural networks to minimize prediction errors.

Uploaded by

anithakumaran29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

SYLLABUS

UNIT - II
INTRODUCTION TO DEEP LEARNING

Deep Feed-Forward Neural Networks – Gradient


Descent – Back-Propagation and Other Differentiation
Algorithms – Vanishing Gradient Problem – Mitigation
– Rectified Linear Unit (ReLU) – Heuristics for
Avoiding Bad Local Minima – Heuristics for Faster
Training – Nestors Accelerated Gradient Descent –
Regularization for Deep Learning – Dropout –
Adversarial Training – Optimization for Training Deep
Models.
Deep Feed Forword Networks
Deep Feed Forword Networks
Deep Feed Forword Networks
Activation Function
Deep Feed Forword Networks
[Link] Layer
•Image as input (grayscale or RGB)
•Pixels flattened into a 1D array (e.g., 28x28 = 784
pixels)
•Neurons in input layer correspond to each pixel
[Link] Layers
•Multiple hidden layers for feature extraction
•First layers detect basic features (e.g., edges)
•Deeper layers extract complex patterns (e.g., shapes,
textures)
•Activation functions (ReLU, sigmoid) introduce non-
linearity
Deep Feed Forword Networks
3. Output Layer
•Neurons represent possible classes (e.g., "cat", "dog")
•Softmax activation for classification
•Produces probabilities for each class
4. Training Process
•Forward pass: data flows through the network
•Loss calculation: error between predicted and true
labels
•Backpropagation: weights updated to minimize loss
•Optimization: uses gradient descent or Adam
Deep Feed Forword Networks
Training Phase
1. Initialize Weights and Biases
•Randomly initialize weights and biases
•Each neuron has associated weights and biases
2. Forward Pass
•Input image is flattened into a vector
Pass through each hidden layer (linear
transformation + activation)
•Hidden layers extract features (e.g., edges, shapes)
•Output layer produces probabilities for each class
(e.g., cat or dog)
Deep Feed Forword Networks
3. Loss Calculation
•Compare predicted output with true label
•Use cross-entropy loss for classification tasks
•Higher loss indicates a larger error
4. Backpropagation
•Compute the gradients of the loss w.r.t. weights
•Gradients propagate from the output layer back to the
input layer
•Determine how to adjust weights to reduce the loss
Deep Feed Forword Networks
5. New Weight Update
•Use optimization algorithms like Stochastic Gradient
Descent (SGD) or Adam
•Update weights based on gradients
•Learning rate controls the size of weight updates
6. Repeat for Multiple Epochs
•Process entire dataset through multiple epochs
•Use mini-batch gradient descent for faster updates
•Iterate until the model converges
Deep Feed Forword Networks
7. Monitoring Performance
•Track training and validation loss
•Use validation data to check generalization
performance
•Early stopping to prevent overfitting
8. Regularization
•Dropout: Randomly drop neurons to prevent
overfitting
•L2 Regularization: Add penalty for large weights to
simplify the model
Deep Feed Forword Networks
5. Inference (Prediction)
•New image input passed through the network
•Features extracted in hidden layers
•Output layer produces predicted label (highest
probability)
6. Challenges
•Overfitting: risk of fitting noise in data
•Regularization: techniques like dropout, L2
regularization
•Data Augmentation: enhances model robustness with
varied inputs
Deep Feed Forword Networks
7. Real-World Applications
•Image classification (e.g., cat vs dog, handwritten digit
recognition)
•Object detection, facial recognition, and medical
image analysis
Gradient based Optimization
• Most deep learning algorithms involve optimization
of some sort.
• Optimization refers to the task of either
minimizing or maximizing some function f (x) by
altering x.
Objective Function
• The function we want to minimize or maximize is
called the objective function or criterion.
• It quantifies how well the model's predictions
match the actual outcomes.
Gradient based Optimization
• We often denote the value that minimizes or
maximizes a function with a superscript ∗. For
example, we might say x∗ = arg min f(x).
• Most optimization problems are framed as
minimization problems.
• If a problem is about maximization, we can
convert it to a minimization problem by
minimizing the negative of the objective
function.
• When we are minimizing it, we may also call it the
cost function, loss function, or error function.
Gradient based Optimization
• Suppose we have a function y = f (x), where both x
and y are real numbers.
• The derivative of this function is denoted as f’(x)
or as dy/dx.
• The derivative f’(x) gives the slope of f (x) at the
point x.
• It shows the rate of change of the function's value
with respect to changes in 𝑥

• In other words, it specifies how to scale a small


change in the input in order to obtain the
corresponding change in the output.
Gradient based Optimization
• This is an iterative optimization technique where
we update the variable x in the direction opposite
to the gradient of the objective function.
• This helps in reducing the value of the function. The
update rule is
x x - α.f’(x)
where
α is a small step size or learning rate.
Gradient based Optimization
Figure
describes
an
illustration
of how the
derivatives
of a
function
can be
used to
follow the
function
downhill to
a
minimum.
Figure Uphill and the Groundhill of the gradient problem
Gradient based Optimization
• The derivative is therefore useful for minimizing a
function because it tells us how to change x in order
to make a small improvement in y.
• For example, we know that f(x-ϵsign(f’(x))) is less
than f (x) for small enough ϵ.
• We can thus reduce f (x) by moving x in small steps
with opposite sign of the derivative. (x) = 0, the
derivative provides no information about which
direction.
Gradient based Optimization
• When f’(x) the derivative provides no information
about which direction to move.
• Points where f’(x)=0 known as critical points or
stationary points.
• A local minimum is a point where f (x) is lower than
at all neighboring points, so it is no longer possible
to decrease f(x) by making infinitesimal steps.
• A local maximum is a point where f (x) is higher
than at all neighboring points,
Gradient based Optimization
Local Minimum:
•A point where the function value is lower than at all
neighboring points.
•It's a point where we can't decrease the function value
by making infinitesimal changes.
Local Maximum:
• A point where the function value is higher than at all
neighboring points.
• It's a point where we can't increase the function value
by making infinitesimal changes.
Saddle Point:
•A critical point that is neither a local minimum nor a
local maximum.
•The function might have a higher value in one direction
and a lower value in another direction, resembling a
saddle.
Gradient based Optimization
• A point that obtains the absolute lowest value of f (x)
is a global [Link] is possible for there to be only
one global minimum or multiple global minima of the
function.
• It is also possible for there to be local minima that are
not globally optimal.
• In the context of deep learning, we optimize functions
that may have many local minima that are not optimal,
and many saddle points surrounded by very flat
regions.
• All of this makes optimization very difficult, especially
when the input to the function is multidimensional.
We therefore usually settle for finding a value of f that
is very low, but not necessarily minimal in any formal
sense.
Gradient based Optimization

Figure representing Minimum ,maximum saddle Point


Gradient based Optimization
• A point that obtains the absolute lowest value of f (x)
is a global [Link] is possible for there to be only
one global minimum or multiple global minima of the
function.
• It is also possible for there to be local minima that are
not globally optimal.
• In the context of deep learning, we optimize functions
that may have many local minima that are not optimal,
and many saddle points surrounded by very flat
regions.
• All of this makes optimization very difficult, especially
when the input to the function is multidimensional.
We therefore usually settle for finding a value of f that
is very low, but not necessarily minimal in any formal
sense.
Back-Propagation
• After a neural network is defined with initial weights,
and a forward pass is performed to generate the
initial prediction,
• there is an error function which defines how far
away the model is from the true prediction.
• There are many possible algorithms that can
minimize the error function—for example, one could
do a brute force search to find the weights that
generate the smallest error.
• However, for large neural networks, a training
algorithm is needed that is very computationally
efficient.
• Backpropagation is that algorithm—it can discover
the optimal weights relatively quickly, even for a
network with millions of weights.
Back-Propagation
Training algorithm of BPNN:
1. Inputs X, arrive through the pre connected path
2. Input is modeled using real weights W. The weights
are usually randomly selected.
3. Calculate the output for every neuron from the
input layer, to the hidden layers, to the output layer.
4. Calculate the error in the outputs
Error B= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer
to adjust the weights such that the error is decreased.
Keep repeating the process until the desired output is
achieved
Back-Propagation

Architecture of back propagation network:


As shown in the diagram, the architecture of BPN has
three interconnected layers having weights on them.
The hidden layer as well as the output layer also has
bias, whose weight is always 1, on them. As is clear from
the diagram, the working of BPN is in two phases. One
phase sends the signal from the input layer to the
output layer, and the other phase back propagates the
error from the output layer to the input layer.
Back-Propagation
1. Forward pass — weights are initialized and inputs
from the training set are fed intothe network. The
forward pass is carried out and the model generates
its initial prediction.
2. Error function — the error function is computed by
checking how far away the prediction is from the
known true value.
3. Backpropagation with gradient descent — the
backpropagation algorithm calculates
• how much the output values are affected by each of
the weights in the model. To do this,it calculates
partial derivatives, going back from the error
function to a specific neuron and its weight.
Back-Propagation
• This provides complete traceability from total errors,
back to a specific weight which contributed to that
error. The result of backpropagation is a set of
weights that minimize the error function.
4. Weight update — weights can be updated after
every sample in the training set, but this is
• usually not practical. Typically, a batch of samples is
run in one big forward pass, and then
backpropagation performed on the aggregate result.
• The batch size and number of batches used in
training, called iterations, are important
hyperparameters that are tuned to get the best
results. Running the entire training set through the
backpropagation process is called an epoch.
Vanishing Gradient
•  The Neural Networks are trained using back
propagation and gradient based learning methods.
•  During training, we want to reach the most
optimum value of weights resulting in minimum loss.
•  Each weight is constantly gets updated during the
training of the algorithm.
•  The update is proportional to the partial
derivative of the error function with respect to
the current weight in each training iteration.
•  However, sometimes this update becomes too
small, and hence the weight does not get updated.
• It results in very less or practically no training of the
network. This is referred to as the vanishing
gradient problem.
Vanishing Gradient
•  In Figure, we Shown that in the sigmoid function,
we can face the problem of vanishing gradient, while
in the case of a ReLU or Leaky ReLU, we will not have
vanishing gradient as an issue.
Back-Propagation
• The backpropagation algorithm is a fundamental
concept in training artificial neural networks,
including deep learning models.
• It is used to adjust the network's weights and
biases during the training process to minimize the
error between the predicted and actual outputs.
[Link] Propagation:
• The process begins with forward propagation.
• Input data is passed through the neural network to
compute the predicted outputs.
• Each neuron in the network calculates a weighted
sum of its inputs and applies an activation function to
produce an output.
Back-Propagation

[Link] Function:
•A loss function (also known as a cost function or error
function) is used to quantify the error between the
predicted outputs and the actual target values.
•Common loss functions include Mean Squared Error
(MSE) for regression tasks and cross-entropy for
classification tasks.
Back-Propagation

[Link]:
•The core of the backpropagation algorithm involves
calculating the gradients of the loss function with
respect to the network's parameters, primarily the
weights and biases.
•The gradients represent the sensitivity of the loss to
changes in the parameters. They indicate how much the
loss would change if the parameters were adjusted.
Back-Propagation

[Link] Descent:
•The computed gradients are used to update the
network's weights and biases.
•A common optimization algorithm used with
backpropagation is gradient descent.
•Gradient descent adjusts the weights and biases in the
direction that reduces the loss, allowing the network to
learn from its mistakes.
Back-Propagation
[Link] Process:
•The forward propagation, loss calculation, gradient
computation, and weight updates are performed
iteratively for a specified number of epochs or until
convergence.
•During training, the network gradually improves its
ability to make accurate predictions and minimize the
loss.
[Link]-Batches:
•To improve efficiency, training is often performed
using mini-batches of data rather than the entire
dataset. This approach reduces the computational load
and can lead to faster convergence.
Back-Propagation
[Link] Functions:
•In deep learning, various activation functions are used
within neural network layers, such as ReLU (Rectified
Linear Unit), sigmoid, and tanh. These functions
introduce non-linearity, which is essential for the
network's ability to learn complex patterns.
[Link] Through Layers:
•Backpropagation works by computing gradients layer
by layer, starting from the output layer and moving
backward through the hidden layers.
•The chain rule from calculus is used to efficiently
calculate the gradients for each layer.
Back-Propagation
[Link] Techniques:
•To prevent overfitting, regularization techniques like
dropout and weight decay are often employed during
training.

•The backpropagation algorithm is a key component of


deep learning, enabling neural networks to learn from
data, make predictions, and adapt their parameters to
minimize errors. It has been instrumental in the success
of various deep learning architectures, such as
convolutional neural networks (CNNs) for image
processing and recurrent neural networks (RNNs) for
sequential data.
Back-Propagation
Back-Propagation
• Using the Back propagation network, find the new
weights for the net shown below. It is presented with
the input pattern [0,1] and the target output 1. Use a
learning rate of 0.25 and binary sigmoidal activation
function
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
CS 404/504, Fall 2021

Activation: ReLU
Introduction to Neural Networks

 Most modern deep NNs use ReLU


activations
 ReLU is fast to compute
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of
gradient descent
o Due to linear, non-saturating form
 Prevents the gradient vanishing
problem

64
A Heuristics for Avoiding Bad Local Minima
A Heuristics for Avoiding Bad Local Minima
• Local To avoid bad local minima, various heuristics
and techniques are employed:
1. Random Initialization of Weights
2. Momentum-Based Gradient Descent
3. Stochastic Gradient Descent (SGD)
4. Regularization Techniques (L2, Dropout)
5. Ensemble Methods
6. Batch Normalization
7. Adaptive Optimization Algorithms (Adam, RMSprop)
8. Simulated Annealing
9. Learning Rate Annealing and Schedulers
A Heuristics for Avoiding Bad Local Minima

[Link] Injection
[Link] with Perturbation Methods
[Link] Over-Parameterized Networks

Avoiding bad local minima in deep learning is crucial to


achieving good performance. Techniques such as
random initialization, momentum, adaptive optimizers,
batch normalization, and noise injection provide
powerful tools to escape or mitigate the effects of local
minima during training. Each of these heuristics
addresses different aspects of the optimization process,
helping the model reach a better solution.
Heuristics for Faster Training
Heuristics for Faster Training
Heuristics for Faster Training
Heuristics for Faster Training
Generate Reduce Precision (Mixed Precision Training)
•Use lower-precision arithmetic (like 16-bit floating-
point) to speed up computation while still maintaining
accuracy.

•These heuristics help optimize training time while


maintaining or even improving model performance.
Regularization
• A central problem in machine learning is how to
make an algorithm that will perform well not just on
the training data, but also on new inputs. Many
strategies used in machine learning are explicitly
designed to reduce the test error, possibly at the
expense of increased training error. These strategies
are known collectively as regularization.
Regularization
Generalization Error
•Generalization error refers to the difference between a
machine learning model's performance on the training
data and its performance on new, unseen data. It
measures how well the model generalizes to data it has
not been trained on.
•A low generalization error indicates that the model is
not overfitting and performs well on both training and
test data.
•Regularization techniques are often used to reduce
generalization error by preventing overfitting and
improving the model’s ability to handle unseen data.
Regularization
Regularization
Regularization

You might also like