0% found this document useful (0 votes)

15 views75 pages

Deep Learning Model Optimization Techniques

This document discusses optimization techniques for training deep learning models, highlighting the differences between learning and pure optimization, as well as the challenges faced in neural network optimization. It covers various algorithms, parameter initialization strategies, and adaptive learning rates, while addressing issues like overfitting, local minima, and gradient problems. The document also introduces convolutional networks and their specialized applications for grid-like data structures.

Uploaded by

palemac773

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views75 pages

Deep Learning Model Optimization Techniques

Uploaded by

palemac773

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

UNIT 3

CHAPTER 1
OPTIMIZATION FOR TRAINING DEEP MODELS
CONTENTS
• How Learning Differs from Pure Optimization

• Challenges in Neural Network Optimization

• Basic Algorithms

• Parameter Initialization Strategies

• Algorithms with Adaptive Learning Rates

In deep learning, optimization means finding the best values for the
model’s parameters (θ) that reduce the error [cost function J(θ)].
This error tells us how wrong the model's predictions are.

Training neural networks is hard because it is a complex optimization

problem.
Special algorithms and strategies are used to solve it efficiently and
accurately.
How learning differs from
pure optimization?
Optimization means finding the best value for something to get the best
result.

Learning (Machine Learning):

• Machine learning usually acts indirectly.
• In most machine learning scenarios, we care about some performance
measure P (like accuracy), that is defined with respect to the test set and
may also be intractable.
• We therefore optimize P only indirectly.
• We reduce a different cost function J(θ) in the hope that doing so will
improve P.

Pure Optimization:
The goal is to directly minimize a cost function J(θ).
Cost Function in Learning:

Where,
• L is the per-example loss function (i.e., how wrong the prediction is)
• f (x; θ) is the predicted output when the input is x
• p_data is the empirical data distribution (i.e., training data)
• In the supervised learning case, y is the target output
• θ is the set of model parameters that the training process adjusts to minimize error

The goal is to make the model’s prediction f(x;θ) as close as possible to the true y.
Ideal Cost Function:

Ideally, we would prefer to minimize the corresponding objective function

where the expectation is taken across the true data distribution p_data
rather than just over the finite training set.
Empirical Risk Minimization
• Risk is the expected error of a model on the true data distribution.
• The goal of a machine learning algorithm is to reduce the expected
generalization error which means make accurate predictions on unseen data.
• Since we don’t know the true data distribution in the real world, we train our
model using a sample of data (training set).
• A training process where we minimize the average training error to help the
model perform better on real-world data is known as empirical risk
minimization (ERM).
• However, empirical risk minimization is likely to overfitting and might not be
feasible.
where m is the number of training examples.

Imagine a student (model) trying to prepare for final exam (real-world

data). The student practice with several sample question paper
(training data). His average mistake on this sample paper is his
empirical risk — he tries to minimize it hoping it will also improve his
performance on the final exam.
Surrogate Loss Functions
• Sometimes, the loss function we actually care about is not one that can be
optimized efficiently. (e.g., 0-1 loss)

• The 0-1 loss just checks if the prediction is right or wrong (0 for correct, 1 for
wrong ). It is not smooth and does not give useful signals to update the
model.

• So we use a surrogate loss instead.

• In machine learning and deep learning, surrogate loss functions are
alternative loss functions that are easier to optimize than the original (often
non-differentiable or combinatorial) objective, but still closely related to it.
• A surrogate loss is a proxy that we optimize instead of the real loss.
For instance, Negative log-likelihood (NLL) is a common surrogate for
0-1 loss in classification problems.

• NLL helps in estimating the probability of each class. A model that

gets the highest conditional probability right often performs well on
0-1 classification tasks too.

• Surrogate loss continues learning after 0-1 loss reaches 0:

o Even if a model gets all training predictions right (0-1 loss = 0), minimizing the
surrogate (log-likelihood) can still improve confidence and robustness—
separating classes better for generalization.
Early Stopping
• Overfitting refers to a model that learns the training data too closely, including
noise and outliers, in result of which, the model performs well on the training
data but poorly on unseen test data.

• A practical technique where we stop training early to avoid overfitting is called

early stopping.

• Stops when the model stops improving on the validation set, even if the
training surrogate loss continues to decrease.

• Typically uses surrogate loss on validation set.

Batch and Minibatch Algorithms
• In ML optimization algorithms, the objective function is defined as the sum (or average)
over all training example and we can perform updates by randomly sampling a batch of
examples and taking the average over the examples in that batch.
• Suppose we take n independent samples from the distribution with mean μ and standard
deviation σ.
• The sample mean is a random is a random variable – it varies depending on n samples.
• The standard error tells how much the sample mean is likely to fluctuate around the true
mean μ.

• For example :
- Taking 5 people gives a rough estimate.
- Taking 500 people gives a much better estimate.
3 types of sampling-based algorithms:

1. Batch gradient descent (BGD)

Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.

2. Stochastic gradient descent (SGD)

Optimization algorithms that use only a single example at a time are sometimes called
stochastic method.

3. Mini-batch gradient descent (MBGD)

Most algorithms used for deep learning fall somewhere in between, using more than
one but less than all of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call them
stochastic methods.
Challenges in Neural
Network Optimization
Optimization helps find the best weights for a neural network.

Goal: Minimize the error (loss function).

Challenge:
Neural networks often involve complex, high-dimensional, and non-
convex loss functions that contain multiple local minima, saddle points,
and flat regions, making it difficult for gradient-based methods to find
the optimal solution efficiently.
Ill Conditioning
• Ill-conditioning means that small changes in input can cause large
changes in output during optimization.
• The ill-conditioning problem is generally believed to be present when
training neural networks using SGD (Stochastic Gradient Descent).
• Ill-conditioning makes the model get stuck or slow down.
• The network doesn’t improve much, even though we are updating
weights.
Local Minima
• In optimization problems, local minima are points where the function value is lower than nearby points,
but not necessarily the lowest possible (global minimum).
• For convex functions, any local minimum is also a global minimum. However, neural networks use non-
convex functions, which can have many local minima due to their complex structure and high-
dimensional space.
• In large neural networks, most local minima have almost the same low error.
• These local minima don’t affect performance since they are basically equivalent solutions.
• The real problem would be bad local minima—those that give high error.
Plateaus, Saddle Points and Other Flat Regions
Saddle point:
• It is a point on the loss surface where the gradient is zero, but the point is not a minimum or
maximum.
• These saddle points can trap optimization algorithms because the gradient becomes very
small, making progress slow.

A plateau is a flat region in the loss surface where, the gradient is almost zero across a wide
area.

Flat regions are wide, flat areas where both the gradient and the curvature are zero or near-
zero. The value of the function is nearly constant. These regions may not even correspond to
a good solution.
Plateaus, Saddle Points and Other Flat Regions
Cliffs and Exploding Gradients
• In deep neural networks, especially recurrent neural networks (RNNs), we often multiply
many weights together across layers or time steps.
• When some of these weights are large, this repeated multiplication can result in very
steep regions in the loss surface called cliffs.
• When the optimizer computes the gradient in such a steep area, it
may suggest a very large update step, causing the model to “jump off
the cliff,” leading to exploding gradients. It is a sudden, huge changes
in the model parameters that destabilize training.

• To handle this, we use a method called gradient clipping, where we

limit the size of the gradient update.

• This helps maintain stable and controlled learning, especially in

models handling long sequences like RNNs.
Long – Term Dependencies
• Deep networks or RNNs apply operations repeatedly
• Reusing the same weights causes gradients to:
o Vanish (become too small)
o Explode (become too large)
• Makes learning unstable or very slow
• This problem makes it hard to learn dependencies over long sequences — hence the name
long-term dependency problem

Solution:
• Gradient clipping is one solution for exploding gradients.
• Deep feedforward networks avoid this better than RNNs because they don’t reuse the same
weights.
• Understanding this helps in designing better architectures like LSTMs that tackle this issue.
Inexact Gradients

• Most optimization algorithms in deep learning assume access to exact gradients. However, this
assumption is often not true in practice.
• Gradients are inexact because of:
o Computing gradients using mini-batch, leading to noisy estimates
o Intractable objectives (e.g., the log-likelihood in Boltzmann machines) that require
approximations
o Biased approximations in algorithms

Solution:
• Use robust optimization algorithms.
• Replace the original loss with a surrogate (easier) loss function.
Poor Correspondence Between Local and Global
Structure
• The direction of local gradient may not guide the optimizer toward the global minimum of
the loss function.
• Optimization can slow down or stall in flat regions or saddle points, where the gradient is
near zero but not at a minimum.
• When the loss surface has sharp curves in some directions and flatness in others, it leads to
inefficient updates and unstable training.
• Many loss functions in deep learning do not have a clear global minimum.
• Optimization algorithms like gradient descent may follow unnecessarily long or indirect paths
in weight space, increasing training time.
• Proper weight initialization can place the model in a well-behaved region of the loss surface,
improving convergence and stability.
Solution:

• Use smart strategies (like Xavier /He ) for better weight initialization.

• Use of advanced optimizers to adjust learning rates automatically for

each parameter, handling poor curvature better.

• Use batch normalization to speed up training and improve gradient

flow.
Theoretical Limits of Optimization
• Some neural network problems are provably hard.
• Theoretical limits mostly apply to networks with discrete outputs.
• Even if a small network has no solution, a larger network with more parameters often can
solve the same task.
• Exact minimum not needed.
• It is very difficult to mathematically prove how well an optimizer will work in practice.

Theoretical limits often do not affect practical deep learning because,

• Most deep learning models use continuous outputs.
• It is often acceptable to find lowest possible loss, as long as it ensures good accuracy and
accuracy generalization error.
• We use larger models with more parameters.
Basic Algorithms
Stochastic Gradient Descent
(SGD)

It is an iterative optimization algorithm which updates the model’s

weights using a small random subset of the training data instead of the
entire dataset.
• It is faster and more memory-efficient than full-batch gradient
descent.
• A proper learning rate schedule is critical for stable convergence.
• High learning rate speeds up training but risks divergence.
• Learning rate decay helps avoid overshooting and improves
convergence.
• Convergence is slow theoretically, but SGD is effective in early training
stages.
• SGD generalizes well when updates are noisy and gradual.
• Mini-batch size and learning rate should be tuned based on training
curves.
1. Initialize weights (θ) and set a learning rate (ε)
2. Repeat until convergence:
• Sample a mini-batch of training examples.
• Compute the gradient of the loss for that batch.
• Update the weights
3. Stop when a stopping criterion is met (like reaching a certain number of epochs or a
low enough loss)
Momentum

• Momentum is a technique used with gradient descent to make

learning faster and more stable by accumulating past gradients.

• It helps the optimizer move in consistent directions and avoid getting

stuck or slowed down in flat or zigzag regions of the loss surface.
Update rule:

Velocity:

Weights:

Where,
• θ = Model weights
• ϵ = Learning rate
• v = Velocity (momentum term)
• α = Momentum coefficient (usually 0.9 or 0.99)
• g = Gradient
1. Initialize weights θ, velocity v=0, learning rate ϵ, and momentum factor α
2. Repeat until stopping condition is met:
• Sample a mini-batch of training data.
• Compute the gradient g of the loss using the mini-batch.
• Update velocity and weights
3. End
Nesterov Momentum

• It is a variant of the momentum algorithm that was inspired by

Nesterov’s accelerated gradient method.

• It is an optimization technique that improves upon classical

momentum by calculating the gradient not at the current parameters
but at the anticipated future position of the parameters.

• This helps in more accurate and stable updates.

Update Rule:
1. Initialize θ, velocity v, learning rate ϵ, and momentum factor α
2. Repeat until stopping condition is met:
• Predict future position.
• Compute gradient.
• Update velocity and weights
3. End
Parameter Initialization
Strategies
Deep learning models need starting values for weights before training
begins.

Poor initialization can cause:

• Slow learning or no convergence (algorithm fails).
• Vanishing/exploding gradients in deep networks.
• Poor generalization to unseen data.

Many initialization methods are heuristic-based (trial and error).

Common Initialization Strategies:
• Random Initialization
o Breaks symmetry among neurons.
• Normalized Initialization (Glorot/Xavier)
o Adjusts weight values to keep activation variance consistent across layers.

• He Initialization
o Similar to Xavier but scaled for ReLU activations.
• Sparse and orthogonal initialization help in deep and recurrent
networks.
Algorithms with Adaptive
Learning Rates
• Learning rate is one of the most important and sensitive
hyperparameters in deep learning.
• Fixed learning rates may not work well for all parameters.
• Adaptive methods adjust learning rates automatically for each
parameter during training.

Adaptive Algorithms:
• AdaGrad
• RMSProp
• Adam
AdaGrad
• It adjusts learning rates individually for each parameter.
• Parameters with large updates get smaller learning rates over time.
• Parameters with small updates retain larger learning rates.
• It accumulates squared gradients for each parameter.
• Learning rate shrinks proportionally to the accumulated gradients.
• It is good for sparse data.
• A major drawback of AdaGrad is that the learning rate keeps shrinking
during training, which may cause the model to stop learning too early.
RMSProp
• Improves AdaGrad by avoiding the learning rate shrinking too much.

• It is robust and widely used for training deep networks.

• Uses an exponentially decaying average of squared gradients.

• It works well for non-stationary and deep networks.

• It maintains a balance between fast and slow updates.

Adam
• It combines the strengths of RMSProp and momentum.
• Maintains two moving averages:
• First moment estimate (mean of gradients)
• Second moment estimate (variance of gradients)
• Includes bias correction for both estimates to address initialization
issues.
• Adam is better than RMSProp because it applies bias correction to
moment estimates at the start of training, which helps stabilize
learning, especially in the early iterations.
• Works well without much tuning.

• Handles sparse gradients and noisy data effectively.

• Works well for a wide range of deep learning problems.

• A drawback of Adam is that it may sometimes require adjusting the

default learning rate for optimal performance.
Convolutional Networks
Contents
• Introduction
• The Convolution Operation
• Motivation
• Pooling
• Convolution and Pooling as an Infinitely Strong Prior
• Variants of the Basic Convolution Function
• Structured Outputs
• Data Types
• Efficient Convolution Algorithms
• Random or Unsupervised Features
Introduction
• Convolutional networks are also known as convolutional neural networks
(CNNs).
• Specialized for data having grid like tolpolgy
• 1D grid – time series data
• 2D grid – image data
• Definition
• Convolutional Networks use convolution in place of general matrix multiplication
in at least one layer.
• Neural Network convolution does not correspond to convolution used in
engineering and mathematics.
1. The Convolution Operation
What is Convolution ?
• Convolution is a mathematical operation used to combine two functions:
-One function is typically the signal (e.g., a laser sensor output).
-The other is a kernel or filter (e.g., a smoothing or averaging function).

General convolution example:

• Scenario:
• You are trying to detect a spaceship using a laser sensor that returns a signal (a noisy
one), and you want to smooth this signal to clearly identify the spaceship's location
The Convolution Operation

• CNN convolutions(not general convolution)

• First function is network input x, second is kernel w
• Tensors refer to the multidimensional arrays
• E.g., input data and parameter arrays, thus TensorFlow
• The convolution kernel is usually a sparse matrix in contrast to the usual fully-
connected weight matrix
2D Convolution
Motivation
• Convolution leverages three important ideas that help improve machine learning
systems.

• Three advantages of convolution

• Sparse Interactions
• Parameter sharing
• Equivariant representations
Motivation
Sparse Interactions
• Fully connected traditional networks
• m inputs in a layer and n outputs in next layer
• requires O(m x n) runtime (per example)

• Sparse interactions
• Also called sparse connectivity or sparse weights
• Accomplished by making kernel smaller than input
• k << m requires O(k x n) runtime (per example)
• k is typically several orders of magnitude smaller than m
Motivation
Sparse Connectivity

Sparse connections due to small

Convolution kernal

Dense Connections
Fully Connected
Growing Respective Fields

Deeper layer units have larger receptive fields

Parameter Sharing

• In traditional neural networks

• Each element of the weight matrix is unique
• Parameter sharing mean using the same parameters for more than one model function
• The network has tied weights
• Reduces storage requirements to k parameters
• Does not affect forward prop runtime O(k x n)
Parameter Sharing
Equivariant Representation
• For an invariant function, if the input changes, the output change in the same way
• For convolution, a particular form of parameter sharing causes equivariance to translation
• For example, as the dog moves in the input image, the detected edges move in the same
way
• In image processing, detecting edges is useful in the first layer, and edges appear more or
less everywhere in the image
Pooling
• The pooling function replaces the output of the net at a certain location with a summary
statistic of the nearby outputs
• Max pooling reports the maximum output within a rectangular neighborhood
• Average pooling reports the average output
• Pooling helps make the representation approximately invariant to small input translations
Convolutional Network Components
Max Pooling and Invariance to Translation
Cross-Channel Pooling and Invariance to Learned
PoolingwithDownsampling
Convolution and Pooling as an Infinitely Strong Prior
• Convolution and pooling can cause underfitting
• The prior is useful only when the assumptions made by the prior are reasonably accurate
• If a task relies on preserving precise spatial information, then pooling can increase training
error
• The prior imposed by convolution must be appropriate
Variants of the Basic Convolution Function
• Stride is the amount of downsampling
• Can have separate strides in different directions
• Zero padding avoids layer-to-layer shrinking
• Unshared convolution
• Like convolution but without sharing
• Partial connectivity between channels
• Tiled convolution
• Cycle between shared parameter groups
Zero Padding Controls Size
Kinds of Connectivity
Partial Connectivity Between Channels
Tiled Convolution
Structured Outputs
• Convolutional networks are usually used for classification
• They can also be used to output a high-dimensional, structured object
• The object is typically a tensor

Data Types
• Single channel examples:
• 1D audio waveform
• 2D audio data after Fourier transform
• Frequency versus time
• Multi-channel examples:
• 2D color image data
• Three channels: red pixels, green pixels, blue pixels
• Each channel is 2D for the image
Efficient Convolution Algorithm
• Devising faster ways of performing convolution or approximate convolution without harming
the accuracy of the model is an active area of research
• However, most dissertation work concerns feasibility and not efficiency

Random or Unsupervised Features

• One way to reduce the cost of convolutional network training is to use features that are not
trained in a supervised fashion
• Three methods (Rosenblatt used first two)
• Simply initialize convolutional kernels randomly
• Design them by hand
• Learn the kernels using unsupervised methods

Optimization Challenges in Deep Learning
No ratings yet
Optimization Challenges in Deep Learning
15 pages
Gradient Descent in Neural Network Optimization
No ratings yet
Gradient Descent in Neural Network Optimization
33 pages
Optimization Techniques for Deep Learning
No ratings yet
Optimization Techniques for Deep Learning
18 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
12 pages
Feedforward Neural Network Optimization
No ratings yet
Feedforward Neural Network Optimization
83 pages
Deep Learning Model Optimization Strategies
100% (1)
Deep Learning Model Optimization Strategies
81 pages
Deep Learning Model Optimization Guide
No ratings yet
Deep Learning Model Optimization Guide
92 pages
Neural Network Training Optimization Techniques
No ratings yet
Neural Network Training Optimization Techniques
13 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
25 pages
Deep Learning Fundamentals and Techniques
No ratings yet
Deep Learning Fundamentals and Techniques
42 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
5 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
36 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
50 pages
DL 12
No ratings yet
DL 12
55 pages
Loss Function Optimization in Neural Networks
100% (1)
Loss Function Optimization in Neural Networks
24 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
31 pages
Neural Network Optimization Challenges
No ratings yet
Neural Network Optimization Challenges
11 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
27 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
25 pages
Deep Learning Optimization Algorithms
No ratings yet
Deep Learning Optimization Algorithms
28 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
25 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
48 pages
Deep Learning: Feedforward Networks & Optimization
No ratings yet
Deep Learning: Feedforward Networks & Optimization
76 pages
Deep Learning Optimization Concepts Explained
No ratings yet
Deep Learning Optimization Concepts Explained
13 pages
Deep Learning Model Optimization Guide
No ratings yet
Deep Learning Model Optimization Guide
5 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
12 pages
Backpropagation in Deep Learning
No ratings yet
Backpropagation in Deep Learning
27 pages
Optimization in Deep Learning
No ratings yet
Optimization in Deep Learning
21 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
5 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
7 pages
Neural Networks: Types and Techniques
No ratings yet
Neural Networks: Types and Techniques
23 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
30 pages
Neural Network Optimization Challenges
No ratings yet
Neural Network Optimization Challenges
17 pages
AL3502 Deep Learning for Vision Syllabus
75% (4)
AL3502 Deep Learning for Vision Syllabus
79 pages
Gradient Descent in Deep Learning
No ratings yet
Gradient Descent in Deep Learning
28 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
23 pages
Deep Learning Overview and Techniques
No ratings yet
Deep Learning Overview and Techniques
17 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
46 pages
Deep Learning
No ratings yet
Deep Learning
35 pages
Understanding Empirical Risk Minimization
No ratings yet
Understanding Empirical Risk Minimization
11 pages
NLP Neural Networks Overview and Basics
No ratings yet
NLP Neural Networks Overview and Basics
13 pages
Deep Learning and Gradient Descent Overview
No ratings yet
Deep Learning and Gradient Descent Overview
84 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
50 pages
Understanding Deep Neural Networks Basics
No ratings yet
Understanding Deep Neural Networks Basics
6 pages
Learning vs. Optimization in Deep Learning
No ratings yet
Learning vs. Optimization in Deep Learning
80 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
22 pages
Sparse Modeling in Machine Learning
100% (1)
Sparse Modeling in Machine Learning
61 pages
Learning vs. Optimization in Deep Learning
No ratings yet
Learning vs. Optimization in Deep Learning
15 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
20 pages
Universal Approximation Theorem Explained
No ratings yet
Universal Approximation Theorem Explained
67 pages
Neural Network Optimization Methods
No ratings yet
Neural Network Optimization Methods
90 pages
CNN Batch Size and Optimization Techniques
100% (1)
CNN Batch Size and Optimization Techniques
59 pages
Introduction to Training Deep Models
No ratings yet
Introduction to Training Deep Models
45 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
27 pages
Deep Model Training Optimization Challenges
No ratings yet
Deep Model Training Optimization Challenges
89 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
49 pages
Neural Network Training & Risk Minimization
No ratings yet
Neural Network Training & Risk Minimization
29 pages
Feedforward Neural Network Overview
No ratings yet
Feedforward Neural Network Overview
101 pages
Corrosion Control for Rail Transit
No ratings yet
Corrosion Control for Rail Transit
176 pages
Private Hitea or Dinner - Kelab Guru SMK Seri Hartamas (14 November 2026)
No ratings yet
Private Hitea or Dinner - Kelab Guru SMK Seri Hartamas (14 November 2026)
10 pages
Yaya's Dream of Modeling
No ratings yet
Yaya's Dream of Modeling
3 pages
Reinforced Concrete Beam Analysis
No ratings yet
Reinforced Concrete Beam Analysis
57 pages
Unique Haitian Sandwiches in Brooklyn
No ratings yet
Unique Haitian Sandwiches in Brooklyn
2 pages
Asphalt Project Timeline and Activities
No ratings yet
Asphalt Project Timeline and Activities
3 pages
Sony SRG-A40 PTZ Auto Framing Camera
No ratings yet
Sony SRG-A40 PTZ Auto Framing Camera
15 pages
Inductor Volt-Second Balance in Buck Converters
No ratings yet
Inductor Volt-Second Balance in Buck Converters
11 pages
Understanding Chiral Carbon Atoms
No ratings yet
Understanding Chiral Carbon Atoms
86 pages
Colorimeter for Dieface Lubricant Control
No ratings yet
Colorimeter for Dieface Lubricant Control
6 pages
Pine Hill Trail Environmental Concerns
No ratings yet
Pine Hill Trail Environmental Concerns
4 pages
Water Testing Report for Gulf Asia
No ratings yet
Water Testing Report for Gulf Asia
2 pages
Metal Chemical Resistance Chart
No ratings yet
Metal Chemical Resistance Chart
8 pages
RX Series High-Function Inverters
No ratings yet
RX Series High-Function Inverters
45 pages
SigmaZinc 102/109 HS Hardener Safety Data
No ratings yet
SigmaZinc 102/109 HS Hardener Safety Data
14 pages
MAPEH Daily Lesson Log for Grade 6
No ratings yet
MAPEH Daily Lesson Log for Grade 6
7 pages
Nigeria's Electricity Market: 10 Years Post-Privatization
No ratings yet
Nigeria's Electricity Market: 10 Years Post-Privatization
6 pages
Agricultural Sciences P2 Marking Guide
No ratings yet
Agricultural Sciences P2 Marking Guide
10 pages
STAAD.Pro Fundamentals Workshop
100% (1)
STAAD.Pro Fundamentals Workshop
2 pages
National Birds of All Countries
No ratings yet
National Birds of All Countries
18 pages
Printing Press Feasibility Study
No ratings yet
Printing Press Feasibility Study
57 pages
Rosco UKFilter Facts
No ratings yet
Rosco UKFilter Facts
32 pages
Gnomish Inventions: A Hazardous Collection
No ratings yet
Gnomish Inventions: A Hazardous Collection
10 pages
Man Basket Structural Strength Analysis
No ratings yet
Man Basket Structural Strength Analysis
58 pages
Listening Comprehension Test Guide
No ratings yet
Listening Comprehension Test Guide
9 pages
Flexible Service System (FSS) Guide
No ratings yet
Flexible Service System (FSS) Guide
5 pages
ASME Design Checklist for Engineers
No ratings yet
ASME Design Checklist for Engineers
2 pages
Overview of Ship Peak Tanks
No ratings yet
Overview of Ship Peak Tanks
26 pages
Battle Bot Fight Rulebook 2024
No ratings yet
Battle Bot Fight Rulebook 2024
6 pages
Understanding Critical Points in Calculus
No ratings yet
Understanding Critical Points in Calculus
12 pages

Deep Learning Model Optimization Techniques

Uploaded by

Deep Learning Model Optimization Techniques

Uploaded by

UNIT 3

• Challenges in Neural Network Optimization

• Parameter Initialization Strategies

• Algorithms with Adaptive Learning Rates

Training neural networks is hard because it is a complex optimization

Learning (Machine Learning):

Ideally, we would prefer to minimize the corresponding objective function

Imagine a student (model) trying to prepare for final exam (real-world

• So we use a surrogate loss instead.

• NLL helps in estimating the probability of each class. A model that

• Surrogate loss continues learning after 0-1 loss reaches 0:

• A practical technique where we stop training early to avoid overfitting is called

• Typically uses surrogate loss on validation set.

1. Batch gradient descent (BGD)

2. Stochastic gradient descent (SGD)

3. Mini-batch gradient descent (MBGD)

Goal: Minimize the error (loss function).

• To handle this, we use a method called gradient clipping, where we

• This helps maintain stable and controlled learning, especially in

• Use of advanced optimizers to adjust learning rates automatically for

• Use batch normalization to speed up training and improve gradient

Theoretical limits often do not affect practical deep learning because,

It is an iterative optimization algorithm which updates the model’s

• Momentum is a technique used with gradient descent to make

• It helps the optimizer move in consistent directions and avoid getting

• It is a variant of the momentum algorithm that was inspired by

• It is an optimization technique that improves upon classical

• This helps in more accurate and stable updates.

Poor initialization can cause:

Many initialization methods are heuristic-based (trial and error).

• It is robust and widely used for training deep networks.

• Uses an exponentially decaying average of squared gradients.

• It works well for non-stationary and deep networks.

• It maintains a balance between fast and slow updates.

• Handles sparse gradients and noisy data effectively.

• Works well for a wide range of deep learning problems.

• A drawback of Adam is that it may sometimes require adjusting the

General convolution example:

• CNN convolutions(not general convolution)

• Three advantages of convolution

Sparse connections due to small

Deeper layer units have larger receptive fields

• In traditional neural networks

Random or Unsupervised Features

You might also like