UNIT 3
CHAPTER 1
OPTIMIZATION FOR TRAINING DEEP MODELS
CONTENTS
• How Learning Differs from Pure Optimization
• Challenges in Neural Network Optimization
• Basic Algorithms
• Parameter Initialization Strategies
• Algorithms with Adaptive Learning Rates
In deep learning, optimization means finding the best values for the
model’s parameters (θ) that reduce the error [cost function J(θ)].
This error tells us how wrong the model's predictions are.
Training neural networks is hard because it is a complex optimization
problem.
Special algorithms and strategies are used to solve it efficiently and
accurately.
How learning differs from
pure optimization?
Optimization means finding the best value for something to get the best
result.
Learning (Machine Learning):
• Machine learning usually acts indirectly.
• In most machine learning scenarios, we care about some performance
measure P (like accuracy), that is defined with respect to the test set and
may also be intractable.
• We therefore optimize P only indirectly.
• We reduce a different cost function J(θ) in the hope that doing so will
improve P.
Pure Optimization:
The goal is to directly minimize a cost function J(θ).
Cost Function in Learning:
Where,
• L is the per-example loss function (i.e., how wrong the prediction is)
• f (x; θ) is the predicted output when the input is x
• p_data is the empirical data distribution (i.e., training data)
• In the supervised learning case, y is the target output
• θ is the set of model parameters that the training process adjusts to minimize error
The goal is to make the model’s prediction f(x;θ) as close as possible to the true y.
Ideal Cost Function:
Ideally, we would prefer to minimize the corresponding objective function
where the expectation is taken across the true data distribution p_data
rather than just over the finite training set.
Empirical Risk Minimization
• Risk is the expected error of a model on the true data distribution.
• The goal of a machine learning algorithm is to reduce the expected
generalization error which means make accurate predictions on unseen data.
• Since we don’t know the true data distribution in the real world, we train our
model using a sample of data (training set).
• A training process where we minimize the average training error to help the
model perform better on real-world data is known as empirical risk
minimization (ERM).
• However, empirical risk minimization is likely to overfitting and might not be
feasible.
where m is the number of training examples.
Imagine a student (model) trying to prepare for final exam (real-world
data). The student practice with several sample question paper
(training data). His average mistake on this sample paper is his
empirical risk — he tries to minimize it hoping it will also improve his
performance on the final exam.
Surrogate Loss Functions
• Sometimes, the loss function we actually care about is not one that can be
optimized efficiently. (e.g., 0-1 loss)
• The 0-1 loss just checks if the prediction is right or wrong (0 for correct, 1 for
wrong ). It is not smooth and does not give useful signals to update the
model.
• So we use a surrogate loss instead.
• In machine learning and deep learning, surrogate loss functions are
alternative loss functions that are easier to optimize than the original (often
non-differentiable or combinatorial) objective, but still closely related to it.
• A surrogate loss is a proxy that we optimize instead of the real loss.
For instance, Negative log-likelihood (NLL) is a common surrogate for
0-1 loss in classification problems.
• NLL helps in estimating the probability of each class. A model that
gets the highest conditional probability right often performs well on
0-1 classification tasks too.
• Surrogate loss continues learning after 0-1 loss reaches 0:
o Even if a model gets all training predictions right (0-1 loss = 0), minimizing the
surrogate (log-likelihood) can still improve confidence and robustness—
separating classes better for generalization.
Early Stopping
• Overfitting refers to a model that learns the training data too closely, including
noise and outliers, in result of which, the model performs well on the training
data but poorly on unseen test data.
• A practical technique where we stop training early to avoid overfitting is called
early stopping.
• Stops when the model stops improving on the validation set, even if the
training surrogate loss continues to decrease.
• Typically uses surrogate loss on validation set.
Batch and Minibatch Algorithms
• In ML optimization algorithms, the objective function is defined as the sum (or average)
over all training example and we can perform updates by randomly sampling a batch of
examples and taking the average over the examples in that batch.
• Suppose we take n independent samples from the distribution with mean μ and standard
deviation σ.
• The sample mean is a random is a random variable – it varies depending on n samples.
• The standard error tells how much the sample mean is likely to fluctuate around the true
mean μ.
• For example :
- Taking 5 people gives a rough estimate.
- Taking 500 people gives a much better estimate.
3 types of sampling-based algorithms:
1. Batch gradient descent (BGD)
Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.
2. Stochastic gradient descent (SGD)
Optimization algorithms that use only a single example at a time are sometimes called
stochastic method.
3. Mini-batch gradient descent (MBGD)
Most algorithms used for deep learning fall somewhere in between, using more than
one but less than all of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call them
stochastic methods.
Challenges in Neural
Network Optimization
Optimization helps find the best weights for a neural network.
Goal: Minimize the error (loss function).
Challenge:
Neural networks often involve complex, high-dimensional, and non-
convex loss functions that contain multiple local minima, saddle points,
and flat regions, making it difficult for gradient-based methods to find
the optimal solution efficiently.
Ill Conditioning
• Ill-conditioning means that small changes in input can cause large
changes in output during optimization.
• The ill-conditioning problem is generally believed to be present when
training neural networks using SGD (Stochastic Gradient Descent).
• Ill-conditioning makes the model get stuck or slow down.
• The network doesn’t improve much, even though we are updating
weights.
Local Minima
• In optimization problems, local minima are points where the function value is lower than nearby points,
but not necessarily the lowest possible (global minimum).
• For convex functions, any local minimum is also a global minimum. However, neural networks use non-
convex functions, which can have many local minima due to their complex structure and high-
dimensional space.
• In large neural networks, most local minima have almost the same low error.
• These local minima don’t affect performance since they are basically equivalent solutions.
• The real problem would be bad local minima—those that give high error.
Plateaus, Saddle Points and Other Flat Regions
Saddle point:
• It is a point on the loss surface where the gradient is zero, but the point is not a minimum or
maximum.
• These saddle points can trap optimization algorithms because the gradient becomes very
small, making progress slow.
A plateau is a flat region in the loss surface where, the gradient is almost zero across a wide
area.
Flat regions are wide, flat areas where both the gradient and the curvature are zero or near-
zero. The value of the function is nearly constant. These regions may not even correspond to
a good solution.
Plateaus, Saddle Points and Other Flat Regions
Cliffs and Exploding Gradients
• In deep neural networks, especially recurrent neural networks (RNNs), we often multiply
many weights together across layers or time steps.
• When some of these weights are large, this repeated multiplication can result in very
steep regions in the loss surface called cliffs.
• When the optimizer computes the gradient in such a steep area, it
may suggest a very large update step, causing the model to “jump off
the cliff,” leading to exploding gradients. It is a sudden, huge changes
in the model parameters that destabilize training.
• To handle this, we use a method called gradient clipping, where we
limit the size of the gradient update.
• This helps maintain stable and controlled learning, especially in
models handling long sequences like RNNs.
Long – Term Dependencies
• Deep networks or RNNs apply operations repeatedly
• Reusing the same weights causes gradients to:
o Vanish (become too small)
o Explode (become too large)
• Makes learning unstable or very slow
• This problem makes it hard to learn dependencies over long sequences — hence the name
long-term dependency problem
Solution:
• Gradient clipping is one solution for exploding gradients.
• Deep feedforward networks avoid this better than RNNs because they don’t reuse the same
weights.
• Understanding this helps in designing better architectures like LSTMs that tackle this issue.
Inexact Gradients
• Most optimization algorithms in deep learning assume access to exact gradients. However, this
assumption is often not true in practice.
• Gradients are inexact because of:
o Computing gradients using mini-batch, leading to noisy estimates
o Intractable objectives (e.g., the log-likelihood in Boltzmann machines) that require
approximations
o Biased approximations in algorithms
Solution:
• Use robust optimization algorithms.
• Replace the original loss with a surrogate (easier) loss function.
Poor Correspondence Between Local and Global
Structure
• The direction of local gradient may not guide the optimizer toward the global minimum of
the loss function.
• Optimization can slow down or stall in flat regions or saddle points, where the gradient is
near zero but not at a minimum.
• When the loss surface has sharp curves in some directions and flatness in others, it leads to
inefficient updates and unstable training.
• Many loss functions in deep learning do not have a clear global minimum.
• Optimization algorithms like gradient descent may follow unnecessarily long or indirect paths
in weight space, increasing training time.
• Proper weight initialization can place the model in a well-behaved region of the loss surface,
improving convergence and stability.
Solution:
• Use smart strategies (like Xavier /He ) for better weight initialization.
• Use of advanced optimizers to adjust learning rates automatically for
each parameter, handling poor curvature better.
• Use batch normalization to speed up training and improve gradient
flow.
Theoretical Limits of Optimization
• Some neural network problems are provably hard.
• Theoretical limits mostly apply to networks with discrete outputs.
• Even if a small network has no solution, a larger network with more parameters often can
solve the same task.
• Exact minimum not needed.
• It is very difficult to mathematically prove how well an optimizer will work in practice.
Theoretical limits often do not affect practical deep learning because,
• Most deep learning models use continuous outputs.
• It is often acceptable to find lowest possible loss, as long as it ensures good accuracy and
accuracy generalization error.
• We use larger models with more parameters.
Basic Algorithms
Stochastic Gradient Descent
(SGD)
It is an iterative optimization algorithm which updates the model’s
weights using a small random subset of the training data instead of the
entire dataset.
• It is faster and more memory-efficient than full-batch gradient
descent.
• A proper learning rate schedule is critical for stable convergence.
• High learning rate speeds up training but risks divergence.
• Learning rate decay helps avoid overshooting and improves
convergence.
• Convergence is slow theoretically, but SGD is effective in early training
stages.
• SGD generalizes well when updates are noisy and gradual.
• Mini-batch size and learning rate should be tuned based on training
curves.
1. Initialize weights (θ) and set a learning rate (ε)
2. Repeat until convergence:
• Sample a mini-batch of training examples.
• Compute the gradient of the loss for that batch.
• Update the weights
3. Stop when a stopping criterion is met (like reaching a certain number of epochs or a
low enough loss)
Momentum
• Momentum is a technique used with gradient descent to make
learning faster and more stable by accumulating past gradients.
• It helps the optimizer move in consistent directions and avoid getting
stuck or slowed down in flat or zigzag regions of the loss surface.
Update rule:
Velocity:
Weights:
Where,
• θ = Model weights
• ϵ = Learning rate
• v = Velocity (momentum term)
• α = Momentum coefficient (usually 0.9 or 0.99)
• g = Gradient
1. Initialize weights θ, velocity v=0, learning rate ϵ, and momentum factor α
2. Repeat until stopping condition is met:
• Sample a mini-batch of training data.
• Compute the gradient g of the loss using the mini-batch.
• Update velocity and weights
3. End
Nesterov Momentum
• It is a variant of the momentum algorithm that was inspired by
Nesterov’s accelerated gradient method.
• It is an optimization technique that improves upon classical
momentum by calculating the gradient not at the current parameters
but at the anticipated future position of the parameters.
• This helps in more accurate and stable updates.
Update Rule:
1. Initialize θ, velocity v, learning rate ϵ, and momentum factor α
2. Repeat until stopping condition is met:
• Predict future position.
• Compute gradient.
• Update velocity and weights
3. End
Parameter Initialization
Strategies
Deep learning models need starting values for weights before training
begins.
Poor initialization can cause:
• Slow learning or no convergence (algorithm fails).
• Vanishing/exploding gradients in deep networks.
• Poor generalization to unseen data.
Many initialization methods are heuristic-based (trial and error).
Common Initialization Strategies:
• Random Initialization
o Breaks symmetry among neurons.
• Normalized Initialization (Glorot/Xavier)
o Adjusts weight values to keep activation variance consistent across layers.
• He Initialization
o Similar to Xavier but scaled for ReLU activations.
• Sparse and orthogonal initialization help in deep and recurrent
networks.
Algorithms with Adaptive
Learning Rates
• Learning rate is one of the most important and sensitive
hyperparameters in deep learning.
• Fixed learning rates may not work well for all parameters.
• Adaptive methods adjust learning rates automatically for each
parameter during training.
Adaptive Algorithms:
• AdaGrad
• RMSProp
• Adam
AdaGrad
• It adjusts learning rates individually for each parameter.
• Parameters with large updates get smaller learning rates over time.
• Parameters with small updates retain larger learning rates.
• It accumulates squared gradients for each parameter.
• Learning rate shrinks proportionally to the accumulated gradients.
• It is good for sparse data.
• A major drawback of AdaGrad is that the learning rate keeps shrinking
during training, which may cause the model to stop learning too early.
RMSProp
• Improves AdaGrad by avoiding the learning rate shrinking too much.
• It is robust and widely used for training deep networks.
• Uses an exponentially decaying average of squared gradients.
• It works well for non-stationary and deep networks.
• It maintains a balance between fast and slow updates.
Adam
• It combines the strengths of RMSProp and momentum.
• Maintains two moving averages:
• First moment estimate (mean of gradients)
• Second moment estimate (variance of gradients)
• Includes bias correction for both estimates to address initialization
issues.
• Adam is better than RMSProp because it applies bias correction to
moment estimates at the start of training, which helps stabilize
learning, especially in the early iterations.
• Works well without much tuning.
• Handles sparse gradients and noisy data effectively.
• Works well for a wide range of deep learning problems.
• A drawback of Adam is that it may sometimes require adjusting the
default learning rate for optimal performance.
Convolutional Networks
Contents
• Introduction
• The Convolution Operation
• Motivation
• Pooling
• Convolution and Pooling as an Infinitely Strong Prior
• Variants of the Basic Convolution Function
• Structured Outputs
• Data Types
• Efficient Convolution Algorithms
• Random or Unsupervised Features
Introduction
• Convolutional networks are also known as convolutional neural networks
(CNNs).
• Specialized for data having grid like tolpolgy
• 1D grid – time series data
• 2D grid – image data
• Definition
• Convolutional Networks use convolution in place of general matrix multiplication
in at least one layer.
• Neural Network convolution does not correspond to convolution used in
engineering and mathematics.
1. The Convolution Operation
What is Convolution ?
• Convolution is a mathematical operation used to combine two functions:
-One function is typically the signal (e.g., a laser sensor output).
-The other is a kernel or filter (e.g., a smoothing or averaging function).
General convolution example:
• Scenario:
• You are trying to detect a spaceship using a laser sensor that returns a signal (a noisy
one), and you want to smooth this signal to clearly identify the spaceship's location
The Convolution Operation
• CNN convolutions(not general convolution)
• First function is network input x, second is kernel w
• Tensors refer to the multidimensional arrays
• E.g., input data and parameter arrays, thus TensorFlow
• The convolution kernel is usually a sparse matrix in contrast to the usual fully-
connected weight matrix
2D Convolution
Motivation
• Convolution leverages three important ideas that help improve machine learning
systems.
• Three advantages of convolution
• Sparse Interactions
• Parameter sharing
• Equivariant representations
Motivation
Sparse Interactions
• Fully connected traditional networks
• m inputs in a layer and n outputs in next layer
• requires O(m x n) runtime (per example)
• Sparse interactions
• Also called sparse connectivity or sparse weights
• Accomplished by making kernel smaller than input
• k << m requires O(k x n) runtime (per example)
• k is typically several orders of magnitude smaller than m
Motivation
Sparse Connectivity
Sparse connections due to small
Convolution kernal
Dense Connections
Fully Connected
Growing Respective Fields
Deeper layer units have larger receptive fields
Parameter Sharing
• In traditional neural networks
• Each element of the weight matrix is unique
• Parameter sharing mean using the same parameters for more than one model function
• The network has tied weights
• Reduces storage requirements to k parameters
• Does not affect forward prop runtime O(k x n)
Parameter Sharing
Equivariant Representation
• For an invariant function, if the input changes, the output change in the same way
• For convolution, a particular form of parameter sharing causes equivariance to translation
• For example, as the dog moves in the input image, the detected edges move in the same
way
• In image processing, detecting edges is useful in the first layer, and edges appear more or
less everywhere in the image
Pooling
• The pooling function replaces the output of the net at a certain location with a summary
statistic of the nearby outputs
• Max pooling reports the maximum output within a rectangular neighborhood
• Average pooling reports the average output
• Pooling helps make the representation approximately invariant to small input translations
Convolutional Network Components
Max Pooling and Invariance to Translation
Cross-Channel Pooling and Invariance to Learned
PoolingwithDownsampling
Convolution and Pooling as an Infinitely Strong Prior
• Convolution and pooling can cause underfitting
• The prior is useful only when the assumptions made by the prior are reasonably accurate
• If a task relies on preserving precise spatial information, then pooling can increase training
error
• The prior imposed by convolution must be appropriate
Variants of the Basic Convolution Function
• Stride is the amount of downsampling
• Can have separate strides in different directions
• Zero padding avoids layer-to-layer shrinking
• Unshared convolution
• Like convolution but without sharing
• Partial connectivity between channels
• Tiled convolution
• Cycle between shared parameter groups
Zero Padding Controls Size
Kinds of Connectivity
Partial Connectivity Between Channels
Tiled Convolution
Structured Outputs
• Convolutional networks are usually used for classification
• They can also be used to output a high-dimensional, structured object
• The object is typically a tensor
Data Types
• Single channel examples:
• 1D audio waveform
• 2D audio data after Fourier transform
• Frequency versus time
• Multi-channel examples:
• 2D color image data
• Three channels: red pixels, green pixels, blue pixels
• Each channel is 2D for the image
Efficient Convolution Algorithm
• Devising faster ways of performing convolution or approximate convolution without harming
the accuracy of the model is an active area of research
• However, most dissertation work concerns feasibility and not efficiency
Random or Unsupervised Features
• One way to reduce the cost of convolutional network training is to use features that are not
trained in a supervised fashion
• Three methods (Rosenblatt used first two)
• Simply initialize convolutional kernels randomly
• Design them by hand
• Learn the kernels using unsupervised methods