Introduction to Deep
Learning
Introduction
• Machine Learning enables systems (shallow or deep) to learn from
data and improve with experience, similar to humans. Raw data is
processed to extract useful information for decision-making.
• Types of Learning Approaches
1) Supervised
2) Unsupervised
3)Semi-Supervised Learning
4)Reinforcement Learning
Shallow Learning
• Shallow learning refers to machine learning models that uses
relatively simple architecture usually consisting of only one or two
layers of feature transformation before making predictions.
• Input -> Feature Engineering -> Model -> output.
• Deep Learning -> multi layered model +automatic features -> good for
complex unstructured data.
• Difference: Shallow vs. Deep Learning
• Shallow Learning:
• Uses only 1–2 layers.
• Works for simple tasks.
• Example: Linear Regression, Decision Trees, SVMs.
• Deep Learning:
• Uses tens or hundreds of layers.
• Learns complex features automatically.
• Example: Image classification, Speech recognition, ChatGPT itself.
Deep Learning
• Deep Learning is a subfield of Machine Learning
Uses neural networks with multiple hidden layers
• Learns hierarchical representations of data
• Replaces manual feature engineering with
automatic learning
• Layer 1 (low-level): detects edges, gradients, textures.
• Layer 2 (mid-level): combines edges into motifs like corners, contours.
• Layer 3+ (high-level): captures object parts, semantics (face, car,
digit).
• Similar to human brain’s visual cortex organization.
Applications
• Computer Vision: face recognition, object detection, medical imaging.
• Natural Language Processing: translation, sentiment analysis,
chatbots.
• Speech Processing: voice recognition, speaker identification.
• Autonomous Systems: self-driving cars, robotics control.
Why to Use Deep Learning?
•Automatic feature extraction → Learns useful representations directly from raw data.
•High representational power → Captures complex nonlinear relationships.
•Scales effectively → Performs better with large datasets and powerful compute resources.
•Transfer learning advantage → Reuse pre-trained models across tasks, reducing effort.
•Domain adaptability → Works across multiple applications with minimal customization.
•Superior performance → Achieves state-of-the-art results in vision, speech, and NLP.
•Outperforms traditional ML → Higher accuracy, better generalization, less manual engineering.
1.5 How Deep Learning Works
• Deep networks map input to target via a sequence of layered transformations, and that these
layered transformations are learned by exposure to the training examples.
• Transformations implemented by a layer are parameterized by its weights.
• Learning can be defined as the process of finding the values of the weights of all layers in the
network in such a manner that input examples can be correctly mapped to their associated
targets.
• A deep learning network contains thousands of parameters, and finding the right values of these
parameters is not an easy task, particularly when the value of one parameter has an impact on
the value of another parameter.
• In order to train a deep network one needs to find out how far the calculated output of the
network is from the desired value. This measure is obtained by using a loss function, also called
as objective function.
• The objective of the training is to find the values for the weights that minimize the chosen error
function.
• The difference obtained is then used as a feedback signal to adjust the weights of the
network, in a way that loss score for the current example is lowered. This adjustment
is done by the optimizer—backpropagation algorithm, the central algorithm in deep
learning.
• Backpropagation algorithm involves assigning random values to the weight vectors
initially, so that the network just implements a series of random transformations.
Initially, the output obtained from the network can be far from what it should be, and
accordingly the loss score may be very high.
• With every example that is fed to the network, the weights are adjusted in such a
direction that makes the loss score to decrease.
• This process is repeated a number of times, until the weight values that minimize the
loss function are obtained. A network is said to have learned when the output values
obtained from the network are as close as they can be to the target values
1.6 Challenges in Deep Learning
• Vanishing/exploding gradients → solved with ReLU, residuals, batch
norm.
• Requires large datasets and high compute (GPUs/TPUs).
• Overfitting risks due to high parameter count.
• Optimization is difficult (complex loss surfaces).
• Interpretability: Deep models are often black-boxes.
• Deployment concerns: latency, memory, robustness.
Optimization for Training Deep
Models
• 1. Optimization in Deep Learning
• Optimization means finding the best values for parameters of a model (like
weights of a neural network) so that the model performs well on a given task.
• Example: In PCA (Principal Component Analysis), optimization is used to find
directions (principal components) that maximize variance.
• In neural networks, optimization is about minimizing a cost function (also
called loss function) like cross-entropy or mean squared error.
• But compared to classical methods (like PCA), optimization in deep learning is
much harder and requires special techniques.
• This chapter focuses on one particular case of optimization: finding the
parameters θ of a neural network that significantly reduce a cost function
J(θ), which typically includes a performance measure evaluated on the entire
training set as well as additional regularization terms.
• Why Optimization for Deep Models is Difficult
• Training deep neural networks is not just a math exercise—it’s computationally
very expensive and comes with challenges:
• High cost: Training can take days or months on clusters of GPUs/TPUs.
• Non-convex cost function: Unlike simple convex problems (like linear regression),
neural networks have many local minima and saddle points.
• Large parameter space: Modern networks have millions or billions of parameters.
• Vanishing/Exploding gradients: Gradients can become too small or too large,
making training unstable.
• Overfitting: Optimization must balance between fitting training data and
generalizing (hence regularization).
How Learning Differs from Pure
Optimization
•Pure Optimization:
•Goal = minimize a function J(θ) directly, with no other concerns.
•Example: Finding the minimum of a quadratic function.
•Machine Learning Optimization:
•Goal = improve performance P (e.g., test accuracy, F1 score).
•But we can’t optimize P directly because it’s usually:
•Defined on the test set (which we don’t use during training).
•Sometimes intractable (too complex to compute exactly).
•So instead, we optimize a proxy cost function J(θ)(training loss)
with the hope that reducing J(θ) also improves P.
• The Cost Function in Machine Learning
• The standard cost function used in training is written as:
• J(θ)=E(x,y)∼p^dataL(f(x;θ),y)
• where:
• θ= model parameters (weights, biases).
• f(x;θ) = model’s prediction for input x.
• y = true label (in supervised learning).
• L(⋅) = loss function (per-example error, e.g., cross-entropy, MSE).
• p^data= empirical distribution (training dataset).
• So, training minimizes average loss over the training set.
• Ideal Case: True Data Distribution
• In reality, what we would like to minimize is:
• J∗(θ)=E(x,y)∼ p^data L(f(x;θ),y)
• where:
• p^data = true data distribution (all possible examples, not just training).
• But we only have access to the finite training set, so we use the
approximation J(θ).
• This is why generalization matters: minimizing training loss does not
guarantee test performance.
Empirical Risk Minimization
• The Goal of Machine Learning
• The ultimate goal is to minimize the expected generalization risk, i.e.:
• R(θ)=E(x,y)∼pdataL(f(x;θ),y) where:
• f(x;θ) = model’s prediction.
• L(⋅) = loss function (error per sample).
• P^data= true underlying distribution of data.
• Problem: we don’t know p^data. We only have a finite dataset.
• From True Risk to Empirical Risk
• Since we can’t compute expectation over the unknown p^data, we
approximate it with the empirical distribution p^data, based on training
samples.
• This gives the empirical risk:
• R^(θ)=1 /m∑L(f(x(i);θ),y(i)) where:
• m = number of training examples.
• (x(i),y(i)) = training samples.
• This is simply the average training loss.
• So ERM = minimize training error, hoping it also reduces test error.
• Problems with ERM
• Even though ERM looks simple, in deep learning it’s not ideal:
• (a) Overfitting
• High-capacity models (deep nets) can just memorize the training set.
• Minimizing empirical risk too aggressively → poor generalization.
• (b) Non-differentiable loss functions
• Many useful loss functions (e.g., 0-1 loss: counts misclassifications) are not differentiable.
• Gradient descent requires smooth loss functions.
• 0-1 loss → derivative = 0 or undefined → not usable.
• (c) Gap between training and true objective
• What we really want to minimize = true risk over pdatap_{data}pdata.
• What we minimize = empirical risk over finite training samples.
• These can diverge, especially with limited data.
Surrogate Loss Functions and
Early Stopping
• Definition
• A surrogate loss function is an alternative loss function that is easier to optimize
than the one we actually care about.
• It serves as a proxy (replacement) for the true loss.
• It is chosen because it is differentiable, smooth, and works well with gradient-
based methods.
• Example:
• The true goal in classification: minimize 0–1 loss (just count how many
predictions are wrong).
• Problem: 0–1 loss is non-differentiable and computationally intractable.
• Solution: Use cross-entropy (negative log-likelihood) as a surrogate.
• Why do we use surrogates?
[Link] difficulty:
1. 0–1 loss is a step function → gradient is 0 or undefined → gradient descent
can’t work.
[Link] advantages:
1. Smooth, differentiable → works well with gradient-based optimizers.
2. Often provides more useful information (like confidence of prediction).
• Early Stopping
Before Gradient loss reaches 0 we stop just to avoid overfitting.
• We want to compare pure optimization and learning optimization.
1) Pure Optimization Case:
Minimize the function
Find the value of that minimizes .
• 2) Learning Optimization Case:
Suppose the true relationship between input and output is:
but we only observe noisy data:
Batch and Minibatch Algorithms
• One aspect of machine learning algorithms that separates them from
general optimization algorithms is that the objective function usually
decomposes as a sum over the training examples.
• In machine learning, the objective function (like likelihood or loss)
usually splits into a sum over all training examples.
• Maximum Likelihood Estimation(MLE) :
• In most machine learning problems, the objective function (cost or
likelihood) is the sum of contributions from each training example.
• Formula: θML=argθmaxi∑logpmodel(x(i),y(i);θ)
• Expectation Form of the Objective:
• Instead of summing explicitly, we view it as an expectation under the
empirical data distribution
• J(θ)=Ex,y∼p^datalogpmodel(x,y;θ)
• Gradient of the Objective:
• Optimization needs the gradient of the objective
• ∇θJ(θ)=Ex,y∼p^data∇θlogpmodel(x,y;θ)
• Statistical Error of Sampling:
• The standard error (uncertainty) of estimating gradient with samples is:SE=σ
/√n
• : true standard deviation.
• : number of samples.
• Redundancy in Dataset:
• If dataset has repeated or highly similar examples, computing gradient on all
is wasteful.
• Sampling avoids redundancy.
Types of Optimization
Algorithms
1) Batch Gradient Descent (BGD):
Uses entire dataset for every update.
•Pros: Accurate gradient.
•Cons: Very slow for large datasets.
2) Stochastic Gradient Descent (SGD):
Uses one sample per update.
•Pros: Very fast.
•Cons: Gradient very noisy.
3) Minibatch Gradient Descent:
Uses small batch (e.g., 32–256).
•Pros: Balance between speed & stability.
•Cons: Still approximate, but works best in practice.
Factors Affecting Minibatch Size
• Gradient accuracy: Larger batches reduce noise but with diminishing
returns.
• Hardware: Very small batches underutilize CPU/GPU.
• Memory: Batch size limited by GPU/TPU memory.
• GPU efficiency: Powers of 2 (32, 64, 128, 256) often best.
• Generalization effect:
• Small batches add noise → acts as regularization.
• Sometimes improves test accuracy.
• Too small (batch size = 1) requires very small learning rate → inefficient.
Method Batch Size Pros Cons Use Case
Accurate gradient, Very slow, memory-
Batch GD All data Small datasets
stable heavy
Fast updates, strong Noisy gradients, Online learning,
SGD 1
regularization unstable streaming data
Efficient on GPUs,
Gradient still Standard choice in
Minibatch GD 32–256 balance of speed &
approximate deep learning
accuracy
Sensitivity of Algorithms
• First-order methods (e.g., SGD): Only need gradient .
• Robust, work fine with small batches (~100).
• Second-order methods (e.g., Newton’s method): Use
Hessian .
• Update rule: .
• Require huge batches (~10,000), otherwise errors in
get amplified.
Importance of Random Sampling
• If minibatches are not random, gradients may be biased.
• Example: If dataset grouped by patient → one minibatch may contain only one
patient’s data.
• Solution: Shuffle dataset before training.
Large datasets: Even shuffling once is enough; true randomness not always needed.
Parallelization
• Different minibatches can be computed in parallel (asynchronous/distributed
training).
• Crucial for very large-scale learning.
Minibatch SGD and Generalization Error
• In theory, SGD with minibatches follows the gradient of true generalization error
(how well model performs on unseen data).
• This is true only if no examples are repeated (like in online learning).
• But in practice, we reuse dataset multiple epochs → slight bias, but training error
reduction outweighs it
Online Learning
• Special case: Data arrives as a stream (never repeats).
• Learner updates model in real-time (like humans learning continuously).
• 👉 Use case: Real-time applications (stock prices, user activity streams).
• Extremely Large Datasets
• When datasets are huge (billions of examples):
• Sometimes only one pass (or less) through data is possible.
• Overfitting not a problem → main issue is computation efficiency &
underfitting.
• From cost function → to expectation → to gradient → to computation
challenge → to sampling idea → to batch types → to minibatch
considerations → to randomness → to parallelization & generalization → to
very large datasets.
Challenges in Neural Network
Optimization
1) Ill-Conditioning
• What it means:
In optimization, the Hessian matrix (matrix of 2nd derivatives) can be ill-conditioned –
meaning some directions have very steep curvature and some are very flat.
Think of a long, narrow valley: moving straight down is easy, but moving across is very slow.
• Effect on training:
• SGD may get “stuck”: even small steps make the cost increase.
• The learning rate must be very small, slowing down learning.
• Gradient norms (‖g‖²) don’t shrink much, while curvature term (gᵀHg) grows → unstable training.
• Key idea:
Even with strong gradients, training becomes slow because the curvature forces small steps.
• Why Newton’s method fails here:
Newton’s method works well in convex problems, but in neural nets (non-convex, huge
parameter space) it needs major modification before being useful.
• 8.2.2 Local Minima
• Convex case:
Any local minimum is also the global minimum → no problem.
• Neural networks (non-convex):
• Many local minima exist because of weight space symmetry:
• E.g., swapping two hidden units with their input/output weights doesn’t change the function → many equivalent
minima.
• Scaling symmetries in ReLU/Maxout also create infinite equivalent minima.
• Are local minima dangerous?
• Early belief: Yes, they trap optimization.
• Now: Not really – most local minima in large networks have low cost and are “good enough.”
• What matters is reaching a point with low training loss, not the true global minimum.
• Test:
If gradient norm doesn’t shrink to ~0, the problem is not due to local minima.
• 8.2.3 Plateaus, Saddle Points, and Flat Regions
• Saddle point = a point where gradient = 0, but:
• In some directions cost increases (like a hill).
• In other directions cost decreases (like a valley).
• Example: a horse saddle – up in one direction, down in another.
• High-dimensional fact:
• In low dimensions → local minima common.
• In high dimensions → saddle points much more common than local minima.
• The ratio grows exponentially with dimension.
• Implication:
• Optimization often slows near saddle points because gradients vanish.
• SGD can usually escape, but second-order Newton’s method may get stuck (since it looks for zero gradient points).
• Visuals:
• Training often shows “flat valleys” (plateaus) instead of deep pits.
• Much time is wasted crossing wide flat areas.
• Cliffs and Exploding Gradients
• What happens:
In deep/recurrent nets, multiplying many large weights creates very steep
surfaces (“cliffs”).
• Small movement in parameters → huge jump in cost.
• Gradient step can overshoot → undoing previous learning.
• Solution:
• Gradient clipping: Limit the maximum step size.
• Keeps learning stable near cliffs.
• Most common in:
Recurrent Neural Networks (RNNs) due to repeated multiplications across
many time steps.
• 8.2.5 Long-Term Dependencies
• Problem:
In very deep nets or RNNs, gradients pass through long computational graphs.
• Equivalent to multiplying by Wᵗ (matrix to the power t).
• Eigenvalues > 1 → exploding gradients.
• Eigenvalues < 1 → vanishing gradients.
• Effect:
• Vanishing → network forgets long-term info (can’t learn dependencies across long sequences).
• Exploding → unstable updates, cost jumps.
• Examples:
• RNN trying to learn dependencies across 100 time steps.
• Only recent steps affect learning; distant steps vanish.
• Solutions (not in your section but practical):
• LSTM/GRU (special RNN architectures).
• Proper initialization.
• Gradient clipping.
• 8.2.6 Inexact Gradients
• Ideal theory: Optimization assumes exact gradient/Hessian.
• Reality in deep learning:
• Gradients are estimated using mini-batches → noisy, approximate.
• Some models (e.g., Boltzmann machines) have intractable gradients, so
approximations (like Contrastive Divergence) are used.
• Effect:
• Learning becomes noisy, not perfectly downhill.
• Still works if estimates are unbiased on average.
• Sometimes surrogate loss functions are used that are easier to optimize.
• 8.2.7 Poor Correspondence Between Local and Global Structure
• Main idea:
Even if you can move locally in the best direction, it may not lead toward the global solution.
• Example:
• Imagine being on the wrong side of a mountain.
• Gradient descent keeps pushing you downhill locally, but you can’t cross the mountain to reach the
global valley.
• You must go around, taking a very long trajectory.
• Observation:
• Neural networks often don’t converge to a critical point at all (gradient never shrinks fully).
• Training paths are often long arcs around obstacles.
• Research direction:
• Instead of new algorithms, much focus is on finding good initializations so that local descent naturally
leads to good solutions.
• 8.2.8 Theoretical Limits of Optimization
• What theory says:
• Some results prove neural net training is NP-hard (Judd, 1989; Blum & Rivest, 1992).
• “No free lunch theorem”: No optimization algorithm is best for all problems (Wolpert &
Macready, 1997).
• But in practice:
• These worst-case results don’t stop us from training useful networks.
• Why?
• We don’t need the exact minimum, only a low-enough value for good generalization.
• Larger networks often make finding a good solution easier (more parameter settings give acceptable results).
• Most practical problems are not worst-case.
• Key takeaway:
Theory gives pessimistic limits, but in real life, neural nets are trainable and effective.
Challenge Cause Effect on Training Typical Solution
Ill-conditioning Hessian has very different Slow learning, tiny steps Preconditioning, adaptive
curvatures optimizers
Local minima Symmetries & non- Many minima exist, but Large nets, good init, test
convexity usually low-cost gradient norm
Saddle points & plateaus Zero-gradient regions in Slowdown, stuck SGD noise helps escape,
high-dim avoid Newton
Cliffs & exploding grads Multiplying large weights Overshooting, unstable Gradient clipping
Long-term dependencies Repeated multiplication Vanishing/exploding LSTM/GRU, clipping
(RNNs) gradients
Inexact gradients Mini-batch estimates or Noisy updates Larger batches, surrogate
intractable loss loss
Poor global vs local Local steps don’t lead Long, inefficient Good initialization
structure globally trajectories
Theoretical limits NP-hardness, no free Exact minimization Approximate “good
lunch impossible enough” solutions