Open navigation menu

Scribd

0% found this document useful (0 votes)

32 views17 pages

Weight Initialization and Gradient Issues

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views17 pages

Weight Initialization and Gradient Issues

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Weight Initialization

• Initializations are surprisingly important.

– Poor initializations can lead to bad convergence behavior.

– Instability across diﬀerent layers (vanishing and exploding

gradients).

• More sophisticated initializations such as pretraining covered

in later lecture.

• Even some simple rules in initialization can help in condition-

ing.
Symmetry Breaking

• Bad idea to initialize weights to the same value.

– Results in weights being updated in lockstep.

– Creates redundant features.

• Initializing weights to random values breaks symmetry.

• Average magnitude of the random variables is important for

stability.
Sensitivity to Number of Inputs

• More inputs increase output sensitivity to the average weight.

– Additive eﬀect of multiple inputs: variance linearly in-

creases with number of inputs r.

– Standard deviation scales with the square-root of number

of inputs r.

• Each weight is initialized from Gaussian distribution with

standard deviation 1/r ( 2/r for ReLU).

• More sophisticated: Use standard deviation of 2/(rin + rout).

Tuning Hyperparameters

• Hyperparameters represent the parameters like number of

layers, nodes per layer, learning rate, and regularization pa-
rameter.

• Use separate validation set for tuning.

• Do not use same data set for backpropagation training as

tuning.
Grid Search

• Perform grid search over parameter space.

– Select set of values for each parameter in some “reason-

able” range.

– Test over all combination of values.

• Careful about parameters at borders of selected range.

• Optimization: Search over coarse grid ﬁrst, and then drill

down into region of interest with ﬁner grids.
How to Select Values for Each Parameter

• Natural approach is to select uniformly distributed values of

parameters.

– Not the best approach in many cases! ⇒ Log-uniform

intervals.

– Search uniformly in reasonable values of log-values and

then exponentiate.

– Example: Uniformly sample log-learning rate between −3

and −1, and then raise it to the power of 10.
Sampling versus Grid Search

• With a large number of parameters, grid search is still ex-

pensive.

• With 10 parameters, choosing just 3 values for each param-

eter leads to 310 = 59049 possibilities.

• Flexible choice is to sample over grid space.

• Used more commonly in large-scale settings with good re-

sults.
Revisiting Feature Normalization

• In the previous lecture, we discussed feature normalization.

• When features have very diﬀerent magnitudes, gradient ra-

tios of diﬀerent weights are likely very diﬀerent.

• Feature normalization helps even out gradient ratios to some

extent.

– Exact behavior depends on target variable and loss func-

tion.
The Vanishing and Exploding Gradient Problems

• An extreme manifestation of varying sensitivity occurs in deep

networks.

• The weights/activation derivatives in diﬀerent layers aﬀect

the backpropagated gradient in a multiplicative way.

– With increasing depth this eﬀect is magniﬁed.

– The partial derivatives can either increase or decrease with

depth.
Example

h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑

• Neural network with one node per layer.

• Forward propagation multiplicatively depends on each weight

and activation function evaluation.

• Backpropagated partial derivative get multiplied by weights

and activation function derivatives.

• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).

• Hard to initialize weights exactly right.

Activation Function Propensity to Vanishing Gradients

• Partial derivative of sigmoid with output o ⇒ o(1 − o).

– Maximum value at o = 0.5 of 0.25.

– For 10 layers, the activation function alone will multiply

by less than 0.2510 ≈ 10−6.

• At extremes of output values, the partial derivative is close

to 0, which is called saturation.

• The tanh activation function with partial derivative (1 − o2)

has a maximum value of 1 at o = 0, but saturation will still
cause problems.
Exploding Gradients

• Initializing weights to very large values to compensate for the

activation functions can cause exploding gradients.

• Exploding gradients can also occur when weights across dif-

ferent layers are shared (e.g., recurrent neural networks).

– The eﬀect of a ﬁnite change in weight is extremely un-

predictable across diﬀerent layers.

– Small ﬁnite change changes loss negligibly, but a slightly

larger value might change loss drastically.
A Partial Fix to Vanishing Gradients

• The ReLU has linear activation for nonnegative values and

otherwise sets outputs to 0.

• The ReLU has a partial derivative of 1 for nonnegative inputs.

• However, it can have a partial derivative of 0 in some cases

and never get updated.

– Neuron is permanently dead!

Leaky ReLU

• For negative inputs, the leaky ReLU can still propagate some
gradient backwards.

– At the reduced rate of α < 1 times the learning case for

nonnegative inputs:
⎧
⎨α · v v≤0
Φ(v) = (14)
⎩v otherwise

• The value of α is a hyperparameter chosen by the user.

• The gains with the leaky ReLU are not guaranteed.

Maxout

• The activation used is max{W1 ·X, W2 ·X} with two coeﬃcient

vectors.

• One can view the maxout as a generalization of the ReLU.

– The ReLU is obtained by setting one of the coeﬃcient

vectors to 0.

– The leaky ReLU can also be simulated by setting the other

coeﬃcient vector to W2 = αW1.

• Main disadvantage is that it doubles the number of parame-

ters.
Gradient Clipping for Exploding Gradients

• Try to make the diﬀerent components of the partial deriva-

tives more even.

– Value-based clipping: All partial derivatives outside ranges

are set to range boundaries.

– Norm-based clipping: The entire gradient vector is nor-

malized by the L2-norm of the entire vector.

• One can achieve a better conditioning of the values, so that

the updates from mini-batch to mini-batch are roughly sim-
ilar.

• Prevents an anomalous gradient explosion during the course

of training.
Other Comments on Vanishing and Exploding Gradients

• The methods discussed above are only partial ﬁxes.

• Other ﬁxes discussed in later lectures:

– Stronger initializations with pretraining.

– Second-order learning methods that make use of second-

order derivatives (or curvature of the loss function).

You might also like

Neural Network Training Essentials
No ratings yet
Neural Network Training Essentials
48 pages
Xavier Glorot Initialization Explained
No ratings yet
Xavier Glorot Initialization Explained
13 pages
Regularization Techniques in Deep MLPs
No ratings yet
Regularization Techniques in Deep MLPs
44 pages
Minimizing Gradient Problems in DNNs
No ratings yet
Minimizing Gradient Problems in DNNs
42 pages
Training Deep Neural Networks Insights
No ratings yet
Training Deep Neural Networks Insights
55 pages
Deep Learning: Gradient Issues & Solutions
No ratings yet
Deep Learning: Gradient Issues & Solutions
48 pages
Weight Initialization in Deep Learning
No ratings yet
Weight Initialization in Deep Learning
8 pages
Deep Learning: Networks & Challenges
No ratings yet
Deep Learning: Networks & Challenges
10 pages
Neural Networks: Types and Techniques
No ratings yet
Neural Networks: Types and Techniques
23 pages
Unit 3
No ratings yet
Unit 3
110 pages
Activation Functions and Batch Normalization
No ratings yet
Activation Functions and Batch Normalization
6 pages
Deep Learning: Gradient Challenges
No ratings yet
Deep Learning: Gradient Challenges
64 pages
Deep Learning Techniques and Strategies
No ratings yet
Deep Learning Techniques and Strategies
52 pages
Deep Learning: Gradient Descent Explained
No ratings yet
Deep Learning: Gradient Descent Explained
41 pages
Understanding Gradient Problems in Neural Networks
No ratings yet
Understanding Gradient Problems in Neural Networks
41 pages
Deep Learning: Stability & Regularization
No ratings yet
Deep Learning: Stability & Regularization
45 pages
Introduction to Training Deep Models
No ratings yet
Introduction to Training Deep Models
45 pages
Supervised Deep Learning Techniques
No ratings yet
Supervised Deep Learning Techniques
28 pages
Deep Neural Networks: Training & Regularization
No ratings yet
Deep Neural Networks: Training & Regularization
26 pages
Vanishing Gradient Problem in Deep Learning
No ratings yet
Vanishing Gradient Problem in Deep Learning
8 pages
Neural Network Initialization Techniques
No ratings yet
Neural Network Initialization Techniques
23 pages
Optimizing Deep Neural Networks Techniques
No ratings yet
Optimizing Deep Neural Networks Techniques
72 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
83 pages
Eigenvalues, Neural Networks, and PCA Insights
No ratings yet
Eigenvalues, Neural Networks, and PCA Insights
13 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
55 pages
Deep Learning Model Training Techniques
No ratings yet
Deep Learning Model Training Techniques
6 pages
Understanding Neural Network Types & Functions
No ratings yet
Understanding Neural Network Types & Functions
7 pages
Deep Feed Forward Neural Networks Guide
No ratings yet
Deep Feed Forward Neural Networks Guide
65 pages
Deep Feedforward Networks Overview
No ratings yet
Deep Feedforward Networks Overview
56 pages
Introduction to Neural Networks
No ratings yet
Introduction to Neural Networks
20 pages
Vanishing and Exploding Gradients Explained
No ratings yet
Vanishing and Exploding Gradients Explained
8 pages
Gradient Descent in Neural Networks
No ratings yet
Gradient Descent in Neural Networks
24 pages
Key Hyperparameters in Neural Networks
No ratings yet
Key Hyperparameters in Neural Networks
15 pages
Deep Neural Network Training Techniques
No ratings yet
Deep Neural Network Training Techniques
60 pages
Advanced Neural Network Techniques
No ratings yet
Advanced Neural Network Techniques
129 pages
Neural Networks and Backpropagation Guide
No ratings yet
Neural Networks and Backpropagation Guide
73 pages
Tensor Operations and Activation Functions
No ratings yet
Tensor Operations and Activation Functions
11 pages
ELUs: Enhancing Deep Network Learning
No ratings yet
ELUs: Enhancing Deep Network Learning
14 pages
Vanishing Gradient in Neural Networks
No ratings yet
Vanishing Gradient in Neural Networks
12 pages
AdaGrad, RMSProp, Dropout, and Batch Norm
No ratings yet
AdaGrad, RMSProp, Dropout, and Batch Norm
10 pages
MLP Backpropagation Algorithm Overview
No ratings yet
MLP Backpropagation Algorithm Overview
34 pages
Deep Learning: Feedforward Networks Guide
No ratings yet
Deep Learning: Feedforward Networks Guide
15 pages
CS 182 Discussion: CNN Techniques
No ratings yet
CS 182 Discussion: CNN Techniques
7 pages
Regularization and Normalization Explained
No ratings yet
Regularization and Normalization Explained
29 pages
Deep Learning Exam Instructions TUM
No ratings yet
Deep Learning Exam Instructions TUM
22 pages
Constant Learning Rate and Batch Normalization
No ratings yet
Constant Learning Rate and Batch Normalization
11 pages
Neural Network Training Techniques
No ratings yet
Neural Network Training Techniques
25 pages
Understanding Train/Dev Sets in Deep Learning
No ratings yet
Understanding Train/Dev Sets in Deep Learning
30 pages
Deep Learning Tips for Neural Networks
No ratings yet
Deep Learning Tips for Neural Networks
49 pages
EE769: Intro to Neural Networks
No ratings yet
EE769: Intro to Neural Networks
52 pages
Regularization Techniques in Neural Networks
No ratings yet
Regularization Techniques in Neural Networks
16 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Loss Function Optimization in Neural Networks
100% (1)
Loss Function Optimization in Neural Networks
24 pages
Deep Learning Concepts and Techniques
No ratings yet
Deep Learning Concepts and Techniques
16 pages
Gradient Descent Optimization Techniques
No ratings yet
Gradient Descent Optimization Techniques
8 pages
Neural Network Backpropagation Insights
No ratings yet
Neural Network Backpropagation Insights
68 pages
Shallow MLP: Basics and Applications
No ratings yet
Shallow MLP: Basics and Applications
27 pages
Training Neural
No ratings yet
Training Neural
16 pages
Fundamental Data Structures
100% (1)
Fundamental Data Structures
376 pages
AES 256 Encryption Overview
No ratings yet
AES 256 Encryption Overview
110 pages
Multiple Regression Analysis Insights
No ratings yet
Multiple Regression Analysis Insights
9 pages
Understanding Asymptotic Notation
No ratings yet
Understanding Asymptotic Notation
51 pages
CNN vs Random Forest for AD Detection
No ratings yet
CNN vs Random Forest for AD Detection
10 pages
Optimizing Aspen HYSYS with Python
No ratings yet
Optimizing Aspen HYSYS with Python
1 page
Intelligent Agents Overview by Arslan Shaukat
No ratings yet
Intelligent Agents Overview by Arslan Shaukat
22 pages
Probability & Statistics Solutions Guide
No ratings yet
Probability & Statistics Solutions Guide
8 pages
Branch and Bound Method Explained
No ratings yet
Branch and Bound Method Explained
5 pages
Properties and Types of Line Coding
No ratings yet
Properties and Types of Line Coding
30 pages
Dense 3D Reconstruction Techniques
No ratings yet
Dense 3D Reconstruction Techniques
31 pages
CS 435 Exam Answers: Algorithms & Data Structures
No ratings yet
CS 435 Exam Answers: Algorithms & Data Structures
8 pages
Data Structure Searching Techniques
No ratings yet
Data Structure Searching Techniques
17 pages
OSSU Computer Science Curriculum 2023
No ratings yet
OSSU Computer Science Curriculum 2023
6 pages
CITAS Spreadsheet Usage Guide
No ratings yet
CITAS Spreadsheet Usage Guide
115 pages
Ayra: Smart AI Call Assistant Project
No ratings yet
Ayra: Smart AI Call Assistant Project
12 pages
Cybersecurity Internship Report: AI Methods
No ratings yet
Cybersecurity Internship Report: AI Methods
14 pages
Gaussian Elimination Exercises
No ratings yet
Gaussian Elimination Exercises
7 pages
E-M Algorithm in Machine Learning
No ratings yet
E-M Algorithm in Machine Learning
12 pages
DESFire EV1 TRNG Bias Analysis
No ratings yet
DESFire EV1 TRNG Bias Analysis
202 pages
Fourier Series and Transform Basics
No ratings yet
Fourier Series and Transform Basics
21 pages
Basis Path Testing Techniques Explained
No ratings yet
Basis Path Testing Techniques Explained
16 pages
M/M/1 Queue Simulation Analysis
No ratings yet
M/M/1 Queue Simulation Analysis
5 pages
Inverse of Matrices via LU Decomposition
No ratings yet
Inverse of Matrices via LU Decomposition
4 pages
Gauss Elimination in Fortran F95
No ratings yet
Gauss Elimination in Fortran F95
2 pages
P&S Model Paper for B.Tech IV Sem
No ratings yet
P&S Model Paper for B.Tech IV Sem
2 pages
Python Password Generator Guide
No ratings yet
Python Password Generator Guide
12 pages
Unsupervised Learning: Clustering & KMeans
No ratings yet
Unsupervised Learning: Clustering & KMeans
50 pages
Real-Time Flood Prediction with PINN
No ratings yet
Real-Time Flood Prediction with PINN
5 pages
Micromachines 13 00586 v2
No ratings yet
Micromachines 13 00586 v2
19 pages