0% found this document useful (0 votes)
32 views17 pages

Weight Initialization and Gradient Issues

Uploaded by

tanvi29khanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views17 pages

Weight Initialization and Gradient Issues

Uploaded by

tanvi29khanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Weight Initialization

• Initializations are surprisingly important.

– Poor initializations can lead to bad convergence behavior.

– Instability across different layers (vanishing and exploding


gradients).

• More sophisticated initializations such as pretraining covered


in later lecture.

• Even some simple rules in initialization can help in condition-


ing.
Symmetry Breaking

• Bad idea to initialize weights to the same value.

– Results in weights being updated in lockstep.

– Creates redundant features.

• Initializing weights to random values breaks symmetry.

• Average magnitude of the random variables is important for


stability.
Sensitivity to Number of Inputs

• More inputs increase output sensitivity to the average weight.

– Additive effect of multiple inputs: variance linearly in-


creases with number of inputs r.

– Standard deviation scales with the square-root of number


of inputs r.

• Each weight is initialized from Gaussian distribution with


standard deviation 1/r ( 2/r for ReLU).

• More sophisticated: Use standard deviation of 2/(rin + rout).


Tuning Hyperparameters

• Hyperparameters represent the parameters like number of


layers, nodes per layer, learning rate, and regularization pa-
rameter.

• Use separate validation set for tuning.

• Do not use same data set for backpropagation training as


tuning.
Grid Search

• Perform grid search over parameter space.

– Select set of values for each parameter in some “reason-


able” range.

– Test over all combination of values.

• Careful about parameters at borders of selected range.

• Optimization: Search over coarse grid first, and then drill


down into region of interest with finer grids.
How to Select Values for Each Parameter

• Natural approach is to select uniformly distributed values of


parameters.

– Not the best approach in many cases! ⇒ Log-uniform


intervals.

– Search uniformly in reasonable values of log-values and


then exponentiate.

– Example: Uniformly sample log-learning rate between −3


and −1, and then raise it to the power of 10.
Sampling versus Grid Search

• With a large number of parameters, grid search is still ex-


pensive.

• With 10 parameters, choosing just 3 values for each param-


eter leads to 310 = 59049 possibilities.

• Flexible choice is to sample over grid space.

• Used more commonly in large-scale settings with good re-


sults.
Revisiting Feature Normalization

• In the previous lecture, we discussed feature normalization.

• When features have very different magnitudes, gradient ra-


tios of different weights are likely very different.

• Feature normalization helps even out gradient ratios to some


extent.

– Exact behavior depends on target variable and loss func-


tion.
The Vanishing and Exploding Gradient Problems

• An extreme manifestation of varying sensitivity occurs in deep


networks.

• The weights/activation derivatives in different layers affect


the backpropagated gradient in a multiplicative way.

– With increasing depth this effect is magnified.

– The partial derivatives can either increase or decrease with


depth.
Example

h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑

• Neural network with one node per layer.

• Forward propagation multiplicatively depends on each weight


and activation function evaluation.

• Backpropagated partial derivative get multiplied by weights


and activation function derivatives.

• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).

• Hard to initialize weights exactly right.


Activation Function Propensity to Vanishing Gradients

• Partial derivative of sigmoid with output o ⇒ o(1 − o).

– Maximum value at o = 0.5 of 0.25.

– For 10 layers, the activation function alone will multiply


by less than 0.2510 ≈ 10−6.

• At extremes of output values, the partial derivative is close


to 0, which is called saturation.

• The tanh activation function with partial derivative (1 − o2)


has a maximum value of 1 at o = 0, but saturation will still
cause problems.
Exploding Gradients

• Initializing weights to very large values to compensate for the


activation functions can cause exploding gradients.

• Exploding gradients can also occur when weights across dif-


ferent layers are shared (e.g., recurrent neural networks).

– The effect of a finite change in weight is extremely un-


predictable across different layers.

– Small finite change changes loss negligibly, but a slightly


larger value might change loss drastically.
A Partial Fix to Vanishing Gradients

• The ReLU has linear activation for nonnegative values and


otherwise sets outputs to 0.

• The ReLU has a partial derivative of 1 for nonnegative inputs.

• However, it can have a partial derivative of 0 in some cases


and never get updated.

– Neuron is permanently dead!


Leaky ReLU

• For negative inputs, the leaky ReLU can still propagate some
gradient backwards.

– At the reduced rate of α < 1 times the learning case for


nonnegative inputs:

⎨α · v v≤0
Φ(v) = (14)
⎩v otherwise

• The value of α is a hyperparameter chosen by the user.

• The gains with the leaky ReLU are not guaranteed.


Maxout

• The activation used is max{W1 ·X, W2 ·X} with two coefficient


vectors.

• One can view the maxout as a generalization of the ReLU.

– The ReLU is obtained by setting one of the coefficient


vectors to 0.

– The leaky ReLU can also be simulated by setting the other


coefficient vector to W2 = αW1.

• Main disadvantage is that it doubles the number of parame-


ters.
Gradient Clipping for Exploding Gradients

• Try to make the different components of the partial deriva-


tives more even.

– Value-based clipping: All partial derivatives outside ranges


are set to range boundaries.

– Norm-based clipping: The entire gradient vector is nor-


malized by the L2-norm of the entire vector.

• One can achieve a better conditioning of the values, so that


the updates from mini-batch to mini-batch are roughly sim-
ilar.

• Prevents an anomalous gradient explosion during the course


of training.
Other Comments on Vanishing and Exploding Gradients

• The methods discussed above are only partial fixes.

• Other fixes discussed in later lectures:

– Stronger initializations with pretraining.

– Second-order learning methods that make use of second-


order derivatives (or curvature of the loss function).

You might also like