Weight Initialization
• Initializations are surprisingly important.
– Poor initializations can lead to bad convergence behavior.
– Instability across different layers (vanishing and exploding
gradients).
• More sophisticated initializations such as pretraining covered
in later lecture.
• Even some simple rules in initialization can help in condition-
ing.
Symmetry Breaking
• Bad idea to initialize weights to the same value.
– Results in weights being updated in lockstep.
– Creates redundant features.
• Initializing weights to random values breaks symmetry.
• Average magnitude of the random variables is important for
stability.
Sensitivity to Number of Inputs
• More inputs increase output sensitivity to the average weight.
– Additive effect of multiple inputs: variance linearly in-
creases with number of inputs r.
– Standard deviation scales with the square-root of number
of inputs r.
• Each weight is initialized from Gaussian distribution with
standard deviation 1/r ( 2/r for ReLU).
• More sophisticated: Use standard deviation of 2/(rin + rout).
Tuning Hyperparameters
• Hyperparameters represent the parameters like number of
layers, nodes per layer, learning rate, and regularization pa-
rameter.
• Use separate validation set for tuning.
• Do not use same data set for backpropagation training as
tuning.
Grid Search
• Perform grid search over parameter space.
– Select set of values for each parameter in some “reason-
able” range.
– Test over all combination of values.
• Careful about parameters at borders of selected range.
• Optimization: Search over coarse grid first, and then drill
down into region of interest with finer grids.
How to Select Values for Each Parameter
• Natural approach is to select uniformly distributed values of
parameters.
– Not the best approach in many cases! ⇒ Log-uniform
intervals.
– Search uniformly in reasonable values of log-values and
then exponentiate.
– Example: Uniformly sample log-learning rate between −3
and −1, and then raise it to the power of 10.
Sampling versus Grid Search
• With a large number of parameters, grid search is still ex-
pensive.
• With 10 parameters, choosing just 3 values for each param-
eter leads to 310 = 59049 possibilities.
• Flexible choice is to sample over grid space.
• Used more commonly in large-scale settings with good re-
sults.
Revisiting Feature Normalization
• In the previous lecture, we discussed feature normalization.
• When features have very different magnitudes, gradient ra-
tios of different weights are likely very different.
• Feature normalization helps even out gradient ratios to some
extent.
– Exact behavior depends on target variable and loss func-
tion.
The Vanishing and Exploding Gradient Problems
• An extreme manifestation of varying sensitivity occurs in deep
networks.
• The weights/activation derivatives in different layers affect
the backpropagated gradient in a multiplicative way.
– With increasing depth this effect is magnified.
– The partial derivatives can either increase or decrease with
depth.
Example
h1 h2 hm-1
x w1 w2 w3 wm-1 wm o
∑ ∑ ∑ ∑
• Neural network with one node per layer.
• Forward propagation multiplicatively depends on each weight
and activation function evaluation.
• Backpropagated partial derivative get multiplied by weights
and activation function derivatives.
• Unless the values are exactly one, the partial derivatives will
either continuously increase (explode) or decrease (vanish).
• Hard to initialize weights exactly right.
Activation Function Propensity to Vanishing Gradients
• Partial derivative of sigmoid with output o ⇒ o(1 − o).
– Maximum value at o = 0.5 of 0.25.
– For 10 layers, the activation function alone will multiply
by less than 0.2510 ≈ 10−6.
• At extremes of output values, the partial derivative is close
to 0, which is called saturation.
• The tanh activation function with partial derivative (1 − o2)
has a maximum value of 1 at o = 0, but saturation will still
cause problems.
Exploding Gradients
• Initializing weights to very large values to compensate for the
activation functions can cause exploding gradients.
• Exploding gradients can also occur when weights across dif-
ferent layers are shared (e.g., recurrent neural networks).
– The effect of a finite change in weight is extremely un-
predictable across different layers.
– Small finite change changes loss negligibly, but a slightly
larger value might change loss drastically.
A Partial Fix to Vanishing Gradients
• The ReLU has linear activation for nonnegative values and
otherwise sets outputs to 0.
• The ReLU has a partial derivative of 1 for nonnegative inputs.
• However, it can have a partial derivative of 0 in some cases
and never get updated.
– Neuron is permanently dead!
Leaky ReLU
• For negative inputs, the leaky ReLU can still propagate some
gradient backwards.
– At the reduced rate of α < 1 times the learning case for
nonnegative inputs:
⎧
⎨α · v v≤0
Φ(v) = (14)
⎩v otherwise
• The value of α is a hyperparameter chosen by the user.
• The gains with the leaky ReLU are not guaranteed.
Maxout
• The activation used is max{W1 ·X, W2 ·X} with two coefficient
vectors.
• One can view the maxout as a generalization of the ReLU.
– The ReLU is obtained by setting one of the coefficient
vectors to 0.
– The leaky ReLU can also be simulated by setting the other
coefficient vector to W2 = αW1.
• Main disadvantage is that it doubles the number of parame-
ters.
Gradient Clipping for Exploding Gradients
• Try to make the different components of the partial deriva-
tives more even.
– Value-based clipping: All partial derivatives outside ranges
are set to range boundaries.
– Norm-based clipping: The entire gradient vector is nor-
malized by the L2-norm of the entire vector.
• One can achieve a better conditioning of the values, so that
the updates from mini-batch to mini-batch are roughly sim-
ilar.
• Prevents an anomalous gradient explosion during the course
of training.
Other Comments on Vanishing and Exploding Gradients
• The methods discussed above are only partial fixes.
• Other fixes discussed in later lectures:
– Stronger initializations with pretraining.
– Second-order learning methods that make use of second-
order derivatives (or curvature of the loss function).