Deep Learning Optimization Techniques
Deep Learning Optimization Techniques
Optimization Algorithms
If you read the book in sequence up to this point you already used a number of advanced optimization
algorithms to train deep learning models. They were the tools that allowed us to continue updating model
parameters and to minimize the value of the loss function, as evaluated on the training set. Indeed,
anyone content with treating optimization as a black box device to minimize objective functions in a
simple setting might well content oneself with the knowledge that there exists an array of incantations of
such a procedure (with names such as Adam’, NAG’, or SGD’).
To do well, however, some deeper knowledge is required. Optimization algorithms are important for deep
learning. On one hand, training a complex deep learning model can take hours, days, or even weeks. The
performance of the optimization algorithm directly affects the model’s training efficiency. On the other
hand, understanding the principles of different optimization algorithms and the role of their parameters
will enable us to tune the hyperparameters in a targeted manner to improve the performance of deep
learning models.
In this chapter, we explore common deep learning optimization algorithms in depth. Almost all optimiza-
tion problems arising in deep learning are nonconvex. Nonetheless, the design and analysis of algorithms
in the context of convex problems has proven to be very instructive. It is for that reason that this section
includes a primer on convex optimization and the proof for a very simple stochastic gradient descent
algorithm on a convex objective function.
419
10.1 Optimization and Deep Learning
In this section, we will discuss the relationship between optimization and deep learning as well as the
challenges of using optimization in deep learning. For a deep learning problem, we will usually define
a loss function first. Once we have the loss function, we can use an optimization algorithm in attempt
to minimize the loss. In optimization, a loss function is often referred to as the objective function of
the optimization problem. By tradition and convention most optimization algorithms are concerned with
minimization. If we ever need to maximize an objective there’s a simple solution - just flip the sign on the
objective.
Although optimization provides a way to minimize the loss function for deep learning, in essence, the
goals of optimization and deep learning are fundamentally different. The former is primarily concerned
with minimizing an objective whereas the latter is concerned with finding a suitable model, given a
finite amount of data. In the section on Model Selection, Underfitting and Overfitting we discussed the
difference between these two goals in detail. For instance, training error and generalization error generally
differ: since the objective function of the optimization algorithm is usually a loss function based on the
training data set, the goal of optimization is to reduce the training error. However, the goal of statistical
inference (and thus of deep learning) is to reduce the generalization error. To accomplish the latter we
need to pay attention to overfitting in addition to using the optimization algorithm to reduce the training
error. We begin by importing a few libraries.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mpl_toolkits import mplot3d
import numpy as np
The graph below illustrates the issue in some more detail. Since we have only a finite amount of data the
minimum of the training error may be at a different location than the minimum of the expected error (or
of the test error).
In [2]: def f(x): return x * [Link]([Link] * x)
def g(x): return f(x) + 0.2 * [Link](5 * [Link] * x)
d2l.set_figsize((4.5, 2.5))
x = [Link](0.5, 1.5, 0.01)
fig, = [Link](x, f(x))
fig, = [Link](x, g(x))
[Link]('empirical risk', xy=(1.0, -1.2), xytext=(0.5, -1.1),
arrowprops=dict(arrowstyle='->'))
[Link]('expected risk', xy=(1.1, -1.05), xytext=(0.95, -0.5),
arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('risk');
In this chapter, we are going to focus specifically on the performance of the optimization algorithm in
minimizing the objective function, rather than a model’s generalization error. In the section on Linear
Regression we distinguished between analytical solutions and numerical solutions in optimization prob-
lems. In deep learning, most objective functions are complicated and do not have analytical solutions.
Instead, we must use numerical optimization algorithms. The optimization algorithms below all fall into
this category.
There are many challenges in deep learning optimization. Some of the most vexing ones are local minima,
saddle points and vanishing gradients. Let’s have a look at a few of them.
Local Minima
For the objective function f (x), if the value of f (x) at x is smaller than the values of f (x) at any other
points in the vicinity of x, then f (x) could be a local minimum. If the value of f (x) at x is the minimum
of the objective function over the entire domain, then f (x) is the global minimum.
For example, given the function
we can approximate the local minimum and global minimum of this function.
In [3]: x = [Link](-1.0, 2.0, 0.01)
fig, = [Link](x, f(x))
[Link]('local minimum', xy=(-0.3, -0.25), xytext=(-0.77, -1.0),
arrowprops=dict(arrowstyle='->'))
[Link]('global minimum', xy=(1.1, -0.95), xytext=(0.6, 0.8),
The objective function of deep learning models usually has many local optima. When the numerical
solution of an optimization problem is near the local optimum, the numerical solution obtained by the
final iteration may only minimize the objective function locally, rather than globally, as the gradient of the
objective function’s solutions approaches or becomes zero. Only some degree of noise might knock the
parameter out of the local minimum. In fact, this is one of the beneficial properties of stochastic gradient
descent where the natural variation of gradients over minibatches is able to dislodge the parameters from
local minima.
Saddle Points
Besides local minima, saddle points are another reason for gradients to vanish. A saddle point is any
location where all gradients of a function vanish but which is neither a global nor a local minimum.
Consider the function f (x) = x3 . Its first and second derivative vanish for x = 0. Optimization might
stall at the point, even though it is not a minimum.
In [4]: x = [Link](-2.0, 2.0, 0.01)
fig, = [Link](x, x**3)
[Link]('saddle point', xy=(0, -0.2), xytext=(-0.52, -5.0),
arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('f(x)');
z = x**2 - y**2
ax = [Link]().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
[Link]([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
[Link](ticks)
[Link](ticks)
ax.set_zticks(ticks)
[Link]('x')
[Link]('y');
Vanishing Gradients
Probably the most insidious problem to encounter are vanishing gradients. For instance, assume that we
want to minimize the function f (x) = tanh(x) and we happen to get started at x = 4. As we can see,
the gradient of f is close to nil. More specifically f ′ (x) = 1 − tanh2 (x) and thus f ′ (4) = 0.0013.
Consequently optimization will get stuck for a long time before we make progress. This turns out to be
one of the reasons that training deep learning models was quite tricky prior to the introduction of the
ReLu activation function.
In [6]: x = [Link](-2.0, 5.0, 0.01)
fig, = [Link](x, [Link](x))
[Link]('vanishing gradient', xy=(4, 1), xytext=(2, 0.0),
arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('f(x)');
Summary
• Minimizing the training error does not guarantee that we find the best set of parameters to minimize
the expected error.
• The optimization problems may have many local minima.
• The problem may have even more saddle points, as generally the problems are not convex.
• Vanishing gradients can cause optimization to stall. Often a reparametrization of the problem helps.
Good initialization of the parameters can be beneficial, too.
Exercises
1. Consider a simple multilayer perceptron with a single hidden layer of, say, d dimensions in the
hidden layer and a single output. Show that for any local minimum there are at least d! equivalent
solutions that behave identically.
2. Assume that we have a symmetric random matrix M where the entries Mij = Mji are each drawn
from some probability distribution pij . Furthermore assume that pij (x) = pij (−x), i.e. that the
distribution is symmetric (see e.g. [1] for details).
• Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvector v
the probability that the associated eigenvalue λ satisfies Pr(λ > 0) = Pr(λ < 0).
References
[1] Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of
Mathematics, 325-327.
In this section, we are going to introduce the basic principles of gradient descent. Although it is not
common for gradient descent to be used directly in deep learning, an understanding of gradients and
the reason why the value of an objective function might decline when updating the independent variable
along the opposite direction of the gradient is the foundation for future studies on optimization algorithms.
Next, we are going to introduce stochastic gradient descent (SGD).
Here, we will use a simple gradient descent in one-dimensional space as an example to explain why the
gradient descent algorithm may reduce the value of the objective function. We assume that the input
and output of the continuously differentiable function f : R → R are both scalars. Given ϵ with a
small enough absolute value, according to the Taylor’s expansion formula from the Mathematical basics
section, we get the following approximation:
f (x + ϵ) ≈ f (x) + ϵf ′ (x).
f (x − ηf ′ (x)) ≲ f (x).
x ← x − ηf ′ (x)
to iterate x, the value of function f (x) might decline. Therefore, in the gradient descent, we first choose
an initial value x and a constant η > 0 and then use them to continuously iterate x until the stop condition
is reached, for example, when the value of f ′ (x)2 is small enough or the number of iterations has reached
a certain value.
Now we will use the objective function f (x) = x2 as an example to see how gradient descent is imple-
mented. Although we know that x = 0 is the solution to minimize f (x), here we still use this simple
function to observe how x is iterated. First, import the packages or modules required for the experiment
in this section.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
import math
from mxnet import nd
import numpy as np
Next, we use x = 10 as the initial value and assume η = 0.2. Using gradient descent to iterate x 10
times, we can see that, eventually, the value of x approaches the optimal solution.
In [2]: def gd(eta):
x = 10
results = [x]
for i in range(10):
x -= eta * 2 * x # f(x) = x* the derivative of x is f'(x) = 2 * x
[Link](x)
print('epoch 10, x:', x)
return results
res = gd(0.2)
epoch 10, x: 0.06046617599999997
show_trace(res)
The positive η in the above gradient descent algorithm is usually called the learning rate. This is a hyper-
parameter and needs to be set manually. If we use a learning rate that is too small, it will cause x to
update at a very slow speed, requiring more iterations to get a better solution. Here, we have the iterative
trajectory of the independent variable x with the learning rate η = 0.05. As we can see, after iterating 10
times when the learning rate is too small, there is still a large deviation between the final value of x and
the optimal solution.
In [4]: show_trace(gd(0.05))
epoch 10, x: 3.4867844009999995
Now that we understand gradient descent in one-dimensional space, let us consider a more general case:
the input of the objective function is a vector and the output is a scalar. We assume that the input of the
For brevity, we use ∇f (x) instead of ∇x f (x). Each partial derivative element ∂f (x)/∂xi in the gradient
indicates the rate of change of f at x with respect to the input xi . To measure the rate of change of f in
the direction of the unit vector u (∥u∥ = 1), in multivariate calculus, the directional derivative of f at x
in the direction of u is defined as
f (x + hu) − f (x)
Du f (x) = lim .
h→0 h
According to the property of directional derivatives [1Chapter 14.6 Theorem 3], the aforementioned di-
rectional derivative can be rewritten as
Du f (x) = ∇f (x) · u.
The directional derivative Du f (x) gives all the possible rates of change for f along x. In order to
minimize f , we hope to find the direction that will allow us to reduce f in the fastest way. Therefore, we
can use the unit vector u to minimize the directional derivative Du f (x).
For Du f (x) = ∥∇f (x)∥ · ∥u∥ · cos(θ) = ∥∇f (x)∥ · cos(θ), Here, θ is the angle between the gradient
∇f (x) and the unit vector u. When θ = π, cos (θ) gives us the minimum value −1. So when u is in a
direction that is opposite to the gradient direction ∇f (x), the direction derivative Du f (x) is minimized.
Therefore, we may continue to reduce the value of objective function f by the gradient descent algorithm:
x ← x − η∇f (x).
Similarly, η (positive) is called the learning rate.
Now we are going to construct an objective function f (x) = x21 + 2x22 with a two-dimensional vector
x = [x1 , x2 ]⊤ as input and a scalar as the output. So we have the gradient ∇f (x) = [2x1 , 4x2 ]⊤ . We
will observe the iterative trajectory of independent variable x by gradient descent from the initial position
[−5, −2]. First, we are going to define two helper functions. The first helper uses the given independent
variable update function to iterate independent variable x a total of 20 times from the initial position
[−5, −2]. The second helper will visualize the iterative trajectory of independent variable x.
In [6]: # This function is saved in the d2l package for future use
def train_2d(trainer):
# s1 and s2 are states of the independent variable and will be used later
# in the chapter
x1, x2, s1, s2 = -5, -2, 0, 0
results = [(x1, x2)]
for i in range(20):
x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
[Link]((x1, x2))
print('epoch %d, x1 %f, x2 %f' % (i + 1, x1, x2))
return results
Next, we observe the iterative trajectory of the independent variable at learning rate 0.1. After iterating
the independent variable x 20 times using gradient descent, we can see that. eventually, the value of x
approaches the optimal solution [0, 0].
In [7]: eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d))
epoch 20, x1 -0.057646, x2 -0.000073
In deep learning, the objective function is usually the average of the loss functions for each example in the
training data set. We assume that fi (x) is the loss function of the training data instance with n examples,
an index of i, and parameter vector of x, then we have the objective function
1∑
n
f (x) = fi (x).
n i=1
1∑
n
∇f (x) = ∇fi (x).
n i=1
If gradient descent is used, the computing cost for each independent variable iteration is O(n), which
grows linearly with n. Therefore, when the model training data instance is large, the cost of gradient
descent for each iteration will be very high.
Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of
stochastic gradient descent, we uniformly sample an index i ∈ {1, . . . , n} for data instances at random,
and compute the gradient ∇fi (x) to update x:
x ← x − η∇fi (x).
Here, η is the learning rate. We can see that the computing cost for each iteration drops from O(n) of
the gradient descent to the constant O(1). We should mention that the stochastic gradient ∇fi (x) is the
unbiased estimate of gradient ∇f (x).
1∑
n
Ei ∇fi (x) = ∇fi (x) = ∇f (x).
n i=1
This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it to gradient descent by adding random noise with a mean of 0 to the gradient to
simulate a SGD.
In [8]: def sgd_2d(x1, x2, s1, s2):
return (x1 - eta * (2 * x1 + [Link](0.1)),
x2 - eta * (4 * x2 + [Link](0.1)), 0, 0)
show_trace_2d(f_2d, train_2d(sgd_2d))
epoch 20, x1 -0.294433, x2 -0.176149
Summary
• If we use a more suitable learning rate and update the independent variable in the opposite direction
of the gradient, the value of the objective function might be reduced. Gradient descent repeats this
update process until a solution that meets the requirements is obtained.
• Problems occur when the learning rate is too small or too large. A suitable learning rate is usually
found only after multiple experiments.
• When there are more examples in the training data set, it costs more to compute each iteration for
gradient descent, so SGD is preferred in these cases.
Exercises
• Using a different objective function, observe the iterative trajectory of the independent variable in
gradient descent and the SGD.
• In the experiment for gradient descent in two-dimensional space, try to use different learning rates
to observe and analyze the experimental phenomena.
In each iteration, the gradient descent uses the entire training data set to compute the gradient, so it is
sometimes referred to as batch gradient descent. Stochastic gradient descent (SGD) only randomly select
one example in each iteration to compute the gradient. Just like in the previous chapters, we can perform
random uniform sampling for each iteration to form a mini-batch and then use this mini-batch to compute
the gradient. Now, we are going to discuss mini-batch stochastic gradient descent.
Set objective function f (x) : Rd → R. The time step before the start of iteration is set to 0. The
independent variable of this time step is x0 ∈ Rd and is usually obtained by random initialization. In
each subsequent time step t > 0, mini-batch SGD uses random uniform sampling to get a mini-batch Bt
made of example indices from the training data set. We can use sampling with replacement or sampling
without replacement to get a mini-batch example. The former method allows duplicate examples in the
same mini-batch, the latter does not and is more commonly used. We can use either of the two methods
1 ∑
g t ← ∇fBt (xt−1 ) = ∇fi (xt−1 )
|B|
i∈Bt
to compute the gradient g t of the objective function at xt−1 with mini-batch Bt at time step t. Here, |B|
is the size of the batch, which is the number of examples in the mini-batch. This is a hyper-parameter.
Just like the stochastic gradient, the mini-batch SGD g t obtained by sampling with replacement is also
the unbiased estimate of the gradient ∇f (xt−1 ). Given the learning rate ηt (positive), the iteration of the
mini-batch SGD on the independent variable is as follows:
xt ← xt−1 − ηt g t .
The variance of the gradient based on random sampling cannot be reduced during the iterative process, so
in practice, the learning rate of the (mini-batch) SGD can self-decay during the iteration, such as ηt = ηtα
(usually α = −1 or −0.5), ηt = ηαt (e.g α = 0.95), or learning rate decay once per iteration or after
several iterations. As a result, the variance of the learning rate and the (mini-batch) SGD will decrease.
Gradient descent always uses the true gradient of the objective function during the iteration, without the
need to self-decay the learning rate.
The cost for computing each iteration is O(|B|). When the batch size is 1, the algorithm is an SGD; when
the batch size equals the example size of the training data, the algorithm is a gradient descent. When the
batch size is small, fewer examples are used in each iteration, which will result in parallel processing and
In this chapter, we will use a data set developed by NASA to test the wing noise from different aircraft to
compare these optimization algorithms[1]. We will use the first 1500 examples of the data set, 5 features,
and a normalization method to preprocess the data.
In [1]: #!pip install matplotlib
In [2]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import autograd, gluon, init, nd
from [Link] import nn, data as gdata, loss as gloss
import numpy as np
import time
We have already implemented the mini-batch SGD algorithm in the Linear Regression Implemented From
Scratch section. We have made its input parameters more generic here, so that we can conveniently use the
same input for the other optimization algorithms introduced later in this chapter. Specifically, we add the
status input states and place the hyper-parameter in dictionary hyperparams. In addition, we will
average the loss of each mini-batch example in the training function, so the gradient in the optimization
algorithm does not need to be divided by the batch size.
In [3]: def sgd(params, states, hyperparams):
for p in params:
p[:] -= hyperparams['lr'] * [Link]
Next, we are going to implement a generic training function to facilitate the use of the other optimization
algorithms introduced later in this chapter. It initializes a linear regression model and can then be used to
train the model with the mini-batch SGD and other algorithms introduced in subsequent sections.
def eval_loss():
return loss(net(features, w, b), labels).mean().asscalar()
When the batch size equals 1500 (the total number of examples), we use gradient descent for optimization.
The model parameters will be iterated only once for each epoch of the gradient descent. As we can see,
the downward trend of the value of the objective function (training loss) flattened out after 6 iterations.
In [5]: def train_sgd(lr, batch_size, num_epochs=2):
return train_ch7(
sgd, None, {'lr': lr}, features, labels, batch_size, num_epochs)
Reduce the batch size to 10, the time for each epoch increases because the workload for each batch is less
efficient to execute.
In [8]: mini2_res = train_sgd(.05, 10)
loss: 0.247321, 0.231905 sec per epoch
Finally, we compare the time versus loss for the preview four experiments. As can be seen, despite SGD
converges faster than GD in terms of number of examples processed, it uses more time to reach the same
In Gluon, we can use the Trainer class to call optimization algorithms. Next, we are going to imple-
ment a generic training function that uses the optimization name trainer name and hyperparameter
trainer_hyperparameter to create the instance Trainer.
In [10]: # This function is saved in the d2l package for future use
def train_gluon_ch9(trainer_name, trainer_hyperparams, features, labels,
batch_size=10, num_epochs=2):
# Initialize model parameters
net = [Link]()
[Link]([Link](1))
[Link]([Link](sigma=0.01))
loss = gloss.L2Loss()
def eval_loss():
return loss(net(features), labels).mean().asscalar()
Summary
• Mini-batch stochastic gradient uses random uniform sampling to get a mini-batch training example
for gradient computation.
Exercises
• Modify the batch size and learning rate and observe the rate of decline for the value of the objective
function and the time consumed in each epoch.
• Read the MXNet documentation and use the Trainer class set_learning_rate function to
reduce the learning rate of the mini-batch SGD to 1/10 of its previous value after each epoch.
Reference
10.4 Momentum
In the Gradient Descent and Stochastic Gradient Descent section, we mentioned that the gradient of
the objective function’s independent variable represents the direction of the objective function’s fastest
descend at the current position of the independent variable. Therefore, gradient descent is also called
steepest descent. In each iteration, the gradient descends according to the current position of the indepen-
dent variable while updating the latter along the current position of the gradient. However, this can lead
to problems if the iterative direction of the independent variable relies exclusively on the current position
of the independent variable.
Now, we will consider an objective function f (x) = 0.1x21 + 2x22 , whose input and output are a two-
dimensional vector x = [x1 , x2 ] and a scalar, respectively. In contrast to the Gradient Descent and
Stochastic Gradient Descent section, here, the coefficient x21 is reduced from 1 to 0.1. We are going to
%matplotlib inline
import d2l
from mxnet import nd
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1 -0.943467, x2 -0.000073
As we can see, at the same position, the slope of the objective function has a larger absolute value in
the vertical direction (x2 axis direction) than in the horizontal direction (x1 axis direction). Therefore,
given the learning rate, using gradient descent for interaction will cause the independent variable to move
more in the vertical direction than in the horizontal one. So we need a small learning rate to prevent
the independent variable from overshooting the optimal solution for the objective function in the vertical
direction. However, it will cause the independent variable to move slower toward the optimal solution in
the horizontal direction.
Now, we try to make the learning rate slightly larger, so the independent variable will continuously
overshoot the optimal solution in the vertical direction and gradually diverge.
In [2]: eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
The momentum method was proposed to solve the gradient descent problem described above. Since
mini-batch stochastic gradient descent is more general than gradient descent, the subsequent discussion
in this chapter will continue to use the definition for mini-batch stochastic gradient descent g t at time
step t given in the Mini-batch Stochastic Gradient Descent section. We set the independent variable at
time step t to xt and the learning rate to ηt . At time step 0, momentum creates the velocity variable v 0
and initializes its elements to zero. At time step t > 0, momentum modifies the steps of each iteration as
follows:
v t ← γv t−1 + ηt g t ,
xt ← xt−1 − v t ,
In order to understand the momentum method mathematically, we must first explain the exponentially
weighted moving average (EWMA). Given hyperparameter 0 ≤ γ < 1, the variable yt of the current time
step t is the linear combination of variable yt−1 from the previous time step t − 1 and another variable xt
of the current step.
yt = γyt−1 + (1 − γ)xt .
We can expand yt :
yt = (1 − γ)xt + γyt−1
= (1 − γ)xt + (1 − γ) · γxt−1 + γ 2 yt−2
= (1 − γ)xt + (1 − γ) · γxt−1 + (1 − γ) · γ 2 xt−2 + γ 3 yt−3
...
n
Let n = 1/(1 − γ), so (1 − 1/n) = γ 1/(1−γ) . Because
( )n
1
lim 1 − = exp(−1) ≈ 0.3679,
n→∞ n
when γ → 1, γ 1/(1−γ) = exp(−1). For example, 0.9520 ≈ exp(−1). If we treat exp(−1) as a relatively
small number, we can ignore all the terms that have γ 1/(1−γ) or coefficients of higher order than γ 1/(1−γ)
in them. For example, when γ = 0.95,
∑
19
yt ≈ 0.05 0.95i xt−i .
i=0
Therefore, in practice, we often treat yt as the weighted average of the xt values from the last 1/(1 − γ)
time steps. For example, when γ = 0.95, yt can be treated as the weighted average of the xt values from
the last 20 time steps; when γ = 0.9, yt can be treated as the weighted average of the xt values from the
last 10 time steps. Additionally, the closer the xt value is to the current time step t, the greater the value’s
weight (closer to 1).
By the form of EWMA, velocity variable v t is actually an EWMA of time series {ηt−i g t−i /(1 − γ) :
i = 0, . . . , 1/(1 − γ) − 1}. In other words, considering mini-batch SGD, the update of an independent
variable with momentum at each time step approximates the EWMA of the updates in the last 1/(1 − γ)
Compared with mini-batch SGD, the momentum method needs to maintain a velocity variable of the same
shape for each independent variable and a momentum hyperparameter is added to the hyperparameter
category. In the implementation, we use the state variable states to represent the velocity variable in a
more general sense.
In [5]: features, labels = d2l.get_data_ch7()
def init_momentum_states():
v_w = [Link](([Link][1], 1))
v_b = [Link](1)
return (v_w, v_b)
When we set the momentum hyperparameter momentum to 0.5, it can be treated as a mini-batch SGD:
the mini-batch gradient here is the weighted average of twice the mini-batch gradient of the last two time
steps.
In [6]: d2l.train_ch9(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.5}, features, labels)
loss: 0.247092, 0.212257 sec per epoch
We can see that the value change of the objective function is not smooth enough at later stages of iteration.
Intuitively, ten times the mini-batch gradient is five times larger than two times the mini-batch gradient,
so we can try to reduce the learning rate to 1/5 of its original value. Now, the value change of the objective
function becomes smoother after its period of decline.
In [8]: d2l.train_ch9(sgd_momentum, init_momentum_states(),
In Gluon, we only need to use momentum to define the momentum hyperparameter in the Trainer
instance to implement momentum.
In [9]: d2l.train_gluon_ch9('sgd', {'learning_rate': 0.004, 'momentum': 0.9},
features, labels)
loss: 0.243254, 0.178137 sec per epoch
• The momentum method uses the EWMA concept. It takes the weighted average of past time steps,
with weights that decay exponentially by the time step.
• Momentum makes independent variable updates for adjacent time steps more consistent in direc-
tion.
Exercises
• Use other combinations of momentum hyperparameters and learning rates and observe and analyze
the different experimental results.
10.5 Adagrad
In the optimization algorithms we introduced previously, each element of the objective function’s in-
dependent variables uses the same learning rate at the same time step for self-iteration. For example,
if we assume that the objective function is f and the independent variable is a two-dimensional vector
[x1 , x2 ]⊤ , each element in the vector uses the same learning rate when iterating. For example, in gradient
descent with the learning rate η, element x1 and x2 both use the same learning rate η for iteration:
∂f ∂f
x1 ← x1 − η , x2 ← x 2 − η .
∂x1 ∂x2
In the Momentum section, we can see that, when there is a big difference between the gradient values x1
and x2 , a sufficiently small learning rate needs to be selected so that the independent variable will not
diverge in the dimension of larger gradient values. However, this will cause the independent variables to
iterate too slowly in the dimension with smaller gradient values. The momentum method relies on the
exponentially weighted moving average (EWMA) to make the direction of the independent variable more
consistent, thus reducing the possibility of divergence. In this section, we are going to introduce Adagrad,
an algorithm that adjusts the learning rate according to the gradient value of the independent variable in
each dimension to eliminate problems caused when a unified learning rate has to adapt to all dimensions.
The Adadelta algorithm uses the cumulative variable st obtained from a square by element operation on
the mini-batch stochastic gradient g t . At time step 0, Adagrad initializes each element in s0 to 0. At time
step t, we first sum the results of the square by element operation for the mini-batch gradient g t to get the
variable st :
st ← st−1 + g t ⊙ g t ,
Here, ⊙ is the symbol for multiplication by element. Next, we re-adjust the learning rate of each element
in the independent variable of the objective function using element operations:
η
xt ← xt−1 − √ ⊙ gt ,
st + ϵ
Here, η is the learning rate while ϵ is a constant added to maintain numerical stability, such as 10−6 .
Here, the square root, division, and multiplication operations are all element operations. Each element in
the independent variable of the objective function will have its own learning rate after the operations by
elements.
10.5.2 Features
We should emphasize that the cumulative variable st produced by a square by element operation on the
mini-batch stochastic gradient is part of the learning rate denominator. Therefore, if an element in the
independent variable of the objective function has a constant and large partial derivative, the learning rate
of this element will drop faster. On the contrary, if the partial derivative of such an element remains small,
then its learning rate will decline more slowly. However, since st accumulates the square by element
gradient, the learning rate of each element in the independent variable declines (or remains unchanged)
during iteration. Therefore, when the learning rate declines very fast during early iteration, yet the current
solution is still not desirable, Adagrad might have difficulty finding a useful solution because the learning
rate will be too small at later stages of iteration.
Below we will continue to use the objective function f (x) = 0.1x21 + 2x22 as an example to observe the
iterative trajectory of the independent variable in Adagrad. We are going to implement Adagrad using
the same learning rate as the experiment in last section, 0.4. As we can see, the iterative trajectory of
the independent variable is smoother. However, due to the cumulative effect of st , the learning rate
continuously decays, so the independent variable does not move as much during later stages of iteration.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
import math
from mxnet import nd
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -2.382563, x2 -0.158591
Now, we are going to increase the learning rate to 2. As we can see, the independent variable approaches
the optimal solution more quickly.
In [2]: eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -0.002295, x2 -0.000000
Like the momentum method, Adagrad needs to maintain a state variable of the same shape for each
independent variable. We use the formula from the algorithm to implement Adagrad.
In [3]: features, labels = d2l.get_data_ch7()
def init_adagrad_states():
s_w = [Link](([Link][1], 1))
s_b = [Link](1)
return (s_w, s_b)
Compared with the experiment in the Mini-Batch Stochastic Gradient Descent section, here, we use a
larger learning rate to train the model.
In [4]: d2l.train_ch9(adagrad, init_adagrad_states(), {'lr': 0.1}, features, labels)
loss: 0.242672, 0.314801 sec per epoch
Using the Trainer instance of the algorithm named adagrad, we can implement the Adagrad algorithm
with Gluon to train models.
In [5]: d2l.train_gluon_ch9('adagrad', {'learning_rate': 0.1}, features, labels)
loss: 0.244517, 0.391792 sec per epoch
• Adagrad constantly adjusts the learning rate during iteration to give each element in the independent
variable of the objective function its own learning rate.
• When using Adagrad, the learning rate of each element in the independent variable decreases (or
remains unchanged) during iteration.
Exercises
• When introducing the features of Adagrad, we mentioned a potential problem. What solutions can
you think of to fix this problem?
• Try to use other initial learning rates in the experiment. How does this change the results?
Reference
[1] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
10.6 RMSProp
In the experiment in the Adagrad section, the learning rate of each element in the independent variable
of the objective function declines (or remains unchanged) during iteration because the variable st in
the denominator is increased by the square by element operation of the mini-batch stochastic gradient,
adjusting the learning rate. Therefore, when the learning rate declines very fast during early iteration, yet
the current solution is still not desirable, Adagrad might have difficulty finding a useful solution because
the learning rate will be too small at later stages of iteration. To tackle this problem, the RMSProp
algorithm made a small modification to Adagrad[1].
We introduced EWMA (exponentially weighted moving average) in the Momentum section. Unlike in
Adagrad, the state variable st is the sum of the square by element all the mini-batch stochastic gradients
g t up to the time step t, RMSProp uses the EWMA on the square by element results of these gradients.
Specifically, given the hyperparameter 0 ≤ γ < 1, RMSProp is computed at time step t > 0.
st ← γst−1 + (1 − γ)g t ⊙ g t .
Like Adagrad, RMSProp re-adjusts the learning rate of each element in the independent variable of the
objective function with element operations and then updates the independent variable.
η
xt ← xt−1 − √ ⊙ gt ,
st + ϵ
Here, η is the learning rate while ϵ is a constant added to maintain numerical stability, such as 10−6 .
Because the state variable of RMSProp is an EWMA of the squared term g t ⊙ g t , it can be seen as the
weighted average of the mini-batch stochastic gradient’s squared terms from the last 1/(1 − γ) time steps.
Therefore, the learning rate of each element in the independent variable will not always decline (or remain
unchanged) during iteration.
By convention, we will use the objective function f (x) = 0.1x21 + 2x22 to observe the iterative trajectory
of the independent variable in RMSProp. Recall that in the Adagrad section, when we used Adagrad with
a learning rate of 0.4, the independent variable moved less in later stages of iteration. However, at the
same learning rate, RMSProp can approach the optimal solution faster.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
import math
from mxnet import nd
We set the initial learning rate to 0.01 and the hyperparameter γ to 0.9. Now, the variable st can be treated
as the weighted average of the square term g t ⊙ g t from the last 1/(1 − 0.9) = 10 time steps.
In [3]: d2l.train_ch9(rmsprop, init_rmsprop_states(), {'lr': 0.01, 'gamma': 0.9},
features, labels)
loss: 0.247826, 0.375890 sec per epoch
From the Trainer instance of the algorithm named rmsprop, we can implement the RMSProp algorithm
with Gluon to train models. Note that the hyperparameter γ is assigned by gamma1.
In [4]: d2l.train_gluon_ch9('rmsprop', {'learning_rate': 0.01, 'gamma1': 0.9},
features, labels)
loss: 0.246442, 0.223892 sec per epoch
• The difference between RMSProp and Adagrad is that RMSProp uses an EWMA on the squares of
elements in the mini-batch stochastic gradient to adjust the learning rate.
Exercises
Reference
[1] Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of
its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26-31.
10.7 Adadelta
In addition to RMSProp, Adadelta is another common optimization algorithm that helps improve the
chances of finding useful solutions at later stages of iteration, which is difficult to do when using the
Adagrad algorithm for the same purpose[1]. The interesting thing is that there is no learning rate hyper-
parameter in the Adadelta algorithm.
Like RMSProp, the Adadelta algorithm uses the variable st , which is an EWMA on the squares of
elements in mini-batch stochastic gradient g t . At time step 0, all the elements are initialized to 0. Given
the hyperparameter 0 ≤ ρ < 1 (counterpart of γ in RMSProp), at time step t > 0, compute using the
same method as RMSProp:
st ← ρst−1 + (1 − ρ)g t ⊙ g t .
Here, ϵ is a constant added to maintain the numerical stability, such as 10−5 . Next, we update the
independent variable:
xt ← xt−1 − g ′t .
Finally, we use ∆x to record the EWMA on the squares of elements in g ′ , which is the variation of the
independent variable.
Adadelta needs to maintain two state variables for each independent variable, st and ∆xt . We use the
formula from the algorithm to implement Adadelta.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import nd
def init_adadelta_states():
s_w, s_b = [Link](([Link][1], 1)), [Link](1)
delta_w, delta_b = [Link](([Link][1], 1)), [Link](1)
return ((s_w, delta_w), (s_b, delta_b))
From the Trainer instance for the algorithm named adadelta, we can implement Adadelta in Gluon. Its
hyperparameters can be specified by rho.
In [3]: d2l.train_gluon_ch9('adadelta', {'rho': 0.9}, features, labels)
loss: 0.244886, 0.572499 sec per epoch
• Adadelta has no learning rate hyperparameter, it uses an EWMA on the squares of elements in the
variation of the independent variable to replace the learning rate.
Exercises
Reference
[1] Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
10.8 Adam
Created on the basis of RMSProp, Adam also uses EWMA on the mini-batch stochastic gradient[1].
Here, we are going to introduce this algorithm.
Adam uses the momentum variable v t and variable st , which is an EWMA on the squares of elements
in the mini-batch stochastic gradient from RMSProp, and initializes each element of the variables to 0 at
time step 0. Given the hyperparameter 0 ≤ β1 < 1 (the author of the algorithm suggests a value of 0.9),
the momentum variable v t at time step t is the EWMA of the mini-batch stochastic gradient g t :
v t ← β1 v t−1 + (1 − β1 )g t .
Just as in RMSProp, given the hyperparameter 0 ≤ β2 < 1 (the author of the algorithm suggests a value
of 0.999), After taken the squares of elements in the mini-batch stochastic gradient, find g t ⊙ g t and
st ← β2 st−1 + (1 − β2 )g t ⊙ g t .
∑t
Since we initialized elements in v 0 and s0 to 0, we get v t = (1−β1 ) i=1 β1t−i g i at time step t. Sum the
∑t
mini-batch stochastic gradient weights from each previous time step to get (1 − β1 ) i=1 β1t−i = 1 − β1t .
Notice that when t is small, the sum of the mini-batch stochastic gradient weights from each previous
time step will be small. For example, when β1 = 0.9, v 1 = 0.1g 1 . To eliminate this effect, for any time
step t, we can divide v t by 1 − β1t , so that the sum of the mini-batch stochastic gradient weights from
each previous time step is 1. This is also called bias correction. In the Adam algorithm, we perform bias
corrections for variables v t and st :
vt
v̂ t ← ,
1 − β1t
st
ŝt ← .
1 − β2t
Next, the Adam algorithm will use the bias-corrected variables v̂ t and ŝt from above to re-adjust the
learning rate of each element in the model parameters using element operations.
ηv̂ t
g ′t ← √ ,
ŝt + ϵ
Here, η is the learning rate while ϵ is a constant added to maintain numerical stability, such as 10−8 .
Just as for Adagrad, RMSProp, and Adadelta, each element in the independent variable of the objective
function has its own learning rate. Finally, use g ′t to iterate the independent variable:
xt ← xt−1 − g ′t .
We use the formula from the algorithm to implement Adam. Here, time step t uses hyperparams to
input parameters to the adam function.
In [1]: import sys
[Link](0, '..')
%matplotlib inline
import d2l
from mxnet import nd
def init_adam_states():
v_w, v_b = [Link](([Link][1], 1)), [Link](1)
s_w, s_b = [Link](([Link][1], 1)), [Link](1)
return ((v_w, s_w), (v_b, s_b))
From the Trainer instance of the algorithm named adam, we can implement Adam with Gluon.
In [3]: d2l.train_gluon_ch9('adam', {'learning_rate': 0.01}, features, labels)
loss: 0.246612, 0.194802 sec per epoch
• Created on the basis of RMSProp, Adam also uses EWMA on the mini-batch stochastic gradient
• Adam uses bias correction.
Exercises
• Adjust the learning rate and observe and analyze the experimental results.
• Some people say that Adam is a combination of RMSProp and momentum. Why do you think they
say this?
Reference
[1] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.