0% found this document useful (0 votes)

73 views46 pages

Deep Learning Optimization Techniques

Uploaded by

Bao Huynh Long Thien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views46 pages

Deep Learning Optimization Techniques

Uploaded by

Bao Huynh Long Thien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

10

Optimization Algorithms

If you read the book in sequence up to this point you already used a number of advanced optimization
algorithms to train deep learning models. They were the tools that allowed us to continue updating model
parameters and to minimize the value of the loss function, as evaluated on the training set. Indeed,
anyone content with treating optimization as a black box device to minimize objective functions in a
simple setting might well content oneself with the knowledge that there exists an array of incantations of
such a procedure (with names such as Adam’, NAG’, or SGD’).
To do well, however, some deeper knowledge is required. Optimization algorithms are important for deep
learning. On one hand, training a complex deep learning model can take hours, days, or even weeks. The
performance of the optimization algorithm directly affects the model’s training efficiency. On the other
hand, understanding the principles of different optimization algorithms and the role of their parameters
will enable us to tune the hyperparameters in a targeted manner to improve the performance of deep
learning models.
In this chapter, we explore common deep learning optimization algorithms in depth. Almost all optimiza-
tion problems arising in deep learning are nonconvex. Nonetheless, the design and analysis of algorithms
in the context of convex problems has proven to be very instructive. It is for that reason that this section
includes a primer on convex optimization and the proof for a very simple stochastic gradient descent
algorithm on a convex objective function.

419
10.1 Optimization and Deep Learning

In this section, we will discuss the relationship between optimization and deep learning as well as the
challenges of using optimization in deep learning. For a deep learning problem, we will usually define
a loss function first. Once we have the loss function, we can use an optimization algorithm in attempt
to minimize the loss. In optimization, a loss function is often referred to as the objective function of
the optimization problem. By tradition and convention most optimization algorithms are concerned with
minimization. If we ever need to maximize an objective there’s a simple solution - just flip the sign on the
objective.

10.1.1 Optimization and Estimation

Although optimization provides a way to minimize the loss function for deep learning, in essence, the
goals of optimization and deep learning are fundamentally different. The former is primarily concerned
with minimizing an objective whereas the latter is concerned with finding a suitable model, given a
finite amount of data. In the section on Model Selection, Underfitting and Overfitting we discussed the
difference between these two goals in detail. For instance, training error and generalization error generally
differ: since the objective function of the optimization algorithm is usually a loss function based on the
training data set, the goal of optimization is to reduce the training error. However, the goal of statistical
inference (and thus of deep learning) is to reduce the generalization error. To accomplish the latter we
need to pay attention to overfitting in addition to using the optimization algorithm to reduce the training
error. We begin by importing a few libraries.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
from mpl_toolkits import mplot3d
import numpy as np

The graph below illustrates the issue in some more detail. Since we have only a finite amount of data the
minimum of the training error may be at a different location than the minimum of the expected error (or
of the test error).
In [2]: def f(x): return x * [Link]([Link] * x)
def g(x): return f(x) + 0.2 * [Link](5 * [Link] * x)

d2l.set_figsize((4.5, 2.5))
x = [Link](0.5, 1.5, 0.01)
fig, = [Link](x, f(x))
fig, = [Link](x, g(x))
[Link]('empirical risk', xy=(1.0, -1.2), xytext=(0.5, -1.1),
arrowprops=dict(arrowstyle='->'))
[Link]('expected risk', xy=(1.1, -1.05), xytext=(0.95, -0.5),
arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('risk');

420 10. Optimization Algorithms

10.1.2 Optimization Challenges in Deep Learning

In this chapter, we are going to focus specifically on the performance of the optimization algorithm in
minimizing the objective function, rather than a model’s generalization error. In the section on Linear
Regression we distinguished between analytical solutions and numerical solutions in optimization prob-
lems. In deep learning, most objective functions are complicated and do not have analytical solutions.
Instead, we must use numerical optimization algorithms. The optimization algorithms below all fall into
this category.
There are many challenges in deep learning optimization. Some of the most vexing ones are local minima,
saddle points and vanishing gradients. Let’s have a look at a few of them.

Local Minima

For the objective function f (x), if the value of f (x) at x is smaller than the values of f (x) at any other
points in the vicinity of x, then f (x) could be a local minimum. If the value of f (x) at x is the minimum
of the objective function over the entire domain, then f (x) is the global minimum.
For example, given the function

f (x) = x · cos(πx) for − 1.0 ≤ x ≤ 2.0,

we can approximate the local minimum and global minimum of this function.
In [3]: x = [Link](-1.0, 2.0, 0.01)
fig, = [Link](x, f(x))
[Link]('local minimum', xy=(-0.3, -0.25), xytext=(-0.77, -1.0),
arrowprops=dict(arrowstyle='->'))
[Link]('global minimum', xy=(1.1, -0.95), xytext=(0.6, 0.8),

10.1. Optimization and Deep Learning 421

arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('f(x)');

The objective function of deep learning models usually has many local optima. When the numerical
solution of an optimization problem is near the local optimum, the numerical solution obtained by the
final iteration may only minimize the objective function locally, rather than globally, as the gradient of the
objective function’s solutions approaches or becomes zero. Only some degree of noise might knock the
parameter out of the local minimum. In fact, this is one of the beneficial properties of stochastic gradient
descent where the natural variation of gradients over minibatches is able to dislodge the parameters from
local minima.

Saddle Points

Besides local minima, saddle points are another reason for gradients to vanish. A saddle point is any
location where all gradients of a function vanish but which is neither a global nor a local minimum.
Consider the function f (x) = x3 . Its first and second derivative vanish for x = 0. Optimization might
stall at the point, even though it is not a minimum.
In [4]: x = [Link](-2.0, 2.0, 0.01)
fig, = [Link](x, x**3)
[Link]('saddle point', xy=(0, -0.2), xytext=(-0.52, -5.0),
arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('f(x)');

422 10. Optimization Algorithms

Saddle points in higher dimensions are even more insidious, as the example below shows. Consider the
function f (x, y) = x2 − y 2 . It has its saddle point at (0, 0). This is a maximum with respect to y and a
minimum with respect to x. Moreover, it looks like a saddle, which is where this mathematical property
got its name.
In [5]: x, y = [Link][-1: 1: 101j, -1: 1: 101j]

z = x**2 - y**2

ax = [Link]().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
[Link]([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
[Link](ticks)
[Link](ticks)
ax.set_zticks(ticks)
[Link]('x')
[Link]('y');

10.1. Optimization and Deep Learning 423

We assume that the input of a function is a k-dimensional vector and its output is a scalar, so its Hessian
matrix will have k eigenvalues (see the Mathematical Foundation). The solution of the function could be
a local minimum, a local maximum, or a saddle point at a position where the function gradient is zero:
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all positive,
we have a local minimum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are all negative,
we have a local maximum for the function.
• When the eigenvalues of the function’s Hessian matrix at the zero-gradient position are negative
and positive, we have a saddle point for the function.
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative is quite
high. This makes saddle points more likely than local minima. We will discuss some exceptions to this
situation in the next section when introducing convexity. In short, convex functions are those where the
eigenvalues of the Hessian are never negative. Sadly, though, most deep learning problems do not fall
into this category. Nonetheless it’s a great tool to study optimization algorithms.

Vanishing Gradients

Probably the most insidious problem to encounter are vanishing gradients. For instance, assume that we
want to minimize the function f (x) = tanh(x) and we happen to get started at x = 4. As we can see,
the gradient of f is close to nil. More specifically f ′ (x) = 1 − tanh2 (x) and thus f ′ (4) = 0.0013.
Consequently optimization will get stuck for a long time before we make progress. This turns out to be
one of the reasons that training deep learning models was quite tricky prior to the introduction of the
ReLu activation function.
In [6]: x = [Link](-2.0, 5.0, 0.01)
fig, = [Link](x, [Link](x))
[Link]('vanishing gradient', xy=(4, 1), xytext=(2, 0.0),
arrowprops=dict(arrowstyle='->'))
[Link]('x')
[Link]('f(x)');

424 10. Optimization Algorithms

As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust range
of algorithms that perform well and that are easy to use even for beginners. Furthermore, it isn’t really
necessary to find the best solution. Local optima or even approximate solutions thereof are still very
useful.

Summary

• Minimizing the training error does not guarantee that we find the best set of parameters to minimize
the expected error.
• The optimization problems may have many local minima.
• The problem may have even more saddle points, as generally the problems are not convex.
• Vanishing gradients can cause optimization to stall. Often a reparametrization of the problem helps.
Good initialization of the parameters can be beneficial, too.

Exercises

1. Consider a simple multilayer perceptron with a single hidden layer of, say, d dimensions in the
hidden layer and a single output. Show that for any local minimum there are at least d! equivalent
solutions that behave identically.
2. Assume that we have a symmetric random matrix M where the entries Mij = Mji are each drawn
from some probability distribution pij . Furthermore assume that pij (x) = pij (−x), i.e. that the
distribution is symmetric (see e.g. [1] for details).
• Prove that the distribution over eigenvalues is also symmetric. That is, for any eigenvector v
the probability that the associated eigenvalue λ satisfies Pr(λ > 0) = Pr(λ < 0).

10.1. Optimization and Deep Learning 425

• Why does the above not imply Pr(λ > 0) = 0.5?
3. What other challenges involved in deep learning optimization can you think of?
4. Assume that you want to balance a (real) ball on a (real) saddle.
• Why is this hard?
• Can you exploit this effect also for optimization algorithms?

References

[1] Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of
Mathematics, 325-327.

Scan the QR Code to Discuss

10.2 Gradient Descent and Stochastic Gradient Descent

In this section, we are going to introduce the basic principles of gradient descent. Although it is not
common for gradient descent to be used directly in deep learning, an understanding of gradients and
the reason why the value of an objective function might decline when updating the independent variable
along the opposite direction of the gradient is the foundation for future studies on optimization algorithms.
Next, we are going to introduce stochastic gradient descent (SGD).

10.2.1 Gradient Descent in One-Dimensional Space

Here, we will use a simple gradient descent in one-dimensional space as an example to explain why the
gradient descent algorithm may reduce the value of the objective function. We assume that the input
and output of the continuously differentiable function f : R → R are both scalars. Given ϵ with a
small enough absolute value, according to the Taylor’s expansion formula from the Mathematical basics
section, we get the following approximation:

f (x + ϵ) ≈ f (x) + ϵf ′ (x).

426 10. Optimization Algorithms

Here, f ′ (x) is the gradient of function f at x. The gradient of a one-dimensional function is a scalar, also
known as a derivative.
Next, find a constant η > 0, to make |ηf ′ (x)| sufficiently small so that we can replace ϵ with −ηf ′ (x)
and get

f (x − ηf ′ (x)) ≈ f (x) − ηf ′ (x)2 .

If the derivative f ′ (x) ̸= 0, then ηf ′ (x)2 > 0, so

f (x − ηf ′ (x)) ≲ f (x).

This means that, if we use

x ← x − ηf ′ (x)

to iterate x, the value of function f (x) might decline. Therefore, in the gradient descent, we first choose
an initial value x and a constant η > 0 and then use them to continuously iterate x until the stop condition
is reached, for example, when the value of f ′ (x)2 is small enough or the number of iterations has reached
a certain value.
Now we will use the objective function f (x) = x2 as an example to see how gradient descent is imple-
mented. Although we know that x = 0 is the solution to minimize f (x), here we still use this simple
function to observe how x is iterated. First, import the packages or modules required for the experiment
in this section.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
import math
from mxnet import nd
import numpy as np

Next, we use x = 10 as the initial value and assume η = 0.2. Using gradient descent to iterate x 10
times, we can see that, eventually, the value of x approaches the optimal solution.
In [2]: def gd(eta):
x = 10
results = [x]
for i in range(10):
x -= eta * 2 * x # f(x) = x* the derivative of x is f'(x) = 2 * x
[Link](x)
print('epoch 10, x:', x)
return results

res = gd(0.2)
epoch 10, x: 0.06046617599999997

The iterative trajectory of the independent variable x is plotted as follows.

10.2. Gradient Descent and Stochastic Gradient Descent 427

In [3]: def show_trace(res):
n = max(abs(min(res)), abs(max(res)), 10)
f_line = [Link](-n, n, 0.1)
d2l.set_figsize()
[Link](f_line, [x * x for x in f_line])
[Link](res, [x * x for x in res], '-o')
[Link]('x')
[Link]('f(x)')

show_trace(res)

10.2.2 Learning Rate

The positive η in the above gradient descent algorithm is usually called the learning rate. This is a hyper-
parameter and needs to be set manually. If we use a learning rate that is too small, it will cause x to
update at a very slow speed, requiring more iterations to get a better solution. Here, we have the iterative
trajectory of the independent variable x with the learning rate η = 0.05. As we can see, after iterating 10
times when the learning rate is too small, there is still a large deviation between the final value of x and
the optimal solution.
In [4]: show_trace(gd(0.05))
epoch 10, x: 3.4867844009999995

428 10. Optimization Algorithms

If we use an excessively high learning rate, |ηf ′ (x)| might be too large for the first-order Taylor expansion
formula mentioned above to hold. In this case, we cannot guarantee that the iteration of x will be able to
lower the value of f (x). For example, when we set the learning rate to η = 1.1, x overshoots the optimal
solution x = 0 and gradually diverges.
In [5]: show_trace(gd(1.1))
epoch 10, x: 61.917364224000096

10.2.3 Gradient Descent in Multi-Dimensional Space

Now that we understand gradient descent in one-dimensional space, let us consider a more general case:
the input of the objective function is a vector and the output is a scalar. We assume that the input of the

10.2. Gradient Descent and Stochastic Gradient Descent 429

target function f : Rd → R is the d-dimensional vector x = [x1 , x2 , . . . , xd ]⊤ . The gradient of the
objective function f (x) with respect to x is a vector consisting of d partial derivatives:
[ ]⊤
∂f (x) ∂f (x) ∂f (x)
∇x f (x) = , ,..., .
∂x1 ∂x2 ∂xd

For brevity, we use ∇f (x) instead of ∇x f (x). Each partial derivative element ∂f (x)/∂xi in the gradient
indicates the rate of change of f at x with respect to the input xi . To measure the rate of change of f in
the direction of the unit vector u (∥u∥ = 1), in multivariate calculus, the directional derivative of f at x
in the direction of u is defined as
f (x + hu) − f (x)
Du f (x) = lim .
h→0 h
According to the property of directional derivatives [1Chapter 14.6 Theorem 3], the aforementioned di-
rectional derivative can be rewritten as

Du f (x) = ∇f (x) · u.

The directional derivative Du f (x) gives all the possible rates of change for f along x. In order to
minimize f , we hope to find the direction that will allow us to reduce f in the fastest way. Therefore, we
can use the unit vector u to minimize the directional derivative Du f (x).
For Du f (x) = ∥∇f (x)∥ · ∥u∥ · cos(θ) = ∥∇f (x)∥ · cos(θ), Here, θ is the angle between the gradient
∇f (x) and the unit vector u. When θ = π, cos (θ) gives us the minimum value −1. So when u is in a
direction that is opposite to the gradient direction ∇f (x), the direction derivative Du f (x) is minimized.
Therefore, we may continue to reduce the value of objective function f by the gradient descent algorithm:
x ← x − η∇f (x).
Similarly, η (positive) is called the learning rate.
Now we are going to construct an objective function f (x) = x21 + 2x22 with a two-dimensional vector
x = [x1 , x2 ]⊤ as input and a scalar as the output. So we have the gradient ∇f (x) = [2x1 , 4x2 ]⊤ . We
will observe the iterative trajectory of independent variable x by gradient descent from the initial position
[−5, −2]. First, we are going to define two helper functions. The first helper uses the given independent
variable update function to iterate independent variable x a total of 20 times from the initial position
[−5, −2]. The second helper will visualize the iterative trajectory of independent variable x.
In [6]: # This function is saved in the d2l package for future use
def train_2d(trainer):
# s1 and s2 are states of the independent variable and will be used later
# in the chapter
x1, x2, s1, s2 = -5, -2, 0, 0
results = [(x1, x2)]
for i in range(20):
x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
[Link]((x1, x2))
print('epoch %d, x1 %f, x2 %f' % (i + 1, x1, x2))
return results

430 10. Optimization Algorithms

# This function is saved in the d2l package for future use
def show_trace_2d(f, results):
[Link](*zip(*results), '-o', color='#ff7f0e')
x1, x2 = [Link]([Link](-5.5, 1.0, 0.1), [Link](-3.0, 1.0, 0.1))
[Link](x1, x2, f(x1, x2), colors='#1f77b4')
[Link]('x1')
[Link]('x2')

Next, we observe the iterative trajectory of the independent variable at learning rate 0.1. After iterating
the independent variable x 20 times using gradient descent, we can see that. eventually, the value of x
approaches the optimal solution [0, 0].
In [7]: eta = 0.1

def f_2d(x1, x2): # Objective function

return x1 ** 2 + 2 * x2 ** 2

def gd_2d(x1, x2, s1, s2):

return (x1 - eta * 2 * x1, x2 - eta * 4 * x2, 0, 0)

show_trace_2d(f_2d, train_2d(gd_2d))
epoch 20, x1 -0.057646, x2 -0.000073

10.2.4 Stochastic Gradient Descent (SGD)

In deep learning, the objective function is usually the average of the loss functions for each example in the
training data set. We assume that fi (x) is the loss function of the training data instance with n examples,
an index of i, and parameter vector of x, then we have the objective function

1∑
n
f (x) = fi (x).
n i=1

10.2. Gradient Descent and Stochastic Gradient Descent 431

The gradient of the objective function at x is computed as

1∑
n
∇f (x) = ∇fi (x).
n i=1

If gradient descent is used, the computing cost for each independent variable iteration is O(n), which
grows linearly with n. Therefore, when the model training data instance is large, the cost of gradient
descent for each iteration will be very high.
Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of
stochastic gradient descent, we uniformly sample an index i ∈ {1, . . . , n} for data instances at random,
and compute the gradient ∇fi (x) to update x:

x ← x − η∇fi (x).

Here, η is the learning rate. We can see that the computing cost for each iteration drops from O(n) of
the gradient descent to the constant O(1). We should mention that the stochastic gradient ∇fi (x) is the
unbiased estimate of gradient ∇f (x).

1∑
n
Ei ∇fi (x) = ∇fi (x) = ∇f (x).
n i=1

This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it to gradient descent by adding random noise with a mean of 0 to the gradient to
simulate a SGD.
In [8]: def sgd_2d(x1, x2, s1, s2):
return (x1 - eta * (2 * x1 + [Link](0.1)),
x2 - eta * (4 * x2 + [Link](0.1)), 0, 0)

show_trace_2d(f_2d, train_2d(sgd_2d))
epoch 20, x1 -0.294433, x2 -0.176149

432 10. Optimization Algorithms

As we can see, the iterative trajectory of the independent variable in the SGD is more tortuous than in
the gradient descent. This is due to the noise added in the experiment, which reduced the accuracy of
the simulated stochastic gradient. In practice, such noise usually comes from individual examples in the
training data set.

Summary

• If we use a more suitable learning rate and update the independent variable in the opposite direction
of the gradient, the value of the objective function might be reduced. Gradient descent repeats this
update process until a solution that meets the requirements is obtained.
• Problems occur when the learning rate is too small or too large. A suitable learning rate is usually
found only after multiple experiments.
• When there are more examples in the training data set, it costs more to compute each iteration for
gradient descent, so SGD is preferred in these cases.

Exercises

• Using a different objective function, observe the iterative trajectory of the independent variable in
gradient descent and the SGD.
• In the experiment for gradient descent in two-dimensional space, try to use different learning rates
to observe and analyze the experimental phenomena.

10.2. Gradient Descent and Stochastic Gradient Descent 433

Scan the QR Code to Discuss

10.3 Mini-batch Stochastic Gradient Descent

In each iteration, the gradient descent uses the entire training data set to compute the gradient, so it is
sometimes referred to as batch gradient descent. Stochastic gradient descent (SGD) only randomly select
one example in each iteration to compute the gradient. Just like in the previous chapters, we can perform
random uniform sampling for each iteration to form a mini-batch and then use this mini-batch to compute
the gradient. Now, we are going to discuss mini-batch stochastic gradient descent.
Set objective function f (x) : Rd → R. The time step before the start of iteration is set to 0. The
independent variable of this time step is x0 ∈ Rd and is usually obtained by random initialization. In
each subsequent time step t > 0, mini-batch SGD uses random uniform sampling to get a mini-batch Bt
made of example indices from the training data set. We can use sampling with replacement or sampling
without replacement to get a mini-batch example. The former method allows duplicate examples in the
same mini-batch, the latter does not and is more commonly used. We can use either of the two methods
1 ∑
g t ← ∇fBt (xt−1 ) = ∇fi (xt−1 )
|B|
i∈Bt

to compute the gradient g t of the objective function at xt−1 with mini-batch Bt at time step t. Here, |B|
is the size of the batch, which is the number of examples in the mini-batch. This is a hyper-parameter.
Just like the stochastic gradient, the mini-batch SGD g t obtained by sampling with replacement is also
the unbiased estimate of the gradient ∇f (xt−1 ). Given the learning rate ηt (positive), the iteration of the
mini-batch SGD on the independent variable is as follows:

xt ← xt−1 − ηt g t .

The variance of the gradient based on random sampling cannot be reduced during the iterative process, so
in practice, the learning rate of the (mini-batch) SGD can self-decay during the iteration, such as ηt = ηtα
(usually α = −1 or −0.5), ηt = ηαt (e.g α = 0.95), or learning rate decay once per iteration or after
several iterations. As a result, the variance of the learning rate and the (mini-batch) SGD will decrease.
Gradient descent always uses the true gradient of the objective function during the iteration, without the
need to self-decay the learning rate.
The cost for computing each iteration is O(|B|). When the batch size is 1, the algorithm is an SGD; when
the batch size equals the example size of the training data, the algorithm is a gradient descent. When the
batch size is small, fewer examples are used in each iteration, which will result in parallel processing and

434 10. Optimization Algorithms

reduce the RAM usage efficiency. This makes it more time consuming to compute examples of the same
size than using larger batches. When the batch size increases, each mini-batch gradient may contain more
redundant information. To get a better solution, we need to compute more examples for a larger batch
size, such as increasing the number of epochs.

10.3.1 Reading Data

In this chapter, we will use a data set developed by NASA to test the wing noise from different aircraft to
compare these optimization algorithms[1]. We will use the first 1500 examples of the data set, 5 features,
and a normalization method to preprocess the data.
In [1]: #!pip install matplotlib
In [2]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
from mxnet import autograd, gluon, init, nd
from [Link] import nn, data as gdata, loss as gloss
import numpy as np
import time

# This function is saved in the d2l package for future use

def get_data_ch7():
data = [Link]('../data/airfoil_self_noise.dat', delimiter='\t')
data = (data - [Link](axis=0)) / [Link](axis=0)
return [Link](data[:1500, :-1]), [Link](data[:1500, -1])

features, labels = get_data_ch7()

[Link]
Out[2]: (1500, 5)

10.3.2 Implementation from Scratch

We have already implemented the mini-batch SGD algorithm in the Linear Regression Implemented From
Scratch section. We have made its input parameters more generic here, so that we can conveniently use the
same input for the other optimization algorithms introduced later in this chapter. Specifically, we add the
status input states and place the hyper-parameter in dictionary hyperparams. In addition, we will
average the loss of each mini-batch example in the training function, so the gradient in the optimization
algorithm does not need to be divided by the batch size.
In [3]: def sgd(params, states, hyperparams):
for p in params:
p[:] -= hyperparams['lr'] * [Link]

Next, we are going to implement a generic training function to facilitate the use of the other optimization
algorithms introduced later in this chapter. It initializes a linear regression model and can then be used to
train the model with the mini-batch SGD and other algorithms introduced in subsequent sections.

10.3. Mini-batch Stochastic Gradient Descent 435

In [4]: # This function is saved in the d2l package for future use
def train_ch7(trainer_fn, states, hyperparams, features, labels,
batch_size=10, num_epochs=2):
# Initialize model parameters
net, loss = [Link], d2l.squared_loss
w = [Link](scale=0.01, shape=([Link][1], 1))
b = [Link](1)
w.attach_grad()
b.attach_grad()

def eval_loss():
return loss(net(features, w, b), labels).mean().asscalar()

ls, ts = [eval_loss()], [0,]

data_iter = [Link](
[Link](features, labels), batch_size, shuffle=True)
start = [Link]()
for _ in range(num_epochs):
for batch_i, (X, y) in enumerate(data_iter):
with [Link]():
l = loss(net(X, w, b), y).mean() # Average the loss
[Link]()
# Update model parameters
trainer_fn([w, b], states, hyperparams)
if (batch_i + 1) * batch_size % 10 == 0:
# Record the current training error for every 10 examples
[Link]([Link]() - start + ts[-1])
[Link](eval_loss())
start = [Link]()

# Print and plot the results.

print('loss: %f, %f sec per epoch' % (ls[-1], ts[-1]/num_epochs))
d2l.set_figsize()
[Link]([Link](0, num_epochs, len(ls)), ls)
[Link]('epoch')
[Link]('loss')
return ts, ls

When the batch size equals 1500 (the total number of examples), we use gradient descent for optimization.
The model parameters will be iterated only once for each epoch of the gradient descent. As we can see,
the downward trend of the value of the objective function (training loss) flattened out after 6 iterations.
In [5]: def train_sgd(lr, batch_size, num_epochs=2):
return train_ch7(
sgd, None, {'lr': lr}, features, labels, batch_size, num_epochs)

gd_res = train_sgd(1, 1500, 6)

loss: 0.246095, 0.027745 sec per epoch

436 10. Optimization Algorithms

When the batch size equals 1, we use SGD for optimization. In order to simplify the implementation, we
did not self-decay the learning rate. Instead, we simply used a small constant for the learning rate in the
(mini-batch) SGD experiment. In SGD, the independent variable (model parameter) is updated whenever
an example is processed. Thus it is updated 1500 times in one epoch. As we can see, the decline in the
value of the objective function slows down after one epoch.
Although both the procedures processed 1500 examples within one epoch, SGD consumes more time
than gradient descent in our experiment. This is because SGD performed more iterations on the indepen-
dent variable within one epoch, and it is harder for single-example gradient computation to use parallel
computing effectively.
In [6]: sgd_res = train_sgd(0.005, 1)
loss: 0.242362, 2.050656 sec per epoch

10.3. Mini-batch Stochastic Gradient Descent 437

When the batch size equals 100, we use mini-batch SGD for optimization. The time required for one
epoch is between the time needed for gradient descent and SGD to complete the same epoch.
In [7]: mini1_res = train_sgd(.4, 100)
loss: 0.250407, 0.049163 sec per epoch

Reduce the batch size to 10, the time for each epoch increases because the workload for each batch is less
efficient to execute.
In [8]: mini2_res = train_sgd(.05, 10)
loss: 0.247321, 0.231905 sec per epoch

Finally, we compare the time versus loss for the preview four experiments. As can be seen, despite SGD
converges faster than GD in terms of number of examples processed, it uses more time to reach the same

438 10. Optimization Algorithms

loss than GD because that computing gradient example by example is not efficient. Mini-batch SGD is
able to trade-off the convergence speed and computation efficiency. Here, a batch size 10 improves SGD,
and a batch size 100 even outperforms GD.
In [9]: d2l.set_figsize([6, 3])
for res in [gd_res, sgd_res, mini1_res, mini2_res]:
[Link](res[0], res[1])
[Link]('time (sec)')
[Link]('loss')
[Link]('log')
[Link]([1e-3, 1])
[Link](['gd', 'sgd', 'batch size=100', 'batch size=10']);

10.3.3 Concise Implementation

In Gluon, we can use the Trainer class to call optimization algorithms. Next, we are going to imple-
ment a generic training function that uses the optimization name trainer name and hyperparameter
trainer_hyperparameter to create the instance Trainer.
In [10]: # This function is saved in the d2l package for future use
def train_gluon_ch9(trainer_name, trainer_hyperparams, features, labels,
batch_size=10, num_epochs=2):
# Initialize model parameters
net = [Link]()
[Link]([Link](1))
[Link]([Link](sigma=0.01))
loss = gloss.L2Loss()

def eval_loss():
return loss(net(features), labels).mean().asscalar()

10.3. Mini-batch Stochastic Gradient Descent 439

ls = [eval_loss()]
data_iter = [Link](
[Link](features, labels), batch_size, shuffle=True)
# Create the instance "Trainer" to update model parameter(s)
trainer = [Link](
net.collect_params(), trainer_name, trainer_hyperparams)
for _ in range(num_epochs):
start = [Link]()
for batch_i, (X, y) in enumerate(data_iter):
with [Link]():
l = loss(net(X), y)
[Link]()
# Average the gradient in the "Trainer" instance
[Link](batch_size)
if (batch_i + 1) * batch_size % 100 == 0:
[Link](eval_loss())
# Print and plot the results
print('loss: %f, %f sec per epoch' % (ls[-1], [Link]() - start))
d2l.set_figsize()
[Link]([Link](0, num_epochs, len(ls)), ls)
[Link]('epoch')
[Link]('loss')

Use Gluon to repeat the last experiment.

In [11]: train_gluon_ch9('sgd', {'learning_rate': 0.05}, features, labels, 10)
loss: 0.244754, 0.232388 sec per epoch

Summary

• Mini-batch stochastic gradient uses random uniform sampling to get a mini-batch training example
for gradient computation.

440 10. Optimization Algorithms

• In practice, learning rates of the (mini-batch) SGD can self-decay during iteration.
• In general, the time consumption per epoch for mini-batch stochastic gradient is between what
takes for gradient descent and SGD to complete the same epoch.

Exercises

• Modify the batch size and learning rate and observe the rate of decline for the value of the objective
function and the time consumed in each epoch.
• Read the MXNet documentation and use the Trainer class set_learning_rate function to
reduce the learning rate of the mini-batch SGD to 1/10 of its previous value after each epoch.

Reference

[1] Aircraft wing noise data set. [Link]

Scan the QR Code to Discuss

10.4 Momentum

In the Gradient Descent and Stochastic Gradient Descent section, we mentioned that the gradient of
the objective function’s independent variable represents the direction of the objective function’s fastest
descend at the current position of the independent variable. Therefore, gradient descent is also called
steepest descent. In each iteration, the gradient descends according to the current position of the indepen-
dent variable while updating the latter along the current position of the gradient. However, this can lead
to problems if the iterative direction of the independent variable relies exclusively on the current position
of the independent variable.

10.4.1 Exercises with Gradient Descent

Now, we will consider an objective function f (x) = 0.1x21 + 2x22 , whose input and output are a two-
dimensional vector x = [x1 , x2 ] and a scalar, respectively. In contrast to the Gradient Descent and
Stochastic Gradient Descent section, here, the coefficient x21 is reduced from 1 to 0.1. We are going to

10.4. Momentum 441

implement gradient descent based on this objective function, and demonstrate the iterative trajectory of
the independent variable using the learning rate 0.4.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
from mxnet import nd

eta = 0.4

def f_2d(x1, x2):

return 0.1 * x1 ** 2 + 2 * x2 ** 2

def gd_2d(x1, x2, s1, s2):

return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)

d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1 -0.943467, x2 -0.000073

As we can see, at the same position, the slope of the objective function has a larger absolute value in
the vertical direction (x2 axis direction) than in the horizontal direction (x1 axis direction). Therefore,
given the learning rate, using gradient descent for interaction will cause the independent variable to move
more in the vertical direction than in the horizontal one. So we need a small learning rate to prevent
the independent variable from overshooting the optimal solution for the objective function in the vertical
direction. However, it will cause the independent variable to move slower toward the optimal solution in
the horizontal direction.
Now, we try to make the learning rate slightly larger, so the independent variable will continuously
overshoot the optimal solution in the vertical direction and gradually diverge.
In [2]: eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))

442 10. Optimization Algorithms

epoch 20, x1 -0.387814, x2 -1673.365109

10.4.2 The Momentum Method

The momentum method was proposed to solve the gradient descent problem described above. Since
mini-batch stochastic gradient descent is more general than gradient descent, the subsequent discussion
in this chapter will continue to use the definition for mini-batch stochastic gradient descent g t at time
step t given in the Mini-batch Stochastic Gradient Descent section. We set the independent variable at
time step t to xt and the learning rate to ηt . At time step 0, momentum creates the velocity variable v 0
and initializes its elements to zero. At time step t > 0, momentum modifies the steps of each iteration as
follows:
v t ← γv t−1 + ηt g t ,
xt ← xt−1 − v t ,

Here, the momentum hyperparameter γ satisfies 0 ≤ γ < 1. When γ = 0, momentum is equivalent to a

mini-batch SGD.
Before explaining the mathematical principles behind the momentum method, we should take a look at
the iterative trajectory of the gradient descent after using momentum in the experiment.
In [3]: def momentum_2d(x1, x2, v1, v2):
v1 = gamma * v1 + eta * 0.2 * x1
v2 = gamma * v2 + eta * 4 * x2
return x1 - v1, x2 - v2, v1, v2

eta, gamma = 0.4, 0.5

d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1 -0.062843, x2 0.001202

10.4. Momentum 443

As we can see, when using a smaller learning rate (η = 0.4) and momentum hyperparameter (γ = 0.5),
momentum moves more smoothly in the vertical direction and approaches the optimal solution faster in
the horizontal direction. Now, when we use a larger learning rate (η = 0.6), the independent variable will
no longer diverge.
In [4]: eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1 0.007188, x2 0.002553

444 10. Optimization Algorithms

Exponentially Weighted Moving Average (EWMA)

In order to understand the momentum method mathematically, we must first explain the exponentially
weighted moving average (EWMA). Given hyperparameter 0 ≤ γ < 1, the variable yt of the current time
step t is the linear combination of variable yt−1 from the previous time step t − 1 and another variable xt
of the current step.

yt = γyt−1 + (1 − γ)xt .

We can expand yt :

yt = (1 − γ)xt + γyt−1
= (1 − γ)xt + (1 − γ) · γxt−1 + γ 2 yt−2
= (1 − γ)xt + (1 − γ) · γxt−1 + (1 − γ) · γ 2 xt−2 + γ 3 yt−3
...
n
Let n = 1/(1 − γ), so (1 − 1/n) = γ 1/(1−γ) . Because
( )n
1
lim 1 − = exp(−1) ≈ 0.3679,
n→∞ n

when γ → 1, γ 1/(1−γ) = exp(−1). For example, 0.9520 ≈ exp(−1). If we treat exp(−1) as a relatively
small number, we can ignore all the terms that have γ 1/(1−γ) or coefficients of higher order than γ 1/(1−γ)
in them. For example, when γ = 0.95,

∑
19
yt ≈ 0.05 0.95i xt−i .
i=0

Therefore, in practice, we often treat yt as the weighted average of the xt values from the last 1/(1 − γ)
time steps. For example, when γ = 0.95, yt can be treated as the weighted average of the xt values from
the last 20 time steps; when γ = 0.9, yt can be treated as the weighted average of the xt values from the
last 10 time steps. Additionally, the closer the xt value is to the current time step t, the greater the value’s
weight (closer to 1).

Understanding Momentum through EWMA

Now, we are going to deform the velocity variable of momentum:

( )
ηt
v t ← γv t−1 + (1 − γ) gt .
1−γ

By the form of EWMA, velocity variable v t is actually an EWMA of time series {ηt−i g t−i /(1 − γ) :
i = 0, . . . , 1/(1 − γ) − 1}. In other words, considering mini-batch SGD, the update of an independent
variable with momentum at each time step approximates the EWMA of the updates in the last 1/(1 − γ)

10.4. Momentum 445

time steps without momentum, divided by 1 − γ. Thus, with momentum, the movement size at each
direction not only depends on the current gradient, but also depends on whether the past gradients are
aligned at each direction. In the optimization problem mentioned earlier in this section, all the gradients
are positive in the horizontal direction (rightward), but are occasionally positive (up) or negative (down)
in the vertical direction. As a result, we can use a larger learning rate to allow the independent variable
move faster towards the optimum.

10.4.3 Implementation from Scratch

Compared with mini-batch SGD, the momentum method needs to maintain a velocity variable of the same
shape for each independent variable and a momentum hyperparameter is added to the hyperparameter
category. In the implementation, we use the state variable states to represent the velocity variable in a
more general sense.
In [5]: features, labels = d2l.get_data_ch7()

def init_momentum_states():
v_w = [Link](([Link][1], 1))
v_b = [Link](1)
return (v_w, v_b)

def sgd_momentum(params, states, hyperparams):

for p, v in zip(params, states):
v[:] = hyperparams['momentum'] * v + hyperparams['lr'] * [Link]
p[:] -= v

When we set the momentum hyperparameter momentum to 0.5, it can be treated as a mini-batch SGD:
the mini-batch gradient here is the weighted average of twice the mini-batch gradient of the last two time
steps.
In [6]: d2l.train_ch9(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.5}, features, labels)
loss: 0.247092, 0.212257 sec per epoch

446 10. Optimization Algorithms

When we increase the momentum hyperparameter momentum to 0.9, it can still be treated as a mini-batch
SGD: the mini-batch gradient here will be the weighted average of ten times the mini-batch gradient of
the last 10 time steps. Now we keep the learning rate at 0.02.
In [7]: d2l.train_ch9(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.9}, features, labels)
loss: 0.268377, 0.217876 sec per epoch

We can see that the value change of the objective function is not smooth enough at later stages of iteration.
Intuitively, ten times the mini-batch gradient is five times larger than two times the mini-batch gradient,
so we can try to reduce the learning rate to 1/5 of its original value. Now, the value change of the objective
function becomes smoother after its period of decline.
In [8]: d2l.train_ch9(sgd_momentum, init_momentum_states(),

10.4. Momentum 447

{'lr': 0.004, 'momentum': 0.9}, features, labels)
loss: 0.242917, 0.213023 sec per epoch

10.4.4 Concise Implementation

In Gluon, we only need to use momentum to define the momentum hyperparameter in the Trainer
instance to implement momentum.
In [9]: d2l.train_gluon_ch9('sgd', {'learning_rate': 0.004, 'momentum': 0.9},
features, labels)
loss: 0.243254, 0.178137 sec per epoch

448 10. Optimization Algorithms

Summary

• The momentum method uses the EWMA concept. It takes the weighted average of past time steps,
with weights that decay exponentially by the time step.
• Momentum makes independent variable updates for adjacent time steps more consistent in direc-
tion.

Exercises

• Use other combinations of momentum hyperparameters and learning rates and observe and analyze
the different experimental results.

Scan the QR Code to Discuss

10.5 Adagrad

In the optimization algorithms we introduced previously, each element of the objective function’s in-
dependent variables uses the same learning rate at the same time step for self-iteration. For example,
if we assume that the objective function is f and the independent variable is a two-dimensional vector
[x1 , x2 ]⊤ , each element in the vector uses the same learning rate when iterating. For example, in gradient
descent with the learning rate η, element x1 and x2 both use the same learning rate η for iteration:

∂f ∂f
x1 ← x1 − η , x2 ← x 2 − η .
∂x1 ∂x2
In the Momentum section, we can see that, when there is a big difference between the gradient values x1
and x2 , a sufficiently small learning rate needs to be selected so that the independent variable will not
diverge in the dimension of larger gradient values. However, this will cause the independent variables to
iterate too slowly in the dimension with smaller gradient values. The momentum method relies on the
exponentially weighted moving average (EWMA) to make the direction of the independent variable more
consistent, thus reducing the possibility of divergence. In this section, we are going to introduce Adagrad,
an algorithm that adjusts the learning rate according to the gradient value of the independent variable in
each dimension to eliminate problems caused when a unified learning rate has to adapt to all dimensions.

10.5. Adagrad 449

10.5.1 The Algorithm

The Adadelta algorithm uses the cumulative variable st obtained from a square by element operation on
the mini-batch stochastic gradient g t . At time step 0, Adagrad initializes each element in s0 to 0. At time
step t, we first sum the results of the square by element operation for the mini-batch gradient g t to get the
variable st :

st ← st−1 + g t ⊙ g t ,

Here, ⊙ is the symbol for multiplication by element. Next, we re-adjust the learning rate of each element
in the independent variable of the objective function using element operations:
η
xt ← xt−1 − √ ⊙ gt ,
st + ϵ

Here, η is the learning rate while ϵ is a constant added to maintain numerical stability, such as 10−6 .
Here, the square root, division, and multiplication operations are all element operations. Each element in
the independent variable of the objective function will have its own learning rate after the operations by
elements.

10.5.2 Features

We should emphasize that the cumulative variable st produced by a square by element operation on the
mini-batch stochastic gradient is part of the learning rate denominator. Therefore, if an element in the
independent variable of the objective function has a constant and large partial derivative, the learning rate
of this element will drop faster. On the contrary, if the partial derivative of such an element remains small,
then its learning rate will decline more slowly. However, since st accumulates the square by element
gradient, the learning rate of each element in the independent variable declines (or remains unchanged)
during iteration. Therefore, when the learning rate declines very fast during early iteration, yet the current
solution is still not desirable, Adagrad might have difficulty finding a useful solution because the learning
rate will be too small at later stages of iteration.
Below we will continue to use the objective function f (x) = 0.1x21 + 2x22 as an example to observe the
iterative trajectory of the independent variable in Adagrad. We are going to implement Adagrad using
the same learning rate as the experiment in last section, 0.4. As we can see, the iterative trajectory of
the independent variable is smoother. However, due to the cumulative effect of st , the learning rate
continuously decays, so the independent variable does not move as much during later stages of iteration.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
import math
from mxnet import nd

def adagrad_2d(x1, x2, s1, s2):

450 10. Optimization Algorithms

# The first two terms are the independent variable gradients
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / [Link](s1 + eps) * g1
x2 -= eta / [Link](s2 + eps) * g2
return x1, x2, s1, s2

def f_2d(x1, x2):

return 0.1 * x1 ** 2 + 2 * x2 ** 2

eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -2.382563, x2 -0.158591

Now, we are going to increase the learning rate to 2. As we can see, the independent variable approaches
the optimal solution more quickly.
In [2]: eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -0.002295, x2 -0.000000

10.5. Adagrad 451

10.5.3 Implementation from Scratch

Like the momentum method, Adagrad needs to maintain a state variable of the same shape for each
independent variable. We use the formula from the algorithm to implement Adagrad.
In [3]: features, labels = d2l.get_data_ch7()

def init_adagrad_states():
s_w = [Link](([Link][1], 1))
s_b = [Link](1)
return (s_w, s_b)

def adagrad(params, states, hyperparams):

eps = 1e-6
for p, s in zip(params, states):
s[:] += [Link]()
p[:] -= hyperparams['lr'] * [Link] / (s + eps).sqrt()

Compared with the experiment in the Mini-Batch Stochastic Gradient Descent section, here, we use a
larger learning rate to train the model.
In [4]: d2l.train_ch9(adagrad, init_adagrad_states(), {'lr': 0.1}, features, labels)
loss: 0.242672, 0.314801 sec per epoch

452 10. Optimization Algorithms

10.5.4 Concise Implementation

Using the Trainer instance of the algorithm named adagrad, we can implement the Adagrad algorithm
with Gluon to train models.
In [5]: d2l.train_gluon_ch9('adagrad', {'learning_rate': 0.1}, features, labels)
loss: 0.244517, 0.391792 sec per epoch

10.5. Adagrad 453

Summary

• Adagrad constantly adjusts the learning rate during iteration to give each element in the independent
variable of the objective function its own learning rate.
• When using Adagrad, the learning rate of each element in the independent variable decreases (or
remains unchanged) during iteration.

Exercises

• When introducing the features of Adagrad, we mentioned a potential problem. What solutions can
you think of to fix this problem?
• Try to use other initial learning rates in the experiment. How does this change the results?

Reference

[1] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

Scan the QR Code to Discuss

10.6 RMSProp

In the experiment in the Adagrad section, the learning rate of each element in the independent variable
of the objective function declines (or remains unchanged) during iteration because the variable st in
the denominator is increased by the square by element operation of the mini-batch stochastic gradient,
adjusting the learning rate. Therefore, when the learning rate declines very fast during early iteration, yet
the current solution is still not desirable, Adagrad might have difficulty finding a useful solution because
the learning rate will be too small at later stages of iteration. To tackle this problem, the RMSProp
algorithm made a small modification to Adagrad[1].

454 10. Optimization Algorithms

10.6.1 The Algorithm

We introduced EWMA (exponentially weighted moving average) in the Momentum section. Unlike in
Adagrad, the state variable st is the sum of the square by element all the mini-batch stochastic gradients
g t up to the time step t, RMSProp uses the EWMA on the square by element results of these gradients.
Specifically, given the hyperparameter 0 ≤ γ < 1, RMSProp is computed at time step t > 0.

st ← γst−1 + (1 − γ)g t ⊙ g t .

Like Adagrad, RMSProp re-adjusts the learning rate of each element in the independent variable of the
objective function with element operations and then updates the independent variable.
η
xt ← xt−1 − √ ⊙ gt ,
st + ϵ

Here, η is the learning rate while ϵ is a constant added to maintain numerical stability, such as 10−6 .
Because the state variable of RMSProp is an EWMA of the squared term g t ⊙ g t , it can be seen as the
weighted average of the mini-batch stochastic gradient’s squared terms from the last 1/(1 − γ) time steps.
Therefore, the learning rate of each element in the independent variable will not always decline (or remain
unchanged) during iteration.
By convention, we will use the objective function f (x) = 0.1x21 + 2x22 to observe the iterative trajectory
of the independent variable in RMSProp. Recall that in the Adagrad section, when we used Adagrad with
a learning rate of 0.4, the independent variable moved less in later stages of iteration. However, at the
same learning rate, RMSProp can approach the optimal solution faster.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
import math
from mxnet import nd

def rmsprop_2d(x1, x2, s1, s2):

g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + (1 - gamma) * g1 ** 2
s2 = gamma * s2 + (1 - gamma) * g2 ** 2
x1 -= eta / [Link](s1 + eps) * g1
x2 -= eta / [Link](s2 + eps) * g2
return x1, x2, s1, s2

def f_2d(x1, x2):

return 0.1 * x1 ** 2 + 2 * x2 ** 2

eta, gamma = 0.4, 0.9

d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))
epoch 20, x1 -0.010599, x2 0.000000

10.6. RMSProp 455

10.6.2 Implementation from Scratch

Next, we implement RMSProp with the formula in the algorithm.

In [2]:
features, labels = d2l.get_data_ch7()
def init_rmsprop_states():
s_w = [Link](([Link][1], 1))
s_b = [Link](1)
return (s_w, s_b)

def rmsprop(params, states, hyperparams):

gamma, eps = hyperparams['gamma'], 1e-6
for p, s in zip(params, states):
s[:] = gamma * s + (1 - gamma) * [Link]()
p[:] -= hyperparams['lr'] * [Link] / (s + eps).sqrt()

We set the initial learning rate to 0.01 and the hyperparameter γ to 0.9. Now, the variable st can be treated
as the weighted average of the square term g t ⊙ g t from the last 1/(1 − 0.9) = 10 time steps.
In [3]: d2l.train_ch9(rmsprop, init_rmsprop_states(), {'lr': 0.01, 'gamma': 0.9},
features, labels)
loss: 0.247826, 0.375890 sec per epoch

456 10. Optimization Algorithms

10.6.3 Concise Implementation

From the Trainer instance of the algorithm named rmsprop, we can implement the RMSProp algorithm
with Gluon to train models. Note that the hyperparameter γ is assigned by gamma1.
In [4]: d2l.train_gluon_ch9('rmsprop', {'learning_rate': 0.01, 'gamma1': 0.9},
features, labels)
loss: 0.246442, 0.223892 sec per epoch

10.6. RMSProp 457

Summary

• The difference between RMSProp and Adagrad is that RMSProp uses an EWMA on the squares of
elements in the mini-batch stochastic gradient to adjust the learning rate.

Exercises

• What happens to the experimental results if we set the value of γ to 1? Why?

• Try using other combinations of initial learning rates and γ hyperparameters and observe and ana-
lyze the experimental results.

Reference

[1] Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of
its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26-31.

Scan the QR Code to Discuss

10.7 Adadelta

In addition to RMSProp, Adadelta is another common optimization algorithm that helps improve the
chances of finding useful solutions at later stages of iteration, which is difficult to do when using the
Adagrad algorithm for the same purpose[1]. The interesting thing is that there is no learning rate hyper-
parameter in the Adadelta algorithm.

10.7.1 The Algorithm

Like RMSProp, the Adadelta algorithm uses the variable st , which is an EWMA on the squares of
elements in mini-batch stochastic gradient g t . At time step 0, all the elements are initialized to 0. Given
the hyperparameter 0 ≤ ρ < 1 (counterpart of γ in RMSProp), at time step t > 0, compute using the
same method as RMSProp:

st ← ρst−1 + (1 − ρ)g t ⊙ g t .

458 10. Optimization Algorithms

Unlike RMSProp, Adadelta maintains an additional state variable, ∆xt the elements of which are also
initialized to 0 at time step 0. We use ∆xt−1 to compute the variation of the independent variable:
√
′ ∆xt−1 + ϵ
gt ← ⊙ gt ,
st + ϵ

Here, ϵ is a constant added to maintain the numerical stability, such as 10−5 . Next, we update the
independent variable:

xt ← xt−1 − g ′t .

Finally, we use ∆x to record the EWMA on the squares of elements in g ′ , which is the variation of the
independent variable.

∆xt ← ρ∆xt−1 + (1 − ρ)g ′t ⊙ g ′t .

ϵ is not considered here, Adadelta differs from RMSProp in its replacement

As we can see, if the impact of√
of the hyperparameter η with ∆xt−1 .

10.7.2 Implementation from Scratch

Adadelta needs to maintain two state variables for each independent variable, st and ∆xt . We use the
formula from the algorithm to implement Adadelta.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
from mxnet import nd

features, labels = d2l.get_data_ch7()

def init_adadelta_states():
s_w, s_b = [Link](([Link][1], 1)), [Link](1)
delta_w, delta_b = [Link](([Link][1], 1)), [Link](1)
return ((s_w, delta_w), (s_b, delta_b))

def adadelta(params, states, hyperparams):

rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
s[:] = rho * s + (1 - rho) * [Link]()
g = ((delta + eps).sqrt() / (s + eps).sqrt()) * [Link]
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g

Then, we train the model with the hyperparameter ρ = 0.9.

In [2]: d2l.train_ch9(adadelta, init_adadelta_states(), {'rho': 0.9}, features,
labels)

10.7. Adadelta 459

loss: 0.244664, 0.492153 sec per epoch

10.7.3 Concise Implementation

From the Trainer instance for the algorithm named adadelta, we can implement Adadelta in Gluon. Its
hyperparameters can be specified by rho.
In [3]: d2l.train_gluon_ch9('adadelta', {'rho': 0.9}, features, labels)
loss: 0.244886, 0.572499 sec per epoch

460 10. Optimization Algorithms

Summary

• Adadelta has no learning rate hyperparameter, it uses an EWMA on the squares of elements in the
variation of the independent variable to replace the learning rate.

Exercises

• Adjust the value of ρ and observe the experimental results.

Reference

[1] Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Scan the QR Code to Discuss

10.8 Adam

Created on the basis of RMSProp, Adam also uses EWMA on the mini-batch stochastic gradient[1].
Here, we are going to introduce this algorithm.

10.8.1 The Algorithm

Adam uses the momentum variable v t and variable st , which is an EWMA on the squares of elements
in the mini-batch stochastic gradient from RMSProp, and initializes each element of the variables to 0 at
time step 0. Given the hyperparameter 0 ≤ β1 < 1 (the author of the algorithm suggests a value of 0.9),
the momentum variable v t at time step t is the EWMA of the mini-batch stochastic gradient g t :

v t ← β1 v t−1 + (1 − β1 )g t .

Just as in RMSProp, given the hyperparameter 0 ≤ β2 < 1 (the author of the algorithm suggests a value
of 0.999), After taken the squares of elements in the mini-batch stochastic gradient, find g t ⊙ g t and

10.8. Adam 461

perform EWMA on it to obtain st :

st ← β2 st−1 + (1 − β2 )g t ⊙ g t .
∑t
Since we initialized elements in v 0 and s0 to 0, we get v t = (1−β1 ) i=1 β1t−i g i at time step t. Sum the
∑t
mini-batch stochastic gradient weights from each previous time step to get (1 − β1 ) i=1 β1t−i = 1 − β1t .
Notice that when t is small, the sum of the mini-batch stochastic gradient weights from each previous
time step will be small. For example, when β1 = 0.9, v 1 = 0.1g 1 . To eliminate this effect, for any time
step t, we can divide v t by 1 − β1t , so that the sum of the mini-batch stochastic gradient weights from
each previous time step is 1. This is also called bias correction. In the Adam algorithm, we perform bias
corrections for variables v t and st :
vt
v̂ t ← ,
1 − β1t
st
ŝt ← .
1 − β2t
Next, the Adam algorithm will use the bias-corrected variables v̂ t and ŝt from above to re-adjust the
learning rate of each element in the model parameters using element operations.

ηv̂ t
g ′t ← √ ,
ŝt + ϵ

Here, η is the learning rate while ϵ is a constant added to maintain numerical stability, such as 10−8 .
Just as for Adagrad, RMSProp, and Adadelta, each element in the independent variable of the objective
function has its own learning rate. Finally, use g ′t to iterate the independent variable:

xt ← xt−1 − g ′t .

10.8.2 Implementation from Scratch

We use the formula from the algorithm to implement Adam. Here, time step t uses hyperparams to
input parameters to the adam function.
In [1]: import sys
[Link](0, '..')

%matplotlib inline
import d2l
from mxnet import nd

features, labels = d2l.get_data_ch7()

def init_adam_states():
v_w, v_b = [Link](([Link][1], 1)), [Link](1)
s_w, s_b = [Link](([Link][1], 1)), [Link](1)
return ((v_w, s_w), (v_b, s_b))

462 10. Optimization Algorithms

def adam(params, states, hyperparams):
beta1, beta2, eps = 0.9, 0.999, 1e-6
for p, (v, s) in zip(params, states):
v[:] = beta1 * v + (1 - beta1) * [Link]
s[:] = beta2 * s + (1 - beta2) * [Link]()
v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
p[:] -= hyperparams['lr'] * v_bias_corr / (s_bias_corr.sqrt() + eps)
hyperparams['t'] += 1

Use Adam to train the model with a learning rate of 0.01.

In [2]: d2l.train_ch9(adam, init_adam_states(), {'lr': 0.01, 't': 1}, features,
labels)
loss: 0.246945, 0.369802 sec per epoch

10.8.3 Concise Implementation

From the Trainer instance of the algorithm named adam, we can implement Adam with Gluon.
In [3]: d2l.train_gluon_ch9('adam', {'learning_rate': 0.01}, features, labels)
loss: 0.246612, 0.194802 sec per epoch

10.8. Adam 463

Summary

• Created on the basis of RMSProp, Adam also uses EWMA on the mini-batch stochastic gradient
• Adam uses bias correction.

Exercises

• Adjust the learning rate and observe and analyze the experimental results.
• Some people say that Adam is a combination of RMSProp and momentum. Why do you think they
say this?

Reference

[1] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.

Scan the QR Code to Discuss

464 10. Optimization Algorithms

Deep Learning Optimization Algorithms
No ratings yet
Deep Learning Optimization Algorithms
28 pages
Neural Network Training Optimization Techniques
No ratings yet
Neural Network Training Optimization Techniques
13 pages
Neural Network Optimization Challenges
No ratings yet
Neural Network Optimization Challenges
11 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
81 pages
Feedforward Neural Network Optimization
No ratings yet
Feedforward Neural Network Optimization
83 pages
AL3502 Deep Learning for Vision Syllabus
75% (4)
AL3502 Deep Learning for Vision Syllabus
79 pages
Deep Learning Optimization Concepts
No ratings yet
Deep Learning Optimization Concepts
33 pages
Deep Learning: Feedforward Networks & Optimization
No ratings yet
Deep Learning: Feedforward Networks & Optimization
76 pages
Deep Learning and Gradient Descent Overview
No ratings yet
Deep Learning and Gradient Descent Overview
84 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
5 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
75 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
50 pages
Neural Network Optimization Challenges
No ratings yet
Neural Network Optimization Challenges
17 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
27 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
27 pages
Ravindra Kumar Sharma-Thesis
No ratings yet
Ravindra Kumar Sharma-Thesis
215 pages
Deep Learning Model Optimization Strategies
100% (1)
Deep Learning Model Optimization Strategies
81 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
5 pages
Deep Learning Optimization Challenges
No ratings yet
Deep Learning Optimization Challenges
12 pages
Optimization Challenges in Deep Learning
No ratings yet
Optimization Challenges in Deep Learning
15 pages
Understanding Empirical Risk Minimization
No ratings yet
Understanding Empirical Risk Minimization
11 pages
Neural Network Optimization Methods
No ratings yet
Neural Network Optimization Methods
90 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
49 pages
Deep Learning Optimization Concepts Explained
No ratings yet
Deep Learning Optimization Concepts Explained
13 pages
Deep Learning Model Optimization Guide
No ratings yet
Deep Learning Model Optimization Guide
5 pages
Numerical Computation in Deep Learning
No ratings yet
Numerical Computation in Deep Learning
39 pages
Challenges in Deep Learning Optimization
No ratings yet
Challenges in Deep Learning Optimization
26 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
31 pages
Deep Learning Optimization Techniques
No ratings yet
Deep Learning Optimization Techniques
48 pages
Gradient Descent in Deep Learning
No ratings yet
Gradient Descent in Deep Learning
28 pages
Lecture 7 - Optimimzation in ML
No ratings yet
Lecture 7 - Optimimzation in ML
17 pages
Gradient Descent Convergence Explained
No ratings yet
Gradient Descent Convergence Explained
4 pages
Understanding Deep Neural Networks Basics
No ratings yet
Understanding Deep Neural Networks Basics
6 pages
A Survey of Deep Learning Optimizers - First and Second Order Methods
No ratings yet
A Survey of Deep Learning Optimizers - First and Second Order Methods
24 pages
Deep Learning Optimization Insights
No ratings yet
Deep Learning Optimization Insights
50 pages
Learning vs. Optimization in Deep Learning
No ratings yet
Learning vs. Optimization in Deep Learning
80 pages
Deep Neural Networks: Key Concepts & Techniques
No ratings yet
Deep Neural Networks: Key Concepts & Techniques
6 pages
DL 12
No ratings yet
DL 12
55 pages
Understanding Optimization Algorithms
No ratings yet
Understanding Optimization Algorithms
33 pages
第2课-机器学习进阶-张景昭
No ratings yet
第2课-机器学习进阶-张景昭
70 pages
Activation Functions and Loss in ML
No ratings yet
Activation Functions and Loss in ML
29 pages
Deep Learning Regularization Techniques
No ratings yet
Deep Learning Regularization Techniques
30 pages
Numerical Computation in Deep Learning
No ratings yet
Numerical Computation in Deep Learning
46 pages
Gradient Descent and Optimization Techniques
No ratings yet
Gradient Descent and Optimization Techniques
201 pages
RMSprop: Adaptive Learning Rate Insights
No ratings yet
RMSprop: Adaptive Learning Rate Insights
26 pages
Deep Learning Model Optimization Techniques
No ratings yet
Deep Learning Model Optimization Techniques
12 pages
Challenges in Deep Learning Optimization
No ratings yet
Challenges in Deep Learning Optimization
46 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
20 pages
Deep Model Training Optimization Challenges
No ratings yet
Deep Model Training Optimization Challenges
89 pages
Big Data Optimization Techniques
No ratings yet
Big Data Optimization Techniques
4 pages
Analytical vs Numerical Solutions in Optimization
No ratings yet
Analytical vs Numerical Solutions in Optimization
14 pages
Challenges in Deep Learning Optimization
No ratings yet
Challenges in Deep Learning Optimization
46 pages
Linear Models in Machine Learning
No ratings yet
Linear Models in Machine Learning
68 pages
DL 2ND IA
No ratings yet
DL 2ND IA
4 pages
Mle Optimization Guide
No ratings yet
Mle Optimization Guide
16 pages
DLbook
No ratings yet
DLbook
165 pages
Loss Function Optimization in Neural Networks
100% (1)
Loss Function Optimization in Neural Networks
24 pages
Optimization in Deep Learning
No ratings yet
Optimization in Deep Learning
21 pages
Transforming Low-Performing Schools
No ratings yet
Transforming Low-Performing Schools
4 pages
Important Questions on Relations & Functions
90% (10)
Important Questions on Relations & Functions
3 pages
Determinants: Concepts and Properties
71% (7)
Determinants: Concepts and Properties
466 pages
Deep Learning for Urban Image Classification
No ratings yet
Deep Learning for Urban Image Classification
14 pages
Electronics Lab Design Projects
No ratings yet
Electronics Lab Design Projects
2 pages
E6.2H 5000 Air Circuit Breaker Specifications
No ratings yet
E6.2H 5000 Air Circuit Breaker Specifications
3 pages
Maths Revision Questions for Parents
No ratings yet
Maths Revision Questions for Parents
5 pages
Tomato Ripening Stages Detection System
No ratings yet
Tomato Ripening Stages Detection System
12 pages
SM 2014 46
No ratings yet
SM 2014 46
10 pages
RDM 2 Materials Resistance Exam Guide
No ratings yet
RDM 2 Materials Resistance Exam Guide
6 pages
STAAD PRO Overview for Civil Officers
No ratings yet
STAAD PRO Overview for Civil Officers
32 pages
Understanding C Expressions and Types
No ratings yet
Understanding C Expressions and Types
3 pages
Grade 8 Math Lesson on Triangle Congruence
No ratings yet
Grade 8 Math Lesson on Triangle Congruence
6 pages
Advanced Excel Exercises Notebook
No ratings yet
Advanced Excel Exercises Notebook
100 pages
Class VIII Admission Test Sample Paper
No ratings yet
Class VIII Admission Test Sample Paper
6 pages
Force vs. Pressure Explained
No ratings yet
Force vs. Pressure Explained
1 page
B.Tech CSE Syllabus 2018 Onwards
No ratings yet
B.Tech CSE Syllabus 2018 Onwards
204 pages
The PHILOSOPHY of Information An Introduction
No ratings yet
The PHILOSOPHY of Information An Introduction
220 pages
Understanding Percentage Concepts
No ratings yet
Understanding Percentage Concepts
34 pages
Check Beams with Plan Analyst Guide
No ratings yet
Check Beams with Plan Analyst Guide
3 pages
Free Probability Theory PDF Download
No ratings yet
Free Probability Theory PDF Download
2 pages
Techniques for Answering UPSR Maths
No ratings yet
Techniques for Answering UPSR Maths
24 pages
JEE Practice Problems: 40 Min Time
No ratings yet
JEE Practice Problems: 40 Min Time
14 pages
CNC Alarm Codes and Solutions
100% (2)
CNC Alarm Codes and Solutions
78 pages
TS PGECET 2025 Mechanical Syllabus
No ratings yet
TS PGECET 2025 Mechanical Syllabus
3 pages
Linear Transformations in Computer Vision
No ratings yet
Linear Transformations in Computer Vision
20 pages
Area Measurement Techniques by Tape
No ratings yet
Area Measurement Techniques by Tape
22 pages
Complementary Pairs in Syllogism
No ratings yet
Complementary Pairs in Syllogism
17 pages
Cement Grout for Low-Level Waste Encapsulation
No ratings yet
Cement Grout for Low-Level Waste Encapsulation
11 pages
Measurement System Analysis Overview
100% (1)
Measurement System Analysis Overview
42 pages