0% found this document useful (0 votes)
214 views14 pages

ReZero: Fast Convergence in Deep Networks

This document summarizes the ReZero technique for improving training of deep neural networks. ReZero addresses the problem of vanishing or exploding gradients in deep networks by introducing a residual connection for each layer's input and output, with a single trainable parameter that modulates the layer transformation. This simple approach enables training very deep networks of thousands of layers, outperforming more complex normalization and initialization schemes. When applied to Transformers, ReZero reduces training time by 56% to reach a target performance level. The key advantages are that ReZero is widely applicable, enables deeper learning, and accelerates convergence compared to regular residual networks.

Uploaded by

NiranjanAryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
214 views14 pages

ReZero: Fast Convergence in Deep Networks

This document summarizes the ReZero technique for improving training of deep neural networks. ReZero addresses the problem of vanishing or exploding gradients in deep networks by introducing a residual connection for each layer's input and output, with a single trainable parameter that modulates the layer transformation. This simple approach enables training very deep networks of thousands of layers, outperforming more complex normalization and initialization schemes. When applied to Transformers, ReZero reduces training time by 56% to reach a target performance level. The key advantages are that ReZero is widely applicable, enables deeper learning, and accelerates convergence compared to regular residual networks.

Uploaded by

NiranjanAryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ReZero is All You Need:

Fast Convergence at Large Depth

Thomas Bachlechner∗, Bodhisattwa Prasad Majumder∗, Huanru Henry Mao∗,


Garrison W. Cottrell, Julian McAuley
UC San Diego
arXiv:2003.04887v2 [[Link]] 25 Jun 2020

{tbachlechner@physics, bmajumde@eng, hhmao@eng,


gary@eng, jmcauley@eng}.[Link]

Abstract
Deep networks often suffer from vanishing or exploding gradients due to inefficient
signal propagation, leading to long training times or convergence difficulties. Vari-
ous architecture designs, sophisticated residual-style networks, and initialization
schemes have been shown to improve deep signal propagation. Recently, Penning-
ton et al. used free probability theory to show that dynamical isometry plays an
integral role in efficient deep learning. We show that the simplest architecture
change of gating each residual connection using a single zero-initialized parameter
satisfies initial dynamical isometry and outperforms more complex approaches.
Although much simpler than its predecessors, this gate enables training thousands
of fully connected layers with fast convergence and better test performance for
ResNets trained on CIFAR-10. We apply this technique to language modeling and
find that we can easily train 120-layer Transformers. When applied to 12 layer
Transformers, it converges 56% faster on enwiki8.

1 Introduction
Deep learning has enabled significant improvements in state-of-the-art
performance across domains [1, 2, 3, 4]. The expressivity of neural
Input
networks typically grows exponentially with depth [5], enabling strong
generalization performance, but often induces vanishing/exploding gradi-
ents and poor signal propagation through the model [6]. Practitioners have
relied on residual [2] connections along with complex gating mechanisms
in highway networks [7], careful initialization [8, 9, 10] and normalization
such as BatchNorm [11] and LayerNorm [12] to mitigate this issue.
Recent theoretical work [13] applied free probability theory to randomly
initialized networks and demonstrated that dynamical isometry is a key
indicator of trainability. Motivated by this theory, we study the simplest
modification of deep networks that ensures initial dynamical isometry,
which we call ReZero. ReZero is a small addition to any network that
Output
dynamically facilitates well-behaved gradients and arbitrarily deep signal
propagation2 . The idea is simple: ReZero initializes each layer to perform
the identity operation. For each layer, we introduce a residual connection Figure 1: ReZero
for the input signal x and one trainable parameter α that modulates the
non-trivial transformation of a layer F (x),
xi+1 = xi + αi F (xi ) , (1)

Authors contributed equally, ordered by last name
2
Code for ReZero applied to various neural architectures: [Link]

Preprint. Under review.


Table 1: Various forms of normalization and residual connections. F represents the transformation of
an arbitrary layer and “Norm” is a normalization (e.g. LayerNorm or BatchNorm).
(1) Deep Network xi+1 = F (xi )
(2) Residual Network xi+1 = xi + F (xi ) 
(3) Deep Network + Norm xi+1 = Norm F (xi ) 
(4) Residual Network + Pre-Norm xi+1 = xi + F (Norm xi )
(5) Residual Network + Post-Norm xi+1 = Norm xi + F (xi )
(6) ReZero xi+1 = xi + αi F (xi )

where α = 0 at the beginning of training. Initially the gradients for all parameters defining F
vanish, but dynamically evolve to suitable values during initial stages of training. We illustrate the
architecture in Figure 1. ReZero provides several benefits:
Widely Applicable: Unlike more complex schemes, ReZero is simple and architecture agnostic,
making its implementation widely applicable to any residual-style architectures without much tuning
and only a few lines of code.
Deeper learning: While simpler than existing approaches [7, 10], ReZero effectively propagates
signals through deep networks, which allows for learning in otherwise untrainable networks. ReZero
successfully trains 10,000 layers of fully-connected networks, and we are the first to train Transform-
ers over 100 layers without learning rate warm-up, LayerNorm [12] or auxiliary losses [14].
Faster convergence: We observe accelerated convergence in ReZero networks compared to regular
residual networks with normalization. When ReZero is applied to Transformers, we converge 56%
faster than the vanilla Transformer to reach 1.2 BPB on the enwiki8 language modeling benchmark.
When applied to ResNets, we obtain 32% speed up to reach 85% accuracy on CIFAR 10.

2 Background and related work


Networks with a depth of L layers and width w often have an expressive power that scales expo-
nentially in depth, but not in width [15, 5]. Large depth often comes with difficulty in training
via gradient-based methods. During training of a deep model, a signal in the training data has to
propagate forward from the input to the output layer, and subsequently, the cost function gradients
have to propagate backwards in order to provide a meaningful weight update. If the magnitude of a
perturbation is changed by a factor r in each layer, both signals and gradients vanish or explode at a
rate of rL , rendering many deep networks untrainable in practice.
To be specific, consider a deep network that propagates an input signal x0 of width w through L
layers that perform the non-trivial, but width preserving functions F [Wi ] : Rw → Rw , where Wi
denotes all parameters at layer i = 1, . . . , L. The signal propagates through the network according to
xi+1 = F [Wi ](xi ) . (2)
There have been many attempts to improve signal propagation through deep networks and they often
fall into one of three categories—initialization schemes, normalization, and residual connections. We
show some of the popular ways to combine residual networks with normalization in Table 1.

2.1 Careful initialization

The dynamics of signal propagation in randomly initialized deep and wide neural networks have been
formalized via mean field theory [13, 9, 16]. For some deep neural networks, including fully connected
and convolutional architectures, the cosine distance of two distinct signals, xi · x0i /(kxi kkx0i k),
approaches a fixed point that either vanishes or approaches unity at large depths. If this fixed point
is 1 the behavior of the network is stable and every input is mapped to the same output, leading
to vanishing weight updates. If this fixed point is 0 the behavior of the network is chaotic and
even similar inputs are mapped to very different outputs, leading to exploding weight updates. To
understand whether a network is in a stable or chaotic phase we consider the input-output Jacobian
∂xL
J io ≡ . (3)
∂x0

2
The mean squared singular values χ of this matrix determine the growth/decay of an average input
signal perturbation as it propagates through the network. The network exhibits a boundary between
the ordered and the chaotic phase, the edge of chaos at χ = 1. Training proceeds efficiently at the
edge of chaos. This behavior was recognized in [17, 6], which motivated a re-scaling of the weights
such that χ ≈ 1 and on average signal strengths are neither enhanced or attenuated.
Pennigton et al. [13, 16] recognized that a unit mean squared average of the input-output Jacobian
is insufficient to guarantee trainability. For example, if the singular vectors of J io corresponding
to very large/small singular values align well with the perturbations in the data, training will still
be inefficient. They proposed the stronger condition of dynamical isometry [18], which requires
that all singular values of J io are close to one. This means that all perturbations of the input signal
propagate through the network equally well. The ReLU activation function maps to zero for some
perturbations of the input signal, and it is therefore intuitive that deep networks with ReLU activations
cannot possibly satisfy dynamical isometry, as was rigorously established in [13]. For some activation
functions and network architectures, elaborate initialization schemes allow the network to satisfy
dynamical isometry at initialization, which significantly improves training dynamics [19, 5, 20, 21].

2.2 Normalization

An alternative approach to improve the trainability of deep networks is to incorporate layers that
explicitly provide normalization. Many normalization modules have been proposed, with the two
most popular ones being BatchNorm [11] and LayerNorm [12]. In general, normalization aims to
ensure that initially, signals have zero mean and unit variance as they propagate through a network,
reducing covariate shift [11].
Normalization methods have shown success in accelerating the training of deep networks, but they
do incur a computational cost to the network and pose additional hyperparameters to tune (e.g. where
to place the normalization). In contrast to normalization methods, our proposed method is simple
and cheap to implement. ReZero alone is sufficient to train deeper networks, even in the absence of
various norms. Although ReZero makes normalization superfluous for convergence, we have found
the regularizing effect of BatchNorm to be complementary to our approach.

2.3 Residual connections

The identity mappings introduced for ResNet in [2] enabled a deep residual learning framework in
the context of convolutional networks for image recognition that significantly increased the trainable
depth. In [22] it was recognized that identity (pre-activation) residual connections allow for improved
signal propagation. Residual connections in ResNets allowed for training of extremely deep networks,
but the same has not been the case for Transformer architectures. Deep Transformer architectures
have thus far required extreme compute budgets or auxiliary losses.
Careful initialization has been employed in conjunction with residual connections. It has been
proposed to initialize residual blocks around zero in order to facilitate better signal propagation
[7, 22, 23, 24, 25, 10]. Concurrently with our work SkipInit [26], an alternative to the BatchNorm,
was proposed for ResNet architectures that is similar to ReZero. The authors find that in deep ResNets
without BatchNorm, a scalar multiplier is needed to ensure convergence. We arrive at a similar
conclusion for the specific case considered in [26], and study more generally signal propagation in
deeper networks across multiple architectures and beyond BatchNorm.

3 ReZero
We propose ReZero (residual with zero initialization), a simple change to the architecture of deep
residual networks that facilitates dynamical isometry and enables the efficient training of extremely
deep networks. Rather than propagating the signal through each of the non-trivial functions F [Wi ] at
initialization, we add a skip connection and rescale the function by L learnable parameters αi (which
we call residual weights) that are initialized to zero. The signal now propagates according to
xi+1 = xi + αi F [Wi ](xi ) . (4)
At initialization the network represents the identity function and it trivially satisfies dynamical
isometry. We demonstrate below for a toy model that this architecture can exponentially accelerate

3
training. The architecture modification allows for the training of deep networks even when the
individual layers’ Jacobian has vanishing singular values, as is the case for ReLU activation functions
or self-attention [27].

3.1 A toy example

To illustrate how the ReZero connection accelerates train-


ing let us consider the toy model of a deep neural network 5
described by L single-neuron hidden layers that have no
bias and all share the same weight w and αi = α ∀i. The 4 10
network then simply maps an input x0 to the output 3 5
L
xL = (1 + αw) x0 . (5)

w
2 0
Fixing the parameter α = 1 would represent a toy model 1 -5
for a fully connected residual network, while initializing
α = 0 and treating α as a learned parameter corresponds 0 -10
to a ReZero network. The input-output Jacobian is given -1
by Jio = (1 + αw)L , indicating that for initialization with
0.0 0.2 0.4 0.6 0.8 1.0
w ≈ 1 and α = 1 the output signal of a deep (e.g., L  1)
network is extremely sensitive to any small perturbations
of the input, while with α = 0 the input signal magnitude Figure 2: Contour plot of the log gradient
is preserved. While this example is too simple to exhibit an norm, logk∂Ck2 , over the network weight w
order/chaos phase transition, it does accurately model the and the residual weight α during the train-
ing of the linear function xL=5 = 50 × x0
vanishing and exploding gradient problem familiar in deep via gradient descent using a training set of
networks. Assuming a learning rate λ and a cost function x0 = {1., 1.1, . . . , 1.8}. Gradient descent
C, gradient descent updates the weights w according to trajectories initialized at α = 0 are shown in
red for five different initial w’s. The trajec-
w ← w − λLαx0 (1 + αw)L−1 ∂x C(x)|x=xL . (6)
tory dynamics avoid the poorly conditioned
regions around α ≈ 1 and converge to the
For α = 1, convergence of gradient descent with an initial solution αw ≈ 1.2.
weight w ≈ 1 requires steps no larger than 1, and hence
a learning rate that is exponentially small in depth L
λ ∝ L−1 (1 + w)−(L−1) , (7)
where we only retained the parametric dependence on w and L. For w  1 the gradients in Equation
6 explode, while for w ≈ −1 the gradients vanish. Initializing α = 0 solves both of these problems:
assuming a sufficiently well-conditioned cost function, the first step of gradient descent will update
the residual weights α to a value that avoids large outputs and keeps the parameter trajectory within a
well-conditioned region while retaining the expressive power of the network. The first non-trivial
steps of the residual weight updates are given by
α ← −λLwx0 ∂x C(x)|x=xL , (8)

and gradient descent will converge with a learning rate that is polynomial in the depth L of the
network. In this simple example, the ReZero connection, therefore, allows for convergence with
dramatically fewer optimization steps compared to a vanilla residual network. We illustrate the
training dynamics and gradients in Figure 2.

4 Training deep fully-connected networks faster


We now study the effect of ReZero on deep ReLU networks, and compare it with some of the
approaches that facilitate deep learning listed in the rows of Table 1. Specifically, we will compare a
vanilla deep fully connected network (FC, row 1), a deep network with residual connections (FC+Res,
row 2), a deep network with LayerNorm (FC+Norm, row 3), and finally our proposed ReZero (row
6). We choose the initial weights W i to be normally distributed with variances optimal for training
2 2
[6, 20], e.g., σW = 2/w for all but the vanilla residual network where σW ≈ 0.25/w.
As a sample toy task, we train four different 32-layer network architectures on the CIFAR-10 data set
for supervised image classification. We are only interested in the training dynamics and investigate
how many iterations it takes for the model to fit the data.

4
We show the evolution of the training loss in Figure 3. In our simple experiment, the ReZero
architecture converges to fit the training data between 7 and 15 times faster than other techniques.
Note that without an additional normalization layer the residual connection decreases convergence
speed compared to a plain fully connected network. We speculate that this is because at initialization
the variance of the signal is not independent of depth, see [20].
With increasing depth, the advantages of the
ReZero architecture become more apparent. To
verify that this architecture ensures trainability FC (1)
FC + Res (2)
to large depth we successfully trained fully con- FC + Norm (3)

Cross Entropy Loss


0.8
nected ReZero networks with up to 10, 000 lay- FC + ReZero (6)
ers on a laptop with one GPU3 to overfit the 0.4
training set.
0.2

5 Training 0.1

Convolutional ResNets faster


50 100 150 200 250 300
Epoch
Residual connections enabled the first widely
used feed-forward networks for image recog- Figure 3: Cross entropy loss during training of four
nition with hundreds of layers [7, 2]. It was variants of 32 layer fully-connected networks with width
quickly realized that applying the residual con- 256 and ReLU activations. Numbers in parentheses refer
nection after the activation function (in PreAct- to the architectures in the corresponding rows of Table 1.
ResNets [22], see Table 1), as well as initializing We average over five runs each and show 1σ error bands.
We train using Adagrad [28] with learning rate 0.01.
the network closer to the identity mapping (in
[29, 23, 24, 25, 10, 26]) leads to improved per-
formance. ReZero combines the benefits of both approaches and is the simplest implementation in
this sequence of papers.
In the previous section, we saw how ReZero connections enable the training of networks with
thousands of layers that would otherwise be untrainable. In this section, we apply ReZero connections
to deep convolutional residual networks for image recognition [2]. While these networks are trainable
without modification, we observe that ReZero accelerates training and improves accuracies.
In order to compare different methods (ResNet [2] modified by Gated ResNet [7, 29], zero γ
[23, 24], FixUp [10], ReZero and Pre-Act ResNet [22] modified with ReZero) to improve deep signal
propagation, we trained various versions of residual networks on the CIFAR-10 image classification
dataset, each with identical hyperparameters. We discuss the architectures and hyperparameters in
detail in Appendices D and E. In Table 2 we present results for the validation error, the number
of epochs to reach 80% accuracy, and loss on the training data. ReZero performs better than the
other methods for ensuring deep signal propagation: it accelerates training as much as FixUp, but
retains the regularizing properties of the BatchNorm layer. Gated ResNets initially train very fast, but
perform significantly worse than ReZero on test data.
In order to demonstrate the accelerated training of ReZero networks, we implemented superconver-
gence [30] in a Pre-activation ResNet-18 with ReZero connections. The phenomenon of supercon-
vergence uses one cycle of an increasing and decreasing learning rate, in which a large maximum
learning rate serves as a regularizer. This yields very fast convergence for some networks. We find
that the training duration to achieve 94% accuracy decreases from the 60 epochs for the baseline4
model to 45 epochs for a ReZero model.

6 Training deeper Transformers faster

In this section, we study the signal propagation and application of ReZero to the Transformer
architecture [27]. Transformers gained significant popularity and success both in supervised and
3
To train at these extreme depths we used the Adagrad optimizer with a learning rate of 0.003.
4
Our implementation was inspired by the codes from [Link] available at [Link]
imagenet-fast/tree/master/cifar10. We replicated this model and added ReZero. It was important to
have a small, constant learning rate for the residual weights, otherwise the ReZero model diverges easily.

5
Table 2: Comparison of ResNet variants on CIFAR-10, see Appendix E for implementation details.
The uncertainties correspond to standard error.

Model Val. Error [%] Change Epochs to 80% Acc. Train Loss ×1000
ResNet-56 [2] 6.27 ± 0.06 – 20 ± 1 5.9 ± 0.1
+ Gated ResNet [7, 29] 6.80 ± 0.09 + 0.53 9±2 4.6 ± 0.3
+ zero γ [23, 24] 7.84 ± 0.05 + 1.57 39 ± 4 31.2 ± 0.5
+ FixUp [10] 7.26 ± 0.10 + 0.99 13 ± 1 4.6 ± 0.2
+ ReZero 6.58 ± 0.07 + 0.31 15 ± 2 4.5 ± 0.3
ResNet-110 [2] 6.24 ± 0.29 – 23 ± 4 4.0 ± 0.1
+ Gated ResNet [7, 29] 6.71 ± 0.05 + 0.47 10 ± 2 2.8 ± 0.2
+ zero γ [23, 24] 7.49 ± 0.07 + 1.25 36 ± 5 18.5 ± 0.9
+ FixUp [10] 7.10 ± 0.22 + 0.86 15 ± 1 3.3 ± 0.5
+ ReZero 5.93 ± 0.12 − 0.31 14 ± 1 2.6 ± 0.1
Pre-activation ResNet-18 [22] 6.38 ± 0.01 – 26 ± 2 4.1 ± 0.3
+ ReZero 5.43 ± 0.06 − 0.95 12 ± 1 1.9 ± 0.3
Pre-activation ResNet-50 [22] 5.37 ± 0.02 – 26 ± 3 2.6 ± 0.1
+ ReZero 4.80 ± 0.08 − 0.57 17 ± 1 2.2 ± 0.1

unsupervised NLP tasks [31, 14]. Transformers are built by stacking modules that first perform
self-attention, then a point-wise feed-forward transformation.
The original Transformer [27] implementation can be
seen as a residual network with post-normalization
(row 5 in Table 1). Inside a Transformer module
the output of each sublayer is added via a residual
connection and then normalized by LayerNorm,
Feed Feed
xi+1 = LayerNorm (xi + sublayer(xi )) (9) Forward Forward

where sublayer ∈ {self-attention, feed-forward}, as


illustrated in the left panel of Figure 4.

6.1 Signal propagation in Transformers

Two crucial components relevant to the signal prop-


agation in the original Transformer layers include
LayerNorm [12] and (multi-head) self attention [27].
Neither component by itself or in conjunction with
a vanilla residual connection can satisfy dynamical
isometry for all input signals, as we show with a the- Figure 4: ReZero for Transformers
oretical argument in Appendix A. We verify these
claims in practice by evaluating the change of the attention output under an infinitesimal variation
of each input element, which yields the input-output Jacobian. We show the input-output Jacobian
for Transformer encoder layers of various depths with Xavier uniform initialized weights in Figure
5a. While shallow Transformers exhibit a singular value distribution peaked around unity, we clearly
observe that the Jacobian of deeper Transformers has a large number of singular values that vanish
to machine precision. While the distribution varies depending on the details of the initialization
scheme, the qualitative statement holds more broadly. These results are consistent with the common
observation that deep Transformer networks are extremely challenging to train.
In order to facilitate deep signal propagation we apply ReZero by replacing LayerNorm and re-scaling
the self-attention block. Specifically, this modifies equation (9) to
xi+1 = xi + αi sublayer(xi ) , (10)
where αi is the learned residual weight parameter as in the right panel of Figure 4. We share
the same αi parameter for a pair of multi-head self-attention and feed-forward network within a
Transformer layer. At initialization, αi = 0, which allows for unimpeded signal propagation: All
singular values of the input-output Jacobian are 1 and the model trivially satisfies dynamical isometry.

6
Figure 5: Histograms of log singular values (log(λio )) for the input-output Jacobian: (a) Transformer encoder at
initialization; depths 4, 12 and 64 layers. (b) ReZero Transformer encoder with 64 layers before/during training.
Deep ReZero Transformers remain much closer to dynamical isometry, where mean singular value λio ≈ 1.

To verify that the model remains close to dynamical isometry throughout training and for larger αi ,
we show a histogram of the Jacobian singular values during the training of a 64 layer Transformer
decoder language model on WikiText-2 [32] in Figure 5b. During training the weight of the residual
connection gradually increases, allowing the Transformer to model extremely complex functions
while maintaining signal propagation properties close to dynamical isometry.

6.2 Convergence speed

We select language modeling on enwiki8 [33] as a benchmark because strong language models are a
good indicator of downstream NLP task performance [4]. Our aim in these experiments is to measure
the convergence speed of each method by measuring the number of iterations it takes for a 12 layer
Transformer to reach 1.2 bits per byte (BPB) on enwiki8.
Since the introduction of Transformers [27], there have been several competing placements of the
LayerNorm [34, 35] within the Transformer to achieve better convergence [4, 36]. We experiment
with 3 Transformer normalization methods and compare against the ReZero Transformer. The Post-
Norm (Row 5 in Table 1) method is equivalent to the vanilla Transformer in [27], the Pre-Norm (Row 4
in Table 1) method was recently introduced in [36] and the GPT2-Norm (xi+1 = xi + Norm(F (xi )))
was used in the training of GPT2 [4], which has successfully trained Transformers up to 48 layers.
Finally, we experiment with our proposed ReZero method with α initialized to either zero or one.
The hyperparameters are in Appendix B.
Our results (Table 3) show that Post-Norm diverges during training while all other models are able
to converge. This is not surprising as the original Transformer implementation required a learning
rate warm-up likely to overcome its poor initial signal propagation, as confirmed in [36]. To verify
this, we re-ran the Post-Norm setup with 100 steps of learning rate warm-up and find that the
model is able to converge to 1.2 BPB in 13,690 iterations. Under this setting, we compared other
LayerNorm placement schemes against Post-Norm. We find that the other placements led to initially
faster convergence, but ultimately Post-Norm catches up in performance, resulting in relatively
slower convergence for Pre-Norm and GPT2-Norm. However, other LayerNorm placements have
an advantage over Post-Norm in that they do not require learning rate warm-up, thus have fewer
hyperparameters to tune. ReZero with α = 1 does not show an improvement over the vanilla
Transformer, indicating the importance of initializing α = 0. With our proposed initialization of
α = 0, ReZero converges 56% faster than the vanilla Transformer.

6.3 Deeper Transformers

Deeper Transformers require significantly more compute to train, with 78 layer Transformers requiring
a cluster of 256 GPUs [37]. This cost comes from an increase in memory requirements and poor
signal propagation. The Character Transformer [14] mitigated this issue by having intermediate
layers predict the target objective as an auxiliary loss, thus circumventing vanishing gradients. In
this section, we extend our 12 layer ReZero Transformer from Section 6.2 to 64 and 128 layers and
compare against the vanilla Transformer (Post-Norm) and the Character Transformer. Our results
(Table 4) indicate that a 12 layer ReZero Transformer attains the same BPB as a regular Transformer
after convergence, which shows that we do not lose any representational expressivity in our model by
replacing LayerNorm with ReZero. We find that trying to train deep vanilla Transformers leads to
convergence difficulties. When scaled to 64 layers, the vanilla Transformer fails to converge even
with a warm-up schedule. A ReZero Transformer with initialization of α = 1 diverges, supporting

7
Table 4: Comparison of Transformers (TX) on the
enwiki8 test set. Char-TX refers to the Charac-
Table 3: Comparison of various 12 layer ter Transformer [14] and uses additional auxiliary
Transformers normalization variants against losses to achieve its performance.
ReZero and the training iterations required to Model Layers Parameters BPB
reach 1.2 BPB on enwiki8 validation set. Char-TX [14] 12 41M 1.11
Model Iterations Speedup TX + Warm-up 12 38M 1.17
TX + ReZero α = 1 12 34M 1.17
Post-Norm [27] Diverged -
TX + ReZero α = 0 12 34M 1.17
+ Warm-up 13,690 1×
Pre-Norm 17,765 0.77× Char-TX [14] 64 219M 1.06
GPT2-Norm [4] 21,187 0.65× TX 64 51M Diverged
ReZero α = 1 14,506 0.94× TX + Warm-up 64 51M Diverged
ReZero α = 0 8,800 1.56× TX + ReZero α = 1 64 51M Diverged
TX + ReZero α = 0 64 51M 1.11
TX + ReZero 128 101M 1.08

our theoretically motivated initialization at α = 0. The deeper ReZero Transformers are able to attain
better performance than the shallower Transformers.
We also display results from Character Transformer [14], which had a similar setup, but required
more parameters and used intermediate and other auxiliary losses to achieve their performance. In
contrast, our 128 layer Transformer achieves similar performance and learns effectively without
any intermediate losses. We did not tune our hyperparameters (Appendix C) and our models can
potentially achieve better results with stronger regularization.
To probe deeper into our model, we examine the
behavior of residual weights αi during training
for our 12 and 64 layer ReZero Transformers. 1

The results for the 12 and 64 layer Transformer 10 0.6


are qualitatively similar, and we show the 64 20
0.5
layer result in Figure 6. It is useful to view |αi | 0.4
Layer

as the amount of contribution each layer pro- 30 0.3


0.2
vides to the overall signal of the network. We 40
0.1
see that an interesting pattern emerges: Dur-
50 0.0
ing the early iterations of training, the residual
weights quickly increase to a peak value, then 64
slowly decay to a small value later in training. 0 200 400 600 800 1000 1200 1400
Training Iterations
Early in training, the higher layers tend to be
dominant (they peak earlier) and towards the
end of training each layer is used to a similar Figure 6: Heat map for residual weight |αi | evolu-
degree. The average |αi | at the end of training tion during training for 64L ReZero Transformer.
is 0.0898 and 0.0185 for the 12 and 64 layer
models respectively, which is approximately 1/L. Interestingly, this pattern also occurs in the 12
layer ReZero Transformer when we initialized α to 1. The difference is that the model spends the first
≈ 50 iterations forcing the α’s to small values, before reaching a similar pattern to that in Figure 6.
This empirical finding supports our proposal that we should initialize α = 0 even for shallow models.

7 Conclusion

We introduced ReZero, a simple architectural modification that facilitates signal propagation in deep
networks and helps the network maintain dynamical isometry. Applying ReZero to various residual
architectures – fully connected networks, Transformers and ResNets – we observed significantly
improved convergence speeds. Furthermore, we were able to efficiently train Transformers with
hundreds of layers, which has been difficult with the original architecture. We believe deeper
Transformers will open the door to future exploration.

8
While training models with ReZero Transformers, we discovered interesting patterns in the values of
residual weights of each layer |αi | over the course of training. These patterns may hint towards some
form of curriculum learning and allow for progressive stacking of layers to further accelerate training
[38]. Patterns of residual weights can be crucial to understand the training dynamics of such deeper
networks and might be important to model performance, which we will explore in future work.

Broader Impact
Recent work [39] has shown that increasing model capacity in terms of parameter count has a
substantial impact on improving model performance. However, this increase in model capacity has
also led to much longer training times and requires large GPU clusters in order to run experiments,
which indirectly contributes to the carbon footprints generated from model training. In addition,
[40] showed that along with carbon emission cost, training deep models draws significant economic
costs. This makes it more difficult for less well-funded research labs and start-ups to effectively train
state-of-the-art models. Our work enables faster convergence without trading off model performance
and is in line with recent efforts [41] to make models more environment-friendly to train.

References
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553), 2015.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In CVPR, 2016.
[3] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing
neural networks. In NIPS, 2017.
[4] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
[5] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.
Exponential expressivity in deep neural networks through transient chaos. In NIPS, pages
3360–3368, 2016.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification. In ICCV, 2015.
[7] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv
preprint arXiv:1505.00387, 2015.
[8] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,
2015.
[9] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pen-
nington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla
convolutional neural networks. arXiv preprint arXiv:1806.05393, 2018.
[10] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning
without normalization. arXiv preprint arXiv:1901.09321, 2019.
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[12] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
[13] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep
learning through dynamical isometry: theory and practice. In NIPS, 2017.
[14] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level
language modeling with deeper self-attention. In AAAI, volume 33, 2019.

9
[15] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of
linear regions of deep neural networks. In NIPS, 2014.
[16] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral
universality in deep networks. arXiv preprint arXiv:1802.09979, 2018.
[17] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
neural networks. In NIPS, 2010.
[18] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
[19] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-
tion propagation. arXiv preprint arXiv:1611.01232, 2016.
[20] Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In NIPS,
2017.
[21] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey
Pennington. Dynamical isometry and a mean field theory of lstms and grus. arXiv preprint
arXiv:1901.08987, 2019.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European conference on computer vision. Springer, 2016.
[23] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,
Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training
imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[24] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint
arXiv:1611.04231, 2016.
[25] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for
image classification with convolutional neural networks. In CVPR, 2019.
[26] Soham De and Samuel L Smith. Batch normalization biases deep residual networks towards
shallow paths. arXiv preprint arXiv:2002.10444, 2020.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[28] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of machine learning research, 12(Jul), 2011.
[29] Pedro HP Savarese, Leonardo O Mazza, and Daniel R Figueiredo. Learning identity mappings
with residual gates. arXiv preprint arXiv:1611.01260, 2016.
[30] Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks
using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain
Operations Applications, volume 11006, page 1100612. International Society for Optics and
Photonics, 2019.
[31] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of
deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and
Thamar Solorio, editors, NAACL, 2019.
[32] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture
models. In ICLR, 2017.
[33] Matt Mahoney. Large text compression benchmark, 2009.
[34] Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization
of self-attention. CoRR, abs/1910.05895, 2019.

10
[35] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. CoRR, abs/1910.10683, 2019.

[36] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai
Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer
architecture. arXiv preprint arXiv:2002.04745, 2020.

[37] Microsoft. Turing-nlg: A 17-billion-parameter language model, 2020.

[38] Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tie-Yan Liu. Efficient training
of BERT by progressively stacking. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
ICML, volume 97 of Proceedings of Machine Learning Research, 2019.

[39] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.

[40] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations
for deep learning in NLP. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors,
Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL
2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. ACL, 2019.

[41] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI. CoRR,
abs/1907.10597, 2019.

[42] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with
gaussian error linear units. CoRR, abs/1606.08415, 2016.

[43] Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing
BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019.

[44] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8),
1997.

A Vanishing singular values in Transformers

Two crucial components relevant to the signal propagation in the Transformer include LayerNorm
[12] and (multi-head) self attention [27]. In this section, we argue that neither component by itself or
in conjunction with a vanilla residual connection can satisfy dynamical isometry for all input signals.

A.1 LayerNorm

Layer normalization removes the mean and scales the variance over all neurons of a given layer and
introduces learnable parameters γ and β to re-scale the variance and shift the mean according to
x − E(x)
LayerNorm(x) = p ×γ+β. (11)
Var(x)
It is clear from this definition that perturbing an input x by a transformation that purely shifts either
its mean or variance will leave the output unchanged. These perturbations, therefore, give rise to
two vanishing singular values of the input-output Jacobian. In the Transformer architecture [27], the
norm is applied to each of the n elements of the input sentence, leading to a total of 2 × n vanishing
singular values of the Jacobian for each Transformer layer.

11
A.2 Self-Attention

Self-attention allows the model to relate content located across different positions by computing a
weighted sum of an input sequence. Specifically, the n × d matrix x contains an input sequence of
n rows containing d-dimensional embedding vectors, from which we can evaluate the query, key
and value matrices Q, K, V = x · W Q,K,V , where the W Q,K,V matrices are d × d. The scaled
dot-product attention then is given by
 √ 
Attention(Q, K, V ) = softmax Q · K > / d · V . (12)
In general, the singular value spectrum of the Jacobian of this attention process is complicated. Rather
than studying it in full generality, we now merely argue that for some inputs x and weights W Q,K,V
the Jacobian has a large number of vanishing singular values (a claim we evaluate empirically in the
paper). Consider weights or inputs such that each of the arguments of the softmax function is small
compared to 1. The softmax function then simply returns a n × n dimensional matrix filled with
entries that all approximate 1/n. This means that the attention function projects all embedding vectors
of the input sequence onto a single diagonal direction. This implies that out of the n × d Jacobian
singular values only d are non-vanishing and hence much of the input signal is lost. A residual
connection can restore some of the lost signals, but even then some perturbations are amplified while
others are attenuated. This example demonstrates that self-attention is incompatible with dynamical
isometry and unimpeded signal propagation in deep Transformer networks. It is easy to verify that
the same conclusion holds for multi-head attention. A careful initialization of the weights might
alleviate some of these issues, but we are not aware of any initialization scheme that would render a
Transformer layer consistent with dynamical isometry.

B Convergence speed experimental hyperparameters


For all model variants in Section 6.2, we control the batch size to be 1080, number of layers to 12,
feed-forward and attention dropout to 20%, hidden and embedding size to 512 units, context length
to 512, the attention heads to 2, and GELU [42] activation in the point-wise feed-forward layer. To
accommodate large batch training we use the LAMB optimizer [43] with a fixed learning rate of
0.016. Although learning rate schedules tend to improve performance [31], we omit them to simplify
our training process. Training is performed on 8x V100 GPUs for at most a few days.

C Deep Transformers experimental hyperparameters


In Section 6.3, in order to examine whether our approach scales to deeper Transformers, we scale our
12 layer ReZero Transformer from Section 6.2 to 64 layers and 128 layers and compare it against the
vanilla Transformer (Post-Norm). Due to memory constraints, we decreased the hidden size from
512 to 256 and reduced batch size to 304 and 144 for the 64 layer and 128 layer model √ respectively.
Following guidelines from [43] we also adjusted the learning rate according to 0.0005 × batch size.
For all models in our experiments we limit training to a maximum of 100 training epochs. Training is
performed on 8x V100 GPUs for at most a few days.

D Review of residual gates for deep signal propagation


In this section we give a chronological but light review of residual gates that are designed to preserve
signals as they propagate deep into neural networks.

D.1 Highway Networks

Highway Networks [7], based on ideas from LSTM [44], were the first feedforward neural networks
with hundreds of layers. This architecture employs gating units that learn to regulate signal flow
through the network. Specifically, the authors define transform and carry gates T [W T ](x) and
C[W C ](x), with weights W T,C that act explicitly non-linearly on the signal x. When combined
with some block F (xi ) of a deep network, this gives the transformation
xi+1 = C[W C ](x) · xi + T [W T ](x) · F (xi ) . (13)

12
The authors further propose to simplify the architecture according to C ≡ 1 − T , and using a simple
Transform gate of the form T [W T ](x) ≡ σ(W > T · x + bT ), where σ denotes some activation
function. The bias is initialized to be negative, as to bias the network towards carry behavior, e.g., C,
but the network is not initialized as the identity map.
The proposal of Highway Networks lead to Gated ResNet [29], in which there exists a single
additional parameter that parametrizes the gates as W T = 0, bT = α, C = 1 − T .
Any feed-forward network could be written in the form (13), and ReZero corresponds to the simple
choice W T = W C = 0, bT = α, bC = 1. In contrast to Highway Networks, in ReZero the gates
are independent of the input signal. We compare the performance of Gated ResNets to ReZero
ResNets in Section 5.

D.2 ResNets

ResNets [2] introduced the simple residual connection


xi+1 = σ (xi + F (xi )) , (14)
that has been extremely successful in training deep networks. Just as Highway Networks, these
residual connections are not initialized to give the identity map.

D.3 Pre-activation ResNets

Soon after the introduction of ResNets it was realized in [22] that applying the activation function
σ prior to the residual connection allows for better performance. Schematically, we have the pre-
activation connection
xi+1 = xi + F (xi ) , (15)
where we absorbed the activation function into the block F (xi ). This finding of improved perfor-
mance is consistent with improved signal propagation, since the residual connection is not modulated
by the activation function.

D.4 Zero γ

Residual networks often contain normalization layers in which the signal is rescaled by learnable
parameters γ which is referred to the Zero γ [23, 24, 25]. If the last element before a residual
connection happens to be a normalization layer, then initializing these γ to zero has been found
to improve convergence speed and accuracy. This method is in spirit very similar to the ReZero
architecture. However, it potentially zero-initializes many parameters for each block, and is only
applicable when a normalization layer exists.

D.5 FixUp

FixUp initialization [10] carefully rescales the initialization scheme in order to avoid vanishing
or exploding gradients, without the use of normalization techniques. In particular, this scheme is
implemented via the following Rules (verbatim from [10]):

1. Initialize the classification layer and the last layer of each residual branch to 0.
2. Initialize every other layer using a standard method (e.g., He et al. [6]), and scale only the
weight layers inside residual branches by L−1/(2m−2) .
3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0)
before each convolution, linear, and element-wise activation layer.

The authors emphasize that Rule 2 is the essential part. ReZero is simpler and similar to the first part
of Rule 3, but the initialization differs.

D.6 SkipInit

[26] proposes to replace BatchNorm layers with a single scalar initialized at a small value. SkipInit is
only applicable when normalization layers exist.

13
D.7 ReZero

ReZero is the simplest iteration achieving the goal of deep signal propagation. Schematically, the
ReZero architecture is
xi+1 = xi + αi F (xi ) . (16)
The rule to implement ReZero is

1. For every block add a scalar multiplier α (initialized at 0) and a residual connection.

E CIFAR-10 experiments
In this section we briefly describe the hyperparameters used in the image recognition experiments
performed in §5. For all these experiments we used PyTorch version 1.2.0 (we have observed
some inconsistencies in results with other PyTorch versions that may be due to different default
initializations). CIFAR10 experiments tend to take less than an hour on a single RTX 2080TI GPU.

E.1 Step-down schedule

In Table 1 we compare the inference accuracy of different network architectures after training with
identical hyperparameters a learning-rate schedule that decreases in three steps, as in [2]. The initial
learning rate is 0.1 and decreases by a factor of 10 at 100 and 150 epochs. The models are trained for
a total of 200 epochs. We use the SGD optimizer with a batch size of 128, weight decay of 5 × 10−4
and momentum 0.9.

E.2 Superconvergence schedule

To demonstrate superconvergence we use a one-cycle learning rate schedule, as described in [30] and
closely follow the setup by Fast AI referenced in the text. In particular, the learning rate of the SGD
optimizer evolves as follows over 45 epochs. The initial learning rate is 0.032 and linearly increases
with each iteration to reach 1.2 after 10% of the total number of iterations has been reached. Then,
the learning rate linearly decreases to return to 0.032 when 90% of the total steps. Thereafter, the
learning rate linearly decays to a final value of 0.001 at the end of training. The SGD momentum
varies between 0.85 and 0.95, mirroring the triangular learning rate, as is standard for the one-cycle
policy in this setup [30]. Weight decay is set to 2 × 10−4 and the batch size is 512.
The residual weights cannot tolerate the extremely large learning rates required for the super-
convergence phenomenon. For this reason we keep the learning rate of the residual weights at
0.1 throughout training.

F Datasets
We note that for all our experiments, we follow the official training, validation and test splits of the
respective datasets.

14

You might also like