EvoNorms: Unified Normalization-Activation Layers
EvoNorms: Unified Normalization-Activation Layers
Abstract
1 Introduction
Normalization layers and activation functions are fundamental building blocks in deep networks
for stable optimization and improved generalization. Although they frequently co-locate, they are
designed separately in previous works. There are several heuristics widely adopted during the design
process of these building blocks. For example, a common heuristic for normalization layers is to
use mean subtraction and variance division [1–4], while a common heuristic for activation functions
is to use scalar-to-scalar transformations [5–11]. These heuristics may not be optimal as they treat
normalization layers and activation functions as separate. Can automated machine learning discover
a novel building block to replace these layers and go beyond the existing heuristics?
Here we revisit the design of normalization layers and activation functions using an automated
approach. Instead of designing them separately, we unify them into a normalization-activation layer.
With this unification, we can formulate the layer as a tensor-to-tensor computation graph consisting
of basic mathematical functions such as addition, multiplication and cross-dimensional statistical
moments. These low-level mathematical functions form a highly sparse and large search space,
in contrast to mainstream NAS which uses high-level modules (e.g., Conv-BN-ReLU). To address
the challenge of the size and sparsity of the search space, we develop novel rejection protocols
to efficiently filter out candidate layers that do not work well. To promote strong generalization
across different architectures, we use multi-objective evolution to explicitly optimize each layer’s
1
Code for EvoNorms on ResNets: [Link]
x−µb,w,h (x)
BN-ReLU max √2 γ + β, 0
sb,w,h (x)
√ x √
EvoNorm-B0 γ +β
max( s2b,w,h (x),v1 x+ s2w,h (x))
performance over multiple architectures. Our method leads to the discovery of a set of novel layers,
dubbed EvoNorms, with surprising structures that go beyond expert designs (an example layer is
shown in Table 1 and Figure 1). For example, our most performant layers allow normalization and
activation functions to interleave, in contrast to BatchNorm-ReLU or ReLU-BatchNorm [1,12] where
normalization and activation function are applied sequentially. Some EvoNorms do not attempt to
center the feature maps, and others require no explicit activation functions.
EvoNorms consist of two series: B series and S series. The B series are batch-dependent and were
discovered by our method without any constraint. The S series work on individual samples, and were
discovered by rejecting any batch-dependent operations. We verify their performance on a number of
image classification architectures, including ResNets [13], MobileNetV2 [14] and EfficientNets [15].
We also study their interactions with a range of data augmentations, learning rate schedules and
batch sizes, from 32 to 1024 on ResNets. Our experiments show that EvoNorms can substantially
outperform popular layers such as BatchNorm-ReLU and GroupNorm-ReLU. On Mask R-CNN [16]
with FPN [17] and with SpineNet [18], EvoNorms achieve consistent gains on COCO instance
segmentation with negligible computation overhead. To further verify their generalization, we pair
EvoNorms with a BigGAN model [19] and achieve promising results on image synthesis.
Our contributions can be summarized as follows:
• We are the first to search for the combination of normalization-activation layers. Our
proposal to unify them into a single graph and the search space are novel. Our work tackles a
missing link in AutoML by providing evidence that it is possible to use AutoML to discover
a new building block from low-level mathematical operations (see Sec. 2 for a comparison
with related works). A combination of our work with traditional NAS may realize the full
potential of AutoML in automatically designing machine learning models from scratch.
• We propose novel rejection protocols to filter out candidates that do not work well or fail
to work, based on both their performance and stability. We are also the first to address
strong generalization of layers by pairing each candidate with multiple architectures and to
explicitly optimize their cross-architecture performance. Our techniques can be used by other
AutoML methods that have large, sparse search spaces and require strong generalization.
• Our discovered layers, EvoNorms, by themselves are novel contributions because they are
different from previous works. They work well on a diverse set of architectures, including
ResNets [12, 13], MobileNetV2 [14], MnasNet [20], EfficientNets [15], and transfer well to
Mask R-CNN [16], SpineNet [21] and BigGAN-deep [19]. E.g., on Mask-RCNN our gains
are +1.9AP over BN-ReLU and +1.3AP over GN-ReLU. EvoNorms have a high potential
impact because normalization and activation functions are central to deep learning.
• EvoNorms shed light on the design of normalization and activation layers. E.g., their struc-
tures suggest the potential benefits of non-centered normalization schemes, mixed variances
and tensor-to-tensor rather than scalar-to-scalar activation functions (these properties can be
seen in Table 1). Some of these insights can be used by experts to design better layers.
2 Related Works
Separate efforts have been devoted to design better activation functions and normalization layers,
either manually [1–4, 7–10, 22, 23] or automatically [11, 24]. Different from the previous works, we
eliminate the boundary between normalization and activation layers and search them jointly as a
unified building block. Our search space is more challenging than those in the existing automated
approaches [11, 24]. For example, we avoid relying on common heuristics like mean subtraction
2
or variance division in handcrafted normalization schemes; we search for general tensor-to-tensor
transformations instead of scalar-to-scalar transformations in activation function search [11].
Our approach is inspired by recent works on neural architecture search, e.g., [20,25–50], but has a very
different goal. While existing works aim to specialize an architecture built upon well-defined building
blocks such as Conv-BN-ReLU or inverted bottlenecks [14], we aim to discover new building blocks
starting from low-level mathematical functions. Our motivation is similar to AutoML-Zero [51] but
has a more practical focus: the method not only leads to novel building blocks with new insights, but
also achieves competitive results across many large-scale computer vision tasks.
Our work is also related to recent efforts on improving the initialization conditions for deep net-
works [52–54] in terms of challenging the necessity of traditional normalization layers. While those
techniques are usually architecture-specific, our method discovers layers that generalize well across a
variety of architectures and tasks without specialized initialization strategies.
3 Search Space
Primitive Operations. Table 2 shows the primitive opera- Element-wise Ops Expression Arity
tions in the search space, including element-wise operations Add x+y 2
and aggregation operations that enable communication across Mul x×y 2
different axes of the tensor. Div x/y 2
Max max(x, y) 2
Here we explain the notations I, µ, and s in the Table for Neg −x 1
Sigmoid σ(x) 1
aggregation ops. First, an aggregation op needs to know the Tanh tanh(x) 1
axes (index set) where it can operate, which is denoted by Exp ex 1
I. Let x be a 4-dimensional tensor of feature maps. We Log sign(x) · ln(|x|) 1
Abs |x| 1
use b, w, h, c to refer to its batch, width, height and channel Square x2 1
dimensions, respectively. We use xI to represent a subset
p
Sqrt sign(x) · |x| 1
of x’s elements along the dimensions indicated by I. For Aggregation Ops Expression Arity
example, I = (b, w, h) indexes all the elements of x along 1st order µ
pI (x) 1
the batch, width and height dimensions; I = (w, h) refers 2nd order pµI (x )
2 1
to elements along the spatial dimensions only. 2nd order, centered s2I (x) 1
The notations µ and s indicate ops that compute statistical Table 2: Search space primitives. The
moments, a natural way to aggregate over a set of elements. index set I can take any value among
Let µI (x) be a mapping that replaces each element in x {(b, w, h), (w, h, c), (w, h), (w, h, c/g)}.
with the 1st order moment of xI . Likewise, let s2I (x) be a A small is inserted as necessary for
mapping that transforms each element of x into the 2nd order numerical stability. All the operations
moment among the elements in xI . Note both µI and s2I preserve the shape of the input tensor.
preserve the shape of the original tensor.
Finally, we use ·/g to indicate that aggregation is carried out in a grouped manner along a dimension.
We allow I to take values among (b, w, h), (w, h, c), (w, h) and (w, h, c/g). Combinations like
(b, w/g, h, c) or (b, w, h/g, c) are not considered to ensure the model remains fully convolutional.
Random Graph Generation. A random computation graph in our search space can be generated
in a sequential manner. Starting from the initial nodes, we generate each new node by randomly
sampling a primitive op and then randomly sampling its input nodes according to the op’s arity. The
process is repeated multiple times and the last node is used as the output.
With the above search space, we perform several small scale experiments with random sampling to
understand its behaviors. Our observations are as follows:
3
Observation 1: High sparsity of the search space. While our search space can be expanded
further, it is already large enough to be challenging. As an exercise, we took 5000 random samples
from the search space and plugged them into three architectures on CIFAR-10. Figure 2 shows that
none of the 5000 samples can outperform BN-ReLU. The accuracies for the vast majority of them are
no better than random q guess (note the y-axis is in log scale). A typical random layer would look like
√
sign(z) zγ + β, z = s2w,h (σ(|x|)) and leads to near-zero ImageNet accuracies. Although random
search does not seem to work well, we will demonstrate later that with a better search method, this
search space is interesting enough to yield performant layers with highly novel structures.
Observation 2: Weak generalization across architectures. It is our goal to develop layers that
work well across many architectures, e.g., ResNets, MobileNets, EfficientNets etc. We refer to this as
strong generalization because it is a desired property of BatchNorm-ReLU. As another exercises, we
pair each of the 5000 samples with three different architectures, and plot the accuracy calibrations on
CIFAR-10. The results are shown in Figure 3, which indicate that layers that perform well on one
architecture can fail completely on the other ones. Specifically, a layer that performs well on ResNets
may not enable meaningful learning on MobileNets or EfficientNets at all.
4 Search Method
In this section, we propose to use evolution as the search method (Sec. 4.1), and modify it to address
the sparsity and achieve strong generalization. In particular, to address the sparsity of the search
space, we propose efficient rejection protocols to filter out a large number of undesirable layers based
on their quality and stability (Sec. 4.2). To achieve strong generalization, we propose to evaluate each
layer over many different architectures and use multi-objective optimization to explicitly optimize its
cross-architecture performance (Sec. 4.3).
4.1 Evolution
Here we propose to use evolution to search for better layers. The implementation is based on a variant
of tournament selection [55]. At each step, a tournament is formed based on a random subset of the
population. The winner of the tournament is allowed to produce offspring, which will be evaluated
and added into the population. The overall quality of the population hence improves as the process
repeats. We also regularize the evolution by maintaining a sliding window of only the most recent
portion of the population [41]. To produce an offspring, we mutate the computation graph of the
winning layer in three steps. First, we select an intermediate node uniformly at random. Then we
replace its current operation with a new one in Table 2 uniformly at random. Finally, we select new
predecessors for this node among existing nodes in the graph uniformly at random.
Although evolution improves sample efficiency over random search, it does not resolve the high
sparsity issue of the search space. This motivates us to develop two rejection protocols to filter out
bad layers after short training. A layer must pass both tests to be considered by evolution.
4
Quality. We discard layers that achieve less than 20%2 CIFAR-10 validation accuracy after training
for 100 steps. Since the vast majority of the candidate layers attain poor accuracies, this simple
mechanism ensures the compute resources to concentrate on the full training of a small number of
promising candidates. Empirically, this speeds up the search by up to two orders of magnitude.
Stability. In addition to quality, we reject layers that are subject to numerical instability. The basic
idea is to stress-test the candidate layer by adversarially adjusting the model weights θ towards
the direction of maximizing the network’s gradient norm. Formally, let `(θ, G) be the training loss
of a model when paired with computation graph G computed based on a small batch of images.
Instability of training is reflected by the worst-case gradient norm: maxθ k∂`(θ, G)/∂θk2 . We seek
∂k∂`(θ,G)/∂θk2
to maximize this value by performing gradient ascent along the direction of ∂θ up to
100 steps. Layers whose gradient norm exceeding 108 are rejected. The stability test focuses on
robustness because it considers the worst case, hence is complementary to the quality test. This test is
highly efficient–gradients of many layers are forced to quickly blow up in less than 5 steps.
Tournament Selection Criterion. As each layer is paired with multiple architectures, and therefore
has multiple scores, there are multiple ways to decide the tournament winner within evolution:
Average: Layer with the highest average accuracy wins (e.g., B in Figure 5 wins because it has the
highest average performance on the two models.).
Pareto: A random layer on the Pareto frontier wins (e.g., A, B, C in Figure 5 are equal likely to win
as none of them can be dominated by other candidates.).
Empirically we observe that different architectures have different accuracy variations. For example,
ResNet has been observed to have a higher accuracy variance than MobileNet and EfficientNet on
CIFAR-10 proxy tasks. Hence, under the Average criterion, the search will bias towards helping
ResNet, not MobileNet nor EfficientNet. We therefore propose to use the Pareto frontier criterion
to avoid this bias. Our method is novel, but reminiscent of NSGA-II [57], a well-established multi-
objective genetic algorithm, in terms of simultaneously optimizing all the non-dominated solutions.
2
This is twice as good as random guess on CIFAR-10.
3
We always use the v2 instantiation of ResNets [13] where ReLUs are adjacent to BatchNorm layers.
4
For each model, a custom layer is used to replace BN-ReLU/SiLU/Swish in the original architecture. Each
custom layer is followed by a channel-wise affine transform. See pseudocode in Appendix A for details.
5
5 Experiments
We include details of our experiments in Appendix C, including settings and hyperparameters of the
proxy task, search method, reranking, and full-fledged evaluations. In summary, we did the search on
CIFAR-10, and re-ranked the top-10 layers on a held-out set of ImageNet to obtain the best 3 layers.
The top-10 layers are reported in Appendix D. For all results below, our layers and baseline layers
are compared under identical training setup. Hyperparameters are directly inherited from the original
implementations (usually in favor of BN-based layers) without tuning w.r.t EvoNorms.
Batch-dependent Layers. In Table 3, we compare the three discovered layers against some widely
used normalization-activation layers on ImageNet, including strong baselines with the SiLU/Swish
activation function [9, 11, 22]. We refer to our layers as the EvoNorm-B series, as they involve Batch
aggregations (µb,w,h and s2b,w,h ) hence require maintaining a moving average statistics for inference.
The table shows that EvoNorms are no worse than BN-ReLU across all cases, and perform better on
average than the strongest baseline. It is worth emphasizing that hyperparameters and architectures
used in the table are implicitly optimized for BatchNorms due to historical reasons.
R-50
Layer Expression MV2 MN EN-B0 EN-B5
original +aug +aug+2× +aug+2×+cos
x−µb,w,h (x)
BN-ReLU max (z, 0), q γ+β 76.3±0.1 76.2±0.1 77.6±0.1 77.7±0.1 73.4±0.1 74.6±0.2 76.4 83.6
s2
b,w,h
(x)
x−µb,w,h (x)
BN-SiLU/Swish zσ(v1 z), z = q γ+ β 76.6±0.1 77.3±0.1 78.2±0.1 78.2±0.0 74.5±0.1 75.3±0.1 77.0 83.5
s2
b,w,h
(x)
√ q
Random sign(z) zγ + β, z = s2w,h (σ(|x|)) 1e-3 1e-3 1e-3 1e-3 1e-3 1e-3 1e-3 1e-3
Random + rej tanh(max(x, tanh(x)))γ + β 71.7±0.2 70.8±0.1 63.6±18.9 55.3±17.5 1e-3 1e-3 1e-3 1e-3
RS + rej √max(x,0) 2 γ + β 75.8±0.1 76.3±0.0 77.4±0.1 77.5±0.1 73.5±0.1 74.6±0.1 76.4 83.2
µb,w,h (x )
x γ + β
EvoNorm-B0 q q 76.6±0.0 77.7±0.1 77.9±0.1 78.4±0.1 75.0±0.1 75.3±0.0 76.8 83.6
max s2
b,w,h
(x),v1 x+ s2
w,h
(x)
x
EvoNorm-B1 q √ γ + β 76.1±0.1 77.5±0.0 77.7±0.0 78.0±0.1 74.6±0.1 75.1±0.1 76.5 83.6
max s2
b,w,h
(x),(x+1) µw,h (x2 )
x
EvoNorm-B2 q √ γ + β 76.6±0.2 77.7±0.1 78.0±0.1 78.4±0.1 74.6±0.1 75.0±0.1 76.6 83.4
max s2
b,w,h
(x), µw,h (x2 )−x
Table 3 also shows that a random layer in our search space only
achieves near-zero accuracy on ImageNet. It then shows that with
our proposed rejection rules in Sec. 4.2 (Random + rej), one can
find a layer with meaningful accuracies on ResNets. Finally, using
comparable compute with evolution, random search with rejection
(RS + rej) can discover a compact variant of BN-ReLU. This layer
achieves promising results across all architectures, albeit clearly
worse than EvoNorms. The search progress of random search
relative to evolution is shown in Figure 6 and in Appendix E.
6
79
Table 4: Left: ImageNet results of batch-independent layers with large and small batch sizes. Learning
rates are scaled linearly relative to the batch sizes [60]. For ResNet-50 we report results under both
the standard training setting and the fancy setting (+aug+2×+cos). We also report full results under
four different training settings in Appendix E. Right: Performance as the batch size decreases. For
ResNet, solid and dashed lines are obtained with standard setting and fancy setting, respectively.
To investigate if our discovered layers generalize beyond the classification task that we searched on,
we pair them with Mask R-CNN [16] for object detection and instance segmentation on COCO [61].
We consider two different types of backbones: ResNet-FPN [17] and SpineNet [21]. The latter is
particularly interesting because the architecture has a highly non-linear layout which is very different
from the classification models used during search. In all of our experiments, EvoNorms are applied
to both the backbone and the heads to replace their original activation-normalization layers. Detailed
training settings of these experiments are available in Appendix C.
Table 5: Mask R-CNN object detection and instance segmentation results on COCO val2017.
Results are summarized in Table 5. With both ResNet-FPN and SpineNet backbones, EvoNorms
significantly improve the APs with negligible impact on FLOPs or model sizes. While EvoNorm-B0
offers the strongest results, EvoNorm-S0 outperforms commonly used layers, including GN-ReLU and
BN-ReLU, by a clear margin without requiring moving-average statistics. These results demonstrate
strong generalization beyond the classification task that the layers were searched on.
7
resolution using Inception Score (IS, [63]) and Fréchet Inception distance (FID, [64]). We compare
two of our most performant layers, B0 and S0, against the baseline BN-ReLU and GN-ReLU, as
well as LayerNorm-ReLU [2], and PixelNorm-ReLU [65], a layer designed for a different GAN
architecture. We sweep the number of groups in GN-ReLU from 8,16,32, and report results using 16
groups. Consistent with BigGAN training, we report results at peak performance in Table 6. Note
higher is better for IS, lower is better for FID.
Swapping BN-ReLU out for most other layers substantially cripples training, but both EvoNorm-B0
and S0 achieve comparable, albeit worse IS, and improved FIDs over the BN-ReLU baseline. Notably,
EvoNorm-S0 outperforms all the other per-sample normalization-activation layers in both IS and FID.
This result further confirms that EvoNorms transfer to visual tasks in multiple domains.
EvoNorm-S0. It is interesting to observe the SiLU/Swish activation function [9, 11, 22] as a part
of EvoNorm-S0. The algorithm also learns to divide the post-activation features by the standard
deviation part of GroupNorm [4] (GN). Note this is not equivalent to applying
GN andSiLU/Swish
x−µw,h,c/g x−µw,h,c/g
sequentially. The full expression for GN-SiLU/Swish is q
2
σ v1 2q whereas the
sw,h,c/g (x) sw,h,c/g (x)
x
expression for EvoNorm-S0 is q σ(v1 x) (omitting γ and β). The latter is more compact
s2w,h,c/g (x)
and efficient.
The overall structure of EvoNorm-S0 offers an interesting hint that SiLU/Swish-like nonlinarities and
grouped normalizers may be complementary with each other. Although both GN and Swish have
been popular in the literature, their combination is under-explored to the best of our knowledge. In
Table 8 we evaluate GN-SiLU/Swish and compare it with other layers that are batch-independent.
The results confirm that both EvoNorm-S0 and GN-SiLU/Swish can indeed outperform GN-ReLU
by a clear margin, though EvoNorm-S0 generalizes better on MobileNetV2.
6 Conclusion
In this work, we unify normalization layer and activation function as single tensor-to-tensor computa-
tion graph consisting of basic mathematical functions. Unlike mainstream NAS works that specialize
a network based on existing layers (Conv-BN-ReLU), we aim to discover new layers that can general-
ize well across many different architectures. We first identify challenges including high search space
sparsity and the weak generalization issue. We then propose techniques to overcome these challenges
8
using efficient rejection protocols and multi-objective evolution. Our method discovered novel layers
with surprising structures that achieve strong generalization across many architectures and tasks.
Acknowledgements
The authors would like to thank Gabriel Bender, Chen Liang, Esteban Real, Sergey Ioffe, Prajit
Ramachandran, Pengchong Jin, Xianzhi Du, Ekin D. Cubuk, Barret Zoph, Da Huang, and Mingxing
Tan for their comments and support.
References
[1] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In International Conference on Machine Learning, 2015.
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
[3] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient
for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
[4] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 3–19, 2018.
[5] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In
Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
[6] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network
acoustic models. In ICML Workshop on Deep Learning for Audio, Speech, and Language Processing,
2013.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.
[8] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning
by exponential linear units (elus). International Conference on Learning Representations, 2016.
[9] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,
2016.
[10] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural
networks. In Advances in neural information processing systems, pages 971–980, 2017.
[11] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. In ICLR Workshop,
2018.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.
In European conference on computer vision, pages 630–645. Springer, 2016.
[14] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:
Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4510–4520, 2018.
[15] Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks.
In International Conference on Machine Learning, 2019.
[16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.
[17] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature
pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2117–2125, 2017.
[18] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiao-
dan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[19] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural
image synthesis. In International Conference on Learning Representations, 2019.
9
[20] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le.
Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
[21] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and
Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. arXiv
preprint arXiv:1912.05027, 2019.
[22] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function
approximation in reinforcement learning. Neural Networks, 107:3–11, 2018.
[23] Saurabh Singh and Shankar Krishnan. Filter response normalization layer: Eliminating batch dependence
in the training of deep neural networks. arXiv preprint arXiv:1911.09737, 2019.
[24] Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li. Differentiable learning-to-normalize
via switchable normalization. International Conference on Learning Represenations, 2019.
[25] Kenneth O Stanley and Risto Miikkulainen. Efficient reinforcement learning through evolving neural net-
work topologies. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation,
pages 569–577, 2002.
[26] Justin Bayer, Daan Wierstra, Julian Togelius, and Jürgen Schmidhuber. Evolving memory cell structures
for sequence learning. In International Conference on Artificial Neural Networks, pages 755–764. Springer,
2009.
[27] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International
Conference on Learning Representations, 2017.
[28] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures
using reinforcement learning. In International Conference on Learning Representations, 2017.
[29] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures
for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 8697–8710, 2018.
[30] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical
representations for efficient architecture search. In International Conference on Learning Representations,
2018.
[31] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model architecture
search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.
[32] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,
Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 19–34, 2018.
[33] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint
arXiv:1806.09055, 2018.
[34] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In
Advances in neural information processing systems, pages 7816–7827, 2018.
[35] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding
and simplifying one-shot architecture search. In International Conference on Machine Learning, pages
550–559, 2018.
[36] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv
preprint arXiv:1812.09926, 2018.
[37] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and
hardware. arXiv preprint arXiv:1812.00332, 2018.
[38] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv
preprint arXiv:1808.05377, 2018.
[39] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-objective neural architecture search
via lamarckian evolution. arXiv preprint arXiv:1804.09081, 2018.
[40] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search
via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
[41] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier
architecture search. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages
4780–4789, 2019.
[42] Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural
networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision,
pages 1284–1293, 2019.
10
[43] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter
Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable
neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 10734–10742, 2019.
[44] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single
path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
[45] Hanzhang Hu, John Langford, Rich Caruana, Saurajit Mukherjee, Eric J Horvitz, and Debadeepta Dey.
Efficient forward architecture search. In Advances in Neural Information Processing Systems, pages
10122–10131, 2019.
[46] Han Cai, Chuang Gan, and Song Han. Once for all: Train one network and specialize it for efficient
deployment. arXiv preprint arXiv:1908.09791, 2019.
[47] Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. arXiv
preprint arXiv:1902.07638, 2019.
[48] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging
the depth gap between search and evaluation. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1294–1303, 2019.
[49] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and
Diana Marculescu. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. arXiv
preprint arXiv:1904.02877, 2019.
[50] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang,
Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the
IEEE International Conference on Computer Vision, pages 1314–1324, 2019.
[51] Esteban Real, Chen Liang, David R So, and Quoc V Le. Automl-zero: Evolving machine learning
algorithms from scratch. arXiv preprint arXiv:2003.03384, 2020.
[52] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without
normalization. In International Conference on Learning Representations, 2019.
[53] Soham De and Samuel L Smith. Batch normalization biases deep residual networks towards shallow paths.
arXiv preprint arXiv:2002.10444, 2020.
[54] Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W Cottrell, and Julian
McAuley. Rezero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887, 2020.
[55] David E Goldberg and Kalyanmoy Deb. A comparative analysis of selection schemes used in genetic
algorithms. In Foundations of genetic algorithms, volume 1, pages 69–93. Elsevier, 1991.
[56] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis,
Department of Computer Science, University of Toronto, 2009.
[57] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective
genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182–197, 2002.
[58] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data augmentation
with no separate search. arXiv preprint arXiv:1909.13719, 2019.
[59] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International
Conference on Learning Representations, 2017.
[60] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew
Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv
preprint arXiv:1706.02677, 2017.
[61] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on
computer vision, pages 740–755. Springer, 2014.
[62] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing
systems, pages 2672–2680, 2014.
[63] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training gans. In Advances in neural information processing systems, pages 2234–2242,
2016.
[64] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural
information processing systems, pages 6626–6637, 2017.
[65] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved
quality, stability, and variation. In International Conference on Learning Representations, 2018.
11
[66] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normaliza-
tion schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160–2170,
2018.
[67] Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning. In International
Conference on Learning Representations, 2020.
12
A Code Snippets in TensorFlow
The following pseudocode relies on broadcasting to make sure the tensor shapes are compatible.
BN-ReLU
def batchnorm_relu(x, gamma, beta, nonlinearity, training):
mean, std = batch_mean_and_std(x, training)
z = (x − mean) / std ∗ gamma + beta
if nonlinearity:
return [Link](z)
else:
return z
13
C Implementation Details
Proxy Task on CIFAR-10. We use the same training setup for all the architectures. Specifically,
we use 24×24 random crops on CIFAR-10 with a batch size of 128 for training, and use the original
32×32 image with a batch size of 512 for validation. We use SGD with learning rate 0.1, Nesterov
momentum 0.9 and weight decay 10−4 . Each model is trained for 2000 steps with a constant learning
rate for the EvoNorm-B experiment. Each model is trained for 5000 steps following a cosine learning
rate schedule for the EvoNorm-S experiment. These are chosen to ensure the majority of the models
can achieve reasonable convergence quickly. With our implementation, it takes 3-10 hours to train
each model on a single CPU worker with two cores.
Evolution. We regularize the evolution [41] by considering a sliding window of only the most
recent 2500 genotypes. Each tournament is formed by a random subset of 5% of the active popu-
lation. Winner is determined as a random candidate on the Pareto-frontier w.r.t. the three accuracy
scores (Sec. 4.1), which will be mutated twice in order to promote structural diversity. To encourage
exploration, we further inject noise into the evolution process by replacing the offspring with a
completely new random architecture with probability 0.5. Each search experiment takes 2 days to
complete with 5000 CPU workers.
Reranking. After search, we take the top-10 candidates from evolution and pair each of them with
fully-fledged ResNet-50, MobileNetV2 and EfficientNet-B0. We then rerank the layers based on their
averaged ImageNet accuracies of the three models. To avoid overfitting the validation/test metric,
each model is trained using 90% of the ImageNet training set and evaluated over the rest of the
10%. The top-3 reranked layers are then used to obtain our main results. The reranking task is more
computationally expensive than the proxy task we search with, but is accordingly more representative
of the downstream tasks of interest, allowing for better distinguishing between top candidates.
Classificaiton on ImageNet. For ImageNet results presented in Table 3 (layers with batch statis-
tics), we use a base learning rate of 0.1 per 256 images for ResNets, MobileNetV2 and MnasNet, and
a base learning rate of 0.016 per 256 images for EfficientNets following the official implementation.
Note these learning rates have been heavily optimized w.r.t. batch normalization. For results presented
in Table 4 (layers without batch statistics), the base learning rate for MobileNetV2 is lowered to 0.03
(tuned w.r.t. GN-ReLU among 0.01, 0.03, 0.1). We use the standard multi-stage learning rate schedule
in the basic training setting for ResNets, cosine schedule [59] for MobileNetV2 and MnasNet, and
the original polynomial schedule for EfficientNets. These learning rate schedules also come with
a linear warmup phase [60]. For all architectures, the batch size per worker is 128 and the input
resolution is 224×224. The only exception is EfficientNet-B5, which uses batch size 64 per worker
and input resolution 456×456. The number of workers is 8 for ResNets and 32 for the others. We use
32 groups for grouped aggregation ops µw,h,c/g and s2w,h,c/g .
Instance Segmentation on COCO The training is carried out over 8 workers with 8 images/worker
and the image resolution is 1024×1024. The models are trained from scratch for 135K steps using
SGD with momentum 0.9, weight decay 4e-5 and an initial learning rate of 0.1, which is reduced
by 10× at step 120K and step 125K. For SpineNet experiments, we follow the training setup of the
original paper [21].
x
p
2. √ γ + β, z = x + µw,h,c (x2 )
max( s2b,w,h (x),z )
3. − √ xσ(x)
2
γ+β
sb,w,h (x)
x
p
4. √ γ + β, z = x µw,h (x2 )
max( s2b,w,h (x),z )
14
x
p
5. √ γ + β, z = µw,h,c (x2 ) − x
max( s2b,w,h (x),z )
q
6. √ 2x γ + β, v1 x + s2w,h (x)
max( sb,w,h (x),z )
√ x
p
7. γ + β, z = µw,h (x2 ) − x
max( s2b,w,h (x),z )
√ 2x
p
8. γ + β, z = x µw,h (x2 ) (duplicate)
max( sb,w,h (x),z )
q
9. √ 2x γ + β, z = x + s2w,h (x)
max( sb,w,h (x),z )
q
10. √ 2x γ + β, z = x + s2w,h (x) (duplicate)
max( sb,w,h (x),z )
1. √x tanh(σ(x))2 γ + β
µw,h,c/g (x )
xσ(x)
2. √ γ +β
µw,h,c/g (x2 )
xσ(x)
3. √ γ + β (duplicate)
µw,h,c/g (x2 )
xσ(x)
4. √ γ + β (duplicate)
µw,h,c/g (x2 )
xσ(x)
5. q γ+β
s2w,h,c/g (x)
xσ(v1 x)
6. q γ+β
s2w,h,c/g (x)
xσ(x)
7. q γ + β (duplicate)
s2w,h,c/g (x)
xσ(x)
8. √ γ + β (duplicate)
µw,h,c/g (x2 )
xσ(x)
9. q γ + β (duplicate)
s2w,h,c/g (x)
x
10. zσ(max(x, z))γ + β, z = √
µw,h,c/g (x2 )
E Additional Results
Figure 7: Search progress of evolution vs. random search vs. a fixed baseline (BN-ReLU) on the
proxy task. Each curve denotes the mean and standard deviation of the top-10 architectures in the
population. Only valid samples survived the rejection phase are reported.
15
ResNet-50, batch size 1024 ResNet-50, batch size 1024 MobileNetV2, batch size 4096 MobileNetV2, batch size 4096
5 5 6 6
BN-ReLU (76.1) GN-ReLU (75.4) BN-ReLU (73.1) GN-ReLU (71.5)
4.5 4.5 5.5 5.5
EvoNorm-B0 (77.8) Evonorm-S0 (77.4) EvoNorm-B0 (75.1) Evonorm-S0 (74.0)
5 5
Training/Eval Loss
Training/Eval Loss
Training/Eval Loss
Training/Eval Loss
4 4
4.5 4.5
3.5 3.5
4 4
3 3
3.5 3.5
2.5 2.5
3 3
2 2 2.5 2.5
1.5 1.5 2 2
1 1 1.5 1.5
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120
Steps (K) Steps (K) Steps (K) Steps (K)
Figure 8: Training/eval curves for ResNet-50 (+aug) and MobileNetV2 on ImageNet with large batch
sizes. The corresponding test accuracy for each layer is reported in the legend.
Images / Worker
Layer
128 64 32 16 8 4
BN-ReLU 73.3 73.2 72.7 70.0 64.5 60.4
GN-ReLU 71.5 72.6 72.9 72.5 72.7 72.6
FRN 73.3 73.4 73.5 73.6 73.5 73.5
EvoNorm-S0 74.0 74.0 73.9 73.9 73.9 73.8
EvoNorm-S1 73.7 74.0 73.6 73.7 73.7 73.8
EvoNorm-S2 73.8 73.4 73.7 73.9 73.9 73.8
Table 13: MobileNetV2 results as the batch size Figure 9: Selected samples from BigGAN-deep
decreases. + EvoNorm-B0.
Batch normalization is efficient thanks to its simplicity and the fact that both the moving averages and
affine parameters can be fused into adjacent convolutions during inference. EvoNorms, despite being
more powerful, can come with more sophisticated expressions. While the overhead is negligible even
for medium-sized models (Table 5), it can be nontrivial for lightweight, mobile-sized models.
We study this subject in detail using MobileNetV2 as an example, showing that EvoNorm-B0
in fact substantially outperforms BN-ReLU in terms of both accuracy-parameters trade-off and
16
76 75.5 76 75.5
73 72.6 73 72.6
BN-ReLU BN-ReLU
72 EvoNorm-B0 72 EvoNorm-B0
Figure 10: ImageNet accuracy vs. params and accuracy vs. FLOPs for MobileNetV2 paired with
different normalization-activation layers. Each layer is evaluated over three model variants with
channel multipliers 0.9×, 1.0× and 1.1×. We consider the inference-mode FLOPs for both BN-ReLU
and EvoNorm-B0, allowing parameter fusion with adjacency convolutions whenever possible.
accuracy-FLOPs trade-off (Figure 10). This is because the cost overhead of EvoNorms can be largely
compensated by their performance gains.
It is intriguing to see that EvoNorm-B0 keeps the scale invariance property of traditional layers, which
means rescaling the input x would not affect its output. EvoNorm-S0 is asymptotically scale-invariant
in the sense that it will reduce to either q 2 x or a constant zero when the magnitude of x
sw,h,c/g (x)
becomes large, depending on the sign of v1 . The same pattern is also observed for most of the
EvoNorm candidates presented in Appendix D. These observations are aligned with previous findings
that scale invariance might play a useful role in optimization [66, 67].
17