0% found this document useful (0 votes)

15 views106 pages

Full Text 01

This thesis explores Model-Agnostic Meta-Learning (MAML) in the context of reinforcement learning, specifically focusing on its capabilities and limitations through experiments on the Meta-World benchmark. The findings suggest that while MAML outperforms baseline models, it shows limited evidence of rapid learning during meta-testing, indicating reliance on features learned during meta-training. The research contributes to the understanding of meta-learning dynamics and proposes insights for future work in the field.

Uploaded by

1513151531

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views106 pages

Full Text 01

Uploaded by

1513151531

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DEGREE PROJECT IN TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2021

Insights into
ModelAgnostic
MetaLearning on
Reinforcement
Learning Tasks
Konstantinos Saitas Zarkias

KTH ROYAL INSTITUTE OF TECHNOLOGY

ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Authors
Konstantinos Saitas Zarkias <kosz@[Link]>
Machine Learning
KTH Royal Institute of Technology

Place of Project
Stockholm, Sweden
Research Institutes of Sweden, Kista

Examiner
Pawel Herman
KTH Royal Institute of Technology

Supervisor

Alexandre Proutiere
KTH Royal Institute of Technology

ii
Abstract

Metalearning has been gaining traction in the Deep Learning field as an approach
to build models that are able to efficiently adapt to new tasks after deployment.
Contrary to conventional Machine Learning approaches, which are trained on a specific
task (e.g image classification on a set of labels), metalearning methods are meta
trained across multiple tasks (e.g image classification across multiple sets of labels).
Their end objective is to learn how to solve unseen tasks with just a few samples.
One of the most renowned methods of the field is ModelAgnostic MetaLearning
(MAML). The objective of this thesis is to supplement the latest relevant research with
novel observations regarding the capabilities, limitations and network dynamics of
MAML. For this end, experiments were performed on the metareinforcement learning
benchmark MetaWorld. Additionally, a comparison with a recent variation of MAML,
called Almost No Inner Loop (ANIL) was conducted, providing insights on the changes
of the network’s representation during adaptation (metatesting). The results of this
study indicate that MAML is able to outperform the baselines on the challenging
MetaWorld benchmark but shows little signs actual ”rapid learning” during meta
testing thus supporting the hypothesis that it reuses features learnt during meta
training.

Keywords

MetaLearning, Reinforcement Learning, Deep Learning

iii
Abstract

MetaLearning har fått dragkraft inom Deep Learning fältet som ett tillvägagångssätt
för att bygga modeller som effektivt kan anpassa sig till nya uppgifter efter distribution.
I motsats till konventionella maskininlärnings metoder som är tränade för en specifik
uppgift ([Link]. bild klassificering på en uppsättning klasser), så metatränas meta
learning metoder över flera uppgifter ([Link]. bild klassificering över flera uppsättningar
av klasser). Deras slutmål är att lära sig att lösa osedda uppgifter med bara
några få prover. En av de mest kända metoderna inom området är Model
Agnostic MetaLearning (MAML). Syftet med denna avhandling är att komplettera den
senaste relevanta forskningen med nya observationer avseende MAML: s kapacitet,
begränsningar och nätverksdynamik. För detta ändamål utfördes experiment på meta
förstärkningslärande riktmärke MetaWorld. Dessutom gjordes en jämförelse med
en ny variant av MAML, kallad Almost No Inner Loop (ANIL), som gav insikter
om förändringarna i nätverkets representation under anpassning (metatestning).
Resultaten av denna studie indikerar att MAML kan överträffa baslinjerna för det
utmanande MetaWorldriktmärket men visar små tecken på faktisk ”snabb inlärning”
under metatestning, vilket stödjer hypotesen att den återanvänder funktioner som
den lärt sig under metaträning.

Nyckelord

MetaLearning, Reinforcement Learning, Deep Learning

iv
Acknowledgements

I would like to thank the researchers at the Research Institute of Sweden who provided
me with a welcoming environment to work in, the professors & classmates at KTH for
insightful discussions and support, and the Foundation for Education and European
Culture (IPEP) who trusted my work and offered financial aid during my Masters. Most
importantly, I would like to thank everyone who stood next to me and supported me
during the period of my thesis and in the difficult times of the Covid19 pandemic.

v
Acronyms

ALE Arcade Learning Environment

ANIL Almost No Inner Loop
ANN Artificial Neural Network
CCA Canonical Correlation Analysis
CNN Convolutional Neural Networks
DL Deep Learning
MAML Model Agnostic MetaLearning
MDP Markov Decision Process
ML Machine Learning
MLP MultiLayer Perceptron
GAE Generalised Advantage Estimator
lr learning rate
RL Reinforcement Learning
RNN Recurrent Neural Networks
VPG Vanilla Policy Gradient
PPO Proximal Policy Optimisation
TRPO TrustRegion Policy Optimisation

vi
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Goal & Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Theoretical Background 8
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 GradientBased Learning & Backpropagation . . . . . . . . . . 10
2.2 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Defining MetaLearning . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Types of MetaLearning . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Model based (or World Model) vs Model free . . . . . . . . . . . 18
2.3.2 Offpolicy vs Onpolicy . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 MetaReinforcement Learning . . . . . . . . . . . . . . . . . . . 20
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 ModelAgnostic MetaLearning . . . . . . . . . . . . . . . . . . . 21
2.4.2 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Insights on MAML . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Almost No Inner Loop Variation . . . . . . . . . . . . . . . . . . 24

3 Methodology 26
3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 ModelAgnostic & ProblemAgnostic Formulation . . . . . . . . 27
3.1.2 Representation Similarity Experiments . . . . . . . . . . . . . . 28

vii
CONTENTS

3.2 Fewshot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Classification Formulation . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Omniglot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 MiniImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 MetaReinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 MetaRL Formulation . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Particles2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3 MetaWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Results 42
4.1 FewShot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 MetaReinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Particles 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 ML1: Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 ML10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Discussion 57
5.1 Further insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Ethical concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusions 63

References 65

viii
Chapter 1

Introduction

Neural networks have proven to be a highly useful tool in many settings. From
detecting various types of cancer in humans [64], [19] to efficiently developing robots
perform various physical tasks [83]. However, developing such models is not an easy
process and it is often met with considerable limitations. In contrast to how humans
acquire new knowledge and skills, neural networks require a great amount of data to
train with [30]. In addition to this, they are also particularly sensitive when trying
to incorporate new information after they have already been trained on a task, which
often leads to the infamous phenomena of catastrophic forgetting. This occurs when a
model is trained on one task, then trained on another second task but then completely
fails to perform on the first task [31]. For these reasons, neural networks are generally
unsuitable in problems where the process of largescale data collection is inaccessible
(e.g Xrays) or when new tasks are introduced after the model has been trained and
retraining from scratch is too costly (e.g detecting a new type of disease after being
deployed to detect a previous disease).

A research field that has been gaining traction due to promising recent publications
that could potentially combat these issues is MetaLearning. As with many ideas
in Machine Learning (ML), the core concept was loosely inspired from a field that
studies humans, this time from the area of educational psychology. When studying the
learning abilities of students, John Biggs describes the term metalearning as ”one’s
awareness over their learning process and actively taking control over it” [8]. It
is easy to see then why applying this idea to ML models becomes very intriguing.
Developing algorithms that are able to selfassess their own performance and improve

1
CHAPTER 1. INTRODUCTION

on it could be profoundly valuable. Such algorithms could alleviate the need for
humans to finetune the models and their hyperparameters in order to adapt them to
their specific task and they could be able to generalise across multiple, different tasks.
Two problems that are still quite present in the current ML developing process.

The end goal of metalearning is to create models that have the ability to quickly
adapt to new tasks that they have not seen before, using their past experience when
trained on similar tasks. For example, an experienced musician that has spent many
years learning how to play the guitar, violin and contrabass, has a considerably easier
time picking up and learning to play moderately well a new instrument like the cello,
compared to someone that has never played an instrument before. One reason for this
is that similar tasks share similar dynamics and structure [60] (all these instruments
have strings and follow the same rules of physics) but also because a skilful musician
is aware of what learning direction they should follow when learning an instrument
(e.g which exercises will help them familiarise themselves with the instrument) based
on their experience when learning the previous instruments. A process that many
times happens subconsciously. This is the highlevel concept rationale metalearning
is trying to leverage when training models.

This thesis is focused on examining one of the most influential stateoftheart meta
learning algorithms called Model Agnostic MetaLearning (MAML)[22], expanding on
the relevant latest research and providing insights based on experimental results on a
new reinforcement learning benchmark, MetaWorld [90].

1.1 Background
Most supervised ML models are usually presented with large quantities of data for a
specific task they should try to solve. During testing, new data samples of the same
task are presented to the model with the expectation that the test data contain similar
feature patterns with the train data, which the model has already identified, and can
thus make correct predictions. A common example would be feeding an Artificial
Neural Network (ANN) images that contain cats and dogs and training the model to
try to distinguish which image contains which animal. Then, to evaluate its accuracy,
different pictures of cats and dogs, that were not part of the training set, are fed to the
model and in return it will try to answer which is which. However, if a new task were
to be introduced, for example distinguish between cats, dogs and lions, the network

2
CHAPTER 1. INTRODUCTION

would most likely fail and retraining it from scratch with many additional images of
lions would probably be necessary to account for the new class of animal. Even such a
seemingly trivial problem, is still challenging for many ML models.

Metalearning methods try to tackle this issue by becoming efficient learners with the
aim of rapid adaptation to new tasks while requiring only a few training samples. As
previously mentioned, conventional ML models try to leverage similarities between
training and test data to make predictions during evaluation. Similarly, metalearning
models also try to leverage similarities but between training and test tasks in order to
generalise to new tasks. Different metalearning algorithms follow different training
procedures on how this can be achieved, but most of current algorithms adhere to
the same principle. During metatraining1 , there are two learning systems that are
being optimised. Firstly, a lowerlevel base learner tries to adapt rapidly to new data,
meaning complete the task with only a few of samples. Secondly, a higherlevel meta
learner is optimised to finetune and improve the base learner based on how well
the adaptation process was performed. During metatesting, the parameters of the
network (weights, biases or even hyperparameters) are updated in just a few iterations
2
to values that can solve the new the task in hand [38].

One of the most prominent metalearning algorithms which has sparked a wave of new
similar approaches is MAML [22]. Following the basic principles of metalearning,
MAML aims to train a neural network with the purpose of finding an initialisation of
the model parameters that is suitable for fast adaptation to new tasks. It also consists
of a twolevel learning system: an inner loop, or base learner, for fast adaptation and an
outer loop, or metalearner, for improving the parameter initialisation 3 . Specifically,
during the inner loop, the network starts from an initial set of parameters and performs
a brief ”learning session” (a few iterations) over a small set of different tasks. Next,
during the outer loop, the network initialisation is updated with respect to general
direction of the adapted inner loop parameters of all the tasks. The goal of this
approach is to explicitly optimise the model’s parameters for rapid learning.

1
The term metatraining is used similarly to training in traditional ML models, but in the context of
metalearning approaches.
2
A simple tutorial with visualisation can be found in this link.
3
The terms inner and outer loop refer to the programming practice of nested loops during meta
training.

3
CHAPTER 1. INTRODUCTION

1.2 Problem

Since the publication of MAML, many studies have been trying to examine its inner
mechanisms and the effects it has on the learning procedure of the models [61], [23],
[57]. A recent study made an interesting observation by asking the question: ”is the
effectiveness of MAML due to the metainitialisation being primed for rapid learning
(efficient changes in the network representation) or due to feature reuse, with the
metainitialisation already containing high quality features?” [62]. Their experiments
presented evidence for the latter, suggesting that the adaptation phase of the inner
loop is of little value and thus proposing their variation titled Almost No Inner Loop
(ANIL).

According to the notion of metalearning, such models lead to efficient learners,

suggesting they are able to perform better than conventional methods when exposed to
new tasks. Conventional ML methods are expected to perform poorly when evaluated
on tasks that they were not trained for (e.g train a robot on pushing a button but tested
on opening a door). However, metalearning methods are expected to be able to learn
the new task rapidly and perform sufficiently.

The objective of the thesis is to develop a framework to assess the performance of

MAML on the premise of becoming an efficient learner while examining the question
of ”rapid learning or feature reuse”. To evaluate this, experiments were performed in a
fewshot image classification setting and a metareinforcement learning setting. For
the former the datasets of Omniglot and MiniImageNet were used and for the latter,
the Particles2D and MetaWorld environments. Overall, the research contribution of
the degree project can be summarised by asking the following questions:

1. What does MAML actually learn: Is it a high quality feature representation of

the training data which it then utilises to solve the test tasks? Or does it learn to
rapidly learn to solve the test tasks by acquiring new skills?

2. Does MAML outperform conventional ML approaches in MetaWorld? Does it

manages to adapt efficiently to unseen tasks even in a more challenging setting?

3. Does the Almost No Inner Loop (ANIL) approach perform similarly to MAML in
more complex environments?

4. Is there a computational cost difference between MAML and ANIL?

4
CHAPTER 1. INTRODUCTION

The first question is inspired by [62], where the evidence was found for the case of
feature reuse. We evaluate the same question both in the same setting (Omniglot, Mini
ImageNet & Particles2D) and in a new setting (MetaWorld) in order to reproduce the
results and also add new observations from a more challenging environment. The
second question arose from the lack of comparison of metalearning methods with
conventional (non metalearning) methods in terms of performance in the MetaWorld
environment, as reported in [90]. By asking this question, we challenge the need for
metalearning and inspect the limitations / capabilities of conventional Reinforcement
Learning (RL) methods. In [62], as a result of their findings regarding MAML’s feature
reuse, the authors propose an alternative method called ANIL as equivalent which
they also evaluate on the tasks of Omniglot, MiniImageNet & Particles2D and find
similar performance to MAML. However, this tasks are quite limited and not complex
enough to test the capabilities of MAML and ANIL. For this reason we examine ANIL’s
performance on MetaWorld in order to evaluate if it is actually equivalent of MAML.
Lastly, one of the considerable benefits of ANIL over MAML as reported in [62] is
that it is computationally cheaper offering significant speedups during training and
testing times. We evaluate again this hypothesis by reproducing the results on the
same environments while also providing new insights about computational costs on
MetaWorld.

1.3 Goal & Contribution

The purpose of this researchoriented thesis is to provide a nuanced understanding of
the MAML method by trying to answer open questions of the metalearning field. The
aim is to expand the discussion regarding a wellestablished algorithm, but also to act
as an introduction to the field for readers unfamiliar with the topic. It is structured as a
standalone report such that minimal previous knowledge or studying external sources
in needed to understand it.

The contributions of this project are both theoretical and practical. An experimental
analysis on MAML provides valuable insights into its internal mechanisms,
performance capabilities and, as equally important, its limitations. The novelty of this
project is a comparison between MAML and one of its recently published variation
called ANIL in the metareinforcement learning benchmark MetaWorld. The results
of these experiments drive the focus of the evaluation of metalearning methods to a

5
CHAPTER 1. INTRODUCTION

more challenging framework.

Lastly, the implementations of the algorithms presented in this thesis and the code
associated to reproduce these experiments are opensourced. Various opensource
programming libraries and frameworks were used for the realisation of this thesis,
some of which are still in active development. During the implementation of the
experiments, contributions were made to public repositories of github in the form
of bug fixes, bug reports and additional feature implementations. One of these
repositories was the learn2learn PyTorch metalearning library 4 which acted as a core
part of this thesis’ code base. During the development of the thesis, a collaboration with
the lead developer of the library led to contributions on GitHub and finally a techincal
paper publication ([3]).

1.4 Delimitations
One frequent issue when developing methods and experiments with the latest
environments and benchmarks are the possible bug errors or performance short
comings that come along. This project made extensive use of opensource libraries and
frameworks, most of them which are still in active development. The basic components
were the metalearning library learn2learn for the algorithmic implementation and
5
the multitask & metalearning environment MetaWorld for the realisation of the
experiments. Both of these frameworks from the start of the thesis until its end, went
through many iterations and even complete rework of their API. They are still active
projects maintained by their own developers teams and the opensource community of
metalearning. It is also possible that further bug reports might arise and optimisations
improvements will take place. This is to say that the reported results cannot be
assumed to have been generated by bugfree software. Thus, in order to guarantee
replication of the findings of this thesis, the specific versions of the software used needs
to be installed.

Another relevant delimitation is the high computational cost that is required by

extensive experiments and hyperparameter searches for these models. Due to limited
resources, the hyperparameter and architecture search of the models was kept to only
the most crucial components. Thus they provide no guarantee that outside of the
4
lear2learn GitHub repository: [Link]
5
MetaWorld GitHub Repository: [Link]

6
CHAPTER 1. INTRODUCTION

configurations tested, different results can be expected. A discussion regarding this

topic is presented in appendix C and as future work in section 5.3.

The most significant limitation of this degree project is related to the evaluation. In
order to provide accurate and concrete conclusions based on the results of experiments,
statistical hypothesis tests are necessary. Due to the computational costs required
however, too few models were developed in order to be able to carry out such tests.
Thus, the results of the thesis showcase trends observed by training and evaluating
the algorithms. In order to make definite conclusions based on statistically accurate
results, a more extensive evaluation research is required. The precondition to carry
out such a test would be access to considerable computational power.

1.5 Outline
The structure of this project’s report is as follows. In Chapter 2, an extensive analysis
of the background work is presented to lay the theoretical foundations of the thesis and
cover related research. Next, in Chapter 3, the methodology and the research approach
is introduced along with the technical details of the evaluation setup. In Chapter 4,
the results of the experiments are presented through tables and figures. Chapter 6
outlines the results of the previous chapter and provides some conclusive remarks.
Lastly, Chapter 5 discusses the overall outcome of the degree project while also making
comments about future work, ethical concerns and sustainability. In addition to these
main chapters, an Appendix section (AC) is also attached to provide technical details
and additional work that might be relevant for reproducibility purposes.

7
Chapter 2

Theoretical Background

This degree project is based on the combination of three different fields: Deep
Learning, Reinforcement Learning and MetaLearning. Deep Learning is part of the
Machine Learning field, which is primarily focused on the study of artificial neural
networks. Reinforcement Learning and MetaLearning are both fields that have been
studied in the contexts of Psychology, Neuroscience and Machine Learning. In the rest
of the thesis, mentions of the fields of Reinforcement Learning and MetaLearning
will be related to their application within the Deep Learning field, unless otherwise
specified 1 .

The first three sections (2.1, 2.3, 2.2) of this chapter provide a gentle introduction to
the relevant fields without delving into too much detail on the thesis’ specifics. Next,
the section 2.4 will solely focus on providing a thorough theoretical understanding of
all the parts of the project and the latest relevant research.

2.1 Deep Learning

Even though ANNs have been around since the 60s, the term Deep Learning has
only recently started being used to describe related research. This was mainly
due to the success of deeper (multilayer) neural networks that was enabled by the
increasing availability of computational power. Deep networks have achieved state
oftheart results in various applications due to their exceptional ability to express and
approximate incredibly complex functions and probability distributions [49].
1
Meaning that the term DeepRL and RL might be used interchangeably

8
CHAPTER 2. THEORETICAL BACKGROUND

(a) A McCullock and Pitts Neuron. (b) A MultiLayer Perceptron.

Figure 2.1.1: Artificial Neural Networks. Figures from [51].

Although providing a thorough description and understanding of neural networks is

out of the scope of this thesis report, it is worthwhile to focus on some vital components
in order to get a better grasp of the basis of metalearning and what this project tries
to answer.

2.1.1 Artificial Neural Networks

Loosely inspired by the mammal brain, the basic concept of ANN follows a similar
structure [30]. A simple artificial neuron receives some input data x and produces an
output y based on its weights w and an activation function f . These types of artificial
neurons are also known as McCulloch and Pitts neurons (figure 2.1.1a). A set of these
neurons make up a Perceptron layer and a set of Perceptron layers finally make up
one of the most standard ANN architectures, the MultiLayer Perceptron (MLP). The
layers are stacked (from left to right as seen in figure 2.1.1b) and the output of each
layer becomes the input of the next.

The process of getting from the input data x to the output values y throughout the whole
network is called forward propagation (or forward pass) and consists of the following
steps:

1. Feed the data x to the first input layer.

(a) Multiply the data x with the the weights w of each neuron.

(b) (Optionally) Add a bias term b.

∑
(c) Sum the result h = (xw + b)

(d) Pass h through the activation function y = f (h)

9
CHAPTER 2. THEORETICAL BACKGROUND

(a) Figure from [27]. (b) Figure from [30].

Figure 2.1.2: Left: Convolutional Neural Network Architecture. Right: Recurrent

Neural Network Architecture.

2. The results y of the first layer now become the input of the next layer and the
same steps 1.11.4 are followed for x ← y.

3. The output of the network will be the output of the final layer.

The adjustable variables (weights & biases) of the network, are also called the
parameters θ of the network. Multiple variations of these networks have been
developed to better tackle different problems. For example, Convolutional Neural
Networks (CNN) are best suited when the input data consist of images or video
(2.1.2a). Whereas Recurrent Neural Networks (RNN) are best suited when the data
are sequentially dependent such as text, timeseries etc (2.1.2b).

2.1.2 GradientBased Learning & Backpropagation

In order to get meaningful results from these networks, they need to be trained,
meaning their parameters θ need to be updated based on a cost function. This can
be achieved by the backward pass, also known as the backpropagation algorithm
[68], to compute the gradients using a gradientbased optimizer. Training an MLP is
an iterative process of many (usually thousands or millions) consecutive forward and
backward passes.

Cost function (or loss function): The cost function describes the error between the
output values (predictions) of the network and the objective values (target) the network
tries to optimise for. Depending on the problem the cost function can take many forms.
For classification problems, this usually means trying to minimise the crossentropy
loss (equation 2.1) based on the principle of maximum likelihood [30]. For regression
and sometimes RL, this usually means trying to minimise the meansquared error,
which is the average squared difference between the predictions and the target values

10
CHAPTER 2. THEORETICAL BACKGROUND

(equation 2.2).

∑
CE = − yi log ŷi (2.1)
i

1∑
MSE = (yi − ŷi )2 (2.2)
n i

Backward pass: The backpropagation algorithm, backprop for short, is a simple

and efficient way to calculate the gradients of a neural network. In most cases this
means to calculate the gradients ∇θ J(θ) of a cost function J with respect to the network
parameters θ. The algorithm can be summarised in the following steps (Algorithm 6.4
[30]):

Given y the output of the network and ŷ the target values.

1. Calculate the gradients of the output layer

g ← ∇ŷ J (2.3)

2. From the last layer to the first:

(a) Calculate the gradients of the layer with respect to its output before the
activation function (elementmultiplication ⊙ if f is elementwise).

g ← ∇h J = g ⊙ f (h) (2.4)

(b) Compute the gradients of the weights and biases.

∇b J = g + λ∇b θ (2.5)

∇w J = gy + λ∇w θ (2.6)

(c) Propagate the gradients to the previous layer

g ← ∇y J = W T g (2.7)

11
CHAPTER 2. THEORETICAL BACKGROUND

Optimizers: The optimizers are the actual algorithms that perform the updating
of the network parameters, using the gradients computed by backprop. One of
the most commonly used optimizers is Adam [44] due to it’s adaptive learning rate
mechanism. Other common optimizers are Stochastic Gradient Descent [67] and
AdaDelta [91].

2.2 Meta Learning

The idea of Meta Learning was firstly conceived in the field of psychology [8] and was
then adopted in other fields such as neuroscience [14]. In the setting of educational
psychology metalearning is defined as ”a person [who is] aware and capable to assess
their own learning approach and adjust it according the requirements of the specific
task”. Kenji Doya defines metalearning in the context of neuroscience as ”a brain
mechanism with the capability of dynamically adjusting it’s own hyperparameters (e.g
dopaminergic system) through neuromodulators (e.g dopamine, serotonin etc)” [14].
even though the idea of MetaLearning has been around since the 80s [70], it has only
recently seen a rapid growth in the field of Deep Learning with exciting new research
and progress in different directions every year. As such, definitions as to what exactly
constitutes metalearning can be inconsistent depending on the setting.

2.2.1 Defining MetaLearning

Generally, metalearning aims to improve the model’s generalisation capabilities after

deployment. Undoubtedly, there are various other approaches that follow the same
idea but do not fall under the metalearning term leading to a confusion as to which
method can be classified as ”metalearning”.

One such example is automatic hyperparameter tuning based on model evaluation

and crossvalidation. This could be considered a naive method of metallearning,
however such approaches are usually considered part of automated machine learning
(AutoML) [41]. Similar confusion can arise with the fields of ensemble, transfer,
continual and multitask learning. However, contemporary metalearning is focused
on explicitly defining a metaobjective (e.g fewshot classification of images) and then
endtoend optimising the base model for that specific goal. In order to provide a
better insight of the metalearning setting and distinct it from similar approaches and

12
CHAPTER 2. THEORETICAL BACKGROUND

conventional ML, the following formulation is presented (adapted from [38]).

Metaknowledge

In a common metalearning classification example and given a training dataset D =

{(x1 , y1 ), ..., (xN , yN )}, the goal is to train a model ŷ = fθ (x) with parameters θ by solving
2.8 which minimises the cost function Ltask for a fixed ω.

θ∗ = arg min Ltask (θ; ω; D) (2.8)

The ω condition is defined as the factors leading to the solution θ which encompass the
how to learn knowledge and is also known as metaknowledge. This could be the initial
parameter θ values, the choice of the optimiser, hyperparameter values, the function
class for f , etc. In this setting, for every dataset D that is given, the optimisation starts
from scratch with a hand tuned prespecified ω. Hence, metalearning can be defined
as learning the ω condition, leading to an improved performance and the ability to
generalise across tasks.

Usually, in metalearning literature, the algorithm is trained over a distribution of

tasks p(T ) in order to learn ω. A task can be a subset of a dataset or even a whole
dataset [84] paired with a cost function T = {D, L}. Thus, the metalearning objective
can be written as:

min E Lmeta (θ∗ ; ω; D) (2.9)

ω T ∼p(T )

where Lmeta (θ∗ ; ω; D) represents the model performance on a dataset D with fixed θ∗
parameters and optimising for ω.

Metatraining

The process of training a model this way over a set of tasks M is called metatraining in
which the training tasks are formalised as Dtrain = {(Dtrain
support
, Dtrain
query
)i }M
i=1 . Each train

task has its own support and query subset used as ”minitrain” and ”minivalidation”
sets. Optimising for ω, or learning to learn, is then written:

ω ∗ = arg max log p(ω|Dtrain ) (2.10)

13
CHAPTER 2. THEORETICAL BACKGROUND

Meaning, given a set of M datasets in Dtrain the goal is to pick the ω ∗ that maximises
the log probability of the metaknowledge across all M datasets.

Twolevel optimisation system

A typical way of metatraining a model, and the way this thesis will adapt in chapter 3, is
through a twolevel optimisation of an inner and an outer loop. This approach follows
the idea of hierarchical optimisation where the model is optimising for one goal while
being constrained by a second optimisation goal [28]. The inner loop is optimising
the base model, or learner, for θi∗ of task i while being conditioned by ω (2.11). The
inner optimisation is based on the support set of the dataset. Note, that the notation
θi∗ (ω) is used to represent that each θi∗ shares the same ω. The outer loop is optimising
the metalearner for the condition ω with the objective to produce a model θi∗ which
performs optimally on the query set (2.12).

θi∗ (ω) = arg minLi (θ, ω, (Dtrain

support
)i ) (2.11)
θ

∑
M
∗
ω = arg min Lmeta (θi∗ (ω), ω, (Dtrain
query
)i ) (2.12)
ω
i=1

where Li is the cost function of Ti for a fixed ω and Lmeta is the metalearning objective
for a fixed θi∗ (ω).

Metatesting

Evaluating a metalearning algorithm’s performance means to evaluate the quality

of the learned metaknowledge ω. Instead of the conventional approach of directly
testing on a Dtest set, in metatesting the test set consists of Q tasks and is split into
Dtest = {(Dtest
support
, Dtest
query
)j }Q
j=1 . Each unseen task j is split into a support and query

set in which the former is used to leverage the metalearned ω and the latter to finally
evaluate the accuracy of the adapted model. The size of the support set for each test
task is usually small since the objective is rapidly adapting with few samples and not
retraining. A common metalearning evaluation setting is of ”Kshot learning” for
classification, meaning that each new task comes with K number of samples (”shots”).
The metalearnt model then needs to adapt to the task using only K samples in order
to correctly predict the labels of the query task. Thus, the metatesting process of

14
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.1: An overview of the metalearning field divided by algorithm design and
applications. Figure from [38].

adapting to Dtest
support
can be summarised as:

θj∗ = arg max log p(θ|ω ∗ , (Dtest

support
)j ) (2.13)
θ

where θj∗ is used to evaluate the model performance on (Dtest

query
)j .

2.2.2 Types of MetaLearning

To provide a general picture of the MetaLearning landscape this section briefly

mentions a way to understand current category trends that are being developed.
Different ways of how metalearning algorithms are divided into categories can be
based on what parameters they optimise for, what type of networks they are based
on or generally based on the procedure’s steps to achieve that goal.

A common way to further distinguish between metalearning algorithms is by

categorising them to one of the following approaches: metricbased, modelbased and
optimisationbased [53], [47]. However, Hospedales et al. propose a more intuitive
taxonomy based on the formulation in 2.2.1 and more generally on the questions:
”what”, ”how” and ”why” [38]. It is interesting to note that a metalearning algorithm
can fall into more than one of the categories following.

MetaRepresentations (”What?”): The first and maybe the most apparent way
to categorise metalearning approaches are by what the metaknowledge ω entails.
Specifically, what parts of the learning process are to be learned and then be deployed

15
CHAPTER 2. THEORETICAL BACKGROUND

and leveraged for generalising during metatesting. This class of metalearning

algorithms is vast (figure 2.2.1), contains some of the most popular approaches and
it keeps expanding every year. MAML is part of the parameter initialisation methods
which try to find a good initial parameter space for the model in order to adapt rapidly
in few gradient steps to the task at hand (more relevant literature in section 2.4).
Another example, would be to explicitly define and learn the gradientbased optimizer,
for example by learning it’s hyperparameters (e.g the step size [48]). Often, such
approaches are ”model agnostic”, meaning that the actual base model parameters θ
can vary and are dependent on the application. For example, MAML and similar
approaches have been used with MLP, CNN, RNN etc.

MetaOptimiser (”How?”): The next category of methods can be defined by the

optimisation strategy they use on how to converge to ω ∗ in the outer loop. The most
common and standard approach is through a gradientbased method by explicitly
calculating the meta parameters ω (MAML again falls into this category). Even though
the formulation for calculating such derivatives using the chain rule might be straight
dLmeta dLmeta dθ
forward, dω
= dθ dω
, they come with some caveats. Such an optimisation
system of two loops can quickly spiral into a complex computational graph in which
tractable high order derivatives are needed and vanishing or exploding gradients can
quickly become problem when the inner loop consists of many iterations [2]. Other
interesting approaches are through RL by optimising the metaobjective Lmeta with
a policy gradient algorithm [39] or using evolutionary algorithms to calculate the
objective [7].

MetaObjective (”Why?”): Lastly, metalearning methods can be labelled by how

they define their metaobjective Lmeta and the way the inner and outer loop interact. As
previously mentioned in 2.2.1, a common way to define the metaobjective is through
an evaluation metric on the validation set Dtrain
query
which measures the effectiveness
of the metaknowledge ω. In this view, the fewshot learning setting is one of the
most popular ones and the one more often used as a benchmark for metalearning
algorithms [48] [54]. However, there are cases that do not have such a restriction
and in which a task can have many samples [1]. Another distinction is by whether the
outer loop optimisation of the base learner happens after the inner loop or if the meta
optimisation is performed online and θ and ω are updated in parallel within an episode
[5].

16
CHAPTER 2. THEORETICAL BACKGROUND

2.3 Reinforcement Learning

The field of Reinforcement Learning lies between supervised learning, where the
models is trained on labelled data, and unsupervised learning, where the models aims
to find and leverage similarities and patterns in the data. Instead, RL algorithms
are driven by interacting with an environment and receiving rewards on whether
2
its actions managed to perform well enough or complete a goal. The agent then
picks the actions that maximise the cumulative rewards leading to an optimal strategy
through a long series of trials and errors, often within a given time constraint. Based
on this simple setting, many methods have been developed in recent years with
remarkable results, especially in the field of games. One of the first papers to reignite
the interest of the research community was by a team of researchers from DeepMind
who managed to develop a simple agent algorithm that could surpass humanlevel
performance in many of the classic Atari 2600 games [55]. Since then, research on the
field has grown rapidly with multiple advancements in similar gamelike environments
such as chess and go [76], [77]. Progress in realworld application has also started to
expand [58], [59] but in a slower rate due to a series of challenges which mainly involve
the need of immense number of interaction samples with the environment to achieve
satisfactory results [17].

Figure 2.3.1: Overview of the typical RL setting of an agent interacting with the
environment. Figure from [80].

In order to understand the choices of algorithms in section 3 and get a better

understanding of the experiments in general, it is worthwhile to briefly go over some
basic RL theory.

2
agent is the term for the policy / model in the RL setting.

17
CHAPTER 2. THEORETICAL BACKGROUND

2.3.1 Model based (or World Model) vs Model free

The most standard way to formulate an RL setting is by an Markov Decision Process

(MDP) of four (or five) components:

1. S: the state of the environment

2. A: the action of the agent

3. R(s, a): the (unknown) reward function which indicates how good or bad action
a was for state s
′
4. S : the next state of the environment after action a has been taken
′
5. T (s |s, a): the transition probability function / matrix that indicates the
′
probability of the environment transitioning to state s when the agent takes an
action a in a state s.

Additionally, environments can either have finite horizons (assuming the agent has H
number of time steps to solve the task) or infinite horizons (in which case, to motivate
the agent to solve the task future rewards are discounted by a factor γ).
′
The last component T (s |s, a) is not always mentioned because in most deepRL
problems the transition function T is not known. When T is known, then the
solution can be computed directly without the need for the agent to interact with the
environment through planning algorithms such as policy iteration, value iteration etc
[79]. However, usually the RL objective is defined as finding an optimal policy π which
maximises the expected future discounted reward without knowledge about the T or
R functions. There are two approaches on how to deal with a problem where the
transition function is unknown.

The modelbased way is to try and model the environment by approximating this
function through a long series of environment interactions (thus building a ”World
Model”). This can be done either by first learning the model and then using a
highly sampleefficient planning algorithm 3 to solve the problem, or simultaneously
learning the model while also approximating a policy [13]. This approach has seen
exceptional success in games where the dynamics of the environment might not be as
complex as in the realworld and where an almost perfect simulator (World Model)
3
Such algorithms can be trained offline, meaning they do not need to sample new interactions
with the environment during training, thus are more cost efficient, while also finding and evaluating
a solution before actually executing it.

18
CHAPTER 2. THEORETICAL BACKGROUND

can be estimated by paying a high computational cost [71]. However, they suffer from
severe performance loss when transferring the learned policy in the real world due to
embedded biases in the model [42].

Modelfree approaches on the other hand, bypass modelling the environment

altogether and directly estimate an optimal policy. Due to the lack of constrain to learn
a model they are typically easier to implement and tune, albeit, they tend to be highly
sample inefficient. There have been numerous methods that follow this idea and they
are usually the first choice in most RL settings. The most prevalent families of model
free methods are Qlearning and Policy Optimisation. Qlearning algorithms aim to
learn an approximator Qθ (s, a) (θ are usually the neural networks’ parameters) for the
optimal actionvalue function Q∗ (s, a) 2.14 and then the actions from the policy are
calculated from 2.15

Q(st , at ) ← Q(st , at ) + α[Rt + γ max Q(st+1 , a) − Q(st , at )] (2.14)

a(s) = arg max Qθ (s, a) (2.15)

Policy optimisation methods however, try to explicitly learn a policy πθ (a|s) by

approximating an onpolicy value function Vπθ (s). This family of methods includes
many variations on the details of how this is achieved, with the most notable ones
being Vanilla Policy Gradient (VPG) [81], Proximal Policy Optimisation (PPO) [74]
and TrustRegion Policy Optimisation (TRPO) [72]. It is worth noting that some
algorithms such as SoftActor Critic (SAC) [34] or Twin Delayed Deep Deterministic
Policy Gradient (TD3) [29] try to combine the best of both Qlearning and Policy
Optimisation with promising results.

2.3.2 Offpolicy vs Onpolicy

Furthermore, RL algorithms can be split into two categories depending on how they
update their policy from the samples of the environment. Offpolicy methods reuse
previous samples collected by the environmentagent interactions during the training
4
regardless of the exploration policy of the agent. This gives them the advantage of
4
The choice of whether the agent will choose an action randomly or based on the currently learned
policy.

19
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.3.2: Overview of a standard metaRL setting. Figure from [9].

being more sample efficient. Almost all Qlearning methods are trained offpolicy. In
contrast, onpolicy methods do not reuse previous data and only depend on sampling
online from the learnt policy. Even though this approach might be less sample efficient,
it is notably more stable since it keeps optimising based on the latest policy for the
objective. Usually, Policy Optimisation methods are trained onpolicy. In this thesis,
all of the experiments were performed with modelfree, on policy algorithms such as
PPO and TRPO (as seen in section 3).

2.3.3 MetaReinforcement Learning

A recurring issue in RL methods is developing agents that can show robust

generalisation capabilities. An obvious direction then, is the application of meta
learning methods in RL, which leads to the field of MetaReinforcement Learning. The
setting is similar to the standard RL setting with the difference that instead of trying
to learn just one task / environment, the agent is trained on a distribution of tasks /
environments. During the inner loop adaptation, a set of various tasks is presented
and for each one, a new policy is approximated that might be slightly different than
the next. Next, in the outer loop a general strategy is optimised with the aim to learn
the dynamics of transitioning between states, rewards and actions in order to quickly
learn new tasks [88].

Generally, even though the tasks might be different, it is important that they share
similar internal dynamics and they come from the same task distribution [85]. Which
leads to a strong connection between task similarity or how broad the task distribution
is and the performance of the metaRL agent. For example, training a robot hand
to grasp different types of objects can be considered a reasonable metaRL problem

20
CHAPTER 2. THEORETICAL BACKGROUND

since all the tasks share the dependency of learning the dynamics of the joints and
the movement of each individual finger. However, when the distribution becomes
too broad, such as including both teaching a robot how to walk and solving a Rubik’s
cube with it’s hand, it is expected that finding a shared metaRL strategy becomes
considerably harder.

The number of different methods developed is extensive on all the three axis of meta
learning (mentioned in 2.2.2) and RL’s algorithmic landscape. In many cases, MAML
has been used as the metalearning base framework with different methods adapting
and modifying it in new ways such as incorporating recurrent methods and model
based RL [56] or extending it in a probabilistic framework [75]. A large portion of this
field has focused on developing methods that make use of RNNs or trying to include
some sort of memory to the agent in order to incorporate the learned knowledge
[87], [16], [54]. Furthermore, when learning a policy, latent representations or task
descriptors have been used efficiently to distinguish between the learnt skills and tasks
[35], [20], [85].

Occasionally, metaRL is confused with multitask RL since they share many core
characteristics such as training on a distribution of tasks. However, their objectives
are different: Multitask RL tries to optimise for a single policy that will solve the tasks
presented more efficiently than learning the tasks individually. MetaRL aims to learn
the dynamics of the training tasks in order to adapt fast to new tasks [90].

2.4 Related Work

2.4.1 ModelAgnostic MetaLearning

Finn et al. proposed ModelAgnostic MetaLearning (MAML) as an intuitive and

straightforward method to learn the initial parameters of a neural network such that
in a few updates the model would be able to generalise and adapt to new tasks (figure
2.4.1). The concept is based on the idea that a network’s internal representation θ of a
distribution of tasks p(T ) will share features across different tasks and can be quickly
adjusted for new tasks from the same distribution.

Following the notation introduced in 2.2.1, the initial model is represented as a

function fθ with parameters θ with the base objective (inner loop) to adapt to a task
Ti = {Di , Li } from the task distribution p(T ). In this case the metalearnt knowledge

21
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.1: Highlevel concept of MAML’s initial representation θ adapting to three

different tasks. Figure from [22].

ω is the initial values of the θ parameters (the representation of the network). Thus,
the parameters start from the point ω and evolve to θi∗ for each Ti during the inner
loop. For simplicity, since ω and θ refer to the same component of the model, in the
following equations the initial point of the parameters will be simply noted as θ (figure
2.4.1).

Adapting is performed with gradient descent starting from the θ parameters while not
modifying them directly (this set is kept for the metaoptimisation update) but keeping
a separate set θi∗ for each task Ti that can be updated either once or multiple times 5 .
′
This can be perceived as one metalearner θ evolving into multiple learners θi each
specialised for a different task.

θi∗ ← θi∗ − α∇θ Li (θ; (Dtrain

support
)i ) (2.16)

where α stands for the inner loop (adaptation) learning rate. The metaobjective of
this method is to ”optimise for the performance of fθi∗ with respect to θ across tasks
sampled from p(T )” [22] as seen in 2.17.

∑
min Li (θi∗ ; (Dtrain
query
)i ) (2.17)
θ
Ti ∼p(T )

The meta
optimisation (outer loop) updates the metalearner’s parameter initialisation θ with
5
MAML supports multiple consecutive gradient descent updates on each learner on each task.

22
CHAPTER 2. THEORETICAL BACKGROUND

gradient descent based on all the learners’ parameters θi∗ (now fixed):

∑
θ ← θ − β∇θ Li (θi∗ ; (Dtrain
query
)i ) (2.18)
Ti ∼p(T )

where β stands for the outer loop (meta) learning rate

2.4.2 Variations

A long list of MAML variations have been proposed since it’s publication focusing on
different components of the training process.

Regarding the metaoptimiser: During the outer loop update, the meta
optimisation requires computing a gradient through a gradient (or a ”meta
gradient”) to update the final θ∗ parameters. One way of calculating these second
order derivatives is directly through a hessianvector product using automatic
differentiation. However, this method is usually highly computationally expensive.
The ”cheaper” alternative is through a firstorder approximation of these derivatives
which has shown to not have any significant decrease in performance in MAML when
using ReLU activation functions [22]. This approach has been further studied and
improved on leading to a MAML variation named Reptile [57].

Regarding the metaknowledge (parameter space): Fast Context Adaptation

via MetaLearning (or CAVIA) follows an intuitive idea: Across different tasks, some
parameters will be similar and can possibly be shared whereas others might be more
fitting regarding the context of each task. During the adaptation phase only the context
parameters are updated to the task while still using the metalearned share parameters
[93].

Regarding general improvements: One of the reoccurring issues of MAML

is it’s high computational cost and it’s sensitivity to hand picked hyperparameter
values, leading to considerable learning instabilities. Antoniou et al. try to tackle
these problems with their variation, MAML++. This includes learning the inner loop
hyperparameters, using cosine annealing for the outer learning rate, stabilising the
gradients volatility through multistep loss optimisation and more, to improve the
overall optimisation process [2].

Regarding application in other fields MAML has seen adaptations to fields such

23
CHAPTER 2. THEORETICAL BACKGROUND

as imitation learning [26], online learning [24], latent embedding optimisation [69],
probabilistic [25] and Bayesian frameworks [32]. In addition to it’s leading success in
fewshot classification 6 and regression, MAML has been applied in different ways in
various RL settings [56], [33], [89], [78].

2.4.3 Insights on MAML

Finn et al. suggested that MAML’s final metalearnt initialisation parameters are
primed for rapid learning, meaning that changes in the representation will lead
to significant improvements when adapting to a task drawn from the same task
distribution. They also presented MAML as a way to ”maximise the sensitivity of
the loss functions of new tasks with respect to the parameters”. This perspective
suggests that the knowledge gained from the model is regarding efficient gradient
based learning and not necessarily learning the dataset’s features themselves. This
view is further supported by additional experiments with MAML when examining
its resilience to overfitting and comparing it with conventional Deep Learning (DL)
methods [23].

2.4.4 Almost No Inner Loop Variation

A recent publication by Raghu et al. argues that the view of how MAML was presented
above (2.4.3) is not accurate and that the model does not actually learn to learn.
Instead, they argue, it already incorporates high quality features of the task distribution
and this is what leads to the successful adaptation to new test tasks [62]. They
provided evidence for this hypothesis by running experiments where they froze the
network body representation (hidden layers) and only let the head (final output layer)
adjust during the inner loop adaptation phase. Additional experiments were presented
indicating that during metatesting, the network body barely needs to be updated while
maintaining the same performance and it is possible to completely remove the network
head, relying only on the learnt representations. This simplified and computationally
cheaper MAML variation, called ANIL, contributed insightful observations to the inner
mechanisms of MAML.

In the ANIL paper, experiments were performed for fewshot image classification
6
Top 10 performance in fewshot classification benchmarks in [Link] as of this date,
despite being one of the oldest.

24
CHAPTER 2. THEORETICAL BACKGROUND

and metaRL. For the former setting, the two datasets: Omniglot and MiniImageNet
were used. For the metaRL setting, the MuJoCo environments of HalfCheetah
Direction, HalfCheetahVelocity and 2DNavigation were used. Even though these
frameworks are part of the most popular benchmarks for metalearning methods,
they might not be sufficient to prove the hypothesis stated previously. As one of the
ANIL paper’s reviewers criticises the train and test tasks are from the same, relatively
narrow dataset where feature reuse might be enough to provide good performance
7
. Specifically for image classification, both Omniglot and MiniImageNet are rather
trivial tasks compared to the recently proposed benchmark MetaDataset which is
explicitly developed for metalearning methods evaluation [84]. Moreover, the 2D
Navigation environment was introduced in [22] as a toy baseline RL problem and the
HalfCheetah benchmark is known to be easily solvable with simple linear policies or
random search [50].

This leads to the question on whether ANIL would be able to perform similarly to
MAML on more challenging environments with a broader task distribution. Currently,
some of the best candidates to examine its robustness in performance would be the
MetaDataset and MetaWorld, for image classification and reinforcement learning
respectively. Due to the MetaDataset’s immense size and the incredible computational
power required for it to train on, it was deemed out of the scope of this thesis and is
left for future work. Therefore, this thesis is mainly focused on experiments on the
metaRL environment titled MetaWorld, as examined in the next section 3.

7
Comment on OpenReview by an anonymous expert reviewer on ANIL: [Link]
forum?id=rkgMkCEtPB&noteId=H1xctUU2oB

25
Chapter 3

Methodology

3.1 Experimental setup

In order to carry out experiments and comparisons between MAML and ANIL
(following the questions outlined in section 1.2) robust implementations of the two
algorithms were required. These were developed with the help of the metalearning
library learn2learn [46] using the PyTorch DL framework. Since these metalearning
algorithms are modelagnostic, they are used in combination with other neural
network based methods depending on the problem setting. In this project, a standard
CNN approach was used for Image Classification and a fully connected MLP was used
for Reinforcement Learning (as described in more detail in 3.2.2, 3.2.3 and 3.3.1)
.

The main focus of the thesis is on the RL experiments but because of the high
complexity and the high computational cost of implementing and running metaRL
methods, the baselines were firstly set on image classification tasks. Since the methods
are ”modelagnostic” the basic core structure of the code is shared across the different
application fields. After establishing reliable implementations of the algorithms by
reproducing the results of their original papers on the baseline tasks, they were then
evaluated on RL.

The evaluation of the algorithms was performed in three parts. First, the actual
metatraining progress is monitored through logging training and validation metrics
during learning. These metrics can provide insights on the stability and convergence
rate of each algorithm, showcasing their robustness or signs of overfitting or under

26
CHAPTER 3. METHODOLOGY

fitting. Secondly, their final performance is evaluated during metatesting, where the
metatrained policies are examined in a series of unseen tasks. The resulting metrics
measure the success of the method on the problem at hand (accuracy for classification,
accumulated reward & success rate for RL). Finally, another experimental settings was
proposed to measure the representation similarity of the two models after training as
described in 3.1.2.

Smoothing factor: In cases where the results were too noisy, a smoothing coefficient
was used for better visibility (as seen in example figure 3.1.1). This factor can be
adjusted with a value from 0 (no smoothing) to 1 (max smoothing) and its formula
is based on the exponential moving average 1 .

(a) (b)

Figure 3.1.1: Example of using a smoothing factor. Figure 3.1.1a does not utilise
smoothing (value is 0) and figure 3.1.1b uses smoothing of a 0.9 value.

3.1.1 ModelAgnostic & ProblemAgnostic Formulation

The objective of MAML is to learn the parameters θ of a model fθ in order to achieve fast
adaptation to new unseen tasks during metatesting. Specifically, the idea of MAML
is based on updating the gradientbased learning rule in a way which will lead to rapid
progress on any task drawn from the same task distribution p(T ). Meaning the model
parameters are sensitive to changes across different tasks such that small changes in
the parameter space leads to a great gain in performance on any task of p(T ). On the
other hand, using an almost identical method, ANIL is based on the idea that such
a metalearning procedure leads to learning a strong and sufficient representation of
the data such that there is no need for change in the representation in order to adapt to
new test tasks from p(T ). A general, modelagnostic formulation of MAML and ANIL
1
More details regarding the exponential moving average formula on this link

27
CHAPTER 3. METHODOLOGY

is defined in 1 and 2 respectively.

Algorithm 1: ModelAgnostic MetaLearning (MAML)

p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
Random initialisation of model parameters θ
while not converged do
Sample a batch of tasks Ti = {(Dtrain )i , Li } ∼ p(T )
for all Ti do
For K samples evaluate: ∇θi Li (θi ; (Dtrain
support
)i )
Update parameters for inner learner: θi ← θi − α∇θi Li (θi ; (Dtrain
support
)i )
end
∑
Update metalearner’s parameters: θ ← θ − β∇θ Ti ∼p(T ) Li (θ; (Dtrain
query
)i )
end

Algorithm 2: Almost No Inner Loop (ANIL)

p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
Random initialisation of network body parameters θb
Random initialisation of network head (output layer) parameters θh
while not converged do
Sample a batch of tasks Ti = {(Dtrain )i , Li } ∼ p(T )
for all Ti do
For K samples evaluate: ∇θhi Li (θhi ; (Dtrain
support
)i )
Update head parameters of inner learner:
θhi ← θhi − α∇θhi Li (θhi ; (Dtrain
support
)i )
end
Update all of network’s parameters:
∑
θb+h ← θb+h − β∇θb+h Ti ∼p(T ) Li (θb+h ; (Dtrain
query
)i )
end

3.1.2 Representation Similarity Experiments

The objective of this experiment is to measure and compare the latent representations
of a network trained with MAML. This can be achieved by employing the Canonical

28
CHAPTER 3. METHODOLOGY

Correlation Analysis (CCA) metric [37]. Given two representations of two layers L1, L2
of a neural network, the CCA outputs a similarity score which indicates whether L1
and L2 share no similarity at all (value is 0) or if they are identical (value is 1).
By comparing the representations of the network before and after the inner loop
adaptation phase (during metatesting), this metric illustrates whether the model had
significant changes in the representation (rapid learning) or minimal changes (feature
reuse). This experiment follows a similar procedure as presented in the ANIL paper
[62] and tries to answer the first question of ”What does MAML actually learn:
learning to learn or high quality feature representation?”.

3.2 Fewshot Image Classification

The first experiments of this thesis concerned MAML and ANIL on image classification
tasks, specifically on the relatively small datasets of Omniglot and MiniImageNet.
By reproducing the results on these datasets of the original papers, it provides some
confidence that the implementations were correct. The most common way to evaluate
metalearning algorithms in classification tasks is through fewshot learning where the
goal is to approximate a function from just a handful of inputoutput pairs of a task /
label in order to classify new images that share similar features to previously seen ones.
For example, the model is introduced to a dataset of pictures of mammals where each
species only has a limited number of picture samples. Due to the constraint of limited
data per species, when a new species is found, the model is expected to easily identify
it even with just a few pictures (fewshots).

3.2.1 Classification Formulation

In the case of fewshot learning, the modelagnostic formulation described in 3.1.1 is

adapted to use the crossentropy loss, as seen in 3.1, for Kshots (inputoutput pairs)
for each class, for N classes making it a N way, Kshot classification problem. Each
Ti task consists of a (Dsupport )i support (inner loop adaptation) and a (Dquery )i query
(metaoptimisation) set, each containing a total of N ∗ K samples. The model fθ in this
case is a standard CNN with the Adam optimiser. The initial hyperparameter values
where chosen based on the values reported in [62]. A limited hyperparameter search

29
CHAPTER 3. METHODOLOGY

was also employed as seen in chapter 4.

∑
Li (θ) = yj log fθ (xj ) + (1 − yj ) log(1 − fθ (xj )) (3.1)
xj ,yj ∼Ti

The complete MAML and ANIL methods for fewshot classification are described
analytically in algorithms 3 and 4 respectively.

Algorithm 3: MAML for FewShot Classification

K: number of samples per class to adapt (shots)
N : number of classes per inner loop iteration (ways)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
j: number of Ti tasks per outer loop iteration (meta batch size)
E: number of outer loop iterations (epochs)
Random initialisation of model’s parameters θ
for E iterations do
Sample j number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θi ← θ
for m steps do
Evaluate loss: ∇θi Li (θi ; (Dtrain
support
)i )
Update inner learner: θi ← θi − α∇θi Li (θi ; (Dtrain
support
)i )
end
Evaluate loss: ∇θi Li (θi ; (Dtrain
query
)i )
end ∑
Update metalearner: θ ← θ − β∇θ Ti ∼p(T ) Li (θi ; θ; (Dtrain
query
)i )
end

3.2.2 Omniglot

The first vision dataset, Omniglot [45], is one of the most popular in evaluating meta
learning algorithms in fewshot classification. It contains 1623 distinct handwritten
characters with 20 samples each from 50 different alphabets. Each character is
an image of 28(H) x 28(W) x 1(Grayscale value) drawn from a different person
(as seen in figure 3.2.1). As implemented in [22], 1200 characters were used for
training, irrespective of the alphabet, the rest were used for testing and the dataset was
augmented by adding rotated variations of the images in multiples of 90 degrees. The
fewshot setting they were evaluated on was on 20 ways (20 distinct tasks / characters)

30
CHAPTER 3. METHODOLOGY

Algorithm 4: ANIL for FewShot Classification

K: number of samples per class to adapt (shots)
N : number of classes per inner loop iteration (ways)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
j: number of Ti tasks per outer loop iteration (meta batch size)
E: number of outer loop iterations (epochs)
Random initialisation of network body parameters θb
Random initialisation of network head parameters θh
for E iterations do
Sample j number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θhi ← θh
for m steps do
Evaluate loss: ∇θhi Li (θhi )
Update head parameters of inner learner:
θhi ← θhi − α∇θhi Li (θhi ; (Dtrain
support
)i )
end
Evaluate loss: ∇θhi Li (θhi ; (Dtrain
query
)i )
end
Update all of network’s∑parameters:
θb+h ← θb+h − β∇θ(b+h) Ti ∼p(T ) Li (θhi ; θb+h ; (Dtrain
query
)i )
end

31
CHAPTER 3. METHODOLOGY

on 1 or 5 shot (1 or 5 samples per task / character). The model used was the same
network as described in [22], a standard CNN of input 1x28x28, 4 convolutional layers
of 64 units each, with no max pooling and 1 channel, and a final fully connected layer
of 64 units and an output size of 20 (one for each class).

Figure 3.2.1: Samples from the Omniglot dataset.

3.2.3 MiniImageNet

The miniImageNet dataset is part of the ImageNet dataset, specifically tuned for the
fewshot learning setting [86]. It requires fewer computational resources than the
original ImageNet but it still remains a difficult problem to solve due to the large variety
of images included. MiniImageNet contains 100 classes, each with 600 samples of
84(H) x 84(W) x 3 (RGB values) images (as seen in figure 3.2.2). The fewshot setting
they were evaluated on was on 5 ways (5 distinct tasks / characters) on 1 or 5 shot (1 or
5 samples per task / character). In the case of MiniImageNet, the standard network
as described in [65] was used, with an input shape of 3x84x84, 4 convolutional layers
of 32 units each, with max pooling (max pooling factor = 1) and 3 channels, and a final
fully connected layer of 800 units with 5 outputs.

32
CHAPTER 3. METHODOLOGY

Figure 3.2.2: Samples from the MiniImageNet dataset.

3.3 MetaReinforcement Learning

The goal of the metaRL setting is to develop an agent that rapidly learns a policy for a
new test task with only a few iterations of experience, utilising previous knowledge
gained from similar tasks of the same distribution p(T ). Training such an agent
usually requires considerable computational power due to the extensive number of
interactions it has to perform with the environment. Moreover, RL environments that
fit the setting of metalearning and are practical to train on are very limited. Apart
from the Particles2D and MetaWorld environments, the recently published Procgen
environment by OpenAI [11] was evaluated in this project. However, due to poor
results from the high computational cost required to successfully train models, it was
eventually deemed out of the scope of this project and the relevant study on it has been
moved to Appendix B.

3.3.1 MetaRL Formulation

In the setting of metaRL, each task Ti is defined as an MDP of a finite horizon H, with
an initial state distribution qi (s1 ) and a transition distribution qi (st+1 |st , at ). During
the inner loop / adaptation phase, the agent fθi is able to sample a limited number of
episodes from each task Ti with the goal to quickly develop a policy πi for each loss Li .
Note that there is no limitation as to what properties of the MDP need to be share across
tasks, meaning that different tasks can have different reward functions or transition
distributions.

33
CHAPTER 3. METHODOLOGY

As mentioned in 2.3.1 in most problems the reward function and transition distribution
are unknown. Additionally, the unknown dynamics of the environment usually make
the expected reward, which we want to maximise, not differentiable. In such cases
policy gradient methods can be used to approximate the gradients for the inner
(adaptation phase) and outer (metaoptimisation) loop. A significant point, that also
increases complexity, is that since such methods are onpolicy, new rollouts from each
individual policy πi need to be sampled for each additional adaptation update / gradient
step. The most common general form of a policy gradient method defines the loss
objective for a specific task Ti as the expected logarithm of the policy and the negative
reward over a batch of samples:

[ ]
∑
H
Li (πi ) = − E log πi (at |st )Ri (st , at ) (3.2)
st ,at ∼πi
t=1

where Ri is the reward function of task Ti .

Along with the policy, another model that is approximating the value function Vϕπ (st )
is updated in parallel so that it always considers the most recent policy. Such function
models that estimate Vϕπ (st ) are also called baselines. A simple linear feature baseline
model was used in this case, to fit each task in each iteration and to compute the state
value function Vϕπ (s) by minimising the leastsquares distance, as firstly proposed in
[15] and then also adopted in [22]:

[ ]
ϕ = arg min Est ,R̂t ∼π (Vϕπ (st ) − R̂t )2 (3.3)
ϕ

To reduce the variance of the policy gradient estimates of the statevalues and control
the bias levels to a tolerable level, using a biasvariance tradeoff parameter, the
Generalised Advantage Estimator (GAE) method proposed in [73] was used. The basic
concept of the advantage function is to provide further insight on how much better or
worse the current policy is, in relation to the average actions. It is defined in equations
3.43.8 and is based on the Temporal Difference (TD) Error: rt + γV (st+1 ) − V (st )
[82].

(1)
Ât := δtV = −V (st ) + rt + γV (st+1 ) (3.4)

34
CHAPTER 3. METHODOLOGY

(2)
Ât := δtV + γδt+1
V
= −V (st ) + rt + γrt+1 + γ 2 V (st+2 ) (3.5)

(3)
Ât := δtV + γδt+1
V
+ γ 2 δt+2
V
= −V (st ) + rt + γrt+1 + γ 2 rt+2 + γ 3 V (st+3 ) (3.6)

leading to a sum consisting of the returns (γ discounted rewards) and the negative
baseline term −V (st ) for a H length horizon (eq 3.7) and then finally adding a bias
variance tradeoff factor τ (eq 3.8).

(H)
∑
H ∑
H
Ât = γ l δt+l
V
= −V (st ) + γ l rt+l (3.7)
l=0 l=0

GAE(γ,τ ) (1) (2) (3)

∑
H
Ât := (1 − τ )(Ât + τ Ât +τ 2
Ât + ...) = (γτ )l δt+l
V
(3.8)
l=0

Thus, making the loss:

[ H ]
∑
Li (πi ) = − E log πi (at |st )ÂGAE
t (3.9)
st ,at ∼πi
t=1

Network architecture: The agentmodel fθ to approximate the optimal policy π for

all Ti was a standard MLP, also used in the MAML and ANIL papers, with two fully
connected layers of 100 units each and ReLU activation function for Particles2D and
tanh for MetaWorld. In order to train the metaRL agent, MAML and ANIL were
tested with two different policy gradient methods, TRPO [74] and PPO [74] (as seen in
algorithms 5 and 6).

In contrast with vanilla policy gradient methods, where an update on the policy
does not differ much in terms of parameter values, TRPO is trying to update with
the largest step possible to improve the performance, while also following a KL
Divergence constraint to control the ”distance” between the old and new policy. Even
though this approach involves a complex secondorder method with multiple tunable
hyperparameters, it significantly improves sample efficiency and usually speeds up
convergence. The loss objective, or surrogate advantage, is defined L(θt , θ) as the
measure of performance between policy πθ and previous policy πθt on episodes from

35
CHAPTER 3. METHODOLOGY

the old policy 2 :

[ ]
πθ (a|s) πθ
L(θt , θ) = E A t (s, a) (3.10)
s,a∼πθt πθt (a|s)

and the KLDivergence constrain between the two policies on the states taken from the
old policy is defined as:

[ ]
DKL (θ||θt ) = E DKL (πθ (·|s)||πθt (·||s) (3.11)
s∼πθt

Then, a general TRPO update can be summed up as:

θt+1 = arg max L(θt , θ) s.t DKL (θ||θt ) ≤ ϵ (3.12)

where ϵ is a parameter, usually a small value ( 0.01), to control the KLDivergence limit
for the TRPO update 3 .

Based on the same motivation of improving sample efficiency by taking large steps ,
cautiously, to better policies and avoiding performance collapse, PPO follows a much
more straightforward firstorder method, while also offering competitive results to
TRPO. In this project, the PPOClip variant was used that omits the KLDivergence
constrain term and motivates the new policy to stay close to the old one based on the
advantage function 4 . The complete loss objective can be defined as:

( π (a|s) )
θ
L(s, a, θt , θ) = min πθt πθt
A (s, a), g(ϵ, A (s, a)) (3.13)
πθt (a|s)

where the clipping term g defined as:



(1 + ϵ)A, f or A ≥ 0
g(ϵ, A) = (3.14)

(1 − ϵ)A, f or A ≤ 0

acts like a regulariser to limit how much the objective can increase and stop the
2
Here we explicitly denote that the policy π is depended on the θ parameters using πθ .
3
More details into the TRPO algorithm by OpenAI’s Spinning Up documentation: https://
[Link]/en/latest/algorithms/[Link]
4
More details into the PPO algorithm by OpenAI’s Spinning Up documentation https://
[Link]/en/latest/algorithms/[Link]

36
CHAPTER 3. METHODOLOGY

new policy from diverging too much. Then, the PPOClip update can be summed up
as:

[ ]
θt+1 = arg max E L(s, a, θt , θ) (3.15)
θ s,a∼πθt

Algorithm 5: MAML for RL using a Policy Gradient method

K: number of episodes per task to adapt (adapt batch size | ’shots’)
N : number of tasks per inner loop iteration (meta batch size | ’ways’)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
E: number of outer loop iterations (epochs)
Random initialisation of linear baseline parameters ϕ
Random initialisation of model’s parameters θ
for E iterations do
Sample N number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θi ← θ
for m steps do
Sample K support episodes on task Ti with πisupport policy
Update the baseline model Vϕπ (s) (eq. 3.3)
Compute advantage estimates: ÂGAE t (eq. 3.8)
Compute loss: ∇θi Li (θi ; πisupport ) eq. 3.10 (TRPO) or eq. 3.13 (PPO)
Compute parameters θi for new policy πisupport : eq. 3.12 (TRPO) or eq.
3.15 (PPO)
Update parameters of inner learner: θi ← θi − α∇θi Li (θi ; πisupport )
end
Sample K query episodes on task Ti with the adapted θi model for policy
πiquery
Compute loss for query episodes: ∇θi Li (θi ; πiquery )
end ∑
Update metalearner: θ ← θ − β∇θ Ti ∼p(T ) Li (θ; πiquery )
end

3.3.2 Particles2D

The first RL setting the algorithms were tested on, was a trivial 2D gamelike problem
developed by Finn et al in [22]. Given a random pointgoal in a 2D square matrix, the
agent should move to these coordinates without having explicit access to them, given
some velocity. The observation is simply the current coordinates of the agent in the
square and the actions correspond to velocity values for movement within the range

37
CHAPTER 3. METHODOLOGY

Algorithm 6: ANIL for RL using a Policy Gradient method

K: number of episodes per task to adapt (adapt batch size | ’shots’)
N : number of tasks per inner loop iteration (meta batch size | ’ways’)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
E: number of outer loop iterations (epochs)
Random initialisation of linear baseline parameters ϕ
Random initialisation of network body parameters θb
Random initialisation of network head parameters θh
for E iterations do
Sample N number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θhi ← θh
for m steps do
Sample K support episodes on task Ti with πisupport policy
Update the baseline model Vϕπ (s) (eq. 3.3)
Compute advantage estimates: ÂGAE t (eq. 3.8)
Compute loss: ∇θhi Li (θhi ; πi ) eq. 3.10 (TRPO) or eq. 3.13 (PPO)
Compute parameters θhi for new policy πisupport : eq. 3.12 (TRPO) or eq.
3.15 (PPO)
Update head parameters of inner learner:
θhi ← θhi − α∇θhi Li (θhi ; πisupport )
end
Sample K query episodes on task Ti with the adapted θhi model for policy
πiquery
Compute loss for query episodes: ∇θhi Li (θhi ; πiquery )
end
Update all of network’s∑ parameters:
θb+h ← θb+h − β∇θb+h Ti ∼p(T ) Li (θb+h ; πiquery )
end

38
CHAPTER 3. METHODOLOGY

Figure 3.3.1: Sample of a MAML trained agent adapting to a position in Particles2D.

Figure taken from [22].

[−0.1, 0.1]. The reward function is the negative squared distance of the agent from the
point and the episode ends either when the agent is within 0.01 of the goal coordinates
or when the time horizon is reached at H = 100. The initial hyperparameter values
where chosen based on the values reported in [62].

3.3.3 MetaWorld

An important miss in the metaRL research community is a standard benchmark

that is easy to use and is computationally reasonable but also challenging enough to
evaluate the generalisation capabilities across algorithms. The most commonly used
benchmark environment in metalearning so far, has been the HalfCheetah and the
”ant” tasks of MuJoCo, which has shown to be limited in complexity and not suitable
to benchmark metalearning [50].

To contest this, Yu et al. have recently developed an open source simulated

environment, compatible with the OpenAI Gym interface, specifically designed to
benchmark metalearning and multitask learning algorithms [90]. MetaWorld
consists of 50 unique simulated manipulation tasks of a Sawyer robot arm, such as
opening a door handle, grasping an object, reaching for a point on a table (as seen in
figure 3.3.2) etc. Even though, these tasks are diverse, they share a common structure
which enables similar tasks to be learned efficiently. Based on the premise that meta
learning algorithms should be able to learn multiple tasks and be able to generalise to
new ones, this environment sets high standards by broadening the task distribution
and providing an extensive evaluation protocol of variable difficulty.

Specifically, all the tasks require the arm to interact with different objects and shapes,

39
CHAPTER 3. METHODOLOGY

Figure 3.3.2: Screenshot of the task ”push” in which the robot arm needs to push an
object (puck) to specific coordinates (goal).

using different joints in order to combine acts of pushing, grasping and reaching. The
observation and action space across tasks is shared, meaning the dimension space
remain the same. The observations consist of a 9 dimensional 3tuple of the 3D
cartesian position of the endeffector, the object (or object #1) and the goal (or object
#2). The actions consists of 3D endeffector positions of the robot arm. Depending on
the task, a specific reward function has been engineered. The initial hyperparameter
values were chosen based on the values reported in [90]. MetaWorld provides three
scenarios of increasing difficulty:

ML1: As a more computationally feasible baseline, the first setting consists of one
specific task (in the case of this thesis the task ”push” was chosen) with 50 random
initial object and goal positions and another set of 10 heldout positions, resembling
the previous setting of Particles2D.

ML10: To properly evaluate the generalisation capabilities to new tasks, the algorithm
is metatrained on 10 specific tasks and then evaluated on another held out set of 5
more tasks with similar structure. For each task, the position of the object and the goal
is randomised.

ML45: Similar to ML10 but with 45 training tasks instead. Due to it’s demanding
complexity and the respective computational cost required, this setting was deemed

40
CHAPTER 3. METHODOLOGY

outside of the scope of the thesis and is left for future work.

Success rate metric: For the case of MetaWorld another metric is introduced to
monitor the performance of the agents. Due to the fact that reward values do not always
provide a clear picture of the effectiveness of a policy, Yu et al. propose measuring
an agent’s performance on MetaWorld by the success rate on a task. This metric is
defined based on the distance of the object (o) (depending on the task) from the final
goal position (g), e.g ||o−g||2 < ϵ, where ϵ is a small distance threshold for a consecutive
number of steps5 .

5
For more details into each of the success metrics, see the complete list in [90]

41
Chapter 4

Results

4.1 FewShot Image Classification

This section presents results regarding the metatraining, meta

testing and representation similarity between MAML and ANIL on the Omniglot and
MiniImageNet datasets for fewshot classification. These models were developed for
the purpose of reproducing the results reported in [62], ensuring the implementations
were correct and validating their findings. The experiments firstly consisted of hyper
parameter searches for each dataset (more in C.1) and then reporting the results of the
best models as seen next.

Firstly, in the case of Omniglot, both methods showed similar stable performance
during metatraining and converged quickly to their final performance as seen from
their validation accuracy performance in figures 4.1.1. In the case of MiniImageNet,
the methods managed to converged again in less than 5.000 epochs but not as fast or as
stable as in Omniglot, as seen from a more volatile validation accuracy in both figures
of 4.1.2. This is speculated to be attributed to the larger variance across coloured RGB
images in the MiniImageNet dataset compared to the more uniform grayscale images
in Omniglot. Further tests in regards to this hypothesis are needed to support this
speculation.

42
CHAPTER 4. RESULTS

(a) 20way, 5shot setting

(b) 20way, 1shot setting

Figure 4.1.1: Validation accuracy for MAML and ANIL on the Omniglot dataset for 20
way, 5shot (4.1.1a) and 20way, 1shot (4.1.1b) classification. Each line is the mean
value and the shaded area is the standard deviation across three seeds. Smoothing
factor 0.8

43
CHAPTER 4. RESULTS

(a) 5way, 5shot setting

(b) 5way, 1shot setting

Figure 4.1.2: Validation accuracy for MAML and ANIL on the MiniImageNet dataset
for 5way, 5shot (4.1.2a) and 5way, 1shot (4.1.2b) classification. Each line is
the mean value and the shaded area is the standard deviation across three seeds.
Smoothing factor 0.8

During metatesting, MAML showed slightly better performance than ANIL in most
cases and the results were similar to the findings of [62], as seen in table 4.1.1.
However, further statistical tests are required to provide concrete evidence for the
observations noted above. Comparing the mean test accuracies across only three seeds
provides too small sample size to employ meaningful statistical hypothesis tests.

44
CHAPTER 4. RESULTS

Dataset Omniglot MiniImageNet

Setting 20w,1s 20w,5s 5w,1s 5w,5s
MAML 91.7 ± 0.4 96.8 ± 0.1 48.3 ± 1.2 63.4 ± 1.3
ANIL 91.3 ± 0.8 96.9 ± 0.6 43.7 ± 0.9 58.9 ± 1.9
MAML [62] 93.7 ± 0.7 96.4 ± 0.1 46.9 ± 0.2 63.1 ± 0.4
ANIL [62] 96.2 ± 0.5 98.0 ± 0.3 46.7 ± 0.4 61.5 ± 0.5

Table 4.1.1: Metatesting results of MAML and ANIL on fewshot image classification.
The reported results for each model are the mean and standard deviation across three
different seeds. The bottom results are shown as reported in the paper [62].

One important factor to consider when comparing these algorithms is the difference
in computational resources required to train each algorithm. Training MAML on
MiniImageNet for 5ways, 1shot classification took approximately 1hour 50minutes
1
, whereas ANIL for the same setting took only 55minutes, half the training time.
Meaning that for half of the computational cost, a similar, or just slightly worse,
performance could be achieved with ANIL.

Finally, when compared to the results presented in [62], the representation of

the neural network during adaptation seemed to change even less and that most
of the network before and after the inner loop remains unchanged (figure 4.1.3a
4.1.3b).

1
On a CUDAenabled RTX 2060 S GPU.

45
CHAPTER 4. RESULTS

(a) Figure from [62].

(b) Our reproduced results.

Figure 4.1.3: Representation change before and after adaptation of MAML in a 5w5s
setting in MiniImageNet. Results reported are mean and standard deviation of 5
different test tasks with 3 different seeds. 4.1.3a from [62], 4.1.3b reproduced results.

46
CHAPTER 4. RESULTS

4.2 MetaReinforcement Learning

Since metaRL is difficult to train and complex to debug, the experiments started
gradually from a trivial toy problem, to an easy baseline and finally to the most
challenging setting. Additionally, a long series of experiments were needed to be run
at each stage before reporting results that could provide any insights. Such results are
reported in Appendix C instead of this section to prevent an overload of information
and the disruption of the flow of the report.

4.2.1 Particles 2D

As a sanity check of the performance of the methods, the basic 2D environment

”Particles2D” introduced in MAML [22] was used. In figure 4.2.1, a comparison
between the PPO and TRPO version of MAML is shown. As a baseline, the performance
of an agent taking random actions is also included in figure 4.2.1 to provide comparison
across the rest of the models. As illustrated in 4.2.1, the method with the best and
most consistent results was the TRPO version of MAML and ANIL, whereas PPO’s
performance was fluctuating and less stable.

Figure 4.2.1: Performance comparison of different methods on Particles2D.

Smoothing factor 0.9.

47
CHAPTER 4. RESULTS

4.2.2 ML1: Push

In the MetaWorld environment, there are three singletask benchmarks that are used
as baselines: ”push”, ”pickandplace” and ”reach” (as seen in the original paper [90]).
For this project, the setting ”push” was used as a baseline before moving to the more
demanding benchmark of ML10. Firstly, in figure 4.2.2 a quick comparison of between
the PPO and the TRPO version of MAML and ANIL indicates that, again, the TRPO
variant provides better and more consistent performance.

Figure 4.2.2: Comparison of methods for the task ML1: Push.

The models with the best performance during the search were then trained for 3.000
epochs, as shown in figure 4.2.3a. Both of the models achieved 100% success rate in 10
unseen tasks (new goal positions) during metatesting. Another interesting note is that
the results of MAMLTRPO in this case show better performance of the MAMLPPO
when evaluated in the original paper of MetaWorld (as seen in figure 4.2.3b).

48
CHAPTER 4. RESULTS

(a) MAMLTRPO and ANILTRPO on ML1: Push trained until convergence.

Smoothing factor 0.9.

(b) Figure from [90] illustrating the performance of MAML on ML1: Push.

Figure 4.2.3: MAML performance on ML1: Push compared to ANIL (a) and other
methods (b) from [90].

49
CHAPTER 4. RESULTS

4.2.3 ML10

Finally, the results for the most challenging problems of MetaWorld ML10 are
presented. In figure 4.2.4a, a comparison of agents trained with different policies
is shown. In addition to the metalearning methods, the vanilla TRPO and PPO
methods were also trained to showcase the limitations of nonmetalearning methods
in an environment where a single policy needs to perform well in multiple tasks.
Following the trend from the previous datasets and environments, the MAMLTRPO
and ANILTRPO methods outperform the rest during metatraining as seen in figure
4.2.4a. After this comparison, they were trained even longer for 3.000 epochs (figure
4.2.4b) each one taking approximately 13 days to complete 2 . Contrary to the few
shot classification setting in which ANIL showed high efficiency in computational cost
resources, in the case of MetaWorld the computational cost difference is negligible3 .
This is probably due to the fact that most of the computational resources are spent for
sampling interactions of the agent with the environment, and that backpropagating
through the whole network vs just the final layer does not amount to a considerable
difference.

Specifically in ML10, the agents were trained on the

10 different tasks of: pick-and-place, reaching, button-press, window-opening,
pushing, sweep-into, drawer-closing, dial-turning, peg-insertion-side,
basketball and then tested on: drawer-opening, door-closing, shelf-placing,
sweep, level-pulling. The metatesting on these 5 new tasks consisted of the agent
only having a single inner loop (gradient descent update) to adapt to each task,
sampling 10 episodes (”shots”) from each (as indicated in [90]). The metatest was
also performed on the train tasks to conclude the final performance of the meta
learnt policy. When sampling one task, the arm / object / goal positions are randomly
generated every time, this can mean that for some settings the agent might not always
manage to adapt within one step. To account for this the metatesting process above
was repeated three times (trials) for three different configuration settings for each task.
Moreover, collecting high rewards does not always translate to a perfect success rate
and the scale of the reward values differs from task to task. However, a 0% success rate
does not mean complete failure either since if the reward values are high it might mean
2
The PPO agent was ”earlystopped” at 1.000 epochs due to not showing signs of improvement after
500 epochs.
3
For both MAML and ANIL in MetaWorld ML10 each epoch took the same amount of time,
approximately 390 seconds

50
CHAPTER 4. RESULTS

(a) Performance comparison of methods on ML10.

(b) PPO, MAMLTRPO and ANILTRPO on ML10. Smoothing factor 0.95.

Figure 4.2.4: Comparison of methods in ML10.

51
CHAPTER 4. RESULTS

that the task is almost completed, but the agent did not cross the specific threshold
needed for the task to be classified as ”success”. This is due to the way the reward
functions and the success metrics are hand engineered in MetaWorld. In the case of
sweep for example the reward function is:

||h−g||22
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e 0.01 (4.1)

and the success metric is calculated based on:

I||o−g||2 <0.05 (4.2)

where o is the object position, h is the robot arm position (gripper) and g is the goal
position. A complete list of the reward functions for each task and their success metric
can be found in Appendix C.10. In the following figures both the reward values and the
success rates are illustrated in order to avoid misinterpretation of the results.

Figures 4.2.5 (a)(b) illustrate a small lead in performance from MAMLTRPO

compared to the other methods on the train tasks both on average accumulated
rewards and higher success rates. It is interesting to see that the success rate of MAML
TRPO drops in comparison to the other methods during test time even though the
accumulated rewards are still higher. However, due to the limited sample size of
our experiments we cannot confidently conclude that this is a statistically significant
difference and it is not the outcome of noise or randomness.

This discrepancy between high rewards and low success rate could be attributed to
two factors. Firstly, as mentioned before, the success metrics are handengineered
with arbitrary threshold values so the agent could almost be completing the task with
the arm / object moving around the goal but not within the threshold (thus gaining
high rewards but not high success rates). Secondly, it could be a sign of MAML
TRPO overfitting to the train tasks and thus failing to perform as well during the test
tasks.

Another issue with these results is the remarkably high variance of the rewards and the
success rates. This is because of the diversity of the tasks since each task has its own
reward function. In figures 4.2.6 for example, we can observe better performance from
MAMLTRPO than the other methods on the train task of basketball but struggles to
outperform them on the test task of drawer-open. Out of all the tasks, MAMLTRPO

52
CHAPTER 4. RESULTS

(a)

(b)

Figure 4.2.5: Comparison of PPO, MAMLTRPO and ANILTRPO on ML10 train (a)
and test (b) tasks. Reported results are mean and standard deviation of all the tasks
with 5 different seeds of 3 trials of 10 episodes per task.

53
CHAPTER 4. RESULTS

manages to show signs of better performance than the other methods on 9 out of the
10 train tasks but on just 2 out of the 5 test tasks, further indicating signs of overfitting
(worse performance during testing than training). It is important to note again that
the findings reported come from a small sample size of experiments, thus preventing
us making statistically significant conclusions. Analytical results of the performance
of the methods in a ”per task” format can be found in Appendix C.8.

54
CHAPTER 4. RESULTS

(a)

(b)

Figure 4.2.6: Comparison of PPO, MAMLTRPO and ANILTRPO on the basketball

(a) and sweep (b) task. Reported results are mean and standard deviation of all the
tasks with 5 different seeds of 3 trials of 10 episodes per task.

A noteworthy observation can be made when rendering the policies in the environment
to see how they perform in action. Even in cases in which the MAMLTRPO agent fails,

55
CHAPTER 4. RESULTS

Figure 4.2.7: Representation similarity across layers of MAMLTRPO network before

and after adaptation during metatesting. Reported results are mean and standard
deviation across the 5 test tasks.

is seems that the movements of the robot arm are smoother and less hectic compared
to the other methods. These rendered animations can be found in the code repository
at GitHub (link).

Lastly, in figure 4.2.7, the representation change in percentage based on the CCA metric
is presented for the network before and after the inner loop adaptation on the meta
testing tasks of ML10. These results show that minimal changes take place in the
representation space indicating that the performance of MAML during metatesting
relies on feature reuse.

56
Chapter 5

Discussion

5.1 Further insights

In addition to the answers given in chapter 6, a few other insightful observations can
be made regarding the algorithms and the environment.

Regarding training such models: The process of training models in complex

environments with a considerably high number of hyperparameters can be quite
challenging. Metalearning methods, and especially metareinforcement learning
methods, seem to be quite sensitive to hyperparameter values. Coupled with the
substantial computational resources required to perform proper hyperparameter
searches that provide high statistical confindence results, it leads to an issue that is
difficult to solve given the scope of such a project.

Regarding MAML: MAML showed great performance on the MetaWorld

benchmark, being able to almost master 9 out of 10 tasks with a single general
metalearnt policy in a relatively small 2layer neural network, outperforming the
previous stateoftheart results of [90]. Additionally, the agent trained with MAML
seemed to show smoother and more sensitive movements both when completing a
task but also even when failing to adapt to a new task. It’s performance during
metatesting does indicate however, some limitations of the algorithm and a possible
sensitivity to overfitting, as well as a limitation in meaningful changes to its weights
during adaptation. To conclude in a high confidence verdict regarding it’s adaptation
capabilities in unseen tasks, a more thorough study of the hyperparameter values
would be advised.

57
CHAPTER 5. DISCUSSION

Regarding ANIL: The results of ANIL are quite intriguing. Even though its
performance in relatively straightforward settings, such as fewshot classification, is
comparable to MAML, it is interesting to see that it does not manage to achieve the
same level of success in a more challenging environment (as questioned in ANIL’s
review 1 ). This indicates that adapting to complex tasks requires more than an update
of the head of network based on a general metainitialisation of its weights.

Regarding MetaWorld: This recently published metareinforcement learning

environment was developed with the purpose of introducing a more challenging
framework for metalearning algorithms. It is an open source project based on the
proprietary framework of MuJoCo that during the period of this thesis went through
many updates and bug fixes. One issue that came up during evaluation was the
inconsistency between high rewards and low success rates. Due to the handcrafted
metrics of measuring the success rates, agents could accumulate high rewards while
almost solving a task but receive a low success metric, indicating failure to adapt. This
can lead to contradictory or confusing results, requiring further analysis of the actual
performance of the agents and possibly rendering the animations of the environment
to observe the policy in action. It would be advisable that any future experiments on
MetaWorld, when reporting results, should illustrate both the success rate and the
accumulated rewards to provide a clearer picture of the model’s performance.

5.2 Limitations
Training neural networks can be a computationally expensive process that can last
weeks or months. Taking into consideration the scope of this degree project and the
limited resources at hand, only the models that were deemed to be the most relevant
and most essential to the research questions were developed. Even so, more than 200
models were trained in this project, across 2 vision datasets and 3 RL problems in a
time span of 4 months with more than 11.000 hours spent on training models. This is
due to the fact that in order to validate the integrity of the results, a hyperparameter
search was needed for each problemcase and every model was trained with 3 different
seeds.

Even so, the performed hyperparameter searches were relatively limited, given that
1
Comment from a reviewer of ICLR 2020 on ANIL, criticising the limitations of metalearning
benchmarks: link

58
CHAPTER 5. DISCUSSION

the search was performed on 2 or 3 parameters in each case, when in some settings
(e.g ANILTRPO on ML10) there were more than 15 hyperparameters that could be
tuned. Moreover, it is reasonable to assume that 3 seeds are not enough to provide a
high interval of confidence for results of such complex algorithms with a high number
of hyperparameters [12]. Thus, in order to further verify the findings of this thesis,
a larger scale statistical analysis test on the hyperparameter space is suggested to
support our results. Especially for the case of the metaRL tasks which are known
to be sources of high instability and irreproducibility due to their numerous tunable
hyperparameters and increased computational cost.

5.3 Future Work

Given the rate in which deep learning is advancing, such a research project can have
many possible extensions. From simple more thorough hyperparameter searches and
ablation studies to algorithmic modifications or additions, or just investigation on even
more stateoftheart methods on new datasets and environments. A collection of
interesting ideas is listed below in order of significance for furthering supporting the
objective of the project:

1. Thorough hyperparameter search: As previously discussed, a more

extensive search for the optimal hyperparameter values would be necessary
to provide results with a high confidence interval. Such a task would
require powerful workstations in order to run gridsearch or random search
investigations.

2. MetaTrain on ML45: The next obvious step after MAML’s success on the
ML10 benchmark of MetaWorld, would be to evaluate it’s performance on the
final and most challenging ML45 benchmark.

3. Investigate task similarity as a factor of success during metatesting:

One factor that could be of significant value to study is how dependent is
the success of metalearning methods across similar and dissimilar tasks.
For example, such insights could be observed by using heat maps of the
representation of the network and highlighting which areas are commonly shared
when completing different tasks.

4. Comparison of other metalearning algorithms: Another reasonable

59
CHAPTER 5. DISCUSSION

extension would be to investigate in the same way more metalearning methods

that have shown promising results such as RL2 [16], PEARL [63] or MetaQ
Learning [21].

5. Investigate the performance of MAML in even more challenging

environments: Another benchmark environment that could be interesting to
evaluate MAML on would be Procgen. During this project a few experiments
were performed on Procgen (as seen in Appendix B), however due to limited
computational resources, the agents were not able to be trained and the results
were insignificant.

5.4 Ethical concerns

Given this thesis is focused mainly on academic research with no explicit application,
one could first think that the ethical aspect of such a project is minuscule. Besides,
the field of metalearning is still quite young and reallife applications, even though
potentially vast, are currently almost nonexistent. If one were to consider that the
ethical challenges of scientific contributions depend on their practical applications
and that scientific methods are just tools which can be used either benevolently
or malevolently, they could completely rid themselves of any responsibility for this
project. Unfortunately, such beliefs are not uncommon and one of the most crucial
components of scientific contributions is often, intentionally or not, easily dismissed:
the awareness of the social responsibility of research & development.

Brunner and Ascher argue that ”science in the aggregate has not lived up to its promise
to work for the benefit of society as a whole.[... For that reason it] is appropriate to hold
science responsible for the public expectations that science creates and depends upon
it” [10]. Opponents of socially responsible science would argue that the development
of science is predetermined and is a force that cannot be influenced from the choices
of its community or from individual scientists. In such a setting, scientists are merely
the instruments of a greater body that can move in a single direction and they can only
control its acceleration rate based on their skills and work.

However, there is plenty of evidence to suggest otherwise. Collective decisions and

actions do matter both in cases of benign and malicious intent. The General Data

60
CHAPTER 5. DISCUSSION

2
Protection Regulation (GDPR) being a recent example of how governments along
with the scientific community came together to defend the society’s interest against
privacybreaching technology. If we were to accept this example as part of an inevitable
progress, would we also accept as ”inevitable” and ”expected” all the other times
humans have used science in evil and horrific ways? Where would we draw the line
between inevitable actions and responsible decisions?

Thus, it is our belief that, it would be not only naive but also dangerous to assume that
science is a disconnected, nonpartisan and selfcontained body from the rest of the
society. With such equation leading to handing a get out of jail free card to whatever
can be identified as science and unburden of any accountability in individual decisions
or actions. It becomes apparent that the impact science has in society is rarely ever
neutral [66].

To backtrack in the context of this particular thesis, the responsibility lies in honest,
reproducible and accountable research. It is a commonly accepted belief that science
has been facing a reproducibility crisis the last decade with more than 70% of
researchers failing to reproduce another scientist’s experiments [4]. This is becoming
an even more apparent problem in the machine learning community where code for
research is scarcely open sourced due to private agreements, profit or unaccountability
[40]. To halt contributions to the problem, the approach of this thesis was based
on thorough background research, open source code (wherever possible) and honest
result reporting with clearly stating any assumptions made and limiting nonconfident
conclusions. The code used for this project along with the trained models and a
technical guide to reproduce the results are open sourced and published online at
GitHub3 .

5.5 Sustainability
As with most DL problems, training such models requires considerable amounts of
computational power. This means personal computers, remote workstations or even
large data centres operating in high capacity for days, weeks or even months. These
data centres can quickly amount to significant electrical power consumption (1% of
the global electricity use) which further increases the need for more electrical power
2
Link to the history of GDPR here.
3
Link to the code repository: [Link]

61
CHAPTER 5. DISCUSSION

generation [52]. It has been shown that such increasing demand for power generation
has direct negative implications to the environment since most of the energy sources
are still not environmentallyfriendly [36].

For the experiments of this project a personal computer and a remote server
workstation in a data centre (ICE North of RISE SICS) were used in the span of 6
months. For some experiments, the remote workstation was operating constantly
for three months. An approximate estimation can be made to calculate the power
consumption footprint of this thesis project. In equation 5.1, EkW h describes the energy
consumption in kilowatts per hour, PW describes the power consumption of the device
in watts and thr is the time spent of the device operating in hours4 .

PW × thr
EkW h = (5.1)
1000

The personal computer can be assumed to had been working 8 hours a day for 6
500(W )∗8(hr)∗180(days)
months at 500 watts. This attributes for 1000
= 720kW h. The remote
workstation can be assumed to had been working 24/7 for 3 months at 2500 watts.
2500(W )∗24(hr)∗90(days)
This attributes for 1000
= 5400kW h. Thus the total amount could
be estimated to be around 6120 kWh in 6 months or 3060kWh per year. To put
this number into perspective, according to the [Link] 5 , the average energy
consumption per capita in Sweden is 13.000 kWh per year. This means that the
electrical energy consumption footprint of this degree project was approximately the
same as 23% of an average Swedish citizen for a year.

On the other hand, the development of metalearning methods could potentially have
a significant decrease in the power consumption needed to train ML models compared
to traditional methods. This would be due to the fact that metatrained models do not
require retraining from scratch in light of new tasks and can thus reduce the number of
many models needed for each task to fewer models where each one can handle multiple
tasks. By developing dataefficient models with high generalisation capabilities the
computational cost of training ML models could be reduced.

4
Equations to calculate electric energy:
[Link]
5
Sweden’s energy consumption data:
[Link]

62
Chapter 6

Conclusions

The objective of this degree project was to supplement the latest research on MAML
and ANIL with novel observations and insights regarding their performance on
the recently published metareinforcement learning benchmark MetaWorld. To
summarise, this thesis tried to answer the following questions:

Question 1 What does MAML actually learn: Is it a high quality feature representation of
the training data? Or does it learns to rapidly adapt to new tasks of the same
distribution of the training tasks?

Answer: Before and after adaptation of MAML in ML10 there seems to be minimal
changes in the representation space of the neural network but also in terms of
performance as well (C.9). This indicates that MAML is mostly reusing features
learnt during metatraining and not actually acquiring new knowledge during
metatesting.

Question 2 How well does MAML adapt to new tasks compared to standard RL approaches?

Answer: Our results show a trend of better performance from MAML (higher rewards &
success rate) than PPO and ANIL on the train tasks of the ML10 benchmark.
During metatesting however this trend is not as clear, making it difficult to come
to any conclusions.

Question 3 Does ANIL perform similarly to MAML, even in more complex environments?

Answer: ANIL shows a comparable accuracy to MAML on the fewshot image

classification benchmarks. On the more challenging and complex benchmark
of MetaWorld ML10 it fails to show comparable performance during meta

63
CHAPTER 6. CONCLUSIONS

training. During metatesting however, it is not clear whether there is a

significant difference between the algorithms. These findings require further
support evidence by more extensive training & evaluation of the models in order
to be able to carry out statistical significance tests.

Question 4 Is there a significant computational cost difference between MAML and ANIL?

Answer: The computational cost between MAML and ANIL is indeed significant on the
vision datasets (ANIL required almost 50% less computational resources during
training) but not during the RL benchmarks. This probably due to the high cost
of sampling and interacting with the environment that is much more substantial
than backpropagating through a whole network vs. one layer of the network.

64
Bibliography

[1] Andrychowicz, Marcin, Denil, Misha, Gomez, Sergio, Hoffman, Matthew

W., Pfau, David, Schaul, Tom, Shillingford, Brendan, and Freitas, Nando
de. “Learning to learn by gradient descent by gradient descent”. en. In:
arXiv:1606.04474 [cs] (Nov. 2016). arXiv: 1606.04474.

[2] Antoniou, Antreas, Edwards, Harrison, and Storkey, Amos. “How to train your
MAML”. In: arXiv:1810.09502 [cs, stat] (Mar. 2019). arXiv: 1810.09502.

[3] Arnold, Sébastien M. R., Mahajan, Praateek, Datta, Debajyoti, Bunner, Ian,
and Zarkias, Konstantinos Saitas. “learn2learn: A Library for MetaLearning
Research”. In: arXiv:2008.12284 [cs, stat] (Aug. 2020). arXiv: 2008.12284.

[4] Baker, Monya. “1,500 scientists lift the lid on reproducibility”. en. In: Nature
News 533.7604 (May 2016). Section: News Feature, p. 452. DOI: 10 . 1038 /
533452a.

[5] Baydin, Atilim Gunes, Cornish, Robert, Rubio, David Martinez, Schmidt,
Mark, and Wood, Frank. “Online learning rate adaptation with hypergradient
descent”. In: arXiv preprint arXiv:1703.04782 (2017).

[6] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. “The Arcade Learning
Environment: An Evaluation Platform for General Agents”. en. In: Journal of
Artificial Intelligence Research 47 (June 2013), pp. 253–279. ISSN: 10769757.
DOI: 10.1613/jair.3912.

[7] Bello, Irwan, Zoph, Barret, Vasudevan, Vijay, and Le, Quoc V. “Neural optimizer
search with reinforcement learning”. In: arXiv preprint arXiv:1709.07417
(2017).

65
BIBLIOGRAPHY

[8] Biggs, J. B. “The Role of Metalearning in Study Processes”. en. In: British
Journal of Educational Psychology
55.3 (1985). _eprint: [Link]
8279.1985.tb02625.x, pp. 185–212. ISSN: 20448279. DOI: 10.1111/j.2044-
8279.1985.tb02625.x.

[9] Botvinick, Matthew, Ritter, Sam, Wang, Jane X., KurthNelson, Zeb, Blundell,
Charles, and Hassabis, Demis. “Reinforcement Learning, Fast and Slow”. en. In:
Trends in Cognitive Sciences 23.5 (May 2019), pp. 408–422. ISSN: 13646613.
DOI: 10.1016/[Link].2019.02.006.

[10] Brunner, Ronald D and Ascher, William. “Science and social responsibility”. en.
In: (), p. 37.

[11] Cobbe, Karl, Hesse, Christopher, Hilton, Jacob, and Schulman, John.
“Leveraging Procedural Generation to Benchmark Reinforcement Learning”.
In: arXiv:1912.01588 [cs, stat] (Dec. 2019). arXiv: 1912.01588.

[12] Colas, Cédric, Sigaud, Olivier, and Oudeyer, PierreYves. “How Many Random
Seeds?
Statistical Power Analysis in Deep Reinforcement Learning Experiments”. In:
arXiv:1806.08295 [cs, stat] (July 2018). arXiv: 1806.08295.

[13] Deisenroth, Marc Peter, Rasmussen, Carl Edward, and Fox, Dieter. “Learning
to control a lowcost manipulator using dataefficient reinforcement learning”.
In: Robotics: Science and Systems VII (2011), pp. 57–64.

[14] Doya, Kenji. “Metalearning and neuromodulation”. en. In: Neural Networks
15.4 (June 2002), pp. 495–506. ISSN: 08936080. DOI: 10 . 1016 / S0893 -
6080(02)00044-8.

[15] Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter.
“Benchmarking Deep Reinforcement Learning for Continuous Control”. en. In:
arXiv:1604.06778 [cs] (May 2016). arXiv: 1604.06778.

[16] Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and
Abbeel, Pieter. “RL$^2$: Fast Reinforcement Learning via Slow Reinforcement
Learning”. In: arXiv:1611.02779 [cs, stat] (Nov. 2016). arXiv: 1611.02779.

[17] DulacArnold, Gabriel, Mankowitz, Daniel, and Hester, Todd. “Challenges of

RealWorld Reinforcement Learning”. en. In: arXiv:1904.12901 [cs, stat] (Apr.
2019). arXiv: 1904.12901.

66
BIBLIOGRAPHY

[18] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih,
Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain,
Legg, Shane, and Kavukcuoglu, Koray. “IMPALA: Scalable Distributed DeepRL
with Importance Weighted ActorLearner Architectures”. In: arXiv:1802.01561
[cs] (June 2018). arXiv: 1802.01561.

[19] Esteva, Andre, Kuprel, Brett, Novoa, Roberto A., Ko, Justin, Swetter, Susan
M., Blau, Helen M., and Thrun, Sebastian. “Dermatologistlevel classification
of skin cancer with deep neural networks”. en. In: Nature 542.7639 (Feb. 2017).
Number: 7639 Publisher: Nature Publishing Group, pp. 115–118. ISSN: 1476
4687. DOI: 10.1038/nature21056.

[20] Eysenbach, Benjamin, Gupta, Abhishek, Ibarz, Julian, and Levine, Sergey.
“Diversity is All You Need: Learning Skills without a Reward Function”. In:
arXiv:1802.06070 [cs] (Oct. 2018). arXiv: 1802.06070.

[21] Fakoor, Rasool, Chaudhari, Pratik, Soatto, Stefano, and Smola, Alexander
J. “MetaQLearning”. In: arXiv:1910.00125 [cs, stat] (Apr. 2020). arXiv:
1910.00125.

[22] Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. “ModelAgnostic Meta
Learning for Fast Adaptation of Deep Networks”. en. In: arXiv:1703.03400 [cs]
(July 2017). arXiv: 1703.03400.

[23] Finn, Chelsea and Levine,

Sergey. “MetaLearning and Universality: Deep Representations and Gradient
Descent can Approximate any Learning Algorithm”. In: arXiv:1710.11622 [cs]
(Feb. 2018). arXiv: 1710.11622.

[24] Finn, Chelsea, Rajeswaran, Aravind, Kakade, Sham, and Levine, Sergey.
“Online MetaLearning”. en. In: arXiv:1902.08438 [cs, stat] (July 2019). arXiv:
1902.08438.

[25] Finn, Chelsea, Xu, Kelvin, and Levine, Sergey. “Probabilistic ModelAgnostic
MetaLearning”. In: Advances in Neural Information Processing Systems 31.
Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and
R. Garnett. Curran Associates, Inc., 2018, pp. 9516–9527.

[26] Finn, Chelsea, Yu, Tianhe, Zhang, Tianhao, Abbeel, Pieter,

and Levine, Sergey. “OneShot Visual Imitation Learning via MetaLearning”.
In: arXiv:1709.04905 [cs] (Sept. 2017). arXiv: 1709.04905.

67
BIBLIOGRAPHY

[27] “Forecasting shortterm data center network traffic load with convolutional
neural networks”. en. In: PLOS ONE 13.2 (Feb. 2018). Publisher: Public Library
of Science, e0191939. ISSN: 19326203. DOI: 10.1371/[Link].0191939.

[28] Franceschi, Luca, Frasconi, Paolo, Salzo, Saverio, Grazzi, Riccardo, and
Pontil, Massimilano. “Bilevel Programming for Hyperparameter Optimization
and MetaLearning”. In: arXiv:1806.04910 [cs, stat] (July 2018). arXiv:
1806.04910.

[29] Fujimoto, Scott, Hoof, Herke van, and Meger, David. “Addressing Function
Approximation Error in ActorCritic Methods”. en. In: arXiv:1802.09477 [cs,
stat] (Oct. 2018). arXiv: 1802.09477.

[30] Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep Learning. The
MIT Press, 2016. ISBN: 9780262035613.

[31] Goodfellow, Ian J., Mirza, Mehdi, Xiao, Da, Courville, Aaron, and Bengio,
Yoshua. “An Empirical Investigation of Catastrophic Forgetting in Gradient
Based Neural Networks”. en. In: arXiv:1312.6211 [cs, stat] (Mar. 2015). arXiv:
1312.6211.

[32] Grant, Erin, Finn, Chelsea, Levine, Sergey, Darrell, Trevor, and Griffiths,
Thomas. “Recasting gradientbased metalearning as hierarchical bayes”. In:
arXiv preprint arXiv:1801.08930 (2018).

[33] Gupta, Abhishek, Mendonca, Russell, Liu, YuXuan, Abbeel, Pieter, and Levine,
Sergey. “MetaReinforcement Learning of Structured Exploration Strategies”.
en. In: (), p. 10.

[34] Haarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, and Levine, Sergey. “Soft
ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with
a Stochastic Actor”. en. In: arXiv:1801.01290 [cs, stat] (Aug. 2018). arXiv:
1801.01290.

[35] Hafner,
Danijar, Lillicrap, Timothy, Ba, Jimmy, and Norouzi, Mohammad. “Dream to
Control: Learning Behaviors by Latent Imagination”. In: arXiv:1912.01603 [cs]
(Mar. 2020). arXiv: 1912.01603.

68
BIBLIOGRAPHY

[36] Hammons, T. J. “Impact of electric power generation on green house gas

emissions in Europe: Russia, Greece, Italy and views of the EU power plant
supply industry – A critical analysis”. en. In: International Journal of Electrical
Power & Energy Systems 28.8 (Oct. 2006), pp. 548–564. ISSN: 01420615.
DOI: 10.1016/[Link].2006.04.001.

[37] Hardoon, David R., Szedmak, Sandor, and ShaweTaylor, John. “Canonical
correlation analysis: an overview with application to learning methods”. eng.
In: Neural Computation 16.12 (Dec. 2004), pp. 2639–2664. ISSN: 08997667.
DOI: 10.1162/0899766042321814.

[38] Hospedales, Timothy, Antoniou, Antreas, Micaelli, Paul, and Storkey, Amos.
“MetaLearning in Neural Networks: A Survey”. en. In: arXiv:2004.05439 [cs,
stat] (Apr. 2020). arXiv: 2004.05439.

[39] Houthooft, Rein, Chen, Yuhua, Isola, Phillip, Stadie, Bradly, Wolski, Filip, Ho,
OpenAI Jonathan, and Abbeel, Pieter. “Evolved policy gradients”. In: 2018,
pp. 5400–5409.

[40] Hutson, Matthew. “Artificial intelligence faces reproducibility crisis”. en.

In: Science 359.6377 (Feb. 2018). Publisher: American Association for the
Advancement of Science Section: In Depth, pp. 725–726. ISSN: 00368075,
10959203. DOI: 10.1126/science.359.6377.725.

[41] Hutter, Frank, Kotthoff, Lars, and Vanschoren, Joaquin. Automated Machine
Learning : Methods, Systems, Challenges. English. Accepted: 20200318
[Link] Journal Abbreviation: Methods, Systems, Challenges. Springer Nature,
2019. DOI: 10.1007/978-3-030-05318-5. URL: [Link]
handle/20.500.12657/23012 (visited on 06/24/2020).

[42] Janner, Michael, Fu, Justin, Zhang, Marvin, and Levine, Sergey. “When to
Trust Your Model: ModelBased Policy Optimization”. In: arXiv:1906.08253
[cs, stat] (Nov. 2019). arXiv: 1906.08253. (Visited on 07/03/2020).

[43] Juliani, Arthur, Khalifa, Ahmed, Berges, VincentPierre, Harper, Jonathan,

Teng, Ervin, Henry, Hunter, Crespi, Adam, Togelius, Julian, and Lange, Danny.
“Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning”.
en. In: arXiv:1902.01378 [cs] (July 2019). arXiv: 1902.01378.

[44] Kingma, Diederik P. and Ba, Jimmy. “Adam: A Method for Stochastic
Optimization”. en. In: arXiv:1412.6980 [cs] (Jan. 2017). arXiv: 1412.6980.

69
BIBLIOGRAPHY

[45] Lake, Brenden M., Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. “Human
level concept learning through probabilistic program induction”. en. In: Science
350.6266 (Dec. 2015). Publisher: American Association for the Advancement
of Science Section: Research Article, pp. 1332–1338. ISSN: 00368075, 1095
9203. DOI: 10.1126/science.aab3050.

[46] learnables/learn2learn. originaldate: 20190808T[Link]Z. July 2020.

URL: [Link] (visited on 07/13/2020).

[47] Lee, Yoonho and Choi, Seungjin. “GradientBased MetaLearning with Learned
Layerwise Metric and Subspace”. In: arXiv:1801.05558 [cs, stat] (June 2018).
arXiv: 1801.05558.

[48] Li, Zhenguo, Zhou, Fengwei, Chen, Fei, and Li, Hang. “MetaSGD: Learning
to Learn Quickly for FewShot Learning”. en. In: arXiv:1707.09835 [cs] (Sept.
2017). arXiv: 1707.09835. (Visited on 05/26/2020).

[49] Lin, Henry W., Tegmark, Max, and Rolnick, David. “Why does deep and cheap
learning work so well?” en. In: Journal of Statistical Physics 168.6 (Sept. 2017).
arXiv: 1608.08225, pp. 1223–1247. ISSN: 00224715, 15729613. DOI: 10 .
1007/s10955-017-1836-5.

[50] Mania, Horia, Guy, Aurelia,

and Recht, Benjamin. “Simple random search provides a competitive approach
to reinforcement learning”. In: arXiv:1803.07055 [cs, math, stat] (Mar. 2018).
arXiv: 1803.07055.

[51] Marsland, Stephen. Machine learning: an algorithmic perspective. CRC press,

2015. ISBN: 1498759785.

[52] Masanet, Eric, Shehabi, Arman, Lei, Nuoa, Smith, Sarah, and Koomey,
Jonathan. “Recalibrating global data center energyuse estimates”. en. In:
Science 367.6481 (Feb. 2020). Publisher: American Association for the
Advancement of Science Section: Policy Forum, pp. 984–986. ISSN: 0036
8075, 10959203. DOI: 10.1126/science.aba3758.

[53] MetaLearning: Learning to Learn Fast. en. Library Catalog:

[Link]. Nov. 2018. URL: [Link]
log/2018/11/30/[Link] (visited on 06/17/2020).

[54] Mishra, Nikhil, Rohaninejad, Mostafa, Chen, Xi, and Abbeel, Pieter. “A simple
neural attentive metalearner”. In: arXiv preprint arXiv:1707.03141 (2017).

70
BIBLIOGRAPHY

[55] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness,
Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas
K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou,
Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane,
and Hassabis, Demis. “Humanlevel control through deep reinforcement
learning”. en. In: Nature 518.7540 (Feb. 2015). Number: 7540 Publisher:
Nature Publishing Group, pp. 529–533. ISSN: 14764687. DOI: 10 . 1038 /
nature14236.

[56] Nagabandi, Anusha,

Clavera, Ignasi, Liu, Simin, Fearing, Ronald S., Abbeel, Pieter, Levine, Sergey,
and Finn, Chelsea. “Learning to Adapt in Dynamic, RealWorld Environments
Through MetaReinforcement Learning”. In: arXiv:1803.11347 [cs, stat] (Feb.
2019). arXiv: 1803.11347.

[57] Nichol, Alex, Achiam, Joshua, and Schulman, John.

“On FirstOrder MetaLearning Algorithms”. In: arXiv:1803.02999 [cs] (Oct.
2018). arXiv: 1803.02999.

[58] OpenAI, Akkaya, Ilge, Andrychowicz, Marcin, Chociej, Maciek, Litwin, Mateusz,
McGrew, Bob, Petron, Arthur, Paino, Alex, Plappert, Matthias, Powell, Glenn,
Ribas, Raphael, Schneider, Jonas, Tezak, Nikolas, Tworek, Jerry, Welinder,
Peter, Weng, Lilian, Yuan, Qiming, Zaremba, Wojciech, and Zhang, Lei. “Solving
Rubik’s Cube with a Robot Hand”. In: arXiv:1910.07113 [cs, stat] (Oct. 2019).
arXiv: 1910.07113.

[59] Pan, Xinlei, You, Yurong, Wang, Ziyan, and Lu, Cewu. “Virtual to Real
Reinforcement Learning for Autonomous Driving”. In: arXiv:1704.03952 [cs]
(Sept. 2017). arXiv: 1704.03952.

[60] Perkins, David and Salomon, Gavriel. “Transfer Of Learning”. In: 11 (July 1999).

[61] Rabinowitz, Neil C. “Metalearners’ learning dynamics are unlike learners’”. In:
arXiv:1905.01320 [cs, stat] (May 2019). arXiv: 1905.01320. URL: http : / /
[Link]/abs/1905.01320 (visited on 05/28/2020).

[62] Raghu, Aniruddh, Raghu, Maithra, Bengio, Samy, and Vinyals, Oriol. “Rapid
Learning or Feature Reuse? Towards Understanding the Effectiveness of
MAML”. In: arXiv:1909.09157 [cs, stat] (Feb. 2020). arXiv: 1909.09157
version: 2.

71
BIBLIOGRAPHY

[63] Rakelly, Kate, Zhou, Aurick, Quillen, Deirdre, Finn, Chelsea, and Levine, Sergey.
“Efficient OffPolicy MetaReinforcement Learning via Probabilistic Context
Variables”. In: arXiv:1903.08254 [cs, stat] (Mar. 2019). arXiv: 1903.08254.

[64] Rakhlin, Alexander, Shvets, Alexey, Iglovikov, Vladimir, and Kalinin, Alexandr
A. “Deep Convolutional Neural Networks for Breast Cancer Histology Image
Analysis”. In: Image Analysis and Recognition. Ed. by Aurélio Campilho,
Fakhri Karray, and Bart ter Haar Romeny. Cham: Springer International
Publishing, 2018, pp. 737–744. ISBN: 9783319930008.

[65] Ravi, Sachin and Larochelle, Hugo. “Optimization as a Model for FewShot
Learning”. In: (Nov. 2016).

[66] Rose, Steven and Rose, Hilary. “Can Science Be Neutral?” en. In: Perspectives
in Biology and Medicine 16.4 (1973), pp. 605–624. ISSN: 15298795. DOI: 10.
1353 / pbm . 1973 . 0035. URL: http : / / muse . jhu . edu / content / crossref /
journals/perspectives_in_biology_and_medicine/v016/[Link]
(visited on 06/11/2020).

[67] Rosenblatt, Frank. “The perceptron: a probabilistic model for information

storage and organization in the brain.” In: Psychological review 65.6 (1958).
Publisher: American Psychological Association, p. 386. ISSN: 19391471.

[68] Rumelhart, David E., Durbin, Richard, Golden, Richard, and Chauvin,
Yves. “Backpropagation: The basic theory”. In: Backpropagation: Theory,
architectures and applications (1995), pp. 1–34.

[69] Rusu, Andrei A., Rao, Dushyant, Sygnowski, Jakub, Vinyals, Oriol, Pascanu,
Razvan, Osindero, Simon, and Hadsell, Raia. “MetaLearning with Latent
Embedding Optimization”. In: arXiv:1807.05960 [cs, stat] (Mar. 2019). arXiv:
1807.05960.

[70] Schmidhuber, Jürgen. “Evolutionary principles in selfreferential learning, or

on learning how to learn: the metameta... hook”. In: (1987). Publisher:
Technische Universität München.

[71] Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen,

Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis,
Demis, Graepel, Thore, Lillicrap, Timothy, and Silver, David. “Mastering Atari,
Go, Chess and Shogi by Planning with a Learned Model”. In: arXiv:1911.08265
[cs, stat] (Feb. 2020). arXiv: 1911.08265.

72
BIBLIOGRAPHY

[72] Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and
Abbeel, Pieter. “Trust Region Policy Optimization”. en. In: arXiv:1502.05477
[cs] (Apr. 2017). arXiv: 1502.05477.

[73] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel,
Pieter. “HighDimensional Continuous Control Using Generalized Advantage
Estimation”. In: arXiv:1506.02438 [cs] (Oct. 2018). arXiv: 1506.02438.

[74] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov,
Oleg. “Proximal Policy Optimization Algorithms”. In: arXiv:1707.06347 [cs]
(Aug. 2017). arXiv: 1707.06347.

[75] AlShedivat, Maruan, Bansal, Trapit, Burda, Yuri, Sutskever, Ilya, Mordatch,
Igor, and Abbeel, Pieter. “Continuous Adaptation via MetaLearning in
Nonstationary and Competitive Environments”. In: arXiv:1710.03641 [cs] (Feb.
2018). arXiv: 1710.03641.

[76] Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai,
Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan,
Graepel, Thore, Lillicrap, Timothy, Simonyan, Karen, and Hassabis, Demis. “A
general reinforcement learning algorithm that masters chess, shogi, and Go
through selfplay”. en. In: Science 362.6419 (Dec. 2018). Publisher: American
Association for the Advancement of Science Section: Report, pp. 1140–1144.
ISSN: 00368075, 10959203. DOI: 10.1126/science.aar6404.

[77] Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis,

Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton,
Adrian, Chen, Yutian, Lillicrap, Timothy, Hui, Fan, Sifre, Laurent, Driessche,
George van den, Graepel, Thore, and Hassabis, Demis. “Mastering the game of
Go without human knowledge”. en. In: Nature 550.7676 (Oct. 2017). Number:
7676 Publisher: Nature Publishing Group, pp. 354–359. ISSN: 14764687. DOI:
10.1038/nature24270.

[78] Stadie, Bradly, Yang, Ge, Houthooft, Rein, Chen, Peter, Duan, Yan, Wu, Yuhuai,
Abbeel, Pieter, and Sutskever, Ilya. “The Importance of Sampling inMeta
Reinforcement Learning”. In: Advances in Neural Information Processing
Systems 31. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa
Bianchi, and R. Garnett. Curran Associates, Inc., 2018, pp. 9280–9290.

73
BIBLIOGRAPHY

[79] Sutton, Richard S and Barto, Andrew G. Introduction to reinforcement

learning. Vol. 135. MIT press Cambridge, 1998.

[80] Sutton, Richard S and Barto, Andrew G. “Reinforcement Learning: An

Introduction”. en. In: (), p. 352.

[81] Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour,

Yishay. “Policy Gradient Methods for Reinforcement Learning with Function
Approximation”. en. In: (), p. 7.

[82] Sutton, Richard S. “Learning to predict by the methods of temporal differences”.

en. In: Machine Learning 3.1 (Aug. 1988), pp. 9–44. ISSN: 15730565. DOI:
10.1007/BF00115009.

[83] Tobin, Josh, Fong, Rachel, Ray, Alex, Schneider, Jonas, Zaremba, Wojciech,
and Abbeel, Pieter. “Domain randomization for transferring deep neural
networks from simulation to the real world”. In: 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS). ISSN: 21530866. Sept.
2017, pp. 23–30. DOI: 10.1109/IROS.2017.8202133.

[84] Triantafillou, Eleni, Zhu, Tyler, Dumoulin, Vincent, Lamblin, Pascal, Evci,
Utku, Xu, Kelvin, Goroshin, Ross, Gelada, Carles, Swersky, Kevin, Manzagol,
PierreAntoine, and Larochelle, Hugo. “MetaDataset: A Dataset of Datasets for
Learning to Learn from Few Examples”. In: arXiv:1903.03096 [cs, stat] (Apr.
2020). arXiv: 1903.03096.

[85] Venkitaraman, Arun and Wahlberg, Bo. “Tasksimilarity Aware Metalearning

through Nonparametric Kernel Regression”. en. In: arXiv:2006.07212 [cs, stat]
(June 2020). arXiv: 2006.07212.

[86] Vinyals, Oriol, Blundell, Charles, and Lillicrap, Timothy. “Matching Networks
for One Shot Learning”. en. In: (), p. 9.

[87] Wang, Jane X., KurthNelson, Zeb, Tirumala, Dhruva, Soyer, Hubert, Leibo,
Joel Z., Munos, Remi, Blundell, Charles, Kumaran, Dharshan, and Botvinick,
Matt. “Learning to reinforcement learn”. en. In: arXiv:1611.05763 [cs, stat]
(Jan. 2017). arXiv: 1611.05763.

[88] Weng, Lilian. “Meta Reinforcement Learning”. In: [Link]/lillog

(2019). URL: [Link] log/2019/06/23/meta-
[Link].

74
BIBLIOGRAPHY

[89] Yang, Yuxiang, Caluwaerts, Ken, Iscen, Atil, Tan, Jie, and Finn, Chelsea.
“NoRML: NoReward Meta Learning”. In: arXiv:1903.01063 [cs, stat] (Mar.
2019). arXiv: 1903.01063.

[90] Yu, Tianhe, Quillen, Deirdre, He, Zhanpeng, Julian, Ryan, Hausman, Karol,
Finn, Chelsea, and Levine, Sergey. “MetaWorld: A Benchmark and Evaluation
for MultiTask and Meta Reinforcement Learning”. In: arXiv:1910.10897 [cs,
stat] (Oct. 2019). arXiv: 1910.10897.

[91] Zeiler, Matthew D. “ADADELTA: An Adaptive Learning Rate Method”. In:

arXiv:1212.5701 [cs] (Dec. 2012). arXiv: 1212.5701.

[92] Zhang, Chiyuan, Vinyals, Oriol, Munos, Remi, and Bengio, Samy. “A Study on
Overfitting in Deep Reinforcement Learning”. In: arXiv:1804.06893 [cs, stat]
(Apr. 2018). arXiv: 1804.06893.

[93] Zintgraf, Luisa

M., Shiarlis, Kyriacos, Kurin, Vitaly, Hofmann, Katja, and Whiteson, Shimon.
“Fast Context Adaptation via MetaLearning”. In: arXiv:1810.03642 [cs, stat]
(June 2019). arXiv: 1810.03642.

75
Appendix Contents

A Technical details 77

B Additional environment: Procgen 78

C Additional hyperparameter searches and results 82

C.1 Hyperparameter search for FewShot Image Classification . . . . . . . 82
C.2 Commonly shared hyperparameter values for RL . . . . . . . . . . . . 86
C.3 Instabilities of MAMLPPO . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.4 Hyperparameter search on Particles2D . . . . . . . . . . . . . . . . . . 87
C.5 Hyperparameter search on ML1: Push . . . . . . . . . . . . . . . . . . 88
C.6 Hyperparameter search on ML10 . . . . . . . . . . . . . . . . . . . . . 90
C.7 Vanilla PPO & TRPO on ML10 . . . . . . . . . . . . . . . . . . . . . . . 91
C.8 Performance on ML10 per task . . . . . . . . . . . . . . . . . . . . . . . 92
C.9 Performance on ML10 test tasks before and after adaption . . . . . . . 94
C.10 MetaWorld Reward functions & success metrics . . . . . . . . . . . . . 95

76
Appendix A

Technical details

This project was developed on Python 3.7 using PyTorch and based on the meta
learning library learn2learn [46]. For further technical details regarding the code base,
visit the opensource repository of this project at GitHub 1 .

The experiments were conducted on a personal computer with an i7 CPU, 24GB RAM
and an RTX 2060 Super 8GB GPU and on a remote workstation Dell PowerEdge R730
(24 cores, 256GB RAM, 1 x GTX 1080 Ti) at ICE NORTH SICS across a 6 month
period.

1
Link: [Link]

77
Appendix B

Additional environment: Procgen

During the first stages of the thesis in which research about related work and setting up
an experimental evaluation framework was conducted, another reinforcement learning
framework was initially considered instead of MetaWorld. OpenAI’s newest RL
environment Procgen was developed as a challenging benchmark to evaluate sample
efficiency and generalisation of RL algorithms [11].

Procgen tries to combat this issue by leveraging procedural generation to create an

almost infinite amount of randomised levels in 16 games. This means that each
run of an agent playing a game is never the same (as it usually is in ALE) by
modifying the level layout, the location of the player and enemies and other game
specific details. Additionally, it can be easily integrated to an existing RL code base
since it follows the OpenAI Gym framework. In [11], they also state: ”It is also
experimentally convenient: training for 200 million timesteps with PPO on a single
Procgen environment requires approximately 24 GPUhrs and 60 CPUhrs. We
consider this a reasonable and practical computational cost. To further reduce
training time at the cost of experimental complexity, environments can be set to the
easy difficulty. We recommend training in easy difficulty environments for 25M
timesteps, which requires approximately 3 GPUhrs with our implementation of

78
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN

PPO.”. Their implementation of PPO is based on the scalable and distributed IMPALA
algorithm [18].

Figure B.0.1: Samples of each of the 16 games of Procgen. Figure from [11].

For this project a semidistributed PPO and a MAMLPPO agent was implemented.
The term semidistributed is used to distinguish from the fully distributed IMPALA
implementation. In our case, a sampler was implemented that could fetch different
episodes from the same agent in parallel. However, this meant that the only part
that is distributed is the forward pass (sampling) and backpropagating still happens
synchronised and needs to wait for every parallel agent to finish. In the case of IMPALA
the whole process (forward and backward pass) is distributed and asynchronous 1 .
The networks were based on the A2C model [55] following a similar architecture of
a standard convolutional neural network used in [55] (which in [11] they call Nature
CNN), as seen in figure B.0.2.

1
For further explanation, see figure 2 in [18] where scenario (a) is our implementation and (c) is the
implementation of [11].

79
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN

(a) PPO network architecture. (b) MAMLPPO network architecture.

Figure B.0.2: Screenshot of the model’s architecture including number of parameters

Unfortunately, even after many trialsanderrors Procgen seemed to be incredibly

difficult to train on. The PPO agent trained on 5 million timesteps on 8 parallel
environments on the easiest setting possible (one game of only one level on easy
difficult) required 14 CPUhrs and 50GB RAM with no sign of progress (figure B.0.3).
In comparison, in [11] they mention that in order for the agent to develop sufficient
generalisation skills to be able to adapt to new levels, it needs to be trained on at least
10.000 different levels.

(a) Validation reward progress. (b) Validation A2C loss progress.

Figure B.0.3: Performance of PPO agent in Starpilot for one level.

Another experiment was training a MAMLPPO agent on the same easy setting,
sampling from only one environment and running it on a GTX 1080Ti GPU for 25M
timesteps. After 57 hours of training the results were still poor as seen in figure
B.0.4.

80
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN

(a) Validation reward progress. (b) Validation A2C loss progress.

Figure B.0.4: Performance of MAMLPPO agent in Starpilot for one level.

These results are also aligned with Section 4 of [11] where they conclude that smaller
architectures (like the NatureCNN) can sometimes completely fail to train compared
to larger and distributed implementation that are more sample efficient and lead to an
agent with better generalisation capabilities.

In conclusion, even though Procgen seems to be a promising direction of research for

RL, these experiments highlighted some technical difficulties when training a standard
RL (PPO) or a metaRL (MAMLPPO) agent. Specifically, the issue of network size
and distributed algorithmic implementation appear to be crucial components for the
success of an agent in this environment. The problem of network size could be
solved by adding more layers or neurons in the network to increase the number of
trainable parameters, though this would also increase the need for more computational
resources. Furthermore, implementing a highly efficient distributed implementation
of MAML for this environment based on the IMPALA or a similar architecture is
speculated to improve drastically the performance of the agent. However, such an
implementation was deemed exceptionally complex and outside of the thesis scope
and is left for future work.

81
Appendix C

Additional hyperparameter searches

and results

During the development of the models in section 4, a lot of hyperparameter values had
to be tested in order to make conclusions regarding the comparison of performance
of the different methods. Here, a series of searches that were part of the project are
presented that were deemed of less significance for the section 4. Moreover, more
detailed results on the MetaWorld tasks are presented for further investigation and
transparency purposes.

C.1 Hyperparameter search for FewShot Image

Classification
A basic hyperparameter search was performed on the MiniImageNet dataset and
afterwards, a smaller scale search of only the most influential hyperparameters was
performed on the Omniglot dataset. Since computational cost is a considerable
limitation to the scale of the hyperparameter search, some experiments were only
performed on ANIL, which showed less use of computational resources.

Number of iterations 1 : Different datasets and hyperparameter values will lead to

slower or faster model convergence, so in every setting a few metrics are taken into
consideration to draw a conclusion. In the case of choosing an appropriate number of
epochs to train the models, this can be done by monitoring the training and validation
1
Iterations and epochs are used interchangeably, unless explicitly stated otherwise.

82
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

loss and accuracy metrics during metatraining and metatesting different model
snapshots in various time checkpoints of the training. For example, for the optimal
hyperparameter values in Omniglot the figures C.1.1, C.1.2 and C.1.3 indicate that
there is no gain in performance after the 2.000 mark and the models have managed to
converge, whereas for MiniImageNet the mark is bit later, around 5.000.

Figure C.1.1: Comparing ANIL and MAML training metrics. Each line is the mean
value and the shaded area is the standard deviation across three seeds.

Figure C.1.2: Comparing ANIL and MAML validation metrics. Each line is the mean
value and the shaded area the standard deviation across three seeds.

Figure C.1.3: Metatesting model snapshots of MAML and ANIL in different iteration
checkpoints for Omngilot and MiniImageNet. Each line is the mean value and the
shaded area the standard deviation across three seeds.

Since such conclusions can usually be derived from just one of these figures, in the rest

83
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

of the section there will only be the minimum number of figures necessary to support
them. As a baseline, for both the Omniglot and MiniImageNet dataset, the models
were trained for 10.000 epochs.

Number of adaptation steps: While fixing all the other hyperparameters to

standard default values, figure C.1.4 shows that changing the number of inner loop
iterations does not significantly affect the performance of the ANIL model in Mini
ImageNet. Specifically, ANIL was tested with 1, 3 and 5 adaptation steps during the
inner loop and all models led to the same performance as seen in figure C.1.4.

Ways 5
Shots 1
Outer LR 0.001
Inner LR 0.1
Adapt Steps [1, 3, 5]
Meta Batch Size 32
Seed 1
Figure C.1.4: Comparison of three models with
different number of inner loop updates. Smoothing Table C.1.1: HP values of C.1.4
factor: 0.8

Learning Rates: A coarse search for the inner (α) and outer (β) learning rate was
performed with ANIL in MiniImageNet. The inner learning rate controls how quickly
the learner’s weights will adapt to new data whereas the outer parameter controls the
rate in which the metainitialisation weights are being updated. For this reason, the
inner learning rate (lr) needs to have a high value (rapid learning) and the outer lr a
lower one (steady convergence to a general enough metainitialisation).

ANIL was trained with the configurations as shown in table C.1.2 on the MiniImageNet
dataset for 5ways, 5shots and 5ways, 1shot classification. For the Omniglot dataset,
two different learning rate settings were tested with ANIL for the case of 20ways, 5
shots (figures C.1.6 and 20ways, 1shot (figures C.1.7).

84
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

#1 #2 #3
Ways 5
Shots 1
Outer LR 0.003 0.001 0.05
Inner LR 0.5 0.1 0.1
Adapt Steps 1
Figure C.1.5: Comparison of three ANIL
models on 5ways MiniImageNet with Meta Batch Size 32
different inner and outer learning rates. Seeds [1,2,3]
Each line is the mean value and the shaded
area the standard deviation across three Table C.1.2: HP values of C.1.5
seeds. 20% accuracy is same as random for
a 5way classification setting.

Figure C.1.6: Results of ANIL for 20ways, 5shots in Omniglot. The reported results
for each model are the mean and standard deviation across three different seeds.

Figure C.1.7: Results of ANIL for 20ways, 1shot in Omniglot. The reported results
for each model are the mean and standard deviation across three different seeds.

85
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

C.2 Commonly shared hyperparameter values for

Due to the complexity of these methods and their wide range of hyperparameters, not
all of them could be tested with different values. In table C.2.1, hyperparameter values
that were left asis in all of the experiments are presented.

For all RL policies

tau (τ ) 1.0
discount factor / gamma (γ) 0.99
horizon length H for ML10 tasks 150
For all TRPO policies
backtrack factor 0.5
ls max steps 15
max kl 0.01
For all PPO policies
ppo epochs 3
ppo clip ratio 0.1

Table C.2.1: Commonly shared hyperparameter values across different environments

/ methods.

C.3 Instabilities of MAMLPPO

This method was concluded to be too unstable to research any further. This could be
due to an error in the implementation or bad hyperparameter configurations. Even
though a few configurations were tested, the MAMLPPO method seemed to perform
quite unstable, as seen in figure C.3.1 for ML1 and in figure C.3.2 for ML10. Some
models were manually stopped since they didn’t show promising signs of convergence
or learning.

86
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

#1 #2 #3 #4
Outer LR 0.01 0.1 0.01 0.01
Inner LR 0.01 0.1 0.05 0.01
Adapt Batch Size 20 10
Meta Batch Size 40 20
Adapt Steps 1
Seeds 42
Figure C.3.1
Table C.3.1: HP values of the models in
C.3.1

ANIL MAML
#1 #2 #3 #1
Outer LR 0.01
Inner LR 0.01 0.05 0.01 0.01
Adapt Batch Size 20 10 10 20
Meta Batch Size 40 20 20 40
Adapt Steps 1
Seeds 42
Figure C.3.2
Table C.3.2: HP values of the models
in C.3.2

C.4 Hyperparameter search on Particles2D

A basic hyperparameter search was also performed on MAMLTRPO as shown in

table C.4.1 and figure C.4.1. The average metatesting reward of the models is reported
across 10 new different tasks (10 episodes each) and 5 adaptation steps. In this case,
reporting the metatesting results is important because it can be seen that even if
during validation the best models seems to be #2 (the one with a higher number of
adaptation steps), it fails to perform as good during adaptation of new test tasks.

87
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

#1 #2 #3 #4 #5 #6 #7 #8
Outer LR 0.1 0.1 0.05 0.05 0.2 0.05 0.1 0.05
Inner LR 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.2
Adapt Steps 1 5 1
Adapt Batch Size 20 32 10 20 20 20 32 20
Meta Batch Size 40 16 16 64 16 16 16 32
Seed 42
Av. Test Reward 27 38 40 29 34 24 26 28

Table C.4.1: HP search for MAMLTRPO.

Figure C.4.1: Validation rewards of a hyperparameter search for MAMLTRPO on

Particles2D. Smoothing factor 0.9.

C.5 Hyperparameter search on ML1: Push

A basic hyperparameter search was performed for MAMLTRPO (figures C.5.1 C.5.3)
and ANILTRPO (figure C.5.4). Due to the computational cost of fully training the
agents until stable convergence, they were trained only for 250 iterations and the
average validation return was used as an indicator of comparison of the different hyper
parameter values. Firstly, a small search was performed for the inner learning rate

88
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

(α), then for the outer learning rate (β) and finally for the adaptation steps, each time
picking the best value of the previous search.

Outer LR 0.1
Inner LR [0.01, 0.001, 0.0001]
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps 1
Seeds 42
Figure C.5.1: MAMLTRPO inner lr Table C.5.1: HP values of the models in
search on ML1: Push. C.5.1

Outer LR [0.001, 0.01, 0.1, 0.3, 0.5, 0.9]

Inner LR 0.001
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps 1
Seeds 42
Figure C.5.2: MAMLTRPO outer lr Table C.5.2: HP values of the models in
search on ML1: Push. C.5.2

Outer LR 0.3
Inner LR 0.001
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps [1, 3, 5]
Seeds 42
Figure C.5.3: MAMLTRPO adapt Table C.5.3: HP values of the models in
step search on ML1: Push. C.5.3

89
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

#1 #2 #3 #4 #5
Outer LR 0.3 0.3 0.1 0.1 0.1
Inner LR 0.001 0.001 0.01 0.001 0.01
Adapt Steps 3 1 3 1 1
Adapt BS 20
Meta BS 40
Figure C.5.4: ANILTRPO
inner lr search on ML1: Push. Seeds 42
Smoothing factor 0.8.
Table C.5.4: HP values of the models in C.5.4

In figure C.5.5, a comparison between two MAMLTRPO models on ML1:

Push with meta batch size=40 & adapt batch size=20 and meta batch size=20 &
adapt batch size=10 is illustrated. As expected, the one with the bigger batch size
(more data per inner and outer loop iteration) is performing better and is less noisy. It
is important to note however that for that model the training time was approximately
4.5 minutes per epoch whereas for the model with the smaller batch sizes it was 1
minute per epoch, a considerable computational cost difference.

Figure C.5.5: Comparison of MAML models during training with different batch sizes
on ML1: Push.

C.6 Hyperparameter search on ML10

Similarly, for ML10 different batch sizes were tested as seen in figure C.6.1.

90
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

Figure C.6.1: Comparison of MAML models with different batch sizes on ML10.

Additionally, for MAMLTRPO on ML10 the following models in C.6.2a, C.6.2b

were trained to find a good pair of learning rates and test whether the number of
inner steps affects the performance of the model. From C.6.2b, it is seems that the
performance difference is minuscule while also using additional inner steps increases
the computational cost.

(a) Performance comparison of various (b) Comparison of using 1 inner loop step
learning rates. (adaptation step) and 3 steps.

C.7 Vanilla PPO & TRPO on ML10

Training vanilla RL policies (not metaRL) on ML10 is expected to perform poorly, due
to the volatile setting of optimising for multiple losses at the same time. In figure C.7.1,
all of the models of TRPO and PPO developed in this project for ML10 are shown with
their respected hyperparameter values in table C.7.1.

91
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

Figure C.7.1: Vanilla RL policies trained on ML10.

Method TRPO PPO

Model #1 #2 #3 #4 #1 #2 #3 #4 #5
LR 0.0001 0.001 0.001 0.1 0.001 0.001 0.001 0.0001 0.001
# Tasks 2 20 10 40 20 10 100 40 20 20
# Episodes 3 20 50 20 10 20 10 20 10 10
Seeds 42

Table C.7.1: HP values of the models in C.7.1

C.8 Performance on ML10 per task

Due to the diversity of the tasks in MetaWorld, averaging the accumulated rewards
and success rates across all tasks to report the performance of a method can be
misleading. In these figures, the performance of the models trained are shown for
each task separately.

92
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j)

Figure C.8.1: Performance (Accumulated rewards & success rate) of PPO, MAML
TRPO and ANILTRPO on the train tasks of ML10. Reported results are mean and
standard deviation of 5 different seeds with 3 trials of 10 episodes per task.

93
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

(a) (b) (c)

(d) (e)

Figure C.8.2: Performance (Accumulated rewards & success rate) of PPO, MAML
TRPO and ANILTRPO on the test tasks of ML10. Reported results are mean and
standard deviation of 5 different seeds with 3 trials of 10 episodes per task.

C.9 Performance on ML10 test tasks before and after

adaption

One possible question that arises regarding the metatesting procedure is whether
such a limited interaction with unseen tasks is sufficient for the agents to adapt. The
methods were updated only once based on 10 episodes with a small learning rate
and a high discount factor leading to the performance being highly dependent on the
randomised seed that sets the configurations of the environment and with a chance
that they were not able to quickly adapt with these hyperparameter values.

In order to investigate the hyperparameter values’ significance on the performance of

the algorithms, the same metatesting procedure was performed but with a much larger
batch size (episodes sampled per task) and more gradient updates in order to give the
policies a few more tries to converge. The hyperparameter settings of the metatesting
are presented in table C.9.1. The results of this test, shown in figure C.9.1, indicate
there is little change between the ”default” and the ”extended” hyperparameter values
during metatesting.

94
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

HyperParameters Default Extended

Adapt Steps 1 5
Adapt Batch Size 10 300
Inner lr 0.001 0.05
γ 0.99 0.95

Table C.9.1: Hyperparameter values for the metatesting phase of the algorithms on
the ML10 test tasks.

(a) (b) (c)

Figure C.9.1: Performance (Accumulated rewards & success rate) of PPO, MAML
TRPO and ANILTRPO on the test tasks of ML10. The ”Before” results are the agents
evaluated on the test tasks without any change to their weights after training. The
”After 1 Step” results are based on the ”Default” values of the table C.9.1 and the ”After
5 Steps” are based on the ”Extended” values. Reported results are mean and standard
deviation of 5 different seeds with 3 trials of 10 episodes per task.

C.10 MetaWorld Reward functions & success

metrics

The specific reward functions and their success metric from MetaWorld: ML10 are
presented, as seen in [90].

95
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

Task Reward function

Train tasks
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
basketball 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
button-press 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
door-open 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
drawer-close 0.01

||h−g||2
peg-insert-side −||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
pick-place 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
push 0.01

||h−g||2
1000 · e
2
reach 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
sweep 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
window-open 0.01

Test tasks
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
door-close 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
drawer-open 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
lever-pull 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
shelf-place 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
sweep-into 0.01

Table C.10.1: Reward functions of the ML10 tasks.

96
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS

Task Success metric

Train tasks
basketball I||o−g||2 <0.08
button-press I||o−g||2 <0.02
door-open I||o−g||2 <0.08
drawer-close I||o−g||2 <0.08
peg-insert-side I||o−g||2 <0.07
pick-place I||o−g||2 <0.07
push I||o−g||2 <0.07
reach I||o−g||2 <0.05
sweep I||o−g||2 <0.05
window-open I||o−g||2 <0.05
Test tasks
door-close I||o−g||2 <0.08
drawer-open I||o−g||2 <0.08
lever-pull I||o−g||2 <0.05
shelf-place I||o−g||2 <0.08
sweep-into I||o−g||2 <0.05

Table C.10.2: Success metrics of the ML10 tasks. The static values represent distance
in meters.

97
TRITA-EECS-EX-2021:15

[Link]

Meta-Reinforcement Learning Tutorial
No ratings yet
Meta-Reinforcement Learning Tutorial
55 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
AI Era Survival Guide by Taehoon Kim
No ratings yet
AI Era Survival Guide by Taehoon Kim
259 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
36 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
7 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
AI, ML, and Deep Learning Explained
No ratings yet
AI, ML, and Deep Learning Explained
24 pages
Neural Networks in AI and ML
No ratings yet
Neural Networks in AI and ML
97 pages
Knowledge Transfer in Image Classification
No ratings yet
Knowledge Transfer in Image Classification
45 pages
Machine Learning Applications Overview
No ratings yet
Machine Learning Applications Overview
15 pages
AI and Meta-Learning Overview
No ratings yet
AI and Meta-Learning Overview
8 pages
Deep Learning Course Syllabus 2025
No ratings yet
Deep Learning Course Syllabus 2025
42 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
7 pages
ML Chap 1 & 2
No ratings yet
ML Chap 1 & 2
36 pages
Understanding AI, ML, and Deep Learning
No ratings yet
Understanding AI, ML, and Deep Learning
46 pages
Unit-1
No ratings yet
Unit-1
85 pages
Mlt unit 1 (1)
No ratings yet
Mlt unit 1 (1)
12 pages
AI, ML, and Deep Learning Overview
No ratings yet
AI, ML, and Deep Learning Overview
61 pages
Neural Network Design 2nd Edition Martin T. Hagan Et Al. ebook testbank solutions high quality edition
100% (3)
Neural Network Design 2nd Edition Martin T. Hagan Et Al. ebook testbank solutions high quality edition
121 pages
Deep Learning for Limited Data Classification
No ratings yet
Deep Learning for Limited Data Classification
42 pages
Machine Learning Internship Report
No ratings yet
Machine Learning Internship Report
26 pages
DSE2
No ratings yet
DSE2
68 pages
74 Artificial
No ratings yet
74 Artificial
7 pages
Overview of Machine Learning Techniques
No ratings yet
Overview of Machine Learning Techniques
17 pages
Thesis
No ratings yet
Thesis
87 pages
Adaptive Neural Networks Overview
No ratings yet
Adaptive Neural Networks Overview
21 pages
Deep Learning in Intelligent Systems
No ratings yet
Deep Learning in Intelligent Systems
29 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
4 pages
ICML 2018 Conference Highlights
No ratings yet
ICML 2018 Conference Highlights
55 pages
Main
No ratings yet
Main
259 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
12 pages
Machine Learning Frameworks Explained
No ratings yet
Machine Learning Frameworks Explained
23 pages
Understanding Generative AI Basics
No ratings yet
Understanding Generative AI Basics
14 pages
Learning Problems in Machine Learning
No ratings yet
Learning Problems in Machine Learning
15 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
3 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
39 pages
Machine Learning Overview and Types
No ratings yet
Machine Learning Overview and Types
24 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
17 pages
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
100% (1)
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
51 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
56 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
6 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
14 pages
Localized Learning and Generalization in Artificial Neural Networks With Properties of The Global Workspace
No ratings yet
Localized Learning and Generalization in Artificial Neural Networks With Properties of The Global Workspace
134 pages
Machine Learning Overview and Frameworks
No ratings yet
Machine Learning Overview and Frameworks
12 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
6 pages
Lecture Notes: Introduction To Machine Learning For The Sciences
No ratings yet
Lecture Notes: Introduction To Machine Learning For The Sciences
80 pages
Metalearning Tutorial: Concepts & Applications
No ratings yet
Metalearning Tutorial: Concepts & Applications
45 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
42 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
87 pages
The Little Book of Deep Learning François Fleuret eBook high-resolution pdf
100% (12)
The Little Book of Deep Learning François Fleuret eBook high-resolution pdf
119 pages
(Synthesis Lectures On Artificial Intelligence and Machine Learning) Philip Osborne, Kajal Singh, Matthew E. Taylor - Applying Reinforcement Learning On Real-World Data With Practical Examples in Pyth
No ratings yet
(Synthesis Lectures On Artificial Intelligence and Machine Learning) Philip Osborne, Kajal Singh, Matthew E. Taylor - Applying Reinforcement Learning On Real-World Data With Practical Examples in Pyth
105 pages
Introduction to Machine Learning Course
No ratings yet
Introduction to Machine Learning Course
62 pages
Machine Learning Lecture Notes for B.Tech
No ratings yet
Machine Learning Lecture Notes for B.Tech
30 pages
Overview of Artificial Neural Networks
No ratings yet
Overview of Artificial Neural Networks
97 pages
Neural Networks: Learning Through Trial and Error
No ratings yet
Neural Networks: Learning Through Trial and Error
17 pages
Machine Learning Basics Overview
No ratings yet
Machine Learning Basics Overview
38 pages
Types and Concepts of Machine Learning
No ratings yet
Types and Concepts of Machine Learning
44 pages
DL Unit-5
No ratings yet
DL Unit-5
12 pages
LSM-2: Learning from Incomplete Data
No ratings yet
LSM-2: Learning from Incomplete Data
32 pages
Quantum Machine Learning in Finance
No ratings yet
Quantum Machine Learning in Finance
7 pages
VGG-16 vs ResNet-50: Architecture & Performance
No ratings yet
VGG-16 vs ResNet-50: Architecture & Performance
17 pages
Understanding Conditional Variational Autoencoders
No ratings yet
Understanding Conditional Variational Autoencoders
3 pages
Deep Learning in Health Informatics
No ratings yet
Deep Learning in Health Informatics
19 pages
Foundations of Artificial Intelligence
No ratings yet
Foundations of Artificial Intelligence
62 pages
Unit 5 - Week 4: Assignment 4
No ratings yet
Unit 5 - Week 4: Assignment 4
4 pages
A Task-Oriented Chatbot Based On LSTM and Reinforcement Learning
No ratings yet
A Task-Oriented Chatbot Based On LSTM and Reinforcement Learning
5 pages
Huawei AI Final Exam Overview
50% (2)
Huawei AI Final Exam Overview
18 pages
Chili Leaf Disease Detection Using SEDCNN
No ratings yet
Chili Leaf Disease Detection Using SEDCNN
12 pages
Transfer Learning with CNNs Explained
No ratings yet
Transfer Learning with CNNs Explained
8 pages
Retinal Vessel Segmentation with CNNs
No ratings yet
Retinal Vessel Segmentation with CNNs
18 pages
CIFAR-10 Image Classification Results
No ratings yet
CIFAR-10 Image Classification Results
3 pages
Beginner's Guide to LLMs
No ratings yet
Beginner's Guide to LLMs
161 pages
Named Entity Recognition with CRFs
No ratings yet
Named Entity Recognition with CRFs
8 pages
Unsupervised Learning Model Cheat Sheet
No ratings yet
Unsupervised Learning Model Cheat Sheet
3 pages
Transfer Learning in Speech Processing
No ratings yet
Transfer Learning in Speech Processing
13 pages
Deep Learning Drowsiness Detection System
No ratings yet
Deep Learning Drowsiness Detection System
15 pages
Understanding Deep Reinforcement Learning
No ratings yet
Understanding Deep Reinforcement Learning
16 pages
Fine-tuning Large Language Models Guide
No ratings yet
Fine-tuning Large Language Models Guide
13 pages
Vietnamese Receipt OCR Pipeline
No ratings yet
Vietnamese Receipt OCR Pipeline
28 pages
VA-VAE: Optimizing Latent Diffusion Models
No ratings yet
VA-VAE: Optimizing Latent Diffusion Models
11 pages
Project Report
No ratings yet
Project Report
13 pages
Facial Sentiment Analysis Using AI
No ratings yet
Facial Sentiment Analysis Using AI
52 pages
Deep Learning Course Syllabus BCSE332L
No ratings yet
Deep Learning Course Syllabus BCSE332L
3 pages
Responsible Generative AI: What To Generate and What Not: Jindong Gu
No ratings yet
Responsible Generative AI: What To Generate and What Not: Jindong Gu
74 pages
Graduate ML Course in Chemical Engineering
No ratings yet
Graduate ML Course in Chemical Engineering
11 pages
Data Mining - Practical Machine Learning Tools and Techniques Website
50% (2)
Data Mining - Practical Machine Learning Tools and Techniques Website
4 pages
Hybrid Transformer for Diabetic Retinopathy Detection
No ratings yet
Hybrid Transformer for Diabetic Retinopathy Detection
41 pages

Full Text 01

Uploaded by

Full Text 01

Uploaded by

DEGREE PROJECT IN TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

KTH ROYAL INSTITUTE OF TECHNOLOGY

Meta­Learning, Reinforcement Learning, Deep Learning

Meta­Learning, Reinforcement Learning, Deep Learning

ALE Arcade Learning Environment

3.2 Few­shot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 29

According to the notion of meta­learning, such models lead to efficient learners,

The objective of the thesis is to develop a framework to assess the performance of

1. What does MAML actually learn: Is it a high quality feature representation of

2. Does MAML outperform conventional ML approaches in Meta­World? Does it

4. Is there a computational cost difference between MAML and ANIL?

1.3 Goal & Contribution

more challenging framework.

Another relevant delimitation is the high computational cost that is required by

configurations tested, different results can be expected. A discussion regarding this

2.1 Deep Learning

(a) A McCullock and Pitts Neuron. (b) A Multi­Layer Perceptron.

Figure 2.1.1: Artificial Neural Networks. Figures from [51].

Although providing a thorough description and understanding of neural networks is

2.1.1 Artificial Neural Networks

1. Feed the data x to the first input layer.

(b) (Optionally) Add a bias term b.

(d) Pass h through the activation function y = f (h)

(a) Figure from [27]. (b) Figure from [30].

Figure 2.1.2: Left: Convolutional Neural Network Architecture. Right: Recurrent

2.1.2 Gradient­Based Learning & Back­propagation

Backward pass: The back­propagation algorithm, backprop for short, is a simple

Given y the output of the network and ŷ the target values.

1. Calculate the gradients of the output layer

2. From the last layer to the first:

(b) Compute the gradients of the weights and biases.

(c) Propagate the gradients to the previous layer

2.2 Meta Learning

2.2.1 Defining Meta­Learning

Generally, meta­learning aims to improve the model’s generalisation capabilities after

One such example is automatic hyper­parameter tuning based on model evaluation

conventional ML, the following formulation is presented (adapted from [38]).

In a common meta­learning classification example and given a training dataset D =

θ∗ = arg min Ltask (θ; ω; D) (2.8)

Usually, in meta­learning literature, the algorithm is trained over a distribution of

min E Lmeta (θ∗ ; ω; D) (2.9)

ω ∗ = arg max log p(ω|Dtrain ) (2.10)

Two­level optimisation system

θi∗ (ω) = arg minLi (θ, ω, (Dtrain

Evaluating a meta­learning algorithm’s performance means to evaluate the quality

θj∗ = arg max log p(θ|ω ∗ , (Dtest

where θj∗ is used to evaluate the model performance on (Dtest

2.2.2 Types of Meta­Learning

To provide a general picture of the Meta­Learning landscape this section briefly

A common way to further distinguish between meta­learning algorithms is by

and leveraged for generalising during meta­testing. This class of meta­learning

Meta­Optimiser (”How?”): The next category of methods can be defined by the

Meta­Objective (”Why?”): Lastly, meta­learning methods can be labelled by how

2.3 Reinforcement Learning

In order to understand the choices of algorithms in section 3 and get a better

2.3.1 Model based (or World Model) vs Model free

The most standard way to formulate an RL setting is by an Markov Decision Process

1. S: the state of the environment

2. A: the action of the agent

Model­free approaches on the other hand, bypass modelling the environment

Q(st , at ) ← Q(st , at ) + α[Rt + γ max Q(st+1 , a) − Q(st , at )] (2.14)

a(s) = arg max Qθ (s, a) (2.15)

Policy optimisation methods however, try to explicitly learn a policy πθ (a|s) by

2.3.2 Off­policy vs On­policy

Figure 2.3.2: Overview of a standard meta­RL setting. Figure from [9].

2.3.3 Meta­Reinforcement Learning

A recurring issue in RL methods is developing agents that can show robust

2.4 Related Work

2.4.1 Model­Agnostic Meta­Learning

Finn et al. proposed Model­Agnostic Meta­Learning (MAML) as an intuitive and

Following the notation introduced in 2.2.1, the initial model is represented as a

Figure 2.4.1: High­level concept of MAML’s initial representation θ adapting to three

θi∗ ← θi∗ − α∇θ Li (θ; (Dtrain

where β stands for the outer loop (meta) learning rate

Regarding the meta­knowledge (parameter space): Fast Context Adaptation

MetaLearning, Reinforcement Learning, Deep Learning

MetaLearning, Reinforcement Learning, Deep Learning

3.2 Fewshot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 29

According to the notion of metalearning, such models lead to efficient learners,

2. Does MAML outperform conventional ML approaches in MetaWorld? Does it

(a) A McCullock and Pitts Neuron. (b) A MultiLayer Perceptron.

2.1.2 GradientBased Learning & Backpropagation

Backward pass: The backpropagation algorithm, backprop for short, is a simple

2.2.1 Defining MetaLearning

Generally, metalearning aims to improve the model’s generalisation capabilities after

One such example is automatic hyperparameter tuning based on model evaluation

In a common metalearning classification example and given a training dataset D =

Usually, in metalearning literature, the algorithm is trained over a distribution of

Twolevel optimisation system

Evaluating a metalearning algorithm’s performance means to evaluate the quality

2.2.2 Types of MetaLearning

To provide a general picture of the MetaLearning landscape this section briefly

A common way to further distinguish between metalearning algorithms is by

and leveraged for generalising during metatesting. This class of metalearning

MetaOptimiser (”How?”): The next category of methods can be defined by the

MetaObjective (”Why?”): Lastly, metalearning methods can be labelled by how

Modelfree approaches on the other hand, bypass modelling the environment

2.3.2 Offpolicy vs Onpolicy

Figure 2.3.2: Overview of a standard metaRL setting. Figure from [9].

2.3.3 MetaReinforcement Learning

2.4.1 ModelAgnostic MetaLearning

Finn et al. proposed ModelAgnostic MetaLearning (MAML) as an intuitive and

Figure 2.4.1: Highlevel concept of MAML’s initial representation θ adapting to three

Regarding the metaknowledge (parameter space): Fast Context Adaptation

3.1.1 ModelAgnostic & ProblemAgnostic Formulation

Algorithm 1: ModelAgnostic MetaLearning (MAML)

3.2 Fewshot Image Classification

In the case of fewshot learning, the modelagnostic formulation described in 3.1.1 is

Algorithm 3: MAML for FewShot Classification

Algorithm 4: ANIL for FewShot Classification

Figure 3.2.2: Samples from the MiniImageNet dataset.

3.3 MetaReinforcement Learning

3.3.1 MetaRL Formulation

Network architecture: The agentmodel fθ to approximate the optimal policy π for

An important miss in the metaRL research community is a standard benchmark

4.1 FewShot Image Classification

This section presents results regarding the metatraining, meta

(a) 20way, 5shot setting

(b) 20way, 1shot setting

(a) 5way, 5shot setting

(b) 5way, 1shot setting

Dataset Omniglot MiniImageNet

4.2 MetaReinforcement Learning

(a) MAMLTRPO and ANILTRPO on ML1: Push trained until convergence.

(b) PPO, MAMLTRPO and ANILTRPO on ML10. Smoothing factor 0.95.

Figures 4.2.5 (a)(b) illustrate a small lead in performance from MAMLTRPO

Figure 4.2.6: Comparison of PPO, MAMLTRPO and ANILTRPO on the basketball

Figure 4.2.7: Representation similarity across layers of MAMLTRPO network before

Regarding MAML: MAML showed great performance on the MetaWorld

Regarding MetaWorld: This recently published metareinforcement learning

1. Thorough hyperparameter search: As previously discussed, a more

3. Investigate task similarity as a factor of success during metatesting:

4. Comparison of other metalearning algorithms: Another reasonable

extension would be to investigate in the same way more metalearning methods

Answer: ANIL shows a comparable accuracy to MAML on the fewshot image