0% found this document useful (0 votes)
15 views106 pages

Full Text 01

This thesis explores Model-Agnostic Meta-Learning (MAML) in the context of reinforcement learning, specifically focusing on its capabilities and limitations through experiments on the Meta-World benchmark. The findings suggest that while MAML outperforms baseline models, it shows limited evidence of rapid learning during meta-testing, indicating reliance on features learned during meta-training. The research contributes to the understanding of meta-learning dynamics and proposes insights for future work in the field.

Uploaded by

1513151531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views106 pages

Full Text 01

This thesis explores Model-Agnostic Meta-Learning (MAML) in the context of reinforcement learning, specifically focusing on its capabilities and limitations through experiments on the Meta-World benchmark. The findings suggest that while MAML outperforms baseline models, it shows limited evidence of rapid learning during meta-testing, indicating reliance on features learned during meta-training. The research contributes to the understanding of meta-learning dynamics and proposes insights for future work in the field.

Uploaded by

1513151531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEGREE PROJECT IN TECHNOLOGY,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2021

Insights into
Model­Agnostic
Meta­Learning on
Reinforcement
Learning Tasks
Konstantinos Saitas ­ Zarkias

KTH ROYAL INSTITUTE OF TECHNOLOGY


ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Authors
Konstantinos Saitas ­ Zarkias <kosz@[Link]>
Machine Learning
KTH Royal Institute of Technology

Place of Project
Stockholm, Sweden
Research Institutes of Sweden, Kista

Examiner
Pawel Herman
KTH Royal Institute of Technology

Supervisor

Alexandre Proutiere
KTH Royal Institute of Technology

ii
Abstract

Meta­learning has been gaining traction in the Deep Learning field as an approach
to build models that are able to efficiently adapt to new tasks after deployment.
Contrary to conventional Machine Learning approaches, which are trained on a specific
task (e.g image classification on a set of labels), meta­learning methods are meta­
trained across multiple tasks (e.g image classification across multiple sets of labels).
Their end objective is to learn how to solve unseen tasks with just a few samples.
One of the most renowned methods of the field is Model­Agnostic Meta­Learning
(MAML). The objective of this thesis is to supplement the latest relevant research with
novel observations regarding the capabilities, limitations and network dynamics of
MAML. For this end, experiments were performed on the meta­reinforcement learning
benchmark Meta­World. Additionally, a comparison with a recent variation of MAML,
called Almost No Inner Loop (ANIL) was conducted, providing insights on the changes
of the network’s representation during adaptation (meta­testing). The results of this
study indicate that MAML is able to outperform the baselines on the challenging
Meta­World benchmark but shows little signs actual ”rapid learning” during meta­
testing thus supporting the hypothesis that it reuses features learnt during meta­
training.

Keywords

Meta­Learning, Reinforcement Learning, Deep Learning

iii
Abstract

Meta­Learning har fått dragkraft inom Deep Learning fältet som ett tillvägagångssätt
för att bygga modeller som effektivt kan anpassa sig till nya uppgifter efter distribution.
I motsats till konventionella maskininlärnings metoder som är tränade för en specifik
uppgift ([Link]. bild klassificering på en uppsättning klasser), så meta­tränas meta­
learning metoder över flera uppgifter ([Link]. bild klassificering över flera uppsättningar
av klasser). Deras slutmål är att lära sig att lösa osedda uppgifter med bara
några få prover. En av de mest kända metoderna inom området är Model­
Agnostic Meta­Learning (MAML). Syftet med denna avhandling är att komplettera den
senaste relevanta forskningen med nya observationer avseende MAML: s kapacitet,
begränsningar och nätverksdynamik. För detta ändamål utfördes experiment på meta­
förstärkningslärande riktmärke Meta­World. Dessutom gjordes en jämförelse med
en ny variant av MAML, kallad Almost No Inner Loop (ANIL), som gav insikter
om förändringarna i nätverkets representation under anpassning (meta­testning).
Resultaten av denna studie indikerar att MAML kan överträffa baslinjerna för det
utmanande Meta­World­riktmärket men visar små tecken på faktisk ”snabb inlärning”
under meta­testning, vilket stödjer hypotesen att den återanvänder funktioner som
den lärt sig under meta­träning.

Nyckelord

Meta­Learning, Reinforcement Learning, Deep Learning

iv
Acknowledgements

I would like to thank the researchers at the Research Institute of Sweden who provided
me with a welcoming environment to work in, the professors & classmates at KTH for
insightful discussions and support, and the Foundation for Education and European
Culture (IPEP) who trusted my work and offered financial aid during my Masters. Most
importantly, I would like to thank everyone who stood next to me and supported me
during the period of my thesis and in the difficult times of the Covid­19 pandemic.

v
Acronyms

ALE Arcade Learning Environment


ANIL Almost No Inner Loop
ANN Artificial Neural Network
CCA Canonical Correlation Analysis
CNN Convolutional Neural Networks
DL Deep Learning
MAML Model Agnostic Meta­Learning
MDP Markov Decision Process
ML Machine Learning
MLP Multi­Layer Perceptron
GAE Generalised Advantage Estimator
lr learning rate
RL Reinforcement Learning
RNN Recurrent Neural Networks
VPG Vanilla Policy Gradient
PPO Proximal Policy Optimisation
TRPO Trust­Region Policy Optimisation

vi
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Goal & Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Theoretical Background 8
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Gradient­Based Learning & Back­propagation . . . . . . . . . . 10
2.2 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Defining Meta­Learning . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Types of Meta­Learning . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Model based (or World Model) vs Model free . . . . . . . . . . . 18
2.3.2 Off­policy vs On­policy . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Meta­Reinforcement Learning . . . . . . . . . . . . . . . . . . . 20
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Model­Agnostic Meta­Learning . . . . . . . . . . . . . . . . . . . 21
2.4.2 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Insights on MAML . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Almost No Inner Loop Variation . . . . . . . . . . . . . . . . . . 24

3 Methodology 26
3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Model­Agnostic & Problem­Agnostic Formulation . . . . . . . . 27
3.1.2 Representation Similarity Experiments . . . . . . . . . . . . . . 28

vii
CONTENTS

3.2 Few­shot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 29


3.2.1 Classification Formulation . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Omniglot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Mini­ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Meta­Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Meta­RL Formulation . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Particles2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3 Meta­World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Results 42
4.1 Few­Shot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Meta­Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Particles 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 ML1: Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 ML10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Discussion 57
5.1 Further insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Ethical concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusions 63

References 65

viii
Chapter 1

Introduction

Neural networks have proven to be a highly useful tool in many settings. From
detecting various types of cancer in humans [64], [19] to efficiently developing robots
perform various physical tasks [83]. However, developing such models is not an easy
process and it is often met with considerable limitations. In contrast to how humans
acquire new knowledge and skills, neural networks require a great amount of data to
train with [30]. In addition to this, they are also particularly sensitive when trying
to incorporate new information after they have already been trained on a task, which
often leads to the infamous phenomena of catastrophic forgetting. This occurs when a
model is trained on one task, then trained on another second task but then completely
fails to perform on the first task [31]. For these reasons, neural networks are generally
unsuitable in problems where the process of large­scale data collection is inaccessible
(e.g X­rays) or when new tasks are introduced after the model has been trained and
re­training from scratch is too costly (e.g detecting a new type of disease after being
deployed to detect a previous disease).

A research field that has been gaining traction due to promising recent publications
that could potentially combat these issues is Meta­Learning. As with many ideas
in Machine Learning (ML), the core concept was loosely inspired from a field that
studies humans, this time from the area of educational psychology. When studying the
learning abilities of students, John Biggs describes the term meta­learning as ”one’s
awareness over their learning process and actively taking control over it” [8]. It
is easy to see then why applying this idea to ML models becomes very intriguing.
Developing algorithms that are able to self­assess their own performance and improve

1
CHAPTER 1. INTRODUCTION

on it could be profoundly valuable. Such algorithms could alleviate the need for
humans to fine­tune the models and their hyper­parameters in order to adapt them to
their specific task and they could be able to generalise across multiple, different tasks.
Two problems that are still quite present in the current ML developing process.

The end goal of meta­learning is to create models that have the ability to quickly
adapt to new tasks that they have not seen before, using their past experience when
trained on similar tasks. For example, an experienced musician that has spent many
years learning how to play the guitar, violin and contrabass, has a considerably easier
time picking up and learning to play moderately well a new instrument like the cello,
compared to someone that has never played an instrument before. One reason for this
is that similar tasks share similar dynamics and structure [60] (all these instruments
have strings and follow the same rules of physics) but also because a skilful musician
is aware of what learning direction they should follow when learning an instrument
(e.g which exercises will help them familiarise themselves with the instrument) based
on their experience when learning the previous instruments. A process that many
times happens subconsciously. This is the high­level concept rationale meta­learning
is trying to leverage when training models.

This thesis is focused on examining one of the most influential state­of­the­art meta­
learning algorithms called Model Agnostic Meta­Learning (MAML)[22], expanding on
the relevant latest research and providing insights based on experimental results on a
new reinforcement learning benchmark, Meta­World [90].

1.1 Background
Most supervised ML models are usually presented with large quantities of data for a
specific task they should try to solve. During testing, new data samples of the same
task are presented to the model with the expectation that the test data contain similar
feature patterns with the train data, which the model has already identified, and can
thus make correct predictions. A common example would be feeding an Artificial
Neural Network (ANN) images that contain cats and dogs and training the model to
try to distinguish which image contains which animal. Then, to evaluate its accuracy,
different pictures of cats and dogs, that were not part of the training set, are fed to the
model and in return it will try to answer which is which. However, if a new task were
to be introduced, for example distinguish between cats, dogs and lions, the network

2
CHAPTER 1. INTRODUCTION

would most likely fail and re­training it from scratch with many additional images of
lions would probably be necessary to account for the new class of animal. Even such a
seemingly trivial problem, is still challenging for many ML models.

Meta­learning methods try to tackle this issue by becoming efficient learners with the
aim of rapid adaptation to new tasks while requiring only a few training samples. As
previously mentioned, conventional ML models try to leverage similarities between
training and test data to make predictions during evaluation. Similarly, meta­learning
models also try to leverage similarities but between training and test tasks in order to
generalise to new tasks. Different meta­learning algorithms follow different training
procedures on how this can be achieved, but most of current algorithms adhere to
the same principle. During meta­training1 , there are two learning systems that are
being optimised. Firstly, a lower­level base learner tries to adapt rapidly to new data,
meaning complete the task with only a few of samples. Secondly, a higher­level meta­
learner is optimised to fine­tune and improve the base learner based on how well
the adaptation process was performed. During meta­testing, the parameters of the
network (weights, biases or even hyper­parameters) are updated in just a few iterations
2
to values that can solve the new the task in hand [38].

One of the most prominent meta­learning algorithms which has sparked a wave of new
similar approaches is MAML [22]. Following the basic principles of meta­learning,
MAML aims to train a neural network with the purpose of finding an initialisation of
the model parameters that is suitable for fast adaptation to new tasks. It also consists
of a two­level learning system: an inner loop, or base learner, for fast adaptation and an
outer loop, or meta­learner, for improving the parameter initialisation 3 . Specifically,
during the inner loop, the network starts from an initial set of parameters and performs
a brief ”learning session” (a few iterations) over a small set of different tasks. Next,
during the outer loop, the network initialisation is updated with respect to general
direction of the adapted inner loop parameters of all the tasks. The goal of this
approach is to explicitly optimise the model’s parameters for rapid learning.

1
The term meta­training is used similarly to training in traditional ML models, but in the context of
meta­learning approaches.
2
A simple tutorial with visualisation can be found in this link.
3
The terms inner and outer loop refer to the programming practice of nested loops during meta­
training.

3
CHAPTER 1. INTRODUCTION

1.2 Problem

Since the publication of MAML, many studies have been trying to examine its inner
mechanisms and the effects it has on the learning procedure of the models [61], [23],
[57]. A recent study made an interesting observation by asking the question: ”is the
effectiveness of MAML due to the meta­initialisation being primed for rapid learning
(efficient changes in the network representation) or due to feature reuse, with the
meta­initialisation already containing high quality features?” [62]. Their experiments
presented evidence for the latter, suggesting that the adaptation phase of the inner
loop is of little value and thus proposing their variation titled Almost No Inner Loop
(ANIL).

According to the notion of meta­learning, such models lead to efficient learners,


suggesting they are able to perform better than conventional methods when exposed to
new tasks. Conventional ML methods are expected to perform poorly when evaluated
on tasks that they were not trained for (e.g train a robot on pushing a button but tested
on opening a door). However, meta­learning methods are expected to be able to learn
the new task rapidly and perform sufficiently.

The objective of the thesis is to develop a framework to assess the performance of


MAML on the premise of becoming an efficient learner while examining the question
of ”rapid learning or feature reuse”. To evaluate this, experiments were performed in a
few­shot image classification setting and a meta­reinforcement learning setting. For
the former the datasets of Omniglot and Mini­ImageNet were used and for the latter,
the Particles2D and Meta­World environments. Overall, the research contribution of
the degree project can be summarised by asking the following questions:

1. What does MAML actually learn: Is it a high quality feature representation of


the training data which it then utilises to solve the test tasks? Or does it learn to
rapidly learn to solve the test tasks by acquiring new skills?

2. Does MAML outperform conventional ML approaches in Meta­World? Does it


manages to adapt efficiently to unseen tasks even in a more challenging setting?

3. Does the Almost No Inner Loop (ANIL) approach perform similarly to MAML in
more complex environments?

4. Is there a computational cost difference between MAML and ANIL?

4
CHAPTER 1. INTRODUCTION

The first question is inspired by [62], where the evidence was found for the case of
feature reuse. We evaluate the same question both in the same setting (Omniglot, Mini­
ImageNet & Particles2D) and in a new setting (Meta­World) in order to reproduce the
results and also add new observations from a more challenging environment. The
second question arose from the lack of comparison of meta­learning methods with
conventional (non meta­learning) methods in terms of performance in the Meta­World
environment, as reported in [90]. By asking this question, we challenge the need for
meta­learning and inspect the limitations / capabilities of conventional Reinforcement
Learning (RL) methods. In [62], as a result of their findings regarding MAML’s feature
reuse, the authors propose an alternative method called ANIL as equivalent which
they also evaluate on the tasks of Omniglot, Mini­ImageNet & Particles2D and find
similar performance to MAML. However, this tasks are quite limited and not complex
enough to test the capabilities of MAML and ANIL. For this reason we examine ANIL’s
performance on Meta­World in order to evaluate if it is actually equivalent of MAML.
Lastly, one of the considerable benefits of ANIL over MAML as reported in [62] is
that it is computationally cheaper offering significant speedups during training and
testing times. We evaluate again this hypothesis by reproducing the results on the
same environments while also providing new insights about computational costs on
Meta­World.

1.3 Goal & Contribution


The purpose of this research­oriented thesis is to provide a nuanced understanding of
the MAML method by trying to answer open questions of the meta­learning field. The
aim is to expand the discussion regarding a well­established algorithm, but also to act
as an introduction to the field for readers unfamiliar with the topic. It is structured as a
stand­alone report such that minimal previous knowledge or studying external sources
in needed to understand it.

The contributions of this project are both theoretical and practical. An experimental
analysis on MAML provides valuable insights into its internal mechanisms,
performance capabilities and, as equally important, its limitations. The novelty of this
project is a comparison between MAML and one of its recently published variation
called ANIL in the meta­reinforcement learning benchmark Meta­World. The results
of these experiments drive the focus of the evaluation of meta­learning methods to a

5
CHAPTER 1. INTRODUCTION

more challenging framework.

Lastly, the implementations of the algorithms presented in this thesis and the code
associated to reproduce these experiments are open­sourced. Various open­source
programming libraries and frameworks were used for the realisation of this thesis,
some of which are still in active development. During the implementation of the
experiments, contributions were made to public repositories of github in the form
of bug fixes, bug reports and additional feature implementations. One of these
repositories was the learn2learn PyTorch meta­learning library 4 which acted as a core
part of this thesis’ code base. During the development of the thesis, a collaboration with
the lead developer of the library led to contributions on GitHub and finally a techincal
paper publication ([3]).

1.4 Delimitations
One frequent issue when developing methods and experiments with the latest
environments and benchmarks are the possible bug errors or performance short­
comings that come along. This project made extensive use of open­source libraries and
frameworks, most of them which are still in active development. The basic components
were the meta­learning library learn2learn for the algorithmic implementation and
5
the multi­task & meta­learning environment Meta­World for the realisation of the
experiments. Both of these frameworks from the start of the thesis until its end, went
through many iterations and even complete rework of their API. They are still active
projects maintained by their own developers teams and the open­source community of
meta­learning. It is also possible that further bug reports might arise and optimisations
improvements will take place. This is to say that the reported results cannot be
assumed to have been generated by bug­free software. Thus, in order to guarantee
replication of the findings of this thesis, the specific versions of the software used needs
to be installed.

Another relevant delimitation is the high computational cost that is required by


extensive experiments and hyper­parameter searches for these models. Due to limited
resources, the hyper­parameter and architecture search of the models was kept to only
the most crucial components. Thus they provide no guarantee that outside of the
4
lear2learn GitHub repository: [Link]
5
Meta­World GitHub Repository: [Link]

6
CHAPTER 1. INTRODUCTION

configurations tested, different results can be expected. A discussion regarding this


topic is presented in appendix C and as future work in section 5.3.

The most significant limitation of this degree project is related to the evaluation. In
order to provide accurate and concrete conclusions based on the results of experiments,
statistical hypothesis tests are necessary. Due to the computational costs required
however, too few models were developed in order to be able to carry out such tests.
Thus, the results of the thesis showcase trends observed by training and evaluating
the algorithms. In order to make definite conclusions based on statistically accurate
results, a more extensive evaluation research is required. The precondition to carry
out such a test would be access to considerable computational power.

1.5 Outline
The structure of this project’s report is as follows. In Chapter 2, an extensive analysis
of the background work is presented to lay the theoretical foundations of the thesis and
cover related research. Next, in Chapter 3, the methodology and the research approach
is introduced along with the technical details of the evaluation setup. In Chapter 4,
the results of the experiments are presented through tables and figures. Chapter 6
outlines the results of the previous chapter and provides some conclusive remarks.
Lastly, Chapter 5 discusses the overall outcome of the degree project while also making
comments about future work, ethical concerns and sustainability. In addition to these
main chapters, an Appendix section (A­C) is also attached to provide technical details
and additional work that might be relevant for reproducibility purposes.

7
Chapter 2

Theoretical Background

This degree project is based on the combination of three different fields: Deep
Learning, Reinforcement Learning and Meta­Learning. Deep Learning is part of the
Machine Learning field, which is primarily focused on the study of artificial neural
networks. Reinforcement Learning and Meta­Learning are both fields that have been
studied in the contexts of Psychology, Neuroscience and Machine Learning. In the rest
of the thesis, mentions of the fields of Reinforcement Learning and Meta­Learning
will be related to their application within the Deep Learning field, unless otherwise
specified 1 .

The first three sections (2.1, 2.3, 2.2) of this chapter provide a gentle introduction to
the relevant fields without delving into too much detail on the thesis’ specifics. Next,
the section 2.4 will solely focus on providing a thorough theoretical understanding of
all the parts of the project and the latest relevant research.

2.1 Deep Learning


Even though ANNs have been around since the 60s, the term Deep Learning has
only recently started being used to describe related research. This was mainly
due to the success of deeper (multi­layer) neural networks that was enabled by the
increasing availability of computational power. Deep networks have achieved state­
of­the­art results in various applications due to their exceptional ability to express and
approximate incredibly complex functions and probability distributions [49].
1
Meaning that the term Deep­RL and RL might be used interchangeably

8
CHAPTER 2. THEORETICAL BACKGROUND

(a) A McCullock and Pitts Neuron. (b) A Multi­Layer Perceptron.

Figure 2.1.1: Artificial Neural Networks. Figures from [51].

Although providing a thorough description and understanding of neural networks is


out of the scope of this thesis report, it is worthwhile to focus on some vital components
in order to get a better grasp of the basis of meta­learning and what this project tries
to answer.

2.1.1 Artificial Neural Networks

Loosely inspired by the mammal brain, the basic concept of ANN follows a similar
structure [30]. A simple artificial neuron receives some input data x and produces an
output y based on its weights w and an activation function f . These types of artificial
neurons are also known as McCulloch and Pitts neurons (figure 2.1.1a). A set of these
neurons make up a Perceptron layer and a set of Perceptron layers finally make up
one of the most standard ANN architectures, the Multi­Layer Perceptron (MLP). The
layers are stacked (from left to right as seen in figure 2.1.1b) and the output of each
layer becomes the input of the next.

The process of getting from the input data x to the output values y throughout the whole
network is called forward propagation (or forward pass) and consists of the following
steps:

1. Feed the data x to the first input layer.

(a) Multiply the data x with the the weights w of each neuron.

(b) (Optionally) Add a bias term b.



(c) Sum the result h = (xw + b)

(d) Pass h through the activation function y = f (h)

9
CHAPTER 2. THEORETICAL BACKGROUND

(a) Figure from [27]. (b) Figure from [30].

Figure 2.1.2: Left: Convolutional Neural Network Architecture. Right: Recurrent


Neural Network Architecture.

2. The results y of the first layer now become the input of the next layer and the
same steps 1.1­1.4 are followed for x ← y.

3. The output of the network will be the output of the final layer.

The adjustable variables (weights & biases) of the network, are also called the
parameters θ of the network. Multiple variations of these networks have been
developed to better tackle different problems. For example, Convolutional Neural
Networks (CNN) are best suited when the input data consist of images or video
(2.1.2a). Whereas Recurrent Neural Networks (RNN) are best suited when the data
are sequentially dependent such as text, time­series etc (2.1.2b).

2.1.2 Gradient­Based Learning & Back­propagation

In order to get meaningful results from these networks, they need to be trained,
meaning their parameters θ need to be updated based on a cost function. This can
be achieved by the backward pass, also known as the back­propagation algorithm
[68], to compute the gradients using a gradient­based optimizer. Training an MLP is
an iterative process of many (usually thousands or millions) consecutive forward and
backward passes.

Cost function (or loss function): The cost function describes the error between the
output values (predictions) of the network and the objective values (target) the network
tries to optimise for. Depending on the problem the cost function can take many forms.
For classification problems, this usually means trying to minimise the cross­entropy
loss (equation 2.1) based on the principle of maximum likelihood [30]. For regression
and sometimes RL, this usually means trying to minimise the mean­squared error,
which is the average squared difference between the predictions and the target values

10
CHAPTER 2. THEORETICAL BACKGROUND

(equation 2.2).


CE = − yi log ŷi (2.1)
i

1∑
MSE = (yi − ŷi )2 (2.2)
n i

Backward pass: The back­propagation algorithm, backprop for short, is a simple


and efficient way to calculate the gradients of a neural network. In most cases this
means to calculate the gradients ∇θ J(θ) of a cost function J with respect to the network
parameters θ. The algorithm can be summarised in the following steps (Algorithm 6.4
[30]):

Given y the output of the network and ŷ the target values.

1. Calculate the gradients of the output layer

g ← ∇ŷ J (2.3)

2. From the last layer to the first:

(a) Calculate the gradients of the layer with respect to its output before the
activation function (element­multiplication ⊙ if f is element­wise).

g ← ∇h J = g ⊙ f (h) (2.4)

(b) Compute the gradients of the weights and biases.

∇b J = g + λ∇b θ (2.5)

∇w J = gy + λ∇w θ (2.6)

(c) Propagate the gradients to the previous layer

g ← ∇y J = W T g (2.7)

11
CHAPTER 2. THEORETICAL BACKGROUND

Optimizers: The optimizers are the actual algorithms that perform the updating
of the network parameters, using the gradients computed by backprop. One of
the most commonly used optimizers is Adam [44] due to it’s adaptive learning rate
mechanism. Other common optimizers are Stochastic Gradient Descent [67] and
AdaDelta [91].

2.2 Meta Learning

The idea of Meta Learning was firstly conceived in the field of psychology [8] and was
then adopted in other fields such as neuroscience [14]. In the setting of educational
psychology meta­learning is defined as ”a person [who is] aware and capable to assess
their own learning approach and adjust it according the requirements of the specific
task”. Kenji Doya defines meta­learning in the context of neuroscience as ”a brain
mechanism with the capability of dynamically adjusting it’s own hyper­parameters (e.g
dopaminergic system) through neuromodulators (e.g dopamine, serotonin etc)” [14].
even though the idea of Meta­Learning has been around since the 80s [70], it has only
recently seen a rapid growth in the field of Deep Learning with exciting new research
and progress in different directions every year. As such, definitions as to what exactly
constitutes meta­learning can be inconsistent depending on the setting.

2.2.1 Defining Meta­Learning

Generally, meta­learning aims to improve the model’s generalisation capabilities after


deployment. Undoubtedly, there are various other approaches that follow the same
idea but do not fall under the meta­learning term leading to a confusion as to which
method can be classified as ”meta­learning”.

One such example is automatic hyper­parameter tuning based on model evaluation


and cross­validation. This could be considered a naive method of metal­learning,
however such approaches are usually considered part of automated machine learning
(AutoML) [41]. Similar confusion can arise with the fields of ensemble, transfer,
continual and multi­task learning. However, contemporary meta­learning is focused
on explicitly defining a meta­objective (e.g few­shot classification of images) and then
end­to­end optimising the base model for that specific goal. In order to provide a
better insight of the meta­learning setting and distinct it from similar approaches and

12
CHAPTER 2. THEORETICAL BACKGROUND

conventional ML, the following formulation is presented (adapted from [38]).

Meta­knowledge

In a common meta­learning classification example and given a training dataset D =


{(x1 , y1 ), ..., (xN , yN )}, the goal is to train a model ŷ = fθ (x) with parameters θ by solving
2.8 which minimises the cost function Ltask for a fixed ω.

θ∗ = arg min Ltask (θ; ω; D) (2.8)


θ

The ω condition is defined as the factors leading to the solution θ which encompass the
how to learn knowledge and is also known as meta­knowledge. This could be the initial
parameter θ values, the choice of the optimiser, hyper­parameter values, the function
class for f , etc. In this setting, for every dataset D that is given, the optimisation starts
from scratch with a hand tuned pre­specified ω. Hence, meta­learning can be defined
as learning the ω condition, leading to an improved performance and the ability to
generalise across tasks.

Usually, in meta­learning literature, the algorithm is trained over a distribution of


tasks p(T ) in order to learn ω. A task can be a subset of a dataset or even a whole
dataset [84] paired with a cost function T = {D, L}. Thus, the meta­learning objective
can be written as:

min E Lmeta (θ∗ ; ω; D) (2.9)


ω T ∼p(T )

where Lmeta (θ∗ ; ω; D) represents the model performance on a dataset D with fixed θ∗
parameters and optimising for ω.

Meta­training

The process of training a model this way over a set of tasks M is called meta­training in
which the training tasks are formalised as Dtrain = {(Dtrain
support
, Dtrain
query
)i }M
i=1 . Each train

task has its own support and query subset used as ”mini­train” and ”mini­validation”
sets. Optimising for ω, or learning to learn, is then written:

ω ∗ = arg max log p(ω|Dtrain ) (2.10)


ω

13
CHAPTER 2. THEORETICAL BACKGROUND

Meaning, given a set of M datasets in Dtrain the goal is to pick the ω ∗ that maximises
the log probability of the meta­knowledge across all M datasets.

Two­level optimisation system

A typical way of meta­training a model, and the way this thesis will adapt in chapter 3, is
through a two­level optimisation of an inner and an outer loop. This approach follows
the idea of hierarchical optimisation where the model is optimising for one goal while
being constrained by a second optimisation goal [28]. The inner loop is optimising
the base model, or learner, for θi∗ of task i while being conditioned by ω (2.11). The
inner optimisation is based on the support set of the dataset. Note, that the notation
θi∗ (ω) is used to represent that each θi∗ shares the same ω. The outer loop is optimising
the meta­learner for the condition ω with the objective to produce a model θi∗ which
performs optimally on the query set (2.12).

θi∗ (ω) = arg minLi (θ, ω, (Dtrain


support
)i ) (2.11)
θ


M

ω = arg min Lmeta (θi∗ (ω), ω, (Dtrain
query
)i ) (2.12)
ω
i=1

where Li is the cost function of Ti for a fixed ω and Lmeta is the meta­learning objective
for a fixed θi∗ (ω).

Meta­testing

Evaluating a meta­learning algorithm’s performance means to evaluate the quality


of the learned meta­knowledge ω. Instead of the conventional approach of directly
testing on a Dtest set, in meta­testing the test set consists of Q tasks and is split into
Dtest = {(Dtest
support
, Dtest
query
)j }Q
j=1 . Each unseen task j is split into a support and query

set in which the former is used to leverage the meta­learned ω and the latter to finally
evaluate the accuracy of the adapted model. The size of the support set for each test
task is usually small since the objective is rapidly adapting with few samples and not
re­training. A common meta­learning evaluation setting is of ”K­shot learning” for
classification, meaning that each new task comes with K number of samples (”shots”).
The meta­learnt model then needs to adapt to the task using only K samples in order
to correctly predict the labels of the query task. Thus, the meta­testing process of

14
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.1: An overview of the meta­learning field divided by algorithm design and
applications. Figure from [38].

adapting to Dtest
support
can be summarised as:

θj∗ = arg max log p(θ|ω ∗ , (Dtest


support
)j ) (2.13)
θ

where θj∗ is used to evaluate the model performance on (Dtest


query
)j .

2.2.2 Types of Meta­Learning

To provide a general picture of the Meta­Learning landscape this section briefly


mentions a way to understand current category trends that are being developed.
Different ways of how meta­learning algorithms are divided into categories can be
based on what parameters they optimise for, what type of networks they are based
on or generally based on the procedure’s steps to achieve that goal.

A common way to further distinguish between meta­learning algorithms is by


categorising them to one of the following approaches: metric­based, model­based and
optimisation­based [53], [47]. However, Hospedales et al. propose a more intuitive
taxonomy based on the formulation in 2.2.1 and more generally on the questions:
”what”, ”how” and ”why” [38]. It is interesting to note that a meta­learning algorithm
can fall into more than one of the categories following.

Meta­Representations (”What?”): The first and maybe the most apparent way
to categorise meta­learning approaches are by what the meta­knowledge ω entails.
Specifically, what parts of the learning process are to be learned and then be deployed

15
CHAPTER 2. THEORETICAL BACKGROUND

and leveraged for generalising during meta­testing. This class of meta­learning


algorithms is vast (figure 2.2.1), contains some of the most popular approaches and
it keeps expanding every year. MAML is part of the parameter initialisation methods
which try to find a good initial parameter space for the model in order to adapt rapidly
in few gradient steps to the task at hand (more relevant literature in section 2.4).
Another example, would be to explicitly define and learn the gradient­based optimizer,
for example by learning it’s hyper­parameters (e.g the step size [48]). Often, such
approaches are ”model agnostic”, meaning that the actual base model parameters θ
can vary and are dependent on the application. For example, MAML and similar
approaches have been used with MLP, CNN, RNN etc.

Meta­Optimiser (”How?”): The next category of methods can be defined by the


optimisation strategy they use on how to converge to ω ∗ in the outer loop. The most
common and standard approach is through a gradient­based method by explicitly
calculating the meta parameters ω (MAML again falls into this category). Even though
the formulation for calculating such derivatives using the chain rule might be straight
dLmeta dLmeta dθ
forward, dω
= dθ dω
, they come with some caveats. Such an optimisation
system of two loops can quickly spiral into a complex computational graph in which
tractable high order derivatives are needed and vanishing or exploding gradients can
quickly become problem when the inner loop consists of many iterations [2]. Other
interesting approaches are through RL by optimising the meta­objective Lmeta with
a policy gradient algorithm [39] or using evolutionary algorithms to calculate the
objective [7].

Meta­Objective (”Why?”): Lastly, meta­learning methods can be labelled by how


they define their meta­objective Lmeta and the way the inner and outer loop interact. As
previously mentioned in 2.2.1, a common way to define the meta­objective is through
an evaluation metric on the validation set Dtrain
query
which measures the effectiveness
of the meta­knowledge ω. In this view, the few­shot learning setting is one of the
most popular ones and the one more often used as a benchmark for meta­learning
algorithms [48] [54]. However, there are cases that do not have such a restriction
and in which a task can have many samples [1]. Another distinction is by whether the
outer loop optimisation of the base learner happens after the inner loop or if the meta­
optimisation is performed online and θ and ω are updated in parallel within an episode
[5].

16
CHAPTER 2. THEORETICAL BACKGROUND

2.3 Reinforcement Learning

The field of Reinforcement Learning lies between supervised learning, where the
models is trained on labelled data, and unsupervised learning, where the models aims
to find and leverage similarities and patterns in the data. Instead, RL algorithms
are driven by interacting with an environment and receiving rewards on whether
2
its actions managed to perform well enough or complete a goal. The agent then
picks the actions that maximise the cumulative rewards leading to an optimal strategy
through a long series of trials and errors, often within a given time constraint. Based
on this simple setting, many methods have been developed in recent years with
remarkable results, especially in the field of games. One of the first papers to re­ignite
the interest of the research community was by a team of researchers from DeepMind
who managed to develop a simple agent algorithm that could surpass human­level
performance in many of the classic Atari 2600 games [55]. Since then, research on the
field has grown rapidly with multiple advancements in similar game­like environments
such as chess and go [76], [77]. Progress in real­world application has also started to
expand [58], [59] but in a slower rate due to a series of challenges which mainly involve
the need of immense number of interaction samples with the environment to achieve
satisfactory results [17].

Figure 2.3.1: Overview of the typical RL setting of an agent interacting with the
environment. Figure from [80].

In order to understand the choices of algorithms in section 3 and get a better


understanding of the experiments in general, it is worthwhile to briefly go over some
basic RL theory.

2
agent is the term for the policy / model in the RL setting.

17
CHAPTER 2. THEORETICAL BACKGROUND

2.3.1 Model based (or World Model) vs Model free

The most standard way to formulate an RL setting is by an Markov Decision Process


(MDP) of four (or five) components:

1. S: the state of the environment

2. A: the action of the agent

3. R(s, a): the (unknown) reward function which indicates how good or bad action
a was for state s

4. S : the next state of the environment after action a has been taken

5. T (s |s, a): the transition probability function / matrix that indicates the

probability of the environment transitioning to state s when the agent takes an
action a in a state s.

Additionally, environments can either have finite horizons (assuming the agent has H
number of time steps to solve the task) or infinite horizons (in which case, to motivate
the agent to solve the task future rewards are discounted by a factor γ).

The last component T (s |s, a) is not always mentioned because in most deep­RL
problems the transition function T is not known. When T is known, then the
solution can be computed directly without the need for the agent to interact with the
environment through planning algorithms such as policy iteration, value iteration etc
[79]. However, usually the RL objective is defined as finding an optimal policy π which
maximises the expected future discounted reward without knowledge about the T or
R functions. There are two approaches on how to deal with a problem where the
transition function is unknown.

The model­based way is to try and model the environment by approximating this
function through a long series of environment interactions (thus building a ”World
Model”). This can be done either by first learning the model and then using a
highly sample­efficient planning algorithm 3 to solve the problem, or simultaneously
learning the model while also approximating a policy [13]. This approach has seen
exceptional success in games where the dynamics of the environment might not be as
complex as in the real­world and where an almost perfect simulator (World Model)
3
Such algorithms can be trained offline, meaning they do not need to sample new interactions
with the environment during training, thus are more cost efficient, while also finding and evaluating
a solution before actually executing it.

18
CHAPTER 2. THEORETICAL BACKGROUND

can be estimated by paying a high computational cost [71]. However, they suffer from
severe performance loss when transferring the learned policy in the real world due to
embedded biases in the model [42].

Model­free approaches on the other hand, bypass modelling the environment


altogether and directly estimate an optimal policy. Due to the lack of constrain to learn
a model they are typically easier to implement and tune, albeit, they tend to be highly
sample inefficient. There have been numerous methods that follow this idea and they
are usually the first choice in most RL settings. The most prevalent families of model­
free methods are Q­learning and Policy Optimisation. Q­learning algorithms aim to
learn an approximator Qθ (s, a) (θ are usually the neural networks’ parameters) for the
optimal action­value function Q∗ (s, a) 2.14 and then the actions from the policy are
calculated from 2.15

Q(st , at ) ← Q(st , at ) + α[Rt + γ max Q(st+1 , a) − Q(st , at )] (2.14)


a

a(s) = arg max Qθ (s, a) (2.15)


a

Policy optimisation methods however, try to explicitly learn a policy πθ (a|s) by


approximating an on­policy value function Vπθ (s). This family of methods includes
many variations on the details of how this is achieved, with the most notable ones
being Vanilla Policy Gradient (VPG) [81], Proximal Policy Optimisation (PPO) [74]
and Trust­Region Policy Optimisation (TRPO) [72]. It is worth noting that some
algorithms such as Soft­Actor Critic (SAC) [34] or Twin Delayed Deep Deterministic
Policy Gradient (TD3) [29] try to combine the best of both Q­learning and Policy
Optimisation with promising results.

2.3.2 Off­policy vs On­policy

Furthermore, RL algorithms can be split into two categories depending on how they
update their policy from the samples of the environment. Off­policy methods reuse
previous samples collected by the environment­agent interactions during the training
4
regardless of the exploration policy of the agent. This gives them the advantage of
4
The choice of whether the agent will choose an action randomly or based on the currently learned
policy.

19
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.3.2: Overview of a standard meta­RL setting. Figure from [9].

being more sample efficient. Almost all Q­learning methods are trained off­policy. In
contrast, on­policy methods do not reuse previous data and only depend on sampling
online from the learnt policy. Even though this approach might be less sample efficient,
it is notably more stable since it keeps optimising based on the latest policy for the
objective. Usually, Policy Optimisation methods are trained on­policy. In this thesis,
all of the experiments were performed with model­free, on policy algorithms such as
PPO and TRPO (as seen in section 3).

2.3.3 Meta­Reinforcement Learning

A recurring issue in RL methods is developing agents that can show robust


generalisation capabilities. An obvious direction then, is the application of meta­
learning methods in RL, which leads to the field of Meta­Reinforcement Learning. The
setting is similar to the standard RL setting with the difference that instead of trying
to learn just one task / environment, the agent is trained on a distribution of tasks /
environments. During the inner loop adaptation, a set of various tasks is presented
and for each one, a new policy is approximated that might be slightly different than
the next. Next, in the outer loop a general strategy is optimised with the aim to learn
the dynamics of transitioning between states, rewards and actions in order to quickly
learn new tasks [88].

Generally, even though the tasks might be different, it is important that they share
similar internal dynamics and they come from the same task distribution [85]. Which
leads to a strong connection between task similarity or how broad the task distribution
is and the performance of the meta­RL agent. For example, training a robot hand
to grasp different types of objects can be considered a reasonable meta­RL problem

20
CHAPTER 2. THEORETICAL BACKGROUND

since all the tasks share the dependency of learning the dynamics of the joints and
the movement of each individual finger. However, when the distribution becomes
too broad, such as including both teaching a robot how to walk and solving a Rubik’s
cube with it’s hand, it is expected that finding a shared meta­RL strategy becomes
considerably harder.

The number of different methods developed is extensive on all the three axis of meta­
learning (mentioned in 2.2.2) and RL’s algorithmic landscape. In many cases, MAML
has been used as the meta­learning base framework with different methods adapting
and modifying it in new ways such as incorporating recurrent methods and model­
based RL [56] or extending it in a probabilistic framework [75]. A large portion of this
field has focused on developing methods that make use of RNNs or trying to include
some sort of memory to the agent in order to incorporate the learned knowledge
[87], [16], [54]. Furthermore, when learning a policy, latent representations or task
descriptors have been used efficiently to distinguish between the learnt skills and tasks
[35], [20], [85].

Occasionally, meta­RL is confused with multi­task RL since they share many core
characteristics such as training on a distribution of tasks. However, their objectives
are different: Multi­task RL tries to optimise for a single policy that will solve the tasks
presented more efficiently than learning the tasks individually. Meta­RL aims to learn
the dynamics of the training tasks in order to adapt fast to new tasks [90].

2.4 Related Work

2.4.1 Model­Agnostic Meta­Learning

Finn et al. proposed Model­Agnostic Meta­Learning (MAML) as an intuitive and


straight­forward method to learn the initial parameters of a neural network such that
in a few updates the model would be able to generalise and adapt to new tasks (figure
2.4.1). The concept is based on the idea that a network’s internal representation θ of a
distribution of tasks p(T ) will share features across different tasks and can be quickly
adjusted for new tasks from the same distribution.

Following the notation introduced in 2.2.1, the initial model is represented as a


function fθ with parameters θ with the base objective (inner loop) to adapt to a task
Ti = {Di , Li } from the task distribution p(T ). In this case the meta­learnt knowledge

21
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.1: High­level concept of MAML’s initial representation θ adapting to three


different tasks. Figure from [22].

ω is the initial values of the θ parameters (the representation of the network). Thus,
the parameters start from the point ω and evolve to θi∗ for each Ti during the inner
loop. For simplicity, since ω and θ refer to the same component of the model, in the
following equations the initial point of the parameters will be simply noted as θ (figure
2.4.1).

Adapting is performed with gradient descent starting from the θ parameters while not
modifying them directly (this set is kept for the meta­optimisation update) but keeping
a separate set θi∗ for each task Ti that can be updated either once or multiple times 5 .

This can be perceived as one meta­learner θ evolving into multiple learners θi each
specialised for a different task.

θi∗ ← θi∗ − α∇θ Li (θ; (Dtrain


support
)i ) (2.16)

where α stands for the inner loop (adaptation) learning rate. The meta­objective of
this method is to ”optimise for the performance of fθi∗ with respect to θ across tasks
sampled from p(T )” [22] as seen in 2.17.


min Li (θi∗ ; (Dtrain
query
)i ) (2.17)
θ
Ti ∼p(T )

The meta­
optimisation (outer loop) updates the meta­learner’s parameter initialisation θ with
5
MAML supports multiple consecutive gradient descent updates on each learner on each task.

22
CHAPTER 2. THEORETICAL BACKGROUND

gradient descent based on all the learners’ parameters θi∗ (now fixed):


θ ← θ − β∇θ Li (θi∗ ; (Dtrain
query
)i ) (2.18)
Ti ∼p(T )

where β stands for the outer loop (meta) learning rate

2.4.2 Variations

A long list of MAML variations have been proposed since it’s publication focusing on
different components of the training process.

Regarding the meta­optimiser: During the outer loop update, the meta­
optimisation requires computing a gradient through a gradient (or a ”meta­
gradient”) to update the final θ∗ parameters. One way of calculating these second­
order derivatives is directly through a hessian­vector product using automatic
differentiation. However, this method is usually highly computationally expensive.
The ”cheaper” alternative is through a first­order approximation of these derivatives
which has shown to not have any significant decrease in performance in MAML when
using ReLU activation functions [22]. This approach has been further studied and
improved on leading to a MAML variation named Reptile [57].

Regarding the meta­knowledge (parameter space): Fast Context Adaptation


via Meta­Learning (or CAVIA) follows an intuitive idea: Across different tasks, some
parameters will be similar and can possibly be shared whereas others might be more
fitting regarding the context of each task. During the adaptation phase only the context
parameters are updated to the task while still using the meta­learned share parameters
[93].

Regarding general improvements: One of the reoccurring issues of MAML


is it’s high computational cost and it’s sensitivity to hand picked hyper­parameter
values, leading to considerable learning instabilities. Antoniou et al. try to tackle
these problems with their variation, MAML++. This includes learning the inner loop
hyper­parameters, using cosine annealing for the outer learning rate, stabilising the
gradients volatility through multi­step loss optimisation and more, to improve the
overall optimisation process [2].

Regarding application in other fields MAML has seen adaptations to fields such

23
CHAPTER 2. THEORETICAL BACKGROUND

as imitation learning [26], online learning [24], latent embedding optimisation [69],
probabilistic [25] and Bayesian frameworks [32]. In addition to it’s leading success in
few­shot classification 6 and regression, MAML has been applied in different ways in
various RL settings [56], [33], [89], [78].

2.4.3 Insights on MAML

Finn et al. suggested that MAML’s final meta­learnt initialisation parameters are
primed for rapid learning, meaning that changes in the representation will lead
to significant improvements when adapting to a task drawn from the same task
distribution. They also presented MAML as a way to ”maximise the sensitivity of
the loss functions of new tasks with respect to the parameters”. This perspective
suggests that the knowledge gained from the model is regarding efficient gradient­
based learning and not necessarily learning the dataset’s features themselves. This
view is further supported by additional experiments with MAML when examining
its resilience to over­fitting and comparing it with conventional Deep Learning (DL)
methods [23].

2.4.4 Almost No Inner Loop Variation

A recent publication by Raghu et al. argues that the view of how MAML was presented
above (2.4.3) is not accurate and that the model does not actually learn to learn.
Instead, they argue, it already incorporates high quality features of the task distribution
and this is what leads to the successful adaptation to new test tasks [62]. They
provided evidence for this hypothesis by running experiments where they froze the
network body representation (hidden layers) and only let the head (final output layer)
adjust during the inner loop adaptation phase. Additional experiments were presented
indicating that during meta­testing, the network body barely needs to be updated while
maintaining the same performance and it is possible to completely remove the network
head, relying only on the learnt representations. This simplified and computationally
cheaper MAML variation, called ANIL, contributed insightful observations to the inner
mechanisms of MAML.

In the ANIL paper, experiments were performed for few­shot image classification
6
Top 10 performance in few­shot classification benchmarks in [Link] as of this date,
despite being one of the oldest.

24
CHAPTER 2. THEORETICAL BACKGROUND

and meta­RL. For the former setting, the two datasets: Omniglot and MiniImageNet
were used. For the meta­RL setting, the MuJoCo environments of HalfCheetah­
Direction, HalfCheetah­Velocity and 2D­Navigation were used. Even though these
frameworks are part of the most popular benchmarks for meta­learning methods,
they might not be sufficient to prove the hypothesis stated previously. As one of the
ANIL paper’s reviewers criticises the train and test tasks are from the same, relatively
narrow dataset where feature reuse might be enough to provide good performance
7
. Specifically for image classification, both Omniglot and MiniImageNet are rather
trivial tasks compared to the recently proposed benchmark Meta­Dataset which is
explicitly developed for meta­learning methods evaluation [84]. Moreover, the 2D­
Navigation environment was introduced in [22] as a toy baseline RL problem and the
HalfCheetah benchmark is known to be easily solvable with simple linear policies or
random search [50].

This leads to the question on whether ANIL would be able to perform similarly to
MAML on more challenging environments with a broader task distribution. Currently,
some of the best candidates to examine its robustness in performance would be the
Meta­Dataset and Meta­World, for image classification and reinforcement learning
respectively. Due to the Meta­Dataset’s immense size and the incredible computational
power required for it to train on, it was deemed out of the scope of this thesis and is
left for future work. Therefore, this thesis is mainly focused on experiments on the
meta­RL environment titled Meta­World, as examined in the next section 3.

7
Comment on OpenReview by an anonymous expert reviewer on ANIL: [Link]
forum?id=rkgMkCEtPB&noteId=H1xctUU2oB

25
Chapter 3

Methodology

3.1 Experimental setup


In order to carry out experiments and comparisons between MAML and ANIL
(following the questions outlined in section 1.2) robust implementations of the two
algorithms were required. These were developed with the help of the meta­learning
library learn2learn [46] using the PyTorch DL framework. Since these meta­learning
algorithms are model­agnostic, they are used in combination with other neural
network based methods depending on the problem setting. In this project, a standard
CNN approach was used for Image Classification and a fully connected MLP was used
for Reinforcement Learning (as described in more detail in 3.2.2, 3.2.3 and 3.3.1)
.

The main focus of the thesis is on the RL experiments but because of the high
complexity and the high computational cost of implementing and running meta­RL
methods, the baselines were firstly set on image classification tasks. Since the methods
are ”model­agnostic” the basic core structure of the code is shared across the different
application fields. After establishing reliable implementations of the algorithms by
reproducing the results of their original papers on the baseline tasks, they were then
evaluated on RL.

The evaluation of the algorithms was performed in three parts. First, the actual
meta­training progress is monitored through logging training and validation metrics
during learning. These metrics can provide insights on the stability and convergence
rate of each algorithm, showcasing their robustness or signs of over­fitting or under­

26
CHAPTER 3. METHODOLOGY

fitting. Secondly, their final performance is evaluated during meta­testing, where the
meta­trained policies are examined in a series of unseen tasks. The resulting metrics
measure the success of the method on the problem at hand (accuracy for classification,
accumulated reward & success rate for RL). Finally, another experimental settings was
proposed to measure the representation similarity of the two models after training as
described in 3.1.2.

Smoothing factor: In cases where the results were too noisy, a smoothing coefficient
was used for better visibility (as seen in example figure 3.1.1). This factor can be
adjusted with a value from 0 (no smoothing) to 1 (max smoothing) and its formula
is based on the exponential moving average 1 .

(a) (b)

Figure 3.1.1: Example of using a smoothing factor. Figure 3.1.1a does not utilise
smoothing (value is 0) and figure 3.1.1b uses smoothing of a 0.9 value.

3.1.1 Model­Agnostic & Problem­Agnostic Formulation

The objective of MAML is to learn the parameters θ of a model fθ in order to achieve fast
adaptation to new unseen tasks during meta­testing. Specifically, the idea of MAML
is based on updating the gradient­based learning rule in a way which will lead to rapid
progress on any task drawn from the same task distribution p(T ). Meaning the model
parameters are sensitive to changes across different tasks such that small changes in
the parameter space leads to a great gain in performance on any task of p(T ). On the
other hand, using an almost identical method, ANIL is based on the idea that such
a meta­learning procedure leads to learning a strong and sufficient representation of
the data such that there is no need for change in the representation in order to adapt to
new test tasks from p(T ). A general, model­agnostic formulation of MAML and ANIL
1
More details regarding the exponential moving average formula on this link

27
CHAPTER 3. METHODOLOGY

is defined in 1 and 2 respectively.

Algorithm 1: Model­Agnostic Meta­Learning (MAML)


p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
Random initialisation of model parameters θ
while not converged do
Sample a batch of tasks Ti = {(Dtrain )i , Li } ∼ p(T )
for all Ti do
For K samples evaluate: ∇θi Li (θi ; (Dtrain
support
)i )
Update parameters for inner learner: θi ← θi − α∇θi Li (θi ; (Dtrain
support
)i )
end

Update meta­learner’s parameters: θ ← θ − β∇θ Ti ∼p(T ) Li (θ; (Dtrain
query
)i )
end

Algorithm 2: Almost No Inner Loop (ANIL)


p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
Random initialisation of network body parameters θb
Random initialisation of network head (output layer) parameters θh
while not converged do
Sample a batch of tasks Ti = {(Dtrain )i , Li } ∼ p(T )
for all Ti do
For K samples evaluate: ∇θhi Li (θhi ; (Dtrain
support
)i )
Update head parameters of inner learner:
θhi ← θhi − α∇θhi Li (θhi ; (Dtrain
support
)i )
end
Update all of network’s parameters:

θb+h ← θb+h − β∇θb+h Ti ∼p(T ) Li (θb+h ; (Dtrain
query
)i )
end

3.1.2 Representation Similarity Experiments

The objective of this experiment is to measure and compare the latent representations
of a network trained with MAML. This can be achieved by employing the Canonical

28
CHAPTER 3. METHODOLOGY

Correlation Analysis (CCA) metric [37]. Given two representations of two layers L1, L2
of a neural network, the CCA outputs a similarity score which indicates whether L1
and L2 share no similarity at all (value is 0) or if they are identical (value is 1).
By comparing the representations of the network before and after the inner loop
adaptation phase (during meta­testing), this metric illustrates whether the model had
significant changes in the representation (rapid learning) or minimal changes (feature
reuse). This experiment follows a similar procedure as presented in the ANIL paper
[62] and tries to answer the first question of ”What does MAML actually learn:
learning to learn or high quality feature representation?”.

3.2 Few­shot Image Classification

The first experiments of this thesis concerned MAML and ANIL on image classification
tasks, specifically on the relatively small datasets of Omniglot and Mini­ImageNet.
By reproducing the results on these datasets of the original papers, it provides some
confidence that the implementations were correct. The most common way to evaluate
meta­learning algorithms in classification tasks is through few­shot learning where the
goal is to approximate a function from just a handful of input­output pairs of a task /
label in order to classify new images that share similar features to previously seen ones.
For example, the model is introduced to a dataset of pictures of mammals where each
species only has a limited number of picture samples. Due to the constraint of limited
data per species, when a new species is found, the model is expected to easily identify
it even with just a few pictures (few­shots).

3.2.1 Classification Formulation

In the case of few­shot learning, the model­agnostic formulation described in 3.1.1 is


adapted to use the cross­entropy loss, as seen in 3.1, for K­shots (input­output pairs)
for each class, for N classes making it a N ­way, K­shot classification problem. Each
Ti task consists of a (Dsupport )i support (inner loop adaptation) and a (Dquery )i query
(meta­optimisation) set, each containing a total of N ∗ K samples. The model fθ in this
case is a standard CNN with the Adam optimiser. The initial hyper­parameter values
where chosen based on the values reported in [62]. A limited hyper­parameter search

29
CHAPTER 3. METHODOLOGY

was also employed as seen in chapter 4.


Li (θ) = yj log fθ (xj ) + (1 − yj ) log(1 − fθ (xj )) (3.1)
xj ,yj ∼Ti

The complete MAML and ANIL methods for few­shot classification are described
analytically in algorithms 3 and 4 respectively.

Algorithm 3: MAML for Few­Shot Classification


K: number of samples per class to adapt (shots)
N : number of classes per inner loop iteration (ways)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
j: number of Ti tasks per outer loop iteration (meta batch size)
E: number of outer loop iterations (epochs)
Random initialisation of model’s parameters θ
for E iterations do
Sample j number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θi ← θ
for m steps do
Evaluate loss: ∇θi Li (θi ; (Dtrain
support
)i )
Update inner learner: θi ← θi − α∇θi Li (θi ; (Dtrain
support
)i )
end
Evaluate loss: ∇θi Li (θi ; (Dtrain
query
)i )
end ∑
Update meta­learner: θ ← θ − β∇θ Ti ∼p(T ) Li (θi ; θ; (Dtrain
query
)i )
end

3.2.2 Omniglot

The first vision dataset, Omniglot [45], is one of the most popular in evaluating meta­
learning algorithms in few­shot classification. It contains 1623 distinct handwritten
characters with 20 samples each from 50 different alphabets. Each character is
an image of 28(H) x 28(W) x 1(Grayscale value) drawn from a different person
(as seen in figure 3.2.1). As implemented in [22], 1200 characters were used for
training, irrespective of the alphabet, the rest were used for testing and the dataset was
augmented by adding rotated variations of the images in multiples of 90 degrees. The
few­shot setting they were evaluated on was on 20 ways (20 distinct tasks / characters)

30
CHAPTER 3. METHODOLOGY

Algorithm 4: ANIL for Few­Shot Classification


K: number of samples per class to adapt (shots)
N : number of classes per inner loop iteration (ways)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
j: number of Ti tasks per outer loop iteration (meta batch size)
E: number of outer loop iterations (epochs)
Random initialisation of network body parameters θb
Random initialisation of network head parameters θh
for E iterations do
Sample j number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θhi ← θh
for m steps do
Evaluate loss: ∇θhi Li (θhi )
Update head parameters of inner learner:
θhi ← θhi − α∇θhi Li (θhi ; (Dtrain
support
)i )
end
Evaluate loss: ∇θhi Li (θhi ; (Dtrain
query
)i )
end
Update all of network’s∑parameters:
θb+h ← θb+h − β∇θ(b+h) Ti ∼p(T ) Li (θhi ; θb+h ; (Dtrain
query
)i )
end

31
CHAPTER 3. METHODOLOGY

on 1 or 5 shot (1 or 5 samples per task / character). The model used was the same
network as described in [22], a standard CNN of input 1x28x28, 4 convolutional layers
of 64 units each, with no max pooling and 1 channel, and a final fully connected layer
of 64 units and an output size of 20 (one for each class).

Figure 3.2.1: Samples from the Omniglot dataset.

3.2.3 Mini­ImageNet

The mini­ImageNet dataset is part of the ImageNet dataset, specifically tuned for the
few­shot learning setting [86]. It requires fewer computational resources than the
original ImageNet but it still remains a difficult problem to solve due to the large variety
of images included. Mini­ImageNet contains 100 classes, each with 600 samples of
84(H) x 84(W) x 3 (RGB values) images (as seen in figure 3.2.2). The few­shot setting
they were evaluated on was on 5 ways (5 distinct tasks / characters) on 1 or 5 shot (1 or
5 samples per task / character). In the case of Mini­ImageNet, the standard network
as described in [65] was used, with an input shape of 3x84x84, 4 convolutional layers
of 32 units each, with max pooling (max pooling factor = 1) and 3 channels, and a final
fully connected layer of 800 units with 5 outputs.

32
CHAPTER 3. METHODOLOGY

Figure 3.2.2: Samples from the Mini­ImageNet dataset.

3.3 Meta­Reinforcement Learning


The goal of the meta­RL setting is to develop an agent that rapidly learns a policy for a
new test task with only a few iterations of experience, utilising previous knowledge
gained from similar tasks of the same distribution p(T ). Training such an agent
usually requires considerable computational power due to the extensive number of
interactions it has to perform with the environment. Moreover, RL environments that
fit the setting of meta­learning and are practical to train on are very limited. Apart
from the Particles2D and Meta­World environments, the recently published Procgen
environment by OpenAI [11] was evaluated in this project. However, due to poor
results from the high computational cost required to successfully train models, it was
eventually deemed out of the scope of this project and the relevant study on it has been
moved to Appendix B.

3.3.1 Meta­RL Formulation

In the setting of meta­RL, each task Ti is defined as an MDP of a finite horizon H, with
an initial state distribution qi (s1 ) and a transition distribution qi (st+1 |st , at ). During
the inner loop / adaptation phase, the agent fθi is able to sample a limited number of
episodes from each task Ti with the goal to quickly develop a policy πi for each loss Li .
Note that there is no limitation as to what properties of the MDP need to be share across
tasks, meaning that different tasks can have different reward functions or transition
distributions.

33
CHAPTER 3. METHODOLOGY

As mentioned in 2.3.1 in most problems the reward function and transition distribution
are unknown. Additionally, the unknown dynamics of the environment usually make
the expected reward, which we want to maximise, not differentiable. In such cases
policy gradient methods can be used to approximate the gradients for the inner
(adaptation phase) and outer (meta­optimisation) loop. A significant point, that also
increases complexity, is that since such methods are on­policy, new roll­outs from each
individual policy πi need to be sampled for each additional adaptation update / gradient
step. The most common general form of a policy gradient method defines the loss­
objective for a specific task Ti as the expected logarithm of the policy and the negative
reward over a batch of samples:

[ ]

H
Li (πi ) = − E log πi (at |st )Ri (st , at ) (3.2)
st ,at ∼πi
t=1

where Ri is the reward function of task Ti .

Along with the policy, another model that is approximating the value function Vϕπ (st )
is updated in parallel so that it always considers the most recent policy. Such function
models that estimate Vϕπ (st ) are also called baselines. A simple linear feature baseline
model was used in this case, to fit each task in each iteration and to compute the state­
value function Vϕπ (s) by minimising the least­squares distance, as firstly proposed in
[15] and then also adopted in [22]:

[ ]
ϕ = arg min Est ,R̂t ∼π (Vϕπ (st ) − R̂t )2 (3.3)
ϕ

To reduce the variance of the policy gradient estimates of the state­values and control
the bias levels to a tolerable level, using a bias­variance trade­off parameter, the
Generalised Advantage Estimator (GAE) method proposed in [73] was used. The basic
concept of the advantage function is to provide further insight on how much better or
worse the current policy is, in relation to the average actions. It is defined in equations
3.4­3.8 and is based on the Temporal Difference (TD) Error: rt + γV (st+1 ) − V (st )
[82].

(1)
Ât := δtV = −V (st ) + rt + γV (st+1 ) (3.4)

34
CHAPTER 3. METHODOLOGY

(2)
Ât := δtV + γδt+1
V
= −V (st ) + rt + γrt+1 + γ 2 V (st+2 ) (3.5)

(3)
Ât := δtV + γδt+1
V
+ γ 2 δt+2
V
= −V (st ) + rt + γrt+1 + γ 2 rt+2 + γ 3 V (st+3 ) (3.6)

leading to a sum consisting of the returns (γ discounted rewards) and the negative
baseline term −V (st ) for a H length horizon (eq 3.7) and then finally adding a bias­
variance trade­off factor τ (eq 3.8).

(H)

H ∑
H
Ât = γ l δt+l
V
= −V (st ) + γ l rt+l (3.7)
l=0 l=0

GAE(γ,τ ) (1) (2) (3)



H
Ât := (1 − τ )(Ât + τ Ât +τ 2
Ât + ...) = (γτ )l δt+l
V
(3.8)
l=0

Thus, making the loss:

[ H ]

Li (πi ) = − E log πi (at |st )ÂGAE
t (3.9)
st ,at ∼πi
t=1

Network architecture: The agent­model fθ to approximate the optimal policy π for


all Ti was a standard MLP, also used in the MAML and ANIL papers, with two fully
connected layers of 100 units each and ReLU activation function for Particles2D and
tanh for Meta­World. In order to train the meta­RL agent, MAML and ANIL were
tested with two different policy gradient methods, TRPO [74] and PPO [74] (as seen in
algorithms 5 and 6).

In contrast with vanilla policy gradient methods, where an update on the policy
does not differ much in terms of parameter values, TRPO is trying to update with
the largest step possible to improve the performance, while also following a KL­
Divergence constraint to control the ”distance” between the old and new policy. Even
though this approach involves a complex second­order method with multiple tunable
hyper­parameters, it significantly improves sample efficiency and usually speeds up
convergence. The loss objective, or surrogate advantage, is defined L(θt , θ) as the
measure of performance between policy πθ and previous policy πθt on episodes from

35
CHAPTER 3. METHODOLOGY

the old policy 2 :

[ ]
πθ (a|s) πθ
L(θt , θ) = E A t (s, a) (3.10)
s,a∼πθt πθt (a|s)

and the KL­Divergence constrain between the two policies on the states taken from the
old policy is defined as:

[ ]
DKL (θ||θt ) = E DKL (πθ (·|s)||πθt (·||s) (3.11)
s∼πθt

Then, a general TRPO update can be summed up as:

θt+1 = arg max L(θt , θ) s.t DKL (θ||θt ) ≤ ϵ (3.12)


θ

where ϵ is a parameter, usually a small value ( 0.01), to control the KL­Divergence limit
for the TRPO update 3 .

Based on the same motivation of improving sample efficiency by taking large steps ,
cautiously, to better policies and avoiding performance collapse, PPO follows a much
more straightforward first­order method, while also offering competitive results to
TRPO. In this project, the PPO­Clip variant was used that omits the KL­Divergence
constrain term and motivates the new policy to stay close to the old one based on the
advantage function 4 . The complete loss objective can be defined as:

( π (a|s) )
θ
L(s, a, θt , θ) = min πθt πθt
A (s, a), g(ϵ, A (s, a)) (3.13)
πθt (a|s)

where the clipping term g defined as:



(1 + ϵ)A, f or A ≥ 0
g(ϵ, A) = (3.14)

(1 − ϵ)A, f or A ≤ 0

acts like a regulariser to limit how much the objective can increase and stop the
2
Here we explicitly denote that the policy π is depended on the θ parameters using πθ .
3
More details into the TRPO algorithm by OpenAI’s Spinning Up documentation: https://
[Link]/en/latest/algorithms/[Link]
4
More details into the PPO algorithm by OpenAI’s Spinning Up documentation https://
[Link]/en/latest/algorithms/[Link]

36
CHAPTER 3. METHODOLOGY

new policy from diverging too much. Then, the PPO­Clip update can be summed up
as:

[ ]
θt+1 = arg max E L(s, a, θt , θ) (3.15)
θ s,a∼πθt

Algorithm 5: MAML for RL using a Policy Gradient method


K: number of episodes per task to adapt (adapt batch size | ’shots’)
N : number of tasks per inner loop iteration (meta batch size | ’ways’)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
E: number of outer loop iterations (epochs)
Random initialisation of linear baseline parameters ϕ
Random initialisation of model’s parameters θ
for E iterations do
Sample N number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θi ← θ
for m steps do
Sample K support episodes on task Ti with πisupport policy
Update the baseline model Vϕπ (s) (eq. 3.3)
Compute advantage estimates: ÂGAE t (eq. 3.8)
Compute loss: ∇θi Li (θi ; πisupport ) eq. 3.10 (TRPO) or eq. 3.13 (PPO)
Compute parameters θi for new policy πisupport : eq. 3.12 (TRPO) or eq.
3.15 (PPO)
Update parameters of inner learner: θi ← θi − α∇θi Li (θi ; πisupport )
end
Sample K query episodes on task Ti with the adapted θi model for policy
πiquery
Compute loss for query episodes: ∇θi Li (θi ; πiquery )
end ∑
Update meta­learner: θ ← θ − β∇θ Ti ∼p(T ) Li (θ; πiquery )
end

3.3.2 Particles2D

The first RL setting the algorithms were tested on, was a trivial 2D game­like problem
developed by Finn et al in [22]. Given a random point­goal in a 2D square matrix, the
agent should move to these coordinates without having explicit access to them, given
some velocity. The observation is simply the current coordinates of the agent in the
square and the actions correspond to velocity values for movement within the range

37
CHAPTER 3. METHODOLOGY

Algorithm 6: ANIL for RL using a Policy Gradient method


K: number of episodes per task to adapt (adapt batch size | ’shots’)
N : number of tasks per inner loop iteration (meta batch size | ’ways’)
p(T ): task distribution
α: Inner loop learning rate
β: Outer loop learning rate
m: number of inner loop updates (adaptation steps)
E: number of outer loop iterations (epochs)
Random initialisation of linear baseline parameters ϕ
Random initialisation of network body parameters θb
Random initialisation of network head parameters θh
for E iterations do
Sample N number of tasks: Ti ∼ p(T )
for every Ti do
Clone a copy of the parameters for this specific Ti : θhi ← θh
for m steps do
Sample K support episodes on task Ti with πisupport policy
Update the baseline model Vϕπ (s) (eq. 3.3)
Compute advantage estimates: ÂGAE t (eq. 3.8)
Compute loss: ∇θhi Li (θhi ; πi ) eq. 3.10 (TRPO) or eq. 3.13 (PPO)
Compute parameters θhi for new policy πisupport : eq. 3.12 (TRPO) or eq.
3.15 (PPO)
Update head parameters of inner learner:
θhi ← θhi − α∇θhi Li (θhi ; πisupport )
end
Sample K query episodes on task Ti with the adapted θhi model for policy
πiquery
Compute loss for query episodes: ∇θhi Li (θhi ; πiquery )
end
Update all of network’s∑ parameters:
θb+h ← θb+h − β∇θb+h Ti ∼p(T ) Li (θb+h ; πiquery )
end

38
CHAPTER 3. METHODOLOGY

Figure 3.3.1: Sample of a MAML trained agent adapting to a position in Particles2D.


Figure taken from [22].

[−0.1, 0.1]. The reward function is the negative squared distance of the agent from the
point and the episode ends either when the agent is within 0.01 of the goal coordinates
or when the time horizon is reached at H = 100. The initial hyper­parameter values
where chosen based on the values reported in [62].

3.3.3 Meta­World

An important miss in the meta­RL research community is a standard benchmark


that is easy to use and is computationally reasonable but also challenging enough to
evaluate the generalisation capabilities across algorithms. The most commonly used
benchmark environment in meta­learning so far, has been the Half­Cheetah and the
”ant” tasks of MuJoCo, which has shown to be limited in complexity and not suitable
to benchmark meta­learning [50].

To contest this, Yu et al. have recently developed an open source simulated


environment, compatible with the OpenAI Gym interface, specifically designed to
benchmark meta­learning and multi­task learning algorithms [90]. Meta­World
consists of 50 unique simulated manipulation tasks of a Sawyer robot arm, such as
opening a door handle, grasping an object, reaching for a point on a table (as seen in
figure 3.3.2) etc. Even though, these tasks are diverse, they share a common structure
which enables similar tasks to be learned efficiently. Based on the premise that meta­
learning algorithms should be able to learn multiple tasks and be able to generalise to
new ones, this environment sets high standards by broadening the task distribution
and providing an extensive evaluation protocol of variable difficulty.

Specifically, all the tasks require the arm to interact with different objects and shapes,

39
CHAPTER 3. METHODOLOGY

Figure 3.3.2: Screenshot of the task ”push” in which the robot arm needs to push an
object (puck) to specific coordinates (goal).

using different joints in order to combine acts of pushing, grasping and reaching. The
observation and action space across tasks is shared, meaning the dimension space
remain the same. The observations consist of a 9 dimensional 3­tuple of the 3D
cartesian position of the end­effector, the object (or object #1) and the goal (or object
#2). The actions consists of 3D end­effector positions of the robot arm. Depending on
the task, a specific reward function has been engineered. The initial hyper­parameter
values were chosen based on the values reported in [90]. Meta­World provides three
scenarios of increasing difficulty:

ML1: As a more computationally feasible baseline, the first setting consists of one
specific task (in the case of this thesis the task ”push” was chosen) with 50 random
initial object and goal positions and another set of 10 held­out positions, resembling
the previous setting of Particles2D.

ML10: To properly evaluate the generalisation capabilities to new tasks, the algorithm
is meta­trained on 10 specific tasks and then evaluated on another held out set of 5
more tasks with similar structure. For each task, the position of the object and the goal
is randomised.

ML45: Similar to ML10 but with 45 training tasks instead. Due to it’s demanding
complexity and the respective computational cost required, this setting was deemed

40
CHAPTER 3. METHODOLOGY

outside of the scope of the thesis and is left for future work.

Success rate metric: For the case of Meta­World another metric is introduced to
monitor the performance of the agents. Due to the fact that reward values do not always
provide a clear picture of the effectiveness of a policy, Yu et al. propose measuring
an agent’s performance on Meta­World by the success rate on a task. This metric is
defined based on the distance of the object (o) (depending on the task) from the final
goal position (g), e.g ||o−g||2 < ϵ, where ϵ is a small distance threshold for a consecutive
number of steps5 .

5
For more details into each of the success metrics, see the complete list in [90]

41
Chapter 4

Results

4.1 Few­Shot Image Classification

This section presents results regarding the meta­training, meta­


testing and representation similarity between MAML and ANIL on the Omniglot and
Mini­ImageNet datasets for few­shot classification. These models were developed for
the purpose of reproducing the results reported in [62], ensuring the implementations
were correct and validating their findings. The experiments firstly consisted of hyper­
parameter searches for each dataset (more in C.1) and then reporting the results of the
best models as seen next.

Firstly, in the case of Omniglot, both methods showed similar stable performance
during meta­training and converged quickly to their final performance as seen from
their validation accuracy performance in figures 4.1.1. In the case of Mini­ImageNet,
the methods managed to converged again in less than 5.000 epochs but not as fast or as
stable as in Omniglot, as seen from a more volatile validation accuracy in both figures
of 4.1.2. This is speculated to be attributed to the larger variance across coloured RGB
images in the Mini­ImageNet dataset compared to the more uniform grayscale images
in Omniglot. Further tests in regards to this hypothesis are needed to support this
speculation.

42
CHAPTER 4. RESULTS

(a) 20­way, 5­shot setting

(b) 20­way, 1­shot setting

Figure 4.1.1: Validation accuracy for MAML and ANIL on the Omniglot dataset for 20­
way, 5­shot (4.1.1a) and 20­way, 1­shot (4.1.1b) classification. Each line is the mean
value and the shaded area is the standard deviation across three seeds. Smoothing
factor 0.8

43
CHAPTER 4. RESULTS

(a) 5­way, 5­shot setting

(b) 5­way, 1­shot setting

Figure 4.1.2: Validation accuracy for MAML and ANIL on the Mini­ImageNet dataset
for 5­way, 5­shot (4.1.2a) and 5­way, 1­shot (4.1.2b) classification. Each line is
the mean value and the shaded area is the standard deviation across three seeds.
Smoothing factor 0.8

During meta­testing, MAML showed slightly better performance than ANIL in most
cases and the results were similar to the findings of [62], as seen in table 4.1.1.
However, further statistical tests are required to provide concrete evidence for the
observations noted above. Comparing the mean test accuracies across only three seeds
provides too small sample size to employ meaningful statistical hypothesis tests.

44
CHAPTER 4. RESULTS

Dataset Omniglot Mini­ImageNet


Setting 20­w,1­s 20­w,5­s 5­w,1­s 5­w,5­s
MAML 91.7 ± 0.4 96.8 ± 0.1 48.3 ± 1.2 63.4 ± 1.3
ANIL 91.3 ± 0.8 96.9 ± 0.6 43.7 ± 0.9 58.9 ± 1.9
MAML [62] 93.7 ± 0.7 96.4 ± 0.1 46.9 ± 0.2 63.1 ± 0.4
ANIL [62] 96.2 ± 0.5 98.0 ± 0.3 46.7 ± 0.4 61.5 ± 0.5

Table 4.1.1: Meta­testing results of MAML and ANIL on few­shot image classification.
The reported results for each model are the mean and standard deviation across three
different seeds. The bottom results are shown as reported in the paper [62].

One important factor to consider when comparing these algorithms is the difference
in computational resources required to train each algorithm. Training MAML on
Mini­ImageNet for 5­ways, 1­shot classification took approximately 1hour 50minutes
1
, whereas ANIL for the same setting took only 55minutes, half the training time.
Meaning that for half of the computational cost, a similar, or just slightly worse,
performance could be achieved with ANIL.

Finally, when compared to the results presented in [62], the representation of


the neural network during adaptation seemed to change even less and that most
of the network before and after the inner loop remains unchanged (figure 4.1.3a­
4.1.3b).

1
On a CUDA­enabled RTX 2060 S GPU.

45
CHAPTER 4. RESULTS

(a) Figure from [62].

(b) Our reproduced results.

Figure 4.1.3: Representation change before and after adaptation of MAML in a 5w5s
setting in Mini­ImageNet. Results reported are mean and standard deviation of 5
different test tasks with 3 different seeds. 4.1.3a from [62], 4.1.3b reproduced results.

46
CHAPTER 4. RESULTS

4.2 Meta­Reinforcement Learning

Since meta­RL is difficult to train and complex to debug, the experiments started
gradually from a trivial toy problem, to an easy baseline and finally to the most
challenging setting. Additionally, a long series of experiments were needed to be run
at each stage before reporting results that could provide any insights. Such results are
reported in Appendix C instead of this section to prevent an overload of information
and the disruption of the flow of the report.

4.2.1 Particles 2D

As a sanity check of the performance of the methods, the basic 2D environment


”Particles2D” introduced in MAML [22] was used. In figure 4.2.1, a comparison
between the PPO and TRPO version of MAML is shown. As a baseline, the performance
of an agent taking random actions is also included in figure 4.2.1 to provide comparison
across the rest of the models. As illustrated in 4.2.1, the method with the best and
most consistent results was the TRPO version of MAML and ANIL, whereas PPO’s
performance was fluctuating and less stable.

Figure 4.2.1: Performance comparison of different methods on Particles2D.


Smoothing factor 0.9.

47
CHAPTER 4. RESULTS

4.2.2 ML1: Push

In the Meta­World environment, there are three single­task benchmarks that are used
as baselines: ”push”, ”pick­and­place” and ”reach” (as seen in the original paper [90]).
For this project, the setting ”push” was used as a baseline before moving to the more
demanding benchmark of ML10. Firstly, in figure 4.2.2 a quick comparison of between
the PPO and the TRPO version of MAML and ANIL indicates that, again, the TRPO
variant provides better and more consistent performance.

Figure 4.2.2: Comparison of methods for the task ML1: Push.

The models with the best performance during the search were then trained for 3.000
epochs, as shown in figure 4.2.3a. Both of the models achieved 100% success rate in 10
unseen tasks (new goal positions) during meta­testing. Another interesting note is that
the results of MAML­TRPO in this case show better performance of the MAML­PPO
when evaluated in the original paper of Meta­World (as seen in figure 4.2.3b).

48
CHAPTER 4. RESULTS

(a) MAML­TRPO and ANIL­TRPO on ML1: Push trained until convergence.


Smoothing factor 0.9.

(b) Figure from [90] illustrating the performance of MAML on ML1: Push.

Figure 4.2.3: MAML performance on ML1: Push compared to ANIL (a) and other
methods (b) from [90].

49
CHAPTER 4. RESULTS

4.2.3 ML10

Finally, the results for the most challenging problems of Meta­World ML10 are
presented. In figure 4.2.4a, a comparison of agents trained with different policies
is shown. In addition to the meta­learning methods, the vanilla TRPO and PPO
methods were also trained to showcase the limitations of non­meta­learning methods
in an environment where a single policy needs to perform well in multiple tasks.
Following the trend from the previous datasets and environments, the MAML­TRPO
and ANIL­TRPO methods outperform the rest during meta­training as seen in figure
4.2.4a. After this comparison, they were trained even longer for 3.000 epochs (figure
4.2.4b) each one taking approximately 13 days to complete 2 . Contrary to the few­
shot classification setting in which ANIL showed high efficiency in computational cost
resources, in the case of Meta­World the computational cost difference is negligible3 .
This is probably due to the fact that most of the computational resources are spent for
sampling interactions of the agent with the environment, and that back­propagating
through the whole network vs just the final layer does not amount to a considerable
difference.

Specifically in ML10, the agents were trained on the


10 different tasks of: pick-and-place, reaching, button-press, window-opening,
pushing, sweep-into, drawer-closing, dial-turning, peg-insertion-side,
basketball and then tested on: drawer-opening, door-closing, shelf-placing,
sweep, level-pulling. The meta­testing on these 5 new tasks consisted of the agent
only having a single inner loop (gradient descent update) to adapt to each task,
sampling 10 episodes (”shots”) from each (as indicated in [90]). The meta­test was
also performed on the train tasks to conclude the final performance of the meta­
learnt policy. When sampling one task, the arm / object / goal positions are randomly
generated every time, this can mean that for some settings the agent might not always
manage to adapt within one step. To account for this the meta­testing process above
was repeated three times (trials) for three different configuration settings for each task.
Moreover, collecting high rewards does not always translate to a perfect success rate
and the scale of the reward values differs from task to task. However, a 0% success rate
does not mean complete failure either since if the reward values are high it might mean
2
The PPO agent was ”early­stopped” at 1.000 epochs due to not showing signs of improvement after
500 epochs.
3
For both MAML and ANIL in Meta­World ML10 each epoch took the same amount of time,
approximately 390 seconds

50
CHAPTER 4. RESULTS

(a) Performance comparison of methods on ML10.

(b) PPO, MAML­TRPO and ANIL­TRPO on ML10. Smoothing factor 0.95.

Figure 4.2.4: Comparison of methods in ML10.

51
CHAPTER 4. RESULTS

that the task is almost completed, but the agent did not cross the specific threshold
needed for the task to be classified as ”success”. This is due to the way the reward
functions and the success metrics are hand engineered in Meta­World. In the case of
sweep for example the reward function is:

||h−g||22
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e 0.01 (4.1)

and the success metric is calculated based on:

I||o−g||2 <0.05 (4.2)

where o is the object position, h is the robot arm position (gripper) and g is the goal
position. A complete list of the reward functions for each task and their success metric
can be found in Appendix C.10. In the following figures both the reward values and the
success rates are illustrated in order to avoid misinterpretation of the results.

Figures 4.2.5 (a)­(b) illustrate a small lead in performance from MAML­TRPO


compared to the other methods on the train tasks both on average accumulated
rewards and higher success rates. It is interesting to see that the success rate of MAML­
TRPO drops in comparison to the other methods during test time even though the
accumulated rewards are still higher. However, due to the limited sample size of
our experiments we cannot confidently conclude that this is a statistically significant
difference and it is not the outcome of noise or randomness.

This discrepancy between high rewards and low success rate could be attributed to
two factors. Firstly, as mentioned before, the success metrics are hand­engineered
with arbitrary threshold values so the agent could almost be completing the task with
the arm / object moving around the goal but not within the threshold (thus gaining
high rewards but not high success rates). Secondly, it could be a sign of MAML­
TRPO overfitting to the train tasks and thus failing to perform as well during the test
tasks.

Another issue with these results is the remarkably high variance of the rewards and the
success rates. This is because of the diversity of the tasks since each task has its own
reward function. In figures 4.2.6 for example, we can observe better performance from
MAML­TRPO than the other methods on the train task of basketball but struggles to
outperform them on the test task of drawer-open. Out of all the tasks, MAML­TRPO

52
CHAPTER 4. RESULTS

(a)

(b)

Figure 4.2.5: Comparison of PPO, MAML­TRPO and ANIL­TRPO on ML10 train (a)
and test (b) tasks. Reported results are mean and standard deviation of all the tasks
with 5 different seeds of 3 trials of 10 episodes per task.

53
CHAPTER 4. RESULTS

manages to show signs of better performance than the other methods on 9 out of the
10 train tasks but on just 2 out of the 5 test tasks, further indicating signs of overfitting
(worse performance during testing than training). It is important to note again that
the findings reported come from a small sample size of experiments, thus preventing
us making statistically significant conclusions. Analytical results of the performance
of the methods in a ”per task” format can be found in Appendix C.8.

54
CHAPTER 4. RESULTS

(a)

(b)

Figure 4.2.6: Comparison of PPO, MAML­TRPO and ANIL­TRPO on the basketball


(a) and sweep (b) task. Reported results are mean and standard deviation of all the
tasks with 5 different seeds of 3 trials of 10 episodes per task.

A noteworthy observation can be made when rendering the policies in the environment
to see how they perform in action. Even in cases in which the MAML­TRPO agent fails,

55
CHAPTER 4. RESULTS

Figure 4.2.7: Representation similarity across layers of MAML­TRPO network before


and after adaptation during meta­testing. Reported results are mean and standard
deviation across the 5 test tasks.

is seems that the movements of the robot arm are smoother and less hectic compared
to the other methods. These rendered animations can be found in the code repository
at GitHub (link).

Lastly, in figure 4.2.7, the representation change in percentage based on the CCA metric
is presented for the network before and after the inner loop adaptation on the meta­
testing tasks of ML10. These results show that minimal changes take place in the
representation space indicating that the performance of MAML during meta­testing
relies on feature reuse.

56
Chapter 5

Discussion

5.1 Further insights


In addition to the answers given in chapter 6, a few other insightful observations can
be made regarding the algorithms and the environment.

Regarding training such models: The process of training models in complex


environments with a considerably high number of hyper­parameters can be quite
challenging. Meta­learning methods, and especially meta­reinforcement learning
methods, seem to be quite sensitive to hyper­parameter values. Coupled with the
substantial computational resources required to perform proper hyper­parameter
searches that provide high statistical confindence results, it leads to an issue that is
difficult to solve given the scope of such a project.

Regarding MAML: MAML showed great performance on the Meta­World


benchmark, being able to almost master 9 out of 10 tasks with a single general
meta­learnt policy in a relatively small 2­layer neural network, outperforming the
previous state­of­the­art results of [90]. Additionally, the agent trained with MAML
seemed to show smoother and more sensitive movements both when completing a
task but also even when failing to adapt to a new task. It’s performance during
meta­testing does indicate however, some limitations of the algorithm and a possible
sensitivity to overfitting, as well as a limitation in meaningful changes to its weights
during adaptation. To conclude in a high confidence verdict regarding it’s adaptation
capabilities in unseen tasks, a more thorough study of the hyper­parameter values
would be advised.

57
CHAPTER 5. DISCUSSION

Regarding ANIL: The results of ANIL are quite intriguing. Even though its
performance in relatively straightforward settings, such as few­shot classification, is
comparable to MAML, it is interesting to see that it does not manage to achieve the
same level of success in a more challenging environment (as questioned in ANIL’s
review 1 ). This indicates that adapting to complex tasks requires more than an update
of the head of network based on a general meta­initialisation of its weights.

Regarding Meta­World: This recently published meta­reinforcement learning


environment was developed with the purpose of introducing a more challenging
framework for meta­learning algorithms. It is an open source project based on the
proprietary framework of MuJoCo that during the period of this thesis went through
many updates and bug fixes. One issue that came up during evaluation was the
inconsistency between high rewards and low success rates. Due to the hand­crafted
metrics of measuring the success rates, agents could accumulate high rewards while
almost solving a task but receive a low success metric, indicating failure to adapt. This
can lead to contradictory or confusing results, requiring further analysis of the actual
performance of the agents and possibly rendering the animations of the environment
to observe the policy in action. It would be advisable that any future experiments on
Meta­World, when reporting results, should illustrate both the success rate and the
accumulated rewards to provide a clearer picture of the model’s performance.

5.2 Limitations
Training neural networks can be a computationally expensive process that can last
weeks or months. Taking into consideration the scope of this degree project and the
limited resources at hand, only the models that were deemed to be the most relevant
and most essential to the research questions were developed. Even so, more than 200
models were trained in this project, across 2 vision datasets and 3 RL problems in a
time span of 4 months with more than 11.000 hours spent on training models. This is
due to the fact that in order to validate the integrity of the results, a hyper­parameter
search was needed for each problem­case and every model was trained with 3 different
seeds.

Even so, the performed hyper­parameter searches were relatively limited, given that
1
Comment from a reviewer of ICLR 2020 on ANIL, criticising the limitations of meta­learning
benchmarks: link

58
CHAPTER 5. DISCUSSION

the search was performed on 2 or 3 parameters in each case, when in some settings
(e.g ANIL­TRPO on ML10) there were more than 15 hyper­parameters that could be
tuned. Moreover, it is reasonable to assume that 3 seeds are not enough to provide a
high interval of confidence for results of such complex algorithms with a high number
of hyper­parameters [12]. Thus, in order to further verify the findings of this thesis,
a larger scale statistical analysis test on the hyper­parameter space is suggested to
support our results. Especially for the case of the meta­RL tasks which are known
to be sources of high instability and irreproducibility due to their numerous tunable
hyper­parameters and increased computational cost.

5.3 Future Work


Given the rate in which deep learning is advancing, such a research project can have
many possible extensions. From simple more thorough hyper­parameter searches and
ablation studies to algorithmic modifications or additions, or just investigation on even
more state­of­the­art methods on new datasets and environments. A collection of
interesting ideas is listed below in order of significance for furthering supporting the
objective of the project:

1. Thorough hyper­parameter search: As previously discussed, a more


extensive search for the optimal hyper­parameter values would be necessary
to provide results with a high confidence interval. Such a task would
require powerful workstations in order to run grid­search or random search
investigations.

2. Meta­Train on ML45: The next obvious step after MAML’s success on the
ML10 benchmark of Meta­World, would be to evaluate it’s performance on the
final and most challenging ML45 benchmark.

3. Investigate task similarity as a factor of success during meta­testing:


One factor that could be of significant value to study is how dependent is
the success of meta­learning methods across similar and dissimilar tasks.
For example, such insights could be observed by using heat maps of the
representation of the network and highlighting which areas are commonly shared
when completing different tasks.

4. Comparison of other meta­learning algorithms: Another reasonable

59
CHAPTER 5. DISCUSSION

extension would be to investigate in the same way more meta­learning methods


that have shown promising results such as RL2 [16], PEARL [63] or Meta­Q­
Learning [21].

5. Investigate the performance of MAML in even more challenging


environments: Another benchmark environment that could be interesting to
evaluate MAML on would be Procgen. During this project a few experiments
were performed on Procgen (as seen in Appendix B), however due to limited
computational resources, the agents were not able to be trained and the results
were insignificant.

5.4 Ethical concerns

Given this thesis is focused mainly on academic research with no explicit application,
one could first think that the ethical aspect of such a project is minuscule. Besides,
the field of meta­learning is still quite young and real­life applications, even though
potentially vast, are currently almost non­existent. If one were to consider that the
ethical challenges of scientific contributions depend on their practical applications
and that scientific methods are just tools which can be used either benevolently
or malevolently, they could completely rid themselves of any responsibility for this
project. Unfortunately, such beliefs are not uncommon and one of the most crucial
components of scientific contributions is often, intentionally or not, easily dismissed:
the awareness of the social responsibility of research & development.

Brunner and Ascher argue that ”science in the aggregate has not lived up to its promise
to work for the benefit of society as a whole.[... For that reason it] is appropriate to hold
science responsible for the public expectations that science creates and depends upon
it” [10]. Opponents of socially responsible science would argue that the development
of science is predetermined and is a force that cannot be influenced from the choices
of its community or from individual scientists. In such a setting, scientists are merely
the instruments of a greater body that can move in a single direction and they can only
control its acceleration rate based on their skills and work.

However, there is plenty of evidence to suggest otherwise. Collective decisions and


actions do matter both in cases of benign and malicious intent. The General Data

60
CHAPTER 5. DISCUSSION

2
Protection Regulation (GDPR) being a recent example of how governments along
with the scientific community came together to defend the society’s interest against
privacy­breaching technology. If we were to accept this example as part of an inevitable
progress, would we also accept as ”inevitable” and ”expected” all the other times
humans have used science in evil and horrific ways? Where would we draw the line
between inevitable actions and responsible decisions?

Thus, it is our belief that, it would be not only naive but also dangerous to assume that
science is a disconnected, nonpartisan and self­contained body from the rest of the
society. With such equation leading to handing a get out of jail free card to whatever
can be identified as science and unburden of any accountability in individual decisions
or actions. It becomes apparent that the impact science has in society is rarely ever
neutral [66].

To backtrack in the context of this particular thesis, the responsibility lies in honest,
reproducible and accountable research. It is a commonly accepted belief that science
has been facing a reproducibility crisis the last decade with more than 70% of
researchers failing to reproduce another scientist’s experiments [4]. This is becoming
an even more apparent problem in the machine learning community where code for
research is scarcely open sourced due to private agreements, profit or unaccountability
[40]. To halt contributions to the problem, the approach of this thesis was based
on thorough background research, open source code (wherever possible) and honest
result reporting with clearly stating any assumptions made and limiting non­confident
conclusions. The code used for this project along with the trained models and a
technical guide to reproduce the results are open sourced and published online at
GitHub3 .

5.5 Sustainability
As with most DL problems, training such models requires considerable amounts of
computational power. This means personal computers, remote workstations or even
large data centres operating in high capacity for days, weeks or even months. These
data centres can quickly amount to significant electrical power consumption (1% of
the global electricity use) which further increases the need for more electrical power
2
Link to the history of GDPR here.
3
Link to the code repository: [Link]

61
CHAPTER 5. DISCUSSION

generation [52]. It has been shown that such increasing demand for power generation
has direct negative implications to the environment since most of the energy sources
are still not environmentally­friendly [36].

For the experiments of this project a personal computer and a remote server
workstation in a data centre (ICE North of RISE SICS) were used in the span of 6
months. For some experiments, the remote workstation was operating constantly
for three months. An approximate estimation can be made to calculate the power
consumption footprint of this thesis project. In equation 5.1, EkW h describes the energy
consumption in kilowatts per hour, PW describes the power consumption of the device
in watts and thr is the time spent of the device operating in hours4 .

PW × thr
EkW h = (5.1)
1000

The personal computer can be assumed to had been working 8 hours a day for 6
500(W )∗8(hr)∗180(days)
months at 500 watts. This attributes for 1000
= 720kW h. The remote
workstation can be assumed to had been working 24/7 for 3 months at 2500 watts.
2500(W )∗24(hr)∗90(days)
This attributes for 1000
= 5400kW h. Thus the total amount could
be estimated to be around 6120 kWh in 6 months or 3060kWh per year. To put
this number into perspective, according to the [Link] 5 , the average energy
consumption per capita in Sweden is 13.000 kWh per year. This means that the
electrical energy consumption footprint of this degree project was approximately the
same as 23% of an average Swedish citizen for a year.

On the other hand, the development of meta­learning methods could potentially have
a significant decrease in the power consumption needed to train ML models compared
to traditional methods. This would be due to the fact that meta­trained models do not
require re­training from scratch in light of new tasks and can thus reduce the number of
many models needed for each task to fewer models where each one can handle multiple
tasks. By developing data­efficient models with high generalisation capabilities the
computational cost of training ML models could be reduced.

4
Equations to calculate electric energy:
[Link]
5
Sweden’s energy consumption data:
[Link]

62
Chapter 6

Conclusions

The objective of this degree project was to supplement the latest research on MAML
and ANIL with novel observations and insights regarding their performance on
the recently published meta­reinforcement learning benchmark Meta­World. To
summarise, this thesis tried to answer the following questions:

Question 1 What does MAML actually learn: Is it a high quality feature representation of
the training data? Or does it learns to rapidly adapt to new tasks of the same
distribution of the training tasks?

Answer: Before and after adaptation of MAML in ML10 there seems to be minimal
changes in the representation space of the neural network but also in terms of
performance as well (C.9). This indicates that MAML is mostly re­using features
learnt during meta­training and not actually acquiring new knowledge during
meta­testing.

Question 2 How well does MAML adapt to new tasks compared to standard RL approaches?

Answer: Our results show a trend of better performance from MAML (higher rewards &
success rate) than PPO and ANIL on the train tasks of the ML10 benchmark.
During meta­testing however this trend is not as clear, making it difficult to come
to any conclusions.

Question 3 Does ANIL perform similarly to MAML, even in more complex environments?

Answer: ANIL shows a comparable accuracy to MAML on the few­shot image


classification benchmarks. On the more challenging and complex benchmark
of Meta­World ML10 it fails to show comparable performance during meta­

63
CHAPTER 6. CONCLUSIONS

training. During meta­testing however, it is not clear whether there is a


significant difference between the algorithms. These findings require further
support evidence by more extensive training & evaluation of the models in order
to be able to carry out statistical significance tests.

Question 4 Is there a significant computational cost difference between MAML and ANIL?

Answer: The computational cost between MAML and ANIL is indeed significant on the
vision datasets (ANIL required almost 50% less computational resources during
training) but not during the RL benchmarks. This probably due to the high cost
of sampling and interacting with the environment that is much more substantial
than back­propagating through a whole network vs. one layer of the network.

64
Bibliography

[1] Andrychowicz, Marcin, Denil, Misha, Gomez, Sergio, Hoffman, Matthew


W., Pfau, David, Schaul, Tom, Shillingford, Brendan, and Freitas, Nando
de. “Learning to learn by gradient descent by gradient descent”. en. In:
arXiv:1606.04474 [cs] (Nov. 2016). arXiv: 1606.04474.

[2] Antoniou, Antreas, Edwards, Harrison, and Storkey, Amos. “How to train your
MAML”. In: arXiv:1810.09502 [cs, stat] (Mar. 2019). arXiv: 1810.09502.

[3] Arnold, Sébastien M. R., Mahajan, Praateek, Datta, Debajyoti, Bunner, Ian,
and Zarkias, Konstantinos Saitas. “learn2learn: A Library for Meta­Learning
Research”. In: arXiv:2008.12284 [cs, stat] (Aug. 2020). arXiv: 2008.12284.

[4] Baker, Monya. “1,500 scientists lift the lid on reproducibility”. en. In: Nature
News 533.7604 (May 2016). Section: News Feature, p. 452. DOI: 10 . 1038 /
533452a.

[5] Baydin, Atilim Gunes, Cornish, Robert, Rubio, David Martinez, Schmidt,
Mark, and Wood, Frank. “Online learning rate adaptation with hypergradient
descent”. In: arXiv preprint arXiv:1703.04782 (2017).

[6] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. “The Arcade Learning
Environment: An Evaluation Platform for General Agents”. en. In: Journal of
Artificial Intelligence Research 47 (June 2013), pp. 253–279. ISSN: 1076­9757.
DOI: 10.1613/jair.3912.

[7] Bello, Irwan, Zoph, Barret, Vasudevan, Vijay, and Le, Quoc V. “Neural optimizer
search with reinforcement learning”. In: arXiv preprint arXiv:1709.07417
(2017).

65
BIBLIOGRAPHY

[8] Biggs, J. B. “The Role of Metalearning in Study Processes”. en. In: British
Journal of Educational Psychology
55.3 (1985). _eprint: [Link]
8279.1985.tb02625.x, pp. 185–212. ISSN: 2044­8279. DOI: 10.1111/j.2044-
8279.1985.tb02625.x.

[9] Botvinick, Matthew, Ritter, Sam, Wang, Jane X., Kurth­Nelson, Zeb, Blundell,
Charles, and Hassabis, Demis. “Reinforcement Learning, Fast and Slow”. en. In:
Trends in Cognitive Sciences 23.5 (May 2019), pp. 408–422. ISSN: 13646613.
DOI: 10.1016/[Link].2019.02.006.

[10] Brunner, Ronald D and Ascher, William. “Science and social responsibility”. en.
In: (), p. 37.

[11] Cobbe, Karl, Hesse, Christopher, Hilton, Jacob, and Schulman, John.
“Leveraging Procedural Generation to Benchmark Reinforcement Learning”.
In: arXiv:1912.01588 [cs, stat] (Dec. 2019). arXiv: 1912.01588.

[12] Colas, Cédric, Sigaud, Olivier, and Oudeyer, Pierre­Yves. “How Many Random
Seeds?
Statistical Power Analysis in Deep Reinforcement Learning Experiments”. In:
arXiv:1806.08295 [cs, stat] (July 2018). arXiv: 1806.08295.

[13] Deisenroth, Marc Peter, Rasmussen, Carl Edward, and Fox, Dieter. “Learning
to control a low­cost manipulator using data­efficient reinforcement learning”.
In: Robotics: Science and Systems VII (2011), pp. 57–64.

[14] Doya, Kenji. “Metalearning and neuromodulation”. en. In: Neural Networks
15.4 (June 2002), pp. 495–506. ISSN: 0893­6080. DOI: 10 . 1016 / S0893 -
6080(02)00044-8.

[15] Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter.
“Benchmarking Deep Reinforcement Learning for Continuous Control”. en. In:
arXiv:1604.06778 [cs] (May 2016). arXiv: 1604.06778.

[16] Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and
Abbeel, Pieter. “RL$^2$: Fast Reinforcement Learning via Slow Reinforcement
Learning”. In: arXiv:1611.02779 [cs, stat] (Nov. 2016). arXiv: 1611.02779.

[17] Dulac­Arnold, Gabriel, Mankowitz, Daniel, and Hester, Todd. “Challenges of


Real­World Reinforcement Learning”. en. In: arXiv:1904.12901 [cs, stat] (Apr.
2019). arXiv: 1904.12901.

66
BIBLIOGRAPHY

[18] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih,
Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain,
Legg, Shane, and Kavukcuoglu, Koray. “IMPALA: Scalable Distributed Deep­RL
with Importance Weighted Actor­Learner Architectures”. In: arXiv:1802.01561
[cs] (June 2018). arXiv: 1802.01561.

[19] Esteva, Andre, Kuprel, Brett, Novoa, Roberto A., Ko, Justin, Swetter, Susan
M., Blau, Helen M., and Thrun, Sebastian. “Dermatologist­level classification
of skin cancer with deep neural networks”. en. In: Nature 542.7639 (Feb. 2017).
Number: 7639 Publisher: Nature Publishing Group, pp. 115–118. ISSN: 1476­
4687. DOI: 10.1038/nature21056.

[20] Eysenbach, Benjamin, Gupta, Abhishek, Ibarz, Julian, and Levine, Sergey.
“Diversity is All You Need: Learning Skills without a Reward Function”. In:
arXiv:1802.06070 [cs] (Oct. 2018). arXiv: 1802.06070.

[21] Fakoor, Rasool, Chaudhari, Pratik, Soatto, Stefano, and Smola, Alexander
J. “Meta­Q­Learning”. In: arXiv:1910.00125 [cs, stat] (Apr. 2020). arXiv:
1910.00125.

[22] Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. “Model­Agnostic Meta­
Learning for Fast Adaptation of Deep Networks”. en. In: arXiv:1703.03400 [cs]
(July 2017). arXiv: 1703.03400.

[23] Finn, Chelsea and Levine,


Sergey. “Meta­Learning and Universality: Deep Representations and Gradient
Descent can Approximate any Learning Algorithm”. In: arXiv:1710.11622 [cs]
(Feb. 2018). arXiv: 1710.11622.

[24] Finn, Chelsea, Rajeswaran, Aravind, Kakade, Sham, and Levine, Sergey.
“Online Meta­Learning”. en. In: arXiv:1902.08438 [cs, stat] (July 2019). arXiv:
1902.08438.

[25] Finn, Chelsea, Xu, Kelvin, and Levine, Sergey. “Probabilistic Model­Agnostic
Meta­Learning”. In: Advances in Neural Information Processing Systems 31.
Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa­Bianchi, and
R. Garnett. Curran Associates, Inc., 2018, pp. 9516–9527.

[26] Finn, Chelsea, Yu, Tianhe, Zhang, Tianhao, Abbeel, Pieter,


and Levine, Sergey. “One­Shot Visual Imitation Learning via Meta­Learning”.
In: arXiv:1709.04905 [cs] (Sept. 2017). arXiv: 1709.04905.

67
BIBLIOGRAPHY

[27] “Forecasting short­term data center network traffic load with convolutional
neural networks”. en. In: PLOS ONE 13.2 (Feb. 2018). Publisher: Public Library
of Science, e0191939. ISSN: 1932­6203. DOI: 10.1371/[Link].0191939.

[28] Franceschi, Luca, Frasconi, Paolo, Salzo, Saverio, Grazzi, Riccardo, and
Pontil, Massimilano. “Bilevel Programming for Hyperparameter Optimization
and Meta­Learning”. In: arXiv:1806.04910 [cs, stat] (July 2018). arXiv:
1806.04910.

[29] Fujimoto, Scott, Hoof, Herke van, and Meger, David. “Addressing Function
Approximation Error in Actor­Critic Methods”. en. In: arXiv:1802.09477 [cs,
stat] (Oct. 2018). arXiv: 1802.09477.

[30] Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep Learning. The
MIT Press, 2016. ISBN: 978­0­262­03561­3.

[31] Goodfellow, Ian J., Mirza, Mehdi, Xiao, Da, Courville, Aaron, and Bengio,
Yoshua. “An Empirical Investigation of Catastrophic Forgetting in Gradient­
Based Neural Networks”. en. In: arXiv:1312.6211 [cs, stat] (Mar. 2015). arXiv:
1312.6211.

[32] Grant, Erin, Finn, Chelsea, Levine, Sergey, Darrell, Trevor, and Griffiths,
Thomas. “Recasting gradient­based meta­learning as hierarchical bayes”. In:
arXiv preprint arXiv:1801.08930 (2018).

[33] Gupta, Abhishek, Mendonca, Russell, Liu, YuXuan, Abbeel, Pieter, and Levine,
Sergey. “Meta­Reinforcement Learning of Structured Exploration Strategies”.
en. In: (), p. 10.

[34] Haarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, and Levine, Sergey. “Soft
Actor­Critic: Off­Policy Maximum Entropy Deep Reinforcement Learning with
a Stochastic Actor”. en. In: arXiv:1801.01290 [cs, stat] (Aug. 2018). arXiv:
1801.01290.

[35] Hafner,
Danijar, Lillicrap, Timothy, Ba, Jimmy, and Norouzi, Mohammad. “Dream to
Control: Learning Behaviors by Latent Imagination”. In: arXiv:1912.01603 [cs]
(Mar. 2020). arXiv: 1912.01603.

68
BIBLIOGRAPHY

[36] Hammons, T. J. “Impact of electric power generation on green house gas


emissions in Europe: Russia, Greece, Italy and views of the EU power plant
supply industry – A critical analysis”. en. In: International Journal of Electrical
Power & Energy Systems 28.8 (Oct. 2006), pp. 548–564. ISSN: 0142­0615.
DOI: 10.1016/[Link].2006.04.001.

[37] Hardoon, David R., Szedmak, Sandor, and Shawe­Taylor, John. “Canonical
correlation analysis: an overview with application to learning methods”. eng.
In: Neural Computation 16.12 (Dec. 2004), pp. 2639–2664. ISSN: 0899­7667.
DOI: 10.1162/0899766042321814.

[38] Hospedales, Timothy, Antoniou, Antreas, Micaelli, Paul, and Storkey, Amos.
“Meta­Learning in Neural Networks: A Survey”. en. In: arXiv:2004.05439 [cs,
stat] (Apr. 2020). arXiv: 2004.05439.

[39] Houthooft, Rein, Chen, Yuhua, Isola, Phillip, Stadie, Bradly, Wolski, Filip, Ho,
OpenAI Jonathan, and Abbeel, Pieter. “Evolved policy gradients”. In: 2018,
pp. 5400–5409.

[40] Hutson, Matthew. “Artificial intelligence faces reproducibility crisis”. en.


In: Science 359.6377 (Feb. 2018). Publisher: American Association for the
Advancement of Science Section: In Depth, pp. 725–726. ISSN: 0036­8075,
1095­9203. DOI: 10.1126/science.359.6377.725.

[41] Hutter, Frank, Kotthoff, Lars, and Vanschoren, Joaquin. Automated Machine
Learning : Methods, Systems, Challenges. English. Accepted: 2020­03­18
[Link] Journal Abbreviation: Methods, Systems, Challenges. Springer Nature,
2019. DOI: 10.1007/978-3-030-05318-5. URL: [Link]
handle/20.500.12657/23012 (visited on 06/24/2020).

[42] Janner, Michael, Fu, Justin, Zhang, Marvin, and Levine, Sergey. “When to
Trust Your Model: Model­Based Policy Optimization”. In: arXiv:1906.08253
[cs, stat] (Nov. 2019). arXiv: 1906.08253. (Visited on 07/03/2020).

[43] Juliani, Arthur, Khalifa, Ahmed, Berges, Vincent­Pierre, Harper, Jonathan,


Teng, Ervin, Henry, Hunter, Crespi, Adam, Togelius, Julian, and Lange, Danny.
“Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning”.
en. In: arXiv:1902.01378 [cs] (July 2019). arXiv: 1902.01378.

[44] Kingma, Diederik P. and Ba, Jimmy. “Adam: A Method for Stochastic
Optimization”. en. In: arXiv:1412.6980 [cs] (Jan. 2017). arXiv: 1412.6980.

69
BIBLIOGRAPHY

[45] Lake, Brenden M., Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. “Human­
level concept learning through probabilistic program induction”. en. In: Science
350.6266 (Dec. 2015). Publisher: American Association for the Advancement
of Science Section: Research Article, pp. 1332–1338. ISSN: 0036­8075, 1095­
9203. DOI: 10.1126/science.aab3050.

[46] learnables/learn2learn. original­date: 2019­08­08T[Link]Z. July 2020.


URL: [Link] (visited on 07/13/2020).

[47] Lee, Yoonho and Choi, Seungjin. “Gradient­Based Meta­Learning with Learned
Layerwise Metric and Subspace”. In: arXiv:1801.05558 [cs, stat] (June 2018).
arXiv: 1801.05558.

[48] Li, Zhenguo, Zhou, Fengwei, Chen, Fei, and Li, Hang. “Meta­SGD: Learning
to Learn Quickly for Few­Shot Learning”. en. In: arXiv:1707.09835 [cs] (Sept.
2017). arXiv: 1707.09835. (Visited on 05/26/2020).

[49] Lin, Henry W., Tegmark, Max, and Rolnick, David. “Why does deep and cheap
learning work so well?” en. In: Journal of Statistical Physics 168.6 (Sept. 2017).
arXiv: 1608.08225, pp. 1223–1247. ISSN: 0022­4715, 1572­9613. DOI: 10 .
1007/s10955-017-1836-5.

[50] Mania, Horia, Guy, Aurelia,


and Recht, Benjamin. “Simple random search provides a competitive approach
to reinforcement learning”. In: arXiv:1803.07055 [cs, math, stat] (Mar. 2018).
arXiv: 1803.07055.

[51] Marsland, Stephen. Machine learning: an algorithmic perspective. CRC press,


2015. ISBN: 1­4987­5978­5.

[52] Masanet, Eric, Shehabi, Arman, Lei, Nuoa, Smith, Sarah, and Koomey,
Jonathan. “Recalibrating global data center energy­use estimates”. en. In:
Science 367.6481 (Feb. 2020). Publisher: American Association for the
Advancement of Science Section: Policy Forum, pp. 984–986. ISSN: 0036­
8075, 1095­9203. DOI: 10.1126/science.aba3758.

[53] Meta­Learning: Learning to Learn Fast. en. Library Catalog:


[Link]. Nov. 2018. URL: [Link]
log/2018/11/30/[Link] (visited on 06/17/2020).

[54] Mishra, Nikhil, Rohaninejad, Mostafa, Chen, Xi, and Abbeel, Pieter. “A simple
neural attentive meta­learner”. In: arXiv preprint arXiv:1707.03141 (2017).

70
BIBLIOGRAPHY

[55] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness,
Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas
K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou,
Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane,
and Hassabis, Demis. “Human­level control through deep reinforcement
learning”. en. In: Nature 518.7540 (Feb. 2015). Number: 7540 Publisher:
Nature Publishing Group, pp. 529–533. ISSN: 1476­4687. DOI: 10 . 1038 /
nature14236.

[56] Nagabandi, Anusha,


Clavera, Ignasi, Liu, Simin, Fearing, Ronald S., Abbeel, Pieter, Levine, Sergey,
and Finn, Chelsea. “Learning to Adapt in Dynamic, Real­World Environments
Through Meta­Reinforcement Learning”. In: arXiv:1803.11347 [cs, stat] (Feb.
2019). arXiv: 1803.11347.

[57] Nichol, Alex, Achiam, Joshua, and Schulman, John.


“On First­Order Meta­Learning Algorithms”. In: arXiv:1803.02999 [cs] (Oct.
2018). arXiv: 1803.02999.

[58] OpenAI, Akkaya, Ilge, Andrychowicz, Marcin, Chociej, Maciek, Litwin, Mateusz,
McGrew, Bob, Petron, Arthur, Paino, Alex, Plappert, Matthias, Powell, Glenn,
Ribas, Raphael, Schneider, Jonas, Tezak, Nikolas, Tworek, Jerry, Welinder,
Peter, Weng, Lilian, Yuan, Qiming, Zaremba, Wojciech, and Zhang, Lei. “Solving
Rubik’s Cube with a Robot Hand”. In: arXiv:1910.07113 [cs, stat] (Oct. 2019).
arXiv: 1910.07113.

[59] Pan, Xinlei, You, Yurong, Wang, Ziyan, and Lu, Cewu. “Virtual to Real
Reinforcement Learning for Autonomous Driving”. In: arXiv:1704.03952 [cs]
(Sept. 2017). arXiv: 1704.03952.

[60] Perkins, David and Salomon, Gavriel. “Transfer Of Learning”. In: 11 (July 1999).

[61] Rabinowitz, Neil C. “Meta­learners’ learning dynamics are unlike learners’”. In:
arXiv:1905.01320 [cs, stat] (May 2019). arXiv: 1905.01320. URL: http : / /
[Link]/abs/1905.01320 (visited on 05/28/2020).

[62] Raghu, Aniruddh, Raghu, Maithra, Bengio, Samy, and Vinyals, Oriol. “Rapid
Learning or Feature Reuse? Towards Understanding the Effectiveness of
MAML”. In: arXiv:1909.09157 [cs, stat] (Feb. 2020). arXiv: 1909.09157
version: 2.

71
BIBLIOGRAPHY

[63] Rakelly, Kate, Zhou, Aurick, Quillen, Deirdre, Finn, Chelsea, and Levine, Sergey.
“Efficient Off­Policy Meta­Reinforcement Learning via Probabilistic Context
Variables”. In: arXiv:1903.08254 [cs, stat] (Mar. 2019). arXiv: 1903.08254.

[64] Rakhlin, Alexander, Shvets, Alexey, Iglovikov, Vladimir, and Kalinin, Alexandr
A. “Deep Convolutional Neural Networks for Breast Cancer Histology Image
Analysis”. In: Image Analysis and Recognition. Ed. by Aurélio Campilho,
Fakhri Karray, and Bart ter Haar Romeny. Cham: Springer International
Publishing, 2018, pp. 737–744. ISBN: 978­3­319­93000­8.

[65] Ravi, Sachin and Larochelle, Hugo. “Optimization as a Model for Few­Shot
Learning”. In: (Nov. 2016).

[66] Rose, Steven and Rose, Hilary. “Can Science Be Neutral?” en. In: Perspectives
in Biology and Medicine 16.4 (1973), pp. 605–624. ISSN: 1529­8795. DOI: 10.
1353 / pbm . 1973 . 0035. URL: http : / / muse . jhu . edu / content / crossref /
journals/perspectives_in_biology_and_medicine/v016/[Link]
(visited on 06/11/2020).

[67] Rosenblatt, Frank. “The perceptron: a probabilistic model for information


storage and organization in the brain.” In: Psychological review 65.6 (1958).
Publisher: American Psychological Association, p. 386. ISSN: 1939­1471.

[68] Rumelhart, David E., Durbin, Richard, Golden, Richard, and Chauvin,
Yves. “Backpropagation: The basic theory”. In: Backpropagation: Theory,
architectures and applications (1995), pp. 1–34.

[69] Rusu, Andrei A., Rao, Dushyant, Sygnowski, Jakub, Vinyals, Oriol, Pascanu,
Razvan, Osindero, Simon, and Hadsell, Raia. “Meta­Learning with Latent
Embedding Optimization”. In: arXiv:1807.05960 [cs, stat] (Mar. 2019). arXiv:
1807.05960.

[70] Schmidhuber, Jürgen. “Evolutionary principles in self­referential learning, or


on learning how to learn: the meta­meta­... hook”. In: (1987). Publisher:
Technische Universität München.

[71] Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen,


Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis,
Demis, Graepel, Thore, Lillicrap, Timothy, and Silver, David. “Mastering Atari,
Go, Chess and Shogi by Planning with a Learned Model”. In: arXiv:1911.08265
[cs, stat] (Feb. 2020). arXiv: 1911.08265.

72
BIBLIOGRAPHY

[72] Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and
Abbeel, Pieter. “Trust Region Policy Optimization”. en. In: arXiv:1502.05477
[cs] (Apr. 2017). arXiv: 1502.05477.

[73] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel,
Pieter. “High­Dimensional Continuous Control Using Generalized Advantage
Estimation”. In: arXiv:1506.02438 [cs] (Oct. 2018). arXiv: 1506.02438.

[74] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov,
Oleg. “Proximal Policy Optimization Algorithms”. In: arXiv:1707.06347 [cs]
(Aug. 2017). arXiv: 1707.06347.

[75] Al­Shedivat, Maruan, Bansal, Trapit, Burda, Yuri, Sutskever, Ilya, Mordatch,
Igor, and Abbeel, Pieter. “Continuous Adaptation via Meta­Learning in
Nonstationary and Competitive Environments”. In: arXiv:1710.03641 [cs] (Feb.
2018). arXiv: 1710.03641.

[76] Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai,
Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan,
Graepel, Thore, Lillicrap, Timothy, Simonyan, Karen, and Hassabis, Demis. “A
general reinforcement learning algorithm that masters chess, shogi, and Go
through self­play”. en. In: Science 362.6419 (Dec. 2018). Publisher: American
Association for the Advancement of Science Section: Report, pp. 1140–1144.
ISSN: 0036­8075, 1095­9203. DOI: 10.1126/science.aar6404.

[77] Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis,


Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton,
Adrian, Chen, Yutian, Lillicrap, Timothy, Hui, Fan, Sifre, Laurent, Driessche,
George van den, Graepel, Thore, and Hassabis, Demis. “Mastering the game of
Go without human knowledge”. en. In: Nature 550.7676 (Oct. 2017). Number:
7676 Publisher: Nature Publishing Group, pp. 354–359. ISSN: 1476­4687. DOI:
10.1038/nature24270.

[78] Stadie, Bradly, Yang, Ge, Houthooft, Rein, Chen, Peter, Duan, Yan, Wu, Yuhuai,
Abbeel, Pieter, and Sutskever, Ilya. “The Importance of Sampling inMeta­
Reinforcement Learning”. In: Advances in Neural Information Processing
Systems 31. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa­
Bianchi, and R. Garnett. Curran Associates, Inc., 2018, pp. 9280–9290.

73
BIBLIOGRAPHY

[79] Sutton, Richard S and Barto, Andrew G. Introduction to reinforcement


learning. Vol. 135. MIT press Cambridge, 1998.

[80] Sutton, Richard S and Barto, Andrew G. “Reinforcement Learning: An


Introduction”. en. In: (), p. 352.

[81] Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour,


Yishay. “Policy Gradient Methods for Reinforcement Learning with Function
Approximation”. en. In: (), p. 7.

[82] Sutton, Richard S. “Learning to predict by the methods of temporal differences”.


en. In: Machine Learning 3.1 (Aug. 1988), pp. 9–44. ISSN: 1573­0565. DOI:
10.1007/BF00115009.

[83] Tobin, Josh, Fong, Rachel, Ray, Alex, Schneider, Jonas, Zaremba, Wojciech,
and Abbeel, Pieter. “Domain randomization for transferring deep neural
networks from simulation to the real world”. In: 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS). ISSN: 2153­0866. Sept.
2017, pp. 23–30. DOI: 10.1109/IROS.2017.8202133.

[84] Triantafillou, Eleni, Zhu, Tyler, Dumoulin, Vincent, Lamblin, Pascal, Evci,
Utku, Xu, Kelvin, Goroshin, Ross, Gelada, Carles, Swersky, Kevin, Manzagol,
Pierre­Antoine, and Larochelle, Hugo. “Meta­Dataset: A Dataset of Datasets for
Learning to Learn from Few Examples”. In: arXiv:1903.03096 [cs, stat] (Apr.
2020). arXiv: 1903.03096.

[85] Venkitaraman, Arun and Wahlberg, Bo. “Task­similarity Aware Meta­learning


through Nonparametric Kernel Regression”. en. In: arXiv:2006.07212 [cs, stat]
(June 2020). arXiv: 2006.07212.

[86] Vinyals, Oriol, Blundell, Charles, and Lillicrap, Timothy. “Matching Networks
for One Shot Learning”. en. In: (), p. 9.

[87] Wang, Jane X., Kurth­Nelson, Zeb, Tirumala, Dhruva, Soyer, Hubert, Leibo,
Joel Z., Munos, Remi, Blundell, Charles, Kumaran, Dharshan, and Botvinick,
Matt. “Learning to reinforcement learn”. en. In: arXiv:1611.05763 [cs, stat]
(Jan. 2017). arXiv: 1611.05763.

[88] Weng, Lilian. “Meta Reinforcement Learning”. In: [Link]/lil­log


(2019). URL: [Link] log/2019/06/23/meta-
[Link].

74
BIBLIOGRAPHY

[89] Yang, Yuxiang, Caluwaerts, Ken, Iscen, Atil, Tan, Jie, and Finn, Chelsea.
“NoRML: No­Reward Meta Learning”. In: arXiv:1903.01063 [cs, stat] (Mar.
2019). arXiv: 1903.01063.

[90] Yu, Tianhe, Quillen, Deirdre, He, Zhanpeng, Julian, Ryan, Hausman, Karol,
Finn, Chelsea, and Levine, Sergey. “Meta­World: A Benchmark and Evaluation
for Multi­Task and Meta Reinforcement Learning”. In: arXiv:1910.10897 [cs,
stat] (Oct. 2019). arXiv: 1910.10897.

[91] Zeiler, Matthew D. “ADADELTA: An Adaptive Learning Rate Method”. In:


arXiv:1212.5701 [cs] (Dec. 2012). arXiv: 1212.5701.

[92] Zhang, Chiyuan, Vinyals, Oriol, Munos, Remi, and Bengio, Samy. “A Study on
Overfitting in Deep Reinforcement Learning”. In: arXiv:1804.06893 [cs, stat]
(Apr. 2018). arXiv: 1804.06893.

[93] Zintgraf, Luisa


M., Shiarlis, Kyriacos, Kurin, Vitaly, Hofmann, Katja, and Whiteson, Shimon.
“Fast Context Adaptation via Meta­Learning”. In: arXiv:1810.03642 [cs, stat]
(June 2019). arXiv: 1810.03642.

75
Appendix ­ Contents

A Technical details 77

B Additional environment: Procgen 78

C Additional hyper­parameter searches and results 82


C.1 Hyper­parameter search for Few­Shot Image Classification . . . . . . . 82
C.2 Commonly shared hyper­parameter values for RL . . . . . . . . . . . . 86
C.3 Instabilities of MAML­PPO . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.4 Hyper­parameter search on Particles2D . . . . . . . . . . . . . . . . . . 87
C.5 Hyper­parameter search on ML1: Push . . . . . . . . . . . . . . . . . . 88
C.6 Hyper­parameter search on ML10 . . . . . . . . . . . . . . . . . . . . . 90
C.7 Vanilla PPO & TRPO on ML10 . . . . . . . . . . . . . . . . . . . . . . . 91
C.8 Performance on ML10 per task . . . . . . . . . . . . . . . . . . . . . . . 92
C.9 Performance on ML10 test tasks before and after adaption . . . . . . . 94
C.10 Meta­World Reward functions & success metrics . . . . . . . . . . . . . 95

76
Appendix A

Technical details

This project was developed on Python 3.7 using PyTorch and based on the meta­
learning library learn2learn [46]. For further technical details regarding the code base,
visit the open­source repository of this project at GitHub 1 .

The experiments were conducted on a personal computer with an i7 CPU, 24GB RAM
and an RTX 2060 Super 8GB GPU and on a remote workstation Dell PowerEdge R730
(24 cores, 256GB RAM, 1 x GTX 1080 Ti) at ICE NORTH SICS across a 6 month
period.

1
Link: [Link]

77
Appendix B

Additional environment: Procgen

During the first stages of the thesis in which research about related work and setting up
an experimental evaluation framework was conducted, another reinforcement learning
framework was initially considered instead of Meta­World. OpenAI’s newest RL
environment Procgen was developed as a challenging benchmark to evaluate sample
efficiency and generalisation of RL algorithms [11].

Popular video­game benchmarks such as the Arcade Learning Environment (ALE)


have met great success in testing RL agents’ performance [6], [55]. However, ALE
and similar environments have shown to be prone to overfitting and are not suitable to
test the generalisation capabilities of RL agents [43], [92] which is a vital component
of this project.

Procgen tries to combat this issue by leveraging procedural generation to create an


almost infinite amount of randomised levels in 16 games. This means that each
run of an agent playing a game is never the same (as it usually is in ALE) by
modifying the level layout, the location of the player and enemies and other game­
specific details. Additionally, it can be easily integrated to an existing RL code base
since it follows the OpenAI Gym framework. In [11], they also state: ”It is also
experimentally convenient: training for 200 million timesteps with PPO on a single
Procgen environment requires approximately 24 GPU­hrs and 60 CPU­hrs. We
consider this a reasonable and practical computational cost. To further reduce
training time at the cost of experimental complexity, environments can be set to the
easy difficulty. We recommend training in easy difficulty environments for 25M
timesteps, which requires approximately 3 GPU­hrs with our implementation of

78
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN

PPO.”. Their implementation of PPO is based on the scalable and distributed IMPALA
algorithm [18].

Figure B.0.1: Samples of each of the 16 games of Procgen. Figure from [11].

For this project a semi­distributed PPO and a MAML­PPO agent was implemented.
The term semi­distributed is used to distinguish from the fully distributed IMPALA
implementation. In our case, a sampler was implemented that could fetch different
episodes from the same agent in parallel. However, this meant that the only part
that is distributed is the forward pass (sampling) and back­propagating still happens
synchronised and needs to wait for every parallel agent to finish. In the case of IMPALA
the whole process (forward and backward pass) is distributed and asynchronous 1 .
The networks were based on the A2C model [55] following a similar architecture of
a standard convolutional neural network used in [55] (which in [11] they call Nature­
CNN), as seen in figure B.0.2.

1
For further explanation, see figure 2 in [18] where scenario (a) is our implementation and (c) is the
implementation of [11].

79
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN

(a) PPO network architecture. (b) MAML­PPO network architecture.

Figure B.0.2: Screenshot of the model’s architecture including number of parameters

Unfortunately, even after many trials­and­errors Procgen seemed to be incredibly


difficult to train on. The PPO agent trained on 5 million timesteps on 8 parallel
environments on the easiest setting possible (one game of only one level on easy
difficult) required 14 CPU­hrs and 50GB RAM with no sign of progress (figure B.0.3).
In comparison, in [11] they mention that in order for the agent to develop sufficient
generalisation skills to be able to adapt to new levels, it needs to be trained on at least
10.000 different levels.

(a) Validation reward progress. (b) Validation A2C loss progress.

Figure B.0.3: Performance of PPO agent in Starpilot for one level.

Another experiment was training a MAML­PPO agent on the same easy setting,
sampling from only one environment and running it on a GTX 1080Ti GPU for 25M
timesteps. After 57 hours of training the results were still poor as seen in figure
B.0.4.

80
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN

(a) Validation reward progress. (b) Validation A2C loss progress.

Figure B.0.4: Performance of MAML­PPO agent in Starpilot for one level.

These results are also aligned with Section 4 of [11] where they conclude that smaller
architectures (like the Nature­CNN) can sometimes completely fail to train compared
to larger and distributed implementation that are more sample efficient and lead to an
agent with better generalisation capabilities.

In conclusion, even though Procgen seems to be a promising direction of research for


RL, these experiments highlighted some technical difficulties when training a standard
RL (PPO) or a meta­RL (MAML­PPO) agent. Specifically, the issue of network size
and distributed algorithmic implementation appear to be crucial components for the
success of an agent in this environment. The problem of network size could be
solved by adding more layers or neurons in the network to increase the number of
trainable parameters, though this would also increase the need for more computational
resources. Furthermore, implementing a highly efficient distributed implementation
of MAML for this environment based on the IMPALA or a similar architecture is
speculated to improve drastically the performance of the agent. However, such an
implementation was deemed exceptionally complex and outside of the thesis scope
and is left for future work.

81
Appendix C

Additional hyper­parameter searches


and results

During the development of the models in section 4, a lot of hyper­parameter values had
to be tested in order to make conclusions regarding the comparison of performance
of the different methods. Here, a series of searches that were part of the project are
presented that were deemed of less significance for the section 4. Moreover, more
detailed results on the Meta­World tasks are presented for further investigation and
transparency purposes.

C.1 Hyper­parameter search for Few­Shot Image


Classification
A basic hyper­parameter search was performed on the Mini­ImageNet dataset and
afterwards, a smaller scale search of only the most influential hyper­parameters was
performed on the Omniglot dataset. Since computational cost is a considerable
limitation to the scale of the hyper­parameter search, some experiments were only
performed on ANIL, which showed less use of computational resources.

Number of iterations 1 : Different datasets and hyper­parameter values will lead to


slower or faster model convergence, so in every setting a few metrics are taken into
consideration to draw a conclusion. In the case of choosing an appropriate number of
epochs to train the models, this can be done by monitoring the training and validation
1
Iterations and epochs are used interchangeably, unless explicitly stated otherwise.

82
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

loss and accuracy metrics during meta­training and meta­testing different model
snapshots in various time checkpoints of the training. For example, for the optimal
hyper­parameter values in Omniglot the figures C.1.1, C.1.2 and C.1.3 indicate that
there is no gain in performance after the 2.000 mark and the models have managed to
converge, whereas for Mini­ImageNet the mark is bit later, around 5.000.

Figure C.1.1: Comparing ANIL and MAML training metrics. Each line is the mean
value and the shaded area is the standard deviation across three seeds.

Figure C.1.2: Comparing ANIL and MAML validation metrics. Each line is the mean
value and the shaded area the standard deviation across three seeds.

Figure C.1.3: Meta­testing model snapshots of MAML and ANIL in different iteration
checkpoints for Omngilot and Mini­ImageNet. Each line is the mean value and the
shaded area the standard deviation across three seeds.

Since such conclusions can usually be derived from just one of these figures, in the rest

83
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

of the section there will only be the minimum number of figures necessary to support
them. As a baseline, for both the Omniglot and Mini­ImageNet dataset, the models
were trained for 10.000 epochs.

Number of adaptation steps: While fixing all the other hyper­parameters to


standard default values, figure C.1.4 shows that changing the number of inner loop
iterations does not significantly affect the performance of the ANIL model in Mini­
ImageNet. Specifically, ANIL was tested with 1, 3 and 5 adaptation steps during the
inner loop and all models led to the same performance as seen in figure C.1.4.

Ways 5
Shots 1
Outer LR 0.001
Inner LR 0.1
Adapt Steps [1, 3, 5]
Meta Batch Size 32
Seed 1
Figure C.1.4: Comparison of three models with
different number of inner loop updates. Smoothing Table C.1.1: HP values of C.1.4
factor: 0.8

Learning Rates: A coarse search for the inner (α) and outer (β) learning rate was
performed with ANIL in Mini­ImageNet. The inner learning rate controls how quickly
the learner’s weights will adapt to new data whereas the outer parameter controls the
rate in which the meta­initialisation weights are being updated. For this reason, the
inner learning rate (lr) needs to have a high value (rapid learning) and the outer lr a
lower one (steady convergence to a general enough meta­initialisation).

ANIL was trained with the configurations as shown in table C.1.2 on the Mini­ImageNet
dataset for 5­ways, 5­shots and 5­ways, 1­shot classification. For the Omniglot dataset,
two different learning rate settings were tested with ANIL for the case of 20­ways, 5­
shots (figures C.1.6 and 20­ways, 1­shot (figures C.1.7).

84
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

#1 #2 #3
Ways 5
Shots 1
Outer LR 0.003 0.001 0.05
Inner LR 0.5 0.1 0.1
Adapt Steps 1
Figure C.1.5: Comparison of three ANIL
models on 5ways Mini­ImageNet with Meta Batch Size 32
different inner and outer learning rates. Seeds [1,2,3]
Each line is the mean value and the shaded
area the standard deviation across three Table C.1.2: HP values of C.1.5
seeds. 20% accuracy is same as random for
a 5­way classification setting.

Figure C.1.6: Results of ANIL for 20­ways, 5­shots in Omniglot. The reported results
for each model are the mean and standard deviation across three different seeds.

Figure C.1.7: Results of ANIL for 20­ways, 1­shot in Omniglot. The reported results
for each model are the mean and standard deviation across three different seeds.

85
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

C.2 Commonly shared hyper­parameter values for


RL

Due to the complexity of these methods and their wide range of hyper­parameters, not
all of them could be tested with different values. In table C.2.1, hyper­parameter values
that were left as­is in all of the experiments are presented.

For all RL policies


tau (τ ) 1.0
discount factor / gamma (γ) 0.99
horizon length H for ML10 tasks 150
For all TRPO policies
backtrack factor 0.5
ls max steps 15
max kl 0.01
For all PPO policies
ppo epochs 3
ppo clip ratio 0.1

Table C.2.1: Commonly shared hyper­parameter values across different environments


/ methods.

C.3 Instabilities of MAML­PPO

This method was concluded to be too unstable to research any further. This could be
due to an error in the implementation or bad hyper­parameter configurations. Even
though a few configurations were tested, the MAML­PPO method seemed to perform
quite unstable, as seen in figure C.3.1 for ML1 and in figure C.3.2 for ML10. Some
models were manually stopped since they didn’t show promising signs of convergence
or learning.

86
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

#1 #2 #3 #4
Outer LR 0.01 0.1 0.01 0.01
Inner LR 0.01 0.1 0.05 0.01
Adapt Batch Size 20 10
Meta Batch Size 40 20
Adapt Steps 1
Seeds 42
Figure C.3.1
Table C.3.1: HP values of the models in
C.3.1

ANIL MAML
#1 #2 #3 #1
Outer LR 0.01
Inner LR 0.01 0.05 0.01 0.01
Adapt Batch Size 20 10 10 20
Meta Batch Size 40 20 20 40
Adapt Steps 1
Seeds 42
Figure C.3.2
Table C.3.2: HP values of the models
in C.3.2

C.4 Hyper­parameter search on Particles2D

A basic hyper­parameter search was also performed on MAML­TRPO as shown in


table C.4.1 and figure C.4.1. The average meta­testing reward of the models is reported
across 10 new different tasks (10 episodes each) and 5 adaptation steps. In this case,
reporting the meta­testing results is important because it can be seen that even if
during validation the best models seems to be #2 (the one with a higher number of
adaptation steps), it fails to perform as good during adaptation of new test tasks.

87
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

#1 #2 #3 #4 #5 #6 #7 #8
Outer LR 0.1 0.1 0.05 0.05 0.2 0.05 0.1 0.05
Inner LR 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.2
Adapt Steps 1 5 1
Adapt Batch Size 20 32 10 20 20 20 32 20
Meta Batch Size 40 16 16 64 16 16 16 32
Seed 42
Av. Test Reward ­27 ­38 ­40 ­29 ­34 ­24 ­26 ­28

Table C.4.1: HP search for MAML­TRPO.

Figure C.4.1: Validation rewards of a hyper­parameter search for MAML­TRPO on


Particles2D. Smoothing factor 0.9.

C.5 Hyper­parameter search on ML1: Push

A basic hyper­parameter search was performed for MAML­TRPO (figures C.5.1 ­ C.5.3)
and ANIL­TRPO (figure C.5.4). Due to the computational cost of fully training the
agents until stable convergence, they were trained only for 250 iterations and the
average validation return was used as an indicator of comparison of the different hyper­
parameter values. Firstly, a small search was performed for the inner learning rate

88
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

(α), then for the outer learning rate (β) and finally for the adaptation steps, each time
picking the best value of the previous search.

Outer LR 0.1
Inner LR [0.01, 0.001, 0.0001]
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps 1
Seeds 42
Figure C.5.1: MAML­TRPO inner lr Table C.5.1: HP values of the models in
search on ML1: Push. C.5.1

Outer LR [0.001, 0.01, 0.1, 0.3, 0.5, 0.9]


Inner LR 0.001
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps 1
Seeds 42
Figure C.5.2: MAML­TRPO outer lr Table C.5.2: HP values of the models in
search on ML1: Push. C.5.2

Outer LR 0.3
Inner LR 0.001
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps [1, 3, 5]
Seeds 42
Figure C.5.3: MAML­TRPO adapt Table C.5.3: HP values of the models in
step search on ML1: Push. C.5.3

89
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

#1 #2 #3 #4 #5
Outer LR 0.3 0.3 0.1 0.1 0.1
Inner LR 0.001 0.001 0.01 0.001 0.01
Adapt Steps 3 1 3 1 1
Adapt BS 20
Meta BS 40
Figure C.5.4: ANIL­TRPO
inner lr search on ML1: Push. Seeds 42
Smoothing factor 0.8.
Table C.5.4: HP values of the models in C.5.4

In figure C.5.5, a comparison between two MAML­TRPO models on ML1:


Push with meta batch size=40 & adapt batch size=20 and meta batch size=20 &
adapt batch size=10 is illustrated. As expected, the one with the bigger batch size
(more data per inner and outer loop iteration) is performing better and is less noisy. It
is important to note however that for that model the training time was approximately
4.5 minutes per epoch whereas for the model with the smaller batch sizes it was 1
minute per epoch, a considerable computational cost difference.

Figure C.5.5: Comparison of MAML models during training with different batch sizes
on ML1: Push.

C.6 Hyper­parameter search on ML10

Similarly, for ML10 different batch sizes were tested as seen in figure C.6.1.

90
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

Figure C.6.1: Comparison of MAML models with different batch sizes on ML10.

Additionally, for MAML­TRPO on ML10 the following models in C.6.2a, C.6.2b


were trained to find a good pair of learning rates and test whether the number of
inner steps affects the performance of the model. From C.6.2b, it is seems that the
performance difference is minuscule while also using additional inner steps increases
the computational cost.

(a) Performance comparison of various (b) Comparison of using 1 inner loop step
learning rates. (adaptation step) and 3 steps.

C.7 Vanilla PPO & TRPO on ML10

Training vanilla RL policies (not meta­RL) on ML10 is expected to perform poorly, due
to the volatile setting of optimising for multiple losses at the same time. In figure C.7.1,
all of the models of TRPO and PPO developed in this project for ML10 are shown with
their respected hyper­parameter values in table C.7.1.

91
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

Figure C.7.1: Vanilla RL policies trained on ML10.

Method TRPO PPO


Model #1 #2 #3 #4 #1 #2 #3 #4 #5
LR 0.0001 0.001 0.001 0.1 0.001 0.001 0.001 0.0001 0.001
# Tasks 2 20 10 40 20 10 100 40 20 20
# Episodes 3 20 50 20 10 20 10 20 10 10
Seeds 42

Table C.7.1: HP values of the models in C.7.1

C.8 Performance on ML10 per task

Due to the diversity of the tasks in Meta­World, averaging the accumulated rewards
and success rates across all tasks to report the performance of a method can be
misleading. In these figures, the performance of the models trained are shown for
each task separately.

92
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j)

Figure C.8.1: Performance (Accumulated rewards & success rate) of PPO, MAML­
TRPO and ANIL­TRPO on the train tasks of ML10. Reported results are mean and
standard deviation of 5 different seeds with 3 trials of 10 episodes per task.

93
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

(a) (b) (c)

(d) (e)

Figure C.8.2: Performance (Accumulated rewards & success rate) of PPO, MAML­
TRPO and ANIL­TRPO on the test tasks of ML10. Reported results are mean and
standard deviation of 5 different seeds with 3 trials of 10 episodes per task.

C.9 Performance on ML10 test tasks before and after


adaption

One possible question that arises regarding the meta­testing procedure is whether
such a limited interaction with unseen tasks is sufficient for the agents to adapt. The
methods were updated only once based on 10 episodes with a small learning rate
and a high discount factor leading to the performance being highly dependent on the
randomised seed that sets the configurations of the environment and with a chance
that they were not able to quickly adapt with these hyper­parameter values.

In order to investigate the hyper­parameter values’ significance on the performance of


the algorithms, the same meta­testing procedure was performed but with a much larger
batch size (episodes sampled per task) and more gradient updates in order to give the
policies a few more tries to converge. The hyper­parameter settings of the meta­testing
are presented in table C.9.1. The results of this test, shown in figure C.9.1, indicate
there is little change between the ”default” and the ”extended” hyper­parameter values
during meta­testing.

94
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

Hyper­Parameters Default Extended


Adapt Steps 1 5
Adapt Batch Size 10 300
Inner lr 0.001 0.05
γ 0.99 0.95

Table C.9.1: Hyper­parameter values for the meta­testing phase of the algorithms on
the ML10 test tasks.

(a) (b) (c)

Figure C.9.1: Performance (Accumulated rewards & success rate) of PPO, MAML­
TRPO and ANIL­TRPO on the test tasks of ML10. The ”Before” results are the agents
evaluated on the test tasks without any change to their weights after training. The
”After 1 Step” results are based on the ”Default” values of the table C.9.1 and the ”After
5 Steps” are based on the ”Extended” values. Reported results are mean and standard
deviation of 5 different seeds with 3 trials of 10 episodes per task.

C.10 Meta­World Reward functions & success


metrics

The specific reward functions and their success metric from Meta­World: ML10 are
presented, as seen in [90].

95
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

Task Reward function

Train tasks
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
basketball 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
button-press 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
door-open 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
drawer-close 0.01

||h−g||2
peg-insert-side −||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
pick-place 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
push 0.01

||h−g||2
1000 · e
2
reach 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
sweep 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
window-open 0.01

Test tasks
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
door-close 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
drawer-open 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
lever-pull 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
shelf-place 0.01

||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
sweep-into 0.01

Table C.10.1: Reward functions of the ML10 tasks.

96
APPENDIX C. ADDITIONAL HYPER­PARAMETER SEARCHES AND RESULTS

Task Success metric

Train tasks
basketball I||o−g||2 <0.08
button-press I||o−g||2 <0.02
door-open I||o−g||2 <0.08
drawer-close I||o−g||2 <0.08
peg-insert-side I||o−g||2 <0.07
pick-place I||o−g||2 <0.07
push I||o−g||2 <0.07
reach I||o−g||2 <0.05
sweep I||o−g||2 <0.05
window-open I||o−g||2 <0.05
Test tasks
door-close I||o−g||2 <0.08
drawer-open I||o−g||2 <0.08
lever-pull I||o−g||2 <0.05
shelf-place I||o−g||2 <0.08
sweep-into I||o−g||2 <0.05

Table C.10.2: Success metrics of the ML10 tasks. The static values represent distance
in meters.

97
TRITA-EECS-EX-2021:15

[Link]

You might also like