Full Text 01
Full Text 01
Insights into
ModelAgnostic
MetaLearning on
Reinforcement
Learning Tasks
Konstantinos Saitas Zarkias
Place of Project
Stockholm, Sweden
Research Institutes of Sweden, Kista
Examiner
Pawel Herman
KTH Royal Institute of Technology
Supervisor
Alexandre Proutiere
KTH Royal Institute of Technology
ii
Abstract
Metalearning has been gaining traction in the Deep Learning field as an approach
to build models that are able to efficiently adapt to new tasks after deployment.
Contrary to conventional Machine Learning approaches, which are trained on a specific
task (e.g image classification on a set of labels), metalearning methods are meta
trained across multiple tasks (e.g image classification across multiple sets of labels).
Their end objective is to learn how to solve unseen tasks with just a few samples.
One of the most renowned methods of the field is ModelAgnostic MetaLearning
(MAML). The objective of this thesis is to supplement the latest relevant research with
novel observations regarding the capabilities, limitations and network dynamics of
MAML. For this end, experiments were performed on the metareinforcement learning
benchmark MetaWorld. Additionally, a comparison with a recent variation of MAML,
called Almost No Inner Loop (ANIL) was conducted, providing insights on the changes
of the network’s representation during adaptation (metatesting). The results of this
study indicate that MAML is able to outperform the baselines on the challenging
MetaWorld benchmark but shows little signs actual ”rapid learning” during meta
testing thus supporting the hypothesis that it reuses features learnt during meta
training.
Keywords
iii
Abstract
MetaLearning har fått dragkraft inom Deep Learning fältet som ett tillvägagångssätt
för att bygga modeller som effektivt kan anpassa sig till nya uppgifter efter distribution.
I motsats till konventionella maskininlärnings metoder som är tränade för en specifik
uppgift ([Link]. bild klassificering på en uppsättning klasser), så metatränas meta
learning metoder över flera uppgifter ([Link]. bild klassificering över flera uppsättningar
av klasser). Deras slutmål är att lära sig att lösa osedda uppgifter med bara
några få prover. En av de mest kända metoderna inom området är Model
Agnostic MetaLearning (MAML). Syftet med denna avhandling är att komplettera den
senaste relevanta forskningen med nya observationer avseende MAML: s kapacitet,
begränsningar och nätverksdynamik. För detta ändamål utfördes experiment på meta
förstärkningslärande riktmärke MetaWorld. Dessutom gjordes en jämförelse med
en ny variant av MAML, kallad Almost No Inner Loop (ANIL), som gav insikter
om förändringarna i nätverkets representation under anpassning (metatestning).
Resultaten av denna studie indikerar att MAML kan överträffa baslinjerna för det
utmanande MetaWorldriktmärket men visar små tecken på faktisk ”snabb inlärning”
under metatestning, vilket stödjer hypotesen att den återanvänder funktioner som
den lärt sig under metaträning.
Nyckelord
iv
Acknowledgements
I would like to thank the researchers at the Research Institute of Sweden who provided
me with a welcoming environment to work in, the professors & classmates at KTH for
insightful discussions and support, and the Foundation for Education and European
Culture (IPEP) who trusted my work and offered financial aid during my Masters. Most
importantly, I would like to thank everyone who stood next to me and supported me
during the period of my thesis and in the difficult times of the Covid19 pandemic.
v
Acronyms
vi
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Goal & Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Theoretical Background 8
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 GradientBased Learning & Backpropagation . . . . . . . . . . 10
2.2 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Defining MetaLearning . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Types of MetaLearning . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Model based (or World Model) vs Model free . . . . . . . . . . . 18
2.3.2 Offpolicy vs Onpolicy . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 MetaReinforcement Learning . . . . . . . . . . . . . . . . . . . 20
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 ModelAgnostic MetaLearning . . . . . . . . . . . . . . . . . . . 21
2.4.2 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Insights on MAML . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Almost No Inner Loop Variation . . . . . . . . . . . . . . . . . . 24
3 Methodology 26
3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 ModelAgnostic & ProblemAgnostic Formulation . . . . . . . . 27
3.1.2 Representation Similarity Experiments . . . . . . . . . . . . . . 28
vii
CONTENTS
4 Results 42
4.1 FewShot Image Classification . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 MetaReinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Particles 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 ML1: Push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 ML10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Discussion 57
5.1 Further insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Ethical concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Conclusions 63
References 65
viii
Chapter 1
Introduction
Neural networks have proven to be a highly useful tool in many settings. From
detecting various types of cancer in humans [64], [19] to efficiently developing robots
perform various physical tasks [83]. However, developing such models is not an easy
process and it is often met with considerable limitations. In contrast to how humans
acquire new knowledge and skills, neural networks require a great amount of data to
train with [30]. In addition to this, they are also particularly sensitive when trying
to incorporate new information after they have already been trained on a task, which
often leads to the infamous phenomena of catastrophic forgetting. This occurs when a
model is trained on one task, then trained on another second task but then completely
fails to perform on the first task [31]. For these reasons, neural networks are generally
unsuitable in problems where the process of largescale data collection is inaccessible
(e.g Xrays) or when new tasks are introduced after the model has been trained and
retraining from scratch is too costly (e.g detecting a new type of disease after being
deployed to detect a previous disease).
A research field that has been gaining traction due to promising recent publications
that could potentially combat these issues is MetaLearning. As with many ideas
in Machine Learning (ML), the core concept was loosely inspired from a field that
studies humans, this time from the area of educational psychology. When studying the
learning abilities of students, John Biggs describes the term metalearning as ”one’s
awareness over their learning process and actively taking control over it” [8]. It
is easy to see then why applying this idea to ML models becomes very intriguing.
Developing algorithms that are able to selfassess their own performance and improve
1
CHAPTER 1. INTRODUCTION
on it could be profoundly valuable. Such algorithms could alleviate the need for
humans to finetune the models and their hyperparameters in order to adapt them to
their specific task and they could be able to generalise across multiple, different tasks.
Two problems that are still quite present in the current ML developing process.
The end goal of metalearning is to create models that have the ability to quickly
adapt to new tasks that they have not seen before, using their past experience when
trained on similar tasks. For example, an experienced musician that has spent many
years learning how to play the guitar, violin and contrabass, has a considerably easier
time picking up and learning to play moderately well a new instrument like the cello,
compared to someone that has never played an instrument before. One reason for this
is that similar tasks share similar dynamics and structure [60] (all these instruments
have strings and follow the same rules of physics) but also because a skilful musician
is aware of what learning direction they should follow when learning an instrument
(e.g which exercises will help them familiarise themselves with the instrument) based
on their experience when learning the previous instruments. A process that many
times happens subconsciously. This is the highlevel concept rationale metalearning
is trying to leverage when training models.
This thesis is focused on examining one of the most influential stateoftheart meta
learning algorithms called Model Agnostic MetaLearning (MAML)[22], expanding on
the relevant latest research and providing insights based on experimental results on a
new reinforcement learning benchmark, MetaWorld [90].
1.1 Background
Most supervised ML models are usually presented with large quantities of data for a
specific task they should try to solve. During testing, new data samples of the same
task are presented to the model with the expectation that the test data contain similar
feature patterns with the train data, which the model has already identified, and can
thus make correct predictions. A common example would be feeding an Artificial
Neural Network (ANN) images that contain cats and dogs and training the model to
try to distinguish which image contains which animal. Then, to evaluate its accuracy,
different pictures of cats and dogs, that were not part of the training set, are fed to the
model and in return it will try to answer which is which. However, if a new task were
to be introduced, for example distinguish between cats, dogs and lions, the network
2
CHAPTER 1. INTRODUCTION
would most likely fail and retraining it from scratch with many additional images of
lions would probably be necessary to account for the new class of animal. Even such a
seemingly trivial problem, is still challenging for many ML models.
Metalearning methods try to tackle this issue by becoming efficient learners with the
aim of rapid adaptation to new tasks while requiring only a few training samples. As
previously mentioned, conventional ML models try to leverage similarities between
training and test data to make predictions during evaluation. Similarly, metalearning
models also try to leverage similarities but between training and test tasks in order to
generalise to new tasks. Different metalearning algorithms follow different training
procedures on how this can be achieved, but most of current algorithms adhere to
the same principle. During metatraining1 , there are two learning systems that are
being optimised. Firstly, a lowerlevel base learner tries to adapt rapidly to new data,
meaning complete the task with only a few of samples. Secondly, a higherlevel meta
learner is optimised to finetune and improve the base learner based on how well
the adaptation process was performed. During metatesting, the parameters of the
network (weights, biases or even hyperparameters) are updated in just a few iterations
2
to values that can solve the new the task in hand [38].
One of the most prominent metalearning algorithms which has sparked a wave of new
similar approaches is MAML [22]. Following the basic principles of metalearning,
MAML aims to train a neural network with the purpose of finding an initialisation of
the model parameters that is suitable for fast adaptation to new tasks. It also consists
of a twolevel learning system: an inner loop, or base learner, for fast adaptation and an
outer loop, or metalearner, for improving the parameter initialisation 3 . Specifically,
during the inner loop, the network starts from an initial set of parameters and performs
a brief ”learning session” (a few iterations) over a small set of different tasks. Next,
during the outer loop, the network initialisation is updated with respect to general
direction of the adapted inner loop parameters of all the tasks. The goal of this
approach is to explicitly optimise the model’s parameters for rapid learning.
1
The term metatraining is used similarly to training in traditional ML models, but in the context of
metalearning approaches.
2
A simple tutorial with visualisation can be found in this link.
3
The terms inner and outer loop refer to the programming practice of nested loops during meta
training.
3
CHAPTER 1. INTRODUCTION
1.2 Problem
Since the publication of MAML, many studies have been trying to examine its inner
mechanisms and the effects it has on the learning procedure of the models [61], [23],
[57]. A recent study made an interesting observation by asking the question: ”is the
effectiveness of MAML due to the metainitialisation being primed for rapid learning
(efficient changes in the network representation) or due to feature reuse, with the
metainitialisation already containing high quality features?” [62]. Their experiments
presented evidence for the latter, suggesting that the adaptation phase of the inner
loop is of little value and thus proposing their variation titled Almost No Inner Loop
(ANIL).
3. Does the Almost No Inner Loop (ANIL) approach perform similarly to MAML in
more complex environments?
4
CHAPTER 1. INTRODUCTION
The first question is inspired by [62], where the evidence was found for the case of
feature reuse. We evaluate the same question both in the same setting (Omniglot, Mini
ImageNet & Particles2D) and in a new setting (MetaWorld) in order to reproduce the
results and also add new observations from a more challenging environment. The
second question arose from the lack of comparison of metalearning methods with
conventional (non metalearning) methods in terms of performance in the MetaWorld
environment, as reported in [90]. By asking this question, we challenge the need for
metalearning and inspect the limitations / capabilities of conventional Reinforcement
Learning (RL) methods. In [62], as a result of their findings regarding MAML’s feature
reuse, the authors propose an alternative method called ANIL as equivalent which
they also evaluate on the tasks of Omniglot, MiniImageNet & Particles2D and find
similar performance to MAML. However, this tasks are quite limited and not complex
enough to test the capabilities of MAML and ANIL. For this reason we examine ANIL’s
performance on MetaWorld in order to evaluate if it is actually equivalent of MAML.
Lastly, one of the considerable benefits of ANIL over MAML as reported in [62] is
that it is computationally cheaper offering significant speedups during training and
testing times. We evaluate again this hypothesis by reproducing the results on the
same environments while also providing new insights about computational costs on
MetaWorld.
The contributions of this project are both theoretical and practical. An experimental
analysis on MAML provides valuable insights into its internal mechanisms,
performance capabilities and, as equally important, its limitations. The novelty of this
project is a comparison between MAML and one of its recently published variation
called ANIL in the metareinforcement learning benchmark MetaWorld. The results
of these experiments drive the focus of the evaluation of metalearning methods to a
5
CHAPTER 1. INTRODUCTION
Lastly, the implementations of the algorithms presented in this thesis and the code
associated to reproduce these experiments are opensourced. Various opensource
programming libraries and frameworks were used for the realisation of this thesis,
some of which are still in active development. During the implementation of the
experiments, contributions were made to public repositories of github in the form
of bug fixes, bug reports and additional feature implementations. One of these
repositories was the learn2learn PyTorch metalearning library 4 which acted as a core
part of this thesis’ code base. During the development of the thesis, a collaboration with
the lead developer of the library led to contributions on GitHub and finally a techincal
paper publication ([3]).
1.4 Delimitations
One frequent issue when developing methods and experiments with the latest
environments and benchmarks are the possible bug errors or performance short
comings that come along. This project made extensive use of opensource libraries and
frameworks, most of them which are still in active development. The basic components
were the metalearning library learn2learn for the algorithmic implementation and
5
the multitask & metalearning environment MetaWorld for the realisation of the
experiments. Both of these frameworks from the start of the thesis until its end, went
through many iterations and even complete rework of their API. They are still active
projects maintained by their own developers teams and the opensource community of
metalearning. It is also possible that further bug reports might arise and optimisations
improvements will take place. This is to say that the reported results cannot be
assumed to have been generated by bugfree software. Thus, in order to guarantee
replication of the findings of this thesis, the specific versions of the software used needs
to be installed.
6
CHAPTER 1. INTRODUCTION
The most significant limitation of this degree project is related to the evaluation. In
order to provide accurate and concrete conclusions based on the results of experiments,
statistical hypothesis tests are necessary. Due to the computational costs required
however, too few models were developed in order to be able to carry out such tests.
Thus, the results of the thesis showcase trends observed by training and evaluating
the algorithms. In order to make definite conclusions based on statistically accurate
results, a more extensive evaluation research is required. The precondition to carry
out such a test would be access to considerable computational power.
1.5 Outline
The structure of this project’s report is as follows. In Chapter 2, an extensive analysis
of the background work is presented to lay the theoretical foundations of the thesis and
cover related research. Next, in Chapter 3, the methodology and the research approach
is introduced along with the technical details of the evaluation setup. In Chapter 4,
the results of the experiments are presented through tables and figures. Chapter 6
outlines the results of the previous chapter and provides some conclusive remarks.
Lastly, Chapter 5 discusses the overall outcome of the degree project while also making
comments about future work, ethical concerns and sustainability. In addition to these
main chapters, an Appendix section (AC) is also attached to provide technical details
and additional work that might be relevant for reproducibility purposes.
7
Chapter 2
Theoretical Background
This degree project is based on the combination of three different fields: Deep
Learning, Reinforcement Learning and MetaLearning. Deep Learning is part of the
Machine Learning field, which is primarily focused on the study of artificial neural
networks. Reinforcement Learning and MetaLearning are both fields that have been
studied in the contexts of Psychology, Neuroscience and Machine Learning. In the rest
of the thesis, mentions of the fields of Reinforcement Learning and MetaLearning
will be related to their application within the Deep Learning field, unless otherwise
specified 1 .
The first three sections (2.1, 2.3, 2.2) of this chapter provide a gentle introduction to
the relevant fields without delving into too much detail on the thesis’ specifics. Next,
the section 2.4 will solely focus on providing a thorough theoretical understanding of
all the parts of the project and the latest relevant research.
8
CHAPTER 2. THEORETICAL BACKGROUND
Loosely inspired by the mammal brain, the basic concept of ANN follows a similar
structure [30]. A simple artificial neuron receives some input data x and produces an
output y based on its weights w and an activation function f . These types of artificial
neurons are also known as McCulloch and Pitts neurons (figure 2.1.1a). A set of these
neurons make up a Perceptron layer and a set of Perceptron layers finally make up
one of the most standard ANN architectures, the MultiLayer Perceptron (MLP). The
layers are stacked (from left to right as seen in figure 2.1.1b) and the output of each
layer becomes the input of the next.
The process of getting from the input data x to the output values y throughout the whole
network is called forward propagation (or forward pass) and consists of the following
steps:
(a) Multiply the data x with the the weights w of each neuron.
9
CHAPTER 2. THEORETICAL BACKGROUND
2. The results y of the first layer now become the input of the next layer and the
same steps 1.11.4 are followed for x ← y.
3. The output of the network will be the output of the final layer.
The adjustable variables (weights & biases) of the network, are also called the
parameters θ of the network. Multiple variations of these networks have been
developed to better tackle different problems. For example, Convolutional Neural
Networks (CNN) are best suited when the input data consist of images or video
(2.1.2a). Whereas Recurrent Neural Networks (RNN) are best suited when the data
are sequentially dependent such as text, timeseries etc (2.1.2b).
In order to get meaningful results from these networks, they need to be trained,
meaning their parameters θ need to be updated based on a cost function. This can
be achieved by the backward pass, also known as the backpropagation algorithm
[68], to compute the gradients using a gradientbased optimizer. Training an MLP is
an iterative process of many (usually thousands or millions) consecutive forward and
backward passes.
Cost function (or loss function): The cost function describes the error between the
output values (predictions) of the network and the objective values (target) the network
tries to optimise for. Depending on the problem the cost function can take many forms.
For classification problems, this usually means trying to minimise the crossentropy
loss (equation 2.1) based on the principle of maximum likelihood [30]. For regression
and sometimes RL, this usually means trying to minimise the meansquared error,
which is the average squared difference between the predictions and the target values
10
CHAPTER 2. THEORETICAL BACKGROUND
(equation 2.2).
∑
CE = − yi log ŷi (2.1)
i
1∑
MSE = (yi − ŷi )2 (2.2)
n i
g ← ∇ŷ J (2.3)
(a) Calculate the gradients of the layer with respect to its output before the
activation function (elementmultiplication ⊙ if f is elementwise).
g ← ∇h J = g ⊙ f (h) (2.4)
∇b J = g + λ∇b θ (2.5)
∇w J = gy + λ∇w θ (2.6)
g ← ∇y J = W T g (2.7)
11
CHAPTER 2. THEORETICAL BACKGROUND
Optimizers: The optimizers are the actual algorithms that perform the updating
of the network parameters, using the gradients computed by backprop. One of
the most commonly used optimizers is Adam [44] due to it’s adaptive learning rate
mechanism. Other common optimizers are Stochastic Gradient Descent [67] and
AdaDelta [91].
The idea of Meta Learning was firstly conceived in the field of psychology [8] and was
then adopted in other fields such as neuroscience [14]. In the setting of educational
psychology metalearning is defined as ”a person [who is] aware and capable to assess
their own learning approach and adjust it according the requirements of the specific
task”. Kenji Doya defines metalearning in the context of neuroscience as ”a brain
mechanism with the capability of dynamically adjusting it’s own hyperparameters (e.g
dopaminergic system) through neuromodulators (e.g dopamine, serotonin etc)” [14].
even though the idea of MetaLearning has been around since the 80s [70], it has only
recently seen a rapid growth in the field of Deep Learning with exciting new research
and progress in different directions every year. As such, definitions as to what exactly
constitutes metalearning can be inconsistent depending on the setting.
12
CHAPTER 2. THEORETICAL BACKGROUND
Metaknowledge
The ω condition is defined as the factors leading to the solution θ which encompass the
how to learn knowledge and is also known as metaknowledge. This could be the initial
parameter θ values, the choice of the optimiser, hyperparameter values, the function
class for f , etc. In this setting, for every dataset D that is given, the optimisation starts
from scratch with a hand tuned prespecified ω. Hence, metalearning can be defined
as learning the ω condition, leading to an improved performance and the ability to
generalise across tasks.
where Lmeta (θ∗ ; ω; D) represents the model performance on a dataset D with fixed θ∗
parameters and optimising for ω.
Metatraining
The process of training a model this way over a set of tasks M is called metatraining in
which the training tasks are formalised as Dtrain = {(Dtrain
support
, Dtrain
query
)i }M
i=1 . Each train
task has its own support and query subset used as ”minitrain” and ”minivalidation”
sets. Optimising for ω, or learning to learn, is then written:
13
CHAPTER 2. THEORETICAL BACKGROUND
Meaning, given a set of M datasets in Dtrain the goal is to pick the ω ∗ that maximises
the log probability of the metaknowledge across all M datasets.
A typical way of metatraining a model, and the way this thesis will adapt in chapter 3, is
through a twolevel optimisation of an inner and an outer loop. This approach follows
the idea of hierarchical optimisation where the model is optimising for one goal while
being constrained by a second optimisation goal [28]. The inner loop is optimising
the base model, or learner, for θi∗ of task i while being conditioned by ω (2.11). The
inner optimisation is based on the support set of the dataset. Note, that the notation
θi∗ (ω) is used to represent that each θi∗ shares the same ω. The outer loop is optimising
the metalearner for the condition ω with the objective to produce a model θi∗ which
performs optimally on the query set (2.12).
∑
M
∗
ω = arg min Lmeta (θi∗ (ω), ω, (Dtrain
query
)i ) (2.12)
ω
i=1
where Li is the cost function of Ti for a fixed ω and Lmeta is the metalearning objective
for a fixed θi∗ (ω).
Metatesting
set in which the former is used to leverage the metalearned ω and the latter to finally
evaluate the accuracy of the adapted model. The size of the support set for each test
task is usually small since the objective is rapidly adapting with few samples and not
retraining. A common metalearning evaluation setting is of ”Kshot learning” for
classification, meaning that each new task comes with K number of samples (”shots”).
The metalearnt model then needs to adapt to the task using only K samples in order
to correctly predict the labels of the query task. Thus, the metatesting process of
14
CHAPTER 2. THEORETICAL BACKGROUND
Figure 2.2.1: An overview of the metalearning field divided by algorithm design and
applications. Figure from [38].
adapting to Dtest
support
can be summarised as:
MetaRepresentations (”What?”): The first and maybe the most apparent way
to categorise metalearning approaches are by what the metaknowledge ω entails.
Specifically, what parts of the learning process are to be learned and then be deployed
15
CHAPTER 2. THEORETICAL BACKGROUND
16
CHAPTER 2. THEORETICAL BACKGROUND
The field of Reinforcement Learning lies between supervised learning, where the
models is trained on labelled data, and unsupervised learning, where the models aims
to find and leverage similarities and patterns in the data. Instead, RL algorithms
are driven by interacting with an environment and receiving rewards on whether
2
its actions managed to perform well enough or complete a goal. The agent then
picks the actions that maximise the cumulative rewards leading to an optimal strategy
through a long series of trials and errors, often within a given time constraint. Based
on this simple setting, many methods have been developed in recent years with
remarkable results, especially in the field of games. One of the first papers to reignite
the interest of the research community was by a team of researchers from DeepMind
who managed to develop a simple agent algorithm that could surpass humanlevel
performance in many of the classic Atari 2600 games [55]. Since then, research on the
field has grown rapidly with multiple advancements in similar gamelike environments
such as chess and go [76], [77]. Progress in realworld application has also started to
expand [58], [59] but in a slower rate due to a series of challenges which mainly involve
the need of immense number of interaction samples with the environment to achieve
satisfactory results [17].
Figure 2.3.1: Overview of the typical RL setting of an agent interacting with the
environment. Figure from [80].
2
agent is the term for the policy / model in the RL setting.
17
CHAPTER 2. THEORETICAL BACKGROUND
3. R(s, a): the (unknown) reward function which indicates how good or bad action
a was for state s
′
4. S : the next state of the environment after action a has been taken
′
5. T (s |s, a): the transition probability function / matrix that indicates the
′
probability of the environment transitioning to state s when the agent takes an
action a in a state s.
Additionally, environments can either have finite horizons (assuming the agent has H
number of time steps to solve the task) or infinite horizons (in which case, to motivate
the agent to solve the task future rewards are discounted by a factor γ).
′
The last component T (s |s, a) is not always mentioned because in most deepRL
problems the transition function T is not known. When T is known, then the
solution can be computed directly without the need for the agent to interact with the
environment through planning algorithms such as policy iteration, value iteration etc
[79]. However, usually the RL objective is defined as finding an optimal policy π which
maximises the expected future discounted reward without knowledge about the T or
R functions. There are two approaches on how to deal with a problem where the
transition function is unknown.
The modelbased way is to try and model the environment by approximating this
function through a long series of environment interactions (thus building a ”World
Model”). This can be done either by first learning the model and then using a
highly sampleefficient planning algorithm 3 to solve the problem, or simultaneously
learning the model while also approximating a policy [13]. This approach has seen
exceptional success in games where the dynamics of the environment might not be as
complex as in the realworld and where an almost perfect simulator (World Model)
3
Such algorithms can be trained offline, meaning they do not need to sample new interactions
with the environment during training, thus are more cost efficient, while also finding and evaluating
a solution before actually executing it.
18
CHAPTER 2. THEORETICAL BACKGROUND
can be estimated by paying a high computational cost [71]. However, they suffer from
severe performance loss when transferring the learned policy in the real world due to
embedded biases in the model [42].
Furthermore, RL algorithms can be split into two categories depending on how they
update their policy from the samples of the environment. Offpolicy methods reuse
previous samples collected by the environmentagent interactions during the training
4
regardless of the exploration policy of the agent. This gives them the advantage of
4
The choice of whether the agent will choose an action randomly or based on the currently learned
policy.
19
CHAPTER 2. THEORETICAL BACKGROUND
being more sample efficient. Almost all Qlearning methods are trained offpolicy. In
contrast, onpolicy methods do not reuse previous data and only depend on sampling
online from the learnt policy. Even though this approach might be less sample efficient,
it is notably more stable since it keeps optimising based on the latest policy for the
objective. Usually, Policy Optimisation methods are trained onpolicy. In this thesis,
all of the experiments were performed with modelfree, on policy algorithms such as
PPO and TRPO (as seen in section 3).
Generally, even though the tasks might be different, it is important that they share
similar internal dynamics and they come from the same task distribution [85]. Which
leads to a strong connection between task similarity or how broad the task distribution
is and the performance of the metaRL agent. For example, training a robot hand
to grasp different types of objects can be considered a reasonable metaRL problem
20
CHAPTER 2. THEORETICAL BACKGROUND
since all the tasks share the dependency of learning the dynamics of the joints and
the movement of each individual finger. However, when the distribution becomes
too broad, such as including both teaching a robot how to walk and solving a Rubik’s
cube with it’s hand, it is expected that finding a shared metaRL strategy becomes
considerably harder.
The number of different methods developed is extensive on all the three axis of meta
learning (mentioned in 2.2.2) and RL’s algorithmic landscape. In many cases, MAML
has been used as the metalearning base framework with different methods adapting
and modifying it in new ways such as incorporating recurrent methods and model
based RL [56] or extending it in a probabilistic framework [75]. A large portion of this
field has focused on developing methods that make use of RNNs or trying to include
some sort of memory to the agent in order to incorporate the learned knowledge
[87], [16], [54]. Furthermore, when learning a policy, latent representations or task
descriptors have been used efficiently to distinguish between the learnt skills and tasks
[35], [20], [85].
Occasionally, metaRL is confused with multitask RL since they share many core
characteristics such as training on a distribution of tasks. However, their objectives
are different: Multitask RL tries to optimise for a single policy that will solve the tasks
presented more efficiently than learning the tasks individually. MetaRL aims to learn
the dynamics of the training tasks in order to adapt fast to new tasks [90].
21
CHAPTER 2. THEORETICAL BACKGROUND
ω is the initial values of the θ parameters (the representation of the network). Thus,
the parameters start from the point ω and evolve to θi∗ for each Ti during the inner
loop. For simplicity, since ω and θ refer to the same component of the model, in the
following equations the initial point of the parameters will be simply noted as θ (figure
2.4.1).
Adapting is performed with gradient descent starting from the θ parameters while not
modifying them directly (this set is kept for the metaoptimisation update) but keeping
a separate set θi∗ for each task Ti that can be updated either once or multiple times 5 .
′
This can be perceived as one metalearner θ evolving into multiple learners θi each
specialised for a different task.
where α stands for the inner loop (adaptation) learning rate. The metaobjective of
this method is to ”optimise for the performance of fθi∗ with respect to θ across tasks
sampled from p(T )” [22] as seen in 2.17.
∑
min Li (θi∗ ; (Dtrain
query
)i ) (2.17)
θ
Ti ∼p(T )
The meta
optimisation (outer loop) updates the metalearner’s parameter initialisation θ with
5
MAML supports multiple consecutive gradient descent updates on each learner on each task.
22
CHAPTER 2. THEORETICAL BACKGROUND
gradient descent based on all the learners’ parameters θi∗ (now fixed):
∑
θ ← θ − β∇θ Li (θi∗ ; (Dtrain
query
)i ) (2.18)
Ti ∼p(T )
2.4.2 Variations
A long list of MAML variations have been proposed since it’s publication focusing on
different components of the training process.
Regarding the metaoptimiser: During the outer loop update, the meta
optimisation requires computing a gradient through a gradient (or a ”meta
gradient”) to update the final θ∗ parameters. One way of calculating these second
order derivatives is directly through a hessianvector product using automatic
differentiation. However, this method is usually highly computationally expensive.
The ”cheaper” alternative is through a firstorder approximation of these derivatives
which has shown to not have any significant decrease in performance in MAML when
using ReLU activation functions [22]. This approach has been further studied and
improved on leading to a MAML variation named Reptile [57].
Regarding application in other fields MAML has seen adaptations to fields such
23
CHAPTER 2. THEORETICAL BACKGROUND
as imitation learning [26], online learning [24], latent embedding optimisation [69],
probabilistic [25] and Bayesian frameworks [32]. In addition to it’s leading success in
fewshot classification 6 and regression, MAML has been applied in different ways in
various RL settings [56], [33], [89], [78].
Finn et al. suggested that MAML’s final metalearnt initialisation parameters are
primed for rapid learning, meaning that changes in the representation will lead
to significant improvements when adapting to a task drawn from the same task
distribution. They also presented MAML as a way to ”maximise the sensitivity of
the loss functions of new tasks with respect to the parameters”. This perspective
suggests that the knowledge gained from the model is regarding efficient gradient
based learning and not necessarily learning the dataset’s features themselves. This
view is further supported by additional experiments with MAML when examining
its resilience to overfitting and comparing it with conventional Deep Learning (DL)
methods [23].
A recent publication by Raghu et al. argues that the view of how MAML was presented
above (2.4.3) is not accurate and that the model does not actually learn to learn.
Instead, they argue, it already incorporates high quality features of the task distribution
and this is what leads to the successful adaptation to new test tasks [62]. They
provided evidence for this hypothesis by running experiments where they froze the
network body representation (hidden layers) and only let the head (final output layer)
adjust during the inner loop adaptation phase. Additional experiments were presented
indicating that during metatesting, the network body barely needs to be updated while
maintaining the same performance and it is possible to completely remove the network
head, relying only on the learnt representations. This simplified and computationally
cheaper MAML variation, called ANIL, contributed insightful observations to the inner
mechanisms of MAML.
In the ANIL paper, experiments were performed for fewshot image classification
6
Top 10 performance in fewshot classification benchmarks in [Link] as of this date,
despite being one of the oldest.
24
CHAPTER 2. THEORETICAL BACKGROUND
and metaRL. For the former setting, the two datasets: Omniglot and MiniImageNet
were used. For the metaRL setting, the MuJoCo environments of HalfCheetah
Direction, HalfCheetahVelocity and 2DNavigation were used. Even though these
frameworks are part of the most popular benchmarks for metalearning methods,
they might not be sufficient to prove the hypothesis stated previously. As one of the
ANIL paper’s reviewers criticises the train and test tasks are from the same, relatively
narrow dataset where feature reuse might be enough to provide good performance
7
. Specifically for image classification, both Omniglot and MiniImageNet are rather
trivial tasks compared to the recently proposed benchmark MetaDataset which is
explicitly developed for metalearning methods evaluation [84]. Moreover, the 2D
Navigation environment was introduced in [22] as a toy baseline RL problem and the
HalfCheetah benchmark is known to be easily solvable with simple linear policies or
random search [50].
This leads to the question on whether ANIL would be able to perform similarly to
MAML on more challenging environments with a broader task distribution. Currently,
some of the best candidates to examine its robustness in performance would be the
MetaDataset and MetaWorld, for image classification and reinforcement learning
respectively. Due to the MetaDataset’s immense size and the incredible computational
power required for it to train on, it was deemed out of the scope of this thesis and is
left for future work. Therefore, this thesis is mainly focused on experiments on the
metaRL environment titled MetaWorld, as examined in the next section 3.
7
Comment on OpenReview by an anonymous expert reviewer on ANIL: [Link]
forum?id=rkgMkCEtPB¬eId=H1xctUU2oB
25
Chapter 3
Methodology
The main focus of the thesis is on the RL experiments but because of the high
complexity and the high computational cost of implementing and running metaRL
methods, the baselines were firstly set on image classification tasks. Since the methods
are ”modelagnostic” the basic core structure of the code is shared across the different
application fields. After establishing reliable implementations of the algorithms by
reproducing the results of their original papers on the baseline tasks, they were then
evaluated on RL.
The evaluation of the algorithms was performed in three parts. First, the actual
metatraining progress is monitored through logging training and validation metrics
during learning. These metrics can provide insights on the stability and convergence
rate of each algorithm, showcasing their robustness or signs of overfitting or under
26
CHAPTER 3. METHODOLOGY
fitting. Secondly, their final performance is evaluated during metatesting, where the
metatrained policies are examined in a series of unseen tasks. The resulting metrics
measure the success of the method on the problem at hand (accuracy for classification,
accumulated reward & success rate for RL). Finally, another experimental settings was
proposed to measure the representation similarity of the two models after training as
described in 3.1.2.
Smoothing factor: In cases where the results were too noisy, a smoothing coefficient
was used for better visibility (as seen in example figure 3.1.1). This factor can be
adjusted with a value from 0 (no smoothing) to 1 (max smoothing) and its formula
is based on the exponential moving average 1 .
(a) (b)
Figure 3.1.1: Example of using a smoothing factor. Figure 3.1.1a does not utilise
smoothing (value is 0) and figure 3.1.1b uses smoothing of a 0.9 value.
The objective of MAML is to learn the parameters θ of a model fθ in order to achieve fast
adaptation to new unseen tasks during metatesting. Specifically, the idea of MAML
is based on updating the gradientbased learning rule in a way which will lead to rapid
progress on any task drawn from the same task distribution p(T ). Meaning the model
parameters are sensitive to changes across different tasks such that small changes in
the parameter space leads to a great gain in performance on any task of p(T ). On the
other hand, using an almost identical method, ANIL is based on the idea that such
a metalearning procedure leads to learning a strong and sufficient representation of
the data such that there is no need for change in the representation in order to adapt to
new test tasks from p(T ). A general, modelagnostic formulation of MAML and ANIL
1
More details regarding the exponential moving average formula on this link
27
CHAPTER 3. METHODOLOGY
The objective of this experiment is to measure and compare the latent representations
of a network trained with MAML. This can be achieved by employing the Canonical
28
CHAPTER 3. METHODOLOGY
Correlation Analysis (CCA) metric [37]. Given two representations of two layers L1, L2
of a neural network, the CCA outputs a similarity score which indicates whether L1
and L2 share no similarity at all (value is 0) or if they are identical (value is 1).
By comparing the representations of the network before and after the inner loop
adaptation phase (during metatesting), this metric illustrates whether the model had
significant changes in the representation (rapid learning) or minimal changes (feature
reuse). This experiment follows a similar procedure as presented in the ANIL paper
[62] and tries to answer the first question of ”What does MAML actually learn:
learning to learn or high quality feature representation?”.
The first experiments of this thesis concerned MAML and ANIL on image classification
tasks, specifically on the relatively small datasets of Omniglot and MiniImageNet.
By reproducing the results on these datasets of the original papers, it provides some
confidence that the implementations were correct. The most common way to evaluate
metalearning algorithms in classification tasks is through fewshot learning where the
goal is to approximate a function from just a handful of inputoutput pairs of a task /
label in order to classify new images that share similar features to previously seen ones.
For example, the model is introduced to a dataset of pictures of mammals where each
species only has a limited number of picture samples. Due to the constraint of limited
data per species, when a new species is found, the model is expected to easily identify
it even with just a few pictures (fewshots).
29
CHAPTER 3. METHODOLOGY
∑
Li (θ) = yj log fθ (xj ) + (1 − yj ) log(1 − fθ (xj )) (3.1)
xj ,yj ∼Ti
The complete MAML and ANIL methods for fewshot classification are described
analytically in algorithms 3 and 4 respectively.
3.2.2 Omniglot
The first vision dataset, Omniglot [45], is one of the most popular in evaluating meta
learning algorithms in fewshot classification. It contains 1623 distinct handwritten
characters with 20 samples each from 50 different alphabets. Each character is
an image of 28(H) x 28(W) x 1(Grayscale value) drawn from a different person
(as seen in figure 3.2.1). As implemented in [22], 1200 characters were used for
training, irrespective of the alphabet, the rest were used for testing and the dataset was
augmented by adding rotated variations of the images in multiples of 90 degrees. The
fewshot setting they were evaluated on was on 20 ways (20 distinct tasks / characters)
30
CHAPTER 3. METHODOLOGY
31
CHAPTER 3. METHODOLOGY
on 1 or 5 shot (1 or 5 samples per task / character). The model used was the same
network as described in [22], a standard CNN of input 1x28x28, 4 convolutional layers
of 64 units each, with no max pooling and 1 channel, and a final fully connected layer
of 64 units and an output size of 20 (one for each class).
3.2.3 MiniImageNet
The miniImageNet dataset is part of the ImageNet dataset, specifically tuned for the
fewshot learning setting [86]. It requires fewer computational resources than the
original ImageNet but it still remains a difficult problem to solve due to the large variety
of images included. MiniImageNet contains 100 classes, each with 600 samples of
84(H) x 84(W) x 3 (RGB values) images (as seen in figure 3.2.2). The fewshot setting
they were evaluated on was on 5 ways (5 distinct tasks / characters) on 1 or 5 shot (1 or
5 samples per task / character). In the case of MiniImageNet, the standard network
as described in [65] was used, with an input shape of 3x84x84, 4 convolutional layers
of 32 units each, with max pooling (max pooling factor = 1) and 3 channels, and a final
fully connected layer of 800 units with 5 outputs.
32
CHAPTER 3. METHODOLOGY
In the setting of metaRL, each task Ti is defined as an MDP of a finite horizon H, with
an initial state distribution qi (s1 ) and a transition distribution qi (st+1 |st , at ). During
the inner loop / adaptation phase, the agent fθi is able to sample a limited number of
episodes from each task Ti with the goal to quickly develop a policy πi for each loss Li .
Note that there is no limitation as to what properties of the MDP need to be share across
tasks, meaning that different tasks can have different reward functions or transition
distributions.
33
CHAPTER 3. METHODOLOGY
As mentioned in 2.3.1 in most problems the reward function and transition distribution
are unknown. Additionally, the unknown dynamics of the environment usually make
the expected reward, which we want to maximise, not differentiable. In such cases
policy gradient methods can be used to approximate the gradients for the inner
(adaptation phase) and outer (metaoptimisation) loop. A significant point, that also
increases complexity, is that since such methods are onpolicy, new rollouts from each
individual policy πi need to be sampled for each additional adaptation update / gradient
step. The most common general form of a policy gradient method defines the loss
objective for a specific task Ti as the expected logarithm of the policy and the negative
reward over a batch of samples:
[ ]
∑
H
Li (πi ) = − E log πi (at |st )Ri (st , at ) (3.2)
st ,at ∼πi
t=1
Along with the policy, another model that is approximating the value function Vϕπ (st )
is updated in parallel so that it always considers the most recent policy. Such function
models that estimate Vϕπ (st ) are also called baselines. A simple linear feature baseline
model was used in this case, to fit each task in each iteration and to compute the state
value function Vϕπ (s) by minimising the leastsquares distance, as firstly proposed in
[15] and then also adopted in [22]:
[ ]
ϕ = arg min Est ,R̂t ∼π (Vϕπ (st ) − R̂t )2 (3.3)
ϕ
To reduce the variance of the policy gradient estimates of the statevalues and control
the bias levels to a tolerable level, using a biasvariance tradeoff parameter, the
Generalised Advantage Estimator (GAE) method proposed in [73] was used. The basic
concept of the advantage function is to provide further insight on how much better or
worse the current policy is, in relation to the average actions. It is defined in equations
3.43.8 and is based on the Temporal Difference (TD) Error: rt + γV (st+1 ) − V (st )
[82].
(1)
Ât := δtV = −V (st ) + rt + γV (st+1 ) (3.4)
34
CHAPTER 3. METHODOLOGY
(2)
Ât := δtV + γδt+1
V
= −V (st ) + rt + γrt+1 + γ 2 V (st+2 ) (3.5)
(3)
Ât := δtV + γδt+1
V
+ γ 2 δt+2
V
= −V (st ) + rt + γrt+1 + γ 2 rt+2 + γ 3 V (st+3 ) (3.6)
leading to a sum consisting of the returns (γ discounted rewards) and the negative
baseline term −V (st ) for a H length horizon (eq 3.7) and then finally adding a bias
variance tradeoff factor τ (eq 3.8).
(H)
∑
H ∑
H
Ât = γ l δt+l
V
= −V (st ) + γ l rt+l (3.7)
l=0 l=0
[ H ]
∑
Li (πi ) = − E log πi (at |st )ÂGAE
t (3.9)
st ,at ∼πi
t=1
In contrast with vanilla policy gradient methods, where an update on the policy
does not differ much in terms of parameter values, TRPO is trying to update with
the largest step possible to improve the performance, while also following a KL
Divergence constraint to control the ”distance” between the old and new policy. Even
though this approach involves a complex secondorder method with multiple tunable
hyperparameters, it significantly improves sample efficiency and usually speeds up
convergence. The loss objective, or surrogate advantage, is defined L(θt , θ) as the
measure of performance between policy πθ and previous policy πθt on episodes from
35
CHAPTER 3. METHODOLOGY
[ ]
πθ (a|s) πθ
L(θt , θ) = E A t (s, a) (3.10)
s,a∼πθt πθt (a|s)
and the KLDivergence constrain between the two policies on the states taken from the
old policy is defined as:
[ ]
DKL (θ||θt ) = E DKL (πθ (·|s)||πθt (·||s) (3.11)
s∼πθt
where ϵ is a parameter, usually a small value ( 0.01), to control the KLDivergence limit
for the TRPO update 3 .
Based on the same motivation of improving sample efficiency by taking large steps ,
cautiously, to better policies and avoiding performance collapse, PPO follows a much
more straightforward firstorder method, while also offering competitive results to
TRPO. In this project, the PPOClip variant was used that omits the KLDivergence
constrain term and motivates the new policy to stay close to the old one based on the
advantage function 4 . The complete loss objective can be defined as:
( π (a|s) )
θ
L(s, a, θt , θ) = min πθt πθt
A (s, a), g(ϵ, A (s, a)) (3.13)
πθt (a|s)
(1 + ϵ)A, f or A ≥ 0
g(ϵ, A) = (3.14)
(1 − ϵ)A, f or A ≤ 0
acts like a regulariser to limit how much the objective can increase and stop the
2
Here we explicitly denote that the policy π is depended on the θ parameters using πθ .
3
More details into the TRPO algorithm by OpenAI’s Spinning Up documentation: https://
[Link]/en/latest/algorithms/[Link]
4
More details into the PPO algorithm by OpenAI’s Spinning Up documentation https://
[Link]/en/latest/algorithms/[Link]
36
CHAPTER 3. METHODOLOGY
new policy from diverging too much. Then, the PPOClip update can be summed up
as:
[ ]
θt+1 = arg max E L(s, a, θt , θ) (3.15)
θ s,a∼πθt
3.3.2 Particles2D
The first RL setting the algorithms were tested on, was a trivial 2D gamelike problem
developed by Finn et al in [22]. Given a random pointgoal in a 2D square matrix, the
agent should move to these coordinates without having explicit access to them, given
some velocity. The observation is simply the current coordinates of the agent in the
square and the actions correspond to velocity values for movement within the range
37
CHAPTER 3. METHODOLOGY
38
CHAPTER 3. METHODOLOGY
[−0.1, 0.1]. The reward function is the negative squared distance of the agent from the
point and the episode ends either when the agent is within 0.01 of the goal coordinates
or when the time horizon is reached at H = 100. The initial hyperparameter values
where chosen based on the values reported in [62].
3.3.3 MetaWorld
Specifically, all the tasks require the arm to interact with different objects and shapes,
39
CHAPTER 3. METHODOLOGY
Figure 3.3.2: Screenshot of the task ”push” in which the robot arm needs to push an
object (puck) to specific coordinates (goal).
using different joints in order to combine acts of pushing, grasping and reaching. The
observation and action space across tasks is shared, meaning the dimension space
remain the same. The observations consist of a 9 dimensional 3tuple of the 3D
cartesian position of the endeffector, the object (or object #1) and the goal (or object
#2). The actions consists of 3D endeffector positions of the robot arm. Depending on
the task, a specific reward function has been engineered. The initial hyperparameter
values were chosen based on the values reported in [90]. MetaWorld provides three
scenarios of increasing difficulty:
ML1: As a more computationally feasible baseline, the first setting consists of one
specific task (in the case of this thesis the task ”push” was chosen) with 50 random
initial object and goal positions and another set of 10 heldout positions, resembling
the previous setting of Particles2D.
ML10: To properly evaluate the generalisation capabilities to new tasks, the algorithm
is metatrained on 10 specific tasks and then evaluated on another held out set of 5
more tasks with similar structure. For each task, the position of the object and the goal
is randomised.
ML45: Similar to ML10 but with 45 training tasks instead. Due to it’s demanding
complexity and the respective computational cost required, this setting was deemed
40
CHAPTER 3. METHODOLOGY
outside of the scope of the thesis and is left for future work.
Success rate metric: For the case of MetaWorld another metric is introduced to
monitor the performance of the agents. Due to the fact that reward values do not always
provide a clear picture of the effectiveness of a policy, Yu et al. propose measuring
an agent’s performance on MetaWorld by the success rate on a task. This metric is
defined based on the distance of the object (o) (depending on the task) from the final
goal position (g), e.g ||o−g||2 < ϵ, where ϵ is a small distance threshold for a consecutive
number of steps5 .
5
For more details into each of the success metrics, see the complete list in [90]
41
Chapter 4
Results
Firstly, in the case of Omniglot, both methods showed similar stable performance
during metatraining and converged quickly to their final performance as seen from
their validation accuracy performance in figures 4.1.1. In the case of MiniImageNet,
the methods managed to converged again in less than 5.000 epochs but not as fast or as
stable as in Omniglot, as seen from a more volatile validation accuracy in both figures
of 4.1.2. This is speculated to be attributed to the larger variance across coloured RGB
images in the MiniImageNet dataset compared to the more uniform grayscale images
in Omniglot. Further tests in regards to this hypothesis are needed to support this
speculation.
42
CHAPTER 4. RESULTS
Figure 4.1.1: Validation accuracy for MAML and ANIL on the Omniglot dataset for 20
way, 5shot (4.1.1a) and 20way, 1shot (4.1.1b) classification. Each line is the mean
value and the shaded area is the standard deviation across three seeds. Smoothing
factor 0.8
43
CHAPTER 4. RESULTS
Figure 4.1.2: Validation accuracy for MAML and ANIL on the MiniImageNet dataset
for 5way, 5shot (4.1.2a) and 5way, 1shot (4.1.2b) classification. Each line is
the mean value and the shaded area is the standard deviation across three seeds.
Smoothing factor 0.8
During metatesting, MAML showed slightly better performance than ANIL in most
cases and the results were similar to the findings of [62], as seen in table 4.1.1.
However, further statistical tests are required to provide concrete evidence for the
observations noted above. Comparing the mean test accuracies across only three seeds
provides too small sample size to employ meaningful statistical hypothesis tests.
44
CHAPTER 4. RESULTS
Table 4.1.1: Metatesting results of MAML and ANIL on fewshot image classification.
The reported results for each model are the mean and standard deviation across three
different seeds. The bottom results are shown as reported in the paper [62].
One important factor to consider when comparing these algorithms is the difference
in computational resources required to train each algorithm. Training MAML on
MiniImageNet for 5ways, 1shot classification took approximately 1hour 50minutes
1
, whereas ANIL for the same setting took only 55minutes, half the training time.
Meaning that for half of the computational cost, a similar, or just slightly worse,
performance could be achieved with ANIL.
1
On a CUDAenabled RTX 2060 S GPU.
45
CHAPTER 4. RESULTS
Figure 4.1.3: Representation change before and after adaptation of MAML in a 5w5s
setting in MiniImageNet. Results reported are mean and standard deviation of 5
different test tasks with 3 different seeds. 4.1.3a from [62], 4.1.3b reproduced results.
46
CHAPTER 4. RESULTS
Since metaRL is difficult to train and complex to debug, the experiments started
gradually from a trivial toy problem, to an easy baseline and finally to the most
challenging setting. Additionally, a long series of experiments were needed to be run
at each stage before reporting results that could provide any insights. Such results are
reported in Appendix C instead of this section to prevent an overload of information
and the disruption of the flow of the report.
4.2.1 Particles 2D
47
CHAPTER 4. RESULTS
In the MetaWorld environment, there are three singletask benchmarks that are used
as baselines: ”push”, ”pickandplace” and ”reach” (as seen in the original paper [90]).
For this project, the setting ”push” was used as a baseline before moving to the more
demanding benchmark of ML10. Firstly, in figure 4.2.2 a quick comparison of between
the PPO and the TRPO version of MAML and ANIL indicates that, again, the TRPO
variant provides better and more consistent performance.
The models with the best performance during the search were then trained for 3.000
epochs, as shown in figure 4.2.3a. Both of the models achieved 100% success rate in 10
unseen tasks (new goal positions) during metatesting. Another interesting note is that
the results of MAMLTRPO in this case show better performance of the MAMLPPO
when evaluated in the original paper of MetaWorld (as seen in figure 4.2.3b).
48
CHAPTER 4. RESULTS
(b) Figure from [90] illustrating the performance of MAML on ML1: Push.
Figure 4.2.3: MAML performance on ML1: Push compared to ANIL (a) and other
methods (b) from [90].
49
CHAPTER 4. RESULTS
4.2.3 ML10
Finally, the results for the most challenging problems of MetaWorld ML10 are
presented. In figure 4.2.4a, a comparison of agents trained with different policies
is shown. In addition to the metalearning methods, the vanilla TRPO and PPO
methods were also trained to showcase the limitations of nonmetalearning methods
in an environment where a single policy needs to perform well in multiple tasks.
Following the trend from the previous datasets and environments, the MAMLTRPO
and ANILTRPO methods outperform the rest during metatraining as seen in figure
4.2.4a. After this comparison, they were trained even longer for 3.000 epochs (figure
4.2.4b) each one taking approximately 13 days to complete 2 . Contrary to the few
shot classification setting in which ANIL showed high efficiency in computational cost
resources, in the case of MetaWorld the computational cost difference is negligible3 .
This is probably due to the fact that most of the computational resources are spent for
sampling interactions of the agent with the environment, and that backpropagating
through the whole network vs just the final layer does not amount to a considerable
difference.
50
CHAPTER 4. RESULTS
51
CHAPTER 4. RESULTS
that the task is almost completed, but the agent did not cross the specific threshold
needed for the task to be classified as ”success”. This is due to the way the reward
functions and the success metrics are hand engineered in MetaWorld. In the case of
sweep for example the reward function is:
||h−g||22
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e 0.01 (4.1)
where o is the object position, h is the robot arm position (gripper) and g is the goal
position. A complete list of the reward functions for each task and their success metric
can be found in Appendix C.10. In the following figures both the reward values and the
success rates are illustrated in order to avoid misinterpretation of the results.
This discrepancy between high rewards and low success rate could be attributed to
two factors. Firstly, as mentioned before, the success metrics are handengineered
with arbitrary threshold values so the agent could almost be completing the task with
the arm / object moving around the goal but not within the threshold (thus gaining
high rewards but not high success rates). Secondly, it could be a sign of MAML
TRPO overfitting to the train tasks and thus failing to perform as well during the test
tasks.
Another issue with these results is the remarkably high variance of the rewards and the
success rates. This is because of the diversity of the tasks since each task has its own
reward function. In figures 4.2.6 for example, we can observe better performance from
MAMLTRPO than the other methods on the train task of basketball but struggles to
outperform them on the test task of drawer-open. Out of all the tasks, MAMLTRPO
52
CHAPTER 4. RESULTS
(a)
(b)
Figure 4.2.5: Comparison of PPO, MAMLTRPO and ANILTRPO on ML10 train (a)
and test (b) tasks. Reported results are mean and standard deviation of all the tasks
with 5 different seeds of 3 trials of 10 episodes per task.
53
CHAPTER 4. RESULTS
manages to show signs of better performance than the other methods on 9 out of the
10 train tasks but on just 2 out of the 5 test tasks, further indicating signs of overfitting
(worse performance during testing than training). It is important to note again that
the findings reported come from a small sample size of experiments, thus preventing
us making statistically significant conclusions. Analytical results of the performance
of the methods in a ”per task” format can be found in Appendix C.8.
54
CHAPTER 4. RESULTS
(a)
(b)
A noteworthy observation can be made when rendering the policies in the environment
to see how they perform in action. Even in cases in which the MAMLTRPO agent fails,
55
CHAPTER 4. RESULTS
is seems that the movements of the robot arm are smoother and less hectic compared
to the other methods. These rendered animations can be found in the code repository
at GitHub (link).
Lastly, in figure 4.2.7, the representation change in percentage based on the CCA metric
is presented for the network before and after the inner loop adaptation on the meta
testing tasks of ML10. These results show that minimal changes take place in the
representation space indicating that the performance of MAML during metatesting
relies on feature reuse.
56
Chapter 5
Discussion
57
CHAPTER 5. DISCUSSION
Regarding ANIL: The results of ANIL are quite intriguing. Even though its
performance in relatively straightforward settings, such as fewshot classification, is
comparable to MAML, it is interesting to see that it does not manage to achieve the
same level of success in a more challenging environment (as questioned in ANIL’s
review 1 ). This indicates that adapting to complex tasks requires more than an update
of the head of network based on a general metainitialisation of its weights.
5.2 Limitations
Training neural networks can be a computationally expensive process that can last
weeks or months. Taking into consideration the scope of this degree project and the
limited resources at hand, only the models that were deemed to be the most relevant
and most essential to the research questions were developed. Even so, more than 200
models were trained in this project, across 2 vision datasets and 3 RL problems in a
time span of 4 months with more than 11.000 hours spent on training models. This is
due to the fact that in order to validate the integrity of the results, a hyperparameter
search was needed for each problemcase and every model was trained with 3 different
seeds.
Even so, the performed hyperparameter searches were relatively limited, given that
1
Comment from a reviewer of ICLR 2020 on ANIL, criticising the limitations of metalearning
benchmarks: link
58
CHAPTER 5. DISCUSSION
the search was performed on 2 or 3 parameters in each case, when in some settings
(e.g ANILTRPO on ML10) there were more than 15 hyperparameters that could be
tuned. Moreover, it is reasonable to assume that 3 seeds are not enough to provide a
high interval of confidence for results of such complex algorithms with a high number
of hyperparameters [12]. Thus, in order to further verify the findings of this thesis,
a larger scale statistical analysis test on the hyperparameter space is suggested to
support our results. Especially for the case of the metaRL tasks which are known
to be sources of high instability and irreproducibility due to their numerous tunable
hyperparameters and increased computational cost.
2. MetaTrain on ML45: The next obvious step after MAML’s success on the
ML10 benchmark of MetaWorld, would be to evaluate it’s performance on the
final and most challenging ML45 benchmark.
59
CHAPTER 5. DISCUSSION
Given this thesis is focused mainly on academic research with no explicit application,
one could first think that the ethical aspect of such a project is minuscule. Besides,
the field of metalearning is still quite young and reallife applications, even though
potentially vast, are currently almost nonexistent. If one were to consider that the
ethical challenges of scientific contributions depend on their practical applications
and that scientific methods are just tools which can be used either benevolently
or malevolently, they could completely rid themselves of any responsibility for this
project. Unfortunately, such beliefs are not uncommon and one of the most crucial
components of scientific contributions is often, intentionally or not, easily dismissed:
the awareness of the social responsibility of research & development.
Brunner and Ascher argue that ”science in the aggregate has not lived up to its promise
to work for the benefit of society as a whole.[... For that reason it] is appropriate to hold
science responsible for the public expectations that science creates and depends upon
it” [10]. Opponents of socially responsible science would argue that the development
of science is predetermined and is a force that cannot be influenced from the choices
of its community or from individual scientists. In such a setting, scientists are merely
the instruments of a greater body that can move in a single direction and they can only
control its acceleration rate based on their skills and work.
60
CHAPTER 5. DISCUSSION
2
Protection Regulation (GDPR) being a recent example of how governments along
with the scientific community came together to defend the society’s interest against
privacybreaching technology. If we were to accept this example as part of an inevitable
progress, would we also accept as ”inevitable” and ”expected” all the other times
humans have used science in evil and horrific ways? Where would we draw the line
between inevitable actions and responsible decisions?
Thus, it is our belief that, it would be not only naive but also dangerous to assume that
science is a disconnected, nonpartisan and selfcontained body from the rest of the
society. With such equation leading to handing a get out of jail free card to whatever
can be identified as science and unburden of any accountability in individual decisions
or actions. It becomes apparent that the impact science has in society is rarely ever
neutral [66].
To backtrack in the context of this particular thesis, the responsibility lies in honest,
reproducible and accountable research. It is a commonly accepted belief that science
has been facing a reproducibility crisis the last decade with more than 70% of
researchers failing to reproduce another scientist’s experiments [4]. This is becoming
an even more apparent problem in the machine learning community where code for
research is scarcely open sourced due to private agreements, profit or unaccountability
[40]. To halt contributions to the problem, the approach of this thesis was based
on thorough background research, open source code (wherever possible) and honest
result reporting with clearly stating any assumptions made and limiting nonconfident
conclusions. The code used for this project along with the trained models and a
technical guide to reproduce the results are open sourced and published online at
GitHub3 .
5.5 Sustainability
As with most DL problems, training such models requires considerable amounts of
computational power. This means personal computers, remote workstations or even
large data centres operating in high capacity for days, weeks or even months. These
data centres can quickly amount to significant electrical power consumption (1% of
the global electricity use) which further increases the need for more electrical power
2
Link to the history of GDPR here.
3
Link to the code repository: [Link]
61
CHAPTER 5. DISCUSSION
generation [52]. It has been shown that such increasing demand for power generation
has direct negative implications to the environment since most of the energy sources
are still not environmentallyfriendly [36].
For the experiments of this project a personal computer and a remote server
workstation in a data centre (ICE North of RISE SICS) were used in the span of 6
months. For some experiments, the remote workstation was operating constantly
for three months. An approximate estimation can be made to calculate the power
consumption footprint of this thesis project. In equation 5.1, EkW h describes the energy
consumption in kilowatts per hour, PW describes the power consumption of the device
in watts and thr is the time spent of the device operating in hours4 .
PW × thr
EkW h = (5.1)
1000
The personal computer can be assumed to had been working 8 hours a day for 6
500(W )∗8(hr)∗180(days)
months at 500 watts. This attributes for 1000
= 720kW h. The remote
workstation can be assumed to had been working 24/7 for 3 months at 2500 watts.
2500(W )∗24(hr)∗90(days)
This attributes for 1000
= 5400kW h. Thus the total amount could
be estimated to be around 6120 kWh in 6 months or 3060kWh per year. To put
this number into perspective, according to the [Link] 5 , the average energy
consumption per capita in Sweden is 13.000 kWh per year. This means that the
electrical energy consumption footprint of this degree project was approximately the
same as 23% of an average Swedish citizen for a year.
On the other hand, the development of metalearning methods could potentially have
a significant decrease in the power consumption needed to train ML models compared
to traditional methods. This would be due to the fact that metatrained models do not
require retraining from scratch in light of new tasks and can thus reduce the number of
many models needed for each task to fewer models where each one can handle multiple
tasks. By developing dataefficient models with high generalisation capabilities the
computational cost of training ML models could be reduced.
4
Equations to calculate electric energy:
[Link]
5
Sweden’s energy consumption data:
[Link]
62
Chapter 6
Conclusions
The objective of this degree project was to supplement the latest research on MAML
and ANIL with novel observations and insights regarding their performance on
the recently published metareinforcement learning benchmark MetaWorld. To
summarise, this thesis tried to answer the following questions:
Question 1 What does MAML actually learn: Is it a high quality feature representation of
the training data? Or does it learns to rapidly adapt to new tasks of the same
distribution of the training tasks?
Answer: Before and after adaptation of MAML in ML10 there seems to be minimal
changes in the representation space of the neural network but also in terms of
performance as well (C.9). This indicates that MAML is mostly reusing features
learnt during metatraining and not actually acquiring new knowledge during
metatesting.
Question 2 How well does MAML adapt to new tasks compared to standard RL approaches?
Answer: Our results show a trend of better performance from MAML (higher rewards &
success rate) than PPO and ANIL on the train tasks of the ML10 benchmark.
During metatesting however this trend is not as clear, making it difficult to come
to any conclusions.
Question 3 Does ANIL perform similarly to MAML, even in more complex environments?
63
CHAPTER 6. CONCLUSIONS
Question 4 Is there a significant computational cost difference between MAML and ANIL?
Answer: The computational cost between MAML and ANIL is indeed significant on the
vision datasets (ANIL required almost 50% less computational resources during
training) but not during the RL benchmarks. This probably due to the high cost
of sampling and interacting with the environment that is much more substantial
than backpropagating through a whole network vs. one layer of the network.
64
Bibliography
[2] Antoniou, Antreas, Edwards, Harrison, and Storkey, Amos. “How to train your
MAML”. In: arXiv:1810.09502 [cs, stat] (Mar. 2019). arXiv: 1810.09502.
[3] Arnold, Sébastien M. R., Mahajan, Praateek, Datta, Debajyoti, Bunner, Ian,
and Zarkias, Konstantinos Saitas. “learn2learn: A Library for MetaLearning
Research”. In: arXiv:2008.12284 [cs, stat] (Aug. 2020). arXiv: 2008.12284.
[4] Baker, Monya. “1,500 scientists lift the lid on reproducibility”. en. In: Nature
News 533.7604 (May 2016). Section: News Feature, p. 452. DOI: 10 . 1038 /
533452a.
[5] Baydin, Atilim Gunes, Cornish, Robert, Rubio, David Martinez, Schmidt,
Mark, and Wood, Frank. “Online learning rate adaptation with hypergradient
descent”. In: arXiv preprint arXiv:1703.04782 (2017).
[6] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. “The Arcade Learning
Environment: An Evaluation Platform for General Agents”. en. In: Journal of
Artificial Intelligence Research 47 (June 2013), pp. 253–279. ISSN: 10769757.
DOI: 10.1613/jair.3912.
[7] Bello, Irwan, Zoph, Barret, Vasudevan, Vijay, and Le, Quoc V. “Neural optimizer
search with reinforcement learning”. In: arXiv preprint arXiv:1709.07417
(2017).
65
BIBLIOGRAPHY
[8] Biggs, J. B. “The Role of Metalearning in Study Processes”. en. In: British
Journal of Educational Psychology
55.3 (1985). _eprint: [Link]
8279.1985.tb02625.x, pp. 185–212. ISSN: 20448279. DOI: 10.1111/j.2044-
8279.1985.tb02625.x.
[9] Botvinick, Matthew, Ritter, Sam, Wang, Jane X., KurthNelson, Zeb, Blundell,
Charles, and Hassabis, Demis. “Reinforcement Learning, Fast and Slow”. en. In:
Trends in Cognitive Sciences 23.5 (May 2019), pp. 408–422. ISSN: 13646613.
DOI: 10.1016/[Link].2019.02.006.
[10] Brunner, Ronald D and Ascher, William. “Science and social responsibility”. en.
In: (), p. 37.
[11] Cobbe, Karl, Hesse, Christopher, Hilton, Jacob, and Schulman, John.
“Leveraging Procedural Generation to Benchmark Reinforcement Learning”.
In: arXiv:1912.01588 [cs, stat] (Dec. 2019). arXiv: 1912.01588.
[12] Colas, Cédric, Sigaud, Olivier, and Oudeyer, PierreYves. “How Many Random
Seeds?
Statistical Power Analysis in Deep Reinforcement Learning Experiments”. In:
arXiv:1806.08295 [cs, stat] (July 2018). arXiv: 1806.08295.
[13] Deisenroth, Marc Peter, Rasmussen, Carl Edward, and Fox, Dieter. “Learning
to control a lowcost manipulator using dataefficient reinforcement learning”.
In: Robotics: Science and Systems VII (2011), pp. 57–64.
[14] Doya, Kenji. “Metalearning and neuromodulation”. en. In: Neural Networks
15.4 (June 2002), pp. 495–506. ISSN: 08936080. DOI: 10 . 1016 / S0893 -
6080(02)00044-8.
[15] Duan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter.
“Benchmarking Deep Reinforcement Learning for Continuous Control”. en. In:
arXiv:1604.06778 [cs] (May 2016). arXiv: 1604.06778.
[16] Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and
Abbeel, Pieter. “RL$^2$: Fast Reinforcement Learning via Slow Reinforcement
Learning”. In: arXiv:1611.02779 [cs, stat] (Nov. 2016). arXiv: 1611.02779.
66
BIBLIOGRAPHY
[18] Espeholt, Lasse, Soyer, Hubert, Munos, Remi, Simonyan, Karen, Mnih,
Volodymir, Ward, Tom, Doron, Yotam, Firoiu, Vlad, Harley, Tim, Dunning, Iain,
Legg, Shane, and Kavukcuoglu, Koray. “IMPALA: Scalable Distributed DeepRL
with Importance Weighted ActorLearner Architectures”. In: arXiv:1802.01561
[cs] (June 2018). arXiv: 1802.01561.
[19] Esteva, Andre, Kuprel, Brett, Novoa, Roberto A., Ko, Justin, Swetter, Susan
M., Blau, Helen M., and Thrun, Sebastian. “Dermatologistlevel classification
of skin cancer with deep neural networks”. en. In: Nature 542.7639 (Feb. 2017).
Number: 7639 Publisher: Nature Publishing Group, pp. 115–118. ISSN: 1476
4687. DOI: 10.1038/nature21056.
[20] Eysenbach, Benjamin, Gupta, Abhishek, Ibarz, Julian, and Levine, Sergey.
“Diversity is All You Need: Learning Skills without a Reward Function”. In:
arXiv:1802.06070 [cs] (Oct. 2018). arXiv: 1802.06070.
[21] Fakoor, Rasool, Chaudhari, Pratik, Soatto, Stefano, and Smola, Alexander
J. “MetaQLearning”. In: arXiv:1910.00125 [cs, stat] (Apr. 2020). arXiv:
1910.00125.
[22] Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. “ModelAgnostic Meta
Learning for Fast Adaptation of Deep Networks”. en. In: arXiv:1703.03400 [cs]
(July 2017). arXiv: 1703.03400.
[24] Finn, Chelsea, Rajeswaran, Aravind, Kakade, Sham, and Levine, Sergey.
“Online MetaLearning”. en. In: arXiv:1902.08438 [cs, stat] (July 2019). arXiv:
1902.08438.
[25] Finn, Chelsea, Xu, Kelvin, and Levine, Sergey. “Probabilistic ModelAgnostic
MetaLearning”. In: Advances in Neural Information Processing Systems 31.
Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and
R. Garnett. Curran Associates, Inc., 2018, pp. 9516–9527.
67
BIBLIOGRAPHY
[27] “Forecasting shortterm data center network traffic load with convolutional
neural networks”. en. In: PLOS ONE 13.2 (Feb. 2018). Publisher: Public Library
of Science, e0191939. ISSN: 19326203. DOI: 10.1371/[Link].0191939.
[28] Franceschi, Luca, Frasconi, Paolo, Salzo, Saverio, Grazzi, Riccardo, and
Pontil, Massimilano. “Bilevel Programming for Hyperparameter Optimization
and MetaLearning”. In: arXiv:1806.04910 [cs, stat] (July 2018). arXiv:
1806.04910.
[29] Fujimoto, Scott, Hoof, Herke van, and Meger, David. “Addressing Function
Approximation Error in ActorCritic Methods”. en. In: arXiv:1802.09477 [cs,
stat] (Oct. 2018). arXiv: 1802.09477.
[30] Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep Learning. The
MIT Press, 2016. ISBN: 9780262035613.
[31] Goodfellow, Ian J., Mirza, Mehdi, Xiao, Da, Courville, Aaron, and Bengio,
Yoshua. “An Empirical Investigation of Catastrophic Forgetting in Gradient
Based Neural Networks”. en. In: arXiv:1312.6211 [cs, stat] (Mar. 2015). arXiv:
1312.6211.
[32] Grant, Erin, Finn, Chelsea, Levine, Sergey, Darrell, Trevor, and Griffiths,
Thomas. “Recasting gradientbased metalearning as hierarchical bayes”. In:
arXiv preprint arXiv:1801.08930 (2018).
[33] Gupta, Abhishek, Mendonca, Russell, Liu, YuXuan, Abbeel, Pieter, and Levine,
Sergey. “MetaReinforcement Learning of Structured Exploration Strategies”.
en. In: (), p. 10.
[34] Haarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, and Levine, Sergey. “Soft
ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with
a Stochastic Actor”. en. In: arXiv:1801.01290 [cs, stat] (Aug. 2018). arXiv:
1801.01290.
[35] Hafner,
Danijar, Lillicrap, Timothy, Ba, Jimmy, and Norouzi, Mohammad. “Dream to
Control: Learning Behaviors by Latent Imagination”. In: arXiv:1912.01603 [cs]
(Mar. 2020). arXiv: 1912.01603.
68
BIBLIOGRAPHY
[37] Hardoon, David R., Szedmak, Sandor, and ShaweTaylor, John. “Canonical
correlation analysis: an overview with application to learning methods”. eng.
In: Neural Computation 16.12 (Dec. 2004), pp. 2639–2664. ISSN: 08997667.
DOI: 10.1162/0899766042321814.
[38] Hospedales, Timothy, Antoniou, Antreas, Micaelli, Paul, and Storkey, Amos.
“MetaLearning in Neural Networks: A Survey”. en. In: arXiv:2004.05439 [cs,
stat] (Apr. 2020). arXiv: 2004.05439.
[39] Houthooft, Rein, Chen, Yuhua, Isola, Phillip, Stadie, Bradly, Wolski, Filip, Ho,
OpenAI Jonathan, and Abbeel, Pieter. “Evolved policy gradients”. In: 2018,
pp. 5400–5409.
[41] Hutter, Frank, Kotthoff, Lars, and Vanschoren, Joaquin. Automated Machine
Learning : Methods, Systems, Challenges. English. Accepted: 20200318
[Link] Journal Abbreviation: Methods, Systems, Challenges. Springer Nature,
2019. DOI: 10.1007/978-3-030-05318-5. URL: [Link]
handle/20.500.12657/23012 (visited on 06/24/2020).
[42] Janner, Michael, Fu, Justin, Zhang, Marvin, and Levine, Sergey. “When to
Trust Your Model: ModelBased Policy Optimization”. In: arXiv:1906.08253
[cs, stat] (Nov. 2019). arXiv: 1906.08253. (Visited on 07/03/2020).
[44] Kingma, Diederik P. and Ba, Jimmy. “Adam: A Method for Stochastic
Optimization”. en. In: arXiv:1412.6980 [cs] (Jan. 2017). arXiv: 1412.6980.
69
BIBLIOGRAPHY
[45] Lake, Brenden M., Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. “Human
level concept learning through probabilistic program induction”. en. In: Science
350.6266 (Dec. 2015). Publisher: American Association for the Advancement
of Science Section: Research Article, pp. 1332–1338. ISSN: 00368075, 1095
9203. DOI: 10.1126/science.aab3050.
[47] Lee, Yoonho and Choi, Seungjin. “GradientBased MetaLearning with Learned
Layerwise Metric and Subspace”. In: arXiv:1801.05558 [cs, stat] (June 2018).
arXiv: 1801.05558.
[48] Li, Zhenguo, Zhou, Fengwei, Chen, Fei, and Li, Hang. “MetaSGD: Learning
to Learn Quickly for FewShot Learning”. en. In: arXiv:1707.09835 [cs] (Sept.
2017). arXiv: 1707.09835. (Visited on 05/26/2020).
[49] Lin, Henry W., Tegmark, Max, and Rolnick, David. “Why does deep and cheap
learning work so well?” en. In: Journal of Statistical Physics 168.6 (Sept. 2017).
arXiv: 1608.08225, pp. 1223–1247. ISSN: 00224715, 15729613. DOI: 10 .
1007/s10955-017-1836-5.
[52] Masanet, Eric, Shehabi, Arman, Lei, Nuoa, Smith, Sarah, and Koomey,
Jonathan. “Recalibrating global data center energyuse estimates”. en. In:
Science 367.6481 (Feb. 2020). Publisher: American Association for the
Advancement of Science Section: Policy Forum, pp. 984–986. ISSN: 0036
8075, 10959203. DOI: 10.1126/science.aba3758.
[54] Mishra, Nikhil, Rohaninejad, Mostafa, Chen, Xi, and Abbeel, Pieter. “A simple
neural attentive metalearner”. In: arXiv preprint arXiv:1707.03141 (2017).
70
BIBLIOGRAPHY
[55] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness,
Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas
K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou,
Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane,
and Hassabis, Demis. “Humanlevel control through deep reinforcement
learning”. en. In: Nature 518.7540 (Feb. 2015). Number: 7540 Publisher:
Nature Publishing Group, pp. 529–533. ISSN: 14764687. DOI: 10 . 1038 /
nature14236.
[58] OpenAI, Akkaya, Ilge, Andrychowicz, Marcin, Chociej, Maciek, Litwin, Mateusz,
McGrew, Bob, Petron, Arthur, Paino, Alex, Plappert, Matthias, Powell, Glenn,
Ribas, Raphael, Schneider, Jonas, Tezak, Nikolas, Tworek, Jerry, Welinder,
Peter, Weng, Lilian, Yuan, Qiming, Zaremba, Wojciech, and Zhang, Lei. “Solving
Rubik’s Cube with a Robot Hand”. In: arXiv:1910.07113 [cs, stat] (Oct. 2019).
arXiv: 1910.07113.
[59] Pan, Xinlei, You, Yurong, Wang, Ziyan, and Lu, Cewu. “Virtual to Real
Reinforcement Learning for Autonomous Driving”. In: arXiv:1704.03952 [cs]
(Sept. 2017). arXiv: 1704.03952.
[60] Perkins, David and Salomon, Gavriel. “Transfer Of Learning”. In: 11 (July 1999).
[61] Rabinowitz, Neil C. “Metalearners’ learning dynamics are unlike learners’”. In:
arXiv:1905.01320 [cs, stat] (May 2019). arXiv: 1905.01320. URL: http : / /
[Link]/abs/1905.01320 (visited on 05/28/2020).
[62] Raghu, Aniruddh, Raghu, Maithra, Bengio, Samy, and Vinyals, Oriol. “Rapid
Learning or Feature Reuse? Towards Understanding the Effectiveness of
MAML”. In: arXiv:1909.09157 [cs, stat] (Feb. 2020). arXiv: 1909.09157
version: 2.
71
BIBLIOGRAPHY
[63] Rakelly, Kate, Zhou, Aurick, Quillen, Deirdre, Finn, Chelsea, and Levine, Sergey.
“Efficient OffPolicy MetaReinforcement Learning via Probabilistic Context
Variables”. In: arXiv:1903.08254 [cs, stat] (Mar. 2019). arXiv: 1903.08254.
[64] Rakhlin, Alexander, Shvets, Alexey, Iglovikov, Vladimir, and Kalinin, Alexandr
A. “Deep Convolutional Neural Networks for Breast Cancer Histology Image
Analysis”. In: Image Analysis and Recognition. Ed. by Aurélio Campilho,
Fakhri Karray, and Bart ter Haar Romeny. Cham: Springer International
Publishing, 2018, pp. 737–744. ISBN: 9783319930008.
[65] Ravi, Sachin and Larochelle, Hugo. “Optimization as a Model for FewShot
Learning”. In: (Nov. 2016).
[66] Rose, Steven and Rose, Hilary. “Can Science Be Neutral?” en. In: Perspectives
in Biology and Medicine 16.4 (1973), pp. 605–624. ISSN: 15298795. DOI: 10.
1353 / pbm . 1973 . 0035. URL: http : / / muse . jhu . edu / content / crossref /
journals/perspectives_in_biology_and_medicine/v016/[Link]
(visited on 06/11/2020).
[68] Rumelhart, David E., Durbin, Richard, Golden, Richard, and Chauvin,
Yves. “Backpropagation: The basic theory”. In: Backpropagation: Theory,
architectures and applications (1995), pp. 1–34.
[69] Rusu, Andrei A., Rao, Dushyant, Sygnowski, Jakub, Vinyals, Oriol, Pascanu,
Razvan, Osindero, Simon, and Hadsell, Raia. “MetaLearning with Latent
Embedding Optimization”. In: arXiv:1807.05960 [cs, stat] (Mar. 2019). arXiv:
1807.05960.
72
BIBLIOGRAPHY
[72] Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I., and
Abbeel, Pieter. “Trust Region Policy Optimization”. en. In: arXiv:1502.05477
[cs] (Apr. 2017). arXiv: 1502.05477.
[73] Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel,
Pieter. “HighDimensional Continuous Control Using Generalized Advantage
Estimation”. In: arXiv:1506.02438 [cs] (Oct. 2018). arXiv: 1506.02438.
[74] Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov,
Oleg. “Proximal Policy Optimization Algorithms”. In: arXiv:1707.06347 [cs]
(Aug. 2017). arXiv: 1707.06347.
[75] AlShedivat, Maruan, Bansal, Trapit, Burda, Yuri, Sutskever, Ilya, Mordatch,
Igor, and Abbeel, Pieter. “Continuous Adaptation via MetaLearning in
Nonstationary and Competitive Environments”. In: arXiv:1710.03641 [cs] (Feb.
2018). arXiv: 1710.03641.
[76] Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai,
Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan,
Graepel, Thore, Lillicrap, Timothy, Simonyan, Karen, and Hassabis, Demis. “A
general reinforcement learning algorithm that masters chess, shogi, and Go
through selfplay”. en. In: Science 362.6419 (Dec. 2018). Publisher: American
Association for the Advancement of Science Section: Report, pp. 1140–1144.
ISSN: 00368075, 10959203. DOI: 10.1126/science.aar6404.
[78] Stadie, Bradly, Yang, Ge, Houthooft, Rein, Chen, Peter, Duan, Yan, Wu, Yuhuai,
Abbeel, Pieter, and Sutskever, Ilya. “The Importance of Sampling inMeta
Reinforcement Learning”. In: Advances in Neural Information Processing
Systems 31. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa
Bianchi, and R. Garnett. Curran Associates, Inc., 2018, pp. 9280–9290.
73
BIBLIOGRAPHY
[83] Tobin, Josh, Fong, Rachel, Ray, Alex, Schneider, Jonas, Zaremba, Wojciech,
and Abbeel, Pieter. “Domain randomization for transferring deep neural
networks from simulation to the real world”. In: 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS). ISSN: 21530866. Sept.
2017, pp. 23–30. DOI: 10.1109/IROS.2017.8202133.
[84] Triantafillou, Eleni, Zhu, Tyler, Dumoulin, Vincent, Lamblin, Pascal, Evci,
Utku, Xu, Kelvin, Goroshin, Ross, Gelada, Carles, Swersky, Kevin, Manzagol,
PierreAntoine, and Larochelle, Hugo. “MetaDataset: A Dataset of Datasets for
Learning to Learn from Few Examples”. In: arXiv:1903.03096 [cs, stat] (Apr.
2020). arXiv: 1903.03096.
[86] Vinyals, Oriol, Blundell, Charles, and Lillicrap, Timothy. “Matching Networks
for One Shot Learning”. en. In: (), p. 9.
[87] Wang, Jane X., KurthNelson, Zeb, Tirumala, Dhruva, Soyer, Hubert, Leibo,
Joel Z., Munos, Remi, Blundell, Charles, Kumaran, Dharshan, and Botvinick,
Matt. “Learning to reinforcement learn”. en. In: arXiv:1611.05763 [cs, stat]
(Jan. 2017). arXiv: 1611.05763.
74
BIBLIOGRAPHY
[89] Yang, Yuxiang, Caluwaerts, Ken, Iscen, Atil, Tan, Jie, and Finn, Chelsea.
“NoRML: NoReward Meta Learning”. In: arXiv:1903.01063 [cs, stat] (Mar.
2019). arXiv: 1903.01063.
[90] Yu, Tianhe, Quillen, Deirdre, He, Zhanpeng, Julian, Ryan, Hausman, Karol,
Finn, Chelsea, and Levine, Sergey. “MetaWorld: A Benchmark and Evaluation
for MultiTask and Meta Reinforcement Learning”. In: arXiv:1910.10897 [cs,
stat] (Oct. 2019). arXiv: 1910.10897.
[92] Zhang, Chiyuan, Vinyals, Oriol, Munos, Remi, and Bengio, Samy. “A Study on
Overfitting in Deep Reinforcement Learning”. In: arXiv:1804.06893 [cs, stat]
(Apr. 2018). arXiv: 1804.06893.
75
Appendix Contents
A Technical details 77
76
Appendix A
Technical details
This project was developed on Python 3.7 using PyTorch and based on the meta
learning library learn2learn [46]. For further technical details regarding the code base,
visit the opensource repository of this project at GitHub 1 .
The experiments were conducted on a personal computer with an i7 CPU, 24GB RAM
and an RTX 2060 Super 8GB GPU and on a remote workstation Dell PowerEdge R730
(24 cores, 256GB RAM, 1 x GTX 1080 Ti) at ICE NORTH SICS across a 6 month
period.
1
Link: [Link]
77
Appendix B
During the first stages of the thesis in which research about related work and setting up
an experimental evaluation framework was conducted, another reinforcement learning
framework was initially considered instead of MetaWorld. OpenAI’s newest RL
environment Procgen was developed as a challenging benchmark to evaluate sample
efficiency and generalisation of RL algorithms [11].
78
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN
PPO.”. Their implementation of PPO is based on the scalable and distributed IMPALA
algorithm [18].
Figure B.0.1: Samples of each of the 16 games of Procgen. Figure from [11].
For this project a semidistributed PPO and a MAMLPPO agent was implemented.
The term semidistributed is used to distinguish from the fully distributed IMPALA
implementation. In our case, a sampler was implemented that could fetch different
episodes from the same agent in parallel. However, this meant that the only part
that is distributed is the forward pass (sampling) and backpropagating still happens
synchronised and needs to wait for every parallel agent to finish. In the case of IMPALA
the whole process (forward and backward pass) is distributed and asynchronous 1 .
The networks were based on the A2C model [55] following a similar architecture of
a standard convolutional neural network used in [55] (which in [11] they call Nature
CNN), as seen in figure B.0.2.
1
For further explanation, see figure 2 in [18] where scenario (a) is our implementation and (c) is the
implementation of [11].
79
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN
Another experiment was training a MAMLPPO agent on the same easy setting,
sampling from only one environment and running it on a GTX 1080Ti GPU for 25M
timesteps. After 57 hours of training the results were still poor as seen in figure
B.0.4.
80
APPENDIX B. ADDITIONAL ENVIRONMENT: PROCGEN
These results are also aligned with Section 4 of [11] where they conclude that smaller
architectures (like the NatureCNN) can sometimes completely fail to train compared
to larger and distributed implementation that are more sample efficient and lead to an
agent with better generalisation capabilities.
81
Appendix C
During the development of the models in section 4, a lot of hyperparameter values had
to be tested in order to make conclusions regarding the comparison of performance
of the different methods. Here, a series of searches that were part of the project are
presented that were deemed of less significance for the section 4. Moreover, more
detailed results on the MetaWorld tasks are presented for further investigation and
transparency purposes.
82
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
loss and accuracy metrics during metatraining and metatesting different model
snapshots in various time checkpoints of the training. For example, for the optimal
hyperparameter values in Omniglot the figures C.1.1, C.1.2 and C.1.3 indicate that
there is no gain in performance after the 2.000 mark and the models have managed to
converge, whereas for MiniImageNet the mark is bit later, around 5.000.
Figure C.1.1: Comparing ANIL and MAML training metrics. Each line is the mean
value and the shaded area is the standard deviation across three seeds.
Figure C.1.2: Comparing ANIL and MAML validation metrics. Each line is the mean
value and the shaded area the standard deviation across three seeds.
Figure C.1.3: Metatesting model snapshots of MAML and ANIL in different iteration
checkpoints for Omngilot and MiniImageNet. Each line is the mean value and the
shaded area the standard deviation across three seeds.
Since such conclusions can usually be derived from just one of these figures, in the rest
83
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
of the section there will only be the minimum number of figures necessary to support
them. As a baseline, for both the Omniglot and MiniImageNet dataset, the models
were trained for 10.000 epochs.
Ways 5
Shots 1
Outer LR 0.001
Inner LR 0.1
Adapt Steps [1, 3, 5]
Meta Batch Size 32
Seed 1
Figure C.1.4: Comparison of three models with
different number of inner loop updates. Smoothing Table C.1.1: HP values of C.1.4
factor: 0.8
Learning Rates: A coarse search for the inner (α) and outer (β) learning rate was
performed with ANIL in MiniImageNet. The inner learning rate controls how quickly
the learner’s weights will adapt to new data whereas the outer parameter controls the
rate in which the metainitialisation weights are being updated. For this reason, the
inner learning rate (lr) needs to have a high value (rapid learning) and the outer lr a
lower one (steady convergence to a general enough metainitialisation).
ANIL was trained with the configurations as shown in table C.1.2 on the MiniImageNet
dataset for 5ways, 5shots and 5ways, 1shot classification. For the Omniglot dataset,
two different learning rate settings were tested with ANIL for the case of 20ways, 5
shots (figures C.1.6 and 20ways, 1shot (figures C.1.7).
84
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
#1 #2 #3
Ways 5
Shots 1
Outer LR 0.003 0.001 0.05
Inner LR 0.5 0.1 0.1
Adapt Steps 1
Figure C.1.5: Comparison of three ANIL
models on 5ways MiniImageNet with Meta Batch Size 32
different inner and outer learning rates. Seeds [1,2,3]
Each line is the mean value and the shaded
area the standard deviation across three Table C.1.2: HP values of C.1.5
seeds. 20% accuracy is same as random for
a 5way classification setting.
Figure C.1.6: Results of ANIL for 20ways, 5shots in Omniglot. The reported results
for each model are the mean and standard deviation across three different seeds.
Figure C.1.7: Results of ANIL for 20ways, 1shot in Omniglot. The reported results
for each model are the mean and standard deviation across three different seeds.
85
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
Due to the complexity of these methods and their wide range of hyperparameters, not
all of them could be tested with different values. In table C.2.1, hyperparameter values
that were left asis in all of the experiments are presented.
This method was concluded to be too unstable to research any further. This could be
due to an error in the implementation or bad hyperparameter configurations. Even
though a few configurations were tested, the MAMLPPO method seemed to perform
quite unstable, as seen in figure C.3.1 for ML1 and in figure C.3.2 for ML10. Some
models were manually stopped since they didn’t show promising signs of convergence
or learning.
86
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
#1 #2 #3 #4
Outer LR 0.01 0.1 0.01 0.01
Inner LR 0.01 0.1 0.05 0.01
Adapt Batch Size 20 10
Meta Batch Size 40 20
Adapt Steps 1
Seeds 42
Figure C.3.1
Table C.3.1: HP values of the models in
C.3.1
ANIL MAML
#1 #2 #3 #1
Outer LR 0.01
Inner LR 0.01 0.05 0.01 0.01
Adapt Batch Size 20 10 10 20
Meta Batch Size 40 20 20 40
Adapt Steps 1
Seeds 42
Figure C.3.2
Table C.3.2: HP values of the models
in C.3.2
87
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
#1 #2 #3 #4 #5 #6 #7 #8
Outer LR 0.1 0.1 0.05 0.05 0.2 0.05 0.1 0.05
Inner LR 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.2
Adapt Steps 1 5 1
Adapt Batch Size 20 32 10 20 20 20 32 20
Meta Batch Size 40 16 16 64 16 16 16 32
Seed 42
Av. Test Reward 27 38 40 29 34 24 26 28
A basic hyperparameter search was performed for MAMLTRPO (figures C.5.1 C.5.3)
and ANILTRPO (figure C.5.4). Due to the computational cost of fully training the
agents until stable convergence, they were trained only for 250 iterations and the
average validation return was used as an indicator of comparison of the different hyper
parameter values. Firstly, a small search was performed for the inner learning rate
88
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
(α), then for the outer learning rate (β) and finally for the adaptation steps, each time
picking the best value of the previous search.
Outer LR 0.1
Inner LR [0.01, 0.001, 0.0001]
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps 1
Seeds 42
Figure C.5.1: MAMLTRPO inner lr Table C.5.1: HP values of the models in
search on ML1: Push. C.5.1
Outer LR 0.3
Inner LR 0.001
Adapt Batch Size 20
Meta Batch Size 40
Adapt Steps [1, 3, 5]
Seeds 42
Figure C.5.3: MAMLTRPO adapt Table C.5.3: HP values of the models in
step search on ML1: Push. C.5.3
89
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
#1 #2 #3 #4 #5
Outer LR 0.3 0.3 0.1 0.1 0.1
Inner LR 0.001 0.001 0.01 0.001 0.01
Adapt Steps 3 1 3 1 1
Adapt BS 20
Meta BS 40
Figure C.5.4: ANILTRPO
inner lr search on ML1: Push. Seeds 42
Smoothing factor 0.8.
Table C.5.4: HP values of the models in C.5.4
Figure C.5.5: Comparison of MAML models during training with different batch sizes
on ML1: Push.
Similarly, for ML10 different batch sizes were tested as seen in figure C.6.1.
90
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
Figure C.6.1: Comparison of MAML models with different batch sizes on ML10.
(a) Performance comparison of various (b) Comparison of using 1 inner loop step
learning rates. (adaptation step) and 3 steps.
Training vanilla RL policies (not metaRL) on ML10 is expected to perform poorly, due
to the volatile setting of optimising for multiple losses at the same time. In figure C.7.1,
all of the models of TRPO and PPO developed in this project for ML10 are shown with
their respected hyperparameter values in table C.7.1.
91
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
Due to the diversity of the tasks in MetaWorld, averaging the accumulated rewards
and success rates across all tasks to report the performance of a method can be
misleading. In these figures, the performance of the models trained are shown for
each task separately.
92
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
(j)
Figure C.8.1: Performance (Accumulated rewards & success rate) of PPO, MAML
TRPO and ANILTRPO on the train tasks of ML10. Reported results are mean and
standard deviation of 5 different seeds with 3 trials of 10 episodes per task.
93
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
(d) (e)
Figure C.8.2: Performance (Accumulated rewards & success rate) of PPO, MAML
TRPO and ANILTRPO on the test tasks of ML10. Reported results are mean and
standard deviation of 5 different seeds with 3 trials of 10 episodes per task.
One possible question that arises regarding the metatesting procedure is whether
such a limited interaction with unseen tasks is sufficient for the agents to adapt. The
methods were updated only once based on 10 episodes with a small learning rate
and a high discount factor leading to the performance being highly dependent on the
randomised seed that sets the configurations of the environment and with a chance
that they were not able to quickly adapt with these hyperparameter values.
94
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
Table C.9.1: Hyperparameter values for the metatesting phase of the algorithms on
the ML10 test tasks.
Figure C.9.1: Performance (Accumulated rewards & success rate) of PPO, MAML
TRPO and ANILTRPO on the test tasks of ML10. The ”Before” results are the agents
evaluated on the test tasks without any change to their weights after training. The
”After 1 Step” results are based on the ”Default” values of the table C.9.1 and the ”After
5 Steps” are based on the ”Extended” values. Reported results are mean and standard
deviation of 5 different seeds with 3 trials of 10 episodes per task.
The specific reward functions and their success metric from MetaWorld: ML10 are
presented, as seen in [90].
95
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
Train tasks
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
basketball 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
button-press 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
door-open 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
drawer-close 0.01
||h−g||2
peg-insert-side −||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
pick-place 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
push 0.01
||h−g||2
1000 · e
2
reach 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
sweep 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
window-open 0.01
Test tasks
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
door-close 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
drawer-open 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
lever-pull 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 100 · min{oz , ztarget } + I|oz −ztarget |<0.05 · 1000 · e
2
shelf-place 0.01
||h−g||2
−||h − o||2 + I||h−o||2 <0.05 · 1000 · e
2
sweep-into 0.01
96
APPENDIX C. ADDITIONAL HYPERPARAMETER SEARCHES AND RESULTS
Train tasks
basketball I||o−g||2 <0.08
button-press I||o−g||2 <0.02
door-open I||o−g||2 <0.08
drawer-close I||o−g||2 <0.08
peg-insert-side I||o−g||2 <0.07
pick-place I||o−g||2 <0.07
push I||o−g||2 <0.07
reach I||o−g||2 <0.05
sweep I||o−g||2 <0.05
window-open I||o−g||2 <0.05
Test tasks
door-close I||o−g||2 <0.08
drawer-open I||o−g||2 <0.08
lever-pull I||o−g||2 <0.05
shelf-place I||o−g||2 <0.08
sweep-into I||o−g||2 <0.05
Table C.10.2: Success metrics of the ML10 tasks. The static values represent distance
in meters.
97
TRITA-EECS-EX-2021:15
[Link]