>>Let's get started again. I hope you
enjoyed your lunch.
So the next speaker is again a guest speaker
from Oxford DeepMind, Yee Whye Teh
speaking on Distill and Transfer Learning for robust multitask Reinforcement Learning.
>>Thanks Sebastian. So this is joint work
with a bunch of people at DeepMind and
was just presented at NIPS, just the past
NIPS. So just start with a bit of
introduction to reinforcement learning.
So we have an agent,
the agent is trying to act in an environment, 
and the environment is sending
observations to the agent and there's
some reward that the agents gets
and what the agent would like to do
is to maximize its reward by acting
in this environment. Okay. So the way
people typically formalize
reinforcement learning is using what's
called Markov decision processes.
So we have agents and so we have the actions
of the agents and the states of the
environments over multiple time steps. So
we have s is going to be states of the
environment is gonna be actions of the
agents and this is for time step one and
then we have time step two, time step
three, and so forth. Okay and we're gonna
call that the trajectory of state and
action pairs and we're gonna denote
that by tau. Okay so and the way Markov
decision processes work is that we're
gonna assume the following. We're going
to assume that we have some
transition dynamics. So basically given
that the environments at some states s
and the agent takes an action a was the
distribution of the next states in the
environment, and we're going to assume
that this is Markov, and what that means
is that given the current state
the future stays independent of
past states. Okay so basically the
current state captures all the
information about the world. So that's
the environment dynamics.
We also have rewards that the agent 
receives. So we think of this as
just a function that basically is
high if the agent gets a high reward
if it takes that action in that state.
So that's the reward and finally that the
last ingredient here is is the policy
that agent okay and and this is the
thing that we want to learn and we're
going to assume that the policy is going
to be a basically a distribution of
actions that the agent can take in that
state s so this together defines a joint
distribution over these trajectories
right and the the problem of
reinforcement learning is basically to
find policies that maximizes this
expected reward here okay so we have a
sum of time of the rewards at each time
and we're going to assume that there's
some sort discount factor gamma between
0 and 1 which basically discounts future
rewards ok so basically we get the agent
to pay more attention to rewards that is
going to arrive closer in more in the
more read was the way in the recent not
too far away okay so so if you're going
to be see very what a long long time
from now you're not going to worry about
it basically okay right and and this is
an expectation over the basically this
the environment dynamics and the policy
that the agent takes so that's a
reinforcement learning and what we're
interested in here is multitask for
reinforcement learning and basically
what that what I'm gonna make that to
mean is basically that we have a set of
tasks index by AI and basically for
every task we have a different
transition that I mean then it makes a
different reward function and we're
gonna learn a different stochastic
policy as well and basically this is the
thing that we want to maximize for each
task I okay so and you can see here that
I'm basically a scanner so of course
these are multiple different tasks but
at the same time we want them to share
some similarity right and we I'm already
assuming in here that the states space
of the environment and the action space
of the agents are are the same across
different tasks okay so imagine that you
have different tasks so in the
experiments later I'll show some
experiments where the agent is basically
like a first is like a first-person
shooters of game right it's a
three-dimensional visual environment
we're going to assume that the state
consists of this trait three-dimensional
visual observations and the and the
actions of that the agent can take will
be like move forward move backward turn
left and right and so forth and shoot
and stuff and jump so basically the
state space we're going to assume that
the state space and the action space are
the same across these different tasks
okay so and of course we're gonna assume
more than that because what we like to
do is to say well if we have multiple
tasks and they are all related to each
other in some way we should be able to
learn them faster if we learn them
jointly as opposed to learning them
separately okay and the idea that we
have for this this distill and transfer
learning is the following so we're going
to assume that there are separate
policies for each task and we're gonna
also assume that there's gonna be some
centroid policy okay some or you can
think of it as a prior policy and this
policy is gonna serve as a communication
channel between the tasks and one way in
which you could think about this is that
each PI I here is gonna be a policy that
we're going to learn to act well in this
task
well PI nots in the middle here is gonna
be a policy that the agent would would
take if it doesn't know which task is in
okay so it's kind of like almost like a
default of behavior if it doesn't know
which task it's it's is trying to to to
solve
okay and it turns out that so so we're
gonna formalize an objective that I
think makes a lot of sense for something
which looks of this form and
interestingly the algorithm that we end
up deriving has quite interesting
behavior which is the following so
basically these this centroid policy is
going to be learnt by a distillation
process from each of the individual task
specific policies and that's going to
learn some self common behavior across
the different tasks and the way this
Pinal is going to is going to affect the
in the individual task specific policies
is very regularization okay which is I
guess why we call this distill and
transfer learning so we're kind of
distilling towards this central policy
and then that's gonna allow us to
transfer knowledge or transfer behavior
across the different tasks so an
intuition that that maybe you could have
is imagine that in one of the tasks the
agent has learned that if it sees a
corridor that's quite long that it
should just walk down the corridor as
opposed to this ring in along the
corridor then that behavior could be
useful in on other tasks um as well
which may have corridors so if agent one
so if PI one learns that behavior then
that gets distill into PI naught and
then that behavior is now gonna
basically what that's gonna do is it's
gonna encourage the other agents that
maybe hasn't lunched that behavior to
reproduce that behavior okay so so
what's the objective that we use
basically the way we're going to do this
so I've I think I've used this trick a
few times across my career but basically
we're gonna take a hierarchical Bayesian
approach okay so
so we have the the task specific
policies PI I write and this bit here is
the average reward that we would like to
maximize for tasks I and what we're
going to do is to say we're going to
introduce a KL a penalty term which says
that by default we're going to prefer
that the task specific policies are
similar in a KL sense to this default
behavior to this default policy okay
so this bit here is the kale so if you
take the expectation over PI into here
you just see that that's a kale between
PI and PI naught but it's not the usual
one in that there's a discount factor
here as well okay right okay so and
that's basically the the setup maybe I
should have okay so and okay so if if
you think about now optimizing this
joint objective over all the different
tasks and over this default policy then
there's kind of two things that you want
to optimize one is the default policy
and one of the tasks specific policy so
I'll go through each of those in turn so
if you think about learning the default
policy fixing the task specific policies
then you see that the only term that the
default policy depends on is this long I
naught term right and if you try to
optimize this basically that's just a
supervised learning problem is there's a
maximum likelihood sort of behavior so
basically what we're what the default
policy we'll try to learn is if if PI I
has high probability for an action then
pine knot is also going to learn to give
high probability to the same action in
the same state okay so this is basically
just distillation
and this is something that people have
this have exploited in the past so
there's a work by Rousseau at all and
Perry Soto at all that they call policy
distillation or policy mimic the
difference from this to works is that in
in this two works it's canvassed the
learning is a two-stage a procedure
where in the first stage we learn the
tasks specific policies independently
and once that's learnt then that there's
a distillation process that learns pine
nuts and then that's that's it well what
we're doing here is actually once we've
learned pine on that in itself also is
used to regularize the pie ice towards
pie knots as well okay so it's kind of
an iterative procedure so for those of
you who are familiar with reinforcement
learning you can see that this objective
here is actually very similar to a lot
of the recent scale or entropy based
policy regularization techniques which
are you which people use to make sure
that the agents that the policy that the
agents learned don't prematurely
converge to some suboptimal behavior
okay so it's basically as used as a way
to regular rice so that the agent can
explore properly
in the space by attempts of they're
actually learning PI not actually Brix
is used as a regularizer for exploration
because typically in in this application
spinal is just a uniform distribution
over actions and basically what that's
saying is that if you haven't learned
what to do in that particular state then
you should just try all puzzle
try all the actions but if if instead PI
not has actually itself learns that it
gives high probability to one of the
actions and low probability to the rest
of the actions then that's going to
discourage exploration
so basically actually what we did is
very simple we're going to have the
original reward Everage reward term
we're going to have this KL
regularization which implements this
distillation and transfer process
that shares information across the
different tasks and in addition we're
going to have an additional entropy
regularization term that encourages the
agents to explore further okay so in the
following I'm gonna so of course that's
hyper parameters C Kol and C entropy and
I'm gonna rewrite this in you basically
we prompt reprimands it using two
different parameters alpha and beta here
just to make the math simpler later so
so I said that
so I've described how the learning for
pi not works I haven't really described
how the learning for the PI's work
that's actually been a lot of work in
the in the literature now that has
explored this idea that of curve doing
entropy regularized reinforcement
learning and that's very interesting
connections between that work and
actually probably state inference okay
so if you go back to this equation here
and if you are familiar with variational
inference and you squint a bit a bit you
can see that actually pi I here you can
think of it as a variational posterior
distribution this term here is basically
the log likelihood that you want to
maximize and this term here is basically
a KL between the prior and the posterior
and basically this is just a variational
objective and it corresponds to that and
people have made that connection before
and you can actually solve for the
optimal variational posterior
distribution in that context that we
have here and in our case what we get is
the following that what we find is that
the optimal PI given PI naught is going
to be of a form or given by 4 so
basically PI is going to be PI naught
raised to the Alpha times another term
here which is e today some advantage
function raised to the beta
and basically you can think of this as a
prior and think of that as likelihood
and this pie is just computes the
posterior distribution I have a quality
term here because the normalization term
is actually kind of roll into this
advantage function it turns out that the
advantage function here is you can think
of this as the value of taking action a
in state s relative to the value of the
state s and turns out that this this
value of the state s actually serves as
a lot of normalization constant in this
in this equation and these are all
softened in the sense that we're kind of
softening it by adding this additional
entropy regularization lkl
regularization so this is also quite
interesting from the perspective of how
do we parameterize our our policies in
vertically in particular how do you
promote rise our central policy and how
do we promote race the task specific
policies so in deep reinforcement
learning is very sensible to say well
let's use a smock it's a typical to say
that we're gonna promote rise our
policies using neural networks so given
a state s we're gonna learn a neural
network that gives us a distribution
over actions if the agent finds itself
in that state okay and similarly we
might say that well we can also promote
race the advantage functions using
neural networks as well basically we
have a neural network that we compute
the advantage of action a if the agent
finds itself in state s and basically
what this this car optimal form tells us
is that the optimal form for pi I for
the task specific policies should have
this following structure basically given
some observations are given some
observations of the states of the
environment we're going to have a neural
network that computes the default policy
and we are going to have a separate
neural network that computes the
advantages and we're just gonna
basically do a weighted sum of these two
things in the locked space and then
exponentiate and basically passing it
through a soft Max and that's gonna give
us our optimal form for the task
specific policies for tasks I in this
case this notation here is a bit of a
cross between neural networks and
graphical models we have a plate over
this neural network for each task
basically okay so we basically replicate
that with different parameters okay and
basically that's how we promote eyes
that's how we propose to parameterize
the task specific policies and this is
actually quite interesting because if if
you start off by looking at this network
architecture we could say well let's
just learn our our stochastic policies
using standard reinforcement learning
techniques like policy gradients or
active critic methods and we we
parameterize our task specific policies
using this of network architecture but
let's not worry about this additional KL
regularization between the PI's and the
PI notes okay
if we do something this you could you
could imagine that that's some form of
non identifiability between basically
between these two columns of the after
network and the addition of this KL
regularization between the PI's and the
PI knots
basically what it does is that it forces
this column here to learn everything
that you can learn across all the
different tasks and so what this column
actually is forced to do is to learn
like the Delta between this default
policy behavior and anything that makes
it work on that particular task so you
can remove some of the non and
identifiability and actually makes the
learning more stable that way
so I'll just skip through this basically
we've compared four different variants
of our algorithm with the standard at
the time which is a asynchronous
advantage actor critic methods a Treacy
methods and keep true to sum of results
basically what's pretty well so this is
an example where basically the the solid
lines are the different variants of our
of our method and the two that actually
we can't prove and are proposing is
basically the actually the rate method
and the grid method and they basically
did the the reward I guess is it learns
pretty fast and it gets to better
policies in the end this little example
on the great world task is quite
interesting because what it shows is
that what the default policy learns is
quite interesting so this is a case
where we have to count rooms to pass of
the space and we have a long narrow
corridor that connects the two paths and
the different tasks are basically just
getting the agent to get to a different
position in in this in this environment
okay so these are four different these
are four different tasks here where the
green is just a gold state okay and the
agent has to move from anywhere in the
space to the goal basically and you can
see that what the default policy has
learned is the following is that if the
agent finds itself in this part of the
space and if it finds that it is going
down this corridor then it should keep
on walk walking down the corridor in the
same direction and if it finds that it's
moving right along the corridor then you
should learn then it has learned to keep
on moving right along the corridor as
well which cannot make sense in this
context so from the context from the
perspective of hierarchical RL basically
this looks like learning an option
without really it looking like an option
it's kind of interesting I think okay
and this works on kind of more complex
domains as well
anyway so what we've proposed here is
have a faster more robust algorithm for
multi-touch reinforcement learning and
basically what this does is they assumes
that tasks have the same physics and so
the it's they are solved by policies
with similar structure and they're
similar structures captured by this
default behavior that they were learning
alongside each of the tasks specific
behaviors thank so I think that there's
lots of different ways in which one
could think about transfer across
different tasks because different sets
of tasks are similar in different ways
and I think that would be another way in
which we could have kind of commonality
across different tasks and transfer
across the top and this is one way I'm
not I'm not saying that this is the best
way it's just one way
