- Hi everyone.
Welcome to Lecture 20 of
CS 287: Advanced Robotics.
Couple of extra exciting
things for today's lecture.
One, we have a second camera crew today
recording some B-roll for
a business school course.
So are you over there
if you like to be shot
then sit more to the front.
If you like to not be
shot, sit more to the back.
We also have CTO of Bolt and
the entire technical team
for this specific project, I think.
Keith, you wanna stand up for a moment?
- [Keith] Hello.
- So Keith leads the
Bolt Scooter as a CTO,
and specifically today is here because
they've actually built a
autonomous version of the scooter.
And so after lecture today,
stick around to get a demo
and learn more about autonomous scooters.
And yeah, looking
forward to seeing it too.
- Thank you very much sir.
And how I got into this was
Pieter's helicopter videos
back in 2004, so.
(all laughing)
- Correct.
(laughter)
Be careful or this thing might fly.
(laughter)
So, let's see, logistically for the class,
yesterday your final
project proposals were due.
So hopefully you did that.
If you didn't do that, please let me know,
and hopefully we can
still find a solution.
And there is one more
homework that's coming out
in the next half week-ish,
probably by Monday,
and that's your last homework.
And then final project and last homework,
that's it for the class.
All right, any logistical questions.
So today we're gonna cover Model-Based RL,
but actually we still have about half
of Off-Policy, Model-Free
left to do first,
so let's recap the storyline there,
and in the second half of lecture
we'll go to Model-Based RL.
So we covered on-policy model-free RL,
and the main algorithms
there were TRPO and PPO.
And the beauty is that
they're very stable,
relatively easy to work with,
but they don't heavily
reuse past experience.
They essentially just
use the latest samples
to optimize the surrogate
loss and that's it,
and so limited off-policy
reuse of your samples.
If you don't care about off-policy reuse
'cause samples are cheap to
collect, then it doesn't matter
because you can use the
latest samples at all times
and just collect more.
But if samples are expensive to collect,
because maybe you need to
collect samples in the real world
or you need to pay money to collect them,
then you might wanna reuse them more so
than PPO or TRPO would do.
What can we do can actually do
something called Q-learning,
which we started last lecture,
which I'll quickly recap now.
From there we'll use, we'll
move to soft Q-learning,
Soft Actor-Critic, and DDPG.
All right, so quick recap of Q-learning
with function approximation.
(mechanical humming)
So today, we'll see DQN, we'll see Soft Q,
and then we'll see DDPG, and SAC.
And they're all quite
related so I'm gonna try
to have them all on the
board at the same time.
So, DQN, because this is really two parts.
One is collecting data.
The other thing is then,
after data has been collected
and put in the replay buffer,
updating the Q function that
you learned from that data.
So first part, and this
will iterate, collect data,
which is things of the form,
S-T R-T A-T S-T plus one,
and put it into the replay buffer.
And then after a bunch of
new data has been collected,
or possibly asynchronously
of a data collection process
and a learning process
working asynchronously,
you do a few updates to your Q function.
So, the loss you put forward is
sum over some experiences,
that's an x by k,
'cause it doesn't need to be
exactly the most recent ones,
experiences you get
from your replay buffer,
of often squared loss or Huber loss,
the target, often called y of S-K
minus what you currently have,
which is your Q value, Q theta of S-K A-K,
And what's your target?
Your target, Y-S-K, is
equal to the reward you got
in that experience when you're
status k to caution A-K,
plus gamma times, and
then max over actions
at the next time.
So not necessarily the action you took,
but the action that maximizes
the Q value at the next time,
so max over A prime for next time,
Q, the state you were in at the next time,
so S-K plus one comma,
and then A prime you get to maximize over.
Now, naively, you might
use also Q theta here,
but in practice that doesn't
work in a very stable way
because now you have this
feedback loop that's very rapid
and might destabilize the learning.
So in practice, often you
have a delayed theta minus,
is a delayed version of theta,
which you could just periodically update.
So this you iterate frequently,
and then every now and then
you send theta minus equal
to the current theta,
and then you go around the
interloop very often again.
So that's standard DQN.
Any questions about this?
All right, so, this algorithm
is essentially what succeeded
at putting deep reinforcement learning
on the map back in 2013,
when DeepMind showed that
this approach is able
to learn to play Atari games
from its own trial and error.
And data collection was done
by just epsilon-greedy at the time.
So most of the time, take the action
that maximizes your Q value,
but then with some small probability,
you take your random action.
So, Atari results from 2013
on a wide range of games,
the exact same code was able
to train a neural network
to reliably play these games.
The network itself was much like AlexNet,
just a bunch of convolutional layers
followed by a few fluid connected layers.
Maybe the main thing to pay attention to
is that it had an output
for every possible action.
So the Q network took in images,
output a Q value for each possible action.
So this only works if you
have discrete action spaces.
Otherwise, you can't really do this.
Results were in about
two thirds of the games,
the DQN network learned to
play better than human players,
and in about one third, not so much.
The one third where it didn't do as well,
typically, is correct tries
by needing better exploration.
So where random exploration
just doesn't really expose
the interesting parts of the game,
then you don't get any
signal in your Q-Learning,
and so can't actually
learn the output Q values
because you haven't seen the things
that lead to optimal behavior.
Now, what can you do to
further improve this?
One thing is that if you think about
the target values being computed
with this max over actions,
well, initially your Q
function is pretty noisy.
And if you take the maximum over actions
of this thing that's pretty noisy,
then it might just be due to noise
that a certain action looks
better than other actions.
And you take the max,
and so you get a drift
towards higher Q values
that's not really justified.
That's just really the noise propagating
rather than something
that's actual signal.
So how to avoid this.
In double DQN, what you do is
you compute the r max on one Q network,
and then you take the
max in another network.
But the max is using
the r max action you get
from the first Q network.
So you use two networks,
one gives you the r max,
the other one gives you the value
associated with the action you get there.
That way if in one network due to noise
you pick a sudden action,
well, in the other network,
very likely the noise
will not be the same,
and it won't be a highly
overestimated Q value
that's propagating.
That helps a lot in stabilizing
and ensuring that you don't
overestimate your Q values,
and gives quite a bit better
results on the Atari games.
Another thing you can do
to make this work better
is prioritize experience replay.
Such a very simple idea
when you think about the
loss function in Q-Learning,
it's a loss to get Q values
closer to target values.
Now, if your Q value is already very close
to the target value, you're
kind of just wasting time
computing the gradients of that loss
and doing the updates that
don't contribute much.
So prioritize experience replay will check
whenever that error's
large, keep track of that,
and prioritize data from the replay buffer
where the Bellman error is large
to get more informed updates.
There's a couple of other
things you might wanna look at.
Dueling DQN is an architecture
where the Q network is architected
to explicitly also have
a value function in it.
And we know that Q function and value
from just our related.
So now I have one network
that has both Q and value function in it,
which regularizes the
learning quite a bit.
Distributional DQN is
a very different idea
where instead of having a Q value
that comes out as a real number,
you have a categorical
output for a Q value.
So what I'm trying to say,
my Q value is this number.
You say, I have bins,
maybe a hundred bins,
possible some sense intervals
my Q value can lie into,
and you just try to predict
which bin it falls into.
So you get a classification problem
rather than a regression problem.
It actually works quite a bit better.
It's not 100% clear yet why.
A lot of people suspect
it's because you propagate more signal.
'Cause if all you propagate
is the average Q value,
you lose a lot of information.
But imagine if you
propagated specific Q values
that can happen at times.
For example, maybe sometimes
you have a reward in the game
that's plus 10, sometimes plus 30.
That's different from
an average getting 20.
And so you propagate a
more formative signal
through distribution of Q-Learning
looking at more detail of the future.
It's closer to almost
predicting future states.
'Cause if the reward is
distinct at future times,
it's like trying to track
what states you're gonna
visit in the future
and keeping track of that.
Noisy DQN is a different way to explore
rather than exploring
by taking epsilon-greedy random actions,
you perturb the weights in the Q function.
Why might you do this?
Well, random actions
can jitter you around,
which is good to
introduce some randomness.
But let's say you're
an agent in a hallway,
and you're just being jittered around.
You're never gonna make it
to one end or the other end
or it's gonna take a very long time.
But if you shift the weights,
then you might shift them in
favor of going to the left
or in favor of going to the right.
If you keep the shifted weights
for a few steps in a row,
you'll actually make some progress
in one or the other direction,
rather than jittering back and forth.
And so this can really
help with exploration.
It's not a full solution to exploration,
but it can make a difference.
Now, if we wanna use
continuous action spaces,
what I just described is
actually not gonna work any more
because you cannot output a Q value
for every possible
continuous action value.
So what can we do?
Let's work through the overall
algorithm on the board.
(mechanical whirring)
So, the challenge we're
trying to address is that
we cannot have an output for every action
'cause we have a continuous action space.
Now you might say, well, question.
- [Student] Can't you just use a Q network
and hook into where you
can pass any action to it?
- That's exactly what we're gonna do.
Absolutely.
So we cannot output a Q value
for every action any more.
What we can do is hook
and feed in the actions.
And if we feed in the actions,
we can still have a Q value
for every possible state action pair.
Now the issue with feeding in the action
is now when we're supposed to act,
let's say when we're acting
here, collecting data,
I wanna find the action that
maximizes the Q function,
we have to solve an optimization problem.
We have to find the action
that maximizes the value
for the current state.
The optimization problem
can be expensive to solve,
and so what we need to
find is a solution to that,
a way to not have to worry about
doing expensive calculations
in every step that we take
when collecting data.
So, that's one thing that Soft
Q-Learning's gonna resolve.
The other thing it's gonna
resolve at the same time,
it's gonna introduce
entropy bonuses again.
So it's gonna try to ensure
that we have more exploration
thanks to entropy bonuses.
So, to resolve the main challenge,
when we're acting, we're
gonna have a policy.
So even though we're doing Q-Learning,
we're gonna explicitly
paramaterize a policy
that we're learning.
And it's gonna be a policy
where we try to find
a distribution over actions given state.
Now, what might be the way to get a policy
if we have a Q network?
Well, one way to get a policy
when we have a Q network
is to say what we really want
is the maximum entry policy
corresponding to the Q values,
and we know what that is
from a earlier lecture,
is gonna be a policy that's
proportional to X-Q S-T A-T.
Now, how do we find that policy?
We can actually run gradient
descent on this objective.
So we'll have a prem
trees policy, pi theta,
our Q function, or let's say
we still use theta for Q,
so we'll use phi for the policy,
prem trees policy pi phi.
And we can find this
by minimizing over phi
the KL divergence between the,
let's see which direction,
between the policy, pi phi, at S-T,
and the un-normalized distribution,
but un-normalized, this
phi will just be a constant
in a KL divergence loss, so
don't have to worry about that.
X-Q theta S-T, and then this can vary.
And so, we'll have this part here,
where we minimize this summed
over multiple data points
where we update our policy.
Once we have our updated policy,
we can act according to it,
get new data.
And we'll still have a Q-Learning update.
But since we're doing Soft Q-Learning now
to introduce entropy bonuses,
we'll have an update for Q
that is how our Q value should
equal to reward at time t
plus expected value over
state at time t plus one
of the value at time t plus one.
And again, this is another
thing we'll need to learn
is value function here.
And our value function will be equal
to the soft max over Q values
log integral now X-Q-S-T-A-T
integrated over possible actions
we could take at that time.
And so now we're keeping
track of three things:
a value function, a Q
function, and a policy.
This will turn into a loss.
So for this thing, we'll
have a negative sign here,
we'll square this thing, and
sum it over many time slices,
and minimize over theta.
This one here, we'll minimize over psi,
sum over many t of this thing squared.
Now there's a little bit of an issue here.
This thing here, the
integral over all actions,
is often intractable to compute.
So what we'll do for this one here,
we'll use a sample based approximation.
So rather than integrating,
we just sample a few actions.
We can sample them from
the policy, which is a,
would be on policy sampling,
if you believe your
policy are pretty good.
But you can actually also
do completely off policy
because it's Q-Learning.
You can just sample actions
and estimate the integral
with an empirical
average over the actions.
And so we have three learning problems
that we're solving in every
iteration of learning updates.
It would collect more data.
We repeat.
Any questions about this?
Yes.
- [Student] Why do we each have to do
a log and exponent of Q?
Why can't we just use the Q value itself?
- So question is why do we
have to use the log exp here?
'Cause we're doing Soft Q learning here.
So I'm making two modifications in one go.
We're making it Soft Q-Learning
where you maximize sum
of rewards plus entropy,
and the plus entropy leads
to this being the thing
to soft max rather than the hard max.
And that, in principle,
it could be a temperature that you insert,
but it can also rescale the
reward that's equivalent.
And so imagine we rescale the reward
to account for temperature,
then this is what it will look like,
just log sum exp.
- [Aria] Thank you.
- Thanks, Aria.
All right, let's look at some
results from this method.
(mechanical whirring)
So this is the Soft Q-Learning equations
that we saw on the board.
Moving the policy closers,
something called Stein
Variational Gradient Descent,
and it is more general
than just for Q-Learning.
Essentially, whenever you have
a implicitly defined distribution
by an energy function,
and you wanna explicitly sample from it,
'cause when you just
have an energy function,
it's hard to sample
from that distribution,
you can try to find a policy,
or any kind of conditional distribution,
you can sample from by minimizing the KL
between the thing you can actually from,
which in this case is the policy,
and the energy based defined distribution,
in this case, through Q functions.
All right, so, lighting is still off.
Here is what we have when
the robot starts out,
random exploration.
The goal is to put the Lego
block, the yellow block,
on top of the red block.
Then, after 12 minutes,
after 30 minutes.
So 30 minutes, actually
getting pretty good.
It's finding the target reasonably well.
And after one hour,
it's able to consistently stack the block.
So in just an hour of
training on the real robot,
so it's actually pretty fast,
to be able to acquire this skill.
We're at two hours here.
You can actually perturb him,
which is pretty interesting.
And that speaks to the robustness
of the policy it's learned.
And you might wonder, well,
why is this so special?
Well, think about regular Q-Learning.
At the point you're converged,
or any non-maximum entropy version,
when your converging might
have found a very specific path
you try to always follow,
and the only thing your policy has learned
is following that very specific path.
But by using max entropy,
whenever there is freedom
to introduce noise and still succeed,
is gonna do that during training,
and be used to introducing that
noise and being more stable.
And so that means that
then if during testing
extra noise gets introduced externally,
it's actually ready for that
'cause it's explored the entire space
and learned something robust
across the entire space
rather than just one
very specific solution.
Okay, before we switch to DDPG,
any questions about
Q-Learning or Soft Q-Learning?
Keith.
- [Keith] Yeah, does it work with to know
which state for these in all the time?
- So a good question.
So what information does the
robot have available here?
So the robot knows its own joint angles.
So for each joint, it
knows the angle it's at.
It also has a reward function.
And often, a lot of trickery comes into it
to make sure you define
the right reward function.
The reward function here
is based on distance
to the target location.
So it knows distance to target.
And so because it knows
distance to target,
the reward is very well-shaped
for the initial learning.
It's gonna very quickly learn
to get closer to the target.
What is not well-shaped,
'cause it's just Euclidean
distance to target,
is once you're close.
'Cause you need to come in from the top
and the reward function
doesn't capture the notion
of coming in from the top.
It's just straight-line
distance to the target.
And so a lot of the learning
happens in the second half
happens all around how to get the block
to approach the right way
so you get onto the target
'cause you're coming from
the side or the wrong angle,
you're never gonna get there.
You're gonna get stuck.
- [Student] So what if the
joint slips a little bit?
So the, your sensor is off after an hour?
- That's a good question.
In a real robot, that's very easy
for maybe your sensor measurements
to be wrong at some point,
and there's some drifting.
In that case, there's a couple of things.
One thing is maybe do
vision based learning,
and the camera can compensate for it.
But this one is not using
vision in these experiments.
But there's no reason in principle
you couldn't also include vision.
And the learning would
probably be a lot slower
'cause visual information
is very high dimensional
and takes a long time to
distill the information from,
but in principle, another
reason you couldn't.
The other thing you could do
is you could continue learning.
So the economical use that was shown here
is you run Q-Learning for two hours,
and then you say here is
my policy, I'm good to go,
let's just keep using it.
But in principle, there's no reason
you couldn't just let it
run after it's deployed
and continue to improve, which
means that if anything slips
in terms of, let's say, encoder readings,
your encoders are off now, you
can actually keep adapting.
Now, the question then does remain,
is your reward function
measured accurately?
'Cause if your reward is
not measured accurately,
well, then you're learning
against a wrong reward.
So you gotta make sure that,
at least for the reward function,
you got something that's
fairly close to ground truth.
Otherwise, you learn
against the wrong thing.
A simple way to do this, though,
could be to have periodic
human interaction.
'Cause in any of these methods,
we don't need a differentiable
reward function.
It's all just taking reward
that was experienced, a number,
and so a human could watch
it and, say, give it a score.
And like this was this good, this good,
and it could directly learn from that.
So you don't need to necessarily
program the reward function to it
as long as somebody provides
the numbers periodically
to the system.
Now of course, in research,
we tend to program the
reward function into it
'cause otherwise you gotta
sit there all the time
and score the robot, which
is very, very tedious.
But in practice, maybe
that's what you need to do
if you wanna really solve for
the human's reward function
and not for something that might be off.
Now, one thing with Q-Learning is that
while it's perfectly off
policy, which is nice,
in theory, in practice,
it's not always gonna learn
as off policy as you want it to.
Off policy learning tends
to be somewhat unstable
'cause off policy learning mean
that you try to learn something
about state action pairs
you have barely visited
or have not visited at all
from data that's coming
and visiting other parts
of the state's space,
and you can't really expect
to get a lotta signal there.
And so in practice,
when you run Q-Learning,
well, you actually run it
pretty close to on policy
'cause you run the max action
or the soft max action most of the time,
and then maybe sometime some exploration.
Given you're doing that anyway,
you could also look at algorithms
that are more explicitly
closer to on policy,
like DDPG and Soft Actor Critic,
which learn the Q function
for the current policy
rather than try to direct
in the alt mode Q function,
which you might not have
the data support for.
And so often, it can be a bit more stable
to run DDPG or SAC.
So let's take a look at those algorithms.
Now it's at this point
(mechanical whirring)
SAC and PPO are probably
the two most popular
off the shelf algorithms
for people to run,
and so we'll actually first see DDPG.
But DDPG is kind of I would
say a precursor to SAC,
if you think of SAC as the
maximum entry version of DDPG.
It's not 100% right.
There's a couple subtle differences,
but essentially the hard
version, the soft max version.
All right, so, in DDPG, we're
gonna first execute roll outs,
as in most of these.
So we do some roll outs.
These roll outs are gonna
be under the current policy.
These are on policy methods,
so we're largely on policy,
under current policy.
And there'll be some
noise for exploration.
So actually, a lot like Q-learning.
You mostly file away Q
values described over there,
and then every now and then some noise.
Now, the updates we do
here are going to be
to learn a Q function
that is the Q function
of the current policy.
So, we have our objective is
sum over multiple time
steps we've collected.
Q phi, our parameter as Q function,
at S-T after taking action U-T,
should be close to our target,
which here is called Q hat, for S-T U-T.
And then maybe a squared loss.
What is our target here?
Our target here is Q hat S-T U-T
equals the reward we
experienced in that roll out
plus gamma times Q phi S-T plus
one, U-T plus one, for now.
We'll make a slight change to this later.
And then we do a policy update.
So we'll only have DDPG.
For now, I'm writing up
the predecessor, SVG0,
and then I'll tell you
what to change to get DDPG.
Now, to get our actions again,
we're hoping to deal with
continuous action spaces,
so we're gonna have a
prem trees policy again
that extracts the max
actions in a prem trees form
so we don't have to solve
an optimization problem
every time we try to take an action.
And so the grant updates on the policy,
and what's special
about what DDPG is doing
is that rather than
minimizing these calibrations
that we saw over here,
which is maybe a fairly general thing,
it does need to be a
little more specialized
that only works under the assumption
that you have essentially
a continuous action space.
So it'll say the grant update, so theta,
well, theta becomes theta plus alpha times
and then we have a grant update,
which looks at the Q values.
Remember, Q is prem trees by phi.
This gradient is with
respect to the policy.
It's gonna say I have Q values.
For each state, I can
look at what is my Q value
as a function of what my
policy prescribes me to do.
And so I can directly optimize the policy
by looking at if I change my policy,
how much would I improve my Q values
in the states that I visited so far.
Now of course, you can only do this
if this has continuous action output.
Otherwise, this gradient
with respect to theta
which has to go through the gradient of Q
with respect to its actions,
gradient of Q with respect to its actions'
only gonna work with the chain rule
if you have continuous action space.
Now we said we're gonna have
some noise for exploration
so the policy might have some
Gaussian noise, called V-T,
that goes into it.
And so when you act on a current policy,
you'll sample some noise that
gets fed into the policy,
and your action will be
somewhat noisy as a consequence.
But you remember what noise went in
and reuse that specific noise
as you do your back propagation here.
And so that's essentially the
reprimed transition trick,
which we've seen before,
but applied here to extracting
a policy from the Q function.
This is SVG0.
What else can we do beyond what's here?
Well, one thing we can do is
instead of just looking at the
roll outs we currently have,
which SVG0 does, now I'm
gonna transition to DDPG,
we can retain a replay buffer.
So instead of just taking the recent ones,
we put this into a replay buffer.
Once we put it in a replay
buffer, some of the data is old,
and we need to be aware
that this data is old
and it's off policy data.
Can we correct for it?
Well, the way we can
correct for it is to say,
well, actually as we look
at the target Q value here,
we're trying to get the Q
value under the current policy,
but the action U-T plus one
coming from the replay buffer
is an old action from an old policy.
So what can we do?
We don't actually put U-T plus one here.
We replace this with the current
pi theta of S-T plus one,
potentially with some noise
also if you want to do that,
but can also put a zero there.
And so this is what allows you now
to do off policy learning.
You get to reuse your old
data to learn the Q value
for the current policy.
If you run this long
enough over enough data,
it will actually converge
to the Q values of your current policy.
And then this is essentially
what the policy extraction,
again, from those Q values,
which in the very early lectures
we covered essentially
the policy duration,
policy improvement update.
This is the policy improvement update
and policy duration happening right there.
So keep in mind at this point
we are going testing here
makes it off policy,
which has good benefits.
'Cause that means that you can use
all your past data and principle
to update your Q values.
So let's say you
collected a bunch of data,
you wanna be sample efficient,
you can do a lot of
updates on your old data
to update the Q function,
update the policy, and repeat.
Compare this to PPO,
which would just use the most recent data.
Any questions about DDPG?
Yes.
- [Student] We just
went back to off policy,
then why do we care about
on policy roll outs?
- That's a good question.
Why do we roll out in the current policy?
I would say that's the main difference
between the pure Q-Learning approaches
and the Q based policy gradient
slash Q based actor critics.
So I would call these two,
there's Q based policy gradient
or Q based actor critic.
It's kind of not clear what
exactly you'd wanna call it
'cause there's so many components to it,
whereas that's pure Q-Learning.
The reason you might wanna do this
as opposed to direct Q-Learning
is the same reason you might
wanna do policy gradients
rather than Q-Learning.
It's much easier to track
performance of these
because as you do updates,
your policy is supposed to
gradually improve over time
whereas in Q-Learning, essentially
what's supposed to happen
is your Bellman error gets reduced.
But it doesn't necessarily
directly translate
in better performance,
and it's harder to kinda get
this amount of improvement,
harder to debug at times.
Q-Learning tends to be very,
if you collect your data
completely off policy,
you have just a harder time
learning good Q values.
And so this will often work more stably,
and with a caveat, 'cause
DDPG itself is not that easy
to run stably.
But once we go to the soft version,
that's when it's the most
stable, maybe the most stable,
current algorithm if you use it.
Just wanna use something out of the box,
have to worry about the least,
and just hope is gonna
work out of the box,
this is gonna probably be
the one that you wanna use,
the soft version of this.
And so why is it more stable?
Because it's more directly
optimizing a policy
compared to Q-Learning.
So the data for the Q function
learning is more on policy,
even though in principle,
it's allowed to be a bit off policy,
and you can track progress more directly.
All right, let's take a look
at results from SVG and DDPG.
(mechanical whirring)
So we saw this one on the board.
This is the basic version of
DDPG, which is really SVG0.
Turns out there's other versions of SVG.
The zero refers to the fact
that we only have one reward here,
and then the Q function,
if there's a one there,
you look at reward plus
multiple time steps,
and a k means you have
k time steps of reward.
Once k is not zero, what happens is
you can now do off policy learning.
'Cause if you look at the
sequence of rewards encountered,
well, that sequence of rewards
encountered is on policy,
and so often it can be as data-efficient.
The zero version will often
be more data-efficient.
So tested a bunch of two-d robotics tasks
and it actually works quite well.
At the time, these were some
of the most sample-efficient results
for reinforcement learning.
Here's a swimmer going to a target,
and same for the other environment.
So it actually works quite well
with relatively small number of samples.
Now, when you start making it off policy,
it becomes even more efficient.
So DDPG puts the off policy update
into the target value for the Q function.
Here is the full approach
that we wrote on the board,
and it's more sample efficient
than the SVG approaches
'cause it reuses the
data in a off policy way.
Here's DDPG in action on
a couple of environments.
Here is a reacher.
At the time, DDPG was the
first to successfully learn
to control these kinds
of robots from pixels.
Why?
Well, likely because when
you learn from pixels,
you need more signaling,
you need to train for a longer time.
If you use a on policy gradient method,
you need a lot more data.
Then DDPG could learn to Q functions,
and then quite efficiently
learn to control these robots
just from pixels.
Let's see, go to the pixel-based version.
Here's based on pixel features,
meaning it's just getting
the images as input
rather than the underlying state.
And again, it's learned
to succeed at the task.
Now, at the time, this
is kind of interesting,
DDPG was super sample efficient,
and if you looked at the paper,
it was like very, very efficient,
but anybody who tried to
reproduce the DDPG results,
ran into trouble and
actually didn't really work
all that well.
And so it was like great results,
and if you are Tim
Lillicrap from DeepMind,
it was like working amazing,
and then pretty much everybody else
would not have the same amazing results.
And so, this, it's kind
of nice to have an arm
that's super-efficient, but
if it's not very robust,
and if it's hard to reproduce the results
at the level that you hope for,
then maybe you need
something a little different
to make it more stable.
And that's really where
Soft Actor Critic comes in.
So amazing results, but very hard
to get a stable implementation.
And also cross random seeds,
some random seeds would do
really, really well initializations,
and others wouldn't do so well.
And so the question is
how do we stabilize this.
That's where Soft Actor Critic comes in.
The key difference is
that now in the objective
we add a entropy term.
What does the entropy term do?
Well, what it does is
encourages exploration.
It also makes the policies more robust
'cause as you execute a maximum entropy
against the maximum entropy objective,
you will try to find the policies
that can see a lot of variation
and deal with that variation.
In fact, they'll introduce
their own variation
during execution, their own random noise,
to introduce more variation,
and make sure they're robust through it.
So, let's take a look
at Soft Actor Critic,
which is one of the most
widely used algorithms today,
and which achieves similar or better
sample efficiency than DDPG.
But keep in mind, in many
ways it's the same algorithm.
It's just the maximum entropy version.
(mechanical whirring)
So, SAC, what's our objective?
We try to find the
policy, pi, that pi phi,
that maximizes this objective,
sum over times in the roll out.
Then we have an expected value
of states and actions we encounter
that depends on essentially
the distribution you encounter
due to our policy, and
we encounter rewards,
and we also evaluate the entropy
of our policy, pi phi, action given state.
So everything else being equal,
you'd rather randomize over your actions.
But of course, if to get a high reward,
you just choose a very specific
action instead of state,
you'll still want to do it
because of the reward term
in the objective.
What does the algorithm look like?
We'll iterate roll outs from pi phi
then to go into our replay buffer.
And then, just like in Soft Q-Learning,
we're gonna be learning three functions.
We gonna learn a policy, a value function,
and a Q function.
So, what's the cost function
for the value function?
We have expected value
with states coming from our replay buffer.
Then we have the value in our state, S-T,
should be close to the expected value
if you take actions according
to our policy, pi phi,
of, well, what is the value?
It's really the value
under our current policy.
It's, again, evaluating the
value of the current policy
is really the Q function averaged
based on action state
according to the policy.
And so far, this would just
be a regular value function,
but we have a max entropy objective.
So in addition here, we
have minus log pi phi
action given state.
So a state has higher
value if from that state,
you tend to take high entropy
distribution over actions.
This, then we try to minimize over psi.
So this is the thing we're
minimizing over here,
choice of psi.
How about our target Q values?
Our target Q value is Q hat S-T comma A-T
equals the reward we encountered
plus gamma times expected
value of S-T plus one
of V psi S-T plus one.
Okay?
That's our target Q value.
In practice, that expected value,
we might not have anything available
except for just the fact that
we for the next data point,
S-T plus one, and we
just have to use that.
But what, yeah, so that might,
that's all available to you.
This, by the way, the
action is fixed here.
This comes from our replay
buffer only in a space
where S-T plus one.
If you have a dynamics
model by any chance,
you could resample from that
and augment your data pool.
But if you don't have a dynamics model,
you would just have your S-T plus one
from your replay buffer.
And then the policy would be optimized
by looking at the KL divergence
between pi phi at S-T
coming from the replay buffer
relative to the exponentiated Q values,
Q theta S-T, over some
normalizing constant.
But again, it's not gonna matter.
This is constant.
And so we have the same objective
as we had in Soft
Q-Learning here, same thing.
In fact, they look very, very similar.
If you look at what's
different between the two,
well, it's pretty much the same.
We're learning three entities:
policy, value function, Q function.
We learn them in a soft way.
The difference is really in the fact
when we learn our Q function,
we learn it on policy.
So we learn a on policy Q function here.
As a consequence, the
value function we learn
is also on policy.
And so other than that, extremely similar.
But Soft Actor Critic
learns the on policy value
and Q function.
This tries to learn the
optimal Q function directly,
but you soften much harder to stabilize
and make sure it works well.
And then you can play
similar tricks, of course,
as I talked about with the original DQN.
You can say, oh, I should do double DQN,
which means essentially you'd
keep two Q functions around,
and when you, to stabilize
things or two value functions,
then you can do pollock averaging
to keep an old one around to regress onto
rather than having this
very tight feedback loop
where you might have
something destabilize.
'Cause your targets here
come from the value,
the same the value targets come from Q,
so if you're not careful,
this could destabilize.
So you keep kind of a running
average over past Q values
and v values to stabilize this thing.
Any questions about SAC?
Olivia.
- [Olivia] Is having like that entry bonus
always a good idea or is there any case
where you wouldn't to use it?
- That's a good question.
'Cause in some sense, some algorithms,
like if you look at the original PPO,
it does not necessarily
have the entry bonus,
but you can add it.
It's not that hard to add it in,
and you get, typically, you get
better exploratory behavior.
So I would say, generally
speaking, it's a good thing to do.
It adds another hyperparameter,
which is the temperature
in front of the entropy,
and you might have to
anneal that over time
in some kinda clever way.
And then you might say, well,
if that's what we're doing anyway,
maybe I should just have a hyperparameter
for maybe directly
temperature in my policy
or introducing random actions in the mix
if it's a off policy method,
and you're in some sense
choosing something similar.
But the beauty of having entropy is
that it's a little more
attuned to the state.
'Cause when you have a entropy
introducing the randomness,
you will in some states
start decreasing randomness pretty quickly
'cause those states are critical
for you to take the right action
whereas other states
it's much more open-ended
where you get to do and so it'll
have more randomness there.
So, and sometimes it seems
a cleaner way of doing it,
but it doesn't necessarily always help.
Because if you still have a
temperature hyperparameter,
otherwise you just have
a kind of hyperparameter
on how much randomness you
might introduce in your policy.
Keith.
- [Keith] So it's possible
to come up with a policy
for these Atari games?
Or is it just, or are these
things not really work
on the, you have to do.
- Oh, I see.
- [Keith] Or what, how
does the policy look like
on an Atari game?
Like how would they?
- So DQN, of course, became
famous around the Atari games.
And after the Atari game results,
kinda historically looking,
those first DQN in 2013.
It worked well in the Atari games.
It didn't work well on
continuous control problems
because of the issues we
kind of resolved here,
which is you need continuous action,
and continuous actions are not compatible
with the original DQN architecture.
And parallel, the policy
gradient methods got developed,
and actually people found
them easier to get to work.
TRPO, PPO, more stable 'cause
more of a direct optimization.
And so those worked well,
both from continuous control
and for the Atari games.
And so all of these work just
fine for the Atari games,
no problem.
You can, well, not all of them.
DDPG does not allow to deal
with continuous actions.
All of it requires continuous actions.
So DDPG is not particularly
compatible with Atari games.
So once you do this KL divergence
way of updating the policy
rather than the gradient
going through the Q function,
you can, again, deal
with discrete actions.
And so Soft Actor Critic
and Soft Q can be applied
to discrete action spaces no problem.
And if I had people run Soft Actor Critic
on the Atari games at times,
and it'll work just fine.
It really depends a lot on
what you're trying to do.
So, if you're, I would say,
if you look at the spectrum of algorithms,
there is policy gradient methods,
with PPO probably being
the most used one today,
there is these Q based policy gradient
slash Actor Critic methods,
and there is more Q learning,
direct Q learning based methods.
And as you go from policy
gradient to Q learning,
all of it's a little hard to get work
'cause you go more and more off policy.
But also, you can often
get more data efficient.
So you're trying to generate a plot
with a horizontal axis
number of samples collected
in your environment.
The vertical axis is performance.
Then Q learning is often
gonna give you the best plot
because you're reusing
your samples maximally.
But sometimes hard to get work,
and you actually spend a lotta compute
on all your samples.
On the other hand, if
your horizontal axis is
wall clock time, vertical
axis still performance,
and if your samples are cheap to collect,
then often PPO will be
your method of choice
'cause it's more stable to run,
and it always uses the latest samples,
and the latest are more
informative than the old ones.
So you might as well use the latest
if they're cheap to collect.
And in Soft Actor Critic, it's
kind of in a happy medium.
And that happy medium, I would say,
makes it one of the most
popular methods today.
If you look at most
papers coming out today,
almost every one compares
to Soft Actor Critic
whenever they come up with a new idea
to see how it fares against that
or have an augmentation
of Soft Actor Critic
to see if their augmentation
makes it maybe better
than just a simpler like original version.
All right, let's look at some
Soft Actor Critic and videos.
So, we put this on the board.
Oh, the screen is not down.
(mechanical whirring)
Show this on the board.
We saw the kind of clean up version.
Let's look at some learning curves.
So, here we have exactly the
phenomenon I was talking about.
We have Soft Actor Critic learning
in a very sample efficient
way shown in yellow.
We have DDPG, which we know
is essentially the same algorithm
as the non-soft version,
and probably somebody choose
it really, really right,
they can get that green
curve somewhere close
to the Soft Actor Critic,
but it's just harder to
get it really tuned right.
And if you just run it out of the box
and, or with some limited tuning,
here is we end up with,
doesn't learn anything
on the humanoid problem.
Then PPO is shown over here.
Well, it's less sample efficient.
Why?
Because it doesn't reuse
the data in a off policy way
that Soft Actor Critic is going to do.
And so you expect it to be
a bit less sample efficient.
But ultimately, you all get
to the same level of performance.
Now, one of the interesting
things is Thomas Harnois,
a PG student at Berkeley who
came up with this method,
at the time, the way he
would showcase the work,
he would essentially say,
I run all the other model-free methods,
and I show you the best
performance, the best policy.
I run, let's say, five random seeds,
and of the five random
seeds, I picked the best one
and show you the result of that one.
So the algorithm's somewhat unstable.
Some of these random seeds
will find no good results,
but then the best one
often still pretty decent.
Maybe we show the worst
one for Soft Actor Critic,
and show that is actually
as good or better
than the best one from
the other algorithms,
showcasing how essentially
robust and stable
the Soft Actor Critic approach is
relative to most RL approaches.
And so here you see the learning curves
on a wide range of problems.
And again, if you look carefully
at the learning curves,
you'll see that the variance
on these Soft Actor Critic
curves are pretty small
while being well above everybody else.
And some other methods in some of the runs
have really high variance,
meaning across different runs
there was a very different
level of performance achieved.
Here is Soft Actor Critic
policies in action.
So this is hopper, found a
really good, robust strategy.
Two-legged walker.
Cheetah, and it really aced cheetah.
(laughter)
Some of the very, very best
results ever on cheetah.
(laughter)
And pretty interesting strategy here
where it mostly uses kind of two legs
for propelling itself forward
and the other two just do not
fall forward or backwards,
which is not exactly
how you imagine running
for this robot, but it's
optimizing some reward
that's velocity off to the right,
and this just happens to
be the way it optimizing.
Humanoid.
There's nothing in their reward function
about how to hold your
arms as you can see.
(laughter)
So it has found an interesting
way to hold its arms.
Here, found a different thing.
(laughter)
That actually was tested
on some real robots.
So here's a four-legged real robot
and Soft Actor Critic
reliably working on that,
even when there is perturbation
to the environment.
And again, that's kind of
illustrating some of the benefits
you often get from
maximum entropy approach
is that you're more robust to perturbation
'cause of all your training.
You're perturbing yourself
and so you're used to being perturbed.
It didn't know that obstacle was there.
It just kind of just
trying to get through.
It doesn't have a camera.
It's just doing some regular locomotion
with perturbation forces
caused by those blocks.
Okay?
Then, turning a valve with a robot hand.
The bottom right is what the robot sees.
So lower resolution
version, not what we see,
but somewhat the same view.
And then Lego block stacking
again, and also mastered that.
So quick recap of the story-line.
TRPO importance sampling
on the surrogate loss
to be able to do more
than just a gradient step.
But it cannot reuse samples as
much as if you start looking
at the dynamic programming formulation
where you look at one step transitions
rather than full roll outs.
The ones step transitions
by leveraging those,
you can go off policy.
Q learning does that in
the most extreme way,
often not as stable as a
policy gradient method.
Soft Actor Critic is kinda
the happy intermediate
in many ways.
It looks at the one step transitions
to allow for off policy learning,
but then still updates the policy
and learns largely on policy
in terms of data collection
and the data it's then
using in the replay buffer
to do stable learning,
yet more sample efficient
than if you don't look at
the one step transitions
and don't get that off policy benefit.
Some very interesting data.
We wrote up two years ago
is that there's actually
a very close equivalence
between policy gradients
and Soft Q-learning.
So you might think policy
gradients and Soft-Q learning
are so different.
But actually if you look at
the details of Soft Q-Learning,
if you assume that in Soft
Q-Learning you collect your data
according to the Boltzman policy,
which is the exponentiated Q function
probability distribution,
if you collect your data
according to that policy,
well, then effectively you're following
a very specific policy.
You're following a policy and
you're updating a Q function,
which in turn updates the policy
through the KL divergence.
And if you look at the details of that,
it actually starts
looking a lot as a whole
like a policy gradient update.
And so in practice, even
though we all talk about
Q-Learning as being so different,
in practice a lot of people
use the Boltzman softmax policy
for their data collection,
and has at that point,
it brings Q-learning a lot
closer to the soft methods here,
as well as policy gradient methods.
All right, let's take
just a one minute break
'cause I wanna cover a bit of
model-based RL still today.
So a one minute break and then
we'll start model-based RL.
(students conversing)
All right, let's restart.
Model-based RL.
What are we gonna cover?
We're gonna cover the basics,
then ensemble methods,
we're gonna look at the
Asynchronous Model-based RL
and Vision-based Model-based RL.
The Asynchronous and
Vision-based will for sure
be for next week.
So, quick reminder.
In RL, what do we do?
We have a agent
interacting with the world,
collecting experiences,
hoping to maximize reward.
What's Model-based RL?
You collect data under the current policy,
learn the dynamics model from
all data collected so far,
and then you effectively run
RL inside the dynamics model
'cause you now have a simulator.
And so you can run the
RL in the simulator,
which might make you more data efficient
'cause instead of using your samples
to directly update your
Q function or policy,
use your samples to build a simulator,
and then you can do all the
learning in the simulator.
Now, it might be that the
policy you find this way
is not the best yet because
your simulator isn't perfect.
You run that policy in the
real world, get new data.
Where that policy thought the
good stuff was going to be,
if it's right, you're done.
You collect your high rewards, that's it.
In practice, often you're not done.
You see that it's not that
great, but you collect data.
That data will always improve your model.
Now in the new model, you run
reinforced learning again.
That new model will know
that the past policy
was not so great, 'cause
the data now support it
not being so great, and
it'll find a new policy
it'll try next time in the
real world, and repeat.
Why would you do things this way?
You would anticipate
better data efficiency
because you get a model out of data
rather than directly updating
policies or Q functions,
and you learn a model, which could be nice
'cause now if your
reward function changes,
you could use that same model
to learn against a new reward function,
and have a new policy against
a new reward function.
So again, here is the algorithm.
Benefits, better sample
efficiency hopefully.
So why is it not used all the time?
If we have a method that's
more sample efficient
that a lotta people love to plot
horizontal axis sample complexity,
vertical axis performance,
why wouldn't you do this all the time?
Well, often there are instabilities,
and we'll look at that more in a moment,
and fixes to it.
And often, it does not achieve
the same level of performance
as model-free methods.
And so that's a problem
'cause now you say,
oh, I'm very sample efficient,
but I saturate at a lower
level of performance.
You might not be happy with the outcome,
and say, well yeah, sample
efficient, but bad outcome.
What's the point?
I'd rather take more samples
and get to a better outcome.
So we'll see fixes to both of those.
And both will rely on learning
ensembles of dynamics models
rather than learning a
single dynamics model.
So, what's under the hood?
Under the hood, there's
overfitting that's happening
in Model-based RL.
Now there's two types of
overfitting that can happen.
First type of overfitting is
Model-based RL learns a model
from state in action to next state.
That's a supervised learning problem.
When you use supervised learning,
you could overfit your training data,
memorize the training data,
do poorly on test data.
But we know how to avoid that.
So we should just avoid that.
That's not the real specific
challenge to Model-based RL.
Just do that right.
The real challenge is that your policy,
when you optimize that in your simulator,
tends to find regions in simulation
where your policy looks like it's getting
real good performance, but
that part of the simulator
is not reflective of reality.
So it's your policy
optimization overfitting
to the quirks of the learned simulator
and taking advantage
of some of those quirks
to get really good performance simulation,
but bad performance in the real world.
Now if you think about it,
what can you do about this?
We think about a mismatch
between real world
and learned simulator.
How do we measure that mismatch?
Well, we could train an ensemble.
If we train an ensemble of models,
if the models in the ensemble agree,
it means the data supports
what they're saying.
If the models in the ensemble disagree,
it means the data does not
support what's happening
in that part of the space,
and you maybe should not
have your policy go there
because it's a part of the space
where you cannot trust
your learned simulators.
So that's exactly what
model-ensemble trust-region
policy optimization does.
There's a lot of detail here,
but really what it comes down to is
instead of learning one model,
learning ensemble of models,
and when you train,
you keep track of
agreement or disagreement
between those models, and
you have in your objective
something that keeps away from
where these models disagree.
You don't wanna have
policy gradient updates
based on the parts of the space
where the models disagree.
So it's very simple.
You collect your data,
do policy gradient
updates in the simulator,
but when the models disagree you downweigh
or get rid of that data
'cause it's not the
signal that you wanna use.
Okay, so we can in
simulation run TRPO or PPO.
This paper happened to run TRPO.
And here are the learning
curves for these environments.
So we have in green the proposed method.
In other colors are PPO and TRPO,
standard model-free methods.
And then in purple, DDPG and SVG infinity,
which is another kind of
older model-based method.
And we see here that the
approach they described
does a lot better in terms
of learning a lot faster,
logarithmic scale on the horizontal axis.
So being a little bit sooner up
is often a factor 10 or 100 faster.
And we see that it does
actually pretty well
in most of these environments,
and some of them have cheetah.
It doesn't really get to the
same level of performance
as the model-free methods.
So, we could also ablate this.
You can do TRPO in a simulator
or you can do backwardation
through time in the simulator,
different ways of
optimizing your objective.
We see here that running TRPO
or generally a policy gradient method
rather than backward through time method
leads to more robust
optimization of your policy.
Why might that be?
I think that's still a
bit of an open problem.
One reason could be that
the policy gradient methods
introduction noise,
and so you get a more
robust optimization process
like the maximum entropy thing.
Maybe the backwards in
time with max entropy
will get something similar.
Who knows?
Or it could be that backward through time
has exploding vanishing gradient issues
so you don't get the
signal that you really want
whereas policy gradient
methods with likelihood ratio
don't have that issue.
Here's another ablation.
It's looking at the number
of models in the ensemble.
If you have one model in the ensemble,
that's shown in green, it
doesn't do nearly as well.
So having these ensemble
and staying away from where
the ensemble doesn't agree
is really critical 'cause
you just one model,
you can't do this, and the
performance is a lot less.
Do you need 20 models, the yellow curve?
Not really.
If you have 10 models, it's enough.
In terms of computational efficiency,
you might just wanna work with 10 models,
not more than that for at least snake,
half cheetah, and ant.
Of course, this is gonna
depend on the complexity
of your environment.
Very complex environments
might require a larger ensemble
to understand the kind of intricacies
of where you already learned
enough about the environment
versus where you don't
have enough support yet.
So, we get this to work very stably,
but actually if you look carefully
at some of these problems,
it does not get you
at the same level of performance
as model-free methods.
So what to do about this?
Well, the reason this happens
is 'cause the learn the
ensemble of models is imperfect.
If it is ensemble models,
it doesn't perfectly match the real world.
There's actually two complications there.
One is it doesn't match the real world.
Two, you're trying to find a policy
that works across every
member of the ensemble,
and that's a robust policy.
Robust policies are great
'cause they're robust,
they can deal with variation,
but if there's a lot of variation,
you'll end up with a
very conservative policy
that can also not leverage the specifics
of your environment
that you're actually in
'cause it's trying to find a
solution that works everywhere.
So one thing is you say, well,
just learn better dynamics models,
but people have not
really succeeded at that.
Second fix is to do something
called meta-learning.
In meta-learning, the idea is
that instead of learning a single policy
that is good across all
members of the ensemble,
you're gonna learn a adaptive policy
that whenever dropped into
any of the ensemble members
gets to quickly adapt to the
specifics of that ensemble
and be specialized to
whatever environment it's in.
And then the hope is
that when it's dropped
in the real world, it'll also
specialize to the real world.
And so you learn to be an adaptive policy.
So it' gonna be a recurring
neural network now
rather than a feed forward network.
And if your current neural
network will take in
state action reward state over time
and will be trained
such that when deployed
in a new environment, it can quickly adapt
to however that environment
tends to behave.
What does it look like as a whole?
We still iterate in some outer
loop over data collection
with our current adaptive policies.
Don't worry about the details too much.
Just assume for now this is one policy.
Our current policy collects data.
Then we learn ensemble of K simulators
from all past data.
We run meta-policy
optimization over the ensemble,
learn a recurrent neural network
that no matter which
member of the ensemble
it gets dropped into, can
achieve high performance
thanks to the ability to
adapt to the specific member
of the ensemble it's dropped into.
This gives you a meta-policy
that is a recurrent neural network
that is ready to adapt to
hopefully the real world.
Now, in reinforcement learning,
exploration really matters.
So we do actually a slight extra thing.
Rather than dropping this
meta-policy in the real world
hoping it will adapt
quickly enough and be done,
we know that likely we need to
collect more exploratory data
before our simulators are gonna support
ending up with the right policy.
So actually do an adaptation
step to each of the models
in the ensemble, which will now
lead to a range of policies.
And that range of policies gets adopted,
gets tried out in the real world.
That's the adaptive policy
shown at the top, and we repeat.
If you wanna see this in more detail,
you'll have the slides, you
can stare at this later.
That's what I just showed,
but laid out in pseudo-code.
So, we tested this on the standard opening
I-team environments and
here's what it looks like.
On the left, we have MB-MPO.
The model based with
meta-policy optimization,
on the right, PPO.
We see PPO is still exploring.
It's really not learned anything yet
about to control this cheetah robot
whereas MB-MBO after about 45 minutes
of real world interaction time
has mastered running the
cheetah off to the right.
And so you see in about 45 minutes,
we can learn to control this robot.
Now, one quick caveat here.
Of course, if all we care about
is controlling the cheetah robot,
why would we learn a model?
We already have a model,
MuJoCo gives us the model.
This is just a research experiment.
We say, assuming we don't
have unlimited access
to the MuJoCo simulator, we wanna minimize
how many samples we collect
from the MuJoCo simulator,
we're in model-based RL, learn simulators,
neural net simulators, and
see if we can through that
more efficiently learn to
control the cheetah robot.
Here, same thing for the
two-d walker and for ant.
Again, in about 45 minutes of interaction
with the MuJoCo simulator,
it's possible for the meta-policy
optimization approach,
model-based meta-policy optimization
to learn a policy that can
adapt to the actual simulator
and be successful there
whereas PPO, of course,
in 45 minutes or an hour will not yet know
how to control these robots.
More quantitatively, these
are the learning curves.
In red is the model-based
meta-policy optimization.
And in the other colors are
different model-free methods.
Actor, which is a natural gradient version
of policy gradient, TRPO, PPO, DDPG.
What are the main things
we wanna look at here?
One is that the model-based
method shoots up more quickly.
So it's more data efficient.
But that's always been true
for model-based methods.
What's so special about
this specific result
is that it's the first
model-based approach
that achieves the same asymptotic
performance as model-free.
So we're not compromising on
the asymptotic performance,
we're getting the same
asymptotic performance,
yet a lot more sample efficient.
We can also compare with
what I just covered,
the model ensemble trust
policy optimization
where we learn a single
feed forward network
that is robust across all
learned models in the ensemble,
which is gonna learn something
a little more conservative,
and that's why it's gonna saturate lower
and actually learn a little slower.
So here, previous state of the art
model ensemble TRPO in blue,
and then the learning and
adaptive policy, MB-MPO, in red.
And in most environments it
makes a significant difference
in terms of performance
you end up achieving.
You can also run this on real robots.
So here is PR two robots
learned to stack the block.
On the left, with model-based
meta-policy optimization.
On the right, with PPO.
And in about 10 minutes
of real world interaction,
it's learned the policy
that can reliably stack
the red block onto the yellow block there.
So 10 minutes, that's it.
So very, very fast learning
compared to really any past work
except for some very
highly specialized work
for this kind of setting.
So, are we done?
Not really.
The reason we're not done,
even though that video
when you watch it looks very appealing,
actually if you go watch
the PR two run an algorithm,
most of the time it's doing nothing.
It'll just collect data for 30 seconds
and do a massive amount of compute,
and that's what's happening
behind the scenes.
If you look at these
learning curves, it's okay,
horizontal axis is just
the data collected,
but in model-based methods,
the amount of compute you do
behind the scenes means
that your wall clock time
is actually pretty long often
before you get to good performance
because, well, that's what you're doing.
You're trading off, you want less samples
and exchanging it to
just a lot more compute.
So, question is can we make
this closer to real time?
The results achieved in
model-based reinforced learning
so far tend to be limited
to shorter horizons
than model-free methods
tend to succeed at.
Something where when
you learn a simulator,
the simulator accumulates error over time.
And so a simulator might not be as precise
over long horizons, making
it hard to learn things
about long horizon
problems you need to solve.
Then, what I just showed
you was from state.
But maybe you wanna do things from images.
So, limited to short horizon.
I don't know of any good fixes for yet,
but next lecture we'll look at the,
how to make it real time,
and how to do it from images
rather than from state.
So, that's it for today.
Right now, we have some cool demos.
Keith, do you wanna?
I think we can stop the
recording for the lecture
in the back.
Can Sherry hear me?
Stop the recording of the lecture.
And then we can do some of the demos.
