So let's get started. First, ah, some logistics.
So Homework 3 is due tonight, ah,
and that's the last homework assignment beyond your, ah,
your projects and the project milestone.
The first milestone, um,
the only milestone is due next Wednesday.
Ah, and then after that,
ah, you will, um,
just have the, the poster session and the final, um,
presentation or the final, ah, report.
Uh, we'll be sending out feedback on your project proposals within the next few days.
Great. So, uh, let's go through the plan for today.
So, uh, today, we're gonna be talking about model-based reinforcement learning, um,
and how it can be used for multitask learning and meta-learning,
and also how it contrasts with model-free learning,
which is the kind of reinforced learning that we've been
talking about thus far in the course.
Um, we'll also be talking about how we can extend
model-based reinforcement learning to image
observations or other high-dimensional inputs.
Uh, this is one of- ah, a very challenging use-case for,
for our model-based reinforcement learning, and so,
we'll, um, we'll be covering that in, uh, in more detail.
And then, we'll be also talking about model-based meta reinforcement learning,
um, and settings where that might be applicable.
Ah, and also just kind of- by the end of this lecture,
some of the things that you'll hopefully be able to learn about are,
ah, how to understand and use, ah,
and implement model-based reinforcement learning methods, uh,
challenges and strategies for
model-based reinforcement learning with high-dimensional inputs,
and also, um, how this relates to multitask learning and meta-learning.
Okay. So first, let's talk about, uh,
reinforcement learning algorithms from a, a broader view.
Uh, so in previous lectures,
we showed this diagram where we looked at
reinforcement learning as an algorithm that, uh,
iterates between generating samples,
fitting a model to estimate the return from those samples,
and then using that to improve the policy.
And we talked about Q-learning based methods and policy gradient
based methods that correspond to estimating a return,
uh, or fitting a Q function.
Ah, and in contrast,
model-based approaches try to fit a model of the dynamics.
Ah, then each of these approaches improve
the policy by either applying the policy gradient,
uh, by taking a,
a max over your Q function,
or to improve your policy, um,
or to optimize actions or optimize the parameters of your policy using your model.
So previous lectures, we focused on
model-free methods like policy gradients and Q-learning.
Uh, and in this lecture,
we'll be focusing on what's known as model-based methods.
Ah, they're known as model-based methods because you're trying to fit this model of the,
what's known as the dynamics model.
Okay. Um, so the main idea of
model-based reinforcement learning is to learn a model of the environment.
Ah, and you might wonder,
hey, why do we wanna do this?
The previous reinforcement learning methods seem to work pretty well too.
Or maybe they don't depending on,
on what you found in your project,
or, or in your homework, um,
and there's kind of two main reasons I think,
um, at least from what I've seen, uh,
in terms of my own, uh,
experiments is that model-based reinforcement learning
tends to lead to better sample efficiency.
So if you care about, uh,
learning with not a lot of interactions in the environment,
fitting a model of the environment and then using that model to, uh,
optimize your policy can reduce the amount of data that you need in the environment.
Um, and this isn't true in all cases, ah,
but it's true- it has at least been empirically true in a number of, ah, different works.
And also, the model can be reused,
ah, for different tasks and different objectives.
And we'll talk a bit about what that means later in the lecture.
Um, and so, ah,
at a high-level what this- what these algorithms are,
are trying to do is they're trying to estimate a,
a model of the dynamics, ah,
and this just corresponds to a supervised learning problem
where you want to maximize the likelihood of the next state given
the current state and the current action for all of
the transitions in your buffer of data.
Uh, so for example,
you could treat this, ah, if,
if you have continuous states, ah,
and you wanna model the likelihood using a Gaussian, ah,
you could just use the,
the following optimization problem where you want to be able to, ah,
minimize the squared error between the predictions from
your model and the observed next state.
And this would be an example of a deterministic model.
You could also imagine using probabilistic or stochastic models as well.
That actually try to model but that that likelihood,
um, model the full distribution of the likelihood.
Okay. Um, and so,
this is kind of the- ah, there's different ways,
different model classes that you can use and different, ah,
ways that you'd go about maximizing that likelihood
but it typically just amounts to a supervised learning problem.
Um, and then, once we have our model,
we use that to improve our policy, um,
and I'll talk a bit about the different ways that we can do that in a minute.
Um, and then you can use that policy or use
the actions that you optimize to generate samples,
ah, and repeat this process.
[NOISE] Okay.
So now, what does this have to do with multitask learning and meta-learning?
So, um, let's go back to our notion of what a reinforcement learning task is,
ah, and in particular, we considered, um,
this reinforcement learning task is basically an MDP,
where different tasks may have different state spaces,
different action spaces, different initial state distributions, dynamics, and rewards.
Um, essentially, these, these tasks correspond to MDPs, um,
and one kind of observation is that in
many practical scenarios that we might care about, ah,
in multitask reinforcement learning and in meta reinforcement learning,
ah, it may be that the dynamics don't actually vary across tasks.
Uh, that there's basically one single dynamics model that governs the, ah,
governs the world that your agent is living in, ah,
and if this is true then we may be able to,
ah, kind of exploit that property.
Uh, so for example, in the real-world,
if your agent is manipulating objects,
or if it is, ah,
ah, walking around on the ground,
or if it is navigating, um,
in an environment the dynamics of the world- of
the underlying world for manipulating different ob- ah,
for doing different things in different objects,
for getting to somewhere in the environment through locomotion or through navigation,
in all these settings, the underlying dynamics of the world isn't necessarily changing.
Um, of course, when the environment is fully observed.
If you can't fully observe,
ah, phy- physical information about the objects,
or about locomotion, ah,
then there may be some variation across,
ah, across tasks, or across objects.
[NOISE] Another example of this is character animation.
So if you wanna animate a character to do things like, ah,
spin clicks, ah, spin kicks,
or cartwheels, or, or running, or back flipping, ah,
all of these- this agent lives in a single world with
consistent dynamics and what's varying is just the reward function and not the dynamics.
Ah, and likewise, if you, ah, have an agent that wants to, ah, converse, ah,
and accomplish a certain task through dialogue,
ah, such as helping you, um, order dinner,
for example, or helping you, um,
reserve a, a car reservation or something, ah,
the underlying dynamics of interacting with that person may be the same,
but the reward function of what you wanna accomplish is varying.
So here are a few examples of where the dynamics might be consistent across tasks,
ah, and all of these cases,
estimating the model is a single task problem.
If there's just a single model,
we only need to estimate a single function.
And so, as a result, then this learning problem may actually be easier than, ah,
some of the multitask model-free methods because we
only have to solve the single task learning problem,
and then once we solve that single task learning problem, um,
we can use that to, ah,
find a policy that optimizes, ah, different tasks.
Any questions on, on this?
All right. So um, how do you actually go about using our model to optimize for actions?
So uh, we wa- we want to be able to kind of optimize actions using the model,
and, and our objective might be to maximize our reward summed over time.
Uh, and so one way that we might think about doing this is,
uh, we can use this form of
computation graph where we have actions being passed into our model.
Our model is predicting the next state, um,
which is, uh, producing the reward function and also producing, uh, sorry.
The model is estimating, uh,
the reward function may also be estimating
the next state which is then passed to our policy
to produce the next action and the next, um, reward function.
And so if we want to optimize over the sequence of actions that,
uh, maximize our reward, uh,
we could imagine just backpropagating the signal from
our reward function into our actions through this computation graph.
Uh, and so for example,
we could use a gradient-based optimization over our actions to optimize for our actions.
Uh, so what this might look like is, uh,
you might run some policy,
for example, a random policy.
I collect some data using that policy,
and then fit a model to that data to minimize the, uh,
the prediction error of that model and then
backpropagate through that model in order to optimize for our sequence of actions.
Uh, and then once you have those actions,
you can just execute those actions,
uh, to accomplish the task that you'd like to perform.
Uh, so this is pretty straight forward.
Uh, another way that we could do this is say we don't want to use, uh, backpropagation.
Uh, for example, maybe the model that we learn,
uh, is doesn't have well conditioned gradients, uh,
or maybe, uh, it's discontinuous in someway,
we can also optimize the reactions via sampling.
Uh, this would be essentially a gradient free optimization over our actions.
Still, still the same underlying loss function,
the same underlying optimization,
we can just use a different optimization approach
for acquiring a sequence of actions that will maximize reward.
Uh, and so what this might look like is to run some policy,
learn a model to minimize model error, uh,
and then iteratively sample action sequences.
Run those action sequences through our model, uh,
and the action sequence you find that achieves the best reward,
we will then execute those corresponding actions.
Uh, and there are ways to uh, to sample action sequences,
uh, in a more intelligent way.
So you can imagine just sampling from some uniform distribution over action sequences,
and then taking the best one.
Uh, but you could also imagine after sampling from a uniform distribution,
you could take the best, uh,
10% for example rather than the best one,
and then refit a distribution around those 10- around those
top 10% of actions and resample from that distribution,
uh, and do this sort of iterative process to iteratively
refine the sampling distribution over actions.
Uh, that would be known as, uh,
something like the cross entropy method, uh,
and that would allow you to perform a slightly better optimization or, uh,
slightly better, um, slightly more powerful optimization over your actions.
Okay. So here are a couple different, um, approaches.
What's something that might go wrong with these approaches? Any thoughts on that?
Sparse reward.
Mm-hmm.
It's probably a vector.
Yeah. So if you're reward function is sparse and your optimization process isn't
able to sample action sequences that lead to
that reward or let that actually see any reward that it may be that,
uh, your optimization won't be powerful enough to actually,
uh, find a good sequence of actions.
So that's one good example. What's another good example? Yeah?
And so in the previous example,
you need to solve an optimization problem in order to get one single action?
Yeah, so in both of these cases,
you need to solve an optimization problem in order to
get a sequence of actions that will,
uh, that will try to maximize that reward.
[inaudible] propagated exponentially forward.
Sorry, can you repeat that.
When your accuracy is in the model you
[inaudible].
Yeah, absolutely. So if your model- I guess there's- I think there's two things here.
One is if your model is inaccurate,
then the optimization can exploit that and,
uh, and be overly optimistic about whether or
not an action sequence will accomplish high reward.
And second thing is that, uh,
if you're optimizing for an open loop sequence of
actions and then executing that sequence of actions.
If one of those actions, uh,
reaches a state that's slightly different from what you thought you would reach,
and then if at the next state you actually use another action,
you'll have this compounding errors such that you move
away from the trajectory that you,
um, that you thought you were going to follow according to your model.
Okay. Any other thoughts on what might go wrong?
Okay. Cool. So the approach that, um,
that I had written here was I guess there's a couple
of di- couple different things that we di- we discussed.
The first is that you can potentially have,
uh, imprecisions in your model, uh,
and that will cause you to,
um, to kind of be overly optimistic about what will happen.
And second, that these errors can compound and cause you to kind of go off track, um,
and have increasing amounts of errors as you roll out your,
um, as you roll out your,
uh, sequence of actions.
One thing that will help with this,
uh, I actually, do any of you have thoughts on, like,
how we might try to avoid some of these issues or,
or these two issues in particular? Yep.
So I have a question, intuitively it makes sense why this is bad.
But on a theoretical standpoint,
isn't this like minimizing regret by going
the most optimistic possible choice and then negating that if it turned out to not be true?
Yeah. So the, the key thing is
the second part that you said is that if that turned out not to be true,
we need to actually take that into account, right?
As on the algorithm that I listed on the previous slide,
we are actually just fitting a model to
our data and then executing actions according to that data.
But if we turn out- if it turns out that we take those actions and it was
actually not the correct thing to do, as you're mentioning,
then we should basically refit our model using the new data and use that to,
uh, continuously improve our model into settings where we're overly optimistic.
And so what you can do is you can go back to this previous algorithm,
and then actually execute those planned actions in the world,
append the data that you observed to your dataset,
and use that to refit your model, um,
to this, uh, growing dataset. Does that answer your question?
I guess, like, it seems like the answer is like that's really not that bad, is what you're saying?
Um, the- sorry.
You're asking the, um-
Just like making an overly optimistic decision we can just use that to our advantage through this?
So if you make an overly optimistic decision,
it can certainly- if that is,
like, if you're, if you're, like,
done running learning and you're not going to actually collect many more data,
then it can be very bad because you'll-
you won't actually be making predictions that maximize reward.
Uh, and it can be arbitrarily bad if, if it's,
like, erroneous outside of the states that you visited.
But if you have the ability to collect more data,
then you in principle should be able to correct for those errors.
Okay. So this is, uh,
this is one thing that we can do.
Um, but still, uh,
even if- so this will help certainly,
uh, doesn't completely solve the problem.
So in general, learning a good global model is pretty hard.
Uh, especially if you wanna learn a good global model everywhere, uh,
in your- in, in all possible states for example.
Another trick that can be helpful with this is,
uh, what's called re-planning.
So, uh, and this is, uh,
a fancier name for this is, uh,
model predictive control or MPC.
And what we can do is we can first run our policy to collect some data,
fit a model to that data,
use this model to optimize over our action sequence.
And what we can do is we can execute the first planned action,
observe the resulting state, um,
append this to our dataset.
Uh, and then after we take ones- one action,
we can then actually replan and reoptimize over a sequence of
actions from the state that we just observed.
And so what this can do is that if you end up at a state that you didn't expect,
uh, as a result of your model,
you won't keep on executing actions as if
you were in the state that you thought you would reach,
you'll then actually replan according to the state that you actually
reached out to try to correct for your mistake at that first time step.
And so this can help address some of
the compounding errors that we talked about before. Yeah.
What if it's the same dataset
or what if they're different like meta dataset?
Um.
[inaudible] or something like that.
Yeah. So we haven't got into the meta-learning setting yet,
but you could also imagine- so in this setting,
uh, there might be a loop here which is to, kind of,
refit your model in this,
like, slower outer loop.
Um, in here all- the only thing that you're doing is you're, um,
observing the state and then using your model to replan,
to like, reoptimize the action sequences from that state.
You're not actually- so you're using it to update your actions.
You're not actually using it to update your model.
And you could also imagine using that state to update your model,
uh, and we'll get to that actually at the- like,
towards the end of the lecture.
Uh, so we're a few steps ahead.
Um, so intuitively this can help with model errors because it can allow- if you, kind of,
go off track of where- where your model thinks you will end up,
uh, you could in principle start to correct for those.
Uh, so the- the benefit of this is you can correct some model errors.
One of the downsides of this approach which I think was maybe alluded to, uh, in one of
the other questions is that this is- in general planning for
actions is a fairly compute intensive process, uh,
because you need to continuously, I mean,
if you're planning at every single time step,
you're doing an optimization,
you need to do an optimization in real time.
Um, one thing that can help with this, uh,
is to optimize over,
um, instead of optimize over a sequence of actions,
you could actually back-propagate actions into your policy.
So if you have a parametric form of
your policies, instead of back propagating grad- gradients into the action,
you can back-propagate gradients into the policy parameters.
Um, likewise for a sampling based approach if you optimize for a sequence of
actions you could use those as targets to train a policy to produce those actions.
Um, and that can reduce some of the compute  intensive challenges of things like MPC.
Um, now I also wanna get back to the problem of sparse rewards that was mentioned.
Um, learning a policy can also help with sparse rewards.
Uh, in that if you, um,
in aggregate at some point you see
a good reward function or a good reward then your policy,
uh, the policy parameters will be trained to try to, uh, accomplish those.
Um, in practice like with very sparse rewards model-based optimization and
model-free optimization will run into same- the same sorts of
issues if they don't actually ever observe rewards.
Um, and things like relabeling, uh,
as you saw in the pro- problem assignment can also help with that.
Okay. Um, any questions on kind of the basic algorithms before we talk
about the multitask setting? Yeah.
My understanding is that when people are more advocates of model-free,
say that model-free sometimes allows behaviors to emerge that
wouldn't have been possible model-based but the way that you've
presented it here where- where does that fall
apart for model-based such
that you can't get everything that you have with model-free?
So I actually haven't heard that argument before.
Um, so the argument you said is that there are behaviors that will emerge with
model-free that you won't be able to have emerge with the model-based method?
Yeah you're essentially, like,
confining yourself too much and confining the agent too much by
using the model-based method.
Yeah. I don't actually see why that would be the case.
Like, uh, in both cases you are, um,
you're optimizing some objective which is to maximize reward
and the behavior that emerges,
like, like- in both cases you are going to
be learning some behavior to maximize that reward function.
And I think that what comes out of that optimization is
more a function of this- the power- how powerful that optimization process is.
Uh, and if it's- if you have a very strong optimization process then, uh,
more interesting, uh, well,
and a reward function that's interesting then more interesting behaviors will emerge.
Uh, and I don't think that there's any difference, um- I think that, like,
any difference that you see in the outcome of
this model- of these approaches will more have to do with
the strength of the optimization and also how
hard it is to fit the model versus fitting a policy.
Um, there are definitely settings where it's
harder to fit the model than to fit the policy.
Uh, so for example if you wanna pour water from one container to another container,
modeling fluid dynamics, is it a hard problem?
Uh, but just twisting your arm is a relatively simple function to learn.
In those settings a model-free approach, um, may be easier.
In other settings, uh,
such as if you want to be able to, um,
push an object to any possible position, uh,
the dynamics may be relatively simple because it just corresponds
to one object on- on the table and the- the dynamics there.
Uh, whereas the policy may be more complex because for
any poss- you have to represent the policy for any possible goal,
um, whereas the dynamics are just- it's just a single task problem.
Okay. So we can look at,
um, what does this actually look like
and reusing this question about like what does
this have to do with multi-task RL and meta RL?
So, um, how you actually apply this to
the multitask RL and meta RL problem statement
depends on whether or not you know the reward function.
Um, and in particular, there are some instances where you actually know
the form of the reward function, uh, exactly.
Uh, so, um, we'll see an example of
this in a second and if you know the reward function for each task,
uh, then you could just learn a single model and plan with respect to
that reward function at each, uh, at test time.
Uh, so for example, um,
here's an example of, uh,
a work that was done by Anusha Nagabandi and colleagues,
and what they were looking at is, uh, they wanted to,
to learn how to write different trajectories with a pen and they were controlling the,
ah, the hand, ah,
in simulation and different reward functions
corresponded to different trajectories of the tip of the pencil.
And so in this case they assumed that they could observe the tip of the pencil and then
the reward function can be derived as just trying to track
a particular trajectory with the tip of the pencil.
So the form of the reward function is known,
uh, but optimizing that, uh,
the reward function I- by actually using the hand to draw is a very challenging problem.
And so they learned a model, um,
and they actually learned a model by collecting data with
this vector random trajectories and then at test time
they gave it reward functions for writing different digits.
Uh, it's a little bit hard to see. I think the first digit is a six,
the second one is a seven,
the next one is like a nine or a four,
and the last one is a five.
And you can basically use that model to plan to accomplish these different,
um, these different trajectories.
Um, another example of this is, uh,
maybe you want the hand to be able to manipulate,
uh, these bouncing balls, like, ah,
move them in a circle for example,
here the reward function is also known it corresponds to
the trajectory of the- of the two balls.
Uh, and you can also have a reward function corresponding to moving- uh,
the ball to a particular location in the- in the palm,
or also, um, rotating the balls in the opposite direction.
Cool. So here- here's an example of, kind of,
multi-task reinforcement learning with a single model and different reward functions.
Um, one caveat that I'd like to mention here is that, uh,
even though the the dynamics may be consistent across all of the tasks,
different tasks may require you to visit different state distributions.
And so if you collect data for
one task and another task is a very different state distribution,
then the model that you learned for that one task
may not actually generalize to the second task,
if it doesn't visit the,
uh, the same states.
Um, so the reward may change how you collect
the data and may affect the quality of your model in other states.
Um, in this work they found that if you actually train
a model only on this first task, uh,
that model can actually be reused for the second two tasks because
the distribution over states is sufficiently diverse.
What is the state representation of that case?
Um, in this case,
the state representation corresponds to the position of the two balls and the,
um, state information about the hands such as the joint angles.
I'm not- I'm actually thinking about this more- a bit more.
I'm not quite sure how they go from two balls to
one ball cause the state representation changes in that case.
Um, I would guess that maybe they just use the,
the model corresponding to one of the balls and they ignore the second one,
um, but I- I'd have to check the details of the paper for that.
They all seem like of different sizes.
It's possible that they're different sizes.
I- my impression was that they were, um,
the same size and maybe the videos- actually,
so the hand is also bigger in the left video.
So I think it's just that- that the video has been scaled differently.
Um, another cool thing about this approach is that, ah,
because it's able to learn a model, um,
pretty quickly with a relatively small amount of
data they actually were able to run this method on a real robot, ah,
and actually run the reinforcement learning process and collect
all the data on a real robot and fit a model to that, um,
and were able to get a,
a real shadow hand to perform this task.
Cool. Yeah.
When did the video come out?
When did this get done?
Um, it's actually pretty recent work.
Well, so I, I know Anusha well,
so I've seen it for awhile but,
I think that the video came out like within the last month.
Um, it was published at CoRL which was last week,
which is why I was actually not here. Um, yeah.
Okay. Cool. So that's what happens if we know the reward function.
Uh, what if we don't know
the reward function or the- at least the form of the reward function?
Um, one thing we can do is we can just learn a reward function, uh,
conditioned on the task and then use that reward function to plan to accomplish tasks.
Ah, and I have a typo on the next thing but,
ah, this should say meta-RL.
Uh, the only thing that you could do is you could
meta-learn a reward function from a small amount of
data and use that to- use that learned reward function to plan to accomplish goals.
Uh, this is pretty straight forward. Uh, one example of the second case, um, is,
some work here where the,
the training dataset corresponds to a few examples of the goal.
Ah, so in this case the goal is to, ah,
place the pencil case to- on top of
the- or behind the notebook and given a few positive examples,
you wanna be able to learn a classifier or a binary reward function that
corresponds to whether or not the task has been accomplished in the image.
Um, so you can, you can do this task with Meta training but,
but by- with meta-learning by collecting a dataset
of a bunch of positive and negative examples for different tasks,
meta learn your classifiers such that given
a small number of positive examples it can quickly learn a new reward function.
And then once you have that reward function you can plan
using your model to maximize reward.
Uh, so here's the kind of the result of running, ah,
planning and executing those actions on the robot, ah,
to accomplish the task with respect to this,
ah, meta learned reward function.
Okay. Um, and I guess kind of the,
the bigger takeaway here is that, uh,
model-based RL solves both the multi task RL problem
and the meta RL problem statements with these,
ah, with these different types
of approaches depending on whether or not you have the reward function or not.
Okay. Any questions before we move on to image observations?
So one thing worth mentioning here that I didn't actually tell you is that,
this is, this is all from images.
And so how do we actually go about doing model-based RL when we have image observations?
Right. Um, so, ah, in particular, ah,
if you only have access to images you might
have a graphical model like this where you can actually observe
the underlying state- the low-dimensional underlying state
of the world and you can only observe the O's shown here.
Ah, so for example maybe I have a robot that
looks like this and you want it to be able to like use
the spatula to lift up an object and put it into
the bowl and all it can observe is this image.
Um, so first, ah, with,
with these images we have to deal with learning models,
ah, in, in some space at least in learning how to predict.
And second we also don't have any reward function necessarily if we only have
those observations and so we need to think
about how we might go about learning our reward function,
um, such as using the meta learning approach that I showed on the previous slide.
Okay. So one option, ah,
for the reward function is learning
an image classifier like I showed before or to meta learning an image classifier.
Um, another example or another option is to provide an image of the goal.
Ah, and this would co- correspond to the goal condition reinforcement learning setting,
ah, that we have covered previously and that you looked at in your homework.
And so for example, you could give the, ah,
the robot an image of the goal like this, uh,
saying I want you to, to accomplish this,
ah, this goal state, ah,
and have it try to,
to reach that goal state.
So how might we go about doing model-based RL in the setting?
So there's a few different classes or approaches that we'll cover.
Um, the first is to learn some latent representation
and then learn a model in that latent representation.
The second is to try to learn a model of your observations directly.
Ah, and the last one is to try to predict
alternative quantities other than your raw observations.
And we'll talk about all three of these approaches.
Ah, so first let's talk about latent space.
So, ah, the key idea of learning in
latent space is that you wanna learn just
some embedding of your observation which we'll
denote as g of o,
and then learn a model inside that embedding space.
Ah, so if we take the graphical model that we showed before,
this corresponds to trying to, ah,
learn some form of inference network that maps from your observations to
your- back to your low-dimensional states or back
to some representation of your state space.
Um, so then, ah,
kind of the way it works I guess is, ah,
first there's a couple of papers that have looked at this kind of approach.
Um, shown here that we'll talk about and they're- more recently there are
a couple of other approaches that have looked at this kinda approach as well.
So um, the way this algorithm works is first you run some policy to collect some data, ah,
and then you learn this latent space of your observation and,
uh, a model in that latent space.
So you'll learn a G to go from O to S and then you learn,
learn a model that goes from S and A to predict S prime.
Then you use your model to optimize over a sequence of actions.
Ah, and then execute those pan- planned actions,
append the visiting tuples, ah,
in this case it may be the,
the state and action and the next state or it might be the observation,
the action and then the next observation.
Ah, and then you can add those, ah,
tuples and then re- retrain your embedding space and retrain your, ah, your model.
All right. So this is pretty straight forward.
Ah, there are a couple of questions though.
Um, the first question is what is
your reward function when you're trying to optimize over your actions?
Ah, so we talked a little bit about how
your reward function could correspond to a classifier,
um, in the case where you're given an image of the goal,
one thing you can do is you can use your reward signal as, ah,
some reward signal of your actions such as trying to minimize effort or
torque plus a distance function
and that there should be a negative sign in front of that.
Some distance- some negative distance between, ah,
the goal of your current observation and- sorry, the,
the representation of your current observation and the
representation of your goal observation.
So you can basically use the negative distance in your latent space
as a reward function for planning.
Okay. Um, and this makes the assumption that distance,
distance in your latent space is an accurate metric for the things that you care about.
Ah, and this, this assumption may or may
not be true depending on the form of your latent representation.
Okay. And then the second big question or
maybe kinda the most salient question is how do you
actually get this latent representation space?
Um, there's a couple of
different pe- approaches that people have taken to try to do this.
Ah, one of the more popular approaches, ah,
that was looked at in 20- 2015 and also actually more recently is to try to,
ah, form a graphical model of your transitions.
Ah, and this basically corresponds to a variational autoencoder that looks at
transitions over pairs of states
rather than a single variational autoencoder for a single state.
And as a result what you get is you're optimizing
jointly for the latent representation, ah,
of your variational autoencoder for example as well as the transition,
ah, distribution, your, your model in that latent space.
Ah, and so if you do this you can, ah, kinda get,
get a representational space that is good, ah,
both- that is both low-dimensional as well as,
ah, satisfies your model effectively.
Ah, and so for example they showed that,
ah, you could use this,
use the algorithm with this latent space for accomplishing different kinds of goals.
So, ah, the goal image was shown on- at the beginning of the video and
then in each case it's trying to reach
it's- executing actions that tried to reach that image.
So we'll see another example in a minute.
So here the goal state is to, ah,
curl up the arm and the left is showing the executed actions and
the right is showing the one-step predictions of
the reconstructed image through that generative model.
Ah, and so one of the things you can see is that it can, ah,
use a single model in a certain-
a single layer representation to accomplish these different goals.
Um, the other thing worth mentioning is, ah,
kinda  alluded to at the beginning of this lecture that
the model-based methods tend to be fairly efficient.
This approach required about 300 trials to learn, ah, these,
these skills from the pixel representations which if you were to run
that in the real world it would probably correspond to about 25 minutes of real time.
Which is pretty fast as reinforcement learning goes.
Um, in practice, some of the more recent model-free methods take
around two to three hours to learn. Yeah.
[inaudible] I'm guessing you're also referring to SOLAR as a more recent work?
There are a couple of more recent work, SOLAR is one of them.
There's also, um, Stochastic Latent Actor-Critic which was covered, um, more recently.
That one's kind of more of a hybrid method but also has this form of graphical model.
[inaudible]
Yeah.
[inaudible]
Yeah.
[inaudible]
Yeah. So the question is this method and some of the,
the predecessors, uh, or successors of this method,
um, places assumption on the latent space,
which is that you can linearly or you can basically have
a local linear model on that space like it can
accurately predict the next state given the current state,
um, where local means that you may have a time-varying linear model.
Uh, I think that theoretically speaking,
if you have a universal function approximator that's producing your state,
it seems like it should,
it should be possible or in the latent space that is,
that does actually satisfy that constraint.
Um, in practice it
may be- that may be a very difficult optimization problem to actually find that,
that latent space and we haven't yet seen these kinds of methods
perform well on very diverse settings where you have, um,
many different objects in the scene and,
and, and, uh, kind of,
the kind of diversity and complexity that you see in like natural images and,
and in, uh, and in like ImageNet images for example.
Uh, that is actually- oh,
I'll come back to that point in a second as well, um,
as we talk about kind of modeling and latent space versus modeling
and observation space. Yeah.
[inaudible] like PlaNet?
Yes, PlaNet, PlaNet is also learning,
uh, a latent representation and doing planning in that representation.
The PlaNet approach that they use,
um, I can't remember the exact details of which PlaNet approach they use.
I think it actually may have corresponded to a model-free algorithm.
Um, sorry, what?
[inaudible]
Oh, Okay. Right. Okay. Yeah. So using- right.
Okay. So it's using cross entropy method, uh,
which is basically the iterative sampling based approach that we talked about.
PlaNet, PlaNet is also kind of has a very similar form of
this graphical model although in addition to having
the stochastic pathway they also had a deterministic pathway in their model.
Um, the PlaNet to my knowledge wasn't tested in the multitask case.
Uh, it was just tested in a single task case but in principle that
the model that's learned could also be used in the multitask setting.
Yeah. So there's been a number of approaches recently that have kind of
followed this form of kind of learning a latent space with
some sort of probabilistic or semi probabilistic approach
and then doing learning in that latent space.
Um, one other example of,
um, a latent space that, uh,
we used in 2016 and also was actually has been studied more recently
as well is having representations that are have structure to them.
Um, so in this case the structure that we're looking at were latent spaces where the,
um, the, the dimensions of the latent space correspond to key points in the image.
Uh, so for example, here are two example, uh,
two example key points that are in the representation and this is the trajectory that,
that, that representation follows as the robot executes the trajectory.
Um, and this- more recently there's been a kind
of a trend of approaches that try to learn object-centric representations or,
or key point-based representations, uh,
of images and then perform planning or perform
model-based RL in the context of those representations.
Um, and so this is kind of maybe alt- an alternative view.
I think that both of them have their merits.
Uh, and oh, yes, specifically for,
for this, this approach to the way that you actually try to get those feature points, um,
is you can take,
you can take the last convolutional layer of your network perform a softmax over
the spatial extent of the image to get a distribution over 2D positions in the image.
And then once you have the, uh, all these distributions over key points you can then,
uh, here's an example of a softmax for
the softmaxes over the x position and the y position.
Um, then you can take an expectation where you compute the, uh,
an expectation over that 2D distribution to get the x,
y coordinate of the, um,
of the, of the approximately the,
the key point of maximal activation.
Uh, so you can essentially view this as a form of spatial softmax.
Uh, so instead of doing a softmax over a one-dimensional operation,
you can do a softmax over a 2D space, uh,
and get some, uh, key point out of it.
Uh, and of course I guess the,
the important part is that this operation is actually
fully differentiable so you can, uh,
optimize for these kinds of representations,
um, with respect to the, um,
the objective that you care about what the- things like
reconstruction or something like the objective of your task.
Uh, and there are also other kind of, uh,
more recent approaches that have looked at, um, kind of other,
other ways of getting these key point-like representations
or object centered representations,
uh, in unsupervised ways or in like weakly supervised ways.
Um, and so what- the result that you get out of
this is if you train it to do reconstruction here,
two of the feature points that you might get out of it,
um, out of the in this case the 16 feature points that were used.
Okay. Um, and I guess there was also- I said smooth here
because there's also an auxiliary loss to
encourage these are representations to be- to have,
um, similar to have constant velocity, uh, through time.
Um, but I think that that detail maybe a little
bit less important.
Yeah.
[inaudible].
So in this case the number of
feature points tot- the number of total feature points need to be pre- predefined,
um, as part of your network, network architecture.
So in this case the number of feature points is, is 16.
Um, and the dimensionality is 16 times 2.
Uh, one thing you could imagine doing, uh,
that we actually did explore in this paper as well as have
a larger number of feature points and then prune
the feature points according to some metric.
Uh, for example, according to, um,
if some feature points you observed to be very noisy,
then you could prune those feature points out
if you don't think that they correspond to in,
in an automatic way if you don't think they correspond to, um,
things relevant to the task. Yeah.
[inaudible]
So in this case it was, uh,
completely unsupervised, um, setting.
So the, uh, the goal,
this autoencoder was, uh, it's an autoencoder,
so it was- it's, it's, it's,
its loss is  to reconstruct the image.
It needs to find these 2D key points that allow it to reconstruct the image.
Um, and because different images have different positions of objects then, uh,
extracting the positions of objects or actually
extracting the positions of things that change in
moving the image lead to, um, good reconstructions.
So essentially, just a constraint on your latent space to have, to,
to have it be representing these 2D positions and how it uses that,
that representation space is up to the model.
What if you have [inaudible]
Yeah, so, uh, occlusions is challenging.
Uh, one thing you could imagine doing is having a recurrent model,
uh, and then the recurrent model in principle could try to track it.
Um, in this work one of the things we did to deal with
occlusions was we used a filtering based approach.
Um, we basically did a form of filtering
where you can basically look at the, the softmax distribution.
If it's very peaked, then it's likely that the,
the point is in view.
If it's not peaked, then it's likely that the point is occluded,
and if it is occluded,
then you can actually use your model to fill in, uh,
like a Kalman filter style update to fill in where you think that point is.
Yeah. So there's a lot of potential details here that I wasn't planning to cover,
but it's- yeah, there's different things that you could imagine doing.
Um, another challenge with this type of approach like that,
I guess is worth mentioning,
is that if you have two objects that are identical, um, this softmax,
since it's over the entire extent of the image,
it's going to have two peaks,
uh, if they look identical.
And therefore, if you take the expectation,
then you'll either get the average or get- you'll get the point that dominates, um,
or you might get this flickering between the two points and so there isn't, uh,
necessarily a satisfying way to deal with that. Yeah.
Is there a type of reference or is this to the, um-
Oh, this is actually from 2016.
Oh.
So the paper reference is there.
Um, so here's an example of the learning process.
So, uh, we actually- in this work we
gave it both the goal image as well as the goal position of the arm,
we initialize it with a policy that could mo- that
could reach the position- the goal position of the arm but not the goal image.
Uh, and then this is the course of,
uh, model-based reinforcement learning,
whereas in this case it's actually optimizing for a policy that
tries to reach both the goal image and the goal position of the arm.
Uh, and this is a toy task where the goal position with, uh,
the goal position is just to reach,
uh, to push the cube over to the left.
Uh, and then the final policy you get looks something like this where,
um, it's able to push it to the specified position on the,
um, on the mat.
The colors actually seem off.
This is supposed to be green.
But, um, anyway, uh,
then you can also perform the task the, the,
the specialized task that I showed before where the goal is to,
um, get the spatula to be in the bowl.
Cool. So one of the nice things about this approach
also w- with these sorts of key point-based representations, I don't,
I guess I'm spending more time on the key point-based representations,
but I don't think there's necessarily one approach that's necessarily better than the other,
but one thing that is quite convenient about this is you could
actually visualize the key points on the image.
So in this case, the X's correspond to the goal position,
and the blue- and the circles correspond to
the current position of the key points, and you can,
uh, with this ability to,
to visualize them directly on the image,
it makes for a very interpretable representation,
uh, and this is very useful for debugging.
Uh, because if your representation isn't
capturing and isn't tracking the objects that you care about,
then it's likely that, uh,
your algorithm won't work because your representation isn't capturing those things. Yeah.
So the key points are in the latent space, right?
Yes. The key points are the latent representation.
Okay. Um, so these skills were learned with
about 125 trials which corresponds to about 11 minutes of robot time per task.
Uh, in this case, the representation was actually learned,
uh, per environment or per task.
Uh, so it's kind of learning this environment specific latent representation.
Um, and so as a result the representations can
become somewhat specific to that environment.
And so for example, if you took this- these representations and try
to use it for a different task like the Lego block task,
they wouldn't necessarily track the objects,
uh, because they weren't trained on those images.
Okay. Cool. So one thought exercise.
Uh, so both the approaches that we looked at
were auto-encoder type approaches like generative models
where we're predicting the f- the, uh, the image.
We're trying to generate the image through some bottleneck,
and we may also be learning a model on that bottleneck.
So one question is, uh,
why do we need to reconstruct the image, right?
Uh, why don't just learn some embedding space like the feature points,
and then also learn a model on those feature points,
and train the representation such that the model is accurate.
This seems like somewhat of a reasonable approach.
Uh, why is this maybe not a good approach?
It's just difficult in practice?
Optimizing with respect to model error,
for the representation would actually- it's actually definitely a solvable problem.
Is the [inaudible]?
Um, right.
So the embedding does depend on the problem that you're solving.
And so if you're optimizing with respect to model error,
then it, uh, it may
not captu- it'll capture things about the model and not necessarily about the task. Yeah.
Um, if it's in latent space, it may not capture some small deviations in the real image?
Um.
[inaudible].
Yeah. That's actually also a problem with- that's, that's- yes. It's definitely a problem.
It's also actually a problem with reconstruction based approaches,
and we'll see that in a second. Yeah.
[inaudible].
Uh-um. And so what does- what does that mean?
So that means [inaudible].
Uh-um. And so what happens if
you optimize for both the embedding and
the model with respect to the error of the model?
[inaudible].
Sorry, what?
[inaudible].
Yes. Yeah. So there is a solution to this.
So this, this- to- basically to
the model error objective which is that if your embedding is always the same thing,
if it's a constant, uh,
if it's like always zero for example,
then it's perfectly- it's very easy to predict the next state, right?
Because this is just always zero.
Um, and then as a result- I mean,
that embedding isn't very useful because it's a constant, it's always zero.
Uh, but it achieves perfect model error.
Um, so there's basically this degenerate solution
that comes up if you try to optimize with respect to model error,
um, with- for both the embedding and the model.
Um, so it's not really a, a good idea.
[LAUGHTER] So this is why we need kind of other,
other forms of objectives to optimize for these representations in
addition to model error so that you can avoid that degenerate solution.
Okay. Does that make sense?
What is the other entropy [inaudible].
Yeah. So that's actually, uh, an interesting point.
So if you add, um, an entropy term,
interestingly that will actually correspond to maximizing the mutual information between
your representation and your, uh, your observation.
Um, so the- out- well, if you, uh,
if you want to maximize your mutual information between
your image observation and your latent representation,
uh, you can show that this is equal to, uh,
H of z minus H of z given S. Um,
so this would correspond to maximize the entropy like we've said,
and this would correspond to,
um, being able to predict z from S. Uh,
and so that's actually a pretty good thing to do,
and a lot of people have looked at these types of
objectives for learning representations.
Cool. Um, so to wrap up the latent space approaches, uh,
the benefits of this approach is that you can learn pretty complex skills, uh,
very efficiently and some structure- structured representations
enable very effective learning of these tasks.
Um, the downside is that, uh,
or I guess one of the main downsides is that we need
good objectives for learning these representations,
um, and things like reconstruction objectives
may not actually recover the right representation.
Uh, so as an example of this, uh,
when I was doing those experiments a few years ago with the- the spatula task,
we also wanted the robot to do another task which,
ah, was to manipulate a ping-pong ball,
uh, by like basically,
like kind of kind of-transferring it from one container to another container.
And so here's a downsampled image of, uh,
of that experiment where the white dot corresponds to the ping pong ball,
and the- you can see the arm of the robot as well.
And so I trained an auto-encoder on these images,
and the reconstructions that I got out of it look like this.
Uh, and so what you can see is that it learns to,
uh, learns a very good eraser of the ping pong ball.
Uh, and instead just learns to reconstruct the arm,
uh, because that's the thing that's larger in the image.
And so there's this mismatch between the objective of
the representation learning and the objective of the task that you might care about.
Um, so kind of the takeaway here is that we may need
better-unsupervised representation learning methods, uh,
be it reconstruction based methods or,
um, like mutual information based on the objectives.
Okay, and then one other side note is that
low-dimensional embeddings can be- also be very useful for model-free approaches.
Uh, so for example,
you could learn a- a low dimensional embedding and do model-free in that latent space,
model-free RL in that latent space.
Uh, so there's work back in 2012 that did this for this,
um, this slot car.
Uh, where they trained an auto-encoder and then to- down to
a two-dimensional representation and then did fitted
Q iteration on top of that two- two-dimensional representation.
Uh, and this work actually predates things like DQM,
so they're doing deep RL back in, back in 2012.
Uh, here's an approach that was able to run TRPO,
which is typically an algorithm that requires a very large number of samples.
Uh, but they learned a latent space of- actually,
both the state and the actions,
and were able to run TRPO on a real robot,
uh, to throw an object to hit the Pikachu.
And then there are also methods that use an embedding for their reward function.
So we talked a little bit about how, uh,
in the previous approaches we're using the,
the- embedding both for the state representation and for the reward representation.
Um, in this case,
this work was looking at, uh,
acquiring a reward function from ImageNet features.
Uh, so this is actually,
a supervised representation learning method.
They took, uh, they took a video of a human opening a door,
run that through ImageNet,
and then used that as a reward function for
a robot to try to reach the same features that the,
um, that the video on the left was, uh, was reaching.
Um, and the one other thing worth mentioning is that if you ha- if you have a reward,
you can actually predict it to form a better latent space,
and this can- this is one way to kind of help solve that degenerate solution prob- um,
degenerate solution that we observed if you just try to predict model error,
and there are a number of approaches that have looked into that as well.
Um, one reason why you may not want to predict reward,
is that maybe that you don't have a good reward function.
Um, so in the case of the spatula,
in the case of the, um,
the embedded control paper,
if you just have goal images we don't actually have reward functions.
Um, but if you do have a reward function it's, it's good to try to use it.
Okay. Um, now that we've talked about latent space models,
let's talk about modeling things directly in your observation space.
Uh, so we can recall the,
um, kind of the model-based RL approaches that we mentioned before.
Uh, in this case,
this is just the,
the same MPC algorithm that I showed before,
but where all the states are replaced with our observations O.
And what we can do is,
we can learn a model on our observations,
uh, and, and plan with that model.
So, uh, first we wanna run some policy,
uh, to collect some data.
Uh, so for example, we could collect data that looks like this.
Um, this is just robots randomly interacting with,
with objects picking them up and such,
and then the data corresponds to the images and actions.
Uh, and so it's very easy to collect data like this,
you don't need reward functions,
you can just run, run off your,
your robot or your agent in whatever environment.
Uh, then you can learn a model to minimize prediction error.
And so this corresponds to a video prediction model.
So you may get predictions that look like this for
different actions that are run through your model.
Um, also because we're, uh,
we're not imposing any representation on our,
uh, on our, uh, state.
We can also apply these sorts of models to
deformable objects because we're just predicting our raw sensory observations.
Uh, and then once you have that model we can use that model to optimize over actions,
uh, by actually sampling,
check through that model, and picking the actions that we think will accomplish our goal.
Uh, so this is pretty straightforward.
Uh, there are a couple of challenges though,
which is that we need to learn these models,
which are pretty challenging to learn,
and we need to be able to learn, uh,
the models by optimizing through these large video prediction models.
So, uh, first question,
how do we actually predict the video?
Um, we wanna learn this model.
Uh, this is a fairly complex model,
because it's a model of how images transform as a consequence of our actions.
So uh, this is a problem that people have been saying for a little while now.
Uh, maybe 5, 10 years, uh, at least.
Uh, although back in, in 2016,
it turned out that the models were pretty bad.
Uh, so, uh, so one of the, an example of a model that works,
ah, a bit better is something that looks like this.
So this is just a big neural network, um, because the,
the main points is that it's,
uh, it's deep neural network, and its recurrent,
uh, and each of these, uh,
each of the yellow arrows corresponds to recurrence,
and each of the green boxes and blue boxes correspond to convolutions.
Uh, it's performing multiple f- multiple f- ma- multi-frame predictions,
it was predicting multiple frames into the future.
Uh, it's conditioning on actions so the actions are,
are passed in here as well as any state information that
you might have about like the position of the robot's arm.
Um, and then the other thing about this model is,
there's actually explicitly modeling the motion of pixels.
So, um, rather than actually trying to generate pixels directly,
like having a neural network with output pixels, pixel values.
Uh, what this model is doing is that it's taking the previous image,
it's predicting, um, actually multiple convolution kernels,
and then applying those convolution kernels to the image to generate
multiple transformations of that image and then composing those transformed images,
uh, with these also predicted masks into a single image prediction.
Um, so it's essentially predicting the motion of- basically,
predicting how the previous image will transform into
the next image in a way that's differentiable that can be backpropagated through.
Um, and so here are some examples of some videos from a, a robot.
And if you took,
uh, some of the models back in, uh,
in 2015 for example, uh,
you would get models that look like this,
uh, or predictions that look like this,
which don't look very good.
Uh, whereas if you, um,
have recurrent models that are predicting
multiple frames and are explicitly modeling motion,
you get predictions that look,
uh, much cleaner, still blurrier.
Uh, and in general, the video predictions that we're
getting out of these models are even,
uh, in, in 2019.
Uh, still leave some to be desired,
but there are, um,
still things that we found can be useful for control. Yeah.
Does this, this model predict control as well?
Yeah. So let's talk about the,
um, let's talk about the planning approach.
So once we have our model,
we need to actually optimize the action sequence.
So the way that you can do this
is basically with the sampling-based optimization that we described previously.
So say this is our initial image.
We, we consider potential action sequences.
It's probably like a hundred or a couple hundred action sequences,
uh, including these two action sequences.
Then, predict the future for each action sequence by running those actions through
your model to get video representations that look like this.
Uh, and then you can pick the feature that you like the best, uh,
and execute the corresponding action, um,
or instead of picking the best one,
you could also iteratively re-sample,
and then pick the best one.
Uh, and then, what you do, is you could actually repeat
these first three steps in real-time in order to re-plan and do MPC,
and basically just planning at every single time-step.
Um, we felt that this is something that is practical to do.
But, uh, in the context of- uh,
with video prediction models it can be a bit slow.
So the sampling-based approach corresponds to
rolling out these big convolutional neural networks,
uh, and rolling out batches,
batches of like a hundred or hundreds of action sequences.
And so as a result, uh, the,
the time it takes to plan can be on the order of one hertz,
for example, um, depending on how many GPUs you parallelize it with.
That one hertz would probably be,
parallelized across like two to four GPUs.
Okay. Um, so you can essentially view this as MPC but in visual space,
uh, it's like visual MPC.
Okay. So that's kind of how you can do these sorts of
model-based RL methods in the raw observation space.
Uh, the way that it works,
um, at test time is,
you can, you need to specify some goal.
There are a few different ways that you can specify
goals as we, as we talked about before.
You could learn an image classifier, you could,
um, you could provide an image of the goal.
Uh, one of the things that we did in some of this work is,
specify the goal by clicking on a pixel
and clicking on where that pixel should be moved to.
Uh, so for example, in this case,
the goal would be to fold the left, uh,
pant leg of these shorts,
uh, by moving the red pixel to the green pixel.
Uh, and then we also specify another pixel right here to specify that, um,
the pants should stay in place, um,
if they're not, uh,
if they're not part of the folding part.
So once you have this goal you can run,
uh, MPC with respect to this specified goal.
And then, this is the video prediction corresponding to
the action plan that was found by MPC.
And then, execute the corresponding action on the robot,
uh, to try to accomplish that goal.
And so here's an example of, uh,
what the robot could try to accomplish by,
uh, with, with respect to the goal of moving the pixel upward.
Okay. Um, and so getting back to the, kind of,
multitask learning aspect of model-based RL,
one of the things that you can do is you can use
the single video prediction model to accomplish multiple tasks.
So, um, for example,
if your goal is to pick up an object,
you can click on an object,
click on where you want to move and the robot can figure out how to pick it up.
Um, also, if you want to manipulate the sleeve of a shirt, uh,
it can figure that out, um, or like a task like putting an apple onto a plate.
Uh, then we can also look at a few other examples like,
uh, folding the shorts,
re- like rearranging objects or, um,
like covering an object with a towel.
Um, yeah.
So one of the nice things about this is that it allows us
to accomplish many different goals or many different reward
functions with a single model without having to retrain our model for every single task.
Um, and then the other nice thing about this is the-
the model training part is s- self supervised.
You don't need to provide reward functions or supervision, um,
the robot can, kind of, just collect data,
uh, and train the model on that data.
Okay. So, um, the benefits of this kind of approach is that, uh,
this was, uh, able to scale to real images, uh, fairly effectively.
Um, there's also very limited human involvement.,
uh, and so the model training was fully self supervised.
Uh, and this was also able to accomplish many different tasks with a single model.
Um, these pros are also shared with many of the latent space approaches as well.
Although in practice, we've found that latent-space approaches, uh,
have trouble modeling some of the di- some of the diversity of the videos like this
because you have to capture all of those objects that you
might see in a compact latent space.
Um, and some of the downsides is that despite the fact that they're real images,
there's somewhat limited background variability.
So this is more variability than like the spatula example, for example,
but still less variability to things,
um, like ImageNet for example.
Um, you can't ha- can't yet handle as complex skills as the spatula example, for example.
These are just, um, kind of pick and place style tasks.
And it's also very compute intensive, uh, test-time.
Okay. Any questions on how that works?
Okay. Um, one other quick aside,
because I think we have a bit of time,
is uh, how can we think about actually doing more complex skills,
uh, rather than things like pick and place?
And one thing you could imagine doing, uh, as we talked about before,
is using your planner to collect
more data and then using that data to improve your model.
And I would expect something like that to,
uh, perform pretty well.
Although in practice, one of the challenges with that is
if your planner is very compute intensive,
then it may be very expensive to collect more data using your model.
Uh, so one approach that we've looked at in the context of this work, uh,
is if we can incorporate some forms of supervision such as
demonstrations in order to learn more complex skills.
So, uh, what you could do is you could collect, uh,
demonstrations from many different tasks and potentially use those demonstrations to
improve the complexity of skills that you could learn with this approach.
Uh, in particular, there's a few different ways that you could use these demonstrations.
Um, the first is to append it to your data set and use them to improve your model.
Uh, but you can also use it to improve the other two,
uh, approaches as well.
So what you could do is you could fit a model
to the behavior of the demonstrator to basically
predict the kinds of actions that the demonstrator might take based on an initial image.
And if you have this model of the kinds of tasks that are interesting to perform,
then you can use this to first to like direct your data collection proc-
process towards the more interesting kinds of behaviors and the more interesting tasks.
And you could also use it to guide the planning process.
Um, so if you know that you can be doing kind of a task that is, uh,
that may resemble some of the tasks that you saw in the demonstrations,
then you can sample actions, uh,
similar to the actions that the human would- would take in addition to the actions,
uh, that you would sample from some random distribution.
Uh, so an example of this is that,
uh, I guess, one,
one setting where we study this problem is in
the example where you want to have a robot manipulate tools.
Uh, so here are some examples of the demonstrations that, uh,
that the, uh, user collected,
uh, using different tools to,
uh, push them in different ways.
Uh, here are some examples of the samples from the action proposal models.
So this is, kind of, the, uh,
the types of actions that the robot thinks that the user might perform.
So these correspond to, kind of,
grasping towards objects and moving those objects.
Uh, these, these, uh,
actions are actually passed through the video picture mode.
So these aren't actually videos but these are actions that I think might be interesting,
uh, as paths to the video prediction model.
And then what you can do is that, uh,
you can specify your goal as before.
Then run planning guided by that action proposal model with respect to your goal, uh,
to get a prediction that looks like this
and a sequence of actions that corresponds to this video.
And then execute those actions on the robot to get,
um, to try to accomplish the goal in the bottom, in the top-left.
Okay. So by actually incorporating these diverse demonstrations we're allowed, uh,
the robot is allowed- is- can now, kinda,
perform these more complex tasks that involve grasping
an object and then using that object to perform the task,
rather than just, uh, pick and place tasks and pushing tasks.
Okay. Um, and again, because the,
the model is trained on a diverse set of, uh,
objects and tasks, and because the demonstrations are also diverse,
it means that this single model is reusable for these different kinds of tasks.
So, um, the model can be used to solve, uh,
tasks that weren't seen in the demonstrations such as,
uh, using a broom to push objects into a dust pan.
Um, it can be used for trying to like use
a hook to bring out of reach objects closer to the robot.
Um, so in this case, the,
the robot was constrained to move in that green shaded region
such that it actually had to use the hook in order to accomplish the task of moving the,
the, um, the blue object closer.
Um, it can also generalize to unseen tools.
And this is, uh, by nature of the fact that it has a,
a large diverse data set,
um, also unseen tools that aren't really conventional tools like water bottles.
Um, and because you're sampling from both, uh,
the demonstration, the action proposal distribution and a random distribution, uh,
they can also figure out when do you use a tool such as when there are
two objects that need to be pushed versus when not to use a tool,
um, when only a single object needs to be pushed.
Okay. Cool. So that was one side on h- one way that you might go about
incorporating demonstration data or incorporating
other forms of supervision in order to perform more complex tasks.
Um, the last kind of approach that I'd like to talk about with
regard to image observations is predicting alternative qua- quantities.
So, uh, it may be that you don't want to be reconstructing images,
uh, such as, uh,
video prediction models and audio encoders.
And it may be that you have some supervision or
some other auxiliary information that you care about for performing your task.
And in these contexts you can try to predict those sorts of things.
Uh, so for example, if you want to be able to learn how to grasp objects,
what you could do is that given a sequence of actions,
you can predict whether or not that sequence of actions will lead to a grasp.
Um, and so given for example,
one of these, these yellow actions,
you can predict the binary event of grasping or not grasping.
Um, and grasping is actually something that you could measure on the robot by looking at
how wide the- whether or not after performing those actions,
the robot was actually holding something.
Um, another example is, uh,
if you care about collision avoidance,
you can predict given a sequence of actions, will I collide?
Will I hit an object?
And so if you have a sensor that can measure whether or not you've collided,
then this is something that you can predict relatively easily.
Uh, then you can also predict, um, things like in a video game,
your health your- or your damage or other,
uh, information about the environment.
Uh, so this is something that's very nice,
if you don't wanna- in these settings you don't have to generate images.
Um, this also has a very close connection to Q-learning because
if you're predicting these types of events that will happen,
if they will happen in the future, then you are essentially,
uh, trying to predict the probability of some event happening which,
um, may correspond to your reward function if you
care about whether or not that event happens.
Okay. So the benefits of this approach, in general,
is that you can only pred- it allows you to only predict the task relevant quantities.
Uh, and then if you're in a multi-task setting,
you could predict the things that c- that are relevant for different tasks.
Uh, the downside is that you need to be able to observe those quantities,
uh, which isn't true in the general case.
You don't always know, for example,
where objects are or maybe if you are in a dialogue
setting you don't know the sentiment of the other person automatically.
You don't observe that automatically.
Um, and of course,
you also need to manually pick
these quantities that you think might be relevant to your task.
Okay. Cool. So that was all I had on model-based RL with image observations.
Um, let's see.
You have eight minutes. I think we have time to
cover though the last part rather than moving it to next week.
Are there any questions on this before I move on?
Okay. So what about model-based meta-RL?
Um, in some sense,
we've already been doing some form of meta-RL.
Uh, and I'll talk about that in a second.
So we talked a bit before about how in
many situations we have this dynamics model that doesn't vary across tasks.
Uh, and in these cases, estimating the model is a single task problem.
Um, but what if we- what if the dynamics are actually changing across tasks?
Um, and so for example, uh,
if you're interacting with objects and you see an object on the table and you don't know,
kind of, a priori how that object is going to move,
if you just see an image of that object,
you don't know the center of mass of the object.
You don't know the friction. And so you don't necessarily
know how it will actually move until you start interacting with it.
Uh, in, in that context,
it's actually somewhat of a partially observed problem
and you need to actually adapt your model
based off of a small amount of data in order to
accurately predict how that object will move.
Um, and so you can essentially view- if the dynamics are changing across tasks,
you can actually turn mod- the model learning problem
from a supervised learning problem into a meta learning problem.
Where you're now going to be conditioning your model on
some data and then using that data to learn a better model.
Uh, in this context, any of the kind of meta learning projects that we
talked about before could be applied to this context.
Um, so for example,
one meta learning approach that we talked about before is using things
like LSTMs or models with memory.
Uh, and we're actually already using LSTMs to- in,
in recurrent models to make predictions before.
So the- those vision based models they were- um,
you can essentially view them,
uh, in this way in,
in a sense as a meta learning problem because they're taking in the context of the past,
uh, few frames and predicting into the future.
Uh, so there's this somewhat of
a blurred line between what is- what constitutes a single model and what constitutes,
uh, like a meta-learned model.
Okay. So one thing you could do is simply kind of collect data,
uh, if you want to turn this into a meta-learning problem,
you can collect data in different environments and adapt
your model to an environment given a small amount of data.
Um, and you can actually also do this in a more online fashion.
So say you have some robot that's, uh,
interacting in the environment and, uh,
has this different- different parts of the environment have different dynamics such as
a terrain change or a motor malfunction that causes the dynamics to change.
Now, one of the things that you can do is you can kind of flatten out the, um,
the experience of the robot, uh,
and view these different, uh,
changes in dynamics as happening in different points in time.
And then if you take it- uh, if you take this viewpoint,
you could essentially view, uh,
the- the few-shot learning problem, uh,
or kinda the meta-learning problem as one of taking a slice- taking a window of
time and using that slice of data to predict
what will happen in the following slice of data.
Um, so you could essentially view this problem
of adapting your model or adapting to your environment
online as a few-shot learning problem where
different tasks correspond to different slices of the experience.
Uh, where basically for,
uh, if you have k time steps of experience,
this might correspond to the training data
set for one task and the following k time steps,
or, or n time steps may correspond to the corresponding test set for that task.
And then you can- this is one window of experience,
you can kind of continuously slide that window to get different tasks,
assuming you have some sort of
temporal continuity in the dynamics that you are encountering.
Okay. So you can use this- you can basically use your favorite meta-learning method,
uh, to solve this kind of problem.
Uh, and so what this might look like is that you
have your last k time steps of experience,
you then adapt your model using, uh,
your training data set and your prior, uh,
to learn a model that's specifically adopted to those k points in time.
And then use that model to actually take actions and to,
to plan using MPC.
Uh, and so, for example this update rule may correspond to one step of gradient descent
and theta star may correspond to
the initialization if you're using an algorithm like MAML.
Uh, and so the way this works is you may,
um, kind of collect data on different terrains,
uh, such as the terrain shown here.
This is a- a little- little six-legged robot called the VelociRoACH.
And the dynamics of the robot actually vary drastically
across different terrains and across different battery levels.
And then you can train it to be able to, uh,
estimate dynamics with only k time steps of experience where
k is something like eight time steps, and that,
in practice, corresponds to actually less than a second of experience if you're at around like
10-20 Hertz and then evaluate
the robot's ability to adapt to other types of dynamics like being on a slope,
or missing a leg,
or having a payload,
uh, or having calibration errors.
So what you can see is that if you try to put the robot on a narrow slope- so here's a,
a visual of the slope up close.
If you try to run a single model across these settings,
uh, what the robot will do is it will,
uh, kind of, diverge across.
It won't be able to run in a straight line because it
won't have learned an accurate model.
Uh, whereas, if you use meta-learning and actually adopt online, uh,
with each window of experience to the current model
and use that to plan and run in a straight line.
Uh, you could also do something like take off the front right leg of the robot,
uh, and see, uh,
if you try to fit a single model,
it isn't able to model the dynamics of these different situations.
Whereas if you train it to quickly adapt and then use- do that adaptation,
uh, at test time,
then you could effectively follow a straight line.
And so this is actually getting back to one of the questions at
the beginning of the lecture where you're not only using
the observed state to re-plan but you're also
using the observed state to update your model at every single time step.
Okay, cool.
So I think that we're basically out of time,
um, some takeaways for model-based versus model-free learning.
So some of the benefits of model-based learning is that it's very
easy to collect data in a scalable way, uh, without boards.
Um, it's pretty easy to transfer across
different reward functions because if- that model only depends on,
uh, the data it was trained on.
It has a less direct relationship on the reward than the policy.
Um, it also typically requires, um,
a smaller amount of data,
or at least a smaller amount of data that's supervised based on the reward.
Um, the downsides of models is that they don't optimize for task performance,
and so there may be a mismatch in the optimized the- the-
the objective you're optimizing for and the objective that you might care about.
We saw the same thing in their representation learning setting if we're trying to learn
a representation for reconstruction versus for the task that we care about.
Um, and sometimes, it's also hard to learn the model than to learn the policies such as
in the pouring example where you have to model fluid dynamics.
Um, and then sometimes you may also need assumptions to learn complex skills,
uh, such as the spatula, for example.
And then for model-free methods, uh,
the benefits is that it makes very little assumptions beyond a reward function.
Uh, it's very effective for learning, uh,
complex policies, uh, and learning complex skills.
Uh, the downsides is that it requires a lot of experience and can be slower to learn,
um, and in the multitask-learning setting in particular,
it's a harder optimization problem because you have to learn
a policy to perform all the tasks,
uh, rather than just learning a model and inverting that model for an individual task.
For each individual task, you have a test pattern.
Uh, and then, I guess the last thing
is that I don't think we necessarily have this dichot- dichotomy,
I think that ultimately we probably want elements of both,
such that when we're pouring water,
we use a model-free approach and such that when we're, uh,
maybe pushing objects around we have more of a model-based approach.
Okay. Um, in the next few weeks,
I will be talking about,
uh, kind- this is, I guess, sort of
the conclusion of some of the RL section of the course and, uh,
on Monday, next week,
we'll be talking about what about seeing tasks in sequence.
Uh, we will cover this both in
the supervised setting and in the reinforcement learning setting.
Um, on Wednesday will be have paper presentations on some miscellaneous topics that are,
are interesting with- relating to task interference,
differentiability, uh, sim2real methods and hybrid reinforcement learning methods.
Uh, and then the following three lectures will
be about really the current frontiers of these approaches.
So we'll have a guest lecturer from Jeff Clune who works
on evolutionary methods, lifelong-learning, and meta-learning.
Uh, we'll have a guest lecturer from Sergey Levine on information theoretic
exploration approaches and how that can be used for task agnostic reinforcement learning.
And, uh, on Monday of th- a couple of weeks after Thanksgiving,
I'll be giving some perspectives on some challenges and,
and frontiers of these topics.
Uh, and then, just a couple of reminders,
Homework 3 is due tonight and the product milestone is due next week.
I'll see- yeah, question.
The question is, um,
regards to model-based versus model free. Especially in a meta-learning context which one is the more
beneficial [inaudible] Like, which one is easier?
For sim2real, I've actually seen both used.
I've seen both model-based methods to try to learn a model that's robust to different,
um, different contexts and then using that model, um, to plan.
I've also seen model-free approaches and that we'll see- uh,
I think we'll see a model-free approach in the paper that's covered on,
on Wednesday or, uh, yeah, on Wednesday next week.
[inaudible] adaptations and model changes.
Yeah. So there are definitely very recent papers
that have looked at meta-learning for sim2real where you try to,
instead of learning a robust model,
you would try to learn an adaptable model such that you can adapt to
any possible simulator and then at test time you're
given the real world and you want to be able to adapt to the real world.
Um, there's at least one paper that came out in the
last like two months that studied meta-learning for that problem.
Um, so I think that's kind of like, the cusp of- of where current research is on.
Okay. Great. See everyone on Monday.
