All right. Welcome back everybody.
Um, before we get started today,
I- does anybody have any questions about logistics,
or midterm, or anything like that?
We'll be doing a midterm review on Monday,
and the midterm will be on Wednesday.
Because there's a number of people in the class,
we're gonna be spreading everybody across a couple of rooms,
and will be, er, sending instructions out about that.
Does anybody have any other logistics questions? Yeah.
Midterm is during class time right?
Midterm is during class time.
Um, instructions will also be on the web,
but you're allowed to bring a one-page, um, of
written notes, um, aside from that everything is closed book. Yeah.
Is it okay to type said notes [NOISE] or it has to be handwritten?
I think we've already issued a policy on that,
Let me just double-check and see what it is.
Okay.
[BACKGROUND]
All right. Okay. Lets go ahead and get started.
Um, before we do that,
I just want to say thank you to all of you that, ah,
participated in the class feedback survey.
It's really helpful to me and to everybody else to
understand what's helping you learn and what's not helping you learn.
Okay. So in terms of,
um, the responses, so note,
you know, all of these things there's about 230 people registered in the class.
Um, so for some of you if you didn't give me feedback,
it's hard for me to know what- what's helping you or not helping you learn.
So we're just gonna go with what people gave us feedback on.
Um, about 65% of you thought it's the right pace,
about 27% of you thought it's going too fast,
and there's only about 8% of people that think it's going too slowly.
Um, so we're gonna keep roughly in the same pace as what we've had before.
Um, a number of people noted that they wished it was like a semester-long course.
Um, I will mention there's a number of other classes that
do reinforcement learning and I highly encourage you to take them.
I- I offer an advanced class,
and also Ben Van Roy offers a class normally in the spring that's more theoretical.
This was super controversial.
I didn't think it would be this split.
Um, so we offer sessions on Thursday and Friday.
Attendance has been really low.
Um, I think we've had around like,
three to seven people showing up for these sessions.
Um, so we thought this was gonna be something that everybody wanted to take out.
Because it's about evenly split,
I asked all the TAs to compare that people that are coming
to their office hours versus the people that are coming to the sessions.
Um, we probably have
about 4 to 5x people trying to go to office hours than trying to go to sessions.
So we're going to switch this to office hours.
Just so that we can kind of,
serve as many people as we can.
Um, so let me just write that there.
So we're gonna switch to office hours.
Now, the sessions will still be offered on the other days,
and they'll still be recorded.
So for anybody who wanted to go to them on Thursday and Friday, you can still go,
you can still participate on Zoom or you can watch the recorded lecture.
Um, but we're gonna switch this to office hours
because my TAs have been saying they've had
a number of office hours where they've either had to stay
really late or they feel like they're not getting to some people.
And so again, just in terms of serving the most people.
I will say when I was going through these responses,
it really made me think about the fact that in
reinforcement learning and sort of sequential decision making in general,
um, we're always optimizing for expected rewards [LAUGHTER]
And so that's kind of exactly the same thing we're doing here is that,
I know that everybody needs slightly different things,
and we're just trying to do our best job and expectation.
Um, but it's exactly why things like
intelligent tutoring systems and other stuff might be better.
Um, okay.
In terms of things that people thought were working well for them,
we've got a number of positive remarks about
doing worked examples in class, doing derivations.
Um, a lot of people really like the fact that my iPad had problems on Monday,
um, and so we did things on the board.
So we'll try to keep doing the same amount or more of derivations.
Um, people also were generally really positive about the homeworks.
In terms of things, er,
we saw repeatedly, so- so what I did is I got,
I just co- collated,
sort of all the free responses and tried to look for common themes.
And anything that came up, you know,
three or five times or more,
I considered was a common issue that people would love addressed.
Um, people would love even more focused on the big picture explaining,
um, as well as connecting from the toy examples to real-world examples.
So we'll try to do that where we can.
Um, I'll also try to make sure that I'm speaking loudly throughout.
Several people said that sometimes it was occasionally hard to hear,
so I'll try to do a better job on that.
If you can't hear me in the back please feel free to raise your hand.
Um, and people would like even more examples, um, worked examples.
And so in particular,
we're going to try to make sure that in sessions,
we emphasize worked examples even more.
And let me just say again, if this was not one of
the things that you were most concerned about,
I'm sorry, we can't address all of them this term.
Um, er, we definitely and it was kind of amusing to go through this,
would have people saying exactly opposite things.
Sometimes right in a row, um,
in terms of how it was collated about things like,
some people don't like the fact that the slides have gaps and I do derivations in class.
And a number of other people really like that I do derivations in class.
Um, some people felt like it was moving too slowly,
other people said it was moving way too fast.
Um, so again, we're just gonna try to do the best job we can to address
everybody's needs. All right.
So today we're going to continue talking about policy search,
which as I said before, is probably the most important
reinforcement learning thing you'll learn this term. [LAUGHTER]
Um, I think this is used really very widely right now,
um, in order to optimize functions.
And we can again think of policy search here as
a lot of things are gonna sound similar to
when we were doing value function approximation.
And what we're thinking about here is having a parameterized policy.
Often we're going to use theta to parameterize It,
but we could use W or anything else.
But we have a policy that's parameterized.
Um, and then we have some value of that policy.
And what we're going to want to be doing,
is trying to find, you know,
a good optima, trying to maximize the value,
um, of that policy.
And one of the reasons why we did this right after
imitation learning was to connect it with the idea of, um,
you have to choose a policy class,
a way to specify that parameterization.
And so because of that inherently,
it's a place to put in structure.
Okay. So just as a recap,
I mentioned that we've done a lot of work so far,
on model-free value-based methods.
We're starting to do work on direct policy search methods now.
Um, and today we'll also start to talk more about Actor-Critic methods,
where we maintain both.
We maintain both an explicit parameterized policy
and an explicit parameterized value function.
And also throughout all of, you know,
the last couple of weeks including yesterday or Monday and today,
we're mostly going to be talking about cases where we want to
be able to work in really really large state spaces.
So I'm just gonna do a brief refresher of last time.
Why do we want to do this?
Well, we're going to generally be able to guarantee that we converge to a local optima.
We don't always have those guarantees for value function based methods.
Um, er, and that might be important.
Er, it's a nice property to have.
Um, the downside is that,
if you use policy gradient methods,
typically we only converge to a local optima.
The last time I showed you the exoskeleton example,
where they are using a global optima approach.
So there are other ones which can policy based RL
isn't inherently always going to get you to a local solution,
but the gradient based methods typically will.
And the other issue that we were talking about, uh,
um, as ways to try to address this is, um,
or- or as tools to address the fact that evaluating
a policy itself might be rather inefficient, and high-variance.
So what we are defining before is a policy gradient where, um, now, er,
before we'd sort of thought about these things being parameterized by a theta,
so we can either think of the value being, uh,
the policy being parameterized by a theta or pi of theta means parameterized.
But we're often going to talk about value functions because ultimately,
the value function depends on the policy and the policy depends on the parameters.
And when we think about what we want out of these algorithms,
typically what we'd really like is to try to converge to a really good local optima.
Often we don't have very much control over that.
Um, but the things we often do have control over is things
like how quickly we converge to that local optima.
Um, and so we want to use sort of go as quickly as we can down that gradient,
if we're doing a gradient-based method,
um, and use our data as well as we can.
So one of the things we're gonna talk more about today,
is when we're doing this sort of policy gradient technique.
So we're going to be sort of moving down.
Now, we're gonna have our gradient.
We're gonna have our functions.
This is B pi, and this is our parameterized pi.
And as we're moving down towards some local gradient,
um, it would be nice if when we update our policy,
that it is monotonically improving.
So can anybody give me a reason why we might want monotonic improvement? Yeah.
To help guarantee convergence.
Answer is right. Can help guarantee convergence absolutely.
And while I love math as much as many of you,
um, that- that is a great reason.
But perhaps, I was also thinking of like
an empirical reason why we might want that as well.
Yeah, in the back.
[inaudible] like in a high-stakes situation.
So what we've seen before in fact,
um, one of my students, ah,
[inaudible] was giving a practice job talk yesterday,
and he was showing this graph for DQN,
which looks something like this.
Where this is, like, the performance,
this is the reward, and this is time.
Of course, it doesn't always look like this.
And typically when you go to- um, when you read papers,
people smooth over many,
many runs, but often it looks something like that.
That as you're going across multiple episodes or across multiple time steps, like,
you're really getting a very jagged up and down performance of your-
of your val- of the policy that you're running as you do DQN.
So why might that not be good in,
like, a high-stakes situation.
Yeah, over there. And name first, please.
If, like, it's something high-stakes and you have something good and then it goes down,
people are going to be upset with you that now it's done something worse,
even if it will later go back up.
Yeah, what was said is that if the system is a high-stakes scenario,
um, if you do- you know,
if your policy works pretty well and then the next one,
next episode, it works really badly,
even if it might go up later, um, you know,
your boss still might fire you [LAUGHTER].
I mean, I'm joking, but I think that people are often
loss averse and also it's often not tolerable,
um, in- it might not be okay, you know,
in a company to say, well,
this quarter we did really well and next quarter we're going to do, you know,
worse, but then eventually, you know,
after many, ah, quarters we're going to do well.
Like we often may wanna ensure that we're sort of monotonically going up.
Um, and in the case of something like patient treatment or
other sorts of high-stakes scenarios or airplanes or stuff like that,
um, it just probably will not be tolerable to people,
if you say we're- we're going to do much worse for this period of time.
Ah, no- now there are exceptions to all of this,
but I think there are many cases,
where you'd really like monotonic improvement,
i- if you can.
Ah, so I think it's a really- in addition to the theoretical,
ah, benefits, ah, it can help us prove things.
Ah, it also can just be,
um, something that's sort of appealing for people to be actually be able to deploy.
And we know that in general that people are very risk- like, very loss averse.
So having policies that are monotonically improving, um,
can be very nice and DQN and a lot of the value based methods do not have those guarantees.
Um, we can talk more also about whether that's always possible,
um, in terms of if you wanted to get to a global optima. Yes.
Just to be clear, the monotonic improvement in these cases
are data that you have access to or have seen, right?
So technically like if there's a distribution in terms of
your life environment where it may differ somewhat from your actual simulation or,
ah, environment, you may not necessarily quantify or improve it,
given all that secret data. Is that right?
Yeah, which is to say,
when we're gonna be- what's- you know,
what is this monotonic improvement?
What are the conditions under which this will be guaranteed or possible?
And are we sort of doing this based on our previous data and
making some assumptions about the future da- future data that's collected.
Absolutely. We're gonna assume that we're still on
the same decision process and that it's stationary.
And what I mean by that is that the transition model and
the reward model is the same across- you know,
you might not have observed all the states yet,
but it's the same across episodes.
So we're not dealing with the fact that, you know, um,
customer preferences have totally changed or,
um, you know, climate change is changing your environment.
We are dealing with the fact that if the world is stationary,
that then we're going to be guaranteed to have monotonic improvement.
Now the other thing that I'm going to show you that in some cases we can guarantee that.
Um, the other really important thing to know is,
this is going to- we're going to hope to show monotonic improvement in expectation.
So- so the value function has expected reward.
So what we're going to be able to hope to say is
the series of policies that we're deploying in
the environment that their value function is going up. So what does that mean?
That means that V_Pi1 we would like that to be less than or equal
to V_Pi2 less than or equal to V_Pi3 dot-dot-dot,
where this is, sort of, um,
the policy we deploy on each iteration or each round.
Ah, but it doesn't guarantee that for a single run this policy is better.
So you could easily have it that it on average you're deploying a policy that is better,
you know, for your airplanes or for patient treatment,
et cetera, but for individual patients that might be worse.
A- and I think a really interesting active area of research right now is,
um, safe reinforcement learning,
um, and safe exploration.
And a lot of different people are thinking about this,
um, including a number of people here at Stanford.
And one of the things that we're looking at in our group is,
how do you really efficiently get to a safe solution?
What do you mean by safe in this case?
I mean you might not want to max- maximize expected reward.
You might want to be able to maximize,
um, some sort of risk averse criteria.
And we'd like to figure out ways to really efficiently get to that solution.
But there's lots of really interesting stuff that says,
you know, how do we try to do policy search?
Or how do we do this improvement in cases
where we don't just care about expected outcomes?
All right. So what we're gonna be trying to do today is move towards,
sort of, ideally, not just monotonic improvements,
but large monotonic improvements.
Um, as you might guess,
it is easier to try to achieve
small monotonic improvements than it would be to
guarantee really large monotonic improvements.
Um, does anybody have any intuition for why that would be true?
That might be harder.
Um, this kind of goes back to the state distributions.
So if you change your policy a lot,
um, does the state- can the state
distribution change a lot in terms of the states you visit?
So- so intuitively that should- the answer should be yes.
So we've talked some about how any policy induces, um,
a state distribution, like if you run it for a long time
you're going to have sort of a stationary distribution over states.
Um, and if your policy is really different than your old policy,
then that state distribution might look really different,
which means you might not have very much data.
Um, whereas if you have almost exactly the same policy as before,
um, you're probably going to be able to have
a really good notion of what that value is in the estimate.
But we'll get more into that later.
So what we're going do today is to try to think
about moving beyond what we were talking about last time,
where we're trying to do policy gradient methods.
And we're trying to do it in a way that was sort of efficient.
We're going to talk about other ways to make it more efficient,
ah, and less noisy, and then try to go towards monotonic improvement.
All right. So that things that we talked about last time,
is we started off when we said what can we do in terms of policy gradient?
One thing we could do is,
kind of, you do Monte Carlo returns.
So these are Monte Carlo returns.
And sometimes people use big R of Tau,
where Tau, this is a trajectory.
So you could just run out your policy till
the world terminates or each steps or however you're defining your episodes,
um, and then look at the reward and you'll look at their reward across,
you know, per time step.
We can also use, um,
G_i_t to denote the reward we get from time-step t and- and in Episode i.
And what we've said before is that, um,
this is an unbiased estimate of the gradient, but it's really noisy.
And so we started talking about additional structure we
could use in the reinforcement learning problem,
where we were assuming the world is Markov,
um, to try to reduce the variance of this estimate.
And so what we talked about last time was using temporal structure,
which we did some on the board.
And the intuition there was the reward you get at some time point,
um, uh, can be- is not influenced by the later decisions you make.
So you didn't have to take kind of
this complete product of the probability of action given the states,
because future actions don't retroactively change
our earlier rewards for intuition.
So what we're going to start talking about now is, um, other things,
which is baselines and alternatives to Monte Carlo. Okay.
So what's the baseline?
Well, a baseline is as let's still think about looking at,
um, the sum of rewards we get from this time-step onwards.
So this is the same thing that we've often called GT.
So this is the reward we get from this time-step till the end of the episode.
And we're gonna subtract a baseline that depends on the state.
And what I'm going to show shortly is that by
subtracting this baseline which depends only on the state,
your resulting gradient estimator can still be unbiased,
but it might not, have much lower variance.
And in particular often a really good choice is the expected return,
which is basically the value function.
And so why would we do this?
We'll then we can kind of look at we- increasing the log probability of an action,
proportional to how much better it is than a baseline.
Which in general is going to end up being sort of a
little bit like an advantage function.
So why is this true?
Okay. So, um, what are we going to try to do here?
We're gonna say we have this high-variance estimate right now.
If we don't have, so imagine we didn't have this, we don't have that.
We have our, the standard estimate we were talking about last time.
And what I want to convince you is that if we
subtract off this thing which is a function of the state,
that an expectation that additional term we're subtracting off is zero.
Meaning that our estimator is still unbiased.
So I said our original estimator is unbiased.
We're subtracting off this weird thing,
we want to show that the resulting estimator is still unbiased.
And the way we do that is by showing there's,
the goal is to show,
[NOISE] that this is equal to zero.
So that's what we're gonna try and do,
and if we can show that then that's gonna justify why we can subtract this random thing.
And then we can start to talk about what that random thing should be.
But first, we're just going to show no matter what that random thing is.
If it's a function only of the state,
that this, um, expectation is 0.
So how do we do this?
Well, first of all noted on the outside there's an expectation over tau.
That is all the trajectories we might encounter by running our current policy.
Okay. So what I'm gonna do first is I'm just going to split it into two parts.
So this is still tau.
And all I've done here is I've split it into
the first part which is all the way up to time step t,
and the second part which is on time step,
um, t, all the way to the end.
So I've just sort of decomposed, that,
I-I I've just written out what, um, ah,
a trajectory is, and then decomposed into two parts.
So I'm just decomposing the trajectory.
[NOISE] And once we do that,
then we can see that the baseline term is only a function of S_t.
[NOISE] So we can pull it out of this inner term.
Right. So we're gonna pull this out because it doesn't depend on
any of these future time steps.
It's independent of those.
[BACKGROUND]
Okay.
Then the next thing we're gonna do is we're going to,
uh, write out or note the fact that in this case.
That all we have in this inner term here is S_t and a_t.
So we can drop all the future terms.
Again that's sort of the prob- the the only thing in
here is the probability of the action a_time-step t,
given the state t and theta.
So we don't need to worry about the future states or the future actions that are taken,
um, we- we're independent of those.
So now, so we just pull it up,
first we pulled out baseline.
And now, we're going to drop those things that we don't need to depend on.
[NOISE]
So all we have here is an expectation over the action that's taken.
Okay. And now what I'm gonna do is I'm going to read it,
so what is this expectation?
It's an expectation over a_t.
What is the problem, you know, what is that expectation?
We're just going to write that out explicitly,
that depends on the policy that we have.
So we're going to sum over a_t,
the probability of that a_t,
which of course just depends on the policy that we're following,
times the derivative of log.
So that is me writing out the expectation,
and I'm gonna take the derivative of the log.
[NOISE]
So that's just gonna be the derivative with respect to the policy itself.
Divided by pi of a_t, s t theta.
Okay. But now we note that there's a term on the numerator and a term on the,
denominator that we can cancel.
So this starts to simplify [NOISE] b of S_t times the sum over a_t,
just the derivative of the policy.
Just canceled numerator and denominator there.
And now we note that we can reverse the sum and the derivative.
This is the others, that kind of critical step of this proof.
So now what we're gonna do here is we're going to move the derivative out.
[NOISE] Well, this is just 1,
because the probability that we select some action,
under our policy always has to be 1.
And so now we see that we're just taking the derivative of 1.
So we are trying to take the derivative of 1,
and of course that's a constant so this is equal to 0.
[NOISE] So that's pretty cool.
So that means that we have added in this baseline.
That is some function that depends on the state,
and we haven't said we told me, you know,
we haven't talked about all the different ways we
could compute that gap or we say it doesn't matter what it is.
No matter what you added there it's always unbiased.
So just to check our understanding for a second if we go back to this equation,
if I set b of S_t,
to be a constant everywhere,
is the gradient estimator still unbiased.
[NOISE]
Just take like one minute and talk to your neighbor,
and say,
so based on what I just said if b of S_t is equal to a constant,
this is like a constant,
for all S_t is the gradient estimator unbiased,
just take like one second or one minute and talk to your neighbor.
[BACKGROUND]
All right. So let's start with,
um, everybody here thinks it's still unbiased, vote?
Yes. Great. Okay, yes.
It has to still be unbiased,
now it's a function not even of s, just a constant.
And so it's definitely a function of an s,
it's a tri- trivial function where it doesn't matter what the value of s is,
and so it's still unbiased.
Just to note here, um,
if s was a fu- if, um,
the baseline was a function of state and action,
do you think this proof would go through?
No.
No.
No. Right. Because one of the things that we did at the very beginning is
we moved b of S_t all the way out.
And if it depended on the action too, we couldn't have done that.
So this is specific to this being only a function of the state. Yes.
[inaudible] functions b that do not give you an un-
an unbiased estimate state.
So I don't know, is there any functions of b?
b is only a function of the state, all of them are unbiased.
Yeah, so it is always unbiased.
There could be really- I mean,
just like what I put here,
um, you could just put in a constant and it might not reduce your variance at all.
So there's certainly unuseful definitions of a baseline,
um, but all of them are unbiased.
So they're not gonna affect, ah,
whether or not your estimator is unbiased.
They could make your estimator potentially worse if they're really bad,
um, or [NOISE] they're really themselves, um,
very bad estimators potentially, um,
and they certainly couldn't make it better [NOISE] by choosing good choices.
Okay. All right.
So this ends up allowing us
to define what's sort of known as the vanilla policy gradient algorithm.
Um, so vanilla policy gradient operates by,
we collect a bunch of trajectories using our current policy.
And then for each time step inside of the trajectory we
compute the return from that time step to the end.
Um, and then we compute the advantage estimate.
So, all right, we'll write out vanilla policy gradient.
[NOISE] Okay.
So vanilla policy gradient works as follows.
You start up, you initialize the policy with
some parameter theta and you need to start with some estimate of the baseline.
Okay. So what happens with vanilla policy gradient,
is for iterations i = 1,
2 dot, dot, dot.
We're gonna collect a set of trajectories using your current policy.
And then for each time step,
for t = 1 dot,
dot, dot, the length of your trajectory i,
then you do two things.
You compute the return,
which is just equal to the sum of all the returns till the end of the episode.
And then you compute the advantage A-hat_i_t,
which is just equal to this return.
I'll parameterize it with i just to show that that's the i'th trajectory.
Um, [NOISE] - b of S_t.
So just to note here for a second,
um, this is a return of the sum of rewards till the end of the episode.
This is a baseline which is like a fixed function.
Um, so this could be, you know,
a deep neural network,
this could be a table lookup.
Ah, but this is a function and you input
the state at time step t and trajectory i [NOISE] and you output a scalar.
So that's what the baseline is doing there.
And then wha- um, in vanilla policy gradient we do is,
then we refit the baseline.
So in this case,
the baseline is gonna be an estimate of the average of the Gs [NOISE].
So in vanilla policy base,
bu- um, vanilla policy gradient.
What we do is the next step is,
we sum over all the trajectories we've got so far.
We just sum over all the time steps.
Um, we do basically, just a least squares fit.
[NOISE]
So note this can be done with like su- this is supervised learning.
We just have some a baseline function that can be parameterized [NOISE].
I'll make sure to put an i there.
So the baseline function that can be
parameterized with some totally other weights or parameters,
um, and then we have our returns g that we've
seen so far and we just try to minimize that distance.
And so then the baseline is really, er,
representing the expected sum of rewards.
Um, note that this is in some ways a little bit funny, right?
Because we're using all of our data that we've ever seen.
So this can either be done over, um,
all the data you you've ever,
ever seen or it can be done over just the most recent round.
There's lots of choices for how to do the baseline.
Um, I- if you use all the data you've ever,
ever seen, um, then which is what this would do.
Um, then you could be averaging over lots of different policies,
because you've got data from different policies.
If you just do this over the most recent round,
then you're just gonna be getting an estimate of essentially V_Pi i- V_Pi i,
like the- the iteration.
This is gonna end up approximating. All right.
If you ju- are only doing it over, um,
the trajectories, if you don't do this,
but you sum it over the trajectories for this round.
I guess the way I've written this is a little bit unclear.
So let me see if I can make that a bit clearer.
So let's say that we have a- a- a- we have d trajectories.
So if we do it this way,
then that's exactly equal to V_Pi i.
So i now is the iteration,
d is the trajectories we've gathered just on iteration i.
So this is only averaging over, um,
the policies for this particular- the- the trajectories for this particular policy.
I said a lot a bit out of order.
Does anybody have any questions about exactly what we're doing in this case?
So normally, in this situation there's a number of series of rounds.
This is for each- So we're gonna have a series of pi's, basically.
And then for each policy,
we're gonna have a set of trajectories.
And for each trajectory we have a set of time steps.
And what this is saying here is average
over all the trajectories you have for the current policy,
and fit the baseline to that.
All right. And then once you have that,
so this gives you the baseline.
Um, and then we do update the policy using your gradient.
And it's gonna be a sum of terms that include these derivatives with respect to
the policy and your advantage function.
Okay. So you're gonna take in this advantage function that you computed over here.
And then you're gonna be multiplying it by what was
the probability of the actions given the state and theta.
The log derivative of that.
And then we plug that,
this gr- this estimate of the gradient,
into something like stochastic gradient descent or ADAM or something else.
So this has been vanilla policy gradient.
[NOISE]
And what we're going to see during the rest of today
is just a number of different slight variants on this basic template.
So I'll get to you in just a second,
but I just want to emphasize that if you- if
you- when you walk away from unders- like from what I'd like you to understand,
from the- the main idea for policy gradient,
is essentially what's on the board right now.
Is that, what we're doing is we are running,
we take one policy,
we get a bunch of data from it,
and then we have to fit something like an advantage,
and there's going to be different ways to compute that.
We could end up doing bootstrapping,
to do some sort of TD estimate,
or we can just directly use the returns.
We often use a baseline,
um, that we're fitting over time.
And then we're going to update the policy,
and have to choose some step size with respect to the gradient.
So this is sort of the most important thing.
Is to say, hey, there's this basic template for almost all policy gradient algorithms,
I can choose different things to kind of plug in here,
and I can choose different ways to take my step sizes.
Um, and that's going to define
a whole bunch of the different policy gradient algorithms that you see.
So what function are we using to represent i so that we can take its gradient?
Great question. So, um, is asking, you know,
how- how do we represent,
um, the policy so we can take its gradient.
We have to be able to take this here.
Um, we talked briefly about this last time, um,
but it was also on the board near the end, Gaussians work,
Softmaxes both are- are- are- are- both of those you can analytically take the derivative
and often we use deep neural networks or shallow deep neural networks.
Yes. I saw a question back there? And name first, please.
Ah, I was wondering if there's any, ah,
issue with like non-states- if we're
getting like b of- the baseline with the neural networks,
there's like non-stationary issues with that?
Yeah, it's a great question.
So, ah, the question I believe is to say,
um, you're asking me about the baseline, right?
So like how- are there non-stationary issues with that?
Empirically what a lot of people, including myself,
have wondered is, um,
we have all this other data.
So when we're estimating the gradient right now,
typically we're running the policy, just you know,
for D trajectories and then we're estimating a gradient with that.
Um, and could we maybe use other data to do that,
but then ends up being off policy,
because then you're mixing together data you've gathered from different policies.
Empirically, I think people often end up using only the data from the current run,
and then you're essentially just estimating V_Pi,
with this and you're not necessarily mixing data for many other policies.
Empirically, it seems like often,
it's really helpful to be on policy.
A- and you could reweight the old data,
ah, but that introduces variance.
And so empirically often, it's best.
I think the jury is still out on it.
There's ongoing research on it.
We've looked at it, Sergey Levine's group has looked at it,
but most of the time using the on-policy data makes sense.
Yeah, is there another question? And name first, please.
I just want to confirm, so when you saying refit baseline,
we're setting baseline equal to the value that minimizes the function error.
[NOISE]
Perfect. Yeah, for error, if we do this if we're
only averaging over the data points that we have for this current policy,
when we do this, it's essentially- it- it's
essentially the same as when we were doing Monte Carlo policy evaluation.
So this is almost exactly like Monte Carlo policy eval.
Where we have a fixed policy and then we have a parameterized function to represent it.
Um, and then we just want to fit those parameters so we can best
estimate the policy value using Monte Carlo.
Okay. So I'm going to- there's a little bit of
information about auto diff you can check it in the slides.
Um, uh, the things we're going to go through next, um,
is thinking about this aspect as I was saying,
and then we'll talk some about this.
So this part is going to be where we think about monotonic improvement.
Because once we have a gradient,
we have to figure out how far to go.
An- and can we guarantee, um,
depending on how far we go,
whether or not we're going to get a monotonic improvement.
Um, and this part is about sort of giving better estimates of our gradient,
ideally with less data,
um, and reducing the variance of it.
So they're both important,
they're doing slightly different things. All right.
So let's talk about,
ah, could we move this up, please? Thank you.
Okay. So like we sort of started talking about before,
um, well, let's- let's talk about first the baseline.
So how should we choose the baseline?
Um, one thing that we can do for the baseline,
is just to- like what that what we're seeing there,
which is an empirical estimate of V_Pi i.
So we could say, in general,
we wanna just have- use V_Pi i as a baseline.
That means we have to compute it somehow.
And the way we estimate that could be from Monte Carlo
or it could be from TD methods. All right.
So what we've seen so far,
is using these as a- so- so I guess just to be clear here,
there's a couple different places we're going to be able to maybe switch
between doing Monte Carlo returns and doing something TD-like.
One is here, and another is our baseline.
Okay. So we have a baseline function here that we're sort of subtracting off.
And we also have a G_t prime here, okay?
And so if we think about our general equation again,
so what we have in this case is we have delta theta, v of theta.
This is our parameter, this is specifying our policy parameters.
And we've said this is approximately equal to 1 over m sum over i = 1
to m of some reward- Well, I'll put this inside.
Sum over t = 0 to t - 1,
of,- I change my mind.
Okay. I'm going to put this out here because it's going to end up being
sort of a function we can use in lots of different ways.
Okay.
So this is
our basic equation we've been working with.
We've said the derivative of the value with respect to our policy parameter is
approximately as sum over m trajectories,
where we've sampled those trajectories from that policy,
times the total reward we've gotten on that trajectory,
times the sum over all time steps of the derivative of the policy with respect to,
um- given the action we took in the state we were in.
All right. And we said it was very noisy, um, but unbiased.
And now we can think of changing this as a target.
So this here was an unbiased estimator of the value of the policy.
And now we can think about substituting other things in. All right.
So we can imagine doing all sorts of things here.
We cou- we could do, um, you know,
TD or MC methods.
If we do it with a value function or if we try to
explicitly compute a value function or a- or a state action value function,
then we typically call this a critic.
So a critic computes V or Q.
So when we talk about actor-critic methods,
that's when we have an explicit parameterized representation of the policy,
and we have an explicit generally parameterized representation
of the value or the state-action value.
And if we have that,
then we can imagine using that to change what our target is.
I want to emphasize here that so,
actor-critic methods combine these two,
combine policy plus critics.
And probably the most popular one of this is A3C,
which is by Mnich et al,
this was introduced in 2016 ICML.
And it's been hugely popular.
This is a version for deep neural networks.
Um, but actor-critic ideas themselves have been around for a lot longer than that.
But A3C is one of the most popular versions of this for deep neural nets. All right.
So how do we do sort of policy gradient formulas with value functions?
What you could do instead here, I shall put on this side.
What you could do is, you could have almost the same equation as we had before.
So derivative with respect to the value function is equal to
an expectation with respect to the trajectories that you might encounter,
the sum over all the time steps in that trajectory,
times the derivative with respect to your policy parameters,
times Q of S_t,
w - b of S_t.
So instead of having your Monte Carlo estimates in here,
you could plug in your estimate of the Q function.
And another way to represent that here is if we think this is an estimate of the value,
and this is basically our advantage function again.
But it could be our- so we had an advantage function over here.
You define that advantage function here,
but this one was a function of the Monte Carlo returns for that episode.
This is a different advantage function which
is the Q function which- where this could be maintained by a critic,
minus your baseline, which is an estimate the value function.
So they look pretty similar,
but you can plug in different choices here.
And these are going to have different trade offs.
So the Monte Carlo estimate of
the return is an unbiased estimate of the value of the current policy.
This is going to be biased generally,
but lower variance. All right.
So I also want to emphasize here that when we think about,
um, kind of getting this estimator,
which we often say the critic is going to compute this estimator, um,
It doesn't have to be only either a TD estimate or a Monte Carlo estimate,
but you can interpolate between these.
It's often known as n-step returns.
So what does that mean?
So let's call- let's write this in a slightly, well, I'll call this here.
So let's put this is a hat.
Okay? Just to note that you can think of this as kind of just a function.
It's going to be an estimate of your state action value function.
And so what we could have is we could have an estimator.
I could have estimator of the value from time-step t onwards,
which is equal to the actual,
this is the actual one we got on time-step i, so I'm going to call this.
I'm going to call this sort of i, 1.
And this is going to be then the actual reward you got on time-step t in episode i,
+ gamma V of S_t + 1 i.
So this should look almost exactly like TD(0) style estimates.
We talked about this before.
So this says, I got- I look at the actual,
immediate one-step reward I got and then I bootstrap.
I add in the value.
So this would be- I get this value function for my critic.
And I would plug that in and then that would be my target.
That then I would use in this equation.
Okay. So that would be one thing you do- you can do.
So we've seen this,
and we've seen a lot of this,
which I'm going to call the infinite or Monte Carlo version.
And this one is you sum over all t prime,
all the way to the end of the episode of gamma 2.
The t prime minus t times r t prime.
So this is the Monte Carlo return,
where we just sum up, we don't do any bootstrapping,
and we sum up all the rewards at the end of the episodes.
But as you can see here, there should be, you know,
there's probably some way to interpolate between these two.
And these are often known as n step returns.
And so for example,
you could do this, you could say,
I'm going to add in the reward at time step i and
the reward I got at time step t + 1.
And then I'm going to bootstrap.
So this is just sort of one of the estimators that are in between these two extremes.
One is you only take in one step of reward,
another one is do you sum up all the rewards,
and then there's a whole bunch of interpolation you could do
between those. Why would you want to do that?
Well, this one is generally going to be somewhat biased, but low variance.
This is going to be unbiased,
but really high variance.
And there's no reason to assume that
the best solution is on either of those two extremes.
And so you could interpolate between a TD estimate and a Monte Carlo estimate.
And all of these just form
returns that then you could subtract off a baseline for too.
So traditional and all this would probably be
sort of hyper-parameter that we can choose through validation or cross-validation.
Right.
Is that what people do here?
Is that kinda too computationally intensive?
So you just have to pick something.
Question is if we were doing standard machine-learning,
this would be just considered as some sort of hyper parameter.
You could turn this into n and you would decide like
how many steps do you do- do people do.
And the question was do- do people do that in
reinforcement learning or is it that considered too expensive?
You certainly could. I- I think it's an interesting question.
I feel like the tricks that people do in this case,
[NOISE] I think I probably see more of this but it varies [NOISE] more using the TD(0),
doing a lot of bootstrapping,
but it probably depends on your application domain.
Another thing that it would likely depend on is if your domain is really Markov or not.
So this case this is still working,
and this is giving you a real estimate of the return of
your policy even if your domain isn't Markov.
This case you're making a much stronger Markovian assumption.
So you also might want to make- do different things depending on your domain.
And also how expensive it is to collect data. All right.
So this gives us sort of a different way to plug things
in to that vanilla policy gradient algorithm I put over there.
So you could plug in these sort of targets instead,
over there to trade off between bias and variance,
when you're doing this estimate of the gradient.
So what this is doing here is it's changing what our targets are,
and it's changing how we're computing our gradient [NOISE].
But then the next thing I wanna talk about is this part.
Which is once we've actually got our gradient,
however we've chosen to get it.
We have to actually figure out how far to go along that gradient.
[NOISE] So why might this be important?
Well, it might be important because,
this is just a local approximation.
You're giving your local estimate of the gradient. Yeah.
How often do you update the parameters of the critic?
The question was how often do you update the parameters of the critic?
It's a great question, again, it depends.
So you can either- you can do this often asynchronously.
So you can have different threads and different networks for
your critic and for your policy and principle,
you could just be updating your critic all the time like, you know,
you can be using DQN for this and doing lots and lots of backups.
In general, it depends,
I think you'd have a schedule.
Yeah. So often you might do something like 10 or 100,
it really varies by application, um.
Uh, but there's no reason that the critic
needs to be updated only on the same schedule at which you're updating the policy.
And doing it asynchronously often makes a lot of sense.
All right. So if we think about what's happening here,
here's our parameterized policy.
Here's our value. We have some crazy function.
Okay. And then we are computing our gradient.
And this gradient, locally is pretty good.
So kinda round here things like linear and things look pretty good.
But of course, as we get further out like here, it's gonna be bad.
Like if we- if we try to follow the gradient too far,
we're going to get an estimate that's very different
than the real va- real value function.
So when we're taking step sizes in this case,
it ends up being important to consider this fact of sort of
how far out do you want your step sizes to be.
Let me just get this back to one [inaudible] Okay.
So we want to figure out how far we should go in the gradient and that's important.
Now, you might say,
okay this is always true.
Right? Like you always need to be careful when you're doing
gradient descent or ascent in any supervised learning problem.
Whenever you're using stochastic gradient descent.
Of course, you don't wanna go too far along your step size because you could overshoot
and you're using this linear approximation and it's bad.
Why does anybody- does anybody have any sense about why this might
be even worse in the reinforcement learning case?
Why might it be even more important to think about this step size.
And it has to do with where the data comes from. Yeah.
So when you have a bad policy
that affects the data you collect, and you might just go down a  bad road.
Sure. So she's exactly correct.
In a supervised learning case,
your data is being generated by an IID distribution,
it doesn't matter what choices you just made for your stochastic gradient descent.
In RL that is determining the next policy we're
using inside of our iteration to gather data.
So we're not going to, you know, if we take a really bad, if we get really,
really bad policies, we just may be getting
no data towards the actual optima of this function.
So it's even more important to sort of carefully
think about where we're going along here and ideally hopefully get monotonic improvement.
So this is- this is the really,
it's very important in the reinforcement learning case,
to think about how we're doing this step size because
this determines the data we collect.
pi and therefore data.
And one version of, um,
one of my colleagues talks about a similar problem.
He sort of has the picture of the Roadrunner running off the cliff, right?
And like them you if you're in a part where your
I- your policy is just really really bad.
You may get no more useful data.
Then you can't get a good estimate there of the gradient and then you're just stuck.
You might get in a really really bad optima.
Okay. So we'd like to think carefully about this part.
So one thing that we could do,
is do something like line search.
So we're talking about right now sort of how do we do,
so how to do step sizes.
So one thing we could do is try to do some sort of line search along the gradient.
[BACKGROUND]
And this is, um, okay
but it's a little bit expensive.
So it's simple but it's expensive.
And it tends to sort of ignore where the linear approximation is good.
So we'd like to do better than this.
Okay. So now we're gonna go back to that point that I mentioned at the start
which is what we'd really like to be able to do is when we're doing
this updating we would like to ensure monotonic improvement.
And so can we kind of choose our step sizes in a way or
choose how far along the gradient to go in order to achieve monotonic improvement.
So what our goal is gonna be is we'd like to have it.
So that V pi of i + 1 is greater than or equal to V pi i.
And we're hoping to achieve this by changing how big of a step size we take.
All right. So let's think about what our- our objective function is again.
Um so we're getting- we have our value, our parameterized value.
So V of theta is equal to the expected value under our policy that's defined by theta of
just the sum over t = 0 to infinity of gamma t_r S_t a_t under our policy.
Okay. And this is where we just sort of
look at the series of states we get under our policy.
So that's our basic equation here in terms of
expressing the value of a parameterized policy.
And what we would like to do here is we would like
to get a new policy that has a better value.
But the problem is that we have samples from an old policy.
So when we're doing this we're gathering policies with
pi i and then we're trying to figure out what our pi i + 1 should be.
So this is gonna fundamentally involve.
Um so we have access to- we have access to
trajectories that are sampled from pi of theta.
And we now wanna sort of predict the value of v of pi of theta.
I'll put pi i, i + 1.
So we'd like to sort of now figure out what a new value would be
if we update these, update these parameters in some way and take like a max.
You know we'd like to figure out what the new parameters are.
But this is fundamentally an off policy problem because we have
data from our last policy and we wanna figure out what our next policy should be.
Okay. So what we're gonna do is we're gonna first re-express um the value
of our policy in terms of the advantage oh- the value
of our new policy in terms of the advantage over our old policy.
So I'm gonna move down to vanilla policy gradient for now.
[NOISE]
Okay. So what we have is we have V of Theta tilde.
So that could be like our new parameterized policy is
gonna be equal to the value of our old parameterized policy.
So whatever we had before plus the following.
The distribution over the states and actions we'd get if we
were to run our new policy. Now we don't know that.
But let's ima- let's ignore that for a second of a sum over
t = 0 to infinity gamma to the t, the advantage pi.
Okay. So this- this just generally holds.
Um, this doesn't have to do anything necessarily with being um, parameterized.
This is just saying the value of any policy which is here
parameterized by pi tilde is equal to the value of another policy plus
the sum over the states and actions you'd reach under
your target policy of the advantage
you get of taking this new policy over the old policy.
Okay. So that um, that just expresses how we can say what the
va- how uh, the value of a new policy relates to the value of the old policy.
It's exactly the same as the old values policy
plus the advantage you'd get if you were to
run the new policy and look at the state action distribution you'd encounter. Yes?
Should the subscript be Pi tilde on the advantage?
Should the subscript be Pi tilde on the advantage?
[OVERLAPPING] policy?
Yeah. So we're doing in it- let me write this up thing.
So [inaudible] question is a good one, let me just write this out to be for- for long.
Okay, so we've got V of theta plus sum over all the states
and we're gonna use mu pi tilde of s. So remember this was the stationary distribution.
Um we use this to denote the stationary distribution over states that we'd
reach um if we were to run our new policy which is parameterized tilde.
This is theta tilde. Okay? Um times the advantage function.
Okay. So what this is saying here is this S_t and a_t are under our desired policy.
And the advantage here is using the old one.
Okay? So this is allowing us to compare.
So what does this do? It's allowing us to compare
s_a S_t a_t minus
Q of S_t our old policy.
And this is under- so
it's allowing us to compare how much better is that if we take like our new action.
Okay? All right.
But one of the problems of this is we don't know this.
Yeah?
Sum over from t = 0 to infinity somewhere?
Oh, thank you um.
And answering- yeah. Yeah.
Yeah.
And also again [OVERLAPPING]
So the question is is [inaudible]?
No. And thank you for make me uh allowing me to clarify that.
So in this case what we're doing is we're taking um
an expectation over all time-steps and
this is saying over the trajectories that we'd get to under our new policy.
I've now reformulated and said well we have a stationary distribution over states.
If we look at what is the probability of reaching
those states and then we weight that by the advantage.
So we've went from a time averaging to a state averaging.
Does that makes sense?
So we can either think of our value function or averaging our value function across
time-steps where we can think here is
averaging across all the states and what is for each state.
What is the relative value you get by following your new policy versus your old policy?
[inaudible].
Oh sorry. Thank you. Then that- those are typos.
Okay. All right.
So this would be under-
so we look at the states that- for each state what is
the probability we'd reached that state under
our new policy and what is the relative advantage?
The thing that you're pointing to should not have
[inaudible].
Oh sorry. We have a tilde over here, this is our value of our original one
plus this we get the advantage term over all the states. Yeah.
Is there a difference between pi theta tilde and tilde pi?
I was just doing this here to make it clear.
Pi L- say Pi tilde is also- Pi tilde is
parameterized by, this is a policy parameterized by theta tilde.
Yeah. I'm just saying it's like in the expectation they wrote in the first slide
[inaudible]
Okay. We can vote on that side.
But either- I- I wanted to just be clear um, often this notation goes back
and forth between using do you wanna make
the policy explicit as opposed to just the parameters.
Um I think it's more clear to have a policy um and parameters here but often we also
use the- you can just directly parameterize the value function in terms of
theta as opposed to V pi tilde of theta.
But I'm gonna just use any of these is fine.
Is anybody confused about what this is?
I mean if it's easier I can just go like this.
Okay. So I can just remove all of this.
Okay. So this is just whenever I say pi tilde
that's the policy that's parameterized by the new- that's your new policy.
Okay. And that's this.
I know it's a lot of notation.
Does anybody have any questions about that notation last? Yeah?
Yeah.
Name first please.
So I guess I'm a little bit confused mostly
just because it's a little bit different from the slides.
And I'm just wondering-
Shouldn't be different by the slides but I'm trying to go-
[LAUGHTER] I was wondering sort in this case do we
sum over the possible actions for a given state or
as we've noted here do we assume that we take
a single action per uh by using the policy which is what we have here on the board?
You are right. I forgot that right.
I'm gonna go. Repeat that again.
Okay. Okay. Let's say V of theta
tilde I'll go with the same exact same notation as the slides is equal
to V of Theta plus sum over states or
stationary distribution over Pi tilde of s. This is the distribution we get.
Um this is the discounted weight of distribution under our target policy.
Under pi tilde. Okay. Sum over A.
Okay.
So this is the weighted affair,
this is the weighted distribution under the states.
We went from the time domain to thinking about the distribution over
states times looking at all the actions we might take under
our target policy and the relative advantage of each of those
over our previous policy, [NOISE] okay?
And this should look very, very similar to imitation learning in certain ways, right?
Like, so we're again sort of thinking about,
um, instead of thinking about subbing rewards over time steps,
we're thinking about what is the stationary distribution we might get to under
a new policy and how that compares to
the stationary distribution we would have had under our old policy.
Um, and what we're looking at so far is,
we're looking at the [NOISE] stationary distribution under our target policy.
The problem is, we don't know this,
so we can't calculate this.
This is just an expression.
Um, uh, but this is unknown [NOISE] because we don't have any samples from pi tilde,
we only have samples from pi, okay?
So we can't compute this.
Why would we, and just to go back,
why are we trying to do any of this?
We're trying to do this because when we do vanilla policy gradient,
[NOISE] we're gonna be trying to figure out a new policy at a,
that has a value that's better than our old policy.
What we did here is, we tried to estimate the
derivative out of the current pol- of the current policy,
um, but we don't know anything yet about the value once we take that step.
And so what we're trying to do here is to say, well,
can we somehow understand what the value will be of a new policy before we execute it?
And we're gonna do that by trying to relate it to what is our value of
the previous policy plus some sort of distance between the old policy and the new policy,
ideally computed in terms of things we can actually evaluate using our current samples.
That's where we're trying to go to, okay?
[NOISE] Right.
So we have this nice expression, but we can't compute it.
So we're gonna make up a new objective function, okay?
We're gonna do this one sort of backwards because we're [NOISE] gonna make it up,
and then, we're gonna show why it's a good thing to do.
So what are we trying to do?
We'd like to use something like this.
If we had this,
then we could compare the value of the new policy to the value of the old policy.
The problem is, we don't have this because we don't have the, uh,
the stationary distribution under the new policy.
So what we're gonna do instead is,
we're going to define an objective function L_Pi,
[NOISE] which is as follows.
It's the value of your old policy plus a sum over all your states,
the stationary distribu- discounted distribution of your old policy.
This is where it's different.
So this is [NOISE] the old,
your current policy, okay?
And then, the rest of the expression looks the same.
[NOISE] Now, notice, we can compute this, okay?
This is, um, we could just
average over all the trajectories we have for the current episode,
we could estimate our stationary distribution from our current data,
we could know for a new policy what its action or,
so if someone gives me a new policy pi,
I could evaluate this,
and I could also evaluate the advantage, okay,
because this advantage is defined only in terms of my previous Pi.
So as long as if I have a, uh,
a representation of the state action value function
for my old policy, I could evaluate this.
So now, all of this is evaluatable.
[NOISE] This might not be a good thing to do,
but we can compute this.
[inaudible]
and then, like, giving that to pi itself?
Yes. [inaudible] in terms of sort of notation where, like, yeah,
I'm using pi tilde interchangeably with,
you can, they- they're just some new parameters for computing.
So our policy is always parameterized by some set of parameters.
You can either think of that as just being a new policy,
or you can think of that as the new parameters, either is fine.
Uh, so could quickly explain why [NOISE] if you're given, um,
the pi tilde, why you won't be able to calculate the, uh, new [OVERLAPPING]?
You don't have to. So, um, so the question is,
if you're given pi to, uh, the new policy,
why can you not compute mu?
It's a good question because you don't want any data from that.
So someone has given you a new polic- [NOISE]
the- there could be ways to approximate [NOISE] this,
but the only data that you have right now is from [NOISE]
the current old policy, [NOISE] from Pi.
So you've run this out M times,
you've got M trajectories,
which are M trajectories gathered under your old policy pi, okay?
You don't have [NOISE] any data from pi tilde.
And in general, if pi tilde is not the same as pi,
you're going to get different trajectories.
So you don't have any direct estimate of this.
Does that makes sense to everybody?
So if we go back to [NOISE] the vanilla policy gradient, what do we do?
We had a policy pi_i,
we ran it out,
and we got D trajectories from that pi_i.
We could use that to estimate that mu.
That just gives us on policy data of what are
the states and actions we experience when we're [NOISE] following Pi_i.
We don't have any data of Pi_i + 1 yet,
we haven't run it. Okay. Yeah.
Uh, just to be clear,
to get an estimate of the stationary distribution for the old policy, [NOISE] uh,
you basically [NOISE] look at all the data that you have, uh, like,
all the trajectories, and see basically what fraction of
the time you're spending at one state. Okay.
Exactly. So [inaudible] is exactly right.
[NOISE] How would you go from this just raw data,
these te- D trajectories to mu?
You could just count, you know.
I mean, in general, if you're in really high dimensions,
you want to do something smoother than that,
you want to approximate the density function.
Um, but essentially, you can just directly,
i- in a type of so,
you could just count and [NOISE] just like,
how many times did I get to this state and take this action, and then,
that would give you a direct estimate of, um, the mu's.
[NOISE] In general, you're gonna want some sort of
parametric function in high dimensions,
but you couldn't fit that using, you could,
you could imagine this is parameterized itself,
and you can fit that using your existing on policy data.
Uh, intuitively, does this work because we
assume that distribution of the states [NOISE] won't change too much between policies?
Oh, yeah. The question is, intuitively, why does this work?
I've not told you why this works,
I've just said this is something we could do and that it's computable.
[NOISE] Um, and I haven't told you yet why this is a good thing to do.
But we're gonna show that, um,
that this is going to allow us to get to something which is a lower bound,
um, and then, we can improve on those lower bounds.
Okay. [NOISE] So just a quick thing to notice here,
which is, if you do this,
if you do L_pi of pi,
[NOISE] so that's just what is
this objective function if you plug in the old policy there,
this is just equal to [NOISE] V of the theta, [NOISE] okay?
So if you evaluate [NOISE] this function under the same policy,
it just gives you the value, okay? All right?
[NOISE] All right.
So conservative, [NOISE] I'll just briefly,
we'll have to continue this further later, but, um,
[NOISE] so we can use this to do what's known as conservative policy iteration.
So [NOISE] conservative policy iteration,
um, and the intuition here is,
let's just first just [NOISE] start with mixed,
um, a mixed policy.
[NOISE] So imagine that you have a new policy,
which is a mix of an,
of, um, an old policy and something different.
So you have 1 minus alpha times
your old policy plus alpha times some new policy pi prime, okay?
So that just means, you take some ol-
your current existing policy and you [NOISE] mix in something else, okay?
Then, in this case,
you can guarantee that the value [NOISE] of your new policy is greater than or equal to,
if you'd to take this objective function here and you evaluate it with your new policy,
so you take your new policy,
you evaluate under your old policy, you plug that in,
that's computable because you have data from
your old policy minus [NOISE] 2 epsilon gamma,
1 minus gamma squared alpha squared, okay?
So you can lower bound [NOISE] the value of
your new policy in terms of whatever this objective function
is when you compute it minus this expression, okay?
Um, I just wanna close with two other thoughts, which is,
note again that if you plug in alpha = 0,
[NOISE] that means that pi new is the same as pi old,
and this goes to 0,
[NOISE] which means that,
and since we know that this is equal to that,
that just says that your new policy has to be greater than or equal to your old policy.
And since the same, their policies are all the same, this is tight.
Okay. So we, um, this is a,
a little bit different than we expected because of,
um, [NOISE] the technical challenges with PDF.
Uh, so what I'll just close with here is that the next steps we'll go from this is to
show we can use this to essentially derive a lower bound on the new value function.
And we can show basically that if we improve across the lower bounds,
that we're guaranteed that the actual value function is monotonically improving.
[NOISE] So we will go through that.
Um, I haven't decided yet whether we'll go through
that on Monday because that's [NOISE] the midterm review
or if we'll wait on that until the following week after the midterm.
Um, the policy gradient,
uh, homework won't be released until after the midterm.
So we have a bit more time for that.
Um, and I'll go through also, like, the main takeaways
with policy gradient stuff when we conclude this part. Thanks.
