All right. We're going to go ahead and get started.
Um, uh, before I get into the technical stuff,
we'll do a little bit of logistics.
Um, so, we are starting these things called sessions.
Um, we announced them on Piazza.
If you didn't, uh,
if you're not getting our Piazza stuff,
definitely make sure you've signed up for Piazza, or send us a note.
Um, the sessions are designed to go into the material a little bit deeper,
also to discuss something about the homework.
Um, these are structured session as opposed to an office hour,
where you can ask one-on-one questions about the homework .
the sessions are designed to go a little bit deeper into the material,
and they were prompted both based on some colleagues of mine,
feedback of how much students have liked them in their other classes,
as well as some request from last year for opportunities to go deeper into the material.
So, we've announced these on Piazza.
The idea is that you will sign up for a session.
They are optional, you don't have to do them.
We will be giving one percent extra credit for
attending them if you attend a sufficient number of them.
Um, the details for that has also been announced,
um, and I, I think that's true.
If I've got that right, just email me.
I think that's been announced also on Piazza.
Um, so, if you go to Piazza,
there's a number of different sessions you can sign up for.
The point of signing up for them is to make sure that we have room capacity,
but I'm pretty sure we'll be able to accommodate almost any session you want to go to.
The last session will be done via Zoom,
and it's particularly targeted at SCPD students,
but anyone is welcome to do it.
Um, the way that we'll be keeping
track of whether or not people are going to sessions or not is,
we will have a code that's, uh,
mentioned inside of the,
the material, and so, then,
you will just write in that code to indicate your attendance. Um, we'll record the last session so that if, for some reason,
your schedule is such that you can't attend
any of these but you want to participate in session,
you can go through the material later and then
record that you attended it by using that code.
And we'll be relying on the Stanford Honor Code,
that only people that are doing this will upload the codes.
Somebody had any question about sessions and what those involve?
Again, they're optional, they're way to go deeper into the material.
Um, some other people have really liked these sort of things.
You can see what you think, it's an experiment.
All right.
Any questions about anything else outside of sessions?
So, homework's been released,
office hour is happening as usual this week.
Feel free to come talk to us or use Piazza for any questions that you have.
All right. We're gonna go ahead and get started now then.
Um, as usual, I really appreciate
it if you use your name whenever you're asking a [NOISE] question or making a comment.
So, today, we're gonna finally start to get into
making decisions where we don't have a model of the world,
and in particular, we are going to be focusing on model-free control.
[NOISE] So, the things that we're going be
covering today is really focusing on how can an agent start
to be making good decisions when it doesn't know how the world
works and it's not going to be explicitly constructing a model?
Um, and remember, a model, in this case,
is going to be a reward and/or a dynamics model of the environment.
So, today, we're gonna be looking at methods that do not involve constructing a verbal,
um, a dynamics or a reward model,
but it's just going to be directly learning from experience.
[NOISE] So, um, [NOISE] before- we were mostly talking last time about,
well, maybe we don't know how the world works,
we don't have these explicit dynamics and reward models,
but we're going to be trying to evaluate a policy that was provided to us.
And now, we're going to be thinking about
the real problems that often comes up in reinforcement learning, which is,
how should an agent make decisions when they don't know how the world works,
and they still want to maximize [NOISE] their expected discounted sum of rewards.
So, when we think about sort of how good is the policy,
as soon as we have information about how good a policy is,
then we can start to think about how do we learn a good policy instead.
And, in fact, when we started off at the very start of the class,
we'd talked about how you would learn to make good decisions or how
you would compute good decisions if you were given a model of the world.
And so, that's what we're gonna be going back to now.
So, in particular, now,
we can think of starting to get at this issue of this optimization and exploration.
We're still not going to get into generalization yet.
Um, this will be happening soon.
Um, we've already seen this a little bit it came up with planning,
but now,
we're going to start to think about how do we explore and how do we do optimization.
So, when we think about- well,
I, I think I'm just gonna go through more of these as we start to,
to go into this area.
So, um, again,
we're going to be thinking about how do we identify a policy
that has a high expected discounted sum of rewards.
There's going to be delayed consequences, which means is,
our agent takes actions that may not see the result
of whether or not those actions were good or bad for a while,
and we're going to start to think about this exploration aspect.
Okay. So, let's start with, um,  you know,
where these types of problems come up and where people model things when
we're thinking about Markov decision processes and maybe not building a model.
So, I think probably one of the first really big examples of success for
doing reinforcement learning and doing it in
this sort of model free way was for Backgammon,
which was roughly 1994.
They trained an agent to play Backgammon,
the board game, um,
I- actually using the- a neural network.
Um, neural networks went sort of out of fashion for probably around 10, 15 years,
and then came back, but in the early '90s,
people were using neural networks.
Um, and, uh, Gerald Tesauro used it for Backgammon and got some very nice results,
and that was sort of one of the first demonstrations of reinforcement learning
in kind of a larger setting that you could solve these sort of complicated games.
Um, many other g- problems can also be modeled in MDP,
whether games or robots or customer ad selection or invasive species management.
Um, and in many of these cases,
we don't know the models in, i- in advance.
So, what we're going to be thinking about today is, um,
situations in particular, mostly here,
where we think about the model being unknown,
but if we can sample it.
But there are occasionally cases too [NOISE] where you do know the model,
but it's [NOISE] really, really expensive.
So, for something like computational sustainability or climate modeling,
you might be able to write down a good model of the world, but it's really,
really expensive to run because actually simulating the climate is really,
really compete- really, really hard,
and even then, your model will probably be about.
But I just- I raise this second point in
the sense that when we mostly think about sort of
learning from the world, we think of a robot,
like, running around in the world, and,
and that being an expensive thing to do because
robots are taking real time to, to do this.
But you can also think about
agents that are learning to sort of interact with a simulator,
where that's also really costly.
All right.
So, what we're going to be thinking about mostly
today is what is known as on-policy learning,
where we get direct experience about the world, [NOISE] and then,
we try to use it to estimate and evaluate a policy from that experience.
But we're also going to stock- start to talk more about off-policy learning,
where we get data about the world and we use it
to [NOISE] estimate alternative ways of doing things.
So, we can kind of combine experience from trying out different things,
to try to learn about something we didn't do by itself.
And, um, this three thing's really important. So, I'm- all right.
The second thing is really important,
so I'm just gonna talk about it briefly up here.
So, imagine you have a case where,
say, there's only a single state for now,
but it's like, you have a state S1, you do A1,
you stay at S1, then you do A1.
Or you're, you're in S1, you do A2,
so you're in S1, and then, you do A2.
[NOISE] So, you'd like to be able to kind of
combine between these experiences so you could learn about doing this.
[NOISE] even though you've never done that in the world.
You've never experienced that full trajectory,
but you'd like to be able to sort of extrapolate from that, that prior experience.
So, [NOISE] this sort of policy would be an off pol- [NOISE] uh,
an off-policy learning because it's different than the previous policies we've tried.
We'll go into that more when we think about Q-learning.
All right. So, let's start with generalized policy iteration.
Okay. So, if we go back to policy iteration,
we talked about that a couple lectures ago,
and policy iteration, we originally saw it when we knew the model of the world.
So, it's a way for us to compute what was the right thing to do
given- right thing meaning the policy that
maximizes our expected discounted sum of rewards.
So, how do we do this when we know how the world works?
We're given our dynamics and reward model.
In that case, we initialize some policy, probably randomly.
So, initializing, again, would mean that we'd said,
pi of S equal to some A for all S,
and this is generally probably going to be chosen at random.
[NOISE] Um, and then,
we did this policy evaluation procedure,
where we first computed the current value of the policy,
and then, we updated the policy.
So, we took whatever we had, and then,
we did this sort of one more thing,
which you can think of as kinda doing one more Bellman backup,
where we said, okay,
we're taking that V pi,
we're plugging it in over to here,
we're using the fact that we know the dynamics model and we know the reward model,
and we're computing this one-step updated pi prime.
And we talked about the fact that when we do this,
we actually get monotonic policy improvement,
[NOISE] which is sometimes referred to as a policy improvement theorem.
So, this procedure, when we're doing it with sort of,
um, in this dy, uh,
in this case where we knew the dynamics model and we knew the reward model,
um, would guarantee to always give us
a policy that was at least as good as the previous policy or better.
Um, and, eventually, it was guaranteed to converge at least in
the case where we have finite states and
actions because there's only a finite number of policies.
So, in this case, there were only A,
to the S possible policies.
So, we only need to do this whole procedure,
at most A to the S times.
Each time, we're either picking a better policy, or we stayed in the same.
And once you find the same policy, you're, you're done.
So, now, we want to do all of this, um,
but we don't have access to the dynamics or reward model.
So, does anybody have any ideas of how we might be able to do
the same thing now that we don't know the dynamics or reward?
We can maintain another- [NOISE] a matrix transition probabilities,
uh, that you calculate [NOISE] it as you experience the world.
Yeah. The better suggestion is, well,
what if we try to, uh, if I interpret correctly,
wha- what if we try to basically estimate
the dynamics model and a reward model from the world, and then,
we could use this to- you could still compute your,
your value function maybe using some of the methods we saw last time, um, and then,
you could do this update as policy improvement using
your estimated dynamics and reward model of the world.
That's a completely reasonable thing to do, um,
and may have any other idea of what we could do. Yeah, name and-
Uh, I think instead of having to,
uh, uh, a compute a model,
can we do away with model and directly try to estimate what is
the value of a particular state or state-action pair? Doing away with models
So estimate the value of a particular state.
state and action?
Yes.
Yes and with actually?
Yes [OVERLAPPING].
What she actually said is exactly the path that we're going to look
about today which is we're going to focus on model-free control.
So we're not going to directly estimate a model today.
I'm actually personally very partial to models that
can be a very simple efficient but for today we're not gonna
look at that and we're gonna do exactly what
was just proposed which is we're gonna compute
a Q and if we compute
a Q function which just to remember Q is always a function of state and action.
We're gonna estimate the Q function directly and after we have
that then we can do policy improvement directly using that Q function.
So how would we do that?
So this is Monte Carlo for on policy Q evaluation and it's gonna look
very similar to Monte Carlo for on policy value evaluation.
Um but we have to make a couple modifications.
So before, if we were doing this for V,
I'm just gonna write it to kind of contrast.
So for V we just had a count of the number of states.
Now we have a count of the number of state action pairs.
Um before we could just keep track of G here um which can
be the sum of previous rewards we've
seen across all episodes for G of S. Now we're gonna do that for S,
A and then now we're gonna end up having a value function.
We're gonna have a Q pi. So essentially
almost everywhere where we'd just had S before now we have S,
A and then it's gonna look very similar.
So we're gonna assume that we're still provided a policy,
we can sample an episode and then we compute
GIT for every single time step and that remember now is gonna,
I mean it was before but we're gonna to think about the fact that it was
also associated with a particular state and a particular action for
that time step and then for every state action pair instead of just state visited in the
episode either for every first time we saw
that state action pair or every time we saw that state action pair,
we can always, just like before we can either have first visit or every visit.
There we're just gonna update our counts update our Sum
of Total Rewards and then estimate our Q function.
It's basically exactly the same as before
except that now we're doing everything over state action pairs.
Now once we have that now policy improvement is even simpler than before.
So we're given this estimate of
the Q function and now we can just directly take an arg max over it.
So we define our new policy to simply be arg max of the previous one.
Alright so did anybody see any problems with
doing this so far for the type of policies we've been thinking about in the class?
So far we've been thinking about mostly policies but are all
deterministic which means that per,
there are mapping from states to actions and we've been thinking
about cases where this is um a deterministic mapping.
So we always pick a particular action for a particular state.
Yeah in the back and name first please. Oh what's your name?
Oh yeah but what's the problem the problem is we're never
exploring which is correct but what's the problem with not exploring?
We only sample one path over and over again and we never
actually learned anything about the rest of the world that we don't see.
So we don't know whether there's a better policy.
So what he is saying that maybe we're only going to sample one path.
I think what he means more than that is so you can
still sample different paths because your state space
can be stochastic but you are only gonna ever try one action from one state.
So you're never gonna learn about what it would be like if you took
A2 instead of A1 in that state which means that
when you're doing this for any particular state you'll only see one corresponding action.
So the time whenever you see state S1 the only
time the only action you'll see will be A1 or whatever your policy says to do there.
Which means that you're not gonna have any information
about doing anything else there which means
your policy improvements is gonna be pretty boring because you're
not gonna get any other information about things you should be doing instead.
So we're going to have to do some form of exploration essentially now,
we are gonna have to start to have some sort of
stochasticity in our policy or there needs to be
changing over time but we can actually try
different things even from the same state and know what to do.Yes name first.
My name is . Do we know the whole action space beforehand?
Great question.  question is do we know the whole action space beforehand?
Yeah we're gonna assume that we do for at least all this lecture and in general yeah.
Since you've made the action initially have high values so
then after it's computed is probably computed took it
low so the next time you see it you would try to other actions?
has made a very nice suggestion
so that relates to how we're initializing the Qs.
So one thing you can do which is what he just
suggested is you can initialize your Q function really
high everywhere to basically do
what's known as optimistic initialization um and that actually can be a really useful strategy for
exploration and if you initialize it in a particular way then you can
have a provable guarantee on
how much data you're going to need in order to converge to the optimal policy.
So optimistic initialization is often
a really good thing to do to be a little bit careful of
how you initialize things like what
those value should be but generally empirically it's really good.
And formally it can be very good to.
We're not gonna talk about Optimistic initialization today but
we will later in the class or talk about optimization.
So doesn't really on the Markov assumption to be able to estimate Q right.
Yes.
But my question is whenever we're defining the policy,
we only define it in terms of the state,
and if the reward that you get from the state depends on officially you have,
then that brings the Markov assumption in like even sorry,
even though the reward is not Markov
then your policy will act you we're defining a policy as if it were.
Yeah so your real world may or may not be
Markov all the policies we're talking about right now is assuming a world as Markov.
The policies are only mappings from current state to action.
They are not a function of history.
So those may or may not work well because your real-world may or may not be
an MDP and if it's not then you're considering essentially restricted policy class.
Considering only mappings from the immediate state to the action and if you,
what you should do really depends on the whole history then you might not
make good decisions. Good point.
Okay so this is sort of how
the basic way you would extend Monte Carlo to be able to start to
estimate Q and once you have that you could do
policy improvement but now it's clear that we need to do
something in terms of how we should get- gather experience so we
can actually improve when we tried to do this policy improvement um.
Because now we don't know how the real dynamics of the world work.
So we need to do some with some sort of interleaving of policy evaluation and
improvement and we also need to think about how we're doing this exploration aspect.
So in general it might seem a little bit subtle.
So we've already got one nice suggestion from  like maybe you could
initialize everything optimistically and maybe that would help you explore.
It does, but in
general it might seem like it's a little bit
hard of how are we going to get this good estimate
of Q pi because what Q pi does is it says um if you wanted
a really good estimate of Q pi of S,A for all S and all A it would
say you kinda need to get to every different state take every possible action
and then follow pi from then onwards and so how do I make sure
that I visit all of those things
and what we're gonna talk about today is a very simple strategy to make
sure that you visit things which works
generally under some mild conditions about the underlying process.
So the really simple idea is to balance
exploration and exploitation by being random some of the time.
So let's imagine that there's a finite number of actions
we're gonna call that cardinality A, um,
here then e-greedy policy with respect to a state action value is as follows.
With probability one minus Epsilon,
you're going to take the best action according to
your current state action value function and
else then you're gonna take
an action A with probability Epsilon divided by A.
So with probability one minus Epsilon you take what you
currently think is best according to your group or your estimate of the
Q function and with probability Epsilon you select one of the other actions.
So it's a pretty simple strategy and the nice thing is that, it's still sufficient.
But before we do that why don't we just do
a brief example to make sure that we're on the same page.
So let's think about how we would do Monte Carlo for
on policy Q evaluation for our little Mars rover.
So now our Mars Rover is gonna have two things that can
do instead of we're gonna be reasoning more about that.
So I've written down the reward function here.
I'm saying that if you take action A1 you get
the same rewards we've been talking about before which is 1,
0, 0, 0, 0, 0 plus 10.
And now I'm changing it I'm saying well you're
action does -- your rewards do depend on your state and
the action you take and so the action for
A2 is now going to be 0 everywhere and then you get a plus five at the end
and gamma is one and let's assume that
our current greedy policy is you take
action A1 everywhere and that we're using an Epsilon of 0.5.
And we sample a trajectory from an e-greedy policy.
And again what an e-greedy policy means here is I set Epsilon
equal to 0.5 which means that half the time we're
gonna take our current greedy policy of
action A1 and the other half the time we're going to either take A1 or A2.
So what that would yield as an example would be a trajectory such
as state three action A1 0, state two.
And now this is a case where we're sampling randomly.
So we flipped a coin.
We said oh this time I'm gonna be random.
Then I have to flip a coin again to see whether I'm
taking an action A1 or A2 and I took A2 there.
I got a reward of 0 and then the rest of trajectory as
follows and my question to you and feel free to talk to a neighbor of
course is what is now the Q estimate for all states for
both action A1 and action A2 at the end of this trajectory using Monte Carlo estimates?
So we're doing first visit in this case.
Yeah.
Uh, [NOISE] I have a question about the action we choose on the Epsilon table.
Yeah.
Uh, is it important
when- what would the actions on policies ,
or should we pass in that action
question is whether or not when you hit,
uh, now do something random,
whether you should include the action that you'll
be taking normally if you're being greedy.
Um, you could.
In some ways, that's like just picking a different Epsilon.
Yeah. I hear less talking than normal,
so that they may have any clarification questions about this or,
or [NOISE] or are there questions? Good. [LAUGHTER] Sorry.
I have an idea.
Okay. Yeah. So, um, uh,
if everybody's ready to ask yourselves.So , what, what did you guys think?
Uh, so, you will have in this case S3,
well ,
so everything that you did not,
every state of action pair you did not see will remain zero.
Yeah.
And at a particular, uh Q of S3, A1 will be zero,
cause you saw that once and reduce that to zero,
Q of S2h will also be zero,
Q of S3A1 will be zero, and then,
the only one that will be non-zero will be Q of S1, A1,
which in this case will be 1.
Because you saw it once,
and the reward that you got when you saw it was 1.
That's one answer.
Anybody with a different answer?
so,
uh, all of the state action pairs that we've seen will be one,
and all other state action pairs will be zero.
That's another answer. So what,
uh,  was say would be right for the TD case.
Okay.
What you were saying would have been right for last week,
uh, or if yesterday, or, or Monday.
Any else who may have a third answer?
Could you repeat what the second- the second choice was?
The first choice is that we only update, um, S1, A1.
The second choice is that everything that we saw will now be 1,
and maybe I misunderstood over there.
So we're gonna have two different um,
we have two vectors now.
So we have Q of A1,
and we have Q of A2,
and they're not gonna look identical.
So, sometimes we take action A1,
and sometimes we take action A2,
and we can only update what we saw the returns for the action we took.
So what actions do we take for S3?
A1.
Just A1, right? So that means for those ones,
for S3 it's gonna be 1,
and that for Q of S3, A1,
so I'll fill in all the ones that are zero,
one, two, three, four.
Um, do we ever take A2 in S3?
No. So that also has to be zero,
cause we didn't ever start there,
take action A2, and get a return.
Uh, what about for,
what action do we take from S2?
Right. So for that one, we get a 1.
So we basically, uh, distributing your experience.
So now if you were going to take a max over those,
then you would get the same thing that we saw last time for Monte Carlo,
which would be 11100000 to the end,
um, but here we- we're subdividing our samples.
So, you only get to get an experience for the action that you actually took in the state.
And because we're in the Monte Carlo case,
we'll see the TD case or,
Q-learning we'll call it later,
um, then we get to add up all the rewards to the end of the episode.
So G here is gonna be the sum of all these steps,
and I didn't speci- oh, I did.
Good. And we're keeping Gamma equal to 1 here just
to make all the math. Just adding. Yeah?
Should we just [OVERLAPPING].
Sorry.
Can just be one half for Q S1A1 or in Q S3A1?
Uh, is talking about whether or not if we did every visit,
if anything would change here. [NOISE] Excuse me.
It would not change in this case, because, um,
both times when you visited S3,
the sum of rewards to the end of the episode was 1.
So you'd have two counts of 1,
and then we divide by 2.
It da- it can actually be different,
but it's mostly different if you got like a different sum of
rewards from then to the end of the episode. Yes?
So is [OVERLAPPING].
Remind me [OVERLAPPING].
Yeah. Isn't that?
Maybe I misunderstood.
Yeah.
So, I thought we were supposed to say that everything was, and I missed that.
Did, did you say that that was different for the two actions?
That was one for in the projectory, um, zero.
I understand. Sorry about that.
Okay.
Okay. So now we're gonna show formally that this does the right thing.
So, um, we're gonna show provably
that like what we did before when we were doing policy improvement,
we're showing that if you pick a policy,
um, pi i, that was, uh,
generated by being greedy with respect to your Q function,
then that was guaranteed to yield monotonic improvement,
and the same thing is gonna be true here too,
when you do e-greedy.
Um, so if you use sort of er,
an e-greedy policy, then you can gather data such that, uh,
the new policy- the new value you get,
if you're optimistic with respect to that- oh, sorry,
if you're greedy with respect to that,
that means you're gonna get any better policy.
Okay. So let's say that, um,
we have an e-greedy policy,
Pi i, and then we're gonna call an e-greedy policy with respect to Q Pi i,
which is gonna be Pi i plus 1,
so we had a greedy- e-greedy policy Pi i that was
doing some amount of exploration and some amount of greediness in the past,
we use that to gather data,
we then evaluated that policy and we got this Q Pi i,
and now we're gonna extract a new policy.
We're going to do policy improvements.
I'm gonna show that that's a monotonic improvement.
Okay? Does anyone have any questions about the, what we are showing?
Okay. So, what does this mean?
So right now what we're gonna be trying to show is that this, this Q function,
the Pi of s Pi i plus 1,
so, is gonna be better than our previous value.
At least as good or better than our previous value of our old policy Pi i.
So the way we define this is now, um,
the Q function here is going to be a sum over,
our policy is stochastic.
So it's Pi i plus 1,
of the probability we take an action in a certain state,
times Q Pi i of SA,
and then we're gonna expand that out,
and we're going to redefine it in terms of what it,
what it means to be an e-greedy policy.
So with, remember in a e-greedy policy we either take something randomly,
and that's with probability S1,
and we split our probability mass across all the actions.
So that's how we get this equation.
So this says, this is the- this is the random part.
So with probability, with probability epsilon,
we take one action,
one of the actions, and then we would follow that from then always.
So that's just Q Pi i of SA,
and then with probability 1 minus Epsilon, we're greedy.
And we follow the best action according to our current Q Pi i.
So, now what we're gonna do is we're going to rewrite that.
The first term isn't gonna change and I'm gonna expand the second.
[NOISE]
I haven't done anything here.
I just multiplied the last term by 1,
but I expressed the 1 as 1 over Epsilon divided by 1 over Epsilon,
and now I'm gonna re-express that part.
So, and I'm gonna rewrite the first term,
plus 1 minus Epsilon,
max over a, and what I'm gonna rewrite this as-
It's gonna use the fact that whenever we define our e-greedy policy,
if you sum over all actions in a certain state,
those are all probabilities of us taking those actions in that state,
so it has to sum to one.
So I just first divide it,
I just multiply by one,
and we're expressing as 1 minus Epsilon divide by 1 minus Epsilon,
then I re-expressed the 1,
because it has to equal to 1,
cause we have to take some action in a particular state.
A policy always has to,
the probability of us taking any action state has to be equal to 1,
and then I'm gonna do the that expression because we're,
here is where we'll take the best action.
So by definition, the best action has to
be at least as good as taking any of the other actions.
So we're gonna do the following;
we're gonna push that Q inside.
[NOISE]
So that has to be smaller than what we saw before,
because basically we just push the Q inside,
and we're no longer taking a max.
And the Q values- all the Q values at best have to be equal to the max,
and in other cases they'll generally be worse.
Okay? But then once we have that,
we can cancel that 1 over Epsilon minus 1 over Epsilon,
and what do we have?
We have two different terms here that look very similar.
We have one.
Let's see. We need one was taking that apart.
And we'll keep this up.
Yeah. There is an Epsilon over a right there.
Okay. So now I'm going to pull that out.
[NOISE]
If I split those terms up,
the first term and the third term are identical,
and one is subtracted and one is added. Make sure that's clear.
So, this just ends up becoming the middle term.
[NOISE]
And that was
just the previous value.
Yeah?
first line, where we changed it to,
instead of the sum, over all,
A of Pi I a given s minus Epsilon to, [NOISE]
minus Epsilon over the cardinality of A in this case?
Yes.
Um, is that s- that,
is that [NOISE] still one minus Epsilon,
[NOISE] I mean, that, that looks all-
That's all . Got it.
Does that answer your question?
I think so.
Yeah. So, the, what we did from the one minus Epsilon, to the next one.
[NOISE] So, we had a one minus Epsilon divided by one mi- minus Epsilon,
[NOISE] and, I re-express that as the sum over A [NOISE] ,
Pi I of A given S minus Epsilon divided by sum over A, and then,
if you sum over A that's second term,
just this Epsilon, and the first term is one.
Okay? Yes.
Isn't that [OVERLAPPING]
Can you remind me your name?
[OVERLAPPING] the Pi, the, what?
Name?
Oh. Pi I is the Pi,
like Pi I plus one,
negative Pi, and then Pi I.
Pi I plus one. [NOISE]
Which line are you thinking about?
[NOISE]
.
Which- [OVERLAPPING]
The means, I'm sorry. The second line .
Yeah.
You wrote E Pi I plus [NOISE] one,
negative five [NOISE], five Pi times five.
I'm just not understanding your question
with- so you're on that second line is that right or-?
In the  .
Okay. Pi plus one is- sorry. What is the question?
[NOISE] .
Yeah. The?
Yeah.
Yeah. [NOISE]
[NOISE] .
Yeah, so good question.
The greater than or equal happen because we push that Q Pi we had a max over
A Q Pi I of s, a we pushed it inside of the sum.
And so that sum now no longer includes a max.
And so, now that the max is always greater than or equal to any of the other elements.
So, that's where you got the,
greater than or equal to. Yeah?
So, I just wondering if you could explain like
intuitively  you go random or optimal actions and then you end up with monotonic improvement.
[NOISE] Yeah. Can we get some intuition this is  the algebraic derivation.
And I think intuitively the idea is that by doing
some e greedy exploration and you're gonna get evidence about some other state action pairs.
Um, and then you can use this to estimate your Q function and that
when you do that then that's also gonna give you uh,
then that can improve your policy and you can have evidence that there is something
better you could do then the current one, the current thing you're doing.
If you don't do any exploration your Q function wouldn't change from before.
But now because you're doing exploration then you can learn about
other stuff and then if it's better you'll see that in your Q function.  function like your exploration
is not as good then you just take the old one?
Yeah. Um, yeah.
So, this is now um,
this is saying that you'll get
this monotonic improvement if you're computing this exactly.
So that's an important part.
So this show- so what this shows here is
that if you get a Q function and it looks like there's
some improvement from some other actions that you're not taking right now you're
gonna shift your policy over towards focusing on those actions.
This is assuming right now in terms of
the monotonic improvement that Q Pi I's have been computed exactly.
So that's what we thought when we were doing planning where we knew what
the dynamics model was in the Reward Model and we're
using that [NOISE] to compute a value function.
Um, so if we we're doing in that case we have
the guaranteed monotonic improvement because we had
the exact value of Pi and similarly here if we have the exact value of Q Pi I,
then when you do this improvement then you're guaranteed to
be monotonically improving if you didn't,
like if you have just an approximation of Q Pi I,
then it may not be monotonic.
like let's say you tried another action once in that state.
You may have a bad estimate of how good things are from that point.
So, this, this is an important aspect.
And this is going to be really important when we start to think about
function approximation because we almost never will have computed Q Pi I, exactly.
But if you do like that say,
you can just iterate through this a ton of times
like you're learning that's still a tabular environment.
You've converged you know your Q Pi I, is perfect.
Then when you- then do
policy improvement you can get a benefit even you can improve- though
there's going to be this interesting question of how often do you improve
your policy   versus how much time do you spend evaluating your current policy  .
Yeah?
Uh, yes.
Yeah.
So does this mean that it definitely converges  to like an optimal Q that was Q function?
The overall- Perfect.
Yeah, we'll talk about that   question is great too.
So, this  is just saying like one step monotonic improvement what,
what's gonna happen in terms of total convergence we'll talk about that in just a second.
Yes? Remind me name, please.
[NOISE]  question and answers When I think of V Pi, I think of it as being a function of a state? But action given a state?
Uh. So,  to sort of re- refresh all over my
what is a Pi and how we define the function.
Now we're thinking of it as a mapping from states
to actions but it can be a stochastic function.
So, it can be a probability distribution over actions.
So, I can select action A1 with 50 percent probability.
or action A2 with 50 percent probability.
For example from the things- [OVERLAPPING] ? .
Okay.
I mean depends how you want to implement it like that concerned to be a bit.
Essentially I think of it as you're in a state and then you have
some probability distribution over actions you have to
sample from that to decide what action you take.
The policy the- so what we're doing here when we
expanded this as we said what is the policy for an action given a state
we said with one minus Epsilon probability we will be taking this max action.
So that one. And with Epsilon probability we would be taking one of the actions.
And so, then we summed over each of the actions we could take.
So, what we did there is we split this sum up
into the probability of taking one action and what'll be
the Q function of that action and the probability of
taking each of the other actions and what would be the value of those.
So, it's like our expected value.
Yeah at the back.  Yeah . So when we talked about the Bellman operator, we said that if you got to the same value function - [NOISE]
You can stop iterating.
Here, would you have to have tried every ah,
action to know that you are done?
That's a great question before  in policy improvement if you got to the same policy you you,
you are done you don't have to do any more improvement.
The question is, in this case is it true or there's some other additional conditions?
Um, this is very related to question too.
So why don't we go onto the next part that is saying, you know,
under what conditions are these going to converge and converge to optimal?
Um,  do- do you have question before?
Yeah, in this  this also say that the only time we get strict equalities is when Epsilon is 1. So you just act purely randomly?
Uh, the question is whether or not there's,
um,so if if policy is random,
would you get such a quality here?
Um, yeah, you should get.
I mean if you can get such a quality whenever you've converged to,
like if your Q function is converged your policy is optimal.
Are you guaranteed such a quality against what interest are?
No, I don't think so. Because if you're acting totally randomly in fact
that's normally often how you start off and then you want to improve from there.
Could you review I mean if you're, if you're if you're worried
if it's uniform some things are going to look better than others.
So even if acting randomly,
some actions are going to have higher rewards than
others and that can be reflected in your Q function.
Any other questions before we get on to convergence in the back  .
One outside this should um,
yeah . Yeah, another question.
Um, um, do you exclude um, argmax when you explore?
Do we what?
Do we exclude the argmax action?
Like, you know by exploration, um, and e greedy part.
And what is your name ?
Pardon.
Um, I- no you don't exclude it, don't exclude.
You don't exclude the argmax action when you explore. You pick all of them.
Um, if you wanted to that would be equivalent to sort of defining. You could do that.
But in the simplest version, including in this proof here,
we assume that when you're um,
acting randomly you just sample from any of the states.
It's often easier from implementation, too.
Okay, great questions.
Let's, um,
write that up here as well.
Er, okay, so this other really great question that's coming up from several people here.
Um, er, I have, okay,
what does this mean over time?
Um, I have call it monotonic improvement and what guarantees do we have?
So, the guarantees that we have is, um,
if you assemble all state action pairs an infinite number of times,
and your behavior policy converges to the greedy policy.
So what do I mean by that?
Um, so the behavior policy here is sort of what policy you- you're
using versus what policy is greedy with respect to your current Q.
So, if you have the case that as the limit as I goes to infinity,
have Pi a given s
goes to argmax Q,
s, a with probability one.
Which means that in the limit you converge to
always taking the greedy action with respect to your Q function  .
Then, um, then you are greedy in the limit of infinity exploration.
That's called GLIE often.
So that means you visit all these state action pairs
an infinite number of times but you are also
converging [NOISE] in the limit to be greedy with respect to your Q function.
Um, and there's different ways to do this.The simple way to do it is to sort of,
decay your um, your Epsilon or your or your E greedy Policy over time.
Um, so you can reduce your Epsilon towards zero at a rate of like one over I,
for example, that's sufficient. It's not necessarily
This is, this is separate than what you wanna do empirically.
This is just to sort of show under these conditions.
Then, um, then we're generally going to be able to show that we are going to converge to
the optimal policy and optimal value for Monte Carlo and TD methods.
So, generally when think will talk about this again as we talk about
some of the other algorithms generally when you're GLIE, um,
and you have some conditions over how you're learning the Q functions,
um, then you will be guaranteed to converge to optimal policy.
Yeah.
Um, do you realize  like,
like we've seen Epsilon and.
Yeah. So question is is
this the only way to guarantee it, um,
there's sort of interesting different things that are happening here.
Um, you could be guaranteed that you're converging to
the optimal Q function without converging to the optimal policy.
So, you could keep Epsilon really high, um,
and you could get a lot of information you will be
learning about what the optimal Q function is,
but you might not be following that policy.
And we'll talk more about that in a, in a minute.
All right, so let's talk a little bit more about Monte Carlo Control.
In that given this precursor.
So, if we wanted to do Monte Carlo Online Control,
instead of just this evaluation we talked about before,
we can kind of combine these ideas of   learning the Q function and doing this er,
improvement at the same time.
So we can initialize our Q functions and
our counts in the same way we were talking about before.
Um, and then what we could do is we can construct an E greedy policy.
So E greedy policy in this case is always going to be
that with probability one minus Epsilon.
We pick the argmax with respect to Q with probability Epsilon we select an action and,
let me write it this way: probability Epsilon over a we select action a.
So we're just mixing up between this random um,
or being greedy. Yeah.
If I heard that so,
actually like  the optimal actions in this case you are selecting with
probability one minus Epsilon plus Epsilon over the cardinality of A right?
Yeah.
Okay.
Several people would ask about this.
So essentially, you're being greedy  with
probability one minus Epsilon plus Epsilon over a.
And then the remaining part of your probability is going to be an exploration.
Because when you're being random you could also select what's currently the best action.
So, um, it looks pretty similar to what we saw before.
We're going to sample an episode,
after we finished the episode then in this case I'm defining as first visit,
no, you could make this every visit.
I could do every visit.
The same, um, benefits and restrictions apply here.
So what we had before in the sense that you could either be getting
a slightly more biased estimator if you're doing
every visit but generally going to be able to use more data.
It's going to be sort of less noisy.
Um, so in this case what we're doing is we're
just maintaining counts over state action pairs and we're updating our Q function.
And then after we finished that episode then we can update, um,
our k and our Epsilon,
in this case we're just using Epsilon equal to one over k,
and then we redefine our new E greedy policy with respect to Q,
and then we get another episode.
So that's just sort of Monte Carlo Online Control.
So why don't we go back to that Mars Rover example?
So, in the Mars Rover example what we had is for this is what our 2 Q functions look like.
So, at this point what would
be just spend a minute and say what would be our new policy, um,
if we're at the end of this episode and- and its fine just write down tie if there,
if there are two Q functions that have exactly the same, um,
value for the, for the same state for two different actions and it's just a tie.
Then you can choose how to break, the break the tie.
Um, and then also write down what the new E greedy policy is.
I'll just take a minute to do that.
Okay, what's our greedy policy?
What is the greedy policy for S1?
A1.
What is the greedy policy for S2?
Two.
And then what's our greedy policy for S3?
One.
And then what is it for everything else?
Tie. Okay. And depending on your implementation you could
either always be you could either sort of define
your greedy policy or you would just like
break ties randomly and keep track of that.
Could constantly be breaking ties randomly.
That would probably be better empirically like,
instead of predefining one greedy policy,
you can probably just always be Q,
er, querying what argmax is of Q.
And if you're getting ties just break them randomly to get more exploration.
Um, so then if we then define
an E greedy policy where K is three and our Epsilon is one over k,
with what probability do we follow?
Random. So k is three,
Epsilon is equal to one over three.
So that would mean that with one-third probability,
we select something random and with two-thirds probability,
we select the pi greedy policy.
And then that would be the update for that particular episode.
So, if you do this,
if you do- if you have greedy in the limit of infinite exploration Monte Carlo,
then you're gonna converge to the optimal state-action value.
[NOISE] So, now, we're gonna start to talk about TD methods.
So, similar to what we were seeing, um,
for Q, uh, Monte Carlo,
there is gonna be sort of this simple analogy that moves us over to TD.
So, remember, for TD what we had before is,
we have our V pi of S. It was equal to our previous V pi of
S plus one minus Alpha times- oops,
let me rewrite that- plus Alpha times R plus Gamma V pi of
S prime minus V pi of S. [NOISE] And this was where we were
sampling an expectation [NOISE] because we're only getting one sample of S prime,
and we were bootstrapping because we're using our previous estimate of V pi.
So, that was kinda the kwo- two key aspects
of TD learning that we're both bootstrapping and sampling.
In Monte Carlo, we were sampling, but not bootstrapping.
Um, and one of the nice aspects of TD learning is that then we could
update it after every tuple instead of waiting till the end of the episode.
So, just as, like, what we do with Monte Carlo,
we're kinda replacing all of our Vs with Qs,
we're gonna do exactly the same thing here.
[NOISE] So, now, we're gonna think about
this sort of what's often known as temporal difference methods for control.
[NOISE] So, what we're gonna do now is,
we can do- we can estimate the Q pi function using temporal difference updating with,
like, a e-greedy policy,
um, and then, we could do Monte Carlo improvement
by setting pi to an e-greedy version of Q pi.
That would be one thing we can do.
[NOISE] There's an algorithm called SARSA,
which stands for state-action-reward- next state-next action, so SARSA.
Um, how does SARSA work?
So, what we do is, we initialize our e-greedy policy randomly.
For example, uh, we take an action,
we observe reward and next state, and then,
we take another action,
and we observe another reward and next state,
and then, we update our Q as follows: We say our previous va- um,
our value of Q for [NOISE] ST,
AT is gonna be whatever our previous value was.
Actually, I'm gonna be careful with this.
We're not going to index them with pi
anymore because we sort of have this running estimate,
and our policy is gonna be changing, too.
So, the, the Q function that we get here is now not just for one policy,
but we're going to be averaging it over different samples,
and we can be changing how we're acting over time.
So, it's ST, AT,
it's gonna be equal to Q of ST,
AT plus one minus Alpha RT plus
Gamma Q of ST plus
one AT plus one minus Q of ST, AT.
The important thing about this equation is
that I am plugging in the actual action that was taken next.
So, you see- you're in a state, you do an action,
you get a reward, you go to a next state,
and then, you do another action.
And so, once you know what the next action is that you've done,
then you can do this update in
SARSA as you're actually plugging in the action that was taken.
And then, once you have that, you can do policy improvement in the normal way.
So, you can have ST is equal
to arg, max Q,
[NOISE] so, like, the E-greedy wrapper for that.
Now- so, this is a little bit different than Monte Carlo for two reasons.
Um, it's sort of,
uh, we're doing these tuple updates,
we see the state, action, reward, next state,
next action tuples, and then,
once we do those, we can update our Q function.
Um, we can do those along the way,
we don't have to [NOISE] wait till the end of the episode,
and similarly, we don't have to wait till the end of
the episode to change how we're acting in the world.
So, like, in the, um,
trajectory that we saw before,
we saw some states multiple times.
In this case, we could actually be changing our policy
for how we act in those states during the same episode.
So, if your episodes are really long, this can be really helpful.
[NOISE] So, in general,
um, I think it's often extremely helpful to,
um, update the policy a lot. Yeah  Is there
a reason [NOISE] ?
Oh, yeah. So, they're both the same,
it's just either you could write it where you put the V in the next part or not.
So, you can either have it as one minus Alpha times
your old value plus reward plus Gamma of your next thing,
or you could have it as V plus Alpha times, or that,
that should be still an Alpha here plus reward  minus V. So that's either.
They- they're the same. If you know that I've made a typo,
just let me know. Yeah.
Uh, and is there a reason  we use, like,
the next state action  pair that we choose, uh,
uh, A plus one rather than the max state action?
question's about why do we use
the next state action pair you choose instead of the max.
Q-learning is going to be the max,
we'll see that in about a slide.
Um, SARSA's basically updating on policy,
um, that can have- generally,
you want to do Q-learning,
which is going to be doing the max.
Sometimes, there's some benefits,
particularly in cases where, um,
[NOISE] you could have lot of negative outcomes,
that the optimism of being max can end up sort of causing your agent to make a lot of
bad decisions early on because it's really optimistic about what it's-
what it's- could do instead of what it's actually doing.
Um, there's a nice cliff walk example inside Sutton and Barto
where they show that SARSA actually is doing better in sort of early,
early stages, early samples compared to Q-learning,
because SARSA is realistic about what happens
if you take certain actions next to- as opposed to optimistic.
And if you're doing a lot of randomness, um,
that means that SARSA can be more realistic in the early stages.
But empirically, generally, you want to do Q-learning,
and both will convert to the same thing. Yeah.
Um, so, [NOISE]  ,
should be, um, Q  ST one A be plus one be ST plus one.
Yeah. Thank you. Yes
This might be  question but you're talking about
how its getting the information from the future action,
but you have to have already done that action.
So, why is it called, um, er,
state action or -or next state actions,
when it's really the past one that you're updating from what I'm understanding.
Because you- you're doing this one and you're using the information you learn,
that take the one in the back.
So, why is it- why are we talking about it like
it's a future action? What's the purpose of that?
Um, all right. I- I don't think isn't about the particular terms used to define SARSA ,
I don't think it- I mean, I guess,
it's really just that you have to wait till you get,
um,f that- that last A is important.
So that instead of saying that- but before we
thought with TD learning if you were in a state action reward
next state and then you could update your Q-function
now we're just saying you have to wait till you've
actually decided what to do in that next state.
Okay.
Because that's how you're choosing how to do update your Q-function here,
and that's what you're plugging in for your target.
So, in terms of the convergent properties,
um, it requires a couple of different things,
uh, so, if we are,
um, we need sort of two things.
We're gonna need the fact that we're- we're updating
our Q-function and it's gonna be updating incrementally,
and so, like what we talked about before,
we're gonna need some conditions over the Alphas.
Um, if alpha is equal to 1, uh,
generally your Q-function is not gonna converge,
because it means you're not remembering anything about the past.
Um, if alpha is 0,
then you're not updating anymore.
So, generally, you need something in terms of the step sizes,
which allows you to sort of slowly be incrementing but still be converging.
So these are one sufficient set of conditions, um,
so if you have stuff like Alpha T,
is equal to 1 over T. Now
empirically often you're going to want to pick very different forms of learning rates.
So Alpha T, is often referred to as,
like, the learning rate parameter,
and empirically you are often not gonna wanna use this [NOISE],
generally not gonna use this.
This is gonna, um, uh,
you- you're often gonna wanna use different things empirically,
you could end up using sometimes small constants,
or slowly decaying constants.
Often that depends on the domain,
but this is from a theoretical side what is sufficient to ensure convergence.
And then the other aspect is that the way that your,
uh, that your policy itself has to satisfy the condition of GLIE,
which means that you are, sort of,
slowly getting more greedy over the time but you're doing so in a way that
you're still sampling all state action pairs an infinite number of times.
Now- now, just note for a second that that's not always possible,
like, so if you have a domain, um,
where, uh, things are not reachable after a point it's not argotic,
you can't get back to certain states after you get there.
Let's say you're flying a helicopter and you break the helicopter.
So you can't get back up there, um,
then you're not gonna be able to satisfy GLIE
because at some point you broke your helicopter and
then like you have no idea what it would have been like
if you continued to fly your helicopter in the air.
So, there can be some domains for which it is very hard to satisfy GLIE, um,
but we generally are going to ignore those even
though there are some really interesting work on,
so, how do we deal with those cases as well.
In those cases, somebody might assume that it's more of an episodic problem,
so maybe you have like a 100 helicopters and so when you
crash one that's considered a termination condition and then you get out your next one.
Um, so you may or may not be able to be greedy in the limit of information,
in the limit of infinite exploration there but you can,
sort of, have a bounded amounts of exploration.
And we're going to talk a lot more later about, sort of,
how to do this exploration in a much more smart manner and in a way that can give us
finite example guarantees on how much data we need to learn a good policy.
So, this is just what I said before which is,
you know, we generally are not gonna use the step type.
where you have
Q  plus alpha because 1 is alpha, times the-the .
Yeah. Okay, yeah. So this is the- this is for SARSA.
So this is the condition for SARSA assuming that-
that particular update of how we're updating our Q-functions.
Okay? So, yeah.
Uh, so in the Monte Carlo case,
we have sufficient  condition on- with the pie
that has been GLIE with the Epsilon going down to 1 over t.
Do we have anything similar in general?
Um, great question. Uh, question is about if,
for the Monte Carlo, do we have a sufficient,  uh, a similar condition.
If you're just, um,
if you're doing first visit that alone is sufficient.
Because you're getting an unbiased estimator that's converging
for all of your returns with only a
few of all the state action pairs in infinite number of times.
If you're doing it in this incremental fashion, um,
then if you're, if you're- if you're playing around with how about Alpha is,
then you need to have similar conditions to make sure it guarantees.
What I mean is, uh, how do you,
like, how do you know that condition will impulse?
How do you know that things are GLIE?
Yes. Like in Monte Carlo we did have condition rate.
It was that Epsilon decay as long as it-
Oh, so, um, great question.
so is like, how do you make sure something's GLIE.
Um, one sufficient condition is that Epsilon is 1 over- uh,
it's over T or one over I.
And do you know like, uh,
with that  that work like, yeah.
Oh, if you need to like know if there are sufficient conditions like this?
Yeah, like, it will be GLIE if and only
if- if Epsilon put to 0 but there's some diverges or something like that?
Yeah I think it's quite similar as  sequence,
similar like you're, sort of,
essentially you're- you're ensuring that you're doing infinite number of updates,
infinite amount of grid its like
random exploration but still its going down fast enough to converge in.
I think it's probably exactly the same but converges.
Okay. So then when we get into Q-Learning which
is related to the question which was asked, okay,
why are we just picking in that particular action
next why don't we just pick the max. Um, yeah.
We could just pick the max instead.
So SARSA is picking this particular action next,
Q-learning is picking the max action next. Yeah.
as   you said what  does it take to do better
early on because its not too often that statement that  later.
Um, is there any way that we could mix SARSA and Q-learning,
you certainly could, um,
but then that also means that maybe I wasn't being clear enough with the earlier part.
So SARSA can do better in some domains early
on particularly if there's a lot of really negative rewards because it's being realistic,
um, another case is Q learning.
Will it be better even early on?
Because you're being more optimistic and as we talked about a little bit before,
often optimism is really helpful for exploration.
The cliff walk example in Sutton and Barto is a case where
some actions lead the agent to like fall off a cliff and so
some actions are really bad and so there
being optimistic early on means that you're gonna take
a lot of really bad decisions and suffer a lot of negative rewards for a while.
Many other domains are not like that so depends on a lot.
And yes you could certainly mix them.
Alright. So I guess in terms of Q-learning one thing that's interesting here is,
uh, so we can again sort of think about how are we're improving this and we're gonna,
sort of, be e greedy with respect to the current estimate of optimal Q,
and- and really this is quite similar to what we we're doing in
SARSA except for now when we update this Q we're really just gonna be doing this MAX.
So Q of ST, AT is gonna be equal to the previous value,
plus alpha or plus max over A.
So, now also note that you can update this a little bit earlier, so,
you don't have to wait until the next action is taken.
So, you only need to observe this part.
You don't need to actually see the next action that's taken and then
you can perform policy improvement,
and in general, in this case,
you're only gonna- you only need
to update the policy for the state that you were just in.
So you can do pi,
you can update pi b for ST for the action you just took.
You don't need to- particularly, in large state space, that can be helpful.
So we actually ended up talking about this a little bit
already about whether or not how you initialize Q matters.
It doesn't asymptotically, I mean,
if you have a case where your Q function is gonna converge to the right thing,
it will still converge to the right thing no matter how you
initialize it as long as it satisfies these other conditions,
but it certainly matters a lot empirically and so even though
often we think of just initializing it randomly or initialize it with 0,
initializing it optimistically is often really helpful.
So we'll talk more about that when we talk about exploration . Yeah
On the previous slide  line six,
either max or a argmax?. Thank you.
[NOISE]
So now, um, if we do Q-learning.
Um, Let's see.
I think wha- I'm gonna leave this as just an exercise you can do later,
but you could just do the exact same exercise for Q-learning,
um, and see how these updates propagate.
Um, so just like Monte Carlo versus Q- Monte Carlo versus TD for policy evaluation,
there's some of the same issues with Q-learning.
Q-learning is only gonna update your Q function for the state you are just in.
So, even if it turns out later in the same episode,
you get a really high reward.
You're not gonna backpropagate
that information at the end of the episode in the way that you would with Monte Carlo.
So Q-learning updates can be much slower often,
um, than Monte Carlo.
Just like enter that has implications for how quickly you
can learn to make better decisions [NOISE].
So, the conditions that are sufficient to
ensure that Q-learning with the ε-greedy converges,
it's basically the same as SARSA.
We need to make sure that things are, um,
that are GLIE, and,
I see, and slightly revise this.
So, if you just wanna make sure that you converge,
that needed to be the all SA infinitely often.
I need to have these conditions on the Alpha.
So if you look at the same conditions,
in order for the Q functions to converge,
you need to have these conditions on how you're
updating your li- like what you're learning rates are.
Ah, and that you visit all state action pairs infinitely often.
But that just- that's sufficient to allow you to converge to the optimal Q values.
And then if you want to actually make sure that the policy
you're following is really the optimal policy,
then you need to be GLIE.
You also need the policy you chose to be more and more greedy.
All right, let me just briefly into the maximization bias before we finish.
The maximization bias is an interesting question.
Ah, so why are we going to talk about this?
Well okay, let's go back to this one.
So in Q learning, what are we doing?
In Q learning, we're computing
the Q function and then we're being e-greedy with respect to it.
Now, we're going to need some more data and we're re-updating
our Q function and we're being greedy with respect to it.
And so we're e-greedy with respect to it.
And so, we're always sort of doing
this dance between updating stuff, getting more evidence,
but then trying to kind of exploit that knowledge up to some random exploration.
And the maximization bias points out that maybe there can be some problems with this.
Okay. So, let's just consider a particular example.
Imagine there is a single state MDP which means there's only one state.
Um, but there are two actions and both of them actually have 0 mean random rewards.
So now, you can think of these as being like, Gaussians.
Right now, we're mostly talking about it when the reward is
actually deterministic but it doesn't have to be.
It could be stochastic reward.
But in this case, where you would imagine that whether you take action a1 or action a2,
your expected value is zero,
but the value you get on any particular episode- any particular step might not be zero.
Might be one or minus one or things like that.
The average is still zero but on any particular step,
you could have something different, okay?
But the expected value is zero um,
and so the Q value for both sa1 and
the Q value for sa2 is zero which is the same as the value.
And these are all the optimal Q and S values.
So let's imagine that there are some prior samples.
You've tried action a1 a bunch of times,
you've tried action a2 a bunch of times,
and you compute an empirical estimate of this.
And here again where um,
there's just a single state.
Um, and we can just average over these.
Let's imagine that it's super simple that we have um,
gamma is equal to zeros.
So, we're really just estimating over the immediate reward.
Okay, so there's no future rewards.
We're just saying all the times that we've tried this action before.
What were all the rewards we get when we average?
And now what we wanna do is we wanna take
our empirical estimates of the Q function for a1 and a2,
and we want to figure out what the greedy policy is.
And the problem is that it can be biased.
So even though each of these unbiased estimators of k- of Q are themselves,
even- even though the two estimates the ah,
actions are unbiased, when you take a max over it, it can be biased.
Let's just write out what that is. So our V Pi hat is equal to the expected value of
max over Q a1, Q a2.
So I'm going to
be taking the expected value of max of these two things because
that's how I defined my policy.
My policy says pick whichever of these two empirically looks best.
But we know that from Jensens,
this is greater than equal to if you switch the max and the expectation [NOISE].
And this is just equal to max of zero, zero.
So the important part is this,
and this is equal to the true V Pi.
So that means that whatever we compute um,
can be a biased estimator of the true V Pi.
So why did this happen?
Well if you get ah, you know,
if you only have a finite number of samples um,
I- if I have tried action a1 a finite number of times,
it might be on that finite number of times it happens to
look slightly positive like it's like,
a 0.1 instead of zero.
And then when I take my policy,
I'm going to maximize over those.
So I'm going to immediately exploit whichever one happens
to look better even if with statistical chance.
So that's why you can get this maximization bias.
And the same thing can happen in terms of MDPs.
So ah, this generally can happen.
You can also look at some nice examples from this paper by
Johns- Johnson Tsitsiklis and
Shie Mannor where they show how this can also happen in Markov decision processes.
Where essentially if you ah,
if your estimates for these Q functions ah,
then you're going to be sort of biased to whatever has happened to look good in
your data and so you can have a maximization bias.
So one thing that was proposed to try to
handle- deal with this case is called double Q learning.
And so the idea is instead of ah,
having one Q function,
we are going to have two different Q functions.
And we're going to create two independent unbiased estimators of Q,
and you're going to use one of them for
your decision-making and the other to try to estimate the value.
And that's gonna allow us to have an unbiased estimator.
And the reason that you might want to do this is because ah,
then it can sort of help- help with this issue that
you can end up being overly bias towards things that have happened to look good.
Yes, now you're separating like between the samples that you're ga-
that you're getting to estimate how good an action is versus ah,
the way you're trying to estimate your policy.
So I'm just going to be a little brief with this because of time.
Q learning basically- double Q learning basically
means that we're going to have these two different Q functions.
Um, and then with 50% probability,
we're going to update one, at 50% probability, we're going to update the other.
So, this was- and in this case,
I'm going to skip out all others um,
the final slides I want to show you the difference.
Um, the difference here can be significant sometimes.
So, in this case,
this is sort of looking at the percent of time that we're taking
bad actions in this domain where you can have,
in this case, you have a scenario where
it's actually the wrong thing to do but it's stochastic.
And so with a small amount of data,
it can end up looking better compared to another option where
the reward is deterministic and actually better but has no stochasticity,
and then Q learning can suffer quite a lot from this maximization bias.
Um, if you're using the same Q function to essentially
immediately define your policy as you are um,
for estimating the value of that policy,
whereas double Q learning does a lot better in this case.
So it's something to consider in terms of when you're implementing these things
and it's pretty small overhead
too because you can just maintain two different Q functions.
Right. I know that was a little bit fast but make sure to put details on there,
um, when I- we upload the additional slides today um.
The main things that you should know from today is to be able to understand how you do
this Monte Carlo on policy controls and same for SARSA and Q-learning.
It's useful to understand how quickly they update, um,
both in terms of whether you have to wait to the end of
the episode and then how quickly information propagates back.
And also to understand how to define the conditions on
the algorithms converging to the optimal Q function. Thanks.
