All right. So, last time we were starting to talk about the sort
of the general overview of what reinforcement learning involves,
um, and, we introduced the notion [NOISE] of a model,
a value, and a policy.
[NOISE] Um, so it's good to just refresh your brain right now,
about what, what those three things are.
Can anybody remember off the top of their head what a value, a model,
or a policy was in the context of reinforcement learning?
[NOISE]
Um, so policy is a set of actions,
that, uh, the agent should take [NOISE] in a work. [NOISE]
Exactly right. So, the definition of a policy is a mapping from
the state you're in to what is the action, um, to take.
And it might be a good policy or a bad policy.
And the way we evaluate that,
is in terms of its [NOISE] expected discounted sum of rewards.
Does anybody remember what a model was? Yeah?
A model is like, uh,
representation of the world and how that changes in response to agent's accident. [NOISE]
Yeah. So right, so normally we think of a model of incorporating either reward model,
or a decision, uh, or, or dynamics model,
[NOISE] which specifies in response to the current state and, uh,
an action how the world might change,
could be a stochastic model or a deterministic model.
[NOISE] Um, and the reward model specifies,
what is the expected reward, um,
that the agent receives from taking a state in a particular action.
[NOISE] So what we're gonna talk about today is, um,
thinking about, if you know a model of the world,
so, you know, um,
what happens if you take an action in a particular state,
or what the distribution of next states might be if you [NOISE] take an action,
[NOISE] um, how we should make decisions.
So, how do we do the planning problem?
So, we're not gonna talk about learning today.
We're just gonna talk about the problem of figuring out what is the right thing to do,
[NOISE] when your actions may have delayed consequences,
which means that you may have to sacrifice
immediate reward in order to maximize long-term reward.
[NOISE] So as we just stated, um,
the model generally we're gonna think about is statistical or mathematical models,
of the dynamics and the reward function.
Um, a policy is a function that maps the students each,
uh, uh these agents states to actions,
and the value function as the expected discounted sum of rewards, um,
from being in a state, um,
and/or an action, [NOISE] and then following a particular policy.
[NOISE] So what we're gonna do today is,
sort of, um, build up for Markov Processes,
um, up to Markov Decision Processes.
And this build, I think,
is sort of a nice one because it sort of allows one to
think about what happens in the cases where you
might not have control over the world but the world might still be evolving in some way.
[NOISE] Um, and think about what the reward might be in those, sort of,
processes, for an agent that is sort of passively experiencing the world.
Um, and then we can start to think about the control problem of how the agent should be
choosing to act in the world in order to maximize its expected discounted sum of rewards.
[NOISE] So, what we're gonna focus about on today and, er,
and most of the rest of the classes is this Markov Decision Process,
um, where we think about an agent interacting with the world.
So the agent gets to take actions,
typically denoted by a,
[NOISE] those affect the state of the world in some way, um,
and then the agent receives back a state and a reward.
So last time we talked about the fact that this could in
fact be an observation, instead of a state.
But then, when we think about the world being Markov,
we're going to [NOISE] think of an agent,
just focusing on the current state, um,
so the most recent observation, like, you know,
whether or not the robots laser range finders saying,
that there are walls, to the left or right of it,
as opposed to thinking of the full sequence of prior history
of the sequences of actions taken and the observations received.
[NOISE] Um, as we talked about last time
but you can always incorporate [NOISE] the full history to make something Markov,
um, [NOISE] but most of the time today,
we'll be thinking about, sort of, immediate sensors.
If it's not clear, feel free to reach out.
[NOISE] So, what did the Markov Process mean?
The Markov process is to say that
the state that the agent is using to make their decisions,
is the sufficient [NOISE] statistic of the history.
[NOISE] Which means that in order to predict the future distribution of states,
on the next time step.
Here we're using t to denote time step.
[NOISE] That given our current state s_t,
and the action that is taken a_t,
[NOISE] this is again the action,
[NOISE] um, that this is equivalent to,
if we'd actually remember the entire history,
where the history recall was gonna be the sequence
of all the previous actions and rewards.
And next states that we have seen up until the current time point.
[NOISE] And so essentially, it allows us to say that,
the future is independent of the past given
some current aggregate statistic about the present.
[NOISE] So when we think about a Markov Process or a Markov Chain,
we don't think of there being any control yet.
There's no actions.
Um, but the idea is that,
you might have a stochastic process that's evolving over time.
[NOISE] Um, so whether or not I invest in the stock market,
the stock market is changing over time.
And you could think of that as a Markov Process,
[NOISE] um, so I could just, sort of be,
passively observing how the stock market for a particular,
th- the stock value for a particular stock,
is changing over time.
[NOISE] Um, and a Markov Chain is,
is sort of just the sequence of random states,
where the transition dynamics satisfies this Markov property.
So formally, the definition of a Markov Process is that,
you have, um, a finite or potentially infinite set of states.
And you have a dynamics model which
specifies the probability for the next state given the previous state.
[NOISE] There's no rewards,
there's no actions yet.
Um, and if you have a finite set of states,
you can just write this down as a matrix.
Just a transition matrix that says,
you're starting at some state.
What's the probability distribution over next states that you could reach?
[NOISE] So if we go back to the Mars Rover example that we talked about last time.
[NOISE] Um, In this little Mars Rover example,
we thought of a Mars Rover landing on Mars
and there might be different sorts of landing sites,
um, so maybe our Mars Rovers starts off here.
And then, it can go to the left or right, um, er,
under different actions or we could just think of those actions as being a_1 or a_2,
where it's trying to act in the world.
[NOISE] Um, and in this case,
uh, the transition dynamics, it doesn't,
we don't actually have actions yet,
and we just think of it as, sort of,
maybe it already has some way,
it's moving in the world, the motors are just working.
[NOISE] And so in this case,
the transition dynamics looks like this,
which says that, for example,
the way you could read this, is you could say, well,
the probability that I start in a particular state s_1, um,
and then, I can transition to the next state on the next time step is 0.4.
[NOISE] There is a 0.6 chance that I stay in the same state on the next time step. Yeah?
[NOISE] Um, which dimension represents the start state?
Um, so, this is a great question.
Which dimension, which, which state is the start state?
[NOISE] I'm not specifying that here.
Um, uh, In general when we think about Markov chains,
we think about looking at their steady-state distribution.
So they're stationary distribution will [NOISE]
converge to some distribution over states,
[NOISE] that is independent of the start state,
if you run it for long enough.
Oh, sorry, I meant to ask,
like, on that matrix,
which dimension represents the initial state of-
Oh, you mean, like, where you are now right now?
Yeah. So in this particular case,
you could have it as, um,
the transition of saying,
if you start in state,
[NOISE] uh, let me make sure that I get it right.
In this case, [NOISE] answer there, there,
so if you start in state here,
um, so this is yours initial start at a state s_1
and then you take the dot product of that with,
I may have mo- let me see if I get it right in terms of mixing it up.
It's either on one side or the other side,
and then, I may have transitioned it.
Um, I think you'll have to do it for the [NOISE] other side here.
Yep, it'll be flipped.
So, you would have your initial state.
So 1, 0, 0,
0, 0, 0, 1, 2, 3, 4,
5, 6, and then times P,
and that would give you your next state
distribution s'. Yeah? [NOISE]
Um, um, so what are the probabilities computed of,
like the rewards, I guess, the probability,
based on the reward of going from state 1 to 2 [NOISE] or?
Great question, so was, you know,
one of this transition probabilities looking
at [NOISE] this relate to their word, in this case,
we're just thinking of Markov Chains,
so there's no reward yet, and there's no actions.
[NOISE] Um, and this is just specifying that there's some state of the,
uh, of the process.
So it's as if you're,
let's say your agent, um,
had some configuration of its motors.
[NOISE] You don't know what that is,
that was set down on Mars, and then it just starts moving about.
And what this would say is,
this is the transition probabilities of if
that agent starts in state, I can write it this way.
So, if it starts in state,
[NOISE] s_1, then the probability that it stays in state s_1 is 0.6.
So, the probability that you're starting in this particular state here,
[NOISE] on the next time step that you're still there,
is 0.6 because of whatever configuration of the motors were for that robot.
[inaudible] world works.
This is specifying that, this is how, yeah,
this is how the world works. So that's a great question.
So we're assuming right now, this is, um, the,
this Markov process is a state of the world that you were,
there is some the,
the environment you're in is just described as a Markov Process,
and this describes the dynamics of that process.
We're not talking about how you would estimate those.
This is really as if, this is how that world works.
This is, like, this is the,
this is the world of the fake little Mars Rover.
[NOISE] We have any questions about that? Yeah?
Uh, [NOISE] the serum one cloud action needs to be
transposed [NOISE] when you multiplied by P and all [NOISE] we can see is [inaudible]
Yes. Yeah. [inaudible] [NOISE] Let me just write down and correct vector notation.
Would be like this.
One, one two, three, four, five, six.
That would be, that would be a sample starting state you could be in for example.
So, this could be your initial state.
Initial state and that would mean that your agent is initially in state S one.
Okay. And then if you want to know where it might be on the next state,
you would multiply that by the transition model P
depending on the notation and whether you take the transpose of this transition model,
it will be on the left or the right.
It should always be obvious from context but if it's not clear,
feel free to ask us. And so what would that say?
That would say if you took the, uh,
the matrix multiplication of this vector which just
says you're starting in state s_1, what would that look like?
Afterwards it would say that you are in state s_1 still with probability 0.6,
you're in state s_2 with probably to 0.4.
And this would be your new (state distribution).
And I think that should be transposed.
But it's just a one it which specify
the distribution over next states that you would be in.
You may have any questions about that?
Okay. All right.
So, this is just specifying that the transition model
over how the world works over time and it's just I,
I've written it in matrix notation there to be compact.
But if it's easier to think about it,
it's fine to just think about it in terms of
these probability of next states given the previous state.
And so you can just enumerate those,
you can write it in a matrix form if the,
if the number of states happens to be finite.
So, what would this look like if you wanted to think of what might happen to
the agent over time in this case or what the process might look like,
you could just sample episodes.
So, let's say that your initial starting state is S four,
and then you could say, well,
I can write that as a one-hot vector.
I multiply it by my probability.
And that gives me some probability distribution over the next states that
I might be in and the world will sample one of those.
So, your agent can't be in multiple states at the same time.
So, for example, if we were looking at state s_1,
it has a 0.6 chance to abstain in s_1 or 0.4 chance of transitioning.
So, the world will sample one of those two outcomes for you and it might be state s_1.
So in this case, we have similar dynamics from s_4.
From s_4, it has a probability of 0.4 going to state s_3.
Probability of 0.4 going to state s_4
or a probability of 0.2 of staying in the same place.
So, if we were going to sample an episode of what might happen to the agent over time,
you can start with s_4 then maybe it will transition to s_5.
Maybe they'll go to s_6,
s_7, s_7, s_7.
So, you're just sampling from this transition matrix to generate a particular trajectory.
So, it's like the world you know what the dynamics,
the dynamics is of the world and then nature is gonna pick one of those outcomes.
It's like sampling from sort of a probability distribution.
Anyone having questions about that?
Okay. So, that just gives you a particular episode.
And we're going to be interested in episodes because later we're gonna be
thinking about rewards over those episodes and
how do we compare the rewards we might achieve over those episodes but for right now,
this is just a process.
This is just giving you a sequence of states.
So, next we're gonna add in rewards.
So, that was just a Markov chain.
And so now what is a Markov reward process?
Again, we don't have actions yet just like before.
But now we also have a reward function.
So, we still have a dynamics model like before.
And now we have a reward function that says,
if you're in a particular state,
what is the expected reward you get from being in that state?
We can also now have a discount factor which allows us to trade off
between or allows us to think about how
much we weight immediate rewards versus feature rewards.
So, again just like before,
if we have a finite number of states in
this case R can be represented in matrix notation which
is just a vector because it's just the expected reward we get for being in each state.
So, if we look at the Mars Rover MRP,
then we could say that the reward for being an s_1 is equal to 1.
The reward for being an s_7 is equal to
10 and everything else that reward is zero. Yeah.
Are the words always just tied to the state you're in?
I think last time you talked about it also having an option.
So, why are we not consider that here?
Great question. I'm saying that I mentioned last time that
rewards for the Markov Decision Process can either be a function of the state,
the state in action, or state action next state.
Right now we're still in Markov Reward Processes so there's no action.
So, in this case,
the ways you could define rewards would either
be over the immediate state or state and next state.
So, once we start to think about there being rewards,
we can start to think about there being returns and expected returns.
So, first of all let's define what a horizon is.
A horizon is just the number of time steps in an episode.
So, it's sort of like how long the agent is acting for or how long it,
how long this process is going on for it and it could be infinite.
So, if it's not infinite,
then we call it a finite Markov Decision Process.
We talked about those briefly last time.
Um, but it often we think about the case where,
um, an agent might be acting forever or this process might be going on forever.
There's no termination of it.
The stock market is up today.
It'll be up tomorrow. We expect it to be up for a long time.
We're not necessarily tried to think about evaluating it over a short time period.
One might wanna think about evaluating it over
a very long time period. So, we've done this.
The definition of a return is just the discounted sum of rewards you get
from the current time step to a horizon and that horizon could be infinite.
So, a return just says,
if I start off in time step T,
what is the immediate reward I get and then I transition maybe to
a new state and then I weigh that return reward by Gamma.
And then I transitioned again and I weigh that one by Gamma squared, et cetera.
And then the definition of a value function is just the expected return.
If the process is deterministic,
these two things will be identical.
But in general if the process is stochastic, they will be different.
So, what I mean by deterministic is that if you always go to the same next state,
no matter which if you start at a state
if there's only a single next state you can go to,
uh, then the expectation is equivalent to a single return.
But in the general case, we are gonna be interested in
these stochastic decision processes which
means averages will be different than particularly runs.
So, for an example of that well,
let me first just talk about discount factor and then I'll give an example.
Discount factors are a little bit tricky.
They're both sort of somewhat motivated and somewhat used for mathematical convenience.
So, we'll see later one of the benefits of mathematic, uh,
benefits of discount factors mathematically is that we can
be sure that the value function sort of expected
discounted sum of returns is bounded as long as here reward function is bounded.
Uh, people empirically often act as if there is a discount factor.
We weigh future rewards lower than,
than immediate rewards typically.
Businesses often do the same.
If Gamma is equal to 0,
you only care about immediate reward.
So, you're the agent is acting myopically.
It's not thinking about the future of what could happen later on.
And if Gamma is equal to one,
then that means that your future rewards are
exactly as beneficial to you as the immediate rewards.
Now, one thing just to note,
if you're only using discount factors for mathematical convenience, um,
if your horizon is always guaranteed to be finite,
it's fine to use gamma equal to one in terms
of from a perspective mathematical convenience.
Someone having any questions about discount factors? Yeah.
My question is, does the discount factor of Gamma always have to progress
in a geometric fashion or like is there a reason why we do that?
It's a great question. You know,
the- what we're defining here is that using a Gamma
that progresses through this exponential geometric fashion is that necessary.
It's one nice choice that ends up having very nice mathematical properties.
There, one could try using other participant is certainly
the most common one and we'll see
later why it has some really nice mathematical properties.
Any other questions? Okay.
So, what would be some examples of this?
Um, if we go back to our Mars Rover here and we now have this definition of reward,
um, what would be a sample return?
So, let's imagine that we start off in state s_4 and then we transitioned to s_5,
s_6, s_7 and we only have four-step returns.
So, what that means here is that our, um,
our process only continues for four time steps and then it maybe it resets.
So, why might something like that be reasonable?
Well, particularly when we start to get into decision-making, um you know,
maybe customers interact with the website for on average two or three times steps.
Um, there's often a bounded number of time you know bounded
length of course in many many cases that the horizon is naturally bounded.
So, in this case you know what might happen in this scenario we start off in s_4.
s_4, s_5, s_6 all have zero rewards by definition.
Um, and then on time-step s_7 we get a reward of 10.
But that has to be weighed down by the discount factor which here is 1/2.
So, it's 1/2 to the power of 3.
And so the sample return for this particular episode is just 1.25.
[NOISE] And of course we could define this for any particular, um,
episode and these episodes generally might go through different states even
if they're starting in the same initial state
because we have a stochastic transition model.
So, in this case maybe the agent just stays in s_4,
s_4, s_5, s_4 and it doesn't get any reward.
And in other cases,
um, it might go all the way to the left.
So, if we then think about what the expected value function would be,
it would involve averaging over a lot of these.
And as we average over all of these, um,
then we can start to get different rewards for different time steps.
So, how would we compute this?
Um, now one thing you could do which is sort of
motivated by what I would just showing before,
is that you could estimate it by simulation.
So, you could, um,
just take for say an initial starting state distribution, um,
which could be just a single starting state or
many starting states and you could just roll out your process.
So, right now we're assuming that we have
a transition model transition matrix and a reward model.
Um, and you could just roll this out just like what
we're showing on the previous couple of time-steps.
And you could just do this many many many times. And then average.
And that would asymptotically
converge to what the value function is cause the value function is just,
um, the expected return.
So, one thing you could do with simulation, um,
and there are mathematical bounds you can
use to say how many simulations would you need to
do in order for your empirical average to be close to the true expected value.
The accuracy roughly goes down on the order of one
over square root of N where N is the number of roll-outs you've done.
So, it just tells you that, you know,
if you want to figure out what the value is of your Markov Reward Process,
um, you could just do simulations and that would give you an estimate of the value.
The nice thing about doing this,
is this requires no assumption of the Markov structure.
Not actually using the fact that it's a Markov Reward Process at all.
It's just a way to estimate sums of returns- sums up rewards.
So, that's both nice in the sense that, um,
if you're using this in a process that you had estimated from
some data or you're making the assumption that things are, er, um,
you know this is the dynamics model but that's also
estimated from data and it might be wrong, um,
then this can give you sort of, um,
if you can really roll out in the world then you can get these sort
of nice estimates of really how the process is working.
But it doesn't leverage anything about the fact that if the world really is Markov,
um, there's additional structure we could do in order to get better estimates.
So, what do I mean by better estimates here?
I mean if we want to, um,
get sort of better meaning sort of computationally cheaper,
um, ways of estimating what the value is a process.
So, what the Markov structure allows us to do,
with the fact that the present that
the future is independent of the past given the present,
is it allows us to decompose the value function.
So, the value function of a mark forward process is simply
the immediate reward the agent gets from
the current state it's in plus the discounted sum of
future rewards weighed by
the discount factor times
the- and where we express that discounted sum of future words is we
can just express it with V, V(s').
So, we sort of say well whatever state you're in right now,
you're going to get your immediate word and then you're going to
transition to some state s'.
Um, and then you're going to get the value of
whatever state s' you ended up in discounted by our discount factor.
So, if we're in a finite state MRP we can express this using matrix notation.
So, we can say that the value function which is a vector is equal to
the reward plus gamma times the transition model times
V. Again note that in this case because of the way we're defining the transition model,
um, then the value functions here
the transition model is defined as the next [NOISE] state given
the previous state and multiplying that by the value function there.
So, in this case we can express it just using a matrix notation.
Um, and the nice thing is that once we've done that
we can just analytically solve for the value function.
So, remember all of this is known.
So, this is known. And this is known.
And what we're trying to do is to compute what V(S) is.
So, what we can do in this case is we just move this over to the other side.
So, you can do V minus gamma PV is equal to R or we can
say the identity matrix minus the discount factor times P. These are all matrices.
So, this is the identity matrix times
V is equal to R which means V is just equal to the inverse of this matrix times R.
Um. So, if one of the transitions can be back to itself,
um wouldn't it be become a circular to try to express V(s) in terms of V(s)?
Um, the question was was if it's possible to have self-loops?
Um, could it be that this is sort of circulator defined [NOISE] in this case.
Um, I in this case because we're thinking about processes that are infinite horizon,
the value function is stationary, um,
and it's fine if you have include self loops.
So, it's fine if some of the states that you
might transition back to the same state there's no problem.
You do need that this matrix is well-defined.
That you can take that you can take the inverse of it.
Um, but for most processes that is.
Um, so, if we wanna solve this directly, um,
this is nice it's analytic, um,
but it requires taking a matrix inverse.
And if you have N states so let's say you have N states there's generally
on the order of somewhere between N squared and N cubed
depending on which matrix inversion you're using. Yeah.
Is it ever actually possible for, uh,
that matrix not to have an inverse or does like the property
that like column sum to one or something make it not possible?
Question was is it ever possible for this not to have an inverse?
Um, it's a it's a good question.
Um, I think it's basically never possible for this not to have an inverse.
I'm trying to think whether or not that can be violated in some cases.
Um, if yeah sorry go ahead.
Okay. [NOISE] Yeah.
So, I think there's a couple,
um, if there's a- if this ends up being the zero matrix,
um depending on how things are defined.
Um, but I'll double-check then send a note on a Piazza. Yeah.
Well, actually I think the biggest side about the transition matrix [inaudible]
Let me just double check so I don't say anything that's
incorrect and then I'll just send a note on- on Piazza. It's a good question.
So, that's the analytic way for computing this.
The other way is to use dynamic programming.
So, in this case,
it's an iterative algorithm instead of a one shot.
So, the idea in this scenario is that you initialize the value function to be
zero everywhere and in fact you can initialize it to anything and it doesn't matter.
If you're doing this until convergence.
And so then what we're gonna do is we're going to
do what's going to be close to
something we're going to see later which is a bellman backup.
So, the idea in this case is because of the Markov property,
we've said that the value of a state is exactly equal to
the immediate reward we get plus the discounted sum of future rewards.
And in this case,
we can simply use that to derive an iterative equation where we use the previous value of
the state in order to bootstrap
and compute the next value of the state and we do that for all states.
And the computational complexity of this is a little bit lower because it's only
|S| squared because you're doing this for each
of the states and then you're summing over all the possible next states.
When I say we do this total convergence
generally what we do in this case is we define a norm.
So, generally we would do something like this,
V_k minus V_k-1.
I need to do this until it's lower than some epsilon.
So, the advantage of this is that each of the iteration updates are
cheaper and they'd also will be some benefits later when we start to think about actions.
The other thing does not apply as easily when we
start to have actions but we'll see also where it can be relevant.
So, here are two different ways to try to compute the value of
Markov Reward Process or three really one is simulation,
the second is analytically.
The analytic one requires us a step
a finite set of states and the third one is dynamic programming.
We're also right now defining only all of these for when the state space is finite,
but we'll talk about when the state space is infinite later on.
So, now we can finally get onto Markov Decision Processes.
Markov Decision Processes are the same as
the Markov Reward Process except for now we have actions.
So, we still have the dynamics model but now
we have a dynamics model that is specified for
each action separately and we also have a reward function.
And as was asked before by Camilla I think,
the reward can either be a function of the immediate state,
the state and action to the state action and next state for most of the rest
of today we'll be using that it's the function of both the state and action.
So, the agent is in a state they take an action,
they get immediate reward,
and then they transition to the next state.
So, if you think about serve an observation you'd see something like
this s, a, r, and then transition to state s'.
And so a Markov Decision Process is typically
described as a tuple which is just the set of states,
actions, rewards, dynamics, model, and discount factor.
Because of the way you've defined that dynamic model,
is the case that if you take a specific action that is
intended for you to move to your state s',
you won't fully successful move to that state?
Like I guess I'm curious about why there's a- why there is a probability at all?
Like if you're deep in a state in K action,
why is it deterministic what the next state is?
Question is same like well why is this- why are there stochastic processes I think.
Um, there are a lot of cases where we don't have perfect models of the environment.
May be if we had better models then things would be deterministic.
And so, we're going to approximate our uncertainty over those models with stochasticity.
So, maybe you have a robot that's a little bit faulty and so
sometimes it gets stuck on carpet and then sometimes it goes forward.
And we can write that down as a stochastic transition matrix
where sometimes it stays in the same place and sometimes it advances to the next state.
Or maybe you're on sand or things like that.
Maybe when you're trying to drive to SFO sometimes you hit traffic, sometimes you don't.
You can imagine putting a lot more variables into your state-space to
try to make that a deterministic outcome or you could just say,
"Hey sometimes when I try to go to work, you know,
like I hit these number of red lights and so I'm late and other times,
you know, I don't hit those red lights and so I'm fine."
So, if we think about our Mars Rover MDP.
Now, let's just define there being two actions A1 and A2.
You can think about these things as the agent trying to move
left or right but it's also perhaps
easier just to think about in general them as sort of
these deterministic actions for this particular example.
So, we can write down what the transition matrix would be in each of
these two cases that shows us
exactly where the next state would be given the previous action.
So, what's happening in this case is if the agent tries to do
a_1 in state s_1 then it stays in that state.
Otherwise, it will generally move to the next state over.
If it's trying to do action a_1 and for action
a_2 it'll move to the right unless it hits s_7 and then it'll stay there.
So, like we said at the beginning of class,
a Markov Decision Process policy specifies what action to take in each state.
And the policies themselves can be deterministic or stochastic,
meaning that you could either have a distribution over in the next action you might
take given the state you're in or you could have a deterministic mapping.
It says whenever I'm in this state I always,
you know, do action a_1.
Now- and a lot of this class we'll be thinking about
deterministic policies but later on when we get into
policy search we'll talk a lot more about stochastic policies.
So, if you have an MDP plus a policy
then that immediately specifies a Markov Reward Process.
Because once you have specified the policy then you can think of that as
inducing a Markov Reward Process because you're only
ever taking you've specified your distribution over actions for
your state and so then you can think of sort of what is the reward,
the expected reward you get under that policy for any state and similarly you can define
your transition model for Markov Reward Process by
averaging across your transition models
according to the weight at which you would take those different actions.
So, the reason why it's useful to think about these connections between
Markov Decision Processes and Markov Reward Processes is it implies that if
you have a fixed policy you could just use all the techniques that
we just described for Markov Reward Processes mainly simulation,
analytic, analytic solution or dynamic
programming in order to compute what is the value of a policy.
So, if we go back to the iterative algorithm then it's exactly the same as before,
exactly the same as the Markov Reward Process except
for now we're indexing our reward by the policy.
So, in order to learn what is the value of a particular policy we
instantiate the reward function by always picking the action that the policy would take.
So, in this case, I'm doing it for simplicity for
deterministic policy and then
similarly just indexing which transition model
to look up based on the action that we would take in that state.
And this is also known as a bellman backup for a particular policy.
So, it allows us to state what is the value of the state under
this policy well it's just the immediate reward I would get by
following the policy in the current state plus
the expected discounted sum of rewards I get by following this policy.
And then for whatever state I end up by next continuing to follow this policy.
So that's what the V^pi_k-1 specifies.
What would happen if the expected discounted sum of rewards we get by
continuing to follow policy from whatever state we just transitioned to.
So, if we go to
the Markov- the Markov chain
or the Ma- now the Markov Decision Process for the Mars Rover,
then let's look at the case now where we have these two actions.
The reward function is still that you either have for any action if you're in state
one you get plus one and in any state any action for state s_7 you get plus 10.
Everything else is zero.
So, imagine your policy is always to do action a_1 and your discount factor is zero.
So, in this case,
what is the value of the policy
and this is just to remind you of what like the iterative way of computing it would be.
Yeah in the back.
Um, and I think that will be zero for everything
except s_1 and s_7 where it's +1 and +10.
That's exactly right. So this is a little bit of a trick question
because I didn't show you again what the transition model is.
Said is exactly correct.
The- it doesn't matter what the transition model is here,
um, because gamma is equal to zero.
So that means that all of this goes away,
um, and so you just have the immediate reward.
So if your discount factor is zero then you just care about immediate reward.
And so the immediate reward for this policy
because the reward for all actions and state one is always +1.
And the reward for all actions and all other states is zero except
for in state s_7 where it's always 10 no matter which action you take.
So this is just equal to one.
That's the value function address.
Okay. So let's, um, look at another one.
So now we've got exactly the same process.
Um, I've written down a particular choice of the dynamics model for ah, state s_6.
So let's imagine that when you're in
state s_6 which is almost all the way to the right, um,
you have a 50% probability of staying there under action A1
or a 50% probability of going to state s_7.
That's what this top line says.
And then there's a whole bunch of other dynamics models that we're
not going to need to worry about to do this computation.
And then the reward is still +1 for state s_1,
+10 in state s_7,
zero for all the states in the middle.
And then let's imagine that, um,
we're still trying to evaluate the policy where you're always taking action a_1.
Um, and we've just said that V_k is equal to 1,0,0,0,10,
um, and now what we wanna do is do one more backup essentially.
So we want to move from V_k=1 and now compute V_k=2.
So how [NOISE] about everybody take a second and figure [NOISE] out what would be
the value under this particular policy, okay, for s_6.
So you can use this equation, um,
to figure out given that I know what
my previous value function is because I've specified it there it's 1,0,0,0,10.
Um, and now I'm going to be doing one backup,
and I'm only asking you to do it for one state,
you could do it for others if you want.
Um, what would be the new value of s_6 if you use this equation to compute it?
And it just requires plugging in what is the value of the reward.
The value is and- and the particular numbers for the dynamics and the old value function.
And the reason that I bring this up as an example is to show sort of
essentially how could have information flows as you do this computation.
So you start off in the very initial.
Let me just go over here first.
So when you start off, you're going to initialize
the value function to be zero everywhere.
The first backup you do basically initializes
the value function to be the immediate reward everywhere.
And then after that you're going to continue to
do these backups and essentially you're trying to
compute its expected discounted sum of
future rewards for each of the states under this policy.
So if you think about looking at this,
that's with information of the fact that state s_7 is good,
is going to kinda flow backwards to the other states because they're saying "Okay well,
I've been in state s_4 I don't have any reward right now but at a couple of timesteps
under this process I might because I might reach that really great +10 state."
So as we do these iterations of policy evaluation,
we start to propagate the information about future rewards back to earlier states.
And so what I'm asking you to do here is to just do that for one, one more step.
Just say for state s_6,
what would its new value be?
Its previous value was zero.
Now we're going to do one backup and what's this new value.
So what if you just uh,
let's ask a question then we can all take a second to uh.
I'm just wondering, er, if repeating the same process to find the value function.
I guess if you don't necessarily know the value function of s,
you could just like reversibly follow it down.
Question was can you- if you don't know what
the value function is. I guess I'm not totally sure.
This is a way to compute the value,
wait your question is asking because this is a way to compute the value function.
So what we've done here is we've said,
we've initialized the value function to be zero everywhere.
That is not the real value function,
that just sort of an initialization.
And what this process is allowing us to do is we keep
updating the values of every single state until they stop changing.
And then that gives us the expected discounted sum of rewards.
Now you might ask, okay well they- are they ever guaranteed to stop changing?
And we'll get to that part later.
We'll get to the fact that this whole process is guaranteed
to be a contraction so it's not going to go on forever.
So the distance between the value functions is going to be shrinking.
And that's one of the benefits of the discount factor.
So if people don't have any more immediate questions,
I suggest we all take a minute and then just compare with
your neighbor of what number you get when you do this computation.
Just to quickly check that the Bellman equation make sense.
[NOISE] All right. So, um,
wherever you got to, um,
hope we got a chance to sort of compare check
any understanding with anybody else that was next to you.
Um, before we go on I just want to, um,
answer a question that was asked before about whether or
not the analytics solution is always possible,
um, to invert. Let's go back to that.
So in this case, um,
because p is a stochastic matrix,
its eigenvalues are always going to be less than or equal to one.
If your discount factor is less than one,
then I which is the identity matrix minus gamma times P is always going to be invertible.
That's the answer to that question.
So this matrix is always invertible as long as gamma is less than one. All right.
So let's go back to this one, um,
which we're going to require any way for some of the other important properties we want.
So in this case what is that?
So the immediate reward of this is
zero plus gamma times [NOISE] 0.5 probability that we stay in
that state times the previous V of s_6 plus 0.5 probability that we go to V of s_7.
And this is going to be equal to zero plus 0.5 times zero plus 0.5 times 10.
So that's just an example of, um,
how you would compute one Bellman backup.
And that's back to my original question which is you seem to be using
V_k without the superscript pi to evaluate it.
Oh, sorry this should, yes.
This should have been pi.
That's just a typo. And that's that was correct in there.
Question was just whether or not that was supposed to be pi up there.
Yes it was, thanks for catching.
All right, so now we can start to talk about Markov Decision Process control.
Now just to note there. So I led us through or we just went through policy evaluation
in an iterative way you could have also
done it analytically or you could have done it with simulation.
But as a particularly nice analogy now that we're going to start to think about control.
So again what do I mean by control?
Control here is going to be the fact that ultimately
we don't care about just evaluating policies,
typically we want our agent actually be learning policies.
And so in this case we're not going to talk about learning policies,
we're just going to be talking about computing optimal policies.
So the important thing is that there exists a unique optimal value function.
So- um, and the optimal policy for
an MDP and an infinite horizon finite state MDP is deterministic.
So that's one really good reason why it's
sufficient for us to just focus on deterministic policies,
with a finite state MDPs,
um, in infinite horizons.
Okay. So how do we compute it?
Well first before we do this let's think about how many policies there might be.
So there are seven discrete states.
In this case it's the locations that the robot.
There are two actions.
I won't call them left and right,
I'm just going to call them a_1 and a_2.
Because left and right kind of implies that you will definitely achieve that.
We can also just think of these as generally being stochastic scenarios.
So let's just call them a_1 and a_2.
Then the question is how many deterministic policies are
there and is the optimal policy for MDP always unique?
So kind of right we just take like
one minute or say one or two minutes feel free to talk to a neighbor
about how [NOISE] many deterministic policies there are for
this particular case and then if that's- um,
once you've answered that it's fine to think about in general if you
have |S| states and |A| actions,
and this is the cardinality of those sets.
How many possible deterministic policies are there?
Um, and then the second question which is whether or not these are always unique.
[NOISE] Can anyone I'd
take a guess at how many deterministic policies that are in this case?
[NOISE].
It's a mapping from states to actions so it's gonna be 2 to the 7th.
That's exactly right. That is it's a mapping.
Er, if we remember back to our definition of what a policy is,
a mapping is going to be a map from states to actions.
So what that means in this case is that there are
two choices for every state and there are seven states.
And more generally that the [NOISE] number of policies is |A| to the |S|. So we can be large,
its exponential and the state-space but it's finite.
So it's bounded. Um, any
one want to take a guess of whether or not the optimal policy is always unique?
I told you the value function is unique.
Is the policy unique?
Yeah.
I think there might be cases where it's not.
Exactly right, um. It's not always unique.
The value function is unique but if there may be cases where you get ties.
And so there might be that there are two actions that,
um, are or two policies that have the same value.
So no. Depends on the process.
You mean like unique optimal value function?
Ah, yes.
So the question is can I explain what I mean by there's a unique optimal value function.
I mean that the optimal value of the state.
So the expected discounted sum of returns, um,
there is- there may be more than one optimal policy but there
exists at least one optimal policy which leads to the maximum value for that state.
Um, and there's a single value of that.
We'll talk about- probably a little bit clearer
to when we talk about contraction properties later.
Um, that there's- so for each state it's just a scalar value.
It says exactly what is the expected discounted sum of returns and this
is the maximum expected discounted sum of returns under the optimal policy.
Yeah.
And on the [inaudible] policies in our-
When we first define policies I thought I was describing the- um,
the entire hash table with sort of
one action per state rather than saying all possible combinations.
It's a little surprised that is 2 to the 7th rather than being
just the number of states with each one of the maps because of action.
For me to sort of better clarify, you know,
what this- what this how many policies there are and whether
maybe- there maybe it looked like it
was going to be linear and it's actually exponential.
Um, the way that we're defining a decision policy here,
um, a deterministic decision policy is a mapping from a state to an action.
And so that means for each state we get to choose an action and so
just as an illustration of why this ends up being exponential.
Um, so, in this case let's imagine instead of having
seven states we just have six or two states.
Now we have s_1 and s_2.
[NOISE] So, you could either have action a_1-a_1,
you could have action a_1-a_2,
you could have action a_2-a_1 or action a_2-a_2.
And you have to and all of those are distinct policies.
So, that's why the space ends up being exponential. Sure.
When you have like A to the power S. I'm assuming that A refers to
legal actions per state assuming like
you could have different actions depending on the state.
The question is whether or not you might be able to have
different constraints on the action space for state, absolutely.
So, in this case, today for simplicity,
we're going to assume that all actions are applicable in all states.
Um, in reality that's often not true.
Um, in many real-world cases,
um, some of the actions might be specific to the state.
Ah, for totally, there's a huge space of medical interventions.
Um, er for many of them,
they might not be at all even reasonable to ever consider,
um, for certain states are applicable.
Um, so, in general,
you can have different actions sub-spaces per
state and then you would take the product over the actions,
the cardinality of the action set that is relevant for each of the states.
But for right now, I think it's simple as just to think of it as there's
one uniform action space and then they can be applied in any state.
Okay. So, um, the optimal policy for an MDP and a
finite horizon problem where the agent acts forever.
Um, it's deterministic.
It's stationary which means it doesn't depend on the time-step.
We started talking about that a little bit last time.
Um, so, it means that if I'm in this state- if I'm in state s_7,
there is an optimal policy for being in
state s_7 whether I encountered that at time-step one,
time-step 37, time-step 242 stationary.
Um, er one of the intuitions for this is that if you get to act
forever there's always like
an infinite number of future time steps no matter when you're at.
So, if you would always do action a_1 from state s_7 now,
um then if you encounter it again in 50 time-steps you still have
an infinite amount of time to go from there and so you'd still
take the same action if that was the optimal thing to do.
As we were just discussing, it's not the optimal policy is not necessarily unique,
um because you might have ah more than one policy with the same value function.
So, how would we compute this?
One option is policy search uh and we'll talk a lot more about this in
a few weeks when we're talking about
function approximation and having really really large state spaces.
Um, but even in tabular cases,
er we can just think of searching.
So, the number of deterministic policies we just discussed is A to the S,
um and policy iteration is a technique that is generally better than enumeration.
So, what do I mean by enumeration in this context?
I mean there's a finite number of policies.
You could just evaluate each of them separately and then pick the max.
So, if you have a lot of compute,
you might just want to and this might be better if you really care
about wall clock and you have many many many processors.
You could just do this exhaustively.
You could just try all of your policies,
evaluate all of them either analytically or iteratively or
whatever scheme you want to use and then take the max over all of them.
But if you don't have kind of infinite compute,
it's generally more computationally efficient if you have to do
this serially to do policy iteration and so we'll talk about what that is.
So, in policy iteration what we do is we basically
keep track of a guess of what the optimal policy might be.
We evaluate its value and then we try to improve it.
If we can't improve it any more,
um then we can- then we can halt.
So, the idea is that we start by initializing randomly.
Here now you can think of the subscript is indexing which policy we're at.
So, initially we start off with some random policy and
then π_i is always going to index
sort of our current guess of what the optimal policy might be.
So, what we do is we initialize our policy randomly and while it's not changing
and we'll talk about whether or not it can change or go back to the same one in a second,
we do value function policy.
We evaluate the policy using the same sorts of techniques we just
discussed because it's a fixed policy
which means we are now basically in a Markov Reward Process.
And then we do policy improvement.
So, the really the new thing compared to what we were
doing before now is policy improvement.
So, in order to define how we could improve a policy,
we're going to define something new which is the state action value.
So, before we were just talking about state values,
state values are denoted by V. We're talking about
like V^pi(s) which says if you start in state s and you
follow policy pi what is the expected discounted sum of rewards.
A state action value says well,
I'm going to follow this policy pi but not right away.
I'm going to first take an action a,
which might be different than what my policy is telling me to
do and then later on the next time-step I'm going to follow policy pi.
So, it just says I'm going to get my immediate reward from taking this action a
that I'm choosing and then I'm going to transition to a new state.
Again, that depends on my current state and the action I just
took and from then on I'm going to take policy pi.
So, that defines the Q function
and what policy improvement does is it says okay you've got a policy,
you just did policy evaluation and you got a value of it.
So, policy evaluation just allowed you to compute what was the value of
that policy [NOISE] and now I want to see if I can improve it.
Now, remember right now we're in the case where we know
the dynamics model and we know the reward model.
So, what we can do then is we can do this with
Q computation where we say okay well I've got
that previous value function by policy and now
I compute Q^pi which says if I take a different action,
it could be the same and we do this for all A and for all S. So,
for all A and all S we compute this and then we're
going to compute a new policy and this is the improvement step which maximizes this Q.
So, we just do this computation and then we take the max.
Now, by definition this has to be greater than or equal to Q^πi(s, pi_i(a)),
right, because either a is equal to pi_i(a),
sorry pi_i(s). So,
either you the arg max is going to be the same as that
your previous policy π_i or it's going to be different and the only time you're going
to pick it differently as if the Q function of that alternative action is better.
So, by definition this Q^π that max over A of Q^π_i(s,a),
has to be greater than or equal to Q^π_i(s, π_i(s)). Question at the back.
Is this going to be susceptible?
Is this going to be like finding a local maximum goal then
its kind of gets stuck there and [inaudible] for actions.
Okay. So, this is going to allow us to maybe do some local monotonic improvement maybe,
um but are we going to be susceptible to gain stuck.
Um, in fact, ah for any of you that have played
around with reinforcement learning and and policy gradient and
stuff that is exactly one of the problems that can happen when we start doing
gradient based approaches nicely in this case this does not occur.
So, we're guaranteed to converge to the global optima and we'll see why for a second.
Okay. All right. So this is how it works.
You do this policy evaluation and then you compute the Q function and then
you compute the new policy that takes an arg max of the Q function.
So, that's how policy improvement works.
The next critical question is Iris was bringing up
is okay why do we do this and is this a good idea.
So, when we look at this,
um let's look through this stuff a little bit more.
What we're going to get is we're going to get,
um this sort of interesting type of
policy improvements step and it's kind of involving a few different things.
So, I just want to highlight the subtlety of it.
So, what is happening here is that we compute this Q function and then we've got this.
We've got max over A of Q^π_i(s,a) has to be greater than equal to R(s, π(a)).
The previous policy that we were using before.
[NOISE].
So, what I've done there is I've said, okay,
the max action over the Q has to be
at least as good as following your old policy by definition,
because otherwise you could always pick the same policy as
before or else you're gonna pick a better action.
And this reward function here is
just exactly the definition of the value of your old policy.
So, that means that you're- the max over your Q function
has to be at least as good as the old value you had.
So, that's encouraging. But here's the weird part.
So, when we do this,
if we instead take arg max we're gonna get our new policy.
So, what is this doing? It's saying,
I'm computing this new Q function.
What does this Q function represent?
It represents, if I take an action and then I follow my old policy from then onwards.
And then I'm picking whatever action is maximizing that quantity for each state.
Okay. So, I'm gonna do this process for each state.
But then- so that's going to just define a new policy, right?
Like I thought that might be the same or it could be a,
a different policy than the one you've had before.
Here's the weird thing.
So, this is saying that if you were to follow
that arg max A and then follow your old policy from then onwards,
you will be guaranteed to be doing better than you were before.
But the strange thing is that we're not gonna follow the old policy from then onwards.
We are going to follow this new policy for all time.
So, remember what we're doing is we're completely changing
our policy and then we're going to evaluate that new policy for all time steps,
not just for the first time step and then follow the old policy from then on.
So, it should be at least a little unclear that this is a good thing to do [LAUGHTER].
Should be like, okay, so you're,
you're saying that if I were to take
this one different action and then follow my old policy,
then I know that my value would be better than before.
But what you really want is that this new policy is just better overall.
And so the cool thing is that you can show that by doing
this policy improvement it is monotonically better than the old policy.
So, this is just saying this on a words, we're saying,
you know, if we took the new policy for one action,
then follow pi_i forever then we're guaranteed to be
at least as good as we were before in terms of our value function,
but our new proposed policy is just to always follow this new policy.
Okay. So, why did we get a monotonic improvement in
the policy value by doing this say in the policy value?
So, what- first of all what do I mean by a monotonic improvement?
Um, what I mean is that the value, uh,
something that is monotonic if, um,
the new policy is greater than equal to the old policy for all states.
So, it has to either have the same value or be better.
And my proposition is that the new policy is greater than or equal to
the old policy in all states with strict inequality if the old policy was suboptimal.
So, why does this work? So, it works for the following reasons.
Let's go ahead and just like walk through the proof briefly.
Okay. So, this is- what we've said here is that,
um, V^pi_i(s),
that's our old value of our policy.
So, this is like our old policy value.
Has to be less than or equal to max a of Q^pi_i(s, a).
And this is just by definition. Uh, let me write it like this.
Is equal to R(s, pi_i+1(s)). Because remember the way that we
define pi_i+1(s) is just equal
to the policy that match- maximizes the Q^pi_i.
Okay. So, this is gonna be by definition.
So, I've gotten rid of the max there.
Okay. So, this is going to be less than or equal to R the same thing at
the beginning times max over a of our Q^pi_i.
Again by definition, because we've said
that the first thing there that we know that the pie i of
s prime would also be less than or equal to max over a of Q^pi_i(s', a').
Okay. So, we just made that substitution.
And then we can re-expand this part using r reward.
So, this is gonna be the max over a' R(s',a') plus dot-dot-dot,
basically making that substitution from that line into there.
So, I'm nesting it. I'm re-expanding what the definition is of Q^pi.
And if you keep doing this forever,
essentially we just keep pushing in as if we get to continue to
take pi_i+1 on all future time steps.
And what- the key thing to notice here is that this is a greater than or equal to.
So, if you nest this in completely what you get is that this is the value pi_i+1.
So, there's kind of two key tricks in here.
The, the first thing is to say,
notice that the V^pi_i is always lower- is the lower bound to max a over Q^pi.
And then to re-express this using the definition of pi_i+1.
And then to re-upper bound that V by Q^pi and just keep re-expanding it.
And so you can do this out and then that allows you to
redefine to- when you substituted it in for all actions using pi_i+1,
then you've now defined what the value is of pi_i+1.
So, this is what it allows us to know that the new pi_i+1 value is by definition at least as good as the previous value function.
So, I'll just put that in there [inaudible]. All right.
So, the next questions that might come up is so we
know we're gonna get this monotonic improvement,
um, so the questions would be if the policy doesn't change, can it ever change again?
And is there a maximum number of iterations of policy iteration?
So, what do I mean by iterations?
Here iterations is i.
It's a kind of how many policies could we step through?
So, why don't we take like a minute and just think about this maybe talk to somebody
around you that you haven't met before
and just see what they think of these two questions.
So policy is monotonically improving
and is there a maximum number of iterations as we've read before?
[NOISE] Just in the interest
of time for today- just in the interest of
time for today because I want us to try to get through value iteration as well,
um, why doesn't- does somebody wanna give me, um,
a guess of whether or not the policy can ever- if the policy stops changing,
whether it can ever change again?
So, what I mean by that is that if the policy at pi,
so the question here was to say,
if pi of i+1 is equal to pi i for all states,
could it ever change again?
Somebody wanna share a guess of whether or not that is true.
Once it has stopped changing it can never change again.
So, no. And the second question is, um,
is there a maximum number of policy iterations? Yeah.
There's no- you can't have more iterations than there are policies.
That's right. There- We know that there is at most a to the s policies.
You cannot repeat a policy ever,
um, because of this monotonic improvement.
And so, there- there's a maximum number of iterations.
Okay? Great. And this just- um,
I'll skip through this now just so we can go through a bit of value iteration,
but this just steps through to show a little bit
more of how once your policy stopped changing,
essentially your Q^pi will be identical.
And so you can't- uh,
there's no policy improvements to be, yeah, to change.
After it's sort of converged, you're gonna stay there forever.
Okay, so policy iteration computes,
um, the optimal value in a policy in one way.
The idea in policy iteration is you always have a policy,
um, that is- that you know the value of it for the infinite horizon.
And then you incrementally try to improve it.
Value iteration is an alternative approach.
Value iteration in itself says we're gonna think of computing
the optimal value if you get to act for a finite number of steps.
The beginning just one step and then two steps and then three steps et cetera.
Um, and you just keep iterating to longer and longer.
So that's different, right? Because policy says you
always have a policy and you know what its value is.
It just might not be very good.
Value iteration says you always know what the optimal value in policy is,
but only if you're gonna get to act for say k time steps.
So they're just- they're computing different things,
um, and they both will converge to the same thing eventually.
So when we start to talk about value iteration,
it's useful to think about Bellman.
Um, so the Bellman equation and
Bellman backup operators are things that are often talked about in,
um, Markov Decision Processes and reinforcement learning.
So this constraint here that we've seen before,
which says that the value of a policy is
its immediate reward plus its discounted sum of future rewards,
um, is known as the Bellman equation.
The constraint for a Markov process, er,
Markov Decision Process say that it as to satisfy that.
And we can alternatively,
like what we were just seeing before,
think of this is as, um,
as a backup operator,
which means that we can apply it to
an old value function and transform it to a new value function.
So just like what we were doing in some of the, um, ah,
evaluation of a policy,
we can also just sort of do these operators.
In this case, the difference compared to what we've seen with
evaluation before is we're taking a max there.
We're taking this max a over th-
the best immediate already credit plus the discounted sum of future rewards.
So sometimes we'll use the notation of BV to mean a Bellman operator,
which means you take your old V and then you'd
plug it into here and you do this operation.
So how does value iteration work?
The algorithm can be summarized as follows.
You start off, you can initialize your value function to zero for all states.
And then you loop until you converge, um,
or if you're doing a finite horizon,
which we might not have time to get to today, but,
um, I- then you'd go to that horizon.
And basically, for each state,
you do this Bellman backup operator.
So you'd say, my value at k plus one time steps for that state is if I get to pick
the best immediate action plus the discounted sum of
future rewards using that old value function I had from the previous time step.
And that Vk said what is the optimal thing my optimal value for
that state s prime given that I got to act for k more time steps.
So that's why initializing it to zero is a good thing to do because in this case,
or a certainly reasonable thing to do if you want the result to
be the optimal as if you had that many time steps to go.
If you have no more time steps to act, your value is zero.
The first backup you do will basically say what is
the optimal immediate action you should take if you only get to take one action.
And then after that you start backing up,
um, and continuing to say well,
what if I got to act for two time steps?
What if I got to act for three time steps?
What's the best sequence of decisions you could do in each of those cases?
Um, again just in terms of
Bellman operations if we think back to sort of what policy iteration is doing,
you can instantiate this Bellman operator by fixing what the policy is.
And so, if you see sort of a B with, um, ah,
pi on top and saying, well,
instead of taking that max over actions,
you're specifying what is the action you get to take.
So policy evaluation you can think of as basically just
computing a fixed point of repeatedly applying
this Bellman backup until V stops converging and stops changing.
So, um, in terms of policy iteration,
this is very similar to what we saw before you can think of it
in these Bellman operators and doing this argmax.
Wanna see if we can get to a little bit on sort of the contraction operator.
So this is what, um, value iteration does.
It's a very similar policy iteration and evaluation.
Um, let me talk a little bit about the contraction aspect.
So, for any operator, um,
let's let O be an operator and x denote a norm of x.
So x could be a vector like a value function and then we could look at
like an L2 norm or an L1 norm or L infinity norm.
So, if you wanna- if an operator is
a contraction it means that if you apply it to two different things,
you can think of these as value functions, um,
then the distance between them shrinks after,
um, or at least is no bigger after you
apply the operator compared to their distance before.
So just to, um- actually,
I'll, I'll save examples for later.
Feel free to come up to me after class if you wanna see
an example of this, um, or I can do it on Piazza.
But this is the formal definition of what it means to be a contraction.
Is that the distance between,
in this case we're gonna think about it as two vectors, um,
doesn't get bigger and can shrink after you apply this operator.
So, the key question of whether or not value iteration will
converge is because the Bellman backup is a contraction operator.
And it's a contraction operator as long as gamma is less than one.
Which means that if you do- if let's say have two different Bell- er,
two different value functions and then you did the Bellman backup on both of them.
Then the distance between them would shrink.
So how do we prove this?
Um, we prove it- for interest of time I'll show you the proof.
Again, I'm happy to go through it,
um, I- or we can go through it in office hours et cetera.
Let me just show it kind of briefly.
So the idea to, to prove that the Bellman backup is a contraction operator,
is we consider there being two different value functions, k and j.
They don't have to be- This has- doesn't have to be anything to do with value iteration.
These are just two different value functions.
One could be, you know, 1,3,7,2 and the other one could be 5,6,9,8.
Okay. So we just have two different vectors of value functions and then we
re-express what they are after we apply the Bellman backup operator.
So there's that max a,
the immediate reward plus the discounted sum of
future rewards where we've plugged in our two different value functions.
And then what we say there is, well,
if you get to pick that max a separately for those two,
the distance between those is lower
bounded than if you kind of try to
maximize that difference there by putting that max a in.
And then you can cancel the rewards.
So that's what happens in the third line.
And then the next thing we can do is we can bound and say
the difference between these two value functions is diff- is,
um, bounded by the maximum of the distance between those two.
So you can pick the places at which those value functions most differ.
And then you can move it out of the sum.
And now you're summing over a probability distribution that has to sum to one.
And that gives you this. And so that means that
the Bellman backup as long as this is less than one has to be a contraction operator.
The distance between the two value functions can't be
larger after you apply the Bellman operator than it was before.
So, I think a good exercise to do, um,
is to then say given that it's a contraction operator,
um, that means it has to converge to a fixed point.
There has to be a unique solution.
So if you apply the Bellman operator repeatedly you- there is
a single fixed point that you will go to which is a single,
um, vector value fun- uh, values.
It's also good to think about whether the initialization and values impacts
anything if you only care about the result after it's converged.
All right. So, um, I think we can halt there.
Class is basically over.
There's a little bit more in the slides to talk about, um,
the finite horizon case, um,
and feel free to reach out to us on Piazza with any questions. Thanks. [NOISE]
