Welcome back, everyone. I hope you had a good Thanksgiving.
Um, I actually didn't ask, I'm not sure why this chair is here.
All right. Let's get rid of this.
Um, by the way, not sure- um, thanks, Anand.
I'm not sure if you guys are following the news, but in,
in reinforcement learning, we chat a lot about robotics, right?
And one of the, you know, uh,
constant problems a lot of people use reinforcement learning to solve is robotics and,
um, I think, ah, uh, back in May,
um, the InSight Mars lander had launched from, um,
here in California and it's about to make an attempt at landing
on the planet Mars in the next 2.5 hours or so,
so excited about that, uh,
I think that is actually one of the grandest,
um, applications of robotics because, you know,
with a- with 20 minute light-speed from Earth to Mars,
you know, once it starts its landing,
there is nothing anyone on Earth can do and so I think that's
one of the most exciting applications of autonomous robotics.
When you launch this thing, it's now about 20,
20 light minutes away from planet Earth,
so you actually can't control it in real time,
uh, and you just have to hope like crazy that your software
works well enough for it to land on this planet, you know.
Uh, and then so we, we will find out a little bit
afternoon if the landing happened successfully or not.
I, I think, um, so I,
I just get excited about stuff like this,
I, I hope you guys do too.
And for those of you that are from California, I mean,
take some pride that it launched from the home state of California and,
and is now nearing its,
er, landing on Mars.
Okay, um, all right.
So, um, what I wanna do today is,
uh, continue our discussion on reinforcement learning.
Do a quick recap of the MDP or the Markov decision process framework.
Um, and then we'll start to talk about algorithms for solving MDPs.
In particular, we need to define, uh,
something called the value function which tells you how good it is to be
in different states of the MDP and then, um,
we'll define the value function and then talk about an algorithm
called value iteration for computing the value function
and this will help us figure out how to actually
find a good controller or find a good policy for an MDP,
and then we'll wrap up with our learning state transition probabilities
and how to put all these
together into an actual reinforcement learning algorithm that you can implement.
Um, to recap, um,
our motivating example- running example from the last time,
from before Thanksgiving was,
uh, this 11-state MDP.
And we said that an MDP comprises a five tuple,
a lists of five things with, er, states.
So that example had 11 states.
Um, actions, and in this example the actions were the compass directions;
North, South, East, and West,
I can try to go in each of the four compass directions.
The state transition probabilities and in the example,
if the robot attempts to go North,
it has an 80% chance of heading North and a 0.1%
chance of veering off to the left and a 0.1 chance of veering off to the right.
Um, Gamma is a number slightly less than 1,
um, usually slightly less than 1,
there is a discount factor, think of this as 0.99,
um and R is the reward function that helps us specify where we want the robot to end up.
Um, and so what we said last time was that, um,
the way an MDP works is you start off in some state S_0,
um, this one's much better, you choose an action, uh,
a_0, and as a result of that,
it transitions to a new state, S_1,
which is drawn according to P_s_0 a_0.
Um, and then you choose a new action a_1 and as a result
the MDP transitions to some new state P_s_1 a_1,
um, and the total payoff is the sum of rewards, right?
Um, and the goal is to come up with a way, um,
and formally the goal is to come up with a policy, Pi,
which is a mapping from the states to the actions, uh,
that will tell you how to choose actions from whatever stage you are in so that
the policy maximizes the expected value of the total payoff, okay?
Um, and so I think last time I,
I kinda claimed that this is the optimal policy for this MDP, right?
Um, and what this means for example is,
if you look at this state, um,
this policy is telling you that Pi of 3, 1 equals,
uh, West, I guess,
or you can write West or left, well,
what do you call that left arrow, right,
where from this state, um,
from the state 3,1, you know,
the best action to take is to go left, it's to go West.
And so if you're executing this policy what that means is that, um,
on every step the action you choose would be, you know,
Pi, right, of the,
the state that you're in, okay?
So, um, what I'd like to do is now,
uh, to find the value function.
So, how, how, how, how did I come up with this, right?
Well, what I'd like to do is, have you,
um, learn given an MDP,
given this five tuple,
how do you compute the optimal policy?
And one of the challenges with, um,
finding the optimal policy is that, you know,
there's a- there's an exponentially large number of possible policies, right?
If you have 11 states and four actions per state,
the number of possible policies is, er,
4 to the power of 11 which is not that big because 11 is a small MDP, right?
Because the number of, of policies- possible policies for,
for an MDP is combinatorially large, is,
uh, number of actions,  the power of the number of states.
So how do you find the best policy?
Okay. So what you learn today is,
um, how to compute the optimal policy.
Now, in order to develop an algorithm for computing an optimal policy,
um, we'll need to define three things.
So just as a roadmap.
Um, what I'm about to do is define V_Pi,
V_star, and Pi_star, okay?
Um, and based on these definitions we'll see that- we'll,
we'll come to the, uh, definition.
We will- uh, derive that Pi_star is the optimal policy, okay?
But so let's, let's go through these few definitions.
Um, first V_Pi.
So for a policy Pi,
V_Pi is a function mapping from states to the rules, uh,
[NOISE] is such that V_Pi of S is the expected total payoff,
um, for starting in state S and executing Pi.
And so sometimes we write this as V_Pi of S is
the expected total payoff
given that you execute the policy Pi and the initial state,
S_0 is equal to S, okay?
So the definition of V_Pi,
this is called the, um,
value function for a policy.
Well, this is called the value function.
[NOISE]
For the policy Pi, okay?
Um, and so what the value function for a policy Pi denoted v_Pi is?
Is it tells you for any state you might start in,
there's a function mapping of states to rewards, right?
For any state you might start in what's
your expected total payoff if you start off your robot in that state,
and if you execute the policy Pi?
And execute the policy Pi means take actions according to the policy Pi.
Right? So here's a, here's a specific example.
Um, this policy.
So let's consider the follo- following policy Pi, right.
Um,
[NOISE]
so this is not a great policy.
You know, from some of these states,
it looks like it's heading for the minus 1 reward or sorry.
So if one of the reward was plus 1 that we get here.
And secondly, this is called an absorbing state.
Meaning that if you ever get to the plus 1 and minus 1,
then the world ends and then there are no more rewards or penalties after that.
Right? So but so this is actually not a very good policy,
so the policy is any function mapping from the states to the actions.
So this is one policy that says, uh,
in this state, you know,
this policy tells you in this state for one go north,
which is actually a pretty bad thing to do, right, is take you to the minus 1 reward.
So this is not a great policy,
um, but, but this is just a policy.
And v_Pi for this policy,
um, looks like this.
Okay. Um, don't worry too much about the specific numbers.
But you've- if you look at this policy,
you see that from this set of states it's
pretty efficient at getting you to the really bad reward,
and from this set of states it's pretty efficient at getting you to the good reward right,
with some mixing because of the noise in the robot veering off to the side.
And so, you know,
these numbers are all negative.
And those numbers are at least somewhat positive.
Right. So but so v_Pi is just,
um, if you start from say this state,
from the state 1, 1 on expectation,
you're expecting some these counts of rewards will be negative 0.88.
Okay? Um, so that's what v_Pi is.
Right. Now, um,
the following equation.
Let me think, uh,
governs, um, the value function.
It's called, it's called Bellman's equation.
Um, and this says
that your expected payoff
at a given state is the reward that you receive plus the discount factor,
times the future reward.
So let me, let me actually explain,
um, the intuition behind this, right?
Which is that, um,
let's say you start off at some state s_0, right?
So and again, let's,
let's say s is equal to s_0.
So v_Pi of s is equal to,
well, just for your robot waking up in that- I'm going to add to that in a second, okay?
But just for the sake,
just for this- for the fact that your robot woke up, um,
in this state s,
you get the immediate- you get a reward R of s_0 right away.
This is something that's called- this is also called the immediate reward.
[NOISE] Right.
Uh, because, you know,
just for the, for the, uh,
good fortune or bad fortune of starting off in this state,
the robot gets a reward right away.
This is called the immediate reward.
And then it will take some action and get to some new state s_1.
Where it will receive, you know,
Gamma times the reward of s_1.
And then [NOISE]. Right.
And then it will get some future reward at the next step and so on.
Um, and just to flesh out the definition,
the value function v_Pi is really this.
Given that you execute the policy Pi and
our s_0 equals s, right, and you start off in this state as 0.
Now, what I'm going to do is rewrite this part of the equation little bit.
I'm going to factor out.
I'm just going to take the rest of this and factor out one factor of Gamma.
So let me put parentheses around this,
right, and just take out Gamma there.
Okay. So I'm just, you know, taking this previously this was Gamma squared, right?
But adding the parenthesis here,
I'm just taking out one factor of Gamma,
uh, that multiplies in the rest of that equation,
okay? Does that make sense?
No. So as Gamma R of s_1 plus gamma squared R of s_2,
plus dot, dot, dot equals Gamma times R of s_1 plus.
Okay. So that's, that's what I did down there, right,
just factor out one, one factor of Gamma.
And so, um, this is the,
the value of state s is the immediate reward,
plus Gamma times the expected future rewards.
Right? So this, the expected value of this
is really v_Pi of s_1.
Right. So this- and,
and so the second term here, this,
this is the expected future rewards, right?
So Bellman's equation says that,
um, the value of a state,
the value- the expected total payoff you get if
your robot wakes up in a state s is the immediate reward plus Gamma,
times the expected future rewards.
Okay. Right. And, and this thing under,
you know, above the curly braces is really, um,
uh, asking if your robot wakes up at the state s_1,
and executes Pi, what is the expected total payoff, right?
And this when your robot wakes up in state s_1 then it'll take an action, gets s_2,
take an action, get s_3,
and this somewhat discounts the rewards for a bit,  starts off with the state s_1.
Okay. Makes sense?
So, um, uh,
this- based on this,
you can write out what- justify Bellman's equation,
which is, um, and, excuse me.
And the mapping from this equation to this equation.
[NOISE].
All right. The mapping from the equation on top to the equation at the bottom is that,
S maps to S_0 and S prime maps to S_1, right?
Um, and, what was I going to say, um, and so if we have that V_Pi of S equals,
um, makes sense? [BACKGROUND]. So the value of,
um, state S is, uh,
R of S plus V_Pi of S prime,
where this is really S_0 and this is S_1.
Uh, and and in, in the notation of MDP,
if you want to write a long sequence of states,
we tend to use S_0,
S_1, S_2, S_3, and S_4,
and so on, but if you have, want to look at
just the current state and the state you'd get to after one time step,
we tend to use S and S prime for that.
So that's why there's this mapping between these two pieces of notation.
Uh, so S prime let's say you get to after one step,
well, let's see, what is S prime drawn from, right?
This so- the, the,
the state S prime or S_1 is the state you get to after one time step.
So what is, what is the distribution the S prime is drawn from?
S prime is drawn from P of what?
S.
Okay, P of S, and then?
Pi of S.
Pi of S, pretty cool. Does that make sense?
Because, um, in state S,
you will take action a equals Pi of s, right.
So we're executing the policy Pi.
So that means that when you're in a state S,
you're gonna take the action a given by Pi of S,
because Pi of S tells you,
please take this action a when you're in sate S. And so, um,
S prime is drawn from P of Sa,
where a is equal to Pi of S, right?
Because they- because that's the action you took,
which is why S prime,
the state you get to after one time step,
is drawn from a distribution S Pi of S, okay?
Wow, that pen really left a mark.
So putting all that together, that's why- well,
I just write out again, where Bellman's equation which is, um,
V_Pi of S equals R of S plus
the discount factor times the expected value of V_Pi of S prime.
And so this term here is just sum of
S prime V S Pi of S, V_Pi of S prime.
So that underlying term I guess is this just underline term here, okay?
Um, now, notice that this gives you
a linear system of equations for actually solving for the value function.
Um, so let's say I give you a policy, right?
It could be a good policy, could be a bad policy,
and you want to solve for V_Pi of S. What this, um, does is,
if you think of V_Pi of S as the unknown you're trying to solve for, um, given Pi,
right, these equations [NOISE] ,
um, these
equa- the Bellman's equations
defines a linear system of equations,
uh, in terms of V_Pi of S as the ve- values to be solved for.
So make sure- here's a, here's a specific example.
Um, let's take the state V1,
right, so this is the state V1, okay.
What this- what Bellman's equation this tells us is,
V_Pi of the state 3, 1 is
equal to the immediate reward you get at the state 3,1,
plus the discount factor times,
well, sum of S prime PS Pi of S V_Pi of S prime, right?
So, um, when- let's see- le,
le- let's say that Pi of 3,1 is north, right?
So let's say you try to go north.
If you try to go north from this state,
then you have a 0.8 chance of getting to 3, 2,
plus a 0.1 chance of, uh, veering, uh,
left, plus a 0.1 chance of veering right.
Um, let me just close out that parenthesis, okay.
So that's what Bellman's equation says about these values.
All right, and if your goal is to solve for the value function,
then these things I'm just circling in purple are the unknown variables [NOISE] okay?
And, um, if you have 11 states,
uh, like in our MDP,
then this gives you a system of 11 linear equations with 11 unknowns.
Um, uh, and so using sort of a linear algebra solver,
you could solve explicitly for the value of these 11 unknowns. Does that make sense?
Okay. So the way you would- so let's say I give you a policy Pi,
you know, any policy Pi.
Um, the way you can solve for the value function is,
create an, an 11 dimensional vector, um,
with V_Pi of, you know, 1, 1,
V_Pi of 1, 2 and so on,
down to the V_Pi of whether is the last thing.
You have 11 states, so V_Pi of 3, 3 or whatever, of 4, 3, right?
So if you want to,
er, solve for those, um,
11 numbers I wrote up just, uh,
in terms of defining V_Pi, what you can do is,
I'll give you a policy Pi,
you can then construct an 11 dimensional vector,
you know, 11 dimensional vector of unknown values that you want to solve for.
And Bellman's equations for each of the 11 states,
um, for each of the 11 states you could plug in on the left-hand side.
This gives you one equation for how one of the values is
determined as a linear function of a few other of the values in this vector, okay?
And so, um, what this does is it sets up
a linear system of equations with 11 variables and 11 unknowns, right?
And using a linear algebra solver, you,
you will be able to solve this linear system of equations. Does that make sense?
Okay. Um, all right.
And so this works so long as you have a discrete-
If you have 11 states, you know,
it takes like a, it,
it takes almost a- takes almost no time,
right, in a computer to solve a linear system of 11 equations.
So that's how you would actually get those values,
if you're ever called on to solve for V_Pi, okay?
[NOISE] Actually, the, the- did what I just say make sense?
Raise your hand if what I just explained made sense.
Okay, good, awesome, great.
All right, good.
So moving on our roadmap,
um, we've defined V_Pi,
let's now define V_star.
Um, so [NOISE].
So V star is the optimal value function.
And we'll define it as V star of S
equals max over all policies Pi of V Pi of S. Okay.
Um, one of the I don't know, slightly confusing things about
reinforcement learning terminology is that there are two types of value function.
There's value function for a given policy
Pi and there is the optimal value function V star.
So both of these are called value functions,
but one is a value function for a specific policy,
could be a great policy, could be a terrible policy, can be the optimal policy.
The other is V star which is the optimal- optimal value function.
So V star is defined as,
um, look at the value for, you know,
any- lo- lo- look across all of the possible policies you could have all, um, 4-11.
Over all the combinatorially large number of possible policies for this MDP.
And V star of this is,
well let's just take the max,
where was of all the possible- of all the policies
you know anyone could implement of all the possible policies,
let's take the value of the best possible policy for that state, so that's V star.
Okay. And that's the optimal- optimal,
um, optimal value function.
And it turns out that, um,
there is a different version of Bellman's equations for this.
And again, there's a Bellman's equation for V_Pi,
for value of a policy.
And then there's a different version of
Bellman's equations for the optimal value function, right?
So just as the two versions of value functions,
there are two versions of Bellman's equations.
But let me just write this out and hopefully this will make sense.
Um, actually let's think this through.
So let's say you start off your robot in a state S,
what is the best possible expected sum of discounted rewards?
What's the best possible payoff you could get, right?
Well, ah, just for the privilege of waking up in state S,
the robot will receive an immediate reward R of S, all right?
And then it has to take some action and after taking some action,
it will get to some other state S prime.
Um, you know, and after some other state S prime
it will receive, right, future expected rewards V star of S prime,
and we have to discount that by Gamma, right?
So, sorry. So well,
the state S prime was arrived at but [NOISE] you're taking
some action a from the initial state.
Um, and so whatever the action is you know,
for- if, if you take action a, right?
Okay, um, so if you take an action a in the state S,
then your total payoff will be- expected total payoff will be the immediate reward
plus Gamma times the expected value of the future payoff.
But what is the action a that we should plug it in here?
Right. Well, the optimal action to take in the MDP is
whatever action maximizes your expected total payoff,
maximizes the expected sum of rewards which is why
the action you want to plug in is just whatever action a maximizes that.
Okay. So this is Bellman's equations for the optimal value function,
which says that, ah,
the best possible expected total payoff you could receive
starting from state S is the immediate reward R of S,
plus max over all possible actions of whatever action allows you to maximize,
you know, your expected total payoff- expected future payoff, okay?
So this is the expected future payoff,
or expected future reward, okay.
Um, now based on the argument we just went through,
um, this allows us to figure out how to
compute Pi star of S as well, right?
Which is, um, let's say-
let's say we have a way of computing V star of S, but we don't yet.
But let's say I tell you what is the V star over S,
and then I ask you, you know,
what is the action you should take in a given state?
So remember, Pi, Pi star,
oh Pi star is going to be optimal policy, right?
And so, um, what should Pi star of S be, right?
Which is le- let's say- let's say we're computing V star.
Um, and now I'll see you,
"Hey, my robot's in state S,
what is the best action I should take from the state S, right?
Then how do I- how do I decide what actions to take in the state S? What, what optimal?
What do you think is the best action to take from the state?
And the answer is almost given in the equation above, yeah.
[inaudible].
Yeah, cool. Awesome, right.
So the best action to take in state S,
and best means of maximizing respect to total payoff.
But the action that maximizes your expected total payoff is, you know,
what- whatever action we were choosing a up here.
And so it's just argmax over a of that.
And because Gamma is just a constant that,
that doesn't affect the argmax,
usually we just eliminate that since it's just a positive number, right?
So this gives us the strategy we will use for finding, um,
the optimal policy for an MDP, which is, um,
we're going to find a way to compute V star of S,
which we don't have a way of doing yet, right?
V star was defined as a max over a combinatorially or exponentially large number policy.
So we don't have a way of computing V star yet.
But if we can find the way to compute V star,
then you know, using this equation,
sorry, let me just scratch this out.
Using this equation gives you a way for every state of every state S,
to pretty efficiently compute this argmax, um,
and therefore figure out what is the optimal action for every state, okay?
[NOISE].
All right, um.
So all right. So just to practice with confusing notation.
All right, let's see if you understand this equation.
I'm, I'm just claiming this. I'm not proving this.
But for every state as V star of S equals V of Pi star of S,
is greater than V Pi of S, all right?
For every policy Pi in every state S, okay?
So ho- hope this equation makes sense.
Ah, this is what I'm claiming. I didn't prove this.
What I'm claiming is that, um,
the optimal value for state S is- this is the optimal value function on the left.
This is the value function for Pi star.
So this is- this is the optimal value function.
This is the value function for a specific policy Pi,
where the policy Pi happens to be Pi star.
And so what I'm claiming here is that- wh- what I'm writing here is that, um,
the optimal value for state S is equal to
the value function 4 Pi star applied to the state S,
and just as greater than equal to V Pi of S for any other policy Pi, okay?
Right. All right. So, um,
the strategy you can use for finding for optimal policy is: one, ah, find V star.
Two, you know, use
the argmax equation to
find Pi star, okay?
And so what we're going to do is- well, step two, right?
We, we know how to do from the argmax equation.
So what we're gonna do is talk about an algorithm for actually
computing V star because if you can compute V star,
then this equation helps- allows you to pretty quickly find the optimal, um,
action for every state [NOISE]. So, um.
So value iteration is,
ah, is an algorithm you can use to,
um, to find V star.
So let me just write out the algorithm, um.
So this is um-
Okay? So in the value iteration algorithm,
you initialize the estimated value of every state to 0,
and then you update these estimated values using Bellman's equation.
And this is the, uh, optimal value function,
the V star version of Bellman's equations, right?
And, um,
[NOISE] so to be concrete about how you implement this,
you know, if you're implementing this, right?
If you are implementing this in Python, um,
what you would do is create
a 11 dimensional vector to store all the values of V of S. So you create a,
you know, 11 dimensional vector, right?
That, that represent V of 1, 1,
V of 1, 2, you know,
down to V of 4, 3, right?
So this is, um, 11 dimensional vector corresponding to the 11 states.
Um, [NOISE] oh, I'm sorry I shou - wait did I say 11?
We got 10 states in the MDP, don't we? Wait.
Yes, we have 10 states. We've been saying 11 all long?
Sorry. Okay, 10.
Um, uh, yeah, uh, wait.
[inaudible].
11?
[inaudible].
Oh, Yes. You're right. Sorry. Yes, 11.
Okay. Sorry. Yes, 11 states. Okay, It's all right.
Right. So 11 states MDP so you create an initial, ah,
create an 11 dimensional vector um,
and initialize all of these values to 0.
And then you will repeatedly update, um,
the estimated value of every state according to Bellman's equations, right?
Um, and so uh, there, there,
there are actually two ways to interpret this um,
and sim- similar to,
er, similar to gradient descent, right?
We've written out, you know,
a gradient descent rule for updating the Theta,
uh, the, the, vector parameters Theta.
And what you do is, you know, then you have,
um- and what you do is you update all of the components of Theta simultaneously, right?
And so that's called a synchronous update, er, in gradient descent.
So one way to- so the way you would, um, er,
update this equation in what's called a synchronous update,
would be if you compute the right hand side for
all 11 states and then you simultaneously overwrite all 11 values at the same time.
And then you compute all 11 values for
the right-hand side and then you simultaneously update all 11 values, okay?
Um, the alternative would be an asynchronous update.
And an asynchronous update,
what you do is you compute v of 1, 1, right?
And the value of v of 1, 1 depends on some of the,
the other values on the right hand side, right?
But the asynchronous update,
you compute v of 1, 1 and then you overwrite this value first.
And then you use that equation to compute v of 1, 2.
And then you update this and then you observe update these one at a time.
And the difference between synchronous and asynchronous is um, you know,
if you're using asynchronous update by the time you're using V
of 4, 3 which depends on some of the earlier values,
you'd be using a new and refreshed value of some of the earlier values on your list, okay?
Um, it turns out that
value iteration works fine with either synchronous update or asynchronous updates.
But, um, for the,
er, er, but, um,
er, because it vectorizes better,
because you can use more efficient matrix operations.
Most people use asynchronous update but it turns out that the algorithm will
work whether using a synchronous or an asynchronous update.
So I, I, I, I guess unless,
unless otherwise uh, uh,
you know, stated you should usually assume that.
Whe- when I talk about, uh, value iteration,
I'm referring to asynchronous update where you compute all the values,
all 11 values using the- a- an- and then update all 11 values at the same time, okay?
Was there a question just now, someone had, yeah.
[inaudible]
Yeah,
yes.
So I think there,
there, uh, uh, yes.
So how do you represent the absorbing state?
The sync state? We get to plus 1 minus 1 then the world ends.
Um, in this framework one way to code that up would be to say that um,
the state transition parameters from that to any other state is 0.
That is one way to, to, to- that, that will work.
Uh, another way would be, um,
less- done less often
maybe mathematically a bit cleaner but not how people tend to do this,
would be to take your, um,
11 state MDP and then create a 12 state,
and a 12 state always goes back to itself with no further rewards.
So both, both of these will give you the same result.
Mathematically, it's pretty more convenient to just set, you know,
P of Sa S prime equals 0 for all other states.
It's not [inaudible] probably but that,
that will give you the right answer as well.
Yeah. All right.
Cool. Um, so just as a point of notation,
if you're using synchronous updates,
you can think of this as, um,
taking the old value function,
er, O estimate, right?
And using it to compute the new estimate, right?
So this, this, you know,
assuming the synchronous update,
you have some, uh,
previous 11 dimensional vector with
your estimates of the value from the previous iteration.
And after doing one iteration of this,
you have a new set of estimates.
So one step of this algorithm is sometimes called the Bellman backup operator.
And so where you update V equals B of V, right?
Where, uh, where now V is,
a 11 dimensional vector.
So you have an order 11 dimensional vector,
compute the Bellman backup operator with
just that equation there and update V according to V of P. Um,
and so one thing that you see in the,
um, problem set, uh,
is prove- is, er, er, showing that,
um, this will make a V of S converge to V star, okay?
So it turns out that, um,
okay, so it turns out that, um,
er, you can prove and you'll see more details of this in the problem set,
that by repeatedly and forcing Bellman's,
er, equations, that this equa- this,
this algorithm will cause your vector of 11 values or cause
V to converge to your optimal value function of V star, okay?
Um, and more details. You- you'll see
in the homework and a little bit in the lecture notes.
And it turns out this algorithm actually converges quite quickly, right?
Um, to, to, to give you a flavor,
I think that, uh, with the discount factor,
the discount factor is 0.99,
it turns out that you can show that the error, er,
reduces, you know, by a factor of 0.99 on every iteration, um,
and so V actually converges quite,
quickly geometrically quickly or exponentially quickly,
um, to the optimal value function, V star.
And so if it's, you know, if the discount factor is 0.99, then we've like a few,
we've 100 iterations or a few hundred iterations,
V would be very close to V star, okay?
And, and the discount factor is 0.9, then we've just,
you know, 10 or a few dozens of iterations that'll be very close to V star.
So these algorithm actually converges quite quickly to V star, okay?
Um, so let's see.
[NOISE].
All right. So just to put everything together,
um, if you- if
you run value iteration on that MDP,
you end up with this. Um, er,
so this is V star, okay?
So it's a list of 11 numbers telling you what is the optimal, um,
expected pay off for starting off in each of the 11 possible states.
And so, um, I had previously said,
I think I said last week,
uh, o- of the week before Thanksgiving,
that this is the optimal policy, right?
So, you know, let's just use as a case study how
you compute the optimal action for that state,
um, given this V star, all right?
Well, what you do is you,
you actually just use this equation.
And so, um, if you were to go west,
then if you were to compute,
I guess this term, um,
sum of S prime west or left I guess, right?
P of S A, S prime V star of S prime is equal to,
um, if you were to go west, you have a, um-
Right.
Um, right.
So if you're in this state,
and if you attempt to go left,
then there's a 0.8 chance you end up there with,
ah, ah, V star of 0.75.
There's a 0.1 chance.
You know if you try to go left,
there's 0.1 chance you veer off to the north and have a 0.069.
And then there's 0.1 chance that you actually go
south and bounce off the wall and end up with a 0.71.
And so the expected future reward,
the expected future payoff given this equation is that if you tend to go west,
you end up with a 0.740 as expected future rewards.
Whereas if you were to go north,
and we do a similar computation.
[NOISE] You know, so 0.8 times 0.69,
plus 0.1 times 0.75,
plus 0.1 times 0.49,
is the appropriate weighted average.
You find that this is equal to 0.676.
Um, which is why the expected future rewards for if you go west, if you go no- ah,
left is 0.740 which is quite a bit higher than if you go north,
which is why we can conclude based on this little calculation,
um, that the optimal policy is to go left by that state, okay?
And- and really, and technically you check north,
south, east, and west and make sure that going west gives a high reward.
And that's how you can conclude that going west is actually the better action,
at this state, okay?
So that's the value iteration.
And based on this, if you,
um, ah, are given an MDP you can implement this,
ah, south of V star and, ah,
ah, be able to, ah,
compute Pi star, okay?
All right. Few more things to go over.
But before I move on, ah,
let me check if there any questions, yeah.
[inaudible]
Oh, sure yep. Is the number of states always finite?
So in what we're discussing so far, yes.
But what we'll see on Wednesday is how to generalize this framework.
I'll, I'll do this a little bit later but it
turns out if you have a continuous state MDP, ah,
one of the things that's often
done I guess is to discretize into finite number of states.
Ah, but then there are also some other versions of, um,
ah, you know, value iteration that applies directly to continuous states as well.
Okay, cool. All right.
So [NOISE].
Um, what I describe is an algorithm called value iteration.
The other, um, I know, common, ah,
sort of textbook algorithm for solving for MDP is,
is called policy iteration.
And let me just- I'll just write out what the algorithm is.
So here's the algorithm which is, um,
you know initialize Pi randomly, right?
[NOISE].
Okay, so let's see what this algorithm does.
So we'll talk of pros and cons of valuation versus policy iteration in a little bit.
Um, in policy iteration, ah,
instead of solving for the optimal policy V star,
so in- in value iteration our focus of attention was V star, right?
Where, um, you know,
you do a lot of work to try to find the value function.
And then once you solve for V star,
you then figure out the best policy.
In policy iteration, the focus of attention is
on the policy Pi rather than the value function.
And so initialize Pi randomly.
So that means for- for each of the 11 states pick a random action, right?
So a random initial Pi.
And then we're going to repeatedly carry out these two steps.
Um, the first step is, um,
solve for the value function for the policy Pi, right?
And remember, um, for V Pi,
this was a linear system of equations, right?
With 11 variables, with 11 unknowns
in a linear- there is a linear system of 11 equations with 11 unknowns.
And so using a sort of linear algebra solver or linear equation solver,
given a fixed policy Pi,
you could just, you know,
at the cost of inverting a matrix roughly, right?
You can solve for- you can solve for all of these 11 values.
And so in policy iteration,
um, you would, you know,
use a linear solver to solve for
the optimal value function for this policy Pi that we just randomly initialized.
And then set V to be the value function for that policy.
Okay, um, and so this is done quite efficiently with the linear solver.
And then the second step of policy iteration
is pretend that V is the optimal value function,
and update Pi of S,
you know, using the Bellman's equations for the optimal value function,
right, or updated, um,
as you saw right how you update Pi of S. And then you iterate,
and then give it a new policy,
you then solve that linear system equations for your new policy Pi.
So you get a new V_Pi and you keep on iterating these two steps,
um, until convergence, okay? Yeah.
[inaudible]
Yeah, yep. Yes, that's right.
So in, in, in value,
ah, yeah, yeah, yeah, yeah.
So in, in value iteration, um, ah,
actu- in value iteration think about
value iterations as waiting to the end to compute Pi of S, right?
Solve for v star first, and then compute Pi of S. Whereas in policy iteration,
we're coming up with a new policy on every single iteration, right?
Okay? So, um, pros and cons of poly- and,
and it turns out that this algorithm will also converge to the optimal policy.
Um, pros and cons of policy iteration versus value iteration.
Policy iteration requires solving this linear system of
equations in order to, um, get V_Pi.
And so it turns out that if you have a relatively small state space,
um, like if you have 11 states,
it's really easy to solve a linear system of equations,
ah, you know, of 11 equations in order to get V_Pi.
And so in a relatively small set of states like 11 states or really anything, you know,
like a few hundred states, um,
policy iteration would work quite quickly.
Ah, but if you have a [NOISE] relatively large set of states,
you know, like 10,000 states or,
or, or a million states.
Um, then this step would be much slower.
At least if you do it right by solving linear system of equations and then
I would favor a value iteration over policy iterations.
So for larger problems,
usually value iteration will, um, ah, ah,
usually I would use value iteration because solving this linear system of equations,
you know, is, is pretty expensive if it's- it's like a million Pi.
Is a million equations and a million unknowns, that's quite expensive.
But even 11 states 11 unknowns is a very small system of equations.
Um, and then one,
one other pros and cons,
one of the, ah,
ah, differences that- that's maybe,
maybe more academic and practical.
But it turns out that if you use value iteration, um,
V will converge towards V star,
but it won't ever get to exactly V star, right?
So just as, if you apply gradient descent for linear regression,
gradient descent gets closer and closer and closer to the global optimum,
but it never, you know,
gets exactly the global optimum.
It just gets really, really close, really, really fast.
Actually gradient descent, actually turns out asymptotically
converges geometrically quickly or exponentially quickly, right?
But they've been never quite gets, you know,
definitively to the optimal,
to the one optimal value.
Whereas, you, you saw using normal equations it just
jumped straight to the optimal value and there's no,
you know, converging slowly.
And so value iteration converges to a V star,
but it doesn't ever end up at exactly the value of V star.
Ah, this difference may be a bit academic because in practice it,
it doesn't have, ah, right?
Ah, ah, but in policy iteration, um,
if you iterate this algorithm then after a finite number of iterations, ah,
this algorithm will stop changing meaning that after a
certain number of iterations Pi of S will just not change anymore, right?
So you find Pi of S update the value function,
and then after another integration.
When you take these argmax's,
you end up with exactly the same policy.
And so, ah, just- just to solve for
the optimal value and the optimal policy, and then just,
you know, ah, ah,
it doesn't converge- it doesn't just converge to what the optimal value.
It just gets the optimal value when it- when it converges, okay?
Um, so I think in practice I actually see value iteration used much more,
ah, ah, ah, because, um,
solving these linear system equations gets expensive, you know,
if you have a larger state space but, um,
value iteration, excuse me,
val- I see value iteration used much more.
But if you have a small problem, you know,
I think you could also use policy iteration which may converge a little bit faster.
If, if you have a small problem, okay?
[NOISE] All right, good.
So the last thing is,
um, kinda putting it together, right?
And what if you don't know
[NOISE].
So it turns out that when you apply this to a practical problem,
you know, in- in- in robotics right.
Um, one common scenario you run into is if you do not know what is P of S, A.
If you don't know the state transition priorities right.
So when we built the MDP we said, well,
let's say the robot if you're going off you know,
has a 0.8 chance of going off and a 0.1 chance of veering off to the left or right.
If you actually- again it's a very simplified robot.
But, if you build a actual robot or build a helicopter or whatever,
play- play- play chess against an opponent.
Uh, the state transition probabilities are often not known in advance.
And so in many MDP implementations you need to estimate this from data.
And so the workflow of many reinforcement learning projects will be that,
um, you will have some policy and have the robot run around,
you know, just have a robot run around a maze and count
up of all the times you had to take the action north,
how often did it actually go north and how
often do they veer off to the left or right, right?
And so you use those statistics to estimate the state transition probabilities.
So let me just write this out.
So you estimate.
So after you're taking maybe a random policy it takes some policy,
executes some policy in the MDP for a while.
And then you would estimate this from data.
And so, the obvious formula would be,
estimate P of Sa S prime to be number of times took action a,
in the state S and got to S prime
and divide that by the number of times you took
action a in state S,  right.
So P of Sa S prime estimates- does actually a maximum likelihood estimate.
When you look at the number of times,
you took action a in state S,
and of that was a fraction of times you got to the state S prime right.
Or one over S and the above is 0, 0 right.
[NOISE] And a common heuristic is,
if you've never taken this action in this state before,
if the number of times you try action A in state S is 0.
So you've never tried this action in this state.
So you have no idea what it's going to do.
They just assume that the state transition probability is 1 over 11, right?
That it randomly takes you to another state.
So this would be common heuristics that people
use when implementing reinforcement learning algorithms, okay?
And it turns out that you can use Laplace smoothing for this if you wish,
but you don't have to.
Because, so you're in Laplace smoothing right.
So it would be, you know, adds 1 to the numerator and add 11 to the denominator would be,
if you were to use Laplace smoothing,
which avoids the problems of 0 over 0s as well.
But it turns out that unlike the Naive Bayes algorithm,
these solvers of MDPs are not that sensitive to 0 values.
So if- if one of your estimates were probably a 0, you know,
unlike Naive Bayes' where having a 0 probability was very
problematic for the classifications made by Naive Bayes,
it turns out that MDP solvers,
including evaluation of policy iteration,
they do not give sort of
nonsensical/horrible results just because of a few probabilities that are exactly 0.
And so in practice,
you can use Laplace smoothing if you wish.
But because the reinforcement learning algorithms don't- don't perform
that badly if these estimates often will be a zero
in practice, Laplace moving is not commonly unison.
What I just wrote is- is more common.
Okay.
So to put it together.
All right, if I give you
a robot and asked you to implement a MDP Solver to find the good policy for this robot,
what you will do is the following.
Take actions with respect to some policy pi.
To get the experience in the MDP.
Right. So go ahead and let your robot lose and have it execute some policy for awhile.
And then update estimates of P of Sa.
Based on the observations of whether robot goes and takes different states,
update- update the estimates of P of Sa.
Solve, um, Bellman's equation using value iteration
to get V and then update.
So this is the value iteration we are putting together.
If you want to plug in policy innovation instead in this step that's also okay.
But so if you actually get the robot, um,
you know, yeah right- right.
If you actually get a robot, uh,
where you do not know in advance the state transition probabilities,
then this is what you would do in order to,
um, iterate a few times I guess.
Repeatedly find a- find a-
find a policy given your current estimate of the state transition probabilities.
Get some experience, update your estimates,
find a new policy and kind of repeat this process
until hopefully it converges to a good policy.
Okay.
Now just to
add more color and more richness to this,
we usually think of-
we usually think of the reward function as being given,
right, as part of the problem specification.
But sometimes you see that the reward function may be unknown.
And so for example,
if you're building a stock trading application
and the reward is the returns on a certain day,
it may not be a function of the state and it may be a little bit random.
Um, or if your robot is running around but depending on where it goes,
it may hit different bumps in the road and you
want to give it a penalty every time it hits the bump.
We're going to build a self-driving car right,
every time it hits a bump, hits a pothole,
you give it a negative reward,
then sometimes the rewards are a random function of the environments.
And so sometimes you can also estimate the expected value of a reward.
But- but in- in some applications,
if the reward is a random function of the state,
then this process allows you to also estimate the expected value of the reward from
every state and then running this will help you to converge. Okay yeah.
[inaudible]
Yeah, cool. [NOISE].
[inaudible]
Yeah, cool. Great question. So let me,
let me talk about exploration, right.
So it turns out that, um,
this one [NOISE] so it turns out
this algorithm will work okay for some problems but the- the- there's one other,
ah, again to add richness to this,
there's one other, um,
issue that this is not solving which is the exploration problem.
And [NOISE] in, in reinforcement learning sometimes you hear
the term exploration versus exploitation, [NOISE] right?
Which is, um, let me use a different MDP example, right.
Which is, um, if your robot, you know,
starts off here and if there is a, um,
plus 1 reward here,
right and maybe a plus 10 reward here.
If just by chance during the first time you run the robot it
happens to find its way to the plus 1 then if you run this algorithm,
it may figure out that going to the plus 1 is a good way, right?
We were giving it a discount factor and there is
a fuel surcharge of minus 0.02 on every step.
So if just by chance your robot happens to find its way to
the plus 1 the first few times you run this algorithm then this algorithm is,
um, is uh, locally greedy, right.
Ah, it may figure out that this is a great way to get to plus
1 reward and then the world ends, it stops giving these minus 0.02 surcharges for fuel.
And so this particular algorithm may converge to a bad,
you know, kind of local optima where it's always heading to the plus 1.
And as it hits the plus 1,
it sometimes will veer off randomly right and get a little bit more experience
in the right half of the state space and end up with pretty good estimates of,
ah, what happens in the right half of this state space.
And, um, and it may never find this hard-to-define
plus 10 pot of gold over on the lower left, okay?
So this problem is sometimes called actually, well,
it is called the exploration versus exploitation problem which is, um,
when you're acting in an MDP, you know,
how aggressively or how greedy should you be
at just taking actions to maximize your rewards?
And so the algorithm we describe is relatively greedy, right?
Meaning that, um, is taking your best estimate of the state transition probabilities
and rewards and is just taking whatever actions and this is really saying, you know,
pick the policy that maximizes
your current estimate of the expected rewards and it's just acting greedily,
meaning on every step it's just executing the policy that
it thinks allows it to maximize the expected payoff, right?
And what this algorithm does not do at all is explore which is
the process of taking actions that may appear less optimal at the outset,
um, such as if the robot hasn't seen this plus 10 reward, it doesn't know how to get there,
maybe it should, you know,
just try going left a couple of times just for the heck of it,
right, to see what happens.
Because even if it seems less,
even if going left from
the perspective of the current state of the knowledge of the robot,
um, maybe if it tries some new things it's never
tried before maybe it will find a new pot of gold, okay.
So this is called the exploration versus exploitation trade-off,
um, and this is actually not just an academic problem.
It turns out that some of the large online web advertising platforms,
ah, have the same problem as well.
And again, I, I, I, I have,
have mixed feelings about the advertising business.
It's very lucrative but it causes other problems, um, as well but,
but it turns out that for some of the large online ad platforms,
um, ah, you know, when a,
when an advertiser, um,
starts selling a new ad or your posts and
you add on one of the large online ad platforms,
the ad platform does not know who is most likely to click on this ad, right?
And so pure explo- pure exploitation,
boy exploitation has such horrible connotations
especially [LAUGHTER] for online ad platforms.
Ah, it's the technical term, not a,
not a social term when used in this context.
But the pure, you know,
reinforcement learning sends exploitation policy not,
not the other even more horrible sense of exploitation.
Um, would be to always just show you,
show, show users the ads that, you know,
they are most likely to click on to drive short-term revenues
because we want to just show people the ad they're most likely to click on to drive short-term revenue.
Whereas an exploration policy for large,
you know, some of these large online ad platforms,
is to show people some ads that may not be
what we think you are most likely to click on in this moment
in time but by showing you that ad or by showing
the pool of users an ad that you might be less likely to click on,
maybe we'll learn more about your interests.
And that, um, increases the effectiveness of these large or
these ad platforms at finding more relevant ads, right?
And for example, I don't know,
um, probably not- I, I, I guess,
ah there are probably no advertisements for ah,
Mars landers as I know.
But if the large online ad platforms
don't know that I'm actually pretty interested in Mars landers
if it shows me an ad for a Mars lander which I don't think such a thing exists, right?
If I did I click on it and they may learn that
showing me ads for Mars landers is a great thing,
right, ah, or, or some other thing that you may not know you're interested in.
So this is actually a real problem.
There are, um, some of the large online ad platforms, ah, um,
actually do explicitly consider exploration versus exploitation and make
sure that sometimes it shows ads
that may not be the most likely you'll click on but, you know,
allows us to gather information to then be better
situated to figure out where the future rewards to be better positioned to,
ah, learn how to match ads not just to you but to other users like you, right?
Um, sorry.
Okay but so in order to make sure their reinforcement learning algorithm,
um, ah, explores as was exploits a, um, ah,
a common a, a modification to
this would be tak- instead of taking actions with respect to Pi,
you may have a, um, a 0.9 chance.
[NOISE] Respect to Pi and 0.1 chance,
[NOISE] take an action randomly, okay.
And so, um, this particular,
[NOISE] exploration policy is called
Epsilon-greedy where on every time step and on every time step you toss a biased coin.
But on every time step,
let's say 90% of the chance you execute whatever you think is
the current best policy and with 10% chance you just take a random action.
And this type of exploration policy, um,
increases the odds that you know,
every now and then maybe just by chance, right,
it'll find it's way to the plus 10 pot of
gold, and learn state transition probabilities and,
and, and then eventually, um,
end up exploring the state-space more thoroughly, okay.
Um, this is called Epsilon-greedy exploration and,
um, it's a little bit of a misnomer I think.
So in, in, in the way we think of Epsilon-greedy Epsilon is, um,
say 0.1 is the chance of taking a random action instead of the greedy action.
Um, this algorithm is,
has always been a little bit strangely named because, ah,
if 0, 0.1 is actually the chance of you acting randomly, right.
So Epsilon greedy sounds like you're being greedy 0.1 of the time but,
but you're actually taking actions randomly 0.1 at a time
so Epsilon-greedy is actually maybe 1 minus Epsilon-greedy.
So th- these name has always been a little bit,
um, off but that's what,
that's, that's how people use this term.
Epsilon-greedy exploration means Epsilon of the time which is the hyperparameter,
which is the parameter of the algorithm you act randomly into- instead
of going to what you think is the best policy, okay.
And it turns out that, um,
if you implement this algorithm with, um,
Epsilon-greedy exploration then this,
ah, ah, this algorithm,
ah, will converge to the optimal policy for any discrete state MDP, right.
Ah, sometimes they take a long time because, you know, if there's a,
if it takes a long time to randomly find plus 10, it, it,
it could take a long time before it randomly stumbles upon the plus 10 pot of gold.
But, um, this algorithm with an,
with an exploration policy will converge to the optimal,
um, will, will converge to the optimal policy for any MDP. What is your question?
[inaudible]
Yeah, yeah, so, right, should you always keep epsilon constant or should you use a dynamic epsilon.
So yes, ah, there are, there, there are.
There are many heuristics for how to explore, ah.
One reasonable thing to do would be we start with
a large value of epsilon and we slowly shrink it.
Um, another common heuristic would be,
um, there is a different,
ah, type of exploration called Boltzmann exploration,
which you can look up if you want which is, ah,
if you think that the value of going north is,
um, you know, 10 and the value of going south is 1,
then there is such a huge difference that you
should bias your action to upgrading to the bigger result,
the, the bigger reward and,
ah, you could have the probability be f E to the value basically time,
ah, divide, times of a times the scaling factor, right?
So that's called Boltzmann exploration where instead
of having a 10% chance of taking an action completely at random,
ah, you could just, you know,
have a very strong bias to,
heading toward the higher values but also have some probability to go into
lower values but where
the exact probability depends on the difference in ideal values is.
So another probably the, I think Epsilon-greedy,
I feel like I see this used the most often
for these types of MDPs and then Boltzmann exploration
which is why I just drive this also. Two more questions before we wrap up, go ahead.
[inaudible]
Yes, can you get a reward for reaching states you've never seen before?
Yes, there is a fascinating line of research called intrinsic reinforcement learning.
Ah, and it really started by search indexing.
If you Google for intrinsic,
intrinsic motivation, you find some research papers on.
Um, and then there was some recent followup work I think by
DeepMind or some other groups but intrinsic motivation
is the term to Google where you reward
a reinforcement learning algorithm for finding new things about the world.
Just one last question.
How many actions you should take with respect to Pi?
Sorry, say that again?
How many actions you should take with respect to Pi before updating the Pi?
I see, right. How often,
how many actions you should you take before updating Pi?
Um, there's no harm  to do it as frequently as possible.
Ah, in the, if you're doing this with a real robot what,
you know, I've seen is, um,
this is sometimes going to physical robot and so, you know, I don't know,
when we're flying helicopters you go out to the field for the day,
collect a lot of data, and they go back to
the lab in the evening and rerun the algorithms.
Ah, but if there's no barrier to running this all the time,
then it doesn't hurt the performance,
it's just running as frequently as it can.
All right, that's it for basis of MDP.
Um, on Wednesday, we'll continue with generalizing all these to continuous state MDPs.
Okay, let's break, I'll see you on Wednesday.
