Okay. Let's start guys.
Okay. So our plan for today is to catch up.
So we're a little behind.
So, uh, it's okay.
So today, I want to talk about MDPs,
Markov decision processes, and my plan is to talk about that for the first hour.
And then after that, I want to talk, uh,
for 10 minutes about the previous lecture.
So remember, like we went over relaxations kind of quick,
so maybe we can go over that again.
And then, the last 10 minutes I want to talk about the project and,
kind of the plan for the project,
how we should think about it,
this is coming up, so we should start talking about that.
So this is an optimistic plan, so,
[LAUGHTER] uh, let's see how it goes with, this is the current plan.
Okay. All right.
So. Okay, let's get into it.
So Markov decision processes.
So let's start with a question.
Um, let's actually do this just by hand,
so you don't need to go to the website.
So the question is,
it's Friday night and you wanna go to Mountain View and you have a bunch of options but,
what, what you wanna do is you want to get to Mountain View
with the least amount of time, okay?
Which one of these modes of transportation would you use?
Like, how many of you would bike?
No one would bike. A couple of you would bike.
How many of you would drive?
This is, this is popular in Mountain View, would be good.
Caltrainers? Some people would take Caltrain,
sounds good. Uber and Lyft?
We have like a good like distribution.
Fly? [LAUGHTER] Yes, yeah, a good number of you want to fly,
uh, as flying cars are becoming a thing,
like this could be an option in the future.
There are a lot of actually startups working on flying cars.
Um, but, but as you think about this problem like the way you think about it is,
is there a bunch of uncertainties in the world,
like it's not necessarily a search problem, right.
You could, you could bike and you can get
a flat tire and you don't really know that right,
you have to kind of take that into account.
If you're driving, there could be traffic.
Uh, if you are taking the Caltrain,
there are all sorts of delays with the Caltrain,
uh, and all sorts of other uncertainties that exist in the world and,
and you need to think about those.
So it's not just a pure search problem
where you pick your route and then you just go with it,
right, there are, there are things that can happen,
uh, that can affect your decision.
So, and that kind of takes us to Markov decision processes.
We talked about search problems,
where everything was deterministic,
and now you're talking about this next class of state-based functions,
which are Markov decision processes.
And the idea of it is, you take actions but you might not actually end
up where you expected to because there is this nature around
you and there's this world around you that's going to be
uncertain and do stuff that you didn't expect, okay.
So, so, so far we've talked about search problems.
The idea of it is you start with a state and then you take
an action and you deterministically end up in a new state.
If you remember the successor function,
successor of S and A would always give us S prime,
and we would deterministically end up in S prime.
So if you have like that graph up there,
if you start in S and you decide to take this action one,
you're going to end up in A,
like, there's no other option.
But that's how you're gonna end up in it, okay.
Uh, and the solution to these search problems are these paths.
So we have the sequence of actions because I know if I,
if I take action one, and action three, and action two,
I know like what is the path that I'm going to end up at and that would be ideal, okay.
So when we think about Markov decision processes,
that is the setting where we have
uncertainty in the world and we need to take that into account.
So, so the idea of it is, you start in a state,
you decide to take an action but then you can randomly end up in different states.
You can randomly end up in S_1 prime or S_2 prime.
And again, because there's just so many other things
that are happening in the world and you need to,
you need to worry about that randomness and make decisions based on that, okay.
And, and this actually comes up pretty much like every run- every application.
So, uh, this comes up in robotics.
So for example, if you have a robot that wants to go and pick up an object,
you decide on your strategy, everything is great,
but like when it comes to actually moving the robot and getting
the robot to do the task like the actuators can fail,
or you might have all sorts of obstacles around you that you didn't think about.
So there is uncertainty about the environment or
uncertainty about your model like your actuators that,
that you didn't necessarily think about and in reality,
they are affecting your decisions and where you're ending up at.
This comes up in other settings like resource allocation.
So in resource allocation,
maybe you're deciding what to produce,
what is the product you would want to produce and,
and that kind of depends on what is the customer demand and,
and you might not have a good model of that and,
and that's uncertain, right?
It really depends on what,
what products customers want and what they don't.
And you might have a model but it's not gonna be like accurate and, and you need,
you need to do resource allocation under
those assumptions of uncertainty about the world.
Um, similar thing is in agriculture.
So for example, you want to decide, uh,
what sort of, uh, what, what to plant but,
but again, you might not be sure about the weather,
if it's gonna rain or if the, if the,
the crops are going to yield or not.
So there's a lot of uncertainty in these decisions that we make and,
and they make these problems to,
to go beyond search problems and become problems where,
where we have uncertainty and we need to make decisions under uncertainty.
Okay? All right.
So let's take another example.
So this is a volcano crossing example.
So, so we have an island and we're on one side of the island and what we wanna do,
so we are in that black square over there.
And what we wanna do is,
you want to go from this black square to this side of the island and here we have
the scenic view and that's gonna give us a lot of reward and happiness.
So, so my goal is to go from one side of the island to the other side of the island.
But the caveat here is that there's
this volcano in the middle of the island that I need to actually pass, okay.
So, and, and if I fall into the volcano,
I'm going to get a minus 50 reward,
more like minus infinity.
But, but for this example like imagine you are getting a minus 50 reward if,
if you fall into the volcano, okay. So. All right.
So, so, if I have this link here in this side,
so if my slip probability is 0 which is- I'm sure I'm not gonna fall into the volcano,
should I cross the island?
No or yes?
Well, I should cross the island uh,
because I'm not gonna fall, right,
like I'm, I'm not gonna fall into that minus 50.
Uh, slip probability is 0,
I'll get to my 20 reward, everything
will be great, okay.
But the thing is like we've been talking about how the world is,
is stochastic and slip probability is not gonna be 0.
Maybe, maybe it's 10%.
So if there's 10% chance of falling to, into the volcano,
how many of you would,
would still cross the island?
Good number, yeah.
So, um, the optimal solution is actually shown by these arrows here.
And yes, the optimal solution is still to cross the island.
Like your value here,
we're going to talk about all these terms,
but the value here is basically the value you're gonna get, uh,
at the beginning like state which is the,
kind of- we'll, we'll talk about it,
it's the expected utility that you're gonna get.
It's gonna go down because there is
some probability that you're going to fall into a volcano,
but still like the best thing to do is to cross the island. How about 20%?
How many of you would do it with 20%?
Some number of people, [LAUGHTER] it's less.
Um, still turns out that the optimal strategy is to cross.
30% percent?
One person. [LAUGHTER]
So with 30%,
that's actually the point that you kind of you'd rather not,
not cross because there's this volcano and then with a large probability you could,
you could fall into the volcano and the value is going to go down.
Okay. So these are the types of problems we're gonna,
we're gonna work with. Yes.
The value like with respect to two because two is like what you can do with them.
So two is like the value- the reward that you are going to get at,
at that state, and then value you compute that you propagated back.
We'll talk about that in details on,
on how to compute the value, [NOISE] okay? [NOISE]
All right. Okay. So that was just an example.
So, so that was an example of a Markov Decision Process.
What we wanna do in this lecture,
is we are going to, like, again, model these, er,
types of systems as Markov decision processes,
then you are going to talk about inference type algorithms.
So how do we do inference?
How do we come up with this best strategy path?
Um, and in the middle,
I'm going to talk about policy evaluation,
which is not an inference algorithm but it's kind of a step towards it.
And it's basically this idea,
if someone tells me this is a policy,
can I evaluate how good it is?
And then we'll talk about value iteration which tries to figure out
what is the best policy that I can take, okay?
So that's the plan for today.
Then next lecture we're going to talk about
reinforcement learning where we don't actually know what the reward is,
and we don't know what the- where the transitions are.
Uh, so, so that's kind of the learning part of- part of these, er, MDP lectures.
So Rita is going to actually do the- do the lecture next,
next- on, on Wednesday, right?
Okay. So let's get into- let's get into Markov decision processes.
So we have a bunch of examples throughout this lecture,
so this is kind of another example.
So all right so actually I do need volunteers for this.
So in this example, uh, we have a bunch of rounds,
and the idea is you can at any point in time,
you can choose two actions.
You can either stay or you can quit, okay?
If you decide to quit,
[NOISE] I'm going to give you $10, I'm, uh,
actually I'm not going to give you $10,
but imagine I'm gonna give you $10 [NOISE],
and then we'll end the game, okay?
And then if you decide to stay,
then you're gonna get $4 and then I'll roll the dice.
If I get one or two, we'll end the game [NOISE].
Otherwise, you're going to continue to the next round,
and you can decide again, okay?
So who wants to play with this?
Okay. All right. Volunteer. Do you want to stay or quit?
Quit. [LAUGHTER]
[LAUGHTER] so that was easy.
You got your $10. [LAUGHTER]
Does anyone else want to play?
Stay, stay again.
Oh, you've got 8, $8.
Sorry. [LAUGHTER]. The dice is still.
Um, so you kind of get the idea here, right?
So, so you have these actions and then with one of them,
like if you decide to quit,
you deterministically you will get your $10 and you're done.
Uh, with the other one, it's,
it's probabilistic and you kind of wanna see which one is better and what,
what would be the best policy to take in this setting.
So we'll come back to this question.
We will formalize this, and, and we'll go over this.
I have a question. Is like, I think I see a similar example.
Is it better to always, like, just continue once and then quit?
Like, isn't it better to switch or?
So when, when not.
Okay so, so then you need to actually compute what is the-
Yeah.
-expected utility, right?
So- and that's what we wanna do, right?
So, so [NOISE] you might say, "Oh, I wanna,
I wanna stay and then I get my $4,
and then I want to quit and then I get 14,
and maybe that is the way to go.
Um, that could be a strategy, but for doing that, right?
Like we are going to actually talk about that.
For doing that, we are going to define what would be the optimal policy.
One other thing that, uh, for this particular problem,
you're going to keep in mind is,
I'll, I'll talk about it when, when I define a policy.
But, but the policy the way we,
we define it is it's a function of state.
So if you decide to stay, that is your policy.
If you decide to not stay, that is your policy.
Like, you're not allowing switching right now.
Like, as I talk about this later in the lecture.
But, but I'll come back to this problem, okay?
So if you- if you decide that your policy,
the thing you want to do is to just stay.
Uh, keep staying, this is the probability of,
like, the total rewards that you are gonna get.
So you're gonna get four with some probability.
And then if you're lucky, you're gonna get 8.
And then even if you're luckier, you're gonna get 12,
and if you're luckier, you're gonna get 16.
But, but the probabilities are going to come down pretty much like really quickly.
So the thing we care about in this setting,
is, is the expected utility, right?
In expectation, like if I- if I- if I run this,
and if I average all of these possible paths that I can do,
what would be the value that I get?
And for this particular problem,
it turns out that in expectation if you decide to stay, you should get 12.
So, so you got really unlucky that you got 8.
But [LAUGHTER], but in general,
in expectation, you should decide to stay, okay?
And, and we actually want to spend a little bit of time in
this lecture thinking about how we get that 12,
and and how to go about computing this expected utility.
And, and based on that, how to decide what policy to use, right?
Okay. And then if you decide to,
to quit, then, then expected utility there is kind of obvious, right?
Because that, that, you're quitting and
that's with probability of 1 you're getting $10,
so you're just gonna get $10 and that is the expected utility of quitting. Yes.
[inaudible]. [NOISE]
Uh, [NOISE] so, so when you- when I say- when you roll a die,
I said if you get one or two-
You stay.
You, you, you, stay, yeah.
And then if you get the other,
so the two-thirds of it, you continue.
So, so it's a one-third,
two-third comes from there, okay?
All right. I'll, I'll come back [NOISE] to this example.
This is actually the, the running example throughout
this lecture [NOISE], okay? So [NOISE].
[inaudible] so how are
you able to do this calculation? We're going to talk about that next.
That is what the lecture is about.
Okay. So let's, let's actually,
uh- I do wanna finish it in an hour,
that's why maybe I'm rushing things a little bit.
But we are going to talk about this problem like throughout the class.
So, so don't worry about it.
If it's not clear at the end of it,
we can clarify things, okay?
All right. So I do want to formalize this problem.
The way I want to [NOISE] formalize this problem is, er, using an MDP.
So I wanna- I wanna formalize this as a ma- as a Markov decision process.
Maybe I can [NOISE] just use this [NOISE].
So in Markov decision processes,
similar to search problems,
you're going to have states.
So in this particular game,
I'm going to have two states.
I'm either in the game [NOISE] or I'm out of the game.
So I'm in an end state where everything [NOISE] we
ended you're out of the game, you're done, okay?
So, so those are my states.
Then, um, when I'm in these states,
I'm in each of these states, I can take an action.
And if I'm in an end state,
I can take two actions, right?
I can either decide to stay [NOISE], right?
Or I can quit [NOISE], okay?
And if I, if I decide to stay,
from in state, that takes me to something that I'm [NOISE] going to call a chance node.
So a chance node is a node that represents a state and action.
So it's not really like, like the blue things are my states,
but I'm creating this chance nodes as a way of kind of going through this example,
to, to see where things are going.
So, so the- these blue states [NOISE] are going to be my states.
I'm in S. These chance nodes are over state and action.
So basically, this node tells me that I started [NOISE] with in,
and I decided to stay, okay?
And the chance node here,
basically tells me that I started with in,
and I decided to quit [NOISE], okay? Yes.
Why do we still call it a chance node even though it's deterministically?
So I deterministically go through it,
but then from the chance node that's where I'm introducing the probabilities.
So from the chance node I can like
probablistically end up in the- these different states.
In the case of quit, it's also deterministic.
In the case of the quit in this case it's deterministic.
Yeah. So in the case of the quit,
we say [NOISE] with probability 1 [NOISE],
I'm going to end up in this end state.
So I am going to draw that with the no- with the- with the edge
that comes from my chance node, and I'm gonna say,
with probability of 1 [NOISE],
I'm going to get $10 [NOISE] and just be done, okay?
But if you are in this state,
this is actually the state where interesting things can
happen with probability two-thirds,
I'm going to go back to [NOISE] in,
and get $4,
or with probability one-third,
I'm going to end up in end,
and, and do I get still 4,
$4 [NOISE] , okay?
So, so that is my Markov decision process.
So, so I had maybe we can keep track
of a list of things we are defining in this lecture.
So we just defined states [NOISE], and then we said well,
we're gonna have these chance nodes [NOISE]
because from these chance nodes probabliistically,
we're going to come out of them depending on what happens in nature, right?
Like I end up- this is the decision I've made,
now nature kind of decides which one you're going to end up at,
and, and based on that we,
we move forward, okay?
All right. So, so more formally,
we had a bunch of things when we define an MDP.
Similar to search problems, we- like we,
we now need to define the same set of things.
So, so we have a set of states.
In this case my states are in and end, okay?
We have a start state.
I'm starting with in.
So that's my start state.
I have actions as a function of states.
So when I ask what are the actions of the state,
my actions are going to be stay or quit.
What are actions of end?
I don't have anything, great,
end state doesn't have any actions that come out of it.
And then we have these transition probabilities.
So transition probabilities more formally,
take a state, an action, and, and a new state.
So S, A, S prime,
and tell me what is the transition probability of that,
it's one-third in this case.
And then I have a reward which tells me how much was that rewarding,
that was $4, okay?
So, so I'm defining- so when I'm defining my MDP,
kind of the new things I'm defining is this transition probability,
which tells me if you're in state S,
and take action A, and you end up in S prime.
What is the probability of that?
I'm in in, I decide to stay,
and then end up in end.
What's the probability of that? That's one-third.
Maybe I'm in in, I decide to quit,
I end up in end.
What's the probability of that?
It's equal to 1, okay?
And then over the same state action state primes,
like next states we are going to end up at,
we're going to define a reward [NOISE] which tells me how much money did I get?
Or like how, how good was that.
So it was $4 in this case.
Or, or if I decide to quit,
I got $10, okay?
Um, and if you remember in the case of search problems,
we're talking about cost.
I'm just flipping the sign here,
we wanted to minimize cost.
Here we want to maximize the reward just a more optimistic view of the world I guess.
Um, so, so that is what the rewards are going to be defined, okay?
We also have this as end function,
which again similar to search problems just checks if you're in an end state or not.
And in addition to that,
we have something that's called a discount factor.
It's, it's this value Gamma [NOISE] which is between 0 and 1.
And I'll talk [NOISE] about this later don't worry about [NOISE] it right now.
But it's a thing to define for our search pro- er, for our MDPs, okay?
All right. So how do I compare this with search?
Again, these were the things that we had in a search problem.
We had the successor function that would deterministically take me to S prime and we had
this cost function that would tell me what was the cost
of being in state S and taking action A.
So, so the major things that are changed is that instead of the successor function,
I have transition probabilities these T's, that,
that basically tell me what's the probability of starting in S,
taking action A, and ending up in S prime.
And then the cost just became reward, okay?
So, so those are kind of the major differences between search and MDP.
Because things are- things are not deterministic here [NOISE], okay?
All right, so, so that was the formalism.
Now, now I can define any,
any MDP model- any Markov Decision Process.
And then one thing- just one thing to point out is this transition probability is this t,
basically specifies the probability of ending up in
state S prime if you take action A in state S. So,
so these are probabilities, right?
So, so for example again,
like we have done this example but let's just do it on the slides again,
if I'm in state in, I take action quit, I end up in end,
what's the probability of that?
1.
And then if I'm in state in,
I take action stay,
I end up in state in again,
what's the probability of that?
I end up in again, two-thirds.
And then if I'm state in,
I take action stay,
I end up in end, what is the probability of that?
One-third, okay?
And then these are probabilities.
So what that means is they need to kind of add up to 1, but one thing to notice is well,
just what is going to add up to 1?
Like, like all of the things in the column are not going to add up to 1.
The thing that's going to add up to 1is if you consider
all possible these- different s primes that you're going to end up at,
those probabilities are going to add up to 1.
So, so if you look at this, this sta- stable again,
if you look at deciding and being stay in and taking action stay,
then the probabilities that,
that we have for different s primes are two-thirds and one
third, and those two are the things that are going to add up to 1.
And in the first case, if you're in stay in and you decide to quit,
then wherever- whatever s primes you're gonna end up at,
in this case, it's just the end state,
those probabilities are going to add up to 1.
So, so more formally what that means is,
if I'm summing over s primes,
these new states that I'm going to end up at,
the transition probabilities need to add up to 1.
Okay, because they're basically probabilities that tell me
what are the- what are the things that can happen if I take an action, okay?
And then these transition probabilities are going to
be non-negative because they are probabilities.
So that's also another property, okay?
So usual six. All right.
So, so that's a search problem.
Let's actually formalize another search problem.
This is- let's actually try to code this up.
So what is a search problem?
This is the tram problem.
So remember the tram problem.
I have blocks 1 through n. What I wanna do is I have two possible actions,
I can either walk from state S to a state S plus 1.
Or I can take the magic tram that takes me from state S to state 2S.
If I walk, that costs one minute, okay?
Means reward of that is minus 1.
If I, if I take the tram that costs two minutes,
that means that the reward of that is minus 2, okay?
And then the question was how- like how do we want to travel from,
from 1 to n in the least amount of time?
So, so nothing here is, is probabilistic yet, right?
So I'm going to add an extra thing here which says
the tram is going to fail with probability 0.5.
So I'm going to decide maybe you take,
take a tram at some point and that tram can,
can fail with probability 0.5.
If it fails, I end up in my state, like I don't go anywhere.
And, and actually like in this case,
you're assuming you're still losing two minutes.
So if I decide to take a tram,
I'm gonna lose two minutes,
maybe you'll fail, maybe we will not, okay? All right.
So let's try to formalize this.
So we're gonna take our tram problem from two lectures ago.
So this is from search one.
We're gonna just copy that.
So all right.
So this was what we had from last time.
You had this transportation problem and we had
all of these algorithms to solve the search problem.
You don't really need them because we have a new problem so let's just get rid of them.
And now I just want to formalize an MDP.
So, so it's a transportation MDP, okay? The initialization looks okay.
Start state looks okay.
I'm starting from 1, this end looks okay.
So the thing I'm going to change is the- first off I need to add this actions function.
Okay? So what would actions do?
It's going to return a list of actions that are our potential actions in a given state.
So I just copy pasted stuff from down there to just edit.
So it's going to return a list of valid actions.
Okay? So what are the valid actions I can take?
I can either walk or I can tram.
So I'm going to remove all these extra things that I had from before and just
keep it to be I'm either walking or I'm taking the tram, okay?
As long as it's a valid state.
So, so that looks right for actions.
The other thing we had was a successor and cost function.
So, so now we want to just change that and
return these transition probabilities and end reward.
So, so it's basically the successor probabilities and reward.
Okay? So I'm putting those
two together, similar to before we had successor and cost.
Now I'm returning probabilities and reward.
Okay? So what this function is going to
return is it's going to return this new status S prime,
I'm going to end up at and the probability value for that and reward of that.
Okay? So, so given that I'm starting in state S and I'm taking action A,
then what are the potential S primes that I can end
up at and what are the probabilities of that?
Then what, what is T of SAS prime and what is the reward of that?
What is the reward of SAS prime?
I want to have a function that just returns these so I can call it later.
Okay? All right.
So I need to basically check like for,
for each one of these actions,
I can for, for action walk.
What happens for action walk?
What's the new state I'm going to end up at?
Well, I'm going to end up at S plus 1.
It's a deterministic action.
So I'm going to end up there with probability 1 and what's the reward of that?
Minus 1 because it's one minute cost,
so it's minus 1 reward.
Then for action tram,
we kind of do the same thing but we have two options here.
I can- I can end up in 2S. Tram doesn't fail,
I end up in 2S. The probability 0.5 that cause- that reward of that is minus
2 or the other option is I'm going to end up in
state S because I didn't go anywhere because we had probability of 0.5, the tram did fail.
And that, that- the reward of that is minus 2.
And that's pretty much it. That, that is my, my MDP.
So I can just define this for a city with let's say 10 blocks.
Oh, and we need to have the discount factor but we'll talk about that later.
Let's say it's just 1 for now, okay?
And they'll use right- I'm writing these other states function for later but, okay.
Does that look right? We just formalized this MDP.
So let's check if it does the right thing.
So maybe we want to know what are the actions from state three?
What are the actions from state three?
Oh, we need to remove this utility function
from before because we don't have it in the folder.
So remove that.
What, what are the actions from state three?
I have 10 blocks.
If I'm in state three,
I can either walk or tram.
Either one of them is fine, right?
So, so that did the right thing.
Maybe we want to just check if
this successor probability and the reward function does the right thing.
So maybe, maybe we can try that out for state three and walk.
So, so for state three and action walk,
then what do we get?
Well we end up in four and that is,
that is with probability 1 with the reward of minus 1.
Okay? Let's try it out for tram.
Again, remember tram can fail,
so I'm gonna get two things here.
So these are the things I'm going to get for tram,
I'm going to either end up in six with probability
0.5 with the reward of minus 2 or I will not go anywhere.
I'm still at three with probability 0.5 and that is with a reward of minus 2.
Okay? All right.
So that was just the tram problem and we formalized it as an MDP.
Again, the reason it's an MDP is,
is that the tram can fail with probability 0.5.
So we added that in,
then we defined our transition function and our problem- and our reward function.
Okay? All right, everyone happy with how we are defining MDPs?
Yeah? Okay. Pretty similar to search problems except for now
we have these probabilities, okay? All right.
So, so now I have defined an MDP, that's great.
The next question that in general we would like to answer is to give a solution, right?
So there's a question here.
So what is the Markov part of an MDP?
So the Markov part means that you just depe- so,
so when you just depend on the state and this current state,
like the way we define our state remember,
our state is sufficient for us to make optimal decisions for the future.
So the Markov part means that you're Markovian, it  only depends on the current state and
actions to end up in the probabilistically end up in the next,
next state. So yeah.
So the interesting question we would like to do is well,
we want to find a solution, right?
I want to figure out what is the optimal path to actually solve this problem.
And again if you remember search problems,
the solution to search problems was just a sequence of actions,
said that's all I had, like a sequence of actions,
a path that was a solution.
And the reason that was a good solution was like everything was deterministic,
so I could just give you the path and then that was what you would follow.
But in the case of MDPs,
the way we are defining a solution is by using this notion of a policy.
So a policy- let me actually write that here.
So we have defined an MDP but now I want to say well,
what is a solution of an MDP?
A solution of an Markov decision pro- process is a policy pi of S.
So and this policy basically goes from states,
so it takes any state and it tells me what is the-
what is the potential action that I would get for that state.
Okay? So, so if a policy is a function,
it's a mapping from each state S in the set of all possible states,
to, to an action and the set of all possible actions.
Okay? So in the case
of the volcano crossing, like I can have something like this.
I can be in state 1,
1 and then a policy of that state could be going south, okay?
Or I can be in state 2,
1 and a policy for that state is east.
If, if this was a search problem, I would just give a path.
I would just say go south and then to- go east and go north, right?
So, so that would be my solution.
But- but again, like if I decide that well the policy at 1,
1 is to go south, there is no reason for you to end up at south, right?
Because this thing, this thing is probabilistic.
So, so the best thing I can do is for every state just
tell you what is the best thing you can do for that particular state and,
and that's why we are defining a policy as opposed to ge- giving like a full path, okay?
All right, so policy is the thing you're looking for.
And ideally, I would like to find
the best policy that would just give me the right solution.
But in order to get there,
I want to spend a little bit of time talking about how good a policy would be.
So and that's kind of this idea of evaluating a policy.
So in this middle section, I don't want to try to find a policy, I,
I just assume you give me a policy and I can evaluate it and tell you how good that is.
So, so that's the plan for the middle section, okay?
All right. Everyone happy with- so,
so far all I've done is I've defined an MDP,
which is very similar to a search problem, it's just probabilistic.
Okay? So so how would we evaluate a policy?
Okay? So if you give me
a policy which basically tells me at every state S, take some action,
then that policy is going to generate a random path, right?
I can get multiple random paths because nature
behaves differently and the world is uncertain.
So I might get a bunch of random paths and then those are all
random variables, uh, random paths, sorry.
And, and, and then for each one of those random paths,
I can, I can define a utility.
So, so what is the utility?
Utility is just going to be the sum of rewards that I'm going to get over that path.
I'm calling it as, as the discounted sum of the rewards.
Remember that discount, we'll talk about that but,
but you can- you can discount the future.
But, but for now just assume it's just a sum of the rewards on that path, okay?
So a util- the utility that we are going to
get is also going to be a random variable, right?
Because if if you think about a policy,
a policy is going to generate a bunch of random paths and
and utility is just going to be the sum of rewards [NOISE] of each one of those.
So it's a random variable.
So, so if you remember this example, right?
So I can, I can basically have a path that tells me
start in in, and then stay and then that ends.
Right? So so this is one random path,
and for this particular random path, well,
what is the utility I'm gonna get? I'm just gonna get $4.
That's one possible thing that can happen.
If my, if my, um, policy is to let's say stay,
like there is no reason for for the game to end right here.
Right? Like I can have a lot of different types of random path.
I can have a situation where I'm staying three times and then
after that ending the game and utility of that is 12.
We can have this situation where we have stay, stay, and end.
That's the situation it's all, like you had,
you had an utility of eight and so on.
So, so you're getting all these utilities for all these random paths.
So, so these utilities are also going to be just random variables.
Okay? So I can't really play around with the utility.
That's not telling me anything.
Although it's telling me something but it's a random variable.
I can't optimize that.
So instead we need to define something that you can actually play around with it and,
and that is this idea of a value which is just an expected utility.
So, so the value of a policy,
is the expected utility of that policy.
And then that's not a random variable anymore,
that's actually like a number and I can I can compute that number.
I can compute that number for every state and and then just play around with value. Okay, next question?
What is the value of the policy, does,
is that policy needs defined for all possible states or a particular state?
For all possible. So so the question is, yeah,
so when you say value of policy, uh,
is the policy basically telling me, um,
is a policy basically telling me,
uh, what- what is a strategy for all possible states?
Well, um, you're defining policy as a function of state, right?
So, and value is the same thing as a function of state.
I might ask what is the value of being in in?
So the value of being in in is, is, and, and, following,  and following policy stay,
is, is going to be the, the value of fo- following
policy stay from this particular state which is the expected utility of that,
which is, which is basically that 12 value there.
I could ask it for about any other state too.
So I can be in any other state and then say well,
what's the value of that?
And, and when we do value iteration and you actually need to
compute this value for all states to kind
of have an idea of how to get from one state to another state but [OVERLAPPING].
[inaudible] will be in state in and
the policy given your state in taking the actions stay.
Yes.
Okay.
Yeah. And that is, that is what 12 is.
Okay? And 12 like we kind of
empirically we have seen, it's 12 but we haven't shown how to get 12 yet.
Okay? All right.
So, um, actually let me write these in my lists of things.
So we talked about the policy.
What else did we talk about?
We talked about utility.
So what is utility?
Utility, we said it's sum of rewards.
[NOISE] So if I get like reward 1, then I get reward2two.
It's a discounted sum of rewards.
So I'm gonna use this gamma which is that discount that I'll talk about in a little bit
times reward 2, plus gamma squared times reward 3, and so on.
So utility is, you give me a random path and I just sum up the rewards of that.
Imagine if gamma is 1,
I'm just summing up the rewards.
If gamma is not 1, I'm summing- I'm looking at this this discounted sum.
Okay, so, so that is utility.
But value- so this is utility,
value is just the expected utility, okay?
So you give me a bunch of random paths,
I can compute their utilities,
I can just sum them up and average them and that gives me value. Yes.
If the discount factor is 1, would that be bounded?
That's a very good question and we'll get back to that.
So, so, so in general, and, and, and. Okay.
If if it is acyclic, it is fine,
but if you have a cyclic graph you want your gamma to be less than 1.
And we'll talk about that when we get to the convergence of these algorithms.
All right, how am I doing on time? Okay. All right.
So so let's go to the, uh,
this particular volcano crossing example.
Um, so in this case, um, like I can run this game,
and every time I run it,
I'm gonna get a different utility because like I'm gonna end up in some random path,
some of them end up in the volcano,
that's pretty bad, right?
So I get different utility values,
utilities [LAUGHTER] but the value which is the expected utility is not changing really.
It's just around 3.7 which is just the average of these utilities.
So I can keep running this getting these different utilities, but
values is one number that, that I can, I can talk about
and, and that's the value of
this particular state and that tells me like what would be the best policy that I
can take and what's the best amount of utility
that I can get from in expectation from that state?
Okay? All right,
so we've been talking about this utility
I've actually written that already on the board.
So utility is going to be a discounted sum of rewards.
And then we've been talking about this discount factor.
And the ideal of the discount factor is I might
like care about the future differently from how much I care about now.
So, so for example, if if you give me $4 today,
and you give me $4 tomorrow,
like if that $4 tomorrow is
the same kinda amount and has the same value to me as as today,
then then I might,
it's kinda the same idea of having a discount counter of 1,
uh, discount of, of 1, gamma of 1.
So you're saving for the future, the values of things in the future is the same amount.
If you give me $4 now,
if you give me $4 10 years from now,
it- it's going to be $4.
I care about it like $4 amount and I can just add things up.
But it could also be the case like you might be in a situation,
in a particular MDP,
where you don't care about the future as much.
Maybe you give me $4 10 years from now and
that's that doesn't like, I don't have any value for that.
So, uh, if then that is the case and you just want to live in
the moment and you don't care about the values you're gonna get in the future,
then that's kind of the other extreme when- when this this gamma,
this discount is equal to 0.
So so that is a situation that if I get $4 in the future,
that they don't like val- like they don't have any value to me.
They're just like a 0 to me.
So, so I only care about right now
living in the moment what is the amount I'm going to get.
And then in reality you're like somewhere in between, right?
Like we're not this, this case where we are living in a moment,
we're also not this case that, that
everything is just the same amounts like right now or in
the future in- and like in balanced life as a setting where we have some discount factor,
it's, it's not 0, it's not a 1,
it actually discounts values in the future because future
maybe doesn't have the same value as now but,
um, but we still value things in
the future like $4 is still something in the future.
And, and that's where we pick like a gamma that's between 0 and 1.
So so that is kind of a design choice like depending on what problem you're in,
you might want to choose a different gamma. Question, yeah.
So is discounting utility, is it an assessment of risk or is there,
like, a different way we can assess how much risk you want to take?
Um, you could, you could think of it as
an, it's not really an assessment of risk in that way.
It depends on the problem, right?
It depends on like in a particular problem,
I do want to get values in the future or have like some sort of
long term like goal that I want to get to and I care about the future.
Like it it depends, like, if you're solving
a game versus you're solving like, I don't know,
mo- mo- mobile like a robot manipulation problem
like it might just be a very different like discount factor that you would use.
For a lot of examples we'd use in this class,
we just choose a gamma that's close to 1.
Like- like usually like for a, for a lot of
problems that we end up dealing with gamma it's like 0.9.
That's like the usual.
Okay, like for usual problems.
Like you might have a very different problem where we don't care about the future.
So, so then we just drop it. Yes.
[inaudible] is gamma a hyperparameter that needs to be
tuned and is a gamma 0 the same as a 3D algorithm?
Gamma. Okay. So so that's a good question.
So is- is gamma a hyperparameter that you need to tune?
I would say gamma is a design choice.
It's not a hyperparameter necessarily in that sense that,
oh if I pick the right gamma that will do the right thing.
You want to pick a gamma that kind of works well with your problem statement.
Um, and, and gamma of 0 is kind of greedy,
like you are picking like what is the best thing right now and I just don't
care about the future ever. Question right there.
Does gamma violate the Markov property because like
this kind of memory of what you save is.
It doesn't violate the Markov property.
It's just a discount of like your- it's about the reward.
It's not about how this state affects the next state.
It basically affects how much reward you're
going to get or how much value you reward in the future.
It doesn't, it doesn't actually like- it's still a Markov decision process.
[inaudible] and make your possible actions [inaudible]?
What you are getting with- it's affecting the reward yeah,
but it's Markov because if I'm in state s and I take action a,
I'm gonna end up in s prime and that doesn't depend on like gamma.
Okay. All right.
So. Okay. So, so in this section we've been talking
about this idea of someone comes in and gives me the policy.
So the policy is pi and what I want to do is,
I want to figure out what's the value of that policy,
and again value is just the expected utility.
Okay? So V pi of s is the expected utility
received by following this policy pi from state s. Okay?
So, so I'm not doing anything fancy.
I'm not even trying to figure out what pi is.
All I want to do is, I want to just evaluate.
If you tell me this is pi,
how good is that? What's the value of that?
Okay? So, so that's what a value function is.
So value of a policy is, is V pi of s. Okay?
That's expected utility of starting in some state, um,
let me put this here and then I'm going to move these up.
[NOISE] Um, yeah, yeah so V pi is,
is the value- the expected utility of me starting in some state S. Okay.
And state S has value of pi of S. And if someone tells me that,
well you're following policy pi,
then I already know from state S,
the action I'm going to take is pi of S. So that's very clear.
So I'll take pi of S. And if I take pi of S we'll- I'm going to end up in some chance node.
Okay. And that chance node is, is a state action node.
It's going to be S and the action- I've decided the action is pi of S. Okay.
And of this- define this new function,
this Q function, Q pi of S,
a, which is just the expected utility from the chance node.
Okay. So, so we've talked about value,
value is expected utility from my actual states.
I'm going to talk about Q values as expected utilities from the chance nodes.
So after you've committed that you,
you have taken action a, and,
and you're following policy pi.
Then, what is the expected utility from that point on, okay.
And well what is the expected utility from this point on?
We are in a chance node,
so many things that can happen because I have
like nature is going to play and roll its die,
and anything can happen.
And they're going to have in transition, S, a,
S-prime and with that transition probability,
I'm going to end up in a new state.
And I'm going to call it S-prime,
and the value of that state- again,
expected utility of that state is V pi of S-prime, okay. All right.
So, okay.
So what are these actually equal to?
So I've just defined value as expected utility,
Q value as expected utility from a chance node,
what, what are they actually equal to?
Okay. So I'm going to write
a recurrence that we are going to use for the rest of the class.
So pay attention for five seconds. There is a question there.
I understand how semantically how pi and v pi are different, in like actual numbers,
like expected value- how are they different?
So they're- both of them are expected value.
Yeah. So it's just- one is just a function of
state the other one you've committed to one action.
And the reason I'm defining both of them,
is to just writing my recurrence is going to be a little bit easier,
because I have this state action nodes, and I can talk about them.
And I can talk about how like I get branching from these state action nodes, okay?
All right. So I'm going to write a recurrence.
It's not hard, but it's kind of the basis of the next like N lectures,
so pay attention. So alright.
So V pi of S, what is that equal to?
Well, that is going to be equal to 0,
if I'm in an end state.
So if IsEnd of s is equal to true,
then there is no expected utility that's equal to 0. That's a easy case.
Otherwise- well, I took policy pi S. Someone told me,
take policy pi S. So value is just equal to Q, right?
So, so in this case, V pi of S,
if someone comes and gives me policy pi,
it's just equal to Q pi of S, a.
Okay. These two are just equal to each other.
So the next question one might ask is-
actually let me write this a little closer so I'll have some space.
Yeah. So this is equal to Q pi of S, a, okay.
So, so what is that equal to?
What is Q pi of S, a equal to?
So this is V pi S. So now,
I just want to know what is Q value,
Q pi of S, a. What is that equal to?
Okay. So if I'm right here then there are
a bunch of different things that can happen, right?
And I can end up in these different S-prime.
So if I'm looking for the expected utility then I'm looking for
the probability of me ending up in this state times the utility of this state,
plus the probability of me ending up in a new state times the utility of that.
So, so that is just equal to sum over all possible S-primes that I can end up
at of transition probabilities of S, a, S prime.
Transition probability of ending of
a new state, times the immediate reward that I'm going to get,
reward of S, a, S prime, plus the value here.
But I care about the discounted value.
So I'm going to add gamma V pi of S-prime,
because I've been talking about this, this next state.
Okay. There's this, does everyone see this?
Okay. So this is the recurrence that we are doing in policy evaluation.
Again, remember someone came and gave me policy pi.
So I'm writing this policy pi here.
Someone gave me policy pi,
I just want to know how good policy pi is.
I can do that by computing V pi.
What is V pi equal to?
Someone told me you're following policy pi,
so it's gotta be equal to just Q pi.
What is Q pi equal to?
It's just sum of all the- like the expectation of
all the places that I can end up at that sum over S-primes,
transition probabilities of ending up in S-prime,
times the reward- the total reward you're getting which is the immediate reward,
plus discounting in my future, okay. Yes.
What if Q values and then following policy pi starting from S-prime?
Yes. Yeah, yeah, yeah, starting from S-prime.
All right. So okay. So far so good.
So so that is how I can evaluate this policy, right?
So, so I have these two recurrences- if I have these two recurrences,
I can just replace this guy here,
and let's imagine we're in the case- maybe I can use a different color up here.
Um, I'm just replacing,
I'm just replacing this guy right here.
I don't know if it's worth writing it.
Imagine we we're not in an end state.
If you're not in an end state then V pi of S,
well, what is that equal to?
That is just equal to sum of transition probabilities S,
a, S-prime, over S-primes,
times immediate reward that I'm going to get,
plus discounting V pi of S-prime.
Okay. So this is kind of a recurrence that I have.
I, I literally just combined these two,
and wrote it in green, okay,
if you're not in an end state.
So if you're not in an end state,
this is the recurrence I have.
I have V pi here,
I have V pi on this side too.
So that is nice.
And, and that is kind of the, the placer.
I can compute V pi.
Maybe I can do it literally or maybe I can
actually find a closed form solution for some problems,
but that is basically what I'm going to do.
I have V pi as a function that depends on V pi of S-prime.
And I can just solve for this V pi.
Okay. It allows me to evaluate policy pi.
I haven't figured out a new policy.
All have done is evaluating what's the value of pi, okay. All right.
Okay, so let's go back to this example.
So let's say that someone comes in and tells me
well the policy you gotta follow is, is to stay.
So my policy is, is to stay.
Okay. I want to know- I want to just evaluate that,
I want to do policy evaluation.
When you're doing policy evaluation,
you gotta compute that V pi for all states.
So let's start with V pi of end,
oh that is equal to 0,
because we know V pi at end state is just equal to 0.
Now, I want to know what's V pi of in, okay stay, in.
What is that equal to? That's just equal to Q pi of in and stay, right?
V pi is just equal to Q pi of in and stay.
So I'm going to replace that,
that's just equal to one-thirds,
times immediate reward, which is 4,
plus value of the next state I'm going to end up at,
which is end in this case,
plus two-thirds, times the immediate reward I'm going to to get,
which is $4, plus value of the state I'm going to end up at, which is end.
Okay. So, so that is just that sum that we have there, right?
V pi of end is 0,
so let me just put that 0 there.
I'm going to put 0 there.
I only have one state here too, right?
So, so th- I just have this other function of this one, stay, in.
So having an equation,
I can find the closed form solution of V pi of in.
I'm just going to move things around a little bit.
And then I will find out that V pi of in is just equal to 12.
So, so that's how you get that 12 that I've been talking about.
So, so you just found out that if you tell me the policy to follow is stay,
if that is the policy,
then the value of that policy from state in is equal to 12.
Is it you always choose the same or- so you always choosing to state.
Yeah. So, so the policy is a function of state.
I only have this one state that's interesting here, right?
That, that one state is in.
So I need to- when,
when I defined my policy,
I need to kind of choose the same policy for, for that state, right?
My policy says, in in you've got to either stay or you've got either quick- quit.
Okay. All right.
So you can basically do the same thing using an iterative algorithm too.
So, so here like in the previous example,
it was kind of simple.
I just solved the closed form solution.
But in, in reality like you might have different states
and then the com- it might be a little bit more complicated.
So we can actually have an iterative algorithm that allows us to find these V pis.
So the way we do that is,
we start with the values for all states to be equal to zero.
And, and this zero that I- I've put here,
is the first iteration.
So, so I'm going to count my iterations here.
So, so I'm going to just initialize
all the values for all states to just be equal to zero.
Okay. Then I'm just going to iterate for some number of time,
whatever number I care, like I would like to.
Then, what I'm going to do is, for every state- again,
remember the value needs to be computed for every state.
So for every state,
I'm going to update my value by the same equation that I have on the board, okay?
And the same equation depends on the value at the previous time step.
So this is just an iterative algorithm that allows me to
compute new values based on previous values that I've had.
And I started like everything zero and then I keep
updating values of all states and I keep going, okay?
So basically, that equation but think of it as like an iterative update every round.
So you- you don't run this for multiple rounds.
Every round you just update your value.
Okay. So like here, is just pictorially you're looking at it,
imagine you have like,
five states here, you initialize all of them to be equal to 0.
The first round, you're going to get some value you're going to update it.
And then you're going to keep running this and then eventually,
you can kind of see that the last two columns are kind of
close to each other and you have converged to the true value.
So, so again, someone comes and gives you the policy,
you start with values equal to 0 for all the states,
and then you just update it based on your previous value.
Okay. So how long should we run this?
Well, we have a heuristic to- to kind of
figure out how long we should run this particular algorithm.
Uh, one thing you can do is you can kind of keep track of
the difference between your value at the previous time step versus this time step.
So, so if the difference is below some threshold you can, kind of,
call it- call it done and- and say,
well I've- I've found the right values.
And then in this case,
we are basically looking at the difference between value at
iteration T versus value at iteration T minus 1.
And then we are taking the max of that over all possible states,
because I want the values to be close for all states. Okay. Yes.
[inaudible]
Is this- so I'm going to talk about the convergence when we talk about
the gamma factor and- and- and the- the discount factor and acyclicity.
Um, also how long you should run this to get these
is also a difficult problem and it depends on the properties of your MDP.
So if you have an ergodic- if you have an ergodic MDP if this- this should work.
Okay, but in general,
it's a hard problem to answer for general Markov decision problem processes.
Okay. And another thing to notice here is,
I'm not storing that whole table.
Like the only thing I'm storing,
is- is the last two columns of this table because- because
that's V pi at iteration T and V pi at iteration T minus 1.
Those are like, the only things I'm storing,
because that allows me to compute and if I've converged then that kind of
allows me to keep going because I only need
my previous values to update my new values, right.
In terms of complexity,
well this is going to take order of T times S times S prime. Well, why is that?
Because I'm iterating over T times step,
and I'm iterating over all my states and I'm summing over all S primes, right.
So because of that- that's a complex idea yet,
and one thing to notice here,
is it- it doesn't depend on actions, right. It doesn't depend on the size of actions.
And the reason it doesn't depend on the size of actions as you have given me the policy,
you are telling me follow this policy.
So if you've given me the policy then I don't really need to worry about, like,
the number of actions I have.
Okay. All right.
Um, here is just another like the same example that we have seen.
So at iteration T equal to 1, in,
is going to get 4, end is going to get 0,
at iteration 2 it gets a slightly better value.
And then finally, like at iteration, like,
100 let's say, we get the value 12.
And then remember for this particular example, like,
this example we were able to solve it, like,
solve the closed form V- V- of, ah,
V- V of policy staying, uh, from state, in, but,
uh, but you could also run the iterative algorithm
and get the same value of 12. Okay. Yes.
Number of actions is just the size of S prime, right?
The number of, uh, actions is the size of S prime.
Uh, no because the size of S- you might end up in very different, different states.
This depends on your probabilities.
Oh, okay.
The size of S prime is actually the size of,
like, size of states is the same thing, right?
Like it's you can- worst case scenario,
you're going from every state to every state.
So just imagine the size of S. [NOISE]
Okay. All right.
So summary so far where are we?
So we have talked about MDPs.
These are graphs with states and chance nodes
and transition probabilities and- and rewards.
And you have talked about policy as the solution to an MDP,
which is this function that takes a state and gives us an action.
Okay. We talked about value of a policy.
So value of a policy is the expected utility of- of that policy.
So, so if you have like utility you- we have
these random values for all these random paths that you're going to get for every policy.
The value of utility is just an expectation over all those random, random variables.
And so far we have talked about this idea of policy evaluation,
which is just an iterative algorithm to compute what's the value of a state.
If you'd give me some policy, like,
how good is that policy what's the value I'm going to get at every state.
Okay. All right.
So- okay, that has been all assuming you'd give me the policy.
Now, the thing I want to spend a little bit of time on
is- is figuring out how to find that policy.
Uh, is that possible that the variable actions for problem that is
going to change the value of the policies.
We learn new actions.
So for example here, we only have stay or quit.
Uh-huh.
If you have a different problem that they can learn another action, like,
stay quit or something, uh, um, the trade.
Is it going to change the value of the policies because then we
had a new action and then we need to update our policies?
So in this case so, so far I'm assuming that a set of actions is fixed.
I am not like adding new actions,
like, the way- even with search problems, like,
the way we defined search problems or the way we are defining MDPs is I'm saying, like,
I'm starting with a set of states are fixed,
actions are fixed, I have stay and quit.
Those are, like, the only actions I can take.
Uh, the reward is fixed, uh,
transition probabilities are fixed under that scenario.
Then what is best- the best policy I can
take and best policies is just from those set of like,
def- already defined actions.
Okay. Um, next lecture we will talk about unknowing settings,
like when we have transition probabilities that are not known or
reward functions that are not known and how we go about learning them.
And, and that- that will be the reinforcement learning lecture.
So next lecture I might address some of those.
Okay. All right, so let's talk about value iteration.
So, so that was policy evaluation.
So like, that whole thing was policy evaluation.
So now, what I would like to do is I want to try to get
the maximum expected utility and find
the set of policies that gets me the maximum expected utility, okay?
So to do that I'm going to define this thing that's called an optimal value.
So instead of value of a particular policy,
I just want to be opt of S,
which is the maximum value attained by any policy.
So, so you might have a bunch of different policies,
I just want that policy that maximizes the value.
Okay. So and that is V opt of S. Okay.
So, um, so let me go back to this- this example.
So I'm going to have this in parallel to this example of policy evaluation,
I want to do value iteration.
Okay. So I'm going to start from state S again,
state S has V opt of S. Okay.
That is what I like to find here I have V pi of S. If I'm looking for V opt of S,
then I can have multiple actions
that can come out of here and I don't know which one to take,
but like, any of them- if I take any of them,
if I take this guy,
that takes me to a chance node of SA.
Okay. And then I'm looking for Q opt of SA.
And from here, it's actually pretty similar to what we had right here.
So I'm in a chance node, anything can happen, right?
Nature plays and with some transition probability of SA,
S prime I'm going to end up in
some new state S prime and I care about V opt of that S prime.
Okay. So if I'm looking for this optimal policy which comes from this optimal value,
then I need to find V opt.
And if I want to find V opt well,
that depends on what action I'm taking here.
But let's say, I take one of these.
And if I take one of these I end up in a chance node,
I have Q opt SA in that chance node.
And then from that point on with whatever probabilities I can end up in some S prime.
Okay. So I want to write the recurrence for
this guy similar to the recurrence that we wrote here.
It's going to be actually very similar.
So- okay, so I'm going to start with Q because that is easier.
So what is Q opt of SA that- that just seems very similar to this previous case.
What is that equal to? What was Q pi?
Q pi was just sum of transition probabilities times rewards,
right. So, so what is Q opt?
[inaudible].
Yeah. So, so it would just be basically this equation
except for I'm going to replace V pi with V opt.
So, so from Q opt,
I can end up anywhere like based on the transition probabilities.
So I'm going to sum up over S primes and all possible places that I can end up at.
I'm going to get an immediate reward which is RSA S-prime.
And I'm going to discount the future but the value of the future is V opt of S-prime.
Okay. So, so far so good that's Q opt.
How about V opt. What is that equal to?
Well, it's going to be equal to 0 if you are in an end state that's similar to before.
So if end of S is true then- then it is 0.
Otherwise, I have- I have a bunch of options here, right.
I can take any of these actions and I can get any Q opt.
So which one should I pick?
Which Q opt should I pick?
The one that maximizes, right?
Like, um, which actually I should pick an action from the set
of actions of that state that maximizes Q opt.
So, so the only thing that has changed here is before someone told me what the policy is,
I just took the Q of that.
Here I'm just picking the maximum value of
Q and that actually tells me what action to pick.
So what is the optimal policy?
What should be the optimal policy?
Hmm? I'm going to call it pi opt of S. What is that equal to?
It's gotta be the- the thing that maximizes V, right.
Which is the thing that maximizes this- this- this Q.
So because that gives me the action.
So it's going to be the argmax of Q opt of S and A.
Where A is an action of S. Okay?
All right, so this was policy evaluation.
Someone gave me the policy.
With that policy I was able to compute V,
I was able to compute Q,
I was able to write this recurrence,
then I had an iterative algorithm to do things,
This is called value iteration.
This is to find the right policy Iteration.
This is to find a policy. How do I do that?
Well I have a value that's for
the optima- optimal value that I can get and it's going to be
maximum over all possible actions I can take of
the Q values and Q values are similar to before.
So I have this recurrence now and at optimal policy is just an argmax of Q. Yeah.
It looks like there are two argmax, right?
Sorry? What?
Phi for argmax like just two argmax,
right, like there are two As?
Oh, yes. You could get two A's, So the question is,
yeah, like, what if I have two A's that give me the same thing?
I can return any of them. It depends on your implementation of max.
So you can return any of them.
How am I doing on time?
[NOISE] We are five minutes over and if you want.
[LAUGHTER] So good news is the slides are the same things that I have on the board.
So so Q_opt is just equal to the sum that we've talked about V_opt.
I just add the max on top of Q_opt same story, okay? And then if I want the policy,
then I just do the argmax of Q_opt and that gives me the policy.
Right. I can have and again an iterative algorithm that does the same thing.
It's actually quite similar to the iterative algorithm for policy evaluation.
I just start setting everything to equal to 0.
I iterate for some number of times.
I go over all possible states.
And then, I just update my value based on this new recurrence that has a max, okay?
So very similar to before, I just do this update.
One thing is the time complexity is going to be order of T times S times A times,
S prime because now I have this max value over all possible actions.
So I'm actually iterating over all possible actions versus in policy evaluations,
I- I didn't have A, because,
because someone would give me the policy.
I didn't need to worry about this.
All right. So let's look at coding this up real quick.
Okay, so we have this MDP problem.
We define it, it was a Tram problem,
it was probabilistic, everything about it was great.
So now I just wanna do an algorithm section and inference section where I code up value
iteration and I can call a value iteration
on this MDP problem to get the best optimal policy.
Okay. So I'm going to call value iteration later.
All right. So we initialize,
so all the values are going to become- I might skip things to make this faster.
So we're gonna initialize all the values to
just 0, right, because all these values are gonna be 0.
So I defined a states function.
So i for all of those the value is just going to be equal to 0.
So it's initialized with that.
Then you're just gonna iterate for some number of time.
And what we wanna do is you wanna compute this new value given old values.
So it's an iterative algorithm.
We have old values, we just update new values based on them.
So what should that be equal to?
So we iterate over our states.
If you are in an end state then what is value equal to? 0, right?
If you're not in an end state,
then you're just gonna do that- that- that recurrence there.
Okay, So new value of a state is going to be equal to max of,
what the Q values, okay.
So new V is just max of Qs of state and actions. Okay.
So now I need to define Q or what does Q do?
Q of state and action is just equal to that sum over- over S primes.
So it's gonna return sum and it's gonna return sum over S primes.
I define this successor probability and
reward function that gives me newState probability and reward.
So I'm gonna iterate over that and- and call that up here.
So given that I have a state and action I can get newState probability and reward.
What are we summing, you're summing the probability, the transition
probabilities times the immediate reward which is reward here
times my- plus my discount times my V
which is the old value of V over S prime, over my newState.
So that is my Q, that is my V, and that's pretty much done.
We just need to check for convergence. To check for convergence,
we kind of do the same thing as before.
We check if value of V and new V are close enough to,
to each other that we can call it done.
I'm gonna skip these parts.
So- so you can basically check if V minus
new V are within some threshold for- for all states.
And if they are then,
V is equal to new V. We need to read the policy.
So policy is just argmax of Q.
So I'm gonna make this a little faster.
So the policy is just going to be, well, none if we're in an end state and otherwise
it's just going to be argmax of- of our Q values.
So I'm just writing argmax here pretty much.
I'm just returning the action that maximizes the Q.
And then we spent a bunch of time getting the printing working.
So let me actually get. Yeah, okay.
All right actually right here. So I'm running this function.
I'm- I'm writing out, actually these are a little shifted grid.
States [LAUGHTER] values and then Pi
which is the policy K. So it starts off walk, walk, walk.
Remember this is the case where we have 50% probability
of tram failing and with 50% probability of tram failing,
these are the values we are gonna get.
And the policies still walk until state five.
And then take the tram from, from state five.
Okay, just kind of interesting
because the policy of the search problem was the same thing too.
Okay, so the thing we can do is,
we can actually, let me move this a little bit forward.
We can actually define this fail probability which becomes just a variable.
So you can play around with this.
If you pick different fail probabilities you're gonna get different policies.
So for example if you pick a fail probability that is large then probably like
that policy is going to be just, just walk
and never take the tram because the tram is failing all the time.
But if you- if you decide to take fail probability is
close to 0, then- then this is
your optimal policy which is close to the search problem.
It's basically the solution to a search problem.
So play around with this, the code is online.
This was just value iteration- value iteration,
um, on this tram problem.
Okay. So I'm gonna skip this one too.
All right, so yeah.
And- and this is also showing like how over
multiple iterations you can kind of get to the- get to
the optimal- optimal value and optimal policy using value iteration.
So in one iteration it hasn't seen it yet.
So it think that the value, optimal value is 1.85, it hasn't updated the values.
And so with like, I don't know, three iterations,
it gets better but it hasn't still
updated. It still thinks it can't get to the other side.
And remember this is with stick probability of 10%.
But if I get to like I think 10,
then it eventually learns the best policy is to get to 20 and the value is 13.68.
And if you go even like higher iterations after that point it's just fine-tuning.
So the values are around 13 still. So you can play around with the volcano problem.
Okay. So when does this converge? So if
the discount factor is less than 1 or
your MDP graph is acyclic then this is going to converge.
So if MDP graph is acyclic that's kind of obvious
you are just doing dynamic programming over your full-thing.
So- so that's going to- that's going to converge.
If you have cycles,
you- you want your- your discounts to be less than 1.
Because if you're, if you have cycles and your discount is
let's say 1 and let's say you are getting 0 rewards from,
then you're never going to change.
You're never going to move,
you move from your state.
You're always going to be stuck in your state.
And if you have non-zero rewards you're going to
get this unbounded reward and you keep going because you
have cycles and it's just going to end up becoming numerically difficult.
So just a good rule of thumb is pick a Gamma that's less than one.
Then you kind of get this convergence property.
Okay, all right, so summary so far is we have MDPs.
Now, you've talked about finding policies,
rather than path, policy evaluation is just a way of computing like how good a policy is.
And the reason I talked about
policy evaluation is there's this other algorithm called policy
iteration which uses policy evaluation and we didn't discuss that in the class.
But it's kind of like,
not equivalent but you could use it in a similar manner as value iteration.
It has its pros and cons.
So policy evaluation is used in those settings.
Do not leave please.
We have more stuff to cover. [LAUGHTER]
And then we have value iteration, uh, which, uh,
computes its optimal value which is the maximum expected utility, okay?
And next time, we're going to talk about reinforcement learning,
and that's going to be awesome.
So let's talk about unknown rewards. All right.
So that was MDPs [LAUGHTER] doing inference and,
and kind of defining them.
I'm going back to the last lecture just to kind
of talk about some of the stuff that we didn't cover last time, okay?
All right. So if you remember last time,
we were talking about search problems.
So big switch now.
Search problems, where we don't have probabilities,
and we talked about A-star as a way of just making things faster,
and we talked about this idea of relaxations which was,
uh, a way of finding good heuristics.
So A-star had this heuristic.
Heuristic was an estimate of future costs.
We wanted to figure out how to find these heuristics,
like, how do you go about finding these heuristics?
And one idea was just to relax everything, that
allows you to come up with an easier search problem or just an easier problem,
and that helps you to find what the heuristic is, okay?
So, um, [NOISE] so we talked about this idea of removing constraints,
and when you remove constraints,
then you can end up in nice situations.
Like in some settings, you have a closed-form solution.
In some other settings, you have just an easier search problem,
and you can solve that, and in some other settings,
you have like independent sub-problems.
So when you remove constraints then,
then you have this easier problem.
You can solve that easier problem,
and that gives you a heuristic.
You're not done yet, right?
You're- you have a heuristic.
You take that heuristic,
and then change your costs,
and then just run uniform cost search on your original problem.
So, so solving an easier problem
is like you're not done when you have solved the easier problem.
It just helps you to find a thing that helps for- with the original problem,
so it's kind of like a multi-step thing.
So examples of that is,
if you have walls, remove all the walls,
you have an easier problem.
If you solve that easier problem,
that gives you a heuristic, and in this case,
like when you knock down these walls,
that easier problem you have a closed-form solution for it.
You don't need to do anything fancy.
You don't need to do uniform cost search. Any of that.
You just compute the Manhattan distance and,
and then that gives you the heuristic.
With that heuristic, you go and solve the original problem. That was one example.
Another example is, when you remove constraints,
you have an easier search problem.
So you don't have closed-form solutions,
but you have an easier search problem.
So you might have a really difficult search problem
with a bunch of constraints that are hard to do.
Remove the constraints.
So when you remove the constraints,
you have a relaxed problem,
which is just the original problem without the constraint.
That's a search problem.
You can solve that search problem using uniform cost search or dynamic programming and,
and solving that allows you to find the heuristic.
Again, you're not done yet, right?
You take the heuristic,
and then you go to the original problem,
change the costs, and, and draw your uniform costs there.
And just one quick kind of example here was,
uh, when you're computing these relax problems,
the thing you want to find is the future costs of this,
this relaxed problem, and,
and to do that, you have this easier search problem.
You still need to run uniform cost search or dynamic programming.
In this case, if you decide to run uniform cost search,
remember, uniform cost search computes past costs.
In this case, I really wanna compute future costs.
So you need to do a bunch of engineering to get that working.
In this particular case,
the relaxed problem, you need to reverse it.
Because when you reverse it,
past costs of the reversed relax problem
becomes future cost of the relaxed problem, if that makes sense.
So, so the way I'm reversing this is I'm
basically saying start state is n. End state is 1,
and my walk action takes me to s minus 1, instead of s plus 1,
and my tram action takes me to s over 2 instead of S times 2,
and the whole reason I'm doing that is- is that the past cost of
this new problem is the future costs of the non-reversed version.
Okay. Because I, I need to use uniform cost search here, okay?
So I run my uniform cost search,
that gives me a heuristic,
and that heuristic gives me this future cost of the relaxed problem,
and everything will be great.
Another example is, I can have independent subproblems using my heuristic.
So in this case,
like we have these tiles,
they technically cannot overlap.
Instead, what we are allowing is,
you're allowing them to overlap.
So if we allow them to overlap,
I have eight independent subproblems that I can solve.
These subproblems give me heuristics,
and I can just go with them, okay?
So, so these were just a bunch of examples,
and kind of the key idea was reducing edge,
li- like when we are coming up in these relaxed problems,
we're reducing edge costs from infinity to some finite costs.
Okay. So I'm getting rid of walls before I couldn't cross, like it was infinity.
Cost of that was infinity,
but if I get rid of the wall and making it a finite cost.
So this type of method,
um, this is a general framework.
So the point I wanna make is, generally,
you can talk about the relaxation of a search problem.
So if you have a search problem P,
a relaxation of a search problem,
I'm going to call that PR, uh, Prel,
is going to be a problem where the cost of the relaxation for
any state action is less than or equal to cost of state and action.
I'll take questions afterwards. All right.
So, uh, so that is a relaxed problem, okay?
So the cool thing about that is,
if you're given a relaxed problem,
then you can pick your heuristic to be the future cost of the relaxed problem,
and that is called the relaxed heuristic, okay?
So, so this is kind of a recipe. A general framework.
Like, if someone asks you find a good heuristic,
find a relaxed problem,
future cost of the relaxed problem is a heuristic.
And the cool thing about that is it turns out that, that,
that future cost of the relaxed problem,
which you are deciding to be a heuristic,
is also consistent because we talked about all these consistency properties,
and how you want to find the heuristic to be consistent for the solution to be correct,
and how in the world am I gonna find a consistent heuristic?
Well, here is one. Here is one way of finding consistent heuristics.
Pick your problem, make it relaxed.
Making it relaxed means that pick your cost that's less- pick,
pick your relaxed problem where the cost is less than the cost of the original problem,
and then future cost of that relaxed problem is just going to be a heuristic,
and, and it's going to be consistent.
So proof of that is two lines, so I'm going to skip that.
And, and the cool thing about this like,
like note about this is, there is a trade-off here.
There is a trade-off between efficiency and tightness.
So, sure, like making things relaxed and removing constraints.
It's kinda fun, right? We have this easier problem,
and you just solved it, and everything is great about it.
But it's not like,
like there is kind of a trade-off between how tight you want your heuristic to be.
Like, you shouldn't remove too many constraints,
because if you remove too many constraints,
then your heuristic is not a good estimate of future costs.
Remember, your heuristic is supposed to be an estimate of future costs.
So, so if it is not a good estimate of
future costs and it's not tight, then it's not that great.
So, so there is a balance between how much you are
removing your cons- your constraints and,
and how that makes finding the heuristic easier,
versus the fact that you want your heuristics to be
tight and be close to your future costs,
so, so don't remove everything.
Leave some constraints [LAUGHTER] and then solve it.
Um, and you can also do things,
like if you have two heuristics that are both consistent,
you can take the max of that,
and if you take the max of that, it's,
it's a little bit more restrictive.
Maybe, maybe that is closer to your future costs,
and that is- and then you can actually show the max of that,
is also consistent, okay?
Uh, so we talked about,
uh, like relaxations A-star.
One other quick thing I want to mention because that wasn't very clear last time,
is structured das- perceptron.
We talked about that a little bit too,
and we talked about convergence of that.
So quick things on that.
Structured perceptron actually converges.
There was this question that, uh,
if we have- if that- if, if we have a path,
that is let's say walk,
tram, and, and we end up recovering another path.
That is tram, walk, is that bad, is that good?
Well, turns out that the cost of both of these paths are the same thing.
So if I end up getting this path,
well that's perfectly fine too.
Right? Like that, that is also with the same optimal weight.
In the example that we have shown, in a tram example,
I don't think we are able to get to a  path that
looked like this because of the nature of the example.
So, so in general things to remember from structured ce- perceptron is, it does converge.
It does converge in a way that it can recover the two Ys,
but it doesn't necessarily get the exact Ws,
as we saw last time, right?
Like, you might get two and four,
you might get four and eight, like,
as long as you have the same relationships,
that, that is enough but,
but you are going to be able to get the actual Ys, and it does converge.
So with that, um,
project conversation is going to be next time.
Do take a look at,
do take a look at the website.
So all the information on the project is on the website.
So if you have started thinking about it,
look at the project page,
and that has something for you.
