- All right, let's get started for today.
Welcome to the second lecture of CS287.
Advanced Robotics.
One announcement today, which is that
some of you are still hoping
to get into the class,
but not in the class yet.
We'll process all of that later tonight.
So by tonight, you should hear from us
whether you're in or out
and hopefully you're in.
But if you're out at least you can move
on and find another class to take.
Any logistical questions?
Great, actually can people
in the back close the door?
It'll keep a little quiet in here.
Somebody there could close the door.
Thank you.
All right, today's topic is
Markov Decision Processes.
We'll see the framework.
Then we will look at
exact solution methods
and then we'll look at a variation
of the regular MDP formulation,
the maximum entropy formulation,
and that will be what we do today.
So, let's start with MDPS.
What is an MDP?
In an MDP, there is an agent
and the agent interacts with the world.
And through interaction with the world,
the agent's supposed
to achieve some goals,
often included in what is
called the reward function.
So, for example, you might have an agent
that is a cooking robot.
That robot gets to interact
with the environment
and it might get a high reward
if it makes a good meal,
a negative reward if it
messes up your kitchen badly,
and maybe really negative if
it sets your kitchen on fire.
And so the reward is the way you specify
what you want in an MDP.
And then the agent is supposed
to somehow act against
that in its environment.
The agent here in an MDP
gets to observe the state.
So, the scenario the robot will get to see
where every object is in
the room get to know that.
That would be the state.
It's own pose, it's own
velocity and so forth.
So, formally what this
looks like it's an MDP
consist of a set of states s,
let's take a very simple example,
a gridworld with may be nine squares.
Then the set of states
would be just nine states.
Set of actions in a gridworld
that would be, let's say,
move north east south west, four actions.
And then it's a transition model
that tells you given
current state in action
and current time what's the
distribution over next states.
Now often it will not depend on time
because often the world is stationary
but not necessarily and we'll
actually often see methods
that take advantage of non
stationary aspects in the world
to find a solution that's
somewhat simpler to find that way.
Reward function can
depend on state, action,
next state and also on time.
And practice often will
only depend on state
and maybe current action
but there's no reason
it couldn't depend on all four.
Discount factor gamma.
This is a number between zero and one
and essentially it
encodes how much you care
about now versus the further future.
So, we're optimizing as an agent this sum
of discounted rewards shown here.
And so the further in the future
the higher the power with gamma
and gamma has been zero and one.
So, the further in the
future the less you care,
but how much less you
care depends on gamma.
If gamma is close to one,
you still care a lot about
things far in the future.
If gamma is close to zero,
you pretty much only care
about what's happening immediately.
The goal is for the
agent to find a policy,
Pi star that maximizes
this quantity over here.
And so the remainder of
lecture will essentially
be about how do we find
this kind of policy.
But before we dive into that,
let's look at a few examples,
make this more concrete.
For example a cleaning robot.
How could that be an MDP?
Well imagine you have a room.
There's a robot in the room
that's supposed to clean.
The state would be describing
the position of the robot.
Maybe how full the vacuum
bag is that the robot has.
Maybe where the dirt is in the room
and then the transition
model would be defined
as well something along the lines of,
if the robot moves, tries to
move in a certain direction,
it will have a higher chance
of moving in that direction
but maybe some chance of
not exactly moving there.
When it tries to suck
up dirt that it's above,
maybe it'll usually succeed
but not always succeed
and the reward could be something like
how clean the room is.
So, negative reward for any
dirt that's still in the room.
You got to be a little careful by the way
when you choose your rewards,
because a naive choice
of reward could be something
like whenever you suck up dirt,
you get reward, but the bad
thing about that is that
you could just dump dirt first,
suck up dirt, dump it
again, and go in a cycle,
where you just keep dumping
and sucking up dirt,
and actually you don't achieve the task
that people wanted the robot to do.
You're just achieving higher reward
but on the wrong reward it really.
So, that's a cleaning robot example.
Walking robot.
Maybe you want to robot that walks.
What could be the reward?
Maybe distance covered.
Maybe distance covered
in a certain direction
if you care about.
Let's say how far did robot get north?
And maybe you want to penalize
for energy consumption.
So, maybe there's some
negative reward related
to how much energy you're consuming.
Then pole balancing.
It's an example you'll work
on in your homework one.
You have a pole and it's
supposed to stay up,
but if it's a little bit off,
it all starts falling
unless you apply a force,
a torque and the motor that
is at the bottom of the pole
to get it back up.
And so the reward there could be how far
do you deviate from
being perfectly upright,
and the more negative,
the more you deviate,
and maybe there's also a negative
reward for applying torque
as the less torque you apply, the better,
because the less energy you consume.
A lot of games can be formulated this way.
Game of tetris, you have to
put blocks into a rectangle
and if you have a full row
at the bottom, it clears.
It's a decision problem.
At any given time, there's
some probability of
which new block is going to show up.
Then it shows up.
Then you get to choose from
actions where to place it
and how much you want to
rotate it before you place it,
and then the process repeats.
The reward is how many
blocks you can place
before you fill up to the top,
at which point the game is over.
Backgammon is a game where you roll dice
and you play against an opponent,
to try to get your pieces
to move across the board
and again similar thing here.
The rules of the game
are the transition model
and then the reward is
one for winning the game,
negative one for losing the game,
zero for a tie if that exists.
Doesn't have to be in games or robotics.
MDPs are really everywhere
and that's a kind of common thread
I want to get across in this class
that the methods we are looking at
we're gonna heavily
study them for robotics,
but they're actually much
more widely applicable.
For example server management.
You run a server that's supposed to
maybe do some processing.
Maybe it gets I don't know,
it runs a few processes.
It has to query some
classification on images,
maybe some speech recognition,
and maybe has to do some
video classification.
You get requests coming in.
If a bunch of servers on to which server
do you schedule each of
the incoming requests
and reward might be associated with well
maybe some requests are more urgent.
If you don't reply
quickly with a response,
that's a problem.
Others are less urgent.
So, the reward will still be high
even if there's a little bit of delay
before you reply and so forth.
Shortest path problems are MDPs.
It's a graph let's say,
and the transition model is deterministic.
You can go to any neighboring
state in the graph
that's your action space.
Then to make it a shortest
path probably you'd say
well maybe I get a reward of one,
once I'm at the destination
and zero otherwise.
And then if you have discounting,
then you have an extra encouragement
to get to the destination
as quickly as possible.
Models for animals and people
are often done with MDPs.
If you say okay how do
I model, I don't know a
honeybee that's like living its life.
Well maybe you say well
it has some dynamics
it is constrained by and then
it likes to go to flowers,
get the nectar and make honey
and the reward is probably
related to something there
and if you have a model like that,
you can more easily explain the behaviors
that you're observing.
So, the canonical example for us will be
this gridworld over here
but I hope the previous examples show that
this it's not really about our gridworlds,
but gridworlds fit really nicely on slides
and allow us to get concepts
across very cleanly,
but again we're gonna apply
this in many, many other places.
So, in this gridworld we have an agent,
and it's a four by three grid.
So, in principle there are 12 locations
but the gray one is a place you cannot go.
So, there's 11 locations agent can be.
For any location the agent can choose
to go north east south or west,
and has an 80% chance the action succeeds.
Then there is a 10% chance
you veer off to the left
compared to what you wanted to do,
10% chance you veer off to the right.
If your action leads you into
the boundary of this space
it's supposed to be a wall
surrounding this thing,
then you stay in place.
The reward is plus one if
you're in the top square
and take the exit action.
And the reward is negative one
if you're in this square here
and you take the exit action.
Everywhere else the reward is zero.
So, you'd hope for is that this agent
if it finds optimal policy,
optimal policy would bring
it to the top right square
and then exit from there.
And assuming the discount
factor gamma is not one.
So, it's below one so you want to be there
sooner rather than later.
You'd hope it follow a shortest path
while also avoiding falling
into this trap over here
because once you're here
the only action available
is the exit action same over there.
So, once you're there you're stuck
and so going along this path
is actually slightly risky
because let's say you move up,
you have 10% chance of
moving off to the right.
You land here, you'll have
a negative one reward.
All right, so what's also
quite typical for MDPs
and is the case here is that
rewards often come at the end.
Usually most of the reward is
related to achieving goals,
not always but very often it is,
and that's the case here.
What does a policy look like?
A policy is something that
says for every time step,
it's your first time
step, second time step,
it tells you what is the
action you're supposed to take
and the optimal policy
is the one that optimizes
expected reward.
Why expected?
Because the dynamics of the environment
is stochastic and so you
can't count on a very specific
reward to come out.
It'll be on an average
how much do you get.
In contrast, let's say this environment
was actually deterministic.
Then you could just choose
a sequence of actions.
You don't need a policy if
the world is deterministic.
You just say I'm here let
me choose the sequence
of actions that gets
me where I want to be.
And I'm done.
But typically the world will be stochastic
and we need a policy and that's what
we're gonna focus on.
Any questions so far
about the formulation?
Okay, so we've covered the formulation.
Let's now start looking
at some solution methods.
First one we're gonna look
at is value iteration.
And this is one of those
things that I'm gonna do
on the board and the kind
of thing that we expect
you to be able to do on your own.
So, I encourage you to follow
along in the derivation.
Let's see.
There we go.
All right so our first
algorithm is value iteration.
It's also the one we're
gonna see the most of
throughout this lecture.
So, let's think about first some concept,
we want to solve this problem.
First concept we want is V i star of s,
which is the expected sum
of discounted rewards.
Is this big enough for the back?
- [Group] No.
- No, not big enough for the back?
All right now lets clear.
First concept is going
to be V Star i of s,
which is the expected, is this big enough?
Thanks.
Expected discounted
sum of rewards
if acting optimally
from state s for i steps.
So, we only have i steps left,
and how much of work can we get
if we act optimally for i steps
when we get to start in state s?
Okay now, once we have this concept
let's start thinking about
how we can compute this.
The easy one is V zero star of s.
We have zero steps left.
There's no time left for us to act.
There's nothing left to happen.
So this is just gonna be zero
because that's all there is.
There's nothing.
And this is gonna be
true for all states s.
So, now the question is
can we find a recursion,
where we find V one from V zero V two,
from V one and so forth.
So let's see.
What would be the V i plus one star of s?
Well by definition is
the expected discounted
sum of rewards if acting
optimally from state s
for i plus one steps.
So, well what happens?
We get to take our first step,
and after that there will be i steps left.
And so we're gonna try
to break it down into
what happens in our first step,
what happens afterwards?
What happens in our first step is
well we choose an action.
We want to choose the best action,
so but for now just
say we choose an action
and after we choose that
action, a transition happens.
We have a transition from
state s taking action a
into state s prime with some
probability T s, a, s prime.
Then what happens at that moment?
We get a reward for what just happened
from that one time step.
What happens after that
is we have i steps left,
and we're in state s prime.
So, when there's i steps
left in state s prime,
we're gonna to get another V i star,
assuming we do optimal
behavior then onwards,
V i star s prime.
But remember we're doing discounting.
Something that happens
later is worth gamma less
than something happened earlier.
So, this is just a reward,
but then here what happens later
gets discounted by a factor gamma.
And so what we have here is now
actually an update equation,
assuming we pick the optimal action a.
So, let's parse this again.
We're trying to compute the
optimal expected sum of rewards
from state s with i plus one steps left.
We're breaking down
into what happens first,
which is we get a reward for
the immediate transition.
Now it could be many transitions,
many possible next states s prime,
so we're summing over all
of the possible next states
s prime averaging the reward that we get,
and then after that we have i steps left,
and so that's the
recursion kicking in here,
and we have a discount
factor in front of it,
'cos one step later is worth gamma less.
We have gamma times V i star s prime.
S prime, the same s prime we have here,
averaged over all the possible
states s prime we might visit.
So, how do we run value iteration?
We just put a loop around this.
We just say in it for all s
V zero star s equals zero.
Then four i equals zero one, two,
up to H minus one for all s.
And that will give us
all the optimal values
V star from zero steps left
all the way to capital H
time steps left to act.
Now the question you might have,
this gives us values and
we've been talking about
we want to find the optimal policy,
optimal strategy to
behave in an environment.
How do we get that?
So, if you look at this equation,
it's already computing the best action.
It'll compute this thing
for all actions right?
So, you'll do this multiple
times once for each action,
you compute this quantity.
You'll have a table of values
and then you check which
one is the highest.
That's the one you pick.
Well the action that achieves
that is the optimal action.
And so to find the policy, we just need to
change the max here into an arg max
and we'll have Pi star i plus one of s
equals arg max over a of the
exact same quantity here.
And then this update here
is called a value update sometimes
or a Bellman update
or a Bellman backup,
after Richard Bellman, who
originated this set of equations.
This is probably the
most important equation
for today's lecture and next lecture.
So, let me pause here and make sure
all questions are resolved around this.
Yes.
- [Student] What is s prime?
- Good question, what is s prime?
So, we're currently in state s.
So, s is the current state.
When we take an action
in the current state,
the dynamics of the
world will transition us
into a next state.
An s prime is the variable
name we use to index
into the state at the next time.
So, another way I could've written this
is I could write this one as
s at time T let's say.
And I could write this one as ST plus one.
ST action in time T, ST
plus one and so forth.
But often just s prime is used,
because we don't want to
pin it to particular times.
Any other questions?
Yes.
- [Student] Is T the probability
of taking that action
that state going after the next state?
- So, good question.
So, T here is equal to the probability
of landing in state s prime,
given you are in state s
and took action A.
And so I should interchangeably
use that notation at times.
Sometimes we'll use TS AS prime,
where T refers to transition model
and sometimes we'll use the
explicit conditional notation P
of s prime given s comma a.
(classroom ambient noises)
So, the things we'll do on the board
will also be in the slides that we share,
but I think it's good to work
through them more explicitly
step-by-step than just
flashing them in front of you,
and I think it's good to
take notes as we go along.
Let's look at the example gridworld.
What happens if we run value iteration?
Well, here we have run one
update of value iteration.
Remember this world.
When you exit, when you're
in one of these states,
the only action available is exit.
So, one step available
left to you in the life
of your agent, that agent
will exit from there,
get the reward of
negative one or plus one,
depending on which state it was in
and that's the value.
That's all it'll get the rest of its life.
For the other states, the
agent actually has options.
He can choose out of four actions,
but no matter what
action the agent chooses,
with only one time step left,
there will never be
any reward encountered.
So, the value is zero,
the optimal value is zero
for all other states.
When there is one step to go.
In this case, the discount
factor gamma is 0.9,
which will allow you to
follow along with the math.
Now let's think about what happens
when we have two time steps left.
Well let's think about it.
What will happen down here?
With two steps you still
can't get any nonzero reward.
It will stay zero,
but up here close to
where the rewards are,
where the nonzero values
are for V one star,
there will be opportunity
for V two star to be nonzero,
because I'll probably
get from V one to V two
V three and so forth.
So, here's what happens.
Let's parse this.
Why is there a 0.72 up there?
What is the expected value 0.72?
Well when it moves to the right,
it has an 80% chance of success.
So, that's 0.8,
times at the next time
it'll be able to exit
and get a plus one reward,
but the next time is discounted by 0.9.
So, it's 0.8 probability of
success of getting there,
then times 0.9 is 0.72.
How come this one is zero
and nothing negative,
even though it's next to this one?
Because it's finding the optimal values
and the optimal values will not put you
into the negative state.
Actually what actually is happening here
is it takes the optimal action,
which is going this way
which 80% of the time keeps it in place,
10% of time it will go down,
10% of the time, it'll go up.
And that avoids it from ever landing here
and so has a zero
probability of landing here,
and so, at the next time
if it had landed here,
it would get a negative one
which would be negative 0.9
with discounting, but that never happens.
It actually ends up over there,
but I'm only one step
left, not enough time left
to get that reward.
Now this is V two star.
How about V three star?
We expect again it's
going to propagate out
because V three star's
computed from V two star.
You just have one extra step
and we see now that actually
you have positive reward here,
why because you can now
actually make it up there
to the plus one.
And so what happens here
is you want to go up,
yes there's some chance
that you get a negative here
by landing here but that's
a small probability.
There's a much higher
chance they end up here
and from there you can actually
very likely make it there
and get a good reward
and so the math works out
that going up is the right thing to do
to maximize your expected value.
After four iterations,
these are the values.
One thing you might wonder is well
I have a value here now of 0.66.
Is this the final value?
This is V star with four time steps left.
Is V star with time steps left
the same as V star with let's
say 100 time steps left?
You're nodding no.
Why not?
- [Student] Because you would,
you can go farther with 100 iterations.
- Yeah so with 100
iterations you can go farther
and how can this help you in this world?
Let's think about this carefully.
If it's a deterministic world
and you run value iteration,
this will just spread out cleanly,
and you'll immediately
have the final value
for something like this,
but because this is stochastic world,
when you're here,
you only have some probability
of getting here for now.
You still have probabilities
of getting stuck along the way
and the more time you have,
the more time you have to make up for that
and still make it to the exit.
And so as we keep iterating,
you'll see that those values will actually
keep going up over time.
So, here it's now 0.72
and also it's spreading out further away
and after 100 iterations, we're at this.
Now discount factor gamma is 0.9.
What that means is that effectively
it's only looking 10 time
steps into the future.
Maybe just a little
more than 10 time steps.
So, at 100, you kind of saturated
what's still gonna happen and
it's pretty much converged
to this level of accuracy.
It's converged out of thousands,
it's exactly the same.
Any questions about this example?
Yes.
(muffled speaking)
Okay so yeah good question.
So, look at the plus one reward cell.
Initially it's zero of course
because we don't have enough time steps.
Then here we have a 80% chance
by taking the right action
to land into the plus one and then exit,
which gives us 0.72.
But we have a 10% chance to
move up against the wall,
which means stay in place,
10% chance of move down,
which moves us down.
That probability mass if we
have more time steps left,
we can still guide that
back into the target,
and so when we have three time steps,
when we have moved up and stayed in place,
we actually still have
enough time to try again
to go off to the right and
that's what happens here.
And that's why it goes up.
Yes.
(muffled speaking)
Absolutely, that's a very good question.
Why is this zero?
It's zero because well let's first think
about why it might not be zero.
Let's just say well why wouldn't we think
this thing tries to move up
but sometimes it's unlucky lands here
and with 10% chance and
then with discounting
would be 0.9 times 0.1 negative of that
so with a negative 0.09 there.
And that would be the case if your policy
is to move up here,
but actually the optimal
policy here moves this way,
bumps into the wall to
never end up over there
with the negative 0.09.
(muffled speaking)
So yeah that's a good question.
The way this visualization is done
is actually it's visualizing
a one step look ahead policy.
So, it's essentially looking at the policy
if you had one more step
than the value function that's shown.
We haven't really covered that yet
but we'll cover it in the future.
Okay, so it turns out there's a theorem
that says the value iteration
is guaranteed to converge.
And at convergence, we will
have found the optimal value
function V star for the discounted
infinite horizon problem
which satisfies the Bellman
equations shown here.
Let's look at some intuition
behind why this is converging.
I mean in some sense,
it makes sense for us that
we are getting optimal values
for i steps left, i plus one steps left,
but why will this reach
some steady-state value
and not kind of keep changing forever?
Well, first let's see what
we can conclude from this.
We're good to go.
That means if we have infinite horizon,
we can just run this till convergence,
and we'll have V star for infinite horizon
and we can use the optimal
policy against that Pi star,
and we don't need to store
a table for every time step.
We can just have a single
policy for all time steps.
Of course this assumes the dynamics
and the reward don't change over time.
So, this what we can do and
this is the one step look ahead
I was talking about.
Once you have an optimal value function
or any value function you
prefer to act against,
this is the one-step
look-ahead operation you can do
to find the optimal action.
So, we're all set in that regard.
And we only need to store either V Star
and do this look ahead,
and every time step or we can just do this
as we compute V star we're
computing already the max.
We can store the arg max
and just store our optimal
policy in doing so.
Let's look at some intuition
behind the convergence.
Actually see, is there any questions
about the theorem statement?
Essentially it's just
saying that if you're
in value iteration long
enough, at some point
the values stop changing.
They're converged and they satisfy
that equation at that point.
Okay, so what's intuition?
V star is the expected sum of rewards
accumulated starting from state s,
acting optimally for
infinitely many steps.
V star H of us is the
expected some of rewards
accumulated start from state s
acting optimally for H steps.
So, when we say it converges,
what it really means that if
we make this edge big enough
at some point, H and H
plus one are the same.
V star H and V star H plus
one are essentially the same,
and that's V star of s.
So, we can think about it.
Well we computed V star H of s.
What additional reward,
if we could act longer,
what additional reward
could we be collecting
by getting to act longer?
Because that's really
gonna make the difference
between the optimal value function
and the one that we
have for only horizon H.
So, well, we're gonna
get discount factor gamma
to the power H plus one times
reward for time H plus one.
Then similar for time H
plus two and so forth.
This quantity here is
smaller than if it replaced
the rewards that we actually
get with the max reward
that's available in the MDP.
So let's assume there's some max.
You can never get more
than a certain max around,
for example plus one and the MDP we saw.
Now this summation in the back here
has a geometric series in it,
and can be bounded by
gamma to the H plus one
over one minus gamma.
So, what this shows is that
if we have more time,
the best extra reward we
could get is this much.
Now if you look at this quantity,
and how it depends on H
as H goes to infinity,
this quantity goes to zero.
So, if my H is already very large,
the extra I could have gotten from getting
to act longer is near zero.
If what I could have gotten extra
from acting longer goes to zero,
then that means that the difference
between V Star H and V star s go to zero,
because V star H is the best
I can get with H time steps
and V star is what I can get
with infinitely many time steps.
I would have just essentially nothing left
by having more time steps
then they're going to be the same.
And so this is a kind of
fairly intuitive proof
as the value iteration will converge
to infinite horizon value function.
What I put up here is assuming the rewards
are always positive.
We can go through the same reasoning
when there are negative rewards
and then or negative and positive.
And the R max here what
it would really mean
is the max absolute value of reward
that you can get anywhere.
And then the same reasoning
we will go through.
All right, this is kind of a
very intuitive proof sketch.
Let's look at another angle that people
often look at the proof
value iteration converges.
So, we'll define something
called a max norm
and the reason we see two versions of,
I mean two proofs for
the same thing is that
just in different contexts,
a different intuition
might be more useful.
So, max norm is the max
absolute value of any entry
in your vector.
So, if you had a value
function or an attempted value
function that you hope was
the real value function u.
Then it would be a vector
with entries for every state
and the max norm of that vector would
be the maximum absolute value
of all entries in that vector.
Then there's a notion called contractions
and specifically a gamma
contraction in max norm.
So, what are we thinking about here it's,
we have a operation.
The value update operation.
We do one iteration in value iteration
and we have a vector of values.
Now we have a new vector values.
That's the update operation.
Question is for any update operation
you can ask is it a contraction or not?
And so we're gonna
define what a contraction
is a contraction is this notion that
if no matter what two vectors you have,
let's say you run value iteration,
but you take two
different initializations.
You don't initialize with zero.
You just take two completely
different initializations.
Ui and Vi, then you do one update,
you might wonder do these
things move closer together
and a contraction is
exactly defined that way.
If you have two things
you do an update to them,
do they move closer together?
Are they a factor gamma closer together
after the update.
This is just the definition
of a gamma contraction.
This is not any property
of value iteration yet,
but it's just the definition.
Now the property though that's important,
this theorem is that the
contraction converges
to a unique fixed point, no
matter the initialization.
We're not gonna prove that here.
The proof is probably one page of work
that you can try on your own,
but it's generally true
once you know show something
as a contraction you
know if you keep applying
that operation it'll get to a fixed point,
a unique fixed point.
Fact and again we're not gonna prove this,
but it's not too hard to prove.
Value iteration updates
form a gamma contraction
in the max norm,
and the intuition here is fairly simple
in that imagine you have
two vectors, Ui and Vi
and they undergo a value iteration update.
Where are Ui and Vi used?
They're only used with a
gamma in front of them.
So, the effect of the original Ui and Vi
gets downscaled by a factor gamma,
and then the reward in front is shared
anyway across all.
So the effect of the Ui and Vi gets
down scaled by a factor gamma,
and that's how essentially
prove that a validation update
is a gamma contraction in max norm.
Corollary, value iteration converges
to a unique fixed point.
Additional fact is that
once your updates are
changing the value function very little,
you can do a bit of math to show that
then also you're close to the
optimal value function V star.
And this is actually very related.
Showing that is very
similar to showing that
contractions converge
to a unique fixed point.
You essentially look at
two vectors that come
after each other in the update,
and then you put this on this side,
and then I'm here I'll add
the plus two and the plus one
and that's kind of the
way you show conversions
of a contraction and it allows us to have
a stopping criterion, when
we run value iteration,
you say I want two values up
to a certain level of accuracy.
Well you just check
what's the biggest change
in any of my values among my states,
and then I call that epsilon,
I plug it into this,
and I know that I'm within that much
of the optimal value function.
All right, yes.
(muffled speaking)
So, in this case the top row U
is just any vector because just defining
the norm of a vector.
Then here Ui and Vi are again
vectors of the same size,
and then an update operation
gets applied to them,
and that gives us Ui
plus one and Vi plus one.
So, you Ui gets updated to Ui plus one,
Vi gets updated to Vi plus one.
And we're talking about
that update operation
being a contraction if and
only if this thing holds true.
Yes.
(muffled speaking)
For this to be true yes.
To be a contraction gamma
has to be smaller than one,
correct, otherwise things
will not shrink together.
If gamma equals one is
called a non expansion,
and then I guess bigger
than one it can expand.
So, then there's not much we can do.
Yes.
- [Student] Is the additional
fact imply anything
about the rate of convergence
of value iteration
like really the exponential et cetera?
- Essentially the rate of conversions is
the contraction rate of gamma.
So, you will be able to define the speed
at which you go to V star.
Essentially as you power up gamma,
that's what will get you the
rate at which you converge.
(muffled speaking)
The additional fact, this
thing here we didn't prove.
That's left for you to try
out if you want to try it out.
You're absolutely right.
You can't just stare at it and
just be like this is obvious.
It's pretty complicated it turns out but,
it's kind of this as you
imagine the starting point
is going to be something along the lines
if you look at you do
something with adding
and subtracting V star here
and then use some triangle inequality
and play around with a few
things and at some point,
you can get this out,
but yeah you're not
expected to necessarily
be able to do this.
Otherwise we would cover it in class.
(muffled speaking)
How's this round?
This is useful in the
sense that you now know
if your update changes your value function
only a little bit, let's say epsilon,
then you know that you
are two epsilon gamma
over minus gamma close to V star.
So, it's giving you a
guarantee that once your value
function doesn't change much anymore,
you are close to the optimal
value function V star.
Okay, let's look at a couple
of examples of problems
and how the parameters
affect the optimal solutions.
So, we're gonna look
at here is a gridworld
of this structure, where
you get negative 10,
when you land over here,
once you exit and only option
you have here is to exit,
plus 10 over here and then one over there.
So, there's a nearby small reward,
a far away higher reward
and then there's also this
danger zone at the bottom,
and so question is how can we
get different policies out?
So, how can we define this problem
by choosing different discount factors
and different noise levels on the actions,
so noise equals 0.5 means that
50% chance you succeed with your action,
and then 25% chance you
veer off to the left,
25% chance you veer off to the right.
Gamma is still the discount factor.
Noise equals zero means you
always succeed with your action.
Exactly the action you
choose is what you get.
You choose north it goes north,
and so a question is here eight, sorry,
eight bullet points.
We have four scenarios,
prefer the closest exit risking the cliff.
So, this path.
Prefer the closest exit
but avoiding the cliffs
along the top go to closest exit,
prefer to distant exit,
risking the cliff this way
or prefer the distant
exit avoiding the cliff,
go along the top to distant exit.
And question for you is
which of these choices
of MDP parameters will map them
to those optimal solutions?
So, I'll give you kind
of two minutes to talk
with your neighbor and make up your mind
on how left column might
match up with which entry
in the right column.
(crowd talking)
All right let's see what you came up with.
So, for a, prefer the close exit,
meaning the plus-one, risking the cliff.
So, the bottom path up there.
Who thinks one?
Nobody thinks one.
Who thinks two?
Nobody.
Who thinks three?
Nobody.
Who thinks four?
Oh wow overwhelming four.
Everybody raised their hand on four.
So, why four?
(muffled speaking)
So the answer was you have
a very harsh discount,
so you want to get your reward
sooner rather than later,
and you have no noise,
so it's not that risky
to go all along the cliff.
So, you just follow the shortest path.
Great.
How about b, prefer the close
exit but avoiding the cliff?
Who thinks one?
Who thinks two?
Three? Four?
Shouldn't pick four anymore at this point.
(all laughing)
Most people said one.
You wanna say why?
(muffled speaking)
So Luca said harsh discount factor here,
which makes us prefer to close one
but we can't risk the cliff,
because there's a lot of noise,
a much higher chance to falling into it
if we go that way.
Great how about the third one?
Prefer this and exit risking the cliff?
One, two, a lot of people two.
Three, four?
Okay so risking the cliff.
Same story here.
We pick two, why?
Because noise is zero
so risking is not really that risky.
In fact you know you're
not gonna fall into it
and then you might as well
take the shortest path.
Gamma is 0.99.
So, you're willing to spend the extra time
to get to the later reward
because it's 10 times bigger.
You get discounted only
by 0.99 to the power
two or three or something.
Then that leaves us with
for D, the only option left,
which is three and same story here.
We're willing to go further
because the discount is close to one
and there's only a couple
of extra steps needed
and it's a factor 10 more reward
and there is noise so we
want to take the longer path.
Great, so we've covered value iteration,
which is the main method
we wanted to cover.
Let's take a two or
three minute break here
and then in the second half of lecture,
we'll go through the remainder
of what's on that slide.
(crowd talking)
- [Crew Member] So, I'm
trying to follow you,
but I kind of have to stop
where the screen comes down,
is when I point the camera at the screen,
it's like it's on auto-exposure.
It kind of gets blown out,
so, for people following the video,
I mean it's up to you but
rather than maybe pointing
at the actual screen using your track head
to point something on the screen,
the video will pick that up.
- [Instructor] I see that makes sense.
- [Crew Member] Yeah.
- You're saying essentially
up to here is fine.
Okay got it.
- Because you're getting blown up
by the additional light coming in.
- Got it, thanks.
(muffled speaking)
All right, let's restart.
Any questions about the
first half of lecture?
All right,
we've seen the foundation,
most of the foundation
for today's lecture.
So, we can actually go a little faster
on some of the next pieces.
First thing we'll look at
this policy evaluation.
So, we're still trying to
solve for the optimal way
to act in a MDP.
But we're gonna find a
new algorithm to do it.
And why do we need more
than one algorithm?
Well sometimes in the future when we're
solving problems that
we cannot solve exactly
and we need to do approximate solutions,
in some case it is easy
to build on one method.
In other case we'll use it
to build another method.
So, policy evaluation is the notion
that given a policy you
want to know how much
reward do I expect to get
when using that policy.
And so remember value iteration
has this update equation here.
It's against the optimal policy.
So, what if we fix the policy?
Well that's actually very easy.
All we need to do is to not give access
to the max over there.
We'll get rid of the max over there
and we can evaluate the current policy.
So, we just replace the
max with a well the policy
Pi prescribes an action a in state s
Pi of s and we fill that in there
and there's no max anymore.
Other than that this is
the exact same equation,
and so actually the
exact same ideas apply.
Will pause evaluation converge?
Yes because it's like value iteration
just in an MDP where in each state
only one action is available.
And so right away we have a proof
of convergence right there.
At convergence pause
evaluation will satisfy
this equation which is
the Bellman equation
just without the max,
because well the policy
prescribes the action in each state.
We have no choice.
Now this is just evaluation
and with just evaluation
we can't necessarily
find the optimal policy yet,
but let's pause on this for a moment
and see what happens if we generalize
this to a stochastic policy okay.
So, this was all deterministic.
How about stochastic?
We now have a policy mu which is a,
mu a given s is a probability
of taking action a
and state s and we have three
proposed policy evaluation
update equations here.
So, which policy
evaluation update equation
do you think is the correct one?
One two or three?
I'll give you a moment to stare at this.
When doing policy evaluation,
policy is mu a given s,
probability of action a
being a state s is mu a given s.
Okay let's see.
Who thinks one?
Who thinks two?
Most people think two.
Who thinks three?
Bunch of people think three.
Most people thought two.
Let's think through this.
Let's look at two which was the majority.
In two, what do we see?
We have a sum over next states,
which we expect to see,
because that's what's in
these update equations.
Then we have a sum of our actions.
Why is there a sum of our actions here?
Because there's no fixed action.
We don't use a specific action.
We use a distribution of our actions.
And so mu a given s is the
probability of that action
in state s, and then
after that essentially
the transition of landing a state s prime,
given state s as an action a.
So, indeed equation
two is a generalization
from a fixed action to a
distribution over actions
for the update equation.
All right that was a little aside
to make sure you're
familiar with the notation.
By the way why is number three incorrect?
It also has a summation of our actions,
but it has a max over state,
and you don't get to choose
the next state s prime.
The next state s prime is a consequence
of the transition model.
And so you don't get that access.
If you were an MDP,
where you got to choose
the next state s prime,
then equation three
could be your update equation potentially,
but you don't get to choose
the next state s prime.
So, you don't get the max over it.
What is policy iteration?
In policy iteration, we
alternate between policy
evaluation and policy
updates, policy improvement.
Policy evaluation is what we
saw on the previous slide.
So, this part over here.
Then policy improvement is the equation,
we see at the bottom.
What is that doing?
That's something with a max again.
It's saying if I get
to act K plus one steps
from state s, I'm going to
do a one step look ahead.
I'm in state s.
Initially I get some reward
for the transition that happens
and after I get discount
factor gamma times value,
I will get then onwards
for using the policy
that I had Pi K.
So, have a current policy Pi K,
for which I computed the values.
And I'm gonna say I'm assuming
I'm gonna use that policy
Pi K for the last K
steps that I get to act,
but for the very first step,
my very first step I get
to override an action
and choose the best one
for the very first one,
and then continue with Pi K.
That's the policy update.
It's kind of interesting
because it's somewhat non-intuitive
that this maybe is guaranteed
to be the right thing to do,
but here's the intuition why this
is gonna be a better policy.
The reason this is
gonna be a better policy
is because well let's first think
about the following notion.
Let's say we use Pi K for
K steps, the last K steps,
and just use Pi K plus one
for the very first step.
Then definitely Pi K plus one
first fall by Pi K is better
because well it's the max
of our expected value,
that we choose the action based on
rather than just sticking to Pi K.
So, it's definitely better that way.
Now we can actually repeat this reasoning.
You can say if, sorry I
phrased this slightly wrong.
we'll use Pi K for infinitely
many steps afterwards.
Now just K steps, infinitely
many steps afterwards
we'll use Pi K.
Okay so it's better to
first use Pi K plus one
followed by infinitely often Pi K
than just Pi K at all times.
After I do that, I'm at the next time,
I had committed to using Pi K,
but what if I just again
say well what about
from the next step onwards
I'm committed to Pi K,
but right now I still
give myself the choice
to take the best action possible
assuming I'll use Pi K
infinitely often than onwards.
That again will be strictly better.
And I can repeat this thought
process at every time step
and ultimately what that means
is that I'm using this Pi K
plus one, this new
policy at every time step
and have shown that strictly better
than using Pi K at every time step.
So, this is indeed a
policy improvement step.
Keep in mind the K index
here, just to be clear now,
I had a small lapse there,
so I want to make sure to
clarify it K here indexes
into iteration of the algorithm.
We have the algorithm iterates
over current policy Pi K,
find the value for that policy
that will take infinitely
many value iteration updates
or till convergence.
After we've done that, we
do a policy iteration update
and the index K goes up.
And then we repeat.
Okay so here's what happens this repeat,
we repeat this until the policy converges
and that converges with
the optimal policy.
And actually converges
faster into some conditions
than value iteration.
Now one quick thing that I
want to note here is that,
for now we just looked at
modified Bellman updates
to evaluate a policy.
But actually if you look at this,
at the end when it's
converged, this is an equation
that's satisfied and that
equation is linear in Vi Pi.
So, we can actually just solve
a linear system of equations instead.
So, when you run policy iteration,
when you do policy evaluation,
you can either solve
linear system of equations,
to find your V Pi for
your current policy Pi,
or you can just run policy evaluation
iteratively like we did
for value iteration,
and find your V Pi that way.
Now theorem says that policy iteration
is guaranteed to converge
and at convergence
the current policy and its value function
are the optimal policy and
optimal value function.
Why is that?
Let's start with the second bullet point
in the proof sketch here.
It's optimal at convergence, why?
By definition of convergence,
Pi K plus one equals Pi K.
What that means is that when
you look at the Bellman update
here, the value iteration update for Pi K,
and we take the max
there, that the max action
is the action already prescribed by Pi K.
So, we have is that the policy Pi K
is already prescribing the best action.
Nothing changes, that's
what happens at convergence.
So, when by just looking at the equation
we know when the action
doesn't change anymore
that the policy prescribes,
it's actually satisfying
the Bellman equation
so, we're optimal and we're done.
Now the question you might ask is well
are we guaranteed to ever have
this equation be satisfied
and that nothing changes anymore?
And the intuition behind
that is actually very simple.
We discussed on the previous slide
that Pi K plus one will
be better than Pi K
or at least as good.
Well if every time we have a new policy,
it's better than the previous one,
and there's only a
finite number of policies
to choose from, at some
point we're out of options
because we can't cycle
around when things have
to always improve.
Some point we're out of options.
That's guaranteed.
So, we must converge at some point,
and then we know when we converge,
that we're actually at the optimal point
from the second bullet point.
All right, this might be a
lot to wrap your head around
but if you haven't seen any of this before
but I encourage you to kind
of slowly read this again
at your own pace and see what you think.
Next, we are gonna actually
skip over linear programming.
We'll do that at the end.
And we're gonna look at the
maximum entropy formulation.
So let's look at a bigger gridworld.
And we're gonna watch value
iteration happening here.
We'll see contours,
well values on the left.
Red is high value.
The exit state is a bottom right.
The black squares are obstacles,
and then on the right what
you see is a bunch of robots
from various initial states of the agent
kind of see what the policy is.
And so as we run this, actually hold on,
let me pause this for a moment.
As we run value iteration,
what will happen is at some point
we will find a optimal strategy
for every starting point,
there'll be an optimal strategy,
a very specific strategy we
follow to get to the goal.
If now somebody changes the world
and puts a new obstacle in,
the strategy we found
might not work anymore
and we only found one
strategy, the optimal one.
And that might be okay if
you know that that's fine.
You only need to find
out one strategy you know
the world's never gonna change.
It's always exactly what
you expect it to be.
What if the world's not exactly
what you expect it to be?
That can be the case if your MDP
is not a perfectly matched up thing
with the real world.
Well now what can happen
is normal policy will fail,
and you have nothing at that point.
And so question could be
well is any way to solve
for a distribution over solution.
So, a distribution over
things you might want to do
rather than a single solution.
Not sure if there's only
a single optimal one,
these other ones will be near optimal.
They will not be the optimal one
but can we find that?
Because if we find that,
then if something small
changes in the world,
we're already ready to deal with that.
Okay so let's think
about this for a moment.
How we're gonna encourage somehow
that we don't just find the optimal.
We're gonna use notion of entropy.
So, what is entropy?
Entropy is this quantity shown over here.
Mathematical quantity.
What it measures is the
uncertainty of a random variable x.
Random variable x distribution P of x.
How much uncertainty do you have over x
when I just tell you, you're
gonna get a sample from x
but I don't tell you
yet what I have sampled,
how much uncertainty do you have?
That's what's measured in the entropy.
So, we can have a high entropy
for our optimal policy,
which then would be a
positive non-deterministic
at this point will be a distribution
over actions and outputs.
If we have a high entropy for our policy,
then we'll have a
distribution of our actions,
distribution of our paths we might follow,
and of course we will not
just want high entropy.
We will also want high rewards.
We'll do a trade-off between the two.
Now entropy is kind of very
information theoretically
justified way of measuring
the kind of uncertainty
over random variable x.
For example if you need to encode samples
from a distribution, you would
information relative speaking
want to do it with a specific sample xi,
you would use a number of
bits two log of one over
p of xi and that's the
optimal way to encode.
The input theory aspects don't
really matter too much to us
but we want to have an intuition
over the notion what entropy is.
So, for a distribution where
x could be one or zero,
and there's a probability
associated with being one,
we can plot the entropy as a
function of that probability.
Horizontal axis probability x equal one.
When a probability of x equal one is zero,
you're always zero,
then there's zero entropy.
There is no uncertainty.
When you're always one,
there is no uncertainty
but when your probability is 0.5,
you have the most uncertainty,
entropy is the highest.
Here's another example two distributions
where x can take on five values.
Which one has higher entropy?
Well which one gives you more uncertainty
about what x is going to be?
The one on the right you actually know
usually it's gonna be that first value,
so that's lower entropy
whereas the one on the left,
it's really spread out fairly evenly,
so it's gonna be higher entropy.
And you can actually compute this.
You can work through the math
for those specific values and take
the sum P log one over
P and get the values
and indeed the one on the
left has higher entropy
than the one on the right.
So, now that we have
this notion of entropy,
which qualifies uncertainty,
we can see a new formulation
of how to solve MDPs,
namely the maximum entropy formulation.
We use to solve for a
policy that maximizes
expected sum of rewards.
Now we're gonna try to solve for a policy
that not just maximizes
expected sum of rewards
but also maximizes entropy of the policy.
And beta here is a coefficient,
if you make beta very high,
then you care a lot about entropy.
If you make beta equals
zero, you get back to
the original case and if you
make beta equal to infinity
then all you care is about entropy,
and you'll just ignore the reward.
And in between is where
the interesting regime is
where you'll have a trade-off
between entropy and reward,
and you'll get a range
of strategies emerging
of how to solve the problem
rather than a single strategy.
So, can we do something
like value iteration
for the max entropy version?
Well let's see.
Actually first we need a little intermezzo
on constrained optimization.
So, let me write something on the board
here for constraint optimization
and we'll need that to be able
to solve the problem we're gonna solve.
Let's see.
We'll see a lot more
constraint optimization
in the future lectures.
Today we'll just do something very simple
but a concept that we already need today.
So, constraint optimization
is where you have a problem.
We try to find a variable x
or maybe a vector-valued variable x,
that maximizes some function f of x,
subject to some constraint
g of x equals zero.
A very specific case we're looking at here
but this is the one we care about.
How we're gonna solve this?
Well the concept people use a lot
is called a Lagrangian.
And we'll solve this problem
by solving this problem instead.
Max over x still,
then min over a new variable lambda
of the Lagrangian of x and lambda,
which is equal to, let
me write the max over x,
rewrite the min over lambda, g no f,
f of x plus lambda times g of x.
So, why is this solving
the original problem?
Let's think about it.
Let's say for this max min problem,
we choose a x where g of
x is not equal to zero.
Well after I choose my x where g of x
is not equal to zero,
minimize I get to choose lambda
and if x is not equal to zero,
minimize we'll just choose lambda
if x is positive and once we minimize,
we'll choose lambda negative.
They're very negative and drive
this really really negative.
Actually this one will
probably choose essentially
if g of x is positive, we'll
choose negative infinity.
If g of x is negative,
we'll choose positive
infinity to drive this to an as negative
a value as possible.
So, we're forced in this problem to pick
an x that has g of x equal to zero,
which we're also forced to do here.
If you don't know yet why this is useful
but I'm just trying to clarify it
why this is equivalent at this point.
This is equivalent because
we can still not pick
an x where g of x is not equal to zero.
And once we pick an x where
g of x is equal to zero,
well then this disappears,
and we're just maximizing f of x.
So, we have an equivalent problem here
to the original problem.
You might say it doesn't
look more complicated.
Maybe it does but actually it doesn't
because the way we can now find a solution
is by saying that the solution
is going to be the case
that the derivative of the Lagrangian
with respect to x will be equal to zero,
because if it's not, I should
move x in the direction
that makes this thing higher
because I'm maximizing with respect to x.
And the derivative of the
Lagrangian with respect to lambda
should also be equal to zero,
because if it's not equal to zero
then I should move it in the direction
that makes this thing lower
because we're minimizing
with respect to lambda.
So, we know that these
are properties satisfied
by the solution.
And so often it'll be easier
to write out those two equations
when we try to solve something
and see if we can read off
the solution to the problem
from those two equations,
and we'll see that in a moment.
We can do that for maximum
entropy formulations
and then from there, we have a solution.
Okay so that was a very brief thing
on constrained optimization.
We will now use that to
solve the MaxEnt problem.
And this will be the second
most important derivation
for this class.
(scrubber scraping)
All right, so instead of
deriving the actual MaxEnt
value iteration, I'm gonna
derive a simpler version,
where we can see all
the insights necessary
to find a solution and we
can just reuse that template
to do the actual MaxEnt value iteration.
So, we're gonna look at max over Pi of a
distribution over actions,
expected value if the reward
we get from taking action a
plus beta times the
entropy of our distribution
of our actions Pi of a.
So, it's a one step problem
rather than a many step problem
but we'll see just like
with sum of value iteration,
it is just like a one step thing
that you repeat many times,
same thing will be true here.
Okay so this is the same
as max over Pi okay?
Expected value is sum over Pi,
sorry sum over a Pi of a r a,
then plus but entropy has
this negative thing in it,
so it's gonna be a
negative, negative beta sum
over a Pi of a log Pi of a.
And I know entropy is a two
log instead of a regular log,
but I'm just gonna ignore that
because any kind of base for the log
is just a multiplicative factor,
and so we'll just assume
that lives inside beta.
We don't worry about it.
Well and there is a constraint.
The constraint is such that
sum over a Pi a equals one.
And there's also another constraint
that the Pi a should be positive,
but we'll see that when we solve it.
Without those constraints,
it'll still come out
with them being positive,
so we don't have to worry about it.
So, let's write out the Lagrangian.
What's our Lagrangian?
We're gonna do max over Pi of a min
over lambda of the Lagrangian,
which is sum over a Pi a, r a
minus beta sum over a Pi a, log Pi a,
plus lambda for the constraint.
What is our g of x here?
It's essentially this thing equal one is
same thing as this thing
minus one equals zero.
So, plus lambda sum
over a Pi of a minus one
shows up here.
Now we take our derivatives.
Let's do derivative of the Lagrangian
with respect to Pi of a.
I'm gonna want to set this equal to zero.
What do we end up with?
Well first thing here
look at the derivative.
This gives us r of a.
This thing gives us minus beta times
if we have this thing,
we have log Pi of a.
For the other one derivative
log one over the thing.
So, that's a plus one,
and then here plus lambda.
I want this equal to zero,
or if we rewrite this,
we want beta log Pi of a
equal to r of a minus beta plus lambda
or Pi of a equal to e to the
power one over beta r of a
minus beta plus lambda.
So, that's what we find for our Pi of a.
So, we see that we have,
essentially the reward gets exponentiated.
The more reward the more
likely we take a certain action
and then what's really happening here
the minus beta plus lambda,
that's just a scale factor
and doesn't depend on a.
So, we can actually rewrite this just
as well as equal to one over Z,
which is the scaling e to
the one over beta r of a.
That's what we find where Z is something
that makes sure things sum to one,
and actually if we take
the derivative with respect
to lambda here of the Lagrangian,
is sum over a Pi a minus one equals zero.
We see that we exactly
get back out the notion
that things should sum to one.
What we see here is the solution
is in a really nice form factor.
If beta were equal to
one, let's simplify it,
if beta equal to one,
we just have e to the
reward as our probability
and then some normalization.
So, that's really nice.
Then what we can do is
we can actually take
that solution for pi of a
and put it into here,
and see what is the actual
optimal value that we achieve
in the MaxEnt environment.
So, this is a derivation
that was on the board.
We're gonna plug the
expression for pi of a
into the maximization
up there, the objective.
It's a bit of math you
have to work through
but it's really straightforward
kind of simplification type math.
There's no new concepts there.
If you work through
all the simplifications
what you get is that the
objective value at optimum
equals beta times log
sum x one over beta r a.
Let's again imagine beta equals one.
Then what we have here is essentially
a soft way of computing a maximum.
It's around taking just put
all the probability mass
on the one action that
achieves the highest reward
would take a softmax in the policy,
and we'll get this log x of the values
as the actual value.
It's a very simple update to do.
Like we did a bunch of math,
but the optimal is very simple.
If you want to know the
optimal value for this thing,
it's just log sum x reward.
If beta is closer to zero,
when you don't care about
entropy, then the exponent will grow
and so the exponents
will become more extreme
and whichever had the highest reward
will have grown the
most will be even bigger
than the others.
It's more bigger than the
others and will dominate
and will have more and more
probability associated with it
and will dominate this value.
When beta goes to zero
you don't care about,
well that's a beta to zero.
When beta goes to infinity
the opposite way you don't
care about reward anymore,
you'll get the uniform distribution
and all you get is the entropy
of the uniform distribution
as your value of the problem.
Let's see.
Actually this is a
pretty important concept.
So, I'm gonna stop here,
and next time we'll see
how to generalize this
to full value iteration.
See you on Thursday.
