Okay, hey everyone. So welcome to the final week of the class.
Uh, what I wanna do today,
is share with you a few generalizations of,
um, reinforcement learning and of MDPs.
So you've learned about the basic MDP formulas of state action,
state transition probability, discount factor and rewards.
Um, the first thing you see today is two, you know,
slight generalizations of this framework to
state-action rewards and to the finite horizon MDPs.
They're making it a little bit easier for you to model certain types of problems,
certain types of robots,
so certain types of factory automation problems will be easier to
model with these two small generalizations.
So we'll talk about those first, and then second,
we'll talk about linear dynamical systems.
Last Wednesday, you saw a fitted value iteration which was a way
to solve for an MDP even when the state-space may be infinite,
even when the state space is a set of
real numbers known as RN so it's an infinite list of states.
So a continuous set states,
we use fitted value iteration in which we had to use a function approximator,
right, like linear regression,
to try to approximate the value function.
There's one very important special case of an MDP where even
if the state space is infinite or continuous real numbers,
uh, there's one important special case where you can still
compute the value function exactly without needing to use,
you know, like a linear function approximator or to use something like
linear regression in the inner loop of fitted value iteration.
Um, and so you also see that today,
and when you can take a robot or
some factory automation task or whatever problem and model it in this framework,
it turns out to be incredibly efficient because you can fit a continuous- fit
a value function as a function of the states without needing to approximate anything,
just compute the exact value function,
uh, even though the state space is continuous.
So, um, this is a framework that doesn't apply to all problems,
but when it does apply,
it's incredibly convenient and incredibly efficient.
So you see that in the second half of today.
Um, uh, yes.
Uh, uh, one, one tactical uh,
two, two tactical things, um, let's see,
from the questions that we're getting from students,
some students are asking us, uh,
how is grading in CS229?
Whatever I did well and this,
you know, didn't do so well in that.
Um, for people taking the class,
pass-fail, a C minus or better is a passing grade.
This is quite- I think this is standard at Stanford.
Uh, and, um, I think CS229 has historically been one of the heavy workload classes.
We know that people taking CS229- yeah, I see a few heads nodding.
[LAUGHTER].
I said, sorry, people, uh, uh,
taking CS229 end up, you know,
putting a lot of work in this class more,
maybe frankly more than average for even Stanford classes.
And so we've usually been quite nice.
With respect to, to grading partly, and we acknowledge that.
So I think, uh, uh, yeah I just thought that as well so don't,
don't, don't, don't sweat too much.
Do work hard for the final project,
but just don't, don't sweat too much.
Um, uh and, uh,
on Wednesday after class,
I had a funny question.
After I talked about the fitted value iteration question,
someone came up to me and said, "Hey Andrew,
um, you know, this algorithm you,
you just taught us, does it actually work?
Like it- does it actually work on autonomous helicopter?"
And the answer is yes.
Uh, the algorithms I'm teaching,
you know, if you, uh, uh,
the fitted evaluation as you learned last week,
it will work on a finite autonomous helicopter at low speeds.
So if you fly very high speeds, very dynamic maneuvers,
crazy bang, flipping upside down you,
you need a bit more than that,
but for flying a helicopter at low speeds the,
the exact algorithm that you learned, uh,
last Wednesday as well as any of the algorithms you'll learn today including, uh, LQR.
You know, if you actually ever need to fly an autonomous helicopter for real,
these algorithms will actually work.
These simply will work quite well for flying a helicopter at low speeds,
maybe not at very,
very high speeds and crazy dynamic maneuvers.
But at those low speeds these algorithms,
pretty much as I'm presenting them,
will work. So, um, okay.
So the first generalization to the MDP framework that I want to describe is,
um, state-action rewards. Um, and so, um,
so far we've had the rewards be a function
mapping from the states to the set of real numbers,
and with state action rewards- um,
this is a slight modification to the MDP formalism.
Where now, the reward function R is
a function mapping from states and actions to the rewards.
Um, and so, you know,
in an MDP you start from the state S0,
you take an action a0,
then based on that, you get S1, take an action a1,
take a state S2, uh,
get to a state S2, take an action a2 and so on.
And with the state-action rewards,
the total payoff is written like this.
Um, and this is, uh, this, this,
this allows you to model that different actions may have different costs.
Uh, for example, in the little robot wandering around the maze example,
um, maybe it's more costly for the robot to move than to stay still.
And so, uh, if you have an action for the robot to stay
still the reward can be, you know, 0,
for staying still and a slight negative reward for moving because you're burning,
uh, because, because you're using electricity.
Um, right, uh, and so in that case,
uh, Bellman's equation becomes this, V star equals,
right. Um, where now,
you still break down the value of a state as the sum of the immediate reward plus the,
you know, expected future rewards.
But now, the immediate reward you get depends on
the action that you take in the current state, right?
So this is a- and so this is Bellman's equations.
And if, uh, and notice that previously,
um, we had the max kind of over here,
but now you need to choose the action,
a, that maximizes your immediate reward plus your discounted future reward,
which is why the max kind of moved right.
If you look at the equation,
uh, if you look at this equation,
I guess the max had to move outside because now the immediate reward you get,
uh, depends on the action you choose at this step in time as well.
So these models that different actions,
um, may have different costs. Yeah.
[inaudible]
Uh, yes. Uh, yes.
Yes, this max applies to the entire expression, right, yeah.
[inaudible]
Uh, let's see. So in this formulation,
reward is determined based on the state and action,
yes that is correct.
So, um, in this formulation,
the reward depends on the, uh,
current state and the current action but on the next state you get to.
Right, um, oh, and by the way, there, there are multiple variations of formulations of MDPs,
but this is, um,
one convenient one. I guess.
The model that different costs and I think the- and,
and actually- and you find in a helicopter a common, um,
formulation of this would be to say that yanking aggressive on the control stick, uh,
should be assigned a higher costs because yanking
the control stick aggressively causes your helicopter to jerk around more,
and so maybe you want to penalize that by setting
reward function that penalizes very aggressive maneuvers.
The, the- this gives you the, uh, as a,
as a problem designer, um,
uh, sort of more flexibility.
Um, and then, uh,
uh, and then finally- so let me just write this on top.
In this formulation, um,
the optimal action- so,
uh, right, so in order to compute the value function,
you can still use value iteration, right, which is still,
you know V of S gets updated as basically the right-hand side from Bellman's equations.
So, um, value iteration works just fine for the state-action reward formulation as well.
And, uh, if you apply value iteration until V converges to V star,
then the optimal action is,
um, is, is just the argmax of that thing, right?
So, so, pi star is just the,
uh, argmax of this thing.
Where now, when you're given state,
you wanna choose the action that maximizes
your immediate reward plus your expected future rewards.
Okay. Yeah, so I think just maybe another example, um,
if you want to use an MDP to, um, um,
plan a shortest route for robot to,
say drive from here in Stanford,
to drive up to San Francisco, right?
Then, if it cost different amounts to drive
on different road segments because of traffic or because of the, uh,
speed limit on different road, then this allows you to say that, well driving
this distance on this road costs this much in terms of fuel consumption or in terms of,
uh, time and so on, right?
So the state action rewards. Or, or in factory maintenance,
uh, if you send in a team to maintain
a machine that has a certain cost versus if you do nothing that has a different cost.
But then the machine breaks down that has yet another cost depending on your actions.
Okay, so that's the first generalization.
Um, the second generalization is the finite horizon MDP.
Um, and in a finite horizon MDP, um,
we're going to replace the discount factor, Gamma,
with a horizon time,
T. Uh, and- and we'll- we'll just forget about the discount vector.
And in a finite horizon, um, MDP,
the MDP will run for,
um, a fi- a finite number of T steps.
You start with state S_0,
take an action a_0, get to S_1, take action a_1,
get to state S_T take an action a_T,
at time step T and then the world ends,
and then we're done, right?
So the payoff is
this finite sum and-
and there's just a full stop at the end of that.
Um, you can also apply discounting but usually when you have a finite horizon MDP,
maybe there's no need to apply discounting,
and so, um, this model is a problem where there are,
you know, T time steps and then the world ends after that, right?
Wo- world end sounds a bit dire.
But, uh, um, yeah,
if you're flying an airplane or you're flying a helicopter,
and you know you only have fuel,
you know, for 30 minutes, right?
Uh, er, or an RC helicopter,
let say you have 20, 30 minutes of fuel,
then you know that you're going to run this thing
for 30 minutes and then you're done and so
the goal is to accumulate as many rewards as possible up until you,
you know, run out of fuel and then you have to land, right?
So that'll be an example of a finite horizon MDP.
Now, um, and- and- and the goal is to maximize this payoff,
um, or the expected payoff over these T time steps, okay?
Now, one interesting, uh,
property of a finite horizon- of, of,
of a finite horizon MDP is that the action you take,
um, may depend on what time it is on the clock, right?
So there's a clock marching from,
you know, timestep 0 to timestep T whereupon,
right, the world ends whe- whereupon that's all the rewards the MDP is trying to collect.
And one interesting effect of this is that,
um, this pen isn't that great, is that,
um, the optimal action may depend on what,
uh, what the time is on the clock.
So, uh, let's say your robot is running around this maze and there's
a small plus 1 reward here and a much larger plus 10 reward there,
and, um, let's say your robot is here, right,
then the optimal action for whether you go left or go
right will depend on how much time you have left on the clock.
If you have only, you know,
two or three times steps left on the clock,
it's better to just rush and get the plus 1.
But we still have, you know,
10, 20 ticks left on the clock,
then you should just go and get the plus 10 reward, right?
And so in this example,
Pi star of S, um,
it's not well defined because well,
the- the optimal action to take when your robot is here in this stage,
should you go left or should you go right?
Um, it actually depends on what time it is on the clock,
and so Pi star in this example, um,
should be written as the Pi star subscript T of S,
uh, because the optimal action,
um, depends on what time T it is.
The technical term for this is that this is a non-stationary- non-stationary policy.
Um, a non-stationary means,
uh, it depends on the time,
a- as it changes over time, right.
Whereas in contrast, up until now, we've been seeing,
you know, Pi star of S is the optimal policy before we- before this formula, right,
we just said Pi star of S,
and that was a stationary policy and stationary means, uh,
there's no change over time, okay?
So one- one- one thing that, um,
I didn't quite prove but that was implicit was
that the optimal action you take in the original formulation,
uh, is the same action,
right, no matter what time it is in the MDP.
So in the original formulation that you saw last week,
the optimal policy was stationary,
meaning that the optimal policy is the same policy,
you know, no matter what time it is,
it doesn't change over time.
Whereas in the final horizon MDP setting, um,
the optimal policy, you know,
the optimal action changes over time and so this is a non-stationary policy.
So stationary versus non-stationary just means,
does it change over time or does it not change over time?
Okay? So, um, right.
If you're using a non-stationary policy anyway, uh,
you can also build an MDP with
non-stationary transition probabilities or non-stationary rewards- non-stationary.
Um, actually so maybe here's an example.
Um, let's say you're driving from campus from
Palo Alto to San Francisco and we know that rush,
hour, is that- what like 5:00 PM or 6:00 PM or something, right?
And- and- and maybe- maybe the weather forecasts
even says it's going to rain at 6:00 PM or something, right?
But so you know that the dynamics of how you
drive your car from here to San Francisco will change over time,
uh, as in the time it takes, you know,
to drive on a certain segment of the road,
is a function of time and if you want to build an MDP to solve for,
um, the best way to drive from here to San Francisco say,
then the state transitions, um,
so S_t plus 1 is drawn from state transition probabilities indexed by the state at
time T and the action at time T. And
if these state transition probabilities change over time,
then, um, if you index it by the time t,
this would be an example of a non-stationary,
um, of a non-stationary state transition probabilities, okay?
Um, al- al- alternatively,
if you want non-stationary rewards,
then you can have R_t T of S_a, uh,
is the reward you get for taking a certain action, um,
uh, you know, o- o- for being at a certain state at a certain time, okay?
Um, so all of these are different variations of- of- of MDPs, um,
and so maybe just a few examples of when you want a, ah,
finite horizon MDP or use,
um, non-stationary, uh, state transitions.
Uh, so let's see.
Um, if you're flying an airplane, right?
For- for- for some airplanes, uh,
something like for commercial- for very large commercial airplanes,
uh, sometimes over a third of the weight of the airplane comes from the fuel, right?
So actually, if you take on a large commercial airplane, you know, when you take off, uh,
from, uh, SFO and you fly
to- I don't know- I don't know where you guys fly to, I don't know.
Fly to- fly to London or something.
Right, direct flight from here to London.
Uh, by the time the plane lands,
you- you get a much lighter airplane than when you took off, um,
because, uh, maybe sometimes- maybe like a third of the weight disappears,
you know, because of burning fuel.
And so the, the dynamics, the, um,
how the airplane feels between takeoff and landing is
actually different because the weight is dramatically different,
and so, um, uh,
this would be one example of
where the state transition probability changes in a pretty predictable way, right?
Um, or- uh, right.
I already mentioned, um,
uh, weather forecasts, right.
Where, uh, weather forecasts or traffic forecasts if you're driving here or, uh, yeah,
drivi- yeah, if you're driving over different types of terrain over time,
then you know that it's gonna rain tomorrow.
Uh, we are gonna know it's gonna rain tonight and the ground will turn muddy,
you know, then all- all the traffic will turn bad.
Um, uh, and then- and then, I don't know, industrial automation.
Um, some of my friends
work on industrial automation and I think that maybe one example, um,
if you run a factory 24 hours a day,
then the cost of labor, you know,
getting people to come into the factory to do some work at noon is actually easier,
right, and less costly than getting someone to show up
in the factory to do some work at 3:00 AM, right?
And so depending on, um, uh,
really labor availability over time,
the cost of taking different actions, uh,
and the cost of, um,
and the likelihood of transitioning into
different state transition probabilities can vary over the 24-hour clock as well, right?
And so these are other examples of when, um, uh,
uh, you can have a non-stationary policy and non-stationary state transitions, okay?
Now, um, let's talk about how you would actually solve for a finite horizon MDP,
and I think for the sake of simplicity, uh,
for the most part, I'm going to not bother with non-stationary transitions and rewards.
Fo- for the most part, I'll focus on- for the most part,
I just going to forget about, you know,
the fact that this could be varying.
Um, I- I- I'll mention it briefly but I- I wanna focus on the finite horizon aspect.
So-
so let me define,
um, the optimal value function.
Um, so this is the optimal value function for
time t for starting at state S. So this is the,
um, ah, expected total payoff.
Starting in state S at time t,
and if you execute,
you know, the best possible policy, okay?
So now the, um,
optimal value function depends on what time it is,
uh, because, if, if you look at, I don't know,
that example of the plus 1 reward on the left and the plus 10 reward on the right,
depending on how much time you have left on the clock,
the amount of rewards you can accumulate can be quite different, right?
If you have more time, you have more than, you know,
you can- more time to get to the plus 10 reward,
in the, the plus 1 and plus 10 reward that I drew- uh,
example that I drew just now.
And so, um, in this example,
value iteration, um, becomes the following.
It actually becomes a dynamic programming algorithm,
uh, which you'll see in a second, okay?
Which is that- let's see, um,
[NOISE] all right, I'm gonna need three lines, let me do this here.
[NOISE]
All right, which is that V star of
t of S is equal to max over A,
R of S, A plus-
okay? Um, and, uh,
actually, this is a question for you.
So there's, there's one missing thing here, right?
So we're saying that the optimal value,
you can get when you start off in state as at time t is the max over all actions of
the immediate reward plus sum of
s prime state transition reward S prime times V star of S prime,
and then what should go in that box.
T plus 1. Okay. Cool, awesome, great.
Right?
And then pi star of S is just,
you know, argmax of a, right?
Of the- of the same thing,
of this whole expression up on top.
Um, and so this formula defines Vt as a function of Vt plus 1.
So this is like, um [NOISE] this is like the iterative step, right?
Given V10, you can compute V9,
given V9, you can compute V8,
given V8, you can compute V7.
Um, and so to start this off,
there's just one last thing we need to define,
which is V capital T at the finite step,
uh, at the final step when the clock is about to run out.
Um, all you get to do is choose the action a,
that maximizes the immediate reward, and then,
and then, and then there's no sum after that, right?
So, um, if you start off at state S at the final time step t,
then you get to take an action and you get an immediate reward,
and then there is no next state because the world just ends right after
that step which is why the optimal value at
time t is just max over a of
the immediate reward because what happens after that doesn't matter, okay?
So this is a dynamic programming algorithm in which this,
um, uh, uh, algorithm- this step on top defines,
you know, allows you to compute V star of t,
[NOISE] and then the inductive step or the n plus 1 step, I guess, is,
uh, if you then having computed V star of t for every state S, right?
So, you know, you compute this for every state S. Having done this,
you can then compute V star T minus 1 using this,
um, inductive step, then V star t minus 2 and so on down to V star of 0.
So you compute this at every state,
and then based on this,
you can compute- oh sorry,
2 pi star of t right?
Compute the optimal policy,
the non-stationary policy for every state as
a function of both the state and the time, okay?
Um, all right, cool.
And, and I think, uh,
again, I don't want to dwell on this,
but if you want to work with
non-stationary- state transition probabilities or non-stationary rewards,
then this algorithm hardly changes in that you can just add you know,  if,
if your rewards and state transceiver is indexed by time as well,
then this is just a very small modification to this algorithm.
And it turns out that once you're using a finite horizon
MDP making the rewards and state transition rewards as non-stationary is,
is just a small tweak, right?
So you could yeah, yeah.
[inaudible] Uh, can you say that again.
In which form will it disappear? The attributes [inaudible].
This one? Oh a non-stationary.
So in the end you get a policy pi star subscript T of S.
[inaudible].
I'm sorry. This one?
This one. Oh,  I see, sure yes.
Pi star, this is a non-stationary policy.
Yes so that's why I like yeah, yeah.
Sorry, yeah, yeah. So this- the optimal policy will be a non-stationary policy.
Yes. uh, I, I, think, uh, yes, I think,
uh, I was using pi star to, to not,
not to denote that it has to be
a fixed function type, but yes, [inaudible] . Thank you.
Yeah. Right. If you
take big T to infinity can it just become the usual value iteration?
So the- let me think.
So there are two things with that.
So the two frameworks are closely related right,
you can kind of see relationship between the valuation.
One problem with taking this framework to big T to infinity
is that the values become unbounded, right?
Yeah and that's actually one of the reasons why we
use a discount factor when you have an infinite horizon MDP,
when the MDP just goes on forever.
One of the things that discount factor does is it makes sure
that the value function doesn't grow without bound, right?
And in fact, you know, if,
if the rewards are bounded by- right,
by some R max then when you use discounting then V,
you know, is bounded by I guess R max over 1 minus Gamma, right?
By the sum of a geometric sequence.
And so but, but in a finite h orizon repeat because you only add up t rewards it,
it can't get bigger than T times R max. Yeah.
[inaudible].
Let me think. So I think that, boy.
So I think, you know, what you find is that- let's see.
Um, actually let me just draw a 1D grid just to make life simpler, right?
So let's say there's a plus 1 reward there and a plus 10 reward there.
If you look at the optimal value function,
um, depending on what time it is.
If you have two times- and let's say that the dynamics are deterministic, right?
Uh, so there's no noise then if you have two times steps left,
then I guess V star would be,
you know, 10, 10, 10, 1, 1, 1, 0, 0, 0, right?
And so, uh, depending on where you are,
I guess if you're, uh, uh, uh yeah.
Actually, in fact I guess if you're here and there's nothing you can do right,
you can't get either reward in time.
Uh, but depending on whether you're here or here or here
the optimal action will change when we compute with this pi star. This makes sense?
Yeah, that's fine.
Okay, well yeah. Maybe I do, do encourage you there.
If this- if you actually build
a little grid simulator and use these equations to compute pi star and V star,
you will see that the optimal policy when you have lots of time will be this.
Wherever you are go for the 10 rewards,
but when the clock runs down then the optimal policy
will end up being a mixer, go left and go right.
All right, cool. Hope that was okay.
Yeah [NOISE].
All right. So, um-
[NOISE]
So the last thing I want to share with you today is,
uh, Linear Quadratic Regulation.
And as I was saying at the start, um,
LQR applies only in a relatively small set of problems.
But whenever it applies,
this is a great algorithm,
and I just, you know, use it whenever, right,
it seems reasonable to apply because it's, uh, uh,
is very efficient, and sometimes gives very good control policies.
And, um, let's see.
And so LQR, um,
applies in the following setting.
So let's see.
In order to specify an MDP,
we need to specify the states actions,
the state transition release.
I'm going to use a finite horizon formulation so capital T and rewards.
This, this also works with
the discounted MDP formalism but this would be a little bit easier,
a little bit more convenient to develop with the finite horizon setting,
so let me just use that today.
And LQR applies under a specific set of circumstances,
which is that, this set of states is an RN.
Set of actions is in RD and so to specify the state transition probabilities,
we need to tell you what's the distribution of the next state given the previous state.
So to specify the state transition probabilities,
I'm gonna say that the way S t plus 1 evolves is going to be as a linear function.
Some matrix A times S t plus some matrix B times A t plus some noise and
sorry there's a little bit of notation overloading again, sorry
about that, A is both the set of actions as well as this matrix A,
right so there's two separate things but same symbol.
I think, I think that the field of- a lot
the ideas from LQR came from traditional controls.
It's from, uh, what- from,
um, I guess from EE and Mechanical Engineering.
A lot of ideas from reinforcement learning came from computer science.
So these two literatures kind of evolved,
and then when the literatures merge,
you end up with clashing notations.
So CS people use A to denote the set of actions and the,
the set of mechanical engineering and EE people use A to denote
this matrix and when we merge
these two literatures the notation ends up being overloaded, right?
Okay. Oh and then, uh,
it turns out one thing we'll see later is that
this noise term it will- we'll see later is actually not super important.
But for now, let's just assume that the noise term is distributed Gaussian
with some mean 0 and some covariance sigma subscript w, okay?
But we'll see later that the noise will,
will be less important than you think.
Right. And so this matrix A is going to be R n by n. And this matrix B is going to be R n
by d where n and d are respectively
the dimensions of the state space and the dimension of the action space.
So for driving a car, for example,
we saw last time that maybe the state space is six dimensional.
So if you're driving a car where the state-space is XY theta x dot y dot theta
dot and the action space is steering controls so maybe A is two-dimensional right,
acceleration and steering, right.
Okay. So let's see.
So to specify an MDP we need to specify this five tuple, right?
So we specify three of the elements.
The fourth one, t is just some number,
right, so that's easy.
And then the final assumption we need to apply LQR,
is that the reward function has the following form.
That the reward is negative of
s transpose U s plus a transpose V a,
where U is n by n,
V is d by
d and U V are a positive semi-definite.
Okay? So these are matrices being bigger than zero, so positive semi-definite.
Okay. So the fact that U and V are positive semi-definite
that implies that S T U s is greater than or equal to 0 and excuse me,
S transpose U s sorry,
a transpose V a is also greater than or equal to 0.
Okay? So here's one example.
If you want to fly an autonomous helicopter and if you want,
you know, the state, the state vector to be close to 0.
So the state vector captures position,
orientation, velocity, angular velocity.
If you want a helicopter to just hover in place,
then maybe you want the state to be regulated or to, to,
to be controlled near some zero position and so if
you choose U equals the identity matrix,
and V also equal to the identity matrix,
this, this would be different dimensions, right?
This would be an n by n identity matrix, this would be a d by d ide- identity matrix.
Then R of s a ends up equal to negative norm of s squared plus norm of a squared.
Okay. And so this allows you to,
this allows you to specify the reward function that penalizes, you know,
we have a quadratic cost function,
the state deviating from 0 or, if you want,
the actions deviating from 0,
thus penalizing very large jerky motions on the control sticks or we set V equal to 0,
then this second term goes away.
Okay? So these are some of the cost functions you can
specify in terms of a quadratic cost function.
Okay. Now again, you know,
just so that you can see the generalization, um,
if you want non-stationary dynamics,
this model is quite simple to change where you can
say the matrices A and B depend on the time t. You can also say these,
you know, the matrices U and V depend on the time t. So if you have
non-stationary state transition probabilities or
non-stationary cost function that's how you would modify this.
But I won't, I won't use this generalization for today, okay?
Now, so the two key assumptions
of the LQR framework are that first,
the state transition dynamics,
the way your states change,
is as a linear function of the previous state and action plus some noise,
and second, that the reward function is a,
you know, quadratic cost function, right?
So these are the two key assumptions.
And so first, you know, where,
where do you get the matrices A and B.
One thing that we talked about on Wednesday already was-
so again this will actually work if you are trying to
apply LQR to fly an autonomous helicopter.
This would work for helicopter flying at low speeds.
Which is if you fly the helicopter around, [NOISE] you know,
start with some state S_0, take an action A_0, um,
get to state S_1,
do this until you get to S t, right?
And then this was the first trial,
and then you do this m times.
So we talked about this on Wednesday.
So fly the helicopter through m trajectory so t time steps each and then we know that
we want, S t plus 1 is approximately A S t plus B A t and so you can minimize,
okay. So we want
the left and the right hand side to be close to each other.
So you can, you know,
minimize the squared difference between the left hand side and the right hand side in a,
in a procedure a lot like linear regression in order to fit matrices A and B.
So if you actually fly a helicopter
around and collect this type of data and fit this model to it,
this will work, you know,
this is actually a good pretty reasonable model
for the dynamics of a helicopter at low speeds.
Okay? So this is one way to do it.
Um, so let's see.
Method 1 is to learn it, right?
A second method is to linearize a non-linear model.
[NOISE] So um,
let me just describe the ideas at a,
at a high level,
um, which is let's say that- and,
and I think for this it might be,
um, useful to think of the inverted pendulum, right?
So that was a, you know, so imagine you have a,
a, a inverted pendulum.
That was that, right? You have a pole and you're trying
to- you have a long vertical pole and you're trying to keep the pole balanced.
Um, so for an inverted pendulum like this,
if you download an open source physics simulator or if you have a friend, you know, from,
from the wi- with a physics degree to help
you derive the Newtonian mechanics equations for this.
Um, ah, [NOISE] let's see.
I, I actually tried to work through the,
the physics equations in the inverted pendulum one.
These are pretty complicated.
But I don't know [LAUGHTER].
Um, but you might have
a [NOISE] function that tells
you that if the state is a certain position orientation with the pole velocity,
angular velocity and you as-,
um, ah, what is it?
Um, apply a certain acceleration,
the actions accelerate left or accelerate right,
then, you know, one-tenth of a second later,
the state will get to this, right?
So, so, your physics friend can help you derive this equation.
Um, and, an- and then maybe plus noise, right?
Le- let me just ignore the noise for now.
Um, and so what you have is a function [NOISE],
right? Then maps from the state,
um, x x. Theta Theta.
That's a position of the cart and the angle of the pole and
the velocities and angular velocities that maps from the current state at time t,
oh, excuse me, comma [NOISE] at, right?
Maps from the, I guess,
current state vector to the next state vector,
um, as a function of the current state and current action.
Okay? So, um, here's what linearization means and,
I'm going to use a 1D example.
So because I can only draw on a flat board, right?
I can't, you know, because, because of the two-dimensional nature of the whiteboard,
um, I'm just going to use a- let's,
let's suppose that you have St plus 1 equals f of
St. And let me just forget- let me just ignore the action for now.
So I have one input and one output so I can draw this more easily on the whiteboard.
Um, so we have some function like this.
So the x-axis is St,
and y-axis is St plus 1,
and this is the function f, right?
We'll plug in back the, um, action later.
What the linearization process does is, um,
you pick a point and I'm going to call this point St over bar,
and we're going to, you know,
take the derivative of f and fill a straight line.
So we're not drawing a straight line very well.
Let's take the tangent straight line at this point St-bar,
and uh, what, ah,
and, and we're going to use this straight line.
Let's draw line green.
And we're going to use a green straight line to approximate the function f. Okay.
And so if you look at the equation for the green straight line, um,
the green straight line is a function mapping from St to St plus 1.
And S-bar is the point around which you're linearizing the function.
So S-bar, um, is a constant.
And this function is actually defined by St plus 1, um,
is approximately [NOISE] the derivative of the function at S-bar times
St minus S-bar plus f of S-bar t. Okay.
Um, and so, ah, so S-bar t is a constant, right?
And this equation expresses St plus 1 as a linear function of St.
So think of S-bar t as a fixed number, right?
It doesn't vary. So given some fixed S-bar, um,
this equation here- this is actually the equation of the green straight line,
which is it says, you know, if,
if you use a green straight line to approximate the function f,
just tells you what is St plus 1 as a function of St, and this is a, you know,
linear and affine relationship between St plus 1 and St, okay?
Um, so that's how you would linearize a function.
Um, and, and in the more general case
where, um, [NOISE].
And in a more general case where, um,
St plus 1 is actually a function of, you know,
putting this back in where both St and at, um,
the formula becomes, um, let me see.
Um, well, I'll write out the formula in a second.
Ah, but in this example,
S-bar t is usually chosen to be a typical value
[NOISE] for S, right? And so in particular,
if you expect your helicopter to be doing a pretty good job hovering near the state 0,
then, uh, it'll be pretty reasonable to choose S-bar t to be the vector of all 0s .
Because if you look at how good is the green line as an approximation of the blue line,
right, in a small region like this, you know,
the green line is actually pretty close to the blue line.
And so if you choose S-bar to
be the place where you expect your helicopter to spend most of its time,
then the green line is not too bad in approximation to the true function to the physics.
Oh, excuse me, or if you expect for the inverted pendulum,
if you expect that your inverted pendulum will spend most of
its time with the pole upright and the velocity not too large,
then you choose S-bar to be maybe the 0 vector.
Um, and so long as your pendulum-
your inverted pendulum is spending most of its time kind of,
you know, close to the 0 state,
then the green line is not to get an approximation for the blue line, right?
So this is an approximation,
but you try to choose,
um ah, be- because- I mean in,
in this little region it's actually not that bad that an approximation,
ah, it's only when you go really far away, right,
there there's a huge gap between the linear approximation,
um, and the true function f, okay?
Um, all right.
And so, um,
in the more general case where f is a function of both the state and the action,
then what you have to do is, ah,
the input now becomes St,
At because f maps from St,
At to St plus 1.
And then instead of choosing S-bar t,
you're choosing S-bar t,
a-bar t, which is a typical state and action,
ah, around which you linearize the function.
Or let me just write down the formula for that.
[NOISE]
Um, in which you would say,
if you linearize f around a point given by S-bar t,
a-bar t kind of the typical values,
then the formula you have is St plus 1 is given by f of S-bar t,
a-bar t plus the gradient with respect to S, [NOISE]
transpose S_t minus S_t bar.
[NOISE]
Okay. So this is the generalization of the 1D function we measured just now,
or we wrote down just now,
which says that, you know,
the next state is approximately this point around you, which you linearize,
plus the gradient with respect to S times how much the state differs from
the linearization point plus the gradient respect to
the actions times how much the actions vary from a-bar, okay?
And this kind of generalizes
that equation you wrote. [NOISE].
So, um, so this equation expresses St plus 1 as a,
ah, linear function or technically an affine function of
the previous state and the previous action, right?
With some matrices in between.
And from this, you know,
after some algebraic munging,
you can re-express this as St plus 1 equals ASt plus Bat.
Um, and, and just- there- there's just one other little detail which is,
um, you might need to redefine St to add an intercept term.
Right. And because this is, is a affine function
with an intercept term rather than the linear function.
But so from this formula,
you know, with a little bit of algebraic munging,
you should be able to figure out whether the matrix is a and b, ah, ah,
but you might need to add an intercept term to the S,
but this is just an affine function to kind of rewrite
in terms of matrices a and b, okay?
Um, all right.
So right,
I hope that makes sense, right?
That this thing, this linearization thing
expresses St plus 1 as a linear function of St and at, right?
This is just a linear- is just- the wa- way St plus 1 varies,
you know, is just some matrix times St,
some matrix times at, um, and that's why with some munging,
you can get into this formula for some matrix a and b, okay?
Um, but because there are some constants floating around as well, like this,
you might need an extra intercept term to multiply to a to give you that extra constant.
[NOISE]
That's where we are. Um, we now have that for
these MDPs either by learning a linear model with the matrices A and B,
um, or by taking a nonlinear model and linearizing it.
Like you just saw, you can model- hopefully model an MDP as a,
um, [NOISE] linear dynamical system, meaning this, you know,
S_T plus 1 is this linear function or the previous state and action,
as well as hopefully with a quadratic reward function or the- really,
the- er, right, in the form that we saw just now.
Um, so let me just summarize the problem we want to solve.
a_ST, oops sorry, sorry.
S_t plus 1 equals A S_T plus B_at plus w_t,
so this is a noise term, um,
and then R of S, a
equals negative S transpose U_S plus a transpose V_a.
All right. And this is a finite horizon MDP.
And so the total payoff is R of S_0,
a_0 plus dot dot dot plus R S_T.
Okay. [NOISE] So let's
figure out a dynamic programming algorithm for this.
[NOISE] The remarkable problem,
the- the remarkable property of LQR, um,
and what makes this so useful is that if you are
willing to model your MDP using those sets of equations,
then the value function is a quadratic function, right?
Um, and so let me show you what I mean.
And so if your- if your model,
if your MDP can be modeled as this type of linear dynamical system,
with a quadratic cost function, uh,
then it turns out that V star is
a quadratic function and so you can compute V star exactly, right?
Um, so let me show you what I mean.
We're going to develop
a dynamic programming algorithm to compute the optimal value function V star.
Similar to, uh, what we did a bit earlier
today with the finite horizon MDP with a finite set of states,
let's start with the final time step and we will work backwards.
So, um, V star t of S_T is equal
to max over a_T of R of S_T , a_T.
Um, this is max over
a_T over negative, right?
Um, but this is always greater than or equal to 0 because V is positive semi-definite.
And so the optimal action is actually to just choose the action 0, um,
and so the max over this is equal to the negative S_T transpose U S_T because,
because V is a positive semi-definite matrix.
This thing is always greater than 0.
And then- and so this tells us also that Pi star of the final action is the argmax.
So the optimal action is to choose, you know,
the vector of 0 actions at the last time step, okay?
So this is the base case for the dynamic programming step of,
um, value iteration where, uh,
the optimal value at the last time step
is just choose the action that maximizes the immediate reward,
uh, which means maximize this, right?
And this is maximized by choosing the action 0 at the last time step, okay?
Now, these blue pens keep, let's see if this is any better, ooh, okay.
Now, the key step to the dynamic programming implementation is the following,
which is suppose that V star t
plus 1 S_t plus 1 is equal to a quadratic function.
Right.
Um, okay.
So in the- uh-huh.
[inaudible].
Yes. It's true that this term is also greater than 0 without the minus sign.
Without the minus sign, that term is positive and so,
but you only get to maximize with respect to 80 right?
So, so the best you could do for this term is set it to 0.
Thank you. All right, cool, tank you. All right.
Now, for the inductive case, um,
we want to go from V_t plus 1- V_star t plus 1 to
computing V star t, right? And the key observation that makes LQR work is,
um, let's suppose V star t plus 1,
the optimal value function at the next time step,
let's suppose is a quadratic function.
So in particular, let's suppose V star t plus 1 is this,
you know, quadratic function, uh,
parameterized by some matrix capital Phi t plus 1 which is an n
by n matrix and some constant offset Psi which is a real number.
Um, what we will be able to show is that if you do one step of dynamic programming, uh,
if this is true for V star plus 1 that V_t after one step as you go from V star plus
1 to V_t that the optimal value function
V_t is also going to be a quadratic function with a very similar form,
right, with I guess t plus 1 replaced by t, right?
Um, and so in the dynamic programming step, um,
we are going to update V_t S_t
equals max of A_t R of S_t, a_T plus.
And then, you know,
I- I think you remember, right, previously,
um, I'm going to write this in green,
previously we had sum of S prime or actually St plus
1 I guess to be S_t a_t S_t plus 1,
V star t plus 1 St plus 1.
So that's what we had previously where we
had a discrete state space and we were summing over it.
But now that we have a continuous state space,
this formula becomes expected value with respect to S_t plus 1 drawn from
the state transition probabilities [NOISE] ,
uh, V star t plus 1 S_t plus 1 [NOISE].
Uh, yeah. Okay.
[NOISE] So the optimal value when the clock is at time t is choose the
action a that maximizes immediate reward plus the expected value of, you know,
your future rewards when the clock has now ticked from time t to time t plus 1,
you're going to state S_t plus 1 at time t plus 1, right?
So, um, let's see.
So, ah, this is a pretty beefy piece of algebra to do.
Um, I think I feel like showing this full result is, I don't know,
is like at the level of complexity of a, you know,
typical CS 229 homework problem which is quite hard [LAUGHTER].
But let me just show the outline of how you do this derivation and why,
you know, why this inductive step works.
Well, but I think you- but,
but if you want you could work through the algebra details yourself at home.
Um, which is that- let me do this on the next board.
So V star_t of S_t is equal to max over a_t
of the immediate reward, right?
So that's the immediate reward.
And then plus the expected value with respect to S_t plus 1,
is drawn from a Gaussian with
mean AS_t plus Ba_t and covariance Sigma w. Ah,
so remember S_t plus 1 is equal to AS_t plus Ba_t plus W_t,
where W_t is Gaussian with mean 0 and covariance Sigma w. Right?
So ah, if you choose an action a_t,
then this is the distribution of the next state at time t plus 1.
Um, and then expected value of
[NOISE]
this quadratic term.
Um, because this quadratic term here,
kind of the inductive case was what we showed was
V star for the-
for the next time step, right?
So it turns out that, um, let's see.
So this is a quadratic function,
and this expectation is the expected value of
a quadratic function with respect to s drawn from a Gaussian, right?
With a certain mean and certain variance.
So it turns out that, um,
the expected value of this thing, right?
Well, this whole thing that I just circled.
This thing simplifies into, er,
a big quadratic function
[NOISE]
of the action a_t, right?
Um, and then, ah,
and so in order to, you know,
derive the argmax or to derive V star of S,
you would derive this big quadratic function.
Um, take derivatives with respect to a_t,
ah, set to 0, right?
And solve for a_t.
Okay? And if you go through all that algebra,
then you actually- then you end up with the formula for a_t as follows.
Um,
okay? And um,
I'm gonna use, I'm gonna do- I'm gonna take that big matrix and denote that L_t.
Okay? Um, and so this shows also that pi star
at time t of S_t is equal to L_t times S_t.
Okay? So, um,
[NOISE] one to- to take away from this is that,
under the assumptions we have, right?
Linear dynamical system with quadratic cost function.
Ah, the optimal action is
a linear function
of the state S_t.
Right? And, ah, this is not a claim that is made through functional approximation.
Ah, what I'm- I'm not saying that you could fit
a straight line t optimal action and if you fit a straight line,
that you get this linear function.
Right? That's not what we're saying.
We're saying that, um, of all the functions,
anyone could possibly come up within the world,
linear or non-linear, the best function,
the best action is linear.
So there is no approximation here.
Right? So it's just that, you know,
it's just a fact that if you have linear dynamical system,
the best possible action at any state is going to be a linear function um,
ah, of of that state.
Right? So there's no there's- we haven't approximated anything.
Right?
Um,
[NOISE]
let me see.
Yeah, all right. Let me,
let me, let me write this here.
Um, and then the other step is that ah,
if you take the optimal action and plug it into the definition of V star,
then by simplifying which again is quite a lot of algebra, but without the simplifying,
you end up with this equation.
Um, where again I'll- I'll just write out the formula as is, you know. [NOISE]
Okay.
Okay.
All right. [BACKGROUND] Um,
so to summarize the whole algorithm,
right, let's, let's put everything together.
And, and so- sorry.
And so what these two equations do is they allow you to go from V star T
plus 1 which is defined in terms of Phi T plus 1 and Psi T plus 1.
And it allows you to recursively go back to
figure out what is V star T using these two equations.
Right. So Phi T depends on Phi T plus 1,
Psi T depends on Phi T plus 1 and Psi T plus 1.
Uh, and this Sigma w,
this is the covariance of w_t.
Right. This, this Sigma subscript w. This is not a summation over w,
this is a Sigma matrix subscripted by w. That was a covariance matrix for
the noise terms you are adding on every step in our linear dynamical system.
Okay. And, and this are trace operators, some of the diagonals.
Okay? So just to summarize.
[NOISE] Um, here's the algorithm.
You initialize Phi T to be
equal to negative u and Psi T equals 0.
Um, and so, you know,
that's just taking this equation and mapping it there.
Right? So the final time step, ah,
that those two, oh, sorry,
it should be capital T. Right.
So that, um, those two equations for Phi and Psi,
it defines V star of capital T. And then you would,
um, you know, recursively calculate, um,
Phi T and Psi T using Phi T plus 1 and Psi T plus 1.
So you go from,
you know, for T equals T minus 1,
T minus 2 and so on and go back when count down from,
right, T minus 1 to T minus 2 and so on down to 0.
Um, calculate L_t as above.
Right. and L_t was a formula I guess we had over there, um,
saying how the optimal action is
a function of the current state depending on A, and B, and Phi.
Ah, and then finally,
Pi star of S_t equals L_t of S_t.
Okay? Um, and this algorithm,
the remarkable thing, what one really cool thing about
LQR is that there is no approximation anywhere.
Right? You, you might need to, um,
make some approximation steps in order to approximate
a helicopter as a linear dynamical system by, you know,
fitting matrices A and B to data or by taking a nonlinear thing and linearizing it,
and you might need to just restrict- constrict,
you know, restrict your choice of possible reward functions.
Reward function is quadratic.
But once you've made those assumptions,
none of this is approximate,
everything is exact. Right. Question?
[inaudible]
Yes, that's right. Yep, yeah.
So the approximation step needed are, ah, ah,
getting your MDP into the form of a linear dynamical system with quadratic reward.
So that is approximate.
But once you specify the MTP like that,
all of these calculations were exact, right?
So, so we're not approximating the value function or quadratic function,
is that the value function is a quadratic function and you're computing it exactly.
And the optimal policy is a linear function and you
just computing, computing that exactly.
Okay. Um, I want to mention- before we wrap up,
I want to mention one,
one unusual fun fact about LQR and this is very specific to LQR.
Uh, and, and, and it's convenient,
uh, but, but, er,
let me say what the fact is and just be careful that this doesn't give you
the wrong intuition because it doesn't apply to anything other than LQR,
which is that if you look at where, um,
so first, if you look at the formula for L, ah, let me see.
Move this around. [NOISE] All right.
If you look at the formula for L_t,
you need to compute, I mean the,
you know, the goal of doing all this work is to find the optimal policy.
Right? So you want to find L_t so that you can compute the optimal policy.
You notice that L_t, um,
just depends on Phi but not Psi.
Right? Um, so, you know,
and, and maybe it's gonna make sense.
You're going to- when you take an action,
you get to some new state and
your future payoffs is a quadratic function plus a constant.
It doesn't matter what that constant is.
Right? And so in order to compute the optimal action, in order to compute L_t,
you need to, you need to know Phi or actually Phi T
plus 1 but you don't need to know what is Psi T plus 1.
Right. Now, if you
look at the way we do the dynamic programming,
the backwards recursion, um,
what if you implement a piece of code that doesn't involve it to compute Psi, right?
So these are the two equations you use,
update Phi and Psi.
But whether, you know,
let's say you delete this line of code.
Just don't bother to compute it and just don't
bother to compute that and don't bother to compute that.
Right? So you notice that Phi depends on Phi T plus 1,
but it doesn't depend on Psi.
Uh, and so you can implement the whole thing and compute the optimal policy and
compute the optimal actions without ever computing Psi.
Right. Now the funny thing about this is that the only place that Sigma w
appears is that it
affects only Psi T. Right?
So, you know, if,
if we do what I've just cross out in orange and just don't bother to compute
Psi T. Then the whole algorithm doesn't even use Sigma w. Right.
So one very interesting property of the LQR, um, ah,
of this formalism is that the optimal policy does not depend on Sigma w. Right.
Um, and I think,
ah, maybe this is a, ah,
so V star depends on Sigma w,
because if the noise is very large,
if there's a huge gust of wind blowing a helicopter all over the place,
then the value would be worse.
But Pi star and L_t, uh,
do not depend on the Sigma w. Okay.
Um, so this is a property that is very specific to LQR,
don't, don't, don't overgeneralize it to other reinforcement learning algorithms.
But this, um, I think the intuition to, ah,
um, take from this is first,
if you are actually applying this system,
you know, don't bother to, don't,
don't- I say don't,
don't try to hard to estimate Sigma w, because you,
you don't actually need to use it, uh,
which is why when we're fitting a linear model,
I didn't talk too much about how you actually estimate Sigma w. Because in LQR system,
it literally doesn't matter in
a mathematical sense in terms of what does the optimal policy you compute.
And the second, the maybe slightly more useful intuition to take away from this,
is that, ah, for a lot of MDPs,
if you're building a robot,
you know, ah, um, remember to add some noise to your system
but the exact noise you add doesn't matter as much as one might think.
So what I've seen in,
in working on a lot of robots,
a lot of MDPs is, you know,
do add some noise to the system and make sure your learning algorithm is robust to noise.
And the form of the noise you add, it does matter.
I don't say it doesn't matter at all.
I mean, in, in LQR, it doesn't matter at all.
For other MDPs, it does matter.
But I think the fact that you've remembered to add some noise is often
in practice more important than the exact details of,
you know, is the noise 10%  higher or is the noise 10% lower.
If, if the noise is 100% higher or lower,
that will often make a big difference, but, ah,
but, but when I'm, you know,
training a model of our helicopter or something,
the noise is something that, you know,
I pay a little bit of attention to but I pay much more attention to
making sure that the matrices A and B are accurate than,
and then, you know, a little bit sloppiness in the
act of using your noise model is something that an MDP can probably survive,
that your policy can survive.
Okay. Let's take one last question. Yes.
[inaudible].
Oh V? Uh, ah, oh I see.
Sorry, yes. Let me see my notes.
Oh V. That was,
ah, this is a V. Yes, thanks, yeah.
Okay, cool. Thanks everyone.
Let's break and I'll see you for the final lecture on [NOISE] Wednesday.
Thanks everyone
