All right. Hey, everyone,
actually started a little bit late.
So welcome to the, uh,
final lecture of, uh,
CS229 of this quarter or I guess,
uh, to the home viewers,
welcome to the season finale.
So what I'd like to do today is, um,
wrap up our discussion on reinforcement learning and then,
um, and then we'll conclude the class.
Um, so I think you know,
over the last, uh,
few lectures you saw a lot of,
uh, uh, we- we saw a lot of NAV.
So maybe as a brief interlude here are some videos.
Um, so self-autonomous helicopter,
um, you know, this is a project that, uh, I know Pieter Abbeel,
Adam Coates, uh, some- some former students here,
now some of the machine learning greats worked on when they were,
um, PhD students here.
Uh, and- and- and I think, uh,
using algorithms similar to the ones you learned in this class,
how do you make a helicopter fly?
So just for fun, this is a video shot on top of
one of the Stanford, uh, soccer fields.
I was actually the camera man that day [LAUGHTER] ,
um, and zooming out the camera.
See the trees touching the sky.
[BACKGROUND].
Say, uh, um,
it- it- it turns out- that's a small radio-controlled helicopter.
It turns out that, uh,
when you're very far away you can't tell if this is
a small radio-controlled helicopter or if there's
like a helicopter with people sitting in it [LAUGHTER].
So, um, uh, there was- actually there's,
uh, you know, foot is on, uh,
a kind of a soccer field, the big,
uh, grass field off San Hill Road and turns out across San Hill Road,
um, one of the high rises there was a- there
was an elderly lady that lives in one of those apartments.
And when she saw that, she would call 9-1-1 and say,
"Hey, there's a helicopter about to crash."
[LAUGHTER] And then the- the firemen would come out, so,
[LAUGHTER] I had to tell them that and I- I think they were partly relieved,
partly disappointed that there was no one for us for- for them to save.
And, uh, um, and so- and- and I think, uh, let's see.
Uh, uh, one of the things I promised to do, um,
in the debugging learning algorithms lecture was just go over the um,
reinforcement learning example again.
So let me just do that now but, uh,
with notation that I think you now understand compared to- oh, yes.
Why is the helicopter flying upside down?
Oh, uh, it was an aerobatic stunt.
Uh, yeah, I- I don't think there's
any good reason for flying a helicopter upside down [LAUGHTER] ,
uh, other than that you can.
Uh, there- there a lot of videos of self-autonomous helicopters flying
all sorts of stunts, go to heli.stanford.edu,
heli.stanford.edu and the Stanford autonomous helicopter did- did-
did a lot more than flying upside down.
Uh, it could, I mean,
make some maneuvers that looked aerodynamically
impossible such as the helicopter that looks like it's tumbling,
just spinning randomly but staying same place in the air, right?
Um, it's called a chaos maneuver and if you look you go, wow.
This helicopter was turning upside down,
spinning around the air in every single direction but it was
just staying right there in the air not crashing,
and so there are maneuvers like that- that, um,
the very best human pilots in the world can fly with helicopters and I think,
uh, this was just, uh, um, uh,
a demonstration I guess, uh,
and I think a lot of this work wound up influencing some of
the later work on the quadcopter drones in a few research labs and.
Yeah, I think, uh,
it was a difficult control problem and it was, uh,
it was one of those things you do when you're,
you- you when you're a university and you want to solve the hardest problems around.
But I wanted to step through a few of the debugging process
that we went through as we were building a helicopter like this.
So, uh, when you're trying to get the helicopter to fly upside down,
fly stunts, you don't want to crash too often.
So step one is build a model or build a simulator of a helicopter, right?
Much- much as you saw, um,
when we start to talk about fitted value iteration and then,
um, choose a reward function, uh,
like that, and it turns out that
specific reward function for staying in place is not that high,
you know, like the quadratic function like that works okay.
But if you want the helicopter to fly aggressive maneuvers it's actually quite
tricky to specify what is a good turn for a helicopter, right?
Um, and then what you do is you run reinforcement learning algorithm, um,
to try to maximize say
the finite horizon MDP formulation and maximize sum of rewards over T timesteps,
so you get a policy Pi.
And then whenever you do this,
the first time you do this,
you find that the resulting controller does much worse than the human pilot,
and the question is what do you do next, right?
This is- by the way- this is almost- I think this is
almost exactly the slide I showed you last time except I cleaned up the slide
using reinforcement learning notation rather than
the slightly simplified notation you saw
before [NOISE] you learned about reinforcement learning.
And so the question is, um,
and- and again if you're working on
the reinforcement learning problem yourself, you know, uh,
there's a good chance you have to answer this question yourself for
whatever robot or other reinforcement learning or
factory automation or stock trading system or whatever it is,
um, you are trying to get to work in reinforcement learning.
But do you want to improve the model sim- model
or do you want to modify the reward function or do you want to,
uh, modify the reinforcement learning algorithm. All right.
And modifying the reinforcement learning algorithm includes things like,
uh, playing with the discretization that you're using.
Um, if you're taking a continuous state MDP and discretizing it to solve over
finite state MDP formulation or modifying the reinforcement learning algorithm
includes also maybe choosing new features to use in fitted value iteration, right?
There are a lot of things you could try.
Or maybe instead of using a linear function approximator,
instead of fitting a linear function for fitted value iteration.
Maybe you want to use a bigger,
you know, deep neural network, right?
Um, but so which of these steps is the most useful thing to do?
So this is the analysis of those three things, uh, you know,
if, I'll give you a second to read this, right?
But if these three statements are true,
then the learn controller should have flown well on the helicopter.
Right? Um, and so
those three sentences correspond to the three things in yellow that you could work on,
um, there's a problem that,
you know, um, statement 1 is false,
that the simulator isn't good enough,
there's a problem that statement 2 is false.
That, um, ah, oh,
sorry I think actually two or three are reversed.
But, uh, the three statements corresponds to the three things in yellow.
I think the two and three are in, uh,
are in, uh, opposite order, right?
Ah, as the RL algorithm maximizing some rewards is a reward function,
actually the right thing to maximize.
And so here are the diagnostics you could use,
um, to see if this helicopter simulator is accurate,
uh, well, first check if,
um, the policy flies well in simulation.
If your policy flies well in simulation but not in real life,
then this shows that the problem is with
your simulator and you should try to learn a better model for your helicopter, right?
And, and if you're using a linear model this with the matrices a and b, um,
if, you know, st plus 1 equals ast plus bat,
if you're [inaudible] try,
try getting more data or maybe try a non-linear model,
but if you find that the problem's not your simulator,
if you find that, uh,
your policy is flying poorly in simulation and flying poorly in real life,
right, then this is the diagnostic I would use.
Um, so I shall show these two lines.
So let human be the human control policy,
so hire a human pilot, right? Which, which we did.
We're fortunate to have one of the best- one, one of,
um, America's top, you know,
aerobatic helicopter pilots working with us, and he,
using his control sticks and radio control,
can make a helicopter fly upside-down,
tumble, do flips, loops, rolls.
So we had a very good human pilot, um,
help us, uh, fly the helicopter manually.
So what you can do is, um,
test whether or not the,
uh- so this, this thing here, right?
That's just a pay off of the,
um, learn policy as measured on your reward function.
So check if, um,
the learn policy achieves a better or worse pay off than a human pilot can, right?
And so that means, you know,
go ahead and let the learn policy fly the helicopter and we get
the human to fly the helicopter and compute the sum of rewards
on the sequence of states that these two systems take the helicopter through and
just see whether the human or the learn policy achieves a higher payoff,
achieves a higher sum of rewards.
And if, um, the payoff achieved
by the learning algorithm is less than the payoff achieved by the human,
then this shows that, um,
the learn policy's not actually maximizing the sum of rewards, right?
Because whether the human is doing, you know,
he or she is doing a better job,
maximizes the sum of rewards then the learn policy.
So this means that you should, you know,
consider working on the reinforcement learning algorithm to try to
make it do a better job maximizing the sum of rewards, right?
Um, and then on the flip side,
this inequality goes the other way, right?
Uh, so if pa- if, if the payoff or
the RL algorithm is greater than the payoff of the human,
then what that means is that, you know,
RL algorithm is actually doing a better job,
maximizing the sum of rewards,
but it's still flying worse.
So what this tells you is that,
doing a really good job maximizing the
sum of rewards does not correspond to how you actually want
the helicopter to fly and so that means that maybe you should work on,
um, improving the reward function,
that the reward function is not capturing what's actually most
important to fly a helicopter well and then,
then you modify the reward function, right?
So in a typical workflow,
uh, hoping to describe to you what,
what it feels like to work on a machine learning project like this,
and this was a big multi-year machine learning project,
but when you're working on a big complicated machine learning project like this,
um, the bottleneck moves around meaning that you build a helicopter,
you get a human pilot to fly it,
you know, gets in the work,
they run these diagnostics and maybe the first time you do this you'll find, wow,
the simulation's really inaccurate,
then you are going to work on improving the simulator for a couple months.
And then, you know, and every now and then you come back and rerun
this diagnostic and maybe for the first two months of the project,
you keep on saying, "Yup,
simulator is not good enough, simulator
is not good enough, simulator is not good enough."
After working on the simulator for a couple months you,
you may find that, um,
item 1 is no longer the problem,
you might then find that,
um, item 3 is the problem,
the simulator's now good enough,
but when you run this diagnostic,
two months into the project, you might say,
"Wow, looks like your RL algorithm, uh,
is maximizing the reward function but this is not good flying."
So now I think the biggest problem for the project or the biggest bottleneck
for the project is that the ref- the reward function is not good enough,
and then you might spend, you know,
another one or two,
or three, or, or,
or sometimes longer months working to try to improve the reward function,
then you might do that for a while,
and then when the reward function is good enough then that exposes
the next problem in your system which might be that the RL algorithm isn't good enough.
And so the problem you should be working on
actually moves around and it's different in different phases of the project.
And, um, when you're working on this it feels like every time you
solve the current problem that exposes the next most important problem to work
on and then you work on that and you solve that then this helps you identify and
expose the next most important problem to work on
and you kind of keep doing that or you keep iterating,
and keep solving problems until hopefully,
you get a helicopter that does what you want it to, make sense?
Okay. Um, but I think [NOISE] teams that have the discipline to, um,
prioritize according to diagnostics like this,
uh, tend to be much more efficient,
the teams that kind of go by gut feeling in terms of selecting,
you know, what to, what to spend the time on.
All right, um, any,
any questions about this?
[inaudible].
Oh, sorry, say that again.
[inaudible] the simulator's
accurate [inaudible].
Yeah, uh, I, I kind of wanna say yes,
um, let me think.
Yeah, I would usually check step 1 first and then
if I think simulator is okay then look at steps 2 and 3.
Um, maybe one, one other thing, uh, er, about,
when you work on these projects there is some judgment involved so I
think I'm presenting these things as though- as a rigid mathematical formula,
that's cut and dry, this formula says,
now work on step 1, then this one says,
now work on step 3.
Um, there is, there is, um,
more judgment involved because when you run these diagnostics you might say, well,
it looks like the simulators not that good but it's kinda good,
it's little bit ambiguous, and oh it looks like,
you know, uh, and so that's what it often feels like.
And so a team would get together,
look at the evidence from all three steps and then say, you know, "Well,
maybe the simulator is not that good but it's maybe good
enough and but both the reinforcement- the,
the reward function is really bad, let's focus on that."
So there is some,
um- so rather than a hard and fast rule there,
there is some judgment needed to,
to make these decisions,
uh, but having a,
um- so when leading machine learning teams often my teams will, you know,
run these diagnostics, get together and look at
the evidence and then discuss and debate what's the best way to move forward,
but I think the process in making sure that discussion and
the debate is much better than the alternative,
which is, you know,
someone just picks something kind of at random and,
and the team does that, right?
Yeah, okay. Cool. Um-
All right, cool.
So, um, just, uh,
yeah maybe you, while I have the laptop up, you know,
a little bit for fun but a little bit because I'm,
uh, to illustrate fitted value iteration.
Um, let me just show another,
um, reinforcement learning video.
Um, oh, by the way,
one of the- I- I think if I look at the future of AI,
the future of machine learning, you know,
there's a lot of hype about reinforcement learning for game playing which is fine.
You know, we all like- we all love, uh,
computers playing computer games,
like that's a great thing I think or something, er.
But- but I think that some of the most exciting applications of
reinforcement learning coming down the pipe I think will be robotics.
So I think over the next few years,
even though there are
only a few success stories of reinforcement learning applied to robotics.
There are more and more right now.
One of the trends I see, you know, when you look at, uh,
the academic publications and some of the things making
their way into industrial environments is I think in the next several years,
just based on the stuff I see,
my friends in many different companies,
in many different institutes working on,
I think there will be a rise of, uh,
reinforcement learning algorithms applied to robotics.
I think this would be one important area to- to- to watch out for. All right.
Uh, but, uh, uh,
so, you know, uh, uh,
this is another Stanford video,
this is again just using reinforcement learning to get a robot dog,
um, to climb over obstacles like these.
Uh, my friends that were less generous, um [NOISE] ,
uh, uh, did not want to think of this as a robot dog.
Uh, they thought it was more like a robot cockroach, uh, [LAUGHTER].
But I think cockroaches don't have four legs, right,
cockroaches have six legs [LAUGHTER].
Um, yeah but so,
uh, how do you program a robot dog like this,
right, to, uh, climb over terrain?
So one of the key components, this is work by,
um, Zico Kolter, uh,
now a Carnegie Mellon professor, uh,
another one of the machine learning greats, uh, is,
ah, ah, a key part of this was,
ah, value function of approximation, uh,
where it- dog starts on the left and it goes get to the right then, uh,
the approximate value function kind of,
um, ah, I- I'm- I'm sort of finding a little bit, right?
But- but the approximate value function tells it,
uh, given the 3D shape of the terrain, uh,
the middle plot  is a height map where
the different shades tell you how- how- how tall is the terrain,
uh, but given the shape of the terrain, the dog, uh,
learns a value function that tells it what is the cost of putting
his feet on different locations to the terrain and it learns among other things,
you know, not to put his feet at the edge of a cliff because then it's
likely to slip off the edge of a cliff and fall over, right?
And so, um, but- but hopefully this gives a visualization of whether,
uh, learn value function for a very complicated function they'll say.
And- and the state is very high-dimensional,
this is all kind of projected onto a 2D space so you can visualize it.
But- but this is what, uh,
a simplified value function looks like for a robot like this.
Okay. All right.
So with that,
um, let me return to the white board [BACKGROUND] um.
So, um, there's just one class of algorithms I want to describe to you
today which are called policy search algorithms.
And uh, sometimes, uh, policy searches are also called,
uh, direct policy search.
And, um, to explain what this means,
so far our approach to reinforcement learning has
been to first learn or approximate the value function,
you know, approximate V star and then use that
to learn or at least hopefully approximate Pi star, right?
So we have- you saw value iteration,
top, we had policy iteration.
But philosophy to reinforcement learning was
to estimate the value function and then use that,
you know, that equation with the arg max to figure out what is Pi star.
So this is an indirect way of getting a policy
because we- we first try to figure out what's the value function.
In direct policy search, um,
we try to find a good policy directly,
hence the term direct policy search because you
don't- you go straight for trying to find a good policy
without the intermediate step of finding an approximation to the value function.
So, um, let's see.
I'm going to use, uh,
as the motivating example the inverted pendulum.
Right. So that is that thing with a free hinge here,
and let's say your actions are to accelerate left or to accelerate right, right?
And then you can have- and you can have states to accelerate strong,
accelerate less strong, accelerate right.
You got more than two actions but let's just say you've
inverted pendulum with, um, two actions.
So, um, if you
want to- I- I'll- I'll talk about pros and cons of direct policy search later.
But if you want to apply polic- direct policy search,
you want to apply policy search,
the first step is to,
um, come up with the class of policies you'll entertain or
come up with the set of functions you use to approximate the policy.
So, um, again to make an analogy,
when, uh, you saw logistic regression for the first time, you know,
we kind of said that we would approximate y as the hypothesis,
um, right, whose form was governed by this sigmoid function.
And you remember in week 2 when,
uh, I first described logistic regression,
I kind of pulled this out of a hat,
right, and said, "Oh yeah, trust me,
let's use the logistic function," and- and then later,
we saw this was a special case of the generalized linear model.
Um, but, you know,
we just had to write down some form for how we will predict y as a function of x.
So in direct policy search,
we will have to come up with a form for Pi, right?
So we have to just come up with a function for algorithms in h. Um,
in direct policy search,
we'll have to come up with a way for how we approximate the policy Pi.
Right? And so, you know,
one thing we have to do is say, well,
maybe the action were approximate with some policy Pi, um,
maybe parameterized by Theta and is now a function of the state,
and maybe it'll be 1 over 1 plus e to the negative Theta transpose,
you know, to state vector.
Right? Where the same vector maybe something like,
um, x, x dot, uh,
and- and the angle- and the angle dot right
if- if this angle is Phi and maybe add an intercept there.
Okay. And- and I- I switch this from Theta to Phi to avoid,
uh, conflict in the notation.
Okay. Um, this isn't really the formative policy we'll write.
So let me- let me make one more definition and then I'll, um,
show you a form of a specific form of policy you can use,
but it's actually not quite this.
We'll- we'll need to tweak this a little bit.
So, uh, the direct policy search algorithm we'll use,
will use a stochastic policy.
So this is a new definition.
Um, so stochastic policy is a function.
Right.
Um, so we're going to use,
um, for the direct policy search algorithm that you see today,
we are going to use stochastic policies meaning that,
um, on every time step, uh,
the policy will tell you what's the chance you want to
accelerate left versus what's the chance you want to accelerate right,
and then you use a random gen- number generator to select either left or
right to accelerate on the inverted pendulum depending on the policies- no,
depending on the probabilities output by this policy.
Okay. Um, and so here's one example.
Um, let's see which is
you can have [BACKGROUND].
So, you know, continuing with the inverted pendulum,
here's one policy that, um,
[BACKGROUND] might be reasonable,
uh, where you say that, um, let's see.
Right. So, you know,
in a state s, the chance that you take
the accelerate right action is given by this sigmoid function.
And the chance that in a state s,
you take the accelerate left action is given by that.
Okay. Um, and here's one example for why this might be a reasonable policy.
So let's say the state vector s is 1, x,
x dot phi, phi dot, um,
where, you know, this angle of the inverted pendulum,
um, is the angle phi.
And let's say for the sake of argument that
we set the parameter of this policy phi to be,
um, 0, 0, 0, 1, 0.
So in this case,
this is saying that, um,
let's see, so theta transpose s is just equal to phi, right?
And so in this case, uh, right, because,
you know, theta transpose s is just 1 times phi,
everything else gets multiplied by 0.
And so in this case is saying that the chance you accelerate to
the right is equal to 1 over 1 plus e to the negative,
how far is the pole tilted over to the right.
Um, and so this policy gives you
the effect that the further the pole is tilted to the right,
the more aggressively you want to accelerate to the right, okay?
So this is a very simple policy,
it's not a great policy,
but it's not a totally unreasonable policy, which is well,
look at how far the pole is tilted to the left or the right, apply sigmoid function,
and then accelerate to the left or right, you know,
depending on how far it's tilted to the right.
Um, now, uh, and,
and, and because this is the, right,
so this is really the chance of taking the accelerate right action as a function of the,
um, pole angle Pi, right?
Now, this is not the best policy because it ignores all the features other than phi.
Um, but if you were to set theta equals,
you know, 0, negative 0.5,
0, 1, 0, then this policy,
um, the negative 0.5 now multiplies into the x position.
Right. Uh, now this new policy if you have this value of theta,
it takes into account how far is your cart is already to the right,
um, where I guess this is the x distance, right?
And the further your cart is already, I guess if,
if your cart is on a set of rails,
right, is on a set of railway track.
And you don't want to fall off the rail- and you want to keep the cart kind of centered,
you don't want it to fall off the end of your table.
But this now says the further this is to the right already well,
the less likely you should be to accelerate to the right.
Okay? And so maybe this is
a slightly better policy than with this set of parameters.
And more generally, what you would like is to
come up with five numbers that tells you how to trade off,
how much you should accelerate to the right based on the position, velocity, angle,
and angular velocity, um,
of the current state of the cart- of the,
of the inverted pendulum.
And what a direct policy search algorithm will do is, um,
help you come up with a set of numbers that results
in hopefully a reasonable policy for controlling the inverted pendulum.
Hope- and in a policy that hopefully result in a appropriate set of
probabilities that cause it to
accelerate to the right whenever it's good to do so and accelerate to the left,
you know, more often when it's good to do so.
Okay. So, um, all right.
So our goal is to find the parame- find parameters
theta so that when
we execute pi of s,
a, um, we maximize, well,
max over theta the expected value of R of s_0 is 0 plus dot,
dot, dot, plus, okay?
Um, and so the reward function could be negative
1 whenever the inverted pendulum falls over,
uh, and 9 whenever it stays up that of, of, of, whatever,
or something that measures how well your inverted pendulum is doing.
But the goal of a direct policy search algorithm is to
choose a set of parameters theta so that we execute the policy,
you maximize your expected payoff.
And I'm gonna use the finite horizon setting,
um, for the algorithm that we'll talk about today.
Okay? Uh, and then one,
one other difference between policy search compared to, um,
estimating the value function is that in direct policy search here s_0 is,
um, a fixed initial state, okay?
Um, it turns out that when we were estimating the value function v-star,
um, you found the best possible policy for starting from any state.
Right. And there's kind of no matter what state you start from is
simultaneously the best possible policy for all states.
In direct policy search,
we assume that either there's a fixed start state- fixed
initial state s0 or there's a fixed distribution over initial state.
So I'm gonna try to maximize the expected reward with respect to your initial state or
respect to your initial probability distribution over what is the initial state.
Okay. So that's, that's one other, um, difference.
So, um, let me think how I'm going to do this. All right.
So let's write this out.
The goal is to maximize overall theta,
the expected value of R of s_0,
a_0, plus R of s_1,
a_1, plus dot, dot,
dot up to R of sc, aT
um, you know, given pi theta.
And, um, in order to simplify the math we'll write on this board today,
um, I'm just going to set T equals 1 to simplify the math,
uh, in order to not carry such a long summation.
But it turns out that, um,
uh, so I'm just gonna do like a 2 times set MDP, uh,
just to simplify the derivation,
but everything works, you know,
just with a longer sum if you,
uh, have a more general version of T. Okay.
Um, and so this term here,
the expectation is equal to sum over all possible state action sequences, right?
And again, this will go up to sT and aT.
But as we said T equals 1 of,
um, what's the chance your MDP starts out in some state s_0?
So this is your initial state distribution times the chance that in
that state you take the first action a_0- oh, actually sorry.
Let me just- let me write this out.
Right. So the chance of your MDP going through this state action sequence, right,
times, times that, right.
So that's what it means to sort of compute the expected value of, uh, the payoff.
Um, and so instead of writing out this sum,
I'm just gonna call this the payoff, right?
And so this is equal to sum of s_0,
a_0, s_1, s_1, a_1 of the chance your MDP starts in state 0,
times the chance that in state 0,
you end up choosing the action a_0 times, um, uh,
the chance governed by
the state transition probabilities that you end up in state 1, uh, state s_1,
times the chance at state s_1 you end up choosing, let's see,
s_1 and then times the payoff, okay.
And so what we're going to really do is, um,
derive a gradient ascent algorithm- actually a stochastic gradient ascent algorithm as
a function of theta to maximize this thing- to maximize the expected value of this thing.
And that- and this is a, um,
this is how we'll do direct policy search.
Okay. So let me just write out the algorithm,
and then we'll go through why, um,
the algorithm that I write down is maximizing this expected payoff.
[NOISE].
So this algorithm is called the, um, reinforce algorithm.
Ah, the objective of the reinforce algorithm, um, uh,
had a few other bells and whistles, but,
but I'm gonna to explain the core of the idea.
But the reinforcing- the reinforce algorithm, um,
does the following which is you're going to run your MDP, right?
And just you know run it for a trajectory of T timesteps.
So, um, again, you know,
I'm just gonna, [NOISE] well.
Right. And and actually you would, uh, right.
Technically, you would, um,
run it for T timesteps but, you know,
let, let's just say for now,
we'll - we'll do only the thing in blue.
We run it for one timestep,
because we set capital T equal to 1.
Um, and then you would compute the payoff, right,
equals R of s0 + R of s1 and again,
in the more general case, you know,
plus dot dot dot plus R of st right?
[NOISE] And then you perform the following update which
is Theta gets updated as Theta plus the learning rate alpha, times.
Right? Um, and then times the payoff.
Right? And again, I'm just setting capital T equals 1.
If capital T was bigger,
you would just sum this all the way up to time T. Okay?
So that's the algorithm.
Um, that's on every iteration through the reinforce algorithm,
through the reinforce algorithm,
you will take your robot,
take your inverted pendulum, um,
run it through T timesteps,
uh, executing your current policy,
so choose actions randomly according to
the current stochastic policy using current values of the parameters Theta,
compute the total sum of rewards you receive, that's called a payoff
and then update Theta using this funny formula.
Right? Now, on every iteration of this algorithm,
um, you're going to update Theta.
And it turns out that reinforce is a stochastic gradient ascent algorithm.
Um, and you remember when we talked about,
uh, linear regression, right?
You saw me draw pictures like this.
It is a global minimum.
Then uh, gradient descents with just,
you know, take a straight path to the minimum,
but stochastic gradient descent would take
a more random path right towards the minimum and it kind of
oscillates around then, maybe it doesn't
quite converge unless you slowly decrease the learning rate alpha.
So this is what we have for stochastic gradient descent,
um, for linear regression.
What we'll see in a minute,
is that reinforce is a
stochastic gradient ascent algorithm meaning that each of these updates is random,
because it depends on what was
this state action sequence that you just saw and what was the payoff that you just saw.
But what will this show is that on expectation, the average update.
You know, this- this update to Theta.
This thing you are adding to theta,
that on average let's see,
that- that on average this update here is exactly in the direction of the, um, gradient.
So that on average,
um, you know, because, uh, every-every loop,
every time through this loop you're making
a random update to Theta and it's random and
noisy because it depends on this random state sequence.
Right? That and this state sequence is random because of
the state transition probabilities and also
because of the fact that you're choosing actions randomly.
But on- but the expected value of this update, uh,
you'll see in a little bit it turns out to be exactly the direction of the gradient.
Um, which is why this, uh,
reinforce algorithm is a gradient ascent algorithm.
Okay? So let's, uh,
let's show that now.
Okay.
So [NOISE] all right.
So what we want to do is maximize the expected payoff
which is a formula we derive up there and so,
um, we're going to,
want to take derivatives with respect to Theta of the expected pay-off.
Right? Of, uh, I'm just gonna copy
that formula up there
[NOISE].
Okay? So there's a chance
of that, you're going through that state-action sequence times the pay off.
And so we want to take derivatives of this and, you know,
so we can like go uphill using gradient ascent.
Um, so we're going to do this in, uh, four steps.
Um, now, first, um,
let me remind you when you take the derivative of three,
of- of a product of three things.
Right? So let's say that you have, uh, three functions,
f of Theta times g of Theta times h of Theta.
So by the product rule of,
um, you know, derivatives product rule from calculus,
the derivative of the product of three things is obtained by,
um, you know, taking the derivatives of each of them one at a time.
Right? So this is f prime times g times h plus,
um, g prime
here plus h prime.
Okay? So the product rule from calculus is
that if you want to take derivatives of a product of three things,
then you kind of take the derivatives one at a time and you end up with three sums.
Right? And so we're going to apply the product rule to this where, um,
we have- here we have two different terms that depend on, um,
Theta, and so when we take the derivative of this thing with respect to theta,
we're gonna have two terms.
Uh, that correspond to taking derivative of
this ones and taking the derivative of that ones.
Right? And so, um,
this derivative is equal to,
so the first term is the sum over all the state action sequences,
um, P of s0,
um, and then let's see.
So now we have pi of Theta, excuse me.
The derivative with respect to pi Theta,
s0, a0.
Right? And then plus, um,
[NOISE].
Right? And then times the payoff.
Right? So the whole thing here is then multiplied by the payoff.
Okay? So we just apply the product rule for calculus where,
uh, for the first term in the sum,
we kinda took the derivative of this first thing and then for
the second term of the sum we took the derivative of this second thing.
Okay? And now, um,
I'm gonna make one more algebraic trick which is,
I'm going to multiply and divide by that same term,
and then multiply and divide by the same thing here.
Right? So lots of multiply,
multiply and divide by the same thing.
Right? And then finally,
um, if you factor out.
So now, the final step is, um,
I'm- I'm gonna factor out these terms I'm underlining.
Ah, right?
Because this terms I underlined,
this is just you know,
the probability of the whole state sequence.
Right? And again, for the orange thing,
this this orange thing.
Right? These two orange things multiplied
together is equal to that orange thing on that box as well.
And so the final step is to factor out the orange box which is just P of s0,
a0, s1, a1, right?
So that's the thing I boxed-up in orange times,
then those two terms
involving the derivatives
[NOISE].
Times the payoff [NOISE].
Okay? And I think, ah, right,
where- because I guess this term goes there,
[NOISE] and this term goes there, okay?
And so this is just equal to,
um, well- and if you look at the reinforce algorithm,
right, that we wrote down,
ah, this is just equal to sum over, you know,
all the state action sequences times the
probability of the gradient update,
right.
[NOISE] Because, ah, I guess I'm running out of colors.
But, you know, this is a gradient update and that's just like equal to this thing, okay?
So what this shows is that, um,
even though on each iteration the direction of the gradient updates is random, um,
the, ah, the expected value of how you
update the parameters is exactly equal to the derivative of your objective,
of your expected total payoff, right.
So we started saying that this formula is your expected total payoff,
um, so let's figure out what's the derivative of your expected total payoff,
and we found that the expected- the- the derivative,
your expected total payoff,
the derivative of the thing you want to maximize is equal
to the expected value of your gradient update.
And so this proves that, um,
on average, you know,
if you have a very small learning rate,
you end up averaging over many steps, right?
But on average, the updates that reinforce is taking on
every iteration is exactly in the direction of the derivative of the,
um, expected total payoff that you're trying to maximize, okay makes sense?
Yes, any questions about this? Yeah.
[inaudible].
Oh, it is independent of the choice of its function.
Um, this is true for any form of a stochastic policy,
ah, where the definition is that, you know,
Pi Theta of s0, [NOISE] ah,
a0 has to be the chance of taking that action in that state,
but this could be any function you want.
Ah, it could be a softmax,
it could be a logistic function of many,
many different complicated features,
it could be- or it has to be a continuous de- or it has to be a differential function.
And actually one of the reasons we shifted to stochastic policies was because,
um, previously just have two actions,
is either left or right, right?
And so you can't define a derivative over a discontinuous function like either left or
right but now we have a probability that
shifts slowly between what's the probability to go left, versus
go right and by making this a continuous function of Theta,
you can then take derivatives and plot gradient ascent,
but it does need to be a logistic function.
Yeah, go ahead.
Ah, [inaudible]?
Sure. So, um, ah,
another way to train a, um,
helicopter controller is you use supervised learning,
where you have a human expert train,
um, you know, so you can also actually have a human pilot demonstrate in this state,
take this action, right,
and then you use supervised learning to just learn
directly a mapping from a state to the action.
Um, I think this, I don't know,
this might be okay for low speed helicopter flight,
I don't think it works super well, ah,
I bet you could do this and not crash a helicopter, but, ah, um, ah,
ah, but to get the best results,
I wouldn't use this approach, um, yeah.
It turns out for some of the maneuvers it'll actually
fly better than human pilots as well,
um, yeah, no.
Cool. All right.
Um, and so, um,
for other types of policies,
um, let's see, right.
[NOISE]
So, ah, direct policy search also works,
um, if you have continuous value actions and you don't want to discretize the actions.
So here's a simple example.
Let's say a is a real number, ah,
such as the magnitude of the force you apply to accelerating left or right. All right.
So run discretizing, you invert your pendulum,
you wanna output a continuous number of how hard you swerve to left  or right.
Um, or for a self-driving car maybe Theta is
the steering angle which is a real value number.
So simple policy would be a equals,
you know, Theta transpose S, um,
and then plus [NOISE] Gaussian noise.
And if just for the purpose of training,
you're willing to pretend that your policy is to
apply the action Theta transpose S and add a little bit of Gaussian noise to it,
then, um, the whole framework for
reinforce but this type of gradient descent also, ah, will,
will also work, great, um,
and then I guess if you're actually implementing this,
you can probably turn off the Gaussian noise variability,
there, there are little tricks like that as well.
Um, so let's see.
Some pros and cons of, um, so,
whe - whe- when should you use direct policy search and when should you
use value iteration or a value function based type of approach?
Um, so it, ah,
turns out there's one setting, ah,
actually there are two settings where direct policy search works much better.
One is if you have a, um,
POMDP, ah, PO in this case stands for partially observable.
[NOISE] And that's if for example, um, you know,
for the inverted pendulum,
um, does a polar angle Phi, you have,
you have a car and this is your position x. Um,
and what this is saying that the state space is, ah,
x, x dot Phi, Phi dot.
All right? [NOISE] But let's say that, um,
you have sensors on this inverted pendulum that allow you to
measure only the position and only the angle of the inverted pendulum.
[NOISE] Uh, so you might have an angle sensor, you know,
down here and you may have a position sensor for your inverted pendulum,
but maybe you don't know the velocity or you don't know the angular velocity, right.
So this is an example of a partially observable Markov decision process because,
ah, and what this means is that on every step,
you do not get to see the host state because you,
you don't have enough sensors to tell you
exactly what is the state of the entire system, okay?
So in a partially observable MDP,
um, at each step,
you get a partial and potentially
noisy measurement of the state,
right, and then have to take actions, or,
have to choose an action a.
[NOISE]
Using these partial and potentially noisy elements, right?
Which is, uh, maybe you only observe the position and the angle,
but your senses aren't even totally accurate.
So you get a slightly noisy,
you know, estimate of the position.
You get a slightly noisy estimate of the angle but you just have to choose
an action based on your noisy estimates of just two of the four state variables, right?
Um, it turns out
that there's been a lot of
academic literature trying to generalize value function base approaches,
the POMDPs, ah, and they're very complicated algorithms in
the literature on trying to apply value function based approaches of POMDPs.
But those algorithms despite their very high level of complexity,
you know, are not- are not widely in production, right?
Um, but if you use the direct policy search algorithm,
then there's actually very little problem.
Oh, let me just write this out.
So let's say the observation is on every timestep you
observe y equals x Phi plus noise, right?
So you just don't know whether it's a state.
And in a POMDP you cannot approximate a value function.
Or even if you knew what was V star, right?
You can't compute Pi-star because,
uh, and maybe you know what is Pi star best.
This can compute V star and Pi star.
But if you don't know what the state is,
you can't apply Pi star to the state because- so- so how do you choose an action.
Um, if you're using direct policy search,
then here's one thing you could do.
Which is you can say that, uh,
Pi of, um, given an observation,
the chance of going to the right given your current observation is equal
to 1 over 1 plus e to negative Theta transpose y,
where I guess y can be 1,
right, x plus noise, Phi plus noise.
But, sorry that's x plus noise,
Phi plus noise, right?
And so you could run reinforce using just the observations you have to, um,
try to- stochastically try to randomly choose an action,
and nothing in the framework we talked about prevents this algorithm from working.
And so direct policy search just works very
naturally even if you have only partial observations of the state.
Um, and more generally instead of plugging the direct observations this can
be any set of features, right?
I'll just make a side comment for those who don't know
what common filters are. Don't worry if you don't.
But one common- one common way of, uh, uh,
using direct policy search would be to use some estimates such as common filter,
or probabilistic graphical model or something to use your historical estimates.
Look, don't, don't just look at your one, uh,
set of measurements now but look at all the historical meas- measurements.
And then there are algorithms such as something we call
the common filter that lets you estimate whatever is the current state,
the full state vector.
You can plug that full state vector estimate into
the features you use to choose- to choose an action.
That's a common design paradigm.
If you don't know what the common filter is, don't worry about it.
Ah, we take- take one of Stephen Boyd's classes or something, I don't know. Yeah, right.
But, but that's one common paradigm where you can use your partial observations as
for the full state and plug that as a feature into the policy search, okay?
So that's one setting where direct policy search works.
Um, just, just applies in a way that
value function approximation is very difficult to even get to apply.
Um, now one last thing is,
uh, one last consideration
so should you apply search policy search algorithm
or a value function approximation algorithm?
Oh, it turns out, um,
the reinforce algorithm is,
is actually very inefficient.
Ah, as in, ah, you end up, you know, whe- when,
when you look at research papers on the reinforce algorithm,
it's not unusual for people that run the reinforce algorithm for like a million iterations,
or 10 million iterations.
So you just have to train. It turns out that gradient estimates
for the reinforce algorithm even though the expected value is right,
it's actually very noisy.
And so if you train the reinforce algorithm,
you end up just running it for a very,
very, very long time, right?
It does work but it is a pretty inefficient algorithm.
So that's one disadvantage of the reinforce algorithm is that
the gradient estimates on expectation are exactly what you want it to be,
but there's a lot of variance in the gradient.
So you have to run it for a long time for a very small learning range.
Um, but one other reason to use, um,
direct policy search is,
is kind of ask yourself,
do you think Pi star is simpler?
Or is V star simpler, right?
And so, um, here's what I mean,
there are, ah, ah, ah,
there- in, in robotics,
there's sometimes what we call low-level control tasks.
And, uh, one way to think of low-level control task is flying a helicopter.
Hovering a helicopter is example of a low-level control task.
And one way to inform of when you think of low-level control task is kind of a really skilled human,
um, you know, holding a joystick.
Control this thing, making seat of the pants decisions, right?
So those are kind of almost instinctual,
in a tiny fraction of a second,
almost by feel you could control the thing.
Those, those are- tend to be low-level control tasks.
Those are seat of the pants, holding a joystick,
a skilled person could balance the inverted pendulum or,
you know, steer a helicopter.
Those are low-level control tasks.
In contrast, um, playing chess is not a low-level control task.
You know, because for the most part,
a very good chess player is not really a seat of the pants, you know,
take that- make a decision in like- in,
in 0.1 seconds, right.
You kind of have to think multiple steps ahead.
Um, and in low-level control tasks,
there's usually some control policy that is quite simple.
A very fun- simple function mapping some state actions, that's pretty good.
And so that allows you to specify a relatively simple class of functions of
Pi star and direct policy search would be relatively promising for tasks like those.
Whereas in contrast, if you want to play chess or play Go,
or do these things where we have multiple steps of reasoning,
um, I think that,
if you're driving a car on a straight road,
that's a low-level control task.
Where you just look at the road and you just, you know,
you know turn the steering a little bit to stay on the road.
So that's a low-level control task.
But if you are planning how to, um, you know,
overtake this car and avoid that other car,
or there's a pedestrian and a bicycle is along the way,
then that's less of a low level control task.
Um, and that requires more multi-step reasoning, right?
I guess depends on how aggressive of a driver you are, right?
Driving on the highway, you know,
may require more or less multistep reasoning.
Where you want to, ah,
overtake this car before the truck comes in this lane.
So that- that type of thing is,
um, more multi-step reasoning.
Um, and approaches like that tend to be
difficult for a very simple like a linear function to be a good policy.
And for those things in playing chess, playing Go,
playing checkers, um, a value function approximation approach may be more promising.
Okay, um, cool.
So any, um,
questions about the- oh, and so, um,
okay for, for, ah, autonomous helicopter flight, ah,
actually, my first attempt for flying
helicopters were actually a direct policy search because flying helicopters,
are actually a seat of the pants thing.
Ah, but then when you try to fly more complex maneuvers,
then you end up using something maybe closer to
value function approximation methods if you want to fly a very complicated maneuver, right?
Um, oh, so the video you saw just now,
of the helicopter flying upside down,
the algorithm implemented on, you know,
for that particular video that was a direct policy search algorithm, right?
Not, not exactly this one, a little bit different.
But that was a direct policy search algorithm.
But if you want the helicopter to fly a very complicated maneuver,
then you need something maybe closer to a value function approximator.
And so the- and there is exciting research on how to blend
direct policy search approaches together with value function approximation approaches.
So actually AlphaGo.
Ah, ah, ah, ah, one of the reasons AlphaGo works was,
um, sorry, ah, you know Go playing program, right, by DeepMind,
ah, was, was a blend of ideas from both of these types of
literature which enabled it to scale to a much bigger system to play Go and,
you know clearly at a very, very impressive level.
All right. Any questions about this, anyone?
[NOISE]
All right. Um, so just final application examples, um, you know,
reinforcement learning today, um,
is, uh, making strong- let's see.
So there's a lot of work on reinforcement learning for game playing,
Checkers, Chess, um, uh, Go.
That is exciting, um,
reinforcement learning today is used in, uh,
is used in a growing number of robotics applications,
um, I think for controlling a lot of robots.
Um, there is a, uh- if you've go to the robotics conferences,
if you look at some of the projects being done by some of
the very large companies that make very large machines, right.
Uh, I have many friends in multiple, you know,
large companies making large machines that
are increasingly using reinforcement learning to control them.
Um, there is fascinating work, uh,
using reinforcement learning for optimizing,
um, entire factory deployments.
Um, there is, uh,
academic research work, uh,
still in research for a class, I know,
actually may- maybe Science to be deployed on
using reinforcement learning to build chatbots.
Um, uh, uh, and actually, on, on,
on using reinforcement learning to, uh,
build a, uh, AI-based guidance counselor,
for example, right, where,
uh, the actions you take up,
of what you say to students, and then,
and then the reward is, you know,
do you manage to help a student navigate their coursework and navigate their career.
Uh, there is, uh, uh,
and that's also starting to be applied to healthcare,
where- one of the keys of reinforcement learning is,
this is a sequential decision making process, right?
Where, do you have to take a sequence of decisions that may affect your reward over time?
And I think um,
uh, and, uh, in,
in healthcare, there is work on medical planning,
where, um, the goal is not to, you know,
send you to get, uh,
a blood test and then we're done, right?
In, in, in complicated, um,
medical procedures, we might essentially get a blood test,
then based on the outcome of the blood test,
we might send you to get a biopsy or not,
or we might ask you to take a drug and then come back in two weeks.
But this is a very complicated sequential decision-making process
for a treatment of complicated healthcare conditions,
and so there's fascinating work on trying to apply
reinforcement learning to this set of multi-step reasoning,
where it's not about, well,
we'll send you for a treatment and then you'll never see again for the rest your life.
It's about here's the first thing you do then come back,
let's see what state you get to after taking this blood test,
or let's see what- state you get to after trying a drug,
and then coming back in a week to see what has happened to symptoms.
But I think that, um,
these are all sectors where reinforcement learning,
uh, is making inroads, um,
or, or even actually, stock trading.
Okay, maybe not the most inspiring one,
but one of my friends, um,
on the East Coast was, uh, uh, in,
in- was, uh- and just actually,
if, if you or your parents,
uh, invest in mutual funds,
this may be being used to, um,
buy and sell shares with them today,
depending on what bank they are investing.
I know what bank is doing this, but I won't say it out loud.
Uh, but, um, uh, uh, but,
uh, if you want to buy or sell, you know, say,
a million shares of stock,
a, a very large volume of stock,
you may not want to do it in
a very public way because that will affect the price of the shares, right?
So everyone knows that a very large investor
is about to buy a million shares or buy 10 million shares or whatever,
that will, um, cause the price to increase, uh,
and this, this is,
this disadvantages the person wanting to buy shares.
But so that's been very interesting work on using reinforcement learning to, um,
decide how to sequence out your, your buy,
how to buy the stock in small lots,
and this trading market is called dark pools.
You could Google if you're curious.
Actually, don't bother, uh, uh, uh, to,
to try to, um,
buy a very large lot of shares.
Also, sell a very large lot of shares without affecting
the market price too much because the way you affect
the market price always breaks against,
you know, is always against you,
it's always bad, right.
Um, so there's work laid down as well.
So anyway, I think, um, uh,
many applications- I personally think that one of
the most exciting areas for reinforcement learning will be robotics,
but, uh, we'll, we'll see what,
what happens over the next few years, right?
Okay. All right.
So let's see,
well, just five more minutes.
Um, and, and just to wrap up, I think,
you know, um, uh,
we've gone through quite a lot of stuff.
I guess, uh, for supervised learning to, uh, learning theory,
and advice for applying learning algorithms to unsupervised learning,
although- was it, K-means, PCA, uh,
EM mixture of Gaussians, uh,
factor analysis, and PrinCo analysis to most recently,
reinforcement learning with the value function approaches,
fitted value iteration, policy search.
So, um, feels like we did, feels like,
feels like- I- feels like you've seen a lot of learning algorithms.
Um, go ahead.
[inaudible]
How is reinforcement learning compared to adversarial learning?
I think of those as pretty distinguished literatures.
Uh, uh, yeah, yeah,
so I think, uh, and again, actually, I, I,
I know a lot of
non-publicly known facts about the machine learning world, but, uh,
one of the things that I actually happen to know is that, er,
uh, some of the ideas in adversarial learning, uh, uh, you know,
so can you tweak a picture by,
you know, very little bit,
by tweaking a bunch of pixel values that are not visible to human eye,
that fools the learning algorithm into thinking that this picture is actually a cat,
when it's clearly not a cat or whatever.
So I actually know that there are attackers out in the world
today using techniques like that to attack,
you know, websites, to try to fool, um, uh, ah,
you know- some of the websites I'm pretty sure you guys use,
and fool their anti-spam,
anti-fraud, anti-undermining democracy types of algorithms into,
um, [LAUGHTER] into making poor decisions.
Uh, so, so it's an exciting to time to do machine learning, right?
[LAUGHTER] That, that, we get to fight battles like these [LAUGHTER]. Yeah, I'm sorry.
Um, uh, okay, um,
and, and I think, you know, I think with,
with- really, I think that with
the things that you guys have learned in machine learning,
I think all of you, um, uh,
are now very knowledgeable, right?
I think all of you are experts in all the ideas of core machine learning,
and I hope that, um- I,
I think you- when you look around the world,
there are so many worthwhile projects you
could do with machine learning and the number of
you that know these techniques is so small that I hope that,
um, you take these skills.
Um, and some of you will go, you know,
build businesses and make a lot of money, that's great.
Some of you will take these ideas and, uh,
help drive basic research at Stanford or at other institutions.
I think that's fantastic.
But I think whatever you are doing,
the number of worthwhile projects on the planet is so large and
the number of you that actually know how to use
these techniques is so small that I hope that,
um, you take these skills you're learning from
this course and go and do something meaningful,
go and do something that helps other people.
Um, I think we are seeing in the Silicon Valley that there a lot of ways, you know,
to build very valuable businesses, uh,
and some of you do that and that's great,
but I hope that you do it in a way that helps other people.
Um, uh, I think, er,
over the past few years we've seen,
uh, um- I think that,
er, in Silicon Valley,
maybe 10 years ago,
the contract we had with society was that people would
trust us with their data and then we'll use their data to help them.
But I think in the past year,
that contract feels like it has been broken and
the world's faith in Silicon Valley has been shaken up,
but I think that places even more pressure on all of us, on all of you,
to make sure that, um,
the work you go out in the world to do is
work that actually is respectful of individuals,
respectful of individual's privacy,
is transparent and open, and ultimately is,
uh, helping drive forward,
um, uh, humanity, or helping people,
or helping drive forward basic research,
or building products that actually help people,
rather than, um, exploit their foibles for profit but to their own harm.
So I hope that all of you will take your superpowers that you now have,
and, um, go out to,
to, to do meaningful work.
Um, and let's see,
um, and I think, uh- oh, and,
and lastly, just, just on a personal note,
I want, to, you know, thank all of you.
On behalf of the TAs and the whole teaching team and myself,
I want to thank all of you for your hard work.
Uh, sometimes, they go over the homework problems.
They look at probably some of these problems and go, "Wow,
there she got that problem, I thought that was really hard," or the project milestones go,
"Hey, that's really cool, look forward to seeing
your final project results at the final poster session."
So, um, I know that all of you have worked really hard.
Uh, uh, and if you didn't,
don't tell me, but I think almost all of you have [LAUGHTER] but,
but, but, I will make sure you know there's
a- I think it wasn't that long ago that I was a student,
you know, working late at night on homework problems and,
and I know that many of you have been doing that, uh, for the homework,
for studying for the midterm,
um, for working on your final term project.
So, um I want to make sure, um,
you know I'm very grateful for the hard work you put into this course,
and I hope that, um- I hope that, uh, your,
your hard earned skills will also reward you very well in the future,
and also help you do work that,
that you find as meaningful,
so thank you very much [APPLAUSE].
