
English: 
So let's do the first stage.
We're going to select.
Well, how do we do selection in an MDP?
Well we follow a policy, right?
So there's going to be
some policy that we have.
Let's not worry about where
it comes from right now.
But some policy based on experience.
That tells us what we ought
to be doing in some state.
So let's say for the sake of argument.
I'm in this particular state,
and my policy tells me that I
should take a particular action.
And I take that action.
Let's say it's this one.
And, it ends me up in this state.
Well, once we end up in
this particular state,
let's say the policy still tells us
what to do, and it says take an action.
And I take that action.
I end up in yet another state.
And the policy says well from here,
you should take a particular action.
So I take that action, and
then I end up in this state.
Okay.
So, I have a policy.
The policy tells me how
to select what to do.
What actions I should take.
And it gets me through the tree,
until I get to a place where
I don't know what to do.
So at this point in the tree.
And in fact, just so you know.

English: 
So let's do the first stage.
We're going to select.
Well, how do we do selection in an MDP?
Well we follow a policy, right?
So there's going to be
some policy that we have.
Let's not worry about where
it comes from right now.
But some policy based on experience.
That tells us what we ought
to be doing in some state.
So let's say for the sake of argument.
I'm in this particular state,
and my policy tells me that I
should take a particular action.
And I take that action.
Let's say it's this one.
And, it ends me up in this state.
Well, once we end up in
this particular state,
let's say the policy still tells us
what to do, and it says take an action.
And I take that action.
I end up in yet another state.
And the policy says well from here,
you should take a particular action.
So I take that action, and
then I end up in this state.
Okay.
So, I have a policy.
The policy tells me how
to select what to do.
What actions I should take.
And it gets me through the tree,
until I get to a place where
I don't know what to do.
So at this point in the tree.
And in fact, just so you know.

English: 
The way I've drawn this tree,
all the leaves of the tree are places
where I don't know where to go next.
I don't have a policy for it.
I've gotta figure out what to do.
So here's what I'm going to do.
I'm at this state, I don't know what
to do next, I have to do something, so
I'm going to do expansion, and
then I'm going to do simulation.
So, here's how the expansion
state expansion step works.
>From this state.
I can take a bunch of actions right.
And from those actions I can
get to a bunch of states.
In fact,
I can just look at my transition model.
And I can see all the possible
next states I might end up at.
Based upon the actions that I take.
And often that's what you would do.
That's what you would do in
a normal kind of game tree search.
The problem here of course, is that we
might have many, many, many, many, many,
many, many, many, many, many, many,
many, many, many, many, many states.
And so, we don't want to
expand out the tree that much.
So instead what we're going to do is
we're going to kind of do a sort of
sampling step.
We're going to say, well, for
each of the actions that I might take.
Why don't I take that action and
then simulate, for

English: 
The way I've drawn this tree,
all the leaves of the tree are places
where I don't know where to go next.
I don't have a policy for it.
I've gotta figure out what to do.
So here's what I'm going to do.
I'm at this state, I don't know what
to do next, I have to do something, so
I'm going to do expansion, and
then I'm going to do simulation.
So, here's how the expansion
state expansion step works.
>From this state.
I can take a bunch of actions right.
And from those actions I can
get to a bunch of states.
In fact,
I can just look at my transition model.
And I can see all the possible
next states I might end up at.
Based upon the actions that I take.
And often that's what you would do.
That's what you would do in
a normal kind of game tree search.
The problem here of course, is that we
might have many, many, many, many, many,
many, many, many, many, many, many,
many, many, many, many, many states.
And so, we don't want to
expand out the tree that much.
So instead what we're going to do is
we're going to kind of do a sort of
sampling step.
We're going to say, well, for
each of the actions that I might take.
Why don't I take that action and
then simulate, for

English: 
one step what state I might end up in.
And then do that for another action.
And then do that for another action.
And so on and so forth.
Until I have a few possible next
state action pairs that I might see.
And maybe that number's really small.
Maybe it's six, maybe it's 100.
It sort of depends
upon your state space.
So I've done the expansion step.
I've figured out where I might end
up next given all the actions that I
might take.
So I'm not drawing this because
there's not a lot of room.
But, each one of these edges represents
some particular action I could
have taken, or
I did take in my imagination.
And the state that I ended up in,
or what the nodes represent.
And so each one of these edges has
some action that's associated with it.
I'm just not writing it
down because of space.
Okay?
>> Mm-hm.
>> All right, so now I've got this.
So I now expanded sort of the fringe.
Here's all the things that I might do,
and where I might end up next.
And now I have to use that to kind of
decide what I actually ought to do.
Now if we were doing the normal
kind of tree search,
like we talked about with game search.
I would just use my evaluation function,
but I don't have an evaluation function.

English: 
one step what state I might end up in.
And then do that for another action.
And then do that for another action.
And so on and so forth.
Until I have a few possible next
state action pairs that I might see.
And maybe that number's really small.
Maybe it's six, maybe it's 100.
It sort of depends
upon your state space.
So I've done the expansion step.
I've figured out where I might end
up next given all the actions that I
might take.
So I'm not drawing this because
there's not a lot of room.
But, each one of these edges represents
some particular action I could
have taken, or
I did take in my imagination.
And the state that I ended up in,
or what the nodes represent.
And so each one of these edges has
some action that's associated with it.
I'm just not writing it
down because of space.
Okay?
>> Mm-hm.
>> All right, so now I've got this.
So I now expanded sort of the fringe.
Here's all the things that I might do,
and where I might end up next.
And now I have to use that to kind of
decide what I actually ought to do.
Now if we were doing the normal
kind of tree search,
like we talked about with game search.
I would just use my evaluation function,
but I don't have an evaluation function.

English: 
So instead, I'm going to do simulation.
And what that means is I'm
going to follow some other policy.
We typically have a name for that.
We call it the roll out policy.
And just for
the sake of discussion here.
Let's just say it's a random policy.
So I'm going to say I took
this particular action.
I ended up in this particular state.
And then I'm just going to
behave randomly for awhile.
And then I'm going to
do the same thing here.
>> [LAUGH]
>> Yeah, that looks random.
And I'm going to do that here.
And then I'm going to do that here.
And then I'm going to do that here.
And I'm going to do that here.
And I'm going to do that here.
And I'm going to get
a whole bunch of spaghetti.
Now, the spaghetti actually
has a nice little bits.
I know spaghetti is delicious.
The spaghetti actually has all kinds
of nice information associated with it.
As I move through this path, and through
this trajectory by behaving randomly.
Say from this particular state.
I actually see a bunch of
rewards along the way.
And I can just take that
out as far as I need to.
Given my horizon say,
my discount factor.
So I do this for a long time.
And I collect rewards along the way.
That gives me an estimate
of being in this state, and
taking this particular action.

English: 
So instead, I'm going to do simulation.
And what that means is I'm
going to follow some other policy.
We typically have a name for that.
We call it the roll out policy.
And just for
the sake of discussion here.
Let's just say it's a random policy.
So I'm going to say I took
this particular action.
I ended up in this particular state.
And then I'm just going to
behave randomly for awhile.
And then I'm going to
do the same thing here.
>> [LAUGH]
>> Yeah, that looks random.
And I'm going to do that here.
And then I'm going to do that here.
And then I'm going to do that here.
And I'm going to do that here.
And I'm going to do that here.
And I'm going to get
a whole bunch of spaghetti.
Now, the spaghetti actually
has a nice little bits.
I know spaghetti is delicious.
The spaghetti actually has all kinds
of nice information associated with it.
As I move through this path, and through
this trajectory by behaving randomly.
Say from this particular state.
I actually see a bunch of
rewards along the way.
And I can just take that
out as far as I need to.
Given my horizon say,
my discount factor.
So I do this for a long time.
And I collect rewards along the way.
That gives me an estimate
of being in this state, and
taking this particular action.

English: 
And since I might take the same action
multiple times, and get to these states.
I now can do a sort of average
over all of these possibilities.
Get lots and lots of,
sort of estimates of this.
And then that gives me
an estimate of a Q value for
each of the actions that I might take.
So all I'm really doing is building
an evaluation function by doing well,
Monte Carlo simulation from there.
So I'm in this state.
I took a bunch of actions.
That gives me a concrete
set of next states.
>From there,
I just behave randomly for a while.
I use that to gather a bunch
of estimates of rewards.
I add them all up.
I average them appropriately.
And now that gives me
an estimate of the Q function.
But by the way, it does more than that.
Now that I have an estimate of
the Q function for this state.
Backed up from here, I can actually back
up information all the way to the top.
And that updates the estimate
of this particular node.
This particular node, and
this particular node.

English: 
And since I might take the same action
multiple times, and get to these states.
I now can do a sort of average
over all of these possibilities.
Get lots and lots of,
sort of estimates of this.
And then that gives me
an estimate of a Q value for
each of the actions that I might take.
So all I'm really doing is building
an evaluation function by doing well,
Monte Carlo simulation from there.
So I'm in this state.
I took a bunch of actions.
That gives me a concrete
set of next states.
>From there,
I just behave randomly for a while.
I use that to gather a bunch
of estimates of rewards.
I add them all up.
I average them appropriately.
And now that gives me
an estimate of the Q function.
But by the way, it does more than that.
Now that I have an estimate of
the Q function for this state.
Backed up from here, I can actually back
up information all the way to the top.
And that updates the estimate
of this particular node.
This particular node, and
this particular node.
