So this lecture is going to be on reinforcement learning.
Um, I will, in the interest of time,
skip the, the quiz.
So, so the way to think about how
reinforcement learning fits into what we've done so far is,
you remember this class has this picture, right?
So we talk about different models and we talk about different algorithms,
inference algorithms to be able to predict using these models and answer queries,
and then we have learning which is,
how do you actually learn these models, right?
So every type of model we go through,
we have to kind of check the boxes for each of these [NOISE] pieces.
So last lecture, we talked about Markov decision processes.
This is a kind of a modeling framework,
allows you to define models.
For example, for crossing volcanoes or playing dice games or tram, taking trams.
Um, what about inference?
So what do we have here last time?
We had value iteration and which allows you to compute the optimal policy and policy,
uh, evaluation which eva- allows you to estimate the value of,
uh, a particular policy.
So these are algorithms that,
um, will operate on MDP, right?
And we sort of looked at these algorithms last time.
So this lecture is gonna be about learning.
Uh, I'll just put RL for now.
RL is not an algorithm, it's a kind of, uh,
refers to the family of algorithms that fits in, uh, this week.
Um, but that's the way you should think about it.
RL allows you to,
um, either explicitly or implicitly estimate MDPs.
And then once you have that,
you can do all these, um, uh,
inference algorithms to, uh,
figure, uh, what the optimal policy is.
Okay? [NOISE] So just to review.
Um, so what is the MDP?
Um, the clearest way- remember to think about it is- it's,
um, in terms of a graph.
So you have a set of states.
So in this dice game,
we have in and end.
So we have a set of states.
From every state, you have a set of actions coming out.
So in this case,
uh, stay and quit.
Um, the actions take you to chance nodes, uh,
where the- uh, you don't get to control what happens,
but nature does and there's randomness.
So out of these chance nodes are transitions.
Each transition takes you into a state,
it has some probability associated with it.
So two-thirds in this case.
It also has some reward associated with it which you pick up along the way.
So naturally, this has to be one-third,
four and remember last time,
this was probability 1:10.
Okay. So, um, and then there is, you know, uh,
the discount factor which Gamma,
which is a number between 0 and 1 tells you how much you value the future.
Uh, for default, you can think about it as 1, uh, for simplicity.
Okay. So this is a Markov decision process.
Um, and what do you do with one of these things?
[NOISE] We, um, have a notion of a policy and a policy,
um, [NOISE] see, I'll write it over here.
So a policy denoted Pi. Uh, let me use green.
Um, so a policy, Pi,
uh, is a mapping from states to action.
It tells you a policy when you apply it,
it says, "When I land here, where should I go?
Should I do stay or quit?"
If I land, well, I mean this is kind of a simple MDP.
Otherwise, there'd usually be more states and for every state,
blue circle will tell you where to go.
Um, and when you run a policy, uh, what happens?
Uh, you get a path,
um, which I'm going to call an episode. So what do you do?
You start in state S_0,
that's- that will be in.
In this particular example, um,
you take an action a_1, let's say stay.
Uh, you get some reward,
in this case it will be 4.
You end up in a new state, um, oops, S_1.
And suppose you go back to end and,
uh, then you take another action,
maybe it's stay, reward is 4 again and,
and so on, right?
So this sequence is a path or in RL speak,
it's, uh, an episode.
Um, let's see.
So let me- let me erase this comment.
Uh, so this is an episode.
Um, and until you hit the end state.
Um, and, uh, what happens out of the episode,
you can look at a utility.
We're gonna denote U which is the discounted sum of rewards along the way, right?
So if you, um, you know,
stayed three times and then went there, you would have, uh,
a utility of 4 plus 4 plus 4 plus 4, so that'll be 16.
Okay? So the last lecture,
we didn't really work with,
um, the episodes into utility,
um, because we were able to define a set of recurrences that,
uh, computed the expected utility.
So, uh, remember that we want to- you know,
we don't know what's going to happen.
So, uh, there's a distribution,
and in order to optimize something,
we have to turn it to a number,
that's what expectation does.
Um, so there's two,
uh, concepts that we had from last time.
One is the value function of a particular policy.
So V_Pi of S is the expected utility if you follow Pi from S. What does that mean?
That means, if you take a particular S, let's take, uh, n,
and I put you there,
and you run the policy,
so stay and you traverse this graph,
um, you will have different utilities coming out and the average of
those is going to be V_Pi of S. Similarly,
there's a Q value,
um, expect the utility,
if you first take an action from a state S and then follow Pi.
So what does that mean? That means if I put you on one of these, uh,
red chance nodes and you basically play out the game,
um, and average the resulting utilities that you get, what number do you get?
Okay? [NOISE] Um, and we saw recurrences that related these two.
So V_Pi of S is, um,
you, recurrence, the name of the game
is to kind of delegate to some kind of simpler problems.
So you first, uh,
look up what you're supposed to do in s, that's
Pi S [NOISE] and that takes you to a chance node which is s,
Pi S of S, and then you say, "Hey,
how much, um, utility am I going to get from that node?"
And similarly from the, the chance nodes,
you have to look at all the possible successors,
the probability of going into that successor, um,
of the immediate reward that you get along the edge plus the discounted, um,
reward of the kind of a future when you end up in, um, S-prime.
Okay. So any questions about this?
This is kind of review of, uh,
Markov decision processes from, um, last time.
Okay. So now we're about to do something different.
Okay. So, um, if you say goodbye to the transition and rewards,
that's called reinforcement learning.
So remember Markov decision processes.
I give you everything here and you just have to find the optimal policy.
And now, I'm gonna make life difficult by not even telling you,
um, what rewards and what are transitions you have to get.
Okay. So just to get a, kind of flavor of what that's like.
Um, let's play a game.
So, um, I'm going to need a volunteer.
I'll, I'll give you the game,
but this volunteer, you have to have a lot of, uh,
grit and, uh, persistence,
because this is not gonna be [NOISE] an easy game.
You have to be one of those people that even though you're losing a lot,
uh, you're still gonna not give up.
Okay. So here's how the game works.
Um, so for each round, r equals,
uh, 1, 2, 3,
4, 5, 6, and so on.
You're just going to choose A or B,
um, red pill or blue pill, I guess.
Um, and you, you move to a new state.
So the state is here and you get some rewards which I'm gonna show here.
Okay. And the state is 5,
0, that's the initial state.
Okay. So everything clear about the rules of the game?
[LAUGHTER] That's reinforcement learning, right?
[LAUGHTER] We don't know anything about how.
Okay. So any volunteers.
Um, how about you in the front? Okay.
Okay.
Okay. Let me, let me fix that. A.
A, A, [LAUGHTER] [NOISE] [LAUGHTER] B, B, A, [LAUGHTER] A.
It's a MDP, so,
uh, in that case that helps.
B, B, B, B,
B, just infinitely click B with an A, I guess.
[LAUGHTER] It's like I'm losing a point every time.
I warned you. [LAUGHTER] Okay.
A, A, A, A, B, A,
A, A, A, A, A. [LAUGHTER]
Okay. [APPLAUSE] I'm glad this worked because last time it took a lot longer [LAUGHTER].
Um, but, you know,
so what did you have to do?
I mean you don't know what to try so you try A and B.
And then hopefully you're building an MDP in your head, right?
Yeah, right? [LAUGHTER] Okay. Just smile and nod.
Um, and you have to figure out how the game works, right?
So maybe you noticed that hey, A is, you know,
decrementing and B isn't going up but then there's this other bit that gets flipped.
So, um, okay you figure this out,
and in the process you're also trying to maximize reward which, uh,
apparently I guess wasn't - doesn't come until
the very end because, um, it's a cruel game.
[LAUGHTER].
Okay. So how do we get an algorithm to kind of do this and
how do we think about, uh, us doing this?
So just to kind of make the contrast between MDPs and reinforcement learning sharper,
so Markov decision process is a offline thing, right?
So you already have a mental model of how the work- world works.
That's the MDP, that's all the rewards and the transitions and the states and actions.
And you have to find a policy to collect maximum rewards.
You have it all in your head,
so you just kind of think really hard about,
you know, what is the best thing.
It's like "Oh, if I do this action then I'll go here" and,
you know, look at the probabilities, take the max of whatever.
So reinforcement learning is very different.
You don't know how the world works.
So you can't just sit there and think because
thinking isn't going to help you figure out how the world works.
Um, so you have to just go out and perform actions in the world, right?
And in doing so you - hopefully you'll learn something but also you'll,
um, you'll get some rewards.
Okay so-so to maybe formalize the,
um, the paradigm of RL.
So you can think about it as an agent.
That's, uh, that's you.
Uh, and do you have the environment,
which is everything else that's not an agent.
The agent takes actions.
So that sends action to the environment and the
environment just send you back rewards and a new state.
And you keep on doing this.
Um, so what you have to do is figure out first of all how to - am I going to act.
If I'm in a particular state S_t minus 1,
what actions should I choose, okay?
So that's one, um, one question.
And then you're gonna get this reward and observe a new state.
How -what, what should I do to update my mental model of the world, okay?
So these are the main two questions.
I'm going to talk first about how to update the parameters and then
later in the lecture I'm going to come back to
how do you actually go and, you know, explore it.
Okay. So I'm not going to say much here but, you know,
in the context of volcano crossing, um,
just to kind of think through things,
every time you play the game, right?
You're gonna get some utility.
So you take -so this is the episode over here.
So a r s,
you're gonna -sometimes you fall into a pit.
Sometimes you go to a hut.
Um, and based on these experiences, um,
if I didn't -hadn't told you what any of
the actions do and what's a slip probability or anything,
how would you kind of go about,
um, kinda solving this problem?
That's a -that's a question.
Okay so there's a bunch of algorithms.
I think there's gonna be 1, 2, 3, 4.
At least four algorithms that we're going to talk about with different characteristics.
But they're all going to kind of build onto each other in some way.
So first class of algorithms is Monte Carlo methods, right?
So, um, okay.
So whenever you're doing RL or any sort of learning,
uh, the first thing you get is you just have data.
Let's, let's suppose that you run even a random policy,
you're just gonna -because in the beginning you don't know any better,
so you're just going to try random actions and, uh,
but in the process you're gonna see "Hey,
I tried this action and it led to this reward and so on".
So in a concrete example just to make,
uh, things a little bit more crisp,
it's gonna look something like in, uh,
and then you take,
uh, you know you did, um, let's see.
Let me try to color coordinate this a little bit.
Um, so you're in n,
you do, um, stay.
And then you get a reward of 4 and then you're back in n,
you do a stay,
and then you get 4 and then maybe you're done, you're out.
Okay. So this is an example episode just to make things concrete.
So this is s_0,
a_1, r_1, s_2, s_1.
I keep on incrementing too quickly.
Um, a_2, r_2, s_3, okay?
Okay so what should you do here?
Alright so, um, any ideas? Model-based Monte Carlo.
So if you have MDP you would be done.
But we don't have MDP, we have data.
So what can we do?
[NOISE] Yeah.
[inaudible].
Yeah. Let's try to build a MDP from that data.
Okay. So, um, the key idea is estimate the MDP.
Um, so intuitively,
we just need to figure out what
the transitions and rewards are and then we're done, right?
Um, so how do you do the transitions?
Um, so the transition says if I'm in state S and I take action A, what will happen?
I don't know what will happen, but let's see in the data what will happen.
So I can look at the number of times I went into a
particular S prime and then divide it over
the number of times I attempted any- this action from
that state at all and just take the ratio, okay?
And for the rewards, um,
this is actually fairly, you know, easy,
when I - because when I observe a reward,
um, from S, A and S prime.
I just write it down and say that's the reward, okay?
Okay. So on the concrete example what does this look like?
So remember now, here's the MDP graph.
I don't know what the -the, uh,
transition distribution or the rewards are.
Um, so let's suppose I get this trajectory.
What should I do? So I get stay, stay,
stay, stay, and I'm out, okay?
So first I, I can write down the rewards of 4 here,
and then I can, um,
estimate the probability of, you know, transitioning.
So three out of four times I went back to in.
One out of four times I went to end.
So I'm gonna estimate as three-fourths, one-fourths.
Okay. But then suppose I get a new data point.
So I have stay, stay,
end. So what do I do?
I can add to these counts, um.
So everything is kind of cumulative.
So two more times,
I'm sorry one more time I went into in and another time I went to end,
so this becomes four out of six, three out of six.
And suppose I see another time when I just go into end,
so I'm just going to increment, uh,
this counter and now it's three out of seven and four out of seven, okay?
So pretty, um, pretty simple.
Okay so for reasons I'm not going to get into,
this process actually, you know,
converges to the -if you do this kind of, uh, you know,
a million times, you'll get pretty,
um, accurate. Yeah, question?
Yes, the question is, you don't know the rewards or the transitions, uh,
but yes you do know the set of,
ah, states and the actions.
Set of states, I guess, you don't have to know them all in advance,
but you just observe them as they come.
The actions, you need to know because you-
you are an agent and you need to play the game.
Yeah, good question. Okay. So, yeah.
Does this work with variable costs?
Like, there is a probabilit- or variable reward around it.
There's a probability you get some rewards for probability [inaudible].
Yeah. So the question is,
does this work with variable, uh, rewards.
Um, and if the reward is not a function of, um,
sas prime, you would just take the average of the rewards that you see.
Yeah. Okay. So- so what do you do with this?
So after you estimate the MDP,
so all you need is the transitions and rewards.
Um, then now we have MDP.
It might- it may not be the exact right MDP
because this is estimated from data so it's not gonna match it exactly,
um, but nonetheless, we already have these tools from last time.
You can do value iteration to compute, um,
the optimal policy on it and then you just,
you know, you're done, you run it.
On- in practice, you would probably kind of interleave
the learning and the- the optimization but, uh,
for simplicity we can think about it as a two-stage where you gather a bunch of data,
you estimate the MDP and then you are off.
Okay. There's one problem here.
Does anyone know what the problem might be?
You can actually see it by looking on the slide. Yeah.
Well, with your based policy of all this thing,
you'll never explore the quick branch of the world.
Yeah, yeah. You didn't explore this at all,
so you actually don't know how much reward is here.
Maybe it's like, uh, you know, 100, right?
So- so this is this problem,
this kind of actually a pretty big problem that unless you have a policy that,
uh, actually goes and covers all the- the states,
you just won't know, right?
And this is kind of natural because there can always be, you know,
a lot of reward hiding under a kind of
one state but unless you see it you- you don't- you just don't know.
Um, okay.
So this is a kind of key idea,
key challenge I would say,
in reinforcement learning is exploration.
So you need to be able to explore,
um, the state space.
This is different from normal machine learning where data just
comes in passively and you learn on your nice function and then you're- you're done.
Here, you actually have to figure out how to get the data,
and that's- that's kind of one of the,
the key challenges of RL.
So we're gonna go back to this- this problem,
and I'm not really gonna, uh,
try to solve it now.
Um, for now you can just think about Pi as
a random policy because a random policy eventually will just,
you know, hit everything for, you know,
finite, uh, small, uh, state spaces.
Okay. So, um, okay.
So that's basically end of the first algorithm.
Let me just write this over here.
So algorithms, we have model-based, um, Monte Carlo.
And the model-based is referring to the fact that we're
estimating a model the- in particular the MDP.
The Monte Carlo part is just referring to the fact that we're using samples, uh,
to estimate, um, a model or you're
basically applying a policy multiple times and then estimating,
uh, the model based on averages.
Okay. So- so now,
I'm going to present a- a different algorithm and it's called,
uh, model-free Monte Carlo.
And you might from the name
guess what we might want to do is maybe we don't have to estimate this model, okay?
And why- why is that?
Well, what do we do with this model?
Um, what we did was we, you know, uh,
presumably use value iteration to,
um, you know, compute the optimal policy.
And the- remember this, uh, recurrence, um,
for computing Q_opt, um,
it's in terms of T and reward,
but at the end of the day all you need is Q_opt.
If I told you, um,
Q_opt (s, a) which is,
um, what is Q_opt (s, a)?
It's the, um, the maximum possible utility I could get if I'm in,
chance node sa and I follow the optimal policy.
So clearly if I knew that,
then I would just produce the optimal policy and I'd be done,
I don't even need to know- understand the- the rewards and transitions.
Okay. So with that, uh,
insight is model-free learning,
which is that we're just going to try to estimate Q_opt,
um, you know, directly.
Um, sometimes it can be a little bit confusing what is meant by model-free.
So Q_opt itself you can think about as a- as a model,
but in the context of MDPs in reinforcement learning,
generally people when they say model-free refers to the fact that there's no MDP model,
not that there is no, um, model in general.
Okay. So, um, so we're not gonna get to Q_opt, uh, yet.
Um, that will come later in the lecture.
So let's warm up a little bit.
Um, so here's our data staring at us.
Um, remember- let's, let's look at a related quantity, so Q Pi.
Remember what Q Pi is.
Q Pi (s, a) is an expected utility if we start at
s and you first take action a and then follow policy Pi, right?
So in, um, in- I guess another way to write this is,
um, if you are at a particular, uh,
time step t, you can define u_t as
the- the discounted sum of the rewards from that point on, which is, you know,
the reward immediately that you will get plus
the discounted part in the non- next time step plus,
you know, a square discounted and then,
uh, two time steps in the future and so on.
And, um, what you can do is you can try to estimate Q Pi from this utility.
Right? So this is the utility,
uh, that you get out to predict your time steps.
So suppose you do the following.
So suppose you average the utilities that you get only on
the time steps where I was in a particular state s and I took an action a.
Okay. So you have a- let's suppose you have a bunch of episodes, right?
So, um, here pictorially, um,
uh, let's see.
[NOISE]
Here's another way to think about it.
So I get a bunch of episodes.
I'm gonna do- do some abstract, um, drawing here.
Um, so every time you have you know,
s, a shows up here,
maybe it shows up here,
maybe it shows up here,
maybe it shows up here,
you're going to look at how much reward do I get from that point on?
How much reward do I get from here on?
How much reward do I get from here on?
And, um, average them, right?
So there's a kind of, a technicality
which is that if s, a appears here and it also appears,
uh, after it then I'm not going to count that because I'm
kind of- if I do both I'm kind of double counting.
Um, in fact it works both ways, but just,
conceptually it's easier to think about just taking of, uh, an s, a, uh,
of the same you don't kind of go back to the same position.
Okay, so let's do that on a concrete example.
So Q-pi, let's just write it.
Q-pi s, a is a thing where we're trying to estimate and this is,
uh, a value associated with every chance node s, a.
So in particular, I've drawn it here.
I need a value here and,
uh, a value here.
Okay? So suppose I get some data,
I stay and then I got- go to the end.
Uh, so what's my utility here? It's not a trick question.
4.
4, yes. Um, sum of 4 is 4.
Okay, so now I can say, "Okay it's 4."
And that's my best guess so far.
I mean, I haven't seen anything else, maybe it's 4.
Um, so what happens if I play the game again and I get 4, 4?
So what's the utility here?
8.
8? So then I update this to the average of 4 and 8,
do it again, I get 16 then I average,
uh, in the 16.
Okay? And, um, and again, you know,
I'm using stays so I don't learn anything about this,  in practice you would
actually go explore this and figure out how much utility you're seeing there.
So in particular, notice I'm not updating
the rewards nor the transitions because I'm model-free,
I just care about the Q values that I get which
are the values that sit at the nodes not on the edges.
Okay, so one caveat is that we are estimating Q-pi not Q-opt.
We'll revisit this, um, later.
Um, and another, uh,
thing to kind of note is the difference between what is called On-policy and Off-policy.
Okay? So in reinforcement learning,
you're always following some policy to get around the world right?
Um, and that's generally called the exploration-policy or the control policy um,
and then there's usually some other thing that you're trying to estimate,
usually the- the value of
a particular policy and that policy could be the same or it could be different.
So On-policy means that, uh,
we're estimating the value of
the policy that we're following, the data-generating policy.
Off-policy means that we're not.
Okay? So um, so in particular is,
uh, model-free Monte Carlo, um, On-policy or Off-policy?
It's On-policy because I'm estimating Q-pi not Q-opt.
Okay? That's On-policy.
Um, and Off-policy ,
uh, what about model-based Monte Carlo?
[NOISE] I mean it's a little bit of a slightly weird question,
but in model-based Monte Carlo,
we're following some policy,
maybe even a random policy,
but we're estimating the transition then rewards,
and from that we can compute the- the optimal policy.
So you can- you can think about is, um,
Off-policy but, you know,
that's maybe not, uh, completely standard.
Okay. So any questions about what model-free Monte Carlo is doing?
So let me just actually write.
So what is model-based Monte Carlo is doing,
it's trying to estimate the, uh,
the transition and rewards and model-free Monte Carlo is trying to estimate,
uh, the, um, Q-pi.
Um, okay? And just as- as a note,
I put Hats on, uh,
any letter that is supposed to be a quantity
that is estimated from data and that's what, you know,
I guess statisticians do, um, to differentiate them between whenever I Q-pi,
that's the true, uh,
value of that, you know,
policy which, you know, I don't have.
Okay, any questions about model-free Monte Carlo?
Both of these algorithms are pretty simple, right?
You just, you know,
you look at the data and you take averages. Yeah.
So model free is not trying to optimize [inaudible] policy.
So the question is is model-free,
uh, making changes to a policy or is it a fixed policy?
So- so this version I've given you is only for a fixed policy.
The general idea of model-free as we'll see later,
uh, you can also optimize the policy.
Okay. So- so now what we're gonna do is we're gonna,
uh, do theme and variations on, uh, model-free Monte Carlo.
Actually where it's going to be the same algorithm but I just wanted to interpret
it in kind of slightly different ways that'll help us,
um, generalize it in the future. Yeah.
Are there certain problems where model-free does better than model base?
Are there certain problems where model-free is better than model base?
So this is actually a really interesting question, right?
So, um, you can show that if your model is correct,
if your model of the world is correct,
model-based is kind of the way to go because there'll be more sample efficient,
meaning that you need fewer, uh, data points.
But it's really hard to get the model correct in the real world.
So recently, especially with,
you know, deep reinforcement learning,
people have gone a lot of mileage by just going model-free because then, um,
jumping ahead a little bit, you can model this as
a kind of a deep neural network and that gives
you extraordinary flexibility and power without having to solve the hard problem of,
you know, constructing the MDP.
Okay. So- so there's kind of three ways you can think about this.
So the first, we already talked about it,
is, you know, this average idea.
So we're just looking at the utilities that you
see whenever you encounter an s and a, and you just average them.
Okay. So here is an equivalent formulation.
Um, and the way it works is that for every,
um, s, a, u that you see,
so every time you see a particular s,
a, u, s, a, u, s, a,
u and so on,
I am going to perform the following update on.
So I'm gonna take my existing value and I'm
going to do a- what- what we call a convex combination.
So, you know, 1 minus eta and eta sum to 1.
So it's, you know, a kind of balancing between two things.
Balancing between the old value that I had and the- the new utility that I saw.
Okay? And the eta is set to be 1 over 1 plus the number of updates.
Okay? So let me do a concrete example.
I think you'll make this very clear what's- what's going on.
So suppose my data looks like this.
So I get, uh, 4,
um, and then a 1 and a 1.
Um, so these are the utilities, right?
That's- that's a U here.
I'm ignoring the s and a,
I'm just assume that there are some- something.
Okay, so first, uh,
let's assume that Q-pi is 0, okay?
So the first time I do, um, uh,
let's see, number of updates,
I haven't done anything so it's 1, um, 1 minus 0.
So 0 times 0 plus 1 times 4 which is the first view that comes in.
Um, okay, so this is 4, okay?
So then what about the next data point that comes in?
So I'm gonna to take, um, one-half now times 4 plus one-half times 1,
which is the new value that comes in.
And that is, I'm gonna to write it as 4 plus 1 over 2, okay?
So now- okay just to keep track of things,
this results in this,
this results in this,
and then now, um,
I'm running out of space but hopefully we can- so now on the third one,
I do, um, uh, two-thirds,
so I have 4 plus 1 over 2 times two-thirds plus,
um, actually I- I guess I should do two-thirds to be consistent.
Two-thirds times 4 plus 1 over 2 which is
the previous value that's sitting in Q-pi plus one-third times 1,
which is a new value,
and that gives me,
um, 4 plus 1 plus 1 over 3, right?
So you can see what's going on here is that, you know,
each, uh, each time I have this, you know,
sum over all the tools I've seen over the number of times it
occurs and this eta is set so that next time I kind
of cancel out the old uh, count and I add
the new count to the denominator and it kind of all works out so that at
every time-step what actually is in
Q-pi is just a plain average over all of the numbers I've seen before.
All right, this is just kind of an algebraic trick to, um,
get this original formulation,
which is a notion of average,
into this formulation which is a notion of, um,
kind of you're trying to, um,
take a little bit of the old thing and add a little bit of a new thing.
Okay. So [NOISE], um,
I guess I'm going to call this, uh,
I guess, um, combination I guess.
So the- that's the second interpretation.
There's a third interpretation here which,
uh, you can think about is,
uh, in terms of stochastic gradient descent.
So this is actually a kind of a,
uh, simple algebraic manipulation.
So if you look at this expression, what is this?
So you have 1 times Q Pi,
so I'm gonna pull it out and put it down here
and then I'm gonna have minus eta times Q Pi,
that's this thing and then I also have a eta, a u,
so I'm going to put kind of minus a- u here and this is,
uh, inside this parenthesis.
So if you just, you know,
do the algebra you can see that these two,
you know, are equivalent.
Uh, so what's the point of this?
Right, so, um, where have you kind of seen this, uh,
before, something like, maybe not,
not this exact expression but something like that [NOISE].
Any ideas? Yeah, when you look down at a stochastic gradient descent in the context of,
uh, the square loss for linear regression.
Right, so remember, uh,
we had these updates that
all looked like kind of prediction minus target which was,
you know, the residual and that was used to kind of update.
So one way to interpret this is, uh,
this is kind of implicitly trying to do
stochastic gradient descent on the objective which is a squared,
uh, loss on, uh,
the, the Q Pi value that you, you,
you're trying to set and,
uh, u which is the new piece of data that you got.
So think about in regression this is the y,
this is, uh, y, you know,
the- what the output is and you- this is
the model that's trying to predict it and you want those to be close to each other.
Okay? So, so those are kind of three views on basically,
uh, this idea of averaging or incremental updates.
Okay. So it'll become clear why, you know, I,
I did this isn't just to, you know,
have fun. Uh, okay.
So now let's, uh, see an example of model- free Monte Carlo in action on this,
ah, the volcano games.
So remember here we have this, uh, you know,
volcanic example and, uh,
I'm going to, uh,
set the number of episodes to let's say 1,000, let's see what happens.
Uh, so here, okay.
So what does this kind of, uh, uh,
grid-like structure, a grid of triangles denote?
So this remember is a state,
this is 2, 1.
So what I am doing here is dividing
into four pieces which correspond to the four different action,
so this triangle is 2, 1 north,
this triangle is 2, 1 east and so on.
Okay. And a number here is the Q Pi or value that I'm estimating along the way.
Okay, so the, the policy I'm using, uh,
is a complete random,
uh, just move randomly, uh,
and I run this 1,000 times and we see that the average utility is,
uh you know, minus 18 which is, uh, obviously not great.
Okay. Uh, but this is an estimate of how well the random policy is doing.
So, you know, as advertised, you know,
random policy you would expect to fall into a volcano quite often.
Uh, okay.
Uh, and you can run this and sometimes you get slightly different results but,
you know, it's pretty much stable around minus 19, minus 18.
Okay. Any questions about this before we move on to, uh, different algorithms?
Okay. So model-based Monte Carlo we're estimating the MDP,
model-free Monte Carlo we're just estimating the Q values of a particular policy for now.
Okay. So, so let's revisit what model-free Monte Carlo is doing.
So if you use the policy Pi equals stay for the dice game,
um, you know, you might get a bunch of different,
uh, trajectories that come out.
These are possible episodes and in each episode you have a utility,
you know, associated with it.
Uh, and what model free Monte Carlo is doing is it's using these utilities,
uh, to kind of update,
uh, towards, uh, update u Q Pi.
Right, so in particular like for example this you're saying, okay, I'm in,
I'm in, uh, the in-state and I,
you know, take an action and stay,
when you're- what will happen?
Well, in this case I got, you know,
16 and, uh, this case I've got 12.
And notice that there's quite a bit of variance.
So on average, this actually does the right thing.
Right? So, um, just by definition,
this is our unbiased, you know,
estimate, if you do this a million times and
average you're just going to get the right value which is,
uh, 12 in this case.
But the variance is here, so if you,
for example if you only do this a few times,
you're not going to get 12, you might get something, you know, sort of related.
Uh, so how can we kind of counteract,
uh, this, this variance?
So the key idea, uh,
behind what we're going to call bootstrapping is,
is that, you know,
we actually have, you know,
some more information here.
So we have this Q Pi that we're estimating along the way.
Right? So, so this view is saying, okay,
we're trying to estimate Q Pi, um,
and then we're going to try to basically regress it against, you know,
this data that we're seeing but, you know,
can we actually use Q Pi itself to, uh,
help, you know, reduce the variance?
So, so the idea here is, uh, um,
I'm going to look at all the cases where,
you know, I started in and I take stay,
I get a 4.
Okay? So I'm going to say,
I get a 4 but then after that point I'm actually just going to substitute this 11 in.
Okay? This is kind of weird, right,
because normally I would just see,
okay, what would happen?
But what happens is kind of random.
On average it's going to be right but, you know,
on any given case,
I'm gonna get, like, you know, 24 or something.
And the, the hope here is that by using
my current estimate which isn't going to be right because if I were,
if it were right I would be done but hopefully it's kind of somewhat
right and that will, you know, be, you know,
better than using the, the kind of the raw,
rollout value. Yeah, question.
You, you would update your current estimate at the end of each episode, correct?
Uh, yeah. So the question is,
would you update the current estimate,
um, after each episode?
Yeah. So all of these algorithms,
I haven't been explicit about it, is that you've seen an episode, you update,
uh, after you see it and then you get a new episode and so on.
Yeah. Sometimes you would even update before you're done with the episode, uh.
[NOISE] Okay.
So, uh, let me show this, uh,
what, um, this algorithm.
So this is a new algorithm, it's called SARSA.
Does anyone know why it's called SARSA?
[inaudible].
Oh, yeah, right. So if you look at this,
it's spelled SARSA and that's literally the reason why it's called SARSA.
Uh, so what does this algorithm say?
So you're in a state s, you took action a,
you got a reward,
and then you ended up in state s prime and then you took another action a prime.
So for every kind of quintuple that you see,
you're going to perform this update.
Okay, so what is this update doing?
So this is the convex combination, uh,
remember that we saw from before, um,
where you take a part of the old value and then you,
uh, try to merge them with the new value.
So what is the new value here?
This is looking at just the immediate reward,
not the full utility,
just the immediate reward which is this 4 here
and you're adding the discount which is 1 for now,
um, of your estimate.
And remember, what is the estimate trying to do?
Estimate is trying to be the expectation of rewards that you will get in the future.
So if this were actually a q pi and not a q pi hat,
then this will actually just be strictly better because that would be,
uh, just reducing the variance.
Uh, but, you know,
of course this is not exactly right,
there's bias so it's 11, not 12 but the hope is that,
you know, this is not biased by, you know, too much.
Okay? So these would be the kind of the,
the values that you will be updating rather than these kind of raw values here.
Okay. So just to kind of compare them, well, okay.
Okay, any questions about what SARSA is doing before we move on?
So maybe I'll write something to try to be helpful here.
So Q pi model-free Monte Carlo estimates Q pi based on u,
and SARSA is still Q pi hat,
but it's based on reward plus,
uh, essentially Q pi hat.
I mean this is not like a valid expression,
but hopefully, it's some symbols that will evoke, uh,
the right memories, um, okay?
So let's discuss, um, the differences.
So this is- this- whenever people say, kind of, bootstrapping,
um, in the context of reinforcement learning,
this is kinda what they mean,
is that instead of using u as its prediction target,
you're using r plus Q pi,
and this is kind of you're pulling up yourself from
your bootstraps because you're trying to estimate q pi,
but you don't know q pi, but you're using Q pi to estimate it.
Okay. So u is based on one path,
um, er, in SARSA,
you're based on the estimate which is based on all your previous kind of experiences, um,
which means that this is unbiased,
uh, model for your Monte Carlo is biased,
but SARSA is biased.
Monte Carlo has large variance.
SARSA has, you know, smaller variance.
Um, and one, I guess, uh,
consequence of the way the algorithm is set up is that model-free Monte Carlo,
you have to kind of roll out the entire game.
Basically, play the game or the MDP until you reach the terminal state,
and then you can- now you have your u to update, whereas, uh,
SARSA when- or any sort of bootstrapping algorithm,
you can just immediately update because all you need to do is you need to see,
this is like a very local window of S-A-R-S-A,
and then you can just update, and that can happen,
kind of, you know, anywhere.
You don't have to wait until the very end to get the value.
Okay. So just as a quick sanity check.
Um, which of the following algorithms allows you to estimate Q opt,
so model-based Monte Carlo,
model-free Monte Carlo, or SARSA?
Okay. So I'll give you maybe ten seconds to ponder this.
[NOISE] Okay?
How many of you more- need more time?
Okay. Let's, uh, get a report.
I think I didn't reset it from last year,
so this includes last year's, uh, participants.
Um, so model-based Monte Carlo,
uh, allows you to get Q opt, right?
Because once you have the MDP,
you can get whatever you want. You can get Q opt.
Model-free Monte Carlo, um, estimates Q Pi;
it doesn't estimate Q opt and, um,
SARSA also estimates Q Pi,
but it doesn't estimate Q opt, okay?
All right. So, so that's,
uh, kind of a problem.
I mean, these algorithms are fine for, uh,
estimating the value of a policy,
um, but you really want the optimal policy, right?
In fact, these can be used to improve the policy as well because you can,
um, do something called policy improvement,
which I didn't talk about.
Once you have the Q values,
you can define a new policy based on the Q values.
Um, but there's actually a kind of a more direct way to do this, okay?
So, so here's the kind of the way mental framework you should have in your head.
So there's two values: Q Pi and Q opt.
So in MDPs, we saw that policy evaluation allows you to get Q Pi;
value iteration get- allows you to get Q opt.
And now, we're doing reinforcement learning,
and we saw model-free Monte Carlo and SARSA allow you to get Q Pi.
And now we need,
I'm going to show you a new algorithm called Q-learning,
that allows you to get Q opt.
So this gives you Q opt,
and it's based on reward, uh,
plus, uh, Q opt, kind of.
Okay. So this is going to be very similar to SARSA,
and it's only going to differ by,
essentially, as you might guess,
the same difference between policy evaluation and value iteration.
Okay. So it's helpful to go back to kind of the MDP recurrences.
So even though MDP recurrences can only apply when you know the MDP.
For deriving reinforcement learning algorithms, um,
it's- they can kind of give you inspiration for the actual algorithm.
Okay. So remember Q opt,
what is a Q opt?
Q opt is considering all possible successors of probability immediate reward plus,
uh, future, um, returns.
Okay. So the Q-learning is,
it's actually a really kind of clever idea, um,
and it's- it could also be called SARS, SARS, I guess, um,
but maybe you don't want to call it that,
and what it does is as follows.
So this has the same form,
the convex combination of the old,
uh, value, uh, and the new value, right?
So what is the new value?
Um, so if you look at Q opt,
Q opt is looking at different successors reward plus V opt.
What we're gonna do is, well,
we don't have all- we're not gonna be able to sum over
all our successors because we're in our reinforcement learning setting,
and we only saw one particular successor.
So let's just use that as a successor.
So on that successor,
we're going to get the reward.
So R is a stand-in for the actual reward of, I mean,
is the stand-in for the reward, the reward function,
and then you have Gamma times.
And then V opt,
I am going to replace it with, uh, the,
our estimate of what V opt is,
and what should the estimate of V opt be?
So what relates V opt to Q opt? Yeah?
I think the a that maximizes Q opt but [inaudible] V opt.
Yeah. Exactly. So if you,
define V opt to be the max over
all possible actions of Q opt of s in that particular action,
then this is V opt, right?
So Q is saying,
I'm at a chance node, um,
how much, what is the optimal utility I can get provided I took an action?
Clearly, the best thing to do if you're at
a state is just choose the action that gives you
the maximum of Q value that you get into, okay?
So that's just Q-learning,
so let's put it side-by-side with SARSA.
Okay. So SARSA, these two are very similar, right?
So SARSA, remember updates against r plus Q Pi?
And now we're updating against r plus this max over Q opt, okay?
And you can see that SARSA requires knowing what action I'm gonna take next,
um, kind of a one-step look ahead,
a prime and that plugs i- into here, whereas Q-learning,
it doesn't matter what a you took because I'm
just gonna take the one that maximizes, right?
So you can see why SARSA is estimating the value of policy because, you know,
what a prime, uh,
shows up here is a function of a policy.
And here, um, I'm kind of
insulated from that because I'm just taking the maximum over all actions.
This is the same intuition as for value iteration versus policy evaluation, okay?
I'll pause here. Any questions?
Q-learning versus SARSA.
So is Q-learning on-policy or off-policy?
It's off-policy because I'm following whatever policy I'm following,
and I get to estimate the value of
the optimal policy which is probably not
the one I'm following, at least, in the beginning.
Okay. So let's look at the example here.
So here's SARSA and run it for 1,000 iterations.
And like model-free Monte Carlo, um, this, um,
I'm estimated that an average- the average utility I'm getting is minus 20,
and in particular, the values I'm getting are all very negative because this is Q Pi.
This is a policy I'm following,
which is the random policy.
Um, if I replace this with q, what happens?
So first, notice that the average utility
is still minus 19 because I actually haven't changed my exploration policy.
I'm still doing random exploration.
Um, well, yeah.
I'm still doing random exploration.
But notice that the value,
the Q opt values are all around,
you know, 20, right?
And this is because the optimum policy, remember,
is just to- and this is,
uh, slip probability is 0.
So optimal policy is just to go down here and get your 20, okay?
And Q- and I- I guess it's kind of interesting that Q-learning,
I'm just blindly following the policy running, you know, off,
off the cliff into the volcano all the time but,
you know, I'm learning something,
and I'm learning how to behave optimally,
even though I'm not behaving optimally,
and that's, uh, the kind of hallmark of off-policy learning.
Okay. So, any questions about these four algorithms?
So model-based Monte Carlo, estimate MDP,
model-free Monte Carlo, um, estimate, ah,
the Q value of this policy based on, um,
the actual returns that you get,
the actual sum of the, ah, rewards.
SARSA is bootstrapping estimating the same thing but with kind of a one-step look ahead.
And Q learning is like SARSA except for I'm
estimating the optimal instead of,
um, fixed policy Pi. Yeah.
Is SARSA on-policy or off policy?
SARSA is on-policy because I'm estimating Q Pi. All right.
Okay so now let's talk about encountering the unknown.
So these are the algorithms.
So at this point if I just hand you some data, um,
if I told you here's a fixed policy,
here's some data, you can actually estimate all these quantities.
Um, but now there's a question of exploration which we saw was really important,
because if you don't even,
even see all the states,
how can you possibly act optimally?
So, um, so which exploration policy should you use?
So here are kind of two extremes.
So the first extreme is, um,
let's just set the exploration policy.
So, so imagine we're doing Q learning now.
So you have this Q_opt estimate.
So it's not a true Q_opt but you have an estimate of Q_opt.
Um, the naive thing to do is just take a- use that Q_opt,
figure out which action is best and just always do that action.
Okay. So what happens when you do this is,
um, you, ah, don't do very well.
So why don't you do very well?
Because initially while you explore randomly and soon you find the 2.
And once you've found that 2, you say, "Ah,
well, 2 is better than 0, 0, 0.
So I'm just gonna keep on going down to the 2 which is you know,
all exploitation, no exploration.
Right? You don't realize you that there's all this other stuff over here.
Um, so in the other direction,
we have no exploitation, all exploration.
Um, here, ah, you kind of have the opposite setup where I'm,
I'm running Q learning, right?
So as we saw before,
I'm actually able to estimate the,
uh, the, the Q_opt values.
So I learn a lot.
But the average utility which is the actual utility
I'm getting by playing this game is pretty bad.
In particular, it's the,
the utility you get from just,
you know, moving randomly.
So kinda what you really want to do is, uh,
balance you know, exploration and exploitation.
So just kind of a, kind of an aside or
a commentary is that I really feel reinforcement learning kind of captures,
ah, life pretty well.
Um, uh, because in life there's, you know,
you don't know what's going on.
Um, you want to get rewards,
you know, you want to do well.
Um, and, ah, but at the same,
time you have to, um,
kind of learn about how the world works so that you can kind of improve your policy.
So if you think about going to in restaurants or
finding the shortest path better way to get to, um,
to school or to work, or in research even when you are trying to figure out, um,
a problem you can work on the thing that you know how to do
and will definitely work or, you know,
do you try to do something new in hopes of you
learning something but maybe it won't get you as high reward.
So, um, hopefully reinforcement learning is,
um, I know, it's kind of a metaphor for life in the US.
Um, okay so, ah,
back to concrete stuff.
Um, so here's one way you can balance,
um, exploration and exploitation, right?
So it's called the Epsilon-greedy policy.
And this assumes that you're doing something like Q learning.
So you have these Q_opt values and ideas that, you know,
with probability of 1 minus Epsilon where Epsilon is,
you know, let's say like 0.1, you're usually gonna give exploit.
We're just gonna do,
give you- give it all you have.
Um, and then, um,
once in a while, you're also gonna do something random.
Okay. So this is actually not a bad policy to act in life.
So once in a while, maybe you should just do
something random and kind of see what happens.
Um, so if you do this,
um, what, what do you get?
Okay, so what I've done here is, uh,
I've set Epsilon to be starting with one.
So one is, ah, all exploration.
And then I'm going to change the value,
ah, a third of the way into 0.5.
And then I'm gonna, two-thirds the way I'm gonna change it to 0.
Okay. So if I do this then I actually estimate the values,
ah, really really well.
Um, and also I get utility which is, you know,
pretty good, you know 32.
Um, okay.
And this is also kind of something that happens, uh,
as you get older,
you tend to, um,
[NOISE] explore less and exploit more.
Um, it just happens.
Um, okay. All right.
So that was exploration.
So let's put some stuff on the board here.
Um, do I need this anymore?
Maybe [NOISE].
Okay. Um, okay.
So covering the unknown,
so we talked about, you know,
exploration, um, you know, Epsilon-greedy.
Um, and there's other ways to do this.
Um, Epsilon-greedy is just kind of the simplest thing that actually, you know,
works remarkably, you know, well,
um, even in the stabilized systems.
So the other problem now I'm gonna talk about is, you know, generalization.
Uh, so remember when we say exploration.
Well, if you don't see a particular state,
then you don't know what to do with this.
I mean you think about it for a moment, that's kind of unreasonable because,
you know, in life you're never gonna be in the exact same, you know, situation.
And yet we are [NOISE] we need to be able to act properly right.
So general problem is that a state-space that you,
you might deal with in a kind of a real,
ah, world situation is enormous.
And there's no way you're going to go and track down every possible state.
Okay. So this state space is actually not that enormous, um,
but this is the biggest state space I could draw on the- on the screen.
Um, and you can see that this, you know,
the average utility is, you know, pretty bad here.
Okay. So what can we do about this?
So, um, I guess let's talk about a large state space.
So this is the problem.
So now this is where
the second- the third interpretation of model-free Monte Carlo will come in handy.
So let's take a look at Q learning.
Okay. So in the context of,
ah, SGD, looks like this.
Right. So it's a kind of a gradient step where you take the old value
and you minus eta and something that kind of looks like,
ah, it could be a gradient,
which is the residual here.
Um, so one thing to note is that under the,
the kind of formulations of Q learning that I've talked about so far,
this is what we call a kind of rote learning.
Right. Um, which if we were,
you know, two weeks ago, we already said this is,
you know, kind of ridiculous because it's, uh,
not really learning or generalizing at all.
Um, right now it's basically for every single state and action I have a value.
If I have a different state and action, completely different value.
I don't- I don't- there's no kind of,
ah, sharing of information.
And naturally, if I do that,
I can't generalize between states and actions. Um, okay.
So here's the key idea that will allow us to,
um, actually overcome this.
So it's called function approximation in the context of reinforcement learning.
Uh, in normal machine learning,
it's just called normal machine learning.
Um, so the way it works is this,
uh, so we're going to define this Q_opt s, a.
It's not going to be a lookup table,
it's going to depend on some parameters here w. And I'm gonna define
this function to be w dot Phi s, a.
Okay. So I'm gonna define this feature vector very similar to how we did it
in kind of machine- in the machine learning section except for instead of s, a we had x.
And now the weights are going to be kind of, you know, the same.
Okay. So what kind of features might you have?
Ah, you might have for example,
um, features on, you know, actions.
So these are indicator features that say, "Hey,
maybe it's better to go east then to go west or maybe it's better to be in the fifth,
ah, row or as it's good to be in a six column and, you know, things like that."
So, um, you have a smaller set of features and you try to
use that to kind of generalize across all the different states that you might see.
So what this looks like is now with the features is
actually the same as before except for,
um, now we have something that really looks like,
uh, you know, the machine learning lectures,
is that you take your weight vector and you do, um,
an update of the residual times the feature vector.
Okay. So how many of you this looks familiar from linear regression?
Okay. All right.
So, so just to contrast,
so before we were just updating the Q_opt values,
um, but the residual is exactly the same and there's nothing over here.
And now what we're doing is we're updating not the Q values,
we're updating the weights.
The residual is the same and the thing that connects the, the,
the Q values with the,
the residual width, the,
the weights is, ah, the kind of the feature vector.
Okay. As a sanity check,
this has the same dimension.
This is a vector. This is a scalar.
This is a vector which has the same dimensionality s, a; w. Okay.
And if you want to derive this, um,
you can actually think about the implied objective function as,
ah, simply, you know, linear regression.
You have a model that's trying to predict a value,
um, from an input, um, s, a.
So s, a is like x and Q_opt is like kind of y.
And then your regre- sorry.
This target is like, uh,
the y that you're trying to predict and you're just trying to make this prediction close
to the target. Yeah, question.
Is the eta, you said that [inaudible] [NOISE]
Yeah. So a good question.
So what is this eta now?
Uh, is it the same as before?
So when we first started talking about these algorithms,
eta was supposed to be one over the number of updates and so on.
But once you get into the SGD form like this
then now this just behaves as a step size and you can tune it to your heart's content.
All right. So that's all I will say about these two challenges.
One is how do you do exploration?
You can use Epsilon-greedy which allows you to kind of balance exploration with
exploitation and then the second thing is that for large state spaces,
Epsilon-greedy isn't going to cut it because you're not
going to see all the states even if you try really
hard and you need something like function approximation to tell
you about new states that you fundamentally haven't seen before.
Okay. So summary so far, online learning.
We're in an online setting.
This is the game of reinforcement learning.
You have to learn and take actions in real world.
One of the key challenges is the exploration-exploitation trade-off.
We saw, um, four algorithms,
there's kind of two key ideas here.
One is Monte Carlo which is that from data alone,
you can basically use averages to estimate
quantities that you care about, for example, transitions, rewards, and Q values.
And the second key idea is this bootstrapping which shows
up in SARSA and Q-learning which is that you're
updating towards a target that depends on your estimate of what you're trying to predict.
Um, not just the kind of raw data that you see.
Okay. So now I'm gonna maybe step back a little bit and talk
about reinforcement learning in the context of some kinda other things.
So there's kind of two things that happen when we went from
binary classification which was two weeks ago to
reinforcement learning now and it's worth kind of decoupling these two things.
One is state and one is feedback.
So the idea about partial feedback is that you can only learn about actions you take.
Right. I mean this is kinda obvious in reinforcement learning.
If you don't, don't, quit in this game,
you never know how much money you'll get.
And the other idea is the notion of state which is
that new rewards depend on your previous actions.
So if you're going through a volcano, you have to, ah,
there's a kind of a different situation depending on where you are in, in the map.
Um, and there's actually kind of- so,
so this is kind of you can draw a two-by-two grid where you go
from supervised learning which is stateless and full feedback.
So there is no state, every iteration you just get a new example, ah,
and that doesn't have, you know,
there's no dependency and in terms of prediction on the previous examples.
Um, and full feedback in because in supervised learning,
you're told which is the correct label.
Even if there might be 1,000 labels for example in image classification,
you're just told which ones are the correct label.
Ah, and now in reinforcement learning,
both of those are made harder.
There is two other interesting points.
So what is called multi-armed bandits is kind of a,
you can think about as a warm up to
reinforcement learning where there's partial feedback,
but there's no state which makes it easier.
And there's also, you can get full feedback but there are states.
So instruction prediction.
For example in machine translation,
you're told what the translation output should be,
but clearly though actions depend on previous actions because,
you know, you can't just translate words in isolation essentially.
Um, okay, So one of the things I'll just mention very briefly is, you know,
this is deep reinforcement learning has been very popular in recent years.
So reinforcement learning, there was kind of a lot of interest in
the kind of '90s where a lot of the algorithms were kind of,
ah, in theory were kind of developed.
And then there was a period where kind of not that much, not as much
happened and since I guess 2013,
there has been a revival of reinforcement of research.
A lot of it's due to I guess at the DeepMind where they
published a paper showing how they can do- use raw reinforced learning to play Atari.
So this will be talked about more in a section this Friday.
But the basic idea of deep reinforcement learning just to
kind of demystify things is that you are using a neural network for Q_opt.
Essentially that's what it is.
And there's also a lot of tricks to make
this kind of work which are necessary when you're dealing with enormous state spaces.
So one of the things that's different about
deep reinforcement learning is that people are much more
ambitious about handling problems where the state spaces are kinda enormous.
So for this, the state is just the,
you know, the pixels,
right, so there's, you know,
a huge number of pixels and whereas before people were kind
of in what is known as a tabular case
which the number of states you can kind of enumerate.
So, um, there's a lot of details here to care about.
One general comment is that reinforcement learning is, it's really hard,
right, because of the statefulness and also the delayed feedback.
So just when you're maybe thinking about final projects, I mean,
it's a really cool area, but don't underestimate how much work and compute you need to do.
Some other things I won't have time to talk about is so far we've
talked about methods that are trying to estimate the Q function.
There's also a way to even do without the Q function and just
try to estimate the policy directly that's called,
um, methods like policy gradient.
There's also methods like actor critic that try to
combine of these value based methods and policy-based methods.
These are used in DeepMind's
AlphaGo, and AlphaZero programs for crushing humans at Go.
This will actually will be deferred to
next week's section because this is in the context of games.
There's a bunch of other applications.
You can fly helicopters, play backgammon,
this is actually one of the early examples TD-Gammon was one of the early examples in
the early '90s of kind of one of
the success stories of using reinforcement learning in particular,
you know, self play.
For non-games, reinforcement learning can be used to kind of do
elevator scheduling and managing data centers and so on.
Okay. So that concludes this section on
Markov decision processes which we- the idea is we are playing against nature.
So nature is kinda random but kind of neutral.
Next time, we're going to play against an opponent where they're out to get us.
So we'll see about that.
