Portion, and then a group portion.
You'll be assigned groups in advance, um,
and there'll be numbers on
the chairs for the room that you're in so you'll know where to go sit.
And the way that it will work is you'll first
do the individual part, you'll turn that in,
and then you'll receive another exam for the group part,
and then go in and discuss your answers,
agree on them, and then scratch that off
to see if you got it right, and then when you're done,
you'll hand that in as your group one.
[NOISE] And, yeah.
Uh, just, logist- logistical question.
So we're splitting into the two groups again without being asked later?
Yes.
Okay.
Great question. [inaudible] asked is whether or not we're splitting into two rooms,
yes, we're again gonna do that,
um, and that will be your assignment.
Um, I think it's likely to be the same as last time,
but I will confirm that.
We might make it slightly different because
some SC- SCPD students will be joining us for this one that didn't before and vice-versa.
So it's possible, particularly,
if you are right on the borderline,
we'll, um, you'll be in a different room this time.
So we'll announce that. Um, [NOISE] just as a reminder, right now,
we're basically done with all of the assignments so this is
a great chance to be focusing on the projects.
Um, you should have been getting feedback about all of those along with, with, um,
with a little bit of information back about your project and the person who graded it,
will have signed that.
So that's a great TA to go ask questions of and their office hours are on Piazza.
Um, but you're welcome to go to any of the office hours to ask about project questions.
[NOISE] And that poster session will be [NOISE] at the time,
the original time announced.
It is sub-optimal but what can you do?
We're gonna meet in the morning on, uh, the last day of finals.
[NOISE] So that was the one that we are assigned.
Anybody else have any other questions about this?
Who here went to Chelsea's talk on Monday?
[NOISE] Okay.
She does, uh, for those of you who did- didn't get to see it,
Chelsea's talk will be online,
it's a really nice talk about meta-reinforcement learning.
It will be covered on the quiz next week,
but pretty light, um,
because you haven't had much exposure to that idea.
So again, just as a recap for the quiz.
The quiz will cover everything, um, in the course.
But things that you didn't have a chance to actually think about, um,
uh, because you didn't get practice on it with an assignment,
uh, we will test more lightly.
It also will be multiple choice.
I highly recommend that you take the quiz from
last year ignoring any topics that we haven't covered,
[NOISE] um, and do that without looking at the answers.
Um, one of the robust finding for
educational research is that
forced recall is incredibly effective at helping people learn.
Forced recall is a nice way for testing [LAUGHTER].
So, um, this is, of course,
in the case where you're not getting assessed by it so you just can use that to,
to check whether or not you understand these things and
look at anythi- re-look up anything [NOISE] that might- you might have questions about.
Any other questions? All right.
So let's go ahead and get started.
Um, we're gonna talk today about batch reinforcement learning,
and in particular, safe batch reinforcement learning,
I'll define that what that is.
Um, this is a topic that I think is extremely important,
we do a lot of work on it in my own lab.
And most of the topics I'm gonna focus on today will be work that came out of,
uh, work that was done by my postdoc,
Phil Thomas, who's now a professor at UMass Amherst,
but the work he did before he worked with me, um,
some of our joint work, and I'll also highlight some other related work.
So let's think about a simple case.
Um, let's think about doing a scientific experiment, um,
where you have a group of people who we're- we're gonna call the A group,
and they get to first do this particular type of fractions problem.
This is a fractions problem for one of our tutoring systems, um, uh,
where people are having to add fractions
together and then reduce the sum to the lowest terms.
And then they have to do something where they do cross multiplication,
um, and then after that, they get an exam.
[NOISE] And, again, an average score of a 95.
And then we have the B group that does the same activities but in the opposite order,
and then they get an average score of 92.
And the question is, for a new student, what should we do?
[NOISE] And what would I- what would- what
additional information might you want to
know in order to be able to answer this question?
So feel free to shout that out.
So what would you need to do for like now, a new student comes along?
[NOISE] Now,
let's imagine our objective is to get a high score on this exam,
for that student to get a high score on this exam.
So what- which sequence of activities would you give to that new student in order
to maximize their probability that they get a good score on this exam,
indicating hopefully that they've learned the material?
So, yeah.
Do you know how big A and B are?
So one great question. [inaudible] to say,
um, how big is this group?
Um, so how large is an A and B?
And you might want to know this for a number of reasons.
For example, if the number of people in group A is 1,
and the number of people in group B is 2,
maybe these are just sort of statistical noise
between these di- distinctions. What else might you want to know?
[NOISE] Um, the variance.
Yes. So I suggest that you might want to know the variance.
That's another thing you might want to know. This is just the mean.
So what is the variance?
What other pieces of information might you want to know? Yeah.
Probably the difference between the median and the mean.
Yeah. So other sort of forms of statistics about,
sort of, you know, the distribution.
I'm- I'm thinking about something also more,
um, in a different direction. Yeah.
Um, the people in the group.
So is that bigger group A is all high school students,
group B is all like lower grade or something like that.
Exactly. So maybe the- maybe group A is kindergartners and maybe group B is,
um, you know, high schoolers and [LAUGHTER],
we're all regressing [LAUGHTER]. So yeah.
So- so who's in these groups?
[NOISE] And in addition to that,
often, you'd want to know who was the new student.
So is the new student then a kindergartner or a high schooler?
Okay. So there's really a lot of additional information that you'd
want to know in order to be able to answer this question precisely,
uh, and it involves a lot of different challenges.
And one of the challenges here is that, um,
if group B is different than group A,
then we have this fundamental issue of, um, censored data.
You don't get to know what would have happened in group A if they had had
the same seq- or in the group B if they'd had
the same sequence of interventions as in group A.
So this is sort of the fundamental challenge that you'll never know
what it would be like if you were at Harvard right now, um,
but, uh, but there's this, uh,
you only get to observe the outcome for the action that's taken.
We've seen that a lot in reinforcement learning,
and that's true in this case too where we have old data, um,
about sequences of decisions,
and so it requires this kind of actual reasoning.
Another thing that it involves is, um, generalization.
So here's a simple example where you can think of it basically as being just two actions.
Each of the two different prob- each of
the two different problems and you can think of this as the reward.
The delayed reward may be of a reward of 0,
reward of 0, and then a reward of the test score.
And so here there's only two actions,
and who knows how big the state space is,
it depends how we'd want to model students.
Um, [NOISE] but we don't want to think about
combinatorially all the different orders of actions.
Um, and even if we're writing down a decision policy,
that might start to be very large very quickly.
And so we're gonna need to be able to do some form of generalization,
either cross states, or actions,
or both so that we don't have to run
a combinatorial number of experiments to
figure out what's most effective for student learning.
[NOISE] So we're gonna talk about
this problem today in the context of batch
reinforcement learning which we can also think of as being offline.
So this has to be offline batch RL,
and this is frequently gonna generally be off policy.
Now, we've seen a lot of off-policy learning before,
Q-learning uses an off-policy algorithm.
Um, but I want to distinguish here that what we're gonna be mostly
focusing on today is the case where someone has already collected the data.
So we already have a prior set of data, um,
and then we're gonna want to use it to make better decisions going forward.
Now, this problem comes up not just, um,
in the case that I just mentioned, but in a huge number of different domains.
You could argue that areas like econom- economics, um,
and statistics, and epidemiology are constantly asking these sort of questions.
Um, it comes up in things like maintenance, you know, um,
what sort of order of, um,
actions do you want to do to make sure that your machines,
your cars run for longest.
Um, it comes up in health care,
like what sort of sequence of activity should we give to
patients in order to maximize their outcomes,
their quality-adjusted life years.
And in many of these cases, it's gonna be state-dependent because what's gonna work
best for patient A is gonna be different than what works best [NOISE] for patient B.
[NOISE] Now, one of the big challenges here too is
that when we think about a lot of the cases where we have this old data,
it's gonna be high-stake scenarios,
which means that whether it's because we have really expensive, you know,
nuclear power equipment which we don't want to go wrong, um,
or we're treating people,
um, for, you know,
really significant diseases, then we wanna
make sure that we make good decisions in the future.
So we might may or may not have a lot of data, um,
but the data that we have is precious and we
wanna make as good decisions as possible from that.
So that means we need to have some form of
confidence in how well it's gonna work going forward.
So we would really like to have some sort of upper and
lower bounds on its performance before we deploy it.
Um, and in general,
we just want good methods to try to estimate, um,
if we do this counterfactual reasoning,
if we think about how well people, or, you know,
how much more healthy people might be if we were to treat them in a different way,
um, how confident can we be before we convince a doctor to go actually deploy this.
So what I'm gonna talk about today is thinking about sort of
this general question of how can we do batch safe reinforcement learning.
Safety can mean a lot of different things.
Um, when I'm talking about safety today,
I'm mostly gonna be thinking about this in terms of this confidence.
This ability to say, um,
before we deploy something,
how good do we know it is.
Um, now, there's different forms of safety.
There's things like safe exploration,
making sure you don't make mistakes online, there's risk sensitivity,
thinking about the fact that,
um, each of us is only gonna experience one outcome,
not the expectation, um,
so we may want to think about the full distribution instead of averaging.
But what we're gonna talk about today is mostly still thinking about
expected outcomes but thinking about being confident in the expected outcomes.
And so, in general,
we would like to really be able to say with high confidence
this new decision policy we're gonna deploy for patients,
or for nuclear power plants,
or for other sorts of high-stakes scenarios, we think it's better.
We think it is better than what we're currently doing. And why might you want this?
You might want to, sort of, to guarantee kind of
monotonic improvement particularly in these high-stakes domains,
which is something that we've seen earlier this quarter.
[NOISE] Okay.
[NOISE]
So let's talk just briefly about some of the notation, some of this will be familiar.
I just wanna make sure that, we're all on board with that,
and then I'll talk about sort of some of the different steps
we might think about to try to create batch,
um, safe reinforcement learning algorithms.
So whereas usually gonna use pi to denote a policy,
um, we're gonna use T or H to denote a trajectory.
I'll often use big D to denote the data that we have access to.
This is like [NOISE] electronic medical records systems,
um, or you know data about power-plants, etc.
And for most of this,
we're gonna assume that we know what the behavior policy is.
So we know what was the mapping of states to actions,
or it could have been histories to actions that was used to gather the data.
[NOISE] Can anybody give me an example where that might not be reasonable,
where we might not know the behavior policy?
[NOISE] [inaudible] actions, and we don't know the questions.
Exactly.
In many, many cases where the data is generated from humans,
um, we will not know what pi_b is.
So when we look at medical health data,
we typically don't know what pi_b is.
So, you know, if this is generated by doctors,
[NOISE] we typically don't know what pi_b will be.
There's obviously guidance, but we don't typically have access to, um,
the exact policy people used and, and if we did,
they probably wouldn't have phrased it as like,
you know, a stochastic process.
[NOISE] But like when someone comes into their office with probability 0.5,
they're gonna treat them with this person 0.5.
They probably think of it, in deterministic terms and they
probably wouldn't think of it in terms of these if-then rules.
So there are many cases where [NOISE] we don't have that.
Um, we've done some work on that recently,
other people have as well.
I might talk about that briefly at the end.
But for most of today, we're gonna assume that we have access to this.
So can someone give me an example,
where it is reasonable to assume that we have pi_b?
[NOISE] Sure.
[NOISE] [inaudible] is based on a [inaudible] set of guidelines or something?
Yeah. So in some cases, you know,
[NOISE] [inaudible] like sometimes you have like an algorithm to make a decision or,
you know, a clear set of guidelines.
Were you gonna say something similar or different?
A little different. Um, so if you have a power plant with maintenance records,
ah, [NOISE] like an established maintenance
plan that has records [NOISE] match the plan then you basically have pi_b.
That's right. Yeah. So in those real cases where you have these fixed protocols.
Another example, I often think about is,
you know when your decisions were made by reinforcement learning [LAUGHTER] agents,
or, or they're made by supervised learning agents.
Um, if you go to a lot of different,
ah, like how, you know, Google serves ads.
We know exactly how it serves ads, it has to all be logged.
And there's a, there's an algorithm that is making that decision.
So in many cases,
[NOISE] we have access to the code that is being used to generate algorithms automa- ah,
generate decisions [NOISE] automatically.
In that case, we can just look it up,
as long as we've saved it.
So, and our objective is usual is to think about how do we get good,
good policies out, and good policies with good values.
Um, when we think about trying to do safe batch reinforcement learning in a setting,
we're gonna be thinking about,
how do we take that old data?
So we're gonna take our data's input and push it through some black box,
and get out a policy that we think is good.
So we're gonna have sort of some algorithm or transformation that
is instead of interacting with the real-world,
and getting to make [NOISE] decisions and choose its data,
it's just taking this fixed data,
and it'll put a policy,
and one thing that we would like is that,
if we feed data into our algorithm,
and our algorithm could be stochastic,
then the value of the policy it outputs.
So we can think A(D). So that's,
you know, this is being our algorithm A, it's gonna output some policy.
That might be a deterministic function, that might be a stochastic function.
Whatever policy it outputs, we want it to be good.
Ideally, at least as good as the behavior policy [NOISE].
So that's what this first equation says,
is it says the value of whatever policy is output,
by our algorithm when we feed in some data set.
We would [NOISE] like it to be as good,
as what [NOISE] the behavior policy that generated that dataset,
um, whatever that value was.
So this is sort of [NOISE] the value of the policy used to generate the data.
Now, we don't normally,
we're not normally given via pi_b, um, directly.
But can anybody give me an example of how we might learn that?
[NOISE] Given a dataset,
which was generated on policy,
from that policy generated, um,
using that policy. Yeah.
Just do like dynamic programming or whatever on that small [inaudible] approximately.
Yeah. So I think, like one thing is that,
you know, you could use that data.
Um, I don't know if you could do dynamic programming,
because you don't necessarily have access to the transition and reward models.
But you could do something like Monte Carlo estimation,
could average the reward, um,
let's assume in the dataset that you can see states, actions and rewards.
Um, so you could certainly just average it,
you know, average over all the trajectories.
So we can get an estimate,
we can estimate p pi_b,
by looking at, let's imagine,
it's an epic- episodic problem.
So you can look at sum over all,
equals 1 to number of trajectories,
of the return for that trajectory. This is a return.
Which is essentially just doing Monte Carlo estimation,
because you know that that data was generated on policy,
[NOISE] and so you can just average.
[NOISE] So that'll give us a way to estimate the pi_b but then we
need some way to estimate V of our algorithm,
outputting D. Um,
and that means we're gonna have to do something off policy,
because in general we're gonna be wanting to find policies that are better than pi_b,
which means that they would have had to be making some different decisions.
And I'll just highlight here that you know,
sometimes you might not just want to be better than the existing behavior policy,
but you might need to be substantially better.
Um, often, if we're thinking about real production systems,
it costs time, and money,
and effort, whenever we want to change protocols.
If you want to get doctors to change the way they're making decisions,
we want to change things in a power plant, there's often overhead.
So often, you might not just need to be
better than the existing sort of state of the art,
you need to be significantly better.
So the same types of ideas we're talking about for relative to sort of the current um,
performance, you can always add a delta to
that because they have to be at least this much better.
And so again, just to sort of summarize,
what does this equation says?
It says, I want to have situations,
where the policy that's output,
when I plug in my dataset,
I want that to be better than my existing policy with high probability.
So delta here, you know,
is gonna be something between 0 and 1.
So now, let's talk about how we might do this.
Um, we're going to start with off-policy policy evaluation.
[NOISE] So the idea in this case,
well, okay, first of all, go through all three of these really briefly.
Um, and then we'll, we'll go step through them more slowly.
So the three things in terms of stuff we
might want to do safe batch reinforcement learning,
and there's tons of variants for each of these depending on the setting we're looking at,
is we need to re do- have to do this off-policy batch policy evaluation.
Which is we need to be able to take our old data,
and then use it to estimate how good an alternative policy would do.
We might want to get confidence bounds over how good that is.
So this could just allow us to get some estimate of V
A(D), or V pi_e.
Pi_e is often used to denote an evaluation policy,
a policy we wanna evaluate.
So the first thing is just doing off-policy policy evaluation.
The second thing is saying how would we know how good that estimate is,
so this is an estimate could be really good, could be bad,
so you might want to have some uncertainty, over this estimate.
So that we can quantify how good or bad it is.
And then finally, we might want to be able to actually,
take like, you know, an argmax over possible policies.
So you might want to be able to do something like argmax,
[NOISE] pi V pi_e, with some confidence bounds.
So in general, you're not gonna just want to be able
to evaluate how good alternative policies will be,
you're gonna wanna figure out a good policy to deploy in the future,
which is gonna require us to do
optimization because we don't normally know what that good policy is yet.
So typically, we're gonna end up evaluating a number of different policies.
So we can think of it as sort of the first part is we're gonna take our historical data,
take a proposed policy,
plug it into some algorithm that we haven't talked about yet,
and get out an estimate of the value of that policy,
and we're gonna talk about how to do important sampling [NOISE] with that.
And then after that, we're gonna go into the high-confidence,
um, policy evaluation and safe policy improvement.
To get confidence bounds,
we're gonna look at Hoeffding.
We've seen Hoeffding before, um,
as something that we looked at when we were starting to talk about exploration.
So when we look at high-confidence and then think back- think back to exploration.
So in exploration, we're often trying to quantify how
uncertain we were about the value of a policy or
its models in order to be optimistic
with respect to how good it could be and use that to drive exploration.
But we also could have computed
confidence bounds that are lower bounds on how good things could be,
and that's gonna be useful here when we try to
figure out how good policies are before we deploy them.
And we'll do Hoeffding inequality for that,
and then finally we're gonna be able to wanna do things like safe policy improvement,
which is can you answer the question of saying,
if someone gives you some data and they say,
"Hey, can you give me a better policy?"
Can one have an algorithm that either gives a better policy that it is
actually better like when you go and deploy it with high probability?
Or can the algorithm also know its limitations and say,
"Nope, there's no- there's no way for me to give you a policy that's better."
So I think it's also really nice to have
algorithms that are aware of their own limitations.
We're doing quite a bit of work on that in my lab right now,
um, so that when people who are using these,
particularly for human in the loop systems, um,
that they can understand if the algorithm is giving out garbage or not.
And so in this case, the idea is that sometimes if you have very little data,
you can't do improvement in an, uh, confident way.
So for that example I showed you before,
we had like two different ways of teaching students,
and someone, you know,
made the good point of saying, how many people are in each of these.
If only one person has tried either of these and someone says,
"Can you definitely tell me what's going to be better
for students in the future?" You should say, "No."
[LAUGHTER] Because there is only one data point,
like there's no way we would have enough data from,
you know, one data point in each group to be able to
confidently say in the future how we should teach students.
So I think that the safe policy improvement needs us to be able to both say
when we can be confident about deploying better policies in the future or when we can't.
So we're gonna look at sort of a- a pipeline for how to answer that sort of question.
All right. So let's first go back,
go and think about off-policy policy evaluation.
So the aim of this,
um, is to get a- an off-policy estimate that is unbiased.
So we want to get sort of a really,
you know, a good estimate of how,
um, how good an alternative decision policy would be.
So we have data right now,
which is sampled, um,
under some policy, let's call it a behavior policy Pi 2.
So we have like this dataset.
This is D, which is giving us through these samples of these trajectories,
and then we want to use them to evaluate how good an alternative policy would be.
And while we could do this for something like Q-learning,
we want to do it with a different method that's gonna allow
us to get better confidence intervals,
um, and where it's going to be an unbiased estimator.
So in Q-learning, if we think back to sort of what Q-learning was doing,
you know, Q-learning is off-policy.
We could do this with Q-learning.
Q-learning is off-policy,
and it- it samples and it bootstraps.
And because it samples and bootstraps,
it can be biased, okay?
And so we're wanna do a different thing right now,
which still allows us to be off-policy,
but in a way that we're not biased,
that our estimator might not systematically be, um,
above or below, particularly because right now we're always gonna have finite data.
We're never gonna be in the asymptotic regime
where we have tons and tons and tons of data,
um, and we can sort of assume this went away.
So again let's think about the return.
G_t is the return.
It's just how much reward we got under a policy, you know,
either over a finite number of steps like for one episode or across all time.
And our policy is just or the value of
a policy again is just the expected discounted reward of that.
So the nice thing is that if Pi 2 is stochastic,
the data that you're using- that you're gathering from your behavior policy,
then you can use it to do off-policy evaluation.
This would have been essential for doing Q-learning too.
And one of the nice things is that because we're
kind of following this Monte Carlo type frame work,
you don't need a model and you don't need to be Markov.
That's really nice because we're gonna end up getting an estimator that is,
um, unbiased and it does not rely on the Markov assumption holding.
And in many cases,
the Markov assumption is not gonna hold,
particularly when we start to think about patient data or other cases
where we just have a set of features that happen to have been recorded in our dataset,
and who knows whether or not that system is Markov or not.
We've certainly seen in some of our projects that it is not.
And that in some of those cases,
if you make the assumption that the world is Markov,
you have really bad estimates of how good,
you know, an alternative way of teaching students might do.
Okay. So why is this a hard problem?
Well, um, it's because we have a distribution mismatch, okay?
So if we look at, um,
imagine we just had a two-state process,
where we thought about, you know,
kind of this is S and this
is- or like we can say this is the probability of your next state S prime,
and I've sort of made it smooth.
We can think of a Gaussian here,
and this is under pi behavior.
Under pi evaluation, it might look different, okay, versus this.
In general, the distribution of returns you're gonna get,
the sequence of the state-action, reward,
next state, next action,
next reward, so this sort of trajectory.
The distribution of trajectory is gonna be different under different policies.
So the distribution of
tau here is not gonna look like the distribution of tau here.
If it looks identical,
what does that say about the value of the two policies?
[inaudible].
Sorry what?
[inaudible].
Exactly. So what you said is correct.
If, um, I- if the distribution of
states and actions that you get under both of these policies are identical,
then the value is identical.
And we saw this idea also in imitation learning when we
are gonna be doing sort of state action,
uh, or- or state feature matching.
Now in this case, we're talking about not just states and actions,
we're talking about full distributions or
full trajectories because we're not making the Markov assumption,
but the idea is exactly the same.
The only way we define the- the value is basically
the probability of a trajectory and the value of that or,
you know, the sum of rewards in that trajectory.
So if a distribution is identical,
then the value is identical.
We don't care if the policies are different
because we already know how to estimate the value.
[NOISE] So the key problem here is that they're gonna look different,
which means that you would have went- done different things under different policy.
So it's like, you know, right now,
maybe you go and visit this part of the state space a lot,
[NOISE] excuse me, [NOISE] and this part infrequently.
And now you're gonna have an alternative policy,
which only goes here infrequently and goes over there a lot.
[NOISE] Excuse me. But thinking about it in this way gives us an idea about how
we can look at our existing data over there and make it look more like that.
Does anybody have any idea of how we could do that?
So if someone gives you a bunch of trajectories, um,
how might you maybe change them so they look like the distribution you care about? Yeah?
Importance sampling.
Right. So we can do importance sampling here, okay?
So let's just review and refresh importance sampling.
So the idea is that for any distribution,
um, we can reweigh them to get an unbiased sample, okay?
So let's imagine that we have data generated from, um,
or we want data generated from some distribution q,
we wanna estimate f(x), okay?
So we'd have- wanna get f(x) under the probability distribution q(x).
So we can multiply and divide by the same thing,
let's incorporate another distribution.
It's just a different distribution over x times q(x), f(x), dx, okay?
So we can just rewrite this as being equal to integral
over x probability of x times this quantity, which is q(x),
divided by p(x) times f(x),
and let's imagine that we actually have data from q(x).
So we want data from q(x) but we have data from p(x).
So we can approximate this by 1 over n, sum over i,
q(xi) divided by p(xi) f(xi),
where xi is sampled from p(x).
I remember when I first learned about this a number of years ago and I thought it was
a really lovely insight just to say we're just gonna reweight our data.
And so we're gonna focus on, um,
the data points that come from, you know,
that- that are ones that we would sample under the distribution we care about.
We're just gonna reweigh them so they look like they're having the same probability,
um, that they would under- under q(x).
In our case, under our desired policy.
Okay. So importance sampling works for any distribution mismatch.
If you have data from one distribution you wish you had it from another,
um, those can come up in things like physics.
Often you have really rare events like Higgs bosons.
And in those cases, you might, um,
there are different scenarios where you could reweigh things,
um, so you can get an estimate of the true- the true expectation.
Okay. This is for just generic distributions.
Let's remind ourselves how this works for- for the reinforcement learning setting.
So again, we're gonna have our episodes.
We can call them h, we can call them tau,
which is a sequence of states, actions, and rewards.
And then if we wanna do importance sampling or let me just write this out.
So, um, in this case,
we're gonna wanna get something like p of hj under our desired policy, okay?
So what does that gonna be?
That's gonna be our probability of our initial state.
Let's assume that's identical no matter what policy that you're in,
and then we're gonna have the probability of taking a particular action,
given we're in that state,
times the probability that we go to
a next state and the probability of the reward we saw.
And we can sum that over j = 1 up to n - 1 or lj - 1.
So we just repeatedly look at what was the probability we pick the action,
the transition model, and the reward model.
So that's how what we have for,
um, the probability of a history.
And then if we wanna do this for importance sampling,
so what we want is we wanna have, um,
the probability of this history, um,
we need to be able to compute this ratio of q(x) divided by p(x),
which for us is gonna be the probability of a history j under
or the evaluation policy divided by
the probability of a history j under the behavior policy.
So- and we wanna do this and we're hoping that everything is gonna
cancel because we don't have access to the dynamics model or the reward model.
So unfortunately, just like what we saw in some of the policy gradient work, it will.
So if we have probability of hj divided by pi b,
we're gonna have again the initial state distribution,
which will be the same in both cases,
and then we have this ratio of probabilities,
the probability of aj divided by sj.
And this is under pi e,
probability of aj divided, um,
given sj for pi b,
and then the transition model and reward model.
Okay. And so this is nice because this cancels and this cancels.
Because the dynamics of the world is what determines
the next state and the dynamics of the world is what determines the reward.
And notice here just to make this not incredibly long,
I- here, I've made a Markov assumption.
So this is the Markov version.
[NOISE] But you could do,
um, you could condition on the full history.
So this trick does not require us- does not require the system to be Markov.
Because no matter whether your- your dynamics
depend on the full history or the immediate last step,
they're gonna be the same in the behavior policy and in the dynamics.
And in the evaluation policies,
you can cancel these and same for the reward model.
So this- this, um,
insight does not require a Markovian assumption.
And what that means is that we just end up getting this ratio of, um,
the way we would pick actions under the evaluation policy divided by the ratio
of the way we would pick actions under the behavior policy. Yeah?
Assuming that, uh, same trajectory is generated by two different policies.
Great question. Um, yes,
we're assuming the same trajectory was generated by two different policies.
Yes. So we're saying for this trajectory,
what's the probability you would have seen this under
the behavior policy versus the evaluation policy?
And so if it was more likely under the evaluation policy,
we wanna upweight whatever reward we got for that trajectory.
And if it's less likely under that evaluation policy, we wanna downweight it.
So the intuition is that we have a bunch of, um,
uh, trajectories and their sum of rewards.
So we kind of have these h,
you know, h_1, G_1 pairs.
So we have these sort of trajectory, sum of rewards.
And if we had the same behavior policy as the
evaluation policy in order to know how good that evaluation policy is,
we just average all of those G's but they are different.
And so what we wanna do is we wanna say for
h's that are more likely under an evaluation policy, upweight those.
For h's that are less likely under a evaluation policy,
downweight those so that you get, um,
a true expectation when you do those G evaluation, G weightings.
Does that makes sense? Does anybody have any questions about that part?
So this is just so far telling us how we re-weight our data.
It's allowing us to get a distribution that looks more like the distribution of
trajectories we would get under our evaluation policy. Yeah?
In the final, ah, the final row,
we have denominator, ah, the behavior policy.
We get that as empirically?
Good question. You know, where does the behavior policy come from,
basically like- like what is these probabilities? Is that the question?
So I think it depends, So in the case,
um- if you're- if it's a machine learning algorithm,
you still generate your data, you just know it.
Like you would just look up in your algorithm and
see what the probability distribution is.
And for today, we're going to assume that this is known and it's perfect.
Um, that is obviously not true when we get into people data.
In those cases, um,
there's a couple of different things you can do.
One is that you can build estimators that are robust to this being wrong.
So you can use other ways to try to kind of be doubly robust if that estimator is wrong.
In others you can take the empirical distribution.
And actually there's some cool work recently,
I think was from Peter Stone's lab at UT Austin,
showing that sometimes you- it's better to
use the empirical estimate then even if you know the true estimate.
Like in terms of the resulting estimator, which is kinda cool.
Okay, so this is just writing that out in LaTex instead of me hand-writing it,
um, and so this just writes out,
um, [NOISE] the- the probability of a history.
And now we can see that equation that I put here.
So, um, you know,
the value of the policy that you wanna evaluate, um,
is gonna be this ratio of history's times the return of that history.
And what we said here is,
and this is, you know, one over n,
is that this is simply the probability of taking each of
those actions or I'll write it just in terms of the Pi notation.
So this is Pi e of aj given sji,
i equals one to the- the length of your trajectory divide by Pi b aji.
All times the return of that particular history.
[NOISE] So the beautiful thing is you can- this is an unbiased estimator.
This really does give you a good estimate of the value under a few assumptions,
um, which I'll ask you guys about in a second.
And, um, you don't need to know the dynamics, so no dynamics.
No reward.
No need to be Markov.
Can anybody, um, [NOISE] tell me case where maybe this doesn't work.
So, I was just seeing that if you use this for running,
then starting from your initial policy to your final policy,
they couldn't be that different, right?
Because otherwise then you- the
samples [BACKGROUND] from the previous strikers won't be useful.
That's a good question. That's exactly what I was asking about is, you know,
how different can Pi e and Pi b [BACKGROUND] and allow us to do this.
So can- that was exactly what I was about to ask you guys about.
So, can anyone give me, uh,
what they think might be a condition for this estimator to be valid.
Like, where might be some cases where you would expect this might do
badly in terms of differences between the evaluation policy and the behavior policy?
And it has to do with the probability of taking actions in a certain state.
If either of those probabilities are too small,
you are gonna have things blow up in bad ways.
Yeah, so in which that,
either of these probabilities are too small.
This might be bad. Which of these ways is worse?
If they- uh, it's not [inaudible].
Right. So pi b is really small or at, you know, 0 [LAUGHTER].
Um, this could be really bad now.
Um, pi b can't ever be 0 and us observe something.
So that's good, because we're getting data from pi b.
And so we have never observed a trajectory under which
pi b is 0 but it could be really, really small.
So it could be, you know, you see something and it's incredibly unlikely there,
but your behavior policy would have generated that.
Um, and what if it is 0?
So it might be 0 for some actions.
What would that do if it's 0 in places where pi e is not 0 ?
So what happens if pi b of a some particular a is equal to 0, uh,
but pi e of a greater than 0 is,
you know, greater than 0, might be 1.
[NOISE] What might be bad there?
Like do you think this is a- let's raise hands for a second.
If that happened, do you think we are hu- let's raise your hands, either
yes if you think this is a good estimator or no if you think, oh, that's rough.
So if the behavior policy is 0 probability of taking an action,
but the evaluation policy has a positive probability of taking an action.
Raise your hand if you think, um,
this estimator could be really bad. That's right.
Yes. So, you know, if there are cases where you're just never trying actions,
like you never saw actions in your data that
your new evaluation policy would take, you can't use this.
So we often call this as coverage.
So coverage or support.
So we often make a few basic assumptions in order for this to be valid.
So our coverage or support normally means that pi b of
a given s has to be greater than 0 for all a and s,
such that pi e of a given s is greater than 0.
So you kind of have to have support over.
It doesn't have to be non-zero everywhere.
But for anything that you might want to evaluate,
for anywhere if you're really going to generate data from
your evaluation policy and it might take an action,
you need to be able to get to that state and you need to be able
to take that action with some non-zero probability. Yeah.
So terminology questions. So, we're calling pi b
is the same as pi 2 in the other part, right?
That's right.
Okay. Originally I thought you said that the evaluation policy was
the one that you observed the data from. That's incorrect?
Thanks for making me clarify that.
Um, the- the- the behavior policy is always one you observe data from,
evaluation is the way you wanna look at.
And, I apologize, notation often here is a little bit [NOISE] snaggly because I think, um,
people sometimes call the evaluation policy the target policy or evaluation policy or,
you know, one or two, um,
and most of the time,
the policy used to gather the data is called the behavior policy. Yeah.
[inaudible] you just like not include it in the product.
Question is whether or not what if they're both 0,
um, is that a problem?
Would you ever get data from that?
So, yes. So if they're both 0, it's okay.
So you don't actually have to- it does not require you to
take- to have a non-zero probability of taking every action in every state.
Um, so it can be okay.
So pi b of a given s can't equal 0,
if pi e of a given s is equal to 0.
It's fine for that to be the case.
You just can't have any case where you would never
have either reached the state or generated
that action for things that you could have potentially done
under your, um, evaluation policy.
And that doesn't sound too strict.
Um, but in practice, that can be a big deal.
So if you think about like Montezuma's Revenge,
excuse me, or different forms of Atari,
like under a random behavior policy,
you're never gonna get to see a lot of states,
and you're never gonna take actions in those states.
Like you're just- uh, it's incredibly unlikely,
unless you have an enormous amount of data.
So in practice, you can think of sort of
the behavior policy you have is kinda of defining like,
um, like a ball [NOISE] under which you can evaluate other potential policies.
So if you have like- it's not actually a sphere,
but like, you know, if it's, um,
if you have a behavior policy here,
you can think of kind of having some distance metric under which you can still get
good estimates of pi e. So you kind of have a radius,
and it depends on these are sort of essentially the- the policies for
which you have support over and anything else you can't evaluate.
Okay, all right.
So just to summarize there, um, you know,
importance sampling is this beautiful idea that works for lots of statistics,
including for reinforcement learning.
I think- The first introduction to this,
um- I think it was first used for RL I think in,
um, Doina Precup's paper, Precup 2000?
Around then. It's been around for a lot longer, but, um,
but I think for reinforcement learning,
that was their first introduction of using this.
And of course these ideas also come up in policy gradients type- type methods.
Um, and the great thing is thi- is,
is this unbiased and consistent estimator,
just as a refresher consistent means that
asymptotically it really will give you the right estimate.
So, you know, as n goes to infinity,
the r estimated v pi e goes to e pi e. Just kind of a nice sanity check.
As you get more and more data, you will get the right estimator.
Um, and just to check here, this is, you know,
under- under a few assumptions.
You have to have support.
[BACKGROUND]
All right. Now, um,
in our particular case,
I- we can leverage a few aspects of the fact that this is a temporal process.
So again, like what we saw for policy gradient type work,
um, we'll call policy gradient methods.
We can leverage the fact that,
um, the future can't affect past rewards.
So when we think about generating these importance ratios,
we only have to for a particular time step t,
um, instead of multiplying it by- so I guess just to back up for there.
Remember that Gt is defined to be equal to, [NOISE] you know,
the rewards, [NOISE] like the sum of rewards.
So when we think about this equation for importance sampling,
um, let me just go back to here.
So this could be expanded into, you know,
r_1 + gamma times r_2 + gamma squared times r_3, dot, dot, dot.
And right now in that equation,
that's like multiplying your full product of importance ratios,
times e to the rewards.
Um, so we're not doing any,
it's the same ratio of probability of action given by probability of action,
multiplied by each of these different reward terms.
But since, you know,
r_3 can't be affected by any actions that are in r_4 longer.
In some ways you're just- this isn't wrong.
It's just that you're introducing additional variance.
So similar to what we saw, um,
in policy gradient stuff,
we don't actually- we only have to multiply by that product of ratios,
up to the time point at which we got that reward.
So this allows us to get to per-decision importance sampling.
So this is only up to- only [NOISE] up to
point got reward [NOISE] because the- the future can't inputs past rewards.
And again, this is independent of it being Markov or not.
So this is just the fact that it's a temporal process and we can't go back in time.
All right. So another thing just in general is that, um,
in importance sampling, um,
we're just sort of these weights,
these weights to these products.
You know, products of like,
um, picking an action under different policies.
So we often call these weight terms
nicely and confusingly with
all the weights we talked about with function approximation.
Um, and weighted importance sampling compared to
importance sampling just renormalizes by the sum of weights.
The reason you might wanna do this is that as we were talking about before,
if your pi_b is really tiny.
So let's say this, this might be super tiny,
super small [NOISE] for some trajectories,
then that can mean that your importance weight for those trajectories is enormous.
In fact, we have,
um, a proof that, you know,
that generally the size of your importance weights
can grow exponentially with the horizon.
Um, uh, and so these importance weights can be incredibly large in some cases.
And so what weighted importance sampling does is it just renormalizes.
So effectively, you're making it
so all your importance weights are somewhere between 0 and 1.
Then you're using that when you're reweighting your distribution.
So when you do this,
this is something that's very common, um,
to do again this pre- predates the reinforcement learning side,
but has also been used in reinforcement learning.
Um, this is, uh,
this is biased, um, still consistent.
[NOISE] So that means asymptotically it's
going to get to the right thing and lower variance.
[NOISE] So we're essentially going to play,
um, the bias variance trade-off that we've often seen.
We can make Q learning versus Monte Carlo estimates.
Monte Carlo were unbiased but very high-variance.
Q Learning bootstrap so it's biased but
often much better because you're so much lower variance.
Weighted importance sampling is much lower variance empirically much,
much better, most of the time,
particularly for small amounts of data. Yeah.
Yeah. I was wondering if you could comment on, um,
you know, before you're saying that we were intentionally designing something to be unbiased.
So we're going to ignore certain techniques and now we're reintroducing bias at the end?
Right.
I guess, what is the intuition behind when it's okay,
I guess, to introduce bias and like when and why?
It's great one. Okay. So you just made a big deal before about saying let's go for
unbiased estimators and now you're telling us that we're going to go back to bias in,
[LAUGHTER] that- that case, you know, how do we make decisions about when this is okay?
Um, I think it totally depends on the domain.
Uh, and it als- I think one challenge and issue that comes up is,
if things are unbiased,
it's often easier to have confidence intervals around them.
We know better how to do that, um,
when it's biased, it's often hard to quantify.
Um, I think, uh, I'll talk briefly a little bit later about times
where we really just want to directly optimize for bias plus variance.
Like we want to look at accuracy,
mean squared error, just the sum of bias and variance.
And so then I think you can- it provides you a way to
directly trade-off between those two because you're like,
I know I want to minimize the mean squared error.
And that's the sum of these two, that gives me a way like,
a principled way, how to trade those off.
I think another thing I often like just for sort of a sanity  check is that,
if it's biased but still consistent,
um, that's sort of a nice sanity check.
It's like okay maybe there's a small amount of bias early on,
but eventually if I get a lot of data,
it's really given me the right answer.
And some of the function approximation stuff we solve for Q learning.
That's not true, like we just, all bets are off,
who knows what's happening asymptotically.
But again, it depends on the- depends on the day.
It's a great question. It's a big challenge in that area.
Okay. So as I was just saying in this case,
um, you know, weighted importance sampling, it's strongly consistent.
You're going to converge, um,
if you have a finite horizon or, um,
one behavior policy or bounded rewards, and, uh,
Phil and I, so that if you'd look at Thomas and Brunskill.
[NOISE] We think about this for
the RL case in our ICML 2016 paper. It's one reference for this.
Okay. So we're going to have these estimates.
Um, they may be,
if we're using weighted importance sampling,
there might be a little bit bias, but lower variance.
Otherwise, there might be high variance but low bias, or zero bias.
What's something else we could do?
So, uh, let's briefly talk about control variance
and be thinking in your head again at back to policy gradient,
and thinking about baselines.
So just from a statistics perspective,
um, you know, if we have an X,
we have the mean of that variable, um,
if it's unbiased, it means that our estimate of
the mean is matching the true expectation,
and then we have our variance.
So let's imagine that we just kinda shift these estimates.
Okay. So we're just going to subtract a random variable Y,
and then in your head you can think about this as like a Q function, something like that.
So, you know, Y might be a Q function,
and expected value of Y might be a V. Do you think about Q being,
um, as an a, an expected value of
Y being an average over all the actions you could take in that state.
So then, i- if you redefine your mu so,
um, X is gonna be like, let's,
we're gonna be going back towards trying to get our estimate of V_pi_e.
[NOISE] Then if you subtract off something else and add on this expectation,
you still get an unbiased estimator.
[NOISE] So we can just write that out here.
So X - Y + expected value of Y,
just equal to expected value of X,
minus expected value of Y,
plus expected value of Y.
So you can do this. I mean, you can do this in statistics.
You can subtract a variable and add on its expectation and on average,
that does not change the mean of your original X.
But you might ask, "Why would I do that?"
[LAUGHTER] So you can do this for anything like
any- these are where X and Y are random variables.
So these are general, just any random variables,
where Y is conc- called a control variant.
And this may be useful if it
allows you to reduce variance. And that means that. You know.
Y has to have something to do with X.
Okay. If you just subtract off something random,
this is probably not going to be helpful.
But if you subtract off something that allows you to have some insight on X,
in our case we're gonna be interested in things that allows us to help predict the value,
um, then we might get a lower variance.
So we can look this by looking at what is the variance of this weird quantity mu hat,
where we had this X - Y + the expected value of Y.
[NOISE] We can write it down as follows,
X - Y + the expected value of Y.
Now, the variance of the expected value of Y,
there's no more variability in that so we can just say that's the variance of X - Y.
[NOISE] So what that means is that we're gonna get something which is variance of X,
variance of Y, and the covariance of the two.
So if it is the case that the covariance of X and Y, meaning that the, you know,
there is relationship between these two variables,
one of them is giving us information about the other.
If that is bigger than the variance of Y,
then you're going to have a win.
Then you're gonna have that the resulting estimator.
So if this is true,
if true- if true,
variance of mu hat is gonna be [NOISE] less than variance of X.
So this is nice because it means that we
didn't change the mean and we reduced the variance,
which in some ways, kind of seems like a free lunch but it's not really
a free lunch because we're using
information that is actually telling us something about X.
And this is very similar to us using a baseline term and policy gradient.
Where instead of just relying on the Monte Carlo returns,
we could also subtract off a baseline like the value function.
[NOISE] So you can do this in importance sampling as well.
Um, so where X is the importance sampling estimator and Y is some control variate.
Typically, you know, this can either be a Q function which you
build from an approximate model of the Markov decision process.
It can be a Q function from Q learning.
Um, this gives you some way to get- we can think of this as Q, you know,
some- some estimate [NOISE] of state-action [NOISE] value, okay.
And doing this is called a doubly robust estimator.
Doubly robust estimator is again, um,
where in statistics for a long time, um,
in around 2011, were brought over to the,
I remember were brought to the multi-armed bandit community,
like the machine-learning multi-armed bandit community with Dudik et al.
And why do they call it doubly robust?
Well, the idea is that, um,
if you use information both from like your normal importance sampling, um,
plus some control variate like a Q value estimate,
um, you can be robust too if either of those are wrong.
So either if you have a poor model or you have
a bad estimate if your control variate is bad.
[NOISE] So you know, why would this be important?
Well, if we go back to sort of questions,
some other people's questions, um,
an alternative is just to like do Q learning on your data, right?
But Q learning might be biased or it might be
a horrible estimate and who knows if it's good?
Um, but it might be good.
So if it's great- good you'd like to be able to say,
"We've got a good estimate."
Or if it's bad,
you'd like to say that it's with your importance sampling can compensate for that,
and this says that either if you have a bad model,
um, or you have error in your estimates of pi B.
So this is like those cases where we've got data from positions,
so we don't really know what the behavior policy is.
So if you have inaccuracies there, then you, um,
also would like to if it turns out your control variate is accurate,
then you could also do well.
Now, if- in some cases both of these are bad and in that case,
kind of all bets are still off.
But it gives you more robustness about different parts of what you're
you're trying to estimate how good your evaluation policy is.
Okay, and Bill and I discussed sort of different ways
to do this as well as doing it in
a weighted way, so incorporating weighted importance sampling.
So what does this allow you to do?
Okay, I'll- I'll briefly show the equation.
Then essentially, the idea here is these are like the importance sampling weights.
This is the raw returns and then we can add and subtract.
This was Y and this is the expected value of Y and these can be computed
by computing like Q learning or
doing like an empirical model and doing dynamic programming on it.
You could get these sort of estimates of Q of pi E in lots of different ways,
um, and they might be good or they might be bad.
But you can plug them in and often,
they're gonna end up helping you in terms of variance.
So let's see empirically what this does.
[NOISE] So this is a really small Gridworld.
Think something on the order of like,
you know, maybe five-by-five, four-by-four.
This is a really small world and we're using it to try to illustrate and understand,
um, the benefits of these different types of techniques.
So what's on the x-axis?
This is the amount of data.
So this is the size of the dataset.
What's on the y-axis? This is mean squared error.
This is the difference between our estimate of
the evaluation policy and the true evaluation policy,
and you want this to be small.
So smaller, better and this is a log scale.
[NOISE] So what do we see here?
So one thing you could do is you could build an approximate model with your data.
That model might be wrong like maybe,
you're making a Markov assumption and it's wrong.
Or maybe, um, you know,
there's other parts where you just can't estimate well.
So this is the model-based.
So this is just we use a model,
and we compute V pi e for that model.
So we take our data,
we build an MDP model, then we, um,
use that to- then that's like a simulator and then we can
just compute V_pi e. So you can see here it's flat.
Um, in this case, um, I'd have to remember here.
I think we are using a different dataset.
Either I- I would have to double check whether or not
just after we have that number of episodes,
the model just doesn't change with further data.
Like the model just isn't great.
Maybe we're not using all the features that we should be, um,
or the world isn't really Markov,
and so you kind of have this fixed bias.
Your model can be asymptotically wrong.
The estimates you get from it can be asymptotically
wrong if your model is not a good fit to the environment.
[NOISE] The- the second thing we can do is we can do importance sampling.
So importance sampling is unbiased.
It's going down as we get more data as we would hope.
Eventually, it should collapse to 0.
Um, but we'd like to do better than that.
So now this is per decision importance sampling.
You can see you get a benefit from leveraging the fact
that rewards can't be influenced by future decisions.
That reduces the variance,
kind of gives you this nice kind of automatic shift down.
Um, if you do doubly robust,
you get a significant bump.
So what's doubly robust doing again?
It's combining our approximate model plus IS.
So you can see again, here we're getting,
ah, a significant bump.
Now, I talk about this mostly in terms of mean squared error
but I think it's really important to think about what mean squared error means.
Um, so mean squared error here is, you know,
how accurately are we estimating this V_pi E minus the true one?
But we can alternatively think about this in
terms of how much data we need to get a good estimate.
So look at this. This is- you want a mean squared error of 1.
Maybe that's sufficient. Maybe that's not.
That means that under a per decision important sampling estimator,
you would need 2,000 episodes and with doubly robust,
you'd need less than 200.
So that's awesome because it means that we need like an order of magnitude less data,
in order to get good estimates,
and a lot of real world applications we just don't have enough data.
You know, there might not be a lot of patients with
a particular condition and
you'd really like to still be able to make good decisions for them,
and so you need estimators that need much less data to give you good answers.
And so that's why this is important.
Okay? Right.
And then in these cases,
we can see also that if you do weighted
importance sampling [NOISE] and weighted throughput decision,
that also ends up helping a lot.
[NOISE] Here, is if you do weighted doubly robust,
again, sort of answering that question of when should you do,
you know, how do you trade off between this bias and variance?
Um, here, we can see that if we do some form of weighted,
uh, doubly robust which is one of the things we introduced in our ICML paper.
You again get a really big gain.
So we had this to there,
and this to there, right?
So now you are needing like five episodes to get to the same.
So again, some of these improvements can be- this is of course Gridworld, right?
Like this- you- you need- one needs to look at
this also for the particular domain one's interested in.
But it indicates there are just some of these cases by being, um,
a little bit better in terms of these estimators,
you can get substantial gains in terms of how accurate your estimators are.
[NOISE] All right,
again to continue on this slide of like, how do you balance between variance, um, and bias.
One thing that Phil and I thought about is, okay,
well, you know, you might want to have, um,
low bias and variance, ah,
and- and how do you do this trade-off,
let's just think about optimizing for it directly.
So our magic estimator just tries to directly- directly minimize mean squared error.
Okay. So- so again let's say mean squared error is gonna be,
you know, it's a function of bias and variance.
So if you knew what bias and variance was,
you could hopefully just optimize this directly.
Do we know what bias is?
So bias again just to remember is bias is the difference
between this minus this. That's bias.
Do we know that? No, unfortunately, if we knew that,
if we knew the bias, then we wouldn't have to do anything else because we would
know exactly what the real value was.
So a big challenge when we're trying to do this work is
how do you get a good estimate of bias?
Or like an under or overestimated bias?
And just briefly, the idea that we had there is to say,
if you can get confidence intervals,
which you can using importance sampling.
So let's say, I have these, you know,
this is my estimate from importance sampling.
And I have some uncertainty over it.
So this gives me some estimate,
[NOISE] with some upper and lower bounds.
So you know, I'm like, you can do, ah,
let's say the value is 5 plus or minus 2.
Okay. And then let's say I have a model, um,
I have a model that I built from this data and then I
used it to evaluate and I got another estimate up here.
So I have a V_pi,
and this is using a model.
Okay, so let's say this is 8.
So what is the minimum bias my model has to have?
How could you use these confidence intervals to get like a lower
bound on how- how bad my- assume these confidence intervals are correct.
Now that these are real confidence intervals so we know
the real value has to be between 3 and 7.
What's the- what's the minimum bias that my model would have?
One right, because it's this difference, okay.
So this gives you a lower bound on the bias.
So how far off you are from
these confidence intervals that gives you a lower bound on the bias?
It's optimistic, your bias- your model might be way more biased.
Um, uh, but it gives you a way to quantify what that bias is,
and that's what we use in this approach.
So we combine our importance sampling estimators and think about how variable they are.
We have to get an estimate of their variance, um,
as well as the bias on the model,
and that allows us to figure out how to trade off between these.
And again, you get, um,
you get a really substantial gain often.
This is still Gridworld but, um,
you're gonna get again [NOISE] roughly an order by magnitude difference in some domains.
You're gonna need an order of magnitude less data.
And in this case, I've just zoomed in so you don't even
see some of these other methods because they're so much higher up.
[NOISE] Okay, so, you know,
that's one thing that we could do to try to get sort of
good off-policy policy evaluation estimates.
Um, I haven't talked to you too much so far about like
how are we gonna get these confidence bounds over those.
But I've mentioned sort of a number of different ways that we
could try to get just an estimate of, you know,
V_pi E. So we want to get some estimate of
this new alternative policy that we might wanna unleash on the wild.
Um, I'll- I'm gonna skip this part, I'll just say briefly,
you know, there are some subtleties here with whether or not,
um, ah, you know,
what's the support of your behavior policy, um,
and how we do some different weighting and can we
improve over this sort of weighted importance sampling?
Um, [NOISE]. The answer is, is yes.
You can do some slightly different weightings,
um, and I'll- I'll defer that.
And then also, another really important question,
really important in practice is that,
um, your distributions are often non-stationary.
You know, imagine that like you're looking at
patient data and during that time period like
some new food pyramid came out from
the Food and Drug Administration and so everyone changed how they're eating.
Um, so now that, you know,
the dynamics of your patients are gonna be really different than before.
So you'd like to be able to identify whether or not you
have sort of non-stationarity in your data-set.
Like if the dynamics model of the world is changing.
So we have some other ideas about how to handle that.
[NOISE] Okay, but now let's go to and say,
let's assume we've done this off-policy policy evaluation.
We've gotten out some estimate, um,
of these, how good our alternative policy would be,
and we want to go beyond that and we want to get some confidence over it.
So again remember we're trying to move to a world where we can say, you know,
the probability that the policy output by
our algorithm being better than our previous policy, [NOISE] is high.
So with high probability we're gonna give you a policy that's better,
which means not only do we need to have an estimate on how good that policy is but
how much better it is or not than your behavior policy.
Okay. How would we do that?
So let's first consider using importance sampling plots Hoeffding's inequality.
Again let's think back to what we are doing with exploration,
to do high confidence off-policy policy evaluation.
So just as a refresher,
mountain car is the simple control task,
[NOISE] where, you know,
you have your agent trying to reach here, we get high reward.
And we're gonna have data gathered from
some behavior policy and we want to try to learn a better policy from it,
and be able to get competence bounds over its performance.
Okay, so remember that in Hoeffding's inequality,
it's a way to look at how different your- your mean is from,
um, your average so far.
So how different can we- how can your- how different can your empirical mean
be [NOISE] minus your true mean?
[NOISE] And it gives you a bound on that in terms of the amount of data you have.
Okay, so it's a function of the amount of data you have.
And it depends on the range of your variables, okay.
So thought about about this for arms,
which might have rewards of between 0 and 1,
and then b would be 1.
Okay, so we can use Hoeffding's inequality also,
we talked about it briefly for,
um, uh, [NOISE] you can use a model-based reinforcement learning.
Let's think about using that in the context of this off-policy evaluation.
So we can also use it using our old data to try to estimate, um,
what the- what our upper or lower bounds might be on the value of the evaluation policy.
[NOISE] So let's imagine that we use 100,000 trajectories.
And the evaluation policy's true performance is 0.19.
And if we use Hoeffding's inequality,
we're very confident that the new policy has at least -5 million.
Okay. And we know that the real reward is somewhere between 0 and 1.
But Hoeffding's inequality gives us this bound of -5 million,
and that's true, right?
Like [LAUGHTER] you know,
is like 0.19 is greater than -5 million.
But it's not particularly informative.
Um, like we know that the real returns
for the- for this domain is somewhere between 0 and 1.
And if we use Hoeffding's inequality there,
we're getting something that we'd call is vacuous.
Okay, so you're getting a bound that is true but entirely uninformative.
Because it is incredibly negative, right?
Like we know that this is
the true- true value is- for all policy is between 0 and 1.
[NOISE] Okay.
So why did this happen?
Um, [NOISE] let's look at importance sampling.
Importance sampling says we're gonna take this product of weights.
Okay, and as we've talked about before,
this might be pretty small.
So let's imagine this is like, you know, 0.1.
So then you have 10 to the L. Like if you take a series of actions,
let's say for every single action of that trajectory it was pretty unlikely
[NOISE] which you often need in domains like mountain car,
because in mountain car you have to take
a pretty specific sequence of actions in order to finally see some reward at the end,
and under a policy [NOISE] that is not optimal,
it might be pretty unlikely to see those series of actions.
So let's say, you know, most of your data you never get up the hill,
in like one or two of your data points you actually get up to the top of the hill,
and those were very rare trajectories.
Which means your, um,
importance sampling weights are gonna be extremely high.
It's gonna be this, you know,
1 divided by 0.1 up to the L. [NOISE] That's just enormous you know.
Um, so these can start to be incredibly large.
And Hoeffding's Inequality depends on this.
The range of the potential returns you have.
What are the range of our potential returns?
The range of our potential returns are g times or
product of i equals 1 to t of our importance weights.
[NOISE]
So b is equal to max over this.
It says, "In the maximum case,
what could it look like, your return is?"
[NOISE].
And so it's gonna depend most on what your real range is.
Our real [NOISE] b is going to be between 0 and 1,
and this product of importance sampling weights,
and that's where the problem is.
The product of importance sampling weights can be enormous.
Okay.
Because you have really unlikely sequence of actions,
and then you get this blow up. All right.
So if we look at that here,
we can get this distribution, um, and some of the,
some of these are incredibly large,
and that means that our Hoeffding Inequality ended up being,
because Hoeffding again is you subtract b,
basically -b times square root of 1 over n. So if b,
let's say- let's say,
your trajectory lengths are something like 200,
which is pretty, somewhat reasonable for Monte Carlo,
I'm sorry, for mountain car.
Then you'd have something like 1 over 0.1 to the 200 times 1,
times the square root of 1 over n, roughly right?
And so [NOISE] you'd have just this crazy,
crazy large term and you're subtracting this.
So that would mean that your bounds are vacuous,
basically I have no idea how good this evaluation policy would be.
So does anybody have any questions about that, about why that issue occurs?
Okay. So the insight that Phil had in some of his previous work is,
just get rid of those, just cut it down.
Um, so if you remove this tail,
what does that do to your expected value?
It just decreases it.
So if you ignore those like really,
really crazy high returns [NOISE] you're
not gonna get an estimate anymore that's as good,
but it's just gonna get smaller.
You're not gonna overestimate it.
So again, if we're thinking about say policy improvement,
we're concerned about deploying policies
that are worse than we think they are in practice.
What this is gonna do, is say, we're gonna
underestimate how good our behavior policy is,
or our evaluation policy is.
And so if we underestimate it, that's okay because that's safe,
like if we don't deploy things that actually would have been good,
maybe there's a lost opportunity cost,
but it's not gonna be bad for the world,
um, so that's what the insight was,
for here is that you can like remove this upper tail.
And so you don't need to read the proof, ah, um,
but the idea is that you can basically define a new confidence interval,
that is conservative [NOISE].
And you can think about how you choose that conservatism,
depending also on the amount of data you have.
So here's the beautiful take-home, um,
so let's say that you kind of use 20% of your data
to figure out exactly how to tune this confidence interval,
so this is sort of sets your confidence interval.
And then your next part to compute your lower bound,
so for mountain car with the same amount of data,
you've got 100k trajectories.
So this is the new estimator,
and you get the, um,
the mu, the estimator of the V pi,
so this is a lower bound on V pi e. [NOISE] It says it's gonna be at least 0.145,
and the true value is 0.19.
And this is compared to all these other forms of concentration inequalities,
which were all, except for Anderson,
really, really bad [NOISE].
So things like Chernoff-Hoeffding and these other
ones you don't have to be familiar with all of these.
But basically, it just says if you'd used other approaches,
to try to get this lower bound [NOISE] they would have been entirely vacuous,
whereas this one says, "Okay.
We're not sure exactly how good it is.
It's gonna be at least 0.145,
and the real value is 0.19."
It's not perfect, but you know,
if- if your behavior policy was 0.05,
that would be good enough to say you should use a new thing [NOISE].
So, um, they use this idea for digital marketing,
this is some work that Phil had done in
collaboration with some colleagues over at Adobe.
Um, and the nice thing about this is you can say,
you know, if I want it, I'll figure it out,
if I [NOISE] am gonna deploy something, and get, um,
more effective digital marketing,
and I have access to our previous data.
[NOISE] Can I say with what confidence,
I can deploy something that's gonna, you know,
generate higher clicks, get more revenue.
[NOISE] And these confidence bounds turn out to be tight
enough that you can actually know that
the new policy is gonna be better, which is pretty cool.
[NOISE] [inaudible] Yeah. You can also,
so this is one form of trying to get those confidence bounds,
turns out you can also use t-tests and empirically that's often very good.
Um, and some of you guys might have seen some of these in,
ah, some of your statistics class.
I'll just really briefly take, because we're almost out of time,
that you can combine these ideas,
and then think about trying to get these lower bounds,
um, here, and combine it with optimization.
So you can think about doing this for a number of different policies,
trying to compute lower bounds on all of them.
And then using that information to try to
decide which one to deploy in the future, in a way that is safe.
Okay. So you can sort of say,
"I'm gonna use some of my data to optimize,
some of my data to, um, ah,
to try to evaluate the resulting one,
and make sure that it's got a good confidence bound before I deploy it.
[NOISE] And again, you can do this in digital marketing,
some of the other work that Phil Thomas and I have looked at is,
using a diabetes simulator,
and looking at whether or not we can infer
higher-performing policies for things like [NOISE] insulin regulation,
um, ah, using similar ideas in something that in a way that you could be,
um, with high-confidence better before you deploy it.
I'm gonna skip briefly through this.
This- this is a really big ba- ah,
like this is an increasingly, ah,
big area of work in the community.
Um, I think a lot of people are thinking about,
it is counterfactual reasoning issue because we have
more and more data from electronic medical records systems,
that we'd like to use to improve patient health.
We have data, um, you know,
on- on online platforms etc.
There's a lot of additional challenges,
things like how do you deal with long horizons?
Um, [NOISE] the fact that importance sampling can be unfair,
ah, [NOISE] what do I mean by that?
I mean that, essentially,
different policies when you evaluate them,
might have different amounts of variance depending on
how well they match to your behavior policy.
Because of that, it may be hard to decide which of those you should deploy.
Um, we have various work thinking about when the behavior policy is unknown.
Where we combine these ideas with deep neural networks,
um, and we're also thinking about transfer learning.
So I know Chelsea talked about meta-learning on Monday.
Um, one of the interesting ideas here is,
you're building these forms of models.
Can you kinda use the same ideas of fine tuning in the reinforcement learning case?
So can you think about building models for our policy evaluation
that leveraged as data that does not match the policy you care about?
In order to get generally better models,
you know, things like health care,
I think that can be pretty helpful.
[NOISE] But there's a lot of additional work on this,
and there's a number of other groups that are thinking about these types of ideas.
Um, and also on campus if you're interested in these ideas in general.
There's also a number of great colleagues that are
thinking about this from other perspectives.
People coming at it from the perspective of economics,
or statistics, or epidemiology.
And it's been really fun to try to get to collaborate with these people as well.
So just to pop-up a level briefly, um,
the goal in these cases is to think about if you have some set of data,
we're gonna go from data to a good new policy [NOISE].
Okay. And you want it to be good in a way that you
know something about the quality of it before you deploy it.
And so that's really what's sort of safe off-policy policy evaluation and selection,
or optimization is about.
And so in terms of the things that you should understand from here,
you should be able to define and apply importance sampling [NOISE] ,
know some limitations of importance sampling.
[NOISE] List a couple of- of alternatives.
Know, you- why you might want to be able to do this sort of safe reinforcement learning,
like what sort of applications this might be important in.
[NOISE] And sort of the- define what type of guarantees we're getting in these scenarios.
So that's it, and then next week,
we'll have the quiz on Monday,
and then on Wednesday we'll talk
about Monte Carlo tree search. Thanks.
