Um, today what we're gonna be doing is we are gonna be
starting to talk about fast reinforcement learning.
Um, so in terms of where we are in the class,
we've just finished up policy search.
You guys are working on policy gradient,
uh, right now for your homework.
And then, that'll be the last homework and then the rest of the time will be on projects.
[NOISE] Excuse me.
Um, and then, uh,
right now we're gonna start to talk about fast reinforcement learning,
which is something that we haven't talked about so much.
So, so the things we've discussed a lot so far in the term are things like optimization,
generalization, and delayed consequences.
So, how we do planning and Markov decision processes?
How do we scale up to really large state spaces using like deep neural networks?
Um, and how do we do this optimization?
And I think that that works really well for,
um, uh, a lot of cases where we have good simulators or where data is pretty cheap.
Um, but a lot of the work that I do in
my own lab thinks about how do we teach computers to help us,
uh, which naturally involves reinforcement learning because we're
teaching computers about how to make decisions that would help us.
But I think there are a lot of other applications
where we'd really like computers to help us.
So, things like, uh,
education or healthcare or consumer marketing.
And in each of these cases we can think of them as being reinforcement learning
problems because we'd have some sort of agent like our computer, uh,
that is making decisions as it interacts with
a person and it's trying to optimize some reward, like,
it's trying to help someone learn something or it's
trying to treat a patient or it's trying to,
uh, increase revenue for a company by having consumers click on ads.
And in all of those cases,
the place where data comes from is people.
Um, and so, there's at least two big challenges with that.
The first is that, uh, you know, people are finite.
There's not [NOISE] an infinite number of people, um, and also,
that it's expensive and costly to,
um, try to gather data when you interact with people.
And so, it raises the concern about sample efficiency.
So, in general, of course,
we would love to have reinforcement learning algorithms that are both
computationally efficient and sample efficient.
But, uh, most of the techniques we've been looking at so far particularly these sort of
Q learning type techniques were really sort
of inspired by this need for computational efficiency.
Um, so if we think back to when we were just doing planning at
the beginning when we talked about doing dynamic programming versus doing Q learning,
um, in dynamic programming,
we had to do a sum over all next states and in TD learning we sampled that.
So, in TD learning we sort of had this constant cost per update
versus for dynamic programming where we had this S squared times A and cost.
Um, so it was much more expensive to do things like
dynamic programming than it was to do TD learning,
um, on a, on a per step basis.
And so, a lot of the techniques that have been developed in
reinforcement learning have really been thinking
about this computational efficiency issue.
Um, and there are a lot of times where computational efficiency is important.
Like, if you wanted to plan from scratch and you were sort of driving
a car at 60 miles per hour,
then if it takes you-.
Uh, so if you're driving a car at 60 miles per
hour and it takes your computer one second to make a decision about like,
you know, how to turn the wheel or something like that, um,
then during that one second you've traveled, you know, many feet.
So, in a lot of cases,
you really do have real-time constraints,
uh, on the computation you can do.
Uh, and in many situations like for,
you know, in the cases of, uh,
some robotics and particularly when we have simulators
[NOISE] we really want
computational efficiency because we need to be able to do these things very quickly.
Um, we can sort of use our simulators but we need
our simulators to be fast so our agent can learn.
Um, in contrast to those sort of examples are things where sample efficiency
is super expens- important and maybe computation is less important.
So, whenever experience is costly or hard to gather.
And so, this is particularly things that involve people.
Um, uh, so we think about students
or patients or customers,
like the way that our agent will learn about the world is making decisions,
um, and that data affects real people.
So, it might be very reasonable for us to take, you know,
several days of computation if we could figure out a better way to treat cancer, um,
because we don't wanna randomly experiment on people and we wanna
use the data as well as we can to be really really sample efficient,
um, versus in the case of like Atari it's, uh,
we wanna be really computationally efficient because we can do many, many simulations.
It's fine. No one's getting hurt.
But, um, but we need to eventually learn to make,
you know, to derive a good game.
So, one natural question might be,
okay, so maybe now we care about sample efficiency.
Um, and before we cared perhaps more about computational efficiency but
maybe the algorithms we've already discussed are already sample efficient.
So, does anybody remember like on the order of magnitude or like, you know,
somewhere in the rough ballpark how many steps it
took DQN to learn a good policy for Pong?
Maybe, there's multiple answers maybe someone could say.
Yeah. [NOISE] I think it varies somewhere partly between 2 to 10,
is my guess. 2 to 10 million.
So, um, that's a lot,
that's a lot of data [LAUGHTER] to learn to play Pong.
So, I would argue that the techniques we've seen so far, um,
are not gonna address this issue and it's
not gonna be reasonable for us to need, you know,
somewhere between 2 to 10 million customers before we figure out a good way to target
ads or 2 to 10 million patients before we figure out the right decisions to do.
So, the techniques we've seen so far are not gonna be, uh,
they're formally not sample efficient and they're also empirically not sample efficient.
Um, so we're gonna need new types of techniques than what we've seen so far.
So, of course, when we start to think about this,
we think about the general issue of.
you know, what does it mean for an algorithm to be good?
We've talked about computational efficiency
and I've mentioned this thing called sample efficiency.
But in general, I think one thing we care a lot about is, you know,
how good is our reinforcement learning algorithm and we're going to start
to try to quantify that in terms of sample efficiency.
Um, but, of course,
you could have an algorithm that's really sample efficient in the sense that
maybe it only uses the first 10 data points and then it never updates its policy.
So, it doesn't need very much data to find a policy but the policy is bad.
So, when we talk about sample efficiency,
we're gonna want both but we don't need very much data
and we don't need very much data in order to make good decisions.
So, we still wanna get good performance we
just don't need- wanna need very much experience to get there.
Um, so when we talk about what it means for an algorithm to be good,
you know, one a, one possibility is we can talk about whether or not it converges at all.
Um, that just means
whether or not the value function or the policy is stable at some point,
like asymptotically because the number of time steps goes to infinity.
And we talked about how sometimes with, uh,
value function approximation we don't even have that,
like, ah, things can oscillate.
Um, then another thing that's stronger than that that you might wanna us to say, well,
asymptotically as t goes to infinity,
um, will we converge to the optimal policy?
And we talked about some algorithms that would do that under some different assumptions.
Um, but what we haven't talked about very much is sort of,
you know, well, how quickly do we get there?
Asymptotically is a very long time.
Um, and so we might wanna be able to say,
if we have two algorithms and one of them gets the optimal policy here,
like if this is performance with this time and another algorithm goes like this,
intuitively, algorithm two is better than algorithm
one even though they both get to the optimal policy, eventually.
So, we'd like to be able to sort of account for either,
we can think about things like how many mistakes our algorithm makes
or its relative performance over time compared to the optimal.
And so, we'll start to talk today about
some other forms of measures for how good an algorithm is.
So, in this lecture and the next couple of lectures, we're gonna, um,
do several different things trying to talk about sort of
how good are these reinforcement learning algorithms and think
about algorithms that could be much better in terms of their guarantees for performance.
Um, we're gonna start off and talk about tabular settings.
But today we're only gonna talk about simple bandits.
But generally, for the next,
um, today and next lecture,
we'll talk about tabular settings and then, um,
hopefully also get to some about function approximation plus sample efficiency.
But we'll start to talk about sort of settings frameworks and approaches.
So, the settings that we're going to be covering like today and next time,
it's gonna be bandits which, uh,
a number of you- who, who,
who here is doing the default project?
Okay. So, a number of you are,
are already starting to think about this in terms of the project.
So, we'll introduce bandits today, um,
and then we'll also talk about this for MDPs.
And then, we'll also introduce frameworks,
and these are evaluation criteria for formally
assessing the quality of a reinforcement learning algorithm.
So, they're way- there's sort of a tool you could use to evaluate
many different algorithms and algorithms will either satisfy
this framework or not, or have different properties,
um, under these different frameworks.
And then, we'll also start to talk about
approaches, which are classes of algorithms for achieving
these different evaluation criteria for
these different frameworks in different settings, either for the MDP setting or,
or for the bandit setting.
And what we'll shortly see is that there's a couple of main ideas of
styles or approaches of algorithms which turned out to both have, um,
be applicable both to bandit settings and MDP settings, um,
and function approximation actually
and also that have some really nice formal properties.
There's sort of a couple of big conceptual ideas
about how we might do fast reinforcement learning.
Okay. So, the, the plan for today will be
that we'll first start with an introduction to multi-armed bandits.
Then we'll talk about the definition of regret on a mathematical formal sense.
Um, and then we'll talk about optimism under uncertainty.
And then, as we can, we'll talk about Bayesian regret,
um, and probability matching and Thompson sampling.
I'm curious, who here has ever seen this sort of material before?
Okay. A couple of people, most people not.
I wouldn't- is it covered in AI?
I don't think they would cover it. Oh, good. Okay.
All right. So, for some of you,
uh, this will be a review, for most of you, it will be new.
So, for multi-armed bandits,
we can think of them as a subset of Reinforcement Learning.
So, it's generally considered a set of arms,
the- there's a set of m arms,
which were, uh, the equivalent to what we used to call actions.
So, in Reinforcement Learning,
in our set of actions, um,
we're thinking about like, there being m different actions.
Now we're often gonna call those arms.
[NOISE] And then, for each of those arms,
they have a distribution of rewards you could get.
So, we haven't talked a lot about sort of having uncertainty over our rewards.
We mostly just talked about the expected reward.
Um, and for multi-armed bandits,
today we're gonna explicitly think about the fact that, um,
rewards might be sampled from a stochastic distribution.
[NOISE] So, there's some distribution that we don't know which is conditioned on the arm.
So, conditioned on the arm,
you're gonna get different rewards.
So, for example, it could be that for arm 1,
your distribution looks something like this.
This is the probability of that reward and this is rewards.
Um, and then for arm 2,
it looks something like this.
[NOISE] So, in this particular example, um,
the average reward for arm 1 would be higher than the average reward for arm 2.
And then they would have different variances.
But it doesn't have to be Gaussian.
You could have lots of different distributions.
Um, essentially, what we're trying to capture here is that whenever you,
uh, um, take a particular action,
which we also refer to as pulling an arm, um,
for the multi-armed bandit, um,
then the reward you get might vary even if you pull the same arm twice.
So, in this case, you can imagine that if you pull the arm once,
um, arm 1 once, you might get a reward here
and maybe the second time you get a reward there.
So, the idea in this case is it's similar to
MDPs except for now there's no transition function.
So, there's no state or equivalently,
you can think of it as there's only a single state.
Um, and so when you take an arm,
um, you stay in the same state.
There's always these m actions available to you, and on each step,
you get to pick what action to take and then you observe some reward that is
sampled from the unknown probability distribution associated with that arm.
And just like in Reinforcement Learning, um,
we don't know what those reward distributions are in
advance and our goal is to maximize the cumulative reward.
So, if someone told you what these distributions were in advance,
you would know exactly which arm to pull,
whichever one has the highest expected,
uh, as, expected mean.
[NOISE] So, we're gonna try to use pretty similar notation to what we had for,
um, the reinforcement learning case.
Um, but if you notice things that are confusing, just let me know.
Um, so we're gonna define the action value as the mean reward for a particular action.
So, that's Qa, this is unknown,
agent doesn't know this in advance.
The optimal value V star is gonna be equal to Q of the best action.
And then, the regret is gonna be the opportunity loss for one step.
So, what that means is that if instead of taking, if,
if you could have taken Q of a star and instead you took Q of at.
So, this is the actual arm you selected.
How much in expectation did you lose by taking the sub-optimal arm?
And this is how we're gonna mathematically define
regret in the case of Reinforcement Learning.
So, if you selected the optimal arm, your regret,
your expected regret will be 0 for that time step,
um, but for any other arm,
there will be some loss.
And the total regret is just, um,
the total opportunity loss if you sum over
all the time steps that the agent acts and compare,
um, the actions it took and the, um,
the expected reward of each of those actions
to the expected reward of the optimal action.
So, just to be clear here,
this is not known,
this is not known to the agent,
and this is unknown to the agent.
[NOISE] Just to check for understanding for a second,
why is this, so why is this second thing unknown to the agent? Yeah.
Because you don't know the probability distribution of Q.
Right. So, [NOISE], correct.
So, you don't know, er,
what the distribution is,
so you don't know what Q is.
You get to observe, um, a sample from that.
So, you get to observe R. You get to get an R which was sampled
from the probability distribution of rewards given the action that was taken.
[NOISE] But you don't get to observe either, uh,
the true expected value of the optimal arm
nor the true expected value of the arm that you selected.
So, this isn't something we can normally evaluate unless we're in a simulated domain, okay?
But we're gonna talk about ways that we can bound this and,
and think about algorithms or try to minimize the regret.
So, if we think about ways to sort of quantify it,
another way to think about it alternatively is that,
um, think about the number of times that you take a particular action.
We can call that Nt of a.
So, that's like the number of times we select action 1,
action 2, et cetera.
And then, we can define a gap which is the difference between [NOISE] the optimal
arm's value and the value of the arm that we selected, and that's the gap.
So, that's how much we lost by picking a different arm than the optimal arm.
So, this gap is equal to,
gap for a star is equal to 0, if you don't lose anything by taking the optimal arm,
and for all other arms, it's gonna be positive.
So, another way to think about the regret which is equivalent is to say,
this is equivalent to thinking about what are the expected number of
times you select each of the arms times their gap.
So, how much you would lose by selecting that arm compared to the optimal arm.
And what we would like is that sort of an algorithm, um,
it should be able to adjust how many times you pull arms which have large gaps.
So, if there's a really, really bad arm, uh,
like if there's really bad action which has a very low reward,
you would like not to pull that as much,
to take that action as much as the arms that are close to optimal.
[NOISE] So, one approach that we've seen before is greedy.
Um, in the case of, ah,
bandits, the greedy algorithm is very simple.
Um, we just average the,
the rewards we've seen for an arm.
So, we just look at every single time that we took that arm,
we look at the reward we got for each those timestamps, and then we average it.
And that just gives us, um,
an estimate of the, of Q hat.
And then what greedy does is it selects the action with the highest value,
um, and takes that arm forever.
So, it's probably clear in this case that because the rewards are sampled from
a stochastic distribution that if you are unlucky and get samples that are,
uh, misrepresentative, then you could lock into the wrong action and stay there forever.
So, if we think of that little example I gave before,
and I'll work out, uh, a bigger example shortly, so in this case,
imagine this is reward,
this is probability, and this was a2.
Okay. So, let's imagine that the first,
um, and I'll make this 1 and this 0.
So, if you sample from a1, in this case,
you could imagine there's some non-zero probability that the first sample you get is say,
0.2 for a1, and the first sample you get for a2 with non-zero probability might be 0.5.
So, the true mean of a2 is lower than the mean of a1.
But if you sampled each of these once, um,
then if you're greedy with respect to that,
then you will take the wrong action forever.
[NOISE] Yeah.
Wrong action forever, is the idea that
our policy is gonna be influencing what times we'll get in
the future or is the idea that there are
some set of samples independent of this greedy policy to begin?
Because it seems otherwise, if there is non-zero reward,
you just take that one forever.
Uh, great question. So, um, is it, yeah,
so what [inaudible] said is, um, you know,
is there an additional thing that we're doing kind of before this?
Normally, for a lot of these algorithms, um,
we're gonna assume that all of them operate by
selecting each arm once at least if you have a finite set of arms.
Um, and equivalently, you [NOISE] can say if you don't have any data,
you treat everything equivalently or, um,
but essentially most of these ones say,
until you have data for all the arms,
we are gonna do round robin,
you're gonna sample everything once.
And after you do that, either you can be greedy or we can do something else.
So, there has to be a pre-initialization space. It's a good question.
So, and we're also gonna assume for right now that we split ties,
um, with equal probability.
So, if there are two arms that have the same pro- probability, um,
and they both have the max actio- max value,
then you would split your time between those until the value is changed.
So, this is an example where if we first sampled a1 once [NOISE],
then sampled a2 once.
Um, and because there's a non-zero probability
that those samples would make it look that,
um, action a1 has a lower mean than action a2,
then you could lock into the wrong action forever.
Now, an e-greedy algorithm,
which we've seen before in class, um, uh,
it does something very similar except for with probability 1 - epsilon, we select,
um, the greedy action,
and otherwise with epsilon,
we split our probability across all the other actions or all the other arms.
[NOISE] So, in these cases,
um, we have some more robustness.
So, in this case, you know,
we would continue to sample, um, the other action,
but we're always gonna make a sub-optimal decision at least epsilon percent of
the time, well, approximately.
It's a little bit less than that because if you do it totally randomly not a,
um, uh, but it's order epsilon.
I mean, er, it's slightly less than that because, um, er,
if you uniformly split across all your arms with one over the number of arms probability,
we'll be selecting the optimal action.
[NOISE]
Okay. So let's see these in practice for a
second before we talk more about better algorithms.
Um, so let's imagine we're trying to figure out how to
treat broken toes and this is not a real medical example.
Um, but let's imagine there's three different surger- three different options.
Um, one is surgery. One is buddy taping the broken toe with another toe,
which is what the Internet might tell you to do.
Um, and the third is to do nothing.
And the outcome measure is gonna be
a binary variable of whether or not your toe is healed um,
after six weeks as assessed by an x-ray.
Okay. So let's imagine that we model this as a multi-armed bandit with three arms,
where each arm corresponds to um,
well, I'll ask you in a second what it corresponds to.
And, and there's an unknown parameter here.
So each arm, there's an unknown parameter which is the reward outcome.
So let's just take just you know,
one or two minutes just to say what does uh, um,
a pull of an arm correspond to in
this case and why is it reasonable to model it as a bandit,
instead of an MDP?
[NOISE]
Yeah.
I'm , and in terms of why we model it as a bandit,
one reason is that MDPs usually,
think of an agent walking through the world and each,
the state, or the world has many different states, and we analyze those.
Here we have just one state,
a toe is broken and various actions are considered.
Right. So  great.
So here we just have one there.
And, and so what is, what is the,
what does it mean to pull an arm in this case or take an action?
Which does that correspond to in the real world?
[NOISE] That would be
a new patient coming in and then making a decision about the care for that patient.
Great. So that's like um, uh,
a patient coming in and then us deciding to either do surgery on them or giving them um,
er, er, in this case a um,
like a er, er, doing one of the three options for treatment.
Um, and so in this case too,
the, each pool is a different patient.
So how we treat patient one isn't gonna generally affect how we treat
patient two in terms of whether they healed or not,
or whether that particularly your you know,
surgery worked for them, doesn't affect the next patient coming in.
So all the patients are IID.
Um, and what we wanna figure out to do,
is which of these treatments on average is most effective.
Okay. So let's think about these in a particular set- setting.
So um, uh, a
par- particular set of values.
So let's imagine that they're all Bernoulli reward var- variables
because either the toe is gonna be be healed or it's not gonna be healed after six weeks.
Um, it turns out in this particular fake example, surgery is best.
So if you do surgery with 95% probability,
it will be healed after six weeks.
Buddy taping is 90%,
and doing nothing is 0.1.
So what would happen if we did something like a greedy algorithm? Oh, yeah.
Sorry, is it possible to incorporate other factors into like pulling the arms?
For example, surgeries like [inaudible] like cost effective versus buddy taping,
are there ways to incorporate that?
Yeah. It's a great question. So question's about like uh,
you know, could we you know,
surgery is a lot more invasive.
And there might be other side effects, etc.
There's a couple different ways you could imagine putting that information in.
One is, you could just change the reward outcome.
So you could say um,
maybe it's more effective but it's also really costly and I've
gotta have some way to combine outcomes with cost.
Um, another thing that one might be interested in doing in
these sorts of cases is that you might have a distribution of outcome.
So in this case, all of them have the same um, distribution.
They're all Bernoulli. But in some cases um,
your reward outcomes will not be,
will be complicated functions, right?
Like, so it might be that um,
maybe for some people,
surgery is really awful and for most people, it's really good.
But it's really bad for some people because they have you know,
some really bad side effects and so its mean is still better.
Um, but there is like, this really bad risk tail of like,
maybe people you know, react badly to anesthesia or something like that.
So in those cases, you might wanna not focus on expected outcomes.
You might wanna look at risk sensitivity.
And in fact, one of the things that we're doing in my group is looking at
safe reinforcement learning including safe bandits um,
and thinking about how you could optimize for risk-sensitive criteria.
Another thing that we're not gonna talk about today which you
also might wanna do in this case is that,
patients are not all identical.
Um, and you might wanna incorporate some sort of contextual features about
the patient to try to decide which surgery to do or no surgery versus buddy taping.
Ah, and hopefully, well,
those of you who are doing the default project,
we'll think about this definitely.
And we'll probably get to this in a couple lectures.
In general, we often have sort of,
a rich contextual state which will also [NOISE] affect the outcomes.
Okay, so in this case, let's imagine that we have
these three potential interventions
that we can do and we're running the greedy algorithm.
So um, as ah,  brought up before,
we're gonna sample each of these arms once,
and now we're gonna get an empirical average.
So let's say that we sample action A1 and we get a plus one.
And so now our empirical average of what the expected reward is for action A1 is 1.
And then we do A2, and we also get 1 so that's our new average.
And then we do A3, we get a 0.
And so at this point,
what is the probability of greedy selecting each arm assuming
that the ties are split randomly? Yeah.
Two plus two, then epsilon pair at one, plus or minus a little.
And, and your name?
.
, yeah. So what  said is um,
exactly correct for the e-greedy case.
So you're jumping ahead a little bit but that's totally right.
Um, ah, in this case for greedy,
it'll just be 50-50.
Yeah. You're already moving to e-greedy. So yes.
So the probability of A1 is gonna equal the probability of A2,
just gonna equal to a half.
So let's imagine that we did this for a few time steps.
So, so we can think about what the regret is that we incur along all of this way.
So at the beginning um,
we have this sort of,
an initialization period where we're gonna select each action
once and we're always comparing this to what
was the reward we could have gotten under the optimal action.
So the regret here is gonna be exactly equal,
so this is optimal.
And remember, the regret is gonna be Q of A star minus Q of the action you took.
So in the first case,
it's gonna be zero because we took the optimal action.
And the second case, it's gonna be 0.95 - 0.9 which is 0.05.
And then the third case, it's gonna be 0.95 - 0.1.
And then in the third case, it's gonna be
zero or fourth case it's gonna be 0 and then it's gonna be 0.05 again.
Now, in this situation um,
will we ever select a greedy?
Will we ever select A3 again given the values we've seen so far?
No. Yeah, I see people say you know,
no. So why not?
[NOISE]
Is it possible,
what's our current estimate of um,
the reward for A3?
. Yeah. So I guess I didn't show,
put those here but the, this was the actual.
These were the rewards we got.
So it's 1, 1, 0.
So our current estimate for A3 is 0.
We know our rewards are bounded between 0 and 1.
None of our estimates can ever collapse below 0.
Um, and we already have a positive 1 for these other two actions,
which means that their averages can never go to 0.
So we're never gonna take A3 again.
Now, in this case,
that's not actually that bad like [LAUGHTER] here, here,
that's not actually a problem because A3 is
a bad arm and it's got a much lower expected reward.
Um, in other cases er,
it could be that A3, we just got unlucky.
Um, and in that case,
it could mean that we never should,
never take the optimal action. Yeah.
I thought we used B star like in terms of the reward for an action.
Yes. And that's the same as this. Good question.
So this is the same as B star.
Yeah. And I'll go back and forth between notation but yeah, definitely just ask me.
So in this case, we're never gonna select A3 again and um,
now notice that in the greedy case,
there are cases where um,
if I had used slightly different values here um,
that you might have selected A3 again later.
Because if it's the case that um,
like if you didn't have Bernoulli rewards but you had Gaussians um,
it could be that the rewards for other arms dropped
below another arm later and then you start to switch.
So you don't necessarily always stick with which other arm looked best at the beginning.
But in this particular case with these outcomes
then you're not gonna select A3 ever again.
[NOISE] All right, now let's do um, e-Greedy.
So in this case,
we're gonna assume we got exactly the same outcomes for the first few.
Um, and then what  said is that we're gonna um,
have ah, one-half minus epsilon um, over 2.
So we're gonna split ties randomly again.
So with probability epsilon,
we're gonna take um,
with epsilon over 3.
We're gonna take A1 or A2 or A3.
And with 1 - epsilon over 2,
we're gonna take A1 [NOISE] or A2.
[NOISE] Interesting, okay.
Okay. So in this case, it's gonna look almost identical except we're
still gonna have some probability of taking A3 in this case.
And we can do a similar computation here.
In this case um,
we've assumed that all of these are exactly the same.
So e- e- e-Greedy in this case,
will select A3 again.
Yes. Um, if epsilon is fixed,
not updating, yeah. [NOISE]
Um, if Epsilon is fixed,
how many times is it gonna select a_3?
Main question's whether it's finite or infinite.
Maybe talk to your neighbor for a second and decide whether if Epsilon is fixed,
whether it'll be- a_3 will be selected a finite or infinite number of times,
and what that means in terms of the regret?
[NOISE]
Okay. I'm gonna have everybody vote.
So if you think it's gonna be selected an infinite number of times, raise your hand.
Great. So what's that mean in terms of regret,
is there- it gonna be good or bad?
It's gonna be bad.
Great. Bad, bad regret.
I mean, in general, regret's gonna be unbounded unfortunately in these cases.
[LAUGHTER] Um, so we're always gonna unfortunately have infinite regret but, um,
but the rate at which it grows can be
[LAUGHTER] can be much smaller depending on the algorithm you do.
[NOISE] So, um, in particular and,
uh, yeah, so we can also think about it in this case.
So if you have a large gap which we do for a_3
here and we're gonna be selecting that arm an infinite number of times,
then e-greedy is also gonna have,
um, a large regret.
So I like this plot, this plot comes from David Silver's slides.
Um, uh, so if you explore forever,
like, if you just do random, um,
which we didn't discuss, but you could also do,
then you're gonna have linear total regret, um,
which means that with the number of time steps,
t, you're- you're gonna scale linearly with t. Essentially,
your regret is growing unboundedly and it's growing
linearly which is equi- essentially proportio- I mean,
it's gonna generally have a constant in front of it but
um, it's gonna be a constant times the worst you could do at every time step.
Because if you always select the worst arm at every time step,
your regret will also grow linearly.
So it's pretty bad.
Um, if you explore never,
if you do greedy, uh,
then it also can be linear,
and if you do e-greedy, it's also linear.
So essentially, it means that all of these algorithms
that we've been using so far can have really,
really bad performance, um,
certainly in bad cases,
and so the critical question is whether or not
it's better tha- it's possible to do better than that.
So can we have sub- what's often called sublinear regret. So we want to have regret.
If an algorithm's gonna be considered good in
terms of its performance and its sample efficiency,
we're gonna want its regret to grow sublinearly.
Um, when we think about this,
we're generally gonna think about, um,
whether the bounds that we create,
in the performance bounds are gonna be problem independent or problem dependent.
So most of the bounds,
i- it depends a little bit.
For MDPs, most of the bounds that we can get are gonna be problem independent.
For bandits, there's a lot of problem dependent bounds.
Um, problem dependent bounds in the case of bandits mean that, um,
the amount of regret that we get is gonna be a function of those gaps,
and that should be sort of intuitive.
So if you have- let's imagine that we just have two arms, like,
a_1 and a_2 and the mean,
um, the expected reward.
So if this is, the expected reward is 1 and this is an expected reward of 0.001,
intuitively it should be easier to figure out that arm one is better than arm
two that if this is like 0.53 and this is 0.525.
Because in one case,
really hard to tell the difference between the mean of the two arms and the other case,
the means are really, really far apart.
So somewhat intuitively if the gap is really large,
it should be easier for us to learn,
and if it's really small, it should be harder. Yeah.
Um, so if the, uh- [NOISE]
is that [OVERLAPPING].
[NOISE] Uh, so, uh,
optimal reward is deterministic, for some actions.
I mean, we have zero regret.
[NOISE]
Good question. Um, uh,
the question is if the optimal are- if the,
uh, optimal reward is,
uh, are you just saying optimal reward?
Yeah. The optimal reward is deterministic till we have zero regret, if you know it.
So if you know that all the rewards of the arms are [NOISE] deterministic,
then you just need to pull them once.
Then you can make a good decision and then you're done.
In general, these algorithms aren't going to know
that information if you don't- even if it
was deterministic you're still gonna have these other forms of bounds.
So what about the greedy case then? What if it's deterministic?
If it's determinant- if in a greedy [NOISE] case- it's a good question.
In a greedy case, if your real rewards are deterministic,
um, then you don't need- then you would have pulled all the arms once,
and you will make no mistakes.
You'll make, uh, you'll have zero regret basically from that point onwards.
So we'd consider that as just like you'd have some initial constant regret,
and then afterward it would be independent of t. Did you have a question?
Yeah. [NOISE] Um-
Remind me your name.
[NOISE] Is it also a function of the variance of each arm-
Oh, good question.
- [OVERLAPPING] because in that case like you take upon [inaudible]
Yeah. Great question. So we're not gonna ta- uh,
question is whether it also depends on the variance in addition to the mean.
Um, yes, we're not gonna talk about that,
um, uh, but in addition to problem dependent bounds,
you can certainly think about parametric, like,
if you have some parametric knowledge
on the distribution of the rewards, then you can exploit it.
Um, so if you know it's a Gaussian or other things like that.
Um, I think in general if you know them,
or like if you have information about the moments,
then you should be able to exploit it.
Most of the information that I've seen is,
um, looking at the mean and the variance.
We very frequently throughout a lot of
this stuff is are gonna assume that our rewards are bounded.
Um, that's gonna be important for most of the proofs that we do,
even without making any other parametric assumptions.
Okay. But then the other version of this is problem independent,
which just says [NOISE] regardless of the,
of the domain you're in, regardless of the gaps,
can we also show that, um, uh,
regardless of any structure of the problem,
can we still ensure that regret grows up linearly,
and that's what we're gonna mostly focus on today.
So I think lower bounds,
lower theoretical bounds are- are helpful to try to understand how hard a problem is.
So I said, is it possible for something to be sublinear, um,
and there's been previous work to look at sort of,
well, how much does the regret have to grow?
So, um, in this case,
this is regret we're mostly getting,
um, to, mu's, to write out regret.
But in this case they- they prove that in terms of the gaps, um,
and the similarity in the dis- distributions in terms of the KL divergence,
the KL divergence, that you can show a lower bound on how much regret grows.
[NOISE] So this is where there's a sort of
unfortunate [NOISE] aspect of regret growing unboundedly comes up.
If you don't make any other assumptions on the distributions of your rewards,
um, uh, in general,
the regret of your, um,
your regret will grow unboundedly,
and it will ha- do so in terms of these gaps,
and the KL divergence.
But it's still sublinear. So that's nice.
So it's growing logarithmically with t here. T is the time steps.
It's encouraging that, like, our lower bound
suggests that there's room for traction, right?
That we can definitely, um,
at least there's no formal result that says it has to be linear,
it says no, we should be able to grow much slower.
So how could we do- why- how would we maybe do this?
So this is now, um, we've talked about a particular framework which is regret,
we talked about a setting, which is bandits,
and now we're gonna talk about an approach which is,
um, optimism in the face of uncertainty.
And the idea is simply to choose actions which might have high value.
Why should we do this? Um.
I have a question on the- in the previous slide, um,
so is that true at every t, or only in the one
of because as it's really isn't that just saying that it's greater than infinity?
All right. Um, that's a good question.
Question is about whether i- it holds.
I think this holds on every time step,
I'd have to check the exact, like,
way that they wrote this, um, uh,
there's also constants, um,
but I think this should hold on a per time step basis.
We're really saying that as time is going along,
it should be true [OVERLAPPING]
Yeah. The limit as t goes large,
this is where I'd have to look back at the original paper, um,
there's probably additional constant terms which are transitory,
um, and so this is probably the dominant term as t goes large.
But in a lot of the regret bounds particularly in our MDP cases,
we often have transitory terms that are independent of the time step,
but are still large early on.
That's my guess. But great question.
Okay. So optimism in the face of uncertainty,
um, says that we should choose actions that might have a high value.
Why? Well, there's two possible outcomes.
If we pick something that we think might be good,
um, one is it's good.
So if it is good,
or I'll be more precise in this case.
So let's say we select, so, a_1 and 1 is a_1 has high reward.
So that's good. If we, um,
if we took an action and it actually does- and we- because we thought it
might have high reward and actually does have high reward,
then we're gonna have small regret. So that's good.
Um, what's the other outcome?
a_1 does not have high reward.
Has lower has, um,
when we sample this,
we get our r for a_1 with lower reward.
[NOISE]
Well, if we get something with low reward, we learn something.
So, we're like, hey, you know, we tried that restaurant again,
we thought it was great the first time.
The second time it was horrible.
So, now we've learned that that restaurant is not as good and we
update our estimate of how good that restaurant is.
And that means we don't think that a- that action has as high a value as it did before.
So, essentially, either the world really is great,
and in which case that's great, we're going to have low regret.
Or the world is not that great and then we learn something.
So, this gave us information.
And so this b,
acting optimistically gives us
both information about the reward or allows us to achieve high reward.
And so it turns out to have been a really nice principle.
It's been around since at least Leslie Kaelbling in 1993.
Introduced this idea of sort of interval estimation and then there started
to be a lot of analysis of these types of optimism under uncertainty techniques.
So, how can we do this more formally, or like,
where would we get to be more precise
about what it means for an action to have a high-value?
Let's imagine that we estimate an upper confidence bound for each action value,
such that the real value of
that action is less than or equal to the upper confidence bound with high probability.
And those upper confidence bounds in general are going to depend
on how many times we've selected that particular action.
Because we would like it to be such that if we've selected that action a lot,
that upper confidence bound should be pre- pretty close to Q of A.
And if we haven't selected it very much,
maybe we're really optimistic.
And then we can divide an upper confidence bound bandit algorithm
by just selecting whichever action has the highest upper confidence bound.
So, for every single action we maintain an upper confidence bound,
and then we just select whichever one has the max,
and then we update the upper confidence bound for that action after we take it.
So, a UCB algorithm would for t = 1 dot, dot, dot,
we would first have an initialization phase where we pull each arm once,
each arm once, and then we compute UT of at for all A.
And then for T = t dot, dot, dot,
we would select at equaling this arg max,
and then we would get a reward that is sampled from
the true reward distribution for that arm.
And then we would update Ut of
at and all other arms.
Turns out that often we have to update not just the action of the arm that we took,
but the action of all the other arms we took too.
And this basically comes into,
you, you don't have to do that,
but in terms of the theory we often have to do that in order to
account for high probability bounds.
And we'll see more about that in just a second.
So, every time you get a reward,
you update the upper confidence bounds of all your arms,
and then you select the next action and you repeat this just over and over again.
Okay. So, how are we going to define these U of T?
So, we're going to use Hoeffding's inequality, so a refresher.
In Hoeffding's inequality, we can apply this to a set of iid random variables.
We're going to assume right now that all of them are bounded between 0 and 1,
and we're going to define our sample mean just to be a,
the average over all of those variables.
And then what Hoeffding says is that the probability of that expected mean is,
like, this is the true expected mean.
So, this is the true mean.
This is our empirical mean.
And this is, you can think of this as like some epsilon.
This is just some constant.
Is the probability that your true mean is greater
than the empirical mean plus some constant U,
is less than or equal to expo- the exponential minus 2nu squared.
So, this is the number of samples we have.
Okay. So, we can also invert this to say,
if you want this to be true with a certain probability,
you can pick a mu so that Xn plus mu is gonna be at least as large as the real mean.
So, let's say what we wanna do is,
we want that this to,
we want that the empirical mean plus mu,
the probability that that's less than the real mean,
to equal to delta over T squared.
And we'll see why we might want to choose that particular probability shortly,
but let's imagine for a second that's what we want our probability to be,
since then we can solve for what mu has to be.
Okay, So, that's exponential -2nu squared has to equal to delta over t squared.
And then what we do is we just solve for what mu is. Thanks for letting me know.
Okay. So, mu in this case is going to be equal to
the square root 1 over 2n log of t squared over delta.
So, I just solve for that equation. What does that tell us?
That says that if we do,
if we define our ut of,
or in this case we define,
I'll keep with the same notation as for Hoeffding.
So, if we have Xn plus mu,
with that particular choice of mu,
that's generally going to be greater than or equal to the true expected value of X,
with probability greater than or equal to 1 - delta over t squared.
So, this, Hoeffding's inequality gives us a way to define an upper bound.
Because instead of these being Xs right now,
you can imagine those are just pulled from our arm and those are all rewards.
And so this says, if you take your empirical average of your words so far,
and you add in this upper bound,
which depends on the number of times we pulled that arm and t. So,
t here notices the total number of time steps we pulled any arms,
and n is the number of times we pulled that arm.
So, they're not the same thing.
So, we're [NOISE] inside of this competence bound,
we have a competence bound that is decreasing at
a rate of how many times we pull this particular arm,
and then we have a log term which is increasing at
the rate of the number of times we pulled any arm.
And that is the reason why after each time step,
we have to up- um, update the upper confidence bounds of all the arms.
So, you kind of have these two competing rates that are going on.
As you pull an arm more,
you're gonna get a better estimate of its reward, so it's shrinking.
But then you also have this slower growing term, this log,
which is increasing with the number of time steps. All right.
So, this is one way we could define our upper confidence bounds.
So, we could use this in the case of our rewards.
So, we might wanna say that ut of at is equal to our empirical average of that arm,
+ 1 over 2, the number of times we pulled that arm,
times log of t squared divided by delta.
So this is how, one way for us to define our upper confidence bounds.
[NOISE] All right.
So now the next question is.
Okay. So we've done that,
how is that gonna help us in terms of showing that maybe
the regret of something which is optimistic is actually sub-linear?
Okay. So what we're gonna do now is,
um, I'll do a quick poll.
Do you guys want me to write it on here or do you guys want me to do it on the board?
So raise your hand if you want it on the board. All right.
Raise your hand if you want it on here. Okay. We'll do
this next part on the board [LAUGHTER] that was easy. Was there a question in the back?
Yeah. I have question isn't this derivation about t because it seems like.
[OVERLAPPING] About what?
You first introduced it, about t?
Yes.
It seems you first introduced it there was basically a constant the way you moved
around but then later you're saying that
that's actually the time that we we're updating every time step.
So how are we able to do that?
Yeah. It's a good question.
So, um, you're right.
And I'm being slightly imprecise about this.
If you know what the time horizon is that you're gonna
be acting on a bandit, you could set t to be the maximum.
So you, if you know that you're gonna act for t time steps,
you can plug that in and then
your confidence bounds are, then that log term is fixed basically.
Um, in online settings if you don't know that,
you can also constantly be updating it with a time step.
It's a good question. Yeah.
How is delta decided?
How is what? Pardon.
How is delta, is like what is delta?
Okay. Good question. Um, so, question is what is delta.
What we're gonna, I did not tell you, uh,
in this ca- well, in this case it's telling us,
um, it's specifying what is the probability
this is holding like what this inequality is holding.
Later we're gonna pro- provide a regret bound that is high probability.
So we're gonna say we're gonna have a regret bound which is
something that like with probability is 1 minus a function of delta,
um, your regret will be sub-linear.
So that's how.
You can get expected regret bounds too and the UCB paper which, um,
one of the original UCB papers provides an expected bound but I thought it was a little,
the, this bound was a little bit easier to do in class.
So I thought I would do the high probability bound. Yeah.
So before we were talking about regret,
I didn't exactly understand how you use
regret to update your estimate of the action value.
Oh, good question. Um, so question is,
do we use or how would we use
the regret bound, the regret to update our estimate action, we don't.
Regret is just a tool to analyze our algorithm.
Great clarification. So regret is a way for
us to analyze whether or not an algorithm is gonna be good or bad
in terms of how fast the regret gro- grows but it's not used in the algorithm itself.
The algorithm doesn't compute regret.
And it's not used in terms of the updating.
Excuse me. Okay. So actually I'll leave this one up here.
So you guys can continue to see that.
All right. So let's do our proof.
So what we're gonna wanna do now is we're gonna wanna try to
prove something about, um,
the regret but before and how quickly it grows for the upper confidence bound algorithm.
But before I prove that,
I'm gonna try to argue to you that, um,
we're gonna look at, sort of, the probability of failure of these regret bounds,
of these confidence bounds.
So what I said here is that [NOISE] we're gonna define
these upper confidence bounds like this in
terms of the empirical mean for that arm so far plus
this term that depends on the number of times we pulled that arm.
And what I wanna convince you now is,
what is the probability,
what is the probability that on one step, that on some step,
step the confidence bounds will fail to hold.
Why is this bad?
Okay. So we wanna bound the probability that on
any step, excuse me, as we're running our algorithm that our confidence bounds fail to hold.
Why? Because if they all hold,
we can guarantee we're gonna be making some,
um, we're gonna have some nice properties.
Okay. So note, if all the confidence bounds hold,
like, on every step then we can ensure the following.
So if all confidence hold, bounds hold then Ut of at,
this is the actual arm we selected,
is gonna be greater than q of a star.
The real value. The real value of the optimal arm.
Okay. So why is this true?
There's two cases here, either
at is equal to a star,  or at is not equal to a star.
So let's just take a second and maybe talk to your neighbor [NOISE] to say,
if it's the case that our confidence bounds hold which means
that really is the case that we have ut with,
um, that this confidence bound is,
um, gonna be greater than or equal to the mean for that arm.
So if these are true confidence bounds,
this equation is holding,
we're not getting a failure.
So we know that Q t which is this is gonna be greater than, um,
the real expected value for that arm.
Okay. So if that's true,
then this is gonna hold at every time step.
So maybe, let's take a neighbor or if it's not clear what I'm
asking or how to think about that, feel free to raise your hand too.
So there's two cases either the arm that we selected is a star
or the arm we selected is not a star and in both cases, this is gonna hold,
if the confidence bounds are correct.
So maybe let's take a second to to think about this or feel free
to raise your hand if it's not clear how to,
how to get started on that.
I wanna be just be clear here so.
[NOISE]
I just wanted to note at the top there that if the confidence bounds hold,
then that upper confidence bounds which is equal to that is going to be greater than
the real expected value for that arm. Yeah.
On the confidence bound on your other equation, you have written over there?
Yeah.
So you're saying that the optimal,
like your optimal Q value,
should be less than all the confidence bounds for any action?
No, good question.
So just need to clarify what it is.
This is saying for the arm that you selected,
the upper confidence bound of the arm you selected is- has
a value that- that- that upper confidence bound that you use to choose the arm,
whichever arm you selected,
the upper confidence bound of that arm is higher than the true value of the optimal arm.
That's what this equation is saying.
Saying that if the confidence bounds hold on all time steps,
which they might not, but let's say that they do
because these are only high probability bounds,
but if they hold on all time steps,
then whatever arm you selected,
its upper confidence bound is higher than the value of the true arm.
The real value of the true arm.
Okay, and I just wanted
to be clear what I mean for the confidence bound to hold, I put this up there.
So that means that the upper confidence bound of an arm holding means that
that upper confidence bound which is defined in
that way is greater than the true value of that arm.
So let's work- work through this a little bit.
So let's say there's two cases.
So if A T is equal to A star,
then that's what this is saying is that saying is,
is Ut of a star greater than Q of a star?
Does that hold if the confidence bounds hold?
Yes. By definition. So if you look up there.
So the upper confidence bounds if it holds for an action for
that- the upper confidence bound for an action has to
be bigger than the mean for that action.
If that upper confidence bounds hold.
Okay. So this ho- this works.
So if we really selected the optimal action,
we've defined our upper confidence bounds,
so they really are better than the mean of that arm, and so this holds.
The other case is that at is not equal to a star.
So what does that mean?
That means that Ut of at is greater than Ut of a star.
Because otherwise, we would have selected a star.
It means that some other arm had a higher upper confidence bound than the optimal action,
and we know that this is greater than Q of a star.
Okay? So if
the confidence bound holds,
we know at every single time step the upper confidence bound of the arm we selected is
better than the true mean of the optimal arm. Yes.
Is that true in the epsilon greedy case as well?
Is it true in epsilon greedy case as well?
I know I- I don't follow your question yet.
Like, you're selecting this arm using some strategy, right?
Yeah.
And it gets some maximizing action, right?
No. Ut, so this only holds,
this part only holds because we're picking the arg max, you're not going to be able to see this,
but the arg max A of Ut of at. So that first inequality,
only- well, there might be other algorithms that t holds for too,
but it holds in particular for the upper confidence bound.
Great- great question. Okay. So this says if we could get it,
and we will see shortly why that matters,
but kind of intuitively.
This says if the confidence bounds hold,
then we know that separate confidence bound of the arm we
select is going to be better than the optimal arm.
And the reason we're going to want that is later when we're doing the regret bounds,
we do not want to deal with properties we don't
observe namely the value of the optimal arm.
Because we don't know what Q of a star is.
We can't com- we don't know which arm it is.
So when we look at regret bound right now,
regret bounds are in terms of Q of a star.
We don't know what that quantity is.
So we're going to need to figure out some way to get rid of
that quantity and we're going to end up using these upper bounds,
but we're going to need the fact that the upper bound of the arm we
select is better than the Q star, Q of a star.
Okay. So- so this is saying that that's true if our upper confidence bounds hold,
what is the probability that that occurs?
Okay. So what that means is,
if we want to say that on all time steps,
this is a union bound,
this is our union over events.
The union over all the events for ut = 1 to t of the probability that
Q of a star minus the upper confidence bound of the action that we took.
We want this, oops, not that way.
We want the probability of this,
which is essentially the probability of failure.
So this is, what's the pro- this is if all confidence bounds hold things are good,
this says, what is the probability that that failed?
That the arm that you took it actually is not better.
The upper confidence bound is not better than the- the real mean of the optimal arm?
So this is the failure case, where the confidence,
but um, we're gonna say that if we look at that,
we- we don't want this thing to happen.
We can upper bound that
by making sure our upper confidence bounds hold at all time steps.
Okay. So these are the arms.
So what I said up there is that if all of
our confidence bounds hold on all time steps we can ensure that.
So we're now going to write that down in terms of what?
The probability that the upper confidence bounds do hold on all time steps.
And so we're gonna do probability that Q of at,
that's Q hat of at is greater than U.
Okay. This is the upper confidence bound we defined over there.
This is just saying that the upper confidence bound holds for each arm on each time step.
Well, by definition, over there we said we- we picked an upper confidence bound
to make sure this held with the least probability of delta over t squared.
Because that's how we defined our upper confidence bound.
We picked a big enough thing to add onto our empirical means so that we
could ensure that the upper confidence bound really was larger than our mean.
So now we have a union over all time steps,
a union over all arms,
delta divided by t squared.
And note that if you sum over t = 1 to
infinity of t to the -t, that's equal to pi squared over 6 which is less than 2.
So when you do the sum you get 2m delta.
So what this says is that the probability-
that your upper confidence bounds hold over all time steps.
So this is the negative,
this is that they- that they don't hold is at least 1 - 2m delta.
So all our- our- what we're gonna end up
doing is we're gonna have a high probability regret bound that says,
with probability at least 1 - 2m delta,
we're gonna get a smaller regret. Yeah.
So what about the infinite horizon case?
Great question. Yes. This is all for an infinite horizon case.
We're gonna look at the- we're gonna find
our regret in terms of t, the number of time steps.
Okay, all right. So why is this useful?
Do you guys want me to leave this up or we can move that now?
Everyone's written it down, who wants it?
Okay. So let's- can this go up?
Let me see or not.
Okay. So why is this useful?
Okay. We're now gonna define our regret.
So, this part of the board just says that we've
made it so these upper confidence bounds holds with high probability.
Now we're gonna try to show what our regret is gonna be, oh good.
Okay. Thank you. All right.
So, what's our regret?
Regret of our UCB algorithm after T time steps is just equal to the sum over
all those time steps t = 1 to t of Q of a star - Q of at. Okay?
Remember, we don't know either of these things.
We don't know what the real mean is of any arm
we pick and we don't know what the real mean is of the optimal arm.
So we need to turn this into things that we know.
These are unknown.
Okay? So what we're gonna do is one of our favorite tricks in reinforcement learning,
which is we add and subtract the same thing.
So we do sum over t equal- -t =1  to t of Ut, this upper bound,
at - Q of at + Q of
A star - Ut at. I just added and subtracted the same thing.
I picked the upper confidence bound of the arm that we selected at each time point.
Okay? So then the important thing is that what
we showed over here is that if all of our confidence bounds hold,
then the upper confidence bound of the arm we selected is larger than Q of a star.
That's what we showed over there.
So that means that this has to be less than or equal to 0.
Because the upper confidence bound of whatever arm we selected,
we proved over there,
is gonna be higher than the real mean of the optimal arm.
So this second part of this equation is less than or equal to 0,
which means we can upper bound our regret as follows.
So now we can drop that second term.
So that's nice, right?
Because now we don't have any a stars anymore.
We only are looking at the actions that we actually took at each time
step and we're comparing
the upper confidence bound of the- at that time step versus the real mean.
But remember the way we've defined our upper confidence bound and put it over here,
the way we defined our upper confidence bound Ut of at is exactly equal to, um,
the empirical mean, at, plus this square root 1 over 2
and at log t squared over delta. Okay?
And we said here that this was going to be the difference from
Hoeffding between Q of at - Q hat of at.
So the probability that this,
remember I called this U,
the probability that this was greater than U was small.
So now we're assuming that all of our confidence bounds hold,
which means that we know that the difference between
the real empirical mean and the true mean of this arm is bounded by U. Yeah.
Going back, for the bottom panel, sorry, it's a little hard to see, two questions. First of all,
where does this second, um, so you have a union over i = 1, the number of arms.
I don't see where that index actually factors in.
And then also if you could just go over the third line with the delta over t squared, and summation, how we derived that.
Sure? Yeah, so, um,
so what we did there is we said that if,
um, are you asking about the second line to the third line?
Yes. So what we did that in this case is,
um, we said for each,
we wanna make sure that on each of the time steps,
all of the upper confidence bounds hold.
Um, and so that's where we get an additional sum,
um, over here over all the arms.
So this is conservative, um,
trying to make sure we don't know- you could imagine
just doing this over the arm that's selected,
and but we don't know which arm is selected.
We want to be able to show, um,
this is going to be a looser upper,
upper bound saying this is sufficient.
So, um, we're saying that if you want to make sure that Q of
a star is greater than the upper bound of the arm that is selected,
it is sufficient to ensure that
your upper confidence bounds are valid at all time points from this,
from this, um, reasoning up here.
And so this is the probability that
your upper confidence bounds are correct on all times points,
for every single time point,
for every single arm, your upper confidence bounds have to hold.
And then, what we get in this case is,
we said that the probability on a particular time-step,
the upper confidence bounds holds is delta over t squared.
That's how we defined that U term [NOISE] so that according to Hoeffding's inequality,
it would hold with the least probability delta over t squared.
And then, so this is that,
and then I just made a side note that this is not something,
some of you guys might have seen this but I certainly wouldn't expected people to,
that just it turns out that in terms of the limit,
um, if you sum over t = 1 to infinity of t to the -2
that's bounded by pi squared over 6 which is less than 2.
Um, fun fact. Um [LAUGHTER] so,
so you can plug that in, right, because then you just get a 2 here,
and then you just get a sum over all arms which is m,
um, and you have a delta.
So this just allows us to take that infinite sum.
So notice also that,
um this goes to  question before this holds for the infinite horizon
because when we did this summing we're basically
making sure that our confidence bounds hold forever.
So we're, okay.
Great. So we said here now that we're
doing all of this part under the assumption that our confidence bounds hold.
Our confidence bounds hold mean that the difference between our expected mean
and the true mean for that same arm is bounded by mu,
bounded by U with high probability where this is the definition of U.
So that's what our Hoeffding inequality allowed us to state.
So take that quantity now,
that's our U and plug it in here.
So this is looking exactly the difference between our upper confidence bound and Q.
So this is exactly equal to sum over t = 1 to t of U. Um,
it's a little confusing in terms of notation.
Um, so I'll just plug in the exact expression right there.
1 over 2 and t of a,
t log t squared over delta.
Okay, so we just plugged in that,  the difference between
the empirical mean and the true mean is bounded by this quantity U.
Okay. All right.
So then in this case,
what we can do is we can split this up into the different arms that we pulled, okay?
So this is sum over all timesteps.
We can pull out,
note that this is, um,
if we upper bound this by big T,
this is equal to less than or equal to square root log big T squared over delta,
and then we're gonna get a sum over
all time steps and we're gonna split this up according to which arms we selected, okay?
So for, this is the same as if we look at for each of the arms,
how many times did we pull it?
So sum over n equals 1 to n, t, i,
square root of 1 over n. So we just divided this up,
like for each of the arms,
we selected them some amount of times.
That's here, i is,
i is indexing our arms.
So nt, i is the total number of times we selected arm i.
And then we sum that up, okay.
And then, if we note
the fact that if you sum from n = 1 to t square root 1 over n,
that is less than or equal to 2 square root t. You use an integral argument for that,
I'm happy to talk about it offline. Yes, in the back.
What happened to the 1 over 2 [inaudible].
Thank you, we have a 2,
we can put a 2 here.
Thanks. I'll be a little loose with constants but definitely catch me on them.
Because most of these bounds will all
end up being about whether it's sublinear or linear.
It's good to be precise.
Okay. So we have this quantity here,
um, when is this quantity maximized?
This quantity is going to be maximized if we pulled all arms an equal number of times.
Why? Because, um, 1 over n is decreasing with n. And so the,
the largest these can be is if you split,
if you split, um,
your pools across all arms equally.
So if we go back up to here,
and I call this a.
So a is gonna be less than or equal to, excuse me,
square root, 2 log t squared over delta times sum over i equals 1 to m,
sum over n equals 1 to t divided by
m. So this is as if we split all of our pools equally across all the arms.
1 over square root n, okay, and then we can use this expression.
Okay? So this is less than or equal to 1 over 2, we're almost there.
Um, t squared over delta and then we're gonna get sum
over i equals 1 to m of 2 square root t over m. Okay.
And then, when you sum this over m,
you get less than or equal to square root 1 over
2 log t squared over delta that brings in m into there.
When we look at another two in here times T,
m. And now we're done.
So what has this shown? This is shown that if we use upper confidence bounds,
that the rate at which our regret grows is sublinear square root times a log.
So t here is the number of timesteps.
So timesteps, so as if we are,
if we use upper confidence bounds, um,
in order to make our decisions,
then the regret grows much slower.
This is a problem independent bound.
It doesn't depend on the gaps.
There's, there's much nicer ones and tighter ones that depend on the gaps.
But this indicates why optimism is
fundamentally a sound thing to do in the case of bandits,
is that it allows us to grow much,
it allows us to have much better performance in terms of
regret than it does for the e-greedy case. Yeah.
Can you just display one more time the last one on the top board, um,
how you went from summation over t1 to big T and you just pulled out the t squared big T,
what happens to t = 1 to big T - 1?
Great question. Yeah, so this log term here t equals,
um, is ranging from t = 1 to t,
this log term is maximized when t is big T.
So we're upper bounding that log term and then it becomes a constant and we can pull it out.
Okay, so the cool thing here is that this is sublinear.
That's, that's really the main point.
Um, well, I go through an example and we'll go through,
um, more of this next time.
Um, I, I go,
we next go through an example for the toy domain for the broken toes of like what do
these upper confidence bounds look like in that case
and then what will the algorithm do in these scenarios?
Um, so that's what we'll look at next.
And then, after that we'll look at,
so this is one class of techniques which is this optimism under uncertainty approach,
which is saying that we're going to look at what the value is based on a combination of
the empirical rewards we've seen so far plus an upper confidence bound over them,
um, and use that to make decisions.
[NOISE] And then the next thing we'll see too
is where we are Bayesian about the world and we
instead maintain a prior and we update our prior and use that to figure out how to act.
So we'll go through this next week,
um, and I'll see you then.
