Okay. Welcome everyone. So, um,
today we'll be going over learning theory.
Um, this is, um,
this used to be taught in the main lectures in- and in previous offerings.
Ah, this year we're gonna cover it as,
ah, as a Friday section.
Um, however, some of the concepts here are,
ah, we gonna be covering today are- are,
um, important in the sense that they kind of deepen
your understanding of how machine learning kind of works under the covers.
What are the assumptions that we're making and you know,
um, why do things generalize,
um, and- and so forth.
So here's the rough agenda for today.
So, ah, we're going to quickly start off with, ah,
framing the learning problem and, ah,
we'll go deep into bias-variance, um, trade off.
We'll go- we'll spend some time over there and we look at, uh,
some other ways where you can kind of, ah,
decompose the error, ah,
as approximation error and estimation error.
Um, we'll see what empirical, ah,
risk minimization is and then we'll spend some time
on uniform convergence and, um, VC dimensions.
So, ah, let's jump right in.
Right. So the, um,
so the assumptions under which we are going to be operating, um,
for- for this lecture and in fact for most of- most of
the- the algorithms that we'll be covering in this course,
um, is that there are two main assumptions.
One is that there exists a data distribution,
distribution D from which x y pairs are sampled.
So this is, ah, this makes sense in the supervised learning setting where,
um, you're expected to learn a mapping from x to y.
But, ah, the assumption also actually holds
more generally even in the unsupervised, ah, setting case.
The- the main assumption is that there is
a ge- data-generating distribution and the examples that we have in our training set,
and the ones we will be encountering when we test it,
ah, are all coming from the same distribution.
Right. That's- that's like the core assumption.
Um, without this, um,
coming up with any theory is- is- is gonna be much harder.
So the assumption here is that you know, um,
there is some kind of a data ge-, ah, generating process.
And we have a few samples from the data generating
process that becomes our training set and that is a finite number.
Um, you can get an infinite number of samples from this data generating process,
and the examples that we're gonna encounter,
ah, at test-time are also samples from the same process.
Right. That's- that's the assumption.
And there is a second assumption.
Um, which is that all the samples are sampled independently.
Um, so, um, with these two assumptions, ah,
we can imagine a learning,
ah, the process of learning to look something like this.
So, we have a set of x y pairs which we call as s. Um,
these are just x 1, y 1,
x m y m. So we have m samples from- from- sample from the data
generating process and we feed this into
a learning algorithm and
the output of the learning algorithm is what we call as a hypothesis.
Hypothesis, ah, is- is a function, um,
which accepts an input- a new input x and makes a prediction about- about y for that x.
So, ah, this hypothesis is sometimes also in the form of Theta hat.
So if we- if we restrict ourselves to a class of hypothesis.
For example, ah, all possible logistic regression models of,
ah, of dimension n, for example,
then, um, it's, you know, um,
obtaining those parameters is equivalent to obtaining the hypothesis function itself.
So a key thing to note here is that this s is a random variable.
All right.
This is a random variable.
This is a deterministic function.
And what happens when you feed a random variable through
a deterministic function you get a? Random variable.
Exactly. So, um, the hypothesis that we get is also a random variable.
Right. So all random variables have a distribution associated with them.
The distribution associated with the data is the distribution of- of capital D. Um,
this just a fixed, ah, deterministic function.
And there is a distribution associated with the,
um, um, with the- with the parameters that we obtain.
That has a certain distribution as well.
In, um, in the sta- in- in a more statistical setting,
um, we call this an estimator.
So if you take some advanced statistics courses you will call,
ah, what you will come across as an estimator.
Here we call it a learning algorithm.
Right, and the distribution of Theta,
um, is also called the sampling distribution.
And the, um, what's implied in this process is that there exists some Theta star,
ah, or in A star.
However you want to view it which is in a sense a true parameter.
A true parameter that we wish,
ah, to be the output of the learning algorithm, ah,
but of course, we never know- we never know what, ah,
Theta star is, um, and when, um,
what we get out of the learning algorithm, um,
is- is going to be just a- a sample from a random, um, random variable.
Now, a thing to note is that this the Theta star or A star is not random.
It's just an unknown constant.
Not a- when we say it's not random it means there is
no probability distribution associated with it.
It's just a constant which we don't know,
that- that's- that's the assumption under which you operate.
Right. Now, um, let- let's see what's,
ah, let's see what's- what's- what are some properties about this Theta- Theta-hat.
So all the, um,
all- all the- all the entities that we estimate are generally,
um, decorated with a hat on top,
which- which indicates that it's- it's something that we estimated.
Um, and anything with a star is like, you know,
the true or the right answer which we don't have access to it generally.
So any questions with this so far?
Yeah. [BACKGROUND]
Yeah. So, yeah, this could be, uh, um,
in case of like, uh, uh,
linear or, or logi- logistic regression
or linear regression generally happens to be a vector.
It could be a scalar,
it could be, you know,
a matrix, it could be anything.
Right. Uh, it's just an entity that we estimate.
Um, and sometimes, uh,
H star can also be so generic that it,
it need not even be parameterized.
It's just some function that you estimate.
So, uh, yeah, so it could,
it could be a vector or a scalar or,
or a matrix, it could be anything.
Right? So, uh, let's see what happens when we- so in the lecture,
we saw, uh, this diagram for in,
in the - when we were talking about bias-variance.
So in case of, uh, regression,
[NOISE] and, um, we saw that this was one fit,
this was just, uh,
let me use a different color,
straight line, and, right?
And we saw this as, uh,
the concepts of [NOISE] sorry,
underfitting and this is overfit and this is like just right.
Right, so the concept of underfitting and overfitting are,
kind of, closely related to bias and variance.
Uh, so this is how you would view it from the data.
So this is from the data view, right?
Cause this is x, this is y.
You know this is your data.
Um, and if, if you look at, you know, um,
look at it from a data point of view,
these are the kind of, uh,
different algorithms that you might get, right?
However, uh, to get a more formal sense,
uh, formal view into what's,
what's bias and variance,
it's more useful to see it from the parameter view.
[NOISE].
So let's imagine we have four different learning algorithms, right?
I'm just going to plot four different.
And here this is the parameter space,
let's say theta 1, theta 2.
Let's imagine, you know,
uh, we have just two parameters.
It's easier to visualize theta 1 and theta 2, right.
And this corresponds to algorithm A,
algorithm B, C, and D. Right.
There is, there is a true theta star.
Let's, let's- which is unknown, right?
Now, let's imagine we run through this,
this process of sampling m examples running it through the algorithm,
obtain a theta hat, right?
And then we start with a new sample- sample from
D run it through the algorithm we get a different theta hat, right?
And theta hat is going to be different for different learning algorithms.
So, so let's imagine first we,
we sample some data that's our training set,
run it through algorithm A and let's
say this is the parameter we got and then we run it through
Algorithm B and let's say this is the parameter we got and through
C here and through D over here.
And we're gonna repeat this, you know,
second one maybe here,
maybe here, here, here and so on and you repeat this process over and over and over.
The, the key is that the number of samples per input is m,
that is fixed, right?
But we're gonna repeat this process and over and over and for every time we repeat it,
we get a different point over here.
[NOISE]
Right? So, uh,
each point each dot corresponds to a sample of size M, right?
The number of points is basically the number of times we repeated the experiment, right?
And what we see is that
these dots are basically samples from the sampling distribution, right?
Now, the concept of,
of bias and variance is kind of visible over here.
So if we were to classify this,
now we would call this as
bias and variance, right?
So these two are algorithms that have low bias,
these two are- have high variance,
these two have low varia- I'm so- these two have low bias,
high bias low variance, high variance.
So what does this mean?
Uh, so bias is basically, um,
checking are the- is,
is the sampling distribution kind of centered around the true parameter,
the true unknown parameter?
Is it centered around the true parameter?
Right? And variance is, um, is,
is measuring basically how dispersed the,
the sampling distribution is, right?
So, so formally speaking,
this is bias and variance and it becomes, uh,
you know pretty clear when we see it in the parameter view instead of in,
uh, uh, uh, the data view.
And essentially bias and variance are basically
just properties of the first and second moments of your sampling distribution.
So you're asking the first moment that's the mean, is it centered around the true parameter
and the second moment that variance - that's
literally variance of the bias-variance trade-off. Yeah.
[inaudible].
Yeah.
[inaudible].
Um, so this is, a,
a diagram where I am using only two thetas just to fit,
you know write on a whiteboard.
So you, you would imagine something that has high variance, for example,
this one to probably be of a much,
much higher dimension, not just two,
but it would still be spread out.
It would still have like high variance.
There would be points in a higher-dimensional space,
you know but more spread out.
Right, so, so the question was,
um, the question was,
um, in over here we, uh,
we actually had more number of thetas but, uh,
here with the higher variance,
um, uh, plots we are having the same number of thetas.
So, uh, yeah so you could imagine this to be higher-dimensional.
And also, different algorithms could have different,
uh, bias and variance even though they have the same number of parameters.
For example, if you had regularization,
the variance would come down, for example.
Let me go over that, um, um, um,
a few observations that we want to make, uh,
is that as we increase the size of the data,
every time we feed in, so if this were to,
to be made bigger,
if you take a bigger sample for every, um,
every time we learn, uh,
the variance of theta hat would become small, right?
So if we repeat the same thing but with,
with larger number of examples,
this would be more- all of these would be more,
um, tightly concentrated, right?
So, so the spread is, uh, uh,
so the spread is a function of how many examples we have in each,
um, in each, uh, uh, iteration.
Right? So, uh, as m tends to infinity, right?
The variance tends to zero, right?
If you were to collect an infinite number of samples,
run it through the algorithm,
you would get some particular, um, um, theta-hat.
And if you were to repeat that with an infinite number of examples
we'll always keep getting the same, um, uh, theta hat.
Now the rate at which the variance goes,
goes to 0 as you increase m,
is you can think of it as what's also,
uh, called the statistical efficiency.
It's basically a measure of how efficient your algorithm
is in squeezing out information from a given amount of data.
And if theta hat tends
to theta star as m tends to infinity,
you call such algorithms as consistent.
So, um, consistent and if
the expected value of your theta hat is equal to theta star for all m, right?
So no matter how big your, um, sample size is,
if you always end up with
a sampling distribution that's centered around the true parameter,
then your estimator is called an unbiased estimator. Yes.
[inaudible].
So efficiency is, is, uh,
basically the rate at which, uh,
the variance drops to 0 as m tends to 0.
So for example, you may have one algorithm which, uh,
which, which, where the variance is a function of 1 over M square.
Another algorithm where the variance is a function of e to the,
uh, uh minus m. You,
you can have- the variance can, uh,
drive down at different rates, uh,
relative to m. So that's kind of captures, um, uh,
what- what's efficiency here.
[NOISE] Right? Yeah.
[inaudible]
Yeah. So uh, theta-theta hat approaches um,
so um, this is a random variable here so so here's one thing to be clear about here.
This is ah, a number, a constant,
and this is a constant but here this is a random variable, right?
So what we're seeing is that as m tends to infinity, theta hat,
that is the distribution,
converges towards being a constant and that constant is going to be a theta star.
Which means at smaller values of m,
your algorithm might be centered elsewhere,
but as you get more and more data,
your sampling distribution variance reduces and also gets
centered around the true theta star eventually.
Okay. So um, informally speaking,
if your algorithm has high bias,
it essentially means no matter how much data or evidence you provided,
it kind of always keeps away from from theta star, right?
You cannot change its mind no matter how much data you feed it,
it's never going to center itself around theta star.
That's like a high biased algorithm,
it's biased away from the true parameter.
And variance is, you can think of it as your algorithm
that's kind of highly distracted by
the noise in the data and kind of easily get swayed away,
you know, far away depending on the noise in your data.
So uh, these algorithms you would call them as those having high variance,
because they can easily get swayed by noise in the data.
And as we are seeing here,
bias and variance are kind of independent of each other.
You can have algorithms that have,
you know, an independent amount of bias and variance in them,
you know, there is there is no um,
um correlation between ah,
ah bias and variance.
And one way- so the- how do we how- do we kind of fight variance?
So first let's look at how we can address variance. Yes.
[BACKGROUND].
So bias and variance are properties of the algorithm at a given size m. Right?
So these plots were from um,
were from a fixed size m and for that fixed size data,
this algorithm has high bias, low variance,
this algorithm has high variance and high bias and so on.
Yeah. Yeah. You can you can um,
you can think of it as yeah, it,
you- you assume like a fixed data size.
Right? So uh, fighting variance.
Okay. So uh, one way to kind of ah,
address if you're in a high variance situation,
this will just increase the amount of data that you have,
and that would naturally just reduce the variance in your algorithm. Yes.
[BACKGROUND].
That is true. So you don't know upfront what uh,
whether you're you're uh,
in a in a high bias or high variance um, um, scenario.
One way to kind of um- one way to kind of uh, uh,
test that is by looking at your training performance versus test performance uh,
we'll go- we'll go over that um.
In fact we're gonna go into um, you know,
much more detail in the main lectures of how do you identify bias and variance,
here we're just going over the concepts of what are bias and what are variance.
So one way to um,
address variance is you just get more data, right?
As you as you get more data,
the- your sampling distributions kind of tend to get more concentrated.
Um, the other way is what's called as regularization.
So when you- when you um,
add regularization like L2 regularization or L1 regularization um,
what we're effectively doing is let's say we have
an algorithm with high variance maybe low bias,
low bias, high variance and you add regularization, right?
What you end up with is an algorithm that
has maybe a small bias,
you increase the bias by adding regularization but low variance.
So if what you care about is your predictive accuracy,
you're probably better off trading off
high variance to some bias and getting down- reducing your your um,
variance ah, to a large extent. Yeah.
[BACKGROUND].
Yeah. We'll- we- we- we're gonna uh,
uh, look into that next.
Right. So in order to kind of um,
get a better understanding of this uh, let's imagine um.
So think of this as the space of hypothesis, space of, right?
So um, let's assume there is a true- there exists, this hypothesis.
Let's call it g, right?
Which is like the best possible hypothesis you can think of.
By best possible hypothesis,
I mean if you were to kind of take this uh, um, um,
take this hypothesis and take the expected value of the loss
with respect to the data generating distribution across an infinite amount of data,
you kind of have the lowest error with this.
So this is, you know, um,
you know the best possible hypothesis.
And then, there is this class of hypotheses.
Let's call this classes h, right?
So this, for example,
can be the set of all logistic regression ah,
hypotheses, or the set of all ah, SVMs you know.
So this is a class of hypotheses and what we,
what we end up with when we ah,
take a finite amount of data,
is some member over here, right?
So let me call h star.
Okay. There is also some hypothesis in this class,
let me call it kind of h star,
which is the best in-class hypotheses.
So within the set of all logistic regression functions,
there exists some, you know,
some model which would give you the lowest um,
lowest error if you were to ah,
test it on the full data distribution, right?
Um, the best possible hypothesis may not be inside ah, your ah, um,
inside your hypothesis class,
it's just some, you know,
some hypothesis that that's um, um,
that's conceptually something outside the class, right?
Now g is not the best possible hypothesis,
h star is best in-class h,
and h hat is one you learned from finite data, right?
So, uh, we also introduce some new notation.
Um, so epsilon of H is,
you will call this the risk or generalization error.
[NOISE] Right?
And it is defined to be equal to the expectation of xy sampled from
E of indicator of h of x not equal to y.
Right? So you sample examples from the data-generating process,
run it through the hypothesis,
check whether it matches with,
uh, with your output and if it matches,
you get a 1, if it does, uh,
if it- if it, uh,
doesn't match you get a 1,
if it matches you get a 0.
So on average, this is, you know,
roughly speaking the fraction of all examples on which you make a mistake.
And here we are kind of thinking about this, um,
from a classification point of view to check if, you know,
the class of your output matches the true class or not.
But you can also extend this to,
uh, the regression setting.
Uh, but that's a little harder to analyze but, you know,
the generalization holds to, um uh,
the regression setting as well but we'll stick to classification for now.
And we have an epsilon hat,
s of h and this is called the empirical risk.
This is the empirical risk or empirical error.
And this over here is 1 over m,
i equal to 1 to m,
indicator of h of x_i not equal to y_i, right?
The difference here is that here this is like an infinite process.
You're- you're- you're, um,
sampling from D forever and calculating like the long-term average.
Whereas this is you have a finite number that's given to you
and what's the fraction of examples on which you make - you make an error.
Right. All right, uh,
before we go further, uh,
there was a question of how,
um, adding regularization reduces your variance.
So what you can see,
um, or actually let me- let me get back to that,
um - um, in a- in a bit.
Uh, so E of g and this is called the Bayes error.
[NOISE] So this essentially means if you take the best possible hypothesis,
what's the fraction, uh,
what's - what's the rate at which you make errors?
You know, uh, and that can be non-zero, right?
Even if you take the best possible hypothesis ever and that can still -
still make some - some mistakes and - and this is also called irreducible error.
[NOISE] For example if your data-generating process you know, uh,
spits out examples where for the same x you have different y's, uh,
in two different examples then no - no learning algorithm can,
you know, uh - uh,
do well in such cases.
That's just one- one kind of irreducible error,
they can be other kinds of irreducible, uh, errors as well.
And epsilon of h_*,
epsilon of g is called the approximation error.
[NOISE] So this essentially
means what is the price that we are paying for limiting ourselves to some class, right?
So it's the - it's the error between - it's the difference between
the best possible error that you can get and
the best possible error you can get from h_*.
Right, so this is, um,
this is an attribute of the class.
So what's the cost we are paying for restricting yourself to a class?
Then you have, uh,
epsilon of h_i minus epsilon h_* and this you call it the estimation error.
[NOISE] The estimation error is,
given the data that we got,
the m examples that we got and we estimated,
you know, using our estimator sum h - h - h_i.
What's the - what's
the - what's the error due to estimation and this is like approximation.
All right. So, this - this,
uh, the error on G is the Bayes error.
The gap between this error and the best in class is the approximation error and the gap
between the best in class and the hypothesis that you end
up with is called the estimation error, right?
And, uh, it's easy to see that, um,
h hat is actually equal to
estimation error
plus approximation error plus irreducible error.
Right? It's pretty easy.
You know, if you just add them up all these cancel out and you're just left with,
uh um, epsilon of H hat.
Um, so it's - it's kind of useful to think about
your generalization error as different components.
Um, some error which you just cannot,
you know, uh - uh, reduce it no matter what - no matter
what hypothesis you pick no matter how much of training data you have.
There's no way you can get rid of the irreducible error.
And then you make some - some decisions about
- that you're going to limit yourself to neural networks or
Logistic regression or whatever and thereby you're defining a class of
all possible models and that has a cost itself and that's your approximation error.
And then you are working with limited data.
And this is generally due to data, right?
And with the limited data that you have and
possibly due to some nuances of your algorithm,
you also have an estimation error, right?
We can further see that the estimation error can be broken down into
estimation variance and the estimation bias, right?
Um, and, uh, you can not, therefore,
write this as approximation error plus irreducible error.
And what we commonly call as bias and variance are - this we call it as variance
and this we call it as bias and this is just irreducible.
So sometimes you see
the bias-variance decomposition and
sometimes you see the estimation approximation error decomposition.
There are somewhat related, they're not exactly the same.
So, uh, the bias is basically why is,
you know, bias is basically trying to capture why is H hat far from a - from G, right?
Why is it staying away from G? You know.
Why did our hypothesis stay away from the true hypotheses?
And that could be because your classes, uh,
is- is kind of too small or it could be due to other reasons,
uh, such as, you know,
um - um, as we'll see
maybe regularization that kind of keeps you away from a certain- certain,
uh - uh, hypothesis, right?
And the variance is generally due to it like
- it's almost always due to having small data.
It could be due to other,
uh - uh, reasons as well.
But these are two different ways of,
uh, of decomposing your, um, your error.
So now, um, if you have high bias,
how do you fight high bias?
Fight high bias.
So how would you fight high bias?
Any guesses. [inaudible] Yeah exactly.
So one way is to just,
you know make your h bigger, right.
Make your h bigger. And also you can - you can try,
you know different algorithms,
um - um uh, after making your h bigger.
And what this generally means is what we saw there was regularization kind of,
you know reduces your - your, um,
variance by paying a small cost in bias and over here,
you know, um. [NOISE]
So let's say your algorithm has some bias, right.
So it has a high bias and some variance, right,
and you make H bigger, your,
your class bigger right and this generally results in
something which reduces your bias but also increases your variance, right?
So, with, with this picture you can,
you can also see, you know,
what's the effect of, um,
how, how does variance come into the picture?
Now just by having a bigger class,
there is a higher probability that
the hypothesis that you estimate can vary a lot, right,
if you reduce your- the space of hypothesis,
you may be increasing your bias because you may be moving away from g,
but you're also effectively reducing your variance, right.
So that's, that's the,
the one of the, you know,
trade off that you observe that any step,
you- a step that you take for example in,
um, reducing bias by making it
bigger also makes it possible for your h hat to land at much,
you know, a- at a wider space and increases your variance.
And if you take a step to reducing your variance by maybe making your,
um, your, your class smaller,
you may end up making it smaller by being away from the end thereby increase your,
your, um, um, increase your bias.
So, when you, when you add regularization,
you know, th- the question, uh, uh,
somebody asked before of how does, um, in,
how does adding regularization decrease the variance?
By adding regularization, you are effectively,
kind of shrinking the class of hypothesis that you have.
You start penalizing those hypotheses whose Theta is very,
is very large, and in a way you're kind of,
you know, shrinking the class of hypothesis that you have.
So, if you shrink the class of hypothesis your,
your variance is kind of reduced because, you know,
there's much smaller wiggles room for your estimator to place your h hat.
And, you know, if you shrink it by going away from,
from, uh, from g, you,
you also introduce bias.
That's like, you know, the bias variance,
uh, um, trade off.
Any questions on this so far?
Yeah. [BACKGROUND].Yeah, you, you probably wanna think of each of these,
you probably wanna think of this as a generalized version of this,
right, so here we have, like, fixed Theta 1,
Theta 2, but you know,
uh, because you could parameterize them into,
uh, uh, a few parameters you can kind of plot it in
a metric space but that's like a more general, um,
um, like a bag of hypotheses, and, you know,
but in any case in both of- both those diagrams,
a point here  is one hypothesis,
a point there is one hypothesis.
Here it's parameterized, here it's not parameterized.
Yes. [BACKGROUND]. The thing
is we differ,
d, um, so the question is,
how- what if we,
we shrink it towards h star, right.
The thing is, uh,
we don't know where h star is, right.
If we knew it, we didn't even need to learn anything.
We could just go straight there,
right. So, um, yeah.
[BACKGROUND].
With regularization? So the question is,
when we add regularization,
are we sure that the bias is going up?
No, we, we don't know and,
and this is a common scenario what happens, right.
You, when you add regularization, you, you,
you reduce the variance for sure but you're
very likely gonna introduce some bias in that process.
[BACKGROUND].
So if you add regularization,
you're shrinking your hypothesis space in some ways.
So you're kind of moving away from 2g. So you're kind of adding a little bit of bias.
You're very likely to add some bias in that process.
Yes, so, it's, uh, so I, I,
I would encourage you to, you know, kind of,
after this lecture to think about this a little more slowly, it's, it's,
it takes a while to kind of internalize this,
the concept of bias and variance and, and, um,
uh, It's not very intuitive but,
but, uh thinking about it more definitely helps.
All right, an- any other questions before we move on?
[BACKGROUND].
So an example for a hypothesis class, right?
So the- an example would be, um,
the set of all logistic regression models, right?
And, uh, when you do gradient descent on your,
you know, logistic regression class,
you're kind of implicitly restricting yourself to set
up possible logistic regression models, that's kind of implicit.
[BACKGROUND].
So, the h is the output of the learning algorithm, right?
So you feed and input your algorithm.
Like this is not the model.
This the learning algorithm like, this is,
like gradient descent for example.
And the output of that is the parameters that you learned that converge to.
Right. So d- so, yeah, you,
you probably don't wanna think about this as the model that you learned but this as the,
like the training process and the output of the training process is a model that you learn.
And that is a point in your,
in the class of hypotheses.
[BACKGROUND]. Yes, so, so,
you fix, um, that, uh,
th- the class of learning models, you, you,
say I'm gonna only gonna learn logistic regression models, right?
For different, different samples of data that you feed it as your training set,
you're gonna get, learn a different Theta hat.
[BACKGROUND].
Yes, the- they have to be within the class of hypotheses.
All right, so let's move on.
Next, we come across this concept called empirical risk minimization.
[NOISE].
ERM. So this is the Empirical Risk Minimizer.
Right. So, so the empirical risk minimizer is a learning algorithm.
Right. It is one of those kind of boxes that we drew.
It is, you know, ah- so in the box
that we drew earlier as learning algorithm, right.
So the- the- the diagram that we drew earlier based on which we- we ah,
reasoned everything so far,
didn't actually tell you what actually happens inside.
It could be doing gradient descent, it could just do something else.
It could be, you know,
some- some, you know,
smart programmer who's written a whole bunch of if,
else and just returns a theta, it could be anything.
Right. Uh, and no matter what kind of algorithm was used,
the- the bias-variance theory still holds.
Right. Now we're going to look at, ah,
a very specific type of learning algorithms called the empirical risk minimizer.
Right. So, um, and this was feed into your algorithm and
you get h star,
h hat ERM.
Right? Now, h, um,
h hat ERM is equal to- what is ERM, empirical risk minimization?
It's what we've been doing so far in the course.
Right? We, we tried to find
a minimizer in a class of hypotheses that minimizes the average training error.
Right. Um, so for example, um,
this is trying to minimize the training error from a classification, ah, ah, perspective.
This is kind of minimizing the- or increasing the training accuracy,
which is different from what actually logistic regression did with,
where we were doing the maximum likelihood or minimizing the negative log-likelihood.
It can be shown that, ah,
losses like the logistic loss are - are can be well approximated by,
um, by the ERM.
And, and, and this theory should- should, ah, ah,
hold nonetheless. Um, All right.
So if- if we are limiting
ourselves to do that class of algorithms which,
which worked by minimizing the training loss, right, um,
as opposed to something that say returns a
constant all the time or- or- or does something else.
If we limit ourselves to, um,
empirical risk minimizers, then we can come up with more theoretical results.
For example, uniform convergence,
which we are gonna look at right now.
[NOISE].
Right. So, so we're limiting ourselves to
empirical risk minimizers and starting off, er, uniform convergence.
Right. So there are two central questions that we are kind of interested in.
So, ah, one question is,
if we do empirical risk minimization,
that is if we just reduce the training loss, right,
what- what does that say about the generalization of an effect?
So that is basically, um,
e hat of h versus h. So for,
you know, consider some hypotheses.
Right. And that gives you some amount of training error.
Right. What does that say about its generalization error?
And that's one central question we wanna, um, um, consider.
And the second one is,
how does the generalization error of our learned hypothesis
compare to the best possible generalization error in that class?
Right. Note we're- you know, we're only talking about h star and not, g um, there.
So h star is- is- is the best in class um, um.
So these are- these are two central questions that we wanna- we wanna um, explore.
And for this, we're gonna use our two tools.
Right. So one is called the union bound.
[NOISE] Right.
What's the union bound?
Um, if we have um,
k different events A_2, A_K.
Then, ah, these need not be independent.
Independent. Then the probability of A_1 union A_2 union A_k,
is less than equal to the sum of-
If this looks trivial, it is trivial.
It's- it's um, it's probably one of the axioms in- in- in your,
ah, undergrad probability class.
But the, the probability of any one of these events happening
is less than or equal to the sum of the probabilities of,
ah, each of them, ah, happening.
Right. And then we have a second tool.
Right. It's called the Hoeffding's inequality.
[NOISE].
We're only going to state the- the inequality here,
ah, there is ah, um,
a supplemental notes on the website that actually proves the Hoeffding inequality.
You can, ah, go through that,
um, but here we're only going to state the result.
In fact, throughout this session, we are going to state results.
We're not gonna prove anything.
Um, so, ah, let Z_1, Z_2, Z_m,
be sampled from some Bernoulli distribution of parameter phi.
And let's call well,
phi hat to be the average of them,
m of z_i,
and let there be a Gamma greater than zero,
which we call it as the margin.
So the Hoeffding Inequality basically says,
the probability that the absolute difference between
the estimated phi parameter and the true phi parameter is greater than some margin,
can be bounded by 2 times the exponential
of minus 2 gamma square m. Right?
Not very obvious but you know,
you can, you can show this.
What, what it's basically saying,
is there is some- there is some- some ber- ah,
parameter between 0 and 1 of a Bernoulli distribution.
The fact that it is between 0 and 1 means it's- it's bounded.
And- and that's a key requirement for,
ah, the Hoeffding's inequality.
And now, we take samples from this Bernoulli distribution,
and the estimator for this is basically- and these are just 0s or 1s.
Z- Z- each of the Z is either a 0 or 1.
The sample of 0 or a 1 with probability, um, um,
Phi, and the estimator is basically just the averages of your samples.
Right. And, um, the absolute difference between the estimated value and the true value,
the probability that this difference becomes greater
than some margin Gamma is bounded by this expression.
Right. So there are a lot of things happening here.
So you probably want to, um, um,
you know, slowly think through this.
So this is a margin.
All right.
And this is like- basically like the deviation of the error.
[NOISE] Right.
Um, the absolute value of how- how- how far away
your estimated values from- from the true.
And you'd like it to be small- closer.
So you- you- you probably want, ah,
your -your Phi hat and phi,
to be not more than,
I don't know, 0.001.
Right. So in which case,
if the absolute value between,
ah, ah, the estimated and,
um, the true parameter is greater than 0.01,
if that's the margin your- that you're interested in.
Then this, ah, the Hoeffding's inequality proves
that if you were to repeat this process over and over and over,
the number of times phi hat is going to be
great- is going to be farther than 0.001 from the true parameter,
it's going to be less than this expression,
which is a function of m. Right.
And that is- you- you- you can kind of, ah,
believe it because as m increases, this becomes smaller,
which means the probability of, um,
your estimate deviating more than a certain margin only reduces as you increase m. Right.
So this is Hoeffding's inequality and we're gonna use this.
[inaudible].
Oh, yeah. Questions?
[inaudible].
Not, so, so the question is,
is h star, uh,
the limit of h_r as M goes to infinity?
Uh, it is h star in,
in the limit as M goes to infinity,
if it is a consistent estimator, right?
So we, we, we went over the concept of consistency.
Given infinite data, will you eventually get to the right answer?
And if your estimator is not consistent,
then it will- it need not be.
So, uh, in general h hat
need not converge to h star as you get an infinite amount of data.
[NOISE] All right?
So, uh, now we wanna use, um, uh,
these tools, tool 1 and tool 2 to answer our- like the central questions.
[NOISE] Any other questions?
Yeah.
[BACKGROUND]
This is a more limited version of Hoeffding's inequality and yes,
uh, if we limit ourselves to a Bernoulli variable, uh,
ba- um, which has some parameter phi and you take samples from it.
And you construct an estimator which
is the average of th- the samples of the 0s and 1s,
then, um, this inequality holds.
That's- thi- this inequality is called the Hoeffding's inequality.
Yes.
[BACKGROUND] So if you're, um,
in general, there, there are- there are- there
is this class of algorithms called maximum likelihood algorithms,
maximum likelihood estimators and a pure
maximum likelihood estimator is generally consistent.
If you include regularization,
then it need not be- it need not be,
uh, uh, uh, consistent though,
uh, I'm not very sure about that.
I'm not very sure about that. [NOISE] Yeah, sure.
Yeah. So basically like- if you think about
a neural net where you have something that's completely
[inaudible] neural net is not always consistent.
Yeah. So the- basically, um, um,
um, I know for the mic, uh,
wha- what, what, what he responded was, um,
if you have an algorithm like a neural net which is, um,
which is non-convex, you may actually not end up with the same,
uh, uh, result even if you, uh, um,
increase, um, increase like,
uh, the number of, um,
uh- though I would probably call the,
uh, uh, the fact- I,
I would probably think of the non-convexity to be part of an estimation bias,
um, because you could in theory always
find like the global minima of a neural network.
It's just that there's some bias in our estimator that
we are using gradient descent and we cannot solve it.
Okay. So now, uh,
let's- let's use these two tools, uh,
and for that, uh, we're gonna start [NOISE] how do we look at this diagram, right?
So, [NOISE] so over here,
um, we have hypotheses.
[NOISE] Here we have error,
[NOISE] and let's think of this.
[NOISE] There's actually one,
one curve which I'm trying to make it thick
and probably make it look like multiple curves,
this is just one curve and this we will call it as.
[NOISE] So this is the generalization risk or the,
uh, uh, the generalization error of every possible hypothesis,
uh, in our class, right?
So pick one hypothesis that's gonna be somewhere on this axis,
calculate the generalization error,
not the empirical, the generalization error
and- no that's the height of that curve, right?
And we also have something like this.
[NOISE] Right?
So this dotted line now corresponds to sum each of s_h.
Now let's, let's sample a set of m examples and calculate
the empirical error of all our hypotheses in our class and plot it as a curve, right?
Any questions on what, what, what these two are?
Yeah. [BACKGROUND] It need not meet.
I'm, I'm just, uh, uh, in fact,
thi- this is very likely not even a straight line,
you're just thinking of all, all possible hypotheses.
It may not be convex.
Um, this just to, to, um,
get some ideas, um,
um, get, get better intuitions on some of these ideas.
Yes. [BACKGROUND] So, uh, the black line,
the thick black line is the generalization error of all your hypotheses, right?
And let's say you sample some,
some, some data, right?
Let's call it S. On that sample,
you have training error for all possible hypotheses, right?
[NOISE] We haven't not learned anything, right?
It's, it's, uh, uh,
this is the generalization error and this is the empirical error for the given S, right?
Now, uh, in order to app- apply
Hoeffding's Inequality here, right?
So let's consider some h_i, right?
This is some hypothesis. We- we don't know.
So we start with some random hypotheses, right?
And- so by starting
with some hypotheses like think of this as you start with some parameter, [NOISE] right?
And, uh, let's- right.
So the height of this line up to the,
the thick black curve is basically,
um, the generalization error of h_i is the height to the thick black curve.
So let me call this Epsilon of h_i, right?
And the height to the dotted curve until here.
And this is Epsilon hat of h_i.
I'm gonna ignore the S for now, right?
And this corresponds to like the,
the sample that we obtain.
Now one thing, ah, you can,
you can check is that the expected value of- [NOISE]
where the expectation is with respect to the data's sample.
So what this means is that, ah,
for one particular sample you,
ah, -this is the generalization error you got.
Take another set of samples,
that curve might look som- some,
you know, some other way, and,
you know, the height of the dotted line would be there.
So in general on average,
if you sum average across all possible training samples that you can get, ah, the,
the expected value of the height to the dotted line is gonna be the height to the,
the, the thick line.
Right? That's, that's justified.
Now here if you apply Hoeffding's inequality,
you basically get probability of absolute difference between the empirical error versus
the generalization error to be greater than Gamma is
less than equal to 2 minus 2 Gamma square.
And- this is basically,
you know, Hoeffding's inequality.
We have right here except in place of phi and phi hat,
we have the true generalization error,
and the empirical error.
Any questions on this so far?
So what we are saying is essentially
the gap between the generalization error and the empirical error. All right.
Right. The gap being greater than
some margin Gamma is gonna be bounded by this expression.
Right? So loosely speaking what this means is,
as we increase the size M,
if our training is up- if we plot the set of all dotted lines for a larger M,
they are gonna be more concentrated around the black line.
Does that make sense? So, so take a moment and think about it.
This dotted line corresponds to S of some particular size
M. We could take another sample of,
you know, a fixed set of examples,
and that might look kinda something like this.
Right. And take another sample of size M,
and that might look something like [NOISE] this.
Now- and now, consider the set of
all deviations from the black line
to every possible dotted line along the vertical line of HI.
Right? Now this gap is greater than some margin
Gamma with probability less than this term over here.
Right? So, so it essentially means that if
you start plotting dotted lines with a bigger M, right,
where the set of all those dotted lines correspond to a bigger M,
they are gonna be much more tightly
concentrated around the true generalization of that, of that edge.
That make sense? And you're basically applying
Hoeffding's inequality to this gap over here instead of some phi.
That's basically what you're doing.
Right? Now, that's good.
But there's a problem here.
The problem here is that,
we started with some hypotheses,
and then averaged across all possible data that you could sample.
But in practice, this is useless.
Because in practice we start with some data,
and run the empirical risk minimizer to find the lowest H for that particular data.
Right? And when you, when,
when- which means that H,
and the data that you have are not really independent.
Right? You, you chose the H to minimize, ah,
minimize the risk for the empirical risk for
th- the particular data that you are given in the first place.
Right? So to, to fix this,
what we wanna do is basically extend this result that we got
to account for all H. Right.
Now if we want to get a bound on the gap between the probabilistic bound,
and the gap between the generalization error,
and the empirical error for all H. You know what's that bound gonna look like.
Right. And this is basically called uniform, uniform convergence.
This is- I'll just call uniform convergence because we are trying to, ah,
we are trying to see how the risk curve converges uniformly to the generalization risk curve,
or how the empirical risk curve uniformly converges to the generalization risk curve.
And, ah, it's, ah,
that's called uniform convergence which you can apply to functions in general,
but here we are applying to the risk curves across our hypotheses.
And we can show- I'm gonna, ah,
just, um, skip the math.
So, um, this we showed using Hoeffding's inequality,
and you can apply the union bound for unioning across all H.
Except we can- first we're going to limit ourselves to, um- all right.
So let me start over.
So we got this bound for a fixed H. Right?
But we are interested in getting the bound for any possible H. Right?
So that's our next step. Right? And the way we're gonna,
gonna extend this pointwise result to across all of them,
is gonna look different for two possible cases.
One is a case of a finite hypothesis class,
and the other case is gonna be the case for infinite hypothesis classes.
So what does it look like?
So, [NOISE]
so let's first consider finite hypothesis classes.
So first we're gonna assume that the class of H has a finite number of hypotheses.
The result by itself is not very useful,
but it's gonna be like a building block for,
for the, for the other case.
So let's assume that the number of hypotheses in this class is some number K. Right?
Ah, we can show that- I'm not gonna go over the,
the derivation, but I'm just gonna,
um, write out the result.
It's, it's pretty intuitive.
So basically what we do is, ah,
we apply the union bound for all K hypotheses,
and we end up just multiplying that by a factor of K. Right?
So what we get is the probability that there exists some hypotheses
in H such that the empirical error
minus generalization error is greater than Gamma,
is less than equal to K times,
K times the probability of any 1 which is equal to K times,
ah, 2 minus xp 2 Gamma square M. And this we flip it over,
we negate it, and we get the probability that for all hypotheses in our class,
the empirical risk minus generalization risk is less than Gamma,
is gonna be greater than equal to 1 minus 2K,
minus 2 Gamma square.
Okay. So with probability at least 1 minus,
you know this expression,
which we can, we can call this Delta with probability at least so much.
For all hypotheses, our margin is gonna be less than some Gamma.
Right? This is, this is just, um,
Hoeffding's inequality plus union bound,
and just negate the two sides, you get this.
And you, you can go with this slowly, um, um,
you know later from the notes,
the notes goes over this,
um, in more detail.
Right? Now, basically now what we have is, you know,
now let's let Delta equals to
K exp minus 2 Gamma squared M. So we basically now have,
um, a relation between Delta which is like the probability of error.
By here, um, ah, by error I mean that the,
um, empirical risk, and the generalization risk are farther than some, some margin.
And Gamma is called the margin of error.
And M is your sample size.
So, so what this basically tells us, um,
if your algorithm is the empirical risk minimizer,
it could have been any kind of algorithm.
But if it is the kind that minimizes the training error,
then you can get by, by,
by just changing the sample size,
you can get a relation between the margin of error and
the probability of error and relate it to the sample size, right?
So, um, what we can do with this relation is basically
fix any two and solve for the third,
and that gives us,
nope, some actionable results.
For example, you can fix any two and solve for the third from this relationship, right?
And what, what, what that could, uh,
mean is for example,
so you, you can choose any two and solve for the third.
Um, I'm only gonna go over one, one of those.
So let, let's fix,
fix uh, Gamma and Delta to be greater than zero.
And we solve for m, and we get m [NOISE] weighted to, equal 1 over 2 Gamma square,
log 2K over Delta.
So what this means is with probability at least 1 minus
Delta which means probably at least 99% or 99.9%.
For example, with probability at least,
uh, 1 minus delta,
the margin of error between
the empirical risk and the true generalization risk is gonna be less
than Gamma as long as your training size is bigger than this expression, right.
That's something actionable for us, right.
Now, theory can be useful.
So this is also called the sample complexity dessert.
[NOISE] right? And uh, basically,
what this means is as you increase m and you,
you sample different [NOISE] sets of, uh, uh, data-sets,
your dotted lines are gonna get closer and closer to, to, uh,
the thick line which means,
minimizing your- minimizing on
the dotted line will also get you closer to the generalization error.
So this, this is basically telling you how minimizing on, on, um,
minimizing on the empirical risk gets you closer to,
uh, gen- generalization, right?
Okay, so that- so we started off with two questions,
relating the empirical risk to generalization risk.
Now, let's, let's explore the second question.
What about, uh, the generalization error [NOISE] of [NOISE] our minimizer with the,
uh, um, best possible in class?
So let's look at this diagram again.
Let's say we started with this dotted curve, right.
And the minimizer of that would be h-star.
This is h-star. Sorry the diagram is a little, uh,
[NOISE] let me erase the previous one [NOISE] right?
So this is h-hat.
Sorry, this is h-hat.
And this has a particular generalization error, right?
That is the point of, of- uh,
let- let- let's assume we got this data-set.
We ran the empirical [NOISE] risk minimizer and we obtained this hypothesis.
And when we deploy this in the world- in the real world,
its error is gonna be so much, right?
Now, how does this compare [NOISE] to the performance of the minimizer of the, the,
the best possible [NOISE] ,
so this is h-star,
best in class, right?
Now, we want to get a relation between this error level and this error level.
We got one bound that relates this to this,
and now we want something that relates this to this.
Now, how do we do that?
It's pretty straightforward.
Um, so the em- generalization error of h-hat,
that's this dot over here,
is less than equal to the empirical risk of h-hat plus Gamma.
So we got the result, um,
using a Hoeffding and union bound that the gap between the dotted line and the,
the thick black line is always less than Gamma, right?
And it's the absolute value.
So we can, we can, uh, um,
write it this way as well.
And this, right?
So basically, we, we start it from the,
the thick black line, drop down to the dotted line.
And this is gonna be less than the empirical error of h-star plus Gamma. Why is that?
Because em- empirical error,
um, the empirical error, uh, uh,
of h-hat by definition,
is less than or equal to the empirical error on any other hypotheses,
including the best in class.
Because this is the training error,
not, not, not the generalization error, right?
So which means, um- and,
and this is less than or equal to.
So we, we dropped from the generalization to the test.
And we said, this test is,
thi- this training error is always gonna to be less
than the empirical error of the best-in-class.
You see that the best-in-class is higher for the trade to be empirical error.
And this again, is now- this gap is also bounded because we,
we proved uniform convergence.
That the gap between the dotted line and thick line is bounded by Gamma for any h, right?
And this is therefore h-star plus 2 Gamma,
because we added the extra margin.
So we wanted the relation between the, uh,
the- our, our hypothesis
generalization error to the generalization error of the best in class hypotheses.
So we dropped from the generalization error to the empirical error of our hypotheses,
related that to the empirical one of the best in
class and again bounded by the gap between these two.
So we- we've got a gap between the generalization bound,
the generalized error for hypothesis to the best in
class generalization. Any questions on this?
So the result basically says,
with probability, 1 minus Delta,
and for training size m,
[NOISE] the generalization error of [NOISE] the hypothesis from
the empirical risk minimizer is going to be within the
best in class generalization error plus 2 times 1 over,
1 over 2m plus log 2K over Delta.
So this was basically uh,
so you can get this, uh, when you,
when you- so in this expression,
if you set this equal to Delta and solve for Gamma, you will get this.
Any questions? [NOISE] I think we're already over time.
So, uh, the case for infinite classes is an extension to this.
Maybe I'll just write the results.
So there is a concept called VC dimension,
which is a pretty simple concept but [NOISE] we won't be going over it today.
VC dimension basically says,
um, what is the- so VC dimension is,
you can think of it as trying to assign a size to an infinitely,
uh, to an infinite size hypothesis class.
For a fixed size hypothesis class,
we had like, you know, K to be the size of the hypothesis class.
So VC [NOISE] of some hypothesis class is gonna be some number, right?
Some number which, which kind of,
um, which is like the size of the hypothesis.
It's basically, telling you how,
how expressive it is um, and, and, uh,
on using, using the VC dimension,
uh, there are very nice uh,
geometrical meanings of VC dimension.
You can, you can get a bound, similar bound.
But now, it's not for, uh, uh, um,
it's not for uh,
uh, finite classes anymore.
Some big O of [NOISE]
right? So in place of this margin, we ended up with,
uh, a different margin that is, uh,
a function of the, the VC dimension.
And the, the key takeaway from this is that uh,
the number of data examples,
that the sample complexity that you want is generally,
uh, an order of the VC dimension to get good results.
That's basically, the, uh, uh,
main result from that, right?
From, uh- with that,
I guess we'll, we'll, uh,
we'll break for the day and,
uh, we'll take more questions.
