All right, so let's get started.
So today,
we're gonna talk about what are probably
the two most famous theorems in
the entire history of probably.
They're called the law of large
numbers and the central limit theorem.
They're closely related, so makes sense
to do them together, kind of compare and
contrast them.
I don't, I can't think of a more famous
probability theorem than these two.
So the setup for today is that
we have i.i.d random variables.
Let's just call them X1, X2 i.i.d.
Since they're i.i.d they have
the same mean and variance.
If the mean and variance exists but
we'll assume they do.
So the mean, we'll just call it Mu.
And the variants, sigma squared.
So we're assuming that
these are finite for now.
The mean and variants exist.
And both of these theorems tell us
what happens to the sample
mean as n gets large.
So, the sample mean is
just defined as Xn bar.
Standard notation in statistics
is put a bar to mean averages and
that's just the average of the first n.
So to take the first n random variables,
and average them, so
that's just called the sample mean.
So the question is, what can we
say about Xn bar as n gets large?
So the way we would interpret this or
use this is we get to observe.
These Xs, they're random variables but
after we observe them they become data.
We're never going to have
an infinite amount of data so
at some point we stop it at n.
We can think of that as the sample size
and hopefully we get a large sample size.
Of course, it depends on the problem.
Some problems,
you may not be able to get large n.
Well, we assume n is large, and
just take the average,
question is just, what can we say?
All right, so first,
here's what the law of large numbers says.
It's a very simple statement.
And hopefully pretty intuitive, too.
Law of Large Numbers says that Xn bar
converges to mu, as n goes to infinity.
With probability 1.
That's the fine print, probability 1.
With probability 0, so
something really crazy could happen.
But we don't worry too much about it,
because it has probability 0.
With probability 1,
this is the sample mean, and
it says that the sample mean
converges to the true mean.
So, that is a pretty nice,
intuitive, easy to remember result.
That is,
by true I mean the theoretical mean.
That is the expected value of Xj for
any j is the true expected value.
Whereas this, is a random variable.
Right?
We're taking an average of
random variables.
That's a random variable.
So this is just a constant but
this is a random variable.
But it's gonna converge and
I should say a little bit more,
what is this convergence
statement actually mean.
You've all seen limit of sequences, but
when we are talking about limits of random
variables we have to be
a little more careful.
How do we actually define this.
The definition of this statement
is just pointwise which means,
remember Xn bar is a random variable.
Random variable mathematically
speaking is a function.
So it's say for each possible,
if you evaluate this at some
specific outcome of the experiment,
then you'll get a sequence of numbers.
That is if you actually observed the
values and this kind of crystallizes into
numbers when you evaluate it at
the outcome of the experiment.
And so those numbers converge to mu.
In other words, this is an event.
Either these random variables converge or
they don't.
And we say that event has probability 1.
That' what the statement
of the theorem is.
So to just give a simple example.
Let's think about what happens
if we have Bernoulli p.
So if Xj is Bernoulli p,
then intuitively we're
just imagining a infinite
sequence of coin tosses.
Where the probability of heads is p, and
then this says that if we add up
all of these Bernoullis up to n,
that it's just in the first coin flips,
how many times did the coin land heads,
divided by the number of flips should
convert to p with probability 1.
So for example, so
this is a very intuitive statement.
If it's a fair coin and
you flip the coin a million times, well,
you're not really expecting that it
will be 500,000 heads and 500,00 tails.
But you do think that,
in the long run, it should be the case
that it's going to be essentially
half heads, half tails.
Not exactly, but essentially.
And the proportion should get closer and
closer to the true value.
This qualification would probably 1 is
needed because mathematically speaking
even if you have a fair coin,
there's nothing in the math that says
it's impossible that the coin would land
heads, heads, heads, heads, heads forever.
You know that that's never
actually gonna happen in reality.
It's just not gonna happen.
It's a fair coin.
It might land heads, heads, heads for
a time if you're very lucky or
unlucky or whatever.
But it's not gonna be heads,
heads, heads forever.
But there's nothing in the math that
says that's an invalid sequence.
So there's some weird
pathological cases like that.
But with probability one,
we get what we expect.
If we didn't have this result,
how we would ever even estimate p?
You might imagine if you
didn't know what p was,
kind of the obvious thing to do is
flip the coin a lot of times and
take the proportion of heads and
use that as your approximation for p.
But what justification could you have for
doing that approximation
if you didn't have this.
So this is a very, very necessary result.
But I guess to comment a little bit
more about what does it actually say for
the coin, because this is kind of
related to gambler's fallacy, and
things like that.
The gambler's fallacy is the idea
that like let's say your gambling and
you lose like ten times in a row and then
it's the feeling that your due to win.
You lost all these times then and
you might try to justify that using a lot
of large numbers and say well you
know the coin might landed let's say,
heads you win money, tails you lose money,
you just lost money ten times in a row.
But the law of large numbers says,
in the long run,
it's gonna go back to
one-half if it's fair.
So somehow you need to start
winning a lot to compensate.
That's not the way it works.
The coin is memoryless.
The coin does not care how many failures
or how many losses you had before.
So the way it works is not through If
you're unlucky at the beginning that
somehow it gets offset later
by an increase in heads.
The way it works is through
what we might call swamping.
And let's say the coin landed
tails a 100 times in a row.
It doesn't mean that the probability
has changed for 101st flip.
What it means though, is that we're
letting n go to infinity here, okay?
So no matter how unlucky you
were in the first 100 or
the first million trials, that's
nothing compared to infinity, right?
So those first million just get swamped
out by the entire infinite future,
so that what's going on here.
Yeah, so
to tell you one little story about the law
of large numbers,
a colleague of mine told me this story.
He had a student once who
said he hated statistics.
And of course,
my colleague was very shocked,
like how can anyone hate statistics?
And so he asked, why?
How is it possible that
you hate statistics?
And then the student who was an athlete,
and he was training everyday and
he had just learned
the law of large numbers.
And he was very, very depressed by this
because he said, the law of large numbers
says in the long run, I'm gonna only
be average and I can't improve.
So well, of course the fallacy there,
we assumed iid right now.
Now there are generalizations
of this theorem beyond iid, but
we can't just get rid of iid.
So the iid is saying that the distribution
is not changing with time.
That doesn't mean that you can't actually
improve your own distribution then it
would not be iid.
So don't be depressed by this,
and in fact this theorem
I think is crucial in order for
science to actually be possible.
Because if you kind of
imagine kind of hypothetical
counter factual world where this
theorem was actually false.
That would be really depressing to try
to ever learn about the world, right?
Cuz this is saying,
you're collecting more and more data.
You're letting your sample
size go to infinity.
And this says,
you converged to the truth, right?
And it would be some weird setting, where
you get more and more data, and more and
more data, and yet you're not able
to converge to the truth, right?
So that would be really bad.
So this is very intuitive, very important.
Okay, so let's prove this
at least a similar version.
So this is actually sometimes called
the strong law of large numbers.
And we're actually gonna
prove what's sometimes called
the weak law of large numbers.
I don't really like the terminology
strong and weak here, but
that's kind of a standard.
Strong law of large numbers
is what I just said,
where it's converging
point-wise with probability 1.
That is just these random variables
converged to this constant,
except on some bad event
that has probability 0.
The weak law of large
numbers says that for
any, C greater than 0,
the probability that Xn bar minus
the mean is greater than c goes to 0.
So it's a very similar looking statement.
It's not exactly equivalent.
It's possible to show, you have to
go through some real analysis for
this that is not necessary for
our purposes.
But it turns out that, this statement,
once you've proven this thing it
implies this form of convergence.
This is called convergence in probability,
but
the intuition is very similar.
So just to interpret this statement
in words it says, so we can chose,
we should interpret c as
being some small number.
So let's say we chose c to be 0.001, okay?
And then it says that this thing
goes to 0, so in other words, this,
as n goes to infinity again.
So this says that if n is large enough,
then
it's extremely unlikely that
these are more than 0.001 apart.
In other words, if n is large,
it's extremely likely that this is
extremely close to this, right?
So it's a very similar statement,
n is large,
it's extremely likely that the sample
mean is very close to the true mean.
Okay, so that's what it says.
So we'll prove this one,
because to prove this one takes
a lot of work and a lot of time.
This one,
it looks like it's a nice-looking theorem.
And it is a nice theorem, but
we can prove it very easily
using Chebyshev's inequality.
Okay, so
let's prove the weak law of large numbers.
So all we need to do is show
that this goes to 0, right?
That's what the statement is.
So let's just bound it using, this looks
pretty similar to what we were doing last
time, where we did Markov's inequality,
Chebyshev's inequality.
This looks similar to that
kind of stuff from last time,
which is why I did that, well,
one reason for doing that last time.
We need the inequalities anyway,
but it's especially useful here.
So we just need to show
this thing goes to 0.
Xn bar minus mu greater than c, goes to 0,
By Chebyshev's inequality,
this is less than or
equal to the variance of Xn
bar divided by c squared,
that's just exactly
Chebyshev from last time.
Now we just need the variance of Xn bar,
variance of Xn bar,
well, just stare at the definition
of Xn bar for a second.
There's a 1 over n in front,
that comes out as 1 over n squared.
And then since I'm assuming
they're iid an then dependent,
the variance of the sum is just n
times the variance of one term.
So that's n sigma squared
divided by c squared,
which is sigma squared over nc squared.
Sigma is a constant, c is a constant,
n goes to infinity, so this goes to 0.
So that proved the weak law of large
numbers, just only a one line thing.
Okay, so that tells us what happens
point-wise when we average a bunch
of iid random variables, and
it converges to the mean.
So let me just rewrite that statement.
Then we'll write the central limit
theorem and kind of compare them.
So another way to write
what we just showed
is that Xn bar minus mu
goes to 0 as n goes to
infinity, which is a good thing to know.
However, it doesn't tell us what
the distribution of Xn bar looks like.
So this is true with probability one,
but what is the distribution?
What is the distribution
of Xn bar look like?
So this says it's getting closer,
Xn bar is getting closer and
closer to this constant mu.
Okay, but that's not really
telling us the shape, and
it's not really telling us the rate.
This goes to 0, but at what rate?
So one way to think about problems like
that, when you have something going to 0,
and you wanna study something about,
how fast does it go to 0?
Then one might, not just in here, but
just as a general approach
to that kind of problem.
We know this goes to 0, but
we don't know how fast.
One way to study that would be multiply it
by something that goes to infinity, right.
Now, if we multiply it by
something that goes to infinity,
such that this times
this goes to infinity.
Then we know that this part that blows
up is dominating over this part.
And if we multiply by something
that goes to infinity, but
this whole thing still goes to 0,
then that's more informative, right?
So what's gonna happen is that we
can imagine multiplying here by
n to some power and we're gonna
show that there's a power here,
and to some power, fill in the blank.
What we're gonna show is that,
if the power here is above some
threshold and to the big powers,
its gonna go to infinity fast,
this thing will just blow up.
And if we put a smaller power than the
threshold here, then this is still going
to infinity as long as this is a positive
power of n, this is still going to
infinity, this parts going to 0,
but this part's dominating, right?
So this term is competing with this term.
This one goes to infinity,
this one goes to 0, okay?
So then the question is what's
that magic threshold value?
And the answer is one-half.
So that's what we're
gonna study right now.
So we're gonna take the square
root of n times xn bar minus mu.
This is kind of the happy medium,
where we're gonna get a non-degenerate
distribution, that this is gonna converge
in distribution to an actual distribution,
it's not gonna just get killed to 0 or
blow up to infinity, it's actually
gonna give us a nice distribution.
Okay, and I'm also gonna divide by the
sigma here, makes it a little bit cleaner.
So this is the central limit theorem now.
I'm stating it, then we'll prove it.
Central limit theorem says,
if you take this and
look at what happens
as n goes to infinity.
Converges to standard
normal in distribution.
[SOUND] By convergence and
distribution, what we mean is that
the distribution of this converges
to the standard normal distribution.
In other words, you could take the CDF.
I mean these may be discrete or continuous
or a mixture of discreet and continuous.
So it doesn't necessarily have a PDF,
but every random variable has a CDF.
So it says if you take the CDF of this,
it's gonna converge to capital 5,
the standard normal.
So I think this is kind of an amazing
result that this holds in such generality,
right, because I mean the normal is just
this one, standard normal is just this
one particular, it's a nice looking bell
curve, but that's just one distribution.
And those x's they could be discrete,
they could be continuous,
they could be extremely nasty
looking distributions, right?
It could look like anything,
the only thing we assumed was
that there was a finite variance.
Other than that,
they could have an incredibly complicated,
messy distribution.
But it's always gonna
go to standard normal.
So this is one of the reasons why
the standard normal distribution is so
important on the one hand and so,
widely used, because this is a theorem
as n goes to infinity is what it says,
but the way it's used in practice is then
people use normal approximations all the
time and a lot of the justification for
normal approximations is coming from this,
because this says that if n is large,
then the sample mean will approximately
have a normal distribution.
Even if the original data did not look
like they came from a normal distribution,
when you average lots and
lots of them, it looks normal, okay.
So this is in a sense is a better
theorem than the law of large numbers,
but because it's kind of more
informative to know the distribution,
know something about the rate, and
you know it's interesting that it's,
square root of n is kind of the power
of n that's just right, right?
A larger power it's gonna blow up,
a smaller power it's gonna go to 0.
N to the one-half is the compromise,
then you always get a normal distribution.
It's more informative in some sense, but
you should also keep in mind,
it is a different sense of convergence.
Up here, we're talking about the random
variables actually converging,
literally the random variables
converge the sample mean converges
literally to point-wise with
probability 1, to the true mean.
Here, we're talking about
convergence in distribution.
So we're not talking about
convergence of random variables.
We're just saying the distribution of this
converges to the normal 0, 1 distribution.
So that's a different sense
of convergence, but anyway,
both of them are telling us what's gonna
happen to Xn bar when n is large, okay?
So well, let's prove this theorem.
Here's another way to write this,
by the way,
it's good to be familiar with both ways.
It's just algebra to go
from one to the other, but
they're both useful enough
to be worth mentioning.
Let's just write the central limit
theorem in terms of the sum of X's
rather than in terms of the sample mean.
So I'm just gonna take the sum of Xj,
j equals 1 to n.
And so, we can either think of
the central limit theorem as,
either think of it as telling us what
happens to the sample mean or we
can think of it as telling us what happens
to the sum, or the convolution, okay?
It's equivalent because
they're just a factor of,
we just have to be careful not
to mess up the factor of n,
b ut we can go from one to the other
cuz it's just a factor of n.
So the claim is that this is
approximately normal when n is large,
but if we just have this thing,
this could easily just blow up.
You're just adding more and more terms.
But somehow we wanna
standardize this first.
So if we take this thing,
because this thing has mean and
mu, right, so let's subtract n mu.
Because then it has zero mean,
because I just want to match.
I wanna make the mean 0 and
the variance 1, so
that it kind of matches up with that,
rather than just letting it blow up.
So this is called centering,
we just subtracted by linearity,
the mean is n mu, so
just subtract it n mu.
And then let's divide by
the standard deviation,
this is just how we did
standard deviation before.
So over there we showed that the variants
of Xn bar is sigma-squared over n.
And the variance of this sum
is just n sigma squared.
So let's just divide by
the standard deviation, right,
which is square root of n Times sigma,
okay?
Cuz the variance is n sigma squared.
So that's just the standardized version.
And the statement is again that this
converges to the standard normal
in distribution.
So if we take this sum and standardize it,
then it's gonna go standard normal.
Okay, so, all right, so
now we're ready to prove this theorem.
And, sort of just a calculation,
but it's kind of a nice
calculation in some ways,
we're gonna prove it, well.
This theorem is always true as
long as the variance exist.
We don't need to assume that, the third
moment or the fourth moment exist.
But the proof is much more complicated
to do it in that generality.
So we're gonna assume that the MGF exists,
then we can actually work with the MGFs.
Because when you see this thing,
sum of independent random variables,
then we know the MGF is gonna be
something useful if it exists.
And there's ways to extend this proof
to cases where the MGF doesn't exist.
But for our purposes,
we may as well just assume MGF exists.
So assuming MGF, let's call it M(t).
Of Xj, they're iid, so if one of them
has an MGF, they all have the same MGF.
We'll just assume that that exists.
Once we have MGFs, then our strategy
is to show that the MGFs converge.
So that's a theorem about MGFs, that
if the MGFs converge to some other MGF,
then the random variables
converge in distribution, right?
We had a homework problem related to that,
where you found that the MGFs converged
to some MGF, and that implies
convergence of the distributions, right?
Okay, so that's the whole strategy.
So that means all we need to
do is find the MGF of this and
then take the limit, okay?
So basically at this point,
it's just like, write down the MGF,
take the limit, and
use a few facts about MGFs, okay?
So first of all, we can assume.
That, let's just assume mu = 0 and
sigma = 1, just to simplify the notation.
This is without loss of generality,
because we could write this as,
all we have to do is consider.
I wrote the standardized thing this way,
but
I could've just written it as
standardizing each X separately.
I could've written Xj- mu over sigma.
So this would be standardizing each
of them separately, j = 1 to n, and
then we have a 1 over root n.
That will be the same thing
that we're looking at.
This just says standardize
them separately first.
But then you could just, I mean if
you want, just call this thing Yj.
And once you have the central limit term
for Yj, then you know that that's true.
So you might as well just assume that
they've already been standardized.
And so just to have some notation,
let's just let Sn equal the sum,
S for sum, of the first n terms.
And what we wanna show is that the MGF
of Sn over root n,
that's what we're looking at, right?
That let mu equal zero, sigma equals one,
so we're looking at Sn over root n.
And we wanna show that that goes
to the standard normal MGF.
Right, so we just need to find this MGF,
take a limit.
Okay, so let's just find the MGF.
So by definition, that's the expected
value of e to the t times Sn over root n.
And Sn is just the sum.
So, and we're assuming independence,
which means that these, you can
write this as e to the t x1 over root n, e
to the t x2 over root n, blah, blah, blah.
All of those factors are independent,
therefore, they're uncorrelated.
So we can just split it up as a product,
X1/ over root n.
Blah, blah, blah, same thing,
just e to the Xj over root n
is the general term, right?
I'm just using the fact that
those are uncorrelated, so
we can write e of the product
of the expectations.
But since these X's are iid,
these are really just the same
thing written, n times.
So really,
this is just this thing to the nth power.
And this thing,
that should remind you of an MGF, right?
That's just the MGF of X1,
except that instead of evaluated at t,
it's evaluated at t over root n.
So really, that's just the MGF,
evaluated at t over root n
raised to the nth power.
So that's what we have.
Now we need to take the limit
as n goes to infinity.
So let's just look at what's gonna
happen here, n is going to infinity.
This thing on the inside becomes M of 0.
M of 0 is 1 for any MGF, right?
Cuz e to the 0 is 1.
So this is of the form 1 to the infinity
which is in indeterminate form, right?
It could evaluate to anything.
So going back to calculus,
how do you deal with 1 to the infinity,
or 0 over 0, or whatever.
Usually we try to reduce it to something
where we can use L'Hopital's Rule for
those problems, right?
Or we can use a Taylor
series type of thing.
So, how do we get into that form?
Take the log,
because this looks like 1 to infinity.
If we take the log,
it'll look like infinity times log of 1.
So it'll look like infinity times 0,
take logs.
Then we just have to remember to
exponentiate at the end to undo the log.
Okay, so
let's write down then what we have.
After taking the log, and
we're trying to do a limit, so
we're doing the limit as n goes
to infinity, and we take the log.
It's n log M(t
over root n).
So that's of the form infinity times 0.
If we want 0 over 0 or
infinity over infinity,
we can just write it as 1
over n in the denominator.
Okay, and now it's of the form 0 over 0.
So we can almost use L'Hopital's Rule,
but not quite.
We have to be a little bit careful.
Because first of all,
I'm assuming n is an integer,
and you can't do calculus on integers.
Secondly, it's just kind of, even if we
pretended that n is a real number and
then the derivative of n would
be- 1 over n squared and
that's kind of annoying to deal with.
And it's kind of annoying to
deal with this square root here.
So let's first make a change of variables.
Let's just let y = 1 over root n and
also let y be real, not necessarily,
Not necessarily of the form 1 over
square root of an integer, okay?
So it's the same limit, just written
in terms of y instead of in terms of n.
So as n goes to infinity y goes to 0 and
1 over n is y squared,
so it's denominator is just y squared.
The reason I do it this way is
that 1 over root n is just y
by definition but
then the numerator is just log m of yt.
That's a lot easier to deal with
because we got rid of the square roots.
So it's still of the form 0 over 0.
So we're gonna use L'Hospital's Rule.
So limit, y goes to 0.
Take the derivative of the numerator and
the denominator separately.
The derivative of the denominator is 2y.
The derivative of the numerator,
well we're just going to
have to use the chain rule.
Derivative of log something
is 1 over that thing.
So that's M of yt hence the derivative
of that thing which again
by the chain rule is M prime of
yt times the derivative of yt.
We're treating t as constant,
we're differentiating with respect to y.
So t comes out.
And now let's see what we have.
Let's just summarize
a couple facts about MGFs.
So M of t is the expected
value of E to the tX1.
So M of 0 = 1 Okay.
And when we first started doing MGF we
said that we take derivatives of the MGF
and evaluate it at 0.
We get the moments, that is why it's
called the moment generating function.
So the first derivative at 0 is the mean,
but we assume that mu is 0.
So this is 0, here.
And the second derivative,
while we're doing this.
Secondary derivative is the second moment,
but since we assumed that the variance is
1 and the mean is 0,
the second moment is 1, okay?
So over here, as we let y go to 0,
denominator's still going to 0.
Numerator's also going to 0,
because M prime of 0 is 0,
so its still on the form 0 over 0, so
let's just do what we were told again.
So first I can simplify it a little bit,
this t can come out,
because that's acting as a constant,
and the 2 can come out.
And limit y goes to 0 and
this M of yt part,
that's just going to 1.
So we can write that as part
of a separate limit, but
that other limit is just going to 1.
You can think of it as just
the limit of this part times
the limit of the rest of it.
But that part's just going to 1,
so we can get rid of that.
So really is just, what's left is just
the limit of M prime yt divided by y.
Everything else is gone, so
it's actually pretty nicely simplified.
Now, using L'Hospital's Rule
a second time,
now the derivative of
the denominator is just 1, okay?
And for the numerator,
chain rule, M double prime of yt.
That was a t not a t squared,
but now it's a t squared,
because by the chain rule, derivative of
yt is t, so we have a t squared over 2.
Now when we let y go to 0,
now it's just M double prime 0 is 1, so
now this limit is just 1.
So we get t squared over 2,
that's what we wanted,
because t squared over 2 is the log.
Of e to the t squared over 2, but
e to the t square over 2 is
exactly the normal 0,1 MGF.
Okay so,
to prove that theorem that's the end of
the proof of the central limit theorem.
All we had to do was just basic facts
out MGF, use, L'Hospital's Rule twice.
And there we have one of the most famous
important theorems in statistics.
Now so
there are more general versions of this,
like you can extend this in various
ways where it's not an IID,
but it still has to satisfy
some assumptions, right.
But anyway,
this is the basic central limit theorem.
Okay, so that's pretty good.
Let's do an example,
like how do we actually use this,
for the sake of approximations,
things like that.
Last time I was talking about
the difference between inequalities and
approximations, right?
And we talked about Poisson
approximation before.
We haven't really talked
about normal approximation.
This result is giving us the ability
to use normal approximations
when we're studying sample mean and
is large, okay?
So historically, though,
the first version of
the central limit theorem
that was ever proven,
I think was for binomials, okay?
So what we're saying is that
binomial np under some conditions
will be approximately normal.
And well in the old days that was
incredibly important fact because
they didn't have computers to
binomials how to deal with
like n choose k, and n is large, and
k, you have all these factorials.
You can't do these things by hand.
Now we have fast computers,
so it's a little bit better.
But it's still a lot easier working
with normal distributions than
binomial distributions most of the time,
right?
And even now factorials still grow so
fast that even with
a fast computer with large memory and
everything, you may quickly
exceed its ability when you're doing
some big complicated binomial problem.
And normals have a lot of nice properties,
as we've seen, okay?
The question is,
when can we approximate a binomial
using a normal, and
how do we do that, okay?
So this is just the binomial approximation
to the normal, other way around.
Normal approximation,
I'll say binomial approximated by normal,
the normal approximation to the binomial.
When is that valid?
To contrast it with
the Poisson approximation,
that we've seen before, okay?
So, if x is, let's x be binomial np
And as we've done many times before
we can represent x as
a sum of iid Bernoulli.
Right?
Well these are just 1, if success on the J
trials 0 otherwise, so
the XJ are iid Bernoulli P.
So this does fit into
the framework of the central limit
theorem that is we are adding
up iid random variables.
So the central limit theorem says that,
if the N is large this will be
approximately normal, at least after
we have standardized it, okay?
So suppose we wanted to approximate,
suppose we're
interested in the probability
that x is between A and B.
And I want to approximate that,
first we'll do equality then
we're approximating it.
So, I mean if you had to do this on
a computer what you would do or by hand,
which you wouldn't want to,
would be to take the PMF and
sum up all the values of
the PMF from A to B, right.
So okay, you would not want to do
that by hand most of the time.
But suppose we just want an approximation
for this, not the exact thing.
So first, the strategy is just gonna
be to take x and standardize it first.
So we're gonna subtract the mean,
so we know that the mean is NP,
and we're gonna divide by
the standard deviation,
which we know as the square root of NPQ or
Q is 1 minus P.
So, I'm just standardizing it right now.
So this is still equal,
we haven't done any approximations yet.
And then, now that we've standardized it,
we can apply the central limit theorem,
if N is large enough, right?
If N is, if central limit
theorem said N goes to infinity,
that doesn't answer the question
of how large does N have to be.
And for that, there's various theorems and
various rules of thumb.
A lot of books will say,
how large does N have to be?
And some books at least will say 30,
and that's just a rule of thumb.
That's not always gonna work for all,
there's separate rules of thumb for
the binomial, like you want N
times P to be reasonably large and
N times 1 minus P to be large,
there are different rules of thumb.
But anyway, if N is large enough,
then what we've just proven is that
this is gonna look like it has
a normal distribution because
that's a sum of IID things.
And we standardized it correctly, because
we already knew the mean and the variance,
so we just standardized it.
Okay, so this is approximately.
Now we're going to use
the normal approximation,
we're going to say this
is approximately normal.
And if I want the probability that
the normal is between something and
something, that's just the CDF
here minus the CDF here, right?
Because for the normal, I mean this
is discrete but we're approximating
using something continuous and we just
say, integrate the PDF from here to here.
But fundamental theorem calculus,
that just says take the CDF and go, okay.
So we're just gonna do Phi of B minus
NP over square root of NPQ minus Phi
of A minus NP over square root of NPQ.
So that would be the basic
normal approximation,
I'll talk a little bit about how
to improve this approximation.
But to contrast it with
the Poisson approximation.
We talked before about the fact that,
and we proved the fact
that if N goes to infinity, and
P goes to 0, and N times P is fixed.
Then the binomial distribution
converts to the Poisson distribution,
we proved that before.
So in the Poisson approximation, so for
the Poisson approximation what we had was
N is large but P was very small, right?
And we let lambda equal NP and
x as moderate.
And most important thing is that
P is small here, P is close to 0.
We proved it in the case where this goes
to infinity and this goes to 0, okay?
So Poisson is relevant when we're
dealing with a large number of very
rare unlikely things.
That's really in contrast to this,
in this case for the normal approximation.
Then, while we still want N to be large,
but
if you kind of think intuitively
about when is this gonna work well,
we actually want P to
be close to one half.
Because think about the symmetry, if you
have a binomial of P equals one half,
that's a symmetric distribution.
The normal is symmetric, no matter,
every normal distribution is symmetric.
If P is far from one half, then the
binomial is very, very skewed, and in that
case it's kind of doesn't make that much
sense to approximate using a normal.
So this is gonna work as an approximation,
that's normal approximation,
as an approximation if P is very small,
this makes a lot more sense than this.
However, think about the statement
of the central limit theorem.
In that theorem I never said
P was close to one half,
in fact that was just a general theorem,
we didn't even have P in the statement
of the central limit theorem, but
somehow this still has to eventually work.
But as a practical matter
as an approximation,
if P is close to one half this
is going to work quite well,
if N is like 30 or 50 or
100, it will work fine.
But if P is .001,
the central limit theorem is still true,
that as N goes to infinity
it's gonna work, okay.
But if N is kind of not
that enormous of a number,
then it's gonna be a pretty
bad approximation.
And let's just try to reconcile these
statements though, is there a case?
If we let N go to infinity and
P be very small,
I still said, if N is going to infinity,
it's still gonna converge to
normal just much slower, right?
So, how could the binomial
look both normal and Poisson?
Well, the answer is that
the Poisson also looks normal.
So if you've Poisson lambda
where lambda's very large,
that's also gonna look normal, so
there is a case where those come together.
Okay, one last thing about this
is that there is something kind
of weird about this in the sense
that we're approximating
a discrete distribution
using something continuous.
And if we wanted to get,
what if we wanted to just
approximate same problem?
I just wanna add something to this.
Well, let's just look at that just to see
what more of like what could go wrong
with this.
What if we look at the case A equals B?
So then we're just saying
the probability that x equals A,
that is approximate the Binomial PMF.
And one kind of weird thing about this is,
this thing would change if
we changed these to strict inequality but
this part would not.
As soon as we say that this is
approximately normal than we don't care
about that anymore.
So there's something called the continuity
correction which I just wanted to
briefly mention.
Which is an improvement to deal with
the fact that you're using something
continuous to approximate
something discrete.
And it's often not explained very well but
if you understand what
it does in this simple case,
then it's not hard to see the idea.
The idea is that if you just said this is
approximately normal then you would just
say zero, right?
Because it would be zero for continuous,
that's not very useful, right?
We want something more useful than zero.
So the idea is just
simply to write this as,
here let's assume A is
an integer x is discreet well,
x equals A is the same thing
as saying that x is between
A plus one-half and A minus one-half.
Right?
So just use this first.
So for each value in this range,
replace it by an interval
of length 1 centered there,
that's exactly the same thing because x
is an integer anyway, so that's true.
But here at least we're giving it
an interval to work with instead of
just saying zero, so
that improves this approximation.
Anyway, it's just central limit theorem.
All right, so see you next time.
