The following content is
provided under a Creative
Commons license.
Your support will help MIT
OpenCourseWare continue to
offer high-quality educational
resources for free.
To make a donation or view
additional materials from
hundreds of MIT courses, visit
MIT OpenCourseWare at
ocw.mit.edu.
PROFESSOR: OK, let's
get started.
We'll start a minute early and
then maybe we can finish a
minute early if we're lucky.
Today we want to talk about
the laws of large numbers.
We want to talk about
convergence a little bit.
We will not really get into
the strong law of large
numbers, which we're going to
do later, because that's a
kind of a mysterious,
difficult topic.
And I wanted to put off really
talking about that until we
got to the point where we could
make some use of it,
which is not quite yet.
So first, I want to review
what we've done
just a little bit.
We've said that probability
models are very natural things
for real-world situations,
particularly those that are
repeatable.
And by repeatable, I mean they
use trials, which have
essentially the same
initial conditions.
They're essentially isolated
from each other.
When I say they're isolated from
each other, I mean there
isn't any apparent
contact that they
have with each other.
So for example, when you're
flipping coins, there's not
one very unusual coin,
and that's the coin
you use all the time.
And then you try to use
those results to be
typical of all coins.
Have a fixed set of possible
outcomes for
these multiple trials.
And they have an essentially
random individual outcomes.
Now, you'll see there's a real
problem here when I use the
word "random" there and
"probability models" there
because there is something
inherently
circular in this argument.
It's something that always
happens when you get into
modeling where you're trying to
take the messy real-world
and turn it into a nice, clean,
mathematical model.
So that really, what we
all do, and we do this
instinctively, is after we
start getting used to a
particular model, we assume that
the real-world is like
that model.
If you don't think you
do that, think again.
Because I think everyone does.
So you really have that problem
of trying to figure
out what's wrong with models,
how to go to better models,
and we do this all the time.
OK, for any model, an extended
model-- in other words, an
extended mathematical model, for
a sequence or an n-tuple
of independent identically
distributed repetitions is
always well-defined
mathematically.
We haven't proven that,
it's not trivial.
But in fact, it's true.
Relative frequencies and
sample averages.
Relative frequencies apply to
events, and you can represent
events in terms of indicator
functions and then use
everything you know about
random variables
to deal with them.
Therefore, you can use
sample averages.
In this extended model,
essentially become
deterministic.
And that's what the laws of
large numbers say in various
different ways.
And beyond knowing that they
become deterministic, our
problem today is to decide
exactly what that means.
The laws of large numbers
specify what "become
deterministic" means.
They only operates within
the extended model.
In other words, laws of large
numbers don't apply to the
real world.
Well, we hope they apply to the
real world, but they only
apply to the real world when the
model is good because you
can only prove the laws of
large numbers within this
model domain.
Probability theory provides an
awful lot of consistency
checks and ways to avoid
experimentation.
In other words, I'm not claiming
here that you have to
do experimentation with a large
number of so-called
independent trials very often
because you have so many ways
of checking on things.
But every once in a while, you
have to do experimentation.
And when you do, somehow or
other the idea of this large
number of trials, and either IID
trials or trials which are
somehow isolated from
each other.
And we will soon get to talk
about Markov models.
We will see that with Markov
models, you don't have the IID
property, but you have enough
independence over time that
you can still get these
sorts of results.
So anyway, the determinism in
this large number of trials
really underlies much of the
value of probability.
OK, in other words, you
don't need to use this
experimentation very often.
But when you do, you really need
it because that's what
you use to resolve conflicts,
and to settle on things, and
to have different people who are
all trying to understand
what's going on have some
idea of something
they can agree on.
OK, so that's enough for
probability models.
That's enough for philosophy.
We will come back to this
with little bits and
pieces now and then.
But at this point, we're really
going into talking
about the mathematical
models themselves.
OK, so let's talk about the
Markov bound, the Chebyshev
bound, and the Chernoff bound.
You should be reading the notes,
so I hope you know what
all these things are
so I can go through
them relatively quickly.
If you think that using these
lectures slides that I'm
passing out plus doing the
problems is sufficient for
understanding this
course, you're
really kidding yourself.
I mean, the course is based on
this text, which explains
things much more fully.
It still has errors in it.
It still has typos in It.
But a whole lot fewer than these
lecture slides, so you
should be reading them and then
using them try to get a
better idea of what these
lectures mean, and using that
to get the better idea
of what the exercises
you're doing mean.
Doing the exercises does not
do you any good whatsoever.
The only thing that does you
some good is to do an exercise
and then think about what it
has to do with anything.
And if you don't do that second
part, then all you're
doing is you're building
a very,
very second rate computer.
Your abilities as a computer
are about the same for the
most part as the computer
in a coffee maker.
You are really not up to what
a TV set does anymore.
I mean, TV sets can do so much
computation that they're way
beyond your abilities
at this point.
So the only edge you have, the
only thing you can do to try
to make yourself worthwhile is
to understand these things
because computers cannot
do any of that
understanding at all.
So you're way ahead
of them there.
OK, so what is the
Markov model?
What it says is, if y is a
non-negative random variable--
in other words, it's a random
variable, which only takes one
non-negative sample values.
If it has an expectation,
expectation of y.
And for any real y greater than
0, the probability that Y
is greater than or equal to
little y is the expected value
of Y divided by little y.
The proof of it is by picture.
If you don't like proofs by
pictures, you should get used
to it because we will prove a
great number of things by
pictures here.
And I claim that a proof by
picture is better than a proof
by algebra because if there's
anything wrong with it, you
can see from looking
at it what it is.
So we know that the expected
value of Y is the integral
under the complimentary
distribution function.
This square in here, this area
of Y times probability of Y
greater or equal to Y. The
probability of Y greater than
or equal, capital Y, the random
variable y greater than
or equal to the number,
little y, is just that
point right up there.
This is the point y.
This is the point probability
of capital Y
greater than little y.
It doesn't make any difference
when you're integrating
whether you use a greater than
or equal to sign or a
greater than sign.
If you have a discontinuity,
the integral is the same no
matter which way
you look at it.
So this area here is y times
the probability that random
variable y is greater than
or equal to number y.
And all that the Markov bound
says is that this little
rectangle here is less than or
equal to the integral under
that curve.
That's a perfectly
rigorous proof.
We don't really care about
rigorous proofs here, anyway
since we're trying to get at
the issue of how you use
probability, but we don't want
to have proofs which mislead
you about things.
In other words, proofs
which aren't right.
So we try to be right, and I
want you to learn to be right.
But I don't want you to start
to worry too much about
looking like a mathematician
when you prove things.
OK, the Chebyshev inequality.
If Z has a mean--
and when you say Z has a mean,
what you really mean is the
expected value of the absolute
value of Z is finite.
And it has a variance sigma
squared of Z. That's saying a
little more than just
having a mean.
Then, for any epsilon
greater than 0.
In other words, this bound
works for any epsilon.
The probability that the
absolute value of Z
less than the mean.
In other words, that it's
further away from the mean by
more than epsilon, is less than
or equal to the variance
of Z divided by epsilon
squared.
Again, this is a very weak
bound, but it's very general.
And therefore, it's
very useful.
And the proof is simplicity
itself.
You define a new random variable
y, which is Z minus
the expected value of
Z quantity squared.
The expected value of Y then is
the expected value of this,
which is just the
variance of Z.
So for any y greater than 0,
we'll use the Markov bound,
which says the probability
that random variable Y is
greater than or equal to number
little y is less than
or equal to sigma
of Z squared.
Namely, the expected value of
random variable Y divided by
number Y. That's just
the Markov bound.
And then, random variable Y is
greater than or equal to
number y if and only if the
positive square root of
capital Y-- we're dealing only
with positive non-negative
things here--
is greater than or equal to the
square root of number y.
And that's less than or equal
to sigma Z squared over y.
And square root of Y is
just the magnitude
of Z minus Z bar.
We're setting epsilon equal to
square root of y yields the
Chebyshev bound.
Now, that's something which
I don't believe
in memorizing proofs.
I think that's a
terrible idea.
But that's something so simple
and so often used that you
just ought to think
in those terms.
You ought to be able to see
that diagram of the Markov
inequality.
You ought to be able to
see why it is true.
And you ought to understand
it well enough
that you can use it.
In other words, there's a big
difference in mathematics
between knowing what a theorem
says and knowing that the
theorem is true, and
really having a gut
feeling for that theorem.
I mean, you know this when
you deal with numbers.
You know it when you deal
with integration or
differentiation, or any of
those things you've known
about for a long time.
There's a big difference between
the things that you
can really work with because
you understand them and you
see them and those things that
you just know as something you
don't really understand.
This is something that you
really ought to understand and
be able to see it.
OK, the Chernoff bound
is the last of these.
We will use this a great deal.
It's a generating
function bound.
And it says, for any number,
positive number z, and any
positive number r greater than
0, such that the moment
generating function-- the moment
generating function of
a random variable z is a
function given the random
variable z.
It's a function of
a real number r.
And that function is the
expected value of e to the rZ.
It's called the generating
function because if you start
taking derivatives of this and
evaluate them, that r equals
0, what you get is the
various moments of z.
You've probably seen
that at some point.
If you haven't seen it, it's not
important here because we
don't use that at all.
What we really use is the fact
that this is a function.
It's a function, which is
increasing as r increases
because you put--
well, it just does.
And what it says is the
probability that this random
variable is greater than or
equal to the number z--
I should really use different
letters for these things.
It's hard to talk about them--
is less than or equal to the
moment generating function
times e to the minus rZ.
And the proof is exactly the
same as the proof before.
You might get the picture that
you can prove many, many
different things from the
Markov inequality.
And in fact, you can.
You just put in whatever you
want to and you get a new
inequality.
And you can call it after
yourself if you want.
I mean, Chernoff.
Chernoff is still alive.
Chernoff is a faculty
member at Harvard.
And this is kind of curious
because he sort of slipped
this in in a paper that he wrote
where he was trying to
prove something difficult.
And this is a relationship that
many mathematicians have
used over many, many years.
And it's so simple that they
didn't make any fuss about it.
And he didn't make any
fuss about it.
And he was sort of embarrassed
that many engineers, starting
with Claude Shannon, found this
to be extraordinarily
useful, and started calling
it the Chernoff bound.
He was slightly embarrassed of
having this totally trivial
thing suddenly be
named after him.
But anyway, that's the
way it happened.
And now it's a widely used tool
that we use all the time.
So it's the same proof that we
had before for any y greater
0, Markov says this.
And therefore, you get that.
This decreases exponentially
with Z, and
that's why it's useful.
I mean, the Markov inequality
only decays as
1 over little y.
The Chebyshev inequality decays
as 1 over y squared.
This decays exponentially
with y.
And therefore, when you start
dealing with large deviations,
trying to talk about things that
are very, very unlikely
when you get very, very far from
the mean, this is a very
useful way to do it.
And it's sort of the standard
way of doing it at this point.
We won't use it right now, but
this is the right time to talk
about it a little bit.
OK, next topic we want to take
up is really these laws of
large numbers, and something
about convergence.
We want to understand a
little bit about that.
And this picture that we've seen
before, we take X1, up to
Xn as n independent identically
distributed random variables.
They each have mean expected
value of X. They each have
variance sigma squared.
You let Sn be the sum
of all of them.
What we want to understand is,
how does S sub n behave?
And more particularly,
how does Sn over n--
namely, the relative--
not the relative frequency, but
the sample average of x
behave when you take
n samples?
So this curve shows the
distribution function of S4,
of S20, and of S50 when you have
a binary random variable
with probability of 1 equal to
a quarter, probability of 0
equal to 3/4.
And what you see graphically
is what you can see
mathematically very
easily, too.
The mean value of S sub
n is n times X bar.
So the center point in these
curves is moving out within.
You see the center point here.
Center point somewhere
around there.
Actually, the center point is
at-- yeah, just about what it
looks like.
And the center point here is
out somewhere around there.
And you see the variance, you
see the variance going up
linearly with n.
Not with n squared, but with
n, which means the standard
deviation is going up with
the square root of n.
That's sort of why the law
of large numbers works.
It's because the standard
deviation of these random--
of these sums only goes up with
the square root of n.
So these curves, along with
moving out, become relatively
more compressed relative to
how far out they are.
This curve here is relatively
more compressed for its mean
than this one is here.
And that's more compressed
relative to this one.
We get this a lot more
easily if we look
at the sample average.
Namely, S sub n over n.
This is a random variable
of mean X bar.
That's a random variable of
variance sigma squared over n.
That's something that you ought
to just recognize and
have very close to the top of
your consciousness because
that again, is sort
of why the sample
average starts to converge.
So what happens then is for n
equals 4, you get this very
"blech" curve.
For n equals 20, it starts
looking a little more
reasonable.
For n equals 50, it's starting
to scrunch in and start to
look like a unit step.
And what we'll find is that the
intuitive way of looking
at the law of large numbers, or
one of the more intuitive
ways of looking at it,
is that the sample
average starts to look--
starts to have a distribution
function, which
looks like a unit step.
And that step occurs
at the mean.
So this curve here keeps
scrunching in.
This part down here is
moving over that way.
This part over here is
moving over that way.
And it all gets close
to a unit step.
OK, the variance of Sn over n,
as we've said, is equal to
sigma squared over n.
The limit of the variance as n
goes to infinity takes the
limit of that.
Don't even have to know the
definition of a limit, can see
that when n gets large,
this get small.
And this goes to 0.
So the limit of this
goes to 0.
Now, here's the important
thing.
This equation says a whole lot
more than this equation says.
Because this equation says how
quickly that approaches 0.
All this says is it
approaches 0.
So we've thrown away a lot that
we know, and now all we
know is this.
This 3 says that the convergence
is as 1/n.
This doesn't say that.
This just says that
it converges.
Why would anyone in their right
mind want to replace an
informative statement like this
with a not informative
statement like this?
Any ideas of why you might
want to do that?
Any suggestions?
AUDIENCE: Convenience.
PROFESSOR: What?
AUDIENCE: Convenience.
Convenience.
Sometimes you don't need to--
PROFESSOR: Well, yes,
convenience.
But there's a much
stronger reason.
This is a statement for
IID random variables.
This law of large numbers, we
want it to apply to as many
different situations
as possible.
To things that aren't
quite IID.
To things that don't
have a variance.
And this statement here
is going to apply more
generally than this.
You can have situations where
the variance goes to 0 more
slowly than 1/n if these random
variables are not
independent.
But you still have
this statement.
And this statement is what we
really need, so this really
says something, which is called
convergence and mean square.
Why mean square?
Because this is the
mean squared.
So obvious terminology.
Mathematicians aren't always
very good at choosing
terminology that makes sense
when you look at it,
but this one does.
Definition is a sequence of
random variables Y1, Y2, Y3,
and so forth.
Converges in mean square to a
random variable Y if this
limit here is equal to 0.
So in this case, Y, this random
variable Y, is really a
deterministic random variable,
which is just the
deterministic value, expected
value of X. This random
variable here is this relative
frequency here.
And this is saying that the
expected value of the relative
frequency relative to the
expected value of X--
this is going to 0.
This isn't saying
anything extra.
This is just saying, if you're
not interested in the law of
large numbers, you might be
interested in how a bunch of
random variables approach some
other random variable.
Now, if you look at a set of
real numbers and you say, does
that set of real numbers
approach something?
I mean, you have sort of a
complicated looking definition
for that, which really
says that the numbers
approach this constant.
But a set of numbers is so much
more simple-minded than a
set of random variables.
I mean, a set of random
variables is--
I mean, not even their
distribution functions really
explain what they are.
There's also the relationship
between the
distribution functions.
So you're not going to find
anything very easy that says
random variables converge.
And you can expect to find the
number of different kinds of
statements about convergence.
And this is just going
to be one of them.
This is something called
convergence and mean square--
yes?
AUDIENCE: Going from 3 to 4,
we don't need IID anymore?
So they can be just--
PROFESSOR: You can certainly
find examples where it's not
IID, and this doesn't hold
and this does hold.
The most interesting case where
this doesn't hold and
this does hold is where you--
no, you still need a variance
for this to hold.
Yeah, so I guess I can't really
construct any nice
examples of n where this holds
and this doesn't hold.
But there are some if you talk
about random variables that
are not IID.
I ought to have a problem
that does that.
But so far, I don't.
Now, the fact that this sample
average converges in mean
square doesn't tell us directly
what might be more
interesting.
I mean, you look at that
statement and it doesn't
really tell you what this
complementary distribution
function looks like.
I mean, to me the thing that
is closest to what I would
think of as convergence is that
this sequence of random
variables minus the random
variable, the convergence of
that difference, approaches a
distribution function, which
is the unit step.
Which means that the probability
that you're
anywhere off of that center
point is going to 0.
I mean, that's a very easy
to interpret statement.
The fact that the variance is
going to 0, I don't quite know
how do interpret it, except
through Chebyshev's law, which
gets me to the other
statement.
So what I'm saying here is if
we apply Chebyshev to that
statement before number 3--
this one--
which says what the
variance is.
If we apply Chebyshev, then what
we get is the probability
that the relative frequency
minus the mean, the absolute
value of that is greater than
or equal to epsilon.
That probability is less than or
equal to sigma squared over
n times epsilon squared.
You'll notice this is
a very peculiar
statement in terms of epsilon.
Because if you want to make
epsilon very small, so you get
something strong here,
this term blows up.
So the way you have to look at
this is pick some epsilon
you're happy with.
I mean, you might want these two
things to be within 1% of
each other.
Then, epsilon squared
here is 10,000.
But by making n big enough,
that gets submerged.
So excuse me.
Epsilon squared is 1/10,000
so 1 over
epsilon squared is 10,000.
So you need to make
n very large.
Yes?
AUDIENCE: So that's why at times
when n is too small and
epsilon is too small as well,
you can get obvious things,
like it's less than or equal
to a number greater than 1?
PROFESSOR: Yes.
And this inequality is not much
good because there's a
very obvious inequality
that works.
Yes.
But the other thing is this
is a very weak inequality.
So all this is doing is
giving you a bound.
All it's doing is saying that
when n gets big enough, this
number gets as small as
you want it to be.
So you can get an arbitrary
accuracy of epsilon between
sample average and mean.
You can get that with
a probability
1 minus this quantity.
You can make that as close
to 1 as you wish if
you increase n.
So that gives us the law of
large numbers, and I haven't
stated it formally, all the
formal jazz as in the notes.
But it says, if you have IID
random variables with a finite
variance, the limit of the
probability that Sn over n
minus x bar, the absolute value
of that is greater than
or equal to epsilon is equal to
0 in the limit, no matter
how you choose epsilon.
Namely, this is one of those
peculiar things in
mathematics.
It depends on who gets
the first choice.
If I get to choose epsilon
and you get to
choose n, then you win.
You can make this go to 0.
If you choose n and then I
choose epsilon, you lose.
So it's only when you choose
first that you win.
But still, this statement
works.
For every epsilon greater
than 0, this limit
here is equal to 0.
Now, let's go immediately a
couple of pages beyond and
look at this figure a little
bit because I think this
figure tells what's going
on, I think better
than anything else.
You have the mean of x, which is
right in the center of this
distribution function.
As n gets larger and larger,
this distribution function
here is going to be scrunching
in, which we sort of know
because the variance
is going to 0.
And we also sort of know it
because of what this weak law
of large numbers tells us.
And we have these.
If we pick some given epsilon,
then we have--
if we look at a range of two
epsilon, epsilon on one side
of the mean, epsilon on the
other side of the mean, then
we can ask the question, how
well does this distribution
function conform
to a unit step?
Well, one easy way of looking at
that is saying, if we draw
a rectangle here of width 2
epsilon around X bar, when
does this distribution function
get inside that
rectangle and when does it
leave the rectangle?
And what the weak law of large
numbers says is that if you
pick epsilon and hold it fixed,
then delta 1 is going
to 0 and delta 2
is going to 0.
And eventually--
I think this is dying out.
Well, no problem.
What this says is that as n gets
larger and larger, this
quantity shrinks down to 0.
That quantity up there
shrinks down to 0.
And suddenly, you have something
which is, for all
practical purposes,
a unit step.
Namely, if you think about it a
little bit, how can you take
an increasing curve, which
increases from 0 to 1, and say
that's close to a unit step?
Isn't this a nice
way of doing it?
I mean, the function is
increasing so it can't do
anything after it crosses
this point here.
All it can do is increase, and
eventually it leaves again.
Now, another thing.
When you think about the weak
law of large numbers and you
don't state it formally, one of
the important things is you
can't make epsilon 0 here and
you can't make delta 0.
You need both an epsilon and
a delta in this argument.
And you can see that just by
looking at a reasonable
distribution function.
If you make epsilon equal to 0,
then you're asking, what's
the probability that this sample
average is exactly
equal to the mean?
And in most cases, that's
equal to 0.
Namely, you can't win
on that argument.
And if you try to make
delta equal to 0.
In other words, you ask--
then suddenly you're stuck over
here, and you're stuck
way over there, and you can't
make epsilon small.
So trying to say that a curve
looks like a step function,
you really need two fudge
factors to do that.
So the weak law of
large numbers.
In terms of dealing with how
close you are to a step
function, the weak law of large
numbers says about the
only thing you can
reasonably say.
OK, now let's go back
to the slide before.
The weak law of large numbers
says that the limit as n goes
to infinity of the probability
that the sample average is
greater than or equal
to epsilon equals 0.
And it says that for every
epsilon greater than 0.
An equivalent statement is
this statement here.
The probability that Sn over n
minus x bar is greater than or
equal to epsilon is a
complicated looking animal,
but it's just a number.
It's just a number
between 0 and 1.
It's a probability.
So for every n, you get
a number up there.
And what this is saying is that
that sequence of numbers
is approaching 0.
Another way to say that a
sequence of numbers approaches
0 is this way down here.
It says that for every epsilon
greater than 0 and every delta
greater than 0, the probability
that this quantity
is less than or equal
to delta is--
this probability is less than
or equal to delta for all
large enough n.
In other words, these funny
little things on the edge here
for this next slide, the delta
1 and delta 2 are going to 0.
So it's important to understand
this both ways.
And now again, this quantity
here looks very much like--
these two equations look
very much alike.
Except this one tells you
something more about
convergence than
this one does.
This says how this goes to 0.
This only says that
it goes to 0.
So again, we have
the same thing.
The weak law of large numbers
says this weaker thing, and it
says this weaker thing
because sometimes you
need the weaker thing.
And in this case, there
is a good example.
The weak law of large numbers
is true, even if you don't
have a variance.
It's true under the
single condition
that you have a mean.
There's a nice proof in
the text about that.
It's a proof that does
something, which we're going
to do many, many times.
You look at a random variable.
You can't say what you want
to say about it, so
you truncate it.
You say, let me--
I mean, if you think a problem
is too hard, you look at a
simpler problem.
If you're drunk and you drop
a coin, you look for it
underneath a light.
You don't look for it where
it's dark, even though you
dropped it where it's dark.
So all of us do that.
If we can't solve a problem,
we try to pose a simpler,
similar problem that
we can solve.
So you truncate this
random variable.
When you truncate a random
variable, I mean you just take
its distribution function
and you chop it off
to a certain point.
And what happens then?
Well. at that point you
have a variance.
You have a moment generating
function.
You have all the things
you want.
Nothing peculiar can happen
because the thing is bounded.
So then the trick in proving the
weak law of large numbers
under these more general
circumstances is to first
truncate the random variable.
You then have the weak
law of large numbers.
And then the thing that you do
is in a very ticklish way, you
start increasing n,
and you increase
the truncation parameter.
And if you do this in just the
right way, you wind up proving
the theorem you want to prove.
Now, I'm not saying you ought
to read that proof now.
If you're sailing along with no
problems at all, you ought
to read that proof now.
If you don't quite have the
kind of mathematical
background that I seem to be
often assuming in this course,
you ought to skip that.
You will have many opportunities
to understand
the technique later.
It's the technique which is
important, it's not the--
I mean, it's not so much
the actual proof.
Now, the thing we didn't talk
about, about this picture
here, is we say that a sequence
of random variables--
Y1, Y2, et cetera--
converges in probability to a
random variable y if for every
epsilon greater than 0 and every
delta greater than 0,
the probability that the n-th
random variable minus this
funny random variable is greater
than or equal to
epsilon, is less than
or equal to delta.
That's saying the same thing
as this picture says.
In this picture here, you can
draw each one of the Y sub n's
minus Y. You think of
Y sub n minus Y as a
single random variable.
And then you get this kind of
curve here and the same
interpretation works.
So again, what you're saying
with convergence and
probability is that the
distribution function of Yn
minus Y is approaching a unit
step as n gets big.
So that's really the meaning
of convergence and
probability.
I mean, you get this unit step
as n gets bigger and bigger.
OK, so let's review all
of what we've done in
the last half hour.
If a random variable, generic
random variable x, has a
standard deviation--
in other words, if it has
a finite variance.
And if X1, X2 are IID with that
standard deviation, then
the standard deviation of the
relative frequency is equal to
the standard deviation
of x divided by the
square root of n.
So the standard deviation
is the relative
of the sample average.
Excuse me, sample average,
not relative frequency.
Of the sample average is going
to 0 as n gets big.
In the same way, if you have a
sequence of arbitrary random
variables, which are converging
to Y in mean
square, then Chebyshev shows
that it converges in
probability.
OK so mean square convergence
implies convergence in
probability.
Mean square convergence is a
funny statement which says
that this sequence of random
variables has a standard
deviation, which
is going to 0.
And it's hard to see exactly
what that means because that
standard deviation is a
complicated integral.
And I don't know
what it means.
But if you use the Chebyshev
inequality, then it means this
very simple statement, which
says that this sequence has to
converge in probability to Y.
Mean square convergence then
implies convergence in
probability.
Reverse isn't true because--
and I can't give you an example
of it now, but I've
already told you something
about it.
Because I've said that's the
weak law of large numbers
continues to hold if the generic
random variable has a
mean, but doesn't have a
variance because of this
truncation argument.
Well, I mean, what it says
then is a variance is not
required for the weak law of
large numbers to hold.
And if the variance doesn't
hold, then you certainly don't
have convergence
in mean square.
So we have an example even
though you haven't proven that
that example works.
You have an example where the
weak law of large numbers
holds, but convergence in mean
square does not hold.
OK, and the final thing is
convergence in probability
really means that the
distribution of Yn minus Y
approaches the unit step.
Yes?
AUDIENCE: So in general,
convergence in probability
doesn't imply convergence
in distribution.
But it holds in this special
case because--
PROFESSOR: It does imply
convergence in distribution.
We haven't talked about
convergence in
distribution yet.
Except it does not imply
convergence in mean square,
which is a thing that
requires a variance.
So you can have convergence
in probability without
convergence in mean square,
but not the other way.
I mean, convergence in mean
square, you just apply
Chebyshev to it,
and suddenly--
presto, you have convergence
in probability.
And incidentally, I
wish all of you
would ask more questions.
Because we're taking this video,
which is going to be
shown to many people in many
different countries.
And they ask themselves, would
it be better if I came to MIT
and then I could sit-in class
and ask questions?
And then they see these videos
and they say, ah, it doesn't
make any difference, nobody
ask questions anyway.
And because of that, MIT
will simply wither
away at some point.
So it's very important for you
to ask questions now and then.
Now, let's go on to the
central limit theorem.
This sum of n IID
random variables
minus n times the mean--
in other words, we just
normalized it to 0 mean.
S sub n minus n x bar is a
0 mean random variable.
And it has variance n
times sigma squared.
It also has second moment
n times sigma squared.
And what that means is that you
take Sn minus n times the
mean of x and divide it by the
square root of n times sigma.
What you get is something
which is 0
mean and unit variance.
So as you keep increasing n,
this random variable here, Sn
minus n x bar over the square
root of n sigma, just sits
there rock solid with the same
mean and the same variance,
nothing ever happens to it.
Except it has a distribution
function, and the distribution
function changes.
I mean, you see the distribution
function changing
here as you let n get
larger and larger.
In some sense, these steps are
getting smaller and smaller.
So it looks like you're
approaching some particular
curve and when we looked
at the Bernoulli case--
I guess it was just last time.
When we looked at the Bernoulli
case, what we saw is
that these steps here were
going as e to the minus
difference from the mean squared
divided by 2 times
sigma squared.
Bunch of stuff, but what we saw
was that these steps were
proportional to the density
of a Gaussian.
In other words, this curve that
we're converging to is
proportional to the distribution
function of the
Gaussian random variable.
We didn't completely prove that
because all we did was to
show what happened to the PMF.
We didn't really integrate
these things.
We didn't really deal with all
of the small quantities.
We said they weren't
important.
But you sort of got the picture
of exactly why this
convergence to a normal
distribution
function takes place.
And the theorem says this more
general thing that this
convergence does, in
fact, take place.
And that is what the central
limit theorem says.
It says that that happens not
only for the Bernoulli case,
but it happens for all
random variables,
which have a variance.
And the convergence is
relatively good if the random
variables have a certain
moment that
can be awful otherwise.
So this expression on top then
is really the expression of
the central limit theorem.
It says not only does the
normalized sample average--
I'll call this whole thing the
normalized sample average
because Sn minus n X bar has
variance square root of n
times sigma sub x.
So this normalized sample
average has mean 0 and
standard deviation 1.
Not only does it have mean 0
and variance 1, but it also
becomes closer and closer to
this Gaussian distribution.
Why is that important?
Well, if you start studying
noise and things like that,
it's very important.
Because it says that if you have
the sum of lots and lots
of very, very small, unimportant
things, then what
those things add up to if
they're relatively independent
is something which is
almost Gaussian.
You pick up a book on noise
theory or you pick up a book
which is on communication, or
which is on control, or
something like that, and after
you read a few chapters, you
get the idea that all random
variables are Gaussian.
This is particularly true
if you look at books on
statistics.
Many, many books on statistics,
particularly for
undergraduates, the only random
variable they ever talk
about is the normal
random variable.
For some reason or other, you're
led to believe that all
random variables are Gaussian.
Well, of course, they aren't.
But this says that a lot of
random variables, which are
sums of large numbers of little
things, in fact are
close to Gaussian.
But we're interested in it
here for another reason.
And we'll come to that
in a little bit.
But let me make the comment that
the proofs that I gave
you about the central
limit theorem for
the Bernoulli case--
and if you fill-in those
epsilons and deltas there,
that really was a valid proof.
That technique does not work
at all when you have a
non-Bernoulli situation.
Because the situation is
very, very complicated.
You wind up--
I mean, if you have a Bernoulli
case, you wind up
with this nice, nice
distribution, which says that
every step in the Bernoulli
distribution, you have terms
that are increasing and then
terms that are decreasing.
If you look at what happens
for a discrete random
variable, which is not binary,
you have the most god awful
distribution if you
try to look at the
probability mass function.
It is just awful.
And the only thing which
looks nice is the
distribution function.
The distribution function
looks relatively nice.
And why is hard to tell.
And if you look at proofs of
it, it goes through Fourier
transforms.
In probability theory, Fourier
transforms are called
characteristics functions, but
it's really the same thing.
And you go through this very
complicated argument.
I've been through it
a number of times.
And to me, it's all algebra.
And I'm not a person that just
accepts the fact that
something is all
algebra easily.
I keep trying to find ways of
making sense out of it.
And I've never been able to make
sense out of it, but I'm
convinced that it's true.
So you just have to sort
of live with that.
So anyway, the central limit
theorem does apply to the
distribution function.
Namely, exactly what
that says.
The distribution function of
this normalized sample average
does go into the distribution
function of the Gaussian.
The PMFs do not converge
at all.
And nothing else converges,
but just that one thing.
OK, a sequence of random
variables converges in
distribution--
this is what someone was asking
about just a second
ago, but I don't think you were
really asking about that.
But this is what convergence
in distribution means.
It means it's the limit of the
distribution functions of a
sequence of random variables.
Turns into the distribution
function of some
other random variable.
Then you say that these random
variables converge in
distribution to Z. And that's
a nice, useful thing.
And the CLT, the Central Limit
Theorem, then says that this
normalized sample average
converges in distribution to
the distribution of a normal
random variable.
Many people call that density
city phi and call the normal
distribution phi.
I don't know why.
I mean, you've got to call it
something, so many people call
it the same thing.
This convergence and
distribution is really almost
a misnomer.
Because when random variables
converge in distribution to
another random variable, I mean,
if you say something
converges, usually you have the
idea that the thing which
is converging to something else
is getting close to it in
some sense.
And the random variables aren't
getting close at all,
it's only the distribution
functions
that are getting close.
If I take a sequence of IID
random variables, all of them
have the same distribution
function.
And therefore, a
sequence of IID
random variables converges.
And in fact, it's converged
right from the beginning to
the same random variable,
to the same
generic random variable.
But they're not at all
close to each other.
But you still call this
convergence in distribution.
Why do we make such a big fuss
about convergence in
distribution?
Well, primarily because of the
central limit theorem because
you would like to see that a
sequence of random variables,
in fact, starts to look like
something that is interesting,
which is the Gaussian random
variable after a while.
It says we can do these
crazy things that
statisticians do, and that--
well, fortunately, most
communication theorists are a
little more careful than
statisticians.
Somebody's going to hit me
for saying that, but
I think it's true.
But the central limit theorem,
in fact, does say that many of
these sums of random variables
you look at can be reasonably
approximated as being
Gaussian.
So what we have now is
convergence in probability
implies convergence
in distribution.
And the proof, I will--
on the slides, I always
abbreviate proof by Pf.
And sometimes Pf is just what
it sounds like it, it's
"poof." It is not quite a proof,
and you have to look at
those to get the actual proof.
But this says the convergence
is a sequence of Yn's in
probability means that it
converges to a unit step.
That's exactly what convergence
in probability mean.
It converges to a unit step,
and ti converges everywhere
but at the step itself.
If you look at the definition
of convergence in
distribution, and I might not
have said it carefully enough
when I defined it back here.
Oh, yes, I did.
Remarkable.
Often I make up these slides
when I'm half asleep, and they
don't always say what I
intended them to say.
And my evil twin brother comes
in and changes them later.
But here I said it right.
A sequence of random variables
converges in distribution to
another random variable Z if the
limit of the distribution
function is equal
to the limit--
if the limit of the distribution
function of the Z
sub n is equal to the
distribution function of Z.
But it only says for all Z
where this distribution
function is continuous.
You can't really expect much
more than that because if
you're looking at
a distribution--
if you're looking at a
limiting distribution
function, which looks like this,
especially for the law
of large numbers, all we've
been able to show is that
these distributions come in down
here very close, go up
and get out very
close up there.
We haven't said anything
about where they
cross this actual line.
And there's nothing in the
argument about the weak law of
large numbers, which says
anything about what happens
right exactly at the mean.
But that's something
that's the central
limit theorem says--
Yes?
AUDIENCE: What's the Zn?
That's not the sample mean?
PROFESSOR: When we use the Zn's
for the central limit
theorem, then what I mean by
the Z sub n's here is those
normalized random variables Sn
minus n x bar over square root
of n times sigma.
And in all of these definitions
of convergence,
the random variables, which are
converging to something,
are always rather peculiar.
Sometimes they're the
sample averages.
Sometimes they're the normalized
sample averages.
God knows what they are.
But what mathematicians
like to do--
and there's a good reason for
what they like to do--
is they like to define different
kinds of convergence
in general terms, and then apply
them to the specific
thing that you're
interested in.
OK, so the central limit
theorem says that this
normalized sum converges in
distribution to phi, but it
only has to converge where the
distribution function is
continuous.
Yes?
AUDIENCE: So the theorem applies
to the distribution.
Why doesn't it apply to PMF?
PROFESSOR: Well, if you look at
the example we have here,
if you look at the PDF for
this normalized random
variable, you find something
which is
jumping up, jumping up.
If we look at it for n equals
50, it's still jumping up.
The jumps are smaller.
But if you look at the
PDF for this--
well, if you look at the
distribution function for the
normal, it has a density.
This PDF function for the
things which you want to
approach a limit never
have a density.
All the time they have a PDF.
The steps are getting
smaller and smaller.
And you can see that here as
you're up to n equals 50.
You can see these little
tiny steps here.
But you still have a PMF.
You'll want to look at it
in terms of density.
You have to look at in
terms of impulses.
And there's no way you can say
an impulse is starting to
approach a smooth curve.
OK, so we have this proof that
converges in probability,
implies convergence
in distribution.
And since convergence in mean
square implies convergence in
probability, and convergence
in probability implies
convergence in distribution,
we suddenly have the
convergence in mean square
implies convergence in
distribution also.
And you have this nice picture
in the book of all the things
that converge in distribution.
Inside of that--
this is distribution.
Inside of that is all the
things that converge in
probability.
And inside of that is
all the things that
converge in mean square.
Now, there's a paradox here.
And what the paradox is, is that
the central limit theorem
says something very, very strong
about how Sn over n--
namely, the sample average--
converges to the mean.
The convergence in distribution
is a very weak
form of convergence.
So how is this weak form of
convergence telling you
something that says specific
about how a sample average
converges to the mean?
It tells you much more
than the weak law of
large numbers does.
Because it tells you if you at
this thing, it's starting to
approach a normal distribution
function.
And the resolution
of that paradox--
and this is important
I think--
is that the random variables
converge in distribution to
the central limit theorem
are these
normalized random variables.
The ones that converge in
probability are the things
which are normalizing in terms
of the mean, but you're not
normalizing them in
terms of variance.
So when you look at one curve
relative to the other curve,
one curve is a squashed down
version of the other curve.
I mean, look at those pictures
we have for that example.
If you look at a sequence of
distribution functions for S
sub n over n, what you find is
things which are squashing
down into a unit step.
If you look at what you have
for the normalized random
variables, normalized to unit
variance, what you have is
something which is not squashing
down at all.
It gives the whole shape
of the thing.
You can get from one curve to
the other just by squashing or
expanding on the x-axis.
That's the only difference
between them.
So the central limit theorem
says when you don't squash,
you get this nice Gaussian
distribution function.
The weak law of large numbers
says when you do squash, you
get a unit step.
Now, which tells you more?
Well, if you have the central
limit theorem, it tells you a
lot more because it says, if
you look at this unit step
here, and you expand it out by
a factor of square root of n,
what you're going to get is
something that goes like this.
The central limit theorem tells
you exactly what the
distribution function
is at x bar.
It tells you that that's
converging to what?
What's the probability that the
sum of a large number of
random variables is greater
than n times the mean?
What is it approximately?
AUDIENCE: 1/2.
PROFESSOR: 1/2.
That's what this says.
This is a distribution
function.
It's converging to
the distribution
function of the normal.
It hits that point, the
normal is centered
on this x bar here.
And it hits that point
exactly at 1/2.
This says the probability of
being on that side is 1/2.
The probability of being
on this side is 1/2.
So you see, the central limit
theorem is telling you a whole
lot more about how this is
converging than the weak law
of large numbers is.
Now, I come back to the
question I asked
you a long time ago.
Why is the weak law
of large numbers--
why do you see it used more
often than the central limit
theorem since it's so
much less powerful?
Well, it's the same
answer as before.
It's less powerful, but it
applies to a much larger
number of cases.
And in many situations, all
you want is that weaker
statement that tells you
everything you want to know,
but it tells you that weaker
statement for this enormous
variety of different
situations.
Mean square convergence applies
to fewer things.
Well, of course, convergence
in distribution applies for
even more things.
But we saw that when you're
dealing with the central limit
theorem, all bets are off on
that because it's talking
about a different sequence of
random variables, which might
or might not converge.
OK, so finally, convergence
with probability 1.
Many people call convergence
with probability 1 convergence
almost surely, or convergence
almost everywhere.
You will see this almost
everywhere.
Now, why do I want to use
convergence with probability,
and why is that a dangerous
thing to do?
When you say things are
converging with probability 1,
it sounds very much like you're
saying they converge in
probability because you're
using the word
"probability" in each.
The two are very, very
different concepts.
And therefore, it would seem
like you should avoid the word
"probability" in this second one
and say convergence almost
surely or convergence
almost everywhere.
And the reason I don't like
those is they don't make any
sense, unless you understand
measure theory.
And we're not assuming
that you understand
measure theory here.
If you wanted to do that first
problem in the last problem
set, you had to understand
measure theory.
And I apologize for that, I
didn't mean to do that to you.
But this notion of convergence
with probability 1, I think
you can understand that.
I think you can get a good sense
of what it means without
knowing any measure theory.
And at least that's what
we're trying to do.
OK, so let's go on.
We've already said that a random
variable is a lot more
complicated thing than
a number is.
I think those of you who
thought you understood
probability theory pretty well,
I probably managed to
confuse you enough to get you
to the point where you think
you're not on totally
safe ground talking
about random variables.
And you're certainly not on very
safe ground talking about
how random variables converge
to each other.
And that's good because to reach
a greater understanding
of something, you have to get
to the point where you're a
little bit confused first.
So I've intentionally
tried to--
well, I haven't tried
to make this more
confusing than necessary.
But in fact, it's not as simple
as what elementary
courses would make
you believe.
OK, this notion of convergence
with probability 1, which we
abbreviate WP1, is something
that we're not going to talk
about a great deal until we
come to renewal processes.
And the reason is we won't need
it a great deal until we
come to renewal processes.
But you ought to know that
there's something like that
hanging out there.
And you ought to have some
idea of what it is.
So here's the definition
of it.
A sequence of random variables
convergence with probability 1
to some other random
variable Z all in
the same sample space.
If the probability of sample
points in omega--
now remember, that a sample
point implies a value for each
one of these random variables.
So in a sense, you can think of
a sample point as, more or
less, equivalent to a sample
path of this sequence of
random variables here.
OK, so for omega and capital
Omega, it says the limit as n
goes to infinity of these random
variables at the point
omega is equal to what this
extra random variable is at
the point--
and it says that the probability
of that whole
thing is equal to 1.
Now, how many of you can
look at that statement
and see what it means?
Well, I'm sure some of you can
because you've seen it before.
But understanding what that
statement means, even though
it's a very simple statement,
is not very easy.
So there's the statement
up there.
Let's try to parse it.
In other words, break it down
into what it's talking about.
For each sample point omega,
that sample point is going to
map into a sequence of--
so each sample point maps into
this sample path of values for
this sequence of random
variables.
Some of those sequences--
OK, this now is a sequence
of numbers.
So each omega goes into some
sequence of numbers.
And that also is unquestionably
close to this
final generic random variable,
capital Z evaluated at omega.
Now, some of these sequences
here, sequences of real
numbers, we all know what a
limit of real numbers is.
I hope you do.
I know that many of you don't.
And we'll talk about it later
when we start talking about
the strong law of
large numbers.
But this does, perhaps,
have a limit.
It perhaps doesn't
have a limit.
If you look at a sequence 1, 2,
1, 2, 1, 2, 1, 2, forever,
that doesn't have a limit
because it doesn't start to
get close to anything.
It keeps wandering
around forever.
If you look at a sequence which
is 1 for 10 terms, then
it's 2 the 11th term, and then
it's 1 for 100 terms, 2 for 1
more term, 1 for 1,000 terms,
then 2 for the next term, and
so forth, that's a much
more tricky case.
Because in that sequence,
pretty soon
all you see is 1's.
You look for an awful long way
and you don't see any 2's.
That does not converge just
because the definition of
convergence.
And when you work with
convergence for a long time,
after a while you're very
happy that that doesn't
converge because it would
play all sorts of
havoc with all of analysis.
So anyway, there is this idea
that these numbers either
converge or they
don't converge.
When these numbers converge,
they might or might not
converge to this.
So for every omega in this
sample space, you have this
sequence here.
That sequence might converge.
If it does converge, it might
converge to this or it might
converge to something else.
And what this is saying here is
you take this entire set of
sequences here.
Namely, you take the entire
set of omega.
And for each one of those
omegas, this might or might
not converge.
You look at the set of
omega for which this
sequence does converge.
And for which it does converge
to Z of omega.
And now you look at that set
and what convergence with
probability 1 means is that that
set turns out to be an
event, and that event turns
out to have probability 1.
Which says that for almost
everything that happens, you
look at this sample sequence
and it has a
limit, which is this.
And that's true in
probability.
It's not true for
most sequences.
Let me give you a very quick
and simple example.
Look at this Bernoulli case, and
suppose the probability of
a 1 is one quarter and the
probability of a 0 is 3/4.
Look at what happens when you
take an extraordinarily large
number of trials and you ask
for those sample sequences
that you take.
What's going to happen
to them?
Well, this says that if you
look at the relative
frequency of 1's--
well, if they converge in this
sense, if the set of relative
frequencies--
excuse me.
If the set of sample averages
converges in this sense, then
it says with probability 1, that
sample average is going
to converge to one quarter.
Now, that doesn't mean that most
sequences are going to
converge that way.
Because most sequences are
going to converge to 1/2.
There are many more sequences
with half 1's and half 0's
than there are with
three quarter 1's
and one quarter 0's--
with three quarter 0's
and one quarter 1's.
So there are many more of
one than the other.
But those particular sequences,
which have
probability--
those particular sequences which
have relative frequency
one quarter are much more likely
than those which have
relative frequency one half.
Because the ones with
relative frequency
one half are so unlikely.
That's a complicated
set of ideas.
You sort of know that it has
to be true because of these
other laws of large numbers.
And this is simply extending
those laws of large numbers
one more step to say not only
does the distribution function
of that sample average converge
to what it should
converge to, but also for
these sequences with
probability 1, they converge.
If you look at the sequence for
long enough, the sample
average is going to converge to
what it should for that one
particular sample sequence.
OK, the strong law of large
numbers then says that if X1,
X2, and so forth are IID random
variables, and they
have an expected value, which
is less than infinity, then
the sample average converges
to the actual average with
probability 1.
In other words, it says with
probability 1, you look at
this sequence forever.
I don't know how you look
at a sequence forever.
I've never figured that out.
But if you could look at it
forever, then with probability
1, it would come out with the
right relative frequency.
It'll take a lot of investment
of time when we get to chapter
4 to try to sort that out.
Wanted to tell you a little bit
about it, read about in
notes a little bit in chapter
1, and we will
come back to it.
And I think you will
then understand it.
OK, with that, we are
done with chapter 1.
Next time we will go into
Poisson processes.
If you're upset by all of the
abstraction in chapter 1, you
will be very happy when we get
into Poisson processes because
there's nothing abstract
there at all.
Everything you could reasonably
say about Poisson
processes is either obviously
true or obviously false.
I mean, there's nothing that's
strange there at all.
After you understand it,
everything works.
So we'll do that next time.
