Picking up right where
we left off last time,
we were deriving the variance
of a hypergeometric, right?
So I was just wanting to
quickly recap that and
make a few more comments about it.
We basically did the calculation last
time, just didn't simplify the algebra.
But I wanna say a few more things
about that and remind you.
So we were doing the variance
of the hypergeometric,
And we have parameters w, b, n.
Which you think of it as w,
white balls, b, black balls,
and we're taking a sample of
size n without replacement.
And then we wanna study the variance of
the number of white balls in the sample.
And I'll just remind you what we
did at the very end, last time.
And let's actually make up a little
bit more notation to make this
a little bit nicer.
So let's let p equal w over w + b, that's
a natural quantity to look at, right?
And that is between 0 and 1, it's just a
fraction of white balls in the population.
And it's also kind of
convenient to let w + b = N.
That's not a random variable,
but that's sort of a traditional
statistics notation sometimes, for
the population size, is capital N.
Sample size is lowercase n.
Okay, and then what we did last
time was derive the variance,
Of x, we decomposed x as a sum of
indicator random variables where
xj is just the indicator of the jth
ball that you draw being white.
And then using the stuff we
did last time for variance.
The variance of the sum is the sum
of the variances, then plus,
we have all these covariances.
If they're independent, you don't
have to worry about the covariances.
But in this case they're not independent,
so we need the covariance terms.
So really this is just gonna
be a Var(X1) plus blah,
blah, blah, plus Var(Xn), and
then all the covariances.
And sorry, I'll put 2,
because I'm grouping covariance of X1 and
X2 with covariance of X2 and
X1, group them together.
So 2 times the sum over all i < j Cov(Xi,
Xj).
Okay, now here's where this would be,
it looks like a complicated sum.
But we take advantage of symmetry,
like I was doing quickly the last time.
But you should make sure that you
see the symmetry in this problem.
Any of these Xj, the jth ball,
before you draw any balls,
that's just equally likely
to be any of the balls.
So we're not,
these are not conditional variances,
these are the unconditional variances.
By symmetry, they're all the same, so
it's just n times the variance
of the first one.
So n times Var(X1), well,
X1 is just Bernoulli p, right?
So this is just n p (1-p).
And then we need all these
covariance terms, and
choose, by symmetry they're all the same,
again.
So it's 2,
there's n choose 2 terms in this sum.
So I'm just gonna write 2 (n choose 2),
then I don't have to do a sum anymore.
So I'll go 2 (n choose 2), and
we want the covariance between X1 and X2.
And we did this quickly last time, but
it's important enough
to write it down again.
Cov(X1, X2) =,
just to remind you of the definition,
or the equivalent of the definition,
E(X1 X2)- (EX1)(EX2).
So once we do, that's just how you
get covariance in general, I mean,
that's always true.
But once we have this, that will tell
us immediately what to put in here.
At least immediately,
once we think hard enough about indicator
random variables actually mean.
Okay, so this part is just easy,
it's just (EX1)(EX2).
And we already know that marginally,
these are just Bernoulli p's.
They're not independent, but this is
just saying look at them separately,
so those are just Bernoulli p's.
So, that's very easy,
that's just gonna be p squared, that term.
Now E(X1 X2), as I pointed at last time.
If you multiply two
indicator random variables,
that's just an indicator random
variable of the intersection.
So this is just the event that
the first ball is white and
the second ball is white.
So first ball is w over w + b,
that's just p again,
times w- 1 over w + b- 1,- p squared,
okay.
So that looks messy.
If you multiply everything and
then simplify and do the algebra,
it actually comes out to
something surprisingly nice.
So this is just algebra at this point,
if you simplify it,
what you get is this factor.
N- n over N- 1, times something
that looks familiar, n p (1-p).
This part looks very familiar, right?
That's just the variance of a binomial np.
This factor in front, in statistics, is
called the finite population correction.
And this answer, it's really neat
that it works out to something so
simple and similar to the binomial.
It looks like the binomial variance,
it's we just need this extra
correction factor in front.
And let's just check this in a couple
simple and extreme cases, right,
I always recommend look at simple and
extreme cases.
So one extreme case would
be of little n equals 1.
Then this goes away, right, and
we just get the variance of a Bernoulli p.
Well it had to be that way, right?
Because if you're only picking one ball,
what difference does it make if it's with
replacement or without replacement,
there's only one ball.
So that makes sense, when n is 1.
And now let's consider
another extreme case, so
I'll just write that down, extreme cases.
So one extreme case is N = 1.
And the other extreme case is
if N is much, much, much larger,
I'll just write much, much, not much,
much, much, larger than little n.
Little n is say, 20, big N is 100,000.
If that's the case,
this is extremely close to 1.
Which says we're getting something
extremely close to the binomial variance,
and that should make perfect sense.
Because if the sample is so minuscule
compared to the population, it's very,
very unlikely that you would sample
the same individual more than once, right?
You're not doing replacement,
but what difference does it make,
cuz it's unlikely to get the same
person twice anyway, in your sample.
Okay, so it's gonna be close to
a binomial if this thing is close to 1,
so that should make intuitive sense.
All right, so
that's the variance of the hypergeometric.
Okay, so I think we're ready for
a change of variables now,
change of topic to change of variables.
So change of variables is
synonymous with transformations.
This is something we've done
before via other methods,
but not as a topic in its own right.
But hopefully, the method that we're gonna
write down, everything should look kind of
natural, because we've already
done some similar stuff.
So, and we've already been talking a lot
About what happens when you have
a function of a random variable.
A function of a random
variable is a random variable.
And we use LOTUS a lot to get
its expected value, okay?
But LOTUS is great but
LOTUS only gives you the expected value
of that transformed random variable.
It doesn't give you
the whole distribution.
So a lot of times you don't just want the
mean or you just don't want the mean and
the variance, you want the entire
distribution, well, how do you do that?
So, let's state it as a theorem and
then, then do examples.
So, it's more interesting
in the continuous case.
So, I'm going to state this for
continuous random variables.
So let X be a continuous random variable.
With PDF, let's say,
f sub x and let Y equals g of X.
So transforming from X to Y by
multiplying some function g.
We need to make some assumptions on g,
if g is a really nasty function then
this may not work out very well.
LOTUS will still be true, but
that doesn't give us the distribution.
So let's assume to start with,
let's assume that first of
all g is differentiable.
So in particular it's continuous,
but it's stronger than that, we want
the derivative of g to exist everywhere,
or at least everywhere of interest to us
And let's assume that g
is strictly increasing,
Okay, and then the question is
how do we get the PDF of Y And
the answer, Is given by,
fY of y equals, well,
we start with the PDF of X.
And then we multiply by dx dy.
And I just have to explain
the notation a little bit.
Here we transformed
capital X to capital Y.
So a natural thing to do is to mirror
that notation with the lowercase letters.
So we're defining it to be true
that little y equals g of little x.
So we're doing the same transformation.
Now his looks a little
bit strange because,
This is a function of y and
this is a function of x.
And if I ask you for
the PDF of y, I'm hoping you'll
give me a function of little y.
And if you just write this down,
you have a function of little x.
So the interpretation of this,
it is that everything is then
written in terms of y, little y.
It looks uglier if I write
it that way right now.
Well, because I made these
assumptions that g is nice enough,
g will have an inverse so we could
also write x equals g inverse of y.
So all I'm saying to do is
plug in g inverse of y here,
then it's a function of y.
Dx dy is the derivative of x with
respect to y viewed as a function of y.
And there are several variations on this.
In particular,
you can also do dx dy is the same.
This is just intro of calculus again,
but it's useful to point out.
dx dy is the reciprocal of dy dx.
That's just the Chain Rule.
Remember from calculus
these look like fractions.
They're not actually fractions but the
Chain Rule says they act like fractions.
That's just the Chain Rule.
So that says we have a choice in doing
this we can decide which is easier.
Either we could do dx dy directly,
or we could take dy dx and flip it.
And then we just have to remember to
write it as a function of y, either way.
So there are a couple of choices for
how to use this, and
you should think first about which
one's gonna be easier rather
than just blindly jumping in
without actually thinking about it.
Also, make sure to check the assumptions.
Strictly increasing.
So a common mistake on this kind of
problem would be to just try to blindly
plug into this formula for
a function like g of X equals X squared.
Now if g of X equals X squared is a very,
very nice function.
I'm not saying that's not a nice function,
it's a parabola.
But it's a u shape,
it goes down and then it goes up.
So it's not strictly increasing so
you couldn't apply this.
It doesn't mean we can't
solve the problem,
it just means you have to go back
to first principles in that case.
This would work, though,
with g of X equals X squared if we're
dealing with positive random variables.
Because then the negative
side doesn't come into play.
But if you're dealing with both
negative and positive values and
you're squaring it, that's not increasing.
Okay, so you have to be careful
about things like that.
Don't just plug into the formula all
without checking the assumptions.
All right, so let's prove this.
And the proof should be pretty easy,
just based on kind of similar
calculations we did before.
Like, in a sense,
this is easier than Universality, or
things like that, that we've done before.
Well, our proof is just
gonna be let's find the CDF,
take the derivative to get the PDF.
So it doesn't require any
great leaps of thought.
We're just gonna find the CDF.
We're gonna take the derivative,
that's it.
So let's do that pretty quickly.
The CDF of Y, Probability Y less than or
equal to little y, equals,
I'm just gonna plug in the definition,
g of X less than or
equal to, Little y, right?
The derivative of the CDF is the PDF.
So if we want we can write
this as P of X less than or
equal to g inverse of y, because I
am assuming this function has an inverse,
this event is equivalent to this event.
That is you can get from here to here and
from here back, back to here.
It's just the same event
written in a different way.
But notice that this is just
the CDF of X evaluated here right.
It's just the definition of CDF so
hopefully this is very,
very familiar by now.
That's just the CDF of x let's call
that F sub X of g inverse of y.
And just to make the notation a little
bit nicer, really that's just FX of x.
Cuz I defined it.
I wrote over there x is g inverse of y.
So that's an easier way to write it.
So basically what that says is that for
the CDF you don't really
have to do much of anything.
But for the PDF you can't
just say this equals this,
that you have this
derivative that comes up.
That's from the Chain Rule.
So now let's just take the derivative
of both sides, fy of y.
Equals, I'm differentiating
with respect to Y.
So chain rules says I can differentiate
first with respect to X and
get the PDF of X.
And then we have the correction
dx dy to correct for
the fact that that we differentiate
both sides with respect to Y.
Whereas to get to big FX to little fx
are differentiated with respect to X.
So it's just the chain rule,
nothing else to it.
All right, so
that's what we wanted to show.
So let's do an example.
So, here's a famous example
one of the most widely used
distributions in practice
is called the log normal.
And let's let Y =, log normal does
not mean the log of a normal.
You can't take the log of
a normal random variable,
because you can't take
the log of a negative number.
Log normal means, so
this is log normal example.
Log normal means that the log is normal,
not log of the normal.
So if you take, z is standard normal here.
More generally you could let
z be normal mu sigma squared.
But let's just do the standard
normal case first.
So if I take the log of this,
I'll get z which is normal.
So that's why it's called
log normal log is normal.
So we actually had a homework
problem about this before, right.
Where if you did the problem,
what you did was to use the MGF
of the normal to find moments
of the log normal, okay?
But that's just moments,
right now we want the entire PDF, okay?
We want the distribution.
So this is an increasing,
this transformation,
this is an increasing function.
It's infinitely differentiable, right?
It's a very, very nice function,
so there's no problem with
applying that result.
And we can just immediately therefore,
write down the PDF,
fy(y) = let's do it here.
Fy(y) =, so I'm just gonna write down
the standard normal x
is z in this example.
1 over root 2 pi,
e to the minus z squared over 2,
except that I said that we have to
express it as a function of Y, right?
So z is log y, so
instead of writing z squared over 2,
I'm gonna write log y squared over 2.
So this would be the normal density,
except plugging in log y in for z, and
then it says according to this we
also have to multiply it by dz/dy.
So over here,
let's just compute the derivative,
dy/dz, =,
just the derivative is dy/dz equals
the derivative of e to the z,
z to the z, right?
But I wanna write that
in terms of y instead,
e to the z in terms of y is y so
that wasn't too difficult.
And then we just need to be careful
about do we multiply it by dy/dz here or
dz/dy, right?
But that says dx/dy.
So it's the reciprocal of this,
so we're gonna just put a 1/y.
And this is for y > 0.
So that's gonna be the PDF.
By the way if you ever forget
whether this is dx/dy, or
dy/dx well, I mean it shouldn't
take long to rederive it.
But kind of mnemonic is kind of pretend
that the dy is over there then it looks
really nice and symmetrical f(y)/dy =
f(x)/dx, is a handy way to remember it.
And you may have been taught that it's
not ok to separate the dx from the dy,
but when you go further in math then
people start separating them again.
And as long as you're careful about what
things mean it is possible to do that if
you interpret it correctly, but
I just think of that as an amonic.
So, if I every write f(y)dx equals
f(x)dx just think of that as a notation,
that means exactly this.
All right, so
the proof was pretty short for this, and
an example just use the formula.
Pretty straight forward as
long as the conditions apply.
But while we're on this topic,
let's do the multidimensional version,
which looks uglier, but
it's conceptually the same thing.
So now we're gonna have
transformations in n dimensions.
So transformations again, but
now we have the multidimensional version.
Okay, so now we think of Y and X as,
Random vectors, so Y = g(x),
where g is a mapping from Rn to
Rn Random vector just think of
that as a list of random variable so
this Y is really just Y1 through Yn.
So it's not really a new concept.
It just means we took our
n random variables and
listed them together as one vector, okay.
And so we have a mapping from Rn to
itself that we're doing a transformation.
And then the problem is, and
let's assume that the X = x1 through xn,
that vector is continuous.
That is, it's a continuous random vector.
In other words, we just have
some joint PDF, right, cuz we've
been talking about joint PDFs, so
that's a familiar concept at this point.
So we have this joint PDF, and we do
this thing, and then the question is,
what's the joint PDF of Y, right?
So it's completely analogous,
just higher dimensional.
So I'm not gonna prove
that the analog holds,
because that's just basically
an exercise in multivariable calculus,
which is not really that relevent for
our purposes.
It's completely analogous,
so depending on how much
multivariable calculus you've
done you could either prove it or
just accept it as analogous,
cuz it is analogous, okay?
So we want the joint PDF of Y,
In terms of the joint PDF of X, right?
So I'm just gonna write down the analogous
equation to that, f sub Y(Y),
I'm just using that as notation for
the join PDF, = the joint PDF of X.
Times dx/dy.
The only problem with this is,
how do we interpret the derivative of
this vector with respect to that vector?
What does that actually mean?
Well this thing, and actually we wanna
put absolute value symbols around it.
By the way if this function were strictly
decreasing we could do the same thing
just by sticking absolute values in here.
If we forgot the absolute values,
we're gonna get a negative
PDF which makes no sense.
All right, so I just have to
tell you what this thing means.
Well this thing is called the Jacobian.
And I'm sure some of you have seen it but
I'm not necessarily assuming
that you've seen it before.
I mean it's standard
multivariable calculus thing.
And all the Jacobean,
if you haven't seen it before,
all the Jacobian is it's the matrix
of all possible partial derivatives.
And as I said on the first day of class,
if you know how to an ordinary derivative,
you know how to do a partial derivative,
you just hold everything else constant.
So dx dy equals just to write it out.
It's a matrix of all possible
partial derivatives.
So, what we do is we take, X is a vector.
We take the first coordinate of X,
we differentiate it with respect
to all the coordinates of Y.
So we go dxd1 dy1,
dxd1 dy2, blah, blah, blah,
dx1 dyn And then we would take x2 for
the second row,
do the same thing, and we keep going till
we've done all the partial derivatives.
dxn dy1 blah blah blah dxn dyn.
So it's just a matrix of all
possible partial derivatives.
Now it doesn't make much sense to stick
in a matrix here, so actually, these
absolute value symbols actually mean take
the absolute value of the determinant.
So we're taking this Jacobian matrix and
we take the determinant of it.
Take the absolute value.
That's the analog of this
formula we just checking up,
we have this matrix somehow we need to
compress the matrix down to a number and
it turns out the right way to
do that is using a determinant.
And like in the other
case we could also do,
we could choose to do it this way or
we could have done dy dx.
This says dude you're
copying the other way around,
right take all the partials
of y with respect to x.
Okay, we could have done that and
then take another reciprocal, and
it would be the same thing.
So sometimes one of these two methods
is much easier than the other, so
you wanna think first about which
direction to do the transformation in.
A lot of books just write
the Jacobian as J, and
I like the letter J a lot, but
I don't like that notation here because
it doesn't tell you which way are you
going, from x to y or y to x.
So all right.
This way then it's very obvious that just
says take derivatives of the x's with
respect to the y's, that has to be this.
And this one would be
the other way around.
You can do either way as long
as you're careful about whether
to do the reciprocal here or not.
All right, and so, that's the Jacobian.
One other calculus thing
that we should discuss
briefly is convolution.
Convolution is something
we've done already,
just like we've already
done some transformations.
But I just wanted to mention
it as its own topic briefly.
Convolution is just the fancy word for
sums.
That is we want the distribution
of a sum of random variables.
Remember for the binomial we
did a convolution of binomials
using story proof as long
as they have the same P.
And for Poissons and
Normals, we used the MGF and
all of those are pretty easy calculations.
But sometimes you can't find
a story that will help you and
the MGF may not exist or
you may not know how to work with the MGF.
Sometimes we need a more direct method.
So in the discrete case,
we've already done calculations like this.
So we want, so let's let T equal X plus Y.
And we want to know the distribution of T
assuming we know the distribution of X and
the distribution of Y that's
called the convolution.
So, in the discrete case, we can just
immediately write down a formula.
It maybe a messy formula it may or may
not be something we can actually do but
at least we have an expression.
So in the discreet case we can immediately
just well what's the probability
that T equals t?
Well, that's just the sum over,
how can I get a total equal to T?
Well, X has to be something and
Y has to be whatever makes that up to T.
You can think of this as
just conditioning on X.
But I'll write it just as you're
using the actions of probability,
breaking this up into disjoint cases.
So I'm just gonna sum over,
always I can make the total equal x.
So I can immediately just write it down.
This is P of X equals little x.
We're assuming that X and
Y are independent here.
It's much nastier if they're dependent.
Probability Y equals t minus x.
We're summing over all x
such that this is positive.
So we don't need a separate proof for
this.
This just says to get the total equal P,
X has to be something and
Y has to be whatever makes the total P.
It has to be that way.
And because I assumed independence I
split it up into two probabilities.
So that's true for the discreet case.
Now lets write down something
analagous in the continuance case.
So now we want the PDF instead.
And I'm gonna write down something
that looks completely analagous.
That is, instead of doing the,
this is the PMF.
So cuz it's continuous,
I'm gonna replace the PMF by the PDF.
Let's go from minus infinity to infinity.
This is the PMF of y,
evaluated at t minus x.
I'll replace that by the PDF
evaluated at t minus x dx.
This is true.
And the easiest way to remember this
result is by thinking by analogy,
with this.
However, that's not a proof.
That's just an analogy.
And on the new homework, you'll see an
example where if you try to reason kind of
an analogous way for a product, instead
of a sum, well, you'll see what happens.
This requires more justification.
There are several ways to justify this.
Probably the simplest way
would be to take the CDF.
Let's do the CDF take the derivative and
get the PDF.
So what's the CDF?
Well, for the cdf let's use
the continuous probability.
So we're integrating the probability
that X plus Y is less than or
equal to little t given
x times the PDF of x.
This is one way to do it there
are other ways to do this calculation.
But I like this one.
That's just the continuous
law of probability.
Now we plug in X equals x.
And once we've plugged in X equals x,
we can drop the condition because X and
Y are independent.
And so
then all we have is the integral of,
notice what's left here, just in your
mind replace big X by little x and
move it over to that
side of the inequality.
So it says Y less than equal t minus x,
that's just the CDF of Y
evaluated at t minus x.
Now take the derivative of
both sides of this equation.
Derivative with respect to t,
and then there's a theorem that says you
can swap the derivative and the integral.
Derivative of this CDF Is the PDF so
it would get that way, so
it requires some justification.
Usually I would like to avoid doing
convolution integrals like this but
sometimes you can't avoid it.
But if possible try to use a story or
an MGF or
one of the other things we've done but
sometimes you need that, yeah?
>> [INAUDIBLE]
>> This is capital F, this-
>> [INAUDIBLE]
>> Yeah, sorry,
this is F sub t(t), thank you.
That's the CDF of capital T, thanks.
Okay, so
my favorite thing about statistics
is that you can do things that
are beautiful and useful.
And Jacobeans are,
it's an extremely useful technical tool,
but I've never heard anyone
describe Jacobeans as beautiful.
So to kind of rebalance our
beauty quotient for today,
let's do something completely different
that involves no calculus at all.
Something that,
you can see whether you agree with me or
not but, this is something that beauty
is not really an adequate word for,
this is not something I would
consider existential, okay?
So, here's an idea.
The idea is you can use probability
to prove existence of up, okay?
So we're gonna prove existence,
what does that mean?
We're gonna prove existence of
objects with desired properties.
Using probability.
Properties using probability.
So that's a very general idea.
So let me just tell you
mathematically what's the idea.
The idea is, I wanna show
that an object with
a certain property exists.
One way to show that
would be to show that,
let's say desired property A.
That is A is some property, okay?
I'm saying this very generally but
we'll do an example.
So we want to show there's
an object with a certain property,
that sounds like it has nothing whatsoever
to do with probability and statistics.
That's just like if you searched
everywhere in the universe,
either mathematically or whatever,
could you ever find this thing?
I didn't say anything about randomness or
uncertainty, okay?
Here's the strategy,
so this is a strategy.
We're gonna show that P(a) is
greater than 0, for a random object.
We get to choose how to define random,
that is we just have this universe
of objects and we decide on some method
for randomly selecting an object.
So if it is a finite set,
the most obvious thing to do is just
pick one at random where they're
all equally likely, right?
If I have a million objects there's
no probability anywhere but
I say well just pick one at
random equally likely, okay?
And then let A be the event that the
randomly chosen object has the property.
Well, is it clear that if the probability
is non 0 then there must exist one?
Well, of course, if it didn't exist,
the probability would be 0, so
if the probability is positive,
it must exist.
So if we can show this,
we've shown that it exists.
And that sounds like, so this is true,
I mean I don't need to write a proof for
that.
But that sounds like a very,
very wishful thinking strategy,
that if we can't even exhibit
existence of even one such object,
how are we ever gonna
compute the probability?
We can't even find one, but
we're gonna compute its probability,
that's pretty weird.
Notice that we don't actually
have to compute P(A) exactly,
we only need a bound that shows
that it's greater than 0, okay?
WE don't need to know exactly P(A),
just that it's positive.
That's method 1,
let's extend this a little bit.
Suppose each object has
a number associated with it,
let's think of a score.
So we have this universe of objects,
no probability yet,
each object has a number attached to it,
so some kind of a score.
We wanna show there exists
an object with a good score.
To say what does good mean,
I will talk a little bit about that.
We wanna show there is an object
with a good score, but
suppose it's really hard to actually
find one that has a good score.
show there is an object with a good score,
I had to say what good means.
Well here's the strategy.
You may guess this has something
to do with probability.
Pick a random object again,
look at its score.
So, in other words, what's
the average score of a random object?
Now, here's the theorem.
There is an object, Whose
score is at least the average, right?
Let's just call it E(X), where this
is the score of a random object.
So, we're defining a random
variable by taking a random object,
find its score, take the average.
Well obviously,
there must be at least one object
that at least is the average, right?
They can't all be below average,
that would make no sense, right?
So therefore, if now of course,
E of X may be pretty lousy.
But if E of X is actually pretty good then
we've shown that there exists a good
one without actually exhibiting it.
So again that sounds like a group
of people, and at least one person
has to have at least the average salary of
the people in the room, things like that.
That's an extremely crude statement.
Is that ever gonna be useful for anything?
I think this is a neat idea,
but is it actually useful?
Well, what I consider one
of the most beautiful and
useful results of the 20th
century was Shannon's theorem.
Claude Shannon is the father
of information theory.
Also the father of this
modern communications theory.
So anytime you use the cell phone,that's
all based on communication and
coding theory that goes
back to Shannon's work.
So you can thank Shannon for this.
Let me just tell you,
this is not an information theory course.
It's a really amazing idea that you
can quantify information, though.
But let me tell you very briefly
what one of Shannon's theorems was.
Shannon theorem, was that he showed that
if you're trying to communicate over
a noisy channel, so you're trying to send
messages from one place to another, but
bits get corrupted, things, there's a lot
of noise and interference, or whatever.
He showed that there's something
called a capacity of the channel, and
you can communicate at rates
arbitrarily close to the capacity,
with arbitrarily small chance of error
That is even if you have a very,
very noisy channel, you can make
the air probability very, very low.
That sounds like a very difficult theorem.
And no one out he proved this in 1948.
No one else was even close to
thinking of that as far as I know.
The way he proved it, that there
exists what he called a good code,
right, a good code is gonna
be one that works well for
sending messages across
this noisy channel.
The way he showed that a good code
exists was to pick a random code.
And that's like kinda the most
daring thing you can imagine,
he probably spent months trying to
actually find one couldn't find one so
he picked a random one.
And to think that a random one is actually
gonna do well is kinda unbelievable, and
it turns out to be true.
It was only 30 or
40 years later that people actually
explicitly could write down a good code.
Until then Shannon showed that
they exist because a random
one has the right properties.
Even though you can't actually write down
a specific one, without a lot of work.
That's one of the most amazing results,
just mathematically extremely beautiful,
but it underlies all of modern
communication and information theory.
All right, so I'm not going try to
prove Shannon's theorem in ten, or
five minutes, but
I am going to do one quick example.
Along these lines so
I just made up a simple example
just to illustrate this idea.
So the idea is, and here is the problem.
So suppose we have.
100 people, I just made up some numbers,
just so that we can actually do something
reasonably concrete and simple,
just to show you how this idea
would work in a small example.
Okay so there are 100 people and
those people form committees.
Now one person can be on
more than committee, so
let's assume that there are,
how many committees do I want?
I made up some numbers last night
I think I wanted 15 committees.
I just made up some numbers where it works
out nicely but we can try this something
more general like M and N and
whatever, but I made up some numbers.
15 committees of 20,
that is each committee has 20 people.
So I chose these numbers such that
15 times 20 is 300 which means that
if everyone is on the same
number committees,
than that means each person
is on three committees.
You can generalize this to cases where
different people can be on different
numbers of committees.
But well, for simplicity, let's assume
each person is on three committees.
No probability yet so far, right?
That's just okay,
there's different way to do it.
You can think of it as a counting problem,
how many ways are there to do it,
there's some vast number of possibilities.
Okay, now here's the problem.
The problem is to show
that there exist two
committees whos overlap is at least three.
So I can find two committees, or
there exist two committees, where a group
of three people is on both committees.
So show there exists, two committees
With overlap greater than or equal to 3.
All right, so clearly the way to solve
this is not gonna be like write down
every possible [INAUDIBLE] committees and
then search through and
find over the computer all the overall
laps, all the intersections and
go through everything right?
That would be a nightmare.
So we're going to use this idea and
we're gonna prove existence.
This is an existence problem.
We're gonna prove existence
just by computing the average.
So the idea is find the average
intersection I said average.
That involves probability.
We didn't have any probability yet.
We introduce our own probability
structure by just saying
let's just choose two random committees.
So find average overlap
of two random committees
All right, so
hopefully we can do that quickly.
So, I'll just write E,
you can make up some fancy notation and
stuff, but we're just picking two,
we're assuming that we have this
fixed assignment of who's on what.
We have specific people with names.
The so and so is on this committee and
so and so is on this committee, and so
on that's not random.
Our randomness is because we're
choosing two random committees.
Okay, and we want the expected
overlap of those two committees.
So how do we do that?
Indicator random variables.
We create an indicator random variable for
each person.
There's 100 people, so I'm not gonna
write all the indicator random variables,
because this should be familiar by now.
We have 100 people, so we create an
indicator for each person, use linearity.
So it's gonna be 100 times and
over here all we need to
do by the fundamental bridge, all we need
to do is write down the probability that,
let's say person number one is on both
of those random committees, right?
So now we're looking, okay,
person number one what's the probability
that that person is on both of
those randomly chosen committees well,
you can think of that as a hypergeometric.
You don't have to let's just
think about it directly.
I'm assuming I chose
two random committees.
So it's 100, choose two possibilities.
Naive definition applies,
because I'm assuming equally likely that
we chose any two with equal probabilities.
Then the numerator, sorry,
this is number of committee,
how many committees are there?
15 committees, choose two out of the 15
committees and then the numerator.
Person number one is on three committees
so choose two out of the three committees.
So this is three choose two, three choose
two is three, so that's 300 over 1500
choose two is 15 times 14 divided by 2.
300 divided by 15 is 20,
the 2 comes up so it's 40 over 14.
Which we can simplify as 20 over 7.
If I did the arithmetic correctly that
looks like we came a little bit short.
It's like almost good enough
cuz we wanted at least 3.
And we only have 20 over 7 and
if only it were 21 over 7.
Then we'd be so happy.
But here's the idea.
According to that, there must be,
so the average is 20/7.
That implies that there
exists a pair of committees.
With at least.
An overlap of 20/ 7.
Now there's no way that two
committees can have an overlap equal
to 20/7 if the overlap were only
2 that would not be good enough.
So we get to round this up to the next
integer because the overlap is an integer
so that means we can have
overlap of at least 3.
Than means we have proven that.
So we prove that it exist,
we ran out of time so have a good weekend.
