So today is covariance day.
Covariant, is a long
awaited moment that will
let us finally deal with
the variance of a sum.
For one thing, we said variance is
not linear, unlike expectation.
That doesn't mean we don't need ways
to deal with the variance of a sum,
it just means we need to think harder,
rather than falsely applying linearity.
So on the one hand covariance is what we
need to deal with variance of the sum,
on the other hand it's what we need when
we wanna study two random
variables together instead of one.
So it's like variance
except two of them and
that's why it's called covariance,
so let's define it.
Do some properties, do some examples.
So at first start with the definition.
It's analagous to how we define variance,
except now we have an X and a Y.
Cuz we're looking at joint distributions.
So we have X, we have Y,
we want their covariance.
And we define it like this,
covariance of X and Y.
X and Y are any two random
variables on the same space.
Covariance X,Y equals expected
value of X minus it's mean,
times Y minus its mean.
That's just the definitions, so you can't
really argue with it too much but let's
stare at it intuitively for a bit and just
see where might this thing have come from?
Why define it this way
instead of any other way?
Well, first of all, it's a product,
something times something.
So we've brought the X stuff and
the Y stuff together into one thing,
cuz we're trying to see
how they vary together.
And just, obviously,
we all know that a positive number
times a positive number is positive,
negative times negative is positive,
positive times negative is negative.
So if it happens to be true that,
This is X relative to its mean,
Y relative to its mean.
So now imagine drawing a random sample,
suppose we had a lot of i.i.d
pairs X, Y but, the pairs are i.i.d, but
within each pair Xi, [INAUDIBLE] Yi,
they have some joint distribution.
They may not be independent.
By the way, we did show before
that if they're independent,
then you can write this is just
E of this times E of this.
So this is you know, we're interested in
what happens if they are not independent.
Well if in that random sample we drew,
if most of the time when X is above it's
mean, then also Y is above it's mean.
Then you're getting
positive times positive.
And if X is below it's mean,
tends to imply that Y is below it's mean,
you get negative times
negative is positive.
So if X being above it's mean
tends to imply that Y is above it's mean,
and being below, being below.
Then we would say that they're
positively correlated.
And vice versus.
It's be negatively coordinated
if X is above it's mean.
It doesn't imply that Y
is below it's mean but
it has more of a tendency
that Y is below it's mean.
Then we would say that they're
negatively correlated.
So this is just a measure of that.
We'll actually define
correlation in a little while.
But correlation is a very familiar term to
everyone cuz people talk about correlation
all the time.
But mathematically, what is correlation?
It's defined in times of covariance.
So we'll get to that soon.
That's just the definition.
But just like, you know how for variance,
we had two different ways to write it.
We define variance as, notice
the way we define variance was expect
the value of X to minus its mean squared.
So, if we let X equal Y,
that is just the variance.
So we've just proved the theorem already,
so I'll just call this properties.
The first property to keep in mind
is that covariance of X with
itself is the variance.
Proof is just let X equal Y,
well that's the definition of variance.
But that's a very useful
fact to keep in mind.
And secondly, it's symmetric.
Covariance X, Y equals covariance Y, X.
And that's, again, something you can just
see immediately just swap the X and Y,
but it's the same thing, so
it's immediately true that it's symmetric.
That's also a useful fact.
I don't even want to
group it into this list.
Right here, what's the alternative
way to write covariance?
This is completely analogous to how,
where we defined variance as this thing,
this part squared without that part.
But then we quickly showed that we could
also write it as E of X squared minus E of
X squared,
you know parenthesize the other way.
The analog of that formula
which is generalization,
so this is E of XY minus E of X E of Y.
So in general these two
things are not equal.
We proved that they are equal if X and
Y are independent, but
in general they're not equal.
Notice that if we let X equal Y,
like in property one here,
that's just E of X squared,
minus E of X squared the other way.
So that is just a version of that formula.
And the proof of this is just to
multiply this out and use linearity.
We'll just quickly do that
over here just for practice.
And we'll just have four
terms use linearity,
so it should be very straightforward.
We're doing this times this,
this times this, and so on.
So we have E of X.
I'm just gonna use linearity.
The first term, X times Y, E of XY.
And then minus, and then we do
this times this, but notice we're
doing E of X times this, this thing is
a constant, you can take out the constant.
So, that term would just be E of X, E of
Y, and then we have another cross term.
This one times this one.
E of X is just a constant, that comes out.
So, that's minus another one that
looks the same, E of X E of Y, and
then the last term is this times this.
Again, that's just a constant,
E of a constant is a constant, so
it's plus that thing again, and so
that's all it is, minus 2 of them plus
1 of them, so it's the same thing.
All right, so that's just an easy
application of linearity of expectation.
So most of the time this way is
a little bit easier than this for
computing covariance.
But, like with variance,
this one has a little bit more intuitive
appeal because it's just saying
X relative to its mean Y relative to
its mean, but it's the same thing.
So well we already have two properties,
well,
let's get some more
properties of covariance.
What if we have a covariance
of X with a constant?
So I'm letting Y equal a constant C.
So here Y is C.
The expected value of constant C is C,
that's just 0.
So it's immediately just 0 just from
the definition if C is a constant.
Similarly by symmetry we could
have covariance of C with X.
I just happened to write it on
this side but it's symmetric.
So if C is a constant.
Okay, now what if we multiplied by
a constant instead of just
having a constant there?
So if we have,
let's say the covariance of CX with Y.
And let's just use this one.
To compute this, all we have to
do is replace X by C times X.
C comes out, C comes out.
So C just comes out of the whole thing.
So constants come out.
Okay, so we just prove that just
by plugging in cX in for X,
and then it's just immediate.
Okay, again, c is any constant here.
Similarly they could have constant here,
a constant here,
and just take them both out.
Very, very straightforward.
All right, and now we want something
that looks kind of like linearity.
What happens if we have
the covariance of x with y plus z?
So if we take the covariance
of x with y plus z,
then what that says to do is to
replace y by y plus z here, okay?
And just as a quick
little scratch work for
seeing what's going on,
I'm taking xy, replace y by y plus z.
Well, of course, that's just xy plus xz,
and now we expect a value of that.
So we use linearity, so
it's E of this plus E of that.
Similarly, we replace this Y by Y+Z,
so again use linearity, E(Y) + E(Z).
And so those terms you get are simply
the sum of the two covariances.
So Cov(X,Y) + Cov(X,Z).
Just write down the four terms you get and
you've just added the two covariances.
So again, all of these things
are basically immediate.
I'm not writing out long proofs for these
because all of these things are immediate
from plugging into the definition.
Either this definition or
this equivalent, plug into either one and
use linearity of expectation and
all of these follow immediately.
So these two together are especially
useful, and they're called,
it's not linearity, but
it's called bilinearity.
Bilinearity is just
a fancy term that means,
If you imagine treating one
coordinate as just kind of fixed and
you're working with the other coordinate,
it looks like linearity, right?
So like here, notice the Y just stayed
as Y, and what happened to the cX?
Well, I took out the constant
just like linearity.
And what happened here?
X just stayed x throughout.
But if you just look at the y + z part,
we split it out into the y and a z.
So it looks like linearity if you're
going one coordinate at a time.
I just happened to write it this way, but
obviously, I could have done x + y, z and
it would be analogous.
I could have put the constant over there,
or
a constant here, a constant there, okay.
So those are really useful
properties that kind of
if you use these properties, you can avoid
a lot of ugly calculations that is you can
just like apply this rather than always
having to go back to the definition.
Just like linearity is incredibly useful,
bilinearity is incredible useful,
when working with covariances.
So, and kind of an easy kind of
way to remember this is it kind
of looks this distributive property,
here this is just
the distributive property x times
y plus z is x, y plus x, z.
It kind of looks like that
as if I'm doing code expect,
it's not literally multiplication,
it's covariance, but
I'm doing covariance of this and
this and this and this.
Right, so if I wanted to extend that to
what happens if we have more of them.
Let's say we had covariance of X plus Y.
I mean this doesn't really need to be
listed separately but for practice,
let's just do it.
Just apply that property five repeatedly.
And we're gonna get the covariance of
this and this, this and this, this and
this, that.
It's just like multiplying
two polynomials,
or however you usually do that thing.
So, So we can immediately just
write this down as four terms.
Cov(X,Z) + Cov(X,W) + Cov(Y,Z) + Cov(Y,W).
And that follows immediately just by
using that property 5 repeatedly.
And more generally than that,
let's just write what happens if we have
a covariance of one sum with another sum.
I don't wanna write out nine terms,
let's just write the general
thing once and for all.
So we have a covariance
of one sum of terms.
Let's say we have a sum over i of
AiXi, where Ai is a constants.
so this is linear combination
of random variables.
And then, let's say i goes from 1 to m.
And then we have another one,
let's say j=1 to n of bjYj.
So, we want the covariance, so
it looks like this complicated thing.
Okay, but as soon as you think about
what's the structure of the problem,
it's just the covariance of
one sum with another sum.
So if you apply that property
five over and over and
over again,
we don't literally have to do that.
But conceptually we're just using that
property over and over and over again, and
just think about what you're gonna get.
And also use property four
to take out the constants.
Well, it just means you're
gonna get a sum over all ij
of the covariance of individual terms,
right?
Cuz it's just saying,
you know, take one term here and
co-vary it with one term here.
For all possible pairs.
So the sum over all ij of
aibj covariance Xi Yj.
So that's just a very, I know this
looks complicated but it's no different
from property five that just means we
used it a lot of times instead of once.
So a lot of times it'd be easier to use
this kind of thing rather than going back
to the definition and multiplying
everything out in terms of expectation.
It's often easier to be able to
work directly with covariance.
All right, so that shows us how,
property one says how covariance
is related to variance, but
it doesn't show us how it would be
useful in actually computing a variance,
the variance of a sum that is.
Okay, so one of the main reasons
we want covariance is so
that we can deal with sums.
So let's just work out
the variants of a sum.
Let's say we have the variance of x1 + x2
to start with, but then we can generalize
that to a sum of any number of terms
just by using this one repeatedly, okay?
Well, we already know how to do this,
because by property 1,
that's the covariance of (x1+x2),
with itself.
But by property five, or
whichever property six,
what's the covariance
of x1 + x2 with itself?
Well, we just have those four terms.
We have the covariance of x1 with
itself but that's just the variance.
And we have covariance of x2 with itself,
that's just the variance of x2.
And then we have two cross terms,
we have the covariance of x1 and x2.
And we have the covariance of x2 and x1.
But by the symmetry property,
those are the same thing.
So it's simpler to just write it as
2 times the covariance of x1 and x2.
In particular this says that
if the covariance is 0,
then the variance of the sum
is the sum of the variances.
And that's an if and only if statement.
So one case were that's true
is if they're independent,
we showed that before that if they're
independent then the covariance is 0.
So if they're independent, this is gone.
And we'll also see examples where
they're not independent, but
this term is still zero and
so, so then it's true.
Okay, but in general, you can't say
the variance of the sum of the sum
of the variances,
because you have these covariance terms.
Yeah, question?
>> [INAUDIBLE]
>> That's if and only if the covariance
is 0, that the variance of the sum
will be the sum of the variance.
So let's write what would happen
if there's more than two of them,
variance x1 + blah, blah, blah + xn.
Just applying, so that's the covariance
of this sum with itself, so
we can just apply this result.
So again, it's gonna be the sum
of all the variances, and
then we're gonna have
all these covariances.
So add up all the variances, and
then add up all the covariances.
And so, you're gonna have a covariance
of x1 and x2, x2 and x1, x1 and x3,
x3 and x1, all those things.
I think it's easiest if
we write it as 2 times
the sum over i less than j covariance x i,
x j,
it's easy to forget the 2 here.
I could have also written it to
i not equal to j in this case I
would have not put the 2.
It's simply the question of
are you going to list cov(x1,x2)
separately from cov(x2,
x1) or group them together.
Seems a little simpler to
group them together, but
then we need to remember to put the 2.
Since I specified less than j,
then I have cov(x1, x2) listed here, but
not cov(x2, x1), cuz I included that here.
All right, so that's the general
way to get the variance of a sum,
and we'll there are some examples
of that in a few minutes.
First I just wanna make sure
that the connection with
independence is clear and
we also need to define correlation.
So theorem says that if x and
y are independent,
Then they are uncorrelated.
The definition of uncorrelated is
just that the covariance is 0.
That's just definition.
I.e, cov(x,y) = 0.
And we actually proved this last time when
we just didn't have the terminology yet.
At least we proved in the continuous case,
but the discreet case is analogous.
So we proved this using
the 2 dimensional lotus
thing that we did e of x times y,
all right?
Equals e of x, e of y in the independent
case, so we showed that before.
Converse is false.
And that's a common mistake is
to show the covariance is 0, and
then just leap to the conclusion
that they're independent.
If the covariance is 0, and that's all we
know, they may or may not be independent.
So just to give a simple counter example
showing why this doesn't imply this.
Let's just consider an example
with normal random variables.
So let's let z be standard normal,
and let n and we'll let x = z.
Slightly redundant notation but
I'm just in the habit of using z for
standard normals and y = z squared.
So we're looking at a normal and
it's square, okay?
So now let's compute the covariance for
this example.
Cov(X, Y) = E(X, Y)- E(X)E(Y).
In terms of Z,
that's E(Z cubed) -E(Z)E(Z squared),
but both terms are just 0,
because we saw before
that the odd moments of
a standard normal are 0.
That's an odd moment and
that's an odd moment so it's just 0- 0.
So they're uncorrelated, but
they're clearly not independent.
In fact, they are very non-independent,
I should say, very dependent.
Avoid too many double negatives.
So they're very dependent, in fact, y is
a function of x, so
that they're extremely dependent.
If you know x, you know y,
complete information.
So y is actually a function of x.
Dependent just means there's
some information, right?
It doesn't have to be
complete information.
In this case, if we know x,
we have complete information about y,
y is a function of x.
And if we go the other way around,
if we know y, well, we don't know x, but
we do know its magnitude, all right?
If we know z squared,
then we can take the square root and
we'll get the absolute value, so
we know it up to a plus or minus.
So that also shows it's dependent going
the other, which we didn't need to do but
it's just nice to think.
If you know this, okay, we know this,
if we know this, then what do we know?
Well, we know it up to a sign.
So I would just say y also determines x,
at least it determines
it up to a sign, so
it determines the magnitude of x.
So, okay, so that's just an example
that shows the converse is false.
But it's kind of a handy counter example
to keep in mind for a lot of things.
So kind of intuitively
what's going wrong here,
I mean, there's nothing wrong with this.
But why the definition
doesn't capture this is part
of the intuition of correlation is it's
kind of a measure of linear association.
And those of you who
have taken Stat 100 or
104 see a lot of things a lot like that.
Where you actually have a data set,
and if it kind of looks like it's
sloping upwards generally you
have this cloud of points.
And as they kind of go upwards or
downwards, that kind of thing.
It's measuring linear
trends in some sense.
There's a theorem that we're not gonna
prove that says, if every function of x is
uncorrelated every function of y,
then they're independent.
But just having the linear things be
uncorrelated is not enough
as this example shows.
Okay, here they have this
quadratic relationship
there is no linear relationship
that the kind of intuition on that.
All right, so
let's also define correlation and
then I will do some examples of how to
use this to compute the variance of
the things that we did not
already know the variance.
Okay, so once we have covariance, which
we do, correlation is easy to define.
And I'll tell you some of the intuition
as well as what's the math.
So here's the definition of correlation.
You can think of it as just
a standardized version of covariance.
So correlation, which you either write
as Cor, or usually I write it as Corr.
Just because R's tend to look like V's
sometimes if you're writing too fast.
Corr(X, Y), usually it's defined this way,
as the covariance, and then we divide by
the product of the standard deviations.
Remember, standard deviation's
just the square root of variance.
So take the covariance,
divide by the square root of
the product of the variances.
But that's the usual definition.
I actually would prefer to define it
a different way, and I'll show you why,
that these are equivalent.
I would prefer to define
it as the covariance of X,
remember standardization?
If we have any normal, we subtract the
mean, divide by the standard deviation,
that gives us standard normal.
So that's called standardization.
Now here,
I'm not assuming anything is normal, but
that the same standardization makes sense.
That we take X, we subtract its mean,
we divide by its
standard deviation, and
then we do the same thing with Y.
So we've standardized both X and
Y, And we take their covariance.
So correlation means standardize them
first, then take the covariance.
The reason that this is a useful thing to
do is that covariance kinda
has an annoying property,
as far as interpretation in terms
of units and things like that.
If you imagine X and Y are distances,
right, they're random variables but
they're representing a distance quantity,
okay?
And if you measured X and Y in nanometers,
and then someone else
working on the same problem
measures them in light years
instead of nanometers,
you're gonna get extremely
different answers.
So if I just tell you,
the covariance between my X and
Y is 42, what does that tell you?
You have to think really hard about
what are the units, what's going on,
is 42 a big number or
a small number, right?
I mean, it's the answer to life,
the universe and everything, but
is it a big number or a small number?
I don't know, because the units thing.
This is a dimensionless quantity,
dimensionless just
basically means unitless.
So if X is measured in nanometers,
and you're subtracting off nanometers,
that's still nanometers.
Remember, that's why we define
standard deviation, also.
Standard deviation has
a square root in it, so
mathematically, it's pretty annoying
to deal with these square roots.
Mathematically, it's nicer
to work with variance,
but intuitively, the variance
would be in nanometers squared.
Now we're back to nanometers,
divide nanometers by nanometers,
we'll get a dimensionless quantity.
So that's a major advantage of this.
And I guess I should tell you briefly,
why is this thing the same as this?
Well, you should kind of just
think about those properties,
I'll just say this kind of quickly.
First of all, subtracting the mean,
that's just adding a constant,
that's not gonna affect
the covariance at all.
So I could have left this out, but it's
just useful to think of standardizing.
Cuz this standardization, what it does is
takes X, which could have any mean and
any variance, and
makes it have mean 0 and variance 1.
That's why it's called standardization.
The part that's affecting what's
going on is the standard deviation.
But from one of those
properties that we wrote,
we can just pull out the standard
deviations, and we get exactly that.
So they're exactly the same thing,
I just think this one's a little
more intuitive to think about.
Okay, so
one quick theorem about correlation.
Correlation can never equal 42.
More generally,
correlation is always between -1 and 1.
So not only is it something more
interpretable in the sense that it doesn't
depend on what system of units you used.
It's also more interpretable in that,
if I say a correlation is 0.9?
That's a pretty high correlation,
cuz I know the largest it can be is 1,
okay, so that's very useful.
And kind of an interesting fact about this
inequality is that it's essentially just
Cauchy-Schwartz.
For those of you who have seen
the Cauchy-Schwarz inequality in linear
algebra or elsewhere.
The Cauchy-Schwarz is one of the most
important inequalities in all of
mathematics.
And if you put this, if you rewrite this
statement in a linear algebra setting,
you can show that it's
essentially Cauchy-Schwarz.
If you haven't seen Cauchy-Schwartz yet,
we'll come back
to it later in the semester, and you
don't need to worry about it right now.
But for those of you who have,
I wanted to make the connection right now.
So let's prove this fact.
So one proof would just be to put it
into the Cauchy-Schwarz framework, and
apply Cauchy-Schwarz.
But that doesn't really show
what's going on, first of all.
And secondly, that assumes you're
familiar with Cauchy-Schwarz.
So let's just prove it directly.
So, first of all, Math classes,
you'll often see the acronym WLOG,
Without Loss of Generality.
We're going to assume X and
Y are already standardized.
If they're not already standardized, so
we're trying to prove this inequality.
We may as well just assume from the start
that they've been standardized,
standardized meaning that
they have mean 0, variance 1.
Because if they weren't standardized,
well,
I could just make up some new notation, x
tilde, y tilde for the standardized ones.
But this says that the correlation
will be the same anyway, so
we may as well assume that
they're already standardized.
All right, so
now let's just compute the variance.
This is actually good practice
with property seven there.
Let's compute Var(X + Y).
Well, that's Var(X) + Var(Y) + 2 Cov(X,
Y).
And for some reasons,
statisticians often like to call
the correlation rho, so
I'll follow that trend.
Corr(X, Y), that's just notation,
let's just name that rho.
All right, so that's the variance, but
I assumed they were standardized, so
this is 1 + 1.
And if they're standardized already,
then the covariance is the correlation,
because they're standardized.
So that's just 1 + 1 + 2 rho, so
that's really just 2 + 2 rho, right?
On the other hand, we could look
at the variance of the difference.
Again, that's good practice with
variances of sums and differences.
A common mistake is to say,
that's Var(X)- Var(Y).
Which we talked about that fact before
when we were talking about sums and
differences of normals,
variances can't be negative.
So think of this not as X- Y,
think of this as X + -Y,
so it still adds, Var(X)- Var(Y).
Now we subtract.
Just check this, right,
is the covariance of x-y with itself?
So we have a- on the covariance part,
but not on these variance terms.
So that's just 2- 2 rho.
Okay, well we're running
out of space on this board,
and that's actually the end of the proof
because variance is non-negative.
So these two inequalities say
that rho is between- 1 and 1.
All right, so that shows a correlation
is always between- 1 and 1.
And so in general it is easier to work
with covariances than correlations but
correlations are more intuitive and
standardized with everything
between- one and one.
Okay, so I wanted to for
the rest of the time do some examples with
with this thing, and
also with computing covariances for
certain problems we
might be interested in.
So let's talk about the multinomial cuz
we were talking about that last time.
And now we actually have the tools
to deal with the covariances,
within a multinomial, okay?
So, this is just an example.
But it's an important example,
cuz multinomials come up a lot.
So we wanna compute covariances,
if we have a multinomial, okay?
So covariances in a multinomial.
That is,
this multinomial is this vector, right?
It's about how many people
are in category one,
how many people are in category two,
and so on.
So you can take any two of those counts
of how many people are in category one,
how many people are in category five, and
compute the covariance
of those things,right?
That's a very natural thing to look at.
And I actually know four or
five ways to derive this,
and I really like this example, so I will
probably come to this later with some of
the other methods, but for
now let's just do one method, okay?
So, we have this multinomial,
so we have this vector,
using the notation from last time,
we have k different categories.
And Xj is the number of people or
objects in the jth category.
And this is multinomial, n,
that there are n objects or people.
And the probabilities are given
back by some vector P.
That's just gives the probability for
each category, okay?
And we wanna find
the covariance of Xi with Xj,
For all i and j, Right?
So first of all let's
consider the case I = j.
Then we just have a covariance
of Xi with itself.
And we know that's just a variance of Xi.
And last time we talked about
the fact that if we define
success to be being in category i,
we just have a binomial, so
for this, we just use the variance
of the binomial, npi 1- pi.
So that's easy, the more interesting part
is what happens if i is not equal to j.
Okay, now if we,
I think it's easier to just think
concretely in terms of the first two.
So let's just find covariance of X1 and
X2.
If we know how to do this,
we could always just relabel things and
get X5 and X12 or whatever we want.
But it's just easier to think
concretely in terms of X1 and
X2, rather than having so
much notation going around.
Okay, so I'll find this one first.
And there's a lot of ways to do this.
Let's just think about
it intuitively first.
Intuitively, you think this is positive,
negative or zero?
>> [INAUDIBLE]
>> Negative, why?
>> [INAUDIBLE]
>> Exactly, so
if you somehow computed this
an you got a positive number,
you shouldn't just be happy you're
done with the problem and move on.
You should stop and think,
does a positive number make sense here?
As you just said, if you knew that there
were more people in the first category,
like there's tons of people
in the first category,
there's fewer people left over who
could be in the second category.
So It's like these categories are kind
of competing for membership, right?
You have a fixed number of people,
not like the chicken and
egg problem we had a Poisson
number of eggs, okay.
So fixed number of eggs competing for
different categories,
more in one then you'd expect
less in the other, right?
So they should be negatively correlated,
all right?
So now how do we do this,
well there's a bunch of ways as I said,
but one way that I especially like is,
to relate this back to stuff we
did last time, we talked about
the lumping property of the multi
nominal try to relate it to this.
Normally, you'd think of this as
a way to find the variance of a sum.
But if we know this, this and this,
then obviously we know this also.
So let's actually do it this way and
I'll probably do some of the other
methods later, not today.
So we have,
Let's take the variance of the sum,
Let's call this thing C.
Just to have some notation, so
we're trying to solve for C, okay.
So at the variance of the sum
equals the sum of the variances.
Now the variance of X1 is nP1 (1- P1) and
the variance of X2 is nP2 (1- P2) and
then plus,
twice the covariance but
I just named the covariance C,
just to have a simple name for
it, so it's + 2C.
So the only thing,
we wanna solve for this.
The only thing left that
we haven't gotten is this.
But then variance of X1 + X2 follows
immediately from what I was talking about
last time with the lumping property.
This just says merge the first two
categories together into one bigger
category, okay?
If we do that, It's still binomial, right?
Now we're defining success to mean being
a member of category one or category two.
Still binomial, so we can
immediately right down, thus well n.
Now the probability of
success is (P1 + P2)(1- P1 +
P2) So now we know everything
in this equation except C.
To solve for C,
multiply things out, factor it,
however you want just do the algebra,
easy algebra at this point.
So I'm not gonna show
you like write that or
multiply this times this and
just multiply it out, simplify it and
what you'll get is the covariance
of X1 and X2 = -nP1 P2.
And so in general, That was just for
X1 and X2, just for concreteness.
The general result would be
the covariance of Xi Xj =- nPi Pj for
i not equal j Notice it is a- number.
Okay, so
That's the covariance in a multinomial.
And, Let's do
a few variants examples now.
For example, variants of the binomial,
we did derive the variants of
the binomial before using indicator
random variables, just directly.
Because we didn't have these
tools available yet, okay?
So let's redo the variance
of the binomial and
then do one more example after that.
So okay, so
the variance of the binomial NP is NPQ.
And let's just derive that really quickly,
So let X be binomial np, and
we write it as we've done many times.
X = x1 + blah, blah, blah plus Xn,
where the Xis are IID Bernoulli P.
Now each Xi, let's do a quick little
indicator random variable review.
We can think of these Xj's,
they're Bernoulli's, but
they're also indicator random variables.
It's the indicator of
success on the jth trial,
solet's just state this in general.
Let's let capital I and capital J,
let IA be indicator random variable for
event A, just in general.
A is any event, IA is it's indicator
random variable, indicator rv of event A.
Okay, so just a couple quick simple
facts about indicator random variables.
What's IA squared?
It's just IA, cuz you're squaring 0 or 1.
Similarly IA cubed = IA and
you can generalise this to
other powers if you want,
just it's 0 or 1.
There's a very, very,
very simple fact, but
I've seen it get overlooked many times,
so I'm emphasizing it now.
Pick any positive power, nothing happens
because it's 0 or 1, very easy, okay?
Now let's look at something else
IA times I times B where A and
B are both of events.
How would you write that as
one indicator random variable?
Intersection, extremely useful simple
fact, but often gets overlooked.
Product of these indicators is 0 or
1 times 0 or 1, that's gonna be 0 or 1.
It's gonna be 1,
if and only if both of those are 1,
that's the definition of intersection.
So that's immediately true,
very useful fact.
Okay, now coming back to this binomial,
if we want the variance of Xj,
That's just E of xj
squared-E of xj squared, But
xj squared is xj, so that's just E of xj,
and we know E of xj is p, for Bernoulli P.
This one is p squared, so
that's just p1-p, okay?
So it's extremely easy to get
the variance of a Bernoulli.
If we define this, let's define this as q
then we're just saying p times q, okay?
So Bernoulli P,
you get p times q, very easy.
So now we want the variance of
the binomial, Well it's just npq, done.
Because you're adding up, and
they're independent for the binomial.
We have independent Bernoulli trials,
so just to write out a little bit more.
Covariance of Xi, Xj = 0 for i not
equal j because they are independent.
They are not only uncorrelated
they are independent, so
we don't have any covariance term.
So we just add up the variances and
n times this, npq, all right?
So now you can do the variance
of a binomial in your head.
You don't need to memorize this,
it's just n times the variance of
one of these Bernoulli's, okay?
So that's easy, Well let's talk
about a more complicated one though.
Hypergeometric, So
let X be hypergeometric,
With parameters w,
b, n, which we interpret as saying,
we have a jar that has w white balls,
b black balls.
We take a sample of size n, and
we want the distribution of the number
of white balls in the sample.
Well again, we can decompose it in
terms of indicator random variables.
So Xj = 1,
we can interpret this as drawing balls
from that jar one at a time
without replacement.
We'd get a binomial if we
did with replacement, but
the hypergeometric would
be without replacement.
Take the balls one at a time, and
we just say one if the jth ball is white,
0 otherwise.
The problem, the reason it's more
difficult than this is that these
are dependant indicator random variables,
because it's without replacement.
So if we write this thing out,
variance of x = so
we're gonna write out all
these variance terms and
all of these covariance terms.
Sounds like it's gonna be a nightmare,
okay?
But there are some symmetries
that we can take advantage of.
First of all,
we have the sum of all the variances,
we're gonna use some symmetries
here to make life easier.
This goes back to our homework problem,
I'll talk a little bit more about.
Variance of x = n times
the variance of x1,
because let's say we're looking at x7.
Let's say the seventh ball,
like the homework problem
where you picked two balls.
And a lot of students were struggling
somewhat with the fact that to consider
the second ball, don't you have
to consider the first ball, okay?
But when we're just looking at like x7,
that depends on the seventh ball,
we're imagining before
we've done anything, okay?
Now the seventh ball is equally
likely to be any of the balls, right?
There's isn't like some balls
like to be chosen seventh and
other ones don't, right?
It's completely symmetrical, so
this is just n times the variance of x1.
Similarly, for
all the covariance terms, 2 times and
then there are n choose two
of these covariance terms.
But we may as well just consider
the covariance of x1 and
x2, Symmetry, so
you should think through to make sure
you see why this symmetry hold here.
So for the first ball,
I mean variance of x1,
that would just get using a Bernoulli,
right?
So that's easy to get,
but let's think a little bit about
the covariance of x1 and x2.
So this part we already know, this we
now know if we see the symmetry or not.
But you should make sure you see
the symmetry in this problem,
cuz if there's symmetry you
only take advantage of it.
And if there isn't symmetry you
don't wanna falsely assume, and so
you have to be very careful about that.
Symmetry is powerful but
the danger is, all right?
Let's quickly get covariance of x1 and x2.
Well that's E(x1 x2)-E(x1) E(x2).
E(x1 x2), let's do the second term first.
That's easy,
that's just the fundamental bridge,
the probability of the first ball is white
times probably the second one is white.
But both of those are w over w + b, Okay?
Now for this term, E(x1 x2), let's use
the fact here that the product of two
indicator random variables is
the indicator of the intersection.
So this event here,
it's expected I have an indicator,
fundamental bridge that's the probability
that the first two balls are both white.
While the first ball has probably w/w + b,
and then the second ball
being white given that the first
ball is white is w-1/w + b-1
So then we have the covariance, so
we know this thing, we know this thing.
So at this point, we can just do some
algebra and simplify everything together.
I'll clean this up next time,
and give the final answer, but
at this point we know the answer,
it's just algebra.
Okay, so see you on Friday.
