SPEAKER: So at the end of last time,
we were doing this matching problem,
De Montmort.
And then I wanted to say a little
bit more about that problem,
because it was a little rushed
at the end of last time.
So let's finish talking about that.
That was an example of
inclusion-exclusion.
And then, for the rest of today,
I want to introduce independents.
We've already been talking
informally about independence.
When we say things are independent,
what does that really mean.
And then the big main
topic for this entire week
is conditional probability
and conditioning.
So that will be the rest of today.
First, let me just quickly remind
you of this De Montmort problem,
a matching problem
continued from last time.
Because I had to explain it
a little too fast last time.
And there were a couple
more comments about this.
This is the most famous example
of an inclusion-exclusion.
If you read the strategic
practice, and if you go to section,
you'll be seeing lots more
examples of inclusion, exclusion.
But this is probably the most
famous one, and it's a nice one.
So I'm not going to write
the whole thing again.
But I'll just remind you of our notation
from last time in case you forgot.
It's hard to remember
notation over the weekend,
so I'll just remind you briefly.
We had we have a deck of n cards.
The cards are labeled 1 through n.
And you're just playing this game, where
you're just flipping cards over one
by one.
And you say, 1, 2, 3, whatever.
And you win, if in fact at some
point, the card that you name
is the card that appears.
That's the game.
You want to know what's the
probability that you win that game.
So I define Aj to be the event
that the jth card matches.
In other words, the jth card in the
deck is labeled with the number j on it.
And we were trying to find
the probability of the union.
And I want to explain this a slightly
different way and then make a couple
more comments about it.
So so what we're
interested in is the union.
And you can think about how to do
this directly, but it's pretty hard.
And inclusion-exclusion is
the easiest way to do this.
So if you remember your
inclusion-exclusion,
it says that this
probability of the union
is going to be something that
looks like-- first, you're going
to add up the individual probabilities.
Then you're going to
subtract the probabilities
of different intersections.
And then you're going to
add triple intersections.
And you're going to subtract, and
add, and keep going like that.
So in order to compute what we need
for inclusion-exclusion, basically what
we need is the probability of A1
intersect A2, intersect blah, blah,
blah, intersect Ak.
This is a handy trick.
This is like really, really simple,
and yet students, for some reason,
don't like doing this.
I highly recommend-- what did I do here?
I picked the first k of
them for concreteness.
Well, what makes this problem
doable, without spending weeks on it,
is the symmetry of the problem.
So I could have written A7, and A9, and
whatever subset of k of them I wanted.
But for concreteness, I'm
just picking the first k.
By symmetry, it wouldn't change.
We could start introducing some doubt--
if we wanted to be really formal,
I could start writing like A
sub j sub 1, A sub j sub 2.
And you'd get all these
double subscripts,
and then people start to get confused.
But by symmetry, it's the same
as just writing this one down.
So it really helps to make it concrete.
Now, this is just n minus k
factorial over n factorial.
That's immediate from the naive
definition of probability.
Because we're assuming that
all orders of the deck of cards
are equally likely.
That's the denominator.
And the numerator, what this says,
is that card one is labeled 1,
card two is labeled 2, blah, blah, blah.
Card k is labeled k.
So that means the rest of the n
minus k cards could be in any order,
but the first k are constrained.
So so that's immediate
from the naive definition.
So that's what we did last time.
And then I'm going explain it
in a slightly different way now.
I just noticed that there are--
how many of these terms are there?
Well, if you look back at
your inclusion-exclusion,
we don't just do this for
1 through k, obviously.
We do this for all subsets of size k.
So we're going to have n choose
k terms that look like this.
And n choose k, remember, is
n factorial over n minus k
factorial k factorial, such terms.
And they're all the same
by symmetry, so symmetry
is the key to this problem,
plus inclusion-exclusion.
And now that this makes it easier to
see than what I wrote last time-- easier
to see why it's going to cancel
into something very nice.
Because we have n minus k
factorial over n factorial.
We have n factorial over
n minus k factorial.
It cancels.
The only thing that's left
is the 1 over k factorial.
So now by inclusion-exclusion,
it's immediate
that the probability of the union
equals 1 minus 1 over 2 factorial
plus 1 over 3 factorial minus
dot dot dot plus negative 1.
And then if you look at the last--
the last term of inclusion-exclusion
had a negative 1 to the n plus 1.
And I think I forgot
the plus one last time,
because I was writing in a
hurry at the end of class.
So you may want to fix your notes.
There's a plus 1 there.
That's the last term.
And let's just do a quick sanity check.
Does this 1 over n factorial make sense?
Well, this 1 over n
factorial term is saying,
the cards are perfectly
ordered 1 through n.
This is the case where
all of them match.
Well, there's only one way that could
happen, that the cards are 1 through n.
So this term should make
sense So that's what we did.
And then suppose we
wanted the probability--
often the problem is
phrased the other way,
as what's the probability
that there are no matches.
So, of course, that's just going
to be 1 minus this no match equals.
And just for practice, let's
write that in set theory notation.
Remember from your math review, if
you take the complement of a union,
it's the intersection
of the complements.
So the complement of that is the
intersection of Aj complement,
and that's just 1 minus this.
So I'm just going to write 1 minus
1 plus 1 over 2 factorial minus 1
over 3 factorial plus
dot, dot, dot plus--
now it's negative 1 to the
n, because the signs flip.
All right, so this is the exact answer
for the probability of no match.
Now, this thing looks
pretty complicated.
So what if we wanted
an approximate answer?
Then you have to recognize
the pattern here.
Whenever you see something that
looks like factorials in the bottom,
that should be making you
think of various Taylor series.
And the two Taylor series that we
need over and over again in this class
are the geometric series and the
Taylor series for e to the x.
So when you see
something like this, this
should immediately remind you of
the Taylor series for e to the x.
Of course, we could put a 1 in
factorial here and a 0 factorial
there without changing anything.
That's exactly the Taylor series for e
to the x evaluated at x equals minus 1.
So this is approximately 1
over e, which is about 0.37.
And most people find this
result pretty surprising.
Because first of all,
where did the e come from?
It seems like you have
a pretty discrete--
e is making you think of logarithms and
maybe some calculus continuous stuff.
But it just seems like
a discrete problem,
just like arranging
these cards randomly.
What does a deck of cards
have to do with the number e?
The other thing that's
surprising about it--
well, actually, I haven't done
a survey of this, not formally.
It would be interesting to
see what people's intuition is
if you let n goes to infinity.
I'm pretty sure most people
who haven't seen this
would guess that the probability
of no match either goes to zero,
or it goes to one.
And it depends whether you're
a pessimist or an optimist.
Because if n is getting very
large, you have a billion cards.
Now each one of those cards is extremely
unlikely to be in its position,
but you have so many chances.
So who's winning that
competition between the fact
that it's so unlikely for each
one but you have so many of them?
Somehow those two competing
forces reduce to one over e.
And we're going to see the number
one over e a lot in this class.
Sometimes I like to say that if you
have no idea whatsoever on a problem
and you had to guess, you
should guess 1 over e.
You don't have a good
chance of getting it right,
and you're supposed to
justify your answer.
But probably 1 over e is going to give
you a better chance than anything else.
So this is exact.
As n goes to infinity, it
will converge to 1 over e.
But even if n is 10, 10
factorial is over 3 million.
It's a huge number.
So even if n is 10, I this number is
correct to within something like 2
out of 100 million accuracy, something
on the order of 10 to the minus eighth.
The difference between
this and this event is 10,
so it converges to 1
over e extremely fast.
So that's just an illustration.
I like this problem because, well,
it's cute, it illustrates 1 over e,
it illustrates inclusion-exclusion,
it illustrates the symmetry,
and those are all things
we're going to see a lot of.
So that's that problem.
So now I say we define independence.
And it's very easy to define.
It's not so easy to understand
fully the definition.
So the definition is very, very simple.
So we already have some
intuition about what does
it mean to have independent events.
That is one of them gives you no
information about the other one,
basically.
But that definition is too
vague to actually verify
are these events independent or not.
So if we have two events, A and B,
we say that they're independent--
we're assuming that
we have a sample space
and a probability function
P that we're working with.
So it's within the
context of one experiment
and one probability, P, that satisfies
the axioms we've been talking about.
They're independent
if-- and the definition
is just that the probability of A
intersect B equals p of A p of B.
So that's pretty intuitive.
It just says, what's the
probability that both A and B occur?
Well, the probability that A occurs
times the probability that B occurs.
So they have nothing
to do with each other.
It's independent, so we just multiply.
That's the definition.
Well, we'll see some
equivalent forms of it later.
[SIDE CONVERSATION]
I want to contrast this with disjoint.
Because that's a common mistake, and
that's a terrible, disastrous blunder.
This is a completely
different from disjointness.
If I say that A and B are disjoint, that
means that if A occurred then B cannot
possibly occur.
That's opposite to independence.
Independence says, if
we know that A occurs,
it tells us nothing whatsoever
about whether B occurs.
Disjointness says, if A
occur, then B can't occur.
So they're completely
different concepts.
But for some reason,
they get confused a lot.
So that's why I was mentioning that.
That's for two events.
Let's just define what
happens with more than two.
If we have events A,B,C--
this is the definition of what
does it mean for three events
to be independent.
Well, first of all, we want any
two of them to be independent.
It would be weird to say that
all three are independent
but A and B are not independent.
So we have P of A, B. And
by the way, sometimes it's
fun to write commas
instead of intersections,
and that doesn't cause any
problems unless we start bringing
in unions and things like that.
So sometimes I'll write
probability of A and.
Because I'm interpreting
intersection as and, so sometimes
I'll just write it as a comma.
Mathematically, this
means the intersection,
but intuitively it means what's the
probability that both A and B occur.
So this says A and B are independent.
And, of course, we want A
and C to be independent.
And, of course, we want B
and C to be independent.
And then the question
is, is that enough?
If we just had these equations, we'd
say that that's pairwise independent.
That is, we have three events,
and if we take any two of them,
they're independent.
And the question is
whether that's enough?
And the answer is no.
We also need one more equation.
P of A,B,C equals P of A P of B P of C.
So a slogan that some people sometimes
say is, independence means multiply.
That's a little too vague though,
because it depends what you're doing.
But if you have independent
events, and you and you
want the probability of all of them
happening, then you just multiply.
So clearly, we want this to be true.
Now there's a question of do these
pairwise ones imply this one?
And the answer is that
it doesn't in general.
So there are some examples of things
like this in the strategic practice.
And I would suggest that,
first, I think it's just
good practice with
the definitions to try
to construct your own counter example.
And if you get stuck, you can look at
the solution in the strategic practice
too.
And I might come back to this later.
But it would be better
to think about it first,
and I want to get to conditioning.
But in general, we can't get
rid of any of these conditions.
And you also can't say that
this one will imply these ones.
So we just need pairwise independence,
and we need all three of them at once.
So that's just the definition.
Similarly, for events A1 up to An--
so if we have n events--
I'm not going to write
down all the equations,
because it's completely analogous.
Once you understand the case with
three of them, you could do n of them.
But we would say, well,
first of all, any two
of these have to be independent, and any
three of them have to be independent,
and any four of them.
So no matter how many of them you
take, when you do something like this,
it's just the product.
So that looks pretty unwieldy.
We have lots of different equations.
So later we'll see other ways
to think about independence.
But just the basic rule
is, independent means
multiply when we're trying to find
the probability of an intersection.
So that should be true no matter
how many of the events we take.
We could take any subset of these.
Yes?
AUDIENCE: Can you show
them the other way?
SPEAKER: What's the other way?
AUDIENCE: So if you
multiply [INAUDIBLE]..
SPEAKER: So the question is, whether
this holds the other way too.
Yeah, we're using--
And this is just a slogan.
This is not a math statement.
But this slogan is just
paraphrasing the definition.
It's saying, if all these
multiplication rules hold,
that's the definition of independence.
So it's true both ways.
So we'll see other ways to
think about independence later.
But for now, that's just the definition.
And hopefully it makes some sense.
You can also think about
the naive definition.
Think about the naive
definition of probability
and your naive intuition
of independence.
Put those together and see
whether it's consistent with this.
And I think it will be.
But if not, let me know.
But just for now, that's the definition.
But I think you should think about it.
So I want to do one example,
one more famous example,
and then we'll do
conditional probability.
So this is another famous problem.
I think it's from
1693, very old problem.
This was called the Newton-Pepys
problem, and I think it's from 1693.
Samuel Pepys was this famous
diarist, who you might
have heard about in a history course.
Well, anyway, he wanted to know the
answer to a problem involving gambling,
involving dice.
I wrote down some numbers here.
He wanted to know the answer,
and he couldn't solve it.
So he wrote to Isaac Newton, and
Newton solved the problem for him.
So I wanted to talk a little
bit about that problem.
So here's my problem.
We have fair dice, just
usual six-sided dice.
So the sides are numbered 1
through 6, all equally likely.
And the question is, which
of these is most likely
of these three possibilities?
The first possibility--
let's call this event A--
A is the event of getting at
least one six with six dice.
So you roll six dice.
And the question is, will at least
one of the six dice have a six on it?
So at least one six with six dice,
that's the first possibility.
The second possibility
is at least two sixes--
I won't write the at least again--
at least two sixes with 12 dice.
And the third possibility,
you might guess,
is at least three sixes with 18 dice.
So you roll 18 dice.
And you want to know, are
there at least three sixes?
So in 1693, this was a hard problem.
And this guy wanted to know the
answer, so he wrote to Newton.
And Pepys had a strong belief about
which one of these was the most likely,
but he didn't know any of the
math to be able to back that up.
And I'm curious what your intuition is.
If you had to guess A, B, or
C, which is the most likely?
How many of you would guess A?
OK.
How many of you would guess B?
OK.
How many of you would guess C?
so I think C--
definitely, B did not win.
So I think C beats out A
according to that informal poll.
Now, C, this is what
Pepys strongly believed.
And he was asking
Newton, is that correct?
Can we back that up?
And the answer is A.
[LAUGHTER]
Now, it's not at all obvious though.
That's why he had to write to Newton.
And what actually happened
is, Newton wrote back a letter
deriving it and showing
that A was the largest.
And apparently, Pepys
had some other bet,
based on this, that he reneged on
when he found out that he was wrong.
So now let's solve the problem.
So let's just find--
at this point, this should be a
pretty easy calculation for us,
so it shouldn't take too long.
Let's find the probability of A. Well,
this is similar to other problems
we've done.
You can use the naive
definition, because all six
outcomes are equally likely.
But let's do it using
independence instead.
Implicitly, I'm assuming that the
rolls of the dice are independent.
In this case, either you roll
one dice six times independently,
or you have six independent dice.
So we're assuming we don't
have some weird situation where
we have six dice that are magnetically
joined together so that they're all
going to land the same way.
So the six dice are independent
of each other intuitively.
But that intuition corresponds
to that definition there.
So I'm going to use independence.
You could also use the naive definition,
and you'll get the same thing.
Now as far as our strategy, when we
see a probability of at least one,
whenever you see at least one, that
should make you think of a union.
And when you see a union, one
approaches inclusion-exclusion.
But like I said last time, I try
to find simpler ways when possible.
So the first thing I think when I
see at least one is, it's a union.
But if I do the complement,
it will be the intersection.
And in this case, we're taking an
intersection of independent stuff.
So that means we just multiply.
So that's going to be the most
straightforward way to do it.
So I'm going to do 1 minus,
because I'm doing the complement.
It's 1 minus the probability that
all of the dice are non-sixes.
Now for each individual die, there's
a 5/6 chance that it's not a six.
Independence says we
can just multiply those.
So it's 1 minus 5/6 to the
6, which agrees with what
we get from the naive definition.
Because in the naive
definition, we would say
there are 6 to the 6 possible outcomes.
And each die has five possibilities
if we're excluding sixes.
So that should be basically immediate.
And now let's do P of B. And by the way,
I computed the number, and it's 0.665.
So it's about a 2/3 chance of
having at least one six if you
roll six dice, which might not be
what most people would guess if they
had to guess what this probability is.
But that's an easy calculation.
Now lets do P of B.
Again, at least two sixes,
that means that the number of sixes
could be anything from 2 to 12.
That seems a little unwieldy.
So let's do the complement.
So we're going to do 1 minus--
So we're going to do 1 minus
the probability of no sixes
and then minus the
probability of exactly one 6.
That's going to be easier.
So to do no sixes is very easy.
That's just 5/6 to the 12th.
That's just saying all
all the dice are non-sixes
minus the probability of exactly one.
And here, we have to think a little bit
more, but hopefully not too much more.
We want exactly one 6
and exactly 11 non-sixes.
So let's think of a
specific way to do that.
A specific way would be, suppose that
we labeled the dice 1 through 12.
So one way to have one six would
be if die number one has a 6,
and the other 11 dice are non-sixes.
But that was just one possibility.
The six could have been
any of the 12 dice,
so we just have to multiply this by 12.
And this works out, if you compute
this with a calculator, this is 0.619.
So this is a bit less
likely, 5% less likely.
Lastly, let let's do P of C.
Again I'm going to do 1 minus.
And this time, just
to save some writing,
I'm going to write this as a sum.
So I'm going to subtract
off the possibilities where
there are zero, exactly zero,
exactly one, or exactly two sixes.
So we're just going to
subtract off those cases.
So now here I want to know, if I
have 18 dice, what's the probability
that exactly k of them are sixes?
So if we draw a little picture, we're
going to number the dice 1 through 18.
So I won't draw all 18 of them.
Here's the first two and the eighteenth.
And it's pretty easy to draw dice.
Just draw squares.
Now, one possible case,
for example, of k is 2.
And I want the probability
of exactly two sixes.
One possibility would be to
have sixes here and here.
And everywhere else--
I don't know how to write non-six, but
everywhere else, other than these two,
is not a six.
That would be one way.
But it didn't have to be die
number one and number two.
It could have been any of the 18.
Choose two.
So choose any two
positions to put the sixes.
And then put the
non-sixes everywhere else.
That's what we're doing.
Therefore, we can then
immediately write down, this
is just going to be 18 choose k.
We're very familiar
with 18 choose k by now.
It's the number of ways to
choose where the sixes are.
And then for the sixes,
it's 1/6 to the k.
And then for the non-sixes
it's 5/6 to the 18 minus k.
This thing that I just wrote down
is called a binomial probability,
and we're going to be
seeing a lot of these later.
And I know some of you
have seen binomials before.
But it doesn't matter
at all whether you've
seen the binomial before in this class.
What matters is that you
understand where this comes from,
not that you memorized it.
And so whether you've
seen the binomial or not,
this should be something that--
just from understanding the concept,
this should be something that
you can just write down fairly
easily without memorizing it.
But we are going to see expressions
that look like this a lot in this class.
So that's just introducing it.
And if you compute this one
numerically, it's 0.597.
So that means A is the most likely,
and c is actually the least likely.
So that's the answer.
And so what happened was that Newton
basically did the same calculation
and sent it back.
This was in the very, very
early days of probability.
So there are a couple of things I
find interesting about this, aside
from the fact that it's just
good practice with writing
these things down.
First of all, at the time, 1693, if you
wanted to solve a problem like this,
you had to go to a
genius, like Isaac Newton.
Whereas now, all you have
to do is take STAT 110.
[LAUGHTER]
So we've come a long way since 1693.
That's nice.
The other interesting
aspect to this problem
is that Newton did the
calculation correctly,
but he actually gave an intuitive
argument which is wrong.
And I'm not going to take the time to
go through his wrong intuitive argument
because it's confusing and wrong.
[LAUGHTER]
But Newton got the intuition wrong.
He got the calculation right.
And there's an interesting way--
if you try to read his
intuition, it's very confusing.
I mean, I actually read the
original a couple of years ago.
And I was trying to
understand what he was saying.
It's just this very confusing intuition.
I mean, I like the fact
that he did the calculation
and was trying to provide intuition,
but the intuition was very confusing.
So at first, I couldn't tell is
this intuition correct or incorrect.
And then another
statistician, Steve Stigler,
had a very beautiful argument
showing that Newton's intuition had
to be wrong, even if you
couldn't understand it.
[LAUGHTER]
The reason that Stigler gave was that,
if you look at Newton's argument,
whether you understand it or
not, nothing in that argument
depends on the fact
that the dice are fair.
So if you changed the probabilities
for the six sides of the dice,
Newton's argument would still claim
that this one is the most likely.
But if you change this--
you can easily do a similar calculation
where the probabilities are not equally
likely for the six sides of the dice.
You can play around with it.
And if you want, you can make C the
most likely when it's not a fair dice.
But Newton's argument
would not reflect that.
So therefore it couldn't be correct.
That is, Newton's argument is invariant
if you change the probabilities,
but this result is not.
So anyway, even Newton didn't
find probability that intuitive.
But he was able to do the
calculation correctly.
And now all of you can
do that easily too.
So that's just another
famous gambling problem.
So now the topic for the rest of
today, and the rest of this week,
is conditional probability.
And I have a lot to say about
conditional probability.
We're going to use this
concept over, and over,
and over again for the whole semester.
And hopefully you'll continue to
use it even after the class is over.
The question that we're
trying to address is--
we have all these probabilities.
but in practice, you live
each day of your life.
And, hopefully, you learn
new things every day.
Hopefully, you learn
new things every day.
Now, I said on the
very first day of class
that statistics is the
study of uncertainty.
So unless you're completely,
extremely dogmatic,
and you're 100% sure of
everything, unless that's the case,
you have uncertainty
about a lot of things.
But you keep learning every day.
At least, that's what
I consider the ideal.
You have an open mind.
You have some beliefs, but
you have a lot of uncertainty.
You learn new things every day.
So that's a very, very
generic situation.
The question is, how should
you update your uncertainty,
and update your probabilities,
based on new evidence?
That's the question.
So you can see why that's a central
question in science, and in philosophy,
and just in thinking.
Just how do you update your beliefs?
So how should you
update your probability,
slash beliefs, slash uncertainty
when you receive new evidence?
If you're doing science
and you do an experiment,
then you'll learn something.
If you're investigating a crime you
find clues and things like that.
You have to update your
uncertainties based on new evidence.
And it's generally a sequential process.
Because you may have
some probabilities today,
and you learn things later today,
and you update your probabilities.
And then you'll learn more stuff
tomorrow, and you update again.
So your old probabilities
get updated to new ones.
But then the new one
becomes the old one.
You update that.
And you keep updating
and updating, hopefully,
in a coherent, consistent,
logical manner.
So we're going to develop the
mathematical and statistical ideas
of how do you do that.
So you can see, this is a very
deep, very broad question.
And, in fact, I like to say that
conditioning is the soul of statistics.
So possibly that says more about
me than it says about conditioning.
But I don't really know
how to quote myself.
But anyway, I'm going to quote myself.
That is a good motivating
principle for this course.
Everything relates to conditioning
in one way or another.
So now I've set this up as
something very important.
And let's say, OK,
what's the definition?
How do we do math?
This sounds philosophical.
It's like the philosophy of science.
Philosophy of science is
all based on conditioning.
But how do we do math, not philosophy?
Well, we want to write down, what's
the probability of A given B.
So this is the notation we use.
We have we have one event
given another event.
So we pronounce-- we just
draw a vertical stroke
and pronounce that as given.
So this is the probability that
A occurs given that B occurs.
So our interpretation is that we--
initially, we had some probability,
P of A, for A occurring.
Initially, we don't know whether
B is going to occur or not.
Then suppose that we observe that,
oh, B actually did indeed occur.
Now if A and B are independent,
then that's going to be irrelevant.
But if A and B are not independent,
then that's valuable information.
And the question is, how do we
update our probability for A?
And the definition is that this
is just the probability of A and B
divided by the probability
of B, as long as P of B
is greater than 0, since we
don't want to divide by 0.
So that's a very simple
looking definition.
It's just this over this.
And it sounds like this very
complicated, general, deep, subject,
but all it reduces to is just
a fraction, this over this.
And we're going to prove
some theorems today.
We're going to prove,
actually, many theorems today.
But each theorem is so easy, that it may
not seem like how can this be useful?
But then we're going
to be seeing examples
over and over again for the rest of the
semester why this is a useful concept.
So just to quickly give you some
intuition for this definition,
I'll explain this in two ways.
I'll call it intuition
1 and intuition 2.
Here's intuition 1.
Just like, why define it this way?
Because we already have some intuition--
I already told you the
intuition of what this means.
It's the probability of A given B.
So you could say it's a definition,
but I want this definition
to actually correspond
to what we might intuitively think is
a reasonable way to do conditioning.
So intuition 1 I call pebble world.
And for that, we just need to draw a
sample space, S. S is our sample space.
Now, we've moved beyond the
naive definition of probability.
So we no longer assume that all
outcomes are equally likely.
But let's assume that there are a
finitely many possible outcomes.
Each one is represented by a pebble.
So I'm going to explain
what this picture means.
So just for the sake of
example, I drew nine pebbles.
Nine pebbles, and the
constraint on the pebbles
is that the total mass equals 1.
Each pebble has a
certain weight or mass.
The assumption is that
the total mass is 1.
But that's not much of an assumption.
Because we can always define the units
of mass so that the total is one.
But anyway, we're assuming that we are
using units where the total mass is 1.
That's all we have.
This is a sample space.
So if you were paying attention
when we talked about sample spaces,
that's all possible
outcomes of our experiment.
So in this example, there
are nine possible outcomes.
And each one, it might
be that they're all 1/9,
or they might be
different probabilities.
That is, some of the pebbles
are more massive than others.
That's the setup.
Now, we've talked about the
correspondence between probabilities,
between events and subsets.
So an event is just a subset.
So, for example, let's
suppose that this is event B.
So I'll just circle these four pebbles.
Because an event is a subset.
Therefore an event is a set of pebbles.
So suppose that's B, and we want to
do P of A given B. And think about,
what does that mean?
Well, what that means is, first
of all, the interpretation
is that we learned that B occurred.
Therefore, all the
pebbles that are outside
of B, that's the other five
of them, are now irrelevant.
Because they didn't occur.
We know that one of these ones occurred.
So all we have to do
is get rid of pebbles
that are not in B, in other
words, pebbles in B compliment.
And anything in B complement
is now completely irrelevant.
We are restricting to the part
of the space that did occur,
and we don't care about the part
of the space that didn't occur.
So we just erased these five pebbles.
And I could erase them, but then
you wouldn't see them anymore.
I'll put X's.
You know that they were
there, but now they're gone.
Now we just have these four pebbles.
That's our world.
We're thinking of the sample
space is the universe,
but our universe got restricted
to B. That's what we're doing.
Now this is our new universe right here.
We're just going to do the usual laws
of probability, everything the same
as before.
And if we want the probability
of A, maybe A looked like this.
If this was A, just for example,
three of the pebbles in A
were outside of B. Those
are just irrelevant now.
The only part that matters is the
part that's A. That's what this is.
A intersects B. That's the pebbles--
in this example there was only one,
but I could have circled this one too.
In this example, there's only
one pebble that's in A and B.
And we're working with that.
The only question is, why
do we divide by P of B?
And the reason is an obvious
technical difficulty.
I said, get rid of these
pebbles that are not in B,
and then just do what
we were doing before.
The only problem is,
when we got rid of those,
now these ones don't have total mass 1.
So all we all we do is renormalize.
Renormalizing is just a
fancy word saying, multiply
by a constant so that the
new total mass equals 1.
And you can check for yourself,
just with easy algebra,
that this is the correct normalization.
That if you divide by P of B, you are
going to make the total mass 1 again.
The reason is just that, if we let
A equal B, that's P of B over P of B
equals 1.
So renormalize to make
total mass 1 again.
And that's it.
That's the end.
So this seems like an
intuitive thing to do.
We get rid of the part of the
space that was irrelevant,
and then restrict to that part,
and we need it to add up to 1.
That's all we're doing.
That's what that definition is.
I promised you two intuitions.
Some people don't like pebble world.
It may have seemed abstract at first.
But I promise you that the more you
think about it, the more helpful
it will be, until you finally
feel that it makes sense to you.
But it's also helpful to have another
intuition, so here's intuition 2.
There are different perspectives on
what does probability really mean.
This is one perspective.
Here's another perspective
that I call intuition 2.
Rather than pebble world, let's
do what I call frequentist world.
In pebble world, we just
do the experiment once,
and we play around with the
pebbles, and we see what
is the rational probabilities to use.
In frequentist world, we don't
just do the experiment once.
We repeat the experiment
over and over again.
I'll just say, many times.
One philosophical objection
to this is, how many is many?
Is it infinitely many?
Is it a million times?
And is it really possible?
It gets into all that, is it possible
to step into the same river twice?
Remember, that ancient Greek asked
that, because you've changed it
after the first time you step in.
So can you do the same
experiment over and over again?
That's a deep philosophical question,
but let's just assume you can.
You can do the same experiment
over and over again.
And here here's what we do.
Assuming that we can, indeed, repeat
exactly the same experiment over
and over again, then one interpretation
of probability is long-run frequency.
And that's a pretty intuitive--
if you flipped a coin 1,000 times,
and 612 of them were heads, then
you might say, OK, that probability
is about 612 over 1,000.
So you're interpreting probability
as, in the long-run, what
fraction of the time does the event
occur if you repeat the experiment over
and over again.
So now all we need to do is make
a list of all the repetitions.
So you do the experiment.
I don't actually know--
I mean, it depends on what
the actual experiment is.
What do the results of
the experiment look like?
Well, let's suppose we're just
generating some binary data.
I'm just making up some numbers here.
I'm just pretending it's
binary for some reason.
It doesn't actually matter.
0 0 1 1, it's just whatever, 1 1 1.
We're repeating over and over again.
We're generating lots of data.
Now what we do is circle the
repetitions where B occurred.
And I haven't defined what
B is, but I am just drawing
a picture for illustrative purposes.
So I suppose that if this was the
outcome, then, in fact, B occurred.
So this one, B did not occur,
this one B did occur, and so on.
So you're just repeating
many, many times.
Only select the ones where B occurred.
Now among those, just the circled ones,
what fraction of time did A also occur?
So that should be a pretty
intuitive way to think of it.
Because we're just thinking,
well, given that B occurred,
then we should restrict our attention
to just the circled experiments
where B occurred, because
we're given that B occurred.
Among those, what fraction
of time did A occur?
If you think about this
intuition a little more,
you can also see why
this corresponds to this.
So now, let's get some theorems.
We only have five minutes left, but
the theorems are extremely easy.
So we're OK.
So theorem 1, suppose that we wanted
to find the probability of A and B,
how would we go about that?
Well, basically I would take
this equation, and I want this,
so I would multiply both sides
by P of B. So that's theorem 1.
The probability of A
intersect B equals--
let's see if I get this right--
P of A given B times
P of B. I'll write it
the other way, P of B times P of A
given B, using the commutative property
of multiplication.
So that's true.
We just proved it.
Let's actually prove another theorem.
Just to save some writing, I'm
going to prove it in the same line.
We could have instead
used P of B given A here.
And to do that, you'll
just swap the A and the B.
But A intersect B is the
same thing as B intersect A.
So I've just swapped the A's and B's.
So if we did it that way, we
would get P of A times P of B
given A. It's the same thing.
I just swapped A and
B. So therefore, we've
just derived two theorems
in less than a minute
about how to get the
probability of an intersection.
Question?
Stretching.
Don't worry, we only have five
minutes left, then you can take a nap.
So that's good.
Notice, by the way, that if
A and B are independent--
remember the definition of
independent from earlier today?
That that would have just said here,
P of B times P of A. So independent
is the case where P of A given B is
P of A. That says intuitively-- that
says conditioning on B does nothing.
Well, we're on a roll here.
So let's do some more theorems.
Theorem 2-- let me just generalize
this to n of them, A1 through An.
So I want to know, what's the
probability that all of these events
occur?
Well, just applying
this multiple times--
formally, you can use induction,
but I'm just using this repeatedly--
we're just going to get P of A1.
And then we're going
to do P of A2 given A1.
And then we're going to do
P of A3 given A1 and A2.
And we're going to keep going like
that, An given all the previous ones.
And actually, this is not just one
theorem, but n factorial theorems.
So we are extremely efficient today.
[LAUGHTER]
It's n factorial, because I
could have started with A7,
and then done A4 given A7,
and kept going like that.
As long as you're adding
a new one each time,
conditional on all the previous ones.
So, hopefully, you all
see the pattern here.
There are n factorial ways to do it.
I mean, I'm kidding about the
n factorial-- but I'm not.
I mean, first of all, it's true.
And secondly, for some problems,
if you try to do it in one order,
it's going to be really hard.
And if you try a different
permutation, it'll be really easy.
So it is useful to think about
these different orderings.
Now, last theorem for today, theorem 3.
I would like to relate P of
A given B to P of B given A.
So I'm going to go back up here,
and I'm going to say, let's see,
I want P of A given B in terms of P
of B given A. So I'm going to divide
both sides by P of B. P of A given
B equals P of B given A P of A
over P of B.
I just proved it.
I just divided both sides
by P of B, end of proof.
This theorem is called Bayes' rule.
And it was discovered in
1760-something by Bayes,
who was a Presbyterian minister, who
was doing probability on the side.
So the proof is completely obvious.
I just divided by P of B. The
implications are extremely deep.
There's a whole field of
statistics, Bayesian Statistics.
Controversies have raged for centuries
just about this kind of thing.
As I wrote it here,
it's uncontroversial.
But the question of how to use it and
how to think about it is very deep.
So this is one of the most useful
theorems you will ever see.
And yet, it's just easy algebra.
All right, so that's all for today.
I'll see you on Wednesday.
