Anyway, so I'll give you this
top ten list, and if you
look at the final review handout, it has
a very comprehensive list of topics.
But that list may be too long, so
here's just ten things,
not in any particular order.
These are not in order of importance.
Okay, I'll just use the big board.
All right, so here's my top ten list.
First is conditioning, right.
Conditioning is the soul of statistics,
so that did have to come first.
And conditioning includes a lot of things,
right.
It includes conditional probability,
it includes conditional expectation,
it includes Bayes rule, and
even includes Markov trends
in a certain sense because there was
a certain conditional independence thing.
Conditional independence versus
independence, that's been everywhere.
Second, symmetry.
Symmetry is powerful but
dangerous, I like to say.
Because it's extremely powerful,
and a lot of problems where you'd
have to do 100 pages of algebra
if you don't see the symmetry and
one line if you see the symmetry.
On the other hand you don't want to start
hallucinating symmetries that are not
there, so you have to be careful.
And third, random variables and
their distributions,
which is one possible five word title for
the whole course.
Fourth is stories.
So we spent a lot of time
on not only story proofs,
but more importantly, the stories for
each distribution, like the normal and
gamma and Poisson and so on.
They're extremely important
because otherwise, right,
there are infinitely many possible
distributions we could have looked at.
Why do we spend so much time on
the Poissons and exponentials and stuff?
It's because they had important stories,
otherwise just write down anything that
integrates to one and then work with that.
Why are some distributions more useful and
important than others?
It's because they have stories.
Okay, fifth is linearity.
Sixth is indicator random variables,
which is one of my favorite tricks,
as I think you know.
It's extremely useful on interview
problems and all kinds of stuff.
Seventh is LOTUS,
just an extremely useful tool for
computing expectations.
And eighth is law of large numbers.
I just sort of had to put it there
because that, and central limit theorem.
These are the two most famous theorems
in probability, possibly the two most
important theorems in probability, but
definitely the two most famous ones.
And then tenth, Markov chains,
which we spent the last few lectures on.
Okay, so that's just the top ten list.
But it kind of partitions
up kind of neatly.
At the first floor all have to do with
the big picture question of
what is randomness, right.
That's why probability is everywhere,
is because randomness and uncertainty.
And what is uncertainty,
that's everywhere, right.
How do we think about uncertainty?
That's these four, these three are all
about computing expected values.
Indicator random variable you can use for
other purposes as well,
but we were especially using it for
computing expected values.
And remember also that expected values
doesn't just mean finding the average of
the random variable, because if we
want to find the standard deviation,
then we need the variance.
And to get the variance,
we need to do expected values.
So a lot of things are more than just
the average reduced back down to this,
so computing averages.
And then these last three, I would
describe these as talking about long run
behavior as we have, by long run I mean
we have a lot of random variables.
So here and here, and
we take the average of a large number
of IID random variables and
see what happens, how does it behave.
And then for Markov chains,
that's how they were originally designed.
Markov's motivation was
going one step beyond IID.
You run this Markov chain just wandering
around for a long time, right.
It's not IID anymore generally, but
it has this nice conditional structure
that makes it very nice to work with.
Okay, well anyway that's top ten list.
And let me know if any other questions
come to mind in the mean time.
But mean while I'll draw a little picture.
So we've been doing Markov chains.
So it's kind of natural to represent
our current state, is stat 110, right.
So this is the whole world
of statistics at Harvard
as a Markov chain, and we're here now, and
it doesn't matter how you got here, right.
All that matters is that
you're here in Stat 110.
And then the question is saying, well
where are you gonna be next semester or
next year or things like that?
What can you do from here?
So I want to briefly mention,
if you look at the prerequisites for
stat courses above 110,
most of them require 110.
And there's reasons,
there are very good reasons for that.
And just give a few examples, both
mentioning a little bit about the courses
just so that you, for some of you,
it may be useful to know.
But also just to show you a few examples.
Okay, so where can you go from 110?
Well the most obvious follow up is 111.
111 and 110 in a sense should be a full
year course on probability and inference.
It's going back to the very first day of
class, I won't draw the diagram again, but
on the very first day of class I drew
a picture of about what's the difference
between probability and
statistical inference.
And then they're kind of
two sides of the same coin.
In probability we're saying,
here's our model, and then we're saying,
well what's the probability
that this will happen?
What's the probability that would happen?
On average what's gonna happen, right,
given that we're using this particular
model like we're Poisson lambda or
something and assume lambda is known or
lambda has some distribution itself but
that distribution is known and so on.
Inference is about going the other way.
You have data, and then you want to
actually then say, estimate unknown
parameters, right, theta or lambda or
whatever the parameter is unknown, okay.
You want to estimate it or
you want to make predictions about future,
like you have lots of observation so
far, but then in the future you're
gonna get more data and you want to
predict what that's gonna look like.
Okay, so those are questions
of statistical inference.
You can't do statistical inference unless
you have probability cuz it gives you
both the language you need as
well as a lot of the theorems.
Universality of the uniform, I didn't
introduce that to torture you or anything.
I mean, it's a beautiful general result,
but it's also very,
very useful in statistical inference.
So things like that.
Okay, so that's 111, that's every Spring.
Just from 110 and 111 alone, they're
both more on the mathematical side,
rather than a course where you're actually
learning how do you work with data.
So another course I would
especially recommend Is 139,
which is a linear models, which has some
overlap with econometrics EC1123.
And in fact, we, in stat, have been having
discussions of the ec department for
years about the issue of 139 versus 1123.
Because the ec department's policy is
that you can't take both for credit.
And the stat department's policy is
that you can take both for credit.
So because the stat perspective on this is
that the points of view are sufficiently
different that you should
be able to do both.
But that's a difference in point
of view about the points of view.
So anyway, you could include 1123 or
1126 here, I'm not going to
list all the ec course numbers.
You should have some course in regression
in linear models as useful for
all kinds of things.
And 139 does not require 110, but
if you've had 110 it gives you a better
understanding of what's going on.
And similarly, if you look at like 1126,
take a look at the course notes for
that sometime.
Everything is conditional expectation,
all right?
So if you understand
conditional expectation,
then that's going to be
extremely advantageous.
139, you're going to understand what's
going on much better than other people in
class who haven't had 110,
although it's not required.
So 139 is just a great course.
And actually just ways to work with data,
that you're actually using
data on the computer.
And one piece of advice if you want,
in addition to everything else I'm saying,
is to learn R.
That's one of the best six letter
pieces of advice I can think of.
Also R is a statistical.
I also recommend learning C.
And so also I recommend CS50 or
learn C somewhere else as well.
But we're talking about
statistics right now, learn R.
R is very, very different from C.
R was designed by statisticians for
statistical purposes in mind.
Completely different from C in flavor,
it's easier in some ways.
But sometimes that's a misconception,
sometimes people who have been doing C for
years think that they could just
learn R in like a couple of hours and
be an R expert, and
they approach it all wrong.
Cuz it's just a completely
different mindset.
Very, very different language.
It's extremely powerful.
Actually I just saw a really
interesting article yesterday that was
arguing that learning R is not only
incredibly useful that it actually
helps you become a bit better statistical
thinker just by studying this language.
And I hadn't thought about
it that way before but
actually that article convinced
me that that was true.
So this is incredibly useful.
Okay, a lot of the stat courses
in general, actually use R.
And whether you do more stat or not,
it's just an incredibly useful skill.
Getting more useful all the time.
And R is free.
You can find all kinds of very,
very well written tutorials and
stuff like that for free,
you can download it for free.
And because it has this kind
of free software culture,
a lot of people are writing R packages,
and put them online for free.
And so it keeps growing like that, because
it has this open source culture to it.
Okay, that's 111, 139,
these are in no particular order.
123, finance.
If you have any interest in finance.
Not offered this spring but
it will be offered next spring.
Any interest at all in finance,
you should definitely take 123.
I don't need to talk more about it now
because Steven Blithe already gave the 123
preview.
Although I would mention I really
liked what he said at the end.
He said the two dimensional lotus was
a key to winning the Nobel Prize.
So then I was happy I have LOTUS here.
That includes the 2D one,
right here, all dimensional LOTUSes.
Okay, that's a finance course.
115 is computational biology.
That's every spring.
It's a really nice mixture of biology,
computer science and statistics.
To understand what's going on there,
a lot of it relies on Markov chains and
Bayesian thinking,
the things we've been doing.
Like that email that I just sent.
The Markov chains are everywhere
in biology these days.
So if you're interested in biology and
stat, that would be a good one.
Well, what else do I wanna mention?
171 is another natural followup.
171 is stochastic processes.
So if you like Markov chains, but
we only had three lectures on it.
In 171, maybe they spend
a month on Markov chains, and
then do a lot of other
stochastic processes as well.
So in a sense, 111 is swapping
to the other side of the coin.
And 171 is continuing in kind
of probability thinking.
Where you already have random
variables evolving over time, right?
That's a stochastic process.
Okay, so that's a really good
stochastic process course.
And any others that I wanted
to mention right now?
I think that's good.
Okay, so we have that little thing there.
I want to do a couple,
just a few quick examples of just
kind of how probability shows
up in some of these areas.
Okay, so for example,
let's take 139.
And by 139, I mean 139 but
I also mean more generally regression,
which you can also see in econometrics or
elsewhere.
It's extremely widely used for
analyzing data.
And the first time I saw regression,
I really, really hated it.
I thought it was the ugliest thing ever
and I had no idea what was going on.
To start with a really simple, relatively
simple, linear regression model just for
the simplest case just to do an example
because it's not a regression course.
We're looking at models like this,
y equals beta 0 + beta 1x + epsilon.
Beta 0 and beta 1 are constants which
are generally we would think of
those as unknown parameters.
So we're trying to estimate.
The question is,
can you use x to predict y?
So you have a pair x and y.
Maybe you have 100 people in your study or
something like that, or
it doesn't have to be people obviously but
you have two variables you wanna
know how does x use x to predict y?
That's a regression.
And this would be a simple
example of linear regression
because this looks linear, right?
And epsilon is because you're assuming
that x is not going to perfectly tell
you y,
that there's some error term here epsilon.
In practice,
people often assume that these epsilons
are normal with mean zero, but
you don't have to assume that necessarily.
So sometimes normal
might not be realistic.
But let's assume,
a common assumption would be that
at least they're centered at zero,
so the expected value is zero.
So on average it's not gonna be,
that there should be zero on average
is a natural uncommon assumption.
But in particular,
I want to assume that the expected
value of epsilon given x is 0.
So there's not certain values of
x that are going to the errors
tend to be positive rather than negative,
that kind of thing.
So let's assume that,
I mean this is just regression.
Okay, and
then I'm not going to do much with this,
I'm just going to show you very quickly
how to get some simple facts about this.
So one thing we could do
when we see this equation
is to take the covariance
of both sides with X.
So, I'm saying that these
are the same random variable, okay?
So I can do whatever I want to both sides,
so
if I take the covariance
of both sides with X, well,
covariance of a constant with anything
is 0, so that part goes away.
So we have the covariance of (beta
1 X with X) + the covariance
of (epsilon with X).
This is just a quick review
of properties of covariance.
Beta 1 comes out, covariance of
X with itself is the variance,
so that's beta 1 times the variance (X),
and
then let's think about
the covariance of epsilon and X.
I'll just do that over here and
figure out what this is.
Well, I'm assuming that
the conditional expectation is 0.
By Adam's law,
the unconditional expectation is also 0,
because it's E(E(epsilon|x)),
that's just Adam's law, = 0.
And so the covariance of epsilon and
X is just the expected value of epsilon X,
because the other term is 0,
because epsilon has 0.
And let's just quickly
compute this expectation.
Well, the strategy, again,
should be condition on x.
Adam's law, so
we're just gonna go E(E(epsilon X|X).
Just a quick review of Adam's law,
take out what's known.
Once we condition on X, the next is known,
so this X pops out right there.
But what's left is E(epsilon|X), which
we assumed is 0, so now it's X times 0.
X times 0 is 0, so this is just 0.
So actually, it's plus 0 here.
So that tells us that
beta 1 = the covariance.
By symmetry, that's the covariance
(X,Y) divided by the variance (X).
Okay?
So
that's a very quick
derivation of this fact.
And if you look in a lot of
books that do regression,
you're gonna have a very ugly looking
formula with summation signs.
This is the population version rather
than the sample version, okay?
And a lot of books,
if they don't wanna assume 1-10,
then they're gonna give the sample
version of just like, here's the formula,
the summation of something, xi minus
something, yi minus, that kind of thing.
And they'll either prove that thing in
some long, tedious, algebraic way, or
they'll just say, we don't wanna prove
this, here's the formula for you, and
it just looks really ugly.
Right?
It's hard to memorize it and
it's hard to understand where it
came from, but it's just that.
So understanding this thing, you actually
understand where the things come from.
Similarly, a lot of these formulas that
looked really ugly at first, once you
really understand what they're doing,
a lot of times, it's just a conditional
expectation, and in particular,
a lot of times it's just a projection.
Right?
We drew a geometric picture.
What does this conditional
expectation mean, geometrically?
It's a projection.
Projections are nice, right?
So, just understanding things, you
actually can understand what's going on.
And in particular, a few of you actually
asked me for book recommendations.
And if any of you want more
specific book recommendations for
anything, I like recommending books.
I'm just gonna mention one right now.
I love this book.
Mostly Harmless Econometrics.
Have any of you read this book?
It's really cheap, too.
It's like $18 on Amazon,
last time I checked.
This book is beautifully written.
And I can't say that that's true
about many econometrics books.
This one is just beautifully written.
And if you flip through this
book sometime, what you'll
see is the conditional expectation, and
Adam's law and Eve's law are everywhere.
So it's all building on things we did
here, and if you haven't had that
background, it will be much harder
to understand what's going on there.
But with that,
then it's just Adam's law, Eve's law.
It should all come together.
Okay.
So I wanted to mention one other
course here, which is more of an obscure
one, but I think it's an important one.
It's only offered every other year.
That's that 160.
So that 160 is survey sampling.
It's very relevant for any of you
who are interested in government and
policy and
things that would involve survey data,
political polls, or
other types of survey sampling.
So just to give you a quick example
of how you can use 110 ideas,
and I think this is actually a good
review example too, I'm gonna do one.
And you can imagine, by the way,
that stuff like hypergeometrics
would come up a lot there,
because you have this population and
you're drawing a sample, right?
Usually, you're just trying
to learn something and
that's of some policy interest, right?
Usually, you cannot ask every person in
the population what their opinion is, so
you get a sample, and
then you try to infer to
the population what can you do, right?
So, if you sample without replacement,
we know that hypergeometrics come up very,
very, very naturally, right?
So here's just a quick example.
It's kind of a different mindset from
other parts of statistics where we assume
IID because we're assuming that we're
sampling from a finite population.
This is one approach to it.
There are some kind of
foundational controversies
about how do you approach this problem.
We have a finite population.
Well, of course, it's a finite population.
It's always a finite population right?
A lot of times, that gets ignored.
You have a finite population,
let's say, of people, and
suppose that their true values of
something that you're interested in,
that any character, that is for
each person, you have some variable.
It could be their height, their income,
their opinion on some question,
whatever you're studying.
Let's say that the true
values are Y1 through YN,
where capital N is the size
of the population.
And let's assume that each person has an
ID number, like social security number, so
there's some well defined
way to list them out.
And these are treated as non-random,
at least in the approach
we're talking about now.
These are treated as fixed.
These are just constants.
Initially, they are unknown constants.
Right?
We just assume each person
has some fixed opinion, but
we don't know what those opinions are yet.
So what do we do?
We draw a sample, we get a sample.
Okay?
And there's many,
many different ways to do that sampling,
which we discussed in Stat 160.
You got this sample, and then you
try to infer it to the population.
And in practice, it may be extremely
difficult to get a simple random sample.
That is where all samples of a certain
size are equally likely, and
it's not always necessarily even
desirable to have a simple random sample.
But anyway, let's just assume we have some
samplings to scheme where we get a sample
of size lowercase n.
Okay?
So suppose our goal is to
estimate the average of these.
Or, equivalently,
assuming we know N, capital N,
then it's equivalent to just say,
try to estimate the sum.
That is, we only get to collect the
sample, but we wanna estimate the total or
average value of the entire population.
So we wanna estimate the sum of
all the yjs, j = 1 to N, okay?
That's the goal.
Now we have a sample of size n.
And let's assume that
the probability that person j,
we have some ordering ID number for
each person, the probability that person j
is in the sample is pj,
which we assume right now is known.
And then, of course,
in practice pj might not be known.
And you might have to estimate pj.
And then you can ask what happens then.
How do you deal with the fact
that you don't know the true pj?
Okay, that's more complicated.
But let's assume that
the true pj are known.
So the simple random sampling would be
the case when all the pjs are equal.
That is, everyone is equally likely
to be collected into your sample.
But that may not be true, some people may
be much easier to sample than others.
Or some are just obscure,
hard to reach, don't answer the phone,
in some obscure part of the country or
something like that.
Other people are easier to
actually collect into your sample.
Okay, so pjs may not all be equal.
Okay, now then the claim is that,
suppose that our data are we get like X1,
I'll just say like X1,
Z1 through Xn, Zn be the data,
That you collect from your sample,
where X is the Y value.
I'm using a different letter,
X rather than Y,
just to indicate the Xs are random
variables, the Ys are fixed.
The Xs are random because this is the
first person you collect into your survey.
But that person was randomly
chosen with some probabilities.
So the value has become random
because of the sampling, okay?
And Z is just their ID number.
So Xj is the Y value that
you're interested in, and
Zj is the ID number of that person,
that is who did we actually get.
That's the setup, okay?
So a question that is of tremendous
interest is how do we get an unbiased
estimator for the total?
And the standard method that's used
that's based on a very, very clever trick
is to divide by the probability.
That is, you sum up all your observations,
j = 1 to n of Xj,
but you divide by the probability
of getting at that person.
So in other words,
if the values you observed were 5, 10 and
15 and those had probabilities, A, B, C.
What you would do is 5 divided by a,
plus 10 divided by b, plus 15 divided c.
So, these are the probabilities, pzj.
That's unbiased.
So this just says take each measurement
and divide by the probability that
you actually got that person in
your survey, then that is unbiased.
Okay, so let's prove that is unbiased.
We'll take the expectation,
but this looks ugly.
This is kind of an interesting
thing here because
this denominator seems kind of tricky.
Because this is P sub Zj,
where Zj is the ID number of that,
it's a random ID number.
So in the denominator here we
have a random probability.
How do we deal with that?
Well, a simple way around that
is to rewrite this using
indicator random variables.
So let's just rewrite this as the sum,
not just over these values
that you actually had.
Let's sum over the entire
population j = 1 to N.
And let's sum yj,
which is just a constant,
divided by pj times Ij, where Ij is
the indicator of person j being included.
So that's exactly the same thing,
In the sample.
That's the same thing because anyone who's
not included just get zeroed out, and
anyone who is included that's 1 and
then it's just the same thing here.
But now it's just Pj in the denominator,
easy to deal with.
Pzj, that's a random probability,
Pj, very simple, okay?
So now the expected value.
I'll just take the expected
value of this linearity, right?
So I'm just gonna swap the e in the sum
by linearity, respective value of ij,
by definition, and
fundamental bridge is the probability
that person j is included in the sample,
by definition that's pj.
So this is just pj over pj, yj, done.
So that's a very, very simple trick.
Just divide by the probability.
This is used all over the place now.
Variations on this trick
is a very important idea.
Inverse probability, this is either
called Horvitz-Thompson Estimator.
We just derived the Horvitz-Thompson
Estimator, it wasn't too difficult.
Or it's called inverse probability
waiting, and you'll find there's tons and
tons of people using this,
inverse probability waiting.
Okay, well, I mean so you can do that.
That's unbiased.
Now we come to a deeper question,
is unbiased good?
Well, I mean,
I'd rather be unbiased than biased but
is that good in and
of itself as a criterion?
Is that good enough to tell us that
this is actually a good estimator?
Not so much a harder question.
And I'll tell you a quick story showing
that this can be horribly, horribly bad.
I'm not saying it's always bad, either.
It can be good, it can be bad.
I'm saying this goes much, much deeper.
This is one of my favorite examples.
This is called Basu's Elephant, and
Basu was one of my favorite statisticians.
So I'll just tell you Basu's
elephant story quickly.
This is an indication that there's more,
there's a lot going on here.
Statistics cannot be described as,
you learn some formulas and
you plug it in and whatever.
You need to think really hard about,
doesn't make sense?
This is useful, you can't just say
all time biased so we're happy.
So the story is there was a circus
owner who had 50 elephants, and
he wanted to know that the average
weight of his elephants,
equivalently he wanted to know
the total weight of the 50 elephants.
And I've never tried weighing an elephant,
but
apparently it's pretty hard to actually
physically weigh the elephant.
Too much work to try to
weigh the 50 elephants.
So the circus said, okay,
I'll just take one that looks average.
Let's call that elephant Stampie.
Take Stampie, weigh Stampie,
multiply by 50.
Sounds like a reasonable thing to do.
Right?
But a statistician overhears that and
gets very, very agitated.
And then says, well but
that would be biased, right?
And so the circus owner says okay,
you're the statistical expert.
What should we do?
And the statistician, this is actually
not a very good statistician.
It's someone who just kind
of learned some formulas.
And says,
well you should use the Horvitz–Thompson.
You know, we do this
inverse-probability-weighting thing.
And otherwise you're gonna be biased,
okay?
And the circus owner says well,
I really wanna weigh Stampie's right here.
The other elephants are off wandering
somewhere else and I really like Stampy.
H's not gonna kick me and stuff.
So anyway, then the statistician says,
that's fine.
And this Horvitz–Thompson,
these probabilities pj.
I should've written pj > 0 because
then we're not dividing by 0.
But as long as pj is greater than 0,
this's perfectly fine.
It's still unbiased, so okay.
So they decided to give probability,
they're only gonna weigh one elephant.
They decide to give
probability 0.99 to Stampy.
And then for
the other 49 elephants take the other 1%.
And divide that up equally
to the other 49 elephants.
Sure enough,
99% chance that they get to weigh Stampy.
So they weigh Stampy and
everyone seems happy.
Until what the circus owner was
proposing to do is take Stampy's weight,
multiply it by 50, right?
It seems like a natural thing to do.
Well Horvitz–Thompson says
take Stampy's weight.
Divide by the probability,
which was 99 over 100.
So just take take Stampy and
multiply by 100 over 99,
which is just slightly greater than 1.
Now that doesn't sound like a very
good estimate for 50 elephants,
just to take one of them and
multiply it by 100 over 99.
And so the circus owner says,
well something seems a little
suspicious about that.
Are you sure that that's good?
And the statistician says, well if
you'd gotten one of the other ones,
well then you multiply it by 4,900.
So then it's unbiased.
So there are very, very, very,
very tricky questions about what
are kind of good criteria or
natural criteria for
when is an estimator good.
What does that mean?
That goes back into 111 a lot.
111 is gonna be a lot about
Bayesian versus frequentist.
So we've been talking about
some Bayesian stuff, right?
Like we talked about conjugate priors and
the beta binomial, things like that.
They'll do a lot of conjugate priors,
things like that.
Frequent unbiasedness is
inherently a frequentist concept.
But there's other frequentist
concepts as well.
You can ask when are these two methods or
ideas.
Well, first of all you can discuss whether
they even are well defined methods.
But when would they agree, when they
would disagree, things like that.
There are very, very,
very subtle questions that come up,
wery, very surprising.
There are a lot of paradoxes in where
you write down something that seems
like a perfectly natural estimator and
you know it.
And then you can prove that actually
you can always do better than that.
And things like that.
Okay well, so let's see.
Any other courses to mention here?
All right, so this is our little
Markov chain, well it's not really.
Well the question is whether
it's irreducible, I guess.
It's irreducible right now.
Now, of course,
you could go From 111 to 139.
You can go from 139 to 111.
You can go all over the place.
There are a couple more quick
edges we should draw here.
Once is 110 goes to jobs.
>> [LAUGH]
>> So if any of you get any good
interview questions that relate to
probability, please send them my way.
I won't get into the whole
Vocational debate right now,
which has been coming up more.
And one more.
The 110 is actually a recurrent
state in this chain.
And William is here who took the class
last year, so we have an example.
Someone else just emailed me yesterday
saying she took it last year.
And really wished, more than anything,
she could just repeat the class over and
over again.
Now I'm hoping you're not
gonna repeat it over and
over again in the sense that you actually
have to literally retake the course.
>> [LAUGH]
>> But
in the sense of revisiting the material,
over and over again, is a good thing.
Okay, well that's all.
Thank you, it was all fun.
[APPLAUSE]
