The following content is
provided under a Creative
Commons license.
Your support will help
MIT OpenCourseWare
continue to offer high quality
educational resources for free.
To make a donation or to
view additional materials
from hundreds of MIT courses,
visit MIT OpenCourseWare
and ocw.mit.edu.
PHILIPPE RIGOLLET: The
chapter is a natural capstone
chapter for this entire course.
We'll see some of
the things we've
seen during maximum likelihood
and some of the things
we've seen during linear
regression, some of the things
we've seen in terms of the basic
modeling that we've had before.
We're not going to go back
to much inference questions.
It's really going to
be about modeling.
And in a way, generalized
linear models, as the word says,
are just a generalization
of linear models.
And they're actually
extremely useful.
They're often forgotten
about and people just
jump onto machine learning
and sophisticated techniques.
But those things do
the job quite well.
So let's see in what sense
they are a generalization
of the linear models.
So remember, the linear
model looked like this.
We said that y was equal to x
transpose beta plus epsilon,
right?
That was our linear
regression model.
And it's-- another way
to say this is that if--
and let's assume
that those were, say,
Gaussian with mean 0 and
identity covariance matrix.
Then another way
to say this is that
the conditional distribution
of y given x is equal to--
sorry, I a Gaussian with mean
x transpose beta and variance--
well, we had a sigma squared,
which I will forget as usual--
x transpose beta and
then sigma squared.
OK, so here, we just assumed
that-- so what is regression
is just saying I'm trying to
explain why as a function of x.
Given x, I'm assuming a
distribution for the y.
And this x is just
going to be here
to help me model what the mean
of this Gaussian is, right?
I mean, I could have
something crazy.
I could have something
that looks like y given
x is n0 x transpose beta.
And then this could
be some other thing
which looks like, I don't
know, some x transpose
gamma squared
times, I don't know,
x, x transpose plus identity--
some crazy thing that
depends on x here, right?
And we deliberately assumed that
all the thing that depends on x
shows up in the mean, OK?
And so what I have
here is that y
given x is a Gaussian
with a mean that
depends on x and covariance
matrix sigma square identity.
Now the linear model
assumed a very specific form
for the mean.
It said I want the
mean to be equal to x
transpose beta
which, remember, was
the sum from, say, j equals
1 to p of beta j xj, right?
It's where the xj's are
the coordinates of x.
But I could do something
also more complicated, right?
I could have something
that looks like instead ,
replace this by, I don't know,
sum of beta j log of x to the j
divided by x to the j squared
or something like this, right?
I could do this as well.
So there's two things
that we have assumed.
The first one is
that when I look
at the conditional
distribution of y given x,
x affects only the mean.
I also assume that
it was Gaussian
and that it affects
only the mean.
And the mean is affected
in a very specific way,
which is linear in x, right?
So this is
essentially the things
we're going to try to relax.
So the first thing
that we assume,
the fact that y was Gaussian and
had only its mean [INAUDIBLE]
dependant no x is what's
called the random component.
It just says that the
response variables, you know,
it sort of makes sense to
assume that they're Gaussian.
And everything was
essentially captured, right?
So there's this
property of Gaussians
that if you tell me-- if
the variance is known,
all you need to tell
me to understand
exactly what the distribution
of a Gaussian is,
all you need to tell me
is its expected value.
All right, so
that's this mu of x.
And the second thing is that
we have this link that says,
well, I need to find a way
to use my x's to explain
this mu you and the
link was exactly
mu of x was equal
to x transpose beta.
Now we are talking about
generalized linear models.
So this part here where mu
of x is of the form-- the way
I want my beta, my x,
to show up is linear,
this will never be a question.
In principle, I could
add a third point,
which is just question this
part, the fact that mu of x
is x transpose beta.
I could have some more
complicated, nonlinear function
of x.
And then we'll never do
that because we're talking
about generalized linear model.
The only thing with generalize
are the random component,
the conditional
distribution of y given x,
and the link that just says,
well, once you actually tell me
that the only thing I need
to figure out is the mean,
I'm just going to slap it
exactly these x transpose beta
thing without any transformation
of x transpose beta.
So those are the two things.
It will become
clear what I mean.
This sounds like a
tautology, but let's just
see how we could extend that.
So what we're going to do in
generalized linear models--
right, so when I
talk about GLNs,
the first thing I'm
going to do with my x
is turn it into some
x transpose beta.
And that's just
the l part, right?
I'm not going to
be able to change.
That's the way it works.
I'm not going to do
anything non-linear.
But the two things
I'm going to change
is this random
component, which is
that y, which used to be some
Gaussian with mean mu of x
here in sigma squared--
so y given x, sorry--
this is going to become y given
x follows some distribution.
And I'm not going to
allow any distribution.
I want something that comes
from the exponential family.
Who knows what the exponential
family of distribution is?
This is not the same thing as
the exponential distribution.
It's a family of distributions.
All right, so we'll see that.
It's-- wow.
What can that be?
Oh yeah, that's
actually [INAUDIBLE]..
So-- I'm sorry?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: I'm
in presentation mode.
That should not happen.
OK, so hopefully, this is muted.
So essentially, this is going
to be a family of distributions.
And what makes them
exponential typically
is that there's an
exponential that
shows up in the definition
of the density, all right?
We'll see that the
Gaussian belongs
to the exponential family.
But they're slightly
less expected ones
because there's this crazy
thing that a to the x
is exponential x log a, which
makes the potential show up
without being there.
So if there's an
exponential of some power,
it's going to show up.
But it's more than that.
So we'll actually come
to this particular family
of distribution.
Why this particular family?
Because in a way,
everything we've
done for the linear
model with Gaussian
is going to extend fairly
naturally to this family.
All right, and it actually
also, because it encompasses
pretty much everything,
all the distributions
we've discussed before.
All right, so the second thing
that I want to question--
right, so before,
we just said, well,
mu of x was directly
equal to this thing.
Mu of x was directly
x transpose beta.
So I knew I was going to
have an x transpose beta
and I said, well, I could do
something with this x transpose
beta before I used it to
explain the expected value.
But I'm actually
taking it like that.
Here, we're going to say, let's
extend this to some function
is equal to this thing.
Now admittedly, this is
not the most natural way
to think about it.
What you would probably
feel more comfortable doing
is write something like
mu of x is a function.
Let's call it f of
x transpose beta.
But here, I decide
to call f g inverse.
OK, let's just my g inverse.
Yes.
AUDIENCE: Is this different
then just [INAUDIBLE]
PHILIPPE RIGOLLET: Yeah.
I mean, what transformation
you want to put on your x's?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: Oh
no, certainly not, right?
I mean, if I give you-- if I
force you to work with x1 plus
x2, you cannot work with
any function of x1 plus any
function of x2, right?
So this is different.
All right, so-- yeah.
The transformation would
be just the simple part
of your linear
regression problem
where you would take your
exes, transform them,
and then just apply
another linear regression.
This is genuinely new.
Any other question?
All right, so this
function g and the reason
why I sort of have to, like,
stick to this slightly less
natural way of defining
it is because that's
g that gets a name, not g
inverse that gets a name.
And the name of g is
the link function.
So if I want to give you a
generalized linear model,
I need to give you
two ingredients.
The first one is the
random component,
which is the distribution
of y given x.
And it can be anything in what's
called the exponential family
of distributions.
So for example, I
could say, y given
x is Gaussian with mean
mu x sigma identity.
But I can also
tell you y given x
is gamma with shared parameter
equal to alpha of x, OK?
I could do some weird
things like this.
And the second thing is I need
to give you a link function.
And the link function is
going to become very clear
how you pick a link function.
And the only reason that you
actually pick a link function
is because of compatibility.
This mu of x, I call
it mu because mu of x
is always the conditional
expectation of y given x,
always, which means
that let's think
of y as being a Bernoulli
random variable.
Where does mu of x live?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: 0, 1, right?
That's the expectation
of a Bernoulli.
It's just the probability
that my coin flip gives me 1.
So it's a number
between 0 and 1.
But this guy right here, if
my x's are anything, right--
think of any body
measurements plus [INAUDIBLE]
linear combinations with
arbitrarily large coefficients.
This thing can be
any real number.
So the link function, what
it's effectively going to do
is make those two
things compatible.
It's going to take
my number which,
for example, is constrained
to be between 0 and 1
and map it into the
entire real line.
If I have mu which is forced
to be positive, for example,
in an exponential distribution,
the mean is positive, right?
That's the, say, don't
know, inter-arrival time
for Poisson process.
This thing is known to be
positive for an exponential.
I need to map something
that's exponential
to the entire real line.
I need a function that
takes something positive
and [INAUDIBLE] everywhere.
So we'll see.
By the end of this
chapter, you will
have 100 ways of doing this, but
there are some more traditional
ones [INAUDIBLE].
So before we go any further,
I gave you the example
of a Bernoulli random variable.
Let's see a few examples
that actually fit there.
Yes.
AUDIENCE: Will it come up
later [INAUDIBLE] already know
why do we need the
transformer [INAUDIBLE] why
don't [INAUDIBLE]
PHILIPPE RIGOLLET:
Well actually, this
will not come up later.
It should be very
clear from here
because if I actually
have a model,
I just want it to
be plausible, right?
I mean, what happens if I
suddenly decide that my--
so this is what's
going to happen.
You're going to have only
data to fit this model.
Let's say you actually
forget about this thing here.
You can always do this, right?
You can always say I'm
going to pretend my y's just
happen to be the realizations
of said Gaussians that
happen to be 0 or 1 only.
You can always, like, stuff that
in some linear model, right?
You will have some least
squares estimated for beta.
And it's going to be fine.
For all the points
that you see, it
will definitely put
some number that's
actually between 0 and 1.
So this is what your picture
is going to look like.
You're going to have a
bunch of values for x.
This is your y.
And for different-- so
these are the values
of x that you will get.
And for a y, you will see
either a 0 or a 1, right?
Right, that's what your
Bernoulli dataset would look
like with a one dimensional x.
Now if you do least squares
on this, you will find this.
And for this guy,
this line certainly
takes values between 0 and 1.
But let's say now
you get an x here.
You're going to actually
start pretending
that the probability it spits
out one conditionally in x
is like 1.2, and that's
going to be weird.
Any other questions?
All right, so let's
start with some examples.
Right, I mean, you get so used
to them through this course.
So the first one is--
so all these things are taken.
So there's a few
books on generalizing,
your models, generalize
[INAUDIBLE] models.
And there's tons of
applications that you can see.
Those are extremely
versatile, and as soon
as you want to do modeling
to explain some y given x,
you sort of need to
do that if you want
to go beyond linear models.
So this was in the
disease occurring rate.
So you have a disease
epidemic and you
want to basically model
the expected number
of new cases given--
at a certain time, OK?
So you have time that progresses
for each of your reservation.
Each of your reservation
is a time stamp--
say, I don't know, 20th day.
And your response is
the number of new cases.
And you're going to actually
put your model directly
on mu, right?
When I looked at
this, everything here
was on mu itself, on
the expected, right?
Mu of x is always the expected--
the conditional
expectation of y given x.
right?
So all I need to model
is this expected value.
So this mu I'm going
to actually say--
so I look at some parameters,
and it says, well,
it increases exponentially.
So I want to say I have some
sort of exponential trend.
I can parametrize
that in several ways.
And the two parameters
I want to slap in
is, like, some sort of gamma,
which is just the coefficient.
And then there's some rate
delta that's in the exponential.
So if I tell you
it's exponential,
that's a nice family
of functions you
might want to think about, OK?
So here, mu of x, if I want
to keep the notation, x
is gamma exponential
delta x, right?
Except that here, my x
are t1, t2, t3, et cetera.
And I want to find what the
parameters gamma and delta are
because I want to be
able to maybe compare
different epidemics and see if
they have the same parameter
or maybe just do some
prediction based on the data
that I have without-- to
extrapolate in the future.
So here, clearly mu of
x is not of the form
x transpose beta, right?
That's not x
transpose beta at all.
And it's actually not even a
function of x transpose data,
right?
There's two parameters,
gamma and delta,
and it's not of the form.
So here we have x,
which is 1 and x, right?
I have two parameters.
So what I do here
is that I say, well,
first, let me transform
mu in such a way
that I can hope to see
something that's linear.
So if I transform mu, I'm
going to have log of mu, which
is log of this thing, right?
So log of mu of
x is equal, well,
to log of gamma plus
log of exponential delta
x, which is delta x.
And now this thing is
actually linear in x.
So I have that this
guy is my first beta 1.
And so that's beta 1 finds 1.
And this guy is beta 2--
times, sorry that said beta
0-- times 1, and this guy
is beta 1 times x.
OK, so that looks
like a linear model.
I just have to change
my parameters--
my parameters beta 1 becomes
the log of gamma and beta 2
becomes delta itself.
And the reason why we do this
is because, well, the way
we put those gamma and those
delta was just so that we
have some parametrization.
It just so happens that if
we want this to be linear,
we need to just change the
parametrization itself.
This is going to
have some effects.
We know that it's going
to have some effect
in the fissure information.
It's going to have a bunch of
effect to change those things.
But that's what needs
to be done to have
a generalized linear model.
Now here, the
function that I took
to turn it into something
that's linear is simple.
It came directly from some
natural thing I would do here,
which is taking the log.
And so the function g,
the link that I take,
is called the log
link very creatively.
And it's just the
function that I
apply to mu so that I see
something that's linear
and that looks like this.
So now this only tells me how
to deal with the link function.
But I still have
to deal with 0.1.
And this, again, is
just some modeling.
Given some data,
some random data,
what distribution do you choose
to explain the randomness?
And this-- I mean,
unless there's no choice,
you know, it's just a
matter of practice, right?
I mean, why would it be
Gaussian and not, you know,
doubly exponential?
This is-- there's matters
of convenience that
come into this, and there's
just matter of experience
that come into this.
You know, I remember when
you chat with engineers,
they have a very
good notion of what
the distribution should be.
They have y bold distributions.
You know, they do optics
and things like this.
So there's some distributions
that just come up but sometimes
just have to work.
Now here what do we have?
The thing we're
trying to measure, y--
as we said, so mu
is the expectation,
the conditional
expectation, of y given x.
But y is the number
of new cases, right?
Well it's a number of.
And the first thing
you should think
of when you think
about number of,
if it were bounded above, you
would think binomial, baby.
But here, it's just a number.
So you think Poisson.
That's how insurers think.
I have a number of, you
know, claims per year.
This is a Poisson distribution.
And hopefully they can model
the conditional distribution
of the number of claims given
everything that they actually
ask you in the
surveys that I hear
you now fail in 15 minutes.
All right, so now you have
this Poisson distribution.
And that's just the
modeling assumption.
There's no particular
reason why you
should do this except
that, you know,
that might be a good idea.
And the expected
value of your Poisson
has to be this mu i, OK?
At time i.
Any question about this slide?
OK, so let's switch
to another example.
Another example is the
so-called pray capture rate.
So here, what
you're interested in
is the rate capture of
preys yi for a given prey.
And you have xy, which
is your explanation.
And this is just
the density of pray.
So you're trying to explain the
rate of captures of preys given
the density of the prey, OK?
And so you need to find
some sort of relationship
between the two.
And here again,
you talk to experts
and what they tell you
is that, well, it's
going to be increasing, right?
I mean, animals like predators
are going to just eat more
if there's more preys.
But at some point,
they're just going
to level off because they're
going to be [INAUDIBLE] full
and they're going to stop
capturing those prays.
And you're just going to
have some phenomenon that
looks like this.
So here is a curve that
sort of makes sense, right?
As your capture rate goes from
0 to 1, you're increasing,
and then you see you have
this like [INAUDIBLE] function
that says, you know, at
some point it levels up.
OK, so here, one way I could--
I mean, there's again
many ways I could just
model a function
that looks like this.
But a simple one that
has only two parameters
is this one, where mu i is
this a function of xi where
I have some parameter alpha
here and some parameter h here.
OK, so there's clearly--
so this function, there's one
that essentially tells you--
so this thing starts
at 0 for sure.
And essentially,
alpha tells you how
sharp this thing
is, and h tells you
at which points you end here.
Well, it's not exactly what
those values are equal to,
but that tells you this.
OK, so, you know-- simple, and--
well, no, OK.
Sorry, that's actually alpha,
which is the maximum capture.
The rate and h represent
the pre-density
at which the capture weight is.
So that's the half time.
OK, so there's actual
value [INAUDIBLE]..
All right, so now I
have this function.
It's certainly not a function.
There's no-- I don't see
it as a function of x.
So I need to find something that
looks like a function of x, OK?
So then here, there's no log.
There's no-- well, I could
actually take a log here.
But I would have log of
x and log of x plus h.
So that would be weird.
So what we propose to
do here is to look,
rather than looking at mu
i, we look 1 over mu i.
Right, and so
since your function
was mu i, when you
take 1 over mu i,
you get h plus xi divided
by alpha xi, which
is h over alpha times one
over xi plus 1 over alpha.
And now if I'm willing to
make this transformation
of variables and say,
actually, I don't--
my x, whether it's
the density of prey
or the inverse density of
prey, it really doesn't matter.
I can always make
this transformation
when the data comes.
Then I'm actually just
going to think of this
as being some linear
function beta 0 plus beta 1,
which is this guy,
times 1 over xi.
And now my new variable
becomes 1 over xi.
And now it's linear.
And the transformation
I had to take
was this 1 over x, which is
called the reciprocal link, OK?
You can probably guess what the
exponential link is going to be
and things like this, all right?
So we'll talk about other
links that have slightly less
obvious names.
Now again, modeling, right?
So this was the
random component.
This was the easy part.
Now I need to just poor
in some domain knowledge
about how do I think this
function, this y, which
is which is the rate
of capture of praise,
I want to understand how
this thing is actually
changing what is the randomness
of the thing around its mean.
And you know, something
that-- so that
comes from this textbook.
The standing deviation
of capture rate
might be approximately
proportional to the mean rate.
You need to find a
distribution that
actually has this property.
And it turns out
that this happens
for gamma distributions, right?
In gamma distributions,
just like say,
for Poisson distribution, the--
well, for Poisson, the variance
and mean are of the same order.
Here is the standard
deviation that's
of the same order as the
[INAUDIBLE] for gammas.
And it's a positive
distribution as well.
So here is a candidate.
Now since we're
sort of constrained
to work under the exponential
family of distributions,
then you can just
go through your list
and just decide which
one works best for you.
All right, third example--
so here we have binary response.
Here, essentially the
binary response variable
indicates the
presence or absence
of postoperative deforming
for kyphosis on children.
And here, rather than having
one covariance which was before,
in the first example, was
time, in the second example
was the density, here
there's three ways
that you measure on children.
The first one is
age of the child
and the second one is
the number of vertebrae
involved in the operation.
And the third one is
the start of the range,
right-- so where
it is on the spine.
OK, so the response
variable here is, you know,
did it work or not, right?
I mean, that's very simple.
And so here, it's nice
because the random component
is the easiest one.
As I said, any random variable
that takes only two outcomes
must be a Bernoulli, right?
So that's nice there's no
modeling going on here.
So you know that y given x
is going to be Bernoulli,
but of course, all
your efforts are
going to try to understand
what the conditional mean
of your Bernoulli, what
the conditional probability
of being 1 is going to be, OK?
And so in particular--
so I'm just-- here,
I'm spelling it out before
we close those examples.
I cannot say that mu of x is x
transpose data for exactly this
picture that I drew
for you here, right?
There's just no
way here-- the goal
of doing this is certainly
to be able to extrapolate
for yet unseen children
whether this is something
that we should be doing.
And maybe the range
of x is actually
going to be slightly out.
And so, OK I don't
want to see that have
a negative probability of
outcome or a positive one--
sorry, or one that's
lower than one.
So I need to make
this transformation.
So what I need to do is
to transform mu, which
is, we know only a number.
All we know is a
number between 0 and 1.
And we need to transform
it in such a way
that it maps the
entire real line
or reciprocally to say that--
or inversely, I should say--
that f of x
transpose beta should
be a number between 0 and 1.
I need to find a function
that takes any real number
and maps it into 0 and 1.
And we'll see that
again, but you
have an army of functions
that do that for you.
What are those functions?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: I'm sorry?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: Trait?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: Oh.
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: Yeah, I want
them to be invertible, right?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: I
have an army of function.
I'm not asking for one
soldier in this army.
I want the name of this army.
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: Well, they're
not really invertible either,
right?
So they're actually in
[INAUDIBLE] textbook.
Because remember,
statisticians don't
know how to integrate
functions, but they
know how to turn a function
into a Gaussian integral.
So we know it integrates
to 1 and things like this.
Same thing here--
we don't know how
to build functions that
are invertible and map
the entire real line
to 0, 1, but there's
all the cumulative distribution
functions that do that for us.
So I can you any of
those guys, and that's
what I'm going to
be doing, actually.
All right, so just
to recap what I just
said as we were speaking, so
normal linear model is not
appropriate for these examples
if only because the response
variable is not
necessarily Gaussian
and also because the
linear model has to be--
the mean has to be transformed
before I can actually
apply a linear model for all
these plausible nonlinear
models that I
actually came up with.
OK, so the family
we're going to go for
is the exponential
family of distributions.
And we're going to
be able to show--
so one of the nice
part of this is
to actually compute
maximum likelihood
estimaters for those right?
In the linear model,
maximum-- like, in the Gauss
linear model, maximum likelihood
was as nice as it gets, right?
This actually was the
least squares estimator.
We had a close form.
x transpose x inverse
x transpose y,
and that was it, OK?
We had to just take
one derivative.
Here, we're going to have a
generally concave likelihood.
We're not going to
be able to actually
solve this thing
directly in close form
unless it's Gaussian,
but we will have--
we'll see actually
how this is not just
a black box optimization
of a concave function.
We have a lot of properties
of this concave function,
and we will be able to show
some iterative algorithms.
We'll basically see how, when
you opened the box of convex
optimization, you will actually
be able to see how things work
and actually implement
it using least squares.
So each iteration of
this iterative algorithm
will essentially
be a least squares,
and that's actually
quite [INAUDIBLE]..
So, very demonstrative
of statisticians
being pretty
ingenious so that they
don't have to call in
some statistical software
but just can repeatedly
call their least squares
Oracle within a
statistical software.
OK, so what is the
exponential family, right?
I promised to do the
exponential family.
Before we go into
this, let me just
tell you something about
exponential families,
and what's the only
thing to differentiate
an exponential family from
all possible distributions?
An exponential family has
two parameters, right?
And those are not
really parameters,
but there's this theta parameter
of my distribution, OK?
So it's going to be
indexed by some parameter.
Here, I'm only talking
about the distribution
of, say, some random variable
or some random vector, OK?
So here in this slide, you see
that the parameter theta that
indexed those distribution
is k dimensional
and the space of the x's
that I'm looking at-- so
that should really be y, right?
What I'm going to
plug in here is
the conditional distribution
of y given x and theta is
going to depend on x.
But this really is the y.
That's their distribution
of the response variable.
And so this is on q, right?
So I'm going to
assume that y takes--
q dimensional--
is q dimensional.
Clearly soon, q is
going to be equal to 1,
but I can define those
things generally.
OK, so I have this.
I have to tell you
what this looks like.
And let's assume that this is
a probability density function.
So this, right this notation,
the fact that I just
put my theta in
subscript, is just
for me to remember that
this is the variable that
indicates the random variable,
and this is just the parameter.
But I could just write it as a
function of theta and x, right?
This is just going to be--
right, if you were in calc,
in multivariable
calc, you would have
two parameter of theta
and x and you would
need to give me a function.
Now think of all--
think of x and theta as being
one dimensional at this point.
Think of all the
functions that can
be depending on theta and x.
There's many of them.
And in particular, there's many
ways theta and x can interact.
What the exponential
family does for you
is that it restricts
the way these things
can actually interact
with each other.
It's essentially
saying the following.
It's saying this is going to
be of the form exponential--
so this exponential is
really not much because I
could put a log next to it.
But what I want is that
the way theta and x
interact has to be of
the form theta times x
in an exponential, OK?
So that's the
simplest-- that's one
of the ways you can think of
them interacting is you just
the product of the two.
Now clearly, this is
not a very rich family.
So what I'm allowing
myself is to just slap
on some terms that depend only
on theta and depend only on x.
So let's just call this thing, I
don't know, f of x, g of theta.
OK, so here, I've restricted the
way theta and x can interact.
So I have something
that depends only
on x, something that
depends only on theta.
And here, I have this
very specific interaction.
And that's all that exponential
families are doing for you, OK?
So if we go back to this slide,
this is much more general,
right? if I want to go from
theta and x in r to theta
and x theta in r--
to theta in r k and x in rq,
I cannot take the product
of theta and x.
I cannot even take the inner
product between theta and x
because they're not even
of compatible dimensions.
But what I can do is to first
map my theta into something
and map my x into something
so that I actually end up
having the same dimensions.
And then I can take
the inner product.
That's the natural
generalization
of this simple product.
OK, so what I have is--
right, so if I want
to go from theta
to x, when I'm going to first
do is I'm going to take theta,
eta of theta--
so let's say eta1 of
theta to eta k of theta.
And then I'm going
to actually take
x becomes t1 of x all
the way to tk of x.
And what I'm going to do
is take the inner product--
so let's call this eta
and let's call this t.
And I'm going to take the inner
product of eta and t, which
is just the sum from j equal
1 to k of eta j of theta times
tj of x.
OK, so that's just a way to say
I want this simple interaction
but in higher dimension.
The simplest way I can actually
make those things happen
is just by taking inner product.
OK, and so now what
it's telling me
is that the distribution-- so
I want the exponential times
something that depends only
on theta and something that
depends only on x.
And so what it tells
me is that when
I'm going to take
p of theta x, it's
just going to be something
which is exponential
times the sum from j equal 1
to k of eta j theta tj of x.
And then I'm going to have a
function that depends only--
so let me read it for now
like c of theta and then
a function that
depends only on x.
Let me call it h of x.
And for convenience,
there's no particular reason
why I do that.
I'm taking this
function c of theta
and I'm just actually
pushing it in there.
So I can write c of theta as
exponential minus log of 1
over c of theta, right?
And now I have exponential
times exponential.
So I push it in, and
this thing actually
looks like exponential sum
from j equal 1 to k of eta
j theta tj of x minus log 1
over c of theta times h of x.
And this thing here, log 1 over
c of theta, I call actually
b of theta Because
c, I called it c.
But I can actually
directly call this guy b,
and I don't actually
care about c itself.
Now why don't I put back
also h of x in there?
Because h of x is
really here to just--
how to put it--
OK, h of x and b of theta
don't play the same role.
B of theta in many ways is a
normalizing constant, right?
I want this density
to integrate to 1.
If I did not have
this guy, I'm not
guaranteed that this
thing integrates to 1.
But by tweaking this function
b of theta or c of theta--
they're equivalent--
I can actually ensure that
this thing integrates to 1.
So b of theta is just
a normalizing constant.
H of x is something that's
going to be funny for us.
It's going to be
something that allows
us to be able to treat both
discrete and continuous
variables within the framework
of exponential families.
So for those that are
familiar with this,
this is essentially
saying that that h of x
is really just a
change of measure.
When I actually look at
the density of p of theta--
this is with respect
to some measure--
the fact that I just multiplied
by a function of x just
means that I'm not looking--
that this guy here
without h of theta
is not the density with respect
to the original measure,
but it's the density with
respect to the distribution
that has h as a density.
That's all I'm saying, right?
So I can first transform my
x's and then take the density
with respect to that.
If you don't want to think
about densities or measures,
you don't have to.
This is just the way--
this is just the definition.
Is there any question
about this definition?
All right, so it
looks complicated,
but it's actually
essentially the simplest
way you could think about it.
You want to be able to
have x and theta interact
and you just say, I
want the interaction
to be of the form
exponential x times theta.
And if they're
higher dimensions,
I'm going to take
the exponential
of the function
of x inner product
with a function of theta.
All right, so I claimed
since the beginning
that the Gaussian
was such an example.
So let's just do it.
So is the Gaussian of the-- is
the interaction between theta
and x in a Gaussian of
the form in the product?
And the answer is yes.
Actually, whether I know or
not what the variance is, OK?
So let's start for the case
where I actually do not
know what the variance is.
So here, I have x is
n mu sigma squared.
This is all one dimensional.
And here, I'm going to assume
that my parameter is both mu
and sigma square.
OK, so what I need to do is
to have some function of mu,
some function of stigma square,
and take an inner product
of some function of x and
some other function of x.
So I want to show that--
so p theta of x is what?
Well, it's one over
square root sigma 2 pi
exponential minus x minus mu
squared over 2 sigma squared,
right?
So that's just my
Gaussian density.
And I want to say that
this thing here-- so
clearly, the exponential
shows up already.
I want to show that this
is something that looks
like, you know, eta 1 of--
sorry, so that was-- yeah, eta
1 of, say, mu sigma squared.
So I have only
two of those guys,
so I'm going to need
only two etas, right?
So I want it to be eta 1
of mu and sigma times t1
of x plus eta 2 mu 1 mu sigma
squared times t2 of x, right?
So I want to have something
like that that shows up,
and the only things
that are left,
I want them to depend either
only on theta or only on x.
So to find that out,
we just need to expand.
OK, so I'm going to first put
everything into my exponential
and expand this guy.
So the first term here
is going to be minus x
squared over 2 sigma square.
The second term is
going to be minus mu
squared over two sigma squared.
And then the cross term is
going to be plus x mu divided
by sigma squared.
And then I'm going
to put this guy here.
So I have a minus log
sigma over 2 pi, OK?
OK, is this-- so this term
here contains an interaction
between X and the parameters.
This term here
contains an interaction
between X and the parameters.
So let me try to write
them in a way that I want.
This guy only depends
on the parameters,
this guy only depends
on the parameter.
So I'm going to
rearrange things.
And so I claim that this
is of the form x squared.
Well, let's say-- do--
who's getting the minus?
Eta, OK.
So it's x squared times
minus 1 over 2 sigma
squared plus x times mu
over sigma squared, right?
So that's this term here.
That's this term here.
Now I need to get this guy
here, and that's minus.
So I'm going to write
it like this-- minus,
and now I have mu
squared over 2 sigma
squared plus log sigma
square root 2 pi.
And now this thing is definitely
of the form t of x times--
did I call them the
right way or not?
Of course not.
OK, so that's going to
be t2 of x times eta
2 of x eta 2 of theta.
This guy is going to be t1
of x times eta 1 of theta.
All right, so just a function
of theta times a function of x--
just a function of theta
times a function of x.
And the way combined is
just by sending them.
And this is going
to be my d of theta.
What is h of x?
AUDIENCE: 1.
PHILIPPE RIGOLLET: 1.
There's one thing I
can actually play with,
and this is something you're
going to have some three
choices, right?
This is not actually completely
determined here is that--
for example, so when I write
the log sigma square root 2 pi,
this is just log of sigma
plus log square root 2 pi.
So I have two choices here.
Either my b becomes
this guy, or--
so either I have
b of theta, which
is mu squared over 2 sigma
squared plus log sigma
square root 2 pi and h of
x is equal to 1, or I have
that b of theta is mu
square over 2 sigma squared
plus log sigma.
And h of x is equal to what?
Well, I can just push
this guy out, right?
I can push it out
of the exponential.
And so it's just square
root of 2 pi, which is
a function of x, technically.
I mean, it's a constant function
of x, but it's a function.
So you can see that it's
not completely clear
how you're going to do
the trade off, right?
So the constant terms can
go either in b or in h.
But you know, why bother with
tracking down b and h when
you can actually stuff
everything into one
and just call h one
and call it a day?
Right, so you can
just forget about h.
You know it's one and
think about the right.
H won't matter actually for
estimation purposes or anything
like this.
All right, so that's basically
everything that's written.
When stigma square
is known, what's
happening is that this
guy here is no longer
a function of theta, right?
Agreed?
This is no longer a parameter.
When sigma square is known,
then theta is equal to mu only.
There's no sigma
square going on.
So this-- everything
depends on sigma square
can be thought of as a constant.
Think one.
So in particular, this
term here does not
belong in the interaction
between x and theta.
It belongs to h, right?
So if sigma is known, then this
guy is only a function of h--
of x.
So h of x becomes exponential
x squared minus x squared
over 2 sigma squared, right?
That's just a function of x.
Is that clear?
So if you complete this
computation, what you're
going to get is that your new
one parameter thing is that p
theta x is not equal to
exponential x times mu
over sigma squared minus--
well, it's still the same thing.
And then you have your
h of x that comes out--
x squared over 2 sigma squared.
OK, so that's my h of x.
That's still my b of theta.
And this is my t1 of x.
And this is my eta one of theta.
And remember, theta is just
equal to mu in this case.
So if I ask you prove that
this distribution belongs
to an exponential family,
you just have to work it out.
Typically, it's expanding what's
in the exponential and see
what's--
and just write it in
this term and identify
all the components, right?
So here, notice those guys
don't even get an index anymore
because there's
just one of them.
So I wrote eta 1 and t1, but
it's really just eta and t.
Oh sorry, this guy also goes.
This is also a constant, right?
So it can actually
just put sigma divided
by sigma square root 2 pi.
So h of x is what, actually?
Is it the density of--
AUDIENCE: Standard [INAUDIBLE].
PHILIPPE RIGOLLET:
It's not standard.
It's centered.
It has mean 0.
But it variance
sigma squared, right?
But it's the density
of a Gaussian.
And this is what I
meant when I said
h of x is really just telling
you with respect to which
distribution, which measure
you're taking the density.
And so this thing here
is really telling you
the density of my
Gaussian with mean mu
is equal to-- is this with
respect to a centered Gaussian
is this guy, right?
That's what it means.
If this thing ends
up being a density,
it just means that now you
just have a new measure, which
is this density.
So it's just saying
that the density
of the Gaussian with
mean mu with respect
to the Gaussian with mean 0
is just this [INAUDIBLE] here.
All right, so let's move on.
So here, as I said,
you could actually
do all these computations
and forget about the fact
that x is continuous.
You can actually do it with PMFs
and do it for x is discrete.
This actually also tells
you if you can actually
get the same form for
your density, which
is of the form exponential
times the product
of the the interaction
between theta
and x is just
taking this product,
then a function only of theta
and of function only of x,
for the PMF, it also works.
OK, so I claim
that the Bernoulli
belongs to this family.
So the PMF of a Bernoulli--
we say parameter p is p to the
x 1 minus p to the 1 minus x,
right?
Because we know so that's
only for x equals 0 or 1.
And the reason is because
when x is equal to 0,
this is 1 minus p.
When x is equal to
1, this is minus 0.
OK, we've seen that
when we're looking
at likelihoods for Bernoullis.
OK, this is not clear this is
going to look like this at all.
But let's do it.
OK, so what does
this thing look like?
Well, the first
thing I want to do
is to make an
exponential show up.
So what I'm going
to write is I'm
going to write p to the x as
exponential x log p, right?
And so I'm going to do
that for the other one.
So this thing here--
so I'm going to get
exponential x log
p plus 1 minus x log 1 minus p.
So what I need to do is
to collect my terms in x
and my terms in whatever
parameters I have,
see here if theta is equal to p.
So if I do this,
what I end up having
is equal to exponential--
so determine x is log
p minus log 1 minus p.
So that's x times
log p over 1 minus p.
And then the term
that rest is just--
that stays is just 1
times log 1 minus p.
But I want to see this as
a minus something, right?
It was minus b of theta.
So I'm going to
write it as minus--
well, I can just keep the
plus, and I'm going to do--
and that's all [INAUDIBLE].
A-ha!
Well, this is of the
form exponential--
something that depends only on
x times something that depends
only on theta--
minus a function that
depends only on theta.
And then h of x is
equal to 1 again.
OK, so let's see.
So I have t1 of x is equal to x.
That's this guy.
Eta 1 of theta is equal
to log p1 minus p.
And b of theta is equal to
log 1 over 1 minus p, OK?
And h of x is equal
to 1, all right?
You guys want to do
Poisson, or do you
want to have any homework?
It's a dilemma because that's
an easy homework versus
no homework at all but maybe
something more difficult. OK,
who wants to do it now?
Who does not want to
raise their hand now?
Who wants to raise
their hand now?
All right, so let's move on.
I'll just do-- do you want
to do the gammas instead
in the homework?
That's going to be fun.
I'm not even going to
propose to do the gammas.
And so this is the
gamma distribution.
It's brilliantly
called gamma because it
has the gamma function just
like the beta distribution had
the beta function in there.
They look very similar.
One is defined over r plus,
the positive real line.
And remember, the beta was
defined over the interval 0, 1.
And it's of the form x to
some power times exponential
of minus x to some--
times something, right?
So there's a function of
polynomial [INAUDIBLE]
x where the exponent
depends on the parameter.
And then there's the exponential
minus x times something depends
on the parameters.
So this is going to also look
like some function of x--
sorry, like some
exponential distribution.
Can somebody guess what
is going to be t2 of x?
Oh, those are the functions of
x that show up in this product,
right?
Remember when we have this--
we just need to take
some transformations
of x so it looks linear in those
things and not in x itself.
Remember, we had x squared
and x, for example,
in the Gaussian case.
I don't know if
it's still there.
Yeah, it's still there, right?
t2 was x squared.
What do you think x is
going-- t2 of x here.
So here's a hint.
t1 is going to be x.
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET:
Yeah, [INAUDIBLE],,
what is going to be t1?
Yeah, you can--
this one is taken.
This one is taken.
What?
Log x, right?
Because this x to
the a minus 1, I'm
going to write that as
exponential a minus 1 log x.
So basically, eta 1 is
going to be a minus 1.
Eta 2 is going to
be minus 1 over b--
well, actually the opposite.
And then you're going to have--
but this is actually
not too complicated.
All right, then those
parameters get names.
a is the shape parameter,
b is the scale parameter.
It doesn't really matter.
You have other things that
are called the inverse gamma
distribution, which
has this form.
The difference is that
the parameter alpha
shows negatively there and
then the inverse Gaussian
distribution.
You know, just densities
you can come up with
and they just happened
to fall in this family.
And there's other ones that
you can actually put in there
that we've seen before.
The chi-square is actually
part of this family.
The beta distribution
is part of this family.
The binomial distribution
is part of this family.
Well, that's easy because
the Bernoulli was.
The negative binomial, which
is some stopping time--
the first time you hit a
certain number of successes
when you flip some
Bernoulli coins.
So you can check
for all of those,
and you will see that you can
actually write them as part
of the exponential family.
So the main goal
of this slide is
to convince you that
this is actually
a pretty broad range
of distributions
because it basically includes
everything we've seen
but not anything there--
sorry, plus more, OK?
Yeah.
AUDIENCE: Is there any
example of a distribution
that comes up
pretty often that's
not in the exponential family?
PHILIPPE RIGOLLET:
Yeah, like uniform.
AUDIENCE: Oh, OK, so maybe
a bit more complicated than
[INAUDIBLE].
Anything Anything that
has a support that
depends on the parameter
is not going to fall--
is not going to fit in there.
Right, and you can
actually convince yourself
why anything that
has the support that
does not-- that depends
on the parameter
is not going to be
part of this guy.
It's kind of a hard thing to--
in fact, you proved that it's
not and you prove this rule.
That's kind of a
little difficult,
but the way you can convince
yourself is that remember,
the only interaction between
x and theta that I allowed
was taking the
product of those guys
and then the exponential, right?
If you have something that
depends on some parameter--
let's say you're going to see
something that looks like this.
Right, for uniform,
it looks like this.
Well, this is not of the form
exponential x times theta.
There's an interaction
between x and theta here,
but it's actually
certainly not of the form
x exponential x times theta.
So this is definitely
not going to be
part of the exponential family.
And every time you start
doing things like that,
it's just not going to happen.
Actually, to be fair,
I'm not even sure
that all these
guys, when you allow
them to have all
their parameters free,
are actually going
to be part of this.
For example-- the
beta probably is,
but I'm not actually
entirely convinced.
There's books on
experiential families.
All right, so let's go back.
So here, we've put a lot
of effort understanding
how big, how much wider than
the Gaussian distribution
can we think of for the
conditional distribution
of our response y given x.
So let's go back to the
generalized linear models,
right?
So [INAUDIBLE] said, OK,
the random component?
y has to be part of
some exponential family
distribution-- check.
We know what this means.
So now I have to
understand two things.
I have to understand what
is the expectation, right?
Because that's actually
what I model, right?
I take the expectation, the
conditional expectation,
of y given x.
So I need to understand
given this guy,
it would be nice if you had some
simple rules that would tell me
exactly what the expectation
is rather than having to do it
over and over again, right?
If I told you,
here's a Gaussian,
compute the
expectation, every time
you had to use that would
be slightly painful.
So hopefully, this thing
being simple enough--
we've actually
selected a class that's
simple enough so that
we can have rules.
Whereas as soon as they give you
those parameters t1, t2, eta 1,
eta 2, b and h, you can
actually have some simple rules
to compute the mean and
variance and all those things.
And so in particular, I'm
interested in the mean,
and I'm going to have to
actually say, well, you know,
this mean has to be mapped
into the whole real line.
So I can actually talk
about modeling this function
of the mean as x transpose beta.
And we saw that for
the [INAUDIBLE] dataset
or whatever other data sets.
You actually can-- you can
actually do this using the log
of the reciprocal or for the--
oh, actually, we didn't
do it for the Bernoulli.
We'll come to this.
This is the most important
one, and that's called
a logit it or a logistic link.
But before we go there,
this was actually
a very broad family, right?
When I wrote this thing on the
bottom board-- it's gone now,
but when I wrote it
in the first place,
the only thing that I wrote
is I wanted x times theta.
Wouldn't it be nice if you
have some distribution that
was just x times theta,
not some function of x
times some function of theta?
The functions seem to be
here so that they actually
make things a little--
so the functions were here
so that I can actually
put a lot of functions there.
But first of all,
if I actually decide
to re-parametrize my
problem, I can always
assume-- if I'm
one dimensional, I
can always assume
that eta 1 of theta
becomes my new theta, right?
So this thing--
here for example,
I could say, well,
this is actually
the parameter of my Bernoulli.
Let me call this
guy theta, right?
I could do that.
Then I could say, well, here
I have x that shows up here.
And here since I'm talking
about the response,
I cannot really make
any transformations.
So here, I'm going to actually
talk about a specific family
for which this guy is not x
square or square root of x
or log of x or anything I want.
I'm just going to actually
look at distributions
for which this is x.
This exponential
families are called
a canonical exponential family.
So in the canonical
exponential family, what I have
is that I have my x times theta.
I'm going to allow myself
some normalization factor phi,
and we'll see, for
example, that it's
very convenient when I talk
about the Gaussian, right?
Because even if I know--
yeah, even if I know this guy,
which I actually pull into my--
oh, that's over here, right?
Right, I know sigma squared.
But I don't want to
change my parameter
to be mu over sigma squared.
It's kind of painful.
So I just take mu, and
I'm going to keep this guy
as being this phi over there.
And it's called the
dispersion parameter
from a clear analogy
with the Gaussian, right?
That's the variance and
that's measuring dispersion.
OK, so here, what
I want is I'm going
to think throughout this class--
so phi may be known or not.
And depending--
when it's not known,
this actually might turn
into some exponential family
or it might not.
And the main reason is because
this b of theta over phi
is not necessarily a function
of theta over phi, right?
If I actually have phi
unknown, then y theta over phi
has to be--
this guy has to be
my new parameter.
And b might not be a function
of this new parameter.
OK, so in a way,
it may or may not,
but this is not really a
concern that we're going to have
because throughout
this class, we're
going to assume that
phi is known, OK?
Phi is going to be known all the
time, which means that this is
always an exponential family.
And it's just the
simplest one you
could think of-- one
dimensional parameter, one
dimensional response, and I just
have-- the product is just y
times or, we used to call it x.
Now I've switched to y, but y
times theta divided by phi, OK?
Should I write this or this is
clear to everyone what this is?
Let me write it somewhere so
we actually keep track of it
toward the [INAUDIBLE].
OK, so this is--
remember, we had all
the distributions.
And then here we had
the exponential family.
And now we have the
canonical exponential family.
It's actually
much, much smaller.
Well, actually, it's probably
sort of a good picture.
And what I have is that
my density or my PMF
is just exponential
y times theta minus b
of theta divided by phi.
And I have plus phi of--
oh, yeah, plus phi
of y phi, which
means that this is really--
if phi is known, h of y
is just exponential
c of y phi, agreed?
Actually, this is the reason
why it's not necessarily
a canonical family.
It might not be that
this depends only on y.
It could depend on y and
phi in some annoying way
and I may not be
able to break it.
OK, but if phi is known,
this is just a function
that depends on y, agreed?
In particular, I
think you need--
I hope you can convince
yourself that this is just
a subcase of everything
we've seen before.
So for example, the Gaussian
when the variance is known
is indeed of this form, right?
So we still have
it on the board.
So here is my y, right?
So then let me write
this as f theta of y.
So every x is replaceable
with y, blah, blah, blah.
This is this guy.
And now what I have is that
this is going to be my phi.
This is my parameter of theta.
So I'm definitely of the form
y times theta divided by phi.
And then here I
have a function b
that depends only on
theta over phi again.
So b of theta is mu
squared divided by 2.
OK, then it's divided
by 6 sigma square.
And then I have
this extra stuff.
But I really don't care
what it is for now.
It's just something that depends
only on y and known stuff.
So it was just a function
of y just like my h.
I stuff everything in there.
The b, though, this
thing here, this
is actually what's
important because
in the canonical
family, if you think
about it, when you know phi--
sorry-- right, this
is just y times theta
scaled by a known
constant-- sorry, y times
theta scaled by a known
constant is the first term.
The second term is b of theta
scaled by some known constant.
But b of theta is
what's going to make
the difference between the
Gaussian and Bernoullis
and gammas and betas--
this is all in this b
of theta. b of theta
contains everything
that's idiosyncratic to
this particular distribution.
And so this is going
to be important.
And we will see that b of theta
is going to capture information
about the mean,
about the variance,
about likelihood,
about everything.
Should I go through
this computation?
I mean, it's the same.
We've just done it, right?
So maybe it's probably better
if you can redo it on your own.
All right, so the canonical
exponential family also
has other distributions, right?
So there's the Gaussian
and there's the Poisson
and there's the Bernoulli.
But the other ones may not
be part of this, right?
In particular, think about
the gamma distribution.
We had this-- log x was one
of the things that showed up.
I mean, I cannot get
rid of this log x.
I mean, that's part of it
except if a is equal to 1
and I know it for sure, right?
So if a is equal to 1, then
I'm going to have a minus 1,
which is equal to 0.
So I'm going to have
a minus 1 times log
x, which is going to be just 0.
So log x is going
to vanish from here.
But if a is equal to 1,
then this distribution
is actually much nicer, and
it actually does not even
deserve the name gamma.
What is it if a is equal to 1?
It's an exponential, right?
Gamma 1 is equal to 1. x to
the a minus 1 is equal to 1.
b-- so I have exponential
x over b divided by b.
So 1 over b-- call it lambda.
And this is just an
exponential distribution.
And so every time you're
going to see something--
so all these guys that
don't make it to this table,
they could be part of those
guys, but they're just more--
they're just to--
they just have another
name in this thing.
All right, so you could
compute the value of theta
for different values, right?
So again, you still have some
continuous or discrete ones.
This is my b of theta.
And I said this is actually
really what captures my theta.
This b is actually called
cumulant generating function,
OK?
I don't have time.
I could write five
slides to explain to you,
but it would just only
tell you why it's called
cumulant generating function.
It's also known as the log of
the moment generating function.
And the way it's called
cumulant generating function
is because if I start taking
successive derivatives
and evaluating them at 0, I
get the successive cumulance
of this distribution, which
are some transformation
of the moments.
AUDIENCE: What are you
talking about again?
PHILIPPE RIGOLLET:
The function b.
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET: So this
is just normalization.
So this is just to tell
you I can compute this,
but I really don't care.
And obviously I don't care
about stuff that's complicated.
This is actually cute, and this
is what completes everything.
And the rest is just like
some general description.
You only need to tell
you that the range of y
is 0 to infinity, right?
And that is
essentially telling me
this is going to give me some
hints as to which link function
I should be using, right?
Because the range
of y tells me what
the range of expectation
of y is going to be.
All right, so here, it
tells me that the range of y
is between 0 and 1.
OK, so what I want
to show you is
that this captures a
variety of different ranges
that you can have.
OK, so I'm going to want
to go into the likelihood.
And the likelihood
I'm actually going
to use to compute
the expectations.
But since I actually
don't have time
to do this now, let's just
go quickly through this
and give you spoiler alert to
make sure that you all wake up
on Thursday and
really, really want
to think about coming
here immediately.
All right, so the thing
I'm going to want to do,
as I said, is it would
be nice if, at least
for this canonical
family, when I give you b,
you would be able
to say, oh, here
is a simple computation of b
that would actually give me
the mean and the variance.
The mean and the variance
are also known as moments.
b is called cumulant
generating function.
So it sounds like
moments being related
to cumulance, I might have a
path to finding those, right?
And it might involve taking
derivatives of b, as we'll see.
The way we're
going to prove this
by using this thing that
we've used several times.
So this property we use
when we're computing,
remember, the fisher
information, right?
We had two formulas for
the fisher information.
One was the expectation of the
second derivative of the log
likelihood, and one was negative
expectation of the square--
sorry, expectation of the
square, and the other one
was negative the expectation of
the second derivative, right?
The log likelihood is concave,
so this number is negative,
this number is positive.
And the way we did this is by
just permuting some derivative
and integral here.
And there was just-- we
used the fact that something
that looked like this, right?
The log likelihood
is log of f theta.
And when I take the derivative
of this guy with respect
to theta, then I
have something that
looks like the derivative
divided by f theta.
And if I start taking the
integral against f theta
of this thing, so the
expectation of this thing,
those things would cancel.
And then I had just the
integral of a derivative, which
I would make a leap of faith
and say that it's actually
the derivative of the integral.
But this was equal to 1.
So this derivative was
actually equal to 0.
And so that's how you
got that the expectation
of the derivative of the log
likelihood is equal to 0.
And you do it once again
and you get this guy.
It's just some nice
things that happen
with the [INAUDIBLE] taking
derivative of the log.
We've done that,
we'll do that again.
But once you do this, you
can actually apply it.
And-- missing a
parenthesis over there.
So when you write
the log likelihood,
it's just log of an exponential.
Huh, that's actually
pretty nice.
Just like the least squares
came naturally, the least
squares [INAUDIBLE]
came naturally
when we took the log
likelihood of the Gaussians,
we're going to have the
same thing that happens when
I take the log of the density.
The exponential is
going to go away,
and then I'm going
to use this formula.
But this formula is
going to actually give me
an equation directly--
oh, that's where it was.
So that's the one
that's missing up there.
And so the expectation
minus this thing
is going to be equal
to 0, which tells me
that the expectation
is just the derivative.
Right, so it's still
a function of theta,
but it's just a derivative of b.
And the variance
is just going to be
the second derivative of b.
But remember, this was some
sort of a scaling, right?
It's called the
dispersion parameter.
So if I had a Gaussian and
the variance of the Gaussian
did not depend on
the sigma squared
which I stuffed in this phi,
that would be certainly weird.
And it cannot depend only
on mu, and so this will--
for the Gaussian, this is
definitely going to be equal
to 1.
And this is just going to
be equal to my variance.
So this is just by taking
the second derivative.
So basically, the take-home
message is that this function b
captures--
by taking one derivative
of the expectation
and by taking two derivatives
captures the variance.
Another thing
that's actually cool
and we'll come
back to this and I
want to think about is if
this second derivative is
the variance, what can
I say about this thing?
What do I know about a variance?
AUDIENCE: [INAUDIBLE]
PHILIPPE RIGOLLET:
Yeah, that's positive.
So I know that this is positive.
So what does that tell me?
Positive?
That's convex, right?
A function that has positive
second derivative is convex.
So we're going to use
that as well, all right?
So yeah, I'll see
you on Thursday.
I have your homework.
