[MUSIC]
Okay, so
I'll start today's last lecture.
So I'm going to do two
applications of SVD and
hopefully, they'll be
somewhat atypical.
So you'll see the second
one is certainly atypical.
It's to do with max-cut and
graphs.
The first one is a mixture of
Gaussians right, which may be
little more familiar, max-cut
hopefully won't be soft, right?
So clustering in general is
a problem of partitioning
data points into k subsets.
So often k is given, so this
is k clustering problem but,
and sometimes you can worry
about k not being given, but
usually k is given, okay?
K sets a points and of course,
we have to formulate
how to measure the quality
of clustering and
then ask for
the best possible, right?
So there are many formulations,
I mean,
this is a big subject, right?
So one formulation is you
minimize the sum of distances.
This is the most
popular formulation
in theoretical computer science.
But actually a more sensible
formulation is this,
perhaps, right?
In statistics, where you take
the sum of squared distances.
And squared distances have all
the advantages that we had for
expedient forms.
So these forms are all pretty
hard, gently and hard, so
if you can't follow exactly
around solving NP-hardness,
one is the approximate
of the solutions.
This is what in operations such
as computer science we often do.
But the difficulty with that is,
even with just filter or
epsilon they may become
completely useless and
I will show you
an example of that.
Or what we will do here or
perhaps what is more
familiar in machine learning or
statistics is the following.
You assume a stochastic
model of data, so
stochastic model means
a hidden hand, or
hidden model which you don't
know generates the data.
And you have to cluster, now
no errors allowed in general.
It's often not interesting to
do it only approximately so
you want exactly as the hidden
model generated, okay, so
we'll see examples of, we'll see
the for most common examples.
So mixture models are an
important class of stochastic
models, what are mixture models?
Mixture of k densities, probably
the densities, let's say,
this is just the way
it is sound.
Complex combination,
non-negative weights and
summing to one.
The idea is, I give you,
I generate according to some f,
I don't tell you f.
I don't tell you the w's and
p's.
I give you just the samples and
you have to find the best w's
and p's or in general,
I guarantee there is one and
you have to find the exact
one assuming some condition.
So learning problem is
just given n i.i.d samples
from a mixture,
learn the mixture.
So find all those,
find all those w's and p's.
So I have to give you something.
If it can be a mixture of
anything, nothing can be done.
So what I'll give is
you is basic density.
I won't give you the exact
p's and w's, of course,
that's the problem.
But I'll tell you, for
instance, they're Gaussians.
Here that's what's
going to happen here.
And in fact, they're spherical
Gaussians in this case.
A spherical Gaussian in
three dimensions is just,
has probability density.
This is just standard deviation,
e to the minus,
the length of the vector x
squared over 2 sigma squared.
And it's of course equal to,
Product over I equals 1 to d,
e to the minus x i
squared over 2 sigma.
So sigma is the standard
deviation but
you notice that it's the same
in each component, right?
It's just like this.
So what's the classroom problem?
I haven't told you what is
the classroom problem, so
we might as well look at that.
There are two ways you can think
of samples being generated
that give you the same
properties as this distribution.
So either might be possible.
One is just pick each sample
according to the mixture f
which we had in the last slide.
Just one remark I should
have made is that this
is also a probability density
and it integrates to one, right?
Because these w's sum to one,
if you integrate this
you get one also.
So one way of generating is just
f is a density generated i.i.d
samples, the other way
of generating is pick
one of the components between 1
and k that are k components and
I'm gonna generate
from the ith component.
I pick an x according to Pi.
So an example which
probably is good to have
is you have a mixture of these
two probability densities
which they.
So I take half of this times,
sorry, plus half of this.
I get a density which
sort of has two molds.
That's my density, right?
This is the sum.
This is the mixture.
So I can generate from this
mixture, just having the whole
mixture in my hand,
this is a hidden hand, right?
Or I can say each sum I'm
going to generate from this or
that, the probability half each,
I pick that, and
then I pick x according to,
clearly they give you
the same distribution, right?
So the learning problem
has two subproblems then.
For the learning problem, you
used the method 2 of generating.
It has two subproblems.
One is cluster data into k
clusters according to which
pi was used to generate each
of those samples, right?
So in this picture, half the
samples will come from here, and
the other half of the samples
will come from there, okay?
So the clustering problem
is separate them like that.
Just to figure out what the
hidden hand used to do, right?
So that's a harder problem, then
you also have to, once you've
done the clustering, once I've
said this is one cluster,
this is another cluster, I have
to fit this density to that and
that density to that,
that's very easy, right?
Fitting a single Gaussian to
a set of points is very easy.
You just take the mean of
the standard deviation of
the sample, and pretend that
was a generating thing.
That's quite accurate.
So we'll only worry about
the clustering problem.
That's the interesting problem.
So here's a mnemonic that we
want to remember for the day.
That you all know,
you heard of mean separated
by six standard deviations or
I'm going to say a constant
number of standard deviations.
If the Gaussian components have
centers that are very close
together then clustering
is impossible.
In the limit, they could just
be on top of each other,
they could be the same Gaussian.
And you cannot tell for
each sample which Gaussian
is generated it from obvious
reasons, it could be either one,
right?
In one dimension, a separation
of a constant number of standard
deviation, that's what this
mnemonic about 6 standard
deviations, right,
is necessary and sufficient.
First, if you have a separation
which is much smaller than
a constant,
this o(1) means close to zero
as n goes to infinity, right?
Then, it's hopeless.
So why is it hopeless?
Because so
then your one Gaussian or
another one like that.
Then I get points but
points here could have come from
either this or that, right?
There's no way you
can necessarily tell.
In fact, you could get samples
from this that are on this side.
That lie on this side, so
there's no way to
tell which Gaussian.
So it's clear that you need that
much, but if the separation
is large enough,
then you can find everything.
Okay, we're a little
more careful,
if you want to get everything
right to get all correct.
Need root log n
standard deviations.
Separation.
I won't prove this because
of e to the x squared,
e to the minus x squared, right?
If you have this with
high probability,
everyone will be correct, right?
Can be correctly.
But if you want to get most
right, then you don't need
the root log n, sorry, you had-
>> [INAUDIBLE]
>> N is the number of data
points, dimension is one for
this so far.
Yeah, so even in one dimension
you need root log n separation
if you want to get
everything right.
It's only one dimension and so
when I say constant, it's with
a pinch of salt that only gets
most things right, right?
Or I have to put a star on
it to say a log factor.
The main purpose of today is
to prove that this also works,
the same mnemonic works
even if the dimension of
d is going to infinity.
As long as k is constant, okay?
So we need only number of
standard deviation separation
which depends on k, but
not on d in any dimension,
so d is to be thought of
as very large, right?
We're always dealing with
high dimensional data, but
that does not affect how much
separation we need, okay?
That's what we'd like to prove,
you'll prove that, right?
But one quick thing calculation,
we'll do
in the beginning is it's not
any more going to be true.
So in one dimension,
I should have said that, in 1d
distance based works,
distance based clustering,
What would,
Omega sigma separation?
And so being waive because,
those state precise theorems,
but it does not work as d
becomes large, as d grows, okay?
We'll see that first and
then we can see what we
can do instead, right?
And of course, the key will be
SVD, that's the solvable SVD or
[INAUDIBLE], two applications
of SV, this is an application.
So limits of distance space
clustering, why it doesn't work?
So let's look at d
dimensions now, x and
y are i.i.d fro the same
spherical Gaussian.
This is just
a spherical Gaussian,
that's a correct
normalize in constant,
this spherical Gaussian with
variance sigma squared, okay?
Then we already saw this, and
I'll recap it with pictures,
but not necessarily go over
the entire calculation.
The distance squared between x
and y is roughly this quantity.
This was the Gaussian Annulus
theorem is worth recalling.
So the picture is, So
the picture is that one
Gaussian is like that, and
the other Gaussian is here.
The separation between these
two is distance delta, and
the standard deviations
are sigma, right?
Sorry, again,
this distance recall
is roughly root d times sigma,
right?
There are three dimensions,
the variances add up.
So the variance together is
d times sigma squared, and
I take the square root
I get this, right?
And furthermore resolve that
most of the probability mass is
in an annulus of which constant?
This is order sigma, everything
is scaled by standard deviation.
So most of the mass lies
there and similarly for this.
This calculation,
we did this picture but
this calculation is x and
y for the same sample.
So here is x and
here's y, right?
From the same Gaussian, sorry,
from the sama Gaussian, right?
And I want to know
the distance between x and y.
And the picture is not going to
look to reflect what I'm going
to say.
But x to the center of that
Gaussian, this is mu 1,
this is center of Gaussian 1,
this is center of Gaussian 2.
I take that and I take that,
this angle is 90 degrees.
Well not exactly,
approximately 90 degrees, right.
We saw that.
That theorem had a name, that's
the equation theorem right?
So if I pick,
these are independent samples so
could have pretended
I picked x first.
Make it the north pole and
then y lies on the equator.
That side this.
That's approximately 90 degrees.
So that's why you get
the distance between them as
sum of squares of these two.
So that's just [INAUDIBLE]
theorem except that's
not exactly route d right?
It's route d plus or
minus the constant.
And we have to square it and
sigma squared, everything is in
standard deviation units okay?
So what happens if you
have two samples and
now zed is from this Gaussian,
x is from the first Gaussian.
X and zed are samples from
different circle Gaussians.
Standard deviation sigma
means separated by delta.
Then this, this, and
this are all perpendicular.
This is 90 degrees,
this is 90 degrees, roughly.
And why was that, right?
So x were independent.
I pick x Make that
the north pole.
Now x itself this angle will
be 90 degrees with high
probability, because here
this is a fixed axis right?
This is not random.
I've fixed mu 1, mu 2.
This is fixed so x will be
perpendicular to that roughly.
Okay so that defines an equator
And then once x is fixed and
of course [INAUDIBLE]
both of those.
That's like saying two equators.
And that's still true that
most of them [INAUDIBLE].
And therefore [INAUDIBLE]
will be again plus or
minus sigma squared,
plus delta squared,
again the power of
it [INAUDIBLE].
So now, we want to ensure that
pairs of samples from the same
Gaussian are always closer than
pairs of samples from
different Gaussians.
If you want that, then we better
have with a minus sign here.
Coz it could be that x and
z are only this distance apart.
And it could be that x and
y are this distance apart.
It's the worst that can
happen to you, right.
Extend y or
within root d plus or minus, but
plus is the worst, and similarly
here minus is the worst.
If you expand this
you will see that,
the leading term cancels out,
right, 2D cancels out and
then you get a root D here
that is eaten up by delta.
So this only works if delta
is at least D to the fourth
root, right?
It is D to the fourth power.
So that is the limit of
distance based clustering.
The centers are less than
these the fourth power apart,
then it will not work.
What will not work means,
that if I just take a point and
take its closest point,
I will not necessarily get all
samples from the same Gaussian.
I will get some corruption,
right?
These arguments go both ways,
so.
I will get some corruption.
If we have less than
[INAUDIBLE] separation.
So, this doesn't mean our
mnemonic of means separated by
a constant number of
standard deviation.
Because g is growing
to infinity.
Okay, so
I cannot do this, right?
I cannot just do this.
I will do this actually,
but I will have to do
something before that, okay?
So, not means separated
by [INAUDIBLE] deviation.
So, of course,
SVD is an application of SVD.
It shows that using SVD that
this still can be made to work.
At the bottom will be additional
space plus [INAUDIBLE],
we'll be doing this, but
not in the full space.
So, the idea's going to be that
I'm going to project the 3D
subspace in cluster.
This is an important idea,
actually.
It's the only way I know of
doing many things correctly,
which is to project this 3D
subspace and cluster that.
So, what do you do?
We predict all data points to
the space spanned by the top k,
sorry, excuse me, backtrack.
Suppose we can find the subspace
spanned by the k centers,
k centers, or
component Gaussians.
In this case, mu 1 and mu 2, in
this example is D dimensional.
This is the subspace, right?,
suppose I can find this line,
the origin is some where here,
let's say for a moment,
then I find this line.
Suppose I found this line,
then I have to draw another
picture getting [INAUDIBLE], so
I project to that line, then
the picture looks like this.
Okay, so then the picture
looks like that, and that's,
because it's easy to show and
we will go to the calculation.
A projection of spherical
Gaussians to a spherical length
it has the same
standard deviation.
We have to do a little argument.
I think maybe I have a slide.
Arguing this is just
simple argument.
That the standard
deviation remains the same.
So, projection to the space
of centers, why does it work?
Because it preserves
inter-center separation. Right?
Okay, so one good thing,
I didn't put up on the slides,
maybe it's worth doing.
We know two dimension reduction
techniques, one is SVD,
which we go through,
the other is random projection.
So, random prediction
does not work
and doesn't work, because
if I do random prediction,
standard deviation is preserved.
That's always true for
any prediction.
Now, what happens to
the intercenter separation.
So, I'm going to do a random
project to k space,
to random k space.
So, we are sitting in D space.
Things project down to
a randomly chosen k
dimensional space.
We know we already solved
theorem that tells us what
happens to the intercenter
separations.
What happens to them?
Do they remain the same?
This was Johnson-Lindenstrauss
theorem, right?
They don't remain the same.
We're setting in d space all
dimensions are roughly the same,
because we're doing
random projection.
So, the intercenter separation
is multiplied by root k over
d, pith algorithm the squares
get multiplied by k over d,
you're projecting to k out
of d dimension, right?
So, everything will shrink in
square Pythagoras by k over d,
so and the distance
will actually shrink.
So, this is not really nice for
us, because what we want
is from the last slide.
If we want to do just
the distance space [INAUDIBLE].
What we want to do is
intercenter separation.
That's my delta.
Divided by whatever dimension
you project to the one-fourth,
right?
Times sigma squared and
a constant.
Okay?
So, we did not have
this in the big space,
that's why we are projecting.
But if the projection has this
property, this is not very good.
I'm sorry, this is times
standard deviation.
I put it on this side.
Well, I'll put it on this side.
So, the dimension has come down,
which is nice, but
so has this, right?
So, what we have done is,
this remains.
This is multiplied
by root k over d.
This remains the same,
unfortunately.
What's happened to this?
It used to be d to
the one fourth.
It's become k to the one fourth,
right.
So, it's been multiplied by
k over d to the one fourth.
No, excuse me.
It shouldn't work out like this.
Yeah, okay,
it doesn't work out, right?
So, we get that this is roughly
k over d to the one-fourth.
But k is much smaller than d.
K is constant,
d is going to infinity, right?
d is growing,
that's what we have.
So, this is no good, because
it shrinks random prediction,
shrinks the intercenters
separation.
Okay, it preserves standard
deviation, not so bad, but
it shrinks that.
So, that's no good, however,
SVD will do the job for
us, I mean, sorry,
not SVD yet for the moment.
Projecting to the space
of centers preserves
the inter-center separation,
preserves the numerator, and
also the denominator, but
the dimension has come down, so
it's nice, so
the ratio improves, right?
Dimension used to
be D it became K.
So, the ratio improves.
Okay.
So, if the sigma hype
was at least k to the 1.
So, k is now a constant.
Then this should work.
Provided we do it
in the projection.
So, k is constant.
So, this is a constant number
of standard deviations so far.
Then distance based clustering.
Now, we do distance
based in the projection.
We do this, what I said
in the last slide, but
in the projection,
the smaller space.
>> [INAUDIBLE]
>> Yeah.
>> [INAUDIBLE]
>> How
would we know which samples are
from which one to be able to,
so that's the difficulty, right?
You have to somehow know which
samples belong to which Gaussian
before you pull them apart.
But that's the main problem.
So, I don't know a way where
random projection can be used to
do this actually.
So, we'll prove that,
so the main thing,
once I tell you this, the proof
is not difficult at all, but
is quite beautiful, actually.
This is few to.
The references are in the book,
you can see that.
So, we'll prove that the space
spanned by the top k singular
vectors of the data is
in fact the space of
centers in the limit as you
get infinitely many samples.
But finite samples,
it's approximately true.
Again, we know how to do SVD.
So, we find the top
k singular vectors.
That is, the solution is that is
essentially the space of center.
So really I don't know the space
of center because I don't know
the clustering problem has
to be solved to determine
which one is the center
of each calcium.
So this is saying I can
do it without clustering.
[INAUDIBLE]
>> No,
this actually mostly comes from
symmetry of spherical Gaussians.
We'll see the argument,
it's a very nice argument,
and very simple I mean.
Okay so Lemma Projection
preserves the S.D.
of spherical Gaussian.
Now, this I have
stated without proof.
I did write down
a proof of this.
Okay, so maybe we'll go
over a proof of this.
I'll go over it quickly because
this is something you can
verify at home.
It is just standard, right?
So, we want to project over
[INAUDIBLE] subspace V and
we want to say its standard
deviation is preserved.
For spherical Gaussians,
we rotate the coordinate system,
the first k coordinate.
So this, I take and x and
d space, project, this is the
projection, this is what's left.
And so the density in
the projection is just,
I integrate out mu
double prime right?
I project it on to the first k,
I integrate out these things.
So if I integrate that out,
that's just a constant.
The whole point just is that for
a Gaussian these things split,
right?
The sum of these
two is the length.
Just only Patogerith again,
right?
Always, so that follows.
Again, please, you can
check this yourself, right?
Now comes the nice part,
proving that the SVD finds
the space of centers.
Here's the idea, and then we'll
go with a proof in some detail.
Show that the top k singular
vectors produced by the SVD span
the space of the k centers.
And that's only going
to be exactly true
as the number of i.i.d.
samples from the mixture
goes to infinity.
It's not going to be true for
a finite sample case.
But as the number of
samples goes to infinity,
it's going to be true.
So to do that, first I'm going
to extend the definition of best
fit to infinitely many samples,
or
to probably distributions
instead of a finite number.
So best fit was defined for
a finite number of data points.
I wanted to find it for infinite
number of data points or for
distributions.
But quite a straightforward way
we do that then we show that for
single step Gaussian,
whose center is not the origin,
the best at one dimensional
subspace is unique and
its align through the center of
the Gaussian and the origin.
So there is a picture in
the book I mean im not sure we
listed in the pictures but
let's draw that.
So we have a Gaussian
with center mu and
the origin is somewhere
maybe inside here okay?
The claim is the best fit,
one dimensional subspace
it certainly passes
through the origin.
And also processing the mu.
There's only one line like that.
That is the best fit, right?
It's better than everybody else.
I mean the rough
reason is symmetry.
If you had a line
going like this.
You get a line like this.
Why is it worse?
Let's take distances
to the line.
There are these points with very
high perpendicular distance.
There are, of course,
some points with very low
perpendicular distance, right?
But the high ones dominate
because we are squaring and
adding square plays
a role in that.
Whereas for this one,
most points are roughly
at the same distance is
pretty constant rate.
Most points also at that
distance, so it's equal.
But we'll prove it.
To prove this algebraically, if
it's going to be written down,
it's not that difficult.
But we'll prove that,
that the best fit line is for
one Gaussian the best fit line
whose center is not the origin.
The best fit 1-dimensional
subspace is the line
through the Gaussian.
Now once I tell you this, the
proof is almost there, right?
Why is that?
Because we now argue next that
we show the best fit
k-dimensional subspace for
a single Gaussian who's
center is not the origin.
Now this is by symmetry.
If I want a plane
that's the best fit.
Just by same kind of argument,
it follows that
it should contain that point,
okay?
But there's another
way we'll see this.
And that's with a greedy
algorithm, right?
Remember, SVD is
a greedy algorithm.
But, so if I wanted to find the
best fit k-dimensional subspace,
I first find the best-fit
one-dimensional subspace, right?
Which I saw, which I told you
here passes through the center.
So it’s a one dimensional
sort of thing, right?
Then I project,
perpendicular to this line.
That’s how I find the second
singular vector, right?
That’s 3D now.
The projection is this one.
But in the projection, I get a
sphere, centered at the origin.
This is the projection center
of the Gaussian is now
the origin, right?
Because I'm projecting
perpendicular [INAUDIBLE].
And once I have a Gaussian
passing through the origin,
all lines are the same.
Completely symmetric, okay?
Therefore, this
thing will follow.
The best k-dimensional
subspace for one Gaussian.
Is any subspace that contains
the origin at the center?
And now, we're done, because
for, if we had k spherical
Gaussians, the best fit
k-dimensional subspace for each
one individually is a subspace
containing their centers.
Now if I contain all the
centers, then I must be the best
for each individual one
therefore for the sum.
Right, does everybody see that?
So for each individual one,
if I produce any k-dimensional
subspace that contains
its center I'm fine.
Now if it contain all k centers,
then this says individually this
is the best k dimensional
best fit substitutes for
each one individually so
it's clearly true for this one.
So, that's the argument so
we'll go over that.
Are there any questions?
I mean, basically I've
given you the proof,
but we'll do it carefully.
I mean, it's what's
absorbing this proof, right?
Again it's very simple but
quite nice.
>> [INAUDIBLE]
>> Independent mixing,
yeah, that's the point.
That's a nice thing.
It's independently
mixing proportions.
And in fact,
it's independent of the radii.
So we are doing it for the same
I mean as I've been indicating.
So it may be that
the of different radii,
yeah it's all,
everything is symmetric.
So, but it's only for right?
I mean, we use heavily this
aesthetical symmetric.
There are some versions of this
for non spherical Gaussians
which are not as good but still
for spherical it's very simple.
So we'll take a few minutes to
try to pin down this proof but
it's basically essentially
here proper proof.
So, recall the best fit line
is the line passing through
the origin maximizing the sum
of square projections.
So we want first to define
what best fit means for
probability densities instead
of set of points which is
an integral for the sum.
So best fit line for
probability density p x is a
line V one, which maximizes its,
it’s an expectation,
it’s an integral, right?
And the, it, for if they’re
indivdual data points X,
I would have summed it over
all the data points, right?
Now I have X as a random
variable, so its sum over,
I take the expected value, of
the sum of squared projections.
I took max because I took
the projection, right?
If I took the distance
to the line,
I would have taken the min,
okay?
For spherical Gaussian
centered at 0,
an line passing through
0 is a best fit line.
So if the Gaussian center is 0,
everything is symmetric.
Doesn't matter what line
through the origin you take,
clearly by symmetry
this is all the same.
Okay so for spherical Gaussian
density p centered at mu which
is not 0, there's a unique
best 1-d subspace, best fit
subspace and it's the one
that passes through 0 and mu.
Okay, so over here earlier
like I said if the center = 0,
then any line through
0 is the best fit.
Okay, so
this is the argument for,
this is the proof which
argue symmetry, right?
So this stands for
the expectation
over random vector is x,
picked according to density P,
that's what this stands for,
right?
Then I take V dot X squared
expected value, and
I write this as V dot X minus
mu, mu is the center of
the Gaussian, plus V dot mu
squared, and I expand that.
So these are just real numbers
now, the dot products.
So this is A2 plus B2
squared is a 2AB term,
which they obviously
have dropped.
So there's a third term in
the whole square expansion,
which is expectation.
X picked according
to P of V dot X minus mu
times V dot mu, right?
I dropped this, why is that?
X is random, V is not random,
so i can remove this out, so
that’s equal to B dot mu.
And I can remove two out, and
picked according to this,
of V dot X minus mu, right?
And again, V is fixed, so
I could remove that V out.
I could remove that V out and
say V dot expectation over X,
according to this,
of X minus mu.
So I guess I should have
done that step over here.
So this is a vector, this is
expected value of a random
vector, X is a random variable,
but vector value,
it takes vector value,
three dimensional.
But I can do this with our
product, it's dot dot price,
it's linear operation, I can
remove V out, this is 0, right?
Because mu is the center
of the Gaussian, so
the expected value of X is mu,
so the cost then goes up.
It always goes out, we did a few
calculations like this earlier.
So, this is now the variance,
and why is that?
Because see, I took X-mu, now
that's just Gaussian center of
the origin, but
we end a square of that.
That's the variance
in the direction V.
Okay, this whole thing is a
variance in the direction V, but
spherical, so the variance is
the same in all directions until
they get sigma squared.
So that thing is
independent of V.
That's very important, right?
I mean,
because it's just a variance.
Okay, so we get this, and
now, we are done, right?
Now, the thing that maximizes
this must happen, okay?
Best fit maximizes
that quantity.
So it maximizes V dot mu
because this is constant, and
this is maximized when
these align with center mu,
unit vector in
the direction of mu.
That's what makes maximum dot
product to see the uniqueness.
So this says that V in the
direction of mu does maximize,
the uniqueness is
also obvious right?
If mu is not 0 and
V is not aligned with mu it's
strictly less, this is strictly
less, so it's a unique maximum.
Again that's very
simple just this.
Proof, okay it's continued,
extend the definition
now to k-d subspaces.
Again, we had already defined
what's the best-fit k-d subspace
of finite number of points for
a density we just do that.
We integrate it into the sum,
expected value of X, so
among all k-d spaces,
this is true.
I mean, this is the definition.
For a spherical Gaussian
with center mu,
we want to prove a k-d subspace
is a best fit subspace, if and
only if it contains mu, okay?
For the line,
it had to contain mu also, but
now it's saying it for every K.
If mu's in the second part,
if mu is zero,
by symmetry any k-dimensional
subspaces is the best fit,
everything is symmetric, so
I could take any one, okay?
If mu is not zero, so if mu
is zero, we're done, right?
This Lemma is for
both zero or not zero.
Mu zero is a simple case,
there's nothing to prove.
Mu is not zero, then
the best-fit line we already saw
must pass through mu
by earlier Lemma.
That was the Lemma in the last
slide, one-dimensional line.
Now, as in the greedy algorithm,
so
this requires some proof
that the greedy works for
probability densities In place
of just finite number of points.
But that's straight forward, so
we'll assume the greedy works.
We project perpendicular to
the first singular vector, but
after the projection, its center
is 0, that's this picture, or
it's, okay, it's, well,
maybe I won't draw a picture.
So you got a line going through
the origin line projecting
perpendicular to that.
Now the origin is the center,
it's a zero center Gaussian, and
therefore this thing applies.
That thing applies so
we get the right thing, and
I think, theorem just says that,
so it says, if P is a mixture
of spherical Gaussians,
then the best fit k-dimensional
subspace contains the centers.
Now, this is proof for the whole
density, we'll come back to
that, in particular if the means
of the Gaussians are linearly
independent, the space
is unique.
They may depend linearly,
they may be dependent and
it's not unique, but it still
contains all the centers.
Okay, I'm sorry,
I do have to prove that.
So, so far, excuse me, so far
in the last slide we dealt with
a single Gaussian, I'm now
just putting it together and
saying it's true for
the mixture.
So, and that's because it's true
for each individual one, okay,
that's next to do.
So the expected value
of X according to P,
P is a mixture okay, is W I,
this should be P I, excuse me,
that should be P I okay?
So that should be P I, but
if V contains the centers of all
the things, then each term here
is individually maximized, so
the entire sum is contained.
Most K centers your
k-dimensional space, maybe
you need less dimensions because
they're dependent, but with
K you can put all the centers,
and then you're the best.
Now of course,
we don't have the whole density,
we have a finite number of
samples, there are errors.
But this is just a small,
well not small,
it requires some calculation to
make sure that it's correct.
But believe me,
it's correct, right?
The sampling error
can be managed.
Okay, so the algorithm again,
it's worth seeing one more time
just Project and cluster, right?
You just project to the SVD
succession cluster, okay?
By the way, if you have done
K means for many heuristics,
there is K means plus plus, all
kinds of heuristics like that.
For instance, okay, so
maybe that's worth saying,
nothing guarantees
what's guaranteed here,
which is that you find
the correct Gaussian.
In fact.
So here's the problem, so
if you have two Gaussians again,
same picture,
separated by C times sigma.
And each of these have
standard divisions sigma, so
they are distance root d sigma.
If I take one of these points,
the second Gaussian, it's
distance to the wrong center,
is this grade plus this grade.
So this is z, this is our z,
distance of z to,
I'll write, wrong center.
The distance squared,
that's Pythagoras,
this correct distance
squared which is about d.
Well, it's really root d plus or
minus order 1 squared,
plus c squared, sigma squared.
Sorry, there's a sigma
squared here too.
So that's roughly this quantity
is v times sigma squared plus c
squared, sigma squared, right?
And d is going to infinity,
c is constant.
So this is much,
much smaller than this, okay?
Maybe I should have written
this, therefore, Clustering,
Which has relative error.
K means clustering,
Which has relative error,
epsilon, Can miscluster
every point, right?
So suppose I only tell you
that I give you an algorithm,
in fact, that's the best you
can prove with most algorithms,
the exact problem's NP-hard.
I'll give you an algorithm that
gives you only relative error
epsilon, and
the k means objective function.
But that epsilon can mean
that everybody's wrong.
In fact, you can put all of
these in this cluster, all of
these in this cluster, there
would be only epsilon, okay?
Can miscluster every point.
So this is the only, SVD is
the only algorithm I know,
where you project to the SV
subspace and cluster,
which can get,
means separated by two standard
deviations Gaussian's, correct.
All of this, [INAUDIBLE].
Well that is SVD, I'll give
the algorithm in a minute, but
proof is worth more.
It's probably folklorist,
people knew that SVD substitutes
would do the job, right?
And imagine that's possible.
But yeah, you had a good
analysis of it and all that.
There are also extension in
non-state Gaussian, I mean that,
presumably is not wrong,
but yes.
So I think,
it's not impossible that people
use SVD predictive cluster.
It's not the most
popular method.
People usually try to cluster on
the big space, in the space in
which data's given, which
generally cannot do guarantees.
But I think there are some
people, some algorithms,
some papers will do this.
>> [INAUDIBLE]
>> What is true for [INAUDIBLE]?
>> [INAUDIBLE]
>> That the mean is
the best thing.
Yes, that is true for
any [INAUDIBLE].
But now we have this mixture,
right?
So there are k different means,
that's the rub.
Because how do you know
which is the partition?
For one, yes.
Okay, that's all I'm gonna
say about Gaussian mixtures.
We are now going to
application two, which is
maximum cut in a graph.
This I'm gonna show you,
because this is unusual.
It's a discreet problem,
we'll see how SVD can be
used to solve, I mean,
you can't solve the whole
problem, it can be hard.
But SVD can be used to solve
some interesting special cases.
So we want to partition
the nodes into two,
just a restatement of the
problem, into two subsets, S and
we call it S bar with stress to
maximize the number of edges
from S to S bar, right?
You want to cut, so
that the maximum cut.
If A is the adjacency matrix,
that means 1 if there is
an edge,
0 if there is no edge, okay?
I can write the problem
this way, right?
Because the reason is that,
this is,
yeah, each xi value is 0,
if i belongs to S.
It's 1 if i belongs to S bar,
that's the interpretation.
So we want to take
all edges aij, for
which they belong
to different sides.
That's the same as saying
xi times 1- xj, right?
If both are 1 or
both are 0, this is 0.
If one is 1, one is 0,
then it's 1.
And it's just a quadratic
form over the matrix A.
So that looks more like
a linear algebra thing.
It's just x transpose A(1- x).
You can verify that
these two are the same.
1- x, 1 is the vector
of all 1's, right?
So we want to maximize that,
it's NP-hard.
We'll see that for dense graphs,
now dense graphs is the one
class I'm gonna
show you using SVD.
There are other classes, but
this one I'm gonna show you.
Dense graph means,
it has a constant fractional
law of possibilities.
We can use SVD to find
a near optimal solution in
polynomial time.
Near optimal means relative
over epsilon, okay?
So for every dense graph,
you can do this.
There are other ways of doing
it for dense graphs alone,
but SVD's the only way of
doing it for some other.
So for a more general class of
graphs, I want you to do that.
So what we will show, here's
a summary of what we'll show and
then we have to go
into the details.
So remember this quantity
that I did the SVD, and
then I truncated after
the case is done.
So ui and vi or the singular
vectors left and right and
this is SVD, but I finished,
I chopped it off at k.
And will maximize x transpose
not Ak 1- x, not A,
[INAUDIBLE] subject to
the same condition.
We'll show two things.
For any 0-1 vector x,
these two are close,
Ak and A are close.
This is the easy part, right?
So what I'm saying is, if I
did the optimization with AK,
I shouldn't suffer too much
error because they are close.
Okay, I can choose k,
k is my choice.
I can choose it high, so
that I make a small error.
As k grows, I'm solving more and
more complicated problems.
I should gain some and
that's what I'm gaining,
the error is going down to
0 as k goes to infinity.
And the harder part, is that
you can in fact find solid
Ak problem in polynomial time,
if k is fixed,
by exploiting
the lower rank of A.
So later I'll put up a model,
so problems in k dimension can
be solved in general in
time exponential only in k.
So if you have
a two-dimensional problem,
you can always solve
the problem anytime, right?
Not all of them, but that's
the moral of this storybook,
we'll explain that.
So again, all I am doing is just
write it as a quadratic form
approximated by SVD and
solve, okay?
And we'll see that,
that's a good solution.
>> [INAUDIBLE]
>> Sorry?
>> K cannot be more than n.
>> K cannot be more than n, if k
we would like it to be constant.
>> [INAUDIBLE]
>> The error, Error
goes to 0 as k, okay, sorry,
so I should say the following.
In a dense graph,
there's always a max cut of,
there's a cut of size at
least half n squared.
Okay, is what saying that?
>> [INAUDIBLE]
>> Yeah, exactly, random.
Max cut is greater than or
equal to half n squared in size.
Because for
a random subset S the cut,
the expectation of the cut,
is half the number of edges.
This is half the number
of edges but
we assume this is all
being squared, right?
So this is only the expectation,
but
you can argue that
this high probability.
So if I just randomly
put each vertex I take
an individual edge, with a half
probability of being in the cut.
Right and it's expectations so
I can add up over all edge,
each edge is half probability
of contributing to the cut, so
the total is at least this much.
Therefore this area is not so
bad.
If I make K 1 million or
something this will be 1,000 so
I make only 1,000th
of the [INAUDIBLE]
Okay so first I want to
prove somewhat easier
part that approximation
is not bad.
So x and 1 minus n so
01 vector so
length is of most root end for
each of them.
And so we can write.
The difference between
these two real numbers
is that quantity, right?
I mean, here all I've done is,
there's a of 1 minus x or
I should have said 1 minus x,
excuse me, here.
I should have said 1 minus x,
okay?
So that's just spectral times
that, and that's to that.
And so that's at most n
times the spectral norm.
So why is this small?
Why is the spectral
norm of A- Ak small?
We prove that SVD was
the best rank k approximation
in spectral norm.
Therefore, the error is
exactly sigma k plus 1 of A.
And now why is that small?
Well, we see that here's
a little inequality.
I take k + 1 times this, since
the sigmas are in decreasing
order that's less than or
equal to that.
But that's the sum of
some singular values.
Sum of all the singular
vallies squared is this.
So it's less than that.
Now this is just Ai j squared.
Aij is 0 1 so
that's at most n squared.
And therefore you get n here,
and you get this is n squared
over k plus 1, and
that all works out right?
So we get that and
you get another n out of that.
So the approximation is correct,
now I want to exploit the lower
rank of A to solve
the problem quickly and
again the moral of the story is
any problem in k dimensions can
be solved roughly
exponential of k.
You have to do some work to get
it in different context but
we can in this context, but it's
true for in general [INAUDIBLE]
Okay, so
we'll see how to do this.
So this requires
a little bit of crunch,
technical stuff unfortunately.
We'll have to do that.
First, I wanna point out that
it's not completely trivial.
Look at the case when k = 1, and
even the special
case when u = v.
And that's already NP-hard,
so you can't just do even
the k equals 1 case exactly.
Why is that NP-hard?
So I want to do the following,
now.
I want to maximize x
transpose Ak (1-x)
I'm now assuming
A K = sigma 1 and
the same u1 transform
you're assuming that.
I want to maximize x
transform of 1-x x is 0
1 [INAUDIBLE]
So
that's it just equal to sigma 1,
and just a dot product of these
two, dot product of these two,
but this is 0 1 vector, so
that's a sum of a subset
of coordinates, right?
So let me just call it U1, U, so
that's a sum of I
belonging to SUI.
And this is the sum of
I not belonging to S.
That's 1 minus X U-I
[INAUDIBLE]
So yeah so,
It's a partition problem.
So, this is just
maximizing this.
I give you a vector.
I want to split it into two so
that the product
is maximizes that.
And that's the same as
saying the best possibility,
the best, sum of i
belonging to s ui, equals,
sum over i belonging
to s body i, right?
You should just convince
yourself that's the best I
can do, that you,
the sum of all the is fixed so.
The best you can do is
partition it equally.
And the question of,
can I partition it equally is
I give you a bunch of real
numbers, and I want to know if I
can partition the two parts
of the sums are equal,
that's empty odd.
So the exact problem is pretty
hard even for k equals 1,
but there's a simple dynamic
programming algorithm which
finds a near optimal solution
for one vector if you will.
But we will use that for
k vectors,
we write UIS as the sum of UIJ J
belonging to S, similarly for
here we want to find the S
that maximizes this.
Now there are k things
not 1 things using
dynamic programming.
So this is just a recap of
dynamic programming, as I said
this will be a little technical
but let's go with that.
So what I'm going to
find Let's say for
now for every subset s I
am going to find u1 of S so
that's just that sum,
v 1 of S bar and
all of these things I'm
gonna find for every S.
If you had a list of all of them
once I tell you what these
are I can find this, right?
I can then find this and
take the maximum.
So we can find that quantity for
each S and
then take the maximum.
But there are two
[INAUDIBLE] subsets, right?
That's not good.
And you can ask that our ties,
they may not be ties
because these are badly.
These could be real numbers.
These components of
singular vectors.
Your singular vectors of A.
A was a nice matrix, 0, 1.
But the components of singular
vectors could be arbitrary
real numbers.
And so
there may not be any ties.
What we do, so that is why we
round the coordinates, so that
they become not real numbers,
but multiples of 1 over integer.
So you want to make them all
integer multiples of 1 over
n k squared.
Called the rounded vector.
And then we take the rounded.
And now we'll prove that
there are not too many such.
So for lot of s's these vectors
will be tight because we have
rounded basically right?
There are very few values there.
So for a lot of Ss this would be
tied to the number of distinct
possibilities for
this vector is not very big, and
we'll just go over
all of them okay?
And construct by Dyn a
programming list of all possible
values of W prime S.
It turns out there will be
only this many values of this,
you have to do
a little calculation,
I won't do that which will
imply this many values.
So basically there're only
this many distinct sets S.
Which lead to distinct vectors.
So the number of
Ss is very small.
The exponent is only k, alright?
The exponent doesn't involve n,
it involves only k.
That's the point.
Now you have to show that
the error is not too much.
And again, perhaps you don't
follow all the details because
it may be crunchy but let me
just quickly run over that.
So these are length
one vectors and
the sum of our subset
is at most root n.
You take a length on vector,
some of it's components it
should be at most root n.
And so the differences at
most k squared and same for
vi, this is a bit
of a calculation.
It's a quadratic term,
this is just saying,
well this is just saying,
don't worry about this.
This is just saying a and
b are two real numbers.
I'm estimating a by a prime,
b by b prime.
And this is saying, I'm doing
a good estimate of the product.
I mean,
that's believable, right?
It just has to pin it down,
that's all.
So here we're doing that, right?
We have the product
of two lambdas,
it's estimating this well,
this well,
therefore the product well.
That's that calculation, okay?
And the theorem is
given a directed graph,
a cut of size at least.
The maximum cut minus this
much can be computed in time
polynomial in n and for fixed k.
That's what comes out of this.
Again, bear with me with
all the calculations.
The idea was simple.
The idea was I approximated
a by rank k matrix, right?
Enough to solve that problem,
rank k min k dimensional cycle
enumerate and solve okay?
So one other way to think about
the enumeration is in sort of
algebraically I'm
sitting in case space.
I just put down a grid.
I want to find the best
solution in case base.
I just put down a grid,
fine enough grid.
So if this distance is delta,
the number of grid points,
Is less than or
equal to 1 over delta to the k.
Well, the time it took to the k.
But the exponent is only k.
That's all, I mean,
what do you think?
Now what I don't have
time to present is one
line I want to state, in fact it
turns out, this kind of method,
take a matrix approximated
by a low rank matrix and
solve the from a low
rank is useful for
quite a class of discrete
optimization problems.
Actually I won't define this.
MAX-CSP includes the following
kind of problem, for instance.
You might have a Boolean
expression 3CMF,
three per class And
I want to know the maximum
number of clauses
that I can satisfy.
So I want the maximum
number of clauses.
I can satisfy.
That can be solved by
these kinds of method, but
if you had three
lithospheric blocks,
you have to approximate
it three times.
If you had L literals per clause
you have to approximate an L
tension.
But the game plan would be the
same, again, I won't go through
this, the game plan will
be the same, is the same,
you take the tensile which
defines this problem.
I haven't told you how but
it defines this problem and
then you approximate the tensor
by a lower ranked tensor, and
then solve the problem.
Show that this same
approximation method works,
okay?
So it turns out it can do that.
One other thing that's worth
noting that can be done is,
Is if you have any matrix A,
Non-negative entries,
Satisfying triangular equality.
Triangle inequality
mean a i j plus
a j k greater than or
equal to a i k.
That implies a has a good enough
Low rank approximation and
you can do the problem on that.
So for instance, if you had max
cut problem to solve on such
a matrix you can do it
relatively arrangements like for
those, okay?
The treatment there is the same
except you have to scale
the rows of a matrix and and
then there's a good
lower end approximation.
So I think that's
all I have to say,
I didn't elaborate
on any of these.
That's the end of the well,
it may be discussion, but
otherwise I'm done.
This was directed, but
you can do undirected also.
Same thing.
Right, so this does not give
you multiple guarantees.
It only gives you multiple
guarantees for max cut for
what you've trying inequalities,
or something is true.
All right, but
it's also dense graphs, but
there is a general class in.
But it doesn't, in general, give
you multiplicative guarantees,
like the STP things do,
but it is faster-
>> Much faster.
>> Yeah, it's much faster, you
just sample these things, and
solve, so if the epsilon,
k has to be at least that
one other epsilon squared.
Yea, I mean for
the result, right?
If you want that epsilon
in the data, yea.
That will be high,
I mean in the for the theorem.
In practice, maybe it can work
with lower k, we don't know.
All right, yea, it should
be If the singular values
decline faster, then you should
be able to do better right?
Just cuz saying the top k + 1
are all equal in fact so yeah.
So this is the worst case
that's true for every matrix,
but it should be better for
matrices are good spectrum.
[INAUDIBLE]
So, if it's not dense,
if it's sparse, then finding
the exact thing is hard.
So, even if it's slightly
sparse, that's what's known.
Actually it's so I think
the improvement beyond 1 over
epsilon squared is not possible
that is known by this method.
K has to be 1 over the maximum.
And all the exact
problems solved.
So I mean in theory there
are many papers proving
the Algorithm does
not work right?
>> [LAUGH]
>> Right I mean.
>> Can you tell me
>> In fact, it.
>> [INAUDIBLE].
>> Yeah.
>> [INAUDIBLE].
>> Any pathological.
>> [INAUDIBLE].
>> Don't have the [INAUDIBLE].
Whether it works for
dense things, for instance.
I doubt actually that it
will work so easily for
any of these things.
So it's actually.
>> Here's a thing, suppose you
have a mixture of k Gaussians.
No algorithm I know,
except SVD, ever finds any
points within a few standards
deviations of the center.
So I want an initial start,
right?
K means completely fails and
the start is bad.
So if the start has
the property that
>> My starting
centers are within a constant
number of standard deviations
of the real center.
Then, you're in luck.
But no other algorithm
shows that except SVB.
So, k means does not
ensure that because
it picks data points that
are very far from the center.
So there is a proof of transfer,
there are some papers that
say it does not work for it.
So I have to qualify that
statement I think if you had
a bad start.
So here I think if you have two
circular Guassians and you start
with two centers both from one
Gaussian then you're sunk.
I think that's true for
Or or any of these.
>> [INAUDIBLE]
>> Because they can't escape,
can't escape from the local, and
I think that's all these
proofs say, in fact.
Probably it is true that,
if I start from centers one for
each Gaussian which is in a few
standard divisions of the real
center, probably
convergence wholes for
many algorithms,
I would guess, but.
>> [INAUDIBLE]
>> But I don't,
maybe I haven't seen the reader
so I shouldn't venture.
But it's the conjecture, right?
Yeah, so it does do something
for non-cycle Gaussian.
So it's been worked
out what it is.
So it does do something.
In general, these
are non-spherical Gaussians.
The centers are separated
by a constant number
of the maximum possible standard
deviation of any other Gaussians
in any direction,
then this whole thing works.
You can have two very big
Gaussians separated by a few
standard, this sort of thing,
then we don't have a proof.
Pancake, yeah,
we don't have a proof.
No we have to assume the number
of samples of each Gaussian
is at least 1 over
the weight right,
otherwise you can't detect it.
But I escaped all that by
going to an infinite number of
samples.
I looked at the whole density.
But it's true finite sample
error things you have to worry
about how many samples
came from each cause.
So I think that actually
people worry about.
So I think it is not known,
of course it's not known how to,
it cannot beat one over the
weight, but I think you can't
even beat one over the weight
squared in the algorithm so
far [INAUDIBLE].
Okay, thanks.
Yeah.
