[MUSIC]
So dimension reduction of,
we already talked
about how data or
high dimensional points.
And the examples we saw, the
references fees in the billions
or many times, it's at least,
in the thousands.
So it's very high.
Of course, we saw that geometry
looks different in all that.
But in any case,
efficiency-wise, it'll be nice
not to have to deal with
such high dimensionality.
So, it'll be nice if you can
reduce the number of dimensions
but, of course,
without changing the problem.
So, in just reducing
the number of dimensions,
we shouldn't lose properties,
right?
So we'd like to do that.
There are two main methods
of dimension reduction.
So one is random projections.
So it's difficult to represent
very high dimensions on
the board.
Random projection just
means that I can show
you two-dimensional points,
right?.
There are points in two space
that are your data points, and
you project onto a line,
orthogonal projection.
And so on.
Now if you project
two to one-dimension,
there will be lot of problems.
For example, you will have
potentially two points which
are on top of each other.
So the distance in
the projection is zero.
Is much, much less than
the distance upstairs.
Now because you're
reducing dimension,
you expect the distance
to go down, we'll see how
much by a certain factor, but
I don't want it to become zero.
Or I don't want it to become
really bad compared
to that factor.
So that's one issue.
So, we want random projections
which are oblivious
to data in this case,
it will be, they won't care
about data, it's just random.
But we don't want to
lose all properties,
so preserve some properties.
Okay, and the properties
depend on the problem but for
a first cut distances.
So you want at least not to lose
track of what the difference
between every pair of points is.
So you don't want
two points like that
to fall on top of each other.
Okay, and so this lecture, we'll
see in detail how to do this.
I mean, it's just going to say
actually that you pick
a random projection.
Pick this line at random and
then things will behave nicely.
We have to prove that,
we'll prove that, okay.
So I want to go over the sort
of general idea first,
before we start proving.
In general, what will happen
with random projections is
that you'll preserve lengths
to factors which are 1 plus or
minus epsilon.
So the length in
the projection will be within
relative that epsilon
of the length on top.
Now, unfortunately,
as epsilon falls,
the dimension that you need to
project to is going to grow.
We'll see how much.
But, I'll just write this down,
the prediction
dimension is order 1
over epsilon squared.
So if you want low error,
you need to project to a very
large number of dimensions.
So if you want really low error,
the dimension you are projecting
to is gonna be so
large that is not that useful.
Remember, we wanted it to be
computationally useful, so
that won't happen with
random projections.
So, random projections are good
because you don't have to
process the data, you can
quickly get it, just as random.
Just a random generator, but
they are not good enough if you
want error which is less
than little order 1.
And you do want errors less
than little order 1 for
a lot of problems.
We'll see that not today, but
we'll see that in general.
So random projections are used
extensively in theory because
of a, now I forgot to stress
it but with random projections
we'll prove that every pair
of lengths is preserved.
So with high probability with
no exception every pair of
length is preserved.
There is a probability
of failure
that the lengths
are not preserved.
There is a small chance that a
random line will put two points,
one on top of each other, right.
As you can imagine,
the chance is very low, okay.
But there is a chance.
But what this is saying by every
is that with 99.9% probability,
we can make sure that,
with null exception,
every pair of points is closed.
Now, nice, in theory,
we use it a lot.
But there will be another
set of projections.
There are globally,
essentially two
gentle methods of projecting
to reduce dimension.
One is random projection,
the other one is principle
component analysis which also
we'll see in coming lectures
that has different properties.
Projection length behaves
very well, much better than
the epsilon error, okay, but
only when averaged over data.
So it's not true that
every pair of points,
the distances are preserved.
It is true that on
average the distances
are preserved very well.
So some pairs may behave badly.
Theoretical use is
a little restricted
because of this fact that it
is not true for every pair.
But it is used in stochastic
mixture models which likely at
the end of this lecture you'll
already see a mixture model.
And we'll come back
to them later.
But PCA is very widely used, in
fact, for certain problems like
nearest neighbor search
which you will see.
PCA is more commonly used
in practice than random
projections.
And random projections are more
commonly used in theory
than PCA.
Okay, so PCA is not today.
It's random projections today.
So most of the lecture will
be on random projections and
then, at the end of the lecture,
I'll start on Gaussian mixture
models, which we'll continue
also when we do PCA, okay, good.
So here's some motivation for
the kind of problems we can
solve with dimensional
projections, in particular,
with random projection.
As somebody said, in fact,
a huge bulk, I don't know
what the percentage is,
a huge bulk of machine learning
and practical computation is
precisely this problem of
Nearest Neighbor Search.
What is the problem?
The problem is the following.
You are first given
a database of n points in Rd.
N is very large,
D is also large, right.
You are given these points.
You're allowed to preprocess
taking a lot of time.
Taking polynomial time in NND,
okay.
And cubed and squared,
any amount of time.
You do the preprocessing.
And then once you've done it,
then queries come along.
A query is also point in Rd.
And ideally, you want to find
the nearest database point to
the query point but very quickly
in sub-linear time, okay.
The data is end points in this
space, so it's length is nd.
So you want to do a little
of nd often in polylog time.
Let's sit back and think about
this, how about if d equals 1?
Suppose I gave you points
on the real line, right.
And then whenever a new
point comes I must
very quickly find
its nearest point.
Do you know how to do that?
I'm allowed to preprocess.
What is the preprocessing
one might do for
points on the line or
in one dimension?
>> [INAUDIBLE]
>> Sorry?
>> Call the sort?
>> Sort, exactly, right.
Once you sort the points,
then by binary search,
a new point in logarithmic time,
you can tell the nearest point,
right.
And you should
think a little bit,
you can do this in fix
dimension, in two dimensions for
instance you can also do this,
right?
I won't go over that, but so
that's the kind of thing we
are looking for or do something,
takes in login time, fine.
And then at the end of
it login time you're
able to tell what is
the nearest point.
We can settle for
approximate nearest in practice.
So, we'll see that
projecting the database
points into a random
polylogarithmic dimension
space is a good solution.
We'll see,
I have to show you that.
At query time, what we do is
we also project the query
point down to that space and
measure distances.
So, the data points were here,
we've already done
this projection.
Getting these is pre-processing.
And query point comes along.
That's a query point.
Then we project that and
see its nearest point
there in the projection,
we don't bother otherwise.
It's not quite that simple but
we can think of it that
way conceptually, okay?
Okay another example,
a completely different example
where dimension
reduction is useful.
And there are lots of examples,
I'm only gonna give you two
Linear Regression, what's called
the over determined case.
That's the following
you have a matrix A.
Ideally you want to solve
the equation x equals to b.
There might not be a solution so
you want to minimize,
you want to find an x so
that x minus b is minimized.
Overdetermined A looks like this
And X is small dimensional.
You want it to be
approximately equal to B.
B is also tolerant then, right?
So, this is interesting in the
case when really is much taller
than it is wide,
n is much larger than d.
Let's call overdetermine,
underdetermine if it's
the other way, right.
So here's a thought.
You take a random
projection of both sides.
You must hit both sides.
So I can draw
the picture better.
The other matrix p which
looks like this, right?
This dimension is n
that's n by d, right?
This dimension is n.
Well I'll call this k,
Okay, so and then you solve
the projected problem now,
it is a much nicer
problem because this is
only a k y d matrix.
Now I won't prove this to you.
To prove that a solution to this
is a good solution to the top,
we have to do two things, right?
You have to prove that for
every X here, this thing
is a correct representation with
an error of the length of this.
So, length of X minus P for
every X.
If roughly equal to
the length of PAx-Pb.
That's non trivial.
Why is it non trivial?
So, you could try to do
that by union bound.
I say, for every x,
this length and
this length are closed for 1x.
Right?
With a certain probability.
So, there's a probability
of failure for 1x.
X, okay.
But you have how many x-es here?
Infinitely many x-es,
you don't know what x is going
to be the solution, right?
We've got infinitely many x-es,
so
it's difficult to just
apply the union bond, okay.
If it can be done, again I am
not going to do this in details
but random projections can
be used to solve this.
This, okay?
There are many other examples
and I'm going to show you at
the end some Clustering example,
k-mean Clustering.
How you can cluster
in the projection and
that's enough to get an
approximate solution up on top.
And we'll discuss this later.
Okay, so the order of business
is now I'm going to present
the random projection
terrament proof.
And then we'll go back and
discuss these applications.
So basically I want what's call
a length preserving projection
or you can all it
distance preserving.
We'll see that they
are the same really.
So we want a random length
preserving linger projection.
From d dimensional space
to k dimensional space,
k is much smaller okay.
We want linear so I want to
matrix P to linear transmission,
I want matrix P so that for
every vector v in this space,
the probability that
the length of Pv Is
with a relative area epsilon
of the length of v, right?
1 plus or minus epsilon wants
to be 1- delta or higher.
So the probability of failure
wants to be less than or
equal to delta. Okay?
Okay, a few observations before
we prove the theorem I
put the quantify as for
all outside the probability,
right?
So that says for every
probabilities is high, right?
So let's understand
that that's important.
What if I place
the quantified inside?
So what if instead I say that
the probability For all we is
that and I state let's see if
that statement makes sense.
Probably for all V PV.
1+1 minus epsilon V is greater
than or equal to 1 minus delta.
So would that be possible?
That's a stronger statement,
right?
This statement is for
one d at a time, the probability
is, failure is small, it says.
For each we have a time.
But for every one of them,
but one at a time.
This is for everyone
simultaneously this is through.
So is that possible that we can
make the stronger statement?
I didnt make it but can we?
Why?
Okay.
But any other answers?
>> [INAUDIBLE]
>> Yes.
So we cannot do that.
Because see basically we're
setting in D dimensions, these
vectors are in D dimensions they
are D degrees of freedom is no
way in general I can reduce
the degrees of freedom and
keep anything sensible, right,
because they're D vectors so
another space is in fact
one way to say it so
if K less than D there's a non
zero V In the null space of V,
null space of P,
P is K by D, right?
D space is always something
like the, the P make zero.
And then if you have the vector
you get a zero here and
you get a zero there, right?
So, one nice thing to remember
is zero is never within relative
error of anything non-zero,
right?
Never, right?
Any non-zero, so...
So you can't write, okay.
So count with
the quantify inside.
I can only do it once for
one vector at a time.
So original projection that
was used to prove this state.
It's called the Johnson Theorem.
They use the following.
They projected to a random
key dimensional sub space.
In terms of matrices
their p was, so
they wanted to predict a random
k dimensional subspace.
P had rows which are a basis,
just like a basis
of that subspace which means
they were orthonormal.
Vectors are orthonormal, as you
recall is the length is one and
they are mutually unsolvable.
So that was a P, the page
paper was very long, 50 pages.
The difficulty was first,
how do you pick such a matrix?
There's dependence
between the rows, right?
I pick the first row,
the second row cannot be
independent necessarily.
I must make sure the dark
product is there.
I must make sure
strict [INAUDIBLE] and
that turns out to involve
a very complicated argument.
Both for, well for
generation it's complicated but
to prove length preserving it's
very complicated because of
the dependence.
Okay, so we won't do that,
we are going to do something
slightly different.
Vectors are not orthogonal, but
still the whole thing works,
we pick k vectors.
Remember the ks number of
rows of P listing, right?
I'm gonna pick k
vectors u1 through uk.
I'm gonna make each of them
spherical Gaussian mean 0
variance covariance
matrix identity.
Just one more time let's recap
what this notation means.
We introduced that last time,
I want to recap that.
So the variance covariance
matrix of random vector u.
So each of these uses
a random vector.
So if you a random vector, in d
space, it's variance covariance
matrix is just (i,j)th entry is
the moment of inertia, right?
Is the expectation of the ith
component times the date.
Except we want to make
sure the means are 0.
May already be 0, so
you just subtract out the means.
Okay, now it equals identity
means that different coordinates
are not correlated.
So independent coordinates,
just means that, okay?
Everybody is familiar with this,
right?
So just a recap.
So the projection is
going to be given by, so
I want to predict the vector x,
right?
It's projection is just a dot
product with these vectors.
Okay, that's what
i'm going to do.
If your projecting
space at an autonormal
basis, I would just take
the dot product with that.
That would be exactly
through the orthogonal
projections, right?
This is not, right?
These u's are not
necessarily orthogonal so
it's not an orthogonal
projection.
But it's going to be fine for
us, so
we still look at length
preserving property.
So it's very simple, I pick k
Gaussian independent vectors and
just take dot
products with them.
So I get here a vector
with k components,
right, instead of d components.
Okay we'll prove that for
each vector v,
the length of f(v) is not
exactly 1 plus epsilon,
1 plus or
minus epsilon times length of v.
There's a root k factor.
We'll think about why
the root k factor.
Before we do that
let me just ask you,
let's figure out what
the expected value of v.u,
v is a fixed vector,
u is random right?
The expectation is
with respect to u.
What the expectation
of v.u1 squared is or
at least an upper bound okay?
So any sort of thoughts about
what the expectation of
v.u1 squared might be?
What the dot product
squared might be?
Or how we might compute it?
So, what do you think the random
variable v.u1 looks like?
U is random, v is fixed,
remember, right.
What does the random variable,
just a dot product look like?
>> [INAUDIBLE]
>> Sorry?
Gaussian.
Yeah, it is Gaussian, right?
Because it's actually
the sum of some Gaussians.
It's a Gaussian.
Sum of independent Gaussians,
so it's a Gaussian.
So that's what makes this
very simple argument compared
to taking out the normal vector.
So v.u1 is just a sum of,
this is a dot product,
right, so
sum over all j of this.
The u1j, the components
of u are independent
because it's normal
zero identity, right?
The components of
u are independent.
So this thing is a sum of,
v is not random,
v is a fixed quantity.
So this is a sum of d
independent Gaussians.
And if you add up Gaussians, the
means and variance just add up.
Provided they're
independent right?
The variance there be careful.
V.u1 is distributed.
So the variance of
this component alone
is vj squared, right?
The variance of this is 1.
The variance of
that is vj squared.
So the vj squareds add up
to the length of v squared.
So you get a normal this real
value It's a real number,
is normal 0v squared
distribution, right?
So with high probability
this is less than cv.
This is just a constant times
this is a number of standard
deviations.
Is the normal random variable,
it doesn't exceed more than
c standard deviations.
So this quantity is much
smaller than if v and
u were co-linear, if they were
co-linear you would get that
much but you get a factor
of root dl, okay?
This, I claim, should be
reminiscent of something we
proved already, and
it would be nice to remember.
So we proved such a thing
from uniform random from
a sphere in d dimensions.
So this is Rd, the big space,
if I have a fixed vector v, and
I pick a random u, we prove that
u unit length in this case.
My u here is not necessarily
unit length, it's of length d.
So, there's some scaling but,
here unit length, we saw that,
v.u will be less than or equal
1 over root d, c over root d.
But the c = 2,
d times u over root d,
because d unit
vector in that case.
Okay, so I want to write this
down, which I should have
written on the slides but
it's worth remembering.
I want to write this
down carefully.
So if we have two vectors.
Vu and Rd which are independent,
then the dot product
is less than or
equal to c over root d times
the product of length.
Product of the lengths
is colinear, right.
Dot product is 1 over
root d times that.
This was our equator theorem for
spheres.
We here argued for
Gaussians which is easier but
it's true for
many distributions.
So this is if you will,
the equator theorem.
Dot product is roughly one over
root d times what it would be
for co-linear, okay,
so that's true.
Okay, now I'm ready to state
the random projection theorem.
I'm gonna state two theorems and
then we'll prove that.
The proof will not be
very difficult, actually.
Not much longer than
the statement [INAUDIBLE].
So f is the same as we had.
I take k Gaussian vectors and
take dot products.
There's an absolute constant c
such that for epsilon not too
big, epsilon between 0 and
1 we have the following thing.
Again the for
all is outside, for
every individual vector in r d,
every one no exceptions okay?
You take the length of f(v), the
k vector minus root k times 3.
I forgot to answer
the question why root k.
But sorry let me just do
that before we go on.
So, sorry excuse me.
So, this expectation is
constant right, we saw that.
If v is length 1,
this expectation is, sorry,
this expectation of the square
of this is 1 constant,
okay, it's the length of v
squared, but if v is length 1,
this is constant, but,
I'm adding a k component.
So I'll get k for
the length square of f and
root k for the square root.
So that's why
the root k comes up.
Cuz I have k components.
Okay, the probability that
there is a relative error.
So it's supposed to be that
the expectation is that well,
it's not quite the expectation.
The expectation of the square
is the square root of n.
That's all we include.
And the probability that
the relative error is more than
epsilon falls off as e to
the minus Very importantly
the exponent of squared.
I'm gonna abbreviate for
the future this statement
here as the length of f of v
is within epsilon error,
relative error of this.
Right, that just mean that,
It means it means
the opposite excuse me.
Sorry this means
the opposite of that okay?
This means that
it's within the but
this is saying it's not,
so that's an error right?
Okay, where the probability is
taking over random draws for
the vector u,
used to construct f.
Other words, v is not involved
in the probability, all right.
And then a second thing which
is in fact little more useful
on the lines of this picture
that the lengths are preserved.
It says for every epsilon and
any integer n, and
K greater than 3 over c squared,
log n.
So very important that K
only grows as log n, okay,
then now I take these n points.
Then for every set of points,
now the for all is inside
the probability, right,
I'm asserting that the high
probability all of them hold,
all the links are preserved.
Okay, now that actually all the
distances are also preserved,
I think this is stated
on the next slide.
So this is So this is
the distance between vi and vj,
on the right hand side.
On the left hand side is
actually the distance between
f(vi) and f(vj), right,
which is the same as f(vi- vj).
F is linear, so
we'll see that that's useful.
So everybody see,
we will come and prove that.
Why is this true based on that,
that's actually simple, right?
Sorry, any thoughts on why,
theorem two follows rather
easily from theorem one I claim
and we will write down a proof.
But Let's take a minute to
see if we can guess that.
So what is theorem one saying?
Theorem one saying is for
one vector, the projection is
more or less length preserving.
What theorem two is saying.
[INAUDIBLE]
>> Sorry.
>> [INAUDIBLE]
>> Cong lex knows.
There's one word I want to hear.
>> [INAUDIBLE]
>> [INAUDIBLE]
>> So we want how many lengths
to be preserver here?
We want N squared lengths
to be preserved, right?
So what might we do to get all
N squared to be preserved from
knowing that one is preserved?
Union.
Did somebody say union?
Union bound, so
there's a lot of dependency.
It's not true that
the projections, these things
are not independent, of course,
right, because we are once you
know some vi and some other
vj then this is dependent.
They are all the predictions.
So it's just a union bound.
So I have n squared things
to be preserved But
luckily, this thing so if you
work it out once I put k equals
some high quantity times log N,
this becomes one over
something probably we have
failure for one point points.
You just have to multiply
one over times n squared
Okay, so we'll actually
write down this proof.
Okay let's prove
the first term right,
again the first one is
gonna be quite simple.
Okay before I prove it I think
I have some remarks including
the ones that I just made.
The claim is for
every v What if I just wanted
the statement for most V?
So let's think about that,
so, statement 1.
Suppose I wanted this to
be true, not that, not for
every, not for every V,
but for most V.
Okay.
It's a simpler statement,
but is it useful?
So maybe you can say if
the data for is random I could
just have a weaker theorem that
says it is true for most things.
With v that is true.
I assume that the data
points I am given is right.
Is not the worst case set, can I
then use a theorem that says for
most VM length preserving?
Okay, actually you cannot
even in case the data random.
Let's think about why there
is I always need that for
every reach it be preserve.
So if it is a random dataset,
right?
What does that mean?
That means that most of
that data set behaves well.
But there might be exceptions
even in the dataset.
And our nearest, the research
says I give you a query point.
You must find absolutely
the nearest neighbor,
or close to the nearest.
I'm not allowed to make errors.
So if the nearest neighbor
is an error point,
is an error-prone point, I can
get it completely wrong, right?
I'm not allowed to do that.
So nearest neighbor search
In fact this is true of any
optimization problem.
If I give you an optimization
problem even a random incidents
I want the optimal solution or
the approximate
optimal solution.
I don't want to say that you can
fail the sample probability and
give me an answer, right?
You can't do it.
So even though Even if
the data were random,
this would still need this for
all V.
So I can not use
a weaker statement.
And the weaker statement
is very simple to prove.
Suppose I have a random V, and
I want to preserve lengths,
what would I do?
Something much
simpler than this.
So I have a random
D-dimensional vector V.
It's random.
And I want a projection
where the length sort of
represents that length.
Not preserve exactly but
maybe within a factor.
What's a good simple projection?
Take the first component, right?
If the vector was random, all
components are roughly the same.
So the first component
should be 1 over root d.
So that's not interesting,
right?
But that's also useful.
Okay.
Good.
The advantage of linearity of f
is estimate distance between two
points is some some fixed factor
1 over root k times distance
between their projection. Right?
And
that's because, the projection
of the difference is just
the difference of
the projection, it's linear.
So, I could have just
predicted the points and
looked at the distance, right?
You need linearity for that,
otherwise, it wouldn't be.
K being in the exponent
in theorem 1 is crucial
to get logs,
because K is in the exponent.
There are many other random
projections known now.
Gaussian is the simplest and
easiest to prove.
But for example the UI can be
taken as a +-1 random vectors.
Intuitively when these
very large that like.
Things behave like their normal.
Right?
Central limit theorem.
Behave like they're normal but
for
smaller D something
else is true.
But that's more
difficult to prove.
Here we don't do that,
we only do it for
Gaussian, but there are many
distributions now, for
that we trying to
predict the end result.
Okay, now the proof,
we want to prove that that's
true with high probability,
scale both sides, we might as
well assume that V is one.
So, if I scale V, both sides,
same amount, so I have no
Nice thing about relative error
is that I'm allowed to scale
both sides by any amount, by any
factor not true [INAUDIBLE] of
error but we want a relative
error so we can scale, okay, and
so if we assume without lost of
[INAUDIBLE] this length 1 Okay,
now I, just again this is a
repeat, something I did earlier.
So the doc product is the sum of
independent random variables and
therefore, the means and
variances add up.
So we get that this as
distribution, N(0.1) okay, now.
u1 this doc product
are independent Gaussians.
Each is distributed
hysterically, right?
This length of v squared is 1.
So each doc product is normal
0,1, they're independent.
So this is a k dimensional
identity matrix.
F of v is a k vector, right?
So the k vectors distributed
exactly like that and
they apply the Gaussian Annulus
Theorem we proved last time.
And the Gaussian Annulus Theorem
told us that if you pick
points from independent
Gaussian k points in this case,
the length will be about root k,
cuz I pick k points each
of Gaussian normal 0,1.
Root k, the probability that's
beta is either the minus
b + squared times c.
Only valid for beta small
enough, only valid for
beta smaller than root k.
So this you have
to refer back to
the Gaussian Annulus Theorem
we proved.
So this, it's not valid,
we didn't prove it for
epsilon greater than one, but
luckily we only need it for
epsilon less than one.
So all results about these
concentration results,
the large deviations
are always difficult.
So if you want a deviation
greater than a certain amount it
doesn't behave like
a Gaussian quite often.
But these kinds of deviations
behaves like Gaussian which
means they have a beta
squared in exponent, right?
While large deviations together
either the minus data.
We saw that for
since last time, right?
So we get this, so
the probability is either the -k
epsilon square, beta k epsilon.
So that's the first theorem,
okay?
We do have to use the Gaussian
Annulus Theorem to prove that,
which we proved that already.
Okay.
Theorem 2 as I said Union Bound,
so
we have n squared
pairs of points.
Probability of failure for each
is e to the -ck epsilon squared.
So with this k,
if I put a log n there
over epsilon squared
e to the -c log n,
I can make the 1 over n cubed
over 1 over n squared, okay.
So it's important, as we said,
exponential in k, so
that k only goes
logarithmically.
Okay, so
now I want to show you a couple
of slides of applications.
So here, be prepared to
answer some questions.
Let's see how that goes.
So, k means clustering.
Here's the problem, I'll
draw a picture on the board.
Given n points in Rd, you have
to partition into K parts so
as to minimize the sum
of squared distances
to the cluster center
of each parts.
So, if I had, I should've shown
you some pictures I guess, yeah.
So, now I want to partition them
into five parts, let's say.
So, I drew it so there's
a natural set of five parts.
These are the five clusters.
I have partitioned the set
of points into five sets.
After I do that, I put down,
maybe one more point
would be nice yo see.
So I put down some cluster
center for each cluster.
I'm allowed to choose
any cluster center and
then I take the distances,
square them, and add.
And I want to minimize that.
It's not difficult to see that
the best choice of center,
once I give you the clustering,
is just the mean of the cluster,
right?
That's only true for
distance squared.
So even on the line,
you can convince yourself,
I give you a bunch of points.
I want a point mu, so that sum
over (x-mu over g)2 is minimum.
And that happens
exactly one point
when mu is the average
of all the x's.
It's only true if it's mu
sent is distance squared.
Actually, there's gonna be other
properties that are only true
for distance squared, which
are not true for other metrics.
We'll see that.
So that's true.
So he has a question.
So this, we are going to spend
a few minutes thinking about.
This is an important question.
So is the following obvious?
So, I want to divide this
up into k parts, right.
I know that, if I project to
some number of dimensions
randomly, as we just
saw in the tail, okay.
All distances are preserved to
one plus minus epsilon, okay.
Therefore, why don't
I just project and
cluster in the projection, after
all the distances are preserved.
So, that should give me
near optimal answer within
1 plus or minus epsilon.
So, is this true?
Is this a proof?
Let's think about this.
So, can I use the theorem to
argue that since the distances
are preserved, I might as
well project down in cluster?
So the thing to think about is,
how I prove theorem
two from theorem one?
Sorry?
>> [INAUDIBLE]
>> The epsilons made
that up, so.
>> [INAUDIBLE]
>> Something like that but
add up,
what if epsilons would add up.
It's a fair thing that
something adds up.
Yes.
So actually, the reason I put
the question is it's false.
Okay, I can't disprove it,
but this is not a proof.
And you cannot carry out
a proof on these lines.
It is true that
distance is opposite.
We prove that. Okay.
It's a tricky question.
So let's see that.
So how many distances
need to be preserved?
This is the important here.
For theorem two,
we only had n squared
distances to be preserved.
Crudely, how many possible
cluster centers could there be?
Well, every subset could
be a potential cluster.
And there are 2
to the n subsets.
Each of them gives you
a point which is a mean.
And each such point is a
possible cluster center, right.
We cannot rule it out.
We cannot rule out which
clusters will be formed.
Therefore, if I want this
argument to be made rigorous,
I've got to take distances from
each possible cluster center to
each data point.
The two to the end cluster
centers, so there are two to
the end events and I have to add
up the failure probabilities.
That's very bad, okay.
And you'll see that k
has to be very large.
Everybody see at least
the argument why the proof is
fallacious, right?
So this proof is not true, okay.
This is false.
But for k this means
there's an easy fix.
Which doesn't apply to
other criteria, okay.
Luckily for any end points,
one can prove that the sum of
square distances to the mean
is set of points, sorry,
is the average sum of pairwise
distances among all of S.
So let me just draw that.
I should've proved
that on the slide.
I'm gonna write down proof on
the board, I didn't have time.
So we have a set as A point and
this is asserting.
They're endpoints.
This is asserting that sum of
x belonging to s, x- x bar
squared is 1 over s times sum
over all xy belonging to s.
There's only squared.
I make it this one it
might be S minus 1 so
this is not a very difficult
proof, maybe if I ran out of
things to say at the end of
the lecture we'll prove that,
you can try to prove this,
it's very simple, okay?
Okay, so, this is only
true cuz it's squared.
If I didn't have a square there,
it's not true, okay?
In general.
So, only with the square.
So there are a lot of nice
things that happen with
the square and
this is one of the nice things.
Actually, there
are two nice things
that happen with
the square on this slide.
This is only true for square,
and this is only true for
square.
So now can I make
this assertion?
Now can I make a projection?
And what has changed
after I told you this?
How many distances does the
projection have to preserve now?
I have already told you this.
So before I told you this,
I was worried that [INAUDIBLE]
distance [INAUDIBLE] preserved,
right?
>> [INAUDIBLE]
>> Yeah,
it's only in [INAUDIBLE] squared
because it's enough to preserve
distance [INAUDIBLE].
Once I tell you that [INAUDIBLE]
points then the value of every
clustering is fixed, that's
what this is saying, right, and
therefore if I preserve those
distances I'm in good shape.
Okay, so
if K means protections do work.
If you think about, okay.
I said that already.
You should think about another,
in fact,
many criteria that I use
all look like SK medium,
where you ought to
minimize distances.
With no square, sum of
distances to cluster centers,
then this argument
is not correct.
It does not work, right?
So there is a way to project to
get within a factor of 2 for
k-median, but
1 + epsilon, no, okay?
K-means, we lucked out.
Okay.
So this is now going
to a different topic,
Gaussian mixtures,
which is in the chapter but
the, we will prove
this perhaps later.
It's also a clustering problem.
We wont actually have special
lectures just on clustering.
But these Gaussian
mixtures I wanna deal
has to do with clustering.
We'll see that twice.
Today, we'll see
Gaussian mixtures.
And then when we do S3D,
we'll see a better algorithm.
Okay, now any questions on
random predictions before we go
on to Gaussian mixtures?
>> [INAUDIBLE]
>> No, because it's got to be
exact, right?
Because We want to all give
the optimal solution here,
is near-autonomous
solution up on top.
If we get an upper bound, that's
going to say that the costing
of the projection is that
most of cost in there, right?
But it might be spuriously
much smaller for
something that's
very costly here And
that might be optimal here,
and I get the wrong answer.
That's actually a good question,
so
I think we should go
over this argument that.
So if every distance
is preserved,
then the optimal answer's
the same, right?
But the argument for
that has to go both ways.
First I have to say,
if I get an optimal thing here,
That's less than or equal to
everybody else therefore,
also on top is less
than equal to.
So, an optimum here is
very good there, but
also have to argue and
optimum there is very good here.
Okay, so it's a good
exercise to, if I tell you,
all distances
are exactly preserved...
There are two problems
with n points each.
The distances
are exactly the same.
I get the same optimal solution,
I mean, that seems for real, but
when they are approximate,
you have two samples to do.
[INAUDIBLE]
Preserving angles.
>> Okay.
Good point,
I should've done that.
Okay, so random predictions
do not preserve angles,
they preserve lengths.
They do not preserve angles,
that's a good point to make.
Okay, good.
So, if I have two vectors,
U and V.
I have a random P,
random projection P,
then it's not true
that the angle between,
angle, between P one P V, is not
equal to in any scaled way the
angle between U and V, right?
So, angles are not preserved...
My random projections.
Neither
are.
Preserve means to
relative error.
Okay.
So let's see why [INAUDIBLE]
That probably is not
that difficult, right?
So what can happen
to dot products?
So dot products need not be
preserved at prediction so
you can get the wrong
answer when you predict.
Distances are preserved Okay so
let's say that I have 2 vectors
u and v, I want to worry
about the dot product,
I can write the following.
This is the usual trick to
express dot products in
terms of length.
Okay, so if I project
|Pu| squared = (1+ or-
epsilon) U squared, Let's say.
So then if I plug that in here I
will get P of u minus v squared,
aligned is 1 plus or minus
epsilon, u minus v squared.
Okay on the right hand side
the l's add up, so the right
side there becomes epsilon
times u squared Plus v squared.
If now my dock product is very
small, they are octagonal.
For instance, u and v.
The duct product is 0.
In the projection,
they won't be exactly half.
nearly octagonal.
And so you don't preserve
right there, right?
So that's the same as saying
here that If the error is so
much it swamps the error
in the dot product.
So, dot products are not, and
therefore angles [INAUDIBLE]. Okay?
Because angles are just the dot
product divided by the product
over length The denominator is
preserve but the numerator is
not so they won't be preserved.
So our products our difficult
to do with random projections
anywhere some problem involves
only distance is you can do, but
not otherwise.
Okay, that probably is a good
idea to put into the vector,
right?
Okay, onto Gaussian
mixtures then right?
So a Gaussian mixture is a
probability density Sorry yeah.
>> [INAUDIBLE] So
the [INAUDIBLE] to
something squared [INAUDIBLE]
>> Yeah.
>> But when we took the median
I don't have the square.
>> That's right.
>> But
I still have the distance.
>> Yeah.
>> Okay, so
still adding distances?
>> Still adding distances.
That's still true but the number
of distances that need to be
preserved seem to
be exponential.
>> So it can't be just 1 by n?
>> Actually, there are some
other constrictions called core
set, which are more complicated,
by which you can do something.
But I wasn't going
to get into that.
But you can't do,
it's not true that pair wise, so
the following is not
necessarily true.
If I told you there
are two problems.
Two sets of endpoints, all pair
wise instances are the same.
Is it true that the k median
clustering will be the same?
I think the answer is no.
I don't know an example.
You can try to
construct an example.
So it's not purely a function
just of pairwise distances.
>> I used a classical
problem from book.
>> The median problem, yeah.
>> Yeah, yeah,
anything except squares is.
You can get within
a factor of two.
That's not difficult, but.
Okay.
Okay, Gaussian mixtures.
Just a convex combination of.
Probability density with your
Gassians, again, a picture.
So this is one Gaussian density.
Then it might gonna one right.
So I can take half of this
density plus half of this.
What this just means is that
just add it weighted combination
of the densities, right?
So weights are wt.
There are key densities.
The nonnegative adding up to 1.
It's a convex combination.
This just exponential
density with.
I'm talking about now standard
spherical Gaussians now.
Variance one.
Mu t are the means, right?
So just like this picture.
The mu ts are this and that.
Add them up.
The problem to do with
Gaussian Mixtures is,
we sample from a mixture or
given to you,
so the real problem is a reverse
problem, it's not the problem of
taking the mixture and
then generating points, but
it's a problem of fitting
a mixture to data that's given.
So, we are given data which is,
x1 through xn there iad draws
from, that's probably density.
Ideally from here you have
to just find V of X and
here's another way to say that
an equal an thing I can do to
pick the samples.
So I want to pick any samples.
For each sample I pick a T
a component of the mixture
One particular Gaussian.
And then I picked
those components with
probability proportional to
their weights, right, w1 to wk.
And then picked xi from
that density, right.
You should convince yourself
that these two are the same.
The distribution of x1 through
xn will be the same The next one
will be the same exact idea.
And the learning problem
we want to solve, or
the model fitting problem,
if you will.
I give you samples x i, I don't
tell you which Gaussian it
was drawn from, I of course
don't tell you the new t's,
I don't tell you the means,
right?
We want to find a t for each i.
That's a clustering problem,
once I find that I can fit a
single Gaussian to each cluster.
So I erase that, so
in this picture you'll
get samples like these.
This is real, right?
So I'll get only the points on
the real line, in gentle and
in the national space I
tell you two clusters, so
the right clustering
would be this and this.
Okay, I'm not drawing them very
well, but this and this, right?
Then I take all these points,
and
I have to fit a single
Gaussian to them.
In the book, actually,
we worked that out,
you should know how to do that.
So if you have a collection
of things flowing from
one Gaussian.
Fitting the Gaussian is easy,
you just take the sample mean
to be the mean of your Gaussian
and the sample variance.
Or the sample variance
covariance matrix in general
to be that of the Gaussian,
that's the best fit.
So you can do that, but
the problem of identifying
the clusters is not trivial
is the real problem to solve.
An the solution I want is,
I want to say the following.
For, let's pretend the number
of pluses is constant,
I like to solve learning problem
is means of each pair of
Gaussian are separated by
ten standard deviations.
So this is a mnemonic that
you hear all the time, right?
Mean separated by six standard
deviations let's say ten or
any constant I want to
say that's enough for
me to learn the Gaussians, okay.
This is not trivial,
in one dimension it is but
in d dimensions it is not
trivial So we like to prove,
of course, we like to come up
with the learning algorithm and
prove that it works
under this condition.
Another way to say that,
is k may not be constant, the
number of standard deviations,
the means,
can be something depending on k.
So in this case, right,
[INAUDIBLE] division [INAUDIBLE]
is bout this much and
you better be separated
by a constant times that.
[INAUDIBLE] less separated
[INAUDIBLE] each other,
there's no way to figure out
the individual ones right.
So [INAUDIBLE] easy to see
that you need that many and
that [INAUDIBLE] So we will
see that, in this lecture
we will see that distance-based
clustering can do this only
if the separation is d to the
one-fourth standard deviation.
So, the number of standard
deviations that means have
to be a part has
to grow with d for
us to do it based
purely on distances.
Purely on distances means
I take two points and
if they are close I put
them in one cluster.
If they are far, I put them
in different clusters.
Let's say there are two
clusters, right?
That's all I'm allowed to do.
Then you need a much bigger
separation than the best.
In next chapter,
SVD based algorithms,
we'll be able to do with
the best possible separation.
Namely a strong number of
standard deviations is constant.
We'll see that when we do a SVD.
Sorry.
You have a question?
>> [INAUDIBLE]
>> Okay, so
I will actually say that if you
want to get everybody right
you need an extra
root log in factor.
This is getting
most things right.
In the end, I'll say that.
If you want to get every point
right you have to pay an extra
root log in factor.
Okay?
This is only going to
get most things right.
So everybody see the problem?
So mean separated by a few
standard deviations.
Now why is that difficult?
So that actually in
d dimensions are.
This is one picture I
stole from the book,
so I didn't actually sit and
draw it.
So, I guess it doesn't quite
illustrate things as well,
I, I draw a better picture on
the board, maybe that's better.
So, we're sitting
at D dimensions...
d large, okay, and
the standard Gaussian.
So we have e to the minus
x squared over 2.
So, the standard deviation
is 1 in each direction.
How does the Gaussian look like?
We already saw that, right?
We saw that the expected
distance squared
of the point is about d.
Not about, it's exactly d, okay,
we solved this already, right.
Therefore, things will be
a distance of root d from
the center.
So all the, most of the
probability mass is here, okay?
Let's have two Gaussians
with means separated
by a few standard deviations.
Standard deviation is 1, right?
So and this is root d.
Think of d as very large as root
d is much much bigger than 1.
So the next Gaussian is centered
here and it looks like this.
And this picture should tell
you what the problem, right?
I've given you two Gaussians
where mass is concentrated in
the two angular and
I have to tell them apart
that's pretty difficult.
They're not geometrically
completely separated, okay.
So you can call this the sun and
planets picture right?
So the distance is enormous
compared to the separations.
Okay, that's a difficulty
in d dimensions.
So I guess I'm trying
to draw that there, but
I'm not sure I did
a very good job there.
So, this is the picture
that I want to focus on.
I want to separate them, because
if I draw a hyper plane here.
It is true, although the picture
doesn't look nice like that,
even if this was 100,
it is true that most points of
this Gaussian are on this side.
Most points of that
counts here on that side.
You have to prove that.
But that's true.
The picture doesn't look like
that at all but that's true.
If I give you 100 root
every point is like that.
What's the difficulty with this?
The difficulty is I don't
know this hyper plane, okay.
Right?
Now if only I could
project to this space.
If only I knew the centers and
I was able to project
orthogonal to that.
So in this case, for
one-dimensional subspace,
I project an orthogonal,
then the Gaussians,
well separated they look
like this in the projection,
again something you
have to prove okay?
So, I have to find this
projection to be able to do
this, then I will be home.
It turns out SVD,
this is an SVD sub space.
SVD does that,
this is a preview,
we won't see that
until a few lectures.
We will see that actually
a proof of that when we do SVD.
Right now,
we are focusing on figuring out
if based just on distances,
I can do something.
And the answer's no in
this case, I cannot do it.
So, two spherical Gaussians
with a unit variance in every
direction.
If x and y are independent
samples from the first Gaussian,
then, okay maybe I should
roll the picture again.
So, x is one sample for
the first Gaussian,
it's always an annulus,
and y is another sample.
That was a picture actually.
That's the first picture, right?
So the picture looks like that.
And they are almost,
nearly orthogonal, we saw that.
We've seen a proof of that, and
in this picture that indicates
that this is the distance
between them,
because they are almost
orthogonal, each distance,
root d from the origin,
the distance is the square root
of two, the square root of D.
If I now have, so first maybe
I'll explain the picture then
we'll do the calculation.
So now there are two
Gaussians separated by delta.
I've drawn delta
to be very large
just to do the calculation.
The first point x is picked
from the first Gaussian.
Point y is picked point y is
picked from the second Gaussian.
Let's see the distance
between them.
That distance directive is
a go from x to it's center and
a go from the center p to q and
a go from q to this.
Now, let's pretend that this
is 90 degrees, this angle.
Also, this is 90 degrees.
Doesn't look like 90 degrees,
but it is 90 degrees, right?
Then why is it
almost 90 degrees?
So I want to claim
that angle xpq and
angle pqy are both
about 90 degrees.
What theorem that we
proved may tell us that?
We proved a theorem to
that effect, right?
And that theorem is sort
of the theorem we proved
is sort of this picture.
And this picture says this
vector and this vector
are almost orthogonal,
and that's because, what?
This vector.
So what is the name
of that theorem?
So that was the equator
theorem right?
So this vector.
They're independent so
I might as well so I might as
well picked this vector first.
Then when I'm picking this
vector that's a fixed direction.
We made it the north pole.
We know that this vector
lies close to the equator.
Which means more or
less orthogonal.
We saw that the angle
is one over root d.
It's cosine inverse of
one over angle root d,
no cosine angle one
over root d that angle.
So, this line p to q,
the centers are fixed.
Before we picked any points,
they're not random.
So, this is a fixed thing,
therefore this is
orthogonal to it, nearly.
And once I've fixed this and
that, this is orthogonal,
nearly, to both of them.
This is two equators.
So, x defines an equator
of the sphere, and
pq defines another equator.
Then draw that and qy is
orthogonal to both of them.
So it's a sum of squares.
So, you get d for
the square here, d for
the square there,
2d and delta squared.
Okay?
So may I work this out
on the next slide.
This is the same.
That is a picture proof.
This is gonna be a proof.
If x and y are two
sarcastically independent
samples in the first Gaussian.
Then I wanna figure out
what x- y squared is.
I explained it, right?
I get x squared y
squared- 2 x.y.
x squared is d plus or minus.
This is the Gaussian
Annulus Theorem, okay?
y squared also
a Gaussian Annulus Theorem.
Plus or
minus order root d means and
it's at most absolute value or
the root d.
Yes, add and subtract.
2xy is order root d now.
Why is that?
So, x and y independent samples
in the same Gaussian 01.
So, I want to know what
x start y looks like.
What's a distribution of x.y,
okay?
I fixed x, right?
And then y is random.
So this is xj yj.
This is the sum of N.
So x is fixed.
y is random.
y is random normal 01.
Right?
And again independent and
so on and so forth.
You add the you will
see you get sum of 0,
x squared, X squared is about d,
sum of (o,d).
Again, it requires a little bit
of proof to show all these.
So x.y is normal 0 d,
d is this variance, so
it's about order root d.
Again you have to pin down more
rigorously these calculations
but that is true.
If the two centers are delta
apart and x is that or
respectively from the two
different Guassians, then I get,
this was the second
picture in the last slide.
This I expand, there
are three terms here, okay?
Again, the cross-terms are very
small, and they are not on
the slide anyway,
because they are very small.
But most of the main terms
are just sum of squares,
they are orthogonal.
So this square plus this square.
So you get this and
this or d each.
So I get two d.
I think that also ran
out of the slide.
Okay, yeah,
it ran out of the slide so
So same Is 2d + or- O(root d).
Different is 2d + delta
squared + or- okay?
So this is wave testing works.
If a two points from
the same Gaussian or
closer to each other than two
points from different Gaussians.
And that requires this be
always greater than that.
The worst case of this is delta
squared minus [INAUDIBLE] and
that must be greater than
the best case of this.
Sorry, that's a 2d here.
2d + root d.
So then you need delta squared
greater than root d or
delta greater than.
So, once the separation goes
below delta one fourth,
you can not tell,
distance wise, in fact,
with these calculations can
be reversed, [INAUDIBLE].
Okay, now this is for one,
we didn't do the calculations
quite right.
The question was right.
What do we do with all points.
So if you put in an extra
root log in factor,
because it's Gaussians in need
of the minus x squared with
the minus root log in, so
the extra root log in factor,
you get failure probability for
each distance goes down as one
over and cubed or something,
so it works for all of them.
Okay?
So what is this?
Moral of the story is you need
a very large separation between
the means.
More than standard deviation
number of separations before you
can tell the Gaussians apart.
Even two of them is very close.
This can be generalized to any
number and can be generalized.
A lot of work has been done,
I mean this is sort of
active research area.
A lot of work is being done on
learning Gaussian mixtures.
Looks like that was
a last slide so
even though we were
supposed to go the five.
I could finish unless
there are questions.
So we could have some
discussional questions if you.
Otherwise, we meet on Thursday
next, right, number one.
Number two, this is going
faster than I expected, right?
So on Thursday,
I'll already start SBD chapter.
And because it's going faster,
we will hopefully do at the end
massive data matrix
computations using sampling.
That's in a later chapter,
I think chapter seven or eight.
So all right.
Massive data matrix
And in that we will do
SVD and all that but
we'll do it with sampling.
So our order of business
is one of will be start of
It will probably run for
three lectures and
then we'll branch off
into other things.
Sorry
>> [INAUDIBLE]
>> Yeah.
Sorry yes,
it's one of the terms there.
There is, I think That's right,
it's delta root d, right.
Delta, I'm sorry,
it's no delta root d.
No, it's only delta.
So this has length delta.
This is length d,
length root d, sorry, right?
The dot product
is one root d of.
To random vectors group
defect the less than the So
if they had been delta root d,
divided root d is
The fact that the
[INAUDIBLE]
>> So delta
squared should be greater
than delta as well as root d?
And, is greater than delta is
attributed to any quality so
it's greater than root d.
And so I guess I will write
it some more carefully and
Put it in the slides but
yeah so yeah.
I think you need only delta
greater than d to the 1
4 provided d is large enough.
Yeah once you have the right
projection you can do it.
So maybe you're saying
information theoretically, yeah
if you are not worried about an
efficient algorithm Then a mean
separated by a fixed number of
constants, standard is enough.
So if you're worried about
a polynomial algorithm then it's
not, then it's enough still
with SVD but not with RSE.
Yeah so I guess that's
a good distinction, so
information theoretically it's
enough that the number of
separation is a few
standard deviations.
Large constant
standard deviation.
The first thing is that if
instead of separation based on
standard deviations, standard
deviations are different in
different directions, so
I take the maximum of
all standard deviations,
of all Gaussians.
If I have that much separation,
I can do it, okay?
That's 20 years old.
Maybe in the nine case actually.
So, I think there was a paper
of sum generals in mind
which we prove things like that.
Then people worried more of what
is called the pancake problem
which is very nice.
So you have two Which are very
long and skinny, like that.
So, their standard deviations in
this direction is small but in
every other direction is huge.
And they are separated by 100
standard deviations in that
direction.
But by no means this is
100 times maximum strategy
Can you tell We know how
to do it now for I am not
sure we still know how to do it
for For the pancake problem.
So what standard is
What a standard is
separation between
means which is omega of
maximum [INAUDIBLE].
This is Odd, right?
Then that's okay.
But ideally, you would like
the standard deviation
the direction joining
the centers And
I gave you a 100 times that
separation can you do it?
That's not easy.
So it's probably [INAUDIBLE]
not true that if I do SVD I get
this direction.
If I do again this looking up
because we haven't done SVD yet
but if did SVD,
then we won't get this direction
right it's actually least
singular ready direction
rather than the most
The techniques are different,
yeah, so.
In fact,
what you do here is throw
away the least singular
values and work your way up.
So something different.
It's somewhat PCA related,
somewhat related to singular
values, singular vectors,
but not quite directly.
The are used Okay, so if you
want to know more about any of
these things,
you can ask me for references.
Both random predictions,
Gaussian mixtures, and so
on are topics of current
interest in this lecture.
