[MUSIC]
Okay, so I sort of reversed
the order a little bit.
And I want to do something
that's probably more new to
most people, and then do
something more familiar later,
in the next lectures.
Later is only the next lecture,
right?
So, we want to show,
I want to show that you can find
low angle approximations through
a matrix from just having
a length squared sample, right?
Just like we did multiplication.
We lose multiplication again,
we keep using that.
And we did compression, right?
So just like that we can do SVD,
or
lower angle
approximation as well.
So we saw that SVD is the way to
do the best approximation right?
Let's say what you get out
of SVD, you truncate it,
you get the best possible
approximation of every rank.
We won't shoot for the best
possible approximation, we can't
do that cuz we are only going to
sample, we'll shoot for a good
approximation, good in the sense
will relate to the best.
In comparison to the best,
it's not too bad; that's
what we want to show.
The advantage will be,
time will be much faster.
You'll have to do, essentially
SVD on a small matrix and
from that you'll get everything.
So, that's the advantage.
By the way, one thing I
did not say likewise,
sampling can be applied
to infinite as well.
I didn't go into that much.
So if you have infinitely
many columns, but
of course you have to have
a conversion, sum of squares.
Otherwise, you can't do it.
Then you can sample a finance
set the infinite number of
columns, but finite dimensional,
you can sample a finite set,
in all of this course true, but,
we didn't see that, so it's
another advantage of doing this,
as opposed to full SVD, because,
if it's anything less than,
full SVD's impossible, right.
Okay, so, SVD is O,
again I'm just repeating this.
SVD is the best
approximation of each rank.
What we will show is that,
if I pick a length squared
sample of just rows of A and
I do SVD on R just on that
matrix that looks like that,
that gives you a distant
approximation to the SVD of A.
Now you have to be a little
careful, A is the dimensions
are different, right, so you
have to relate this SVD of R to
this SVD of A, in a nice way,
we'll see that, okay.
So, in fact the centers
of out here, will not be
anything to do with lengths
[INAUDIBLE] sampling at all.
You'll have to be it's a purely
matrix perturbation result and
we'll prove the following.
If I get a good lower
rank approximation of r
provided r transpose r and
a transpose a are closed.
Now just nice to recall
the dimensions of those.
So A is, if A is n by d,
A transpose
A is d by d and so
is R transpose R.
R, only some number of
rows have been picked, so
it's little r by d.
But R transpose R Is
also d by d, right?
And maybe again, just a picture.
So if I have this as my A, the
rows, I sample only a few rows.
A transpose A wants
to take the dot
product in every pair
of columns, right?
And R transpose R.
Just takes the dot product of
the sample, in the sample rows.
All right, I mean just
writing a picture of it.
So they have the same
dimensions, so
it makes sense to say that
they'd be roughly equal.
What we will actually
show is the following.
Again the dimensions
are different so
the relation will be this.
If I restrict A as
a linear transformation
to the space spanned by the top
few right singular vectors of R,
that's a good enough
approximation to A.
Now that's the right-hand side
dimension of R is, right?
Okay, this is what we'll show.
So the main result really
doesn't have to be length
squared sampling.
It just has to do
again with this.
In a way, I perturbed A.
You can think of it as
a perturbed, a transpose A,
to get slightly different
matrix A, r transpose R,
except it's a lot simpler,
because it's a product of two.
Matrices which are smaller.
If the perturbation is not bad,
then this is true.
We'll prove that.
And, then this again is purely
a matrix perturbation result,
there's no probability
involved in that part.
But, As we saw,
we will see, again,
we'll recap that if I have
a length squared sampled R,
then A transpose A = R transpose
R, roughly, approximately.
And that's the familiar matrix
multiplication theorem.
So if I were multiplying
the matrix A transpose with
the matrix A, I might do that
by picking some columns of a
transpose, which are rows of a.
And that's my r transpose.
And then pick the same col,
same rows of this a,
which is r, okay?
And matrix [INAUDIBLE]
location theorem in spirit.
I mean, we'll quantify this.
In spirit, it told us
these are roughly equal.
And this is going to say
if these are roughly equal,
then SVD of R will
give us SVD of A.
Now, let's recall that R,
itself, I want the right
singular vectors of R.
R, the right dimension
is very high, right?
Because R is like this.
This dimension is still D so
this seems to be a D
dimensional problem but
observe that R is only R by D.
I could take R out of transpose.
Okay, that's a small matrix,
it's a little r by
little r matrix.
That's got only this many
rows and this columns.
I can do, I can find it,
right singular vectors of R now.
I'm sorry.
The left singular vectors
can be found from this.
These are just eigenvectors
of R, R transpose.
We saw that the spectral
decomposition of R, R transpose
is related to Singular Value
Decomposition of R, so
the left singular vectors,
they can be found.
Then, you find the right
singular vectors by folding
them in.
We saw that,
if I take a matrix A and
multiply on this side by
the right Singular Vector's.
I get the left singular vector.
And vice versa.
So I want, you know?
So I have to do, I just, the
essential competition I have to
do that answers the question
is that I have to find
speculative decomposition
of a very small matrix.
Okay, but then I have to do some
matrix vector multiplications,
those are simpler, right?
Okay, this is just so the story.
So first I'm going to prove
the matrix perturbation result.
I'll do a self-contained
proof that'll involve some
classical result
wich we'll go over.
So singular values and singular,
but here's another point
that'll be nice to make
which is important point
from linear algebra, right?
So this is also
something standard, but
I want to make this point, the
perturbation inequality we'll
use is called
the Hoffman-Wielandt inequality.
It shall say that
if R transpose R is close to A
transpose A, in Frobenius form,
then the singular values
of A and R are close.
We'll prove that, okay?
So I have made a small
perturbation you can think of,
of A transpose A to
get to R transpose R.
If the perturbation is small
enough the singular values don't
deviate too much.
Okay, that sounds very
possible and true,
and it's like it's true.
So now you can ask, what about
singular vectors, right?
So we really want the lower
end approximation of A.
And that would be
possible intuitively
if we get a good approximate
top vectors of A, right?
It I get the top few singular
vectors, it may not be exactly,
but approximately, that should
be good enough to get a good
learning approximation of it.
If I get them exactly,
I get the best possible.
If I get them approximately,
I should get it approximately.
Now, if my inequality here have
any other inequality actually
guaranteed the top singular
vectors of R and A are close, so
if this is small, would
imply that R and A have very
close singular vectors, then
it will be very simple, right?
Then, that would say find
the singular vectors of R.
They are close to singular
vectors of A, and
they give me a good
low-rank approximation.
Unfortunately that's false,
this is false, right?
Even though perturbations don't
change singular values too much,
they can completely
change singular vectors.
Okay, singular vectors or
singular,
that's one reason to
call them singular.
Singular values are not.
So, multiple singular vectors
are points of discontinuity
of singular vectors.
I'm sorry, multiple singular
values that means like
multiply again value.
There are many eigenvectors
corresponding to
the same eigenvalue, then it's
called the multiple eigenvalue
fills the multiple roots
of the polynomial, right?
Singularly, if you
have a singular value
that has associated with
many singular vectors,
then there's a discontinuity.
I'll elaborate but
I'll give you an example first.
But, before all that, please
do recall from linear algebra
that if you have ties for
the eigenvalues,
your bunch of eigenvector they
form an invariance of space.
And they could have chosen.
In linear algebra, right,
I would choose any basis of
a subspace can be my eigenvector
and that's the reason for
the discontinuity, right?
So if you flip
things a little bit,
you might flip from one
basis to another, and so
that's a complete flip
of the eigenvector.
So, and the simplest
example is this right,
so if I have a matrix 1 plus
epsilon minus epsilon 0,0 at
the epsilon equal 0
there's a discontinuity
of the single vectors,
not of the singular values.
Or of the eigenvector,
but of the eigenvalues,
the eigenvalues continues
with epsilon equals 0, right?
So it's the eigenvector
that's discontinuous because
if epsilon is positive,
the eigenvectors of first unit
vector of flow is negative.
The second unit vector.
So it flips completely.
So the point to remember is
that lower rank approximation,
intuitively is not
discontinuous, right?
If I take the singular vector of
a slightly worse singular value.
Like this picture.
And that one was positive,
right?
So the eigenvector or the
singular vector is the top one.
But this is a close by
eigenvector with eigenvalue
1- epsilon is 1 + epsilon.
So if I took this vector instead
of that, intuitively I shouldn't
be very far off, right,
in the approximations I get.
That is true.
I am badly off if I
want exact eigenvalues,
if I want A x equals lambda x,
I'm not gonna get it if I took
this instance for that, right?
But I'm okay if I want
just an approximation.
So that's what we'll use.
So presumably this
point is clear.
I'm just recapping, jogging
your memory on linear algebra,
about continuity
of eigen values.
Eigen values are continuous,
eigen vectors are not.
Now, just for your information,
I want to prove this.
In fact it is true
that at points on
at points where this
continuity does not occur
vectors of singular vectors
also behave nicely.
So these are when there's
a good gap in singular values
this continuity occurred
when two singular values
were the same okay?
Preventing that means singular
values are quite different.
So if there's a gap between
successive singular values,
then it's also true that
the singular vectors
behave nicely continuously
in a quantified way.
So not only continuous but
right?
That's what this is saying.
Singular vectors of
a perturbed matrix are close to
singular vectors of unperturbed.
These are.
[INAUDIBLE] Sorry?
[INAUDIBLE] [INAUDIBLE] These
are called the theorems or
theorems.
They are due to.
The simplest versions are due to
and they are called the theorem.
If you want to look it up,
you can do that.
Yeah, again, I won't prove it.
We don't need that.
Right?
So we saw that, this says I
don't need these theorems
because lower anchor
approximation.
I'm okay but if I messed up by
a little bit the singular value.
Okay, so, just a technical fact,
this is just a definition of
the restriction of
a linear operator, or
this linear transformation,
to a space.
So if you have subspace
v spanned by this basis.
And I want to take a linear
transformation here,
restrict it to a.
Receipt to V, I just say the
restriction is A bar, which is
that, it's restriction in the
sense that A bar X is the same
as A X, as X belongs to V, and 0
if X is orthogonal to V, right?
And both are quite clear.
I mean,
the 0 is completely clear.
If X is orthogonal to
the space suites, so
[INAUDIBLE] talking to all
these [INAUDIBLE] is zero.
The first one is just
a little manipulation, right?
So we have a-bar.
X now, x belongs to the space.
So x can itself be written
as sum of alpha-t, w, t.
And then you write, you expand
out A bar, and you get.
You've done things
like this before.
Times this quantity, right?
I'm gonna put a t prime here,
alpha t prime, w t prime.
And you can see that
these are all zero
unless T equals T prime.
So those terms go away and
only the terms W 2 remain so
then you just get other T, WT.
And that's equal to A.
Okay so restriction just means
it does what a does in the sub
space orthogonal space.
Okay now the statement
of the main theorem.
So we'll state the main theorem
which is as I said purely
a matrix perturbation result and
then we'll prove it so.
Now, R is not necessarily
length squared example.
It's just any matrix
which is R by N.
It is not even a sub-matrix,
necessarily, right?
R has no relation to A.
So for
every pair of matrices, R and
A is applies, as long as
the N is the same, right?
So if V1 through Vk are the top
k right singular vectors of r.
So I want to say if I've found
the right singular vectors of r
that gives me a good
approximation to A.
Okay then the assertion is,
the conclusion is, the
restriction of a to the span of
this is approximately a In
both and spectral norm.
And of course,
that won't be true for all R.
R can be arbitrary, but
the error involves how different
R is from A in this instance.
And here is a precise statement,
two precise statements.
So spectral and
intravenous models.
So let's go over these.
Slide should say that.
So, I take A, this from the last
slide is the restriction is
that thing, is a restriction
of A to the span of that.
That's written globally, right?
This is in symbols.
That's what the last slide said,
right?
Okay, in Frobenius form
that's less than or
equal to A- A(k) Frobenius
form squared.
Now Ak I'm sorry I used a
slightly different notation last
time, I said A sub K.
But this also just the best
ranking approximation to A,
which we saw last lecture
obtained by SVD right?
So the best you can do,
you cannot do better
than that right?
You'll be worse by a little bit,
and a little bit is root k times
the Forenius form of that, okay?
So just make sure
the sales balance,
I take the square of
the Forvenius form of a and
this is like the square right
I am taking a transpose a.
So the dimensions
balance right just.
Okay so that's statement number
1 and also in spectral norm,
this is the best possible.
Because again,
s video was the best you
could do in spectral norm.
And there,
it does not have a root k.
It just, it has just
the spectral norm there also,
not for being in the spectrum.
So both of these are spectral
norms in this case.
Okay.
So again,
this is repeating what I said.
A- A(k) squared is the best
possible error with
the exact SVD.
We cannot beat that,
that's the best we can do.
Right, and
this says second term is the
error that's a little bit more.
Okay, and we get that.
Now, one note I should
have said perhaps that,
I should have put it on
the slide maybe but.
I can tell you.
So more recent theorem.
But these are based on sampling.
Actually can give you the right
hand side which is beta.
So, So these are called
relative error approximations.
So A minus A(k), right?
If you were going to use
principal component analysis,
for instance, this would
be hopefully very small.
So you would use this for
matrices which are very
close to being rank k.
Okay so
if this is very small, this
may be actually much bigger.
So this error term is pretty
big if you did length
squared sampling.
There are more complicated
sampling schemes, for
which you can actually
show these things which
are called relative error bounds
because the error is relative.
So I have a survey paper in
which I also have these things,
but we'll only prove this right?
We'll only prove with
this absurd error.
I can give you references for
this.
So again this is
how far your worse off
than the best you can do.
Okay now one thing that's
instructed before we leave
the statement of the theorem and
go to the proof is the theorem
is also true for any r.
It's true in particular for
r equals 0 and n.
It'll be nice to see
what they say to us.
What does r equal 0 say?
That means I pick no rows in r,
either probabilistically or
other way I pick no rows.
So r transpose r is 0.
Right.
So this is 0.
And there's no Vs,
so this is also 0.
So you get A squared, right?
And I should have said,
excuse me, so I should have
said No, I'm sorry.
This is true for all K.
So I could back off for
a minute.
So if R is 0,
then there's no singular vector.
All of these are 0, so
you get A F squared,
plus a little bit more.
This is positive, right?
So that's okay.
It's not saying very much.
You don't expect it to say that.
What about r = n?
That's also an interesting case,
well I
meant to say r = d,
I meant to say all the rows.
So r equals the total
number of rows of A.
What's a possible R then?
A possible R is all of A,
so if R is all of A,
this is 0, and this is 0, right?
So, still we are just saying 0
less than or equal to that, I'm
just saying these things as a
sanity check, if R equals 0 and
N is still true, but they
won't be saying anything for
other value of R
that you expect.
The proof will use
the Hoffman-Wielandt
inequality which
have proof first.
As I was saying
I'm going to spend a little
time proving that.
There are many
proofs of that and
this I think is one
of the simplest.
We'll have to do a little bit of
commentary here to prove that,
and then I'll go back
to proving this.
Okay so
the model of the story for
any r you get decent
approximation of spectral n.
Okay so for this I need
some technical results for
the Hoffman-Wielandt inequality.
So you have two sets of real
numbers, a1 to an, b1 to bn.
These are going to be singular
values over two matrices,
but later.
Right now,
they're just real numbers.
And the as are in
decreasing order.
Then the assertion is
the permutation that minimizes
the difference between ai and
b at the permuted spot So I'm
allowed to permute b in any way,
and I want to minimize this.
So intuitively, it's obvious
that the best I should be able
to do to minimize that is also
bringing b in decreasing order.
The as are in decreasing order,
if bs are also in decreasing
order, you'd expect
that to minimize.
All right, so usually the proofs
of these are all really simple
and here is the proof.
Usually, for such a statement,
this statement is, if b was
not in decreasing order,
let's say that i plus first, if
they're not in decreasing order,
there must be adjacent ones
which are in that order, right?
If B was not sorted, there's a
pair that's adjacent not sorted.
So, you have that, and if you
just swap, you'll have proof.
So, you can check that.
So, it's just a simple check.
Now we have to use
another theorem called
Birkoff's theorem.
It says every doubly
stochastic matrix can be
written as a convex combination
of permutation matrices.
Did I define that, no I
didn't define those terms so
maybe I should just
recall those terms.
So doubly stochastic,
Means non-negative entires
All row and column sums are one,
all of them.
So matrix for this property
is called doubly stochastic.
And square also, I should say,
doesn't follow.
And then the term permutation
matrix just means,
permutation matrix
is just called 1, 1.
A rotation matrix, cuz there
is a permutation, right?
So ai, fi = 1 for
some permutation f,
no I should say this.
There is exists a permutation f.
So I say this is 1 and equal
to 0 for all of the entries.
It is a [INAUDIBLE]
matrix or any permutation.
There are a lot of
[INAUDIBLE] permutation.
So Birkoff's theorem says every
doubly stochastic matrix can be
written as a convex combination
of permutation matrices.
I won't prove that, but
I just assume that.
Okay, so when ai and
bi now are both in decreasing
order according to this,
right, the minimum
value of this quantity,
ai minus bj squared xij,
I'm going to try to
minimize that over all x's
which form a doubly
stochastic matrix, right?
It's obtained at a permutation
matrix first, so why is that?
Why is the minimum obtained
at a permutation matrix?
Because the convex combination,
right?
So if the minimum is attained
at some w stochastic matrix,
you can always write it as
a convex combination of
permutation matrices, and here
is a linear function in the xij.
So it should be attained at one
of the permutation matrices.
And secondly since the identity
permutation is the min
over all permutations.
This is obtained by
the identity, right?
So we saw that the permutation
that minimizes
is the identity
that was this part.
And this says that over all
stochastic matrices, the minimum
that's obtained in a permutation
matrix must be the identity.
Okay, we'll come back.
We could use this is
in the next slide.
But I may have to come back and
recall this for you.
So now [INAUDIBLE].
I see.
Okay, I guess I don't have it.
I have it all
displayed on one chart.
This is the proof of Hoffman we
learned now have of
course [INAUDIBLE].
So you have two real
symmetric matrices
with lamdas eigenvalues.
I've arranged lamdas of both
matrices in decreasing,
nonincreasing order.
Then the assertion is the lamdas
are close, right, provided P and
Q are close.
If P and Q are close, right,
the lamdas are close.
The eigenvalues are close.
Again, this is continuity
of eigenvalues.
The eigenvectors
are not continuous.
So you take the sum of
squares of all these things,
that's less than or
equal to the Frobenius normal of
the difference between P and Q.
Okay, so I already proved that.
While going over this we'll
recap some beginning algebra as
we go along.
So I'm gonna write
P as UDU transpose.
And this is VDV
prime V transpose
these are spectral
decompositions.
And just to recall.
So U columns of U or
eigenvectors of p.
The p and q are symmetric right?
P so they are spectrally
compostioned here right?
So v is columns of v or
the vectors of q
So we've been using a slightly
different notation but
UD, D is diagonal.
D the primal diagonal,
the diagonal e in value.
So UD you'd transpose.
If you think about it,
is the same as saying lambda i,
this is for p lambda i of p,
u, column i of u,
so column i of u,
l wrote it this way,
and that is U transpose
i row i of u transpose.
Okay, I'm just
expanding the matrix.
This is just the matrix notation
which we have used before.
But more standard than the one
notation we've been using right?
So this is just the outer
product of columns and
rows, they sum up.
It's worth checking that
these are the same,
and these diagonal
switches like that.
Okay, now what we know
is that the columns,
the diagonal
vector's orthogonal.
The columns of U and
V are orthogonal Or autonomal.
And in matrix notation that is
just going to say u u transpose.
So I was going to say u
transpose u it's the identity.
And v transpose v is
also the identity.
Actually the other way as well
because if you have a full
dimensional matrix
then the rows are all.
In n space you have n columns
which are also normal.
The matrix also have
also normal rows.
Not true if it's less than n.
But if it's n that's true.
So in fact, uu transpose is
also equal to vv transpose.
We could write, okay?
A bunch of linear algebra,
we'll recap and we'll use that.
Okay, so these are spectral
decompositions with eigenvalues
arranged in nondecreasing order.
In both cases and I'm going to
do something that it looks
its pulled out of the hat.
So I'm gonna take P- U,
multiplied by U transpose and
V on the two sides.
I give you, I'm just explaining
this out, U transpose U D and
so on and so
forth and I get that.
Now U transpose U is
the identity, right?
It's one of these things.
So that goes away.
You get D times U transpose V.
Again, V transpose V is
the identity, that goes away,
you get that.
Now these are orthonormal
things, so
they are all length preserving,
they don't change the norm.
Okay, its invariant,
Frobenius form under
these transformations.
So I get this is just I'm taking
the Frobenius form of the whole
thing here squared, and
that's the same as P-Q squared.
And that's the same as that
Frobenius form squared.
And that's the extended
version of this.
Now just quick check
that that is correct.
That's just algebra,
but let's check that.
So, I want to know what
the entry of this is.
Right.
Well, I did entry of this
is D,ii1 sum over i1.
You transfer i1, i2,
I'm just writing them down.
And Vi2 j.
Sum over I1, I2 and J.
i1 better be equal to I,
otherwise it's 0.
So this goes away.
This is i, okay?
And so, you get d i i,
is just I call it d i.
Okay?
And this is just
you transpose these i j entry,
because i sum i two.
This sum over i2 goes in there,
and I get that.
So, that I've just
done the algebra for
the first one, right?
And this is similar.
Just U transpose V.
Same U transpose V.
Anyway, so you get that, now U
transpose V is also orthonormal.
Because you and me are,
you have to check that.
Another lineal effect,
maybe I won't check that.
And so, the rows and columns,
columns forms the matrix.
Rows and columns sums,
column lengths are one,
that means the sum of is one.
So, if I take this matrix that
is right, [INAUDIBLE] with this
and then I used the theorem
from the last slide, right?
So, this so
DI minus DI prime squared,
this eigenvalue
difference is squared.
There's something
that I want there,
that's what I want, right?
Those are weighted by
a doubly stochastic matrix.
And the minimum is obtained
when I put the identity there,
that's what I show
i the last slide.
So, I put identity, so
the costumes go ray inj not
equal go away, so you just say
ini for both dnd you get that,
you do have to check all
the intermediate steps.
So, this used that quantity,
right?
And I started from that side.
Okay, so that's all from me now,
and let's go on to our theorem.
So, v1 to vk, so
now we have to do little bit of
algebraic manipulation
unfortunately to get to theorem,
but it's not too bad, I hope.
So, we compute,
we take v1 through vk.
So, the theorem said, I'm going
to restrict a to the space
spanned by the top ki
singular vectors of r, so
these are the top k
singular vectors.
I'm going to make
a full basis out of it.
Right?
You can always complete
the basis out of normal set.
So, I would like to
prove this is small.
So, this is
A restriction to this.
I like to prove
this is small and
I'm going to pull
something out of the hat.
I'm going to pull this out
of the hat and subtract.
So, again, unfortunately this
will have to go through some
manipulation here.
In fact, I wanted to prove
this is small with respect to
this, right?
So, maybe I should go back
to the theorem for a minute.
Yeah, so I wanted to prove this
is small comparison to that,
I just brought this
over to this side.
So, I have taken this minor set.
I want to show that small.
So, I have that and
I'm gonna expand that.
Okay, so there's a few
more lines of proof, so
this is A, this is A restricted
to a particular subspace.
And that's always
the Frobenius norm of that.
The difference is A minus that,
you have to prove that, right?
But I mean a simple
proof of that.
Is just by the completion
of the bases, so.
Okay, I should write the answer.
AF squared is just equal
A times Bt squared.
S,o the Frobenius norm
squared is the sum of the row
squared lengths, and
you can write the matrix.
Nice norms are always invariant,
whatever the basis, so
I choose VT as the basis,
and you'll get this.
And I've subtracted
out the first k, so
you get only k + 1 to
what's left, right?
So, that minus that.
You also have to prove that
that is equal to that.
Again, some of the proof,
I won't do that.
These two cancel out and
you get that minus that.
Okay.
So, that I have to prove is
small, a few more tricks.
So, AK is a best ranking
approximation, so
that came from SVD, so that's
just SVD theorem we proved.
We proved that SVD is the best
approximation, so you get that.
What about this?
You can see that that's just
equal to AVT squared, right?
Say all of this not true
if it's not autonormal.
Everything uses all these facts.
So, that's true.
So, now I want to write
that it's equal to.
This is the same as
what I had before.
Here's another trick.
I'm going to take AVT squared.
That is.
I mean, first here is not
a trick it's just simple, that's
just equal to VT transpose
a transpose a VT, right?
Cuz the length squared of
a vector that's equal to its dot
product with itself,
again that's not a trick.
But I'm going to add R transpose
R and subtract it, okay?
That's okay, I'm adding and
subtracting, okay?
I've done this several times,
I did this last time too.
And now, what do we get?
Now, we get, this is V squared.
Here, VT or
the singular vectors of R.
So, those are just
the singular values of R.
So, that's the difference.
So, I'm going to use
this difference and
bound it using Hoffman
[INAUDIBLE], I'm coming to that,
but that's this term and
this term.
I still have to deal with this
term, that's, I read it and
that, I have not
dealt with it yet.
So, now this is less than,
or equal to k times that.
Because for each vector V,
this quantity is at most
the spectral norm of that, okay?
Now, we define spectral norm
slightly differently, but yet
we defined spectral norm
of A as the max over all V.
Of A V, but it's also true that
the spectral norm of A over
the symmetric matrix A transpose
A minus R transpose R.
This is symmetric,
[INAUDIBLE] is symmetric.
For a symmetric matrix the
spectral norm is the max over
all V, or
V transpose the matrix times V.
Okay, this all identities
fee this give me a K,
but I want to give the root K.
See happy with K you done, so
I'm going to work a little bit
one more slide of
work to get a root K.
So, for this, we have to use so
we take that.
The reason I'm doing this
proof is partly algebra that
interesting, but hopefully it'll
jog you memory on the linear
algebra that you
might have forgotten.
So, for this we use to assert
that that is less than, or
equal to that.
That's just the sum
of K real numbers.
These are real numbers.
So, there's sum of squared,
and then the square root, and
then the K up top.
Now, what is this?
So, this is.
So, I claim that that's at
most a convenience norm.
Why is that?
The Frobenius norm is invariant
under change of bases, so
in the VT bases, these entries,
or the diagonal entries of that
matrix of, [INAUDIBLE] A
transpose A minus R transpose R.
The diagonal entries in that
bases I'm taking the sum
of squares of
the diagonal entries,
which is at most
a Frobenius norm.
I'm throwing away something,
because off-diagonal is not
being counted, so
that you get that.
Okay, Rv is still to bound that,
for
that we use Hoffman-Wielandt.
So, by the way,
Hoffman-Wielandt is only for
symmetric matrices as stated.
So I have to do some
of these machinations,
to get to the proper
symmetric matrix.
So we'll use another
Cauchy-Schwartz,
then plugging all this
in you get the first,
you get the Frobenius
norm installation.
Turns out the spectral norm
proof is not much harder in
this case.
Yesterday it was harder,
but today it's not so bad.
Okay, bounding the spectral
norm. So we have only a couple
more things, bounding a spectral
norm, for the second assertion,
that's the spectral norm
assertion, that's all.
The top singular value u of this
must be orthogonal to the span
of this, v1 through vk.
So because I subtracted out
the action of v1 to vk,
if you had a component
along this,
it would make just 0, right?
So that’s not interesting.
So you might as well assumed
to be orthogonal to this.
So this product goes away and
you just are left with Au.
So again we do have
to do some work.
So this is what I
wanted to bound.
So again, this is A,
this is a restriction
of A to that subspace.
I want to say
the difference is small and
since the top singular
vector is u, so I get that.
I get that, now this trick
again, I have A transpose A,
I'm going to add and
subtract R transpose R.
It's same trick and
that's because I know how
to bound this, that's this.
And this quantity,
u, was orthogonal
to the top k singular
vectors of R.
These are the exact
singular vectors of R, so
top k, so this is at most k+1.
So we again saw that.
So you lessen the space of the
k+1 and later singular vectors,
therefore you get the best that,
okay?
So we're not done yet.
So that same written like that.
Now I have that, I'm adding and
subtracting sigma
k+1 squared (A).
This is, we know what this is,
how about this?
So I'm going to say,
that's also at most this.
Okay, and
that is using weyl's inequality,
So let's remember that again.
I read a short proof of this.
This is one of the many
inequalities in algebra.
So it says for
a symmetric matrix,
like our R transpose R is,
I take the k plus first or
any Igon Value in
decreasing order,
Igon Values are in decreasing
order, that is at most, I've
written in a particular way,
let's write as inequality here.
So it's saying, lambda k+1
(R transpose R), if for
instance, less than or equal
to lambda k+1 of a different
matrix (A transpose A) plus,
I've written that.
So that's, I'd like to say this
is a top Igon Value of that.
But it's the top Igon Value and
absolute value, right, so.
So that is that side and
there's a symmetric inequality.
How do you prove
something like this?
So we prove it by taking, well,
we have to use min-max theorems.
I guess I didn't write,
okay, so we don't prove it.
You can look up a proof.
To prove the fact use
min-max theorems,
which I just realized
I didn't prove.
But you can look up
any standard book, and
you'll have the same equality.
I do have to have
symmetric matrices, right?
Otherwise I have to be
a little more careful.
But here we do.
And that is telling us that
this difference,
that's this minus that, right?
The difference between these
two is bounded by this at most.
So that's got a factor of
2 + sigma k+1 squared.
Okay, so that's the end
of that story of proving
the main theorem, okay?
And by length squared sampling,
we can get an R,
we can get it by any
other means, right?
As long as this is small.
For which it's enough to do SVD
on that, our transport log.
And now I want to
take a step back and
figure out what we have
done in terms of time.
How much we have saved, right?
Okay, so this will also
recap the overall proof and
sort of the algorithm behind it.
So, what we needed was
the right singular
vectors of a smaller matrix R.
And for that we can do SVD,
I remarked on this first
in the beginning but I want
to recap it cuz now we have
an algorithm and we'll
normalize the running time.
We can do SVD of RR transpose,
which is a small matrix,
r x r matrix.
And that gives us the left
singular vectors of R.
So the eigenvalues of this or
the left singular vectors of R.
Cuz it goes on the left of R or
the right of R transpose is the
same to left of R, from which we
get the right singular vectors
by just multiplying once by R.
And I am going to figure out how
much time all this is going to
take, right, that's the story.
So this is, how we just spent
multiplying left singular
vectors by R to get
the right singular vectors?
Algorithm, So we pick r rows,
now I'm going to give you
a length squared sampling where
you should could do it
by other means again.
Pick r rows of A by length
squared sampling, scale them so
that for the resulting matrix,
expected value R transposed
R = A transposed A.
Remember, this is going back
to the beginning when we did
length squared sampling,
we said to make it unbiased
we have to scale it.
So you pick these rows, right?
And to make the estimator
unbiased you had to scale it,
you have to scale down by the
probability you picked, okay?
So unbiased just means the
expected value is right, okay?
We already did this
in the beginning.
Now find RR transpose, so I
picked R, the expected value R,
transpose R transpose A, but I
find other thing, RR transpose.
Find the left singular vector
by spectral decomposition.
Pre-multiply R by the left
singular vector to get the right
singular vectors of R.
Okay, all of that is routine.
And then, you return A times
this as the implicit LRA,
meaning if I don't really
want explicitly what
the lower rank approximation is,
and that is useful sometimes.
I'll try to show you one
instance where it's useful.
You do A times vt vt transpose,
or
if you want an explicit one,
multiply out and return.
And the time required,
if you, okay,
I should mention one thing.
We didn't do many algorithms
here, but if a matrix is sparse,
lots of big matrices
have very few non-zeroes.
And you don't write down all
the entries of the matrix.
You only write down
a linked list of non zeros.
So that's a general trick.
Maybe I should write that.
So a sparse matrix.
We have few non zeros.
One example is the hypertext
link matrix of the web.
So it has x billion, right?
Matrix, right?
But each row has only
about ten non zeros.
So you write it down by having
basically a linked list of
non zeros.
Linked list of nonzero entries,
it just tells you where
the next entry found.
So you don't write
down the zeros,
you basically have
space proportional.
Your space, which is order,
here's a piece of
notation that's now become
standard N inside of A,
which is the number
of non-zeros of A.
Number of non-zeros
.So in fact, I didn't
mention many algorithms, but
there's been some effort and
people have now
made many of these algorithms
depend fairly benignly,
linear quite often on
the number of non zeros only.
So, in the case of the web
matrix, right, if you
add an n squared algorithm
that would be pretty bad.
But there aren't n squared
entries to be read.
So the argument is that this can
all be done in time proportional
to the number of non-zeros,
provided the R and
all are constant.
So one way to see that, so
this is an advantage of sampling
out by the way, [INAUDIBLE] or
any of these methods,
standard numerical zero methods,
unless you're careful will
fill out not zeros and
the matrices will come down and
then after a few iterations and
then you're in trouble.
So these don't happen
with these kinds of
sampling based algorithms,
just because you work only
on a sample of the matrix.
So if p is the maximum number
of non-zero entries in a row,
in this case, p is 10.
Now, it's not true that
the maximum number is 10.
But pretend that was
the average that it will be 10.
So the first step can be done in
two parts from external memory.
So we saw that how to sample
going through the matrix
twice from external memory.
You compute the probabilities
one step, and
the next step you pick them out.
Then you find R R transpose.
And our R R transpose is R
rows multiplied by R rows, so
R squared multiplication,
but each only P time.
Right, the only p non zeros,
dot products of two vectors with
p non-zeroes only takes p time.
Spectral decomposition can be
done in r cubed time or better.
R is to be thought of
as a constant, right.
And p as the, so multiplying
the k singular vectors of r,
so this gives you the left
singular vectors of r.
The right singular vectors
can be found in time k times
r times p.
Okay, so you do take
time dependent on n for
reading through
the entire matrix.
But after that,
the times are not, right?
Unless you want to do
this multiplication,
then you have to go through
the whole matrix once more.
Okay, so
roughly the time is linear
in the number of non zeros.
And that's typical, we didn't
do this kind of analysis before,
but it's typical of
these sampling methods.
Okay, now I want to do
one more slide, and
then I'll tell you something
which I didn't put on the slide.
So, I mentioned in passing
earlier on, right, SVD
requires only fine so best fit
line through the origin right?
Subspace always pass
through the origin,
it only finds subspaces
through the origins, okay?
Now you could have data which
is nowhere near the origin.
And then,
this is a big restriction.
So here is the origin, and
all your data could be here, and
the best fit line might
be like this, but
you might end up finding this,
right?
So you don't want to do that.
How do you avoid doing that?
So, first definition of affine
space of dimension k is just
a translation of a k
dimensional space.
Translation just means, so
this is the vector space b plus
x translation just means the set
of u + x u belongs to v, okay.
I mean it's just
moving like that.
If it's all following,
this is the length,
it just moves over
to that space.
If also true that this x
also is a good, right?
You can translate this way,
this vector,
this vector you
get the same thing
All that matters is what
the perpendicular component is.
Okay that's just
the definition all right,
lets call it fine space.
So the I want to prove
is fairly simple,
is the best fit a fine
dimensional, a fine space
of any dimensions passes through
the centroid of the data.
The centroid of the data.
We have data points
A1 through AN,
centroid C is just
what you think it is.
These are vectors.
You just average, right?
Average is going to mean
component-wise average,
just add them up as vectors and
divide them by the total number.
So the proof is that if it's
not the case, if you had
a best-fit sub space that
didn't pass through the origin,
that's a contradiction,
you're going to gain by
translating it to contain zero.
So first let's prove this.
So suppose you have real
numbers a1 through n, I
want to choose an x to minimize
sum of squared distances of ai.
It's not difficult to see that
the centroid is the right,
I mean this also comes up in
defining variance from mean
and so on, right?
So proof, okay, so let's
quickly go with a proof because
this typical proof that's
used in many places.
So I'm gonna expand ai
minus x whole squared,
So I write it as ai minus c,
plus c minus a,
I'm sorry that's not a that's c.
I'm sorry that's all right.
C minus x, that's x okay
that should be right.
There's a cross term that I'll
mention because it's zero right?
So, there's a term, three terms
can explain the whole square.
So I'll also get a term sum
over i = one term, twice ai- x
Now, (a i- c) times (c- x).
This is 0 because c minus x
comes out, is a constant,
is independent of the ai,
2 also comes out.
Sum over i=1 to n a i- c,
this is 0.
So that goes away.
So you get that, and that proves
that for the case of real
numbers, because this is
a non-negative quantity and
the best value you can have for
it is zero.
And zero means that x equals c.
Okay, that's the lowest value,
x = c,
that's what we wanted to prove.
Same thing is true for vectors.
So I wanna prove this for
vectors actually.
So ai minus x,
I want to minimize the sum
of squared distances.
I'll come back to how you apply
it to SVD, so I write it as
ai minus c Plus c- x, plus,
I also put the cross term here.
Cross term is ai minus c
dot product of c- x, right,
that's dot product.
I'm just using the fact that,
length squared of the vector
is dot product of itself.
And it's the sum of two vectors.
If you expand it,
it's what you get.
And that term is 0.
Again, c- x comes out,
even though it's a vector.
And the sum of these is 0,
because sum of ai equals c,
c is the centroid, okay?
Like it was for the [INAUDIBLE].
Okay, so how is this used for
the SVD?
So, if you want to find the best
line that fits the data,
rather than the one
going to the origin,
you could just as well
find the centroid.
Okay, that's simple
enough to do.
Add all the data points and
divide.
Take the average, and
then subtract that
from each data point.
Subtract the centroid
from each data point.
Now, the centroid
becomes the origin, so
there's no restriction in
assuming that your sub-space has
to pass through the origin,
okay?
Since we use the best space.
So for a bunch of applications,
you do that.
Okay, so I think that's
all I have on the slides.
I take a few more minutes
to say two more things.
So, so far what we saw was that,
we had a matrix A.
We do raw sampling.
You get R, and
we reduced by this process,
finding SVD of A,
to basically SVD of R.
We could further reduce
it to RR transpose, but
this in spirit what we did.
So you can ask now,
I can now take R, do
column sampling on R, and reduce
it to a really small matrix.
Now, it can repeat That
is I do column sampling.
On R, to get an R by R matrix W.
Intuitively, I should be able to
reduce SVD (R) to find the SVD
(W).
Right?
I mean, it's just a symmetric
version of that problem,
except columns and rows.
So this is true, but the proof
is much more complicated, and
the trouble happens because,
things go wrong,
when singular values
are close to zero.
So, it's somewhere in this
process, we have to worry about
singular variables that
are close to zero, so
this is true but complicated.
I would say actually,much
more complicated,
because it's not simple at all.
But, the moral of this story is,
by doing the sampling twice,
you finally are left with
a constant size matrix,
both rows and
columns constant in number, and
that's all that you
need to worry about.
I mean, that was true
when we do RR transpose,
with small matrix, but we had
to find RR transpose for that.
So we don't have to do any
competition on big matrices,
except to sample,
that's what this is going to be.
Now in practice however,
I'm not sure this has been
used as much as rule sampling
has been used quite a bit,
but both sides, I think is
two expenses tell although,
I'm not current on
the state of the output.
To do that, okay that's
one thing I wanted to say.
The other thing also I
should've put up on the slides,
I'll just say that.
So our error bonds For lengths
All this had, or generally
had On the right hand side.
Not every one of them.
Those one today, where I only
have a spectral norm, but
this is what I generally have
on the right-hand side, right?
And I told you, that I cannot
replace this by spectral norm.
That's correct, but what's true
is that, it can be replaced by
Epsilon times Forbenius norm,
times the spectral norm.
So one of the Forbenius norms
can be replaced
by spectral norm.
Now, this looks just
like an improvement.
It's not.
This is a theorem due to.
Which came about, maybe five,
six years after the original
paper then, out are paper work,
this uses a nice bit of function
and analysis, it's not simple.
There's slightly simple proof of
this, but it's still not simple,
now this turns out to lead
to a bunch of things.
Probably the most important
thing is graph possification.
And I'll tell you the result,
but not prove anything.
This is due to Spielman and.
So, you need the fact that one
of the Frobenius norms has gone
to spectral norm,
to get graphs possification.
Graph possification
says the following,
I give you any graph GVE.
From that I can produce a graph,
this is n vertices.
I can produce a graph H,
which is a sub graph of G,
L, I'll write it like this.
So H is on n log n vertices,
sorry n log n H.
And as a property
that every cut in H,
is roughly same as
every cut in G.
So for any cut SDS, the weight
of H is across the cut in H,
weight of H is across
S V minus S in H,
is within 1 plus or
minus epsilon,
or the same thing in G.
So there's a pair in gastric
result So what it's saying is I
can produce a much smaller graph
so that there are exponentially
many possible ways of cutting
the graph into two parts right.
Any subset of vertices
any subset of vertices.
H without fail for
every cut has about the right
number of h's as g does.
This is completely non-trivial
and it requires quite a lot of
work, but central to that
is Rooterson's theorem.
Rooterson's theorem is also
about length-squared sampling,
so can we replace this for
just length-squared sampling.
So this I won't be
able to prove that but
hopefully I've stated the result
well enough for you to see.
So what is interesting is of
course that this is a purely
combinatorial result
follows from
this matrix kind of stuff right?
Not terribly surprising, a lot
of graph problems are related
to matrices, but
it does follow from that.
The bound today I showed is also
what spectral norm on the right.
But the matrix multiplication
bound, for instance,
had only the Frobenius norm,
that can be replaced.
Okay, so maybe I should say
more specifically we wanted to
multiply two matrices, A and B,
A and A transpose let's say.
So A and
A transpose minus our estimate,
Okay, even if I wanted
to bound the two norm,
the right hand side was
only the Frobenius norm.
What Rooterson proves is that
for that specific bound site it
was not precise enough for
this specific bound, but
I want the spectral norm of my
error of the product is bounded
by one Frobenius norm,
you still can't get rid of this.
It cannot be purely spectral.
The centroid thing is for
Frobenius, yeah, you're right.
Central thing is not for
spectral.
Actually there's some
interesting open questions
there, but no.
It's only for Frobenius so far.
If the roles of A were
picked from our IID
distribution, right?
Then you would probably
want to center out by
subtracting the mean.
The mean is just averages,
the rows of ager estimate
of the mean because you've
got enough samples.
Then you've subtracted out then,
as you said,
then, the AA transpose would be
the variance co variance matrix.
A transpose a.
[INAUDIBLE] AA transpose
would be the variance co
variance matrix.
So I think I mentioned this
once, the original motivation
for all these sampling results,
basically, came from trying
to estimate the variance co
variance matrix distribution.
So you would subtract
out the center.
There are cases where you
don't want to center data,
I probably don't.
I mean there are some PCA
examples where you don't want to
center data, but certainly IAD
picks, you want to center data,
likely.
Yeah sometimes you want to
keep the data in all negative,
that's another example.
Non-negative matrices come up,
then you don't want to center
it, it'll introduce negatives.
The theory only gives you bounds
on r that are very large, right?
And the question is,
can we get lucky with smaller r?
I think there are empirical
studies but I don't know.
Theoretically, if you
want an epsilon of 0.01,
there is an r which is one
epsilon to the fourth power or
something so that will be 10 to
the 80 you don't necessarily
want that till you
begin the whole matrix.
But you hope empirically
smaller r will work.
