Okay, we'll get started.
So I want to talk about,
we'll continue on
Singular Value Decomposition.
We'll do that also the next
two lectures, Monday and
Wednesday of next week.
So we already saw what it is but
I'm going to recap a little bit,
but go on a little bit.
Right?
So, we just showed that
every matrix can be
decomposed into a sum like that
where the Ut and Vt are vectors.
The Vt are orthonormal, so
we always had this sigma
t Ut Vt transpose.
So, remember that this
is another product.
They're both column vectors,
Ut this is Vt transpose.
So, each of those are a matrix.
It's a decomposition
into rank one matrices.
The sigma t are always
non negative.
Actually we assume
it's positive.
We don't write the zero's.
Now, this way I want to
illustrate lots of uses
of just the fact that
the Vt are orthonormal.
So, so far we already have
that the Vt are orthonormal.
That's what I will assume, but
later,we'll show that
the Ut are orthonormal too.
Now, the Vt are orthonormal
by definition,
because we said, by definition,
cuz we said V2 is maximizing
vector perpendicular to V1.
So, they are orthonormal, right?
Ut are not by definition because
they were defined as Ut =
A times Vt divided
by the length.
So, there no reason, no
immediate reason why Ut should
be orthonormal so we,
we will have to prove that.
But, even the fact
that Vt orthonormal,
we will be able to use,
I'll show you some context and
some context I'll just mention,
right?
So, we just sort of
worked this out but
this is a bit of
a recap AA transports.
I want to tell you what it is.
I mean comes out to be that.
On the face of it, there are
terms here, are terms there that
will be a total of r squared
terms when I multiply out.
But, of course, the non
identical terms cancel out,
so let me just write one more
step here in the middle.
So, I'm going to
multiply them out.
So that's t1 t2
going from one to r.
Sigma t1, sigma t2,
I will get ut1, vt1 transpose
vt2, u t 2 transpose.
Okay, so these don't necessarily
commute they are matrixes you
have to write them in order,
except you notice that
this is a scalar.
Right this this is
a [INAUDIBLE] a scaler.
So, that of course
can be brought out or
whatever you want to do with it.
This is 0,
again if t1 is not equal to t2.
And for a word of caution,
that if instead of inner
product you had an outer
product that's not true, right?
So, we saw that last time.
So, these all go out,
if t1 is not equal to t2
only t1 equals t2 counts, so
only the cross terms go away and
then you get that.
Right?
So, this will be solved,
I'll show you something new but
in a minute.
We also saw that if
A is invertible,
then R equals the full
dimension is D.
So all the sigma T's
are non-zero, and
the inverse of it is one over
sigma T Vt Ut transpose.
I have to show you that,
that that is the inverse.
So we take A times that matrix
and show that's identity.
Vt, ut transpose, and I want
to show that this is identity.
That's enough, right,
to show that that's inverse.
So you just write, expand u,
expand a, again you put t one,
sigma t one, u t one,
v t one, transpose.
It's really the same thing.
One over sigma t and
here you just, vt2,
ut2 transpose.
And again, you expand these two
but this times this is zero.
Okay, and, that's,
again, in a product; so
this times that is zero once
you expand it out, right?
So we just expand it
out like we did this.
And so this is identity.
So you get, sorry excuse
me I'm not done yet.
I feel to complete
these pictures.
So that's equal to Ut,
Ut transport overall t.
And that's just the identity
of the basis Ut, right.
So change your basis, so the way
that you sort of see that this
identity, is if I hit
it with a linear x, it
will expand x in this basis and
you will get the x back, okay?
So, you do have to do
a little calculations.
So, I could take any set of
full orthonormal vectors,
full set of orthonormal
vectors and
Ut Ut transposed summed
over this is the identity.
Okay, so the normal identity is
when Ut is all the unit vectors.
Okay A may not have an inverse,
and
often the pseudo-inverse
is quite useful.
If A is singular, it always
has the pseudo inverse.
Again, I only wrote this for
sigma T theorem, not 0.
I wrote it only for sigma T not
0 so I can take 1 over sigma T.
And I'll just transpose it,
I put V-T U-T transpose,
just like you do for
the inverse right?
There's a transpose [INAUDIBLE].
Again I'm going to check
by multiplication,
I multiply A with that,
right go away.
Sigma T 1 over minus
1 over sigma T is 1,
and you just get Ut,
Ut transpose.
This is the identity, but
not on all of space, right?
So again, maybe it's worth
saying, pointing that out.
So, now we have Ut,
Ut transpose, these are
orthonormal but not a full set.
R is less than.
The matrix Ut, by the way,
was on that side.
So, it's n by 1.
Matrix is 1 by n.
Okay, R is less than n.
So this acting on an x = to 0,
if x is orthogonal to
the span of the Ut's.
So, if I have something
orthogonal to all the Ut's
it's 0 right because it makes
the 0 at that part, right?
And it's the identity on
the span of the Ut's.
And that, we just,
let's just approve here for
right now, all right?
So it is identity but
on a sub space, okay.
Now so, in linear there
are probably you know,
you have all seen
spectral decomposition.
So any symmetric matrix, any
emission matrix can be written
as, sum of t equals
1 to all lambda t,
lambda t eigand values,
they don't have to be positive.
Lambda t Ut, Ut transpose, where
the Ut's are orthonormal now,
right?
So eigand vectors which
are orthogonal, okay.
So, you can write it as, so this
is a spectral decomposition,
we can prove that which
one assumption from SVD,
that's this simple proof.
So, suppose you look at
matrixes, symmetric matrixes A.
And I do SVD, I don't do
spectral decomposition,
I want to get spectral
decomposition from SVD.
So I do SVD, and
a SVD is T equals 1 to our sigma
T, uT vT transpose, let's say.
I assume the SVD had no ties.
Okay.
That's a case I can
prove very easily,
spectral decomposition
from SVD these two steps.
Now, I cannot assume no ties,
there can be plus minus 1 ties,
so what that no ties phase is. Okay.
There exists a unique
pair of vectors.
We want minus v1, such that ab
is maximized by that, right?
Of max, whatever, over
length of v is equal to one.
This is how we define first
singular vector, best fit line.
So, that's equal to v1 or -v1.
So, I cannot avoid
having two of them.
I can always put
a negative sign.
That's all right.
But I will say by no
ties that informally,
that there are only plus minus.
Okay.
So if that's the case,
we have a profile,
because A is equal to
the transpose symmetric, so
that implies A is also equal
to sigma t vT uT transpose.
So, I've taken that and
transposed it, so
vT came to the front and
uT transposed, right?
No ties, implies that,
we said v1 was unique.
Here you got a u1,
the length is the same.
Right there,
the length of, okay,
maybe I should have
made that remark.
Well we know that for any
vector x, when I hit it with x,
the lengths will be the same.
Therefore, this u1
also should be a max.
So, you want identify plus or
minus v1, okay?
So, we have to pause for
a moment.
We can make a mistake
here right,
we can say no ties means
u1 must be equal to v1.
So, that is not
necessarily true.
If you went l equal
to v1 always,
all the eigen values
will be positive
because all the single
values are positive, right?
So this actually says
the eigenvalues are plus or
minus the singular values for
symmetric matrices.
They're already real.
The eigenvalues of
symmetric matrices,
are always already real,
but it says they're plus or
minus a singular value.
Okay, so
this proves that you can get
spectral decomposition
of a symmetric matrix.
Now, spectral decomposition of
a symmetric matrix is used for,
and I'm gonna mention two
important examples in a minute.
It's useful for
a lot of context, actually.
So let's take a high
power of the matrix S.
I claim that it is just,
this is symmetric now, right?
It's spectral decomposition.
I'm going to take a high power,
I mean,
it's gotta be at least
squared before I take powers,
rectangular matrices,
they can't take powers.
But if I take a high power,
then only the eigenvalues
get powered.
The eigenvectors remain
the same, right.
We can see that by
just multiplying out.
So, maybe since I have not so
many slides,
I can do all these calculations.
So, A to the 8 is sum over
t1 equals 1 to R lambda t1,
Ut1, Ut1 transposed.
Then you have t2 and so on.
I could expand this out.
I mean, you could also
prove it by induction.
There are s brackets like that.
So t1 through ts I get lambda
t1, lambda t2, lambda ts.
Now, this thing becomes ut1,
ut1 transpose.
Then you get a t2
Then you get a t3,
But these are scalars, They're
all in our products, right?
So, they commute, come out,
and so on so forth.
But one of them is 0,
everything is 0, so
you better have t1
= t2 = t3 = ts.
Otherwise, at least one of if
even one pair differs it's 0,
dot product is 0, so
the whole thing is 0.
So you only get terms
that are all equal.
So that becomes,
just sum over t, lambda t
to the s, ut, ut transpose.
Of course you knew that
the eigen values are powers of
matrices, or powers.
You also knew that the eigen
vectors are the same, right?
If Av is equal to a lambda v,
A to the sv,
is also equal to lambda A to
the sv, so you knew that, but
this is just saying that,
expanding.
But now I want to say
something non-trivial, and
again that's extremely useful so
let's see why that's true.
Okay, so I want to claim that if
lambda1 is a unique item value
With the greatest
absolute value, so
inwards lambda 1 is
the unique eigenvalue,
With greatest magnitude.
So no ties,
no multiple eigenvalues with the
highest absolute value, okay?
You could have that for
a matrix, right?
We are assuming that
that is not the case.
For instance,
if you had the identity,
all the eigenvalues have the
same absolute value, one, right?
So I'm ruling out,
such matrices.
So you take only
matrices of lambda 1,
I'm calling it, lambda 1, but
it could be any lambda, but
I can rename it lambda 1, right?
So maybe I should put in writing
to rename it lamda 1.
Then we want to claim, basically
that the limit as you raise it
to a power with some scaling,
becomes rank one.
And that rank one vector is
the top rank vector, right?
So that is important.
So it's just an obvious
calculation, so now, A to the S
and then we're going to divide
the one over lambda 1 to the S.
Absolute value is sum
over t equals one to r,
I erased, well,
I wrote it down there,
lambda t to the s over
lambda 1 to the s, right?
And we get A to the S Okay,
so the lambda 1 to the s term,
sorry, ut, ut transpose.
Okay, then the lambda 1 term
in the limit, as that goes
to infinity, the lambda 1
term becomes u1u1 transpose.
The other terms go to 0.
Because, lambda t to the s
over lambda 1 to the s
goes to 0 because
the absolute value is less.
Okay, so
the other terms go away.
Only one can remain, okay?
We will see a little later today
or the next lecture the power
method, which is actually
just a method for computing.
So you can ask, how do you
compute singular values and
singular vectors?
We saw the greedy
algorithm Which said,
if I can compute the best fit
line, I can find the best
fit two dimensional space,
three dimensional, and so on.
But how do I find
the best fit line, right?
Best fit line is a top
singular vector.
How do I find it?
This is how we'll find it.
Provided the matrix
is symmetric,
I can just take powers of it,
and sooner or later,
I'll get pretty close to
the top singular vector.
Not quite the vector, right,
but I get u and u1 transpose.
It's easy to see that if I
give you u and u1 transpose,
you can figure out u1.
Okay, so the power method
is a way of, in principle,
finding the top singular vector,
in principle,
doing the whole SVD
algorithmically, right?
I say in principle because
some of you know already,
but there's also different
numerical analysis,
which has developed beautiful,
complicated algorithms.
But power method is, in
principle, the starting point.
You wouldn't actually implement
that, but it's a starting point.
>> [INAUDIBLE]
>> Okay, very good.
So this was under
this assumption.
This may not hold,
and so we will state
at the end of this lecture or
the next lecture, we will state.
Okay, so there's one thing,
I should back off a little bit.
In numerical analysis, usually,
you see the power method and
everything isn't stated in terms
of the gap between lambda1 and
lambda2.
We don't want to say that cuz
we want absolute theorems.
So I will tell you an absolute
theorem, whether or
not there's a gap, okay?
It's not new or anything,
but that's not the way it's
formulated in
numerical analysis.
So we would like a theorem that
holds for all matrices and
we will state that.
But at the moment,
I assume some strict gap
exists between lambda1 and
lambda2,
then it's true that
I get to a rank 1.
If lambda1 and
lambda2 are equal, you won't
necessarily get rank 1.
You can get two terms here,
right?
Okay, good, but that's also
used, which is not a subject.
But since we have
too few slides,
maybe I'll spend two
minutes on this.
This little fact is also
used for Markov Chains.
Now, again, I won't actually
prove things there, but
the point is this.
In Markov Chains, you have,
let's say, n states.
You'll start with an initial
probability, vector p(0).
So p(0)i is the initial
probability of being in state i.
In Markov Chains, what happens,
the probability is just
transformed linearly.
So the next time around,
this is time 1,
p(1), excuse me,
these are row vectors.
P(1) is just p naught times
the transition probability
matrix, p.
This is called transition
probability matrix.
I won't tell you what this is.
So it's not my intention to
go over Markov Chains, so
I'm not going to
tell you what it is.
All I'm going to tell you is
that what happens at time t,
the probabilities
are just p naught times
p raised to the power of t.
Excuse me,
let me use s for power.
This is at time s.
Okay, and the big theorem in
Markov Chains is if you run it
forever under some conditions,
but if you run it forever,
you get two stationary
probabilities.
Some steady state probabilities,
which are limiting
probabilities.
So we want to say this converges
to a limiting probabilities.
Again, it's not my intention
to prove it in detail, but
I just want to tell you what we
have there works to prove that.
So what happens for Markov
Chains is that lambda1(p) is 1
and everybody else is greater.
It's not always true for every
chain, but we assume this now.
Then we saw, we see from there,
p to the s divided by lambda1
to the s, but this is just 1.
Sorry, lambda1 is 1,
exactly, not absolute value.
So this is just 1,
that's equal to p to the s.
And that converges to just some
top eigenvector, rank 1 matrix.
And now,
if you hit it with p naught, so
p naught times p to the s
converges to p naught.
This is a dot product,
row vector and column vector,
u1 transpose.
Right, what that says is
whatever p naught is,
it doesn’t matter.
I always converge to
probabilities proportional
to u1.
This is just a scalar, right?
So always converge to
probabilities proportional to
u1, right?
One fixed u1,
whatever p naught is.
So maybe that's worth
writing and then we can.
Whatever, p naught is,
we converge,
To probabilities
proportional to u1.
And these are called
the stationary
probabilities, so yeah.
>> [INAUDIBLE]
>> Yeah, so good point.
I'm doing it only for
p symmetric, thank you.
So you do have to
otherwise symmetrize it.
So p is symmetric.
Yeah, so this is not valid
if it's not symmetric, yeah,
thanks.
So Markov Chains is not
the subject of this.
All I'm saying is it's just
a matrix equation or whatever.
Okay, later when you
see power method,
you won't see Markov Chains.
Okay, that slide
was on just using
orthogonality of
singular vectors.
It's essentially like you
would use orthogonality of
eigenvectors for
spectral decomposition.
There are a lot of uses and
here are some of them, right?
Okay, so now I want to prove
that SVD actually is not just,
we did see that it's
best in some sense, but
I'm going to prove that
it's best in another sense.
So we want to define,
here's a rank k matrix, Ak.
Ak is defined as the sum of the
first k terms alone of the SVD.
This is often called
the truncated SVD.
So if you go to MATLAB or
something like that,
there's always codes for
finding the truncated SVD.
So that takes only
the top k terms.
So sigma 1 was the highest
number, and so on.
The top k terms, top k singular
values and singular vectors,
will show that for each k,
Ak is the best approximation
to A of rank k.
Okay, we want to show that.
Sounds very much like
what we said already,
that vk is the best
fit subspace.
In fact it uses that,
it is more or less that, but
we'll prove that this is true.
So for this reason,
in a lot of computations, if
you want to reduce the rank of A
and still get something decent,
a decent approximation
you use the SVD right.
It is the best possible
approximation.
So this is, in a sense,
the reason SVD is used a lot.
This, and an analog of
this we'll prove also for
the spectral normal relator.
So this proof first, right?
First the rows of A are
projections of the corresponding
rows of A into vk,
the space spanned by v1
through vk I want to say.
We saw this capital Vk yesterday
and also these are the singular
vectors right, these
are the right singular vectors.
Throughout this talk I'll
use u for the left V for
the right singular vectors.
So first we want to prove this,
okay so and
this is just some simple
calculation force.
So if I take any vector a,
the projection of a onto v k.
So capital Vk is a subspace
spanned by these vectors.
And these are our basis, because
they are all to normal, right?
They are all Vk, so
that's the basis.
So you take it's start product
projection on each piece's
vector and add them up,
that is the prediction
on to the whole space.
So that's the one
particular vector a,
now projecting each row of
A onto Vk, we get this.
Okay, and
this is just also equal to Ak,
that quantity, right?
So I want to say that
these two are equal.
I just see why it is the case so
I missed one step there.
It's probably just in
between the first and
the second equation.
There's one step,
which is just that
I equals 1 to K.
I'm going to write out A, right?
A is T equals 1 to R, SIGMA T,
UTVT transpose, that's A.
And then I have another
vi vi transpose.
And again,
this is the dot product.
It's zero unless t equals i.
If t equals i, it's one because
this product is the length
of vt if t equals i for
t equals i.
And then you get you get
the next and that's Ak okay?
Okay, theorem,
for any matrix B of rank k,
this is true.
That says that Ak is
the best rank approximation.
Because whatever matrix I take,
the error is less for
Ak than for B.
Less than or equal to.
So the proof, take B that
minimizes that among all
rank k or less matrices, okay.
I don't have to keep saying it
but this minimum exists, right?
Again the set of rank k matrices
looks unbounded but it's not.
So you have to convince
yourself with that.
Oops okay, you have to convince
yourself that minimum exists but
it doesn't.
So to see that, first we argue
that each row B is a projection
of the corresponding row
of A into the subspace V,
spanned by the rows of B.
So I want to claim this and
the proof is,
if that's not the case, so
if a row of B had a component
which is not equal to the
projection on the corresponding
space of V, we get A instead,
replace it by that.
Replace that row by the row
of A projected onto V, okay?
Now replacement doesn't
increase the dimension,
because that's still
in the space of V.
Therefore, the rank
of B doesn't go up.
That's important.
We are still dealing
with a rank k matrix.
But I reduced a minus b squared
because I now made that row,
projection of this
row onto the space.
Originally it wasn't, so
again this was a row of a.
This is a sub phase b right?
And my b was here.
Instead I made it this.
And this is the best
you can do right,
this is your least super
perpendicular distance to v.
Okay there used to be
some other point in v but
I made it the perpendicular one,
so that's the best.
So that strictly reduces
because I added a component,
strictly reduces a minus p
which is a contradiction to
the assumption that A minus,
this means contradiction, right?
So that means A minus
B is the best.
We started with A- B best,
so we cannot reduce it.
So that proves this theorem,
okay?
I'm sorry, it's not proved yet,
there is one more step.
So A minus B squared is the sum
of squared distances of A to
v, right?
So B were all projections
of the rows of A so
it's a perpendicular distance
of the rows of A which are data
points to B.
So among all vector
spaces of dimension k,
Vk is the one that
minimizes that.
And that's K, so we get that.
This is from our old
theorem that we had,
main theorem we had,
that the algorithm works.
It's also true for
spectral norm, we'll prove that.
That is a little more work,
it's a little less obvious.
It's actually true for some more
general class of norms, but
certainly for
these two important ones.
So that's a nice
thing about SVD,
not only that greedy works,
but the best approximation
in the two norms that
you're interested in,
when the error is mentioned in
the two norms most interesting.
Okay, so, this will come
a little later to this lecture.
Okay so, now, I want for
that, I need, also,
that the left singular
vector is orthogonal.
So I must prove the UIs
are [INAUDIBLE].
Again, the right singular
vectors were all [INAUDIBLE] by
definition, because I said V2's
[INAUDIBLE] to V1, but VUs,
that's not necessarily true.
So we have to prove this.
So this is
a variational argument.
In a sense, you can use this
argument also to prove.
The main theorem, but
we use this kind of argument but
we'll prove this here now.
So it's a variational
argument that says
if something is violated,
if they were not orthogonal,
I can improve matters
a little bit.
So let's go back one step.
In the case of V's, right, if
V1 and V2 were not orthogonal,
that's they make a [INAUDIBLE],
then I can improve V1 by adding
a little bit of V2 into it.
That's the kind of argument we
are going to see now but for
the use.
Cuz the vs are already
our problem.
Suppose they're not orthogonal.
Take the smallest integer such
that ui is not orthogonal to
some other uj.
Okay so UI and now we want to
assume that actually they make
a positive dot product.
I just assumed they were not
orthogonal but it may be that
they make a negative dot
product, but then what do I do?
I want to really assume
they make a positive dot
product for the argument.
So why is it that
I can assume that?
What if they made a negative?
Yeah, just a negative.
So yeah, replace it by
the negative of, right?
If this was replace ui,
or uj by it's negative,
that's still a singular vector.
So, we may assume that.
We also have that j
must be greater than i.
That's, because i
was the smallest
to violate orthogonal
with anybody.
So, it cannot be violating
the earlier ones.
Cuz the earlier ones
are all all right.
Must be the later ones
that's violating.
Is it?
So, j must be greater than i.
So, here's
the variational argument. Right?
So,
I take the right
singular vector Vi.
Okay?
And I'm going to
[INAUDIBLE] slightly
epsilon here is infinitesimal.
I'm going to assume
epsilon is infinitesimal.
So, second order terms
can be thrown out.
I'm going to take the i, and
add a little bit of vj to it.
I might have said it wrong
earlier, but this is correct.
That's, ui and
uj are not orthogonal.
I'm going to add a little
bit of vj to vi.
I added that I must make
the one, so I do that, so
that this thing has
length one now.
And I am going to show you that
VI prime is better than VI,
which is a contradiction.
VI was the best fit orthogonal
to some other things,
I am going to show you that
VI prime is a better fit.
There can be, right?
So, A vi prime,
let's see what it is.
A vi, vi are all singular
vectors, so A vi is sigma i ui,
right?
If it's an eigenvector,
A vi would be lambda-i vi,
singular vector, sigma i ui.
And then, similarly this,
you get epsilon sigma J, UJ.
If you're looking at the book
some of these industries
are wrong there, but
I corrected them.
Now, the length of this,
this is length 1 length 1, but
these two are orthogonal.
VA and VJ.
Right there,
already orthogonal, so
that length is just [INAUDIBLE].
But square root.
It's important that I have
only epsilon by only a second
order term here.
If I had a first order
term this wouldn't work.
So, these are being already
orthogonal word gives me
the second order term here.
Okay?
So, AVi prime.
I'm going to prove Is
a better fit than vi, so
it is long, so I want to
say its length is high.
This vector.
This is a vector it's length
is high, I want to say.
Well, it's at least as high as
the component along ui, right?
Ui is a unit vector, so if I
take the dot product of ui and
this I get something suddenly
the length of this is
at least that much.
It has other components perhaps.
So, it's at least
a component along ui, so
you take a dot product here.
You get ui transpose r.
That's still wrong.
That's ui, sorry.
That's not u1,
that was ui there.
This Ui, this should be Ui,
I'll correct that before
we that should be Ui,
so we get sigma i,
Ui transposed Ui is blank one,
so you get sigma i, okay,
and this is epsilon
sigma i Uj Epsilon.
Yeah, this is still uj,
that's still ui, sorry.
So, ui transpose ui is 1, so
you get sigma i,
ui transpose uj is delta.
So, you get delta
epsilon sigma i delta.
And this I've written let's
see square root of, I think,
that's just absolutely true,
right?
So, I get 1 plus epsilon squared
times 1 minus epsilon square
over 2 is less than 1, right?
I think if you square this
you get 1 + epsilon squared,
you get 1- epsilon
squared + something.
Yeah, I hope,
I didn't make a mistake, so, no,
minus epsilon squared plus
epsilon to the 4th over 4.
Yeah, 1 minus.
It's not quite, right?
But anyway it's a second
order term, so epsilon is
infinitesimal, 1 over square
root of 1 plus epsilon squared.
Is only second, or
returns column.
If I do a expansion of that,
if I didn't do that right thing,
it's equal to one thing ordered,
one minus order.
Doesn't matter what
the constant is.
Okay, so I multiply this out,
most important I get
a first term here.
The rest are second order terms.
But the first order
term is positive.
That's all I care about.
The first order
term is positive.
And therefore I get something
greater than sigma i.
So, that was greater
than sigma i there.
So, all I care about is that
these two are orthogonal, so
the denominator was
second order, and
the first order term is
still positive coefficient.
>> [INAUDIBLE]
>> Well,
now the sum is infinitesimal,
okay so
delta is not infinitesimal,
because us are fixed.
So, I have ui transpose z.
I assumed it's positive,
it's a fixed positive quantity.
And epsilon is much smaller
than delta, actually.
So, there's
a variational argument,
right, variational just
means this, that you have
an infinitesimal and show that
something is violated, right?
>> [INAUDIBLE]
>> Assuming you guys
are unit length.
But they were normalized
to length one.
So, ui was defined as AVI
divided by length of AVI,
so they are already unit length.
Okay, so this says that we've
got something greater than sigma
i for the length of this vector.
Now, this is a contradiction,
because vi plus epsilon vj.
So.
Is orthogonal to the previous v,
so remember, vi was the best
orthogonal to the previous ones.
I produce something
which is better, but
I must say also that it's
orthogonal to the previous ones,
to get a contradiction, right?
So, it is orthogonal
to the previous
since j is greater than i.
So.
Vi an vj were both
orthogonal to all of these.
Therefore, the sum is, okay?
So, I needed that vi and
vj are both orthogonal
of the previous one.
And sigma Is the frame
over all vectors,
so that's a contradiction.
So, the Uis are all orthogonal.
So, I'm going to use that.
But I need one more lemma.
I'm still trying to prove Ak is
the best fit in spectral norm.
I'm not there yet.
I need one more lemma.
So, I need to prove that
A minus Ak spectral norm,
now is just the next singular
vector, singular value.
And the proof of this.
So, let me step back a step.
This lecture is probably
the most technical,
because I have a lot of
details to work out.
Maybe I should tell you again
the just to take a break from
this, then we will
do the cancellation.
So, what will happen after this,
in the next few lectures,
including possibly today,
we'll see the power method
with an absolute theorem.
Absent of meaning,
no gap assumptions.
Possible today, possibly now.
And then,
we will see how SVD helps us
learn a mixture
of two Gaussians.
So, I introduced
Gaussian mixtures, and
I'll redefine them.
But it helps us learn a mixture,
sorry, K Gaussians.
Any number of Gaussians.
But, there are spherical
Gaussians, only for this.
This is in a way
a clustering problem.
Before that,
we'll also see that SVD
leads to spectral clustering.
Now, this is in chapter seven
of the book, not this chapter.
Spectral clustering is a very
widely used technique for
clustering, not just mixtures.
So this is also
spectral clustering.
So first we do it for
Gaussian mixtures,
then we'll do it
a little more generally.
That's a rough agenda.
For what follows,
I may preview the power
methods at least today.
I don't have slides but
I'll preview the power method.
And then, we'll come back and
continue with that and
do these things.
There's one topic I
did not mention and
we'll do it, if there's time.
Which is that length
squared sampling.
To compute SVD.
This is in a different
chapter as well.
I think, it's in Chapter 6 or 8.
So, this is that, if you want
to do SVD of a big matrix,
you can do length squared
sampling to pick some rows and
just do SVD of that rectangular
matrix consisting of those rows.
That gives you a good
approximation,
provable approximation,
all right, for any matrix.
The gist of this is,
SVD on length squared
sample rows gives you
a guarantee on what the error.
I may permute these topics so
that I do,
perhaps, this before that.
But we'll send you to
the power method and
Gaussian mixtures,
spherical Gaussian mixtures.
Okay, so that is a relief from
the technical detail, but
now we'll go back to
the technical detail.
So, I see, I had.
So, I want to prove that A minus
Ak is spectral norm is sigma k.
So, SVD is that, A minus Ak, so
Ak subtract the first k terms.
So, I go k + 1 to r.
So I want to prove that
the spectral norm of this not
very bad.
Is equal to that thing.
So, that maybe the top
singular vector of that.
Spectral norm is defined by v,
of course, right?
So we can express v as a linear
combination of v1 through vr.
So, why is that?
So, V was the top singular
vector of A minus Ak,
which is these things, right?
In fact, it's even true, that
it's expressed as a combination,
I should have said,
you don't need to use v1
through all of vr, right?
What Vs, what vT suffice
is a combination of?
It's a top singular vector
of this matrix, right?
So, you'll involve vk
plus 1 through vr.
Anything else is useless.
Right, anything else is a waste?
Right, if I put something
other than these v's,
I'm wasting length and
getting nothing, getting 0.
So I'll only put that.
So, but here I put all r's,
I think.
But anyway,
you get this calculation.
So, now you do that.
So, I put all r's although,
again very soon it
comes down to that so.
All, again, the same thing.
That vi transpose vj is 0 unless
i equals j, and so i only goes
to k plus one to r, j only goes
that much, and those go away.
That in a product
is a length one.
And so, you get sigma i,
ci, sorry,
that was the c's from there,
and you get this.
All right, these are all also
normal, so length squared is
just the sum of squares,
versus Pythagoras, right?
And we are maximizing v, sorry,
maximizing this thing,
because v is a singular vector.
The top singular vector
of A minus Ak, so this
needs to be maximized, and the
maximum was obtained when sigma
k plus 1 is the highest of the
slot, so ck plus 1 is 1, right?
All the way it should
be put on the highest.
Okay, rest are 0, and
therefore that's true.
And then finally,
Ak is the best rank
approximation is
what we have proved.
Okay, so for B or rank A,
I want to prove that A- Ak,
B is no better than Ak.
kB is worse than aK in spectral.
But now,
we put a 2 there, right?
Instead, for B is a norm,
we put a spectral norm.
Okay?
So first case,
if rank of A is k or less,
it's trivial because
then A- Ak is 0, right?
Ak takes away the top
k terms of A.
So, if the rank is k or less,
then we take away everything.
Okay, so
0 is of course less than or
equal to anything like that.
That's non negative.
So, nothing to prove.
So, we assume that the rank of
A is greater than or equal to K.
So, A minus Ak, spectral norm,
we already know that.
We knew that from
the last slide.
Now, I take the null space of B.
It has dimension at least d- k.
That's because the rank
of B is at most k.
So the null space is
dimension at least d- k.
I mean, so let's say d's
smaller than i, so d- k.
Dimension counts imply,
now there's a vector in
the null space of B,
which is also in the span of V.
So, why is that?
This is dimension D- K,
this is dimension K plus 1.
So if I take two sub-spaces,
their dimension adds
up to more than D.
There's always a non-zero
vector in the intersection.
This is linear algebra that
you should prove, right?
The dimension together
is more than d.
If it's d or less it's a norm b.
So, d, for instance, the first
subspace could be the first unit
vector, the next one would
be all the other d minus 1.
The dimensions add up only to d,
there's nothing in
the intersection.
Nothing non 0 in
the intersection,
0 is the only thing, right?
But if it's d plus 1,
you have to prove that there
is somebody in another space.
So maybe, I won't prove that,
but you certainly,
I mean that was just
a basic linear graph fact,
you can check that that's true.
Okay, now you scales that to be
of length 1, I can always do
that, and still be in this, and
still be in this hub spaces.
Now what can we say with that?
So, A minus B, I'm trying to
prove is not to small, so
it's bigger than at
least one quad vectors.
A- B this spectral number,
right?
It's a max over all vectors
certainly is greater than or
equal to what is Z.
Now, B is set to zero, so,
this is just equal to A Z, okay?
Now, A Z, Z belongs to
the span of the first k + 1
singular vectors of A, so A Z.
I can write, again, this way.
And, I just did the dark
product squared.
Now why is this?
So, this is the length squared,
these are scalars, right?
And, these are also normal so,
you just take the sum of
squares as the components.
Components along UI,
UJ rather, I guess.
No, it's UI.
So, you get that?
And, that's equal
to sigma squared.
VI transpose Z.
Z is only in the span of V, so
I go only up to k + 1 there.
The rest don't count,
that's greater than or equal to.
Now, i = 1 to k + 1,
the sigmas all decreasing,
so the least of them is
sigma k + 1 squared.
It comes out and you get that.
And that's equal, this is one
because this is the length
of Z, right.
Z is in the span of
these things, so
I took all its components,
squared and added so
you get sigma k plus
one squared, okay.
So, I proved that in the
spectral norm, that's the best.
So, that took us the most work.
So, A K is the best, frankly,
approximation to A in both
Frobenius, and spectral norm.
So this is the main reason,
that’s speedy, is so
useful, right?
So, you can find most kinds of
optimization problems are not
solvable in polynomial terms,
Venn, Bernhardt, but
this is an optimization problem
we can solve in polynomial term. Okay?
We can find the best fit sub
space, we can find the best
rank k approximation to A,
in both these senses.
Okay, so I think I ran out of
slides, but what I'm going to
do is probably do preview of
a power method on the board.
Unless, there are questions or
discussion we can take
a two minute break to see
if there are any,
anybody has any questions.
So this works, I think, for
all unit variant or not.
Shatter norm is not clear.
Because you get
powers higher than 2,
so Pythagoras and
everything is not true.
But Unitarian
invariant norms is false.
>> So how hard is it to-
>> [INAUDIBLE], right?
>> Now suppose I say that there
are two measures, A and B.
>> Right,
>> [INAUDIBLE]
Yeah.
So results you want.
Okay.
That's a good point.
Yeah, so I didn't, yeah,
I should have done some
perturbation terms.
So, here's the kind of okay,
let's erase this.
Suppose, I want
sigma k of a plus b
in terms of sigmas of A and B?
What can I say in general?
I mean, I can say good things
for specific matrices, but
what can I say in general?
Okay, unfortunately the only
think we can say in general,
this is less than or
equal to sigma K of A,
but here we are to
put sigma one.
So, that is true, right.
Anything like this, you prove
by taking the keg singular
vector of one of these
things of either A, or A+B,
and Yes, I wanted to prove
this is greater than or
equal to, so I take the singular
vectors of this A +B.
Singular vectors so,
summation of oT,
uT, vT, transpose,
and then I argue.
That sigma k of A
[INAUDIBLE]
I have a suspicion,
I may not be able to prove it
on the spot, but it is true,
this inequality is true.
Yeah this could be very high,
it would've been nice if I said
sigma k of A + sigma k of B,
it's not true in the end though.
So, this is all I can say right?
Maybe I won't actually
prove this, but
you could ask,
suppose B is very small.
B is nice, right?
How much do singular
values get picked up,
that's one thing you could ask,
right?
So there's something called
the Hoffman Wielandt inequality,
which is a very nice inequality.
And that says the following.
Sum over,
let me risk writing this,
sigma k square of this
minus sigma k square of a.
I'm going to try to say
that this is less than or
equal to this.
I could be wrong and I think
I may have to take a square of
this and get a fourth power.
The Hoffman-Wielandt Inequality
is often stated for eigenvalues
of symmetric matrices and
you get singular values just by
taking symmetric matrices A plus
B times A plus B, transpose.
Don't worry for the moment
about the exact inequality.
You can look it up.
What the idea is that,
if B is not very big,
then I don't change
singular values very much. Right?
Which is what you'd expect.
That's what this is saying.
Okay?
So, the right hand
side depends on B.
For instance,
if B is zero, of course,
you should get zero
on the other side.
This is zero,
this says it's less nor equal
That's one kind of theorem
that's nice to know.
So, I want to perhaps point out
another kind of theorem since we
asked for perturbation results
which is what happens if you
have a random matrix.
So first,
if you have a random symmetric
matrix, what does that mean?
That means Aij,
j greater than i,i or
iit, let's say.
Normal zero, one, and
aij, j less than i.
Isometric.
About diagonal entries, so
independent random variables.
I want the matrix to be
symmetric, so below diagonal
entries better be equal to the
above diagonal entries, right,
corresponding ones.
And the expected value
of each entry is 0, and
the variance is 1, right?
Now we know that then
the length of, sorry,
the length of Ai squared,
sum over all i, j.
That's the Frobenius
norm of A squared,
and I want the expected
value of that,
it's a random variable.
So the Frobenius norm of
A squared is just the sum of
the squares, right?
The expectation can go in,
it's just the sum,
expectational sum or
sum of expectations, right?
Even independence is
not required here.
So these are, these expectations
about diagonal, but
I didn't say anything
about the diagonal.
So let's say that Aii are all 0,
it could also be normal 0, 1.
Maybe I'll just make, Ai is also
is a normal 0, 1, excuse me,
they don't matter too much.
Okay, so each entry has variance
1, so you get n squared.
And we saw that,
This is the sum
of the singular
values.
So that implies sigma-1
squared of A is less than or
equal to Af squared.
Which should be less than or
equal to n squared, cn squared,
with high probability,
so why is that, right?
Frobenius norm and
expectation of n squared,
only n expectation.
But the sum of many
independent variables, so
it's not going to be greater
than cn squared, right?
We saw that sum of independent
variables, the tail bounds
are good, so this won't be very
large with high probability.
This stands for with high
probability, right, and so
sigma-1 squared is
less than n squared.
In general, sigma-1 squared
can be, if A is rank 1,
sigma-1 squared equals
Af squared, right?
If A is rank 1,
Sigma1 squared A = Af squared.
That's true, but in fact,
for random, in this setup,
that's in general.
In this setup we can,
the theorem is that with high
probability, sigma-1 squared
of A, or sigma-1 of A.
Is less than or
equal to, let's see,
I think it's root 2 +
epsilon times root n.
>> [INAUDIBLE]
>> Or maybe I should just say,
it's asymptotically
about root 2, root n
Okay, now what's nice about
this, this is hard to prove,
I mean, I'm not gonna
prove it here, right?
The reason it's hard to prove
is that this is the max over all
x, Of Ax, right?
So to prove this,
you must prove that for
every one of these
infinitely many x,
you must prove for
every x, that's true.
That Ax is less
than root 2 root n.
So that's not the subject
of this lecture,
so I won't be able
to prove that, but
I want to point out one
thing before we go on.
And the one thing is,
the length of one row,
the length of one row of A is
itself as large as root n.
So let met do, the expected
lengths squared of one row
of A is equal to sum over j
expectation of Ai squared.
Sorry, Aij squared,
these are all ones, that's n.
So one row is as long,
in a singular value,
it's at most the length of one
row times a constant, okay?
Now that's quite striking,
right, that won't be true for
arbitrary matrices.
Why is that quite striking,
because of the following.
Sigma-1 of any A is
greater than or
equal to the length of
A times the first row.
Okay, written as a column,
that's equal to the length of a1
squared, greater than or
equal to.
So I take the first row,
dot product with this, right, so
that's the length squared,
that's squared.
Sorry, so I have to divide
by a1, right, I have to do,
I have to make unit
length vector.
So certainly, it's true that
sigma-1 is trivially greater
than or equal to this.
I'm even ignoring the
contribution of the other rows,
the other rows may make a good
dot product with a1, too.
Okay, so
this is telling you that the
rows are uncorrelated, right?
In a random matrix, I take the
first row, I get about row ten.
I really almost get nothing
from the other rows,
I only get a constant
factor more.
So it's as if the rows are in
a sense uncorrelated, and
they are, they are independent
random variables.
So it for one thing says, sorry,
this thing implies
the rows are uncorrelated.
Okay, so random matrix,
you see it's a one extreme, but
SVD is used for totally
un-random matrices, in practice,
so, Let's say what that is,
So typical use of SVD,
so just again,
I was going to go over next
time, but I'll try it out now.
Typical use of SVD is
principal component analysis.
And principal
component analysis, or
PCA, says replace A by ak,
and just compute with ak.
This is only good if the first
k, if the truncation,
leaves very little error.
And we just saw that
the error it leaves is,
in spectrum law, is the k
plus first singular value.
So if this is much smaller
than let's say sigma-1,
this is an interesting
thing to do.
So in practice,
you use this a lot for matrices
which have this property.
So useful if sigma-k+1 is much,
much smaller than sigma-1.
But for random matrices,
that's not true, because sigma-1
itself, for a random matrix,
sigma-1 itself is order root n,
Yeah, Sigma 1 of O(root n).
And the sum of the squares of
all of them, Is n squared,
right?
That's the frobenius norm,
which is O(n squared).
So sigma 1 is basically just
about the average in square,
right?
So sigma 1 squared is more or
less within a constant
factor of the average
sigma t squared.
So the end of these,
the total is n squared,
the average is n, n squared and
sigma 1 is only about that.
So, it's far from having
this inequality, so this and
this alpha.
So if it's a random matrix, you
should not use PCA, all right.
The assertion is in practice,
a lot of matrices have
this property, that sigma k + 1
is much smaller than sigma 1.
Okay, that's why you use PCA.
>> [INAUDIBLE]
>> Sorry?
[INAUDIBLE]
>> Is it easier for?
See I don't know that.
I mean, I don't know
an algorithm which for
makes it easier or
faster, right.
In fact there's a big
workload that is harder,
because the gaps are small.
So random matrices
are often thought to
be the hardest sort of.
>> [INAUDIBLE]
>> Yeah so it may be that so for
instance one conjecture would be
there is a bigger random subset
that would be a singular vector
just in the span of those.
Yeah, I think some things
like that might be true.
I don't know proofs
of those things.
The numerical [INAUDIBLE]
methods which don't sample
certainly for them random
matrices are thought to be hard,
because the gaps are bad.
But this addition, maybe that's
what you're making is a right.
If I pick a few random rows,
in the span,
I should have a good
singular vector.
That's possible, yes.
I don't know proof, but
sometimes quantity
works like that.
So, I will stop here because I
don't have anything more to say,
but if on Monday, we'll come
back and do the power method.
I didn't do the power method,
but yeah.
>> [INAUDIBLE]
>> Yes, that is the computation.
So it's a little
more complicated.
Sampling, svd that was one of
the topics I put up later.
But maybe I'll move it up,
come to think of it.
But it's complicated.
But it's more complicated
in multiplication or
compressed representation.
>> [INAUDIBLE]
>> Okay.
>> So you'll do the proof,
right?
The paper?
>> Yeah.
Okay, so
maybe we'll move that
up in the agenda.
But first we [INAUDIBLE] because
it's a nice new apartment.
Okay so
first we'll do apartment and
then we'll [INAUDIBLE]
mixtures [INAUDIBLE] mixtures.
Those are both nice to know.
And then maybe we'll do
svd [INAUDIBLE] sampling.
