Okay, so we'll start.
I did do length
squared sampling.
That is already done.
So that's the matrix
multiplication theorem.
I'll put this up and
then we'll use it for
the compressed
representation of a matrix,
which we went through fast we'll
go through this carefully.
So I take AB,
I wanna find this product.
This is the estimate,
the spectral norm.
I'm sorry, the frobenius norm.
So for the multiplication,
is less than or equal to
A over square root of S, okay
because I took the square root.
Of course, this is random
variables and expectation, but
just to remember,
if I pick S samples to multiply,
the other goes down as root S,
1 over root S.
That we will use when
we do the [INAUDIBLE].
So this was
just recap.
This is what we are doing now.
It is an n by m matrix.
Potentially big in both
dimensions M and N, okay?
We'll show without
assumptions on A.
We'll show that A can
be approximated.
Provided you give me just
a sample of rows of A and
a sample of columns of A, okay?
But the sampling has
a feeling squid,
to get reasonable error bounds,
right?
If the sampling is done with
arbitrary probabilities then,
we don't work with bounds, okay.
You can always get a bound with
the variance but it won't be so
good, that's all.
Okay, then we went
through the intution.
Let me go through the intution
first, then we get to the form
a proof we didn’t do,
so I’ll do that now.
First can we just pick
a sample of rows?
We said no,
because the rows won’t tell us
anything about n sample rows.
But if the rank (A )=
k is very small and
if rows are in general position.
If I pick a large multiple of k,
I should pin down the row space.
Basically I should get everybody
in the row space right.
Like that point.
But we still don't know for
unsampled row,
what linear combination
of the sample rows it is.
And here was a picture. Right?
So I sample these rows.
And here is another row.
Unsampled.
I know nothing about this from
the information I have on rows.
However we said
intuitively this.
If I picked the sample of
hundred k columns as well,
that should be sufficient.
So I picked a lot of columns.
Then I know all these entries.
And I want these entries.
Okay, if things are in
general position now,
these entries I want these,
these are so many equations.
But there are enough unknowns
that that's easy to solve,
pin down, right?
Sorry, if there are more
equations than unknowns,
because this is 100 k Okay,
and the rank is only k.
So the k unknowns, the 100 k
equations they are in a general
position you should be
able to pin it down.
So that was the idea and we
want to rigorously prove it, so
here is a picture.
Big A sample of columns,
sample of rows, and
the assertion is from
the sample rows and
columns I can find a matrix,
U, to put in here, right?
So that the product is
approximately right.
And if there are s columns and
s rows it's s by s distinct.
This won't be diagonal.
If I did SVD, we'll see later.
Or spectral decomposition,
this would be diagonals right?
Spectral decomposition
you're all familiar.
Diagonal entries
would be eigenvalues.
In singular decomposition which
you'll see the diagonals will be
singular values, and
it will be zero of diagonal.
But this will not be
zero of diagonal.
Okay, so we gave that example.
I won't run through
the example again.
But the example was to
illustrate that the norm we want
to bound is a spectral norm.
So we want this approximation
to have the property,
that the spectral
norm is bounded.
That's the correct norm just
based on this motivation.
And we'll see a bound that goes
down as S to the one third
in the square.
So if I take square
to the one sixth,
as opposed to root s,
it's worth convergence but
we are doing more because we
are approximating any matrix so.
Okay, so now the idea this
we went through again but,
I'll recap then we'll
prove things today.
The idea is this.
I'm going to do a funny thing.
I'm going to pretend I'm
multiplying A with the identity
matrix, right.
Which of course I won't do but
let's pretend that.
And I'm going to
apply this result,
matrix multiplication result to
that pretend multiplication.
That's what we're going to do,
right.
So we are multiplying A with
I by sampling s columns of A.
Okay, then the other from
here is the norm of of i,
divided by root s.
Right, that's this one.
But the from i root n,
norm of is n right?
These are the sum of
squares of root n, so
this doesn't do much good
because I need s to be at least
n to get any sensible
result right?
Getting a of Af is not sensible.
I could just start with 0,
right, for a,
and got an error [INAUDIBLE] so
that's not interesting.
So to get anything interesting,
I seem to have to make S n or
greater, so that's no good.
Now we picked,
remember r was a set of rows.
We picked a bunch of rows,
RR transpose is a small matrix,
we picked R rows so
it's an R by R matrix,
we assume it's invertible.
Now that's again
intuitively true, so R has,
I mean only R rows right If
a has high enough rank, okay?
Now forget the rank here.
I'm going to do it
in general now.
If a has high enough rank,
these should be independent,
if in general position, right?
And then our r transpose
would be invertible,
as long as the rows
are independent, right?
So we assume that.
If it's not just as an aside the
full proof does not assume it,
but we'll only do the proofing,
this is assumed, if you don't
assume it you have to do
something called a inverse.
And I won't do that,
I'll assume it's invertible.
Okay, then look
at this matrix p.
This is just
a prediction matrix,
so it's R transpose R
transpose inverse R.
This acts as the identity
on the row space of R.
So what I'm going to
do is replace the I,
which is back with a P.
And P is an identity lookalike,
right?
So because it is the identity
on a subspace so why is that,
so that is the proof here.
So suppose we use the subspace
spanned by the rows of R, and
I want to prove that P
acts as identity and
the proof is simple.
So since X is in the row space
of R, it can be written by that,
X transpose can be written as
a combination of rows of R.
So Px, you just write it down,
and you get that, right?
And then RRT-1 and RRT cancel
out, and you get RTy, which is
just x itself, so it does
access the identity as claimed,
and for vectors in the alto
compliment, that's the value of
the null space of P, so
Px is 0 because Rx is 0 right?
So x is orthogonal to the row
space of r, so Rx is 0 so
it is 0 right?
That's all I'm saying, so
instead of pretend AI,
do pretend AP.
We'll prove two things which
together imply we want to prove
that A minus UR is small.
We'll prove two steps.
First A-AP is small, so this is
going to come from this theorem,
right?
So I pretend multiplying
A times P to get
an approximation to A itself,
because P is like the identity.
So this theorem will
tell us it's small.
Okay, that's step one, and
the second step is going to
prove that AP – CUR is small.
So if you add these together,
by trying an inequality,
you bound A – CUR,
which is what we want to do.
So why is this?
So this is another
application of this theorem.
So suppose C is a sample
of columns of A,
that's what we started with.
Okay then we'll see that
the corresponding rows of a,
rows of p excuse me,
can be written as U times R.
And the hint for
this is P ends in R,
so P is actually
something times R.
So if I pick some rows of P,
that's like U times R.
R is there for us at the end.
Right, so okay.
So this is small because
I'm multiplying A and
B by A, A times P by sampling.
That's why I'm sampling A,
columns of A at C and
sampling corresponding rows
of P that's U times R.
Okay.
So there will be a second
application of this, right.
So together it will
give us a result.
So this, hastily I told
you this last time, but
now I'm going to go
over the proof of this.
So again, the proofs are simple,
once you have this.
So I'm gonna go over that.
Okay, so proposition, first,
P acts like the identity but
only in a small subspace, but
we'll fill claims that A and
AP are roughly equal.
If it's completely the identity
it would have been exactly
equal, but that's not true.
So in other words I want that P
is random variable, so I'm gonna
set that the expected operators
spectral norm is small.
One over root r times that,
and why is that?
A-AP squared is, this is just a
definition of the spectral norm,
max over all unit length vectors
of (A- AP)x squared okay?
Now two cases, x could be
in the row space of r.
Then we saw that Px is
P's identity, so Px is x.
So APx is just Ax and so
it's 0, that's no problem.
I'm trying to prove the norm is
small if 0 is very good, okay,
the zero vector,
so the norm of 0.
So we need to consider only
vectors in the orthogonal
complement space,
because every vector is a sum
of something in V and V per.
So the max is obtained
by somebody in v pub,
right, orthogonal
complement of V.
And in that space, Px is 0,
we saw that,
that solved another space of P.
So, (A- AP)x is just Ax.
Right, Px becomes zero.
So, now I have to bound Ax,
okay.
For a subspace orthogonal 2R.
So x is orthogonal to R.
The rough idea,
what is this saying?
The rough idea of
what I'm gonna say.
I'm gonna say x is small.
I picked a sample of rows of a.
And I'm saying everybody
orthogonal to just a sample,
actually is fairly
orthogonal to A itself.
That's what this
is saying right?
If x is fairly orthogonal to A,
it'd be 0.
So it's saying if it's
orthogonal to a sample,
it's actually orthogonal
merely to A itself.
Makes sense?
But it's going to be important,
it's length squared,
otherwise it's not
going to be true.
It's not true if you
pick a random sample,
the variance is high right?
So if you have some rules
that are very heavy and
other rules that are all zero,
the sample doesn't tell
you much, but lenghth squared
is enough to prove this.
So here's the proof of that.
So I'm going to write Ax squared
of x transpose A transpose Ax.
I'm going to rewrite that as
x transpose A transpose
A minus R transpose Rx.
So why is this true?
Anybody want to take a guess?
Rx is zero, so I've only added
zero so I subtract zero,
that's fine.
So that by definition is at
most a spectral norm of A,
sorry, spectral norm of
A minus r transpose of r.
And I want to say this is small,
okay?
I wanna say A transpose A minus
R transpose R is small.
R is just a bunch of rows
of A I've picked at random.
Any idea what might
make this small?
How would I prove that?
So why is it that if I pick
a bunch of rows at random but
length squared, then this is
a good approximation to this
[INAUDIBLE]?
Yeah?
So it's this one, right?
I am multiplying A transpose by
A by sampling some columns of
A transpose, which is just
rows of A, transpose, right?
I'm taking the corresponding
rows of A which is R.
So this is small because of the
matrices multiplication theorem,
because of this I'm not
actually doing it, right,
this is only for the proof.
I'm not actually multiplying
A transpose by A.
So suffices to prove
that this is small
matrix multiplication theorem.
Why, again I didn't write down
all of the why, but I am again,
so to do multiplication
of A transpose times A,
I have to pick random columns of
a transpose which are rows of A,
length squared,
row of a and that's r.
And I must take this
corresponding rows of A,
this A and that's r again.
Okay, sorry, so
another way to transpose A.
So this formula of
sum of A squared.
But in the square root,
it's 4th power, everything
divided by root power.
But in the square,
it's divided by r, right?
That's all I'm saying.
So this is an expectation.
I was swapping here,
this is expectation,
these are random variables.
>> Do these three here
>> Sorry, good point.
In fact, I am asserting,
this is even weaker.
This is even for model, right?
It turns out, okay I'm not
able to use the because
all I have here is two of them.
For this part I cannot set okay,
because this way of doing
x in the space of that,
I'm only proving for a single x,
so I'm not proving for V is 1.
So excuse me I should
have said that, but
two norms always smaller than.
I mean, maybe I should
write down a little proof.
So two norm is max
x transpose Ax.
Well, not, okay.
Ax squared, right, is equal to
x transpose A, transpose Ax,
and then one thing you can do
is apply and you get that.
So maybe I won't write down the
proof, but please go home and
do prove this.
[INAUDIBLE]
So
I get that, and then that's
just what I said already and
it's written down
there more carefully.
Okay, excuse me.
Let's go back one there.
So I have proved here so
far only this, right?
I've only proved this.
I have to prove the other
part of the bargain.
There were two
inequalities to prove.
AP and C are roughly equal.
Now this is proved
again by applying this.
So C is the length squared
sample of columns of A.
Want to pick
corresponding rows of P.
So you're multiplying A and P.
I picked columns of
a which formed c.
I must pick the corresponding
rows of P, but
P is of this form.
Okay, so
if I pick some rows of P,
I can always write that
as U time R, right?
I leave the R alone, and I pick
some rows of all of this, and
that's my U.
All I have to notice is that,
there's a R at the end,
which I have.
So that tells you how
to find u as well, so
you find U by taking
the rows of this quantity.
And the error is now
in Frobenius norm,
because I'm multiplying matrices
A and P, I get that, okay?
And I have to divide by S,
because I picked S samples.
I'm still to bound this
quantity, proving it's normal P.
I don't know what it is, right?
I have to get rid of that,
so bounding this
P has ranked R, P is there.
So we assumed R is full rank,
that's why I could write the
inverse, and R has R rows, so
the rank of P is also R,
like small R, right?
And we saw it acts as identity
on an R dimensional subspace.
And the rest is null space,
any such matrix has
Frobenius norm R.
Again, you should
prove that at home,
but it's very simple, right?
It's basically just identity,
R dimensional identity, right?
Because 0 are so it's an R
dimensional identity matrix, so
it has Frobenius norm squared R,
sorry Frobenius norm squared R,
just R once on the diagonal.
In the correct basis, you should
write it in the right basis,
all right?
Putting this together you get,
okay,
so this error is R by S,
that R by S,
because top field and an R left,
this goes up, R by S.
The other is 1 over root r.
So the optimal choice of r,
is s to the two thirds.
That's just a little calculus.
Basically, if you have
two terms like that,
you make them equal, right?
That's the best choice, and
we can do a differentiations,
make sure differentiable
respect R keeping S fixed.
And you'll see that R
is S to the two thirds,
is the best choice.
And so then, we get the theorem,
I'm gonna just list
with a question, no.
So A is m by n matrix R,
and S positive integers.
C is an m x s matrix of columns,
length squared sampling,
R is length squared
sampling of rows, R and
S are not necessarily equal.
And the conclusion is,
we can find from C and
R, an S x E matrix U,
so that this is valid.
This is true for
any R and S, right?
So, and of course it makes no
sense, if R is greater than S.
You get a huge error.
You can make it greater than S.
It's still true but
it doesn't tell you much.
But the best choice of all
is S to the two-thirds.
Okay, so
that's only the spectrum long.
They are also bound on
the Frobenius norm,
which I didn't write down.
If there is time later,
I'll go with that, but
maybe that's all I
want to say for now.
I'm going on to
the next lecture.
I mean, next topic which is.
So, this is not the best
you can do now.
I believe there
are better results,
I can give you references for
this, right?
So there are better things
than this to be said.
>> Can you comment on,
how would you improve this
if the only you can do
is sample the columns?
>> Yeah.
>> Is the best play
in some sense?
To prove that the choice of
solutions are optimal, right?
You could improve upon this.
>> Right.
>> [INAUDIBLE]
>> Good point, so if I
wanted to do one multiplication
of A and A transpose,
then it's true that length
squared was optimal, okay?
So length squared will still be
length squared, so length square
is optimal once you've fixed
the number of samples you take.
That's still going
to be the case,
except you'll fix the number of
samples slightly differently and
the argument changes.
But I don't remember,
actually, all the changes.
Part of the answer, I don't know
the precise answer actually.
I should look up the papers.
>> [INAUDIBLE]
>> Yeah.
>> That [INAUDIBLE].
>> Okay, so that's right.
So there's all new
subject in some sense.
Length squared is not
the only thing you can do.
It turns out there's, I mean
later methods now do adaptive
sampling, roughly adaptive
sampling you can think of.
So length squared picked
one vector with probability
proportional to length squared.
I don't know whether
I showed this to you,
but you can instead pick pairs
of vectors, with probability
proportional to the area of
the triangle that you squared.
So length squared is a one
dimensional measure, right?
Instead I could look at all
the players of growth, and
pick pairs with
probability proportional
to the area squared.
Or more, you can pick three
tuples of rows, with probability
proportional to the square
of the simplex they enclose.
And so, this is called
determinantal process volume
sampling, that turns out to
be a much better results.
It takes a long time and
also the proofs are harder.
I won't do this in this course,
but I mean,
I can give a reference.
How do we construct U?
So going back
Here, okay.
I picked some columns
of A to make my C up.
I want pick the same rows of P,
which is
the same as picking exactly
those rows of R transposed.
So R transposed is a long skinny
matrix, because R was flat.
So I want to pick
the rows of R transposed,
which correspond to
the columns of C.
So my U then is, I do that, and
then hit it with RR
transpose inverse.
To compute it, I have to
find RR transpose inverse.
One thing I should have said,
I mean maybe that's a good
question to answer right,
I should have said right,
the computational effort,
so you've got to sample
into all these things.
But the computational
effort is mainly
finding the inverse
of this matrix.
But this is a small matrix,
this is hour by hour.
So I do have to compute
the invoice of that.
The rest of the time,
I'm only doing samples, and
taking the samples, and putting
them together, and so on.
But this one computation
I have to do.
>> [INAUDIBLE]
>> Alternate
trails of R may not be this,
yes, so how is it possible?
I didn't prove that.
>> [INAUDIBLE]
>> What do we do to
evade slow rank?
>> [INAUDIBLE]
>> No,
two things I don't know,
meaning,
if k is exactly ranked k,
I mean if a is exactly ranked k,
then, so you're asking is if I
can do it less than K rows or
something like that all right?
I don't know that.
>> [INAUDIBLE]
>> So right now
it is independent of the rank of
k right, so that's right yeah.
So okay for
the norm There
is a best possible approximation
by a 3D if I go to ranked K.
And the bound I did not
write down, answer of this,
will say, I can come within
this much of the best.
There you'll lead a number of
sample size has to depend on K.
It'll be larger than K.
But it has to depend on k.
So the result would be,
if I draw n samples, but
depending on k, I'll get a bound
on the right-hand side, which is
the best rank approximation you
can ever get, plus some more.
Maybe I should have
written it down, but
I can, yeah, there are results
like that for a continuous
[INAUDIBLE]
Square root of there.
>> [INAUDIBLE]
>> A,
>> Yes.
>> Square root
of trace of A transpose A, and
then sum of writing trace of
A transpose they are similar
values, so negative to that.
>> Right, so if you want to put,
I think you're saying that if
I want to put the right hand
side only the spectral norm, so
I have to use
>> This is less than or
equal to the rank of A.
Times the spectrum all, right?
That's what you were saying,
right?
Yeah, so we'll actually
prove that when we do SVD.
But yeah, so actually maybe
everybody should try to prove
that, or recall that from Linear
algebra, that is at most square,
is at most the of a times that.
But we will prove
that when we do scb.
So if you're not already
up on it, we'll prove that.
So I could have written
that in terms of that, but
I look pretty badly,
the rank would be mrn.
Whichever as as high as that so
then the bound doesn't look
very good which is true I mean.
So I should mention that you
cannot put just an operator on
here without the m or
n that's not possible.
>> [INAUDIBLE]
>> So, yes you can.
You do get concentration. Okay.
So, with some catches, right?
You get concentrations.
The fact that S choice is
of column are independent.
That's all.
There's only s degrees of
independence or r degrees.
And s and r are thought
to be constants, right?
So you only get that
much concentrations.
So and then we switch
to the next files.
>> So [INAUDIBLE] so
CUR stands for a dog.
[LAUGH]
No c for columns and r for rows.
But I guess yeah it's not
the best, it's curves or
something right there.
Might be if you could
put an e at the end
then it would be cure,
but I just called it c.
>> So the total input,
somebody input [INAUDIBLE] and
>> Yeah.
>> [LAUGH]
>> Or curse or
cut it or something, yeah.
Okay, singular value
decomposition.
So, I'm going to do this
in a self-contained manner.
Now, clearly everybody
knows linear algebra.
Linear algebra covers
>> Unfortunately often
does not cover SVD, right?
You do spectral
decomposition and so on, but
not necessarily SVD.
Somehow, SVD is more useful
in terms of handling data,
because it's in a sense,
more useful than handling
eigenvalues and eigenvectors.
Of course, they are different
purposes, but it's less covered.
So let's see one definition of
SVD, there are many equivalent
definitions, I'm gonna
to start with one angle.
You have n data
points in d space,
each represented as
a row of a data matrix.
I guess I had called it columns
at some point, sorry, this so
it is an n by d matrix each
row has one data point.
Singular value decomposition is
going to consist of the best fit
K dimensional subspace for the
data [INAUDIBLE] for the rank of
A where best fit It means I
minimized the sum of squared
[INAUDIBLE] distance or the data
points to the sub space.
Okay now we will see that
which is not obvious that
[INAUDIBLE] simultaneously.
Perhaps we draw a picture.
So your data points are here.
So you have a sub-space
of dimension two.
So [INAUDIBLE]
since it's a sub-space
the origin must be on it.
This is the sub-space.
So we take the sum of squared
perpendicular distances.
So you project.
We take for each point it's
distance, perpendicular distance
to plane, right, perpendicular
to the plane, this plane, right.
And square them each.
Square and add.
Again the square is important,
right, and add them.
I'd like to say so.
That's the same as, and
we'll have another picture in
the next slide, that's the same
as if we maximized the sum
of squares of the length of
the projection of the data
points into the subspace.
That's the same as, taking this
as the length of the projection.
And for instance for this,
from the origin to that, right?
In that space.
I think I have a better picture
on the next slide.
So I take those projections,
square and sum,
they are the same, we'll see the
picture will make it obvious.
So this is the picture I took.
This is a data point AI.
This is my one
dimensional subspace, so
I either take these distances,
square them and
add, minimize them Okay.
Or take these projections,
square them and add, and
maximize them, right?
And they will give you
the same result because
of courtesy Pythagoras because
the sum of squares of these two
is a fixed constant,
independent of the subspace
that I'm worried about, right?
So it's just the length
of ai squared.
Therefore, I can think
of it as maximizing.
So, for example, this is
the first advantage of using
sum of square predictions or
distances.
It's not true if I use distances
or prediction lengths then you
don't have Pythogorous Now,
we have to contrast this just to
make sure we see the difference.
In calculus,
we look at least-squares fit and
that's slightly different,
right?
So least-squares fit is what,
you have a bunch of
points in a plane.
This is (xi, yi).
You want to fit a best
linear function to this and
now the distance is
a vertical distance, okay?
I mean, that's what is this is.
Not distances perpendicular
to this line, right?
So it's different.
I mean these squares again,
you have squares for
a reason, right?
You remember when
you differentiate,
it will be a linear instead of
equation so it's easy to solve.
Something like that will happen
here that's why you have
squares.
There's perhaps
a bigger difference
in the usual least squares fit.
The line doesn't have to go
through the origin, right.
Least squares fit.
Be careful to call
that least squares and
what we have best-fit.
And least squares will
go away after a minute.
I'm not worried about that,
right?
So in least squares you
don't require the line to
go to the origin, but in SVD
you do, it's a subspace so
it has to go to the origin.
So you may wonder,
well that's a good idea?
Suppose my points
actually look like this.
The best one dimensional fit,
the best-fit is also some
line like this which doesn't
pass through the origin, right?
They require that
to pass through
the origin I get a bad fit,
okay?
That may bother you, but
it'll turn out it's not very
difficult to prove this.
We won't do it now,
I'll do it later.
Is that in fact,
if I want the best-fit line.
Not necessarily passing
through the origin.
You can show that the line
must actually pass through
the centroid of the data.
Centroid is just the average
of all these things.
So these data points,
the average is somewhere here.
You can show that
the best-fit line
actually passes
through the centroid.
So you can try that
as an exercise,
there's time later we'll come to
do it, just a simple exercise.
So in fact, if you took any line
and translated it parallel to
itself until it passes
through the centroid,
that's the best translation.
Just best parallel translation
that you can do, okay?
You just prove it by
writing this down.
That's true for any dimensional
substrate, actually.
If now I'm sitting
with points in 3D and
I want to find the plane fit,
then you can prove that the best
one passes through the centroid.
If it didn't,
you can translate and gain.
Okay.
Therefore, if you wanted
the best affine space.
Affine means any
translate of a subspace.
Affine space,
you should take the centroid.
You can find it by
taking averages.
And do the translation before
you apply this with you.
Okay, so if somebody just
put all the data way out,
they translated, doing SVD
may not be a good idea.
You'll get a line like this.
If you really want to get that.
So that's called centering data,
we'll go through
the proof of this.
You can easily do it for
lines for i dimensional spaces,
it's still just only a
calculation, we'll do it later.
So the most important
thing is for SVD,
we will show that
the greedy algorithm works.
The greedy algorithm is find the
best-fit 1-dimensional subspace
to the data: line
(through the origin).
Then, find the best-fit line
through the origin again,
perpendicular to the first line.
So if the data where, if the
data lie on a plane like this,
a co-planar, then you'd like to
find both axis of this plane.
So if you do this process
you'll find the two axis.
By the way, this is probably
one of the earliest or
the earliest greedy algorithm.
So it's an important
greedy algorithm.
We'll see that it works
correctly, but it dates back
to before minimum and
all of those things.
It's a very early algorithm,
right?
So, at the ith step, second
step, you look for a line,
best-fit line through
the origin, sub-space again,
perpendicular to the first.
The ith step, you look for
the best-fit line perpendicular
to the ones you've got already.
You keep doing this until
i reaches rank of A,
and we'll show that
when we're done.
So I'm going to write down,
just because some of you may
be familiar from linear algebra
with the matrix version.
Don't pay any attention to it
because that's not the viewpoint
we want to take.
But in any case,
at the end of the day,
you could write A as
a product of three matrices.
Where the columns of V or these
lines, or these lines we found.
D is a diagonal matrix
with positive entries.
In spectral decomposition, D
would have been the eigenvalues.
Now, it's the singular values,
which we'll define, and
the columns of U and
V are also normal.
This is called SVD.
Again, now focus just
on the best-fit lines,
forget the matrix factorization,
we'll come back to it, okay.
Okay, first singular vector,
right?
Notation, this is just a recall,
A n x matrix,
each row is a data point.
So if you use the unit vector
across the best-fit line,
we know that, again,
I'm just recapping this.
v minimizes among all
unit length vectors.
The distance from the data point
A to the line through x, right?
To the line with x as the unit
vector, the direction x.
So, this is just
a prediction on x.
I subtract out the projections
from the length and
that's the distance.
So the distance is this, and
again, this is just redoing it.
And again,
this sum is a constant,
it's just a sum of the length
of rules squared is just a F,
it doesn't depend on x,
so I may discard that.
Okay, I've a minus sign,
and I'm minimizing so
that's like maximizing
the Ax square.
So we define that to be
our first singular vector.
So we take the maximum or
all unit length vectors of a v,
the length of it, okay?
It's the same as best-fit,
minimizing the sum of
squared distances, right?
So I should have
had a square there.
Because I'm maximizing the sum
of squared projections,
but maximizing something
squared is the same as
maximizing the absolute value.
Because the answer is squared,
but the result is the same
Okay, there can be ties.
There could be multiple v1s.
Ask you a simple question.
Are there going to be ties, or
maybe sometimes there won't be?
What's an obvious,
there is always a tie.
What's an obvious tie?
>> [INAUDIBLE]
>> Minus v1, right?
So if we put a minus, for
eigenvalues, also, that's true.
Eigenvector and negative of
that are both eigenvectors with
the same eigenvalue, right?
You choose one.
So there can be ties,
we break ties arbitrarily.
So this log max is not unique,
okay.
And the singular value
is just a length.
So you can view
the singular value
of how much A magnifies
any vector.
The maximum amount A magnifies
the length of any vector.
So if you draw the isotherms,
the curves along
which you get the same value
of av, it'll be an ellipsoid.
I didn't define ellipsoid and
all that, but
the longest axis will be
the first singular value.
So ignore this picture if
you're not familiar, but
that's called the first
singular value of a.
Okay, one thing I should say,
we'll show that the singular
values are unique, okay?
There's no, even if there are
ties for the singular vectors,
singular values are unique,
okay?
We'll see that later.
If the data points all lie on
a line through the origin.
The line on which
the projections that maximizes
precisely that line.
If the data points are collinear
on a line through the origin,
you will get that line right,
that's where the projection
is maximized.
So that's the first
single vector.
So now, what happens is the data
points are coplanar rather than
collinear.
We'd like to get the two
perpendicular directions
spanning the plane.
So that motivates
this definition.
We want to define further
singular vectors, so
think of coplanar data.
We really would like the two
dimensional subspace maximizing
the sum of squared projections.
So we want to take points,
project them down,
this is the origin.
So we want to maximize
the sum of squared.
Some of these length squared,
right?
Now, let's say for the moment
I don't know how to find two
dimensional sub spaces
which do this maximization.
I can try to do a Greedy.
We can define a second singular
vector v2 as the 1, and
here break ties
arbitrarily again,
which maximizes the sum of
the squared projections.
But has to be perpendicular
to the first vector.
A little algebra shows you
that's the same as this, right.
So again, there would
have been a square there,
but I don't care to put
the square, so now, v2 is
the maximum v perpendicular to
v1 of length 1 that does that.
Okay, and v3 is the v
perpendicular to both v1 and
v2 that maximizes this.
And then you define the second
singular value as how
much demagnification you get
here in the v2 direction.
Third singular value,
how much magnification you get
in the 3 direction, and so
on, okay?
And you go on until it's 0.
That is to say,
I've found v1 to vr.
Every v perpendicular to that
makes zero is a null space of A.
Makes zero product with A,
right, then you stop.
R will be the rank of A,
we'll prove that later.
And more importantly, if this is
not obvious, even if there were
ties for the singular vectors,
the singular values are unique.
Right?
I mean, as advertised,
I could have gone down
different parts for
choosing singular vectors.
There's no simple reason right
away that the singular values
are unique, but
we'll see that they are.
Okay.
>> [INAUDIBLE]
>> No,
so they are invariants
of the matrix.
So they do not depend on
the singular vector I choose is
advertised.
So I may have multiple choices
of each single vector,
whatever I choose.
The array comes down to three
singular values of unique.
It'll fall from
the main theorem.
So the main jist of this most
important theorem is that Greedy
algorithm works.
So let's define best-fit k
dimension subspace again.
Just done it already, as the one
maximizing the sum of squared
projection lengths of data
points into the subspace,
maximized over all k
dimensional subspaces.
The theorem says the following,
it says the Greedy algorithm
works, that you have this and
I've defined similar vectors.
So again, Greedy because I
pick the best single v1,
best single v2 perpendicular
to it and so on.
I didn't pay attention to
the prejections in the k
dimensional space, just one at
a time I paid attention to.
So it's Greedy, so
for k = 1 to R,
Vk be the subspace spanned
by the singular vectors.
Again, that could have been ties
and we picked one choice, but
any choice.
For each K, the assertion is
that V(k) is the best-fit
k-dimensional subspace for A.
Now, I shouldn't say necessarily
V, it can be A, right?
There could be many
best-fit subspaces.
There could be ties, right?
So when you want examples for
things like that, the identity
is a good example, lots of
ties for the identity, right?
Every single value is 1,
every vector is similar.
So this is what we'll prove now.
Again, this is one of the,
I would say this is the earliest
Greedy algorithm, so
it's very important,
historically as
well as useful one.
Very important algorithm, right.
Proof by induction on k.
Statement is obvious for
k=1, right?
So for k =1, One dimensional
subspace by definition,
that's the best.
So we proved that.
Let's do k = 2.
Suppose W is the best-fit
two dimensional subspace.
I like to prove somehow that v2,
sorry, let me go back.
I found v1 and v2,
their span is capital V2.
I like to say v2 is a best,
is not the best, it can be ties.
Here, I assume W is the best,
I must prove v2 is
almost as good as W.
First, a little technical
playing, there is a W2
belonging to W, so
that it's perpendicular to v1.
Perhaps a little picture here.
So here's the origin.
Here is V1 and I have a two
dimensional subspace W.
The ascension is there's a W2
in this subspace, length
1 is not a big problem, W2
which is perpendicular to this.
The reason is I project v1 to
this, and I can take that angle
to be 90 degrees although,
it doesn't look like that.
So because the projection of
v1 onto W spans at most one
dimensional space, and
you could take it off.
The picture need not
be quite like this,
the picture could be that
v1 is perpendicular to W,
then I can take anybody
here W2 row orthogonal.
Choose any w1 which is
perpendicular to w2.
So w2's here and I choose w1,
could be this one,
right, scaled to 1.
And so the w1, w2 form a basis.
They are perpendicular
in length 1.
So by convention, sometimes by
basis, you only mean a set of
orthonormal vectors, which
they're all length 1, right.
So I won't have to
repeat length 1.
So basis is going to mean that.
Now, well, so I'm trying to
prove that v2 is as good as w2.
W2 is the length, sum of
the square root projection so
on to w equals this.
And that's just because
[INAUDIBLE], right?
>> [INAUDIBLE]
>> So I have the data point ai.
Its projection onto W squared
is just AI.W1 squared
plus A1.W2 squared.
W1, W2 is a basis so I take the
components along the two basis
vectors, and square and add
them, that's where I get that.
And then I sum that all our way,
I just get that, okay?
The squares are important.
Now, I want to
assert this is true,
that v1 is at least as good
as w1, so why is that?
>> [INAUDIBLE]
>> V1 is the best, right,
yeah How about this,
2 is at least as good as W 2.
>> [INAUDIBLE]
>> So, V 2 was the best,
among everything was
perpendicular to V 1.
And you make sure W2 is
perpendicular to V1, so
V2 is the best among that lot,
therefore it must be
better than this, right?
We add the 2 and this is what
Ws thing is and this is what V2,
capital V2s prediction, so
v2 is at least as good.
So, that's proved for k=2, now
I want to prove for k higher,
right?
So, the inductive hypothesis
we're gonna assume is good for
k- 1 and I want to prove for k.
So, again it's more, or less the
same proof, I assume that W is
the best fit k dimensional
subspace let's say.
And I want to claim, just let
my erase here, I want to claim
that the unit vector wk in
W perpendicular to Vk-1.
So, W is a k-dimensional space,
and V sub k-1,
I can't draw fully,
is a one dimension lower there's
somebody here perpendicular
to all of that.
It's just a dimension count,
right?
So, wk is perpendicular
to all of that.
So, now we choose a basis
w1 through wk of w.
You can always take a vector and
complete it to a basis, right?
And so, you can choose these
orthogonal to this and so on.
So, that it is W, and
now take the first K minus 1,
I want to claim this is the
length of squared projections
on W, except for
the last one, right?
That is less, or equal to that,
okay, why is that?
That's the [INAUDIBLE]
assumption right, so VK minus 1,
capital VK minus 1 of which this
is the basis was the best K
minus 1 dimensional subspace
we assumed inductively.
And this is a k minus
1 dimensional space,
so this must be better than
that, and that's my induction.
We also have this.
And that is again,
because VK was the best, and
make everything perpendicular
to V1 through VK minus 1.
And WK was perpendicular
to all of those too.
Therefore this is
better than that. Okay.
As you see the proof is very
simple, right?
I mean, and then you just add,
you're done, right?
Okay.
So, the consequences, I guess,
if you are looking at the book,
usually everything I
have done is there.
But this one we didn't do, so
I will spend a few minutes
trying to prove that.
The theorem also proves that
the singular values are unique,
even if are ties for the
singular vectors, why is that?
Sigma one is unique.
Okay.
But before unique you have
to prove that it exists,
it was a maximum, or
incidentally many things, right
single one was the max overall.
Max overall unit length
vectors over here.
Instead of unit length vectors,
let's close some boundaries,
right?
So, we know that.
To maximum is obtained,
otherwise there would
be a bit of a trouble.
But maximum is obtained and
that's sigma 1 A.
And it's defined and
exists and is unique, okay?
Now, there's a unique value of
the maximum over 2d subspaces
of the sum of predictions
squared onto the subspace.
Now, this is a little trickier,
so
the set of 2d subspaces
is closed and bounded.
It doesn't look like that,
right?
Subspaces are unbounded.
I'll show you on the next
slide that this is
a reasonable thing to do, right?
So, that mu2 exists.
So, call that mu2.
Right.
So, we have to prove
that mu2 exists.
For that we have to prove
that the maximum is attained.
It is in fact attained,
even though it looks like
the center subspaces aren't
bounded, but it's still true.
Okay, we'll see that.
So, let's assume mu2 exists,
then theorem says that's mu2.
And that's equal to that, right?
So, sigma 2 squared is
just mu2 minus that.
This is unique, this is unique,
so this is unique.
And in general, by induction,
these are assumed to unique.
Mu k would be the maximum of
all k potential sub spaces
of the sum of squared
projections into the subspace.
And the theorem implies
that this is equal to mu k
provided mu k exists.
Again, we have to prove that
using the inductive hypothesis
now we see that
sigma k is in it.
So, I have to prove to
you mu k exists, so
let's do that exercise, I mean,
this is not very relevant,
normally it's just assumed that
after having done, whatever
advanced calculus, but let's
just go over this proof, right?
So, I want to prove that
a sequence of sub spaces has
a convergence sub sequence,
right?
So, we have v 1, or
these are not my v 1, excuse me.
Therefore just any infinite
sequence of k dimensional
subspaces of R to the d.
So, I should have called
them something else.
I want to prove that there is
a convergent sub-sequence.
And the caution is
that the subspace
is simply unbounded objects.
So, what you do is you
choose the basis for HVI.
And then first take
a subsequence of the sequence of
spaces, or the sequence of
subspaces in which the first
basis vectors converges.
The basis vectors are unit
length objects, right?
So, they are close bounded.
It's the close bounded side.
So, there is a subsequence
that converges, and
then take the subsequence
of the subsequence where
the second basis
vector converges.
Again, second basis
vector unit length.
Everything is bounded close.
So, you can repeat that process.
And finally you get a sub
sequence with each bases vector
converging.
And you have to do a little
proof that in the limit.
The limit of a bunch of vectors
of length one is still of
length one.
Right?
And also the limit of a bunch
of basis vectors is still
orthogonal, but
these are obvious,
just they don't change much,
okay?
So, this was an aside just so,
jog your memory of calculus,
right?
So, but it does need a proof,
because it's not
obvious that sub
space is subsequence.
So, let's see if I've
forgotten something.
I don't forget anything.
I don't know whether
I am accelerating.
But I think, I have enough
flights, so I can go on. Okay.
So, singular values and norm.
So, AV1 is a list of lengths.
With signs of the projections
of the rows of A, right?
So, I take each row, and
project onto each one,
that's each component of Av1.
So, sigma(A), which is
the length of that vector,
you can think of it as if it
was a component of A along v1.
Each role of this vector is
a component of one data point
along V1.
Together you can think of this
as sort of the projection of
the component of
the data along V1.
Sigma 2.
Similarly, you can think of
as a component of a long v2
the data along v2.
Now this is sounding like,
decomposing a vector into
its component along
bases vectors right?
In a sense that's
what we're doing.
And if that analogy
is to carry over.
So for vector you have that,
you take the component
along basis vectors.
Square and add and
you get the length right?
So you better have
the same thing here.
Sum of squares component
equal to the whole.
Then the analogy would be good,
so let's see if that's true.
And that is in fact true, right?
I'm going to take each ai,
project it along each of
the singular vectors,
and add after squaring.
But first observe,
that if I had a vector v
orthogonal to all the r vectors,
ai dot v is 0.
Because remember,
we stopped when we couldn't
find anything orthogonal.
So these are all 0, therefore
the length of ai squared is
just this component along,
all the basis vectors squared.
Because you really have to take
the other vectors orthogonal to
vt but there aren't any which
make any non zero products, so
this is true.
And you just now add up, so
I think I did a little
bit of calculation here.
I add up all the rows,
of all the data points each
one of the data points.
So, I just take the product
with each singular vector
to exchange the sums here, and
now, this thing is
the length of this thing.
Some all the data
points is that okay.
So those are just so
that proves that
the dilemma that the so
if I,
this thing is the of a right?
Which we call the square.
So I took the sum of squares in
raw j but I summed overall j.
So that's just equals to sum
of squares of the all entries.
The and we approve that
that's the sum of squares of
the singular values. Okay?
So that actually proves this.
That proves this, because r
is equal to the rank of A.
And one other thing
I didn't write down
And sigma 1 squared A.
So I guess I should have written
down this, sorry this quantity.
Singular value is just also
this spectral norm right?
We defined spectral
norm earlier.
It's the same definition.
The top singular value is
equal to the spectral norm.
I should have written that down.
And then, we have sigma 1
squared is greater than or
equal to sigma 2 squared.
That's by definition right?
This is the best overall,
this is the best perpendicular
to the first, and so on.
So they're in ascending order,
descending order,
therefore you get that
from that, right?
Because the, this left
hand side, some of the t
Equals 1 to r, sigma t
squared a is less than or
equal to r times sigma 1
squared a normal decrease.
Okay so I'm going to complete
this 3D that's what's going
to happen in
the next few slides.
I'm gonna tell you how to
put this together to get
the singular value
decomposition.
So, right and
left singular vectors, right?
So, these V's vent on
the right hand side of A, so
they're called right
singular vectors.
The vectors AVI will give
us less singular vectors.
In the case of eigenvalues,
if you had an eigenvector
v you've got lambda b.
So when I hit an eigenvector
with a you get the same vector,
colinear, but
in the case of singular vectors
that's not necessarily true.
In fact there's one simple
reason why it's true.
Why it's not true.
A might be rectangular.
So A might be like that, right?
So B would be a small vector.
AB would be a long way.
So even the dimensions of V and
AV don't match.
Eigenvalues are defined,
eigenvectors are only defined
for square matrices, right?
So singular values are defined,
I should've said
that earlier on.
Singular values are defined for
rectangular matrixes of course,
because remember the rows were
data, so the number of data
points in dimensional will not
be the same [INAUDIBLE] so
we want to we call the right
singular vectors and
the vectors a times vi form
a fundamental set of vectors.
I call the UI's
after normalizing.
So AVI has length sigma IV,
remember that's
how we defined it.
I normalized length one,
I gave you I.
Will show later that UI,
are very analogous to VI,
so UI behaves as if they were
more or less the same thing,
except on the other side, right?
So u transpose v over all
u perpendicular to that.
So they maximize u transpose A,
which is the same
as A transpose u.
So I should have said, you or
the singular vectors, or
the right singular
vectors they transpose.
You write or
we have to prove this,
right singular vector.
You recall, the left singular
vector is now of A, right?
By definition, the right
single vector's the v1 to vr
are orthogonal because I define
v2 to be orthogonal to v1,
v3 to be orthogonal to both
those and so on right?
We'll show later that the less
singular vectors are also
orthogonal.
We won't show that yet.
Now I come to what's probably
it's an important thing,
we'll find it has a lot of uses,
right?
It's called a singular
value decomposition.
In the case of values, you've
heard of spectral decomposition,
it's sort of similar.
They are related, but I won't
tell you the relation yet,
because I want to focus
just on singular value,
singular vectors,
I don't want to say it's
some how copying vectors.
In a way it's not but
so, by the way even for
square matrices the eigenvalues
and eigenvectors can be quite
different from singular
values and singular vectors.
So for instance,
if you know what's called
Page Rank where Google decides
how to rank pages,
that's based on an eigenvector.
You don't have anything good
if you base that on
singular vectors of A.
There are other cases where
single vectors are useful.
They can be quite different,
when they're both defined also.
So A the matrix, any matrix,
VT and UT and Sigma T.
This is a I'm going to state the
singular relativity composition
this notation we've already
introduced, are the right
singular vectors, left singular
vectors, and singular values.
Then the Singular Value
Decomposition, SVD theorem says
A is exactly equal, and
r is the end, right?
There's no more after that.
Exactly equal to Sigma t, which
are all non negative, right?
Times ut times vt transpose.
ut is a column vector.
vt is also a column vector,
but transpose is a row vector.
So this is an outer product.
Namely, each of those things
looks like utvt transpose.
So the product is
a full-fledged matrix, right?
Full dimensional matrix.
It's a sum of those, okay, is
exactly equal to sum of those,
right?
That's what we want to prove.
The proof won't be hard, we've
already set up many things.
First off, for the proof, so
to prove that two matrices
are identical, this and that,
I wanna prove they're the same.
It's enough to prove,
or it's if and only if,
they operating on any vector
they do the same thing.
So A applied to v and B applied
to v give you the same vector.
And this is one line
proof of that, right?
So we see that Av = Bv for
all v.
In particular it holds for
the unit vector, so
the columns are the same.
The converse is very simple.
Okay, so now we'll use this,
we want to prove A and
B are the same.
Where B is this, we want to show
that Av equals Bv for all v.
And so
it's enough to show, okay,
this is another standard thing
in linear algebra, right?
When you want to show a bunch
of linear equations hold,
it's enough to show it for
basis like this.
Cuz everybody is a linear
combination of that and so
that's fine.
So you'll have to show for a set
of v forming a basis of space,
you're gonna take
a convenient basis,
which consists of the singular
vectors v1 through vr.
But then we run out of them,
right, there are only r of them.
They may not span
the whole space, right?
A may not be full rank, so
they may not span all of
the d dimensional space.
We complete the basis
with other vectors.
We can always do that when
you have a partial basis.
When you have a set of
orthogonal length on vectors
you can always complete
the two basis.
We choose some completion, so
it contains the first r singular
vectors, such a basis exists,
why?
You know that.
So then now the proof
is very simple for
t = 1,2.....r: Avt = sigma ut.
So let's write this down.
So Avt, that we know by
definition is sigma t ut,
that's the definition of ut.
How about Bvt?
So Bvt = sum over all, t prime
let's call it, 1 through r.
So B, I'm gonna write
down just B itself.
Sigma t prime ut prime
vt prime transpose.
This is B, and
I'm multiplying it by vt,
Okay, so
this is an inner product.
By the way, that's not true for
outer products,
which we'll see later.
This is an inner product.
So it's 0 unless
t equals v prime.
0 or t prime, not equal to t,
since the vt are orthogonal.
Okay?
So I should recall and we'll
say that next time vt, vt prime
transposed is not 0 even it t
is not equal to t prime, right?
So if these are both
unit like this, so
one is the first unit vector and
this is the second unit vector,
but outer product,
this is not there, all right?
This matrix, so you have to be
a little careful sometimes.
That's going to come up,
but this is inner product,
this is 0, they are orthogonal.
Right, so you get 0 for
everybody except t,
so you get sigma t ut.
Okay, and therefore that's
true for the first r, and
then for everybody else, Avt
is 0 cuz we had to stop at r.
And Bvt is also 0 because
those are orthogonal at least,
all right?
And B ends in a vt,
so the first one.
So that proves that the singular
value composition is true.
And it's possible that i don't
have, yeah, that's the end of
the denominator and
numerator are equal, yeah.
>> So
this construction that you had,
how do you pick the signs
to get the orthogonal form?
Because for any vector
vi that you're picking,
you can pick the negative,
right?
>> Yeah.
>> But then normally when
you state the singular
value decomposition,
it can only perform where
sigmas are non-negative.
>> Okay, good point.
So what happens if-
>> And I suppose this,
the ordinal bases are also
still waiting, right, and
then the relevant terms is when-
>> Yeah, so right,
so we should check that, right?
Suppose I replace
vt by minus vt,
then ut will also get
replaced by minus itself.
Avt =sigma t ut, right?
So if I multiply this by
minus 1, so will ut be.
So then you're fine, right,
because this thing
will stay the same.
The product will stay the same,
2 minus 1.
>> Yeah that's fine, but in
general is there a handy way to
mess up the basis of that part?
>> Yeah whether-
>> Is there a good step without
sigma being non-negative,
you could flip-
>> You could flip, or
if there are ties you could
have chosen any one, right?
So is that what you're, so for
an invariance of phase
I can choose any basis.
Okay, so it is sort of exactly
like the ideal value case.
If the top 100 singular
values are all the same,
the 100 dimensional space,
consisting of the first 100
singular vectors is an invariant
subspace in some sense.
And you can choose any
orthonormal basis.
>> My question is if you
write this into a block form,
[CROSSTALK] with you ve
transposed, then what are you
being transposed into
being successful there,
I guess it's a [INAUDIBLE].
>> Yeah.
>> This really doesn't
ensure that does it?
>> No, so you're saying sum
of vt, sum of our all tv, tv,
t transposed identity, right?
It is for
the r dimensional space.
If there are only vs,
it will be the identity.
But it's true that you can
hit it with the minus 1.
So it's also true that
if there are ties for
the singular values,
which can be,
then you can chose any basis of
this, even the rotation, yeah.
And it's not clear whether one
is better than the other one,
but you can do rotations.
One other point that
I want to make.
So what is nice about
this as well as spectral
decomposition is orthogonality
is very nice, right?
So for instance,
if I want A transposed,
I can write it as sigma
t ut vt transposed.
Now the transpose of
that is also sigma t, but
vt comes in front, right,
vt ut transposed, okay?
Now let's make this sum.
Let's do the index t1 here and
t2 here to avoid confusion.
So that's equal to,
I can just multiply them out,
I get sigma t1, sigma t2,
ut1 vt1, v transpose v,
ut2 transpose, okay?
This is zero,
this is in a product, all right?
This is zero unless
t1 equals t2, so
otherwise it's zero, all right?
0 if t1 not equal to t2,
therefore,
this becomes just t1,
as we can call it t.
Sigma t squared,
ut ut transpose.
Same thing as when you had
a spectral decomposition.
So if you have lambda t,
ut of a symmetric matrix,
ut ut transpose.
This is spectral decomposition,
right?
And you can power the matrix And
the eigenvalues will power,
but the eigenvectors will
remain the same, right?
And that's just by repeating
this kind of process, right?
This is extremely useful
in a lot of context.
Actually one context where it is
really useful is Markov chains,
because if this is the
transition probability matrix,
p to the s is the transition
probability matrix of s steps.
And that the eigenvalue
is just power.
The eigenvectors just
stay as they are.
Which is very nice.
That's also true of singular
value decomposition, right?
Of course it's not symmetric,
so you don't get uv.
Okay, so maybe one last thing
I'll say as an application where
it's very trivial in some sense,
is if a is invertible,
So then it's d by d, let's say.
Then A equals sum over t
equals 1 to b, sigma t.
ut vt transpose,
all the sigma t are non-zero.
It's invertible,
they're all positive,
none of them is negative.
And A inverse.
If 3 equals 1 to t, 1 over
sigma t, same ut ut transpose.
Okay, and the reason that's true
is just check by multiplication.
Check by multiplying.
Okay, and if it's not invertible
then r is less than v.
You can do the same thing.
If not.
R is less than d1
than A is t equals
1 to r sigma t, ut vt transpose.
And you can write what's
called a pseudo inverse.
This is often written this way.
This is pseudo inverse.
T equals 1 to r, 1 over sigma
t again, vt ut transpose,
oops, sorry,
vt ut transpose, same here.
I wrote that wrong, this should
be vt first ut transpose.
And this acts like the inverse
in the space of AA inverse is
equal to the identity in
the r dimensional space.
And its useful for
a lot of purposes.
