[MUSIC]
So I changed the agenda
a little bit.
I'll write that down so
that you know what to expect.
So I'm going to first
do the topic that
was going to come later.
I realized, actually,
reasonably elementary, but new.
It's something that many
of you wouldn't have seen,
so I'm going to do that first,
then SVD.
Then the SVD for
very large matrices.
We'll define what
massive matrices are.
It's like big data,
but we'll define that.
So I'm going to talk about one
particular sampling procedure,
and I'll tell you what it is.
But it turns out to be
useful fairly widely for
matrix problems.
So this deals with,
quote unquote, big data.
It's a tacky term that's
bandied around a lot.
But here's our technical
definition of big data.
It's just something that
doesn't fit into RAM.
So usually you think of
algorithms like SVD and so
on traditionally, if you will,
most of the time you think
of the matrix being in RAM.
And then if I want to access
one particular entry,
I get it in linear time.
Any entry, right?
That's how you reckon the cost.
But that doesn't work for big
data, so the data is outside,
and we have to worry about what
fits into RAM or not, and so on.
We won't get our
hands too dirty.
We won't worry about exact
transactions, input, output.
You'll see we'll find
a clean way of doing it.
So, obvious thing when you
encounter a very large
problem is sample.
So let's say I have
a very big matrix,
too big to fit in RAM and
apply my algorithms.
I'd like to sample some rows and
columns and
do whatever computation
I'm going to do.
It could be singular values,
which we haven't defined.
I'll define when we come to it.
It could be singular values or
it could be, today we'll see
something as simple as matrix
multiplication is interesting.
So we would like to deal
only with the sampled
rows after that,
and get an answer.
Of course,
it is not always going to work.
Indeed, one thing that
certainly won't work
is uniform random sampling.
So if I pick rows all
with equal probability,
there's no chance that I would
get a result, an error bound,
which works for all matrices.
In a minute I'll say that.
We want things that work for
every matrix possible.
And the reason is simple.
So it may be that in a matrix
a small fraction of the rows
are very heavy and the rest
are all zero or close to zero.
We have to be able to
handle such matrices.
Again, I have to tell you in
what way we want to handle them,
what the error bounds are.
But intuitively,
uniform sampling won't do.
We have to pay attention to how
big or small the entries are.
I said this lecture,
two problems.
I may not get to both
in this lecture,
so it may involve part
of the next lecture too.
So one is a very simple thing.
On the other hand, it'll be
quite useful for the next thing,
which is not so simple.
Matrix multiplication, so
I have two matrices, and
I want to multiply them.
That's all I want to do.
I want to do it very
fast with sampling.
We'll see a way of doing that.
Again, I want to get
an approximate answer where
the error is bounded for
any pair of matrices you
give me, not random.
But the next thing is something
that perhaps you wouldn't have
seen generally, so it's going to
use matrix multiplication result
in a funny way, in not the way
one might be thinking of, but
we'll see.
But we want a compressed
representation of a matrix.
And I should have
said that there, but
the compressed representation
of a matrix is the following.
What we will prove is that any
matrix can be approximated.
Given.
A small sample of rows and
a small sample of columns
So I want to get
an approximation
to every entry of the matrix.
This is non trivial,
we'll prove that, perhaps
only by the next lecture.
But the claim is that whatever
matrix you give me, if you give
me a sample of rows and a sample
of columns, that suffices to
approximate the matrix, but the
magic bullet is the probability
distribution with which you
picked the rows and columns.
Probability.
For sampling.
And as the title of
the lectures indicate,
probabilities are going to be
proportionate to the squared
length of the rows and columns.
We are going to deal
only with real matrices,
no complex entries, so
I square the entries and
sum them up, the probabilities
will proportional to that.
This theorem will not be valid,
it's not known to be valid
if I cube the entries or
take the absolute value or
any of those things.
It's got to be squared.
And then we'll prove that for
every matrix you get
an approximation to
the whole matrix doing this.
We'll come back to that.
I first want to do
mostly multiplication.
Both will sample rows with
probability proportional to
length squared,
as well as columns.
Just the length
squared is a vector.
Sum of squares is entries.
That is the magic bullet.
We'll see why length squared
is the right thing to do, but
it turns out it does
a lot of things.
We'll also use length
squared later for SVD.
Length squared sampling
has been used for
a whole number of things.
We'll perhaps see some,
but not too many.
For instance, there's a way to
deal with tensors using that,
we'll see perhaps later.
Now throughout this talk, so
there are two sources of
randomness in algorithms.
One, you can think of them
as either the algorithm
tossing coins or
the data tossing coins.
So what do I mean by that?
The algorithm tossing
coins just means there's
a random number generator
available to you.
So you can generate random bits
and base your decisions on that.
These are often called
randomized algorithms.
The data tossing coins means
average case analysis.
That is, I'm not able to
prove for every input,
something happens.
I'm only able to prove for
the average input, so take the
expected time over all inputs.
We do not want the data to be
allowed to toss coins here.
We want to be able to do
something that's good for
every matrix, so
data does not toss coins for
these two things I'll do.
But the algorithm does, so
algorithm is randomized.
You could have both, you could
have only the data tossing
coins, you could have only
the algorithm, as here, or
you could have both.
All of that is possible.
So matrix multiplication, it's
just multiplying two matrices.
We have m by n and n by q matrix
A and B, I want to find AB.
If you want to do it exactly,
you can do it in MMQ time,
of course.
And you've all seen lot of
divide and conquer algorithms,
which are numerical problems,
but they improve the complexity.
So I would like to do matrix
multiplication approximately,
in a sense we'll define in
the error, in linear time.
Linear means m times
n plus n times q.
Now, later I'll talk about
sparse matrices where many
entries of 0, you can do
linear time for those as well.
So we'll do that, and
one reason you might wanna do
it approximately is to exactly,
A and B cannot be stored in RAM,
they are big data so
we want to sample and
this will do some sampling.
Can we just take a sample A and
B, multiple to get
an approximate product?
That's what we like to do.
Well, we can't sample
entries of A and
B because you don't know
what to multiply then.
You don't have
compactible dimensions.
Sample some rows and
columns of A, B which rows and
which columns is the question.
Do I sample some rows or
columns of A and this joint from
that do I pick some rows and
columns of B and you'll see well
you'll have to think about it.
Suppose I pick arbitrarily some
rows let's say are columns of
A but I don't pay attention
to which ones I picked and
I go to B sample you should be
able to convince yourself not
much can happen,
you cannot do that.
So that's also not going to
work, but observe the following.
The product AB is just
a sum of all the products,
the first column of A times
the first row of B.
So the column of A,
the first column of A is
a column back there
first row of this.
So let me write this as A1 dot,
I guess, sorry, excuse me.
I think I used this notation A1,
what is is it?
So I think something like this.
So for the first column of A and
the first row of B Is this,
right?
So if I multiply them I
get a full vector, right?
So this is m by 1,
this is 1 byn,
I get a full vector
m by n matrix.
So that's a rank one matrix,
okay.
I don't know whether you've
all seen this identity,
it's very simple,
but it's what checking of the
first time that it's correct.
Okay, so the matrix product is
exactly just a sum of these.
N is the middle dimension
right n outer products right?
And rank one things,
you should check that at home,
just by multiplying matrices
there's no magic to that.
So the right hand side Is
the sum of n quantities.
It happens to be the case
that each quantity is
not a real number, it's a
matrix, it's a full matrix, but
nevertheless the sum
of n things.
Now the first thing you would
think of to evaluate a sum,
at least if they were reals,
I would pick a sample of
all of these n quantities.
And if the sampling is
already correct, that is, if
the variance is not really bad,
then you hope that there’s some
you have to scale it by n times
r to make it unbiased, right?
You picked r over n, you have to
make it unbiased by doing that.
You would hope that
that’s a good instrument.
That depends on the variance,
right.
It's not going to be
a good estimate for
all probability choices.
In fact, it won't be a good
estimate for the uniform choice.
That's not difficult to see.
So if I pick columns uniform
you won't be able choice.
But this tells you
something very crucial.
So, I'm going to sample
some of the columns of A I
must pick the same rows of B,
right?
This is the sum.
So first column times first row.
Okay, so
I cannot take this joint sets
of columns or rows of B, right?
Whatever columns of A I pick, I
must pick the same rows of B if
I'm going to appried to this
this identity to all things all
good which is what we'll do.
Okay, try uniformly at random so
UAR reference
uniformly at random
if I pick uniformly at random,
a subset of columns of A.
And same goes with B.
Okay, does this look for
a, every A and B.
So let's think about that for
a minute.
Can you give me examples where
it doesn't work sum A and B?
What's an example where,
if I pick uniformly or random,
some sample of R columns
of A same rows of B, right.
I have to do that.
I'll get nothing.
I'll get garbage, maybe.
The other possibility is A has
only one non-zero column and
B has the corresponding
non-zero row.
The rest are zeros and you
would never pick that, right?
So, if A looks like, non-zero
here, and all zeros there.
And B looks like non-zero here,
all zeros here.
This is A, this is B.
Then AB is not zero.
Is not the zero matrix,
right, because you get
this tells you something.
But if R is not N,
R is not as high as N,
you would almost never
see this column, right.
The probability is only one end
an unless you pick samples you
won't see that column, uniformly
at random, you're sunk.
Now there is a simple fix for
this example.
For this A and B,
non zero zero, non zero zero.
And here if I pick a column of A
with a probability proportional
This is proportional to its
length or length squared,
any of those thing,
then you have a chance.
There is a chance.
So if I did that, I would pick
the first column of A, right?
That's the only one
that is nonzero so
the probability is one
of picking this and
zero of picking everybody else.
And I'll get some result.
You hope it's right.
Now, this does not say you
have to do length squared.
You'll see why later.
But something dependent
on the length, right?
So you have to do
important sampling.
Important sampling just
means that the properties
are not uniform.
[INAUDIBLE]
>> Yeah?
>> [INAUDIBLE]
>> Yeah,
so I have not put up
the arithmetic yet, and sorry.
I will put it up,
but before that,
I have to define the norm and
so on.
Okay, I should tell you.
So I want the norm of AB
minus my approximation.
And the norm I want
is just the sum of
squares of all the entries.
Square.
So I want the mean square error,
if you will, so for small.
I want a mean square
error to be small, okay?
Now, small, so
this is a square here.
Small in what sense?
In this sense.
I'm jumping ahead a little bit.
This is the mean squared.
This is the sum of squares
of all the entries of A,
this is of B.
I want my error to be less than
or equal to that time and so on.
Okay, again I will state
this more carefully.
This will not be true for
this matrix if you do inner form
sampling right, you'll get 0 and
it should be non 0 okay.
So that does not work, matrix
multiplication continued, so
we write the sum.
I'm just gonna write it in
this sort of MatLab edition
notation right, so column
i of A, row i of B uniform
sampling does not work,
I'm just repeating that.
General non-uniform sampling,
I'm gonna to pick
column i of a and also the same
row of b with probability pi.
So P1 through Pn or some
non-negative real sum into 1.
P1 is the probability of
picking column 1, and
P2 is the probability of
picking column 2, and so on.
I don't know them yet.
I'm trying to find them, okay?
So the experiment I'm
going to do is pick a j,
one of the columns.
Probably picking j's Pj, that's
one trial i'm going to do.
I am going to find
a random variable x.
It's a matrix variable of
random variable, right?
It takes on matrix values,
not real values, and the random
variable is going to be
the picked column of
A times the same row of B.
So its one auto product, right?
I'm picking one column and
I'm going to estimate
the entire sum by just one.
Which is not going to give
you very good data, but
we will take an average
of many things later and
then it will give
you better data.
So X is a matrix valued
random variable.
And really, ultimately,
we want to take i.i.d samples.
R i.i.d samples of j,
and take the average.
Otherwise the variance
will be too high, right?
But let's deal with one and
find the variance and
we can compute the variance for
average of many things simply.
Okay, so let's compute
the mean and variance of x.
And again, once I know
that I know the mean and
variance of the average
of our trials.
Right so what's the mean of X,
so it's a matrix of random
variables, so the expected value
should be taken entry wise.
Or the other way of
saying it expected values
also a matrix right?
So let's see what
the expected value is.
These are simple things so I'll
go over perhaps a little faster.
So the expected value is just
with probability Pj I'm picking
j and I get this right, but
this is not unbiased so
the I'd like the expectation
to be A times B.
I'd like this expectation to be
the matrix A times matrix B.
Right, it's a matrix and that's
what unbiased means right?
So any idea of how I
can make it unbiased?
But first maybe a hint.
Step back suppose I want
to estimate the sum of
n real numbers just real
numbers now not matrices for
the moment I want to pick one
j these are the probabilities
should I pick them?
How do we scale the Aj so
that it's an unbiased
estimator of the sum?
I want to find a random
variable like X, right?
I pick a j,
X is my Aj that I picked.
But I wanted be unbiased, I want
the expectation to be the sum.
So how will I do that?
Some little thought
about statistics, so
I have to scale it, right?
So it's,
I think I've put it down.
Yea, so it's just Aj or Pj.
So I pick a j, right?
And then that's Aj, Aj or
Pj has the correct
expectations on bias.
Okay, so we do that.
So that's okay, so we,
I'm gonna write one other Pj,
better have Pj positive, right?
Some Pjs might be 0, in which
case you never pick them, so
this is all right
to write still.
Okay, what about the error?
So now the variance, right?
We have to worry about
the variance that's our main
problem, try writing down
the variance of one entry ij.
So it variance try
the second moment.
The second moment is solved.
There are l choices
with probability Pl,
I pick the lth column of A.
And the same row of B.
So, with probability pl I
pick the lth column of A.
And the same row B, so
the ijth entry of the product
is Ail Blj, right?
So my random variable.
X the ijths entry I
shouldn't written this down
perhaps is this over Pl, and so
the expected value of Xi
j squared is sum over l.
So this is with probability Pl.
So one trial I am picking one of
those, the probabilities of Pl,
so I am just writing
the second moment down.
So I get Pl is the probability
of picking it and
that's written down there.
Pl squared and I get Aij.
Aij squared Blj squared.
Okay, now the variance is
slightly smaller than that
right, the variance vector j,
so you all know this right?
The second moment minus
the expected values create
But the expected value
does not depend on the Pl.
So this is unbiased, so
it's always the same so
I might as well.
So, what I'm trying to do is
figure our Pl's that are good,
that minimize the variants.
So how do you bound the, sorry.
So there was that question,
okay.
I want to say that I can
disregard this term and
only look at this term
because this is constant.
My variables are the Pl's and
I'm trying to minimize
the variance by choosing
the best possible Pl's.
And this doesn't depend on
the Pl's it's unbiased.
So only the second
moment matters.
So the difficulty is we
are one variance for
every entry of X,
which one do we minimize?
Well, I'm going to take the sum,
it's a simple idea but
simplified math is a lot,
you'll see.
Without that idea, life would
be complicated because each
variance is different.
But I take the sum of the
variances of X, so let's denote
for a matrix value random
variable x there's two known
values of x to mean the sum
of the values of all entries,
each entry is a random variable,
the sum of the variants
of all the entries, okay?
And we calculate that so
it's the sum over i and
j of the variants of the entry,
sum over i and j of that.
Sum over i and j of that,
this is what we have done on
the last slide, and simply this
simplifies a lot just because
you can switch the sums, okay?
So sum over l,
that 1 over p that has come out.
Sum over i and
j you get these two, right?
Okay, sum over j of Bij
squared is just column
is just a row sum right, l is
the same sum over all j, that's
the length squared of the row
because sum of squared entries.
And this is the length squared
of the column, okay right?
Nothing very difficult
here I mean just, so
now I ask what's
the best choice of pl?
So I want to minimize this,
Pl is my control right?
I can choose any Pl I want,
let's say for the moment.
What choice of Pl
would minimize this?
It's the one that minimizes
the variance of x,
I wanna minimize that.
Suffices to, okay, I'm sorry.
This is something I said here,
so
in minimizing the variance, we
can minimize the second moment,
since the expected value,
this term, is fixed.
Again, I'm summing
over all idea, so
sum over all idea here,
too, right.
But the unbias means that every
entry has the expectation,
which is fixed,
it's independent of P.
Okay, so I want to choose
the Ps to minimize that.
So again, I rewrote that,
chose the Pk,
I guess I didn't rewrite that,
I changed the l to k.
So I want to one other Pk times
that to be minimized some
the rule of k.
P has to be a probability
vector, right?
So I have to sum to one and
calculus which you
should do at home.
So if you have a1 through an
which are positive real numbers,
I want P1 through Pn.
So that AJ over,
AL over PL sum is minimized,
subject to non-negativity and
summing to 1.
In calculus,
you have to do this,
that should just,
maybe we could just do that.
So, the point is that I'm
differentiating with respect
to PL, and right, so you get
a PL squared in the denominator.
So, you have to work this out,
but I just will tell you that
the PL are proportional
to square root of AL.
Just because you get
the square here,
that's all that's happening.
You get the square root.
Check, okay.
So, the best choice of
PK are proportional
to the square root of that.
That's a product of length,
okay?
So you can take PK to be
product of these two lengths
divided by the sum
of all the products.
So there's,
the PK have to sum to 1, so
I have to normalize that.
That's what I've done there,
right?
Now, and the important special
case that B equals A transpose,
believe it or
not it's a fairly important
problem to multiply
A with A transpose
Now I'm not actually going to
go into details of this but
let me back off one step and
say why this is important.
Actually, this whole
sampling issue came from
a more general problem of
estimating the variance,
covariance matrix,
of a probability density.
So, suppose you have a
probability density in D space.
So some samples, and I want
to figure out the variance
covariance matrix of that.
Unfortunately, that's denoted
sigma, matrix sigma I,
so this is a random x.
X is a point, right, sigma
ij is expectation of xi xj.
Provided after centering, so
the expectation of x is 0.
So 0 is the expectation
of the vector.
That means every
component is zero.
Then I take, I gave entry
of a matrix to be this,
this is a called a variance
covariance matrix.
So, a question that actually
originally led to all of this
work in a way that probably I
won't describe is the following
question.
How many ILD samples,
so this is an aside, so
if you don't know some of
these terms, don't worry,
it won't be necessary for later.
How many ILD samples from a,
I'm going to leave a blank here,
probability distribution in Rd.
Do we need to approximate
sigma to relative error epsilon.
What that means,
is I must get a sigma approx
That as a property,
that the quadratic form,
is within 1 plus or minus
epsilon of the correct answer.
For every V Okay,
so the question is how
many samples do we need?
Because, this is
a very important
to estimate quite often,
how many samples do you need?
But, the other definition
is very important.
Without exception for
every V, you must have
that the quadratic form is
within one plus minus x1.
So, this question
arises naturally, for
instance in regression.
But, it also arises
in a lot of context.
Now, so it's not for
a general distribution.
So, usually we worry about
what are called log-concave
distributions.
And Gaussians are among them,
okay?
And the answer we know
now is order d log d,
We didn't know that,
until perhaps ten years ago, but
now we know that for every log
compute density it suffices to
pick, Divided by
epsilon squared.
It suffices to pick
this many samples, and
then the theorem is that,
the sample variance covariance
matrix has this property.
It's a decent approximation
to the variance covariance
matrix rather than
the trial density.
So, I want to prove that.
I'm only setting
this as an aside,
although this was
the original motivation.
But, if some of these
terms are not familiar,
don't worry about it.
It won't be needed
in the sequel.
Yeah, sorry.
>> [INAUDIBLE]
>> So,
we do need the multiplicative
guarantee.
That's the least I can value
has to be captured, right.
>> [INAUDIBLE]
>> Yeah,
that's why you need this, yeah.
For linear regression,
you need the least one so.
Yeah, that's why with
this definition.
So this originally arose,
actually, this arose
in the context of comparing
volumes with convex sides.
And Louis and
I formulated this thing, and
then we proved some bound,
and went through a bunch of
iterations before we got here.
These results
are due to Rudelson,
the functional analyst.
The proof was very complicated,
but
now we actually know much
simpler proof of this.
Okay, a side over,
let's go back there, alright.
So, in this important
special case,
we get PK equals the length of
the column A squared divided by
Frobenius norm.
This is the norm of the matrix
which is the sum of squares of
all the entries.
That's a right scaling factor,
right?
Because the numerator is this.
We have to normalize for
one side to divide the sum
of all the columns.
So, that's that.
So, hence, the name
length-squared sampling.
It came up naturally here,
in this context here, right.
It's what calculus told you
was minimizing probabilities.
Length-Squared Sampling, right?
So, that's why you don't use
length or length cube or
any of those things.
Length square is exactly
the thing that minimizes that.
So, that's because of
this business right?
So, this is called
Length-Squared Sampling.
And that was the title
of this talk.
So for instance,
if I use any of these, okay,
let's go over this in detail.
So, I'm going to use, excuse me,
I'm using these probabilities,
right?
P case.
I'm going to plug them in.
The variance becomes,
I'm going to put them into that.
That's all I'm doing, right?
So that's that.
That guy.
PK, is this thing divided by
a sum, so that comes up here.
So altogether, you get this
thing cancels, and you get sum
over L, length of the column,
length of the row, after I sum,
I square them, and that is
less than or equal to that.
Now, why is that?
What inequality am I
using to get that?
So, what am I doing?
So let's, instead of writing
all those complicated things,
let's write sum over L.
AL times VL squared.
AL is the first one,
BL is the second one squared,
I'm saying is less than or
equal to sum over AL squared
sum over BL squared.
And now we know the name for
it, right.
So right.
Or another way to think
about it is two vectors.
Vector A and vector B and
it's just saying that our
product is at most most space,
the product is the square
of the length, right.
You get the same bound if
you used just a k squared.
So maybe that's worth doing, so
let's just do it on the board.
If pk equals A(k) squared
divided by A squared,
then I plug it into
the top there and
I get the same answer as I
was getting there, right?
So one over pk A,
K squared B k right.
Squared.
So this A k squared just cancels
out, it is also in fact in this
case is exactly equal to
A f squared, B f squared.l
So, these probabilities
are the best.
This is not necessarily
always as good.
But the final lower upper
boundary is the same.
Okay so these probabilities
are the best, right?
So if the lengths of the kth
column of A sometimes is very
high, and at the same time the
length of the kth row would be
low then this is a better thing.
But that's the best possible, so
length squared is optimal for
A transpose.
This is our basic problem.
I think that it said that.
It turns out that approximate
length square is also fine and
will get you a good result,
a good enough result, if I
knew only the constant factor.
I need to know that
the probability of something at
least, a certain amount.
At most is not enough.
I need to know at least for
every k, for
some constant can you
get as good a variance?
Now, this is useful because
there are matrices for which you
don't want to take the time to
estimate each column's length.
And then if you know an estimate
which has this property,
that's good enough.
Okay, so so
that's what it says here, right?
It may be possible to find such
pk more easily than finding all
lengths, okay?
In that case you can use that.
Then you get that.
So the unbiased estimate, that
still divides by pk of course.
The probabilities, right?
Whatever it is,
it divides by that.
And when you get the variance
you get an extra C there,
that's just very simple,
Pk has an extra C
compared to what we have.
I don't know why I wrote 1
over C squared, it's 1 over C.
So reducing variances
is standard.
Let's run over this one.
This X had variance less than or
equal to that.
That's not good enough, why?
Because the of A times B is at
most a product of variance.
You have to prove this right?
But it's not very difficult,
it's just a lot of
algebra perhaps.
But it's not,
it's submultiplicative,
it's not difficult to prove.
You can do this by writing down
entries which is laborious,
but, whatever.
There are lots of
ways of proving this.
I won't do it.
You can check that.
Equality could hold.
So even in the best case, if
these are equal there's the best
possible, the highest
possible ABF.
Even then you get the error,
which is if variance is only
that much, it's not so good.
Variance, in general,
this guy is greater than
the number that you're
trying to find.
It's not a good
idea to have error
greater than what
you're trying to find.
Pardon me, I should have
had a square here, so
this is variance of x,
x is ab, two things.
Variance is a square, so
a square of a square.
Okay, so that's not really good.
What do we do?
What's the general method
of reducing the variance?
We take s i.i.d copies, variance
cut down by a factor of s.
So with that, I'm going to
state this as a theorem.
It'll be nice if you
can remember this.
This is a term we'll use for
doing lots of other things
like compressing a matrix.
So you have an m by n, n by
q matrix, and we say that we
want this, we get approximately
if C is the following.
You pick j1 through js,
s samples, i.i.d trials,
with probabilities pj which
satisfy this property.
Basically length squared, right?
Pj is proportional to the square
length of the column.
J, approximately.
You're off by a factor of c,
a little c.
And then you pull out those
columns, you have to scale them.
This is making them unbiased,
right?
That's your matrix c, okay?
Because tilde is the matrix
of corresponding roles of B.
What did I do here?
I picked one sample, and
I said take an outer product.
This is sample 1, J1.
And then I wanted to take J2,
take the outer product and
so on.
Instead I just said put these
columns into one matrix.
That's the same as saying J1,
JS that matrix times
other matrix right?
Because matrix product
expands into this.
That's all I have done, right?
So I was taking the average
of s of these sampled x's and
then I'm just writing it
as a matrix, that's all.
Okay, then the expected value.
So what we have already proved
is that the expectation of this
is less or equal to that.
This was the variance, right?
So it's expectation of square.
But so AB's mean was CB.
We've proved that,
it's unbiased.
So the expected value
of this is that, so
I'm taking the matrix
with a random variable.
Or I'm sorry,
I'm taking the random variable,
subtracting its expectation,
and taking the square,
and that's the variance, right?
The variance is this much,
We proved that.
We proved the bound of
the variance of the matrix, and
that's that.
And then so if you want
just the expected value of
the that's at most that.
So everybody,
how does this imply that?
Everybody knows that, right?
Same absolutely holds for
picking rules fo B, correct, and
the corresponding also A.
I could have done that.
That's actually a good point,
because sometimes
it might be easier.
>> [INAUDIBLE]
>> Yeah,
so it may be, it may happen.
Like what you were saying.
It may happen that b is
something I know better.
I know for some reason
the distribution of the length
squared to the rows of b, or
if they are physical
matrices you would
potentially have an idea what
the length squared might be.
Then you would do b and
through b, either is fine.
>> I'm just wondering
[INAUDIBLE] this is a bounded
[INAUDIBLE] [INAUDIBLE]
>> Absolutely any A and
any B, this is true.
Yeah, so that's another point
worth making that it's true for
every possible A and B.
And if this context is not
interesting to prove results
that hold for most A and B.
Because those are very simple.
The optimal ones are A times B,
that's what you're saying,
right?
>> Yeah.
>> The optimal choice was
length of A times length of B
instead of just one of them.
This is the optimal choice.
So, if the lengths are all the
rows of B have the same length,
this can be canceled, and you
get not lengths quit sampling,
but lengths sampling, right?
So, length squared comes
from A transposed.
Length squared come from
A transposed, right?
In A transposed,
it's length squared.
And that was a sample problem
variance covariance matrix was
central problem you
wanted to solve.
So, it's best and
easiest to remember
matrix multiplication
theorem by these words.
We'll use some towards the end,
but so
it's nice to remember
these words, right?
So, all that's saying is that
The Frobenius norm error,
which is the second thing
lets say if you will,
goes down as one over ruled S.
Where S is the number of
sample calls you pick.
For length squared sampling.
So, clearly there will go there,
or S goes to infinity in saying
it goes to there as
long overruled S.
Okay, if you look
at Frobenius norm.
If you look at
Frobenius norm squared,
it goes down as 1 over s.
Okay, for this case,
it's the best you can do.
You can't beat this 1 over s.
So, in general, all sorts of
simple improvements of this, or
ruled out.
Let’s just go over that.
Let’s see whether, I don’t know
that the one you’re suggesting,
so what you would like is
what you’re saying is,
I would like the expected value
of AB minus the estimate.
So, you would like this to be
epsilon times [INAUDIBLE] AB.
Right?, and length squared
will not get you that, but
it's all okay.
So, if you're interested
in this subject I can give
you references.
Since this is early work.
This is 20 years back,
let's say.
So, there have been a lot of
attempts at getting relative
error, and there are some
possibilities, but much more
complicated probability
distributions are involved.
You have to sample columns
according to what's called
leverage it makes life
more complicated.
So, I won't do it here, but
I can give you references.
So, something like
that is possible.
I'm not quite sure in this
problem, but it's possible.
So, this can be much smaller
than the venus model of A times
the venus model of B, right?
This may be much,
much bigger than this thing.
So, that's our problem,
we're getting a very
large error somehow.
So, let me state it for
three tensors.
So, propose (Aijk)
ijk going from 1 to
n is a collection of numbers and
I want to maximize.
Over all unit vectors x.
Sum of idak, Aijk, xi xj xk.
If I had just two indices and
a symmetric design
eigenvalue problem.
This is the maximum eigenvalue.
For three indices these
forms are very hard.
We don't know how to
solve them exactly.
But the theorem is can find.
This 2 plus or minus, well,
it's really only minus,
epsilon times
the Febenius norm of A.
Which is this quantity.
Using Length-Squared Sampling
is fairly complex.
Again, it's just stated, right?
I mean, I won't be able to
prove that, but that's true.
Again, for every any Aijk.
I can give references to all of
these if you have more interest.
In polynomial time and so on.
But exponential and epsilon.
One over epsilon.
So, epsilon is
an important parameter.
You can't drive this
error down to too small.
It's difficult.
Okay, so, Big Data.
So, we are going to implement
this algorithm with Big Data
[INAUDIBLE].
How do we do that?
Big Data means,
it cannot be held in RAM.
That's, again, our definition.
We do one pass through A and
B, and compute all
the possibilities, Pk.
So, remember, Pk could have been
either dependent only on
the length of columns of A, or
the columns of
A times columns of B.
We have several
candidate choices for Pk.
We can find the probability
in one pass
by just keeping running sum.
So, we do that by
running sums of columns.
Then we store in RAM each of
the PKs Everybody with me
on this, how we do this?
I mean, we've gotta compute the
length squared of each row of A,
sorry, each column of A.
Okay, or each row of B, or
the product of the column and
length.
You have to compute that for
each one.
Okay, I have to normalize,
at the end of the day
we find the sum, and
divide it to normalize by 1.
So, we can do this with just
order n space not n squared,
the rand still uses n space,
instead of n squared.
With pk once you have
the probabilities
we can toss coins offline to
figure out which set of columns,
which set of s columns of A.
And the corresponding rows of
A we are going to sample, okay?
We do this, and
then they make a second pass.
So, I'm thinking of now A and
B as two big to fit into RAM, so
they're in external memory, and
the only way you can access them
is by making an entire pass
through the whole thing, right?
I cannot get one entry.
I can't randomly
access one entry, so
I have to make a pass for
the whole thing.
You can make a second
pass through A and B, and
pull out the sample.
So, two passes and then multiply
the sample matrices in RAM,
because they fit into RAM.
And for error, okay.
If I want error less
than epsilon times this,
s hs to exceed c over
epsilon squared.
Now, what I told you here
seems to be a way of picking
one sample column of A and
row B.
How do I do s of them?
Only two passes.
I'm not allowed s passes.
So, you know all
the probabilities.
You toss coins s times
instead of one time.
And s coin tells you which
s columns I must pick,
and then on the second pass
you pick them all out.
So, that's something simple,
in two parts,
you can do this algorithm.
>> [INAUDIBLE]
>> With, well, okay, so yes,
with replacement, but
it's highly, okay.
So, some of the JTs
might be equal, right,
if that's what you mean.
But high probability that
won't happen because, but
yes with replacement,
this whole thing.
Actually there is a, okay, so
let's, there's an old result of
that says without replacement
is always better.
I'm sorry, with I forget.
The variance is bounded
one by the other.
So, because it's
a convex function.
But what I'm talking about is
with replacement, in any case,
for simplicity.
Okay, now just some lesser
problems that now we know how to
solve, I won't go over any of
them except the one I told you.
Sketch is compressed
representation of a matrix,
which I will go over,
that's our next topic.
PCA, principal component
analysis SVD is coming.
You can actually compute
the SVD by sampling.
That is, you'll be able to
sample by length squared.
Probabilities the rows
of the matrix and
do S-V-D on just
the rows we sampled.
And we can prove that has an
arrow bound compared to what you
would get by doing the S-V-D
of the whole thing.
I'll do this after I
introduce S-V-D, S-V-D and
then talk about it for
a couple of lectures.
That's our next topic after
that we'll see how to do this.
Okay, tensor optimization,
you ask, I won't do it,
but again, well,
I stated the result at least.
Okay graph sparsification again,
I won't,
I can give you references,
but I won't do that, okay
now sketch of a large matrix,
so that's our next topic.
I may or may not quite
finish all of this today,
but we'll go into this and then
come back if we run out of time.
So what do I want to do?
I have a large matrix.
Will show, our term is A can be
approximated given just a random
sample of rows of A, and
a random sample of columns of A.
Provided sampling
is length squared.
It's not known for
any other probabilities, right?
The errors are too much.
So for length squared,
we'll be able to prove that.
So let's say first, intuitively,
let's ask ourselves,
can we approximately sketch a
matrix by a sample of just rows?
The answer's obviously no
because the sample test says
nothing about the rows
that we didn't pick.
They could be anything right?
But suppose the rank of
ASK which is very small
compared to MNM for the moment.
So we want to do it for
high rank matrices as well.
We don't want to do it only for
this, but suppose the rank is K.
I sample that big 100k rows.
Intuitively, without proofs
we'll prove things properly
later, intuitively that you pin
down the row space of A right?
I mean it's unlikely that there
would miss something in the row
space, unless it's
an anomaly okay?
So if there's one
row of a which is
completely orthogonal to all
the other rows, I'm sunk.
My sample will not
get that right, so
I'll miss some of the row space.
But as long as
that's not the case,
as long as no small numbers or
rows is completely orthogonal to
the other ones I'll get
the [INAUDIBLE] essentially,
this is easy to do it's
not a question, right?
So at least we can
get the row space,
maybe don't give
an approximation to row A but
we get a good idea
of the row space.
Now suppose I pick
order columns of A.
That should also tell me for
the on sample rows what linear
combinations we need to take,
for why the things
will not degenerate.
So what do I mean?
So I have a large matrix I
picked lots of full rows. Okay?
I pinned down the row
space of A.
Now I pick a fair
number of columns,
that are adjacent columns.
Now let's say that I have
an unsampled row here, and
I want to figure out
what combination of
these sample rows
makes this row, right?
So, I know the row space,
this slice in the row space.
Row space I knew
from the sample.
And I want to figure out what
linear combination of the picked
rows gives me this.
Okay, how do I figure that out?
I figure out what happens in the
columns I sample, only in those.
I figure out what
linear combination of
these circle samples entires
gives me these right?
If I picked 100 k columns,
it should pin down intuitively.
Everybody sort of see I mean, so
to figure out what combination
this is of the sample rows
I have k unknowns right,
it's in k dimensional space.
So I have k unknowns
to figure out,
if I have 10 k circles here
I have 10 k equations.
So if the system is reasonable,
the 10k equations should tell
me what the combination is.
So I figured out a good idea of
all the columns, all the rows,
right, including
the unsampled ones.
So then I'm done,
I'm not stuck with that problem,
provided I have rows and
columns.
This is intuition.
So let's say that the matrix
is not degenerate.
Let's say the following,
a strong assumption.
Every case-by-case has
determined it is non-singular,
non-zero.
That is usually the definition
of the strict non-degeneracy.
So then you are done, right?
Now less frayed, this was
a assumption that's made
sometimes for data is, the rank
is K of the whole matrix, but
every 10k by 10k sub
matrix has 4 rank,
okay 4 rank as k, not to be
singular, but it was 4 rank.
That's not unreasonable.
But this one on intuition.
Okay, so the gist of the story
is that if I pick a random
sample of rows and
a random sample of columns,
I should be home.
I should be able to approximate
every entry of the matrix.
We have proved that, we'll
prove it interestingly using
the matrix multiplication theory
twice, at least twice, right?
We'll rigorously prove the error
bound without assuming that A is
low rank, this not
an assumption we make, right?
Okay, so the picture of
what we're going to do.
So we have a matrix A, which is
big and fat in both dimensions.
I'm gonna take some columns
of A, which form a matrix, C.
I picked a few columns.
This is as tall as that,
yes, but not as broad.
I picked some rows,
that's a row sample.
So this is as broad as A but
not as tall, right?
And the assertion is,
I can compute from C and R,
just given C and R I can compute
a U to put in between there, so
that A is approximately
equal to C times U times R.
You just to have to
just convince yourself
the dimensions are right.
This is the S column, it has S
rows let's say for the moment.
This is an S bias matrix, okay,
the dimensions are correct when
I multiply, just a sanity check.
Okay, so I'll state
the theorem at the end, but
the theorem is for every matrix
A if I pick a random sample of
rows Columns and rows
according to length squared.
That's important.
Then I can compute a u to put
in between there just from
the sample so that this is true.
I'll spell out the error bound
and everything after a little
motivation so I think I'll
give you one motivation.
See this is used in
a bunch of context but
I'll give you one motivation.
Suppose you have a large
corpus of documents.
Each document as we did in
the first lecture is a word
frequency vector, right.
You just list how many times
each vertical is in the document
right, so that's your vector, it
forms a column of a large matrix
consisting of many documents.
Each column is a document matrix
A as we did in the first thing.
A new document which is also
a vector comes in, we want its
similarity to each document in
the corpus or maybe you want
the similarity only to the ones
with which it's fairly similar.
A very standard
notion in information
retrieval of similarity
is dot product.
You could use other notions but
dot product is a very
standard notion.
So we want the dot product
of v with every column of A.
That's the same as
computing this, right,
v transpose A One of my going
through this is it will revise
a bunch of linear algebra as we
go along this part, for sure.
So here's a problem
I want to solve.
I want to pre-process A.
I'm allowed a bunch of time, so
that now query time means new
documents being presented to me.
I must find the similarities
vector quite fast.
I don't want to spend
n cubed times that.
But also we better be accurate
about this vector, but
for every document, v.
So when I pre-process I don't
know which documents I will be
faced with later, right?
So I have to pre process for
that,
I am correct, approximately,
on every vector v.
So we want to find
a u that says for
that u is an estimate of v
transpose A so the estimate
must not differ by more than
delta times length of v.
The length of v is because I can
always scale v by a huge factor
and the error will
blow up you can see.
And this make sure they're
scale invariant right?
So I don't want to blow up
the error by scaling this, so
I want this to hold but
for every v no exceptions.
Because I don't know what
the future holds right?
So this has got a name right,
so we'll do this by
doing the following.
We will just approximate
we find u to be just
v transpose CUR okay?
Where CUR was in the last slide,
and but
that is a fast computation.
So that's a fast computation
because I wanna multiply by v
transpose on this side,
but C is tall and skinny.
So the multiplication is not
going to take so much time.
Just think of the number
of columns as constant, so
it's just a constant number
of dot products right,
instead of n dot products.
And then multiply u, and
then multiply the r, so
everything takes a constant
number of dot products only
instead of n dot products.
That's the saving we got,
what about the error?
I want the error
maximized over all v, and
I have to scale by the length
of v to be at most delta, okay.
This quantity has a name.
The left hand side has a name.
It's a spectral along
with A minus CUR.
So this is a chance for
me to say,
please read up spectral norm if
you've forgotten what it is.
It's this.
But anyway, it's the maximum
overall unit length vectors of
transpose C but it's the same
as dividing by the length, okay?
Symmetric, absolute value.
They're not necessarily
symmetric so
it's a maximum singular
value of a minus CUR.
You may not have seen
what singular values are,
you will soon, but eigand
values certainly you know so.
So I want that to be small,
but this randomized
algorithm so
the random process can give you
terrible results with
very low probability.
Always possible.
I just want the expected
value of this
is called the [INAUDIBLE]
Spectral norm.
So I want the Spectral norm
squared to be bounded by
something we will prove That if
I took a sample of s columns,
it's bounded by 1 over
s to the one-third.
One-third is not the best we
know, we know better things now,
but what we'll prove is
best to the one-third.
So it's less than
the matrix modification.
There it was square root of s,
now,
it's a cube root, where s is the
number of sample columns, okay?
So it turns out I'm going to
sample a different number of
rows then columns, curiously,
so I'll put this down.
So everybody understand?
Again I'll state the theorem
precisely later but
everybody see what
I'm saying is?
Any a I pick a random
subset of columns, and
a random subset of rows.
S columns here, smaller number
of rows, and then I can
guarantee that the spectral norm
difference is at most this much.
The slight problem with proving
spectral norm bounds and
that happens in a lot of
other contexts when for
instance you have
random matrix theory.
Spectral norm bounce requires
you to prove that for
every vector v this is small.
Okay, so that's somehow
not the trivial, right.
You have to prove it for
every one of infinitely
many possible vectors.
Provenius norm is a bit easier
because it's a sum of squares.
It's a fixed quantity.
It's not fixed quantity, right.
It's many quantities.
You went through
each one is small.
Okay so
any questions on this setup?
I should've said this
on the slide maybe but
here's one question
that could arise.
So, I prove, in expectation
at least, this is smaller
than some epsilon times
sobinuos of A, right?
Okay, several questions
can arise, first,
can I put 2 instead of F here?
Second can I put f
instead of 2 here,
both will make things
stronger right?
So let's consider the f and
f, can I do a minus CUR,
f less than or equal to epsilon
times, the answer is no.
That's not possible.
So, that brings me back to
this whole thing, right.
So, that's not possible if
A is essentially high rank and
simple example of
that is identity.
So suppose the A is
the identity matrix, right?
I pick a sample of columns
of A which is just
unit vectors right?
A little bit thought shows
you that I cannot get such
a fantastic approximation of
A just with those columns right?
So I'm only giving you some unit
factors as columns, the rest of
the matrix there or the rest of
the matrix could be identity.
Those two cases
are completely different and
I can't distinguish those two.
So, there's no good lower
rank approximation.
I should mention this is
low rank obviously right?
This rank of this at most.
They're only columns right?
So another way to say this.
Maybe I should have
said it clearer.
There's no matrix of slow
rank that will approximate
the identity very
well in this sense.
So identity is essentially
high rank, you cannot beat it.
So, essentially high
rank beats it, so
if all the dimensions
are high you're in trouble.
You cannot approximate,
there's no free lunch, right?
And actually a similar thing
applies to the two also.
If I put 2s both places
that's also not valid against
the identity but I didn't,
that's just one,
reflectional identity is just
one so you cannot do that.
Okay so
these are the best two things
we can do on the two sides.
Okay, so this time I came
prepared with a lot of slides,
so I won't run out of them.
And hope is I run
out of time before,
cuz otherwise it's getting
perhaps on the heavier side.
So because the last 15 minutes
I'll go over this, but
then I might repeat
this the next time.
So first the intuition about
this, so the intuition is
interesting cuz it's very
curious what I'm going to do.
I'm gonna take A, I'm multiply
it with identity, right?
Using my Limo for sampling,
I'm going to multiply it as
identity, because I know that
A is equal to A times identity.
So pretend multiplying A with
identity or by sampling, okay?
Sampling s columns of A.
So we've proved the matrix
multiplication theorem which
said the error is at most going
down as one over root s and
this was Frobenius norms.
Unfortunately, the Frobenius
norm of I,
the sum of squares of
all the entries is n.
So the Frobenius norms is the
square root of that is root n.
This is not interesting, right?
So this is useless, because
to get anything better than
Frobenius norm of A, I need s
greater than or equal to n.
So it's useless.
But anyway why is it useless to
get Frobenius norm of A on that
side?
So suppose I want to
approximate A, but
I error of Frobenius norm.
So what can you do?
This is trivial, right?
It's not so interesting.
But how would I accomplish this?
How would I get
an approximation like this?
0, right?
So I could just take U to be
0 matrix and this is true.
So it's no good.
So this pretend multiplication
didn't help us very much because
I seem to need s of at least n.
But it will help us, you see.
Now, see how it helps us.
Let's pretend RR
transpose is invertible.
R is a sample of row's array.
RR transpose is a small matrix.
R looks like this.
Again, I'll go over all this
a bit slower next time.
This r*r,
this is likely to be invertible,
let's assume it's invertible for
the moment.
The proof works
even if it's not.
Why is it likely
to be invertible?
I picked a very small matrix.
And if you're not degenerate,
it should be invertible.
If that's not obvious to you,
don't worry,
that doesn't matter,
this is just an intuition.
So RR transpose,
assume it's invertible,
then I can define a matrix
P which looks like this.
RR transpose inverse,
we said exists.
And I multiply the RR
transpose here on R.
And it's not difficult to see
that that acts like an identity
on a particular space,
on the space of R.
Why am I saying this?
I'm going to replace the pretend
multiplication identity by P,
and that will buy me something,
that will get rid of the root n.
And so, okay, maybe I wrote
down a proof of this.
So it access identity on
the row space of A because if I
have somebody belonging
to the row space of A,
I can write it as
a combination of the rows.
That's what I'm doing to x.
So now P times x, so
P operating on x is P is
this and x is R transpose y.
Now you get RR transpose
inverse RR transpose,
they cancel each other out.
You get R transpose y, which
is the same as x, so Px is x.
So it means it acts like that,
acts as if it was
an identity on that space.
Identity is what we want and
it's not difficult to see that.
The whole space is
space isn't a null
space of P space.
Yeah, all of this
should be null.
So instead of pretend AI,
I'm going to do pretend AP.
Now P is somehow an identity on
a very small dimensional space,
R dimensional space.
Whereas I was identity on
the n dimensional space.
So the root n will get
replaced by root R
because it's identity on
a very small space, right?
And that's it,
that buys everything we want,
we'll get a root
R by root s here.
Where R can be made,
it is under our control,
we will make it smaller than s.
So we'll have to
prove two things.
So there A was equal to
A time identity, no problem.
Here it is not true that A is
equal A times P, but it is in
a particular subspace but
it is not in the whole of space.
So we have to prove
that the norm is small,
that the spectral norm is small.
It will follow from one and two.
We also have to prove next
that we're going to prove
sorry that AP is close to CUR.
So I want to prove A is close to
CUR, I prove A is close to AP,
and then I prove AP
is close to CUR.
And I do that by
doing the following.
I now want to do a pretend
multiplication of A and P.
I say C is a length squared
sample of columns of A.
I know how to do pretend
multiplication, so
I can do length space
sample of columns.
I want to pick
the corresponding rows of P.
I will observe that, in fact,
I can do that,
that will look like U times R.
Here's a hint why it
might look like that.
Because P ends in R, right?
P is something times R,
so the sum of the rows of
P look like U times R.
We'll see that,
so this is solved
by another application of the
matrix multiplication theorem.
So it is as if we are doing
a multiplication of A and
P by picking sample columns
of A, and sample rows of P.
We don't do these
multiplications,
we are not actually
automatically being done,
this is for the proof.
Okay, then I think I maybe
shouldn't start the proof,
because we want to come back,
revise all this.
And then,
this is not the original proof,
this was later thought out by
these tricks of employing matrix
multiplication as a sort of
conceptual thing, for the proof.
This is the conclusion.
So the difference is bounded
by some function of r,
as well as s.
So unfortunately r and
s are not symmetric.
So again, we have a m x n
matrix, r and s are any integers
satisfying some conditions then,
actually they don't have satisfy
any conditions,
this is true for any r and s.
So we have length
squared columns,
length squared sampling columns,
length squared sampling rows,
then we can find the matrix U.
Use s by r now not s by s
to put in the middle so
this is true for
spectral moments.
There is the results for
Frobenius norm which
I won't state, but
