SPEAKER: At CWI before that.
And before that, he was a
student of Arkadi Nemirovski
at Georgia Tech.
And he will tell us about
fast, deterministic, and sparse
dimensionality
reduction, that all
for the price of a
single algorithm.
CRISTOBAL GUZMAN: Well, yeah.
So thanks [? Vitale ?]
for the introduction.
I'm very happy to be here
and talk about this work.
So basically it's some work
we did in collaboration
with Daniel Dadush and Neil
Olver, both from Amsterdam.
And this is going to
appear soon in SODA.
So let me talk
first about what do
we mean by dimensionality
reduction, more
precisely, in the
setting of our work.
So the basic idea is
quite simple to explain.
We have set up high dimensional
vectors v1 through vn in Rd.
So the way I like
to think about this
is these are rather large
dimensions we're working with,
but the number of vectors
you have is even larger.
That's the kind of region
we're interested in.
So we want to map these to
a lower dimensional space.
And of course, we don't want
to do this in an arbitrary way.
We wanted to do it in such a way
that we can maintain distances
approximately.
So if these are the original
distances in vi minus vj
squared, the norm, then
we want our radical to be
within a 1 plus minus
epsilon approximation.
So the idea, you can see it
in this very simple picture
for instance, you have two
vectors in the unit sphere.
So there's one way to find a
projection, which maintains
its distances exactly,
which is basically
taking a two
dimensional plane that
contains both the origin
and these two points.
But the point is that many
hyperplanes can maintain
these distances approximately.
So you can see here
a few examples.
And indeed, this
kind of slide we have
is the freedom to do this
projection with many more
vectors at the same time.
AUDIENCE: You allow
the circle, the method,
to be data dependent?
CRISTOBAL GUZMAN: Yes,
in our case, we will.
So I guess you're thinking
of all these oblivious kind
of constructions.
I mean, our setting is
not about that particular,
but it's also very interesting.
AUDIENCE: Another question.
So is m here fixed?
Or are you going to pick m
based on the data as well?
CRISTOBAL GUZMAN: m is going to
be dependent on n and epsilon,
but not on d.
So a few things which are
highly desirable, of course,
if we can make this
mapping to be linear,
this is very helpful.
For instance, it allows
us to simplify it slightly
with the kind of flow
distortion property by just,
instead of looking at
differences or distances
between points, we can
just look at maintaining
[INAUDIBLE] sort of
vectors, or these vectors
of [INAUDIBLE] differences.
And on the other
hand, we would like
to find some kind of projection
pi, which is very easy to copy.
So for instance, if this
metric space very sparse
or it's highly structured, based
on some kind of [INAUDIBLE],,
then that's highly convenient.
So just to mention
a few applications
on how this technique
can be useful,
there are a number
of applications.
I'm not going
through all of them,
but I'm just going to show you a
couple so you get an impression
on how these things are used.
So for instance, for the case
of binary classification,
we suppose we have a data set
with labelled examples positive
and negative here.
And there's typical this
assumption in machine
learning that you can separate
these positive and negative
examples with a certain
hyperplane, which
not only separates them, but
it does so with a [INAUDIBLE]..
So there's some
margin gamma where
you're guaranteed that you
don't have any vectors.
And somehow, this guarantee
of having this margin
is, in a certain
way, a certificate
that our solution is in fact
quite structured, quite simple.
What you can do is
take any of these sort
of dimensionality
reduction techniques,
map these vectors to a
lower dimensional space.
And since you're approximately
maintaining distances,
basically you can highly reduce
the dimension of the problem,
losing a little bit on
slack on this margin.
Just a constant factor.
And extending this separator
to the high dimensional space
is trivial.
You just extend it by zeros.
So a second more
involved example
is the case of
nearest neighbors.
So here we have a large
space, v1 through vn in Rd.
And for a certain vector, vi,
and for a certain parameter k,
we want to know what are the
k closest elements to vi.
And we can also have
the same question
about approximate
nearest neighbors.
So the issue here is
that there is, of course,
an algorithm that will go
through all of these input
vectors, compute
distances, and keep
the lowest ones, the closest.
But that's expensive.
Particularly if you're thinking
that d and n are rather large,
that's an issue.
So there's this sort
of classical idea
of somehow pre-processing
the system instance,
coming up with a
nice sort of data
structure that would
allow us to make
queries in much faster time.
The issue is that these
pre-processing typically
will run into an exponential
interdimension complexity.
That's what's known as the
curse of dimensionality.
And the way that
this has been fixed--
and this is the
classical work of Indyk
and Motwani-- is
that, basically, you
can first predict these vectors
using some low distortion
embedding.
You maintain distances
approximately,
and then you can do the
pre-processing efficiently.
Because basically
what you end up
paying in terms
of the projection
dimension is lower than
the number of points.
And we will see that in a bit.
Now let's go into the more
technical aspects of this talk.
I'm going to first talk
about the classical
Johnson-Lindenstrauss Lemma.
And then we'll go and improve
these things bit by bit.
So including these fast,
deterministic, and sparse,
one-by-one.
So the classical result, known
as the Johnson-Lindenstrauss
Lemma, basically
says the following.
So the high level
idea is that we
want to come up with a
construction of a matrix pi,
which is random,
and our guarantee is
going to be a sort
of probability
of having high
distortion with respect
to a fixed but
arbitrary vector, P.
So it's phrasing a slightly
different way to what
I was saying in the beginning,
but you can imagine--
I mean, it's easy to turn
this into a guarantee
for multiple vectors.
So the point I want
to emphasize here,
and something that I said
before, the prediction
dimension m scales with epsilon
as 1 over epsilon squared.
And if we have a certain
confidence parameter
delta we want to control,
we'll be [INAUDIBLE] of 1
over delta in the
projection dimension.
Is this statement clear?
So how do we handle
multiple vectors?
Well, it's just a matter
of taking-- sorry.
AUDIENCE: So I guess random
matrix here means what?
Random over what?
CRISTOBAL GUZMAN: Well, for
now I'll just say random.
In a bit I will explain which
specific constructions we
care about.
So if you want to
handle multiple vectors
in the setting I described
in the beginning,
you can just take a
[INAUDIBLE] on this.
So if you care about maintaining
north of vectors v1 through vn
and you want to have
a constant confidence,
you can make this projection
scale as log n over epsilon
squared.
What kind of distributions
we care about?
So first of all, the classical
result of Johnson-Lindenstrauss
is just that if you
pick this projection
as a projection on
a random subspace,
this will work with
high probability.
Of course, this is not
very efficient of course,
because you need then to compute
this projection according
to the subspace.
So the work of Indyk and
Motwani, what they did
is show that you can
make this much simpler.
You can just pick i.i.d.
Gaussian coordinates.
And this will work with the same
kind of probability guarantee.
You can make this
even more efficient.
And that's the proposal of
Achlioptas, which is basically
taking i.i.d.
coordinates Rademacher.
So plus 1 minus 1 equiprobably.
And then you need to,
in both of these cases,
for the Gaussian
and the Rademacher,
you need to somehow
rescale proper.
So basically, you want
to make the variance
of the same scale as 1
over m because the height
of your matrix is that.
And all these analysis of
these high probability bounds
are based on this idea of
concentration of measure,
which I won't go into now.
So the natural notion of
concentrated random variables
will look here,
it's the notion of
sub-Gaussian random variables.
So the definition of random
variable being sub-Gaussian,
with a certain
sub-Gaussian parameter,
nu, is that basically the
moment-generating function
of x, it's upper bounded by
the moment-generating function
of a Gaussian with variance
equal to nu squared.
So some examples of
course, of Gaussian
is sub-Gaussian with a
sub-Gaussian parameter
being the variance.
Are also the Rademachers
are 1 sub-Gaussian.
So the picture looks a
little bit like this.
Things which are
highly concentrated.
So if you look in black
here, this distribution
of a Gaussian random
variable-- a sub-Gaussian
is better concentrated
around the origin.
And something which
is not sub-Gaussian
could have this kind of
kink here at the origin.
So let me ask at
this point, are you
all familiar with the proof
of the turn of inequality
using sub-Gaussian
random variables?
No?
Should I take a
moment and do it?
It's going to take
just one minute,
and I can do it on the board.
So basically, under
that assumption,
I can prove that the random
variable will [INAUDIBLE] well.
So if I want to control the
probability of x being larger
than a certain
threshold t, I can just
use the [INAUDIBLE] of
the exponential here.
And even further,
I can write this
as a quotient of this
being [INAUDIBLE] one.
AUDIENCE: A different marker?
Got another one?
CRISTOBAL GUZMAN: Let's see.
Oh, this is much better.
Yeah.
[INAUDIBLE] lambda x divided
by [INAUDIBLE] lambda t
[INAUDIBLE].
So what we see here is we're
basically integrating something
which takes value 1 when
this inequality is satisfied
and in 0 otherwise.
So just using
Markov's inequality,
I can upper bound this value.
And now I'm positioned to
use the sub-Gaussian property
dominating this thing by the
moment-generating function
of the Gaussian.
Do we have a 2 over there?
Yes.
And then, I mean,
it's just a matter--
all I mean here is that
the lambda is positive.
So then I can optimize
in terms of lambda.
On top here, we
have a quadratic--
sorry, this should be
lambda squared, nu squared.
So this is quadratically lambda.
We can optimize it and we'll
get a concentration of order
e to the minus--
I think it should be
T over 2 nu squared
if you optimize this thing.
So that's why we care about the
sub-Gaussian random variables
and why it's making
those [INAUDIBLE]..
So any questions at this point?
Good.
So unfortunately, what
I want to talk about now
is something which
is more involved,
which is basically concentration
for quadratic forms.
So this is the so-called
Hanson-Wright inequality.
So the setting here is
quite similar in principle.
We have a vector of
independent nu sub-Gaussian
random variables, and we have
the symmetric matrix, which has
0 coefficients in the diagonal.
The assumption of 0
coefficients in the diagonal
is not important for
result to be true,
but it's going to help us a lot.
And it's always
satisfied in a setting
of Johnson-Lindenstrauss.
So basically, what we have
under these assumptions
is that its quadratic
form, x transpose
Ax is going to
concentrate at this rate.
And this rate is
an expression that
looks slightly complicated.
It depends on the
operator norm of A,
on the Frobenius
norm of A, and it has
these two kind of behaviors.
So let me make a plot a little
bit so you can get a handle
on how this looks like.
This term in blue here is a
sub-Gaussian kind of tail.
So it looks like this over here.
And the other one is just
an exponential decay in t.
So you have this
kind of behavior.
And basically, this minimum
is picking the worst
of both in each region.
It's still an exponential
concentration.
It's just not as
strong as sub-Gaussian.
And from now on, I'm
not going to care
much about this nu squared
term, because we mostly
care about the case
of the Rademachers,
which are 1 sub-Gaussian.
So you can completely
forget about the nu.
AUDIENCE: Sorry, question.
Can you just go back.
What was A again?
CRISTOBAL GUZMAN: A
is a matrix that--
it's a symmetric
matrix which has
0 coefficients in the diagonal.
So think about the case
of the Rademachers.
Well, these diagonal
terms are just
multiplied by this sigma
squared, which are always 1.
So that's a deterministic term.
OK.
So now, I'm going
to show you briefly
how you can prove
Johnson-Lindenstrauss
based on this concentration
of quadratic force.
This is a proof due
to Kane and Nelson.
So the setting of our projection
matrix is what I said before.
We'll pick i.i.d.
Rademacher coordinates.
And we'll scale them
by 1 over square root
of the height of this
matrix, or number of rows.
Now for notational purposes--
I mean, this is the random
object I want to study.
It's a matrix.
And I want to write it
as a quadratic form,
so I need to write
it in vector form.
And to do that, I'm
basically just tacking
the rows of this matrix.
And that's what I
defined as sigma.
It's just a vectorized
version of pi.
Well now, let us note that the
kind of low distortion property
that JL matrix is satisfied,
it's invariant under scaling.
So I have norms everywhere.
Norm squared of v. Norm
squared of the projection.
I can just divide by the norm.
I only care about vectors
in the unit sphere.
So the property of JL
basically only requires this.
And I'm going to show that this
kind of expression, which--
I mean, if you isolate.
Maybe I can do it on the board.
But if you write down the kind
of low distortion property
for a unit vector,
it's basically
requiring that this norm
of pi v squared minus 1,
it's between minus
epsilon and epsilon.
Is that clear for everybody?
So I'm just going
to expand this.
I'm going to show that this is
equivalent to a quadratic form
in sigma.
So if you look at what this
pi matrix is doing-- well,
basically, it's making this
row vector inner products.
And then you have the sum of
squares of the inner products
you have for each row.
So this is what we get.
Since it's a unit vector,
I can replace this one
by a sum of the squares
of the coefficients
of v. That's what I did here.
Oh.
Not seeing-- OK.
Yeah, here.
Here.
And then, this square, I'm
just going to expand it
as a product of two terms.
And I'm going to use
different indices
because I want to make
all these cross products.
So if you expand that,
what you will see,
it's basically that this is
a product of a rank 1 matrix.
It's a quadratic form of
sigma, but with respect
to a rank 1 matrix, which
is [INAUDIBLE] transpose.
And then basically here, we're
removing the diagonal terms.
So it's exactly in
the framework of
this Hanson-Wright
inequality, where you
can make very precise matrices.
It's very structured.
Block diagonal.
Each block, it's a
rank 1 perturbation
of a diagonal matrix.
So just using
Hanson-Wright inequality,
I can show that the
deviation I care about,
it's upper bounded by this
exponential decay [INAUDIBLE]..
And in this case, I
can compute exclusively
what these two matrix norms are.
I'm not going to
do it here, but I
can tell you that this
thing scales as 1 over m.
And the operator norm
has the same behavior.
So this is what you get here.
Then, it's just a matter
of choosing the right m,
so you can guarantee that
this deviation probability, is
less than delta.
OK.
So now we know we can use
the Hanson-Wright inequality
for proving distributional JL.
What I'm going to
do now is show you
a new proof of the
Hanson-Wright inequality.
So I haven't even done the
proof of Hanson-Wright.
And why do I want to get a
new proof of Hanson-Wright?
Well, there's many reasons.
The most important one,
if you have ever seen it,
all these proofs at
some point use some kind
of decoupling step, which
allows us to somehow partition
this quadratic form in a nice
way, getting linear terms.
And you can condition on those.
So somehow, it's
trying to reduce
this quadratic concentration
in something that looks more
like a linear concentration.
But that introduces some
artificial random variables.
So basically, you're taking
some Bernoulli random variables
on top of this
quadratic form, and this
introduced some extra variance.
So there's no way that this
proof technique is going
to give you something sharp.
In fact, most of the
proofs of Hanson-Wright
do not have any explicit
constants on the c
that I mentioned before
in the concentration.
The best we can find
in the literature
is something that was like 1/64.
And this c, the lower it is,
the worse the concentration.
So it really looks bad.
It's kind of constant.
On the other hand, we want to
get a deterministic counterpart
of this algorithm and
derandomizing this decoupling
is extremely slow.
You basically need
to numerate all
the possible representations
of this Bernoulli random
variables.
So what I'm going to do
now is show a simpler proof
of the Hanson-Wright
inequality with a constant
which is much better.
So it's something like 3/20.
Much better than that.
AUDIENCE: Is the-- the
constant is optimal?
CRISTOBAL GUZMAN:
It's sharp in a sense.
You will see the proof
and I can tell you
which sense it's sharp.
I mean, it's hard to say
what's sharp here because--
I mean, let's go back here.
I mean, this kind of
inequality, it's something that
only depends on the
operator and Frobenius norm.
If you really want to
get something sharp,
you should use A completely.
But in the sense of the
dependence on these two things,
it's sharp.
OK.
And the proof is quite simple.
It's based on this
very simple Lemma that
basically says the following.
So suppose you have two random
vectors which are independent.
And each one of them,
it's upper bounded
in terms of the
moment-generating function,
from one respect to the other.
So you should think of these
y's as being the Rademachers
and the z's as
being the Gaussians.
That's what we have.
And then what we
can show is that
for any symmetric
matrix with 0 diagonal,
well, the moment-generating
function of the quadratic from
is dominated by the other.
And this, in principle,
is highly-nontrivial,
but I'm going to
show you the proof.
It's two lines.
So how we do this?
Well, first of all, you write
down what this expectation is.
So there's a bunch of
independent random variables.
And what we can do is somehow
condition on y2 through yk.
And look what's inside here
with respect to y1 already.
The expression might
look a bit nasty.
But basically, the only
thing you need to observe
is that the [? A11 ?]
term is 0, everything here
is just a final y1.
So basically, this is
a linear term in y1
and this is a constant.
So you can use the hypotheses
and replace this y1 by a z1.
Now you can take this
expectation with respect
to z1 outside, get the y2
inside, and do this equation.
And that's all.
Now from here onwards, I
mean what we're going to do
is basically follow
the same strategy
for regular Hanson-Wright.
So really, this
is the heart of--
AUDIENCE: Are you
using symmetry of A?
CRISTOBAL GUZMAN: No.
I don't think I need that.
Yeah.
AUDIENCE: It was
just a 0 diagonal?
CRISTOBAL GUZMAN: It's
just a 0 diagonal.
It doesn't even matter
what these signs
are because you know that
you have it for every lambda.
So--
AUDIENCE: We just need to
make the quadratic to vanish,
that's the only thing--
CRISTOBAL GUZMAN: Exactly.
That's all.
So I'm going to show you
a little bit of how this
is done once you have this
Lemma because it makes sense,
but I'm not going to go
through full details here.
So what's important from
this is that we can replace--
I mean, if we're
trying to control
this probability of
deviation with respect
to the Rademachers, we can use,
again, Chernoff's strategy.
So we get this
exponential thing.
The moment-generating
function with respect to this.
And then, by the Lemma,
we can replace it
by the one with the Gaussian.
Now the key thing is that the
Gaussian is unitary in variance
so I can look it in
whatever basis I want.
So why not looking at
it from the eigenbasis?
Then, by this
unitary in variance,
I can just look at
this and change basis.
I get something which
is quadratic, but then
with a diagonal matrix.
So basically, I have a
bunch of Gaussian squareds
with the respective
coefficients.
These are the eigenvalues of
A. That's also important to
[INAUDIBLE].
So doing this, I
get something which
is a sum of independent
random variables.
I can turn it into a
product and I can explicitly
compute each of these
moment-generating functions,
which gives us this.
And now you see why the operator
norm becomes important, right?
I mean, these are the
eigenvalues of the matrix.
Whenever we reach anything
beyond 1 over the operator
norm of A, we're going to run
into this thing diverging.
And we can also write
this alternatively,
since it's already
eigenvalues, as a sort
of determinant-dependent vector.
So I should say
here at this point,
once you have this
inequality, once you replace
this thing over here
by this determinant,
you get something
which is completely
independent of the distribution.
So it's just a matter of
optimizing this thing.
In fact, if you just use
this kind of quotients
that we have in
this product here
and you upper bound them
by appropriate exponential
function, which is
what is typically
done for the concentration of
chi square random variables,
you can just follow
the standard proof,
and you'll get a Hanson-Wright
inequality with constant c
equals 1/8.
So that's already much
better than this 1/64.
So what we do here is
do a little bit of work
in terms of how we can
optimize this thing.
With a determinant, you can get
something which is much finer.
And we call these lower bound.
But it's not very important.
I mean, you just do this and
then optimize the lambda.
In that case, you use this
kind of concentration.
It looks a bit more
complicated than the standard
Hanson-Wright, but
let me just tell you,
you can dominate this
thing by the standard form
of Hanson-Wright and
the constant you get
is this one I promised.
So 1 minus log 2 over 2.
AUDIENCE: Are the
terms, exponents, tight?
CRISTOBAL GUZMAN: Say again.
AUDIENCE: The terms,
is it tight, basically?
CRISTOBAL GUZMAN: This
one is tight, yes.
But then if you want to turn
it into a regular Hanson-Wright
expression, you can do so
with this kind of constant.
So any other questions?
OK, great.
So now, let's turn to
the question of sparsity.
Of course, replacing
Gaussias by Rademachers
is a significant improvement,
but we still end up
with a matrix which is dense.
And we would like
to sparsify it so we
can get a faster embedding.
Now, the bad news
here is that it
was proved that for
independent entries,
sparsification is only going
to buy you a constant factor.
I don't know the
exact constant, but I
think maybe one third of the
coefficients can go away,
but not much more than that.
So that's what
Achlioptas proved,
and Matousek got a lower bound.
And fortunately,
there's a way out, which
is trying dependent entries.
And that was proposed
by Kane and Nelson.
So the idea is that
here, we can think
of this as basically the same
distribution we had before
with these i.i.d. entries.
But now, we select a
random subset of size s.
And we can do this in two ways.
Either we can pick a
uniformly random subset or we
can partition in blocks, in s
blocks of size m divided by s.
And in each of these
blocks, we just
select one coordinate
uniformly at random.
So for certain column
sparsity parameter,
we choose s that I'm going to
specify in a couple of slides.
All we do is coming
up with a way--
so you can think of
this sparse construction
as a two-step process.
First, you select a
subset of coordinates
you want to maintain.
So basically, this is
the masking matrix.
And then only for
those coordinates,
we are picking this plus 1,
minus 1 random realizations.
So in fancy linear
algebra notation,
this looks as a component-wise
or Hadamard product
of these two matrices.
And now, the scaling
factor is different.
It used to be 1 over
square root of m.
But now, since we're only
selecting s-coordinates,
it's just 1 over
square root of s.
So the analysis
is quite similar.
And I'm not going to
go through the details,
but the point is, even
in this sparse case,
you can pretty much do
the same kind of a range
in terms of turning
this deviation
as quadratic-- random quadratic
form in the same vector sigma.
And this kind of low-diagonal
matrix we used to have,
it's almost the same.
But now, each of
these vectors B are
sparsified by a different
row of the masking.
So is all the setup
clear until this point?
OK.
So the point is
that basically, we
can follow a similar argument
as was done in classical JL.
So I'm going to set the
sparsity to be epsilon times m.
So basically, we have
m-coordinates per column
and we're selecting an
epsilon fraction of them.
And then if you assume
that we had hyperability,
you have these two upper bounds.
You can condition on
that, and then just use
the regular Hanson-Wright.
And since we picked
this s to be epsilon m,
these two terms over
here are balanced.
So the kind of
concentration we get
is something like 1 over
c to a certain power.
So this is vanishing with
the number of input vectors.
And all we're left
with is proving
that this event star happens
with enough high probability.
Now, this involves two kind
of bounds for this matrix
we defined here.
However, the part regarding the
operator norm, it's very easy.
So not only with
high probability,
it's for sure that the upper
bound of the operator norm
is bounded by 1/s.
The part regarding the masking
is much more difficult.
Or I mean, significantly
more difficult in comparison.
And if you just
write it down, you
get this kind of expression.
So what we're looking
at here is somehow
a sum on the different pairs
of columns of these expressions
that are scaled by
the coordinates of v.
And this qjk, it's
a random variable.
But it turns out that it's a
sum of non-independent random
variables.
I'm going to explain right
next why this is the case
and what's basically the
meaning of this thing.
So what's happening is that
this qjk random variable, what
it's doing is looking at two
different columns, j and k.
And here, I'm denoting this x
by the selection of the masking.
So suppose that the
sparsity here is indeed 1.
So we're just selecting
one element per column.
And here, I ran a randomization
for these two columns.
So what this qjk
random variable denotes
is basically counting the number
of collisions within these two
different columns.
So we're just basically
going through here
and seeing what are
the coincidences.
And in this realization, in
fact we have no collisions
so the q is 0.
But in general, it
could be something else.
Of course, it's always going
to be upper bounded by s.
Now, you can see why this
thing is not independent
because we're choosing a
random subset in each of these.
However, it's clear
that this thing--
kind of the dependence goes
in the right direction.
We have somehow
negative dependence
in the sum of random variables.
Because, for instance,
if the sparsity is 1
and I tell you that I selected
here the first coordinate
in column j, well,
you immediately
know that everything
else has to be 0.
So that's what I mean
by negative dependence.
Whenever I know that
I picked a coordinate,
that immediately makes
everything else less probable.
And when you have these
kind of negative dependence,
you can basically go through
with the same argument.
You just replace these
dependent random variables
by their independent marginals.
And then you can sort
of use the same kind
of concentration argument.
And I'm not going to do it
here because even from here,
it gets a bit technical.
But I think this is the
heart of what's going on.
OK.
So now, let's go into the
derandomization business.
So first of all,
in certain settings
it's, of course, preferable to
get something deterministic as
opposed to random.
And there's many
reason for that.
For instance, I
mean, the guarantees
we get with distributional JL
hold with high probability,
but we might want to prefer
something that holds surely.
And the other hand, of course,
a deterministic algorithm
will not be sensitive to
these random fluctuations
that we're picking
for our projections.
There is this hope--
I'm not sure, and I don't
have any theorem for this,
but I would like to see
whether, for instance,
these deterministic
constructions do a better
job exploiting the structure
of the input as opposed
to this distributional
JL, which is completely
oblivious to the input.
And of course also, there
is conceptual reasons
to think what can be
achieved with randomized
versus deterministic algorithms.
So let me, just in
a nutshell, explain
what's our basic
algorithmic principle, which
is based on the method
of derandomization
by conditional expectations.
So suppose we have a
randomized algorithm which
depends on these independent
random outcomes, x1 through xk.
You may think of this as the
Rademachers, for instance.
And our probabilistic
guarantee looks like this.
So some sort of expectation
of a certain function, upper
bounded by a confidence delta.
So you should think
of this basically
being the event of
having low distortion.
I mean, we can write
this big expectation
by integrating with respect
to the first random variable
and just using a probabilistic
method kind of argument to say,
if this integral is less
than or equal to delta,
the [INAUDIBLE] is
a realization of x1,
which gives me the [INAUDIBLE].
Expectation with
respect to this--
to the rest upper
bounded by delta.
And I could just
repeat the argument.
In the end, what I find is
a deterministic assignment
of these x's.
Of course, this is only
going to run efficiently
if we can compute these
condition expectations
efficiently, which is
typically not the case.
If you think about these
Bernoulli random variables
or Rademachers, you'll
have an exponential number
of realizations.
However, there is no need to
do this if we can properly
upper bound these expectations.
And this leads us to the idea
of pessimistic estimators.
So the idea here is that
there are some parameters, p.
And for each of
these parameters,
there's an associated
distribution, D of p.
And we'll say size of
pessimistic estimator,
for any possible distribution
Dp, the probability
of the event we care about
is upper bounded by this psi
function.
So the very
high-level idea would
be that somehow we
use this-- some sort
of potential function, right?
So we start with the
distribution that guarantees
this probabilistic state.
And then we somehow
traverse this base
of parameters x and [INAUDIBLE]
using different kind of p's
[INAUDIBLE].
So basically, what
we're doing is
fixing partially this
coordinate and this number
of distributions and we
end up with something
which is fully deterministic.
But whenever we do this,
we do it in such a way
that we can control this
failure probability.
And you can imagine
that somehow,
if this psi, for
instance, is concave,
this will allow me to pick
appropriate derandomization
at every step.
So let me explain
very simply how
we can, for
instance, derandomize
the Hanson-Wright inequality
using pessimistic estimators.
So here, our space of
parameters will be--
so remember that we have
these vectors x1 through xk.
There is this matrix
A. So our parameters
are basically indexing
what things we have fixed
and which we haven't yet.
So there's a vector of length k,
which can be either minus 1, 0,
or plus 1.
What's happening with this--
the distribution
we'll pick according
to this parameter, D of sigma.
It will be such
that it's marginal
when the psi i sigma i is 0
is going to be a Gaussian.
And when you fix one of these
to be either plus 1 or minus 1,
we're fixing that
zi to be that value.
The pessimistic
estimator we will use
is based on the
Chernoff inequality.
So basically, since we know
that the failure probability is
upper bounded by this, we
could use this as a proxy.
And what's important
to show, of course,
is that you have this
concavity property.
So basically, whenever you have
a partial assignment of sigma
and there's a [INAUDIBLE],,
which has value 0 currently.
So we have a Gaussian there.
If you choose an
assignment of sigma i
to be plus 1 or minus 1,
these two, on average,
are not worse than
the original psi.
So basically, one
of these choices
is as good as the psi of sigma.
And then, we can move on.
So in very simple words,
this derandomization
is based on that idea.
So we start with having no
idea how to fix the psi,
so we have all Gaussians.
And we go, one by one,
minimizing this psi potential.
Now, a few technical
comments because you
need to compute this
pessimistic estimator.
And this looks as this
moment-generating function.
What is shown early in the talk
is that this kind of thing when
you have only
Gaussians, it looks
as a determinant of something.
And even when you
condition on things,
you still get this
kind of expression.
So basically,
computing the psi will
require some algorithm for
computing determinants--
say, for instance, SVD.
So even if you tried
to make this faster,
you're not going
to get linear time.
However, for the case of
JL, things are much better.
Because our matrix, it's
a block diagonal matrix
where each block is a rank 1
perturbation of a diagonal one.
So computing determinant of
a diagonal matrix is trivial.
It takes linear time.
And doing this
one rank 1 update,
we can use the Sherman-Morrison
formula in algebra
that also allows us to
do this in linear time.
So the good news is that we
can run this efficiently for--
not only efficient, but linear
time, in the case of JL.
And we end up with an algorithm
that runs in this time.
So NNZ of V, in this case, is,
if you pick your whole input,
all these vectors V is counting
the total number of non-zeros
of those vectors.
So it's a sum of the sparsities
across all input vectors.
And under my assumption that
this number of non-zeros
is at least as large
as the dimension,
we get something which has
complexity S times [? NNZO. ?]
AUDIENCE: Here you're referring
to the sparse version of JL,
like S sparse?
CRISTOBAL GUZMAN: Yeah.
Right.
Because otherwise,
you will get M here.
You'll get the full
number of rows.
I'm thinking about when you
already made this masking,
so you get this P number of--
and then you have S
times [INAUDIBLE]..
Now, the masking is
a bit more technical.
And I don't know, how
much time do I have?
AUDIENCE: 6 minutes.
CRISTOBAL GUZMAN:
Like 10 minutes?
OK.
AUDIENCE: It depends how much--
CRISTOBAL GUZMAN: No,
it's not that long,
but I want to show some
numerical results as well.
So yeah.
So the [INAUDIBLE]
distributions for the masking
is, of course, a bit more tricky
because I said that things
are not independent.
But in fact, we can
pretty much ignore that,
and it kind of works as well.
So our space of parameters
is basically a matrix.
So where we have this
masking matrix, which
is 0, 1, what we're
looking at here,
it's the probability of
selecting a certain coordinate.
So this prj represents
the margin of probability
of making that choice.
Taking this coordinate.
As I said, in the
Kane-Nelson analysis,
these choices are
not independent.
We have these
constraints, right?
So for every column, we're
selecting s-coordinates.
But what we can do
here is basically,
we can start with
this phase of--
I mean, we're going to
start with this p being
only equal to epsilon,
uniform choice.
And we're going to somehow
traverse this [INAUDIBLE],,
making deterministic
assignments one at a time.
So if you're familiar with
the technique of pipage
rounding by any means,
this is very similar.
So in fact, we use
independent marginals.
And somehow, we traverse this
[INAUDIBLE] in such a way
that we don't violate
these constraints.
OK.
I don't want to go too much in
the details on what this is.
But basically, this
is what you get
using Chernoff's
inequality in the event
that you want to control
this Frobenius norm.
So that's what-- remember
that event we wanted
to control with hyperability,
this Frobenius norm of A,
was upper bounded by something.
So this is basically
what we're writing here.
We just use Holder's
inequality so that we can
make things nicely decouple.
So instead of taking
products inside
of this expectation with these
different choices of columns,
we just use Holder's inequality
and we take these as exponents.
And basically, what we end
up doing in the algorithm,
it's a very simple idea.
Let me go to the board and just
make a quick drawing of this.
So we have this masking matrix.
This has m rows and d columns.
And what we're going to do is
partition the space of rows
into s blocks.
And each block, of course,
will have size m divided by s.
So what we'll do is
derandomize by blocks.
We're going to pick
the first block.
We're going to compute the
pessimistic estimator here
for every possible choice.
And we're going to
stick to the one that
minimizes the potential.
That gives me some
choice, and then I
move on to the next block.
I do the same.
So the whole point is that I
can do these updates fairly
quickly.
And also, of course,
there is a point
that this pessimistic
estimator has
to be concave on the blocks.
So every time I have
a partial assignment
and I want to make an update,
I make these final [INAUDIBLE]
choices.
Since the pessimistic
estimator is concave,
there's one choice, which is
as good as what I had before.
And yet again, there is
some technical details
that happened here.
Computing this thing, in
general, is expensive.
So what we do is there is a
choice of a particular input
vector set, which
is basically all
the differences between
different canonical vectors.
And this we call the
universal masking,
but it's also present
in the Kane-Nelson work.
They call it the
code construction.
And if you make the assumption
that your input is sufficiently
large, then your running
time is guaranteed
to be of the same order that
we have for the signing.
And in this region,
what we get is basically
the same running time
that you would get just
with the randomized
assignment, and then
making the matrix
vector product.
So what we get or we know is
this determinacy guarantee
for a given input vector set.
So we maintain low distortion.
We have the guarantee of
the projection proportional
to what you get
with standard JL.
And the running
time, it's the same
what you would get from
the Kane-Nelson analysis.
So how much time do I have now?
AUDIENCE: About 5.
CRISTOBAL GUZMAN: Like 5?
OK.
AUDIENCE: I have a question.
This stuff with
pessimistic estimators,
is that something standard
for derandomization, or is it
that something that
you developed for--
CRISTOBAL GUZMAN: It's
one of the-- no, no.
I would say it's
fairly standard.
I mean, not many
people know about it.
But people that work in
derandomization, they--
[INTERPOSING VOICES]
AUDIENCE: That seems
really powerful.
CRISTOBAL GUZMAN: It's powerful.
And I like it because
it's very natural.
I mean, you can turn
these probabilistic proofs
into algorithms.
And that's very beautiful.
AUDIENCE: It's by
[INAUDIBLE] and Spencer.
AUDIENCE: OK.
Thank you.
CRISTOBAL GUZMAN: Yeah.
That's from the '80s, right?
And there have been
many works on--
AUDIENCE: Right.
The original, I guess, you--
[INAUDIBLE] and Joe Spencer.
CRISTOBAL GUZMAN: In fact,
there is a version of JL
which is based on this method
of pessimistic estimators.
We realized after we
were done with this work,
but it's kind of the same
idea in the dense setting.
The analysis is much
trickier, though.
AUDIENCE: [INAUDIBLE]
CRISTOBAL GUZMAN: I think so.
I can't tell for sure.
AUDIENCE: You imagined it's
something much simpler.
AUDIENCE: OK.
AUDIENCE: [INAUDIBLE].
CRISTOBAL GUZMAN:
So let me quickly go
through some numerical results.
And this is very preliminary
work, I should tell you that.
This semester, I'm teaching
a machine learning course,
and I got this end of
the semester project
for my students.
So they carried out
these experiments.
And I got them last week,
so I haven't checked
all the details, but just want
to show you a little bit how
this looks like.
So what we do is we get
this randomly-generated data
in dimension 500.
And we have 1,000 input vectors.
So we want to maintain
the distortion
of those with an epsilon
factor which is 0.2.
If you use the JL
statement we had before,
this approximately tells you
that the dimension projection
is like this, 175.
So one of the questions
we were wondering
is whether derandomization
is much slower.
So here, there
are a few results.
With this instance size, you can
see that the random algorithm
is pretty quick.
It takes like 20 seconds.
The deterministic one
takes twice as much.
But they haven't
completely optimized--
so all these tricks I told
you of doing things very fast,
they didn't test them.
So for instance, they used SVD
for computing the determinants.
I'm still surprised that this
works in this many dimensions.
But you can see, for instance--
OK, the success probability
you get with the random one,
of course, is going to be lower.
But I'm a bit surprised
that it's this low.
So they also carried out some--
AUDIENCE: Is that
probability, is
that computed
empirically, or is that--
CRISTOBAL GUZMAN: It's
computed empirically.
So basically-- right.
So your construction is
completely oblivious,
and then you can
evaluate the frequency
of things that were distorted.
AUDIENCE: You have
m equal to 175.
How do we expect
that [INAUDIBLE]??
CRISTOBAL GUZMAN: Yeah.
AUDIENCE: And what
is your delta?
CRISTOBAL GUZMAN: There is no
delta here because you're--
AUDIENCE: No, for random.
CRISTOBAL GUZMAN: Say again.
AUDIENCE: For random.
AUDIENCE: Measuring it
empirically, I think.
AUDIENCE: Yeah.
That's what I'm saying.
AUDIENCE: So you
didn't take three runs
and take the median or whatever?
CRISTOBAL GUZMAN: No.
These are for
particular realizations,
and then you compute
the frequency of things
that were distorted or not.
AUDIENCE: Ah, I see.
CRISTOBAL GUZMAN: It's
super-simple, yeah.
And then you can make
some box plots with more.
So here, there's 20
realizations of this thing.
So most of the time, it
goes pretty well, over 95%.
That's what I would expect.
But every once in a while,
you see this much lower
performance.
AUDIENCE: So the randomization,
is it easy or hard for like JL?
AUDIENCE: What?
AUDIENCE: The instance you
gave was random, right?
CRISTOBAL GUZMAN: Right.
AUDIENCE: Is it
easier or harder?
AUDIENCE: I don't
expect easier--
[INTERPOSING VOICES]
AUDIENCE: Let us know.
Maybe it's the same.
Maybe it is the hardest.
I don't know.
Because it spans everything.
It's kind of--
because it's full--
maybe you expect things
which are in lower dimensions
to be actually easier.
I don't know.
CRISTOBAL GUZMAN: For instance,
in this example, what we did
is--
so there is this 500
dimension, but the vectors
are actually supporting
the first 175 coordinates.
So one of the things
that I wanted to test
is whether this deterministic
algorithm will figure that out.
I mean, I'm not sure.
So the results weren't very
conclusive in that sense.
I don't know if it
matters for the comparison
between the techniques,
but it's probably
the hardest set for comparing
deterministic versus randomized
maybe.
AUDIENCE: If it's randomized,
you get a 99% rate.
Also maybe.
CRISTOBAL GUZMAN: So what else?
So here are some
pictures if you're just
wondering how this looks like.
So first, you see the
random projection.
And second one is
the one we get from
the deterministic algorithm.
So maybe you can't see from
this far, but the first block
of the deterministic
one looks kind
of a bunch of diagonal things.
And that's because the
pessimistic estimator
is kind of greatly optimized.
So in the first block, it
looks pretty structured.
And then from then onwards--
actually, also in the
first few columns,
has this similar
kind of behavior.
AUDIENCE: The first block--
CRISTOBAL GUZMAN: No.
You can get closer.
It looks kind of like a
bunch of diagonal things.
It looks more
structured in a sense,
but I don't know how to
interpret this, really.
Just to show you
how it looks like.
OK.
So all in all, let
me conclude by saying
the dimensionality reduction
is a key primitive in data
analysis.
And the JL Lemma
allows you to do
this dimensionality
reduction quite fast.
So what we do in this
work is basically
show that you can do this
deterministically based
on this idea from
randomized algorithms,
and we can basically reach
the same performance.
And there's this new kind
of technical contribution,
which is based on this new group
of Hanson-Wright inequality.
So a few open problems.
Well, first of all, we're not
very happy with this assumption
that the number of nonzero
is sufficiently large.
We believe this
is not necessary.
But this would require
something in a smarter way
than using this
universal masking.
Something which is more
instance dependent.
Also, there is a question--
I mean, you asked at
the very beginning.
So JL transforms are usually
used in online or streaming
settings where you
want to be completely
oblivious to the input.
So we hope that
maybe there is a way
to argue about some
sort of online settings
where you can maybe regret
some of your decisions
along the way, fixing
things so you still
can do good projections.
Finally, there is this
super-interesting question
about deterministically
constructing RIP matrices.
So if you've ever
heard about this.
Well, these RIP
matrices are basically
things that keep low
distortion, but on the set
of sparse vectors.
And that's an infinite set.
So I don't expect that
these ideas will directly
resolve this.
This is a very hard question.
But I think it's
very interesting.
I mean, it's something that
is good to keep in mind.
So with this, I'm done.
Thank you very much.
AUDIENCE: Thank you.
[APPLAUSE]
AUDIENCE: Very good.
You mentioned some applications
of dimensionality reduction.
And also, In this algorithm,
are there any candidates
for particular
algorithms or settings
where this might improve
the state of the art?
CRISTOBAL GUZMAN: Let's think.
So I mean, for me, the
most important thing
is whether you want to
be oblivious or not.
So in settings where you don't
care about being oblivious,
then I think it can certainly
help because it gives you
guaranteed low distortion.
So it's more a matter of
what's really your setup.
So I'm thinking maybe
in some kind of machine
learning application where
you already have your data,
maybe you want to go
with this because it
takes roughly the same
time as the randomized one.
But typically, for streaming
settings and online, then--
I mean, as this is,
it's not helpful.
AUDIENCE: The only part which
uses the non-oblivious is
the derandomization, right?
Or is--
CRISTOBAL GUZMAN: So let me
say the masking, for instance.
Since we are doing
this universal masking,
it's completely oblivious.
So the point is that
this universal masking
works for any kind of input
in the right dimensions.
AUDIENCE: OK.
I see.
CRISTOBAL GUZMAN: But
then, the signing--
AUDIENCE: I see.
So that part of the--
CRISTOBAL GUZMAN:
That's independent.
AUDIENCE: --of making
things deterministic is--
makes it non-oblivious.
CRISTOBAL GUZMAN: Right.
So I mean, there is
some hope that we
can resolve this online
question in a meaningful way
because at least
part of the algorithm
is oblivious in a sense.
AUDIENCE: Is this a
[INAUDIBLE] you have?
So for online, you can
get fast and sparse,
but not deterministic?
I mean, fast comes from sparse.
CRISTOBAL GUZMAN: Sure.
AUDIENCE: So you
can get that part,
but the deterministic part
is the part that isn't there?
CRISTOBAL GUZMAN: Right.
But that's Kane-Nelson.
That's not us.
AUDIENCE: OK.
AUDIENCE: Also, is it
reasonable to compare it
with these techniques
where you don't
care about the sparsity of
the matrix, but at the end,
you care about the fast
computation, the fact
that you can multiply by
the matrix faster than--
[INTERPOSING VOICES]
CRISTOBAL GUZMAN: Fast
Fourier transform?
AUDIENCE: Yeah.
Like you would
usually use Fourier
transform as part of
the multiplication,
but you can do other things.
CRISTOBAL GUZMAN: We tried this.
And I think that somehow,
these results are incomparable.
So there are regions where fast
Fourier transform is better,
but there's others which
this sparse thing is better.
So for instance, the Fourier
transform constructions
do not exploit the sparsity.
AUDIENCE: But then,
there are these things
that are multiply in NNZ?
AUDIENCE: You mean, they don't--
[INTERPOSING VOICES]
CRISTOBAL GUZMAN: They
don't exploit the sparsity
of the B vectors, the input?
AUDIENCE: Yeah.
What you call the
NNZ parameter, right?
Isn't this what this--
or maybe what Woodruff
and Clarkson did?
CRISTOBAL GUZMAN: OK.
Maybe I'm wrong about that,
but I thought that-- at least I
was looking at other works.
So this Ailon and Liberty,
these other works.
They don't exploit the sparsity.
But I have to say, I'm not--
AUDIENCE: Let's thank
Cristobal again.
[APPLAUSE]
[MUSIC PLAYING]
