[Philippe Rigollet]: Let me move on to this week's and first speaker for this virtual version
of the Stochastics and Statistics seminar, and it's my great pleasure to
welcome Jonathan Niles-Weed from NYU. So, as many of you know, Jonathan completed
his PhD at MIT under my supervision, so
we know each other quite well and it was
real pleasure to have him as a
student and I kind of miss him, so
I'm happy that he's back. So, Jonathan
completed his PhD as part of the
interdisciplinary doctoral program
and received a PhD in mathematics and
statistics. So it's not surprising that
he chose a job which is halfway between
mathematics and statistics, is currently
half with the math department at the
Courant Institute at NYU in the Center for
Data Science. In terms of awards, Jon
received a couple of awards during his
PhD including an NSF award, an NSF
fellowship and a de Karman fellowship.
His work spans learning under algebraic
structure, optimal transport, and he's
currently working on problems related to
discrepancy and all of - throughout all of
this work, he's been working on
high-dimensional phenomena of both in
statistics and probability and so
today, he's going to talk about something which seems to be more in this last part about
matrix concentration for products. So Jon, please take it away.
[Jonathan Niles-Weed] Wow, thank you so much, Philippe, for the extremely kind introduction and thank
you all for being here virtually. I would
prefer to return to MIT to get to hang
out with all of you, but in lieu of that
we can socially distance and I can be at
MIT from afar. So today, as Philippe said, I'm going to be
talking about some new work on matrix
concentration for products of random
matrices. This is joint work with
some wonderful co-authors - De Huang
who is a PhD student at Caltech, as well
as Joel Troppe and Rachel Ward.
So the starting point for our investigations is a question that I think I don't need to
convince you is an important and
interesting one, which is
just a question to which a great deal of
effort has been applied over the course
of the history of probability theory.
Which is, when I have some function
reasonably nice in some sense that I
would like to make precise of
independent random variables, when is it
the case that this function concentrates
well around its expectation? And in
particular, how can we make quantitative
this concentration effect which
ensures that we're actually close to the
expectation of this function. And so this
type of question has driven the growth a
tremendous number of new ideas in
probability theory. It's the central
question in the concentration of measure
phenomenon, and by now we have very
sophisticated ways of understanding the
answer to this question for a number of
functions of interest. So, you know, you
pull open Ledoux' book on the
concentration of measure phenomenon and he says that we now have a
sort of high-level view that when the
function is nice or smooth enough that
these sort of concentration properties
hold essentially automatically, and, you
know, there's a great theory surrounding
them that allows us to prove very
sophisticated things for very seemingly
complicated questions, complicated
functions. So a superficially similar
question is the following variant. I can
ask exactly the same thing, but I can ask
what happens when instead of scalar
random variables I have matrix random
variables instead, and these two
questions look extremely similar but I
would argue that we lack a great deal of
sophisticated technology in the second
case that we do have in the first case,
and it turns out is as we'll see a
little bit later the key problem here is
non-commutativity, right? Like, we're
working with random variables
that don't have commutative structure
necessarily, and in result, many of the
powerful and simple ideas that exist for
scalar valued random variables
are much more complicated to instantiate
when it comes to matrices. So, to give an
example that I'm sure many of you are
familiar with, let's go back to the first
concentration inequality you might see
for scalar random variables, which is
Bernstein's inequality, which reads like
this: I have a bunch of independent
random variables, say they're all mean
zero and bounded almost surely. If I
apply this trick of using Laplace's method, which bounds the moment-generating
function, then I can get a nice tail
bound for the average of these random
variables, right? I see that the average
deviates from the expected value which
is zero by some amount T by probability
that decreases exponentially in T, and we
see in the tail bound that you know you
get the sub quadratic, the sub Gaussian
growth in the exponent and you also get
something that adapts to the fact that
the variance of these x i's might be
small. So this is Bernstein's
inequality that I think is an early one
that you would learn in this
concentration of measure, in these
questions on concentration and measure.
So, as I said, the proof of this is very
simple, right? The one trick is to
apply Laplace's method, so put everything
on the exponential scale then use the
fact that for a sum of independent random
variables, the moment-generating function
just turns into the product of the
individual moment-generating functions
and then the bound that I have on the
individual x i's allows me to bound to
this quantity easily for each of the
random variables. Then I can apply
Markov's inequality and I'm done.
So this is an example that's known to
everyone and I think a very reasonable
question to ask is whether the exact
same statement is true when it comes to
random matrices. So if I have a bunch of
independent random matrices now, D by D,
still with expectation zero whose
spectral norm is bounded by one - I'll
always use these double bars to
represent the the spectral norm of the
matrix. Is it again the case that the
average of these random matrices
deviates from its expectation with
exponentially small probability? And
before we - I think this is a
mathematically interesting question on
its own, but of course, one reason that
people have asked these questions is it
turns out to be super useful to have
bounds on this sort of thing in a number
of modern statistics and machine
learning style problems, right? The
original example of this, maybe, is
covariance estimation. If I have i.i.d. samples from a random variable and I want to estimate
its covariance matrix, then naturally
I'll get something that looks like this
showing up, but it also turns out to be
useful throughout statistics.
Like I said, like many ideas and random
graphs, sparsification,
algorithmic ideas involving randomized
linear algebra, dimensionality reduction,
all of them need bounds on quantities of
this form. And so, one reason that we
might, we might be worried about this is
because if we try to just ape the proof
that we had of Bernstein's inequality,
then we fail at the very first step
somehow, right? I can still form the
matrix exponential on all of this
sum of random matrices, but unless the
matrices commute almost surely, I have no
way of turning this into a product of
the expected matrix exponentials of the
individual matrices, which was what I
needed to do to be able to apply my
assumptions on the x i's. So it's a very
similar looking inequality, but I can't
apply the proof that I know, and so the
the insight here or the idea here and
this is a series of ideas developed
over the last 20 years or so since the
work of Ahlswede and Winter in 2002 is
that I can get exactly the same
inequality but in place of the trivial
equality involving matrix generating
functions that I used for scalar random
variables, I need something quite a bit
more sophisticated. So, you know, we've
replaced the equality we had for
regular scalar random variables by this
significantly deeper inequality, which
then we can use to complete the proof as
we would originally want to, and the
thing I want to point out here is that I
get a tail bound that looks exactly the
same. Apart from this factor of D out front,
which is the price to pay for the fact
that we're working with D by D matrices,
I see a tail bound of exactly the same
form, sub Gaussian decay plus involving
something that looks like the variance
of these random matrices - and here, what
do I mean by variance? Well, in the scalar
case I was just taking the sum of the
expected squares of the random variable, here because they're matrices not
necessarily symmetric, I have two
different ways of taking their squares
and then I just take whichever one of
these is bigger. So this, as I said
inaugurated by Ahlswede and Winter in
2002 and then this particular
formulation discovered independently by
Olivera and Tropp a few years after the
Ahlswede and Winter bound shows that you can get stuff that looks exactly like
the scalar case, but what you need is one
idea, right? You need some
non-trivial fact about matrices that
replaces whatever you were using in the
scalar case, and I think this is the
pattern of many ideas involving random
matrices, that you start with what you
understand from the scalar case but then
you need to reach deeper into your
toolbox to get something non-trivial
that you can use to deal with the non
commutative nature of these operators.
So, okay, that is the simplest possible case.
That's the so-called matrix Bernstein
inequality, and our question today is
just - let's go for the next most
complicated function. We said now a
sum of random matrices that are bounded
is something that we can handle via
these matrix Bernstein ideas from about
ten years ago, and what we want to
understand is what happens when you have
a product of random matrices instead? So
again, just as was the case with with the
sums of random matrices, I think this is
a natural mathematical question, but
actually, I would argue also that it has
practical importance in some recent
questions that are being asked in
statistics and machine learning, and the
key insight is that the reason you might
care about products of random matrices
is that these are really compositions of
random operators. If I apply a
product of random matrices, I'm just
applying random operators one after each
other, and this starts to sound like a
lot of things that happen in machine
learning these days. I mean, certainly, I
can go back to control theory and think
about linear dynamical systems which
have this flavor, right.
Some random operator is being applied to
my system in each step but when I'm
doing stochastic optimization on a
quadratic, say, or I'm interested in the
Jacobian of a deep neural network with
randomly initialized layers, I get
something that looks like a composition
of these random operators. So if I want
to understand the extent to which this
object concentrates or behaves well, I
need to understand the question about
how well products of random matrices
behave, and just as was the case when we
were looking for the Bernstein
inequality in the case of the sum, the
quantitative question will be - can we
take what we understand about the scalar
concentration inequalities for products
and give a bound that's quantitatively
similar to that bound? Can we
show that we don't really lose anything when we pass the matrices or can we quantify
precisely how much we do lose in going
to the matrix rather than the scalar
case? So that's going to be the question
today. Can we understand concentration
for products of random matrices in a
form that's just as general and just as
widely applicable as we can for sums of
random matrices. So this is not a new
question, but the perspectives that have
been applied to it in the past, I would
say, are a little bit different from the
perspective that we apply in this work.
So there is some really classical work
on what happens when you take products
of random matrices, and the earliest work
is due to Bellman who considered
a sort of interesting setting, which is - let's pretend that the matrices
that you're multiplying have entries
which are positive, almost surely, and
bounded. So bounded away from zero and
bounded away from infinity.
I have iid copies of such matrices, and
then what this asked - if I take the
product of these random matrices and
then, entrywise, look at the logarithm of
each entry and we normalize - what is the
behavior of that random object?
So, it turns out that work of Bellman and then followed up by Furstenberg and Caston
show that there's a law of large numbers
and also when properly renormalized
a CLT for this entrywise object for
positive random matrices, positive in the
entrywise sense. So that's the earliest
work on this on this thing. Again, they're
not interested in non-asymptotic
concentration bounds there, they're just
interested in - what's the asymptotic
behavior of these objects? Slightly
closer to our setting but still 
in the asymptotic regime where you just
care about limiting distributions, Berger
in the 80's worked on the question not
where you have matrices who are positive
almost surely in the sense of entrywise,
but matrices instead which are
perturbations of the identity matrix.
So I have matrices of the form identity
plus something small, I have a product of
a bunch of those and again I want to ask,
when I multiply these all together and
suitably renormalized what is
the resulting object look like?
And Berger again proved law of large numbers and central limit theorems for random
matrix products of this type. So that's
the classical picture, some work in the
50's and 60's, some later work in the 80's,
but all done more or less from
this perspective. Another sort of take on
this problem comes from the world of
free probability which is, you know, the
brainchild of ... Voicu... Voiculescu?
- I'm sorry to any Romanian
speakers in the audience - but over the
past, you know, thirty-five years, has
developed this very sophisticated set of
techniques to understand what happens in
this non-commutative probability case.
And there I would say that the
approach is, again, somewhat different from
the perspective we're taking because
really, what kicks in in free probability
is the asymptotics as the size of the
matrices gets very large. So there's this
notion of asymptotic freeness and free
probability that is really asking what
happens when I have a product of a small
number of matrices but where the
dimensions are going to infinity, and in
that case we can say very precisely what
happens to the spectrum of these
products, because you get this
freeness that turns into a type of
convolution of the matrices in question.
So there, we're taking asymptotics, but in
a different sense as the size of the
matrices goes to infinity.
The other case that's well studied, and for
which very exact results exist, are of
course if I impose a great deal of
structure, right? So if I require
that my random matrices come from some
invariant ensemble, say that they're
they're all Wigner matrices or Wishart
matrices or something like that, then I
can actually, you know, roll up my sleeves
and write down what the, say,
spectral density of the resulting matrix
is when I take products of these guys.
But this is really relying very heavily
on the Gaussianity or the general
invariants of the matrices in question.
So, if we're looking for, I would say
user-friendly and widely applicable
bounds,
these give some insight into what
happens in a very particular case, but we
want stuff that works far outside the
Gaussian regime, because in any of the
examples that I talked about where
you're composing random operators, you
can't just a priori assert than anything
Gaussian or invariant is going on.
So these several lines of work are all
thinking about this problem, but sort of
from different perspectives, and I would
say that the starting point for our work
is much more recent. It's a paper by
Amelia Henriksen and Rachel Ward -
Rachel's one of my co-authors from 2019 -
where they, I would say, wrote down the
first non-asymptotic concentration
bounds for this problem that really have
the flavor of what we're looking for. 
So, let me just show you what this
theorem of Henriksen and Ward is. They say - okay, you have a bunch of i.i.d. matrices,
and they're bounded almost surely. So, so
far this looks exactly like the
Bernstein case, and you form this random
product. So the random product is going
to be identity plus rescaled copies of X, so identity plus 1 over
N times a copy of the matrix X, then I
take a product of n of these guys.
So just to get some intuition for a
second, the reason that I'm doing this
rescaling is Zn in the limit has
constant size, right? I'm sort of
multiplying things that are of size - n
copies of things that are of size 1 plus
1 over n, and so Zn is an object of sort
of constant size and the Amelia
Henriksen and Rachel Ward proof shows
that this thing concentrates in the
following way. So, they prove some more
stuff, but just thinking about an
expectation, the expected spectral norm
of Zn, the deviation of Zn from its
expectation scales like what - something
involving B, which is the almost sure
bound that we have on the size of the
matrices, and then some decay that's like
polylog D and n over n under a square
root. So the good news here is there's
one over root n behavior is exactly
what you would hope for. Some logs but, you
know, who cares about logs among friends
and then some stuff that depends on B.
So if you have B is equal to 1 for instance,
this just looks like up to logarithmic
factors 1 over root bound, which is
exactly the type of behavior that you
would hope for and expect in this
situation. So this was a really nice
result, and I actually think that the
result is so nice that I'm going to say
some words about the proof, because I
think that the proof gives some insight
into the techniques that we have
available when working with random
matrices, and so the proof here is-
[Philippe Rigollet]: Jon, can I ask a quick question?
[Jonathan Niles-Weed]: - that it
basically relies - yes?
[Philippe Rigollet]: Is it clear that I should expect - that I should not expect a d-factor, like I see
for the Matrix Bernstein case, in front
of the bound?
[Jonathan Niles-Weed]: So, I've actually, I've
hidden some things here. So, the
difference was, in the Matrix Bernstein
case, I wrote down a tail bound, and then you would expect a
multiplicative factor of D in front of
the tail bound, but if I changed
Bernstein to ask about what happens in
expectation instead then that D turns
into a log D. So the expected
spectral norm is like root log D factor
bigger, whereas the probability bound is
a factor of D bigger. So here, since we're
dealing with expectation bounds, the
natural thing to expect is a root log D
rather than a polynomial factor in D.
So the proof idea here is just
in the grand mathematical tradition of -
if you have a hammer then look for nails.
And so here we have one hammer which is
the matrix Bernstein inequality and so
Henriksen and Ward squeeze as much as
they possibly can out of this hammer of
the matrix Bernstein inequality, and they
use this to analyze the product. So let
me show how this works. So the
starting place is - well, okay I have a
product of random matrices that are all
perturbations of the identity. So, first
thing I might as well do is expand this
thing, right, to understand how it looks asymptotically. So, I can do
an expansion of this. I start with the
identity matrix, that's good, and then I
have a sum of the first order terms,
second order terms, and so on all the way
down the list. So let's try to apply
matrix Bernstein to the thing we see
here. So the first thing we notice is
that this second term in the expansion,
this is just a sum of bounded i.i.d.
matrices, which is great, because the
matrix Bernstein inequality tells us
that this thing is approximately the
expectation of X.  I have a sum of i.i.d things, I can make precise the fact that this
object concentrates around expectation
of X. So, so far so good. The first term
works fine and this is exactly what
we hope it should be.
Okay, what happens for the second term
here?
Well, for the second term I'm
already having a problem, right? Because
this second term is no longer a sum of
i.i.d. terms because there are some things
in this sum that depend on each other,
right? So, for instance, the matrix x1
shows up in several different sum ands
here. So this is no longer a sum of i.i.d.
things. So Henriksen and Ward have the
following very clever idea, which is - okay
this is not a sum of i.i.d. matrices for
sure, but I can break it up cleverly in
such a way that it is a sum of i.i.d.
things. So what do I mean by this? So, what
we're going to do is, in place of this
sum over all possible pairs I and J,
we're going to partition this into sums
so that each index appears exactly once.
So, to look at the first thing, right, I
can take all products of the form x1,
x2, x3, x4, and so on and sum all of those,
that's some subset of this total number
of things that I need to sum over, and
then I can do the same with other
partitions, right? I can partition this
into a bunch of things where each of
these individual sums comprises only
independent matrices, and then I apply
matrix Bernstein to each of these. So 
matrix Bernstein for each of these says,
okay you've got a sum of independent
matrices - they're i.i.d. in fact - and so I
know that they concentrate around their
expectation and their expected value,
it's easy to see, is just expectation of
x quantity squared over 2n, because each
of these things is renormalized by 1
over n-squared, and there are about n
over two things per sum.  Great, okay, and when I add these things
up, what do I get? Well, I get precisely that this term is
of order expectation of x quantity
squared over two. So we had to do a
little bit of work in order to get this
second-order term, but luckily there was
this very clever idea that if I broke
this up, I could do it as a sum of truly
independent sums of random variate - or
each of these sums comprises independent
random matrices. Okay, you might be
optimistic about the fact that we did
the second one, but things already seem
much more complicated now that we're at
the third one, right? So I was able to do
this by hand, somehow, in the second one,
but it's not clear that I'll be able to
do it from the third one. So, let's just
get some intuition for what we did in
the second one, actually. So, I would claim
that one way of thinking about what we
did in the second one, is we just noticed
that the set of things we were summing
over is just the edges in the complete graph,
right? So if I have a complete graph on four vertices and I label them
1, 2, 3, 4, then really what I did in
breaking up the second sum is I
decomposed this complete graph into some perfect matchings. So each of these
things that I decompose it into is
vertex disjoint, which means that every
one of these is a sum of independent
things. So really what we did in bounding
this second sum is we just took the
complete graph on n-vertices and we
broke it up into a bunch of perfect
matchings. Well, in the third term we have
now a hypergraph, right? In the sense
that I now have triples of indices from
1 through N, and it turns out if you open,
you know, you open your big book of
combinatorics facts that Hungarians have
proved, you know that there is actually a
nice theorem about hypergraphs. So this
is a theorem due to Baranyai in the 70's
that says that the complete hypergraph
decomposes into 1 factors. What's a 1
factor? It's just a perfect matching in
this sense. So I can take a complete
hypergraph on three - like a three
uniform hypergraph or more generally a
k-uniform hypergraph and decompose it
in exactly this way into subsets that
are all independent sums, and then if I
apply major experiencing independently
to each of those things, then I get again
that this object concentrates around its
expectation, and I can keep doing this. So
the idea is we're going to - yes?
[Philippe Rigollet]: So, this decomposition, I mean, this is a decomposition for u
statistics that shows up in a, say, an original 1963 paper.
[Jonathan Niles-Weed}: Exactly.
[Philippe Rigollet]: You have - the fact that you're using this
complicated combinatorial arguments is so that you get the factor six or, like, what do you lose by using...
[Jonathan Niles-Weed]:  Yeah, so we really need to be careful.
Yeah, so, good point. We need to be quite careful here
about making sure that we don't lose
anything in the constants, because as I'm
about to say, we need these constants to be the right
ones so that the sum of all of these
things that we're concentrating around
actually looks like the expectation that
we care about, right? So I better see
exactly, you know, K factorial show up in
the denominator for the K term in this
expansion, so that I actually get
something near the expectation at the
end of the day. So I don't know whether
sloppier argument will also get this,
but I think the reason that Henriksen
and Ward use this rather precise
decomposition result is they really need
to nail down the constant and what
it concentrates around.
So what we're going to do is, in order to
apply matrix Bernstein - as I said ,I see
concentration for each of these things
individually and I'm just going to pay
an enormous union bound to promise that
all of these things are close. So I'm
going to independently apply a high-
probability bound to each of these
independent sums, and I'm going to
guarantee that the resulting sum is
close to the expectation that I want.
And what have I lost here? So, what is
implicit in these approximate signs and
all of these places? Oops. These
approximate signs and all of these
places?
Well, matrix Bernstein tells me what
errors that I should expect here, and
what I should expect to pay is errors of
order 1 over root n, which is good, and
then I need to pay for the size of the
matrices under consideration. So, the
first line I'm summing iid matrices
whose spectral norm is bounded by B, so I
incur an error of B over root n. In the
second line, I'm bounding products of the
xi's - a priori their spectral norm
could be of order B squared. So I'm
bounding sums of independent random
matrices with a spectral norm bound of B
squared and so on down the line. So because I have - because everything's
happening on the exponential scale here,
ultimately at the end of the day I pay
exponentially for this B, because the
error gets bigger and bigger as I bound
these things going down the line. So we
get concentration around the expectation
that we want, and our error scales like B
to the 1 over root n - exponential
of B to the times 1 over root n, which
is exactly, as I said, what you see
showing up in the Henriksen-Ward bound.
So again, the whole idea here is just apply
matrix Bernstein over and over again and
bound everything together, and that's how
you got the following bound. I should note
that Nikhil Srivastava and his group at
Berkeley improved this proof a little
bit, recently. They removed the log factor
in n and simplified the proof, but the
rest of the bound looks exactly
unchanged. So this is the
state-of-the-art before our work, and
this was the the starting question for us.
Whether this is, in fact, the tight
results that we want, and I guess
the thing that would draw one's eye in
this theorem is the following: do we
really need to pay exponentially in b,
here? Now you might say, first of all, that
like - yeah, we should pay exponentially in
B, because maybe this object Zn is about
of size e to the b, right? I mean, if each
of the xi's are indeed of size b and
spectral norm, then Zn in spectral
norm -
yeah, maybe looks something like e to the
b. So this sort of seems like a natural
scaling when you look at it the first
time. Maybe this is unavoidable just
because that's the size of the object on
the left hand side. But a moment's
thought will convince you that this is
not as sharp as it could be, because if
you go into the scalar case and just
think about what would happen for scalar
random variables, the optimal bound looks
almost exactly the same. Except instead
of paying exponentially for B, which is
the size of the almost-sure bound on
these random variables, you pay
exponentially for the typical size of
these random variables. Namely, the
expectation of x. So in the argument that
I gave a second ago, indeed if x to the b -
if x is always of size b basically, then
maybe, indeed, the typical size of Zn is e
to the b, but nothing in the Henriksen-Ward
argument is able to adapt to the
fact that the typical size of these
random variables might be much,
much smaller. So the question here is - in
the scalar case, we have a bound that
looks very similar, but is it possible
for us to recapture in the matrix case
that I should pay for the typical
size of the matrices, not the worst-case
size.
And so -
[Philippe Rigollet]: Could you say a word about how you get the scalar bound?
[Jonathan Niles-Weed]: Sure. The scalar bound is just
coming from the fact that - let me see if
I can write here - let me just take the
logarithm of everything we see.  So the
reason that the scalar argument is easy
is because I can take logs everywhere,
right? So it turns out that if it's - if
these conditions actually hold, then - let
me just look at log of Zn instead - which
is a sum of iid random variables, and if
I believe that the xi's are bounded and
n is large, then this looks a whole lot
just like the sum of xi over n, right?
So everything I know about bounding sums of iid random variables after I take a
logarithm in these precise asymptotics,
it actually is no more complicated than
than Hufting (?) or Bernstein
inequality, just applied on the logarithm,
and then at the end of the day, you get
something that looks like - so, so let me
see if i can erase this - so what one gets
at the end of the day is that log of Zn
is approximately expectation of x plus a
term that's of order like root b over n
and then when I take the exponential of this thing, this term
turns into this term, right, and
because b over root n is small, that x of b
over root n just looks like one plus b
over root n. So I get something
that looks like exactly the B over root
n that comes from here.
So it's just applying it to taking the
logarithm and applying sums, and of
course I should point out that this
proof that I just gave is, like, extremely
far from true for the matrix case, right?
Matrix logarithm doesn't even make sense
unless I have PSD matrices and then when
things don't commute, I don't have
this ability to pull apart the sum. So
none of what I just said makes sense in
the matrix case, but that's why it's so
easy to prove for scalar n variables.
So, we are able to show - so what we show is a bound that looks exactly like the
bound on the previous page, except that
we've made the crucial change that
instead of paying exponentially for b we
pay exponentially for the size of the
average of the xi's - the expectation of
the xi's. So the bound looks
exactly the same except that I have e to
the mu upstairs rather than e to the b
in my tail bound, then I'll say that, you
know, we actually developed, sort of, more
general tools for many different types
of products of this kind, but in this
particular case I'm just trying to
compare with the Henriksen-Ward results
to see where the improvement comes in. So we get exactly this thing and
actually, sort of like the matrix
Bernstein case that I discussed earlier,
what happens is that we end up getting a
tail bound that looks exactly the same
as the scalar tail bound up to constants
except for this d out front, which was
exactly what happened in the matrix
Bernstein case. I had a tail bound with a
particular exponent and that tail bound
showed up exactly in the matrix case
just with the out front and the same
thing happens here. So really, this
captures, at least at that level, the
exact behavior we would want for the
scalar case just in the matrix setting. And
to convince you that this
improvement is in fact meaningful, just
to parse these for a moment, notice
that this yields an exponential
improvement in several cases of interest.
For instance, if my random matrices are
centered, then this improvement is huge.
I go from something that depends
exponentially on b to linearly on b, so
in the centered case this destroys
an exponential factor in their bounds,
and in a slightly less trivial case,
if I have xi's that are say,
isotropic coming from the sphere, we
often should expect to see a factor of d
difference between the worst case bound
and the expected bound. So, for instance,
if I'm uniform on a sphere of radius
root d, then my worst case bound, the size
of this matrix in spectral norm is actually d because the length
of xi is root d, but typically when I
take the average, I just get the identity,
which in spectral norm has size one. So
even if I'm dealing with PSD matrices, it
can be the case that the worst case
bound is like a factor of d bigger than
the expected bound, and there we see that
the exponential dependence on d that's
in the Henriksen-Ward bound disappears
when you apply ours. So this is all to
say that, you know, I think that one nice
thing about these results is that there
are cases where the Henriksen-Ward bound
is - the bound is too large to be
meaningful. You need n to be
exponentially large indeed to get any
meaningful concentration. Whereas here, we can afford n much much
smaller than that and still have a non-trivial amount.
So let me now turn to saying some
words about the proof. It turns out that
the proof of this improved bound is much,
much conceptually simpler than the bound
that Henriksen and Ward do. We don't use
matrix Bernstein, we just try to apply
first principles ideas to this product
and see whether we can get anything just
by examining it in a very general
framework. And this general framework
we're going to explore is just, say,
rather than thinking about these as
matrices, let's just think about them as
random elements in a Banach space.
So the space of d by d matrices equipped with the Shattan p-norm, which is defined in
this way: the p-norm of x is just
equal to the trace of the absolute value
of x to the p, and by absolute value of x
I mean the matrix absolute value, so it's
x transpose x to the one-half. So when x
is psd, this is just transpose of x to the d
or more generally I have to just
multiply by x transpose. So this is
the Schattan p-norm and we're just going
to analyze directly random elements in
this space. And the intuition will be
that as long as this norm is nice enough,
sufficiently Euclidean, we will get
good concentration results for free,
essentially. So this is - this uses
ideas that are well-known in
geometry of Banach spaces. So, Assaf Naor has a nice paper where he sketches ideas
like this in more detail for Banach
value random variables, but the
high-level intuition just is - if this norm
looks smooth, looks familiarly like
the Euclidean norm, then I get something
that automatically gives me good
concentration. So, to see how this
works in practice, let's just see if we
can figure out how to prove the
following simpler claim. So what I'm
going to try to prove for you together
today is, like, if I have a bunch of
centered random matrices with an almost
sure bound on their size, what we're
going to try to prove together is that
the expected norm of this random product
is small. It's one plus something that
looks like b over root n. So again, I want
to contrast this with the bound that you
would get if you were just thinking that
these x's were as large as they can
possibly be - the worst case bound, the
worst case size of this object is
exponential in b. Whereas the bound that
we would hope to prove, which is a
simpler version of the theorem I
presented several slides ago, just says
that actually, typically, the size of this
product is like one plus a factor that
decays like b over root n. So let's
step through an idea of how to prove
this. So what should we do first?
So here's a first attempt for how to prove
this. *aside* oh, there's a chat - is this, oh
that's linked to everyone - Okay, so the first attempt for how to prove
this might be, like - okay, let's do something
that's familiar to people who have
worked on on matrix Bernstein or other
random matrix concentration inequalities.
Let's just, instead of bounding the spectral norm,
let's apply Jensen and the monotonicity of the Schattan norms to
bound the squared Frobenius norm
instead. Right, the operator norm is a bit
of a pain, but let me work with the
Frobenius norm and this is something
that, you know, I feel I can handle and
because the Frobenius norm is always
bigger than the spectral norm, I can just
bound this thing, get a good upper
bound for the quantity I want. Okay, so I
have the spectral norm, I have the
Frobenius norm of this thing, let me
develop this a little bit and - *aside* chat is.. -
So there's a question about - so, great. So there's a - let me back up here -
So the question was, does this requirement that n is...
does this requirement that n is large
enough so that n is of - n is larger than
b squared log d, is this actually
necessary and is this part of the
general bound, too.  So the answer is,
sort of, yes. I would recommend that you
look at the paper for this. The answer is yes, and you can actually see
that this is true, even in the scalar
case. Basically when n is not big enough -
when n is not bigger than b - n is
not bigger than b squared yet, then this
thing is actually exponentially large.
Typically exponentially large. So, before
there's an elbow effect in what
you expect, and this is tight, its
treatment for the scalar case - and what
you expect the size of this thing to be.
This thing is large or can be large
early on. There's no good reason for it
not to be large, but once n is
competitive with b squared,
you concentrate near one. So there's
something going on and this is a
consequence of the fact that the tails
of this random variable are  - they
can be quite bad, actually. Because I'm
taking the exponent of a random variable
like - this thing has very poor decay, it
looks like a log normal random variable
away from the origin. So you do need some requirement that n is
large enough to write down such a thing.
Ah, okay, so let's go back to this, let's
go back to this expansion. I said, let's
pass to the Frobenius norm, then I can
just use the definition of xn and let me
write here. So zn is just exactly what
you think it is - er, zn minus one.
Zn minus one is just the same product but
with the last guy lopped off
Right, so I can peel off the first factor here, and I'm just using the
explicit expression for what these
products are. Now what? Well, ok, so now we
actually have something non-trivial we
can do, which uses the fact that because
the Frobenius norm arises from an inner
product and because x is conditionally
mean 0 given zn minus 1, I can do a bias
variance decomposition here, right?
So I can get something that's the
expectation of this object, which is just
the first term, and then I get something
that looks - so this is the expectation
squared, and this is the squared
expectation. This is just a bias variance
decomposition - bias variance decomposition of this object up here.
And this was, of course, only possible because I was able to use the fact that the
Frobenius norm arises from an inner
product. This is a Euclidean norm, so I can
just pull this thing out by passing to the inner product and
recombining the squares. Okay, so what does this give us? What can we do now? Well, the
obvious thing to do now is just to say,
ok, the first term is whatever it is and
I know the second term is small because
xn is of size at most b. So by using the
fact that I can pull out the product of the xn and the z n minus one,
this thing is of size, at most, one
plus b-squared over n-squared times
expectation of the product at time n
minus one. So now we have a recursive bound.
The key thing is I can keep peeling off
factors of xn, and each time I do I get a
factor of one plus b squared over n
squared and when I do this n times?
Okay, what happens? Well, at the end I'm left with the identity matrix plus a bunch of
these factors of 1 plus b-squared over 
n-squared and importantly this one plus
b-squared over n-squared to the nth power
still gives me something small in the
sense that if n is of order b-squared or
bigger then this is some - this is
something that goes to zero as n goes to
infinity. This goes to one, so b-squared
over n goes to zero so this thing gets smaller as n goes to infinity.
So compare this again with the trivial
bound which said that zn could be of
size e to the b, could be of size e to
the B, which in no sense went to zero as n
went to infinity. So this is already a
good thing. Of course, this factor of d is
bad, right? We don't want to pay
polynomially for d and this thing does
not even - is not even
near one, right? It's something of
order d instead. So just to say this
again, because this is the key point for
the proof that we're going to use,
because the Frobenius norm arises from
an inner product, I can apply this
bias-variance decomposition and then
bound brutally the xn by its almost
sure bound b in order to get a
recursive bound of this form. And if I peel off n factors of these,
I get something that scales like e
to the b squared over m times the true norm
of the identity matrix. Of course, as we
saw in the previous slide, the
problem is that I cannot do this using
the true norm because the bound that I
get is simply too loose. So I'm caught
in a double bind. The bound that I get
from the true norm is too loose, but of
course the true norm is the only norm,
only Schattan p-norm, that arises from an
inner product. So I'm sort of stuck here,
because I would like to do it in a
different norm, but I can't, because it
doesn't come from an inner product in
general. And the insight is that there's
this concept called uniform smoothness
of Banach space norms, and the Schattan
norms in particular, that allows us to
mimic this argument even in other norms.
So, to motivate the definition of uniform
smoothness, let me just remind you what
it means when I say that a convex
function is smooth. I say that there's an
upper bound of this quadratic type
around any point x, if the convex
function x is smooth, and one way to say
this in maybe a slightly - what seems like
a slightly obscure way, is this implies
that Jensen's inequality is tight up to
a factor that's like l over two times the
size of the perturbation squared, right?
If I take this inequality on the first
line and apply it to x plus tau and x
minus tau, it just shows me that this is
like f of x plus something that depends
quadratically on the size of the
perturbation. So, motivated by this
definition, we can say that a general
norm in a general Banach space is too
uniform smooth if the exact same
inequality holds with some constant K.
So I asked for x plus tau and x minus tau,
the square of these things, to be
like norm of x squared plus something
that depends
quadradically on tau, the size of the
perturbation. So notice, for instance, that
for the Euclidean norm this just holds
automatically with K equals one here,
right? So if I have - if I have the
Euclidean norm, then it's always the case
that x plus tau-squared and x minus tau-squared, the cross terms disappear and
I'm just left with tau-squared plus tau-squared over two. So I get K is exactly
equal to one, and the point is that if I
have a true uniform smooth thing,
a uniform smooth norm, then I can still do
a bias variance decomposition at the
price of a factor of K. So this would be,
again, an equality with k equals one for
the Euclidean norm, but if I have uniform
smoothness then I get an inequality with
a little bit of extra in front of the b.
And just to tell you why this should be
true, let's pretend for a second that
instead of being mean zero conditioned
on a, let's pretend actually that b and
minus b have the same law conditioned
on a, so that actually this is a
symmetric - this is a symmetric random
variable. Well, then, the expectation of a
plus b-squared, I can think of that as
the average of a plus b-squared and a
minus b-squared, because b and minus-b
have the same law, but then if I just
apply the smoothness inequality, I get
exactly this inequality that I've
claimed on the first slide. I get
expectation of a-squared plus k
expectation of b-squared. If b and minus-b
do not have exactly the same loft, they're
just mean zero, I can use symmetrization.
If you do symmetrization naively,
you get not exactly the factor k here,
but you get a different constant that's
like their constant bigger than k.
It takes a little bit of cleverness to
actually get the sharp constant k, but it
turns out that whatever constant was in
uniform smoothness, you get exactly this
same constant in this bias variance
upper bound. And how will we use this?
Well, just like Olivera and Tropp use a
single fact, deep fact about matrices and
then everything followed from there, we
will use a single deep fact about
matrices, which is this famous bound of
Ball, Carlen and Lieb, which tells us that
the Schatten p-norm is true uniform smooth with a
particular constant k that depends on p.
So the whole story here is that we know
that this particular property of Banach
spaces is possessed by the Schatten p-norms,
and so what we can do is mimic
precisely the proof that we had in the
case of the Frobenius norm, which happens with p equals two, by an analogous proof
in the case of the Schatten p-norm, at the
price of an inequality sign, which we
don't care about, and an extra factor of
p in the bound. Otherwise the proof
goes through entirely unchanged, and what
happens if i recurse this bound? Well, the
first term, again, looks like what I
expect from the previous example. I have
n iterates of this one plus p times 
b-squared over n-squared and then I have
something that depends on the p-norm of
the identity matrix, and so in a
balancing act that will be familiar to
anyone who has worked with these matrix
inequalities, the first term in this
bound gets better as p gets - the first
term in this gets better for small p and
worse for large p, but the second term
gets better as p becomes large. So if you
just balance these two things, then you
get exactly the claimed bound. So to
answer, again, the question from earlier,
the actual bound you get is this thing.
It's still of exponential size, but when
n is large enough, this exponential of b
just looks like one plus something that
scales with b over root n. hat's the
order this is you end up getting what's
[Philippe Rigollet]: So, what's the order of p that you end up getting?
[Jonathan Niles-Weed]: So this is.. what's the order of what?
[Philippe Rigollet]: Of p?
[Jonathan Niles-Weed]: Oh, you have to choose whatever balances
these things. Let me see if I can do it
online. So P is equal to like root n, root n
log d over b or something like that.
P is large, yeah, p is very large but
it's, you know, it's literally just
balance these things and it's, it's like
I said, something like root n log d over
b squared.
Okay, all of the proofs in our paper for
everything - so, including inverses
of products of random matrices, adapted
sequences, tail bounds, everything comes
down to this basic idea of using uniform
smoothness, and as I said the results are
not only an improvement over the
Henriksen-Ward results, but I think are
significantly conceptually cleaner. So I
just want to end, since I'm running out
of time, with with some open questions.
This is far from the last word on
products of random matrices, this is
really just the first step to understand
properly how these products behave. 
I think that the clearest open question
here is the following: when xi's are
independent but not i.i.d - so they may have
different expectations - if you go to the
theorem in our paper, the main bound
scales with the e to the average norm of
the expectations. When really, one thing
you might hope is that it would scale
with the norm of e to the average
expectation. So the first thing is always
bigger than the second thing, but you
might wonder whether it's possible, in
general, to replace all of the bounds we
have here with something that looks like
this, and one interesting fact is
actually we know that's not true. We have
some examples where this is the tight
behavior and the second bound does not
hold, but some examples where this is
loose and actually something of the
second form is true. So I think the meta
question here that we don't yet have a
good handle on is - what is the right
quantity that should show up in these
bounds? It's not this, but it's not this
either. So there's still a question about,
you know, what the truly right scaling
for this tail bound ought to be. One
instantiation of this question, you know,
something that we don't understand at
all, is that all of the bounds we've had
are insensitive to the orderings of
these random matrices. So in the sense
that if these random matrices are
independent but not i.i.d, so different
ones have different laws, we get exactly
the same bound no matter what order
these random matrices come in. But we can show by example, essentially, that the
ordering matters a lot. So there are some
permutations of random matrices which
concentrate much much better
than other permutations. We don't have
any bounds that are able to pick this up.
So again, I think that this is
a situation where we don't know much ye.t
about what the true situation should be
The question - I have a question in the
Q&A - 'Could we easily extend the proof to
i.i.d Gaussian xi's?  So if the xi's are
entry wise Gaussian, say, or something
like that? Yeah, so I guess the the thrust
of the question is that - of course, if
they're Gaussian,  we don't have an
almost sure bound on the size of the
matrices. Instead we have some sort of
moment bounds, and the answer is totally
yes. If you look in the paper, we have
examples where we have much weaker
conditions. So if I have, in general,
unbounded random matrices but whose
moments can be well controlled, then the
same story holds. And that's all I wanted
to say. So, we have this this new set of
tools that I think give us good control
over products of random matrices and
maybe are the first step in what I
hope is a much more sophisticated
understanding of compositions of random
operators. If you have any questions, the
papers on archive, or I'm happy to
answer anything now. Thanks a lot.
[Philippe Rigollet]: Thank you very much, Jon. So, I'm guessing
you're getting virtual applause. So, how
about we try the following - if you have a
question, just let me know in the Q&A and I will  try to click this 'allow to talk.' Just be
mindful that the talk is being recorded
and we plan to include questions, if you
have some. So just, I need to let you
know about this. So are there any
questions? So while people are thinking
about it, I wanted to actually ask maybe
a more general question. So, when you look, you describe this Henriksen-Ward
approach, which is really very much a
u statistics approach, I was wondering so
here in your results you use a very
specific form of the product of random
matrices where you peel them one at a
time. I was wondering, if I actually want
to understand generally u statistics
of random matrices, are there any general
results for this or is the Henriksen-Ward the only result that you know about this?
[Jonathan Niles-Weed] Not to my
knowledge. I mean, again, there's some - if
you squint, some of the results in the
literature end up looking like
asymptotic questions about your
statistics of matrices. I mean, you have
again, this law of large numbers type
things. You get consistency of various
estimators based on your statistics of
matrices, but apart from Henriksen and
Ward, I'm not aware of non-asymptotic
concentration results for general use
statistics of random matrices.
[Philippe Rigollet]: I see, thanks.  Any other questions?
So again, if you have a question please enter it into the Q&A and I'll give you the ability
to ask it yourself. So, I actually have a
second question.
So uniformly smooth norms, is that
related to type?
Yeah, so it's related to martingale type, and so this is a - one way of seeing this, so... let
me go back to... so this is a, this is a
perceptive remark. So, if I go back to
this proposition, like, what does
this look like, right? So to say that
expectation of b conditional and a is
equal to zero, I mean, this is just saying
that a followed by b is a martingale
different sequence, right? So this is
allowing me to peel off factors from the
martingale different sequence, and if you
iterate this you get exactly what looks
like a martingale-type inequality.
So this is, yeah, this is totally related to
that.
[Philippe Rigollet] All right, so thank you very much Jon. it doesn't seem like we have
any other questions, so it's probably
time to end this meeting. So I would like - ah, there is a question from Kwangjun. So, Kwangjun
I'm going to give you the ability to speak if I find you in the
list of - *aside* all right, allow to talk - all right. you can unmute your, so you're unmuted - sorry.
[Kwangjun Ahn]: Thank you. So, actually I
typed my question on the QA section.
So my question is - throughout the
talk, I was under the
impression that, like, free probability
can insert a regime where n is
small but d is large, and your work kind
of covers the case when n is large and
d is small, and my, just like, question out of curiosity is - would it be hard to
analyze the regime where n and d are
both small?
[Jonathan Niles-Weed]: So, I think that you have
to ask what type of answers you hope to
get in that regime, right? Like, if we go
back to the very beginning, questions
about concentration of measure...
You know, we're looking here for something
concentrating around its expectation, and
so if I even think of sums of random
variable, you can say, like - what can we
say about the sum of two random
variables? So, certainly you can say some
things about that sum. Especially if you
know things about the individual sum
ends, but the thing that you won't get is
stuff that looks like a sum of two
random variables concentrates well
around its expectation. I mean, no
better than any individual ones will. So
the - you can try to analyze, like this,
this procedure gives you something to
analyze product to the small number of
random matrices, but I guess the question
is what one hopes to - what type of
answers one hopes to obtain for those
things? And again there's, there are answers
in our - I presented theorems today that
require n to be large enough, but there
are statements in our paper that hold
for all choices of n and all choices of d.
It's just the case that normally, when
n is super small, any sort of
concentration result is is uninteresting.
[Kwangjun Ahn]: Thank you very much.
[Philippe Rigollet]: Thank you Kwangjun for your question. Any other questions?
Yeah, Paxton has a question so I'm going to give you
the - so if you have a question just say I
have a question, and I'll I allow you
to ask it online. So let me allow you to
speak, Paxton. okay thanks John
[Paxton Turner]:Okay. Thanks, Jon for the talk, these are some nice results.
I was just wondering what examples
you're aware of where these matrix
products appear? I saw an exposition of
this Henriksen paper before and they
talked about streaming PCA but just
wondering if there were others that you
were aware of?
[Jonathan Niles-Weed]: Yeah, so the easiest
example so streaming PCA, which also
goes into the name Oja's algorithm,
is one example of basically doing its'
stochastic gradient descent on a
quadratic, essentially is what's
happening in streaming PCA, and so
more generally products of random
matrices arise whenever you are doing
stochastic gradient descent on some
object that is locally
well- approximated by quadratics. So, you know, this is not in our paper yet but you
can answer similar questions about SGD
on smooth and strongly convex functions
because those just look a lot like
quadratics locally. So composing these
random operators looks a lot like
multiplying random matrices. As I said,
the other example that people have used
this for - so there are a couple of recent
papers, one of the names associated with
this is Boris Hanin, and I think
there are some others who are - so I said,
looking at deep neural nets, and what
they ask is - okay, I have a deep neural
net randomly initialized with many
layers and I want to understand
something about the the Hessian of the -
you know, if I take that the Hessian of
this function with respect to the
parameters, what it looks like, and so
when I do this layer by layer, I get a
product of matrices because I'm
composing, and each of these matrices
looks like a random matrix with some
structure, because I have assumed that
I'm randomly initializing the weights.
So it turns out that you can
prove some things about the dynamics of
neural networks that seem to match
what's in experiments by just
pretending that your Hessian is always
of this form, always looks like a product
of random matrices, and so there again in
the way that they've been able to
analyze it in the past is just in the
asymptotic regime, so the asymptotic
regime with, say, infinitely deep x or
particular y with width and depth
scaling asymptotically together, our
results offer a non-asymptotic take on
this same picture, you know, if I actually
want to understand with high probability
near initialization what the landscape
of a neural net optimization problem
looks like, then I can look at this
random composition and say that it looks
like it's expectation with some
deviation that's small as long as the
net is deep enough.
[Paxton Turner}: Okay, great yeah, that's really nice thanks Jon.
[Philippe Rigollet]:  All right, so I'm not sure there's any
further questions, but we're a bit over
time so I'm going close this first
virtual meeting. Thank you, everyone, for
participating. Let me remind you that
next week, Ery Arias-Castro is speaking in
the same seminar. Let's thank Jon
virtually and I hope to see you all next
week.
Thank you.
