Okay, so let's start with
a quick recap of what we were
doing last time and
then carry on.
I'll tell you what
we'll do today.
We proved certain things.
We proved that the volume of the
d-dimensional sphere of points
of length less than or
equal to 1 goes down
as d goes to infinity.
And in fact, they're asymptotic.
There's some constant to the d
divided by d to the d over 2.
And so it goes to 0.
And to do this,
we use the Gaussian integral.
And I won't go through
the proof again, but
we needed two properties of
the Gaussian or the integral.
So one was radial symmetry.
So it depended only on r,
the distance from the origin.
And two was it splits
into the coordinates.
I mean, that's only at
a Gaussian that can do that.
So even though it looked like
I pulled this function out of
a hat, it was for
a reason, right?
And we proved also that for
a uniform random point
x from the unit sphere.
We have that the probability
that the length of x1, the first
component, this is the distance
from the equator if you will,
is greater than c over root d.
Goes down as e to
the -c squared.
Okay, so we proved that.
So here's a little sort of
generous thing I didn't state
last time.
But many quantities, like
the absolute value of x1, right,
the first coordinate,
is a random variable
that x has picked uniformly
at random from the unit.
For such random variables
have very well-behaved tails.
They're exponentially
dropping tails.
That is to say they drop off
exponentially as we move
to multiples of
the expected value.
So as soon as you take twice,
thrice or c times the expected
value, it goes down as e to the
-c or e to the -c squared, okay?
Often e to the minus c squared.
So the e to the minus x squared
is an important function,
because it's a Gaussian, right?
And we'll see that
several times.
That'll come up.
Okay, so I sometimes call this
the law of large dimensions.
It's like the law
of large numbers.
But basically, having many
dimensions is sort of
like having many independent
variables sometimes.
So it behaves like that.
Now, 2 gives another proof of 1.
And also leads to some puzzles
that we discussed a little bit,
but it's worth understanding.
So by union bound,
we saw that at least half
the volume of the sphere
lies in a cube of this side.
And we draw a picture and the
picture looks sort of confusing
in a way, because all I am
saying is here is a cube and
half the volume of
the cube is here.
But I already prove that
half the volume of the cube
is also there, right?
So apparently, they don't seem
to intersect but they do.
Of course, they intersect.
Both facts are true.
And you should just write down,
algebraically,
what happens and you'll see that
that's not a problem, okay?
Okay, so
2 also implies the equator
also implied that samples x and
y from the sphere
are merely orthogonal
if they're independent.
And we prove that for
many samples, okay?
And we also saw how
to generate uniform
random points from
the hypersphere using Gaussians.
The end short to that was
rejection sampling will not be
good, right, because the cube
is much larger than the sphere.
In fact, that's a general
phenomenon in high dimensions.
If you want to do uniform
random samples from a lot
of other things,
which you got here, right,
you cannot do
rejection sampling.
In general, in d-dimensions, you
cannot enclose even a sphere.
Let alone, other objects
in some enclosing thing,
which is nice and
not much greater volume, okay?
In two or three dimensions,
you can put in a square or cube.
And from which you know how
to draw samples because of
independent coordinates, but
you can do that in either way.
Okay, so that's the recap.
Now, today,
I'm gonna start today's lecture.
Today, basically,
I'm going to prove
a concentration of results for
some random variables.
It's partly pedagogically.
I thought a lot about what's
the right result to prove,
from which I can derive all
the results that we need,
like Chernoff bounds.
I'll tell you what they are,
if you're not familiar Chernoff
bounds and
other concentration things.
So I'm going to tell you that.
But first, let's just start with
the central limit theorem,
right?
I'm not gonna prove this.
I mean, you've probably
all seen a proof.
But at least,
you should know the statement.
So if you have independent
identical random,
identically distributed,
they don't have to be identical.
That's not that important.
Independent is important for
this.
The expected value of 0,
each of them.
The variance is sigma squared.
Then the quantity, right?
1 over root n is
the right normalization.
That is because if I sum these
n things, variances sums up.
So the variance of this
is n times sigma squared.
If we divide it by root n,
the variance becomes
sigma squared, right?
So it tends to in distribution.
It converges to the normal with
mean 0 sigma squared, okay?
This is only a limit statement.
What we're going to try now is
to get statements that hold for
every n, because for
us to use, we need that.
Further, this limit
statement has some problems,
it's not a uniform limit.
So it's generally not
immediately useful.
But that's going to be the guide
of what we want to get, right?
Central limit theorem.
So one remark, you cannot get
this if only the second moment
is assumed to be finite.
Central limit theorem needs only
to assume the second moment is
finite, right?
So the third moment
need not be finite.
So for instance,
if you have the density,
probability (random variable
x = x) = c over x cubed.
Sorry, plus a little bit.
Okay, that has a finite.
Second moment, that's fine.
Doesn't have a finite
third moment, but
that's true for
x squared to the 1, let's say.
Okay, that's good enough for
the central limit theorem.
It's not good enough for us.
So to get this going,
it is not sufficient to
assume the second moment.
For that particular density,
it's a power law.
You can see what happens.
I won't do it here, but you can
do a calculation to see that.
In fact, you don't get
very nice tail bounds.
You wouldn't get,
that's the problem.
I mean, the tail is
there in fact, right?
So you wouldn't get very
good tail bounds so.
Okay, so we're going to look at,
so most of the lecture today,
I'll spend in tail bounds.
And we'll come back to geometry
either at the end of today or
the next lecture and start
looking at Gaussian annulus.
But this tail bound is going
to be quiet useful, right?
First, Markov.
Just a recap,
you have a non-negative real
value of random variable.
Then the probability that is
greater than t is bounded
by something that goes down to 0
as 1 over t, as t gets farther.
And a proof is just that
there are no cancellations.
So the expected value of X is at
least what you get from things
that are greater than t, okay?
It would be true if the random
variable can take negative
values that could cancel it out,
and this is not true, right?
So only if were a non-negative.
And then Chebychev now
is just saying that,
that the deviation of
a mean is bounded by that.
And the proof is just applying.
Take the square of this and
apply, which is non-negative,
apply Markov.
And here, the tail bound
goes to 0 as 1 over t
squared as t goes to infinity. Okay?
T is just 1 over t squared.
Now, can you do better?
Can we get a tail
probability that falls off
more than quadratically in t.
Falling off
I should've said here more
than a quadratic, right?
Faster than quadratic.
And in fact,
it's simple [INAUDIBLE]
higher moments would help.
So moments are going to
play a key role in this.
It's often these Tail
bounds are not always,
of course traditionally
they were.
But not always proved
by taking moments, but
I'm going to do that here.
So higher sental moments.
So if you take the oct moment.
Think of oct as higher than two,
right?
Take the oct moment.
So you get 1/t to be oct.
That is just step one
in Malikov inequality.
To the random variable X- E(X),
absolute value
to the r [INAUDIBLE] Usually we
want to apply it but r even so
we don't have to worry
about the absolute values.
I'm sorry,
we don't have to worry
about the absolute value
signs there, right.
If r is all that you
need to put there.
But think of it being applied
when r is an even integer.
Of course,
r doesn't have to be an integer.
It's true for
any of real-valued r.
So the higher the r is the
greater the fall, rate of fall.
You get 1 over t to the r.
So, that quantity
doesn't depend on t.
1 over t to the r does.
And if r is very high then
you get a very steep fall.
But for that you need to
bound higher moments, okay.
You need no bound higher amount.
So we'll see how to do that.
So I'm going to state now
what we call in the book,
the Master Tail Bounds Theorem.
Only because it's going to
imply tail bounds by channel
of [INAUDIBLE] exponential and
all other things.
Traditionally, there is
a particular proof of channel
bounds.
I'll go over that in a minute.
Some of you are familiar.
But then that doesn't work for
Gaussians.
It only works for
boundary random variables.
So traditionally, every one of
these want to separately but
this is going to
give us all of them.
It's the reason for
calling it Master.
So, you have independent
random variables means
0 variance at most
sigma squared.
Not identical, right?
For those that are interested,
this actually holds even
without assuming independence,
it holds in a more general
context, it holds for.
But right now,
we'll only look at,
here we'll look only
at independent things.
So I'm going to take some
integer s, which is less than n
sigma squared over 2,
for whatever reason.
And here,
the crucial assumption.
I'm going to assume the r
moment, if r is odd
this could be negative so
I put absolute value sign.
Does not grow faster
than r factorial and
this factorial is crucial and
I'll tell you in a minute why.
So this part of the is going to
be fairly technical because I'm
gonna have to prove this, right?
But the point is moments only
grow as of most factorial
times sigma squared, number one.
Number two, I don't assume all
moments exists only certain up
to a limit.
So if possible that beyond that,
the moment don't exist
either and certainly not
necessarily bounded.
So just so this r moment grows
as at most r factorial, and
the theorem,
this is the theorem.
And I will prove this which will
take us a little bit of time
will prove this.
The tail bound the probability
that x greater than any number
a, provided a is not very big,
very large deviations,
this is not valid.
But as long a is at
most [INAUDIBLE] either
the minus a squared
times something, okay.
Let's at that something
before we prove this.
So each of these random
variables have variants of most
sigma squared.
If I add up N of them,
the variance is N sigma squared,
at most.
So an upper bound on the
variance or that should be X1.
Upper bound on
the variance of X1 + XM,
because they're independent.
So if I did have
the sentlimiterum type behavior,
then this is correct.
I would have a squared
divided by the variance.
I'm not worried about constants.
The 12 is higher
than what central limits
that I will give you.
We do two, but we're getting it
for all n, so within constants,
it is like the behavior of
the sample limit theorem.
Okay, so in some sense,
the tails behave like this.
Now, they don't
behave like this for
very large a, but
that's always going to be true.
So you should always
revisit this little thing.
That if I have coin tosses,
if I toss a coin,
I toss n coins,
the probability of heads being,
p so this the [INAUDIBLE]
p each of them,
then the tails of sum of xi,
1 to n,
are not Gaussian
beyond a certain point
So c x squared beyond sem x.
So very large deviations you get
only simple exponential bounds.
That's always going to
be a problem, right?
Again that's required for
calculation.
Actually doesn't require
too much of a calculation.
The probability since,
I can write this down,
the probability that
X Y is greater than or
equal to N means all the lines
have to come out to one, right?
Is, is p to the n.
Whereas, had it been sub
Gaussian, you should be getting
an N squared value,
which you don't get, right?
You only get an N.
Let's see.
So, this is a Gaussian bound
with a correct variance,
except for constants.
You get that.
Now, why [INAUDIBLE] factorial,
okay?
I have to write on the board
because I didn't write that.
So, in fusion of our factorial.
So many proofs of tail down
start by saying the probability
that the random variable X is
greater than or equal to T...
Right, equals the probability
that e to the lambda x,
is greater than or
equal to e to the lambda t for
positive lambda are all
positive lambda, okay.
This is so
just because the exponential
function is monotone,
right, which is less than or
equal to the expected value
of e to the And
then the x- lambda t.
This is Markov inequality,
right?
Who has seen a proof
that begins like this?
Yeah, this is called
the Bernstein method [INAUDIBLE]
function method.
It's used for
[INAUDIBLE], right?
So many proofs start like this.
Now, with x,
it's sum of x i, 1 to n.
So that's x.
And then, this factors.
That's either the lambda x,
x is the sum of x, y,
it factors, right?
It's just a product.
I won't do that proof but
that can only work if
the exponential numbers exist,
okay, this is valid only if
Expected value of e
to the lambda x by xi
exists, otherwise
it's not valid.
And now if I expand
e to the lambda,
I'm sorry, I should have put
that in the slides but anyway,
so I do a Taylor series of this,
right?
So then I will get,
I used t already, so
I get i = 0 to infinity,
lambda to the i over
i factorial,
expected value of x squared.
I just want to explain
the exponential Taylor series.
So if expected value of x to the
i grew faster than i factorial,
there's no hope.
The series does not converge,
right?
So this method is, I mean, it
may work for other reasons, but
this method is not
likely to work, right?
There may be cancellation, but
it's not absolutely convergent.
So if this grows faster than,
so if,
Grows faster than x to
the i than i factorial,
the series does not
absolutely converge.
No absolute convergence.
So then the Bernstein method,
this is called
the Bernstein method.
Again, I won't
follow this method.
I'm gonna tell you that it's
better to abandon that than go
with moments, that's
the spirit of this theorem.
But if you want to
use Bernstein method,
it better be true that it's
bounded by i factorial.
Actually, it should be bounded
by something slightly less i
factorial.
>> [INAUDIBLE]
>> Yeah, so yes,
for lambda less than one.
So there is some play
with the lambda, yes.
But, okay, so there is
some play with the lambda.
Provided, I mean,
if it does actually strictly
factorial that goes-
>> [INAUDIBLE]
>> So if it is i to something
greater than one.
Then you will be
in some trouble.
So, yeah, if it is potentially
more than i factorial then
you will be in trouble.
Okay, so the factorial is,
agrees with that,
but most importantly,
that statement,
this method only works if for
every i however large,
it bounded with something
like i factorial.
Whereas here, we assume
only finite moments, right?
So this is very crucial.
So typically,
in this we'll encounter what
it called long tail density.
Long tailed just means the tail
is not exponentially going off.
So for
instance if you had something,
everybody knows what a power law
is, power law is just how long
they'll take care
of the density.
If you had a power law,
this is still valid.
You have finite moments, but
you will not have
very large moments.
So this is useful even when
there are finite moments.
So, I'm gonna take, it's gonna
take some time to prove this,
four or five slides.
We're gonna go with the proof.
It's elementary,
there's nothing phenomenally
sophisticated going on,
there's common threads
So the idea is to prove an upper
bound on the rth moment and
then use that for the sum.
The use Markov's inequality.
So the sum is this,
is to expand this, you do
the multinomial expansion.
We start with a multinomial
expansion, so we get that.
So this is just a sum over all
partitions of r into r1 to rn,
right?
The exponents of x, I mean it's
how many times, when you expand
this, xi occurs, just in the
multinomial expansion, right?
And that is that, and
I now take the expected
value of both sides.
I want the expected value
of this whole thing.
And I wanna take the expected
value of this, okay?
And something nice happens.
So let's try to guess what
nice thing happens and
then we'll see it.
Lot of these terms
have zero expectation.
So which terms have
zero expectation?
>> [INAUDIBLE]
>> I'm sorry.
Yeah, whenever r i is one,
right?
So anytime an r i is one,
independent says that
the expected value of
each thing splits up,
I'm writing all of
this down in a minute.
But if r i is one the expected
value of x i is zero,
because we started that, right?
So where r i range over
all nonnegative integers
summing to r.
By independence, so first we
have to peel off the terms
independence factor
of like that, okay?
And then we get,
if in any term, ri = 1,
the term is zero
since that's zero.
So r i is either zero or
at least two, okay?
By the way, in all these proofs,
that's crucial.
If r i equals one
terms were there,
then you would not get
this kind of result.
You would not get
Gaussian behavior.
Okay, so, now we have to,
we are going to do some common
[INAUDIBLE] to figure out
how many terms are aligned.
So let's consider all of
the terms where each r i
is either zero or two or more.
Cuz we are only
interested in those,
we're not interested in
the terms that r i is one.
Every r i has to be two or
more if it's nonzero.
So I assumed this r factorial.
So the expected value of r,
this is from
the last slide,
there was, sorry.
So, if I raise to the r i power,
I get only r i factor, this was
the hypothesis, therefore, that
cancels out and you get this.
So, this actually j
is in the subscript.
And you get sigma to the So,
this is not a minus sign,
okay, so maybe I should write.
It's sum over something,
sigma to the twice number
of nonzero, all right.
So that's because everytime
r i was nonzero we assumed
the boundaries that most said,
see, this is what I'm applying.
You get only a sigma squared
every time r i is nonzero.
It's important,
by the way that sigma squared
doesn't go up as
sigma into the r i.
Some thought is required
to make sure that all
of those important, but later
you can think about that, right?
So, collect terms which
have t non-zeroes
because they have
the same power of sigma.
And call them this, this is the
set of our n tuples that have
t non-zeroes,
this is in our commentary.
And then you'll get x factor
value of x to the r is this.
This is what I'm trying to bond.
So everything that is seen
non-zeroes contributes sigma
to the two t, right?
So, and there's an r
factor to the outside, so
we have to do all that.
Again, as I said,
this lecture will be technical.
I have to,
there are maybe two more slides.
So we crucially have to bound.
The number of R1 through R
Interpol's adding up to R there
it's like most t of
the [INAUDIBLE].
First of all rich t
you choose first,
enter the t subset, one of them
is the t that you've chosen to
put nonzero r on once you
fix a subset as a set of t
values which are nonzero,
each of those values nonzero.
So it must be at least 2.
So we allocate 2 from our budget
of r to each of these ris.
And then the allocated
arbitrary.
Again, this is right?
So what is this?
So let's go with this.
The already have to give
each one for 2 at least.
So I give 2.
So that's r- 2t and
I must partition down- 2t
to t- 1 I can arbitrarily
allot any number to each of the
parts, and that's what I get.
Okay, there's a number
of three doubles.
There's not permutations here
because these are not labels.
You have to think at home,
make sure the proof is correct
comment, already,
but I'll be fair.
Check that.
So
you get that J of t
is at most this much.
Is not a precious estimate but
it's.
Putting this together you will
get that, you get this for that.
Now, we prove, so
t only goes up to r over 2,
important know it
doesn't go up to r.
It goes only up to
r over 2 because.
Everybody which is present is
presented at least as two.
The level two right.
So they're all level two things
will prove that r over
two is the max of this.
So this more or
less is going to look like all
times if I put T equals
R over two there.
So we get n sigma
squared R over two.
Divided by r over 2 factorial,
2 to the r over 2.
Okay, all I'm doing is
putting t equals r over 2.
Now, if you use sterling and
other things, this is
r to the r over 2, so we get
r n sigma to the r over 2.
Another 2.
We'll do this manipulation
the next slide.
But the moral of the story
is very important.
The moral is, the moment of
the sum grows only
as R over 2 power.
Okay, now it requires some
thought and I'll do it in
the next slide, I'll do
the manipulation the next slide.
But that is essential for
sub Gaussian behavior, so
this is essential For
e to the minus x squared.
That square there comes
from the r over 2.
If I didn't have r over 2,
if I just had r,
which is what I
would get trivially.
So remember,
I'm bounding expected
value of X i to r power.
So there are n to the r terms.
N to all terms So,
I might be expecting an R there,
if I did,
I would not get Gaussian okay,
again, it requires a little
bit of manipulation at home to
make sure that you see that.
But, I won't do
the manipulation, but
I tell you that That
r over 2 is crucial.
So now I want to prove that
t = r over 2 is the max, okay?
So that's the [INAUDIBLE].
For t less than r over 2,
it's easy to see.
That h t is
an increasing function.
Is increasing well enough,
fast enough, by a factor of
at least two each time and
therefore the sum is what
happens with r over two.
Then as you go down from r over
two, it's falling drastically,
so that's only a constant.
At the end of the you get this.
Which is what I
wrote down there.
So what have I got up to this?
I've got the most crucial
thing I want from all of this
calculation is that
the exponent is all over 2.
Not r, okay?
And that's what is going to
give us [INAUDIBLE] of Gaussian
division Then you just apply
Markov inequality to get this.
Again, you have an r
over two at the top.
Now I want each of
the minus a squared but
I've got the optimum And
it turns out that's another
standard calculation.
This moment bound applies
to every moment r less than or
equal to s.
If you remember the proof
of channel, then those
bounds apply for every finite
r and you choose the best one.
Okay, that's all we
are going to do.
Except we have
a restriction very high.
You can not get arbitrary Okay,
and then we do some calculus and
to be the largest So
this Function
is minimized at a certain r,
which you can find by calculus.
You take the log of that and
minimize that.
You find the calculus.
On the next slide
we'll see that.
And then r is basically
taken to be that value.
And that value turns out to
be less than x, so it works.
So this is a statement
that says all of that, so
we are a function that looks
like x to the x over 2 and
you take log and minimize
then you see that minimize
that point just differentiate
[INAUDIBLE] they are minimize
to that point so [INAUDIBLE] And
then at that point you plug
the minimum value in and
you get this, which is at most a
squared over 12 n sigma squared,
that proves the theorem.
Yeah, do you have a question,
no.
So you do have to
check all the details,
I mean there are several
calculation details that you
have to check, but at the end
of the day we end up with that.
Okay, so that proves
the Master Conservation Theorem,
I'm not going to give you
all the implications,
I'll give you two implications
of this which we'll use.
One is Chernoff.
The other one, which comes next
is the Gaussian annulus theorem
which we'll use for, I think
I'll be able to prove this and
the Gaussian annulus
theorem today.
We'll use the Gaussian
annulus theorem for
the random projection theorem.
Will see the next week.
Which is So we will use that.
Chernoff bounds.
How many of you know
Chernoff bounds?
Has seen Chernoff bounds?
Many of you, okay.
So, write that as a.
So if you are independent and
know your random variables.
Here there is one small
restriction for this, we applied
directly, is P is not very close
to 1, the probability of heads.
Then quality bound says
that the probability
that it deviates by more than
n times p is this thing.
Again, you cannot do very large
deviations only for c and
most 1 gives this value, right?
So the c behavior stops for
a certain cp so
we got cs redundant, okay?
And you get it simply by
applying the moments are good So
simply by applying
the master K bound theorem
you would get that.
So you first center it so
that the expected value of 0,
the variance we all
know is p times 1- p.
But the moments are also bounded
by basically that the moments
are also at most p times
1- p This actually,
I wrote down a proof, it's
actually simple to see, right?
Beyond the second moment
xy is at most 1 anyway so
that's going to be true.
You need bounds on moments for
our theorem, right.
So this is all going to be true,
so apply this theorem
with a C and T,
how large a deviation you have.
Okay, I think that's all I've
said here, by the way, there
are more details given in the
book, right, so, than the slide.
So the usual proof of channel
sounds, again, is by going
through the Bernstein method,
Which we erased, is going
through the exponential moments
right, that's what we do.
They do apply in this case,
exponential norms do exist
because these are finite random
variables, they never exceed 1.
So the exponential
moment certainly exists.
Okay, now the next thing
I'm going to do is
the Gaussian Annulus theorem.
I'm going to state that first,
the exponential moments also
exist here but I'm going to give
you a proof just based on the.
But the random variables
are not bounded right,
they're Gaussians they
can go to infinity.
So some notation you have
a d-dimensional spherical
Gaussian This is
the main is now both
0 because of the center of
the origin and d dimensions.
This is the identity matrix,
it's the variance,
covariance matrix, right?
So for
the Gaussian high dimensions,
you describe it by mean
Variance-covariance matrix, so.
So I is the variance-covariance
matrix.
In statistics,
it's called sigma, right?
Statisticians are notorious for
their notations, so
they use sigma for.
It's a terrible thing to do,
but they use sigma for
variance covalent matrix.
So it could be more generally,
and
it may not be identity, but
in this case, identity,
which means that level curves
of the density look like this.
I mean, so I should say the
density corresponds to normal 0,
I is e to the minus x squared.
There's a normalizing constant,
2 pi to the d over 2.
So the density is maximum
at 0 and falls off, okay?
And the assertion is for any
data which is less than group B
at most so
much of the probability mass.
Again, not just the data squared
there lies between root B.
Most of the mass lays
between B minus theta and
root B plus theta.
So maybe I should put up,
excuse me,
I should tell you one more
line and then draw a picture.
So intuition, so the expected
value of the length squared
is just the sum of d things.
Each has expectation 1,
normal 0.1, so that's v.
So the mean squared distance is
v Radius, that's called radius,
the square root remains
this is root d.
And it says that which should
be concentrated about its
expectation, that is roughly
what this is saying.
Let's see what this is saying.
So it's saying the high
probability is
between route d + d and
route d = d.
So x [INAUDIBLE] is between
d and Order root d.
If I expand this, if I square
this up, the second term is more
significant is
beta times root d.
And the you can beta square it
which is a small order term.
So this is more significant.
So, if I draw a curve.
So this is a distance and
the analyst is one.
If I instead have normalized it
so that this is distance one for
the Gaussian, the analyst
is distance one over root d
So, this is different
from the sphere, right?
The sphere was concentrated in
a smaller annulus, one over t.
This is concentrated only in
an annulus of one over root t.
Well of course it is true that
the sum of independent random
variables, so you expect it
to be concentrated about it's
expectation, and that is what
we will prove with our theorem.
This all is going
faster than I expected.
We are already at
the 12th side out of 13.
So any questions?
Or anybody want something?
Let's take a break and
drink water right now.
Everything okay so far?
Okay, so I'll go on.
So what are we doing?
Also we are just doing they
ray r's is the radius.
We want to make sure that 8
minus root b is at most beta,
and if it is not, then you
also have this right And
that's greater than or
equal to beta tan through t.
But r squared minus v is the sum
of independent random variables,
which means they're
all right y ones.
y 1 squared is from there.
It means each of them means 0.
And they are independent.
So put xi equals to
yi squared minus 1,
the probability that
xi is greater or
equal to delta minus 3 is what
we want we want to bound, okay.
But we need to do the Sth
moments of xi, so we have to
bound moments of Gaussian,
I mean this is not a big deal.
I can't stand it but
let's try do that.
So okay either y i is less than
1, then x i is at most less
than 1 any power, or y is
greater than 1 then x i is not.
So in any case we have that,
I've added both cases basically.
So when I get that and that,
I'm just writing down
the Gaussian moment, right?
And I make this substitution,
you compute the moments by
converting it to,
this is the gamma function,
because we get either minus
that and that to some power.
That's a gamma function.
So we get 2 to the s
times z to the s okay.
So the variances of most
eight the fixed number.
But we do not have the.
So for the theorem we wanted
the F moment to grow only
as s factorial
times the variant.
That's not true
because it grows as,
where is it It goes as 2
to the s times s factorial.
So the expected value
of xi to the s,
2 to the s times s factorial.
But we'd like no 2 to the s.
Right, we want just
this factorial.
We don't want the 2
to the s growth.
So how will we avoid it?
I mean, we cannot avoid it.
I mean, there is something to
the s here because that's how
the gamma function is, right?
So up to this point, up to
the gamma function, it's exact.
All except for the one,
this was exact.
So I cannot get
better than that.
But 2 to the s is lower order
than s factorial, right?
s factorial is
roughly s to the s.
2 to the s is less than that,
so it should be manageable.
So what might be a trick?
I mean, so what shall I do?
So maybe I can ask you the
question and then try to do it.
So I would like random
variables sth moment to grow
as s factorial, basically.
But this is growing as 2
to the s times s factorial.
So what's a good way of
trying to do something?
So I just divide by 2,
then the 2 to the s goes away.
But the factor of 2 is not
a problem because we are very
generous with constant
factors we don't care, right?
So that's what will happen.
We put wi = xi / 2 and
everything goes through.
Okay, so now since
there are no questions,
there are 15 minutes, we'll
have to go over a new stuff,
which I can do on the board.
But if you want,
you can ask questions instead.
Okay, so now I have to write
on the board because, again,
I'm going faster
than I expected.
So what the problem
is asking is that.
So now if x1 through x10 or
Normal(0,1) independent iid,
then I would like, let's say,
the tail bounds for,
Yeah, so
these are independent also.
And the moments exist and
we should be able to
get a bound for these.
So that's a good point you
bring up in a way because for
sum of squares, it's a standard
thing to try to have it on.
This is probably less standard.
But moments are still lengths,
so this also follows-
>> [INAUDIBLE]
>> Yeah, so
I have to work that out
to get what we get.
>> [INAUDIBLE]
>> Okay, we'll work it out.
So next time, maybe I should
work this out and then bring it,
right, since we're trying to do,
but it's had enough.
It's a good question cuz
they should apply and
get you a bound for
these normals cuz these are just
independent and
sum of independent.
So it'll be more exact
than 2 to the s.
2 to the s times s
factorial is just a bound.
But qualitatively,
the 2 sth moment of
a normal Gaussian is
like s factorial.
Well, it add terms
only up to 2 s.
>> [INAUDIBLE]
>> Yeah, but
the leading term is s to the s,
the c to the s times s to the s,
because you have s things
you're multiplying.
Right, you're multiplying 1, 3,
5, and so on up to s, or 2, 4,
but there are s things
you're multiplying.
So it's at least the starting
formula, or something,
it's s to the s times
the constant to the s.
Qualitatively, that's
the correct.
So let me introduce
Johnson-Lindenstrauss,
which we'll prove next time.
So the idea is, You have
points, Okay,
how do we denote them?
x1, x2, up to xn in Rd, n large.
There are many problems
in which this is used.
But I want to start with
one concrete problem,
which is estimate all
pairwise distances.
There are n squared of them.
And if I take a pair,
it takes d time to find
the distances, right?
There are d coordinates,
I have to subtract them and add.
So it would seem to
take this much time.
This theorem, what the
Johnson-Lindenstrauss theorem
says is that if I have two
points, just two points,
in d space, I
can project them down to a much
smaller dimensional space.
So that the distance between
them in the projection gives you
a good estimate of the distance
between these two.
They're the known scale factor,
but for that, it's almost exact.
So let's see what we expect
here before we go on.
So I have this point.
Let's call this point x,
this point y.
I want to estimate
(xi- yi) squared.
So first, intuitively,
and then we'll do
the actual proof next time.
Suppose we choose a random
coordinate system.
Random coordinate system just
means I take the regular
coordinate system and
rotate it by a random mark.
Another way of saying that is I
pick the x1 axis uniformly at
random, and then x2 axis
uniformly at random,
perpendicular to x1, and so on.
So if I took a random
coordinate system intuitively,
the xi- yi should all
be roughly equal.
So roughly, xi- yi,
all equal, Because they were the
random coordinate system, right?
This is not a proof.
I mean, again,
we'll prove it carefully.
So roughly,
they should all be equal,
and therefore, each xi- yi.
Now if they are all equal,
this was the length
of |x- y| squared.
How would they be
related to x- y?
They are d coordinates,
they are all equal,
would I then have 1 /
d times this, right?
No, not quite because
it's the progress, so
only the squares add up.
So this will be 1 / root d.
Okay, so this is correct.
It turns out to be
provable as a theorem.
This is what was proven by
Johnson-Lindenstrauss in
a 50-page paper, perhaps,
in the beginning.
And the difficulty was
the random coordinate system.
So this was difficult to do and
prove, but
later proves the following
very simple thing.
Pick a Gaussian vector, v.
What that means
is v is picked from
the distribution Normal(0,I).
This involved phase, it's a
vector, right, picked from this.
Then (x- y) dot
product of length 1.
Okay, so what should it be?
What should the dot product
of this be in relation
to x- y length?
Based on that kind of intuition,
what would you expect?
This and that.
Same 1 / root d, right,
because this is again
just a random x1 axis.
Now, sometimes you need a more
solid estimate because we have
many points and
we want all pairwise distances.
So what Johnson-Lindenstrauss
originally set is a random
coordinate system, right?
So the biggest difficulty
here was that x2 had to be
perpendicular to x1, and so on.
But instead,
it's enough to pick k of these,
which are all
independent Normal(0,I).
They won't be orthogonal, but
we saw that they'll be close
to orthogonal, that's enough.
That turns out then that (x- y)
dot Vi squared, sum to k,
Would be roughly k / d
times |x- y| squared.
So this is saying I take k
coordinate directions and
take the sum of squares.
They are not quite,
they are not perpendicular, but
it doesn't matter, right?
This can be carried
out very easily, and
we'll see that the proof
is very simple.
Proof is a few lines
from the Gaussian.
So lot of set goes here into
ensuring the orthogonality
of these, but
that's unnecessary.
We can just take
Gaussian vectors,
they need not be
quite orthogonal.
This is a quantity
we can easily find.
Therefore, we might as
well deal with this.
So we'll prove that next time.
It'll be very simple again,
as I told you that this is true.
And I'll probably give you a
couple of applications of that.
