[XBOX SOUND] So
welcome to your first lecture.
So this is the closest
mathematical foundations of data
science.
So the word mathematical is
important because data science
and sort of buzzwords like that
have very many meanings, right?
Some of them, of course,
have to do with data.
But this course,
on the face of it,
will have something
to do with data.
We'll see some motivations
today, but after that,
we'll forget data.
It'll all be just
mathematical vectors and
vector spaces and so on.
But it has connections to data,
I will motivate that.
So the first part of the first
lecture will be atypical,
because we are going to
give motivations and
sort of connections
to real world things.
Most of the time,
we'll move on and
just prove things
mathematically, right?
It's foundation so
I'm not going to tell you,
necessarily, particular models
for particular situations.
Occasionally I will give you
examples like I'll do now, but
generally I won't.
Generally it'll be abstract,
right?
There are two aspects.
One is algorithms,
the other one is models.
Stochastic models or
deterministic models,
algorithms and tools,
you'll see both of these.
This follows a book, and
I meant to give the reference.
So it's a book I'm writing
with Avrim Blum and
John Hopcroft called
Foundations of Data Science.
And you can download it from
Hopcroft's website, right?
And what we are going
to do in these lectures
are chapters two and three.
And I'll stress this
as we go along.
The prerequisites, of course,
you are all familiar with linear
algebra, multivariate calculus,
basic probability, right?
I will assume that and
I'll stress that as I go along,
right?
Okay, so with that,
I'll move on.
As I said, the first part of
the first lecture is atypical.
I'm gonna give you motivations.
Okay, so for this course each
data point is in real space,
three-dimensional real space.
I'll always use use d for
dimension, where d is large.
There are quibbles about this.
Now the reason for this is, in
fact, I'll give you an example,
in many modern applications,
d is the number of features.
Each coordinate is a feature,
and
each data point has one
component for each feature and
tells you the importance of the
weight of that feature, right?
So if you have images,
they represent differences and
then we'll see how documents
are represented in this way.
Now you can say this
is not universal.
Sometimes coordinates or Boolean
or some things like that, yes,
that's true, but
most of the time they are real.
In this course,
we'll deal with real vectors.
Also there's much nicer
mathematics to do with real
vectors that I can easily
understand than other things.
So we'll just do
that in this course.
For us then, data points
are just points in Rd.
There are two broad
things you can look at.
One is the modeling part.
I mentioned this, sort of.
Modeling part,
what feature should we use?
What should be the value
of each feature value?
So we'll see in a minute
example of representing
documents by vectors.
And when you represent documents
by vectors, each component
may be the frequency of
one particular word.
Component corresponds to
a word and the frequency.
Maybe frequency is
not the right thing.
You should take the square root
of the frequency of the log.
These are the kinds of
details we won't worry about
here, right?
But somebody who does the
modeling has to worry about it
to get a good model.
So we assume that is their job.
They have done the modeling, and
given us points in Rd, right?
That's what we'll do here.
And the second part
is understanding and
processing data,
mathematical structure,
properties of data, and so on,
as well as algorithms, right?
So the course mainly deals
with the second part
with occasional examples as
I'm going to now describe.
Right at this point,
I'm going to describe some
examples of modeling, right?
Okay, so example one, which
is a fairly standard example.
So if you have
a collection of documents,
I want to represent them.
Represent each document
as a point in d space,
where d is large, right?
So let's say the documents
use the vocabulary of d words
or terms.
Every word may not make
it to your list of terms.
I mean, articles may be omitted
then and things like that.
This kind of pruning the modeler
does, you don't worry about it.
It's given to us after
all the pruning is done.
We represent each document as a
d vector, d-dimensional vector,
which lists the frequency
of each term in a document.
Or it may be the square root of
the frequency or whatever, but
it's some function of
the frequency usually.
Okay, I guess this was
supposed to be an example.
So here's a random document.
These words occur with very
high frequency, right?
And so
you know what the document is.
I mean, one question is right,
we completely
forgot the context.
We didn't worry about
which words occur
next to which words, right?
So we don't worry about that,
just call it a bag of
words representation.
Even though it throws away a lot
of information, it's turned out
to be quite useful and
used in a lot of applications.
And if you want to worry
about neighboring words you
could worry about pairs of
words and their frequencies.
Pairs of words occurring next
to each other, so bigrams,
they're called, and
their frequency.
So you could do
things like that.
Those are tweaks to the model,
but it's just an example to say
documents are represented
as vectors, okay?
One more example,
which also is crucial.
So if you have
a collection of d URLs,
if you want to describe
the whole web.
Let's say all
the URLs are there,
d of them, the D is very large,
right, billions?
Each URL I'm going to say
becomes a d vector with 0-1
coordinates, right?
And the 0-1 is 1 in position i,
if there's a hypertext link.
So one particular URL I'm
looking at has become a vector.
The vector's i component is 1,
if there's a link
from that URL to your URL to
the IT URL in the collection,
0 otherwise, 0-1 vectors, right?
This potentially could be
d-squared elements, right?
But of course most
of them are 0.
Everybody points to,
I don't know,
at most seven URLs on average,
so this is very sparse.
There are only
seven d non-zeroes,
you would only list them.
But this is
the kind of detail we don't
worry about at the moment.
But there are seven d rather
than d-squared ones, twos.
So I suppose most
of this went away.
So you may ask,
so far I've said,
take documents or take URLs,
plunk them down as vectors.
That seems actually
a trivial exercise.
All I've done is made
a list of things.
I could have made a list or
a link list or
any other data structure.
So you can say, what's the point
of calling them vectors?
Okay, there turns
out to be a point,
it's not obvious immediately.
There turns out to be a point.
That for instance,
correlation between URLs, so
this is URLs 4 and 5.
I want to know how many
URLs do they both point to?
How many common pointers
are there from these URLs?
As you know, for 0-1 vectors,
that's just a dot product.
But dot product of vectors turns
out to be more prevalent, more
useful measure of correlations
between data points.
So one reason for listing them
as vectors is perhaps you can do
things like dot product, right,
instead of having a link list or
something.
So the vector thing is useful.
Now not only dot products, but
it turns out our probability and
everything is quite useful,
angles,
lineal quantities
Orthogonality is very useful.
Also principal components which
I will define in a minute and
will be covered in the second
part of the course will also
turn out to be very useful
in a lot of applications.
So this is not just
a bookkeeping exercise to
put them down as vectors in
sort of linked list or stack or
like that, right?
It's not just
a pointless exercise.
So I'll soon say something
controversial so
you can have questions, right?
Think about that.
Okay, so
high dimensional geometry.
So far what we have seen is
that we've seen that data
can be represented, usefully, if
you believe me, and we have to
still demonstrate that,
in very high dimensional space.
As an example,
D is in the billions, right?
So it's still feasible,
because it's sparse, but
it's in the billions,
so we've seen that.
So it would be nice to see what
vectors in billion dimensional
space look like, right?
That will be the first two weeks
of this mini-course, right?
To see some properties of
high dimensional geometry.
So since data consists of points
in high dimensional space,
it's important to understand
properties of high
dimensional space.
And these are quite different
from two and three dimensions,
and that's what I want to
communicate in these lectures.
Before I go on I want to say
there will be two things that
will hopefully, or the
highlights of these two weeks.
One is called the Johnson
Lindenstrauss random projection
lemma, Which I
won't define now, but
random projection theorem.
But if you like,
you can look it up.
And the second
highlight will be,
how many of you know
Chernoff bounds?
Good, we will see them.
So the second one will be.
You have things like
Chernoff's bounds for 0,
1 random variables.
So Bernoulli random variable.
You can prove similar results,
what they are when they come to
it, I'll tell you in detail,
but for
right now I just want to pique
your curiosity a little bit and
have you look up things
if you want to, right?
Chernoff bound, 0,
1 random variables.
Now I could instead take sums
of Gaussian random variables
And I could take sums of squares
of Gaussians, these all come up.
So what's meant here is
I have X1 through Xn,
which are iid and
xy is normal 01.
If this notation
is not familiar,
this is not the right course for
you, right?
I mean all of this
should be very familiar.
iid is identical to independent
distributior, right?
So in this case you take
the sum of squares,
and in that case you
just take their sum.
Or you could take sum of
exponentially distributed random
variables.
So the second highlight
of these lectures will be
one result from which you
can derive concentration
properties of all of these.
So these are all useful and
in practice, you have different
proofs that the sums of
these things behave or
very close to their
expected value, right?
So when you take the sum of
a bunch of random variables,
then some will be close
to their expected value.
Again, we'll put up
things you should know,
central limit theorem on low and
large numbers you should know,
and if not revise them.
They tell you that but usually
the proofs are all different.
So one thing you might want
to think about at home is
what common thing can
derive all of these?
And we'll see one, right?
We'll see a theorem that tells
us how to derive all these
confrontations.
So I'm giving you the highlights
partly to tell you what
the cause is but partly also so
you can go and revise up,
brush up your old linear
algebra and probability, right?
Okay, so volumes and surface
areas are just integrals.
And so
we can work them out in 3-d,
we can also work them out in
any number of dimensions.
So 3-d is multiple integration,
three integrals.
Instead you'll have D
integrals now, right?
Now here are some simple facts,
right?
The volume for cube of side 1
in three dimensions is still 1,
okay?
That's good.
This is the cube, right?
So the volume is 1.
What's the volume of
a cube of side 2?
So if they scale each side by 2,
in three dimensions you scale
the volume by 8, right,
in two dimensions by 4, in three
dimensions by 8 and so on.
In D dimensions you scale
it up by 2 to the d.
Cuz every dimension
scales up by 2.
Good thing to remember.
What about the d
dimensional sphere?
Also the same, if I up
the radius by a factor of 2,
I up the volume by
a factor of 2 to the d.
Okay, these all follows by
integration, so the fact to
remember is so, you are going
to be integrating to do this.
So you have
an infinitesimal cube.
Indeed this is
three-dimensional, right?
And if I expand each side by a
factor of 2, the volume goes up.
I mean, I can't draw it.
But volume goes up by
a factor of 8, right?
That's all I'm saying.
So once you have this property
for a cube, it's true for
everything else,
like a sphere, right?
This follows from that.
And why is that?
Because in getting the volume
of this, you'll be integrating.
Maybe I should just tell you
why that is maybe in a picture.
I won't actually so
here's a sphere of radius two,
here's the sphere of radius one.
I integrated the volume of
this by taking cubes here.
And corresponding to each cube,
there's a cube of side,
double the side, right?
So I got this bigger one by
going along every infinitesimal
cone and doubling each point.
So we'll see the volume just
goes up by a factor of 2 to
the d, because it does so for
a few, therefore for everybody.
Okay, so here's a nice picture
that I like to put up.
So things don't really
look like this but
this is a picture to remember.
So here is a cube of side one,
center of the origin, right?
And in two dimensions, and here
is a sphere in two dimensions.
This is circle of
radius also one, right?
And clearly that encloses
the square in two dimensions.
In three dimensions, I mean,
the cube doesn't quite look like
this but we have drawn it like
this schematically, right?
In three dimensions,
it's exactly touching, right?
Is that right?
Let's see, so
this cube is side one,
so you have a half here,
so you get three-fourths.
Well, we should be able to, I'm
trying to work out the distance
to the corner, so
if you have the origin here,
half to each side,
vector of all halves.
This distance is what, is 1 over
4, it's Pythagoras' theorem,
right?
The root d over 2.
Root of the dimension over 2, I
think this is four dimensions in
fact, so
now we workout there so.
The legend does mean whereas
this illustration radius 1 in 4,
this is is four dimensions,
right?
In four dimensions, the corners
would be exactly on the sphere.
Everybody with me?
I'm just doing
Pythagoras' theorem.
When we go to d dimensions
You get root d over t.
D is big so this is enormously
outside the ball, right?
The contents of the cube
are way outside the unit ball.
Unit sphere has a radius 1.
This distance is root d over d,
that's all I'm
trying to say here.
The cube is, of course, convex,
so this picture's not correct.
It cannot be that the line is r,
it's a convex set.
But it's just drawn like that
because the centers of the faces
are inside, right?
So the center of one of
the faces is distance r, but
the corners are way out.
So as you go to lower
dimensional faces,
they are very far out.
Again, nothing profound so far.
So, we will, okay.
I'm sorry, I should stress this.
The course will prove these
kind of statements, properly.
We will compute the volume of
the sphere and things like that,
properly.
But you might want
to do this at home.
You certainly want to review
multivariate calculus, right,
that's if it's not fresh on your
fingertips, please review it.
Also, try to prove that
the volume of the d dimensional
sphere goes to 0 as
d goes to infinity.
Okay, so instead of multiple
integration with three
intervals, you will get If d is
10, you get 10 intervals, right?
And if you work out the volume
properly, I won't do this,
I'm going to do the volume
in a different way.
You work out properly,
you will get powers of
trigonometric functions, right?
You have to integrate the powers
of sine and cosine, and
you have to use multiple angle
formulas from trigonometry.
So to do the integral,
that's a good exercise, perhaps.
Just to review your calculus,
okay?
And then we'll see that, So,
this picture is trying to
indicate that, because this big
thing is a cube of side one, And
it's much bigger in a lot
of ways than this sphere.
The cube of side
one is volume one.
This is going to be much
smaller, going to 0.
Okay, we'll prove that,
we'll prove that differently.
We'll prove that in a nicer way
than multiple integration but
relating it to Gaussian.
So, in fact, I want to use
that to bring in Gaussian and
the connection to spheres.
But you should try to prove it
sort of in an elementary way, so
see if you can do that.
Okay.
The next two weeks, a very
brief intro to, basically,
singular value decomposition,
principal component analysis.
They go by both these names.
I'll tell you what
it is right now.
But in the last two weeks of
the course, we'll see that.
So many of you know principal
component analysis.
But perhaps,
you've seen it slightly
differently from what I'm
going to tell you now.
It's the same thing,
but I'm gonna tell you in
a particular way which, perhaps,
is simpler, more elementary,
also more to do with data.
So, the principal components
are just best-fit directions
over set of vectors.
So I give you the set of
vectors, that's your data.
Bunch of vectors,
bunch of data points.
Take the best-fit line,
namely, the red line.
I define best-fit
by saying I take the sum of
squared perpendicular distances
of these, like this, to that
line, to that infinite line.
So here, I've drawn
the perpendicular things.
I don't take just
the perpendicular,
I take the sum of squares.
Of course, you've all seen
least-squares, right?
This is very similar
to least-squares,
except it's not exactly
least-squares because
least-squares does
vertical distances, right?
So what does your calculus
least-squares do?
So you have a bunch of points
and you fit a line to it, right?
Now typically,
you take the vertical distances,
square them and sum, right?
Now I'm telling you to take
the distance perpendicular to
the line, so
it's slightly different, okay.
And the reason for
taking sum of squares
is it has many nice
mathematical properties.
Just as in your calculus
least-squares, if instead I
took sum of distances
instead of squaring them,
If I took these distances, that
becomes much more difficult.
We've learned in calculus how
to do least-squares, right.
Because it's the sum of squares,
the derivative is linear, and
it's very nice.
So, we do that, right.
For the very same reason, or
very similar reason here,
also taking the sum of
spaces very important.
So in linear algebra,
you always pick squares, right.
Some are squares of vectors,
some are square of distance
is very, very important.
Just like least-squares, there
will be an argument to do this,
we'll see that but
this is the best-fit line,
it's called the first
principal component.
Okay.
There are other
principal components, so
the best- fit direction,
the first, v1 is the first
principle component then you'll
take the best fit-direction
v2 perpendicular to v1.
That's called the second
principle component,
right, and so on.
So v3 perpendicular
to both of these.
The best-fit direction.
Best-fit, meaning sum of
squared distances, right?
And finding k such
directions gives you the k
principal component.
And a lot of algorithms
do the following.
You find find the first k
principal directions, and
project all data to
the space spanned by these.
So somehow,
these are considered the very
important directions, right?
And it turns out, in a lot of
applications, that's true.
Again, I'm not telling
you all the details,
this is just a peek into what
we'll do in those lectures.
Okay, and then algorithms do
something in the projection, but
the idea is to find these, okay?
So we'll spend this
second part doing that.
Any questions so far?
I've been fairly general.
So presumably,
this is just, as I've said,
an overview of what's
going to happen.
I am going to do one of
two detailed things today.
Probably the simplest things,
and then we'll finish.
Okay, probability in high
dimensional geometry, so
please review and be current
on your basic probability.
You should know what random
variables mean, variants,
independents, and so on.
Or you should know that
statement of the central
limit theorem.
You don't necessarily have to
know the proof right now, but
it'll be nice if
you know the proof.
But we won't, I don't insist
on that, at least a statement.
You should know what the
Variance-Covariance matrix is.
You should know what a general
Multi-variate Gaussian
density is.
Probability density, right.
Not just spherical Gaussian,
but general Multi-variate
probability density,
these dimensions.
All of these,
if you're not familiar, look up.
Okay, so from probability as
I said, if a data quantity is
the average of many independent
quantities, it behaves like
a Gaussian random variable, The
simple element theory, right.
So many analogies
between vectors of d,
independent random variables and
points from d dimensional space.
Okay, so
I said geometry so far and
I suddenly brought
in probability.
I did so because this is true,
that there are many analogies
between vectors in very
high dimensional space, And
sums of many random variables,
okay?
And we'll see one or
two of those analogies.
So if, Okay, this is just saying
that Gaussian random variables
is a particular case of that.
So, this is true for independent
Gaussian random variables, okay?
With proof also,
That if you have a random point
from a d dimensional
hypersphere,
centered at the origin,
then the sum is nearly 0.
So I'm going to draw a picture,
which we will see.
So I'm gonna draw two
pictures for you.
So one is a sphere
in d dimensions.
If I take this sphere,
this is d.
Most of the mass of it
is near the equator.
So this is the equator.
So this has 1- epsilon
of the volume.
So I haven't told you the width
of this annulus is very small.
It's going to 0 as
d goes to infinity.
So I guess anybody who hasn't
seen it, if you want to guess
how wide this annulus is,
that would be interesting.
How wide will it be as
a function of d, right?
If you've seen it,
you shouldn't guess it.
Anybody wants to guess?
Another way of saying that is
if I pick a random point from
the sphere, it's highly likely
to be close to the equator.
By the way, what is this region?
This is the region f(x),
let's say, so that x1 is small.
Okay, if they are not
fullest when x1 is 1,
then this is the equator.
I could have called any axis,
anything that are not full, but
I'm calling the first axis, so
the next one is less than or
equal to delta.
So how large or
how small is delta so
that I capture quite
a lot of the volume?
We'll see that.
So I don't get random guesses?
Okay, it's good, you're not
going to be that great,
that's fine.
So we'll show this.
We'll also show, and
the second thing will be easier.
This is 1, the second thing
would be another picture.
It should be easier.
So this is also in d dimension.
And in fact, most of the volume
will be in this part.
So this 1- epsilon also
applies to something very
near the boundary, right?
So most of the volume
would be there.
We'll prove this today and
this the next time, or
maybe both at this time.
But I want you to think about
why these are not contradictory.
So I'm asserting two things,
right?
I'm saying 99% of
the volume is here.
I'm also saying 99% of
the volume is here.
So is that make 198%,
or what happens, right?
It'll be good for
you to think about that.
We will think about that.
So everybody at least
going to find, so
I have to tell you
the width of this annulus.
Yeah?
>> [INAUDIBLE]
>> The second picture is,
again, a sphere of radius 1,
right?
And let's say,
I make this 1- delta.
I used the delta already,
let me put delta prime.
So I'm then left with all
points within delta prime
of the boundary.
>> [INAUDIBLE]
>> No,
both of them are d dimensional.
>> Which one is?
This, if I draw
a flag like that,
that'll be d- 1 dimensional.
>> [INAUDIBLE]
>> Yeah,
it's uniform over
the entire ball.
Okay, it's uniform over the
interior, as well as everything.
So the picture I need to draw
is these are all pictures
of the magnitude of x
less than or equal to 1.
So this is generally true for
many, almost all, in fact,
we'll prove it in general
that this is general.
And again, we'll state this
carefully and prove that.
Okay, so again, review and
be current on probability.
Okay, so
this is the equator thing,
which is the first
picture on top.
So this is lecture one.
So I'm going to lecture two.
Actually, so given that this
is a very large audience,
so you have to figure out for
the first lecture what's
the fraction of people you
want to come back, right?
It could be greater than one or
less than one, but in this case,
maybe less than one.
So we'll see if we go
fast enough for that.
Okay, so the second flag,
this thing is simpler.
I'll prove this right now.
And possibly also at this rate,
the second flag, right?
So I will state
the theorem properly,
but the gist of it is most
of the volume of any high
dimensional object
is near the surface.
So this picture is more general.
So I'm taking a cross-section
and a thickness.
So it's a full dimensional.
So the equator is d-
1 dimensional, right?
So the Earth's
equator is a plane.
So the equator is one dimension
less, but I'm taking the slab,
right?
I'm taking the set of x such
that x1 is less than or
equal to delta,
not equal to 0 alone, yeah?
They're clear so
this is the real equator.
I'm taking basically everybody
within distance delta of it.
Now, we will define delta and
delta prime in a minute.
But still, it seems
contradictory that I should be
putting 99% of the marks here,
as well as here, okay?
They're both true,
we'll prove both, okay?
So the question will be, why
is it that they're both true?
Now, it does imply that
most of them are true in
the intersection correct.
So 98% at least, right?
If 99% is in this and
99% is in this,
then the intersection
must have at least 98%.
Yes, so that seems
also not intuitive, and
this is where higher
dimensions will differ, right?
Intuitively, three dimensions,
if I take a little slab here and
the periphery, the intersection
is very small it looks like.
So that's wrong in d dimensions.
It should be 98% because
of these two statements.
So this is the second thing
that's counterintuitive.
The cube and the sphere
interaction is perhaps also
counterintuitive, absolutely,
on the value d and delta.
So the 99% depends on d and
delta.
So those three things
are related, right,
what percentage is there and
d and delta.
Okay, now in three dimensionals,
the great circle would have
been one dimensional, right?
So I mean to say
equator by equator,
the two dimensional equator.
If you cut through that,
which you won't do ever.
So maybe I could have taken that
as the equator if my north pole
was there, right?
I could have put
the north pole anywhere.
>> So you're specifying that-
>> I'm specifying one thing.
So okay,
that is a good question because
that raises another
contradiction.
This is for
every equator, right?
I can put the north
pole anywhere.
They all lead to equative.
And everyone of those
is symmetric, and
I'm saying they're all 99%.
That's even more worrisome,
right?
Things are getting worse, right?
So at least for the periphery,
there's only one periphery,
right, only one Set of
points that are within
delta prime of the boundary.
But for the equator,
there are infinitely many,
because any point could
have been the North Pole.
Okay, so now we are saying
something even worse,
that every one of those is 99%.
It's also true,
all of these are true.
This is a volume of
a d-dimensional object.
So let's see, let's go
through all that, right?
So if you have the Earth, right?
The usual equator is just
a one-dimensional thing,
it's a line through, right?
But I mean to call the equator
of the two dimensional planar
cut, as if we cut the whole
Earth into two, right?
So in d dimensions,
that cut would have been
d minus 1 dimensional.
But now I'm taking a slab,
so it's d-dimensional.
What you are saying is just
the line like we usually
think of the equator is d
minus 2 dimensional, yeah?
So it is nice to keep track
of the number of dimensions,
because for
instance, volumes scale up by
r to the d in d dimensions.
But if I wanna do the same thing
about curves, it's d minus 1,
so you have to really keep
track of the dimensions.
So most volume near the surface,
let's prove this.
This is simple, much simpler
than the equator thing, right?
So so take any object which
is finite nonzero volume in d
dimensional space, okay?
I want to argue that if I
shrink A by a small epsilon,
it shrinks the volume by a lot
in high dimensions, right?
Small shrinkage shrinks
the volume a lot.
That is somehow intuitively
going to say this, right?
If I take this unit sphere and
shrink it just by a little bit,
by delta prime, I'm saying
I've lost most of the volume.
Right, my volume has shrunk
an enormous amount, right?
That's what I want to say.
Okay, so we'll denote the object
obtained by shrinking
A by epsilon, we will call
that (1-epsilon) times A.
So this is a scalar,
this is a set of vectors, right?
It's an object in d-dimensional
space which is a set of vectors.
What this means is,
I take every vector and
multiply it by a scalar,
(1-epsilon).
So every coordinate is
multiplied by (1-epsilon).
So that's my shrunk object,
(1-epsilon) A.
So what's the volume of
(1-epsilon) times A in terms of
the volume of A?
I want to know that, right?
So if we shrink objects in two
dimensions by a factor of 2,
it's by a factor of half, sorry.
Multiplied by a factor of half,
then in two dimensions it would
be one-fourth the area
would shrink, right?
In three dimensions,
one-eighth, and so on.
And for general d,
let's see, did I finish?
For general d,
we again divide space into
infinitesimal cubes to
integrate of side dx, and
count how many cubes
are there in this.
And you can imagine each
cube being shrunk now.
So maybe another picture.
So for the sphere, it's clear
if I take every point multiplied
by (1-delta prime),
the set of points I get is this.
So you have some object A,
and I want to take
everybody and
multiply it by 1-epsilon.
An interesting picture is when
the origin is somewhere outside
the origin,
this is the origin, right?
So what am I doing?
I'm taking each point x and
replacing it with a point
(1-epsilon) times x which
is on the same line.
And I take everybody like that,
then the picture will become
something like the same thing,
The origin,
maybe it's still outside.
In fact,
it will still be outside, right?
If the origin didn't belong to
the A, it will not still belong
to this, cuz I've shrunk
every point by 1-epsilon.
So the volume of this
I would have computed
by taking small
infinitesimal cubes.
I'm doing these pictures,
I mean, clearly you know from
multivariate calculus
all of this.
But I want to go over
them once at least, so
that when we go to d dimensions,
I mean, d dimensions, you can't
draw the picture, but at least
you can imagine what happens.
So you take this and you shrink
it by a factor 1-epsilon,
then it'll become
a cube somewhere else.
It may not be the original
object, right?
If I shrunk it,
some of these cubes,
the shrunk thing will be
outside, but we don't care.
It's that cube.
Now this object is made
up of cubes like that.
It's one-to-one, you have
to convince yourself, for
the integral to be correct,
you have to convince yourself
that it's one-to-one.
Not completely obvious, right?
Because the origin is
outside and you're shrinking,
may go outside.
But it's still true that every
cube goes to a shrunken cube,
and every shrunken cube
came from one cube.
Again, in 3D, you can
imagine these pictures, but
it's also true in
higher dimensions.
And if you take any cube, and
I do this 1-epsilon thing,
it'll still be a cube.
Even if the origin
is not the center.
So okay, that's a good point,
you know you should convince
yourself that that's true,
cubes shrink to cubes.
Okay, that's important,
otherwise,
all I'm saying is not
going to be true, right?
That's right,
distance will be maintained, or
another way to say it
is the cube says that
x1 minus the center is at
most 1 in absolute value.
If I shrink it, x1 minus
the new center is also shrunk.
It will be at most 1-epsilon,
right?
So you have to prove that
cubes go to cubes for
these statements to be true,
but they do.
Okay, good.
So we divide space into
infinitesimal cubes and
see how many cubes are inside.
There's a one-to-one, okay, so
this is what you're asking.
One-to-one corresponds in cubes
of side 1-epsilon dx contained
in this,
draw a picture on the board.
Okay, so the tech would have
cost me a lot of time to draw,
so, not too excited, just.
Okay, so volume of the
d-dimensional cube of side that
is, we have shrunk these sides
by (1-epsilon) factor, so
you multiply the volume by a
factor of (1- epsilon) to the d.
That's where we get a lemma,
which we have proved now, right?
Because of these integrals
that the volume of (1-epsilon)
times A is at most (1-
epsilon) to the d times A.
And here's an important
inequality you should remember
throughout the course,
(1-x) is always less than or
equal to e(x), for all real x.
This is probably the most used
inequality in this course or
any other course, right?
But it's true for x negative,
positive, everything,
always true.
Therefore it gives you this
factor, what does this tell you?
The most important thing here
is d is an exponent, right?
So if epsilon is 1 over d,
already it starts falling.
If epsilon is 10 over d, you
get e to the minus 10, right?
So it's a very small fraction,
so this delta prime,
Delta prime is order 1 over d.
So in 1,000 dimensions,
most of the volume is within
one-thousandth of the surface.
1 over d, okay?
So that's the width
of the annulus.
We will later also want to find
the width of such annuli for
Gaussians.
And for Gaussians,
the width is 1 over root d.
It won't be the same.
So if I pick n Gaussians,
right, and plug them as points,
sorry, d Gaussians, plug them as
points in d-dimensional space,
There will be an annulus, but
with width 1 over root t, so
that most of the probability
mass lies there.
We'll prove that, but it's still
an annulus, so what I'm now
saying is this, delta prime
depends on the object, okay?
For the sphere, it's 1 over d,
for Gaussians, it'd be 1 over
root d, for various other
objects, it'll be different.
For cubes, also it'll be 1 over
d, I didn't prove that, but
it's something you should prove.
>> [INAUDIBLE]
>> The other inequality,
yes, so this was exact, right,
this was exact, good point.
So the volume of 1
minus epsilon times
8 is exactly equal to 1 minus
epsilon to the volume of 8.
Right, no loss there, only
when I went to e to the minus
epsilon, d I lost something.
Okay, specializing to
the sphere, we get a slightly
better statement, I mean,
the statement that we wanted.
All but e to the minus epsilon
d of the volume of the unit
hypersphere lies within distance
epsilon of the surface,
okay, why is that?
Because every point at
distance at most delta
prime from the surface is in
the annulus, and vice versa.
The annulus here consists
precisely of the set of points
within distance delta prime
of the boundary, right?
Will just, for the sphere,
it's true, but
it's not necessarily true for
everything, right?
For the cube, if I shrunk it by
a factor of 1 minus epsilon,
what I draw earlier, So
you do get a square, right, but
this is not the set of points
a distance at most
epsilon from boundary.
That set is different,
that set is sort of a little
bit rounded near the bound.
Okay, so
that proves what we want the
volume near the boundary, yeah?
>> [INAUDIBLE]
>> Yes, I'm sorry,
yes, I called it epsilon,
yeah, they're the same.
So if somebody told you that
data points were uniform random
from a sphere, if you have
a lot of data points,
most of them will have sum
of squares nearly 1, right?
There'll be almost nothing
near the origin, there is,
all of this is empty space.
Okay, there'll be
nothing near the origin.
It's also true of Gaussians,
we'll see that if I pick
x1 through xd IID Gaussians,
there'll
be almost no probability
amounts near the origin.
The sum will not be nearly 0
most of the time, I'm sorry,
sum of squares will not be
nearly 0 most of the time.
Sum will be nearly 0,
sum of squares won't be.
And what about a sphere
of radius r, no,
it just all scales
relative to the d, right?
Nothing new or
interesting happens, so
that picture, annulus of
width 1 over d everything.
Essentially, most everything
lies there, not everything.
So in fact,
this is the difference between
1 minus epsilon to the d and
e to the minus epsilon d,
rhis is exact, right?
I didn't lose anything, so it's
exactly equal to 1 minus epsilon
d times the volume of that.
So if d is 1000,
there's a power of 1000.
>> Yeah, but why [INAUDIBLE]?
>> If you consider it
an arbitrary object,
what happens?
Okay, that's a good question,
so for
very bad-shaped objects,
things are not so easy to say.
So if I take a cube,
it'll still be a cube,
it'll be reasonably nice.
Right, so this is half,
let's say,
this will be half
times 1 minus epsilon.
This corner will be still fairly
far from the corner, right?
The middle of a face will
be epsilon distance, but
this is epsilon times root d,
there's always the root d
factor coming in, right?
But if you take a long,
skinny object like that, and
take 1 minus epsilon of that,
then that'll cut off
somewhere very far from that.
And we have to work it out for
every object, but for bad shaped
objects, the [INAUDIBLE] is
still 1 minus epsilon to the d,
that's exact, but the distance
will be a lot from the corner.
Yeah, cuz only for cubes and
spheres do volumes behave more
or less like distance to the d
power, for odd shaped objects,
no, right?
So here the volume is this
dimension which is very small
times the other dimensions
which are much bigger.
So the linear transformations
don't preserve lengths,
they preserve volumes.
There are many examples,
like the URL example,
where the vectors came
from a real world
application which had
only zero-one vectors.
But once we make them
points in this space,
we forget common authorial
properties like zero-one, and
only look upon as
these real points.
And then we can shrink them,
expand them,
anything, so all of that
is interesting, okay.
A much more striking example
of this, which is not going
to be in this course, where
you make data into vectors,
where it doesn't seem to mean
anything is optimization.
So you all know the traveling
salesman problem, right?
Where you have two cities, and
you want to find
the minimum length to.
If you don't know it,
don't worry,
because there's no
time to explain it.
But it turns out, the most
common way to solve this problem
is to make each two
into a zero-one vector,
which tells you which
address you take.
And you take those
zero-one like this,
put them down in the high
dimensional space, and
you consider the polytope, and
you optimize over this polytope,
that turns out to
be the best way.
One way of optimizing over
polytopes is going through
the interior to
the optional point,
these are called
interior point methods.
In the interior,
you will have absolutely non
zero-one vectors, right?
So even though the common
[INAUDIBLE] of the original
problem was only zero-one,
going to real geometry space,
we'll have to make
everything real vectors,
turns out to be very useful.
And that's so for
also the URLs [INAUDIBLE] but
optimization you won't
see in this course right.
That's a completely different
context in which you turn data
into vectors and
then forget about data,
it's just vectors
that you do that.
Okay, we have come to
the end of the hour.
