The next kind of thing
we're going to talk about
is what's called the
curse of dimensionality.
So this is as much a data--
this is sort of a data quality
issue, but it's
something that we
have to be careful about when
we're doing data processing.
So the curse of dimensionality
is that as your number
of dimensions increases--
so as the number of columns,
number of attributes you have
in your data set increases--
the data inherently
becomes increasingly
sparse in that space,
since in a lot of contexts,
for a lot of
different algorithms,
definitions of
density and distances
between points of
similarity and dissimilarity
are really important to
things like clustering methods
and outlier detection--
so anomaly detection.
And this all becomes
less meaningful.
If you add enough
dimensions, every point
looks like an outlier.
So a great
illustration of this is
that if we randomly generate 500
points in a n dimensional space
and we compute the difference
between the maximum distance
between any pair of points
and the minimum distance
between any pair of points--
and this has been
normalized in a log taken
to make it look pretty--
we can see that in two
dimensions with 500 randomly
generated points, the
maximum distance is about
three and a quarter times larger
than the minimum distance.
Actually, this is 10 to the
three and a quarter times
larger, because there's
a log base 10 here.
As we increase the
number of dimensions,
though, that spacing
falls off really sharply.
And by the time we get down
here 30, 40, 50 dimensions,
our points are so sparse
that the minimum distance
between points and
the maximum distance
is almost the same thing.
This 50 point represents a
factor of something like 10
to the 0.25.
Like, the fourth root
of 10 is the difference
between the maximum distance
and the minimum distance.
This is a very small number.
It's really hard
to define outliers
when you have such
high dimensional
data, because every point
is an outlier in some ways,
because the space is so sparse.
So the solution to this
data quality problem
is something called
dimensionality reduction.
So we can do dimensionality
reduction via aggregation
or other sorts of
column combination.
But there are also a number
of mathematical techniques.
Two of the big popular ones are
Principal Component Analysis,
or PCA, and Singular Value
Decomposition, also called SVD.
And those are
mathematical techniques
that will run automatically that
will reduce the dimensionality
of your data.
PCA actually usually
goes from n dimensions--
so as many dimensions as you
have to have-- all the way down
to two dimensions.
[BEEPING] Natalie, they
are kind of the same thing,
but they aren't
exactly the same thing.
I'm not going to go
into great detail,
because we don't
spend a lot of time
on dimensionality reduction over
the course of the boot camp.
But my understanding is that
they are distinct techniques,
though they have the same goal.
They have the same goal.
But they are achieved via
different mathematical methods.
