In this second optional video, we're
gonna continue our look at unsupervised
learning, clustering, and some issues in
clustering including dimensionality
reduction, and we're gonna look at two
specific applications of dimensionality
reduction in clustering in natural
language processing. So in our last video
we're talking about issues in
unsupervised learning, in particular in
clustering, we've looked at the algorithm
k-means which drops centers of clusters
at random, and then has the centers
drift around the feature space so that
it can anchor itself unto the center of
potential clusters. This is very good
when the clusters are very clearly
separated. We looked at a second
algorithm called expectation
maximization where if your clusters
overlap, the ex- the EM algorithm
calculates a probability distribution of
a dot belonging to a certain cluster.
This will make it so that we don't have
hard edges in between clusters but there
can be areas where the grouping can be a
little bit more ambiguous but others
where it's more certain. If we have many
dimensions, it might be difficult for the
computer to find the clusters. We have
algorithms that reduce dimensionality.
Some of them are linear like principal
component analysis, so PCA will take a
surface for example a three dimensional
surface, and try to figure out where you
have the most variation, and then make
those the new axes of, for example, a
two-dimensional structure. So it reduces
many dimensions into fewer dimensions by
finding which are the features or
dimensions that explain the most
variance. Those will be your new
principal components. But this is done in
a straight line.
How about nonlinear distance
measurements? How can we find
similarity between documents, where we
don't necessarily have to use a straight
line? There's many algorithms design-
designed to do this, for example, look at
the s like object on the left. It might
be true linearly that the red dots and
the green dots are relatively close, but
that linear assumption does not
correspond to the actual shape of the
object. If we followed a non linear
trajectory around the outside of the s,
maybe the linear distances would be
longer, but we would have a better
reflection of the topology of the object.
So maybe things that are further apart
in linear space are actually closer
together in some kind of curved space, as
you can see. There's many algorithms to
learn, I'm sorry, to reduce many
dimensions in curved space into fewer
dimensions in, for example, two
dimensional planes. Let's look at two of
them Isomap, and t-SNE.
Let's say we have the same obect, we're
looking at like a rolled-up Swiss cake
like object that looks like a six. We
could describe this as just for
example the pixels in like a 28 by 28
pixel image, like a 6 over there, and we
could describe it as, you know, out of
these 700 pixels the first one is off
the second one is off the third one is
off the fourth one is off, and you'd have
728 features to describe that. That is a
lot of features for something that could
be described more easily with a
nonlinear path, for example a curve that
tells you how the six curls up.
Essentially we need to unfurl this curve
into something that looks like a plane.
Trying to find a so-called geodesic path
through a curved structure. Isomap
will do exactly this. It finds ways to
unfurl multi-dimensional space so that
it looks closer to, for example,
two-dimensional space, as you can see
there, this is again the Swiss roll which
looks like a six. And in figure eight we
have what we were mentioning before, the
dashed distance is shorter linearly but
it's actually a bad description of
what's happening. The better description
would be the solid curve going along the
curve of the Swiss roll as shown in B.
And if we unfurl that six on to
something that can be seen as a
two-dimensional plane we will find the
geodesic path we're looking for. And a
way to have fewer dimensions, for example,
two dimensions out of something that was
three-dimensional originally.
Isomap is very interesting in that it
is very transparent. It's very easy to
know what it's doing, for example, this
algorithm is popular when learning
handwriting. When you give it pictures of
handwritten numbers from 0 to 9,
it can learn the characteristics of the
numbers, and then it's very easy to
understand what it is that it learned. So
there's a database called mnist of
handwritten numbers and each of them is
a 28 by 28 drawing of the number. This is
28 by 28 pixels so each specification
has about 784 features. Isomap can
reduce that to two features, and when we
chart it as we see here, it's easy to
understand what those reduced features
are telling you. For example, when we look
at the horizontal axis, it tells us that
it those numbers vary on bottom loop
articulation. If you look at the numbers
towards the left, you will see that the
bottom of the two is more like a line,
and if we go rightwards, you will see
that the loops in the bottom of the
two are becoming more curved like, so the
Isomap algorithm is learning the
curves as a potential dimension of
interest. In the vertical dimension, it
has the top arch articulation, so as you
can see at the top of that dimension we
have twos that have a very straight top
of the two, that is where the top is more
like a line and as we go downwards the
upper part of the two is more curl like.
As you can see, the dimension that the
- dimensions that this algorithm
learned were curves essentially, not
lines and that's how it measures
distances between objects, and this is
very useful for objects that have curves
like hand
writing. There's another type of
dimensional reduction that we will use
which is t-SNE or t distribution
stochastic neighborhood embedding. We're
going to use it in our next video where
we are looking at word2vec embeddings.
In summary let's say you have something
of 300 dimensions you have the
description of a word in 300 features
and you want to display that in two
dimensions. Maybe your clusters are going
to look like they look here where you
have a small cluster in the center and
then a larger cluster outside. So it is
very difficult to find a single line
that can tell us how to separate these
clusters. You need something more like a
circle.
What t-SNE does is that it calculates
the probability of two dots being
together in the higher dimensional space
and then it tries to recreate that
probability of closeness in a lower
dimensional space. So if two dots are
very close in three-dimensional space
with a probability distribution, it's
gonna try to make them very close in the
two dimensional distribution as well. How
does it do it? With the same trick as
the expectation maximization. It draws a
Gaussian or normal distribution around
every dot and it calculates the
probability that this dot and this dot
are close to each other with a normal
distribution, so very close dots are
gonna have a high probability, dots that are
very far away are gonna have a lower
probability. It creates the probabilities
using a Gaussian, however when it plots
it it uses a different distribution
called the t student distribution.
It's basically a flatter distribution so
that it has fatter and longer tails. This
makes it - makes it so that things
are not as crowded in the
two-dimensional chart but are more
spread apart. This again gets
probabilities of three-dimensional space
and recreates those probabilities in
two-dimensional space, and in doing that,
it recreates original distances in the
higher dimensional space, so for example
the relationship between the word
drinking and drank is probably similar
to the relationship between swimming and
swam, and this relationship is going to
remain equidistant, the same, in the
two-dimensional t-SNE chart as it was in
the three hundred dimensional original
chart. So you're gonna get a chart that
looks like this where you can see that
drink - drinking and drank are roughly the
same distance as swimming and swam. And
we're gonna take advantage of this in
Our study of Word2vec. So in summary
for these two optional videos, we looked
at some issues in unsupervised learning
and particularly in clusters. We looked at
how to reduce dimensionality. There's
some algorithms that reduce
dimensionality in a linear fashion like
principal component analysis, turning a
three-dimensional object into two
dimensions with straight lines.
There's nonlinear algorithms like Isomap which takes curves and unfolds them
into two-dimensional space. There's
algorithms like t-SNE that take the
probability of two dots being very close
together in three hundred dimensional
space and recreate that probability in
two-dimensional space. So notice that all
these depend on how you measure
distances in between documents, how many
features do you give the system for it
to learn from them, and on how you
determine the edges in between potential
clusters.
Next up we're going to look at a very
useful system to describe the meaning of
words called Word2vec.
