This is an optional video about
unsupervised learning, and in particular
about different types of clustering.
So in the last couple of videos, we've been
looking at documents and features of
documents and how we can use those
features to cluster documents together.
We've been looking at documents that can
be very short like descriptions of sushi
and Hanover some longer ones like
Shakespeare plays, but in all cases
we've given the input to the computer,
and all we've given the computer is a way
to measure distance between documents
which is the features. We give it a
measure to - a way to measure distance and
then told it to try to find which
documents were more similar and which
documents were more dissimilar, and doing
that, the computer can find clusters of
similar data points. We could do this
with web pages, for example we could do
this with entire books, we could even do
this with entire languages if we have
features like the similarity in words
across languages, we could find that some
languages are more closely related like
French and Spanish for example and some
languages are related but more distantly,
so like Spanish and Hindi or English and
Hindi, that's on the Indo-European side,
which is our language family. There's
many other language families for example,
the Austronesian one includes languages
like Hawaiian, Samoan and Tagalog from
the Philippines, as you can see those
languages are related but Hawaiian and
Samoan are more closely related because
their features, the words are more
similar than the words between Hawaiian
and Tagalog, for example. So in all these
cases we need to figure out how to
provide features, how to calculate
similarity and how to -
what kind of clusters are we expecting
to find. Are we going to find many, are
they gonna be well distinguishable and
separated? These are questions that we
need to ask ourselves when we try to
perform unsupervised learning in
general and clustering in particular. Are
our clusters well separated so that the
computer can clearly find them or maybe
they overlap, there's some regions where
it's difficult to determine whether
something belongs to cluster A or to
cluster B? How many features are we going
to feed into the unsupervised learning
system? If we give it a thousand features
the data might be very rich but the
computer might have a very hard time
finding distinct features in the
data. So we need to be smart about how -
what kind of features we are going to
feed it. Also, we're gonna need to
figure out how to measure distances
between two objects. How does it figure out if they're
similar or not? We've been using a kind of
linear distance where we measure the
similarity between two objects with a
straight line in three-dimensional space,
for example. In the next video we're
going to look at trajectories that are
not straight lines, for example, that are
curves across spheres. In this video
we're going to focus on the first two
questions: what happens if our clusters
overlap and what happens if we have many
dimensions, how do we reduce those
dimensions to the ones that we really
need? Let's start with the overlap. We
looked at an algorithm called k-means
which performs really well when the
clusters are separate, when we have the
situation like the one we have here.
There's three separate clusters but all
of the data points are like in their own
region of space. There's no case where
for example one dot from one of the
clusters is intermixed with a dot from
the other region. Also if we placed a
center in each of the clusters,
it's very obvious that the center's gonna
be very close to the - to all the
data points in that cluster and very far
away from data points in other clusters
but what would happen if we have an overlap?
For example, the distributions that we
have here. Here we have one dimension,
this is just the position across this
x-axis, and we have potentially two
clusters the red dots and the blue dots.
The red dots have their center probably
around 3, somewhere over there because
that's where the most red dots are and
the blue dots have their center around
either 5 or 7, somewhere over there
because that's where we find the most
blue dots. So clearly there's - there's two
things here and there could be two
clusters but look at the region around
four. There's a region there were blue and
red dots intermingle so it's not easy to
set up a single border between them that
clearly separates the blue dots from the
red dots. In this case k-means is gonna
have problems separating these two
clusters, but that's okay maybe we don't
need some sort of edge in between them.
What we need to figure out is the
probability of a dot being blue or a dot
being red. There is an algorithm called
expectation maximization that does
exactly this. Here we have the same dots
but all them are colored purple to
reflect the fact that when the computer
gets the dots it doesn't know which
one's which. And we start the same way
we did with k-means: we drop two centers
at random wherever. And as you can see,
the centers here are not very good: one
of them landed at around one, the center
for the putative red cluster, and the
center for the possible blue cluster
landed around
9, somewhere over there. They don't really
match any of the data points. The next
thing you do is calculate the
probability of a dot belonging to a
certain cluster. Let's focus on the blue
cluster, so the blue cluster has a
pro - we have the center of the blue
cluster at around 9 and we put it there
at random, and then we project a
probability distribution around it
something like a normal curve where it
goes up around the center and then it
drops, it looks like a little bell curve,
when we draw that distribution we have
the maximum of that probability
distribution at the center and then we
have distribution probabilities for
other parts of the axis, for example the
dots in 8 have a very high probability
of belonging to the blue cluster because
the the curve is very high in that
region, the dots around 7 also have a
fairly high probability of belonging to
the blue cluster. Notice that it has a
maximum there and then the descent, so by
the time it reaches 2, for example it's 0,
completely 0. So a dot at around 2 for
example or 1 has a very low probability
of belonging to the blue cluster. So this
is the expectation phase. The next thing
we do is try to maximize the probability
of finding the actual center of the
cluster. We do this as follows: we have
the old center which for the blue is
around 9, and then we calculate the
distance between that old center and the
- every other data point: the points in 8,
the point in 6 the points in 2. But we
multiply each of those distances by
their probability. So for example a dot
in 8 is gonna have a very high
probability,
so the multiplication is going to give
you a higher result because it's gonna
have a higher number and this dot will
therefore make a high contribution
towards finding the average position of
the new center. On the other hand, we also
calculate the contribution of a dot
around 2, but because the distance for
that dot is going to be multiplied
by a number that is very close to zero,
that dot is gonna be co - make a very
small contribution towards the new
center of the blue distribution. So the
dots around eight are gonna have a very
high contribution, the dots around two
are gonna have a very low contribution.
In doing so, we move the old center on to
this new position and then we try again
and again, like we did with the k-means.
Notice that what we did was take the
points and then calculate the
probability that they belonged to a
certain cluster. We don't really have a
strong edge in between them. The figure
shown here, it shows five iterations of
this process, and as you can see the,
center does seem to approach the
apparent centers of the red dots and the
blue dots. There are also some areas
where it is difficult to tell whether
the dot belongs to the reds or the blues.
For example if a dot is around five, both
distributions have about the same
probability so we're not really sure
what would happen to a dot there. But
that's okay, as we know learning
algorithms do not have to be perfect for
them to work well. So this is a solution
in case our clusters have high degrees
of overlap, we don't calculate the
distance on its own, we calculate the
distance and then we compensate with the
probability. This algorithm is called
expectation maximization.
Let's take a brief look at the second
problem: having many dimensions. We looked
at descriptions of documents that had
three dimensions. For example having
sushi, Hanover, origami. But what if our
description had 300 dimensions? So the
presence or absence of 300 words? We
would then need to calculate distances
and clusters across 300 dimensional
space, and it might be very difficult to
calculate the cohesive clusters in this
kind of high dimensionality. We probably
need to reduce the dimensions to figure
out which dimensions are not
contributing a lot, so that we can
simplify our problem. One way to do this
would be to figure out dimensions or
features that have no variance for
example, where the values are the same or
nearly the same for every document. We've
already seen one example of this, in our
description of Shakespeare documents, the
word good has the same tf-idf value for
every document. It's always zero because
it appears in every single one of them
so one thing we could do would be just
eliminate that feature because it has no
variants, as a matter of fact, it's always
the same zero, and it's therefore not
contributing much to our analysis of the
documents. There's many other ways to
reduce dimensionality. One popular
approach is called principal component
analysis. And let's imagine the following:
let's say I have a plane like this in
three dimensional space so our axes
system starts somewhere around here and
in order for me to describe the points
along this plane, I need three dimensions
X Y Z for one edge of my plane, X Y Z for
another edge of my plane, so the
description of this object is three
dimensional, however maybe I could squash
the three dimensions into two
dimensions by describing the dots by their
position along an imaginary plane that I
can draw here. And here. If I rotated this
plane so that it looked like this, it
would now have two dimensions. One
dimension would be this one, one
dimension will be the other one, and I
could describe the dots according to
their positions in the new dimensions
that I drew. So I had a three dimensional
object but with the right rotation I can
make it look like a two-dimensional
object. These new axes that explain the
variation of the dots are going to be
called principal components, and this
technique is called principal component
analysis. So if we have the object like
this, the greatest variation would be
that sometimes the dots are here, here
here here here here here here, they go
upwards and downwards, so we're gonna
draw our new principal component to
describe this source of variation.
There's another variation where the dots
are later - lateral to that line so
that's also going to describe some
variation, and that's how we're gonna
draw our second principal component. When
we rotate them, this will describe the
variation going up and down, and this
will describe the lateral variation. This
is another visualization of the exact
same thing. Imagine you have not a  -something that
looks like a CD or like a tortilla in
looking like this. Likewise the dots are
along this three dimensional object but
if we rotate it in the right way it
would look two-dimensional, and now you
could describe the positions along that
CD or that tortilla according to this
new two dimensional system. Each of those
dimensions would be a principal
component, and it is a good way to reduce
dimensionality. You could reduce hundreds of
dimensions potentially to two or three
dimensions. This is a concrete example
with genetic expression. For example we
have three dimensions of the occurrence
of certain genes and likewise it's an
object that looks like a plane so when we
rotate it we have the positions of the
dots along our rotated principal
components 1 & 2. So we've simplified a
three-dimensional problem to a
two-dimensional problem, and it makes it
easier for clusters to emerge in our
data because we've eliminated dimensions
where there wasn't a lot of variance, and
now we have the dimensions with the most
variance. This is a partial summary of what
we have so far. There's many algorithms
for unsupervised learning and clustering,
and we're only gonna see very few
examples. We've looked at k-means which
is very good when clusters are easy to
separate, it just drifts around the space
looking for places where it can anchor
a cluster. Sometimes the clusters are going
to overlap, in which case we're gonna
need probabilistic algorithms like
expectation maximization, it drifts
around space but then it doesn't draw a
hard border, it draws a probability
distribution of a certain dot belonging
to a cluster. We also have the problem of
dimensionality where it might be
difficult for a program to find clusters
if there's too many dimensions. There's
algorithms to reduce dimensionality, some
of them are linear like PCA, so if you
remember what we did with the principal
component analysis, we made new axes that
looked like lines and when we rotated,
they are straight lines. In the next
video we're going to look at nonlinear
dimensionality reduction and some
popular algorithms to turn, for example
300 dimensional objects into
two-dimensional objects.
