Let's continue our look at support vector
machines, and see how they would deal
with overlaps in their groups, and with
nonlinear boundaries between the groups.
So the last video we looked at support
vector machines. They are a type of
supervised learning, but they're very
similar to the unsupervised clustering
techniques that we saw a couple of weeks
ago, like clustering. They take a vector
of features and they tried to find
similarities in the features to see
which things belong together. What the
support vector machine actually does is
that we tell it what the cluster is, we
give it a group of features, and then
tell it this is a good review, these
features are a good review, these other
features are a bad review, and then we
tell it, this is the cluster you should
be looking - you should be looking for.
Tell me what makes it similar try to be
sure that you can separate this cluster
from this cluster here, and the way it
does it is that it projects a line of
maximum separation between this group
here, and this group here, the two
clusters. So it tries to find the line or
the plane that separates these two
clusters as maximally as possible in
n-dimensional space. As you can see here,
we can separate them with a line but we
can also separate them with nonlinear
shapes, and this is what makes support
vector machines a very powerful
technique. Also it's easy to deal with
overlaps in the groups. So let's look at
both of these very quickly. Those are
with overlaps. Let's go to our one
dimensional example. As you can remember,
this is one dimension. It's just the
amount, the number of times, we see the
word greatest in
a document. It's just a one feature
description, a one dimensional
description of a document. The red dots
are the bad reviews, and there, the word
greatest appears zero times, one time, two
times, and so forth. The green dots are
the good reviews, and they have the word
greatest eight times, nine times, ten
times, for example. In the last video, we
have two clusters that were very clearly
separated, and so you drew a right - a line
right in the middle, but how about this
one where you have one of the bad
reviews on the other side, so this is a
bad review that would have many
occurrences of the word greatest. Maybe
because it's the, you know, greatest
fiasco in the history of the greatest
theater in the greatest financing of the
greatest place ever. So it's a bad review
that has the word greatest many times, so
many in fact that it resembles the good
reviews. So we were saying that the
support vector machines are like the
edges of each of the clusters. So the
edge of the bad examples would be this
really strange example that has greatest
too many times, because the edge of the
red group is now here on the side of the
green documents and the extreme of the
green documents is here. We would have
the yellow line as the center as the
line separating the edges of the red
group and the green group. And of course
this line is kind of silly; this line is
not the maximum separation between the
groups. This line has very little
separation between them and it's really
if you get any green blobs that have
different occurrences of the word
greatest, they're gonna be misclassified
as belonging to the red group, for
example. So obviously, we want to try to
find the maximum separation between the
two categories. We're gonna have
something called soft
margins. A soft margin is allowing for
misclassifications,
allowing for some of the dots to be on
the wrong side every now and then. So if
instead of using the yellow line as the
separation between the two clusters, we
chose the orange line in the middle, we
would now be misclassifying one of them.
The red dot would - at the edge, at the
right side, would be misclassified as a
good review. This is a mistake but this
is okay. We make this sacrifice in order
to get a maximum separation in between
the clusters. So because we allow
mistakes every now and then, we call this
a soft margin. If you've taken a machine
learning you'll remember that this is
the lambda factor. So if it's larger
lambda you allow for softer margins; if
it's a lower lambda, you allow for harder
margins. If you have not taken machine
learning, it's basically you try to
find the maximal separation between the
clusters even if it means having a few
dots in the opposite side. So by
softening the margins, we can still get a
maximum amount of separation between
these two blobs. What's really
interesting is if we cannot use a line
to separate our blocks, so the situation
we have on the left is the one that
we've been discussing, where you can draw
a line and every now and then, a triangle
is gonna be on the circle side, on the
blue circle side, and so it's gonna get
misclassified as a blue circle, and we
make that sacrifice, but the two groups
are still clearly separate. And they can
be separated by a line. That's not true
for the case on the right, as you can see
there - there's no line that would go
across this chart that would clearly
separate the red triangles from the blue
circles. Whatever - however, what you can do is draw
something that's not a line, maybe a
circle around the blue dots. A circle
would clearly separate the dots from the
red triangles around. This is of course
nonlinear because it's not a line and
that's fine. We're gonna use different
functions, for example the RBF or radial
basis function, which doesn't try to do a
line to separate the blobs. It tries to
draw a circle emerging from the center
of the blobs, and tries to get these
circles to be as big as possible without
touching. So they start at the center and
then they project and become as large as
they can, covering for example the purple
dots, but also not touching the circles
that would circumscribe the light blue
dots, the other blobs. You can have many
types of functions to describe the
border between the clusters: a linear
function, for example, separates them with
a line; you can have a polynomial
function like the ones you studied in
high school that have curves, and so you
can have a curve-like separation between
two clusters; you can use the radial basis
to project blobs from the center of the
categories and to have these as the
edges, as maximal edges, in a circular
like shape.
And one neat thing is that we can decide
how aggressive we are going to be.
There's a parameter called gamma where
if we set gamma to lower values, one for
example, the blobs are going to try to
project this as maximally as possible to
the point where the blobs for each of
the categories of the lab - of the
labels practically touch each other. If
we set it to higher values of gam - of
gamma, the bor - the edges of the clusters
are going to start receding towards the
data points. And if we set a gamma value
that's too high, we might even have edges
that are too tight around our data
points so they would not allow for
generalization. If you look at the
example on the left, with gamma equals 1,
you can still get a new green dot that
is, for example, here in a position that
is not usually previously occupied by
green dots, but we would expect it to be
clustered with the green dots. In gamma
equals 20, you do not have that region
covered by the edges of the blob, of the
cluster, so there will be no way to
classify it as green. Summary so far.
Support vector machines allow you to
classify inputs into clusters. We tell
the computer what clusters we're looking
for, and then the computer tries to find
the maximal separation between the
clusters in n dimensional space. You can
separate them with lines, you can separate
them with blobby circles, you can
separate them with polynomial equations,
but it's gonna try to find the maximum
separation between them so that it can
establish a decision boundary of, oh if
something falls here, then it's a green
dot, if something falls here, then it's a
red dot. And that way it will manage to
perform classification. In the next few
videos, we're going to be looking at a
concrete example classifying parts of
speech.
