So when we've got real values--
and this is sort of
a primer for the boot
camp, a reminder
for those of you
who've been out of math
classes for a while--
when we've got continuous
data, purely continuous data,
we will often use Euclidean
distance as the distance,
as a way of measuring
similarity, actually, really,
as a way of measuring
dissimilarity
because it's higher the
more unlike the objects are.
So this formula
might be a little
intimidating to some people.
But I promise you that you
are familiar with Euclidean
distance.
You just maybe
don't know the term.
Euclidean distance
is what you'd hear
called a distance formula,
just the distance formula,
in your high school
algebra classes.
And most people have seen it in
two dimensions, and sometimes
three.
But one of the very nice things
about the Euclidean distance
is that it generalizes very
naturally to as many dimensions
as you want.
So in order to calculate the
Euclidean distance between two
data objects, we take the
difference in each attribute
value, square it, and then sum
that and take the square root.
So for instance, we
have four points here
at 0,2 2,0, 3,1 and
5,1 that are all
plotted at different points.
And we can construct
a distance matrix
describing how dissimilar
all of our points are.
So 0.1 0.4 are the
most dissimilar.
They're the farthest
apart, whereas 0.2 and 0.3
are the most similar.
They're the closest together.
0.3 is also fairly
similar to 0.4,
whereas 0.2 is somewhat
less similar from 0.4.
So another distance metric
that we see, particularly
in the context of documents,
is called cosine similarity.
So we have documents.
We have turned them
into term vectors.
We can find how similar--
and cosine similarity is
a measure of similarity,
not of dissimilarity.
We can find how dissimilar
the two documents are
by thinking of each of them
as vectors, taking their dot
product--
which, for those
of you who never
had it or don't remember
your college vector
calculus classes--
you take each attribute,
attribute by attribute,
and you multiply them
together across your two
different objects.
So 3 times 1, 2
times 0, 0 times 0.
Maybe this is play and this is
coach and this is tournament.
And so we'll do our
count, and then we'll
multiply them all together
document to document,
and sum that all up.
And then we end up dividing by
the product of the magnitudes.
So the product of
the magnitudes is
just you square each
attribute, add them all up,
and take the square root.
So in this case we have
a dot product of 5.
We have a D1 and a D2
of 6.481 and 2.245.
Those are our magnitudes.
So we multiply these two
together and divide 5 by that.
And we end up with a
cosine similarity of .315.
Cosine similarity is a really
nice metric for documents
because it gives us this very
clean 0 to 1 measurement that
suffers less from the
curse of dimensionality
than something like
Euclidean distance does.
So because document vectors
tend to get very, very
long because there's a lot
of different words in a given
language, and given
documents might have lots
of different words in
them, cosine similarity
is a way to avoid some of
the curse of dimensionality.
And we'll talk
about this more when
we talk about encoding documents
more directly in the boot camp.
