So, the next section is
similarity and dissimilarity.
I'm going to blast through
similarity and dissimilarity
to get to data exploration
and visualization.
If we have to cut data
exploration and visualization
a little short, we will--
because we're going to talk
about it a lot on--
I think it's Tuesday when
we do the-- on Monday, when
we do the introduction
to our our lab webinar.
So, similarity and
dissimilarity are
exactly what they sound like.
Similarity is a
numerical measure
of how alike two
data objects are.
It is higher when
objects are more alike,
and you usually set
it up so that it falls
in the range between 0 and 1.
Dissimilarity is a numerical
measure of how different
two data objects are.
It's lower when
objects are more alike.
The minimum dissimilarity
is 0 almost always.
The upper limit varies,
based on the exact metric
you're using, but is often 1--
just sometimes higher.
And we'll use the term proximity
to refer to either how similar
or how dissimilar objects are.
So objects that
are close will have
a high similarity or
a low dissimilarity
and if objects are far, if
they have a low similarity
and a high dissimilarity.
Now, in the context
of data matrices
of very nice numeric
data, we can use
something very straightforward.
We can use simple
distance formulas.
But, in the context of
ordinal of categorical data,
nominal or ordinal
data, then we often
need to use different things.
So, nominal data--
we usually use
this sort of binary similarity,
dissimilarity measurement,
where dissimilarity is 0 if the
two values are the same and 1
if the two values are
different and exactly
the reverse for similarity.
For ordinal values, we'll
often-- for dissimilarity,
we'll often map our
values to our integers
from 0 to n minus
1 and then take
the difference in
those integer values
and divide by the maximum,
giving us a nice measurement
between 0 and 1.
In similarity, we take 1 minus
the dissimilarity measurement.
So that's really the reason why
I'm talking about similarity
and dissimilarity separately
rather than just sort
of talking about one of them--
is that there are
some data types--
like, say, nominal data types--
where similarity is a
very natural measurement.
It's 1 if the two are the same
and 0 if the values are not
the same--
whereas for something
like ordinal variables,
the measure of similarity is
an easier and more natural
way to think about it.
We think about how far
apart the variables are
and not how close
together they are.
And similar for interval
or ratio variables--
we'll often just take
the absolute difference
between the variables as a
measure of dissimilarity,
and then we might do some
slightly more complicated
things to calculate
a similarity.
