Hi. Mitch Wenger back with another video on 
data analytics and machine learning.
In this video, we'll discuss similarity and 
distance functions.
Hope you enjoy.
okay let's get started
0:00:28.033,0:00:28.000
similarly and distance
similarly is at the core of
many data mining methods that we
will discuss with the
notion that if two things are
similar in some ways
0:00:43.066,0:00:43.000
well maybe they share
other attributes as well that we're
interested in and
distance is a common
way of looking at
similarity
all right so there are lots of different
business cases where we use
similarity and distance
classification and regression
these are two that we'll hit on time and time
again but other
approaches use it as well grouping
how do we group things into clusters or into
other naturally occurring groups that we can
recognize
providing recommendations so we
all have been on Amazon
Netflix etcetera where
we see recommendations that we should
buy things that we should
watch friends that
maybe we should reach out to people that 
should
be connections
also we
use similarity to
reason from similar
cases and we use this all the
time in medicine and law we do
it naturally as humans and we can
do it from a data mining
perspective as well
so we can represent many
things as a set of data
values
once we have done that represented them as a
set of data values then we can
display it as a point
in some dataspace and this
dataspace can have n
dimensions could have one dimension could 
have
two could have many
dimensions depending on the number of
features that we're evaluating
once we have that data point
defined we can compare it to other
data points mathematically
and this is the basis for nearly all
data analytics
many techniques basically take the view
that data points in the same region
should be similar
and from there we can use all kinds of
different techniques to draw boundaries
between these groups
after that we use those boundaries
to make predictions for new
values where we don't yet know
the classification
now the technique could be a decision tree
it could be a linear regression line
or any number of other linear
classifiers
0:03:09.066,0:03:09.000
we use distances of
between those data objects to
determine that similarity
one of the most commonly used distances
is what we call Euclidian distance
and you can see it represented in this
diagram by the green line it's
simply the straight path from one
point to another
now note that we represent it in two 
dimensions
here but Euclidian distance
can be calculated no matter how many
dimensions are in the dataspace
to calculate Euclidian distance we take the
square root of the sum of the
squared differences along all
dimensions however many dimensions we 
happen to have
so that's the difference in values
for every feature we want to
include in our model
so in essence we're basically applying the
Pythagorean theorem for each
dimension
now Euclidian distance is probably the
most common distance function in
use although it's not necessarily
considered the most robust
so that takes us to the next similarity
function which is Manhattan distance this
is considered more robust than Euclidian
it's calculated as if you have to traverse the
actual data points to get from one
to another much like you would navigate when 
walking
in Manhattan thus the name you
can't just walk through all the buildings to
get from the Empire State Building to the 
Intrepid
you have to go so many short blocks
and so many long blocks
so that's the concept we're using here as you 
can
see we've got a red blue
and yellow line each taking
different paths to point B
but in each of those cases the distance
ends up being the same we're going X number 
of
blocks one way we're going X number of blocks
the other way so you can see that
calculation is simply the sum
of the absolute differences between the 
measures
and again it's for however many
dimensions or features that we
happen to have in our model
so Minkowski was able to
generalize this so the Minkowski
distance calculation generalizes
both Euclidian and Manhattan by
adding this q value into it
we could set q to one or two
depending on which approach we're going to 
use and
as you can see the calculation works out
fine just either way if we set q to
one well then we're calculating the Manhattan
distance if we set q to
two we're calculating
0:05:45.033,0:05:45.000
the Euclidean distance again it's
the sum of the differences for each of the
variables and then we
either keep that absolute value
or we square it the one or the two in q
and then we either use that value to the
power of one or we
take the square root of it to the power of one 
half
all right so that's it for our
discussion on distance it's a relatively
straightforward concept but it's
important to understand as it's the
basis for many of the techniques we will 
explore
0:06:17.066,0:06:17.033
going forward
I hope you found this video useful be sure to 
check out
the other videos in this series
