Hi. Mitch Wenger back with another video on data analytics and machine learning.
In this video we'll discuss the use of nearest neighbors in predictive modeling. Hope you enjoy.
okay let's get started
how can we use distance and
similarity in predictive modeling
and remember predictive modeling is
supervised learning
say we're given an example whose
target value we want to predict
well we can scan through all the
examples we have out there and
choose several that are the most
similar to our new example
and of course the key is that in those several
we select we already know what the target
variable is
at that point we now can predict
the new example's target value based
on these nearest neighbors'
already known target values
and we can do that using
classification to look for
those nearest neighbors and
derive a target class for our new
example
we can also do that using a regression
approach in which case
we derive the target value
using the means or the medians
of those neighbors remember in
data mining we place examples
feature vectors that is
somewhere in our dataspace
now if we have a new case we want to
consider like we just mentioned
we can calculate the distance
between this new case and other
cases again where we already
know their target variable that's the
idea behind what we call k nearest neighbors
coffee wine
beer and liquor aficionados
can have a field day with an approach like
this
if you think about all the
characteristics people use to
describe these beverages
aroma nose
finish chocolatey
earthy hoppy
fruity peaty
dry smooth bright
if you have a scale for each of those
attributes
you can find new variations
of your favorite beverage that are
similar to those you like out of the
dozens hundreds or
thousands of options you may have
available
in music Pandora originated
the music genome
project and they used an approach
similar to this to classify
music based on a variety of
attributes and of
course when you listen to something and
you give it that thumbs up well then they
look for other music with
similar attributes that they predict you might
like as well newer
services such as Spotify
Amazon Music and the like have
also incorporated this approach as
well at its most basic
k nearest neighbors uses a
majority rules approach
for a new feature vector or case
we look at the k nearest
cases in the dataspace to
determine the value of the new case
whatever the majority are that's what
we predict the new case to be
beyond that there's a lot to consider
how many neighbors should we evaluate
one three
five hundred
this is an issue to consider and should become part
of the testing you go through when you
create a new model and we'll
talk more about that later getting back
to our example here
let's see we want to
decide whether our new
example will accept our
offer and we choose to evaluate the nearest
five examples and we can see from
this table that three are "N"
while two are "Y" so
this would result in us predicting that our new
prospect will not accept our offer
so we may not send it to them now you
might ask the question why don't we just look at
THE nearest neighbor
that is the one nearest neighbor why not
set k to one
well we build these models with training
data and these are examples we've
collected where we already know the
classification category and you can
think of the examples that you've read
about or heard about earlier like
CapitalOne for example
building all that training data now
let's take a look at these two charts if
you look at the model on the left we're
using one nearest neighbor to
predict and as you can see we end up
with many disjointed decision
boundaries and a very
specific model
as a matter of fact we call this overfitting and we'll talk
about it more in future lectures
now on the right the model
uses thirty nearest
neighbors so we've set k to thirty
and here we end up with pretty clean
decision boundaries and fairly well
defined decision spaces
all right so you might ask well why not just
evaluate all the neighbors
make it five hundred or N
whatever your N happens to be
well in this case that would result in a
single classification
space that's simply the category
with the most cases basically a majority
so we would always predict the majority
class all new values any
new unknown value we just predict to be that
majority class that's what we call a
baseline predictor and that's a
starting point for determining how
good a model might be actually
the problem is we would protect all those
values to be that majority value no
matter where they ended up appearing on the chart
same thing if we did a regression using all the neighbors
we would predict the average of all
parameter values
so determining how many neighbors is
part of the evaluation
one general rule of thumb in
practice we do set
k to an odd number
so that we minimize the
possibility of a tie so for
k nearest neighbors k is what
we call a complexity parameter
the value of k determines how
simple over on the right a
simpler model or complex
over on the left a very complex
model our model will be
more complex models have a greater tendency
to overfit
specifically to the training data which
reduces their
generalizability to the wider
population of all these unknown
data points we will encounter in the future
so the next question we want to consider is
distance does distance matter
in other words should closer
neighbors be more influential
this is definitely another consideration you should
put it into your model building
how much more important should closer neighbors
be what type of weighting can
you give them again your
decisions will affect the performance of
the model
so in our previous example we evaluated the
nearest five examples
just looking for a majority and we ended up
predicting NO our new prospect is not
going to accept our offer
on the other hand if we do let distance
matter we've already calculated those distances so
we can assign weightings based on those
distances
how near our comparison points are to the new
example and from there we can
determine how much they will contribute to the
prediction calculation
here we can see that the first three
examples each end up
contributing more than thirty percent to
the prediction while the last two
contribute less than one each
we basically took the distance
weighting and turned it
into a percentage such that
all the percentages added up to 1
so in this case we end up predicting
"Y" yes they will accept our offer
so we'll talk again a little bit more about that later
so this weighting concept also works for
lots of other prediction approaches
you've probably used it in regression already
and probability estimation it works there
too so it brings a scoring approach to the
process which reduces the
importance of determining how
many neighbors we need to select
now there are some considerations that we
need to discuss regarding
k nearest neighbors as well
the first one is intelligibility
now in fields such as medicine
and law case based reasoning
which is similar to k nearest neighbors
it's used all the time so
we certainly have some parallels with
the legal profession so k
nearest neighbors might make sense in certain
situations another aspect of
intelligibility is that decisions
are relatively easy to explain
using k nearest neighbors based on
the contributions of those neighbors
most people understand this approach
intuitively although how you
phrase and present these
representative neighbors should be a part of your
strategy
so for example the movie Captain
Marvel was recommended to you based on your
interest in The Avengers
Spiderman and Wonder Woman
all right
now in some situations k nearest
neighbor may not be such a
good idea
so for example you're a lot like these
deadbeats the Smiths and the Jones who
defaulted on their loans
in that case you might want to think of a
different way of phrasing
that
as a matter of fact there may even be some legal
restrictions on which types of models can
be used for various things
calculating credit scores for example
the next consideration is dimensionality and we're gonna
talk about that more in just a minute so I'm gonna
skip that for now and move right to
computational efficiency
computational efficiency is another thing
you do want to consider in your models
depending on how quickly you need to be
able to build a refresher model
and or how quickly you need to make the
decision that the model will be
used for
now k nearest neighbor is great for model building it's
called a lazy learner because there's
really little if any computation effort
when building the model you just throw any new data
points you know the target value for into the
model and there you have it the only
real learning done during model building is
to determine what your value of k going to
be
on the other hand k nearest
neighbor is not very efficient at
decision time because
anytime you get a new data point you have to go out and
query the dataspace to find out what all these nearest
values are in order to make your
prediction and that becomes expensive
in this case when we say expensive from
a computing standpoint that means we're
using more computing resources on the
server it's generally going to take
longer and that means something else can't be done
now when we add distance to the k
nearest neighbor calculation that
increases the computational expense even
more so k nearest
neighbor isn't necessarily a preferred
technique for applications that require
extremely fast decision making
now tools have been making quite a bit of
headway in speeding up this decision making
process but still it's
something you want to keep an eye on when you
consider k nearest neighbors
now let's get back to dimensionality
when we're using nearest neighbors we
generally consider all the numeric
features that means we can have a scaling
problem because different features
are often going to use
widely different value ranges
so that causes a computational
problem in that different features will
naturally be weighted more just based on the range of
their scale we can address that
problem by standardizing
and normalizing those values
but more importantly having too many
attributes or irrelevant attributes
may confuse those distance
calculations and therefore we end up
with what we call the curse of
dimensionality that is we have
high dimensional dataspaces
now remember k nearest neighbor
is not that efficient at
decision time so if we have to consider
more dimensions
again that makes things more expensive
also if we have too many dimensions that
aren't really all that important to the real
decision of interest we still end up
using them in the calculation for k
nearest neighbors possibly
confusing the prediction that results from
that calculation so we've got a couple
ways that we can
approach this dimensionality
problem solution one would be what we call
feature selection and this can
be done in a number of ways we can do it
manually by
using the knowledge we have of attribute
relevance in previous decision
processes we've been involved in
and this is one of the main ways that the
data mining team can
inject what we call domain
knowledge
into the project into the model
so it plays a big role here
but we can also use automated feature selection
tools that are built into a lot of data mining
software products these days and we'll discuss
some of these in a future video
solution two is that we
can tune the distance calculation
we use in the model we can
manually adjust the weighting of different
attributes or even eliminate those that
aren't important to you
in the decision process I might not
care how hoppy my beer is
so I take that out of my calculation model
so as you might imagine domain and specific
business problem knowledge play an important role
here as well so that's
a brief introduction to
k nearest neighbors I hope you found
this video useful be sure to check out the
other videos in this series
