Hi.
I'm Ari Zitin, and
I'll be talking to you
about some machine learning
fundamentals today.
We'll be going into a
little bit of detail
on some decision tree models and
on some neural network models,
and as part of that, we'll
be using the HMEQ Home Equity
data set to try and go into
detail in these models.
Before we go into the data,
let me talk to you a little bit
about machine learning.
The idea of machine learning is
we want to automate a computer
task to classify something.
The easy example
that I can always
think that almost everyone
has had experience with
is when you automatically
deposit your check at the ATM,
it figures out how much
money is on the check
without you having to type it
in, because they don't trust
you to type it in, necessarily.
So what it does is it takes a
picture of the check amount,
basically, and it learns
from historical data
that it has annotated about
what that number actually is.
And so that's an example of
machine learning using images.
What we'll be doing is machine
learning using historical data.
So we're going to have
bank data, HMEQ data set,
and there's a link
to the data set
below if you want to access it.
It's publicly available,
you can follow along
and build your own models
with the same data.
This HMEQ data set
is historical data
that our bank has collected
about its customers,
and what we want to
predict is whether or not
they default on a loan.
So that's going
to be our target,
is whether or not they
default on our loan,
and we use the
historical information
to try and determine that.
And the example I gave you
earlier with automatic check
scanning, the target is the
numeric amount of the check,
and the information that we
give is actually the picture.
So there are two different
examples of machine learning.
We'll be doing ours using
historical banking data.
Our inputs for the
historical banking data
are things like the amount
of the loan that you're
requesting, the amount of
the mortgage that you have
with the bank, your
debt-to-income ratio,
the number of
delinquent credit lines,
the number of derogatory
credit reports,
the number of credit lines,
your job information.
The target is BAD,
which is Binary Default,
but I also think of it
as BAD because it's bad
if you default on your loan.
So our target is going to be
1 for customers who defaulted
on their loan and 0 for
customers who didn't.
We want to try and
predict if people
are going to default on the
loan so we can avoid giving them
loans that we know they're
going to default on.
So now we'll go into examples
of our machine learning
algorithms.
We'll start with
decision trees, look
at how the algorithm works, and
then build it in the software.
We'll also do the same
thing for neural networks.
So we'll start by
looking at a picture
to see what sort of
data we might have.
So in this example,
we've restricted our data
to two dimensions.
In our two-dimensional
data we have
blue points and red points.
For the sake of
argument, we'll say
our blue points are--
our target is 0,
so they didn't
default on their loan;
and our red points
are our target
is 1, which means they
did default on the loan.
These two axes are just two
of our input dimensions.
So I mentioned a few of them.
For these, let's imagine
that on the x-axis
we have delinquent
credit lines, which
is the number of
delinquent credit lines
the customers have
had, and on the y-axis
we have the value of
the home that they're
trying to pull credit out of.
So we look at some information
about our customers
and we can see,
there's clustering
of blue points, which means
most of these customers
didn't default on
their loans, and groups
of red points, which means
most of these customers
did default on their loans.
What we want to do
is try and come up
with a way to draw
lines on this plot
to separate the blue
points from the red points.
So for decision
trees, we're going
to be drawing straight
lines that are
perpendicular to one another.
For example, I'll
draw one straight line
to represent a split point that
we might look for in the data
to split the blue points from
the red points or the people
who didn't default from
the people who did default.
So looking at this data, I think
I can draw a line right here.
And if we look, this
creates a split.
On the right I've
got sort of a table,
and we see we have 11 blue
points and 11 red points
overall.
So when I do this
split, I see we'll
create two different groups.
The group on the left has 1,
2, 3, 4, 5, 6, 7, 8 blue points
and one red point.
And the group on the right has
three blue points and seven--
eight red points.
So what we can see is we've
done a good job splitting
the points on the-- all
right, I missed a red point.
We've done a good job
splitting the blue points
from the red points on the
left, but not on the right.
So we can imagine now, we've
done this first partition
and we could say, this
looks like a decent split,
but we can do better
by adding more splits.
So we'll draw another
perpendicular straight line
to separate the red points from
the blue points on the right.
Put a perpendicular line here.
And we'll do one on the left.
And so now we can see,
we've done a really good job
of separating out blue points
and red points in the bottom.
The bottom-left is mostly
blue and the bottom-right
is mostly red.
So if we were to see new
data, because remember,
although we've
learned on this data,
we already knew that
these customers defaulted.
So we want a model that
works well on data that are--
excuse me-- we want a model
that works well on data
that we haven't seen before,
on brand new data where we
don't know what the color is.
So we could see
in the future, we
would imagine most people who
end up in the bottom-left here
are going to end up predicting
that they do not default
on the loan, whereas most
people in the bottom-right
will predict that they
will default on the loan.
In the top.
We could see we might
want to do more splits.
So we might want to go into
more depth-- for example,
we could draw a line here and
here to separate even more.
And what we'll see is that
the software does this
automatically.
So I've drawn these lines based
on my visual interpretation
of this plot, but we
really want an algorithm
that's going to do this
for us, because one thing
I didn't mention is that
these are only two inputs.
And if we had a third input,
if we had three variables--
so I mentioned this as
debt-to-income ratio and house
value, we might have another
one which is years on the job,
it would be a third
dimension, so it'd
be coming out of the page.
You can see why I only chose
two, so we can visualize it.
But we actually have 1, 2,
3, 4, 5, 6, 7, 8, 9, 10--
11 input variables that we've
collected about our customers,
and when we're doing
machine learning,
we can have hundreds
of input variables.
Which means that
in this data set,
we're in an
11-dimensional space,
and you could try and
visualize 11 dimensions
but it won't work very well.
And in reality, we'll
often be working
with hundreds of dimensions.
So you could imagine
turning this picture that
you're looking at into a
100-dimensional space with all
these dots, and we're
still drawing planes--
in this case they're
going to be hyperplanes,
100-dimensional hyperplanes
to separate the points.
So obviously we
can't do it visually,
and we're going to have to
let the computer do it for us
in an algorithmic way.
So we'll go to see the software
and look at how the decision
tree built itself on this data.
I've built a little bit of a
pipeline here in Model Studio,
and if you'd like to see
some examples of how I built
this pipeline and how to use
Model Studio and get started,
we've got a video link
below on getting started
with pipelines in Model Studio.
I'll right-click on the
Decision Tree default
node, select Results.
So I'm looking at the
results of a decision tree
that I trained on
this HMEQ data set.
The first thing that I'll
look at is the tree diagram.
In this tree diagram, we have
a picture of the decision tree,
and you can see, it's a
much deeper decision tree
than the example that
I drew on the right.
We zoom in on the
top, and we see
that we start with
3,000 very-- excuse me,
3,000 observations and about 20%
of them are 1's and 80% of them
are 0's, which means about
20% of our customers defaulted
on their loans.
We split based on the number
of delinquent credit lines.
If they have a lot of
delinquent credit lines,
they're pretty much all going
to be defaulting on their loans.
So we predict that they'll
default on their loans
if they have 12, 15, 7, 6,
or 8 delinquent credit lines.
If they have fewer than that--
5, 4, 0, 1, 2, 3
credit lines, it
looks like they're sort of a
mix of people who defaulted
and people who didn't
default, so we're going
to keep splitting from there.
When you look at this decision
tree on the broad scale,
you'll see that the
thick line indicates
where most of the data is
going, and the thinner lines
indicate small amounts of data.
The boxes at the bottom are
the final decision boxes.
So you can see this box
suggests that everyone who
ends up in this box,
which means they have--
there's a certain
path they follow.
They have a small number
of delinquent credit lines,
but the value of their
home is above $474,000,
and their years on
job are less than 26,
then we predict most of these
people defaulted on their loans
in the historical data.
One of the things that you might
notice in this decision tree
as I've been reading
through is that we've
done a little bit of what
I like to call overfitting,
which is we've memorized
the training data.
I mentioned that if you have a
lot of delinquent credit lines,
I say you're going to
default on the loan,
but it's actually
a specific number--
12, 15, 7, 6, or 8.
Those numbers are
part of the data
that we learned the
decision tree on,
so we might want to
come up with a better
model that generalizes to new
customers a little bit better.
So I'm going to close this tree
diagram, close the decision
tree results, and I'll go
into another decision tree
that I made where
I specifically made
an effort to cut some of the
leaf nodes that I didn't think
were important.
I actually let the computer
do this for me automatically.
So I'll open the
decision tree diagram,
and we'll see it's a different
tree altogether because I
trained it a little
bit differently.
What I did when I
trained it was I
built a full decision tree that
memorizes the training data.
So if we go back to--
if we go back to the drawings
that uniquely selects
blue and red points so that each
box is selecting only one color
point-- so it's a tree that gets
it 100% right on the training
data.
Once we built this tree,
we start cutting it back
and we look at data that
it's never seen before.
So we look at data that
this model has never
seen that wasn't used to train
it and we see how it performs,
and every time we cut back,
if it improves performance,
we keep cutting the
tree back and making
it simpler and simpler.
So in the end here, we get
better performance on new data
with a simpler model,
which is desirable.
If we look here, this
pruning error plot
shows exactly the procedure
that I was discussing.
We see that the training
data, which is data
that we used to build the
model, the performance
continues to improve as we have
more and more leaves, which
is a more complex tree.
But the validation
data, which is data
that the tree's never seen
before, actually gets worse.
So we stop the tree, we build
the most complicated tree
that we want on
the training data,
and then we just start
cutting the leaves
back so that we build one
that actually works well
on the validation data.
And at some point, if we
have too simple of a model,
we have bad performance
on the validation data.
So that's been our
discussion of decision trees.
We can look at some
assessment statistics,
but we'll do it in the end when
we compare all of our models.
We find out that
the decision tree
we built pruning, the second
one I showed you, actually does
work a little bit better on
this data than the first one.
Going back to our diagram, we'll
be doing neural networks now
and thinking about how they're
different from decision trees.
So another model and
a really popular model
these days are neural networks.
They're particularly
popular because they're
used in a more complicated
form of machine
learning called
deep learning, which
relates to processing images.
We're going to be trying to
use a neural network to do
the same thing as what we did
with our decision tree, which
is to separate the red
points from the blue points.
The big difference
is the decision tree
creates this list of rules,
whereas the neural network is
really trying to learn
an arbitrary nonlinear
function to map the
inputs to the outputs.
This arbitrary
non-linear function
can take any shape
it wants on the plot,
so that means that our
boundaries aren't necessarily
going to be straight lines.
So I'll draw in
what I might imagine
a neural network would think.
And you can see, I'm just sort
of drawing arbitrary curves.
And the only requirement
is that these curves
must be functions--
they must be able to be
defined by functions.
And so what you can
see is I basically just
selected the blue points
and drawn the ideal decision
boundary.
This might suggest to you that
neural networks are always
the best model, but
the disadvantage is
is that we don't want to
memorize the training data,
we want to apply our
model to new data.
And you can see, the size
of the circle that I drew
or the size of those
curves that I drew,
it could be very
different from what
it is while still capturing
all the blue points
and none of the red points.
So there's a lot of ambiguity
in my personal drawing of this,
which means that when
neural networks are learning
these functions, they find
one of many of these functions
that are going to do this,
and we don't necessarily
know which one is going to work
better on the validation data.
So it's a little bit easier
to over fit your training data
with neural networks,
and we might see
that in our following examples.
Neural networks don't
produce a list of rules,
so I don't have a
diagram on the right
here to show you how
we drew these lines,
but they do create
a function which
is a collection of numbers
that defines the function.
So we actually have
a way to model it,
but we don't show
it in detail here
because it isn't a
visual representation,
it's just an
equation, basically.
So that's a big difference as
well between neural networks
and decision trees,
is decision trees
creates this list of rules.
If you have more than
seven delinquent credit
lines than we predict,
you're probably
going to default on your loan.
That's very useful if you need
to explain to someone why you
did what you did in your model.
Whereas with the
neural network, it just
generates a bunch of
numbers under the scene,
and you multiply all
the numbers together,
and it gives you a probability
for the prediction.
So you can't really
interpret the results at all.
We'll go back to the software.
We built a neural network model.
You might notice the pipeline
for the neural network
is a little bit more
complicated than the pipelines
for the decision trees.
We just went straight from
data down to the decision tree,
whereas for the
Neural Network node,
we used a Manage Variables
node, an Imputation node,
and a Variable Selection node.
We have to do-- we have
to manage variables
to set the metadata
for the imputation
and the variable selection.
In the imputation,
we have to replace
missing values in our data.
The decision tree can figure out
whether missing values should
be in either branch.
So when we draw the
line, missing values
can go on either
side of the line.
Neural networks
create an equation,
and an equation
requires numbers.
Missing values are not numbers,
so we must replace them.
We'll replace them
with the mean.
When we built the decision
tree, at each split
we chose a variable to split on.
So the first split, for example,
was delinquent credit lines,
the number of
delinquent credit lines,
or the debt-to-income ratio.
The neural network
will not automatically
choose what variables to use.
Instead it just puts them
all in the equation--
again, because it's an
equation, all the variables
get multiplied by
numbers in the equation.
So we use a Variable
Selection node
to select some variables
going into the neural network.
We want to restrict
the number of variables
going into the neural
network, so we only
use the useful variables.
We noticed in the decision
tree that not all the variables
were useful.
I'll right-click on the Neural
Network, select Results.
The first thing I'll highlight
is this neural network diagram.
I like this picture because
it's various authentic.
I don't like this
picture because it
doesn't have a lot of
useful information on it.
So it's a nice picture
to show someone
that you built a neural network,
here's what it looks like.
The first thing I see is that
we use three input variables.
So we started out
with 11 variables,
but we only use
three of them because
of the variable selection.
The thing that I
don't find useful
is the size of
these dots indicate
the magnitude of the
numeric weights that
are used to create the
equation, but I already
told you, you can't
really interpret
what that equation means
or what those weights mean,
so it's not really
very interpretable.
This diagram does
show you the picture
of the neural network
which would indicate
how to create the equation.
If you're familiar
with neural networks,
you could look at this
diagram and write down
a model of what the
equation would look like.
You'd obviously be missing all
the numbers that go in there,
so you have to pull
the numbers out.
But you can see,
we're using derog,
which is derogatory
credit reports;
delinq, which is number of
delinquent credit lines;
and debt-to-income ratio to
try and predict the target BAD.
I'll close this diagram, I'll
close the results of the Neural
Network node, and I'll
go to Model Comparison
to see how we did on
the different models.
So you'll notice there were
a lot less visual results
for the neural network,
which connects to the fact
that the decision tree is a
fundamentally interpretable
model, whereas the
neural network produces
a bunch of numbers
on the backend.
So a lot of the
results are numeric
and you can take those numeric
results and apply them.
I open the Model
Comparison node and I
see that the decision tree with
the reduced error, and that
was the one where
I pruned it-- so I
where I built the big decision
tree on the training data
and then cut back, and
actually did the best based
on misclassification rate.
The misclassification rate
for the neural network
was about 20%, which means that
the neural network model really
did not capture the information
we were interested in.
I think the easiest
way to explain
this is that the decision
tree is a simpler model,
and this is a simpler data set.
And one issue with
this data set that
might hurt the neural
network is that there
are a bunch of categorical
input variables like job
where we have a list of
different people's jobs,
and the neural network does not
do as well with these variables
as the decision tree.
So the neural
network model really
didn't capture what
we were interested in
and really didn't
work on this data set,
but we wanted to highlight the
difference between decision
trees and neural networks.
And what you'll
generally find when
working with new data that--
whatever data you
have to work with,
you'll find that some models
work better than others.
So if your data
is really simple,
you might find decision trees
and linear regression models
work really well.
And if your data is
really, really complicated
and they aren't
working well, it might
suggest that neural
networks are a better model.
So one thing I'll
say is that when
we found the decision
tree worked really well,
it doesn't necessarily
suggest, oh, we better
go try it the neural network to
see if it works well as well,
it's just a nice comparison
to look at the two models.
Thanks for joining us to learn
about some machine learning
fundamentals.
We talked a little bit
about decision trees
and some neural network models.
Subscribe, check out
some more of our videos,
and then check out
the links below,
and if you have
any questions, feel
free to put them
in the comments.
Thanks.
