Hi.
I'm Christa, and today
I'm going to be talking
about gradient boosting.
I'm going to be
starting off talking
about a general overview
of ensemble methods
and then go into a brief
explanation of gradient
boosting.
And then I'm going
to be showing you how
to do gradient boosting in SAS.
Let's get started.
So, first, I'm going to start
off with a simple example using
something that most
people in machine learning
are familiar with, and that's
just using a single model.
So you start off, you have your
data, you take your samples,
and then you pass
it into your model.
And then you get
your prediction.
So if you're looking to
improve your performance
and you're using a single model,
you can use ensemble methods.
So two main types
of ensemble methods
is bagging and boosting.
So for bagging, you're
using parallel training.
So you have multiple
independent models
that you're training to try and
capture variance in your data,
whereas in boosting, you're
using sequential training.
So in this case,
you have one model,
and you're trying to
improve upon its errors
to improve your performance.
So I'm going to give a really
simple overview using people
as an example.
So on the left,
you have bagging,
and you start off
with your one model.
So this is represented
by the person.
And so if you're going
to use the bagging method
and you're trying to improve
it that way, all you're doing
is adding more people
to the scenario.
So you're trying to add multiple
models to try and capture
that variance.
So on the right,
you have boosting.
And so this is where
you are starting off
with a weak learner,
and you're trying
to improve upon the
errors over time.
So it's kind of like your
model is maturing over time
and growing up.
So this is how this is achieving
its performance improvements.
So now I'm going to use the
same people example except
give something more specific.
So here you have an example, and
this is just like a little kid.
[CHILDREN LAUGHING]
So you have this kid, and
you're going to ask them,
how many continents are there?
So this is a kid.
Maybe they're five years old.
They might say
something like 27.
They have no idea.
And if you were trying to
use the bagging method,
you might just be adding
more kids to the scenario.
So is this really going
to improve the results?
Are they really going to
provide a better answer?
And the case is probably not.
So you might have more
kids are going to give
more of a random answer.
So they might say
stuff like 42, 9.
Adding more children
is probably just not
going to help in this scenario.
So if you have data that
doesn't have high variance
and might have--
it has low variance
and a high bias.
You might be able to learn how
those errors are being made
and be able to improve
upon the model that way.
So we have this weak
learner that we're
going to use on this.
And you're going to say,
how many continents?
And at first, they're going
to give you a random answer.
They might say
something like 200.
And then you have the kid
growing up and maturing,
and they're learning
more things,
and they're learning
from their mistakes.
And now they're going to
say something like 21.
And then we keep training and
keep learning from the errors,
and now finally we
have an improved model
who gives us a better answer.
So now I'm going to move on
to the actual representation
of bagging and
boosting real quick.
So you start off with
your original data,
and then you take your
random samples from it.
And then you do
your first model.
So all of these methods
are going to start off
this kind of similar way.
But the important
difference is here.
So you have your additional
models that you're using.
You can see that the
arrows show they're
all independently learning.
So they're all taking
their own random sample,
and they're learning
using their own model.
And then you take
all of the prediction
results from these models
and do some sort of voting.
So there's multiple methods
you could use to kind of get
what the overall answer is.
So for classification, you might
just choose a majority vote,
whereas if you're
doing regression,
you might just take the
average of the responses
for the models for your result.
So for boosting,
you see we start off
in a very similar way.
So we have the original data,
we have our random sampling,
and we're sending
it into our model.
However, this is where
the distinction is.
So here this is just an
example of a boosting method.
You might have your
model, and now you're
going to use that
to determine where
it was making its mistakes.
And you use those observations
that you made mistakes on
to weight them so that in
your next model that you get,
they're going to have a
higher likelihood of receiving
that in their sample.
So it's no longer a
random sampling anymore.
You're trying to find
those observations
that you're doing worse on and
try and learn more about them.
So you do this over
and over until you
have a good representation
of what your data actually
is like.
And you take all of these
models and do a weighted voting.
So as you're doing
this over time,
you're going to keep track of
which models are performing
the best and the
worst, and those models
get either-- if
they're doing well,
they'll get more of a say.
If they're doing worse,
they get less of a say.
So this is just an
example of boosting.
And like in AdaBoost,
this model will actually
look at how well the
model is performing.
And if they're not doing well at
all, they might just drop them.
So it'll train them,
test it, and drop it.
If they're doing well
enough, then they
get in the actual
weighted voting portion.
So for the topic
of today's video,
I'm going to show now
gradient boosting.
So gradient boosting is just
another form of boosting.
You start off the same way
as the previous method.
You have your original
data, your random sample,
and you send it to
the first model.
Except now, after you have this
and you have the predictions
from that first
model, you're going
to take the loss between
the targeted value
and the predicted value and
try and learn maybe if there's
any patterns in those errors.
So if you can find patterns in
the residuals and model that,
then you can use that to
improve upon your model.
So you're focusing solely on
the areas in which you're not
performing very well.
So if you see, now
we're going to train
a model on this residual.
And you take the
residuals from that model,
and you train that again.
And so you're trying to
reach a threshold of where
your residuals are
very close to zero,
so you're having
a very low error
rate between your predicted
and targeted variables.
And so this just
happens over time.
And you want to ensure
when you're doing this
that you're not going to a
point of overfitting your model,
because that can be an issue
with gradient boosting.
And so once you have all
of your training done
and you either reach
the cutoff or decided
to stop due to
overfitting, you take a sum
of these models' predictions.
So this is different
than the previous methods
where we're using some
sort of voting method.
Here, because the first model
is kind of giving us prediction
and the rest of them are
predictions of the errors, what
this is trying to do is
take that initial prediction
and modify it based on what
we learned from those errors.
So you're going to take
those and add them together.
So if it thinks it's
off by a little bit
when you do that addition,
then it fixes it.
Hmm.
All right.
So now I'm going to be
moving over to Model Studio
to show you how gradient
boosting is done in there.
All right.
So now I'm in Model Studio.
I have my gradient boosting
example already ready.
And I'm using the HMEQ data set.
So now that the
project is loaded up--
so it takes us to the
Data tab initially.
You can see the variables
that we're using,
and you can see that BAD
is set to the target.
So in the links below, there'll
be a link to this data set.
And, also, you can look up what
the individual variables mean.
But I'm just going to
move right into how to do
gradient boosting using this.
So this is a binary target.
So we're just looking to
predict whether or not--
if the answer is one or zero.
So to start off, I'm going to go
to the Supervised Learning tab
and drag over a
Gradient Boosting node.
All right.
So here it gives us
a brief description
of the gradient boosting model.
We can see the number
of trees that we have.
It starts off as
a default of 100.
You have your learning rate,
your L1 regularization rate,
and you also have
the option to change
the tree-splitting options.
So you can change the number
of branches, the maximum depth,
the minimum leaf size.
So you have a
variety of options.
And then, of course, instead
of just manually changing
these variables, you
also have the option
to perform autotuning
if you want to.
I'm also going to show
you real quick if you're
interested in the
type of gradient
boosting that SAS is doing.
So there's this website
that you can go to.
If you look up SAS
Viya Gradient Boosting,
you'll be brought to here,
which will just point you
to the GRADBOOST procedure.
So this is the basis for
the Gradient Boosting node.
And if you want to learn
more details about that
or if you're interested in
using it in a SAS code node,
then you can do so and
modify variables that way.
Okay.
So now I'm going to go back to
the example that I was showing.
And I'm going to run it.
Okay.
So the pipeline has
successfully completed.
I'm going to go ahead and open
the results for the Gradient
Boosting node.
You could open the
Model Comparison tab,
but I don't have any other nodes
to compare it to right now.
So I'm just opening
these results.
So initially, when you open up
the Gradient Boosting Results,
you'll see that it
shows an error plot
with average squared error.
And it also shows the
variable importance,
so you can look at this
and see what variables were
most important to your model.
But you can also look
here and change this
to the misclassification rate.
So you can see that
as the number of trees
were added, especially
in the beginning,
we had better and
better misclassification
rates over time.
I'm going to expand this so we
can see the legend real quick.
So here we can see the training
as the training continues
to get better.
We're not really
noticing much improvement
over time with the
validation and test sets.
So this might be a case in which
we want to use less trees just
in case we're overfitting
around in this area,
where we're no longer
seeing much improvement.
But, next, I'm going
to add two other nodes.
So let's see here.
So I'm going to drag
over a Forest node.
So forest is a bagging
method, so we can
look at that to compare it to.
And I'm also going to compare it
to just a regular single model
decision tree.
All right.
So I have these nodes loaded.
One thing is that
the default options
for these individual models
are a little different.
And just to kind of make
it a little more similar,
before I run it,
I'm going to change
some of the
tree-splitting options
just so the trees are
kind of similar shapes
just so it provides a little
bit better of a comparison,
although some other
things might be different.
So keep that in mind.
So here for the
Gradient Boosting node,
we have a maximum depth of
four, and I know that some--
another one of these
has 10, so I'm just
going to go ahead and
change this to 10 also.
So I'm going to go
over to the Forest node
and look at its options.
So here it's also
using 100 trees.
And then here we have a maximum
depth for its trees being 20.
So I'm going to go ahead and
change that to be 10 also.
And then I'm going to check
Decision Tree real quick.
And so this maximum depth is 10.
So now all of them are around
the same shape of tree.
There might be a
little difference
in some of the methods that it's
using, but for the most part,
these are kind of similar.
So we can compare
the boosting method
compared to the bagging method
compared to the single model.
So going to run this real quick.
Okay.
So now the pipeline has
successfully finished.
I'm going to go ahead and click
on the Model Comparison node
and open the results.
So for this example, the
Gradient Boosting node
did turn out to be
the best method.
You can see that its
misclassification rate is
the lowest, followed
by the Decision Tree,
and then lastly the Forest.
So for this case, adding
more trees wasn't enough.
We actually needed to look
at the individual errors
that we had.
And so even though we have
the decision tree, which
is the single model
and the forest,
which is using multiple
decision trees,
this difference might just
be due to individual options
inside these nodes
that are causing
it to be a little different.
But you can go and you can play
around with all these options.
You can perform autotuning
to improve your model,
and you can try boosting and
bagging to see if any of those
have a good effect.
All right.
So I hope you learned a thing
or two about gradient boosting.
If you want any of
the links that I
talked about or any
additional resources,
you can check out the
description below.
You can also feel free to
leave any sort of comments
or questions about the video.
And if you did like
this video and you
want more tips like this, you
can subscribe to the channel.
And thanks for watching.
