Hi.
I'm Christa.
And I'm going to be
showing you examples
of machine learning in SAS.
First I'm going to be
starting off in SAS studio.
This may be more
for people who are
familiar in SAS programming.
Second, I'm going to be
showing you the same model
but in Model Studio.
This might be more
interesting for people
who aren't as familiar
in SAS programming
and are looking for
an easier way to do it
but still want to
have the same control
over the hyperparameters.
This video is more
about showing how
to do machine learning in SAS.
So if you are interested
in the fundamentals
or why we're doing
certain tasks,
we do have a
fundamentals in machine
learning that you can
refer to in another video.
All right.
So now I have SAS Studio open.
And we're going to
go to the snippets.
So underneath the
SAS Snippets, you
can find a variety of different
snippets that you can use.
I'm going to be using the SAS
Viya Machine Learning snippets.
So I'm going to
scroll down a bit.
As you can see here, there
are a few different options
that you can use.
You can compare two machine
learning algorithms.
You can even compare several
machine learning algorithms.
So if you were looking
to use anything specific,
all you would have to do is
go to these little examples
and maybe take that model out
and use it with your data.
So if you were looking
to see how that was used,
you can use these
examples to base what
you're trying to do out of it.
So right now I'm going to be
using the Supervised Learning
snippet.
So this snippet is
showcasing a sample machine
learning workflow using
the HMEQ data set.
And so what the steps
of this are are we're
going to be preparing
and exploring our data.
So we're going to be loading it
in, exploring it, seeing what
missing values it has,
partitioning it out, imputing
those missing values,
and identifying variables
that explained variance.
Next, we're going to be
performing supervised learning
using a random forest.
And then we're going to
be evaluating and scoring
our model.
So I'm scrolling down.
The first block of code here
is defining the macro variables
for later use in the program.
So what we're going to be
doing is setting the output
directory, so where we want our
temporary files to be written
to, starting our CAS
session, and then
specifying the data set name.
So this is pointing
to where a data set is
and renaming it SAS data,
CAS data, partition data.
Next, we're specifying the
data set inputs and the target
variable.
So we have our different
variables here.
So these are all class inputs.
The ones below are
the interval inputs.
And they're all listed here.
And then last, we have
our target variable,
which is the BAD.
All right.
So if we want to get a
quick look at our data,
we can go to Libraries
and scroll down.
So it's in the SAMPSIO tab.
I'm going to scroll
all the way down
till we get to the HMEQ data.
So here I've
already expanded it.
You can see our
different variables.
There are 13
different variables.
If we click on it, we can
look at a sample of this data.
So you can see a few
different observations here.
One thing you might notice
is we have quite a few
of these little dots.
What that means is
the data is missing
for that variable
for this observation.
This is going to be
important and something
that we look at later.
So when I run this code--
all right.
So it successfully ran and
we started our CAS session.
Now I'm going to be
loading the data set.
So we're loading the data
set using the variable names
that we set earlier.
I'll submit and run.
And that was successful.
So we can see here
that there are
5,960 observations
and the 13 variables
that we've mentioned earlier.
All right.
Scrolling down a bit
more, this section
is all about exploring the data
and looking for missing values.
So this is going
to be telling us
the percentage of
missing values and which
variables have the most.
So I'm going to
highlight and run this.
OK.
So here we have a few
different sets of information.
We have the data summary, which
shows us each of our variables,
what level they are, so whether
they're interval, class.
And also if we
scroll to the right,
we can see the number
of observations
that they have
that we're missing.
So out of the total data
set, how many observations
did that specific variable
have that didn't have
a value associated with it?
We can also see the mean,
the max, standard deviation,
and a few other traits of
our different variables.
If we scroll down, we can
see the percentage of missing
values for each variable.
So here you can see that
DEBTINC is 21% missing
for all of the observations.
So there's a large amount
of missing variables
for this specific one.
And we have a few other
that also have quite a bit
missing values.
But we're going to
handle them in a minute.
But first, we're going
to partition the data
into training and validation.
So we want to have
our training set
before we impute it because
we don't want to mess
with our validation set.
So here we load in
the data that we have
and we set our partition
to be 70% of the data.
So I'm going to run this.
And that is complete.
So here we can see the
number of observations
for each of the
category for bad,
and also the number of
samples that we collected
from each of those categories.
So this represents 70%
for this partition.
So now we're going to get into
imputing the missing values
that we looked at earlier.
So we have our training set.
And we're going to take
the variables seen here.
So here we have three.
And we're going to impute
those to the median.
For these two
variables listed here,
we're going to impute
them for the mean.
And then we're going
to save all this data
in a data set with the
tag prepped underneath it.
So this is going to be used
when we create our model.
Right.
So here what this
generated is showing us
how many variables we imputed
and what method that we used.
It also shows us the variables,
their mean and median, so
the value that was replaced
when we did the imputation.
So each of those missing--
each of those missing values
now have this value instead.
So it's important when you're
looking at data to make sure
that what you're
replacing is something
that you should be replacing.
There could be
instances where you
have a missing value where
you don't necessarily
want to replace it.
It might just be another
level in your data.
So before you do this
to your data set,
make sure that this is
what you really want to do.
And this is changing it to be
a more accurate representation
of those observations.
All right.
So now that we've
done the imputation,
we're going to look and
identify variables that
explain variance in the target.
So what this is doing
is it's identifying
variables that are predictive
of our target variable.
So what explains
why this variability
is occurring in our target?
So we want to try and
find those variables
and limit the ones that
don't explain variance
so that our model can
run on just the most
predictive variables and
not have any noise that
might be associated with ones
that are less predictive.
So I'm going to select this bit
of code, which also includes
a plot that we can look at.
So I'm going to run this.
And so what we have here
is a few different summary
information.
You can see the
proportion of variance
explained by each variable.
You can see the ones
that were selected.
And you can see a plot
of the variants explained
by each iteration.
All right.
So now that we've
done this, we're
finally going to get to building
our predictive model using
the random forest.
So here we're using
the procedure FOREST.
We're using our data set
that we prepped earlier.
And we've specified a
few different inputs.
So we have the number of trees
being 50, the number of bins
being 20, and the minimum
leaf size being five.
So we have to specify
which of our variables
are the interval variables,
which we have here
and that we named
earlier in the program.
We have our class inputs.
And we have our target.
Here we're specifying the
partition that we did earlier.
And then we're outputting
fit statistics for the model
that we create.
So let's run this.
OK.
So we can see a little
bit of model information
of the model we just created.
So it's the same information
that we put earlier, including
some of the default values.
So we have the number
of trees being 50.
We have the number of
bins, the maximum depth.
And we can also see the
misclassification rate.
So this is currently 12%.
We also see the split of
training and validation.
And we can look at the variable
importance that was generated
by running the forest model.
So you can see which
variables were determined
to be most important.
This table is the fit
statistics that we generated.
So you can look at the
training average squared error,
the validation
average squared error,
and you can also look
at those same things
for the misclassification rate.
So here you can see as we added
more trees-- so to the left,
you can see how
many trees we have--
that our misclassification
rate for both the training
and the validation went down.
Now the validation
never quite reaches
where the training was at.
That's expected.
But we still have a relatively
low misclassification rate
at approximately 13%.
OK.
So now what we're going
to do is score the data
using the generated model.
So this is going to tell us
how our model was performing.
So I'm going to
select this code.
And this is going to give us a
plot of the misclassification
rate per trees.
So here you can see that
as we're adding trees,
we are getting a lower and
lower misclassification
rate until around this
area, maybe like 18 trees.
And we're kind of
plateauing at this point.
So there might be
benefits of continuing.
We might have what we
see here like a lower dip
where we are getting
a better rate.
But overall, this
is plateaued out.
And adding more
trees isn't really
helping us much in creating
a better predictive model.
Next, we're going to assess
the model performance.
And we're also going to
analyze the model using ROC
and lift charts.
So I'm going to select both
of these blocks of code.
And what this is doing is
generating information related
to the lift and the ROC.
So here you can see various
information for this partition,
for the lift.
And if we keep scrolling, we'll
also see the fit statistics.
So what I'm scrolling down to
is the plot that's generated.
So here we can
see the ROC curve.
So this is a relatively
decent curve.
We have it plotted,
the true positive rate
against the false positive rate.
So this looks pretty decent.
The validation, of course,
isn't as good as the training.
If I scroll down
a little further,
you can see the lift chart.
So this also plots
the validation
against the training.
So that is it for the
SAS Studio example.
And now I'm going to be
moving over to Model Studio
to show you a similar
model using machine
learning pipelines.
All right.
So I'm in the SAS
Drive in build models.
I'm going to go to
Create A New Project.
So I'm going to load our data
set that we used in SAS Studio.
So when you're trying
to do this example,
you'll have access
to this data set
and can just download
it and import it
the same way that I just have.
I'm going to name
the project example
and start with a blank template.
There are other
templates that you
can use such as the basic
template for class target
or the intermediate template.
This will start you off with a
few different machine learning
models.
So if you want to start
off with that, feel free.
But I'm going to start off
with a blank template that
just gives me the data node.
All right.
Now that we've opened our
project and our data is loaded,
it notifies us
that we must assign
a variable with
the role of target
in order to run the pipeline.
So our target variable
is here named BAD.
I'm going to switch its
roll over to target.
OK.
So now that we've assigned
BAD to the target role,
we need to change a few of the
roles to their proper role.
So based on the way
that SAS reads in data,
it scans the first
few observations
and then assigns what
level it thinks it is.
So in this case, it has
assigned a few variables nominal
when they should be interval.
So I'm going to select
everything but job that
is nominal and change
it to interval.
So we have one more
than I need to change.
Deselect the
previous ones first.
All right.
OK.
So now our data
is how we want it.
However, like we did in
the previous example,
we're going to want to
impute those variables.
So the way that we
do it in SAS Studio
is different from Model Studio.
Here I'm going to
select the variables
that I want to impute.
So I'm going to
select three variables
that we're going to change
the imputation to be median.
So that is this
one, YOJ, and CLNO.
So I'm going to
change this where it
says Impute to be the median.
So this is similar to
what we did in SAS Studio
except all we're doing here
is selecting the dropbox.
All right.
For our other two variables
that we set to the mean,
that would be the
CLAGE and the DEBTINC.
All right.
So I'm going to change
those to be the mean.
All right.
So now that we have
our data like we want,
I'm going to go over
to the Pipelines tab.
So the pipeline
initially starts off
with just the data node if you
selected the basic template.
So I'm going to expand
the nodes on the left.
Right.
So we have the Data
Mining Preprocessing.
So if you expand that,
we have a few options
for transforming or
exploring your data.
You can do variable
clustering, selection.
But we're just going to be
dragging over the Imputation
node.
So this is the node that
actually imputes the data.
When we set the features
earlier, what we were doing
is specifying what we
wanted to happen when we had
the imputation node in play.
If you don't have the imputation
node, this won't happen.
So we have this loaded up.
If you scroll down, you can see
the class and interval inputs.
These both have default
methods that we're
going to change
because we've already
set which variables that we
wanted to be imputed and how.
So I'm going to go ahead
and change both of those
to be none.
I'm also going to select
the Summary Statistics.
So this is going to give us
an idea of what it changed.
It'll also tell you the
number of missing observations
and what they were
replaced with.
So I'm going to run the
pipeline real quick.
All right.
So our pipeline has
finished running.
I'm going to open up
the Imputation node
by right-clicking and
selecting Results.
So here we have a
few different tables.
The first table is the
Input Variable Statistics.
So you can see the number of
variables that are missing.
I'm going to expand
this real quick so we
can see it a little better.
You have the number
that are missing.
You have the percent
that that represents
and the observations
that we have.
And you also have the
mean standard deviation
and a few other features
related to that variable.
So I'm going to
close out of this.
Here you can see the imputed
variables that we'd selected.
They now have a new name to
indicate that we imputed them.
They also show you
the method that we
used for the imputation,
and the value
that it replaced, and
for how many observations
that it did replace.
Here are some other
potentially useful information.
But for now, I'm
going to close out
so we can get to
creating our model.
So now that we've
imputed our variable--
now that we've
imputed our variables,
I'm going to open up the
Supervised Learning tab.
So under this you can see a
few different models available.
But I'm going to create a forest
like we did in the SAS Studio
example.
So I'm going to drag
it over and drop.
You do also have an
option where you can right
click on the node.
And if you want to add
something under it,
select Add Child Node.
And then go to the tab.
And you'll be given a list of
things that would be under it.
So if that's an option
that you want to use,
you have two ways of connecting
new nodes to your pipeline.
OK.
So here's our forest node.
So we see by default
it has 100 trees.
We're going to
change that to be 50
to match what we used
in the Model Studio--
what we used in the
SAS Studio example.
So here it just shows
you different options.
You have the tree
splitting options
so you can specify what the
class target criterion is
so you can change that.
You can also specify the maximum
depth, the minimum leaf count,
so it's five like we had
in our SAS Studio example.
And you can also specify what
to do with missing values.
So this is just saying go ahead
and use them in our model.
All right.
So now I'm going
to run our model.
Now I could have
selected run pipeline,
but I didn't in this
case, because we only
have one model here.
So the Model Comparison
node wouldn't provide us
additional information
compared to just the results
of the forest.
All right.
So now our model is ran.
I do want to show one
additional thing before I
look at the results.
So if you were
interested in trying
a few different
parameters, you can go
to the Perform Autotuning tab.
And you can just turn this on.
And you'll have
the same variables
except now most of
them are in ranges.
So if you were to
run this, it would
test out multiple
hyperparameters
and then present you with the
top 10 models, which you then
could use to determine how
you want your model to be.
So for this case,
we just ran it once
with the default
parameters that we
used by changing the
number of trees to 50.
So I right-clicked.
I'm going to select Results.
And here we can see a few
different plots and tables.
So this first plot here is
the average squared error.
I'm going to expand this.
To show you something similar
to what we saw on SAS Studio,
all I'm going to do
is click the dropdown
and change the
misclassification rate.
So there's a similar plot
that we showed there.
But it also includes the
out of bag and test sets.
So this split into
three different sets--
training, validation, and test.
And they're represented
here and their validation
or their misclassification rate.
So if you look, these are
approximately the same
as what we saw in the
SAS Studio example.
The algorithms may be
slightly different.
But overall, you can
see that it didn't
make that much of a difference.
So we have around
8% misclassification
for the train.
And we have around 11%,
12% for the validation set.
This table here is the
variable importance.
So this was created
when we ran the forest.
And it shows you the
importance of each
of these variables to
predicting the target variable.
So this was also something
that we saw in SAS Studio.
But this is something that's
autogenerated in Model
Studio when you run your model.
So you have access
to all of this
without having to add any code
or run any additional code.
It's run automatically
when you run your model.
So now I'm going to switch
over to the Assessment tab.
And this gives us our lift
reports and our ROC reports.
So you can see the plots here.
You can scroll down and
also see the fit statistics.
So this is all information
that we created in SAS Studio.
But this is automatically
given to you
when you run your
nodes in Model Studio.
So if you like this video and
you want more tips like this,
subscribe to our channel.
If you want any
related information,
you can check in the links
in the description below.
And if you have
any comments, feel
free to leave them or questions.
And thanks for watching.
