Hello, I’m Wendy Czika from SAS
here to show you how you can
use the SAS Code node in SAS
Visual Data Mining and Machine
Learning pipelines, as part of
the Model Studio application.
The SAS Visual Data Mining
and Machine Learning pipeline
provides an easy-to-use,
interactive environment
to not only build complex
machine learning models
but also to
visualize the process
and create re-usable pipelines.
This environment is
similar in many ways
to SAS Enterprise
Miner for those
of you familiar
with that, but this
lives in SAS Viya, which
modernizes the SAS platform.
In this pipeline, you can see
that sequential steps were
performed such as imputing
missing values, selecting
important variables to use,
followed by several candidate
predictive models including
an ensemble model.
The last step is to
compare all the models
and determine the champion for
this data from this pipeline.
Each node within the pipeline
has a corresponding set
of properties that can be tuned
for your individual needs.
Here you see the properties
associated with the Forest node
in the right-hand panel.
In addition, if you expand
the panel on the left,
you see a very comprehensive
suite of nodes that are
available within the SAS Visual
Data Mining & Machine learning
product – for preprocessing
your data, supervised learning,
and more.
While there is a
tremendous amount
of functionality available
at your fingertips
through this suite of
nodes and the numerous
properties within
each node, you may
be looking to add functionality
to your model building
process that is not covered
within a defined node
in the product.
How can you extend the pipelines
within SAS Visual Data Mining
& Machine Learning to
include this custom code?
The SAS Code node of course!
As you can see from
the screenshot,
the SAS Code node is available
in the Miscellaneous grouping
of nodes on the left.
This means it can be
dragged and dropped anywhere
within a pipeline.
Just like any of the
other Model Studio nodes,
the SAS Code node has
properties associated with it
as you can see on the right.
Let me quickly point
out a few details
of the coding
environment itself.
This is the editor you will
be taken into when you click
“Open” from the property
sheet of the code node.
Notice on the left-hand
side that macros and macro
variables are defined for
you, grouped into categories.
These are included to help
you in productivity as well
as in generalizability.
And note there are 2 panes shown
here - one for training code
and one for scoring code, and
I will talk about that more
in a minute when
I show an example.
The editor includes syntax
coloring and auto-complete –
so here is what you would see as
you are typing “Proc P-r-i-n”.
You additionally
have the ability
to open an existing
piece of code
you have saved on
file, copy/paste, etc.
just as you would find in
any SAS coding environment.
Now let’s talk about three
of the ways that you would
typically use a SAS Code node:
as a preprocessing node that
creates new variables and/or
modifies the metadata about
the variables, as a supervised
learning node to create
a predictive model, or for
tasks like data summarization
and visualization.
The first use case
highlights using the SAS Code
node for pre-processing your
data prior to model building.
This can be anything from
tasks for feature engineering
to filtering or
subsetting your data.
You can also change metadata
for any variable other than
your target – examples of this
include changing the variable’s
role, measurement level,
imputation method, etc.
You can do this from the
Data tab of Model Studio
or in a Manage Variables node,
but you might need the ability
to change it programmatically.
If you’re familiar with using
the SAS Code node in SAS
Enterprise Miner, you need
to know about an important
difference.
Unlike the way
Enterprise Miner works,
a copy of the data isn’t
stored for each node in order
to reduce the space needed.
This means you can NOT modify
the training data directly
by changing values of a variable
or deleting observations.
The data you are
using in your pipeline
within SAS Visual Data
Mining and Machine Learning
can only be modified by creating
new variables via scoring code.
Later I will show an
example of how you do that.
Now, let's look at how
to use the SAS Code
node for supervised learning.
This is most helpful in
adding the flexibility
to pull in additional supervised
learning algorithms or options
into your process flow
that may not be represented
in an existing node.
There are a few tips I want
to mention for this use case.
First, you must properly
flag the SAS Code node
as a Supervised Learning node.
To do that, after adding
the node to your pipeline,
click on it and
select Move, which
will give you the choice
of Supervised Learning.
Once you move it there,
it will be treated
as a Supervised Learning node.
This means that if
you set up your code
to generate the scoring
code for calculating
your predictions as
either DATA step (or DS1)
scoring code or
an analytic store
using the macro
variables provided,
then assessment
of your model will
be performed automatically,
and your model
will be included
in model comparison
along with any other
models in your pipeline.
Second, you have
the same options
for deploying your SAS Code node
model as for other Supervised
Learning models.
Finally, model
interpretability properties
will become enabled
for you to turn on
to perform various methods
to help explain your model
- either in terms of what
variables are important
and how they affect your
predictions at a global level,
or locally by looking at
an individual or clusters.
And finally, the
third use case is
using the SAS Code node
for data summarization
and visualization.
You can use your favorite SAS
procedures to create ODS output
or use the dmcas_report macro
to create your own plots based
on data in the pipeline.
This macro enables you to
create tailored reports
in your results such as bar
charts, series plots, pie
charts, scatter
plots, and tables.
In this case, the SAS Code node
will typically be a terminal
node in your pipeline since it’s
not creating any scoring code
or changing metadata.
Now, let's look at
examples of all these use
cases in pipelines
in the Model Studio
application of SAS Visual Data
Mining and Machine Learning.
We’ll start with one that
includes a preprocessing
example.
Note you can open the code
editor either by clicking here
on the node and selecting Open,
or from the properties panel.
As I mentioned, one
common pre-processing task
that you can perform
in the SAS Code node
is excluding a subset
of observations
from the training
data, or filtering.
This can be done with a
few simple lines of code.
Here is an example of filtering
where I want to exclude all
values of the variable JOB other
than “Sales” or “Self” from
my training data.
As I mentioned, any modification
of the data that we want
to pass to subsequent nodes must
be done within scoring code -
we can’t modify
the data directly.
So here is what
that looks like when
you want to subset your data.
The two panes you see here
are new in our latest release,
SAS Visual Data Mining
and Machine Learning 8.4.
This makes it very
straightforward to enter any
scoring code directly
in the Scoring editor,
so I’m actually going to
start there where you can see
the scoring code for creating
this filter flag that is 0
for these two values of the
variable JOB, and 1 otherwise.
In the Training code
pane, I’ve included a call
to the dmcas_metachange macro.
This macro sets the role of
the new filter_flag variable
we have just
created to be FILTER
and a corresponding
level of BINARY.
This tells all subsequent
nodes to use this variable
as a filtering definition.
Any node that follows this code
node will automatically pick up
this filter and observations
that have a value of JOB other
than SALES or SELF will be
excluded from the training
data.
This is just one example
of preprocessing.
As I mentioned you could
engineer new features here that
are a function of other
inputs, programmatically change
metadata – for example, in our
GitHub repository we have code
for calculating the
skewness of inputs,
then setting the transformation
for those with skewness > 3
to Log which would be applied
in a subsequent Transformations
node.
Next, we are going to look at
a Supervised Learning example.
As I mentioned earlier, to
have a SAS Code node used
as a Supervised Learning
node, you can click on it
and select Move.
You can see for this SAS Code
node that has been moved,
the Post-training properties
for Model Interpretability
are now enabled
for you to select.
In this example, I
am simply running
a gradient boosting model
using the gradboost procedure.
This is for illustration
only since we have a Gradient
Boosting node already.
Here you can see that we
are using the &dm_Data macro
variable to reference the
project data as well as many
of the macros provided for
you embedded within the proc
gradboost call.
For example, %dm_class_input,
%dm_dec_target,
%dm_interval_input.
Not only does this make
your code much simpler,
it also creates
generalizable code.
You do not see data,
variable names or file names
hardcoded here.
If I wanted to save this SAS
Code node out to my Exchange
and use it in other projects
within Model Studio using
entirely different data,
you could easily do so.
On line 24 you see
a macro variable
for the partition statement,
which will automatically
indicate the values of
the partition variable
for defining the validation
and testing observations.
Let’s continue to line
29 in this example.
Here you see the inclusion
of the SAVESTATE statement
on proc gradboost requesting
that an analytic store scoring
file be generated.
We are passing in the
dm_data_rstore macro variable
so this scoring file is saved
in the appropriate location.
For procedures that create
data step scoring code,
you would instead use
the CODE statement
as shown in the comment here
with the dm_file_scorecode
macro variable.
For this particular
procedure, you
can actually use either method
to generate score code, however
since the scoring code for
gradient boosting machines
can get quite long, in
this case, you are probably
better off using
the analytic store,
which is a binary representation
of the scoring code.
After the proc
gradboost call, you
can see two dmcas_report
macros specified here
for adding some
reports to the results.
You can see they both use the
VarImp ODS table generated
from proc gradboost.
The first dmcas_report
call requests
that the table be displayed, and
the second requests a bar chart
of the relative importance.
Because we have flagged this
as a supervised learning node,
you will automatically also
see assessment tables and plots
in the results.
Before I open up the results
to see everything that’s been
created, I want to show you
that I had checked the PD Plots
checkbox in the Model
Interpretability group
of properties.
This requests that
partial dependency plots
be generated to help explain
the gradient boosting model.
So, let’s take a
look at the results.
Here in the results, you can
view the SAS code that was
submitted, and then the two
reports that were requested
with the dmcas_report macro
– the table of variable
importance, and the bar
chart, plotting the relative
importance in descending order.
We can click the Assessment
tab and look at fit statistics
and the Predicted Reports –
here we have in interval target,
but if we had a binary target we
would also see things like lift
charts and ROC-type plots.
Finally, you’ll notice
that there is a Model
Interpretability tab since one
of those properties was turned
on.
Here you can see for the five
most important variables, how
their values affect the
prediction for our target,
along with the natural
language generated text
explaining the plot.
Again, you can get both
assessment and model
interpretability for
any model as long
as either data step or
analytic store scoring code
is generated to
create the column
or columns of predictions.
Our final example is
where we the SAS Code
node after a preprocessing
node or supervised
learning node to visualize or
explore our data by creating
custom reports.
Let’s look at the code.
Here is just a simple data
step to do two things.
First, it takes a subset of the
data being passed into the node
– these are the project data
that could have millions
of observations that would be
too hard to read in a scatter
plot, so we are taking
a subset of 200.
And second, the residual
variable as the target
minus its prediction from
the gradient boosting model.
Then we use the dmcas_report
macro to create a scatter plot,
along with a
y-reference line at 0
to help visualize the
residuals by the predictions.
And here are the results.
We can see that the residuals
seem to be pretty randomly
distributed around the reference
line and don’t exhibit any kind
of pattern – so that’s
just a nice quick check
of the residuals as an example
of one kind of visualization
you could do.
I just want to end here
with some helpful links,
particularly to our
Git Hub repo called
“sas-viya-dmml-pipelines” where
we have several SAS Code node
examples and continue
to add to that.
Thank you!
