Hi, my name is Neil and I work for the data science
team at TIBCO.
TIBCO Data Science has launched AutoML, which
is an extension to the TIBCO Team Studio product.
AutoML reduces the workload of data scientists
by providing automation for both the features
and models.
It’s delivered as a set of building blocks
that can be used as is, it can be tweaked
or optimized, or it can even be broken up
entirely to be used as reusable modules.
Now, this is an excellent example of how extensible
the TIBCO Data Science platform is.
It’s built entirely using standard APIs
and extension points that the platform provides.
Let’s take a look at how this works.
This initial workflow calls on the AutoML
orchestrator.
To the left, you see our input data consisting
of insurance customers joined with fraudulent
claims and income estimated by zip code.
We’ll use AutoML here to see if we can predict
the fraudulent claims from the customer data.
In the orchestrator options, we selected “dependent
column” from our data.
We can choose if we want to use a shallow
or deep model complexity.
This choice controls the extent of the hyperparameter
searched during the predictive modeling phase.
A shallow search is quicker and gives a good
initial idea of the most suitable model while
a deep search will normally provide more accurate
models at the cost of being slower.
Running the workflow automates the entire feature
engineering and modeling procedures needed here.
At the bottom, we see the leaderboard of top-scored
models and diagnostics.
A gradient boosting model is our top performer
with an accuracy of about 78%.
Other models are shown below.
The best model info shows more details on
our top model and its resulting parameters
such as the number of trees, depth, and learning rate.
We can quickly view the
variable importance score as well.
Here, it’s showing that income by zip code
is the most important variable.
Within the AutoML workflow, I can see individual
subflows used to create this model.
Each of these flows are within my workspace and shown
with the URL directly from the workflow results.
I can view the subflows directly in my workspace
as well.
The really cool part about AutoML is that
it automatically built each of these subflows
on its own with no manual work needed from my end.
Let’s take a look at the first subflow on
data prep.
This subflow takes input data from the AutoML
orchestrator and goes through cleaning, removing
of missing values, and label encoding for
the target variable before finally being split
into a training and test set.
In the feature engineering subflow, predictors
are automatically transformed using multiple
strategies, allowing the downstream algorithms
to train better performing models.
Here, weight of evidence and impact encoding
are applied to select categorical variables
and normalization is used for continuous variables.
Next, for feature selection, stability selection
chooses candid predictors that are likely
to have the most predictive power.
The modeling subflow takes each of the feature
engineering strategies and trains models with
random forest, gradient boosting, and regularized logistic regression algorithms
across a set of hyperparameters.
The results are all compared based on model
accuracy to produce the leaderboard results.
But what if a data scientist wants to tweak
or optimize any of these subflows?
They can do so at any time by using TIBCO
Team Studio as they would with any other workflow.
Re-sampling nodes can be added, for example,
if I want to upsample to account
for imbalance in the target variable.
After I make the changes I want, I simply
return to my orchestrated workflow where I
can re-run the flow in each subflow.
Briefly reviewing the results, I see that
adjusting the imbalances through re-sampling
improve the performance of the random forest model
and improves my prediction accuracy to about 79%.
So, this was a quick view of
how AutoML works and Team Studio.
If you’d like to learn more, check out the
links in the video description.
Thanks and catch you next time.
