Hello again! You'll recall last time we ran
the example Knowledge Flow template that built
two different classifiers in the Spark environment,
a Naive Bayes classifier and a JRip classifier.
But we found that it did something different
with the JRip classifier than it did with
Naive Bayes.
It actually ended up building four separate
JRip classifiers and wrapping them up in a
voted ensemble.
In this lesson, we'll take a closer look at
exactly what happened there and the reasons
for why there is a difference between the Naive
Bayes example and the JRip example.
So, let's get started. Okay.
Here is that Knowledge Flow template we ran
last time, and, if we open up the TextViewer,
we can refresh our memories as to what the
results looked like.
At the top, we had the Naive Bayes classifier
on our hypothyroid dataset, and we ended up
with one model, as we expected.
The second classifier that ran in this example
was JRip, and, as we can see from the results,
we ended up with four sets of JRip rules and
that these were combined in a voted meta classifier.
So, the question was why did this happen?
Let's take a closer look at it, and I'll attempt
to explain.
Okay, so here is a slide that attempts to
describe how the processing occurs in Spark
for our classifier job.
On the left-hand side, we have the ArffHeaderSparkJob,
which initially loads the data into main memory for us.
It loads the CSV data and creates one of Spark's
resilient distributed datasets with a number
of splits or partitions, as they're called
in Spark.
Each partition is processed by a worker out
on the cluster, or, in our case, by a CPU
core on our desktop machine.
The map tasks process these partitions and
create models or, to be more precise, they
create partial models.
In the case of Naive Bayes, the algorithm
is fairly simple, and the model is comprised
of a number of probability estimators, all
of which can be computed incrementally and
additively.
So, when it comes to combining these probability
estimators, we can simply add together their
statistics, and we end up with one final model,
which is identical to that that we would get
if we Naive Bayes sequentially on the dataset
on our desktop.
In the case of other types of classifiers,
tree learners and rule learners like JRip
are an example of that, it's somewhat more
difficult to try and aggregate these partial
models into one final model which would be
the same as if you were to run sequentially.
In that case, Weka takes the easy route of
taking the partial models or the smaller models,
which are learned on the splits of the data,
and combining them by simply making a voted
ensemble out of them.
Okay. We're nearly finished with this example,
but before we leave, there are a couple more
aspects to touch on.
One is output.
We've seen some output in the TextViewer here
in the Knowledge Flow, but I mentioned earlier
that the jobs in distributed Weka also store
output on the file system.
This can be our local file system or, if we're
using Hadoop and Hadoop's distributed file
system, it could be stored in HDFS.
Anyway, if we take a look in the source of
the data here, our ArffHeaderSparkJob, recall
that we saw we had some set up for our input
files and also our output directory down here.
This is where the jobs will store their output.
Let's take a look at that on the file system.
If I find my home directory, then I can see
that directory that was specified in the configuration
there, sparkOutput.
If we go into that directory, we can see a
couple of subfolders.
One is where the ARFF header was stored by
the ARFF header job.
We can look in there, and we can see hypo.arff,
and, if we go back out and look in this model
directory, we can see the models that were
created by distributed Weka.
So, we have one for the Naive Bayes and one
for the voted JRip model.
Okay.
The other thing we haven't mentioned is this job in the middle here called the RandomlyShuffleDataSparkJob.
So, what does this do? Well, as the name suggests,
it's a job that, in a distributed fashion,
randomly shuffles the order of the rows or
instances in the dataset that was being processed.
In some cases, it's advantageous to do this
random shuffle.
There are certain classifiers which this is
beneficial for.
Naive Bayes isn't one of them.
It's not effected by the order of the instances
in the dataset that it learns from.
However, other classifiers, like trees and
rules, can be.
In a worst case, we might end up with a partition
of our Spark RDD where certain class values
aren't represented at all, if the data has
perhaps been collected in some systematic
way.
For that reason, it can be beneficial to randomly
shuffle the order of the instances before
learning a classifier like a rule set or a
decision tree.
Before we finish today's lesson, let's take
a look at one more of the example templates
that come with distributed Weka.
We'll take a look at the one that cross-validates
two classifiers.
So, this runs an evaluation, a cross-validation,
inside of distributedWeka.
If we load this one--and I make it a little
larger here--we can see that, apart from
the ArffHeaderSparkJob, and the RandomlyShuffleDataSparkJob,
we now can see two components called the WekaClassifierEvaluator--
or Evaluation, I should say-- SparkJob.
These are job entries that will perform a
cross-validation out in Spark for us.
There are two entries here, two WekaClassifierEvaluationSparkJob
entries, because we're comparing two classifiers
under cross-validation.
We're comparing the Naive Bayes classifier
again, and, in this case, the Random Forest
classifier.
Both of these will be evaluated under cross-validation
inside of Spark.
Let's run the flow and see what happens.
Switch to the log; we can watch some activity.
This will take a little bit longer than the
previous job, as we're running ten-fold cross-validation.
Let's take a look at the results in the TextViewer
here.
Okay, we've got two entries, one for Naive
Bayes, which is this first one here, and the
second one is for the Random Forest classifier.
As we can see in the textual output, the results
look exactly the same as if you were to run
a cross-validation in desktop Weka, and, similarly
for the Random Forest classifier.
Let's consider how distributed Weka performs
a cross-validation.
It actually involves two separate phases,
or passes over the data.
Phase one involves model construction, and
phase two involves model evaluation.
If we consider a simple three-fold cross-validation.
We know that the dataset gets split up into
three distinct chunks during this process,
and that models are created by training on
two out of the three folds.
So, we end up with actually three models created,
each of them trained on two-thirds of the
data, and then we test them on the fold that
was held out during the training process.
In this example, our dataset in Spark is made
up of two logical partitions.
We can think of each of these partitions as
containing part of each cross-validation fold.
In this case, they would hold exactly half
of each cross-validation fold, because we
have two partitions.
Each partition, as we know, is processed by
a worker, or a map task.
In the model building phase, the workers will
build partial models, and there will be a
model inside the worker that is created for
one of the training splits of this cross-validation.
So, it will be a partial model.
For example, we'll have the first model created
on folds two and three, or parts of folds
two and three.
Similarly, model two will be created on fold
one and three and model three on fold one
and two.
In each of these workers, these models are
partial models, because they've only seen
part of the data from those particular folds.
In our example here, the map tasks will output
a total of six partial models, two for each
training split of the cross-validation.
This allows us to get parallelism involved
in the reduce phase.
We can run as many reduce tasks as there are
models to be aggregated.
Each reducer will have the goal of aggregating
one of the models.
So, in our example here, the six partial models
are aggregated to the three final models that
you would expect from a three-fold cross-validation.
The second phase of cross-validation is somewhat
simpler than the first.
It takes the models learned from the first
phase and applies them to the holdout folds
of our cross-validation in each of the logical
partitions of our dataset.
It uses them to evaluate each of those holdout
folds.
The reduce task then takes all of the partial
evaluation results coming out of the map tasks
and aggregates them to one final full evaluation
result, which is then written out to the file system.
Over the last couple of lessons, we've looked
at some of the example Knowledge Flow templates
that come with distributed Weka.
We've looked at one that creates ARFF metadata
and summary statistics for a dataset.
We've looked at how distributed Weka builds
models and how it performs cross-validation.
The next lesson will wrap things up and leave
you with some directions for what to look
at next with respect to distributed Weka.
