Hello! In the last lesson, we installed distributed
Weka and ran our first distributed Weka for
Spark job that analyzed and computed a header
for the hypothyroid dataset.
In this lesson, we'll take a closer look at
how these jobs are configured, and we'll run
a few more jobs that use Weka classification
algorithms, and learn classification models
on the hypothyroid data.
So, let's get started.
Okay, here is the job that we ran last time.
I've loaded it back up into the Knowledge
Flow here.
Let's take a look at how it's configured.
So, if I double click on the ArffHeaderSparkJob
component here on the canvas, it will bring
up the configuration dialog, which is made
up of two tabs here.
The first tab is entitled Spark configuration,
and, as the name suggests, there are a bunch
of options here related to how the cluster
is configured.
So, up at the top, we have a couple of options
that are related to how Spark handles or manages
memory out on the cluster.
We won't go into detail about exactly how
those work, but, suffice it to say that the
defaults that are set here work reasonably
well for most situations.
Under that, there is something called the
InputFile parameter, and that's most important,
because it's the dataset that we're operating
on.
You can see here that it's preconfigured to
point to the hypothyroid data, which is in
this sample data directory, which is, in turn,
in the package installation directory for
distributed Weka for Spark.
Then we have the masterHost parameter.
This is where you can specify the address
of the machine that the master Spark process
is running on.
In our case, we don't have a Spark cluster,
we're running locally on our desktop machine,
and Spark is treating each of the processing
cores in our CPU as a processing node.
So, that's why we have the local host specified
here, and in parentheses, we have an asterisk,
which tells Spark that we want to make use
of all of our available processing cores.
If we wanted to limit the number of cores
that Spark uses on our desktop machine, then
we could place a number inside those parentheses
there to limit that.
Similarly, the masterPort would be used if
we were running against a cluster and we needed
to provide the port that the Spark master
process is listening on.
Further down in the list here we can see something
called the output directory.
This is where Weka will be saving any results
generated by the job.
Okay. The last parameter we'll take a look
at here is called minInputSlices.
With this parameter, we're telling Spark how
many logical chunks to split the dataset up into.
So, Spark will create partitions or slices
of the dataset and process those, and it uses
one worker task running on a core on the CPU
of a processing node in order to process a
given partition.
Here we can really have some control over
the level of parallelism applied to our dataset.
If we had a processing cluster of 25 machines
where each machine had a CPU with four cores,
then Spark would be working at the maximum
efficiency if we chose 100 input slices or
fewer for our dataset.
That way, we would have the entire dataset
processed in one wave of tasks.
Let's take a quick look at the second configuration
panel in the dialog here entitled ArffHeaderSparkJob.
So, if we click on that.
This relates to how Weka will parse the CSV
file of hypothyroid data.
There are a lot of options here related to
CSV parsing, so what the field separator is,
what the date format might be if there are
date attributes, and so forth.
We can also tell Weka, since this is a headerless
CSV file, what the names of the attributes
are in the data, and we can do that in one
of two ways.
Either by typing a comma-separated list of
attribute names in this first text box at
the top here or we can point Weka to a file
on the file system that contains the names
of the attributes.
We're using that option in this case by saying
their is a file called hypothyroid.names on
the file system.
The format of that file is a simple one.
It just contains one attribute name per line
in the file.
We also have this option called pathToExistingHeader
here.
If we have already run this job and created
an ARFF header file and computed all the summary
statistics, then there is no need for the
job to run again, but we may have it as a
component in an overall larger job.
In that case, we can provide the path to the
header file that was created in a previous
execution, and Weka will then realize that
it does not need to regenerate that file.
And the last dialog box here is one where
we tell it what we want to call that ARFF
header file when it gets created.
In this case, we're calling it hypo.arff.
All right, now let's try running another one
of the example flows that included with the
distributed Weka for Spark package.
Up here in the templates menu, let's choose
the "Spark train and save two classifiers" flow.
Let's load that one in.
All right, here it is. Make it a little bit larger.
Okay, so what do we have in this flow?
Well, we have--as we can see on the left-hand
side here--
the ArffHeaderSparkJob again.
This time, however, it is configured, if we
take a look, to make use of an existing header
file, if we happen to have already run this
particular job on the hypothyroid dataset.
As we can see here, this path is now filled
in, so we can take advantage of that existing
header file that we may have already generated.
If that is the case that this exists on disc,
then it will load and use this header file,
and then the only point of this job entry
in this particular example, is to load the
CSV data and parse it into an internal Spark
format that can then be used in the downstream
job entries in the rest of this flow.
So, that's ready to go pretty much.
What we have next in the flow is something
called the "WekaClassifierSparkJob, and, in
fact, we have two entries in this flow.
These components are executed in sequence.
The ArffHeaderSparkJob will run first.
When it succeeds, it triggers execution of
the next component downstream in the flow,
so the WekaClassifierSparkJob will execute.
We can see from this Spark job that there
is a text connection, so it will produce a
textual description of the classifier that
it learns, which will then be displayed in
the TextViewer here.
That also gets saved out along with the model
itself, the actual Weka model, to the file
system in our output directory, and we'll
take a look in there once we've finished looking
at this flow and executing it.
There's also a second Weka Spark job here
on the right-hand side, and this learns a
different classifier.
The first one learns a Naive Bayes classifier,
and the second one learns a JRip rules set.
In between the two, we have another job that
gets executed.
This is called the RandomlyShuffleDataSparkJob.
We'll discuss exactly what that's doing a
little later on.
Let's take a quick look at the configuration
dialog for the first of these WekaClassifierSparkJobs,
the one that trains Naive Bayes.
So, we open that up--
make it a little bit larger here.
What we can see is a whole bunch of settings
that we can change.
We have some stuff at the top here related
to telling the system what the class attribute
is and what we want to call the serialized
model file that gets written out as the output
of this job to our output directory.
Down at the bottom here, we can see that we
have an option that allows us to choose a
classifier we want to run and also configure
its options, just like in regular Weka.
In this case, we've chosen the standard Naive
Bayes algorithm.
You can see that we also have an option here
that allows us to combine some filtering with
this classifier, as well.
So, we can opt to specify one of Weka preprocessing
filters here and have that applied to the
data before it reaches the classifier.
We can combine multiple filters if we so desire,
by using Weka's multifilter filter, which
allows us to specify multiple filters to apply.
All right, there are a number of other options
here.
I won't describe them at this point in time.
Let's go ahead and run this flow now.
We have two classifiers that are going to
be trained, Naive Bayes and the JRip rule set.
So, we can start this running.
Before I do so, I'll get the log opened so
we can see the activity in the log, and we
launch it now.
All right, so it's processing away.
A lot of output in the log, and now it's completed.
Let's take a look in the TextViewer, which
has picked up the textual output of these
two steps, and see what we have.
Let's look at the results.
I'll make that a little bit larger.
Okay, so the first entry here is, as expected,
a Naive Bayes classifier model learned on
our hypothyroid data.
This is very similar to what you would see
if you just ran standard desktop Weka.
In the second entry in our result list here
is for the JRip classifier.
It doesn't actually say "JRip" here.
Instead it has something called "weka.classifiers.meta.BatchPredictorVote".
That's a little bit interesting.
We'll take a look at this output.
And, what we have is, instead of a single
rule set, we have four separate rule sets,
and they've been--
combined into an ensemble learning, a vote
ensemble learner, which has four separate
sets of JRip rules as its base learners.
We can see the individual rules here in this
list.
So, what's happened in this case?
Why has it done this?
Why has it learned one Naive Bayes model when
we apply Naive Bayes, but it's learned four
sets of JRip rules when we applied the JRip
classifier?
To answer that question, we'll have to learn
a little bit about how distributed Weka learns
a classifier when it runs on Spark, but we'll
leave that to the next lesson.
In the mean time, you can try some different
algorithms, some different classifiers, in
the Weka classifier Spark entries in the job
we were just looking at and see what the results
are on the hypothyroid dataset.
All right, until next time.
