Hello, again! Last time we learned a little
bit about what distributed Weka is and a little
bit about the MapReduce framework.
In this lesson, we're going to install distributed
Weka and start to use some of the components
that come with it.
So, let's get started.
Okay, here we are in Weka's package manager,
which I'm sure you're all familiar with by now.
What we're going to do here is scroll down
a little bit in the package list, and we're
going to install distributed Weka for Spark.
And here it is, just down here.
Okay. So, if I click install with this one
selected, it asks me--
or it tells me that I'm going to install the following packages: distributedWekaSpark version 1.0.2.
We click "Yes".
We click "OK".
And then it tells me that, in order to install this, we need also to install distributedWekaBase 1.0.12.
At this stage, I'll click "No", because I
already have this installed, and we won't
show it installing at the moment, because
the download is fairly large for distributeWekaSpark,
and it'll take a little while.
And I already have it installed.
Okay.
Once you've installed distributed Weka, you
need to make sure that you restart Weka, so
that the packages--
or the newly installed packages--
are loaded correctly.
The main way to interact with distributed
Weka is through the Knowledge Flow environment.
This allows us to chain together processing
components in such a fashion that a given
component will not execute until the previous
one has completed executing.
It's also possible to use distributed Weka
from the command line, but the graphic user
interface provided by the Knowledge Flow is
a really convenient and easy way to edit the
sometimes many parameters that are involved
in setting up a distributed Weka job.
Let's verify that our installation of distributed
Weka has proceeded correctly.
All right, so in the Weka Knowledge Flow environment,
you can see that there is on the left-hand
side in the Design palette, a new folder called
"Spark".
If we open this up, we should find that there
are a bunch of new components available to us.
In particular, we have something called an ArffHeaderSparkJob,
we have a WekaClassifierSparkJob,
a WekaClassifierEvaluationSparkJob, and several
others, as well that we'll discuss shortly.
The distributedWekaSpark package also comes
with a bunch of example template flows.
If we look in the templates folder, which
is accessible from the templates button up
here in the tool bar, we can see a bunch of
entries that are prefixed with the word "Spark".
These are all example flows that we can execute
right out of the box.
They don't require a Spark cluster to be installed
and configured.
Spark has a very convenient local mode of
operation, which allows it to use all of
the cores in your CPU as processing nodes,
if you like.
So, we can execute these particular example
flows straight away without any further configuration.
They are ready to go.
Before we start running distributed Weka examples,
I need to introduce the dataset that we're
going to be looking at.
We're going to take a look at the hypothyroid
data.
This is a benchmark dataset from the UCI Machine
Learning Repository.
The goal on this data is to predict the type
of thyroid disease a patient has using input
variables, such as demographic information
about the patient and various medical information,
as well.
In this dataset, there are 3,772 instances
described by 30 attributes.
A version of this data, in CSV format without
a header row, can be found in the distributedWekaSpark
package that you installed just before.
If you browse to your home directory and look in wekafiles/packages/distributedWekaSpark/sample_data
directory, you'll find it there.
The data in ARFF format is also included with
the Weka 3.7.13 distribution in the data folder.
So, you can also load it up into the Explorer
and take a look in there.
Why don't we do that now.
Here we are in the Weka Explorer.
Let's open the hypothyroid data.
If you browse to the Weka installation directory,
Program Files here, Weka 3.7, and in the data
directory, we can see the hypothyroid data.
Let's open that up.
As I mentioned before, there are 3,772 instances
in this dataset, and we can see the attributes
here.
We have the age and sex of the patient, and
we have a bunch of attributes related to various
medical information.
Down at the bottom is the class attribute.
You can see there are four different class
values here.
By far the largest class in the data is that
of negatives.
So, these are patients that don't have hypothyroid
disease.
Then we have 194 cases of compensated_hypothyroid,
95 cases of primary_hypothyroid, and only
2 cases of secondary_hypothyroid.
All right.
That's the characteristics of the data.
We can now return to the Knowledge Flow and
start executing some distributed Weka processes
on this dataset.
Before we do so, it's worth spending a minute
or two to explain why we're going to be operating
on comma-separated values, CSV, files without
a header rather than ARFF.
Systems like Hadoop and Spark split data files
up into blocks.
This is to facilitate distributed storage
of large files out on the cluster and also
to allow data local processing.
This is where the processing is taken to where
the data resides.
So, rather than move the data around, we take
the processing to where the data is.
Within such frameworks like Hadoop and Spark,
there are "readers", as they're called, for
various text files and for various structured
binary files.
These readers maintain the integrity of the
individual records within the files.
They know where the boundaries between records
are, and they don't every split a record in half.
If we were to use ARFF within such a framework,
we would need to write a special reader, due
to the fact that ARFF files, as you know,
have header information that occurs at the
start of the file.
That header information provides details on
what attributes are in the data, their types,
and legal values, and so forth.
Now, because the data file gets split up,
only one of the blocks, or chunks of data
out on the cluster would have that header
information.
That is why we'd have to write a special reader
to handle it.
Distributed Weka for Spark, as it stands at
the moment, operates just on CSV data, simply
because there are readers already available
within Spark and Hadoop for dealing with such data.
All right. Here we are back in the Knowledge
Flow environment.
Let's execute the first distributed Weka job
in the list here: the "Create an ARFF header
job".
Make it a little be larger here.
We'll use this one to verify that everything
is installed correctly and running properly.
Now, the goal of this job on the hypothyroid
data is to analyze that CSV file and produce
some summary statistics and do this in a distributed
way.
At the same time, it collects all the information
that's necessary to create an ARFF header,
and it stores this.
And then any future jobs that we run can make
use of this ARFF header information straightaway
and not be required to analyze the CSV data
a second or third time before they can run.
What we can do is go ahead and execute this
and see how it runs.
First of all, make your log area--
switch to the log from the status area down
at the bottom here--
and make it a little bit larger, so that we
can see what's happening in the log, because
Spark generates a lot of log output, and there
is information about what it is doing and
you'll see any problems that occur in that
log, as well.
We have just one job that's going to be executed
here--
the job to create the ARFF header--
and we'll just run this right now and make
sure everything is working correctly.
Later on, we'll take a look at the parameters
for the job, and I'll explain a little bit
about how it's configured.
Up here in the upper left-hand corner of the
Knowledge Flow, we can press this Play button
and start the flow running.
As I said, we can see a lot of information
being dumped into the log here.
Most of this is coming from Spark.
Our job has completed.
You can see here it says "Successfully stopped"
something called a "SparkContext".
All right, so what has this job produced?
We can see here in the flow that we have a
dataset connection coming out of the ArffHeaderSparkJob to a TextViewer.
So, if we open up the TextViewer and show
the results--
I'm going to make this just a little bit larger
here, so that it fills the screen--
we can see that, as the name suggests, it
has created an ARFF header for the hypothyroid data.
In fact, it's an ARFF header on steroids,
because there is some extra information in here.
What we can see at the top is standard ARFF
header information.
Here's all our attributes--
just like we saw in the Explorer before--
all the way down to Class here, where, in
this row here we can see all of the values
of the class attribute listed.
Now, below this is a bunch of additional information
that we've added into this header.
The way that other jobs are programmed when
they make use of this, is that they can either
access this additional information or remove
it and use a standard ARFF header.
So, what we have in this additional information
is a bunch of summary statistics that have
been computed on the hypothyroid data running
in parallel in the Spark environment.
You can see that for the age attribute here
there is summary statistics that have been
computed on that.
We have a count, we have a sum, we have a
sum of squares, we have minimum and maximum
values, and we have a mean.
This is a numeric attribute.
And a standard deviation, as well.
And similar for other attributes.
For nominal attributes, it computes a frequency
distribution.
So, down here in the class, the summary attribute
for the class, we can see the class label
for each of the values of the class followed
by an underscore and a number, and that number
is the count for that particular class label.
The ARFF header job has computed the header
for us and a bunch of summary statistics.
The next time, we'll take a look at how that's
configured, and we'll also look at running
some other distributed ARFF jobs, as well.
In this lesson, we've covered getting distributed
Weka installed; our test dataset, the hypothyroid data;
the data format processed by distributed Weka;
and we've taken a look at a distributed Weka
job running on Spark to generate some summary
statistics and an ARFF header for the hypothyroid
data.
Next time, we'll dig in a little bit deeper
and see how these things are configured, and
we'll run some classifiers on the hypothyroid
data.
So, until next time!
