Hi there! In the last lesson, we looked at how distributed Weka for Spark performs cross-validation.
In this, the final lesson of this class, we'll
touch briefly on a couple of Knowledge Flow
templates that we haven't had time to look
at so far, and we'll leave you with some things
to look at for distributed Weka, if you wish
to take it further.
Here we are back in the Knowledge Flow.
If we open up the Templates menu again and
scroll down a little bit here, we can see
a template called compute a correlation matrix
and run PCA, where PCA stands for principal
components analysis.
Let's open this one.
I'll make it a little bit larger.
All right. What we have is our trusty ArffHeaderSparkJob,
which loads our hypothyroid data again, and
we have a little step here called the CorrelationMatrixSparkJob.
And we have an ImageViewer and a TextViewer
attached to that.
This suggests that this job will produce some
kind of an image that we can take a look at
and also some textual results.
In the dialog for the CorrelationMatrixSparkJob
here, we have a few options, mainly related
to exactly what sort of matrix is going to
be computed, so we can compute either a correlation
matrix or a covariance matrix.
We have an option to run principal components
analysis.
The algorithm for principal components analysis
can compute the principal components using
either a correlation matrix as input or a
covariance matrix.
All right.
Let's run this now and see what it produces.
It just takes a few seconds to run.
And it's finished.
Okay, let's open up the TextViewer.
In the TextViewer, we have the result of the
principal components analysis and the correlation
matrix that was computed.
We can see that the correlation matrix and
the principal components analysis only involve
the numeric attributes that are present in
the hypothyroid data.
Let's take a quick look in the ImageViewer
now.
If we open up the ImageViewer, we can see
that we have a graphical heat map representation
of our correlation matrix, where the colors
indicate the magnitude of the correlations
between the attributes--the numeric attributes--in the hypothyroid data.
Right, let's take a look at one more example
before we finish with distributed Weka.
In the Templates menu here, we have a job
called run K-means||.
K-means parallel is, as the name suggests,
a parallel version of the k-means algorithm.
For clustering in distributed Weka, unfortunately,
we can't use the trick of creating a voted
ensemble like we did in the classification
case.
It's not possible to make a voted ensemble
out of separate clustering models.
This is why there is only k-means available
in distributed Weka so far, as it's the only
clustering algorithm that has been implemented
in a distributed fashion, specifically for
distributed Weka.
This job takes a little while to run, so through
the magic of video editing, I've executed
it in between cuts to save a little bit of
time.
It actually takes longer to run than sequential
Weka does if you were to run k-means in the
Explorer on the hypothyroid dataset.
This is simply due to the fact that there
is a certain amount of overhead involved in
Spark's communication, the creation of its
RDD data structures, and so forth, and that
overhead actually outweighs the speed gained
through parallel processing in this local
case when we're just using the cores that
are available on our CPU.
If our dataset was much larger and we were
running on a real cluster, then we would have
a true benefit from using a distributed approach.
In the TextViewer, we can see the clustering
results for k-means, which look exactly the
same--or are in the same format--as if you
were to run k-means in standalone Weka on
your desktop.
So, whereto from here? Experimenting with
distributed Weka in local mode using small
datasets is the best way to get familiar with
the capabilities of it and explore what it
has to offer.
However, if you want to process larger datasets,
then you'll need to run on a cluster.
We'll take a little look at what's available
on the web to help you get started in that area.
The first place to go for information is the
main Apache Spark website, so let's take a
look at that first.
Okay, under the documentation section here,
we can find the documentation for the latest
release of Spark.
We go to that page, and there's information
on downloading, running some examples, and
then down here a little ways, we have information
on launching on a cluster.
The first thing to look at is the overview
of cluster mode.
This will describe exactly how a cluster is
configured and set up to run.
Then there are various different types of
clusters that you can run Spark on.
The simplest is called a standalone mode, and there is a documentation section here on that mode.
That would be the one to start with first.
There are several other modes of clustered
running for Spark, including something called
Mesos and YARN.
These are different ways of managing the machines
in a cluster.
The standalone mode is the simplest.
There are a number of blogs on the web that
step you through the process of setting up
a standalone cluster on a single machine.
So, if we search for "Apache Spark standalone
cluster install", there are a number if hits
in Google for information on setting up a
cluster.
One that's particularly concise, or, at least,
I thought it was concise and could be a good
place to start, is this one here.
If we take a look at that, we can see a very
short introduction to getting started with
a Spark cluster running on a single machine.
This is different from what we've been looking
at so far, where we've been running in local mode.
That's where the entirety of Spark runs in
a single JVM process.
The standalone cluster running on a single
machine involves multiple separate Java processes,
and they communicate as if they were running
on different machines.
This tutorial is a reasonably short introduction
to getting started with that.
That's it for this lesson.
Today, we took a look at how you can use distributed
Weka to compute a correlation matrix in Spark
and then use that correlation matrix as input
to a principal components analysis.
We also took a look at the k-means algorithm
running in a distributed fashion inside of
Spark, and we took a little look at information
on setting up Spark clusters.
Well, I hope you've enjoyed learning about
how to use Weka in a distributed processing
environment, and now I'll leave you with some
links to further information on distributed
Weka and on Apache Spark.
