Hello, again! So far, we've been using Python
from within the Java Virtual Machine.
However, in this lesson, we're going to invoke
Weka from within Python.
But you might ask "why the other way?" Isn't
it enough using Jython? Well, yes and no.
Jython limits you to pure Python code and
to Java libraries, and Weka provides only
modelling and some limited visualizations.
However, Python has so much more to offer.
For example, NumPy, a library of efficient
arrays and matrices.
SciPy, for linear algebra, optimization, and
integration.
There's matplotlib, a great plotting library.
You can check all this out on the Python wiki
under Numeric and Scientific libraries.
So, what do we need? Well, first of all, we
need to install Python 2.7, which you can
download from python.org.
But make sure the Java that you've got installed
on your machine and Python have the same bitness.
So, they're either 32bit or 64bit.
You cannot mix things.
You have to set up an environment that you
can actually compile some libraries.
On Linux, that's an absolute no-brainer.
A few lines on the command line and you're done
within 5 minutes.
However, OSX and Windows are quite a bit
of work involved, so it's not necessary for
the faint-hearted.
You can install the python-weka-wrapper library,
which we're going to use in today's lesson,
and you'll find that and some instructions
on how to install it on the various platforms
on that page.
Good luck with that.
I've got it already installed, so I'm going
to talk a bit more about what the python-weka-wrapper
actually is.
This library fires up a Java Virtual Machine
in the background and communicates with the
JVM via Java Native Interface.
It uses the "javabridge" library for doing
that, and the python-weka-wrapper library
sits on top of that and provides a thin wrapper
around Weka's superclasses, like classifiers,
filters, clusterers, and so on.
And, in difference to the Jython code that
we've seen so far, it provides a more
"pythonic" API.
Here are some examples.
Python properties are, for example, used instead
of the Java get/set-method pairs.
For example, "options" instead of "getOptions/setOptions".
It uses lowercase plus underscore instead
of Java's camel case, "crossvalidate_model"
instead of "crossValidateModel".
It also has some convenience methods that
Weka doesn't have, for "example data.class_is_last()"
instead of "data.setClassIndex(data.numAttributes()-1)".
And plotting is done via matplotlib.
Right.
So, I presume you were lucky installing everything,
and you've sorted everything out.
I've already done that on my machine here
because it takes way too long, and I'm going
to fire up the interactive Python interpreter.
For the first script, we want to revisit cross-validating
a J48 classifier.
As with all the other examples, we have to
import some libraries, of course.
In this case, we're communicating with the
JVM, so we have to have some form of communicating
with it and starting and stopping it, so we
import the "weka.core.jvm" module.
We want to load data, so we're going to import
the "converters", and we're importing "Evaluation"
and "Classifier".
First of all, we're going to start the JVM.
In this case, using the packages, as well,
is not strictly necessary, but we'll just
do it.
You can see a lot of output here.
It basically tells you what the libraries
are in the classpath, which is all good.
Next thing is we're going to load some data.
In this case, our "anneal" dataset, once again
using the same approach that we've already
done with Jython using the environment variable.
That's loaded.
Then we're going to set the class, which is
the last one, and we're going to configure
our J48 classifier.
Whereas in Jython we simply said "I want to
have the J48 class", we're going to instantiate
a "Classifier" object here and tell that class
what Java class to use, which is our J48 classifier,
and with what options.
So, the same confidence factor of 0.3.
Once again, same thing for the "Evaluation"
class.
We instantiate an "Evaluation" object with the
training data to determine the priors, and then
cross-validate the classifier on the data
with 10-fold cross-validation.
That's done.
And then we can also output our evaluation
summary.
Done.
This is simply with "Evaluation.summary(...)".
The title, and we don't want to have any complexity
statistics being output, and, since in our
Jython example we also had the confusion matrix,
we're going to output that, as well.
Here's our confusion matrix.
One thing you should never forget is, once
you're done, you also have to stop the JVM
and shut it down properly.
We can see once again like with the other
one, we have 14 misclassified examples out
of our almost 900 examples.
You can count those 3, 2, 2, and 7--14--here
in the confusion matrix, as well.
For the next script, we'll be plotting the
classifier errors obtained from a "LinearRegression"
classifier on a numeric dataset.
Once again we'll be using the errors between
predicted and actual as the size of the bubbles
that we're going to do.
Once again, I'm going to fire up the interactive
Python interpreter.
I'm going to import, as usual, a bunch of
modules.
In this case, new is the plotting module for
modules I'm going to import here.
We'll start up our JVM.
We're loading our "bodyfat" dataset in, setting
the class attribute.
Then we're going to configure our "LinearRegression",
once again turning off some bits that make
it faster.
We're going to evaluate it on our dataset
with 10-fold cross-validation.
Done, and now we can plot it with a single
line.
Of course, we're cheating here a little bit,
because the module does a lot of the heavy
lifting, which we had to do with Jython manually.
Here we go.
Nice plot.
Of course, you can also zoom in if you wanted
to.
Great.
As a final step.
Stop the JVM again, and we can exit.
The last script that we're going to do in
this lesson.
We'll be plotting multiple ROC curves, like
we've done with Jython.
Once again, the Python interpreter.
It's a nice thing.
We can just open it up and do stuff with it
straight away.
Import stuff.
Once again, we're using a plotting module
for classifiers.
We are starting up the JVM. Starting to get good at that.
Loading the "balance-scale" dataset, like we did
with Jython.
We'll also use the "NaiveBayes" classifier.
As you can see, this time there is no options.
Cross-validate the whole thing with 10-fold cross-validation.
Then we use the "plot_roc" method to plot everything.
We want to plot 0, 1, and 2 class label indices.
Here we have those.
Once again, we can see the AUC values for
each of the labels, whether it's L, B, or R.
Final step, stopping the JVM again and exit.
Okay.
In this lesson, we--actually, you--installed
Python and additional modules via Python's
"pip" command, and we used Weka from within a native Python environment using the python-weka-wrapper library.
See you next time!
