Hello again, and welcome to the last lesson
on scripting.
This lesson is slightly different than the
other ones because we will be looking at a
real-world challenge, and then, as a second
part, we'll be looking at another scripting
language called Groovy. First to the challenge.
The challenge is basically from an annual
shoot-out from the Council for Near-Infrared
Spectroscopy, and the shoot-out process basically
works as follows.
You build your data on the training data, which is called calibration in infrared spectroscopy terms.
You evaluate your model on the separate dataset,
which is the test dataset, and then you basically
generate and submit predictions in the shoot-out
process.
However, we don't do the last step of submitting
our predictions, because that particular challenge
has already finished.
But, we're still going to use the data that's
publicly available at the link below that
you can download and then run.
What are you going to do? Well, first, you're
going to download the CSV files for Dataset
1 and 2.
I'm just going to go on their website here.
Here's Dataset 1 and 2.
For each one of them, you download the CSV
files, and you only need the calibration and
the test set, we don't need the validation
one.
Then, you generate basically data for Weka
in ARFF format for building the model/calibration
and then for evaluating the model, the test
set.
The class attribute in the calibration dataset
is called "reference value", and you shouldn't
include the "sample #" in your model.
This is basically now up to you to come up
not only with the proper dataset and the compatible
training and test set, but you should also
then try and come up with a good regression
scheme for predicting the reference value.
But what do you have to beat?
Well, in our case, you have to beat, on Dataset
1, a correlation coefficient of 0.8644 and
a root mean squared error of 0.384.
And on Dataset 2 you have to beat a correlation
coefficient of 0.9986 and a root mean squared
error of 0.0026.
Up for the challenge? Good luck!
Now, to the second part, using Groovy, another
scripting language.
As I already mentioned in the introduction,
Groovy also runs in the JVM and can be installed
through the Package manager, as well.
If you haven't done so already, please open
up the Package manager and install the "kfGroovy" package.
It doesn't matter what version.
"kf" for KnowledgeFlow "Groovy".
I've already done that, and I'm going to show
you what the interface looks like.
Once again, just like with the Jython console,
you'll find a Groovy console menu item under
the Tools menu in the GUI Chooser.
Once you've opened up that, you'll find the
appearance of the Groovy console very much
similar to that of the Jython console.
Like, on the top, you basically write your
script and at the bottom, you'll see the output.
However, in the Groovy console, you cannot
use multiple tabs.
You'd have to sort of like open multiple instances,
but, for our purpose, that is sufficient.
Now, before we start just a few minor Groovy
basics.
The grammar of Groovy is derived from Java,
but, with the exception that you don't have to
write any semicolons to finish a line, which
makes it much nicer.
"def", for definition, defines a variable.
You don't have to require any types or anything.
"Lists" are very similar to the Python ones,
square brackets and just comma-separated.
There can be mixed types.
"Maps" are also very similar to the Python ones,
but they're called "dictionaries" in Python.
However, you don't use curly brackets, you
still use square brackets.
Groovy also enhances the Java syntax.
For example, you have multi-line strings by
using triple single quotes.
You can use string interpolation.
You can also have default imports of commonly
used packages, like java.lang, java.io, java.net,
and so on.
And, last but not least, closures.
They are not quite the same as Java 8 lambdas,
but they're a very powerful tool.
They're basically anonymous code blocks, which
can take parameters and return values, as
well, and can be assigned to variables.
If you want to look up some differences between
Java and Groovy, then follow the link.
One really funky thing about Groovy that I
very much like is looping.
Of course, you will have the standard Java
for-loop and while-loop, but you can also
use--since everything is an object in Groovy--you
can also use some additional methods called
"upto", "times", and "step", as long as you have
number objects, like integers and so on.
So, if you look at "upto", if you have "0.upto(10)",
that basically outputs all the numbers from
0 to 10, both included.
If you do "times", for example "5.times", that
outputs the numbers from 0 to 5 with 5 excluded,
so it outputs the numbers 0, 1, 2, 3, 4.
Last, but not least, you can also "step" through.
If you have "0.step(10, 2)", that means you're
going from 0 to 10 at a step 2, so it outputs
the numbers 0, 2, 4, 6, and 8.
Okay.
So, with the basics out of the way, we're
going to dive into writing one of the scripts
we've already seen previously in Jython and
python-weka-wrapper, and we're going to make
some predictions with a built classifier.
Once again, as always, we're going to have
some imports, and, just similar to Jython,
we're just going to import the whole classes.
We once again do the trick with our environment
variable, however, here we use "System.getenv()".
Then we're loading our training data.
Once again using the MOOC_DATA environment
variable, using our shortcut variable and
loading the "anneal_train" dataset.
Setting the class attribute once again as
the last one, and we're also loading in the
unlabeled data and setting the class, as well.
Now, we're going to instantiate J48.
We're going to set some options.
There's a minor difference here to Jython.
You actually have to specify that this is
a string array.
So, even though you have a list of strings,
you just have to say what you want to cast
it to.
And, once again, build our classifier on the
training data and output the built model,
just for the fun of it.
Now, we want to once again look at making
predictions.
First of all, we're going to look at what
labels we have.
In this case, we're going to use this previously
mentioned ".times" that allows us for the number
of values the class attribute has from 0 to
times-1, and we're basically adding the string
label to it, which we can then also output
with a simple "println" statement.
We're using the list's join method, and we're
joining all those elements in the list with
a comma, generating a comma-separated string.
Once again, using our "times" method, but this
time, we'll be using the number of instances
in the dataset.
For all the rows in the data, we'll be calling
the classifier's "distributionForInstance"
in order to retrieve the class distribution.
And then simply output what the class distribution
is.
Okay.
And when we're running this thing, you will
first see that is actually loads the whole
thing into JVM here on top.
It just outputs this is what we're loading.
Then after that, you can see our J48 tree
that we built on the training data.
Then we can have here our class labels and,
finally, the class distributions for all the
rows in the data.
Slightly different to Jython and python-weka-wrapper, but not too different.
As the second script, we'll be looking at
outputting multiple ROC curves on the "balance-scale" data.
We'll start from scratch.
Once again, we have a bunch of imports that
we need.
In this case, the "Evaluation" class again.
We're going to use "NaiveBayes" as the classifier
again.
"ThresholdCurve", which allows us to compute
the ROC curves and so on.
DataSource for loading the data, and, once
again, we're going to use JFreeChart for
the plotting.
Okay.
First thing, we're going to load the data
in again using our environment variable, setting
the class attribute as the last one.
We're going to instantiate "NaiveBayes".
There are no options necessary.
Then, we're going to cross-validate it, after
initializing an "Evaluation" object on the training data.
We're going to do 10-fold cross-validation,
and, once again, using a seed value of 1 for
a random number generator.
Having that done, we can now then create our
plot dataset once again.
It's just a simple XY dataset again, and,
as you can see, we're going to use our ".times" again.
So, basically for all--since we want to do
multiple--all the labels in the class.
We can once again use the number of the labels
that we have in the class attribute ".times"
and use the iterator once again to retrieve
the curve data, then retrieve the data from
the "false positive rate" column and the
"true positive rate" column.
Turn it into lists, and we're adding that
then as a data series to our plot dataset,
including the AUC area.
Having done that, we can then create the plot,
which is just an XY line chart, a simple
one, and, with axes of "False Positive Rate"
and "True Positive Rate".
As the last step, as usual, we're going to
create a frame, embed a chart panel with the
plot and make the whole thing visible.
Okay.
And we run that.
It takes a little while, and then we basically
have our plot that we've already seen before.
You've now seen quite a range of scripting
languages that you can use on the Weka API
whether it is within the JVM or outside using
Python itself.
And, last but not least, you also had some
fun with a real world data challenge and I
hope you were much, much better than I was.
Okay.
That's it for scripting.
I hope you enjoyed it, and I'll see you sometime.
Bye Bye.
