Hi, everyone! I'm Peter.
Welcome to my class on scripting.
Why would you do scripting? Well, there's
pros and cons to scripting.
On the positive side of things, when you write
a script, it captures all the steps that you
performed from preprocessing to modeling to
evaluation.
Also, when you write a script, you really
only write it once, and you can run it multiple
times with no extra cost.
It's also very easy to create a variant of
the script in order to test some theories.
For example, tweaking some parameters of a
classifier or swapping out a classifier completely.
The best thing about scripting is that you
don't need to compile anything like you would
have to with Java code.
On the not-so-good side of things, you will
have to do programming.
You need to familiarize yourself with the
APIs of the libraries that are involved, and
writing code is usually slower than clicking
in the GUI.
Now, what scripting languages will we cover
in this class? We will cover Jython, Python,
and Groovy.
Jython is basically a pure-Java implementation
of Python 2.7, which runs solely in the Java
Virtual Machine.
This means it gives you access to all the
Java libraries that are on the CLASSPATH.
If you're using Python code, then it has to
be pure-Python, no native libraries like,
for example numpy.
As for Python, we'll be using Python 2.7,
and we'll be invoking Weka through Python 2.7.
It gives you then all the access that you
need to the full Python library ecosystem.
At the end, we'll be touching briefly on Groovy,
which has a Java-like syntax and also runs
in the Java Virtual Machine.
Once again, it gives you access to all the
Java libraries on the CLASSPATH.
In order to demonstrate why Python might be
a good choice of program language for doing
the scripting, is simply by comparing what
Java code would look like and Python code
would look like for doing the same thing.
What we're trying to do is simply outputting
ten times "Hello WekaMOOC!".
Looking at the Java code here is you have
the outer class definition, then you have your
main method.
Inside your main method, you have your for
loop, where you finally output stuff.
In Python, this whole thing is collapsed to
a two-liner.
You simply iterate from 0 to 9 and then print
the whole thing out.
Done.
Now, in order to have Jython support in Weka,
we need to install a package.
I'm going to start up Weka.
In the Package manager, I'm presuming that you're
already familiar with the Package manager.
We need to install tigerJython 1.0.0, not
the latest version.
It gave me a bit of grief lately.
Scroll down to tigerJython and, if you want
to install a specific version, you can simply
open up the Repository version and you
can get a dropdown box and select simply 1.0.0
instead of 1.0.1.
I've already done that, and, for plotting
Jython in Lesson 3, we also want to use jfreechart,
and for that reason, you want to install the
jfreechartoffscreenRenderer library, version
1.0.2 is fine.
After we've done that, we have to restart
Weka.
Then, under the Tools menu, we will have a
Jython console menu item, which brings up
a little user interface for writing and running
Jython scripts.
The first time round it takes a little bit
longer because it analyzes all the libraries
that are in your CLASSPATH.
Here's our little interface.
What you can see here is basically where you
write your script.
Down here you would see errors and so on and
output that your script generates.
You execute your script with the green triangle
up here.
You can also turn debug mode on and off, which
allows you to basically step through the program
that you've written.
You can also set breakpoints up here, which
allow you to stop at certain points in the
program and then analyze, for instance, what
the values for variables are, and so on.
When running things, I usually run multiple
scripts in parallel, so under preferences,
I usually have a smaller font and I'd rather
use tabs, rather than just a single one.
Let's just revisit our really, really simple
example that we had previously.
We were just outputting our "Hello World",
more or less.
When we run this--not in debug mode for the time
being--I'm just going to run that, we'll see
that there's an output from 1 to 10, "Hello
WekaMOOC!" Now, if we are in debug mode, once
again toggling it, then we can default how
fast it actually goes through and we can simply
go through and run it.
You can see the instruction pointer sort of
toggling between those two lines, and you
can also see over here when you open up variables
and types that the variable "i" gets incremented.
This is sort of the first quick introduction
to the tigerJython interface.
When you're writing code, you have to find,
of course, information, and the best information
on Java libraries, like Weka, is using the
Javadoc.
You can have either the online documentation
on the SourceForge homepage, which is always
the latest one, or, if you're not working
with the latest version, then you can simply
go into the "doc" directory in your Weka installation
and use that.
Also, coming with your release or snapshot
that you've installed, you'll find a wekaexamples.zip
file, which contains quite a lot of example
code that should get you going in how to use
APIs and what not in Weka.
Last, but not least, also check out the WekaManual.pdf
document, which in the appendix under the
"Using the API" section, you will find most
of the important APIs in Weka explained and
how to use them.
Of course, I promised that we're going to
write a little script.
What we're going to do is load data and filter
it and print it out.
However, since all the installations of Weka
will be different around the world, in order
to find datasets, I'll be using a little trick.
I'll be using an environment variable to point
to the directory where I've stored my datasets.
I'm going to close Weka for the time being.
You can see here on my desktop in the data
directory, I have various datasets you will
be downloading throughout the class, and we
want to point to that directory.
I'm going to basically for this purpose, I'm
going to copy that path and I'm going to add
an environment variable.
I'm going to go into the Advanced settings,
Environment variables, and I'm going to create
one called MOOC_DATA and paste that in there.
Okay.
Okay and okay.
Close that dialogue again, and we can close
that, too.
Then we can start up Weka again.
We're starting up our Jython console again.
Once it's there.
First of all, we'll have to import some classes
to actually do stuff.
First of all, we actually want to load data,
and we'll be using the "DataSource" class for
that, abbreviated to "DS".
The "Filter" for filtering a dataset and the
Remove filter to do the actual work.
The "os" library is Jython/Python library which
gives us access to the operating system, like,
for example, environment variables and so
on.
In order to utilize the MOOC_DATA environment
variables that I've just configured, I'm using
the "os.environ.get" method, the "os.sep" property
for forward or backward slash, depending on what
operating system you're running, plus the
name of the dataset, in this case "iris", so
I'm basically loading that.
Then, we're going to configure our filter. So we want to have a "Remove" filter.
We want to remove the last attribute, which
is done via the "-R" and "last" options.
Then, we are telling the filter about what
the data actually looks like, so it can configure
itself internally.
Then, we're using the Filter class to actually
push the data through our Remove filter and
get a new dataset.
Finally, we're going to output that new dataset
in the console.
We run this now, and we get a lot of output
here.
If we scroll to the top of it, we can see
that the relation name has changed with the
filter set up and there is no longer any class
attribute.
In this first lesson, we have installed tigerJython.
We've seen that Python is actually very easy
to read and write and is quite short, as well,
compared to Java, learned about where we can
find API documentation, and wrote our first
Jython script.
Well done! See you next time.
