Hello again! Glad to see you're back for some
more scripting with Weka.
What we're going to cover in this lesson is
building models and evaluating them.
The classes that we're going to touch upon
in this lesson are "weka.classifiers.Evaluation"
for evaluating classifiers, some classifiers,
filters we've already seen in the last lesson,
and some randomization stuff where we're going
to use some Java classes.
The first thing that we want to do is building
a J48 classifier.
I'm going to start up our Jython console
again.
For this script, we'll load some data, configure
the J48 classifier, build it and output the
model.
First of all, the imports.
Once again, our DataSource for loading data
and our J48 classifier.
Once again, we're going to load our data using
our environment variable, this time we're
loading the "anneal" UCI dataset, and, since
it's classification, we also have to set the
class attribute.
In this case, it's the last one.
So, with "numAttributes", you can determine
the number of attributes in a dataset, and
with "setClassIndex", you can set which of these
attributes is going to be the class attribute.
However, since it's an API, they usually start
counting at 0, not 1.
That's why we have "numAttributes() - 1".
Next thing is I'm going to instantiate the J48
classifier, and we're going to set some options.
In this case, we're changing the confidence
factor from the default value of 0.25 to 0.3.
With the data available now and the classifier
configured, we can build it, which simply
happens with a "buildClassifier" call supplying
the data.
Then, as a final step, we're outputting the
model with a simple print statement.
We run that.
We can see the model that is being output
after building the data.
Now, that wasn't very hard.
As a next step, we want to evaluate a model
that we've built.
In this case, we're going to use cross-validation,
because there's no point in building a model
if you don't actually know if it's any good.
I'm going to open a new tab and import some
more stuff again.
In this case, we also need the "Evaluation"
class and, since we're cross-validating, we
also want to randomize the data.
In that case, we're importing the "Random" class.
Okay.
Just like before, we're loading the "anneal"
UCI datasets, setting the class attribute.
Then we're configuring the same classifier
again.
Confidence factor once again 0.3.
And then, we're sort of setting up our evaluation.
First of all, we're initializing our Evaluation
object with the current data in order to obtain
the class priors.
Then, we're calling the crossValidateModel
method of the Evaluation object with the classifier
template not built, the data that we want
to evaluate on, the number of folds--in our
case, we're doing 10-fold cross-validation--and
a random number generator initialized with
a seed value of 1.
After that finishes, we'll have basically
all of the statistics inside the Evaluation
object, and we want to output some things.
First thing, we want to output some summary
statistics.
There's the so called "toSummaryString" method.
If you look at the Javadoc, you'll realize
there's actually several methods, one with
no attribute, one with a Boolean attribute,
and one with String and Boolean attribute,
like we're using here.
Now, the difference between Python and Java
is that Python doesn't have polymorphism,
it has optional parameters and named parameters.
So, in order for Jython to work, you basically
have one method that has all the various parameters
available.
In this case, we'll have to provide a title
for basically our summary string and that
we don't want to output any complexity statistics,
hence "False".
That is that.
Since this is classification, we also want
to output the confusion matrix, which you
can do with the "toMatrixString".
When we're running this script now, you'll
see in the output now our usual summary statistics
of accuracy, what's missed, kappa statistic,
all kinds of errors, coverage, and how many
instances there were all together in the dataset--almost
900 in the anneal dataset case.
The confusion matrix was also output.
You can see there's hardly any instances that
are not on the diagonal.
According to our misclassified ones, it should
be only 14.
So, we have 3 here, 2 there, 2 there, and
7 there,
which is 14, so all is good.
The final script that we want to do in this
lesson is how we can actually use a built
model to make predictions.
So, I'm going to open up a new tab again.
In this case, like in the first script, we
are importing our DataSource for loading data
and our J48 classifier.
We are once again loading a dataset.
In this case, we're not using the usual "anneal"
dataset, but one that's been stripped down
a bit, the "anneal_train" set.
But still, the class attribute is in the same
location; so it's the last one.
Setting that.
We are once again configuring our J48 classifier,
because we were happy with that configuration,
based on our cross-validation results.
It resulted in excellent results.
Then, we are building our classifier on the
data once again using the "buildClassifier"
method, and, since we want to make predictions
on unlabeled data, we are now loading the
unlabeled data in.
In this case, dataset "anneal_unlbl", which
basically has the same dataset structure,
but just missing values for the class.
We also set the class attribute for this one.
It's usually recommended that you compare
your training and test/unlabeled data, whether
they are actually compatible.
You can use a method of the "Instances" class
called "equalHeadersMessage", for telling whether
two datasets are the same.
If you look at this code here, the unlabeled
data is checked against the training data,
and this will return a message, but only in
case if they are different, for instance,
different number of attributes, different
types, or different order of labels.
Then it will output a message.
Otherwise, it will just output "None" or, in
the Java case, "null".
In case we have a discrepancy between our
datasets, then this will be output simply
saying that they are not the same.
And, for making our predictions finally, since
we now have our unlabeled data and our built model.
We just iterate through our unlabeled data
row by row, and then we obtain our class distribution
by calling the "distributionForInstance" method.
We want to know what the chosen class label
is, so we're using the "classifyInstance" method,
which returns us, in case of a nominal class
attribute, the label index, starting with 0.
In order to determine what the string label
actually is, we use the dataset, retrieve
the class attribute, and then determine the
string value that is associated with that
particular index.
To output anything, we are then outputting
with a simple print statement our class distribution,
our label index, and the associated label.
Running that we get an output like this.
First, you get an array, which is the class
distribution, then the index of the label
and the label itself, all separated by lines.
At the bottom, you can see you have 1, 2,
3, 4, 5, 6 labels all together there, so index
5, and the label is U in this case.
So, what we've learned in this lesson is how
to build a classifier.
We can output statistics from cross-validation
that we've obtained from a classifier on a
particular dataset, and we also used a built
model to actually make predictions on new,
unlabeled data.
Hope to see you next time!
