Today, we're going to look a bit more at how
to use R from Weka.
More specifically, we'll look at how to use
the MLR library from Weka.
MLR stands for Machine Learning in R.
This library includes many of the learning
algorithms that are available in the R environment
all nicely bundled up in one package.
As we'll see, it's quite easy to use MLR from
Weka.
There is a particular classifier that can
be used to do this.
Okay, let's have a look at how this classifier
works.
I have loaded the diabetes data into the Explorer
so that we can process it using MLR learning
algorithms.
One way to use MLR is to just use the R console.
We've seen last time that we can, for example,
plot the data in the R console, by referring
to the data using rdata.
This will plot the data that we have loaded
into the Preprocess panel.
We can also use the MLR learning algorithms
from this console by typing in commands.
However, that is a little bit inconvenient.
Instead, we can use the MLR classifier by
selecting it under the Classify panel.
We select the Choose button to choose the
MLR classifier.
As you have seen, this has taken a while,
because Weka actually needs to download and
install the MLR package in R the first time
we want to use it.
However, once this has happened, we don't
need to install the package again, so this
will be much faster in the future.
Okay, now you can see here that we have an
MLR package in the classifiers package.
There's an MLR classifier there, so let's
select it.
The MLR classifier wraps the MLR R library
for building and making predictions using
various R classifiers and regression methods.
Right.
Just like with any other Weka classifier,
we have the text box up here which contains
the configuration information for the MLR
classifier.
Let's just run it with default settings.
You press the Start button, and, by default,
the MLR classifier runs the rpart learning
algorithm in R.
This builds a classification tree from the
data using the CART decision tree learning
method.
You can see that it gets 75% accuracy in the
cross-validation on the diabetes data.
We get all the other performance statistics
that we are used to, as well.
Really, we treat the learning algorithm in
R just like any other Weka learning algorithm.
For this to happen, behind the scenes, the
MLR classifier actually has to transfer the
data into the R environment, build the classifier
in the R environment, and then also feed the
test data to the classifier in the R environment
and get the predictions back.
But it all happens transparently.
Further up, we can see the tree that has been
generated from this data in textual form.
We also get some information on the learning
algorithm that was used and the package it
originally comes from.
We used rpart, which is a classification algorithm,
so in MLR it's called "classif.rpart".
This learning algorithm comes from the rpart
package, which is a separate package for R.
The MLR package for R just bundles algorithms
from a lot of other packages that are available
in R in one convenient interface, which we
can easily make use of using the MLR classifier.
The name is given here and also some properties
of this algorithm.
It can deal with two classes.
It can deal with multiple classes.
It can deal with missing values, numeric variables--
numeric attributes, in other words--
factors, which are nominal attributes.
It could also, potentially, deal with ordinal
attributes.
It can produce probability estimates, and
it can deal with instance weights.
This is the rpart learning algorithm from
R, but there are many other learning algorithms
that are available in the MLR package, and
most of them are available through the MLR
classifier.
We can choose the algorithm we want to use
by using the RLearner property.
By default, we can see here that rpart is
chosen, but there are many other algorithms
that we can choose from.
There are many classification algorithms,
and there are also many regression algorithms.
Let's run one other classification algorithm
in MLR.
Let's run random ferns.
This is available as "classif.rferns".
Living in New Zealand, I am quite fond of
ferns, and it's intriguing to see that there
is also a learning algorithm that generates
random ferns.
Now, you can see that when I've clicked this,
nothing happens for a while because Weka actually
has to download and install the rferns package.
That has happened now, and we can use this
classifier.
The fern is a variant of a decision tree where
all the tests at one level of the decision
tree are exactly the same, so they all test
the same attribute and they perform the same
split of the data.
A fern is a restricted form of a decision
tree.
Just like the random forest classifier does
for regular decision trees, it generates an
ensemble of ferns.
Okay, let's try this.
Right.
Okay, so this classifier is slightly less
accurate than the rpart classifier, but there
may be other datasets where it outperforms
rpart, because it is an ensemble classifier.
You've seen that it runs quite quickly.
It has actually generated an ensemble of 1000
ferns and the depth was restricted to 5.
So, maybe we should try to decrease the depth
to reduce the chance of overfitting.
We can also specify parameters for the learning
algorithm here in the learnerParams field
of the MLR classifier.
To find out some information about the parameters
that we can use, we actually need to go on
the web.
It's best to go to the list of learning algorithms
that are available in the MLR library first.
To do that, we just search for "MLR integrated
learners" and we search for the release version of MLR.
There is also a development version.
The first link here is the link we want.
You have the integrated learners here.
This has a list of all the learning algorithms
that are in the MLR package, and most of them
are available through the MLR classifier in
Weka.
We want to look for rFerns, so I search for
"rFerns" on this page.
There's a link here.
This will take us to the appropriate documentation
page.
It has a list of all the topics that are in
the manual for the rFerns package for R.
rFerns is the actual learning method, so let's
click on this.
Here we have some information on the usage
of the method.
We have arguments that can be used in R.
X and Y is just the data.
We can ignore that; that's filled in by Weka
by the MLR classifier.
Formula, you can also ignore that, and data,
yes, we can ignore that, as well.
But here we can see some relevant parameters
that we might want to change.
We can change the depth for example of the
ferns, and we can change the number of ferns.
Let's change the depth.
Let's try to reduce it in our experiment.
What we do is we type depth equals perhaps
2, if you want to reduce the depth to 2.
Let's rerun the experiment.
We start it again.
Right.
We can also specify multiple parameters.
We can change the number of trees that we
want to generate.
By using the ferns argument, we can say how
many ferns we want to include in our ensemble.
To specify multiple arguments, we just separate
them by a comma.
So "ferns = 100" will generate 100 ferns instead
of 1000.
This runs even more quickly now.
The accuracy has actually slightly gone up.
This is most likely due to chance.
We've seen now how we can use MLR classifier
from Weka, and you can also run MLR classifier
from the other user interfaces in Weka.
You can run it from the Weka Experimenter. You can run it from the command line, and
you can also run it from the Knowledge Flow.
Next time, we'll look at how to use R tools
for pre-processing in the Knowledge Flow of Weka.
See you then!
