Hello! Nice to see you.
Nice to be back.
It's me again.
This is Class 3, interfacing to other data
mining packages.
We're going to concentrate on the "R" package
for most of this class, but just to be begin
with, we're just going to look at the LibSVM
and LibLINEAR packages.
These are written by the same people.
They are widely used outside of Weka, and
they are also Weka's most popular packages.
You should install them.
I've got them installed, and also you should
install the gridSearch package as well.
Both of these packages are to do with support
vector machines.
Weka already has the SMO implementation for
support vector machines that we've seen before
in the first course, but LibSVM is more flexible
and LibLINEAR can be much faster.
It's important to know that SVMs can be either
linear or nonlinear through a kernel function,
mentioned very briefly in lesson 4.5 of that
earlier course.
Also, they can do classification or regression,
which we haven't mentioned.
Weka contains SMOreg for regression, the same
algorithm.
We're going to use the gridSearch method to optimize parameters for SVMs, which is quite important.
Let's just look at LibSVM and LibLINEAR, these
two packages, and also the standard SMO and SMOreg.
All three implement linear SVMs.
All but LibLINEAR are capable of accommodating
nonlinear kernels.
LibSVM does one-class classification, which
you will see in the activity associated with
this lesson.
LibLINEAR does logistic regression.
It's linear.
We saw logistic regression in Lesson 4.4 of
the first course.
LibLINEAR is very fast, and LibLINEAR can
operate with the L1 norm, which I'm not going
to explain in this lesson.
Just a quick look at LibLINEAR.
I did a speed test.
I used the data generator on Weka's Preprocess
panel to generate 10,000 instances of this
data, LED24.
LibLINEAR took two seconds to build the model.
LibSVM took 18 seconds to build the model,
but that's a slightly unfair comparison, because
it's using a nonlinear kernel.
So when I changed it to use a linear kernel,
it took 10 seconds.
And SMO with default parameters, which is
a linear kernel, took 21 seconds.
So you can see LibLINEAR is quite a lot faster.
Now, let's just talk about linear boundaries
and support vector machines in general.
Support vector machines try to drive a channel
between the two classes.
Here we've got the blue class and the green
class, and they try and drive a channel
halfway between the classes to leave as
large a margin as possible.
In this case, we've got zero errors on the
training data, and a pretty small margin,
the distance between the dashed lines.
However, when we look at the test data--
now this is an artificial dataset--
but in this case you can see that some points
in the test data are being classified incorrectly.
Four points, in fact.
If, instead of using this line, we turned
it a bit and used a line with a much larger
margin, although it makes one error on the
training data, in this particular situation
it gets all of the test data correct, no errors
on the test data.
It's an advantage sometimes to have a large
margin, even at expense of errors on the training
data.
SVMs try to give you large margin classifiers.
Here we are with a nonlinear dataset.
I've drawn a linear boundary here, the boundary
that's produced by LibLINEAR or LibSVM with
a linear kernel, or indeed the SMO package
and the SMO classifier in Weka.
This gives 21 errors on the dataset, or the training
set.
Here's a nonlinear boundary for the same dataset
implemented by LibSVM with an RBF kernel.
I've got this dataset open in Weka's BoundaryVisualizer
over here, and I'm going to just choose LibSVM --
luckily, I've installed the package already --
and I just start.
Ok, let's speed this up.
There we are.
That's the result, and you can see it's making
some errors down here and up here on the dataset,
on the training set.
Let's go to the Explorer.
I've got the same data file open, and I'm
going to go again to LibSVM and take a look.
We're plotting the training set here, so
if I look at that I get a total of nine errors,
four and five, respectively, on the different
training set parts.
That's with the default parameters.
If I change the LibSVM parameters, then I
can get this boundary.
Now, this is quite a good boundary, because
it gives zero errors on the training set,
but it gives poor generalization, because
it doesn't drive a channel right between those
two classes.
With different parameters, I can continue
to get zero errors on the training set with
a much more satisfactory boundary, which will
probably generalize better.
Whenever you use nonlinear support vector
machines, you need to optimize the parameters.
The parameters we're talking about are called
cost and gamma.
When we optimize parameters in Weka, we use
the gridSearch method, which is in the meta
category.
These are the parameters for gridSearch.
The default configuration for gridSearch,
well let's look at it.
Down at the bottom, it says use SMOreg, that's
the default, and evaluate using the correlation
coefficient.
We're going to need to change those.
Then the first six boxes are talking about
X of the grid and the next six boxes about
Y.
The X property being optimized is called C,
and that's going from 10^3 down to 10^-3 in
multiplicative steps of 10.
That's what those first six parameters signify.
The second six parameters give the same range
with the Y property of kernel.gamma.
That's for SMOreg.
If we want to use LibSVM, we need to change
some things.
We're going to optimize the properties cost
and gamma.
We're going to choose the classifier LibSVM
and we're going to evaluate using Accuracy.
Let me set that up in Weka.
I'm going to choose gridSearch from the meta
category.
In gridSearch, I'm going to first of all choose
the classifier.
I'm going to choose LibSVM.
I'm going to optimize--
let's move this up so you can see--
optimize the Accuracy.
Then the two properties involved are cost
and gamma.
If I run that.
It's finished here, and the result is--
the parameters are 1000 for the X coordinate,
that's cost, and 10 for the Y coordinate,
that's gamma.
We've got 100% accuracy with that dataset.
We could see we were going to get 100% accuracy
when we looked at the boundary visualization.
That's for LibSVM.
If we were to choose a different method, like
SMO, it's got different parameters.
Let me just look at SMO here.
I'm going to choose SMO.
I need to find the appropriate parameters.
Here are the SMO parameters.
I want C here for the cost, and if I look
at the kernel, I want an RBF kernel, and in
the RBF kernel the key parameter here is gamma.
So it's kernel.gamma.
Kernel here dot gamma here.
I'm going to use C and kernel.gamma.
C and kernel.gamma.
That will allow me to optimize SMO.
OK, so gridSearch is fairly complicated to
use, but it's necessary to optimize the parameters
when using nonlinear support vector machines.
Here's a summary.
We've looked at LibLINEAR, which does all
things linear, linear SVMs, logistic regression,
and it can use the L1 norm, which minimizes
the sum of absolute values, not the sum of
squares, which has big advatages under certain
conditions -- and is very fast.
LibSVM is all things SVM, linear and nonlinear
SVMs.
The practical advice when you want to use
SVMs is first use a linear SVM, do it quickly
with LibLINEAR, perhaps, and see how you get
on.
Then, for a nonlinear SVM, select the RBF
kernel.
But when you select a nonlinear kernel like
RBF, it's really important to optimize cost
and gamma, and you can do this using the gridSearch
method.
Here's reference to support vector machines,
to these packages, and the activity, as I
said before, will involve you looking at one-class classification, an interesting thing
that LibSVM can do.
Good luck with that, and we'll see you later.
Bye for now!
