Hello! Welcome to Class 2 of Advanced Data
Mining with Weka.
My name is Albert Bifet.
I'm a member of the Weka machine learning
group, and I work at Telecom ParisTech in Paris, France.
In this class, we're going to talk about data
stream mining in Weka and MOA.
Data stream mining is a way of doing real-time
analytics, so this is going to be very, very
important for big data and the Internet of
Things.
Let's start with the first lesson that's going
to be incremental classifiers in Weka.
As you know, in Weka usually what we do is
that we store all the dataset in memory and
then what we do is that we build our classifier
using this dataset that is stored in memory.
This is what is called a batch setting.
In the incremental setting, what we do is
that we update our classifier one instance
by one instance.
There's a huge difference between the two
settings.
Let's look at the incremental setting with
more detail.
So, the idea is that, what we do is we process
an example every time, so only one example,
not a dataset of examples.
We use a limited amount of memory, because
we don't store all the dataset in memory.
We work in a limited amount of time, so we
need to be very, very fast.
And we are ready to predict at any point.
That's the main difference.
So, in the batch setting what we do is that
we process all the dataset and then we are
ready to predict at the end of building the
classifier.
Weka has many different incremental methods.
To know which ones are incremental, we need
to look that they should implement the UpdateableClassifier interface.
We can find many different methods, as NaiveBayes,
NaiveBayesMultinomial, k-nearest neighbours,
Stochastic Gradient Descent, and also some
decision trees, such as the Hoeffding Tree.
As you know, the standard decision tree is
not incremental, so we need to have all the
dataset in memory.
The Hoeffding Tree is the first--is a state-of-the-art
incremental decision tree learning.
The Hoeffding Tree was proposed by Pedro Domingos
and his group around 2000, and the difference
with the standard decision trees and in the
standard decision tree what we do is that
when we need to decide when we want to split
or not, what we do is that we look at the
data that we have memory and then we compute
the statistics and then we decide if we split
or not.
But, in the incremental setting, we don't
store the dataset in memory, so then, what
we need to do if we need to decide if we want
to split or not, we need to wait for new instances
to arrive.
So, how many instances do we need to decide
if we need to split or not?
This is something that is computed using the
Hoeffding bound, and this is why the Hoeffding
tree has this name.
Another interesting thing of the Hoeffding
Tree is that it has theoretical guarantees
that if the number of instances that we are
using to build the model is large enough,
the decision tree is going to be similar to
a decision tree built using the standard decision
tree method.
Let's see an example in Weka.
We're going to generate a dataset and then
we're going to evaluate using a Hoeffding
Tree.
Let's start generating the data.
We are going to use the Random radial basis
function generator.
This is a generator that generates data by
creating for each class a random set of centers.
In this case, we're going to use 50 centroids,
10 attributes, 2 classes.
We're going to create 1000 instances.
Let's generate the data.
Now we have the data, 1000 instances.
Now, let's classifier it using the Hoeffding
Tree.
Let's choose the classifier.
Hoeffding Tree.
We're going to run a 10-fold cross-validation.
We see this is very fast, and we get an accuracy
of 71%.
Now, let's do it again but generating 100,000
instances.
We change this parameter.
We generate the data.
Then again we run the 10-fold cross-validation.
We see that it takes more time, but still
it's really fast.
At the end, what we get is the accuracy goes
to 89%.
With 100,000 instances, we get 89% and with
1000 instances, we were getting only 71%.
Increasing the number of instances that the
Hoeffding Tree processes that allows us to
have a much better accuracy.
In this lesson, we have seen the two settings
of Weka, the batch setting and the incremental
setting.
In the batch setting what we did is that we
stored all of the dataset in memory.
In the incremental setting, what we did is
that we built the classifier one instance
by one instance.
The nice thing about the incremental setting
is that we can be much more efficient and
that we use less memory and we are much faster.
See you in the next lesson!
