Hello again, and welcome back to Advanced
Data Mining with Weka.
This is Lesson 1.3.
We're going to look at the timeseriesForecasting
package now
to do roughly what we did in the last lesson
without the timeseriesForecasting package.
We need to install it first of all.
Let me go to my Weka Tools menu, to the Package
Manager.
Here's the Package Manager, and here are the
packages.
And if I scroll down here.
It's pretty hard to find things in this list
of packages,
but near the end is the timeseriesForecasting
package.
If I click install here, that will install
it.
Actually, it's already installed on my computer,
so I'm not going to do that.
I've got the airline data loaded here.
The time series package has given me this
additional Forecast tab.
I'm going to go straight to that,
and without anymore ado I'm just going to
click Start and see what happens.
Well, the timeseries package transforms the
data into a large number of attributes.
Unfortunately, you don't get to see the attributes
in the Preprocess panel.
We still just have those two attributes there.
You don't see the generated attributes there.
You have to go to the Forecast panel and look
here.
Here's the original attributes,
and here's transformed training data: passenger_numbers;
we've got month, quarter, date-remapped.
The date-remapped is like what we did for
the date in the last lesson.
We did it manually, which changed it from
milliseconds
since January 1, 1970 into something more
sensible.
This actually does a better job,
because it takes proper account of which years
are leap years
and which years aren't leap years.
Then we've got these lagged variables.
The passenger_numbers lagged by--
we just had 12 before--
but now we've got the lags by 1, 2, 3,
right up to 12 for 12 months, I guess.
We've got the square of the date-remapped
and the cube of the date-remapped, in case
you need those,
and a bunch of other things, the date-remapped
times these lagged variables.
That's a lot of variables.
Underneath here is the generated model, which
is very complicated.
Let's see how well it does.
Actually, it doesn't show here how well it
does.
To see that, we have to turn on Perform Evaluation.
Let me click that here.
Run it again, and we get a root-mean-squared
error of 10.6 on the training set. Which looks good:
last time we got 16.0.
That was the best figure we got.
But remember, this is the error on the training
set.
That's always very misleading.
Let's make a simpler model.
There's a lot of attributes here.
We can't edit the generated attributes, like
I said, but we can apply a filter.
So I'm going to go to Advanced Configuration,
and for my base learner, I'm going to choose
the FilteredClassifier.
And in the FilteredClassifier, I'm going
to specify linear regression just like we
had before, and for the filter, I'm going
to choose the Remove attribute filter.
Here it is.
I'm going to configure that to remove attributes
number 1, 4, and 16, which I happen to know
the correct ones.
I'm sorry.
I'm going to leave attributes 1, 4, and
16; I'm going to set invertSelection to True.
So these are the three attributes that I leave.
Well, let's just see what happens.
Go back and look at my attributes, and here's
the generated attributes that we saw before.
Now here's the filtered attributes.
We've go passenger_numbers, we've got date
remapped, and we've got this lag by 12.
This is what we did in the last lesson, remember?
Let's see how we get on here.
We got a root-mean-squared error of 27.8.
Actually, we got that on the last lesson,
but we got even better results by deleting
the first 12 instances.
Remember the first 12 instances have got lagged
values with unknown values and linear regression
does bad things with unknown values, at least
as far as time series are concerned.
So I want to delete the first 12 instances.
Now, I could do that by applying two filters:
removing attributes and removing instances
and I could use the multifilter.
But actually on the timeseriesForecasting
panel, there's an easy way of doing that,
which you really need to learn, because you're
going to be doing it a lot.
In Advanced Configuration, we're going to
look at Lag creation and the More options.
We're going to say remove leading instances
with unknown lag values.
Let me run that, and now I get a root-mean-squared error of 15.8,
and a model which is exactly the same as
the model we got on the last lesson:
1.07 times lag_passenger_numbers plus 12.7.
That's what we got before.
Now, let's just return to this full model
that we had.
We won't use the filtered classifier;
we'll just use linear regression.
Here it is.
Now, we get a root-mean-squared error of 8.7.
It looks fantastic.
But the model looks extremely complicated.
We looked it it before.
Here it is again.
Look at the complexity of this model.
So it's probably overfitted.
What we'd like to do is to evaluate this on
held out training data.
We can do that with the Evaluation panel.
I'm going to evaluate on--we can either have
a fraction here or a number of instances--
I'm going to evaluate on 24 instances, that
is two years' worth of instances and run that.
I get an error on the test data of 59.
That's huge.
The error on the training data is only 6.4.
So let's just have a look at this on the
slide.
With the full model, all the attributes, we've
got this enormous gap between the training
error and the test error.
And with this simple model, with just two
attributes there, there's a little gap, but not very big.
So we could try reducing the attributes in
other ways.
We could actually use the AttributeSelectedClassifier.
I won't do that for you, but to do that I'd have to choose the metalearner attributeSelectedClassifier
and specify linear regression as the base learner and then specify some attribute selection method.
If I left that at all the defaults, I would
in fact get four attributes selected.
And I'd get a training and test error of 11
and 19.
Still some indication of overfitting.
The gap between these two figures really indicates
overfitting.
Now, we reduced the model to two attributes
using a filter, the Remove filter.
But actually there is a simpler way of doing
that,
which you need to learn, in the Forecast panel.
If you go to Lag creation, it's going
to create lags between 1-12--
we saw those--but if you use custom lag lengths,
we can increase that to 12, and now it's only
going to create a lag length of 12.
I can remove the powers of time.
Remember we had the time squared and the time
cubed.
We can remove the product of time and lagged
variables.
And if I go to periodic attributes here and
click Customize, then I can include whichever
ones of these attributes it wants to generate.
Now, I'm not going to include any of those.
So that will get us the simplest
attribute set.
I'll just run that, and let's look now at
the attributes being used, just three of them:
passenger_numbers, date-remapped, and this
lag by 12.
Down here, of course, we've got the same result
as we got before.
We've got the same model and the same training
and test errors.
If we plot these things, this is the training
data.
Now remember we're ignoring the first 12 instances
at the beginning because we have unknown values
for the lagged variable,
and we're reserving 24 instances at the end
for testing.
So if we look now at the full model, we get
this red line,
and you can see that the predictions over
the test data are starting to vary from those
data points.
If you look at the simple model, the one with
just two attributes, then we get a more  accurate line.
Here they are, in fact, both together,
and you can see the blue one from the simple
model is more accurate than the red one for
the full model.
We're using one-step-ahead predictions to
evaluate the error here,
which means that errors can propagate.
If you look at the solid red line toward the
end, the first of those big dips is an error,
and then the second sort of 'double dip' is
an error that's propagated from the first error.
Once it starts making an error in this kind
of evaluation,
when we're evaluating the one step ahead each
time, the errors are going to propagate.
So it's a pretty bad thing once you start
making errors,
they get worse and worse.
Ok. That's it.
Weka's timeseriesForecasting package makes
it easy to experiment with lagged variables
and other kinds of things like that.
It automatically generates many attributes,
perhaps too many attributes, so it's a good
idea to always try simpler models.
You can use the Remove filter which we did
at first,
or you can choose which attributes you want
using the Lag creation and Periodic attributes
tabs under Advanced Configuration.
As always in data mining, you need to be aware
of evaluation based on the training data,
and you can hold data out using the Evaluation
tab.
Finally, we're evaluating time series using
repeated one-step-ahead predictions,
which means that errors propagate.
There's a reference here for a paper
which talks about this approach to time series
analysis.
What you should do now is go to the activity
associated with this lesson,
which will take you through the kinds of things
we've done here,
but using a different base learner, not linear
regression.
So have fun with that,
and I look forward to seeing you in the next
lesson.
Bye for now!
