Hi! Welcome back to New Zealand for some more
Advanced Data Mining with Weka.
This is the last lesson on the time series
forecasting facilities.
We're going to look at some features that
we haven't looked at so far.
First of all, the time stamp.
Any attribute of type "date" is used by default as
the time stamp, but you can change
this under the basic configuration parameters.
I've loaded the airline data once again, and
if I go to the Forecast panel, it's going
to use Date as the time stamp, but I could
change that to another attribute if I wanted.
Also, the periodicity.
We've been detecting the periodicity automatically.
This data is monthly; I think there are 143
monthly instances, but we can specify something
else if we prefer.
We could actually specify, let's say, weekly.
This is not necessarily a very sensible thing
to do, but what would happen if we specified
weekly? First of all, it affects the lagged
variables, the variables that are generated.
Now we've got a large number of lagged
variables. Actually, with weekly data we've
got 52 lagged variables generated.
52 weeks in a year, a whole year's worth.
As well as that, Weka inserts interpolated
instances for the missing values.
So, if we're trying to do this weekly and
the data was only monthly, then there's a
whole lot of weeks which need to be interpolated,
and these are them.
These weeks, and there's a long list of weeks
here, have been interpolated into the data.
Then of course, in order to get values for the
training instances, they're all missing values,
so Weka interpolates the values for all of
the attributes.
So these values have been interpolated.
In this case, the airline data monthly is
144 instances.
Weekly, we've got 573 instances here, and
if I were to specify hourly we'd have 104,000 instances.
The periodicity, as I said, determines what
attributes are created,
different numbers of lagged variables depending
on whether it's monthly, weekly, daily or hourly.
If it's daily, then we include a Day of the
Week attribute and Weekend attributes.
If it's hourly, we include a Morning or Afternoon
attribute.
Of course, you can override all of these attributes
using the Advanced Configuration panel.
I bet you're tired of the airline data now.
I'm going to open another dataset, the appleStocks data.
We need to find this data.
When you install a package in Weka, it installs
the package information in your home folder,
so I'm going to go to my home folder, wekaFiles /
Packages / timeseriesForecasting package, and
here I've got some sample data, timeseriesForecasting
data.
I'm going to open appleStocks.
Now, this data contains more than one thing
to predict.
It's actually got the daily high, low, opening,
and closing values for the Apple stocks in
the year 2011, plus the sales volume.
I'm going go here--I need to tell it what
to forecast.
I'm going to forecast Close.
Let me just see what happens.
It's generated lags here.
It's generated 12 lags.
I think I want to tell it this data is weekly
actually.
I don't think it's figured that out.
The Periodicity is weekly.
No, I'm sorry, the periodicity is daily for
this data.
Let me do that, and now I've got seven lagged
variables, a whole week's worth of lagged
variables.
There were some missing values, and instances
were inserted, a few instances.
Those are mostly weekends, actually, those
instances.
That's what the skip list is for.
I don't really want to include weekends, because
the stock market is closed.
If I type "weekend" here, and do it again,
then I will have reduced the number of interpolated
instances.
There are still a few of them--five of them--and
those correspond to holidays when the stock
exchange was closed.
So, I can actually specify a list of dates
here, as well as the word "weekend".
Let's specify a list of dates in the format
that's on the slide.
Let me just try that.
Now I'm hoping for no interpolated instances.
Yep, there's none there.
I think what I'd like to do is to specify
under the lags, I want to use maybe 2 weeks
worth, that would be 10 working days.
Let's up that number to 10.
Ok, that's the data prepared.
Now, let's do some evaluation on this data.
First of all, I'm going to remove the leading
instances, which are
the ones with unknown lag values, which is
a good idea.
And then we're going to hold out some of the
instances.
Let's go and remove leading instances, and
then go to Evaluation.
We're going to evaluate on training and test,
and I'm going to leave this at 30%.
We're going to use 30% of the dataset for
testing.
OK. I'm going to look here at the mean
absolute error.
We've got these numbers here, 7.7 on the slide,
you can see that since we've removed the leading
instances, we get slightly better results
than if we hadn't have done that.
We can predict more than one target with this
data, and if we do that, we're going to get
lagged versions of each of the targets, and
that might help.
Let's go and predict Close and High.
We're going to get lagged values of both of
these variables, and it's possible that we
might get better predictions.
Well, actually, we don't.
These are the values we get: 8 on the test
data and 3.4 on the training data, slightly
worse than before.
If we were to select all of the variables
as targets, we'd getting even worse results.
We get quite bad overfitting here,
with a much smaller training error, 2.5, than
the test error, 9.6.
Now, another thing that you need to know about
is overlay data.
Overlay data is additional data that might
be relevant to the prediction.
It's not to be forecast.
It can't be predicted, and it's available
in the future.
Overlay data is available in the future.
We don't have overlay data for the appleStocks
problem, but I'm going to kind of cheat by
using one of the existing attributes as though
it were overlay data, as though we knew it
even in the future.
Let me just predict Close.
I'm going to go and specify some overlay data.
We're going to use Open as overlay data, and
I can then see what happens.
I got a complaint here from Weka.
It's unable to generate a future forecast
because there're no future values available
for the overlay data.
Well, let's just stop it trying to generate
future forecasts.
If I just take out these output future predictions
and do it again, then I won't get that error message.
Back on the slide, we can see that the overlay
data has improved things quite a bit.
By including Open, the test error has got
down to 5.9, and if we include High as well,
it gets down even further.
And although I won't do this for you, if I
were to change the base learner to SMO, a
better learner,
I would get even better results, down to a
very small error on the test data, 2.4.
In fact, I would get these graphs if I looked
at the predictions. Again,
to save time I won't do that, but you can
see the prediction on the training data, the
prediction on the test data.
We're getting very good predictions using
this overlay data.
Well, we've covered quite a few options in
the timeseriesForecasting package.
When you're starting with a new dataset, you
should start by getting the time axis right.
Don't forget that missing instances are automatically
interpolated, and you can select the periodicity
yourself if you like, and there's a skip facility
to ensure that time increases linearly.
Then you need to select your target, what
you're going to predict (or targets).
Overlay data can help a lot, obviously.
If you can get hold of it, that's always
wonderful.
We haven't looked at quite a few features
of this package.
We haven't looked at confidence intervals,
adjust for variance, and a bunch of other things.
You can read about that in the documentation
for the package.
Here's a reference to this whole regression
approach to time series analysis, which was
followed when building Weka's timeseriesForecasting
package.
So, off you go.
Do the activity, and we'll see you in the
next lesson,
where we're going to talk about an application
of Weka.
Bye for now!
