Hello! Welcome back.
Nice to see you again.
In this lesson, we're going to start talking
seriously about time series forecasting.
We're going to look at linear regression with
lags.
We're not going to use the timeseriesForecasting
package yet; we'll start that in the next lesson.
We're going to load a time series data set
here.
We're going to go to the Explorer.
I'm going to load airline.
This is where my Weka datasets are.
I don't know where yours are.
I'm going to load airline.arff.
Here it is.
I'm going to just have a look at this data
with the edit button.
You can see that there's a passenger_numbers
attribute and then a Date attribute that goes
from the first of January 1949 through to
the first of December 1960.
So this is ancient airline passenger data.
We're going to go to Classify here, and we're
going to predict with linear regression in
the functions category.
This is important.
We're going to predict passenger_numbers.
It's the first attribute, so we need to set
it here from the default, because Weka by
default predicts the last attribute.
I'm going to just click Start.
We're going to be looking at the root-mean-squared error here.
46.6 is what we get.
We could look at the classifier errors.
Now, this is a linear regression, so we're
expecting a linear kind of line here.
That's what linear regression predicts.
On the y access I'm going to put the predicted
passenger numbers; on the x axis I'm going
to put the date, and there we have it.
This is the predicted line.
The size of these crosses incidentally indicates
the size of the error at that point, but,
for our purposes here, it's a linear regression.
Not really very interesting.
One thing that's a little bit surprising is
the model is zero times date plus this constant
and that would be a horizontal line if it
was really true.
There's something a little bit funny about
this.
What is funny about it is the date.
If I go back and look here, the date attribute
has got values ranging from these numbers here.
Is that 662 billion? -662 billion here.
And that's because these dates are measured
in milliseconds since January 1, 1970.
So I'm going to convert them into months since
the beginning of the dataset.
I'm going to do that with a filter.
There's different ways of doing this, but
I'm going to use the AddExpression filter,
and I'm going to make an expression that takes
the second attribute, the date attribute,
that's a2.
And I'm going to divide that by--that's in milliseconds.
I'm going to make it seconds, and then I'm
going to make it minutes, and then I'm going
to make it hours, and then I'm going to make
it days.
Then I'm going to make it years.
365 and a quarter days in a year.
I'm going to add 21 to get from 1949 to 1970.
I'm going to make this in months.
It took me a little bit of a while to figure
this out.
I hope it's going to work.
I'm going to call that attribute NewDate.
Let's see what happens here.
I'm going to apply the filter, and now I've
got NewDate, which goes from round about 0
to about 143.
Now, there's a little issue here with leap
years, right? I'm using this figure of 365.25
days in a year, which is pretty accurate on
average, but I should really take into account
exactly which years are leap years and so
on, so there's a bit of inexactness going
on here.
But never mind.
It's just a bit approximate.
I'm going to delete the Date attribute, remove
the Date attribute.
I'm going to look at the model again.
I'm going to remember every time--this is
a bit of a nuisance--every time I've got to
remember to predict passenger_numbers.
And if I run that, then we're getting this
model: 2.66 times the NewDate plus 90.
It's the same model as before, but we've kind
of scaled NewDate, so now this coefficient,
which used to be rounded down to 0, is something
more sensible.
Ok.
So far so good, and so far not very interesting.
Here is the regression line, and you can see
the data.
The data's kind of cyclic when you look at
it.
Passenger numbers, it depends on the month,
you know, and yet the regression line is just
a straight linear prediction.
Not so interesting.
Let's do something a little bit more interesting.
I'm going to copy the passenger_numbers attribute.
We're going to add a delayed version of passenger_numbers.
I'm going to use the Copy filter to create
a new attribute.
I'm going to copy the first attribute and
apply that.
And here we've got Copy of passenger_numbers.
I'm going to take this attribute and subtract
12, I'm going to kind of lag it.
I'm going to delay it by 12 months, so it's
going to contain last year's value
for that month.
I'm going to do that with a TimeSeriesTranslate.
I'm going to configure that.
I'm going to translate the third attribute.
I'm going to translate it by 12 months, subtract
12 months from that.
I think that's ok.
And then I need to actually--this particular
filter doesn't work on the class, so I'm going
to set the class back to passenger_numbers,
and then I'm going to run it and see what
happens here.
If I go to Edit, now I can see this is my
new attribute, and you can see that that 112
is this 112 here.
In fact, this is a delayed version of this
attribute.
This gives for this month, month number 13,
this gives the figure for the year before
and these are unknown values.
Terrific! That's what I wanted to do.
Then I'm going to go back and predict this
with linear regression.
I need to remember to predict passenger_numbers.
There we go, and now I get a different model
and a better root-mean-squared error, 31.7.
This is a model that uses the Date
and then a little bit of the 12-month-before copy.
Now actually, this is not a very good model.
It's a little bit crazy, and the reason it's
a little bit crazy is because of those missing values.
We've got missing values at the beginning
of the dataset, and we're going to get much
better results if we delete those instances
with missing values.
I'm going to do that with a filter.
I'm going to do that with an instance filter
called RemoveRange and I'm going to remove
instances from 1-12.
And if I apply that, then now if I look at
my data, I don't have missing values.
This starts out with the 112 data which is
12 months before, and this starts out on the
13th month of the original data, which is
what I want.
So I'm then going to go now and classify that
with linear regression.
Don't forget to predict passenger_numbers.
There we go.
And now I get a much smaller root-mean-squared
error of 16, and I'm getting quite a sensible model.
This says passenger numbers increase a little bit.
Take the passenger_number of the year before,
add 7% and then just a little offset here.
I could try and visualize this model.
I'll just show you.
If I do it this way, it's not really very
informative, because this is predicted passenger_numbers
on the y axis against Date on the x axis.
And you can see any pattern here, there is
actually a cyclic pattern, but it's completely
obscured by the size of these x's, which are
not very interesting for our purposes at the moment.
In order to get a better look at that.
I'm going to use the AddClassification filter.
I'm going to add a classification.
It's a supervised attribute filter, AddClassification.
I'm going to add the classification created
by linear regression.
Output the classification, and I need here
to say what we're going to be predicting,
which is passenger_numbers.
I'm going to apply this filter, and now I
get a new attribute classification, which
I can then visualize.
So I'm going to look at classification against
NewDate.
And this shows you this cyclic prediction that we're getting here.
So adding this delayed attribute gives us
a cyclic prediction.
Let's go back to the slide and have a look
at this.
We have a graph here, which shows the prediction
with lag_12.
There is no prediction for the first 12 instances,
I deleted those.
So these are the predictions, this cyclic wave, and you can see this fits pretty
well the actual values of passenger_numbers,
which are the black dots here.
It's a much better fit, this cyclic
prediction, than the original rather boring
red linear prediction, and these are the two
equations of those lines.
So adding this simple lag variable allows
us to break away from the linear paradigm,
even though we're using linear regression,
and get nonlinear predictions.
I think that's pretty exciting, actually.
I've done a lot of things rather quickly
here, and you're going to to be redoing them
yourself in the activity with a different
classifier.
I've got a list of some of the pitfalls
that I've done, and you might want
to refer back to this list--I won't go through
them now--when you do the activity.
So have a look at that.
Just to summarize.
We've learned that linear regression can be
used for time series forecasting and that
lagged variables yield much more complex models
than straight line ones.
In this case, we chose the appropriate lag
by eyeballing the data and noticing that it
kind of varied in an annual cycle.
And we can include more than one lagged variable
with different lags, and we could think about
seasonal effects, you know.
We could think about yearly, quarterly, daily,
hourly data.
Of course, doing all of this manually is a
pain, adding these variables.
So the timeseriesForecasting package helps
you do this in a much easier, quicker, more
convenient way.
That's what we're looking at in the next lesson.
Meanwhile, go off and do the activity, and
I'll see you soon in Lesson 1.3.
Bye for now!
