STEVEN MILLS: Hello.
Thank you for
tuning into my talk.
I hope everyone is well.
My name is Steven Mills.
And I am a research
statistician developer
with the SAS Institute.
Today, I wanted to talk
to you a little bit
about neural
network-based strategies
in SAS Viya for forecasting
with time series.
So machine learning models
like neural networks
have often been dismissed
as potential models
in forecasting.
There's a couple of
reasons for this.
One is that they're not
very good at learning
other regressive features
common in time series
like seasonality or trend.
And the second is that they
require quite a bit more data
than classical models.
However, the M4 forecasting
competition results
were dominated by ensembles
of machine learning
and statistical methods.
In fact, the first and
second place contestants
included machine
learning techniques.
And nearly 3/4 of
the best results
were ensembles of
different models.
An ensemble of models
often generalizes
better to new data
than a single model
because different models
are better at identifying
different features in the data.
And additionally, noise or
error in any of the models
is less impactful when it's
averaged with other forecasts.
From this point of
view, it makes sense
to use diverse models when
we are creating ensembles.
Neural networks can make
valuable contributions
to these ensembles because
they're very good at modeling
non-linear behavior.
And they are also very
good at identifying
complex interactions
between variables.
With that in mind, I'd like to
talk about the neural network
modeling strategies
that are designed
for panels of time series
in SAS Visual Forecasting.
We designed three
modeling strategies
that include neural networks.
They are the panel series neural
network, the stacked model,
and the multi-stage model.
To show you how to use each
of the modeling strategies
effectively, I'm
first going to explain
how the data is preprocessed
differently from classical time
series applications.
This is to overcome
some of the shortcomings
of traditional neural networks.
Then I'll give you
a brief overview
of each of the three modeling
strategies available in Visual
Forecasting.
Next, I'll show you a case study
comparing the neural network
base and the classical
forecasting methods.
And finally, I'll cover a
few tips and common problems
that you might encounter.
One big difference between
machine learning and time
series models is that
machine learning algorithms
typically don't care about
the order of observations
and will partition
the data randomly.
Since time series
are ordered data,
and the more recent observations
have more predictive power,
ordered sampling makes
more sense instead.
So when we are
partitioning the data
for forecasting with
neural networks,
we put the oldest
data in the training
partition shown in the yellow.
And then we use newer
data for the holdout
and the absolutely
newest data for testing.
Neural networks also require
significantly more data
than a typical
time series model.
With time series
processing, if we
are modeling a panel of
series, each BY group
is diagnosed and has its
own model fits of the data.
However, a single
time series typically
doesn't contain enough data
to train a neural network.
To increase the amount of
data available for training,
each BY group if
concatenated together
to form a single long
time series that's
set to a single
neural network model.
So in a slide here, you can see
yellow, blue, kind of brownish,
green, and purple blocks, each
representing a different time
series.
So we'll have repeating
time stamps that start over
at the beginning of each color.
The individual
series are delineated
within the neural
network algorithm
by using BY variables
as categorical inputs.
This lets the neural network
do things like determine
if one particular series
has a lower average level,
or perhaps some
features are less
prevalent in a
particular series.
Feature extraction
is another technique
that we use to increase
the amount of data
and also to help overcome
some of the shortcomings
of neural networks.
So neural networks
I mentioned are
very bad at detecting
autocorrelated features
like seasonality or trends.
But if we create lagged versions
of the variables in our input
data, then we can allow
the neural network
to use past data
when it's trying
to evaluate each observation.
In the table below,
you can see that I've
created lagged versions of
the dependent variable y
in the first row
of yellow columns,
in the first two yellow
columns, and lagged
variables of the x variable
and the second group of yellow.
Additionally, a trend or an
exponential smoothing model
can be fit to the
dependent variable
and included as an input.
This helps a neural network
learn level shifts trends
over the course of the
series and adjust for those.
Finally, seasonal
dummy variables
can also be generated.
These can be generated
at any interval
that you like, any
standard SAS interval.
And they don't have to conform
to the interval of your data.
So if you had weekly
forecasting data,
you could generate quarterly
seasonal dummy variables
in order to keep your--
or in order not to have
redundant information.
There are three strategies
in Visual Forecasting that
use neural networks and these
preprocessing techniques
as I mentioned earlier.
The panel series neural
network is the first.
The block diagram at
the bottom of the slide
shows the typical workflow
through the neural network
modeling strategy.
First, any transformations and
data standardization is done,
and then the feature extraction
process that I described
previously is performed.
Next, the neural
network fits the model
to the data using a
training and validation
cycle before it
finally reverses any
transforms and standardization
to generate the final output.
The second strategy
is the stacked model.
This is a hybrid of a neural
network and time series models.
First, a panel series neural
network is fit to the data.
Then the error or the
residuals leftover
are modeled with time series.
And finally, the two
forecasts are added together
to generate the final forecast.
This takes advantage of the
advantages of machine learning
and time series techniques.
And finally, the
multi-stage model
is a bit of a twist on
hierarchical forecasting.
The lower levels in
a hierarchy where
your data may be much
more noisy and nonlinear
can be modeled with neural
network or regression models,
while the upper levels can
be model with time series.
Finally, the forecast
would be reconciled
to the user specified level
to generate the final output.
To illustrate the forecasting
power of these modeling
strategies, I'm going to show
some results from a case study
now from the Array of
Things project in Chicago.
The Array of Things
project is a bunch
of sensor nodes distributed
around the city that
include environmental
sensors like gas sensors
and light sensors, temperature,
humidity, microphones.
And they continuously
collect data.
And I've downloaded
from their website
the data from October 2018.
And I've accumulated it
to an hourly time series.
There was a lot of
bad data in there.
Of course, raw
data is often dirty
and needs a lot of cleaning.
But after cleaning and
selecting variables,
I was left with 27 series
of 720 observations each.
The dependent variable that
I'm going to be forecasting
is the ozone sensor output.
And to predict that,
I'm going to use
five independent variables--
temperature, humidity,
and three light sensors,
one for ultraviolet
light, one for infrared,
and one for visible.
Ozone exposure is
an interesting thing
to look at here
because it's closely
linked with respiratory
problems like asthma.
In fact, the risk of
having an asthma attack
after you've been exposed
to high levels of ozone
remains high for hours.
So it's an important thing to
be able to monitor and know
what you've been exposed
to throughout the day.
In addition, with
COVID-19 in the world now,
this could further complicate
respiratory symptoms
for those patients.
The data only has
two hierarchy models.
So I'm going to use
the panel series neural
network and the stacked models.
If it had more levels
in the hierarchy,
or if the lower
level was more noisy,
than multi-stage strategy
would be a good candidate.
The extracted features that
I include that I've included
are three lags of
all the variables,
seasonal dummy variables
for the hour of the day,
and a damped trend model
for the dependent variable.
So here are the results from
some of the forecasting.
The plot on the left
compares the output
from the panel
series neural network
with hierarchical forecasting.
The shaded area, the gray
shaded area on the right,
is the forecast horizon.
So that's the future
predicted data.
And to the left, in the white
area, is historical data.
So we can see the
historical fit and then
the fit in the forecast region.
The tables on the right shows
the weighted MAPE measurements
measuring the error
in the model fit.
In the top table, you can see
the stacked model significantly
outperformed hierarchical
forecasting within sample data.
And I should mention
hierarchy forecasting
is really just the auto
forecasting version of this
because there aren't very
many hierarchy levels.
I was not able to calculate
out of sample weighted MAPE
for the stacked model because
it doesn't yet support
incremental forecasting.
However, the panel series
neural network does.
So I was able to
easily calculate
the out of sample weighted
MAPE in that case.
And here, we can see
that both the in sample
and the out of sample MAPE for
the panel series neural network
are lower than that of the
hierarchical forecasting,
indicating that we have
achieved greater accuracy
with the neural network model.
Now, you can calculate
the out of sample MAPE
for a stacked model
type of problem
in the coding environment.
But it's not supported
in the UI yet.
So cool.
We have good results.
But these results are only
good if we can use them.
And I know training neural
networks can take a long time.
So how long does it take
as we scale up the data?
If you recall, the
original table that I had
was about 20,000 observations.
And here, I have scaled the data
up to two million observations.
And we can see that the
training time on the left
increases linearly as the
volume of data increases.
So that's a good sign that our
training time isn't blowing up.
So that's great news.
But what about if our series
are short or our data is small?
We know that neural
networks need a lot of data
to train a good model.
But how much is a lot?
So I did another
experiment with a study
where I shortened
the time series.
So starting on the right
at 720 observations
and decreasing the number of
observations in each series,
we start to see an
increase in the error
right around 350 or
400 observations.
So this gives us an
effective estimate
of the lower bound of
how long our series need
to be to use the
neural network models.
Now, reducing the number
of BY groups in a series
doesn't affect the
error the same way.
It's not really deterministic.
And it depends on the similarity
between all the different BY
groups in your panel data.
This should give you a
pretty good foundation
to start experimenting with
the neural network modeling
strategies.
So I'm going to
give you a couple
of tips and common problems to
look for while you're doing so.
And those should
help you out as well.
One of the things that
you should keep in mind
is the effect of missing values
in particular in combination
with generating lagged
versions of variables.
In the table here, you can
see that the first observation
for lag one of y is
missing because there
was no historical data to create
that time-shifted value from.
This propagates
further down in time
on the second lag
and the third lag.
Notice there's also a missing
value in the third observation
for the x variable.
As we create lags
of this variable,
the missing values also
propagate forward through time
and cause more of our total
number of observations
to have missing values.
The problem with this
is that any observation
with a missing
value is going to be
excluded from the
neural network training.
If you're not careful, and if
you have many missing values,
then you can quickly reduce the
amount of data that you have.
And you may not have enough
data to train a good model.
Let's see.
Another important
thing to understand
is how the training time is
affected by the selections
that you make in modeling.
For example, using the data from
the Array of Things case study,
I had one time ID variable,
one BY variable, one dependent
variable, and five
independent variables.
After creating three
dependent variable lags,
three lags each for all of
the independent variables,
adding a trend and
seasonable dummy
variables for each
hour in the day,
I now have 51 effective
input variables,
which causes a large
increase in complexity
and the training time
required to fit a good model.
This is further compounded
by the structure
of the neural network.
So here on the left,
I have a diagram,
a simple example of a very
small neural network--
one input layer, one hidden
layer, and one output layer.
And on the right, I've
shown an enlarged view
of the central or
the center node
there to illustrate
the math that
is calculated for each node.
So the input layer
is going to have
one node for every
effective input variable.
So this has already increased to
51 based on my case study data.
And now, every
one of those nodes
is going to be connected
to every single node
in the hidden layer,
the very next layer.
Each connection has a connection
weight parameter W sub i.
In addition, the different
nodes aside from the input nodes
have a bias parameter b.
So you can see how quickly
these parameters multiply
and can get out of hand.
The increase in
train time can be
exponential with the number
of nodes in your hidden layers
and the number of
variables that you are--
or the number of features
that you're generating.
So I think that that should
give you a pretty good amount
of information.
I encourage you to read my paper
for more details and more tips
for using these strategies.
So I've shown you how these
forecasting strategies
were designed for use--
rather, these neural
network strategies
were designed for
use with time series.
I've shown how they can
improve the forecast
accuracy for certain
data characteristics.
And finally, I'd like to
leave this point with you.
Parsimonious models
generally converge faster
and generalize better.
There is often a
tendency to make
a large neural network and lots
of inputs and lots of nodes.
But multicollinearity will
damage your forecast results.
And keeping your model simpler
is usually the best bet.
If you have any
further questions,
I encourage you to
reach out to me.
And I look forward
to hearing from you.
Thanks.
