So far we have discussed only feedforward
neural networks in this series.
Feedforward neural networks have no feedback
loops, the data flow in one direction from
the input layer to the outputs.
Recurrent neural networks, or RNNs for short,
use inputs from previous stages to help a
model remember its past.
This is usually shown as a feedback loop on
hidden units.
These type of models are particularly useful
for processing sequential data.
Sequential data can be any type of data that
can be represented as a series of data points.
Audio, video, text, biomedical data, such
as EEG signals or DNA sequence data, financial
data, such as stock prices, etc.
Recurrent models feed the outputs of units
as inputs in the next time step.
That's where the name 'recurrent' comes from.
So, how do we implement this feedback loop?
Under the hood, recurrent neural networks
are actually Feedforward Neural Networks with
repeated units.
We can unfold this recurrent graph into a
full network as a chain of repeated units.
Earlier we talked about how convolutional
neural networks share parameters spatially.
Similarly, RNNs can be thought of as neural
networks that share parameters in time.
RNNs can handle inputs and outputs of different
types and lengths.
For example, to translate from one language
to another a model would input a piece of
text and output another piece of text, where
the length of the inputs and outputs are not
necessarily the same.
A speech recognition model would consume audio
data to produce text.
A speech synthesis model would do the opposite.
The input and the output don't have to be
both sequences either.
A model can input a sequence like a blog post
and output a categorical variable, such as
a variable that indicates whether the text
carries positive, negative, or neutral sentiment.
Similarly, the output can be a sequence whereas
the input is not.
A random text generator can input a random
seed and output random sentences that are
similar to the sentences in a corpus of text.
It's possible to have many different types
of input and output configurations.
An RNN can be one-to-one, one-to-many, many-to-one,
and many-to-many.
The input and output don't have to be the
same length.
They can be time-delayed as well, like in
this figure.
It can even be none-to-many, where a model
generates a sequence without an input.
This type is essentially the same as one-to-many
since the output would depend on some initial
seed, even if it's not explicitly defined
as an input.
As we talked earlier, optimizing models become
more difficult as the chain of units gets
longer.
In RNNs, we can easily end up with very long
chains of units when we unfold them in time.
One of the problems we might come across is
the exploding gradients problem.
Long sequences can result in long chains of
parameter multiplications.
When we multiply so many weights together,
the loss becomes highly sensitive to the weights.
This sensitivity may results in steep slopes
in the loss function.
The slope of the cost function at a point
might be too large that when we use it to
update the weights the weights might go outside
a reasonable range and end up having an unrepresentable
value such as a NaN value.
This doesn't have to happen in a single update
either.
It can happen over the course of several updates.
A long chain of large weights lead to large
activations, large activations lead to large
gradients, and large gradients lead to large
weight updates and even larger activations.
A quick fix for this problem is to clip the
gradient magnitude to prevent it from being
larger than some maximum value.
This is called gradient clipping.
Another problem we might encounter is the
vanishing gradient problem, which we discussed
earlier.
To recap, when we backpropagate the error
in a deep network, the gradient sometimes
gets so small until it reaches the early layers.
In a feedforward network, this makes it harder
to optimize the early layers since they barely
get any updates.
In the context of recurrent neural networks,
this results in quickly forgetting things
that the model has seen earlier.
In many cases, this type of behavior is not
acceptable since there might be long-term
dependencies.
For example, if our task is to predict a missing
word in a paragraph, the contextual cues we
need might not be very close to the word being
predicted.
We can tell the missing word in this example
is probably "1970s" by looking at the beginning
of the paragraph but we would have no clue
if we had access only to the words right next
to the missing word.
So we need a model architecture that is better
than the vanilla version of the recurrent
neural nets.
Two popular RNN architectures called LSTMs
and Gated Recurrent Units both aim to remember
long-term dependencies while alleviating the
vanishing and exploding gradient problems.
These architectures use gated modules to keep
what's important in a sequence of data points.
The main idea in gated architectures is to
have a straight channel that flows through
time and have modules connected to it.
These modules are regulated by gates, which
determine how much the module should contribute
to the main channel.
The gates are simply sigmoid units that produce
a number between zero and one.
Zero means nothing passes through the gate,
and one means everything is let through as-is.
Let's build an extremely simplified version
of a gated unit.
We have a main channel where all the modules
will connect to.
We have modules that can add or remove information
from this channel, where what needs to be
kept or discarded is determined by sigmoid
gates.
Actual LSTMs and Gated Recurrent Units are
more complicated than this simplified example.
This figure from Chris Olah's blog, for example,
summarizes how LSTMs work.
The first gate in this module determines how
much of the past we should remember.
The second gate decides how much should this
unit add to the current state.
Finally, the third gate decides what parts
of the current cell state makes it to the
output.
It's possible to increase the representational
capacity of recurrent neural networks by stacking
recurrent units on top of each other.
Deeper RNNs can learn more complex patterns
in sequential data but this extra depth makes
the model harder to optimize.
One last thing: recurrent models are not our
only option for processing sequential data.
Convolutional neural networks can also be
very well applied to time series data.
For example, we can represent a series of
measurements as a grid of values, then build
a convolutional neural network on top of it
by using one-dimensional convolutional layers.
One thing to keep in mind is to make sure
that the convolutions use only past data and
don't leak information from the future.
This type of convolution is called a causal
convolution or a time-delayed convolution.
Another trick is to use dilated convolutions
to capture longer-term dependencies by exponentially
increasing the receptive field.
Google's WaveNet model makes use of these
techniques to train convolutional neural networks
on sequential data.
You can find more information about it in
the description below.
You can also watch my earlier videos on Convolutional
Neural Networks to learn more if you haven't
watched them already.
That's all for today.
The next video's topic will be unsupervised
learning.
We will talk about how we can train models
on unlabeled data.
As always, thanks for watching, stay tuned,
and see you next time.
