Let's talk about another kind of neural network, the Recurrent Neural Network. What a RNN is for?
Well a couple of things,
basically they're for sequences of data and that might be a sequence in time,
so, you might use it for processing time series data,
we're trying to look at a sequence of data points over time and predict the future behavior of something
over time in turn.
So RNN's are base for sequential data of some sort. Some examples of time series data might be weblogs
where you are receiving different hits to your website over time, or sensor logs where you're getting
different inputs from sensors from the Internet of Things, or maybe you're trying to predict stock behavior
by looking at historical stock trading information.
These are all potential applications for recurrent neural networks because they can take a look at the
behavior over time and try to take that behavior into account when it makes future projections.
Another example might be if you're trying to develop a self-driving car, you might have a history of
where your car has been its past trajectories and maybe that can inform how your car might want to turn
in the future,
so you might take into account the fact that your car has been turning along a curve to predict that
perhaps it should continue to drive along a curve until the road straightens out.
And another example, it doesn't have to just be time, it can't be any kind of sequence of arbitrary length,
so, something else that comes to mind are languages, you know, sentences there are just sequences of words,
right?
So you can also apply RNN's to language or machine translation or producing captions for videos
or images,
these are examples of where the order of words in a sentence might matter and the structure of the sentence
and how these words are put together could convey more meaning than you could get by just looking at
those words individually without context,
so again, an RNN can make use of that ordering of the words and try to use that as part of its model.
Another interesting application of RNN's is machine generated music,
you can also think of music sort of like text where instead of a sequence of words or letters you have
a sequence of musical notes,
so it's kind of interesting you can actually build a neural network that can take an existing piece
of music and sort of extend upon it by using a recurrent neural network to try to learn the patterns
that were aesthetically pleasing to the music in the past.
Conceptually, this is what a single recurrent neuron looks like in terms of a model,
so it looks a lot like a an artificial neuron that we've looked at before,
the big difference is this little loop here,
OK? So now, as we run a training step on this neuron, some training data gets fed into it, or maybe this
is an input from a previous layer in our neural network, and it will apply some sort of step function
after summing all the inputs into it.
In this case we're going to be drawing something more like a hyperbolic tangent because mathematically
we want to make sure that we preserve some of the information coming in in more of a smooth manner.
Now, usually we would just output the result of that summation and that activation function as the output
of this neuron,
but we're also going to feed that back into the same neuron,
so the next time we run a, run some data through this neuron, that data from the previous run also gets summed
in to the results.
OK? So as we keep running this thing over and over again, we'll have some new data coming in that gets
blended together with the output from the previous run through this neuron, and it just keeps happening
over and over and over again,
so you can see that over time the past behavior of this neuron influences its future behavior
and it influences how it learns. Another way of thinking about this is by unrolling it in time.
So what this diagram shows is the same single neuron just at three different times steps,
and when you start to dig into the mathematics of how RNN's work, this is a more useful way of
thinking about it.
So if we consider this to be time step 0, you can see there's some sort of data input coming into this
recurrent neuron and that will produce some sort of output after going through its activation function,
and that output also gets fed into the next time step,
so if this is time step one with this same neuron, you can see that this neuron is receiving not only
a new input, but also the output from the previous time step and those get summed together, the activation
function gets applied to it, and that gets output as well, and the output of that combination then gets
fed onto the next time step, call this time step 2, where a new input for time step 2 gets fed into
this neuron and the output from the previous step also gets fed in, they get summbed together, the activation
function is run,
and we have a new output.
This is called a memory cell because it does maintain memory of its previous outputs over time, and you
can see that even though it's getting summed together at each time step, over time those earlier behaviors
kind of get diluted,
Right?
So, you know, we're adding in that time step to that time step and then the sum of those two things end
up working into this one.
So one property of memory cells is that more recent behavior tends to have more of an influence on the
current behavior that you get out of every current neuron
and this can be a problem in some applications,
so there are ways to work against that that we can talk about later. Stepping this up you can have a layer
of recurrent neurons, so you don't have to have just one obviously,
so in this diagram we are looking at four individual recurrent neurons that are working together as part
of a layer,
and you can have some input going into this layer as a whole that gets fed into these four different
recurring neurons, and then the output of those neurons can then get fed back to the next step to every
neuron in that layer.
So all we're doing is scaling this out horizontally, so instead of a single recurrent neuron we have
a layer of four recurrent neurons
in this example where all of the output of those neurons is feeding in to the behavior of those neurons
in the next learning step,
OK?
So you can scale this out to have more than one neuron and learn more complicated patterns as a result.
RNN's opened up a wide range of possibilities because now we have the ability to deal not just with
vectors of information, static snapshots of some sort of a state, we can also deal with sequences of data
as well.
So there are four different combinations here that you can deal with, we can deal with "sequence to sequence"
neural networks,
if we have the input is a time series, or some sort of sequence of data, we can also have an output that
is a time series, or some sequence of data as well,
so if you're trying to predict stock prices in the future based on historical trades, that might be an
example of sequence to sequence topology. We can also mix and match sequences with the older vector 
static states that we predicted back with just using multilayer perceptrons, we would call that a
sequence to vector,
so if we were starting with a sequence of data, we could produce just a snapshot of some state as a result
of analyzing that sequence.
An example might be looking at the sequence of words in a sentence to produce some idea of the sentiment
that that sentence conveys, and we'll actually look at that in an example shortly.
You can go the other way around too, you can go from a vector to a sequence.
So an example of that would be taking an image which is a static vector of information, and then producing
a sequence from that vector,
for example, words in a sentence creating a caption from an image. And we can chain these things together
in interesting ways as well,
we can have encoders and decoders built up that feed into each other,
for example, we might start with a sequence of information from a sentence of some language, embody what
that sentence means as some sort of a vector representation, and then turn that around into a new sequence
of words in some other language,
So that might be how a machine translation system could work
for example, you might start with a sequence of words in French, build up a vector that sort of embodies
the meaning of that sentence, and then produce a new sequence of words in English or whatever language
you want,
that's an example of using a recurrent neural network for machine translation,
so lots of exciting possibilities here. Training RNN's, just like CNN's,
it's hard, in some ways it's even harder,
the main twist here is that we need to back propagate not only through the neural network itself and
all of its layers, but also through time and at a practical standpoint every one of those time steps
ends up looking like another layer in our neural network while we're trying to train our neural network
and those times steps can add up fast.
So over time we end up with like an even deeper and deeper neural network that we need to train,
and the cost of actually performing gradient descent on that increasingly deep neural network becomes
increasingly large.
So to put an upper cap on that training time often we limit the backpropagation to a limited number
of time steps.
We call this "truncated backpropagation through time."
So, just something to keep in mind when you're training and RNN, you not only need to backpropagate
through the neural network topology that you've created,
you also need a backpropagate through all the time steps that you've built up up to that point.
Now, we talked earlier about the fact that as you're building up an RNN, the state from earlier time
steps end up getting diluted over time because we just keep feeding in behavior from the previous step
in our run to the current step,
and this can be a problem if you have a system where older behavior does not matter less to newer behavior.
For example if you're looking at words in a sentence, the words at the beginning of a sentence might
even be more important than words
toward the end, so if you're trying to learn the meaning of a sentence, the position of the word in the
sentence, there is no inherent relationship between where that word is and how important it might be
in many cases.
So that's an example of where you might want to do something to counteract that effect,
and one way to do that is something called the LSTM Cell, it stands for "Long Short-Term Memory cell,"
and the idea here is that it maintains separate ideas of both short-term and long-term states and it
does this in a fairly complex way.
Now, fortunately you don't really need to understand the nitty-gritty details of how it works,
there is an image of it here for you to look at if you're curious, but,you know, the libraries that you
use will implement this for you,
the important thing to understand is that if you're dealing with a sequence of data where you don't
want to give preferential treatment to more recent data, you probably want to use an LSTM Cell instead
of just using a straight up RNN. There's also an optimization on top of LSTM Cell called
GRU Cells, that stands for "Gated Recurrent Unit,"
it's just a simplification on LSTM Cells that performs almost as well,
so if you need to strike a balance for compromise between performance in the terms of how well your
model works, and performance in terms of how long it takes to train it a GRU Cell might be a good
choice.
Training them is really hard, if you thought CNN's was hard, wait till you see RNN's, they are very
sensitive to the topologies that you choose and the choice of hyperparameters,
and since we have to simulate things over time and not just through, you know, the static topology of your network,
they can become extremely resource intensive,
and if you make the wrong choices here, you might have a recurrent neural network that doesn't converge
at all,
you know, it might be completely useless even after you've run it for hours to see if it actually works,
so again, it's important to work upon previous research, try to find some sets of topologies and parameters
that work well for similar problems to what you're trying to do.
This all will make a lot more sense with an example, and you'll see that it's really nowhere near as hard
as it sounds when you're using Keras.
Now I used to work at IMDb, so I can't resist using a movie related example so let's dive into that next
and see RNN's, Recurrent Neural Networks, in action.
