- Okay.
Can everyone hear me?
Okay.
Sorry for the delay.
I had a bit of technical difficulty.
Today was the first time
I was trying to use my
new touch bar Mac book pro for presenting,
and none of the adapters are working.
So, I had to switch
laptops at the last minute.
So, thanks.
Sorry about that.
So, today is lecture 10.
We're talking about
recurrent neural networks.
So, as of, as usual, a
couple administrative notes.
So, We're working hard
on assignment one grading.
Those grades will probably be out
sometime later today.
Hopefully, they can get out
before the A2 deadline.
That's what I'm hoping for.
On a related note, Assignment
two is due today at 11:59 p.m.
so, who's done with that already?
About half you guys.
So, you remember, I did warn you
when the assignment went out
that it was quite long, to start early.
So, you were warned about that.
But, hopefully, you guys
have some late days left.
Also, as another reminder,
the midterm will be in class on Tuesday.
If you kind of look
around the lecture hall,
there are not enough seats in this room
to seat all the enrolled
students in the class.
So, we'll actually be having the midterm
in several other lecture
halls across campus.
And we'll be sending out some more details
on exactly where to go in
the next couple of days.
So a bit of a, another
bit of announcement.
We've been working on this sort of
fun bit of extra credit
thing for you to play with
that we're calling the training game.
This is this cool
browser based experience,
where you can go in
and interactively train neural networks
and tweak the hyper
parameters during training.
And this should be a
really cool interactive way
for you to practice
some of these hyper
parameter tuning skills
that we've been talking about
the last couple of lectures.
So this is not required,
but this, I think, will be
a really useful experience
to gain a little bit more intuition
into how some of these
hyper parameters work
for different types of
data sets in practice.
So we're still working on getting
all the bugs worked out of this setup,
and we'll probably send out
some more instructions
on exactly how this will work
in the next couple of days.
But again, not required.
But please do check it out.
I think it'll be really fun
and a really cool thing
for you to play with.
And will give you a bit of extra credit
if you do some,
if you end up working with this
and doing a couple of runs with it.
So, we'll again send out
some more details about this
soon once we get all the bugs worked out.
As a reminder,
last time we were talking
about CNN Architectures.
We kind of walked through the time line
of some of the various winners
of the image net classification challenge,
kind of the breakthrough result.
As we saw was the AlexNet
architecture in 2012,
which was a nine layer
convolutional network.
It did amazingly well,
and it sort of kick started
this whole deep learning
revolution in computer vision,
and kind of brought a lot of these models
into the mainstream.
Then we skipped ahead a couple years,
and saw that in 2014 image net challenge,
we had these two really
interesting models,
VGG and GoogLeNet,
which were much deeper.
So VGG was,
they had a 16 and a 19 layer model,
and GoogLeNet was, I believe,
a 22 layer model.
Although one thing that
is kind of interesting
about these models
is that the 2014 image net challenge
was right before batch
normalization was invented.
So at this time,
before the invention
of batch normalization,
training these relatively deep models
of roughly twenty layers
was very challenging.
So, in fact, both of these two models
had to resort to a little bit of hackery
in order to get their
deep models to converge.
So for VGG, they had the
16 and 19 layer models,
but actually they first
trained an 11 layer model,
because that was what they
could get to converge.
And then added some extra
random layers in the middle
and then continued training,
actually training the
16 and 19 layer models.
So, managing this training process
was very challenging in 2014
before the invention
of batch normalization.
Similarly, for GoogLeNet,
we saw that GoogLeNet has
these auxiliary classifiers
that were stuck into lower
layers of the network.
And these were not really
needed for the class to,
to get good classification performance.
This was just sort of a way to cause
extra gradient to be injected
directly into the lower
layers of the network.
And this sort of,
this again was before the
invention of batch normalization
and now once you have these networks
with batch normalization,
then you no longer need
these slightly ugly hacks
in order to get these
deeper models to converge.
Then we also saw in the
2015 image net challenge
was this really cool model called ResNet,
these residual networks
that now have these shortcut connections
that actually have these
little residual blocks
where we're going to take our input,
pass it through the residual blocks,
and then add that output of the,
then add our input to the block,
to the output from these
convolutional layers.
This is kind of a funny architecture,
but it actually has two
really nice properties.
One is that if we just set all the weights
in this residual block to zero,
then this block is competing the identity.
So in some way,
it's relatively easy for this model
to learn not to use the
layers that it doesn't need.
In addition, it kind of adds this
interpretation to L2 regularization
in the context of these neural networks,
cause once you put L2 regularization,
remember, on your,
on the weights of your network,
that's going to drive all
the parameters towards zero.
And maybe your standard
convolutional architecture
is driving towards zero.
Maybe it doesn't make sense.
But in the context of a residual network,
if you drive all the
parameters towards zero,
that's kind of encouraging the model
to not use layers that it doesn't need,
because it will just drive those,
the residual blocks towards the identity,
whether or not needed for classification.
The other really useful property
of these residual networks
has to do with the gradient
flow in the backward paths.
If you remember what happens
at these addition gates
in the backward pass,
when upstream gradient is coming in
through an addition gate,
then it will split
and fork along these two different paths.
So then, when upstream gradient comes in,
it'll take one path through
these convolutional blocks,
but it will also have a direct
connection of the gradient
through this residual connection.
So then when you look at,
when you imagine stacking many of these
residual blocks on top of each other,
and our network ends up with hundreds of,
potentially hundreds of layers.
Then, these residual connections
give a sort of gradient super highway
for gradients to flow backward
through the entire network.
And this allows it to train much easier
and much faster.
And actually allows
these things to converge
reasonably well,
even when the model is potentially
hundreds of layers deep.
And this idea of managing
gradient flow in your models
is actually super important
everywhere in machine learning.
And super prevalent in
recurrent networks as well.
So we'll definitely revisit
this idea of gradient flow
later in today's lecture.
So then, we kind of also saw
a couple other more exotic,
more recent CNN architectures last time,
including DenseNet and FractalNet,
and once you think about
these architectures
in terms of gradient flow,
they make a little bit more sense.
These things like DenseNet and FractalNet
are adding these additional shortcut
or identity connections inside the model.
And if you think about what happens
in the backwards pass for these models,
these additional funny topologies
are basically providing direct paths
for gradients to flow
from the loss at the end of the network
more easily into all the
different layers of the network.
So I think that,
again, this idea of managing
gradient flow properly
in your CNN Architectures
is something that we've really seen
a lot more in the last couple of years.
And will probably see more moving forward
as more exotic architectures are invented.
We also saw this kind of nice plot,
plotting performance of
the number of flops versus
the number of parameters
versus the run time of
these various models.
And there's some
interesting characteristics
that you can dive in
and see from this plot.
One idea is that VGG and AlexNet
have a huge number of parameters,
and these parameters actually come
almost entirely from the
fully connected layers
of the models.
So AlexNet has something like
roughly 62 million parameters,
and if you look at that
last fully connected layer,
the final fully connected layer in AlexNet
is going from an activation
volume of six by six by 256
into this fully connected vector of 496.
So if you imagine what the weight matrix
needs to look like at that layer,
the weight matrix is gigantic.
It's number of entries is six by six,
six times six times 256 times 496.
And if you multiply that out,
you see that that single layer
has 38 million parameters.
So more than half of the parameters
of the entire AlexNet model
are just sitting in that
last fully connected layer.
And if you add up all the parameters
in just the fully connected
layers of AlexNet,
including these other
fully connected layers,
you see something like
59 of the 62 million
parameters in AlexNet
are sitting in these
fully connected layers.
So then when we move other architectures,
like GoogLeNet and ResNet,
they do away with a lot of these large
fully connected layers
in favor of global average pooling
at the end of the network.
And this allows these
networks to really cut,
these nicer architectures,
to really cut down the parameter count
in these architectures.
So that was kind of our brief recap
of the CNN architectures
that we saw last lecture,
and then today,
we're going to move to
one of my favorite topics to talk about,
which is recurrent neural networks.
So, so far in this class,
we've seen, what I like to think of
as kind of a vanilla feed forward network,
all of our network
architectures have this flavor,
where we receive some input
and that input is a fixed size object,
like an image or vector.
That input is fed through
some set of hidden layers
and produces a single output,
like a classifications,
like a set of classifications scores
over a set of categories.
But in some context in machine learning,
we want to have more flexibility
in the types of data that
our models can process.
So once we move to this idea
of recurrent neural networks,
we have a lot more opportunities
to play around with the types
of input and output data
that our networks can handle.
So once we have recurrent neural networks,
we can do what we call
these one to many models.
Or where maybe our input is
some object of fixed size,
like an image,
but now our output is a
sequence of variable length,
such as a caption.
Where different captions
might have different numbers of words,
so our output needs to
be variable in length.
We also might have many to one models,
where our input could be variably sized.
This might be something
like a piece of text,
and we want to say what is
the sentiment of that text,
whether it's positive or
negative in sentiment.
Or in a computer vision context,
you might imagine taking as input a video,
and that video might have a
variable number of frames.
And now we want to read this entire video
of potentially variable length.
And then at the end,
make a classification decision
about maybe what kind
of activity or action
is going on in that video.
We also have a, we
might also have problems
where we want both the inputs
and the output to be variable in length.
We might see something like this
in machine translation,
where our input is some,
maybe, sentence in English,
which could have a variable length,
and our output is maybe
some sentence in French,
which also could have a variable length.
And crucially, the length
of the English sentence
might be different from the
length of the French sentence.
So we need some models
that have the capacity
to accept both variable length sequences
on the input and on the output.
Finally, we might also
consider problems where
our input is variably length,
like something like a video sequence
with a variable number of frames.
And now we want to make a decision
for each element of that input sequence.
So in the context of videos,
that might be making some
classification decision
along every frame of the video.
And recurrent neural networks
are this kind of general paradigm
for handling variable sized sequence data
that allow us to pretty naturally capture
all of these different types
of setups in our models.
So recurring neural networks
are actually important,
even for some problems that
have a fixed size input
and a fixed size output.
Recurrent neural networks
can still be pretty useful.
So in this example,
we might want to do, for example,
sequential processing of our input.
So here, we're receiving
a fixed size input
like an image,
and we want to make a
classification decision
about, like, what number is
being shown in this image?
But now, rather than just doing
a single feed forward pass
and making the decision all at once,
this network is actually
looking around the image
and taking various glimpses of
different parts of the image.
And then after making
some series of glimpses,
then it makes its final decision
as to what kind of number is present.
So here, we had one,
So here, even though
our input and outputs,
our input was an image,
and our output was a
classification decision,
even this context,
this idea of being able to handle
variably length processing
with recurrent neural networks
can lead to some really
interesting types of models.
There's a really cool paper that I like
that applied this same type of idea
to generating new images.
Where now, we want the model
to synthesize brand new images
that look kind of like the
images it saw in training,
and we can use a recurrent
neural network architecture
to actually paint these output images
sort of one piece at a time in the output.
You can see that,
even though our output
is this fixed size image,
we can have these models
that are working over time
to compute parts of the output
one at a time sequentially.
And we can use recurrent neural networds
for that type of setup as well.
So after this sort of cool pitch
about all these cool
things that RNNs can do,
you might wonder, like what
exactly are these things?
So in general, a recurrent neural network
is this little, has this
little recurrent core cell
and it will take some input x,
feed that input into the RNN,
and that RNN has some
internal hidden state,
and that internal hidden
state will be updated
every time that the RNN reads a new input.
And that internal hidden state
will be then fed back to the model
the next time it reads an input.
And frequently, we will want our RNN"s
to also produce some
output at every time step,
so we'll have this pattern
where it will read an input,
update its hidden state,
and then produce an output.
So then the question is
what is the functional form
of this recurrence relation
that we're computing?
So inside this little green RNN block,
we're computing some recurrence relation,
with a function f.
So this function f will
depend on some weights, w.
It will accept the previous
hidden state, h t - 1,
as well as the input at
the current state, x t,
and this will output
the next hidden state, or
the updated hidden state,
that we call h t.
And now,
then as we read the next input,
this hidden state,
this new hidden state, h t,
will then just be passed
into the same function
as we read the next input, x t plus one.
And now, if we wanted
to produce some output
at every time step of this network,
we might attach some additional
fully connected layers
that read in this h t at every time step.
And make that decision based
on the hidden state at every time step.
And one thing to note is that
we use the same function, f w,
and the same weights, w,
at every time step of the computation.
So then kind of the simplest function form
that you can imagine
is what we call this vanilla
recurrent neural network.
So here, we have this same functional form
from the previous slide,
where we're taking in
our previous hidden state
and our current input
and we need to produce
the next hidden state.
And the kind of simplest
thing you might imagine
is that we have some weight matrix, w x h,
that we multiply against the input, x t,
as well as another weight matrix, w h h,
that we multiply against
the previous hidden state.
So we make these two multiplications
against our two states,
add them together,
and squash them through a tanh,
so we get some kind of non
linearity in the system.
You might be wondering
why we use a tanh here
and not some other type of non-linearity?
After all that we've said negative
about tanh's in previous lectures,
and I think we'll return
a little bit to that
later on when we talk about
more advanced architectures, like lstm.
So then, this,
So then, in addition in this architecture,
if we wanted to produce
some y t at every time step,
you might have another weight matrix, w,
you might have another weight matrix
that accepts this hidden state
and then transforms it to some y
to produce maybe some
class score predictions
at every time step.
And when I think about
recurrent neural networks,
I kind of think about, you can also,
you can kind of think of
recurrent neural networks
in two ways.
One is this concept of
having a hidden state
that feeds back at itself, recurrently.
But I find that picture
a little bit confusing.
And sometimes, I find it clearer
to think about unrolling
this computational graph
for multiple time steps.
And this makes the data
flow of the hidden states
and the inputs and the
outputs and the weights
maybe a little bit more clear.
So then at the first time step,
we'll have some initial
hidden state h zero.
This is usually initialized
to zeros for most context,
in most contexts, an then
we'll have some input, x t.
This initial hidden state, h zero,
and our current input, x t,
will go into our f w function.
This will produce our
next hidden state, h one.
And then, we'll repeat this process
when we receive the next input.
So now our current h one and our x one,
will go into that same f w,
to produce our next output, h two.
And this process will
repeat over and over again,
as we consume all of the input, x ts,
in our sequence of inputs.
And now, one thing to note, is that
we can actually make
this even more explicit
and write the w matrix in
our computational graph.
And here you can see that
we're re-using the same w matrix
at every time step of the computation.
So now every time that we
have this little f w block,
it's receiving a unique h and a unique x,
but all of these blocks
are taking the same w.
And if you remember,
we talked about how gradient
flows in back propagation,
when you re-use the same,
when you re-use the
same node multiple times
in a computational graph,
then remember during the backward pass,
you end up summing the gradients
into the w matrix
when you're computing a d los d w.
So, if you kind of think about
the back propagation for this model,
then you'll have a separate gradient for w
flowing from each of those time steps,
and then the final gradient for w
will be the sum of all of those
individual per time step gradiants.
We can also write to this y t explicitly
in this computational graph.
So then, this output,
h t, at every time step
might feed into some other
little neural network
that can produce a y t,
which might be some class
scores, or something like that,
at every time step.
We can also make the loss more explicit.
So in many cases, you
might imagine producing,
you might imagine that you
have some ground truth label
at every time step of your sequence,
and then you'll compute some loss,
some individual loss,
at every time step of
these outputs, y t's.
And this loss might,
it will frequently be
something like soft max loss,
in the case where you have, maybe,
a ground truth label at every
time step of the sequence.
And now the final loss for the entire,
for this entire training stop,
will be the sum of
these individual losses.
So now, we had a scaler
loss at every time step?
And we just summed them
up to get our final
scaler loss at the top of the network.
And now, if you think about,
again, back propagation
through this thing,
we need, in order to train the model,
we need to compute the gradient
of the loss with respect to w.
So, we'll have loss flowing
from that final loss
into each of these time steps.
And then each of those time steps
will compute a local
gradient on the weights, w,
which will all then be
summed to give us our final
gradient for the weights, w.
Now if we have a, sort of,
this many to one situation,
where maybe we want to do
something like sentiment analysis,
then we would typically make that decision
based on the final hidden
state of this network.
Because this final hidden state
kind of summarizes all of the context
from the entire sequence.
Also, if we have a kind of
a one to many situation,
where we want to receive a fix sized input
and then produce a variably sized output.
Then you'll commonly use
that fixed size input
to initialize, somehow,
the initial hidden state of the model,
and now the recurrent network will tick
for each cell in the output.
And now, as you produce
your variably sized output,
you'll unroll the graph for
each element in the output.
So this, when we talk about
the sequence to sequence models
where you might do something
like machine translation,
where you take a variably sized input
and a variably sized output.
You can think of this as a combination
of the many to one,
plus a one to many.
So, we'll kind of proceed in two stages,
what we call an encoder and a decoder.
So if you're the encoder,
we'll receive the variably sized input,
which might be your sentence in English,
and then summarize that entire sentence
using the final hidden state
of the encoder network.
And now we're in this
many to one situation
where we've summarized this
entire variably sized input
in this single vector,
and now, we have a second decoder network,
which is a one to many situation,
which will input that single vector
summarizing the input sentence
and now produce this
variably sized output,
which might be your sentence
in another language.
And now in this variably sized output,
we might make some predictions
at every time step,
maybe about what word to use.
And you can imagine kind of
training this entire thing
by unrolling this computational graph
summing the losses at the output sequence
and just performing back
propagation, as usual.
So as a bit of a concrete example,
one thing that we frequently use
recurrent neural networks for,
is this problem called language modeling.
So in the language modeling problem,
we want to read some sequence of,
we want to have our
network, sort of, understand
how to produce natural language.
So in the, so this, this might
happen at the character level
where our model will produce
characters one at a time.
This might also happen at the word level
where our model will
produce words one at a time.
But in a very simple example,
you can imagine this
character level language model
where we want,
where the network will read
some sequence of characters
and then it needs to predict,
what will the next character
be in this stream of text?
So in this example,
we have this very small
vocabulary of four letters,
h, e, l, and o, and we have
this example training sequence
of the word hello, h, e, l, l, o.
So during training,
when we're training this language model,
we will feed the characters
of this training sequence
as inputs, as x ts, to out input of our,
we'll feed the characters
of our training sequence,
these will be the x ts that
we feed in as the inputs
to our recurrent neural network.
And then, each of these inputs,
it's a letter,
and we need to figure out a way
to represent letters in our network.
So what we'll typically do is figure out
what is our total vocabulary.
In this case, our vocabulary
has four elements.
And each letter will be
represented by a vector
that has zeros in every slot but one,
and a one for the slot in the vocabulary
corresponding to that letter.
In this little example,
since our vocab has the
four letters, h, e, l, o,
then our input sequence,
the h is represented by
a four element vector
with a one in the first slot
and zero's in the other three slots.
And we use the same sort of pattern
to represent all the different letters
in the input sequence.
Now, during this forward pass
of what this network is doing,
at the first time step,
it will receive the input letter h.
That will go into the first RNN,
to the RNN cell,
and then we'll produce this output, y t,
which is the network making predictions
about for each letter in the vocabulary,
which letter does it think is most likely
going to come next.
In this example,
the correct output letter was e
because our training sequence was hello,
but the model is actually predicting,
I think it's actually
predicting o as the most likely letter.
So in this case, this prediction was wrong
and we would use softmaxt loss
to quantify our unhappiness
with these predictions.
The next time step,
we would feed in the second letter
in the training sequence, e,
and this process will repeat.
We'll now represent e as a vector.
Use that input vector together
with the previous hidden state
to produce a new hidden state
and now use the second hidden state
to, again, make predictions
over every letter in the vocabulary.
In this case, because our
training sequence was hello,
after the letter e,
we want our model to predict l.
In this case,
our model may have very low predictions
for the letter l, so we
would incur high loss.
And you kind of repeat
this process over and over,
and if you train this model
with many different sequences,
then eventually it should learn
how to predict the next
character in a sequence
based on the context of
all the previous characters
that it's seen before.
And now, if you think about
what happens at test time,
after we train this model,
one thing that we might want to do with it
is a sample from the model,
and actually use this
trained neural network model
to synthesize new text
that kind of looks similar in spirit
to the text that it was trained on.
The way that this will work
is we'll typically see the model
with some input prefix of text.
In this case, the prefix is
just the single letter h,
and now we'll feed that letter h
through the first time step of
our recurrent neural network.
It will product this
distribution of scores
over all the characters in the vocabulary.
Now, at training time,
we'll use these scores
to actually sample from it.
So we'll use a softmaxt function
to convert those scores into
a probability distribution
and then we will sample from
that probability distribution
to actually synthesize the
second letter in the sequence.
And in this case, even though
the scores were pretty bad,
maybe we got lucky and
sampled the letter e
from this probability distribution.
And now, we'll take this letter e
that was sampled from this distribution
and feed it back as input into the network
at the next time step.
Now, we'll take this e,
pull it down from the top,
feed it back into the network
as one of these, sort of, one
hot vectorial representations,
and then repeat the process
in order to synthesize the
second letter in the output.
And we can repeat this
process over and over again
to synthesize a new sequence
using this trained model,
where we're synthesizing the sequence
one character at a time
using these predicted
probability distributions
at each time step.
Question?
Yeah, that's a great question.
So the question is why might we sample
instead of just taking the character
with the largest score?
In this case,
because of the probability
distribution that we had,
it was impossible to
get the right character,
so we had the sample so
the example could work out,
and it would make sense.
But in practice,
sometimes you'll see both.
So sometimes you'll just
take the argmax probability,
and that will sometimes be
a little bit more stable,
but one advantage of sampling, in general,
is that it lets you get
diversity from your models.
Sometimes you might have the same input,
maybe the same prefix,
or in the case of image captioning,
maybe the same image.
But then if you sample rather
than taking the argmax,
then you'll see that
sometimes these trained models
are actually able to produce
multiple different types
of reasonable output sequences,
depending on the kind,
depending on which samples they take
at the first time steps.
It's actually kind of a benefit
cause we can get now more
diversity in our outputs.
Another question?
Could we feed in the softmax vector
instead of the one element vector?
You mean at test time?
Yeah yeah, so the
question is, at test time,
could we feed in this whole softmax vector
rather than a one hot vector?
There's kind of two problems with that.
One is that that's very different
from the data that it
saw at training time.
In general, if you ask your model
to do something at test time,
which is different from training time,
then it'll usually blow up.
It'll usually give you garbage
and you'll usually be sad.
The other problem is that in practice,
our vocabularies might be very large.
So maybe, in this simple example,
our vocabulary is only four elements,
so it's not a big problem.
But if you're thinking about
generating words one at a time,
now your vocabulary is every
word in the English language,
which could be something like
tens of thousands of elements.
So in practice, this first element,
this first operation that's
taking in this one hot vector,
is often performed using
sparse vector operations
rather than dense factors.
It would be, sort of,
computationally really bad
if you wanted to have this load of
10,000 elements softmax vector.
So that's usually why we
use a one hot instead,
even at test time.
This idea that we have a sequence
and we produce an output at
every time step of the sequence
and then finally compute some loss,
this is sometimes called
backpropagation through time
because you're imagining
that in the forward pass,
you're kind of stepping
forward through time
and then during the backward pass,
you're sort of going
backwards through time
to compute all your gradients.
This can actually be kind of problematic
if you want to train the sequences
that are very, very long.
So if you imagine that we
were kind of trying to train
a neural network language model
on maybe the entire text of Wikipedia,
which is, by the way,
something that people
do pretty frequently,
this would be super slow,
and every time we made a gradient step,
we would have to make a forward pass
through the entire text
of all of wikipedia,
and then make a backward pass
through all of wikipedia,
and then make a single gradient update.
And that would be super slow.
Your model would never converge.
It would also take a
ridiculous amount of memory
so this would be just really bad.
In practice, what people
do is this, sort of,
approximation called truncated
backpropagation through time.
Here, the idea is that,
even though our input
sequence is very, very long,
and even potentially infinite,
what we'll do is that during,
when we're training the model,
we'll step forward for
some number of steps,
maybe like a hundred is
kind of a ballpark number
that people frequently use,
and we'll step forward
for maybe a hundred steps,
compute a loss only over this
sub sequence of the data,
and then back propagate
through this sub sequence,
and now make a gradient step.
And now, when we repeat, well,
we still have these hidden states
that we computed from the first batch,
and now, when we compute
this next batch of data,
we will carry those hidden
states forward in time,
so the forward pass will
be exactly the same.
But now when we compute a gradient step
for this next batch of data,
we will only backpropagate
again through this second batch.
Now, we'll make a gradient step
based on this truncated
backpropagation through time.
This process will continue,
where now when we make the next batch,
we'll again copy these
hidden states forward,
but then step forward
and then step backward,
but only for some small
number of time steps.
So this is,
you can kind of think of this
as being an alegist who's
the cast at gradient descent
in the case of sequences.
Remember, when we talked
about training our models
on large data sets,
then these data sets,
it would be super expensive
to compute the gradients
over every element in the data set.
So instead, we kind of take small samples,
small mini batches instead,
and use mini batches of data
to compute gradient stops
in any kind of image classification case.
Question?
Is this kind of, the question is,
is this kind of making
the Mark Hobb assumption?
No, not really.
Because we're carrying
this hidden state forward
in time forever.
It's making a Marcovian assumption
in the sense that, conditioned
on the hidden state,
but the hidden state is all that we need
to predict the entire future
of the sequence.
But that assumption is kind of built
into the recurrent neural network formula
from the start.
And that's not really particular
to back propagation through time.
Back propagation through time,
or sorry, truncated back prop though time
is just the way to
approximate these gradients
without going making a backwards pass
through your potentially
very large sequence of data.
This all sounds very
complicated and confusing
and it sounds like a lot of code to write,
but in fact, this can
acutally be pretty concise.
Andrea has this example of
what he calls min-char-rnn,
that does all of this stuff
in just like a 112 lines of Python.
It handles building the vocabulary.
It trains the model
with truncated back
propagation through time.
And then, it can actually
sample from that model
in actually not too much code.
So even though this sounds like
kind of a big, scary process,
it's actually not too difficult.
I'd encourage you, if you're confused,
to maybe go check this out
and step through the
code on your own time,
and see, kind of, all
of these concrete steps
happening in code.
So this is all in just a single file,
all using numpy with no dependencies.
This was relatively easy to read.
So then, once we have this idea
of training a recurrent
neural network language model,
we can actually have a
lot of fun with this.
And we can take in, sort
of, any text that we want.
Take in, like, whatever
random text you can think of
from the internet,
train our recurrent neural
network language model
on this text,
and then generate new text.
So in this example, we
took this entire text
of all of Shakespeare's works,
and then used that to train
a recurrent neural network language model
on all of Shakespeare.
And you can see that the
beginning of training,
it's kind of producing maybe
random gibberish garbage,
but throughout the course of training,
it ends up producing things
that seem relatively reasonable.
And after you've,
after this model has
been trained pretty well,
then it produces text that seems,
kind of, Shakespeare-esque to me.
"Why do what that day," replied,
whatever, right, you can read this.
Like, it kind of looks
kind of like Shakespeare.
And if you actually train
this model even more,
and let it converge even further,
and then sample these
even longer sequences,
you can see that it learns
all kinds of crazy cool stuff
that really looks like a Shakespeare play.
It knows that it uses,
maybe, these headings
to say who's speaking.
Then it produces these bits of text
that have crazy dialogue
that sounds kind of Shakespeare-esque.
It knows to put line breaks
in between these different things.
And this is all, like, really cool,
all just sort of learned from
the structure of the data.
We can actually get
even crazier than this.
This was one of my favorite examples.
I found online, there's this.
Is anyone a mathematician in this room?
Has anyone taken an algebraic
topology course by any chance?
Wow, a couple, that's impressive.
So you probably know more
algebraic topology than me,
but I found this open source
algebraic topology textbook online.
It's just a whole bunch of tech files
that are like this
super dense mathematics.
And LaTac, cause LaTac is sort of this,
let's you write equations and diagrams
and everything just using plain text.
We can actually train our
recurrent neural network language model
on the raw Latac source code
of this algebraic topology textbook.
And if we do that, then after
we sample from the model,
then we get something that seems like,
kind of like algebraic topology.
So it knows to like put equations.
It puts all kinds of crazy stuff.
It's like, to prove study,
we see that F sub U is
a covering of x prime,
blah, blah, blah, blah, blah.
It knows where to put unions.
It knows to put squares
at the end of proofs.
It makes lemmas.
It makes references to previous lemmas.
Right, like we hear, like.
It's namely a bi-lemma question.
We see that R is geometrically something.
So it's actually pretty crazy.
It also sometimes tries to make diagrams.
For those of you that have
taken algebraic topology,
you know that these commutative diagrams
are kind of a thing
that you work with a lot
So it kind of got the general gist
of how to make those diagrams,
but they actually don't make any sense.
And actually,
one of my favorite examples here
is that it sometimes omits proofs.
So it'll sometimes say,
it'll sometimes say something like
theorem, blah, blah, blah,
blah, blah, proof omitted.
This thing kind of has gotten the gist
of how some of these
math textbooks look like.
We can have a lot of fun with this.
So we also tried training
one of these models
on the entire source
code of the Linux kernel.
'Cause again, this character level stuff
that we can train on,
And then, when we sample this,
it acutally again looks
like C source code.
It knows how to write if statements.
It has, like, pretty good
code formatting skills.
It knows to indent after
these if statements.
It knows to put curly braces.
It actually even makes
comments about some things
that are usually nonsense.
One problem with this model is that
it knows how to declare variables.
But it doesn't always use the
variables that it declares.
And sometimes it tries
to use variables that
haven't been declared.
This wouldn't compile.
I would not recommend sending this
as a pull request to Linux.
This thing also figures
out how to recite the GNU,
this GNU license character by character.
It kind of knows that you
need to recite the GNU license
and after the license comes some includes,
then some other includes,
then source code.
This thing has actually
learned quite a lot
about the general structure of the data.
Where, again, during training,
all we asked this model to do
was try to predict the next
character in the sequence.
We didn't tell it any of this structure,
but somehow, just through the course
of this training process,
it learned a lot about
the latent structure
in the sequential data.
Yeah, so it knows how to write code.
It does a lot of cool stuff.
I had this paper with
Andre a couple years ago
where we trained a bunch of these models
and then we wanted to try
to poke into the brains
of these models
and figure out like what are they doing
and why are they working.
So we saw, in our,
these recurring neural networks
has this hidden vector which is,
maybe, some vector that's
updated over every time step.
And then what we wanted
to try to figure out is
could we find some elements of this vector
that have some Symantec
interpretable meaning.
So what we did is
we trained a neural
network language model,
one of these character level models
on one of these data sets,
and then we picked one of the
elements in that hidden vector
and now we look at what is the
value of that hidden vector
over the course of a sequence
to try to get some sense of maybe
what these different hidden
states are looking for.
When you do this, a lot
of them end up looking
kind of like random gibberish garbage.
So here again, what we've done,
is we've picked one
element of that vector,
and now we run the sequence forward
through the trained model,
and now the color of each character
corresponds to the
magnitude of that single
scaler element of the hidden
vector at every time step
when it's reading the sequence.
So you can see that a lot
of the vectors in these hidden states
are kind of not very interpretable.
It seems like they're
kind of doing some of this
low level language modeling
to figure out what
character should come next.
But some of them end up quite nice.
So here we found this vector
that is looking for quotes.
You can see that there's
this one hidden element,
this one element in the vector,
that is off, off, off, off, off blue
and then once it hits a quote,
it turns on and remains
on for the duration
of this quote.
And now when we hit the
second quotation mark,
then that cell turns off.
So somehow, even though
this model was only trained
to predict the next
character in a sequence,
it somehow learned that a useful thing,
in order to do this,
might be to have some cell
that's trying to detect quotes.
We also found this other cell
that is, looks like it's
counting the number of characters
since a line break.
So you can see that at the
beginning of each line,
this element starts off at zero.
Throughout the course of the line,
it's gradually more red,
so that value increases.
And then after the new line character,
it resets to zero.
So you can imagine that maybe this cell
is letting the network keep track
of when it needs to write
to produce these new line characters.
We also found some that,
when we trained on the linux source code,
we found some examples that are turning on
inside the conditions of if statements.
So this maybe allows the network
to differentiate whether
it's outside an if statement
or inside that condition,
which might help it model
these sequences better.
We also found some that
turn on in comments,
or some that seem like they're counting
the number of indentation levels.
This is all just really cool stuff
because it's saying that
even though we are only
trying to train this model
to predict next characters,
it somehow ends up learning a lot
of useful structure about the input data.
One kind of thing that we often use,
so this is not really been
computer vision so far,
and we need to pull this
back to computer vision
since this is a vision class.
We've alluded many times to this
image captioning model
where we want to build
models that can input
an image and then output a
caption in natural language.
There were a bunch of
papers a couple years ago
that all had relatively
similar approaches.
But I'm showing the figure
from the paper from our lab
in a totally un-biased way.
But, the idea here is that the caption
is this variably length
sequence that we might,
the sequence might have different numbers
of words for different captions.
So this is a totally natural fit
for a recurrent neural
network language model.
So then what this model looks like
is we have some convolutional network
which will input the,
which will take as input the image,
and we've seen a lot about how
convolution networks work at this point,
and that convolutional
network will produce
a summary vector of the image
which will then feed
into the first time step
of one of these recurrent
neural network language models
which will then produce words
of the caption one at a time.
So the way that this kind
of works at test time
after the model is trained
looks almost exactly the same
as these character level language models
that we saw a little bit ago.
We'll take our input image,
feed it through our convolutional network.
But now instead of
taking the softmax scores
from an image net model,
we'll instead take this
4,096 dimensional vector
from the end of the model,
and we'll take that vector and use it
to summarize the whole
content of the image.
Now, remember when we talked
about RNN language models,
we said that we need to
see the language model
with that first initial input
to tell it to start generating text.
So in this case, we'll give
it some special start token,
which is just saying, hey, this
is the start of a sentence.
Please start generating some text
conditioned on this image information.
So now previously, we saw that
in this RNN language model,
we had these matrices that
were taking the previous,
the input at the current time step
and the hidden state of
the previous time step
and combining those to
get the next hidden state.
Well now, we also need to add
in this image information.
So one way, people play
around with exactly
different ways to incorporate
this image information,
but one simple way is just to add
a third weight matrix that is
adding in this image
information at every time step
to compute the next hidden state.
So now, we'll compute this distribution
over all scores in our vocabulary
and here, our vocabulary is something
like all English words,
so it could be pretty large.
We'll sample from that distribution
and now pass that word back as
input at the next time step.
And that will then feed that word in,
again get a distribution
over all words in the vocab,
and again sample to produce the next word.
So then, after that thing is all done,
we'll maybe generate,
we'll generate this complete sentence.
We stop generation once we
sample the special ends token,
which kind of corresponds to the period
at the end of the sentence.
Then once the network
samples this ends token,
we stop generation and we're done
and we've gotten our
caption for this image.
And now, during training,
we trained this thing to generate,
like we put an end token at the end
of every caption during training
so that the network kind
of learned during training
that end tokens come at
the end of sequences.
So then, during test time,
it tends to sample these end tokens
once it's done generating.
So we trained this model
in kind of a completely supervised way.
You can find data sets
that have images together with
natural language captions.
Microsoft COCO is probably the biggest
and most widely used for this task.
But you can just train this model
in a purely supervised way.
And then backpropagate
through to jointly train
both this recurrent neural
network language model
and then also pass gradients
back into this final layer of this the CNN
and additionally update the weights
of the CNN to jointly tune
all parts of the model
to perform this task.
Once you train these models,
they actually do some
pretty reasonable things.
These are some real results from a model,
from one of these trained models,
and it says things like a cat sitting
on a suitcase on the floor,
which is pretty impressive.
It knows about cats
sitting on a tree branch,
which is also pretty cool.
It knows about two people walking
on the beach with surfboards.
So these models are
actually pretty powerful
and can produce relatively
complex captions
to describe the image.
But that being said,
these models are really not perfect.
They're not magical.
Just like any machine learning model,
if you try to run them on data
that was very different
from the training data,
they don't work very well.
So for example, this example,
it says a woman is
holding a cat in her hand.
There's clearly no cat in the image.
But she is wearing a fur coat,
and maybe the texture of that coat
kind of looked like a cat to the model.
Over here, we see a
woman standing on a beach
holding a surfboard.
Well, she's definitely
not holding a surfboard
and she's doing a handstand,
which is maybe the interesting
part of that image,
and the model totally missed that.
Also, over here, we see this example
where there's this picture of a spider web
in the tree branch,
and it totally,
and it says something like
a bird sitting on a tree branch.
So it totally missed the spider,
but during training,
it never really saw examples of spiders.
It just knows that birds sit
on tree branches during training.
So it kind of makes these
reasonable mistakes.
Or here at the bottom,
it can't really tell the difference
between this guy throwing
and catching the ball,
but it does know that
it's a baseball player
and there's balls and things involved.
So again, just want to
say that these models
are not perfect.
They work pretty well when
you ask them to caption images
that were similar to the training data,
but they definitely have a hard time
generalizing far beyond that.
So another thing you'll sometimes see
is this slightly more advanced
model called Attention,
where now when we're generating
the words of this caption,
we can allow the model
to steer it's attention
to different parts of the image.
And I don't want to spend
too much time on this.
But the general way
that this works is that
now our convolutional network,
rather than producing a single vector
summarizing the entire image,
now it produces some grid of vectors
that summarize the,
that give maybe one vector
for each spatial location
in the image.
And now, when we,
when this model runs forward,
in addition to sampling the
vocabulary at every time step,
it also produces a distribution
over the locations in the image
where it wants to look.
And now this distribution
over image locations
can be seen as a kind of a tension
of where the model should
look during training.
So now that first hidden state
computes this distribution
over image locations,
which then goes back to the set of vectors
to give a single summary vector
that maybe focuses the attention
on one part of that image.
And now that summary vector gets fed,
as an additional input,
at the next time step
of the neural network.
And now again, it will
produce two outputs.
One is our distribution
over vocabulary words.
And the other is a distribution
over image locations.
This whole process will continue,
and it will sort of do
these two different things
at every time step.
And after you train the model,
then you can see that it kind of
will shift it's attention around the image
for every word that it
generates in the caption.
Here you can see that it
produced the caption,
a bird is flying over,
I can't see that far.
But you can see that its attention
is shifting around
different parts of the image
for each word in the
caption that it generates.
There's this notion of hard attention
versus soft attention,
which I don't really want
to get into too much,
but with this idea of soft attention,
we're kind of taking
a weighted combination
of all features from all image locations,
whereas in the hard attention case,
we're forcing the model to
select exactly one location
to look at in the image at each time step.
So the hard attention case
where we're selecting
exactly one image location
is a little bit tricky
because that is not really
a differentiable function,
so you need to do
something slightly fancier
than vanilla backpropagation
in order to just train the
model in that scenario.
And I think we'll talk about
that a little bit later
in the lecture on reinforcement learning.
Now, when you look at after you train
one of these attention models
and then run it on to generate captions,
you can see that it tends
to focus it's attention
on maybe the salient or
semanticly meaningful part
of the image when generating captions.
You can see that the caption was a woman
is throwing a frisbee in a park
and you can see that this attention mask,
when it generated the word,
when the model generated the word frisbee,
at the same time,
it was focusing it's
attention on this image region
that actually contains the frisbee.
This is actually really cool.
We did not tell the model
where it should be looking
at every time step.
It sort of figured all that out for itself
during the training process.
Because somehow, it
figured out that looking
at that image region was
the right thing to do
for this image.
And because everything in
this model is differentiable,
because we can backpropagate
through all these soft attention steps,
all of this soft attention stuff
just comes out through
the training process.
So that's really, really cool.
By the way, this idea of
recurrent neural networks
and attention actually
gets used in other tasks
beyond image captioning.
One recent example is this idea
of visual question answering.
So here, our model is going
to take two things as input.
It's going to take an image
and it will also take a
natural language question
that's asking some
question about the image.
Here, we might see this image on the left
and we might ask the question,
what endangered animal
is featured on the truck?
And now the model needs to select from one
of these four natural language answers
about which of these answers
correctly answers that question
in the context of the image.
So you can imagine kind of
stitching this model together
using CNNs and RNNs in
kind of a natural way.
Now, we're in this
many to one scenario,
where now our model needs to take as input
this natural language sequence,
so we can imagine running
a recurrent neural network
over each element of that input question,
to now summarize the input
question in a single vector.
And then we can have a CNN
to again summarize the image,
and now combine both
the vector from the CNN
and the vector from the
question and coding RNN
to then predict a
distribution over answers.
We also sometimes,
you'll also sometimes see this idea
of soft spacial attention
being incorporated
into things like visual
question answering.
So you can see that here,
this model is also having
the spatial attention
over the image when it's trying
to determine answers to the questions.
Just to, yeah, question?
So the question is
How are the different inputs combined?
Do you mean like the
encoded question vector
and the encoded image vector?
Yeah, so the question is
how are the encoded image
and the encoded question vector combined?
Kind of the simplest thing to do
is just to concatenate them
and stick them into
fully connected layers.
That's probably the most common
and that's probably
the first thing to try.
Sometimes people do
slightly fancier things
where they might try to have
multiplicative interactions
between those two vectors
to allow a more powerful function.
But generally, concatenation
is kind of a good
first thing to try.
Okay, so now we've talked
about a bunch of scenarios
where RNNs are used for
different kinds of problems.
And I think it's super cool
because it allows you to
start tackling really complicated problems
combining images and computer vision
with natural language processing.
And you can see that we
can kind of stith together
these models like Lego blocks
and attack really complicated things,
Like image captioning or
visual question answering
just by stitching together
these relatively simple
types of neural network modules.
But I'd also like to mention that
so far, we've talked about this idea
of a single recurrent network layer,
where we have sort of one hidden state,
and another thing that
you'll see pretty commonly
is this idea of a multilayer
recurrent neural network.
Here, this is a three layer
recurrent neural network,
so now our input goes in,
goes into, goes in and produces
a sequence of hidden states
from the first recurrent
neural network layer.
And now, after we run kind of
one recurrent neural network layer,
then we have this whole
sequence of hidden states.
And now, we can use the
sequence of hidden states
as an input sequence to another
recurrent neural network layer.
And then you can just imagine,
which will then produce another
sequence of hidden states
from the second RNN layer.
And then you can just imagine
stacking these things
on top of each other,
cause we know that we've
seen in other contexts
that deeper models tend to perform better
for various problems.
And the same kind of
holds in RNNs as well.
For many problems, you'll see maybe a two
or three layer recurrent
neural network model
is pretty commonly used.
You typically don't see
super deep models in RNNs.
So generally, like two,
three, four layer RNNs
is maybe as deep as you'll typically go.
Then, I think it's also really
interesting and important
to think about,
now we've seen kind of
what kinds of problems
these RNNs can be used for,
but then you need to think
a little bit more carefully
about exactly what happens to these models
when we try to train them.
So here, I've drawn this
little vanilla RNN cell
that we've talked about so far.
So here, we're taking
our current input, x t,
and our previous hidden
state, h t minus one,
and then we stack, those are two vectors.
So we can just stack them together.
And then perform this
matrix multiplication
with our weight matrix,
to give our,
and then squash that
output through a tanh,
and that will give us
our next hidden state.
And that's kind of the
basic functional form
of this vanilla recurrent neural network.
But then, we need to think about
what happens in this architecture
during the backward pass when
we try to compute gradients?
So then if we think
about trying to compute,
so then during the backwards pass,
we'll receive the derivative of our h t,
we'll receive derivative of loss
with respect to h t.
And during the backward
pass through the cell,
we'll need to compute derivative of loss
to the respect of h t minus one.
Then, when we compute this backward pass,
we see that the gradient flows backward
through this red path.
So first, that gradient
will flow backwards
through this tanh gate,
and then it will flow backwards
through this matrix multiplication gate.
And then, as we've seen in the homework
and when implementing these
matrix multiplication layers,
when you backpropagate through
this matrix multiplication gate,
you end up mulitplying by the transpose
of that weight matrix.
So that means that every
time we backpropagate
through one of these vanilla RNN cells,
we end up multiplying by some
part of the weight matrix.
So now if you imagine
that we are sticking many
of these recurrent neural
network cells in sequence,
because again this is an RNN.
We want a model sequences.
Now if you imagine what
happens to the gradient flow
through a sequence of these layers,
then something kind of
fishy starts to happen.
Because now, when we want to compute
the gradient of the loss
with respect to h zero,
we need to backpropagate through every one
of these RNN cells.
And every time you
backpropagate through one cell,
you'll pick up one of
these w transpose factors.
So that means that the final expression
for the gradient on h zero
will involve many, many factors
of this weight matrix,
which could be kind of bad.
Maybe don't think about the weight,
the matrix case,
but imagine a scaler case.
If we end up, if we have some scaler
and we multiply by that
same number over and over
and over again,
maybe not for four examples,
but for something like a hundred
or several hundred time steps,
then multiplying by the same number
over and over again is really bad.
In the scaler case,
it's either going to explode
in the case that that
number is greater than one
or it's going to vanish towards zero
in the case that number is less than one
in absolute value.
And the only way in which
this will not happen
is if that number is exactly one,
which is actually very
rare to happen in practice.
That leaves us to,
that same intuition
extends to the matrix case,
but now, rather than the absolute
value of a scaler number,
you instead need to look at the largest,
the largest singular value
of this weight matrix.
Now if that largest singular
value is greater than one,
then during this backward pass,
when we multiply by the
weight matrix over and over,
that gradient on h w, on h zero, sorry,
will become very, very large,
when that matrix is too large.
And that's something we call
the exploding gradient problem.
Where now this gradient will
explode exponentially in depth
with the number of time steps
that we backpropagate through.
And if the largest singular
value is less than one,
then we get the opposite problem,
where now our gradients will shrink
and shrink and shrink exponentially,
as we backpropagate and pick
up more and more factors
of this weight matrix.
That's called the
vanishing gradient problem.
THere's a bit of a hack
that people sometimes do
to fix the exploding gradient problem
called gradient clipping,
which is just this simple heuristic
saying that after we compute our gradient,
if that gradient,
if it's L2 norm is above some threshold,
then just clamp it down and divide,
just clamp it down so it
has this maximum threshold.
This is kind of a nasty hack,
but it actually gets used
in practice quite a lot
when training recurrent neural networks.
And it's a relatively useful tool
for attacking this
exploding gradient problem.
But now for the vanishing
gradient problem,
what we typically do
is we might need to move to
a more complicated RNN architecture.
So that motivates this idea of an LSTM.
An LSTM, which stands for
Long Short Term Memory,
is this slightly fancier
recurrence relation
for these recurrent neural networks.
It's really designed to help alleviate
this problem of vanishing
and exploding gradients.
So that rather than kind
of hacking on top of it,
we just kind of design the architecture
to have better gradient flow properties.
Kind of an analogy to those
fancier CNN architectures
that we saw at the top of the lecture.
Another thing to point out
is that the LSTM cell
actually comes from 1997.
So this idea of an LSTM
has been around for quite a while,
and these folks were
working on these ideas
way back in the 90s,
were definitely ahead of the curve.
Because these models are
kind of used everywhere now
20 years later.
And LSTMs kind of have
this funny functional form.
So remember when we had this vanilla
recurrent neural network,
it had this hidden state.
And we used this recurrence relation
to update the hidden
state at every time step.
Well, now in an LSTM,
we actually have two,
we maintain two hidden
states at every time step.
One is this h t,
which is called the hidden state,
which is kind of an
analogy to the hidden state
that we had in the vanilla RNN.
But an LSTM also maintains
the second vector, c t,
called the cell state.
And the cell state is this
vector which is kind of internal,
kept inside the LSTM,
and it does not really get
exposed to the outside world.
And we'll see,
and you can kind of see that
through this update equation,
where you can see that when we,
first when we compute these,
we take our two inputs,
we use them to compute these four gates
called i, f, o, n, g.
We use those gates to
update our cell states, c t,
and then we expose part of our cell state
as the hidden state at the next time step.
This is kind of a funny functional form,
and I want to walk through
for a couple slides
exactly why do we use this architecture
and why does it make sense,
especially in the context
of vanishing or exploding gradients.
This first thing that we do in an LSTM
is that we're given this
previous hidden state, h t,
and we're given our
current input vector, x t,
and just like the vanilla RNN.
In the vanilla RNN, remember,
we took those two input vectors.
We concatenated them.
Then we did a matrix multiply
to directly compute the next
hidden state in the RNN.
Now, the LSTM does something
a little bit different.
We're going to take our
previous hidden state
and our current input,
stack them,
and now multiply by a
very big weight matrix, w,
to compute four different gates,
Which all have the same
size as the hidden state.
Sometimes, you'll see this
written in different ways.
Some authors will write
a different weight matrix
for each gate.
Some authors will combine them all
into one big weight matrix.
But it's all really the same thing.
The ideas is that we
take our hidden state,
our current input,
and then we use those to
compute these four gates.
These four gates are the,
you often see this written
as i, f, o, g, ifog,
which makes it pretty easy
to remember what they are.
I is the input gate.
It says how much do we want
to input into our cell.
F is the forget gate.
How much do we want to
forget the cell memory
at the previous, from
the previous time step.
O is the output gate,
which is how much do we
want to reveal ourself
to the outside world.
And G really doesn't have a nice name,
so I usually call it the gate gate.
G, it tells us how much
do we want to write
into our input cell.
And then you notice that
each of these four gates
are using a different non linearity.
The input, forget and output gate
are all using sigmoids,
which means that their values
will be between zero and one.
Whereas the gate gate uses a tanh,
which means it's output will
be between minus one and one.
So, these are kind of weird,
but it makes a little bit more sense
if you imagine them all as binary values.
Right, like what happens at the extremes
of these two values?
It's kind of what happens,
if you look after we compute these gates
if you look at this next equation,
you can see that our cell state
is being multiplied element
wise by the forget gate.
Sorry, our cell state from
the previous time step
is being multiplied element
wise by this forget gate.
And now if this forget gate,
you can think of it as being
a vector of zeros and ones,
that's telling us for each
element in the cell state,
do we want to forget
that element of the cell
in the case if the forget gate was zero?
Or do we want to remember
that element of the cell
in the case if the forget gate was one.
Now, once we've used the forget gate
to gate off the part of the cell state,
then we have the second term,
which is the element
wise product of i and g.
So now, i is this vector
of zeros and ones,
cause it's coming through a sigmoid,
telling us for each
element of the cell state,
do we want to write to that
element of the cell state
in the case that i is one,
or do we not want to write to
that element of the cell state
at this time step
in the case that i is zero.
And now the gate gate,
because it's coming through a tanh,
will be either one or minus one.
So that is the value that we want,
the candidate value that
we might consider writing
to each element of the cell
state at this time step.
Then if you look at the
cell state equation,
you can see that at every time step,
the cell state has these kind of
these different,
independent scaler values,
and they're all being incremented
or decremented by one.
So there's kind of like,
inside the cell state,
we can either remember
or forget our previous state,
and then we can either
increment or decrement
each element of that cell state
by up to one at each time step.
So you can kind of think of
these elements of the cell state
as being little scaler integer counters
that can be incremented and decremented
at each time step.
And now, after we've
computed our cell state,
then we use our now updated cell state
to compute a hidden state,
which we will reveal to the outside world.
So because this cell state
has this interpretation
of being counters,
and sort of counting up by one
or minus one at each time step,
we want to squash that counter value
into a nice zero to
one range using a tanh.
And now, we multiply element wise,
by this output gate.
And the output gate is again
coming through a sigmoid,
so you can think of it as
being mostly zeros and ones,
and the output gate tells us
for each element of our cell state,
do we want to reveal or not reveal
that element of our cell state
when we're computing the
external hidden state
for this time step.
And then, I think there's
kind of a tradition
in people trying to explain LSTMs,
that everyone needs to come up
with their own potentially
confusing LSTM diagram.
So here's my attempt.
Here, we can see what's going
on inside this LSTM cell,
is that we take our,
we're taking as input on the
left our previous cell state
and the previous hidden state,
as well as our current input, x t.
Now we're going to take our current,
our previous hidden state,
as well as our current input,
stack them,
and then multiply with
this weight matrix, w,
to produce our four gates.
And here, I've left
out the non linearities
because we saw those on a previous slide.
And now the forget gate
multiplies element wise
with the cell state.
The input and gate gate
are multiplied element wise
and added to the cell state.
And that gives us our next cell.
The next cell gets
squashed through a tanh,
and multiplied element
wise with this output gate
to produce our next hidden state.
Question?
No, So they're coming through this,
they're coming from different
parts of this weight matrix.
So if our hidden,
if our x and our h all
have this dimension h,
then after we stack them,
they'll be a vector size two h,
and now our weight matrix
will be this matrix
of size four h times two h.
So you can think of that as sort of having
four chunks of this weight matrix.
And each of these four
chunks of the weight matrix
is going to compute a
different one of these gates.
You'll often see this written for clarity,
kind of combining all
four of those different
weight matrices into a
single large matrix, w,
just for notational convenience.
But they're all computed
using different parts
of the weight matrix.
But you're correct in
that they're all computed
using the same functional form
of just stacking the two things
and taking the matrix multiplication.
Now that we have this picture,
we can think about what
happens to an LSTM cell
during the backwards pass?
We saw, in the context of vanilla
recurrent neural network,
that some bad things happened
during the backwards pass,
where we were continually multiplying
by that weight matrix, w.
But now, the situation looks much,
quite a bit different in the LSTM.
If you imagine this path backwards
of computing the gradients
of the cell state,
we get quite a nice picture.
Now, when we have our upstream gradient
from the cell coming in,
then once we backpropagate backwards
through this addition operation,
remember that this addition just copies
that upstream gradient
into the two branches,
so our upstream gradient
gets copied directly
and passed directly to backpropagating
through this element wise multiply.
So then our upstream
gradient ends up getting
multiplied element wise
by the forget gate.
As we backpropagate backwards
through this cell state,
the only thing that happens
to our upstream cell state gradient
is that it ends up getting
multiplied element wise
by the forget gate.
This is really a lot nicer
than the vanilla RNN for two reasons.
One is that this forget gate
is now an element wise multiplication
rather than a full matrix multiplication.
So element wise multiplication
is going to be a little bit nicer
than full matrix multiplication.
Second is that element wise multiplication
will potentially be
multiplying by a different
forget gate at every time step.
So remember, in the vanilla RNN,
we were continually multiplying
by that same weight matrix
over and over again,
which led very explicitly
to these exploding or vanishing gradients.
But now in the LSTM case,
this forget gate can
vary from each time step.
Now, it's much easier for the model
to avoid these problems
of exploding and vanishing gradients.
Finally, because this forget gate
is coming out from a sigmoid,
this element wise multiply
is guaranteed to be between zero and one,
which again, leads to sort
of nicer numerical properties
if you imagine multiplying by these things
over and over again.
Another thing to notice
is that in the context
of the vanilla recurrent neural network,
we saw that during the backward pass,
our gradients were flowing
through also a tanh
at every time step.
But now, in an LSTM,
our outputs are,
in an LSTM, our hidden state is used
to compute those outputs, y t,
so now, each hidden state,
if you imagine backpropagating
from the final hidden state
back to the first cell state,
then through that backward path,
we only backpropagate through
a single tanh non linearity
rather than through a separate
tanh at every time step.
So kind of when you put
all these things together,
you can see this backwards pass
backpropagating through the cell state
is kind of a gradient super highway
that lets gradients pass
relatively unimpeded
from the loss at the very end of the model
all the way back to the initial cell state
at the beginning of the model.
Was there a question?
Yeah, what about the
gradient in respect to w?
'Cause that's ultimately the
thing that we care about.
So, the gradient with respect to w
will come through,
at every time step,
will take our current cell state
as well as our current hidden state
and that will give us an element,
that will give us our local gradient on w
for that time step.
So because our cell state,
and just in the vanilla RNN case,
we'll end up adding those
first time step w gradients
to compute our final gradient on w.
But now, if you imagine the situation
where we have a very long sequence,
and we're only getting
gradients to the very end
of the sequence.
Now, as you backpropagate through,
we'll get a local gradient on w
for each time step,
and that local gradient on w
will be coming through
these gradients on c and h.
So because we're maintaining
the gradients on c
much more nicely in the LSTM case,
those local gradients
on w at each time step
will also be carried forward and backward
through time much more cleanly.
Another question?
Yeah, so the question is
due to the non linearities,
could this still be susceptible
to vanishing gradients?
And that could be the case.
Actually, so one problem you might imagine
is that maybe if these forget gates
are always less than zero,
or always less than one,
you might get vanishing gradients
as you continually go
through these forget gates.
Well, one sort of trick
that people do in practice
is that they will, sometimes,
initialize the biases of the forget gate
to be somewhat positive.
So that at the beginning of training,
those forget gates are
always very close to one.
So that at least at the
beginning of training,
then we have not so,
relatively clean gradient flow
through these forget gates,
since they're all
initialized to be near one.
And then throughout
the course of training,
then the model can learn those biases
and kind of learn to
forget where it needs to.
You're right that there
still could be some potential
for vanishing gradients here.
But it's much less extreme
than the vanilla RNN case,
both because those fs can
vary at each time step,
and also because we're doing
this element wise multiplication
rather than a full matrix multiplication.
So you can see that this LSTM
actually looks quite similar to ResNet.
In this residual network,
we had this path of identity connections
going backward through the network
and that gave, sort of
a gradient super highway
for gradients to flow backward in ResNet.
And now it's kind of the
same intuition in LSTM
where these additive and element wise
multiplicative interactions
of the cell state
can give a similar gradient super highway
for gradients to flow backwards
through the cell state
in an LSTM.
And by the way, there's this
other kind of nice paper
called highway networks,
which is kind of in between this idea
of this LSTM cell
and these residual networks.
So these highway networks
actually came before residual networks,
and they had this idea where
at every layer of the highway network,
we're going to compute
sort of a candidate activation,
as well as a gating function
that tells us that interprelates
between our previous input at that layer,
and that candidate activation
that came through our
convolutions or what not.
So there's actually a lot of
architectural similarities
between these things,
and people take a lot of inspiration
from training very deep CNNs
and very deep RNNs
and there's a lot of crossover here.
Very briefly, you'll see a
lot of other types of variance
of recurrent neural network
architectures out there
in the wild.
Probably the most common,
apart from the LSTM,
is this GRU, called the
gated recurrent unit.
And you can see those
update equations here,
and it kind of has this
similar flavor of the LSTM,
where it uses these
multiplicative element wise gates
together with these additive interactions
to avoid this vanishing gradient problem.
There's also this cool paper
called LSTM: a search based oddysey,
very inventive title,
where they tried to play
around with the LSTM equations
and swap out the non
linearities at one point,
like do we really need that tanh
for exposing the output gate,
and they tried to answer a lot
of these different questions
about each of those non linearities,
each of those pieces of
the LSTM update equations.
What happens if we change the model
and tweak those LSTM
equations a little bit.
And kind of the conclusion is
that they all work about the same
Some of them work a little
bit better than others
for one problem or another.
But generally, none of the things,
none of the tweaks of LSTM that they tried
were significantly better
that the original LSTM
for all problems.
So that gives you a little bit more faith
that the LSTM update
equations seem kind of magical
but they're useful anyway.
You should probably consider
them for your problem.
There's also this cool paper
from Google a couple years ago
where they tried to use,
where they did kind of
an evolutionary search
and did a search over many,
over a very large number of
random RNN architectures,
they kind of randomly premute
these update equations
and try putting the additions
and the multiplications
and the gates and the non linearities
in different kinds of combinations.
They blasted this out over
their huge Google cluster
and just tried a whole bunch
of these different weigh
updates in various flavors.
And again, it was the same story
that they didn't really find anything
that was significantly better
than these existing GRU or LSTM styles.
Although there were some
variations that worked
maybe slightly better or
worse for certain problems.
But kind of the take away is that
probably and using an LSTM or GRU
is not so much magic in those equations,
but this idea of managing
gradient flow properly
through these additive connections
and these multiplicative gates
is super useful.
So yeah, the summary is
that RNNs are super cool.
They can allow you to attack
tons of new types of problems.
They sometimes are
susceptible to vanishing
or exploding gradients.
But we can address that
with weight clipping
and with fancier architectures.
And there's a lot of cool overlap
between CNN architectures
and RNN architectures.
So next time, you'll
be taking the midterm.
But after that, we'll
have a, sorry, a question?
Midterm is after this lecture
so anything up to this point is fair game.
And so you guys, good luck
on the midterm on Tuesday.
