Hi, everyone. I'm Abby.
If you weren't here last week,
I'm the head TA of this course.
And this is the second [NOISE] of three lectures that I'm
going to be giving on RNNs and related topics.
Okay. So, welcome to week four.
Today, we're going to be learning about vanishing gradients,
and some more complex types of RNNs.
So, before we get started,
I've got a few announcements.
Uh, the first announcement is that assignment four is released today, uh,
it's due Thursday of next week, not Tuesday,
so that means you have two days more to do it than you did for all the other homeworks.
And the reason for that is assignment four is
probably more work than the other homework so far,
so don't be surprised by that.
Uh, assignment four is all about Neural Machine Translation.
Uh, we're gonna learn about NMT on Thursday's lecture this week.
And, uh, this is really exciting,
because actually CS 224 has never had an NMT assignment before,
so this is all new this year,
and you're gonna be the first year students who are going to be doing an NMT assignment.
Uh, something else that's different about
assignment four is that you're going to be using Azure, which is, uh,
a cloud computing service,
in order to train your NMT systems on a virtual machine with a GPU.
And, uh, this is necessary in order to be able to do it in a reasonable amount of time.
So, I have a warning which is,
if you're a person who perhaps doesn't have, ah,
learnt- a lot of experience working on remote machines,
so for example if you're not very familiar with SSH,
or tmux, or remote text editing,
then I advise you to budget some extra time for assignment four,
because that's probably gonna take you a little while to set up and get used to.
So, again, I'm going to emphasize,
do get started early on assignment four, because, uh,
the NMT system takes about four hours to train on your virtual machine,
so you really can't start it the night before and expect to get it in on time.
Uh, and assignment four is really quite a lot more complicated than assignment three.
So, uh, don't get into a false sense of security if you found assignment three easy.
Um, so Thursday's slides on NMT are ready on the website today,
so you can even start looking at it today if you
want- if you wanna get started on assignment four early.
Uh, so, I have a few more announcements, uh,
on the subject of projects, uh,
next week's lectures are going to be all about projects.
So, you're going to hear about, uh, question answering,
and the default final projects,
and then you're also gonna get some tips about how you might,
uh, choose and define your own custom projects.
So, it's fine if you're not thinking about a project this week, that's okay.
You can delay until next week to start thinking about it for the first time.
But if you are a person who is already thinking about your projects,
for example, if you're trying to choose your custom projects, uh,
then you should check out the website's project page,
because it has quite a lot of information about, uh,
how to choose your projects, and also some inspiration.
And that includes- we've collected some, uh,
project ideas from various members of the Stanford AI Lab.
So, these are faculty and PhD students and postdocs,
who have ideas for, uh,
NLP deep learning projects that they would like
CS224n students such as yourself to work on.
So, especially, if you're looking to maybe get into research later,
this is a really great opportunity, uh,
to work with someone in the Stanford AI Lab,
and maybe get some mentorship as well.
Okay. So here's an overview.
Uh, last week, we learned about Recurrent Neural Networks,
um, we learned about why they're really great for Language Modeling.
And today, we're gonna learn about some problems with RNNs,
and we're gonna learn about how to fix them.
And this is gonna motiva- motivate us to learn about some more complex RNN variants.
And then, uh, next lecture on Thursday,
we're going to, uh, have some more application-based, uh, contents,
so we are going to be learning about Neural Machine Translation,
which is a really important task in, uh,
NLP and deep learning, and in particular,
we're gonna learn about this architecture called sequence-to-sequence with attention.
But in more detail,
today's lecture, uh, first,
we are going to learn about the vanishing gradient problem.
And this is gonna motivate us to learn about two new types of
RNN called Long Short-Term Memory,
and Gated Recurrent Unit.
We're also going to learn about some other kind of
miscellaneous fixes for the vanishing gradient problem,
or the exploding gradient problem.
Uh, so in particular,
we're going to learn about gradient clipping,
which is, uh, fairly simple, but quite important.
Uh, we're also going to learn about skip connections,
which is a fairly new neural architecture,
which tries to, uh,
fix the vanishing gradient problem.
[NOISE] And then, at the end of the lecture,
we're gonna learn about some more fancy RNN variants such as, uh,
bidirectional RN- RNNs, those are the ones which go not just left to right,
but also right to left,
and we're going to learn about multi-layer RNNs.
And that's when you stack multiple RNNs on top of each other.
So, there's a lot of important definitions today.
Um, so, you're gonna find that the information in
this lecture is pretty important for
assignment four and probably for your project as well.
Okay. So, let's get started thinking about the vanishing gradients.
Uh, so here we have an RNN,
with, let say, ah, four steps,
and suppose that we have some kind of loss that's, uh,
J4, and that's computed based on the four hidden states.
So, let's suppose we're interested in asking what is the derivative of this loss J4,
with respect to the hidden states,
uh, h1, the first hidden state?
So, I'm representing that with this, uh,
blue arrow notation to kind of represent how we have
to make the gradients flow backwards in order to complete this.
So, if we're interested in what this gradient is,
we can apply the chain rule and say, "Well,
it's the product of the, uh,
gradient of the loss with respect to h2,
and then gradient of h2, with respect to h1."
And then, similarly, we can decompose that
again using the chain rule, and we can do it again.
So, what we've done here is we've decomposed the gradient that we were interested in,
into the products of these various intermediate gradients.
And in particular, we're seeing all these ht by ht minus 1,
uh, adjacent gradients of the hidden states.
So, the thing I want to ask you is,
what happens if these gradients are small?
Given that there's a lot of them,
uh, what happens if they're small in magnitude?
So, the overall problem of the vanishing gradient problem,
is that when these gradients are small,
then our overall gradient is gonna get smaller and smaller,
as it back propagates further.
Because the accumulated gradient is the product of all of these intermediate gradients.
And when you multiply something by something small,
then the whole thing gets smaller.
So, that's what I'm representing here with these, uh,
smaller and smaller blue arrows going backwards.
So, that's the general idea of the vanishing gradient problem.
Here's a slightly more formal definition.
So, if you remember from last time,
uh, if we have a null RNN,
then the hidden state ht is
computed as a function of the previous hidden state ht minus 1,
and the current input xt.
Uh, so you might remember in the previous lecture we
said that xt were one-hot vectors representing words,
and then ET is the embedding.
Uh, this lecture we're going to be,
uh, getting rid of that detail,
and we're just gonna be thinking very abstractly about
an RNN that has some kind of input xt,
and xt is just any kind of vector.
Probably a dense vector,
but you know, it could be words or not.
It could be one-hot or dense.
Uh, but that's just the input.
So, that's the, uh,
the definition that we learned last time for Vanilla RNNs.
So, this means that the derivative of ht,
hidden state on step t with respect to the previous hidden state,
uh, is this expression here.
Uh, so this is just an application of the chain rule, and, uh,
if you looked long enough or refer back to
the backprop lecture you'll see, uh, that that make sense.
So, in particular, we're, um,
multiplying by Wh at the end, uh,
because we have the multiplication of Wh and ht minus 1 on the inside.
Okay. So, if you remember, on the previous slide,
we were thinking about what's the gradient of the loss on some step,
step i I'd say,
with respect to a hidden state hj,
on some previous step j.
And maybe J is quite a few steps before i.
So, we can now write this,
uh, in the following way.
So just by applying the chain rule,
now on the first line we're saying that this derivative that we're interested in
can be decomposed into the derivative with respect to step i,
which is kind of the last step,
and then do all of those intermediate gradients of the adjacent hidden states as well.
So, that- that first slide is just exactly the same thing as we were looking at on the,
uh, the picture, uh, the diagram on the previous slide.
Okay. And then, given that we figured out what is, uh,
dht by dht minus one,
ah, further on the slide,
then we can just substitute that in.
So, what we're finding is that this overall gradient that we're
interested in, in particular,
has this term, uh,
Wh, the weight matrix, and it's, uh,
multiplied by itself, i minus j times,
because there's i minus j many steps between, uh,
step j and step i,
which is the- the distance that we're traveling with this gradient.
So, the big problem here is,
if this weight matrix Wh is small,
then this term is gonna get vanishingly small,
exponentially small, as i and j get further apart.
So, to give this a little more detail, uh,
we can think about the, uh,
L2 matrix norms of all of these matrices, right?
And, uh, as a- as a- uh, as a- sorry.
I'm- it's a known fact of,
uh, L2 norms that you have this, um,
inequality that's the, uh,
norm of the products of
some matrices is less and equal to the product of the norms of the matrices.
So, in particular, we're seeing that the norm of this gradient that we're interested in,
is less than or equal to, uh,
the product i minus j many times of the norm of the weight matrix Wh.
So, this is what we mean when we say we're concerned about Wh being small,
because if it's small, then the thing on the left has to be exponentially small.
So in particular in this,
uh, paper that, uh,
you can take a look at the bottom if you're interested, um, uh,
Pascanu et al showed that if
the largest eigenvalue of the weight matrix Wh is less than one,
then this gradient on the left is going to shrink exponentially.
And you can probably see intuitively why this is true.
So if, you know, as a simplifying assumption,
we suppose that Wh was not a matrix,
but simply a scalar that was just a single number,
then you can see why if that number was greater than one,
then the whole thing is gonna explode.
And if that number is less than one,
then it is going to shrink
exponentially as you multiply by the same number again and again.
Uh, so you can check out the paper for more details,
but here, uh, the bound is one,
partially because we have the sigmoid nonlinearity.
And that's, uh, based on the bounds of what we know as the,
uh, norm of the sigmoid function to be.
So, uh, this shows you why if the, uh,
Wh matrix is small,
or if its largest eigenvalue was small,
then we're going to have vanishing gradients.
And similarly, if you check out the paper,
you can see that there's a similar proof, uh,
relating if the largest eigenvalue is greater than one,
to having exploding gradients.
So that's when the gradients get bigger and bigger,
as you backprop further.
Okay. So hopefully I've convinced you that
vanishing gradients is a phenomenon that happens in our norms.
But I haven't yet said why this is a problem.
So, why should we view this as a bad thing,
if the gradients are getting larger and larger,
or smaller and smaller as you backprop?
So here's, uh, here's a picture that might illustrate why it's a bad thing.
So, uh, as before,
suppose that we're thinking about,
what's the derivative of the loss on
the fourth step with respect to the first hidden state?
And we have this situation where
the gradient is getting smaller and smaller as it goes backwards.
But then, think about what is the gradient of let's say
the loss in the second step also with respect to the first hidden state.
So I'm representing that with the orange arrows.
And what my point is here,
is that the magnitude of the gradient signal from close by,
is a lot bigger than the magnitude of the gradient signal from far away.
And this means that when you update your model weights,
the signal that you're getting from close by is gonna
be so much bigger than the signal from far away,
that essentially you're only going to learn,
you're only going to optimize with respect to
these nearby effects and not the long-term effects.
So you're gonna, you're gonna lose the long-term effects, er, inside the,
the nearby effects. Any questions about this, yeah?
So, uh, where they say there that you do actual updates.
You know, there are actually some that are multiple chains, not just one chain.
So the nearer term should cover it.
Sorry, what's the last part?
The nearer term should have a larger effect considering you're
updating the sum of the weights over different chains.
Okay. So I think, ah, the observation was that,
given that, for example,
in Language Modeling you might be summing over multiple losses.
There is a loss in every step and you sum all of them and that's your overall loss.
Then you do want to update more with respect to the nearby losses than the far losses.
So I think, uh, yeah,
so if the design of your objective function
is that it's the sum of the loss in every step,
then you do want to, uh,
weight all of them equally.
I think, uh, my point was more about,
what is the influence of, uh,
the action of the weight matrix at this early stage.
What is its influence on a loss that's nearby?
And what is its influence on a loss that's far away?
Um, and due to, uh,
the dynamics of how the vanishing gradient, uh,
problem works, then, uh,
the influence on the loss that's far away
is gonna be much less than the influence nearby.
And I'm gonna give some more linguistics examples later of why you might want to learn,
uh, the connections that are farther away.
So essentially the problem is,
in situations where you do want to learn the connection
between something that happens early and something that happens later,
then you're going to be unable to learn that connection.
Uh, so we'll see some motivating examples in a minute.
Any other questions on this? Yeah?
Um, I'm getting confused like, why are you talking about like dh, dj dh.
Uh, it's like H parameter, like, are we going-
Yeah.
from-
Okay. That's a great question.
So you're asking why are we interested in some kind of dj by
dh given that we're not updating H. H is an activation not a weight.
Um, so the reason why we're thinking about that,
is because when you think about what is dj by dw,
which is a thing that we're going to update.
That's always gonna be in terms of dj by dh at some point, right?
So if we're thinking about W, you know,
and how it acts on, uh,
the transmission from h_1 to h_2,
then dj4 by W in that position is going to have to go through dj4 by dh_2.
So if we're getting vanishing gradients,
uh, as we back propagate further,
then it's kind of like a bottleneck.
Then you're certainly going to have vanishing gradients as they affect, uh,
the recurrence matrix there,
and indeed the matrix that's applied to the inputs.
Okay. I'm gonna move off now.
Uh, so another way to explain why vanishing gradients is a problem,
is you can think of it as, uh, a gradient.
You can think of it as a measure of the effect of the past on the future.
So we've already talked about this little bit.
Uh, gradient is like saying, if I change, uh,
this weight or this activation a little bit,
then how much and how does it affect this thing in the future.
So in particular, if our gradient is becoming vanishingly small over longer distances,
let say from step T, step T to step T plus N,
then we can't tell whether in one of two situations.
So the first situation is maybe there's no dependency between
step T and step T plus N in the data.
So perhaps we're learning on a task where,
in the task there truly is no collect, uh,
connection or relationship to be
learned between what happens on step T and what happens on
step T plus N. So there truly is nothing to be
learned and it's actually correct that there should be,
you know, small gradients with respect to those two things.
But the second possibility is that, yes,
that is a true connection between those two things in the data and in the task.
And really ideally we should be learning that connection.
Um, but we have the wrong parameters in our model to capture this thing,
and therefore that is why the,
the gradients are small.
Because the model doesn't see them as connected.
So we are not learning the true dependency between these two things.
And the problem with the vanishing gradient problem is that it's,
we're unable to tell in this situation,
which of these two situations we're in.
Okay. So this is all pretty theoretical.
I think this example should make it a little more,
more clear why the vanishing gradient problem is bad.
So, uh, last week we learned about RNN-Language Models.
And if you remember Language Modeling is a task where you have some kind of
text and then you're trying to predict what word should come next.
So, uh, here's a piece of text.
It says, um, ''When she tried to print her tickets,
she found that the printer was out of toner.
She went to the stationery store to buy more toner.
It was very overpriced.
After installing the toner into the printer,
she finally printed her,'' and
can someone shout out what word you think should come next?
Tickets.
Tickets. Yes, exactly.
So that was easy for you to do because, uh,
it makes sense logically that if that was the thing she was trying to do,
that's the thing she's gonna do once she's gone the whole detour for the, for the toner.
Um, so the question is,
can RNN-Language Models easily answer this question.
Would they do well at this particular Language Modeling example?
So for an RNN-Language Model to do well at this kind of example,
then they need to learn from this kind of example in the Training Data.
So if it solves the example in the Training Data,
then the RNN-Language Model will need to model the dependency.
Learn the connection between the appearance of
the word tickets early on on the 7th step,
and the target word tickets at the end.
But if we have the vanishing gradient problem,
then these gradients, uh, if they know the step,
the, the last step with respect to the early step,
it's gonna be very small because it's,
it's a fairly long distance, right?
And this means that the model is going to be unable to
learn this dependency, easily or at all.
So if the model can't learn this kind of dependency during training,
then the model is going to be unable to predict
similar kinds of long distance dependencies at test-time.
Okay, here's another example.
Um, here's a piece of text.
Uh, the text says and this isn't a full sentence.
This is just a partial sentence.
It says, the writer of the books, blank.
And I'm gonna give you two options.
It's either, the writer of the books is or the writer of the books are.
So, uh, again shout out which one do you think it is, is or are?
Is.
Is, that's right. So, uh, the correct answer,
a correct possible continuation of the sentence would be,
uh, the writer of the books is planning a sequel.
I can't think of a continuation that goes the writer of the books are,
that would be, uh, grammatically correct.
So the reason why I'm bringing up this example,
is because this shows a kind of tension between, uh,
two things called, uh,
syntactic recency and sem- uh, sequential recency.
So syntactic recency is the idea that in
order to correctly predict the next word should be more is than are,
is that the word writer is the kind of syntactically close word here.
So we say the writer of the books is because it's the writer is.
So you can see this as the word writer and is,
are, uh, syntactically close.
Because if you looked at the dependency paths for example,
then there would be a short path in that tree.
So by contrast, se- sequential recency is the,
uh, simpler concepts of how close words are just in the sentence as a sequence of words.
So in this example,
books and are, are very sequentially recent because they're right next to each other.
So the reason I'm bringing this up is because,
the second one would be incorrect but it's kind of a tempting option.
Because if you're mostly only paying attention to things that happened recently,
um, then you might get distracted and think,
"Oh, the books are, that sounds right."
So the problem here is that RNN-Language Models
are better at learning from sequential recency than sicta- syntactic recency.
And this is partially due,
due to the vanishing gradient problem.
Because especially perhaps, if your syntactically,
uh, related word is actually kind of far away,
then it might get really hard to use the information from the syntactically recent word,
especially if there's a lot of strong signal from the sequentially recent word.
So, uh, there are some papers that show that RNN-Language Models make this kind of error,
of saying are, rather than is.
Uh, they make this kind of error more often than you would like, uh,
especially if you have multiple of these distracting words such as books, uh,
in between, uh, the word you're trying to predict
and the true word that you should be, uh, referring to.
Okay, any questions on this? All right, moving on.
So, we briefly mentioned that exploding gradients, uh, is a problem.
So, I'm briefly going to justify why is exploding gradients a problem,
and why does it, uh, what does it look like?
[NOISE] So, the reason why exploding gradients are a problem,
is if you remember this is how SGD works.
Uh, we say that the new parameters of the model,
which we represent by Theta,
is equal to the old premises,
and then you take some step in the direction of
negative gradients because you're trying to minimize the loss of J.
So, the problem is if your gradient gets really big, uh,
then your SGD update step is going to become really big too.
So, you're going to be taking a very big step,
and you're going to be drastically changing your model parameters, Theta.
And this means that you can end up with some bad updates.
We end up taking too large a step.
And we're changing the parameters too much.
And this means that, uh,
we kind of take a big step,
and we end up in some, uh,
area where the parameters are actually very bad.
Uh, with example the- for example,
they might have a much larger loss than they had before.
So, in the worst case,
this can often manifest as seeing, uh,
infinities or NaNs, not a number in your network when you're training it in practice.
[NOISE] So, this can happen because if you take such a big step
that maybe you update your parameters so much that now they're infinity,
or minus infinity, something like that,
then you're gonna have all of these infinities within your activations as well,
and then all of your losses are going to be infinity,
and the whole thing just isn't going to work, at all.
So, it's very annoying when this happens,
and unfortunately it happens, uh, fairly often.
And if it does then you have to essentially
restart training from some earlier checkpoint before you
got the NaNs and the infinities because there's
no kind of salvaging it from its new state.
[NOISE] So, what's the solution to this exploding gradient problem?
[NOISE] Uh, the solution is actually pretty
simple and it's this technique called gradient clipping.
So, the main idea of gradient clipping,
[NOISE] is that if the norm of your gradient is
greater than some threshold and the threshold is a hyperparameter that you choose.
uh, then you want to scale down that gradient,
um, before you apply the SGD update.
So, the intuition is yo- you're still gonna take a step in the same direction.
But you're gonna make sure that it's a smaller step.
[NOISE] So, here, um,
I've got a screenshot of some pseudocode from, uh,
the related paper that, uh,
proposed gradient clipping, or at least some version of gradient clipping.
[NOISE] And, um, it's pretty simple as you can see.
Uh, g hat is the vector which is the, uh,
derivative of the error with respect to the premises,
and it's saying that if the norm of
this gradient is greater than the threshold's, then you just scale it down.
But the important thing to note is that it's still pointing in the same direction,
it's just a smaller step.
So, here's a picture to show how that might work out in practice.
And, uh, this is a diagram from the, uh,
deep learning textbook which is also linked on [NOISE] the website.
So, what's going on here, is that, uh,
the picture here is the loss surface of a simple RNN.
So, they made a very simple RNN that instead of having, uh,
a sequence of vectors as the hidden states,
it just suppose that each hidden state is simply just a single scalar.
So, this means that instead of having a weight matrix, w,
and the bias vector, b,
you have a scalar w and a scalar b.
So, that's why in the picture, you just have this like two-dimensional parameter space.
And then the, the z-axis is your, is your loss.
So here, high loss is,
is bad and low loss is good in what you're trying to get.
So, uh, here in this picture,
you've got this kind of cliff, right, where you have this very steep cliff face,
uh, where the loss changes very quickly.
[NOISE] And this cliff is really dangerous because it has steep, steep gradients.
And you might be in danger of taking a really big,
[NOISE] uh, update step because you're on the area with a really steep gradient.
[NOISE] So, on the left,
you've got a possible scenario of what might happen if you don't have gradient clipping.
[NOISE] So, on the left, uh,
you can see that you start kind of at the bottom of the cliff,
and you have a f- a si- a few small updates.
And then, in particular makes a bad update because you
see there's a small kind of dip before it goes off the cliff.
So, th- the true local minimum,
the optimal you're trying to get to is that the bottom of that small kind of ditch.
And, um, it starts off kind of near the edge of that ditch,
and then there's a negative gradient going into it.
But unfortunately, the, the update kind of overshoots,
and it ends up going a long way off the cliff.
So now, it's in this bad situation where it's taken a bad update,
and now it's got a much bigger loss than it had [NOISE] before.
So now that it's on the cliff.
Again it, it measures the gradient,
and the gradient is very steep, right?
The gradient is very large.
So, when it takes a, uh,
update with respect to that gradient,
then because the gradient is so big,
it takes a really huge step.
And that's, um, the, the one to the right.
You can see the step going to the right.
So, that's also a very bad update because it's just throwing
it really far to some probably fairly random,
uh, configuration of w and b.
So, on the left, you can see what can go wrong if you're taking
these really big steps because you were in areas with a very steep gradient.
So, by contrast on the right,
you can see what might happen if you do have a gradient clipping.
[NOISE] [NOISE] And, um, it's much less drastic, right?
You've got a similar kind of pattern where it takes a few steps into the ditch,
and then ends up going off the cliff a little bit,
but not too much because the gradient was clipped.
And then, it's on the cliff and there's again a really steep gradient,
but it doesn't take such a big step because again the gradient was clipped,
so that it kind of comes back down.
So, you can see that plausibly by using this gradient clipping method,
you've got a, a kind of safer update rule,
where you're not gonna take any,
any big crazy steps and you're more likely to kind of find the,
the true minimum which is at the bottom of the ditch.
[NOISE] I think there was a question earlier.
Was there a question over here? [NOISE]
I just want to see the value. [NOISE] [NOISE]
Okay. Anyone else?
[NOISE]
Yeah?
[NOISE] [inaudible]
So, the question is, in assignment three,
y- you saw the atom optimization algorithm which, uh,
has this thing called momentum,
which essentially says that kind of like physical momentum in,
in the real world, that if you've been traveling in the same direction for a while,
then you can take bigger steps,
I think, and if you've recently kind of changed direction,
then you should take smaller steps.
And I think there's another element as well, where you divide by some factor.
[NOISE] So, it is a similar kind of idea.
I suppose it's a different criterion, right?
So, what they both have in common is it's a kind of criterion for when to
scale up or scale down the size of your update step.
Um, and I think they're based on different notions
of when should you take bigger steps and when should you take smaller steps.
When should you be cautious or less cautious?
So, I guess here the criterion is different.
It's kind of a simple criterion saying, like if it's really steep,
then be careful. Yeah. Another question?
Uh, so the [inaudible]. [NOISE]
Okay. So the question is,
is this similar to regularization of some kind, right?
So, I suppose, yeah, there is- there are some things in common.
Say for, example, L2 regularization says that you want, for example,
your weight matrices to have a small L2 norm, right?
And the idea is that you're trying to prevent
your model from over-fitting the data by, um,
having some kind of constraint that says you have to keep your weights fairly simple,
that is keep them, you know, small.
So, I suppose the relationship is that here we're
saying that we don't want the norm of the gradients to be too big.
Ah, I don't know if this is related to overfitting.
Um, I guess I have to think more carefully about that,
but I guess it's a similar kind of constraint that you're placing.
Okay. I'm gonna move on for now.
Uh, so we've talked
about how you might fix the exploding gradient problem with gradient clipping,
but we haven't talked about how we might fix the vanishing gradient problem.
So, um, to recap,
I think one way to characterize the problem with the- the vanishing gradients in RNNs is
that it's too difficult for the RNN to learn to preserve information over many timesteps.
So, in our example with printing
the tickets and re- remembering that it's the tickets that she wants to print,
you could think of it as it's hard for the RNN language model to correctly
predict tickets because in a way, it's too hard for the RNN language model to,
uh, learn to retain the tickets information and use it later.
So, um, if you look at the equation
for vanilla RNNs and how we compute the hidden state, uh,
based on the previous hidden state and- and the inputs,
you can see that the hidden state is in a way constantly being rewritten.
It's always computed based on these, uh,
linear transformations and the,
you know, the non-linearity.
So, it's not all that easy to
preserve the information from one hidden state to the other,
in particular, because we are putting it through this non-linearity function.
So, this motivates us to ask what about an RNN with some kind of separate memory?
If we have some kind of separate place to store information that we want to use later,
then would this make it easier for our RNN
to learn to preserve information over many timesteps?
So, this is the motivating idea behind LSTMs or Long Short-Term Memory RNNs.
So, the idea here is that an LSTM is a type of RNN and it was proposed back in, uh, 1997.
And the idea is that this is, uh,
this was proposed as an explicit solution to the vanishing gradients problem.
[NOISE] So, one of the main differences here is
that on each step T instead of just having a hidden state h_t,
we have both the hidden state h_t and the cell state which we denote c_t.
And both of these are vectors of some same length,
n, and the idea there is that the cell is meant to
sto- store our long-term information that, that's on memory units.
Another super important thing is that the LSTM can
erase and write [NOISE] and read information from the cell.
So, you kind of think of this a bit like memory in a computer,
in that you can do these operations, reading and writing and erasing,
um, and that's how you're gonna keep your information.
[NOISE].
Another super important thing is that the way the LSTM decides,
whether it wants to erase, write, read,
information and decide how much and which information,
uh, that's all controlled by these [NOISE] gates.
So, the idea is [NOISE] that the gates are themselves also vectors of length n,
and the idea there is that on each timestep,
each element of these gates which are vectors are somewhere between zero and one.
So here, uh, one represents an open gate and zero represents a closed gate,
and you can have values anywhere in between.
So, the overall idea, which we're gonna firm up on the next slide,
but the overall idea is that if the gate is open,
that represents some kind of information being passed through,
and if the gate is closed,
it [NOISE] means that information does not pass through.
Okay. So, the last really important thing is that the gates are dynamic.
They're not just set at some constant value for the whole sequence.
[NOISE] Um, they're dynamic,
which means that they're different on each timestep T,
and the value that is the decision of whether they're open or closed and in which ways,
[NOISE] um, that is computed based on the current context.
Okay. So here's, um,
here's the- the equations for the LSTM which might make it clearer.
So, uh, suppose we have some sequence of i- inputs x_t and we
want to compute a sequence of hidden state h_t and cell states c_t.
So, this is what happens on timestep t. Uh,
this process equation shows you the three gates that I talked about before.
So, the first one is called the Forget Gates.
And the idea is that this one is controlling what is kept versus what is forgotten,
um, from the previous cell state, the previous memory.
And you can see that this forget gate is computed based on, uh,
the previous hidden state h_t minus one and the current input x_t.
Um, so that's what I meant when I said that it's
dynamic and it's computed based on the- the current context.
[NOISE] Um, you can also see that it's computed using,
uh, the sigmoid function,
which means that it is somewhere between zero and one.
Okay. The next gate is called the input gate,
and this one controls what parts of the new cell contents are written to the cell.
So, the idea there is that you have this- this memory cell and this is kind of, um,
controlling like ho- how and what you get to write to the memory cell.
Okay. And the last one is called the upper gate.
So, this one is controlling, uh,
what parts of the cell are outputs to the hidden state,
[NOISE] so you could view this as kind of like the read function, right?
We're going to read some information from
our memory cell and that's gonna get put into our hidden states,
and this gate is gonna control that.
[NOISE] Okay.
[NOISE] Uh, yeah, that's just the sigmoid function as we noted before.
All right. So, the next set of equation shows how we use these gates.
[NOISE] So, the first line, uh,
you could regard this, uh,
c_tilde as the new [NOISE] cell content.
So, uh, this is the new content that you want to write to the cell,
[NOISE] and this is also computed based on, uh,
your previous hidden state and your current inputs,
and this goes through your tan h non-linearity.
So, uh, this is kind of the- the main contents that
you are computing based on the context and you want to write this into memory.
So, on the next line what's happening is that we're going to use
the forget gate to selectively forget some of the information from the previous,
[NOISE] uh, memory cell.
And you can see that we're doing these element-wise products,
that's what the little circle is.
So, the idea is that if you remember that f_t is
a vector full of values between zero and one,
when you do an element-wise product between f_t and
the previous cell state c_t minus one,
then what you're essentially doing is you're kind of masking
out some of the information from the previous hidden state.
Sorry, no. Previous cell state.
So, when f is one,
then you're copying over the information,
but when f is zero, then you're getting rid of that information,
you are erasing it or forgetting it.
Okay. And then the other half of this equation,
um, i_t times c tilde t, uh,
that's the input gate controlling
which parts of the new cell contents are gonna get written,
written to the, to the cell.
Okay. And then the last thing we do is we, uh,
pass the cell through a tan h,
that's just adding another non-linearity,
and then you pass that through
the output gates and that gives you [NOISE] the hidden state.
So, in LSTMs, we often think of the hidden states as being,
uh, like the outputs of the RNN.
And the reason for this is that you kind of view
the cell states as being this kind of
internal memory that's not generally accessible to the outside,
but the hidden states are the parts that you're
gonna pa- pass on to the next part of the model.
So, that's why we view it as kind of like the output of the model.
[NOISE] Uh, and this is, yeah,
x just to remind the- there is- circles are
element-wise products and that's how we apply the gates.
Uh, did anyone have any questions about this?
[NOISE].
Okay. [NOISE] Um, so as a reminder,
all of these are vectors of some same length n.
[NOISE] Okay.
So, some people learn better from diagrams than equations,
and here's a diagram presentation of the same idea.
So, this is a really nice diagram from a blog post,
uh, by Chris Olah about LSTMs,
and that was a good place to start if you want to
get an intuitive understanding of what LSTMs are.
So, in this diagram, uh,
the green boxes represent timesteps,
um, and let's zoom in on the middle one and see what's happening here.
So, within one timestep,
you can see that this diagram is showing exactly the same thing as
those six equations showed on the previous slide.
So, uh, the first thing we do is we use the, uh, the current input x_t,
which is at the bottom and the previous hidden state h_t minus the one on the left,
and we can use that to compute the forget gate.
[NOISE] And you can see f_t is on that arrow there.
And then you apply the forget gate to the previous, uh, cell,
and that's the same thing as forgetting some of the- the cell content from last time.
[NOISE] Okay.
And then after that, you can compute the input gate, uh,
and that's computed in much the same way as the forget gate.
And then you use the input gate to decide which parts of this,
uh, new cell content get written to the cell,
and that gives you the cell c_t.
So, here you can see that you computed the impu ga- input gates and
the new content and then you use that to gate that and write it to the cell.
So, now we've got our new cell c_t,
and then the last things we need to do is to compute our new output gate, that's o_t.
And then lastly, use the output gate to select which parts of
the cell contents you're gonna read and put in the new hidden state h_t.
So, that's, that's, uh, that's
the same thing as the equations we saw on the previous slide.
Okay. So, that's LSTMs.
Um, is there a question?
What's the importance [NOISE]  [inaudible]
The question is, why are we applying a tan h
on the very last equation on this, on this slide?
Why we're planning a tan h to the cell before applying the output gate?
Let's see. Um.
Yeah. So, your question is, the- the cell,
the new cell content already went through a tan h. Um, I'm not sure.
So, I suppose a- a- a general answer is that it must
be giving some kind of more expressivity in some way,
and that it's not just applying
tan h's sequentially because you do have the gates in between.
Um, so I suppose there must be a reason,
kind of similarly to when you apply- apply
a linear layer you won't have a non-linearity before the next linear layer.
I suppose maybe we're viewing these cases as a kind of linear layer?
I'm not sure. I'll look it up.
[NOISE] Okay.
So, uh, that's LSTMs.
And, um, re- if you recall,
we were- oh, question?
Yeah. Why is it that in the forget gate,
you don't look at the previous cell state but you just look at the new hidden state?
Like it seems like if you're this- instead of
deciding what to forget from the cell state, you should look at it.
So the question is, why is the forget gate
computed only for the previous hidden state and the current input,
why is it not computed based on ct minus one itself, right?
Because surely you want to look at the thing to figure
out whether you want to forget it or not?
Um, that's a pretty good question.
Uh, so, I suppose one reason why you might think that this- this works fine is that
the LSTM might be learning a general algorithm
for where it stores different types of information in the cell, right?
So, maybe it's learning that in this particular position in the cell,
I learn information about this particular semantic thing and then in this situation,
I want to use that or not use that, forget it or keep it.
But, yeah, I haven't entirely convinced myself why you don't want to
look at the contents of the cell itself in order to decide.
I suppose another thing to notice is that ht minus one was read from ct minus one.
So, I suppose there is some information there but not necessarily all of the information.
Ah, yeah.
I'm not sure, that's another thing I need to look up I guess.
[NOISE] Any other questions?
Okay. Ah, so, that's LSTMs and,
um, LSTMs were introduced to try to solve the vanishing gradient problem.
So, the question is, ah,
how exactly is this architecture making the vanishing gradient problem any better?
So, you could, ah, see that the LSTM architecture
actually makes it easier for RNNs to preserve information over many time steps.
So, while it w as kind of difficult for
the vanilla RNN to preserve the information over all of the hidden states,
there's actually a fairly easy strategy that makes
it simple for the LSTM to preserve the information.
So, namely, if the forget gate is set to remember everything on every step, um,
that's a fairly simple strategy that will ensure that
the information in the cell is going to be preserved indefinitely over many time steps.
So, I don't know if that's actually a good strategy for whatever task you're trying to do,
but my point is that there is at least, um,
a fairly straightforward way for the LSTM to keep the information over many steps.
And as we noted that's relatively harder for the vanilla RNN to do.
So, you can think of this as the key reason why LSTMs are more able,
ah, to preserve the information
and thus are more robust to the vanishing gradient problem.
Ah, however, I think you should still know that LSTMs don't
necessarily guarantee that we don't have a vanishing or exploding gradient problem.
You could still have that problem,
but the thing to remember is that it's easier to avoid it anyway.
Okay. So, um, LSTMs, ah,
have been shown to be more robust to the vanishing gradient problem,
ah but I'm going to tell you a little about how they've
actually been more successful in real life. You have a question?
Yeah,  [inaudible]
Okay. So it's a great question.
The question is, why is it that just because you
have these LSTM defined forward equations,
why do you not have the vanishing gradient problem?
Why does the- the logic about, ah,
the chain rule kind of getting smaller and smaller or bigger and bigger not apply?
So, I think the key here is that, um,
in the vanilla RNN,
the hidden states are kind of like a bottleneck, right?
Like all gradients must pass through them.
So, if that gradient is small then,
all downstream gradients will be small,
whereas here you could regard the cell as being kind of like
a shortcut connection at least in
the case where the forget gate is set to remember things,
um, then that's kind of like a shortcut connection where
the cell will stay the same if you have the forget gate set to remember things.
So, if the cell is staying mostly the same,
then you are not going to be,
ah, having the vanishing gradient via the cell.
So, that means that to get a connection from
the gradient of something in the future with respect to something in the past,
there is a potential route for the gradient to
go via the cell that doesn't necessarily vanish.
So in that, I have one more question.
Um-uh.
Since we have a shortcut [inaudible]
So I think the question was how do you check that your gradients are correct given that
there are now multiple routes for information to travel?
Right.
So, I suppose this somewhat relates to what we talked about last time with
the multivariable chain rule about what is
the derivative of the loss with respect to a repeated weight matrix and we saw that,
if there are multiple routes then
the multivariable chain rule says that you add up the gradients.
So, if your question is how do you do the calculus correctly and make sure it's correct,
I guess you just kind of apply
the multi-variable chain rule and it's more
complicated than assessing with the LSTMs.
Ah if you're using PyTorch 14 you do not have to do that yourself,
if you're going to implement it yourself then,
you might have a more difficult time.
Um, yeah. So, I guess, yeah.
Okay. All right, so, what do we get to. All right.
So, let's talk about LSTMs and how they work in the- in the real world.
So, in the pretty recent past,
2013-2015 um LSTM started achieving a lot of state of
the art results on a variety of different tasks including for example,
handwriting recognition, speech recognition,
machine translation, parsing, image captioning.
So, over this period,
LSTMs became the dominant approach in a lot of
these application areas because they worked convincingly a lot better than vanilla RNNs.
However, today in 2019,
things changed pretty fast in deep learning.
So, other approaches for example,
transformers which you're going to learn about later in the class.
Ah, in some of these application areas,
they seem to have become,
ah, the dominant approach.
So, to look into this,
I had a look at WMT which is a machine translation conference and
also competition where people submit their MT systems to be evaluated.
And I looked at the report,
the summary report for WMT 2016 and in this report,
I did a quick Ctrl+F,
and I found the word RNN appeared 44 times.
So, it seems that most people entering this competition were building
their MT systems based on RNNs and in particular LSTMs.
And then I looked at the report from 2018,
just two years later and I found that the RNN,
the word RNN only appeared nine times and the word transformer appeared 63 times,
and in fact the organizers noted that everyone,
well, most people seem to using transformers now.
So um, this shows that things change pretty fast in deep learning.
The thing that was hot and new just a few years ago um,
is- is now being passed by perhaps by other kinds of approaches.
So, you're going to learn more about transformers
later but I guess that gives you a kind of
idea of where LSTMs are currently in applications.
Okay. So, the second kind of RNN we're going to learn about is gated recurrent units.
So, these fortunately are simpler than LSTMs,
in fact that was the motivation for them being proposed.
They were proposed in 2014 as a way to try to retain
the strengths of LSTMs by getting rid of any unnecessary complexities.
So, in a GRU,
we don't have a cell state.
We again just have a hidden state.
But the thing it has in ah in common with LSTMs is that we're going to be
using gates to control the flow of information.
So, here are the equations for GRU.
We start off with two gates.
So the first gate is called the update gate and this
controls what parts of the hidden states are going to be updated versus preserved.
So, you can kind of view this as playing
the role of both the forget gate and the input gate in
the LSTM and it's computed in much the same way as the gates in the LSTM were.
The second gate is called the reset gate rt,
and this gate is controlling which parts of
the previous hidden state are going to be used to compute new contents.
So, you can think of the- the reset gate as kind of selecting
which parts of the previous hidden states are useful versus not useful.
So, it's going to discard some things and select some other things.
Okay. So, here's how those gates get used.
Um, h tilde here.
This is you can think of it as the new hidden state contents and what's
going on in that equation is that we are applying
the reset gate to the previous hidden state ht minus
one um and then putting all of that through some linear transformations and
a tan H and then this gives us the new content
which we want to write to the hidden cell.
And then lastly our new hidden cell is going to be a combination
of ah this new content and the previous hidden state.
So, the important thing to notice here is that we have this one minus u and u term.
So um, it's kind of like a balance right?
U is ah is setting the balance between
preserving things from the previous hidden state versus writing new stuff.
So, whereas in the LSTM,
those were two completely separate gates that could be whatever value.
Here we have this constraint that U is being uh, balanced.
So, if you have more of one, you have to have less of the other.
So, this is one way in which the creators of the GRU sought to make LSTMs more simple.
Was by having a single gate play both of these roles.
Okay. So, that's GRUs and I think it's a little less obvious just looking at it.
Why GRUs help the vanishing gradients problem because there is no explicit ah memory
cell, like there is in LSTMs.
So, I think the way to look at this here is um GRUs,
you can view this as also being a solution to
the vanishing gradient problem because like LSTMs,
GRUs make it easier to retain information ah long-term.
So, for example here,
if the update gate ut is set to zero,
then we're going to be ah keeping the hidden state the same on every step.
And again that's maybe not a good idea but at least that is a strategy you can easily
do in order to retain information over long distances.
So that's kind of like- like the same explanation of how GRUs make it
potentially easier for RNNs to retain information long-term.
Okay. So, we've learned about these two different types of RNNs. Yes.
[inaudible]
I think the question was,
if we view the two gates in the GRU, as being, uh,
a precise, um, analogy to the gates in the LSTM or are they more of a fuzzy analogy.
I'd say probably more of a fuzzy analogy
because there are other changes going on in here, like,
for example, the fact that there's no separate, um,
memory cell, it means they're not performing exactly the same functions.
Yeah. Okay. So, we've learned about LSTMs and GRUs which are both,
um, more complicated forms of RNNs,
more complicated than Vanilla RNNs.
And they are both,
uh, more robust to the vanishing gradient problem.
So, um, it would be useful to know which of these should we be using in practice?
Which one is more successful,
the LSTM or GRU?
Uh, so, I- I did a little reading and it looks like researchers have
proposed a lot of different types of gated RNNs.
So, it's not just GRUs and LSTMs,
there's many other papers with lots of other different variants.
Uh, but these are definitely the two that are most widely used.
And, ah, you can probably say that the biggest difference between the two, um,
for sure is the fact that GRUs are simpler
and quicker to compute and they have fewer parameters.
So, this makes an actual practical difference to you as, uh,
a deep learning practitioner because if you build your net based on GRUs,
then it's gonna be faster to run forwards and,
you know, faster to train and so on.
So, other than that, there appears to be
no very conclusive evidence that one of these LSTM or GRUs,
uh, is consistently outperforming the other on lots of different tasks.
Uh, it seems that often, uh,
sometimes GRUs do perform as well as LSTMs,
but there are cases where one of them performs better than the other.
So, as a rule of thumb,
it seems like LSTM is often a good default choice to start with, uh,
especially if your data has
particularly long dependencies because there's evidence to think
that LSTMs might be slightly better at keeping information over very long distances.
And also, if you have a lot of training data,
you might think that LSTMs are a better choice because they
have more parameters which means that,
um, maybe you need more train data to learn them.
So, a rule of thumb is that maybe you want to start with LSTMs
and if you're happy with their performance and you're
happy with how long it takes to train, then you stick with that.
But if you feel like you need it to be more efficient,
then maybe you should switch to GRUs and see how that goes with the performance
and if it's faster. All right.
So, um, we've talked so far about how
the vanishing/exploding gradients are a problem that occur a lot in RNNs.
But, um, the question is,
is it only an RNN problem?
Does this occur in other kinds of neural networks as well?
And the answer is,
uh, no, it's not just an RNN problem.
In fact, vanishing and exploding gradients are a
pretty significant problem for
most neural architecture such as feed-forward and convolutional,
especially when they're deep.
And this is a really serious problem because there's no point having
a really cool neural architecture if you can't learn it efficiently because of the,
uh, vanishing gradient problem.
So, in particular, uh, in these feed-forward and convolutional networks, uh,
you often have a gradient becoming vanishingly
small over back-propagation, uh, because of the Chain Rule,
because of this multiplying by
all these different intermediate gradients or
sometimes due to your choice of non-linearity function.
So, if this happens, this means that your- the lower layers of your, let's say,
convolutional or feed-forward network,
they have a much smaller,
uh, gradient than the high levels.
And this means that they get changed very slowly during SGD.
So, this means that, overall,
your network is very slow to train because when you take updates,
then your lower layers are changing very slowly.
So, one solution, uh,
the kind of like a family of solutions that we've seen in
recent years is that there's been lots of
proposals for new types of deep feed-forward or convolutional architectures.
And what they do is, they add more direct connections in the network.
And the- the idea,
kind of as we talked about before,
is that if you add all of these direct connections between layers,
like maybe not just adjacent layers but further apart layers,
then it makes it much easier for the gradients to flow,
and you're going to find it easier to train your network overall.
So, I'm going to show you some examples of these in
particular because it's fairly likely you're going to
run into these kinds of architectures when you're doing your projects and reading papers.
So, one example is something called residual connections or,
uh, the network itself is sometimes referred to as ResNet.
And here we've got a figure from the related paper.
So, what's going on in this diagram is that you have, uh,
the usual kind of you've got weight layer and
a non-linearity which is ReLU, and another weight layer.
So, if you regard that function as being f of x, ah,
what they're doing is instead of just, ah,
transforming x to f of x,
the- they're taking f of x plus x.
So they're adding this identity skip connection where
the input x is skipped over those two layers and then,
um, added to the output of the two layers.
So, the reason why this is a good idea,
uh, also known as skip connections,
is that the identity connection is going to preserve information by default, right?
So, if you imagine perhaps if you, um,
initialize your network and you
initialize your weight layers to have small random values,
then if they're small and kind of close to zero,
then you're going to have something like a noisy identity function, right?
So you're going to be preserving information by default through all of your layers.
And if you have a very deep network,
that means that even often many,
um, many layers, you're still gonna have something like your original input.
So, uh, the- the people who wrote this paper, they show that, uh,
if you don't have something like skip connections then
actually you can find that deep layers- uh,
deep networks perform worse on some tasks than shallow networks.
Not because they're not expressive enough,
but because they're too difficult to learn.
So, when you attempt to learn deep networks,
it just doesn't learn effectively and you end up
getting worse performance in the shallow network.
So, the people who wrote this paper,
they show that when they add these skip connections,
then they made the deep networks, uh,
much more effective and they managed to get good performance.
Uh, so another example which kinda take this- this idea
further is something called dense connections or DenseNet.
And again, this was, uh,
something proposed I think in a feed-forward or or convolutional setting.
And, ah, it's just kind of the same as skip connections but except ,
um, connects everything to everything.
So, add more of these skip connections kind of
from all layers to all layers and they showed that this,
uh, performs even better.
And, uh, the last one I want to talk about which I don't have a picture
for is something called highway connections.
So, this is similar to the residual or skip connections.
Ah, but the idea is that instead of just adding your x,
adding your identity, uh, connection,
the idea is that you're gonna have a gate that controls the balance between, um,
adding the identity and computing, ah, the transformation.
So, instead of f of x plus x, you're gonna have, you know,
gate times f of x plus, you know,
one minus gate times x, something like that.
Um, so, this work was actually inspired by LSTMs,
but instead of applying it to a recurrent setting,
they were seeking to apply it to a feed-forward setting.
Okay. I'm gonna keep going for now.
Um. So, overall the question was,
you know, how much uh,
vanishing and exploding gradients a problem outside of the setting of RNNs?
And I think uh, the important takeaway is that it is a big problem
but you should notice that it is particularly a problem for RNNs.
So, um, RNNs are particularly unstable and
this is essentially due to the repeated multiplication by the same weight matrix.
If you remember from last time, um,
the characteristic thing about RNNs that makes them recurrent is
the fact that you are applying the same weight matrix over and over again.
So, this is actually the core reason
why they are so prone to the vanishing and exploding gradients,
and ah, you can see some more information about that in the paper.
Okay. So, I know there's been a lot of dense information today,
a lot of um, lot of notation.
So, here's a recap, if I've lost you at any point.
Now's a good time to jump back in because it's gonna
get a little easier to understand perhaps.
So, okay, recap. What have we learned about today?
Um, the first thing we learned about was the vanishing gradient problem.
We learned uh, what it is.
We learned why it happens and we saw why it's bad for RNNs,
for example, RNN language models.
Ah, and we also learned about LSTMs and GRUs which are
more complicated RNNs and they use gates to control the flow of information.
And by doing that, they are more resilient to the vanishing gradient problem.
Okay. So, if the remainder of this lecture,
I think we've got about 20 minutes left,
ah, we're going to be learning about two more advanced type of RNNs.
So, the first one is bidirectional RNNs and that's all
about information flowing left to right and right to left.
And then we're also going to learn about
multi-layer RNNs which is when you apply multiple RNNs on top of each other.
So, I'd say that both of these are pretty simple conceptually.
Um, so it shouldn't be too hard to understand.
All right, so let's start with bidirectional RNNs.
Um, this is a picture which you saw at the end of last lecture.
So, if you remember,
sentiment classification is the task when you have
some kind of input sentence such as the movie was
terribly exciting and you want to classify this as a positive or negative sentiment.
So, in this example, it should be seen as positive sentiment.
So, um, this is an example of how you might try to
solve sentiment classification using a fairly simple RNN model.
Ah, here we're using the RNN as a kind of encoder of
the sentence and the hidden states represent the sentence.
And we'll do some kind of combination of the hidden states to compute uh,
what we think the sentiment is.
So, my question is, if we look at let's say,
the hidden state that corresponds to the word terribly and we're regarding
this hidden state as a representation of the word
terribly in the context of the sentence.
So, for this reason we- we sometimes call hidden states in this kind of situation
a contextual representation because the idea is that it's
a representation of the word terribly in the context of the sentence.
So, thing to think about here is that this contextual representation,
it only contains information about the left context.
So, for terribly, the left context is the words um,
the movie was and this hidden state the one that's got
a blue box around it has only seen information to the left.
It hasn't seen the information of the words exciting or exclamation mark.
So, what we're asking is what about the right context?
The right context of terribly is- is what exciting and the exclamation mark.
And do we think that the right context is useful here?
Do we think that this is something we want to know about?
And I would argue that in this example,
it is actually kind of important because we've got the phrase terribly exciting.
And if you look at the word terribly in isolation,
terrible or terribly usually means something bad, right?
But terribly exciting, you can mean something good because it just means very exciting.
So, if you know about the right context,
the word exciting then this might quite significantly
modify your perception of the meaning of the word
terribly in the context of the sentence.
And especially given that we're trying to do sentiment classification,
this is- this is kind of important.
So this motivates why you might want to have information
from both the left and the right when you're making your representations.
Ah, if when you were a kid,
your parents told you to look both ways before you cross the street.
You might regard it as the same kind of idea that there's
useful information to the left and the right that
you'd like to know about ah, before you do anything.
Okay. So that's the motivation and um,
here is how a bidirectional RNN might work in practice.
I have a kind of accidentally festive color scheme here.
And so the idea is that you have two RNNs going on.
You have the forward RNN as before that encodes the sentence left to right.
And then separately, you also have a backwards RNN.
And this has completely separate weights to the forward RNN.
So, the backward RNN is just doing the same thing
except that it's encoding the sequence from right to left.
So, each of the hidden states is computed based on the one to the right.
And then finally, you just take the hidden states from
the two RNNs and then you concatenate them together and you've got your uh,
your final kind of representations.
So, in particular, if we now think about
this contextual representation of the word terribly in the context,
um, this- this vector has information from both the left and the right, right?
Because you had the forwards and backwards RNNs that
respectively had information from both left and right.
So the idea is that these concatenated hidden states,
those can be regarded as kind of like the outputs of the bidirectional RNN.
Like if you're going to use these hidden states for
any kind of further computation, then ah,
it's these concatenated hidden states that you are going to be
passing on to the next part of the network.
Um, here- here are the equations that just say the same thing.
So, you have your forward RNN and here we've got ah,
a notation that you might not have seen before
this kind of notation where it says RNN and then in brackets,
the previous hidden state and the input that's simply saying that you know,
HT is computed from the previous hidden state and the input.
And RNN forward could be a vanilla or a GRU or an LSTM.
It doesn't really matter, we're looking at it abstractly.
So, you have these two separate RNNs,
RNN forwards and RNN backwards and generally, these have separate weights.
Although I have seen some papers where they have shared weights.
So, it seems that sometimes that does work better,
perhaps maybe when you have enough training data.
And then finally, we regard these concatenated hidden states which you might just
notice ht as being like the hidden state of the bidirectional RNN.
So, um, the previous diagram is pretty unwieldy.
So here's a simplified diagram.
And this is probably the only kind of diagram you're going to
see from now on to denote bidirectional RNNs.
Um, so, what we've done here is you've just
made all of the horizontal arrows go left and right ah,
to represent that this is a bidirectional RNN.
So, the other thing you should assume is that the hidden states depicted here, you know,
these red- red trying- red rectangles with the dots.
You can assume that those are the concatenated forwards,
backwards hidden states from the bidirectional RNN.
[inaudible]
Okay. So the question is, um,
would you train your forwards and backwards RNNs kind of separately,
um, on some kind of task and then
maybe concatenate them together once they're separately trained networks,
or would you train them all together?
Um, it seems to me that it's much more common to train them together,
but I don- I don't think I've heard of anyone training them separately.
Uh, so yeah, it seems like the standard practice is usually
to train them together. Does that make sense?
[inaudible].
So, let's suppose that we were trying to build
a sentiment classification system using the bidirectional RNN.
Then what you do, which maybe I should have pictured but I didn't have space, is uh,
you would do the same thing that you were doing with the unidirectional RNN, uh,
which was, let's say an element y is min or max,
um, to get your sentence encoding.
Maybe you just do that but over the concatenated, um, n states.
Okay. So, an important thing to note is that, uh,
when talking about applying bidirectional RNNs,
we've assumed that we actually have access to the entire input sequence.
So, we assume that we have the full sentence,
uh, the movie was very exciting, and,
uh, that, that was a necessary assumption in order to
be able to run the forwards and the backwards RNN, right?
Um, so there are some situations where you can't assume this.
Like, for example, in Language Modeling,
you only have access to the left context kind of by definition of the task.
You only know the words that have come so far.
You don't know what's coming next.
So, you can't use a bidirectional RNN, uh,
to do Language Modeling, uh,
in the way that we've depicted here because uh,
you don't have the full sequence.
However, if you do have access to the entire sequence.
Uh, so, for example, if you're doing any kind of encoding
similar to the sentiment example,
uh, then bidirectionally- bidirectionality is pretty powerful.
And you should probably regard it as a good thing to do by default uh,
because it turns out that getting this information from
both the left and the right, uh,
makes it a lot easier to learn these more useful contextual representations.
So, in particular, as a preview of
something you're going to learn about later in the class, uh,
there's a model called BERT, B-E-R-T,
and that stands for Bidirectional Encoder Representations from Transformers.
And this is a pretty recently.
Like, a few months ago, uh, proposed system,
and it's this pre-trained contextual representation system.
Um, and it's heavily reliant on the idea of bidirectionality.
It turns out that the bidirectional, uh,
nature of BERT is pretty important to its success.
So, you're gonna learn more about that later,
but that's just an example of how bidirectionality can give you much
more uh, powerful contextual representations.
Okay. So the last thing we're going to talk about today is multi-layer RNNs.
Uh, so you could regard RNNs as already being deep
in some sense because you've already unrolled them over potentially very many timesteps,
and you could regard that as a kind of depth, right?
But there's another way that RNNs could be deep.
So, for example, if you applied multiple RNNs kind of one after another,
then this would be a different way to make your RNN deep,
and this is the idea between, uh,
behind a multi-layer RNN.
So, the reason why you would want to do this is because uh,
this might allow the network to compute more complex representations.
So, this is the logic betwe- behind deep networks in general.
So, if you're familiar with the idea of why
deeper is better for let's say convolutional networks,
then this is kind of the same logic.
It's saying that, uh, your lower RNNs might be computing lower-level features like,
let's suppose maybe it's keeping track of syntax,
and your higher  level RNN's gonna compute higher-level features like maybe semantics.
And a note on terminology, these are sometimes called stacked RNNs.
So, this works much as you'd imagine.
So here's an example of how a multi-layer RNN might work.
Uh, if it's three layers.
So this is a unidirectional RNN,
but it could be bidirectional,
um, If you have access to the entire input sequence.
So, I guess the, the main thing is that the hidden states from one RNN layer are going to
be used as the inputs to the RNN layer that's coming next.
Um, any questions on this?
Yeah.
[inaudible].
That's a great question. So the question I think it's about the order of computation.
What order will you compute all of these hidden states in?
I suppose there's some flexibility, right?
But you could compute all of the step one ones,
like all of the V ones and then all of the movie ones,
or you could do all of RNN layer one and then all of RNN layer two.
So, it's- I think that, um, when you- you know,
call the PyTorch function to do a multi-layer RNN,
it will do all of RNN layer one, then two, then three.
That's what I think happens.
But it seems like logically,
there's no reason why you couldn't do it the other way.
Yep?  [inaudible].
Yes, yes. That's a great point as well.
Um, so uh, someone pointed out that if they were bidirectional,
then you no longer have that flexibility.
You would have to do all of layer one before layer two.
Yeah, good point. Anyone else?
Okay. Uh, so mostly RNNs in practice,
um, this tends to perform pretty well,
uh, in that when I look at, um,
RNN-based systems that are doing very well on some kind of task,
they usually are some kind of multi-layer RNN, um,
but they certainly aren't as deep as
the deep convolutional or feed-forward networks you might have seen in,
for example, image tasks.
So whereas, you know, very deep convolutional networks,
I think hundreds of layers now, um,
you certainly aren't getting RNNs that are that deep.
So, for example, um,
in this paper from, uh, Google, uh,
they're doing this kind of large hyperparameter search for
neural machine translation to find which kinds of hyperparameters work well for NMT.
And in this paper, they found that um,
two to four layers was best for the encoder RNN,
and four layers was best for the decoder RNN.
Uh, you'll find out more about what encoder and decoder mean next time.
Um, but those are fairly small numbers.
Although they did find that if you add these skip
connections or these dense connections, um,
then it makes it much easier to learn some even deeper RNNs more effectively,
like, maybe up to eight layers,
but these certainly aren'tx  hundreds of layers deep.
And one of the reasons why, uh,
RNNs don't tend to be nearly as deep as these other kinds of networks,
is that because as we commented before,
RNNs have to be computed, uh,
sequentially; they can't be computed in parallel.
This means that they're pretty expensive to compute.
If you have this depth in like, two-dimensions,
you have the depth over the timesteps and then the depth over the RNN layer is two,
then it beco- it becomes very,
very expensive to compute these, these RNNs.
So, that's another reason why they don't get very deep.
Uh, so again, we just mentioned transformers.
Uh, you gonna learn about transformers later.
But these, it seems, um,
can be deeper fro- from what I can tell of,
of what people are using these days.
Transformer-based networks can be pretty deep.
So, uh, but for example,
there's a 24-layer version and a 12-layer version, um,
and admittedly, that was trained by Google,
and they have a lot of computational power.
Um, but I think part of the reason why
these transformer-based networks can be quite deep,
is that they have a lot of these skipping like connections.
In fact, the whole um,
innovation of transformers is that they're built on a lot of, kind of,
skip connections. Okay, any questions?
We're almost done. Okay. All right.
So, uh, here's a summary of what we've learned today.
I know it's been a lot of information.
Um, but I think here are four practical takeaways from today that, uh,
are probably useful to you in your projects,
even if you, um,
uh, even if you
didn't find them very interesting in themselves they're probably pretty useful.
So, the first one is that LSTMs are very powerful.
They're certainly a lot powerful than,
uh, more powerful than Vanila RNNs.
Um, GRUs are also more powerful than, uh, Vanila RNNs.
Uh, and the only difference that is consistently the
same is that GRUs are faster than LSTMs.
The next one is that you should probably clip your gradients,
because if you don't clip your gradients,
you're in danger of walking off cliffs and then ending up with NaNs in your model.
Uh, the next tip is that bidirectionality is useful if you can apply it.
And, basically, anytime when you have access to the entire input sequence,
you can apply bidirectionality,
so you should probably do that by default.
And then the last tip is that multi-layer RNNs are pretty powerful.
And again, you should probably do that if you,
uh, have enough computational power to do so.
But if you're going to make your multi-layer RNN pretty deep,
then you might need skip connections.
All right. Thanks [NOISE].
