[MUSIC]
Stanford University.
>> Okay hi everyone.
Let's get started again.
We're back with we're into
week six now and Lecture 11.
This is basically the third
now last of our lectures.
It's sort of essentially concentrating on
what we can do with recurrent models and
sequence to sequence architectures.
I thought what I'd do in the first
part of the lecture is have one
more attempt at explaining some
of the ideas about GRUs and
LSTMs and where do they come from and
how do they work?
I'd sort of decide to do
that anyway on the weekend,
just because I know that when I first
started seeing some of these gated models,
that it took a long time for
them to make much sense to me, and
not just seem like a complete surprise and
mystery.
That's the way they work so
I hope I can do a bit of good at
explaining that one more time.
That feeling was reconfirmed when we
started seeing some of the people
who've filled in the midterm survey so
thanks to all the people who filled it in.
For people who haven't,
I'm still happy to have you fill it
in over the last couple of days.
While there were a couple of people
who put LSTMs in the list of
concepts they felt that they
understood really well.
Dozens of people put LSTMs and
GRUs into the list of concepts
they felt kind of unsure about.
This first part is for you and if you're
one of the ones that already understand
it really well, I guess you'll just
have to skip ahead to the second part.
Then we'll have the research
highlight which should be fun today.
And then, so
moving on from that it's then completing,
saying a bit more about
machine translation.
It's a bit that we sort of had skipped and
probably should have explained earlier
which is how do people evaluate
machine translation systems?
Because we've been showing you numbers and
graphs and so on and never discussed that.
And then I wanna sort of say a bit
more about a couple of things that
come up when trying to build new
machines translation systems.
And in some sense, this is sort of
done on the weed stuff it's not
that this is sort of one central concept
that you can possibly finish your
neural networks class
without having learned.
But on the other hand, I think that all of
these sort of kind of things that come up
if you are actually trying to build
something where you've actually got a deep
learning system that you can use to
do useful stuff in the world and
that they're useful, good,
new concepts to know.
Okay.
Lastly just the reminders and
various things.
The midterm, we have got it all graded.
And our plan is that we are going to
return it to the people
who are here after class.
Where in particular, there's another
event that's on here after class,
so where we're going to return it
after class is outside the door.
That you should be able to find
TAs with boxes of midterms and
be able to return them.
Assignment three, yeah so this has
been a little bit of a stretch for
everybody on assignment three I realized,
because sort of the midterm got
in the way and people got behind.
And we've also actually we're hoping
to be sort of right ready to go with
giving people GPU resources on Azure and
that's kinda've gone behind,
they're trying to work on that right
now so with any luck maybe by the end
of today we might have the GPU
resources part in place.
I mean, at any rate, you should absolutely
be getting start on the assignment and
writing the code.
But we also do really hope that
before you finish this assignment,
you take a chance to try out Azure,
Docker and
getting stuff working on GPUs because
that's really good experience to have.
Then final projects,
the thing that we all noticed about our
office hours last week after the midterm
is that barely anybody came to them.
We'd really like to urge for this week,
please come along to office hours again.
And especially if you're doing
a final project, we'd really,
really like you to turn up and
talk to us about your final projects and
in particular tonight after class and
a bit of dinner which is again,
we're going be doing
unlimited office hours.
Feel free to come and see him, and
possibly even depending on how you feel
about it, you might even go off and
have dinner first and then come back and
see him to spread things out a little bit.
Are there any questions
people are dying to know,
or do I head straight into
content at that point?
I'll head straight into content.
Basically I wanted to sort of spend
a bit of time going through, again,
the sort of ideas of where did these
kinds of fancy recurrent units come from?
What are they going to try and achieve and
how do they go about doing it?
Our starting point is, what we have
with a recurrent neural network is that
we've got something that's
evolving through time.
And at the end of that we're at some
point in that here where time t plus n.
And then what we want to do
is have some sense of well,
this stuff that we saw at time t, is that
affecting what happens at time t plus n?
That's the kind of thing of is it
the fact that we saw at time t
this verb squash that is having
some effect on the n words later,
that this is being someone saying
the word window because this is some
kind of association between squashing and
windows or is that completely irrelevant?
We wanna sort of measure
how what you're doing here
affects what's happening maybe six,
eight, ten words later.
And so the question is how can we
achieve that and how can we achieve it?
And what Richard discussed and
there was some sort of complex math here
which I'm not going to explain,
again, in great detail.
But what we found is if we had a basic
recurrent neural network what we're
doing at each time step in the basic
recurrent neural network is
we've got some hidden state and
we're multiplying it by matrix and
then we're adding some stuff to do with
the input and then we go onto next
time stamp where we're multiplying that
hidden state by the same matrix again and
adding some input stuff and then we
go onto the time step and we model.
Multiplying that,
hidden stuff by the same matrix again.
It keeping on doing these matrix
multiplies and when you keep on doing
these matrix multiplies you can
potentially get into trouble.
And the trouble you get into is
if your gradient is going to zero
you kind of can't tell whether that
means that actually what happened
in words ago is having no effect
on what you're seeing now.
Or whether it is you hadn't set all
of the things in your matrixes norm
exactly right and so that the gradient
is going to zero because it's vanishing.
This is where the stuff about eigenvalues
and stuff like that comes in.
But kind of the problem is with.
Basic RNA, sort of a bit too much
like having to land your aircraft
on the aircraft carrier or
something like that.
That if you can get things
just the right size,
things you can land on
the aircraft carrier but
if somehow your eigenvalues are a bit too
small then you have vanishing gradients.
And if they're a bit too large
you have exploding gradients and
you sort of,
it's very hard to get it right and so
this this naive transition function seems
to be the cause of a lot of the problems.
With the naive transition
function in particular,
what it means is that sorta we're doing
this sequence of matrix multipliers.
So we're keeping on multiplying
by matrix at each time step.
And so, that means that when
we're then trying to learn.
How much effect things have
on our decisions up here.
We're doing that by backpropagating
through this whole sequence of
intermediate nodes.
And so, the whole idea of all of these
gated recurrent models is to say,
well, somehow, we'd like to be
able to get more direct evidence
of the effect of early time
steps on much later time steps,
without having to do this long sequence
matrix multiplies, which almost certainly.
Give us the danger of
killing off the evidence.
So essentially what we wanna have is,
we want to kinda consider the time
sequence that's our straight line.
We also want to allow these shortcut
connections so ht can directly
affect ht +2 because if we could do
that we then when we're backpropagating
we'll then be able to measure in the
backward phase the effect of ht on ht + 2.
And therefore,
we would be much more likely to
learn these long term dependencies.
So that seems a good idea.
So I'm gonna do the kinda gated
recurrent units first, and
then kinda build onto LSTMs,
which are even more complex.
So essentially that's what we're
doing in the gated recurrent unit.
And we're only making it a little
bit more complex by saying, well,
rather than just uniformly
putting in stuff from time -1 and
time -2, maybe we can have adaptive
shortcut connections where we're
deciding how much attention to pay to
the past, as well as to the present.
And so, that's essentially what you
get with the gated recurrent unit.
So the key equation of the gated
recurrent unit is this first one.
So it's sort of saying, well, we're
going to do the normal neural network
recurrent units stuff,
that's the stuff in green.
So for the stuff in green, we take the
current input and multiply it by a matrix.
We take the previous hidden statement and
multiply it by a matrix.
We add all of those things with a bias and
put it through a tanh, that's exactly the
standard recurrent neural network update.
So we're going to do that candidate
update just like a regular RNN.
But to actually work out what
function we're computing,
we're then going to adaptively learn
how much and on which dimensions
to use that candidate update and
how much that we just gonna shortcut it,
and just stick with what we had
from the previous time step.
And while that stuff in the previous
time step will have been to some
extent computed by this regular and
updated the previous time step.
But of course, that was also a mixture, so
to some extent, it will have been directly
inherited from the time step before that.
And so,
we kind of adaptively allowing things from
far past time steps just to
be passed straight through,
with no further multiplications
into the current time step.
So a lot of the key to is it
that we have this plus here.
The stuff that is on this side
of the plus, we're just saying,
just move along the stuff you had
before onto the next time step,
which has the effect that we're
directly having stuff from the past
be present to affect further on decisions.
So that's most of what
we have in a GRU and
a GRU is then just a little bit more
complex than that because if we do this,
it's sort of all additive,
you kinda kick stuff around forever.
You're deciding which to pay attention to,
but
once you've paid attention to it,
it's around forever.
And that's because you're sort
of just adding stuff on here.
And so, the final step is to say well
actually, maybe we want to sort of prune
away some of the past stuff adaptively so
it doesn't hang around forever.
And so, to do that, we're adding
this second gate, the reset gate.
And so, the reset gate gives you a vector
of, again, numbers between zero and
one, which is calculated like a kind
of a standard recurrent unit.
But it's sort of saying,
well to some extent, what we want to do is
be able to delete some of the stuff that
was in ht- 1 when it's no longer relevant.
And so,
we doing this sort of hadamard product,
the element wise product of the reset
gate and the previous hidden state.
And so,
we can forget parts of the hidden state.
And the parts that we're forgetting is
embedded in this kind of candidate update.
The part that's being just
passed along from the past to
have direct updates is still
just exactly as it was before.
So to have one attempt to
be more visual at that.
So if we have a basic vanilla tanh-RNN,
one way that you could think about
that is we have a hidden state,
and what our execution of our
unit is doing as a program
is saying you read the whole
of that register h,
you do your RNN update, and
you write the whole thing back.
So you've got this one memory register.
You read it all, do a standard recurrent
update, and write it all back.
So that's sort of very inflexible.
And you're just sort of repeating that
over and over again at each time step.
So in contrast to that,
when you have a GRU unit, that is then,
allowing you to sort of learn
this adaptive flexibility.
So first of all,
with the reset gate, you can learn
a subset of the hidden state that
you want to read and make use of.
And the rest of it will
then get thrown away.
So you have an ability to forget stuff.
And then,
once you've sort of read your subset,
you'll then going to do
on it your standard RNN
computation of how to update things.
But then secondly,
you're gonna select the writable subset.
So this is saying,
some of the hidden state we're
just gonna carry on from the past.
We're only now going to
edit part of the register.
And saying part of the register,
I guess is a lying and simplifying a bit,
because really,
you've got this vector of real numbers and
some said the part of the register is
70% updating this dimension and 20%
updating this dimension that values could
be one or zero but normally they won't be.
So I choose the writable subset And
then it's that part of it that I'm
then updating with my new candidate
update which is then written back,
adding on to it.
And so
both of those concepts in the gating,
the one gate is selecting what to read for
your candidate update.
And the other gate is saying, which
parts of the hidden state to overwrite?
Does that sort of make
sense how that's a useful,
more powerful way of thinking
about having a recurrent model?
Yes, a question?
Yeah, so how you select the readable
subset is based on this reset gate?
So, the reset gate decides
which parts of the hidden
state to read to update the hidden state.
So, the reset gate calculates which parts
to read based on the current input and
the previous hidden state.
So it's gonna say, okay, I wanna pay a lot
of attention to dimensions 7 and 52.
And so, those are the ones and
a little to others.
And so those are the ones that
will be being read here and
used in the calculation of
the new candidate update,
which is then sort of mixed together
with carrying on what you had before.
Any, yes.
So, the question was explain this again.
I'll try.
[LAUGH] I will try.
I will try and do that.
Let me go back to this slide first,
cuz this has most of that,
except the last piece, right.
So here, what we want to do is we're
carrying along a hidden state over time.
And at each point in time,
we're going to say, well,
based on the new input and
the previous hidden state,
we want to try and
calculate a new hidden state, but
we don't fully want to
calculate a new hidden state.
Sometimes, it will be useful just to
carry over information from further back.
That's how we're going to get longer term
memory into our current neural network.
Cuz if we kind of keep on doing
multiplications at each time step
along a basic RNN,
we lose any notion of long-term memory.
And essentially, we can't remember things
for more than seven to ten time steps.
So that is sort of the top level equation
to say, well, what we gonna calculate.
We want to calculate a mixture
of a candidate update and
keeping what we had there before and
how do we do that?
Well, what we're going to learn is
this ut vector, the update gate and
the elements of that vector
are gonna be between zero and one.
And if they're close to one,
it's gonna say,
overwrite the current hidden state with
what we calculated this time step.
And if they're close to zero,
it's gonna say,
keep this element vector
just what it used to be.
And so how we calculate the update
gate is using our regular kind
of recurrent unit where it
looks at the current input and
it looks at the recent history and
it calculates a value with the only
difference that we use here sigmoid,
so that's between 0 and
1 rather than tanh that puts
that at between minus 1 and 1.
And so the kind of hope
here intuitively is suppose
we have a unit that is
sort of sensitive to what
verb we're on,
then what we wanna say is well,
we're going through this sentence and
we've seen a verb.
We wanted that unit, well, sorry,
these dimension of the vector.
Let's say, their five dimensions of the
vector that sort of record what kind of
verb it's just seen.
We want those dimensions of the vector
to just stay recording what verb was
seen until such time as in the input,
a band new verb appears.
And it's at precisely that point, we wanna
say, okay, now is the time to update.
Forget about what used to be
stored in those five dimensions.
Now, you should store
a representation of the new verb.
And so, that's exactly what
the update gate could do here.
It could be looking at the input and
say, okay, I found a new verb.
So dimensions 47 to 52 should
be being given a value of 1 and
that means that they'll be storing a value
calculated from this candidate update,
and ignoring what they
used to store in the past.
But if the update gate finds
it's looking at a preposition or
at a term in our It'll say,
no, not interested in those.
So it'll make the update
value close to 0 and
that means that dimensions
47 to 52 will continue to
store the verb that you last saw
even if it was ten words ago.
I haven't quite finish.
So that was that part of it, so yes.
So, the candidate update.
So, that's the update gate.
And when we do update, the candidate
update is just exactly the same as
it always was in our current new
network that you're calculating this
function of the important
the previous hidden state and
put it through a tanh
together from minus 1 to 1.
Then the final idea here is that well,
if you just have this,
if you're doing a candidate update,
you're always using
the previous hidden state and
the new input word in
exactly the same way.
Whereas really for my example, what I was
saying was if you have detected a new
verb in the input, you should be storing
that new verb in dimensions 47 to 52 and
you should just be ignoring
what you used to have there.
And so it's sort of seems like
at least in some circumstances
what you'd like to do is throw
away your current hidden state,
so you could replace it
with some new hidden state.
And so that's what this second gate,
the reset gate does.
So the reset gate can also look at
the current import in the previous hidden
state and
it choses a value between zero, and one.
And if the reset gate choses
a value close to zero,
you're essentially just throwing
away the previous hidden state and
calculating something
based on your new input.
And the suggestion there for
language analogy is well,
if it's something like you're recording,
the last seen verb in dimensions 47 to 52.
When you see a new verb, well, the right
thing to do is to throw away what you
have in your history from 47 to 52 and
just calculate something new based
on the input, but that's not always
gonna be what you want to do.
For example, in English, English is
famous for having a lot of verb particle
combinations which cause enormous
difficulty to non-native speakers.
So that's all of these things
like make up, make out, take up.
All of these combinations of a verb and
a preposition have a special
meaning that you just have to know.
It isn't really, you can't tell
from the words most of the time.
So if you are wanting to work out
what the meaning of make out is,
so you've seen make and
you put in that into dimensions 47 to 52.
But if dimensions 47 to 52 are really
storing main predicate meaning,
if you see the word out coming next, you
don't wanna throw away make because it's
a big difference in meaning whether
it's make out or take out will give out.
What you wanna do is you wanna combine
both of them together to try and
calculate the predicate's meaning.
So in that case, you want your reset
gate to have a value near one so you're
still keeping it and you're keeping the
new import and calculating another value.
Okay, that was my attempt to explain GRUs,
and now the question.
So the question is okay, but
why this gated recurrent
unit not suffer from
the vanishing gradient problem?
And really the secret is
right here in this plus sign.
If you allowed me to simplify slightly,
and this is actually a version
of a network that has been used.
It's essentially, not more details,
but this aspect of it actually
corresponds to the very original
form of an LSTM that was proposed.
Suppose I just delete this this- ut here,
so this just was 1.
So what we have here is ht- 1,
so kind of like the reset gate,
the update gate is only
being used on this side.
It's saying should you pay any
attention to the new candidate,
but you're always plussing it with ht-1.
If you'll imagine that
slightly simplified form,
well, if you think about your gradients,
then what we've got here is when
we're kind of working at h,
this has been used to calculate ht.
Ht-1 is being used to calculate ht, so
ht equals a plus ht-1, so
there's a completely linear relationship
with a coefficient of one between ht and
ht-1.
Okay, and so
therefore when you do your calculus and
you back prop that, right,
you have something with slope 1.
That ht is just directly reflecting ht-1.
And that's the perfect case for
gradients to flow beautifully.
Nothing is lost, it's just going
straight back down the line.
And so that's why it can carry
information for a very long time.
So once we put in this update gate,
what we're having is the providing
ut is close to zero,
this is gonna be approximately one,
and so the gradients are just gonna flow
straight back to the line in an arbitrary
distance and
you can have long distance dependencies.
Crucially, it's not like you're
multiplying by a matrix every time,
which causes all with vanishing gradients.
It's just almost one there,
straight linear sequence.
Now of course, if at some point ut is
close to 1, so this is close to zero,
well then almost nothing
is flowing in from ht-1.
But that's then saying there
is no long term dependency.
That's what the model learn.
So nothing flows a long way back.
Is that a question?
Yeah.
So the question is,
isn't ht tilted ut both dependent on ht-1.
And yeah, they are.
Just like the ut you're calculating
it here in terms of ht-1.
So in some sense the answer is yeah,
you are right but
it's sort of turns out not matter, right?
So the thing I think is If I put words
in to your mouth, the thing that you're
thinking about is well, this ut
look right down at the bottom here,
you'll calculate it by matrix
vector multiply from ht-1.
And well then, where the ht-1 come from,
it came from ht-2 and there was some
more matrix vector multiplies here,
so there is a pathway going
through the gates where
you're keep on doing matrix vector
multiplies, and that is true.
But, it turns out that sort
of doesn't really matter,
because of the fact that there is this
direct pathway, where you're getting
this straight linear flow of gradient
information, going back in time.
Any other question?
Yes, I don't think I'll get any further
in this class if I'm not careful.
I'm sorry if that's true.
So the question was, why when you
Is before ut and one, one is ut.
We swapped.
>> [INAUDIBLE]
>> Yeah, if that's true, sorry about that.
That was bad, boo boo mistake,
cuz obviously we should be
trying to be consistent.
But, it totally doesn't matter.
This is sort of, in some sense, whether
you're thinking of it as the forget
gate or a remember gate, and
you can kind of have it either way round.
And that doesn't effect how the math and
the learning works.
Any other questions?
I'm happy to talk about this because I do
actually think it's useful to understand
this stuff cuz in some sense these kind
of gated units have been the biggest and
most useful idea for making practical
systems in the last couple of years.
Yes.
I actually have a picture for
an LSTM later on.
It depends on a lot of particularities,
but
it sort of seems like
somewhere around 100.
Sorry the question was how long does a GRU
actually end up remembering for and I
kind of think order of magnitude the kind
number you want in your head is 100 steps.
So they don't remember forever I think
that's something people also get wrong.
If we go back to the other one,
that I hope to get to eventually,
the name is kind of a mouthful.
I think it was actually very
deliberately named, where it was called,
long short term memory.
Right there was no idea in people's
heads that this was meant to be
the model of long term
memory in the human brain.
Long term memory is
fundamentally different and
needs to be modeled in other ways and
maybe later in the class,
we'll say a little a bit about the kind
of ideas people thinking about this.
What this was about was saying okay,
well people have a short term memory and
it lasts for a while.
Whereas the problem was our current
neural networks are losing all of there
memory in ten time steps.
So if we could get that pushed out
another order of magnitude during
100 time steps that would
be really useful to give us
a more human like sense
of short term memory.
Sorry, yeah?
So the question is,
do GRUs train faster than LSTMs?
I don't think that's true,
does Richard have an opinion?
>> [INAUDIBLE]
>> Yes,
so Richard says less computation
the computational cost is faster,
but I sort of feel that sometimes
LSTMs have a slight edge on speed.
No huge difference,
let's say that's the answer.
Any other, was there another
question that people want to ask?
Okay, I'll go on.
You can ask them again in a minute and
I go on.
Okay, so then finally I wanted to sort
of say a little bit about LSTMs.
So LSTMs are more complex because there
are more equations down the right side.
And there's more gates but they're barely
different when it comes down to it.
And to some extent, they look more
different than they are because of
certain arbitrary choices of notation
that was made when LSTMs were introduced.
So when LSTMs were introduced,
Hochreiter & Schmidhuber
sort of decided to say, well,
we have this privileged notion of
memory in the LSTM,
which we're going to call the cell.
And so people use C for
the cell of the LSTM.
But the crucial thing to notice
Is that the cell of the LSTM
is behaving like the hidden
state of the GRU, so really,
the h of the GRU is equivalent
to the c of the LSTM.
Whereas the h of the LSTM is
something different that's related
to sort what's exposed to the world.
So the center of the LSTM,
this equation for updating the cell.
Is do a first approximation exactly
the same as this most crucial equation for
updating the hidden state of the GRU.
Now, if you stare a bit,
they're not quite the same,
the way they are different is very small.
So in the LSTM you have two gates
a forget gate and then an input gate so
both of those for each of the dimension
have a value between zero and one.
So you can simultaneously keep
everything from the past and
keep everything from your
new calculated value and
sum them together which is
a little bit different.
To the GRU where you're sort of doing
this tradeoff as to how much to take
directly, copy across the path versus
how much to use your candidate update.
So it split those into two functions,
so you get the sum of them both.
But other than that,
it's exactly the same, right?
Where's my mouse?
The candidate update is
exactly the same as what's
being listed in terms of c tilde and
h tilde but
the candidate update is exactly,
well, sorry,
it's not quite I guess it's
the reset gate the candidate
update is virtually the same as
the standard LSTM style unit.
And then for the gates,
the gates are sort of the same,
that they're using these sort of R and
N style calculations to get a value
between zero for one for each dimension.
So the differences
are that we added one more
gate because we kinda having forget and
input gates here and
the other difference is
to have the ability to
sort of that the GRUs sort
of has this reset gate where it's saying,
I might ignore part of the past when
calculating My candidate update.
The LSTM is doing it
a little bit differently.
So the LSTM in the candidate update,
it's always using the current input.
But for this other half here, it's not
using ct minus 1, it's using ht minus 1.
So the LSTM has this extra
ht which is derived from ct.
And the way that it's derived from ct
is that there's an extra tanh here but
then you're scaling with this output gate.
So the output gate is sort of equivalent
of the reset gate of the GRU.
But effectively,
it's one one time step earlier,
cuz on the LSTM side,
on the preceding time step,
you also calculate an ht by ignoring
some stuff with the output gate,
whereas in the GRU, for
the current time step,
you're multiplying with the reset gate
times your previous hidden state.
That sorta makes sense?
A question.
Right, yes, the don't forget gate.
[LAUGH] You're right, so
it's the question about was the ft.
Is it really a forget gate?
No, as presented here,
it's a don't forget gate.
Again, you could do the 1 minus trick if
you wanted to and call this 1 minus f1,
but yeah, as presented here,
if the value is close to 1,
it means don't forget, yeah, absolutely.
So this one here is genuinely
an update gate because if If the value
of it is close to 1,
you're updating with the candidate update.
And if the value is close to zero,
you're keeping the previous
contents of the hidden state.
>> [INAUDIBLE]
reset.
>> Right, so the reset gate is
sort of a don't reset gate.
[LAUGH] Yeah, okay.
[LAUGH] I'm having a hard time
with the terminology here [LAUGH].
You are right.
Another question?
So okay, so the question was
sometimes you're using ct-1,
and sometimes you're using ht-1.
What's going on there?
And the question is in what sense
is ct less exposed in the LSTM?
Right, so there was something I glossed
over in my LSTM presentation, and
I'm being called on it.
Is look, actually for the LSTM, it's ht-1
that's being used everywhere for
all three gates.
So really, when I sort of said
that what we're doing here,
calculating ht, that's sort of
similar to the reset gate in the GRU.
I kind of glossed over that a little.
It's sort of true in terms of thinking of
the calculation of the candidate update
cuz this ht- 1 will then go
into the candidate update.
But's a bit more than that, cuz actually,
stuff that you throw away with your
output gate at one time step is
then also gonna be thrown away
in the calculation of every
gate at the next time step.
Yeah, and so then the second question is
in what sense is the cell less exposed?
And that's sort of the answer to that.
The sense in which the cell
is less exposed is
the only place that
the cell is directly used,
is to sort of linearly add
on the cell at the previous
time step plus its candidate update.
For all the other computations,
you're sort of partially hiding
the cell using this output gate.
Another question, sure.
Hm, okay, so the question is, gee,
why do you need this tanh here,
couldn't you just drop that one?
Whoops.
Hm.
I'm not sure I have such a good
answer to that question.
>> [INAUDIBLE]
>> Okay, so Richard's suggestion is,
well this ct is kind of
like a linear layer, and
therefore it's kind of insured if you
should add a non linearity after it.
And that gives you a bit more power.
Maybe that's right.
Well, we could try it both ways and
see if it makes a difference, or
maybe Shane already has,
I'm not sure [LAUGH].
Any other questions?
Make them a softball
one that I can answer.
>> [LAUGH]
>> Okay,
so I had a few more
pictures that went through
the parts of the LSTM
with one more picture.
I'm starting to think I should maybe
not dwell on this in much detail.
Cuz we've sort of talked about
the fact that there are the gates for
all the things.
We're working out the candidate update,
just like an RNN.
The only bit that I just wanna
say one more time is I think
it's fair to say that the whole
secret of these things,
is that you're doing this addition
where you're adding together.
When in the addition,
it's sort of a weighted addition.
But in the addition,
one choice is you're just copying
stuff from the previous time step.
And to the extent that you're copying
stuff from the previous time step,
you have a gradient of 1,
which you're just pushing.
So you can push error directly
back across that, and
you can keep on doing that for
any number of time steps.
So it's that plus, having that plus
with the previous time step rather
than having it all multiplied by matrix.
That is the central idea that makes LSTMs
be able to have long short-term memory.
And I mean, that has proven to
be an incredibly powerful idea,
and so in general,
it doesn't sound that profound, but
that idea has been sort of driving
a lot of the developments of what's
been happening in deep learning
in the last couple of years.
So we don't really talk about,
in this class, about vision systems.
You can do that next quarter in 231N.
But one of the leading ideas and has
been used recently in better systems for
doing kind of vision systems with
deep learning has been the idea of
residual networks,
commonly shortened as ResNets.
And to a first approximation, so
ResNets is saying gee,
we want to be able to build 100 layer
deep neural networks and
be able to train those successfully.
And to a first approximation,
the way ResNets are doing that is exactly
the same idea here with the plus sign.
It's saying, as you go up each layer,
we're going to calculate some non-linear
function using a regular neural net layer.
But will offer the alternative,
which is that you can just shunt
stuff up from the layer before,
add those two together, and
repeat over again and go up 100 layers.
And so this plus sign,
you may have learned in third grade, but
turns out plus signs have been a really
useful part of modern deep learning.
Okay, Yeah, here is my little picture,
which I'll just show.
I think you'll have to sort
of then slow it down to
understand that this is sort
of going backwards from
Time 128 as to how long
information lasts in an LSTM,
and it sort of looks
like this if I play it.
And so if we then try and drag it back,
I think, then I can play it more slowly.
All right, so that almost instantaneously,
the RNN has less
information because of
the Matrix multiply.
But as you go back,
that by the time you've gone back so
at ten times steps, the RNN is
essentially lost the information.
Whereas the LSTM even be going back,
it starts loose information, but you know
you sort of gain back this sort of more
like, time step 30 or
something before it's kind of
lost all of its information which is sort
of the intuition I suggested before.
But something like 100 time
steps you can get out of a LSTM.
Almost up for a halftime break,
and the research highlight,
but before that couple other
things I wanted to say,
here's just a little bit
of practical advice.
So both for assignment for or
for many people's final projects.
They're gonna be wanting
to train recurrent neural
networks with LSTMs on a largest scale.
So here is some of the tips
that you should know, yes.
So if you wanna build a big
recurrent new network,
definitely use either GRU or an LSTM.
So for any of these recurrent networks,
initialization is really,
really important.
That if your net, recurrent your network
should work, if your network isn't
working, often times it's because
the initial initialization is bad.
So what are the kind of initialization
ideas that often tend to be important?
It's turned to be really useful for
the recurrent matrices, that's the one
where you're multiplying by the previous
hidden state of previous cell state.
It's really useful to
make that one orthogonal.
So there's chance to use your good
old-fashioned linear algebra.
There aren't actually that many
parameters in a recurrent neural net.
And giving an orthogonal
initialization has proved to
be a better way to kinda get
them learning something useful.
Even with sort of these
ideas with GRUs and LSTMs,
you're gonna kinda keep multiplying
things in a recurrent neural network.
So normally, you wanna have
your initialization is small.
If you start off with two large
values that can destroy things,
try making the numbers smaller.
Here's a little trick, so
a lot of the times we initialize
things near zero, randomly.
An exception to that is when you're
setting the bias of a forget gate,
it normally works out much better
if you set the bias gate for
the forget gate to a decent size
positive number like one or
two or
a random number close to one or two.
That's sort of effectively saying
you should start off paying
a lot of attention to the distant past.
That's sort of biasing it
to keep long term memory.
And that sort of encourages
you to get a good model.
Which effectively uses long term memory.
And if the long term past stuff isn't
useful, it can shrink that down.
But if the forget gate starts
off mainly forgetting stuff,
it'll just forget stuff and
never change to any other behavior.
In general, these algorithms work much
better with modern adaptive
learning rate algorithms.
We've already been using
Adam in the assignments.
The ones like Adam, AdaDelta,
RMSprop work a lot better than basic SGD.
You do wanna clip
the norms of the gradients.
You can use a number like five,
that'll work fine.
And so,
we've used dropout in the assignments, but
we haven't actually ever talked
about it much in lectures.
For RNNs of any sort,
it's trivial to do dropout vertically.
And that usually improves performance.
It doesn't work and
I either do drop out horizontally
along the recurrent connections.
Because if you have reasonable
percentage of drop out and
you run it horizontally then within
the few time steps, almost every
dimension will be dropped in one of them,
and so you have no information flow.
There have been more recent work
that's talked about ways that you
can successfully do horizontal
dropout in recurrent networks in,
including orthongal's PhD student
in England who did work on so
called base in drop out
that works well for that.
But quite commonly, it's still the case
that people just drop out vertically and
don't drop out at all horizontally.
The final bit of advice is be
patient if you're running,
if you're learning recurrent
nets over large data sets,
it often takes quite a while and
you don't wanna give up.
Sometimes if you just train them
long enough start to learn stuff.
This is one of the reasons why we
really want to get you guys started
using GPUs because the fact of the matter,
if you're actually trying to do
things on decent size data sets,
you just don't wanna be trying to train
in LSTM or GRU without Using a GPU.
One other last tip that we should
mention some time is ensembling.
If you'd like your numbers to be 2%
higher, very effective strategy,
which again, makes it good to have a GPU,
is don't train just one model,
train ten models and
you average their predictions and
that that normally gives you
quite significant gains.
So here are some results
from MT Systems trained.
Montreal again.
So it's different language pairs.
The red ones is a single model.
The purple ones are training 8 models,
and in this case,
it's actually just majority
voting them together.
But you can also sort of
average their predictions and
you can see it's just giving very nice
gains in performance using the measure for
mt performance which I'll
explain after the break.
But we're now gonna have Michael up
to talk about the research highlight.
And I'll quickly explain
it until the video is in
there-
>> Okay.
>> After the picture.
>> Okay.
Hi, everyone.
I'm gonna be presenting the paper
Lip Reading Sentences in the Wild.
So our task is basically taking a video,
which we preprocessed into
a sequence of lip-centered images,
with or without audio.
And we're trying to predict like the words
that are being said in the video.
>> Just slide after that one.
Maybe it doesn't
>> The government will pay for both sides.
>> We have to look at whether it
>> Not.
Said security had been
stepped up in Britain.
>> Cool, so anyway,
it's hard to do lip reading.
So anyway, and for the rest of this I'll
talk about what architecture they use,
which is, they deem the watch,
listen, attend, and spell model.
>> Gonna talk about some of these training
strategies that might also be helpful for
your final projects.
There's also the dataset and
the results was actually surpassing
like a professional lip reader.
So, the architecture basically
breaks down into three components.
We have a watch component which takes
in the visual and the listening
component which takes in the audio and
these feed information to the attend, and
spell module which outputs
the prediction one character at a time.
And they also use this with like, just the
watch module or just the listen module.
To go into slightly more detail,
for the watch module,
we take a sliding window over
like the face centered images and
feed that into a CNN,
which then the output of
the CNN gets fed into an LSTM
much size over the time steps.
We output a single state vector S of v,
as well as the set of
output vectors L of v and
the listen module is very similar.
We take the pre-processed speech and
we again site over using the LSTM,
and we have another state vector,
and another set of output vectors,
and then in the decoding step.
So we have an LSTM as a really
steps of during the decoding and
the initial hidden state is initialized
as the concatenation of the two hidden
states from the two previous
modules as well as we have
like a dual attention mechanism
which takes in the output
vectors from each of their respective
modules, and we take those together, and
we make our prediction using a softmax
over a multi-layer procepteron.
And so, one strategy that uses
called curriculum learning.
So ordinarily, when you're training
this sequence to sequence models,
you might be tend to just use
one full sentence at a time.
Tip by what they do on curriculum learning
is you start with the word length like
segment and then you can slowly increase
the length of your training sequences and
what happens is you're actually like
the idea is you're trying to learn,
like slowly build up the learning for
the model and
what happens is it ends up converging
faster as well as decreasing overfitting.
Another thing that they use
is called scheduled sampling.
So ordinarily during training,
you'll be using
the ground truth input like
character sequence, but
during the test time you
wouldn't be using that you'd just
be using your previous prediction
after every time step.
So what you do in scheduled sampling is
kind of like bridge the difference in
scenarios between training and
testing is that you actually just for
a random small probability,
like sample from the previous input
instead of the ground truth input for
that time step during training.
So the dataset was taken from the authors
collected it from the BBC News and
they have like dataset that's much
larger than the previous ones
out there with over 17,000
vocabulary words and
the other the quite a bit like processing
to like some other things on the lips, and
do like the alignment of the audio,
and the visuals.
So, just to talk about the results, I
guess the most eye popping result is that
they gave the test set to actually like
a company that does like professional
lip reading and they're only able to get
about like one in four words correct or
as this model was able to get one in two,
roughly, based on word error rate.
And they also did some other
experience as well with looking at,
if you combine the lips
version with the audio,
you get like a slightly better model which
shows that using both modalities improves
the model as well as looking at what
happens if you add noise to the model.
Great.
Thanks.
>> [APPLAUSE]
>> Thanks, Michael.
Yeah, so obviously,
a lot of details there.
But again, that's kind of an example of
what's been happening with deep learning
where you're taking this basic model
architecture, things like LSTM and saying,
here's another problem,
let's try it on that as well and
it turns out to work fantastically well.
Let's say, 20 minutes left.
I'll see how high I can get in teaching
everything else about it on machine
translation.
So it's something I did just want
to explain is so, back here and
in general, when we've been showing
machine translation results.
We've been divvying these
graphs that up is good and
what it's been measuring with these
numbers are things called blue scores.
So, I wanted to give you some idea of how
and why we evaluate machine translation.
So the central thing to know about machine
translation is if you take a paragraph or
text and give it to ten
different humans translators,
you'll get back ten
different translations.
There's no correct answer
as to how to translate
a sentence into another language.
And in practice, most of the time
all translations are imperfect and
it's kind of deciding what you wanna pay
most attention to is that do you want to
maximally preserve the metaphor that
the person used in the source language or
do you wanna more directly
convey the meaning it conveys,
because that metaphor won't really be
familiar to people in the target language.
Do you want to choose sort
of short direct words,
because it's written in a short,
direct style?
Or do you more want to sort of,
you choose a longer word that's
a more exact translation?
There's all of these decisions and
things and in some sense a translator is
optimizing over if we do it
in machine learning terms,
but the reality is it's
sort of not very clear.
There are a lot of choices.
You have lots of syntactic choices
as whether you make it a passive or
an active and word order, and so on.
No right answer.
So we just can't have it like a lot
things of saying, here's the accuracy,
that was what you were meant to use.
So, how do you do it?
So, one way to do MT evaluation
is to do it manually.
You get human beings to look
at translations and to say,
how good they are.
And to this day, basically,
that's regarded as the gold standard
of machine translation evaluation,
because we don't have a better
way to fully automate things.
So one way of doing that is things
like Likert scales where you're
getting humans to judge
translations to adequacy,
which is how well they convey the meaning
of the source and fluency which is for
how natural the output sentence
sounds in the target language.
Commonly, a way that's more easily
measurable that people prefer is actually
if you're comparing systems for goodness
is that you directly ask human beings
to do pairwise judgments of which is
better translation A or translation B.
I mean, it turns out that even
that is incredibly hard for
humans to do as someone who has sat around
doing this task of human evaluation.
I mean, all the time, it's kind of okay,
this one made a bad word choice here and
this one got the wrong verb form
there which of these do I
regard as a worse error.
So it's a difficult thing, but
we use the data we can from human beings.
Okay, that's still the best
thing that we can do.
It has problems.
Basically, it's slow and expensive to get
human beings to judge translation quality.
So what else could we do?
Well, another obvious idea is to say,
well, If we can embed machine
translation into some task, we can just
see which is more easily a valuable.
We could just see which MT system
lets us do the final task better.
So, we'd like to do question answering
over foreign language documents.
We'll just to get our question
answers correct score, and
they'll be much easier to measure.
And that's something that you can do, but
it turns out that that often
isn't very successful.
Cuz commonly your accuracy
on the downstream task is
very little affected by many of
the fine points of translation.
An extreme example of that is sort of
like cross-lingual information retrieval.
When you're just wanting
to retrieve relevant
documents to a query in another language.
That providing you can kind of produce
some of the main content words in
the translation, it really doesn't matter
how you screw up the details of syntax and
verb inflection.
It's not really gonna affect your score.
Okay, so what people have
wanted to have is a direct
metric that is fast and cheap to apply.
And for a long time, I think no one
thought there was such a thing.
And so
then starting in the very early 2000s,
people at IBM suggested
this first idea of, hey,
here's a cheap way in which we can
measure word translation quality.
And so they called it the BLEU metric.
And so
here was the idea of how they do that.
What they said is let us
produce reference translations.
We know that there are many, many possible
ways that something can be translated.
But let's get a human being to
produce a reference translation.
So what we are going to do is then we're
going to have a reference translation by
a human, and
we're going to have a machine translation.
And to a first approximation we're
going to say that the machine
translation is good to the extent
that you can find word n-grams.
So sequences of words like three
words in a row, two words in a row,
which also appear in the reference
translation anywhere.
So what are the elements of this?
So by having multi-word sequences,
that's meant to be trying to
judge whether you have some understanding
of the sort of right syntax and arguments.
Because you're much more likely
to match a four word sequence
if it's not just you've
got a bag of keywords.
You actually understand something
of the syntax of the sentence.
The fact that you can match it anywhere
is meant to be dealing with the fact that
human languages normally have
quite flexible word order.
So it's not adequate to insist that
the phrases appear in the same word order.
Of course, in general in English, a lot of
the time you can say, last night I went
to my friend's place, or,
I went to my friend's place last night.
And it seems like you should
get credit for last night
regardless of whether you put it at
the beginning or the end of the sentence.
So, that was the general idea
in slightly more detail.
The BLEU measure is a precision score.
So it's looking at whether
n-grams that are in the machine
translation also appear in
the reference translation.
There are a couple of fine points then.
You are only allowed to count for
a certain n and n-gram once.
So if in your translation,
the airport appears three times,
but there's only one
the airport in the reference,
you're only allowed to count one of
them as correct, not all three of them.
And then there's this other trick that we
have, this thing called a brevity penalty.
Because if it's purely
a precision-oriented measure,
saying is what appears in the machine
translation in the reference.
There are games you could play,
like you could just translate
every passage with the word the.
Because if it's English the word the is
pretty sure to appear somewhere in
the reference translation,
and get precision one.
And that seems like it's cheating.
So if you're making what your translation
is shorter than the human translations,
you'll lose.
Okay, so more formally, so you're doing
this with n-grams up to a certain size.
Commonly it's four so you use single
words, pairs of words, triples,
and four words.
You work out this kind
of precision of each.
And then you're working out a kind
of a weighted geometric mean
of those precisions.
And you multiplying that
by brevity penalty.
And the brevity penalty penalizes
you if your translation
is shorter than the reference translation.
There are some details here, but
maybe I'll just skip them and go ahead.
So there's one other idea then which is,
well, what about this big problem that,
well, there are a lot of different
ways to translate things.
And there's no guarantee that your
translation could be great, and
it might just not match
the human's translation.
And so the answer to that that
the original IBM paper suggested
was what we should do is collect
a bunch of reference translations.
And the suggested number that's
been widely used was four.
And so then, most likely,
if you're giving a good translation,
it'll appear in one of
the reference translations.
And then, you'll get a matching n-gram.
Now, of course,
that's the sort of a statistical argument.
Cuz you might have a really
good translation and
none of the four translators chose it.
And the truth is then in
that case you just lose.
And indeed what's happened in more
recent work is quite a lot of the time,
actually, the BLEU measure is only
run with one reference translation.
And that's seems a little bit cheap.
And it's certainly the case that if you're
running with one reference translation,
you're either just lucky or unlucky as to
whether you guessed to translate the way
the translator translates.
But you can make a sort of a statistical
argument which by and large is valid.
That if you're coming up
with good translations,
providing there's no correlation somehow
between one system and the translator.
That you'd still expect on balance that
you'll get a higher score if you're
consistently giving better translations.
And broadly speaking, that's right.
Though this problem of correlation does
actually start to rear its head, right?
That if the reference translator always
translated the things as US, and
one system translates with US, and the
other one translates with United States.
Kind of one person will get lucky, and
the other one will get unlucky
in a kind of a correlated way.
And that can create problems.
So even though it was very simple
when BLEU was initially introduced,
it seemed to be miraculously
good that it just corresponded
really well with human judgments
of translation quality.
Rarely do you see an empirical
data set that's as linear as that.
And so this seemed really awesome.
Like many things that
are surrogate metrics,
there are a lot of surrogate metrics that
work really well If no one is trying
to optimize them but don't work so well
once people are trying to optimize them.
So what happen then was,
everyone evaluated their systems
on BLEU scores and so therefore,
all researchers worked on how to make
their systems have better BLEU scores.
And then what happened is this
correlation graph went way down.
And so the truth is now that current,
and this relates to the sort of
when I was saying the Google
results were exaggerated.
The truth is that current MT systems
produce BLEU scores that are very similar
to human translations for many language
pairs which reflects the fact that
different human beings are quite creative
and vary in how they translate sensors.
But in truth, the quality of machine
translation is still well below
the quality of human translation.
Okay, few minutes left to
say a bit more about MT.
I think I can't get through
all this material, but
let me just give you a little
bit of a sense of some of it.
Okay, so one of the big problems you
have if you've tried to build something,
any kind of generation system,
where you're generating words is you have
a problem that there are a lot of words.
Languages have very large vocabularies.
So from the hidden state,
what we're doing is multiplying by this
matrix of Softmax parameters,
which is the size of the vocabulary
times the size of the hidden
state doing this Softmax.
And that's giving us
the probability of different words.
And so the problem is if you wanna
have a very large vocabulary,
you spend a huge amount of time just doing
these Softmaxes over, and over again.
And so, for instance, you saw that in
the kind of pictures of the Google system,
that over half of their
computational power was just going
into calculating these Softmax so
that's being a real problem.
So something people have worked
on quite a lot is how can we
string the cost of that computation.
Well one thing we can do is say,
ha, let's use a smaller vocabulary.
Let's only use a 50,000 word
vocabulary for our MT system, and
some of the early MT
work did precisely that.
But the problem is, that if you do that,
you start with lively sentences.
And instead what you get is unk,
unk, unk because all of
the interesting words in the sentence fall
outside of your 50,000 word vocabulary.
And those kind of sentences are not very
good ones to show that human beings,
because they don't like them very much.
So, it seems like we need to
somehow do better than that.
So, there's been work on, well, how can
we more effectively do the softmaxes
without having to do as much computation.
And so,
there have been some ideas on that.
One idea is to sort of have a hierarchical
Softmax where we do the standard
computer scientist trick
of putting a tree structure
to improve our amount of computation.
So if you can sort of divide the
vocabulary into sort of tree pieces and
divide down branches of the tree,
we can do less computation.
Remember, we did noise
contrast to destination for
words of that was a way
of avoiding computation.
Those are possible ways to do things.
They are not very
GPU-friendly unfortunately.
Once you start taking branches down
the tree, you then can't do the kind
of nice just bang bang bang type
of computations down to GPU.
So there's been on work on coming
up with alternatives to that, and
I wanted to mention one example of this.
And an idea of this is well,
maybe we can actually
sort of just work with small
vocabularies at any one time.
So when we're training our models,
we could train using subsets of
the vocabulary because there's a lot
of rare words but they're rare.
So if you pick any slice of the training
data most rare words won't be in it.
Commonly if you look at your
whole vocabulary about 40% of
your word types occur only once.
That means if you cut your
data set into 20 pieces,
19 of those 20 will not contain that word.
And then,
we also wanna be smart on testing.
So we wanna be able to, at test time
as well, generate sort of a smaller
set of words for our soft max, and so we
can be fast at both train and test time.
Well, how can you do that?
Well, so at training time,
we want to have a small vocabulary.
And so we can do that by partitioning the
vocab, for partitioning the training data,
each slice of the training data,
we'll have a much lower vocabulary.
And then we could partition randomly or
we could even smarter and
we can cut it into pieces
that have similar vocabulary.
If we put all the basketball
articles in one file and
all the foot walled articles in another
pile, will shrink the vocabulary further.
And so they look at ways of doing that,
so in practice that they can get down and
order a magnitude or more in the size
of the vocab that they need for
each slice of the data, that's great.
Okay, so what do we do at test time?
Well, what we wanna do
it at test time as well,
when we're actually translating,
we want to use as much smaller vocabulary.
Well, here's an idea of
how you could do that.
Firstly, we say, they're are just common
function words that we always
gonna want to have available.
So we pick the K most frequent words and
say we're always gonna
have them in our Softmax.
But then for the rest of it,
what we're actually gonna do is
sort of have a lexicon on the side
where we're gonna know about likely
translations for each source word.
So that we'll have stored ways that would
be reasonable to translate she loves
cats into French.
And so when we're translating a sentence,
we'll look out for
each word in the source sentence what
are likely translations of it and
throw those into our candidates for
the Softmax.
And so then we've got a sort
of a candidate list of words.
And when translating
a particular soft sentence,
we'll only run our
Softmax over those words.
And then again, we can save well over
an order of magnitude computations.
So, K prime is about 10 or 20 and
K is sort of a reasonable size vocab.
We can again, sort of cut at least in
the order of magnitude the size of
our soft mixers and
act as if we had large vocabulary.
There are other ways to do that too,
which are on this slide.
And what I was then going to go on,
and we'll decide whether it does or
doesn't happen based on the syllabus.
I mean, you could sort of say,
well, that's still insufficient
because I sort of said that you have
to deal with a large vocabulary.
And you've sort of told us how to deal
with a large vocabulary more efficiently.
But you've still got problems, because
in any new piece of text you give it,
you're going to have things like new
names turn up, new numbers turn up, and
you're going to want to
deal with those as well.
And so
it seems like somehow we want to be able
to just deal with new stuff at test time,
at translation time.
Which effectively means that
kind of theoretically we have
an infinite vocabulary.
And so, there's also been a bunch of
work on newer machine translation and
dealing with that.
But unfortunately, this class time is not
long enough to tell you about it right
now, so I'll stop here for today.
And don't forget, outside you can
collect your midterm on the way out.
