[MUSIC]
Stanford University.
>> All right, hello, everybody.
And welcome to lecture number nine.
Today, we'll do a brief recap,
some organizational stuff, and
then we'll talk about, I love when we call
them fancy, recurrent neural networks.
But those are the most important
deep learning models of the day,
LSTMs and GRU type models.
They're very exciting and really form
the base model for pretty much every
deep learning paper or almost all the deep
learning papers you see out there.
So after today, you'll really have in your
hands the kind of tool that is the default
tool for a lot of different deep learning
final P applications, so super exciting.
And the best part is, you kind of know
most of the important math already of it,
so we can just define the model.
And everything else will kind of
follow through with these basic, and
sometimes painful building blocks
that we went through before.
All right, before we jump in,
some organizational stuff.
We have a new office hour schedule and
places.
Today, we're continuously trying to
optimize the whole process based on
your feedback.
Thanks for that.
So I'll have office hours every day,
multiple times.
I hope that will allow us to kind of
distribute the load a little bit,
cuz I know sometimes lots of people
come to one office hour, and
then there's a long wait there.
Also, it's important that you register for
your GPU teams by the end of today,
or ideally before.
So that we can make
sure you all get a GPU.
Ideally, we also encourage people
to have pairs for Problem set 3 and
4 the project, at least pairs,
cuz we only have 300 or
so GPUs and
almost 700 students in the class.
So try to form teams, but do make sure
that you don't just have your partner or
work on and
implement all the GPU stuff, and you do
all the other parts of the problem set.
Cuz then you really miss out on
a very important valuable skill for
both research and applied deep learning,
if you don't know how to use a GPU.
And sadly,
I have to get back to some work event.
So I'll have a pretty
short office hour today.
But then I know we have the deadline for
project proposals on Thursdays.
So on Thursday,
I'm gonna have an unlimited office hour.
I'm gonna start after class, and
will end when queue status is empty.
So if you come half an hour late,
prepare to talk to me three hours later.
So [LAUGH] the project is the coolest
part, so I don't wanna discourage
people from doing the project
because we don't have enough.
So it's gonna be great.
I'll bring food.
You should bring food too,
and-
>> [LAUGH]
>> [LAUGH]
>> It's kind of good fun, all right.
If by any chance even after midnight,
people, we still have the queue status
is still full, which I doubt at
that point, I think, I hope,
then we can push the proposals out for
those, or you can submit the proposal.
And then we'll figure out the mentor
situation, very soon thereafter.
So all right, any questions
around any class organization?
All right, then lets dive right in.
So basically today,
we'll have a a very advanced,
cutting edge blast from the past, because
while pedagogically, it'll make sense for
us to first talk about a model from 2014,
from just three years ago.
The main model we'll end up with,
the long-short-term-memories is actually
a very old model, from 97, and
has kind of been dormant for a while.
As very powerful model,
you need a lot of training data for it,
you need fast machines for it.
But now, that we have those two things,
this is a very powerful model for NLP.
And if you ask one of the inventors,
the second model is really just
a special case of the LSTM.
But I think, pedagogically,
it makes sense to sort of first talk about
the so-called Gated Recurrent Unit,
which is slightly simpler version.
>> And we'll use machine translation which
is one of the sort of most useful tasks
you might argue of NLP,
sort of a real life task.
Something that actual people
outside academia, outside research,
outside linguistics really care about.
And by the end, you'll actually have the
skills to build one of the best machine
translation models out there, modulo
a lot of time and some extra effort.
But the biggest parts of 90% of
the top MT systems out there,
you'll be able to understand at least and
probably build also,
if you have the GPU
skills after this class.
All right, so I'm not gonna go through
too many of the details, but in just in
preparation to mentally make you also
think about the midterm that's coming up.
Next lecture, we'll have midterm review.
But ideally, these kinds of
equations that I'm throwing up here,
you're pretty familiar with.
At this point, you're like yeah,
I just do some negative
sampling here from my Word2Vec.
And I have my inside and
my outside vectors in the window.
And similarly for glove, I have two
sets of vectors, you optimize this.
You have a function here, dead limits,
how important very frequent pairs
are in your concurrence matrix.
You understand the max-margin
objective function.
You have scores of good windows
from the large training corpus and
corrupted windows.
So all of these should be familiar.
And if not, then you really should also
start thinking about sort of studying
again for the midterm.
The most basic definition of neural net,
where we just have some score
at the end or some soft max.
And really being comfortable
with these two final equations,
that if you understand those, all
the rest of the models will basically be,
in many cases, sort of fancy versions or
adapted versions of these two equations.
So that's important.
And then we'll have our standard recurrent
neural network that we already went
through, and we kind of assume you
should know for the midterm as well.
And our grade of Cross Entropy Error,
as one of the main loss or
objective functions that we optimize.
And when we optimize,
we usually use the Mini-batched SGD.
We don't go through single example.
We don't go through the entire
batch of our trained data, but
we take 100 or so
of examples each time we train.
So all those concepts, you should
feel reasonably comfortable now.
And if not, then definitely come
back to the office hours, and
start sort of studying for the midterm.
All right, and we'll go over more
midterm details in the next lecture.
All right, now, onto the main topic
of today, machine translation.
So you might think for some NLP tasks
that you can get away with thinking of
all the rules that, for
instance, sentiment analysis.
A sentence might come up positive or
negative, right?
You say, I have a list of all the positive
words, most of all the negative words.
And I can think of the ways you can negate
positive words and things like that.
And you could maybe conceive of
creating a sentiment analysis system
of just all your intuitions
about linguistics and sentiment.
That kind of approach is completely
ridiculous for machine translation.
There's no way you would ever,
nobody will ever be able to think of all
the different rules and exceptions for
translating all possible sentences
of one language to another.
So basically, the baseline that's
pretty well established is that
all machine translation systems
are somewhat statistical in nature.
We will always try to
take a very large corpus.
In fact, we'll have so
called parallel copra,
where we have a lot of the sentences or
paragraphs in one language.
And we know that this paragraph in this
language translates to that paragraph
in another language.
One of the popular parallel copra
of All of, for a long time,
for the last couple thousand
years is the Bible, for instance.
You'll have Bible translated.
It has nice paragraphs.
And each paragraph is translated
in different languages.
That would be one of
the first parallel corpora.
The very first is actually
the Rosetta Stone.
Which allowed people to have
at least some understanding
of ancient Egyptian hieroglyphs.
And it's pretty exciting if you're
into historical linguistics.
And it allows basically to translate
those to the Demotic script and
the ancient Greek also,
which we still know.
And so we can gain some intuition about
what's going on in the other two.
Now, in the next couple of slides,
I will basically try to bring across
to you that traditional statistical
machine translation systems are very,
very complex beasts.
And it wouldn't have been impossible for
me to say at the end of the lecture,
all right, now you could implement
this whole thing yourself,
after just one lecture, going over MT cuz
there are a lot of different moving parts.
So let's walk through this.
You won't have to actually implement
traditional statistical MT system in
this class.
But I want you to appreciate
a little bit the history.
And why deep learning is so impactful and
amazing for machine translation.
Cuz it's replacing a lot of different
submodules in these very complex models.
And sometimes it uses still
ideas from this, but not very.
Most of them we don't need any more for
neural machine translation systems.
All right, so let's set the stage.
We have generally a source language.
Let's call that f, such as French.
And we have a target language, e,
in our case, let's say it's English.
So we wanna translate from the source
French to the target language of English.
And we'll usually describe
this here with a simple Bayes
rule where we basically try to find the,
Target sentence,
usually e here we assume is the whole
sentence in the target language.
That gives us the largest conditional
probability conditioned on f.
So this is an abstract formulation.
We'll try to fill in how to actually
compute these probabilities
in traditional and then later in
neural machine translation systems.
So now we can use Bayes rule.
Posterior equals its prior times
likelihood divided by marginal evidence.
Marginal evidence here would just be for
the source language.
So that doesn't change.
So we can drop that,
argmax would not change from that.
So basically, we'll try to
compute these two factors here.
The probability of the French
sentence given, or
the source language given the target,
times the probability of just the target.
And now, we'll basically call
these two elements here.
One is our translation model.
And the other one is our language model.
Remember language modeling where we
tried to get the probability of a longer
sequence.
This is a great use case for it.
Basically, you can think of this
as you get some French sentence.
Your translation model will try to find.
Maybe this phrase,
I can translate into that.
And this phrase,
I can translate into this.
And then you have a bunch
of pieces of English.
And then your language model will
essentially in the decoder be combined
to try to get a single,
smooth sentence in the target language.
So it'll help us to take all these pieces
that we have from the translation model.
And make it into one sentence that
actually sounds reasonable and flows and
is grammatical and all that.
So the language model helps us to
weight grammatical sentences better.
So, for instance, I go home will
sound better than I go house, right?
Because I go home will have a more
likely higher probability, so
more likely English
sentence to be uttered.
Now, how do we actually train
all these different pieces?
And how would you go about doing this?
Well, if you wanted to translate,
do this translation model here.
Then the first thing you'd have to do
is you'd find so called alignments.
Which is basically, the goal of the
alignment step is to know which word or
phrase in the source language would
translate to the other word or
phrase in the target language.
And that sub problem already.
And now, again,
we have these three different systems.
And now we're zooming in to
the step one of that system.
Now that one is already hard
because alignment is non-trivial.
These are actually some cool examples from
previous incarnation from Chris's class,
224, and from previous years.
Here are some examples of why
alignment is already hard.
And this is for a language pair
that is actually quite similar.
English and French share a lot
of common history, and so on,
and they're more similar.
But even if we have these two sentences
here, like Japan shaken by two new quakes.
Or Le Japon secoue par
deux nouveaux seismes.
Then we'll basically have
here a spurious word.
So Le was actually not
translated to anything.
And we would skip it in our alignment.
So you see here this alignment matrix.
And you'll notice that Le
just wasn't translated.
We don't say the Japan, or
a Japan, or something like that.
So it gets trickier, though.
Cuz there are also so
called zero fertility words
that are not translated at all.
So we start in a source and
we just drop them.
And, for some reason, the translators,
or for grammatical reasons and
so on, they don't actually have any
equivalent in the target language.
And to make it even more complex,
we can also have one-to-many alignments.
So implemented in English is actually
mis en application in French.
So made into an application
of sorts is just the word and
the verb implemented here.
So then we'll have to try to find.
And now, as you try to think through
algorithms that might do this
alignment for you.
You'll have to think, so this word could
go to either this one word or no word.
Or these three words together.
Or maybe these two words together.
And you can see how that would create, if
you tried to go through all the statistics
and collect all of these probabilities,
of which phrase would go to what phrase.
It'll get pretty hard to
actually combine them all.
And language is just incredible and
very complex.
And you also have many-to-one alignments.
So aboriginal people are just
autochtones in French.
So similar actually in German, [FOREIGN].
So you'd have two words in German.
And so, you have many-to-one
alignments making the combinatorial
explosion even harder if you
try to find good alignments.
And lastly,
you'll also have many-to-many alignments.
You have certain phrases
like don't have any money.
This just goes to sont demunis in French.
And so it's a very, very complex
problem that has combinatorial
explosion of all potential
combinations and it's tricky.
All right, so now, really,
if you were to take a traditional class,
you could have several lectures,
or at least an entire lecture,
just on the various ways you could
implement cleverly an alignment model.
And sometimes,
people use just single words.
And other times, they actually use
parses like the one you're now familiar,
syntactic parses.
And try to find which,
no, not just words, but
phrases from a parse would
map to the other language.
And then, of course, it's not just that.
And not usually are sentences and
languages nicely aligned, but
you can also have complete reorderings.
So German sometimes, for sub Clauses
actually has the verb at the end,
so you flip a lot of the words, and you
can't just have this vocality assumption
that words rough in this area will
translate to roughly a similar area,
in terms of the sequence of
words in the other language.
So yeah, ja nicht here,
ja is technically just yes in German,
also not translated at all.
And then actually going over there and
going, moving also.
All right, now let's say we have
all these potential alignments, and
now as we start from the source
language we say, all right.
Let's say the source here is this German
sentence, geht ja nicht nach hause.
Now could be translated
into many different words.
So German it's technically just the he
of he, she, it, as the es in German.
But sometimes English as
you do your alignment
when not unreasonable one is just it or
comma he or
he will be, cuz those were dropped
before in the alignment and so on.
So you now have lots of candidates for
each possible word and for
each possible phrase that you
might want to combine now in
some principled way to
the final target translation.
So you have again here a combinatorial
explosion of lots of potential ways you
could translate each of the words or
phrases of various lengths.
And so basically what that means is
you'll have a very hard search problem
that also includes having to
have a good language model.
So that as you put all these pieces
together, you essentially try to keep
saying or combining phrases that
are grammatically plausible or
sound reasonable to native speakers.
And this often ends up being
so-called beam search,
where you try to keep around a couple of
candidates as you go from left to right
and you try to put all of these
different pieces together.
Now again, this is totally not
doing traditional MT justice.
Right, we just went in five minutes over
what could have been an entire lecture on
statistical machine translation, or
maybe even many multiple lectures.
So there are lots of important
details we skipped over.
But the main gist here
is that there's a lot of
human feature engineering that's required
and involved in all of these different
pieces that used to require building
a machine translation system.
And it also meant that there were whole
companies that you could form just for
machine translation because nobody
could go through all that work and
really build out a good system.
Whereas now you have companies that have
worked for decades in this and they start
using an open-source machine translation
system that anybody can download.
And now a normal student, a PhD
student can spend a couple months and
then he has like one of
the best MT systems.
Which just completely would have been
completely impossible in their large
groups that all work together in very
large systems before, in academia.
So one of the main problems
of this kind of approach,
is actually that not only is
it a very complex system, but
it's also a system of independently
trained machine learning models.
And, if there's one thing that I think
that I like most, when property of deep
learning models, not just for MT, but
in all of NLP and maybe in all of AI.
Is that we're usually in deep learning
try to have end to end trainable models
where you have your final objective
function that you care about and
everything is learned
jointly in one model.
And this MT system is kind
of the opposite of that.
You have an alignment model
you optimize for that, and
then you have a reordering model maybe,
and then you have the language model.
And they're all separate systems and you
couldn't jointly train all of it together.
So that's kind of the very quick summary
for traditional machine transaction.
Any high level questions
around traditional MT?
All right, so now deep learning to
the rescue, maybe, probably.
So let's go through a sequence of
models and see if they would suffice.
So the simplest one that we could
possibly do is kind of an encoder and
decoder model that looks like this.
Where we literally just have
a single recurrent neural network,
where we have our word vectors so
let's say here
we translate from German to English
Echt Kiste is awesome sauce in English.
And we now have our word vectors
here we learned them in German, and
we have our soft max classifier here.
And we just have a single recurrent neural
network and once it sees the end of German
sentence and there's no input left we'll
just try to output the translation.
Not totally unreasonable,
it's an end-to-end trainable model.
We'll have our standard cross entry
pair here that tries to just predict
the next word.
But the next word actually has
to be in a different language.
Now, basically this last vector here,
if this was our main model,
this last vector would have to
capture the entirety of the phrase.
And sadly, I've already told you
that usually five or six words or so
can be captured and
after that, we don't really,
we can't memorize the entire
context of the sentence before.
So this might work for like,
very short sentenced but maybe not.
But let's define what this model
would be in its most basic form,
cuz we'll work on top of this afterwards.
So we have here our standard recurrent
neural network from the last lecture.
Where we have our next hidden state,
it's just basically a linear
network here followed by
non-element wise linearities.
And we sum here the matrix
vector product with the vector,
the previous hidden state in
our current word vector xt.
And that's our encoder and
then in our decoder in the simplest form,
again not the final model,
in the simplest form we could just
drop this cuz the decoder doesn't
have an input at that time.
Right, it's just we wanna now
translate and just generate an output.
So during the decoder we drop
this matrix vector product and
we just go each time step.
It's just basically moving along based
on the previous hidden time step.
And we'll have our final softmax output
here at each time step of the decoder.
Now I also introduced this phi notation
here, and basically whenever you have,
we'll see this only in
the next couple of slides.
But whenever I write phi of two vectors,
that means we'll have two separate W
matrices for each of these vectors.
This is the little, shorter notation, and
then the default here would be well, just
like I said, minimize the cross entropy
error for all the target words conditioned
on all the source words that we hoped
would be captured in that hidden state.
All right, any questions, concerns,
thoughts about how this model would do?
So, the comment or question is that
neither are the traditional model or
this model account for grammar.
And in some ways, that's not true.
So there are actually a lot of traditional
models that work on top of syntactic
grammatical tree structures.
And they do this alignment based
on the syntactic structure of
prefer potentially the alignment step.
But also for the generation and
the encoding step and
all these different steps.
So there are several ways you can infuse
grammar and chromatical sort of priors
into neuro machine translation systems or
so syntactic machine translation systems.
It turns out it's questionable
if that actually helps.
In many cases for machine translation,
you have such a broad range of sentences.
You actually might have un-grammatical
sentences sometimes, and
you still want them to be translated.
You have very short,
complex ambiguous kinds of
sentences like headlines and so on.
So it's tricky, the jury was sort of out.
And some tactic models were battling
it out with non-tactic models until
neural machine translation came.
And now, it's not as important
of a question anymore.
Now, for neural systems, we would assume
and hope that our hidden state actually
captures some grammatical structures and
some grammatical intuitions that we have.
But we don't explicitly give
that to the algorithm anymore.
Which some people who are very good
at giving those kinds of features,
your algorithms might think is sad.
But at the same time,
it's good if we don't have to, right?
It's less work for us, putting more
artificial back into artificial
intelligence, less human
intelligence on designing grammars.
Anyways, so any other questions?
Yeah.
Good question, so sometimes, the number of
input words is different to the numbers
of output words, and that's very true.
So one modification we would have to
make to this kind of model for sure,
is actually say, have the last output word
here, BA, stop out putting up words work.
Like a special token that says, I'm done.
And one, you add that to your softmax
classifier sort of the last row.
And then you hope that when it
predicts that token, it just stops.
And that is good enough and
not uncommon actually for
all these neural machine translations.
The superscript S is just again,
to distinguish the different
W matrices that we have for
hidden connections, visible or
hidden inputs, and softmax W.
All right, now sadly,
while neural MT is pretty cool, and
it is simpler than traditional systems,
it's not quite that simple.
So we'll have to be a little more clever.
And so let's go through a series of
extensions to this model where in the end,
we'll have a very big
powerful LSTM type model.
So step one, is we'll actually have
different recurrent neural network weights
for encoding and decoding.
So instead of having the same W here,
we actually should have a different set
of parameters, a different W for
the decoding step.
That's still relatively similar.
All right, so again, remember this
notation here of fi where every
input has its own matrix
W associated with it.
The second modification is
that the previous hidden state
is kind of the standard that you
have as input for during decoding.
But instead of just having
the previous hidden state,
we'll actually also add the last
hidden vector of the encoding.
So we call this c here,
but it's essentially ht.
So at this input here, we don't just
have the previous hidden state,
but we always take the last hidden
state from the encoding step.
And we have, again,
a separate matrix for that.
And then on top of that, we will also add,
and that's actually, if you think about
it, it's a lot of parameters, we'll add
the previous predicted output word.
So as we translate,
we have three inputs for each hidden
state during the decoding step.
We'll have the previous hidden state
as a standard recurrent neural network.
We have the last hidden
state of the encoder.
And we have the actual output word
we predicted just before that.
And this will essentially help the model
to know that it just output a word, and
it'll prevent it from
outputting that word again.
Cuz it'll learn to
transform the hidden state,
based on having just upload
a specific word before.
Yeah?
That's right, that's right, yeah.
So whenever you have fi of xyz here,
it'll just f of w times
x + u of y + v of z.
So you just,
I don't wanna define all the matrices.
That's a great question.
So why do we need to make y,
t minus one a parameter,
if we actually had computed yt
minus one from ht minus one, right?
So two answers, one, it will allow
us to have the softmax weights
also modify a little bit how that
hidden state behaves at test time.
And two,
we actually can choose usually yt, and
there are different ways you can do this.
You could take the actual probability, the
multinomial distribution from the softmax.
But here,
we'll actually make a hard choice, and
we'll actually tell the model
we chose exactly this one.
So instead of having the distribution,
we'll make a hard choice.
And we say, this is the one word, the
highest probability that had the highest
probability, we predicted that one,
and that's the one we give us input.
So it turns out in practice,
that helps to prevent the model
from repeating words many times.
And again, it incorporates the softmax
weights in that computation indirectly.
Yeah.
That is a good catch.
That is not how we define the model.
Ignore those errors.
Yeah, well done.
In theory, again, so I didn't define
it but you can also, you can do
the same thing with the softmax, and
this is what the picture actually shows.
So instead of having a softmax of just W,
ht for the probability of yt.
You can also concatenate here your c,
and that's what the picture said.
But I wanted to skip over the details so
you caught it, well done.
So this model usually,
so the question is, do we have
kind of a look ahead type thing?
Or does the model output blanks?
And the model basically has to
output the words in the right order.
And it doesn't not have the ability
to do this whole reordering step or
look ahead kind of thing.
Or there's no sort of post processing
of reordering at the end, so
this model isn't able to output
the verb at the right time stamp.
It's over, okay, here we go.
Now, of course,
once it works well, everybody will try to
see if they can kind of improve it, and
eventually you can do beam searches
too for these kinds of models.
But surprisingly, in many cases, you
don't have to get a reasonable MT system.
All right, now, I want you to
become more and more familiar,
to be able to read the literature.
So the same picture that we had here and
the same equations we defined,
here's another way off looking at this.
So with the exception that this
one doesn't have the c connection
that you caught.
So, Yeah, it's similar.
It's the same exact model,
just a different way to look at it, and
it's kind of good to see.
Sometimes people explicitly write
that you start out with a discreet
one of k and coding of the words.
It's just like you want one-hot
vectors that we defined, and
then you embed it into
continuous word vector space.
You give those as input, you compute
your recurrent neural network, ht steps.
And now,
you give those as input to the decoder.
And that each time stamp of decoder, you
get the one word sample that you actually
took as input, the previous hidden state
and to see vector, we defined before.
So all these three already
are the inputs for
each node in this
recurrent neural network.
So just a different picture for
the same model we just defined, so
you learn picture in variances first,
model semantics.
Now, it gets more powerful.
It needs to get more powerful cuz
even with those two assumptions here,
we have a very simple recurrent
neural network with just one layer,
that's not going to cut it.
So we'll use some of the extensions
we discussed in the last lecture,
we'll actually have stacked
deep recurrent neural networks
where we have multiple layers.
And then we'll also have,
in some cases, this is not as common,
but sometimes it's used,
we have a bidirectional encoder.
Where you go from left to right,
and then we give both of,
last hidden states of both directions
as input to every step of the decoder.
And then this is kind
of almost an XOR here.
If you don't do this, than another way
to improve your system slightly is by
training the input
sequence in reverse order,
because then you have a simpler
optimization problem.
So especially for languages that align
reasonably well like English and French.
You might instead of saying A,
B, C, the other word's A,
the word B, or C goes to in the different
language the words X and Y.
You'll say, C B A goes to X Y,
because as they align,
A is more likely to translate to X,
and B is more like to Y.
And as you have longer sequences,
you basically bring the words that are
actually being translated closer together.
And hence, you have less of a vanishing
gradient problems and so on, because where
you want the work to be predicted, it's
closer to where it came in to the encoder.
Yeah?
That's right, but yeah,
it's still an average force.
So how does reversing not mess it up?
Cuz this sentence doesn't
make grammatical sense.
So we never gave this model
an explicit grammar for
the source language, or
the target language, right?
It's essentially trying, in some really
deep, clever, continuous function,
general function approximation kind of
way, just correlation, basically, right?
And it doesn't have to know the grammar,
but as long as you're consistent and
you just reverse every sequence,
the same way.
It's still grammatical if you
read it from the other side.
And the model reads it from
potentially both sides, and so on.
So it doesn't really matter
to these learning models,
as long as your transformation of the
input is consistent across training and
testing times, and so on.
So the question is,
he understands the argument, but
it could still change the meaning.
And it doesn't change the meaning if
you assume the model will always go
from one direction to the other.
If you start to sometimes do it and
sometimes not,
then it will totally mess up the system.
But as long as it's
a consistent transformation,
it is still the same order and
so you're good.
So why is reversing the order
a simpler optimization problem?
Imagine, you had a very
long sequence here.
And again, this is only the case
if the languages align well.
As in usually,
the first capital words in one
of the source language translated to first
capital words in the target language.
Now, If you have a long sequence and
you try to translate
it to another long sequence, and
say there are a lot of them here.
Now, what that would mean is that this
word here is very far away from that word,
cuz it has to go through
this entire transformation.
And likewise,
these words are also very far away.
So everything is far away from
everything in terms of the number
of non-linear function applications
before you get to the actual output.
Now, if you just reverse this one,
then this word, so
let's call this a, b, c, d, e, f.
Now, this is now f,
e, d, c, b, a.
Now, this word, it's here now.
And now, this word translates
directly to that word, right?
So in your decoder.
So now, these two are very,
very close to one another.
And so as you do back propagation and we
learn about the vanishing creating problem
in the last lecture you have much
less of a vanishing creating problem.
So at least in the beginning,
it'll be much better at translating those.
So, how does this check work for
languages with different morphology?
It doesn't actually matter, but
the sad truth is also that very few
MT researchers work on languages
with super complex morphology.
So like Finnish doesn't have
very large parallel corpora
of tons of other languages.
And so you don't sadly see
as many people work on that.
German does work.
And for German actually,
a lot of other tricks that we'll get to.
And really these tricks are not as
important as the one as trick number six.
But before that,
we'll have a research highlight.
>> [LAUGH]
>> Give you a bit of a break, all right.
Allen, take it away.
>> This?
>> Yes.
>> Okay.
Hi, everyone.
My name is Allen.
So I'm gonna talk about Building Towards
a Better Language Modeling.
So as we've learned last week,
language modeling is one of
the most canonical task in NLP.
And there are three different ways
we can make it a little bit better.
We can have better input representation.
We can have better regularization or
preprocessing.
And eventually,
we can have a better model.
So for input, I know you guys
have all played with Glove, and
that's a word level representation.
And I heard morphemes.
From you guys who are down there.
So in fact,
you can code the word at a subword level.
You can do morpheme encoding.
You can do BPE.
You can eventually do
character level embedding.
What it does is that it drastically
reduce the size of your vocabulary,
make the model prediction much easier.
So as you can see, Tomas Mikolov in 2012,
and Yoon Kim in 2015,
explored this route and got better results
compared to just plain word-based models.
So another way to improve your model
is that one of the bigger problems for
language modelling is over-fitting.
And we know that we need to apply
regularization techniques when the model
is over-fitting.
So there are a bunch of them, but today,
I'm gonna focus on preprocessing
because it's a little bit newer.
What preprocessing does is
that we know that we're
never gonna have unlimited training data.
So in order to have our corpus look
more like the true distribution
of the English language, what we can do is
quite similar to computer vision we can
do this type of data augmentation
technique where we try to replace
some words in our corpus
with some other words.
So for example,
your model during the first pass
you can see a word called New York,
the next pass you can see New Zealand,
the next pass you can see New England.
So by doing that, you're basically
generating this data by yourself and
eventually you achieve
a smoothed out distribution.
The reason this happens is
that more frequent word by
replacing by dropping them.
They appear less often and
rarer words by making them appear.
They appear more often.
So a smooth distribution allow us to
learn a better language model and
the result is on the, I think is on
the right hand side of you guys.
And the left hand side is what happen when
we apply better regularization techniques.
So at last we can, wait,
that's it okay, awesome thank you guys.
>> All right, now what you'll also see
in these tables is that the default for
all these models is an LSTM and that's
exactly what we'll end up very soon with.
Which is basically a better
type of recurrent unit.
And so, we'll start with gated
recurrent units that were introduced
by Cho just three years ago.
And the main idea is that,
we wanna basically keep around
memories that capture long
distance dependencies and
you wanna have the model learn when and
how to do that.
And with that,
you also allow your error messages to flow
differently at different strengths,
depending on the input.
So, how does this work?
What is a GRU as our step to the LSDM?
And sometimes you don't need
to go all the way to the LSDM.
The GRU is a really good model by itself.
In many cases already in its simpler.
So let's start with our standard
recurrent neural network,
which basically computes our hidden
layer at the next time step directly.
So we just have again previous hidden
state recurring to our vector that's it.
Now instead what we'll do for
gated recurring units or GRUs,
is we'll compute to gates first.
These gates are also just like ht,
continuous vectors of the same
length as the hidden state, and
they are computed exactly the same way.
And here, it's important to note that
the superscripts that's just basically
are lined with the kind of
gate that you're computing.
So we'll compute a so
called update gate and a reset gate.
Now the inside here is
the exact same thing but
is important to note that we
have here a sigmoid function.
So we'll have elements of this vector
are exactly between zero and one.
And we could interpret them as
probabilities if we want to.
And it's also important to note that
the super scripts here are different.
So the update gate of course,
uses a different set of
weights to the reset gate.
Now why are they called update and
reset gates, and how do we use them?
It's relatively straight forward.
We just introduced one new function
here just the element wise product.
We've remember it from back propagation.
We also call it the Hadamard
product sometimes.
Where we just element wise multiply
this vector here from the reset
gate with this,
which would be our new memory content.
We call it ht,
this is our intermediate memory content,
it has the standard tanh that
we also know as a [INAUDIBLE].
This part here is exactly the same,
we just have to input our word vector and
then transformed with a W.
But what's going on in here?
So intuitively right, this is just a long
vector of numbers between zero and one.
Now intuitively,
if this reset gate at a certain unit,
is around zero,
then we essentially ignore all the past.
We ignore that entire computation
of the past, and we're just going
to define that element where our zero,
with the current word vector.
Now why would we want to do that?
What's the intuition here?
Let's take the task of sentiment analysis
cuz it's very simple and intuitive.
If you were to say, you're talking
about a plot of a movie review.
And you talk about the plot and
you know some girl falls in love for
some guy who falls in love with her but
then they can't meet, blah, blah, blah.
That's a long plot and in the end you say,
but the movie was really boring.
Then really doesn't matter that
you keep around that whole plot.
You wanna say boring as a really
negative strong word for sentiments, and
you wanna basically be able to allow the
model to ignore the previous plot summary.
Cuz for the task of sentiments
analysis it's irrelevant.
Now this is essentially what
the reset gate will let you do, but
of course not in this global fashion,
where you update the entire hidden state,
but in a more subtle way, where you learn
which of the units you actually will reset
and which ones you will keep around.
So this will allow some
of the units to say,
well maybe I want to be a plot unit and
I will keep around the plot.
But other units learn, well if I see one
of the sentiment words, I will definitely
set that reset gate to zero and I will
now make sure that I don't wash out,
the content with previous stuff
by summing these two, right?
You're sort of like, not quite
averaging but you're summing the two.
So you wash out the content
from this word and
instead it will set that to zero and take
only the content from that current word.
Now the final memory it will compute,
we'll combine this with the update gate.
And the update gate now,
there's something similar but
basically allows us to keep around
only the past and not the future.
Or not the current time steps.
So intuitively here when you look at Z,
if Z is a vector of all ones,
then what we would do is
essentially do ht = ht-1
+ 1-1 is 0, so this term just falls away.
Basically if zt was all ones we could
just copy over our previous time step.
Super powerful,
if you copied over the previous time step
you have no vanishing gradient problem,
right.
Your vector just gets a bunch of ones.
Nothing changes in your
gradient computation.
So that's very powerful and intuitively
you can use that same sentiment example.
But you say in the beginning man,
I love this movie so much,
here's this beautiful love story.
And now you go through the love story,
and really what's important for
sentiment is not about the love story,
but it's about the person saying,
I love this movie a lot.
And you wanna make sure you
don't lose that information.
And with the standard
recurring neural network,
we update our hidden state,
every time, every word.
No matter how unimportant a word is,
we're gonna sum up those two vectors,
washing out the content as we
move further and further along.
Here we can decide, and what's even
more amazing, you don't have to decide.
You can say, this word is positive, so
I'm gonna set my reset gate manually.
No, the model will learn when to reset and
when to update.
So this is a very simple kind of
modification but extremely powerful.
Now, we're gonna go through it and
explain it a couple more times.
And we'll try to.
Have an attempt here at
a clean illustration.
Honestly, personally, I feel the equations
here are still straight forward, and
very intuitive, that I don't know
if these illustrations always help,
but some people like
them more than others.
So intuitively here, you basically
see that only the final memory,
that you computed is the one that's
actually used as input to the next step.
So all of these are only
modifying through the final state.
And now this one gets as input to
our reset gate or update gate,
the intermediate state and
the final state of the memory.
And so does our x vector the word vector
here also gets its input through the reset
gate, the update gate, and
our intermediate memory state.
And then, I tried to use this,
so the dotted line here,
as basically gates that modify
how these two interact.
All right, so I've said, I think,
most of these things already, but
again, reset gate here is close to 0.
We ignore our previous state.
And that again, allows the model,
in general, to drop information that is
irrelevant for the future
predictions that it wants to make.
And if we update the gate z controls,
how much of the past state should
matter at the current time stamp?
And again, this is a huge improvement for
the vanishing gradient problem,
which allows us to actually train these
models on nontrivial, long sequences.
Any questions around the GRU?
Yep?
Does it matter if you reset first or
update first?
Well, so you can't compute
h until you have h tilled.
So the order of these two doesn't matter.
You can compute that in peril, but
you first have to compute h tilled
with the reset gate before
you can compute that one.
So the question is,
does it matter to switch and
use an equation like this first,
and then an equation like that?
I guess it's just a different model.
It's not one that I know
of people having tried.
It's not super unreasonable,
I don't see a sort of reason why
it would be illogical to ever to that,
but yeah, just not the GRU model.
You will actually see,
in [INAUDIBLE] she has a paper on a Search
Space Odyssey type paper where there
are a thousand modifications you can
make to the next model, the LSTM.
And people have tried a lot of them,
and it's not trivial.
There are a lot of modifications.
And a lot of times they
seem kind of intuitive, but
don't actually change performance that
much across a bunch of different tasks.
But sometimes, one modification improves
things a tiny bit on one of the tasks.
It turns out the final model of GRU here
and the LSTM, are actually incredibly
stable, they give good performance
across a lot of different tasks.
But it can't ever hurt to, if you have
some intuition of why you want to have,
make something different,
it can't hurt to try.
So the question is,
is it important of how they're computed?
I think there are some people who have
tried once to have a two layer neural
network to compute.
These a z and update, z and r.
In general, it matters of course
a lot of how they're computed, but
not in the sense that you have to
modify them manually or something.
It just the model learns when to
update and when not to update.
That's a good question.
So what do I mean when I say unit.
So in general, what you'll observe in
a slide that's coming up very soon is
that we will kind of abstract away from
the details of what these equations are.
And we're going to write that just ht
equals GRU of xt and ht minus 1.
And then we'll just say that GRU
abbreviation means all these other things,
all these equations, and
we're going to abstract away from that.
And that's something that you'll see even
more in subsequent lectures where you
just say a whole recurrent
network with a five layer GRU and
combine lots of different
ways is just one block.
We often see this in computer vision too
where CNNs are now just like the CNN
block, and you assume you've got
a feature vector out at the end.
And people will start abstracting
away more and more from that.
But yeah,
you'll always have to remember that, yes,
there's a lot of complexity
inside that unit.
Here's another attempt at an illustration
which I'm even less of a fan of,
then the one I tried to come up with.
Basically, how you have your z gate
that kind of can jump back and forth.
But of course,
it's usually a continuous type thing.
It's not a zero one type thing, so I'm not
a big fan of this kind of illustration.
And so in terms of derivatives,
we couldn't theory asks you to
derive all the details of the GRU.
And the only change here is that we now
have the derivative of these element
wise multiplications,
both of which I have parameters or inside.
And we all should know what
derivative of this is, and
the rest is again,
the same kind of chain rule.
But again, now you're sort of realizing
why we wanna modularized this more and
more, and abstract a way from actually
manually taking these instead having
error messages and deltas sent around.
Yeah?
Explain why we have both update and reset.
So basically, it helps the model
to have different mechanisms for
when to memorize something and
keep it around, versus when to update it.
You're right, in theory, you could try to
put both of those into one thing, right?
In theory, you'd say, well,
if this was just my previous ht here,
then this could say, well, I wanna keep
it around, or I wanna update it here.
But now, this update here,
if you just had an equation like this it
would be still be a sum of two things.
So that means that xt here
does not have complete control
over modifying the current
hidden state in its entirety.
It would still be summed
up with something else,
and that happens at
every single time stamp.
So its only once you have
this reset gates are here.
These reset gates are here,
that you would allow h
to be completely dominated by the current
word vector, if the model so chooses.
If the reset gates are all,
Okay, so if these are all ones,
then you have here basically a standard
recurrent neural network type equation.
And then if you just have zs, all 0s,
then you take that exact equation and
you're right.
Then you just have a standard RNN.
It's also beautiful,
it's always nice to say my model
Is a more general form of your model or-
>> [LAUGH]
>> An opposite,
you're model's a special case of my model.
It was actually a couple years ago
that you could by and say that.
>> [LAUGH]
>> It's good machine learning banter.
So yeah, it's always good.
And likewise, the inventor of this model
made exactly that statement about the GRU.
Not knowing why anybody
had to publish a new paper
about this instead of just referring to
this and the special cases of the LSTM.
So if we have one more
question about the GRU, yeah?
>> Is there a reason.
>> Good question.
Why tanh and sigmoid?
So in theory, you could say the tan h
here could be a rectified linear unit or
other kind of unit.
In practice, you do want sigmoids here
because you have this plus 1 minus that.
And so if they're all over the place then
everything will kind of be modified and
it's less intuitive that you kind of have
a hard reset in sort of a hard sort of,
yeah, hard reset or a hard update.
And if this wasn't 10h and
was rectified linear unit then these
two might be all over the place too and
it might be kind of easy to potentially
have the sum also the not very synthecal.
But at the same time,
it's not unreasonable to try having
a rectified learning unit here.
And maybe, if you combine it with
proper regularization and so on,
you could get away with other kinds
of other kinds of linearities.
That's unlike probabilistic
graphical models for
certain things just make no sense.
And you can't do them, deep learning
you can often try some things and
sometimes even nonsensical
things surprisingly work.
And then other people try to analyse why
that was the case in the first place.
But yeah, there's no mathematical reasons
why you couldn't at all have a rectified
linear unit here.
All right, now on to a even more
complex sort of overall recurrent unit.
Namely the long-short-term-memories or
LSTMs.
So now this is the hippest
model of the day, and
it's pretty important to know it well.
Fortunately, it's again very similar
to the kinds of basic building blocks.
But now we allow each of
the different steps to have again,
we separate them out even more.
So how do we separate them out?
Basically, this is what's
going on at each time step.
We will have an input gate,
forget gate, output gate, memory cell,
final memory, and a final hidden state.
Now let's gain a little
bit of intuition and
there is good intuition of
why we want any of them.
So the input gate will
basically determine how much we
will care about the current vector at all.
So how much does the current cell or
the current input word vector matter?
The forget gate is a separate mechanism
that just says maybe I should forget,
maybe I don't.
In this case here, just kind of
counterintuitive sometimes and
they're actually different
models in the literatures.
Some have the one minus there and
others don't.
But in general here,
we'll define our forget gate.
If it's 0 then we're forgetting the past.
Then we have an output gate,
basically when you have this output gate,
you will separate out what
matters to a certain prediction
versus what matters to being kept around
over the current recurrent time steps.
So you might say at
this current time step,
this particular cell is not important,
but it will become important later.
And so I'm not going to output it,
to my final softmax for instance, but
I'm still gonna keep it around.
So it's yet another separate
mechanism to learn when to do that.
And then we have our new memory cell here,
which is similar to what we had before.
So in fact all these four here
have the same equation inside and
just three sigmoid non linearity and
one tan h non linearity.
So these are all just four
single layer neural nets.
Now we'll put all of these gates together
when we compute the memory cell and
the final hidden state.
So the final memory cell now
basically separated out the input and
the forget gate.
Instead of just c and 1 minus c,
we have two separate mechanisms
that can be trained and
learn slightly different things.
And actually become also in some ways
counter intuitive like you say, I don't
wanna forget but you do wanna forget,
but you also input something right now.
But the model turns out to work very well.
So basically here we have final hidden
state is just to forget gate how to
mark product with the previous hidden
states final memory cell ct-1.
So this again will determine,
how much do you wanna keep this around or
how much do we wanna forget from the past?
And then the new memory cell here,
this has a standard recurrent neural net.
If i is all 1s,
then we really keep the input around.
And if the input gate says no,
this one doesn't matter,
then you just basically ignore
the current word back there.
So in that sense,
this equation is quite intuitive, right?
Forget the past or not, take the input or
not, that's basically it, yeah?
So the secret question,
once you forget the past does it mean
you forget grammar or something else?
And the truth is we can think of these
forget gates as sort of absolutes.
They're all vectors, and
they will all forget only certain
elements of a long hidden unit.
And so really, I can eventually show
you what these hidden states look like.
And sometimes they're actually
more intuitive than others.
But it's rare that you would find this
particular unit when it was turned off or
on actually had like this
perfect interpretation that
we as humans find intuitive and
think of as grammar.
And also of course grammar is
a very complex kind of beast.
And so it's hard to say any single unit
would capture any particular like entirety
of a grammar,
it might only capture certain things.
So it's not implausible to think
of these three cells together
suggest that the next noun should be
a plural noun or something like that.
But that's the most we could hope for
in many cases.
All right, and then here,
the final hidden state again,
we can keep these cs around, right?
And cs will compute our
computer from other cs.
But we might not want to expose
the content of this memory cell
in order to compute the final
hidden state, ht minus 1.
All right, now yeah,
this is it, this is the LSTM.
It's a really powerful model, are there
any questions around the equations?
We're gonna attempt at some illustrations,
but
again I think the equations
are sometimes more intuitive.
Does the LSTM and GRU completely liviate
or just help with an engine came problem?
And the truth is they helped with it a
lot, but they don't completely obviate it.
You do multiply here a bunch of
numbers that are often smaller than 1.
And over time even if it would
have to be a perfect one,
but that would mean that, that unit
is really, really strongly active.
And then it's hard to sort of dies,
it's like the gradient,
when you have unit that's really, really
active and looks something like this.
Now the input is really large
to that unit and it's here,
then grade in around here,
It's pretty much 0.
So that unit's kind of dead.
And then the model can't do
anything with it anymore.
And so it happens, there are,
when you want to train these,
you'll observe some units just sort
of die after training after awhile.
And you'll just sort of keep around stuff,
or delete stuff at each time step.
But in general most of the units
are somewhat small than 1, and
so you still have a bit of a vanishing
creating problem but much less so.
And intuitively you can
come up with final P for
a lot of good ways to
think about this right?
Maybe you want to predict different
things at different time steps.
But you wanna keep around knowledge
through the memory cells but
not expose it at a given prediction.
Yeah.
What is the point of the exposure gate
when it already had the forget gate?
So basically, you want to,
sort of forget gate will tell you whether
you keep something around or not.
But exposure gate, will mean, does it
matter to this current time step or not.
So you might not wanna forget something.
But you also might not wanna
show it to the current output,
because it's irrelevant for that output.
And it would just confuse the Softmax
classifier at that output.
Yeah?
Does the exposure gate help you, or
do you mean the output gate here, right?
So does the output gate,
does it help you to what exactly?
To not have to forget everything forever.
So, in some ways, yes.
You can basically,
this model could decide that,
while it doesn't wanna give as
output something for a long time.
And hence it's basically
a temporal forgetting, right?
It will only be forgotten at that time
set but actually be kept around in.
I don't wanna use,
like anthropomorphize the models, but
like the subconsciousness of this model or
whatever, right?
Keeps it around but doesn't expose it.
Don't quote me on that.
All right, one last question, yeah?
The initialization to all these models
matters, it matters quite significantly.
So, if you initialize all your weights,
for instance such that
whatever you do in the beginning,
all of the weights are super large.
Then your gradients are zero and
you're stuck in the optimization.
So you always have to
initialize them properly.
In most cases, as long as they're
relatively small, you can't go too wrong.
Eventually, it might slow down
your eventual convergence, but
as long as all your parameters, W here,
and your word vectors and so
on are initialized to very small numbers.
It will usually eventually
do it pretty well.
Yes you could use lots of different
strategies for initialization.
All right, now, some visualizations.
I like this one from Chris Olah on
his blog from not too long ago.
But again, I don't know.
I feel like the equations speak mostly for
themselves.
You can think of these.
I have four different neural network
layers, and then you combine them in
various ways with pointwise operations,
such as multiplication or addition.
And sometimes you know multiplication and
then addition,
and concatenation and copies and so on.
But, In the end you often observe,
this kind of thing where we'll
just write LSTM in this block.
And has an X and an H, and
we don't really look into too many
details of what's going on there.
And here's some, I think, even less
helpful [LAUGH] illustrations that,
yeah, I think are mostly
confusing to a lot of people.
I have the forget gates here,
output gates, input gates, and so on.
But and your memory cells as
they try to modify each other.
This one is a little cleaner.
You know you have some inputs, your gates,
you have your forget gates on top
of your memory cell and so on.
But in general I think the equations
are actually quite intuitive, right?
If you think of your extremes,
if this is zero, one, then this
input matters more to the output.
All right, now as I said,
LSTMs, currently super hip.
The en vogue model are for pretty
much all sequence labeling tasks and
sequence to sequence tasks
like machine translation.
Super powerful in many cases, you will
actually observe that we'll stack them.
So just like the other RNN architectures,
we'll have a whole LSTM block and
we put another LSTM block with different
sets of parameters on top of it.
And then the parameters
are shared over time, but
are different as you
have a very deep model.
And, of course, with all these
parameters here, we have essentially
many more parameters then the standard
recurrent neural network.
Where we only have two such parameters and
we update every time.
You wanna have more data especially
if you stack you now have
10x the parameters of standard RNN,
we wanna train this on a lot of data.
And in terms of amount of training
data available machine translation is
actually one of the best tasks for that.
And is also the one where these
model sort of shine the most.
And so in 2015, I think the first time I
gave the deep learning for NLP lecture,
the jury was still a little bit out.
The neural network models
came up fairly quickly.
But some different, more traditional
machine translation systems
were still slightly better,
like by half a BLEU point.
We haven't defined BLEU scores yet.
You can essentially think
of it as an engram overlap.
The more your translation overlaps
in terms of unigrams and bigrams and
trigrams, the better it likely is, period.
So you have this reference translation,
sometimes multiple reference translations.
You have your translation, you look
at engram overlap between the two.
So the higher the better.
And basically the neural network
models were often also just use it for
rescoring traditional MT model.
Now, just one year later, last year,
really a couple months ago,
the story was completely different.
So this is WMT, the worldwide
competition for machine translation.
And you have different universities,
and different companies and
so on, submit their systems.
And the top three systems were all
neural machine translation systems.
The jury is now basically not out anymore.
It's clear neural machine
translation is the most accurate
machine translation model in the world.
Yeah that number two was us, yeah.
>> [LAUGH]
>> James Bradbury and me worked on that.
James Bradbury was actually a linguistics
undergrad while he was doing that, but
now he's full-time.
So, yeah, basically we haven't talked
that much about ensembling and
ensembles of different models.
But you can also train
five of these monsters and
then average all the probabilities and
you'll usually get a little better.
We just, as general thing,
you'll observe for every competition
machine learning competition out there.
If you go on Kaggle, other machine
learning competitions usually train
even the same kind of model five times.
You end up in slightly different
local optimum average,
and you still do pretty well.
What's cool also though,
is that while we might not be able
to exactly recover grammar, or
have specific units be explicitly sort
of capturing very intuitive things.
As we project this down
similar to the word vectors,
we actually do observe some
pretty interesting regularities.
So this is a paper from Sutskever in 2014,
they projected different sentences.
They were trained basically with
a machine translation task and
basically observe quite
interesting regularities.
So John admires Mary is close
to John is in love with Mary and
to John respects Mary.
Now of course,
we have to be a little carefull here
to not over interpret the amazingness.
It's amazing, but
we also have a selection vice here, right?
Maybe if we just had
John did admire Mary or
something, it might also be close to it,
right?
And it might be closer too, but
if you just project these six particular
sentences into lower dimensional space.
Then you do see very nicely that whenever
John has some positive feelings for
Mary, all those sentences are in here.
And all the ones that are on this area
of the first two item vectors, Mary
admires John, Mary admires John, Mary is
in love with John, and Mary respects John.
They're all closer together,
which is kind of amazing cuz
some people are also worried.
Well it's a sequence model, so
how could it ever capture
that the word order changes?
And so this is a particularly
cool example of that.
So here we have,
she was given a card by me in the garden
versus in the garden I gave her a card.
And I gave her a card in the garden, and
despite the word order being
actually flipped, right?
In the garden is in the beginning here,
and in the end here.
These are still closer together
than the different ones where,
in the garden basically she gave me
a card verses I gave her a card.
So that shows that the semantics here
turn out to be more important than
the word order.
Despite the model just
going from left to right or
this one was still the trick where we
reversed the order of the input sentence.
But it choses that its
incredibly invariant and
variance is a pretty important concept,
right?
We want this model to be
invariant to simple syntactic
changes when the semantics
are actually kept the same.
It's pretty incredible, that it does that.
So this is also the power
I think of some of these.
This is a very deep LSTM model where
you have five different LSTM stacked in
the encoder and several in the decoder.
And they're all connected
in multiple places too.
All right, any questions around
those visualizations and LSTMs?
All right, you now have knowledge under
you belt that is super powerful and
very interesting.
I expected to maybe have
five minutes more of time.
So I'm going to talk to you about a recent
improvement, two recurrent neural networks
that I think is also very
applicable to machine translation.
But nobody has actually yet
applied it to machine translation.
And that is a general problem
with all softmax classification
that we do in all the models I've so
far described to you.
And really up until two or
three months ago,
that everybody in NLP
had as a major problem.
And that is you can only ever predict
answers if you saw that exact word at
training time.
And you have your cross entropy error
saying I wanna predict this word.
And if you've never predicted that word,
no matter how obvious it is for
the translation system it will
not be able to do it, right?
So we have some kind of translation,
and let's us say we have a new word,
like a new name or something that
we've never seen at training time.
And it is very obvious that this word
here should go at this location.
This is like Mrs. and
then maybe the new word is like yelling or
something like that,
it could be any other word.
And now let's say at training time,
we've never seen the word yelling.
But now, it's like vowel, German misses,
miss in, yeah,
German translation for this.
And now it's very obvious to
everybody that after this word,
it should be the next one,
the name of the of the miss.
And so these models would never
be able to do that, right?
And so one way to fix that is to think
about character meant translation models,
where the model's actually surprisingly
similar to what we described here.
Well many times it have to go, but instead
of having words we just have characters.
So that's one way, but
now we have very long sequences.
And at every character you have
a lot of matrix multiplications.
And these matrix multiplications
that we have in here are not
50 dimensional for really powerful MT
models, they're a 1,000 dimensional.
And now you have several thousand
by a thousand matrices here
multiplying with thousand
dimensional vectors.
And you stack them, so
you have to do it five times.
Doing that for every single character
actually gets really, really expensive.
So at the same time,
it's very intuitive that
after we see a new word at test time
we wanna be able to predict it.
And also in general when we have
the softmax, even for words that we do
see once or twice, it's hard for
the model to then still predict them.
It's this skewed data set
distribution problem.
But you have very rare, very infrequent
classes, our words are hard to predict for
the models.
So this is one attempt at fixing that,
which is essentially a mixture
model of using standard softmax and
what we call a pointer.
So what's a pointer?
It's essentially a mechanism to
say well maybe my next word is one of
the previous words in the context.
You say 100 words in the past,
and every time step you say,
maybe I just wanna copy a word
over from the last 100 words.
And if not, then I will use my
standard softmax for the rest.
So this is kind of this
sentinel idea here.
This is a paper by Stephen Merity and
some other folks.
And basically, we now have
a mixture model, where we combine
the probabilities from the standard
vocabulary and from this pointer.
And now how do we compute this pointer?
It's very straightforward,
we basically have a query.
This query is just a modification of
the last hidden layer that we have here.
And we pipe that through a standard
single layer neural network
to compute another hidden layer,
which we'll call q.
And then we'll do an inter
product between this q and
all the previous hidden states
of the last 100 timed steps.
And that will give us,
basically, the single number for
each of these interproducts.
And then we'll apply
a softmax on top of that.
And this gives us essentially,
a probability for
how likely do we wanna point
to each of these words.
Or the very last one is we
don't point to anything,
we just take the standard softmax.
So we keep one unit
around where we do this.
And now of course in the context,
the same word might appear multiple times.
And so you just sum up all
the probabilities for specific words.
If they appear multiple times,
you just sum them up.
With this simple modification, we now
have the ability to predict unseen words.
We can predict based on the pattern
of how rare words appear much more
similar things.
For instance, Fed, Chair, Janet,
Yellen, raised, rates and so on, Ms.
is very obvious that this is the same Ms.
that we're referring to here.
And you can base or
you combine this in this mixture model.
And now over many, many years for
language modeling.
The perplexity that we defined before
was sort of stock actually around 80.
And then in 2015,
we have a bunch of modifications
to LSTMs that were very powerful.
And lower this, and
now were down to the lowest 70s.
And was some modifications
will cover another class,
were actually down on the 60s now.
So it really had to told for
several years, and
now perplexity numbers
are really dropping in.
And this models are getting better and
bettered capturing, more and
more the semantics and
the syntax of language.
All right, so let's summarize.
Recurrent Neural Networks, super powerful.
You now know the best ones in
that family to use in LSTMs.
This is a pretty advanced lecture,
I hope you gained some of the intuition.
Again, most of the math falls out from the
same basic building blocks we had before.
And next week or no next Thursday,
we'll do midterm review.
All right, thank you.
