[SOUND] Stanford University.
>> Okay, so we're back again with CS224N.
So Let's see.
So in terms of what we're gonna do today,
I mean
I think it's gonna be a little bit muddled
up, and going forwards and backwards.
Officially in the syllabus
today's lecture is,
sequence to sequence models and attention.
And next Tuesday's lecture
is machine translation.
But really,
Richard already started saying some things
about machine translation last week.
And so I thought for
various reasons, it probably makes
sense to also be saying more stuff
about machine translation today.
So expect that.
But I am gonna cover the main content of
what was meant to be in today's lecture,
and talk about attention today.
And that's a really useful
thing to know about.
I mean almost certainly if you're gonna
be doing anything in the space of sort of
reading comprehension, question
answering models, such as, for instance,
assignment four.
But also kinds of things that a whole
bunch of people have proposed for
final projects, you definitely
wanna know about and use attention.
But then I actually thought
I'd do a little bit of
going backwards next Tuesday.
And I want to go back and actually say a
bit more about these kind of gated models,
like the GRUs and
LSTMs that become popular lately.
and try and have a bit more of a go at
saying just a little bit more about well,
why do people do this and
why does it work?
And see if I can help make it
a little bit more intelligible.
So we'll mix around between those topics.
But somehow over these two weeks of
classes, we're doing recurrent models,
attention, MT and
all those kinds of things.
Okay.
Other reminders and comments.
So the midterm is over yay!
And your dear TAs and me spent all
last night grading that midterm.
So we're sort of 99%
over with the midterm.
There's a slight catch that a couple
of people haven't done it yet,
because of various complications.
So essentially next Tuesday is,
when we're gonna be able to be sort of
releasing solutions to the midterm and
handing them back.
Some people did exceedingly well.
The highest score was extremely high 90s.
Most people did pretty well.
It has a decent median.
A few few not so well.
[LAUGH] You know how what
these things are like.
But yeah, overall we're pretty
pleased with how people did in it.
I just thought I should
mention one other issue,
which I will say sort of
send a Piazza note about.
I mean I know that a few people
were quite unhappy with the fact
that some students kept on writing
after the official end of the exam.
And I mean I totally understand that.
Because the fact of the matter is,
these kind of short midterm exams
do end up quite time-limited,
and many people feel like they could
do more if they had more time.
I mean on the other hand,
I honestly feel like I don't know
quite what to do about this problem.
Both Richard and me came from educational
traditions, where we had exam proctors.
And when it was time to put your
pens down, you put your pens down or
else dire consequences happen to you.
Whereas my experience at Stanford is that,
every exam I've ever been in at Stanford,
there are people who keep writing
until you forcibly remove
the exam out of their hands.
And so there seems to be
a different tradition here.
And in theory this is meant to be
student regulated by the honor code,
but we all know that there are some
complications there as well.
So it's not that I'm not
sensitive to the issue.
And you know really exactly what
I said to the TAs before the end
of the exam is, so it's a real problem
at Stanford, people going on writing, so
could everyone get in the room
as quickly as possible,
and collect everyone's exams
to minimize that problem.
But obviously, it's a little bit
difficult when there are 680 students.
But we did the best we could.
And I think basically we
have to proceed with that.
Okay.
Other topics.
Assignment three is looming.
Apologies that we were a bit
late getting that out.
Though with the midterm,
it wouldn't have made much difference.
We have put a little bit of
extension to assignment three.
I guess we're really nervous about,
giving more extension to assignment three.
Not because we don't want you to have
time to do assignment three, but
just because we realized that anything we
do is effectively stealing days away from
the time you have to do the final
project or assignment four.
So we don't wanna do that too much.
We hope that assignment
three isn't too bad.
And the fact that you can do
it in teams can help, and
that that won't be such a problem.
Another thing that we want people
to do but are a bit behind on, but
I hopefully can get in place tomorrow, is
giving people access to Microsoft Azure,
to be able to use GPUs
to do the assignments.
We really do want people to do that for
assignment three.
Since it's just great experience to have
and will be useful to know about, for
then going on for assignment four and
the final project.
So we hope we can have
that in place imminently.
And it really will allow you to do things
much quicker for assignment three.
So the kind of models
that you're building for
assignment three, should run at least
an order of magnitude, sort of ten,
12 times or something faster, if you're
running them on a GPU rather than a CPU.
So look forward to
hearing more about that.
The final reminder I want to mention is,
I'm really really encouraging people
to come to final project office hours for
discussion.
Richard was really disappointed how few
people came to talk to him about final
projects on Tuesday after the exam.
Now maybe that's quite
understandable why no one turned up.
But at any rate moving forward from here,
I really really encourage you to do that.
So I have final project office
hours tomorrow from one to three.
Richard is gonna be doing
it again next Tuesday.
The various other PhD students
having their office hours.
So really do for the rest of quarter,
try and get along to those.
and check in on projects
as often as possible.
And in particular, make really really sure
that either next week or the week after,
that you do talk to your project mentor,
to find out their advice on the project.
Okay, all good?
Any questions?
Okay, so
let's get back into machine translation.
And I just thought I'd sort of say
a couple of slides of how important
is machine translation.
Now really a large percentage of
the audience of these Stanford classes
are not American citizens.
So probably a lot of you realize that,
machine translation is important.
But for the few of you that
are native-born American citizens.
I think a lot of native-born Americans
are sort of, very unaware of
the importance of translation, because
they live in an English-only world.
Where most of the resources for
information are available in English, and
America is this, sort of, a big enough
place that you're not often dealing with
stuff outside the rest of the world.
But really in general, for
humanity and commerce, translation,
in general, and machine translation in
particular, are just huge things, right?
That for places like the European Union
to run, is completely dependent on having
translation happen, so it can work across
the many languages of the European Union.
So, the translation industry is
a $40 billion a year industry.
And that's basically the amount
that's spent on human translation,
because most of what's done as
machine translation at the moment
is in the form of free services,
and so it's a huge issue in Europe,
it's growing in Asia, lots of needs in
every domain, as well as commercial,
there's social, government,
and military needs.
And so the use of machine translation
has itself become a huge thing.
So Google now translate over 100
billion words per day, right?
There are a lot of people that
are giving Google stuff to translate.
It's then important for
things like having social connections.
So I mean in 2016,
last year Facebook rolled out their
own homegrown machine translation.
Prior to that they've made use of
other people's translation, but
essentially what they had found was that
the kind of commercial machine translation
offerings didn't do a very good job
at translating social chit chat.
And the fact of the matter is that doing
a better job at that is sufficiently
important to a company like Facebook that
they're developing their own in house
machine translation to do it.
One of the quotes that came along with
that was when they were testing it and
turned off the machine translation for
some users, that they really went nuts,
that lots of people really
do actually depend on this.
Other areas as well.
So eBay makes extensive use of machine
translation to enable cross-border trade.
So that if you are going
to be able to successfully
sell products in different markets,
well, you have
to be able to translate the descriptions
into things that people can read.
Okay, and so that then leads us into
what we're gonna be focusing on here,
which is neural machine translation.
And so, neural machine translation or NMT
is sort of a commonly used slogan name.
And it's come to have a sort of
a particular meaning that's slightly more
than neural plus machine translation.
That neural machine translation is
used to mean what we want to do
is build one big neural
network which we can train
the entire end-to-end machine translation
process in and optimize end to end.
And so systems that do that are then
what are referred to as an MT system.
So that the kind of picture here
is that we're going to have
this big neural network.
It's gonna take input text that's
somehow going to encode into
neural network vectors.
It's then gonna have a decoder and
out would come text at the end.
And so we get these
encoder-decoder architectures.
Before getting into the modern stuff,
I thought I'd take two slides
to tell you about the archaeology
of neural networks.
Neural networks had sorta been very
marginal or dead as a field for
a couple of decades.
And so I think a lot of the time people
these days think of deep learning
turned up around 2012,
with the ImageNet breakthroughs.
And boy has it been amazing since then.
But really there have been
earlier ages of neural networks.
And in particular there's a boom in the
use of neural networks in the second half
of the 80s into the early 90s which
corresponds to when Rumelhart and
McClelland, so that's the James McClelland
that's still in the Psych Department
at Stanford, pioneered or re-pioneered
the use of neural networks partly as
a cognitive science tool, but
also as a computing tool.
And many of the technologies
that we've been talking about
really the math of them are worked
out during that period.
So it was in the 80s there was really
worked out of how to do general back
propagation algorithms for
multi-layer neural networks.
And it was also during
that period when people
worked out how to do the math
of recurrent neural networks.
So algorithms like backpropagation through
time were worked out in this period,
in the late 80s, often by people who were
psychologists, cognitive scientists,
rather than hard core CS
people in those days.
And so, also in that period, was
actually when neural MT, in having these
encoder decoder architectures for
doing translation, was first tried out.
The systems that were built were
incredibly primitive and limited,
which partly reflects the computational
resources of those days.
But they still were in
coder/decoder architectures.
So as far as I've been able to work out,
the first neural MT system was this system
that was done by Bob Allen in 1987,
the very first international
conference on neural networks, and so
he constructed 3,000 English/Spanish
pairs over a tiny vocabulary.
Sort of a 30 to 40 word vocabulary and
the sentences were actually kind of
constructed based on the grammar,
it wasn't actually kind of we'll just
collect together human language use, but
you know you sort of had sentences like
this with some variation of word order and
things like that.
And he built this simple encoded decoded
network that you can see on the right
that was not a recurrent model.
You just had sort of a binary
representation of the sequence of words in
a sentence and the sentences were only
short and then were pumped through that.
A few years after that, Lonnie Chrisman.
Lonnie Chrisman is actually
a guy who lives in the Bay Area.
He works at a tech firm still to this day.
[LAUGH] Not doing neural networks anymore.
So Lonnie Chrisman in the early 90s
then developed a more sophisticated
neural network architecture for
doing encoded decoder MT architecture.
So he was using this model called
RAAMS Recursive Auto Associative Memories
which were developed in the early 90s.
Not worth explaining the details of them.
But a RAAM is in some sense kind of
like recurrent network of the kind
that we've already started to look at.
And he was building those ones.
And so that then leads into our modern
encoder decoder architectures
that Richard already mentioned.
Where we're having, perhaps a recurrent
network that's doing the encoding and
then another recurrent network there's
then decoding out in another language.
And where in reality they're not
normally as simple as this, and
we have more layers and more stuff,
and it all gets more complicated.
I just wanted to mention
quickly a couple more
things about the space of these things.
So you can think of what
these encoder decoder
architectures are as a conditional
recurrent language model.
So if we want to generate a translation,
we're encoding the source so
we're producing a Y from the source.
And then from that Y
we're going to decode,
we're going to run a recurrent neural
network to produce the translation.
And so you can think of that decoder there
as a conditional recurrent language model.
So it's essentially being a language
model that's generating forward
as a recurrent language model.
And the only difference from
any other kind of recurrent or
neural language model is that you're
conditioning on one other thing,
that you've calculated this Y
based on the source sentence.
And that's the only
architecture difference.
So if we then look down into
the details a little bit,
there are different ways
that you can do the encoder.
The most common way to do the encoder has
been with these gated recurrent units,
whether the GRUs or the LSTMs,
which are another kind of gated recurrent
unit that Richard talked about last time.
I mean, people have tried other things.
I mean the modern resurgence
of neural machine translation,
actually the very first paper that
tried to do it was this paper by
Nal Kalchbrenner and Phil Blunsom
who now both work at DeepMind.
And they actually for
their encoder they were using a recurrent
sequence of convolutional networks.
Not the kind of gated recurrent
networks that we talked about.
And sometime later in the course
we'll talk a bit more
about convolutional networks and
how they're used in language.
They're not nearly as much used
in language, they're much,
much more used in Vision.
And so if next quarter you do CS231N and
get even more neural
networks then you'll spend way more of
the time on convolutional networks.
But the one other idea I sort of
wanted to just sort of put out
there is sort of another
concept to be aware of.
So we have this Y that we've
encoded the source with.
And then there's this
question of how you use that.
So for the models that we've shown
up until now and that Richard had,
essentially what happened was
we calculated up to here.
This was our y.
And we just used the y as the starting
point of the hidden layer,
and then we started to decode.
So this was effectively the Google
tradition of the way of doing it,
the model that Sutskever
et al proposed in 2014.
And so effectively, if you're doing
it this way, you're putting most of
the pressure on the forget gates
not doing too much forgetting.
Because you have the entire knowledge
of the source sentence here.
And you have to make sure you're carrying
enough of it along through the network.
That you'll be able to continue to access
the source sentence's semantics all
the way through your generation
of the target sentence.
So it's especially true in
that case that you will really
lose badly if you got something
like a plain recurrent neural
network which isn't very good
at having a medium term memory.
And you can do much better
with something like an LSTM.
Which is much more able to
maintain a medium term memory with
the sort of ideas that Richard
started to talk about.
But that isn't actually
the only way of doing it.
And so the other pioneering work
in neural machine translation
was work that was done at the University
of Montreal by Kyunghyun Cho and
colleagues and
that wasn't actually the way they did it.
The way they did it was once
they'd calculated the Y as
the representation of the source.
They fed that Y into every time step
during the period of generation.
So when you were generating at each state,
you were getting a hidden
representation which was kind
of just your language model.
And then you were getting two inputs.
You were getting one input which
was the previous word, the x_t.
And then you were getting a second input,
which was the y that you
were conditioning on.
So you were directly feeding that
conditioning in at every time step.
And so then you're less dependent on
having to sort of preserve it along
the whole sequence.
And in a way having the input
available at every time step,
that seems to be a useful idea.
And so that's actually the idea that will
come back when I talk about attention.
That attention is again going to give
us a different mechanism of getting at
the input when we need it.
And to being able to condition on it.
Let me just sort of give you
a couple more pictures and
a sense of how exciting neural
machine translation has been.
So for machine translation,
there are a couple of prominent
evaluations of machine
translation that are done.
But I mean I think the most prominent
one has been done by what's called
the workshop on machine translation.
Which has a yearly evaluation.
And so this is showing results from that.
And most of the results shown
are the results from Edinburgh Systems.
And the University of Edinburgh's
traditionally been one of the strongest
universities at doing machine translation.
And they have several systems.
And so what we can see from these results,
up is good of machine translation quality.
So we have the phrase-based syntactic
machine translation systems
which is the kind of thing that you saw
on Google Translate until November 2016.
That although they work reasonably,
there is sort of a feeling that
although they are a pioneering
a good use of large data
machine learning systems.
That they had kind of stalled.
So there really was very little
progress in phrase-based machine
translation systems in recent years.
Until neural machine translation came
along, the idea that people were most
actively exploring was building
syntax-based statistical machine
translation systems, which made more
use of the structure of language.
They were improving a little bit
more quickly but not very quickly.
How quickly kind of partly depends
on how you draw that line.
It sort of depends on whether
you believe 2015 was a fluke or
whether I should draw the line as I have,
in the middle between them.
But you got slightly more slope,
then not a lot.
But so compared to those two things,
I mean actually just this amazing thing
happened with neural machine translation.
So it was only in 2014,
after the WMT evaluation,
that people started playing with.
Could we build an end-to-end
neural machine translation system?
But then extremely quickly,
people were able to build these systems.
And so by 2016 they were clearly winning
in the workshop and machine translation.
In terms of how much slope you have for
improvement,
that the slope is extremely high.
And indeed the numbers are kind of
continuing to go up too in the last year.
So that's actually been super exciting.
As I say in the next slide.
That so
neural MT really went from this sort of,
fringe research activity of let's try this
and see if it could possibly work in 2014.
To two years later,
it had become this is the way that
you have to do machine translation.
Because it just works better
than everything else.
So I'll say more about
machine translation.
But I thought I'd just highlight
at the beginning, well,
why do we get these big wins
from neural machine translation?
And I think there are maybe
sort of four big wins.
At any rate,
this is my attempt at dividing it up.
So the first big win
is the fact that you're just
training these models end-to-end.
So if you can train all parameters
of the model simultaneously for
one target driven loss function.
That's just proved
a really powerful notion.
And indeed I think quite a lot of
the success of deep learning systems is
that because we have these sort
of big computational flow graphs
that we can optimize everything over
in one big back propagation process.
So it's easy to do end to end training.
But that's been a very productive
way to do end to end training.
And it's the end to end training
more than neural nets are magical.
I think sometimes they're just
given enormous amounts of power to
these systems.
But there are other factors as well.
So as we stressed a lot,
these distributed representations
are actually just worth a ton.
So that they allow you to kind of
share statistical strength between
similar words, similar phrases.
And you can exploit that to
just get better predictions,
and that's given a lot of improvement.
A third big cause of improvement
has been these neural MT
systems are just much better
at exploiting context.
So Richard briefly mentioned
traditional language models.
So those were things like four gram and
five gram models which were just
done on counts of how often
sequences of words occurred.
And those were very useful parts
of machine translation systems.
But the reality was that
the language models on
the generation side only
used a very short context.
And when you are translating words and
phrases that the standard systems
did that completely context free.
So the neural machine translation systems
are just able to use much more context and
that means that they can do a lot better.
And there's an interesting way
in which these things kind of
go together in a productive way.
So precisely the reason why your
machine translation systems can
practically use much more context.
Is because there are these distributed
representations that allow you
to share statistical strength.
Effectively you could never use more
context in traditional systems.
Because you were using these
one-hot representations of words.
And therefore you couldn't build more than
five gram models usefully because you were
just being killed by
the sparseness of the data.
And then the fourth thing that
I want to call out is sort of
really related to all of one two or three.
But I think it's just sort
of worth calling out.
Is something really powerful
that's happened in the last couple
of years with neural NLP methods.
Is that they've proven to just be
extremely good for generating fluent text.
So, I think it's fair to say
that the field of sort of
natural language generation was sort
of fairly moribund in the 2000s decade.
Because although there were sort
of simple things that you can do,
writing a printf,
that's a text generation method.
[LAUGH] But people could do a bit
better than that with grammar driven
text generation and so on.
But there really were not a lot of good
ideas as how to produce really good,
high quality natural language generation.
Whereas, it's just proven
extremely easy and productive.
To do high-quality,
natural language generation using
these neural language models.
Because it's very easy for
them to use big contexts,
condition on other goals at the same time,
and they work really well.
And so one of the big reasons why
neural machine translation has been so
successful and the results look very good.
Is that the text that they're
generating is very fluent.
In fact it's sometimes the case
that the actual quality
of the translation is worse.
That the quality of the generation
in terms of fluency is much better.
It's also worth knowing
what's not on that list.
So one thing that's not on that list,
that's a good thing,
is we don't have any separate
black box component models for
things like reordering and
transliteration and things like that.
And traditional statistical MT systems
have lots of these separate components.
You had lexicalized reordering components
and distortion models and this models and
that models.
And getting rid of all of that with
this end to end system is great.
There are some other things
that are not so great.
That our current NMT models really make
no use of any kind of explicit syntax or
semantics.
You could sort of say, well maybe some
interesting stuff is happening inside
the word vectors and maybe it is.
Sorry, there are current
hidden state vectors and
maybe it is, but it's sort of unclear.
But actually this is something
that has started to be worked on.
There have been a couple of papers
that have come out just this year.
Where people are starting to put more
syntax into neural machine translation
models, and
are getting gains from doing so.
So I think that's something
that will revive itself.
Also another huge failing of machine
translation, has been a lot of the errors.
That higher level textual
notions are really badly done
by machine translation systems.
So those are things of sort of discourse
structure, clause linking, anaphora and
things like that.
And we haven't solved those ones.
Yeah, so that's been the general picture.
Before going on, one of the things we
haven't done very much of in this class.
Is actually looking at linguistic
examples and having language on slide.
So I thought I'd do at least one
sentence of machine translation.
And I kind of guessed that
the highest density of
knowledge of another language
in my audience is Chinese.
So we're doing Chinese.
And this is my one sentence test set for
Chinese to English machine translation.
So I guess back in the mid 2000s,
we were doing Chinese to
English machine translation.
And there was this evaluation
that we did kind of badly on.
And one of the sentences that we
translated terribly was this sentence.
And ever since then, I've been using
this as my one sentence evaluation set.
So I guess this sentence, it actually
comes from Jared Diamond's book,
Guns, Germs, and Steel.
So in a sense it's sort of a funny one
since it's starting with the Chinese
translation of Jared Diamond's text.
And then we're trying to translate it
back into English, but never mind!
That's our sentence for now.
So what have we got here?
So this is the 1519 year,
there were 600 Spanish people and
their landing in Mexico.
And then we've got "to conquer".
And the first bit I want to focus
on is then this next bit here.
The several million population
of the Aztec Empire.
And so, what you get in Chinese is so
here's our "Aztec Empire".
So in general in Chinese all modifiers
of a noun are appearing before the noun.
And Chinese has this really handy little
morpheme right here, the [FOREIGN].
This is saying the thing
that comes before it,
shown in that brownish color, is
a modifier of this noun that follows it.
And this one's saying the sort
of several million population.
So it's the Aztec Empire with
the population of a few million.
And there's this very specific linguistic
marker that tells you how you're meant to
translate it.
And then after that We
then got the part here,
where then we've got so
first time confronted them,
losses, two-thirds.
And so that's just sort of tacked
on to the end of the sentence, so
they lost two-thirds of their
soldiers in the first clash.
This is just an interesting
thing in how translation works.
So you could in an English translation try
and tack that onto the end of the sentence
and sort of say "losing two thirds of
their soldiers in the first clash" or
"and they lost two thirds of their
soldiers in the first clash".
But neither of those sound
very good in English.
So, below here what we
have is the reference
translation which is where we got some
competent human to translate this.
And so, interestingly what they did and
I think correctly actually here is that
they decide it would actually be much
better to make this into two sentences.
And so, they put in a period and
then they made a second sentence.
They lost two thirds of their
soldiers in the first clash.
Okay, so
I won't tell you a bad translation, but
every year since I've been running
since this sentence through Google and
so I'll show you the Google translations.
So on 2009, this is what Google produced.
1519, 600 Spaniards landed in Mexico.
So that start's not very good.
But if we go in particular
to this focus part,
millions of people to
conquer the Aztec empire.
No, that's not correct.
And well it's getting some of the words
right but it's completely not making any
use of the structure of
the sentence in Chinese.
And it doesn't get much better.
The first two-thirds of
soldiers against their loss.
Okay, so we can go on to 2011.
I left some of them out so
he font size stayed vaguely readable.
So it changes a bit but not really.
1519, 600 Spaniards landed in Mexico.
Millions of people to
conquer the Aztec empire.
The initial loss of soldiers
two-thirds of their encounters.
So that last bit may be a fraction
better but the rest of it is no better.
In 2013, it seemed like they might
have made a bit of progress.
1519 600 Spaniards landed in Mexico
to conquer the Aztec empire,
hundreds of million of people.
It's unclear if it's made progress.
The fact that you can read the to conquer
the Aztec empire has mean the Spaniards
sort of means it might have made some
progress but then after that they
just dump the hundreds of millions
of people between two commas.
And so it's really not quite
clear what that's doing but
it sort of seemed like whatever that
change that was just kind of luck because
in 2014 it sort of switch back 1519
600 Spaniards landed in Mexico,
millions of people to
conquer the Aztec empire,
the first two-thirds of the loss
of soldiers they clash.
And not only that interestingly
when I ran it again in 2015 and
2016, the translation
didn't change at all.
So I don't know what all the people were
doing on the Google MT translation team in
2015 and 2016, but they definitely weren't
making progress in Chinese translation.
And I think this sort of
reflects as if the feeling
that the system wasn't really progressing.
That they sort of built the models and
mined all the data they could for
their Chinese English MT system
that wasn't getting any better.
So then in late 2016, Google rolled out
their neural machine translation system,
which you're gonna hear
more about in a moment.
And there's actual and
distinct signs of progress.
So in 1519 600,
Spaniards landed in Mexico.
So the beginning of it is a lot better
'cause the whole time it'd just been
plunking down 1519 and 600,
which wasn't a very promising beginning.
In the Chinese there's no word for
"in", right?
So this character here is "year", right?
So it's sort of 1519 year,
600 people, Spanish people, right?
But clearly in English you wanna be
putting in it in there and say in 1519.
But somehow Google it never manage
to get that right where you might
have thought it could.
But now it is, right?
In 1519 comma, great beginning, and
it continues much better 600
Spaniards landed in Mexico
to conquer the millions of people of the
Aztec empire, this is getting really good.
Neural machine translation is much,
much better.
But there is still some work to do.
I guess this last part is kind of
difficult in a sense the way it's so
tacked on to the end of the sentence.
But you're right,
it still isn't working very well for
that cuz they've just tacked on the first
confrontation they killed two-thirds,
which sort of it seems to be
the wrong way around because
that's suggesting they killed
two-thirds of the Aztecs.
Whereas meant to be that they
lost two thirds of the Spaniards.
So there's still work to be done from
proving neural machine translation.
But, I do actually think that that's
showing very genuine progress and that's,
in general, what's been shown.
So neural machine translation
has just given big gains.
It's been aggressively
rolled out by industry.
So actually the first
people who rolled out
neural machine translation was Microsoft.
So in February 2016, Microsoft
launched neural machine translation
on Android phones no less.
And another of the huge selling points
of neural machine translation systems,
is that they're actually
massively more compact.
So that they were able to build a neural
machine translation system that actually
ran on the cellphone.
And actually that's a very useful
use case, 'cause the commonest
time when people want machine translation
is when they're not in their home country.
And at that point it depends.
But a lot of people don't actually have
cell plans that work in foreign countries,
at decent prices.
And so it's really useful to be able to
run your MT system just on the phone.
And that was sort of essentially
never possible with the huge
kind of look up tables of phrase
based systems it is now possible.
Systran is a veteran old MT company
that also launched this system.
And then Google launched their
neural machine translation system
with massively more hype than
either of the two predecessors,
including some huge overclaims of
equaling human translation quality.
Which we've just seen still isn't true,
based on my one sentence test set,
that they still have some work to do.
But on the other hand,
they did publish a really interesting
paper on the novel research that they've
done on neural machine translation.
And so for the research highlight
today Emma is gonna talk about that.
>> Hi, today I'm gonna talk about Google's
multi lingual NMT system which
enables zero shot translation.
So as we have seen in the lecture,
this is the standard architecture for
an NMT system which you have
an encoder and a decoder.
However, this thin architecture supports
only bilingual translation, meaning
that we can have only one specific source
language and one specific target language.
So what if you want to have a system
that's able to do multilingual
translation?
Meaning that we can have multiple source
languages and multiple target languages.
So previously people have proposed
several different approaches.
The first one, they proposed to have
multiple different encoders and
multiple different decoders.
Where each pair correspond to one specific
pair of source and target languages.
And the second one that proposed to
have a shared encoder that works for
one specific source language, but
have different decoders to decode
into different target languages.
And they also have proposed
the third one is they have
multiple different encoders to work for
different source languages and
wants a single shared decoder to work for
more specific target language.
So what's so special about
Google's multilingual NMT system?
So first of all,
it's really simple because here we only
need one single model that is able to
translate from different source languages
to different target languages, and
because of the simplicity the system can
trivially scale up to more language pairs.
And second, the system improves
the translation quality for
low resource language.
So because the progress of the model
are shared implicitly and so
the model is forced to generalize
across language boundaries.
So it's observed that if we train
the language that has very little training
data with a language pair that
has a lot of training data in one
single model, the translation quality for
the low-resourced language
is significantly improved.
And also the system is able to
perform zero-shot translation.
Meaning that the model can
inclusively translate for
the language pairs it has never
seen during training time.
For example, if we train a model
on Portuguese to English and
English to Spanish data,
the model is able to
generate reasonable translation for
Portuguese to Spanish directly.
Without seeing any data for
the language pair during training time.
And this is the architecture for
the models.
As we can see, this is kind of
the standard architecture for
the state-of-the-art NMT system.
Where we have multiple stacked layers of
LSTMs for both decoders and encoders and
those applied attention mechanism which
we will talk about later in a lecture.
So what is the magic here that enables the
system to do a multilingual translation?
So it turns out instead of trying to
modify the architecture, they instead
modified the input data, by adding the
special artificial token at the beginning
of every input sentence, to indicate what
target language you want to translate to.
So for example, if you wanna
translate from English to Spanish,
we simple add this  token to indicate
that Spanish is the target language.
And after adding this artificial token,
we simply just put together
all of the multi-lingual data and
just start training.
With this simple trick,
the system is able to surpass
the state-of-the-art performance for
English to German, French to English,
and German to English translation.
And they have comparable performance for
English to French translation.
Both on the WMT benchmark.
So and here's a little more detail
about a zero-shot translation.
The setting is like this.
So during training time we train
a model on Portuguese to English and
English to Spanish data.
But during test time we ask the model
to perform Portuguese to Spanish
translation directly.
And it's shown here that the model
is able to have comparable
performance as the phrase based
machine translation system.
And also the NMT system with bridging.
And also with a little bit
of incremental training.
Meaning that we add a little bit of data
for the Portuguese to Spanish translation.
The model is able to surpass all
of the other models listed above.
And that's all, thank you.
[APPLAUSE]
>> So
I think that actually is
a really amazing result.
I mean, in some sense,
it's actually realizing a long-held
dream of machine translation.
So a traditional problem
with machine translation has
always been that if you'd like to be able
to translate between a lot of languages,
or you're then in a product space
of number of systems, right.
So if you'd like to support around
80 languages as Google does.
That if you wanna allow translation
between any pairs straightforwardly you
have to build 6,400 machine
translation systems.
And that's a lot of machine
translation systems.
And they never quite did that.
That was a reference to bridging.
So if something was being bridged,
what that effectively meant for
Google was you were translating
twice via an intermediate
language where the intermediate
language was normally English.
So the goal has for
a long time has been in MT,
is to achieve this dream
of an interlingua.
So that if you had an interlingua in the
middle you have to translate each language
to and from the interlingua, so you only
need 80 encoders and 80 decoders, so
it's then the number of languages.
And that has sort of never been very
successful, which is why effectively
people just sort of build all
of these bilingual systems but
this system is now sort of
illustrating how you can actually have
the encodings of neural MT system
be an effective interlingua.
Okay, so now on to the main technical
content to get through today,
is introducing this idea of attention.
So what's the problem
we want to deal with?
So if we're into the sort of
vanilla sequence to sequence,
encoder-decoder model.
We have this problem because
our only representation of
the input is this sort of one
fixed-dimensional representation Y
which was sort of the state
that our encoder was last in.
And so,
we need to kind of carry that through
our entire generation of
our translation sentence.
And that seems like it might be
a difficult thing to do, and
indeed, what was shown was that was
indeed a difficult thing to do and
so what people found is
that this initial neural
machines translations systems
worked well on short sentences.
But if you tried to use them to translate
very long sentences, that their
performance started to tank and I'll show
you some numbers on that later; And so
the idea that people came up with and
this idea was actually first proposed for
vision but was then moved over and
tried for neural
machine translation by Kyunghyun Cho and
colleagues at Montreal, was to say,
well instead of saying that
our Y that we generate
from is just the last hidden states,
why don't we say all of the hidden states
of the entire encoding
process are available to us.
And so
we sort of have this pool of source states
that we can draw from
to do the translation.
And so then when we're
translating any particular word,
we then want to work out which
of those ones to draw from.
So effectively,
the pool of source states becomes kind
of like a random access memory which the
neural network is then going to be able
to retrieve from as needed when
it wants to do its translation.
And it'll find some stuff from it and
use it for translating each word.
And so attention for
neural machine translation is one specific
instantiation of this, but in general this
sort of builds into a bigger concept that
has actually been a very exciting concept
in recent neural networks research and
I know at least a couple of Groups
are interested in doing for
their final projects is
this idea of can we augment
neural networks with a memory on the side.
So that we cannot only lengthen our
short term memory with an LSTM, but
we can actually have a much
longer term memory that
we can access stuff from as we need it.
And attention is a simple
form of doing that.
And then some of the more recent work like
neural turing machines is trying to do
more sophisticated forms of read-write
memories augmenting neural networks.
Okay, so if we want to retrieve as needed,
you could think of that as saying,
okay, well,
out of all of this pool of source states,
we want to be looking at where in
the input we want to retrieve stuff from.
So effectively, after we've said, Je, and
we wanting to translate the next word.
We should be working out,
well, where in here do
we want to be paying attention to
decide what to translate next?
And if it's French,
we wanna be translating the am next.
And so our attention model effectively
sort of becomes like an alignment model.
'cause it's saying, well,
which part of the source are you
next gonna be translating?
So you've got this implicit alignment
between the source and the translation.
And that just seems a good idea, 'cause
that's even what human translators do.
It's not that a human translator
reads the whole of a big,
long sentence and says, okay, got it.
And then starts furiously scribbling
down the translation, right?
They're looking back at
the source as they translate, and
are translating different phrases of it.
And so Richard mentioned last
week the idea that in training
statistical models that one of
the first steps was you worked
out these word alignments between
the source and the target.
And that was used to extract phrases
that gave you kind of phrases to use in
a statistical phrase based system.
Here, we're not doing that,
it's rather just at
translation time by process of
using this attention model.
We're implicitly making connections
between source and target,
which gives us a kind of alignment.
But nevertheless, it effectively means
that we're building this end-to-end neural
machine translation system that's doing
alignments and translation as it works.
So it achieves this NMT vision, and
you do get these good alignments.
So we're using this kind of
on the right structure where
we're sort of filling in where
the alignments have occurred.
And so you can look at where attention
was laid when you're producing
a translation,
translating here from French to English.
And you can see that this model,
which is a model from people at Montreal,
is doing a good job at deciding
where to place attention.
So it's starting off with the agreement
on the, and then the interesting part is
that sort of French typically has
adjectival modifiers after the head down.
So this is the zone, economic, European,
which you have to flip in English
to get the European economic area.
And so
it's kind of correctly modelling that flip
in deciding where to pay
attention in the source.
And then kind of goes back to
a more monotonic linear order.
Okay, so that looks good,
how do we go about doing that?
So what we're gonna be doing is
we've started to generate, and
we wanna generate the next word.
And we want to use our hidden
state to decide where to access
our random access memory,
which is all the blue stuff.
And so, well we haven't yet generated
the hidden state for the next word, so
it seems like our only good choice
is to use, I think I skipped one.
Okay, the only good choice is to use
the previous hidden state
as the basis of attention.
And that's what we do, and then what
we're gonna do is come up with some
score that combines it and
elements of the hidden state.
And commonly, people are only using
the highest level of the hidden state for
attention, and
decides where to pay attention.
And so this scoring function,
will score each position and
saying, where to pay attention.
And I'll get back to the scoring
functions in a minute.
And so the model that they proposed was,
we get a score for
each component of the memory.
And then what we're gonna do is
sort of build a representation
which combines all of the memories
weighted by the score.
So what we're gonna do is
we're going to say, okay,
we'll take those scores and
we'll do our standard trick.
We'll stick them through
a softmax function and
that will then give us a probability
distribution of how much
attention to pay to the different
places in the source.
And so then, we're going to combine,
okay, then we're going
to combine together all of
the hidden states of the encoder,
weighted by how much
attention we're paying to it.
So that we're taking, these are each
hidden state of the encoder,
the amount of attention you're
paying to that position.
And then you're just
calculating a weighted sum and
that then gives us a context vector.
So now rather than simply
using the last hidden state
as our representation of all of meaning,
we're using the entire of our
hidden states of the encoder as
our representation of meaning.
And at different points in
time we weight it differently
to pay attention in different places.
And so now what we're gonna do,
is based on what we were.
This is going automatic on me.
Now what we're gonna do is based
on what we were doing before.
And so the previous hidden state and the
next and the previous word of the decoder.
But also conditioned on
this context vector,
we're then gonna generate the next word.
Okay, so then the question is, well,
how do we actually score that?
And at this point we need
some kind of attention
function that decides how
to work out the score.
And a very simple idea you could use for
that is just to say, well,
let's take the dot product between
the decoded hidden state and
an encoded the hidden state.
And we wanna find the ones that are
similar, cuz that means we're in the right
ballpark of words that have the same
meaning, and generate from that.
And that's a possible
thing that you could do.
The one that was proposed by the people
in Montreal was this bottom one.
Where we're effectively using a single
layer of neural net, just like
the kind of functions that we've been
using everywhere else inside our LSTM.
So we're taking the concatenation
of the two hidden states.
We're multiplying the biometrics,
putting it through a tanh function.
And then multiplying that by another
vector, where both the V and
W are learned.
And using that as an attention function.
And so that's what they did in their work,
and that worked pretty well.
In the work we did at Stanford, so
principally Thang Luong's work, that we
proposed using a different attention
function, which is the one in the middle.
Which is this bilinear attention function,
which has actually been quite
successful and widely adopted.
So here, it's kind of like the top
one where you're doing a dot product.
But you're sticking in between
the dot product a mediating matrix W.
And so that matrix can effectively
then learn how much weight to
put on different parts of the dot product.
To sort of have an idea of
where to pay attention.
And that's actually turned out to
be a model that works kind of well.
And I think there's a reason
why it works kind of well.
Cuz what you would like to do
is kind of have interaction
terms that look at h_t and h_s together.
And even the dot product kind of has
this interaction between h_t and h_s.
And this is a more sophisticated way
of getting an interaction between
h_t and h_s.
Whereas if you're using this model with
only a single layer of neural network,
you don't actually get
interactions between h_t and h_s.
Because you've got the sort of
two parts of this vector and
each of them is multiplied by
a separate part of this matrix.
And then you put it through a tanh, but
that just rescales it element-wise.
And then you multiply it by a vector,
but that just rescales it element-wise.
So there's no place that h_t and
h_s actually interact with each other.
And that's essentially the same
problem of the sort of classic result
that you can't get an xor function
out of a one layer perceptron is
because you can't get the two
things to interact with each other.
So, this is a very
simple low parameter way
in which you can actually
have interaction terms.
It seems to work really well for
attention functions.
It's not the only way
that you could do it.
Another way that you could do things that
a couple of papers have used is to say,
well, gee,
a one way neural net's just not enough.
Let's make it a two layer
feedforward network.
And then we could have arbitrary
interactions again like the xor model.
And a couple of people have
also played with that.
Another thing that has been explored for
attention that I'll just mention.
So the simple model of attention,
you've got this attention function.
That spreads attention over
the entire source encoding.
And you've got a weighting on it.
That's kind of simple, it's easy to learn.
It's a continuous,
nice differentiable model.
It's potentially unpleasant
computationally if you've got very long
sequences.
Because that means if you start thinking
about your back prop algorithm that you're
back propagating into
everywhere all the time.
So people have also looked some
at having local attention models.
Where you're only paying attention to
a subset of the states at one time.
And that's more of an exact notion of
retrieving certain things from memory.
And that can be good,
especially for long sequences.
It's not necessarily compellingly better
just for the performance numbers so far.
Okay, so here's a chart that shows you
how some of the performance works out.
So what we see is that this
red model has no attention.
And so this shows the result that,
a no attention model works reasonably
well up to sentences of about length 30.
But if you try and run a no
attention machine translation system
on sentences beyond length 30.
Performance just starts
to drop off quite badly.
And so in some sense this is
the glass half full story.
The glass half full is actually LSTMs
are just miraculous at remembering things.
I mean,
I think quite to many peoples' surprise,
you can remember out to about length
30 which is actually pretty stunning.
But nevertheless,
there's magic and there's magic.
And you don't get an infinite memory.
And if you're trying translate
sentences that are 70 words long.
You start to suffer pretty badly with
the basic LSTM model, oops, okay.
So then the models that are higher up
is then showing models with attention,
and I won't go through all the details.
The interesting thing is that even for
these shorter sentences.
Actually there are a lot of gains from
putting attention into the models.
That it actually does just let you do
a much better job of working out where
to focus on at each generation step.
And you translate much better.
But the most dramatic result is
essentially these curves turn into flat
lines, there's a little
bit of a peak here, maybe.
But essentially you can be translating
out to 70 word sentences without your
performance going downhill.
And that's interesting.
The one thing that you might think freaky
about all of these charts is that they all
go downhill for very short sentences.
That's sort of weird.
But I think it's sort of just
a weirdo fact about the data.
That it turns out that
the things that are in this kind
of data which is
European Parliament data actually.
That are five word sentences.
They just aren't sentences,
like, I love my mum,
which is a four word sentence that has
a really simple grammatical structure.
That, when you're seeing five
word things that they're normally
things like titles of x.
Or that there are half sentences
that were cut off in the middle and
things like that.
So that they're sort of weirdish stuff and
that's why that tends to
prove hard to translate.
Okay, here are just a couple of examples,
of giving you again some
examples of translations.
So we've got a source,
a human reference translation.
Then down at the bottom,
we have the LSTM model.
And above it, it's putting in attention.
So for this sentence,
it does a decent job,
the base model of translating it,
except for one really funny fact.
It actually sticks in here a name that
has nothing whatsoever to do with
the source sentence.
And that's something that you
actually notice quite a bit in neural
machine translation systems.
Especially ones without attention.
That they are actually
very good language models.
So that they generate sentences that
are good sentences of the target language.
But they don't necessarily pay very much
attention to what the source sentence was.
And so they kind of go, okay,
I'm generating a sentence and
a name goes there, stick in some name.
And let's get on with generating, it's got
nothing to do with the source sentence.
That gets better in the other example,
where it actually
generates the right name.
That's an improvement.
Here's a much more complex example
where there's various stuff going on.
One thing to focus on though, is that
the source has this "not incompatible"
whereas the base model translates
that as "not compatible",
which is the opposite semantics.
Whereas our one here we're then
getting "the incompatible".
So not incompatible.
So that's definitely an improvement.
None of these translations are perfect.
I mean in particular one of the things
that they do wrong is "safety and
security".
Where in the translation,
we have exactly the same words,
so it's of the form A and A.
Now really safety and
security have a fairly similar meaning.
So it's not actually so
unreasonable to translate either
of those words with this word.
But clearly you don't want to translate
safety and security as safety and safety.
[LAUGH] That's just not
a very good translation.
So that could be better.
I'll go on.
Yeah.
So this idea of attention
has been a great idea.
Another idea that's been interesting
is the idea of coverage.
That when you're attending,
you want to make sure you've attended
to different parts of the input, and
that was actually an idea that, sort of,
again, first came up in Vision.
So, people have done Caption Generation,
where you're wanting to generate
a caption that summarizes a picture.
And so
one of the things you might wanna do
is when you're paying
attention to different places,
you wanna make sure you're paying
attention to the different main parts.
So you both wanna pay
attention to the bird.
And you wanna pay attention to
the background so you're producing
a caption that's something like "a
bird flying over a body of water".
And so you don't want to miss
important image patches.
And so that's an idea that people have
also worked on in the neural MT case.
So one idea is an idea of doing
sort of attention doubly, and
so you're sort of working out
an attention in both directions.
So there's a horizontal attention and
a vertical attention.
And you're wanting to make sure you've
covered things in both directions.
Okay, so that's one idea.
And in general, something interesting
that's been happening is in the last
roughly a year, I guess.
That essentially,
people have been taking a number of
the ideas that have been explored in other
approaches to machine translation and
building them into more
linguistic attention functions.
So one idea is this idea of coverage.
But actually if you look in the older
literature for word alignments, well there
are some other ideas in those older
machine translation word alignment models.
Some of the other ideas
were an idea of position.
So normally attention or alignment isn't
completely sort of random in the sentence.
Normally although there's some reordering,
stuff near the beginning of the source
sentence goes somewhere near the beginning
of the translation, and stuff somewhere
near the end of the source sentence
goes towards the end of the translation.
And that's an idea you can put in
to your attention model as well.
And a final idea here is fertility.
Fertility is sort of
the opposite of coverage.
It's sort of saying it's bad if you pay
attention to the same place too often.
Because sometimes one word is gonna
be translated with two words or
three words in the target
language that happens.
But if you're translating one word with
six words in your generated translation,
that probably means that you've
ended up repeating yourself and
that's another of the mistakes
of sometimes neural
machine translations systems can make,
that they can repeat themselves.
And so people have started to build
in those ideas of fertility as well.
Okay.
Any questions or
people good with the attention?
Yeah?
So the question is that when we're
doing the attention function,
we were just We were just doing
it based on the hidden state.
And another thing that we could do is
actually put in the previous word, the xt.
And also put that into
the attention function.
I mean one answer is to say yes,
of course you could.
And you could go off and try that.
And see if you could get value from it.
And it's not impossible you could.
I suspect it's less likely that
that's really going to work
because I think a lot of the time,
what you get with these LSTMs
is that the hidden state,
to a fair degree.
Is still representing the word
that you've just read in, but
it actually has the advantage that
it's kind of a context-disambiguated
representation of the words.
So one of the really useful things that
LSTMs do is that they're sort of very good
at word-sense disambiguation because
you start with a word representation.
Which is often the kind of average of
different senses and meanings of a word.
And the LSTM can use its
preceeding context to decide,
In this context, I should be
representing this word in this way.
And you kind of get this
word sense disambiguation.
So I suspect most of the time
that the hidden state
records enough about the meaning of the
word and actually improves on it by some
of this using of context that I'm a little
doubtful whether that would give gains.
On the other hand, I'm not actually
aware of someone that's tried that.
So it's totally in the space
of someone could try it and
see if you could get value from it.
Yes.
Yes, there's a very good reason to use
an LSTM as your generator even if you're
going to do attention.
Which is, the most powerful part of
these neural machine translation systems
remains the fact that you've got this
neural language model as your generator
which is extremely powerful and
good as a fluent text generator.
And that's still being powered
by the LSTM of the decoder.
And so.
[INAUDIBLE]
>> And no, I.
The power you get from the LSTM at
better remembering the sort of longer
short-term memory is really useful
as a language model for generation.
So I'm sure that that's still
giving you huge value, and
you'd be much worse off without it.
Yeah.
I mean the thing that you could wonder,
is in this picture I'm still feeding
the final state in to initialize
the LSTM for the decoder.
Do you need to do that, or
could you just cross that off and
start with a zero hidden state, and
do it all with the attention model?
That might actually work fine.
Yeah?
>> [INAUDIBLE]
>> That's a good question.
So where do I have that?
Here, okay, yeah.
So in this simple case,
if you sort of are making a hard decision
to pay attention to only
a couple of places,
that's a hard decision and so
that then kills differentiability.
And so the easiest way to sort
of keep everything nice and
simply differentiable is just to say,
use global attention.
Put some attention weight
on each position that's
differentiable the whole way through.
So if you're making a hard decision here,
traditionally,
the most correct way to
do this properly and
train the model is to say, okay, we have
to do this as reinforcement learning.
'Cause, doing a reinforcement
learning system lets you get
around the non-differentiability.
And then, you're in this space of deep
reinforcement learning which has been
very popular lately.
And, there are a couple of papers
that have used local attention,
which have done it using
reinforcement learning training.
So in the paper that Thang did,
that's not what he did.
He sort of, I think it's true to say that,
to some extent, he sort of fudged
the non-differentiability but
it seemed to work okay for him.
But, I mean,
this is actually an area in which,
there's been some recent work,
in which people have explored methods
which in some sense continuing this
tradition of fudging by putting it
on the more of a theoretical footing and
finding this works very well.
So, an idea that's been explored
quite a bit in recent work is to say,
in the forward model we're going
to be making some discreet
choices as to which positions
to pay attention to.
In the backwards model,
we’re going to be using
a soft approximation of those decisions,
and
we will then do the back
propagation using that.
So that kind of idea is, You are working
out, say, where to pay attention,
and you are choosing the states
with the sort of a high need for
attention, is a hard decision,
but in the backwards model you
are then having a sort of soft attention
still and you are training with that.
And so, that leads into ideas like
the Straight Through Estimator
which has been explored by
Yoshua Bengio's group and
other recent ideas of Gumbel-Softmaxes,
and things like that.
And that's actually, sort of been
worked out as another way to explain,
another way to train these not
really differentiable models,
which is in some ways easier than
using reinforcement learning.
I'll go on.
There was one other last thing,
I did want to sort of squeeze in for
the end of today, is I just wanted
to say a little bit about what's.
Okay, so assuming that at source time,
we've got our source sentence,
we encode it in some way that
we're gonna make use of.
And, decoders,
that really our decoders are just saying,
okay here's the meaning we want convey,
produce a sentence,
that expresses that meaning and
how can we do that decoding successfully.
And I just sort of wanted to mention for
couple minutes, what are the options and
how do they work.
So, one thing in theory
we could do is say,
okay, well, let's just explore
every possible sequence of words we
can generate up to a certain length.
Let's score every one of them with
our model and pick the best one.
So, we'd literally have an exhaustive
search of possible translations.,
Well, that's obviously
completely impossible to do.
Because, not only is that exponential
in the length of what we generate,
we have this enormous vocabulary.
It's not even like we're doing
exponential on a base of two or three.
We're doing exponential on the base
of 100,000 or something like that.
So, that can't possibly work out.
So, the obvious idea and the first
thing that people do is -- Sorry.
I'll get to the obvious one next.
The second thing,
[LAUGH] the not quite so obvious but
the probabilistically nice and good thing
to do is to do a sampling based approach.
Which is a sort of a succesive sampling.
So, it's sometimes referred
to as Ancestral Sampling.
So, what we're doing then is we've
generated up to word t-1 and
then saying okay.
Based on our model, we have a probability
distribution over the t-th word.
And so, we sample from that probability
distribution one symbol at a time.
And we keep on generating
one word at a time,
until we generate our end
of end of sentence symbol.
So, we generate a word and
then based on what we have now we do
a probabilistic sample of the next
word and we continue along.
So, if you are a theoretician that's
the right practical thing to do
because if you are doing that,
you've gotten not only an efficient model
of generating, unlike the first model, but
you've got one that's unbiased,
asymptotically exact, great model.
If you're a practical person this
is not a very great thing to do
because what comes out is
very high variants and
it's different every time you
decode the same sentence.
Okay, so the practical easy thing to do,
which is the first thing that everybody
really does, is a greedy search.
So, we've generated up
to the T minus one word.
We wanna generate the t-th word.
We use our model, we work out what's
the most likely word to generate next,
and we choose it and
then we repeat that over and
generate successive next words,
so that's then a greedy search.
We're choosing best thing given
the preceding subsequence.
But that, doesn't guarantee us
the best whole sentence because we can
go wrong in any of a number of ways
because of our greedy decisions.
So it's super-efficient.
But is heavily suboptimal.
So, if you want to do a bit better
than that, which people commonly do,
the next thing that you think about
trying is then doing a beam search.
So, for a beam search we're
up to word t-1 and we say,
gee, what are the five most
likely words to generate next.
And, we generate all of them and
we have a beam of five.
And then, when we go on to generate
word T plus one, we say for
each of those sequences up to length T,
what are the five
most likely words to generate
is the T plus first word and
we generate all of them and
well then we've got 25 hypotheses and
if we kept on doing that, we'd again
be exponential but with a smaller base.
But we don't wanna do that.
So, what we do is say, well out of
those 25, which are the five best ones?
And we keep those five best ones.
And then, we generate five
possibilities from each of those for
the T plus two time.
And so we maintain a constant
size k hypotheses and
we head along and do things.
So as K goes to infinity,
that becomes unbiased.
But in practice our K is small,
so it is biased.
It doesn't necessarily monotonically
improve as you increase K, but
in practice it usually does
up to some point, at least.
It turns out that often there's
a limit to how big you can go for
it improving,
which might even be quite small.
Because sometimes,
you actually tend to get worse
if your model is not very good and
you explore things further down.
It's not as efficient, right?
That your efficiency is
going down in K squared.
So, as soon as you're at a beam of 10
you're 2 orders of magnitude slower
than the greedy search, but
nevertheless it gives good gains.
So, here are some results.
So this is from work
again of Kyunghyun Cho's.
So, in the middle here we
have the greedy decoding.
And, we're getting these
numbers like 15.5 and 16.66,
so something I haven't actually done yet
is explain the machine translation
evaluation, and that's something I'll
actually do in the next lecture.
But big is good for these scores.
So, what you see is that if you
sort of sample 50 translations and
go with the best one,
although that gives you some
improvement over the greedy one best.
The amount of improvement it gives
you isn't actually very much because
there's such a vast space
that you're sampling from and
it's quite likely that most of your 50
examples are sampling something bad.
On the other hand, if you're using
a fairly modest beam of size five or
ten, that's actually
giving you a very good and
noticeable gain much bigger than you're
getting from the ancestral sampling.
And so,
that's basically the state of the art for
neural machine translation, is people
do beam search with a small beam.
The good news about that
actually is in statistical
phrase space machine translation,
people always used a very large beam.
So people would typically use
a beam size of size 100 or 150, and
really people would have
liked to use larger.
Apart from where it's just
computationally too difficult.
But what people found with neural
machine translation systems
is small beams like 5 or
10 actually work extremely well and
conversely bigger beams often
don't work much better.
Okay, and so that gives us sort of Beam
Search with a small beam as the de facto
standard in NMT.
Okay, that's it for today, and we'll have
more of these things on next Tuesday.
