Okay. Hi, everyone.
Let's get started again.
Okay, so first of all let me just say a bit about Assignment 5.
So Assignment 5 is coming out today.
Um, it's a brand new assignment,
so you guys are the guinea pigs for that.
Um, and so what it's going to be, it essentially builds on Assignment 4.
Um, so it's okay if you didn't do perfectly on Assignment 4,
but I think actually most people did.
Um, and what we're gonna be doing is adding, um,
convolutional neural networks and subword
modeling to the neural machine translation system,
seeking it to make it better.
Um, so this assignment is coding heavy, written questions light.
Um, so I mean the coding that you have
to do sort of isn't actually really more difficult than Assignment 4,
it's kind of like Assignment 4.
But what we're hoping is that this time you will be able to do it on your own.
Now what I mean by that,
um, is for Assignment 4.
Well, there was tons of scaffolding telling you what everything should be,
and there were all of these auto-grader checks and you could keep on working on
your code until they passed all the autograder checks, and everybody did.
Um, and so it was very kind of coddled, shall we say.
Um, but I mean,
I guess what we're really wanting to achieve is to have a more- sorry, question.
[inaudible]?
Yes. So what we're hoping is that this can be, uh, useful.
Um, it'll be short-term pain but useful as being
a more effective ramp to doing
the final project and indeed for the rest of your life, right.
And the- the reality is that in the rest of your life,
you sort of if you're going to be doing things with deep learning,
you kind of have to work out what kind of
model to build and which pieces to stitch together,
and how to write some tests to see if it's doing something sensible.
And if it's not doing something sensible, um,
to figure out how you could change things and try different things,
and get it to work sensibly.
And so that's what we're hoping,
um, that people, um,
can do in Assignment 5,
so you've got to,
um, figure things out.
Um, should write your own testing code.
Um, we don't have a public autograder,
so you should- that's part of working out your own sanity checks,
trying to do things like what I talked about last week of sort of getting
simple bits working, confirming that they work on
minute amounts of test data and so on, and doing things more sensibly.
I mean in particular,
um, the one particular part of that,
that we were planning to do, um, for,
um, this assignment, I was looking for it,
um, but it's on the next slide.
Um, so, um, for this assignment and beyond, um,
we're going to enforce rules like more like they are in CS107,
for those of you who are undergrads,
meaning that the TAs don't look at and debug your code for you.
Um, and so, you know,
of course we still want TAs to be helpful, come to them with your problems, um,
talk about how you're meant to use different things,
um, in the PyTorch library,
um, but you shouldn't be regarding it as the TA's job of, here's a big Python file.
Um, can you tell me what's wrong with it,
and fix it up for you.
Okay. Um, the precise policy for that's,
um, written up on Piazza.
Okay. So after- any questions about that or do I go straight on in?
Okay. Um, yes so today's lecture,
um, in some sense today's lecture is an easy lecture.
Um, so last time's lecture,
there was really sort of a ton of new stuff
of other stuff on neural networks that you haven't seen before,
and we did Convnets and we did pooling layers,
and we did highway and residual connections,
and batch norms, and I don't know, whatever else we did.
Um, size one convolutions I guess.
So there are tons of new stuff really
in this lecture in terms of sort of neural network machinery,
there isn't any new stuff at all.
So this is really easy.
Um, and this is also really a new lecture but it was sort of put in for a reason.
And the reason for this relates to a kinda remark I made
last time about how lots of stuff keeps changing in neural network land.
So at the time we first designed this class and
the way as- that a lot of the structure of it still is.
Um, that sort of around 2014-2015 when we designed this class,
it was basically axiomatic that
all deep learning models for natural language processing worked off words.
And therefore it completely made sense that we start with word vectors,
and then we start looking at things like recurrent models over words.
Whereas the fact of the matter is in the last approximately three years,
there's been a ton of new work including some of the most influential new work.
There's building language models that aren't- isn't -isn't aren't, um,
being built over words that they're building, built over pieces of words or characters.
And so this lecture is sort of meant to give you
some sense of these other ways of doing things and,
um, some orientation to some of the things that's going on.
But the actual kind of models that we're looking at, uh,
sort of using all of the building blocks that we've already looked at, things like,
um, RNNs and ConvNets and things like that.
So let's get into this.
Um, so I'm going to start off with a teeny bit
of linguistics of learning about the structure of language, um,
at first sort of lower level units of language and then we'll see how that pans out,
um, for things like character level models.
So in linguistics, um,
if you start at the bottom of the totem pole,
the first-level of linguistics is phonetics,
which is sort of understanding the sounds and the physiology of human speech.
So that's sort of like physics, or physiology,
or something, right, there are mouth parts that move,
there are ear parts that act as filters,
and there's, um, audio waves in between the two of them.
So that's kind of uncontroversial in some sense.
Um, but above that level,
the standard thing that people do for the analysis of human languages is to say,
well human languages may seem to make use of a relatively small set of
distinctive units which are then commonly called phonemes which are actually categorical.
And the idea here is that well,
uh, our mouths are continuous spaces, right.
That they've got these various bits of their mouths like, you know,
tongues and pharynges and so on,
but it's a continuous space.
So actually, um, we can make an infinite variety of sounds, right.
So if I open my mouth and apply voicing and just wiggle my tongue around, I can go [NOISE].
And I can make an infinite variety of different sounds.
But the reality is that human languages aren't like that,
that out of that infinite variety of sounds,
we distinguish a small space of sounds.
Um, and something that happens when languages change is, um,
that the space of sounds that are seen as
important and distinguished in a language change.
And that happens even with inside one language as- as English.
And I'm about to give an example of that.
Um, so people in cog psych talk about the phenomenon of categorical perception.
And what that means is that really there's something continuous,
but that humans perceive it as belonging to fairly sharp categories.
Um, and, you know,
you can use that for sort of, you know,
styles of clothing or whether someone counts as fat or not.
Um, but the most famous examples of categorical perception are in language,
where we can make an infinite variety of sounds but
people per- perceive them as categories.
And so effectively what that means is when you have categorical perception,
the differences within a category are sort of perceived to have shrunk.
You barely notice them at all where differences
across categories are expanded and very clear.
And so one of the cases that sort of studied
a lot is what's called- referred to as sort of,
um, voice onset time.
So lots of languages including English have pairs of sounds like p and b, uh,
pah and bah, and they differ based on when voicing starts.
So buh, it has a voice sound like a vowel with an r in it.
And well that's a continuous parameter,
you can sort of make any point along a spectrum between a p and a b but, um,
human beings, um, who speak English,
um, perceive just two points on that spectrum.
And you don't sort of really notice the fine differences between them.
Um, some languages distinguish more points on the spectrum.
So Thai distinguishes three different consonant sounds,
um, in depending on the voice onset time.
Um, something that might be, um,
more accessible to you is,
um, this is an example of language change.
So for a speaker like me, um,
there was caught and cot and those are different vowels,
and I hear them as different vowels.
But if you are someone who grew up in the southwest of the United States,
um, then these are exactly the same vowel and you don't distinguish them.
Then- you thought I said the same thing twice even though I'm saying two different vowels.
And so that's where in even at a dialectal issue- level that people develop
categorical perception as to
which- which distinctions and sounds they're sensitive to or not sensitive to.
Okay. And summing- and,
I mean why I'm mentioning this is in some senses these sound distinctions of
categorical sound distinctions that are what a lot of
our language writing systems that we'll come to in a minute record.
Okay. Um, so in traditional linguistics, um,
you have sounds, but sounds don't have any meanings in language.
So pah and bah don't have meanings,
and a and e don't have meanings.
And so people then normally distinguish as
the next level up morphology is parts of words.
And this is seen as the minimal level that has meaning.
And so the idea is lots of words are complex and can be made
u- made up of pieces but these pieces do have meanings.
So fortune has a meaning, um, fortunate,
you end in this ate ending,
um, which sort of gives the- gives fortune to somebody.
So that means you know, having fortune, um,
that has a meaning,
un has a meaning which means to reverse that.
So unfortunate means that you don't have fortune, and ly, um,
then has a meaning of turning this all into an adverb,
and you can say unfortunately not having,
um, gotten fortune, something happened.
And so these sort of pieces of, um,
words are then the minimal things that have meaning.
Um, almost no work in deep learning has tried to
make use of this sort of morpheme level of structure.
Actually me and a couple of students six years ago did actually
try and build a system where it built these tree structured neural networks,
that put together meanings of words out of their pieces.
Um, but that really isn't an idea that's taken on widely.
There's sort of a reason why it hasn't taken on widely,
which is doing this and working out the semantically meaningful pieces of words,
is kind of hard and a lot of the time in NLP what people have found
is you can just about get the same kind of
results if you just work with character n-grams.
The kind of units that you put into the convolutional neural net.
Because if you just have a model that uses, um,
character trigrams and you have sort of start of word, un, and nfo, and so on.
For going through the ly end of word,
that those different units.
There's different character trigrams,
in a distributed way will pick up
all the important meaning components of the word pretty well,
and that that's just good enough.
And that's actually a very classic idea that's sort of been revived.
Um, so back in the second coming of
neural networks in the mid-80s into the early 90s, um,
there was qu- um,
there was quite a bit of sort of controversial work on
the structure of language and in particular Dave Rumelhart and Jay McClelland.
So Jay McClelland's still in the psych department here,
if you want to look him up in your spare time.
Um, they proposed a model of how to model generating past tense forms in English.
So this was sort of a cog-psych experiment of can we build
a system that can learn past tenses of English verbs?
And the difficult part there is some, many verbs are regular,
you add the kinda -ed ending,
but some words are irregular and you had to sort of learn about the irregular patterning.
Um, but the way they did that.
I mean partly because this was sort of early days with respect to, um, sequence models,
is that they used a representation where they represented
words precisely with these sort of character trigrams.
And that was the representation of words that they used and fed forward in their model.
And that, um, idea was met with a lot of controversy by
linguists, philosophers, and other people with their ideas of
language and so there was a lot of debate in those days about that.
But from a- as a purely engineering solution that
sort of proved to be a pretty good way to do things.
And so this decade there's been other work which includes the model, um,
developed at Microsoft of a sort of a deep, um,
semantics model where what they're using as these kind of
character n-grams to put meaning over words.
Okay so, um, so now we might be interested in building models that aren't over words.
So we're going to have a word written as characters
and we're going to do something with it such as build character n-grams.
And so something that is just useful, um,
to know is there's actually
a fair amount of variation between languages when you do this.
So it's not all the same stuff, right?
So the first problem is, um,
there are some languages that don't put spaces between words.
The most famous example is Chinese.
Um, but an interesting fact for those people of European ancestry, um,
is that you know if- for- when the ancient Greeks wrote ancient Greek,
um, they didn't put spaces between words either.
It was actually a later invention of
medieval scholars who are recopying their manuscripts,
who they decided [NOISE] maybe that'll be easier to read if we
put spaces in and then they started doing it.
Um, [NOISE] most languages these days do put spaces in between words but even,
then there's sort of a lot of fine cases.
So in particular, a lot of languages have some sort of little bits
of stuff which might be pronouns, or prepositions,
or various kind of joining words like and, and so,
and which sometimes they write together and sometimes separately.
So in French, um,
you get these kind of prepositional, I'm sorry,
pronominal, um, markers for you, I, you, have brought.
Um, and you know, these kind of little words and
pronunciation just sort of run together as je vous ai,
and arguably it's almost one word,
but it's written as separate words.
Um, where there are other languages which sort of stick things together,
where arguably they're separate words.
So in Arabic, you get pronominal clitics and some of these sort of joining words like so,
and and, and they are sort of written together as one word,
where arguably that they should really be four words.
Another famous case of that is with compound nouns.
Um, so in English,
we write compound nouns with spaces between them,
so you can see each noun.
Um, even though in many respects compound nouns
something like whiteboard behaves like it's one word, or high-school.
Um, whereas other languages,
German is the most famous case,
but also other Germanic languages,
just write them all as one word and you get very long words like that.
So we can get different words if we just use spaces and don't do much else. Um, good.
Okay. Yes. So for dealing with words,
there are these practical problems.
Um, and we sort of already started to
touch on them that if you're trying to build word-based models,
there's this huge space of words,
and well, strictly there's an infinite space of words
because once you allow in things like numbers,
let alone FedEx routing numbers, or, um,
or if you allow just morphology,
when you can make those ones like unfortunately.
Yeah, sort of, you can just expand the space of words,
so you get this large open vocabulary.
Um, English, you know, a bit problematic.
It gets way more problematic than a lot of other languages.
So here's a lovely Czech word, um,
to the worst farmable one, um,
where you can make sort of much more complex words in lots of other languages.
Um, many Native American languages,
other European languages like Finnish have these sort of very complex words,
Turkish has very complex words.
Um, so that's bad news.
Um, there are other more reasons we'd like to be able to
look at words below the word level,
to know things about them.
So when you're translating,
there's a wide space of things,
especially names, where translation is essentially transliteration,
that you're going to rewrite the sound of somebody's name as just roughly, you know,
perhaps not perfectly correctly but roughly correctly
according to the sound systems of the different language.
And well, if we want to do that,
we essentially want to work- operate at the letter level,
not the word level.
But another huge modern reason why we'd like to start modeling below the word level is,
we live in this age of social media and if you're in the social media land,
there's a lot of stuff that's written not
using the canonical words that you find in the dictionary,
and somehow we'd wanna start,
um, to model that.
So in some sense this is the, um,
easy case.
Um, good vibes.
Um, but nevertheless this is spelled with one,
two, three, four, five, six,
seven O's, and one, two,
three, four, five, oh and also seven S's, they match.
I don't know if that's deliberate or not [LAUGHTER]. Um, okay.
So this sty le of writing is very common, um, and well,
you know, we kind of sunk if we're
treating things at the word level and we're trying to model this right.
That's clearly not what human beings are doing, we're sort of looking at
the characters and recognizing what goes on.
Um, in some sense that's kind of the easy case that you could imagine preprocessing out.
Um, there's a lot of harder stuff that then turns up.
Um, I guess there's sort of the abbreviation speak, like I don't care.
Um, but then you sort of get a lot of creative spellings, um,
that come off of kind of reduced pronunciations like I'mma go, sumn.
Um, and it seems like somehow we need something other than
canonical words if we're going to start to deal better with a lot of this text.
Oops. Okay. So that suggests we sort of want to start doing that with our models.
And so, that's led to a lot of interest in using character level models.
Um, and I mean there are sort of two extents to which you can do this,
and we'll look at them both a bit.
Um, one level is to say,
look we're still gonna have words in our system.
Basically, we're going to build a system that works over words,
but we want to be able to create
word representations for any character sequence and we'd like to
do it in a way that takes advantage of being able to
recognize parts of the character sequence that look familiar,
so that we can probably guess what vibes means.
Um, and so that sort of then solves the problems
with unknown words and we get similar words,
similar embeddings for words with similar terms, spellings, et cetera.
But the other alternative is to say,
oh no, just forget about these words altogether, who needs- um,
why don't we just do all of our language processing on sequence of characters,
it'll work out fine.
Um, both of these methods have been proven to work very successfully.
Um, and I just wanted to dwell on that for one moment,
and that sort of goes back to my,
um, morphology slide here.
When people first started proposing that they are going
to build deep learning models over characters.
I mean, my first feeling was oh, that is never going to
work because it sort of seemed like,
okay, words have a meaning it makes sense,
um, that you can do something like build
a word2vec model and that's going to really be able to
sort of see words and their distribution and learn
the meanings of the words because words have a meaning.
The idea that you're going to be able to say, well,
I'm going to come up with a vector representation of h,
and a different vector representation of a,
and a different vec- vector representation of t,
and somehow that'll be useful for representing what a hat
means once I put it through enough neural network layers,
um, frankly it sounded pretty unconvincing to me.
Um, but, um, I guess, you know-
But it, it, totally works so I'm convinced now, empirical proof.
And I think what we so essentially need to realize,
is that with going- that yes,
at some level we just have these characters that don't mean much.
But we then have these very powerful combinatory models with a lot of parameters in them,
things like recurrent neural networks and
convolutional neural networks and that they are respectively able to, sort of, build,
store and build representations of meaning from multi-letter groups,
in such a way that they can model the meanings of
morphemes and larger units and therefore put together word meanings.
Um, yeah. So, um,
one more detail on using characters,
um, from writing systems.
So if you're a linguist you tend to think of sounds as primary.
Those were the phonemes that we- I mentioned beforehand.
You know, um, essentially,
um, deep learning hasn't tried to use phonemes at all.
Traditional speech recognizers often did use phonemes,
but in the deep learning land,
you want to have a lot of data and the way you get a lot of data is you just use, um,
written stuff because, you know,
it's the easily found data where you can get millions and billions of words of stuff.
Um, so that sort of makes sense from a data point of view.
But the thing that ends up as a little weird about that,
is that when you're then building a character level model,
what your character level model is,
actually varies depending on the writing system of the language.
And so you, kind of,
have these quite different writing systems.
So you have some writing systems which are just completely phonemic,
that there are letters that have a particular sound and you say that sound.
Something like Spanish is pretty much phonemic.
Sometimes it's a teeny bit complicated.
So you might have a digraph.
So this digraph, ngabulu, is, kind of, like,
the N-G of English that is used for "ng" sound like at the end of seeing,
but, you know, basically this is just 'jiyawu',
each letter is a sound,
you can read it, um,
and it's just, um, phonemic.
Um, that then contrasts from something like English where
all the non-native speakers know the spelling is terrible.
It's got this, sort of, highly fossilized,
once upon a time,
phonemic system in the tenth century or something.
Um, but now we have this system that words have
fairly arbitrary spelling that doesn't actually represent the sounds, um, very clearly.
But it's sort of a phonemic system.
But then there are languages that use larger units.
Um, this is, um,
Canadian and Inuktitut which I just put in there because it's such a pretty writing system.
Um, but there are a lot of languages that represent syllables by their characters.
Um, so you'd have something like this in Korean for example,
with Korean Hangul, that each, um,
letter is then being a syllable of a sort of consonant vowel combination like bar.
Um, if you can then go up a level from that and if we get back to Chinese again,
well, um, this is sort of also a syllabic system, you could say.
But really, the Chinese characters are much more than just the sound.
They also have a meaning.
That this is really then an ideographic system
where there are characters with particular meanings attached to them.
So they're, sort of, uh,
whole morphemes in- written as one letter.
And, you know, another example of such language,
um, was Egyptian hieroglyphs, if you've seen those.
That they're, sort of, ideographic systems where you have letters with meanings.
Um, and then you have language systems that sort of mix several of those.
So Japanese is sort of a mixture of partly moraic,
partly ideographic systems mixed together.
So if you just,
sort of, start off and say,
"Okay, I'm gonna build a character-based system." That's fine.
But effectively, your character units like
letter trigrams are just very different in a language like Chinese,
where commonly a letter trigram will be, sort of,
a word and a half,
three morphemes with meaning.
Whereas if you're in something like English,
your character trigram will be something like T-H-O
which is still sort of much too small a unit to have any meaning.
So moving right ahead.
So these two kind of approaches, um,
one was just do a completely character level model and then the other one was,
sort of, you make use of characters
to build bigger things that you're then gonna put something,
like, into a more word level model.
So I'll do this one first and the other one.
So for purely character level models,
I actually showed an example of that last time. Do you remember?
So there was that very deep convolutional network from
the Conneau et-al word for text classification at the end, um,
and that just started with a big line of
characters and built these convolutional layers on top of that,
in the vision like network and classified the documents.
Um, so that was, sort of, a completely character-level model.
Um, but here's a bit more work on this.
So people for machine translation have built, um,
machine translation systems that just read characters and write characters.
And when people first tried to do that,
um, it, sort of, didn't work, right?
The people thought it might help to build
character-level models especially for languages like Chinese.
But people just weren't able to build models that worked as
well as word-based models and either the pre-neural,
the non-neural or the neural world.
But gradually that started to change.
So people started to have successful character-level decoders and then,
sort of, around 2015 and '16,
um, people started to show,
look you could- can actually do machine translation very
well at just a character level with a few asterisks.
And so, um, here's a bit of work, um, that we did.
Um, the Luong and Manning one, from, um,
2015 on the last slide.
So this is looking at English to Czech translation and Czech's
a good language to use if you want to motivate doing things at the character level,
because it had those big horrible words with lots of
morphology like the example I showed you before and I'll show you some more later.
So people had built word-level models for Czech.
Um, and, you know, they didn't work great,
partly because of some of these vocab problems.
So, um, the, sort of,
word-level state of the art was at this time was 15.7 BLEU,
which as you know is much less than we will accept for full grades in your homework.
[LAUGHTER] Um, but, you know,
what counts as a good BLEU score depends on how difficult the language pair is.
Uh, um, and so you're not doing Czech.
Um, but, um, so this was, sort of, the,
kind of, neural MT model that we've talked about.
So it was a Seq2Seq model,
with attention and then it had extra stuff for substituting UNKs with either,
uh, single word translation or by copying stuff from the source.
So it was, sort of, basically,
state of the art neural MT of 2015, got 15.7 BLEU.
And the difference isn't big,
um, but we were able to show,
look we can build this completely, um,
character-level model and then actually, it did fractionally better.
Um, so this, sort of,
showed that in terms of translation quality, um,
character, purely character-based models were completely viable at
capturing the meaning of text as well as word-based models.
Um, was this a great result?
Um, in many, in some ways,
yes, in another way, no.
I mean, this model was truly terrible to train, right?
So it took about three weeks for us to train this model and at run-time,
it also worked very slowly.
And so the problem with character-level models,
if you're putting them into something like an LSTM,
is your sequences get way longer, right.
So you've got about seven times as long sequences as you used to have.
And since there's not much information, the characters,
you have to do back propagation through time much further back.
And so we were running back propagation through time for
600 steps before we were trun- truncating it.
And so this, sort of, made,
maybe that was excessive,
but it made the models, um, very slow.
But we were able to show that it was able to get some of these good effects, right.
So here's a Czech,
um, translating to Czech,
her 11 year-old daughter,
Shani Bart, said it felt a little bit weird.
And, um, I don't know, probably.
Does anyone speak Czech, any Czech speakers?
Um, no Czech speakers?
Okay, um, I don't speak Czech either, um,
but we can see that the [LAUGHTER] we can see that this does interesting things, right.
So the second line is the human translation
into Czech which we can use for some guidance.
And so in particular, um,
in Czech there's a word for 11 years old,
um, which you can see is that blue word on the second line.
And you can see that despite 11-year-old was, um,
that for 11-year-old it's just able to perfectly, um,
produce letter by letter, um,
the Czech word for 11 years old and that works beautifully.
In contrast, for the word-level model, um,
11 year-old was an unknown word because that wasn't in the vocabulary.
And so then it had two mechanisms to try and deal with, um, unknown words.
It could either do, uh,
unigram translation of them or it could just copy them.
And for whatever reason, it decided here that the best strategy
was to copy and so that was a complete fail.
Um, and if we go along for the character-level model,
another thing that it gets right that's really cool,
um, is the name Shani Bart.
It's able to do this transliteration tasks that I mentioned just perfectly.
And it turns that to Shani Bartova
which is exactly what the human translator did as well.
And so, you know, it's actually doing some really kind of nice, um,
human translator, um, like things.
I mean, in fact,
as best I can tell from spending a bit of time on Google Translate,
it actually does a pretty good job in this sentence, period.
All right, this part here starts to be different,
um, from the human translator.
But it's not actually bad.
It's sort of a more literal translation.
So this citi um,
actually translates feel like in the English texts.
Whereas the human, sort of,
didn't actually use the word feel in the Czech version that they just went,
um, was a little bit,
um, weird or strange. So that's cool.
Okay. So here are a couple more results from this.
So here's another system that was built the next year.
By these people Jason Lee,
Kyunghyun Cho and Thomas Hoffman.
So they wanted to do something that was, I don't know,
much more complex and neural and
understanding the meaning of the text on the source side.
And so they were more using the kind of technologies we saw last time.
So on the encoder side you started off with a letter sequence of character embeddings.
And then you're sort of using convolutions of four,
three and five of characters to get representations up here.
You're then doing a max pooling with a stride of five.
So you're getting a max pooled representation of pieces of the text for each of the three,
four and five convolutions.
You're then feeding that through multiple layers of
highway network and feeding that through
a bidirectional gated recurrent unit and that's giving you your source representation.
On the decoder side,
it was sort of the same as our decoder,
it was just running a character level sequence model.
So overall, so they were doing the opposite task.
This is Czech to English.
But, so they are starting to get better scores.
But I mean actually if you're sort of looking at these different numbers,
where I'll explain this system more in a minute,
I mean it sort of seems like the place where they get a lot of value is that using
the character level decoder gives them a lot of value by
this very complex model on the source side is giving them almost no value at all.
One even more recent paper,
so this is Colin Cherry and fellow researchers at Google.
So they last year did one more exploration of
doing LSTM sequence to sequence
style models of comparing word and character-based models.
And this is English to French and this
is um, Czech to English which is just what we were doing.
And so in both cases when you have a big model,
the character model wins for them.
The blue model comes out on top but the sort of interesting thing as you
sort of see these different effects depending on
the morphological complexity of the language.
So for a language like Czech,
it's a really good idea,
if you want to build a good model,
to use character level that they're getting about a BLEU point of difference there,
whereas for a model without putting French or English there's
actually a tiny but very little gain from using a character level model.
Okay so let me just explain these models,
so these models are models of different sizes.
So these models are using bidirectional LSTM encoders and one-directional LSTM decoders.
So the simplest model just has
a shallow bidirectional LSTM encoder and a two layer LSTM decoder.
The middle model has a three deep stack of
bidirectional LSTM encoders and a four deep stack of LSTM decoders.
And the most complex model has a six deep stack of
bidirectional LSTM encoders and an eight deep stack of LSTM decoders.
This is where it helps to work at Google.
Probably for your projects,
you don't want to go beyond three or four. Stay over here.
Okay yeah so, so these are the results.
So basically what you're finding is if you're making
sort of smaller models you're better off with words,
but as you go to big models especially if you're in a morphologically rich language,
you clearly start to win from the characters.
But there is still a loss which is essentially
exactly the same loss that we were suffering from in 2015, right?
This is the time graph and so these are the same three models as over here,
it's just the axis is changed to sort of sum the total number of LSTM layers.
And so essentially, if you're at the word level,
you can run any of these three models and they are fast that you can be translating in
sort of not much time but for the character level models your slope is much higher.
So it starts to get quite expensive to run the deep character level models.
Okay, so that's that section.
So then chugging along.
I then wanted to look at other ways of doing things.
And so these are models that in some sense still do have words but where
we're going to want to sort of build word representations out of pieces.
And there are essentially two families of ways that people have explored doing this.
One way of doing it is to say look we just want to use
exactly the same architecture as we use for a word model
except our words aren't really going to
be words at least sometimes they're going to be pieces of words.
And so those are often called word piece models.
And in particular, there's one commonest way of doing it.
It's called BPE, which I'll go through in some detail.
The other alternative is to say,
well, we're gonna kind of make a mixture or a hybrid.
So our main model is going to work in terms of words but we're
going to have some kind of facility where we can construct a representation,
for otherwise unknown words,
by doing things that at a character or a lower level.
And I'll show you a bit of that as well.
Okay, so this is BPE.
BPE is actually a pretty simple idea which has nothing to
do with deep learning but the use of BPE has sort of become
pretty standard and successful for representing pieces of words to allow you to
have an infinite vocabulary
while an infinite effective vocabulary while actually working with a finite vocabulary.
So the origins of Byte Pair Encoding and
the name byte pair has nothing to do with natural language processing or neural nets,
we're just writing a compression algorithm.
So this is something like compressing your documents with gzip.
So what basic Byte Pair Encoding is,
that you've got a collection of stuff with bytes and you are
looking for the most frequent sequence of two bytes and you say,
okay, I'm going to add that sequence of two bytes as
a new element to my dictionary of possible values.
And that means I can have 257 different values for bytes so to
speak that I can shrink the length of
my sequence and I can repeat over and do that again.
And so essentially, this work suggested,
well we can apply this kind of compression algorithm and use it as
a way of coming up with pieces of words that were
useful, doing it not strictly with bytes despite
the name but instead with characters and character n-grams.
And so the most common way to do this with
characters and character n-grams and if you're up with modern times,
you know that means there's unicode and you can represent
all of these lovely letters like Canadian Inuktitut's
syllabics and stuff like that.
But there's actually a problem with Unicode,
which is there actually a lot of Unicode characters.
I forget the number theoretically.
I think there's about 200,000 possible Unicode characters.
But at any rate, if you want to handle a bunch of languages which include East Asian languages,
maybe you need something like 20,000 characters and that's sort of a lot.
So there are actually some people who've literally gone back to bytes and said,
"You know 200,000, that's a really big vocabulary.
I don't want to deal with anything."
Sorry, 200,000 is a really big vocabulary.
I don't even want to deal with anything that large.
So why don't I actually just do these kind of algorithms over bytes?
And so that means that in UTF-8 encoding,
Chinese characters take three bytes each.
And so you're actually have to- you only get whole characters if you
actually merge together several bytes that are common sequencers.
Okay. So more concretely,
um, how does this work?
So we're sort of doing this bottom-up clustering of short sequences.
So we start with a unigram vocabulary,
which is all of the Unicode characters and some data.
We then sort of ask, what's the most frequent ngram here?
Um, initially it will be a bigram pair and we add that to our vocabulary.
So if we start off, you know,
we can take our text that's- um,
I'll come back to this in a minute.
Let's assume we have a text that has been divided into words so we do have word tokens.
And so we can represent as a dictionary and say here are some words with their frequency.
Um, and so now we look for a common letter sequence and we say, "Oh, es."
That occurs nine times, um,
in this data because we have the counts for the words on the left side.
So, um, we start with our vocabulary being all the individual letters.
We find a commonest letter sequence like es,
and so we say, "Let's clump that together and make that a new thing in our vocabulary."
So now we've got an extra thing in our vocabulary.
And now what's the commonest ngram sequence that clumped something?
Well, actually all of these es's are followed by t,
so we also have es,
t with frequency nine,
and so we can add that to our vocabulary.
And then we ask again, well,
what's another common letter sequence?
Let's see, there are seven cases of o double- well,
I guess there are seven cases of either l o or o w,
so we can lump those and then we can lump
again and make an lo w. So if we sort of run this,
we start to build these clumps of common letter sequences,
and so common bits like est,
but also just common words,
something like that in English will very quickly be
clumped together and be a unit of our vocabulary.
Um, and so we do that for a while.
So normally what we do is we decide
a vocabulary size that we want to work with. We say, "Okay.
I want to work with a vocabulary size of 8,000 words."
That'll mean my model will be fast,
and we just sort of keep doing this until we have 8,000 things in our vocabulary.
And that means that our vocabulary will have in it
all single letters because we started with them and it'll
have common subsequences of words like the es and the est that are now in our vocabulary,
but also have whole words whenever there're common words, like, you know,
the, and too, and with,
and so on, will become parts of our vocabulary.
Um, and so then when we have a piece of text we can do
a deterministic longest piece segmentation of words,
and we will say that is our eeset of word pieces.
And so for an input piece of text,
we turn into word pieces,
and then we just run it through our MT system as if we were using words,
but really it's pieces of words,
and then on the output side,
we just concatenate them back together as needed.
Okay. So we get this sort of automatic word-based system.
And that's proved to be a very successful system.
So this idea of using byte pair encoding sort of really
emerged in 2015 and then in 2016, uh,
workshop on machine translation which has been the main sort of annual competition for
MT systems that the several top systems were built using byte pair encoding.
If you look at last year's competition,
there's a bit more variety,
but really a number of the top systems are still using byte pair encoding,
that's just been a good way to do things.
So for Google's Neural Machine Translation,
they effectively use of- a variant of byte pair encoding.
So they don't use exactly the same algorithm, um,
they use a slightly different algorithm where they're
using a language model and they're saying,
what- what- rather than just using pure counts,
they're saying, "What clumping together would maximally reduce
the perplexity of my language model and clump those things and repeat over?"
And so they did- they've done two versions of this model.
So the first version, the wordpiece model kind of like, um,
byte pair encoding assumed that you have an initial tokenization
to words and then you're just sort of having pieces of words,
um, using this algorithm.
And then they did a second version, um,
the sentencepiece model which you can find at this GitHub site which said, "Well,
it's problematic if we need to tokenize into words first
because then we need to have a tokenizer for
every language and that's a lot of work."
Um, so maybe instead of that,
we could just sort of treat,
go from a character sequence,
retain whitespaces and regard that as something that's part of the clumping process,
and so that, um,
you just build your word pieces which
commonly will have spaces on one side or the other of them, um,
because often things inside a word are the
commoner- more common clumps and you build those up,
and that's proven to be quite successful.
Um, in particular, one place where some of you might see this,
um, is, um, we've yet to get to describing it in the class really,
but there's been this recent work which we actually talk about next week in class, are
building these transformer models, in particular,
Google has released this BERT model which gives you
very good, um, word representations.
And if you download BERT and try and use it,
what you will find out is it doesn't operate over words,
it operates over word pieces.
Um, and so it has a large vocabulary.
It's not a vocabulary of like 8,000 words.
I forget the number,
but the models have a large vocabulary,
but they're still not a huge vocabulary and it's using word pieces.
So lots of words are in the vocabulary.
So if you look at the English model,
it not only has words like f in it,
but it even has words like Fairfax and 1910s,
which aren't that common.
Um, but it's nevertheless to cover all words,
it's again using this wordpiece idea.
So if I want a representation for the word hypatia, um,
that's not in the vocabulary,
and so I'm making it up of pieces.
There's an h representation,
and then in the BERT version,
which is different to the Google NMT version,
the non- the non-initial word pieces are represented with two hashes at the start,
so I can put that together with h##yp etc.,
and this would be my representation of hypatia.
So effectively, I have word vectors, um,
for four word pieces,
and then I have to work out what to do with them.
The simplest and quite common way is I just average the four of them.
And there are obviously other things you could do.
You could ConvNet and maxpool or you could run
a little LSTM or something to put together a representation.
Okay. Yeah. So- so those were the models that, um,
sort of, worked with pieces of words to give you
infinite vocabulary and ran them through a normal system.
The other possibility is to say, "Well,
we wanna work with characters so we can deal with an infinite vocabulary,
but we're gonna sort of incorporate those into a bigger system."
And a whole bunch of work has done this
and in some sense it's a fairly obvious thing to do.
Um, so this work in 2014 was one of the early ones.
So they said, "Well,
we could start with characters.
We can do a convolution over the characters to generate word embeddings,
and then we can use those word embeddings for something in a higher level model."
Um, this was actually sort of a fixed window model for doing part of speech tagging.
Um, that makes sense.
Instead of a convolution,
you could use LSTM.
So this was work from a year later,
and they said, "Well,
we're also gonna build up, um,
word representations from characters.
And the way we're gonna do it is we're gonna run character level Bi-LSTMs,
concatenate the two final states,
and we're gonna call that outward representation,
and then we're gonna put that word representation into a
language model which is then a higher level LSTM that works along a sequence of words."
And I thought I'd just- Oh, yeah.
Words, are they training uh, like character-
Yeah. Oh yeah, that's very important to realize.
Yes so yeah so if you're learning- you'll learn- I mean this is the hidden layer.
I guess I'm not actually showing the input layer but for the input layer
you're learning a vector for each character.
So effectively you're doing the same kind of thing we saw
before that you're starting with random representations for each character.
You've got this embedded inside a word sequence LSTM,
your goal is to minimize the perplexity of the higher level LSTM as,
um, as a language model and so it filters back its gradients.
So it's wanting to come up with character vectors such that if it
produces good word vectors which produces low, um, perplexities.
Good question. Um, so here's, um,
a slightly more complex version of trying to
do this that's a bit more recent where again the idea is can we build
a good language model by starting out from characters
and wanting to exploit sort of related sub words and rare words.
And so they built sort of this kind of
this more stacked complex model that we'll go through the stages of wherefore
we start with a word represented as characters.
We have character embeddings which we build
into a convolutional network and then we head upwards.
So if we take that one piece at a time,
um, so you have a character embedding for each character.
Um, you'll then have a convolutional layer which
then sort of rep, has various filters that work over those,
um, character sequence of two, three and four grams of characters.
So you're getting representations of parts of words.
Um, then from those convolutional networks you're then doing max pooling over time which
is effectively sort of like choosing which of
these n-grams best represents the meaning of a word.
Um, then what they do after that is so at that point
they've got an output representation for character n-grams,
and so then they feed that into a highway network like we talked about a bit last time.
And then the output of that then at the word level,
um, goes into an LSTM network,
and this LSTM network is now word-level LSTM network,
and you're trying to sort of maxim- minimize
perplexity like for the neural language models we saw earlier.
Um, so what could they show with this?
Well, the first thing they could show with it is that it
actually again just works well as a language model despite that skepticism
that I hadn't told you of about the fact of
the matter is you could build these kind of character level models and
train them and they work to a first approximation as well as word-level language models.
But one of the observations that they make is that you can be
getting as good results but with much smaller models.
So up at the top here are
the character level LSTM models and word ones that the models they built.
And here are a whole bunch of models over this data-set.
Um, and so as time went by perplexities have been going down,
gone to 78.4 and their point was well we can
build pretty much as good a character model with 78.9 perplexity but
our model is actually much smaller, this model here has
52 million parameters whereas our model that
works on a character level has only 19 million parameters.
So it's about 40% of the size.
And that seems,um, kind of interesting.
But perhaps what's more interesting is to sort of peek inside it and see what
happens with the representation of words when
built out of characters and this part is sort of actually a bit cool.
Um, so what this is showing is for words that are up to the top while,
his, you, Richard, trading.
It's asking what other words are most
similar to it according to the word representations that's computed.
And the top part is the output of a word level LSTM model and that's sort of okay.
Richard comes out as similar to Jonathan,
Robert, Neil and Nancy et cetera.
While although letting though minute mainly okay.
But it's sort of interesting what happens with their character level models,um,
and so in particular, um,
what's kind of interesting is like first of all you remember they sort of had
the character embeddings that went through the convolutional layer and the max pooling.
And if at that point you ask what things are most
similar that basically it's still remembering things about characters.
So the most similar words to while,
chile, whole, meanwhile and white.
So at least for the sort of first ones they all end in LE.
And you see that pattern elsewhere right close to Richard,
hard, rich, richer, richter that hard ends in ARD, rich.
So you're just sort of getting this character sequence similarity,
it's not really doing meaning at all.
But interestingly when they're then putting it through the highway layers,
that the highway layers are suc- successfully learning how to
transform those character sequence representations
into something that does capture meaning.
So if you then say at the output of
the highway layers what words are most similar then it seems to be working pretty well,
While was similar to meanwhile, Richard is similar to Edward, Gerard, Edward with Carl.
They're sort of now working much more like
a word level model in capturing semantic similarity.
So that seems kind of cool.
Um, so then they say well what about if we ask about
words that aren't in the vocabulary of the model.
Well, if they're not in the vocabulary of the model,
the word level model can't do anything and so that's why you get those dashes there.
And what they're wanting to show is that
the character level model still works pretty well.
So if you give it look with seven O's in the middle of
it that it's correctly deciding that look,
look, look, looking are actually
the most similar words to that which is actually working very nicely.
And some of the other examples are similar, computer-aided,
is seen as most similar to computer-guided,
computer-driven, computerized, computer, you're getting pretty similar sensible results.
And then the little picture on the,
um, right is sort of, um,
showing, um, one of these 2D visualizations of the units that have been learned.
And so the red,
the red things are word character prefixes,
the blue things are character suffixes,
the orange things are hyphenated things
like in the middle of computer-guided and gray is everything else.
And so there's some sort of sense,
with which it's picking out different important parts of words.
Okay. Um, and that's why also I guess just another good example of how you can sort of
compose together different kinds of building blocks to
make more powerful models that you might
also want to think about for your final projects.
Okay.
Um, so here's
back to one other example from a neural machine translation system
of doing this hybrid architecture that has word-level and character level.
I showed you earlier a purely character level model.
I mean we built that out of interest to see
how well it did but we were sort of really wanting to build
a hybrid model because that seemed like it would be much more practical
to build something that translated relatively quickly and well.
Um, so the idea was we'd mainly build
a word-level neural machine translation system but we'd
be able to work with character level stuff when we had rare or unseen words.
Um, and that turned out to work pretty,
um, successfully at improving performance.
So the idea of that model is this.
Um, that we're going to run a pretty standard, um,
sequence to sequence with attention LSTM neural machine translation system.
In my pic- I mean, it's actually a four-level deep system but in my picture I
showed less than four levels stacked to make it easier to see things.
And we're going to run this with a reasonable vocabulary of 16,000 words.
So for common words we just have word representations that we're feeding into
our neural machine translation model but for words that aren't in the vocabulary we're
going to work out a word representation for them by using a character level LSTM,
and conversely, when we start to generate words on
the other side we have a soft max with a vocabulary of 16,000.
It could just generate words like [NOISE] but one of those words is the UNK symbol.
And if it generates the UNK symbol we then run a- we take
this hidden representation and feed it in as
the initial input into a character level LSTM and then we have the character level
LSTM generate a character sequence until it generates
a stop symbol and we use that to generate words. Um-
Okay. So we end up sort of with this sort of
hybrid composed stack of eight LSTM layers. Uh, yeah.
[inaudible] and you always get some probability for the UNK symbol.
So if you wanted to get the- the proper gradient,
you- you'll always have to run it for every word but what- what do you-?
I would often say, you only run during training,
you only run the character level LSTM when the UNK symbol receives the highest likelihood.
So we-
What is that?
So at training, at training time,
there's a determinant piece of tech, right?
You know the source and you know the target,
and so we're, and at training time,
we've already decided our vocabulary, right?
That we've just decided what are the 15,999 most common words,
those and UNK are our vocabulary.
So for both the input and the output side,
we know which words aren't in our vocabulary.
And so if it's not in our vocabulary,
we're running this one.
If if what was the output is not in our vocabulary,
we're running that one, and otherwise we're just not running it at all, yeah.
So and and the bit that I didn't explain but is actually important perhaps
related like when we're calculating a loss that we can back
propagate, that sort of up here,
there are sort of two losses.
There's a loss at the word level that you know you'd like to in this position,
give probability 1 to generating UNK but really,
this model we'll softmax, we'll say UNK is you know probability 0.2 or whatever.
So there's a loss there and then secondarily,
there's a particular sequence of characters you wanna generate and you've also got
a loss because you've met the probabilities you put over the characters.
Um, So then, um,
I think we- I think Abby sort of briefly mentioned this.
Commonly, the decoders do some kind of
beam search to consider different possibilities before deciding,
um, the highest probability one over a sequence of words.
And so this was doing a slightly more complex version of that.
So there's a word-level beam search when running it and then
also doing a character level beam search to consider different possibilities.
And so if you wanna integrate the the two of those together.
Um, but essentially, um,
this worked pretty well.
Um, so, um, this was the winning system at WMT 2015 which
used 30 times as much data and ensembled together
three other systems compared to the data that was provided for the task.
This was the system I showed before, they got 18.3.
Um, and if you remember our character,
purely character level system got 18.5.
Um, then by building this hybrid system,
that we were able to build a much better system that was about 2.5 BLEU points better,
um, than after- than either this word level or the character level system.
So that was kind of nice, um,
and in particular that was the state of the art at the time.
Now of course, if you were paying very close attention,
that's now nowhere near the state of the art.
Because when I showed you that slide way earlier of the Google system,
you will have noticed that they have
much higher numbers in the 20s, but that's what happens as the years go by.
Um, okay.
But here's an example that shows
these different systems working and some of the mistakes they make.
Um, here's a cherry picked example, um,
where our system, the hybrid system,
works perfectly because what- that's what you expect to see.
Um, and so, you know,
you can see some of the defects of things that can go wrong.
Um, so in this case,
you know the character level system didn't work here because it just sort
of starting with the Steph, it sort of seemed to free associate,
um, a completely made up name that doesn't really have anything to do with the source.
So that one isn't very good.
Um, the word level system went bang here,
so you remember when it generates an UNK,
the word level system would have when it generates, it's using attention.
So when it wants to generate, um,
it has attention back to words and the source.
And when it generates UNK has two strategies.
It can either do unigram translation of the word that it's maximally
putting attention on or it could copy the word that it's maximally putting attention on.
Um, so in this case,
it chose to translate the word that it was maximally putting attention on but the word it
was maximally putting attention on was after rather than diagnosis.
And so you just get this po po coming out of after,
after and we've completely lost the word.
Um, and in this example, in this example,
how a hybrid system, um,
just ends up working beautifully and gives you exactly the right translation.
Yeah. Um, of course,
it's not always that good in the real world.
Um, so here's a different example.
So this is the example I showed before with the 11-year-old daughter.
Um, and in this example,
the hybrid model has the same strength of the character model.
It correctly generates 11 years old at a character level in its translation,
but you know this time, for whatever reason,
it's the hybrid model that goes bang in
generating the names and it translates Shani Bart as Graham Bart.
Um, whereas the character level model gets it right.
Um, actually, I think this is one of the weaknesses of
this hybrid model compared to the character level model.
That because of the character level generator is
kind of this sort of second level.
For the purely character level model,
it's able to use the character sequence as conditioning context very effectively.
Whereas our hybrid model,
although we feed the hidden representation of
the word level model in as
the starting hidden representation of the character level model,
it doesn't have any representation further
back than that of what's in the word level model.
And so it tends to not always do as good a job at representing,
of capturing the context that allows it to do translation of things like names.
Okay. Um, very- almost finished but there's
just sort of one thing I wanted to mention before
the end which is almost a practical thing.
Um, So we started off with word embeddings,
but now we've been talking a lot of character level models.
So surely, just for word embedding, you should be able to do useful things with them,
with characters or pieces of words.
And that's something that people start to play with.
So in this Cao and Rei paper they said well
let's train a Word2vec model using exactly the same, um,
loss as Word2vec uses but let's,
um, rather than having word representations,
let's start with character sequences and run
a bidirectional LSTM to work out word representations,
and we'll then sort of be effectively
training this more complex model where we're learning
character embeddings and LSTM parameters and that will give us our word representations.
And that's an idea that people have continued to play with,
and so in particular I just wanted to mention these FastText embeddings.
Um, so a couple of years ago,
um, people now at Facebook,
the same Tomas Mikolov who did the original Word2vec,
brought out a new set of embeddings,
the FastText embeddings and their goal was to sort of
have a next-generation Word2vec, um,
which is sort of an efficient fast, um,
word vector learning library, um,
but it was better for rare words and languages with lots of morphology.
And the way they did it was that they sort of essentially took the Word2vec skip
gram model but they augmented it to put in character n-grams.
So more precisely, this is what they did.
So, um, when you had a word,
my example word is where,
for some n-gram size you represent it as a set of n-gram.
So this is kind of just about like those, we called phonemes I mentioned
right at the beginning where you have a kind of a boundary symbol,
so you know the beginning of the word.
So if the length is three you have beginning of word WH, WHE, HER,
ERE, RE end of word,
as pieces of representation.
And then you have an additional one for just the whole word.
So you do still have whole word representations in this model.
So where is represented by six things and so
then you're going to use all six of those things in your computation.
Um, so if you sort of remember the guts of
Word2vec that what you were doing was you were doing
these vector dot products between your context representation
and your center word representation.
So they're going to do exactly the same thing but for
the center word they're gonna use all six of these vectors.
All the vectors corresponding to all six of
these representations and they're going to sum them.
And so you're just doing a simple summing operation,
and that's sort of then giving you your representation of similarity.
Um, very precisely, they don't quite do that
because there's a hashing trick but I'll leave that out.
But what they're able to show is that that model actually works pretty successfully.
So these are words similarity scores,
skip gram, they're all CBOW,
and then this is the sort of new model,
um, that, um, uses these kind of n-grams.
And in this, um,
you know at least for one of the English data sets,
it doesn't get any better.
Um, but what they especially notice this is for languages that have more,
morp- more morphology that you're sort of getting some fairly clear gains.
70, 69 onto 75,
59, 60 on to 66 in the right column,
so then these wordpiece models do give them a better model of
words and just practically FastText, um,
library now has sort of word embeddings for about 60 or 70 different languages,
so it's sort of a good source of word embeddings for multilingual applications.
Okay, I think I am done.
So thanks a lot and see you again next week.
