Let's get started. So welcome to the very final lecture of the class.
I hope you're all surviving the last week and,
uh, wrapping up your projects.
So today we're going to be hearing about the future of NLP and deep learning.
Uh, so Chris is still traveling and today we're going to be having Kevin Clark,
who's one of the PhD students in the lab, uh,
in the NLP lab,
and he was also one of the head TAs for the class last year.
So he's very familiar with the class as a whole.
Um, so, take it away Kevin.
Okay. Thanks, Abby. Um, yeah,
it's great to be back after being a TA last year.
Um, I'm really excited today to be talking about the future of deep learning and NLP.
Um, obviously, trying to forecast the future, um,
for deep learning or anything in that space is really
difficult because the field is changing super quickly.
Um, so as one reference point, um,
let's look at what did deep learning for NLP,
um, look like about five years ago.
And really, a lot of ideas that are now considered to be pretty core techniques,
um, when we think of deep learning and NLP,
um, didn't even exist back then.
Um, so things you learned in this class like Seq2Seq,
attention mechanism, um, large-scale,
reading comprehension, uh, even frameworks
such as TensorFlow or Pytorch, um, didn't exist.
And, uh, the point I want to make with this is that, um,
because of this it's really difficult to, to look into the future and say,
okay, what are things going to be like?
Um, what I think we can do though is look at, um,
areas that right now are really sort of taking off, um,
so areas in which, um,
there's a lot, been a lot of recent success and kind of, uh,
project from that, that,
those same areas will likely be important in the future.
Um, and in this talk I'm going to be mostly focusing on one key idea of
wh- key idea which is the idea of leveraging
unlabeled examples when training our NLP systems.
So I'll be talking a bit about doing that for machine translation, um,
both in improving the quality of translation and even
in doing a translation in an unsupervised way.
So that means you don't have, um,
paired sentences, uh, with, with their translations.
Um, you try to learn a translation model only from a monolingual corpus.
Um, the second thing I'll be talking a little bit about is, uh,
OpenAI's GPT-2, um,
and in general this phenomenon of really scaling up,
um, deep learning models.
Um, I know you saw a little bit of this in the lecture on contextual representations,
but this, but this will be a little bit more in depth.
Um, and I think, um,
these new developments in NLP have had some,
um, pretty big, uh,
impacts in terms of,
uh, more broadly kind of beyond even the technology we're using,
and in particular, I mean,
starting to raise more and more concerns about the social impact of NLP, um,
both, um, in what our models can do and also in kind
of plans of what, where people are looking to apply these models, um,
and I think that really has some risks associated with it, um,
in terms of security also in terms of areas like bias.
Um, I'm also gonna talk a bit about future areas of research,
um, these are mostly research areas now that are, um,
over the past year have really kind of developed into
promising areas and I expect they will continue to be important in the future.
Okay, um, to start with,
I wanna ask this question, why has deep learning been so successful recently?
Um, I like this comic, um,
here there's a statistical learning person,
um, and they've got some really complicated,
um, well-motivated, uh, method for doing, um,
the task they care about,
and then the neural net person just says,
er, stack more layers.
Um, so, so the point I want to make here is that, um,
deep learning has not been successful recently because it's more
theoretically motivated or it's more sophisticated than previous techniques, um.
In fact I would say that actually a lot of, um,
older statistical methods have more of
a theoretical underpinning than some of the tricks we do in deep learning.
Um, really the thing that makes deep learning so
successful in recent years has been its ability to scale, right.
So neural nets, as we increase the size of the data,
as we increase the size of the models, um,
they get a really big boost in accuracy,
in ways other approaches do not.
And, um, if you look to the '80s and '90s, um,
there was actually plenty of research in neural nets going on, um.
But it hadn't, doesn't have a hype around it that it does
now and that seems likely to be because,
um, in the past there wasn't, um,
the same resources in terms of computers,
in terms of data and, um,
only now after we've reached sort of an inflection point where we can
really take advantage of scale in
our deep learning models and we started to see it become,
um, a really successful paradigm for machine learning.
Um, if we look at big, uh,
deep learning success stories, um,
I think, uh, you can see kind of this idea play out, right?
So here are three of what are arguably the most famous successes of deep learning, right.
So there's image recognition, where before,
people used very highly engineered, um,
features to classify images and now neural nets are much superior, um, to those methods.
Um, machine translation has really closed the gap between, um,
phrase-based systems and human quality translation,
so this is widely used in things like Google Translate
and the quality has actually gotten a lot better over the past five years.
Um, another example that had a lot of hype around it is game-playing, so, um,
there's been work on Atari games, there's been AlphaGo,
uh, more recently there's been AlphaStar and OpenAI Five.
Um, if you look at all three of these cases underlying
these successes is really large amounts of data, right.
So for ImageNet, um,
for image recognition, um,
there is the ImageNet dataset which has 14 million images,
uh, machine translation datasets often have millions of examples.
Um, for game playing you can actually
generate as much training data as you want essentially,
um, just by running your agent,
um, within the game,
um, over and over again.
Um, so if we,
if we look to NLP, um,
the story is quite a bit different for a lot of tasks, um, right.
So if you look at even pretty core kind of popular tasks,
to say, reading comprehension in English, um,
datasets like SQuAD are in the order of like 100,000 examples
which is considerably less than the millions or tens of millions of examples,
um, that these previous,
um, successes have, have benefited from.
Um, and that's of course only for English, right.
Um, there are, um,
thousands of other languages and this is I think
a problem with NLP data as it exists today.
Um, the vast majority of data is in English, um,
when in reality fewer than 10% of the world's population,
um, speak English as their first language.
Um, so these problems with small datasets are only compounded if you look at,
um, the full spectrum of languages, um, that exist.
Um, so, as what do we do,
uh, when we're limited by this data,
but we want to take advantage of deep learning scale and train the biggest models we can.
Um, the popular solution, um,
that's especially had recent success is using unlabeled data, um,
because unlike labeled data,
unlabeled data is very easy to acquire for language.
Um, you can just go to the Internet,
you can go to books, you can get lots of text, um,
whereas labeled data usually requires at the least crowdsourcing examples.
Um, in some cases you even require someone who's an expert in something like linguistics,
um, to, to annotate that data.
Okay, so, um, this first part of the talk is going to be applying
this idea of leveraging unlabeled data to improve our NLP models,
um, to the task of machine translation.
Um, so let's talk about machine translation data.
Um, it is true that there do exist quite large datasets for machine translation.
Um, those datasets don't exist because
NLP researchers have annotated texts for the purpose of training their models, right.
They exist because, er, in various settings,
translation is done just because it's useful, so for example,
proceedings of the European Parliament,
um, proceedings of the United Nations,
um, some, uh, news sites, they translate their articles into many languages.
Um, so really, the machine translation data we use to train our models are often
more of byproducts of existing cases where translation is wanted rather than,
um, kind of a full sampling of the sort of text we see in the world.
Um, so that means number one,
it's quite limited in domain, right.
So it's not easy to find translated tweets,
um, unless you happen to work for Twitter.
Um, in addition to that, um,
there's limitations in terms of the languages that are covered, right.
So some languages, say European languages,
there's a lot of translation data, um,
for other languages there's much less.
Um, so in these settings where we want to work on
a different domain or where we want to work with a low resource language,
um, we're limited by labeled data, um,
but what we can do is pretty easily find unlabeled data.
Um, so it's actually a pretty solved problem, um,
maybe not 100%, but we can with good accuracy look at
some text and decide what language it's in and train a classifier to do that.
Um, so this means it's really easy to find
data in any language you care about because you can just go on
the web and essentially search for data in
that language and acquire a large corpus of monolingual data.
Okay, um, I'm now going into the first approach,
um, I'm going to talk about on using
unlabeled data to improve machine translation models.
Um, this technique is called pre-training and it's
really reminiscent of ideas like, um, ELMo.
Um, the idea is to pre-train by doing language modeling.
So if we have, um,
two languages we're interested in translating,
um, from one end to the other,
we'll collect large datasets for both of those languages and then we can train,
uh, two language models,
one each on that data and then, um,
we can use those, uh,
pre-trained language models as initialization for a machine translation system.
Um, so the encoder will get initialized with
the weights of the language model trained on the source side language, um,
the decoder will get initialized with weights trained on the target size language, uh,
and this will, um,
improve the performance of your model because during this pre-training, um,
we hope that our language models will be learning useful information such as, you know,
the meaning of words or, um, uh,
the kind of structure of the language, um,
they are processing, um, and this can, uh,
down the line help the machine translation model,
um, when we fine tune it.
Um, let me pause here and ask if there are any questions,
and just in general, feel,
feel free to ask questions throughout this talk. Okay.
So, so here is a plot showing some results of this pre-training technique.
Um, so this is English to German translation.
Uh, the x-axis is how much training data,
as in unsupervised training data, um,
you provide these models,
but of course they also have large amounts
of monolingual data for this pre-training step.
And you can see that this works pretty well, right?
So you've got about two blue points, um,
increase in performance, so that's this red line above the blue line,
um, when doing this pre-training technique.
And not too surprisingly,
this gain is especially large when the amount of labeled data is small.
Um, there is a problem with,
uh, pre-training which I want to address, which is that, uh,
in pre-training, you have
these two separate language models and there's never
really any interaction between the two,
um, when you're running them on the unlabeled corpus.
Um, so here's a simple technique, um,
that tries to solve this problem and it's called self-training.
Um, the idea is given a sentence from our monolingual corpus,
so in this case, "I traveled to Belgium," that's an English sentence.
Um, we won't have a human provided translation for this sentence, uh,
but what we can do is we can run our machine translation model,
and we'll get a translation in the target language.
Um, since this is from a machine learning model it won't be perfect, uh,
but we can hope that maybe our model can still learn from this kind
of noisy labeled example, right?
So we, we treat, um,
our original monolingual sentence and it's machine-provided
translation as though it were a human-provided translation and,
uh, train our machine learning model as normal on this example.
Um, I think this seems pretty strange actually as- as
a method when you first see it because it seems really circular, right?
So if you look at this, um, the, uh,
translation that the model is being trained to
produce is actually exactly what it already produces to begin with,
right, because, um, this translation came from our model in the first place.
Um, so actually in practice,
this is not a technique that's very widely used due to this problem,
um, but it motivates another technique called back-translation.
And this technique is really a very popular, um,
solution to that problem, and it's the method, um,
that has had a lot of success in using unlabeled data for translation.
So here's the approach rather than only
having our translation system that goes from source language to target language,
um, we're also going to train a model that
goes from our target language to our source language.
And so in this case, if,
if at the end of the day we want a French to English model, um,
we're gonna start by actually training an English to French model.
And then we can do something that's a lot like self-labeling.
So we take a English sentence.
We run our English to French model and translate it.
The difference to what we did before is that
we're actually going to switch the source and target side.
So now in this case the French sentence is the source sequence.
Uh, the target sequence is, um,
our original English sentence that came from  monolingual corpora.
And now we're training the language, uh,
the machine translation system that goes
the other direction so that goes French to English.
Um, so, so why do we think this will work better?
Um, number one, um,
there's no longer this kind of circularity to the training
because what the model is being trained on is the output of a completely different model.
Um, another thing that I think is pretty crucial here is that,
um, the translations, the model is trained to produce.
So the things that the decoder is actually learning to
generate are never bad translations, right?
So if you look at this example,
the target sequence for our French to English model,
I traveled to Belgium, um,
that originally came from a monolingual corpus.
Um, so I think intuitively this makes sense is
that if we want to train a good translation model,
um, it's probably okay to expose it to noisy inputs.
So we expose it to the output of a system that's English to French,
it might not be perfect.
Um, but what we don't want to do is um, expose it to
poor target sequences because then it
won't learn how to generate in that language effectively.
Any questions on back-translation before I get to results? Um, sure.
[BACKGROUND]
So this is assuming we have a large corpus of
unlabeled data and we want to be using it to help our translation model.
Does that, does that make sense?
Um, maybe you could clarify the question.
[BACKGROUND]
Yeah, that's right. So we have a big corpus of English which includes the sentence,
"I traveled to Belgium," and we don't know the translations but we'd still like to
use this data. Yeah, another question.
[BACKGROUND]
Yeah, so that's a good question is how do you
avoid both the models let's say sort of blowing up and producing garbage?
And then they're just feeding garbage to each other.
The answer is that there is some amount of labeled data here as well.
So on unlabeled data you do this, but on labeled data,
you do standard training,
and that way you avoid, you,
you make sure you kind of keep the models on track because they still have to fit to
the labeled data. Yeah, another question.
How do you schedule the training of the two models?
Yeah, that is a good question.
And I think that's basically almost like a hyper-parameter you can tweak.
So I think a pretty common thing to do is first,
train two models only on labeled data.
Then label, um, so then do back-translation
over a large corpus and kind of repeat that process over and over again.
So each iteration, you train on the label data,
label some unlabeled data and now you have more data to work with.
But I think there'd be many kinds of scheduling that would be effective
here. Okay. Another question.
I'm curious as to the evaluation, considering if you have a very good French to English model, you could try to look up, or contest if you have a good French to English model, you could try to look up the original source and see if it matches.
Yeah, I'm not, I'm not quite sure.
Are you suggesting going like English to French to English and seeing if?
I see, yeah, yeah,
that's a really interesting idea.
And we're actually going to talk a little bit about this sort of,
it's called cycle consistency,
this idea later in this talk.
Okay, I'm going to move on to the results.
So, so here's the method for using unlabeled data to improve translation.
How well does it do?
Um, the answer is that the improvements are at least to me, they
were surprisingly extremely good, right?
So, um, this is for English to German translation.
This is from some work by Facebook, so they  used 5 million labeled sentence pairs.
But they also used 230 monolingual sentences, so sentences without translations.
And you can see that compared to previous state of the art,
they get six BLEU points improvement which, um,
if you compare it to most previous research and machine tran- machine translation
is a really big gain, right?
So even something like the invention of the transformer which most people would
consider to be a really significant research development in NLP,
that improved over prior work by about 2.5 BLEU points.
And here without doing any sort of fancy model design just by using way more data,
um, we get actually much larger improvements.
Okay. So an interesting question to think about,
um, is suppose we only have our monolingual corpora.
So we don't have any sentences that had been human translated.
We just have sentences in two languages.
Um, so the scenario you can sort of imagine is suppose,
um, an alien comes down and,
um, starts talking to you and it's a
weird alien language, um, and it talks a lot,
would you eventually be able to translate what it's saying to English,
um, just by having a really large amount of data?
Um, so I'm going to start with, um,
a simpler task than full-on translating when you only have unlabeled sentences.
Um, instead of doing sentence to sentence translation,
let's start by only worrying about word to word translation.
So the goal here is given a word in one language,
find its translation but without using any labeled data.
Um, and the method,
the method we're going to use to try to solve
this task is called, uh, cross-lingual embeddings.
Um, so the goal is to learn, uh,
word vectors for words in both languages,
and we'd like those word vectors to have
all the nice properties you've already learned about word vectors having, um,
but we also want word vectors for a particular language,
um, to be close to the word vector of its translation.
Um, so I'm not sure if it's visible in this figure but this fis- figure shows
a large number of English and I think German words and you can see that,
um, uh, the each English word has its corresponding German word,
um, nearby to it in its embedding space.
So if we learn embeddings like this then it's pretty easy to do word to word translation.
Um, we just pick an English word,
we find the nearest, uh,
German word in this joint embedding space
and that will give us a translation for the English word.
Um, our key method for or the key
assumption that we're going to be using to solve this is that,
um, th- even though if you run word2vec twice you'll get really different embeddings.
Um, the structure of that embedding space has a lot of regularity to it,
and we can take advantage of that regularity, um,
to help find when,
um, an alignment between those embedding spaces.
So to be kind of more concrete here.
Here is a picture of two sets of word embeddings.
So in red, we have, um,
English words, in, uh,
blue we have Italian words,
and although, um, the vector spaces right now look very different to each other,
um, you can see that they have a really similar structure, right?
So you'd imagine distances are kind of similar that the distance from,
uh, cat and feline in the, um,
English embedding space should be pretty similar to the distance
between gatto and felino in the, um, Italian space.
Um, this kind of motivates an algorithm for learning these cross-lingual embeddings.
Um, so here's the idea.
What we're going to try to do is learn what's essentially
a rotation such that we can transform,
um, our set of English embeddings so
that they match up with our Italian embe- embeddings.
So mathematically, what this means is we're gonna learn
a matrix W such that if we take let's say,
uh, the word vector for cat in English and we multiply it by W, um,
we end up with the vector for gatto in Spanish or Italian,
um, and a detail here is that, um,
we're going to constrain W to be orthogonal, um,
and what that means geometrically is just that W is
only going to be doing a rotation to the,
uh, vectors, um, in X.
It's not going to be doing some other weirder transformation.
So this is our goal is to learn this W. Um,
next I'm gonna talk about,
talking about how actually do we learn this W. Um,
and there's actually a bunch of techniques for learning this W matrix,
um, but, um, here is one of
them that I think is quite clever is called adversarial training.
Um, so it works as follows,
is in addition to trying to learn this W matrix,
we're also going to be trying to learn a model that, uh,
is called a discriminator,
and what it'll do is take a vector and it will try to predict,
is that vector originally, um,
an English word embedding or is it originally an Italian word embedding?
Um, in other words, if you think about, um,
the diagram, what we're asking our discriminator to do is, uh,
it's given one of these points and it's trying to predict is it
basically a red point so an English word originally, or is it a blue point?
Um, so if we have no W matrix and this is
a really easy task for the discriminator because,
um, the, uh, word embeddings for English and Italian are clearly separated.
Um, however, if we learn a W matrix
that succeeds in aligning all these embeddings on top of each other,
then our discriminator will never do a good job, right.
We can imagine it'll never really do better than 50%,
um, because given a vector for say cat,
it won't know is that the vector for cat that's been
transformed by W or is it actually the vector for gatto?
Um, because in this case those two vectors are aligned so they are on top of each other.
Um, so, um, during training, you first, um,
you alternate between training the discriminator a little bit which
means making sure it's as good as possible at
distinguishing the English from Italian words and then you
train the W and the goal for training W is to,
uh, essentially confuse the discriminator as much as possible.
Um, so you want to have a situation where,
um, you can't, um, with this machine learning model,
figure out if a word embedding actually, um,
was, um, originally from English or if it's an Italian word vector.
Um, and so at the end of the day you have,
you have vectors that are kind of aligned with each other.
Um, any questions about this approach?
Okay. Um, he- there's a link to a paper with more details.
There's actually kind of a range of other tricks you can do,
um, but this is kind of a key idea.
Um, okay. So that was doing word to word unsupervised translation.
Um, how do we do full sentence to sentence translation?
Um, so we're going to use, um,
a standard sort of seq2seq model,
um, without even an attention mechanism.
Um, there's one change to the standard seq2seq
model going on here which is that, um,
we're going to use the same encoder and decoder,
uh, regardless of the input and output languages.
So you can see, um,
in this example, um,
we could give the encoder an English sentence,
we could also give it a French sentence and it'll have these cross-lingual embeddings.
So it'll have vector representations for English words
and French words which means it can handle sort of any input.
Um, for the decoder,
we need to give it some information about what language is it supposed to generate in.
Is it going to generate in French or English?
Um, so the way that is done is by, uh,
feeding in a special token which here is Fr
in brack- brackets to represent French that tells the model,
okay, you should generate in French now.
Um, here in this figure it's only French,
but you could imagine also feeding this model, uh,
English in brackets, and then that'll tell it to, uh, generate English.
And one thing that you can see is that you could use this sort of model to g enerate,
do go from English to French.
You could also use this model as an auto-encoder, right.
So, uh, at the bottom, um,
it's taking in a French sentence as input and it's just generating French as
output which here means just reproducing the original input sequence.
Um, so just a small change to standard seq2seq.
Here's how we're going to train the seq2seq model.
Um, there's going to be two training objectives, um,
and I'll explain sort of why they're, uh,
present in this model in just a few slides.
For now let's just say what they are.
So the first one is, um,
called a de-noising autoencoder.
Um, what we're going to train our model to do in this case is take a, uh, sentence.
So, um, and here it's going to be
an English sentence but it could also be a French sentence.
Um, we're going to scramble up the words a little bit,
and then we're going to ask the model to, uh,
de-noise that sentence which in other words means
regenerating what the sentence actually was before it was scrambled.
And, uh, maybe one idea of why this would be a useful training objective is that,
uh, since we have an encoder-decoder without atten- attention,
the encoder is converting the entirety of the source sentence into a single vector,
what an auto-encoder does is ensure that that vector contains all the information about
the sentence such that we are able to recover what the original sentence was,
um, from the vector produced by the encoder.
Um, so that was objective 1.
Training objective 2 is now we're actually going to be trying to do a translation,
um, but, um, as before,
we're going to be using this back-translation idea.
So remember, we only have unlabeled sentences,
we don't have any human-provided translations,
um, but what we can still do is, given, a,
um, let's say an English sentence or let's say a French sentence,
given a French sentence, we can translate it to English, um,
using our model in its current state, uh,
and then we can ask that model to translate from English or translate that- yeah,
translate that English back into French.
Um, so what you can imagine is in this setting, um,
the input sequence is going to be somewhat messed
up because it's the output of our imperfect machine learning model.
So here the input sequence is just "I am student," um, a word has been dropped,
but, um, we're now gonna train it to, even with this kind of bad input,
to reproduce the original, um,
French sentence, um, from our,
uh, corpus of- of monolingual, um, French text.
[NOISE] Um, let me- let me pause here actually and ask for questions.
Sure.
[NOISE] [inaudible] What if, um, the reason you have
this orthogonality constraint for your words to be word embedding,
is it to avoid overfitting?
Have you tried to take that off, and you know, see what [inaudible]
Yeah. That's a good question.
Um, so this is going back to earlier when there was a word-word translation.
Why would we constrain that W matrix to be orthogonal?
Um, essentially, that's right. It's to avoid overfitting and in particular,
it's making this assumption that our embedding spaces are so
similar that there's actually just a rotation that distinguishes,
um, our word vectors in English versus our word vectors in Italian.
Um, I think there has been, um,
there have been results that don't include that orthogonality constraint,
and I think it slightly hurts performance to not have that in there.
[NOISE] Okay.
Um, so- so continuing with,
um, unsupervised machine translation,
um, I- I gave a training method.
I didn't quite explain why it would work, so- so,
um, here is some more intuition for- for this idea.
Um, so remember, um,
we're going to initialize
our machine translation model with these cross-lingual embeddings,
which mean the English and French word should look close to identically.
Um, we're also using the shared, um, encoder.
Um, so that means if you think about it,
um, at the top, we have just,
a auto-encoding objective and we can certainly believe that our model can learn this.
Um, it's a pretty simple task.
Um, now imagine we're giving our model a French sentence as input instead.
Um, since the, uh,
embeddings are going to look pretty similar,
and since the encoder is the same, um,
it's pretty likely that the model's representation of
this French sentence should actually be very
similar to the representation of the English sentence.
Um, so when this representation is passed into the decoder, um,
we can hope that we'll get the same output as before.
Um, um, so here's like sort of as a starting point.
We- we can hope that our model, um,
already is able to have some translation capability.
[NOISE] Um, another way of thinking about this is
that what we really want our model to do is to be able to encode a sentence,
such that the representation,
um, is sort of a universal kind of Interlingua.
So a universal, um, uh,
universal representation of that sentence that doesn't,
uh, that's not specific to the language.
And so- so here's kind of a picture that's trying to get at this.
So our autoencoder, um, and our, um,
here in our back-translation example,
um, here, the target sequence is the same.
[NOISE] Um, so what that essentially means is
that the vectors for the English sentence and the French sentence,
um, are going to be trained to be the same, um, right?
Because if they are different, our, uh,
decoder would be generating different,
uh, outputs on these two examples.
Um, so here- this is just another sort of intuition is that what our model is
trying to learn here is kind of a way of
encoding the information of a sentence in a vector,
um, but in a way that is language-agnostic.
Um, any more questions about,
uh, unsupervised machine translation?
Okay. Um, so going on to results of this approach, um,
here, the horizontal lines are,
um, the results of an unsupervised machine translation model.
Um, the lines that go up are for a supervised machine translation model,
um, as we give it more and more data.
Right? So unsurprisingly, um,
given a large amount of supervised data, um,
the supervised machine translation models
work much better than the unsupervised machine translation model.
Um, but, um, the unsupervised machine translation model,
actually still does quite well.
Um, so if you see it around 10,000 to 100,000 training examples,
um, it actually does just as well or better than supervised translation,
and I think that's a really promising result,
uh, because if you think of, um,
low-resource settings where there isn't much labeled examples, um,
it suddenly becomes really nice that you can perform this well,
um, without even needing to use a training set.
Um, another thing kind of fun you can do with,
an unsupervised machine translation model is attribute transfer.
Um, so basically, you can, um, take, uh,
collections of texts that,
uh, split by any attribute you want.
So for example, you could go on Twitter,
look at hashtags to decide which tweets are annoyed and which tweets are relaxed,
and then you can treat those two corpora as
text as though they were two different languages,
and you can train an unsupervised machine translation model,
uh, to convert from one to the other.
Uh, and you can see these examples, um,
the model actually does a pretty good job of sort of minimally changing the sentence,
kind of preserving a lot of that sentence's original semantics,
um, such that the target attribute is changed.
Um, I also wanna throw a little bit of cold water on this idea.
So I do think it's really exciting and- and almost kind of
mind-blowing that you can do this translation without labeled data.
Um, certainly, right.
It's really hard to imagine someone giving me a bunch of books in Italian and say, "Okay.
We're in Italian," um, without, you know,
teaching you how to specifically do the translation.
Um, but, um, even though these methods show promise,
um, mostly they have shown promise on languages that are quite closely related.
So those previous results,
those were all, um,
some combination of English to French or English to German,
um, or so on, and those languages are quite similar.
[NOISE] Um, so if you look at, uh,
a different language pair, let's say English to Turkish,
where, um, the linguistics in those two languages are quite different, uh,
these methods do still work to some extent, um,
so they get around five BLEU points let's say, uh,
but they don't work nearly as well,
um, as they do in the f- uh, i- in the other settings, right?
So there's still a huge gap to purely supervised learning. Um, right?
So we're probably not, you know,
quite at this stage where an alien could come down and it's sort of, no problem,
let's use our unsupervised machine translation system, um,
but I still think that's pretty exciting progress. Um, yeah. Question?
Um, so what you're saying is that the genealogy of
a language might need it to superimpose worse, right?
Because my original thought was that if you took, for example,
like Latin, which doesn't have a word for, you know,
the modern classification of car, I thought that would do more poorly. But if- but, uh, basically,
what I'm asking is, do you think the English maps better to Latin
because they're both related, and worse to Turkish or is it the other way around?
Um, I would expect English to map quite a lot better to Latin.
And I think part of the issue here is that, um,
the difficulty in translation I think is not really at the word level.
So I mean that certainly is an issue that words exist
in one language that don't exist in another,
um, but I think actually,
more substantial differences between language is at the level of like syntax,
um, um, or you know, semantics, right?
How ideas are expressed.
Um, so- so I think I- I would expect Ital- Latin to have, you know,
relatively similar syntax to English,
um, compared to say Turkish,
I imagine that is probably the bigger obstacle
for unsupervised machine translation models.
Um, I'm going to really quickly go into
this last recent research paper which is basically taking BERT  which,
which you've learned about, um, correct?
Yes. Okay. And making it cross-lingual.
Um, so, um, here's what regular BERT is, right?
We have a sequence of sentences in English.
We're going to mask out some of the words.
And we're going to ask BERT which is our transformer model, um,
to essentially fill in the blanks and predict what were the words that were dropped out.
Um, what actually has already been done by Google is training a multilingual BERT .
So what they did essentially is concatenate, um,
a whole bunch of corpora in different languages and then train one model um,
doing using this masked LM objective um,
on all of that text at once.
And that's a publicly released model.
Um, the, the new kind of extension to this that has recently been uh,
proposed by Facebook is to actually combine
this masked LM training objective um, with uh, translation.
So what they do is sometimes give this model a in this case,
a sequence in English and a sequence in uh, French.
Um, drop out some of the words and just as before,
ask the model to fill it in.
And the motivation here is that, um,
this will much better cause the model
to understand the relation between these two languages.
Because if you're trying to find a fill in a English word that's been dropped,
uh, the best way to do it if you have a translation is look
at the French side and try to find that word.
Hopefully, that one hasn't been dropped as well.
And then you can um, much more easily fill in the blank.
And uh, this actually leads to very uh,
substantial improvements in unsupervised machine translation.
So just like BERT is used for other tasks in NLP,
they basically take this cross-lingual BERT.
They use it as initialization for
a unsupervised machine translation system and they get, you know,
really large gains on the order of 10 BLEU points um,
such that the gap between
unsupervised machine translation and the current supervised state of the art,
um, is much smaller.
Uh, so this is a pretty recent idea but I think it also shows promise
in really improving the quality of translation through using unlabeled data.
Um, although I guess yeah, I guess in this case with BERT
they are using labeled translation data as well.
Any, any questions about this?
Okay. Um, so that is all I'm going to say about using unlabeled data for translation.
The next part of this talk is about um,
what happens if we really scale up these unsupervised language models.
Um, so in particular I'm gonna talk about GPT-2 which is a new model by OpenAI.
That's essentially a really giant language model
and I think it has some interesting implications.
So first of all, here's just the sizes of a bunch of different NLP models and,
um, you know, maybe a couple years ago the,
the standard sort of
LSTM medium-size model was on the order of about 10 million parameters.
Where 10- where a parameter is just a single weight let's say in the neural net um,
ELMo and uh, GPT.
So the original OpenAI paper before they did
this GPT-2 and we're about 10 times bigger than that.
Um, GPT-2 is about another order of magnitude bigger.
Um, one kind of interesting comparison point here is that uh,
GPT-2 which is 1.5 billion parameters,
actually has more parameters than a honey bee brain has synapses.
Um, so that sounds kind of impressive, right?
You know honeybees are not the smartest of
animals but they can still fly around and find nectar or whatever.
Um, but yeah. Of course, this isn't really an apples to apples comparison, right?
So a synapse and a weight in a neural net are really quite different.
But I just think it's one kind of interesting milestone
let's say in terms of model size um,
that has been surpassed.
[NOISE] Um, one thing to point out here is that um,
this increasing scaling of deep learning is really a general trend uh,
in all of machine learning so beyond NLP.
So this plot is showing time on the x-axis and the y-axis is log scaled um,
the amount of petaFLOPS used to train this model.
Um, so what this means is that the trend at least currently is that there is
exponential growth in how much compute power
we're throwing at our machine learning models.
I guess it is kind of unclear, you know,
will exponential growth continue but certainly um,
there's rapid growth in the size of our models.
And it's leading to some really amazing results, right?
So here are results not from language but for vision.
Um, this is a generative adversarial network
that's been trained on a lot of data and it's been trained on really large scales.
So it's a big model kind of in-between the size of ELMo and BERT let's say.
And uh, these photos here are actually productions of the model.
So those aren't real photos.
Those are things the model has just kind of hallucinated out of thin air.
And at least to me they look essentially photo-realistic.
There's also a website that um, is fun to look at it.
If you're not- if you're interested which is,
thispersondoesnotexist.com.
So if you go there, you'll see
a very convincing photo of a person but it's not a real photo.
It's again like a hallucinated image produced by a GAN.
We're also seeing really huge models being used for image recognition.
So this is recent work by Google where they trained
an image net model with half a billion parameters.
So that's bigger than BERT but not as big as GPT-2.
Um, this plot here is showing a
log scaled number of parameters on the x-axis and then accuracy at ImageNet
on the y-axis- axis and sort of unsurprisingly bigger models perform better.
And there seems to actually be a pretty consistent trend here which is uh,
accuracy is increasing with the log of the, the model size.
Um, I wanna go into a little bit more detail, how is it
possible that we can scale up models and train models at such a large extent.
One answer is just better hardware.
And in particular, um,
there's a growing uh,
number of companies that are developing hardware specifically for deep learning.
So these are even more kind of constrained and the
kind of operations they can do than a GPU,
um but they do those operations even faster.
So Google's Tensor Processing Units is one example.
There are actually a bunch of other companies working on this idea.
Um, the other way to scale up models is by taking advantage of
parallelism and there's two kinds of parallelism that I want to talk about very briefly.
So one is data parallelism.
In this case, each of your,
let's say GPUs, will have a copy of the model.
And what you essentially do is split
the mini-batch that you're training on across these different models.
So if you have, let's say,
16 GPUs and each of them see a batch size of 32.
You can aggregate the gradients of these 16 uh, uh,
if you do a back-prop on these 16 GPUs and you end up with effectively a batch size of 512.
So this allows you to train models much faster.
Um, the other kind of parallelism that's growing in importance is model par- parallelism.
Um, so eventually models get so big that they
can't even fit on a single GPU and they can't even do a batch size of one.
Um, in this case,
you actually need to split up the model across
multiple computers- multiple compute units.
Um, and that's what's done for models kind of the size of,
of let's say GPT-2.
There are new frameworks such as Mesh-TensorFlow, um,
which are basically designed to make this sort of model parallelism easier.
Um, okay. So onto GPT-2, um,
I know you already saw this a little bit in the contextualized uh,
um, embeddings um, lecture but I'm going to go into some more depth here.
[NOISE] So so essentially it's a really large transformer language model.
Um, so there's nothing really kind of novel here in terms
of new training algorithms or in terms of um,
the loss function or anything like that.
Um, the thing that makes it different from
prior work is that it's just really really big.
Uh, it's trained on a correspondingly huge amount of text.
So it's trained on 40 gigabytes and that's roughly 10 times larger than previous uh,
language models have been trained on.
Um, when you have that size of dataset,
um, the only way to get that much text is essentially to go to the web.
Um, so one thing OpenAI put a quite a bit of effort into when they're developing
this network was to ensure that that text was pretty high-quality.
Um, and they did that in a kind of interesting way.
They, they looked at Reddit which is this website where people uh,
can vote on links.
And then they said uh, if
a link has a lot of votes then it's probably sort of a decent link.
There's probably um, you know,
reasonable text there for a model to learn.
Um, okay, so if we have
this super huge language model like
GPT-2 on this question of what can you actually do with it,
um, well obviously if you have a language model you can do language modelling with it.
Uh, but one thing kind of interestingly interesting is that you
can run this language model on er,
existing benchmarks, um, for,
for language modelling, um,
and it gets state of the art perplexity on these benchmarks even
though it never sees the training data for these benchmarks, right?
So normally, if you want to say evaluate your language model on the Penn Treebank.
You first train on the Penn Treebank and then you evaluate on this held-out set.
Uh, in this case, uh,
a GPT-2 just by virtue of having seen so much text and being such a large model,
outperforms all these other uh,
prior works even though it's not seeing that data.
Um, on a bunch of different uh, language modelling benchmarks.
Um, but there's a bunch of other interesting experiments that OpenAI
ran with this language modeling and these were based on zero-shot learning.
So zero-shot learning just means trying to do a task without ever training on it.
And, uh, the way you can do this with a language model
is by designing a prompt you feed into
the language model and then have it just generate from there and
hopefully it generates something relevant to the task you're trying to solve.
So for example, for reading comprehension,
what you can do is take the context paragraph,
uh, concatenate the question to it and then add uh,
a colon which is a way,
I guess, of telling the model,
''Okay you should be producing an answer to this question,''
and then just have it generate text, um,
and perhaps it'll generate something that is actually answering,
um, the question and is,
is paying attention to the context.
[NOISE] Um, and similarly, for summarization,
you can get the article then TL;DR and perhaps the model will produce the summary.
Um, you can even do translation,
where you give the model,
um, some ex- a list of known English to French translations so you, sort of,
prime it to tell it that it should be doing translation and then you give
it the source sequence equals blank and have it just run and,
um, perhaps it'll generate,
um, the sequence in the target language.
Um, okay. So so here's what the results look like.
Um, for all of these,
uh, the X-axis is,
is log scaled model size and the Y-axis is accuracy, um,
and the dotted lines basically correspond to,
um, existing works on these tasks.
Um, so for most of these tasks, um,
GPT-2 is quite a bit below existing systems,
um, but there's of course this big difference, right?
Existing systems are trained specifically to do,
um, whatever task they're being evaluated on,
where GPT-2 is um,
only trained to do language modeling and as it learns language modeling,
it's sort of picking up on these other tasks.
Um, so right. So for example, um,
it does, uh, English to French machine translation, um,
not as well as, uh,
standard unsupervised machine translation which is those, uh,
dotted lines, um, but it still,
it still does quite well.
And, um, one thing, kind of,
interesting is the trend line, right,
for almost all of these tasks.
Um, performance is getting uh,
much better as the model increases in size.
[NOISE] Um, I think a particularly interesting,
uh, one of these tasks is machine translation, right?
So the question is, how can it be doing
machine translation when all we're giving it as a bunch of
web pages and those web pages are almost all in
English and yet somehow it sort of magically picks up uh,
a little bit of machine translation, right.
So it's not a great model but it can still,
um, you know, do a decent job in some cases.
Um, and the answer is that,
if you look at this giant corpus of English,
occasionally, uh, within, within that corpus,
you see examples of translations, right?
So you see, um,
a French idiom and its translation or
a quote from someone who's French and then the translation in English.
And, um, kind of,
amazingly I think this big model, um,
sees enough of these examples that it actually starts to learn how to generate French,
um, even though that wasn't really,
sort of, an intended part of its training.
Um, another interesting, um,
thing to dig a bit more into is its ability to do question answering.
So uh, a simple baseline for question answering gets about 1% accuracy,
GPT-2 barely does better at 4% accuracy.
So this isn't, like, you know,
super amazingly solved question answering, um, but, um,
it's still pretty interesting in that,
if you look at answers the model's most confident about,
you can see that it sort of
has learned some facts about the world, right.
So it's learned that Charles Darwin wrote Origin of Species.
Um, normally in the history of NLP, if you want to get, kind of,
world knowledge into an NLP system,
you'd need something like a big database of facts.
And even though this is still,
kind of, very early stages and that, um,
there's still a huge gap between 4% accuracy and the, uh, you know,
70% or so that, uh,
state of the art open domain question answering systems can do,
um, it, it, um,
it still can, uh,
pick up some world knowledge just by reading a lot of text, um, without,
kind of, explicitly having that knowledge put into the model.
Um, any questions by the way on GPT-2 so far?
Okay. So one question that's interesting to think about is,
what happens if our models get even bigger?
Um, so here I've done the, um,
very scientific thing of drawing some lines in PowerPoint and seeing where they meet up.
Um, and you can see that, um,
if the trend holds at about 1 trillion parameters,
um, we get to human level reading comprehension performance.
Um, so if that's true it would be really astonishing.
I actually do expect that a 1 trillion parameter model would be attainable in,
I don't know, ten years or so,
um, but of course,
right, the trend isn't clear.
So if you look at summarization for example,
it seems like performance is already,
uh, uh, topped out.
Um, so I think this will be a really interesting thing kinda going forward,
looking at the future of NLP, um,
is how the scaling will change,
um, the way NLP is approached.
Um, the other interesting thing about GPT-2 was its reaction from uh,
the media and also from other researchers.
Um, and the real cause of
a lot of the controversy about it was this statement from OpenAI.
They said that, ''We're not going to release our full language model,
um, because it's too dangerous,
you know, our language model is too good.''
Um, so the media really enjoyed this and,
you know, said that,
uh, machine learning is going to break the Internet.
Um, there's also some pretty interesting reactions from our researchers, right.
So um, there's some,
kind of, tongue-in-cheek responses here, right.
You know, I trained the model on MNIST.
Is it too dangerous for me to release it?
Um, and similarly, we've done really great work
but we can't release it it's too dangerous so you're just gonna have to trust us on this.
Looking at more, kind of, reasoned, um,
debate about this issue,
you still see articles,
um, arguing both sides.
So these are two ar- articles,
um, from The Gradient which is a, sort of,
machine learning newsletter, um,
and they're arguing precisely opposite sides of this issue,
um, should it be released or not.
So I guess I can briefly go over a few arguments for or against.
There is, kind of, a lot of debate about this and I don't want to
go too deep into a controversial issue,
um, but here's a long list of,
kind of, things people have said about this, right.
So um, here's why you should release.
One complaint is that,
is this model really that special?
There's nothing new going on here.
It's just 10 times bigger than previous models, um,
and there's also some arguments that,
um, even if this one isn't released, you know,
in five years everybody can train a model this good, um,
and actually if you look at image recognition or look at images and speech data, um,
it already is possible to synthesize highly convincing,
um, fake images and fake speech.
So kinda, what makes this thing different from those other, um, systems.
And speaking of other systems, right,
Photoshop has existed for a long time,
so we can already convincingly fake images, um,
people have just learned to adjust and learned
that you shouldn't always trust what's in an image,
um, because it may have been,
um, altered in some way.
Um, on the other hand, you could say,
''Okay, uh, Photoshop exists but, um, you can't, sort of,
scale up Photoshop and start mass producing fake content the way you can with this sort
of model,'' and they pointed at the danger of uh, fake news, um,
fake reviews, um, in general just astroturfing, which means basically,
uh, creating fake user content that's supporting a view you want other people to hold.
Um, this is actually something that's already done,
um, pretty widely by country- companies and governments.
There's a lot of evidence for this, um,
but they are of course hiring people to
write all these comments on news articles let's say
and we don't want to make their job any easier
by producing a machine that could potentially do this.
So um, I'm not really gonna take a side here,
um, there's still a lot of debate about this.
I think, you know,
the main, the main takeaway here is that,
as a community on people in machine learning and NLP,
don't really have a handle on this, right?
We are sort of caught by surprise by, um,
OpenAI's, um, decision here and, um, uh,
that means that, you know,
there really is some figuring out that needs to be done on what
exactly is responsible to release publicly.
What kind of research problems should we be working on and so on.
[NOISE] So yeah.
Any questions about uh, this,
this reaction or this debate in general?
[NOISE] Okay.
Um, I think something arising from this debate is, um,
the question of, um,
should really the ML people be the people making these, sort of,
decisions or is there a need for more interdisciplinary science where we look at, um,
experts in say, computer security,
um, people from social sciences,
um, you know, people who are experts in ethics,
um, to look at these decisions.
Um, right. So GPT-2 was definitely one example of where suddenly it seems like,
um, our NLP technology has a lot of pitfalls, right.
Where they could be used in a malicious way or they could cause damage.
And I think this trend is only going to increase, um,
if you look at, kind of,
areas of NLP that people are working on, uh,
increasingly people are working on really high stakes applications of NLP,
um, and those often have really big, um,
ramifications, especially if you think from the angle of bias and fairness.
Um, so, so let's go over a couple examples of this, um-
Um, one- so some, some areas where,
where this is happening is people are looking at,
uh, NLP to look at judicial decisions.
So for example, should this person,
uh, get bail or not?
Um, for hiring decisions, right?
So you look at someone's resume,
you run NLP on it,
and then you'd make a decision automatically,
um, sh- should we throw out this resume or not?
So do some, sort of, screening, um, grading tests.
Um, if you take the GRE, um,
your, your tests will be graded by a machine.
Um, a person will also look at it, um,
but nevertheless, um, that's, you know,
a sometimes very impactful part of your life, um, when it's,
when it's the tests that, um, inf- you know,
affects your, um, acceptance into a school, let's say.
Um, so I think there is- are some,
some good sides of using Machine Learning in these kinds of contexts.
So one is that we can pretty quickly evaluate,
a machine learning system and search out.
Does it have some, kind of, bias,
just by running it on a bunch of data and seeing what it does,
and also perhaps even more importantly,
um, we can fix this, kind of,
problem if it arises, right?
So, um, it's probably easier to fix a machine learning system that screens resumes,
than it is to s- to fix having, you know,
5,000 executives that are slightly sexist or something, right?
So, so in this way,
um, there is a, sort of,
positive angle on using machine learning in these high-stakes, um, uh, decisions.
Um, on the other hand, um,
it's been pretty well, uh, s- known,
and I know you had a lecture on bias and fairness,
that machine learning often reflects bias in a data-set,
um, it can even amplify bias in the data-set.
Um, and there's concern of, kind of,
a feedback loop where a biased algorithm
actually will lead to the creation of more biased data,
um, in which case these problems will only compound and get worse.
Um, so for all of the, uh, high-impact decisions,
um, I, I had listed on that slide,
there are examples where things have gone awry, right?
So Amazon had some AI that was,
um, working as a recruiting tool and it turned out to be sexist.
Um, um, there have been some, kind of,
early pilots of using AI, um,
in the justice system and those also have had,
um, in some cases, really bad results.
Um, if you look at automatic,
automatic essay grading, um,
it's not really a great,
you know, NLP system, right?
So here's an example, um,
excerpt of an essay that, um,
a automatic grading system used by the GRE test gives, uh,
a very high score, um,
but really it's just, kind of, a solid of,
uh, big fancy words and that's
enough to convince the model that this is a, a great essay.
Um, the last, um,
area I wanna talk about where, where, um,
you can see there's really some risks and
some pitfalls with using NLP technology, is chatbots.
Um, so I think chatbots do have a side where they can be very beneficial.
Um, Woebot is one example,
is this company that has this chatbot you can talk to if you're not,
um, feeling too great and it'll try to,
um, I don't know, cheer you up.
Um, so, so that, you know,
could be a- a really nice piece of technology that helps people,
um, but on the other hand, there's some big risks.
So, so one example is Microsoft research had a chatbot trained on tweets,
and it started quickly saying racist things and had to be pulled.
Um, so I think all of this highlights that, um,
as NLP is becoming more effective,
people are seeing opportunities to use it in, um,
increasingly high-stakes decisions and although,
you know, there are some nice- there's some appeal to that,
um, there's also a lot of risk.
Um, any more questions on, uh,
this sort of social impact of NLP?
Okay. Um, last part of this lecture is looking more at future research, right?
And in particular, um,
I think a lot of the current research trends are,
kind of reactions to BERT, um, right?
So, so the question is what did BERT solve and- and what do we work on next?
Um, so here are results on the GLUE benchmark.
Um, that is, uh, a compendium of,
uh, 10 natural language understanding tasks.
Um, and you get an average score across those 10 tasks.
Um, the left, uh, two- the two are,
sorry the right- two right most models are,
um, uh, s- non, uh,
are just supervised trained machine learning systems, right?
So we have Bag-of-Vectors, um,
we instead use our fancy neural net architecture
of BiLSTM + Attention and we get about five points.
Um, but the gains from BERT,
uh, really dwarf that difference, right?
So, so BERT improves results by about, uh,
17 points and we end up being actually quite close,
um, to human performance on these tasks.
Um, so one, sort of,
implication of this that people are wondering about is,
is this, kind of, the death of architecture engineering?
Um, so I'm sure all of you who have worked on the default final project, um,
have seen a whole bunch of fancy pictures showing different,
uh, architectures for solving SQuAD.
Um, there are a lot of papers.
They all propose some, kind of,
uh, attention mechanism or something like that.
Um, and, um, right.
With BERT, it's, sort of,
um, you don't need to do any of that, right?
You just train a transformer and you give it enough data,
and actually you're doing great on SQuAD,
you know, maybe, um, these, uh,
architectural enhancements are not necessarily, um,
the key thing that'll drive progress in,
uh, improving results on these tasks.
Um, right. So, uh,
if you look at this with the perspective of a researcher,
you can think a researcher will say, "Okay,
I can spend six months designing a fancy new architecture for
SQuAD and if I do a good job maybe I'll improve results by 1, uh, F1 point."
Um, but in the case of BERT, um,
increasing the size of their model of 3x,
which is the difference between,
they've like a base size model and a large model,
um, that improve results by 5 F1 points.
Um, so it does seem to suggest we need to, sort of,
re-prioritize, um, which avenues of research we'd pursue,
because this architecture engineering isn't providing, kind of,
gains for its time investment the way,
uh, leveraging unlabeled data is.
Um, so now, if you look at the SQuAD leaderboard, um,
I think at least the top 20 entrants are all BERT plus something.
Um, one other issue, uh,
I think BERT has raised is that,
um, we need harder tasks, right?
BERT has almost solved SQuAD,
if you define it by, uh,
getting close to human performance.
Um, so there's been, um,
a growth in new datasets that are, uh,
more challenging and there are a couple of ways in which,
um, they can be more challenging.
So one is, um,
doing reading comprehension on longer documents,
or doing it across more than one document.
Um, one area is looking at c- uh,
coming up with harder questions that require a multi-hop reasoning.
Um, so that essentially meas- means you have to string
together multiple supporting facts from different places,
um, to produce the correct answer.
Um, and another area,
situating question-answering within a dialogue.
Um, there's also been a, kind of,
small detail with the construction of reading comprehension datasets,
that has actually really affected,
um, the, the difficulty of the task.
And that is whether, um,
when you create these datasets, um,
is the person who writes questions about a passage,
can they see that passage or not?
Um, so of course, it's much easier to come up
with a question that when you see the passage,
and if you come up with a question without seeing the passage,
you may not even have a answerable question.
Um, but the problem with looking at
the passage is that first of all it's not realistic, right?
So, uh, if I'm asking a question, you know,
I'm not going to have usually
the paragraph that answers that question sitting in front of me.
Um, on top of that,
it really encourages easy questions, right?
So, um, if you're a Mechanical Turker,
and you're paid to write as many questions as possible,
and then you see an article that says,
um, I don't know, you know,
uh, Abraham Lincoln was the 16th president of the United States,
um, what are you gonna write?
As your question, you're gonna write,
who was the 16th president of the United States.
You're not gonna write something more interesting that's harder to answer.
Um, so- so this is one way in which crowdsourced datasets have changed, um,
people are now making sure questions are,
sort of, independent of, of the contexts.
Um, so I'm gonna briefly, uh,
go over a couple of new datasets in this line.
So one is called QuAC, which stands for Question Answering in Context.
Um, in this dataset,
there is a teacher and a student,
um, the teacher sees a Wikipedia article.
The student wants to learn about this Wikipedia article,
and the goal is to train a machine learning model that acts as the teacher.
Um, so you can imagine maybe in the future, this,
sort of, technology would be useful for,
uh, um, education for, kind of,
having, uh, adding some automation.
Um, uh, one thing that makes this task difficult is that,
uh, questions depend on the entire history of the conversation.
Um, so for example, uh,
if you look, um, on the left here, uh,
the example, um, dialogue,
um, the third question is was he the star?
Um, clearly you can't answer that question unless you look back earlier in the dialogue,
and realize that the subject of this,
uh, conversation is Daffy Duck.
Um, a- and, sort of,
because this dataset is more challenging,
and you can see there's a, there's a much bigger gap to human performance, right?
So if you train some BERT with some extensions, you'll st- uh,
the results are still like 15 F1 points worse than human performance.
Um, um, here's one other dataset, um, called HotPotQA.
Um, it is, uh,
designed instead for multi-hop reasoning.
Um, so essentially, in order to answer a question,
you have to look at multiple documents,
you have to look at different facts from those documents,
and perform some inference,
um, to get what the correct answer is.
Um, so I think, you know, this is a- a much harder task.
And again, um, there's a much bigger gap between human performance.
Um, any questions on, uh,
new datasets, um, harder chi- tasks for NLP?
Okay. Um, I'm gonna,
kind of, rapid fire and go through, um,
a couple of more areas in the last minutes of this talk.
Um, so multitask learning I think is really growing in importance.
Um, of course, um,
you've had a whole lecture on this, right?
So I'm not gonna spend too much time on it.
Um, but maybe one, uh,
point of interest is that if you look at performance on this GLUE benchmark,
so this benchmark for natural language understanding,
um, all the top couple results, um,
are- that are now actually surpassing BERT in
performance are- is taking BERT and training it in a multi-task way.
Um, I think another interesting, uh,
motivation for multi-task learning is that if you are training BERT, you have a really,
really large model and one way to make
more efficient use of that model is training it to do many things at once.
Another area that's definitely important, um,
and I think will be important going in the future is dealing with low-resource settings.
Um, and here I'm using a really broad,
uh, definition of resources, right.
So that could mean compute power, um, you know,
BERT is great but it also takes huge amounts of compute to run it.
So it's not realistic to say,
um, if you're building, let's say a mobile, uh,
an app for a mobile device that you could run a model the size of BERT.
Um, as I already ga- went into earlier in this talk, um, you know,
low-resource languages is an area that I think is pretty, um,
under-represented in NLP research right now,
because most datasets are in English, um,
but I do think, right,
there's a really, you know,
large number of people that in order to benefit from NLP technology, um,
we'll need to have technologies that work well in a lot of
different languages especially those without much training data.
And, um, speaking of low- low amounts of training data, I think in general this is,
uh, a- an interesting area of research,
um, within machine learning.
Actually, people are, um,
working a lot on this as well.
Um, so a term is often, uh,
a term often used is few shot learning.
Um, and that essentially means being able to
train a machine learning model that only sees,
let's say five or ten examples.
Um, one motivation there is, um,
I think a clear distinction between how our existing machine learning systems learn,
and how humans learn is that, um,
humans can generalize very quickly from five or so examples.
Um, if you're training a neural net,
you normally need, you know,
thousands of examples or perhaps even tens of thousands,
hundreds of thousands of examples to get something that works.
Um, so I also see this being a pretty important area in the future.
Um, the last area where I want to go in, um,
a little bit more depth is interpreting and understanding models.
Um, so, so really there's two aspects of this.
One is if I have a machine learning model and it makes a prediction,
I would like to be able to, uh,
know why did it make that prediction?
So gets some rationale, get some explanation,
um, that would especially be important in an area like health care, right?
So if you're a doctor and you're making a decision, um,
it's probably not good enough for your machine learning model to say,
"Patient has disease X."
You really want it to say,
"Patient has disease X for these reasons."
Um, because then you as a doctor can double-check,
and, and try to validate the, the,
uh, machine's, um, thinking I guess,
um, to come up with that diagnosis.
Um, the other area of interpreting
understanding models is more of a scientific question, right?
Is we know things like BERT work really well,
um, we want to know why do they work well?
What -what what aspects of language do they model?
Um, what things don't they model?
Um, and that might lead to, um,
ideas of improving, um, those- those models.
Um, so, um, here is a, uh,
couple slides on the main approach for evalu- answering the sort of scientific questions.
What does a machine-learning model learn?
Um, what you do is you have a model so let's say it's BERT.
It takes as input a sequence of words, um,
it produces as output a sequence of vectors, um,
we want to ask does it know for example,
the part of speech of words?
So, so it does in its vector representations,
does that capture something about syntax?
Um, and a simple way of asking this question is train another classifier on top of BERT,
uh, that's trained to do,
um, let's say part-of-speech tagging.
Um, but we only, um,
backprop into that diagnostic classifier itself.
So in other words we're treating the output of BERT, um,
that sequence of vectors as a fixed input,
and we're sort of probing those vectors to see,
um, do they contain, um,
information about a part of speech that
this second diagnostic classifier on top can decode,
um, to get the correct labels?
Um, so, um, it was kind of quite a few concerns here.
Um, one concern is, uh,
if you make your diagnostic classifier too complicated,
it can just solve the classif- the task all on itself,
and it can basically ignore, uh,
whatever representations were produced by BERT.
Um, so- so the kind of standard thing right now is to use
a single softmax layer on top of BERT,
um, to do these decisions.
Um, and there's been a whole bunch of tasks proposed for
evaluating essentially the linguistic knowledge of these models.
Um, so you could do part-of-speech tagging,
you could do more semantic tasks like,
uh, relation extraction, um,
or- or something like co-reference.
Um, and this is a pretty active area of work.
Um, here is, uh, just one, uh,
plot showing some of the results, um, of this approach.
So here what we're doing is we're adding
diagnostic classifiers to different layers of BERT,
and we are seeing which layers of BERT are more useful for particular tasks.
Um, and, um, something kind of interesting comes out of this which is that, um,
the different layers of BERT seem to be corresponding, um,
fairly well with notions of,
uh, different layers of li- of linguistics.
Um, so, uh, dependency parsing which is a syntactic task,
um, it's, uh, considered sort of a, you know,
medium level task in understanding a sentence.
Um, the medium layers of BERT, so layers kind of 6 through 8 or something,
are the ones best at dependency parsing.
Um, if you have a se- very semantic task like sentiment analysis,
um, where you're trying to learn some kind of, uh,
semantic property of the whole sentence, um,
then the very last layers of BERT are the ones that seem
to encode the most information about- about this, uh, phenomenon.
Um, okay.
So this is almost it for the talk, um,
I just have one slide here of, uh, um,
NLP not in kind of the academic researching context,
which I have already been talking a lot about but NLP in industry,
and really there's rapid progress there.
And I wanted to point to you two areas where I think there's
especially a large interest in using NLP technology.
Um, one is dialogue,
um, so for things like chatbots, right?
There's the Alexa Prize where they're actually investing a lot of money in,
um, having groups figure out how to improve chitchat dialogue.
Um, there's also I think a lot of potential for customer service, right?
So improving basically automated systems that'll, um,
you know, book you a flight,
or help you cancel a subscription, or anything like that.
Um, and similarly, there's a lot of potential in health care.
Um, one is understanding the records of someone who,
um, is sick and to help them- to help with diagnoses.
Um, I think another, um,
equally important area is actually, uh,
parsing, uh, biomedical papers.
Um, so, um, the number of biomedical papers that are being written is really insane,
um, it's, it's way larger than the number
of computer science papers that are being written.
[NOISE] Um, often if you're a doctor,
or if you're a researcher, um,
in medicine, you might want to look up something very specific, right?
You might want to know what is
the effect of this particular drug on this particular gene,
or a cell with this particular gene.
Um, there's no good way right now of searching through, um,
hundreds of thousands of papers to find if someone has a- has, uh,
done this experiment and have results for this,
um, particular combination of things.
Um, so automated reading of all this biomedical literature,
um, could have a lot of value.
Okay, um, to conclude, um,
there's been rapid progress in the last five years due to deep learning, um, in NLP.
Um, in the last year, we've seen another really kind of, uh,
a dramatic increase in the capability of our systems,
thanks to, uh, using unlabeled data.
So that's methods like BERT.
Um, and, um, the other kind of thing that's I think important to think about is that,
NLP systems are starting to be at a place where they can have big social impact.
Um, so that makes some issues like bias and security very important. Um, thank you.
Uh, good luck finishing all your projects.
[APPLAUSE].
