Okay hi everyone.
Let's get started again.
Um. Okay. So, first of all for a couple of announcements.
Um, first of all thanks to everyone, um,
who filled in our mid-quarter survey we've actually gotten,
um, great participation in that.
Here are my two little Pac-Man figures.
So, the Pac-Man figures thinks,
means that almost everyone thinks the lectures are at
the right pace and those that don't are pretty much evenly divided.
Um, if we go for how challenging was Assignment three,
slightly more people thought it was too easy than too hard.
So, I guess we're setting about rectifying that with
assignments four and five, um, [NOISE].
So, though there are a whole bunch of other questions and we've
been trying to absorb all the feedback.
I mean one of the questions was what people wanted most from the remaining lectures.
I guess the good news here is really we're very good at predicting, um,
what people wanted, that or else everybody
just looked ahead in the syllabus and wrote down what it said was
ahead in the syllabus but I guess the most popular four answers to
topics that they wanted in the remaining lectures were Transformers and BERT,
both of which are gonna be covered this week.
Uh, question-answering which we talked about last week, um,
and then text generation and summarization
and you guys get Abby back next week to talk about that.
Um, there are also a lot of people also answered this question
a different way as to what kind of style of stuff,
um, some people emphasized new research and the latest updates from the field.
I guess we'll get some of that today as well,
some people are more interested in
successful applications in industry or trying to do a bit of that,
um, cool new neural architectures.
Um, the bottom answer wasn't the most popular one,
I'll admit but at least a few people, um,
wish that we were teaching more linguistic stuff.
Um, I mean that is something that I actually feel
a bit awkward about the way things were merged with CS224N,
with this deep learning,
I mean the truth of the matter is that sort of seems
like in the early part of the course,
there's so much to cover with,
um, neural networks, backpropagation,
different, um, neural net architectures and so on that the reality is that we
teach rather less linguistic stuff than we used to in the class.
I mean, for the last four weeks of the class we really do try and
cover some more linguistic stuff topics.
Um, so look forward to that.
Um, announcements.
Okay. So we've made a couple of deadline changes.
Um, firstly, a number of people have
mentioned that they think assignment five is a bit tough.
And so, we're giving people one extra day,
um, to do assignment five.
Um, I'm realizing in one sense that one extra day is not a ton
but you know there's sort of this complex balance here because on the other hand,
we don't really want to undermine time that people have available for final projects.
And if you're one of the people who hasn't yet started assignment five,
um, we do really encourage you to get underway on it.
Um, yeah, in the reverse direction
we decided that the project milestone was really too late.
If we are going to be able to give you feedback on it that you could usefully make use
of, so we're moving the project milestone date two days earlier.
And so, we've also gotten everyone's project proposals and our
planned hope is to get them back to everybody on Friday.
Yes, so, a lot of things moving.
Um, and finally on other announcements I guess, um, on
this Thursday is our first invited speaker, um, and so,
if you're in person student you're meant to be here,
um, and if you're not able to be here,
you should know about our reaction paragraph policy and
I actually stuck up on the Piazza pinned posts about, um,
reaction pieces and attendance, an example of a reaction piece, um,
from a past class to make it a little bit more concrete what's expected there.
But, you know, the idea is what we're hoping for something that isn't a ton of work.
You can just write 100, 150 words, a few sentences,
but wanting you to pick out a specific thing that was
interesting and write a couple of sentences
about what it was and what your thoughts are about it.
I, not just some very generic statement of this was a lecture about transformers.
He talked about transformers and it was interesting,
that is not what we want for the reaction piece. Um, okay.
So, here's the plan for today.
So, for today's, what I want to talk about is,
um, the exciting recent work about contextual word representations.
I mean I, I was thinking of what I was gonna say I was wanting to say, oh, this is
the most exciting thing in deep learning for NLP in
the last five years then something's just completely wrong,
because really this is the most exciting thing in deep learning that happened in 2018.
I mean, I guess things move very quickly, um,
in deep learning at the moment and it's sort of I don't think it's
really fair to say that you know it's got 5 years of life.
But there's a very exciting thing that happened last year,
and we'll talk about that.
Okay. So, we'll talk about early stuff,
the ELMo, ULMfit,
transformer architectures briefly and then go on to
talk about the BERT model that's being quite prominent lately.
So, let's just recap,
let's just go backwards a bit first to think about, um,
where we've been and where we are now and why we might want something more.
So, up until now,
we've sort of just had,
one representation for words which is what we learned at the beginning of class,
there was a word, you trained a word vector for it and that's what you used in your model.
Um, and you could do that, with algorithms like Word2vec,
GloVe, or fastText that I mentioned last week.
Um, so some on this sort of progression of ideas in deep learning,
when deep learning for NLP or the general
just the resurgence of neural networks for NLP
came about sort of at the beginning of this decade.
Um, these pre-trained word vectors.
So, pre-trained unsupervised over a large amount of text.
They were completely seen as the secret sauce,
and they were the thing that transformed
neural networks from NLP to something that didn't really work,
to something that worked great.
Um, so, this is actually an old slide of mine.
So, this is a slide I guess I first made for
2012 ACL tutorial and then sort of used in lectures.
Sort of in 2013, 2014. Um-.
And so this was sort of the picture in those years.
So this was looking at two tasks,
part of speech tagging and named entity recognition which I'll use quite a bit today.
And, you know, the top line was showing a state of the art which was
a traditional categorical feature based classifier of the kind
that dominated NLP in the 2000s decade, in their performance.
And what then the next line showed is that if you took the same data set
and you trained a supervised neural network on it and said how good is your performance?
Um, the story was, it wasn't great.
Um, part-of-speech tagging has very high numbers always for various reasons.
So perhaps the more indicative one to look at is these named entity recognition numbers.
So, you know, this was sort of neural net sucked, right?
The reason why last decade everybody used, um,
categorical feature based, you know,
CRF, SVM kind of classifiers.
Well, if you look, it worked eight percent better than a neural network.
Why wouldn't anybody?
But then what had happened was people had come up with this idea that we could
do unsupervised pre-training of word representations,
um, to come up with word vectors for words.
And, you know, in those days,
this was very hard to do the alg- both because of
the kind of algorithms and the kind of machines that were available, right?
So Collobert and Weston, 2011,
spent seven weeks training their unsupervised word representations.
And at the end of the day,
there are only 100 dimensional, um, word representations.
But this was the miracle breakthrough, right?
You've put in this miracle breakthrough of unsupervised word representations.
And now, the neural net is getting to 88.87.
So it's almost as good as the feature-based classifier,
and then like any good engineers,
they did some hacking with some extra features,
because they had some stuff like that.
And they got a system that was then slightly better than the feature based system.
Okay. So that was sort of our picture that,
um, having these pre-trained,
unsuper- and unsupervised manner of word representations,
that was sort of the big breakthrough and
the secret sauce that gave all the oomph that made,
um, neural networks competitive.
Um, but, you know,
it's a sort of a funny thing happened which was after people had sort of had
some of these initial breakthroughs which were
all about unsupervised methods for pre-training,
it was the same in vision.
This was the era in vision,
where you were building restricted Boltzmann machines and doing
complicated unsupervised pre-training techniques on them as well.
Some- somehow, after people had kind of discovered that and started to get good on it,
people sort of started to discover, well,
actually we have some new technologies for non-linearities,
regularization, and things like that.
And if we keep using those same technologies,
we can just go back to good old supervised learning.
And shockingly, it works way better now inside neural networks.
And so if you sort of go ahead to what I will call,
sort of 2014 to 2018 picture,
the, the picture is actually very different.
So the picture is, so this,
the results I'm actually gonna show you this is from the Chen and Manning,
um, neural dependency parser that we talked about weeks ago.
The picture there was, um,
and you could- despite the fact that
this dependency parser is being trained on a pretty small corpus,
a million words of supervised data,
you can just initialize it with random word vectors,
um, and train a dependency parser.
And to a first approximation,
it just works fine.
You get, get sort of a 90 percent accuracy,
E- um, English dependency parser.
Now, it is the case that instead,
you could use pre-trained word embeddings and you do a bit better.
You do about one percent better.
And so this was sort of the,
the new world order which was yeah, um,
these pre-trained unsupervised word embeddings are useful because you can
train them from a lot more data and they can know about a much larger vocabulary.
That means they are useful.
They help with rare words and things like that and they give you a percent,
but they're definitely no longer the sort of night and day,
uh, thing to make neural networks work that we used to believe.
I'm, I'm just gonna deviate here to,
from the main narrative to just sort of say, um,
one more tip for dealing with unknown words with word vectors,
um, just in case it's useful for some people,
building question answering systems, right?
So, um, so for sort of word vectors on unknown words, you know,
the commonest thing historically is you've got your supervised training data,
you define a vocab which might be words that occur
five times or more in your supervised training data.
And you treat everything else as an UNK.
And so you also train one vector per UNK.
Um, but that has some problems which you have no way to
distinguish different UNK words either for identity or meaning.
And that tends to be problematic for question answering systems.
And so one way to fix that is what we talked about last week,
you just say, "Oh, words are made out of characters.
I can use character representations to learn word vectors for other words."
And you can certainly do that.
You might wanna try that.
That adds some complexity.
Um, but especially for things like question answering systems,
there are a couple of other things that you can do
that work considerably better and they've been
explored in this paper by Dhingra et al., um, from 2017.
Um, the first one is to say, well, um,
when you at test-time encounter new words, probably your unsupervised word,
pre-trained word embeddings have a much bigger vocabulary than your actual system does.
So anytime you come across a word that isn't in
your vocab but is in the pre-trained word embeddings,
just use, get the word vector of that word and start using it.
That'll be a much more useful thing to use.
And then there's a second possible tip that if you
see something that's still an unknown word,
rather than treating it as UNK,
you just assign it on the spot,
a random word vector.
And so this has the effect that each word does get a unique identity.
Which means if you see the same word in the question,
and a potential answer,
they will match together beautifully in an accurate way which you're
not getting with just UNK matching and those can be kind of useful ideas to try.
Okay, end digression. Okay, so up until now,
we just sort of had this representation of words,
we ran Word2vec and we got a word vector,
um, for each word.
Um, so, um, that, that was useful.
It's worked pretty well.
Um, but it had, um,
some big problems. So what were the big problems of doing that?
The problems when we,
of having a word vector in each word, yes.
A lot of words have like one spelling, but a whole bunch of meanings.
Right, so, a word can have- So, typically,
you have one string of letters which has a whole bunch of meanings.
So, words have a ton of senses.
Um, and yeah, so that's
the biggest and most obvious problem that we're
collapsing together all the meanings of words.
So, we talked about a bit where
one solution to that was you could distinguish
word senses and to have different word vectors for them.
Um, and I then said something about also you could think of
this word vector as a sort of a mixture of them and maybe your model could separate it.
But it seems like we might want to take that more seriously.
And one way, um,
that we could take that more seriously is we could start to say, well,
really, you know, traditional lists of word senses are themselves a crude approximation.
What we actually want to know is the sense of the word inside a particular context of use.
And sort of what I mean by that is, you know,
we distinguish different senses of a word, right?
Say for the word star there's the astronomical sense and
there's the Hollywood sense and they're clearly different.
But you know, if we then go to this what I'm calling the Hollywood sense,
I could then say, well, wait a minute.
There are movie stars and there are rock stars,
and there, uh, are R&B stars,
and there are country stars.
Now, all of those different senses, um,
in certain contexts, though, one or other of them would be evoked.
And so, you know,
it's very hard if you're trying to actually enumerate
senses of a word as to which ones count as different or the same.
So, it's really you sort of wanna know what a word means in a context.
There's a second limitation of these word vectors which is,
we haven't really talked about and is less obvious,
but it's also something that we might want to fix, and at least one of
the models we discussed today takes some aim at that,
and that is, we just sort of have one vector for a word.
But there are sort of different dimensions of a word.
So, words can have different meanings,
some sort of real semantics or words can have
different syntactic behavior like different parts of speech or grammatical behavior.
So, in some sense, arrive and arrival,
their semantics are almost the same,
but they're different parts of speech.
One is a, um, a verb and one is a noun,
so they can kind of appear in quite different places.
And you know, you'd wanna do different things with them in a dependency parser.
And there are even other dimensions.
So, words also have register and connotation differences.
So, you can probably think of lots of different words for a bathroom,
and a lot of those words all means semantically the same,
but have rather different registers and
connotations as to when they're appropriate to use.
And so, we might want to distinguish words on that basis as well.
And so these are the kinds of soluti- things we want to
solve with our new contextual word embeddings.
Um, so I've said up until now, you know,
oh, we just had these word vectors that we use,
words just had one vector.
Um, but if you actually think about it, maybe that's wrong.
I mean, maybe we never had a problem, or at any rate, we solved it six classes ago.
Because if you remember back, [NOISE] um,
to when we started talking about neural language models,
well, what did a neural language model do?
At the bottom, you fed into it the word vectors.
But then you ran across that one or more recurrent layers,
something like a LSTM layer,
and it was calculating these representations that sit above each word and,
you know, the role of those hidden states is a bit ambivalent.
They are used for prediction.
And they are used for next hidden state and output states and so on.
But in many ways you can think huh,
these representations are actually representations of a word in context.
And if you think about what happened with, uh,
the question answering systems,
that's exactly how they were used, right?
We ran LSTM's backwards and forwards,
over a question in the passage, and then we say,
okay those are a good representation of a word's meaning and context.
Let's start matching them with attention functions et cetera.
So, it sort of seemed like we'd already invented a way to have,
um, context-specific representations of words.
And effectively, you know,
the rest of the content of this lecture is sort of basically no more complex than that.
Um, that it took a while but sort of people woke up and started to notice, huh,
really when you're running any language model,
you generate a context-specific representation of words.
Maybe, if we just took those context-specific
representation of words, they'd be useful for doing other things with them.
And that's sort of, you know,
there are a few more details,
but that's really the summary of the entire of this lecture.
Um, so one of the first things to do that was a paper that Matt Peters wrote in 2017,
um, the year before last.
Um, and this was sort of a predecessor to the sort of modern, um,
versions of, um, these context-sensitive word embeddings.
So, um, together with co-authors,
he came up with a paper called TagLM,
but it essentially already had all the main ideas.
So, what, um, was wanted was okay.
We want to do better at tasks such as named-entity recognition.
And what we'd like to do is know about the meaning of a word in context.
Um, but you know, standardly if we're doing named-entity recognition,
we just train it on half a million words of supervised data.
And that's not much of a source of
information to be learning about the meaning of words and context.
So, why don't we adopt the semi-supervised approach and so that's what we do.
So, we start off with a ton of unlabeled data.
Um, and from that unlabeled data,
we can train a conventional word embedding model like Word2vec.
But we can also at the same time train a neural language model.
So, something like a bi-LSTM language model.
Okay. So, then for step two when we're using our supervised data,
um, actually, I guess that's step three.
Okay. Um, so for then when we want to learn our supervised part-of-speech tagger at the top,
what we're gonna do is say, well,
for the input words New what York is located,
we can not only use the word embedding which is context independent,
but we can use our trained recurrent language model and also run it over this import,
and then we'll generate hidden states in our bi-LSTM language model and we can also
feed those in as features into ou- our sequence tagging model,
and those features will let it work better.
Here's a second picture that runs this through in much greater detail.
So, so, we're assuming that we have trained, uh,
bi-LSTM language model, um,
on a lot of unsupervised data.
Then what we wanna do is we want to do named entity recognition for New York is located.
So, the first thing we do is say,
let's just run New York is located through our separately trained neural language model.
So, we run it through a forward language model.
We run it through a backward language model.
We get from that, um,
a hidden state representation,
um, for each word,
we concatenate the forward and backward ones,
and that's going to give a set, a concatenated language model embedding
which we'll use as features in our named entity recognizer.
So, then for the named entity recognizer itself that we're gonna
train supervised while we have the same sentence,
so we can both look up a Word2vec-style token embedding for it.
We can use what we learned about with character level CNNs and RNNs and we can build
a character level representation for it which we also
concatenate to have two representations.
So, we feed these representations into a bi-LSTM layer.
But then when we get the output of the, this bi-LSTM layer,
as well as this normal output,
we can concatenate with each output what was- what we get from our,
um, neural language model.
So, each of these things becomes a pair of states.
One that's spit up from the first bi-LSTM layer and
then it's concatenated with something from the neural language model.
And so that concatenated representation is then fed into a second layer of bi-LSTM.
And then from the output of that,
we do the usual kind of softmax classification
where we're then giving tags like beginning of location,
end of location, say New York is a location and then is, we'll get
another tag to say it's not a location. Does that makes sense?
Yeah so, um, so the central thing is
sort of having seen that these sort of representations that we get from Bi-LSTMs are useful.
We're just going to feed them into supervised models as we train them,
and the idea is that will give us better features of words.
Some kind of representation of their meaning and context,
which will allow us to learn better named entity recognizers or what it- whatever it is.
Maybe I should put this slide earlier,
but this slide was meant to remind you what a named entity recognizer is.
I hope you remember that,
something where are we going to find and label
entities for things like person, location, date, organization.
So anyway, doing this worked.
So, here's a little bit of a history.
So the most famous Named Entity Recognition dataset is this CoNLL 2003 dataset,
which actually exists in multiple languages.
But whenever people say CoNLL 2003 and don't mention a language,
they mean the English version of it.
That's the way the world works.
Um, okay so on this dataset- yeah.
So, it's sort of been around for whatever, 15 years roughly now.
So, in the- so it was originally a competition, right?
So, this is in 2003 was the original bake-off.
My group actually took place in that.
Took part in it. I think we got third or fourth place or something,
and our F1 score was 86.
The people who won were from IBM Research Labs,
and they got 88 almost 89.
But a difference between these two things is our system was
a single clean machine-learning model categorical,
whereas the IBM one was not only an ensemble
of four different machine learning models, plus gazetteers.
It also fit in the output of
two other old NER systems that IBM people were trained years ago on different data.
So it was- I guess it worked for them but,
it was a fairly complex system.
Here's another system from Stanford.
So this was our classic Stanford NER system that is widely used.
So, this was then using a conditional random field model which generally dominated
sort of the second half of the 2000s and the first half of the 2010s for doing NER,
and it was sort of, you know, a bit but not usually better than the 2003 system.
This system here was sort of the best ever built categorical CRF system.
But rather than only using the training data to build the model as this system did,
it threw in Wikipedia and other stuff to make it work better,
and that got you to about 90.8 F1.
So, essentially, once sort of BiLSTM style models started to be known and used in NLP.
That was when people were able to train, build training
just on the training data systems that worked a lot better.
Because essentially you're going from the same data from this system to that system.
So, you're getting about 4 percent gain on it,
because it's not- wasn't making use of Wikipedia and things like that;
and so this Ma and Hovy system is pretty well-known getting about 91.21.
Okay, but if we then go to this TagLM system, um,
that Matt Peters and Co have a system that
was sort of similar to the Ma and Hovy system that is a little bit worse.
But the point is that this BiLSTM uses sorry- using the neural language model,
is just a useful oomph giver which sort of takes the results up.
Yeah, not night and day but,
slightly over a percent and then gives them the best NER system that was then available.
So that sort of proved these sort of
contextual word representations really had some power and started to be useful,
and then there's a white space at the top because we'll get back to more of this later.
Um, there's some details on their language model.
Some of their details are that it's useful to have
a bidirectional language model, not unidirectional.
It's useful to have a big um,
language model to get much in the way of gains,
um and, you need to train this language model over much more data.
It doesn't work if you're just sort of training it over your supervised training data.
Another model that was around was CoVe,
but I think I'll skip that.
Okay. So, then the next year, um,
Matt Peters and a different set of colleagues
then came up with an improved system called ELMo,
and effectively this was the breakthrough system.
That this was sort of just the system that everybody
noticed and said "Wow these contextual word vectors are great.
Everyone should be using them,
not traditional word vectors." Yes?
I have a simple question, imagine re-training a system, what exactly
what measure [inaudible]
It's pre-trained because this piece over here;
a big neural language model is trained first,
and there's an important thing I forgot to say.
So, thank you for the question.
The main reason why it's- in some sense pre-trained,
is this was trained first.
But the main reason why people think of this as pre-training
is after you've trained this, it is frozen.
So, this is just something that you can run with parameters which will give
you a vector which is your contextual word representation each position,
and then that's just going to be used in this system.
So, when you're training this system,
there's no gradient flowing back into
this neural language model that's changing and updating it; it's just fixed.
And so that's sort of the sense when people are talking about pre-training.
It's sort of normally a model that you trained
somewhere else and that you're using to give features,
but isn't part of the model that you are now training. Yeah?
[inaudible]
Well, I guess that's, I wouldn't quite call it reconstruction.
Yeah, it's unsupervised in the sense that this is a language model,
you're training it to predict the next word.
So here are words one to k. What is the k plus oneth word during a cross entropy loss,
and repeat over for each position.
[NOISE] Yes, so I mean,
having gone through TagLM in some detail, I mean,
in some sense, the difference between TagLM and ELMo is kind of small,
it's sort of in the details.
So I mean, to a first approximation,
they're doing exactly the same again,
but a little bit better.
Um, so, um, I sort of hope it made sense the last time,
I mean, what are the things that are different?
Um, they do the bidirectional language model a bit differently,
and actually one of their concerns was to try and come up with
a compact language model that would be easy for people to use,
um, in other tasks even if they don't have the beefiest computer hardware in the world.
And so they decided to dispense with having
word representations altogether and just use, um,
character CNNs to build word representations,
because that lessens the number of parameters you have to store,
the big matrices you have to, um, use.
Um, they expanded the hidden dimension to 4,096,
but then they project it down to
512 dimensions with a sort of feed-forward projection layer,
and that's a fairly common technique to again reduce
the parameterization of the model so that you have a lot of
parameters going in their current direction but you
need much smaller matrices for including,
um, the input at the next level.
Um, between the layers,
they now use a residual connection and they do a bit of parameter tying.
So it's sort of all in the little details there.
Um, but there's another interesting thing
that they did which was an important innovation of ELMo,
so we should get this bit.
So in TagLM,
what was fed from the pre-trained LM into
the main model was just the top level of the neural language model stack,
and that was completely standard de rigueur in those days,
that you might have had three layers of
neural language model that you regard at the top-level as your sort
of one that's really captured the meaning of
the sentence and the lower layers for processing that led up to it.
Um, and they had the idea that maybe
it would be useful to actually use all layers of the,
biLSTM of the neural language models.
So maybe not just the top layer but all layers would be kind of useful.
So, um, there are these kind of complex equations,
uh, but essentially the point of it over here is,
we going- for a particular position,
word seven in the language model,
we're going to take the hidden state at each level of our,
our neural language model stack,
we're going to give- learn a weight for that level,
we go in to sort of sum them,
so this is sort of a weighted average of the hidden layers at each position,
and that will be used as our basic representation.
Um, and so, they found that that gave quite a bit
of extra usefulness for- and different tasks could prefer different layers.
There's one other bit here which is,
they learn a global scaling factor Gamma for a particular task.
And this allows them to control that for some tasks, the, um,
contextual word embeddings might be really
useful and for other tasks they might not be so useful,
so you're just sort of learning a specific,
um, usefulness for the entire task.
Okay. So, um, that's the sort of new version of language model.
But this, this is allowing this idea of well,
maybe there's sort of more syntactic meanings
of a word and more semantic meanings of a word,
possibly those could be represented at different layers of
your neural language model and then for
different tasks you can differentially weight them.
Um, so that's the basic model.
So you run your biLSTM before to g et representations of each word.
And then the generic ELMo recipe was,
well, with that frozen language model,
you want to feed it into some supervised model depending on what the task was,
and they sort of say in the paper, well,
how you do this maybe depends on the task.
You might want to kind of concatenate it to the intermediate layer,
just as the TagLM did,
that might be fine.
But you know it might also be useful to make use of
these ELMo representations when producing outputs,
so if you're doing something like a
generation system or you might just sort of feed in the ELMo representation again,
be- before you sort of do the softmax to find the output,
they sort of left it flexible as to how it was used,
but the general picture,
you know, was kinda like we saw before.
Indeed I'm reusing the same picture that you've calculated
an ELMo representation for each position as a weighted average,
and then you're sort of concatenating that to the hidden state of
your supervised system and generating your output.
And anyway, um, one way or another,
um, they were able to do this, uh,
and that with the little improvements that gave them about an extra
0.3 percent in Named Entity Recognition.
Um, now, that sort of sounds like not very much.
And you might conclude from this why the excitement [LAUGHTER] and,
you know, in some sense, um,
that's right because sort of to the extent that there was an interesting idea here really
that come up with it for the TagLM paper which gave a much better gain.
But, you know, why everyone got really excited was that in the ELMo paper,
they then showed this isn't something that you can
do one-off to improve a Named Entity Recognizer,
you can take these ELMo representations and use them for pretty much any NLP task,
and they can be very useful and give good gains.
And so, essentially why people got excited was because of the data that's in this table.
So here we're taking a whole bunch of very different tasks,
so there's SQuAD question-answering, uh,
there's natural language inference,
there's semantic role labeling,
there's co-reference, the Named Entity Recognition, doing sentiment analysis,
so a wide range of different NLP tasks,
and they have a previous state of the art system.
They produced their own baseline um, which is,
you know, commonly sort of similar to the previous state of the art,
but usually actually a bit worse than
the current state of the art because it's
whatever simpler cleaner system that they came up with,
but then they could say in each case,
oh, just take this system and add
ELMo vectors into the hidden representations in the middle,
and have those help you predict.
And in general, in all cases,
that's giving you about a three percent or so gain absolute
which was then producing this huge performance increase,
which in all cases was moving the performance well above the previous,
um, state of the art system.
So you know, this sort of then made it seem like magic pixie dust,
because, you know, in the stakes of NLP conference land, you know,
a lot of people use to try and to come up
with a paper for the next year that's one percent better
on one task and writing it up and that's
their big breakthrough for the year to get their new paper out.
And the idea that there's just well this set of
this way of creating context sensitive, um,
word representations and you just use them in any task,
and they'll give you around three percent and take you past the state of the art,
this seemed like it was really great stuff.
And so people got very excited about this and that won
the Best Paper Award at the NAACL 2018 conference.
Ah, and then, a- as I sort of vaguely mentioned,
um, so the model that they actually used wasn't a deep stack,
there were actually only two layers of biLSTMs,
but they do show this interesting result that the lower level better captures
low-level syntax word properties
and its most useful things like part-of-speech tagging,  syntactic
dependencies, NER, where the top layer of
their language model is better for
higher level semantics that is more useful for things like sentiments,
semantic role labeling and question answering.
Um, so that seemed interesting,
though it'll actually be interesting to see how that panned
out more if you had sort of more layers to play with.
Okay. ELMo, done.
Um, so I'm moving right ahead.
Um, here's something else that I just thought I should mention a little bit about,
another piece of work that came out around the same time,
a few months later maybe or maybe not,
came out around the same time, uh,
in, in 2018, was this work on
Universal Language Model Fine-tuning for text classification,
um, or ULMfit, by Howard and Ruder.
And essentially this had the same general idea of saying, Well,
what we want to do is transfer learning where we could learn a big language model, um.
A big language model,
and then for our target task which might be named entity recognition.
But here's text classification,
we can transfer this language model information and help us to do better with the task.
And so, they proposed an architecture to do that.
And so, their architecture was,
you have a big unsupervised corpus from which you train a neural language model.
They used the deeper neural language model with three hidden layers.
Um, you then fine tune
your neural language model on the actual domain that you're interested in working in.
So, this was sort of an extra stage that they did.
And then finally, um,
you now introduce your classification objectives.
So, what they're going to be doing is making text classifiers.
So, we're now wanting to,
take this model and turn it from a language model into a text classifier.
Um, but there's something that they did differently, um,
which is in some sense,
foreshadows the later work in transformers.
So, rather than just feeding features from this into a completely different network,
they keep using the same network but they introduce a different objective at the top.
So, one thing you could do with this network is use
it to predict the next word as a language model.
And so at this point,
they freeze the parameters of that softmax at the top,
that's why it's shown in black.
Um, but instead, they could stick on
a different prediction unit where it's predicting stuff for a particular task.
So, it might be predicting
positive or negative sentiment in a text classification task or something like that.
So, in their model,
they're sort of reusing the same network but sticking on the top of that,
a different layer, to do the new classification task.
Um, they were also interested in something small,
the sort of one GPU model of research, um,
the paper has a lot of detail, the sort of tricks
and care and feeding of your neural models to maximize performance.
If you're interested in that, you could sort of look up some of the details about that.
Um, but what they were able to show again,
was making use of this language model pre-training was
a very effective way to improve performance,
this time for text classification.
So, these are text classification datasets,
IMDb is for sentiment,
um, TREC is for topical text classification, and again,
there are preceding systems that other people have developed and they
are showing that by making use of this language model pre-training,
they're able to significantly improve on the state of the art of these error rates,
so that low is good.
They also showed another interesting result which is kind of,
um, what you would expect or hope from doing this kind of transfer learning,
that what they were able to show is,
if you can train this neural language model on a big amount of data,
that that means you will then be able to do well on
your supervised task even when trained on pretty little data.
Um, so, here this is error rate,
so low is good.
So, what the- and here's the number of
training examples which has being done on a log scale.
And so the blue line is if you're just training
a text classifier from scratch on supervised data.
So, you need a lot of data to start to do pretty well.
Um, but if you're making use of this transfer learning, um,
from a pre-trained language model,
you can get to that you're sort of doing pretty
well with way less, um, training examples.
Essentially, an order of magnitude,
less training examples will give you the same amount of performance.
And the difference between these two lines corresponds to the extra,
um, phase that they had in the middle of theirs, um, which is,
whether you're doing this sort of extra fine tuning on your target domain,
um, it's part of your process and they found that to be pretty helpful.
Okay. So, that, um, is another precursor.
Um, and so, one big part of what has happened since then,
is effectively people said this is a good idea, uh,
maybe it'll become a really really good idea if we just make things way bigger.
Um, so, ULMfit, um,
was something that you could train in one GPU day,
sounds appealing for CS224N final projects,
remember that, um, and but well,
then the people at OpenAI decided, well,
we could build a pretrain language model and train it on
a much larger amount of data on a much larger amount of compute,
and use about 242 GPU days and that will get a lot better, and it did.
Um, and then the people at Google said,
well we could train a model, um,
in to 256 TPU days,
which means maybe about double the amount of computation.
It's hard to figure out exactly,
and that might be able to do exciting things,
and that was the BERT model, and it did.
Um, and then if you're following along these things, um,
just last week, um,
the OpenAI people said,
well we can go much bigger again and we can train a model, um,
for approximately 2,000 TPU version three days.
Um, and it will be able to,
um, do much bigger again,
a bit much better again,
um, and so, this is this GP2,
GPT-2 language model, um,
which OpenAI released last week.
Um, and they're, they're actually very impressive results, um,
when they're showing that if you're sort of building a really,
really huge language model over a very large amount of data.
And then you say language model go off and generate some text,
on this particular topic,
that it can actually just do a great job of producing text.
So, the way this was being do- done,
was a humanist writing a couple of sentences;
in a shocking finding,
scientists discovered a herd of unicorns,
living in remote previously unexplored valley in the Andes Mountains.
Um, and so, we then,
using our neural language model and chugging through that,
so that gives us context,
and then say generate more text,
and it starts to generate the scientist
named the population after their distinctive horn,
Ovid's Unicorn, these four-horned,
silver-white Uni four corns were previously unknown to science.
Um, it produces remarkably,
um, good text or at least in the,
in the hand-picked examples [LAUGHTER] that they showed in the tech news,
um, it produces extremely good text.
Um, yeah so, I think one should be a little bit cautious about, um,
that and sort of some of its random outputs actually
aren't nearly as good but nevertheless you know,
I think is is actually dramatic
how good language models are becoming once you are training
them on long contexts as we can do with modern models on vast amounts of data, um-.
So then, um, the OpenAI people decided
this language model was so good that they weren't gonna release it to the world, um,
which then got transformed into headlines of,
Elon Musk's OpenAI builds artificial intelligence so powerful,
it must be kept locked up for the good of humanity.
[LAUGHTER] Um, with the suitable pictures that always turn off at
these moments down the bottom of the screen, um, and,
um, yeah I guess that was the leading even Elon Musk to be wanting to clarify and say
that it's not actually really that he's directing what's happening at OpenAI anymore.
Um, anyway, moving right along.
Um, so, part of the story here is
just a scaling thing that these things have been getting bigger and bigger,
um, but the other part of the story is that all three of
these are then systems that use the transformer architecture.
And transformer architectures have not only being very powerful,
but technically had allowed scaling to much bigger sizes.
So to understand some of the rest of these, um,
we should learn more about transformers.
And so, I'm sort of gonna do that, um,
but I mean, um, in mix of orders,
um, our invited speaker coming Thursday uh, is, um,
one of the authors of the transformer paper,
and he's gonna talk about transformers.
So I think what I'm gonna do is, um,
say a little bit about transformers quickly,
but not really dwell on all the details, um,
but hope that it's a bit of an introduction,
and you can find out more on Thursday about the details and
then talk some more about the BERT model before finishing.
So the motivation for transformers is essentially
we want things to go faster so we can build bigger models,
and the problem as we mentioned for these, um,
LSTM or in general any of the recurrent models is the fact that they're recurrent.
You have to generate sort of one to n status time chugging through,
and that means you just can't do the same kind of parallel computation, um,
that GPUs love that you can do in things like convolutional neural networks.
But, you know, on the other hand,
we discovered that even though, um,
these gated recurrent units like LSTMs and GRUs are great,
that to get really great performance out of these recurrent models,
we found that we wanted to- we had a problem within these long sequence lengths,
and we can improve things by adding attention mechanisms.
And so that led to the idea of- well,
since attention works so great,
maybe we can just use attention,
and we can actually get rid of the recurrent part of the model [NOISE] altogether.
And so that actually then leads to the idea of these transformer architectures,
and the original paper on this is actually called attention is all you need,
which reflects this idea of we're gonna keep the attention part,
and we're getting- going to get rid of the, um,
recurrent part, and we'll be able to build a great model.
So in the initial work,
what they're doing is machine translation kind of like
the Neural Machine Translation with attention we described,
but what they're wanting to do is build
a complex encoder and a complex decoder that works non-recurrently,
and, um, nevertheless is able to translate sentences
well by making use of lots of attention distributions.
And so, I wanted to say a little bit more quickly about that,
and hopefully we'll get more of this on Thursday.
Um, first as a- as a recommended resource,
if you wanna look at, um,
home and learn more about, um,
the transformer architecture, there's this really great, um,
bit of work by Sasha Rush called The Annotated Transformer that goes through
the entire transformer paper accompanied by PyTorch code in a Jupyter Notebook,
and so that can actually be a really useful thing,
but I'll go through a little bit of the basics now of how we do things.
So the basic idea, um,
is that they're going to use attention everywhere to calculate things.
And, um, we talked before about the different kinds of
attention of the sort of multiplicative by linear attention and the little,
um, feed-forward network additive attention.
They kind of go for the simplest kind of attention,
where the attention is just dot-products between two things.
Um, but they sort of do the more comp- for various purposes,
they do the more complicated version of dot-product between two things where they have,
um, when the- the things that they're looking up are
assumed to be key-value pairs, keys and values,
and so you're calculating the similarity as a dot-product between a query and the key,
and then based on that,
you're going to be using the vector for the corresponding value.
So our equation here for what we're calculating is where you are
looking using the softmax over query, um,
key similarities and using that to give
the weightings as an attention based weighting over the corresponding values.
Um, so that's the basic attention model.
Um, so that add- saying it that way, um,
adds a little bit of complexity,
but sort of for the simplest part for their encoder.
Actually, all of the query keys and values are exactly the same.
They are the words, um,
that they're using as their source language, um, things.
So, it sort of adds some complexity that isn't really there.
Um, okay. Um, I'll skip that.
Um, so, there are a couple of other things that they do.
One thing that they note is that, um,
the- the values you get from, um, QTK, um,
very, in variances the dimension gets large
so that they sort of do some normalization by the size of the hidden state dimension,
but I'll leave that out as well for details, right.
So in the encoder, um,
everything is just our word vectors,
there are the queries, the keys, and the values.
Um, and we're gonna use attention everywhere in the system.
Oops. Okay. So the second new idea is, well,
attention is great but maybe it's bad if you only have one attention distribution,
because you're gonna only attend to things one way.
Maybe for various users it would be great
if you could attend from one position to various things.
So, if you're thinking about syntax and what we did with dependency parsers.
If you're a word, you might want to attend to your headword,
but you might also wanna attend- attend to your dependent words.
And if you happen to be a pronoun,
you might want to attend to what the pronoun refers to you.
You might want to have lots of attention.
So they introduced this idea of multi-head attention.
And so what you're doing with multi-head attention is you have,
um, your hidden states,
um, in your system,
and you map them via projection layers, um,
which are just multiplications by different W matrices as
linear projections into sort of different lower dimensional spaces,
and then you use each of those to calculate dot-product attention,
and so you can attend to different things at the same time.
And this multi-head attention was one of
the very successful ideas of transformers that made them a more powerful architecture.
Okay. Um, so, then for our complete transformer block,
it's sort of then starting to build complex architectures like we sort of started seeing,
um, the other week.
Um, so- okay.
Yeah. So, starting,
um, from our word vectors,
we're kind of going to do attention to multiple different things,
um, and we're simultaneously gonna have
a residual connection that short-circuits around them.
Um, we're then going to sort of sum the two of these,
and then they're going to do a normalization at that point.
Um, I talked previously about batch normalization,
they don't do batch normalization,
they do another variant which is layer normalization,
which is a different way of doing normalization,
but I'll skip that for now.
And then they sort of for one transformer block,
you then go after the multi-head attention,
you put things through a feed-forward layer which also has a residual connection,
you sum the output of those,
and you then again do another, um, layer normalization.
So this is the basic transformer block that they're gonna use everywhere.
And to make their complete architectures,
they're then gonna sort of start stacking
these transformer blocks to produce a very deep network.
And in some sense,
what has been found is that transformers performed very well.
But, you know, there's no free lunch,
um, you kind of can't.
You're- now, no longer getting
recurrent information actually being carried along a sequence.
You've got a word at some position which can be casting attention,
uh, on other words.
So if you'd like to have information carried along in a chain,
you've sort of first of all gotta walk the first step of the chain,
and then you need to have another layer
vertically which can walk the next step of the chain,
and then you need to have another layer vertically that walks the next step of the chain.
So, you're getting rid of the recurrence along the sequence,
but you're substituting some depth to allow things to walk along multiple hops.
But nevertheless, that's highly advantageous in GPU architectures
because it allows you to use parallelization to calculate everything at each,
um, depth at the same time. Um.
Maybe I'll go light on explaining this as well.
Um, so they use byte-pair encodings.
But if you do nothing else,
you just have words fed in this word vectors and you have
no idea whether you're at the beginning of the sentence or at the end of the sentence.
Though, they have a message of- method of doing positional encoding which gives
you some ideas to pro- position your word has in the sentence.
Okay. Um, so that's sort of the, um, encoder system.
So from the words,
they have an initial word embedding,
you add in their positional encoding,
you go into one of these transformer blocks,
and you then repeat it n times.
So you'll have a stack of these transformer blocks.
So you're multiple times doing, um,
multi-head attention to other parts of the sentence, calculating values,
feeding forward a value,
putting it through a fully-connected layer,
and then you just sort of repeat, do attention to different places in the sentence.
Get all your information,
put it through a fully connected layer,
and go up, um, proceeding up deeply.
And and that sounds a little mysterious,
but it turns out to work just great.
And the way to think about,
I think is that at each stage,
you can look with your multi-headed attention and various other places in the sentence,
accumulate information, push it up to the next layer.
And if you do that sort of half a dozen times,
you can be starting to progressively push information along
the sequence in either direction to calculate values that are of interest.
Um, and the interesting thing is that these models turn out to work
really well at sort of learning to attend the interesting things in linguistic structure.
Um, so these are just sort of suggestive diagrams,
but this is looking at layer five of the transformer stack and
seeing what words are being attended to by different attention heads.
So these different colors correspond to different attention heads.
And so the sentence is,
um, it is, "In this spirit,
that a majority of American governments have passed new laws since
2009 making the registration or voting process more difficult."
And so what we see is sort of most of the attention heads,
uh, looking from making to making more difficult and that seems to be useful.
One of the attention heads seems to be looking at the word itself might be okay.
Um, then the other ones are sort of looking a bit at laws and at 2009.
So it's sort of picking out the arguments, um,
and modifiers and making in a syntax kind of like way.
Um, interestingly, for pronouns,
attention heads appear to learn to be able to look back to reference.
So the law will never be perfect,
but its application should be just that one attention head it for its,
is looking at what its is modifying in the application.
But another attention head,
the its is looking strongly at what its refers back to as the law.
So that seems kind of cool.
Um, yeah.
Um, okay.
And so then, for the rest of the model, um,
there's then some more complexity for how to use
the transformers decoder to give you a full neural machine translation system.
But I think maybe I will skip that and go
on and say a bit about BERT in my remaining minutes.
Okay. So, um, the latest and greatest contextual
word representations to help you flow your tasks have been these BERT vectors,
where BERT is Bidirectional Encoder Representations from Transformers.
And so essentially, it's using the encoder from a transformer network.
Uh, this deep multi-headed attention stack to calculate, um,
a representation of a sentence and saying,
"That's a great all-purpose representation of a sentence that you can use for tasks.
Be it named entity recognition or SQuAD question answering."
And so there's actually an interesting new idea that these people had.
And that well, their idea was well standard language models are
unidirectional and that's useful
because it gives you a probability distribution of a language model.
But it's bad because you'd like to be able to do
prediction from both sides to understand word meaning and context.
There's a second choice, um,
which is you can kind of do bidirectional models when you incorporate,
um, information in both ways.
But that sort of has problems as well,
because then you get crosstalk.
Um, and so if you run a BiLSTM,
and then you merge the representations by
concatenation and then feed them into the next layer.
When you're running the next layer,
the forward LSTM will have already gotten
information about the future from the first layer.
Um, so it sort of, um,
ends up with words that have already seen the future themselves.
So you have this sort of complex non-generative model.
Um, so somehow, they wanted to do things a bit differently,
so they can have bidirectional context without words being able to see themselves.
And the idea that they came up with is well,
we're gonna train things with a transformer encoder.
But what we're gonna do is mask out some of the words in the sentence,
like, maybe we'll mask here store and gallon.
And then, so our language mod- our language modelling like
objective will no longer be
a true language model that's sort of generating a probability of a sentence,
um, which is standardly done by working from left to right,
but it will instead be a Mad Libs style fill in the blank objective.
So you'll see this context,
which will be literally,
"The man went to the mask to buy a mask of milk."
And your, what's your training objective is to say,
try and predict what this word is,
which you can do with a cross entropy loss to the extent that you don't guess store.
And then, it will be trying to guess what this word is and you want to let guess gallon.
So you're training a model,
um, to fill in these blanks.
Um, and the rate at which they blank words is essentially one word in seven,
and they discuss how this is a trade-off.
Because if you blank too few words,
it gets very expensive to train.
And if you blank many words,
well you've blanked out most of the context of a word,
and that means it's not very useful for training,
and they found about sort of one in seven seemed to work pretty well for them.
But what they want to argue is, um,
that for the OpenAI's GPT,
which is also a transformer model.
It's a sort of a classic language model working from
left to right and so you only get left context.
Um, for the BERT language model,
sorry, the ELMo language model that's shown up at the top.
Um, well, they're running a left to right language model and they're running,
um, right to left language models.
So in some sense, um,
they have context from both sides.
But these two language models are trained completely independently
and then you're just sort of concatenating their representations, um, together.
So there's no sense in which we're actually kind of having a model that's jointly
using context from both sides at the time though that the pre-trained,
um, contextual word representations are built.
So their hope is using inside a transformer model
this trick of blanking out words,
and predicting it using the entire context will allow them to use two-sided context,
and be much more effective.
And that's what they seem to show, um.
There's one other complication and,
I mean, I'll show later.
Um, this last complication is a bit useful,
but it's sort of not really essential to their main idea,
was that they thought,
one of the, one of the goals in their head was clearly to be able to
have this be useful for things like question answering,
um, tasks, or, um,
natural language inference tasks,
and their relationships between, um, two sentences.
So, their idea was, well,
one good objective is this fill in the blank word objective which is,
sort of, like language modeling objective.
But they thought it would be useful to have a second objective
where you're predicting relationships between sentences.
So, they secondly have a loss function which is, um,
let's have two sentences where
the sentences might be two successive sentences in the text,
or a sentence followed by a random sentence from somewhere else.
And we want to train the system to predict when you've,
seeing an- a correct next sentence versus a random sentence.
And so you're also training a loss based on this next sentence prediction task.
And so it'll be something like: The man went to the store.
He bought a gallon of milk.
You're meant to predict true is the next sentence,
um: The man went to the store.
Penguins are flightless.
You're meant to say false.
This isn't the next sentence.
And so they're simultaneously also,
um, training with this representation.
So, what they end up looks, looks like this.
Um, so, they have,
um, for the input,
they'll have a pair of sentences.
My dog is cute.
Um, separator.
He likes playing.
Um, the words are represented as word pieces like we talked about last week.
Um, so there's a token embedding for each word piece.
Um, then there's a positional embedding for
each word piece which is gonna be summed with the token embedding.
And then finally, there's a segment embedding for each word piece which is simply
whether it comes from the first sentence or
the second sentence before or after the separator.
So, you're summing those three things together to get the token representations.
And then you're going to use those in a transformer model
where you will have losses to the extent that you can't predict the masked words.
And then your binary prediction function as to whether there's
a correct next sentence or not which is the training architecture.
Okay. So, it's a transformer as before,
it's trained on Wikipedia plus the BookCorpus.
And they built two models.
Um, the Base-BERT model was a twelve layer transformer.
And so this corresponded to what the previous transformer paper had used, right?
Those two layer transformer blocks repeated six times gave you 12 layers with 768 hidden,
um, dimension hidden states and 12 heads for the multi-head attention.
And then they went bigger,
um, and trained BERT-Large which is,
sort of, double the number of layers,
bigger hidden states, even more attention heads.
Um, and training these on,
um, pods of TPUs.
Um, so, first of all, you're training, um,
on this basis for masked words and,
um, next sentence or not.
Um, so then what they wanted to say was this pre-trained model,
um, evaluated on these losses and masked language model and next sentence prediction.
Um, we could then take this model,
fr- freeze most of its what weak. No, sorry, that's wrong.
We could take this model, um,
pre-trained and it would be incredibly useful for various different tasks.
We could use it for named entity recognition,
question answering, natural language inference et cetera.
And the way we're going to do it, is kind of,
doing the same thing as the ULMFit model did.
We're not just going to say here's our,
here's a contextual word representation like ELMo did.
Instead, what we're gonna say is just keep on using this,
keep on using this um,
transformer network that we trained as a, sort of,
language model, but fine tune it for a particular task.
So, you're now going to run this transformer
calculating representations for a particular task.
And what we're going to change is we're going to remove the very top-level prediction.
The bits that predict the mass language model and next sentence prediction.
And we're going to substitute on it,
on top, um, a final prediction layer that's appropriate for the task.
So, if our task is SQuAD question answering,
our final prediction layer will be predicting start of span and end of span,
kind of, like when we saw DrQA a couple of weeks ago.
If what we're doing is the NER task,
our final prediction layer will be predicting
the net- named entity recognition class of each token just like a standard NER system.
Okay, um, and so they built this system and tested it on a whole bunch of data sets.
Um, one of the main things they tested on was
this GLUE data set which has a whole bunch of tasks.
A lot of the tasks, they're,
uh, natural language inference tasks.
And I've kept saying that phrase all of this lecture but I haven't really defined it.
So, with a natural language inference you're given two sentences
like: Hills and mountains are especially sanctified in Jainism.
And then you can write a hypothesis on: Jainism hates nature.
And what you're meant to say is,
whether the hypothesis, um,
follows from the premise,
contradicts the premise, or has no relation to the premise.
So, that's a three-way classification.
And so here it contradicts the premise.
Um, there are various other tasks such as this linguistic acceptability task.
Um, but if we look at these, um, GLUE tasks.
Um, these are showing the Pre-OpenAI State Of The Art.
How well, um, ELMo works.
How well OpenAI GPT works,
and then how well do small and large BERT models work.
And effectively, what you're finding is,
um, that the OpenAI GPT was so,
you know, pretty good.
It showed actually good advances on most of these tasks.
For many, but not all of them that broke the previous state of the art,
showing the power of these contextual language models.
But the bidirectional form of BERT's prediction just seemed much better again.
So, going from this line to this line you're getting depending on
the task about two percent better performance.
And so the BERT people actually did their experiments carefully.
So, these models are pretty comparable in terms of size,
but the bidirectional context seems to really help.
And then what they found was,
well, by going to just a bigger model,
again, you could get another big lift in performance.
And so you're getting for many of the tasks about
another two percent lift in performance going into the bigger model.
So, this really produced super-strong results.
And in general, um, people have found,
um, that BERT continues to give super strong results.
So, if I return back to my ConLL NER task,
we had ELMo giving you 92.2,
um, and you, sort of,
continue to get gains.
So, BERT Base gets you to 92.4 and BERT Large takes you to 92.8.
Though in, um, truth in, truth in description,
there is now a system of beats BERT Large on NER which is actually a character-level,
um, transformer language model from Flair.
Um, but, you know,
this continued over to a lot of other things.
So, on SQuAD 1.1, um,
BERT immediately just outperformed
everything else that people have been working on for SQuAD for ages.
In particular, what was especially dramatic, um,
was the sing- a single BERT model, um,
beat everything else that had been done previously on SQuAD version 1.1,
even though they could also show that an
ensemble of BERT models could give further good, um, performance gains.
Um, and as I've mentioned before,
essentially if you look at the SQuAD 2.0, um,
leaderboard, all of the top ranked systems,
um, are using BERT one place or another.
Um, and so that,
sort of, led into this,
sort of, new world order, um, that, okay,
it seems like the state of NLP now is to,
if you want to have the best performance,
you want to be using
these deep pre-trained transformer stacks to get the best performance.
And so this is, sort of, making,
um, NLP more like vision.
Because really vision for five years has had
these deep pre-trained neural network stacks, um, like ResNets.
Where for most vision tasks what you do is you take a pre-trained ResNet,
and then you fine tune a layer at the top to
do some classification tasks you're interested in.
And this is, sort of, now, um,
starting to be what's happening in NLP as well.
That you can do the same thing by downloading
your pre-trained BERT and fine tuning it to do some particular performance task.
Okay, um, that's it for today and more on
transformers on Thursday [NOISE].
