So today, we're very pleased to have as our second, um,
invited speaker, Richard Socher,
he is the chief scientist at Salesforce.
Um, Richard actually also has a lot more connection to this class,
um, because, um, for several years, um,
Richard was involved either as instructor or, um,
co-instructor in teaching this material at Stanford,
um, so he sort of knows the course, um, pretty well.
Um, and so today,
he's going to be talking about some of the challenges and recent work
in doing multitask learning in natural language processing. So welcome, Richard.
Thank you. Hello, everybody. I'm excited to be here.
Uh, yeah, I want to talk to you today about what we,
in short, called decaNLP.
I want to first give a big shout out to Bryan McCann.
He's the first author of this, uh, paper,
and I've pitched this idea to a lot of people in the last, like,
three to four years,
and most people were like,
"This is too much pre-processing because you're trying to
do 10 different tasks in one model."
That's sort of where the decathlon, uh,
wording comes in, uh, but he,
he really stuck to it, uh,
did all the pre-processing and all the things that you now know like tokenization,
and it turns out a lot of different data sets,
have a different conception of what a word is.
This wasn't two words,
uh, or one word,
and things like that, and that changes how you
write all your evaluation scripts and all of that.
So Bryan, uh, is,
is a really phenomenal researcher,
uh, with us in the group,
and Nitish has helped us a lot on the optimization side of this,
uh, and then Caiming Xiong,
the Director of Research, has done a lot of, uh,
really phenomenal work that's kind of helpful in pretty much all our projects.
So I'm going to tell you a couple of different, uh,
lines of reasoning that led us to,
uh, this idea of multitask learning.
And the first one was sort of trying to take a step back and looking at the field,
and I noticed not like that much of a historical class but basically pre-2010,
most natural language processing had kind of these very hand-designed features,
and we basically just had,
uh, machine learning kind of learned weights,
uh, in the optimization procedure for these human-designed features.
And so in 2010, Chris and I and others sort of started to work in deep learning for feature learning.
So everything was a word vector and now,
we can back-propagate into them and actually learn those representations.
And I think currently,
we're kind of in a state where we do a lot of
deep architecture engineering for specific tasks,
and you've seen this already.
You have like an NER model,
you have a question and answering model,
you have a translation model,
and we basically now,
each of these communities has at least, uh,
converged on is probably some kind of neural network,
but there's still a lot of different kinds of architectures of
these neural networks that you're working on for each different task.
And so the question is like, okay,
we're gonna probably do that for
another couple of years because we're making good progress,
but what's sort of next,
uh, on the research side?
And what I actually love about this class so much is that
you go from like maybe not knowing much about NLP at
all to you can basically understand
the state-of-the-art research papers as they come out now,
uh, and this, this is one of those.
Uh, so [NOISE] why,
why not continue to work in this multitask regime?
In some ways, I feel like, uh,
the community is a little bit, uh,
like this cute dog, where we, kind of,
randomly restart, uh, after every project.
And it's kind of clear to me that if you have a lot of training data, uh,
and you define a specific data set and task on that data set,
you start to architecture engineer in your model to hill-climb on a particular metric,
or leaderboard, or publications,
or products, or whatever it is, uh,
then as long as your data set has
roughly a good representative set of
1,000 times the number of output classes that you have,
you'll probably get it into a regi- regime where you're in the 80 to 90 percent accuracy,
or if one, where you're basically doing pretty okay.
And of course, now when you look at trends on ImageNet,
you have 1,000 different classes in computer vision,
1,000 different classes, each has 1,000 images.
So if you have roughly a million images, you do pretty well.
And in machine translation, ideally,
you know, I have many more, I have like hundreds of thousands of words,
so you want many millions of examples of each of the word in their,
uh, words in their context.
And of course, you know, that the caveat is
machine translation doesn't work to the level of humans,
but it works well enough to have it at least in products,
and even the best human translators use it as sort of a pre-translation and then,
uh, sort of, clean it up.
And so it's also clear to me that in this regime,
and if we want to get to, sort of,
more general AI features, uh,
we need to have some kind of more continuous learning of a single model.
Because if we keep restarting at every project,
we're never going to get to a single model that, kind of,
encompasses more and more of the complexity of natural language.
And, uh, when I say we start from random,
you of course know that that's not quite true
because we do have some things that we pre-train,
namely word vectors, and in computer vision,
we have even more things.
And so in some ways that is, ah,
an aspiring ideal for NLP,
because in computer vision, you would be, kind of,
crazy to not use some kind of
convolution neural network that has pre-train- has been pre-trained on some kind of
tasks like ImageNet when you start with your project and
try to classify objects or do object detection and a lot of other things.
And in some ways that the whole community could get behind it very quickly,
because I mean, you know, once it worked, uh,
reasonably well, because there was a, sort of,
single blocking task in computer vision.
If you can't even tell apart a dog from a cat from a house,
it doesn't really make sense to think of even larger, uh, vision projects.
And in NLP, we've had a lot of success with word vectors,
you know a lot of those now,
and it started for, sort of, just a small, uh,
window-based approach or Word2Vec and GloVe, uh,
then we had, uh, context vectors that were trained, uh,
on machine translation, but basically,
instead of just having a single set of words,
we actually pre-trained some of the NLSTMs that came on top of those word vectors,
and, uh, the way we train that, uh,
was also actually Bryan McCann's paper on
contextual vectors with machine translation and then ELMo,
kind of, replaced machine translation with, uh, language modeling,
which of course is even better because there's even more training data,
and it still tells you a lot, uh,
and kind of captures in some ways a more complex version of
distributional sort of hypotheses that we had in simpler word vectors,
and BERT, not quite a language model but also, kind of,
trying to predict words in their context, uh,
but pre-training a lot more layers and a lot deeper networks.
And so we see the success of pre-training a certain set of weights.
And so the question is,
why not try to pre-train the entire model?
As in including your output,
your softmax, your pointer mechanisms and everything,
and then just taking a completely pre-trained model and trying to do something,
and that is, kind of, the goal that we have.
And so, uh, we, sort of,
ask ourselves why hasn't this happened?
Why are we, you know,
the first to think about, like,
trying to pre-train the entirety of the model,
the encoders, and decoders,
and outputs, and everything.
Uh, and I think part of it is that NLP requires a lot of different kinds of reasoning.
You've seen many of them already.
You have some logical reasoning like 550 people in this room,
25 leave, are there still people in the room,
and you logically can answer that question,
and you have lots of different kinds of linguistic and emotional reasoning,
sentiment analysis, you know,
this is a typical Nicolas Cage movie and then you need to know that that's a
probably negative review unless you like Nicolas Cage movies.
Um, no judgment. And, uh,
you know, visual types of reasoning and so on.
And so I think partly because of that complexity in the beginning to feel,
didn't really make much progress and now and then kind of separate it.
And I think in some cases, kind of artificially separated into all these separate tasks,
like you have named entity recognition,
part of speech tagging, and semantic role labeling and, and so on.
And, and in some ways- and it sounds kind of snarky but,
you know, it made a lot of sense at the time,
and it allowed us to make a lot of progress in the community,
but basically we started chasing these benchmarks,
and all these different communities, kind of,
started going off in their own ways.
And we even have some communities that say,
"We do general question answering,
and there's literally workshops on general question answering, and when I asked,
uh, the organizers, "Can I ask your model what the sentiment is of this tweet?"
They're like, "No, that's sentiment analysis. Go to that different workshop.
It's down, down the hall."
But I'm like, "That's a- that's a question.
Why can't you answer it in the general question answering workshop?"
Um, and so a lot of people then say,
"Well, if you want to work on more general stuff,
it has to be an unsupervised, kind of,
task and the, the feature will not be supervised."
I don't think NLP will be completely unsupervised,
and we won't solve it, uh, completely unsupervised,
because in the end, language has a lot of supervision for people,
uh, and, uh, I think for, for systems also.
Uh, and you won't, you know,
if you have- there's a child and it's in a jungle,
it will probably develop a pretty good visual cortex by itself,
but it won't develop language by itself.
And then- and then also, like,
I think if you'll just allow AI's to talk to one another,
it makes very little sense for them to try to come up with as
inefficient of a communication protocol as humans have with, you know,
sequential processing of language because algorithms and computers could,
if there's no supervision of human language,
they could just communicate in much more efficient ways with one another.
So I think it's fairly clear,
we need a lot of supervision, uh, in NLP.
And so basically, all of this has led us, uh,
to trying to think about a unified multitask model for a lot of different NLP tasks.
By the way, if you have any questions, just raise your hand.
Okay, let's make this very interactive.
Um, basically, we want this unified model, uh,
to decide how to transfer knowledge,
uh, and not have it, sort of, be manually assigned.
Like in most cases,
when you assign your project you say, "Oh,
well I know that named entity recognition part of speech tagging help each other.
Because once you know something is a noun,
then it's more likely that it's also a named entity."
And in this case, we want to basically allow for the single unified model
to know itself how to do domain adaptation and wha- how to share the weights,
and that will hopefully then lead to a lot of,
uh, transfer learning and zero shot learning capabilities.
I also think that if we get to this, sort of,
hard goal of having a single fa- single unified multitask model,
then we'll easy-  be able to more easily adapt it to
new tasks and we'll be also able to deploy it in production more quickly.
If nowadays you want to build
a little squirrel detector and connect it to your sprinkler system,
you can just download some off-the-shelf software,
and it will basically, kind of, work.
That is not the case if you try to do
a pretty complex language project where you
want to translate into some completely new language or,
you know, analyze some website and then do something else afterwards.
So, uh, you also,
when you actually try to deploy and use these kinds of tools and companies,
you'll realize that there are a lot of different kinds of groups.
There's the search group,
and the chatbot team,
and the translation team,
and, uh, and the social sentiment analysis team,
and they all use different models,
and they all deploy different models,
and they all have to build a lot of overhead into
the core of the- or around that core of an AI model.
So basically, um, lastly,
it was, sort of, what we had with, with this dog.
I think that once we have this unified model,
it will also be a first step to being able to
then continually learn this and just have a single model that just
gets better and better over time and starts
to capture more and more of the complexity of language.
All right, any questions around, sort of,
the motivation high level?
All right. So then, uh,
it's sort of the question, how do we actually make that happen?
And then we -- I first sort of sat down and looked at, like,
the general sort of formats of all the tasks that you may experience in
this class and that NLP sort of has as a field in
general and I think they can broadly classified,
be classified into these three different categories.
Sequence tagging, you already know.
Things like NER or aspect-specific sentiment or in
a specific context we want to classify if a word is positive or negative.
Uh, and then text classification,
just a single label for the entire piece of text
and then sequence the sequence a lot of different, you know,
problems fall into that and I actually personally love, uh,
these three particular tasks: machine translation, summarization, question answering.
Because they are immediately useful that you don't have to explain to somebody,
"Oh, but why do you need the semantic role labeller or parser? "
If you're a layman and you, you know,
on the Internet you understand immediately why it's
useful to do summarization, question answering,
or translation and an improvement in
those tasks kind of immediately translates in- into better products,
uh, and people being able to communicate better and more efficiently with language.
So, that, uh, kind of analysis led us to think,
uh, about these what I call three equivalent supertasks of NLP.
Uh, and basically they are
language modeling, question answer now- question answering and dialogue systems.
Uh, language modeling, basically trying to predin- predict the next word,
you've already worked on that.
Uh, and usually it's only used to rescore or basically to pre-train these days.
But really if you ask me a question and then you try to predict the next couple of words,
then that is also language modeling
and if you're able to predict the next couple of words after a question, like,
what were the named entities in the sentence and then you just generate, you know,
Dresden was a location,
Richard was a person and whatnot.
Uh, then you can kind of cast almost all of these tasks into language modeling.
Uh, similarly question answering,
you can ask any kind of question,
what is the translation,
what's the summary, uh, and so on,
and then with dialogue right now it's kind of tricky because there are
no really good dialogue datasets out there and a lot of times you want some interaction,
you have to run user studies and most of the existing NLP task would
basically be pretty short one-step dialogues like what are the named entity tags,
and you give them and that's it.
So it's a little bit overkill and because of that we basically converged,
uh, on question answering as our main formalism.
And here is now an overview of the 10 different tasks that we have,
uh, and we cast all of them as question answering.
These are literally the tr- the training,
uh, the format of the training dataset, uh,
and eventually also the way we formulate
the test set and you'll see basically for every single task,
you have a context as some kind of document.
It could be a Wikipedia article,
it could be a tweet, it could be a longer document,
whatever, and you ask a question about it and you want to generate an answer.
And I'm actually -- I'm curious if you can think of any task in NLP
that couldn't be formulated in this kind of structure.
Uh, so, let's go over some of these.
Uh, the first one is sort of the standard,
uh, task that all- you're all familiar with now.
The SQuAD, Stanford Question Answering Dataset.
Uh, where the answer is essentially a phrase somewhere in the context.
But then, uh, the second one is something that you would never see in most,
uh, generalized, uh, question answering workshops and that is, uh,
having a context of the single sentence asking what is the translation from
English into German and the output is again a sequence of words but in this case,
and we color them differently here.
Uh, this is blue because all these words are basically not in the context and not in
the question and we will just generate them
with a standard softmax to basically answer this question.
We can also ask what is the summary and you can see that those
two in some ways is artificial to make them into a natural language question.
You could just say translate or summarize and this is just like
one kind of task token in your network but actually half of these tasks.
It makes sense because the question also has ac- is different for every example.
So this one here is natural language inference, NLI, uh,
She covered also where we want to ask whether two sentences entail each other,
contradict each other or there's some neutral relationship between them.
You've seen a lot of sentiment.
And this here is kind of important.
We actually asked is this sentence positive or negative versus just what is the sentiment
and what- why that is important is that you see here in green,
this answer here actually comes from
a word into question and if we formulate it that way,
we can eventually do zero-shot learning where we ask a new question that was
never asked before for a new set of labels and magically, in some cases,
it still actually works and we'll, you know,
ask que- we can ask questions like is this story happy or sad and it will still
give us an answer even though we've never given
it a trained dataset of a bunch of happy and sad stories.
So, it's kind of zero-shot classification that you get to in
some cases if you formulate your questions in a way
that the answer is part as a word in the question.
Then we have semantic role labeling here.
So what has something experienced, kind of a random weird question.
Then we have a zero-shot relation extraction who is
the illustrator of Cycle of the Werewolf,
we also have some dialogue state tracking.
What is the current state in- in a dialogue and the context just keeps on
growing with the dialogue and then we also have SQL,
Wiki SQL translation tasks but not translating into
another natural language translating into a SQL database query.
It's actually a super-helpful task.
There's a, you know, a lot of data out there that is stored in databases.
If you can access it without having to ask
somebody who knows how to program SQL it will make
that data available to a lot more people so
they can analyze it and like business analytics and so on.
And then here, Winograd Schemas and anaphora resolution.
Uh, some people call this kind of common sense reasoning but it's kind of,
you know, mostly just anaphora resolution trying to understand in this context.
Uh, what -- who's, you know,
uh, the word like who had given help,
was it Susan or Joanne, and then based on this context,
you can kind of should be able to figure that out and again here,
the question is different for every single example. All right, yeah?
When you're testing it -- like when you ask,
is this sentence positive or negative,
does it sometimes, like, [inaudible]?
Great question. So, the question is when I ask,
is this sentence positive or negative will it sometimes eventually
accidentally switch to a different one of the task and, uh,
we actually have a slide on that and the answer is it's surprisingly good at
knowing how to go about doing the task and where to get the answer where it's from.
Um, and yeah, they'll make more sense in a couple of slides once we go over the model.
Any other questions about,
uh, the question answering formalism?
Are you able to formulate text generation in the question answer format as well?
Like, tell me a story.
Good question. So can we do text generation, uh,
like tell me a story, uh,
from a random kind of -- or in this kind of formalism.
Uh, we don't have that as a task because largely it's really hard to evaluate.
It'll tell you some random stuff and then is that a good story or not,
is it grammatical, you have to come up with a lot of,
uh, sort of, uh,
evaluation metrics which we actually are doing for
some of the dialogue systems and in case of dialogue,
why does -- why are they equivalent because
the context can just keep on growing and every time, uh,
the user said something, uh,
you basically try to then predict the next answer in that dialogue.
And so I think you could very easily [NOISE] use this to generate texts.
Uh, you basically just ask -- tell it like what is, you know,
what's a good ending of the story and you maybe start the context with like
two or three words and then you ask the model to generate more and more words,
uh, in the form of this network I'll describe in a second. Yeah?
I was wondering like, uh, when you're training
it and you're trying to research like a new task.
Uh, does it like learn with less data?
That is an amazingly thoughtful question
and it's- it's so important we'll have a bunch of slides on it.
So maybe we'll- we'll go -- we'll continue and we'll get to that question, uh,
in a lot of detail because it's sort of why we're doing it and, the short answer is yes.
But we'll get to more details. All right.
So these are basically the 10 tasks.
Uh, and again this is the actual format for it.
So if you have a problem,
and you can cast it in this format, uh,
you can just take, uh, the open source code and run it and,
uh, it'll- it'll work.
And so when you kind of analyze and think about what we've done here.
In some ways, we've taken the tasks that
usually is kind of in your head but it's not given to the model.
The model is just given an input x and an output y in almost all of
the supervised systems and instead we're actually including the task in the inputs,
uh, in the set of inputs to the model. So you can kind of call this meta-supervised learning.
So again the question, uh,
is kind of our task definition for each of these different tasks.
The model has to figure out itself when to ask the question
that way it can also figure out itself when to
transfer knowledge from these other tasks and y is again just the answer.
So, in some ways it's meta-supervised learning and I'm quite excited
because once you allow the task to be given to the model as input,
it can kind of decide itself how to go about
solving that particular task and now you can learn,
uh, a lot more powerful models.
So once we had the dataset,
we thought "Okay, how do we now solve this problem?"
The simplest way is you could just say, "Well,
I have a big if statement,
I have a classifier in the beginning and then I classify.
If this is a machine translation task,
then run my machine translation model."
And in general, in Python that would still be just like one big python,
uh, model with a bunch of if statements, right?
And that's not the goal because then we wouldn't get to any of
the transfer learning and zero-shot capabilities that we're hoping for.
So [NOISE] we want to have the model wanted
to have the capability to internally adjust
to these different tasks and make these decisions itself.
And basically, all of those considerations and all
of those thoughts led us, uh, to this model.
So before I go, uh,
into a little bit more detail.
I'll just like sort of give you the high-level overview.
Again, you start with the context.
Um, you start- you ask a question about, uh,
that context document, and then we're going to generate,
uh, the answer one word at a time by either pointing to the context,
and you've had pointers already, right?
Pointer networks, all that? Great. Um, pointing to a question word,
or choosing a word from an external vocabulary with your standard softmax classifier.
Uh, and we'll have a pointer switch mechanism that will kind
of choose how much to weight [NOISE] each of these three generation mechanisms.
So, uh, let's dig into a little bit into this model.
Fortunately, uh, in some ways it's kind of just taking the best, uh,
of the current sort of the state of the art techniques and putting them together in a way,
uh, that- that generalize well enough.
Uh, you can look at all the code on decanlp.com,
[NOISE] it has like thousands of, uh,
stars and, uh, and forks and stuff combined, uh,
and you can, you know,
basically run everything, uh,
in this, uh, on these experiments with just one command.
It'll double, you get all the datasets and everything and- and run everything,
you can really explore what it looks like but let's- let's
dive a little bit into the details of what this model told us.
In some ways again, it just kind of takes
all the best ingredients from deep learning [NOISE] NLP,
most of which you've already learned about and puts them together in a reasonable way.
So we start with fixed GloVe embeddings.
Eventually, we'll- we updated, uh,
the embeddings to CoVe embeddings, uh,
and probably it'll work even better if you update them to BERT embeddings.
Uh, but at some point we kind of have to move on and do other things.
Uh, but basically, you have a fixed set of word vectors,
and that is kind of important because in some of these,
uh, data sets, they're much smaller than others.
Uh, and as you know from SQuAD,
if you actually backpropagate into the word vectors,
you just do really, really well on your trained dataset,
but then you won't generalize because of most of the [NOISE] text,
uh, test documents will include words you've never seen before.
So if you change all the word vectors during training, uh,
it won't- it won't work very well at test time and won't generalize the unseen words.
So, uh, fixed GloVe embeddings,
if you don't have word vectors, uh,
for unseen words, we also have character n-gram embeddings.
Then we pipe them through a simple linear layer,
and then we have a shared, uh,
bidirectional LSTM with skip connections.
And so, uh, it's a deep- deep one so you skip to higher layers,
and it's shared between the context and the questions.
So they have basically the same [NOISE] set of weights.
[NOISE] Then, uh, we have a co-attention layer.
Uh, where we basically just have outer products, uh,
between all the hidden states of those two sequences,
and again, have skip connections, uh,
to circumvent, uh, those as well.
So now you have kind of context or question dependent, uh,
contextual representations [NOISE] or- or representations of that context.
[NOISE] Uh, then we feed those into our transformer layers,
uh, and we actually tried to use transformers for all the things,
with having no LSTMs or any of that.
Uh, unfortunately, transformer layers were still, uh,
very, uh, finicky and very hard to optimize,
and there's a lot of trickery with- of the learning rates,
and we could just not get them to perform really well,
uh, on- on these 10 different tasks.
Uh, [NOISE] sometimes you had one transformer layer, one transformer network,
that worked really well in one task,
but the only other transformer network that worked well
on the second task had like half the layers.
And once you tried to have one network with the same number of layers,
it just wouldn't work on either of the two tasks anymore.
Uh, and so- so yeah, unfortunately as nice as they
are because they're nicely paralyzable in GPUs,
uh, they weren't yet robust enough,
uh, to- to be used for this.
[NOISE] So we have to have these LSTMs,
uh, before and after the transformer layers.
[NOISE] And then we essentially just have a standard sort of autoregressive, uh,
decoder where given the last state,
uh, we generate the next word.
And then we have these three pointer mechanisms.
Uh, they're very similar to the pointer ne- mechanisms you already know.
But now on top of these very contextualized representations, uh,
at the end of this encoder, uh,
and it basically learns to either point to question words,
context words based on the hidden states,
or have also a standard softmax,
and then we just basically have a weighted sum,
convex sum, of these three different distributions of output words.
[NOISE] All right.
So I think these are mostly standard components that you've already saw,
uh, for you- already seen all their details.
But if you have any questions,
um, about how we put it together? Yeah?
[NOISE] So the output- the output has to be a word.
That's right. The output has to be a word and it's always either a word from the context,
a word from the question or a word from the softmax.
[NOISE]
That's- the data preprocessing I guess it's different with each task.
So the data preprocessing is different for each task,
but we basically had to normalize everything to have
the same tokenization and- and all of that. [NOISE]
Uh, so do the double arrows in the encoding just represent there's a bidirectional?
Yeah.
Okay.
Yeah. But the double arrows,
uh, here are just bidirectional.
So left to right and right to left for the LSTMs. All right.
So what datasets, uh, are we using?
Uh, I mentioned that that was a big headache in the beginning.
Uh, we definitely wanted to include a lot of the sequence to
sequence tasks that we felt like are very,
um, sort of high level and I- immediately useful, uh,
and in some ways what this also shows you is that
nowadays you don't have to work as much on some of the intermediate representations,
uh, in NLP anymore.
Uh, you can just directly go for the end tasks that that real users might care about,
and then have these end-to-end trainable systems,
uh, that really do quite well.
And, uh, I've myself worked a lot on parsing.
And so I don't wanna, you know,
say we- we don't need it.
There's certainly still tasks that you do need it for,
but it's kind of surprising that you can just go directly to translation or summarization
without having intermediate representations that
were sort of very specifically hand-designed.
Um, so we had those three really interesting, uh, and hard tasks.
Question answering, machine translation, summarization.
They actually also have the three biggest datasets,
uh, of all of these.
Uh, then we had NLI, and basically, um,
all of these, uh, 10 datasets [NOISE] were, uh,
publicly available, uh, and in several cases especially for translation,
you could actually find much larger, uh, translation datasets,
but we also tried to keep it, uh,
to a- to a size where normal people that don't work in gigantic companies with huge, uh,
GPU infrastructures could still run experiments, [NOISE] uh, themselves.
So universities and folks, uh, can still run it on.
Basically if you have just a single GPU,
it'll probably take about a week or so, uh,
to run an experiment.
If you have multiple GPUs on one large AWS machine,
you can kind of run an experiment in a day or two.
And so especially for translation, right,
you could get a lot more data, uh, than IWSLT.
And each of these, uh,
communities and datasets and- and tasks has their own metric.
We actually tried to, in the beginning,
we had a lot of discussion about how we should
define the measure of success for this project.
Uh, it doesn't make sense, uh,
to have a normalized F1 score for basically all the different tasks,
but then we basically realized that
these different communities have different metrics for a reason.
Uh, unfortunately at least all of these metrics are from 0-100 in theory.
Of course, in practice, you rarely ever see, uh,
a translation system of a 100, uh,
or even high 90s of a BLEU score,
uh, or these really, really high ROUGE scores.
But, you know, in theory they go from 0-100, and so, uh,
we kept basically intact the different evaluation metrics for each of these communities,
and we just said we're going to sum them up.
And, uh, when we first talked about this,
we have- had a lot of discussion,
uh, with- with others also like, oh,
but translation is so much more important because it's much
bigger and it's a much more useful task than you still,
you know, silly like pronoun resolution Winograd Schemas
which only have a couple hundred training samples.
And so you should have weighted translation more and
then literally five questions later somebody's like,
"Why didn't you weight pronoun resolution more?
That is a really hard task that captures sort of common sense reasoning and, you know,
the complexity of language and semantics,
and unlike all this, like, statistical pattern matching [NOISE] that you do in translation."
And I was like, I used to talk to that guy [LAUGHTER] and like,
uh, hopefully in the end,
we'll just all agree that like it's reasonable to sum them up, uh,
and of course, you also have to tackle when you run experiments in this.
Uh, a lot of the complexity that you have in machine learning and,
you know, stuff that very few people talk about like having very skewed distributions.
So you have translation which has, uh,
millions or hundreds of thousands of examples,
and you have Winograd Schemas,
uh, that only have a couple hundred.
How do you train that such that you don't just completely ignore the smaller dataset.
Uh, so we'll get to some of the optimization trickery,
uh, that Nitish spent several months on in a bit.
But I first wanna sort of give you the first set of experiments.
So as you can see from all the numbers,
there's a lot of experiments, uh,
that we ran to even get to this,
and so we'll walk through this, uh, quite carefully.
I think hopefully you'll get some ideas also for- for ablations,
or experiments that you might wanna run in your, um,
in your experiments and in your,
uh, problem- final- final projects.
So what are we looking at here?
So basically, uh, on the left side,
we have single task performance.
So here, each number comes from its different model that was trained,
um, separately on just one task.
Uh, each row- each column here is the same architecture, uh,
and [NOISE] on the right side here,
we basically have, uh,
for each column is basically the same architecture and the same exact model.
So here, we have four different models and here, uh,
we have 40 different models,
and each column again is the same architecture.
And so the simplest, uh,
first column here is just a standard sequence to sequence
model with very few bells and whistles and some pointers,
but nothing sort of major.
It's pretty deep, you know,
stack bidirectional LSTM skip connections,
all the standard good well-tuned stuff for sequence to sequence models.
And, uh, then we added self-attention.
Um, this- this sort of, uh,
basically, uh, transformer layers.
[NOISE] Then we have this co-attention layer of
the outer products that we mentioned in the beginning,
and then we also added the question pointer.
So having the ability to point to a word in a question.
All right. Any questions about this table?
We'll dig into some of the details.
Uh, okay. Well, we'll dig into
the details first and then maybe you can think of some questions.
So let's analyze, uh,
what's going on in this table because there are a lot of numbers, uh,
and you really want to carefully analyze and sort of distinguish.
I think my first, uh,
observation was, wow, we can have a single architecture.
Like, even, even this is not quite what we want, right?
We want a single model.
But even this kind of showed us, wow,
you can have a single architecture that actually does really well and somewhat randomly,
in some cases, it actually had gotten state-of-the-art results.
So Wiki SQL, for instance,
this architecture had the best model
to translate natural language English questions into SQL queries,
which was a surprise to us because it is the ninth dataset.
It was really not like a priority for us and when we designed
the model and thought about how to generate words and pointer mechanisms and so on.
We just kind of had the standard context of SQL words
and we asked the question what's the translation to SQL, and then, uh,
somewhat surprisingly to us this particular architecture had the state-of-the-art, uh,
on SQL generation and bunch of folks in that community kind
of picked it up more quickly because it had state-of-the-art.
And that's- uh, unfortunately,
it doesn't have that many other state-of-the-art numbers, uh,
which is why it's harder, uh,
it's actually a much harder task.
And what you also observe is that,
uh, in several of the cases, uh,
using the multitask model,
so having a single model for all the 10 tasks,
uh, actually hurts performance at first.
And this is also something you rarely read in papers because papers
have a strong selection bias to only publish positive results.
Uh, and when you look at most transfer learning and multitask learning papers,
they're sort of an outside of the actual model consideration of like,
well, let's only combine tasks that we know will work well with one another.
And if they don't work and hurt performance,
then we'd just exclude them from our experiments.
And so you don't see many negative task results, uh,
in the literature and there are a few papers here and there that, uh,
study basically the opposite side of transfer learning and that is,
uh, catastrophic interference and catastrophic forgetting.
So interference is when you train two different tasks in the same model,
and to interfere with one another next, you hurt each other's performance.
And catastrophic forgetting is if you train continually
your first train in one task then you train on a second task,
people used to think,
"Oh, well, you know,
basically the first task will be completely
forgotten," and you just work well on the second task.
If you train neural networks sort of in a sequential way one task and then
another and somewhat surprisingly, uh,
we- we found that things aren't actually
catastrophically being forgotten in these models,
turns out that if you train them sequentially and
you add a little bit of the original to the first task,
it comes back very, very quickly.
So while the performance is really bad,
you can get to the really good performance very,
very quickly in very few iterations.
So but it's one of the many interesting sort of tidbits that we found,
uh, in the course of this that we haven't even published yet. All right.
So, uh, focusing on, uh,
the transformer layers here we basically find transformers
do help the original sequence to sequence model a lot.
So if you tune them carefully and you combine them with, uh,
some bidirectional LSTMs and so on, uh,
they were very helpful and improved, uh,
across a bunch of different datasets, in some cases quite significantly.
Another observation is question-answering and semantic role labeling,
uh, actually can predict each other's performance quite well.
If one works well, the other works well,
uh, and- and vice-versa.
If they don't work well,
uh, both of them don't work very well.
Um, and it's also interesting because both of those tasks have different questions for,
uh, every training example.
Pointing. Uh, so the question pointing,
uh, is super important.
Uh, we actually have in some cases, uh,
twice the performance even for,
and this is kind of surprising to us,
a simple classification task where you could just have a standard Softmax.
But instead of saying you have a Softmax of entailment, contradiction, and so on,
you just basically, uh,
point to the word entailment in the question.
And that was also the case for Winograd Schemas that also benefited a lot,
uh, from this pointer mechanism.
[NOISE]
Can you explain that?
Sure. Um, can we explain it? Why-
[inaudible]
Why does it help so much?
Um, in some ways,
I think partly is the whole architecture
has been gotten- has gotten better and better at pointing.
And part of the reason we actually do very,
very poorly in translation,
which is the only task that hurt in the- our first experiments a lot, uh,
in the multitask setting is that that is the only task that now has to generate,
uh, results from a completely separate Softmax,
whereas the rest of the architecture got really,
really good at pointing to things to answer questions, any kind of question.
Uh, and so but in some ways,
I think that is one explanation,
but I- I don't think it's- it's all of it.
I think we still need to figure out more why this happens. All right.
Now, multitask learning is the most
helpful when it comes to zero-shot and I'm actually very excited about that.
So this is a zero-shot relation extraction where you have different kinds of, uh,
relations that you might wanna extract and you might have never
seen like the student-teacher relationship that you're trying
to identify in a certain context or
a product company relationship or something like that.
And so, uh, that one actually, uh,
benefited a lot and almost got twice, uh,
as high in terms of the accuracy, uh,
when you learned it with everything else.
So these were questions, it's never seen before,
relations that it's never seen before,
and it got twice as good, uh,
and benefited a lot especially from having seen other kinds of questions.
And in some ways, we have to give a lot of credit to SQuAD too,
uh, because SQuAD as a dataset,
uh, kind of pushed people into thinking about pointers as a mechanism to generate answers.
And pointers, we kind of see them like as a given and they don't get that much credit,
but they allow you to predict answers that you've never seen before at training time.
To generate words, you've never seen before at training time,
which is actually quite- quite amazing. All right.
Now, the main observation though
here is that you still if you had an Oracle that would tell you
exactly which task you're currently in
and you would be perfectly kind of separating these into 10 different models,
maybe they're all the same architecture but there's still 10 different models, then, uh,
you would actually still do slightly better,
uh, than the first version of this multitask learning model.
And that is largely because we
chose to include a bunch of different tasks that have nothing to do
with one another and we wanted the community to start
thinking about tackling catastrophic interference, right?
If you learn like a new language or, you know,
you learn how to understand social media on Twitter,
you don't replace all your language,
uh, you know, in- in your brain.
You have one brain, it keeps getting smarter,
you keep learning new skills,
even when that skills that are new to you are very,
very different from old skills.
So in some ways we may have made our lives too hard,
and now we're actually thinking, okay,
maybe if you wanna publish a nicer paper on multitask learning,
we'll just look at all the tasks that do help each other,
and then we'll just, you know, have groups of tasks,
and then I can very quickly publish,
uh, some, some nice state-of-the-art papers.
But basically here, uh, we're still, uh,
quite significantly away in the decaScore between 10 different models and a single model.
Now, this of course is kind of an oracle score,
that's why we put it in parentheses because you don't actually have this oracle.
And in some cases,
it's quite easy to build an almost perfect classifier.
So, you know, separating what is the summary
based on that question and what is the translation from English to German,
you can do with almost 100 percent accuracy.
Uh, but, uh, SQuAD, question-answering,
and zero-shot relation extraction,
and question-answering as a semantic role labeling,
those are actually easily confused in terms of how
to generate the answers and you wouldn't quite know,
uh, which into which model to route, uh, this.
So in some sense, this is kind of theoretical. All right.
Now, I mentioned that we have this prob- this
complexity in the optimization strategy and this is one of the many,
um, sort of problems that don't get that much, uh, coverage.
But when you have a very,
uh, imbalanced or skewed dataset,
it's easy to lose track and basically overpower the smaller dataset tasks.
And so, uh, the first, uh,
simplest training- we actually tried a ton of different training strategies,
but in the end, this fully joint one worked quite well.
But actually promised to ask go wait for questions, uh, on this table.
So any questions on all these results so far? Yeah?
So, uh, [NOISE] since you mentioned that if you had
an oracle that will tell you which task it is and
you have two better ways having 10 different ones.
So really try training a model on
like data meaning what task is interested in this particular version?
We did. And so it- it confused, you know,
SQuAD and- and those too the quest- the other- basically the other,
uh, two types of problems that were also cast, ask question answering.
So it confused those.
Um, but then a lot of the others, it was able to like, very perfectly do it.
But then you basically, as soon as you,
uh, were to try to then build a whole model and get a decaScore,
if your- if your classifier is even like 90 percent accurate,
you basically multiply this by 0.9 and
you get dinged so hard that it- it's not competitive anymore.
So it is actually hard if you try to just build
that whole system and keep adding sort of if-then else statements,
uh, to make that, uh,
into sort of a single system. Yeah?
Have you tried telling the model what kind of task this it's doing,
just giving that indicator of the kind of task quickly?
I mean, in some ways,
we did in this case,
because we only trained each model separately on it.
[inaudible]
Um, only through the question.
Yeah. Because I was thinking the
um, maybe it's not that important that the model figure out what we want it to
do in- in a practical [NOISE] application
if we could just tell it what we want it to do right now?
In some cases, you could tell.
Uh, so the question is sort of,
uh, and even in the multitask setting,
you could have like an extra kind of token to say,
"Now, you're doing summarization.
So, and that's another input."
Uh, in some ways,
whether you have a summarization token,
uh, or you ask what is the summary?
It actually I don't think makes that big of a difference.
It's just now you can query this model in
very natural language rather than having to know
kind of a special token to, to query the model.
Uh, and we'll see actually in a couple of slides that the model is not confused,
uh, when it comes to how to generate the answers.
So, for every of the task,
it knows very clearly how to generate the words to get to the right,
to get to, you know, a reasonably accurate answer.
[NOISE] Um, in the- [inaudible] does the model
see all of the data and then [inaudible] that class or does it only include a [inaudible]?
Oh, great question. So, how do we train, uh, the single task models?
They're only trained on that dataset.
So, the SQuAD number here is just a single model that has only seen SQuAD training.
[NOISE] So, your point about the,
um, the pointer exception for the, uh,
[inaudible] generally more helpful than [inaudible]?
Somewhat surprisingly, even, ah,
in the case here, uh,
where we had, um, this is MultiNLI,
this particular model, I mean,
if you just have the standard sequence to sequence,
it just generates, you know,
also with a softmax, uh, that label.
So in that sense, it's quite similar.
Uh, but yeah, it was actually better able to just point, which actually led us, uh,
for a while into thinking about maybe we should have a project where we just say point to
all the things and just get rid of softmax classifiers forever.
Um, the problem is when you then try to do translation also,
it's like okay wow,
what do you point to,
and then you kind of pre-train it and do
some alignment and it gets kinda very large and you point to a lot of different like,
you may have like- like tens of thousands of potential candidates.
So we kinda discarded it as like a single unifying model for all the things,
but you could point to a lot of different,
like a lot of these tasks,
you could actually point to and
I think it's another interesting side project that could spawn from this, yeah.
Just a quick question to how,
how sensitive [inaudible] how sensitive, uh,
the individual components [inaudible] was when you
slightly perturb the relative weights of them in the loss function?
So, we -- the question is, uh, how, um,
sensitive were the tasks if we were to,
um, add weights to the different tasks?
We [NOISE] did in the optimization kind of did a lot of trickery on
how to train it but we never said this task only matters like 0.5 or something.
So, we didn't do that analysis. Yeah?
Co-attention seems to be a burden a little bit.
In some cases, yeah.
Is it the [inaudible] co-attention and order but no co-attention or is that kind of like,
"Oh, you already saw the test data so, like, you can't use these."
I mean, these are all dep sets.
Um, but it's, you could definitely do even more architecture engineering.
In fact, there's this whole field which I don't think
you gotten to, right, neural architecture search?
Yeah. So like you can actually combine your reinforcement learning, um,
and you say the action space for the reinforcement learning agent
are trying to have a couple of
different modules of neural nets like maybe you want to have
like a CNN layer and then like
a memory layer and then an LSTM layer and maybe it's bidirectional and you
basically let a reinforcement learning agent figure out all of these decisions.
Uh, so I think it would be phenomenal to try to apply
neural architecture search not to what's
usually being done which is we already know how to do image classification,
we'll just do it slightly better with NAS, neural architecture search.
But we actually try to find
a single architecture for multi-task learning which we don't know.
The problem of course is that already getting to these.
All these numbers took a lot of compute time and a lot of
fiddling around with stuff and it is, I can,
I can only give you sort of an idea of like how often we'd say,
"Oh man, we got like this really amazing result
in this task but it needed this learning rate."
And it turns out the same model,
same set of hyperparameters everything,
but this other task to get to good performance needed a much higher learning rate.
And now, you try to combine those two tasks only together and you're like,
"Okay, how do you choose your learning rate now?"
You choose the, you know,
if you choose the task, the learning rate from the task that is, you know,
bigger than the smaller tasks just doesn't work
well at all because it needed this higher learning rate.
If you'd use the higher learning rate that the smaller task and the smaller dataset,
uh, did really well on then the large one just overfits and doesn't work well either.
If you try to do the average, neither of the two work.
Like there's a lot of complexity in trying to do multitask learning.
That's why, that's why it's such an interesting I think, uh, research challenge.
All right, any more questions about this first set of results?
They get, they will get better.
We, we have, we have had some ideas already,
uh, on, on how to improve them.
All right. So, uh,
how did we actually train this whole thing?
Um, we had tried a lot of different things but in the end, uh,
this very simple fully joint training strategy actually worked the best.
Uh, and that is you basically take a mini batch from each of
the different tasks and you just train on that mini batch from that task.
So basically just going through all the 10 tasks and then round robin,
uh, go through them.
Um, now it turns out, ah,
that that does not work,
uh, quite as well, uh,
as another training strategy and if you look into optimization,
uh, strategies in neural nets, uh,
there are actually a couple of papers on
so-called curriculum learning, where the idea is,
you start with training your model with simple pro- simple instances of your problems.
So, in translation, for instance you start training with
very short sentences and then you go to larger and larger,
uh, sentences, uh, or longer and longer sentences.
Uh, now it turns out for multi-task learning,
you actually want to do the opposite.
You wanna do anti-curriculum learning.
Uh, and that is you start with the hardest tasks and you iterate on
those for a while and then you add the simple tasks later on.
And to some degree, I think this is intuitive because when
you train this very gigantic and powerful model,
uh, on a very simple task like
sentiment and you just need to classify everything to be positive or negative.
You train all of these weights and you arrive at sort of, uh,
local optima that are quite deep and very
specific to just generating these two words and if you then try to get out of that,
out of this local optimum for that very simple task
and then try to generate all these other kinds of words and point to different,
you know, words it's never seen before then SQuAD,
it's very very hard to come out of that local optimum.
And that is sort of my intuition of why it actually makes more sense to say,
"Let's start with SQuAD and machine translation and a couple of these harder tasks.
We'll make the model very general purpose.
It has to generate a lot of different things,
create a softmax, German words,
it has to point to all kinds of
different words and be able to parse all kinds of different Wikipedia paragraphs."
And you do that a couple of times and then once you've finished,
uh, this sort of pre-training, uh,
stage or anti-curriculum, then you move on and add sort of the simpler smaller tasks.
So [NOISE] with that, uh,
relatively simple change that did take us,
uh, a lot of different experiments to get to.
Um, we actually, uh,
closed or, uh, um,
went closer to closing that gap and now, um,
we're only sort of, um, 14, uh, away.
Right, yeah, uh, 14 or so.
Uh, but there's still, uh,
a big gap and the biggest, uh,
nuisance and issue that we had was with a translation.
Basically, if you look at all of these,
most things are kind of similar,
get slightly better, um and it's sort of a toss up but then and,
and roughly similar, but translation was really bad.
It's almost only half, uh,
the performance in the multitask learning setup,
and part of that is because translation was the only task that had
a very large Softmax vocabulary of words that were in no other task.
And most of the other tasks,
actually were doing really well with pointing.
And so, uh, my interpretation of this was that the intermediate layers,
all these representations that we learned with
bi-directional LSTMs and transformers, they got really,
really good at being pointed to,
like creating hidden representations that the answer module can point to very accurately.
And then you have this one task that is like,
I don't point to almost anything,
I basically just generate other words and then different vocabulary.
And so those hidden representations became less useful for that task.
And so, that was one of the insights and that led
to one of the ways of trying to improve this.
Now, one of the interesting issues that we had is,
when we improved the model,
the multi-single model for all 10 tasks,
a lot of times we said, well,
but now we also have to go back and run
10 more experiments on all the single tasks to have a proper comparison, right?
Because if you tune the thing you care about,
and you stop tuning the thing you wanna show you can do better than,
then that's not fair.
Uh, so you always wanna give as much, uh,
TLC and focus and experiment time to your baselines.
And so, uh, in some cases we actually,
uh, improved some- improved something.
But then, we improve both the 10 separate models and our model,
and some cases like the 10 separate models improved, even more.
So the gap got even larger.
It's kind of the opposite of what we wanted to show, but in general,
it's better for both tests,
uh, for the architecture overall.
So basically, we started, uh,
with this fully joint training and we have
this sort of set of single models that we could,
in theory with some oracle,
kind of just sum up, uh,
in their scores, to get a decaScore.
So the gap started at 23.
And then, uh, we basically did this anti-curriculum training,
uh, which, uh, lowered the gap to 15.
So we're kind of excited,
uh, making good progress.
Then we switched, uh,
from GloVe and use CoVe.
So contextual vectors, um,
which actually increased the gap a lot again.
So everything got better, but the 10 separate models got
even better than the one single model that does the 10 tasks.
Um, so the gap got bigger,
but everybody's performance increased.
So it was still overall a good thing.
Uh, and then, uh, we basically figured,
especially with this machine translation issue,
we shouldn't just pre-train on SQuAD,
but we also should include machine translation in
this pre-training in the beginning so the model doesn't just start learning to point.
Um, and that helped us, uh,
to reduce the gap between the 10 separate models,
Oracle, and the single model to about five points.
And then, uh, we basically said,
okay, translation is still not that good.
We just keep oversampling.
So, every time we go through one of these round robin mini-batch sets,
we just always include machine translation.
And that basically allowed us to then reduce the gap,
uh, to just a single point.
So now, uh, we started, uh,
couple of, several months ago, uh, at 586.
And now the single, uh,
oracle with 10 different models,
if you were to sum them up,
get 618, uh, and the, you know,
better contextual vectors and tuning and adding a lot more translation,
and translation is still not as good as we would like it to be, uh,
but now, several of the other tasks benefited a bunch.
And now we're basically one decaScore away from
having a single model that does as well as 10 different ones.
And you can basically,
you could run even more experiments,
in some ways you could burn millions of dollars on AWS cost here,
because most of the time we kept the hyperparameters of these different models the same.
Like each of these, you could also say, well,
maybe this multitask model needs to have 50 more layers,
or maybe 19 more layers,
or maybe five more layers and maybe they should be 1000,
you know, wider in their hidden dimensions.
And you could basically run a lot more experiments.
Maybe hopefully, eventually, the community jointly does that,
and then we can kind of move, move towards that.
But we figured, okay, we're pretty close,
so we moved on to some other things which maybe I'll tell you about next year.
[LAUGHTER] But basically, um,
let's do some analysis of what happened in this project.
And this is kind of, I think something that I would encourage you all to do as well.
Like you, you can chase the numbers for a while and in some ways,
you should always be skeptical about your evaluations.
And in some cases,
you've seen- we've seen in the NLP community people
like basically just optimize BLEU scores for translation for years.
And then somebody came out with a paper and said, well,
it turns out BLEU metrics and human evaluations on how good of a translation is this,
aren't actually that correlated.
And you're like, ah, that that sucks,
we just spent years of our lives tuning that metric and publishing a bunch of papers.
Um, and so in some ways all of these metrics have flaws, uh, you know,
root scores summarization is a super,
uh, subjective kind of a task.
And summarization, for instance,
when you analyze the errors, uh,
you often realize that word vectors have problems too.
So, for instance, the word vector for Jason, John,
and Jeremy are all kind of the same, right?
They all have similar, uh,
distributions, similar contexts, windows, and so on.
And so word vectors of names are very similar.
And so in summarization errors, you realize, oh,
well, you know, this article, news article talked about Jeremy being kidnapped.
But the summary said that Jason was kidnapped.
And you like, well, you know, in the evaluation metric
that's just one word is off and like, all the rest is correct,
but it's a pretty important word.
And so, word vectors have like issues
for summarization that are pretty fundamental and I don't think,
uh, anybody's tackling really well right now.
Uh, and so all of these metrics have issues.
I would argue though that combining the 10 actually
makes it less problematic and more meaningful,
than looking at each one separately.
Uh, because now you can't use the idiosyncrasies of
one particular evaluation metric to just get like your score a little bit higher.
Um, because then, if you just tune with that particular thing in mind,
it will hurt some of the other tasks and you won't get to the sort of general,
uh, NLP model that much more easily.
All right. So now, let's do some analysis uh,
of this model and, uh,
look at, and this is the kinda thing that comes to one of the questions that was asked.
Uh, is this model able to kind of generate the right words for the right tasks?
And here, we basically looked at the distributions of how often, uh,
the model generated words in these differen- with these three different mechanisms,
Softmax vocabulary, context pointers, or question pointers.
And, uh, as you can see,
in the majority of cases it knows exactly how to generate.
So, uh, for, uh,
question, answering, and semantic role labeling,
and SQuAD and Wiki SQL and,
um, summarization, it basically uses the context pointer.
So it just points into the context document.
And we know for SQuAD,
that is basically [NOISE] how the data set was generated.
So that's the only thing that that really makes a lot of sense.
Uh, what's kind of cool is that in some cases like summarization,
it sometimes creates new words or, you know,
that weren't in the context document wherein pointed to.
Uh, and for zero-shot relation extraction,
also sometimes uses, uh,
this external vocabulary and in some cases the context pointer.
So for the most part, uh,
this model doesn't- is not confused how to execute on a task given, uh,
this question formalism rather than, uh, the,
uh, format of sort of this is the task,
just do this particular test.
Now, um, you might argue,
okay, I'm not that impressed by, you know,
having the performance be slightly the same with one model versus
10 separate models even though it's nice if you wanna deploy it right,
like, uses less RAM and all of that,
assuming they're the same size,
uh, while, you know, one-tenth the size.
But what I'm excited about is more like the next couple of results.
And namely, sort of this transfer learning,
domain adaptation, and zero-shot,
uh, these kinds of capabilities.
So here, uh, we chose two data sets that weren't included in the original 10.
And we basically trained a pre-trained model on this versus a random model.
And, uh, randomly here again,
they're the same architecture,
and pre-trained means the entirety of the model was pre-trained.
All the, you know,
encoders including the decoder in the Softmax and everything, uh,
and to two other tasks where another IWSLT language pair namely,
translating from English to Czech, uh,
and named entity recognition tasks that you all know very well.
So basically what we found is that,
uh, it converges much more quickly,
uh, in the beginning, uh, and then,
there's still a significant but not gigantic gap.
So this pre-training on these completely separate kinds of task had helped.
And, uh, I think that's,
that's pretty exciting, um,
especially sort of the quicker convergence, like,
learning more quickly, uh,
whatever new task you, you come up with,
which also means in some cases you can get away with
less training data on these new- on these new tasks.
Uh, now domain adaptation is kind of the simpler form of transfer learning,
where you basically just have a different,
uh, type of, uh,
you know, distribution for your words.
Uh, we mentioned we have the Stanford Sentiment Treebank for sentiment analysis.
Uh, and then we analyze this on different,
uh, sentiment data sets,
namely Amazon product reviews and Yelp restaurant reviews,
and out of the box without any training,
the model just got 80% accuracy on both of those data sets.
Uh, and I think for practitioners,
that is pretty exciting because you basically didn't have to train anything,
it just kind of worked out of the box,
download it from GitHub, and run it.
Uh, SNLI, that was slightly different.
It didn't quite work as well.
It's another natural language inference data set,
but has very different- a very different distribution, different, uh,
kinds of domains, uh, that,
uh, these entailment questions are asked over.
Uh, and here, out of the box it achieved 62.
Uh, but then, uh, once you fine tuned it and
similar to these experiments here continue to actually train on this data set,
it quickly uh, converged to 87 which was
still two percent gain over a randomlyor initialized McCann model. Yeah.
In that experiment, did you evaluate how much less data you can get away with?
Did we evaluate how much less data we can get away with? We didn't.
And in some ways, whenever you would run this experiment,
you'd basically be like, you'd still not do as well.
Like, everything- all these models will still do better with more training data.
So you just kind of, it would be a fuzzy kind of say,
like, cut- fuzzy sort of result, right?
Where you say, well, with one-tenth we might get
to 50 and the other model might get only to 40,
doing something like that.
Um, we don't- I don't have those numbers.
It would be kind of actually also a neat, neat, uh,
analysis to do. Yeah.
So if you wanted to like train on a new task [inaudible].
Yeah.
[inaudible] .
So, do we have the code to train a new task? Yes, we do.
Um, you can just, uh, edit,
make it into this format using context.
Here's a question, simple like CSV type format,
and then you add it and you can both like train the pre-trained model yourself.
You can download a pre-trained model and just add it. So I'll look it up, yeah.
Do you know how this compares to using other kinds of pre-trained representations like, say BERT?
So, um, it's a great question.
So how does this compare to other pre-trained representations like BERT?
So, in some ways,
people say BERT is kind of this model that does everything,
but when you actually read the paper, you realize, well,
it's a separate model for these different tasks, right?
If you wanted to have a classification task,
you have a little token in the beginning,
and you have a different top layer.
If you wanna do a sequence labeling task,
you have a different top layer.
If you wanted to do a sequence extraction task,
you have a different top layer.
So, BERT isn't actually a single model for all of these different tasks.
Ah, and then, on all the results,
there's a lot of extra tuning for each of the data sets,
and tasks, uh, that, you know,
different learning rate for this task, uh,
different size, or different sets of BERT, and so on.
So, we're also super excited, we're like maybe this is it,
we'll just run everything on BERT,
and then we looked into all the details,
and there's so much excitement in the beginning.
And then the more we dug through the details,
the less excited we became as this being like sort of the answer,
because it is not a single model.
Uh, in some ways, it's probably better to- for pre-training.
So instead of CoVe,
you can have kind of BERT at the very beginning,
and my hunch is everything will get slightly better,
but you still need to have, um,
a lot of the- a lot of the other sort of modeling architecture on top of it.
Uh, and then the sad thing is to really get the state of the art results,
there's a lot of very spec- task-specific tuning of those last top layers.
So, if you try to unify that task-specific tuning,
you lose a lot of the good performance of BERT.
Um, so, unfortunately, it's not quite the sort of,
"Oh, just use BERT for it,
and you'll just have state-of-the-art numbers and all the things."
Um, I could probably go like talk about it a lot more, but, uh,
I think it still makes sense to think about, um,
some of the ideas from BERT,
like basically, add as one of the tasks language modeling.
That would be very likely the task  that helps the most for all the other tasks,
and we should include that, uh,
it also would be nice to have a faster model right now.
Um, it's hard to do language modeling is very, very large,
it benefits even more from,
you know, billions and billions of words.
It's hard to train the McCann model,
this current question answering model of the co-attention mechanism of the question
with like an increasingly large context.
So you'd have to kind of split it also like BERT,
works also reasonably well only for like at most I think 500 words or so,
and if you wanted to do summarization you'd basically have to cut
the original document to only 500 words, and then try to summarize it.
So, there are a lot of like devil in the details that they didn't have to figure out,
because they said, "Well, we'll just sort of just like word vectors,
we can take them in, and then we do a lot of other stuff that is task-specific,
um, with those- those word vectors,
or with the BERT architecture."
I still- I don't want to- this BERT is obviously amazing,
and we are looking into trying to use ideas from it.
But unfortunately, it wasn't just sort of a silver bullet to
solve multi-task learning. Mm-hmm?
Pre-training process to be considered, uh,
prioritized sampling based off of how much fewer group, how much loss there is?
Sorry, did we- say again?
Would you consider prioritizing sampling [inaudible]?
So, did we consider prioritizing the sampling?
So in some ways with this pre-trained strategy here, um,
that's kind of what we did by basically focusing on these really hard tasks.
And, uh, a lot of like the gap in the end was improved by really waiting for,
like four of the tasks at the very end,
uh, bef- unti- you know, uh,
until after you're gone through, uh,
sort of oversampling all of these,
uh, really hard tasks.
In the last 10 minutes, uh, basically, uh,
th- the most exciting thing, uh,
for- for last though I think you could also do a lot more work in this direction.
Uh, I mentioned the sole question pointer
and zero short learning in the beginning, and, uh,
we basically just tried to play around with that a little bit, um,
and found that in some cases,
it actually kind of magically works.
Uh, so here, we tried, uh,
a sentence John had a party,
but no one came, and he was all alone.
And then we asked, "Is this story sad, or happy?"
And while the model could've, you know,
generate some random German words,
or some random SQL words,
or it's just said whatever,
it actually pointed to, of all the words,
you could've pointed to in the context or the question that
pointed to "Sad", which is pretty cool.
Like- and it's just one small sample,
and, you know, you could do a lot more,
you could try to come up with a very large zero-shot kind of classification data set,
which is actually kind of hard too.
You have to be quite creative, it's not like you can just say, "Oh,
it would just take all these reviews,
and label them as these, you know, positive negative.
Ah, but so, I think we- we need to do more work in that direction.
Somebody will hopefully create a zero-shot kind of task data set,
that is not just zero-shot for, you know,
kind of new distributions or something with completely different, uh, outputs.
Uh, but we- we tried a couple,
and it doesn't always work, right.
You can be adversarial about it,
you can make this basically looks most similar to,
is the sentiment positive or negative?
Uh, is this sen- is this sentence positive or negative?
That was the formalism we had for sentiment analysis.
And so you could,
if you make the question more and more different,
eventually, it'll kinda get tripped up.
Ah, and it's clear that it's benefited, uh,
from the word vectors,
of sad being closer to negative,
and then understanding sort of through all these,
uh, correlations, and- and, uh,
deep representations that there are other sort of sad words in this context,
or- or whatever it is.
Uh, and so, it was able to point to this.
But you can be adversarial, it doesn't always work.
But even the fact that, uh,
it was sort of zero-shot classification based on word vectors, uh,
for new kinds of questions,
uh, personally, it was very exciting to me.
And we tried a couple of other things like,
uh, Bryan gave a talk and nobody clapped.
Was Bryan happy, or sad?
And it also got it right.
So, um, there are a couple- a couple of the,
the examples were, were at least as happy or sad thing worked.
And then, uh, a couple of other sort of adjective questions that we,
we tried but, um,
what I'm- what I would be most excited about is eventually actually
trying to have a zero-shot classification task,
uh, that combines the different tasks too.
So, uh, unfortunately, there's no data set for that,
so we didn't train it, so it doesn't happen with the model.
But in theory, if you ask what is the sum- you can summarize,
and you can translate from English into German,
why couldn't you ask the model for a German summary?
And if that worked, eventually,
that would be even more amazing,
but it, it doesn't work right now,
because we never ask it sort of for these
compositional task- these compositional task questions.
But is yet another interesting line of research that I think could spawn from this.
Uh, all right.
So, I hope I could show you that this sort of
decaNLP framework is an interesting new benchmark for generalized NLP.
Uh, I do think it's a reasonably good framework
for tackling a bunch of the really hard questions in the field.
Uh, more general language understanding,
and question answering of course,
uh, multitask learning, domain adaptation, uh,
which we sort of analyzed a little bit with the sentiment,
and SNLI versus multi NLI,
um, transfer learning, and then weight sharing.
I think it's clear, everybody loves weight sharing,
you wanna share as many weights as possible.
Uh, word vector started at, uh, ELMo,
CoVe, and now BERT basically share more and more,
deeper and deeper layers.
It would be great if we can unify that last bit also, uh,
and then share basically the entirety of the networks,
and then eventually hopefully get to zero-shot learning.
Now, there's a bunch of related work.
The original paper has over 100,
um, citations in it, uh, of,
of, you know, papers to other,
other, um, lines of, uh, work.
But, uh, this is actually zero- at least some of
the models and papers that influenced us the most,
uh, in, in our thinking and modelling.
Uh, one of them actually comes from,
uh, the two instructors of the class.
And so, um, hopefully, uh, we can,
you know, sort of think about what- what's next after all this architecture engineering.
And, uh, I think one potential answer to that, uh,
is single multitask learning for more generalized NLP models.
[NOISE] All right. Thank you. [APPLAUSE]
