HANNA HAJISHIRZI:
Hello, everyone.
Welcome to our last talk of
this quarter in the NLP seminar
series.
I'm very honored to
introduce Sam Bowman--
I think most of you know him.
He is an assistant
professor at NYU.
He received his PhD in 2016.
He's working on a lot of
interesting work in modeling,
evaluation metrics, and creating
a lot of benchmark data sets
for text understanding, sentence
understanding, and even beyond.
Probably most of
you know his work
on the NLI benchmark as well
as GLUE and SuperGLUE, so
and I think he's going to talk
about SuperGLUE question, OK.
SAM BOWMAN: All right
thanks so much for coming.
Thanks for hosting me,
excited to be here.
So this talk is oriented
toward this goal
that I see as the goal for
a pretty big chunk of NLP
research these days.
Let me try to distill this down.
I think a lot of
work is oriented
toward trying to develop a
general-purpose neural network
encoder for text, which
makes it possible to solve
any new language understanding
task using only enough training
data to define the
possible outputs.
So in this view, the goal--
the purpose of
training data in NLP--
or least labeled training data--
is just to tell you
what the task is.
In other words, we're trying
to develop neural network
models that already
understand English
or already understand
your target language
before you start
building a system
for your task of interest.
And, of course, I'm
interested in this question
because of the Muppets,
systems like ELMo, and BERT,
and RoBERTa-- these large-scale
pre-trained language models
or quasi-language models--
have started to
suggest that we might
be able to get very,
very close to this goal
in the not-so-distant future.
So this is what I'll
be talking about.
I'm going to start with
the GLUE benchmark, which
is our attempt to
run a shared task
competition around this goal.
I'll talk about what we
learned after launching GLUE
and what we did in our reboot of
it this year called SuperGLUE.
Talk a little bit about
some analysis work
trying to understand why these
Muppet models are succeeding,
and I'll actually do
some meta BERTology--
some analysis of analysis work--
And then I'll end--
and this is the part
I'm excited to
give because I have
no results in this last
section, just wild speculation--
on what's next for evaluation
because the conclusion of this
is that we're in a very
weird position in the field.
So I'll jump in with GLUE.
GLUE-- this is joint work with
Omer, with Alex and Amanpreet
at NYU, with Julian here,
and Felix Hill at DeepMind.
GLUE is an open-ended
competition-- basically,
a shared task
without a deadline--
for work on general-purpose
sentence encoders.
So it's meant to roughly capture
this sentence-level version
of that goal I laid out
at the very beginning.
So why did we think that
a shared task like this
is worthwhile or interesting?
We had our own reasons, but this
is the post-hoc justification
that I think is most convincing.
It's increasingly common for
researchers outside of NLP
to evaluate new techniques on
language understanding tasks.
I think this is much more true
in the last couple of years
than it had been in
the preceding years.
This can potentially be
really useful to us--
having lots and
lots of researchers
with potential orthogonal
areas of expertise,
trying to give us new methods
that work on our problems.
But we're only going to learn
much-- or at least we're only
going to have an
easy time benefiting
from this-- if this
work is being evaluated
on tasks that represent
real open problems in NLP.
And this often doesn't happen--
this frustrated seeing papers
with very interesting methods
that show convincing evaluations
on game-learning or computer
vision, and then the NLP
part of the evaluation
is on something
that feels like it's
lost the interest of
the field years ago,
often for good reasons.
So on this view,
the goal of GLUE
is this service to
expert researchers
who are not NLP experts.
So we want to provide a set
of tasks, metrics, baselines,
and code, that represent
open problems of interest
to researchers in NLU.
And the idea is,
basically, it's never
safe science to just
hill climb on a number
but we want to get
as close as we can.
We want to say, all right, this
is a number where if you can
make it go up for
a good reason--
you understand why
it's going up--
that is the best
evidence you can easily
get that you are making progress
toward something interesting.
And we're just trying to
offer this as a service.
We're not trying to enforce any
particular experimental design.
Making a convincing experiment
it's up to the users.
So how did we flesh this out?
GLUE is essentially a
thin wrapper around nine
sentence-understanding tasks.
These are all previously
published data sets.
GLUE is not a data
set-- it's just
this wrapper in this tool kit.
We tried to pick data
sets that were unsolved,
that seemed like they
were still subjects
of active research
in NLP, that were
as varied as possible in
the amount of training data
that they used, and that were as
varied as possible in the style
or genre of language
that they used,
at least subject to the
constraint that the text be
readable by maybe
college-educated US, UK English
speakers.
And the hope was that by
picking a reasonably diverse set
of tasks that fit
these criteria,
we'd be able to make the
case that, OK, if you have
a single recipe-- a single
model-- that can do well
in all of these, that's evidence
that you're making progress
on a somewhat more
general problem.
We imposed a couple of
additional constraints.
We wanted to use very
simple task APIs.
For the original
GLUE benchmark, we
were focusing on
sentence or sentence-pair
inputs and classification
or regression outputs.
So we weren't
including any tasks
that required generation or
any other kind of structure
prediction.
And the idea here was just
it seemed conceptually
possible to isolate the
thing we care about--
isolate this core learning
of syntactic, semantic, maybe
pragmatic processing of common
sense as it applies to NLU.
We could isolate these skills
using very, very simple task
APIs and that would make it
easier to focus on the thing
we care about--
that in an ideal world dealing
with complex outputs basis
would just be a distraction--
not necessarily how it played
out but that was the idea.
We also put together a fairly
simple leaderboard API--
this is sort of a detail but--
To get scores on this benchmark,
you download unlabeled test
data, you run your model on
it, you upload the test data,
and you get a score
from our website
very much in the style
of Caglar's SemEval.
This means that we're not
imposing any restraints
on the kinds of software
that people can use,
or the kinds of models
that people can use.
We don't care if people are
doing multi-task learning
or pre-training, or even
using neural networks.
The only bit of
babysitting we're doing--
the only thing we're enforcing--
is placing a hard limit on how
often people can use the
test set to make it--
to basically to try to earn
the trust of the people who
are giving us access to
their private test data.
Here are the data sets we
wound up with for again,
the first iteration
of GLUE I'll go over
this from a couple
of different angles.
So first looking at
training set size
plotting this on a linear scale,
we get a reasonably wide range.
For the Quora question pairs
and multi-genre NLI data,
we've got close to a
half-million examples.
I think this is
toward the high end
of what you might realistically
see doing NLU work.
And at the low end,
we've got one task--
this NLI format Winograd
data set with under 1,000.
Four of these test
sets in bold use
private data that the full
label test sets have never
been published that helps
us gain some degree of trust
that these test numbers mean
what they're supposed to mean.
And then looking at domain,
we have single-genre tasks
using Quora data-- this
is community question
answering, social media question
answering, fiction books,
Wikipedia, and then several
of these miscellaneous or
multi-domain data sets
include transcribed speech
as well-- so we think
reasonably broad range there.
Give just a couple of examples
of what these tasks look like.
CoLA-- this is one
from our group at NYU--
is a binary
classification task where
you have to decide
if a string of words
forms a possible
English sentence.
This is our most syntax or
structure-oriented task.
It's an odd task
because we're not
allowing for any kind of
structured prediction,
so this is just this
Boolean classification task.
You're asked to
decide if a string is
a valid sentence of English.
So for example,
you'll get examples
like, "Who do you think that
will question Seamus first?"
This is understandable, but if
you're a native English-speaker
you'll recognize
that there's really
no reason that this word
"that" should go there.
This doesn't sound right.
This isn't anything
you'd generally say,
and so you'd mark
that as unacceptable.
And then we have a
roughly equal number
of contrasting
sentences that are
perfectly fine like the
second example here.
And we think this is
an interesting task
and interesting data source
because Boolean judgments
of these kinds are a
major source of evidence
in linguistic theory--
especially for a lot
of work in syntax,
this is the primary
gold-standard form of evidence
it could work with.
And this allowed us
to actually collect
this corpus from published
linguistics literature
and build a data set that
is comprised entirely
of examples that
linguists thought
were interesting and relevant
to key theoretical points--
interesting
distribution of data.
One of our more
semantics-oriented tasks is
this combined data
set from the RTE--
Recognizing Textual
Entailment Challenge Series.
This is a binary classification
task of our sentence pairs.
The task is to decide if one
sentence entails another.
If I tell you that
this text is true,
would you reasonably
believe on that basis
that this hypothesis is true?
And here we get a no
case-- a negative example.
And this is drawn from
these four data sets
from this combination
series that
was running in the late OTS,
and it's relatively small--
2 and 1/2 1000
samples for training.
The last one I want to
highlight is our GLUE variant
of The Winograd
Schema Challenge.
In the version that
we're using here,
it's a binary
classification task
on expert constructed
pairs of sentences.
You're trying to decide what
an ambiguous pronoun refers to,
but these pairs were
constructed such
that there are few usable
statistical cues that
will give you this
answer and it's
meant to require you to do some
fairly high-level reasoning
about the situation described.
So if Joan is the
person receiving
candy in this
situation, that suggests
that Joan rather
than Jane is hungry,
and so we get this
entitlement in this example.
And we got private test
data from the creators
of The Winograd
Schema Competition.
So that's my very
brief overview of GLUE.
Let me do the sort of boring
but kind of inevitable thing
and just run through
what happened
in the leaderboard-- run through
what we've learned just staring
at the numbers to start with.
So I'll be showing just results
on the single-number GLUE
benchmark score.
Our bag-of-words model--
we represent sentences
as the sum of word
embeddings, feed those
into a lightweight
neural network,
that gets a little
bit under 60 points.
These points are sort
of unitless nonsense.
This is the average of the
nine task-specific metrics
for each of our nine tasks.
We couldn't really
think of a better way
to design a metric than that--
even though it doesn't make
any sense, but bear with me.
So trivial baseline
gets a little
under 60 nonsense points.
We've got two strong
baselines which
really represent where we were
starting for the competition.
This baseline
involves us building
a reasonably sophisticated
neural network
model for each of the
nine tasks separately,
tuning that model for
each of the nine tasks
separately, and training it
on the task-specific training
data.
So here we use word embeddings.
We don't otherwise use any kind
of pre-training or multi-task
learning.
And this got about
66, 67 points,
and this represented pretty
close to the state of the art
on these tasks circa late 2017.
Our other serious baseline was
the best available sentence
[INAUDIBLE] model--
This is from Subramanian
et al. at Montreal.
This is a model where
you do some pre-training,
you do some multi-task learning,
but ultimately your goal
is to build a function that
maps sentences to vectors
where you're not intending
to fine-tune that function,
you're just intended to get
sentence embeddings that you
can use for your task.
And this doesn't-- these
baselines are roughly
equivalent, which is a clue that
we really hadn't been making
progress toward our big
goal up until this point--
that our best available methods
for general-purpose modeling
or pre-trained modeling for NLU
was really no better than what
you're able to do without
any of these tricks.
The first method that really
got off the ground was ELMo--
that I'm sure many of you
have seen or even worked on--
one of the first clear successes
with training a large language
model and using
that in a component
of a task for a neural network.
This was concurrent with
the development of GLUE
and so we just kind of dropped
it on the leaderboard ourselves
as the real starting point.
Then got us a rapid succession
of other models showing up.
Open AI followed relatively
soon after the launch
of the competition with
the first language model
to use a transformer style
in the neural network
architecture, and the
first to use fine-tuning--
where the idea here is to adapt
this model to a specific task
you back-propagate into the
pre-trained language model
itself without adding
any more parameters
or layers than necessary.
This turned out to
help quite a bit.
Followed fairly quickly
by Google's BERT
which replaces the classic
left-to-right language modeling
pre-training task with this
modified mask language-modeling
setup and gets an
even larger gain.
So at this point, we
were seeing what looked
like accelerating progress.
We were sort of surprised
to see how much mileage we
were able to get out of more
or less just language modeling
and fine-tuning.
So we went back to do
something we really
should have done to
begin with, which
is measure human performance.
So we wanted to know how
much headroom GLUE had left.
We wanted to know at what point
are the scores getting so good
that we should sort
of abandon ship--
we should say, all
right, we've learned
all we can learn
from this benchmark.
And there's no perfect
way to measure this.
You can't directly
observe Bayes error,
but we figured that
trying to do a reasonably
high-quality human evaluation
would give us hints.
So for each of the nine
tasks, we trained crowdworkers
on the task using
both instructions
and some actual
interactive-mode training data.
And we got multiple labels for
each test example we were using
from multiple annotators--
took a majority vote--
not this is absolutely the gold
standard in human evaluation,
but should give us a reasonable
clue of how well one can easily
do on these tasks.
And few humans do significantly
better than BERT--
I think about a five or
six-point difference over BERT.
But then within a few weeks
of us publishing this result,
Microsoft came out
with a BERT variant--
this just changes how
BERT is fine-tuned--
that did slightly
better than humans.
So by our definition, GLUE
has about this much headroom.
So this method added
some information sharing
across tasks.
You do a phase of
retraining where
you train on all nine tasks
at once before you fork off
into nine separate models.
This turned out to help.
And we've seen a
little bit of somewhat
more incremental
progress since then.
RoBERTa takes the
base BERT model
and trains it longer on
more data plus a couple
of other small changes.
ALBERT switches to a more
efficient parametrization
of the transformer
that lets them
train a bigger, deeper model.
And T5 from Google
gets even larger
and adds the sum use of label
data including translation
data, during pre-training.
And this is where we are
at this benchmark now.
We're still seeing
some progress,
but it seems like we're getting
some amount of leveling out.
We're above human
performance with not clear
how much room there really
is to go on this benchmark.
And this situation-- or
at least the beginnings
of this situation
a few months ago--
is what motivated us to
put together a follow-up
project called SuperGLUE.
So SuperGLUE was
joint work with all
of the same people behind GLUE
plus Nikita Nangia and Yada
Pruksachatkun at NYU.
With SuperGLUE, we
basically wanted just
to rebuild GLUE from
scratch and do it better--
make something a little bit
harder, a little bit more
persuasive.
I was talking to Emily
a couple of hours ago
and had the realization
that anytime I
write a paper that
gets some attention,
I realize all the ways I wish
I'd written the paper better
and then I go back and
try to write that paper.
So this is sort of our
version of this for GLUE.
So one thing we did differently
is we started with an open call
for data set proposals.
GLUE, like a lot
of shared tasks,
was organized somewhat ad hoc.
We had picked tasks just
based on what the organizers
thought was fairly
representative
of this thread of NLU research.
With SuperGLUE, we wanted to
say, all right, we're getting--
some people are taking us as
speaking for the NLU community,
so let's actually try to make
some attempt at doing that.
We had said, all
right, we'll take
any tasks that get submitted
that pass some filters.
We got 30 or 40
candidates depending
on how you count examples we
had to rule out for copyright
reasons or technicalities,
and we filtered those down
by running a human evaluation
and then a BERT model
evaluation on each
of these tasks.
And our intent was just to keep
any task for which humans were
doing significantly
better than BERT,
and that left us
with eight tasks.
I think this is actually
the more striking result
on SuperGLUE than any of the
results that actually showed up
on the benchmark is that
the majority of tasks
that publishing-NLP experts
submitted to us as examples
of hard open problems in
language understanding, BERT
Large trained with the
default hyperparameters
was already basically
at human performance,
so it was hard to find
a viable set of data
sets to put together here.
And to even get to
a task we actually
had to loosen some
of our restrictions
on what we would consider.
So we moved from
sentence-level inputs
to also allow a paragraph
or short-dialog inputs,
and we slightly relaxed the
set of output mechanisms
we would allow.
So we wound up with
these eight tasks.
We kept in two of
the tasks from GLUE
that had survived through
the arrival of RoBERTa
and some of these newer models,
including the RTE challenge
set, and the Winograd
schema data--
we're using a different
format but same basic setup.
And looking at training
data-set size--
I think this was a
really obvious trend--
that the tasks that survived
through this filtering
tended to be ones with
relatively small training sets.
For a quick visual
comparison, here's
the linear scale plot of
the data set sizes for GLUE.
Here's the same
plot for SuperGLUE.
We've got three tasks with under
1,000 examples and all but one
of them are under 10,000--
so smaller data regime.
Zooming in on just a couple of
examples of how these tasks are
distinctive-- one of the ones we
added was the commitment bank.
This is a textual
entailment task--
roughly the same idea as the
RTE task I presented earlier.
But the inputs are somewhat
longer and more complex-- here
the input is a short dialog.
And this is targeting a special
case of textual entailment
that's meant to be somewhat
distinctive and potentially
somewhat difficult, which are
cases where the hypothesis
sentence-- the thing that you're
wondering whether is true--
is more or less verbatim
extracted from the source text.
So here the source text contains
the phrase, "What do you think,
do you think we're
setting a trend?"
and the hypothesis is,
"they are setting a trend."
And the idea is this data set
is meant to be balanced so that
in about half of the examples
for one reason or another--
because of questions,
or hedging,
or attribution of
beliefs to others--
this text does not
actually imply the clause
that it literally contains.
And so this is both an
interesting weird, special case
of entailment and a fairly small
data set with only 250 training
examples.
One more example of a
task-- we included MultiRC.
This is one of the many
multiple-choice reading
comprehension data sets
to come out recently.
It's one that's proven
to be relatively hard.
The inputs are paragraphs
rather than sentences,
and we're asking
relatively straightforward
factoid questions about what
the paragraph as a whole
is conveying.
And this is also relatively
small with about 5,000
examples.
So that's a quick
overview of SuperGLUE.
Let me again, do the
boring leaderboard thing
and then we can start talking
about some more fun stuff.
So our bag-of-words baseline
was fairly low again,
in the high 40s.
BERT by design didn't
perform that well.
We're getting 68 or 69
unitless points for BERT.
What was striking
though is that RoBERTa
did quite a bit better--
I believe about 15 points better
which is a much larger gain
that we'd gotten from
BERT to RoBERTa--
which is again the same model
just trained on more data.
We got a much smaller gain on
the original GLUE benchmark.
So I think what
this is suggesting
is that we explicitly filtered
out data sets for which BERT
was doing well, but many
of these remaining data
sets for which BERT
was doing poorly,
it doesn't seem like
BERT was doing poorly
for a particular deep reason--
that just kind of doing
a little bit more tuning, a
little bit more refinement--
on the same basic idea,
the same basic model,
was able to recover a lot
of that lost performance,
If that makes sense.
Current state of the
art is Google's T5,
a few points higher,
which is one point
behind human
crowdworkers, so SuperGLUE
is pretty close
to being saturated
at least by our rough
estimate of an upper-bound.
So I'm going to
be bouncing around
for the rest of the
talk trying to make
sense of where this leaves us.
But let me start by just
giving some caveats to my sales
pitch about SuperGLUE.
I still think that
GLUE and SuperGLUE
are interesting,
informative, useful metrics
but with a couple of caveats.
So GLUE and SuperGLUE are
built only on English.
We're deliberately targeting the
easiest, highest resource cases
of language-understanding tasks
that seem realistic to us,
where focusing on
English, dealing
with lower-resource
languages, or even just
languages that are not quite as
international and weird, would
likely be harder and
somewhat different.
More importantly,
GLUE and SuperGLUE
use lots of naturally-occurring
or crowdsourced data.
Most of the text
is either extracted
from a web source like Wikipedia
or written by crowdworkers.
What this means is that as with
really most sources of text you
can find, they're going to
contain evidence of social
bias-- that the word "nurse"
will show up with female names
and pronouns much more often
the male names and pronouns--
that African-American
names will show up
in more negative sentiment
contexts than European-American
names.
These are just facts about
the text we have access to.
Because we're using this data
in an evaluation benchmark,
all else being equal, models
that learn and use these biases
will do better on
these benchmarks,
and I'm not sure we even need
that much quantitative evidence
that this is happening,
but we have some.
In SuperGLUE we included one
little targeted evaluation
as one of our auxiliary tasks.
So this is a task
that we deliberately
didn't count toward
the mean score
but that we still
required users to submit.
This is a special case of
textual entitlement where you
have these gender-balanced
examples or you have--
anytime you have
a reference to--
any time you have a gendered
pronoun in an example,
you'll have a mirrored example
with the opposite pronoun
or an adjusted names.
And what we found is the T5--
the current state of the art--
is 10 times more likely to be
thrown off by irrelevant gender
information than humans.
This is sort of an odd metric,
but it's the best thing
we had access to.
It essentially means that if a
model gets an example right--
where, for example, nurse
co-refers with a female name--
it's pretty likely to
get that example wrong
when you make the
counter-stereotypical switch
and make the sentence
refer to a male nurse--
the model is likely to sort
of lose the thread of what's
being talked about.
And this is just a big
awkward open problem
in pretty much just large-scale
machine learning in general.
We don't have any sort of easy,
reliable, politically-neutral
recipe for mitigating
these biases,
and even measuring them
well in a way that's
relevant to a task
you're trying to build
can be somewhat
difficult. So this is just
a thing to watch out for
with this kind of work.
And just flagging one more
non-limitation but point
that bears mentioning,
we again, we
built both of these benchmarks
around these maximally simple
outputs-based tasks-- these
classification, or prediction,
or multiple-choice tasks.
I still think that this
is a reasonable way
to go about working on
language understanding,
but this also means
that we did rule out--
we did rule out a
lot of tasks that
represent difficult problems
in language understanding
in a number of ways.
So where does this leave us?
We're more or less
at human performance
on both of these benchmarks.
We clearly haven't solved
language understanding.
To give one of many,
many, many different kinds
of evidence I could
give for this claim,
we included another diagnostic
task attached to SuperGLUE
that are relatively
straightforward examples--
nothing really tricky,
but just trying
to isolate a bunch of
different kinds of reasoning
or a bunch of
different phenomena
that might come up in
text understanding.
One example of this--
this is from Julian--
is this simple pair
of examples showing
that if you ate pizza with
olives, you ate the olives.
If you ate pizza with friends,
you didn't eat the friends.
And state-of-the-art models
do pretty well on these
diagnostics.
These definitely
aren't highlighting
massive gaping
failures in reasoning.
But on examples like this
that are still pretty
straightforward, we do
see a 10-point gap between
the state-of-the-art
models and humans.
So one of many kinds of evidence
that we're not there yet--
that our main benchmark
isn't testing everything
we might want it to test.
And this is a
trickier point to make
and one that I'm a little less
confident about asserting,
but it really doesn't
seem like we've
solved language understanding
even if you focus
on IID evaluations for these
simple classification-based
problems.
I'm not willing to accept that
we've solved textual entailment
even if you define
that task in a nice IID
in-domain kind of setup.
To come at this
from one angle we
have one task left where there's
still a big gap-- this Winograd
schema task, we are still
seeing a six-point difference
between the state of
the art and humans,
and maybe that will disappear
if you scale up T5 by one more
order of magnitude in data.
But I think that the more
hand-waving reason that I tend
not to believe that
we've gotten there yet,
is that even the made
evaluations for tasks--
like textual entitlement,
or question-answering,
or just the kinds of tasks we're
dealing with in SuperGLUE--
to get these tasks
right consistently
and for the right
reasons, we need
to be doing a very good job
at resolving coreference
and managing entities
at parsing sentences--
at just dealing with a lot of
kind of intermediate structure
that, from the best
evidence we have available,
these models are only OK at.
So it seems likely that
our data sets are either
too biased or too
noisy to really pick up
on the weaknesses
that these models tend
to show on some areas.
Handwaving point--
happy to get back
this at the end of the talk,
but this is just the uneasiness
that I want to leave
you at before I move on.
So I'm going to detour
into a side project that's
only loosely related
to this, but it's just
something I wanted to share
and also deals very broadly
with trying to take stock of the
state of the field in this kind
of classification-based
language-understanding
universe.
And this is just getting into
this line of work that's really
exploded in the
last couple of years
of trying to analyze what
big pre-trained neural
networks know.
And for now, I'm going
to be just asking
why does BERT work so well?
What does BERT now?
We're talking
entirely about BERT
just because a lot
of this work exploded
around the emergence of the
BERT model in particular,
and analysis work lags modeling
work by one publication cycle
and so BERT is the
latest thing to have
been subjected to this
full gauntlet of studies.
So the kinds of--
I want to just give
a survey of kind
of what we know--
give a few, I think,
representative
examples of claims
that have been made
in the literature
credibly about what's inside
BERT, what BERT's doing,
why it works.
So one example that
I was an author on--
this work from Tenney et al.
We showed that ELMo and BERT
both learn essentially perfect
features for
part-of-speech tagging--
that they really seem to be
implicitly doing part-of-speech
tagging about as well if it's
possible to be doing that--
that both models are
learning pretty good features
for both dependency and
constituency parsing
but not perfect, and BERT is
significantly better than ELMo.
So one of the ways in which this
seems to be a stronger model
is in its ability to
track sentence structure.
And looking at coreference--
looking at establishing links
between pronouns of pronouns
or pronouns and names--
ELMo and BERT-based really
aren't getting off the ground.
They're not extracting
features that
are at all usable for doing
coreference resolution better
than chance.
But BERT Large is
doing somewhat better
than chance-- it's starting
to get off the ground.
This is an interesting
result because I
think it gives us a
clue as to a phase shift
that we're seeing as
these models get large.
And it seems pretty clear from
what we've been seeing that
as we're scaling up the model
sizes and the data-set sizes
for these pre-trained models,
these models are in some cases
starting to pick up on more
difficult or more abstract
abilities than they had
been at smaller scales--
so a little clue to that.
Just to survey some
of the other claims
that have been made in
the same sort of universe.
There's a follow-up paper to
our paper by some of the same
authors-- but not me--
that points out that
lower layers of BERT
express features for
lower-level tasks.
So part of speech knowledge
is concentrated in low layers
and correspondingly knowledge
of things like coherence
and semantic roles tend
to be in the higher
layers of these big 12
or 24-layer networks.
There is a strange
and interesting paper
from Hewitt and Manning
showing that by looking
at the geometry of the word
in context representations
coming out of BERT, you can
show that these are spatially
arranged in a way that somewhat
reflects syntactic structure--
more evidence that
these models are doing
at least some implicit parsing.
And using evaluations on
hand-built test sets--
kind of like our pizza
with olives, pizza
with friends example--
another paper has
shown that BERT
relies quite often
on very brittle
syntax-unaware heuristics.
When doing a lot of these
tasks like textual entailment,
these models will say, oh, if
there's a strict good string
overlap between the first
sentence and the second,
that's a good clue
to some label.
But again, some evidence
of one of these phase
shifts, the larger BERT model--
again, same model
same training data
just more parameters
and more layers--
is relying on these
heuristics much less.
So these are claims that
have been tossed around.
I think they're
basically reasonable
but I want to throw
doubt on all of these.
I think there's not necessarily
anything wrong with how we're
doing analysis, but there's
something wrong about exactly
these four slides that I think
goes beyond just my talk.
So probing studies
loosely defined--
this is how we're referring
to the kinds of results
that I just presented--
are becoming a very
common tool for trying
to make claims about what
models like BERT know.
The experimental designs that
go into studies like these
tend to make a number of
pretty strong assumptions
about the models we're studying.
To give a fairly
straightforward example,
the edge-probing methods that
we used to make claims about how
well these models know
part of speech, or parsing,
or coreference, was based on
this assumption that if these
models are implicitly doing
something like part-of-speech
tagging or coreference
resolution,
that it should be possible to
read that information off with
a very small NLP model--
essentially a linear model--
and there's really
no reason to believe
that should be the case.
It's not a crazy
assumption, but it's not
an obviously valid
assumption either.
And so what we
wanted to do is just
go back and ask if we pick
one question that we're
interested in about
what these models know,
and subject it to the
full range of ways
that we might ask this question,
how consistent of the answer
is going to be--
at least the sort
of high level, what
does BERT know kinds of
answers that I was just
giving on previous slides?
So this is actually
an odd paper.
This came out of a linguistics
PhD seminar in the spring
so hence the 14 first authors--
fun project to do.
But anyway, we wanted to ask
this one fairly narrow question
that we thought we
could operationalize
very nicely which is how well
BERT tracks NPI licensing.
So NPIs-- negative
polarity items--
are words like any or ever
in this kind of usage.
They don't contribute a
ton to sentence meaning
that they're a clue to how
the sum sentence is structured
because they're only
allowed to appear
in the scope of a particular set
of linguistic operators, most
classically negation.
So here, you might
have a sentence
like I see kids who are
not eating any cookies.
This means roughly the same
thing as I see kids who are not
eating cookies,
but this word "any"
is a clue that you're in
the scope of negation--
that this part of the
sentence is being negated--
and you can't rearrange
this to move "any"
outside of the
scope of negation.
So "I see any kids who
are not eating cookies,"
you have a negation
but the negation
doesn't scope over "any"--
this sentence just sounds weird.
It's not clear what that "any"
means or what it's doing there.
This is a fairly
complex phenomena.
There's a fairly
complex set of rules
that govern where
these things can occur,
but they're well studied.
We have a pretty good command
over where these things can
show up in English.
These rules depend on
long-distance dependencies
and fairly high-level
syntactic structures
so this seems like
something where
it shouldn't be trivial
for a language model
to pick up on these phenomena.
And so we wanted to pick
the [INAUDIBLE] question.
The big picture
question we're asking
is does BERT know where
NP is or licensed?
So we're going to ask this
a bunch of different ways,
and I'm going to speed
through a bunch of slides
just getting it this
is many ways I can.
So one thing we sort
of kept constant
across all of our studies
was one set of test data
we were using to
evaluate performance.
We created nine test
sets, each built
around hand-built grammars that
we set up to generate sentences
that were more or less
topically coherent,
that looked like English,
and that in a controlled way
did or did not have these
NPI licensing violations.
And these were written by
linguistics PhD students
and then we validated
each one of these grammars
on Mechanical Turk to
make sure that humans
agreed with our judgements.
So study number one--
let's try teaching BERT to do
the acceptability judgment task
to decide if a sentence
is coherent English,
and then test it on the
acceptability judgment
task that's implicit
in our data here.
It seems like a reasonable
way to do this kind of study.
It's on par with
how a lot of people
use, say, textual
entailment data.
It's training on a
standard data set
and then testing on a custom
domain-specific test set.
So here's what we get.
We train on CoLA.
This data I mentioned
earlier in the talk
of acceptability judgments
from a variety of domains.
We test in our new test set.
We're essentially
measuring accuracy.
The metric I'm showing here
is Matthews Correlation
which measures the
same thing as accuracy
but just sets random
guessing at zero.
So these results
look fairly positive.
Our simple bag-of-words baseline
is essentially a chance.
And BERT isn't doing perfectly
but it's doing fairly well--
0.76 correlation-- that's
good but not great.
So we'll make a--
if we were to give the elevator
pitch version of this result
or we're to say picture how
someone else might describe
this result in their
related-work section citing
our work, someone
might say that BERT
knows a bit about NPI
licensing but is not perfect.
Let's try doing this
study a slightly away.
Instead of using our
general-purpose training data,
let's use domain-specific
training data
and build a more specialized
train-test split.
So here what we'll do is we'll
do a hold-one-out evaluation
by environment.
So we'll train our model, for
example, on relative clauses,
and conditionals, on
lots of kinds of sentence
that don't include
negation, and then we'll
test it on sentences
including negation.
So we'll teach it what it
means to look for licensing
violations, but it will
have to already know
whether negation
is a valid licensor
and how negation scope works.
So we'll do this nice
hold-one-out evaluation
for training and
testing, but we'll
test with, again, the same
data, the same metrics.
Here the results look
somewhat different.
BERT does a bit
better in the setting
but our trivial
bag-of-words model
is suddenly able to
pick up on what's
going on which is worrying.
I think this small
gap would read
as much more of a
negative result.
I think, again, if you
were giving the elevator
pitch here you'd say BERT is
really just about baseline.
It's not doing anything
very interesting here.
BERT probably doesn't
track scope that well.
You could do things
a different way.
It's very easy to
organize this data
we created into
these minimal pairs
where you have
two sentences that
differ by a one-word edit,
usually a one-word swap.
We think this is
a reasonable way
of isolating this phenomenon.
Let's use a slightly
stricter metric
where in order to
get an example right
you have to make the
correct positive judgment
for the positive example and
the correct negative judgment
for the related
negative example.
So you have to recognize
that this difference makes
a difference as it were.
So same two-different training
setups, same test data set,
just different
formulation of the metric.
The results this way look
pretty much the same as before
and we could make either
of the same conclusion
we made on the previous slide.
This change doesn't
change that much.
But we could frame this
a little bit differently.
Again, same training
setup, same test data,
let's just define a
slightly different metric.
Let's say let's look
at the probabilities
that the model assigns to
these sentences, how likely is
sentence one to be acceptable?
What's the logic coming
out of our classifier?
And how likely is sentence
two to be acceptable?
And let's say the model
gets an example right
if it assigns a
higher probability
to the correct sentence.
So this is the
forced-choice task.
This is a common way of using
these accountability judgments
in linguistic literature--
plenty of precedent for it
and slightly easier in metric.
Here the results look
a little bit different.
Our baselines are
high but especially
in this hold-one-out
training environment,
BERT is at ceiling.
BERT is doing completely
perfectly-- the correlation
of having something like 0.995.
Here we have evidence that BERT
is generalizing to meaningfully
out-of-domain test data on a
fairly sophisticated phenomena
with perfect accuracy.
I think here if you were
to give the elevator pitch,
you would say something
very positive--
BERT knows what it's doing.
We can do this one more way.
We can use BERT's
language-modeling heads
to test it with
no training data.
I'm not going to go
into much detail here.
And here we get a
more lukewarm result.
We could also do a probing-style
study where we just
ask which word positions are
under the scope of a licensor
rather than asking
about acceptability.
Here we get again, a
more lukewarm result.
What I want to leave you with
is just this point that it's
fairly easy to put
together studies
that give you this variety
of different conclusions.
And I'm not arguing that
any of these methods
are wrong-- these
are all precedented.
I would claim that if I sort
of put some time into it
I could publish a
study that included
any one of these methods--
that these are all
sane, and publishable,
and like recognized ways of
doing analysis and the NLP
literature, and I don't
think that's a bad thing.
I think the only
thing to worry about
is just that when you distill
the findings of these studies
down to these little
high-level snippets
to these kinds of
claims that are meant
to be most useful in
understanding kind of why BERT
works, and why it
failed, what we
should do next, that you tend to
lose a lot of information that
is quite important to actually
drawing reasonable conclusions
here.
So that's where I'll leave that.
That's the end of
our little detour
into meta-BERTology land.
Let me get back to the fun
and weird part of the talk.
So we're in this
weird situation.
It feels like there are plenty
of big open problems in NLU
but even in this sort
of sentence-level
or paragraph-level simple output
space IID setting in English--
in like the easiest possible
version of NLU evaluation--
it really doesn't feel
like we've solved it.
It really doesn't
feel like we've
built robust human-like models
for language understanding.
But we've also maxed out all
of the available test data sets
and it's not clear that we've
come up with that many new ones
that we can kind of
use to fill in the gap.
Again, we tried to
create SuperGLUE just
at the beginning of this year--
2019-- we had to throw
away most of the data sets
were proposed, many of
which looked legitimately
hard when they were submitted.
So it seems like at
this particular moment,
our ability to build models,
and scale up models, and improve
models, is growing
faster than our ability
to build correspondingly
difficult evaluation sets.
So what do we do?
Do we give up and work
on something else?
I mean, sure, there are lots
of other problems in NLP
that are arguably more
urgently important,
but I still think that is a
hard, open problem that I'd
like to work on, so
how might we do that?
One approach that's been getting
some mileage in this direction
is to use something in the
family of adversarial filtering
to semi-automatically create
data sets that are hard
for state-of-the-art models.
And to give my rough
caricature of this idea,
you essentially take a
data-generation process--
either an automatic process
or a process of crowdworkers--
that gives you test examples,
and you discard the examples
or you increase your probability
of discarding examples
if the current state-of-the-art
model gets this right.
So for example, there
is an implementation
of this for collecting textual
entailment data at Facebook
recently where they
asked crowdworkers
to write these sentence
pairs for entailment,
but they wouldn't
let you click submit.
They wouldn't let
you finish an example
until you'd edited
it to the point
that RoBERTa would guess wrong.
I think this is a really
interesting source
of data for training.
This is a really good
way of finding examples
that are right at this decision
boundary of the models we're
studying.
It's probably a good source of
data for local hill-climbing
evaluation, for
high-parameter tuning,
for model development
in a certain scope,
but using this kind of
method to build benchmarks--
to build something like
GLUE-- has me very worried,
because what we'd really like to
say with something like GLUE is
that this sort of linear scale--
like the number that
you're getting out--
is more or less monotonically
associated with ability
we care about-- that
if you get better
on this number
that means you're,
for some reason or
another, doing better
at the actual ability
we're studying.
And I think using adversarial
filtering to create data
for these kind of benchmarks
for broad use creates a metric
that's going to encourage
you to build systems that are
different from current
state-of-the-art systems
without necessarily
being better--
that you could wind up
building systems that
have different
weaknesses that fail
on a different set of corner
cases than current models,
and that would
cause them to score
much better-- because you've
created this evaluation
that highlights the error
cases of the current state
of the art.
And I do think there are
ways to work around this.
You can mitigate
this to some degree
by saying we're going
to keep generating data
and we're going to make data
that's adversarial for all
the models that
we're considering,
but I think getting this
down to this thing that's
as usable as a nice
public single-number
benchmark is going to be
quite logistically tricky.
There's another
approach that feels
like kind of a
lighter-weight version
of this same basic
idea, which is to build
some kind of growing benchmark.
So there was this thing that
was pioneered more or less
by the Built-it-Break-it
workshop a couple of years ago,
and I think is more recently
being revived by this ORB
benchmark for question answering
that was recently proposed,
I believe, from AI2--
where the idea is you
start with a test set
that's built in a normal way.
People submit models,
ideally in a form where
you can rerun those models
after they've been submitted
on a centralized server, and
you allow experts to just submit
more test examples-- that as
people notice phenomena that
seem like interesting phenomena
to evaluate on, they can just
upload, hey, here's
a set of 1,000
new question-answering
questions that
target some interesting
phenomena that I
think will be hard.
If human annotators agree
that these questions are
valid-- that labels
are right-- you just
add that onto the test-set
and rerun your tests.
And so you kind of
allow this benchmark
to evolve to target interesting
sort of corner cases
as they're discovered.
I think this feels like it
has similar risks to a lesser
degree and similar
logistical complications
to a lesser degree, I
think this is doable.
I think this might be the most
viable way forward right now,
but there is still some
risk that we'll drift away
from the task we're trying
to solve-- that we'll wind up
creating data through this
process of having linguists
actually write data sets
that artificially focuses
on things that the most
popular styles of model
get wrong, but don't necessarily
represent the underlying
task that we're interested
in so it's a risky situation.
One more direction
we could go is
to restrict the
task-training sets
to focus on zero-shot
or few-shot adaptations
to the new task.
Where you test on the
kinds of test sets
you already have, but
you say, all right,
you're only allowed
to use 100 examples
from each task at training,
or 10 examples of each task
or you're only allowed to
use a description of what
the task is and zero examples.
I think this is interesting.
This is likely to encourage
us to build better models
and better representations.
I think this is something
we should be doing,
but I also think it's
not quite the same thing.
This is just a different
problem we could be doing.
I think there are
still open problems
in the setting where
you are allowed
to create training data.
I think that's a setting
that users of NLP in industry
are often in, that
academic users of NLP
are often in, where you can
afford to create 1,000 data
points for you task.
I think that's still
not solved, and so I
think this feels like a detour.
This feels like a
separate way of looking
at the problem that isn't
quite what we started with,
so worthwhile but not
quite the same thing.
The last direction-- and thing
that I think we as a field
should ultimately do.
This strikes me as sort
of the safest way to go--
and I credit George
Dahl at Google
for really beating me over
the head with this point--
is just build really big,
really high-quality data sets.
The idea is to try to build
data-creation pipelines--
I have no idea how we'll build
these data-creation pipelines--
where we can get data sets
that are reasonably difficult--
that involve reasonably
complex and subtle reasoning--
but where human annotators
agree at a very high level--
ideally at greater than 99%.
So this would mean very
carefully filtering out errors,
but also very
carefully filtering out
any data points where
judgments are subjective--
or at least coming up with
a way to incorporate that
into your metric.
The idea here would be, OK, you
have a million test examples.
We're sure that 999,000
of those are right,
probably RoBERTa
off the bat is going
to get 95%, 96% of those
right, but you say,
all right, we're not done
in until [INAUDIBLE],,
we're not done until you're
actually at human agreement,
and I suspect that
this will allow
us to highlight a number
of weaknesses of our model,
but still do so in a way
that is sort of drawn
from some relatively neutral
volatiley natural distribution
that isn't artificially
biased towards
or away from the
particular methods we
are studying right now.
So again, I don't
know how we do this.
It's likely any method
that would accomplish this
would be very slow,
very expensive,
and not guaranteed
to work once you've
put in the time in
the money, but this
does seem like the
only way to fully build
an evaluation that matches
the goal that we started with.
So that's more or less
where I want to leave us.
I just want to flag one
other related open question
that I have no good
ideas on that just seems
to fit in with the kinds
of points I'm raising,
which is that we included
this metric for the inclusion
of social bias in the
SuperGLUE benchmark,
but we deliberately
didn't include that
as part of our average.
We didn't include that
as a single number
we wanted to hill-climb on,
because [INAUDIBLE] there's
a serious risk that
someone could easily
tweak their model in a way
that would make it look very
good on exactly our metric of
[INAUDIBLE] without actually
improving on any other metric.
And I think that's
a fair concern
for any time you're trying to
isolate these biased behaviors
in model--
that it's both technically
and politically difficult
to decide what to measure--
or it's politically difficult
to decide what to measure
and it's technically
difficult to measure it--
make sure you're measuring it
in a way that you can safely
hill-climb on--
make sure you're measuring
it in a way that if we get--
if we build models that score
better on some benchmarks,
that they're actually
less likely to make
biased decisions on future
data on unknown tasks.
I think it would be valuable
to find a way to do this.
I think putting out
benchmarks like this
has been a good way of
spurring progress on problems,
but I think we have not yet
figured out how we might do so.
Anyway, what's next
for evaluation?
What do we do after
GLUE and SuperGLUE?
I've thrown out some
rough ideas of where
I can see the field going,
but ultimately, I don't know.
We're in a weird position.
I don't claim to have
any good answers.
Thanks for coming.
These people gave me money.
[APPLAUSE]
AUDIENCE: So I have
two related comments,
but maybe we can start
with some comments
about that sort
of recurring slide
because there may have been--
I realize that the main
adversarial filtering
[INAUDIBLE] how it was some of
the prior work including ours--
were presented in the past
might have given this impression
that we really only
keep the harder ones
and throw out all the easy ones
which might actually present
really good problem definition.
And in fact, that's not
the case in the sense
that the way we think about
it-- at least I personally
think about it-- is that
a lot of our data sets
are over-representing
some cases.
We don't know whether that's
necessarily easy cases or not,
but there are over-population
of the very similar problem
instances over and
over again, and then
some version of the [INAUDIBLE]
that are built on algorithms
are keeping them
at a lower level
so that's a little
bit more evened out.
So basically, we're using
the head of distribution,
and we're trying to make the
data set a little bit more
toward [INAUDIBLE] distribution,
but there's a limit to it
in the sense that it cannot
generate new data points--
a lot more new [INAUDIBLE]
distribution data points--
and you can when you reduce
the over-representation
of the head.
And then the question about
whether-- so in the slide
you asked whether
these data sets
can lead to better
[INAUDIBLE],, probably not.
They're different
but not better.
So I think that this
depends on where
you want to perform better.
If you actually want to perform
better over your original GLUE,
it's really not a good idea
to filter anything out.
It's best to use the
exact same bias--
the training data that resembles
the tested distribution.
But we have new
experiments that we
find that on many other
diagnostic data sets-- where
adversarially constructive
data sets the models trained
on filtered data compared to the
original data, those do better.
So I think it's somewhat
surprising or not surprising
results depending on how one's
expectation may have been,
so I would just say that
the verdict is still yet
to be determined.
I mean, probably we should
look at these things
much more carefully,
but I appreciate
how you basically
suspect anything
that's happening in the field.
So on that note, I had
a really fun workshop
to attend at ICCB in which
Alyosha Efros at Berkeley--
he's a hot-shot in computer
vision who's anti-data set.
And I did not approve of
everything he said in his talk,
but I loved his talk so much.
If you ever see his
name, go check it out.
It's really great.
But one thing he
said was something
like the following-- the
reason why he doesn't like
supervised training is
because it's almost like,
you know how students who
didn't come to your classes
but can download the previous
year's exam questions.
And I remember for
this kind of question
it's this kind of answer--
is the right answer.
So even if you try to change
the problem a little bit
because you recycle very similar
style of the exam writing,
students can perform really
well without learning the true--
having true understanding
about the material.
And then this relates
to your earlier comment
about whether any
of these models
actually learn a meaningful
linguistic phenomenon--
like a coreference
resolution well enough
to solve any of this.
And then yet another really--
sorry I-- [INAUDIBLE] so cite
Matt Gardner's-- a workshop I
took at [INAUDIBLE] in which he
also raised a similar question
that even if he one were
to generate really a lot
of high-quality data for reading
comprehension that's really,
really hard, then everybody
agrees [INAUDIBLE] really hard,
can a model actually learn to
really comprehend the text when
the learning is set up
as just like [INAUDIBLE]..
So these are just
questions like,
are we actually
doing this right?
Is there a hope to just create
larger-scale true data sets,
and then trying to learn
something off of it?
SAM BOWMAN: Yeah, so I'll try
to respond to at least some
of that and I'm happy to
talk for it in person.
Yeah, so I very much
like one point which
is just that the original data
situation that we're starting
with is often going to be
broken and heavily influenced
by exactly how we collected
the data, and also
not necessarily
reflect the thing
we care about at all--
that the argument
against an adversarial
filtering-style approach
is strongest if you believe
that the original data
distribution is actually somehow
valid which is usually wrong.
I guess I'm still
nervous though.
I think there's this burden of
proof on using these methods,
that even if you're
just adjusting
the distribution of
an existing data set,
that if you're using a set
off models that is not--
if you're using a set of models
and it is in anyway biased--
that is not like a true uniform
sample from the set of all
possible models, whatever
that would mean--
that you are somehow
artificially disadvantaging
some classes of
models over others
and that you need to
quantify that and show
the degree to which that
that's the case in order
to really make a convincing
case that you have
a fair, neutral evaluation--
that it is still putting up
a sort of benchmark-style
leaderboard.
This does make me nervous.
Yes, but I'm curious also
about the burn it all down
and give up on training sets.
I think there is--
that is another page
we could put here
and there likely would be very
interesting things to say.
Yes.
AUDIENCE: But you said a few
times are we really evaluating
to capture the thing
we really care about?
And I'm wondering
how do you articulate
what it is that we really
care about in this context?
SAM BOWMAN: So I'll make a
slightly jet-lagged attempt
at this.
I'm not sure I can
really do a great job,
but through all of
this, I'm taking a view
that I think is fairly
grounded in applied NLP.
I'd like to say, all
right, we have some set
of tasks we'd like to
solve that seem intuitively
to require dealing with
syntax, [INAUDIBLE],, pragmatic,
some degree of common sense,
that seem solvable in principle
by machine.
We would like to
build models that
perform at a human
level on any data that
is consistent with our intended
definition of the task--
that if we're building a
sentiment classification model,
we can more or less write
down with sentiment means.
We can samples sample
a test set that we
think represents that
reasonably well but we
want to say any data you
throw at me that feels
like an example of
sentiment, our models should
be agreeing with--
should tend to agree
with humans to the extent
that humans agree
with one another.
And so I think even for
that relatively applied--
relatively current status
quo-friendly definition,
we're not there yet.
AUDIENCE: So I
liked how you used
human requirements triggers
to calibrate or interpret
the model performances,
and I'm wondering
if you could talk about how
meaningful you think the gap is
between human trigger
performance and expert
performance, and the truth,
then are those gaps--
is that a real limitation to
progress, or is it just noise?
SAM BOWMAN: Yeah,
so I guess we wanted
to have those human
performance numbers have
just some idea of
when to give up,
when to abandon hope
when we're pretty sure
we've solved the data sets,
but they're a very blunt tool.
I don't think we should accord
them too much importance
for no reason.
Yes, these are crowdworkers
making these decisions.
They're not likely to be putting
really substantial effort
into the task.
We're not adjudicating
disagreements
so there's kind of--
even if we assume that
they are doing the job,
there is a quite a
bit of noise there.
That crowdworker performance
is not going to be Bayes error.
There's not going to be the
best possible performance
in the data set.
I think working with experts who
are given very long time limits
and who know the
exact process that
generated the original
data, you can do better.
For the CoLA benchmark,
we nicely saw this.
We saw the performance of--
we saw performance of
crowdworkers on our metric
who was maybe 65 points.
The performance of a single
linguistics PhD student judging
the data was 75, and
five linguistics students
in a room with pizza trying
to argue it out were maybe 80.
But again, I think this
is just a very blunt tool.
It is meant to give us a rough
clue of where we're maxing out
these data sets.
Yeah, I don't know if I quite
gets at the question, yeah.
AUDIENCE: Also
assuming there's bias,
the data-- the models
[INAUDIBLE] better [INAUDIBLE]..
SAM BOWMAN: Yes, yes, yes,
that's a very good point.
Yeah, this is where I think
if the data-generation process
doesn't quite match the
just abstract specification
of the data, then models
will pick this up.
We saw this, especially
for the Quora data.
The Quora data
actually came out of
this long multi-step
semi-automatic moderation
process inside
Quora the company.
They distributed no
documentation about where
this data came from.
And so our MTurk workers were
trying to reverse engineer
this process using our
description of what we thought
they were doing and 20 examples
and they didn't do a good job,
and I think even
ELMo is basically
at human level on that.
Yeah.
AUDIENCE: I'm just curious
on this thread of English
only so far, are
there other data sets
like this benchmark-- data
sets in other languages that
are highlighting other phenomena
that perhaps have been captured
on the menu [INAUDIBLE]?
SAM BOWMAN: I don't
know that well.
So what there is, there
is a Chinese GLUE data
set that came out--
benchmark that came out
recently that I think
is a fairly direct
clone of this.
The documentation is in
Chinese which I can't read,
so I don't know exactly
what the scope of task is.
I know it's a different
distribution of tasks.
There's also something that's
a little bit less related
but that's really interesting
that's been emerging
in the last year or two
is the arrival of a bunch
of cross-lingual benchmarks.
These are things
like XNLI that we've
been involved with
or XQuAD or quad,
I'm not sure how to pronounce
this-- that came out recently.
There are few data
sets where the setup is
you're asked to train
on English-language data
and test on 20 or
30 other languages,
and just it's expected
that as part of your model,
you have some way of
mapping language meaning
into some language-agnostic
representation that
allows you to just
train in English
and test something else.
And I think probably the
craziest result in NLP
this year is that this
actually kind of works
and these benchmarks
are getting better,
the data sets are
getting better,
and we're actually
building models
that do a decent job at
that, but it's not quite
getting [INAUDIBLE] where
we're not intentionally
doing a good job of
focusing on phenomena
that are specific to
non-English languages.
HANNA: OK, let's thank
Sam again [INAUDIBLE]..
SAM BOWMAN: Thanks, again.
[APPLAUSE]
