All right. I propose we get started.
Plan for today is to finish up NLI.
I was just talking with the group that was here before,
and I think the consensus is that if we finish that early,
I'm gonna give you a glimpse of the topics for next week, um,
because they're basically my favorite topics,
and I wanna, like, get people to do projects on them.
And I know that we're introducing that content kind of late, um,
in terms of people settling on project topics,
but maybe this is my last chance to sneak some things in.
I've got lots of good links to datasets and papers and so forth.
That would be about- that'd get us pretty close to the true dream of NLU,
I think, which is like, you know,
language use in grounded situations in the real world.
Um, and it just comes late in the course because we have to build
up a lot of material to get to the relevant models.
Um, but now, certainly by the end of the day today,
you're well-prepared to get your hands dirty
with the kinds of things that fall under grounded language understanding.
So I might say a few things about that if there's time.
And just by the way, it's gonna get a little bit of a rush next week.
We have just one day for grounded language understanding,
and one day for semantic parsing.
The saving grace there is that we have a huge amount of material for semantic parsing.
It's a challenging topic, it's very interdisciplinary,
you basically have to know a bunch of
linguistic semantics and a bunch of machine learning.
Um, but if that sounds exciting to you,
then it's a wonderful area to work in,
uh, and we have this whole Codebase sippy cup,
and Bill has created lots of notebooks for it.
And then we have two supporting screencasts that kind
of get to the essence of the technical ideas,
and then a bunch of wonderful papers.
So if you're still looking around for a project idea,
and especially if you took semantics with me before,
then this is like the- the dream- the marriage of Linguist 130A and this course,
and so I encourage you to check it out.
And Bill will give us an introduction to it,
but there's no- only so much you can do in an hour.
So it will really pay to do that kind of self-study.
Um, just while I'm talking about the schedule,
so you're- you're doing the last homework and the last bake-off.
Uh, and then after that,
we kind of settle into a rhythm of project-oriented assignments.
You do a lit review, and at that point,
you'll be paired with a mentor from
the project team who will be with you through the whole process,
so like somebody you can count on for
advice and who will be giving you a lot of feedback.
So we do the lit review first,
and then we have this thing called an experimental protocol.
And I'll tell you much more about what that's supposed to contain closer to the time,
but kind of hovering in the background here is that we're gonna push you to
make sure that your project has some- some kind of
quantitative evaluation associated with it.
Not because we think that that has to be the way all work in NLU works,
but rather just because we think it's healthy to kind of push you in that direction.
And that's why you'll find that when doi- when you interact
with people from the teaching team about your project,
they wanna know what data you're gonna use.
Because I think that we all on the teaching team
think if we can figure out what data you're gonna use,
then we can kind of figure out what your metrics will be,
and get you to the point of a quantitative evaluation.
So have that in mind as well as you do project planning and know
that the project assignments are gonna be pushing you in that direction.
And then after that, it's just a short YouTube video,
those are typically fun,
uh, and then the final paper itself.
And, uh, correspondingly, most of the material that we do in class in
the second half of the course is kind of oriented toward you doing your project.
So it's kind of, it gets a little bit meta.
We talk about methods and metrics,
and presenting your work, and so forth.
So it should be relevant to you no matter what
kind of project you're working on, that's the rationale.
There is one thing that I wanted to just clarify about Bake-off 4,
just so that you know what the points of freedom are.
So remember Homework 4 and Bake-off 4 are all about word-level entailment.
It's a kind of small version of the NLI problem,
and I've supplied you with datasets that you can work with.
And anything goes for this in terms of you bringing in outside vectors,
for example, you could do it with ELMo or BERT,
um, or vectors you build yourself,
GloVe, right, you name it.
You can go onto the web and in fact the- the task is pushing
you in the direction of having really rich inputs to these models,
otherwise, you're just not gonna do very well.
So from the perspective of modeling and
outside data that will initialize your- your vectors,
we wanna be, you know, casting the net wide.
There is one restriction that I feel like I didn't emphasize enough last time,
though, which is that when you get down to it,
when you develop your original system,
and when you enter the bake-off,
you have to be working only with the word disjoint problem, right?
It says that here: Keep in mind that for the bake-off evaluation,
the edge disjoint portions of the data are off limits.
You can train on any combination of word disjoint train in dev,
you just can't look at that edge-disjoint part,
and the reason for that is that this problem,
your performance just goes way up if you're allowed to look at edge-disjoint,
and the reason for that is that that means that in training,
you get a lot of words that you see at test time,
which is kind of contrary to the direction that I'm trying to push us,
which is, these are all new words at test time.
And so what we're doing is kind of stress-testing your system on its ability to
generalize from what it knows about its training words into a new part of the vocabulary.
Um, so yeah, this is really important,
and you will definitely not be allowed to win the
bake-off if we see that you've brought in the edge-disjoint stuff.
[LAUGHTER]
[inaudible] .
That'll be okay. You know, I'm actually,
so I'm assuming that whatever embedding space you have when it
gets to test-time those words are in that embedding space, right?
Because that's kind of the point,
Because it's like your embedding space has interesting structure,
and then your model helps you generalize into the entailment problem,
so it'll be- your system is gonna do really poorly,
in fact, if none of the test words are in your embedding.
Um, I've tried to disguise where I got the entailment data,
and the spirit of that is just that,
yo- you know, you guys can search widely and stuff as long as you
don't make a deliberate effort to figure out where my data come from.
That would be contrary to the spirit of all of this.
Um, but definitely go out and see, you know,
whether there are new word embeddings that people have
released that are especially good for this task.
Okay. Any questions about that before we dive back into the content?
Okay. Just as a kind of recap,
so last time I introduced the problem of NLI,
and I think the summary of that discussion is kind of like,
NLI at this point is a common sense reasoning task of some kind.
Um, you know, it's kind of hard to define exactly what
the criteria are for entailment and contradiction and neutral.
Uh, but we have these naturalistic datasets that reflect
some aspects of how humans actually reason about language,
and that's what we hope our systems do as well,
because the whole idea of NLI is that
it's kind of keyed into fundamental information about how we use language
to build arguments and to reason and to find out new information,
and if you think in an extended sense,
even how to answer a question and so forth.
Um, I introduced SNLI and MultiNLI, which are exciting,
large datasets, um, that can- that will test disparate- different aspects of your system.
So SNLI is just image captions,
and MultiNLI is drawn from a bunch of different genres.
And a nice twist about MultiNLI is that,
in addition to the diversity of the training data,
there are two dev sets,
one which is matched,
which means that you evaluate on the same genres that you trained on, and one mismatched,
which is kind of seeing- it gives you a chance to see how well
your system can generalize out of its comfort zone, so to speak.
Um, and another exciting thing about MultiNLI is that the team released it
with a bunch of annotations that you can use for kind of error analysis,
and I'm gonna give you a bunch of illustrations of that later,
I think it's a nice opportunity.
You know, writing a good error analysis section for your paper
is often very difficult and requires a lot of creativity,
and I feel like they've given you this gift of the structure
for a very rich error analysis by providing these annotations,
so I'll show you a bit about how that works.
And then we got as far as talking about hand-built features,
um, which are kind of, like,
word overlap, word cross-product.
And what you're seeing here is a bunch of stuff that you can derive from WordNet.
The one- and then here, a bunch of other ideas.
I added one slide here,
which is, um, we have this narrative of, like,
talking about hand-built features and linear models,
and then talking about deep learning models,
and you could fall into a trap of thinking these ideas are in direct competition.
Um, I think that's not so,
I think it could be very interesting,
and maybe some of you have done this in your work so far,
to think about combining,
like, a word embedding,
something short and dense, with a
feature representation that comes from hand-built features,
and that will have the characteristic that we- it will have
very high dimensionality and be very sparse.
And that really contrasts with what you would get from BERT or ELMo or GloVe or whatever,
which tend to be very sh- short,
low dimensional and very dense,
I've kind of represented that here,
short and dense, long and sparse.
And the question arises then, like,
what's an effective way to combine these two representations in a unified model, right?
Because it's not like I have to choose just long and sparse or short and dense,
maybe I can have the advantages of both.
And the naive thing to do would be just- to just concatenate these two representations,
um, so that if, like,
the green has 20,000 dimensions,
which is pretty realistic,
and the short and dense one has 50,
then you have 20,050 dimensions,
but even that way of posing it suggests that there might be an issue here, right?
Which is that there's a kind of wild imbalance,
and it's two kinds of imbalance.
On the one hand, you have all this sparsity dominating the representation.
On the other hand, it's kind of in two phases,
like, with very different kinds of information.
So I think it might not be effective to simply concatenate the two representations.
One thing that I've found effective in various work that I've done recently is to do
a little bit of preprocessing on
the long and sparse side to just make it short and dense.
Uh, and there are kind of two classes of approach you could take there.
So you could just apply something that is
outside of the model you're developing, like, you know,
apply LSA or PSA or something like that,
to shrink it down into a dense representation.
Uh, that's one step.
That's what I've called this, Model External Transformation.
Another approach would be to actually have
your model learn some weights that perform that transformation,
that embed this really long thing into a smaller space.
That will reintroduce some difficulties of optimization,
but I think if you give it separate parameters,
then it will become kind of manageable.
Uh, and then at that point,
you would concatenate short and dense and short and dense here, and you'd have,
kind of two comparable objects and then move forth.
And I- I think that there's good evidence that that representation
would have a lot of the advantages of both sides of the incoming information.
Yeah.
[inaudible] combine the two?
I don't know about that.
Attention would normally be defined over entire representations,
not over the individual dimensions,
so you'd have to think about it somewhat create- creatively.
Does that ring any bells for you over on the teaching team?
[inaudible] of success.
Okay, and we'll sa-
If you perform this transformation,
then this thing, once concatenated,
if you have subsequent weights up here,
then you are giving your model a chance to learn lots
of interactions between those two sides,
and I think that's where this becomes potentially so powerful.
Okay. And then I introduced nli.experiment, um,
which is not required for the homework but if you want to do
a project in NLI, I think this could give you a real head-start,
and it kinda makes all these ideas concrete.
And I introduced these two ways of doing hyperparameter selection in the context of
a really large dataset where you can't afford to sit
around through 50 iterations because,
you know, that could consume all your time or resources.
And then toward the end here,
I introduced this idea of a hypothesis only baseline.
Um, and the relevance of that is just that,
this can be a surprisingly good baseline for NLI.
Uh, and I gave you a rationale that was a kind of linguistic scientific rationale for
why this works related to the fact that there are some intrinsic biases,
um, when it comes to entailment problems about
what's likely to appear on the left or the right,
uh, and that wasn't my kind of justification
for why hypothesis only might always be pretty strong.
And you can see here at the bottom that it is indeed quite a strong baseline, 65%,
uh, and compare it to random 33%, you know,
it looks like you really want to think about
the strength of the hypothesis only baseline.
And then asked, what about premise only experiments?
And I had not thought enough about that.
And so in the moment I think I said something like, well,
based on what I said they ought to be pretty good as well.
Because the same kind of biases that are telling you,
for example, that if I give you a general word it's likely to be a hypothesis.
Conversely, if I give you
a very specific word it's likely to be a premise in an entailment pair,
and also very specific things are likely to be participating in
contradictions in the sense of this problem
because they're likely to exclude other event descriptions.
And all of that leads and push me in this direction.
All of that leads to think that a premise only baseline ought to be pretty good.
So I ran this experiment on SNLI and I was like aghast,
so 33.3 which is the very definition of random for this problem.
And at first I thought there must be some kind of bug in the code
or something because there's just no way it could be this random.
And then, then I and I couldn't find
any bugs until I started to do some real soul-searching,
like did I give you guys a misleading assessment of this problem?
Um, and so I was in a kind of a bad state for awhile I guess.
And then I figured it out,
does anybody actually see already what I was missing in running this on SNLI,
it's a subtle fact.
It took me awhile to realize it.
[inaudible].
[inaudible] That's true and I am going to bring in
the local task but I don't think that's the answer.
Why is it random for SNLI?
It's because of the way the dataset was constructed.
So every premise was given to- to crowd workers and
exactly one of entailment contradiction and
neutral was constructed on the basis of that premise.
So those constructed sentences differ wildly of
course and they do contain biasing information about the label.
But if I give you only the premise,
this was like what the crowd workers got before they started their work.
And so what you see is exactly what you get from
the data collection which is perfect randomness,
because all the system has is this unbiased information.
So phew, and then I did just to reassure myself check what the- what it is for
the word entailment model because your homework asks you to do
hypothesis only and that's about 0.5.
Premise only is about 0.47.
So those are pretty strong baselines.
Again, it's the same lesson.
But here, I think my- my insights are at least consistent with those results.
So all is well with the world.
Like I said, it's the bottom line here [LAUGHTER] as far as I know.
Questions about that or additional concerns?
It was- it was because kind of by construction in any of the premises could have
led to any kind of results like
you could have constructed something that was entailed or a contradiction or neutral.
In fact, even more by design.
It's like, uh, literally
the crowd worker was given the premise and constructed three sentences;
one entailment, one contradiction, one neutral.
[LAUGHTER] So it's not just by chance,
this was really by design and I should have known this.
Um, I did eventually oh, phew.
Okay. So that's a kind of recap with a little bit of new information.
And then the, the final things I wanted to do wa- was move into the neural realm,
because I think there's exciting models that are kind of specific to NLI,
um, that we can discuss of it.
Yeah. Just, just a quick question previously.
So what about the logistic regression model would
differentiate it from this model and let it have a higher baseline than randomness?
Oh, well, I just picked logistic regression for the,
the experiment down here just to have
some model to implement premise and hypothesis only baselines.
And the observation is just that both of them have a lot
of information that makes them much better than random.
It's a different dataset. It's not- it's not a model. It's the dataset.
Oh, it's a different dataset. Okay.
Yeah, because, so this one was constructed to have
no bias when it comes to the premise but the word entailment one was not.
It's just a bunch of words drawn from some vocabulary.
Okay. So there are kinda two classes of models here
that and th-this is a by-product of us having a premise and a hypothesis.
The first-class or what I'm going to call sentence encoding models,
and in these models you get
a separate summary representation of
the premise and the hypothesis and then you do something with it.
Um, I think this is the kind of model that
you're pushed toward if you have in mind that Ido Dagan
dream that you're learning representations from
NLI data that are going to transfer to lots of other tasks,
because for many other tasks,
you need an individual sentence representation.
Uh, and this offers exactly that.
And I've started here with a natural baseline for this kind of model,
which is just that kind of sum of
word vectors approach that you've explored in various places before.
So just to summarize this,
I've got every dog danced as my premise,
every poodle moved as the hypothesis.
I do an embedding lookup to get this layer, right?
That would come from some pretrained space presumably.
And then to get a fixed dimensional representation for each side here,
because the word- the number of words could differ,
you just do something that's a kind of hard summary of all those vectors,
like the sum, or the average, or the difference,
or whatever, right, but sum operation that's gonna take you from
this collection of vectors into a single fixed dimensional one,
and that would give you xp and xh.
And then after that,
again, this is familiar,
you might just concatenate them or do something else with
them to get to the input to your classifier.
And then from there,
this could be any classifier you want.
So a simple linear one or a whole deep learning model, whatever you choose.
And this is a baseline in the sense that,
this is a pretty primitive way to combine
all those words into a fixed dimensional representation,
um, but actually these models turn out often to be pretty good.
A lot of information from the words ends up encoded in those sums that
you get for the premise and the hypothesis. Make sense?
And here you can see that like, you know,
one simple approach you could take to the bake-off would be, you know,
I just have a one word premise and
a one-word hypothesis so there's nothing to sum or average.
I get xp and xh directly,
concatenate them, and then fit a model.
Make sense? Questions about it?
Here's some code for doing that, um,
and it's familiar code in the sense that like go- glove_leaves_phi,
glove_leaves_phi, some phi.
These are functions that you've actually worked with before.
And the only twist here is just that as you process examples,
you need to process two parts,
the premise and the hypothesis,
whereas before when you've done this,
you are processing like just a
Stanford Sentiment Treebank Representation.
But everything else is the same.
Um, and it does okay.
Uh, on SNLI here training- trained dev 51.6%,
so it's actually below the hypothesis only
baseline which might make this a kind of nonstarter,
but it could still give you intuitively some information about like, uh,
a model that you build on top of this,
how much you're adding if you do something more sophisticated
than the sum or the average the way this model does.
Here's what I take to be the rationale.
I kind of alluded to this before.
So first, encoding the premise and hypothesis separately
might give the model a chance to find rich abstract relationships between them.
You might think that that's a kind of very human way of doing this.
Like I have an understanding of the premise,
an understanding of the hypothesis,
and then there's a subsequent step of figuring out what their relationship is,
and that's more or less directly encoded in this class of model.
Um, and in addition,
this is the Ido Dagan dream.
Sentence-level encoding could facilitate transfer to other tasks. So yeah.
And so if you wanted to do like the- the GLUE benchmark, for example,
that it's really natural to choose a sentence-encoding model,
so that you could use the premise or hypothesis in one of
the sentiment tasks or in one of the grammaticality judgment tasks and stuff.
Here's a model that's received a lot of attention in the literature.
That is a kind of step up in complexity from
that simple averaging or summing of vectors,
and this is where I kind of have two RNNs.
So one for the premise,
and one for the hypothesis.
So I've used the same example,
every dog danced, every poodle moved.
I looked them up in what I presumed is a single embedding space,
although you could have
different embedding spaces for premise and hypothesis if you wanted to,
you know, on the assumption that
word sentences are different depending on where they appear.
And then what I am assuming here is that you would have
different parameters for this recurrent neural network.
So I put that in green and in purple here just to signal that at that level you
might want to learn something very different
depending on what the environment for the content is.
But that's not forced either.
You could have tied parameters.
It could be a single RNN.
I think the point here is that the way this model subsequently works is that I get
this final representation h_3 and this one over here h_P,
and those get combined somehow.
Maybe concatenation, that's a likely choice,
and those get fed into the classifier.
And so what you're really depending on is that like the final state in this RNN,
is a good summary representation of that example,
and similarly for the hypothesis.
And then they get concatenated,
and you make a classification decision.
So from the top here,
once you get to combo,
this is exactly like that first baseline that I showed you.
You should really think of this as like instead of having something
primitive like sum or average to combine my word vectors,
I have this really complicated function,
an RNN, and it has parameters that I learn as part of the task.
Instead of imposing this ideal from on high,
that sum is the right choice so that mean is the right choice.
For all you know,
your network is going to discover that the sum was exactly the right function to learn.
It's unlikely but conceptually,
I think it's useful to think about that possibility.
And of course, this has many variants.
So like even our code base will let you explore a variant,
where this is a bidirectional RNN.
And in that case, you might be using a summary representation derived
like sort of at the h_1 point combined with the one from h_3.
Those might get concatenated,
and that's your representation of the premise and same for the hypothesis.
But you would feed all of that into this combo func here and fit
a classifier. Make sense?
Questions? It would be
wonderful to see you guys exploring different variants of these.
I've only provided one.
Um, and here's a kind of strategy for- actually,
I did implement this full one in the notebook.
You can see that linked up here.
But I thought for here, instead of just looking at the code,
which you can do on your own,
it would be more useful to talk about
the conceptual strategy that I took for implementing this.
So sentencing-encoding RNN.
How would I implement this model here using PyTorch?
So here's the way I thought about it.
First, it's- it's nice and clean,
gives you good code if you define a dataset class.
The code that's in the repository by and large for these models,
it deals with just a single example.
So a single sequence of words,
a single length, and a single label.
That's like what you would do if you just use the RNN class.
It would force you into that mold of having a single list. We need two.
So for the dataset,
you could subclass the existing one and just have it
yield batches of objects that are pairs,
like every dog danced as tokens,
every poodle moved as tokens,
their individual lengths and then finally the label.
So that's kind of like bookkeeping,
and that's just imposed on you by the structure of the model,
that your examples need to look like that.
Then for the classifier model itself,
the subclass of NN module in PyTorch.
Um, you need it to conceptually be a premise RNN and a hypothesis RNN.
So what you do is implement its forward method to
use those two RNNs, and they get, you know,
the premise and hypothesis get processed in
exactly the same way as the current RNN code is processing regular examples.
And it gives you the two final state representations.
And then you just need to by hand as part of
the forward method concatenate them and return them,
and those will get fed into the classifier.
And then finally, the- the way the code is written,
you don't really need to do anything to the, uh,
base class for RNNs when you define this- this special one.
Uh, it just needs a slightly different predict problem method,
and the reason it needs a different predict method is that it
needs to deal with examples like this, right?
That's- that's kind of just back to the fact that your data are a certain way.
Uh, but everything else about the logic of optimizing the model is the same.
Really the changes are localized to datal- data handling and
to the structure of this core like the- what you think of as the computation graph.
And I do love PyTorch because it makes all this so easy and with- actually,
if you look at the notebook, you'll see that in relatively little code,
you get one of these sentence-encoding models.
And then because you're inheriting from a simpler class,
all the options like whether it's bidirectional,
and what the hidden activation function is,
and the dimensionalities and so forth,
that's kind of all just taken care of.
Make sense? Now I encourage you to- to look at the code, and you'll see.
You know, it's like 50 lines or something.
But most of it is just bookkeeping.
Final entrant into this class of sentence-encoding models would be,
again just a small variant of the one that we just saw.
Instead of having to RNNs,
you have two TreeRNNs, right?
Recursive tre- you know,
tree structured neural networks.
Um, you know and so that if you look at just the premise or just the hypothesis,
this is exactly the model that we used or that I showed you for the SST.
Uh, and all the same variance holds.
So if you wanted like more com-
more complex combination functions than the one that I've signaled here,
um, you could introduce those.
It would all be kind of nicely modularized
because you would just be dealing with a TreeRNN,
a tree- a recursive, uh, neural network.
You would define it for the premise and the hypothesis,
get their final states,
concatenate them, and fit the classifier.
So from this point on,
it looks like exactly all those others- other models.
And again I think this would be pretty straightforward to implement given
the TreeRNN PyTorch implementation that comes with the course repo.
[NOISE]
All right. And that's it for the sentence-encoding models.
I hope that gives you a sense for the ways that you
could impose new innovations and so forth.
The second class that I wanted to introduce I've called chained models.
And the reason I've called them chained is that basically you just
run together the premise and hypothesis.
The simplest version of that,
is the one that I've given here where I have every dog dance to every poodle moved,
you could insert your own boundary symbol if you wanted,
if you thought it was important for the model to learn that transition point.
It can be like a unique token that you learn a unique embedding for.
Um, but in principle you can just run them together like this,
and then have, you know,
the same embedding space.
That's all these gray cells,
and then the same RNN parameters process
the entire thing premise and hypothesis together,
and then finally on the basis of the final state here make a classifier prediction.
And it could be bidirectional in which case it would be
using h_6 and h_1 as the basis for the classifier,
you know all the same variants apply here.
I mean, this is literally just using the RNN that you might have used
already where I just feed in the blurred together premise and hypothesis and process it.
All right. This is the- I guess the rationale for this would be like
the premise really and truly establishes the context for the hypothesis,
because it has- it says,
"Though a human reader has gone through and just read them,
and then made a decision about the blurred together example pair."
And maybe that yeah, as I say here,
maybe that corresponds to some real processing metaphor or something.
The- then this is the throwaway here like,
you don't have to really do anything except def- define a new feature function,
which just concatenates the leaves of the two trees,
and then everything else is just code that's already in the course repo,
and you can run your experiments in the usual way.
And here's a simple variant where I
decide that I'm going to run together the premise and hypothesis.
But now I'm gonna give them sle- separate parameters.
So I have like an RNN for the premise here,
and an RNN for the hypothesis, green and purple.
And the linking point is just that the initial hidden state
of the hypothesis RNN is the final state of the premise RNN.
Bu- but, you know,
but different parameters for the two.
And this is kind of intermediate between
the sentence encoding thing which was keeping them completely separate,
and the blurred together one that I just showed you.
Because they do get blurred together but with
different parameters reflecting different parts of the example.
Yeah, and then obviously this has lots of variants as well, ah,
where I could fiddle around with different views on each of
these RNNs bidirectional different,
um, hidden activation functions,
different cell structure and so forth.
To state, I'm just curious based on what I've presented,
do people have intuitions about which model is likely to be better?
I think there is a kind of clear winner if I've read the leaderboards
correctly in terms of just raw performance of the NLI problem.
Do people have intuitions about which one to bet on? Yeah.
I've bet on this learner.
Oh, you would why is that?
Um, well it seems to capture
the entire premise and hypothesis, or pre- premise and conclusion,
and also have a delineation between both of them.
I would imagine that maybe,
it's more expressive than having just this concatenation.
So maybe I forgot,
but I think we presented three different options.
Oh, so your bet is this one, the first,
what about this one?
Yeah, this I think will be weaker because you don't have a separation.
Weaker than the sentence encoding ones like,
and this would be a minimal pair?
Um, I think it would- I don't know.
I'm not really sure how this combination function works,
like with the concatenation functions.
It would likely just be concatenation?
Yes, so-
Because the- probably better than not ones.
Because- but all those models have that in common, right?
Because the idea is, I get these two representations for premise and hypothesis,
and I want both of them to contribute to the classifier decision.
So concatenation is a good choice because then it will
give the- the weights that I learned above that for the classifier,
it will have access to all the different dimensions from both premise and hypothesis.
So I think that part is shared, combo, right?
Like this certainly has it.
I guess this doesn't have it,
but if it was bidirectional,
then you'd probably concatenate h_6 and h_1 having run in the reverse direction.
So but what was your ranking?
It was one?
Well, you start from the first one, I'll help you guys. [LAUGHTER]
Ah, you know. Anybo- anybody else have bets? Yeah.
Ah, I'm gonna bet on this, the tree RNN.
Oh, this one here? Yes. So if we stuck to RNNs,
this would be the best one, wh- why was that?
Ah, RNNs tend to lose some information over time.
Yes.
So unless we're using an upgraded RNN, like an LSTM or a GRU,
I think that having two separate ones and
then combining their outputs is likely to perform better.
Oh, that's interesting.
Just in keeping it shorter here,
and getting this representation here kind of saving it for the classifier.
I might be doing better, and I- I'd say like [NOISE] that might
hold even if you've got a sophisticated cell like an LSTM,
just because these things inevitably get kind of attenuated.
That's an interesting insight.
Yeah, and actually- maybe I'll have to really look at these leaderboards here because,
the way you framed that,
that insight attention which I'm gonna show you next is highly relevant to this.
Because to the extent that I have mechanisms that help me remember distant information,
the calculus might change.
Um, but still I like that, that intuition. Yeah.
I can use intuition where a simple RNN is best.
Because if you don't have a lot of data,
then maybe you're not gonna have enough data to be able to
differentiate the difference between your hypothesis and everything,
so to just have a good language model frankly,
seems like you'd wanna take advantage of both,
versus if you had a ton of data,
I can see how having two different ones would give a little bit more readiness.
Oh, that's another great insight.
Yeah. Yeah, just if you think about data versus the number of parameters that you need to learn,
[NOISE] then small data might do best with this one
which is the kind of most minimal model that I presented.
Yeah. Correct me if I'm wrong.
But I think that these models are the best-performing models.
Um, especially if you allow that you can have attention,
and that sentence encoding models are chosen by and large,
by people who are doing multitask learning,
so they want a representation of the individual sentences so that they can transfer to
other tasks or representations from other tasks can easily transfer into this one.
But it's like, if you just wanna solve NLI,
you pick a model that's like this,
and you add attention mechanisms which I'll show you next and that's
like a very powerful approach. Yeah.
When you say this comprise the best,
do you mean on the bake-off data or on the original dataset?
I meant like SNLI.
Actually let me, um-
Because I guess the way they created the hypothesis was by reading the premise,
and then saying, "Mmh, I'm going to now write an
entailment," that model might model exactly that process.
I had in mind in particular this leaderboard. So let's look.
So this is for SNLI, feature-based models,
that's like handcrafted stuff. Those are up here.
Sentence vector-based models that's like the sentence encoding ones.
The top score there is currently 87.4 on the test set,
and then down here I think these- before we get to the ensembles,
I think these are by and large, so that's a tree one.
But I think a lot of these are the chained,
but separate premise and hypothesis LSTMs,
and these numbers tend to be a bit higher, right?
Even setting aside the ensembles.
Well, it's less obvious just at a glance exactly what these models are doing.
But I'm pretty sure I'm right.
[NOISE] I'm pretty sure I'm right for
these big well-used tasks chained separate RNNs are the best-performing with attention.
So let me show you the attention.
That's the final stage here.
So here are the guiding ideas for attention mechanisms.
We need more connections between the premise and hypothesis.
Both classes of model that I showed you are kind of limited in what they allow.
So the sentence encoding ones just use
the final two representations and concatenate them,
and all that other information is kinda,
it has to be summarized by the two that I used for the classification decision.
Otherwise, it's lost.
Here's a more informal statement of this,
in processing the hypothesis,
your model needs some reminders of what's in the premise.
And that final hidden representation isn't enough to get those kind of reminders.
And then another guiding insight here might be that it's
useful to have a kind of soft alignment between the premise and hypothesis.
Alignment is an early insight from approaches to NLI,
where people would use algorithms to construct kind of
high-fidelity alignments between premise and
hypothesis and use that as the basis for making,
uh, decisions about entailment.
And you can imagine that attention mechanisms
especially the word-by-word ones that I show you are
kind of a soft way to explore that kind of alignment by
establishing some connections and learning weights on those connections.
But again, this is all in the spirit of kind of making
more dense interconnections between the two parts of your examples.
I'm going to start with what I think is the simplest version of attention.
You'll see that it's already kind of involved.
But I think I can convey to you the intuitions about how this works.
So again, we have our familiar example,
every dog dance, some poodle danced.
And this'll be the target state here, h_c.
So I use h_c to get a score vector that combines h_c with h_1,
h_2, and h_3, you can see them here.
And this is the simplest variant in the sense that all I'm doing is forming the dot-product.
It's kind of like un-normalized cosine similarity.
And that gives me a vector of scores, alpha here.
Those, uh, scores get normalized by
a softmax function into what are properly called the attention weights.
And so that's another vector of length 3 in our example here,
giving a weight, a kind of weight on the, in this case,
the association or the similarity between h_c and each one of these elements.
Then you get what's called a context vector.
And a simple version of that would be just that I take each one of these,
and multiply it by its attention weight down here,
and then take their average.
So this is a vector that has the same dimensionality as each one of these hidden states.
It's just been weighted by attention.
The attention weight is measuring similarity with this final state,
which is so important for these problems.
Then you do some kind of attention combination function.
So here I've concatenated K,
this representation with the final representation here and fed it
through a kind of dense layer and apply to an activation function.
And that gives you what I've called the attention combination representation.
Here's another variant where I have separate parameters for the context vector,
and for this vector here and just sum them,
very similar I think.
And then finally, you fit a classifier on top of
this attention combination in the usual way and make your classification decision.
So it looks like a lot,
um, but, you know, intuitively,
what's happening is in this basic form here,
I'm getting a measure of similarity with the final vector.
That gets averaged into this context vector,
and then that context vector,
and the usual thing that we use for prediction is
fed through into this classifier to make a final decision.
So it's like the usual thing and then K would be what I called the
reminders, an attention weighted view of what happened previously in the premise,
that's been like fast-tracked up as an input to the classifier.
Here's a specific example.
Just in case you're the sort of person who likes to see numbers here.
So imagine these are all just two-dimensional vectors.
I get those scores.
And here, the intuition which you could probably read off of the vector values is
that the first word every is really very similar to danced.
And It's got high magnitude.
So you get a really large un-normalized,
um, cosine similarity value.
And then they sort of drop off.
And then they get normalized by the softmax function,
which sort of flattens things out.
Then I do the mean to get that context vector,
and that context vector is dimension 2, right?
It's just an average of all these weighted vectors from down over here.
And that gets concatenated with this representation.
Hope you can see that in the purple.
So that gives me a 4-dimensional vector,
which I feed through this dense layer to get the attention combination.
And then finally, that gets fed into the classifier as before. Yeah?
Seems to me like the larger your output vector,
the more attention you get pretty much no matter what.
[inaudible] It's got the highest weights because it had larger numbers?
It did, yeah. On this scoring function.
And that's because, well,
it's not only larger numbers but also
larger numbers that are very similar to this final representation.
That's sort of the way I cook things up so that you would see what the vectors are.
But you're right in the sense that the more similar
I am to this final representation and the larger my magnitude,
the greater my attention weight will be.
[inaudible] you're going to get pretty close to the original vector.
Yeah. I made up the boundaries.
So I think you have exactly the right intuition. Yeah?
So to what extent does, do you
benefit from the attention versus just seeing all of the hidden states from previously?
Like how much does,
where does the attention like add to that?
It's an interesting question, so how would I see all the other states?
[inaudible] them before, right?
Right.
Um, so if you just had a classifier on all of those as an addition to the final one,
like would that give you the same like results status using this attention layer?
I see what you mean. So how would that model deal
with varying length for the premise or the hypothesis,
especially the premise in this case.
If it could be 3, 4, 5, 6.
If I didn't know the length,
how would I summarize them into something that was fixed
dimensional so that I could feed it in the way it happens up here?
I think that would be the blocker to actually just literally reusing them.
And so the attention mechanism is giving us a kind of blurry look back in a way that for
this mechanism is just hardcoded to be
weighting things that are similar to this final representation or higher.
But we are going for
the same intuition which is that we want to kind of bring them back in.
The idea s that like all of these models,
they are kind of like as they move through data,
they have a kind of hazy view of their own past.
Uh, and then we want to give them these reminders
in the sense in the reminders come through this very thin pipe on this model.
But this is a good transition point.
So, um, there are other scoring functions that people have explored.
I just showed you the dot-product.
But you could have learned parameters in here,
like this one that's called general by Luong et al.
That's like general because it's giving us this like way of combining h_c.
The final representation with each one of
these vectors and learned parameters for that kind of bilinear combination.
And so that's giving us a chance to go way beyond
the hard-coded ideal of the dot-product into learning just in general what,
what relationships are valuable for the problem that I'm
trying to solve and concat is very similar here.
I think the crucial innovation from both of these is
that you're learning weights as part of the attention mechanism.
Would there be any benefit, especially when the  premise and hypothesis get along to have
attention for the previous parts of the hypothesis?
That's a wonderful question.
I mean, yes and if you look at the literature,
especially the early literature on attention which was focused on machine translation,
that's, in fact, the way they're defined, right?
So I have attention for this state but I also have it for
this one and this one all the way through the hypothesis or,
you know, the translation.
Um, the reason that's meaningful is because,
in translation, each one of these has an output.
So you're like predicting a word for example.
Whereas in NLI you have an output only in this last state.
So for this presentation,
if you did compute them for all these other steps the information would just disappear.
Um, and that's a limitation of the,
the places where we're getting supervision.
So for this global attention,
the answer is kind of no,
it's not worth doing for every step in
the hypothesis but that's a nice transition to word-by-word attention.
So word-by-word attention is gonna try to make good on exactly your intuition.
I would like to get information at all of these steps
about what my premise was like but it gets more involved.
So here let me see how well I can do this.
So every dog danced,
some poodle moved, focus your attention on this B state.
And keep in mind that this has a kind of recursive property.
So you'll feel a little bit at c at the start.
The first representation to build this is kind of feeding,
it's looking at B here.
So B gets copied over and it gets weighted by
the previous attention weights K which came from this A representation.
I didn't show you those but just imagine you had them.
So I re-weight B by those attention things and then I add
in all of the vectors from everything in the premise, right?
So this gets copied. These are all the same.
But the reason they're repeated is so that I can add in
all the previous states and so this would be
for whatever dimensionality you had for the premise.
You can see already that in this case I'm gonna
have to make the simplifying assumption that I
have the same length for the premise and hypothesis so that this parameters,
these parameters W are well-defined.
But I hope you can see that the fundamental thing
is we're bringing in all of the premise.
We're kind of infusing the current state
with everything from the premise and then feeding it
through this neural layer to get M. Now,
M doesn't have the right dimensionality for us going forward.
So the true weights for this are fed through a softmax combination of
M with some other learned weights W which are performing this transformation.
Then I get what is kind of like the context vector that we saw before.
And here, it's all those premise states again weighted by alpha B.
And then you bring in also the previous attention weights here.
That's KA and you get some parameters for that as well.
And then finally, it's the same as before.
This classifier layer is just KC and HC.
So down here I would use this one.
And the final calculation that I got if I was doing the final state and
feed that through a classifier. Makes sense?
And you do this, so I do it for A and for B and when I get to C,
I create this classifier.
And since it's dependent on KC,
which is dependent on KB and B,
which is dependent on KA and A,
I've also got lots of influences from all these attention weights that I
kept word by word through the entire processing of the hypothesis.
So if you squint, I think it makes good on the intuitions that both of you had. Yeah.
The question about the intuition behind attention.
So it seems like sort of the underlying hypothesis is that
if your representation states have similar values, right?
So their dot product or their matrix multiplication will be similar.
But in this case, right?
Maybe it's really useful if you're trying to say that one thing
is not related to the other that your hypothesis is not related to your premise.
Maybe the states will have very different values and
your attention would almost make you not pay attention to that at all.
Like, I guess at what point does it make sense that you're always assuming
that similar values for your features is what you want to pay attention to?
Does that make sense as a question?
I think that this is a failing only of the dot-product one, right?
Where I've kind of hard-coded the ideal because I tried to highlight them in orange here.
I have three chances to learn parameters
and those parameters could reflect all of these different associations.
So not only strong things but also weak things, right?
And you see this in visualizations of
the learned weights that you kind of learn to amplify
important pairings and de-emphasize ones that are not informative for the final task.
So this is kind of like again the data are speaking very
loudly in this case in a way that they weren't for the dot-product view.
Yeah and let me just mention a few other variants.
I mean, what I've done there with those two views.
I mean, I just picked two that seemed kind of representative of strategies people use.
My personal opinion is that the literature on attention is
kind of hard to read because there are so many variants.
And when you open up a new paper and see that
people have tried a different view of attention it's hard to
know where the meaningful differences are
and where they're just kind of incidental implementation details.
But I hope you're now armed to think like there's kind of
these two classes global with like
pretty hard-coded values about what attention weight should be like on
up to this one which is like very freely
learning word by word what the attention weight should be.
Other variants so Luong et al also applied, explored local attention
which is like instead of looking at
the entire premise when forming whatever vector of attention weights I'm gonna form,
I look only at a controlled window.
Um, I think that's most applicable in situations like machine translation,
where you have lots of output states and so you're kind of moving through
the hypothesis attending to different local windows in the premise.
Um, or I should say the source and target languages for machine translation
or for like language modeling or dialogue which could have a lot of output states.
Um, for NLI it kind of seems like it's
that final state that really matters and then you might as well
attend to the entire premise. Yeah.
[inaudible] and that works?
Yeah, I, I've sort of mentioned that down here.
And we're gonna talk about those later.
This is what I mean by, um,
attention has become kind of pervasive and maybe even the primary source of connection.
Let me just go through these now before getting to that.
So word-by-word attention.
This thing that I showed you here can be set up in
many ways and you could have many more learned parameters.
And as an extreme case of that,
this Rocktaschel paper which was really
foundational in showing that attention is powerful for NLI,
they basically seem to have taken the view
that for every one of the matrices that you have here,
you have associated embedding weights or transformation weights.
So they have a lot more, um,
W's that would be in orange if you depicted their full model.
But I think that with this summary here,
it's kinda the same intuition which is as you troop through the hypothesis infuse
every state that you look at with an attention weighted version
of the entire premise and learn parameters along the way.
You could, you could, um,
as you troop through the hypothesis,
append attention weights from the premise at each time-step.
That's another variant that Luong et al explore,
and then relevant to things like the transformer.
So a recent development which I think we'll talk about
in a couple of weeks is that in the
usual framing like the one I have taken for
this lecture uh, the primary connections are the RNN ones.
And then you do this like additional layer of attention connections.
And in the paper,
I think this is the one that's called attention is all you need.
Uh, the idea there is that drop
the basic recurrent network connections and just have lots of attention mechanisms.
And in my view what that's doing is kind of just
saying for every point that I'm at in an example,
just freely explore connections to my neighborhood maybe through that,
throughout the whole example and just see if there are
any influences anywhere around me that are relevant for solving my problem.
So it's a very open-ended way of exploring the data.
It doesn't even encode kind of the linear structure that you get from an RNN.
But those are, those are proven to be really powerful.
And then as a final kind of more indirect connection,
but still interesting the memory networks that were developed a few years ago,
they're kind of similar to attention mechanisms
in the sense that they're trying with additional,
augmented representations to address the fact that
your network might be sort of forgetting things as it processes examples.
But the mechanism there is quite different.
So here you have this kind of
memory store representation as opposed to additional attention connections.
But of course, those ideas could be combined as well.
Okay, that's very technical stuff.
Um, but I hope you feel,
if you weren't already kind of prepared to dive into the literature,
it will be rewarding.
Ah, I think it's a very clear result at this point that for lots of problems in NLU,
retention mechanisms are valuable.
Excellent, so final phase here,
I wanted to show you some error analyses.
An and some- suggest some new possibilities for doing this.
So I mentioned the MultiNLI annotations,
here I've given a few examples,
the annotations are on the left,
I think there are 16 categories for them,
and all they're doing is just giving you know
a human's judgment about certain important semantic properties of the examples.
So this one is MODAL and COREF because it has can in here, ah,
and it's sort of got an anaphoric connection between students of
human misery and those who study human misery.
NEGATION, TENSE DIFFERENCE, CONDITIONAL,
QUANTIFIER ACTIVE, PASSIVE, this active passive thing is nice.
They consolidated programs to increase efficiency and
programs to increase efficiency were consolidated.
You know, like it's not surprising that
that's a common strategy for people who were constructing examples.
And it's really interesting to ask of your system whether it has learned the kind of
human linguistic thing that active passive variants
preserve some aspects of meaning or many aspects of meaning.
So just as a kind of example of something you might do,
I used some of my own Amazon credits that I got along with you
all to just run a bunch of experiments on the cloud just using the course code.
So I did Logistic Regression with cross-product features,
um, and I explored like the regularization strength and the penalty type L1 L2.
And then for LSTMs,
I did a chained one.
So that's the most naive version where you have shared parameters across
premise and hypothesis as well as a sentence encoding one,
which I just pulled out of that notebook that I showed you.
And I explored the embedding dimension,
the hidden dimensionality, and the learning rate and the activation function.
Um, up here, I've given the performance on the 1,000 or so annotated examples.
So the logistic regression was best,
chained, and then sentence encoding.
Just as an aside here,
I noted I wanted to store
all the model parameters so that I could test them on new cases.
The model file for the logistic regression has
over 600 megabytes because it has 16 million features.
The cross-product feature spaces are really large,
contrast that with the LSTMs,
where the model files are each about a megabyte.
So you know this performance difference if I wanted to run these models on a phone,
um, then I just can't use that logistic regression right,
but these, these models down are nice and compact.
An important thing to keep in mind.
[BACKGROUND]
Oh for the annotations?
I don't know. That's a good question.
I should have checked. Um, It
would be quick to check though because those models are quick to train.
I'm less concerned though with the absolute performance.
Although it would be nice to bring it into the mix for an error analysis,
but what I was really interested in is just that you have this view
of absolute performance where maybe logistic regression is best.
But how are they differing according to these different annotation categories?
So there are like 16 of them,
so I grouped them in a few ways.
These are the categories where all the models are more often correct than incorrect.
Um, that's a kind of high level summary and that's ACTIVE/PASSIVE,
antonyms, BELIEF reports, conditionals,
long sentences, MODALS and NEGATION.
It's pretty good. Okay, and it's good that there are a lot of them where there doing,
they're more often correct than incorrect.
Here's the one case where all the models were more incorrect than correct.
So this is presumably,
no matter what your model the hardest and this is QUANTITY/TIME_REASONING.
And this doesn't surprise me,
because I think a lot of the models and also the feature spaces are just
not sensitive to this low-level kind of stuff, it's very subtle.
This is a classic case of a very small set of
semantic morphemes having very large consequences for what the overall sentences mean.
And so I'm not surprised that this is a hard one.
Here's a case where only the logistic regression was more often correct than incorrect,
the other models were the reverse and that's COREF and tense differences.
And that's kind of interesting.
Um, the TENSE_DIFFERENCE one makes sense to me in the sense that
the word product model first of all is keeping track of all of these pairs.
But since I didn't do any stemming or other normalization,
it also knows about differences in tense.
Um, and so it has this enormous space of features and it's just kind of
pretty well attuned to the fact that you might have like
was to is because it has that directly represented,
whereas the, um, LSTMs are probably struggling because they
probably end up with pretty similar representations for
the tense variants because they occur in a lot of the same environments.
And you might tell a similar story for the COREF case.
Here's one where only the chained LSTM was
more correct than incorrect and this is WORD_OVERLAP.
That's kind of surprising to me,
I don't have an explanation for that.
This would make perfect sense if I had added some attention mechanisms,
um, but I didn't.
And I would have thought that the logistic regression would be really good at this,
but it's just about you know 50-50.
And here the finally, here the cases where only
the sentence-encoding LSTM was more incorrect than correct.
So the other two models were better, right?
This is our worst model remember and this is PARAPHRASE and QUANTIFIER.
I don't have a deep explanation for this I guess.
Except that it's the worst, worst of the three.Yeah.
Ah so did you [inaudible] ask where the initial input embeddings weren't random,
instead of like GloVe or something?
I considered it, but my,
I took the philosophical view that I would introduce
no outside information in this competition and since
the logistic regression model can't benefit from GloVe vectors,
at least not in the way that I implemented it,
I would also not give the other models that advantage.
My hunch is that if I brought in rich initialization
than the LSTMs would just jump up in performance.
But they would have a lot more information at their disposal.
I guess my assumptions that the logistic regression
has a lot more information [inaudible] cross products such a large feature space.
Yeah fair is fair,
it's like I gave these three players the data and nothing more.
But another dimension to this problem would be,
if I compare an LSTM with and without BERT representations for its words,
where do the differences arise?
And I think this could again illuminate like you probably even have some expectations
that it's going to do better at things like that involve general lexical information,
maybe word overlap,
antonyms, maybe belief, negation.
Um, but long sentence might be unaffected by this outside information. Maybe not though.
A couple more things.
So another thing that you can do in NLI,
ah, is resort to some linguistic expectations that you might have.
So you could ask, for example,
does your model know that negation is downward monotone?
That is, can it get right consistently that Fido ran entails Fido moved?
And Fido didn't move entails Fido didn't run?
So the up arrow means I have entailment in that direction,
down entailment in that direction.
And what this would mean for your model of course is that,
any example that you plugged in,
maybe where you would say where it got this right,
if you just add a negation it should get this one right.
In my experience, these models disappoint you and that they have not learned this.
And you will certainly be disappointed if you
try something a little bit more sophisticated like,
does your model know that every is
downward monotone on its first argument and upward on its second?
This is a semantic fact that I think is very clear.
Um, but your model probably will not have the systematicity
that leads you to think that yes it has fully encoded this logical fact.
You will find some examples where it seems to be behaving in
accord with this generalization and others where it does something quite crazy,
even if it is performing well on a large-scale assessment like SNLI.
And that's something to think about.
Here's a related thing just by way of wrapping up.
So I've mentioned before that,
that Bill as part of his thesis did lots of rich work on
natural logic, applying it to NLI tasks.
And natural logic has the characteristic that it has
very sophisticated little algebras of semantic relationships.
And I've summarized one here.
This is a kind of theory of negation, um,
where you know if I have p and q are disjoint like dog and couch,
then the negation of those two are neutral.
And I filled out this whole table here and you can construct, um,
artificial data sets where you just insist that the lexical items, so to speak,
have certain semantic relationships like entailment, consistency, overlap contradiction.
And then you can define the algebra on top of them and
construct as many examples as you want in a kind of algorithmic fashion.
In past years of this class,
we've done this as a homework problem just to have
people confront the semantic subtleties.
And so then you could ask,
if you train your model on a doubly-negated dataset,
an artificial one that you create like you know,
not not p not not q.
Does it generalize to the EEE negated cases?
You might think it should, right?
It should have all the information that you need at
that point because it's seen the base cases,
It's seen single negation and it should now
know that double negation just kind of flips you back,
or you know flips you to another systematic space in this little algebra.
Um, but again, I think it's unlikely that your model has fully learned
this algebra and you will see that as you try to
ask it to make predictions about longer and longer sequences.
Um, and this is a nice way to stress test your model and I've given
a few examples of papers that I think do this quite
well including one by who's on the teaching team.
[BACKGROUND]
Trained on like not, not-p,
not, not-q, not-p, not-q, p,
not-q, not-p, q or just,
on not, not-p, not, not-q?
I will do it on- yeah,
that's a good point like the unnegated cases,
the singly, and the doubly.
I would think that that's giving your model
a fair chance to have seen all the relevant kinds of patterns,
and it should have everything it needs now to generalize to triple, quadruple,
and so forth. Yeah.
When were cleaning, um, these types of models,
they're completely not transmissible from
one language to another even as we're part of the same language family because say,
for example, double negation is a thing in
other languages that acts differently as in English,
I would expect them not to work.
But say is we traded on a huge dataset of English words,
English sentences, and that has included, I don't know,
some other sentences in
a different language but related to
English and then we test them on that different language,
we expected to pick up things that are- uh-
Yeah, it's a good question. For- for this probing artificial data case,
this is meant to be sep- separate from any intuitions you might have about negation.
This is more like testing the sense in which the model is doing something systematic.
In the sense of cognitive science like has it really learned to abstract from
your data a kind of systematic pattern as embodied in this little algebra up here?
And this is just nice because since you control all
of the- all aspects of the model and the data that come in,
you can have a pretty good sense for whether you've posed this in a fair way,
uh, and that can give you insight into what your model is actually capable of learning.
For the more general and interesting linguistic question of
whether or not parameters you've learned on one language would transfer to others,
there are experiments of people trying to do that and in fact,
you can try that for NLI because that Facebook group
released a version of MultiNLI, that is multilingual.
Um, pursuing the hypothesis that there might be some abstract layers of
representation that are good for multiple languages. Yeah.
Yes, are you ultimately limited to
feeding your model certain things and seeing what it spits out and you can't,
sort of, look inside and like
look- obviously it's not easy to say have you learned the concept?
And like look at the activations,
but is this still, sort of,
the main way of asking this question?
It's just throwing things in and seeing what comes out.
I'm just smiling because I,
sort of, feel like the answer is yes,
and this is just amusing because it's like a sense in
which deep learning has become like a psychology experiment,
where instead of having humans come into the lab
where you probe them behaviorally to see what they know,
you now have this black-box model that you probe behaviorally to see what it knows.
[LAUGHTER] Um, there are great papers that are trying to push us past this.
Um, where you do, kind of,
you have another model that does some local expe- inspection
of the parameters or maybe looks at how,
um, different examples as they pass through the network activate
different parts of it to give you insight into what they actually have encoded.
But I think a lot of it still comes down to this kind of behavioral thing.
Which is not necessarily bad it just means that you might also wanna get good at running
human subject experiments so that your artificial agents are treated fairly,
and the protocols. [NOISE] Yeah.
And sort of related to this- that I have seen some, kind of,
optimism that attention is not only, kind of,
a performance enhancer but maybe also some kind of
explanatory metric that you can look at or what does my model you're paying attention to,
and maybe get some insights there but I was just talking to someone recently who was
saying that there's been a paper published recently and it's saying that attention,
actually, is not any good for this.
It doesn't actually help us explain things.
Wondering if you have any insight on that.
Yeah, again, I think there are differing opinions.
My view would be- let me give you two views.
So first for NLI,
when you look at the heat maps of
the different attention weights that it has learned on word-by-word attention,
those seem often to be quite illuminating to me and do, kind of,
reflect what you would expect that certain pairs of words across premise
and hypothesis are really informative about the classification label.
So I think that all looks good to me.
Conversely like I was really excited when I saw people adding attention mechanisms into
these problems because it kind of looked like it was a soft parse of the example,
like a dependency parse,
and I thought that would be really cool because maybe
you'd be able to read off of the attention weights
what the true parse was or maybe what the top most likely parses for this utterance were,
and I think that that dream is not quite realized,
it's less clear that that's happening.
But for the NLI case,
I think this is like one as- one way of making good on your- on what your goal was.
Do you have different views?
[OVERLAPPING] You think a lot about these things.
[LAUGHTER] We'll talk more about attention when we talk about the transformer,
and BERT, and ELMo,
because attention is a big part of what makes those models work.
I think that was a lot of material,
and so I think I won't try in this- in these last few minutes to do
the ambitious thing of showing you too much about these grounding slides,
um, they are up.
Let me just show you just one thing.
So this is about grounding, and what I did is,
I gave you a bunch of linguistic insights about
why grounded language understanding is important,
and then I frame this as speakers, listeners,
chat bots, and reasoning about other minds.
And so I'll just say that speakers- that would be like what I've done for a lot of
these slides is focusing on colors, people describing colors,
because it's a simple version of a very large class of interesting problems
and colors are great because they have like
cognitive complexity and linguistic complexity.
So for speakers here,
you could think of models that are learning to describe colors.
And there's great data for that,
and this is a nice manageable problem that would allow you to confront,
um, like color to sequence, right?
So color comes in and then you need to produce a description.
Then the listener view is the converse, right?
So somebody gives you a color description and you
need to make a guess about what color they are describing.
That's interestingly different, the inputs are different
and the output space is a little bit, uh, harder to deal with.
And then grounded chat bots.
You probably heard about chat bots?
I think that they are often kind- kind of at C because they're just using language and
producing language and like they're not like actually
connected to any problem or any social situation.
This group- this Facebook Paper, uh, on negotiation,
is a data set where you have chat bots that are grounded in a goal oriented task,
and I think that's very exciting.
So I've just tried to present their model in its basic form,
and you should know that it's a really cool dataset.
And then for other minds,
I've sort of made a connection between pragmatics and machine learning,
and another- and an interesting thing about that is that again,
you have a bunch of really interesting quite manageable datasets for doing projects on.
And you guys are really well-prepared to think about all of these models,
you'll find, if you look through,
you'll see that because basically,
all of them are like NLI models where instead of having just one label,
you have a label for every stage in your hypothesis or in your output space,
and what's called the decoder part of the problem.
Really easy to implement based on what you guys have already done,
and then you have all these cool data sets to think about.
So I'm gonna give you much more detail about that on Monday,
but if you're still casting about for projects,
you might page through this and just, maybe,
follow the links to various data sets and so forth,
because it is such a rich space.
But let's stop for now,
that's been enough content. Thanks everyone.
