My plan for the day is to finish up that stuff on methods.
We don't have much left,
but I think they're really interesting topics.
And then I have a notebook on metrics.
And I think I won't go through the entire notebook,
there's a lot of material there.
What I thought I would do is just go through the section that's
on kind of accuracy, precision and recall.
Because I think if I do that,
not only are those probably going to be
the most used metrics for projects in this course,
but also you can kinda get a feel
for the higher level thing that I'm trying to do which is
to get you to think in a kind of structured way about what your metrics are doing,
what they're not doing,
where they might have faults, and so forth.
So I'll do that. And then Atticus is gonna take over.
Atticus has a very exciting mini-lecture on generalization and adversarial testing,
as I've called it here.
Um, and we'll let that take whatever time it takes and then just wrap up.
Uh, and if there's a little bit of time that's
great you guys can just start to work on your projects,
uh, and chat with the members of the teaching team who are here.
Uh, and then next week, contextual word representations,
I'm gonna try to make you, uh,
responsible users of BERT and ELMo and the transformer.
Uh, and then next Wednesday we'll have a panel discussion on NLU in industry.
And then Memorial Day.
And then we kind of wrap up with some other loose ends again in the spirit
of trying to introduce topics and content that will help push your projects forward,
that's basically what we're all about from here on out.
Oh, we had a wonderful discussion.
I really appreciated all the comments that people made last time about the material.
There was, you know, a bunch of stuff that I had never thought about,
and that deserves further thought.
Um, in the meantime,
have any questions or concerns occurred to you about the things
we covered which is kind of like we did all of this stuff,
we covered a lot of ground.
Data organization, cross-validation, baselines,
hyper-parameter optimization, and classifier comparison.
So Goog- Google has that theme this year
which is their hyper-parameters selection
I guess that's maybe what they use for AutoML too.
Is there some sense that those things have kind
of helped with the hyper-parameter tuning or is that still
kind of privy only to these big companies that have so much money
that they can just blast away on hyper-parameters for whenever,
it was 400,000 hours or something?
I mean, there's definitely an element of blasting away.
And kinda impressionistically for SNLI,
which is a leaderboard I kinda follow closely.
Um, there is a split,
you see sometimes that simple models with
relatively few parameters have really high scores,
and I think that strongly correlated with
them being- them originating in companies like Google,
that could do really wide cross- cross, you know,
hyper-parameter validation to find settings of the model that were really really good.
And they correspondingly, you know,
these really articulate intricate models that are
guided a lot by people's human intuition about the data,
they do well with less hyper-parameter optimization because they are guided by intuition.
So it's an interesting kind of juxtaposition.
I myself don't know much about Google Vizier.
Is that a public product or something they just use internally?
I think they had published a paper a little bit about it.
It was part of their like learning to learn I think like quantity or something in
2017 they had published about it but I wasn't sure how widely applicable it was.
Yeah, I don't know, I mean,
I remember it from that quote I gave from the healthcare paper
where they spent 200,000 hours doing,
uh, optimization using Vizier,
but I guess I don't really know much about what it's all about.
It would be great if they alleviated this burden by making resources
available. Other things on your mind?
All right. Let's do this final bit here, which is,
assessing models without convergence,
which you could think of as another way of calling out a lot of deep learning models.
And the background here is kind of like in the simpler era of
2011 for this course when everybody was using logistic regression, um,
convergence issues rarely arose,
because those models I mean, you might not even notice it but they do have thresholds,
they do run for a number of iterations typically,
um, but they tend to converge, right?
So you just kind of don't think about the fact that this
is one of your hyper-parameters because they just
finish up possibly before whatever threshold was set on the number of epochs.
So we didn't talk about this too much until recently but
now with neural networks kind of at center stage,
convergence too has taken center stage,
because these models rarely converge.
Um, they can converge at different rates between runs.
And then the other side of this is that maybe
running to convergence is not really what you want because what you
might implicitly be doing with kind of early stopping
is some regularization that and that is benefiting your solution.
Um, and then the final point here is just
that the performance of these models on test data is
often heavily dependent on when you decided to stop and how it related to convergence.
So the bottom line here is kind of like
sometimes a model with a low final error turns out
to be great and sometimes it turns out- it turns out to be
worse than one that finished with a higher error.
Who knows, right? That's I guess what I was gesturing at with
this notion of like stopping early as a way of
regularizing or as a way of avoiding over fitting to the training data.
But this poses a whole new set of problems, right?
Because it's at least one more hyper-parameter but it's a very
complicated one to deal with because now
you're thinking about having to run systems to
varying degrees interacting with all your other hyper-parameters.
And of course the more they need to run,
the more expensive this all gets in some generalized sense.
So one class of responses that you could have to
this I've grouped under the heading of incremental Dev set testing.
So basically the idea here is that as your model is training,
uh, incrementally say every 100th
epoch or every 10th or every 1000th, whatever you can afford,
um, run an evaluation on a Dev set and see how performance is doing.
Uh, and I just wanna mention that all the PyTorch models for
this course that are included in your repo, they can do this.
So if you give fit an optional X_dev and dev_itr set of arguments,
uh, then they will run evaluations on X_dev.
And you can kind of collect that information and make use of it.
You could show a learning curve um, or you could
use it to make a decision about where you want to stop.
And I will say that in my view at least given my understanding,
TensorFlow is kinda ahead of the curve in all of this.
If you use the TensorFlow classes for this course,
then I was able to just piggyback on the new estimator model framework and
its capabilities and they have lots of sophisticated things in
terms of like automatically writing your model to disk,
so that if you want to rewind back to
an earlier state of really good performance it's easy to do that.
And the PyTorch models are not currently set up for that kind of thing.
But the spirit of all of this is just that you have your Dev set, uh,
you might as well see how you're doing on it and use that to inform test set evaluation.
It's worth emphasizing again that absolutely
under no circumstances should you do this with your test set.
That would just be like an incredibly blatant form of cheating
where you decided to stop opportunistically when you saw the best performance.
But that's what your Dev set is for,
uh, so I encourage a lot of this.
I mentioned before here that, you know,
the low error rate doesn't always correspond to good performance.
Here's an actual pair of learning curves.
I have macro F1 here and the number of iterations.
And for whatever reason on this run of the model,
performance got really good and then kind of leveled off.
Meanwhile, over here error versus iterations,
the error just kept getting lower and lower.
But obviously this was arguably having a negative effect or no effect at all,
even as this error went way down,
and that's just because at a certain point you start to over-fit to the training data.
[inaudible]. [NOISE]
I mean, I- I guess it's a wild world that you might see that.
[LAUGHTER] Um, but versions of this
might be something that you actually use, this kind of debugging.
So for example, if your learning curve- your- sorry,
your error doesn't look approximately like this,
then there might be something wrong
in the optimization setting that's causing it to oscillate,
or even like have the error get larger,
and it would surprise me if that corresponds to
good behavior kind of over here on this left panel.
Um, and I guess the lesson there is just you might observe both of these things.
That's another advantage of TensorFlow is that with- with, um,
TensorBoard monitoring, you can see all of this,
and kind of get a feel for what your model is doing.
But certainly look at both and don't take one as a proxy for the other.
That's kind of my- my overarching point here.
Here's what I think the best response to all of this is, right?
Because of- the way I was just describing things before,
we were still kind of in the mode of saying,
I know that my model varies by epoch.
I know that that affects test set performance,
but still what I wanna do for you is report one single number on a test set.
And if you think open mindedly about this,
that's kind of misleading because what I'm telling you is that there's lots of
variation along these learning curves as you- as you train on more data.
So I think the best response is to accept that incremental performance plots,
like the one here,
are really how we should be assessing these models.
Because that exposes much more information about the variation that you actually observe,
and can give you actually new ways of
arguing for your model as I'll show you in a second.
The other thing I would say is that in deep learning, you know,
in principle if you think about the theory,
these models should be capable of learning anything.
So what we're really thinking about is how efficiently they can learn,
or how stable their learning patterns are and so forth,
and that's probably, ultimately,
where your real argument lies.
And here's the kind of an example of this from my own work
from a paper that I did last year with Nick Dingwall.
So the background here just quickly is,
we were arguing for this model called Mittens.
Mittens is a warm start for GloVe.
The argument is that Mittens are warmer than GloVes.
And really the pitch here is that what you could do is
start from like vectors that you had downloaded from the web,
and then run GloVe on them with a kind of
retrofitting learning objective to update them for a specialized domain.
And in the paper, we do this kind of updating for a bunch of different domains.
And what we're trying to do is show that you can do well,
if you do that kind of updating.
But the way that we actually evaluate them is kind of extrinsic.
So the evaluation is that they are
inputs to various kinds of other machine learning models.
And in this case,
we're doing a sequence prediction task for healthcare where
you're kind of talking about people being diagnosed with different diseases.
And the model is an RNN that for each token is predicting whether
it's like talking about a concern for a diagnosis,
or ruling out a diagnosis,
whereas in other tag or whether it's
a positive diagnosis for the disease in kind of clinical texts,
the kind of thing that you'd see in a- in an electronic health record.
So these are RNNs fundamentally.
The full story here is that we didn't do a lot of
hyper-parameter tuning because our rationale was that would be expensive,
first of all, for these RNNs because they need to run for a long time for convergence.
And the two models are,
our main competition is like Mittens versus clinical text GloVes.
So GloVe just trained on the text we have available,
and not given a warm start from the downloaded vectors.
That's our main competition.
And all we really care about is the Delta between them.
Really, we're trying to see whether Mittens is better.
And otherwise these RNNs are kind of the same.
Like in your- and the inputs are scaled in similar ways,
Mittens and these various flavors of GloVe.
So we used that as an argument to just set the hyper parameters,
like the dimensionality to the network and its activation function and so forth.
But we did observe a lot of variation in how these models would
learn because of randomization in their parameter initialization and so forth.
So we felt it was important to show these learning curves,
and that's what we're doing here by category,
along the x-axis is the number of training epochs and the y-axis is F1.
And I like this really full picture because if I step back,
I can say that we- if you look at macro averaged,
have a kind of modest argument overall,
that for this data,
Mittens learns a bit faster.
So if you can only afford to do like 1,000 training epochs,
then the warm start for this domain is actually helping.
But over time if you can run to 10,000,
the difference between those two models has disappeared
between clinical text GloVe and Mittens.
But you can also see that,
no matter how long you run,
these things have kind of leveled out,
and both of those models are better than generic GloVe vectors,
or random initialization like that's the yellow and the- and the gray.
So that's the kind of full picture,
but it gets pretty nuanced if you look at these different categories.
So like, I'm not sure that that model really-
that argument really holds for concern, for example.
There, it's very hard to see a difference between the red and the blue.
Maybe it's a bit stronger for positive up here where the error bars are tighter.
[NOISE] I like this,
you know, kind of like in early phases of learning,
Mittens is good, but in the fullness of time a lot of the differences disappear,
at least between those two competitors.
It would be a very different picture.
I would be giving you a very different story here if we had decided to just pick
the best point of performance for each one of
these categories or the best point of performance from macro averaging, right?
There it might look like a stronger cut and dry argument in favor of Mittens.
But you can see that would be a bit opportunistic, right?
And so I kind of, you know,
I think things like open an honest reporting of how these models are doing.
This is like our attempt to do that.
And so I don't know, I guess I'm saying that I would love to see
plots like this in papers you all produce.
And I recognize that there's often a lot of pressure in terms of
information overload and in terms of culture to give
a single Macro F1 number, or something like that.
But if you feel there's a room for this kind of more nuanced thing,
then go for it.
You have the teaching team's full support, I would say.
And this plot here,
since all that's differing here is
the random initialization and other kind
of random things about how these models are learning.
That brings us to this other point here,
the role of random parameter initialization, right?
So, you know, you- when you set up the weights for your model in these settings,
you do it with random initialization in these kind of non-convex optimization problems,
that initial setting might be determining a lot about how the model ultimately fares.
Um, but I will say that even simpler models can be impacted by this.
If there's multiple optimal solutions,
then they could get steered in one direction or another.
A kind of startling result from a couple of years ago,
is this paper that kind of did a bunch of analysis
of recently proposed neural, um, language models.
And the striking finding of this paper is that,
if you ran these systems a lot of times,
then the differences between them essentially disappeared because the way in
which they had been randomly initialized was having
a big impact on the final outcome for the models.
And it will kind of put a damper this paper on the feeling that
recent proposals were vastly better than
earlier ones because kind of when it all washed out,
the differences were small or nonexistent.
And this was all tracing to different initializations.
And this is sort of disturbing also for statistical testing.
Different initializations can lead to statistically significant differences.
And there's a related thing here which is kinda the extreme version of this,
which goes under the heading of catastrophic failure.
So you have an unlucky initialization of your model,
and it leads to not only bad performance,
but like effectively zero performance.
Um, my feeling is that,
the smart, the good,
and open thing to do is to report the number of times that that happened,
maybe in the pros of your paper,
so that you don't destroy the results table,
but it is important, I think,
to be forthcoming about how often this was happening,
and how much it was shaping the final means and
confidence intervals that you reported. Sorry.
So since the- the initializations can have these big differences,
and people aren't going to be publishing, you know,
results where they're not beating state-of-the-art.
Do you think that there is this kind of, like,
just random hill-climbing that happens because people all around the world are- are
basically doing these random initializations every time they attempt one of these papers,
and occasionally people will get lucky,
and beat some benchmark,
and they will get published and someone else will get even luckier.
Like is that what you think is happening,
or is that not really a big concern?
I think it is happening,
and I think it's a concern,
and that's sort of the story not to impose too much on these authors,
but that's kind of the story of this paper here.
Uh, and that- all of it brings me to
the idea that we should be reporting things that look more like this,
that show us the impact of all of these different choices,
and not so fixated on a single summary number
because it's probably overstating the degree to
which we're making inexorable progress on these problems.
It's kind of- like, again,
to bring up the SNLI leaderboard,
there are so many systems that have been entered onto that leaderboard now,
that you can do the thing of kind of informally doing a meta study,
and saying, okay, broadly speaking,
which classes of models are doing well?
And like here, there the clear answer is- it's some kind of
ensemble of deep learning modular pieces.
Um, so that even if the individual numbers are kind of being overstated,
the overall trend is clear.
And to substantiate this a little bit in the notebook I didn't reproduce it
here but there's a famous problem in, in neural networks.
The XOR problem which is just the logical connective that
gives a 1 to the cases where p and q have different truth values, otherwise 0.
Um, and it's famously not a problem that a linear classifier can learn.
The other one is if and only if where
the values have to be the same to get a truth value of 1.
And like one of the original arguments by the deep learning founders,
um, just shows that you can learn XOR with a kind of a shallow neural network.
And that's great. And it's cool.
And you definitely can theoretically do this thing that you can't do
with a linear model without adding a bunch of interaction terms.
Okay. Cool. But in reality,
if you run my little feed-forward network it succeeds at this 8 of 10 times.
[LAUGHTER] And that's- and
that's entirely owing to the fact that this is a small problem.
And so the initialization really matters.
And you could just run that and get a feel for yourself
that the extent to which this has actually solved empirically.
[LAUGHTER] So response here yeah,
this kind of summarizes what I've been saying.
So report scores for multiple complete runs with different randomly chosen
initializations and then summarize
that variation with confidence intervals or statistical tests.
And then I wanna return to this pragmatic note that I keep sounding here.
So arguably these observations are incompatible with
things like McNemar's test which depend on one run.
And I think that's true.
But I think we have to balance that against
the fact that if your system takes two weeks to
train then more or less starting now you have the opportunity to do just a few runs.
And then McNemar's test might be better than
nothing in terms of informing us about differences.
So let's wrap this up.
Uh, I think I can summarize everything that we did last
time and this time with a few things here.
So your evaluation should be based around a few systems that are related in
ways that illuminate your hypotheses
and help to convey what the best models are learning.
That's the kind of baseline stuff that we talked about.
Every model you assess should be given its best chance to shine, right?
Uh, that's like hyper parameter tuning and all of
that stuff is trying to make the strong argument in this way.
And then the pragmatic note we need to be realistic
about how many experiments we can actually afford to run,
afford in terms of dollars or time or resources.
The test shet- the test set should play
no role whatsoever in optimization or model selection.
This is a note that we've sounded a lot because we did all those bake-offs.
The best way to ensure that this,
this is to have the test set locked away until
the final batch of experiments that will be reported in the paper.
Um, and the other thing that you can do if you don't have a test set is,
is simulate that by careful cross-validation set-ups.
This would be where okay,
you have a fixed body of examples,
you do cross-validation and report the mean.
And the way you try to make a case that you're kind of
hands-off about all of that is that you set up
your models with a lot of hyper-parameter tuning and let
them run hands-off and then just report the final number.
And try to be honest with yourself about the extent to which you're
going back and specially advantaging the model that you're favoring.
You kind of want this to be as hands-off as possible.
And then of course just report this regime very carefully.
Strive to base your model comparisons in multiple runs
on the same splits especially if you're doing deep learning stuff.
Um, because a single model can perform in very different ways on
the same data depending on optimization and moon,
you know, sunspots and whatever else is happening in the universe.
All right. Uh, did that raise any new questions for you or concerns or comments or? Yeah.
Um, I'm still a little bit caught up on the random initialization portion.
I'm just wondering ethically speaking I guess- so I,
I understood I guess one major point of
the analysis and different the performance of various models to learn like
sort of meta-learn what architectures are of high level like
features of the model are good for like good performance on a certain task.
But at the same time is it necessarily bad that people
get lucky and do happen to do the random hill climb and,
and get like successful models that way?
Oh, I think it's not- so the only bad part.
Yeah, like getting lucky could be great.
And it does show us something there's a setting in which you win.
I think the part that at least I was worried about and maybe two is
like if I do that special stuff only for the model that I want to advocate for,
because that's where you're stacking the deck in your favor.
Uh, like I run these models hundreds of times.
And most of the time my model loses and I
report the situation in which my model happened to win.
And you, you probably know in your heart that that's not
an honest reporting, um, that kind of thing.
And I think he's asking,
how often is that happening out there in the world?
And I guess I was saying more than I would like as my guess.
But what we can do is a kind of push the community with
our own choices in the direction of reporting
the mean and confidence intervals on all of those runs.
[inaudible] on that note,
is there any empirical value from those weights that get initialized when you get lucky?
Is it may be something that could be,
could be indicative of a good initialization system and
like could that actual initialization be later considered like the hyper,
hyper parameter like there is- uh,
have there initialization there's other types like that.
Uh, I get that like, see tuning as an idea probably isn't that great but, uh,
is there insight that could be gleaned
from finding what weights worked out or which ones didn't?
I'm sure there is.
Yeah, now that's a great point.
Uh, and I feel like as we get better at getting analytic insights those good
settings where you got lucky we'll
provide more and more intelligence because you could s- what,
what if you did report that for your problem
the standard thing was this kind of, um, initialization.
And the best settings were the ones that deviated the most from it.
But what if they had different statistical properties?
And what if you then took that, uh,
lesson and applied it in a systematic way and found that it
reproduced a lot then you could get an initialization scheme named after you.
[LAUGHTER] Which is real fame
because then you get your name in all of these deep learning packages.
[LAUGHTER] Okay.
Let's do the metrics thing but keep asking questions on Piazza and office hours.
Uh, yeah, I think it's, it's great to be discussing all this.
[NOISE]
But let's do one thing about metrics.
So this is a massively long notebook.
[NOISE] Um, and that's why I say like let's talk about basically
just the first part of this classifier metrics thing
because that'll be enough to give you a feel for my thinking about this,
and kind of get that discussion going.
And then, if you're doing a problem with regression or sequence prediction,
then you can refer to these later sections here, um,
which introduce all of these metrics in just the way I'm going to today,
and talk about their trade-offs.
Yeah. Um, here's the overview of this.
So different evaluation metrics encode different values, and the,
the different values in the sense of like what do we value, uh,
of a system, and they also have different biases and weaknesses.
No- none of these metrics is perfect because
all of them are encoding different values and values can vary.
So you should choose your metrics carefully,
and motivate these choices when writing up and presenting your work.
Um, I would love to see passages where even if you've,
even if you're doing something that's the default [NOISE] for your problem you
still do articulate why it's the choice that you made.
[NOISE] The notebook reviews some of the most prominent metrics in NLP,
and I tried to define them,
but also articulate what values they encode and what weaknesses they have,
and also report relationships between them.
Because one way in which this can be kind of overwhelming
is that basically the same metric is described,
um, and given different names in different ways,
um, when really the differences don't mean much.
In your own work,
you should not feel confined to the metrics in this notebook, right?
Per my first item.
You have the freedom to motivate new metrics and use
cases for existing metrics depending on what your goals are.
Now, culturally, if you're working on an established problem,
then you'll feel a lot of pressure from readers and
referees to use metrics that have already been adopted for the problem,
and there are even these little communities within NLP that
have totally settled on a single metric,
uh, no matter how problematic.
And so you can get in this trap of feeling like you know there's a problem,
but you still have to report it.
Um, you should feel free to argue against those norms and motivate new ones.
You might in your paper want to report the old metrics as
a kind of foundation and then talk about why a new metric,
uh, is a better one for your problem.
I kind of feel like I don't really work in this area,
but for language modeling,
I feel like everybody is in this state with reporting perplexity.
So it's like everyone does it because everyone
expects that everyone else will expect that they do it,
but nobody believes in it.
Still report it.
[NOISE] Uh,
and the other thing I'll say just by way of high level is
the scikit model evaluation usage guide here is great.
It has lots of notes about how to make responsible use
of the metrics they've provided and they've provided lots of them.
In the notebook that I'm going through here,
I've hidden a lot of this,
but I defi- redefine the metrics just so that you can see how they work,
but m- my advice to you would be to use
the scikit ones because they had been more heavily tested,
and tend to have more options for doing things.
So but let's just do this classifier metric thing and then kind of linger over this.
And this is kind of nice because this is
the stuff that you've probably encountered the most and feel like you
already fully understand and that the- my idea would
be that I wanna kind of problematize this for you.
So let's start with the confusion matrix.
Uh, this gives a complete comparison of how the observed,
that is the gold labels here,
and the predicted labels relate to each other.
I'm sorry that that's a bit small, um,
but I have like a simple sentiment problem,
pos, neg, neutral, pos, neg, neutral.
And the idea here is that like the system that made this got 15 positive examples right,
15 negative, and 1,000 neutral correct.
Those are the diagonal elements.
And then the off-diagonal is saying like,
for 10 of the true positive cases,
the model predict negative,
and for 100 of the true positive cases,
the model predicted neutral, right?
You guys have seen these before.
I'm gonna call this example 1.
You should remember when you think about these things that
your classifier probably did not predict into this space.
It predicted a probability distribution over those three labels pos, neg, and neutral.
And to construct this table,
you have to impose a threshold.
You have to say that I'm gonna, for example,
pick the highest of the probability values and just say that is the true label,
and that is what shaped this final picture here.
But that is, of course,
a meaningful thing to have done.
If your classifier say it predicted, [NOISE] uh,
like fro- [NOISE] to make it easy to think of
a binary problem where the threshold will be 50%.
Suppose your classifier reliably predicts for the zero label that it's like 0-20%,
and then 20-40 for the positive label,
you're confusing ma- confusion matrix will look like you've got everything wrong,
when in fact with a different threshold,
your model would be perfect.
Um, so keep that in mind and that, this can happen,
especially with very imbalanced classes that the classifier
just kind of never predicts high values for things,
even though it is discriminating.
And I'm not gonna go into this today,
but metrics like average precision and precision-recall
curves they can expose exactly that kind of behavior in your classifier,
and give you a, a really strong argument that you're doing well
even though the default ma- confusion matrix would look bad.
And the other thing I would say here is that we might care about the full distribution,
and that's being completely hidden by this confusion matrix,
but this is nonetheless the basis for the metrics that I want to go over with you now.
So accuracy. This is the metric
that we all probably think of at some level when we think about assessing the,
how good a system is, at least before you enter
this field to think accuracy will be the right metric.
This is just the sum of the correct predictions divided by the sum of all predictions.
And I have the same confusion matrix here and I just highlighted
the diagonal because what accuracy is doing is summing those values,
and dividing by the total of all the values in the table.
And so here this,
with this classifier you get an accuracy of 81%.
[NOISE] Yeah.
It's useful to just know the bounds,
I did this for all the metrics,
um, and here obviously at 0 to 1.
0 the worst, 1 the best.
What value is encoded by accuracy?
You might argue that it encodes a kind of core value that we have for classifiers,
that is how often they're correct.
Um, and in addition as I'll show you in a second,
the accuracy of a classifier on a test set will be negatively correlated with
the negative cross entropy loss which is a common loss for classifiers.
So in, in that important sense,
your classifier is probably optimizing accuracy or
rather an inverse value of accuracy and that does make it a very natural fit.
And I, I wanna return to that when we think about how
this is all interacting with, um, Hyper-parameter tuning.
But just keep in mind that even though we don't favor accuracy as the metric,
it is what your classifier is doing, probably.
So weaknesses of accuracy.
It does not give a per-class metric for multi-class problems,
uh, and we might want that kind of nuance.
[NOISE] But the more important thing is that accuracy just completely
fails to control for size imbalances in the classes.
So for example, consider this variant here of the above classifier,
ex1, where the classifier only ever predicts neutral.
Because neutral is a large category,
it looks like a really good classifier from the point of view of accuracy.
Its accuracy is 0.87 for this,
for this confusion matrix versus 0.81 for the one before.
But I really feel like this is a better model.
If you think about actually discriminating between the labels,
this looks better than this one,
which is just stupidly always guessing neutral and
benefiting from the fact that neutral is a massive category.
And this is why we have never used accuracy in this class because you could only
ever trust it even a little bit for a completely balanced problem.
[NOISE] Makes sense?
And then the other part of this little framework I've
constructed here is just what's related to accuracy.
And so I didn't want- just want to state that accuracy is
inversely proportional to the cross-entropy loss,
um, and the cross-entropy loss and in turn accuracy can both be,
be related to KL divergence.
And the value of that is just that if
your classifier is predicting a probability distribution,
you might also think about learning from a probability [NOISE] distribution,
and kind of really embracing
the probabilistic nature of your [NOISE] classifier end of the data,
in which case you'd probably use KL divergence as your metric and it would
also be what your model was optimizing in some re-scaled sense.
[NOISE] So this is kind of like,
like nice little network of
related things that very deeply connect with what your model is doing,
and then we have this kind of mismatch that it's what your model is doing,
but it's probably not what you want to report as your metric.
You know we want to report something like macro F1.
But you can't in a straightforward way optimize for macro F1.
So that's worth keeping in mind,
and in my experience it's worth keeping in mind
because when you do hyper-parameter tuning,
you might do that tuning against your true metric which is macro F1,
uh, and pick the model that maximizes that value.
But you should remember that your model is
doing something different and might actually be
steering you outside of the space that you actually wanna be in for your problem.
That's a kind of limitation of this,
this mitch- mismatch that we're encountering.
Make sense? Any questions about that?
[NOISE] This, oh, go ahead.
Is it variables function for F1 macro average?
[NOISE] I've- um-
I'm afraid not because the macro F1 is this kind of holistic thing about
how you're doing on an entire subset of your data and so it would have to be,
correct me if I'm wrong here, [NOISE] but some kind of loss
that was more like a reinforcement learning loss,
and not something that was directly like
stateable as a differential loss on your classifier.
Yeah. That kind of problem though with accuracy you know like imbalanced classes.
This is what leads to precision and recall and the reason I like that we're doing
this one in particular is that again and again through the notebook,
you see that precision and recall is what
people have in their back- in the back of their minds.
So like when you get to like BLEU scores,
BLEU scores are kind of precision-recall, you know, um,
word-level acc- it's what's sometimes called accuracy is also a precision-recall balance
and even some of the regression metrics are kin- trying
to do something that's like bouncing precision and recall.
So this is kind of the heart of it.
So precision in these things is the sum of
the correct predictions divided by the sum of all the guesses.
Is a per-class notion and in our confusion matrices,
it's the diagonal values divided by the column sums.
That's what I've indicated with this coloring here
that we're kind of operating column-wise.
So here are the precision values for this,
uh, example one that we've been dealing with.
So high neutral, low negative, and modest positive.
For our problematic all neutral classifier,
precision is kind of undefined because we made zero guesses going down.
So we end up dividing by zero.
The standard thing is to map those to zero but
you should keep in mind that it was strictly speaking undefined and this,
this affects the bounds.
If you think very carefully and mathematically,
you can't quite say that the bounds are 0 to 1 without this caveat
about what you do in the case that you have an all 0s column like this.
Then we have high precision for this neutral [NOISE] category.
There's that caveat there about the bounds.
What value is encoded by precision?
The way I would articulate this is that i- it encodes
a conservative value in that it penalizes incorrect guessing on a per-class basis.
The weaknesses of precision.
Precision's dangerous edge case is that you can achieve
very high precision for a category by rarely guessing it.
So consider this one here; example three,
what I did is just make a minor change where I have it
predict once for each of these and it happens to be correct.
I think this is not a good classifier in some intuitive sense.
But its precision is perfect for positive and negative because it just kind
of withheld judgment for all but these two cases where it could do well.
Compare that with example one where, you know,
the precision was pretty low here but it's intuitively
kind of got a better grip on the positive and negative categories.
So the counter par- part to precision is recall.
Recall is the sum of correct predictions divided by the sum of all true instances,
again per-class; and in our confusion matrices it's the diagonal divided by the row sums.
[NOISE] So here are the recall values for example one.
Important point about all of these metrics that are related to
F scoring is that recall trades off against precision,
and that's the way in which each of them is kind of
making up for the faults of the other.
So here's that example three again.
Remember, we make one guess for positive, one for negative.
So precision is perfect but recall is dismal for both of
those categories because in withholding its guesses for those classes,
it ends up making a bunch of
recall mistakes and as a result like the F1 for this is going to be pretty bad.
[NOISE] Recall is straightforward bounds,
and I would say this is a nice balance here.
So recall encodes a permissive value in penalizing only missed true cases.
Whereas precision was conservative, recall is permissive.
Weaknesses of recall.
Its dangerous edge case is that you can achieve very high recall for a category
by always guessing it and that could mean lots of incorrect guesses.
But since recall kind of intuitively,
if you think about the calculations,
it sees only the correct guesses.
So it doesn't really mind about this over guessing.
Um, you can see this here actually.
So for the neutral category, uh,
it made a lot of guesses about neutral, um,
but it missed none of them,
and so its recall is actually very high, down here it's 1.
[NOISE] Yeah, just always guess a category and you're sure to have perfect recall.
That's the sum of it. But in turn your precision will be hit somewhat.
F scores which we've used pervasively in this course,
they kind of codify this balance between precision and
recall and so the standard way of doing this is to
combine precision and recall via their harmonic mean with
a value beta that can be used to emphasize precision or recall,
and we've done the default thing of always balancing them equally.
And that's F1 down here,
where you just pick the value 1 and it simplifies down to this expression.
But you know again you should be thoughtful about
which one- whether you default to this equal balance,
uh, and one way to think about this is like,
so a bunch of people in this course are working on things like hate speech and toxicity.
So just as a quick use case.
Suppose you are developing a system that's going to
help with interventions for that problem.
It's an open question whether you want the system
to be biased in favor of precision or recall.
Um, if you have
very few hum- human resources for doing manual review of your model predictions,
then you want to favor precision as you want to make really good use of
those humans and have them see only things that
your model has high confidence are like toxic.
But, that's going to have the tradeoff that a lot of
messages that really ought of have been filtered
out of your community are going to make it in.
Because what you are optimizing for there,
at a higher level it's like saving human time.
Conversely, if you have a whole fleet of humans who,
who can help you with this task,
then you could be biased in favor of recall if they don't mind reviewing a lot
of score- a lot of texts where actually it was fine to just let them onto the platform.
It's really going to depend on what you're trying to do
but I think whatever your actual situation is,
it's unlikely that precision and recall are perfectly balanced.
So I guess the only justification for us,
perfectly balancing them all the time,
is that we don't know what else you're trying to accomplish in the world.
So here's an example for- for our basic example,
these are the F scores here.
And I think they kind of align with what we
think of about how the model is doing on these tasks.
And there's for example two and and strictly speaking
both for positive and negative because you had undefined precision for uh,
for positive and negative,
uh the F Score is also undefined but typically that will be mapped to
0 or like for scikit that will be
mapped to 0 and it'll print out a bunch of warnings for you.
And here's example three where we had
such high precision for positive and negative but very low recall.
And that's nicely reflected in the fact that
the F1 scores for pos and neg are also very low.
Bounds are 0 and 1,
and you have a guarantee that it will be between precision and recall.
What's the value encoded by F Scores?
Here's one way that I tried to articulate it.
So it's an attempt to summarize how well your classifiers
predictions for a certain class K align with the true instances,
where what I'm trying to bring out with align is
like not only giving you credit for the hits,
but also detracting for the misses,
in the kind of in the case of F1 in a symmetric way.
And intuitively for all these things precision and recall are
keeping each other in check somehow for these calculations.
And as- and if you go through the notebook on your own,
you'll see that theme recur again and again outside in the classification context.
What are the weaknesses of F Scores?
So there's no normalization for the size of
your data-set within the class that you're focused on or outside of it.
Um, and then the other thing that you might highlight here is that for a given F score,
if you think about like focusing on pos here,
pos F1 pays attention to all the row values and all the column values,
but it ignores everything that's off of those rows and columns, all the values.
And to kind of illustrate that what I did is here I have example
one where down here in the corner you have  1,000.
And then for this dataset I changed that to  100,000.
For pos and neg,
F1 is the same across these two data-sets.
And that's worth keeping in mind because obviously these are very different problems.
These are very different datasets and the classifier is doing
very different things because I made this so massive.
But, F1 was insensitive to that.
Only the neutral F1 changed because of
course that value is- is on its row and its column.
But keep that in mind, right?
When you do this F1 you're kind of ignoring one aspect of the data,
when you think about a per class basis.
Related to F Scores,
so dice similarity for binary vectors which you might remember from the first unit uh,
and is often used to assess kind of how well you uh,
a model's predictions align on a set theoretic way,
that's equivalent to F1, especially,
if you think on a kind of per token basis for a vector of predictions.
Um, and then as I said the intuition of F Scores is common.
Let's do just a few of these macro averaged things.
So macro averaged F scores.
That's the mean of all the F scores for each category.
That's the one that we've always used.
Here's a quick example.
You guys know about this one.
I mean basically what we're doing here is just taking
all the per class F values and averaging them.
And that kinda goes here to the weaknesses.
So I typically- I default to macro F1 because I like
the fact that it's giving each class no matter
its size equal weight in the final calculation,
and the justification for that is that in NLP
very often it's the smallest classes that we care about the most,
and so it makes no sense to give more weight to
the ones that are large cause maybe they're easy and uninteresting.
But you should keep in mind that in- in thinking about macro averaging,
your system's metrics might look out of step with how it actually performs in the world.
Because if I just turn it loose on
real data that have the same class balance as the data-set it was developed on,
then you know the fact that it makes a lot of
correct predictions is gonna be
a meaningful thing in terms of people experiencing the system.
And just because it's a large class doesn't mean that it's
any less meaningful if you think about performance in the world.
So in a funny way,
your macro average score could end up overstating how well you
actually do because you know a tiny class that you
never experience as a user with the system was contributing just as
much to the kind of example that it sees all the time in practice.
So that's worth keeping in mind in terms of practical applications.
And then I think these two kind of balance each other.
So you can both over and understate how well
your system is doing by doing macro averaging.
But nonetheless, I think when we think about system evaluation in this context,
the macro average choice is a really good one.
There are also weighted F Scores.
Scikit reports these which just gives a weighted average based on the class size.
And then there are micro-averaged scores and micro-averaged scores
with just a few qualifications are exactly identical to accuracy.
And so they inherit all the problems of accuracy and that
makes them a kind of disfavored option and in fact by and large.
Why not just report accuracy as- as
opposed to choosing the more convoluted micro averaging approach.
But I did include it here for completeness.
And I think I'll stop there cause I wanna leave plenty of time for Atticus,
and I also feel like now you're armed with a kind of framework for thinking about this.
What value does it encode,
what bounds does it have,
and what are its strengths and weaknesses.
It's worth asking that for any metric that you encounter and certainly if you decided to
propose your own metric you'd wanna
kinda fully command that space of different considerations.
Okay. Unless there are questions.
Yeah.
Why are macro-averaged F1 scores better than weighed F scores?
The weighted scores will have that fact that they'll favor large classes,
just like micro-averaging would.
That might be what you wanna do.
Uh, it really depends- I guess I'm encouraging you
to think about what you're trying to achieve,
what your system is trying to do,
and then choose a metric accordingly.
And my- my justification for macro averaging is
just that we often do care about those small classes.
We really wanna get traction on them.
And that- that can be kind of dispiriting if you're doing really well
improving on the small classes but you chose weighted or micro F1,
and you just don't see any impact on the number.
It's like I doubled the performance,
I doubled my F score on this tiny class and weighted- weighted F1 says, ''I don't care.
It's too small for me to care.
I go up by a little bit,'' whereas the macro one will reflect all of those gains. Yeah.
If you wanted to develop say like a conversational agent that appears human,
I just basically pass the Turing test.
What type of metric would be best suited for something like that?
That's so funny to think though, you know.
Turing's answer was the Turing test.
And he did have some kind of accuracy calculation, right?
Wasn't it like fooling it like two out of three times or something,
I forgot. Was it two out of three?
And it's specifically like the game is you,
since it's a classification where there's a human and a machine,
they have to call the machine the human and the human machine,
like they have to get the label swapped.
So that was his metric but the lesson of the time is that
that people confuse humans for robots and the reverse all of the time.
Did we tell that story?
I guess we didn't. From the first ever real Turing Test.
Um, the human who is most often rated as a machine was a Shakespeare expert.
She could answer any question about Shakespeare,
and people's rationale was,
''No human could know that much about Shakespeare.''
[LAUGHTER]
All right. Let me turn it over to Atticus [NOISE] to continue these themes.
[NOISE]
Hello, I'm Atticus.
I'm one of your course assistants.
And I'm gonna give you a little presentation on
evaluating NLU models with harder generalization tasks.
All right. So going through this,
I'm gonna start by just kind of overviewing
the framework that Chris has been talking about which is
just a standard generalization framework of arbitrarily setting
aside training examples and testing examples from a large dataset.
And then I'm going to introduce you to
some adversarial testing literature where people are trying to develop
more challenging generalization tasks to probe at the capabilities of NLU models,
and then I will conclude by sharing some of my research with you.
[NOISE] Cool.
So this should be a framework you're very familiar with by this point.
It's what we've been doing in our homeworks and in most of our bake offs,
where we find a dataset for an NLU task.
And then we just arbitrarily split this dataset into training and testing.
And we trained models on the training set and evaluate on unseen testing examples.
But actually in bake-off 3,
we did something a little bit different than this.
So in bake-off 3,
if you remember, we did NLI on single words.
We were just trying to predict an entailment relation for pairs of words.
And in our edge-disjoint case,
we followed this standard framework where we
arbitrarily set aside some examples for testing and some for training.
But in our word-disjoint case,
we did something a little bit different,
where we ensured that in training and testing,
no words would be shared between these two sets which creates
a harder generalization task as it expects models to generalize to unseen vocabulary.
So kind of like the high level theme of this presentation is that I'm
going to encourage you to consider breaking from the standard evaluations,
and try to create generalization tasks that are difficult,
well-motivated, and that answer specific questions about model capabilities.
[NOISE] So to get introduced into the adversarial testing literature,
we're gonna start with the example of question answering.
So consider the research question,
can a model learn to comprehend a passage of text?
So this is a very high level and ambitious research question.
And people here have actually tried to answer
it by creating the Stanford Question Answer Dataset,
which is an awesome resource that you might wanna use in your projects.
So examples in this dataset look like this,
where there is a passage,
a question, and an answer.
The input is going to be the passage of text in the question,
and the output will be the answer.
In this case, it's about a passage about football,
question about a quarterback,
the answer's John Elway.
So you might think to answer our research question,
can a model learn to comprehend a passage of text,
that if a model achieves human level performance on this dataset,
then it is able to comprehend passages of text.
So if you do think that,
then you're in luck.
Plenty of models have beaten humans at this task.
So, you know, we're done.
Question answering solved like everything's great and no more work to do.
But unfortunately, it's not actually quite that simple.
So [NOISE] you might have a suspicion that
these models don't understand language quite as deeply as you might hope they do.
And Jia et al.
actually devised an experiment to see whether this is the case.
They take training examples from the SQuAD dataset,
and they systematically perturbed them to create
a new adversarial test set that they then evaluate models on,
uh, and using this as a new evaluation metric.
So our example before with the football.
We have a model prediction of John Elway which is the correct answer.
But what they do is they append a single misleading sentence to the passage
to trick the model into thinking that
Jeff Dean is the quarterback that the question is asking about,
and which is just incorrect.
So [NOISE] when they generate a new adversarial test set using this technique,
they find that 16 published models trained on SQuAD dropped from a 75% test,
a 75% F1 score on the original test set,
down to a 36% F1 score on this new adversarial test set.
So sad; question answering is not solved,
which you might have suspected.
So a natural idea is we've identified a hole in the
generalization capabilities of models trained on the SQuAD dataset.
So to patch this hole you'd say, "Okay,
let's just take these types of examples and include them in training."
And this does work,
when we include these types of examples in training,
those models learned to ignore the final sentence,
and they now- this patched model will now make the correct prediction of John Elway.
[NOISE] But these models- this new patched
model trained on these types of examples is
now vulnerable to a different adversarial testing set,
one where we just prepend the sentence instead of appending it.
And so now, we see that
this problem seems to be deeper than just throwing in more training data,
and having the model become more robust that [NOISE] different perturbations,
being trained on one perturbation of the data,
doesn't generalize to even similar perturbations like this.
And so- oh, yeah.
Are you- are they algorithmically inserting this? Like did-
Yes.
Or they didn't do it manually?
So there- it's a process of algorithmically,
and then I think it's verified by Mechanical Turks,
that it's consistent with the passage.
So they have like
an algorithmic generation process for these types of sentences that are misleading,
but then I believe that they have their own Mechanical Turks that go
in and verify whatever sentence they put in is going to be consistent with everything.
I see.
Yeah.
But they didn't do it based of the model itself,
like they didn't look at the way the model was working and then insert [OVERLAPPING]
I think they did do another experiment,
it's not this one.
It's- I think it's in the same paper where they just add tokens arbitrarily,
don't have to be grammatical or anything just to try to make the sen- the models fail.
And they get the F1 score down to 6% on that adversarial testing data.
But you also might have some like, I mean,
it- I think this one's a little more convincing in that this is English,
and the other experiment that gets down to
6% is kind of like throwing at random stuff.
That's not even really going to be grammatical English.
Yeah. And also something you can look at is SQuAD 2.0 tries to
address this make it a little harder by adding the option to say,
"Oh, there is no answer to the question in this passage."
Cool. Any more questions?
Okay. So we're gonna move from question
answering to NLI where in the last couple of years.
Um, [BACKGROUND] oh, sorry.
Yeah, that's all right. Um-
This is, like, a big question, but, like,
if you could include- I think all possible perturbations, um,
[NOISE] train on those, like,
what use- would you be able to say that,
like, you solved that test?
Uh, no, I, I think that's a good question.
That's an actually an experiment that I was really
interested when I read this paper is not pre-pending, appending,
but what if you had a training and testing data where you inserted
this misleading sentence at every possible location in the paragraph?
Because then you could think that the- you know,
I'm not sure if the tech models would even be able to,
like, generalize across those two things.
But yeah, no, I think that's really interesting.
Because- well, also it's important to remember that this is, like, one way to,
like, mess with the model, but of, like,
an infinite classes [LAUGHTER] of ways to mess with this model.
So me, I, I definitely think that's like an interesting question.
Cool. All right.
So moving on to NLI.
[NOISE] So in the last couple of years,
there have been, like,
a growing number of more difficult generalization tasks that are trying
to expose the fragility of models trained on the SNLI,
and MultiNLI datas- dataset.
[NOISE] So one of these experiments tries to
isolate the capability of models to perform lexical semantic reasoning.
So what they do to create
adversarial testing examples is they take a premise from the SNLI dataset,
they take a single word in that premise,
they exchange it for another word to generate a hypothesis sentence,
and then that is the adversarial example.
So you can see, there are three examples here,
and each of these examples,
the label for the sentence is directly from a relationship between a pair of words.
And in this paper,
they expose that models don't have
quite the robust range of lexical semantic reasoning that you would hope they would,
when being trained on these massive naturalistic data sets. Yeah.
[inaudible] first one in,
how do we know that some are holding a saxophone can also [inaudible] guitar, like one chance.
So that actually gets in more less about
their paper and more into the nature of how SNLI was created.
So SNLI was created with image captions,
deri- describing a literal scene,
and what Turkers were told to do is generate a caption that must be true.
So they're given a caption of an image without that image.
They're to generate an entailment relationship, they say,
they are asked to describe a sentence that
must be true of this scene for a contradiction relationship,
they are asked to generate a sentence that is not true of this scene.
And so, that kind of,
like, results in this more, like,
fuzzy idea of contradiction,
then on kind of, like, a strict logical idea of contradiction.
And so give an example, like,
what makes it different from an SNLI sampling rather than,
like, what's adversarial about it?
So what's adversarial about it is that- so these,
these pairs don't occur in the SNLI dataset.
Only the first sentences do.
And then, they just take a single word,
and they exchanged that word for something else.
So this could be something that might occur in SNLI,
but I guess, that's kind of the idea around adversarial testing.
Is that, like, it should be similar enough to
what we're training on that it's a justified task,
but the interesting part is they fail on these.
Compared to the SNLI test set.
So I guess what makes it adversarial is they're doing worse than they
are on the actual SNLI test set. So yeah.
Okay.
Yeah.
They accidentally made it do better.
Accidentally made it do better. So yeah.
So actually in this paper,
they have a bunch of different- they break it down to
different classes of words they work with, they have, like,
planets, like fruit, animals,
and so PyClass, some of the classes as they do very well on.
And then some of the classes they seem not to be exposed to those types of words as much,
and so, you can actually see a full breakdown.
So if you were to actually use, like,
for example, they have, like,
one thing they do is planets.
So like, replacing Venus with Pluto.
And it seems S- that models trained on SNLI just are really bad at that,
I guess because it's down to
like, 10% or something,
but in in other classes of words that I guess are much more familiar,
they get above what they get on the test set for SNLI.
So yeah.
I, I think it's a good idea to really, like,
take a look at the details in these types of papers because they, like,
really can expose, like,
a range of nuanced capabilities.
[NOISE] So moving on from lexical semantics.
There's also experiments that try to determine
whether models are capable of compositional semantics.
So here's two examples from this paper by a Nie and Wang.
So the first one,
they just take the subject and the object,
and they swap those two.
And so, woman is pulling a child on a sled in the snow,
child's pulling a woman on a sled in the snow.
And then in this example,
what they do is they take a premise sentence from SNLI,
and they take an adjective in that sentence,
and just move it to a different noun.
And so, we both- well,
we all understand that these sentences are not in
the entailment relation because moving words matters a lot.
[NOISE] That's kind of the whole thing around compositional semantics.
But what they found is by doing this,
they trick the models that are trained on SNLI,
and the- they often mistake these types of examples for entailment relationships.
Because they think, uh,
yellow- yellow those are similar words, entailment.
Yeah. [NOISE] And so another experiment isolates a specific compositional frame,
it's I think a little more, [NOISE] like,
specific than the other ones I was talking about, where it's just, like,
the X is more Y than the Z, and so,
it's just found that as models trained on SNLI just can't do these kind of examples.
And this one makes,
like, a little more sense,
isn't that surprising to me because it's like
a very particular compositional frame that I don't think
would occur that much when with literal image captions.
So yeah. [NOISE] So yeah.
So something you might be wondering is,
like, if they don't learn
in lexical semantics,
and they don't learn in compositional semantics,
what, what are we really doing here, right?
And I guess what I would guess is what I would say is
that models are learning a specific slice of lexical semantics,
and/or compositional semantics, that is kind of hand tailored to
the SNLI test set which the community has been hill climbing on since its release.
And so I think this is actually an exciting thing for your guys' projects because
it's kind of intimidating to try to get a new state-of-the-art score on SNLI dataset,
but for these adversarial testing data sets,
no one's been hill climbing on them.
So I think there's a lot more room for innovation and
improvement that you could do in your projects.
[NOISE] All right.
So now, I'm gonna move on from
adversarial testing onto and share a little bit of my own research.
So in my own research,
I've been constructing artificial natural language inference datasets.
And they have examples that look like this.
So like, every tall human does not kick
any large rock contradicts no human angrily kicks some rock.
So they're like, pretty complicated and hard to parse as humans,
but this is kind of how they look.
There's two quantifiers, negation, adjectives, adverbs.
And my kind of original intent with this dataset was to stress test,
and align models with learning these kind of, like, logical sentences.
And so, [NOISE] oh, also,
I designed a task-specific model which is I call the CompTreeNN model,
which kinda jointly composes a premise and hypothesis together
by aligning all their words and then composing them together up this tree structure.
And so there's just a task specific model that should be
particularly good at doing [NOISE] this kind of reasoning.
And so, the first thing I did was just do a standard evaluation on my data.
Where I arbitrarily set aside some examples for testing,
and some examples for training.
And as you can see from this results table, standard neural models,
and my task-specific model,
all achieved very high accuracy on this standard evaluation split.
And while when I did some digging,
I found that the standard neural models fail to encode the identity of verbs,
nouns, adverbs, and adjectives,
which is a pretty serious flaw,
and my task-specific model performs perfectly,
what I realized was that my-
the standard evaluation on my dataset was far easier than I expected it to be.
The- when I thought about it more,
I realized that every combination of quantifiers and negation,
and modifiers occur in my training dataset.
Because I have 100 nouns,
100 verbs, 100 adjectives.
So all these complex function words are going to
appear in every possible order during training.
And so a model that could do well is just one that
memorizes these quantifiers and negation,
and then just assigns them labels.
So what I decided to do is I wanted to construct a
more challenging training and testing split, that really, um,
precisely [NOISE] asked a question about what a model is able to do.
And we're gonna just keep it very high-level on
[NOISE] what a baseline model that performs natural logic reasoning is,
but I constructed a simple baseline model that does perform natural logic reasoning,
and I use this model to kind of implicitly define an idea of fairness.
So I have a simple baseline model,
and I consider a training dataset to be fair,
if this simple baseline can learn a perfect solution from the training data.
With the idea being that neural models,
if they're able to perform the type of reasoning my baseline model is able to form,
should succeed at the same generalization tasks.
And what I found was that standard neural models
fail miserably at this new more challenging generalization task,
and that even my task-specific model fails to achieve perfect performance,
which was pretty interesting that even something with
a hard-encoded tree structure wasn't able to solve this task.
And so, I guess the kind of narrative here is that at first,
I used the standard evaluation which was interesting,
but by breaking away from that kind of standard framework, I was able to ask,
like, a far more deep question about the capabilities of my models.
And so that's kind of just the high-level moral of the story is you should, like,
think deeply and carefully about what you're learning from your experiments.
Because often, they're going to be far easier than you expect them to be.
And you should consider breaking away from this kind of
standard evaluation of arbitrarily creating a training and test
set to create more challenging generalization tasks
that answer interesting specific questions.
Cool. Thank you.
