Hi everyone. So we're gonna start the, um,
the Bake-off 4 Report.
So the task.
So, um, for this Bake-off,
we had to do word-level natural language inference with binary classification.
So, basically we wanna be predicting,
um, the word entailment given two words.
So, um, we had a disjoint train- train test split which reflects
our expectation that we wanna be able to generalize on unseen words and vocabs.
So our evaluation dataset consisted of 1,767 negative labels and 446 positive labels,
which is a pretty, um, unbalanced dataset.
And thus our evaluation metric was Macro F1 Score.
So, um, some people actually reported their Micro F1 or the weighted F1,
but we only looked at the, um, the Macro F1.
The Micro F1 and the weighted F1 were a bit higher than the- than the, um, Macro F1.
And Macro F1 is a de- desirable matrix because,
um, because of data imbalance.
So this is a histogram of the submission scores.
As you can see, um, the baseline is right around 0.67.
And, um, as you can see
all people were able to surpass the baseline, which is really good.
As usual, we're gonna look at the, um,
the VSM results for top per- performing models than the,
um, the less significant ones.
So, um, for the top submissions, so, um,
as you can see the GloVe embedding was being used very frequently.
And as you can see from the variable name such as torch.tensor,
you can see that they probably played around with neural network frameworks.
[NOISE] And this probably doesn't have to do a lot with the design but, um,
the top submissions they had the, um,
the variable name custom_experiment a lot,
which we thought was, um, pretty interesting.
And let's take a look at the,
um, the bottom performing models.
As you can see, there are a lot of numbers which probably
means that people handpicked hyperparameters.
And again, um, as you can see the-
the variable name word_disjoint_experiment appeared a lot more in,
um, the bottom submission than in top submissions. All right.
And- So, um, the first place goes to group 26.
Congratulations, um.
They got the score of 0.7852.
Is group 26 here?
And, um, what they did was they used
a combination of BERT Sequence Classification Model with oversampling.
So, um, they used transfer learning.
They used the pre-trained BERT, um,
end-to-end sequence classification model,
and then they trained on our training dataset and,
um, tested on the validation set.
And in order to account for the data imbalance,
they used oversampling, [NOISE] specifically,
um, the Random Oversampler.
They randomly sampled with replacement, um,
the current available samples,
to make- to balance out the,
um, the label distribution.
And the second place goes to group 9.
Is group 9 here? Congratulations, um.
You got the score of 0.7541.
So what they did was they used Facebook's InferSent Model.
And they pre-trained the- the InferSent model on the SNLI,
Stanford NLI corpus dataset.
And they transferred the weights and then they added
an extra layer at the end for binary classification.
And for the, um,
data imbalanced problem, they used weighted loss.
And they set, they set the weights to 1 and 5.3,
meaning that they gave 5 times more emphasis to class 1 than class 0.
So, um, the both, these top submissions,
the top first and the top second.
They used transferred learning and some type of weight to balance out the data imbalance.
Which was really impressive, um.
Okay, moving on.
We also looked at the models that,
um, that performed the worst.
And we figured out some things that didn't really quite work.
So first of all, um,
element wise multiplication is not a good function to combine vectors.
And for this task,
shallow networks linear- linear regressions or
SVMs did not work as well as the deep neural classifiers.
This is kind of understandable because, um,
for this task it's really hard to, um,
handcraft, um the feature representations.
Okay, um, that's all I have.
Any questions? Cool.
All right, thank you Min.
Ah, that was cool.
I guess a few comments about that.
So first, the same team that won this time won Bake-Off 2.
As far as I know that's unprecedented that the same group won two bake-offs.
Although I do remember that on
separate teams last year finished in the top two or three on two of the Bake-offs,
which is also very impressive.
But it's striking that,
uh, the same team won.
And I believe that they won with broadly the same approach,
which is BERT with some task specific fine-tuning,
and then a few other tricks that are specific to the dataset.
Um, and that's pretty cool.
Um, next week we're gonna be talking about BERT and ELMo and the Transformer,
and that might give us some insight into why those techniques are working so well.
So stay tuned on that.
The other thing I wanted to say is just,
I mentioned this last week,
it's really cool now that a bunch of you have done
really serious work on four separate NLU tasks.
You develop custom models for them.
Some of the things that you did are incredibly ambitious,
and I just think that's cool in terms of you having experiences to
report and now experiences to draw on when you go forth and do your own projects.
So congratulations to you all.
The plan for this week is to try to do things that will
be like directly relevant to your projects whatever they are.
Because it's- it's kind of awkward to be introducing new topical material now when we
know that you all are probably focused on
whatever special thing that you're doing for your own projects.
But this is a nice moment in which to start talking
openly about methods and metrics and things like that
because the next major assignment that you have to do is
this new document that we're calling an Experimental Protocol,
which is replacing the kind of milestone document that we have used in years past.
So in years past you did the lit review,
and then the milestone was a kind of generic check-in.
And it was kind of an attempt to make sure that
you were working steadily on the projects.
What we- what we're trying to do with this new approach to this interim report,
is kind of save you from yourselves to make
sure that you're incrementally working on these projects.
But also to kind of codify our belief that it's
valuable to be doing a particular kind of quantitative evaluation,
and also encourage you to start talking in a detailed way with
your teaching team mentor about exactly how the project should be structured,
what kinds of experiments you could be running,
what kind of additional data you could bring in, and so forth.
So let me just walk through this to make sure it's clear kind of what we're expecting.
It's due on the 27th.
It's a highly structured report in the sense that you could
just copy out these prompts if you wanted to and fill them in.
Um, you don't need to invest in writing a nicely structured paper.
I think it would actually be valuable for the team if you kind of stuck to this framework
because in turn this is the framework that
we're gonna use as a kind of rubric to evaluate the work.
So right off the bat,
we want you to state a hypothesis or hypotheses.
Um, I think that this is often not done or not done enough in NLU and NLP.
And one reason that I can make that inference is that, in recent years,
if you review for ACL,
they have started codifying exactly this requirement in their referee form.
So as a reviewer,
you're meant to state back to the authors and to
the area editors exactly what the hypotheses are.
To kind of get everyone thinking about whether or not the paper is doing something beyond
just throwing a bunch of parts together and running a quantitative evaluation.
So we want you to state your hypothesis.
These hypotheses can take many forms,
and this will depend a lot on what kind of project you're doing.
It could be something as simple as,
"My hypothesis is that for my dataset,
the transformer is a better model than an RNN at making the relevant predictions," right.
That would be kind of functional kind of engineering oriented,
but it is nonetheless a hypothesis.
And if you reflect on why you believe in your hypothesis,
that might you lead- lead you to even deeper insights like,
for my dataset which is structured in a particular way,
the transformer has the right kind of structure to reflect the data, something like that.
But it could be quite functional and just describe what kind of experiments you wanna do.
But it could be something as interesting as,
"I believe that in processing Project Gutenberg files I can tell whether the author
is a man or a woman based on their portrayor- portrayal of female characters."
Which is a really interesting hypothesis from a past year that a project team did.
I think it's one of the ones we picked as the- as a top project, right.
And that would be more social sciences oriented and more
removed from the particular modeling tools that they're using.
Because they're just saying like, "This is a signal that I can detect using NLP tools."
And there's lots of stuff in between those two extremes.
The point is state it,
state it as clearly as you can and start talking about it with your mentor. Makes sense?
Then as you can imagine,
coming in at second here, data,
we want a description of the datasets.
It could be again, it's something as simple as,
"I'm using MultiNLI," or something as complicated as,
"Here is my procedure for crowd-sourcing a web
scraped custom dataset that I developed for this project."
For the first one it's probably okay to just mention it and give the relevant citation.
For the second one, I think it would be helpful
if you gave us some details on the nature of the dataset.
Because, for example, our suggestions
for you are going to vary depending on whether or not you have 5,000 instances,
500 or 5 million.
And also in turn like the internal structure of the dataset will matter.
So give us a clear picture of that.
Then the metrics, you'll state the metrics.
That's gonna be one of our topics for this week.
These are gonna be the things that form the basis for your quantitative evaluation.
If it's familiar, that's good, right.
If you're doing a classification problem with imbalanced data
easy to justify that you- you're using Macro F1.
May be harder to justify Micro F1,
but maybe you wanna do it.
May be harder still to justify something like
an imbalanced version of the F1 score because you wanna favor precision or recall.
Or if you're working on something like dialogue agents,
then as you'll see today or next time,
this is a tricky question exactly what metrics to use.
Because it seems to me that all of the available ones have their own faults.
And so you might want to say,
"I'm gonna use perplexity,
it has these known downsides,
but then again for my project it's the right thing
because it fits intellectually," or you could just be up front and say,
"It's expedient because I don't have
any other kind of extrinsic evaluation I can run," right.
Just be open and upfront about what choices you're making and why.
That's the spirit of this.
And then we'll give you a response like, "Hey, you know,
I really feel you could go beyond your chosen metric.
Here's a slightly better perspective on the problem or something like that."
Then models, we want a description of the models that you'll be using.
And I think the thing to do here is remember that we
don't just want a description of your favorite model,
the one that you're kind of advocating for,
but also a statement of which models you'll be using as baselines.
Because as we're gonna talk about a little bit later today,
none of these evaluation metrics really have any meaning in isolation.
If you repoint- report a 0.76 for your model,
just as a lonely number there,
nobody has any idea what that means.
If it's a familiar dataset,
maybe we can fill in some background.
But then again, it should fall to you to fill that in.
Um, and if it's an unfamiliar dataset, we just won't know.
So one thing you can do to provide that context is set up some baselines.
And I'm gonna go over a bunch of ways that you can do this
where the baselines maybe get more and more ambitious.
Um, but all of them will help us in understanding what you achieved.
[inaudible] dat- uh datasets and someone else already produced a model,
is that a case that,
that number is our baseline number or do we need to run the actual
[NOISE] dataset ourselves to generate that number? [NOISE].
Great question. Yeah, I think- certainly at this stage,
the more published numbers you can provide the more context we'll have.
Um, we might encourage you also to fit some of your own baselines,
um, especially if the published numbers are astronomically high.
Because then you might want to say like,
okay somehow teams at Google have achieved 0.90 because of their vast resources,
but the competition I'm running is between
two more modest models and mine still wins within that space,
and in which case your own baseline is serving as a kind
of given your budget and time constraints and goals,
what the baseline is.
But I think those upper bound numbers like top numbers and literature are great.
Human performance, if you know it, uh,
information about inter-annotator agreement also valuable context for,
for what we can expect for your data. Yeah, great question.
Others about the models?
This is kind of the heart of it here.
Um, and what we'll be looking for this kind of leads
into this general reasoning part is a statement or
brief statement of how the data and the models come together to inform your hypothesis.
And we have this as a separate item because we don't want you to take it for granted.
We would like to hear your version of why your data and
models and baselines are gonna come together to achieve this.
This is not idle work because if you do this well,
it's probably gonna be the basis for the abstract for your paper.
Because this is more or less what you want from an abstract minus maybe,
maybe the abstract has some additional context.
And then we would like a summary of your progress so far.
So what have you done?
If you've run experiments and they're all in and you have confidence, that's great.
Preliminary numbers are great.
What you still need to do, obstacles, that would be helpful to raise at this point.
Um, things that we should know that might be blockers for you in getting all
that you want to achieve done by the time it's all due.
And then don't forget this because we'll ding you for it,
references section, proper bibliography. Yeah.
So all questions should be a good foundation for our abstract,
so what's the expected size of the experimental pro- protocol?
I don't know.
Um, because I don't really think of the length as having
any kind of inherent virtues or, um, vices here.
Um, how long would it take me to answer these questions?
It could be as short as a couple of pages, right?
For some projects where it's like,
my data is SNLI,
my hypothesis is that the Potts model is better than the some other model.
Um, my metric is going to be macro F1.
My models I can sketch out.
The general reasoning is clear like the Potts model is keyed into NLI,
the way no other is and here's why colon, you know,
here's the- it wouldn't take me very long.
Um, but if you're doing something quite custom
or you still have open questions like I've talked with a few people who,
maybe even it's unclear what metrics they'll be able to use,
I think you might devote more space to it.
Any other questions? All right.
So I have- what we're gonna do for this week is,
I have two notebooks that I'm gonna show you.
They're both posted at the website,
they both have code and I've kind of alighted
of- alighted the code for a lot of this presentation,
but it's there for you if you wanna get more hands-on with the material.
But the first one is going to be on methods,
and the second one on metrics.
Um, the metrics one I might not do too much lecturing about,
it's not the stuff of really compelling lectures,
but I would like to just broadcast to you my framework for thinking
about these metrics and how you might choose your own and think about your own,
um, and then you can kind of use that notebook as a reference.
And then more exciting,
on Wednesday, he's gonna give, uh,
a short presentation about what you might think of as like the
2.0 version of the traditional metrics and methods for NLP.
Where we go beyond the stuff that I'm gonna show you today and start thinking in
a more serious way about what it means to truly generalize and maybe to truly have,
like what we could call understanding for a phenomenon or,
or a domain or a task.
Um, and I think that's gonna be really fruitful because you all might want to push
beyond like reporting macro F1 scores
and think about what  your systems have actually learned.
And this is an opportunity for you to do that.
That kind of brings me to this overview here.
So our primary goal is to talk about projects.
I can't do methods or metrics in a comprehensive way,
but I hope I can give you a kind of foundation for thinking about all these issues.
And partly, what I wanna do is just bring some of this out int- into the open,
because I have a feeling that a lot of it is
stuff that you're supposed to just kind of pick up on the streets.
Um, everyone kinda presupposes it and you get
a feel for what the issues are and how to navigate this.
And I was just hoping we'd have an open discussion about it basically.
And as I said, that brings me to the project.
So I wanna make a few things clear about how we think about projects.
Um, first, we will never evaluate a project based on how good the results are.
Let me just make this font a little bigger.
Okay. Uh, only those arrows are getting bigger.
[LAUGHTER] I hope you can- if not- if it's hard to read,
you can follow along at the website.
All of the stuff is drawn from that main notebook, and I'll read it aloud.
So just to emphasize again,
we will never evaluate a project based on how good the results are.
I grant you that publication venues,
have- they do this.
Whether they have to is secondary,
but they do it and they do it because they have constraints on
space that lead them to be only able to publish a few things,
and then all of science has a bias for
positive evidence for things over negative results.
That's why you might get your paper rejected because you don't have
state of the art results even if you have wonderful insights and analyses.
It's a sad thing about the state of publishing.
And I think it's true all over science.
But in CS224u, we are not subject to that first constraint.
Um, which means that we can do the right and good thing of valuing positive results,
negative results, and everything in between, right?
So just because you got what looks like a state of
the art number on your dataset does not mean that you will get an A triple plus.
And correspondingly, if your hypothesis totally falls flat but you
give a really stellar evaluation of it and
like maybe some insights into why everything fell apart,
that could be an A plus paper, right?
So we're going to evaluate you based on the appropriateness of your metrics,
the strength of your methods,
and the extent to which you are open, and honest,
and clear-eyed about the limits of your findings and other things like that.
And I've been trying to push this on people in office hours like,
because of that first dynamic about publishing,
this is a rare opportunity for you to try something really wild and crazy.
Uh, and, you know,
something that would just be too risky as a conference submission because you never know.
And surely, you know,
the really earth-shaking ideas began as things that sounded too wild and crazy.
In fact, my read on the whole history of deep learning is that it
was too wild and crazy for 30 years before it wasn't.
Um, so you never know, right. Yeah, yeah.
So let's start walking through these issues.
I think some of them seem obvious but all of them are worth reflection.
I would love to have your questions and comments I
would love for this to be a bit of a dialog here.
Start with stuff that's basic.
So data organization, you guys have heard about this.
We have like the idea of a train dev test split.
Ah, and this is a common thing to see in large publicly available data sets.
Even before diving into the details of how this works as a community you should
reflect on the fact that even this is a non-obvious choice, right?
Because in the background here is an assumption that you will develop
a system using the training and dev data and then evaluate it on test data.
And the reason that we draw those from the same body of
examples is we feel that that's in some sense a fair evaluation that um,
if my system has learned its parameters on the training data it ought to be able to da-
generalize to data that is in some sense from
the same empirical distribution as the training data.
And we think of that as a kind of fair evaluation.
It's not the only conceivable approach that you could have to evaluation.
And in fact you might think it's kind of stacking the deck
in favor of all of the systems and overstating all of
our results because you're given even before you look at
the test data a kind of guarantee that you'll be looking at similar examples.
And you can imagine that the whole field took
an alternative route of saying our evaluations,
the ones that we'll allow into our prestigious publishing venues are all gonna be
human subjects evaluations where humans have to deal with
a system and give their ratings about whether the systems are good or not.
And, and we wouldn't value the numbers that we value in this course so much.
So even this is a non-obvious choice and
the other part that might be kind of non-obvious about
this is that the test data does come from the same body of examples as the training data.
And that's one where I can feel the community kind of starting to push back,
starting to acknowledge more and more that
this is overstating what our systems can actually do.
Um, and for example in this course the MultiNLI dataset is one that has
the mismatched test condition where you train on
some genres and test on genres that you've never seen before.
That's starting to push back on
the core assumption that you're seeing on this slide about train dev test.
And I think people aspire to do even more radical things to give up on this the,
the way people tend to state it is training and testing on the same distribution.
Um, it's hard to know what it would mean to test on a different one
because there's an infinitude of different data sets that you could
pick and surely trying to generalize a system trained on SLI and
testing it on evaluating chairs and tables in the real-world, right?
Completely different data that's gonna fail.
What's in between there that would be a more ambitious test.
Multi NLI is one gesture in that direction.
But back down to earth many data sets that you study will have this kind of
split and as we've emphasized before
you are- you're on the honor system to not use your test sets.
Like the paradigm for you if you're publishing is that you are
on the honor system to do all your work on the train and dev sets,
never looking at test even if you possess it.
And then just before publication with all the systems
done you run a single test evaluation,
enter the numbers into your document and submit
and you're just at the mercy of what happened on test.
I don't know how often people really adhere to that.
It's sort of scary to think about people compromising that position.
And I think that's one reason why you're seeing an increased use
of test sets that are truly held out in the sense that they are on
Kaggle or held by a system and we may even keep track of
how many times different teams are evaluating on that test set. Yeah.
Does that sort of mean that you,
you wouldn't be able to analyze,
like how you did on the test set ultimately?
Like if you're just running it and immediately
submitting or you can't actually run it while
you're still writing your paper is the idea that when you're
discussing your results you're discussing your dev results?
That would be my preference.
Maybe people differ on this but my preference would be that when you discuss,
when you do error analysis you do it on
Dev because of this last point here, this test set.
People start hill climbing on it.
So if you publish a wonderful paper with detel- detailed error analysis on your test set,
then I get to learn a lot about what works and what doesn't.
Even if technically I have never peeked at the data set.
And in that way we would just end up with a lot like community hill climbing on the data.
So my preference would be that these are just kind of stark numbers that you see.
Um, even that's going to allow some hill-climbing because
we can get a community-wide sense of what's working and what isn't.
But yeah, that's my preference. Go ahead.
In the case where we have like totally different sets, right?
Like we're, we're testing on a different distribution entirely.
Is it, there even not useful to kind of talk
about like say you know you just do abysmally worse on
this new data set like that seems kind of notable that you would
want to talk about like gosh there is something in my model that,
that sort of failed to capture this difference.
Yeah, that's a fair point.
If you think there's some value you might,
like if you- I mean if you're doing the thing of just saying like,
"Hey, this system that was trained on Twitter,
I'm going to apply it to the Wall Street Journal."
And it falls down because the styles are very
different for that Wall Street Journal data set.
I'm not even sure whether you're thinking of that as
your final test set and it might be okay because it's more qualitative.
But I think it will depend on the circumstances.
But if you're in this hypothetical situation,
if you're releasing that Wall Street Journal test set and you want others to use it,
it probably benefits you to be kind of tight-lipped about
it and maybe have a dev portion that is keyed into that.
[NOISE] Other questions about this? Yes.
It's seems that if you're [inaudible] your data set you then want to ensure that your sort of
class distribution is the same across the train test dev
because sometimes if you're just arbitrary chunking up your data,
this might go sideways.
You're raising a very important issue, yeah.
I agree with the core of it.
We want similar distributions here.
But this can get quite intricate as you can imagine.
If we think about problems that have different structures where I always think that
the really scary version of this is what if your data set depends on a knowledge graph,
then how are you gonna do this in a way that's fair.
Um, and these issues are arising in a more subtle form and
I'll show you some tools that help you manage some of
these problems but you're absolutely right about the kind of core default insight.
The other point I wanted to make here is that even doing this
presupposes a pretty large data set, like um,
if you do work in health care or something where every example is kind of hard won,
you might be reluctant to hold out a test set because it really means that you're
giving up on a whole bunch of data that could be used and studied in an interesting way.
So that brings me to this next thing here which should be like no fixed splits.
And some public data sets do have this property.
And I think this can be good um,
but it does pose a challenge for assessment because for really robust comparisons,
you have to run all the models using whatever assessment regime you've picked.
Right? So if it does a bunch of
experiments on one of these data sets and he splits in his way.
And I don't maybe know the details of how he did those splits.
And I do different splits.
Strictly speaking if I want to compare our numbers now,
I can't quite do it right um,
because for a really fair comparison we should be looking at
exactly the same splits because who knows what
happened when we did these random divisions.
Let me just say that because I always want to balance this, that's the ideal.
And so the ideal is that somehow I get from exactly his splits or his model and I rerun
everything in a super careful way and then
my results table can be reported with no caveats.
That's the ideal. But of course we don't live maybe in the ideal world.
We all wanna have room to say unfortunately,
I couldn't get exactly the splits that it's used.
So my comparisons are informal in that sense but nonetheless reliable and I think
the important thing is that you report them and talk
about their value but with that caveat attached.
And that's very different like if you think about reviewers being your antagonists.
If you just report these numbers without giving context,
the reviewer might say thumbs down,
those comparisons weren't fair.
It wasn't the same splits.
If you give the caveat and acknowledge it,
it's kind of harder for them because you've already
contextualized the results that you're offering. Yeah, you had a question?
Is there a point where you almost have enough data that over-fitting would be the goal?
Like if you had all text ever written.
At that point would it be almost not be good to just
over fit to that in the sense that you have so much data
that if you fit it well then you just kind of fit text well.
Like it's so large that it might as well be as large
as all utterances that will ever be produced by humans.
[LAUGHTER]
I guess is like is there- is there
sort of an implicit assumption that no dataset is too large,
that like every dataset; it doesn't matter how big it is,
you still need some- something that has been held out in order to evaluate it properly?
I think I do want to stand by that.
I think language is so complicated that you never see even
but the tiniest portion of the full distribution of utterances.
We're incredibly creative with language.
We produce, um, entirely new utterances all the time,
and that means there's always room for testing generalization.
Yeah, that's one of the exciting things about the domain I would say. Another question?
[BACKGROUND] Something about that,
I feel like maybe if you're in a setting where you're just trying to that,
create like word vectors that encoded like a hybrid in your relation with WordNet,
you could think that like the entirety of
WordNet taints all the relations you might need to use.
So maybe in that setting,
it's appropriate to train and test on the same data.
Because really, you're just trying to encode the relational information and it doesn't
really matter like what you do in a general way as long as it can be accessed.
Oh, I see. So you don't mean literally
the same data but rather just the same kind of dataset that gets split.
Yeah. In which case, I think your,
your observation kicks in for sure.
If we're looking at the dictionary of English,
we might think that this is the entirety of the domain.
Yeah. For large datasets,
if they don't come with a predefined split as part of your project,
you could impose splits and just say like here's
what I'm going to use and I don't look at the test data.
Um, this can be nice.
If you have the luxury of doing this,
I would do it because it simplifies your experimental setup in many ways as you'll see.
Um, and in particular,
the second is important.
You'll see that for very careful hyper-
hyperparameter optimization you end up doing a lot of evaluations and having
fixed splits as opposed to doing something like cross-validation
could dramatically reduce the number of things you have to do here,
and also just kind of yes simplifier li- like
remove an inner loop from your experiment code as you'll see.
So I'd encourage you to do that,
and then the qualifier there is just for small datasets.
You might feel too constrained to leave out some data for testing.
And I'll talk about strateg- strategies for that next.
Oh, yeah. That kind of brings me to this other mode; so cross-validation.
So instead of having fixed splits in cross validation,
we take a set of examples x and partition them into two or more
train/test splits and average over the results in some way.
You probably have heard about this before.
There are two kind of classes of ways you could do this.
The first one, I'll call random splits.
So in this case, shuffle your data,
hold out 10% of the data for training.
Probably the rest for testing,
and run an experiment;
and then make another split,
randomly, run an experiment, and so forth.
And you can do that as many times as you want,
and you might want to do it a lot to get some insight
into how much variance there is in your system's performance.
Uh, relevant to that question in the back.
I'm sorry, I don't know your name,
but this was a great insight that when I do this,
I probably want to make sure that I have
the same distribution of labels across the splits.
So Scikit makes this very easy.
The, uh, relevant code has a keyword called stratified,
and you can just make sure that you've told the code to
honor the distribution of the train and test,
have them equal across all these splits,
and I think that will lead to more consistent evaluations.
The good and the bad of random splits.
Okay. So the good is that you can create as many as you
want without having this impact the ratio of training to test examples.
That'll become clear when we look at true cross-validation.
But the idea is that you can just run as many of these experiments as you want.
Because probably, with a large dataset,
there might as well be an infinite number of ways that you could do these divisions.
The bad is kind of related to that.
There's actually no guarantee that
every example will be used the same number of times for training and testing.
So depending on the nature of the splits that you do,
you might be looking at kind of
distorted evaluations as you combine them and average them.
But this has really nice flexibility to it.
And in situations where your model is pretty fast to train and test,
the fact that you can do lots of splits this way without
impacting the ratio of train to test examples is very powerful.
And will open the door to doing some testing of
the statistical differences of your systems.
Let's contrast that with, um, oh,
and here, here are some, you know,
some code that you can use.
Scikit is as usual wonderful for helping you do this kind of thing.
Let's compare this with K-fold's cross-validation.
This is different from the random train test split things.
So in K-folds, you divide your data into what are called K-folds there.
So different sets here like fold 1, 2,
3 and you conduct K experiments.
In each fold i- in each fold- sorry.
Sorry, at each of the experiments,
fold i is used for assessment and the other folds are merged together for training.
So if I have three folds here,
these are subsets of the examples.
For experiment 1, I hold out fold 1,
train on 2 and 3 and test, and I get a number.
In the second experiment,
I hold out 2, train on 1 and 3 and I get a number.
So forth for the third one,
here I test on fold 3 and train on 1 and 2.
So now, I've seen every combination of these three folds, and I get three numbers,
and I can average them or something,
and maybe report if I do more folds,
um, kind of confidence intervals around those means.
Let's do the good and the bad here.
So the good and the bad kind of complementing the last few.
Every example appears in a train set exactly k
minus 1 times and in- in the test set exactly once.
So that's nice. You have some guarantees about how you've gone through the data.
I'd say the bad of this is that the size of K
determines the size of the train test splits.
So with three fold cross validation,
you train on 67%of the data and test on 33.
But with 10-fold, you train on 90 and test on 10.
Those are very likely to be different experiment scenarios,
and I feel like two things have gotten muddled together when you're doing this.
On the one hand, you wanted a lot of runs because you wanted
a real sense for a system performance across different settings.
On the other hand, you end up changing
the size of the training and test data when you do that,
uh, and that's just a consequence of the good here.
So it's not like we can blame the method.
But if- that might steer you back toward random splits,
where you don't have this relationship,
and all you're giving up on is this absolute guarantee up here.
Make sense? And as usual, as I said,
Scikit is great for this K-fold and stratified K-fold,
and then it has these helper methods here;
cross validate and cross val score,
which will allow you to,
um, setup models and datasets and kind of under the hood,
do all of these evaluations so that you don't write
nested for-loop code which is hard to maintain.
And the only two, um,
tips I want to make there is first,
when you use these guys,
you might be explicit about what your CV argument is so that you have
absolute certainty that you are doing
stratified if you want to or just K-fold if you don't.
If you're kind of nervous like me.
And then maybe more important is that when you do this,
you set the scoring value.
If you don't set a scoring value for these functions,
it's going to default to whatever the model's inbuilt score is.
And for example for a classifier that will be accuracy,
and we've seen I think time and again that that's probably
not the metric that we want to be using for our evaluations.
So you have to take the step of explicitly changing that to
macro F1 or whatever your choice is.
And just to mention a few variants K-fold has some special cases.
Again, Scikit is good about this.
So in LeaveOneOut cro- LeaveOneOut cross-validation,
you actually- this is an edge case of K-folds
where I hold out just a single example for testing,
and train on all the r- rest and I do that, k- you know,
for every- the number of times that there are
examples in my dataset and average the results.
You would do that if you have a very small dataset in the relevant sense.
And there might be two ways that could be small.
First, you might just have a tiny number of examples,
um, in which case,
you'll be compelled to do this.
The other situation would be that,
maybe you have a very large dataset,
but you would like to conduct a valuation- an
evaluation that is like at the level of users.
And so, maybe every user is associated with 100,000 text,
and you have 12 users,
so you actually have a quite large dataset.
But if I wanna do this user-level evaluation,
suddenly, the data seem quite sparse.
And in that case, you might do the LeaveOneOut thing.
And then finally, Scikit has a function called LeavePGroupsOut,
that's actually quite sophisticated about helping
you to do splits that are not just the standard ones.
So for example, it has keyword things that you can do to make sure that you
split along a temporal dimension which would be
important for anything with a time series aspect to it.
Um, I think you could also manage things that needed to be divided by,
like, demographic group, um,
or other things like structure in your data that
aren't immediately evident from your labels,
uh, in which case, you know,
if it's just the labels,
then the stratified option will probably take care of that.
Good. So that's it for splits.
[NOISE] Did I leave anything out in talking about all this stuff,
k-fold, cross-validation, stuff like that. Yeah.
So if you are putting forth like a trained model,
would you ultimately train it on all the data,
and then just use this cross validation as like an approximation of the [NOISE] accuracy?
Probably not.
If, if you have a fixed test set,
then you certainly won't do what you just described.
I think- I think you would do a lot of cross-validation as part of development,
uh, but then finally,
evaluate on the test set.
So for example, if your dataset has just train test and no dev,
then in place of dev,
you might do a lot of cross-validation runs inside the training data,
to get a sense for exactly how your system
generalizes in all sorts of ways that you can throw at it,
and then evaluate on test.
I think you would have these as
your final evaluation numbers only for datasets that don't have a test set.
Because they're smaller, they weren't released with one.
Uh, in which case, you know,
you get the benefits of not having just one number,
just this test set number because maybe the test set was chosen very strangely,
and it's distorting progress on the dataset.
It's kind of nice to know that we're reporting the average of a bunch of different runs,
gives us a sense of more robustness.
On the other hand, you don't get that pristine kind of held out nature of a test set.
Other? Oh, yeah.
So when you talk about test- on test set,
[NOISE] is there any intuition for how- think about the ratio between training and
test that should be [NOISE] [inaudible]
That's a good question. I don't know it.
I feel like you see 80-20,
75-25, 67-33, but I don't think there's much behind this.
It's probably people just stepping back and thinking like,
what's a goo- a large enough number that the testing isn't gonna be kind of wild.
Um, but also, isn't gonna done- deny me too much training data.
That's the best answer I can give.
Um, it certainly gets more
interesting if you think about holding out much more test data,
how good can these systems get if I showed them only half the data, for example?
But as a dataset creator,
you might be shooting yourself in the [NOISE] foot by doing that
because people want lots of data to train on.
And maybe it's that game theoretic back and forth that gets us to 80-20.
[NOISE] Yeah.
[NOISE] Systems and industry do they ultimately use a similar metric like,
you know, so almost the same thing as an academic setting
when you have your training set of training data,
and your test set, and all of that jazz?
That's a great question.
Um, and contribute I see nodding.
My first pass at this would be- the answer is yes.
But then one fascinating thing that might happen to you,
um, is that your incentives are switched.
So in academia, we would kind of like all our systems to look good,
and maybe that creates a bias for having them look better than they
are because that's the way you get published.
In industry though, when you think about deploying a system,
if your- if your neck is on the line for that deployed system,
[NOISE] you might take a much harder look at the kind of evaluations you're doing
to essentially get a conservative number about how it's gonna perform.
And you might do much more kind of like truly te- held out testing on completely or,
you know, partially unrelated distributions.
So that when you do deploy, you're not fired.
[NOISE] [LAUGHTER] Do other people who worked in industry wanna comment on that?
It's an interesting dynamic. Yeah.
What you said sounds exactly right to me.
I mean, um, your, um, the,
the question of incentives is,
is [NOISE] complicated because like on the one hand,
um, do you wanna get your project out the door?
[NOISE] On the other hand,
do you also wanna keep your job?
[LAUGHTER] So you kind of have, uh, conflicting incentives.
But I think, um, uh, ultimately,
you want evaluations which reflect the true strength of the system.
Um, and doing a train te- test split and,
and using this- the same kind of, um,
methodology that we're using in academia is pretty, pretty similar.
[NOISE] Let's change topics a little bit.
Oh, another question. Yeah.
Kind of [NOISE] piggybacking on what he was asking.
Um, is there any point in say doing like
a meta study of like all the major datasets in a given domain,
and seeing what the best ratio of train test splits are?
[NOISE]
How would you define best?
I mean, it would get interesting very fast because,
yeah, what do you mean by best?
Well, I guess like which one offers the best validation performance?
Um, yeah.
Let me- my answer is kind of yes like if you
could find a way to do this that offered, er,
intel to the community so that we make
better decisions going forward, we would all benefit.
It's kind of hard for me to see how we get the ideals right.
I think a part of it would be you stating what your ideals were.
But you do remind me that there's another interesting thing you could
do which is related to the split sizes which is to say,
okay, I have my fixed test set,
and my default is to use all the training data.
But maybe you're advocating for a system that really shines with very little data.
In which case, I think it would be smart for you to
do evaluations where you reduce the amount of training data,
and maybe what you're doing is showing that.
Okay. For all the training data, your competitor wins.
But you can get by on 10% of the data,
and your system is still doing much better at that point than the other competitor.
That could be incredibly valuable because [NOISE] we never have enough data.
And so we should in some sense favor systems that can thrive in that sparse area.
[NOISE]
Yes.
Is that something that people actually do in papers?
Like what you have just described.
Yes.
Yeah. Okay.
Yeah. And I think you could tie into like
the general feeling that we should make efficient use of all the data that we have.
Yeah. Let's talk a bit about baselines.
Again, kind of maybe stuff we've been taking for granted
but it's nonetheless we're thinking about these things.
Um, first of all, as I said before,
we can never understand evaluation numbers in NLP in isolation.
And I feel like you get- you get it from both sides here because on the one hand,
if you just try to publish a number like my system gets
0.95 or 0.99 even F1 on a dataset,
you might think that you had- you just declare victory by default,
right, because you've solved the task.
But what will actually happen is that your reviewers will say,
Well, obviously, this is too easy.
Um, and in that ca -and then this is also like related to
that deb- debugging phenomenon that might be depressing about NLP,
that if you get perfect performance,
you probably don't cheer but rather think something is broken about my code.
[LAUGHTER] I can't be that good.
Um, so even those really good numbers require maybe
human level performance and then at the other side of baseline.
On the other hand,
if your system gets 0.60, um,
you might despair but maybe it turns out that humans aren't a whole lot better,
and maybe the baseline is way below that.
In which case, that 0.60 might mark real progress on the problem.
Uh, and that's a clear case where we need context ,
where you do not want people defaulting to what they think
0.60 means in the context of classification or something.
So these baselines are really important for building good arguments.
Um, they shouldn't be an afterthought but rather
pretty central to how you set up your hypotheses,
and you can hear us encouraging that with this protocol document.
Um, that baselines are really important to building a good case.
And the other thing that you can do is use them to illuminate
specific aspects of the problem that you're
studying or some virtue of your proposed system.
And I think we've seen versions of this before, we didn't dwell on it.
But think like, if I have, uh,
a model that's just summing GloVe vectors,
it's completely insensitive to word order.
If I compare that to a model like, uh,
an RNN, which has a lot of word order dependence,
and I show that the RNN is better,
then I've got a kind of
intuitive indirect argument that word order matters for my dataset.
Um, and you can think of lots of other examples that will have that quality.
Maybe you added some attention mechanisms to your NLI model.
And then the difference, you know,
the gain that you get from that is kind of
illuminating the fact that there are important kind of
dependencies between premise and hypothesis
that a standard model was just missing out on.
Maybe forgotten but important to mention here random baselines,
you might want to include them.
And this is, uh, I mean, often,
the random baseline will be easy to construct like you could just say I'm gonna guess
proportional to the class label or I'm gonna just pick the majority class,
those are two common baselines.
I mentioned the dummy classifier and dummy regressor here.
These are both Scikit models,
just because they make it easy for you to
build the dummy classifier into your standard experimental pipeline.
So maybe you're testing like five different models.
You don't want to have a separate code base for doing the dummy classifier,
because it might- you might introduce bugs there or change the evaluation
somehow or it might make it just very
hard to make sure all your models are seeing the same data.
With dummy classifying regressor,
they have the same interface as all the other Scikit models and
and in turn same interfaces all the models that we've
used for PyTorch and so forth in this class.
Um, it's just that they don't make
intelligent use of the features that you give them essentially.
Um, but that's kind of nice in terms of automating your workflows.
This is a subtler issue.
This is kind of already pushing us beyond standard thinking about evaluation,
I've called this task specific baselines.
It's worth considering whether your problem suggests a baseline that
will reveal something about the way the problem is posed or modeled.
We've seen one example of this.
So in NLI, we saw that hypothesis only baselines can be very strong,
much stronger than random guessing.
Uh, and I gave you an argument for why that is,
I think there's a scientific reason at least in part for why these can be so strong.
Uh, another story that had some prominence
a few years ago in some -there's a Story Close dataset,
and this is where you are given a passage,
and you have to choose between a coherent and incoherent ending for that passage.
Um, and people reported a lot of progress on this,
you know, numbers that looked pretty good compared to human performance.
Um, but then it all kind of got qualified by the fact that if you turned a
classifer loose just on the endings without showing it the story,
it could come pretty close to those top numbers.
And I don't said- have such a hopeful story
about that one as I do about the NLI baseline.
This might indicate that there's a real problem with the dataset.
But separately from that,
even if there is an issue,
you can kind of partly overcome that by just saying
my baseline is this classifier that doesn't look at the story.
It gets 70%,
and then let's see how much further we can get from that with
the smart model and more intelligent use of the data, right?
So all is not lost even in a two type situation here. Yeah.
[inaudible] task specific baseline,
do you essentially kind of implicitly encoding
some prior knowledge about the problem in, in doing that, right?
Do you think there is any potential like risks when you take that approach?
What would be the risks that you'd be making the problem too hard for yourself?
If, if you end up encoding kind of the wrong information about that.
I think I see what you mean.
So I don't- so do you think it holds for either of these two,
just to get a feel for what you're after?
Not, not particularly. No.
I recognize that it's a danger.
Um, so don't do it.
[LAUGHTER] I'm only part- partly joking in the sense that,
if you do take this step,
it's probably because you're giving a positive argument that you feel random,
it's just not a fair baseline for us to have picked.
That's certainly what you would do in both of these situations.
Um, I guess I'd be kind of discouraging about just going off and seeing if you
can cleverly construct task specific baselines that are really strong,
um, because you might end up introducing biases of the sort you're describing.
But a lot of what I'm trying to do here is say,
I don't have the full answer because I don't know what the world is gonna throw at me,
what task you're going to take on,
what your data will be like.
The idea here is to just be thoughtful about it,
and that like if you think about the really exceptional papers in our field,
they often have- first of all,
they're thoughtful about all of their choices in my experience.
Um, but they often also introduce these subtleties in a way
that moves us forward conceptually and not just quantitatively.
Okay. Another big topic is hyperparameter optimization.
Uh, this is where we're gonna start using those Cloud credits I guess.
I gave this argument before,
let me rehearse it and amplify it a little bit.
So just for some terminology here.
In machine learning, the parameters of your model are
those whose values are learned as part of our optimizing the model itself,
we often call those the weights or the coefficients.
Let's take that as given as a definition and just say that
the hyperparameters of your model are any settings
that are set by a process that is outside of the one that I described under 1.
Of course, the boundary between these two could get blurred, um, and moreover,
if you think about the space that I just defined for hyperparameters,
it could end up including everything, including, like,
what kind of computer you're running on and what you had for breakfast.
Um, but some clear examples here,
the regularization term for a classifier,
that's a clear hyperparameter.
We all have an intuition that those values
could pretty dramatically affect how your system performs.
Other examples are like the dimensionality of your network if its deep learning,
the activation functions that you use could have a huge impact,
the optimizer that you could use could have an impact,
especially for models that don't converge in a reliable way.
Um, you could also think as I've done here,
"What about, like, the way you normalize feature values?
Uh, did you apply TF-IDF or PSA or something like that?"
Um, we're kind of drifting farther and farther from the core parts
of the model but I think you could think of all of these things as hyperparameters,
and we'll construe them pretty broadly.
So here's the rationale for doing hyperparameter optimization.
There are lots of ways you could go at this,
but here I've picked one, kind of,
back and forth narrative.
So suppose you ran experiments with models A, B,
and C and for each,
you use the default hyperparameters as given by the implementations you're using, right?
So in scikit you just set up these models,
you initialize them with no arguments.
So I'm just at the mercy of what scikit picked.
And in doing that, you found that your model C performed best,
and you report that in your paper and you think "Ah, you know, victory.
I was rooting for C and it came out true."
Your reviewer is going to ask or certainly wonder about a few things.
So first, "Did you actually try
any other values for the hyperparameters without reporting that?"
They might just wonder about your practices.
"If not, would you have done that if your favorite model hadn't outperformed the others?"
Perfectly legitimate question, right?
You- maybe you didn't have to confront it,
but suppose B had won your competition,
would you have thought, "Well, wait a second.
I need to think about whether or not I'm doing the regularization correctly.
Let me fiddle around with some values a little bit"?
And then you'd be off and running on some hill,
climbing toward your favorite thing.
So the- what the reviewer might conclude here is that
all we've learned at this stage is that
there's some setting of
the hyperparameters that favors your model and disfavors the other one.
But the really cynical hard-beaten reviewer is gonna say, "Well,
I already knew that because I know that I can just craftily pick
these settings in a way that will make some model look
terrible and other models look good."
Right. If I'm given this as kind of behind the scenes fiddling,
then game over, we don't have a fair evaluation.
So in response, you could give every model its best chance to succeed, right?
Let your models explore a really wide range of hyperparameters.
For each model, choose the best settings
according to performance on your development data,
and then report how the models do on those settings on all your test sets.
Right? If you go through that whole discourse there,
your persistent reviewer might say, "Well,
you didn't pick the right space of hyperparameters.
You should have gone further,
you should have tried these new variations," and I grant that,
but I mean, come on,
you have to stop somewhere.
So that's one kind of rationale.
The other would be just that you are open-mindedly trying to figure out what's best.
Like, you've been placed in a situation where you have
to do a bake off for a class and you want to win, uh,
and then all of the same dynamics kick in,
it's just that you're not arguing with someone but rather just
trying to figure out how you can win the bake off.
Um, I think the same rationale applies cause you want to pick A,
B, and C and really see each one in the best possible light.
And that's kind of the spirit of this,
that we get a fair evaluation if we see A, B,
and C all at their best, which means, like,
dressing them up to be the best they can be,
and that comes back to the hyperparameters.
So here's the ideal for this kind of thing.
For each hyperparameter to your model,
identify a really large set of values for it.
Create a list of all the combinations of all the hyperparameter values,
this will be the full cross product of all the sets of
values for all the features you identified as relevant.
For each of the settings,
cross-validate it on the available training data
and then choose the settings that did best,
train on all the data,
and then evaluate your model in the usual way.
That's what you really ought to be doing I suppose by the logic that I gave you before.
However, this could be very demanding.
Suppose that hyperparameter 1 has five values and 2 has ten,
okay, now there are 50 settings.
If you add a third with two values,
now your number of settings has jumped to 100.
Okay. Now you wanna do five-fold cross-validation,
that means you're doing 500 runs,
and if you wanna do 10 random train test splits because you wanna, like,
test whether or not your model is
statistically different or look at confidence intervals,
now you're doing 5,000 runs of your model.
Boy, that got expensive fast, right?
Um, and if each one of these models takes eight hours of GPU compute time,
well you can do the numbers here,
um, because they're gonna get very expensive.
Yeah, we have to- we have to back off from this idea a little bit.
This is untenable as a set of laws for a scientific community.
If we actually said that the only people who could
publish were people who had made good on this entire thing here,
then only people with vast amounts of money would be
able to publish at all, and in addition,
we would have a distorting bias in favor of simple models because
those would be the models that you could really afford to do all these runs for.
And I pick this- [NOISE] there- more and more you see these kind of things,
but this is from a nice paper on doing NLP for
health care and they just note in their, uh,
methods section in an appendix,
the performance of all above neural networks were tuned automatically using
Google Vizier with a total of over 200,000 GPU hours.
So you figure like, "Okay.
At a dollar an hour,
plus whatever Google is gonna take from me,
this could cost a quarter of a million dollars very easily."
Um, we got you some Amazon credits,
but not close to that, I'm afraid.
So we can't have that be the- the law that we live by.
We can have it as an ideal,
but we shouldn't, um,
allow others to expect it from us nor should we require it of other people.
We need to introduce some space for being pragmatic about these choices.
So here is my view about how we could kind of alleviate this problem, and I've done this.
Again, this is very impressionistic,
but in descending order of attractiveness to
me as a reviewer or to my imaginary reviewers out there,
if I think about them being antagonists for me.
So first, you could do some randins- random sampling and guided sampling, uh,
to allow your- to explore a really large space based on a fixed budget of runs.
So the random sampling part would be just that,
I set up my grid as I did before,
and it has 500 settings,
but I decide that I'm going to look at 50 of them.
And what the random sampler will do is pick 50 random settings,
and what I'll do is kind of have an expectation
that that's a pretty good sample of the full space that I,
with only low probability,
missed a setting that was gonna be really transformatively different.
And then there's a kind of variation on that
that is cal- I've called guided sampling here,
which would be that you have maybe a separate model running that is trying to
intelligently travel around in the hyperparameter space and make choices on that basis,
and you might think that that's better than random.
The virtue of both of these strategies is you could just say, "Look,
my budget is 50 runs," and that you won't in having, in doing that have to delimit
your hyperparameter space artificially because it could be
wide and you're still just going to sample 50 times from it,
and I think this is pretty palatable to reviewers,
especially since there's published evidence that random sampling is a good strategy.
Uh, and I included those links in the notebook on the website,
and then guided sampling is another proven method and I'll show you- I'll link to,
uh, some software packages for doing that a little bit later.
Is there evidence that the, like,
space of hyperparameters is really non-linear?
That's a great question. I think it really is going
to depend on your model and your settings.
Um, yeah.
Do- do you have anything more to say?
[inaudible] So anything more than, uh, to ask.
You actually like have very, very convenience.
[NOISE] Like you- you really want to use [inaudible] something like random sampling with a meaning.
It is this idea that you keep focusing on small areas as- as you are advancing on the,
uh, uh, process of finding hyperparameters.
So you don't- you don't have to explore all things behavior, you kind of,
uh, constrain yourself to smaller and smaller areas.
Maybe you see that you don't have like being able to.
Based on these [inaudible] my fault.
We've talked about the second strategy before.
I'm sorry that these are all ones here.
[LAUGHTER] Um, you could do some search based on a few epochs of training.
Uh, so this is especially relevant I think in the context of deep learning models,
where your model may take 1,000 epochs to converge and that would be very
expensive to do all of those runs and then check
performance and select parameters on that basis.
But what you could do is have a background hypothesis
that the future is gonna be like the past,
that a setting that was really bad at the start is just gonna remain bad,
and in turn settings that are good in the start are gonna remain good.
Um, two things about that,
you might not be able to prove it but you might be able to
support it with some kind of learning curves that show what's
happening in those early things and maybe a few runs that
project out further and show that your assumption is basically correct.
So you could do that to kind of build your case.
Um, but you know,
if you don't have the capacity to do that,
even just acknowledging that this is the strategy that you
picked and giving a reason based on budget or other considerations,
I think will go a long way with reviewers because it kind of acknowledges
that your choices were done on somewhat limited evidence,
um, but at least you know that that qualification exists.
And again, this is kind of nice because you can control costs in this way
while still exploring a large space of parameters.
Search based on subsets of the data.
That's another strategy that we explored off and on throughout this course,
um, and that would be that, you know,
my actual dataset contains 500,000 examples but I'll do my tuning on 5,000
and then I have lots of flexibility to do lots of
experiments and then I'll just project that out to the larger set.
Um, the real problem with that is that I
think it's obvious that some of these parameters,
these hyperparameters, could be very dependent on dataset size.
So a classic case would be that
regularization parameters are really gonna be affected by the number of features you
have and the amount of evidence that you can gather for each
one and that's interacting badly with this approach to selection.
But still, certainly better than nothing.
You could also do some heuristic search.
We're getting kind of low down in this list but this is still quite respectable, right,
so via some experiments that you run just in a free-form way,
you figure out that some hyperparameters never change performance and others really
change it a lot and in turn you just fix
the ones that seem not to matter and explore the rest.
And this would be a case where absolutely you would want to justify this in the paper.
You'd want to say that for our problem, uh,
the dimensionality of the hidden layer didn't matter in the range 50 to 300
so we set it at 300 and explored all the regularization parameters,
and then, you know, your reviewer could say,
"in my experience, the hidden dimensionality really matters."
But now it's kind of on them to have that evidence, right?
And you, you sort of said heuristically that you didn't see it.
So I don't know.
It seems like you built a pretty good argument in that case.
You could find the optimal hyperparameters via
a single split and then use them for all subsequent splits,
and this would be justified on the idea that the splits are similar.
So that would be a case of saying like I'm going to run one experiment really
carefully and assume that it holds for all the other settings that I need to explore.
That could be pretty reasonable with
a large dataset that you know is kind of homogenous.
This is probably perfectly fine.
And then all the way at the bottom here,
but something that you still do see,
is just adopting choices that other people have made.
So you know like I'm going to train new word vectors on
a bajillion words from the web and I don't know how this model is going to behave.
I can afford to run one experiment,
so I'm just going to pick the parameters that were in
the originally published paper and use them,
uh, even though the data is different and my goals are different and so forth.
Um, you're probably going to get some complaints that the findings don't
translate that the- what was optimal for that paper isn't necessarily optimal for yours.
But in that setting, where you're trying to do
web-scale training of word representations,
this might be all you can afford and again,
I think we can't just arbitrarily expect that people will run
multiple runs across the whole web in order to tune these parameters.
So as a last resort,
other people's parameters and there's lots of papers
that do this and I think it works out okay. Yeah.
[BACKGROUND] Can you use the best results on some dataset?
Is it necessary to justify all this?
I mean if- you know, let's just say you just stuck some numbers out of
your- out of your hat and they work the best,
do you still have to say I did all of this stuff or is that almost not
self-sufficient evidence that you're beating the benchmark?
Where the- like the paper is really one of these pure papers that just says my model is
the best model for this dataset and my evidence is that
I found parameter settings that beat all the published numbers.
It might be a persuasive argument.
Yeah, I mean I personally would hope that you're trying to do more with this paper and
if you were trying to show us something more about this model than just that you won,
then this would all kick in.
Because I'd want to understand like what's the space of performance numbers and so forth.
Some of this I'm gonna address in a little bit.
But yeah, I want to allow them.
That might be all that's necessary.
I'm sorry. Why is sort of robustness
to different hyperparameters or at least trying a lot of them,
give you more insight into how the model's working?
I mean ultimately if you have a good hypothesis and the results are promising,
why is it necessary to,
to really get that last little bit of juice out of your model by hyperparameter solution?
Oh, I see. It might not be juice though.
Like I- first I want to just grant that your narrative might be perfectly fine,
which is just that, you know,
I found a setting where my model is the best
and that's evidence that it's a really good model for the task.
The hyperparameter exploration might be something that you explore more openly to say,
for example, how small could my network get,
uh, and still do well.
How- what are my performance drop-offs,
or how much does it matter how it was regularized,
or like how much does it matter even how it was
initialized with GloVe or with Word2Vec or with ELMo?
And what I- I was just injecting
my values there that I find all of that really illuminating.
Because after all, I don't really care that you won.
I mean congratulations in this case that you won.
But what I really want to know is can I use your model for other tasks?
How sensitive is it to various settings?
Um, can I form an expectation about how it will
work in an entirely new domain? That kind of thing.
Here's um, a hyperparameter optimization tool.
So scikit has grid search and randomized search.
Goods, Stan bias here,
and randomized search will be the one where you can fix your budget.
And then scikit optimizes a package for doing more guided search through these spaces.
And it's kind of nice because it plays well with the existing scikit models.
And it might offer you
some additional gains when you do this kind of thing on a fixed budget.
Let's do one more thing here.
Classifier comparison.
Because we've seen this before.
This is a nice way to round out this first lecture.
So suppose you've assessed two classifier models.
Well, we'll keep it for two classifiers because I think a lot of you are doing
classifiers the same considerations extend to other kinds of models.
And we can talk about how that would, would work.
But suppose it's classifiers,
they probably differ in some way.
Um, what could we do to establish that they're different in some meaningful sense?
And again, if you think about publishing reviewers will ask this,
how much can I trust that differences I'm seeing are really and truly differences?
So first of all you might think practical differences.
This is the ultimate, right?
Um, if your test set has 1000- say that your test set has a million examples,
then 1% will correspond to 10,000 cases,
and that seems like it's sure to matter, right?
If this is lives saved or something or
disastrous consequences averted then
10,000 you shouldn't need to say much more than that.
You're making a meaningful impact on the world, right?
Uh, so there's that one case.
In the other case, if the test set has 1000 examples,
then a 1% difference in accuracy or F1 will be like 10 examples.
And your readers might not know whether that matters at all.
And then hovering in the background here is this other question,
how much variation is there across runs?
That is in the case where it was 10 examples for your reported number,
if I make a small change does it go in the other direction, right?
It seems like it's just on the edge.
And in principle that could be true even of these models here because maybe
with some runs you save 10,000 lives but in others you lose 5,000.
We don't know, right?
Because we don't know the amount of variation there is between runs for your system.
And so all those things mean that we probably want to go
beyond just practical differences.
So just quickly a few ways you could do this.
First, you could report confidence intervals around the means that you report.
Uh, I've given here a kind of traditional way to do this.
That you could use.
My sense from NLP systems is that this traditional way here, um,
is pretty conservative in the sense that your system might actually
show less variation than is being assumed by this test.
And so you might see confidence intervals that show values that are
way outside anything you've ever seen in running experiments.
If that happens you could think about bootstrapping the confidence intervals,
and I link to a library for doing that.
And that would just make much more use of
your actual data when it comes to deciding what the interval of variation is.
But this is a really nice way to do this.
You report the mean and the confidence interval,
and what your reviewers are likely to do is say,
"Okay those two intervals don't overlap.
I'm pretty confident that these systems are different."
Or if there's a lot of overlap you might say, "Um, you know,
numerically they're different but in practical terms these might be the same system."
So this is a powerful step,
but it presupposes that you can do a lot of runs.
Because you want to do like 10 and, and, you know,
better would be like 20 different runs to get a bunch of values for the mean in
slightly different scenario- settings so that you can
get a good picture of what the confidence interval is like.
And you might just be limited in your ability to do that.
Uh, related kind of thing that you could do then- and again I'm just kind of drawing on
best practices as I've learned them from
the literature is to run the Wilcoxon Signed-Rank Test,
which is a kind of variant of a T-test,
um, that doesn't make any assumptions about the underlying statistical distribution.
So it does not presuppose in particular that your scores are normally distributed.
Um, it requires a lot of different runs.
So you wanna do at least 10,
ideally 20 or something like that.
Um, I think that you want to pick
the Wilcoxon Signed-Rank Test over this very closely related one,
the rank sums test which is often called the Mann-Whitney U Test.
I think the Wilcoxon one will be just slightly more conservative in the sense that it's
more appropriate in situations in which
your systems are trained and evaluated on the same underlying data.
And given the best practices as I laid out before,
you want them to be trained and test on the same data for
these comparisons but that just means that you
violated a lot of independence assumptions.
So pick the Wilcoxon Signed-Rank Test, it's a rank based test.
So it- all it's caring about is
the relative ordering of the scores that your system reports.
I think this is a good step to take.
And I think it will go a long way to people of- for people who are evaluating your work.
But a few qualifications are in order.
First, like all tests of this form it says nothing
about the practical importances of any of the differences that you see.
So like a small p-value is not a larger effect size.
Uh, and a large p-value doesn't mean that you have a failed result.
That just means that you lack evidence for claiming that there's a difference.
So these qualifications are to be kept in mind.
This is kind of very weak evidence for some kind of detectable difference.
But I would ideally balance it against something
establishing that your systems are different in practical terms.
Um, but I grant you that this could help you in sort of advocating for a system,
especially in situations in which the numerical differences appear to be small.
Because here you are saying,
they're small but I have some evidence that you can count on them.
And then just a final test so we can wrap this up here.
McNemar's test, is a test that you can run to compare
two classifiers that depends only on one run of the data.
Um, because it operates on this kind of collapsed,
um, confusion matrix for your two systems.
Um, and it's a kind of variant of the chi-squared test.
I think this can be pretty volatile for
systems that report different values under different conditions.
So I would resort to this just in situations in which you absolutely cannot afford to
do the number of runs that it would take to run a responsible Wilcoxon test.
In this case, this is a kind of fall back.
I think again, it's better than nothing for your reviewers.
And especially if you can compare- if you can pair it with
practical differences you probably got a pretty good case for the models being different.
There are a bunch of other big issues that I want us to
confront that are kind of particular to the deep learning era.
Um, but since we're out of time I'm going to save those for next time.
So next time it'll be these juicy issues that we're all facing today.
A discussion of metrics.
And then, this is going to take us beyond
all this traditional stuff that I've been talking about.
But thanks everyone for your questions.
