Okay. Happy Halloween.
Um, what I want to do today is share with you advice for applying machine learning.
And, and you've heard me allude to this before,
but, um, uh, yeah,
I think over the last several weeks,
you've learned a lot about the mechanics of how to build different learning algorithms.
Everything from linear regression,
logistic regression, SVMs, uh,
uh, Random Forest, is it, uh, uh, neural networks.
And what I want to do today is share with you some
principles for helping you become efficient at how
you apply all of these things to solve
whatever application problem you might want to work on.
Um, and so, uh,
a lot of today's material is actually not that mathematical.
There's also some of the hardest materials as we're in,
in, in, in this class to understand.
Um, it turns out that when you give advice
on how to apply a learning algorithm such as, you know,
"Don't waste lots of time collecting data unless you,
you, you have confidence it's useful to actually spend all that time."
It turns out when I say things like that,
people, you know, easily agree.
They say, "Of course, you shouldn't waste time collecting lots of data
unless you have some confidence it's actually a good use of your time."
That's a very easy thing to agree with.
Um, but the hard thing is when you go
home today and you're actually working on your class project,
right, uh, uh, to apply the principles we talked about today.
When you're actually on the ground talking to a teammate saying,
"All right, do we collect more data for our class project now or not? "
To make the right judgment call for that.
To map the concepts you learned today.
To when you're actually in the hot seat,
you know making a decision to be going,
spending another two days scraping data off the Internet,
or are you gonna tune this algorithm,
tune these parameters to the algorithm and actually make those decisions is actually, um,
uh, it, it, it often takes a lot of,
um, careful thinking to make the mapping.
From the principles we're talking about today,
and then probably all of you go,
"Yep, that makes sense."
But to actually do that when you're in the hot seat making the decisions.
That, that, that's something that, um,
will often take, take some careful thought, I guess.
Um, and I think, uh,
uh, you know, for a long time, um,
the concepts of machine learning have been an art, right, where,
you know, we'll, we'll go to these people that have been doing it for 30 years.
And you say, "Hey, my learning algorithm doesn't work," you know,
uh, uh, uh, what do we do now?
And then they will have some judgment or you go.
And people asked me and for some reason because we've
done it for a long time, we will say, "Oh yeah,
I get more data, or I'll tune that parameter, or try a neural network with big hidden units, and for some reason that'll work.
And what I hope to do today is,
uh, turn that black magic, that, that,
that art into much more refined,
so that you can much more systematically make these decisions yourself, rather than,
uh, talk to someone, um,
who's done this for 30 years, then for,
for some reason is able to give you the good recommendations even if,
you know, that to turn from, um,
more of a black art into more of a systematic engineering discipline.
Um, and, and just, uh, uh, one note.
Uh, some of the work we are gonna do today is not the best approach for,
uh, developing novel machine learning research,
or if you're- if your main goal is to write research papers, uh,
some of what I'll say will apply,
some of what I'll say will not apply,
but I'll come back to that later.
So most of today's focused on how to help you build stuff that works,
right, to build, build applications that work.
Um, so the three key ideas, um,
you see today are first is, uh,
diagnostics for debugging learning algorithms.
Um, one thing you might not know,
or actually if you're working on the class project maybe you know this already is that,
uh, when you implement a learning algorithm for the first time,
it almost never works, right?
At least not the first time.
Uh, uh, uh, and so,
um, what is it- I still remember was- there was a weekend,
um, uh, about a year ago where I implemented Softmax regression on my laptop,
and it worked the first time.
And even to this day, I, I still remember that feeling of surprise,
like, no, there's got to be a bug and I
went in to try to find the bug, and there wasn't a bug.
But, but it's so rare.
[LAUGHTER] So the learning algorithm worked the first time.
I still remember it over a year later.
Uh, and so a lot of the workflow of developing learning algorithms,
it, it actually feels like a debugging workflow, right?
Um, and so I want to help you become systematic at that.
Um, and, uh, uh,
two key ideas here are about error analysis, and integrative analysis.
[NOISE] So how to analyze the errors in your learning algorithm.
And also how to, how to understand what's not working,
what's error analysis, and how to understand what's working,
which is ablative analysis.
And then, and then finding with some philosophies and how to
get started on a machine learning project,
su- such as your class project, okay?
So let's start with, uh,
discussing debugging learning algorithms.
Um, so what happens all the time is you have an idea for a machine learning application.
You implement something, uh,
and then it won't work as well as you hoped.
And the key question is,
what do you do next, right?
When I want to work on machine learning algorithm, that's actually most of my workflow.
We usually have something implemented.
It's just not working that well.
And your ability to decide what to do next has a huge impact on,
on, on your efficiency.
Um, uh, I, I think, uh, when, when,
um, when I was, uh,
when I was an undergrad, uh,
at Carnegie Mellon University,
I had a friend, um, that would, uh,
debug their code by,
um, you know, they write a piece of code.
And then as always, when you write a piece of code,
initially there's always a bunch of syntax errors, right?
And so their debugging strategy was to delete
every single line of code that generated
a syntax error because this is a good way to get rid of errors.
So that wasn't a good strategy.
So in, in, in machine learning as well,
there are good and less good debugging strategies, right?
Um, so let's start with a motivating example.
Uh, let's say we're building an anti-spam classifier.
And, um, let's say you've carefully
chosen a small set of a hundred words to use as features
So instead of using, you know,
10,000 or 50,000 words.
You've chosen a hundred words that you think could be most relevant to, um, anti-spam.
And let's say you start off implementing logistic regularization.
Uh, I think when we talk about this,
this is also, you know,
there's a frequentist and Bayesian school,
but you can think of this as Bayesian which is a progression where, uh,
you have the maximum likelihood term on the left,
and then that second term is the regularization term, right?
Um, so that's, so that's Bayesian logistic regression.
If you're Bayesian or, uh,
which is regression with regularization if you are,
uh, you know, using frequency statistics.
[NOISE] And let say that, um,
logistic regression with regularization or Bayesian logistic regression,
it gets 20% test error which is unacceptably high, right?
Making one in five mistakes on, on your spam filter.
Um, and so what do you do next?
Um, now, for this scenario, I, I wanna,
uh, and, and so, um, for, uh,
when you implement an algorithm like this, uh,
what many teams will do is,
um, try improving the algorithm in different ways.
So what many teams would do is say, "Oh yeah,
I remember, you know, well,
we like big data, more data always helps."
So let's get some more data and hope that solves the problem.
So one or some teams would say,
"Let's get more training examples."
And, and, and it's actually true, you know,
more data pretty much never hurts.
It almost always helps,
but the key question is how much.
Um, or you could try using a smaller set of features.
With a hundred features probably some weren't that relevant.
So let's get rid of some features.
Um, or if you try having a larger set of features,
hundred features is too small, right?
So let's add more features.
Um, uh, or you might want other designs of the features, you know,
instead of, uh, uh, just using features in an e-mail body,
uh, you can use features from the e-mail header.
Uh, the e-mail header has,
um, uh, not just a From To subject,
but also routing information about what's the set of
servers of the Internet that the e-mail took to get to you.
Um, uh, or you could try running gradient descent for more iterations.
That, that, you know, that never hurts, right, for usually.
Uh, uh, from gradient descent, let's switch to Newton's methods.
Uh, or let's try a different value for Lambda.
Um, or, or we say, you know,
forget about Bayesian logistic regression or,
or run regression regularization.
Let's, let's use a totally different algorithm,
like an SVM or neural networks or something, right?
So what happens in a lot of teams is, um,
uh, someone will pick one of these ideas, kind of at random.
Um, it depends on, you know,
what they happen to read the night before, right, about something.
Uh, or, or their experience on the last project.
And sometimes a project,
and sometimes you or the project leader will say,
uh, you know, we'll pick one of these and just say, "Let's try that."
And then spend, spend a few days or few weeks trying that,
and it may or may not be the best thing to do.
So, um, uh, I think that in, in,
in my team's machine learning workflow,
so first, if you actually, you and a few others,
sit down and brainstorm a list of the things you could try, you actually are,
are already ahead of a lot of teams because all teams
will come just by gut feeling, right?
Um, uh, or the most opinionated person will pick one of these things at random and do that,
but you brainstorm a list of things and then,
and then try to evaluate the different options.
You're already ahead of many teams.
Um, oh, sorry, and I think,
uh, uh, yeah and I think, right, you know,
unless you analyze these different options, um, uh,
uh, it's hard to know which of these is actually the best option.
So, um, the most common diagnostic I end up using in developing learning algorithms,
is a, um, bias versus variance diagnostic, right?
And I think I, um,
talked about bias and variance already with a classifier is
highly biased, then it tends to under fit the data.
So high bias is, well, actually.
You guys remember this, right?
If, um, If you have a dataset just like this,
a highly biased classifier may be much too simple,
and high variance classifiers may be much too complex,
and some- something in-between,
you know, with, with trade off bias and variance in an appropriate way, right?
So that's bias and variance.
Um, and so, uh,
it turns out that one of the most common diagnostics I end up using in
pretty much every single machine learning project is a bias versus variance diagnostic.
So understand how much of
your learning algorithm's problem comes from bias and how much of it comes from variance.
Um, and, uh, uh, and,
you know, I- I've had,
I don't know, like former PhD students, right,
that- that learned about bias and variance when they're doing their PhD and
then sometimes even a couple of years after they've graduated from Stanford and worked,
you know, on more practical problems.
They actually tell me that, that,
that their understanding of bias and variances continue to deepen,
right, for, for, for many years.
So this is one of those concepts is, is, um,
if you can systematically apply it, you'll be much more efficient and this is really the,
maybe the single most useful tool I've found,
understanding bias and variance at debugging learning algorithms.
Um, and so what I'm gonna describe,
is a workflow where you would run some diagnostics to figure out what is the problem,
uh, and then try to fix what the problem is.
And so, um, just to summarize this no- this example.
Um, uh, this logistic regression error is unacceptably
high and you want to- and you suspect problems due to high variance or high bias.
And so, um, it turns out that there's a diagnostic that lets
you look at your algorithm's performance and try to figure out if,
um, how much of the problem is variance and how much of the problem is bias.
Oh, and I'm going to say test error,
but if you are developing,
should I really be doing this with a dev set or development set rather than a test set, right?
But so let me, let me explain this, um,
uh, diagnostic in greater detail.
Uh, so it turns out that, um,
if you have a classifier with very high variance,
then the performance on the test set,
or actually would be better,
better practice to use the hold-out cross validation so the, the development set.
You see that the error that you classify has, um, much, uh,
uh, much lower error on the training set than on the development set.
But in contrast, if you have high bias,
then the training error and the test set error and the dev set error will go behind.
So let me sh- let me illustrate this with a picture.
Um, so this is a learning curve and what that means is, um,
on the horizontal axis,
you are going to vary the number of training examples, right?
Uh, and when I talk about bias and variance,
I had a plot where the horizontal axis was the degree of polynomial, right?
You fit a first order, second order,
third order, fourth order polynomial.
In this plot, the horizontal axis is different, it's the number of training examples.
And so it turns out that,
um, whenever you train a learning algorithm, you know,
the more data you have usually,
the better your development set error,
the better your your test set error, right?
This error usually goes down,
when you increase the number of training examples.
The other thing, the other- and,
and let's say that you're hoping to achieve a certain level of desired performance,
you know, for business reasons,
you'd like your spam classifier to achieve
a certain level of design performance and often- sometimes,
desired level of performance is, um,
to do about as well as a human can.
That's a common business objective depending on your application,
uh, but sometimes it can be different, right.
So you have some- your product manager, you know,
tells you that well you,
if you're leading the project,
you think that you need to hit a certain level of
target performance in order for it to be a very useful spam filter.
So the other plot, uh, uh,
to add to this which will help you analyze bias
versus variance is to plot the training error.
Um, now one thing to help you with training error is that it increases, um, uh,
as the training set size increases because,
if you have only one example, right?
Let's say you're building a spam classifier and you have only one training example,
then any algorithm, you know,
can fit one training example perfectly.
And so if your training set size is very small.
The training set error is usually 0, right?
If you have like 5, 10 examples,
you probably can fit all 5 examples perfectly. And it's only if you have a bigger training set
that it becomes harder for the learning algorithm to
fit your training data that well, right?
Or in the the linear regression case,
here you have you have one example,
yeah you can fit a straight line to data,
if you have two examples, you can fit any model,
pretty much to the data,
and have zero training error.
There's only a very, very large training set that a classifier like
logistic regression or linear regression may have
a harder time fitting all of your training examples.
So that's why training error or average training error,
average over your training set, uh,
generally increases, um, as you increase the training set size.
So, um, now there are two characteristics of this plot,
that suggest that, um,
if you plot the learning curves if you see the- this, this pattern,
this suggests that, um,
the algorithm has a large bias problem, right?
And the two properties written at the bottom,
one, the weaker signal,
the one that's harder to rely on,
is that, um, the development set error,
or the test set error is still decreasing,
as you increase the training set size.
So the green curve is still,
you know, still looks like it's going down,
and so this suggests that if you increase
the training set size and extrapolate further to the right,
that the curve would keep on going down.
Um, this turns out to be a weaker signal because sometimes we look at a curve like that,
it's actually quite hard to tell,
you know, to extrapolate to the right.
Uh, uh, if you double the training set size,
how much further would the green curve go down?
It's actually kind of hard to tell.
So I find this a useful signal,
but sometimes it's a bit hard to judge, you know,
exactly where the curve will go if you extrapolate to the right.
Um, the stronger signal is actually the second one,
the fact that there's a huge gap between your training error and your test set error,
or your training or your dev set error would be the better thing to look at.
It's actually a stronger signal that,
um, this particular learning algorithm has,
um, has high variance right,
um, uh, because, uh,
as you increase the training set size,
you find that the gap between, um,
training and test error usually closes, usually reduces.
And so there's still a lot of room, for, um, uh,
making your test set error become closer to your training error.
And so if you see a learning curve like this, this is a strong sign that,
um, you have a variance problem, okay?
Now let's look at what the curve- what the learning curve will look like,
um, if you have a bias problem.
Um, so this is a typical learning curve for high bias which is, uh,
that's your dev set error or your development set cross-validation error, uh, test error,
and you're hoping to hit a level of performance like that,
and your training error looks like that.
And, um, so one sign that you have
a high bias problem is that
this algorithm is not even doing that well on the training set, right?
Even on the training set, you know,
you're not achieving your desired level of performance,
and it's like, look learn, i- i- imagine you know,
you're, you're looking at learning algorithms and say,
it's like this algorithm has seen these examples and
even for examples it's seen, it's not doing as well as you were hoping.
So clearly the algorithm's not fitting the data well enough.
So this is a sign that you have a high bias problem,
not enough features, your learning algorithm is too simple.
And the other signal is that, um,
uh, this is very a small gap between the training and, uh, the test error, right?
And you can imagine when you see a plot like this,
no matter how much more data you get, right,
go ahead and extrapolate to the right,
as far as you want, you know.
No matter how much more data you get,
um, no matter how far you extrapolate to the right of this plot,
the gree- the blue curve, the training error,
is never going to come back down,
to hit the desired level of performance.
Uh, and because the test set error is you
know generally higher than your training set error,
no matter how much more data you have,
no matter how far you extrapolate to the right,
the error is never going to come down to,
to your desired level of performance.
So if you get a, um,
training error and test error curve that looks like this,
you kind of know that, you know,
while getting more training data may help, right?
The green curve could come down, like a little bit.
If you get more training data, uh,
the act of getting more training data by
itself will never get you to where you want to go.
Okay? Um, so let's work through this example.
So for each of the four bullets here, um,
each of the four- first four ideas fixes
either a high variance or a high bias problem, right?
So let's, let's go through them and, and ask, uh,
for the first one,
do you think it, do you think it helps you fix high bias or high variance?
[BACKGROUND]
High variance, right?
Okay. Right. Cool. All right, high variance, right?
A- anyone want to say, say- well, great.
Anyone want to say why? Yeah, okay.
[inaudible]
All right, cool, yes, uh, right. Yeah, right.
I guess if you're fitting a very high order polynomial that wiggles like this,
if you have more data,
it will make it- then you won't have these oscillates,
so crazy even if you have a higher order polynomial.
Right. And, um, if you look at a
high variance curve, um,
this was- wow, there's a lot of latency, you know.
That's all for some reason.
Huh. Right, sSo this is a high variance plot.
Um, and, uh, uh,
and if you have a learning algorithm of high variance, you can,
hopefully, you know, if you extrapolate to the right,
there is some hope that the green curve will keep on coming down.
So, so getting more training data if you have high variance,
which is if you're in this situation,
looks like it could help you- help- it's,
it's worth trying, right?
I can't guarantee it work, but it's worth trying.
[inaudible] when you think about these functions,
like for certain algorithms [inaudible] uniformly distributed.
Oh, I see. Yes. Sorry. That's a good one. So let's see.
Um, the curves will look like this assuming that your training data is IID, right?
Um, the training and dev and test sets are all drawn from the same distribution.
Uh, uh, uh, there is learning theory that suggests that in most cases,
the green curve should decay as 1 over
square roots of m. That's the rate that which it should decay,
uh, until, until it reaches some Bayes error.
That's what the learning theory says.
Does that make sense? Um, and sometime- and,
and learning algorithms errors don't always go to 0, right?
Because sometimes, uh, uh,
there- sometimes, um, the data is just ambiguous.
I don't know, like, uh, I guess, you know,
my PhD students, including Annan,
we do a lot of work in healthcare.
And sometimes when you look at an x-ray,
it's just blurry, and you could try to make a diagnosis, right?
Is there, is there, uh- or I actually,
Annan is working on predicting patient's mortality.
What's the chance that someone dying in the next year or so?
And sometimes you look at a patient's medical record,
and you just can't tell when- what's,
you know, will, will they pass away in the next year or so.
Or you're looking at an x-ray,
you just can't tell is there,
is there a tumor or not?
Because it's just blurry, and so learning algorithm's error don't always decay to zero,
but the theory says that as,
as M increases, it will decay at roughly a rate of 1 over square root of M,
um, toward that baseline error,
which is, which is called Bayes error,
which is the best that you could possibly hope
anything could do given how blurry the images are,
given how noisy the vector is, right?
All right. Um, sorry,
I gave the answer away. [LAUGHTER] Okay.
So uh, try a smaller set of features, uh,
that fixes a high variance problem.
Right? Uh, and one concrete example would be, um,
if you have this dataset and you're fitting a, you know,
10th order polynomial and the curve oscillates all over the place, that's high variance.
You can say, well,
maybe I don't need a 10th order polynomial,
maybe I should use, you know,
only- Wow, I don't know where my- I'm sorry. I don't know what's going on?
[NOISE] Okay. All right.
So maybe you say maybe I don't need my features to be
all of these things, 10th order polynomial,
maybe if this is too high variance,
I'm going to get rid of a lot of features and just use,
you know, a much smaller number of features.
Right? So that fixes,
um, uh, high variance.
Um, and then if you use a larger set of features [NOISE] [inaudible] , right?
Cool. So that's if you're fitting
a straight line to the data and it's not doing that well,
you can go, "Gee, maybe I should add a quadratic term," just add more features, right?
So that fixes variance.
And adding e-mail header features.
[BACKGROUND] Cool.
Yeah. Generally, I would try this if- um,
ah, to try to reduce bias.
And so in the workflow of, um,
how you develop a learning algorithm, ah,
I would recommend that, um, you,
ah- so, so one of the things about,
um, building learning algorithms, is that,
for a new application problem, uh,
it's difficult to know in advance, uh,
if you're gonna run into a high bias or high variance problem, right?
It, it is actually very difficult to know in
advance what's gonna go wrong with your learning algorithm.
And so the advice I tend to give is, uh,
if you're working on a new application,
uh, implement a quick and dirty learning algorithm.
It, it will have like a quick and dirty implementation of something.
So you can run your learning algorithm,
uh, just say- start with logistic regression, right?
Let's start with something simple.
Um, and then run this bias-variance type of analysis, uh,
to see, sort of,
what went wrong and then use that to decide what to do next.
You go to a more complex algorithm,
do you try adding more data?
Um, the, the one exception to this is if you're
working on a domain in which you have a lot of experience, right?
Uh, and, and so for example,
you know, I've done a lot of work on speech recognition.
So because I've done that work,
I kinda have a sense of how much data is needed for the application,
then, then I might just build something more complicated from the get go.
Or, or if you're doing- or if you're working on, say,
face recognition and because you've read a lot of research papers,
you have a sense of how much data is needed.
Then maybe it's worth trying something because you're building on a body of knowledge.
Uh, but, but if you're working on something,
on a brand new application that you and maybe,
you know, no one in the published academic literature has worked on or,
or you don't totally
trust the published results to be representative of your problem,
then I will usually recommend that, um,
you implement a- build a quick and dirty implementation,
look at the bias and variance of the algorithm, uh,
and then use that to better decide what to try next.
Right? Um, so I think,
uh, bias and variance is, uh,
I think, is actua- is really like the single most powerful tool I know,
you know, for analyzing the performance of learning algorithms.
And I do this pretty much in every single machine learning application.
Um, there's one other pattern that I see quite often,
which is, um, uh- which,
which addresses the second set, which is, um,
uh, which is a- which is the optimization algorithm, ah, working.
So, so let me, let me explain this with,
um, a motivating example, right?
So, um, it turns out that when you implement a learning algorithm,
uh, you often have a few guesses for what's wrong.
And if you can systematically test if
that hypothesis is right before you spend a lot of work to try to fix it,
then you could be much more efficient.
So, uh, let's explain that with a concrete example.
So, so you understand those words I just said,
maybe they're a little bit abstract, which is,
um, let's say that, you know,
you tuned your logistic regression algorithm for a while.
And lets say logistic regression gets 2%t error on
spam e-mail and a 2% error on non-spam, right?
And it's okay to have 2% error on spam e-mail, maybe, right?
You know, so you, you have to read a little bit of spam e-mail.
It's like, that's okay.
Uh, but 2% error on non-spam is just not really
acceptable because you're losing 1 in 50 important e-mails.
Um, and let's say that,
uh, you know, your teammate, right,
also try- trains an SVM and they find
in SVM using a linear kernel gets 10% error on spam,
uh, but 0.01% error on non-spam.
All right. And maybe not great,
but for this- for purposes of illustration,
let's say this is acceptable.
Um, but because it turns out logistic regression is more computationally efficient and,
and it may be easier to update, right?
And you get more examples,
run a few more iterations of gradient descent.
Uh, and let's say you want to ship
a logistic regression implementation rather than SVM implementation.
Um, so what do you do next?
It turns out that, um,
one common question you have when training your learning algorithm is,
you often wonder, uh,
is your, um, optimization algorithm converging?
Right? So you know, it's,
it's gradient ascent, is it converging?
And so one thing you might do is, uh,
draw a plot of the training optimization objective,
of J of Theta,
whatever you are maximizing or log likelihood of J of Theta or whatever,
versus the number of iterations.
And, um, often the plot will look like that, right?
And, you know, the curve is,
kind of, going up,
but not that fast.
And if you train it twice as long or even 10 times as long, will that help?
Right? And again, training,
training the algorithm for more iterations,
it, you know, pretty much never hurts.
If, if you regularize the algorithm properly,
training the algorithm longer, you know,
almo- almost always helps, right?
Pretty much never hurts, uh,
but it's the right thing to do to go and burn another 48 hours of,
you know, CPU or GPU cycles to just train this thing longer and hoping it works better.
Right? Maybe. Maybe not.
Um, so is there a,
is there a systematic way to tell- is there a better way, uh,
to tell if you should invest a lot more time,
um, in running the optimization algorithm?
Sometimes it's just hard to tell, right?
So, um, now, the other question that you sometimes wonder- so,
so a lot of- um,
where a lot of this iteration of
debugging learning algorithms is looking at what your learning algorithm is
doing and just asking yourself what are my guesses for what could be wrong.
Uh, and maybe one of your guesses is, well,
maybe optimizing the wrong cost function.
Right? So, so here is what I mean.
Um, what you care about is this, um,
weighted accuracy criteria, uh,
you know, where, uh,
sum over your dev set or test set of, you know,
weights on different examples of whether it gets it right,
uh, where the weights are higher for non-spammed and spam.
Because you really make sure you label non-spam e-mail correctly, right?
So, so maybe that's the weighted accuracy criteria you care about.
Uh, but for logistic regression,
uh, you are maximizing this cost function, right?
Law of likelihood minus this regularization term.
So you're optimizing J of Theta,
when what you actually care about is A of Theta.
So maybe you're optimizing the wrong cost function.
And then one way to change the cost function
would be to fiddle with the parameter Lambda, right?
That's one way to change the definition of J of Theta.
Um, another way to change J of Theta is to
just totally change the cost function you are maximizing,
like change it to the SVM objective, right?
Or, or- and then part of that also means choosing the appropriate value for C. Okay?
And so, um, there is a second diagnostic which, um,
I end up using i- th- th- which is - which I hope you can tell,
is the problem your optimization algorithm?
Uh, in other words is gradient ascent not converging?
Or is the problem that you're just optimizing the wrong function?
Right? And, and we'll see two examples of this thing.
So this is the first example.
Okay? Um, and so here's the diagnostic that can help you figure that out.
So just to summarize this scenario- this, um, this, uh,
example - this running example we're using,
um, the SVM outperforms logistic regression.
If you want to deploy logistic regression.
Uh, let's say that theta SVM for the parameters learned by SVM.
And, and instead of writing the SVM parameters as w and b,
I'm just gonna write the linear SVM.
SVM linear kernel.
You know, using the logistic regression parameterization.
Right? So if you have a linear set of parameters.
Um, and let's say that theta BLR will be the parameters learned by logistic regression.
Right? So I'll, I'll just- yeah,
regularized logistic regression or Bayesian logistic regression.
So you care about weighted accuracy and, uh, uh, um,
uh, and the, the SVM outperforms Bayesian logistic regression.
Okay? So this is one- a one-slide summary of where we are in this example.
So how can you tell if the problem is your optimization algorithm,
uh, meaning that you need to run gradient ascent longer to actually maximize J of Theta.
Um, or this- oh, sorry. And then- right.
And this is the- what BLR tries to maximize.
Right? So, so how do you tell, we have,
we've two possible hypotheses you wanna distinguish between.
One is that, um,
the learning algorithm is not actually finding
the value of Theta that maximizes J of Theta. All right?
For some reason gradient ascent is not converging.
So that would be a problem with the optimization algorithm.
That j of Theta that, that,
that, you know, uh, for,
for the property of the- for
the problem to be with the optimization algorithm it means that,
if only we could have an algorithm that maximizes j of Theta we would do great.
But for some reason gradient ascent isn't doing well.
That's one hypothesis.
The second hypothesis is that J of Theta is just the wrong function to be optimizing.
It is just a bad choice of cost function,
that j of Theta is too different from A of Theta,
that maximizing J of theta doesn't give you,
you know a classifier that does well on A of theta which is what you actually care about.
Okay? Any que- so this is a problem setup.
Is there any, any que- I wanna make sure people understand this.
This is- raise, raise your hand if this makes sense.
Most people? Okay. Cool. Almost everyone, okay.
Good. Any questions about this problem setup?
Why don't you, why don't you [inaudible].
Oh. Uh, thank you. Why not maximize A of Theta directly?
Because A of Theta is non-differentiable.
So we don't actually have,
um, you know there's this indicator function.
So it's- we actually don't- we, uh,
- it turns out maximizing A of Theta explicitly is NP-hard.
Uh, uh, but just- we just don't have great algorithms to try and do, do that.
Okay. So it turns out there's a diagnostic you could
use to distinguish between these of two- these two different problems.
Um, and here's the diagnostic.
Which is, check the cost function that logistic regression is trying to maximize.
So J. And compute that cost function on the parameters found by
the SVM and compute that cost function
on the parameters found by Bayesian logistic regression.
And just see which, which value is higher.
Okay? Um, so there are two cases.
Either, this is greater,
or this is less than or equal to.
Right? They're just two possible cases.
So what I'm gonna do is go over case one and case two
corresponding to this greater than or is less than equal than.
Uh, and let's, let's see what that implies.
So on the next slide, I'm gonna copy over this equation.
Right? That's, that's just a fact that the SVM does
better than Bayesian logistic regression on a problem.
So on the next I'm gonna copy over this first equation.
Um, and then we're gonna consider,
you know, these two cases separately.
So greater than will be case one and less than or equal to will be case two.
Okay? So let me copy over these two equations in the next slide.
Right? So that's the first equation that I just copied over here.
And that's- this is the greater than, this is case one.
Okay? So let's see how to interpret this.
Um, in case one,
J of theta SVM is greater than J of Theta BLR.
Right? Meaning that whatever the SVM was doing, um,
it found a value for Theta which we have written as, Theta SVM.
And theta SVM has a higher value on the cost function J than theta BLR.
But Bayesian logistic regression was trying to maximize J of theta.
Right? I mean Bayesian logistic regression is just using
gradient ascent to try to maximize J of theta.
And so under case one,
this shows that whatever the SVM was doing,
whatever your buddy implementing SVM did.
They managed to find a value for Theta
that actually achieves a higher value of J of Theta,
than your implementation of Bayesian logistic regression.
So this means that Theta BLR fails to maximize the cost function J.
And, uh, and the problem is with the optimization algorithm.
Okay? So this is case one.
Case two, um- again I'm just copying over the first equation.
Right? Because this is just part of our analysis.
This is part of the problem set up.
Uh, then case two is now the second line.
It's now a less than or equal sign.
Okay? So let's see how to interpret this.
Um, so under- if you look at the second equation right?
The less than equal to sign.
It looks like J did a better job than the SVM maximizing J- excuse me.
It looks like Bayesian logistic regression did a better job than the SVM,
um, maximizing J of Theta.
Right? So, you know,
you tell Bayesian logistic regression to maximize J of Theta.
And by golly, it found the- it found the value of Theta.
That's that- it found a value that achieves a higher value of J of Theta than,
than whatever your buddy did using an SVM implementation.
So it actually did a good job
trying to find a value of Theta that drives up J of Theta as much as possible.
But if you look at these two equations in combination what we have is that,
um, the SVM does worse on the cost function J.
But it does better on the thing you actually care about.
A of Theta.
So what these two equations in combination tell you is that having the best value-
the highest value for J of Theta does not
correspond to having the best possible value for A of Theta.
So it tells you that maximizing J of
Theta doesn't mean you're doing a good job on A of Theta.
And therefore, maybe J of Theta is not such a good thing to be maximizing.
Because maximizing it, doesn't actually give you the result you ultimately care about.
So under case two, um,
you can be convinced that j of Theta is
just a- i- i- is not the best function to be maximizing.
Because getting a high value of J of
theta doesn't get you a high value for what you actually care about.
And so the problem is with the objective function of the maximization problem.
And maybe we should just find a different function to maximize.
Okay? So, um,
any questions about this? Right, go ahead.
If you want to change the cost function in case two,
you saw it was the right one. [inaudible]
Yeah. Uh, let me come back to that.
Yeah. It's a g- a complicated answer.
Yeah. All right. Actually, let,
let- let's do this first.
Um, so, uh, all right.
For these four bullets,
does it fix the optimization algorithm or does it fix the optimization objective?
First one. Does it fix
the optimization algorithm or does it fix the optimization objective?
Cool. Second one.
Ah, I don't know what's wrong with this thing.
This is so strange. Okay. All right.
Does it fix the optimization algorithm or fix
the optimization objective? Optimization algorithm, right?
So Newton's method still looks at the same cost function J
of Theta but in some cases it just optimizes it much more efficiently.
Um, this is a funny one.
Usually, you fiddle with lambda, um, to,
uh, uh, trade off bias and variance things.
Right. That, that this is one way to change the optimization objective.
Although uh, uh, uh, usually you change
lambda to just bias and variance rather than this.
Right? Uh, and then trying to use an SVM, right?
Would be one way to totally change the optimization objective.
Okay? So, uh, to,
to address the question just now.
Sometimes we find you have the wrong optimization objective,
is that there, there isn't always an obvious thing to do.
Right? Sometimes you have to,
uh, brainstorm a few ideas.
Is that there, there isn't,
uh, um, always one obvious thing to try.
But at least it tells you that,
that category of things of trying out different optimization objectives is what you want.
Right? Um, all right.
So, um, let's go through a more complex example.
They're, they're, you know,
incorporate some of these- wow, I don't know what's wrong.
I sprayed my laptop.
I wonder if my- this is so strange.
Let me see what I can do. Yeah. All right.
Well. Okay. Let's go for a more complex example, uh, that, that,
that will illustrate some of these concepts, uh, that,
that we've been going through and,
and just let you see another example of these things. Um, uh, oh,
and- and I find that,
um, one- one thing I've learned as a teacher,
you know, one of the ways for you to become good at this, right?
Is to go, you know,
work in a good AI group for five years, right?
Because when you work in a good AI group for some several years,
then you have seen, you know, 10 projects,
and that lets you gain that experience.
But it turns out that it takes,
I don't know, depending on what AI group you work on,
it- it takes- if you work on a different project every year,
then in five years that I guess you work on five projects or something.
I- I actually don't know. Or maybe 10 projects or something.
But, er, one of the reasons that,
um, in, uh, the way I try to explain this,
I'm try to go- give specific scenarios with you so that, um, you know,
my Ph.D students and I, we spent- actually,
we spent like many years working with Stanford Autonomous Helicopter,
but I'm trying to distill the key lessons down for
you so that you don't need to work on a project for, you know,
few years to gain this experience but to give you
some approximation to this knowledge in maybe 20 minutes, right?
The 20 minutes won't give you the depth of
three years of experience but we try to summarize
the key lessons so that we can learn
from experience that others took years to develop.
Um, all right. So, uh,
this helicopter actually sits in my office.
Uh, uh, uh but if you go to my office,
uh, uh, and, you know, grab this helicopter,
uh, uh, and- and- and we ask you to write a piece of code to make this fly by itself,
use the learning algorithm to make this fly by itself.
How do you go about doing so?
So it turns out a good way to, um,
make a helicopter fly by itself is to use,
uh, is to do the following.
Uh, step one is build a,
uh, computer simulator for a helicopter.
So, you know, that's actually a simulator, right?
Like a video game simulator of a helicopter.
Um, the advantage of using, you know,
say a video game simulator of a helicopter,
is you could try a lot of things,
crash a lot in simulation,
you know, which is cheap, whereas crashing a helicopter in real life is- is- is-
is slightly dangerous and- and- and also, uh, more expensive.
Um, uh, but so step one build a simulator of a helicopter.
Step two, uh, choose a cost function.
And for today, I'm just using a relatively simple cost function which is squared error.
So you want the helicopter to fly the position x desired,
and your helicopter is there,
you know, wandered off to some other place x.
So let's use a squared error to penalize it, right?
Um, when we talk about reinforcement learning towards the end of this quarter,
we'll- we'll actually go through the same example again by using, uh,
the reinforcement learning terminology,
understand this slightly- this at a slightly deeper level.
And we'll go over this exact same example,
after you learn about reinforcement learning.
But we'll just go over a slightly simplified- very slightly simplified version today.
Um, and so, uh,
run a reinforcement learning algorithm and what the reinforcement learning algorithm does,
is it tries to minimize that cost function J of Theta.
Um, and so, uh, you know,
and so you learn some set of parameters Theta sub through
RL for controlling the helicopter, right?
And we'll talk about reinforcement learning, you know,
the- the- we'll- you- you'll see all this redone with
proper reinforcement learning notation where J is a reward function,
Theta Rs is the control policy and so on.
But don't worry about that for now.
Um, so let's say you do this,
and the resulting controller, right?
The way you fly the helicopter,
it gets much worse performance than a human pilot, you know,
so the helicopter wobbles all over the place
and doesn't quite stay where you are hoping it will.
So what do you do next, right?
Well, here are some options, uh,
corresponding to the three steps above.
You could work on improving your simulator.
Um, it turns out even today,
you know, we- we- we've had helicopters for what?
I don't know- like, uh, uh, I think, uh,
we started having a lot of commercial helicopters around the 1950s.
You see we have been co- conc- helicopter for many decades now.
But airflow around the helicopter is very complicated.
And even today, there are actually some, uh, uh,
details of how air flows around the helicopter.
The- the aerodynamics textbook, you know,
that- that even, um, AeroAstro people, right?
The experts in AeroAstro cannot fully explain.
So helicopters are incredibly complicated.
And there's almost unlimited headroom,
uh, for building better and more accurate simulations of helicopters.
So maybe you wanna do that or maybe you think that cost function is messed up,
you know, maybe a squared error isn't the best metric, right?
Uh, and- and it turns out, um,
the way helicopter- a helicopter has a tail rotor that blows wind to one side, right?
So I guess, uh,
because the- the- the main rotor spins in one direction,
if it only had a main rotor,
then the body will spin in the opposite direction.
Er, an equal and opposite reaction within torque, right?
So the main rotor spins in one direction.
If it only had a main rotor,
the rotor on top,
and it just spun that, then the body of the helicopter would spin the opposite direction.
So that's why you need a tail rotor to blow air down off to one side,
to not make it, um, uh,
uh, spin in the opposite direction.
Uh, but because of that, it turns out
the helicopter's staying in place, it's actually tilted slightly to a side.
Because a tail rotor blows air in one direction.
So it's pushing you off to one side,
so you have to tilt your helicopter in the opposite direction.
So- so the main rotor blows air to one side,
the tail rotor blows air to the other side.
So you actually stay in place, right?
So a helicopter is actually asymmetric.
Lift in birds is not the same.
So- so- so because of this comp- complication,
maybe squared error isn't the best, um, uh,
uh, error because, you know,
your- your orientation- your optimal orientation is actually not zero, right?
Um, so- so- so maybe you should modify the cost function.
Um, or maybe you wanna modify the, um,
reinforcement learning algorithm because you secretly
suspect that your algorithm is not doing
a great job of minimizing that cost function, right?
That it's not actually finding the value of Theta that absolutely minimizes J of Theta.
So it turns out that, um, uh,
each one of these topics can easily be a PhD thesis, right?
You can definitely work for six years on any one of these topics.
Um, and the problem is, uh, uh,
you know, so I- I actually- I actually know someone that wrote their PhD thesis is on, right?
Uh, improving helicopter simulator, right?
Um, uh, but the problem is maybe a helicopter simulator is good enough.
You can spend six years improving
your helicopter simulator but will that actually get you the result?
And you can write- and you can write a PhD thesis,
and you get a PhD doing that maybe.
But if your goal is not
just to write a PhD thesis, it's actually to make a helicopter fly better.
It's actually not- not totally clear, right?
If- if that's the key thing for you to spend time on.
Um, so what I'd like to do is, uh,
describe to you a set of diagnostics that allows you to use this sort of
logical step-by-step reasoning to debug
which of these three things is what you should actually be spending time on, right?
Um, so is it possible for us to come up with a debugging process to logically reason, uh,
so as to select one of these things to work on and- and have conviction,
and then be relatively confident that this is a useful thing to work on, right?
Um, so here's how we're gonna do it.
Um, so just to summarize a scenario, right?
Um, the controller given by Theta RL performs poorly, right?
So, uh, this is how I would reason through a learning algorithm, right?
So suppose, uh, suppose all of these things were true,
um, suppose that- okay,
corresponding to the three steps in the previous slide,
suppose the helicopter simulator was accurate and suppose,
um, uh, you know, the learning algorithm,
uh, correctly, you know,
minimizes the cost function and suppose J of Theta is a good cost function, right?
If- if all of these things were true,
then the learned parameters should fly well on the actual helicopter, right?
Um, but it doesn't fly well on a helicopter,
so one of these three things is false.
And our job is to figure out,
is- is to identify at least one of these three statements: one,
two or three that's false because that- that- that lets you
sink your teeth into something that to- to- to work on, right?
Um, and I think, uh, uh, um,
to make an analogy to more conventional software debugging,
if a big complicated program,
and for some reason,
your program crashes, you're like the core down to whatever, um,
if you can isolate this big complicated program into one component that crashes,
then you can focus your attention on that component that you
know crashes for some reason and try to find the bug there, right?
And so instead of trying to look over a huge code base,
if you could do binary search or try to isolate
the problem in a smaller part of your code base,
then you can focus your debugging efforts on that part of your code base,
try to figure why it crashes,
and then fix that first.
And after you fix that, it might still crash,
then there may be a second problem to work on but at least you know that, um,
trying to fix the first bug seems like, uh,
seems with a worthwhile thing to do, okay?
So what we're gonna do is, um, uh,
come up with a oh, sorry, that's gradient descent, come up with a set of diagnostics
to isolate the problem to one of these three components, okay?
So the first step is,
uh, let's look at,
um, how well the algorithm flies in simulation, right?
So what I said just now was, uh,
you ran the algorithm and it resulted in
a set of parameters that doesn't do well on your actual helicopter.
So the first thing I will do is just check how well
does this thing even do in simulation, right?
And, uh, uh, there are two possible cases.
Um, if it flies well in simulation but doesn't do well in real life,
then it means something's wrong with the simulator, right?
It- it means it's actually work- working on the simulator because, you know,
if it's already working well in the simulator,
I mean what else could you expect to
learn the reinforcement learning algorithms to do, right?
You know, you told the reinforcement learning algorithm to go and fly
well in the simulator because this is just training simulation.
It's already doing well in the simulator,
so there's not much to improve on there, right?
At least, it's hard to improve on that.
Uh, but- but- but if- if- if you found a learning
algori- if your learning algorithm does well in the simulator but not in real life,
then this means that the simulator,
um, isn't matching real life well.
And so dish- that- that's strong evidence.
That's strong grounds for you to spend some time to improve your simulator. Yeah?
[inaudible].
Oh, yeah. Uh, right.
So to just repeat for
the camera, is it ever the case that it flies bad in the simulator but well in real life?
I wish that happened.
[LAUGHTER] You know, I actually, um, very rarely,
I- I think, uh,
if that happens I will,
I will still work on improving the simulator.
Um, uh, so there,
there is one scenario where that happens,
it turns out that, uh, uh,
when we train this helicopter in the simulator or really, any robot in the simulator,
we often add a lot of noise to he simulator because one lesson we've learned is
that if your simulators is noisy, because simulators are always wrong, right?
Any- any digital simulation is only an approximation in the real world.
So it turns out we have a lot of noise in all of our simulators,
because we think if that the learning algorithm is
robust to all this noise you've thrown at it in simulation.
Then, whatever the noise the real world throws at it,
it has a bigger chance of being robust too, as well.
Um, uh, and so we tend to throw a lot of noise into, into simulators.
And so one case where that does happen is when we find we threw too much noise
added in simulation and tha- that might be a sign we should dial back the noise a bit.
Um, right, cool.
Uh, so, um, yeah, right.
So this first diagnostic tells you should work on improving the simulation.
But just, I think there's a big mismatch between
simulation performance and real world performance.
That's a good sign that,
you know, that you should improve the simulation.
Second, um, this is actually very similar to the diagnostic we use on
the Spam, you know, Bayesian logistic regression and SVM example.
So what we're gonna do is, um,
we're going to measure this equation.
And this is, this again, this is
very similar to our previous equation which is,
take the cost function, similar as the previous example.
Take the cost function J that reinforcement learning is,
uh, totally minimized, right?
That's J and J of theta was a squared error, right?
So take the cost function that, uh, uh,
reinforcement learning was told to minimize and see if
the human achieves better squared error than the reinforcement learning algorithm.
We just see, you know supervise better.
So let's measure the human performance on this squared error cost function um,
and see which one does better.
So there are two cases that equation will be either less than
or it will be greater than or equal to, right, so less, or greater or equal to.
So case one, is um,
say to human is less than
excuse me, J of theta human is less than J of theta RL. That would be this case.
Then, that tells you that the problem
is with the reinforcement learning algorithm, right?
That somehow the human achieves a lower squared error uh, and so, uh,
the learning algorithm is not finding the best possible squared error, that is
some other controller as evidenced by whatever the human is
doing that actually achieves a lower cost function, right?
So in this case,
um, we think the learning algorithm or,
or reinforcement learning algorithm is not doing a good job
minimizing that and we'll work on the reinforcement learning algorithm.
The other case would be if the sign of the inequality is the other way around.
Right? Now in this case,
um, you can infer that the problem is in the cost function.
Because what happens here is,
um, the human is flying better than the reinforcement learning algorithm.
But the human is achieving what looks like a worse cost than
the reinforcement learning algorithm.
So what this tells you is that minimizing J of theta does not correspond to flying well.
Right? Your learning algorithm achieves a better value for J of theta,
you know, J of theta RL is actually smaller than what the human is doing.
So the reinforcement learning algorithm as far as it knows is doing
a great job cause it's finding a value of theta where J of theta is really really small.
But in this last case, um,
you know that finding such a small value of J of theta doesn't correspond to flying
well because a human doesn't achieve such a good value in the cost function but
the helicopter actually just looks better, was flying in a more satisfactory way.
And that tells you that this squared error cost function
is not the right cost function for,
for, for what flying accurately remains, right?
And so um, through this set of diagnostics,
um, uh, you could decide which one of these three things.
Uh, improving the simulator,
improving the RL algorithm,
reinforcement learning algorithm or improving
the cost function is the thing you should work on.
And what happens in- in
this particular project and what often happens in machine learning applications is,
you run this set of diagnostics and this
actually happened when we were working on this helicopter.
We ran this set of diagnostics and then one week we were saying,
"Yep simulator's got a problem, let's work on that."
And then we'd improve the simulator,
improve the simulator and after
a couple of weeks of work we will run these diagnostics and say,
"Oh, looks like the simulator is not good enough."
And maybe there's a problem with the RL algorithm,
then we'll work on that, work on that and improve that.
And after that, after awhile we'll say, "Oh, they'll say that's
also good enough and the problem is in the cost function."
And sometimes the, the location of
the most acute problems shifts right after you've cleared out one set of problems.
It might be the case that now the bottleneck is the simulator, right?
And so, um, I often use this, uh,
workflow to constantly drive prioritization for what to work on next, right?
And, and to answer your question just now about how do you find the new cost function?
It turns out finding a new cost function is actually not that easy.
Uh, so actually one, one of my former PhD students Adam Coates um,
through this type of process realized that
finding a good cost function is actually really difficult.
Uh, because if you want a helicopter to fly and maneuver,
you know, like fly at speed and then make a bank turn, right?
Like how do you mathematically define what is an accurate bank turn?
It's actually really difficult to write down an equation to
specify what is a good way of, I will fly in that and do a turn.
Or is this, how do you specify what is a good turn?
So um, he wound up writing a research paper,
uh, one of the best application paper, it won at ICML.
Uh-uh on, on how to define a good cost function,
it's actually pretty complicated,
but the reason he did it and it was a good use of his time was running
diagnostics like these which gave us
confidence that this was actually a worthwhile problem uh,
and the, that resulted in,
you know making real progress in optimization, right.
Um, any questions about this? All right, cool.
Actually, I think I- all right, anyway,
all right, fun helicopter videos, I always want to show this, but it's fine.
And you guys saw this earlier. All right, so,
um, only one time,
all right, let's go through this.
So, um, uh, in addition to,
um, these specific diagnoses of bias
versus variance and optimization algorithms versus optimization objective.
Um, oh sorry- and when we do RL,
I wanted to just go through that example one more time,
so you see everything you just saw again,
after you learned about reinforcement learning, they tend to squeeze up.
Okay. Now, in addition to these type of diagnostics,
um, uh, how to debug learning algorithms, um,
there's one other set of tools you'll find very useful,
which is, uh, error analysis tools, uh,
which lets you figure out,
which is another way for you to figure out what's working,
what's not working, or really what's not working in the learning algorithm.
[NOISE] So let's let's go through a motivating example.
Um, so let's say you're building a,
um, uh, you know, uh, like a security system,
so when someone walks in front of a door,
you unlock the door knob based on whether or not, you know,
that person is authorized to enter right that, that place.
Um, and so let's say that, uh, uh,
so there are a lot of machine learning applications where
it's not just one learning algorithm, right?
But instead you have a pipeline,
you string together many different steps.
So how do you actually build a face recognition algorithm?
To decide if someone approaching your front door
is authorized to unlock the door, all right.
Well, here's something you could do which is, uh,
you start with a camera image like this, and then,
um, you could do preprocessing to remove the background.
So all that co- co- complicated color in the background, let's get rid of that.
And it turns out that, um,
when you have a camera against a static background, right?
You could actually do this, you know,
with a little bit of noise relatively easily
because if you have a fixed camera that's just like mounted,
you know, on your door frame,
it always sees the same background,
and so you can just look at what pixels have changed and- and just
keep the pixels that have changed compared to- I mean re- because,
you know, this camera always sees that gray background and that, um,
brown bench in the back,
and so you just look at what pixels have changed a lot and,
and this background doesn't really move, right.
So this is- this- this is- this is actually feasible by
just looking at what pixels have changed and
keeping pixels that have changed relative to that.
Um, and so, after getting to the background,
you could run the face detection algorithm, uh,
and then, uh, after detecting the face, it turns out that,
uh, actually, you know,
I've actually worked with a bunch of face detection,
worked with a bunch of face- face recognition systems.
It turns out that, um,
for some of the leading face recognition systems,
so- depends on details, but some of them.
Uh, it turns out that, um,
the appearance of the eyes is a very important cue for recognizing people,
that's why, if you cover your eyes you actually have a much harder time recognizing people,
as eyes are very distinct through people.
Just segment out the eyes,
um, segment out the nose,
and the other thing you- segment out the mouth.
[LAUGHTER] It's Halloween.
[LAUGHTER] All right.
And then- and then feed these features into some other algorithms,
say logistic regression, that then, you know,
finally outputs a label that says,
is this the person, right?
That- that- that, you know- you know,
you're authorized to open the door for.
Um, so it- so in many learning algorithms,
you have a complicated pipeline like this of different components that,
that have to be strung together,
and, uh, you know,
if you read the newspaper articles about-
or if you read research papers in machine learning, often,
uh, uh, the, the research papers will say, oh,
we built a machine translation system,
we've trained a gazillion, you know,
of sentences found on the Internet and that's great and a pure end-to-end system,
so that's like one learning algorithm that sucks in an input,
by sucking an English sentence and spit out the French sentence or something, right?
So that's, that's like one learning algorithm.
It turns out that for a lot of practical applications,
if you don't have a gazillion examples, uh,
you end up designing much more complex machine learning pipelines like this,
where it's not just one monolithic learning algorithm,
but instead there are many different smaller components.
Um, and I think in,
in- uh, uh, I think that, you know,
the, the, the, the, um,
I think that, uh,
having a lot of data's great, all right?
I love having more data,
but big data has also been a little bit over-hyped,
uh, and to model things you could do with small data sets as well.
And in the teams [NOISE] I've worked with,
we find that if, if,
if you have a relatively small dataset,
often you can still get great results.
You know, my teams often get great results at 100 images,
100 training examples or something.
But when you have small data,
it often takes more, uh,
insightful design of machine learning pipelines like this, right?
Um, now, [NOISE] when you have a machine learning pipeline like this, uh,
the things you want to do- what you want to do is,
uh, so, so you build a pipeline like this and it doesn't work, right?
And there's this common workflow.
You build a pipe, you build something,
it doesn't work, so you want to debug it.
So in order to decide which part of the pipeline to work on, um,
it's very useful if you can look at your- the error of your system and try to attribute
the error to the different components so that you
can decide which component to work on next, right?
And, and, there's actually a- I'll tell you a true story, you know,
remember preprocess background removal step, right?
Since you're getting rid of the background,
um, it turns out that, uh,
there are a lot of details of how to do background removal,
uh, for example, um,
the simple way to do it is to look at every pixel and
just see which pixels have changed, uh,
but it turns out that if there's a tree in the background that, you know,
waves a little bit because the wind moves
the tree and blows the leaves and branches around a little bit,
then sometimes the background pixels do change a little bit.
And so they're actually really complicated background removal algorithms,
they try to model basically
the trees and the bushes moving around a little bit in the background,
so you know, that even though the pixels of the tree
moves around is part of the background,  you just get rid of it.
So background removal, there's simple versions where you just look
at each pixel and see how much it's changed and there's incredibly complicated versions.
Um, so I actually know someone, uh, that, uh,
uh, was trying to work on a problem like this and
they decided to improve their background removal algorithm.
Uh, and they actually, er,
this real person actually literally wrote a PhD thesis on background removal.
Uh, and so I'm glad he got a PhD,
but it turn- but,
you know, when I look at the problem he was actually trying to solve,
I don't think it actually moved the needle, right?
So- so, um.
Uh, this is one of the nice things about academia,
right, guys, so long as, you know,
you can- you can still publish a paper.
[LAUGHTER]
And- and- and that was technically innovative.
It was actually a very good technical work.
But- but- but- but if- so if your goal is to publish a paper, great, do that, uh,
but then if your goal is to build a better face recognition system,
then I would carefully ask which components should
you actually spend your time to work on, all right?
Um, so here's what you can do with error analysis,
which is, say your overall system has 85% accuracy.
Here's what I would do. I would go in and in your,
uh, dev set, in your development set,
the whole of the cross validation set, right,
uh, go in and for every one of your examples in the dev set,
I would plug into the ground truth for the background.
Meaning that, uh, rather than using a-some, you know,
approximate heuristic algorithm for
roughly cleaning out the background which may or may not work out well,
I would just use Photoshop.
And for every example in the dev set,
I would give it the perfect background removal, right?
So imagine if instead of some noisy algorithm trying to remove the background,
this step of the algorithm was- just had perfect performance, right?
And then you can give it perfect performance on your dev set, on your test set,
just by using Photoshop to just tell it this is a background,
this is a foreground, right?
And let's say that when you plug in this perfect background removal,
the accuracy improves to 85.1%.
And then you can keep on going from left to right in this pi- pipeline which is, um, now,
instead of using some learning algorithm to do face detection,
let's just go in and for the test set, you know,
modify, kind of have the face detection algorithm cheat, right?
Have it just memorize the right location for
the face in the test set and just give it a perfect result in the test set.
So when- when I shaded these things, um,
that means I'm giving it the perfect result, right?
Uh, so let's just go in and on the test set
give it the perfect face detection for every single example,
an- and then look at the final output and see
how that changes the accuracy of the final output, right?
And then, same for these components, um,
eyes segmentation, nose segmentation, mouth segmentation.
Uh, and then- and you do these one at a time.
And then finally for logistic regression,
if you give it the perfect output,
your- your- your- your accuracy should be 100%, right?
Uh, so now, what you can do is look at the sequence of,
um, uh, of steps and see which one gave you the biggest gain.
And it looks like, um,
in this example, it looks like, um,
when you gave it perfect face detection,
the accuracy improved from 85.1 to 91%.
So, you know, roughly a 6% improvement.
And that tells you that,
if only you can improve your face detection algorithm maybe
your overall system could get better by as much as 6%.
So this gives you faith that, you know,
maybe it's worth improving on your face detection component.
And in contrast, this tells you that even if you had perfect background removal,
it's only 0.1% better so maybe don't- don't- don't spend too much time on that.
Um, and it looks like that, uh,
when you gave it perfect eye segmentation,
it went up another 4%.
So maybe that's another good project to prioritize, right?
Um, and if you're in a team,
one common structure would be to do the separate analysis,
and then we have some people work on face detection,
some people work on eyes segmentation.
You could usually do a few things in parallel if you have a large engineering team.
But at least this should give you a sense of
the relative privatization of the different things. Question?
[inaudible]
Yeah, right. So if you just cumulatively,
uh, such as give it perfect eye segmentation,
then add on top of it nose segmentation,
or do you give it perfect eye segmentation and then take that away,
and then give it perfect nose segmentation.
Um, the way I presented it here is done cumulatively.
Uh, um, and- and it turns out that, uh, let's see.
If you give it- once you give it a perfect face,
uh, uh, uh, once you give it, you know,
perfect things in the later stages,
maybe the- the earlier stages doesn't matter that much anymore.
So that's one pattern. It turns out that,
uh, uh, you could do it either way, right?
For the uh, eyes-nose-mouth,
you can do it cumulatively or one at a time and
you'll probably get relatively similar results.
Um, uh, no guarantee,
you might get different results in terms of conclusions.
But, uh, but I think,
to the extent that you are wondering if doing it
cumulatively versus non-cumulatively might give you different results,
I will just do it both ways.
And then- an- and then- and- and I think this, um,
error analysis is not a
hard mathematical rule, if- if that makes sense.
It is not that you do this and then there's a formula that tells you,
okay, work on, uh,
uh, face detection, right?
I think that this should be,
um, married with judgments on,
you know, how- how hard do you think it is to
improve face detection versus eye segmentation, right?
But this at least gives you a sense of- of- of- it gives you a sense of prioritization.
Um, and it's worth doing this in- in- in
multiple ways if- if you think of- if- if- if you're
concerned in the discrepancy in the cumulative and non-cumulative versions, all right?
Um, so when we have a complex machine learning pipeline,
this type of error analysis helps you break down the error,
so attribute the error to different components,
which lets you focus your attention on what to work on.
So if you [inaudible]?
Oh, right. Yeah. If you do face detection
accurately and then your error drops, what does that entail?
Uh, it's not impossible for that to happen,
uh, it would be quite rare.
Uh, I would, uh, uh,
uh- so at a high-level,
what I would do is go in and try to figure out what's going on actually.
I- I wouldn't ignore that.
Uh, uh, so this is something I see.
Sometimes a team gets a- discovers
a weird phenomenon like that and usually ignore it and move on.
I wouldn't do that, I would- it's actually go.
Whenever you find one of these weird things, uh,
I wouldn't gloss over and ignore it,
I would go in and figure what's going on.
Does it make sense? It's- it is like debugging a software [NOISE].
You know, if- if you're- if you're trying to debug a piece of software,
and if- whenever you move your mouse over, you know,
some button, some random pixel color changes,
you go, huh, that's weird.
And then some people just ignore it and say,
"Oh well, the user won't see this."
[LAUGHTER] But I'll say no, let's go figure it out.
[LAUGHTER]
So what you're saying is quite rare but not impossible.
But I would- I would, uh,
I don't have an easy solution for how to figure out what's
going on but I would- I would- wanna figure out what's going on.
Um, all right.
So one last thing before we break.
So error analysis, um,
helps figure out the difference between where you are now,
85% overall system accuracy and 100%, right?
So it tries to explain difference between where you are and,
you know, perfect performance.
There's a different type of analysis called ablative analysis
which figures out the difference between where you
are and something much worse. So- so here's what I mean.
Um, er, so let's say that you built,
um, let's say you built
a good anti-spam classifier by adding
lots of clever features in logistic regression, right?
So a spelling correction because spam is trying
to misspell words to mess up the tokenizer,
uh, uh, to make word look, you know,
spammy words not look like spammy words.
Uh, sender host features.
So, what machine did the e-mail come from?
You know, header features uh,
could have a parser from NLP,
parse a text, uh,
use a JavaScript parser to understand, right?
Or even you can, uh, uh, uh,
fetch the web pages that a- that the e-mail refers to and parse that.
Um, and the question is um,
how much would these- these components really help?
And it turns out, if you're writing a research paper,
you know, sometimes you're writing a research paper and you can say, "Hey.
Look, I built a great spam classifier," and that's okay.
That's, like, a nice result to have.
But if you can explain to your reader,
either in a research paper or in a class project report like a term project,
what ac- what actually made the difference,
that conveys a lot of insights as well.
So, um, so simple logistic regression
without all these clever features got 94% performance,
uh, and with all of your- addition of all these clever features,
you got 99%, uh, uh, accuracy.
So an ablative analysis which we'll do, is, um,
we move the components one at a time to see how it breaks, right?
So just- so just now,
we were adding to the system by making
components perfect with error analysis, this is how it improves.
Here, we're gonna remove things one at a time.
I did not mean to remove that [LAUGHTER].
So let me figure out what's going to pop on. All right.
We move things one at a time to see how it breaks.
So let's see, we remove spelling correction.
And, uh, as the set of features,
the error goes away that.
Then let's remove the sender host features,
we remove email header features and so on until,
uh, when you remove all of these features you end up there.
And again, you could do this cumulatively or
remove one and put it back, remove one and put back.
Uh, uh, you know, or- or you could do it both
ways and see if they give you slightly different insights.
Uh, and so the conclusion from
this particular analysis is that the biggest gap is from the,
uh, text parser features,
because when you remove that the error or the accuracy went down by 4%.
And so, you know,
there is a strong evidence.
If you wanna publish a paper,
you can say like text parser features significantly
improves spam filter accuracy in that level of insight.
An- and then if you're working with spam filter for many years, right, you know,
there- there are- there are really important applications
where sometimes the same team will work on for many years.
So this type of error analysis gives you
intuition about what's important and what's not, uh,
and helps you decide to maybe even double down
on text parser features or maybe if, uh, um, uh,
or maybe if, uh, the sender host features is too computationally expensive to compute,
tells you maybe you can just get rid of that and without too much harm.
And also if you're a publishing paper or sending a report,
this gives much more insight to your report.
Okay? All right.
Um, so that's it for error analysis and ablative analysis.
I hope this was useful for your class projects as well.
I'll take one last question over there.
Uh, how did you chose the order of to remove the features?
Oh. Yeah. Uh, uh, how would you choose the order in which you- no systematic way.
If you didn't have a systematic way you do that, the other way,
the non-cumulative, where you remove one [NOISE] put it back, remove one put it back.
So either way it works. All right, let's break.
Um, uh, and, uh,
problem set two is- is due tonight.
A friendly reminder, and problem set three will be posted, uh,
in the next, like, several tens of minutes.
Okay. Thanks everyone.
