- Okay so welcome to
CS 231N Lecture three.
Today we're going to talk about
loss functions and optimization
but as usual, before we
get to the main content
of the lecture, there's a
couple administrative things
to talk about.
So the first thing is that
assignment one has been released.
You can find the link up on the website.
And since we were a little bit late
in getting this assignment
out to you guys,
we've decided to change
the due date to Thursday,
April 20th at 11:59 p.m.,
this will give you a full
two weeks from the assignment
release date to go and
actually finish and work on it,
so we'll update the syllabus
for this new due date
in a little bit later today.
And as a reminder, when you
complete the assignment,
you should go turn in the
final zip file on Canvas
so we can grade it and get
your grades back as quickly
as possible.
So the next thing is always
check out Piazza for interesting
administrative stuff.
So this week I wanted to
highlight that we have several
example project ideas as
a pinned post on Piazza.
So we went out and solicited
example of project ideas
from various people in the
Stanford community or affiliated
to Stanford, and they came
up with some interesting
suggestions for projects
that they might want students
in the class to work on.
So check out this pinned post
on Piazza and if you want
to work on any of these projects,
then feel free to contact
the project mentors
directly about these things.
Aditionally we posted office
hours on the course website,
this is a Google calendar, so
this is something that people
have been asking about
and now it's up there.
The final administrative
note is about Google Cloud,
as a reminder, because we're
supported by Google Cloud
in this class, we're able to
give each of you an additional
$100 credit for Google Cloud
to work on your assignments
and projects, and the exact
details of how to redeem
that credit will go out later
today, most likely on Piazza.
So if there's, I guess if
there's no questions about
administrative stuff then we'll
move on to course content.
Okay cool.
So recall from last time in lecture two,
we were really talking about
the challenges of recognition
and trying to hone in on this idea
of a data-driven approach.
We talked about this idea
of image classification,
talked about why it's hard,
there's this semantic gap
between the giant grid of
numbers that the computer sees
and the actual image that you see.
We talked about various
challenges regarding this
around illumination,
deformation, et cetera,
and why this is actually a
really, really hard problem
even though it's super
easy for people to do
with their human eyes
and human visual system.
Then also recall last time
we talked about the k-nearest
neighbor classifier as kind
of a simple introduction
to this whole data-driven mindset.
We talked about the CIFAR-10
data set where you can see
an example of these images
on the upper left here,
where CIFAR-10 gives you
these 10 different categories,
airplane, automobile, whatnot,
and we talked about how the
k-nearest neighbor classifier
can be used to learn decision boundaries
to separate these data points into classes
based on the training data.
This also led us to a
discussion of the idea of cross
validation and setting
hyper parameters by dividing
your data into train,
validation and test sets.
Then also recall last time
we talked about linear
classification as the first
sort of building block
as we move toward neural networks.
Recall that the linear
classifier is an example
of a parametric classifier
where all of our knowledge
about the training data gets summarized
into this parameter matrix W that is set
during the process of training.
And this linear classifier
recall is super simple,
where we're going to take
the image and stretch it out
into a long vector.
So here the image is x and
then we take that image
which might be 32 by 32 by
3 pixels, stretch it out
into a long column vector of 32 times 32
times 3 entries,
where the 32 and 32 are
the height and width,
and the 3 give you
the three color channels,
red, green, blue.
Then there exists some parameter matrix, W
which will take this long column vector
representing the image
pixels, and convert this
and give you 10 numbers giving scores
for each of the 10 classes
in the case of CIFAR-10.
Where we kind of had this interpretation
where larger values of those scores,
so a larger value for the cat
class means the classifier
thinks that the cat is
more likely for that image,
and lower values for
maybe the dog or car class
indicate lower probabilities
of those classes being present
in the image.
Also, so I think this point
was a little bit unclear
last time that linear classification
has this interpretation
as learning templates per class,
where if you look at the
diagram on the lower left,
you think that, so for
every pixel in the image,
and for every one of our 10 classes,
there exists some entry in this matrix W,
telling us how much does that
pixel influence that class.
So that means that each of
these rows in the matrix W
ends up corresponding to
a template for the class.
And if we take those rows and unravel,
so each of those rows again corresponds
to a weighting between the values of,
between the pixel values of
the image and that class,
so if we take that row and
unravel it back into an image,
then we can visualize the
learned template for each
of these classes.
We also had this interpretation
of linear classification
as learning linear decision
boundaries between pixels
in some high dimensional
space where the dimensions
of the space correspond
to the values of the pixel
intensity values of the image.
So this is kind of where
we left off last time.
And so where we kind of
stopped, where we ended up last
time is we got this idea
of a linear classifier,
and we didn't talk about how
to actually choose the W.
How to actually use the training data
to determine which value
of W should be best.
So kind of where we stopped off at
is that for some setting
of W, we can use this W
to come up with 10 with our
class scores for any image.
So and some of these class
scores might be better or worse.
So here in this simple example,
we've shown maybe just a
training data set of three images
along with the 10 class scores
predicted for some value of W
for those images.
And you can see that some
of these scores are better
or worse than others.
So for example in the image
on the left, if you look up,
it's actually a cat because you're a human
and you can tell these things,
but if we look at the
assigned probabilities, cat,
well not probabilities but scores,
then the classifier maybe
for this setting of W
gave the cat class a score
of 2.9 for this image,
whereas the frog class gave 3.78.
So maybe the classifier
is not doing not so good
on this image, that's bad,
we wanted the true class
to be actually the highest class score,
whereas for some of these
other examples, like the car
for example, you see
that the automobile class
has a score of six which is much higher
than any of the others, so that's good.
And the frog, the predicted
scores are maybe negative four,
which is much lower
than all the other ones,
so that's actually bad.
So this is kind of a hand wavy approach,
just kind of looking at
the scores and eyeballing
which ones are good
and which ones are bad.
But to actually write
algorithms about these things
and to actually to determine
automatically which W
will be best, we need some
way to quantify the badness
of any particular W.
And that's this function
that takes in a W,
looks at the scores and then
tells us how bad quantitatively
is that W, is something that
we'll call a loss function.
And in this lecture we'll
see a couple examples
of different loss functions
that you can use for this image
classification problem.
So then once we've got this
idea of a loss function,
this allows us to quantify
for any given value of W,
how good or bad is it?
But then we actually need to find
and come up with an efficient procedure
for searching through the
space of all possible Ws
and actually come up with
what is the correct value
of W that is the least bad,
and this process will be
an optimization procedure
and we'll talk more about
that in this lecture.
So I'm going to shrink
this example a little bit
because 10 classes is
a little bit unwieldy.
So we'll kind of work with
this tiny toy data set
of three examples and
three classes going forward
in this lecture.
So again, in this example, the
cat is maybe not so correctly
classified, the car is correctly
classified, and the frog,
this setting of W got this
frog image totally wrong,
because the frog score is
much lower than others.
So to formalize this a little
bit, usually when we talk
about a loss function, we imagine
that we have some training
data set of xs and ys,
usually N examples of these
where the xs are the inputs
to the algorithm in the
image classification case,
the xs would be the actually
pixel values of your images,
and the ys will be the things
you want your algorithm
to predict, we usually call
these the labels or the targets.
So in the case of image classification,
remember we're trying
to categorize each image
for CIFAR-10 to one of 10 categories,
so the label y here will be an integer
between one and 10 or
maybe between zero and nine
depending on what programming
language you're using,
but it'll be an integer telling you
what is the correct category
for each one of those images x.
And now our loss function
will denote L_i to denote the,
so then we have this prediction function x
which takes in our example
x and our weight matrix W
and makes some prediction for y,
in the case of image classification
these will be our 10 numbers.
Then we'll define some loss function L_i
which will take in the predicted scores
coming out of the function f
together with the true target or label Y
and give us some quantitative
value for how bad
those predictions are for
that training example.
And now the final loss L will
be the average of these losses
summed over the entire data
set over each of the N examples
in our data set.
So this is actually a
very general formulation,
and actually extends even
beyond image classification.
Kind of as we move forward
and see other tasks,
other examples of tasks and deep learning,
the kind of generic setup
is that for any task
you have some xs and ys
and you want to write down
some loss function that
quantifies exactly how happy
you are with your particular
parameter settings W
and then you'll eventually
search over the space of W
to find the W that minimizes
the loss on your training data.
So as a first example of
a concrete loss function
that is a nice thing to work
with in image classification,
we'll talk about the multi-class SVM loss.
You may have seen the binary
SVM, our support vector
machine in CS 229 and the multiclass SVM
is a generalization of that
to handle multiple classes.
In the binary SVM case as
you may have seen in 229,
you only had two classes, each example x
was going to be classified
as either positive
or negative example,
but now we have 10 categories,
so we need to generalize
this notion to handle multiple classes.
So this loss function has kind
of a funny functional form,
so we'll walk through it in a bit more,
in quite a bit of detail over
the next couple of slides.
But what this is saying
is that the loss L_i
for any individual example,
the way we'll compute it
is we're going to perform a sum
over all of the categories, Y,
except for the true category, Y_i,
so we're going to sum over
all the incorrect categories,
and then we're going to compare the score
of the correct category, and the score
of the incorrect category,
and now if the score
for the correct category
is greater than the score
of the incorrect category,
greater than the incorrect
score by some safety margin
that we set to one, if that's
the case that means that
the true score is much, or the
score for the true category
is if it's much larger than
any of the false categories,
then we'll get a loss of zero.
And we'll sum this up over all
of the incorrect categories
for our image and this
will give us our final loss
for this one example in the data set.
And again we'll take
the average of this loss
over the whole training data set.
So this kind of like if then
statement, like if the true
class score is much
larger than the others,
this kind of if then
formulation we often compactify
into this single max of zero
S_j minus S_Yi plus one thing,
but I always find that notation
a little bit confusing,
and it always helps me
to write it out in this
sort of case based notation
to figure out exactly
what the two cases are
and what's going on.
And by the way, this
style of loss function
where we take max of zero
and some other quantity
is often referred to as
some type of a hinge loss,
and this name comes from
the shape of the graph
when you go and plot it,
so here the x axis corresponds to the S_Yi,
that is the score of the
true class for some training
example, and now the y axis is the loss,
and you can see that as the
score for the true category
for this example increases, then the loss
will go down linearly
until we get to above this safety margin,
after which the loss will be zero
because we've already correctly
classified this example.
So let's, oh, question?
- [Student] Sorry, in terms of notation
what is S underscore Yi?
Is that your right score?
- Yeah, so the question is
in terms of notation,
what is S and what is SYI
in particular, so the Ss
are the predicted scores
for the classes that are
coming out of the classifier.
So if one is the cat class and
two is the dog class then S1
and S2 would be the cat and
dog scores respectively.
And remember we said that Yi
was the category of the ground
truth label for the example
which is some integer.
So then S sub Y sub i, sorry
for the double subscript,
that corresponds to the
score of the true class
for the i-th example in the training set.
Question?
- [Student] So what
exactly is this computing?
- Yeah the question is what
exactly is this computing here?
It's a little bit funny, I
think it will become more clear
when we walk through an explicit
example, but in some sense
what this loss is saying is
that we are happy if the true
score is much higher than
all the other scores.
It needs to be higher
than all the other scores
by some safety margin,
and if the true score
is not high enough, greater
than any of the other scores,
then we will incur some
loss and that would be bad.
So this might make a little bit more sense
if we walk through an explicit example
for this tiny three example data set.
So here remember I've sort
of removed the case space
notation and just switching
back to the zero one notation,
and now if we look at,
if we think about computing
this multi-class SVM loss
for just this first training example
on the left, then remember
we're going to loop over
all of the incorrect
classes, so for this example,
cat is the correct class, so
we're going to loop over the car
and frog classes, and now for
car, we're going to compare the,
we're going to look at the car
score, 5.1, minus the cat score,
3.2 plus one, when we're
comparing cat and car we expect
to incur some loss here because
the car score is greater
than the cat score which is bad.
So for this one class,
for this one example,
we'll incur a loss of 2.9,
and then when we go and
compare the cat score
and the frog score we see that cat is 3.2,
frog is minus 1.7,
so cat is more than one greater than frog,
which means that between these two classes
we incur zero loss.
So then the multiclass SVM
loss for this training example
will be the sum of the losses
across each of these pairs
of classes, which will be
2.9 plus zero which is 2.9.
Which is sort of saying that
2.9 is a quantitative measure
of how much our classifier screwed up
on this one training example.
And then if we repeat this procedure
for this next car image, then
again the true class is car,
so we're going to iterate
over all the other categories
when we compare the car and the cat score,
we see that car is more
than one greater than cat
so we get no loss here.
When we compare car and frog, we again see
that the car score is more
than one greater than frog,
so we get again no loss
here, and our total loss
for this training example is zero.
And now I think you hopefully
get the picture by now, but,
if you go look at frog, now
frog, we again compare frog
and cat, incur quite a lot of
loss because the frog score
is very low, compare frog
and car, incur a lot of loss
because the score is very low,
and then our loss for this
example is 12.9.
And then our final loss
for the entire data set
is the average of these losses
across the different examples,
so when you sum those out
it comes to about 5.3.
So then it's sort of, this
is our quantitative measure
that our classifier is
5.3 bad on this data set.
Is there a question?
- [Student] How do you
choose the plus one?
- Yeah, the question is how
do you choose the plus one?
That's actually a really great question,
it seems like kind of an
arbitrary choice here,
it's the only constant that
appears in the loss function
and that seems to offend
your aesthetic sensibilities
a bit maybe.
But it turns out that this is somewhat
of an arbitrary choice,
because we don't actually care
about the absolute values of the scores
in this loss function, we only care
about the relative differences
between the scores.
We only care that the correct score
is much greater than the incorrect scores.
So in fact if you imagine
scaling up your whole W
up or down, then it kind
of rescales all the scores
correspondingly and if you kind
of work through the details
and there's a detailed derivation
of this in the course notes
online, you find this choice
of one actually doesn't matter.
That this free parameter
of one kind of washes out
and is canceled with this scale,
like the overall setting
of the scale in W.
And again, check the course
notes for a bit more detail
on that.
So then I think it's
kind of useful to think
about a couple different
questions to try to understand
intuitively what this loss is doing.
So the first question is what's
going to happen to the loss
if we change the scores of the
car image just a little bit?
Any ideas?
Everyone's too scared to ask a question?
Answer?
[student speaking faintly]
- Yeah, so the answer is
that if we jiggle the scores
for this car image a little
bit, the loss will not change.
So the SVM loss, remember,
the only thing it cares
about is getting the correct
score to be greater than one
more than the incorrect
scores, but in this case,
the car score is already quite
a bit larger than the others,
so if the scores for this
class changed for this example
changed just a little
bit, this margin of one
will still be retained and
the loss will not change,
we'll still get zero loss.
The next question, what's
the min and max possible loss
for SVM?
[student speaking faintly]
Oh I hear some murmurs.
So the minimum loss is zero,
because if you can imagine that
across all the classes, if
our correct score was much
larger then we'll incur zero
loss across all the classes
and it will be zero,
and if you think back to this
hinge loss plot that we had,
then you can see that if the correct score
goes very, very negative,
then we could incur
potentially infinite loss.
So the min is zero and
the max is infinity.
Another question, sort of when
you initialize these things
and start training from scratch,
usually you kind of initialize W
with some small random values,
as a result your scores
tend to be sort of small
uniform random values
at the beginning of training.
And then the question is
that if all of your Ss,
if all of the scores
are approximately zero
and approximately equal,
then what kind of loss do you expect
when you're using multiclass SVM?
- [Student] Number of classes minus one.
- Yeah, so the answer is
number of classes minus one,
because remember that
if we're looping over
all of the incorrect classes,
so we're looping over
C minus one classes, within
each of those classes
the two Ss will be about the same,
so we'll get a loss of one
because of the margin and
we'll get C minus one.
So this is actually kind
of useful because when you,
this is a useful debugging strategy
when you're using these things,
that when you start off training,
you should think about what
you expect your loss to be,
and if the loss you actually
see at the start of training
at that first iteration is
not equal to C minus one
in this case,
that means you probably have
a bug and you should go check
your code, so this is actually
kind of a useful thing
to be checking in practice.
Another question, what happens
if, so I said we're summing
an SVM over the incorrect
classes, what happens if the sum
is also over the correct class
if we just go over everything?
- [Student] The loss increases by one.
- Yeah, so the answer is that
the loss increases by one.
And I think the reason
that we do this in practice
is because normally loss of
zero is kind of, has this nice
interpretation that
you're not losing at all,
so that's nice, so I think your answers
wouldn't really change,
you would end up finding
the same classifier
if you actually looped
over all the categories,
but if just by conventions
we omit the correct class
so that our minimum loss is zero.
So another question, what if we used mean
instead of sum here?
- [Student] Doesn't change.
- Yeah, the answer is
that it doesn't change.
So the number of classes is
going to be fixed ahead of time
when we select our data set,
so that's just rescaling
the whole loss function by a constant,
so it doesn't really matter,
it'll sort of wash out
with all the other scale things
because we don't actually
care about the true values
of the scores, or the
true value of the loss
for that matter.
So now here's another
example, what if we change
this loss formulation and we
actually added a square term
on top of this max?
Would this end up being the same problem
or would this be a different
classification algorithm?
- [Student] Different.
- Yes, this would be different.
So here the idea is that
we're kind of changing
the trade-offs between good and badness
in kind of a nonlinear way,
so this would end up actually computing
a different loss function.
This idea of a squared hinge
loss actually does get used
sometimes in practice, so
that's kind of another trick
to have in your bag when you're making up
your own loss functions
for your own problems.
So now you'll end up,
oh, was there a question?
- [Student] Why would
you use a squared loss
instead of a non-squared loss?
- Yeah, so the question is
why would you ever consider
using a squared loss instead
of a non-squared loss?
And the whole point of a loss function
is to kind of quantify how
bad are different mistakes.
And if the classifier is making
different sorts of mistakes,
how do we weigh off the
different trade-offs
between different types
of mistakes the classifier
might make?
So if you're using a squared loss,
that sort of says that things
that are very, very bad
are now going to be squared bad
so that's like really, really bad,
like we don't want anything
that's totally
catastrophically misclassified,
whereas if you're using this hinge loss,
we don't actually care between
being a little bit wrong
and being a lot wrong, being
a lot wrong kind of like,
if an example is a lot
wrong, and we increase it
and make it a little bit less wrong,
that's kind of the same
goodness as an example
which was only a little bit
wrong and then increasing it
to be a little bit more right.
So that's a little bit hand wavy,
but this idea of using
a linear versus a square
is a way to quantify how much we care
about different categories of errors.
And this is definitely something
that you should think about
when you're actually applying
these things in practice,
because the loss function is the way
that you tell your algorithm
what types of errors
you care about and what types of errors
it should trade off against.
So that's actually super
important in practice
depending on your application.
So here's just a little snippet
of sort of vectorized code
in numpy, and you'll end up implementing
something like this for
the first assignment,
but this kind of gives you
the sense that this sum
is actually like pretty easy
to implement in numpy, it
only takes a couple lines
of vectorized code.
And you can see in practice,
like one nice trick
is that we can actually go in
here and zero out the margins
corresponding to the correct class,
and that makes it easy to then just,
that's sort of one nice
vectorized trick to skip,
iterate over all but one class.
You just kind of zero out
the one you want to skip
and then compute the sum
anyway, so that's a nice trick
you might consider
using on the assignment.
So now, another question
about this loss function.
Suppose that you were
lucky enough to find a W
that has loss of zero,
you're not losing at all,
you're totally winning,
this loss function is crushing it,
but then there's a
question, is this W unique
or were there other Ws
that could also have achieved zero loss?
- [Student] There are other Ws.
- Answer, yeah, so there
are definitely other Ws.
And in particular, because
we talked a little bit
about this thing of scaling
the whole problem up or down
depending on W, so you
could actually take W
multiplied by two and this
doubled W (Is it quad U now?
I don't know.)
[laughing]
This would also achieve zero loss.
So as a concrete example of this,
you can go back to your favorite example
and maybe work through the numbers
a little bit later,
but if you're taking W and we double W,
then the margins between the
correct and incorrect scores
will also double.
So that means that if all these margins
were already greater than
one, and we doubled them,
they're still going to
be greater than one,
so you'll still have zero loss.
And this is kind of interesting,
because if our loss function
is the way that we tell our
classifier which W we want
and which W we care about,
this is a little bit weird,
now there's this inconsistency
and how is the classifier to choose
between these different versions of W
that all achieve zero loss?
And that's because what we've done here
is written down only a
loss in terms of the data,
and we've only told our classifier
that it should try to find the W
that fits the training data.
But really in practice,
we don't actually care
that much about fitting the training data,
the whole point of machine learning
is that we use the training
data to find some classifier
and then we'll apply
that thing on test data.
So we don't really care about
the training data performance,
we really care about the performance
of this classifier on test data.
So as a result, if the only thing
we're telling our classifier to do
is fit the training data,
then we can lead ourselves
into some of these weird
situations sometimes,
where the classifier might
have unintuitive behavior.
So a concrete, canonical example
of this sort of thing,
by the way, this is not
linear classification anymore,
this is a little bit of a more general
machine learning concept,
is that suppose we have this
data set of blue points,
and we're going to fit some
curve to the training data,
the blue points,
then if the only thing we've
told our classifier to do
is to try and fit the training data,
it might go in and have very wiggly curves
to try to perfectly classify
all of the training data points.
But this is bad, because
we don't actually care
about this performance,
we care about the
performance on the test data.
So now if we have some new data come in
that sort of follows the same trend,
then this very wiggly blue line
is going to be totally wrong.
And in fact, what we
probably would have preferred
the classifier to do was maybe predict
this straight green line,
rather than this very complex wiggly line
to perfectly fit all the training data.
And this is a core fundamental problem
in machine learning,
and the way we usually solve it,
is this concept of regularization.
So here we're going to
add an additional term
to the loss function.
In addition to the data loss,
which will tell our
classifier that it should fit
the training data,
we'll also typically add
another term to the loss function
called a regularization term,
which encourages the model
to somehow pick a simpler W,
where the concept of simple
kind of depends on the task and the model.
There's this whole idea of Occam's Razor,
which is this fundamental
idea in scientific discovery
more broadly, which is that
if you have many different
competing hypotheses, that could explain
your observations,
you should generally
prefer the simpler one,
because that's the explanation
that is more likely
to generalize to new
observations in the future.
And the way we
operationalize this intuition
in machine learning is
typically through some explicit
regularization penalty
that's often written down as R.
So then your standard loss function
usually has these two terms,
a data loss and a regularization loss,
and there's some
hyper-parameter here, lambda,
that trades off between the two.
And we talked about hyper-parameters
and cross-validation in the last lecture,
so this regularization
hyper-parameter lambda
will be one of the more important ones
that you'll need to tune when training
these models in practice.
Question?
- [Student] What does that lambda R W term
have to do with [speaking faintly].
- Yeah, so the question is,
what's the connection
between this lambda R W term
and actually forcing this wiggly line
to become a straight green line?
I didn't want to go through
the derivation on this
because I thought it would
lead us too far astray,
but you can imagine,
maybe you're doing a regression problem,
in terms of different
polynomial basis functions,
and if you're adding
this regression penalty,
maybe the model has access to polynomials
of very high degree, but
through this regression term
you could encourage the
model to prefer polynomials
of lower degree, if they
fit the data properly,
or if they fit the data relatively well.
So you could imagine
there's two ways to do this,
either you can constrain your model class
to just not contain the more powerful,
more complex models, or you
can add this soft penalty
where the model still has
access to more complex models,
maybe high degree
polynomials in this case,
but you add this soft constraint
saying that if you want to
use these more complex models,
you need to overcome this penalty
for using their complexity.
So that's the connection here,
that is not quite linear classification,
this is the picture that
many people have in mind
when they think about
regularization at least.
So there's actually a
lot of different types
of regularization that
get used in practice.
The most common one is
probably L2 regularization,
or weight decay.
But there's a lot of other
ones that you might see.
This L2 regularization is
just the euclidean norm
of this weight vector W,
or sometimes the squared norm.
Or sometimes half the squared norm
because it makes your derivatives work out
a little bit nicer.
But the idea of L2 regularization
is you're just penalizing
the euclidean norm
of this weight vector.
You might also sometimes
see L1 regularization,
where we're penalizing the
L1 norm of the weight vector,
and the L1 regularization
has some nice properties
like encouraging sparsity
in this matrix W.
Some other things you might see
would be this elastic net regularization,
which is some combination of L1 and L2.
You sometimes see max norm regularization,
penalizing the max norm
rather than the L1 or L2 norm.
But these sorts of regularizations
are things that you see
not just in deep learning,
but across many areas of machine learning
and even optimization more broadly.
In some later lectures, we'll also see
some types of regularization
that are more specific
to deep learning.
For example dropout, we'll
see in a couple lectures,
or batch normalization, stochastic depth,
these things get kind of
crazy in recent years.
But the whole idea of regularization
is just any thing that
you do to your model,
that sort of penalizes somehow
the complexity of the model,
rather than explicitly trying
to fit the training data.
Question?
[student speaking faintly]
Yeah, so the question is,
how does the L2 regularization
measure the complexity
of the model?
Thankfully we have an
example of that right here,
maybe we can walk through.
So here we maybe have
some training example, x,
and there's two different
Ws that we're considering.
So x is just this vector of four ones,
and we're considering these
two different possibilities
for W.
One is a one in the
first, one is a single one
and three zeros,
and the other has this 0.25 spread across
the four different entries.
And now, when we're doing
linear classification,
we're really taking dot products
between our x and our W.
So in terms of linear classification,
these two Ws are the same,
because they give the same result
when dot producted with x.
But now the question is,
if you look at these two examples,
which one would L2 regression prefer?
Yeah, so L2 regression would prefer W2,
because it has a smaller norm.
So the answer is that the L2 regression
measures complexity of the classifier
in this relatively coarse way,
where the idea is that,
remember the Ws in linear classification
had this interpretation of how much
does this value of the vector x
correspond to this output class?
So L2 regularization is saying
that it prefers to spread that influence
across all the different values in x.
Maybe this might be more robust,
in case you come up with xs that vary,
then our decisions are spread out
and depend on the entire x vector,
rather than depending
only on certain elements
of the x vector.
And by the way, L1 regularization
has this opposite interpretation.
So actually if we were
using L1 regularization,
then we would actually prefer W1 over W2,
because L1 regularization
has this different notion
of complexity, saying that
maybe the model is less complex,
maybe we measure model
complexity by the number of zeros
in the weight vector,
so the question of how
do we measure complexity
and how does L2 measure complexity?
They're kind of problem dependent.
And you have to think about
for your particular setup,
for your particular model and data,
how do you think that
complexity should be measured
on this task?
Question?
- [Student] So why would L1 prefer W1?
Don't they sum to the same one?
- Oh yes, you're right.
So in this case, L1 is actually the same
between these two.
But you could construct
a similar example to this
where W1 would be preferred
by L1 regularization.
I guess the general intuition behind L1
is that it generally
prefers sparse solutions,
that it drives all your
entries of W to zero
for most of the entries,
except for a couple
where it's allowed to deviate from zero.
The way of measuring complexity for L1
is maybe the number of non-zero entries,
and then for L2, it thinks
that things that spread the W
across all the values are less complex.
So it depends on your data,
depends on your problem.
Oh and by the way, if
you're a hardcore Bayesian,
then using L2 regularization
has this nice interpretation
of MAP inference under a Gaussian prior
on the parameter vector.
I think there was a
homework problem about that
in 229, but we won't talk about that
for the rest of the quarter.
That's sort of my long, deep dive
into the multi-class SVM loss.
Question?
- [Student] Yeah, so I'm still confused
about what the kind of stuff I need to do
when the linear versus polynomial thing,
because the use of this loss function
isn't going to change the
fact that you're just doing,
you're looking at a
linear classifier, right?
- Yeah, so the question is that,
adding a regularization
is not going to change
the hypothesis class.
This is not going to change us
away from a linear classifier.
The idea is that maybe this example
of this polynomial regression
is definitely not linear regression.
That could be seen as linear regression
on top of a polynomial
expansion of the input,
and in which case, this
regression sort of says
that you're not allowed
to use as many polynomial
coefficients as maybe you should have.
Right, so you can imagine this is like,
when you're doing polynomial regression,
you can write out a polynomial as f of x
equals A zero plus A one
x plus A two x squared
plus A three x whatever,
in that case your parameters, your Ws,
would be these As, in which case,
penalizing the W could force it
towards lower degree polynomials.
Except in the case of
polynomial regression,
you don't actually want to parameterize
in terms of As, there's
some other paramterization
that you want to use,
but that's the general idea,
that you're sort of penalizing
the parameters of the model
to force it towards the simpler hypotheses
within your hypothesis class.
And maybe we can take this offline
if that's still a bit confusing.
So then we've sort of seen
this multi-class SVM loss,
and just by the way as a side note,
this is one extension or
generalization of the SVM loss
to multiple classes,
there's actually a couple
different formulations
that you can see around in literature,
but I mean, my intuition is
that they all tend to work
similarly in practice,
at least in the context of deep learning.
So we'll stick with this
one particular formulation
of the multi-class SVM loss in this class.
But of course there's many
different loss functions
you might imagine.
And another really popular choice,
in addition to the multi-class SVM loss,
another really popular
choice in deep learning
is this multinomial logistic regression,
or a softmax loss.
And this one is probably
actually a bit more common
in the context of deep learning,
but I decided to present
this second for some reason.
So remember in the context
of the multi-class SVM loss,
we didn't actually have an interpretation
for those scores.
Remember, when we're
doing some classification,
our model F, spits our these 10 numbers,
which are our scores for the classes,
and for the multi-class SVM,
we didn't actually give much
interpretation to those scores.
We just said that we want the true score,
the score of the correct class
to be greater than the incorrect classes,
and beyond that we don't really
say what those scores mean.
But now, for the multinomial
logistic regression
loss function, we actually
will endow those scores
with some additional meaning.
And in particular we're
going to use those scores
to compute a probability distribution
over our classes.
So we use this so-called softmax function
where we take all of our scores,
we exponentiate them so that
now they become positive,
then we re-normalize them by
the sum of those exponents
so now after we send our scores
through this softmax function,
now we end up with this
probability distribution,
where now we have
probabilities over our classes,
where each probability
is between zero and one,
and the sum of probabilities
across all classes
sum to one.
And now the interpretation
is that we want,
there's this computed
probability distribution
that's implied by our scores,
and we want to compare
this with the target
or true probability distribution.
So if we know that the thing is a cat,
then the target probability distribution
would put all of the
probability mass on cat,
so we would have probability
of cat equals one,
and zero probability for
all the other classes.
So now what we want to do is encourage
our computed probability distribution
that's coming out of this softmax function
to match this target
probability distribution
that has all the mass
on the correct class.
And the way that we do this,
I mean, you can do this
equation in many ways,
you can do this as a KL divergence
between the target
and the computed probability distribution,
you can do this as a
maximum likelihood estimate,
but at the end of the day,
what we really want is
that the probability
of the true class is
high and as close to one.
So then our loss will
now be the negative log
of the probability of the true class.
This is confusing 'cause
we're putting this
through multiple different things,
but remember we wanted the probability
to be close to one,
so now log is a monotonic
function, it goes like this,
and it turns out mathematically,
it's easier to maximize log
than it is to maximize
the raw probability,
so we stick with log.
And now log is monotonic,
so if we maximize log P of correct class,
that means we want that to be high,
but loss functions measure
badness not goodness
so we need to put in the minus one
to make it go the right way.
So now our loss function for SVM
is going to be the minus
log of the probability
of the true class.
Yeah, so that's the summary here,
is that we take our scores,
we run through the softmax,
and now our loss is this
minus log of the probability
of the true class.
Okay, so then if you look
at what this looks like
on a concrete example,
then we go back to our
favorite beautiful cat
with our three examples and
we've got these three scores
that are coming out of
our linear classifier,
and these scores are exactly
the way that they were
in the context of the SVM loss.
But now, rather than taking these scores
and putting them directly
into our loss function,
we're going to take them
all and exponentiate them
so that they're all positive,
and then we'll normalize them to make sure
that they all sum to one.
And now our loss will be the minus log
of the true class score.
So that's the softmax loss,
also called multinomial
logistic regression.
So now we asked several questions
to try to gain intuition about
the multi-class SVM loss,
and it's useful to think about
some of the same questions
to contrast with the softmax loss.
So then the question is,
what's the min and max
value of the softmax loss?
Okay, maybe not so sure,
there's too many logs and sums and stuff
going on in here.
So the answer is that the min loss is zero
and the max loss is infinity.
And the way that you can see this,
the probability distribution that we want
is one on the correct class,
zero on the incorrect classes,
the way that we do that is,
so if that were the case,
then this thing inside the
log would end up being one,
because it's the log
probability of the true class,
then log of one is zero, minus
log of one is still zero.
So that means that if we
got the thing totally right,
then our loss would be zero.
But by the way, in order to
get the thing totally right,
what would our scores have to look like?
Murmuring, murmuring.
So the scores would actually
have to go quite extreme,
like towards infinity.
So because we actually
have this exponentiation,
this normalization, the only way
we can actually get a
probability distribution of one
and zero, is actually
putting an infinite score
for the correct class,
and minus infinity score
for all the incorrect classes.
And computers don't do
so well with infinities,
so in practice, you'll
never get to zero loss
on this thing with finite precision.
But you still have this interpretation
that zero is the theoretical
minimum loss here.
And the maximum loss is unbounded.
So suppose that if we
had zero probability mass
on the correct class, then
you would have minus log
of zero, log of zero is minus infinity,
so minus log of zero
would be plus infinity,
so that's really bad.
But again, you'll never really get here
because the only way you can
actually get this probability
to be zero, is if e to the
correct class score is zero,
and that can only happen
if that correct class score
is minus infinity.
So again, you'll never
actually get to these minimum,
maximum values with finite precision.
So then, remember we had this debugging,
sanity check question in the
context of the multi-class SVM,
and we can ask the same for the softmax.
If all the Ss are small and about zero,
then what is the loss here?
Yeah, answer?
- [Student] Minus log one over C.
- So minus log of one over C?
I think that's, yeah,
so then it'd be minus log of one over C,
because log can flip the thing
so then it's just log of C.
Yeah, so it's just log of C.
And again, this is a nice debugging thing,
if you're training a model
with this softmax loss,
you should check at the first iteration.
If it's not log C, then
something's gone wrong.
So then we can compare and
contrast these two loss functions
a bit.
In terms of linear classification,
this setup looks the same.
We've got this W matrix
that gets multiplied
against our input to produce
this specter of scores,
and now the difference
between the two loss functions
is how we choose to interpret those scores
to quantitatively measure
the badness afterwards.
So for SVM, we were going to
go in and look at the margins
between the scores of the correct class
and the scores of the incorrect class,
whereas for this softmax
or cross-entropy loss,
we're going to go and compute
a probability distribution
and then look at the minus log probability
of the correct class.
So sometimes if you look at,
in terms of, nevermind,
I'll skip that point.
[laughing]
So another question that's interesting
when contrasting these two
loss functions is thinking,
suppose that I've got this example point,
and if you change its scores,
so assume that we've got
three scores for this,
ignore the part on the bottom.
But remember if we go back to this example
where in the multi-class SVM loss,
when we had the car, and the
car score was much better
than all the incorrect classes,
then jiggling the scores
for that car image
didn't change the
multi-class SVM loss at all,
because the only thing that the SVM loss
cared about was getting that correct score
to be greater than a margin
above the incorrect scores.
But now the softmax loss
is actually quite different
in this respect.
The softmax loss actually
always wants to drive
that probability mass all the way to one.
So even if you're giving very high score
to the correct class, and very low score
to all the incorrect classes,
softmax will want you to pile
more and more probability mass
on the correct class, and
continue to push the score
of that correct class up towards infinity,
and the score of the incorrect classes
down towards minus infinity.
So that's the interesting difference
between these two loss
functions in practice.
That SVM, it'll get this
data point over the bar
to be correctly classified
and then just give up,
it doesn't care about
that data point any more.
Whereas softmax will just always
try to continually improve
every single data point
to get better and better
and better and better.
So that's an interesting difference
between these two functions.
In practice, I think it tends
not to make a huge difference
which one you choose, they tend to perform
pretty similarly across,
at least a lot of deep
learning applications.
But it is very useful to keep
some of these differences
in mind.
Yeah, so to recap where
we've come to from here,
is that we've got some
data set of xs and ys,
we use our linear classifier
to get some score function,
to compute our scores
S, from our inputs, x,
and then we'll use a loss function,
maybe softmax or SVM or
some other loss function
to compute how quantitatively
bad were our predictions
compared to this ground true targets, y.
And then we'll often
augment this loss function
with a regularization term,
that tries to trade off between
fitting the training data
and preferring simpler models.
So this is a pretty generic overview
of a lot of what we call
supervised learning,
and what we'll see in deep
learning as we move forward,
is that generally you'll want
to specify some function, f,
that could be very complex in structure,
specify some loss function that determines
how well your algorithm is doing,
given any value of the parameters,
some regularization term
for how to penalize model complexity
and then you combine these things together
and try to find the W
that minimizes this final loss function.
But then the question is,
how do we actually go about doing that?
How do we actually find this
W that minimizes the loss?
And that leads us to the
topic of optimization.
So when we're doing optimization,
I usually think of things
in terms of walking
around some large valley.
So the idea is that you're
walking around this large valley
with different mountains
and valleys and streams
and stuff, and every
point on this landscape
corresponds to some setting
of the parameters W.
And you're this little guy who's
walking around this valley,
and you're trying to find,
and the height of each of these points,
sorry, is equal to the loss
incurred by that setting of W.
And now your job as this little man
walking around this landscape,
you need to somehow find
the bottom of this valley.
And this is kind of a
hard problem in general.
You might think, maybe I'm really smart
and I can think really hard
about the analytic properties
of my loss function, my
regularization all that,
maybe I can just write down the minimizer,
and that would sort of correspond
to magically teleporting
all the way to the bottom of this valley.
But in practice, once your
prediction function, f,
and your loss function
and your regularizer,
once these things get big and complex
and using neural networks,
there's really not much
hope in trying to write down
an explicit analytic solution
that takes you directly to the minima.
So in practice
we tend to use various
types of iterative methods
where we start with some solution
and then gradually improve it over time.
So the very first, stupidest thing
that you might imagine is random search,
that will just take a bunch of Ws,
sampled randomly, and throw
them into our loss function
and see how well they do.
So spoiler alert, this is
a really bad algorithm,
you probably shouldn't use this,
but at least it's one thing
you might imagine trying.
And we can actually do this,
we can actually try to
train a linear classifier
via random search, for CIFAR-10
and for this there's 10 classes,
so random chance is 10%,
and if we did some
number of random trials,
we eventually found just
through sheer dumb luck,
some setting of W that
got maybe 15% accuracy.
So it's better than random,
but state of the art is maybe 95%
so we've got a little
bit of gap to close here.
So again, probably don't
use this in practice,
but you might imagine
that this is something
you could potentially do.
So in practice, maybe a better strategy
is actually using some
of the local geometry
of this landscape.
So if you're this little guy who's walking
around this landscape,
maybe you can't see directly the path
down to the bottom of the valley,
but what you can do is feel with your foot
and figure out what is the local geometry,
if I'm standing right here,
which way will take me
a little bit downhill?
So you can feel with your feet
and feel where is the slope of the ground
taking me down a little
bit in this direction?
And you can take a step in that direction,
and then you'll go down a little bit,
feel again with your feet to
figure out which way is down,
and then repeat over and over again
and hope that you'll end up at the bottom
of the valley eventually.
So this also seems like a
relatively simple algorithm,
but actually this one
tends to work really well
in practice if you get
all the details right.
So this is generally the
strategy that we'll follow
when training these large neural networks
and linear classifiers and other things.
So then, that was a little hand wavy,
so what is slope?
If you remember back
to your calculus class,
then at least in one dimension,
the slope is the derivative
of this function.
So if we've got some
one-dimensional function, f,
that takes in a scalar x,
and then outputs the height
of some curve, then we
can compute the slope
or derivative at any point by imagining,
if we take a small step,
h, in any direction,
take a small step, h, and
compare the difference
in the function value over that step
and then drag the step size to zero,
that will give us the
slope of that function
at that point.
And this generalizes quite naturally
to multi-variable functions as well.
So in practice, our x
is maybe not a scalar
but a whole vector,
'cause remember, x
might be a whole vector,
so we need to generalize this notion
to multi-variable things.
And the generalization that
we use of the derivative
in the multi-variable
setting is the gradient,
so the gradient is a vector
of partial derivatives.
So the gradient will
have the same shape as x,
and each element of the
gradient will tell us
what is the slope of the function f,
if we move in that coordinate direction.
And the gradient turns out
to have these very nice properties,
so the gradient is now a
vector of partial derivatives,
but it points in the
direction of greatest increase
of the function and correspondingly,
if you look at the negative
gradient direction,
that gives you the direction
of greatest decrease
of the function.
And more generally, if you want to know,
what is the slope of my
landscape in any direction?
Then that's equal to the
dot product of the gradient
with the unit vector
describing that direction.
So this gradient is super important,
because it gives you this
linear, first-order approximation
to your function at your current point.
So in practice, a lot of deep learning
is about computing
gradients of your functions
and then using those gradients
to iteratively update
your parameter vector.
So one naive way that you might imagine
actually evaluating this
gradient on a computer,
is using the method of finite differences,
going back to the limit
definition of gradient.
So here on the left, we
imagine that our current W
is this parameter vector
that maybe gives us some
current loss of maybe 1.25
and our goal is to
compute the gradient, dW,
which will be a vector
of the same shape as W,
and each slot in that
gradient will tell us
how much will the loss
change is we move a tiny,
infinitesimal amount in
that coordinate direction.
So one thing you might imagine
is just computing these
finite differences,
that we have our W, we
might try to increment
the first element of
W by a small value, h,
and then re-compute the
loss using our loss function
and our classifier and all that.
And maybe in this setting,
if we move a little bit
in the first dimension,
then our loss will decrease
a little bit from 1.2534 to 1.25322.
And then we can use this limit definition
to come up with this finite
differences approximation
to the gradient in this first dimension.
And now you can imagine
repeating this procedure
in the second dimension,
where now we take the first dimension,
set it back to the original value,
and now increment the second
direction by a small step.
And again, we compute the loss
and use this finite
differences approximation
to compute an approximation
to the gradient
in the second slot.
And now repeat this again for the third,
and on and on and on.
So this is actually a terrible idea
because it's super slow.
So you might imagine that
computing this function, f,
might actually be super
slow if it's a large,
convolutional neural network.
And this parameter vector, W,
probably will not have 10
entries like it does here,
it might have tens of millions
or even hundreds of millions
for some of these large,
complex deep learning models.
So in practice, you'll
never want to compute
your gradients for your
finite differences,
'cause you'd have to wait
for hundreds of millions
potentially of function evaluations
to get a single gradient,
and that would be super slow
and super bad.
But thankfully we don't have to do that.
Hopefully you took a calculus course
at some point in your lives,
so you know that thanks to these guys,
we can just write down the
expression for our loss
and then use the magical
hammer of calculus
to just write down an expression
for what this gradient should be.
And this'll be way more efficient
than trying to compute it analytically
via finite differences.
One, it'll be exact,
and two, it'll be much faster
since we just need to compute
this single expression.
So what this would look like is now,
if we go back to this
picture of our current W,
rather than iterating over
all the dimensions of W,
we'll figure out ahead of time
what is the analytic
expression for the gradient,
and then just write it down
and go directly from the W
and compute the dW or
the gradient in one step.
And that will be much better in practice.
So in summary, this numerical gradient
is something that's
simple and makes sense,
but you won't really use it in practice.
In practice, you'll always
take an analytic gradient
and use that
when actually performing
these gradient computations.
However, one interesting note is that
these numeric gradients
are actually a very useful
debugging tool.
Say you've written some code,
and you wrote some code
that computes the loss
and the gradient of the loss,
then how do you debug this thing?
How do you make sure that
this analytic expression
that you derived and wrote down in code
is actually correct?
So a really common debugging
strategy for these things
is to use the numeric gradient as a way,
as sort of a unit test to make sure
that your analytic gradient was correct.
Again, because this is
super slow and inexact,
then when doing this
numeric gradient checking,
as it's called, you'll tend
to scale down the parameter
of the problem so that it actually runs
in a reasonable amount of time.
But this ends up being a super
useful debugging strategy
when you're writing your
own gradient computations.
So this is actually very
commonly used in practice,
and you'll do this on
your assignments as well.
So then once we know how
to compute the gradient,
then it leads us to this
super simple algorithm
that's like three lines, but
turns out to be at the heart
of how we train even these very biggest,
most complex deep learning algorithms,
and that's gradient descent.
So gradient descent is
first we initialize our W
as some random thing, then while true,
we'll compute our loss and our gradient
and then we'll update our weights
in the opposite of the gradient direction,
'cause remember that the gradient
was pointing in the direction
of greatest increase
of the function, so minus gradient
points in the direction
of greatest decrease,
so we'll take a small
step in the direction
of minus gradient, and
just repeat this forever
and eventually your network will converge
and you'll be very happy, hopefully.
But this step size is
actually a hyper-parameter,
and this tells us that every
time we compute the gradient,
how far do we step in that direction.
And this step size, also
sometimes called a learning rate,
is probably one of the
single most important
hyper-parameters that you need to set
when you're actually training
these things in practice.
Actually for me when I'm
training these things,
trying to figure out this step size
or this learning rate, is
the first hyper-parameter
that I always check.
Things like model size or
regularization strength
I leave until a little bit later,
and getting the learning
rate or the step size correct
is the first thing that I
try to set at the beginning.
So pictorially what this looks like
here's a simple example in two dimensions.
So here we've got maybe this bowl
that's showing our loss function
where this red region in the center
is this region of low
loss we want to get to
and these blue and green
regions towards the edge
are higher loss that we want to avoid.
So now we're going to start of our W
at some random point in the space,
and then we'll compute the
negative gradient direction,
which will hopefully
point us in the direction
of the minima eventually.
And if we repeat this over and over again,
we'll hopefully eventually
get to the exact minima.
And what this looks like in practice is,
oh man, we've got this
mouse problem again.
So what this looks like in practice
is that if we repeat this
thing over and over again,
then we will start off at some point
and eventually, taking tiny
gradient steps each time,
you'll see that the parameter
will arc in toward the center,
this region of minima,
and that's really what you want,
because you want to get to low loss.
And by the way, as a bit of a teaser,
we saw in the previous slide,
this example of very
simple gradient descent,
where at every step, we're
just stepping in the direction
of the gradient.
But in practice, over the
next couple of lectures,
we'll see that there are
slightly fancier step,
what they call these update rules,
where you can take slightly fancier things
to incorporate gradients
across multiple time steps
and stuff like that, that tend
to work a little bit better
in practice and are
used much more commonly
than this vanilla gradient descent
when training these things in practice.
And then, as a bit of a preview,
we can look at some of these
slightly fancier methods
on optimizing the same problem.
So again, the black will be
this same gradient computation,
and these, I forgot which color they are,
but these two other curves
are using slightly fancier update rules
to decide exactly how to
use the gradient information
to make our next step.
So one of these is gradient
descent with momentum,
the other is this Adam optimizer,
and we'll see more details about those
later in the course.
But the idea is that we have
this very basic algorithm
called gradient descent,
where we use the gradient
at every time step
to determine where to step next,
and there exist different
update rules which tell us
how exactly do we use
that gradient information.
But it's all the same basic algorithm
of trying to go downhill
at every time step.
But there's actually
one more little wrinkle
that we should talk about.
So remember that we
defined our loss function,
we defined a loss that computes how bad
is our classifier doing at
any single training example,
and then we said that our
full loss over the data set
was going to be the average loss
across the entire training set.
But in practice, this N
could be very very large.
If we're using the image
net data set for example,
that we talked about in the first lecture,
then N could be like 1.3 million,
so actually computing this loss
could be actually very expensive
and require computing perhaps
millions of evaluations
of this function.
So that could be really slow.
And actually, because the
gradient is a linear operator,
when you actually try
to compute the gradient
of this expression, you see
that the gradient of our loss
is now the sum of the
gradient of the losses
for each of the individual terms.
So now if we want to
compute the gradient again,
it sort of requires us to iterate
over the entire training data set
all N of these examples.
So if our N was like a million,
this would be super super slow,
and we would have to wait
a very very long time
before we make any individual update to W.
So in practice, we tend to use
what is called stochastic
gradient descent,
where rather than computing
the loss and gradient
over the entire training set,
instead at every iteration,
we sample some small set
of training examples, called a minibatch.
Usually this is a power
of two by convention,
like 32, 64, 128 are common numbers,
and then we'll use this small minibatch
to compute an estimate of the full sum,
and an estimate of the true gradient.
And now this is stochastic
because you can view this
as maybe a Monte Carlo
estimate of some expectation
of the true value.
So now this makes our
algorithm slightly fancier,
but it's still only four lines.
So now it's well true, sample
some random minibatch of data,
evaluate your loss and
gradient on the minibatch,
and now make an update on your parameters
based on this estimate of the loss,
and this estimate of the gradient.
And again, we'll see
slightly fancier update rules
of exactly how to integrate
multiple gradients
over time, but this is the
basic training algorithm
that we use for pretty much
all deep neural networks
in practice.
So we have another interactive web demo
actually playing around
with linear classifiers,
and training these things via
stochastic gradient descent,
but given how miserable
the web demo was last time,
I'm not actually going to open the link.
Instead, I'll just play this video.
[laughing]
But I encourage you to go check this out
and play with it online,
because it actually helps
to build some intuition
about linear classifiers and training them
via gradient descent.
So here you can see on the left,
we've got this problem
where we're categorizing
three different classes,
and we've got these
green, blue and red points
that are our training samples
from these three classes.
And now we've drawn
the decision boundaries
for these classes, which are
the colored background regions,
as well as these directions,
giving you the direction of
increase for the class scores
for each of these three classes.
And now if you see, if
you actually go and play
with this thing online,
you can see that we can
go in and adjust the Ws
and changing the values of the Ws
will cause these decision
boundaries to rotate.
If you change the biases,
then the decision boundaries
will not rotate, but will
instead move side to side
or up and down.
Then we can actually make steps
that are trying to update this loss,
or you can change the step
size with this slider.
You can hit this button
to actually run the thing.
So now with a big step size,
we're running gradient descent right now,
and these decision boundaries
are flipping around
and trying to fit the data.
So it's doing okay now,
but we can actually change
the loss function in real time
between these different SVM formulations
and the different softmax.
And you can see that as you flip
between these different
formulations of loss functions,
it's generally doing the same thing.
Our decision regions are
mostly in the same place,
but exactly how they end
up relative to each other
and exactly what the trade-offs are
between categorizing
these different things
changes a little bit.
So I really encourage you to go online
and play with this thing to
try to get some intuition
for what it actually looks like
to try to train these linear classifiers
via gradient descent.
Now as an aside, I'd like
to talk about another idea,
which is that of image features.
So so far we've talked
about linear classifiers,
which is just maybe taking
our raw image pixels
and then feeding the raw pixels themselves
into our linear classifier.
But as we talked about
in the last lecture,
this is maybe not such
a great thing to do,
because of things like
multi-modality and whatnot.
So in practice, actually
feeding raw pixel values
into linear classifiers
tends to not work so well.
So it was actually common
before the dominance
of deep neural networks,
was instead to have
this two-stage approach,
where first, you would take your image
and then compute various
feature representations
of that image, that are maybe computing
different kinds of quantities
relating to the appearance
of the image,
and then concatenate these
different feature vectors
to give you some feature
representation of the image,
and now this feature
representation of the image
would be fed into a linear classifier,
rather than feeding the
raw pixels themselves
into the classifier.
And the motivation here is that,
so imagine we have a
training data set on the left
of these red points, and
red points in the middle
and blue points around that.
And for this kind of data set,
there's no way that we can
draw a linear decision boundary
to separate the red points
from the blue points.
And we saw more examples of
this in the last lecture.
But if we use a clever feature transform,
in this case transforming
to polar coordinates,
then now after we do
the feature transform,
then this complex data
set actually might become
linearly separable,
and actually could be classified correctly
by a linear classifier.
And the whole trick here
now is to figure out
what is the right feature transform
that is computing the right quantities
for the problem that you care about.
So for images, maybe
converting your pixels
to polar coordinates, doesn't make sense,
but you actually can try to write down
feature representations of images
that might make sense,
and actually might help you out
and might do better than
putting in raw pixels
into the classifier.
So one example of this kind
of feature representation
that's super simple, is this
idea of a color histogram.
So you'll take maybe each pixel,
you'll take this hue color spectrum
and divide it into buckets
and then for every pixel,
you'll map it into one
of those color buckets
and then count up how many pixels
fall into each of these different buckets.
So this tells you globally
what colors are in the image.
Maybe if this example of a frog,
this feature vector would tell us
there's a lot of green stuff,
and maybe not a lot of
purple or red stuff.
And this is kind of a simple feature
vector that you might see
in practice.
Another common feature vector that we saw
before the rise of neural networks,
or before the dominance of neural networks
was this histogram of oriented gradients.
So remember from the first lecture,
that Hubel and Wiesel
found these oriented edges
are really important in
the human visual system,
and this histogram of oriented gradients
feature representation tries to capture
the same intuition and
measure the local orientation
of edges on the image.
So what this thing is going to do,
is take our image and then divide it
into these little eight
by eight pixel regions.
And then within each of those
eight by eight pixel regions,
we'll compute what is the
dominant edge direction
of each pixel, quantize
those edge directions
into several buckets and then
within each of those regions,
compute a histogram over these
different edge orientations.
And now your full-feature vector
will be these different
bucketed histograms
of edge orientations
across all the different
eight by eight regions
in the image.
So this is in some sense dual
to the color histogram
classifier that we saw before.
So color histogram is
saying, globally, what colors
exist in the image,
and this is saying, overall,
what types of edge information
exist in the image.
And even localized to
different parts of the image,
what types of edges exist
in different regions.
So maybe for this frog on the left,
you can see he's sitting on a leaf,
and these leaves have these
dominant diagonal edges,
and if you visualize the
histogram of oriented gradient
features, then you can
see that in this region,
we've got a lot of diagonal edges,
that this histogram of oriented gradient
feature representation's capturing.
So this was a super common
feature representation
and was used a lot for object recognition
actually not too long ago.
Another feature representation
that you might see out there
is this idea of bag of words.
So this is taking inspiration
from natural language processing.
So if you've got a paragraph,
then a way that you might
represent a paragraph
by a feature vector is
counting up the occurrences
of different words in that paragraph.
So we want to take that
intuition and apply it
to images in some way.
But the problem is that
there's no really simple,
straightforward analogy
of words to images,
so we need to define our own vocabulary
of visual words.
So we take this two-stage approach,
where first we'll get a bunch of images,
sample a whole bunch of tiny random crops
from those images and then cluster them
using something like K means
to come up with these
different cluster centers
that are maybe representing
different types
of visual words in the images.
So if you look at this
example on the right here,
this is a real example of clustering
actually different image
patches from images,
and you can see that after
this clustering step,
our visual words capture
these different colors,
like red and blue and yellow,
as well as these different
types of oriented edges
in different directions,
which is interesting that
now we're starting to see
these oriented edges
come out from the data
in a data-driven way.
And now, once we've got
these set of visual words,
also called a codebook,
then we can encode our
image by trying to say,
for each of these visual words,
how much does this visual
word occur in the image?
And now this gives us, again,
some slightly different information
about what is the visual
appearance of this image.
And actually this is a type
of feature representation
that Fei-Fei worked on when
she was a grad student,
so this is something
that you saw in practice
not too long ago.
So then as a bit of teaser,
tying this all back together,
the way that this image
classification pipeline
might have looked like,
maybe about five to 10 years ago,
would be that you would take your image,
and then compute these different
feature representations
of your image, things like bag of words,
or histogram of orientated gradients,
concatenate a whole bunch
of features together,
and then feed these feature extractors
down into some linear classifier.
I'm simplifying a little bit,
the pipelines were a little
bit more complex than that,
but this is the general intuition.
And then the idea here was
that after you extracted
these features, this feature extractor
would be a fixed block
that would not be updated
during training.
And during training,
you would only update
the linear classifier
if it's working on top of features.
And actually, I would
argue that once we move
to convolutional neural networks,
and these deep neural networks,
it actually doesn't look that different.
The only difference is that
rather than writing down
the features ahead of time,
we're going to learn the
features directly from the data.
So we'll take our raw pixels and feed them
into this to convolutional network,
which will end up computing
through many different layers
some type of feature representation
driven by the data, and
then we'll actually train
this entire weights for
this entire network,
rather than just the
weights of linear classifier
on top.
So, next time we'll really
start diving into this idea
in more detail, and we'll
introduce some neural networks,
and start talking about
backpropagation as well.
