Okay, so welcome to lecture two of CS231N.
On Tuesday we, just recall,
we, sort of, gave you
the big picture view of
what is computer vision,
what is the history,
and a little bit of the
overview of the class.
And today, we're really going
to dive in, for the first time,
into the details.
And we'll start to see,
in much more depth,
exactly how some of
these learning algorithms
actually work in practice.
So, the first lecture of the class
is probably, sort of, the
largest big picture vision.
And the majority of the
lectures in this class
will be much more detail orientated,
much more focused on
the specific mechanics,
of these different algorithms.
So, today we'll see our
first learning algorithm
and that'll be really exciting, I think.
But, before we get to that,
I wanted to talk about a couple
of administrative issues.
One, is Piazza.
So, I saw it when I checked yesterday,
it seemed like we had maybe 500 students
signed up on Piazza.
Which means that there
are several hundred of you
who are not yet there.
So, we really want Piazza
to be the main source
of communication between the
students and the core staff.
So, we've gotten a lot of
questions to the staff list
about project ideas or questions
about midterm attendance
or poster session attendance.
And, any, sort of, questions like that
should really go to Piazza.
You'll probably get answers
to your questions faster
on Piazza, because all the
TAs are knowing to check that.
And it's, sort of, easy
for emails to get lost
in the shuffle if you just
send to the course list.
It's also come to my attention
that some SCPD students
are having a bit of a hard
time signing up for Piazza.
SCPD students are supposed to receive a
@stanford.edu email address.
So, once you get that email address,
then you can use the Stanford
email to sign into Piazza.
Probably that doesn't
affect those of you who are
sitting in the room right now,
but, for those students listening on SCPD.
The next administrative issue
is about assignment one.
Assignment one will be up later today,
probably sometime this afternoon,
but I promise, before
I go to sleep tonight,
it'll be up.
But, if you're getting a little bit antsy
and really want to start
working on it right now,
then you can look at last year's version
of assignment one.
It'll be pretty much the same content.
We're just reshuffling it
a little bit to make it,
like, for example, upgrading
to work with Python 3,
rather than Python 2.7.
And some of these minor cosmetic changes,
but the content of the
assignment will still be the same
as last year.
So, in this assignment you'll
be implementing your own
k-nearest neighbor classifier,
which we're going to talk
about in this lecture.
You'll also implement several
different linear classifiers,
including the SVM and Softmax,
as well as a simple
two-layer neural network.
And we'll cover all this content
over the next couple of lectures.
So, all of our assignments
are using Python and NumPy.
If you aren't familiar
with Python or NumPy,
then we have written a
tutorial that you can find
on the course website to
try and get you up to speed.
But, this is, actually, pretty important.
NumPy lets you write these
very efficient vectorized
operations that let you do
quite a lot of computation
in just a couple lines of code.
So this is super important for pretty much
all aspects of numerical
computing and machine learning
and everything like that,
is efficiently implementing
these vectorized operations.
And you'll get a lot of practice with this
on the first assignment.
So, for those of you who
don't have a lot of experience
with Matlab or NumPy or
other types of vectorized
tensor computation, I recommend
that you start looking
at this assignment pretty early
and also, read carefully
through the tutorial.
The other thing I wanted to talk about
is that we're happy to announce that
we're officially supported
through Google Cloud
for this class.
So, Google Cloud is somewhat
similar to Amazon AWS.
You can go and start virtual
machines up in the cloud.
These virtual machines can have GPUs.
We're working on the tutorial
for exactly how to use
Google Cloud and get it to
work for the assignments.
But our intention is that
you'll be able to just download
some image, and it'll be very seamless
for you to work on the assignments
on one of these instances on the cloud.
And because Google has, very generously,
supported this course,
we'll be able to distribute to each of you
coupons that let you use
Google Cloud credits for free
for the class.
So you can feel free to use
these for the assignments
and also for the course projects
when you want to start using
GPUs and larger machines
and whatnot.
So, we'll post more details about that,
probably, on Piazza later today.
But, I just wanted to mention,
because I know there had
been a couple of questions
about, can I use my laptop?
Do I have to run on corn?
Do I have to, whatever?
And the answer is that,
you'll be able to run on
Google Cloud and we'll provide
you some coupons for that.
Yeah, so,
those are, kind of, the
major administrative issues
I wanted to talk about today.
And then, let's dive into the content.
So, the last lecture
we talked a little bit
about this task of image classification,
which is really a core
task in computer vision.
And this is something
that we'll really focus on
throughout the course of the class.
Is, exactly,
how do we work on this
image classification task?
So, a little bit more concretely,
when you're doing image classification,
your system receives some input image,
which is this cute cat in this example,
and the system is aware
of some predetermined set
of categories or labels.
So, these might be, like,
a dog or a cat or a truck
or a plane, and there's some
fixed set of category labels,
and the job of the computer
is to look at the picture
and assign it one of these
fixed category labels.
This seems like a really easy problem,
because so much of your own
visual system in your brain
is hardwired to doing these, sort of,
visual recognition tasks.
But this is actually a
really, really hard problem
for a machine.
So, if you dig in and
think about, actually,
what does a computer see
when it looks at this image,
it definitely doesn't get
this holistic idea of a cat
that you see when you look at it.
And the computer really
is representing the image
as this gigantic grid of numbers.
So, the image might be something
like 800 by 600 pixels.
And each pixel is
represented by three numbers,
giving the red, green, and
blue values for that pixel.
So, to the computer,
this is just a gigantic grid of numbers.
And it's very difficult
to distill the cat-ness
out of this, like, giant array
of thousands, or whatever,
very many different numbers.
So, we refer to this
problem as the semantic gap.
This idea of a cat, or
this label of a cat,
is a semantic label that
we're assigning to this image,
and there's this huge gap between
the semantic idea of a cat
and these pixel values that the
computer is actually seeing.
And this is a really hard problem because
you can change the picture
in very small, subtle ways
that will cause this pixel
grid to change entirely.
So, for example, if we took this same cat,
and if the cat happened to sit still
and not even twitch, not move a muscle,
which is never going to happen,
but we moved the camera to the other side,
then every single grid,
every single pixel,
in this giant grid of numbers
would be completely different.
But, somehow, it's still
representing the same cat.
And our algorithms need
to be robust to this.
But, not only viewpoint is one problem,
another is illumination.
There can be different
lighting conditions going on
in the scene.
Whether the cat is appearing
in this very dark, moody scene,
or like is this very bright,
sunlit scene, it's still a cat,
and our algorithms need
to be robust to that.
Objects can also deform.
I think cats are, maybe,
among the more deformable
of animals that you might see out there.
And cats can really assume a
lot of different, varied poses
and positions.
And our algorithms should
be robust to these different
kinds of transforms.
There can also be problems of occlusion,
where you might only see part
of a cat, like, just the face,
or in this extreme example,
just a tail peeking out
from under the couch cushion.
But, in these cases, it's pretty
easy for you, as a person,
to realize that this is probably a cat,
and you still recognize
these images as cats.
And this is something that our algorithms
also must be robust to,
which is quite difficult, I think.
There can also be problems
of background clutter,
where maybe the foreground
object of the cat,
could actually look quite
similar in appearance
to the background.
And this is another thing
that we need to handle.
There's also this problem
of intraclass variation,
that this one notion of
cat-ness, actually spans a lot of
different visual appearances.
And cats can come in
different shapes and sizes
and colors and ages.
And our algorithm, again, needs to work
and handle all these different variations.
So, this is actually a really,
really challenging problem.
And it's sort of easy to
forget how easy this is
because so much of your
brain is specifically tuned
for dealing with these things.
But now if we want our computer programs
to deal with all of these
problems, all simultaneously,
and not just for cats, by the way,
but for just about any object
category you can imagine,
this is a fantastically
challenging problem.
And it's, actually, somewhat miraculous
that this works at all, in my opinion.
But, actually, not only does it work,
but these things work very
close to human accuracy
in some limited situations.
And take only hundreds
of milliseconds to do so.
So, this is some pretty
amazing, incredible technology,
in my opinion, and over the
course of the rest of the class
we will really see what
kinds of advancements
have made this possible.
So now, if you, kind of, think about
what is the API for writing
an image classifier,
you might sit down and try
to write a method in Python
like this.
Where you want to take in an image
and then do some crazy magic
and then, eventually,
spit out this class label
to say cat or dog or whatnot.
And there's really no obvious
way to do this, right?
If you're taking an algorithms class
and your task is to sort numbers
or compute a convex hull
or, even, do something
like RSA encryption,
you, sort of, can write down an algorithm
and enumerate all the
steps that need to happen
in order for this things to work.
But, when we're trying
to recognize objects,
or recognize cats or images,
there's no really clear,
explicit algorithm
that makes intuitive sense,
for how you might go about
recognizing these objects.
So, this is, again, quite challenging,
if you think about,
if it was your first day programming
and you had to sit down
and write this function,
I think most people would be in trouble.
That being said,
people have definitely
made explicit attempts
to try to write, sort
of, high-end coded rules
for recognizing different animals.
So, we touched on this a
little bit in the last lecture,
but maybe one idea for cats is that,
we know that cats have ears
and eyes and mouths and noses.
And we know that edges,
from Hubel and Wiesel,
we know that edges are pretty important
when it comes to visual recognition.
So one thing we might try to do is
compute the edges of this image
and then go in and try to
categorize all the different
corners and boundaries, and
say that, if we have maybe
three lines meeting this way,
then it might be a corner,
and an ear has one corner
here and one corner there
and one corner there,
and then, kind of, write down
this explicit set of rules
for recognizing cats.
But this turns out not to work very well.
One, it's super brittle.
And, two, say, if you want
to start over for another
object category, and maybe
not worry about cats,
but talk about trucks or dogs
or fishes or something else,
then you need to start all over again.
So, this is really not a
very scalable approach.
We want to come up with some
algorithm, or some method,
for these recognition tasks
which scales much more
naturally to all the variety
of objects in the world.
So, the insight that, sort
of, makes this all work
is this idea of the data-driven approach.
Rather than sitting down and
writing these hand-specified
rules to try to craft exactly
what is a cat or a fish
or what have you,
instead, we'll go out onto the internet
and collect a large
dataset of many, many cats
and many, many airplanes
and many, many deer
and different things like this.
And we can actually use tools
like Google Image Search,
or something like that,
to go out and collect a very
large number of examples
of these different categories.
By the way, this actually
takes quite a lot of effort
to go out and actually
collect these datasets
but, luckily, there's a lot
of really good, high quality
datasets out there already for you to use.
Then once we get this dataset,
we train this machine learning classifier
that is going to ingest all of the data,
summarize it in some way,
and then spit out a model
that summarizes the
knowledge of how to recognize
these different object categories.
Then finally, we'll
use this training model
and apply it on new images
that will then be able to recognize
cats and dogs and whatnot.
So here our API has changed a little bit.
Rather than a single function
that just inputs an image
and recognizes a cat,
we have these two functions.
One, called, train, that's
going to input images and labels
and then output a model,
and then, separately, another
function called, predict,
which will input the model
and than make predictions
for images.
And this is, kind of, the key insight
that allowed all these things
to start working really well
over the last 10, 20 years or so.
So, this class is primarily
about neural networks
and convolutional neural networks
and deep learning and all that,
but this idea of a data-driven
approach is much more general
than just deep learning.
And I think it's useful to, sort of,
step through this process
for a very simple classifier first,
before we get to these big, complex ones.
So, probably, the simplest
classifier you can imagine
is something we call nearest neighbor.
The algorithm is pretty dumb, honestly.
So, during the training
step we won't do anything,
we'll just memorize all
of the training data.
So this is very simple.
And now, during the prediction step,
we're going to take some new image
and go and try to find
the most similar image
in the training data to that new image,
and now predict the label
of that most similar image.
A very simple algorithm.
But it, sort of, has a lot
of these nice properties
with respect to
data-drivenness and whatnot.
So, to be a little bit more concrete,
you might imagine working on
this dataset called CIFAR-10,
which is very commonly
used in machine learning,
as kind of a small test case.
And you'll be working with
this dataset on your homework.
So, the CIFAR-10 dataset gives
you 10 different classes,
airplanes and automobiles and
birds and cats and different
things like that.
And for each of those 10 categories
it provides 50,000 training images,
roughly evenly distributed
across these 10 categories.
And then 10,000 additional testing images
that you're supposed to
test your algorithm on.
So here's an example
of applying this simple
nearest neighbor classifier
to some of these test images
on CIFAR-10.
So, on this grid on the right,
for the left most column,
gives a test image in
the CIFAR-10 dataset.
And now on the right, we've
sorted the training images
and show the most similar training images
to each of these test examples.
And you can see that they
look kind of visually similar
to the training images,
although they are not
always correct, right?
So, maybe on the second row,
we see that the testing,
this is kind of hard to see,
because these images are 32 by 32 pixels,
you need to really dive in there
and try to make your best guess.
But, this image is a dog and
it's nearest neighbor is also
a dog, but this next one,
I think is actually a deer
or a horse or something else.
But, you can see that it
looks quite visually similar,
because there's kind of a
white blob in the middle
and whatnot.
So, if we're applying the
nearest neighbor algorithm
to this image,
we'll find the closest
example in the training set.
And now, the closest
example, we know it's label,
because it comes from the training set.
And now, we'll simply say that
this testing image is also
a dog.
You can see from these
examples that is probably not
going to work very well,
but it's still kind of a
nice example to work through.
But then, one detail
that we need to know is,
given a pair of images,
how can we actually compare them?
Because, if we're going to take
our test image and compare it
to all the training images,
we actually have many different choices
for exactly what that comparison
function should look like.
So, in the example in the previous slide,
we've used what's called the L1 distance,
also sometimes called
the Manhattan distance.
So, this is a really
sort of simple, easy idea
for comparing images.
And that's that we're going to
just compare individual pixels
in these images.
So, supposing that our test
image is maybe just a tiny
four by four image of pixel values,
then we're take this upper-left hand pixel
of the test image,
subtract off the value
in the training image,
take the absolute value,
and get the difference in that
pixel between the two images.
And then, sum all these
up across all the pixels
in the image.
So, this is kind of a stupid
way to compare images,
but it does some reasonable
things sometimes.
But, this gives us a very concrete way
to measure the difference
between two images.
And in this case, we have
this difference of 456
between these two images.
So, here's some full Python code
for implementing this
nearest neighbor classifier
and you can see it's pretty
short and pretty concise
because we've made use of
many of these vectorized
operations offered by NumPy.
So, here we can see that
this training function,
that we talked about earlier,
is, again, very simple, in
the case of nearest neighbor,
you just memorize the training data,
there's not really much to do here.
And now, at test time, we're
going to take in our image
and then go in and compare
using this L1 distance function,
our test image to each of
these training examples
and find the most similar
example in the training set.
And you can see that, we're
actually able to do this
in just one or two lines of Python code
by utilizing these vectorized
operations in NumPy.
So, this is something that
you'll get practice with
on the first assignment.
So now, a couple questions
about this simple classifier.
First, if we have N examples
in our training set,
then how fast can we expect
training and testing to be?
Well, training is probably constant
because we don't really
need to do anything,
we just need to memorize the data.
And if you're just copying a pointer,
that's going to be constant time
no matter how big your dataset is.
But now, at test time we need
to do this comparison stop
and compare our test image
to each of the N training
examples in the dataset.
And this is actually quite slow.
So, this is actually somewhat backwards,
if you think about it.
Because, in practice,
we want our classifiers to
be slow at training time
and then fast at testing time.
Because, you might imagine,
that a classifier might go
and be trained in a data center somewhere
and you can afford to
spend a lot of computation
at training time to make
the classifier really good.
But then,
when you go and deploy the
classifier at test time,
you want it to run on your mobile phone
or in a browser or some
other low power device,
and you really want the
testing time performance
of your classifier to be quite fast.
So, from this perspective, this
nearest neighbor algorithm,
is, actually, a little bit backwards.
And we'll see that once we move to
convolutional neural networks,
and other types of parametric models,
they'll be the reverse of this.
Where you'll spend a lot of
compute at training time,
but then they'll be quite
fast at testing time.
So then, the question is,
what exactly does this
nearest neighbor algorithm
look like when you apply it in practice?
So, here we've drawn, what
we call the decision regions
of a nearest neighbor classifier.
So, here our training set
consists of these points
in the two dimensional plane,
where the color of the point
represents the category,
or the class label, of that point.
So, here we see we have five classes
and some blue ones up in the corner here,
some purple ones in the
upper-right hand corner.
And now for each pixel
in this entire plane,
we've gone and computed
what is the nearest example
in these training data,
and then colored the
point of the background
corresponding to what is the class label.
So, you can see that this
nearest neighbor classifier
is just sort of carving up the space
and coloring the space
according to the nearby points.
But this classifier is maybe not so great.
And by looking at this picture
we can start to see some of the
problems that might come out
with a nearest neighbor classifier.
For one, this central
region actually contains
mostly green points,
but one little yellow point in the middle.
But because we're just looking
at the nearest neighbor,
this causes a little
yellow island to appear
in this middle of this green cluster.
And that's, maybe, not so great.
Maybe those points actually
should have been green.
And then, similarly we also
see these, sort of, fingers,
like the green region
pushing into the blue region,
again, due to the presence of one point,
which may have been noisy or spurious.
So, this kind of motivates
a slight generalization
of this algorithm called
k-nearest neighbors.
So rather than just looking for
the single nearest neighbor,
instead we'll do something
a little bit fancier
and find K of our nearest neighbors,
according to our distance metric,
and then take a vote among
each of our neighbors.
And then predict the majority vote
among our neighbors.
You can imagine slightly more
complex ways of doing this.
Maybe you'd vote weighted on the distance,
or something like that,
but the simplest thing that
tends to work pretty well
is just taking a majority vote.
So here we've shown the
exact same set of points
using this K=1 nearest
neighbor classifier,
as well as K=3 and K=5 in
the middle and on the right.
And once we move to K=3, you
can see that that spurious
yellow point in the middle
of the green cluster
is no longer causing the
points near that region
to be classified as yellow.
Now this entire green
portion in the middle
is all being classified as green.
You can also see that these fingers
of the red and blue regions
are starting to get smoothed out
due to this majority voting.
And then, once we move to the K=5 case,
then these decision boundaries
between the blue and red regions
have become quite smooth and quite nice.
So, generally when you're
using nearest neighbors
classifiers,
you almost always want
to use some value of K,
which is larger than one
because this tends to
smooth out your decision
boundaries and lead to better results.
Question?
[student asking a question]
Yes, so the question is,
what is the deal with these white regions?
The white regions are
where there was no majority
among the k-nearest neighbors.
You could imagine maybe doing
something slightly fancier
and maybe taking a guess
or randomly selecting among
the majority winners,
but for this simple example
we're just coloring it white
to indicate there was no nearest neighbor
in those points.
Whenever we're thinking
about computer vision
I think it's really useful to kind of flip
back and forth between
several different viewpoints.
One, is this idea of high
dimensional points in the plane,
and then the other is actually
looking at concrete images.
Because the pixels of the image actually
allow us to think of these
images as high dimensional
vectors.
And it's sort of useful to
ping pong back and forth
between these two different viewpoints.
So then, sort of taking
this k-nearest neighbor
and going back to the images
you can see that it's
actually not very good.
Here I've colored in red and green
which images would actually
be classified correctly
or incorrectly according
to their nearest neighbor.
And you can see that it's
really not very good.
But maybe if we used a larger value of K
then this would involve
actually voting among
maybe the top three or the top five
or maybe even the whole row.
And you could imagine that
that would end up being
a lot more robust to some
of this noise that we see
when retrieving neighbors in this way.
So another choice we
have when we're working
with the k-nearest neighbor algorithm
is determining exactly
how we should be comparing
our different points.
For the examples so far we've just shown
we've talked about this L1 distance
which takes the sum of the absolute values
between the pixels.
But another common choice is
the L2 or Euclidean distance
where you take the square
root of the sum of the squares
and take this as your distance.
Choosing different
distance metrics actually
is a pretty interesting topic
because different distance metrics
make different assumptions
about the underlying
geometry or topology that
you'd expect in the space.
So, this L1 distance, underneath
this, this is actually
a circle according to the L1 distance
and it forms this square shape thing
around the origin.
Where each of the points
on this, on the square,
is equidistant from the
origin according to L1,
whereas with the L2 or Euclidean distance
then this circle is a familiar circle,
it looks like what you'd expect.
So one interesting thing to
point out between these two
metrics in particular,
is that the L1 distance
depends on your choice
of coordinates system.
So if you were to rotate
the coordinate frame
that would actually change the L1 distance
between the points.
Whereas changing the coordinate
frame in the L2 distance
doesn't matter, it's the
same thing no matter what
your coordinate frame is.
Maybe if your input features,
if the individual entries
in your vector have some important meaning
for your task,
then maybe somehow L1 might
be a more natural fit.
But if it's just a generic
vector in some space
and you don't know which
of the different elements,
you don't know what they actually mean,
then maybe L2 is slightly more natural.
And another point here is that
by using different distance metrics
we can actually generalize
the k-nearest neighbor
classifier to many, many
different types of data,
not just vectors, not just images.
So, for example, imagine you
wanted to classify pieces
of text, then the only
thing you need to do
to use k-nearest neighbors
is to specify some distance function
that can measure distances
between maybe two paragraphs
or two sentences or something like that.
So, simply by specifying
different distance metrics
we can actually apply this
algorithm very generally
to basically any type of data.
Even though it's a kind
of simple algorithm,
in general, it's a very
good thing to try first
when you're looking at a new problem.
So then, it's also kind of
interesting to think about
what is actually happening geometrically
if we choose different distance metrics.
So here we see the same
set of points on the left
using the L1, or Manhattan distance,
and then, on the right,
using the familiar L2,
or Euclidean distance.
And you can see that the
shapes of these decision
boundaries actually change quite a bit
between the two metrics.
So when you're looking at
L1 these decision boundaries
tend to follow the coordinate axes.
And this is again because
the L1 depends on our choice
of coordinate system.
Where the L2 sort of doesn't
really care about the
coordinate axis, it
just puts the boundaries
where they should fall naturally.
My confession is that
each of these examples
that I've shown you is
actually from this interactive
web demo that I built,
where you can go and play
with this k-nearest neighbor
classifier on your own.
And this is really hard to
work on a projector screen.
So maybe we'll do that on your own time.
So, let's just go back to here.
Man, this is kind of embarrassing.
Okay, that was way more
trouble than it was worth.
So, let's skip this, but I encourage you
to go play with this in your browser.
It's actually pretty fun
and kind of nice to build intuition about
how the decision boundary changes
as you change the K
and change your distance metric
and all those sorts of things.
Okay, so then the question is
once you're actually trying
to use this algorithm
in practice, there's several choices
you need to make.
We talked about choosing
different values of K.
We talked about choosing
different distance metrics.
And the question becomes
how do you actually make
these choices for your problem
and for your data?
So, these choices, of things
like K and the distance metric,
we call hyperparameters,
because they are not necessarily
learned from the training
data,
instead these are choices about
your algorithm that you make
ahead of time
and there's no way to learn
them directly from the data.
So, the question is how
do you set these things
in practice?
And they turn out to be
very problem-dependent.
And the simple thing that
most people do is simply
try different values of
hyperparameters for your data
and for your problem, and
figure out which one works best.
There's a question?
[student asking a question]
So, the question is, where L1
distance might be preferable
to using L2 distance?
I think it's mainly problem-dependent,
it's sort of difficult to say
in which cases you think
one might be better
than the other.
but I think that because L1
has this sort of coordinate
dependency, it actually depends
on the coordinate system
of your data,
if you know that you have a vector,
and maybe the individual
elements of the vector
have meaning.
Like maybe you're classifying
employees for some reason
and then the different elements
of that vector correspond
to different features or
aspects of an employee.
Like their salary or the
number of years they've been
working at the company
or something like that.
So I think when your
individual elements actually
have some meaning,
is where I think maybe using
L1 might make a little bit
more sense.
But in general, again,
this is a hyperparameter
and it really depends on
your problem and your data
so the best answer is
just to try them both
and see what works better.
Even this idea of trying
out different values
of hyperparameters and
seeing what works best,
there are many different choices here.
What exactly does it mean
to try hyperparameters
and see what works best?
Well, the first idea you might think of
is simply choosing the
hyperparameters that give you
the best accuracy or best performance
on your training data.
This is actually a really terrible idea.
You should never do this.
In the concrete case
of the nearest neighbor
classifier, for example,
if we set K=1, we will always
classify the training data
perfectly.
So if we use this strategy
we'll always pick K=1,
but, as we saw from the examples earlier,
in practice it seems that
setting K equals to larger values
might cause us to misclassify
some of the training data,
but, in fact, lead to better performance
on points that were not
in the training data.
And ultimately in machine learning
we don't care about
fitting the training data,
we really care about how our classifier,
or how our method,
will perform on unseen
data after training.
So, this is a terrible
idea, don't do this.
So, another idea that you might think of,
is maybe we'll take our full dataset
and we'll split it into some training data
and some test data.
And now I'll try training
my algorithm with different
choices of hyperparameters
on the training data
and then I'll go and apply
that trained classifier
on the test data and now I will pick
the set of hyperparameters
that cause me to perform best
on the test data.
This seems like maybe a
more reasonable strategy,
but, in fact, this is also a terrible idea
and you should never do this.
Because, again, the point
of machine learning systems
is that we want to know how
our algorithm will perform.
So, the point of the test set
is to give us some estimate
of how our method will do
on unseen data that's
coming out from the wild.
And if we use this strategy
of training many different
algorithms with different hyperparameters,
and then, selecting the
one which does the best
on the test data,
then, it's possible, that
we may have just picked
the right set of hyperparameters
that caused our algorithm
to work quite well
on this testing set,
but now our performance on this test set
will no longer be representative
of our performance of new, unseen data.
So, again, you should not
do this, this is a bad idea,
you'll get in trouble if you do this.
What is much more common, is
to actually split your data
into three different sets.
You'll partition most of
your data into a training set
and then you'll create a validation set
and a test set.
And now what we typically do
is go and train our algorithm
with many different
choices of hyperparameters
on the training set,
evaluate on the validation set,
and now pick the set of hyperparameters
which performs best on the validation set.
And now, after you've
done all your development,
you've done all your debugging,
after you've dome everything,
then you'd take that best
performing classifier
on the validation set
and run it once on the test set.
And now that's the number
that goes into your paper,
that's the number that
goes into your report,
that's the number that
actually is telling you how
your algorithm is doing on unseen data.
And this is actually
really, really important
that you keep a very
strict separation between
the validation data and the test data.
So, for example, when we're
working on research papers,
we typically only touch the test set
at the very last minute.
So, when I'm writing papers,
I tend to only touch the
test set for my problem
in maybe the week before
the deadline or so
to really insure that we're not
being dishonest here and
we're not reporting a number
which is unfair.
So, this is actually super important
and you want to make sure
to keep your test data
quite under control.
So another strategy for
setting hyperparameters
is called cross validation.
And this is used a
little bit more commonly
for small data sets, not used
so much in deep learning.
So here the idea is we're
going to take our test data,
or we're going to take our dataset,
as usual, hold out some test
set to use at the very end,
and now, for the rest of the data,
rather than splitting it
into a single training
and validation partition,
instead, we can split our training data
into many different folds.
And now, in this way, we've
cycled through choosing which
fold is going to be the validation set.
So now, in this example,
we're using five fold cross validation,
so you would train your
algorithm with one set of
hyperparameters on the first four folds,
evaluate the performance on fold four,
and now go and retrain
your algorithm on folds
one, two, three, and five,
evaluate on fold four,
and cycle through all the different folds.
And, when you do it this way,
you get much higher confidence about
which hyperparameters are going to perform
more robustly.
So this is kind of the
gold standard to use,
but, in practice in deep learning
when we're training large models
and training is very
computationally expensive,
these doesn't get used
too much in practice.
Question?
[student asking a question]
Yeah, so the question is,
a little bit more concretely,
what's the difference
between the training and the
validation set?
So, if you think about the
k-nearest neighbor classifier
then the training set is this
set of images with labels
where we memorize the labels.
And now, to classify an image,
we're going to take the image
and compare it to each element
in the training data,
and then transfer the label
from the nearest training point.
So now our algorithm
will memorize everything
in the training set,
and now we'll take each
element of the validation set
and compare it to each
element in the training data
and then use this to
determine what is the accuracy
of our classifier when it's
applied on the validation set.
So this is the distinction
between training
and validation.
Where your algorithm is
able to see the labels
of the training set,
but for the validation set,
your algorithm doesn't have
direct access to the labels.
We only use the labels
of the validation set
to check how well our algorithm is doing.
A question?
[student asking a question]
The question is, whether the test set,
is it possible that the
test set might not be
representative of data
out there in the wild?
This definitely can be
a problem in practice,
the underlying statistical
assumption here is that
your data are all independently
and identically distributed,
so that all of your data points should be
drawn from the same underlying
probability distribution.
Of course, in practice, this
might not always be the case,
and you definitely can run into cases
where the test set might
not be super representative
of what you see in the wild.
So this is kind of a problem
that dataset creators and
dataset curators need to think about.
But when I'm creating
datasets, for example,
one thing I do,
is I'll go and collect a whole
bunch of data all at once,
using the exact same methodology
for collecting the data,
and then afterwards you go
and partition it randomly
between train and test.
One thing that can screw you up here is
maybe if you're collecting data over time
and you make the earlier
data, that you collect first,
be the training data,
and the later data that you
collect be the test data,
then you actually might
run into this shift
that could cause problems.
But as long as this partition is random
among your entire set of data points,
then that's how we try
to alleviate this problem
in practice.
So then, once you've gone through this
cross validation procedure,
then you end up with graphs
that look something like this.
So here, on the X axis, we
are showing the value of K
for a k-nearest neighbor
classifier on some problem,
and now on the Y axis, we are
showing what is the accuracy
of our classifier on some dataset
for different values of K.
And you can see that, in this case,
we've done five fold cross
validation over the data,
so, for each value of K we
have five different examples
of how well this algorithm is doing.
And, actually, going back
to the question about
having some test sets
that are better or worse
for your algorithm,
using K fold cross validation
is maybe one way to help
quantify that a little bit.
And, in that, we can see the
variance of how this algorithm
performs on different
of the validation folds.
And that gives you some sense of,
not just what is the best,
but, also, what is the
distribution of that performance.
So, whenever you're training
machine learning models
you end up making plots like this,
where they show you what is your accuracy,
or your performance as a
function of your hyperparameters,
and then you want to
go and pick the model,
or the set of hyperparameters,
at the end of the day,
that performs the best
on the validation set.
So, here we see that maybe
about K=7 probably works
about best for this problem.
So, k-nearest neighbor
classifiers on images
are actually almost
never used in practice.
Because, with all of these
problems that we've talked about.
So, one problem is that
it's very slow at test time,
which is the reverse of what we want,
which we talked about earlier.
Another problem is that
these things like Euclidean
distance, or L1 distance,
are really not a very good way
to measure distances between images.
These, sort of, vectorial
distance functions
do not correspond very well
to perceptual similarity
between images.
How you perceive
differences between images.
So, in this example, we've constructed,
there's this image on the left of a girl,
and then three different
distorted images on the right
where we've blocked out her mouth,
we've actually shifted
down by a couple pixels,
or tinted the entire image blue.
And, actually, if you compute
the Euclidean distance
between the original and the boxed,
the original and the shuffled,
and original in the tinted,
they all have the same L2 distance.
Which is, maybe, not so good
because it sort of
gives you the sense that
the L2 distance is really
not doing a very good job
at capturing these perceptional
distances between images.
Another, sort of, problem
with the k-nearest neighbor
classifier has to do with
something we call the curse
of dimensionality.
So, if you recall back this
viewpoint we had of the
k-nearest neighbor classifier,
it's sort of dropping paint
around each of the training
data points and using that to
sort of partition the space.
So that means that if we
expect the k-nearest neighbor
classifier to work well,
we kind of need our training
examples to cover the space
quite densely.
Otherwise our nearest neighbors
could actually be quite far
away and might not actually
be very similar to our testing
points.
And the problem is,
that actually densely covering the space,
means that we need a number
of training examples,
which is exponential in the
dimension of the problem.
So this is very bad, exponential
growth is always bad,
basically, you're never
going to get enough images
to densely cover this space of pixels
in this high dimensional space.
So that's maybe another
thing to keep in mind
when you're thinking about
using k-nearest neighbor.
So, kind of the summary
is that we're using
k-nearest neighbor to introduce this idea
of image classification.
We have a training set
of images and labels
and then we use that
to predict these labels on the test set.
Question?
[student asking a question]
Oh, sorry, the question is,
what was going on with this picture?
What are the green and the blue dots?
So here, we have some training samples
which are represented by points,
and the color of the dot
maybe represents the category
of the point, of this training sample.
So, if we're in one dimension,
then you maybe only need
four training samples
to densely cover the space,
but if we move to two dimensions,
then, we now need, four times
four is 16 training examples
to densely cover this space.
And if we move to three, four,
five, many more dimensions,
the number of training
examples that we need
to densely cover the space,
grows exponentially with the dimension.
So, this is kind of giving you the sense,
that maybe in two dimensions
we might have this kind
of funny curved shape,
or you might have sort of
arbitrary manifolds of labels
in different dimensional spaces.
Because the k-nearest neighbor algorithm
doesn't really make any
assumptions about these
underlying manifolds,
the only way it can perform properly
is if it has quite a dense
sample of training points
to work with.
So, this is kind of the
overview of k-nearest neighbors
and you'll get a chance
to actually implement this
and try it out on images
in the first assignment.
So, if there's any last minute
questions about K and N,
I'm going to move on to the next topic.
Question?
[student is asking a question]
Sorry, say that again.
[student is asking a question]
Yeah, so the question is,
why do these images have
the same L2 distance?
And the answer is that, I
carefully constructed them
to have the same L2 distance.
[laughing]
But it's just giving you the
sense that the L2 distance
is not a very good measure
of similarity between images.
And these images are
actually all different from
each other in quite disparate ways.
If you're using K and N,
then the only thing you
have to measure distance
between images,
is this single distance metric.
And this kind of gives
you an example where
that distance metric is
actually not capturing
the full description of
distance or difference
between images.
So, if this case, I just sort
of carefully constructed these
translations and these
offsets to match exactly.
Question?
[student asking a question]
So, the question is,
maybe this is actually good,
because all of these things
are actually having the
same distance to the image.
That's maybe true for this example,
but I think you could also
construct examples where
maybe we have two original images
and then by putting the
boxes in the right places
or tinting them,
we could cause it to be
nearer to pretty much
anything that you want, right?
Because in this example, we
can kind of like do arbitrary
shifting and tinting
to kind of change these
distances nearly arbitrarily
without changing the perceptional
nature of these images.
So, I think that this
can actually screw you up
if you have many
different original images.
Question?
[student is asking a question]
The question is,
whether or not it's
common in real-world cases
to go back and retrain the entire dataset
once you've found those
best hyperparameters?
So, people do sometimes
do this in practice,
but it's somewhat a matter of taste.
If you're really rushing for that deadline
and you've really got to
get this model out the door
then, if it takes a long
time to retrain the model
on the whole dataset,
then maybe you won't do it.
But if you have a little
bit more time to spare
and a little bit more compute to spare,
and you want to squeeze out
that maybe that extra 1%
of performance, then that
is a trick you can use.
So we kind of saw that
the k-nearest neighbor
has a lot of the nice properties
of machine learning algorithms,
but in practice it's not so great,
and really not used very much in images.
So the next thing I'd
like to talk about is
linear classification.
And linear classification is,
again, quite a simple learning
algorithm, but this will
become super important
and help us build up to
whole neural networks
and whole convolutional networks.
So, one analogy people often talk about
when working with neural networks
is we think of them as being
kind of like Lego blocks.
That you can have different
kinds of components
of neural networks and you
can stick these components
together to build these
large different towers of
convolutional networks.
One of the most basic
building blocks that we'll see
in different types of
deep learning applications
is this linear classifier.
So, I think it's actually
really important to
have a good understanding
of what's happening
with linear classification.
Because these will end up
generalizing quite nicely
to whole neural networks.
So another example of kind
of this modular nature
of neural networks
comes from some research in our
own lab on image captioning,
just as a little bit of a preview.
So here the setup is that
we want to input an image
and then output a descriptive sentence
describing the image.
And the way this kind of works is that
we have one convolutional
neural network that's looking
at the image,
and a recurrent neural network that knows
about language.
And we can kind of just stick
these two pieces together
like Lego blocks and train
the whole thing together
and end up with a pretty cool system
that can do some non-trivial things.
And we'll work through the
details of this model as we go
forward in the class,
but this just gives you the sense that,
these deep neural networks
are kind of like Legos
and this linear classifier
is kind of like the most
basic building blocks
of these giant networks.
But that's a little bit too
exciting for lecture two,
so we have to go back to
CIFAR-10 for the moment.
[laughing]
So, recall that CIFAR-10 has
these 50,000 training examples,
each image is 32 by 32 pixels
and three color channels.
In linear classification,
we're going to take a bit
of a different approach
from k-nearest neighbor.
So, the linear classifier is
one of the simplest examples
of what we call a parametric model.
So now, our parametric model
actually has two different
components.
It's going to take in this image,
maybe, of a cat on the left,
and this,
that we usually write
as X for our input data,
and also a set of parameters, or weights,
which is usually called
W, also sometimes theta,
depending on the literature.
And now we're going to
write down some function
which takes in both the data,
X, and the parameters, W,
and this'll spit out now
10 numbers describing
what are the scores
corresponding to each of those 10
categories in CIFAR-10.
With the interpretation that,
like the larger score for cat,
indicates a larger probability
of that input X being cat.
And now, a question?
[student asking a question]
Sorry, can you repeat that?
[student asking a question]
Oh, so the question is what is the three?
The three, in this example,
corresponds to the three color
channels, red, green, and blue.
Because we typically work on color images,
that's nice information that
you don't want to throw away.
So, in the k-nearest neighbor setup
there was no parameters, instead,
we just kind of keep around
the whole training data,
the whole training set,
and use that at test time.
But now, in a parametric approach,
we're going to summarize our
knowledge of the training data
and stick all that knowledge
into these parameters, W.
And now, at test time, we
no longer need the actual
training data, we can throw it away.
We only need these
parameters, W, at test time.
So this allows our models
to now be more efficient
and actually run on maybe
small devices like phones.
So, kind of, the whole
story in deep learning
is coming up with the
right structure for this
function, F.
You can imagine writing down
different functional forms
for how to combine weights
and data in different
complex ways, and these
could correspond to different
network architectures.
But the simplest possible example
of combining these two things
is just, maybe, to multiply them.
And this is a linear classifier.
So here our F of X, W is
just equal to the W times X.
Probably the simplest
equation you can imagine.
So here,
if you kind of unpack the
dimensions of these things,
we recall that our image was
maybe 32 by 32 by 3 values.
So then, we're going to take
those values and then stretch
them out into a long column vector
that has 3,072 by one entries.
And now we want to end
up with 10 class scores.
We want to end up with
10 numbers for this image
giving us the scores for
each of the 10 categories.
Which means that now our matrix, W,
needs to be ten by 3072.
So that once we multiply
these two things out
then we'll end up with
a single column vector
10 by one, giving us our 10 class scores.
Also sometimes, you'll typically see this,
we'll often add a bias term
which will be a constant
vector of 10 elements
that does not interact
with the training data,
and instead just gives us
some sort of data independent
preferences for some classes over another.
So you might imagine that
if you're dataset was
unbalanced and had many
more cats than dogs,
for example, then the bias
elements corresponding
to cat would be higher
than the other ones.
So if you kind of think about pictorially
what this function is doing,
in this figure we have
an example on the left
of a simple image with
just a two by two image,
so it has four pixels total.
So the way that the
linear classifier works
is that we take this two by two image,
we stretch it out into a column vector
with four elements,
and now, in this example,
we are just restricting to
three classes, cat, dog, and ship,
because you can't fit 10 on a slide,
and now our weight matrix is
going to be four by three,
so we have four pixels and three classes.
And now, again, we have a
three element bias vector
that gives us data independent bias terms
for each category.
Now we see that the cat score
is going to be the enter
product between the pixels of our image
and this row in the weight matrix
added together with this bias term.
So, when you look at it this way
you can kind of understand
linear classification
as almost a template matching approach.
Where each of the rows in this matrix
correspond to some template of the image.
And now the enter product or dot product
between the row of the
matrix and the column
giving the pixels of the image,
computing this dot
product kind of gives us
a similarity between this
template for the class
and the pixels of our image.
And then bias just,
again, gives you this data
independence scaling offset
to each of the classes.
If we think about linear classification
from this viewpoint of template matching
we can actually take the
rows of that weight matrix
and unravel them back into images
and actually visualize
those templates as images.
And this gives us some
sense of what a linear
classifier might actually be doing
to try to understand our data.
So, in this example, we've
gone ahead and trained
a linear classifier on our images.
And now on the bottom we're visualizing
what are those rows in
that learned weight matrix
corresponding to each of the 10 categories
in CIFAR-10.
And in this way we kind
of get a sense for what's
going on in these images.
So, for example, in the
left, on the bottom left,
we see the template for the plane class,
kind of consists of this like blue blob,
this kind of blobby thing in the middle
and maybe blue in the background,
which gives you the sense
that this linear classifier
for plane is maybe looking for blue stuff
and blobby stuff, and those
features are going to cause
the classifier to like planes more.
Or if we look at this car example,
we kind of see that
there's a red blobby thing
through the middle and a
blue blobby thing at the top
that maybe is kind of a blurry windshield.
But this is a little bit weird,
this doesn't really look like a car.
No individual car
actually looks like this.
So the problem is that
the linear classifier
is only learning one
template for each class.
So if there's sort of
variations in how that class
might appear,
it's trying to average out all
those different variations,
all those different appearances,
and use just one single template
to recognize each of those categories.
We can also see this pretty
explicitly in the horse
classifier.
So in the horse classifier we
see green stuff on the bottom
because horses are usually on grass.
And then, if you look
carefully, the horse actually
seems to have maybe two
heads, one head on each side.
And I've never seen a
horse with two heads.
But the linear classifier
is just doing the best
that it can, because it's
only allowed to learn
one template per category.
And as we move forward
into neural networks
and more complex models,
we'll be able to achieve
much better accuracy
because they no longer
have this restriction
of just learning a single
template per category.
Another viewpoint of the linear classifier
is to go back to this idea of images
as points and high dimensional space.
And you can imagine
that each of our images
is something like a point in
this high dimensional space.
And now the linear classifier
is putting in these
linear decision boundaries
to try to draw linear
separation between one category
and the rest of the categories.
So maybe up on the upper-left hand side
we see these training
examples of airplanes
and throughout the process of training
the linear classier will
go and try to draw this
blue line to separate
out with a single line
the airplane class from all
the rest of the classes.
And it's actually kind of
fun if you watch during
the training process these
lines will start out randomly
and then go and snap into
place to try to separate
the data properly.
But when you think about
linear classification
in this way, from this high
dimensional point of view,
you can start to see again
what are some of the problems
that might come up with
linear classification.
And it's not too hard
to construct examples
of datasets where a linear
classifier will totally fail.
So, one example, on the left here,
is that, suppose we have a
dataset of two categories,
and these are all maybe
somewhat artificial,
but maybe our dataset has two categories,
blue and red.
And the blue categories
are the number of pixels
in the image, which are
greater than zero, is odd.
And anything where the
number of pixels greater
than zero is even, we want to
classify as the red category.
So if you actually go and
draw what these different
decisions regions look like in the plane,
you can see that our blue class
with an odd number of pixels
is going to be these two
quadrants in the plane,
and even will be the
opposite two quadrants.
So now, there's no way that we
can draw a single linear line
to separate the blue from the red.
So this would be an example
where a linear classifier
would really struggle.
And this is maybe not such an
artificial thing after all.
Instead of counting pixels,
maybe we're actually trying
to count whether the number
of animals or people in
an image is odd or even.
So this kind of a parity problem
of separating odds from evens
is something that linear classification
really struggles with traditionally.
Other situations where a linear
classifier really struggles
are multimodal situations.
So here on the right,
maybe our blue category has
these three different islands
of where the blue category lives,
and then everything else
is some other category.
So, for something like horses,
we saw on the previous example,
is something where this
actually might be happening
in practice.
Where there's maybe one
island in the pixel space of
horses looking to the left,
and another island of
horses looking to the right.
And now there's no good
way to draw a single linear
boundary between these two
isolated islands of data.
So anytime where you have multimodal data,
like one class
that can appear in
different regions of space,
is another place where linear
classifiers might struggle.
So there's kind of a lot of problems with
linear classifiers, but it
is a super simple algorithm,
super nice and easy to interpret
and easy to understand.
So you'll actually be
implementing these things
on your first homework assignment.
At this point,
we kind of talked about
what is the functional
form corresponding to a
linear classifier.
And we've seen that this functional form
of matrix vector multiply
corresponds this idea of template matching
and learning a single
template for each category
in your data.
And then once we have this trained matrix
you can use it to actually
go and get your scores
for any new training example.
But what we have not told you is
how do you actually go
about choosing the right W
for your dataset.
We've just talked about
what is the functional form
and what is going on with this thing.
So that's something we'll
really focus on next time.
And next lecture we'll talk about
what are the strategies and algorithms
for choosing the right W.
And this will lead us to questions
of loss functions and optimization
and eventually ConvNets.
So, that's a bit of the
preview for next week.
And that's all we have for today.
