- Hello, hi.
So I want to get started.
Welcome to CS 231N Lecture 11.
We're going to talk about
today detection segmentation
and a whole bunch of other
really exciting topics
around core computer vision tasks.
But as usual, a couple
administrative notes.
So last time you obviously
took the midterm,
we didn't have lecture,
hopefully that went okay
for all of you but so we're
going to work on grading
the midterm this week, but as a reminder
please don't make any public discussions
about the midterm questions
or answers or whatever
until at least tomorrow
because there are still
some people taking makeup midterms today
and throughout the rest of the week
so we just ask you that
you refrain from talking
publicly about midterm questions.
Why don't you wait until Monday?
[laughing]
Okay, great.
So we're also starting to
work on midterm grading.
We'll get those back to
you as soon as you can,
as soon as we can.
We're also starting to work
on grading assignment two
so there's a lot of grading
being done this week.
The TA's are pretty busy.
Also a reminder for you guys,
hopefully you've been working
hard on your projects now that most of you
are done with the midterm
so your project milestones
will be due on Tuesday so
any sort of last minute
changes that you had in your projects,
I know some people
decided to switch projects
after the proposal, some
teams reshuffled a little bit,
that's fine but your
milestone should reflect
the project that you're actually doing
for the rest of the quarter.
So hopefully that's going out well.
I know there's been a
lot of worry and stress
on Piazza, wondering
about assignment three.
So we're working on that as hard as we can
but that's actually a
bit of a new assignment,
it's changing a bit from last year
so it will be out as soon as possible,
hopefully today or tomorrow.
Although we promise that
whenever it comes out
you'll have two weeks to finish it
so try not to stress
out about that too much.
But I'm pretty excited,
I think assignment three
will be really cool, has a lot of cool,
it'll cover a lot of really cool material.
So another thing, last time in lecture
we mentioned this thing
called the Train Game
which is this really cool
thing we've been working on
sort of as a side project a little bit.
So this is an interactive
tool that you guys can go on
and use to explore a
little bit the process
of tuning hyperparameters
in practice so we hope that,
so this is again totally
not required for the course.
Totally optional, but
if you do we will offer
a small amount of extra
credit for those of you
who want to do well and
participate on this.
And we'll send out
exactly some more details
later this afternoon on Piazza.
But just a bit of a demo for
what exactly is this thing.
So you'll get to go in
and we've changed the name
from Train Game to HyperQuest
because you're questing
to solve, to find the best
hyperparameters for your model
so this is really cool,
it'll be an interactive tool
that you can use to explore
the training of hyperparameters
interactively in your browser.
So you'll login with
your student ID and name.
You'll fill out a little survey with some
of your experience on deep learning
then you'll read some instructions.
So in this game you'll be
shown some random data set
on every trial.
This data set might be
images or it might be vectors
and your goal is to
train a model by picking
the right hyperparameters
interactively to perform
as well as you can on the validation set
of this random data set.
And it'll sort of keep
track of your performance
over time and there'll be a leaderboard,
it'll be really cool.
So every time you play the game,
you'll get some statistics
about your data set.
In this case we're doing
a classification problem
with 10 classes.
You can see down at the bottom
you have these statistics
about random data set, we have 10 classes.
The input data size is three by 32 by 32
so this is some image
data set and we can see
that in this case we have 8500 examples
in the training set and 1500
examples in the validation set.
These are all random, they'll change
a little bit every time.
Based on these data set statistics
you'll make some choices
on your initial learning rate,
your initial network size,
and your initial dropout rate.
Then you'll see a screen
like this where it'll run
one epoch with those
chosen hyperparameters,
show you on the right
here you'll see two plots.
One is your training and validation loss
for that first epoch.
Then you'll see your training
and validation accuracy
for that first epoch and
based on the gaps that you see
in these two graphs you can
make choices interactively
to change the learning
rates and hyperparameters
for the next epoch.
So then you can either
choose to continue training
with the current or
changed hyperparameters,
you can also stop training,
or you can revert to
go back to the previous checkpoint
in case things got really messed up.
So then you'll get to make some choice,
so here we'll decide to continue training
and in this case you could
go and set new learning rates
and new hyperparameters for
the next epoch of training.
You can also, kind of interesting here,
you can actually grow
the network interactively
during training in this demo.
There's this cool trick
from a couple recent papers
where you can either take existing layers
and make them wider or add
new layers to the network
in the middle of training
while still maintaining
the same function in the
network so you can do that
to increase the size of
your network in the middle
of training here which is kind of cool.
So then you'll make
choices over several epochs
and eventually your
final validation accuracy
will be recorded and we'll
have some leaderboard
that compares your score on that data set
to some simple baseline models.
And depending on how well
you do on this leaderboard
we'll again offer some small
amounts of extra credit
for those of you who
choose to participate.
So this is again, totally
optional, but I think
it can be a really cool
learning experience for you guys
to play around with and
explore how hyperparameters
affect the learning process.
Also, it's really useful for us.
You'll help science out by
participating in this experiment.
We're pretty interested in
seeing how people behave
when they train neural networks
so you'll be helping us out
as well if you decide to play this.
But again, totally optional, up to you.
Any questions on that?
Hopefully at some point but it's.
So the question was will this be a paper
or whatever eventually?
Hopefully but it's really
early stages of this project
so I can't make any
promises but I hope so.
But I think it'll be really cool.
[laughing]
Yeah, so the question is
how can you add layers
during training?
I don't really want to
get into that right now
but the paper to read is
Net2Net by Ian Goodfellow's
one of the authors and
there's another paper
from Microsoft called Network Morphism.
So if you read those two papers
you can see how this works.
Okay, so last time, a bit of a reminder
before we had the midterm
last time we talked
about recurrent neural networks.
We saw that recurrent
neural networks can be used
for different types of problems.
In addition to one to one
we can do one to many,
many to one, many to many.
We saw how this can apply
to language modeling
and we saw some cool examples
of applying neural networks
to model different sorts of
languages at the character level
and we sampled these
artificial math and Shakespeare
and C source code.
We also saw how similar
things could be applied
to image captioning by connecting
a CNN feature extractor
together with an RNN language model.
And we saw some really
cool examples of that.
We also talked about the
different types of RNN's.
We talked about this Vanilla RNN.
I also want to mention that
this is sometimes called
a Simple RNN or an Elman RNN so you'll see
all of these different
terms in literature.
We also talked about the Long
Short Term Memory or LSTM.
And we talked about how the gradient,
the LSTM has this crazy set of equations
but it makes sense because it
helps improve gradient flow
during back propagation
and helps this thing model
more longer term dependencies
in our sequences.
So today we're going to
switch gears and talk about
a whole bunch of different exciting tasks.
We're going to talk about, so
so far we've been talking about
mostly the image classification problem.
Today we're going to talk
about various types of other
computer vision tasks where
you actually want to go in
and say things about the spatial
pixels inside your images
so we'll see segmentation,
localization, detection,
a couple other different
computer vision tasks
and how you can approach these
with convolutional neural networks.
So as a bit of refresher,
so far the main thing
we've been talking about in this class
is image classification so
here we're going to have
some input image come in.
That input image will go through
some deep convolutional network,
that network will give
us some feature vector
of maybe 4096 dimensions
in the case of AlexNet RGB
and then from that final feature vector
we'll have some fully-connected,
some final fully-connected layer
that gives us 1000 numbers
for the different class scores
that we care about where
1000 is maybe the number
of classes in ImageNet in this example.
And then at the end of the day
what the network does is we input an image
and then we output a single category label
saying what is the content of
this entire image as a whole.
But this is maybe the
most basic possible task
in computer vision and
there's a whole bunch
of other interesting types of tasks
that we might want to
solve using deep learning.
So today we're going to talk about several
of these different tasks and
step through each of these
and see how they all
work with deep learning.
So we'll talk about these more in detail
about what each problem is as we get to it
but this is kind of a summary slide
that we'll talk first about
semantic segmentation.
We'll talk about classification
and localization,
then we'll talk about object detection,
and finally a couple brief words
about instance segmentation.
So first is the problem
of semantic segmentation.
In the problem of semantic segmentation,
we want to input an image
and then output a decision
of a category for every
pixel in that image
so for every pixel in this, so
this input image for example
is this cat walking through
the field, he's very cute.
And in the output we want to say
for every pixel is that pixel
a cat or grass or sky or trees
or background or some
other set of categories.
So we're going to have
some set of categories
just like we did in the
image classification case
but now rather than
assigning a single category
labeled to the entire
image, we want to produce
a category label for each
pixel of the input image.
And this is called semantic segmentation.
So one interesting thing
about semantic segmentation
is that it does not
differentiate instances
so in this example on the
right we have this image
with two cows where
they're standing right next
to each other and when
we're talking about semantic
segmentation we're just
labeling all the pixels
independently for what is
the category of that pixel.
So in the case like this
where we have two cows
right next to each other
the output does not make
any distinguishing, does not distinguish
between these two cows.
Instead we just get a whole mass of pixels
that are all labeled as cow.
So this is a bit of a shortcoming
of semantic segmentation
and we'll see how we can fix this later
when we move to instance segmentation.
But at least for now we'll just talk about
semantic segmentation first.
So you can imagine maybe using a class,
so one potential approach for attacking
semantic segmentation might
be through classification.
So there's this, you could use this idea
of a sliding window approach
to semantic segmentation.
So you might imagine that
we take our input image
and we break it up into many
many small, tiny local crops
of the image so in this
example we've taken
maybe three crops from
around the head of this cow
and then you could imagine
taking each of those crops
and now treating this as
a classification problem.
Saying for this crop, what is the category
of the central pixel of the crop?
And then we could use
all the same machinery
that we've developed for
classifying entire images
but now just apply it on crops rather than
on the entire image.
And this would probably
work to some extent
but it's probably not a very good idea.
So this would end up being super super
computationally expensive
because we want to label
every pixel in the image,
we would need a separate
crop for every pixel in
that image and this would be
super super expensive to
run forward and backward
passes through.
And moreover, we're actually,
if you think about this
we can actually share
computation between different
patches so if you're trying
to classify two patches
that are right next to each
other and actually overlap
then the convolutional
features of those patches
will end up going through
the same convolutional layers
and we can actually share
a lot of the computation
when applying this to separate passes
or when applying this type of approach
to separate patches in the image.
So this is actually a terrible
idea and nobody does this
and you should probably not do this
but it's at least the first
thing you might think of
if you were trying to think
about semantic segmentation.
Then the next idea that works a bit better
is this idea of a fully
convolutional network right.
So rather than extracting
individual patches from the image
and classifying these
patches independently,
we can imagine just having
our network be a whole giant
stack of convolutional layers
with no fully connected
layers or anything so in this
case we just have a bunch
of convolutional layers that
are all maybe three by three
with zero padding or something like that
so that each convolutional
layer preserves the spatial size
of the input and now if we pass our image
through a whole stack of
these convolutional layers,
then the final convolutional
layer could just output
a tensor of something by C by H by W
where C is the number of
categories that we care about
and you could see this
tensor as just giving
our classification scores for every pixel
in the input image at every
location in the input image.
And we could compute this all at once
with just some giant stack
of convolutional layers.
And then you could imagine
training this thing
by putting a classification
loss at every pixel
of this output, taking an
average over those pixels
in space, and just training
this kind of network
through normal, regular back propagation.
Question?
Oh, the question is how do you develop
training data for this?
It's very expensive right.
So the training data for this would be
we need to label every
pixel in those input images
so there's tools that
people sometimes have online
where you can go in and
sort of draw contours
around the objects and
then fill in regions
but in general getting
this kind of training data
is very expensive.
Yeah, the question is
what is the loss function?
So here since we're making
a classification decision
per pixel then we put a cross entropy loss
on every pixel of the output.
So we have the ground truth category label
for every pixel in the output,
then we compute across entropy loss
between every pixel in the output
and the ground truth pixels and then
take either a sum or an average over space
and then sum or average
over the mini-batch.
Question?
Yeah, yeah.
Yeah, the question is do we assume
that we know the categories?
So yes, we do assume that we
know the categories up front
so this is just like the
image classification case.
So an image classification we
know at the start of training
based on our data set that
maybe there's 10 or 20
or 100 or 1000 classes that we care about
for this data set and
then here we are fixed
to that set of classes that
are fixed for the data set.
So this model is relatively simple
and you can imagine this
working reasonably well
assuming that you tuned all
the hyperparameters right
but it's kind of a problem right.
So in this setup, since
we're applying a bunch
of convolutions that
are all keeping the same
spatial size of the input image,
this would be super super expensive right.
If you wanted to do
convolutions that maybe have
64 or 128 or 256 channels for
those convolutional filters
which is pretty common in
a lot of these networks,
then running those convolutions
on this high resolution
input image over a
sequence of layers would be
extremely computationally expensive
and would take a ton of memory.
So in practice, you don't
usually see networks
with this architecture.
Instead you tend to see
networks that look something
like this where we have some downsampling
and then some upsampling
of the feature map
inside the image.
So rather than doing all the convolutions
of the full spatial
resolution of the image,
we'll maybe go through a small number
of convolutional layers
at the original resolution
then downsample that
feature map using something
like max pooling or strided convolutions
and sort of downsample, downsample,
so we have convolutions in downsampling
and convolutions in downsampling
that look much like a lot of
the classification networks
that you see but now
the difference is that
rather than transitioning
to a fully connected layer
like you might do in an
image classification setup,
instead we want to increase
the spatial resolution
of our predictions in the
second half of the network
so that our output image
can now be the same size
as our input image and this ends up being
much more computationally efficient
because you can make the network very deep
and work at a lower spatial resolution
for many of the layers at
the inside of the network.
So we've already seen
examples of downsampling
when it comes to convolutional networks.
We've seen that you can
do strided convolutions
or various types of pooling
to reduce the spatial size
of the image inside a
network but we haven't
really talked about
upsampling and the question
you might be wondering is
what are these upsampling
layers actually look
like inside the network?
And what are our strategies
for increasing the size
of a feature map inside the network?
Sorry, was there a question in the back?
Yeah, so the question
is how do we upsample?
And the answer is that's the topic
of the next couple slides.
[laughing]
So one strategy for
upsampling is something like
unpooling so we have
this notion of pooling
to downsample so we talked
about average pooling
or max pooling so when we
talked about average pooling
we're kind of taking a spatial average
within a receptive field
of each pooling region.
One kind of analog for
upsampling is this idea
of nearest neighbor unpooling.
So here on the left we see this example
of nearest neighbor
unpooling where our input
is maybe some two by
two grid and our output
is a four by four grid
and now in our output
we've done a two by two
stride two nearest neighbor
unpooling or upsampling
where we've just duplicated
that element for every
point in our two by two
receptive field of the unpooling region.
Another thing you might see
is this bed of nails unpooling
or bed of nails upsampling
where you'll just take,
again we have a two by two receptive field
for our unpooling regions
and then you'll take the,
in this case you make it all
zeros except for one element
of the unpooling region so
in this case we've taken
all of our inputs and
always put them in the upper
left hand corner of this unpooling region
and everything else is zeros.
And this is kind of like a bed of nails
because the zeros are very flat,
then you've got these things poking up
for the values at these
various non-zero regions.
Another thing that you see
sometimes which was alluded to
by the question a minute ago
is this idea of max unpooling
so in a lot of these networks
they tend to be symmetrical
where we have a downsampling
portion of the network
and then an upsampling
portion of the network
with a symmetry between those
two portions of the network.
So sometimes what you'll see
is this idea of max unpooling
where for each unpooling,
for each upsampling layer,
it is associated with
one of the pooling layers
in the first half of the network
and now in the first half,
in the downsampling when we do max pooling
we'll actually remember which
element of the receptive field
during max pooling was
used to do the max pooling
and now when we go through
the rest of the network
then we'll do something that
looks like this bed of nails
upsampling except rather than
always putting the elements
in the same position,
instead we'll stick it
into the position that was
used in the corresponding
max pooling step earlier in the network.
I'm not sure if that explanation was clear
but hopefully the picture makes sense.
Yeah, so then you just end up
filling the rest with zeros.
So then you fill the rest with zeros
and then you stick the elements
from the low resolution
patch up into the high resolution patch
at the points where the
max pooling took place
at the corresponding max pooling there.
Okay, so that's kind
of an interesting idea.
Sorry, question?
Oh yeah, so the question
is why is this a good idea?
Why might this matter?
So the idea is that when we're
doing semantic segmentation
we want our predictions
to be pixel perfect right.
We kind of want to get
those sharp boundaries
and those tiny details in
our predictive segmentation
so now if you're doing this max pooling,
there's this sort of
heterogeneity that's happening
inside the feature map
due to the max pooling
where from the low resolution
image you don't know,
you're sort of losing spatial
information in some sense
by you don't know where that
feature vector came from
in the local receptive
field after max pooling.
So if you actually unpool
by putting the vector
in the same slot you might
think that that might help us
handle these fine details
a little bit better
and help us preserve some
of that spatial information
that was lost during max pooling.
Question?
The question is does this make
things easier for back prop?
Yeah, I guess, I don't think
it changes the back prop
dynamics too much because
storing these indices
is not a huge computational overhead.
They're pretty small in
comparison to everything else.
So another thing that you'll see sometimes
is this idea of transpose convolution.
So transpose convolution,
so for these various types
of unpooling that we just talked about,
these bed of nails, this nearest neighbor,
this max unpooling, all
of these are kind of
a fixed function, they're
not really learning exactly
how to do the upsampling so
if you think about something
like strided convolution,
strided convolution
is kind of like a learnable
layer that learns the way
that the network wants
to perform downsampling
at that layer.
And by analogy with that
there's this type of layer
called a transpose
convolution that lets us do
kind of learnable upsampling.
So it will both upsample the feature map
and learn some weights about how it wants
to do that upsampling.
And this is really just
another type of convolution
so to see how this works
remember how a normal
three by three stride one pad
one convolution would work.
That for this kind of normal convolution
that we've seen many
times now in this class,
our input might by four by four,
our output might be four by four,
and now we'll have this
three by three kernel
and we'll take an inner product between,
we'll plop down that kernel
at the corner of the image,
take an inner product,
and that inner product
will give us the value and the activation
in the upper left hand
corner of our output.
And we'll repeat this
for every receptive field
in the image.
Now if we talk about strided convolution
then strided convolution ends
up looking pretty similar.
However, our input is
maybe a four by four region
and our output is a two by two region.
But we still have this idea of taking,
of there being some three
by three filter or kernel
that we plop down in
the corner of the image,
take an inner product
and use that to compute
a value of the activation and the output.
But now with strided
convolution the idea is that
we're moving that, rather
than popping down that filter
at every possible point in the input,
instead we're going to move
the filter by two pixels
in the input every time we
move the filter by one pixel,
every time we move by
one pixel in the output.
Right so this stride
of two gives us a ratio
between how much do we move in the input
versus how much do we move in the output.
So when you do a strided
convolution with stride two
this ends up downsampling
the image or the feature map
by a factor of two in
kind of a learnable way.
And now a transpose convolution
is sort of the opposite
in a way so here our input
will be a two by two region
and our output will be
a four by four region.
But now the operation that we perform
with transpose convolution
is a little bit different.
Now so rather than taking an inner product
instead what we're going
to do is we're going to
take the value of our input feature map
at that upper left hand
corner and that'll be
some scalar value in the
upper left hand corner.
We're going to multiply the
filter by that scalar value
and then copy those values
over to this three by three
region in the output so rather
than taking an inner product
with our filter and the
input, instead our input
gives weights that we will
use to weight the filter
and then our output will be
weighted copies of the filter
that are weighted by
the values in the input.
And now we can do this
sort of same ratio trick
in order to upsample so
now when we move one pixel
in the input now we can
plop our filter down
two pixels away in the output
and it's the same trick
that now the blue pixel in
the input is some scalar value
and we'll take that scalar value,
multiply it by the values in the filter,
and copy those weighted filter values
into this new region in the output.
The tricky part is that
sometimes these receptive fields
in the output can overlap
now and now when these
receptive fields in the output overlap
we just sum the results in the output.
So then you can imagine
repeating this everywhere
and repeating this process everywhere
and this ends up doing sort
of a learnable upsampling
where we use these learned
convolutional filter weights
to upsample the image and
increase the spatial size.
By the way, you'll see this operation go
by a lot of different names in literature.
Sometimes this gets called
things like deconvolution
which I think is kind of a
bad name but you'll see it
out there in papers so from a
signal processing perspective
deconvolution means the inverse
operation to convolution
which this is not however
you'll frequently see this
type of layer called a deconvolution layer
in some deep learning
papers so be aware of that,
watch out for that terminology.
You'll also sometimes see
this called upconvolution
which is kind of a cute name.
Sometimes it gets called
fractionally strided convolution
because if we think of the
stride as the ratio in step
between the input and the output
then now this is something
like a stride one half
convolution because of this ratio
of one to two between steps in the input
and steps in the output.
This also sometimes gets
called a backwards strided
convolution because if you think about it
and work through the math
this ends up being the same,
the forward pass of a
transpose convolution
ends up being the same
mathematical operation
as the backwards pass
in a normal convolution
so you might have to take my word for it,
that might not be super obvious
when you first look at this
but that's kind of a neat
fact so you'll sometimes
see that name as well.
And as maybe a bit of
a more concrete example
of what this looks like I think
it's maybe a little easier
to see in one dimension so if we imagine,
so here we're doing a three
by three transpose convolution
in one dimension.
Sorry, not three by three, a three by one
transpose convolution in one dimension.
So our filter here is just three numbers.
Our input is two numbers
and now you can see
that in our output we've
taken the values in the input,
used them to weight the
values of the filter
and plopped down those
weighted filters in the output
with a stride of two and now
where these receptive fields
overlap in the output then we sum.
So you might be wondering,
this is kind of a funny name.
Where does the name transpose
convolution come from
and why is that actually my preferred name
for this operation?
So that comes from this kind of
neat interpretation of convolution.
So it turns out that any
time you do convolution
you can always write convolution
as a matrix multiplication.
So again, this is kind of easier to see
with a one-dimensional example
but here we've got some weight.
So we're doing a
one-dimensional convolution
of a weight vector x
which has three elements,
and an input vector, a vector,
which has four elements,
A, B, C, D.
So here we're doing a
three by one convolution
with stride one and you
can see that we can frame
this whole operation as
a matrix multiplication
where we take our convolutional kernel x
and turn it into some matrix capital X
which contains copies of
that convolutional kernel
that are offset by different regions.
And now we can take this
giant weight matrix X
and do a matrix vector
multiplication between x
and our input a and this
just produces the same result
as convolution.
And now with transpose convolution means
that we're going to take
this same weight matrix
but now we're going to
multiply by the transpose
of that same weight matrix.
So here you can see the same
example for this stride one
convolution on the left and
the corresponding stride one
transpose convolution on the right.
And if you work through
the details you'll see
that when it comes to stride one,
a stride one transpose
convolution also ends up being
a stride one normal convolution
so there's a little bit
of details in the way that
the border and the padding
are handled but it's
fundamentally the same operation.
But now things look different
when you talk about a stride of two.
So again, here on the left
we can take a stride two
convolution and write out
this stride two convolution
as a matrix multiplication.
And now the corresponding
transpose convolution
is no longer a convolution so if you look
through this weight matrix and think about
how convolutions end up
getting represented in this way
then now this transposed
matrix for the stride two
convolution is something
fundamentally different
from the original normal
convolution operation
so that's kind of the
reasoning behind the name
and that's why I think that's
kind of the nicest name
to call this operation by.
Sorry, was there a question?
Sorry?
It's very possible there's
a typo in the slide
so please point out on
Piazza and I'll fix it
but I hope the idea was clear.
Is there another question?
Okay, thank you [laughing].
Yeah, so, oh no lots of questions.
Yeah, so the issue is why
do we sum and not average?
So the reason we sum is due
to this transpose convolution
formula zone so that's
the reason why we sum
but you're right that you actually,
this is kind of a problem
that the magnitudes
will actually vary in the output depending
on how many receptive
fields were in the output.
So actually in practice this
is something that people
started to point out very
recently and somewhat
switched away from this
stride, so using three by three
stride two transpose
convolution upsampling
can sometimes produce some
checkerboard artifacts
in the output exactly due to that problem.
So what I've seen in a
couple more recent papers
is maybe to use four by four stride two
or two by two stride two
transpose convolution
for upsampling and that helps alleviate
that problem a little bit.
Yeah, so the question is what
is a stride half convolution
and where does that terminology come from?
I think that was from my paper.
So that was actually, yes
that was definitely this.
So at the time I was writing that paper
I was kind of into the name
fractionally strided convolution
but after thinking about
it a bit more I think
transpose convolution is
probably the right name.
So then this idea of semantic segmentation
actually ends up being pretty natural.
You just have this giant
convolutional network
with downsampling and
upsampling inside the network
and now our downsampling will
be by strided convolution
or pooling, our upsampling will
be by transpose convolution
or various types of
unpooling or upsampling
and we can train this
whole thing end to end
with back propagation using
this cross entropy loss
over every pixel.
So this is actually pretty
cool that we can take
a lot of the same machinery
that we already learned
for image classification
and now just apply it
very easily to extend
to new types of problems
so that's super cool.
So the next task that I want
to talk about is this idea
of classification plus localization.
So we've talked about
image classification a lot
where we want to just
assign a category label
to the input image but
sometimes you might want to know
a little bit more about the image.
In addition to predicting
what the category is,
in this case the cat, you
might also want to know
where is that object in the image?
So in addition to predicting
the category label cat,
you might also want to draw a bounding box
around the region of
the cat in that image.
And classification plus localization,
the distinction here between
this and object detection
is that in the localization
scenario you assume
ahead of time that you know
there's exactly one object
in the image that you're looking
for or maybe more than one
but you know ahead of time
that we're going to make
some classification
decision about this image
and we're going to produce
exactly one bounding box
that's going to tell us
where that object is located
in the image so we
sometimes call that task
classification plus localization.
And again, we can reuse a
lot of the same machinery
that we've already learned
from image classification
in order to tackle this problem.
So kind of a basic
architecture for this problem
looks something like this.
So again, we have our input image,
we feed our input image through some giant
convolutional network, this is Alex,
this is AlexNet for
example, which will give us
some final vector summarizing
the content of the image.
Then just like before we'll
have some fully connected layer
that goes from that final
vector to our class scores.
But now we'll also have
another fully connected layer
that goes from that
vector to four numbers.
Where the four numbers are something like
the height, the width,
and the x and y positions
of that bounding box.
And now our network will
produce these two different
outputs, one is this set of class scores,
and the other are these four
numbers giving the coordinates
of the bounding box in the input image.
And now during training time,
when we train this network
we'll actually have two
losses so in this scenario
we're sort of assuming a
fully supervised setting
so we assume that each
of our training images
is annotated with both a
category label and also
a ground truth bounding box
for that category in the image.
So now we have two loss functions.
We have our favorite
softmax loss that we compute
using the ground truth category label
and the predicted class scores,
and we also have some
kind of loss that gives us
some measure of dissimilarity
between our predicted
coordinates for the bounding box
and our actual coordinates
for the bounding box.
So one very simple thing
is to just take an L2 loss
between those two and that's
kind of the simplest thing
that you'll see in
practice although sometimes
people play around with
this and maybe use L1
or smooth L1 or they
parametrize the bounding box
a little bit differently but
the idea is always the same,
that you have some regression loss
between your predicted
bounding box coordinates
and the ground truth
bounding box coordinates.
Question?
Sorry, go ahead.
So the question is, is this a good idea
to do all at the same time?
Like what happens if you misclassify,
should you even look
at the box coordinates?
So sometimes people get fancy with it,
so in general it works okay.
It's not a big problem, you
can actually train a network
to do both of these
things at the same time
and it'll figure it out but
sometimes things can get tricky
in terms of misclassification
so sometimes what you'll see
for example is that rather
than predicting a single box
you might make predictions
like a separate prediction
of the box for each category
and then only apply loss
to the predicted box corresponding
to the ground truth category.
So people do get a little
bit fancy with these things
that sometimes helps a bit in practice.
But at least this basic
setup, it might not be perfect
or it might not be
optimal but it will work
and it will do something.
Was there a question in the back?
Yeah, so that's the
question is do these losses
have different units, do
they dominate the gradient?
So this is what we call a multi-task loss
so whenever we're taking
derivatives we always
want to take derivative
of a scalar with respect
to our network parameters
and use that derivative
to take gradient steps.
But now we've got two scalars
that we want to both minimize
so what you tend to do in
practice is have some additional
hyperparameter that
gives you some weighting
between these two losses so
you'll take a weighted sum
of these two different loss functions
to give our final scalar loss.
And then you'll take your
gradients with respect
to this weighted sum of the two losses.
And this ends up being
really really tricky
because this weighting
parameter is a hyperparameter
that you need to set but
it's kind of different
from some of the other hyperparameters
that we've seen so far in the past right
because this weighting hyperparameter
actually changes the
value of the loss function
so one thing that you might often look at
when you're trying to set hyperparameters
is you might make different
hyperparameter choices
and see what happens to the loss
under different choices
of hyperparameters.
But in this case because
the loss actually,
because the hyperparameter
affects the absolute value
of the loss making those
comparisons becomes kind of tricky.
So setting that hyperparameter
is somewhat difficult.
And in practice, you
kind of need to take it
on a case by case basis
for exactly the problem
you're solving but my
general strategy for this
is to have some other
metric of performance
that you care about other
than the actual loss value
which then you actually use
that final performance metric
to make your cross validation
choices rather than looking
at the value of the loss
to make those choices.
Question?
So the question is why do
we do this all at once?
Why not do this separately?
Yeah, so the question is why
don't we fix the big network
and then just only learn
separate fully connected layers
for these two tasks?
People do do that sometimes
and in fact that's probably
the first thing you
should try if you're faced
with a situation like this but in general
whenever you're doing transfer learning
you always get better
performance if you fine tune
the whole system jointly
because there's probably
some mismatch between the features,
if you train on ImageNet and
then you use that network
for your data set you're going
to get better performance
on your data set if you can
also change the network.
But one trick you might
see in practice sometimes
is that you might freeze that network
then train those two things
separately until convergence
and then after they
converge then you go back
and jointly fine tune the whole system.
So that's a trick that sometimes people do
in practice in that situation.
And as I've kind of
alluded to this big network
is often a pre-trained
network that is taken
from ImageNet for example.
So a bit of an aside,
this idea of predicting
some fixed number of
positions in the image
can be applied to a lot
of different problems
beyond just classification
plus localization.
One kind of cool example
is human pose estimation.
So here we want to take an input image
is a picture of a person.
We want to output the
positions of the joints
for that person and this
actually allows the network
to predict what is the pose of the human.
Where are his arms, where are
his legs, stuff like that,
and generally most people have
the same number of joints.
That's a bit of a simplifying assumption,
it might not always be true
but it works for the network.
So for example one
parameterization that you might see
in some data sets is
define a person's pose
by 14 joint positions.
Their feet and their knees and their hips
and something like that and
now when we train the network
then we're going to input
this image of a person
and now we're going to output
14 numbers in this case
giving the x and y coordinates
for each of those 14 joints.
And then you apply some
kind of regression loss
on each of those 14
different predicted points
and just train this network
with back propagation again.
Yeah, so you might see an L2
loss but people play around
with other regression losses here as well.
Question?
So the question is what do I mean
when I say regression loss?
So I mean something
other than cross entropy
or softmax right.
When I say regression loss I usually mean
like an L2 Euclidean loss or an L1 loss
or sometimes a smooth L1 loss.
But in general classification
versus regression
is whether your output is
categorical or continuous
so if you're expecting
a categorical output
like you ultimately want to
make a classification decision
over some fixed number of categories
then you'll think about
a cross entropy loss,
softmax loss or these
SVM margin type losses
that we talked about already in the class.
But if your expected output is
to be some continuous value,
in this case the position of these points,
then your output is
continuous so you tend to use
different types of losses
in those situations.
Typically an L2, L1, different
kinds of things there.
So sorry for not clarifying that earlier.
But the bigger point
here is that for any time
you know that you want
to make some fixed number
of outputs from your network,
if you know for example.
Maybe you knew that you wanted to,
you knew that you always
are going to have pictures
of a cat and a dog and
you want to predict both
the bounding box of the cat
and the bounding box of the dog
in that case you'd know
that you have a fixed number
of outputs for each input
so you might imagine
hooking up this type of regression
classification plus localization framework
for that problem as well.
So this idea of some fixed
number of regression outputs
can be applied to a lot
of different problems
including pose estimation.
So the next task that I want to
talk about is object detection
and this is a really meaty topic.
This is kind of a core
problem in computer vision
and you could probably
teach a whole seminar class
on just the history of object detection
and various techniques applied there.
So I'll be relatively
brief and try to go over
the main big ideas of object
detection plus deep learning
that have been used in
the last couple of years.
But the idea in object detection is that
we again start with some
fixed set of categories
that we care about, maybe cats
and dogs and fish or whatever
but some fixed set of categories
that we're interested in.
And now our task is that
given our input image,
every time one of those
categories appears in the image,
we want to draw a box around
it and we want to predict
the category of that
box so this is different
from classification plus localization
because there might be a
varying number of outputs
for every input image.
You don't know ahead of time
how many objects you expect
to find in each image so that's,
this ends up being a
pretty challenging problem.
So we've seen graphs, so
this is kind of interesting.
We've seen this graph
many times of the ImageNet
classification performance
as a function of years
and we saw that it just got
better and better every year
and there's been a similar
trend with object detection
because object detection
has again been one
of these core problems in computer vision
that people have cared
about for a very long time.
So this slide is due to Ross Girshick
who's worked on this
problem a lot and it shows
the progression of object
detection performance
on this one particular
data set called PASCAL VOC
which has been relatively
used for a long time
in the object detection community.
And you can see that up until about 2012
performance on object
detection started to stagnate
and slow down a little
bit and then in 2013
was when some of the first
deep learning approaches
to object detection came
around and you could see
that performance just shot up very quickly
getting better and better year over year.
One thing you might notice is
that this plot ends in 2015
and it's actually continued
to go up since then
so the current state of
the art in this data set
is well over 80 and in
fact a lot of recent papers
don't even report results
on this data set anymore
because it's considered too easy.
So it's a little bit hard to know,
I'm not actually sure what is
the state of the art number
on this data set but it's
off the top of this plot.
Sorry, did you have a question?
Nevermind.
Okay, so as I already
said this is different
from localization because
there might be differing
numbers of objects for each image.
So for example in this
cat on the upper left
there's only one object
so we only need to predict
four numbers but now for
this image in the middle
there's three animals there
so we need our network
to predict 12 numbers, four coordinates
for each bounding box.
Or in this example of many
many ducks then you want
your network to predict
a whole bunch of numbers.
Again, four numbers for each duck.
So this is quite different
from object detection.
Sorry object detection is quite
different from localization
because in object detection
you might have varying numbers
of objects in the image and
you don't know ahead of time
how many you expect to find.
So as a result, it's kind of
tricky if you want to think
of object detection as
a regression problem.
So instead, people tend to
work, use kind of a different
paradigm when thinking
about object detection.
So one approach that's very
common and has been used
for a long time in computer
vision is this idea
of sliding window approaches
to object detection.
So this is kind of similar to this idea
of taking small patches and applying that
for semantic segmentation and we can apply
a similar idea for object detection.
So the ideas is that
we'll take different crops
from the input image, in
this case we've got this crop
in the lower left hand corner of our image
and now we take that crop,
feed it through our convolutional network
and our convolutional network does
a classification decision
on that input crop.
It'll say that there's no dog
here, there's no cat here,
and then in addition to the
categories that we care about
we'll add an additional
category called background
and now our network can predict background
in case it doesn't see
any of the categories
that we care about, so
then when we take this crop
from the lower left hand corner here
then our network would
hopefully predict background
and say that no, there's no object here.
Now if we take a different
crop then our network
would predict dog yes,
cat no, background no.
We take a different crop we get dog yes,
cat no, background no.
Or a different crop, dog
no, cat yes, background no.
Does anyone see a problem here?
Yeah, the question is how
do you choose the crops?
So this is a huge problem right.
Because there could be any
number of objects in this image,
these objects could appear
at any location in the image,
these objects could appear
at any size in the image,
these objects could also
appear at any aspect ratio
in the image, so if you want
to do kind of a brute force
sliding window approach
you'd end up having to test
thousands, tens of thousands,
many many many many
different crops in order
to tackle this problem
with a brute force
sliding window approach.
And in the case where
every one of those crops
is going to be fed through a
giant convolutional network,
this would be completely
computationally intractable.
So in practice people don't
ever do this sort of brute force
sliding window approach
for object detection
using convolutional networks.
Instead there's this cool line of work
called region proposals that comes from,
this is not using deep learning typically.
These are slightly more
traditional computer vision
techniques but the idea is
that a region proposal network
kind of uses more traditional
signal processing,
image processing type
things to make some list
of proposals for where,
so given an input image,
a region proposal network
will then give you something
like a thousand boxes where
an object might be present.
So you can imagine that
maybe we do some local,
we look for edges in the
image and try to draw boxes
that contain closed edges
or something like that.
These various types of
image processing approaches,
but these region proposal
networks will basically look
for blobby regions in our
input image and then give us
some set of candidate proposal regions
where objects might be potentially found.
And these are relatively fast-ish to run
so one common example of
a region proposal method
that you might see is something
called Selective Search
which I think actually gives
you 2000 region proposals,
not the 1000 that it says on the slide.
So you kind of run this
thing and then after
about two seconds of turning on your CPU
it'll spit out 2000 region
proposals in the input image
where objects are likely to be found
so there'll be a lot of noise in those.
Most of them will not be true objects
but there's a pretty high recall.
If there is an object in
the image then it does tend
to get covered by these region proposals
from Selective Search.
So now rather than applying
our classification network
to every possible location
and scale in the image
instead what we can do is
first apply one of these
region proposal networks to get some set
of proposal regions where
objects are likely located
and now apply a convolutional
network for classification
to each of these proposal
regions and this will end up
being much more computationally tractable
than trying to do all
possible locations and scales.
And this idea all came
together in this paper
called R-CNN from a few years
ago that does exactly that.
So given our input image in this case
we'll run some region proposal network
to get our proposals, these
are also sometimes called
regions of interest or ROI's
so again Selective Search
gives you something like
2000 regions of interest.
Now one of the problems
here is that these input,
these regions in the input
image could have different sizes
but if we're going to run them all
through a convolutional
network our classification,
our convolutional networks
for classification
all want images of the
same input size typically
due to the fully connected
net layers and whatnot
so we need to take each
of these region proposals
and warp them to that fixed square size
that is expected as input
to our downstream network.
So we'll crop out those region proposal,
those regions corresponding
to the region proposals,
we'll warp them to that fixed size,
and then we'll run each of them
through a convolutional network
which will then use in this case an SVM
to make a classification
decision for each of those,
to predict categories
for each of those crops.
And then I lost a slide.
But it'll also, not shown
in the slide right now
but in addition R-CNN also
predicts a regression,
like a correction to the bounding box
in addition for each of
these input region proposals
because the problem is that
your input region proposals
are kind of generally in the
right position for an object
but they might not be perfect
so in addition R-CNN will,
in addition to category labels
for each of these proposals,
it'll also predict four
numbers that are kind of an
offset or a correction to
the box that was predicted
at the region proposal stage.
So then again, this is a multi-task loss
and you would train this whole thing.
Sorry was there a question?
The question is how much does the change
in aspect ratio impact accuracy?
It's a little bit hard to say.
I think there's some
controlled experiments
in some of these papers but I'm not sure
I can give a generic answer to that.
Question?
The question is is it necessary
for regions of interest to be rectangles?
So they typically are
because it's tough to warp
these non-region things but once you move
to something like instant segmentation
then you sometimes get proposals
that are not rectangles.
If you actually do care
about predicting things
that are not rectangles.
Is there another question?
Yeah, so the question is are
the region proposals learned
so in R-CNN it's a traditional thing.
These are not learned, this is
kind of some fixed algorithm
that someone wrote down but
we'll see in a couple minutes
that we can actually, we've
changed that a little bit
in the last couple of years.
Is there another question?
The question is is the
offset always inside
the region of interest?
The answer is no, it doesn't have to be.
You might imagine that
suppose the region of interest
put a box around a person
but missed the head
then you could imagine
the network inferring
that oh this is a person but
people usually have heads
so the network showed the box
should be a little bit higher.
So sometimes the final predicted boxes
will be outside the region of interest.
Question?
Yeah.
Yeah the question is
you have a lot of ROI's
that don't correspond to true objects?
And like we said, in
addition to the classes
that you actually care
about you add an additional
background class so your
class scores can also
predict background to say
that there was no object here.
Question?
Yeah, so the question is
what kind of data do we need
and yeah, this is fully
supervised in the sense that
our training data has each
image, consists of images.
Each image has all the
object categories marked
with bounding boxes for each
instance of that category.
There are definitely papers
that try to approach this
like oh what if you don't have the data.
What if you only have
that data for some images?
Or what if that data is noisy but at least
in the generic case you
assume full supervision
of all objects in the
images at training time.
Okay, so I think we've
kind of alluded to this
but there's kind of a lot of problems
with this R-CNN framework.
And actually if you look at
the figure here on the right
you can see that additional
bounding box head
so I'll put it back.
But this is kind of still
computationally pretty expensive
because if we've got
2000 region proposals,
we're running each of those
proposals independently,
that can be pretty expensive.
There's also this question
of relying on this
fixed region proposal network,
this fixed region proposals,
we're not learning them so
that's kind of a problem.
And just in practice it
ends up being pretty slow
so in the original implementation R-CNN
would actually dump all
the features to disk
so it'd take hundreds of
gigabytes of disk space
to store all these features.
Then training would be super
slow since you have to make
all these different
forward and backward passes
through the image and it
took something like 84 hours
is one number they've
recorded for training time
so this is super super slow.
And now at test time it's also super slow,
something like roughly 30
seconds minute per image
because you need to run
thousands of forward passes
through the convolutional network
for each of these region proposals
so this ends up being pretty slow.
Thankfully we have fast
R-CNN that fixed a lot
of these problems so when we do fast R-CNN
then it's going to look kind of the same.
We're going to start with our input image
but now rather than processing
each region of interest
separately instead we're
going to run the entire image
through some convolutional
layers all at once
to give this high resolution
convolutional feature map
corresponding to the entire image.
And now we still are using
some region proposals
from some fixed thing
like Selective Search
but rather than cropping
out the pixels of the image
corresponding to the region proposals,
instead we imagine projecting
those region proposals
onto this convolutional feature map
and then taking crops from
the convolutional feature map
corresponding to each proposal rather
than taking crops directly from the image.
And this allows us to reuse
a lot of this expensive
convolutional computation
across the entire image
when we have many many crops per image.
But again, if we have some
fully connected layers
downstream those fully connected layers
are expecting some fixed-size input
so now we need to do some
reshaping of those crops
from the convolutional feature map
and they do that in a differentiable way
using something they call
an ROI pooling layer.
Once you have these warped crops
from the convolutional feature map
then you can run these things through some
fully connected layers and
predict your classification
scores and your linear regression offsets
to the bounding boxes.
And now when we train
this thing then we again
have a multi-task loss that trades off
between these two constraints
and during back propagation
we can back prop through this entire thing
and learn it all jointly.
This ROI pooling, it looks
kind of like max pooling.
I don't really want to get into
the details of that right now.
And in terms of speed if we
look at R-CNN versus fast R-CNN
versus this other model called SPP net
which is kind of in between the two,
then you can see that at
training time fast R-CNN
is something like 10 times faster to train
because we're sharing all this computation
between different feature maps.
And now at test time
fast R-CNN is super fast
and in fact fast R-CNN
is so fast at test time
that its computation time
is actually dominated
by computing region proposals.
So we said that computing
these 2000 region proposals
using Selective Search takes
something like two seconds
and now once we've got
all these region proposals
then because we're processing
them all sort of in a shared
way by sharing these
expensive convolutions
across the entire image that
we can process all of these
region proposals in less
than a second altogether.
So fast R-CNN ends up being bottlenecked
by just the computing of
these region proposals.
Thankfully we've solved this
problem with faster R-CNN.
So the idea in faster
R-CNN is to just make,
so the problem was the
computing the region proposals
using this fixed function
was a bottleneck.
So instead we'll just
make the network itself
predict its own region proposals.
And so the way that this
sort of works is that again,
we take our input image,
run the entire input image
altogether through some
convolutional layers
to get some convolutional feature map
representing the entire
high resolution image
and now there's a separate
region proposal network
which works on top of those
convolutional features
and predicts its own region
proposals inside the network.
Now once we have those
predicted region proposals
then it looks just like fast R-CNN
where now we take crops
from those region proposals
from the convolutional features,
pass them up to the rest of the network.
And now we talked about multi-task losses
and multi-task training networks
to do multiple things at once.
Well now we're telling the
network to do four things
all at once so balancing out this four-way
multi-task loss is kind of tricky.
But because the region proposal network
needs to do two things: it needs to say
for each potential
proposal is it an object
or not an object, it
needs to actually regress
the bounding box coordinates
for each of those proposals,
and now the final network at the end
needs to do these two things again.
Make final classification decisions
for what are the class scores
for each of these proposals,
and also have a second round
of bounding box regression
to again correct any errors that may have
come from the region proposal stage.
Question?
So the question is that
sometimes multi-task learning
might be seen as regularization
and are we getting that affect here?
I'm not sure if there's been
super controlled studies
on that but actually
in the original version
of the faster R-CNN paper
they did a little bit
of experimentation like what if we share
the region proposal network,
what if we don't share?
What if we learn separate
convolutional networks
for the region proposal network
versus the classification network?
And I think there were minor differences
but it wasn't a dramatic
difference either way.
So in practice it's kind
of nicer to only learn one
because it's computationally cheaper.
Sorry, question?
Yeah the question is how do you train
this region proposal network
because you don't know,
you don't have ground
truth region proposals
for the region proposal network.
So that's a little bit hairy.
I don't want to get too
much into those details
but the idea is that at any
time you have a region proposal
which has more than some
threshold of overlap
with any of the ground truth objects
then you say that that is
the positive region proposal
and you should predict
that as the region proposal
and any potential proposal
which has very low overlap
with any ground truth objects
should be predicted as a negative.
But there's a lot of dark
magic hyperparameters
in that process and
that's a little bit hairy.
Question?
Yeah, so the question is what
is the classification loss
on the region proposal
network and the answer is
that it's making a binary,
so I didn't want to get
into too much of the
details of that architecture
'cause it's a little bit hairy
but it's making binary decisions.
So it has some set of potential regions
that it's considering and it's making
a binary decision for each one.
Is this an object or not an object?
So it's like a binary classification loss.
So once you train this
thing then faster R-CNN
ends up being pretty darn fast.
So now because we've
eliminated this overhead
from computing region
proposals outside the network,
now faster R-CNN ends
up being very very fast
compared to these other alternatives.
Also, one interesting thing
is that because we're learning
the region proposals
here you might imagine
maybe what if there was some mismatch
between this fixed region
proposal algorithm and my data?
So in this case once you're learning
your own region proposals
then you can overcome
that mismatch if your region proposals
are somewhat weird or
different than other data sets.
So this whole family of R-CNN methods,
R stands for region, so these
are all region-based methods
because there's some
kind of region proposal
and then we're doing some processing,
some independent processing for each
of those potential regions.
So this whole family of methods are called
these region-based methods
for object detection.
But there's another family of methods
that you sometimes see
for object detection
which is sort of all feed
forward in a single pass.
So one of these is YOLO
for You Only Look Once.
And another is SSD for
Single Shot Detection
and these two came out
somewhat around the same time.
But the idea is that rather
than doing independent
processing for each of
these potential regions
instead we want to try to treat this
like a regression problem and just make
all these predictions all at once
with some big convolutional network.
So now given our input image you imagine
dividing that input image
into some coarse grid,
in this case it's a seven by seven grid
and now within each of those grid cells
you imagine some set
of base bounding boxes.
Here I've drawn three base bounding boxes
like a tall one, a wide
one, and a square one
but in practice you would
use more than three.
So now for each of these grid cells
and for each of these base bounding boxes
you want to predict several things.
One, you want to predict an
offset off the base bounding box
to predict what is the true location
of the object off this base bounding box.
And you also want to predict
classification scores
so maybe a classification score for each
of these base bounding boxes.
How likely is it that an
object of this category
appears in this bounding box.
So then at the end we end up predicting
from our input image, we end up predicting
this giant tensor of seven
by seven grid by 5B + C.
So that's just where we
have B base bounding boxes,
we have five numbers for
each giving our offset
and our confidence for
that base bounding box
and C classification scores
for our C categories.
So then we kind of see object
detection as this input
of an image, output of this
three dimensional tensor
and you can imagine just
training this whole thing
with a giant convolutional network.
And that's kind of what
these single shot methods do
where they just, and again
matching the ground truth
objects into these potential base boxes
becomes a little bit hairy but
that's what these methods do.
And by the way, the
region proposal network
that gets used in faster
R-CNN ends up looking
quite similar to these
where they have some set
of base bounding boxes
over some gridded image,
another region proposal
network does some regression
plus some classification.
So there's kind of some
overlapping ideas here.
So in faster R-CNN we're
kind of treating the object,
the region proposal step
as kind of this fixed
end-to-end regression problem
and then we do the separate
per region processing but now
with these single shot methods
we only do that first step and just do all
of our object detection
with a single forward pass.
So object detection has a
ton of different variables.
There could be different
base networks like VGG,
ResNet, we've seen
different metastrategies
for object detection
including this faster R-CNN
type region based family of methods,
this single shot detection
family of methods.
There's kind of a hybrid
that I didn't talk about
called R-FCN which is somewhat in between.
There's a lot of different hyperparameters
like what is the image size,
how many region proposals do you use.
And there's actually
this really cool paper
that will appear at CVPR this
summer that does a really
controlled experimentation
around a lot of these
different variables and tries to tell you
how do these methods all perform
under these different variables.
So if you're interested I'd
encourage you to check it out
but kind of one of the
key takeaways is that
the faster R-CNN style
of region based methods
tends to give higher
accuracies but ends up being
much slower than the single shot methods
because the single shot
methods don't require
this per region processing.
But I encourage you to
check out this paper
if you want more details.
Also as a bit of aside,
I had this fun paper
with Andre a couple years ago that kind of
combined object detection
with image captioning
and did this problem
called dense captioning
so now the idea is that
rather than predicting
a fixed category label for each region,
instead we want to write
a caption for each region.
And again, we had some data
set that had this sort of data
where we had a data set of
regions together with captions
and then we sort of trained
this giant end-to-end model
that just predicted these
captions all jointly.
And this ends up looking
somewhat like faster R-CNN
where you have some region proposal stage
then a bounding box, then
some per region processing.
But rather than a SVM or a softmax loss
instead those per region
processing has a whole
RNN language model that predicts
a caption for each region.
So that ends up looking quite
a bit like faster R-CNN.
There's a video here but I think
we're running out of time so I'll skip it.
But the idea here is
that once you have this,
you can kind of tie together
a lot of these ideas
and if you have some new
problem that you're interested
in tackling like dense captioning,
you can recycle a lot of the components
that you've learned from other problems
like object detection and image captioning
and kind of stitch together
one end-to-end network
that produces the outputs
that you care about
for your problem.
So the last task that I want to talk about
is this idea of instance segmentation.
So here instance segmentation is
in some ways like the full problem
We're given an input image
and we want to predict one,
the locations and identities
of objects in that image
similar to object detection,
but rather than just
predicting a bounding box
for each of those objects,
instead we want to predict
a whole segmentation mask
for each of those objects
and predict which pixels
in the input image corresponds
to each object instance.
So this is kind of like a hybrid
between semantic segmentation
and object detection
because like object
detection we can handle
multiple objects and we
differentiate the identities
of different instances so in this example
since there are two dogs in the image
and instance segmentation method
actually distinguishes
between the two dog instances
and the output and kind of
like semantic segmentation
we have this pixel wise accuracy
where for each of these
objects we want to say
which pixels belong to that object.
So there's been a lot of different methods
that people have tackled, for
instance segmentation as well,
but the current state of
the art is this new paper
called Mask R-CNN that
actually just came out
on archive about a month ago
so this is not yet published,
this is like super fresh stuff.
And this ends up looking
a lot like faster R-CNN.
So it has this multi-stage
processing approach
where we take our whole input image,
that whole input image goes
into some convolutional
network and some learned
region proposal network
that's exactly the same as faster R-CNN
and now once we have our
learned region proposals
then we project those proposals
onto our convolutional feature map
just like we did in fast and faster R-CNN.
But now rather than just
making a classification
and a bounding box for regression decision
for each of those boxes we in addition
want to predict a segmentation mask
for each of those bounding box,
for each of those region proposals.
So now it kind of looks like a mini,
like a semantic segmentation problem
inside each of the region proposals
that we're getting from our
region proposal network.
So now after we do this
ROI aligning to warp
our features corresponding
to the region of proposal
into the right shape, then we
have two different branches.
One branch will come up that looks exact,
and this first branch at
the top looks just like
faster R-CNN and it will
predict classification scores
telling us what is the
category corresponding
to that region of
proposal or alternatively
whether or not it's background.
And we'll also predict some
bounding box coordinates
that regressed off the
region proposal coordinates.
And now in addition we'll
have this branch at the bottom
which looks basically like
a semantic segmentation
mini network which will
classify for each pixel
in that input region proposal
whether or not it's an object
so this mask R-CNN problem,
this mask R-CNN architecture
just kind of unifies all
of these different problems
that we've been talking
about today into one nice
jointly end-to-end trainable model.
And it's really cool and it actually works
really really well so when
you look at the examples
in the paper they're kind of amazing.
They look kind of indistinguishable
from ground truth.
So in this example on the left you can see
that there are these two people standing
in front of motorcycles,
it's drawn the boxes
around these people, it's
also gone in and labeled
all the pixels of those
people and it's really small
but actually in the
background on that image
on the left there's also
a whole crowd of people
standing very small in the background.
It's also drawn boxes around each of those
and grabbed the pixels
of each of those images.
And you can see that this is just,
it ends up working really really well
and it's a relatively simple addition
on top of the existing
faster R-CNN framework.
So I told you that mask
R-CNN unifies everything
we talked about today and it also does
pose estimation by the way.
So we talked about, you
can do pose estimation
by predicting these joint coordinates
for each of the joints of the person
so you can do mask R-CNN to
do joint object detection,
pose estimation, and
instance segmentation.
And the only addition we need to make
is that for each of these region proposals
we add an additional little branch
that predicts these
coordinates of the joints
for the instance of the
current region proposal.
So now this is just another loss,
like another layer that we add,
another head coming out of the network
and an additional term
in our multi-task loss.
But once we add this one little branch
then you can do all of these
different problems jointly
and you get results looking
something like this.
Where now this network, like
a single feed forward network
is deciding how many
people are in the image,
detecting where those people are,
figuring out the pixels
corresponding to each
of those people and also
drawing a skeleton estimating
the pose of those people
and this works really well
even in crowded scenes like this classroom
where there's a ton of people sitting
and they all overlap each other
and it just seems to work incredibly well.
And because it's built on
the faster R-CNN framework
it also runs relatively close to real time
so this is running something
like five frames per second
on a GPU because this is all sort of done
in the single forward pass of the network.
So this is again, a super new paper
but I think that this will probably get
a lot of attention in the coming months.
So just to recap, we've talked.
Sorry question?
The question is how much
training data do you need?
So all of these instant
segmentation results
were trained on the
Microsoft Coco data set
so Microsoft Coco is roughly
200,000 training images.
It has 80 categories that it cares about
so in each of those
200,000 training images
it has all the instances of
those 80 categories labeled.
So there's something like
200,000 images for training
and there's something
like I think an average
of fivee or six instances per image.
So it actually is quite a lot of data.
And for Microsoft Coco for all the people
in Microsoft Coco they
also have all the joints
annotated as well so this
actually does have quite a lot
of supervision at training
time you're right,
and actually is trained
with quite a lot of data.
So I think one really
interesting topic to study
moving forward is that we kind of know
that if you have a lot of
data to solve some problem,
at this point we're relatively
confident that you can
stitch up some convolutional network
that can probably do a
reasonable job at that problem
but figuring out ways to
get performance like this
with less training data
is a super interesting
and active area of research and I think
that's something people will be spending
a lot of their efforts working
on in the next few years.
So just to recap, today we
had kind of a whirlwind tour
of a whole bunch of different
computer vision topics
and we saw how a lot of the
machinery that we built up
from image classification can
be applied relatively easily
to tackle these different
computer vision topics.
And next time we'll talk about,
we'll have a really fun lecture
on visualizing CNN features.
Well also talk about DeepDream
and neural style transfer.
