- Good morning.
So, it's 12:03 so, I want to get started.
Welcome to Lecture 12, of CS-231N.
Today we are going to talk about
Visualizing and Understanding
convolutional networks.
This is always a super fun lecture to give
because we get to look a
lot of pretty pictures.
So, it's, it's one of my favorites.
As usual a couple administrative things.
So, hopefully your projects
are all going well,
because as a reminder your milestones
are due on Canvas tonight.
It is Canvas, right?
Okay, so want to double check, yeah.
Due on Canvas tonight, we are working on
furiously grading your midterms.
So, we'll hope to have those
midterms grades to you back
by on grade scope this week.
So, I know that was little confusion,
you all got registration
email's for grade scope
probably in the last week.
Something like that, we start
couple of questions on piazo.
So, we've decided to use grade
scope to grade the midterms.
So, don't be confused, if you
get some emails about that.
Another reminder is that assignment three
was released last week on Friday.
It will be due, a week from
this Friday, on the 26th.
This is, an assignment three,
is almost entirely brand new this year.
So, it we apologize for taking
a little bit longer than
expected to get it out.
But I think it's super cool.
A lot of that stuff, we'll
talk about in today's lecture.
You'll actually be implementing
on your assignment.
And for the assignment, you'll
get the choice of either
Pi torch or tensure flow.
To work through these different examples.
So, we hope that's really
useful experience for you guys.
We also saw a lot of activity
on HyperQuest over the weekend.
So that's, that's really awesome.
The leader board went up yesterday.
It seems like you guys are
really trying to battle it out
to show off your deep learning
neural network training skills.
So that's super cool.
And we because due to the high interest
in HyperQuest and due to
the conflicts with the,
with the Milestones submission time.
We decided to extend the deadline
for extra credit through Sunday.
So, anyone who does at
least 12 runs on HyperQuest
by Sunday will get little bit
of extra credit in the class.
Also those of you who are,
at the top of leader board
doing really well, will
get may be little bit
extra, extra credit.
So, I thanks for
participating we got lot of
interest and that was really cool.
Final reminder is about
the poster session.
So, we have the poster
session will be on June 6th.
That date is finalized,
I think that, I don't
remember the exact time.
But it is June 6th.
So that, we have some questions
about when exactly that poster session is
for those of you who are traveling
at the end of quarter
or starting internships
or something like that.
So, it will be June 6th.
Any questions on the admin notes.
No, totally clear.
So, last time we talked.
So, last time we had a pretty
jam packed lecture, when we
talked about lot of different
computer vision tasks, as a reminder.
We talked about semantic segmentation
which is this problem, where
you want to sign labels
to every pixel in the input image.
But does not differentiate the
object instances in those images.
We talked about classification
plus localization.
Where in addition to a class label
you also want to draw a box
or perhaps several boxes in the image.
Where the distinction here is that,
in a classification
plus localization setup.
You have some fix number of
objects that you are looking for
So, we also saw that this type of paradigm
can be applied to the things
like pose recognition.
Where you want to regress to
different numbers of joints
in the human body.
We also talked about the object detection
where you start with some fixed
set of category labels
that you are interested in.
Like dogs and cats.
And then the task is
to draw a boxes around
every instance of those objects
that appear in the input image.
And object detection
is really distinct from
classification plus localization
because with object
detection, we don't know
ahead of time, how many object instances
we're looking for in the image.
And we saw that there's
this whole family of methods
based on RCNN, Fast RCNN and faster RCNN,
as well as the single
shot detection methods
for addressing this problem
of object detection.
Then finally we talked
pretty briefly about
instance segmentation,
which is kind of combining
aspects of a semantic
segmentation and object detection
where the goal is to
detect all the instances
of the categories we care about,
as well as label the pixels
belonging to each instance.
So, in this case, we
detected two dogs and one cat
and for each of those instances we wanted
to label all the pixels.
So, these are we kind of
covered a lot last lecture
but those are really interesting
and exciting problems
that you guys might consider to
using in parts of your projects.
But today we are going to
shift gears a little bit
and ask another question.
Which is, what's really going on
inside convolutional networks.
We've seen by this point in the class
how to train convolutional networks.
How to stitch up different
types of architectures
to attack different problems.
But one question that you
might have had in your mind,
is what exactly is going
on inside these networks?
How did they do the things that they do?
What kinds of features
are they looking for?
And all this source of related questions.
So, so far we've sort of seen
ConvNets as a little bit of a black box.
Where some input image of raw pixels
is coming in on one side.
It goes to the many layers of convulsion
and pooling in different
sorts of transformations.
And on the outside, we end up
with some set of class scores
or some types of understandable
interpretable output.
Such as class scores or
bounding box positions
or labeled pixels or something like that.
But the question is.
What are all these other
layers in the middle doing?
What kinds of things in the input image
are they looking for?
And can we try again intuition for.
How ConvNets are working?
What types of things in the
image they are looking for?
And what kinds of techniques do we have
for analyzing this
internals of the network?
So, one relatively simple
thing is the first layer.
So, we've seen, we've
talked about this before.
But recalled that, the
first convolutional layer
consists of a filters that,
so, for example in AlexNet.
The first convolutional layer consists
of a number of convolutional filters.
Each convolutional of filter
has shape 3 by 11 by 11.
And these convolutional filters gets slid
over the input image.
We take inner products between
some chunk of the image.
And the weights of the
convolutional filter.
And that gives us our output of the
at, at after that first
convolutional layer.
So, in AlexNet then we
have 64 of these filters.
But now in the first layer
because we are taking
in a direct inner product
between the weights
of the convolutional layer
and the pixels of the image.
We can get some since for what
these filters are looking for
by simply visualizing the
learned weights of these filters
as images themselves.
So, for each of those
11 by 11 by 3 filters
in AlexNet, we can just
visualize that filter
as a little 11 by 11 image
with a three channels
give you the red, green and blue values.
And then because there
are 64 of these filters
we just visualize 64
little 11 by 11 images.
And we can repeat... So
we have shown here at the.
So, these are filters taken
from the prechain models,
in the pi torch model zoo.
And we are looking at the
convolutional filters.
The weights of the convolutional filters.
at the first layer of AlexNet, ResNet-18,
ResNet-101 and DenseNet-121.
And you can see, kind
of what all these layers
what this filters looking for.
You see the lot of things
looking for oriented edges.
Likes bars of light and dark.
At various angles, in various
angles and various positions
in the input, we can see opposing colors.
Like this are green and pink.
opposing colors or this orange
and blue opposing colors.
So, this, this kind of
connects back to what we
talked about with Hugh and Wiesel.
All the way in the first lecture.
That remember the human visual system
is known to the detect
things like oriented edges.
At the very early layers
of the human visual system.
And it turns out of that
these convolutional networks
tend to do something, somewhat similar.
At their first convolutional
layers as well.
And what's kind of interesting is that
pretty much no matter what type
of architecture you hook up
or whatever type of training
data you are train it on.
You almost always get
the first layers of your.
The first convolutional
weights of any pretty much
any convolutional network
looking at images.
Ends up looking something like this
with oriented edges and opposing colors.
Looking at that input image.
But this really only, sorry
what was that question?
Yes, these are showing the learned weights
of the first convolutional layer.
Oh, so that the question is.
Why does visualizing the
weights of the filters?
Tell you what the filter is looking for.
So this intuition comes from
sort of template matching
and inner products.
That if you imagine you have
some, some template vector.
And then you imagine you
compute a scaler output
by taking inner product
between your template vector
and some arbitrary piece of data.
Then, the input which
maximizes that activation.
Under a norm constraint on the input
is exactly when those
two vectors match up.
So, in that since that,
when, whenever you're taking
inner products, the thing
causes an inner product
to excite maximally
is a copy of the thing you are
taking an inner product with.
So, that, that's why we can
actually visualize these weights
and that, why that shows us,
what this first layer is looking for.
So, for these networks
the first layers always
was a convolutional layer.
So, generally whenever
you are looking at image.
Whenever you are thinking about image data
and training convolutional networks,
you generally put a convolutional layer
at the first, at the first stop.
Yeah, so the question is,
can we do this same type of procedure
in the middle open network.
That's actually the next slide.
So, good anticipation.
So, if we do, if we draw this exact same
visualization for the
intermediate convolutional layers.
It's actually a lot less interpretable.
So, this is, this is performing
exact same visualization.
So, remember for this using
the tiny ConvNets demo network
that's running on the course website
whenever you go there.
So, for that network,
the first layer is 7 by
7 convulsion 16 filters.
So, after the top visualizing
the first layer weights
for this network just like
we saw in a previous slide.
But now at the second layer weights.
After we do a convulsion
then there's some relu
and some other non-linearity perhaps.
But the second convolutional layer,
now receives the 16 channel input.
And does 7 by 7 convulsion
with 20 convolutional filters.
And we've actually,
so the problem is that
you can't really visualize
these directly as images.
So, you can try, so, here if you
this 16 by, so the input is
this has 16 dimensions in depth.
And we have these convolutional filters,
each convolutional filter is 7 by 7,
and is extending along the full depth
so has 16 elements.
Then we've 20 such of these
convolutional filters,
that are producing the output
planes of the next layer.
But the problem here is that
we can't, looking at the,
looking directly at the weights
of these filters, doesn't
really tell us much.
So, we, that's really done here is that,
now for this single 16 by 7
by 7 convolutional filter.
We can spread out those 167
by 7 planes of the filter
into a 167 by 7 grayscale images.
So, that's what we've done.
Up here, which is these little
tiny gray scale images here
show us what is, what are the weights
in one of the convolutional
filters of the second layer.
And now, because there are
20 outputs from this layer.
Then this second convolutional
layer, has 2o such of these
16 by 16 or 16 by 7 by 7 filters.
So if we visualize the weights
of those convolutional filters
as images, you can see that there are some
kind of spacial structures here.
But it doesn't really
give you good intuition
for what they are looking at.
Because these filters are not
looking, are not connected
directly to the input image.
Instead recall that the second
layer convolutional filters
are connected to the
output of the first layer.
So, this is giving visualization of,
what type of activation
pattern after the first
convulsion, would cause
the second layer convulsion
to maximally activate.
But, that's not very interpretable
because we don't have a good sense
for what those first layer
convulsions look like
in terms of image pixels.
So we'll need to develop some
slightly more fancy technique
to get a sense for what is going on
in the intermediate layers.
Question in the back.
Yeah. So the question is that
for... all the visualization
on this on the previous slide.
We've had the scale the weights
to the zero to 255 range.
So in practice those
weights could be unbounded.
They could have any range.
But to get nice visualizations
we need to scale those.
These visualizations also do not take
in to account the bias is in these layers.
So you should keep that in mind
when and not take these
HEPS visualizations
to, to literally.
Now at the last layer
remember when we looking at the last layer
of convolutional network.
We have these maybe 1000 class scores
that are telling us what
are the predicted scores
for each of the classes
in our training data set
and immediately before the last layer
we often have some fully connected layer.
In the case of Alex net
we have some 4096- dimensional
features representation
of our image that then
gets fed into that final
our final layer to predict
our final class scores.
And one another, another kind of route
for tackling the problem
of visual, visualizing
and understanding ConvNets
is to try to understand what's
happening at the last layer
of a convolutional network.
So what we can do
is how to take some,
some data set of images
run a bunch of, run a bunch of images
through our trained convolutional network
and recorded that 4096 dimensional vector
for each of those images.
And now go through and try to figure out
and visualize that last
layer, that last hidden layer
rather than those rather than
the first convolutional layer.
So, one thing you might imagine is,
is trying a nearest neighbor approach.
So, remember, way back
in the second lecture
we saw this graphic on the left
where we, where we had a
nearest neighbor classifier.
Where we were looking at
nearest neighbors in pixels
space between CIFAR 10 images.
And then when you look
at nearest neighbors
in pixel space between CIFAR 10 images
you see that you pull up images
that looks quite similar
to the query image.
So again on the left column
here is some CIFAR 10 image
from the CIFAR 10 data set
and then these, these next five columns
are showing the nearest
neighbors in pixel space
to those test set images.
And so for example
this white dog that you see here,
it's nearest neighbors are in pixel space
are these kinds of white blobby things
that may, may or may not be dogs,
but at least the raw pixels
of the image are quite similar.
So now we can do the same
type of visualization
computing and visualizing
these nearest neighbor images.
But rather than computing
the nearest neighbors in pixel space,
instead we can compute nearest neighbors
in that 4096 dimensional feature space.
Which is computed by the
convolutional network.
So here on the right
we see some examples.
So this, this first column shows us
some examples of images from the test set
of image that... Of the image
net classification data set
and now the, these
subsequent columns show us
nearest neighbors to those test set images
in the 4096, in the 4096th
dimensional features space
computed by Alex net.
And you can see here that
this is quite different
from the pixel space nearest neighbors,
because the pixels are
often quite different.
between the image in
it's nearest neighbors
and feature space.
However, the semantic
content of those images
tends to be similar in this feature space.
So for example, if you
look at this second layer
the query image is this elephant
standing on the left side of the image
with a screen grass behind him.
and now one of these, one of these...
it's third nearest
neighbor in the tough set
is actually an elephant standing
on the right side of the image.
So this is really interesting.
Because between this
elephant standing on the left
and this element stand,
elephant standing on the right
the pixels between those two images
are almost entirely different.
However, in the feature space
which is learned by the network
those two images and that
being very close to each other.
Which means that somehow
this, this last their features
is capturing some of those
semantic content of these images.
That's really cool and really exciting
and, and in general looking
at these kind of nearest
neighbor visualizations
is really quick and easy way to visualize
something about what's going on here.
Yes. So the question is that
through the... the standard
supervised learning procedure
for classific training,
classification network
There's nothing in the loss
encouraging these features
to be close together.
So that, that's true.
It just kind of a happy accident
that they end up being
close to each other.
Because we didn't tell the
network during training
these features should be close.
However there are sometimes
people do train networks
using things called
either contrastive loss
or a triplet loss.
Which actually explicitly make...
assumptions and constraints on the network
such that those last their features
end up having some metric
space interpretation.
But Alex net at least was not
trained specifically for that.
The question is, what is the nearest...
What is this nearest neighbor thing
have to do at the last layer?
So we're taking this image
we're running it through the network
and then the, the second to last
like the last hidden layer of the network
is of 4096th dimensional vector.
Because there's this, this is...
This is there, there are
these fully connected layers
at the end of the network.
So we are doing is...
We're writing down that
4096th dimensional vector
for each of the images
and then we are computing
nearest neighbors
according to that 4096th
dimensional vector.
Which is computed by,
computed by the network.
Maybe, maybe we can chat offline.
So another, another, another
another angle that we might have
for visualizing what's
going on in this last layer
is by some concept of
dimensionality reduction.
So those of you who have
taken CS229 for example
you've seen something like PCA.
Which let's you take some high
dimensional representation
like these 4096th dimensional features
and then compress it
down to two-dimensions.
So then you can visualize that
feature space more directly.
So, Principle Component Analysis or PCA
is kind of one way to do that.
But there's real another
really powerful algorithm
called t-SNE.
Standing for t-distributed
stochastic neighbor embeddings.
Which is slightly more powerful method.
Which is a non-linear
dimensionality reduction method
that people in deep often
use for visualizing features.
So here as an, just an
example of what t-SNE can do.
This visualization here is, is showing
a t-SNE dimensionality reduction
on the emnest data set.
So, emnest remember is this date set
of hand written digits
between zero and nine.
Each image is a gray scale image
20... 28 by 28 gray scale image
and now we're... So that
Now we've, we've used t-SNE
to take that 28 times 28
dimensional features space
of the raw pixels for m-nest
and now compress it
down to two- dimensions
ans then visualize each
of those m-nest digits
in this compress
two-dimensional representation
and when you do, when you run t-SNE
on the raw pixels and m-nest
You can see these natural
clusters appearing.
Which corresponds to the,
the digits of these m-nest
of, of these m-nest data set.
So now we can do a similar
type of visualization.
Where we apply this t-SNE
dimensionality reduction technique
to the features from the last layer
of our trained image net classifier.
So...To be a little bit more concrete here
what we've done
is that we take, a large set of images
we run them off convolutional network.
We record that final 4096th
dimensional feature vector
for, from the last layer
of each of those images.
Which gives us large collection
of 4096th dimensional vectors.
Now we apply t-SNE
dimensionality reduction
to compute, sort of compress
that 4096the dimensional
features space down into a
two-dimensional feature space
and now we, layout a grid in that
compressed two-dimensional feature space
and visualize what types of images appear
at each location in the grid
in this two-dimensional feature space.
So by doing this you get
some very close rough sense
of what the geometry
of this learned feature space looks like.
So these images are
little bit hard to see.
So I'd encourage you to check out
the high resolution versions online.
But at least maybe on
the left you can see that
there's sort of one
cluster in the bottom here
of, of green things, is a
different kind of flowers
and there's other types of clusters
for different types of dog breeds
and another types of
animals and, and locations.
So there's sort of
discontinuous semantic notion
in this feature space.
Which we can explore by looking through
this t-SNE dimensionality reduction
version of the, of the features.
Is there question?
Yeah. So the basic idea is that we're
we, we have an image
so now we end up with
three different pieces
of information about each image.
We have the pixels of the image.
We have the 4096th dimensional vector.
Then we use t-SNE to convert
the 4096th dimensional vector
into a two-dimensional coordinate
and then we take the
original pixels of the image
and place that at the
two-dimensional coordinate
corresponding to the
dimensionality reduced version
of the 4096th dimensional feature.
Yeah, little bit involved here.
Question in the front.
The question is
Roughly how much variants do
these two-dimension explain?
Well, I'm not sure of the exact number
and I get little bit muddy
when you're talking about t-SNE,
because it's a non-linear
dimensionality reduction technique.
So, I'd have to look offline
and I'm not sure of exactly
how much it explains.
Question?
Question is, can you do the same analysis
of upper layers of the network?
And yes, you can. But no,
I don't have those
visualizations here. Sorry.
Question?
The question is,
Shouldn't we have overlaps
of images once we do this
dimensionality reduction?
And yes, of course, you would.
So this is just kind of taking a,
nearest neighbor in
our, in our regular grid
and then picking an image
close to that grid point.
So, so... they, yeah.
this is not showing
you the kind of density
in different parts of the feature space.
So that's, that's another thing to look at
and again at the link
you, there's a couple more
visualizations of this nature that,
that address that a little bit.
Okay. So another, another thing
that you can do for some of
these intermediate features
is, so we talked a couple of slides ago
that visualizing the weights
of these intermediate layers
is not so interpretable.
But actually visualizing
the activation maps of those
intermediate layers
is kind of interpretable in some cases.
So for, so I, again an
example of Alex Net.
Remember the, the conv5
layers of Alex Net.
Gives us this 128 by...
The for...The conv5 features for any image
is now 128 by 13 by 13 dimensional tensor.
But we can think of that
as 128 different
13 by 132-D grids.
So now we can actually go and visualize
each of those 13 by 13 elements slices
of the feature map as a grayscale image
and this gives us some sense
for what types of things
in the input are each of those features
in that convolutional layer looking for.
So this is a, a really
cool interactive tool
by Jason Yasenski you can just download.
So it's run, so I don't have the video,
it has a video on his website.
But it's running a convolutional network
on the inputs stream of webcam
and then visualizing in real time
each of those slices of that
intermediate feature map
give you a sense of what it's looking for
and you can see that,
so here the input image
is this, this picture up in, settings...
of this picture of a person
in front of the camera
and most of these intermediate features
are kind of noisy, not much going on.
But there's a, but there's
this one highlighted
intermediate feature
where that is also shown larger here
that seems that it's activating
on the portions of the feature map
corresponding to the person's face.
Which is really interesting
and that kind of,
suggests that maybe this,
this particular slice of the feature map
of this layer of this particular network
is maybe looking for human
faces or something like that.
Which is kind of a nice, kind of a nice
and cool finding.
Question?
The question is, Are the
black activations dead relu's?
So you got to be... a little
careful with terminology.
We usually say dead relu to mean
something that's dead over
the entire training data set.
Here I would say that it's a
relu, that, it's not active
for this particular input.
Question?
The question is, If there's
no humans in image net
how can it recognize a human face?
There definitely are humans in image net
I don't think it's, it's one of the cat...
I don't think it's one of
the thousand categories
for the classification challenge.
But people definitely appear
in a lot of these images
and that can be useful
signal for detecting
other types of things.
So that's actually kind of nice results
because that shows that, it's
sort of can learn features
that are useful for the
classification task at hand.
That are even maybe a little bit different
from the explicit classification task
that we told it to perform.
So it's actually really cool results.
Okay, question?
So at each layer in the
convolutional network
our input image is of three,
it's like 3 by 224 by 224
and then it goes through
many stages of convolution.
And then, it, after
each convolutional layer
is some three dimensional
chunk of numbers.
Which are the outputs from that layer
of the convolutional network.
And that into the entire three
dimensional chunk of numbers
which are the output of the
previous convolutional layer,
we call, we call, like
an activation volume
and then one of those, one of those slices
is a, it's an activation map.
So the question is, If the image is K by K
will the activation map be K by K?
Not always because there
can be sub sampling
due to pool, straight
convolution and pooling.
But in general, the, the
size of each activation map
will be linear in the
size of the input image.
So another, another kind
of useful thing we can do
for visualizing
intermediate features is...
Visualizing what types of
patches from input images
cause maximal activation in different,
different features, different neurons.
So what we've done here
is that, we pick...
Maybe again the con five
layer from Alex Net?
And remember each of
these activation volumes
at the con, at the con
five in Alex net gives us
a 128 by 13 by 13 chunk of numbers.
Then we'll pick one of those 128 channels.
Maybe channel 17
and now what we'll do is run many images
through this convolutional network.
And then, for each of those images
record the con five features
and then look at the...
Right, so, then, then look at the, the...
The parts of that 17th feature map
that are maximally activated
over our data set of images.
And now, because again this
is a convolutional layer
each of those neurons in
the convolutional layer
has some small receptive
field in the input.
Each of those neurons is not
looking at the whole image.
They're only looking at
the sub set of the image.
Then what we'll do is,
is visualize the patches
from the, from this
large data set of images
corresponding to the maximal activations
of that, of that feature,
of that particular feature
in that particular layer.
And then we can sorts these out,
sort these patches by their activation
at that, at that particular layer.
So here is a, some examples from this...
Network called a, fully...
The network doesn't matter.
But these are some visualizations
of these kind of maximally
activating patches.
So, each, each row gives...
We've chosen one layer from or one neuron
from one layer of a network
and then each, and then,
the, they're sorted
of these are the patches from
some large data set of images.
That maximally activated this one neuron.
And these can give you a sense
for what type of features
these, these neurons might be looking for.
So for example, this top row
we see a lot of circly kinds
of things in the image.
Some eyes, some, mostly eyes.
But also this, kind of blue circly region.
So then, maybe this,
this particular neuron
in this particular layer of
this network is looking for
kind of blue circly things in the input.
Or maybe in the middle here
we have neurons that are looking for
text in different colors
or, or maybe curving, curving edges
of different colors and orientations.
Yeah, so, I've been a little bit loose
with terminology here.
So, I'm saying that a
neuron is one scaler value
in that con five activation map.
But because it's convolutional,
all the neurons in one channel
are all using the same weights.
So we've chosen one
channel and then, right?
So, you get a lot of neurons
for each convolutional filter
at any one layer.
So, we, we could have been,
so this patches could've
been drawn from anywhere
in the image due to the
convolutional nature of the thing.
And now at the bottom we also see
some maximally activating patches
for neurons from a higher up
layer in the same network.
And now because they are coming
from higher in the network
they have a larger receptive field.
So, they're looking at larger
patches of the input image
and we can also see
that they're looking for
maybe larger structures
in the input image.
So this, this second row is maybe looking,
it seems to be looking for human,
humans or maybe human faces.
We have maybe something looking for...
Parts of cameras or
different types of larger,
larger, larger object like
type things, types of things.
Another, another cool experiment we can do
which comes from Zeiler
and Fergus ECCV 2014 paper.
is this idea of an exclusion experiment.
So, what we want to do is figure out
which parts of the
input, of the input image
cause the network to make
it's classification decision.
So, what we'll do is,
we'll take our input image
in this case an elephant
and then we'll block
out some part of that,
some region in that input image
and just replace it with
the mean pixel value
from the data set.
And now, run that
occluded image throughout,
through the network and then record
what is the predicted probability
of this occluded image?
And now slide this occluded
patch over every position
in the input image and then
repeat the same process.
And then draw this heat map showing,
what was the predicted probability
output from the network
as a function of where did,
which part of the input
image did we occlude?
And the idea is that
if when we block out
some part of the image
if that causes the network
score to change drastically.
Then probably that part of the input image
was really important for
the classification decision.
So here we've shown...
I've shown three different examples of...
Of this occlusion type experiment.
So, maybe this example of
a Go-kart at the bottom,
you can see over here that
when we, so here, red,
the, the red corresponds
to a low probability
and the white and yellow
corresponds to a high probability.
So when we block out
the region of the image
corresponding to this Go-kart in front.
Then the predicted probability
for the Go-kart class drops a lot.
So that gives us some sense
that the network is actually
caring a lot about these,
these pixels in the input image
in order to make it's
classification decision.
Question?
Yes, the question is that,
what's going on in the background?
So maybe if the image is a
little bit too small to tell
but, there's, this is
actually a Go-kart track
and there's a couple other
Go-karts in the background.
So I think that, when
you're blocking out these
other Go-karts in the background,
that's also influencing the score
or maybe like the horizon is there
and maybe the horizon is an useful feature
for detecting Go-karts,
it's a little bit hard to tell sometimes.
But this is a pretty cool visualization.
Yeah, was there another question?
So the question is, sorry,
sorry, what was the first question?
So, the, so the question...
So for, for this example
we're taking one image
and then masking all parts of one image.
The second question
was, how is this useful?
It's not, maybe, you don't
really take this information
and then loop it directly
into the training process.
Instead, this is a way, a tool for humans
to understand, what types of computations
these train networks are doing.
So it's more for your understanding
than for improving performance per se.
So another, another related idea
is this concept of a Saliency Map.
Which is something that you
will see in your homeworks.
So again, we have the same question
of given an input image
of a dog in this case
and the predicted class label of dog
we want to know which
pixels in the input image
are important for classification.
We saw masking, is one way
to get at this question.
But Saliency Maps are another, another,
angle for attacking this problem.
And the question is, and
one relatively simple idea
from Karen Simonenian's
paper, a couple years ago.
Is, this is just computing the gradient
of the predicted class score
with respect to the
pixels of the input image.
And this will directly tell us
in this sort of, first
order approximation sense.
For each input, for each
pixel in the input image
if we wiggle that pixel a little bit
then how much will the
classification score
for the class change?
And this is another way
to get at this question
of which pixels in the input
matter for the classification.
And when we, and when we run
for example Saliency,
where computer Saliency map
for this dog, we see kind of a nice
outline of a dog in the image.
Which tells us that these
are probably the pixels
of that, network is actually
looking at, for this image.
And when we repeat this type of process
for different images, we get some sense
that the network is sort of
looking at the right regions.
Which is somewhat comforting.
Question?
The question is, do
people use Saliency Maps
for semantic segmentation?
The answer is yes.
That actually was...
Yeah, you guys are like really
on top of it this lecture.
So that was another component,
again in Karen's paper.
Where there's this idea
that maybe you can use these
Saliency Maps to perform
semantic segmentation
without direct, without any labeled data
for the, for these, for these segments.
So here they're using this
Grabcut Segmentation Algorithm
which I don't really want
to get into the details of.
But it's kind of an interactive
segmentation algorithm
that you can use.
So then when you combine this Saliency Map
with this Grabcut Segmentation Algorithm
then you can in fact,
sometimes segment out
the object in the image.
Which is really cool.
However I'd like to point out
that this is a little bit brittle
and in general if you,
this will probably work
much, much, much, worse than a network
which did have access to
supervision and training time.
So, I don't, I'm not sure
how, how practical this is.
But it is pretty cool
that it works at all.
But it probably works much
less than something trained
explicitly to segment with supervision.
So kind of another, another related idea
is this idea of, of
guided back propagation.
So again, we still want
to answer the question of
for one particular, for
one particular image.
Then now instead of
looking at the class score
we want to know, we want to
pick some intermediate neuron
in the network and ask again,
which parts of the input image
influence the score of that neuron,
that internal neuron in the network.
And, and then you could
imagine, again you could imagine
computing a Saliency Map
for this again, right?
That rather than computing the
gradient of the class scores
with respect to the pixels of the image.
You could compute the gradient
of some intermediate value
in the network with respect
to the pixels of the image.
And that would tell us again
which parts, which
pixels in the input image
influence that value of
that particular neuron.
And that would be using
normal back propagation.
But it turns out that
there is a slight tweak
that we can do to this
back propagation procedure
that ends up giving some
slightly cleaner images.
So that's this idea of
guided back propagation
that again comes from Zeiler
and Fergus's 2014 paper.
And I don't really want to get
into the details too much here
but, it, you just, it's
kind of weird tweak
where you change the way
that you back propagate
through relu non-linearities.
And you sort of, only, only back propagate
positive gradients through relu's
and you do not back propagate
negative gradients through the relu's.
So you're no longer
computing the true gradient
instead you're kind of only keeping track
of positive influences
on throughout the entire network.
So maybe you should read
through these, these papers
reference to your, if you
want a little bit more details
about why that's a good idea.
But empirically, when you
do guided back propagation
as appose to regular back propagation.
You tend to get much
cleaner, nicer images.
that tells you, which part,
which pixel of the input image
influence that particular neuron.
So, again we were seeing
the same visualization
we saw a few slides ago of the
maximally activating patches.
But now, in addition to visualizing
these maximally activating patches.
We've also performed
guided back propagation,
to tell us exactly which parts
of these patches influence
the score of that neuron.
So, remember for this example at the top,
we saw that, we thought this neuron
is may be looking for circly tight things,
in the input patch
because there're allot
of circly tight patches.
Well, when we look at
guided back propagation
We can see with that intuition
is somewhat confirmed
because it is indeed the circly
parts of that input patch
which are influencing
that, that neuron value.
So, this is kind of a useful
to all for synthesizing.
For understanding what these
different intermediates
are looking for.
But, one kind of interesting thing
about guided back propagation
or computing saliency maps.
Is that there's always a
function of fixed input image,
right, they're telling us
for a fixed input image,
which pixel or which parts
of that input image influence
the value of the neuron.
Another question you might answer is
is remove this reliance, on
that, on some input image.
And then instead just ask
what type of input in general
would cause this neuron to activate
and we can answer this question
using a technical Gradient ascent
so, remember we always use Gradient decent
to train our convolutional
networks by minimizing the loss.
Instead now, we want to fix the, fix
the weight of our trained
convolutional network
and instead synthesizing image
by performing Gradient ascent
on the pixels of the
image to try and maximize
the score of some intermediate
neuron or of some class.
So, in a process of Gradient ascent,
we're no longer optimizing
over the weights of the network
those weights remained fixed
instead we're trying to change
pixels of some input image
to cause this neuron,
or this neuron value,
or this class score to
maximally, to be maximized
but, instead but, in addition
we need some regularization term
so, remember we always a,
we before seeing regularization terms
to try to prevent the network weights
from over fitting to the training data.
Now, we need something kind of similar
to prevent the pixels
of our generated image
from over fitting to the peculiarities
of that particular network.
So, here we'll often incorporate
some regularization term
that, we're kind of, we
want a generated image
of two properties
one, we wanted to maximally activate some,
some score or some neuron value.
But, we also wanted to
look like a natural image.
we wanted to kind of have,
the kind of statistics
that we typically see in natural images.
So, these regularization
term in the subjective
is something to enforce a generated image
to look relatively natural.
And we'll see a couple
of different examples
of regualizers as we go through.
But, the general strategy for this
is actually pretty simple
and again informant allot
of things of this nature
on your assignment three.
But, what we'll do is start
with some initial image
either initializing to zeros
or to uniform or noise.
But, initialize your image in some way
and I'll repeat where
you forward your image
through 3D network and compute the score
or, or neuron value
that you're interested.
Now, back propagate to
compute the Gradient
of that neuron score with respect
to the pixels of the image
and then make a small Gradient ascent
or Gradient ascent update
to the pixels of the images itself.
To try and maximize that score.
And I'll repeat this
process over and over again,
until you have a beautiful image.
And, then we talked, we talked
about the image regularizer,
well a very simple, a very
simple idea for image regularizer
is simply to penalize L2
norm of a generated image
This is not so semantically meaningful,
it's just does something,
and this was one of the,
one of the earliest
regularizer that we've seen
in the literature for these
type of generating images type
of papers.
And, when you run this
on a trained network
you can see that now we're
trying to generate images
that maximize the dumble score
in the upper left hand
corner here for example.
And, then you can see that
the synthesized image,
it been, it's little
bit hard to see may be
but there're allot of
different dumble like shapes,
all kind of super impose
that different portions of the image.
or if we try to generate an image for cups
then we can may be see a
bunch of different cups
all kind of super imposed
the Dalmatian is pretty cool,
because now we can see kind of this black
and white spotted pattern
that's kind of
characteristics of Dalmatians
or for lemons we can see
these different kinds
of yellow splotches in the image.
And there's a couple
of more examples here,
I think may be the goose is kind of cool
or the kitfox are actually
may be looks like kitfox.
Question?
The question is, why are
these all rainbow colored
and in general getting true colors out
of this visualization is pretty tricky.
Right, because any, any actual image will
be bounded in the range zero to 255.
So, it really should be some kind
of constrained optimization problem
But, if, for using this generic
methods for Gradient ascent
then we, that's going to
be unconstrained problem.
So, you may be use like projector
Gradient ascent algorithm
or your rescaled image at the end.
So, the colors that you
see in this visualizations,
sometimes are you cannot
take them too seriously.
Question?
The question is what happens, if you let
the thing loose and don't
put any regularizer on it.
Well, then you tend to get
an image which maximize
the score which is confidently classified
as the class you wanted
but, usually it doesn't
look like anything.
It kind of look likes random noise.
So, that's kind of an
interesting property in itself
that will go into much more
detail in a future lecture.
But, that's why, that
kind of doesn't help you
so much for understanding what things
the network is looking for.
So, if we want to understand,
why the network thing makes its decisions
then it's kind of useful
to put regularizer
on there to generate an
image to look more natural.
A question in the back.
Yeah, so the question
is that we see a lot of
multi modality here, and
other ways to combat that.
And actually yes, we'll see that,
this is kind of first
step in the whole line
of work in improving these visualizations.
So, another, another kind
of, so then the angle here
is a kind of to improve the regularizer
to improve our visualized images.
And there's a another
paper from Jason Yesenski
and some of his collaborators where
they added some additional
impressive regularizers.
So, in addition to this
L2 norm constraint,
in addition we also periodically
during optimization,
and do some gauche and
blurring on the image,
we're also clip some,.
some small value, some small pixel values
all the way to zero, we're
also clip some of the,
some of the pixel values
of low Gradients to zero
So, you can see this is kind of
a projector Gradient ascent algorithm
where it reach periodically
we're projecting
our generated image onto some nicer set
of images with some nicer properties.
For example, special smoothness
with respect to the gauchian blurring
So, when you do this, you
tend to get much nicer images
that are much clear to see.
So, now these flamingos
look like flamingos
the ground beetle is starting
to look more beetle like
or this black swan maybe
looks like a black swan.
These billiard tables actually
look kind of impressive now,
where you can definitely see
this billiard table structure.
So, you can see that once you
add in nicer regularizers,
then the generated images become a bit,
a little bit cleaner.
And, now we can perform this procedure
not only for the final class course,
but also for these
intermediate neurons as well.
So, instead of trying to
maximize our billiard table score
for example instead we
can get maximize one
of the neurons from
some intermediate layer
Question.
So, the question is what's
with the for example here,
so those who remember
initializing our image randomly
so, these four images would
be different random
initialization of the input image.
And again, we can use these
same type of procedure
to visualize, to synthesis images
which maximally activate
intermediate neurons
of the network.
And, then you can get a sense from some
of these intermediate
neurons are looking for,
so may be at layer four there's neuron
that's kind of looking for spirally things
or there's neuron that's may be looking
for like chunks of caterpillars
it's a little bit harder to tell.
But, in generally as you
go larger up in the image
then you can see that
the one, the obviously
receptive fields of
these neurons are larger.
So, you're looking at the
larger patches in the image.
And they tend to be looking
for may be larger structures
or more complex patterns
in the input image.
That's pretty cool.
And, then people have
really gone crazy with this
and trying to, they basically
improve these visualization
by keeping on extra features
So, this was a cool paper kind
of explicitly trying to address this
multi modality, there's
someone asked question
about a few minutes ago.
So, here they were
trying to explicitly take
a count, take this multi
modality into account
in the optimization procedure
where they did indeed,
I think see the initial,
so they for each of the classes,
you run a clustering algorithm
to try to separate the
classes into different modes
and then initialize with something
that is close to one of those modes.
And, then when you do that,
you kind of account for
this multi modality.
so for intuition, on the
right here these eight images
are all of grocery stores.
But, the top row, is
kind of close up pictures
of produce on the shelf
and those are labeled as grocery stores
And the bottom row kind of shows
people walking around grocery stores
or at the checkout line
or something like that.
And, those are also labeled
those as grocery store,
but their visual appearance
is quiet different.
So, a lot of these classes
and that being sort
multi modal
And, if you can take, and
if you explicitly take
this more time mortality into account
when generating images, then
you can get nicer results.
And now, then when you look at some
of their example, synthesis
images for classes,
you can see like the
bell pepper, the card on,
strawberries, jackolantern now they end up
with some very beautifully
generated images.
And now, I don't want to get to much
into detail of the next slide.
But, then you can even go crazier.
and add an even stronger image prior
and generate some very
beautiful images indeed
So, these are all synthesized
images that are trying
to maximize the class score
or some image in a class.
But, the general idea is that rather
than optimizing directly the pixels
of the input image, instead they're trying
to optimize the FC6 representation
of that image instead.
And now they need to use some
feature inversion network
and I don't want to get
into the details here.
You should read the paper,
it's actually really cool
But, the point is that,
when you start adding additional priors
towards modeling natural images
and you can end generating
some quiet realistic images
they gave you some sense of
what the network is looking for
So, that's, that's sort of one cool thing
that we can do with this
strategy, but this idea
of trying to synthesis
images by using Gradients
on image pixels, is
actually super powerful.
And, another really cool
thing we can do with this,
is this concept of fooling image
So, what we can do is
pick some arbitrary image,
and then try to maximize the,
so, say we take it picture
of an elephant and then
we tell the network
that we want to, change the image
to maximize the score
of Koala bear instead
So, then what we were
doing is trying to change
that image of an elephant
to try and instead cause
the network to classify as a Koala bear.
And, what you might hope for is that,
maybe that elephant was
sort of thought more thing
into a Koala bear and
maybe he would sprout
little cute ears or something like that.
But, that's not what happens in practice,
which is pretty surprising.
Instead if you take this
picture of a elephant
and tell them that, tell them that
and try to change the
elephant image to instead
cause it to be classified as a koala bear
What you'll find is that, you is that
this second image on the right actually
is classified as koala bear
but it looks the same to us.
So that's pretty fishy
and pretty surprising.
So also on the bottom we've
taken this picture of a boat.
Schooner is the image in that class
and then we told the network
to classified as an iPod.
So now the second example looks just,
still looks like a boat to us
but the network thinks it's an iPod
and the difference is in
pixels between these two images
are basically nothing.
And if you magnify those differences
you don't really see any
iPod or Koala like features
on these differences,
they're just kind of like
random patterns of noise.
So the question is what's going here?
And like how can this possibly the case?
Well, we'll have a guest
lecture from Ian Goodfellow
in a week an half two weeks.
And he's going to go in much more detail
about this type of phenomenon
and that will be really exciting.
But I did want to mention it here
because it is on your homework.
Question?
Yeah, so that's something,
so the question is can we use
fooled images as training data
and I think, Ian's going
to go in much more detail
on all of these types of strategies.
Because that's literally,
that's really a whole lecture onto itself.
Question?
The question is why do we
care about any of this stuff?
Basically...
Okay, maybe that was a
mischaracterization, I am sorry.
Yeah, the question is
what is have in the...
understanding this intermediate neurons
how does that help our understanding
of the final classification.
So this is actually, this
whole field of trying
to visualize intermediates
is kind of in response
to a common criticism of deep learning.
So a common criticism of
deep learning is like,
you've got this big black box network
you trained it on gradient
ascent, you get a good number
and that's great but we
don't trust the network
because we don't understand as people
why it's making the
decisions, that's it's making.
So a lot of these type of
visualization techniques
were developed to try and address that
and try to understand as people
why the network are making
their various classification,
classification decisions a bit more.
Because if you contrast,
if you contrast a deep
convolutional neural network
with other machine running techniques.
Like linear models are much
easier to interpret in general
because you can look at
the weights and kind of
understand the interpretation
between how much each input
feature effect the decision or
if you look at something like
a random forest or decision tree.
Some other machine learning models
end up being a bit more interpretable
just by their very nature
then this sort of black box
convolutional networks.
So a lot of this is sort of
in response to that criticism
to say that, yes they are
these large complex models
but they are still doing some
interesting and interpretable
things under the hood.
They are not just totally going out
in randomly classifying things.
They are doing something meaningful
So another cool thing we can do with this
gradient based optimization of images
is this idea of DeepDream.
So this was a really cool blog post
that came out from
Google a year or two ago.
And the idea is that,
this is, so we talked
about scientific value,
this is almost entirely for fun.
So the point of this exercise is mostly
to generate cool images.
And aside, you also get
some sense for what features
images are looking at.
Or these networks are looking at.
So we can do is, we take our input image
we run it through the convolutional
network up to some layer
and now we back propagate
and set the gradient
of that, at that layer
equal to the activation value.
And now back propagate, back to the image
and update the image and
repeat, repeat, repeat.
So this has the interpretation
of trying to amplify
existing features that were
detected by the network
in this image. Right?
Because whatever features
existed on that layer
now we set the gradient
equal to the feature
and we just tell the network to amplify
whatever features you
already saw in that image.
And by the way you can also
see this as trying to maximize
the L2 norm of the features
at that layer of the image.
And it turns... And when you do this
the code ends up looking really simple.
So your code for many of
your homework assignments
will probably be about this complex
or maybe even a little bit a less so.
So the idea is that...
But there's a couple of tricks here
that you'll also see in your assignments.
So one trick is to jitter the image
before you compute your gradients.
So rather than running the
exact image through the network
instead you'll shift the
image over by two pixels
and kind of wrap the other
two pixels over here.
And this is a kind of regularizer
to prevent each of these [mumbling]
it regularizers a little bit to encourage
a little bit of extra special
smoothness in the image.
You also see they use L1
normalization of the gradients
that's kind of a useful trick sometimes
when doing this image generation problems.
You also see them clipping the
pixel values once in a while.
So again we talked about
images actually should be
between zero to 2.55
so this is a kind of
projected gradients decent
where we project on to the
space of actual valid images.
But now when we do all this
then we start, we might start
with some image of a sky
and then we get really
cool results like this.
So you can see that now
we've taken these tiny features on the sky
and they get amplified through
this, through this process.
And we can see things like this different
mutant animals start to pop up
or these kind of spiral shapes pop up.
Different kinds of houses and cars pop up.
So that's all, that's
all pretty interesting.
There's a couple patterns in particular
that pop up all the time
that people have named.
Right, so there's this Admiral
dog, that shows up allot.
There's the pig snail, the camel bird
this the dog fish.
Right, so these are
kind of interesting, but
actually this fact that
dog show up so much
in these visualization,
actually does tell us
something about the data on
which this network was trained.
Right, because this is a
network that was trained
for image net classification,
image that have thousand categories.
But 200 of those categories are dogs.
So, so it's kind of not
surprising in a sense
that when you do these
kind of visualizations
then network ends up hallucinating
a lot of dog like stuff
in the image often morphed
with other types of animals.
When you do this other
layers of the network
you get other types of results.
So here we're taking one
of these lower layers
in the network, the previous
example was relatively
high up in the network
and now again we have this
interpretation that lower layers
maybe computing edges and
swirls and stuff like that
and that's kind of borne out
when we running DeepDream
at a lower layer.
Or if you run this thing for a long time
and maybe add in some
multiscale processing
you can get some really,
really crazy images.
Right, so here they're doing a
kind of multiscale processing
where they start with a small image
run DeepDream on the small
image then make it bigger
and continue DeepDream on the larger image
and kind of repeat with
this multiscale processing
and then you can get,
and then maybe after you
complete the final scale
then you restart from the beginning
and you just go wild on this thing.
And you can get some really crazy images.
So these examples were all from networks
trained on image net
There's another data set from
MIT called MIT Places Data set
but instead of 1,000 categories of objects
instead it's 200 different types of scenes
like bedrooms and kitchens
like stuff like that.
And now if we repeat
this DeepDream procedure
using an network trained at MIT places.
We get some really cool
visualization as well.
So now instead of dogs,
slugs and admiral dogs
and that's kind of stuff,
instead we often get these
kind of roof shapes of these
kind of Japanese style building
or these different types of
bridges or mountain ranges.
They're like really, really
cool beautiful visualizations.
So the code for DeepDream is
online, released by Google
you can go check it out and
make your own beautiful pictures
So there's another kind of...
Sorry question?
So the question is, what
are taking gradient of?
So like I say, if you, because
like one over x squared
on the gradient of that is x.
So, if you send back
the volume of activation
as the gradient, that's equivalent to max,
that's equivalent to taking the
gradient with respect to the
like one over x squared some...
Some of the values.
So it's equivalent to maximizing the norm
of that of the features of that layer.
But in practice many implementation
you'll see not explicitly compute that
instead of send gradient back.
So another kind of useful,
another kind of useful thing
we can do is this concept
of feature inversion.
So this again gives us a
sense for what types of,
what types of elements
of the image are captured
at different layers of the network.
So what we're going to
do now is we're going to
take an image, run that
image through network
record the feature value
for one of those images
and now we're going to try
to reconstruct that image
from its feature representation.
And the question, and now
based on the how much, how much like
what that reconstructed image looks like
that'll give us some sense
for what type of information
about the image was captured
in that feature vector.
So again, we can do this
with gradient ascent
with some regularizer.
Where now rather than
maximizing some score
instead we want to minimize the distance
between this catch feature vector.
And between the computed
features of our generated image.
To try and again synthesize
a new image that matches
the feature back to
that we computed before.
And another kind of regularizer
that you frequently see here
is the total variation regularizer
that you also see on your homework.
So here with the total
variation regularizer
is panelizing differences
between adjacent pixels
on both of the left and
adjacent in left and right
and adjacent top to bottom.
To again try to encourage
special smoothness
in the generated image.
So now if we do this
idea of feature inversion
so this visualization here on the left
we're showing some original image.
The elephants or the fruits at the left.
And then we run that,
we run the image through a VGG-16 network.
Record the features of
that network at some layer
and then try to synthesize
a new image that matches
the recorded features of that layer.
And this is, this kind of
give us a sense for what
how much information is
stored in this images.
In these features of different layers.
So for example if we try
to reconstruct the image
based on the relu2_2 features
from VGC's, from VGG-16.
We see that the image gets
almost perfectly reconstructed.
Which means that we're
not really throwing away
much information about the raw
pixel values at that layer.
But as we move up into the
deeper parts of the network
and try to reconstruct
from relu4_3, relu5_1.
We see that our reconstructed image now,
we've kind of kept the general space,
the general spatial
structure of the image.
You can still tell that, that
it's a elephant or a banana
or a, or an apple.
But a lot of the low level details aren't
exactly what the pixel values were
and exactly what the colors were,
exactly what the textures were.
These are kind of low level details are
kind of lost at these higher
layers of this network.
So that gives us some sense that
maybe as we move up through
the flairs of the network
it's kind of throwing away
this low level information
about the exact pixels of the image
and instead is maybe trying
to keep around a little bit
more semantic information,
it's a little bit invariant
for small changes in color and
texture and things like that.
So we're building towards
a style transfer here
which is really cool.
So in addition to
understand style transfer,
in addition to feature inversion.
We also need to talk
about a related problem
called texture synthesis.
So in texture synthesis, this
is kind of an old problem
in computer graphics.
Here the idea is that
we're given some input
patch of texture.
Something like these little scales here
and now we want to build some model
and then generate a larger
piece of that same texture.
So for example, we might here
want to generate a large image
containing many scales that
kind of look like input.
And this is again a pretty old
problem in computer graphics.
There are nearest neighbor
approaches to textual synthesis
that work pretty well.
So, there's no neural networks here.
Instead, this kind of a simple algorithm
where we march through the generated image
one pixel at a time in scan line order.
And then copy...
And then look at a neighborhood
around the current pixel
based on the pixels that
we've already generated
and now compute a nearest
neighbor of that neighborhood
in the patches of the input image
and then copy over one
pixel from the input image.
So, maybe you don't need to
understand the details here
just the idea is that there's
a lot classical algorithms
for texture synthesis,
it's a pretty old problem
but you can do this without
neural networks basically.
And when you run this kind of
this kind of classical
texture synthesis algorithm
it actually works reasonably
well for simple textures.
But as we move to more complex textures
these kinds of simple methods
of maybe copying pixels
from the input patch directly
tend not to work so well.
So, in 2015, there was a really cool paper
that tried to apply
neural network features
to this problem of texture synthesis.
And ended up framing it as kind of
a gradient ascent procedure,
kind of similar to the feature map,
the various feature matching objectives
that we've seen already.
So, in order to perform
neural texture synthesis
they use this concept of a gram matrix.
So, what we're going to do,
is we're going to take our input texture
and in this case some pictures of rocks
and then take that input texture
and pass it through some
convolutional neural network
and pull out convolutional features
at some layer of the network.
So, maybe then this
convolutional feature volume
that we've talked about,
might be H by W by C
or sorry, C by H by W at
that layer of the network.
So, you can think of this
as an H by W spacial grid.
And at each point of the grid,
we have this C dimensional feature vector
describing the rough
appearance of that image
at that point.
And now, we're going to
use this activation map
to compute a descriptor of the
texture of this input image.
So, what we're going to do is take,
pick out two of these
different feature columns
in the input volume.
Each of these feature columns
will be a C dimensional vector.
And now take the outer product
between those two vectors
to give us a C by C matrix.
This C by C matrix now
tells us something about the co-occurrence
of the different features at
those two points in the image.
Right, so, if an element,
if like element IJ
in the C by C matrix is large
that means both elements I and
J of those two input vectors
were large and something like that.
So, this somehow captures
some second order statistics
about which features, in that feature map
tend to activate to together
at different spacial volumes...
At different spacial positions.
And now we're going to
repeat this procedure
using all different
pairs of feature vectors
from all different points
in this H by W grid.
Average them all out, and that gives us
our C by C gram matrix.
And this is then used a
descriptor to describe
kind of the texture of that input image.
So, what's interesting
about this gram matrix
is that it has now thrown
away all spacial information
that was in this feature volume.
Because we've averaged over
all pairs of feature vectors
at every point in the image.
Instead, it's just
capturing the second order
co-occurrence statistics between features.
And this ends up being a
nice descriptor for texture.
And by the way, this is
really efficient to compute.
So, if you have a C by H by
W three dimensional tensure
you can just reshape
it to see times H by W
and take that times its own transpose
and compute this all in one shot
so it's super efficient.
But you might be wondering why
you don't use an actual covariance matrix
or something like that instead
of this funny gram matrix
and the answer is that using covariance...
Using true covariance matrices also works
but it's a little bit
more expensive to compute.
So, in practice a lot of people
just use this gram matrix descriptor.
So then... Then there's this...
Now once we have this sort of
neural descriptor of texture
then we use a similar type
of gradient ascent procedure
to synthesize a new image
that matches the texture
of the original image.
So, now this looks kind of
like the feature reconstruction
that we saw a few slides ago.
But instead, I'm trying to
reconstruct the whole feature map
from the input image.
Instead, we're just going
to try and reconstruct
this gram matrix texture descriptor
of the input image instead.
So, in practice what this
looks like is that well...
You'll download some pretrained model,
like in feature inversion.
Often, people will use
the VGG networks for this.
You'll feed your... You'll
take your texture image,
feed it through the VGG network,
compute the gram matrix
and many different layers
of this network.
Then you'll initialize your new image
from some random initialization
and then it looks like
gradient ascent again.
Just like for these other
methods that we've seen.
So, you take that image, pass it through
the same VGG network,
Compute the gram matrix at various layers
and now compute loss
as the L2 norm between the gram matrices
of your input texture
and your generated image.
And then you back prop,
and compute pixel...
A gradient of the pixels
on your generated image.
And then make a gradient ascent step
to update the pixels of
the image a little bit.
And now, repeat this process many times,
go forward, compute your gram matrices,
compute your losses, back prop..
Gradient on the image and repeat.
And once you do this, eventually
you'll end up generating
a texture that matches your
input texture quite nicely.
So, this was all from Nip's 2015 paper
by a group in Germany.
And they had some really cool
results for texture synthesis.
So, here on the top,
we're showing four
different input textures.
And now, on the bottom, we're showing
doing this texture synthesis approach
by gram matrix matching.
Using, by computing the gram
matrix at different layers
at this pretrained convolutional network.
So, you can see that, if we
use these very low layers
in the convolutional network
then we kind of match the general...
We generally get splotches
of the right colors
but the overall spacial structure
doesn't get preserved so much.
And now, as we move to large
down further in the image
and you compute these gram
matrices at higher layers
you see that they tend to
reconstruct larger patterns
from the input image.
For example, these whole rocks
or these whole cranberries.
And now, this works pretty well
that now we can synthesize
these new images
that kind of match the
general spacial statistics
of your inputs.
But they are quite different pixel wise
from the actual input itself.
Question?
So, the question is, where
do we compute the loss?
And in practice, we
want to get good results
typically people will
compute gram matrices
at many different layers
and then the final loss
will be a sum of all those
potentially a weighted sum.
But I think for this visualization,
to try to pin point the
effect of the different layers
I think these were doing reconstruction
from just one layer.
So, now something really...
Then, then they had a
really brilliant idea
kind of after this paper
which is, what if we do this
texture synthesis approach
but instead of using an image
like rocks or cranberries
what if we set it equal
to a piece of artwork.
So then, for example, if you...
If you do the same texture
synthesis algorithm
by maximizing gram
matrices, but instead of...
But now we take, for example,
Vincent Van Gogh's Starry night
or the Muse by Picasso as our texture...
As our input texture,
and then run this same
texture synthesis algorithm.
Then we can see our generated images tend
to reconstruct interesting pieces
from those pieces of artwork.
And now, something really
interesting happens
when you combine this
idea of texture synthesis
by gram matrix matching
with feature inversion
by feature matching.
And then this brings us to
this really cool algorithm
called style transfer.
So, in style transfer, we're
going to take two images as input.
One, we're going to take a content image
that will guide like what
type of thing we want.
What we generally want
our output to look like.
Also, a style image that will tell us
what is the general texture or style
that we want our generated image to have
and then we will jointly
do feature recon...
We will generate a new image
by minimizing the feature
reconstruction loss
of the content image
and the gram matrix
loss of the style image.
And when we do these two things
we a get a really cool image that kind of
renders the content image
kind of in the artistic style
of the style image.
And now this is really cool.
And you can get these
really beautiful figures.
So again, what this kind of looks like
is that you'll take your style
image and your content image
pass them into your network
to compute your gram matrices
and your features.
Now, you'll initialize your output image
with some random noise.
Go forward, compute your losses
go backward, compute your
gradients on the image
and repeat this process over and over
doing gradient ascent on the
pixels of your generated image.
And after a few hundred iterations,
generally you'll get a beautiful image.
So, I have implementation of this online
on my Gethub, that a
lot of people are using.
And it's really cool.
So, you can, this is kind of...
Gives you a lot more control
over the generated image
as compared to DeepDream.
Right, so in DeepDream, you
don't have a lot of control
about exactly what types of
things are going to happen
coming out at the end.
You just kind of pick different
layers of the networks
maybe set different numbers of iterations
and then dog slugs pop up everywhere.
But with style transfer,
you get a lot more fine grain control
over what you want the
result to look like.
Right, by now, picking
different style images
with the same content image
you can generate whole
different types of results
which is really cool.
Also, you can play around with
the hyper parameters here.
Right, because we're doing
a joint reconstruct...
We're minimizing this
feature reconstruction loss
of the content image.
And this gram matrix reconstruction
loss of the style image.
If you trade off the constant,
the waiting between those
two terms and the loss.
Then you can get control
about how much we want
to match the content versus
how much we want to match the style.
There's a lot of other hyper
parameters you can play with.
For example, if you resize the style image
before you compute the gram matrix
that can give you some control
over what the scale of features are
that you want to reconstruct
from the style image.
So, you can see that here,
we've done this same reconstruction
the only difference is how
big was the style image
before we computed the gram matrix.
And this gives you another axis over
which you can control these things.
You can also actually do style transfer
with multiple style images
if you just match sort
of multiple gram matrices
at the same time.
And that's kind of a cool result.
We also saw this multi-scale process...
So, another cool thing you can do.
We talked about this multi-scale
processing for DeepDream
and saw how multi scale
processing in DeepDream
can give you some really
cool resolution results.
And you can do a similar type
of multi-scale processing
in style transfer as well.
So, then we can compute images like this.
That a super high resolution,
this is I think a 4k image
of our favorite school,
like rendered in the
style of Starry night.
But this is actually super
expensive to compute.
I think this one took four GPU's.
So, a little expensive.
We can also other style,
other style images.
And get some really cool results
from the same content image.
Again, at high resolution.
Another fun thing you can do is
you know, you can actually
do joint style transfer
and DeepDream at the same time.
So, now we'll have three
losses, the content loss
the style loss and this...
And this DeepDream loss that
tries to maximize the norm.
And get something like this.
So, now it's Van Gogh
with the dog slug's coming out everywhere.
[laughing]
So, that's really cool.
But there's kind of a problem
with this style transfer for algorithms
which is that they are pretty slow.
Right, you need to produce...
You need to compute a lot of
forward and backward passes
through your pretrained network
in order to complete these images.
And especially for these high
resolution results that we saw
in the previous slide.
Each forward and backward
pass of a 4k image
is going to take a lot of
compute and a lot of memory.
And if you need to do several
hundred of those iterations
generating these images could take many,
like tons of minutes
even on a powerful GPU.
So, it's really not so
practical to apply these things
in practice.
The solution is to now,
train another neural network
to do the style transfer for us.
So, I had a paper about this last year
and the idea is that we're
going to fix some style
that we care about at the beginning.
In this case, Starry night.
And now rather than running
a separate optimization procedure
for each image that we want to synthesize
instead we're going to train
a single feed forward network
that can input the content image
and then directly output
the stylized result.
And now the way that we train this network
is that we compute the same content
and style losses during training
of our feed forward network
and use that same gradient
to update the weights
of the feed forward network.
And now this thing takes
maybe a few hours to train
but once it's trained,
then in order to produce stylized images
you just need to do a single forward pass
through the trained network.
So, I have a code for this online
and you can see that it
ends up looking about...
Relatively comparable
quality in some cases
to this very slow optimization base method
but now it runs in real time
it's about a thousand times faster.
So, here you can see, this is
like a demo of it running live
off my webcam.
So, this is not running
live right now obviously,
but if you have a big GPU
you can easily run four different styles
in real time all simultaneously
because it's so efficient.
There was... There was
another group from Russia
that had a very similar out...
That had a very similar paper concurrently
and their results are about as good.
They also had this kind
of tweek on the algorithm.
So, this feed forward
network that we're training
ends up looking a lot like these...
These segmentation models that we saw.
So, these segmentation networks,
for semantic segmentation
we're doing down sampling
and then many, and then many layers
then some up sampling [mumbling]
With transposed convulsion in order
to down sample an up sample
to be more efficient.
The only difference is
that this final layer
produces a three channel output
for the RGB of that final image.
And inside this network,
we have batch normalization
in the various layers.
But in this paper, they introduce...
They swap out the batch normalization
for something else called
instance normalization
tends to give you much better results.
So, one drawback of these
types of methods is that
we're now training one new
style transfer network...
For every... For style
that we want to apply.
So that could be expensive
if now you need to keep a lot
of different trained networks around.
So, there was a paper from
Google that just came...
Pretty recently that addressed this
by using one feed forward trained network
to apply many different
styles to the input image.
So now, they can train one network
to apply many different
styles at test time
using one trained network.
So, here's it's going to
take the content images input
as well as the identity of
the style you want to apply
and then this is using one network
to apply many different types of styles.
And again, runs in real time.
That same algorithm can also
do this kind of style blending
in real time with one trained network.
So now, once you trained this network
on these four different styles
you can actually specify
a blend of these styles
to be applied at test
time which is really cool.
So, these kinds of real
time style transfer methods
are on various apps and
you can see these out
in practice a lot now these days.
So, kind of the summary
of what we've seen today
is that we've talked about
many different methods
for understanding CNN representations.
We've talked about some of
these activation based methods
like nearest neighbor,
dimensionality reduction,
maximal patches, occlusion images
to try to understand based
on the activation values
of what the features are looking for.
We also talked about a bunch
of gradient based methods
where you can use gradients
to synthesize new images
to understand your features
such as saliency maps
class visualizations, fooling images,
feature inversion.
And we also had fun by seeing
how a lot of these similar ideas
can be applied to things like
Style Transfer and DeepDream
to generate really cool images.
So, next time, we'll talk
about unsupervised learning
Autoencoders, Variational Autoencoders
and generative adversarial networks
so that should be a fun lecture.
