Hi everyone. Uh, welcome to lecture number seven.
Um, so, up to now, uh,
I believe, can you hear me in the back? Is it easy?
Okay. So, in the last set of module that you've seen,
you've learned about convolutional neural networks and how they
can be applied to imaging, notably.
Uh, you've played with different types of layers including pooling,
max pooling, average pooling, and convolutional layers.
You've also seen some classification, uh,
with the most classic algorithms,
uh, all the way up to Inception and, and ResNets.
Uh, and then you jumped into advanced application like object detection with YOLO,
uh, and the Fast R-CNN,
Faster R-CNN series with an optional video.
And finally, uh, face recognition and
neural style transfer that we talked a little bit about in the past lectures.
So, today, we are going to build on top of everything you've seen in this set of modules,
to try to delve into the neural networks and interpret them.
Because you, you, you notice after seeing, uh,
the set of modules up to now that a lot of, uh,
improvements of the neural networks are based on trial and error.
So, we try something, uh,
we do hyperparameter search,
sometimes the model improves, sometimes it doesn't.
We use a validation set to find the right set
of methods that would make our model improve.
It's not satisfactory from a scientific standpoint,
so people are also searching how can we find, uh,
an effective way to improve our neural networks,
not only with trial and error,
but with theory that goes into the network and visualizations.
So, today, we will focus on that.
We first, uh, we'll see three methods,
saliency maps, occlusion sensitivity,
and class activation maps,
which are used to kind of understand what was the decision process of the network.
Given this output, how can we map back
the output decision on the input space to
see which part of the inputs were discriminative for this output.
And later on, we will delve even more in
details into the network by looking at intermediate layers,
what happens at an activation level,
at a layer level,
and at a network level with another set of methods,
gradient ascent class model visualization,
dataset search, and deconvolution.
We will spend some time on the deconvolution because it's,
uh, it's a cool, it's a cool type of, uh,
mathematical operation to know and it will give you
more intuition on how the convolution works from a mathematical perspective.
Uh, if we have time, we'll go over a fun application called Deep Dream, um,
which is super cool visuals for some of you who know it.
Okay? Let's go.
Menti code is on the board,
if you guys need to, to sign up.
So, uh, as usual,
we'll go over some context,
trial information and small case studies,
so don't hesitate to participate.
So, you've built an animal classifier for a pet shop,
um, and you gave it to them.
It's, it's super good.
It's been trained, uh, on ImageNet plus some other data.
And what, what is a little worrying is that
the pet shop is a little reluctant to use your network,
because they don't understand the decision process of the model.
So, how can you quickly show that the model is actually looking at a specific animal,
let's say a cat, if I give it an input that is a cat.
We've seen that together,
one time, everybody remembers?
So, I'll go quickly. Uh, you have a network,
here is a dog given as an input to a CNN.
The CNN assuming the constraint is that there is one animal per image was trained with
a Softmax output layer and we get a probability distribution over all animals,
iguana, dog, car, uh, cat and crab.
And what we want is to take the derivative of
the score of dog and backpropagate it to the input
to know which parts of the inputs were discriminative for this score of dog.
Does that make sense? Everybody remembers this?
And so, the interesting part is that this value is the same shape as x.
So, it's the size of the input.
It's a matrix of numbers.
If the numbers are large in absolute value,
it means the pixels corresponding to these locations had an impact on the score of dog.
Okay? What do you think the score of dog is?
Is it the output probability or no?
What- wha- wha- what do I mean by s of dog?
[NOISE]
Yeah?
Score of the dog?
It's the score of the dog, yeah.
But is it, uh, 0.85, that's what I mean?
[NOISE] No, there are actually formulas used to compute the 0.85, going to the softmax [inaudible]
Yes. So, it's the,
it's the score that is pre-softmax.
It's the score that comes before the softmax.
So, as a reminder, here's a softmax layer and this is how it could be presented.
So, you get as a vector, that is a set of scores that are not necessarily probabilities,
they are just scores between minus infinity and plus infinity.
You give them to the softmax and
the softmax, what it's going to do is that it's going to output
a vector where the sum of
all the probabilities in this vector are going to sum up to one.
Okay? And so, the issue is if instead of
using the derivative of what we called Y hat last time,
we use the score of dog,
we will get a better representation here.
The reason is in order to maximize this number,
score of dog divided by the sum of the score of al- all animals,
or like maybe I,
I should write exponential of score of dog divided
by sum of exponential of the score of all animals.
One way is to minimize the
su- the scores of all the other animals rather than maximizing the score of dog.
So, you see, so maybe moving a certain pixel will minimize the score of fish.
And so, this pixel will have a high influence on Y hat,
the general output of the network.
But it actually doesn't have an influence on the score of dog
one layer before. Does it make sense?
So, that's why we would use, uh,
the scores pre-softmax instead of
using the scores post-softmax that are the probabilities.
Okay. And what's fun is here you cannot see that,
the slides are online if you wanna- if you wanna look at it on your computers.
But you have some of the pixels that are roughly
the same positions as the dog is on the input image that are stronger.
So, we see some white pixels here.
And this can be used to segment the- the dog probably.
So, you could use a simple thresholding to find where the dog was based on this pixel,
uh, pixel derivative, the pixel score map.
It doesn't work too- too well in practice,
so we have better methods to do segmentation,
but this can be done as well.
So, this is what is called saliency maps,
and it's a common technique to quickly, uh,
visualize, uh, what the network is looking at.
In practice, we will use other methods.
So, here's another contextual story.
Now you've built the animal classifier,
they're still a little scared,
but you wanna prove that the model is actually looking
at the input image at the right position.
You don't need to be quick but you have to be very precise.
[NOISE] Yeah?
So, going back from the last slide,
is the saliency map that edge detection, one pixel border?
No, the saliency map is literally distinct here.
Okay.
It's the values of the de- the derivative.
Oh, okay. So, it's like the gradient's at [inaudible]
So, you- you take the score of dog,
you backpropagate the gradient all the way to the inputs,
it gives you a matrix that's exactly the same size as the x.
And you use- you use like a specific color scheme to see which pixels are the strongest.
Perfect, thank you.
Okay. So, here we have our CNN.
The dog is forward propagated and you get a score of,
uh, probability score for the dog.
Now, you want a method that is more
precise than the previous one but not necessarily too fast.
And this one, we've talked about it a little bit, it's occlusion sensitivity.
So, the idea here is to put a gray square on the dog here.
And we propagate this image with the gray square at this position through the CNN.
What we get is another probability distribution
that is probably similar to the one we had before,
because the gray square doesn't seem to impact too much of the image.
At-, uh, at least from a human perspective,
we still see a dog, right?
So, the score of dog might be high, 83 percent probably.
What we can say, is that we can build
a probability map corresponding to the class dog and ha- and we
will write down on this map how confident is
the network if the gray square is at a specific location.
So, for our first location,
it seems that the network is very confident,
so let's put a red square here.
Now, I'm going to move the gray square a little bit.
I'm shifting it just as we do for a convolution and I'm going
to send again this new image in the network.
It's going to give me
a new probability distribution output and the score of dog might change.
So, looking at the score of dog,
I'm going to say, okay,
the network is still very confident that there is a dog here, and I continue.
I shift it again, here same,
network's still very confident that there is a dog.
Now, I shift the, the, the square, um,
vertically down, and I see that partial,
that the- the face of the dog is partially occluded.
Probability of dog will probably go down,
because the network cannot see one eye of the dog.
It's not confident that there's a dog anymore.
So, probably, the confidence of the network went down.
I'm going to put a, a square that is tending to be blue, and I continue.
I shift it again and here we don't see the dog face anymore.
So, probably the network might,
might classify this as a chair, right?
Because the chair is more obvious than the dog now.
And so, the probability score of dog might go down.
So, I'm gonna put a blue square here and we're going to continue.
Here, we don't see the tail of the dog,
it's still fine, the network is pretty confident, and so on.
And what I will look at now is
this probability map which tells me roughly where the dog is.
So, here we used a pretty big filter compared to the size of the image.
The smaller the, sorry,
the pretty big gray square,
the smaller the gray square,
the more precise this probability map is going to be.
Does that make sense? So, this is,
if you have time, if you can,
you can take your time with the pet shop to explain them, uh,
what's happening, you would do that. Yeah?
Would you ever, in an occlusion type of situation have an increase in the probability not just a decrease, say, you removed the noise from the picture?
We will see that in the next slide. That's correct.
So let's see more examples.
Here, we have three classes and these, these,
these images has been- have been generated by Matthew Zeiler and Rob Fergus.
This paper, Visualizing and Understanding Convolutional Networks,
is one of the seminal paper that has
led the research in in visualizing and interpreting neural networks.
So, I'd advise you to take a look at it,
and we will refer to it a lot of time in this lecture.
So, now we have three examples.
One is a pomeranian,
which is this type of cute dog, a car wheel,
which is the true class of the second image,
and an Afghan Hound,
which is this type of dog here on the last image.
So, if you do the same thing as we did before that's what you would see.
So, just to clarify,
here we see a blue color.
It means when the gray square was positioned here or centered at this location,
the network was less confident that the true class was pomeranian.
And in fact, if you look at the paper they explained that when a gray square was here,
the confidence of pomeranian went down because the conference,
because the confidence of tennis ball went up.
And in fact, the pomeranian dog has a tennis ball in the mouth.
And another interesting thing to notice is on the last picture here.
You see that there is a,
a red color on the top left of the image.
And this is you exactly as what- as what you mentioned Adam is that,
when the square was on the face of the human,
the network was much more confident that the true cla- that the true class was the dog.
Because you removed a lot of
meaningful information for the network which has the face of the human.
And similarly, if you put the square on the dog,
the true class that the network was outputting was human probably, does make sense?
Okay. So, this is called occlusion sensitivity,
and it's the second method that you now have seen for
interpreting where the network looks at on an input.
So, let's move to class activation maps.
So, I don't know if you remember,
but two weeks ago, Pranav when he
discussed the techniques that he has used in healthcare,
he explained that you get a- he did a chest x-ray.
And he manages to,
to tell the doctor where the network is looking at when predicting
a certain disease based on his chest X-ray, right? You remember that?
So, this was done through class activation maps,
and that's what we're going to see now.
So, one important thing to notice is that we discussed that
classification networks seem to have a very good localization ability,
and we can see it with the two methods that we previously discussed.
Same thing, for those of you who have read the yellow paper,
that you've studied in this set of modules.
The YOLOv2 algorithm has first been trained on classification,
because classification has a lot of data,
a lot more than object detection.
Has been trained on classification,
builds a very good localization ability and then has been fine-tuned,
and retrained on object detection datasets.
Okay. And so the core idea of class activation map is to show that
CNNs have a very good localization ability
even if they were trained only on image level labels.
So, we have this network.
There is a very classic network used for classification.
We give it a kid and a dog.
Uh, this class activation map is coming from MIT,
the MIT lab with Bolei Zhou et al in 2016.
And forward propagate this image of a kid with
a dog through the network which has some CONV,
ReLU, MAX POOL, classic series of layers, several of them.
And at the end, you usually flatten the last output volume of the CONV,
and run it through several fully connected layer
which are going to play the role of the classifier,
and send it to a softmax,
and get the probability output.
Now, what we're going to do is that we are going to prove that
this CNN is generalizing to localization.
So, we're going to convert this same network in another network.
And the part which is going to change is only the last part.
The downside of using flattened plus fully connected is
that you lose all spatial information, right?
You have a volume that has spatial information,
although it's been going through some max pooling,
so it's been down sampled and you lost some part of the spatial localization.
Flattening kills it, you flatten it you
run it through a fully connected layer, and then it's over.
You- it-s, it's super hard to find out where
the activation was corresponds to on the input space.
So, instead of using flattened plus fully-connected,
we're going to use global average pooling.
We're going to explain what it is.
A fully connected softmax layer and get the probability output.
And we're going to show that now this network can
be trained very quickly because we just need to train one layer,
the fully connected here,
and can show where the network looks at.
The same as the previous network.
So, let's talk about it more in detail.
Assume this was the last CONV layer of our network,
and it outputs a volume,
a volume that is sized to simplify four by four by six.
So, six filters were used in the last CONV.
And so we have six feature maps now. Does that makes sense?
I'm going to convert this using
a gla- global average pooling to just a vector of six values.
What is global average pooling?
It's just taking these feature maps.
Each of them averaging them into one number.
So, now instead of having a four by four by six volume,
I have a one by one by six volume,
but we can call it a vector. Does that make sense?
So, what's interesting is that this number,
actually holds the information of the whole feature map that
came before in, one number being averaged over it.
I'm going to put these in a vector,
and I'm going to call them activations.
As usual a_1, a_2,
a_3, a_4, a_5, a_6.
As I said, I'm going to train a fully-connected layer here with the softmax activation,
and the outputs are going to be the probabilities.
So, what is interesting about that?
It's that the feature maps here as you know will contain some visual patterns.
So, if I look at the first feature map,
I can plot it here,
so these are the values.
And of course, this one is much more granular than four by four.
It's not a four by four it's much more numbers.
But this- you can say that this is the feature map,
and it seems that the activations have found something here.
There was a visual pattern in the inputs that activated the feature map,
and the filters which generated this feature map here in this location.
Same for the second one, there's probably two objects or
two patterns that activated the filters that generated this feature map, and so on.
So we have six of those.
And after I've trained my fully connected layers here- my fully connected layer,
I look at the score of dog.
Score of dog is 91 percent.
What I can do is to know this 91 percent,
how much did it come from these feature maps?
And how can I know it?
It's because now I have a direct mapping using the weights.
I know that the weight number one here,
this edge you see it,
is how much the score was dependent on the orange feature map?
Does that makes sense? The second weight,
if you look at the green edge,
is the weights that has multiplied
this feature map to give birth to the outputs of a dog.
So, this weight is telling me how much this feature map the green one
has influence on the output. Does that makes sense?
So, now what I can do is to sum all of these,
a weighted sum of all these feature maps.
And if I just do this weighted sum,
I will get another feature map.
Something like that. And you notice that,
this one seems to be highly influenced by the green one,
the green feature map, yeah.
It means probably the weight here was higher.
It probably means that the second feature of
the last CONV was the one that was looking at the dog.
Does that make sense?
Okay. And then, once I get this feature map,
this feature map is not the size of the input image, right?
It's the size of the height and width of the output of the last CONV.
So, the only thing I'm going to do is,
I am going to up sample it back simply,
so that it fits the size of the input image,
and I'm going to overlay it on the input image to get my class activation map.
The reason it's called class activation map is because
this feature map is dependent on the class you're talking about.
If I was using, uh,
let's say I was using car here,
if I was using car,
the weights would have been different, right?
Look at the edges that connect
the first activation to the activation of the previous layer.
These weights are different. So, if I sum
all of these feature maps I'm going to get something else.
Does that make sense? So, this is class activation maps.
And in fact, there is a dog here and there is a human there.
And what you can notice is,
probably if I look at the class of human the weights number one might be very high,
because it seems that this visual pattern that
activated the first feature map was the face of the kid.
Okay. So, what is super cool is that you can get your network,
and just change the last few layers into
global average pooling plus a softmax fully connected layer.
And you can do that, and visualize very well.
It requires a small fine tuning.
Yeah.
So are these like saliency maps, but for the activation?
It's a different vocabulary,
I would use saliency maps for the backpropagation up to
the pixels and class activation maps related to one class [NOISE].
Uh, it's not the backpropagation at all,
it's just an up sampling to the,
to the input space based on the feature maps of the last CONV layer.
So it's mostly just examining the weights and sort of doing like a max operation on a,
on them, not so much that different from backpropagation.
Yes.
Good [NOISE].
Any other questions on class activation maps? Yes.
Does taking the average not kill the spatial information?
Yeah. That's a good question. So, taking
the average, does it kill the spatial information?
So, let me, let me write down the formula here.
This is the score that we're interested in,
let's say dog plus C. What you could say is that this score
is a sum of K equals one to six of WK,
which is the, the weight that,
that connects the output activation to the previous layer,
times what's times A of the previous layer.
Uh, let's say we, we,
we use a notation that is like K is the Kth feature map
and IJ is the location and I sum that over the locations.
Can you see in the back? Roughly? So, what I'm saying is that here,
I have my global average pooling that happened
here and I can divide it by the certain number,
so divided by 16, four by four.
Okay. I can switch the two sums,
so I can say that this thing is a sum over IJ
the locations times sum over K equals one to six of what,
WK times AK, so the activations of
the Kth feature map in position a- IJ and times the normalization, 116.
Does it make sense? Does this makes sense?
So I, I still have the,
the, the location, I still moved,
I still moved the sum around and what I could do is to say that this thing is
the score in location IJ of the class activation map,
is a class score for this location IJ and I'm summing it over all locations.
So, just by flipping what the average pooling was doing over the locations,
I can say that my weighting, using my weights,
all the activation in a specific location for all the feature maps,
I can get the score of this position in regards to the final output.
Does that makes sense? So, we- we're not losing the spatial information.
[NOISE] The reason we're not losing it is because we know,
we know what the feature maps are.
Right. We know what they are and we know that they've been averaged exactly,
so we exactly can map it back.
Were you giving only one way to each [inaudible].
Yeah. Because we, we assume that each filter that
generated these feature maps detects one, one specific thing.
So, like if, if this is the feature map it means assuming the filter was detecting dog,
that we're going to see just,
just something here meaning that there is a dog here and if there was a dog on
the lower part of the image we would also
have strong activations in these parts. [NOISE] I,
I, I say if you wanna see more of the map behind it,
check the paper, but this is the intuition behind it.
You can flip the summations using
the global average pooling and show that you keep the spatial information.
The thing is you do the global average pooling,
but you don't lose the feature maps because you know where they
were from the output of the count, right?
So, you're not, you're not deleting this information.
That makes sense? Yeah.
So, the summation of, uh,
the activation is K divided by 16 is instead of taking the average, right, for that [inaudible].
Yeah.
[NOISE] Okay, let's move on and watch a cool video on how act- class activation maps work.
This video was from Kyle McDonald.
And it's, uh, it's live so it's very quick.
So, you can see that the network is looking at this speed boat.
Okay. So now, the three methods we've
seen are methods that are roughly mapping back the output to
the input space and helping us visualize which parts of the inputs were the
most discriminative to lead to this output and the decision of the network.
Now we're going to try to delve more into details in the, in the,
in the intermediate layers of the network and try to interpret how does
the network see our world, not necessarily related to a specific input, but in general.
Okay. So, the pet shop now trusts your model
because you- you've used occlusion sensitivity, saliency maps, and
class activation maps to show that the model is looking at the right place,
uh, but they got a little scared when you did that.
And they asked you to explain what the model thinks a dog is.
So, you have this trained convolutional neural network
and you have an output probability.
Yeah, let me take one in the back. Yeah.
Um, what are some good ways to visualize like non-image data?
Non-image data that's a, that's a good question.
It's actually, so the reason we're seeing images was
mo- most of the resources being [NOISE] focusing on images,
um, if you look at let's say time series data.
So, either speech or natural language,
the main way to realize those is, uh,
with the attention method,
uh, are you familiar with that?
So, in the next set of modules that you're going to
start this week and you will just study
in the next two weeks you will see a visualization method called attention models,
which will tell you which part of a sentence was important,
let's say to output a number like assuming you're doing machine translation.
You know some languages,
they don't have a direct one to one mapping.
It means I might say,
uh, I love cats,
but in another language maybe [NOISE] this same sentence
will be cats I love or something near that, its fit.
And you want an [NOISE] attention model to se- to
show you that the cat was referring to the second.
I think it's, it's, it's okay.
Okay, sorry guys [NOISE].
[NOISE] So, going back to the presentation.
Now, we're going to delve into- inside the network.
And so the new thing is the pet shop
is a little scared and asked you to explain what the network thinks a dog is.
What's the representation of dog for the network?
So, here, we're going to use a method that we've
already seen together called gradient ascent,
which is defining an objective,
that is technically the score of the dog,
minus a regularization term.
What the regularization term is doing,
is it's- it's saying that x should look natural.
It's not necessarily L2 regularization,
it can be something else,
and we- we will discuss it in the next slide,
but don't think about it right now.
What we will do is we will compute
the back-propagation of this objective function all the way back to the input,
and perform gradient ascent to find the image that maximizes the score of the dog.
So, it's an iterative process,
takes longer than the class activation map.
And we repeat the process, forward propagate x,
compute the objective, back-propagate,
and update the pixels and so on.
You guys are familiar with that?
So let's see what- what we can visualize doing that.
So, actually, if you take an image net- classification network,
and you perform this on the classes of goose or ostrich or kit fox, husky,
dalmatians, you can see what the network is
looking at or what the network thinks that dalmatian is.
So, for the dalmatian, you can see some- some black dots on a white background,
somehow, but these are- are still quite hard to interpret.
It's not super easy to see and even worse here on the screen,
better on your computers.
But you can see a fox, some here,
you can see orange color for the fox.
It means that pushing the pixels to an orange color would
actually lead to a higher score of the kit fox in the output.
If you use a better regularization than L2,
you might get better pictures.
So, this is for flamingo,
this is for pelican, and this is for hartebeest.
So, a few things that are interesting to see,
is that in order to maximize the score of flamingo,
what the network visualized is many flamingos.
It means that's 10 flamingo leads to
a higher score of the class flamingo than one flamingo for the network.
Talking about regularization, what does L2 regularization say?
It says that for visualizing,
we don't want to have extreme values of pixel.
It doesn't help much to have one pixel with an extreme value,
one pixel with a low value and so on.
So, we're going to regularize
all the pixels so that all the values are around each other,
and then we can re-scale it between zero and 20- 255 if you want.
One thing to notice is that the gradient ascent process doesn't
constrain the inputs to be between zero and 255.
You can go to plus infinity potentially,
while an image is stored with numbers between zero and 255,
so you might want to clip that as well.
This is another type of regularization.
One thing that led to beautiful pictures was what Jason Yosinski and his team did is,
they forward propagated an image, computed the score,
computed the objective function, back-propagated,
updated the pixels, and blurred them, blurred the picture.
Because what- what is not useful for visualizing,
is if you have high frequency variation between pixels,
it doesn't have to visualize,
if you have many pixels close to each other that have many different values.
Instead, you want to have a smooth transition among pixels,
and this is another type of regularization called Gaussian blurring.
Okay? So, this method actually makes a lot of sense in- in- in scientific terms.
You're- you're maximizing an objective function that
gives you what the network sees as flamingo,
which would maximize the score of flamingo.
So, we call it also class model visualization. Yes?
So, does a more realistic class model,
visualization correspond to a more accurate model? [NOISE]
Um, does a more realistic class model visualization correspond to a more accurate.
So, it's hard to map the accuracy of the model based on this visualization,
but it's a good way to validate that the network is looking at the right thing.
Yeah. We're going to- to see more of this later.
I think the most interesting part is actually on this slide is,
we- we did it for the class score,
but we could have done it with any activation.
So, let's say I stop in the middle of the network,
and I define my objective function to be this activation.
I'm going to back propagate and find the input that we maximize this activation.
It will tell me what is this activation.
What does this activation fire for?
So, that's even more interesting I think than looking at
the inputs and then the output. Does that make sense?
That we could do it on any activation?
Yep.
[NOISE] Any questions on that? [NOISE]
Okay. So, now, we're going to do another trick which is data-set search.
It's actually one of the most useful, I think.
Not fast, but very useful.
So, the pet shop loved the previous technique,
and asks if there are other alternatives to- to
show what- what an activation in the middle of a network is thinking.
You take an image, forward propagate it to the network, get your output.
Now, what you're going to do is select a feature map,
let's say this one, where at this layer,
and the feature map is of size five by five by 256.
It means that the CONV layer here had 256 filters, right?
You are going to look at these feature maps and select probably,
uh, yeah, what you're going to do is select one of the feature maps, okay?
We select one out of 256 feature maps,
and we're going to learn- run a lot of data,
forward propagate it through the network,
and look which data points have had the maximum activation of this feature map.
So, let's say we do it with the first feature map.
We notice that these are the top five images that really fired this feature map,
like high activations on the feature map.
What it tells us, is that probably this feature map is
detecting shirts. Could do the same thing,
let's say we take the second feature map,
and we look which data points have maximized the activations of this feature map,
out of a lot of data.
And we see that this is what we got,
the top five images.
Probably means that the other feature map seems to be activated when seeing edges.
So, the second one is much more likely to
appear earlier in the network obviously than later.
So, one thing that you may ask is,
do these images seem cropped?
Like I don't think that this was an image in the data-set,
it's probably a subpart of the image.
What do you think this crop corresponds to?
[NOISE]
Any idea how we cropped
the image, and why these are cropped?
[NOISE] Like, why- why didn't I show you the full images?
How was I able to show you the cropped?
[NOISE].
[inaudible] and so that anything outside is not [inaudible]
That's correct. So, let's say we pick an activation,
an activation in the network.
This activation for a convolutional neural network
oftentime doesn't see the entire input image.
Right? Doesn't see it.
What it sees is a subspace of the input's image.
Does that make sense? So, let's look at another slide.
Here, we have a picture of units,
64 by 64 by 3.
It's our input. We run it through a five-layer ConvNet.
And now, we get an encoding volume that is much smaller
in height and width, but bigger in depth.
If I tell you what this activation is seeing.
If you map it back, you look at the stride and the filter size you've used,
you could say that this is the part that this filter is seeing.
This- this-, uh, this activation is seeing.
It means the pixel that was up there had no influence on this activation,
and it makes sense when you think of it.
You're- you're- the- the easiest way to think about it is looking at the- the top picks,
the- the- the top entry on the encoding volume, top-left entry.
You have the input image, you put a filter here.
This filter gives you one number, right?
This number, this activation only depends on this part of the image,
but then if you add a convolution after it,
it will take more filters.
And so, the deeper you go,
the more part of the image the activation will see.
So, if you look at an activation in layer 10,
it will see much- a much larger part of the input
than an activation in layer one. That makes sense?
So, that's why- that's why probably the pictures that I showed here,
these ones are very small part cro- crops,
small crops of the image,
which means the activation I was talking about here is probably earlier in the network.
It sees a much smaller part of the input.
[inaudible]  [NOISE]
Yeah, yeah. So, what you look at it which activation was maximum.
You look at this one and then you match this one back to crop. Does that make sense?
Okay, so here's units again,
up and same, this one would correspond more in the center of the image.
This intuition makes sense?
Okay cool. So, let's talk about deconvolution now.
This is gonna be the hardest part of the lecture,
but probably helping with- with more intuition on deconvolution. You remember that?
That was the generative adversarial networks scheme.
And we said that giving a code to the generator,
the generator is able to output an image.
So, there is something happening here that we didn't talk about.
Is how can we start with a 100 dimensional vector and
output a 64 by 64 by 3 image? That seems weird.
We could use, you might say,
a fully connected layer with a lot of neurons, right, to up-sample.
In practice, this is one method,
another one is to use a deconvolution network.
So, convolutions will encode the information
in a smaller volume in height and width deeper in- in depth,
while the deconvolution will do the reverse.
It will up-sample the height and width of an image.
So, that would be useful in this case.
Another case where it would be useful is segmentation.
You remember our case studies, uh,
for segmentation life cell,
microscopic images of cells.
Give it to a convolution network.
It's going to encode it.
So, it's going to lower the height and width.
The interesting thing about this encoding in the middle
is that it holds a lot of meaningful information.
But what we want ultimately,
is to get a segmentation mask,
and the segmentation mask in height and width has to be the same size as the pixel image.
So we need a deconvolution network to up-sample it.
So, deconvolution are used in these cases.
Today the case we're going to talk about is visualization.
Remember the gradient ascent method we talked about.
We define an objective function by choosing an activation in the middle of the network,
and we want the objective to be equal to this activation to find
the input image that maximizes its activation through an iterative process.
Now, we don't want to use an iterative process.
We want to use a reconstruction of this activation
directly in the input space by one backward path.
So, let's say I select this feature map out of the max pool,
255, sorry, 5 by 5 by 256.
What I'm going to do is,
I'm going to identify the max activation of this feature map. Here it is.
It's this one, third column second row.
I'm going to set all the others to zero.
Just this one I keep it,
because it seems that this one has detected something.
Don't wanna talk about the others.
I'm going to try to reconstruct in the input space what this activation has fired for.
So, I'm going to compute
the reverse mathematical operation of pooling, ReLU, and convolution.
I will unpool, I will un-ReLU,
let's say, doesn't ex- this word doesn't exist, so don't use it.
But un-ReLU and deconv.
And I will do it several times because this activation went through several of them.
So I will do it again and again until I see, oh,
this specific activation that I selected in
the feature map fired because it saw the ears of the dog.
And as you see, this image is cropped again.
It's not the entire image,
it's just the part that the activation has seen.
And if you look at where the activation is located on the feature map,
it makes sense that this is the part that corresponds to it.
So now, the higher level intuition is this.
We are going to delve into it and see what do we mean by unpool,
what do we mean by Un-reLU,
and what do we mean by de-conv.
Okay. Yes.
So, if we had [inaudible].
Would we have just gotten a reconstruction of the whole image?
So, the difference is, you mean if we don't zero out all the activations?
It shows that this reconstruction would be messier.
It would be more messy. [NOISE] Yeah.
Doesn't, doesn't necessarily mean you will not get the full image,
because probably the other activations probably didn't even fire,
means they didn't detect anything else.
It's just that it's gonna- it's gonna add some noise to this reconstruction.
Okay, so let's talk about deconvolution a little bit on the board.
[NOISE] So, to start with deconvolution,
and you, you guys can take notes if you want.
We are going to spend about 20 minutes on the board now to discuss deconvolution, okay?
[NOISE] To understand the deconvolution,
we first need to understand deconvolution.
We've seen it, uh, from a
computer science perspective, but actually,
what we are going to do here is we are going to frame
deconvolution as a simple matrix vector mathematical operation.
You're going to see that it's actually possible.
So let's start with a 1D conv.
For the 1D convolution,
I will take an input x which is of size 12,
x1, x2, x3, x4,
x5, x6, x7, x8.
So, 8 plus 2 padding,
which gives me the 12 that I mentioned.
So, the input is a one-dimensional vector which has padding of two on both sides.
I will give it to a layer that will be a 1D conv.
And this layer would have only one filter.
And the filter size will be four.
We will also use a stride equal to two.
[NOISE] So, my first question is,
what's the size of the output?
Can you guys compute it on your- on your notepads and,
and tell me what's the size of the output.
[NOISE]. Input size 12,
[NOISE] filter of size four,
stride of two, padding of two.
Five, yeah I heard you, yeah.
So, remember use nx, sorry,
ny equals nx minus f plus 2p divided by stride and you will get five.
So, what I'm going to get is Y1,
Y2, Y3, Y4, Y5.
[NOISE] So, I'm going to focus on this specific convolution for now.
And I'm going to show now that we can define it as,
as a mathematical operation between a matrix and a vector.
So, the way to do it is,
I guess the easiest way is to write the system of equation
that is underlying here. What is Y1?
Y1 is the filter applied to the four first values here. This makes sense?
So, if I define my filter as being y W1,
W2, W3, and W4,
what I'm gonna get is that Y1 equals W1 times zero plus W2 times
zero plus W3 times x1 plus W4 times x2.
This makes sense? Just the convolution,
elementwise operation, and then sum all of it.
Y2 is going to be same thing,
but with a stride of two, going two down.
So, it's going to give me W1 times x1 plus W2 times
x2 plus W3 times x3 plus W4 times x4.
Correct? Everybody is following?
No. Same thing.
We will do it for all the y's until Y5,
and we know that Y5 is elementwise operation between
the filter and the four last number here, summing them.
So, it will give me W1 times x7 plus W2 times
x8 plus zero plus W3 times zero plus W4 time zero.
[NOISE]
Okay. Now what we're going to do is to try to write down
y as the matrix vector operation between w and x.
We need to find what this w matrix is.
And looking at this system of equation,
it seems that it's not impossible. So let's try to do it.
I will write my Y vector here, Y_1,
Y_2, Y_3, Y_4, Y_5.
And I will write my matrix here and my vector x here.
So first question is,
what do you think will be the shape of
this w matrix? Um?
5 by 12.
5 by 12. Correct. We know that this is 5 by 1,
this is 12 by 1,
so of course w is going to be 5 by 12.
Right?
So, now, let's try to fill it in 0,
0, x_1, x_2, x_3, blah,
blah, blah, x8, 0, 0.
Can you guys see in the back or no?
Yeah? Okay. Cool. Ah, so,
I'm going to fill in this matrix regarding this system of equation.
I know that the Y1 would be w_1 times 0,
w_2 times 0, w_3 times x_1, w_4 times x_2.
So this vector is going to multiply the first row here.
So I just have to place my ws here.
w_1 will come here, multiply 0,
w_2 will come here, w_3 would come here,
and w_4 would come here.
And all the rest would be filled in with 0s, right?
I don't want any more multiplications.
How about the second row of this matrix?
I know that Y_2 has to be equal to this dot product with this row.
And I know that it's going to give me w_1x_1 plus w_2x_2 plus w_3x_3.
x_1 is the third input on this vector, third- third entry.
So, I would need to shift what I had in
the previous row with a stride of two, it will give me that.
That makes sense? So if I use the dot product of this row with that,
I should get the second equation up there.
And so on and you understand what happens, right?
This pattern will just shift with the stride of two on the side.
So, I would get zeros here and I will get my w_1,
w_2, w_3, w_4 and then zeros.
And all the way down here.
And all the way down here,
what I will get is w_4,
w_3, w_2, w_1 and zeros.
So the only thing I wanna mention here
is that the convolution operation as you see can be
framed as a simple matrix times a vector. Yes.
So why did you have zeros in- on the right side of the top row,
in the left side, that's when multiplying the- [NOISE]
For the top row, why the zeros are on the right side?
Yes.
Because I don't want Y hat- Y_1 to be dependent on x_3 to x_8.
So I want these to be zero multiplicate priors.
Okay. Oh, because of the stride and the window size.
Okay.
Thank you.
So why is this important for
the intuition behind the deconvolution and the existence of the deconvolution?
It's because if we manage to write down y equal wx,
we probably can write down x equal w minus one,
y if w is an invertible matrix and this is going to,
to be our deconvolution.
And in fact, what's the,
what's the shape of this new matrix?
12 by 5. Um?
12 by 5.
Yes. 12 by 5.
We have 12 by 1 on one side,
5 by 1 on the other, it has to be 12 by 5. So it's flipped compared to
w. So one thing we're going to do here is we're going to make an assumption.
First assumption is that w is an invertible matrix.
And on top of that, we're going to make a stronger assumption which
is that w is an orthogonal matrix.
And without going into the details here,
same as when we proved Xavier initialization in sections,
we made some assumptions that are not always true.
This assumption is not going to be always true.
One, one intuition that you can have is,
if I'm using a filter that is,
ah, assume the filter is an edge detector.
So like, ah, plus one,
zero, zero, minus one.
In this case, the matrix would be orthogonal.
Why? A matrix that is orthogonal means that if I take two of the columns here,
I dot-product them together, it should give me zero.
Same with the rows, you can see it.
So, what's interesting is that, ah,
if the stride was four,
there will be no overlap between these two rows.
It would give me an orthogonal matrix.
Here a stride is two but if I replace this w_1 by minus one,
zero, zero, plus one,
so plus one, zero, zero,
minus one and minus,
plus one, zero, zero, minus one,
you can see that the dot product would be zero.
The zeros will multiply the ones and the ones
will multiply the zeros, it gives me a zero dot product.
So, this is a case where it works.
Practices doesn't always work.
The reason we're making this assumption is because we wanna make a reconstruction, right?
So, we wanna be able to have this w minus one, this,
this, this invert and the reconstruction is not going to be exact.
But at, at first-order approximation,
we can assume that the reconstruction will still be useful to us,
even if this assumption is not always true.
In the case where w is orthogonal,
I know that the inverse of w is w transpose.
Or another way to write it,
is that for orthogonal matrices,
w transpose times w is the identity matrix.
So, what it tells me is that x is going to be w transpose time y, times y.
So, let's see what we get from that.
Let me write down the Menti code.
So, let's say now we have our x and we wanna regenerate our,
or we will have our y and we want to generate our x using this method.
So, I would, what I would write is to understand the 1D deconv.
We can use the following illustrations,
where we have x here,
which is zero, zero, x_1,
x_2, x_3, all the way down to x_8.
Okay? And I will have my w matrix here,
w transpose and my Y vector,
Y_1, Y_2, Y_3, Y_4, and Y_5 here.
And so, I know that this matrix will be the transpose of the one I have here, right?
So, I can just write down the transpose.
The transpose will be w_1, w_2, w_3, w_4.
Okay? I will shift it down with a stride
of two and so on.
[NOISE]
And this whole thing will be W Transpose.
So, th- the small issue here is that this in
practice is not- is going to be very similar to a convolution,
but because, uh, but it's going to be a tiny little different interval of implementation.
Another question I might ask is,
how can we do the same thing with the same pattern as we have here?
It means the stride is going from left to right,
instead of going from up to down.
I'm going to introduce that with a technique called sub-pixel convolution.
And for those of you who read papers and segmentation in visualization,
oftentimes this is a type of convolution that is used for reconstruction.
So, let's see how it works.
I just wanna do the same operation,
but instead of doing it with a strike going from up to down,
I want to do it from a strike going from left to right.
O- one, one thing you wanna,
you wanna notice here,
is that, uh, the two lines that I wrote here are cropped.
And the reason is because we're using a padded input.
Here, we will just crop the two top lines.
And same for the two last lines.
They will be cropped. Look at that.
W1 will multiply Y1,
and this one will multiply Y2 and so on.
So, this dot product will give me W1 times Y1,
but I don't want that to happen because I wanna get the padded zero here.
So, I will just crop that.
In this matrix it's actually going to be smaller than it seems,
and is going to generate my X1 through X8 and then I will
pad the top values and the bottom values.
Okay, just the height.
So, let's look at the sub-pixel convolution. I have my input.
And I will do something quite fun.
I will perform a sub-pixel operation on Y. What does it mean?
I will insert zeros almost everywhere.
I will insert them, and I will get 0,
0, Y1, 0, Y2,
0, Y3, 0, Y4,
0, Y5 and 0, 0.
Even more, one more 0 here, one more 0 here.
So, this vector is just the vector Y with
some zeros inserted around it and also in the middle between the elements of Y.
Now, why is that interesting?
It's interesting because I can now write down my convolution by flipping my weight.
[NOISE]
So, let me explain a little bit what happened here.
What we wanted is,
in order to be able to efficiently compute the deconvolution
the same way as we've learned to compute the convolution.
We wanted to have the weights
scattered from left to right with a stride moving from left to right.
What we did, is that we used a sub-pixel version of Y by inserting zeros in the middle,
and we divided the stride by two.
So, instead of having a stride of two as we had in our convolution,
we have a stride of one in our deconvolution.
So, notice that I shift my weights from one at every step,
when I move from one row to another.
Second thing is, I flipped my weights.
I flipped my weights. So, instead of having W1, W2,
W3, W4, now I have W4, W3, W2, W1.
And what you could see is looking at that,
first, look at this row,
the first row that is not cropped.
The result of the dot product of this row with this vector is going to be Y1 times W3,
plus Y2 times W1.
Yeah? Now, let's look what happened here.
I look at my first row here,
the dot product of this first row with my Y here is going to be- sorry,
sorry, we- these two are cropped as well.
And same here. So, looking at my first non-cropped row
here as a dot product with this vector what I get is W3 times Y1,
plus W2- sorry, plus W1 times Y2.
So, exactly the same thing as I got there.
So, these two operations are exactly the same operations. They're the same thing.
You get the same results two different way of doing it.
One, is using a weird operation with strides going from top to bottom.
And the second one is exactly a convolution. This is a convolution.
Convolution plus flipped weights,
insertion of zeros for the sub-pixel version of Y.
And on top of that,
padding here and there.
So, this was the hardest part.
Okay? Does it give you more intuition on the convolution here?
You know now how convolution can be framed as
a mathematical operation between a matrix and a vector.
And you know also that under these assumptions,
the way we will deconvolve is just by flipping our weights,
dividing the stride by two, and inserting zeros.
If we just do that, we're deconvolving.
For propagating the convolution,
the following way you wanna deconvolve,
just flip all the weights,
insert zeros sub-pixel, and finally divide the stride.
And that's the de-convolution.
So, super complex thing to understand but this is the intuition behind it.
Now, let, let's try to have an intuition of how it would work in two-dimension.
Uh, let me write it down.
The sub-pixel convolution, we already have that [inaudible] [NOISE]
Why do we use that?
Yeah.
Because in terms of implementation this is the same as what we've been using here.
It's, it's very similar,
while this one is another implementation.
So, you could do both the same,
is the same operation.
But in practice this one is easier to understand because it,
it's exactly the same operation of the convolution,
with flipped weights, insertion of zeros and divided stride.
That's why I wanted to show that. Yeah.
So, uh, what, what happens when,
uh, the assumption [OVERLAPPING].
When the ass- assumption doesn't hold?
Yeah.
So, oftentimes the assumption doesn't hold,
but what we want is to be able to see a reconstruction.
And if we use this method we will still see a reconstruction.
Practice if we had really W minus one,
the reconstruction would be much better. But we don't.
So, uh, let me go over the 2D,
uh, the 2D example.
We are going to go a little over time because we have
two hours technically for- one hour and 50 minutes,
and uh, and let me go over the 2D example.
And then we will answer this question on why we need to make this assumption.
So, here is the interpretation of the 2D deconvolution.
Let me write it down here.
[NOISE]
The intuition behind the 2D deconv is, I get my inputs.
Which is five by five,
and this I call it x. I forward propagate it using a filter of size two-by-two,
in a conv layer,
and a stride of two.
This is my convolution. What I get.
So, if you do five minus two,
plus the padding which is zero,
divided by two, plus one, oh,
I forgot the plus one actually here,
plus one and you floor it.
So- so, five minus two divided by two gives you,
uh, three divided by two plus one.
Um, no actually it will give you three by three,
yeah, three by three.
A y of three by three. That's what you get.
And now, this you call it y.
What you're going to do here,
is you're going to deconvolve y.
In order to deconvolve y,
in order to deconvolve it,
you're going to use a stride of one.
And what we said is that we need to divide this stride by two, right?
So, we need a stride of one,
and the filter will be the same, two-by-two.
And you remember that what we've seen,
is that the filter is the same.
It's just that it's going to be flipped.
So, you will use a filter of two-by-two, but flipped.
And now, what do we get?
We hope to get a five-by-five input,
which is going to be our reconstructed x, five-by-five input.
And the way we're going to do it,
is this is the intuition behind it. Yeah.
Is it up two by two? [NOISE].
Five minus two divided by two. Yeah, it's two by two.
Okay. Up two by two.
Thanks . [OVERLAPPING]. Two by two.
Five-by-five here.
That's what we hope to reconstruct.
The way we will do it, is we will take the filter,
s is two by two.
We will put it here.
And we will multiply all the weights of this filter by y11.
All the weights will be multiplied by y11.
So, I will get four values here,
which are going to be w4 y111,
w3 y111 and so on.
Now, I will shift this with a stride of one.
And I will put my filter again here.
And I will multiply all the entries by y12 and so on.
And you see that this entry has an overlap.
So, it will, it will be updated at every step of the convolution.
It's not like what happened in the forward pass.
So, this is the intuition behind the 2D convolution.
3D, same thing. You have,
uh, a volume here.
So, your filter is going to be a volume.
What you're going to do is you're going to put the volume here,
multiply by y11 and so on.
And then if you have a second filter,
you would put it again on top of it and multiply
by y11 all the weights of the filter and so on.
It's a little complicated,
but this is the intuition behind deconvolution.
Okay, let's get back to the lecture.
I'm going to take one question here if you guys need clarification.
[NOISE] Don't worry if you don't understand deconvolution truly.
The important part is that you get the intuition here and you understand how we do it.
So, let me make a comment.
[NOISE] Why do we need to make this assumption and do we need to make it?
[NOISE] When we want to reconstruct [NOISE] like we're doing here in the visualization,
we need to make this assumption because we don't want
to retrain weights for the deconvolutional network.
What we know is that the activation we selected here on
the feature map is- has gone through the entire pipeline of the ConvNet.
So, to reconstruct, we need to use the weights that we already have in the ConvNet.
We need to pass them to the deconvolution and reconstruct.
If we're doing the segmentation,
like we talked about for the live cell
we don't need to do this assumption.
We're just saying that this is a procedure that is a deconvolution,
and we will train the weights of the deconvolution.
So, there is no need to make this assumption,
it's just we have a technique that is dividing the stride by
one and inserting zeros and then beam,
we retrain the weights and we get an output
that is an upsampled version of the input that was given to it.
So, there's two use cases.
One where you use the weights and one where you don't.
In this case, we don't want to retrain,
we wanna use the weights. So let's see.
Let's see a- a version more visual of the upsampling.
So, we do the sub-pixel image.
This is my image, four by four,
I insert zeros and I pad it,
I get a nine by nine image.
I have my filter like that.
And this filter will convolve.
I will- it will convolve over the input,
so I will place it on my input,
and at every step I will perform a convolution up.
I will get a value here.
The value is blue because as you can see the weights that
affected the output were only the blue weights.
I would use a stride of one beam.
Now, the weights that affect my input are the green ones and so on.
And I would just convolve as I do usually, and so on.
And now one step down.
I see that the weights that are impacting my input are the purple ones.
So, I would put a purple square here and so on.
So, I just do the convolution like that.
And so on, so one thing that is interesting here is
that the values that are blue in my out six by six output,
were generated only using the blue values of the filter,
the blue weights in the filter.
The ones that are green were only
used-were only generated using the green values of my filter.
So, actually this subsample- sub-pixel convolution
or deconvolution could have been done with four convolutions,
with the blue weights, green weights,
purple weights and yellow weights.
And then, just- just replace such that the adjustments would be the output.
Just put the output of each of these conv and mix them to give out a six by six output.
Only thing you need to know we have an input four by four
and we get an output six by six. That's what we wanted.
We wanted to upsample the image.
We can retrain the weights or use the transposed version of them.
So, let's see what happens now.
We understood what, uh,
what deconv was doing.
So, we're able to deconv.
What we need to do is also to unpool and to unReLU.
Fortunately, it's easier than the deconv.
So, we're not gonna do board work anymore.
So, let's see how unpool works.
If I give you this, uh,
inputs to the pooling- to a max pooling layer.
The output is obviously going to be this one,
42 is the maximum of these four numbers.
Assuming we're using a two-by-two filter with stride of two,
vertically and horizontally, 12 is the maximum of the green numbers,
six is the maximum of the red numbers and seven the- the orange ones.
Now, question. I give you back the outputs and I tell you, give me the input.
Can you give me the input or no?
No.
No, why- why? [NOISE] You only keep the maximum.
So, you- you lost all the other numbers.
I don't know anymore the zero,
one and minus one that were the red numbers here
because they didn't pass through the maximum.
So, max pool is not invertible,
from a mathematical perspective.
What we can do is approximate its invert.
How can we do that? [NOISE].
Spread it out.
Spread it out. That's a good point, we could spread out the six among the four values.
That would be an approximation.
A better way if we manage to cache some values,
is to cache something we call the switches.
We cache the values of the maximum,
using a matrix that is very easy to score,
of zeros and ones.
And we pass it to the unpooling.
And now we can approximate the inverts,
because we know where 6 was,
we know where 12 was,
we know where 42 was and 7 was.
But it's still not invertible because we- we lost all the other numbers.
Think about maxpool back propagation.
It's exactly the same thing.
These numbers 0, 1, -1.
They had no impact on the loss function at the end,
because they didn't pass through the forward propagation.
So, actually with the switches you can have the exact
back propagation, we know that the other values are going to be zeros,
because they didn't affect the loss during the forward propagation.
That- that make sense?
Okay. So, this is maxpooling,
unpooling, unmaxpooling.
And we can use it with the switches. We can approximate it.
Why not just cache the whole regional matrix?
Yeah, why don't we just cache the whole region there.
We could- could cache the entire thing.
But in terms of back- for back propagation in terms
of efficiency we will just use this switching because it's enough.
But not for unpooling though.
Yeah, yeah, for unpooling you're right, we could cache everything.
But then it's cheating, like you- you kept it, it's like, just give it back.
Okay. So now, we know how [NOISE] unpooling works. Let's look at the ReLU.
So, what we need to do, in fact,
is to pass the switches and the filters back
to the unpooling deconv in order to reconstruct.
Switches are the matrix of zeros and ones indicating where the maximums were,
and filters are the filters that I will transpose under this assumption on the board.
Okay. And so on and so on,
and I get my reconstruction.
I just need to explain the ReLU now.
I give you this input to ReLU and I forward propagate it. What do we get?
All the negative numbers are going to be equalized to zero,
and the others are going to be kept.
Now, let's say I'm doing a backpropagation [NOISE] through ReLU.
What do I get if I give you that?
This is the gradients that are coming back,
and I'm asking you what are the gradients after the ReLU during the backpropagation?
[NOISE] How does the ReLU behave in backprop?
[NOISE].
Zeros? [NOISE] Which ones are zero?
Um, the negative.
The negative are zeroes? Do you agree?
The negatives in this yellow matrix are going to be zeros during the backprop.
Are you guys sure? [NOISE] Think always
about what was the influence of the input on
the loss function and you will find out what was the backpropagation.
Look at this number. This number here, -2.
Did this number have,
the fact that it was -2,
did it have any influence on the loss function?
No, it could have been -10,
it could have been -20.
It's not gonna impact the loss function.
So, what do you think should be the number here?
Zero.
Zero. Even if the number that is coming back,
the gradient is 10.
So, what do you think should be the ReLU backward output?
[NOISE]
Same idea as max-pooling.
What we need to do is to remember the switches.
Remember which of these values had an impact on the loss.
We pass the switches,
all these values here that are kind of a y, you know this is a y.
All these ones had no impact on the loss function.
So, when you backpropagate,
their gradient should be set to zero,
doesn't matter to update them.
It's not gonna make the loss go down.
So, these are all zeros and the rest they just pass.
Why do they pass with the same value?
Because ReLU for positive numbers was 1.
So, this number 1 here that passed the ReLU during the forward propagation,
it was not modified.
Its gradient is going to be 1.
That makes sense? So this is ReLU backward.
Now, in this reconstruction method,
we're not going to use ReLU backward.
We're going to use something we call ReLU DeconvNet let's say.
The reason we're not, the intuition between why we're not
using ReLU backward is because what we're interested
in is to know which pixels of the input positively affected the,
the activation that we're talking of.
So, what we're going to do is that we're just going to do a ReLU.
We're just going to do a ReLU backward.
Another reason is when we reconstruct,
we wanna have the minimum influence from the forward propagation
because we don't really want our reconstruction to depend on the forward propagation.
We would like our reconstruction to be unbiased and
just look at this activation, reconstruct what happened.
So, that's what you're going to use.
Again, this is a hack that has been found through trial and error
and it's not going to be scientifically viable all the time.
Okay. So now, we can do everything and we can reconstruct
and find out what was this activation corresponds to.
It took time to understand it,
but it's super fast to do now,
It's just one pass, not iterative.
We could do it with every layer.
So, let's say we do it with the first block of conv, ReLU, maxpool.
I go here. I choose an activation.
I, I, I, I find the maximum activation.
I set all the others to 0.
I unpool, ReLU, deconv and I find out the reconstruction.
This specific activation was looking at edges like that.
So, let's delve into the fun and see how we can visualize inside,
what's happening inside the network.
So, all the visualization we're going to see now can be found in
Matthew Zeiler's and Rob Fergus'
paper Visualizing and Understanding Convolution Networks.
I'm going to explain what they correspond to, but check,
check out their papers if you want to understand more into detail.
So, what happens here is that on,
on the top left, you have nine pictures.
These are the cropped pictures of the data set that
activated the first filter of the first layer maximum.
So, we have a first filter on the first layer and we run
all the data sets and we recorded what are the main pictures that activate this filter.
These were the main ones. And we did the same thing for
all the filters of the first layer and there are nine times nine of them.
There are a lot of them, I think.
In the bottom here you have the filters,
which are the weights that were plotted.
Just take the filter, plot the weights.
This is th- this is important only for the first layer.
When you go deeper into your network,
the filter itself cannot be interpreted.
It's super hard to understand it.
Here, because the weights are directly multiplying the pixels,
the first layer weights can be interpretable.
And in fact, you see that the,
let's look at the third one,
the third filter here on the first row.
The third filter has weights that are kind of diagonal,
like one of the diagonals.
And in fact if you look at the datas that maximized these filters' activation,
the feature map corresponding to this filter,
they're all like cropped images that correspond to diagonals.
That's what happens. Now, the,
the deeper we go, the more fun we have.
So let's go. Results on a validation set of 50,000 images.
What's happened here is they took 50,000 images,
they forward propagated to the network.
They recorded which image is the maximum,
the one that's maximized the activation of
the feature map corresponding to the first filter of layer two,
second filter and so on for all the filters.
Let's look at one of them.
We can see that's, okay,
we have a circle on this one.
It means that this,
the filter gener- which generated the feature map corresponding, uh,
[NOISE] to this has been activated through probably a wheel or something like that.
So, that the image of the wheel was the one that maximized
the activation of this one and then we use the deconv method to reconstruct it.
Any questions on that? Yeah.
What if the activation function is not ReLU [inaudible].
Good question, yeah. What if the activation function is not ReLU?
In practice, you would just use a backward to reconstruct if it's [inaudible].
You would use the same,
the same type of method and you would try to approximate the reconstruction.
Okay, let's go a little deeper.
So now, same layer two,
forward propagate all the images of the data set,
find the nine images that are
the maximum activate- that lead to the maximum activation of the first filter.
These are plotted on top here.
What you can see is like for this filter,
that is the sixth row first filter,
features are more invariant to small changes.
So, this filter actually was activated to many different types of circles,
spirals, whirls, and so it's,
it's still activated although the circles were different sized.
Can go even deeper up third layer.
What's interesting is that the deeper you go,
the more complexity you see.
So, at the beginning we were seeing only edges,
now we see much more complex figures.
You can see a face here,
in this- in this entry.
It means that this filter activated for when
it sees this- when it has seen a data point that had this face,
then we reconstructed it,
cropped it on the face.
Uh, the face is kind of red,
it means that the more red it was,
the more activation it led to.
And same top nine for layer three.
So, these are the nine images that actually led to the face.
These are the nine images that maximize the, the,
the activation of the feature map corresponding to that filter and so on.
So, here is a very funny.
[inaudible]  [NOISE].
Can you stand up? [NOISE].
And realization layers,
we can switch back and forth between showing
the actual activations and showing images synthesized to produce high activation.
So, he's- he's giving his own image to the network right now.
By the time we get to the fifth convolutional layer,
the features being computed represent abstract concepts.
So, these are the gradients I said. [OVERLAPPING]
For example, this neuron seems to respond to faces.
We can further investigate this neuron by showing a few different types of information.
First, we can artificially create optimized images
using new regularization techniques that are described in [OVERLAPPING].
Our paper, the one we talked about.
These synthetic images show that this neuron fires in response to a face.
[OVERLAPPING] It also taught the images from the training set to activate this neuron the most
as well as pixels from those images most responsible for
the high activations computed via the deconvolution.
And this is the deconvolutionary substance.
This feature responds to multiple faces in different locations.
And by looking at the deconv,
we can see that it would respond more strongly if we had even darker eyes and rosier lips.
We can also confirm that it cares about the head and shoulders,
but ignores the arms and torso.
We can even see that it fires to some extent for cat faces.
Using back-prop or deconv,
we can see that this unit depends most strongly
on a couple of units in the previous layer conv4,
and about a dozen or so in conv3.
So they're trying to track back track where- which neurons led to [OVERLAPPING].
So, let's look at another neuron on this screen.
So, what is this unit doing?
From the top nine images,
we may conclude that it fires for different types of clothing,
but examining the synthetic images shows that it may be
detecting not clothing per se, but wrinkles.
In the live plot, we can see that it's activated by my shirt and
smoothing out half of my shirt causes that half of the activations to decrease.
Finally, here's another interesting neuron.
This one has learned to look for printed text in a variety of sizes, colors, and fonts.
This is pretty cool because we never asked
the network to look for wrinkles or text or faces.
The only labels we provided were at the very last layer.
So, the only reason the network learned features like texts and faces in
the middle was to support final decisions at that last layer.
For example, the text detector may provide good evidence that a rectangle is in fact
a book seen on edge and detecting many books
next to each other might be a good way of detecting a bookcase,
which was one of the categories we trained the net to recognize.
In this video, we've shown some of the features of
the DeepViz toolbox and a few of the things
we've learned by using it. You can download it.
Yeah, so they had a toolbox,
which is exactly what you visualize here,
and you could test the toolbox on your model,
takes time to- to get- get it to run,
but- but if you want to visualize all the neurons, it's very helpful.
Okay. So, uh, let's go quickly.
We'll spend about three minutes on the optional Deep Dream one because it's fun.
And yeah, feel free- feel free to jump in and ask questions.
So, the Deep Dream one is, uh,
is implemented by Google, and, uh,
the page- the- the blog post is by Alexander Mordvintsev.
The idea here is to generate parts using this knowledge of
visualization and how they do that is quite interesting.
They would take an input,
forward propagate it to the network and at
a specific layer that we call the- the green layer,
then pick activation and set the gradient to be equal to this activation.
The gradient at this layer and then we back propagated the gradients to the input.
So, earlier what we did is that we defined a new objective function,
that was equal to an activation and we tried to maximize this objective function.
Here, they- they're doing it even stronger.
They take the activations and they set the gradients to be equal to the activations.
And so the stronger the activation,
the stronger it's going to become later on, and so on and so on, and so on.
So, they are trying to see what the network is
activating for and in- increase even this activation.
So, forward propagate the image,
set the gradient of the dreaming layer to be equal to its activation,
but back propagate all the way back to the inputs and update the pixel of the image.
Do that several time and every time the activations will change.
So, you have to set again the new activations to
be the- the- the gradients of the green layer and back propagate,
and ultimately, you will see things happening.
So, it's hard to see here on the screen,
but you would have a pig appearing here.
You'd have like a tree somewhere there, and some animals,
and a lot of animals are going to start appearing in this cloud.
It's interesting because it means,
let's say, you see this cloud here?
If the network thought that this cloud looked a little bit like a dog,
so one of the- the feature maps was- which
would be generated by the filter that detects dog would activate itself a little bit.
Because we set the gradient to be equal to the activation,
it's going to increase the appearance of the dog in the image and so on.
And then you will see a dog appearing after a few iterations.
So, it's quite fun and if you zoom you see that type of thing.
So, you see a pig-snail,
it's kind of a pig with the snail carapace.
Camel-bird, dog- dog-fish.
I'd advise you to like look at this on
the slides rather than on the screen, but it's quite fun.
And same, like if you give that type of image,
you would see that- because the network thought there was like a tower a little bit,
you will increase the network's confidence in the fact that there is
a tower by changing the- the image and the tower would come out.
And so on, it's quite cool.
Uh, yeah and if you're dreaming lower layers,
obviously you will see edges happening or patterns coming.
Because then the lower layers seem to detect an edge and then you will
increase its confidence on its edge so it will- it will create an edge on the image.
This is a fun video I have, Deep Dream on a video.
[MUSIC].
So, everything that the metric thinks is something it knows with the information it appears to be.
[MUSIC] And what's funny
is that there is so many animals in the video.
And the reason is [MUSIC].
Gets too trippy, I'm going to stop it.
[LAUGHTER] So, one- one insight that is fun about it is,
uh, if the network and this is not only for Deep Dream,
it's also for- it's mostly for gradient ascents.
Let's say we have an output score of a dumbbell,
and we define our objective function to be the dumbbell score,
and we try to find the image that
maximizes a dumbbell when we'll see something like that.
What's interesting is that the network thinks that
the dumbbell is the hand with the dumbbell.
Not only the dumbbell. And you can see it here, you see the hands.
And the reason is it has never seen a dumbbell alone.
So, probably in ImageNet there is no picture of a dumbbell
alone in a corner and labeled as dumbbell.
But instead, it's usually a human trying to push hard.
Okay. So, just to summarize what we've learned today,
we are now able to answer all the following questions.
What part of the input is responsible for the output beam, occlusion sensitivity,
class activation maps seem to be the best way to go.
What is the role of a given neuron feature layer?
Deconvolve, reconstruct, search in a dataset?
What are the top images and do gradient ascents?
Check- can we check what the network focuses on?
Occlusion sensitivity, saliency map, class activation maps?
How does the network see our world?
I would say gradient descent,
maybe Deep Dream is cool stuff.
And then what are the- the implication and- and use cases of these visualizations?
Uh, you can use saliency maps to segment,
it's not very useful given the new methods we have.
But the deconvolution that we've seen together is
widely used for segmentation and reconstruction.
Also for generating adversarial networks to generate images in parts sometimes.
Uh, these visualizations are also helpful
to detect if some of the neurons in your network are dead.
So, let's say you have a network and you use the toolbox and
you see that whatever the input image you give,
some feature maps are always dark.
It means that the feature that generated
this feature map by convoluting over the inputs probably never detected anything.
So, it's not being even trained.
That's a type of insight you can get.
Okay, thanks guys.
Sorry we went over time.
[NOISE]
