[TRAIN WHISTLE]
Hi.
So if you're here,
this video is really
dependent on the previous one.
So if you just watched
and you took a break,
and talked to your
plants, then welcome back.
And I'm here.
I'm going to continue
the discussion
of convolutional neural
networks building off
of what I did before with
the filtering function
and take the next
step into max pooling.
So I don't want to
keep talking about it
because that's exactly what the
video is going to talk about.
And I'll see you in the
future because there
will be even more of
these after this one, OK?
See you soon.
Now that I've wrapped up
talking about the convolutions,
there's many other
aspects of this diagram
but there's one other
really important operation
that happens in a convolutional
neural network that's
described in this
diagram as sub-sampling
that I want to add to my diagram
and my code demonstration.
And that is-- and I'm not
going to call it sub-sampling.
The common term for this
now is called pooling.
And in particular, the
operation that I want to add
is max pooling.
So there are different
kinds of pooling you can do.
But max pooling is the
standard pooling operation
for a convolutional
neural network.
So max pooling-- I mean,
it's another layer.
It can happen at
any given point.
So we could max pool before
we apply the convolution.
But typically speaking,
the convolutional filters
are applied.
And then after
those are applied,
we get new images out of those.
And those go through
a max pooling layer.
So I think to
describe it, I think
I need to race this
whole diagram so
that I can look at max pooling.
And then we could kind
of come back to this
when we look at the
full architecture.
So let's begin with
our 28 by 28 image.
Then let's assume I have one
filter just to simplify things.
I had one filter
that was 3 by 3.
One thing I didn't discuss--
and it's going to be more
relevant with the max
pooling layer because
I'm going to do something
specific with it-- is there's a
term you'll see called stride.
And stride refers to--
remember, this filter--
I'm not to actually
do this 28 by 28.
But this filter is applied
to each and every pixel.
We take this filter,
apply it to this pixel.
And this gives us a new image.
Pixel, apply filter, take
the result into a new pixel.
Pixel, apply filter,
take the result,
put it into the new pixel.
Here's the thing.
This is 28 by 28.
This is a 3 by 3 filter.
I had to start with
this pixel right here
because the edges don't
have neighbors on all sides.
So ultimately,
this-- sorry, stride
refers to how far I
pass the filter along
as I'm going through the image.
I don't really have
enough spots here.
But I could take the
filter and jump over pixels
as I'm applying it to reduce
the resolution of the image.
In this case, in my code that
I wrote, the stride was one.
I just slid over by one.
And we can actually see
where the stride would go.
This ultimately, right there--
the x++, y++--
that's the stride.
So I could say x+ equals
stride, y+ equal stride.
And set the stride equal to one.
So that's what was
happening here.
But even with the
stride of one, if I'm
skipping the edge pixels,
my new image is 27 by 27.
So one thing that's
really key to how
a convolutional
neural network works
is that the image over
time, as it goes from layer
to layer to layer--
so this is the convolutional
layer with the filters.
And now I'm going to talk
about the pooling layer.
The resolution is reduced.
And this has a
number of benefits.
One is images are high
resolution with millions
of pixels.
So this learning space
of a neural network,
to learn all of the parameters
of every pixel connected
to every filter throughout
multiple layers,
it would just be much
too big to realistically
be computationally
realistic to do.
So this process of reducing
the image down, and down,
and down as the layers
is effective in keeping
things manageable.
But it also has
another benefit, which
is we're trying to boil
the essence of the image
down into something
that will highlight
key features in that image.
And so this is really
what max pooling does.
One thing it does is it really
reduces the resolution, which
I'll show you in a second.
But it also picks and
chooses the pixels
that have the highest
values to emphasize those,
what is really being activated.
So pooling comes with
a matrix as well.
It's not really a filter
but it's a matrix.
And a standard matrix
might be 2 by 2.
And so let's take the case--
and actually, let me
erase all this just
to zero in on pooling.
To describe this, I'm going
to start with an 8 by 8 image.
And I'm going to do
max pooling with a 2
by 2 max pooling
with a stride of 2.
So there are no weights.
This is not a filter.
2 by 2 is just describing,
how many pixels am
I looking at at one given time?
If I'm looking at a
2 by 2 area of pixels
for each iteration of this
algorithm, and then my stride
is 2 the next set of pixels.
I'll look at is here.
The next one is here.
The next one is here.
So for the columns, I
end up looking at 4.
And for the rows, it's the same.
It's 8 by 8, 4.
So actually, the result
after max pooling is 4 by 4.
Now, how does the
algorithm work?
This sounds like
some fancy thing.
This is actually the
simplest thing ever.
Basically, for each one of
these areas of 2 by 2 pixels,
take the largest value,
the brightest color,
and put it in there.
So I'm going to fill in
some arbitrary values here.
So I'm not going to fill
this whole thing out.
But you see-- I don't know
how well you can see this
but I have the
numbers 4, 8, -1, 2.
The highest one is 8.
It goes here.
I have the numbers 3, 3, 1, 9.
The highest one is 9.
The highest one is 1.
The highest one is 10.
And so the max pooling
algorithm takes
these little neighborhoods,
2 by 2 max pooling, skips,
goes from one to the
other with a stride of 2.
I could have just moved
these neighborhoods just
by one or by even
a larger amount.
But this is pretty typical.
This has the benefit of
sub-sampling the image,
reducing it but not just--
we could do average pooling.
So you could do average
pooling, average all of these.
But it turns out that
convolutional neural networks
perform better with max pooling
over average pooling-- maybe
not in all cases, but in
sort of the standard image
misclassification case.
And this is because
what we're looking
for are features in the image
that we want to highlight.
And so by looking at an area of
pixels and seeing which pixels
activated the most and
keeping that one, that's
going to really
emphasize and help
boil the essence of the
image down into something
lower resolution.
I should add, just to be
really accurate here--
and the chat is offering some
different opinions about this--
that while max pooling is the
most common historical example
of pooling in a
convolutional neural network,
there are other researches
showing promising results
from things like
dilated pooling, which
is a new concept to me that I
just looked up and read about.
You can also do a combination
of max pooling and average
pooling.
So there is, I think, some
discussion and research
happening there.
And I'm not here
to tell you what
is the optimal way to architect
your convolutional neural
network.
I just want to talk about
it, and explain the process,
and look at an example
of it, which is
very common like max pooling.
So I'm going to write another
function much like convolution
but call it pooling.
The same thing happens here.
I want to receive an image.
I want to give an x, y.
I want to return
some RGB value that
is the highest RGB values
within that neighborhood.
Now, there's an
interesting question here.
Do I take the RGB values
from the brightest pixel,
whatever they might be?
Or do I just take them
the highest R, the highest
G, and the highest
B independently
and they could be
from different pixels?
I don't know the answer
to that right now.
Let me just go with actually
picking the brightest
R, the brightest G,
and the brightest B
separately, independently.
So I'm going to start with
the brightest R, G, and B.
And I could start with zero.
But just to be really,
really safe, absolutely
in the convolutional process--
the idea of pixels has gone.
I'm really just dealing
with numeric data.
So I really should, if I'm going
to try to find the brightest,
start with negative
infinity because that's
the lowest possible number--
in JavaScript, that is.
Then I want to look
at this 2 by 2 area.
And the same thing that I did
before in the convolution,
I want to look at the given
pixel and its neighbors.
And then I can get the R,
G, and B from that pixel.
And now I just want the maximum.
I want-- if this R
is greater than what
is being stored as
the brightest R,
then that R should be
the brightest R, which I
can do with the max operation.
Bright R is the biggest
between bright R and R.
And the same for G and B.
Oh, and it has to be 1 and 2.
This is actually all
that I need to do.
This is max pooling right here.
But now I just need to
return bright R, bright G,
and bright B.
Next, I'm going to
create yet another image.
I'm going to call it pooled.
And pooled is also
a blank image.
However, if you recall, I'm
going to use a stride of 2
so the resolution of that image
is reduced further by half.
So I'm actually going to take
out the stride from here.
And I'm going to create a
global variable for stride.
But this stride
is only referring
to the pooling
process because then I
can say, create image
dim divided by stride,
dimensions divided by stride.
Just to add some
comments for a moment,
this is convolutional layer.
I mean, I'm stimulating the
idea of a convolutional layer.
I'm not actually-- there's
no neural network here.
There's no machine
learning here.
I'm just going through
these particular algorithms
without matrix
operations, I should add.
Then let's add the
pooling operation.
So, same thing here--
I'm going to go through
all of the pixels.
In this case, I
can start at zero.
But I still need to only
go to dimensions minus 1
because I'm going to
skip every two pixels.
And I don't want to end up here.
So this is plus equal stride.
And this is plus equal stride.
I can do the same exact thing.
I can create a
variable called RGB,
which equals, now, pooling.
I want to pool-- what
were my arguments?
The image and the x, y.
And I should probably call
this max pooling, but whatever.
Oh, no, I'm not pooling the cat.
The cat was filtered
with convolution.
And then the filtered
image is pooled.
So I'm pooling filtered
at this given x, y.
Then I need to figure out, where
am I putting the resulting RGB
values?
And putting them in
the image called pooled
but that image has the
dimensions of half.
So the pooled x is x
divided by the stride.
The pooled y is y
divided by the stride.
And then the index is--
so this is why this
function really
needs the image passed with it.
I should not have used
the global variable.
It was a terrible idea
because I want to reuse it
but I have a different
resolution of image.
I'm going to go back
to making this image.
And then where did I call it?
Here it's image.width.
I need it here-- image.
Anywhere else?
Oh, here-- image.
So now I can say index of pixel
x, pixel y in the pooled image.
Because I want to
say pooled.pixels,
pix plus 0 equals RBG.R.
And I need to add the load
pixels and update pixels.
And now this should be
the max pooling operation.
Go over the filtered
image by the stride.
For every 2 by 2 area, find
the highest RGB values.
And then add those to
the corresponding pixel
in the lower-resolution
pooled image.
Let me make the height
of my canvas times 2
so I can put the pooled
image at the bottom right.
So the filtered image
went off to the right.
And now the pooled image should
go also off to the right.
And let's give this a try.
I don't see the pooled image!
This should be a G. I forgot
to add the alpha in again.
I always forget this.
So I need to give it the alpha.
There we go.
So let's go back
to a known filter
instead of having
random filters.
So that was my edge detection.
And you can see this is just--
I mean, visually what
I'm seeing right now
is kind of like a
lower-resolution version
of what you have above.
But if I were to rewrite this
with, say, average pooling,
I think you would
see it different.
It wouldn't come--
those features,
these edge features that in
a neural network would be
discovering-- here I'm
telling it to look for those--
are highlighted
even more than they
would be with just
average pooling itself.
So now that I've shown you
the code for both applying
a convolution filter to an image
and then a pooling algorithm
to that image with
a variable stride,
I think that I can now go
back and look at the larger
diagram of the full story of
a convolutional neural network
that has these components in it.
And again, our reference
point is this diagram
from the 1998 paper
Gradient-Based Learning Applied
to Document Recognition.
I also want to highlight
for you a blog post that
was really helpful
for me when I was
reading up, and
researching, and trying
to learn about convolutional
neural networks.
It's this blog post right here--
An Intuitive Explanation Of
Convolutional Neural Networks
from 2016.
This diagram is super helpful.
It is exactly what I want
to talk through, basically.
And there are a lot of nice
visual diagrams and animations
of the convolution process,
convolutional filters,
as well as the max
pooling algorithm itself.
Here's my best attempt,
now, at the full story
of the convolutional
neural network.
We start with an image.
The first layer is a
convolutional layer.
And I'm writing 2D because a lot
of times in a machine learning
library, you can
have convolutions
in different dimensions.
And we're working with a
two-dimensional convolution
here.
The convolutional layer
has a number of filters.
The image is sent to every
one of those filters.
And these filters are applied.
I should say that the values
that come out of the filters
aren't just the raw values
from the convolution process.
They're also then passed
through an activation function,
the same kind of
activation function
that's in a standard
layer or a dense layer.
So typically, this would be
rectified linear unit, RELU.
The next step is max pooling.
I'll represent that
with little squares.
So the image that comes
out of the convolution
and the activation function
is then max pooled.
And then the output
there is another image.
So we take this
first image, pass it
through a bunch of filters,
max pool them and a whole bunch
of other images that, if
I'm using a stride of 2,
now have half the resolution
as the original image.
So the question becomes
what to do next.
Well, we could be done and pass
this to what is the last layer.
And if we're doing
that, at some point
the data does have
to be flattened.
So everything I did
in my previous video
about ML5 neural network
with an image that just gets
flattened and passed in, that is
what happens in the last layer.
The last dense layer
takes these images
and has a hidden
layer of neurons.
And each image is flattened
and sent into all of those,
and then sent to
the output layer,
and passed through the
soft max activation
function that I've
described, which gives it
a probability for a
classification if this
were a classification problem.
But what's interesting
is in most cases,
if you look at a lot
of these diagrams--
for example, this diagram on
the blog post I referred to
or this particular
diagram here--
you'll see convolutions,
sub-sampling, convolutions,
sub-sampling.
Let me redraw this to give
myself a little bit of room.
I'm running out of room and I
want to diagram the full story.
I used so much space
here for this image.
So here's the same diagram
but squashed a little bit
to the left because I want to
add another convolutional layer
and another max pooling layer.
So I'm going to add
some more filters here.
But something interesting
is going to happen here.
So let me actually do fewer
filters in this next layer.
And I'm going to really
just only use two.
And there's only
two filters here.
Well, these images that result
from the first convolutional
max pooling process, they need
to be sent to both filters.
So this image goes here.
This image goes here.
So in essence, we have 1,
2, 3, 4 times 2 filters.
And I'm not really
drawing this well
because I have eight in total.
So we get eight new outputs out
of this convolutional layer.
And each one of those
needs to be max pooled.
So now I have eight images.
And remember, let's
say this was 28 by 28.
These are all 14 by 14.
Then after this convolution
process and this max pooling,
these are all now 7 by 7.
So we get these progressively
lower and lower resolution
feature maps of
the original image
with lots of different
filters applied
in lots of different ways.
And then the final result is
essentially everything that I
did in my non-convolutional
neural network with an image,
just that one hidden layer--
it's called a fully connected
or dense layer--
and one output layer.
All of that gets put right here.
But instead of some original
image being flattened and sent
to it, this whole
process has happened.
And we're sending the data from
these 7 by 7 images through
the one dense layer and one--
and I've totally run out
of room here, so I'm just
going to put O here--
output layer.
And this is where we would
finally see, is it a cat
or is it a dog?
We would see probability
values for the particular
classification task.
1, 2, 3, 4, 5, 6, 7--
whoops.
I'm missing one here.
Even though this is a bit
of a mess, let me go back
and refer to and
thank the author
of this blog post for this much
more thoughtful and precise
diagram showing these different
layers, how the images become
lower and lower resolution,
become these final feature
maps, and then get passed
through what here is actually
two fully connected layers.
So there are a
lot of reasons why
you might have different
numbers of convolutional layers,
different numbers of
fully connected layers,
different strides,
different filter sizes.
Oh, by the way, another
word for filter is kernal.
So really, all I wanted to do in
this video is talk through all
of the pieces as
well as show you
some code that actually
runs through and does
those processes to an
image itself, which I think
opens up a lot of interesting
possibilities for you
if you wanted to create a
project around visualizing
the process of a
convolutional neural network
as it's learning.
Now, this would be a
much bigger endeavor
than what I've done
here because you'd
need to create these visuals
out of all of the pieces
as the training
process is happening.
But ultimately, what
I want to do next
is to two slash three things.
And it might take a while
for me to get to them.
But they will be
eventually, hopefully,
in subsequent videos.
One is I want to just create
this exact architecture
with ML5.
So I want to show
you how with ML5 I
can make an ML5 neural network
with a convolutional layer,
maybe two convolutional
layers, and then a dense layer,
and an output layer.
Then I can take that and apply
it to the previous example
where I didn't use
convoluted layers
and just see how that looks.
I would also like to
look at something that we
could call a doodle classifier.
So using the quick-draw
data set that I've
referred to in a number
of different videos,
could I train a classifier to
recognize particular drawings?
And in fact, ML5
has built into it
a pre-trained doodle
classification
model that's pretty robust.
So I might try to train a sort
of smaller version of that,
write all the code
for that with ML5,
but ultimately then show you
how to use the pre-trained model
that's in ML5 as well.
But that uses
convolution layers.
OK.
So thank you so much if you
somehow made it all the way
to the end of this rather
long explanation and kind
of tinkering around with
code demonstration of what
the process of
convolution and pooling
is in a convolutional
neural network.
I hope to see you in a
future coding training video.
I mean, I don't
really see you but I
feel your presence somehow.
And sometimes you write a
nice comment that brings me
a little happiness to my day.
So I will see you in that
virtual way in a future video.
And thanks for watching
and have a great day that's
not convoluted at all!
[TRAIN WHISTLE]
Goodbye!
[MUSIC PLAYING]
