Hello and welcome to
another Beginner's Guide
to Machine Learning
with ml5.js video.
This is a video.
You're watching it.
And I am beginning this journey
to talk about, and think about,
and attempt to
explain and implement
convolutional neural networks.
So this is something that I
refer to in the previous video,
where I took the
pixels of an image
and made those the inputs
to a neural network
to perform classification.
And I did this in even earlier
videos with pretrained models.
And I mentioned that those
pretrained models included
something called a
convolutional layer,
but my example didn't include
a convolutional layer.
So ml5 has a mechanism for
adding convolutional layers
to your ml5 neural network.
But before I look at
that mechanism, what
I want to do in this
video and in the next one
is just explain what
are the elements
of a convolutional
neural network,
how do they work, and then
look at some code examples that
actually implement the features
of that convolutional layer.
I'm not going to
build from scratch
a full convolutional
neural network.
Maybe that's some other video
series that I'll do someday.
We're going to use the fact
that the ml5 library just
makes that possible for you.
In the first part
I will just talk
about from the zoomed out view,
what a convolutional layer is,
then I will look at with
code, this idea of a filter.
In the second part,
I'll come back
and look at this other aspect
of a convolutional layer called
pooling.
I hope you enjoy this
and you find it useful.
And I'll see you--
I'll be back in this outfit
at the end of the video.
Let me start by diagramming
what the neural networks looked
like with ml5 neural network
to date in the videos
that I've made.
So there's been two layers--
a hidden layer and
an output layer--
and then also there's some data
coming into the neural network.
And in this case, in
the previous example,
it was an image,
which was flattened.
So I used the example of 10 by
10 pixels, each with an R, a G,
and a B. So that made
an array of 300 inputs.
All these pixel values,
those are the inputs.
And those go into
the hidden layer.
But just for the
sake of argument,
let me simplify this
diagram and I'm just
going to consider an
example with four inputs.
I'm going to
consider that example
as having five hidden nodes--
hidden units.
And then let's say, it's
a classification problem
and there's three
possible categories.
So when I call the
function ml5.neuralNetwork,
it creates this architecture
behind the scenes
and connects every single
input to every hidden unit
and every hidden
unit to each output.
[MUSIC PLAYING]
So this is what the
neural network looks like.
Each one of these
connections has
a weight associated with it.
Each unit receives
the sum of all
of the inputs times the weights
passed through an activation
function, which then becomes the
output, which then all of those
with those weights are
summed into the next layer,
and so on and so forth.
So this is what I have
worked with before.
While in the previous
example, I was
able to get this
kind of architecture
to work with image input and get
results that produced something
in the output, this
can be improved upon.
There is information
in this data that's
coming in that is lost
when it is flattened
to just a single flat array.
And the information
that's lost is
the relative spatial
orientation of the pixels.
It's meaningful that these
colors are near other colors.
Something in what we're
seeing in the image
has to do with the spatial
arrangement of the pixels
themselves in two dimensions.
In order to address
that, we want
to add into this architecture--
I really spent a lot of
time drawing this diagram,
which I'm now going
to mostly erase--
we want to add something
called a convolutional layer.
So in this video, I want to
explain what are the elements.
There are units,
nodes, neurons, so
to speak, in a convolutional
layer, but what are they?
And the word that's
typically used
is actually called a filter,
which makes a lot of sense.
Now, convolutional
neural networks
can be applied to lots of
scenarios besides images
and there's a lot of
research into different ways
that they can be
used effectively,
but I'm going to stick
with the context of working
with images because the word
"filter" really fits with that.
We're filtering an image.
How is this layer
filtering an image?
So the idea of a convolutional
layer is not a new concept,
and it predates
the era that we're
in now of so-called
deep learning.
And if you want to go back
and look at the origins
of convolutional
neural networks,
you can find them in this paper
called "Gradient-Based Learning
Applied to Document
Recognition" from 1998.
Section two, convolutional
neural networks
for isolated
character recognition.
And here, we can see
this diagram, which
is I'm attempting to
kind of talk through
and create my own version of
over here on the whiteboard
itself.
This is also the
original paper associated
with the MNIST dataset--
a dataset of handwritten
digits that's
been used umpteen amounts
of times in research papers
over the years related
to machine learning.
I know I'm going back
and forth a lot here,
but let's go back to
thinking of the input
as a two-dimensional
image itself.
So this two-dimensional image--
and let's not say it's 10 by 10.
Let's use what the
MNIST dataset is, which
is a 28 by 28 pixel image.
And of course now, much higher
resolution images are used.
And this is what is coming in to
the first convolutional layer.
This image is being
sent to every single one
of these filters.
A filter is a matrix of numbers.
And let's just, for example,
let's have a 3 by 3 matrix.
Each one of these filters
represents nine numbers--
a matrix that's 3 by 3.
You could have a 5 by 5
filter and so on and so forth,
but it a sort of standard size
or a nice example size for us
to start with is 3 by 3.
Each one of these filters
is then applied to the image
through a convolutional process.
This by the way,
is not a concept
exclusive to machine learning.
This idea of a convolutional
filter to an image
has been part of
image processing,
and computer science, and
computer vision algorithms
for a very long time.
To demonstrate this, let
me actually open up--
I can't believe I'm
going to do this,
but I'm going to
open up Photoshop.
So here I am in
Photoshop and I've
opened this image of a kitten.
And there's a menu
option called Filter.
This word is not
filter by accident.
There's a connection.
So all of these types of
operations that you might do--
for example, like
blur an image--
these are filters-- convolutions
applied to the image.
I'm going to go down here
under Other and select Custom.
All of a sudden, you're
going to see here,
I have this matrix of numbers.
This matrix of
numbers in Photoshop
is exactly the same thing
as this matrix of numbers
I'm drawing right here.
Each one of these filters
in the convolutional layer
represents a matrix
of numbers that
will be applied to the image.
So let me actually just
put some numbers in here.
[MUSIC PLAYING]
This particular set
of numbers happens
to be a filter for
finding edges in an image.
And you can think
of it as these are
all weights for a given pixel.
So for any given pixel,
I want to subtract colors
that are to the left
of it and emphasize
colors that are at that
pixel and above and below.
This draws out
areas of the image
where the neighboring pixels
are very, very different.
Interestingly enough, I
could switch these to 0.
[MUSIC PLAYING]
Switching the filter to have
the negative numbers on the top,
you can see now I'm
still detecting edges,
but I'm detecting
horizontal edges.
If you go back and
look at the cat
that I had previously
versus this one,
you can see vertical edges
versus horizontal edges.
So there are known
filters, which draw out
certain features of an image.
And that's exactly what each
one of these filters does.
If all of the nodes
of a neural network
can draw out and highlight
different aspects of an image,
those can be weighted
to indicate and classify
the image in certain ways.
The big difference between
a convolutional layer,
and a neural network,
and what I'm doing here
by hardcoding in
sort of known filters
is that the neural
network is not
going to have filters
hardcoded into them.
It's going to learn filters that
do a good job of identifying
features in an image.
This relates to the idea
of weights, I think.
So if I go back to
my previous diagram,
where every single
input is connected
to each hidden
neuron with a weight,
now the input image is
connected to every single one
of these filters.
In a way, there are now nine
weights for every single one.
Instead of learning
a single weight,
it's going to learn a set of
weights for an area of pixels
to identify a
feature in the image.
All of these filters will start
with random values, and then
the same gradient
descent process--
the error backpropagating
through the network,
adjusting all the dials,
adjusting all the weights
in these matrices and
all of these filters--
works in the same way.
So in the ml5 series,
I haven't really
gone through and looked at
the gradient descent learning
algorithm to adjust all
the weights in detail.
I do have another
set of videos that
do that if you're interested,
but the same gradient descent
algorithm that is
applied to these weights
is applied to all of
the different values
in each one of these filters.
Incidentally, just to show
a very common convolution
operation to blur an
image, blurring an image
is taking the average of a given
pixel and all of its neighbors.
So here, you can see if I give
the same weight to a 5 by 5
matrix of pixels
around a center pixel,
and then divide that
scale-- let's divide by 25
because there's 25--
that's averaging
all of the colors.
If I click on Preview,
blurred, not blurred,
blurred, not blurred.
Of course, there are other more
sophisticated convolutions,
like a Gaussian blur.
You can take a look
a Gaussian blur.
There's different
ways to pronounce it.
You can take a look and
research what that is,
but again, I'm not
going down the road
to look at common image
processing convolutions.
Instead, talking about the
concept of a convolution as
applied to an image
in the process
of a convolutional
neural network.
Just to take this a
little bit further,
I'm going to demonstrate
how to code the convolution
algorithm in p5.js.
In truth, ml5 and
TensorFlow.js are
going to handle all of the
convolution operations for us
and creating all the filters.
We're just going to configure
a convolutional layer
from a high level.
But I think it's
interesting to look
at how you might code an image
processing algorithm in p5.
I have some videos that do
things like this previously,
but let's look at
it in this context.
So I took a low resolution
28 by 28 image of a cat.
This comes from the Quick Draw
dataset, which I've made videos
about before and I
will also use to see
if we can create a
doodle classifier as part
of this series.
And all I want to do is apply
a convolution to that image.
So first, I'm going
to create a variable
and I'm going to call it filter.
So this is going
to be our filter.
And I'm going to make it
a two-dimensional array.
So let me just put all
zeros in it to start.
So this is the filter.
And let's go with that
one that looks for edges.
The cat image is actually
quite low resolution,
just 28 by 28 pixels, but I'm
drawing it at twice the size.
I want to write the code to
apply this filter to the image
and draw the filtered
image to the right.
I'm going to create a variable
called dim for dimensions
and just call this 28.
And then I want another variable
to store the filtered image.
And in setup, I can
create that image.
This creates a blank image
of the same dimensions
as the original cat drawing.
Then I can write a loop.
And this loop is going to
look at every single pixel
for all the columns x
and all of the rows y.
And I wrote int there
because I'm half the time
programming in Java.
But one thing that's
important here,
if we're going to take
this 3 by 3 matrix
and apply it to every single
pixel of the original image,
if we're applying it to
that first pixel 0,0,
there's no pixel to the
left and no pixel above it.
It doesn't have all
of its neighbors.
So there's various
ways around this.
I'm just going to ignore
all the edge pixels.
So the loop will go from
1 to dimensions minus 1.
Now, there's a lot more work
to be done here just to apply
this filter to any given pixel.
I think a way that
might make sense
to do this is to actually
have a new function.
I would call the
function filter--
let's just call it convolution.
I'm going to write a
function called convolution.
It receives an image, an
x and a y, and a filter,
and it returns a new color.
So the idea of this
function is that it receives
all the things it needs.
It receives the original image,
the filter to apply to it,
which particular pixel
we want to process,
and then will return
back to new RGB value
after that pixel is processed.
And the reason why I'm doing
that in a separate function
is I need another nested
loop to go over the filter.
So I need to go from 0 to 3--
0, 1, 2 columns in the filter,
0, 1, 2 rows in the filter.
And it would be getting
to be quite a lot
if I had four nested
loops right in here.
Now, I probably
shouldn't have some
of this hardcoded in
here-- the number 3
and that sort of
thing-- but you can
imagine how you might
need to use variables
if the filter size is flexible.
Now, we have a really sort
of like sad fact, which
is true about most cases
where you're doing image
processing with some framework.
And in this case, our framework
is JavaScript, and canvas,
and p5.js.
And the sad fact is though even
though all of this is built--
all of this discussion
is built upon the fact
that we are retaining
the spatial orientation
of the pixels.
We're thinking of it as
a two-dimensional matrix
of numbers.
The actual data is
stored in one array.
And so I've gone over this
in probably countless videos,
but there's a simple formula to
look at if I have a given x,y
position in a
two-dimensional matrix,
how do I find the
one-dimensional lookup
into that matrix, assuming
that the pixels were counted
by rows--
0, 1, 2, 3, 4, 5, 6, 7, blah,
blah, blah, next row, 28, 29,
30, blah, blah blah.
And that formula is let index--
oh, well, I need to do that
before this nested loop
because right now, I just want
the center pixel-- that x,y.
Let index equal x plus
y times img.width.
But there's more, oh!
So this is the form.
And if you think about
it, it makes sense
because it's all the
x's, and then the
offset along the
y's is how many rows
times the width of the image.
But there's another
problem, which
is that in JavaScript in
canvas, for every single pixel
in this image,
there are actually
four numbers being stored--
an R, a G, a B, and an alpha--
the red, green,
and blue channels
and the alpha channels--
channel, singular.
So each pixel takes
up four spots.
So this index actually
needs to say times 4.
So guess what?
You know it's going to
make a lot of sense.
I'm going to need
this operation a lot.
Let's write a function for it.
I'll just call it index, and it
receives an x, y, and a width,
and it returns--
you know what?
The width is never going
to change in my sketch,
so I don't want to be
so crazy as to have
to pass it around everywhere.
So we're just going to pull
it from a global variable.
Return x plus y times img.width.
And that's not img,
it's cat.width.
OK, so once again, this is
terrible what I'm doing,
but I'm just saving myself
a little bit of heartache
here and there.
So this index-- ooh,
let's call this pixel.
Oh, and this should be times 4.
This pixel is that
function index x,y.
Now, I have something I
could do to simplify this,
but I might as well write
the code for if this
were a full RGB image.
This is a grayscale image, but
it has all the channels in it.
The thing that I need to do
to perform this convolution
operation is to take
all of the weights--
the numbers that are
in the filter matrix--
and I need to multiply each one
times the pixel value of all
of the neighbors and their
corresponding locations,
add them all up together,
and maybe divide by something
if I wanted to sort of,
like, average it out.
But in this case, I actually
don't want to divide
by anything.
I'm just going to leave the
weights are the weights are
the weights are the weights.
And actually, this right
here is irrelevant.
I need to do this
inside the loop.
You'll see in a second.
I think it's going
to make sense.
So I need sum.
I'm going to make a sum
of all the R values,
a sum of all the green
values, and a sum of all
the blue values.
All right, wait a sec,
wait a sec, wait a sec.
Actually, I think this is
going to make more sense.
Let's go from negative 1 to 2.
You'll see why.
I mean, I'll explain why.
And negative 1 to 2.
Let's do that instead.
And maybe it's more clear to
say less than or equal to 1.
Less than or equal
to 1 because--
and let me draw this
diagram once again--
if this is pixel 0,0, this is
pixel negative 1, negative 1.
This is 1,1.
This is 1,0.
This is 1, negative 1.
I guess I'll do them all.
So you can see that
the neighboring
pixels are offset by negative
1 and 1, and negative 1 and 1.
So the pixel x
value is x plus i.
The pixel y value is y plus j.
And then the pixel index
is call the index function
x, which returns the actual
index into that array
for pixel x and pixel y.
And actually, maybe it
makes more sense for me
to just say that I
don't necessarily
need separate variables.
It might actually be
just as clear just
to put this right in here.
So now, I just need to add the
red, green, and blue values
of this particular
pixel to the sum.
So sumR plus equal img.pixels
at that pixel index.
And then G and B.
G is the next one,
and B, blue, is the next one.
And let's add a plus 0
here just to be consistent.
So ultimately, what I'm actually
returning here is r is sumR,
g is sumB, and b is sum--
oh, sorry, g is
sumG and b is sumB.
So this is the process now
of adding up all the pixels.
I've gone through every
single pixel in a 3
by 3 neighboring
area and added up
all the reds, greens, and blues,
and I'm returning those back.
But I'm missing the
crucial component, which
is as I'm adding all the
pixels up in that area,
I need to multiply each one by
the value in the filter itself.
Incidentally, I
should also mention
that the operation that this
really is is the dot product,
and in an actual
machine learning system,
all this would be
done with matrix math,
but I'm doing it sort
of like longhand just
to sort of see the
process and look at it.
What should I call this in
the filter, like the factor?
Now, I need to look
up in the filter, i,j.
Only here's the thing--
because I decided to go
from negative 1 to 1,
negative 1 to 1,
the filter doesn't
have those index values.
It goes 0, 1, 2, 0, 1, 2.
So this has to be
i plus 1, j plus 1.
So it's all six of one,
half dozen of the other,
whether I go from
0 to 2 there and do
the offset in the pixels.
But the point is
the pixel array,
I'm looking actually to
the negative and positive
to the left and right,
but the filter is just a 3
by 3 array starting with
0,0 on the top left.
So now, I should be able
to multiply by factor.
And there we go.
I have the full
convolution operation.
Now, I might have
made a mistake here.
I think this is right.
When I run it, we'll find
out if I made a mistake.
I'm summing up a 3 by 3
neighborhood of pixels,
all multiplied by weights
that are in a 3 by 3 filter.
Oh, but I actually have to
call that function here.
Now, it should be relatively
easy because all of the work
was in there.
So if I say let I'm just
going to call this rgb
equal convolution, the
cat at the given x and y
with the filter, then the new
image, which is called filter--
oh.
I have to look up.
It's OK.
No problem.
The pixel is index
x,y, and then filter--
so I have to look up the
one-dimensional location
in the new image, and then
at .pixels at that pixel is
the rgb--
the red value that
came back plus
0 plus 1 plus 2, green and blue.
And then if all goes
according to plan,
I should be able to
draw the filtered image
at offset to the right
with the same size.
I did miss something
kind of important,
which is that if I am working
with pixels of an image in p5,
I need to call loadPixels.
So cat.loadPixels
filtered.loadPixels.
And then I haven't changed
the pixels of the original cat
image, but since I changed the
pixels of the filtered image,
afterwards I need to
call updatePixels.
And now is the moment of truth.
[DRUM ROLL]
Never good when I press
the snare drum button.
I'm going to run the sketch.
Whoops.
All right, well, I've
already got an error.
[SAD TROMBONE]
Cannot read property loadPixels.
Oh, filter, filter, filtered.
That should be filtered.
Also this isn't
right-- createCanvas.
The size of the canvas is
times 10 times 2 times 10.
Remember, the image
is just 28 by 28.
Let's try this again.
[DRUM ROLL]
[SAD TROMBONE]
Well, a little bit better.
We didn't get any errors.
I don't see an image.
Do I need to give it a
hardcoded transparency of 255?
Yes.
[BELL] Oops.
So it was fully transparent.
So I'm not pulling
the transparency over.
I could pull it
over, but I just know
I don't want it
to be transparent.
Look at that.
Look at how it found the--
oh, oh, oh, oh.
Look at this.
That doesn't look like it's
finding the vertical edges--
pixels that are
different to the left.
It looks like it's
finding horizontal edges.
Even though I've typed
this out in a way
that visually, these negative
1's appear in a column,
it's actually those
correspond not to the j index,
but to the i index.
So I think one way to fix that
would just be to swap it here.
And maybe there's like a more
elegant way of doing this,
but this now, if
I run it this way,
you'll see, ah, look at
those horizontal edges.
So now, we see how
this convolution
is applied to the image.
The difference in the
neural network here--
the convolutional
neural network--
is we're not hardcoding
in specific filters
that we know highlight
things in an image.
The neural network
is going to learn
what values for the
filters highlight
important aspects of the image
to help the machine learning
task at hand, such
as classification.
So it might draw
out, you know, cats
tend to have ears that appear
a certain way and this kind
of filter, like, brings
that out, and then leads
to the final layer of
the network activating
with a high value for that
particular classification.
So just to keep my example
simulating the neural network
process a bit more, let's
just every time I run it,
give it a random filter
because that's what
the layer would begin with.
Just like a neural network
begins with random weights
and learns the right
weights, the filters
begin with random values and
it learns optimal values.
So right here in setup,
I'll write a nested loop
and give it a random value
between negative and 1.
In truth, there are other
mechanisms and strategies
for the initial weights of a
convolutional neural network,
but picking random
numbers will work for us
right now just to see.
So every time I
run it, you can see
we get a different resulting
image that is filtering
the image in a different way.
OK, that was a
lot and I think it
would be good to take a break.
So this was the first
part of my explanation,
a long-winded attempt to
answer the question, what is
a convolutional neural network?
So the first thing to look at
is the convolutional layer.
It's made up of filters.
And so this video
attempted to explain that.
And I think we could take
a break, have a cup of tea,
talk to your pet, or friend, or
plant, or something, meditate,
relax.
And then if you want--
if you want, you can come
back and in the next video,
I'm going to look
at the next piece--
the next component of
the convolutional layer,
an operation called pooling or
more specifically, max pooling.
And then I'll be able
to tie a little ribbon
and put a little bow
on this explanation
about convolutional
neural networks
and move towards
actually implementing one
with the ml5 built-in
functionality.
All right, so maybe I'll
see you in the future
and have a great
rest of your day.
Goodbye.
[MUSIC PLAYING]
