In this video, I will be using a convolutional
neural network, implemented with Keras and
written in Python, to recognize handwritten
digits and perform basic operations between
them. I simultaneously aim to give you an
intuition for, and understanding of, convolutional
neural networks and their awesome potential.
I will start off with a demonstration of the
program. I enter “13” one digit at a time,
clicking “save image” after each digit.
Then I click “multiply”, and enter “14”,
one digit at a time, clicking “save image”
after each digit. When I click “equals”,
I get the correct solution of 182. Pretty
cool, huh?
The core of this program is a convolutional
neural network. Convolutional neural networks
are loosely based on the manner in which all
mammals perceive the world around them. This
manner involves a hierarchical series of feature
recognitions, starting off with simpler features
like diagonal lines, curved edges, etc., and
progresses towards complex or abstract recognitions,
like combinations of shapes, and finally,
the classification of entire objects. That’s
pretty simple, but how would you mathematically
model this process? Let’s dive deeper into
how convolutional neural networks do what
they do.
Like the standard neural networks discussed
in previous videos, convolutional neural networks
involve input layer neurons, weights, biases,
hidden layer neurons and output layer neurons.
However, they are also a bit different. Let’s
begin at the start of the convolutional neural
network and work our way through it. Inputs
are a 2-d matrix for black and white images,
and a 3-d tensor for colored images. BTW,
tensors are just arrays with dimensions that
are higher than 2, and the 3rd dimension in
a colored image is for the different color
channels, usually red, green, and blue in
an RGB image. In our case, images are black
and white. Next, weights are arranged into
2d matrices. This differs from a regular neural
net, where inputs and weights are scalars,
or just single numbers. In a regular neural
net, you simply multiply inputs by weights.
How does this work with matrices or even tensors?
Instead, the dot product is computed. Essentially,
to perform the dot-product between the image,
which is a matrix, and a smaller matrix, which
is the weight matrix and is often called a
filter, you multiply the filter by each corresponding
value in a section of the image, sum these
values, compute the average by dividing by
the number of values added, and place this
in the corresponding position in the resulting
matrix. You then repeat this process, sliding
the section of the image that is multiplied,
or the receptive field, until all the values
for the resulting matrix are found. Note that
this resulting matrix loses one pixel at its
edges in every direction, and programmers
sometimes perform padding, in which the lost
pixels are replaced with zeros. When performed,
padding could make the resulting matrices
16*16 instead of 14*14. Also note that initially,
like any other weight, filters contain random
values and are adjusted in training. Performing
these repeated dot-products in this manner
is known as convolution. Convolution plays
a central role in a convolutional neural network,
hence, the name convolutional neural networks.
To understand why, let’s take a look at
two examples, one simple and one slightly
more complex. I’ll start with the simpler
example. Say that this is the input image,
and we apply this filter to it. This filter
can be thought of as a low-resolution horizontal
line, and look what happens in the resulting
image: horizontal lines are kept, and almost
everything else fades into a dark gray and
is ignored. In this way, filters keep what
is relevant and ignore everything else. ‘What
is relevant’ is adjusted based on the needs
of the neural network in training. Mathematically,
this occurred because the resulting value
in the matrix was large if and only if the
receptive field was very similar to the filter.
If they were not similar, then values in the
image matrix were multiplied by values close
to or equal to zero in the filter matrix or
vice versa, making the resulting value small.
So far in this video, filters have contained
only positive values. This, however, is a
simplification. Like in standard neural networks,
weights in a convolutional neural network
can contain negative values. For these two
examples, when visualizing any values in filters,
including negative values, I set the lowest
value in the matrix to black and the highest
value in the matrix to white. While visualization
is a powerful tool, it is also important to
consider the numbers, because the numbers
are all that the computer sees. In this next
example, I will use negative values in the
filter and a more complex image. This is the
new filter. While it looks exactly the same
as the filter that was just used, it contains
-1s where there were once zeros. The image
itself is much more complex, with many details.
In the resulting convoluted image, the horizontal
edges are highlighted and most other information
is lost. The general principle of how certain
values are kept or even amplified and others
are ignored through convolution remains the
same--that is, if the filter and the receptive
field are similar, then the resulting value
is large, and if they are not, then the resulting
value is small. However, negative values differ
from zeros in an important way. With the filter
that was used in the first example, the values
in the receptive field corresponding to the
zeros in the filter don’t matter--they can
be zero, or any other value, and since any
value multiplied by zero is zero, the resulting
value will be zero. So, with this filter,
you will keep any horizontal line, regardless
of what values are above or below it. When
the zeros are replaced with -1s, however,
the filter will preserve a horizontal white
line if it has low, or ideally, negative values
above and below it. This is because if the
values above and below the line are large,
then large values will be made negative, making
the resulting sum much smaller. Similar logic
can be used to explain why a horizontal black
line surrounded by large values will be preserved.
Hence, this filter may look for horizontal
lines making up an edge, if the edge has a
height of one pixel and is surrounded by values
that contrast the line. This fact explains
the emphasis on edges rather than any horizontal
lines in the resulting convoluted image. In
this way, negative values in filters allow
the model to emphasize a wider variety of
features.
In a convolutional neural network, it is almost
always advantageous to have multiple filters,
in order to highlight multiple different features
separately, but more on how this plays out
later. The next step involves making the output,
after applying the filter, have a lower resolution,
but still retain the most important features.
The point of this is to make the neural network
more efficient and less computationally expensive,
especially during training. This step is pretty
simple and is called pooling. First, you define
a pool size. This will be the dimensions of
a section of the matrix that you will compress
into one value. Perhaps, the most popular
method of this ‘compression’ is simply
taking the maximum value, where the value
represents the pixel intensity. In a black
and white image, which is what we will deal
with, the value represents how white a pixel
is. So, the network simply slides across the
image, taking the maximum value in a certain
area. Note that for both convolution and pooling,
there is something known as a stride length.
Stride length represents how much you move
the window after performing convolution or
pooling, in both the horizontal and vertical
directions. For convolution, the stride length
is usually 1 by 1 or 1 horizontally and 1
vertically, and in pooling, the stride length
is usually the same as the pool size, which
is the area from which the model takes the
maximum value. Changing the stride length
from the standard values is usually unnecessary
and importantly, will change the dimensions
of the resulting matrix. Finally, we apply
an activation function to all the values in
the now compressed matrix. Remember, an activation
function serves the purpose of allowing a
neural network to approximate a non-linear
function and this remains the case in convolutional
neural networks. By the way, you could technically
apply an activation function before pooling
and directly after convolution, however this
would be slightly more computationally expensive,
as the matrices will be larger. These 3 parts,
or convolution, pooling, and activation can
be repeated to further add complexity to the
model. Note, that you don’t necessarily
have to implement these parts precisely in
this order. As an example, it may prove beneficial
to perform convolution, activation, convolution,
and only then pooling. What happens when you
add more convolutional layers is really cool.
Say that I start with a 28 by 28 black and
white image and I have 32 filters in my first
convolutional layer. So, I now have 32 26
by 26 matrices, assuming that I don’t perform
padding. Whether I apply pooling and activation
is irrelevant to this scenario. Let’s say
that I don’t, and I add another convolutional
layer with 32 filters. I actually don’t
get 32 times 32 or 1024 new matrices after
I apply this convolutional layer. So, I don’t
apply 32 filters to each of the previously
filtered images. Let’s break down what actually
happens.
Each of the 32 filters will actually have
a depth of 32, corresponding to the depth
of 32 in the previous layer’s outputted
tensor. From this point, the dot product is
calculated between the receptive field in
the first matrix and the first matrix in the
first filter, between the receptive field
in the second matrix and the second matrix
in the first filter, between the receptive
field in the third matrix and the third matrix
in the first filter, and so on. Then the average
of all these values is computed is placed
in the top left position of the first resulting
matrix. Then the rest of the values in the
first matrix are computed by sliding the receptive
field across the matrices, computing the dot
product, computing the average value, and
placing it in the corresponding position in
the first resulting matrix. So, the first
resulting matrix was generated with the first
filter. This process is repeated with each
of the filters, until there are 32 resulting
matrices. Essentially, 2d convolution between
a matrix and a filter is repeated to account
for the depth of both the tensor and the filter,
the average of these values is computed to
make the result a 2d matrix, and this process
is repeated with each of the filters.
By performing convolution and averaging the
values in this way, I combine the highlighted
features in a computationally inexpensive
way. By doing this, we add a level of complexity
that will force the model to make progressively
more and more intricate feature recognitions
or highlights as it finds the optimum filter
values deeper in the convolutional neural
network. So, deep in a CNN, where convolution,
pooling, and activation has already been performed
quite a few times, a filter that looks like
a horizontal line, for example, will likely
result in the highlighting of a much more
complex feature in the original image. Generally,
by the last convolutional layer, each filter
will represent a relatively complex feature.
I say “relatively” because depending on
the scenario a curved edge could be a complex
feature, like in our case, or an entire human
face could be a complex feature. While all
this may be slightly complicated, it is important
to remember that at a high level, we are still
just computing a product between an input
and a weight, and because of this, eventually
allowing the neural net to find the relationship
between input and output.
Then, you flatten all the pixels into a single
vector. Remember, a vector in programming
is a one-dimensional array. You connect each
pixel to a neuron in either another hidden
layer or the output layer. Remember, that
by this point in the convolutional neural
network, a pixel should represent the presence
or absence of a relatively complex feature.
In this way, we add a fully connected section
to our neural network. This is done in order
to take advantage of all the different features
that have been learned. In other words, the
section of a Convolutional Neural Network
with convolution, pooling, and activation
performs feature extraction and in the fully
connected section, the network finds the relationship
between the presence or absence of these features
and the different classes. Like in the last
video, with breast cancer diagnosis, the output
will be a set of probabilities. These probabilities
represent the chances that a particular image
belongs to a certain class, according to the
model. You can then take the digit corresponding
to the highest probability and voila, you
have a predicted digit. Initially, as with
the neural networks dealt with in previous
videos, the predictions are inaccurate. However,
accuracy is gradually improved by minimizing
the loss or cost during training.
For training the model, I use the MNIST dataset.
This dataset is popular and widely used among
programmers. It contains 70000 handwritten
digits that are already split into training
and testing data, with 60000 training digits
and 10000 testing digits. Each digit is a
28 by 28 pixel, black and white image. Each
of these images obviously contains a corresponding
correct, human-made label.
Let’s begin writing the CNN in Python with
the help of Keras. First, as always, I import
all the packages that I will need in order
to implement the CNN. Then, I load the MNIST
dataset. From where? Actually, Keras has a
copy of it because of how popular it is. Also,
as I previously said, the dataset is already
split into training and testing data, so all
I have to do is assign it to X_train, X_test,
y_train, and y_test. Remember, that X_train
and X_test are the images, and y_train and
y_test are the classes or labels for the images.
Next, I reshape both X_train and X_test to
-1 by 28 by 28 by 1. What in the world does
that mean? Well, 28 by 28 is simply the resolution
of each of the images, and 1 represents the
single color channel. The -1 is a special
number that essentially tells Keras to figure
out what the actual value is. The actual value
will be the number of images assigned to that
variable. So, the only thing that will actually
change in this line is the addition of the
one color channel, and this is added to match
the format expected by Keras. Next, I apply
to_categorical to y_train and y_test. For
a complete description and explanation of
what this does, see my breast cancer diagnosis
video. Basically, it converts the classes
in y_train and y_test into a form readable
by Keras. In the following lines, I normalize
the input data by taking each pixel intensity
value, which is currently between 0 and 255,
and dividing by 255. In order to do this,
I must first change the type of the values
in these numpy arrays to float32 so that they
can contain decimals. Then, as always, I define
the model in the line model=sequential().
From here, I define the input shape as 28
by 28 by 1. Note that since we only input
one image at a time, the input is 3 dimensional,
whereas X_train and X_test are 4 dimensional,
with the 4rth dimension being the image index.
In the same line, I define 32 3 by 3 filters,
and even though I don’t explicitly write
it out, Keras includes a bias since this is
the default. I also don’t choose to change
the default of no padding after convolution,
simply because it is not necessary. Next I
add a pooling layer, and more specifically,
a max-pooling layer with a pool_size of 2
by 2. There are actually other types of pooling,
like average pooling, for example, in which
instead of taking the maximum value in a defined
area, you take the average value. In truth,
there isn’t all that much of a difference
in the resulting performance between the two,
especially for a simpler example, like the
one that we are dealing with. As mentioned
earlier, you can change the stride length
in both pooling and convolution from the default,
however, this is unnecessary. Hence, I use
the default values. Then, I apply the relu
activation function. After this, I flatten
all the matrices into one very long list of
values, where each value represents a pixel.
Then, I add a dense layer with 128 inputs.
So, the value of each pixel will be connected
to each of the 128 neurons in this layer.
This connection is often computationally expensive,
especially with much higher-resolution images.
I also apply the relu activation function
to each of these neurons. Finally, I add an
output layer with 10 different neurons. Each
of these neurons will represent a class, or,
more specifically, a digit from 0 to 9. I
apply the softmax activation function, which
is used in multiclass classification problems
like this one, as discussed in the last video.
Finally, I compile the model. I use categorical
crossentropy as the loss, which is used in
tandem with the softmax activation function,
I use the Adam optimizer, which will dictate
how the weights are updated during training,
and I state that I want to keep track of the
accuracy metric. Now we have defined the structure
of our convolutional neural network. Next,
I perform training in the line model.fit.
I pass X_train and y_train_cat, and I define
the batch size, epochs, verbose, and validation
split. For a complete explanation of all of
these, see my neural network regression video.
Finally, I evaluate the model and find that
the accuracy is good. And that’s it. I have
now written a very simple Convolutional neural
network in Keras. With this simple dataset,
there is very little that I can do to improve
the model’s performance, let alone by a
substantial amount. For example, I could add
more layers to the model, or I could implement
dropout, however, none of this is necessary
and again, it doesn’t have a substantial
effect on the model’s accuracy, which is
already quite high.
The next section of the code is responsible
for the GUI, or graphical user interface,
that you saw at the start of this video, and
for the operations performed between the numbers.
Basically, here’s how it works. First, the
program opens a window, into which the user
can write with their mouse. Whenever the user
clicks, holds, and drags, the program will
plot circles in a location corresponding to
the position of their mouse. The program keeps
track of every location in which a circle
was plotted, so that an exact copy of the
displayed image can be made, that will be
inputted to the convolutional neural network,
when the user clicks ‘save image’. Before
being inputted, however, the image is resized
to 28 by 28, or the dimensions of images in
the MNIST dataset. The program also makes
the pixel intensity values float32 variables
and divides each of them by 255, just as we
did when we trained the network. Then this
slightly-modified image is inputted to the
network and the program takes the digit with
the highest probability in the network’s
output. This digit is then displayed. If the
user clicks on the ‘click here if the number
is incorrect button,’ then the program will
clear the current screen and will later replace
the digit with a new one, when entered by
the user. A user at this point can either
enter the second digit in a number, in which
case the process will be repeated and the
two digits will be joined together, or the
user can click an operation. When the user
clicks an operation, the program stores the
name of that operation in a variable. After
this, the initial process is repeated and
the user can enter another number. When they
click equals, the program simply performs
the requested operation between the two numbers
and prints the result. If the user happens
to click the ‘reset’ button at any point
in this process, then all the existing information
will be cleared or overwritten, and the process
will start over again. And that’s it. If
you want to try the program out for yourself
you can download the code on my github, for
which there is a link to in this video’s
description. Note that if you do so, you will
need to install pillow in your virtual environment.
This is done with the following command. And
that’s it! We now have a basic, functional
calculator that can recognize handwritten
digits and perform simple operations between
them.
In case you didn’t already know, this video
is part of a series in which I find really
cool datasets like this one, and for each
of them I show you how to implement a Neural
Network. In doing this, I hope to both entertain
and educate you. Subscribe and click on the
notification bell to be notified when I release
a new video, and also hit the like button
and leave a comment if you want this video
to reach more people. Thanks for watching.
