Hello and welcome! Let's talk about Convolutional
Neural Networks, which are specialized kind
of neural networks that have been very successful
particularly at computer vision tasks, such
as recognizing objects, scenes, and faces
among many other applications.
First, let's take a step back and talk about
what convolution is. Convolution is a mathematical
operation that combines two signals and is
usually denoted with an asterisk. Let's say
we have a time series signal a, and we want
to convolve it with an array of 3 elements.
What we do is simple, we multiply the arrays
elementwise, sum the products, and shift the
second array. Then, we do the same thing again
for the other elements by moving the second
array over the first one like a sliding window.
Technically, what we do here is cross-correlation
rather than convolution. Mathematically speaking,
the second signal needs to be flipped in order
for this operation to be considered a convolution.
But in the context of neural networks, the
terms convolution and cross-correlation are
used pretty much interchangeably. That's a
little off topic but you might ask why would
anyone want to flip one of the inputs. One
reason is that doing so makes the convolution
operation commutative. When you flip the second
signal, a * b becomes equal to b * a. This
property isn't really useful in neural networks,
so there is no need to flip any of the inputs.
In digital signal processing, this operation
is also called filtering a signal a with a
kernel b, which is also called a filter. As
you may have noticed, this particular kernel
computes the local averages by averaging the
values within a window. If we plot this signal
and the result when it's convolved with this
averaging filter, we can see that the result
is basically a smoothed version of the input.
We can easily extend this operation to two
dimensions. Let's convolve this 8x8 image
with this 3x3 filter for example. Just like
the previous example, we overlay the kernel
on the image, multiply the elements, sum the
products, and move to the next tile. This
specific kernel is actually an edge detector
that detects the edges in one direction. It
has a weak response over the smooth areas
in an image, and a strong response to the
edges. If we apply the same kernel to a larger
grayscale image like this one, the output
image would look like this where the vertical
edges are highlighted. If we transpose the
kernel, then it detects the horizontal edges.
The filter in the previous example smoothed
its input whereas in this example the filter
does the opposite and makes the local changes,
such as the edges, more pronounced. The idea
is that kernels can be used to extract some
certain features from input signals.
The input signals don't have to be a grayscale
image. It can be an RGB color image for example,
and we can learn 3-dimensional filters to
extract features from these inputs. The inputs
don't even have to be images. They can be
any type of data that has a grid-like structure,
such as audio signals, video, and even electroencephalogram
signals. Both the inputs and the filters can
be n-dimensional.
There's a lot that can be said about convolutions
and filter design. But since the focus of
this video is not digital signal processing,
I think this is enough background to understand
what happens inside a convolutional neural
network.
In the earlier examples, we convolved the
input signals with kernels having hardcoded
parameters. What if we could learn these parameters
from data and let the model discover what
kind of feature extractors would be useful
to accomplish a task? Let's talk about that
now.
Let's say we have an 8x8 input image. In a
traditional neural network, each one of the
hidden units would be connected to all pixels
in the input. Now imagine if this was a 300x300
RGB image. Then we would have 270,000 weights
for a single neuron. Now, that's a lot
of connections. If we built a model that had
many fully connected units at every layer
like this, the model would be big, slow, and
prone to overfitting.
One thing we can do here is to connect each
neuron to only a local region of the input
volume. Next, we can make an assumption that
if one feature is useful in one part of the
input it's likely that it would be useful
in the other parts too. Therefore, we can
share the same weights across the input.
Looks familiar? Yes, what this unit does here
is basically convolution.
A layer that consists of convolutional units
like these is called a convolutional layer.
Convolutional networks, also called ConvNets
and CNNs, are simply neural networks that
use convolutional layers rather than using
only fully connected layers.
The parameters learned by each unit in a convolutional
layer can be thought of as a filter. The outputs
of these units are simply the filtered versions
of their inputs. Passing these outputs through
an activation function, such as a ReLU, gives
us the activations at these units, each one
of which responds to one kind of feature.
As compared to traditional fully-connected
layers, convolutional layers have fewer parameters,
where the same parameters are used in more
than one place. This makes the model more
efficient, both statistically and in computational
terms.
Although convolutional layers are visualized
as running sliding windows over the inputs
and multiplying the elements, they aren't
usually implemented that way. As compared
to for loops, matrix multiplications are faster
and scale better. So instead of sliding a
window using for loops, many libraries implement
convolution as a matrix multiplication.
Let's assume that we have an RGB image as
input and have four 3x3x3 kernels. We can
reshape these kernels into 1x27 arrays each.
Together, they would make a 4x27 matrix, where
each row represents a single kernel. Similarly,
we can divide the input into image blocks
that are the same size as the kernels and
rearrange these blocks into columns. This
would produce a 27xN matrix, where N is the
number of blocks. By multiplying the matrices,
we can compute all these convolutions at once.
Each row in this resultant matrix would give
us the filter outputs when reshaped back to
input dimensions.
Another type of layer that is commonly used
in convolutional neural nets is the pooling
layer. A pooling layer downsamples its input
by locally summarizing them. Max pooling,
for example, subsamples its input by picking
the maximum value within a neighborhood.
Alternatively, average pooling takes the average.
In many cases, we care about if some features
exist in the input regardless of their exact
position. Pooling layers make this easier
by making the outputs invariant to small translations
in the input. Because even if the input is
off by a few pixels, the local maxima would
still make it to the next layers. Another
obvious advantage of pooling is that it reduces
the size of the activations that are fed to
the next layer, which reduces the memory footprint
and improves the overall computational efficiency.
A typical convolutional neural network usually
stacks convolutional and pooling layers on
top of each other and sometimes use traditional
fully connected layers at the end of the network.
An interesting property of convolutional neural
networks is that they learn to extract features.
Early convolutional layers, for example, learn
primitive features such as oriented edges.
After training a model, the filters in the
first layer usually look like Gabor-like filters,
edge detectors, and color-contrast sensitive
filters.
As we move towards the output layer, the features
become more complex and neurons start to respond
to more abstract, more specific concepts.
We can observe neurons that respond to cat
faces, human faces, printed text, and so on.
The dots you see in the activations of this
convolutional layer can be a result of neurons
that respond to cats, pets, or animals in
general. One of them, for example, can be
a neuron that activates only if there is a
cat in the input picture. The following layers
make use of this information to produce an
output such as a class label with some probability.
An interesting thing is, the concepts that
are learned by the intermediate layers don't
have to be a part of our target classes. For
example, a scene classifier can learn a neuron
that responds to printed text even if that's
not one of the target scene types. The model
can learn such units if they help detect books
and classify a scene as a library.
This is somewhat similar to how visual information
is processed in the primary visual cortex
in the brain, which consists of many simple
and complex cells. The simple cells respond
primarily to oriented edges and bars of particular
orientations, similar to early convolutional
layers.
The complex cells receive inputs from simple
cells and respond to similar features but
have a higher degree of spatial invariance,
somewhat like the convolutional layers after
the pooling layers. As the signal moves deeper
into the brain, it's postulated that it might
reach specialized neurons that fire selectively
to specific concepts such as faces and hands.
An advantage of using pooling layers in our
network is that it increases the receptive
field of the subsequent units helping them
see a bigger picture. The term receptive field
comes from neuroscience and refers to a particular
region that can affect the response of a neuron.
Similarly, the receptive field of an artificial
neuron refers to the spatial extent of its
connectivity. For example, the convolutional
unit in the earlier example had a receptive
field of 3x3. Units in the deeper layers have
a greater receptive field since they indirectly
have an access to a larger portion of the
input. Let's have another example and for
simplicity, let's assume both the input and
the filter is one dimensional. This unit has
access to three pixels at a time. If we add
a pooling layer followed by another convolutional
layer on top of that, a single unit at the
end of the network gains access to all 8 pixels
in the input.
Of course, pooling is not the only factor
that increases the receptive field. The size
of the kernel obviously has an impact. A larger
kernel would mean that a neuron sees a larger
portion of its input.
A larger receptive field can also be achieved
by stacking convolutions. In fact, it is usually
preferable to use smaller kernels stacked
one on another as compared to using a larger
kernel, since doing so usually reduces the
number of parameters and increases non-linearity
when a non-linear activation function is used
at the output of each unit. For example, a
stack of two 3x3 convolutions would have the
same receptive field as a single 5x5 convolution,
while having fewer mathematical operations
and more non-linearities.
One thing to pay attention when stacking convolutional
layers is how the size of the input volume
changes before and after a layer. Without
any padding, the spatial dimensions of the
input shrink by one pixel less than the kernel
dimensions. For example, if we have an 8x8
input and a 3x3 kernel the output of the convolution
would be 6x6. Many frameworks call this type
of convolution a 'valid' convolution or a
convolution with valid padding. Valid convolution
might cause some problems. Especially if we
use larger kernels or stack many layers on
top of each other, the amount of information
that gets thrown out might be critical.
There is an easy hack that helps improve the
performance by keeping information at the
borders. What it does is to pad the input
with zeros so that the spatial dimensions
of the input is preserved after the convolutions.
This type of zero padding is called 'SAME'
padding by many frameworks. Zero padding commonly
used and works fine in practice, although
it's not ideal from a digital signal processing
perspective since it creates artificial discontinuities
at the borders.
Another hyperparameter that has an impact
on the receptive field is the stride of the
sliding window. So far, we used a stride of
one in the examples. This is usually the default
behavior of a convolutional layer. If we set
it to two, for example, the sliding window
moves by two pixels instead of one, leading
to a larger receptive field. Using a stride
larger than one has a downsampling effect
that is similar to pooling layers and some
models use it as an alternative to pooling.
One thing that is sometimes confused with
stride is the dilation rate. A dilated convolution,
also known as atrous convolution or à trous
convolution, uses filters with holes. Just
like pooling and strided convolutions, dilated
convolutions also learn multi-scale features.
But instead of downsampling the activations,
dilated convolutions expand the filters without
increasing the number of parameters. This
type of convolutions can be useful if a task
requires the spatial resolution to be preserved.
For example, if we are doing pixel-wise image
segmentation, pooling layers might lead to a
loss in detail. Using dilated convolutions
preserves spatial resolution while increasing the
receptive field. However, this approach demands
more memory and comes at a computational cost
since the activations need to be kept in memory
at full resolution.
In this video, we talked about the building
blocks of convolutional neural networks. We
also covered what some of the hyperparameters
in convolutional networks are and what they
do.
In the next video, we will talk about how
to choose these hyperparameters and how to
design our own convolutional neural network.
We will also cover some of the architectures
that have been widely successful at a variety
of tasks and went mainstream.
Ok, that's all for today. It's already been
a litter longer than usual. As always, thanks
for watching, stay tuned, and see you next
time.
