One of the questions that I get frequently
is 'how do you design a neural network' or
more specifically 'how do you know how many
layers you need to have' or 'how do you know
what's the right value for a particular hyperparameter'.
First things first, if you are not familiar
with convolutional neural networks, you can
find the link to my introductory video in
the description below.
When people say the best thing about deep
learning is that it requires no hand-designed
feature extractors, everything is learned
from data, and there's almost no human intervention,
that's not entirely true.
Indeed, the features are learned from data,
and that's great.
A hierarchy of learned features can lead to
a great representational power.
But there's still a lot of human intervention
in the model design although there have been
some efforts to automate the model selection
process.
I think eventually we will get to a point
where no human intervention is required but
it seems like humans will still be in the
loop for a while.
You might ask why not just do a grid search
on all hyperparameters and automatically pick
the best configuration?
Wouldn't that be more systematic?
Well, a complete grid search is usually not
a feasible option since there are too many
hyperparameters.
Furthermore, model selection in deep models
is not just about choosing the number of layers
and hidden units and a few other hyperparameters.
Designing the architecture of a model also
involves choosing the types of layers and
the way they are arranged and connected to
each other.
So there are infinitely many ways one can
design a network.
Designing a good model usually involves a
lot of trial and error.
It is still more of an art than science, and
people have their own ways of designing models.
So the tricks and design patterns that I will
be presenting in this video will be mostly
based on 'folk wisdom', my personal experience
with designing models, and ideas that come
from successful model architectures.
Back to our question "how do you design a
neural network?"
The short answer is: you don't.
The easiest thing you can do is pick something
that has been proven to work for a similar
problem and train it for your task.
You don't even have to train it from scratch.
You can take a model that has already been
trained on some large dataset and fine tune
the weights to adapt it to your problem.
This is called transfer learning and we will
come back to that later.
This approach works in many practical cases,
but not applicable in all cases especially
if you are working on a novel problem or doing
bleeding edge research.
Even if you are working on a novel problem
or the existing models don't meet your needs,
that doesn't mean that you need to reinvent
the wheel.
You can always borrow ideas from successful
models to design your own model.
We will discuss some of these ideas in this
video.
Let's go through frequently asked questions
about designing a convolutional neural network.
First question: how do you choose the number
of layers and number of units per layer?
My experience is that beginning with a very
small model and gradually increasing the model
size usually works well.
And by increasing the model size, I mean adding
layers and increasing the number of units
per layer.
You could also go the other way around and
start with a big model and keep shrinking
it.
The problem with that is it's hard to decide
how big you should start.
If you want to start small you always have
a point zero, which is the linear regression.
That doesn't mean that you should always try
linear regression first even if it's obvious
that there is no linear mapping between the
inputs and the outputs and the problem is
not linear.
But, overall it usually has more benefits
to start smaller and increase the model capacity
until the validation error stops improving.
Earlier, I made a separate video about how
to choose model capacity.
You can find it in the Deep Learning Crash
Course playlist to learn more about it.
You might wonder given the same number of
trainable parameters whether it's better to
have more layers or more units per layer.
It's usually better to go deeper than wider,
so I would opt in for a deeper
model.
However, a very tall and skinny network can
be hard to optimize.
One way to make training deep models easier
is to add skip connections that connect non-consecutive
layers.
A well-known model architecture, called ResNet,
uses blocks with this type of shortcut connections.
Using such connections gives the following
layers a reference point so that adding more
layers won't worsen the performance.
The skip connections also create an additional
path for the gradient to flow back more easily.
This makes it easier to optimize the earlier
layers.
Using skip connections is a common pattern
in neural network design.
Different models may use skip connections
for different purposes.
For example, fully convolutional networks
use skip connections to combine the information
from deep and shallow layers to produce pixel-wise
segmentation maps.
A paper that I have published last year proposed
using both types of skip connections to segment
remotely sensed multispectral imagery.
The skip connections on the left help recover
fine spatial information discarded by the
coarse layers while preserving coarse structures.
The skip connections on the right provide
access to previous layer activations at each
layer, making it possible to reuse features
from previous layers.
Let's move on to the second question: how
do you decide on the size of the kernels in
the convolutional layers?
Short answer: 3x3 and 1x1 kernels usually
work the best.
They might sound too small, but you can stack
3x3 kernels on top of each other to achieve
a larger receptive field as I mentioned in
the previous video.
How about 1x1 kernels?
Isn't a 1x1 filter just a scalar?
First, a 1x1 filter isn't really a 1x1 filter.
The size of a kernel usually refers to its
spatial dimensions.
So a 1x1 filter is, in fact, a 1x1xN filter
where N is the number of input channels.
You can think of them as channel-wise dense
layers that learn cross-channel features.
Obviously, 1x1 filters don't learn spatial
features and stacking 1x1 filters alone wouldn't
increase the receptive field, but combined
with 3x3 filters they can help build very
efficient models.
This pattern is at the heart of many convolutional
neural network architectures, including Network
in Network, Inception family models, and MobileNets.
One advantage of 1x1 convolutions is that
they can be used for dimensionality reduction.
For example, if the input volume is 32x32x256
and we use 64 of 1x1 units then the output
volume would be 32x32x64.
Doing so reduces the number of channels before
its fed into the next layer.
Let's say the output is fed into a 3x3 convolutional
layer with 128 filters and compute the number
of operations that we need to do to compute
these convolutions.
To compute the output of the 1x1 filter we
need to compute the values for each one of
32x32x64 pixels, and we need to do 1x1x256
operations, which is the size of the filter,
to compute each value.
We do the same thing to compute the activations
of the following 3x3 convolutional layer which
sums up to roughly 92 million operations.
Now, if we remove the 1x1 layer and compute
the number of operations we end up with over
300 million operations.
It may sound a little counterintuitive at
first but adding 1x1 convolutions to a network
can greatly improve the computational efficiency.
Another use of pointwise convolutions is to
implement a depthwise separable convolution,
which reduces the number of parameters.
The idea is simple, perform a spatial convolution
on each channel in the input volume separately,
then use a pointwise convolution to learn
cross-channel features.
Let's take the previous example with the traditional
convolutional layer.
We had 128 units, each had 3x3x256 parameters,
where 256 is the number of channels in the
input volume.
So, in total, this layer had roughly 300,000
parameters.
Alternatively, we could use 256 filters each
only applied to one channel, separately.
So, the units in the first layer would have
3x3x1 parameters instead of 3x3x256, since
each unit acts on only a single channel.
Then, we can use a pointwise layer to learn
cross-channel features and get the same output
volume.
This would lead to about 35,000 trainable
parameters, spatial and pointwise layers combined.
This is the main idea behind the recently
popularized MobileNet architecture.
By stacking depthwise separable convolutional
blocks MobileNet manages to be very small
and efficient without sacrificing too much
accuracy.
Separable convolution is not a new concept.
For example, in image processing, it's a common
practice to separate a 2-dimensional filter
into 1-dimensional row and column filters
and apply them separately to reduce the computational
cost.
So we can take the depthwise separable convolution
idea one step further and stack 1x3, 3x1,
and 1x1 filters on top of each other to learn
row-wise, column-wise, and depthwise separable
filters.
Actually, I tried this several years ago,
but it turns out that the savings from the
spatially separable filters are not worth
the accuracy that is sacrificed since the
filters are already small spatially.
So it seems like depthwise-only filter separation
is a good compromise.
Next question: how to choose the sliding window
step size, also known as the stride?
Choose 1 if you want to preserve the spatial
resolution of the activations, choose 2 if
you want to downsample and don't want to use
pooling.
If you want to upsample the activations use
a fractional stride such as 1/2, which is
similar to a stride of 2 but has its input
and output reversed.
A convolution with a fractional stride is
sometimes called a transposed convolution
or a deconvolution, although using the term
'deconvolution' is a little misleading from
a mathematical perspective.
How about pooling parameters?
Max pooling with same padding and a pooling
size of 2x2 usually works fine.
If you want your model to handle variable-sized
inputs and your output is fixed-size you might
consider pooling to a fixed size or using
global average pooling.
For example, if your inputs are images having
different dimensions and your output is a
single class label, then you can take the
mean of the activations before the fully-connected
layers.
How to choose the type of activation functions?
Short answer: choose ReLU except for the output
layer.
Long answer: check out my earlier video on
artificial neural networks.
It's actually a short video.
So, I should have said not so long answer.
What type of regularization techniques should
I use?
Short answer: use L2 weight decay and dropout
between the fully connected layers if there
are any.
Not so long answer: check out my earlier video
on regularization.
What should be the batch size?
A batch size of 32 usually works fine for
image recognition tasks.
If the gradient is too noisy you might try
a bigger batch size.
If you feel like the optimization gets stuck
in local minima or if you run out of memory,
then a smaller batch size would work better.
These are the hyperparameters and design patterns
that I can think of right now.
The next video will be about transfer learning.
Feel free to ask questions in the comments
section and subscribe to my channel for more
videos if you like.
As always, thanks for watching, stay tuned,
and see you next time.
