Today I realized that it's been a long time
since I made the 'how to design a convolutional
neural network' video.
A lot has changed since then, but also a lot
has stayed the same.
So, in this video, we will talk about how
to design a neural network in 2020, covering
some of the useful techniques that came out
or popularized between 2018 and 2020.
At the end of the video, I will also go through
some of our recent papers and explain how
my colleagues and I designed neural networks
for constrained environments.
Alright, let's get started!
My first advice is still the same.
You don't really need to spend too much time
trying to design a neural network.
You can pick something that worked for a similar
problem and use it.
But what if your model needs to have some
special properties that mainstream models
don't provide?
What if you have some unique concerns that
others typically disregard or haven't researched
yet, such as minimizing the hardware footprint
of a model or dealing with very large multispectral
images.
In many of those cases, you still don't have
to start from scratch.
There are many good practices and useful design
patterns that you can use in your model architecture.
That's what we will be covering in this video.
Let's first talk about efficient model architectures.
Despite the recent advances in automated network
architecture search, hand-designed models
are still relevant, especially when it comes
to designing efficient models.
For example, the ShuffleNetV2 paper argues
that automatically found network architectures
are much slower in practice even when they
require a smaller number of operations to
run.
Their paper reports that MobileNetV2, a hand-designed
model, is much faster than the NasNet-A, which
was a result of an automated network architecture
search process.
This is because the speed of a model depends
not only on the number of floating-point operations
but also on memory access costs and platform
characteristics.
Keeping those in mind, the ShuffleNetV2 paper
proposes some guidelines to design efficient
model architectures, optimized for inference
speed.
One of their guidelines is based on the observation
that equal channel width at both ends of a
layer minimizes memory access costs.
So it's a good practice not to change the
number of input and output channels too frequently.
Models that make use of bottlenecks and expand-and-contract
modules such as SqueezeNet and MobileNet V2
violate this guideline.
It doesn't mean you should never use them,
though.
Speaking of bottlenecks, something I observed
is that extremely narrow bottlenecks also
hurt training stability.
A few dead neurons and the entire model collapses.
So, if you really need to use layers having
very few filters, such as 8 or smaller, using
linear activation instead of ReLU at the end
of those layers would at least prevent dead
neurons.
MobileNetV2 also does something similar by
using ReLU activations at the expansion layers
while keeping the bottlenecks linear.
Another guideline is that network fragmentation
reduces the degree of parallelism.
Using a lot of small operations instead of
a few large ones decreases efficiency since
they are not very GPU-friendly, and they introduce
extra overheads.
This is a well-known phenomenon and is one
of the reasons why some automatically designed
models run slower.
Network architecture search algorithms may
result in heavily fragmented architectures
when accuracy and the number of operations
are the only search criteria.
The paper also points out is that element-wise
operations have a non-negligible cost.
Point-wise operations such as ReLU and 1x1
convolutions have a small number of floating-point
operations, but their memory access cost is
non-negligible.
Therefore, one shouldn't consider them free
in model architecture design.
As you may know, my work mostly focuses on
models that operate on image data, and I haven't
been working on anything natural language
processing related for the past few years.
However, it's hard not to see how successful
the Transformer model has been in the field
of natural language processing.
This new type of architecture relies on what's
called `attention mechanisms.'
At a very high level, an attention mechanism
tells a model where to look; what parts of
the input signal are more relevant.
You can think of it as a module that generates
a weight vector given a query.
For example, to resolve what 'it' refers to
in "This video is very interesting.
I liked it.", an attention vector would put
a higher weight on the words "this" and "video."
This type of attention mechanism is called
self-attention.
It's straightforward to see how attention
mechanisms help in this example, but can we
apply this type of self-attention also to
models that deal with computer vision problems?
Yes, we can.
We don't have words and sentences in images,
but we certainly can design mechanisms to
shift the attention towards particular spatial
locations or feature maps.
For example, this paper, titled "Squeeze-and-Excitation
Networks," proposes a self-attention mechanism
that assigns weights to feature maps based
on their relevance for a given input.
The way they do it is quite simple.
For a given layer, they first squeeze the
feature maps into a global description vector
by averaging over the spatial axes.
This is basically a global average pooling
operator.
Then, they feed that information into a mini
fully-connected neural network that outputs
an attention vector.
Finally, they take those weights in the attention
vector and multiply them with their corresponding
feature maps at the input.
This process essentially puts the attention
on more relevant feature maps by recalibrating
their channels.
This method can easily be applied to many
types of convolutional neural network architectures,
such as ResNet.
MobileNetV3 also makes use of similar attention
modules, combining manual design practices
with an automated network architecture search
approach.
For many computer vision tasks, it seems that
it's a good strategy to use self-attention
mechanisms in convolutional neural network
architectures.
At the beginning of the video, I mentioned
using a well-known model architecture off-the-shelf
would be sufficient in many cases.
But if you are trying to solve a problem that
has some specific requirements or limitations,
then you may need to make some task-specific
design choices to adapt your model to a targeted
application.
Let's go through two such examples.
The first one is a pixel-wise segmentation
model that I designed to handle very large
input images.
The goal was to create surface water maps
given satellite imagery.
This is a typical semantic segmentation task
that any mainstream pixel-wise prediction
model would be expected to perform well.
However, satellite imagery can be much larger,
like orders of magnitude larger, than images
that the mainstream models are designed to
deal with.
So, I needed to make some design choices to
process large inputs in one shot, without
dividing them into tiles, given a certain
memory budget.
It's a common practice to double the number
of feature maps whenever you downscale the
input by two and vice versa.
In this setting, higher-resolution layers
get much more memory allocation than the coarser
scale ones.
Because downscaling an image by two in both
spatial axes while doubling the number of
channels still reduces the overall feature
map volume by two.
One design choice I made was to quadruple
the number of channels whenever the feature
maps are spatially downscaled so that the
model uses constant memory throughout the
network and layers at different levels of
abstraction get their fair share of memory
allocation.
This approach also has some downsides, but
for this particular task, it worked very well.
I published a paper on this in the IEEE Geoscience
and Remote Sensing Letters very recently.
There's a lot more in the paper.
Check it out to learn more about it.
Let's move on to the second custom model design
example, in which our goal was to minimize
the latency and hardware footprint of our
model.
We used several tricks to do that.
The main innovation in our model was the use
of 3-way separable FIR-IIR filters for the
purposes of line buffer minimization, and
I'll explain what that means.
The concept of separable convolutions is nothing
new.
Many efficient model architectures use depthwise
separable convolutions.
However, as I mentioned in one of my earlier
videos, 3-way separable convolutions having
vertical, horizontal, and depthwise components
are fairly uncommon.
There is a reason for that.
Convolutional neural networks typically use
very small kernel sizes, such as 3x3.
Therefore, breaking down a spatial convolution
into column and row convolutions doesn't really
save much.
I'm guessing that's also why popular models
don't use depthwise separable convolutions
in the first layer, although I haven't seen
it stated explicitly in the papers.
Since the input is usually a 3-channel RGB
image, the number of channels is too few for
depthwise separability to be worth it.
In our case, spatial separability was very
useful for factorizing the hardware cost.
The cost of vertical convolutions was disproportionally
high because of the number of lines that need
to be buffered.
The hardware acquired images line-by-line.
Therefore, a column convolution required more
elements to be buffered.
For example, a 5x1 convolution would need
4 lines of data to be buffered, whereas a
1x5 convolution would need only 4 elements
to be buffered.
We addressed this problem by replacing the
vertical convolutions in a 3-way separable
convolution layer with infinite impulse response
filters.
You can think of those IIR filters as recurrent
neural network modules that summarize pixels
in the vertical direction.
Unlike fixed-window convolutions, our separable
FIR-IIR filters start processing their input
as soon as the pixels arrive, without having
to buffer the lines that would be spanned
by a fixed-sized window.
This reduces latency and the size of the line
buffers, leading to significant savings in
silicon area.
You can find our paper in the description
below.
My final tip in this video is not to follow
any advice or guideline religiously, including
my own.
Things change fast, especially in this field.
Guidelines, rules of thumb, and design patterns
are practical, but they don't always work
well.
Things change, our understanding of things
change.
So it's better to keep an open mind.
Alright, that's all for today.
I hope you liked it.
Links are in the description, as always.
Subscribe for more videos.
Happy belated new year.
Thanks for watching, stay tuned, and see you
next time.
