So, convolutional neural
networks for image recognition.
So, this is the plan for this
lecture.
I'm going to give you a little bit
of background about convolutional
neural networks or, as I'll be
referring to them henceforth,
ConvNets because that's a lot
easier to say. So, I'll give you a
little bit of background on
ConvNets and sort of the ideas
behind them but, crucially, also
the history behind them because
they are really something that
has developed a lot over the
past decade.
And then, in the second part,
we'll talk a little bit about
the building blocks of
ConvNets. We'll go into quite
some detail about the
convolution operation and how
it's used in these neural
networks.
And then, in the fourth part, we'll-
sorry, in the the third part, we'll put these
building blocks together into
convolutional neural networks,
and I'll sort of show you
how they fit together. In
the fourth part,
we'll look at some case studies,
some very successful
convolutional neural network
architectures that were
developed in recent years.
And that includes
some more advanced building
blocks as well.
And then, to wrap up the lecture,
I'll hint at a few more advanced
topics and also talk a little
bit about how ConvNets might
be used, not just for image
recognition, which is what we'll
be talking about today, but maybe
also other applications, other
data modalities, things like
that.
So, let's start with some
background.
Last week - I don't know who was
here last week - but so, my
colleague Wojciech was here
talking about neural networks.
And I'm going to recap very
briefly a diagram from his slide
deck.
So, this is how we can visualise
a neural network.
So, it's basically a sequence of
operations, a sequence of
computations, and data goes in at
the one end, and at the other
end
we want to make a prediction of
some variable that we can
extract from this data. And we
have what's called a training
data set where we have a bunch of
data points with associated
target values that we would like
the model to predict. And then we
have a loss function, indicated
in orange here, which is going to
measure how well our network is
predicting the target values.
And we're going to try and
change the parameters of our
neural network.
These are basically the weights
and the linear layers over
here and over here. We're going to
adapt those to try and minimise
the cross entropy,
and we do that using gradient
descent.
So, we do that using an algorithm
called backpropagation, which
Wojciech talked about in
detail last week.
So, I'm going to talk about image
recognition with neural networks,
and so the first question we
need to ask ourselves is: How can
we actually feed images into a
neural network?
Because the neural networks that
Wojciech described last week, they take
vectors as input.
Right.
You basically give it a series
of numbers, and then it produces
another series of numbers at the
output.
So, neural networks operate on
vectors.
So how, essentially, can we turn
images into vectors?
This is an image that I'll use
as sort of a basis for all the
examples that I'll be talking
about.
So, we have a fairly simple image
here with some background and a
tree in the foreground, this sort
of one meaningful object in this
image.
So, we want to feed that image
into the neural network.
How do we turn it into a vector
so that we can do that?
An image is actually not a
vector.
It's a digital- a digital image
is a two-dimensional grid of
pixels.
So, it has a structure to it, and
it has a topological
structure.
And, basically, so we have this
two-dimensional grid,
it has a height, it
has a width.
And then, for each discrete
position in the grid, we record
the intensity and the colour of
the light, essentially, to create
a digital image. And so, for the
purposes of this lecture, I'm
going to use a slightly stylised
version of this tree image, where
you can actually see the
individual pixels. Because the
operations that we'll be talking
about, the convolutional
operation- the convolutional
operator and the pooling layers
and several other layers
that will be used in
the convolutional neural
networks,
they will operate at this
pixel level, so it's very
important to understand that
this is what images look
like to a computer.
They're just a grid of numbers
corresponding to the colours and
intensities at discrete
positions.
But so, as I've already said, a
neural network actually expects
a vector of numbers as input,
so we need to turn this thing
into a vector.
And so the simplest thing we
could do is just take the rows
of pixels in this image, one by
one, and just kind of string them
together into one long row.
And this is a vector.
So, this image is nine by nine
pixels so this is 81
numbers representing our image.
And that's a valid thing to do.
So, now you can just take the
network that you that you built
with Wojtek last week and just
feed in these vectors, and you
can train the model to try and
predict, for example, that there
is a tree in this image, like you
could do a classification of
objects and images.
So, that works.
But it's not ideal. Because, for
example, let's see what happens
if we slightly change the image
by shifting the content.
So, if you look at this
new image - I'll just go back and forth
so you can see the difference -
it's essentially the same image,
but we've slightly moved the tree to
the top left corner of the
image.
So, you can imagine that we're
taking the photograph from a
slightly different angle, for
example. For the purposes of
image classification,
this is still the same image,
right?
It's the same.
It's the same type of object.
So, we would expect the output of
our neural network to be exactly
the same thing.
Now, if we do the same thing
again as before, and we turn this
into a vector by just taking the
rows of pixels and
concatenating them, stringing them
together, this will end up
looking very different than what
we had before. The image kind of
looks the same to us.
I mean, it's shifted a little
bit,
but we can still see it's a tree. But
if you look at these vectors,
they look very, very different,
and to a neural network,
they will look very, very
different. Like, the foliage of
the tree was mainly here before,
but now it's kind of shifted all
the way to the left, the trunk is
somewhere else.
So, this is kind of challenging
for the network because it will
have to learn many different-
it will have to learn to
detect many different patterns
to be able to say with
confidence that, oh, this is an
image of a tree.
And so, clearly, just this
flattening operation is not the
way to do it.
We actually want to change the
architecture of the network that
we're building to take into
account
this grid structure
that the image has. And so two
key properties of natural images,
of photographs, are locality and
translation invariance. So,
locality is the idea that pixels
that are nearby in the image
that are close together in the
grid will tend to be correlated.
So, if you have a
photograph, usually there'll be a
few objects in there, and those
objects tend to take up actually
quite a small portion of the
total number of pixels in the
image.
And so those pixels
corresponding to the object,
they're going to be very highly
correlated, but a pixel in the
top left corner of the image is
not going to be very correlated
with a pixel in the bottom right
corner, for example.
So, that's locality.
And related to that,
another important property is
translation invariance,
which is that meaningful
patterns in the image can
actually occur anywhere.
So, I have this example here with
a photograph of a bird in the
sky.
And in these four photographs,
the bird is in a slightly
different position.
Again, you could imagine that
you're taking the photograph
from a slightly different angle.
But, clearly, it's the same bird,
and it should be classified as a
bird regardless of where it is.
And so, if you think of
this bird as a pattern, this
pattern can actually occur at a
lot of different positions in
the image.
So, I have a few examples of
photographs here that
exhibit these characteristics
sort of in an extreme way.
And so you can see that
there's lots of patterns going
on here, like, the individual
objects sort of give rise to
patterns that are repeated
across the entire image and that
can also occur at different
scales. So, in the image on the right,
for example, there are
interesting patterns in terms of
the brickwork on the wall, but
there's also these windows-
this window
pattern that occurs multiple
times in the image.
And this also hints at the fact
that these images are
compositional.
So, there are sort of objects and
textures in the image at many
different scales, and smaller
patterns sort of join to form
larger patterns and join to form
objects.
And that's a very
important point that we'll exploit in
convolutional neural
networks.
I also want to point out that
images are not the only data
modality that
have this
property.
Think about audio, for example.
If you record a sound or,
more specifically, if you record
speech, someone speaking,
then the phonemes that the
person is pronouncing, the sounds
that the person is making with
their mouth,
they can occur anywhere
in the signal.
You don't know a priori when
they're going to pronounce which
part of the word.
So, again, that's this
translation invariance, but in
only one dimension this time, in
the time dimension. Textual data
exhibits this property. If you
take a page from a book, a
particular word could appear
anywhere on the page.
Another interesting one, I think,
is graph-structured data.
So, if you think about maybe
molecules, organic molecules have
a lot of patterns that can occur
at various positions in the
graph that represents
the connectivity structure
between the atoms.
So, how do we take advantage of
this topological structure,
of this grid structure of images,
and of this compositional nature
of images? So, the first thing we
can do is something called
weight sharing.
When we have a
particular hidden unit in, say,
the first layer of our neural
network that detects a
particular local pattern, for
example this one, then we might
also want to have units in a
network that can detect the same
pattern at different spatial
positions in the network. And we
can simply achieve that by
looking at the weights
corresponding to the connections
going into that unit and then
making copies of that unit all
across the image, where we
shift the pattern across the
image. So, that's weight sharing. And
a second thing we can do is
we can make our models
hierarchical because, as I said,
these images are sort of
compositional, and combinations
of patterns give rise to more
interesting, more complex
patterns.
So, we could incorporate that in
our model by stacking lots of
layers that extract
progressively more abstract, more
high-level features. And so, in
this image, I've demonstrated
that you have sort of edges and
textures that can combine to
form these object parts, and
object parts then combine into
entire objects that you might
want to detect if you're trying
to do image recognition.
So, before we go on with the
technical details about
convolutions, I want to talk a
little bit about the history of
these models and how they came
to be the way they are today.
And the key story behind that is
that data drives research.
So, the availability of
interesting data sets has a
massive impact on how much
innovation might happen in a
particular field.
And so, for the computer vision field
sort of the thing that
kickstarted this ConvNet
revolution was actually the
ImageNet challenge, which was a
competition that was run from
2010 to 2017,
so until a few years ago, that
turned into a really major
computer vision benchmark,
so a lot of research was
done on this data set.
So, every year they ran a
competition, and the idea was
that you got a data set of
1.4 million images,
so quite a lot of images, and in
fact, I would say orders of
magnitude larger than what
people had been using before
then.
And this one 1.4 millions images
were divided into a thousand
different classes, so different
types of objects, household
objects, animals, lots of
different things.
There was actually an interesting
imbalance in the data set in
that they included lots and
lots of different dog breeds,
so about 100 of those 1000
classes are actually different
dog breeds.
And this is kind of interesting
because it forced people to
build models that could really
pay attention to
to the details on objects to
tell them apart. Because it's
it's quite easy to tell a part
of cat and a dog, but if you have to
tell apart certain dog breeds,
that's a lot more difficult.
You need a lot more local detail
for that.
So, the goal here was image
classification, and another
challenge of this data set was
that the objects that one had to
identify, that one had to
classify, weren't always front and
centre in the images.
So, there were images that might
have multiple objects in them.
And to sort of compensate for
that, the performance of the
models for the
purpose of the competition was
measured using top five
accuracy or top five error rate.
So, the idea is that your model
gets five guesses. It can pick
five classes out of 1,000,
and if the correct class is
among those five guesses, then
then you get a point, then
that's good.
So, this is a diagram of the top
five classification error rate
of the competition winners each
year, from 2010 to 2017.
So, in 2010 and 2011, people used
traditional computer vision
techniques that were state of
the art at the time.
The idea there is that you try
to do some- you try to extract
some kind of feature
representation from your image
that you think will capture
relevant properties about the
objects in the image. But it's
entirely handcrafted, so there's
no there's no learning involved,
or there's very little learning
involved, in that process.
And then, once you have those
features, you can do what
we did before actually, you can
turn them into a vector and then
you can feed them into a simple
classifier, which could be a
neural network or an SVM, and
that kind of used to be how
things were done.
And so, using that strategy you
can actually do reasonably
well.
There it is.
You can do reasonably
well.
You can get about two thirds,
maybe three quarters of your
answers right
with this top five accuracy
metric. But then, in 2012,
something interesting happened,
and this was actually the year
where Alex Krizhevsky,
Ilya Sutskever, and Geoffrey
Hinton submitted their AlexNet
model to the competition, and
this was a ConvNet.
So, this was actually one of the
first ConvNets that was
trained at this scale on this
larger data set.
ConvNets had been around
before, you know, since
the 90s, maybe even the 80s,
depending on who you ask.
ConvNets have been around, but
they hadn't really been applied
at this scale before.
And people didn't actually
expect them to work at this
scale.
So, that was kind of the
most interesting aspect of this
is that suddenly these
ConvNets were actually
outperforming the existing state
of the art by a very large
margin.
So, this was actually one of the
first major successes of deep
learning
altogether I would say, in 2012.
And then, in 2013, people sort of
noticed this, and immediately
everyone switched over to
ConvNets.
So, in 2013, basically all the
entries of the competition were
ConvNets. And what they
had done was, they had taken AlexNet
and they had added some
extra tricks, added a few layers,
added a few modifications and
got the error rate down a little
bit further.
So you can see here from 16
percent down to 12 percent.
But then, in 2014,
another very interesting thing
happened and people sort of
started taking ConvNets a bit
further. They looked at AlexNet,
and they started questioning
fundamentally sort of the design
decisions in this model and
asked, like, how can we actually
do even better?
And so this gave rise to
models that I'll talk about in
more detail later called VGGNet
and GoogLeNet. GoogLeNet
is a reference to LeNet,
which is one of the first
incarnations of ConvNets from
the early 90s.
So, these models are much deeper.
They have more layers. But they
also have a lot more intricate
architecture.
So, people thought more about the
challenges of training these deep
models and tried to figure out
how to do that.
Then, in 2015, we had another
major breakthrough with ResNet,
or residual networks.
The idea there is the
introduction of residual
connections where you add new
connections in your network that
allow it to skip a few
layers.
And this enabled training of
proper deep networks with
hundreds of layers.
And these residual connections
are basically an essential
component of neural networks
today.
Almost every network has them.
So this was a very important
innovation that was again sort
of driven by this competition.
And so we'll take a closer look
at this one later as well
in the section about case
studies. After ResNet,
performance kind of saturated, so
we see that there's still some
improvements in 2016 and 2017,
but there's no sort of major
breakthroughs anymore after
this.
People started combining the
predictions from lots of models,
there are a few other building blocks
that were tried but nothing that
resulted in as dramatic an
improvement as we'd seen in the
years before.
So, after 2017, the organisers
decided - this was organised
by students at the
University of Stanford -
they said, you know what,
this is solved. This problem is
essentially solved. Anything we
can do that's better- like,
this might be- this was already a
lot better than a human could do,
even a trained human on this
data set.
So, we're essentially considering
this problem solved, and we
should move on to more
challenging data sets, other
problems.
So, now let's look at some of the
building blocks of convolutional
neural networks, so I'm going
to get my tree out again.
So, this is the tree
image from before, and I'm going
to again use a stylised version
with nine by nine
pixels. And we're going to look
at how we can go from fully
connected to locally connected.
So, what do I mean by that? Fully
connected
is like in a traditional neural
network where you connect every
input value, every
element of the input vector to
every hidden unit in the first hidden
layer and so on. You always
connect every unit to every
other unit.
And we're
going to move to locally connected
because, as we said before, we
know that objects tend to take
up a small portion of the image,
and so local correlations
between pixels are much more
relevant, much more interesting
than correlations between pixels
that are far away.
So.
So, this is a fully connected
unit in a hidden layer so I'm
representing the hidden layer
here as-
so, this is a vector, essentially,
of numbers that represents the
hidden layer.
And we're
highlighting this particular
unit here and looking at its
connectivity structure.
So, I didn't draw all the lines
because that would be tedious,
but imagine that this thing is
actually connected to all the
pixels in the input image, and
then how do we compute the
output of this unit?
We basically multiply the
weights, the parameters
associated with each of these
connections, with the input pixel
values, and then we optionally
add a bias term, and then we get
essentially a linear function of
the pixels.
And then, after that, we can apply
nonlinearity to that if we want
to,
and that's essentially what our
neural network layer does.
I should also mention that, in
practice -
so, I've used this image here,
I say this has 81 connections -
in practice this will actually
have 243
connections, three times as many.
And that's because a
pixel in an image is not
represented by a single value,
it's actually represented by
three values, right.
Red, green, and blue.
I'm not drawing that here to
keep things simple, but keep in
mind that there are actually
three colour channels in this
image that we all feed into the
network.
So, this is a fully connected
layer in a normal neural network.
How can we make this locally
connected? So, we can basically
connect each unit only to a
local region in the image.
So, that's the first thing we'll
do.
So, instead of having 81
or 243 connections
here, we'll only connect the three
by three region in the image to
this particular unit.
And then, instead of having our
units in a vector representation
here, I also made this two-
dimensional because this will
also kind of preserve the
the topology, the input topology.
This grid structure will also be
in the output of our neural
network.
So, we're going to say this unit
connects to units up here, so
I'm going to put this here, and then
this unit connects to inputs
down here, so I'm going
to put this here.
So, now we have locally connected
units with a three by three
receptive field.
So, that's a word I'll use
more often later.
The receptive field is what
we call the part of the image
that the unit can see,
essentially.
So, this formula doesn't actually
change much. The only thing is
now that we're no longer
summing over the entire image,
we're only summing the
contributions over a local
region in the image. And so this
will reduce the number of
parameters in the network quite
drastically because each unit
just has many fewer connections
going into it.
Now, how can we go from locally
connected to convolutional?
That's just the introduction of
weight sharing, essentially.
So, all we're saying now is that
we have this locally connected
unit here and we have another
one here,
and we're just going to make
these weights the same.
We're going to say that the
parameters that they use will be
the same.
And so the result is a
convolution operation. That's
essentially all there is to it.
And so we write that with this
asterisk.
There are many notations that
are used in the literature for
this operation but the asterisk
is the most common.
So, we have some weight
vector that actually matches up
with a three-by-three region in
the image, and we sort of slide
it across the image to compute
the outputs of our hidden
units. And what this means is
that the resulting operation
becomes equivariant to
translation.
So, if we translate the
image of the tree like we did
before, then this resulting
output is also going to be
translated in the same way.
And that's interesting because
it means that our network kind
of preserves this original
structure.
So, as I already said, the
region that this connects to is
called the receptive field and
the output of a particular unit
that we sort of slide across the
entire image and then group
the outputs of in a 2D
grid,
that's what we're going to call
a feature map.
The weights associated with each
unit we're going to call the kernel
or the filter.
Both terms are used
interchangeably, and as I said,
this operation will
preserve the topology of the
input, so the feature map is also
grid-structured. So, how can we
implement this operation in
practice?
So, we take this kernel and we
essentially slide it over the
image.
And this is basically a
filtering operation, right.
We're applying a filter to the
image, but the weights of our
filter are actually going to be
learned in this instance.
So, the kernel will slide across
the image and then we produce an
output value at each position,
and I'm indicating these with
different grayscale values here.
And so, once that's done, we have
this new representation of our
image that's still two-
dimensional and that basically
contains detections of that
particular feature in the image.
So, if part of the image
matched the weights in the
kernel very well, then we're
going to get a very high value
at the output,
so that means that the feature
was detected.
And if there's no match, then
we're going to get like very low
value at the output and then the
feature wasn't detected.
So, we can sort of interpret this
as a feature detection map.
Now, in practice, we will have
multiple kernels, not just one.
So, I've given them different
colours.
And then each of these will be
convolved with the input image
and give rise to a different
feature map.
So, we get multiple feature maps,
and we will refer to those
multiple feature maps as the
channels of the output of the
convolution. As I already said
before, the image is of course
and RGB image,
so it also has three channels
actually.
So, what we're going to do then
is each filter is
going to consist of a sub-filter
for each colour channel, and we're
basically just going to sum the
contributions of the different
input channels together.
So, what that means is that each
output channel here is connected
to all the input channels in the
input of the convolution
operation. And so that means that
the inputs and the outputs of the
convolution operation are tensors.
They're three-dimensional
objects that have a width, a
height, and a number of channels.
So, that's true for images, as I
already said before - red, green
blue -
but that's also true for the
output of our convolution
operation, and all the output
channels of the convolution will
be connected to all the input
channels as well.
So, let's take a look at a few
variants of the convolution
operation that have been used
over the years.
So, the simplest one is a valid
convolution, and in a valid
convolution, we're only going to
compute output values of the
feature map where we're able to
fully overlap the kernel and the
image.
So, we're only going to compute a
value where we can get this full
overlap.
And what this means is that the
output will be slightly smaller
than the input.
So, if our input image is nine by
nine and you convolve with a
three-by-three filter, what we're
going to get out is actually a
seven-by-seven feature map
because there are only seven
possible offsets for our filter
with respect to our image
where we can compute a
valid output. So, the output size
is going to be the input size
minus the kernel size plus one.
The opposite of that is the full
convolution where we're actually
going to try and compute outputs
wherever the kernel and the
image overlap by at least one
pixel.
And so that, in practice, what you
what you do is you actually just
pad the image with some zeros - or
whatever value you like, but
typically people just pad with
zeros -
and then you do the same thing
as before.
So, you can think of a
full convolution as a valid
convolution but with added
padding, and the result is
actually going to be a feature map
that's larger than your original
image because there are more
valid offsets than there are
actually pixels in the original
image.
So, the output size is going to
be the input size plus the
kernel size minus one.
And so if we stack a lot of
valid convolutions on top of
each other, the effect that
that's going to have is that the
size of our feature maps is
gradually going to shrink,
whereas if we stack lots of full
convolutions, the size of the
feature maps is gradually going
to grow.
And neither of those are really
desirable in convolutional
neural networks.
So, there's a third variant
that's actually the most popular
variant today, which is
called the same convolution,
where we try to pad just enough zeros
so that the output size of the
feature maps will have the same
size as the image.
And so, for our three-by-three
kernel, you just need to pad with one
row of zeros all around the image,
and then you get a nine-by-nine
feature map at the output.
Note that this version actually
only makes sense if your kernel
has an odd size.
If your kernel is even sized,
then you would have to pad
asymmetrically, you would
have to pad slightly more on one
side than on the other.
In practice this problem doesn't
really come up because everyone
just uses odd kernel sizes.
I've seen very few ConvNets
where people actually use even
kernel sizes, and so the nice
thing about this is that if we
stack lots of same convolutions
on top of each other, you can do
that as much as you want, the
size of the feature map will not
change. Of course, what might
happen is that we get some edge
artifacts, because some of
our filters might actually end
up detecting the edges of the
image.
If this is zero, which
typically corresponds to a black
pixel, it might actually detect
this corner as something
meaningful, whereas actually
that's just the corner of the
image, so that's something to
look out for.
There's a few other variants
that are interesting, so there's
a so-called strided convolution
where, instead of computing the
output of the convolution at
every possible offset, we're
actually going to skip some
steps.
And the nice thing about this is
that it is obviously a lot
cheaper to compute because if
you if you use a stride of two,
for example, you reduce the
computation by a factor of four
because obviously you increase
the step size both in the height
and the width direction.
And so this gives you a way to
reduce computation.
It also gives you a way to
reduce the resolution of the
feature maps that you're
operating on, and this will be
useful when we stack lots of
layers together and we want to
create a feature hierarchy,
right.
Because you would like the
higher-level features in the
model to be operating at a
larger scale.
You want them to see more of the
image. And so strided
convolutions can be very useful
to sort of create this
hierarchy.
So, if we move the filter
again, you can see that in this
case I'm doing a valid
convolution and doing a
valid convolution with a
stride of two, and I'm getting
a four-by-four feature map, which
is obviously a lot smaller than
the nine-by-nine image that we
started with.
Another interesting variant is
the dilated convolution.
So, here we're not we're not
striding, so we're not skipping
offsets, but we are skipping
values of the filter.
So, if you want to increase the
receptive field of a convolution,
the naive thing to do would be
just to increase the size of the
kernel.
If you have a very large kernel,
you have a large receptive field,
that can get very expensive
because obviously the cost, the
number of parameters and the
computational costs, will
increase quadratically with
the kernel size that
you choose.
And so dilation is a very cheap
way to do that where you
basically say, typically, the
features in my image will
actually vary slowly over space,
so it's okay to subsample, it's
okay to skip a few steps and not
compute the feature value there
because it's not going to be
that interesting anyways.
It's probably just going to be
interpolation between the
two values beside it anyway, so
we can safely skip it.
And so, in a dilated convolution,
you basically -
sorry -
you basically pretend that you
have a larger filter but you
have a bunch of zeros in there.
And this can be computed more
efficiently than the
naive way would,
than you would think naively,
because you don't actually have
to pad, you don't actually
have to put those zeros in there
and then do lots of multiplies
with zero.
You can actually do this
efficiently with some reshapes
and other tricks of the tensors.
And then a final variant that I
want to talk about is a depth-
wise convolution because that
one's really come to the
forefront more recently.
So, as I said before, in the convolution
operations that we've talked
about every output channel will
be connected to, each output
channel will be connected to
every input channel.
So, we kind of have dense
connectivity between the
channels. In a depth-wise
convolution that's not the case.
In a depth-wise convolution
we have one output channel per
input channel, and there's no
interaction between the channels,
and so that dramatically reduces
the number of parameters that
this convolution has,
but obviously it's also a lot
less expressive. But it's
being used more and more as
a building block together with
other types of convolutions.
And then, finally, pooling is
another operation that's very
common in convolutional neural
networks.
So, this is kind of an
alternative to strided
convolutions to reduce the
resolution of the feature maps.
Basically, what you do is you
look at local windows of your
input and you just compute some
aggregation function of those
inputs, so that will typically be
the mean of the values or the
maximum of the values, and then
you compute those for all
positions in the grid, and then
you get your output feature map,
which will typically be a lot
smaller. So, here I've done this
directly on the pixels and in
practice you might want to do
this inside the network on your
feature maps.
So, now let's talk about
convolutional neural networks and
how these building blocks
actually fit together in neural
networks.
So, I've already been referring
to them as ConvNets.
So, there's actually two
abbreviations that are
in common use today like CNN
and ConvNets.
You'll see both used
interchangeably.
I like ConvNets, it's easier to
say. We'll stack up to
hundreds of these convolution
operations together in a model,
and we'll alternate convolutions
and pooling, or possibly
strided convolutions, to
create a feature hierarchy, where
higher layers in the model will
extract more abstract features
from the image.
So, a brief recap about neural
networks as computational
graphs.
So, this is a slide-
a diagram, rather, that I took
from from Wojciech's deck last week.
So, this is kind of a
computational graph
representation of a neural
network where we have nodes
representing the input.
So, that's both the image in
our case, the input image, but
also the target, the
prediction that we're trying to
match.
And then we have lots of
computational nodes, and some of
these nodes are linear layers of
convolutions, which have
learnable parameters.
And so these are indicated in
pink here, and then at the output
side, we also have the loss
in orange.
So, I'm going to simplify this
diagram.
I'm not going to display the
parameters anymore.
They're implicit, so they're
considered part of the
computational nodes.
I'm also not going to show the loss
because that's always
there,
that's not what we're focusing
on right now.
So, I'm not going to draw that on
the diagram.
So, I just have an input node and
some computation nodes here,
straighten it out a little bit
as well, and then I'm actually
going to differentiate the
computational nodes because
that's what we're interested in
here,
sort of the architecture of
our ConvNets.
So, I'm going to distinguish
between fully connected layers,
which are the layers that
Wojciech also talked about, the
typical neural network layers
where every unit is densely
connected to all the units in
the previous layer,
those will be in pink, and then the
convolution operations will be
in teal, the pooling operations
will be in purple,
and
I've left the nonlinearities
in dark blue as before.
So, now let's talk about some
more interesting, more recent
convolutional neural network
architectures.
So, the one that I actually just
showed you is an existing one
called LeNet-5.
So, this is one of the earliest
published ConvNet
architectures.
So, this was a ConvNet for
handwritten digit recognition,
so it operated on fairly small
images, I think, 28 by 28 or 32 by
32 grayscale images of handwritten
digits and tried to produce
a classification for which
digit was in the image. And this
had kind of what was until
then sort of the canonical
structure of a ConvNet, which
was, you had the input image, and
then you had a few convolutions
interspersed with pooling, so
there was always this structure
of convolution, nonlinearity,
pooling, convolution, nonlinearity,
pooling. And then, at some point,
you would do the vectorisation
operation that we talked about
before, you would just take the
feature map at this point and
just flatten it into a vector,
and then, from there on, it
would just be a regular fully
connected neural network.
So, you would have a few fully
connected layers interspersed
with nonlinearities and then
maybe a softmax nonlinearity
at the end to do the actual
classification.
So,
and then, in 2012, obviously we
had AlexNet, as I
said.
So, this is actually an
architecture diagram from the
from the paper. It's cut off at
the top, and it's like this in
the paper.
To this day, we
don't know if that's intentional
or not, but the reason for the
sort of unusual structure of
this diagram is that this was a
very big model that was trained
on two GPUs.
So, the model was actually
kind of split over two different
processors that each contained
half of the parameters of each
layer,
and so that's why you have this
kind of separation running
through in the middle and you
have very few connections going
across because that was
communication between two GPUs.
So, that was very costly,
especially at that time.
So, you had this kind of two-
stream network.
So, the architecture of this
model is a bit more complex then
than LeNet.
So, now we have eight layers.
That's eight layers with
parameters, so that's five
convolutional layers and three
fully connected layers.
We have something else that was
new here was the ReLU
nonlinearity, so before this people
tended to use saturating
nonlinearities like the sigmoid
function or the tanh
function, which sort of are
limited in their output range.
And it was actually really hard
to train deeper networks than
say four or five layers with this
set-up.
So, with LeNet, it was okay, but if
you added a few layers to
LeNet, you would be in
trouble.
And people found actually that
you can just use the ReLU, which
is- the function that the ReLU
is defined as is literally just
a maximum of its input and zero.
So, basically, if the input is
negative, you set it to zero.
And this has a discontinuity at
zero.
And people thought, oh, if we have
discontinuities in our
nonlinearities, then gradient-based
optimisation is no longer going
to work, right.
Because that uses the gradient,
so clearly that's
only going to work if the gradient
is defined everywhere. And it
turns out that's just not true.
It turns out that- as soon as
someone tried this, it turned out
that this was actually a very
nice nonlinearity to use because
it improved the propagation of
the gradient information
throughout the model.
And so it enabled deeper
networks.
And so this is actually kind of
a key innovation here: to use
these ReLU nonlinearities.
Other important innovations
include regularisation
strategies.
So, as I said, this model was
proposed for the ImageNet
data set, which is quite a large
data set,
so you would think that maybe
you don't need to regularise the
model too much because you have
so much data. But Alex Krizhevsky's
response was just to make his
model really, really, really
massive and have lots of
millions of parameters.
And so he still needed
regularisation to make sure that
the model wouldn't overfit to
this data set.
And so weight decay was used, which
is kind of a traditional
regulariser, where you just
make sure that the magnitude of
the parameters doesn't grow too
much. But also dropout,
and that was also kind of a new
thing,
the idea that you can regularise
neural networks by randomly
removing units during training.
And the idea is that this makes
the other units more robust to
potentially having inputs that
are absent or that are distorted,
and that turns out to be an
extremely good regulariser.
So that was another important
innovation of
AlexNet. So, as I said, this
was trained on two GPUs, and it
actually took six days to train
one of these models back in the
day.
Nowadays you can train it in
minutes.
So, if we use our colour
scheme, then the diagram of this
network looks like this. I kind
of had to wrap around here, and
you can see that it's
already about twice as deep as
LeNet was.
So, if we walk through this
from the input to the output: So,
at the input you have images
coming in,
three channels, and the images
were scaled to 224 by 224
pixels, which is a lot larger
than any ConvNet had used
before then.
So, typically, ConvNets would
use inputs of say 32 by 32, but
people had never really gone to
that scale. And so the
way they did this was actually
by having a very large stride in
the first convolutional layer.
So, only the first convolutional
layer was operating at this very
high resolution, and then
immediately the resolution would
be reduced by a factor of four,
which meant that the actual- the
amount of computation was
reduced by a factor of 16 from
there on.
So, it had an 11 by 11
kernel, again, as I said, an
odd-sized kernel because that's
what people use, 96
channels, a stride of four,
and so its output size was
56 by 56 by 96,
so a lot smaller spatially
but obviously a lot more
channels. And then we had the ReLU
nonlinearities,
and then a max-pooling layer to
reduce it even further, down to
28 by 28 by
96, which means this is
essentially a pooling operation
where we just take the
maximum over two by two windows.
And so that means that
the rest of the network is
actually operating on things
that are 28 by 28 or smaller,
so not that different from the
networks that came before.
So, it's only really this first
layer that's doing a lot of hard
work to use this high-
resolution information in the
image.
And that was new.
That was an innovation of AlexNet.
I'm not going to go through all the
layers.
I'm going to skip ahead to the
last fully connected layer, which
is going to produce 1,000
outputs, one for each class in
the ImageNet data set, and then,
finally, we have a SoftMax
nonlinearity, which takes
the output of the fully
connected layer and turns it
into a categorical probability
distribution where we can
guarantee that the outputs of
the model will be probabilities
that sum to one.
So, they will form a valid
distribution over the classes. So,
here are all the layers again.
And you can sort of see that the
resolution actually is reduced
very rapidly at the start and
then more gradually throughout
the network.
So, another thing actually that
was kind of new here was the
realisation that we don't always
have to pair a convolutional
layer with a pooling layer.
So, this is done here at the
start, twice.
But then we have a few
convolutions with nonlinearities
in between where there's no
pooling happening, and people
just didn't do this before
AlexNet.
It wasn't considered to be a
valid thing to do.
So, it's kind of interesting that
these things that we maybe now
take for granted were
just not done.
So, by now, I think it's clear
that the story is that deeper
models tend to perform better,
and to get some insight into
that, you can consider each layer
as acting as kind of a linear
classifier that's detecting
particular patterns in its
input.
And so that means that, if you
stack more of these layers on
top of each other, you actually
get more nonlinearities in
your model.
You get a more powerful
parameterised function that you can
use to fit the targets.
And so this- the
question arises, like, what
is actually limiting the number
of layers in ConvNets, like, why
was AlexNet eight layers?
Why wasn't it 80 layers? And,
obviously, an obvious one is
computational complexity.
Obviously, if you have more
layers, you have to do more
computation, which, you know, we
always have a limited
computational budget.
But there were actually other
issues as well, such as
optimisation.
So, if we have a deeper model, how
do we actually backpropagate
through that entire model?
How do we do credit
assignment?
If a model makes a mistake, how
do we assign responsibility for
that mistake to particular units
in the network? And that gets
harder and harder as you stack
more layers on top of each
other.
So, in 2014, we had VGGNet, and
there, again, we see a doubling in
depth essentially.
So, you see, I now need four lines
to fit this model.
And so there the idea was that
they kind of took this sequence
of three ConvLayers from AlexNet
to an extreme, and they said, we
can actually do this all the way
throughout the network.
We can stack many, many
convolutional layers on top of
each other before we actually do
any pooling, and we can use
same convolutions.
So, with padding so that the
output feature maps are the same
size as the input to avoid
resolution reduction where we
don't want it.
So, if we're stacking these
convolution of layers, we don't
want resolution reduction.
We want to be in control of
where the resolution is reduced,
and that's going to be in the
pooling layers. And so another
idea from VGGNet is actually to
fix the kernel size and to not
treat this as a hyperparameter
of the architecture.
So, unlike AlexNet, which had
different kernel sizes for
different convolutional layers,
VGGNet only uses three by
three kernels throughout.
So, that simplifies the search
for good architectures
considerably. Because what they
realised was, if we want a larger
receptive field, we don't
necessarily need to take
a single layer and make its
receptive field larger by
increasing its kernel size,
we can actually just stack three-
by-three filters to create a
larger receptive field that
spans multiple layers.
So, here, if we have a stack of
two three-by-three convolutions,
we can sort of see in blue, these
are receptive fields of the
first convolutional layer, and then
in red I've superimposed the
receptive field of the second
convolutional layer with respect
to its input, so with
respect to the outputs of the
first layer.
But, if we look at these two
layers as one block and sort of
look at the receptive field of
the second layer with respect to
the input of the first, we see
that it's actually five by five.
So, it grows. As we stack more
three-by-three convolutions, we
can actually create something
that has an equivalent receptive
field to a single layer with a
five-by-five kernel, but it will
have fewer parameters, and it
will be more flexible because we
can also insert an extra
nonlinearity there.
So, it'll be able
to model more interesting
functions.
So, in terms of architecture,
VGGNet had up to 19 layers,
so, again, quite a bit more than
the eight layers of AlexNet.
It only used three-by-three
kernels with same convolutions.
In terms of infrastructure,
this was also a bit of an
upgrade,
so this was trained on four GPUs,
and it was trained for two to
three weeks.
So, very patient people at VGG
in Oxford.
And another thing here that was
interesting is they use data
parallelism, not model
parallelism.
So, for AlexNet, the model was
kind of split over these two GPUs,
and you saw that this
actually affects the
architecture, like it affects
which parts of the model
we can connect to each other.
So, what
was done for VGGNet
instead is data parallelism
where you just take your batch
of data that you're processing
and you just split it into four
parts, and then you have the
entire network on all four
processors, on all four GPUs,
and you just compute
predictions on smaller
sub-batches, predictions and
gradients obviously during
training.
So, this is the error rate on-
sort of top five error rate on
ImageNet for different
versions of VGGNet with
different numbers of layers.
So, they had variants with 11
layers, 13 layers, 16 layers, and
19 layers.
And what's interesting here is
that obviously, up to a point,
deeper is better.
So, 16 is better than 13 is
better 11.
But it seems like their
performance saturates after 16
layers.
They tried one with 19 layers
and saw that it was actually
slightly worse.
So, what's actually going on
there?
And so later models,
so at the time we didn't know,
but later models actually use a
lot of tricks to prevent this
from happening because what's
happening here is an
optimisation issue,
right. Now you have these 19
layers of computation, and it's
starting to get harder to do
credit assignment.
So, I've actually already
already mentioned both of these.
The challenges of deep neural
networks are computational
complexity, more layers is more
computation - that takes time and
energy -
and there are also optimisation
difficulties that arise because
optimisation of the parameters
by gradient descent becomes a
lot harder.
And so we'll look at some ways
to address these challenges
next.
There will be a future lecture
in this series that will cover
optimisation of very deep models
in detail.
So, look out for that.
So, how-
I'll just give a quick
summary, but my colleague, I
believe James Martens, will be
doing that one, he'll
go over this in detail.
So, one thing we can do is be
very careful with how we
initialise the parameters of our
neural network.
If we just randomly
initialise these from, say, a
uniform distribution, from minus
one to one, then that's not going
to work because the
activations, the outputs of the
the layers in our network are
going to grow as we go through
the network.
And then, if we try to optimise
the network, we need to take the
gradients, which means we need to
do a backward pass through the
network, and those gradients,
those intermediate gradients
that we're going to compute, are
also going to grow.
So, we're actually going to get
exploding gradients if we do
that.
You might say, okay, just
make them really small. You can't
make them zero by the way
because you have to do some
symmetry breaking. Like, if you
initialise a neural network to
zeros, it has no way to
differentiate the different
units.
So, you do have to do something
random, but you could say like, okay,
initialise all the weights to
very small values, then what
you're going to get is vanishing
gradients, like, the gradients are just
going to collapse to zero
because your computer will not
have enough precision to
represent these really small
values.
So, you need to be a little bit
careful about how to initialise
these models, and people have
figured out various heuristics
to sort of ensure that the
gradients have the right scale
at the start of training. And
then luckily, if you do that at
the start of training, that tends
to be preserved throughout
training.
Another thing you can do is use
very sophisticated optimisers.
So, obviously, you can just use
stochastic gradient descent to
train a neural network, but there
are lots of interesting variants
of that algorithm that are
specifically tailored to neural
networks and tend to do a better
job optimising them more
quickly.
Again, I'm going to
leave this to my colleague to go
into in detail. An architectural
innovation that can help with
this is the introduction of
normalisation layers.
So, I haven't talked about those
yet, but they're actually today
just as essential as the
convolutional layers and
pooling layers and the
nonlinearities.
So, we insert normalisation
layers throughout the network to
sort of scale the activations so
that they're in the right range
for optimisation to be easy.
And then, finally, we can also
just change the network design
to make gradient propagation
easier.
So, we can change the
connectivity pattern, and
the ResNets that I
already briefly mentioned before
are an example of that.
So, I think these residual
connections is a good example
of that.
So, let's also look at GoogLeNet
because this was actually
the winner of the 2014
competitions. So, VGGNet
came second but was, in
retrospect, just as influential.
GoogLeNet was interesting
because it was a lot more
intricate and a lot more
complicated than previous
network designs had been.
So, people hadn't really
considered that you could
actually kind of branch out and
then
concatenate the resulting
feature maps and sort of have
these multiple convolution
operations operating side by
side.
So, this is kind of the
the canonical diagram of GoogLeNet.
This is a zoomed-in
version of one of these what's
called inception blocks.
And so you can see they have
multiple convolutions with
different kernel sizes and even
pooling operating in parallel.
This is the first version of the
inception module.
There have been various
iterations on top of this
in the meantime. I'm not
going to go over all the
different variants in detail
though. So, I mentioned
normalisation layers, so a key
innovation actually in the
second version of the inception
module was the introduction of
batch normalisation, and the idea
of batch normalisation is that
you do- essentially, you standardise
activations, so you compute the
mean and the variance, and you
estimate these across a batch of
data. So, you do this in every
layer, you estimate the mean and the
variance of the activations, and
then you just normalise right
here. At then, at the output of
the normalisation, what you might
want to do is have a trainable
scaling factor and a bias at
this point, so that the
activations aren't actually
forced to be constrained, to be
zero-mean unit-variance.
You want to retain the
expressivity of the original neural
network, but you want to have
this normalisation step in there
because it makes optimisation
easier.
And so that's how you can
do that.
And so, introducing
normalisation layers
throughout the network
dramatically reduces the
sensitivity of the model to
initialisation.
So, even if you kind of wing it
in terms of how you initialise
the model, with batch norm you'll
actually be able to still
train it. And it makes it
more robust to larger learning
rates as well.
Another aspect of it, which can
be a downside or an upside
depending on what you're trying
to do, is it introduces stochasticity
and it acts as a regulariser
because these statistics, these
mus and
sigmas, they are estimated on
the current batch of data and
obviously the batch of data will
be relatively small.
So, you're going to get a
rough estimate of
these statistics, but it's not
going to be exact.
And that can actually be a good
thing because it introduces
noise into the network, as we
said before with dropout,
introducing noise into the model
can actually make it more robust
in practice.
So, this acts as a regulariser.
Now, a downside of this is that,
at test time, when you actually
want to use your model and make
predictions,
this will introduce a dependency
between the different images in
the batch, so you will get
different predictions for a
particular image, depending on
which other images are also in
the batch.
And that's not nice.
You would want your
model to give deterministic
predictions.
So, in practice what you need to
do is actually estimate these
statistics on a data set and just
keep track of them
separately and then use those at
test time, which is doable but
we've actually, at least, I've found
in practice that this can be a
source of a lot of bugs. If
something is wrong with my
neural network, batch norm is
usually the first thing that I
suspect.
So, the original GoogLeNet
did not use batch norm, but
all later versions of it use
this.
And if you look at a comparison
between the original GoogLeNet,
which is the dashed black
line here, and then
another version of this model,
which uses batch norm, you can
see that it actually converges a
lot faster and to a higher
accuracy. And this is because batch norm
makes the model able to take
larger learning rates
essentially without
diverging.
So, let's look at ResNet, which
came in 2015.
This is actually one of my
favourite sort of ConvNet
innovations because it's
beautiful in its simplicity.
The idea here is that, oh, if
depth is what's preventing us
from training deeper models
because it makes optimisation
harder,
why don't we give it a skip
connection?
Why don't we let it skip a few
layers if it needs to?
So that the gradient can
backpropagate more easily. And the
way they achieve that is
essentially by adding this
residual connection, which just
means that you take your input
from the input of this
layer and you just kind of add it
back in later, which means that,
when you
backpropagate through this and you
take the gradient, you can
actually go along this pathway
and skip these convolutional
layers altogether. And so
residual connections facilitate
training much deeper networks,
and so the ResNet that
won the ImageNet competition
in 2015 was actually
152 layers deep.
So, again, a dramatic increase, an
order-of-magnitude increase
compared to the previous year.
So, there's a few different
versions of residual blocks,
residual modules. So, the top one
is the one I just showed you,
which is the original one from
the ResNet
paper in 2015, which is the one
used in the ImageNet
competition.
So, that looks like-
Like, this one here it has a three-
by-three convolution, followed by
batch norm, followed by a
nonlinearity, and then a one-by-one
convolution followed by batch norm.
And then you have the
residual connection coming in,
and then you have another nonlinearity.
So, this block was kind of
stacked many, many times to come
to 152 layers.
There is a variant that they
used to reduce the number of
parameters, which is called a
bottleneck block.
And so that's this one.
So, as you can see, instead of two
sequences of ConvNet, batch norm,
nonlinearity, this one actually
has three.
And what's happening here is it
has convolutions with very small
filter sizes, so one by one
essentially means that you do a
convolution but there's no
actual spatial integration of
information happening, you're
just kind of computing the same
function at every spatial
position.
And so what they did was they
used one by ones to reduce the
number of channels, reduce the
number of feature maps, and then
they do a three by three on
fewer feature maps, which means
you'll have fewer parameters in
your three-by-three convolution,
which is doing the actual work,
the actual spatial integration
of information.
And then you just go back out
and you increase the number of
channels again within your one by
one.
So, this is a nice way to sort of
have a large representation but
still have fairly cheap
convolution operations in the
middle there. And then the third
version at the bottom is
called ResNet V2. If you don't-
Is something not clear?
If you don't mind. I think that
will be easier.
Thank you.
So, this one is ResNet V2.
And there they actually just
moved the operations around a
bit.
So, as you can see, the batch norm
is here, nonlinearity is here,
then a three by three, then batch norm,
nonlinearity, one by one,
and then the residual connection
comes at the end here.
The advantage of this one is
that if you stack a lot of these
on top of each other, there's
actually a direct connection
from the output to the input
without any nonlinearities
on the path.
And so this allows you to go
even deeper and even go to
thousands of layers if you want
to.
So, this is a table describing
the ResNet architectures from
the original paper.
So, these are a few different
versions of progressive depth.
And you can kind of see the same
pattern as before. They
start out with high resolutions,
but they kind of very quickly
reduce them, and most of the
computation is actually done at
these lower resolutions.
You can see I actually made a
mistake earlier, the 152-
layer model on the right here,
you can see that it's actually
using the bottleneck blocks, not
the-
not
the top one on the previous
slide.
And what's interesting here is
that this 152-
layer model, because it uses
bottleneck blocks, actually
requires fewer computational
operations, fewer floating-point
operations than the 19-layer
VGGNet did.
So, even though it's an order of
magnitude deeper, it's actually
cheaper to compute, which I think
is really nice, and it actually
also obviously performs a lot
better because of this depth.
Another variant of this idea is
DenseNet.
So, here we don't have residual
connections. In DenseNet, the
authors decided to make
backpropagation easier just by
connecting every layer to every
other layer.
So, whenever you stack a new
layer in a DenseNet, you
connect it to
all the previous layers
and not just the preceding one.
And so you get this dense
connection between layers, but
obviously each layer is still a
convolution, with batch norm, with
ReLUs inside. And then
this actually comes from the-
this was also introduced as part
of the ImageNet competition,
so this was one last
innovation in 2017.
The idea here is to incorporate
global context.
So, convolutions are actually-
obviously great at capturing
local patterns, but sometimes you
might want to modulate
these patterns based on
the global context of the image,
the stuff that's happening
elsewhere in the image.
And so, for that purpose, it's
nice to try and compress the
entirety of the image into just
a feature vector and then kind
of broadcast that feature vector
to all spatial positions so that,
at any spatial position in
the image, you can actually get
some extra information about
what's going on elsewhere in the
image. So, you can
incorporate global context into
your features. And then another
strand of architecture design
that's become popular more
recently is neural architecture
search.
So, up until now we've been
talking about these
architectures that got
progressively more intricate, and
these were all hand designed by
humans, so humans basically
searched for the optimal
hyperparameters, the optimal number of
layers, the optimal kernel sizes in
these models.
And so people started to think,
like, maybe we can actually
automate that process as well,
maybe we can use a search
algorithm or even a machine-
learning algorithm to find the
best architecture to use- to then
train to do image recognition.
And so AmoebaNet is a model that
arose from such an algorithm.
So, it's an architecture that was
found by an evolutionary
algorithm that basically
performed a search over acyclic
graphs composed of a set of
predefined layers.
So, they kind of said the
convolution operation is a
predefined layer, and then we
have a pooling operation, we have
different types of pooling, and
then basically connect these up
any way you want and find the
optimal connectivity pattern
that gives rise to a ConvNet
that works really well.
And so that's architecture
search. And then another trend in
recent years has been to try and
reduce the
computational complexity in
these models by parameterising
them in more efficient
ways.
So, I've already talked about
depth-wise convolutions and the
way that they reduce the
number of parameters
dramatically just by not
connecting all the input
channels to each output channel
but rather connecting channel by
channel.
Obviously, you pay a cost in
terms of expressivity, but
sometimes this can be worth it.
But people have used depth-wise
convolutions to build what's
called a separable convolution,
and a separable convolution is
essentially just a combination
of a depth-wise convolution
followed by a regular convolution
with a one-by-one filter size.
So, you're kind of dividing the
work here in the sense that a
depth-wise convolution will do
spatial integration,
it will sort of capture
information from a local
neighbourhood, and then the one-by-
one convolution that follows
will redistribute the
information across the channels,
which the depth-wise convolution
doesn't because it operates
within each channel.
And so, if you combine those two,
you have kind of a separable
version of a regular convolution,
where one part of the operation
does spatial integration and the
other part does integration over
the channels.
This idea is also used in
another fairly modern building
block.
So, we've talked about bottleneck
blocks before, and I talked about
ResNet. The new cool thing is
inverted bottlenecks where,
instead of reducing the number
of channels and then applying a
three-by-three convolution, you
actually increase the number of
channels inside the bottleneck
and then apply a three-by-three
depth-wise convolution.
The idea being that you sort of
do spatial integration in this
really high-dimensional space
and then collapse it back into a
more manageable feature space
for communication between the
different parts of the network.
So, that's it as far as case
studies is concerned, so to wrap
up this lecture I'm going to
give a brief overview of some
more advanced topics.
And I'm also going to talk a
little bit about ConvNets
beyond image recognition. So, one
thing I haven't mentioned, which
is actually a crucial ingredient
in many modern ConvNets is
data augmentation.
So, by design, ConvNets are
robust against translation.
If you translate an image, then
the internal representations
inside the ConvNet will also
translate, and eventually, because
of all the pooling, it comes
quite easy for the model to be
invariant to that.
So, to sort of ignore that
translation altogether and to
just classify the object that's
in the image, but obviously
translation is not the only
thing that can happen to an
image.
You could also rotate your
camera, for example, and take a
photograph from a different
angle.
You could go closer and farther
away, which changes the scale of
the object. You could- you know, it
could be a bright day, it could be a
dark day, that will affect the
lighting. All these things are
nuisance factors essentially,
nuisance factors of variation
that will not affect the
classification output that you
want but obviously dramatically
affect the pixel values that the
that the model will
see.
And so the way to make these
models robust against these
variations is to just
artificially apply them during
training.
So, every time we feed an image
to our network during training,
we randomly perturb it with some
of these transformations.
So, I have some examples down
here of different perturbations
of this image.
And then we basically say: For
all of these, you should always
produce the output tree.
Because this is a tree
regardless of how it is
perturbed.
And that allows the model to
learn robustness against these
other transformations.
So, the robustness is not innate,
in the sense that the
architecture isn't designed to
be robust against these
transformations, but we can still
let the model learn that it
needs to be robust.
We can also visualise the
patterns and the filters that a
ConvNet is actually learning.
And so a neat way to do that is
to take a unit in any layer of
the ConvNet and then to try to
maximise its activation by
changing the input image.
And so we can just do that with
gradient descent. Just like we
use gradient descent to train
these models,
we can do gradient descent with
respect to the input pixels to
try and find images that
maximally or minimally activate
different units in the network.
And we can do this for different
layers.
And these are some figures from a
paper by Matthew Zeiler and his
colleagues, which really nicely
demonstrate this idea of
compositionality and hierarchy,
in the sense that these different
layers are learning different
patterns, and in layer two you
can kind of see that these are
fairly local patterns, there are some
edge detectors, some textures
here. In layer three, these are
starting to get aggregated into
more interesting patterns, and
then, if you go all the way up to
layer five, you can actually see
that there's- or even layer four,
sorry, you can see that there's a
dog head detector here that
arises.
So, it's kind of nicely
showing that these
patterns are getting combined
into progressively more
interesting structures
throughout the network. We can
also run this procedure with
respect to the output layer of
the network.
So, if we take an output unit
corresponding to a particular
class,
so one of a thousand outputs,
and then we just say, you know,
find me the image that maximises
the probability of this output,
so we get sort of the canonical
image corresponding to a class,
this is what you can get out.
So, this is from
work by Karen Simonyan and his
colleagues.
So, you can kind of see the
objects in here.
I mean, obviously these don't
look like natural images, but you
can kind of see that there are
certain patterns that arise here,
and these are the patterns that
the network will try to pick up
on to classify images.
If you do this with a strong
prior, so you kind of add an
additional term in your
loss where you say this is what
an image looks like, this is what
a natural image should
look like, then you can get
images like these out.
So, this is from a
different paper where they
essentially use the same
procedure but they have this
extra prior that tries to make
the images look natural.
And so now you can sort of see
images that would maximally
activate particular units in the
ConvNet.
There's a really nice
interactive blog post on the
distill.pub about this idea
of feature visualisation.
So, it's interactive so you can
play with this, and you can kind
of look at all
different units in all the
different layers in a neural
net.
I definitely recommend checking
this out,
it's really cool to play with.
So, some other topics to explore
that I don't have time to go
into today are pre-training and
fine-tuning.
So, a lot of image
classification problems that are
of interest don't have large
data sets available.
So, ImageNet is obviously
1.4 million images, which is
pretty good.
But for many problems we may
have orders of magnitude fewer.
And so people have sought for
ways to sort of reuse the models
trained on ImageNet for
different tasks. And the way you
can do that, for example, is to
take a trained model on ImageNet
and chop off the top layer that
does the actual classification
and then just fit another layer
on top of those features.
And that turns out to work
really quite well.
And you can even fine-tune the
rest of the network a little bit
to work on your new data set.
So that's a very effective
strategy
to do transfer learning.
Another topic that I think is
very interesting is group
equivariant ConvNets.
So, we've talked about how
ConvNets are invariant to
translation, and then all other
invariances kind of have to be
learned,
so you can learn
them with data
augmentation. But you can
actually build ConvNets that
are intrinsically equivariant to
rotation, and to scale and
other things like that.
And this is a line of research
that's kind of taken
flight over the past three years
or so,
which I think is worth
exploring.
I also want to talk briefly
about recurrence and attention.
So, these are two other ways to
incorporate topological
structure of the input into our
network architectures.
I'm not going to talk about them
now because they'll be the
subject of future lectures but
so just to say that convolutions
are not the only thing you can
do to exploit grid structure or
sequence structure in your
input.
And so, to wrap up, let's talk
about what else we can use these
models for.
So, we've talked about models for
image recognition so far, but
obviously there are lots of
other tasks involving images
that could benefit from these
architectural priors.
So, what else can we do with
ConvNets? So, the object
classification here is in the
top left,
so that's what
we've been talking about so far.
You can also do object detection
where, in addition to identifying
the class of each object in the
image, you want to figure out
where in the image it is.
So, you want to produce a
bounding box for each object in
the image.
Another variant of this is
semantic segmentation where, for
each pixel in the image, you want
to identify what type of object
it is part of.
And then there's also instance
segmentation where you actually
want to segment the individual
objects even if there's multiple
of the same class. So, that's all
in the image space.
We can also generate images with
ConvNets in many different
ways.
So, there are various different
types of generative models,
generative adversarial networks,
variational autoencoders,
autoregressive models like PixelCNN
that all use the convolution
operation as a basic
building block because they also
benefit from this,
these priors of locality and
translation and variance.
So, these are some images that
were generated by BigGAN, which is
a generative adversarial network
developed by my colleagues at
DeepMind. More with
ConvNets,
You can do representation
learning.
One thing that's getting very
popular right now is self-
supervised learning.
So, what do you do if you have a
very large collection of images
but no labels? Then you can do-
you can kind of create labels
yourself and do self-supervised
learning and hopefully get
features that might still be
useful for transfer learning
later. As I also mentioned
earlier, you can use ConvNets
for other data types like video,
audio, text, graphs, there's lots
of options there. You can use
ConvNets inside agents, inside
intelligent agents trained with
reinforcement learning.
There's lots of options.
Many of these will be talked
about in future lectures as
well.
So, to wrap up, I want to leave
you with this sort of statement,
which is that "Convolutional
neural networks replaced
handcrafted features with
handcrafted architectures."
And the reason I want to
stress that is because people
often used to see ConvNets as
kind of a magical thing that led
us to no longer have to sort of
be clever about what we do with
images, like, how do we exploit
structure in images? Oh, we
actually don't need to do that
anymore,
we just put a ConvNet on top,
it will figure it out, but that's
not how it ended up being
because we've actually used a
lot of our prior knowledge about
structure and images to design
these architectures to work
better.
So, we still have to be
intelligent.
We still have to do research.
We still have to use the prior
knowledge that we have about the
data that we're working with.
We're just incorporating it at a
higher level of abstraction than
we were before.
And we're using learning now.
So, learning is in the mix,
which is the main
differentiator I think.
So, that is all I have.
I want to thank you very much
for your attention.
