Hey! I’m Aurélien Géron, and in this video
I’ll tell you all about Capsule Networks,
a hot new architecture for neural nets. Geoffrey
Hinton had the idea of Capsule Networks several
years ago, and he published a paper in 2011
that introduced many of the key ideas, but
he had a hard time making them work properly,
until now.
A few weeks ago, in October 2017, a paper
called “Dynamic Routing Between Capsules”
was published by Sara Sabour, Nicholas Frosst
and of course Geoffrey Hinton. They managed
to reach state of the art performance on the
MNIST dataset, and demonstrated considerably
better results than convolutional neural nets
on highly overlapping digits. So what are
capsule networks exactly?
Well, in computer graphics, you start with
an abstract representation of a scene, for
example a rectangle at position x=20 and y=30,
rotated by 16°, and so on. Each object type
has various instantiation parameters. Then
you call some rendering function, and boom,
you get an image.
Inverse graphics, is just the reverse process.
You start with an image, and you try to find
what objects it contains, and what their instantiation
parameters are. A capsule network is basically
a neural network that tries to perform inverse
graphics.
It is composed of many capsules. A capsule
is any function that tries to predict the
presence and the instantiation parameters
of a particular object at a given location.
For example, the network above contains 50
capsules. The arrows represent the output
vectors of these capsules. The capsules output
vectors. The black arrows correspond to capsules
that try to find rectangles, while the blue
arrows represent the output of capsules looking
for triangles. The length of an activation
vector represents the estimated probability
that the object the capsule is looking for
is indeed present. You can see that most arrows
are tiny, meaning the capsules didn’t detect
anything, but two arrows are quite long. This
means that the capsules at these locations
are pretty confident that they found what
they were looking for, in this case a rectangle,
and a triangle.
Next, the orientation of the activation vector
encodes the instantiation parameters of the
object, for example in this case the object’s
rotation, but it could be also its thickness,
how stretched or skewed it is, its exact position
(there might be slight translations), and
so on. For simplicity, I’ll just focus on
the rotation parameter, but in a real capsule
network, the activation vectors may have 5,
10 dimensions or more.
In practice, a good way to implement this
is to first apply a couple convolutional layers,
just like in a regular convolutional neural
net. This will output an array containing
a bunch of feature maps. You can then reshape
this array to get a set of vectors for each
location. For example, suppose the convolutional
layers output an array containing, say, 18
feature maps (2 times 9), you can easily reshape
this array to get 2 vectors of 9 dimensions
each, for every location. You could also get
3 vectors of 6 dimensions each, and so on.
Something that would look like the capsule
network represented here with two vectors
at each location. The last step is to ensure
that no vector is longer than 1, since the
vector’s length is meant to represent a
probability, it cannot be greater than 1.
To do this, we apply a squashing function.
It preserves the vector’s orientation, but
it squashes it to ensure that its length is
between 0 and 1.
One key feature of Capsule Networks is that
they preserve detailed information about the
object’s location and its pose, throughout
the network. For example, if I rotate the
image slightly, notice that the activation
vectors also change slightly. Right? This
is called equivariance. In a regular convolutional
neural net, there are generally several pooling
layers, and unfortunately these pooling layers
tend to lose information, such as the precise
location and pose of the objects. It’s really
not a big deal if you just want to classify
the whole image, but it makes it challenging
to perform accurate image segmentation or
object detection (which require precise location
and pose). The fact that capsules are equivariant
makes them very promising for these applications.
All right, so now let’s see how capsule
networks can handle objects that are composed
of a hierarchy of parts. For example, consider
a boat centered at position x=22 and y=28,
and rotated by 16°. This boat is composed
of parts. In this case one rectangle and one
triangle. So this is how it would be rendered.
Now we want to do the reverse, we want inverse
graphics, so we want to go from the image
to this whole hierarchy of parts with their
instantiation parameters.
Similarly, we could also draw a house, using
the same parts, a rectangle and a triangle,
but this time organized in a different way.
So the trick will be to try to go from this
image containing a rectangle and a triangle,
and figure out, not only that the rectangle
and triangle are at this location and this
orientation, but also that they are part of
a boat, not a house. So, yeah, let’s figure
out how it would do this.
The first step we have already seen: we run
a couple convolutional layers, we reshape
the output to get vectors, and we squash them.
This gives us the output of the primary capsules.
We’ve got the first layer already. The next
step is where most of the magic and complexity
of capsule networks takes place. Every capsule
in the first layer tries to predict the output
of every capsule in the next layer. You might
want to pause to think about what this means.
The capsules in the first layer try to predict
what the second layer capsules will output.
For example, let’s consider the capsule
that detected the rectangle. I’ll call it
the rectangle-capsule.
Let’s suppose that there are just two capsules
in the next layer, the house-capsule and the
boat-capsule. Since the rectangle-capsule
detected a rectangle rotated by 16°, it predicts
that the house-capsule will detect a house
rotated by 16°, that makes sense, and the
boat-capsule will detect a boat rotated by
16° as well. That’s what would be consistent
with the orientation of the rectangle.
So, to make this prediction, what the rectangle-capsule
does is it simply computes the dot product
of a transformation matrix W_i,j with its
own activation vector u_i. During training,
the network will gradually learn a transformation
matrix for each pair of capsules in the first
and second layer. In other words, it will
learn all the part-whole relationships, for
example the angle between the wall and the
roof of a house, and so on.
Now let’s see what the triangle-capsule
predicts.
This time, it’s a bit more interesting:
given the rotation angle of the triangle,
it predicts that the house-capsule will detect
an upside-down house, and that the boat-capsule
will detect a boat rotated by 16°. These
are the positions that would be consistent
with the rotation angle of the triangle.
Now we have a bunch of predicted outputs,
what do we do with them?
As you can see, the rectangle-capsule and
the triangle-capsule strongly agree on what
the boat-capsule will output. In other words,
they agree that a boat positioned in this
way would explain their own positions and
rotations. And they totally disagree on what
the house-capsule will output. Therefore,
it makes sense to assume that the rectangle
and triangle are part of a boat, not a house.
Now that we know that the rectangle and triangle
are part of a boat, the outputs of the rectangle
capsule and the triangle capsule really concern
only the boat capsule, there’s no need to
send these outputs to any other capsule, this
would just add noise. They should be sent
only to the boat capsule.
This is called routing by agreement. There
are several benefits: first, since capsule
outputs are only routed to the appropriate
capsule in the next layer, these capsules
will get a cleaner input signal and will more
accurately determine the pose of the object.
Second, by looking at the paths of the activations,
you can easily navigate the hierarchy of parts,
and know exactly which part belongs to which
object (like, the rectangle belongs to the
boat, or the triangle belongs to the boat,
and so on). Lastly, routing by agreement helps
parse crowded scenes with overlapping objects
(we will see this in a few slides). But first,
let’s look at how routing by agreement is
implemented in Capsule Networks.
Here, I have represented the various poses
of the boat, as predicted by the lower-level
capsules. For example, one of these circles
may represent what the rectangle-capsule thinks
about the most likely pose of the boat, and
another circle may represent what the triangle-capsule
thinks, and if we suppose that there are many
other low-level capsules, then we might get
a cloud of prediction vectors, for the boat
capsule, like this. In this example, there
are two pose parameters: one represents the
rotation angle, and the other represents the
size of the boat. As I mentioned earlier,
pose parameters may capture many different
kinds of visual features, like skew, thickness,
and so on. Or precise location. So the first
thing we do, is we compute the mean of all
these predictions. This gives us this vector.
The next step is to measure the distance between
each predicted vector and the mean vector.
I will use here the euclidian distance here,
but capsule networks actually use the scalar
product. Basically, we want to measure how
much each predicted vector agrees with the
mean predicted vector. Using this agreement
measure, we can update the weight of every
predicted vector accordingly.
Note that the predicted vectors that are far
from the mean now have a very small weight,
and the ones closest to the mean have a much
stronger weight. I’ve represented them in
black. Now we can just compute the mean once
again (or I should say, the weighted mean),
and you’ll notice that it moves slightly
towards the cluster, towards the center of
the cluster.
So next, we can once again update the weights.
And now most of the vectors within the cluster
have turned black.
And again, we can update the mean.
And we can repeat this process a few times.
In practice 3 to 5 iterations are generally
sufficient. This might remind you, I suppose,
of the k-means clustering algorithm if you
know it. Okay, so this is how we find clusters
of agreement. Now let’s see how the whole
algorithm works in a bit more details.
First, for every predicted output, we start
by setting a raw routing weight b_i,j equal
to 0.
Next, we apply the softmax function to these
raw weights, for each primary capsule. This
gives the actual routing weights for each
predicted output, in this example 0.5 each.
Next we compute a weighted sum of the predictions,
for each capsule in the next layer. This might
give vectors longer than 1, so as usual we
apply the squash function.
And voilà! We now have the actual outputs
of the house-capsule and boat-capsule. But
this is not the final output, it’s just
the end of the first round, the first iteration.
Now we can see which predictions were most
accurate. For example, the rectangle-capsule
made a great prediction for the boat-capsule’s
output. It really matches it pretty closely.
This is estimated by computing the scalar
product of the predicted output vector û_j|i
and the actual product vector v_j. This scalar
product is simply added to the predicted output’s
raw routing weight, b_i,j. So the weight of
this particular predicted output is increased.
When there is a strong agreement, this scalar
product is large, so good predictions will
have a higher weight.
On the other hand, the rectangle-capsule made
a pretty bad prediction for the house-capsule’s
output, so the scalar product in this case
will be quite small, and the raw routing weight
of this predicted vector will not grow much.
Next, we update the routing weights by computing
the softmax of the raw weights, once again.
And as you can see, the rectangle-capsule’s
predicted vector for the boat-capsule now
has a weight of 0.8, while it’s predicted
vector for the house-capsule dropped down
to 0.2. So most of its output is now going
to go to the boat capsule, not the house capsule.
Once again we compute the weighted sum of
all the predicted output vectors for each
capsule in the next layer, that is the house-capsule
and the boat-capsule. And this time, the house-capsule
gets so little input that its output is a
tiny vector. On the other hand the boat-capsule
gets so much input that it outputs a vector
much longer than 1. So again we squash it.
And that’s the end of round #2. And as you
can see, in just a couple iterations, we have
already ruled out the house and clearly chosen
the boat. After perhaps one or two more rounds,
we can stop and proceed to the next capsule
layer in exactly the same way.
So as I mentioned earlier, routing by agreement
is really great to handle crowded scenes,
such as the one represented in this image.
One way to interpret this image (as you can
see there is a bit of ambiguity), you can
see a house upside down in the middle. However,
if this was the case, then there would be
no explanation for the bottom rectangle or
the top triangle, no reason for them to be
where they are.
The best way to interpret the image is that
there is a house at the top and a boat at
the bottom. And routing by agreement will
tend to choose this solution, since it makes
all the capsules perfectly happy, each of
them making perfect predictions for the capsules
in the next layer. The ambiguity is explained
away.
Okay, so what can you do with a capsule network
now that you know how it works.
Well for one, you can create a nice image
classifier of course. Just have one capsule
per class in the top layer and that’s almost
all there is to it. All you need to add is
a layer that computes the length of the top-layer
activation vectors, and this gives you the
estimated class probabilities. You could then
just train the network by minimizing the cross-entropy
loss, as in a regular classification neural
network, and you would be done.
However, in the paper they use a margin loss
that makes it possible to detect multiple
classes in the image. So without going into
too much details, this margin loss is such
that if an object of class k is present in
the image, then the corresponding top-level
capsule should output a vector whose
length is at least 0.9. It should be long.
Conversely, if an object of class k is not
present in the image, then the capsule should
output a short vector, one whose length
is shorter than 0.1. So the total loss is
the sum of losses for all classes.
In the paper, they also add a decoder network
on top of the capsule network. It’s just
3 fully connected layers with a sigmoid activation
function in the output layer. It learns to
reconstruct the input image by minimizing
the squared difference between the reconstructed
image and the input image.
The full loss is the margin loss we discussed
earlier, plus the reconstruction loss (scaled
down considerably so as to ensure that the
margin loss dominates training). The benefit
of applying this reconstruction loss is that
it forces the network to preserve all the
information required to reconstruct the image,
up to the top layer of the capsule network,
its output layer. This constraint acts a bit
like a regularizer: it reduces the risk of
overfitting and helps generalize to new examples.
And that’s it, you know how a capsule network
works, and how to train it. Let’s look a
little bit at some of the figures in the paper,
which I find interesting.
This is figure 1 from the paper, showing a
full capsule network for MNIST. You can see
the first two regular convolutional layers,
whose output is reshaped and squashed to get
the activation vectors of the primary capsules.
And these primary capsules are organized in
a 6 by 6 grid, with 32 primary capsules in
each cell of this grid, and each primary capsule
outputs an 8-dimensional vector. So this first
layer of capsules is fully connected to the
10 output capsules, which output 16 dimensional
vectors. The length of these vectors is used
to compute the margin loss, as explained earlier.
Now this is figure 2 from the paper. It shows
the decoder sitting on top of the capsnet.
It is composed of 2 fully connected ReLU layers
plus a fully connected sigmoid layer which
outputs 784 numbers that correspond to the
pixel intensities of the reconstructed image
(which is a 28 by 28 pixel image). The squared
difference between this reconstructed image
and the input image gives the reconstruction
loss.
Right, and this is figure 4 from the paper.
One nice thing about capsule networks is that
the activation vectors are often interpretable.
For example, this image shows the reconstructions
that you get when you gradually modify one
of the 16 dimensions of the top layer capsules’
output. You can see that the first dimension
seems to represent scale and thickness. The
fourth dimension represents a localized skew.
The fifth represents the width of the digit
plus a slight translation to get the exact
position. So as you can see, it’s rather
clear what most of these parameters do.
Okay, to conclude, let’s summarize the pros
and cons. Capsule networks have reached state
of the art accuracy on MNIST. On CIFAR10,
they got a bit over 10% error, which is far
from state of the art, but it’s similar
to what was first obtained with other techniques
before years of efforts were put into them,
so it’s still a good start. Capsule networks
require less training data. They offer equivariance,
which means that position and pose information
are preserved. And this is very promising
for image segmentation and object detection.
The routing by agreement algorithm is great
for crowded scenes. The routing tree also
maps the hierarchy of objects parts, so every
part is assigned to a whole. And it’s rather
robust to rotations, translations and other
affine transformations. The activation vectors
somewhat are interpretable. And finally, obviously,
it’s Hinton’s idea, so don’t bet against
it.
However, there are a few cons: first, as I
mentioned the results are not yet state of
the art on CIFAR10, even though it’s a good
start. Plus, it’s still unclear whether
capsule networks can scale to larger images,
such as the ImageNet dataset. What will the
accuracy be? Capsule networks are also quite
slow to train, in large part because of the
routing by agreement algorithm which has an
inner loop, as you saw earlier. Finally, there
is only one capsule of any given type in a
given location, so it’s impossible for a
capsule network to detect two objects of the
same type if they are too close to one another.
This is called crowding, and it has been observed
in human vision as well, so it’s probably
not a show-stopper.
All right! I highly recommend you take a look
at the code of a CapsNet implementation, such
as the ones listed here (I’ll leave the
links in the video description below). If
you take your time, you should have no problem
understanding everything the code is doing.
The main difficulty in implementing CapsNets
is that it contains an inner loop for the
routing by agreement algorithm. Implementing
loops in Keras and TensorFlow can be a little
bit trickier than in PyTorch, but it can be
done. If you don’t have a particular preference,
then I would say that the PyTorch code is
the easiest to understand.
And that’s all I had, I hope you enjoyed
this video. If you did, please thumbs up,
share, comment, subscribe, blablabla. It’s
my first real YouTube video, and if people
find it useful, I might make some more. If
you want to learn more about Machine Learning,
Deep Learning and Deep Reinforcement Learning,
you may want to read my O’Reilly book Hands-on
Machine Learning with Scikit-Learn and TensorFlow.
It covers a ton of topics, with many code
examples that you will find on my github account,
so I’ll leave the links in the video description.
That’s all for today, have fun and see you
next time!
