Hi, I’m Aurélien Géron, and today I’m
going to explain how Synthetic Gradients can
dramatically speed up training of deep neural
networks, and even often improve their performance
significantly.
We will also see how they can help recurrent
neural networks learn long term patterns in
your data, and more.
Synthetic Gradients were introduced in a paper
called “Decoupled Neural Interfaces using
Synthetic Gradients” published on Arxiv
in 2016 by Max Jaderberg and other DeepMind
researchers.
As always, I’ll put all the links in the
video description below.
To explain Synthetic Gradients, let’s start
with a super quick refresher on Backpropagation.
Here’s a simple feedforward neural network
that we want to train using backpropagation.
Each training iteration has two phases.
First, the Forward phase: we send the inputs
X to the first hidden layer, which computes
its outputs h1 using its parameters theta1,
and so on up to the output layer, and finally
we compute the loss by comparing the network’s
outputs and the labels.
Then the Backward phase.
The algorithm first computes delta3, which
are the gradients of the loss with regards
to h3, then these gradients are propagated
backwards through the network, until we reach
the first hidden layer.
The final step of Backpropagation uses the
gradients we have computed to tweak the parameters
in the direction that will reduce the loss.
This is the gradient descent step.
Okay, that’s it for Backpropagation.
Now suppose you want to speed up training.
You buy 3 GPU cards, and you split the neural
network in three parts, with each part running
on a different GPU.
This is called model parallelism.
Unfortunately, because of how Backpropagation
works, model parallelism is inefficient.
Indeed, to compute the loss, you first need
to do a full forward pass sequentially.
Each GPU has to wait for the previous GPU
to finish working on a training batch before
it can start working on it.
This is called the Forward Lock.
Notice that the model parameters cannot be
updated before the loss is computed.
And this is called the Update Lock.
And finally, we cannot update a layer’s
parameters before the backward pass is complete,
at least down to the layer we want to update.
This is called the Backward lock.
The consequence of all these locks is that
GPUs will spend most of their time waiting
for the other GPUs.
As a result, training on 3 GPUs using model
parallelism is actually slower than training
on a single GPU.
So, the main idea behind Synthetic Gradients
is to break these locks, in order to make
model parallelism actually work.
Let’s see how.
First we send the inputs to the first hidden
layer.
Then this layer uses its parameters theta1
to compute its outputs.
So far, nothing has changed.
But now we also send the outputs h1 to a magical
little module M1, called a Synthetic Gradient
model.
We’ll see how it works in a few minutes,
but for now it’s just a black box.
This model tries to predict what the gradients
for the first hidden layer will be.
It outputs the synthetic gradients delta1
hat, which are an approximation of the true
gradients delta1.
Using these synthetic gradients, we can immediately
perform a gradient descent step to update
the parameters theta1, no need to wait.
This hidden layer equipped with its Synthetic
Gradient model is effectively decoupled from
the rest of the network.
This is called a Decoupled Neural Interface,
or DNI.
In parallel, the second layer can do the same
thing.
It uses a second Synthetic Gradient model
M2 to predict what the gradients will be for
the second hidden layer.
And it performs a gradient descent step.
And so on up to the output layer.
This time instead of using a Synthetic Gradient
model, we might as well compute the true gradients
directly and use these true gradients delta3
to update the parameters theta3.
And we are done!
Notice that we only did a forward pass, no
backward pass.
So just like that, training could potentially
be up to twice faster.
Just to be clear, the Synthetic Gradient models
are only used during training.
After training, we can use the neural network
as usual, based on the trained parameters
theta1, 2 and 3.Okay, now let’s see how
this technique enables model parallelism during
training.
Once again, let’s split the network into
three parts, each running on a different GPU
card.
And the CPU will take care of loading the
training instances and pushing them into a
training queue.
We start by loading the first training batch.
And while the first GPU is computing h1, and
updating its parameters using synthetic gradients,
we can already load batch number 2 and push
it into the queue.
Then while layer 2 takes care of batch number
1, layer 1 can already take care of batch
number 2.
No need to wait!
And so on, so you get the picture.
Now each layer is working in parallel on a
different batch, so all GPUs are active, they
are much less blocked waiting for other GPUs
to finish their jobs.
And we can continue like this until the end
of training.
As you can imagine, this can dramatically
reduce training time.
However, every time we go from one layer to
the next, we need to move a lot of data across
the GPU cards.
This can take a lot of time and in practice
it can far outweigh the benefits of this architecture.
But if you have a deep neural network composed
of, say, 30 layers then you can split it in
3 parts of 10 layers each.
You can use Synthetic Gradient models at every
hidden layer, or every few hidden layers,
or just at the interfaces between the GPU
cards.
With so many layers, the time required to
copy the data across GPU cards is now small
compared to the total computation time, so
the GPU cards spend much less time waiting
for data, and you can hope to train your network
close to 3 times faster than using regular
Backpropagation on a single GPU card.
So model parallelism actually works!
Great!
Now it’s time to open the black boxes and
see how the Synthetic Gradient models work.
Let’s focus on a hidden layer i.
It has its own Synthetic Gradient model Mi
which produces synthetic gradients delta i
hat, and these synthetic gradients can be
used to update the hidden layer’s parameters
without waiting for the true gradients to
be computed, as we have just seen.
This model can simply be a small neural network.
For example, a single linear layer, with no
activation function.
Or it could have a hidden layer or two.
We will simply train the Synthetic Gradient
model Mi so that it gradually learns to correctly
predict the true gradients delta i.
For this, we can just train the Synthetic
Gradient model normally, by minimizing a loss
function.
We can just use regular Backpropagation here,
nothing fancy.
For example, we can minimize the distance
between the synthetic gradients and the true
gradients (in other words, the L2 norm of
their difference), or we can minimize the
square of that distance.
But this begs the question: how do we compute
the true gradients delta i?
If we need to wait for the loss function to
be computed and for the true gradients to
flow backward through the network, then we
have somewhat defeated the purpose of synthetic
gradients.
Fortunately, there’s a neat trick to avoid
this.
We can just wait for the next layer to compute
its synthetic gradients delta i+1 hat and
then we just Backpropagate these synthetic
gradients through layer i+1.
This does not really give us the true gradients
delta i, but hopefully something pretty close.
Of course if the next layer happens to be
the output layer, then we might as well compute
the true gradients and Backpropagate them.
Over time, the Synthetic Gradient models will
get better and better at predicting the true
gradients, and this will be useful both for
updating the parameters correctly and also
for providing accurate gradients to train
the Synthetic Gradient models in lower layers.And
that’s it, you now know what synthetic gradients
are, how they work and how they can speed
up neural network training.
But there are a few more important things
to mention.
Firstly, Synthetic Gradients can be used pretty
much on any type of network, including convolutional
neural networks such as this one.
Just add Synthetic Gradient models after some
hidden layers, and that’s about it.
Each Synthetic Gradient model’s outputs
must have the same shape as its inputs, that
is the same shape as the outputs of the layer
they are attached to.
For example, M1’s outputs must have the
same shape as the outputs of this convolutional
layer.
Suppose it’s a convolutional layer with
5 feature maps of size 400x200, then that’s
exactly the shape that M1 must output.
That’s a 5x400x200 array.
In practice, you can use a shallow convolutional
neural network that preserves the shape of
its inputs, so for example a couple convolutional
layers with zero padding and stride 1 would
do just fine.
Here’s another important point.
Until now, the input of each Synthetic Gradient
model Mi was only the output of the corresponding
layer, hi.
But it is perfectly legal to provide additional
information to the Synthetic Gradient model,
so that it can make better predictions.
For example, we can give it the labels of
the current batch.
This is called a conditional Decoupled Neural
Interface, or cDNI.
In the paper, the authors show that cDNI consistently
performs better than regular DNI, so it should
probably be your default choice.
So in the paper, they experimented with the
MNIST dataset of handwritten digits, using
various architectures and training methods.
In particular, they used this fully connected
network with 3 to 6 hidden layers of 256 neurons
each.
They used Batch normalization and the ReLU
activation function at each hidden layer.
And here is a graph presented in Figure 2
in the paper.
It shows the learning curves for 3 to 6 hidden
layers and for various training methods.
For example, when trained using regular Backpropagation,
the network reaches below 2% error on the
test set, and it gets better when you add
more layers.
Using Synthetic Gradient models at each hidden
layer, the final performance of the 3 layer
network ends up being better than before,
but it takes time to train the synthetic models,
so overall, you know, it’s a little bit
longer than Backpropagation.
When you add more layers, the network’s
performance actually decreases, and training
time increases.
That’s not great.
Note that each synthetic gradient model is
actually composed of two hidden layers of
1024 neurons each, and one output layer of
256 neurons.
They also used batch normalization and the
ReLU activation function in the hidden layers.
Finally, they tried training the network using
conditional DNI.
The network gets better when you add more
layers, and with 6 layers it actually reaches
the best performance overall.
Moreover, as you can see, this is the fastest
learning architecture.
It reaches less than 2% error in just a few
thousand iterations.
Surprisingly, they used very simple synthetic
gradient models, without any hidden layers
here.
I am curious to know why they did not use
the same synthetic models for DNI and cDNI,
because it feels like we are comparing apples
and oranges.
Anyway, it clearly demonstrates that cDNI
performs much better than Backpropagation
on this task, both in terms of final accuracy
and training speed.
There are many more results in the paper,
if you’re interested, in particular great
results with Convolutional Neural Nets.
Another great application of Synthetic Gradients
is in Recurrent Neural Networks.
At each time step t, a recurrent layer takes
the inputs Xt, as well as its own outputs
from the previous time step h_t-1, and it
produces the output h_t.
It is convenient to represent RNNs by unrolling
them through time, across the horizontal axis,
like this.
First the recurrent layer takes the inputs
at time t=0, and it has no previous outputs.
It then outputs h_t=0
And at the next time step, it takes the inputs
X_t=1 and the previous outputs h_t=0.
To be clear, these two boxes represent the
same recurrent layer at two points in time.
Then it outputs h_t=1
And we could go on and on and on…
However, during training, we have to stop
at one point, or else we will run out of memory.
We can then compute the loss based on the
outputs produced so far.
And we can perform Backpropagation.
And finally we can update the parameters of
the recurrent layer.
This technique is called Truncated Backpropagation
through time.
It works well, but it has its limits.
In particular, since we only computed the
loss on a few outputs, we know nothing about
the future losses.
So in practice, this means that the network
cannot learn long-term patterns.
So let’s see how Synthetic Gradients can
help solve this problem.
Instead of stopping at time step t=3, let’s
unroll the network for just one additional
time step.
But instead of using its outputs to compute
the loss, we send them to a Synthetic Gradient
model.
It estimates the gradients for that time step,
delta_t=4_hat.
And we backpropagate these gradients through
the layer to get an estimate of delta_t=3.
We can then perform regular Backpropagation
through time, by mixing the true gradients
and the estimated future gradients.
Finally, once we have all the gradients we
need, we can update the parameters of the
recurrent layer by performing a gradient descent
step.
We must not touch the last unrolled cell,
because this would change its output h_t=4,
and we are going to need it in a minute to
train the Synthetic Gradient model.So by using
Synthetic Gradients in a recurrent neural
network like this, we can capture long term
patterns in the data even if we unroll the
network through just a few time steps.
Now, let’s see how we can train the Synthetic
Gradient model.
For this, we will need to run the network
on the next few time steps, so let’s move
forward in time.
Okay, clean up a bit and push this to the
left to have more space.
Okay, now we run the RNN on the next few time
steps.
Okay, we compute the loss.
We add an extra time step and we use the Synthetic
Gradient model to estimate the gradients for
that time step.
And just like earlier, we Backpropagate these
synthetic gradients and we mix them with true
gradients.
And now this process gives us something pretty
close to the true gradients for time step
4, and we can use these gradients to train
the Synthetic Gradient model.
Next, we can use the gradients we computed
to update the RNN’s parameters.
And boom!
Of course we could repeat this process many
times, and both the RNN and the Synthetic
Gradient model would get better and better.It
does add some complexity, but you can bet
that the main Deep Learning libraries will
soon hide this complexity from us, hopefully.
And if you need some motivation, here are
some amazing results.
This graph is a simplified version of Figure
4 in the paper, and it comes from DeepMind’s
great blog post about Synthetic Gradients,
which I highly encourage you to read (the
link is in the video description below).
It shows the performance of various RNNs on
the Penn Treebank task, which is a language
modelling task.
The horizontal axis shows training time, and
the vertical axis shows the model’s error,
measured in bits per character (BPC).
The three dashed lines are the learning curves
of a regular RNN using Backpropagation through
time, unrolled through 8, 20 or 40 time steps.
So the more you unroll the RNN, the longer
it takes to train, and the more data it requires,
but also the better the performance it eventually
reaches.
Now compare these three dashed lines to the
solid line on the left: it shows the learning
curve of an RNN trained using Backpropagation
through time unrolled through just 8 time
steps, but this time using synthetic gradients.
As you can see, the model reaches the lowest
error, even better than the model unrolled
through 40 time steps, and it takes roughly
half as much time and data to train.
That’s really impressive!
Okay next!
Yet another really interesting idea in the
paper aims to break the forward lock.
Recall that the Forward lock is the fact that
we need to wait for the lower layers to finish
before we can compute the top layers.
It may sound impossible to break this lock,
but it is in fact quite simple: you can just
equip any layer you want with a Synthetic
Input model.
For example, let’s add a Synthetic Input
model I3 to layer 3, which is the output layer.
It allows us to skip the hidden layers 1 and
2 by computing h2_hat, an approximation of
h2, the inputs of layer 3.
We can just feed h2_hat directly to the output
layer.
And ta-da!
We’ve just broken the forward lock.
As you might guess, once we eventually get
the output of the hidden layer 2 we can use
it to train the Synthetic Input model.
This is really the exact same idea as earlier,
but going forwards rather than backwards.
In fact, we can even use the same trick as
earlier to go even faster.
Instead of letting the signal propagate through
the whole network to compute h2, we can just
use the synthetic input model from the previous
layer and feed it to the hidden layer 2, and
this will give us something hopefully close
enough to h2, to train I3, the Synthetic Input
model of layer 3.
To conclude, let’s look at the data flow
of a fully Decoupled Neural Interface that
uses both synthetic inputs and synthetic gradients.
First, the Synthetic Input model receives
the next training batch and computes an approximation
of the layer’s inputs, h_i-1_hat.
Then, the hidden layer computes its outputs
h_i and feeds them simultaneously to the next
layer and to its own Synthetic Gradient model.
These gradients are backpropagated through
the hidden layer, which gives a reasonably
good approximation of the true gradients for
the previous layer.
The gradients delta_i-1 are just sent back
to the previous layer, which will use them
to update its own Synthetic Gradients model.
And immediately after that, we can update
the layer’s parameters using the Synthetic
Gradients delta_i_hat.
At some point we receive the outputs of the
previous layer, h_i-1, and we will use them
to train the Synthetic Input model.
And lastly, we receive the gradients from
the next layer, and we use them to train the
Synthetic Gradients model.
And that’s it!
The DNI is ready to handle the next training
batch.
If you want to learn more about Synthetic
Gradients, I encourage you to read the paper
itself, as it touches on a few more topics,
such as many implementation details, or how
Synthetic Gradients can help two Recurrent
Nets communicate efficiently when they don’t
tick at the same rate, and so on.
Also check out the links in the video description,
there are several interesting blog posts and
implementations, and I might add my own implementation
at one point.
If you want to learn more about Deep Learning,
check out my book Hands-On Machine Learning
with Scikit-Learn and TensorFlow.
In particular, there’s a whole chapter on
running TensorFlow across multiple GPUs and
servers.
There’s also a german version and a French
version, and I believe a Chinese version should
be out in the next few weeks.
And that’s all I had for today!!
I hope you enjoyed this video and that you
found it useful.
If you did, please, like, share, comment,
subscribe, and you can also follow me on Twitter
if you’re into that.
See you next time and I wish you a very Happy
New Year!
