Hello and welcome to
the UCLX DeepMind lecture
series. My name is Alex Graves.
I'm a research scientist at
DeepMind. And I'm going to be
talking to you today about
attention and memory in deep
learning. So you may have heard
people talk about attention in
neural networks and it's really,
it's emerged over the last few
years as a really exciting new
component in the deep learning
toolkit. It's one of the, in my
opinion, it's one of the last
new things that's been added to
our toolbox. So in this lecture
we're going to explain how
attention works in deep
learning. And we're also going
to talk about the linked concept
of memory. And so you can think
of memory in some sense as
attention through time. And so
we're going to talk about a
range of attention mechanisms,
those that are implicitly
present in any deep network as
well as more explicitly defined
attention. And then we'll talk
about external memory and what
happens when you have attention
to that and how that provides
you with selective recall. And
then we'll talk about
Transformers and Variable
Computation Time. So I think the
first thing to say about
attention is that it is not
something, it's not only
something that's useful for deep
learning, it plays a vital part
in human cognition. So the
ability to focus on one thing
and ignore others is really
vital. And so we can see this in
our everyday lives. We're
constantly bombarded with
sensory information coming from
all directions and we need to be
able to pick out certain
elements of that signal in order
to be able to concentrate on
them. So a classical example of
this is known as the cocktail
party problem when people are
attending a noisy party and
listening to lots of other
people talking at once, we're
still able to easily pick out
one particular speaker and kind
of let the others fade into the
background and this is what
allows us to hear what they're
saying. But there's also a kind
of, a form of introspective or
internal attention that allows
us to attend to one thought at a
time, to remember one event
rather than all events. And I
think that the crucial thing
that I want you to, the crucial
idea that I want you to take
away from this is that attention
is all about ignoring things,
it's not about putting more
information into a neural
network, it's actually about
removing some of the information
so that it's possible to focus
on specific parts. Now I know
you've all heard about neural
networks and how they work and
it might seem at first glance
that there's nothing about a
neural network that is
particularly related to this
notion of attention. So we
have this, you know, this big
non-linear function approximator
that takes vectors in and gives
vectors out and so in this kind
of paradigmatic example, you
have an image coming in, being
processed and then a
classification decision coming
out. Is it a leopard or a jaguar
or a cheetah in this image? And
this doesn't appear to have much
to do with attention at first
glance, the whole image is
presented to the network and
then a single decision is made.
But, what you can actually find
if you look inside neural
networks and analyse what
they're actually doing with the
data is that they already learn
a form of implicit attention,
meaning that they respond more
strongly to some parts of the
data than others. And this is
really crucial. So if you want
to distinguish, you know, a
leopard, for example, from a
tiger or something like that,
part of what you need to focus
on are the spots in the leopards
fur and you need to do that, you
need to focus on these parts
while ignoring perhaps
irrelevant detail in the
background. And to a first
approximation we can study this
use of implicit attention by
looking at the network Jacobian.
So the Jacobian is basically the
sensitivity of the network
outputs with respect to the
inputs. So mathematically it's really
just a matrix of partial
derivatives where each element
Jᵢⱼ is the partial derivative of
some output unit, i, with
respect to some input unit, j,
and you can compute this thing
with ordinary backprop. So
basically the backprop
calculation that's used for
gradient descent can be
repurposed to analyse the
sensitivity of the network. All
you do is you, instead of
passing through the errors with
respect to some loss function,
you set the errors equal to the
output activations themselves
and then you perform backprop.
And by doing this, we get a feel
for what the network, what
pieces of information the
network is really focusing on,
what it's using in order to
solve a particular task. So by
way of illustration, here's a
neural network that's, it's
known as the Dueling Network.
This is from an architecture
presented in 2015, that was used
for reinforcement learning. Now
it's, a network that was applied
to playing Atari games and so
the input is a video sequence
and the output in this case, the
network has a two headed output.
One head attempts to predict the
value of the state, as is kind
of normal for reinforcement
learning, for deep reinforcement
learning. The other head
attempts to predict the action
advantage. So which is basically
the differential between the
value given a particular action
and
the expected value overall. Or
to put it in simpler terms, it
tries to guess whether
performing a particular action
will make its value higher or
lower. And so if we look at the
video here, this image on the
left represents the Jacobian and
with respect to the value
prediction and what's being
shown here, so we're seeing the
input video itself, this is a
racing game where the goal is to
try and overtake as many cars as
possible without crashing and
overlaid on that, this red
heatmap that we see flaring up,
this is the Jacobian so the
places that are appearing in red
are the places that the network
is sensitive to. So if we
concentrate on the left side of
this video we can see some
things that the network is
really interested in. So one of
them is it tends to focus on the
horizon. The car is appearing,
you know, just appearing on the
screen. And of course these are
very important as a predictor of
how much score the network is
likely to obtain in the near
future because it's by
overtaking these cars that it
gets points. It's also
continually focused on the car
itself and obviously that's
important cause because needs to
know its own state in order to
predict its value. And
interestingly, it has another
area of, kind of, continual
focus, which is
the score at the bottom. So
because it's the score that it's
attempting to predict the score
is the value for these games, it
kind of makes sense that knowing
what the current score is is
very important. That's what
gives it an indicator of how
fast the value is accumulating.
If we look at the image on the
right, which is also a Jacobian
plot, but this time it's the
Jacobian of this action
advantage, so the degree to
which any one particular
action's better or worse than
the expectation over other
actions, we see a very different
picture. First of all, we see
that there's less sensitivity
overall. The Jacobian, these
reds areas of sensitivity are a
lot less prevalent and when we
do show up, when they do show
up, they tend to show up in
different places. They're not
looking so much at the horizon,
they're not looking at the score
very much, they tend to flare up
just in front of the car that's
driving. And the reason for that
is that the information that
needs to decide whether it's
better to go right or left is
really the information about the
cars that are very close to it.
So that's the point. It's only
really when it comes close to
another car that it has this
critical decision about whether
it should go right
or left. And so what I'm trying
to get across with this video,
is that even for the same data,
you get a very different
sensitivity pattern depending on
which task you're trying to
perform. And so this implicit
attention mechanism is allowing
it to process the same data in
two very different ways, it's
seeing essentially, even though
it's being presented with the
same data, it's effectively
seeing different things and
seeing these different things is
what allows it to perform
different tasks. So once again,
the whole point about attention
and the whole reason it's so
important is that it allows you
to ignore some parts of the data
and focus on others. And this
same concept also applies to
recurrent neural networks, I
think you've covered recurrent
neural networks in an earlier
lecture, and the idea here is
that you've got a lecture that
basically takes an input
sequence, to take sequences as
inputs, and produces sequences
as outputs. And what really
makes recurrent neural networks
interesting is that they have
these feedback connections that
give them some kind of memory of
previous inputs. And what we
really want to know, as I said
at the start of the lecture,
memory can be thought of as
attention through time. So what
we really want to know about
recurrent neural networks is how
are they using the memory to
solve the task? And once again,
we can appeal to the
Jacobian to try to measure this
use of memory, this use of past
information or surrounding
context. And in this case, I
tend to refer to it as a
sequential Jacobian because what
you're really doing now, instead
of getting a two dimensional
matrix of partial derivatives,
you're really looking at a three
dimensional matrix where the
third dimension is through time.
And what you care about mostly
is how sensitive is the network.
How sensitive are the decisions
made by the network at one
particular time to those inputs
over other times. In other
words, what part of the sequence
does it have to remember, does
it have to recall, in order to
solve the task. Okay, so. To
make that a little bit more
concrete, we've got the
sequential Jacobian it's a set
of derivatives of one network
output, so one output at one
particular point in time with
respect to all the inputs over
time. So there's a time series,
there's a sequence of these 2D
Jacobian matrices. And what it
can, what you can use this
sequential Jacobian to analyse
is how the network responds to
inputs that in the sequence that
are related, in the sense that
they are needed together in
order to solve a particular
aspect of the task, but are not
necessarily together or
contiguous or close to one
another in the input sequence,
they may
be widely separated. And so in
the example I've got here, this
was from a network that I worked
on some years ago that was
trained to do online handwriting
recognition. So online
handwriting recognition means
that someone is, in this case,
writing on a white board with a
pen that has an infrared tracker
that keeps a track of the
location of the pen and is
therefore able to record a
trajectory of pen positions. And
it also records special end of
stroke markers for when the pen
is lifted off the whiteboard.
And so, this text at the bottom
shows that the words that the
person wrote were 'once having',
and then the sort of this next
graph up from the bottom shows
how the information was actually
presented to the network so that
what the network actually saw
was a series of these
coordinates, X and Y
coordinates, with these end of
stroke spikes. And then above
that, excuse me, above that what
we have is the sequential
Jacobian and now what I've
really looked at here, what I'm
really interested here, is the
magnitude of the sequential
Jacobians, all these matrices
over time, and what I'm really
interested in is essentially the
magnitude of the matrix, the
magnitude of the response of the
network, so of that particular,
of one particular network output
with respect
to the inputs at a particular
time. And so the network output
that I've chosen is the point,
so I should say, the task here
is for the network to transcribe
this, these online pen positions
and to kind of to recognise what
it was that the person wrote and
see there's this output sequence
here where it's emitting label
decisions, 'o', 'n', 'c', 'e',
then the space character and it
misses out the 'v' in this case it
doesn't entirely classify or
transcribe this image correctly.
But the point that we are
looking at is the point where it
decides to output the letter 'i'
in 'having', and what's really
interesting, if we look at the
sequential Jacobian, we can see
that there's a peak of
sensitivity around here, which
roughly corresponds to the point
in the input sequence where the
stroke, the main body of the
letter 'i' was actually written.
So it makes sense that there's a
peak of sensitivity here.
However, we can see that the
sensitivity also extends further
on in the sequence. It doesn't
extend so far back in the
sequence only very slightly. So
the sensitivity is mostly to the
end. And I believe the reason
for this is that this suffix,
'ing', the 'ing' at the end of
'having' is a very common one.
And so being able to identify
that whole suffix helps you to
disambiguate the letter, 'i', it
helps to tell you, for example,
that it's not an 'l' in there.
And what's really interesting is
this peak, this very sharp peak
right at the end, and what that
corresponds to is the point when
the writer lifted the pen off
the page, off the white board,
and went back to dot the 'i'.
So they wrote this entire word
'having' as one continuous
stroke in their cursive
handwriting and then they lifted
the pen off the page and put a
little dot there. And of course
that dot is crucial to
recognising an 'i', right?
That's the thing that really
distinguishes an 'i' from an
'l'. So again, it makes sense
that the network is particularly
sensitive to that point, but
it's nice to see that by
analysing the sequential
Jacobian you can really get a
sort of quantifiable sense of
the degree to which it's using
particular pieces of
information. And once again, I
want to stress what's really
critical here is that means it's
ignoring other pieces of
information. It's focusing on
those parts of the sequence that
are relevant and ignoring those
that are irrelevant. And you
know, we can see that this is
really quite powerful. It's able
to bridge things that are
related in the input sequence
but may actually be quite far
apart. Another example here
comes from machine
translation. Now, a major
challenge in machine translation
is that words may appear in a
completely different order in a different
language and so we have a simple
example here where we have this
infinitive 'to reach' at the
start of an English sentence
that's being translated into
German. But in German, the
corresponding verb appears at
the end of the sentence. And so
in order to correctly translate
this, the network needs to be
able to reorder the information
and from this paper in 2016,
what it showed was just with a
very deep network without any
kind of specific mechanism for
rearrangement or for attention,
the network was able to use its
implicit attention to perform
this rearrangement. And so what
we're seeing in the heat map on
the right here is again, this
idea of sensitivity, it's a
sensitivity map of the outputs
at particular points in the
target sequence, so in the
German sequence, with respect to
the inputs in the English
sequence. And you can see mostly
there's a kind of diagonal line
because in this particular case,
most of the sequence, most of
the words have a more or less
direct sort of one-to-one
translation. But there's this
part at the end of the sequence
for the final two words in
German are particularly
sensitive to the words at the
start in English. So this word
'reach' is, there's a peak of
sensitivity from the end of the
sequence. And of course this is
once again showing that the
network
is able to use this implicit
attention that it gets in some
sense for free just by being a
very deep network, by being a
very, you know, rich function
approximator, it's able to use
that to focus in on a particular
part of the sequence and to
ignore the rest of the sequence.
Well, you know, implicit
attention is great, but there
are still reasons to believe
that having an explicit
attention mechanism might be a
good thing. So what I mean by an
explicit attention mechanism is
one where you actually decide to
only present some of the data to
the network and you know,
completely remove other parts of
the data. And one reason this
might be preferred, of course,
is computational efficiencies.
So you no longer have to process
all of the data, you don't have
to feed it to the network at
all, so you can save some
compute. There's a notion of
scalability. So for example, if
you've got a fixed size, what
I'll call a 'glimpse' or like a
foveation, where you take in a
fixed size part of an image,
then you can scale to any sized
image so that the resolution of
the input doesn't have to sort
of alter the architecture of the
network. There's this notion of
sequential processing of static
data, which I think is an
interesting topic. So again, if
you go look at a
kind of visual example, if we
have a foveal gaze moving around
a static image, then what we get
is a sequence of sensory input.
And of course this is how images
are presented to the human eye.
We're always actually, even if
the data is static, we're always
actually receiving it as a
sequence. And there's reasons to
believe that doing this can
improve the robustness of
systems. So, for example, there
was a recent paper that showed
that networks with sequences of
glimpse or foveal attention
mechanisms for static data were
more robust to adversarial
examples than ordinary
convolutional networks that
looked at the entire image in
one go. Last but not least,
there's a big advantage here in
terms of interpretability. So
because explicit attention
requires, you know, making a
hard decision and choosing some
part of the data to look at, you
can analyse a little bit more
clearly what it is that the
network is actually using. So,
you know, with implicit
attention we've looked at the
Jacobian as a guide to what the
network is looking at, but it
really is only a guide. It's not
real, it's not necessarily an
entirely reliable signal as to
what the network is using and
what it's ignoring whereas with
explicit attention mechanisms,
as we'll see, you get a much
clearer indication of the parts
of the data that
the network is actually focusing
on. Okay. So the basic framework
for what I'm going to call
neural attention models is that
you have a neural network as
usual that is producing an
output vector as always, but
it's also producing an extra
output vector that is used to
parameterise an attention model.
So it gives some set of
parameters that are fed into
this attention model, which
we'll describe in a minute, and
that model then operates on some
data, whether that's an image
that you're looking at or audio
or text or whatever it is and
gives you what I'm going to call
a glimpse vector. And this is
non-standard terminology, I'm
just using it because I think it
helps to kind of unify these
different models. That glimpse
vector's then passed to the
network as input at the next
timestep. And so there's this
kind of loop going on where the
network makes a decision about
what it wants to tend to, and
that then influences the data it
actually receives at the next
step. And what that means is
that even if the network itself
is feed forward, the complete
system is recurrent. It contains
a loop. Okay. So the, you know,
the way this model usually works
is that we define a probability
distribution over glimpses, 'g',
of the data, 'x', given some set
of attention outputs. So I've
set this attention vector 'a'
and that's used
to parameterise something like
the probability of glimpse 'g'
given 'a'. So the simplest case
here is we just split the image
into tiles and this image on the
right here you can see there's
nine possible tiles and 'a' just
assigns probabilities through a
set of discreet glimpses as in
to a set of, to each of these
tiles that's using one of these
tiles. So it's just a kind of
good old fashioned softmax
function here where the softmax
outputs are the probabilities of
picking each tile. And so we can
see that having done that, if we
have a network that is using
this distribution, what it's
going to do is, you know, output
some distribution over these
nine tiles and then at each
point in time it's going to
receive one of the tiles as
input. So rather than receiving
the whole input at once, it's
going to keep on looking at one
tile at a time. Now one issue
with this of course, is that
it's a hard decision. And what I
mean by hard decision is it's no
longer, we no longer have a
complete gradient with respect
to what the network has done.
Basically what we've got is a
stochastic policy in
reinforcement learning terms
that we're sampling from in
order to get the glimpses. And
we can train this with something
like REINFORCE. So I've kind of
given that, you know, the simple
kind of
standard mathematics here for
how you get a gradient with
respect to some stochastic sort
of discreet sample using
REINFORCE. And this is a sort of
general trick here. We can use
these sorts of what I'm going to
call RL methods, by which I
really just mean methods that
are designed for getting a
training signal through a
discreet policy and we can sort
of fall back on these for
supervised tasks like image
classification anytime there's a
non-differentiable module in
there. And what we can't do is
just ordinary end-to-end
backprop. And this is a
significant difference between
using kind of hard attention as
I've described it so far versus
using this implicit attention
that's always present in neural
networks. So generally we want
to do something a little bit
more complex than just a softmax
over tiles. One example that
I've kind of already alluded to
is this notion of a foveal model
where you have a kind of
multiresolution input that looks
at the image, takes part of the
image at high resolution, so in
this case, the square in the
centre here is kind of recorded
at high resolution, it's
basically just mapped at one-to-one.
This next square out is
also presented to the network
but at a lower resolution. So
you can see the actual, it's
taking something that maybe has
twice as many pixels as the one
in the middle and subsampling it
down to something with the same
number of pixels. And then the
third square out looks at the
entire image here that gives
this very kind of squashed down
low resolution version of it to
the network. And the idea is
that you're mimicking the effect
of the human eye where it has
high kind of, high resolution in
the centre of your gaze and much
lower resolution in the
periphery with the idea being
that the information at the
periphery is sufficient to alert
you to something that you should
attend to more closely. You
should look at directly in order
to get a higher resolution view
of it. And we can see an example
of this apply to image
classification. This is from a
2014 paper where the network was
given the cluttered MNIST data
where these MNIST, these
familiar MNIST handwritten
digits are basically dropped in
an image that has some visual
clutter. And the idea here is
that in order to classify the
image, the network has to
discover the digit within the
clutter. Again, once again, it's
about being able to ignore
distractors, being able to
ignore the noise. And the green
path here shows the movement of
this foveal model through the
image over this kind
of six point trajectory that
it's given while it classifies
the image. And we can see, for
example, in this, on the example
on the top row, it starts down
here in the bottom corner where
there isn't much information but
then rapidly moves towards the
digit in the image and then kind
of scans around the digit. And
in the pictures to the right, we
can see the information that's
actually presented to the
network. Basically, you know, it
starts off with something where
there's very little information
about the image, but there's a
blur over here that suggests
there might be something useful.
And then it moves over to there.
And by moving around the image,
it can build up a picture of,
you know, everything that's in
the digit that it needs to
classify. And we have a similar
example here for the letter
eight where it kind of moves
around the periphery of the
digit in order to classify it.
So a similar, and so you might
ask, you know, why would you
bother doing that when you can
feed the whole image into the
network directly? And so one
issue I mentioned earlier is
this idea of scalability and one
way in which a sequential
glimpse distribution is more
scalable is that you can use it,
for example, to represent
multiple objects. This was
extraordinary explored in
another paper in 2014 where
there were, so for example, in
the street view, house numbers
dataset there are multiple
numbers from people's street
addresses present in each image.
And you want to kind of scan
through all of those numbers in
order to recognise them in order
rather than just looking at the
image in a single go, although
it can also be applied to more
conventional image
classification as shown here.
And once again, in order to
classify the image, the network
will move its attention around
the really important parts of
the image. And this gives you an
indication, it allows you to see
what it is in the image that is
necessary in order to make the
classification. So, so far we've
looked at both implicit and
explicit attention, but the
explicit attention we've looked
at has involved making hard
decisions about what to look at
and what to ignore, and this
leads to the need to train the
network using RL-like
mechanisms. It makes it
impossible to train the whole
thing end-to-end with backprop.
So what we're going to look at
in this section is what's
sometimes known as soft or
differentiable attention, which
makes, gives you explicit
attention but makes end-to-end
training possible. So whereas in
the previous examples we had
these fixed size attention
windows that we were kind of
explicitly moving around the
image, now we're going to look
at something that
operates a little bit
differently. And, you know, it's
important to realise that, you
know, if we're thinking about a
robot or something where you
have to actually direct a camera
in order to direct your
attention, then in some sense
you have to use hard attention
because you have to make a
decision about whether to look
left or right. But for the kinds
of systems we're mostly focusing
on in this lecture, that isn't
really the case. We've got all
the data and we just need to
make a decision about what to
focus on and what not to focus
on. And so we don't actually
need to make a hard decision
about attention. We want to
focus more on some regions and
less on others in much the same
way that I showed that we
already implicitly do with a
neural network. But we can take
this one step further than
implicit attention by defining
one of these soft attention,
these differentiable attention
mechanisms that we can train
end-to-end. And they're actually
pretty simple. There's a very
basic template. So if we think
back to the glimpse distribution
I talked about before where we
have the parameters of the
network to finding some
distribution over glimpses. And
what we did then was take a
sample from that distribution
and it was because we were
picking these samples that we
needed to
think in terms of training the
network with reinforcement
learning techniques. So what we
can do instead is something like
a mean field approach. We take
an expectation over all possible
glimpses instead of a sample. So
it's just this weighted sum
where we take all of the glimpse
vectors and multiply them by the
probability of that glimpse
vector given the attention
parameters and sum the whole
thing up. And because it's a
weighted sum and not a sample,
this whole thing is
straightforwardly differentiable
with respect to the attention
parameters, 'a', as long as the
glimpse distribution itself is
differentiable, which it usually
is. So now we no longer have,
you know, REINFORCE or some
reinforcement learning
algorithm. We really just have
ordinary backprop. And in actual
fact, because we're doing this
weighted sum, we don't really
technically need a probability
distribution at all. All we need
is a set of weights here. So we
have a set of weights and we're
multiplying them by an
attention, we're multiplying
them by some set of values,
which are these glimpses, and
the weighted sum of these two
things gives us the attention
readout. Now there's, I've got a
little asterisk here on the
slide where I'm saying: yes, we
don't actually need a proper
probability distribution here,
but it's usually a nice thing to
have. So just if we make sure
the weights are all between
zero and one and that they sum
to one then everything tends to
stay nicely normalised and
sometimes it seems to be a good
thing as far as training the
network goes. But anyway, if we
look at this weighted sum, this
attention readout 'v', which is
just now if we think, stop
thinking probabilistic terms and
just think of sum of our 'i'
times some weights 'i', times
some vectors 'i', this should
look familiar to you, it's
really just an ordinary
summation, a Sigma network, a
Sigma unit from an ordinary
neural network. And in fact,
where these weights, 'wᵢ', look
like network weights. So we've
gone from, you know, glimpse
probabilities defined by the
network to something that looks
more like network weights. And
actually we can think of
attention in general as defining
something like data
dependent dynamic weights or
fast weights as they're
sometimes known. And they're
fast because they change
dynamically in response to the
data so they can change in the
middle of processing a sequence,
whereas ordinary weights change
slowly. They change gradually
over time with gradient descent.
And so to look at these two sort
of diagrams I've got here on the
left, we have the situation with
an ordinary ConvNet, where this
would be sort of a one
dimensional convolutional
network where you have a set of
weights
that are given in different
colours here that are used to
define a kernel that is mapping
into this input that the arrows
are pointing into. But the point
is those weights are going to
stay fixed. They're fixed. The
same kernel is going to be
scanned over the same image and
those weights are over the same
sequence, in this case it's one
dimensional. And those weights
are only gradually changing over
time. And in addition of course,
because it's a convolution,
there's a fixed size to the
kernel. So we've decided in
advance how many inputs that are
going to be, that are fed into
this kernel. With attention we
have something more like the
situation on the right, so we
have this set of weights that
first of all extends, can in
principle extend, over the whole
sequence. And secondly,
critically, those weights are
data dependent. They're, a
function because they're
emitted, you know, they're
determined by the attention
parameters that are emitted by
the network, which is itself a
function of the inputs received
by the network. So these weights
are responding to the input
they've received. So they're
giving us this ability to kind
of define a network on the fly.
And this is what makes attention
so powerful. Okay. So my first
experience of attention with
neural networks of soft
attention with neural networks
was a
system I developed, some years
ago now, I think seven years
ago, to do handwriting synthesis
with recurrent neural networks.
So handwriting synthesis, unlike
the handwriting recognition
networks I mentioned earlier,
here the task is to take some
piece of text like this, the
word 'handwriting' on the left,
and to transform that into
something that looks like
cursive handwriting. And
basically the way this works is
the network outputs, it takes in
a text sequence and outputs a
sequence, a trajectory of pen
positions and these positions
define the movement of, or
define the actual writing of
the letters. So you can think of
this as a kind of sequence to
sequence problem but the
challenging thing about it is
that the alignment between the
text and the writing is unknown.
And so I was studying this
problem with recurrent neural
networks and I found that if I
just fed the entire text
sequence in as input and then
attempted to produce the output,
it didn't work at all. What I
needed was something that was
able to attend to a particular
part of the input sequence when
it was making particular
decisions about the output
sequence. So for example, I
wanted something that would look
at the letter 'h' in the input
sequence and use that as the
conditioning signal for when it
was drawing a letter 'h' and
move on to the letter 'a' and so
forth. So
once again, I needed something
that was able to pick out
certain elements of the input
sequence and ignore others. And
this was achieved with soft
attention. So basically the
solution was that before the
network made each, predicted
each point in the handwriting
trajectory, it decided where to
look in the text sequence using
a soft attention mechanism. And
so the mechanism here, which is
a little bit different from the
normal attention mechanisms that
you see in neural networks that
we'll talk about later, the
mechanism here was, the network
explicitly decided how far along
to slide a Gaussian window it
had over the text sequence. So
there was a kind of, I thought
of it as a soft reading network.
And so the weights, the
parameters emitted by the
network to determine the set of
Gaussians, these are shown here,
Gaussian functions, whose, these
are shown here by these coloured
curves, and those functions had
a particular centre which
determined where they were
focused on the input sequence,
and also it was also able to
parameterise the width of the
Gaussian so it kind of
determined how many of the
letters in the input sequence it
was looking at. And I should say
the sequence of input vectors
here, what I've shown as a
series of one-hot vectors, which
is how they're presented to the
network, but what these actually
correspond to is letters. So you
can think of this as an 'h' here
and an 'a' here and so forth and
then what the network is
deciding is where to put these
Gaussians, which implicitly
means once we perform this
summation at the top here that
gives us the attention weights,
what part of the text sequence
should we look at in order to
generate the output's
distribution. And so doing this,
the network was able to produce
remarkably realistic looking
handwriting. These are all
generated samples and you can
see that it also generates, as
well as being able to legibly
write particular text sequences,
it writes in different styles.
And the reason it does this of
course, is that it's trained on
a database of networks of people
who, sorry, a database of
handwriting from people writing
in different styles. And so it
kind of learns that in order to
generate realistic sequences, it
has to pick a particular style
and stick with it. So I'm
claiming on this slide that real
people write this badly. Maybe
that's not quite strictly true,
but you can, you know, you can
see at least that here was a
system where attention was
allowing the network to pick out
the salient information and
using that to generate something
quite realistic. And so, as I
said, one advantage of this
use of attention is that it
gives you this interpretability,
it allows you to look into the
network and say, what were you
attending to when you made a
particular decision? And so this
heat map here, what it shows is
while the network was writing
the letters shown along the
bottom, so if the writing here
is that the handwriting here is
a horizontal axis, the vertical
axis is the text itself. And you
can see what this heat map shows
is what part of the text was the
network really focusing on when
it was producing a particular,
when it was predicting a
particular part of the pen
trajectory. And you can see that
there's this roughly diagonal
line because of course, you
know, there is here a one,
really a one-to-one
correspondence between the text
and the letters that it writes.
But this line isn't perfectly
straight. So the point is that
some, well some letters might
take, you know, have 25 or 30
points in them or even more
others letters might have much
fewer. And so this is, the whole
issue of the alignment being
unknown that attention was able
to solve in this instance. And
so this is an example, an early
example of what's now kind of
thought of as location-based
attention. So the attention is
really about just where, how far
along the input sequence you
should look. And so,
it's important, what's kind of
interesting here is to see what
happens if you take that
attention mechanism away. And
just allow the network to
generate handwriting
unconditionally. And this was
very similar to the result I
obtained when I first tried to
treat this task as a more
conventional sequence to
sequence learning problem where
the entire text sequence was fed
to the network at once. And what
happens is it generates things
that kind of look like words
that kind of look like letters
but don't make much sense. And
of course it's obvious the
reason for this is that the
conditioning signal isn't
reaching the network because it
doesn't have this attention
mechanism that allows it to pick
out which letter it should write
at a particular time. Okay, so,
that was sort of an early
example of a neural network with
soft attention. But the form of
attention that's really kind of
taken over, the one that you'll
see everywhere in neural
networks now, is what I think of
as associative or content-based
attention. So instead of
choosing where to look according
to the position within a
sequence of some piece of
information, what you can do
instead is attend to the content
that you want to look at. And so
the way this looks is that,
works is that the network, the
attention parameter emitted by
the network is a key vector. And
that key vector is then compared
to all the elements in the input
data using some similarity
functions. So typically you have
something like cosine similarity
or something that involves
taking a dot product between the
key and all the elements in the
data. And then typically this is
then normalised as something
like a softmax function and that
gives you the attention weights.
So, you know, implicitly what
you're doing is you're
outputting some key, you're
looking through everything in
the data to see which parts of
the data most closely match to
that key and you're getting back
a vector, an attention vector
that focuses more strongly on
the places that are more, that
are closer, that correspond more
closely to the key. And this is
a really natural way to search.
You can actually define, you can
do, you can essentially do
everything you need to do
computationally just by using a
content-based lookup. And what's
really interesting about it is
that especially with this sort
of cosine similarity measure, it
gives you this multidimensional
feature-based lookup. So you can
put a set of features
corresponding to particular
elements of this key vector and
find something that matches
along those features and ignore
other parts of the vectors. So
just by setting other parts of
the vector to zero, you'll get
something that matches on
particular features and
doesn't worry about others. So
it has, it gives this
multidimensional, very natural
way of searching. So for
example, you might want to say,
well, show me an earlier frame
of video in which something red
appeared. And you can do that by
specifying the kind of red
element in your, the
representation of your key
vector. And then the associative
attention mechanism will pick
out the red things. So typically
what's done now is given these
weights, you can then perform
this expectation that I
mentioned earlier where you sum
up over the data, you compute
this kind of, this weighted sum
and you get an intention
readout. What you can also do,
and this has become I think
increasingly popular with
attention-based networks, is you
can split the data into key
value pairs and use the keys to
define the attention weights and
the values to define the
readout. So there's now a
separation between what you use
to look up the data and what
you're actually going to get
back when you read it out. And
this has been used, I mean, as I
said, this is now really a
fundamental building block of
deep learning. And it was first
applied in this paper from 2014
for neural machine translation.
And once again, so similar to
the heat map I showed you in a
previous slide for implicit
attention,
we have something here that
shows what the network is
attending to when it translates
in this case from, I believe
it's translating from English to
French or it might be from
French to English, and what's
kind of interesting here you can
see, first of all, if we compare
this to the earlier heat map I
showed for implicit attention,
it's clear that the decisions
are much sharper, so you get a
much stronger sense here of
exactly what the network is
attending to and what it's
ignoring. Secondly, in this
case, there's a more or less,
one-to-one correspondence
between the English words and
the French words, apart from
this phrase, 'European Economic
Area', that is reversed in
French. And you can see this
reversal here in the image by
this kind of, line that goes
sort of against the diagonal of
the rest of the sequence. And so
this is a very, as we'll see,
this is a very powerful general
way of allowing the network in a
differentiable end-to-end
trainable way allowing the
network to pick out particular
elements of the input data.
Here's an example of a similar
network in use. Here, the task
is to determine what this
removed symbol is in the data.
So if we look at the example on
the left, we have,
I should say the proper names
have been replaced by numbered
entities here, which is quite a
standard thing to do in language
processing tasks because proper
names are very difficult to deal
with otherwise. And we have this
task where entity one one nine,
identifies deceased sailor as X
and what the network has to do
is to fill in X, and you can see
from this heat map here which
words it's attending to when it
attempts to fill in this X. And
you can see it's mostly
particularly focused on this
entity 23, which was presumably
the decision it made and which
is indeed correct. It says he
was identified Thursday as
special warfare operator entity
23. In general it's focusing on
the entities throughout because
it can kind of tell that those
are the ones that it needs to
look at in order to answer these
questions. Similarly, X
dedicated their fall fashion
show to moms. You can see it's
very focused on this particular
entity here that's helping it
make this decision. And what's
really crucial here is that
there's a lot of text in this
piece, there's a lot of text
that it's ignoring, but it's
using this content-based
attention mechanism to pick out
specific elements. And this can
be taken, this, you know,
attention mechanism, this
combination typically of a
recurrent neural network with
attention can be used much more
broadly. It's been applied
to speech recognition for
example. And here we see a plot,
not dissimilar to the one I
showed you for handwriting
synthesis where we have an
alignment being discovered
between the audio data here
presented as a spectrogram and
the text sequence that the
network is outputting, the
characters that it's using to
transcribe this data. And so for
example, there was this long
pause at the start when nothing
happens, the network mostly
ignores that. It knows that when
it has to start emitting, for
example, the 'st' at the start
of the sentence, it's very
focused on these sounds at the
beginning corresponding to those
noises in the speech signal. So
basically this attention
mechanism is a very general
purpose technique for focusing
in on particular parts of the
data. And this is all done with,
well mostly all done with
content-based attention. Okay
so, another form of, so there
are a huge number of possible
attention mechanisms and we're
only going to mention a few of
them in this talk. And one idea
I want to leave you with is that
there's a very general framework
here. Having defined this
attention template that gives
you this weighted sum, there's
lots of different operators you
could use to get those attention
weights. And one very
interesting idea from a network
known as DRAW from 2015,
the idea was to determine an
explicitly visual kind of soft
attention. So this is kind of
similar to the foveal models we
looked at earlier only instead
of an explicit kind of hard
decision about where to move
this fovea around the image,
rather there was a set of
Gaussian filters that were
applied to the image. And what
these did is they have a similar
effect of being able to focus in
on particular parts of the image
and ignore other parts but it's
all differentiable end-to-end
because there's a filter that is
being applied everywhere that
gives you these attention
weights. And what does this
filter look like? Well if you
look at these three images on
the right, we show that for
different settings of the
parameters for the Gaussian
filters, the filter variants,
essentially the, you know, the
width of the filter, the centre,
the stride, as we've shown here
with which the filter is applied
throughout the image, and also
this last parameter for
intensity, by burying these we
get different views of this same
letter five. So this one here is
quite focused in on this central
part of the image. This one here
is looking more at the image as
a whole and it's doing so with
quite low variance. So it's
getting quite a sharp picture of
this image. This one on the
bottom here is getting a
more kind of blurred, like less
distinct view of the entire
image. And so we can see a video
of a DRAW network in action.
What we're seeing here, the
movement of these green boxes
shows the attention of the
network. I'm just going to play
that again, it's rather quick.
The attention of the network, as
it looks at an MNIST digit and
you can see that it starts off
kind of attending to the whole
image and then very quickly
zooms in on the digit and moves
the box around the digit in
order to read it. And it does a
similar thing when it starts to
do generate data. It uses, so
this red box shows where it's
attention is as it's generating
the data. And once again, it
starts off kind of generating
this kind of blurred out view of
the whole image and then focuses
down on a specific area and kind
of, it does something that looks
a lot like, it's actually
drawing the image, it's actually
using the attention mechanism to
trace out the strokes of the
digit. And so again, what's nice
about this, so we have something
that's kind of, transforming,
excuse me, transforming a kind
of static task into a sequential
one where there's this sequence
of association decisions being,
sorry, this sequence of glimpses
or views of the data.
And what's nice about that is
that we'd get this
generalisation so we can now
generate, because it generates
these images kind of one part of
the time, it can be extended to
something that generates
multiple digits, for example,
within the same image. And this
is a sort of a general, an
illustration of this general
property of, I think,
scalability that are referred to
for engine mechanisms. So far
what we've talked about is
attention applied to the input
data being fed to the network.
But as I mentioned at the start
of the lecture, there's another
kind of attention which I think
of as introspective or kind of
inward attention where we as
people use our kind of, use a
kind of cognitive attention to
pick out certain thoughts or
memories and in this section I'm
going to discuss how this kind
of attention can be introduced
to neural networks. So as I've
said, in the previous slides,
what we were looking at was
attention to external data. So
deciding where in a text
sequence to look, which part of
an image to look at and so
forth. But if we sort of apply
this attention mechanism to the
network's internal state or
memory then we have this notion
of introspective attention. And
as I've said, the way I like to
think about is that memory is
attention through
time. It's a way of picking out
a particular event that may have
happened at some point in time
and ignoring others. And once
again, just want to come back to
this idea that attention is all
about ignoring things, it's all
about what you don't look at.
And so there's an important
difference here between internal
information and external
information, which is that we
can actually modify the internal
information so we can do
selective writing as well as
reading, allowing the network to
use attention to iteratively
modify its internal state. And
an architecture that I and
colleagues at DeepMind developed
in 2014 did exactly this. We
called it a Neural Turing
Machine because we, what we
wanted was something that
resembled the action of a Turing
machine, its ability to read
from and write to a tape, using
a neural network by an
attention, a set of attention
mechanisms. And I'm going to
talk about this architecture in
some detail because it shows, it
gives you a sort of nice insight
into the variety of things that
can be achieved with attention
mechanisms. And it shows how, it
really shows this link between
attention and memory. So the
controller in this case is a
neural network. It can be
recurrent or it can be feed
forward. Once again, even if
it's feed forward, the combined
system is recurrent because
there's this
loop through the attention
mechanisms. And then we have, we
referred to the attention
modules that are parameterised
by the network as 'heads'. And
so this was in keeping with the
Turing machine analogy, the tape
analogy. But this is something
that I think has been picked up
in general. People often talk
about attention heads. And, you
know, these heads are attention
mechanisms, are soft attention
mechanisms in the same kind of
template that we've discussed
before. And their purpose is to
select portions of the memory.
The memory is just this
real-valued matrix. It's just
the big grid of numbers that the
network has access to. And the
key thing that's different is
that as well as being able to
select portions of the memory to
read from these heads can also
selectively write to the memory.
So yeah, once again, this is all
about selective attention. We
have to, we don't want to modify
the whole memory in one go.
Maybe, you know, I should stress
here that a key part of the kind
of design decision underlying
the Neural Turing Machine was to
separate out computation from
memory in the same way as is
done in a normal digital
computer, we didn't want, so for
a normal recurrent neural
network, for example, in order
to give the system more
memory, you have to make the
hidden state larger, which
increases the amount of
computation done by the network
as well as giving it more
memory. So computation and
memory are kind of inherently
bound up in an ordinary network
and we wanted to separate them
out. We wanted potentially quite
a small controller that can have
access to a very large memory
matrix. In the same way that a
small processor in a digital
computer can have access to, you
know, a large amount of RAM or
disc or other forms of memory.
And so it's key, you know, if
you look at it from that
perspective, it's key that it's
not processing the entire memory
at once. If this thing's going
to be large, it needs to
selectively focus on parts of it
to read and write. And so we do
this basically using a similar,
the same template as I mentioned
before for soft attention, the
controller, the neural networks,
outputs parameters that
basically parameterise this what
we're calling a distribution or
a weighting over the rows in the
memory matrix. But this
weighting is really just the
same attention weights that we
discussed before. And we have
two main attention mechanisms.
So I've mentioned in the
previous section that my first
experience of soft attention in
neural networks was around
location-based attention as was
applied for this
handwriting synthesis network,
which was in fact the
inspiration for the Neural
Turing Machine. So having
realised that the handwriting
synthesis network could
selectively read from an input
sequence, I started to think:
well, what would happen if it
could write to that sequence as
well? And wouldn't it then start
to resemble a neural, a Turing
Machine? But as well as the
location-based content that was
considered in the handwriting
synthesis network, this also
incorporates content-based
attention, which as I've said,
is the kind of the preeminent
form of attention as used in
neural networks. So addressing
by content looks a lot like it
does with other content-based
networks. There's a key vector
emitted by the controller that
is compared to the content of
each memory location. So take
each row in memory and treat
that as a vector. And then we
compare the key to that vector
using the similarity measure,
which was indeed cosine
similarity, which we then
normalise with a softmax. We
also introduced an extra
parameter which isn't usually
there for content-based
attention, which we called
'sharpness', and this was used
to sort of selectively narrow
the focus of attention so that
it could really focus down on
individual rows in the memory.
But we also included this notion
of addressing by location. And
the way this worked was that the
network looked at the previous
weighting and output a shift
kernel which was just a, sort
of, a softmax of numbers between
plus and minus 'n'. And we then
essentially convolve that with
the weighting, with the previous
weighting, the weighting from
the previous timestep to produce
a shifted weighting. So
basically the maths here, very
simple, are shown below. And
what this did is it just
essentially shifted attention
through the memory matrix. It
shifted it down. So if you
started here and output a shift
kernel focused around maybe five
steps or so, then you'd end up
with an attention distribution
that would look like this. And
so this combination of
addressing mechanisms, the idea
behind this was to allow the
controller to have different
modes of interacting with the
memory. And we thought about
these modes as corresponding to
data structures and accessors as
are used in sort of conventional
programming languages. So as
long as the content is being
used on its own, then memory is
kind of being accessed the way
it would be in something like a
dictionary or an associative
map. Well not strictly like a
dictionary because we didn't
have key value attention for
this network although you could
define it. Rather it would be
more like an associative array.
Through a combination of content
and location, what we could do
is use the content-based key to
locate something like an array
of contiguous vectors in memory
and then use the location to
shift into that array, to shift
to index a certain distance into
the array. And when the network
only used the location-based
attention, essentially it acted
like an iterator, it just moved
on from the last focus. So it
could essentially read a
sequence of inputs in order. So,
basically reading, as we've
said, this network uses
attention to both read from and
right to the memory, reading is
very much the standard
attention, sort of soft
attention template. We get a set
of weights, we got a set of rows
in the memory matrix to which
those, you know, that are, to
which the network is attending.
And we compute this weighted
sum. So we take each row in the
metrics and multiply it by the
weight, which gives the
sharpness of attention to the
degree to which the network is
attending to that particular
role. So this is just very much,
this is exactly like the soft
attention template I described
before only it's being applied
to this memory matrix rather
than being applied to some
external piece of data. The part
that was kind of novel and
unusual was the write head, the
writing attention mechanism used
by Neural Turing
Machines. And so in this case we
kind of, inspired by the way
long short term memory LSTM has
sort of forget and input gates
that are able to modify the
contents of memory, the contents
of its own internal state, we
defined a combined operation of
an erase vector, 'e', which
behaves sort of analogously and
equivalently to the forget date
in long short term memory and an
add vector which behaves like
the input gate. And essentially
what happened is that once the
write head had determined which
rows in the matrix it was
attending to, the contents of
those rows were then selectively
erased according to 'e'; and I
should say here: so 'e' is
basically a set of numbers
between zero and one. So
basically if some part of the
erased vector goes to one, that
means whatever was in the memory
matrix at that point is now
wiped, it's set to zero. And if
'e' is set to zero, then the
memory matrix is left as it is.
So once again, there's this kind
of smooth or differentiable
analogue of what is essentially
a discreet behaviour, the
decision of whether or not to
erase. And adding is more
straightforward, it just says:
well, take whatever's in memory
and add whatever's in this add
vector, 'a', multiplied by the
write weights. So basically if
the write
weight is high and you're
strongly attending to a
particular area in a particular
row in the matrix, then you are
going to essentially add
whatever is in this add vector
to that row. And if the write
vector, the important thing here
is if this 'w[i]' for all the
rows in the matrix for which
this 'w[i]' is very low, nothing
happens, nothing changes, so if
you're not attending to that
part in the memory, you're not
modifying it either. So how does
this work in practice? So what
we really looked at was well,
can this Neural Turing Machine
learn some kind of primitive
algorithms in the sense that we
think of algorithms as applied
on a normal computer and you
know, in particular does having
the separation between
processing and memory enable it
to learn something more
algorithmic than we could do,
for example, with a recurrent
neural network? And we found
that it was indeed able to
learn, it's a very simple
algorithm. So the simplest thing
we looked at was a copy task. So
basically a series of random
binary vectors were fed to the
network, that's shown here, at
the start of the sequence, and
then the network just has to
copy all of those and output
them through the output
vectors. And all it has to do is
just exactly copy what
was in here to what's over here.
So it's an entirely trivial
algorithm that would be, you
know, very uninteresting. It's
not interesting in its own right
as an algorithm. But what's
surprising about it is that it's
difficult for an ordinary neural
network to do this. So neural
networks generally are very good
at pattern recognition. They're
not very good at exactly sort of
remembering, storing and
recalling things. And that was
exactly what we hoped to add by
including this access to this
memory matrix. And so the
algorithm that it uses, well, a
kind of pseudocode version for
it here is given on the right
but we can also analyse it by
looking at the use of attention
and the way it attends to
particular places in the memory
during this task. And so these
two heat maps that are shown
here at the bottom are again
heat maps showing the degree to
which the network is attending
to that particular part of the
memory. So when it's black, it's
being ignored, when it's white,
it's focusing and you can see
that there's a very sharp focus
here, which is what we want
because it's basically
implementing something that is
really fundamentally a discreet
algorithm. And so what it does
in order to complete this copy
task is it picks a location in
memory, given here, and then
starts to write whatever input
vector comes in, essentially
just copies that to a row of
memory and then uses the
location-based iterator, this
location-based attention, to
just move on one step to the
next row of memory and then it
copies the next input and so
forth until it's finished
copying them all. And then when
it has the output it uses it's
content-based lookup to locate
the very start of the sequence
and then just iterates through
until it's copied out everything
remaining. And so, you know,
just once again, what was really
interesting here was to be able
to get a kind of, an algorithmic
structure like this going
through a neural network that,
you know, it was completely
parameterised by neural network
and it was completely learned
end-to-end and there was nothing
sort of built into the network
to adapt it towards this sort of
algorithmic behaviour. And so
the real issue is, and actually
a normal recurrent neural
network, an LSTM model, LSTM
network for example, can perform
this task. You feed in a
sequence of inputs and ask it to
reproduce them as outputs just
as a sequence to sequence
learning problem. But what you
find is it will work reasonably
well up to a certain distance,
but it won't generalise beyond
that. So
if you train it to copy
sequences up to length 10, and
then ask it to generalise the
sequences up to length, you
know, a hundred, you'll find it
doesn't work very well as we'll
see. Whereas with the Neural
Turing Machine, we found that it
did work quite well. So in these
heat maps here, we're showing
the targets and the outputs. So
basically this is the copy
sequence given to the network,
if it's doing everything right,
each block at the top exactly
matches each block at the
bottom. And you can see that
it's not perfect, like there's
some mistakes creeping in as the
sequences get longer. So this is
for sequences, you know, short
sequences like 10, 20, 40, so
on. But you can still see that
most of the sequence is still
kind of retained. Like most of
the targets are still being
matched by the outputs of the
network. And that's because it's
just basically performing this
algorithm and using that to
generalise the longer sequences.
So this is an example of where
attention and being able to
selectively pick out certain
parts of information and ignore
others gives you a stronger form
of generalisation. And so this
form of, this kind of
generalisation that we see with
Neural Turing Machines does not
happen with, you know, a normal
LSTM model, for example,
essentially it learns to
copy up to 10 and then after
10, it just goes completely
awry. It starts to output sort
of random mush. And this really
shows that it hasn't learned an
algorithm, it's rather kind of
hard-coded itself, it's learned
internally to store these 10
things in some particular place
in its memory and it doesn't
know what to do when it goes
beyond that. So in other words,
because it lacks this attention
mechanism between the network
and the memory, it's not able to
kind of separate out computation
from memory, which is what's
necessary to have this kind of
generalisation. And so this can
be extended. We look at other
tasks, well, one very simple one
was to learn something akin to a
for-loop. So if the network is
given a random sequence once
it's then also given an
indicator telling it how many
times it should reproduce that
output sequence. And then it
just has to output the whole
sequence N times, to copy N
times. And so basically what it
does is just uses the same
algorithm as before except now
it has to keep track of how many
times it's output the whole
sequence. So it just keeps on
jumping to the start of the
array with content-based, you
know, to a particular row in
memory using content-based
location, then iterates one step
at a time, gets to the end and
jumps back. And meanwhile it has
this sort of internal variable
that keeps track of the number
of steps that it's done so far.
Another example of, you know,
what it could do with memory was
this N-Gram inference task. So
here the task is that a sequence
is generated using some random
set of N-Gram transition
probability. So basically
saying, given sort of some
binary sequence, given the last
three or four inputs, there's a
set of probabilities telling
you, you know, whether the next
input will be a zero or a one.
And those probabilities are sort
of randomly determined and then
you generate a sequence from
them. And what you can do as the
sequence goes on, you can infer
what the probabilities are. And
there's a, you know, a sort of,
a Bayesian algorithm for doing
this optimally. But what was
sort of interesting as well, how
well does a neural network
manage to do this? How well does
it, sort of, it's kind of like a
meta-learning problem where it
has to look at the first part of
the sequence, work out what the
probabilities are, and then
start making predictions in
accordance with those
probabilities. And what we found
was that yes, once again, LSTM
can kind of do this, but it
doesn't do it in a very kind of,
it makes quite a lot of
mistakes. I've indicated
a couple of them with red arrows
here, Oh no, sorry, excuse me.
Those red arrows I think
actually indicate mistakes made
by the Neural Turing Machine.
But in general, the Neural
Turing Machine was able to sort
of perform this task much
better. And the reason it was
able to do that is that it used
its memory, it used specific
memory locations to store
variables that kept count of the
occurrences of particular
N-Grams. So if it had seen zero
zero one for example, it would
use that to define a key to look
up a place in memory, it might
be this place here, and then the
next time it saw zero zero one
it would be able to increment
that, which is basically a way
of saying, okay, if I've learned
that zero zero one is a common
transition, that means that the
probability of two zeros
followed by one must be really
high. So basically it learns to
count these occurrences which is
exactly what the optimal
Bayesian algorithm does. And it
uses, it is able to do this by
being able to pick out
selective, specific areas in its
memory and use those as
counters. Okay. So here's a
little video kind of showing the
system in action. So this is
being performed, this repeat N
times task, so at the start here
where the network is going
quickly we see what happens
during training. Then we have
the
system, everything slows down
here and we have, you know, what
happens with a trained version
of the network, the input data
comes in. While the input data
was coming in, we saw this blue
arrow here, which showed the
input data being written to the
network memory one step at a
time. So it stored this input
sequence in memory and then the
writing task begins. Once the
writing task begins, we see this
red arrow which represents the
write weights, the attention
parameters used for writing. And
we can see that these are, you
know, very tightly focused on
one particular role in the
network, the one that it's
emitting as output at any one
point in time and it's then
iterating through this array one
step at a time. But what you can
also see as this video goes on
is that the, sorry, the size of
the circle, the size and colours
of the circles here represent
the magnitude of the variables
within this memory matrix. I
believe the sort of the hot
colours are positive and the
cold colours are negative as I
remember. But as, what's
happening as the network is
looping through this, whoops, is
running through this loop
several times. Just run that
video again and we can see
during training that, you know,
at first these attention, these
read and write weights are not
sharply focused. They're blurred
out. So this sharp focus kind of
comes later on. Once the network
has finished writing the whole
sequence, it then sort of, you
see these variables in the
background become larger. That's
because it's using those to keep
count of how many times it's
been through the, how many times
it's repeated this copy
operation. And then at the end
it changes this final, this row
at the bottom, which is an
indicator to the network that
the task is complete. So it's
using this memory to kind of
perform an algorithm here. And
so quickly, I'm just going to
mention we, following the Neural
Turing Machine, we introduced
the kind of extended version of
it, a successor architecture,
which we called Differentiable
Neural Computers and we
introduced a bunch of new
attention mechanisms to provide
memory access. And I'm not going
to go through that in detail but
just to say, one of the, rather
than looking at algorithms with
this sort of updated version of
the architecture, what we were
really interested in was looking
at graphs because, so you know,
while recurrent neural networks
are designed for sequences in
particular, many types of data
are more naturally expressed as
a graph of, you know, nodes
and links between nodes. And
because of this ability to store
information in memory and to
store and recall with kind of
something akin to random access,
it's possible for the network to
store quite a large and complex
graph in memory and then perform
operations on it. And so what we
did during training of the
system is we looked at randomly
connected graphs and then when
we tested it, we looked at
specific examples of graphs. So
one of them was a graph
representing the, sort of, zone
one of the London underground.
And we were able to ask
questions like, well, can you
find the shortest path between
Moorgate and Piccadilly Circus?
Or can you perform a traversal
when you start at Oxford Circus
and follow the Central line and
the Circle line and so forth.
And it was able to do this
because they could store the
graph in memory and then
selectively kind of recall
elements of the graph. And
similarly, we asked it some
questions about a family tree
where it had to determine
complex relationships like
Maternal Great Uncle. In order
to do that, it had to keep track
of all the links in the graph.
So for the remainder of the
lecture, we're going to look at
some further topics in attention
and deep learning. So one type
of deep network that's got a lot
of attention recently is known as
Transformer networks. And what's
really interesting about
Transformers, as they're often
known, from the point of view of
this lecture is that they are,
they really take attention to
the logical extreme. They
basically get rid of everything
else that all of the other
components that could have been
present in similar deep networks
such as the recurrent state
present in recurrent neural
networks, convolutions, external
memory like we discussed in the
previous section, and they just
use attention to repeatedly
transform a data sequence. So
it's a, the idea that the paper
that introduced Transformers was
called 'Attention Is All You
Need', and that's really the
fundamental idea behind them is
that this attention mechanism is
so powerful it can essentially
replace everything else in a
deep network. And so the
attention, the form of attention
used by Transformers is a
little, it's mathematically the
same as the attention we've
looked at before but the way
it's implemented in the network
is a little bit different. So
instead of there being a
controller network as there was
in the Neural Turing Machine,
for example, that emits some set
of attention parameters that are
treated as a query, rather, what
you have is that every vector in
the sequence emits its own query
and compares itself with every
other. And sometimes I think of
this as a sort of
emergent or anarchist attention
where the attention is not being
dictated by some central control
mechanism, but it's rather
arising directly from the data.
And so in practice, you know,
what this means is you have
quite a similar calculation to
quite a similar attention
mechanism to the content-based
attention we've discussed
previously where there's a
cosine similarity calculated
between a set of vectors. But
the point is that there's a
separate key being emitted for
every vector in the sequence
that's compared with every
other. And like as with NTM and
DNC, there are multiple
attention heads that are used,
so in fact, every point in the
input sequence gives not just
one set of, not just one
attention key to be compared
with the rest of the sequence
but several. So I'm not going to
go very much into the details
of, you know, how Transformers
work. Although the attention
mechanism is straightforward,
the actual architecture itself
is fairly complex. And actually
I recommend this blog post, 'The
Annotated Transformer', for
those of you who want to
understand it in more detail.
But if we look at the kinds of
operations that emerged from the
system is very intriguing. So,
as I've said, there's this, the
idea is that a
series of attention mechanisms
are defined, so Transformers I
should say are particularly
successful for natural language
processing. And I think the
reason for that and the reason
that recurrent neural networks
for attention were also first
applied to languages that in
language this idea of being able
to attend to things that are
widely separated in the input is
particularly important. So this
one word at the start of the
paragraph might be very
important when it comes to
understanding something much
later on in the paragraph. And
there may be, if you're trying
to extract, for example, the
sentiment of a particular piece
of text, there may be several
words that are very well, that
are very spaced out that are
required in order to make sense
of it. So it's a sort of a, it's
a natural fit for
attention-based models. So in
this particular example here
from the paper we see that when
the, so the network has
processed, has created keys and
vectors for each element in the
input sequence and then, and
this creates a, you know, a
sequence that, a sequence of
embeddings equal to length of
the original sequence. And then
this process is repeated at the
next level up. So the network
basically now defines another
set of key vector pairs at every
point along the sequence. And those
key vector pairs are then
compared with the original
embeddings to create these
attention masks. And so we see
for this word 'making' here this
is being, while this word is
being processed, I forget what
the exact task was here, but
while this word 'making' was
being processed, it was
attending to a bunch of
different words: 'laws', '2009',
the word 'making' itself, but
also this phrase 'more
difficult' here in the end of
the sequence. And so all of
these things are tied up with
the semantics of how this word
'making' is used in the
sentence. And generally what you
find is you get these different,
as I said, there are multiple
attention vectors being defined
for each point in the sequence,
and what you get are different
patterns of attention emerging.
So, for example, in this example
here, and this is showing the
kind of the all to all
attention. So how the embedding
corresponding to each word in
the sentence at one level is
attending to all of the
embeddings at another level. And
we see that this one is doing
something quite complex. It
seems to be looking for phrases
so 'what', the word 'what' is
kind of is tending to 'this is
what'; 'the law', the word 'law'
is attending to 'the law'
and 'it's' and so forth. So
there's some kind of, you know,
complicated integrating of
information going on. Whereas
here, another one of the sort of
attention masks for the same
network, we see that it's doing
something much simpler, which is
it's just attending to nearby
words. And so the overall effect
of having access to all of these
attention mechanisms is that the
network can learn a really rich
set of transforms for the data.
And what they realise with the
Transformer network is that just
by repeating this process many
times they could get a very,
very powerful model in
particular of language data. And
so from the original paper, they
already showed that, you know,
the transformer was achieving
state-of-the-art results from
machine translation. Since then
it's kind of gone from strength
to strength: it now provides
state-of-the-arts for language
modeling. It's also been used
for other data types besides
language: it's been used for
speech recognition, it's been
used for two dimensional data
such as images. But from this
blog post here, this was posted
by OpenAI in 2019, we can see
just how powerful a
Transformer-based language model
can be. So in language modeling
essentially just means
predicting, iteratively
predicting the next word in a
piece of text or the next
subword symbol. And so in this
case, once the language model is
trained, it can be given a human
prompt and then you can generate
from it just by asking it to
predict what word it thinks will
come next. And then feeding that
word in and repeating this whole
Transformer-based network that
is able to attend to all of the
previous context in the data.
And what's really interesting
about this text relative to
texts that have been generated
by language models in the past
is that it manages to stay on
topic, it's that it manages to
keep the context intact
throughout a relatively long
piece of text. So having started
off talking about a herd of
unicorns living in an unexplored
valley in the Andes, it
continues to talk about
unicorns, it continues to, it
keeps the setting constant in
the Andes Mountains, it, you
know, it invents a biologist
from the University of La Paz.
And once it's made these
inventions, for example, once
it's named the biologist, it
keeps that name intact. So
having called it Pérez once, it
knows to keep on calling them
Pérez throughout. And the reason
it can do that is that it has
this really powerful use of
context that comes from having
this ability to attend to
everything in the sequence so
far. So what attention is really
doing here is allowing it to
span very long divides, very
long separations in the data.
And this is something that even,
you know, before attention was
introduced, even the most
powerful recurrent neural
networks, such as LSTM, they
struggled to do because they had
to store everything in the
internal state of the network
which was constantly being
overwritten by new data. So
between the first time Pérez is
introduced and the last time
there might've been, you know,
several hundred updates of the
network and this information
about Pérez would attenuate
during these updates. But
attention allows you to kind of
remove this attenuation, allows
you to bridge these very long
gaps and that's really the
secret to its power,
particularly for language
modeling. And so, one
interesting extension to; and so
I should say there've been many,
many extensions to Transformer
networks and they have, kind of,
you know, gone from strength to
strength, particularly with
language modeling. One extension
that I'd like to look at in this
lecture, which I find very
interesting, is known as
Universal Transformers. So the
idea here is that basically the
weights of the Transformer are
tied across each transform. So
the transforms here, if we look
at this model here, if we have
the input sequence over time
along the x-axis, then this, the
effect of the Transformer is to
generate
a set of self-attention masks at
each point along the sequence.
And then all of the embeddings
associated with these points are
then transformed again the next
level up and so on. Now on an
ordinary Transformer these
parameters that happen, the
parameters of the self-attention
operations are different at each
point going up at each
transform, at each level going
up on the y-axis. And this means
that the functional form of the
transform that's being enacted
is different at each step. So
tying these weights going up
through the stack makes it act
like a recurrent neural network
in depth. So what you have is
something like a recursive
transform. And what's
interesting about that is that
you then start to have something
that can behave a little bit
more algorithmically, something
that is not only very good with
language, but is good at
learning the sorts of functions,
the sorts of algorithmic
behaviours that I talked about
in the Neural Turing Machine
section. And so, this could be
seen from some of the results
for Universal Transformers where
it was applied, for example, to
the bAbI tasks, which are a set
of kind of toy linguistic tasks
generated using a grammar. And
so a related topic to this idea,
so there's this idea here of
having a recursive transform
and; oh and the other thing I
should say is that because the
weights are tied it means you
can enact this transform a
variable number of times and
just the same way that you can
run an RNN through a variable
length sequence. So now we have
something where the amount of
time it spends transforming each
part of the data can become
variable, it can become data
dependent. And so this relates
to work that I did in 2016 which
I called Adaptive Computation
Time. The idea of Adaptive
Computation Time was to change,
so with an ordinary recurrent
neural network there's a one-to-one
kind of, correspondence
between input steps and output
steps. Every time an input comes
in the network admits an output.
And the problem with this in
some sense is that ties, it ties
Computation Time to what we
could call Data Time. There's
one tick of computation for
every step in the data. And now
you can alleviate this by
stacking lots of layers on top
of each other so now you have
multiple ticks of computation
for each point in the input
sequence. But the idea of
Adaptive Computation Time was
that maybe we could allow the
network to learn how long it
needed to think in order to make
each decision. And we called
this the amount of time it spent
pondering each decision so the
idea is some network,
some input comes in at timestep
x₁, the network receives a
hidden state that, its hidden
state from the previous timestep
as usual for a recurrent neural
network, and it then thinks for
a variable number of steps
before making a decision. And
this variable number of steps is
determined by a Halting
Probability. So you can see
these numbers 0.1, 0.3, 0.8, the
idea is that when that
probability passes a threshold
of one, the network, when the
sum of these possibilities
passes the threshold of one, the
network is ready to emit an
output and move on to the next
timestep. And so what's sort of
the relevance of this mechanism
to the rest of this lecture,
which has been about more
explicit attention mechanisms,
is that, in some sense, the
amount of time a person or even
a neural network spends thinking
about a particular decision that
it makes is strongly related to
the degree to which it attends
to it. So, I mean, there have
been cognitive experiments done
with people where you can by
measuring the amount of time it
takes them to answer a
particular question, you can
sort of measure the amount of
attention that they're needing
to give to that question. And so
if we look at this concretely,
if we look at what happens with
this Adaptive Computation Time,
if we apply this to a recurrent
neural network, this is an
ordinary LSTM network, that is
being applied to language
modeling, in this case next step
character prediction, and what
this graph is showing is the
y-axis shows the number of steps
that the network stopped to
think for. Now, this number of
steps is actually not, it's not
integer because there's this
question where it can slightly
overrun a complete step, but
that's not really important.
What's important here is that
there's a variable amount of
computation going on for each of
these predictions that it has to
make and you can immediately see
a pattern. So for example, the
amount of ponder time goes up
when there's a space between
words. And the reason for this
is that it's at the start of
words that we need to spend the
longest thinking because it's
basically, it's easier once
you've gone most of the way
through a word, it's easy to
predict the ending. Once you've
seen 'p', 'e', 'o', 'p', 'l',
it's pretty easy to predict that
'e' is going to come next. Once
you see a space after that 'e'
then it becomes harder. Now you
have to think, well, 'and the
many people', what word could
come next? So this takes
a little bit more thought and
then it tends to drop down again
and it spikes up even further
when it comes to a kind of
larger divider, like a full
stop, like a comma. So there's
this very close, if we think
about, if we think back to the
plots I showed you at the very
start of this lecture were to do
with implicit attention where we
saw that a deep network or
recurrent neural network will
concentrate in some sense or
will respond more strongly to
certain parts of the sequence,
we kind of see that same pattern
emerge again when we give it a
variable amount of time to think
about what's going on in the
sequence. And there's some
interesting consequences here.
So for example, one is that the
network is only, because this is
a question of how long it needs
to take in order to make a
particular prediction, and that
work is only interested in
predictable data. So, for
example, if you see these ID
tags, this is from Wikipedia
data which contains you know,
XML tags as well, we can see
that the network does not,
there's no spike in sort of
thinking time when it comes to
these ID numbers. And this is
kind of interesting because
these ID numbers are hard to
predict. So it isn't simply that
the network thinks longer
whenever it finds something
that's harder to predict, it
thinks longer when it sort of
believes that there's a benefit
to thinking longer, when
thinking longer is likely to
make it better able to make a
prediction. And the reason it
would be better able to make it
a prediction is that it allows
it to spend more time processing
the context on which that
prediction is based. So this
kind of goes back again to the
idea we talked about in
Transformers of having these
repeated steps of contextual
processing as being the thing
that builds up the information
the network needs to make a
prediction. And so there's a
nice combination of this idea of
Adaptive Computation Time with
these Universal Transformer
models and so in this case here,
we have a task from the bAbI
dataset where there's a series
of sentences, so these sentences
presented along the x-axis are
kind of like the input for the
network then, or these are the
context that the network needs
to know about, and then it gets
asked the question. The question
here was: 'where was the apple
before the bathroom?' And if you
go through all of these
sentences, and I think I've
cropped this graph so it doesn't
have all of them, but you can
see that things are happening
with the apple: 'John dropped
the apple',
'John grabbed the apple'; 'John
went to the office', so we think
the apple at this point is
probably in the office. 'John
journeyed to the bathroom',
well, maybe now it's gone to the
bathroom. But in between those
two things were some pieces of
information that weren't
relevant: 'Sandra took the
milk', for example. 'John
traveled to the office', we're
back in the office again. So
there's a little puzzle here
that the network has to work out
as to where the apple has ended
up. And of course some parts of
the sequence are important for
that puzzle and some parts
aren't. 'John discarded the
apple there', well of course
that's very important. Basically
all of the ones that mentioned
where John is are important and
generally those are the ones
that the network spends longer
thinking about. So we're kind
of, via this Adaptive
Computation Time and via this
Transformer model where, you
know, at every step in time,
each point along this sequence
is attending to all of the
others. But we build up a
similar picture in some sense to
the one we had at the start of
the lecture where we can see
that the network has learned to
focus more on some parts of the
sequence than others. And so
once again, this is what
attention is all about, it's
about ignoring things and being
selective. And so to conclude, I
selective. And so to conclude, I
think the main point I would
like to get
across in this lecture is that
selective attention appears to
be as useful for deep learning
as it is for people. As we saw
at the start of the lecture,
implicit attention is always
present to some degree in neural
networks just because they've
learned to become more sensitive
to certain parts of the data
than to others. But we can also
add explicit attention
mechanisms on top of that and it
seems to be very beneficial to
do so. These mechanisms can be
stochastic, so-called hard
attention, that we can train
with reinforcement learning or
they can be differentiable,
so-called soft attention, which
can be trained with ordinary
backprop and end-to-end
learning. And we can use
attention to attend to memory or
to some internal state of the
network as well as to data. So
many types of attention
mechanism have been defined and
I should say that even the ones
I've covered in this lecture
only cover a small fraction of
what's being considered in the
field and many more could be
defined. And I think, and what's
become very clear over the last
few years is that you can really
get excellent results,
state-of-the-art results in
sequence learning by just using
attention, by using Transformers
that essentially get rid of all
of the other mechanisms that
deep networks have for attending
to
long range contexts.
And that is the end of this lecture on
attention and memory in deep
learning. Thank you very much
for your attention.
