Hello and welcome to the
UCL and DeepMind lecture series.
My name's Felix Hill and I'm
going to be talking to you about
deep learning and language
understanding. So here's an
overview of the structure of
today's talk. It's going to be
divided into four
sections. So in the first
section I'll talk a little bit
about neural computation in
general and language in general
and then give some idea of why
neural computation, deep
learning and language might be
an appropriate fit to come
together and produce the sort of
improvements and impressive
language processing performance
that we've seen over the last
few years. In the second section
I'll focus in on one particular
neural language model, which I
think is quite representative of
a lot of the principles that
govern all neural language
models. And that model is the
transformer which was released
in 2018 and then in section
three I'll go a bit deeper into
a particular application of the
transformer, that's the well
known BERT model, and BERT in
particular is an impressive
demonstration of unsupervised
learning and the ability of
neural language models to
transfer knowledge from one
training environment to another.
And then in the final section,
we'll take a bit more of a look
towards the future of language
understanding and deep learning
and to do that we'll delve into
some work that's been done at
DeepMind on grounded language
learning, where we study the
acquisition of language in deep
neural networks that have the
ability to interact and move
around simulated environments.
So that's the overall structure.
It's important to add that of
course, natural language
processing is an enormous field
and there are many things that
I'm not going to have the time
to talk about during this
lecture. So some of the most
important ones are things like
sequence to sequence models and
specific applications of deep
learning to neural machine
translation. Also speech
recognition and speech synthesis
are really important
applications that I won't have
time to talk about. And then
there's many NLP tasks which I
also won't get the chance to
delve into from machine
comprehension and question
answering and dialogue. And even
in grounded language learning in
the last section, I won't get
the chance to go into things
like visual question answering,
video captioning. So in short,
the important thing to take away
is that I'm not going to have
the chance to cover all aspects
of natural language processing.
I'm going to just talk about a
few focused areas and that's
because I think they, they're
quite representative and they
hopefully convey the key
concepts. I mean, it's not
because I think they're more
important or more valid than any
other areas. Cool. So, let's
start off with a bit of
background about deep learning
and language and how they might
fit together. Of course, where
we are now is that, there's been
a load of impressive results
relating deep learning to
natural language processing in
the last few years. So you may
have heard of models like GPT-2
or BERT, or WaveNet, which was
developed in DeepMind. And all
of these models have done
really impressive things with
respect to the various aspects
of language processing that they
focus on. So GPT-2 as a language
model is now able to produce
long streams of text which look
like plausible stories and BERT
has led to very large
improvements on many language
classification tasks. And of
course WaveNet has led to
fantastic performance in speech
synthesis where we're now are
able to synthesize voices for
various speech applications with
much more fidelity than was
previously possible. So it's
really like an exciting era for
natural language processing and
we're moving at a rate of
progress which is possibly
unprecedented at least in recent
years. So if you think about all
the sort of panorama of
different things you might be
able to apply language models or
language processing technology
to, like to a much greater sense
than at any point in the past,
neural computation and deep
learning plays a role in those
systems. So on the left we have
systems which are almost now
entirely based on neural
networks from machine
translation systems to speech
synthesis systems and speech
recognition and then on the
right here, it's important to
note that there are still many
applications which do language
processing but don't use deep
learning or neural networks for
all of their computation or even
at all. So things like home
assistants, which you might have
to provide specific pieces of
information from the internet,
we're still a long way from
building systems like that in an
end-to-end fashion in deep
neural networks. Having said
that, the balance of this
particular scale has moved a lot
over the last few years and it's
certainly a trend towards more
applications of neural
computation and neural networks
in language processing
applications. And it's not just
in practical applications. In
the slightly more focused world
of research, we see a similar
trend. So this is data from 2010
to 2016 and it covers
submissions to two of the main
language processing conferences,
ACL and EMNLP. And on the chart,
you just see the number of
papers published at those
conferences for which the word
'deep' or 'neural' is found in
the title. And you can see that
back in 2010 there was close to
or effectively zero papers with
those words in the title. But by
the time we got to 2016 this
number has scaled up rapidly.
And of course there's a very
good chance that, if we looked
at the data up to 2020, we would
just expect this trend to
continue in that time. And it's
obviously not just the number of
publications but the effective
quality of systems and models,
which seems to be improving over
this time. So here's just a
snapshot in time, 2018-19, of
how well the best model was able
to perform on the GLUE
benchmarks. So the GLUE
benchmarks are just a sort of
intended to
be a representative of language
classification challenges,
things like reading a couple of
statements saying whether one of
them entails another one or
maybe classifying them as
positive sentiment or negative
sentiment, things like that. So
our ability to do those sorts of
things automatically has rapidly
increased according to this
benchmark, just between 2018 and
19. You can see the rate of
improvement from under 60%
performance to over
approximately 85% performance.
And again, on this task, on this
benchmark, the performance has
just increased and increased up
to the present date. So, this
is, sort of taken together, a
bunch of evidence that, you
know, deep learning has really
been able to improve performance
on a bunch of language
processing applications and I
think looking at that evidence,
it raises the question of why
deep learning, and models which
have this neural computation at
the heart of their processing,
have been able to be so
effective in language
processing. What is it about
deep learning and what is it
about language which has sort of
allowed this sort of effect to
take place. And of course, if we
can answer that question, if we
can understand that, then that
can help us to think a little
bit about ways to improve things
futher. But of course, in order
to understand that, we really
need to think a bit about
language. So in the other
lectures in this series, I think
you've had a, you would have
had a very comprehensive
introduction to deep learning
neural networks and the
principles of neural
computation. In this lecture,
I'd like to just spend a bit of
time to think about language in
itself so we can start to think
about why these two paradigms
fit well together. So the first
thing about language, it's often
said that language is a process
of manipulating symbols or that
language processing involves
symbolic data, operations on
symbols. So, you know, if we had
a sentence coming into a network
and a sentence coming out of a
network, then one
characterisatian of that problem
is from mapping symbols to
symbols, these very discrete
units. But of course, if those
who think a little bit more
about language specifically,
have many reasons to believe
that individual words that we
might be passing to these models
don't seem to behave like
discrete symbols exactly. So
let's just consider an example.
The word 'face'. If we think
about the word 'face' we can
find it in many different
contexts in language. So in the
sentence 'did you see the look
on her face?' 'we could see the
clock face from below', 'it
could be time to face his
demons' or 'there are a few new
faces in the office today'. And
those, we will, as we think
about those uses of the word
'face', we get some sense that
they are different in meaning or
different in usage. And we call
these differences word senses,
but the important thing to note
about the different senses of
the word 'face' is that they're
not entirely different. So it's
not the case that we should
model of these as entirely
independent symbols, which we
would like to pass through a
model. In actual fact, what we
find when we, when we delve into
the meaning is that these
meanings have certain aspects in
common, but they're just not
identical. So if we think about
the first case of 'face', we
might think of that as the most
prototypical meaning. And of
course, that's just the face
that you can see in front of
you. Now the face, which is the
most important side of
somebody's head of sight for the
eyes and the nose. So yeah, as
well as being the most important
side of some, somebody it's
something that represents them
and it's something which is used
to inform or communicate other
people. So if we think of all of
those as sort of features or
properties of this sense of
'face' and then when we think
about the sense 'the clock
face', we see that it actually
shares some of those properties
but not all of them. So it's
also the most important side of
the clock, clock face, and it's
also the side that's used to
communicate or inform others by
conveying some information. And
then if we think
about this notion of face as a
verb, 'to face your demons',
again, there's this idea that
when you face somebody, you
point your literal face directly
at them. So it conveys this same
sense of the pointing aspect of
face that's also conveyed in the
core prototype. And then
finally, if we think about the
idea of 'new faces in the office
today', then it conveys this
sense of identity yourself,
which is also potentially shared
by the core meaning of 'face'.
So this example shows and you
will see these effects if you
look at many other words that
rather than discreet word
sentences which are orthogonal
to each other, we might be
better off modeling this
discrepancy in meaning within
individual words as operations
that can interact then, but are
not necessarily the same. Okay.
Now, when we think about the
fact that each work could have
many of these different senses,
how could a processor possibly
make sense of this sentence? How
could it possibly disambiguate
the possibilities for the
different senses in order to
come to one coherent
understanding of a phrase? Well,
one of the ways in which we do
that is of course by using
additional information separate
from that particular word. So we
use the wider context. And to
give just a small example of how
our language processing really
depends on context, consider
this example. It actually goes
back to Rumelhart in the mid to
late seventies. So he
noticed that if we had some
handwriting like this, 'Jack and
Jill went up the hill', we can
read it very quickly. And in the
bottom case, 'The pole vote was
the last event'. We can read
that just as easily. But if you
look at the areas highlighted in
red here, you'll see that the,
there's actually a character
which is identical in both
cases. And it's arguably midway
between an E V and a W. But when
we read the sentence seamlessly,
we just interpret this
character, which could
potentially be ambiguous, in the
way which fixed most naturally
with the whole of the wider
sentence around it. So this sort
of example intuitively gives us
some justification for thinking,
well, maybe it's the
interactions between the
individual tokens that we're
looking at and all of the things
around them, which actually
allow us to solve the mystery of
which possible sort of sense of
a word we might be looking at at
any one time. And in particular,
what we probably do according to
this example at least, is think
about the whole sentence and
think what's the most likely
interpretation of the whole
sentence and that in itself
informs the individual
interpretation of the particular
characters where ambiguity might
be. Another classic example of
this phenomenon can be simply
gained by reading the following
symbol on your screen, the
following image by reading it
across or down. So obviously as
we read it down, the character
at the very center of the image
looks very much like a 13, but
as we read it across it looks
clearly like a B, and this tells
us the extent to which, even in
our very early perceptual
processes, the context is
informing the ways in which we
map what we're seeing into
things further inside our
processor, which might be our
memories of existing symbols
like 13 or B in this case. So
we've seen then that words are
not necessarily best modelled as
discrete symbols. And we've also
seen that in order to
disambiguate meanings that naturally fit
between these word-like things,
we better off be considering
wider context in order to
modulate those computations.
Another very important fact
about language is that the
important interactions which we
may well need to model, can very
often be non-local. So it can be
things that are not very close
together, which we have to
capture the interactions
between. Classic example is
sentences a bit like this: 'The
man who ate the pepper sneezed'.
So even though the pepper
sneezed, those, that part of the
sentence is contiguous, we, as
we read the sentence, know that
it's in fact the man who
sneezed, we know that this sort
of image characterises what
happened when we read this
sentence and that there isn't
necessarily anything to do with
pepper to be seen apart from the
fact that that's something that
the man just ate. So this tells
us that it can be things at one
end of the sentence and things
at the very other end of the
sentence, which must be
considered to interact in order
for us to form the most
satisfactory meaning when we
read sequences of words.
However, there are of course
other factors at play. So
consider the sentence, 'the cat
who bit the dog barked'. Now
it's actually the case that
people are much slower to make
sense of this sentence than 'the
man who ate the pepper sneezed'
even though they have exactly
the same overall structure and
length. Eventually upon thinking
about it, we do realize that in
the sentence 'the cat who bit
the dog barked' it's actually
unusually the cat which does the
barking. But our difficulty to
process this also tells us that
many factors are at play. So in
particular it seems to be that
the, the three word phrase, 'the
dog barked' seems to capture our
attention and we sort of have an
urge to consider that it's
actually the dog barking, in a
way that's more strong than in
the other case where we don't
have a such a strong urge to
consider, it's the pepper that
sneezed. Now where might those
urges come from? And can we
capture those in our
conversational models? Well,
these sorts of examples seem to
tell us that those urges can
come from our underlying
understanding of the world, our
understanding of the meaning of
dog and barking and the fact
that those are very likely to
come together and form and
describe a particular situation.
Whereas, our knowledge of
peppers will tell us that they
don't typically sneeze and
therefore we don't think that
the pepper sneeze is a very
likely state of affairs and we
look for other ways to make
sense of the sentence and the
correct way of making sense of
that sentence, of the sentence
in fact is more salient to us as
we process it. So that's just a
thought to bear in mind when
we're thinking about optimal
processes of language in deep
learning models. And then
there's another final point. So
lots of people who consider and
talk about language,
particularly in the wider
machine learning community,
consider language to be
compositional in the sense that
the meaning can be computed
simply by elegant operations on
the individual parts. But when
we actually consider how
meanings combine, the picture is
a little bit less clear and it
seems very likely that whatever
we do to combine meanings very
well ought to take into account
exactly what those meanings are.
And it shouldn't operate
arbitrarily on any different set
of inputs. It should be a
function which really takes into
account the individual meanings
in a particular scenario before
deciding the best way to combine
those meanings. Just to justify
that, consider the following
example, here's a
characteristic image of
something that's red. But if you
look at red wine, none of us
would find that unusual, but of
course the colour of that wine
is much darker. It could even be
black in that particular image.
And here's a red pen. Our
experience of pens tells us that
even red pens needn't be at all
red when we look at them from
afar, it's only the ink that
comes out of them that needs to
be red. So even in something as
simple as combining a colour
adjective with a noun, there's
all sorts of factors at play
telling us exactly how those
meanings combine that don't seem
to be equivalent from one pair
of words to the next. Things get
even more wacky in certain
cases. So here's a classic
example about pets. If we think
about the prototypical pet, it's
probably black or white or brown
because obviously dogs and cats
have those sorts of colours. If
you think about fish, then maybe
a classical fish we'd think
about would be silver or gray,
slippery in that way. But when
we think about pet fish, this
sort of magic seems to happen
where our typical pet fish has
lots of bright colours. It could
be orange, green or purple or
yellow. So something seems to
have happened in our mind to
allow these strong features to
come into the representation of
pet fish which didn't play a
strong role
in our representations of either
pet or fish. This doesn't always
happen when we combine words,
but it does sometimes. Another
example would be this
representation of plant, which
might be typically looks
something like that and our
representation of carnivore,
which might be, look a bit like
that, but our representation of
carnivorous plant has this
additional feature about eating
insects. So these are kind of
wacky effects of how meanings
interact when two words come
together and it's not
necessarily easy to explain them
in a model which treated every
pair of words fed into that
model with exactly the same
function to combine their
meanings, it very much seems to
me that what's instead happening
is that whatever function is
combining the meanings is taking
into account the individual
meanings of the components going
into that function and in, in
additional, additionally, that
function may well need to take
into account a wider knowledge
of typical things we might
encounter in the world and how
their properties might fit
together under the constraints
of the world as we know it. So
just to summarise, we've seen in
this section that words have
many related senses and that
they're not necessarily
characterised as sort of
perfectly idealised discrete
symbols. We've also seen that in
order to somehow find which of
those senses is most relevant in
a particular scenario, many,
some of the ways to settle that
problem might require us to
look at the wider context around
that word. And in many cases we
may need to look a long way from
a particular word to
satisfactorily disambiguate the
uncertainty that we have at any
particular point. And finally,
when we're thinking about
building models of how word
meanings might combine, we've
seen that functions that combine
meanings will probably need to
take into account what the
inputs to those functions are in
order to come up with the best
bespoke way of combining for
those particular words. And
we've even suggested that they
may well also need a widened
sense of how the world works and
how things can naturally fit
together in order to eventually
arrive at the optimal
representation for the
combination of meanings in each
particular case.
So, in the first part we talked
about particular aspects of
language and particular aspects
of neural computation that have
that seem to fit together in a
particularly appropriate way,
such that, they define certain
ways in which a computational
model might need to behave in
order to capture the ways that
meaning works in language. So in
this section we're going to talk
much more concretely about a
specific model, which was
published just a couple of years
ago and has had an incredible
impact on a large number of
natural language processing
problems from machine
translation to sentence
classification and essentially
any problem that requires a
model to process a sentence or a
passage of multiple sentences
and compute some sort of
behavioural prediction based on
that. So it's fair to say that
for any of any problem of that
form transformer is probably the
state of the art method or some
variant of a transformer is the
best way to for the model to
learn and to learn
to extract the signal from those
sentences in order to make
optimal protections. And in this
section I'll talk about the
details of the transformer and
just refer back to those aspects
of language processing that we
saw in the first section in
order to give some intuition
about why the transformer might
be so effective when it
processes language. So just
here's credit to the authors of
the transformer from Google
Brain and collaborators. And the
paper is obviously available for
you to find out the fine
details, but I'll give a broad
overview and starting in
particular with the first layer.
So the transformer contains a
distributed representation of
words in its first layer which
is something it has in common
with almost any neural language
models now. And what do I mean
by a distributed representation
of words? Well, the first thing
that we do when we construct a
neural language model is we have
to determine what is the
vocabulary on which the model is
going to operate. So what I mean
by that is we do need to chop up
the process, the input which the
model sees into some sort of
units in order to pass them to
the model. Now, if you think
about a large page of text,
those units could be individual
characters. In an extreme case,
they could be individual pixels
if we consider that the, the
text an actual image. But in
general with
language processing
applications, because we have
texts stored in digital form, we
don't need to go through that
phase and subject our model to
having to learn to process
pixels. So we have to make a
decision about what it, what are
the units that we actually pass
to the model. And in most
applications of neural language
models these days, that can
either be character level, which
is where we pass each unit as an
individual letter or it can be
word level, which is where we
split the input according to
white space in the text and then
we pass each of the individual
words to the model as discrete
different symbols. But of
course, as we've talked about in
the last section, a model which
just takes symbols and treats
them as symbols might not be
optimal for capturing all of the
aspects of meaning that we see
in natural language. So instead
of doing that, the developers of
neural language models have come
up with a procedure which allows
the model to be more flexible
than which would be represents,
in the ways of which it
represents words. And that
process is something like the
following. So let's say we do
take the decision to chop up our
input text according to
individual words. So what we
should, what we first do is we
consider all of the words that
we want our model to be aware of
and we let
that define the vo-, the total
vocabulary of the model. So to
get such a list, we might scan
hundreds of thousands of pages
of texts and count all the works
that we find there, and then we
can take some subset of the
words which appear the most
frequently or alternatively if
we have lots of memory and a
really big model, we can take
all of the words and allow all
of those to be in the vocabulary
of the model. What we typically
do in a neural language model
then is pass each of the words
to an input layer and that input
layer contains a particular unit
for each, corresponding to each
of the words in the vocabulary
of the model. But importantly,
those units are then connected
to a set of weights and it's
always, each unit is connected
to the same number of weights
and those weights connect to a
set of units of a particular
dimension. Now that dimension we
can think of as the word
representation dimension or the
word embedding dimension and
when the model sees a given
word, we turn on the unit
corresponding to that word and
we leave all of the other units
at zero. So we put an activation
of one on unit corresponding to
the word leave all of the other
weights as zero and we've marked
those weights in this diagram
here with yellow and light blue
shows the space occupied by the
whole layer of input weights
for the model. So in this case
the model sees the word 'the',
we turn on the weights
corresponding to the word 'the',
and of course because we
activate the 'the' unit with the
strength of one and we activate
each of the units at the output
of the next phase, which is
corresponding to this black box
around the grey rectangle, we
activate each of the units there
according to exactly what the
weight is that went from the
word, the unit corresponding to
'the', to this distributed
layer. So effectively we get a
representation in that, in that
layer with the black box around
the rectangle, we get a
representation corresponding to
the word 'the', but that
representation is actually a
finite number of weights,
floating point valued weights.
And if we do this for all the
words, we get a different
representation for all of the
words. So we can unroll the
input and actually do repeatedly
do this and get a sequence of
vectors of floating point values
for each of the words in our
input. And those vectors live in
a space and importantly that
space has certain geometric
properties. So we might find
that representing words in a
space like that allows words to
move together in the space if
it's useful for the model to
represent them as somewhat
similar and to move further away
in that space if it's useful for
the model to represent them as
different. Because remember with
backpropagation, all of the weights
in this first layer of the model
are going to be trained to
optimise the model to achieve
its objective. So this gives the
model the flexibility to move
its representation of individual
words around as it sees fit and
the best way to achieve its
objective. So just to recap,
this is the first layer of many
neural language models,
including the transformer. And
it contains quite a lot of
weights. So if we have a total
of capital V words in our
vocabulary and if capital D is
the dimension of the vector that
we're going to represent each of
these words with in a floating
point vector, then the total
number of weights that we have
in the first layer is V
multiplied by D and we end up
with a D dimensional Euclidean
space with which to represent
these input units in the model.
Now this idea of representing
words or letters or whatever we
take as the input units to a
model in some sort of high
dimensional floating value, real
valued vector space is actually
quite an old idea. If we go back
to 1991 Mikkulainen and Dyer
produced a language, a neural
language model with much less
computational power than current
models have, but it still tried
to execute this principle of
representing input words in this
distributed geometric space. And
it was able to exhibit certain
types of interesting
generalisation when trained
on real texts that a model which
represented words as individual
discrete symbols wouldn't be
able to represent or achieve.
And of course, perhaps the most
famous example of this
demonstration came from a very
famous paper in which Jeff Elman
introduced the recurrent neural
network to the wider community.
And in this paper Elman analysed
the distributed representations
corresponding to lots of
different words as he trained
the model on sequences of, sort
of, subject verb object style
sequences of natural language
style snippets. And the
objective of this model was just
to represent a sequence of words
such that the model was able to
optimally predict the next word
with as much accuracy as
possible. And what Elman found
when he analysed the way that
the model was distri-, was
representing these words
internally was that of all the
words in his vocabulary, they
started to cluster together in
this geometric space such as the
words with similar meanings came
together. And importantly also
words with similar syntactic
roles, so things like verbs or
nouns or subjects or objects
also started to cluster together
in the space. And this tells us
that neural language models, as
they experience more and more
text, start to slowly infer the
underlying structures in
language which we might be able
to perceive as language users
such as subject, object, verb
and how things fit together like
that as well as an emergent
categorical semantic structure
where we see that
certain classes of different
types of words naturally fit
together. So that's the solid
foundation on which the
transformer builds. But that's
of course not novel to the
transformer. Distributed
representations of words have
been a part of neural language
models as I pointed out since
the early nineties. So what else
does the transformer do that
makes it so powerful and allows
it to fit and correspond and
capture some of the aspects of
language that I talked about
from the first section? Well,
after the first stage of
processing, which I've just
outlined in the previous slides,
we end up with a particular real
valued continuous vector for
each of the words in an input
sentence. So the next stage, the
transformer, computes what's
called a self-attention
operation. So how does that
work? Well for any
self-attention operation, there
are three matrices containing
the weights which parameterise
the operation. So the first
matrix is we could call the
query weight matrix WQ. The
second matrix we would call the
key weight matrix WK and the
third weight matrix we'll call
the value weight matrix WV. And
each of these matrices have
independent weights in the
transformer and we can, their
dimensions are such that they
can naturally multiply, in this
case, I've written it as post
multiplication of the
distributed word vector that I
talked about in the first
section. And importantly as the
self-attention operation is
carried
out, these weights are applied
equally and in exactly the same
way to each of the words in the
input. So we end up with for
every individual word vector
I've written here 'e beetle'
corresponding to the word
'beetle' in the input, we ended
up with three further vectors
corresponding to multiplying
that vector by the matrix WQ,
the matrix WK and the matrix WV.
So those three additional
vectors we can call bold Q, bold
K and bold V and we can call
those, they are typically called
the Query Vector, the Key Vector
and the Value Vector for this
self-attention operation
corresponding to each of the
words. And then with those three
vectors, we use them to
understand how the different
words in the input start to
interact. So in particular with
the query vector, we produce an
operation where for every word
we take the query vector
corresponding to that word and
we compute the inner product,
the dot product of that word,
with the value, with the, with
the key vector corresponding to
each of the other words. So
that's represented here by the
dotted line. And by taking a dot
product in that way, we get a
scalar, and then we can, we want
to understand how big is that
scalar relative to an average
scalar that we would get if we
just took that operation
arbitrarily. So essentially we
want to give the model the power
to represent
how strong should the connection
between these two words be. And,
in order for that to be a nice
normalised distribution over all
the possible strengths computed
by the model, we first work out
the inner product of the query
value with the key value of a
particular word. And then we
divide that number by the dot
product of the, well, we need to
normalise by quantity
corresponding to the dot product
of that query vector with each
of the key vectors of the other
words. And the way we do that
is, we compute those values and
we pass all of those values and
the dot products through a
softmax layer, which gives us a
distribution, so it normalises,
it exponentiates and
normalises, such that we get a
nice smooth distribution
corresponding to how well each
of the queries corresponds to
each of the keys of the words in
the input. So this then gives us
a set of weights correspond-,
it's a probability distribution,
which gives us a set of weights
between zero and one, so for a given
word 'beetle', we get a set of
weights, one for each of the
words in the input telling us to
what extent is there a strong
interaction between the word
'beetle' and that other word. So
in this case, the way I've
marked it in the slide is that
the strongest interaction when
we do this operation is with the
word 'drove' and
that might be because the word
'drove' tells us in particular
that this beetle is not the
animal type of beetle, but it
should in fact be thought of as
the car beetle. So that's the
sort of interaction that we want
to naturally capture here. Once
we've got these weights, we then
use them to tell us how much of
the value representation to take
through to the next layer of the
transformer. So in this case,
for example, when representing
the word 'beetle', we would
notice a strong connection with
the word 'drove' and that that
would give us a strong weight in
our attention distribution. And
then that would tell us to take
a lot of the value of the
embedding for 'drove' through to
the next layer of the
transformer. So the operation
which allows us to take an
amount of the value through to
the next layer, which
corresponds to the weight
computed by the transformer,
it's just this simple weighted
sum. So what we end up with then
for each word like 'beetle' is
that we take a small amount of
the value of each of the other
words plus some of the value of
the word 'beetle' through to the
next layer of the transformer.
And that can then be aggregated
to form the next layer's
representation of the word
'beetle'. So notice that having
performed this transformer
layer, we haven't reduced the
number of embeddings in the
model in any
way, we still have a
representation corresponding to
the word 'beetle' that we
started with, but that
representation has been updated
or modulated, conditioned
exactly on information about how
well it corresponds or how well
it should interact with all of
the other words in the input.
And of course that was just for
the word 'beetle' but we do that
for each of the other words in
turn. And that computation can
be computed in parallel, which
makes the transformer quite fast
to feed forward in today's deep
learning libraries. And so for
one application of a
self-attention layer we end up
with the same number of
distributed representations
coming out as we had going in.
And within the mechanism, the
only weights that we learned are
those single matrix giving us
the queries, a second matrix,
giving us the keys and a third
matrix, which gives us the value
representations of course those
matrices are then applied to
each of the individual words.
But it's not just this
self-attention layer that gives
the transformer it's
expressability and power is
actually an operation known as a
multi-head self-attention, which
basically takes the operation I
just talked about and reapplies
it four different times in
parallel. So if you imagine the
operation that I just spoke
about being parameterised by
three matrices, WQ, WK and WV,
well we can repeat that process
with three additional
independent matrices.
And in fact, typically we might
do it say four times. So we'll
end up with four sets of three
independent matrices and each of
them can do exactly the same
self-attention operation as I
just talked about in the
previous slides. So we end up
with four independent and
parallelisable self-attention
operations, each computed on the
input words of a particular
layer in order to get us through
to the next layer. Now of course
that is a lot of computation and
it might require a lot of memory
if we end up with very large
representations in our model. In
practice, then, what the
developers of the transformer
recommend is that each of our
self-attention layers actually
effectively reduces the
dimensionality of the input
vectors. So if the input vector
and the light blue at the bottom
of the slide here, has dimension
100, then we can make the matrix
WV be a rectangular matrix
rather than square matrix. What
that would do is mean that the
output of WV, which is the value
vector, which gets passed to the
next layer of the transformer,
that can be arbitrarily small.
In this case, we might find it
to be just 25 units. And so each
self-attention layer
independently takes 100
dimensional vector and returns
25 dimensional vector for each
unit, for each word in our
inputs. But if we do that four
times, then we end up with four
25 dimensional vectors
and those can be aggregated, in
fact, in the transformer they're
passed through an additional
linear layer which is
parameterised by matrix W0 but
then concatenated to return
overall a vector of the same
dimension, 100 units as was the
dimensionality of the input. So
in that way we can apply
multi-head self-attention. We
can give the model four
independent ways to analyse the
interactions across the
different words in the input.
And we can do so without
expanding the dimensionality
with which the model needs to
represent each of its words. And
that makes it a relatively
practical tool, which doesn't
lead to an enormous explosion in
the memory requirements of the
models. But it does give the
model many independent ways with
which it can represent
interactions between the words
and the inputs. Now after that
multi-head self-attention layer,
the model does what's called a
feed forward layer. So
conceptually this is less
interesting, but essentially the
representations that the output
of the multi-head self-attention
layer are then multiplied again
by a linear layer, there's a
rectified linear unit
non-linearity and then they're
actually expanded out and
dimensioned somewhat and then
reduced again in dimension with
another linear layer. So when
considering a transformer all
together, it's actually multiple
applications of those multi-head
self-attention layers and the
linear layers that I described
afterwards. But there's another
important detail in the
transformer, which is the notion
of skip connections. So whenever
we apply a multi-head
self-attention layer or indeed a
linear layer, the transformer
also gives the model the option
to ignore that computation and
instead to pass the activations
that were at the input to that
multi-head self-attention layer
direct, to bypass the
self-attention layer and go
through to the point of the
network of which the output is
coming out of that
self-attention layer. And then
that is added to the output of
the self-attention layer, passed
through a layer normalisation
layer and then that represents
the actual output of the whole
unit that, the whole part of the
network, the whole module which
is doing the multi-head
self-attention. So why might
that sort of skip connection be
important? Well, in the examples
I gave in the previous section
about language, one thing that
should have maybe come across is
the importance of, or the role
of our expectations in forming a
consistent representation of
what a particular input is. So
as an example, in the case of
pet fish, we came to the
understanding that pet fishes have
many bright colours even though
that was not necessarily part of
the individual parts of the
input. It's not necessarily
something we would associate
with pets and it's not something
we would necessarily associate
with fish, ordinarily. And where
does that additional notion of
colours come from? Well it
probably comes from our sort of,
our wider understanding of the
world and our ability to
think about pet fish as a
combination and then reconsider
how the input works. And so
these sorts of top down
influences our expectations
influencing how we actually
combine the inputs in language
are really common in many
different contexts. And if you
think about skip connections,
it's not a perfect model of
this, but it does give the
transformer a rudimentary
ability to allow its
representations of things at a
higher level of processing to
interact with this
representations of things at a
lower level of processing. So
let's say that the model didn't
have skip connections and fed
things through to a certain
level in the hierarchy. At that
point after computing many
different interactions, the
model might form a consistent
sense of the fact that a meaning
needs to be understood in a
particular way, but of course
those top down influences tell
us that that expectation of what
the meaning might be should
actually feed back and allow us
to remodulate how we understand
the input. Well a skip
connection which comes up for
the input and interacts with the
model at that point can actually
compute such an interaction in
subsequent layers because at
that point the model has access
to both a high level
representation of what it
expects the best way of
interpreting the whole situation
is and it has direct access to
the lower level input. So in
some ways in a very deep model,
the addition of skip connections
allows the model to execute
a form of top down influence on
processing. There's one more
detail I'll finish off with and
our characterisation of the
transformer. Now, if you were
aware, if you were paying
attention during the
explanation, you may well have
noticed that none of the
operations that I described on
the input words took into
account the actual relative
order of the words in the input.
It was a series of matrix
multiplications which were
applied identically to each of
the words. And then on top of
that, a series of inner
products, which are symmetric
operations, which don't favour
the ordering in which we apply
them with respect to the words.
So there was no way that a model
like this would have any ability
to express the fact that certain
words appear close together in
the input or certain words
appear further apart. And of
course we know in language that
the word order can tell us some
important things about what the
overall sentence means. So in
order to give the model
sensitivity to word order in a
way that the computational form,
the functional form of the model
doesn't allow, the developers of
the transformer came up with a
rather nice trick known as
positional encoding. So
positional encoding is just a
way of determining a set of
scalar constants which are added
to the word embedding vector
after say let's say in the
lowest level of the transformer
it can be added before
the first self-attention layer,
but just after the word
embedding layer, and those
scalars combined with the word
embedding to mean that whatever,
if a word appears in a
particular slot in the input,
regardless of the fact that it's
embedding weights will
necessarily be the same, the
actual effective representation
that the transformer sees will
be slightly different depending
on where it appears in the
input. So to achieve this sort
of thing, you just need, the
model just needs a set of small
scalars which are different in
each of the possible locations
that are word could appear in
the input. And they use a nice
sinusoidal function which has
various properties which may
well be more desirable than just
being, allowing the word to
discriminate words according to
their position. Because in fact
that sinusoidal function gives
the model a slight prior to pay
attention to relationships of a
certain wavelength, a certain
distance across the input and
each particular unit in the
embedding representation can
then specialise at recognising
interactions or correspondences
at a different distance from a
given word. So, unlike if you
think about models like a
recurrent neural network or an
LSTM, those models have the
notion of order built in because
they process input sequentially,
one word after the other
according to a process
transitioning the state from
the, from its position after
reading one word to after
reading two words,
to after reading three words, to
after reading four words. What
that means is that the model has
a very strong awareness of the
ordering of the words naturally.
But then it, it's harder for
that model to remember to pay
attention to things a long time
in the past even if those things
actually end up having a really
important influence on what I'm
currently looking at now. With a
transformer, things are totally
different. The model has sort of
natively, in its native
functional form, it has no
awareness of word order and we
have to add on these additional
positional encodings to give the
model a weak awareness of word
order. But the transformer
actually performs better than
RNNs and LSTMs on a lot of
language tasks and this maybe
tells us that it's easier to
learn the word, the notion of
word order for the few cases or
for the number of cases where
it's actually important in
language than it is to be given
the notion of word order
automatically, but to have to
learn the very difficult process
of paying attention to things a
long time in the past. And when
I say difficult, I mean the
gradients have to pass back
through many, many weight
matrices in order to determine,
in order to allow the model to
update and then learn to encode
dependencies between things in
the past and things in the
present. With a transformer that
path that
the gradient has to go through
is much shorter because there's
no prior favouring of things
which are close together instead
of the gradient path that the
model needs to go through to
connect any two words in the
input is equivalent and in fact
it indeed is shorter on average
than it is in recurrent neural
networks. So that gives a small
amount of intuition about
another reason why the
transformer might be so
effective at processing. So just
to summarise this section: we
saw in the previous section that
words shouldn't necessarily be
thought of as independent
discrete symbols and that
disambiguating their meaning can
depend a lot on the context but
not only on the immediate
context which is closest to
those words, but on potentially
distant context of the
information encoded in words a
long way away. We've also seen
that functions which models use
in order to combine the meaning
of two words should take into
account the meaning of those
words and if possible, take into
account a wider general
knowledge of how things
typically combine in order to
allow that to modulate the
interactions between the words
coming in and we think about the
architectural components I
talked about in the transformer.
The multi-head processing is one
way of getting at this notion
that words are not discrete
symbols because it naturally
gives the transformer even in
one feed, single feed forward
pass the opportunity to
represent each
word at each layer with n, let's
say four, different possible
contextualised representations
and of course going back longer
term just the general notion of
representing words as
distributed representations and
allowing words with similar
meanings to occupy local areas
in a large geometric vector
space also allows the model to
express this non-discreet nature
of word meaning in a very
eloquent way. Now the fact that
distribution depends on context
is very nicely modeled by
self-attention precisely gives
the meaning of every word to be
critically dependent on the
meaning of all the other words
in a given input stream. And the
fact that that context could be
non-local as I've just talked
about is very nicely modeled by
the self-attention mechanism
because the gradient flow from
the particular point I'm in at a
sentence to any other point in
the sentence, is the same. So
interactions over words that are next
to each other are not particularly
favoured over minor
interactions. Another fact is
that the more layers we have,
the more chance the model has to
learn as it moderates this
representation of different
things, how interactions might
take place at different levels
of abstraction as the model
goes, continues to reprocess the
model, the input. And finally on
this point about how meaning
combined and the fact that the
meaning, the ways in which the
meanings of two words combine
seems to often depend on the
particular meaning of those
words
and also top down effects. We've
seen that skip connections are
one way in which the transformer
can learn to implement the
interaction of higher level
information with lower
information. And we've also
again seen that parameterised
functions on distributed
representations, i.e. the
multiplication of a matrix by a
vector is precisely the
operation of a function which
combines word meanings according
to the meanings of the words
themselves and those operations
are common in most, many neural
language models, but are a
really important part of the
transformer architecture. So
hopefully this section has given
you some intuition about how a
transformer works, but also some
intuition about maybe why it
works, why it is that the
various components in the
transformer improve on a model's
ability to process language
because of the way that we think
meaning works in a very sort of
intricate and interactive way
when we understand linguistic input.
In the last section, we introduced the transformer
and we talked about how various
components within the
transformer combine to make it a
very powerful process, a very
powerful model for processing
sentences and combinations or
sequences of words. In this
section, I'm going to talk a
little bit about the very
specific use of the transformer.
It's a way of training
transformer models in order to
allow them to excel at a wide
range of different language
tasks. And those tasks might
involve reading a sentence and
making a prediction or
classifying how two sentences
relate to each other or even
classifying or making
predictions about longer texts
such as documents. But before I
do that, I also just want to go
back to our points about the
nature of language and discuss
one more issue which I think is
quite motivating when we think
about how transformers are
applied in the model that I'm
going to talk about in this
section. So let's consider this
sentence: 'time flies like an
arrow'. And then we can compare
it to what seems superficially
to be a very similar sentence:
'fruit flies like a banana'. But
of course when we start to
process and make sense of these
sentences, it feels very clear
to us as native English
speakers, that there's quite a
difference in the way that the
words in those sentences have to
relate to each other in order
for us to sort of construct the
meanings in our head. And it
feels, it certainly seems to me
like there's at least two
factors that are really
important here. So one thing is
that we, in the top sentence,
'time flies like an arrow', we
know what an arrow is, we know
that they regularly fly and in
fact we know how they fly. So
we've got our experience of
arrows. Another important piece
of experience that we have is
our experience of bunch of
phrases or sentences which are
quite similar to the phrase
'time flies like an arrow'. And
in particular they're similar in
the way that the meanings of the
words combine for us to come up
with a representation style
sentence. So those could be
things like 'John works like a
Trojan' or 'the trains run like
clockwork'. These are all
actually kind of metaphorical or
simile style sentences where we
can, where we compare the way
that something works with the
way that something else works.
So it feels to me like those two
pieces of experience are very
important in our ability to read
a sentence, like 'time flies
like an arrow' and immediately
understand it. In the case of
'fruit flies like a banana' of
course we come to quite a
different understanding, right?
We know that we're not comparing
the way that fruit flies with
the way that bananas fly and how
is it
that we can somehow know that
that's not what we have to do to
understand this sentence.
Instead what we do is we have
some knowledge of fruit flies
and we, and we know that in fact
that one of maybe one of the
most salient things about fruit
flies, I'm not an expert in
fruit flies, but there's one
thing I do know, which is that
they like fruits and we know
that bananas are fruits. And so
this connection helps to tell
us, well maybe it's a different
type of liking that I need to be
thinking about it in this
sentence. And then of course
there's again, other than that
background knowledge of how the
world works, how fruit flies
are, there's also this kind of
more linguistic knowledge of
sentences we may well have
already previously understood in
which the meaning seems to
combine in a similar way to
'fruit flies like a banana'. So
'Fido likes having his tummy
rubbed' or 'grandma likes a good
cuppa'. In those cases it seems
like the process of putting
together the meanings has
something quite similar or in
common with the scenario in
'fruit flies like a banana'. So
if we're going to come up with a
general language understanding
engine that's able to cope with
all these different types of
processes and constraints which
are involved in understanding a
sentence, then there's obviously
a lot of places where such a
model needs to get its
experience. And
a lot of the places where such a
model needs to get its
understanding of the world and
its understanding of language
and those considerations lead us
to add a fifth point to these
many characteristics of
language, which is that when we
actually form an understanding,
you know, it really does seem to
be a process of balancing our
existing knowledge and that
could be knowledge of language
and also knowledge of the world
with the input with the
particular facts of the thing
that's currently coming into the
model. And that consideration is
a key motivating factor behind
the approach which is taken in
this model. I'm going to
describe in the section which is
called BERT. And BERT stands for
Bi-directional Encoder
Representations with
Transformers. And BERT is
essentially an application of
the transformer architecture
that I described in the last
section. But the key insight
with BERT is that rather than
training a transformer just to
understand the inputs to the
sentences which the model is
currently considering, a process
of pre-training takes place in
which the weights within the
model are endowed with knowledge
of a much wider range of text in
this case, which can plausibly
give the model that background
knowledge which is really
necessary for forming a coherent
understanding of the total of
the different types of sentences
a language understanding
processor needs to, to be able
to understand. So the important
thing to remember when
considering how BERT works is
that a transformer as described
in the previous section really
is just a mapping from a set of
distributed word representations
to another set of distributed
word representations. So as I
talked about in the last
section, the first layer of a
transformer goes from the
particular input symbols passed
to the model to a space, a
geometric space of continuous
valued vectors. And of course
what comes out after these many
layers of self-attention is
precisely another space of
continuous valued vectors. And
corresponding to each of the
words in the input, the exact
same set of words that were the
input. So if I pass a sentence
to transform a model, it'll very
quickly compute a set of
embeddings for those, for each
of the words in that sentence.
And then it will output, it will
pass them through self-attention
models and output a set of
embeddings for each of the words
in that sentence. But of course
each of those embeddings will be
highly contextualised, highly
modulated by all the other words
in the sentence. And hopefully
will have gone through the sorts
of processing needed for us to,
for the model to sort of
gradually and incrementally form
a reasonable representation of
what the sentence means. So
given that fact that a
transformer is just a mapping
from a set of word
representations to a modified
set of word representations of
the same length, there's quite
an
easy way in which we could train
such a model in order to extract
knowledge from an enormous
amount of texts that we might
just have lying around. So in
particular, the insight from
BERT is precisely how can we get
knowledge into the weights of
such a model without requiring
problems or data which has been
labeled by human experts or
other, some other mechanism in
order to give the model sort of
knowledge of what's the right
classification or what's the
right answer to make. So how can
we get knowledge into a model, a
transformer model in an
unsupervised way? And the
approach that the authors of
BERT take is firstly, by means
of a masked language model
pretraining phase. So the way
this works is the following, the
authors just considered the
problem of mapping a particular
sequence of words to the exact
same sequence of words. So the
job of this transformer in
theory is just to represent a
sentence, for example, and then
output a sentence at the very
top of its network. But rather
than, rather than having the
model output the exact same
sentence, instead in the input
to the model, one of the words
is masked out. So the model is
not aware of one of the words in
the input sentence. And instead
of having to predict all of the
words in the input sentence, the
model just has to make a
prediction conditioned on the
output embedding for the missing
word of what that missing word
was. So it just has to answer
the question, you know, here's a
sentence with a missing word in
it, 'sucking up '...' from
words', and the model just has
to make the prediction that the
missing word in that case is
knowledge. And when training the
model, the authors of BERT do
that with 15% of words at
random. So they ran, they
present sentences from any, any
particular place where we might
be able to get running text
language and the authors mask
out words with a probability of
15% and then ask the model to
make a prediction and
backpropagate the cost, which is
essentially, the likelihood, the
negative log likelihood of the
model, having predicted that
word over all of the other words
in its vocabulary. So that's
masked language modal
pretraining. But one thing that
the authors noticed is that, if
they trained the model in that
way, then on the test set when
they came to use this model, of
course, in the input there
wouldn't actually be any tokens
masked out. So there's a risk
that just by training the model
in this way, it would not behave
well on inputs where there
wasn't anything masked out. So
for a small amount of the time,
instead of masking out a word,
they have the model make a
prediction of which word is
missing even though they didn't
actually mask a
word out. So no words missing.
In this case, the model really
does just need to retain which
word is in a particular point in
the sentence and at the output
representation corresponding to
that point, conditioned on that,
make a prediction of what that
word was. This of course, if
this was just the only
objective, the model would never
have to do any sort of
inference. It would never have
to make any sort of unexpected
judgment about what word could
be missing. It would instead
just be able to copy knowledge
straight through. And that
wouldn't lead to any interesting
formation of any interesting
representations. So this is only
done occasionally, but it does
make the model perform better on
the test set because the model
does not, kind of, find itself
completely out of its training
experience when it encountered
sentences for which no words are
masked out. Okay. So that's the
masked language modeling
objective. But in order for BERT
to be an effective language
processor, the authors wanted it
to also be able to, to be aware
of the flow of how meaning works
on a longer scale than just
within a particular sentence. So
in order to achieve this, they
came up with an additional
mechanism for training the
weights in the BERT model, which
is complimentary to the masked
language model objective. So
this objective can be trained at
the same time as the masked
language modeling objective.
And as in that case, it doesn't
require any data that's been
labeled by experts or found to
have a right answer in some way.
We can just construct this
objective by taking running text
from the internet. And the way
this works, this is called the
next sentence prediction
pretraining objective. So the
way this works is the authors
add an additional input token at
the start and it's the output
embedding corresponding to that
input location that's going to
be used to make the prediction
on this objective. Then as input
to the model, the model is
presented with not one sentence,
but two sentences in this case
and so there's the additional
input token, then there's the
first sentence, then a
separation token, and then the
second sentence, and that's all
passed to the transformer and
it's processed through in
parallel. At the end, the model
produces representations for
corresponding to each of the
input tokens, but it's only the
initial representation
corresponding to this additional
token that was added to the
inputs that needs to be
considered in this objective.
And conditioned on that, the
model just makes a binary choice
of whether or not this was
actually two consecutive
sentences from the training
corpus. So in this case it is
two consecutive sentences: 'Sid
went outside' and 'it began to
rain'. So in this case, the
model would predict, yes, those
are two sentences which are
likely to follow one
another in a corpus. But by
shuffling data, the trainers
can, the people who train the
model can also present tips with
negative cases. So cases where
one sentence didn't follow the
other sentence. So that might
look something like 'Sid went
outside', 'unfortunately it
wasn't'. So the objective of the
model here is to identify this
as two sentences which don't fit
well together and to make the
prediction no on the next
sentence prediction task. So
like combining next sentence
prediction and masked language
modeling, slowly the weights of
this large transformer, the BERT
transformer, gradually start to
acquire knowledge of how words
interact in sentences typically
maybe abstract knowledge of the
typical ways in which meaning
flows through sentences. And of
course, the spaces in which they
represent each of the individual
words at various levels of the
stack, things start to happen
like words that have similar
meanings start to come close
together. The model might
require to separate them out
into the different parallel
heads if words have various
different senses. And so, a lot
of the general knowledge that we
talked about being very
necessary for forming a
consistent and coherent
representation of loads of
different language sentences can
start to be introduced into the
weights of this model as it
trains according to these
unsupervised objectives. So
that's the theory behind BERT or
at least the intuition
behind that. And of course,
because neither of those
training objectives required any
sort of particular labels you
can, BERT is trainable on all of
the texts that exist in digital
form in English around the
world. So you could take any
text from the internet and use
it to train more and more
knowledge in theory into the
weights of BERT. Of course,
that's in principle how BERT
works but it wouldn't be a very
convincing demonstration unless
there was some evaluation. And
in this case, the way that BERT
is then evaluated is by taking
its knowledge in all of those
ways and using that as a start
process to train on many
specific language understanding
tasks. And these tasks typically
have, they do use labeled data
and they typically have a lot
less data. So, in order to apply
BERT to these models, the BERT
weights, which are trained on
all of the unsupervised
objectives, are then taken and
the data specific to each of
these tasks is passed through
the BERT model and then BERT is,
the BERT weights are updated
according to the signal from the
supervised learning signal from
these actual specific language
understanding tasks. Typically
this process of fine tuning the
BERT representations for a
specific task takes place
separately and independently for
each of those additional tasks.
And it's also necessary in many
cases when fine tuning in this
way to add in a little bit of
machinery
onto the top of BERT because,
you know, in the standard BERT
architecture, it's just making
predictions where it outputs a
number of distributed
representations at the top of
this transformer model. But, of
course, given a specific task,
it may be necessary to come to
some sort of prediction,
depending on the output format
of the task, it may be necessary
to take only some of those
representations and condition on
them with additional weights in
order to make that prediction.
But typically that's only a
small amount of additional
weights that contains
task-specific knowledge and the
vast majority of the model
contains the general knowledge
that was trained into the model.
So just doing this massively
improves the performance of any
models which aim to exhibit some
sort of general understanding of
language. What I mean by that
is, any model which is intending
to be trained on a wide range of
different tasks, using the BERT
style approach, so transferring
knowledge from an enormous
running text corpora, via fine
tuning to those specific tasks,
has led to a really strong and
significant performance on a
large number of these tasks. And
importantly, this doesn't just
allow one model to solve lots of
tasks better, in many cases,
this is the way to achieve
state-of-the-art performance on
these additional tasks. So even
a model which was just
specialised to those additional
supervised learning tasks would
not perform
better than a model that was
initially pre-trained on BERT, in
fact, you know, for a lot of
these tasks, performance is
substantially worse unless you
apply BERT-style pre-training on
enormous corpus before
transferring to these additional
tasks. So this is a really sort
of compelling demonstration of
transfer learning. And the key
insight with BERT is that
transfer needs to take place
throughout the weights of a
large network. Previous attempts
to do this involve transfer just
at the level of those specific
word embedding weights which can
encode the information relevant
to each individual word in a
modern vocabulary. But they
didn't have a mechanism to
encode the ways in which those
words combined. Now a few years
before BERT, a model called ELMO
and a couple of other models,
started to show that there was
some promise in sharing more
than just those word embedding
weights, but actually sharing a
large amount of functions which
learned to combine weights when
pretrained on some task agnostic
objective and transferred to
specific tasks. And then the
BERT model really took that to
the next level using the
machinery of the transformer to
exhibit really impressive
transfer learning. So we've now
acquired five interesting
principles of how language and
meaning seem to interact when we
understand the sentence. And
when we've added this fifth one:
understanding is balancing input
with knowledge that we've had
already or our general knowledge
of the world. And we've talked
about BERT as a mechanism for
endowing models with something
like a general knowledge that
may be necessary. And we've
shown that, in fact, indeed it
is very important on a lot of
language understanding tasks to
have this sort of prior
knowledge acquired from a
massive range of different
experiences and different types
of texts. So in the next section
we'll look a bit forward to
other sources of information
which may plausibly be useful
for different language
understanding models because of
course, BERT only has the means
to acquire knowledge through
text whereas if you think about
the fruit fly example or time
flying like an arrow, those
sorts of examples tell us that
there are many other sources of
information that we may have
used in order to gain the
general conceptual or world
knowledge required to actually
make sense of language.
So in the last section we saw how the BERT
model is a really exciting
example of transferring
knowledge from an enormous
amount of text, to apply that
knowledge to very specific
language tasks that maybe have a
small amount of data from which
to learn. And this works in part
because of the critical
importance of general knowledge
in understanding language and we
need ways in which models can
acquire general principles of
how language works and how word
meanings fit together in order
to make high quality predictions
for a range of different
language tasks. Now in this
section we're going to talk
about further ways in which we
might be able to endow models
with general or conceptual
knowledge which they can then
apply to language related tasks.
And in particular in a way
that's not accessible to the
BERT model, which is the ability
to extract knowledge, general
knowledge and conceptual
knowledge from our surroundings,
which is something as humans
that we are doing all the time.
Now this is a good opportune
time to start thinking about
these challenges because the
tools available for these sorts
of unsupervised knowledge
extraction processes are
improving all the time. So as
well as the objective of masked
language modeling and next
sentence prediction that we saw
with BERT, there's also exciting
techniques in the field of
computer vision, that
often involve things like
missing parts of an image and
making predictions about whether
or not that part of the image is
the correct part or of which
pixels would most appropriately
fit in to that part of an image
or maybe contrasting incorrect
parts of images with correct
parts of images and things like
that. So those sorts of
objectives are also leading to
really good ability to transfer
from large banks of images to
specific image classification
tasks. And, of course, in the
world of learning
when it comes to jointly learning
language and behaviour, which
involves often reinforcement
learning on those sorts of tasks
are techniques for having agents
develop a more robust
understanding of their
surroundings and possibly import
what's known as a model of their
world. Those techniques are also
improving. So in DeepMind we
thought it was the right time
given all of these improvements
to start to study this question
of knowledge acquisition through
prediction in an actual agent
that can interact in its
surroundings. But the idea of
knowledge acquisition through
prediction is actually a very
old one in neuroscience and
psychology. So it goes back all
the way to the time of Helmholtz
and there's some very
influential papers you can see
in this slide that really
proposed and made clear the idea
that predicting what was about
to happen to an agent or an
organism was a very powerful way
of extracting knowledge and
structure about the
world that surrounds that agent.
Now, in our case, we
unfortunately can't set an
enormous neural network free in
the world in which we live and
just see if it learns. But the
next best thing is to create a
simulated world. And we did that
in the unity game engine. And
the aim with this work was to
study precisely whether or not
an agent which moves around this
world can apply various
different algorithms in order to
acquire as much knowledge as
possible from its environment.
But in particular, in a slight
difference to other work on this
sort of topic, we were
interested in whether or not
this knowledge would be language
relevant, i.e. whether or not
this knowledge would be
knowledge which could serve the
agent's ability to understand or
use language. And the way we did
that was as well as creating
loads of random rooms with
different objects positioned in
different places in this
simulation, we also created a
bunch of questions such that for
any random room that was created
the agent could find in the
environment questions which
could plausibly be answered, so
examples of the sorts of
questions we asked were things
like, 'what is the colour of the
table?', 'what is the shape of
the red object?' 'how many
rubber ducks are there in the
room?' 'is there a teddy bear
somewhere?' And even comparison
questions like things like 'is
the number of rubber ducks
bigger than the number of toy
sheep?' So those are the sorts
of questions. And importantly,
being able to answer these
questions requires a particular
type of knowledge, that's
propositional knowledge, the
knowledge, the ability to tell
whether something's true or
false in our environment and
that's often contrasted
especially by philosophers with
procedural knowledge, which is
just the sort of instinctive
knowledge that maybe a
reinforcement learning agent
would naturally have when it
learns to solve control problems
in a very fast and precise way.
So this is a different type of
problem most typically faced by
agents which are trained with
reinforcement learning. So, in
order to think about how we
could develop algorithms to
aggregate this sorts of
knowledge as an agent explores
their surroundings, we first
just gave the agent a policy
which meant that it visited all
of the things in the room a
little bit. So that essentially
creates a video of experience
and then we set our learning
model the challenge of taking in
that experience and aggregating
knowledge as much as possible in
the memory state of the agent as
it lives that experience. And
then the way that we measure the
quality of that knowledge is by
bolting on a QA decoder onto the
agent. And that's the part of
the model which is going to
actually produce the answer to
the questions when fed with the
current memory state of the
agent and the particular
question. So as an example, the
agent might explore a room with
a yellow teddy bear and a red
sheep and a large table and then
a small toy dinosaur under the
table. And the environment might
present the question: 'what is
the toy that's under the table?'
Now the agent would explore and
the agent's learning algorithm
can take in all of the things it
sees as it moves around the
room. But the agent itself
can't see the question. So the
agent just has to, the learning
algorithm just has to find a way
based on that experience to
aggregate general knowledge into
the agent such that when the
question the QA decoder is
queued with the state of the
agent at the end of the episode
and queued with the question,
it's possible to combine those
two pieces of knowledge and
answer with 'dinosaur'. So to do
this effectively, the agent
needs a large amount of general
knowledge about how things are
arranged in the environment
around it such that the QA
decoder can take that knowledge
and make predictions about the
answers to questions. It's a
very important detail that we do
not backpropagate from the
answer to the question back into
the agent. So the weights and
the agent and the objectives
that the agents applying must be
general. They can't be
specifically tailored to getting
the knowledge to answer the
question. Instead, it must be a
process of aggregating
throughout this episode such
that at the end of the episode
the agent's memory is as
knowledgeable as possible. And
then to this we apply various
different baseline. So the
obvious baseline is just an LSTM
or any sort of recurrent agent.
It's much more complicated to
apply transformers in this
context because, of course, we
can't see the whole episode at
once. As we're moving through
the world, we can only see
timesteps up to the current
timestep. Now, another approach
is to endow the agent with
predictive learning objectives.
A little bit like the sorts of
masked language model prediction
that BERT's making, but where
the agent has to given a certain
time point in the episode, a
predictive loss, a predictive
engine an overshoot engine,
takes the current memory state
of the agent at that time point
and rolls forward in time. Once
the agent has finally
experienced the episode, we can
then do some learning where we
compare the prediction of that
predictive loss to what the
agent actually encounters. And
importantly, the predictive loss
can also take into account the
action that the agent chose to
take in each of those timesteps.
So these are kind of action
conditional overshoot unrolls
where we see what the agent
actually encountered in the
future and then update the
weights of the agent such that
they're better able to make
these sorts of predictions. And
we tried two specific
algorithms: so in one case in
the SimCore algorithm, which was
proposed last year, the loss
that's used in this predictive
mechanism is a generative model
loss, which is modeling each of
the individual pixels in the
observations that the agent sees
in the world in future
timestamps and in the other
predictive objective, we use
contrastive predictive coding.
This is basically asking the
model to distinguish or maybe,
presenting the model with two
images at a given timestep and
asking the model to say which of
those two is actually the one
that the agent encounters in the
future as opposed to one which
is selected randomly from some
other episodes. Now we can
evaluate these sorts of
predictive mechanisms for
aggregating knowledge in the
agent precisely by their ability
to create knowledge in the
memory state of the agent such
that at the final timestep of
every episode, the question
answering decoder can take that
knowledge and answer the
question. What we found
surprisingly is that only one of
these predictive algorithms
actually led to the agent being
able to effectively answer
questions. And that was the
SimCore, the model which uses a
generative model to estimate the
probability density of the
pixels in future observations of
the model conditioned on the
memory state further back in
time. So the contrastive
predictive algorithm was much
less effective at giving the
agent the general knowledge
required to be able to answer
these questions. The green line
at the top of the plot here
shows the performance of the
agent if you backpropagate from
the question answers back into
the actual agent memory. So by
doing that, you allow the agent
to specialise in a particular
type of question for every
episode rather than requiring it
to build up knowledge in a
general way. But you can see
that that makes the agent much
more effective at answering
these questions. To give us some
flavour of exactly what the
agent does, here's a video: can
see the question is what is the
colour of the pencil? And you can
see that as the episode
continues, the agent's
prediction gets more and more
confident that the answer is
red. Such that on the final
timestep red is by far the most
probable answer. If we consider
the other video, let me just
activate the other video. You
can see that a similar thing
happens with a different type of
question. So here the question
is 'what is the aquamarine
object?' And the answer is it's
a grinder, it's a salt and
pepper grinder. And again, the
agent's confidence is very
strong at the end. But we
exhibit, we observed these sorts
of effects only in the agent
which was endowed with the
SimCore model of predicting the
probabilities of pixels of
future observations conditioned
on the actions that it took at
arbitrary overshoots into the
future. So that's just a small
insight into work
that's going on in DeepMind
where we're starting to consider
how we can aggregate knowledge
from the general environment as
well as knowledge from large
amounts of text into a single
model which can start to combine
this sort of conceptual
understanding and general
knowledge understanding and our
understanding and a really
strong understanding of language
into a single agent which can
come up with a coherent and
strong ability to form the
meaning of statements and
sentences and also to take that
knowledge to answer questions,
to produce language and to enact
policies, enabling it to do things
things in complex environments.
So we've reached the end of the
lecture and I just thought we'd
go back, reflect a little bit
quickly on the various things
that we've covered. So we've
talked about various aspects of
language which make neural
networks and deep networks
particularly appropriate models
for capturing the way that
meaning works. So in particular,
we raised the fact that words
are not discrete symbols but
they actually almost always have
some sense of different related
senses that disambiguation is a
huge part of understanding
language and then that can
critically depend very often on
context. We've talked about the
facts that that context can be
non-local. So to do with the
work we're currently thinking
about that context can also be
very non-linguistic. It can
require, it can depend very much
on what we're currently seeing
or doing. And the notion
of composition, we've reflected
on the fact that that in itself
seems to vary depending on what
the words are that are being
combined in any one instance.
And we've talked about the
importance of background
knowledge and ways to combine,
ways to acquire that. So one way
that we talked about was BERT
and unsupervised learning from
text and another way was through
predictive objectives in a
situated agent. And so if we
look at these features or these
aspects of language, the
mechanisms that I've discussed
today cover them reasonably well
and hopefully they shed some
light on why neural networks and
interactive processing
architectures that obey the sort
of disparate principles of neural
computation. And distributed
representations are particularly
effective for language
processing. But of course, it
should be said that there are
many aspects of language
processing that the work I've
talked about just doesn't start
to approximate, doesn't start to
capture. And that's in
particular around a lot of the
social aspects of language
understanding. So our models are
not currently able to do things
like understanding the
intentions of others or
reflecting on how language is
used to communicate and do
things. And, you know, we need
to make a lot more progress in
these areas if we're actually
going to arrive at agents which
are truly able to understand
language. So yeah, just as a
final note, I think it's
interesting that before
deep learning really exhibited
its success on language
processing problems, a typical
view of language understanding
was what I call the pipeline
view, which was that each
independent, each part of
processing language from the
letters to the words, to the
syntax and to the meaning and
then eventually to some
prediction could be thought of
relatively independently as a
separate process. But now that
we've reflected on how language
works and in particular taking
in all of the evidence from the
effectiveness of different
neural language models on
language processing tasks, I
think maybe this is a more
effective or more realistic
schematic of how language
processing should be thought of.
So we may have some stimule,
some letters or sounds, and
we've always got some sort of
context around those letters or
sounds. Those two things input
to our system but critically,
it's that input combined with
our general background knowledge
of the world of knowledge of
language, which together allow
us to arrive at some sort of
plausible meaning for everything
that we hear or everything that
we might say. So on that, we'll
finish up. Thanks very much for
your time. There's some selected
references here and many other
references, which I didn't, I
don't have time to list here,
but have been hugely
inspirational for the work that
I think that I've talked about
today. At the end, I've popped a
few there for recent work at
DeepMind but again, there's no,
not time
to list a huge amount of very
related work. So anyway, I hope
you've enjoyed this lecture and
it's given you some insight into
why language and language
understanding is such an
interesting problem for
computational models to try and
tackle. And I hope that you've
enjoyed the talk and you'll
enjoy the following lectures on
the DeepMind lecture series
Thank you very much.
