Welcome back. In this video, we're going
to talk about the concept of attention.
When we have an encoder that sends an
intermediate representation to a decoder,
we're sending just one element in the
algorithm of attention. We're going to
send more input elements from the
encoder so that the decoder can look at
specific states and decide which of
these hidden inputs are the ones that
matter to generate the output. So what
have we seen so far during the week. We
had feed forward neural networks, which
took an input, did some computation with
a hidden layer, and produced an output. We
have neural networks that can transmit
data across executions of themselves. So
for example, if you have a sequence of
words, and you have recurrent neural
network to predict the next word, like in
I am, you will get the input I, you will
produce the prediction output am, but you
will also pro - produce a hidden
vector that will go on to the next
iteration of the recurrent neural
network. This is unidirectional
translation and transmission. By the way,
you can also have bi-directional
transmission, where after training, you
can have contributions from the word
that comes before you, and the word that
comes after you. So you're getting three
inputs to your recurrent neural network,
the input itself, and each element from
the word that came before, and another H
element for the one that came after,
making it bi-directional. So we have
neural networks that can transmit across
sequences, and we also have architectures
that can encode a sequence, into an
intermediate form, and then decode it
into some other form. We call these
encoding decoding architectures, and
generally deal with sequences to
sequences, like the words in a question
to the words in an answer,
the words in an article to the words in
summary of the article. This video will
look at the concept of attention. When
you're producing the decoded output,
maybe you want to look at more elements
other than just your intermediate
representation, we're gonna have
attention across encoder/decoder. We're
also gonna have self attention, we're
just looking at my own sequence when I'm
encoding. And we're gonna look at a
couple of architectures that use this.
They're called transformers. They're -
we're gonna look at the concepts here
and then on your exercises, you're gonna
get to play with a few transformers. So
take a look at this matrix here. Here we
have an English sentence, the agreement
of the European Economic Area was signed
in August 1992, on top, and then the
French translation from top to bottom,
L'accord sur la zone economique europeene
a ete signe en aout 1992.
What we have here is the translation but
also a matrix that tells you how much
attention you're paying to each element in
the English sentence when you're
generating the French element. So for
example, when you are generating -
the meaning of 1992, the
French element here, what are you looking
for in the input, you're looking at the
number 1992. Of course and this is very
obvious if you just have to sequence 1 9
9 2, it's obvious that it only means the
sequence 1 9 9 2 to generate something
similar. However, most things in language
are not so easy. When you're generating,
for example, the word signe, signed, you
do need to look at the English verbs,
signed. But because this has agreement
with the subject, if this was feminine
for example it would have another e, you
also need to pay attention at the
subject. So most - when you're generating
signe, most of your attention is going
to the word signed, but also a little bit
of your attention
is going to the word agreement, when
you're generating the word, ah you need
to pay attention to was and signed in
English in order to get the correct
tense in the French. So sometimes, you
only need to look at one element to make
an input out - an input in English appear
in output in French. But sometimes you
need to look at more than one element to
make an input in English appear as an
output in French. We call this process
attention.
How does attention happen? We can get
attention by transmitting richer
information between the encoding and the
decoding. A student. So when we are
transmitting the encoding, you could
transmit just an intermediate vector, but
you could also transmit all of the
little H hidden states that you
generated when the recurrent neural
network was running forward. When you,
when your lstm was running, so all these
intermediate hidden elements get
transmitted to the decoder along with
the last intermediate representation and
then during training, the decoder can
decide which of these is better to
generate the correct output. That way, it
can generate a mapping to what elements
it should be paying attention to when
it's generating an output. For example, if
it's generating the word je, I'm sorry,
if it has the word je, and it wants to
generate the English I, it takes the - this
generation of I, has all of the hidden
states, and it pays most of its attention
to just je because obviously je is
equivalent to I. However when you're
generating the word suis for example you
can see that most of the attention, I'm
sorry, when you're generating the word am,
because you you're translating from
French to
English, when you're generating the word
am, you need to pay most of your
attention to suis, but you need to pay
some attention to you as well, and you
pay no attention to etudiant. When
you're generating the word a, you pay
attention to suis, and you pay attention
to etudiant, so by transmitting a
richer set of information from the
encoder to the decoder, the decoder can
take advantage of it, and decide what are
the elements that matter. If you made a
huge vector with the weight of how much
the first element is going to matter, the
second, the third, of the fourth, you could
implement attention and you'd also
really be freed from the constraint that
you need to have a one-to-one
relationship. Because then you'd have
vectors that map the end-to-end
relationship between word and its
translation. You can also have self
attention where if you have a sequence
of tokens or words, you could try to
decide which of the words is getting the
most input from other hidden states. For
example, the word it in the sentence, The
animal didn't cross the street because
it was too tired, the word it is paying a
lot of attention to the animal, is paying
little attention to cross, and it's
paying a little bit of attention to
street. This is because - because during
its training phase, it must have seen
that it had correlate positional
correlations with animal, then it had
some positional correlations to street,
and almost no correlation to cross, for
example. So this allows it, this allows
the encoding to pay attention to itself,
and to the other elements in the
sentence. In your exercises, you read the
about the programming implementation of
self attention, but in summary during
training, you generate a
series of matrices that tell you how
much attention elements should be
putting, should be given to other
elements. So for example, the word it when
you have the word it, and you're
inputting the it into the network, so it
can predict some other element, it needs
to pay a lot of attention to previous
nouns, like robot, and it should pay
almost no attention to previous verbs,
like obey, for example. So it's going to
take these elements, and with them
generate an attention vector. So if you
have the first law of robotics, a robot
must obey the orders given it by a human,
and you want, you have the word it, and you
want to generate the next word in the
first law of robotics, you're gonna have
two vectors: the embedding of the word it,
and a vector of weights of how much
weight we should be giving each of the
inputs in the sequence. And from training
we will be able to know that the word
robot needs a lot of attention, and that
the word the needs very little attention.
And by the way, attention just means the
embedding for the word robot multiplied
by an attenuating factor, by 50% for
example, plus the embedding for the word
obey. multiplied by 0.1, plus the
embedding for the word orders, multiplied
by 0.05. So you're gonna get a
combination of these settings which give
you the elements that you should be
paying attention to. We're just gonna
look a lot like robot plus the embedding
for the word it, the input and the
attention, and this gives you really cool
information, really cool performance
improvements. For example, if you are
standing on the word chasing, you can
analyze the attention vector, and see
that it's paying attention to things
like is, and FBI, for example. You can also
do this with multimodal attention. If you
have the encoding - if you're encoding a picture,
like the color picture here on the left
upper left, if you're encoding a picture
and decoding the description of the
picture, like a woman is throwing a
frisbee in a park, then you will get a
vector of the encoding, of the hidden
information in the encoding, plus the
final intermediate state. And as you
decode that in two English words, you
will know which parts of the encoded
input, meaning which pixels it was paying
attention to, so for example, when it
generates the word frisbee here, it's
paying a lot of attention to the pixels
that have to do with the frisbee. Again,
in your exercises, you will see more
about how this is generated, but what
this is really good for is that it
allows us for more complex architectures.
What we have here is a transformer. The
transformer is a kind of encoding-
decoding where you have half here that's
encoding and another half here that's
decoding. By the way, when it has - it says
Nx, it means that it can have multiple
levels of these, and by multiple, I mean
dozens. It has, for example, if you want to
input the word je and have it come out
as the word I, it will take the embedding
of the words je in French, it will
correct it by a positional encoding,
telling you how important order is. If it
comes into first position, it will set
this to an attention head, which tells
you how important every other word in
the input is to je, then it will use a
neural network to generate the correct
kind of intermediate representation. It
will take that, and take as the - as the
start of the output, like a token for
new phrase. So it takes the new phrase
from here, the intermediate
representation of something like je
into here, it enters into even more
attention heads, which get the
information from attention from the
encoding phase,
ultimately produces a softmax vector
with like thousands of words that it has,
and the one with the bigger probability
is gonna be I. So je, attention to
immediate form generation that goes into
the attention, softmax, and the output I.
That's the transformer. It's a complex
model, but very powerful. And what it can
do, there's two types, there's many types
of transformers, many. I'm just gonna
focus on two that are particularly
important: BERTs and GPT-2s. And inter -
interestingly, they've discovered that
you only really need half of the
transformer. You can use only the encoder
as in BERT, or only the decoder as in GPT-2
A BERT, for example or bi-directional
encoder representation, is an algorithm
that takes two inputs that you are going to
encode, and produces a vector, some
representation of them. What do we want
this for? The cardinal use of a BERT is
to predict words to function as a
language model. So for example, if you
give it an input like, I want to mask the
car because it is cheap, the BERT - a BERT
will produce the output buy, because it'll
give - it will receive the embedding of
all of the words. It'll pay attention to
the fact that the input had car want
cheap, and will produce a vector for the
mask that has all of that attention
information, and it tells you that the
correct candidate for those attention
words be buy. And by the way, this is gonna
be in your exercises. Yeah just how to look
at a BERT generating this output. And by
the way, it does this again with 340
million parameters. So none of these are
light models to generate or run. What's
really cool about the BERT is that once
it generates that vector you can do a
lot of things with it. You can for
example fine tune a BERT so that if you
get an input and generate a vector
you can use that vector for
classification, so you can use it to
decide if an email is spam or not. You
can use it to define - to see if a movie
review is positive or negative. Here, for
example, we would get the words of an
email and we would have a token, a CLS,
for the classifier. This token will take
all of the attention information from
the other words of the email, and then we
will train an additional neural network
to decide if this classification vector
is spam or not spam. And this is what
makes BERT - BERT really powerful. It can
be fine-tuned to many tasks. For example,
QNLI is about language inference. So if
you give it two sentences, it can tell
you whether one can be inferred from the
other, for example, or whether one is the
answer to another one - to the question. If
this is a question, this is an answer, it
will tell you, yes or no. You can use it
for sentiment analysis, tell you if
something is positive or negative. You
can use it for question answering tasks,
this is what Squad does. You give it a
bunch of questions, and paragraphs where
the answers are, and it'll give you the
positions where the answer starts and
ends. BERT can also be used to find
named entities, or proper names, and it
will tell you where they start and where
they end. So they can be adjusted for
many tasks and they are very powerful. A
different type of transformer is the GPT-2
It only uses the decoder part. It is
very good at generating predictions of
what the next word will be, so if you
give it planet it will generate the next
word, for example here, if you give it The
spaceship entered orbit around the
planet, it will then give you something
like Once in orbit... and so forth. You
can again experiment with the
transformer there. And it's 40 gigabytes,
and it's 1.5 billion parameters in order
to generate all these. It has multiple
attention heads. It has numerous phases
of this, so it is a very heavy model.
As a summary, we have an idea called attention, which is that you can pass
information about all of the elements
into supplementary vectors or matrices
so that when you are decoding, you could
take advantage of all that information
and essentially establish end-to-end
relationships telling you, oh if I'm
looking at the verb, if I want to
generate the verb signe, I have to look at
the word signed, but also the subject
agreement. You use - you can do this across
encoding/decoding stages, but also you
can use self attention to look at your
own items in the encoding. This has been
used in architectures that are called
transformers, BERT and GPT-2 are just two
examples of them, and they can use this
property of paying attention to multiple
parts of the input to generate really
powerful and flexible output. But all
this comes at a price which we will
analyze in the next and final video for
the week.
