Today, I'm going to be
talking to you about sequential
data and how machine learning
deals with this type of data
structure.
Before I begin, I'm someone who
likes people to understand what
I'm talking about.
So if at some point during the
lecture something's not clear or
you have a question, please raise
your hand.
This can be also a conversation,
so I should be getting the
message across to you, if this is
not happening then that's my
mistake and I need to know about
it. Okay, so there is no stupid
questions, literally just raise
your hand and ask.
So, with that out of the way, to
give you a little brief overview
of what I'm going to be covering
today. I'm going to cover-
I'm going to divide the
presentation into three
different broad sections.
We're going to start by motivating
why we care about sequential
data in the first place.
Why is it an important thing we
should consider in machine
learning?
Then, from there, once we are
convinced that this is something
we're interested in, we'll think
about how do we train models
that deal with these sequential
data.
What are the challenges, what are
losses, what are we optimising
and so on?
And hopefully give you the
fundamentals on this.
Once we have our trained model in
section two, we'll move over to
section three where we take a trained
model and actually use it to do
cool stuff like generating
sequences.
And I hope to cover not only
different applications
throughout this section but also
maybe introduce some of the more
recent concepts and methods that
have been applied to sequence-
to sequential processing and
machine learning. So, we'll start
straightaway by motivating: Why
sequential data?
Just a quick recap.
What you guys have done-
have looked at so far in this course
is, first of all, feed forward
neural networks which are
networks that take in vectors of
a fixed size. That should have
been covered.
Then you've also talked about
convolutional neural
networks, which are networks that
usually deal with images that
have these things called
convolutions that they just
share across an image, for
example, and apply.
In this lecture, we want to
move away from fixed-size
vectors and images,
as I mentioned, to sequences.
So, what even is a sequence? To
give you a bit of a formal
definition, a sequence is a
collection of elements that has
certain properties.
Let me give you the example of a
sentence: You can think of an
English sentence as a sequence
of words.
So, a sequence first of all has
elements that can repeat.
So, if you think about a sentence,
you can have the same word
appearing several times in a
sentence. A sentence- in
a sequence,
the order of the element matters.
In a sentence this is the same,
this is also the case.
So, in a sentence, it's very
important in what order the
words come in. It can mean
something completely different
if the words are in a different
order. And, finally, sequences can
be of variable length, so there can
be any number of elements to
them.
And this also applies to our
example of a sentence.
So, English sentences can almost
have an arbitrary amount of
words in them.
So, with these properties, if we
look at the methods that you've
looked so far, we can see that
they're not very good at dealing
with data that have these kind
of properties.
So, this one here, the
feed forward neural networks,
they usually deal with only
fixed-size inputs which clashes
with our need for variable
length, and also they don't
really take structure that much
into account. Convolutional neural
networks again really can deal
with non-fixed-size inputs, but
the structure is not the right
one.
So, how do we deal with this? And
the question is: How can we
develop models that actually
deal with sequential data?
Because this is a problem.
And, given that this is so hard,
you might wonder why do we even
care about sequences, right?
Why do we care about sequences?
And my answer to that to you is,
well, the sentence "why do we care
about sequences" itself is a
sequence.
And, as a matter of fact, each of
those words is a sequence of
letters and, going further, this
whole presentation is a sequence
of slides.
So I hope that, with this, I
convince you that sequences are
literally everywhere. I
set the example earlier of
sentences and the English
language as a sequence of words.
But you can also think of other
things as sequences, so speech,
wave forms, is a sequence. You can
think of videos as sequences of
images. Images themselves can be
sequences of pixels.
We can have more complex
sequences.
So, if you think about programs
you have different executions
that you carry out sequentially.
And, finally, for those of you
interested in RL, in decision
making, that as well is a
sequence of decisions that your
agent has to work.
So, it's fair to say that
sequences are very universal, and
they span a very big variety of
tasks, and this is why in machine
learning we're interested in
being able to deal with
sequences.
Okay, so this is some of the
motivation. I hope that you take from
this that- what a sequence is,
so it's this collection of items
where order matters and that is
of variable lengths.
We know that the sequences are
widespread not only across
machine learning but across
applications and everyday life
really.
And, finally, the methods that we
have been discussing in this
lecture series so far are not
enough to deal with sequences, so
we need to come up with
different methods that can take
all of this into account.
Are we clear up until here? Is
the motivation clear?
We know what we're doing.
Okay, so now that everyone's
convinced that sequences
is what we want to focus on, how do we
do it?
In order to think about a
machine learning model, I'm sure
you've had this in the previous
courses,
one of the things you have to
develop is: What is my data?
What is my loss?
What do I optimise?
And so on.
So, let's go through it a bit,
one at a time. I'm going to first
revise what you guys should have
already learned.
This is supervised learning.
This is how I would train a
normal feed forward network.
So, if we look at this, the data
that we normally have in
supervised learning are pairs of
some inputs and some outputs.
For example, you could have
images and labels for the
classic classification task.
What you're trying to do then is
you are going to assume there is
a mapping from your inputs to
your outputs.
This means there's some mapping
between images and labels that
you want to learn.
And we're going to, in this
particular course because we
care about deep learning, we're
going to try to approximate this
mapping with some nonlinear
function that is parameterised
by theta.
So, in this case, this means that we're
going to have a neural network that
learns to approximate this
mapping, with some
parameters that we can then tune.
How do we tune these
parameters?
So, we're going to define some
loss function whereby we say, okay,
the output of a neural network is
going to be compared to the
ground truth label, for example,
of an image.
And then we're going to take
whatever our distance measure is
and that's going to be our loss.
This is how much we penalise our
prediction with that network.
So, once we have this loss, what
we're going to do then is just
try to minimise it using standard
backpropagation and update the
weights of the network as a
function of whatever loss we
have.
Is this clear? This should have
been covered? Good.
When we think about sequences,
this is a little bit different.
So, in the sequence- let me take
the example of an English
sentence.
This applies to any type of
sequence, but I find thinking in
terms of sentences is quite
intuitive because we deal with
sentences all day long.
So let's take that example.
We're not going to have pairs of,
necessarily, of targets of
inputs and outputs.
Instead we're just going to have
a sequence. So, x is just going to
be, for example, our sentence and
what we're trying to model is
not necessarily a mapping
between inputs and outputs.
We're trying to model how
probable is a sequence because
we want a machine learning model
that generates likely sequences,
if we are interested in
generating, or they can estimate
how likely an English sentence
is. We don't want it to just be
generating garbage sentences.
So, what we're actually using our
neural network for is to
estimate the probability of that
particular sequence.
The rest looks fairly similar.
So, you can just optimise the
lock probability, so the
probability of that- of
the sentence that your model
generates, and you can just
optimise. In this case we're
doing the max because we're
dealing with probabilities, but
you can also take the negative
and minimise.
So, that would be the difference.
Is this clear? Yeah.
*STUDENT* If you
want to tag certain words
[Inaudible]
*LECTURER* What do you mean tag?
*STUDENT* Like if
you want to recognise like a ct or
like an entity or a combination.
*LECTURER* Ah yeah yeah
and then this would be more
of a supervised learning
problem.
Yeah, absolutely.
And you can always add
information, you can change
your loss.
Yep, exactly.
Yeah.
So, in a sequence - and we'll talk
about it in a minute,
about how we actually calculate
this probability - but you can
always add additional losses, so
for sure, yeah.
I think if you want to- that
would be almost like a separate
task.
That wouldn't be as much as, like,
estimating the probability of a
sentence, more like, like you say,
tagging a particular word.
So, in that case yes, I would go
to the former loss. The moment
you have inputs and outputs, you
are going to be doing something
like this.
Cool.
Cool.
So, let's think now- okay, what
we're trying to do here
is we're trying to learn this
function over here, as I
mentioned, that measures the
probability of our sequence,
okay?
So, how do we go about that?
What is a good model that would
learn the probability of a
sequence?
So let's take the sentences
Modelling word
probabilities is really
difficult because, if there's one
thing that I want you to take
away from this lecture, is this
sentence so I'll try to
subconsciously feed it in, and
modelling word probabilities is
really difficult.
So, one way you could think, one
very naive approach that you can
take to this is say, okay, I'm just
going to apply the simplest
model,
assume that the words are all
independent, and what I'm going
to do is I'm going to look at
some corpus that tells me how
probable is each of the single
words in the English language,
right, and just say, okay, the
probability of the entire
sentence is just the probability
of the individual words
multiplied with each other.
And that's a perfectly valid
model, if you want.
So, you would take the
probability of modelling times
the probability of word times the probability
of probabilities and so on, and it'll
give you some probability.
Now, this model, while it's fairly
simple and easy, is not a very
good model of language.
And the reason for this is that
language has structure.
So, one example of why- that shows
how this is a really bad model.
If I asked my model, this model,
what the most likely sentence is,
it's going to tell me it's
"the the the the the the the." because "the" is
the most likely word and it's
just trying to optimise for
that.
So, this shows us that the
English language is just not
independent words that are
thrown around, right.
There is clearly structure in
the language, and this model is
not capturing it.
So, the question is: How can we
incorporate this structure into
our model?
Okay, let's go to a bit of a
more complicated model,
marginally, but it's fair enough.
We're going to take all of the
previous words in our sentence,
we're going to call that our
context, and then we're going to
condition our probability on
everything we've seen before. And
this kind of makes sense. If
you're thinking about a sentence,
the next word kind of
depends of what you've said so far.
So, it's a fair enough
assumption.
It's definitely much richer than
just assuming independence.
So, what does this mean in this
particular example?
I can give you this part of the
sentence, "Modelling word
probability is really..." as the
context.
And then I can see conditional
on this, which are x's 1, 2,
T minus 1.
We are going to calculate what
is the probability of the x,
which is the word, at times- at the
time step T, which is the current
word that we care about.
So, that would be our target.
And then there would be
different probabilities for
different words, so "difficult"
would be really likely because,
as you know, modelling over word
probabilities is really
difficult. "Hard"
is also a good
candidate, maybe "fun", and
definitely not "easy".
So, you would have this
conditional probability that
describes what the next word is.
This, of course, would only model
a single word, but if you wanted
to get the probability of the
whole sentence, the way
you would go about it is you
would first calculate the
probability of the first word
and then you can multiply that
with the probability of the
first word-
the second word given the first
word, which is everything we've seen
so far, and then the third word
is the probability of the third
given first and second, and so
on.
So, you keep doing this,
calculating this joint
probability from all of the
conditionals that you've seen so
far. And this is a method that
works fair- well enough.
If you use this, it does give you
a good overview of the
structure.
But there is a pretty big
problem with it.
Let me show you an example to
give you a bit of an intuition.
Imagine we only care about
modelling p of
second word given first word,
so that is the simplest, in
short, as we can do in a
conditional. That already,
for four words, which is not much,
gives us a four by four table
with all the probabilities of
the possible combinations of
first and second word. Four words,
the English language has a few
more than four, so if we look at
this and increase it maybe to
120,
this is starting to look pretty
big. And maybe the English
language has more like 10,000
words and that's very limited as
well. So, this table gets huge if we're
comparing the probability of all
words given all other words.
And keep in mind, we're only
we're only really looking at the
probability of one word given
another word.
If we're looking at contexts of,
like, English sentences this is
likely to be much, much longer.
And if you think about it, this
is going to scale really, really
badly with the size of
vocabulary, so however many words
your language has,
which in this case here I'm
showing just 10,000, to the power
of however long your context
is. This is more than atoms in
the universe for reasonably long
sentences,
so there's- it's fair to say
that this is not a table that we
can store, work with or even
really approximate with the data
that we have of English language
out there.
So, while this type of model
where you are conditioning on
the context, is useful and
captures some of the structure,
it's definitely not scalable,
not beyond very short time
horizons.
One way that people went around
this, especially in the early
NLP research is that they
said, okay, rather than taking into
account a very long context because we
know it scales with the size of
the context, the power of the
size of the context,
why don't we fix the window
size of that context?
So, these are called N-grams
and standing for however long
your horizon is.
So, imagine you only care about
the two previous words,
those would be two grams.
And what this essentially means
is that, once you have more than
two words in the past, you only
really care about the last two.
You only care about the
probability of whatever word
you're at given the previous two
time steps. The benefit of this
type of model is that, again,
as we said, this scales
to the power of the number of
context points,
so this kind of reduces it and
you have a fixed size. But, as you
can also imagine, you're losing a
lot of information.
Like, if I just tell you "...is
really difficult." "Is really"? That
is not telling you anything
about what I really am asking,
like, maybe difficult would not
be your first choice of word
after just the words "is really."
So, there are-
yes,
so having said this, N-grams
alleviate the problem of
scalability a little bit but
there are some downsides to it
as well.
The two biggest one is the one I
just mentioned.
It doesn't really take into
account words that are more than
just N words away, so you very
quickly lose the context.
Or, if you take a large context,
you again have the scalability
problem. And the data table,
even if you reduce the context
size, is huge.
And just to give you an
impression of how big this is,
I'm showing you this blog post
from Google from a couple of
years ago where they essentially
released a data set of N-grams
of size five.
So, this is not very large.
Keep in mind five words of
context in a normal sentence
doesn't give you a lot of an
idea of what's going on. And what
they did as well,
they only took N-grams of
size five that appeared at least 40
times on the Internet. So 40,
like, that takes away a lot of
N-grams and, despite taking away
all of these, they still ended up
with one trillion five-word sequences
that they got off the Internet.
So, this is how many we're
dealing with when actually
thinking about N-grams.
And, again, this is only five.
So, this does not scale really
well.
So, in summary, to summarise a bit
what we're thinking about,
modelling these word
probabilities is difficult, and
it scales really badly with the
number of elements that you're
considering for your context.
So, the question that we then ask
as a machine learning
researcher is can we learn this
probability estimation? Rather
than having to get it from some
big, big data matrix,
can we actually learn it? Before
I move on,
are there any questions so far
with N-grams, why data sequences
are hard to model? Okay.
So, I come back to this question:
Can we learn how to learn-
can we learn how to estimate
this probability?
And the question is, of course,
yes, otherwise the lecture would be
over. And we're going to talk a
little bit about how we're
thinking about doing this.
If you tackle this problem, one
way of thinking about it is
you're going to need two
elements, you're going to need
the first element, which is going
to take in your context and
somehow summarise it, and then
you need the second element that,
from this summary, predicts you a
probability distribution over
what the next word should be.
Okay.
So, I'm going to treat those two
a bit separately.
The first component is this-
I'm going to call it the
sum function that is going to,
we call it, vectorise the
context.
It essentially- what we wanted is
to take in our words, process
them in some way or other, and
output
this h, which is going to be just
a tensor, a vector, anything.
And this h we wanted to capture
whatever context information
we're getting from these words
that we've observed.
So, basically, we're trying to
approximate the probability of-
of all of the context with this
one h,
so this should replace this over
here.
Does anyone - and I normally don't
ask questions into the audience
but let's try -
have any ideas of what would be
good properties that we would
want this f to have?
So, we want f to somehow
summarise the context,
and I've told you what sequences
are important for. Any
suggestions whatsoever? Guys?
There's one very obvious one,
if you think-
*STUDENT* They accept variable inputs?
*LECTURER* Good, one, very good.
We need to have variable input,
so it cannot be like a
neural network that expects a fixed-
fixed-size vector.
It's going to-
we don't know how long our
sentence is going to be or our
context.
So yeah. *STUDENT* [inaudible]
*LECTURER* Very good
We need to somehow still keep
the order
and, like, this notion of order.
Because if I gave it some other
order of these words, the context
is very different.
So, those two things are very,
very important.
I'm going to give you the other
ones because they're maybe less
obvious, although given. So, order
matters, very good.
Very well then, good. Learnable,
obviously we're in a deep
learning course so we care about
things being differentiable.
One thing that is maybe not as
obvious, but when you start
working a bit with sequences it
becomes more so.
You want individual changes to have
a very large effect.
So, if I just change one word in
a very long context, it can
actually mean the opposite
in the English language, right.
So, you want a model that can
actually capture this.
The deep learning way of thinking
about this is just to have
very large networks, lots of
nonlinearities that somehow
capture these very bizarre
boundaries in high-
dimensional space.
And, finally, another thing that
we'll actually touch on a bit
more later.
We wanted to preserve long-term
dependencies because language
depends not only on the previous
three words, not even on the
previous sentence,
all I've been saying so far has
depended on almost the first
sentence I've said in the
lecture, right?
So, you need to be able to
remember that as well.
Very good.
So, this is the first part.
Looking back now,
I've mentioned one method, which
was the end N-grams,
How do they do on these kind of
desiderata.
This is just a little reminder
of what an N-gram was, in case
someone was asleep two minutes
ago.
Essentially, the way we're thinking
about N-grams is that there is a
function that just concatenates
the N previous words.
That's all it's doing.
It's not doing super great.
So, the order matters,
it does in a way because you're
only carrying always about the
last few ones.
It is not variable length by
definition.
It is not differentiable, you're
not learning anything. The
pairwise- I'm calling pairwise
encoding when I say, like, single
words have a big effect because
that's how you work pairwise
comparing single words.
It doesn't have that, obviously.
And, finally, it most definitely
does not preserve long-term, we
by definition are cutting it to
a very small amount of words.
So, that's not a super great
model. One that people also often
think about when aggregating
sequences together is to just
add everything together -
sorry - where basically you're
just thinking of f as a big sum,
so they just take all the words,
lump them together, and say this
is your context.
This one has actually quite
complementary properties to-
to the N-gram. Order does not
matter, you lose it the moment
you're just clumping
everything together.
However, it deals with variable
length, I mean you can add on as
many items as you want.
Nothing stopping you.
You can't differentiate through it,
it does not have the pairwise
encoding, and it can't preserve
long-term because you're just
literally preserving everything
by adding it together. But you
can arguably say that it's not a
very smart model because it's
essentially recreating the first
example that I showed you where
all the words were independent
and there was no real structure.
So, we said that's not a good
model.
*STUDENT* Out of the
calculating the
probability for
frequency occurences
so, so,
I guess it's
in, in,
and language is always like a
frequency matter because
you can, you can look at the
frequency of occurences of the
words in the sentence.
Is this something reflected?
In these learned ones or in
general? So, this-
I think this goes back, if I
understand correctly, more to the
first part where we were saying
we have some sort of corpus,
like a probability table, that
tells us not only how frequent
are words but how frequently are
word combinations,
so that was this p of
x1 given p x2.
So, that would be the table.
And this is a very intuitive way
to think about it but, like,
combinatorially, it scales quite
badly, that was- that was what we
were mentioning.
So, what we're hoping is that
these deep methods actually
somehow learn that and, like,
learn- implicitly they should be
learning
how often do I see the word "cat"
on the "mat"?
Like, how often do I see those
together versus not, right? So
that-
but again, it's deep networks. Does that
answer your question to a
certain extent?
Good.
So, this was part one. We wanted to
take our context and just
encode it in some representation
h.
Now that we have our context somehow
summarised- vectorised,
what do we do with it?
So, we now want a second
function that takes this context
and just produces a probability
distribution. Our desiderata
for this are much simpler.
I'm not going to delve too much
into them. But basically again we
want these-
the fact that individual
changes should have a big
effect,
this translates from before, and
also the fact that it returns
the probability distribution.
That's really our only kind of
concern.
And this is fairly simple to do.
You can just throw a sigmoid in
there and it should do the job.
Cool.
So, in summary, N-grams and other
simple methods that we've been
talking about don't really need
these requirements for modelling
sequences,
as we saw in that table with all
the crosses and the ticks. So, how
can we build deep networks that
actually meet our requirements?
Is it so clear so far what our
requirements are and why
so far the methods we've seen
are not super great at dealing
with this?
Cool. So, I'm going to move on to
probably one of the most
important models in sequence
modelling, which I'm sure many
of you have heard of, which are
called Recurrent Neural Networks.
Recurrent Neural Networks are a
type of neural network
architecture that have a specific
structure.
So, they have this hidden state h,
which is- stores information. It's
going to be a state that we're
going to keep modifying and
stores information about what
we've seen so far.
So, the way this would work is
we're going to have- initialise
h,
so our state, can be initialised
with zeros, whatever you want it
to be. And then you're going to give
it some input,
so a word of your sentence or
whatever the first element of
your sequence is. And we're going
to update h to be the next h to
some particular function.
Okay? So far, so good.
The way we're going to update h
is actually quite simple.
There's going to be a weight matrix
that are some of the parameters
we're learning and that we're
going to multiply the previous
state with. And there's going to be
a second weight matrix,
again learned parameters, that
we're going to multiply with the
input. So, we multiply one with
the previous state, one with the
current input and then pass it
to a tanh
so things don't go too crazy and
then we just get our next state.
So, so far all we've done is we
started somewhere
and, given that we've seen a
single word, we updated our
internal state.
Why is this useful?
Because this basically so far
we've summarised our context,
like I said that is step one.
Now we're doing step two.
We have a summary of the context,
which is this h1 at the moment,
and all we're going to do is
multiply it with another learned
weight matrix.
And that's going to give us our
output probability distribution.
So this over here is going to be
a vector that is going to be a
probability distribution.
So all the values are going to be
between zero and one, and all of them
are going to add up to one, and
it's going to- over all possible
words that can be in the English
language,
it's going to tell us,
this one is going to be the most
likely or whatever the
probability distribution is.
We're obviously going to use a
softmax here, as I mentioned,
just because then that's going
to ensure that we have a
probability distribution.
You are familiar with the
softmax? Okay, cool.
So, once we have our predictions,
we're just going to take the next
word in the sentence.
So, forget about that.
We're just going to take the
next word, feed that in,
update our current state to get
the next state. There we go. And feed
the next probability
distribution, and so on, and so on.
And the way- so what we're essentially
doing is we're, one word at a
time, feeding it into our state
variable h, updating h,
and doing this for however long
we want. There is literally no
restriction on the sequence
length because we can feed an
arbitrarily many words over
time.
And there's also no- there's also
a sense of order because the
networks should learn to process
them sequentially and therefore
keep some notion of what the
order of the words were.
Does this make sense?
Is everyone on board with
RNNs? Cool. The way they are
often also shown, RNNs, they
can be summarised as such
basically where you're again
showing the input, the output, and
this hidden state, and you're
looping over it with the different
words.
This is just the normal diagram.
People also talk about unrolling
RNNs, so whenever you're
coding it and you want to
backprop through it, you're going to
have to unroll it and we'll get
to that in a minute.
It essentially just means
you're taking it and spreading
it over the different time steps.
Cool.
So, that's our model.
We're going to discuss quickly
the loss- what are the losses and how
we optimise.
I don't know how many here are
actually interested in the maths-y
details.
I know there's also people from
the public so I'll try to point
out what the high-level messages
are, but if you want to follow
more closely, the equations are
on here.
But please don't be intimidated.
They just look sometimes a bit
scary but they're not. Cool.
So, how do we train this?
We haven't really talked about
the loss yet.
We've only talked about what we
want from a model and how we
implement the model.
So, how do we actually update
these weight matrices that I
showed you?
So, what we're currently doing in
this task, if you think about it,
is almost like a classification
task where we say our input is
going to be the context and our
target is going to be whatever
the next word is.
Right.
And for classification, as
I'm sure you've also already
learned, one of the very- the
normal lost functions is to just
use the cross entropy,
so essentially just saying I'm
going to predict whatever the
distribution is over- I'm just
going to take the probability of
whatever I predicted, my model
predicted, as the next word and
then multiply with the real
next word in that sentence, and
I'm going to effectively add it
up over all of the words in that
sentence.
And that's going to be my loss.
So, the loss is actually fairly
straightforward. And the loss is
going to, as I said, depend on
some parameters theta, which
are going to be the three weight
matrices that I mentioned
earlier.
The one that creates the output,
the one that takes in the
input, and the one that just
updates the end state. Now,
differentiating.
Bear with me.
I'm going to very quickly touch
on why differentiating is a bit
different in RNNs than
it has been in the classes you
took so far. At a higher level,
it's just different because we
have this recursive loop in the
middle. That's basically all you
need to know if you don't care
more than that. If you want a few
more details, let's just really
quickly recap what the equations
are that we're going to be dealing
with.
First of all we have just the
state update.
So, given some h, how do we get to
the next one?
And, as I mentioned, we're going
to just multiply with some weight
matrix the previous state and
then the input with another
weight matrix, tanh. Easy.
How do we predict this y, which
is the same as probability of
the next x?
We just take the softmax over
some weight matrix with the
current state.
Also easy, and that's a loss that,
as I mentioned, the cross entropy, that
we're just taking with our
prediction and the real next
word. Now, we have three
parameters that we need to
optimise for, this, this, and this.
I'm going to start with the one
that's easiest.
This one is very easy.
These two are a bit more
complicated.
And I tell you in a minute why.
So, let's focus on just Wy.
Wy is fairly easy to- to
differentiate because, when we're
actually just expanding and
doing the chain rule, we can see
that we're just differentiating
y, which is this one here, with
respect to Wy, and that's the
end of it. If you want to do the
math,
this is what comes out of it.
Fine.
You're just differentiating
and using that to update. It get
a bit more tricky when we
actually look at these other two
variables. I'm going to focus on
Wh for a second.
And what the problem is when,
here, is once we start- when we
actually look at the final term
of our chain rule over here, so
we're differentiating this with
respect to this.
So far, this is fine, but then we
also have this term here. And this
h has come from another equation
that also has the same Wh in
it.
If this makes sense. Because you
have the same- and this is a bit
clearer actually if you think
about it like this. Because your h
depends on another- on the same
weight matrix again and again and
again and you have it over time,
you cannot just differentiate
with respect to like that one
variable and that's it.
It's a bit more complicated. And
this is called backpropagating
through time, which sounds really
fancy, but all you're doing is
you're essentially unpacking
this recursive iteration that
you've been doing. The last
intimidating equation...
Essentially all I'm trying to
convey with this, and you can
almost just visually see it, is
that you're going to have- this
is going to actually break down
in your explicit derivative - and
there should probably be a plus
there - and something that depends
on the previous weight, and then
this itself is then again going
to split into the explicit one
and some weights and then this
itself can then be broken up
into two components, in this and
this and this, and you go like
this until the first step.
Okay.
So, all I want you to take away
is that you need to unroll over
all the time steps so you cannot
just differentiate once. And this
is a somewhat more-
this is a summarised version
of that equation. You can plug
this summarised version in, and
this is actually going to be
your objective.
So, all of this to say is that
this is a bit different and since
we're actually dealing with
Recurrent Neural Networks,
differentiating is not as simple,
obviously not as with TensorFlow
and other libraries.
This should be much more
straightforward and they do it
for you and they unroll in time,
but it's good to be aware that
this is a bit different than the
things you've been dealing with
so far.
Cool. Was that more or less
clear intuition-wise we're all
on board?
Good.
Yay.
Okay.
One of the other issues beyond
just having to, like,
differentiate in a somewhat
bizarre way is this thing called
vanishing gradients, which effect-
which are a big problem in
Recurrent Neural Networks.
And I'd like to give you an
intuition of why they are a
problem.
Imagine, in order to give you
this intuition, I'm going to make
a very, very simple RNN.
So, is a very gross
simplification.
Instead of having an input, so
before we had the hidden states
and we had some inputs and
outputs, we're not going to care
about those for a minute.
We only care about the hidden
states. And the weight matrix is
just going to be a single scalar
for now.
So, we're just updating it with a
single scalar and that's about
it.
So, we're just taking a state and
updating it with a scalar many,
many times.
So, essentially if we look at ht,
so, at some point, ht is going to
be the scalar times the previous
time step, which itself is going
to be the scalar times the
previous time step, and so on, and
so on, so you can also rewrite
this as ht is going to be this
initial state times the scalar
to the power of however many
times 62 if that makes sense
So, you're just multiplying that
one scalar over, over, over, over,
over again and then times the
initial state. It's clear from
this, it should- because you're
multiplying this with itself so
many times that as- if Wh,
so this scalar, is bigger than
one, your state will go to
infinity because it will
multiply itself over time and
grow as- the more time steps you
take, the more it will grow and
grow and grow.
So, h will get to go to infinity.
If it's smaller than one,
it'll go to zero.
What does this mean?
If the weights of your neural
network deviate too much from
one, the gradients and the values
are all going to explode and go
everywhere, which you can then
not learn from.
Does this make sense intuitively?
Okay, good?
If I lose you, honestly, just
raise your hand and I'm happy to
go over this again.
I've actually made a small
example to clarify this also a
little bit because some of you
might claim, okay, Marta,
that's fine.
But this guy here is normally
bounded by a sigmoid, right?
So, it should be between zero and one.
The h should not go that nuts.
And that's true.
We do do it like that.
Usually they're bounded.
But what happens is that the
gradients are still affected by
this. Even if the values of the
state themselves are going to
stay between zero and one, the
gradients are going to go to
zero.
So, what is done here is I've
written a little code where
essentially I'm running exactly
this, I am updating the state as
I mentioned to you,
h is going to be just the tanh,
so it's bounded by some scalar
and the previous state.
So very simple.
And then here I am just
calculating the gradient, which I
can calculate in closed form
because it's such a simple thing,
just calculating the gradient, and
I'm plotting here, if you can see,
the values, this light purple
line is the values of the state
as I change the value of the
weight.
So I can choose different
weights for my- values for my
weight.
And, whenever I'm around one, the
values are reasonable.
But, as soon as I go away from
one or minus one or one, this is
going to go to either zero or
one, which is fair enough, at
least it's not going to infinity
like it was doing before.
But if I then look at the
gradients actually, that's a
different story.
The gradients are actually zero
everywhere except for quite near
the region of minus one and one.
And this is very useless because
in machine learning, in deep
learning we depend on the
gradients in order to update our
models,
right. So if we have zero
gradients, there's no signal
coming through our parameters,
I'm going to learn anything.
So, basically, we're just saying:
If we take too many steps and
our weights are not near one,
we're going to not have
gradients, and this is a bit- and
this is one of the biggest
problems that RNNs are
facing because what this
essentially means is that you
cannot take many steps, you can
only update it a couple of times
before your gradients vanish to
zero.
So, if we look at now on our- on
our list of desiderata how well
RNNs are doing.
Order matters.
That's really good
and very important, and variable
length.
So, those two we achieved by
design. It is differentiable, again
by design.
It doesn't do this pairwise
encoding yet.
And, crucially, it does not
preserve long-term dependencies
because like I said we have
these vanishing gradients that
are going to make that I'm going to
force us to only look at a
certain amount of steps before
gradients go to zero.
Does this make sense so far? Why
RNNs are getting this
type of score? Nice.
Good.
So, basically, just to summarise,
Recurrent Neural Networks can, like
I just said, model these variables
and also train via
backpropagation, but they do have
this vanishing gradient problems,
which is a problem for a
modelling long-term. And maybe I
haven't convinced you guys just
yet
why long-term is so important?
I'd like to give you a little
example. Imagine I give you
your- your model, a
language model, and I give you
the sentence, I said, "Finally, Tim
was planning to visit France on
the final week of his journey.
He was quite excited to try the
local delicacies and had lots of
recommendations for good
restaurants and exhibitions.
His first stop was, of course, the
capital where he would meet his
long-time friend Jean-Pierre. In order
to arrive for breakfast,
he took the early 5 AM train
from London to..." and then, if
you're a good language model, you
should look, oh, France and also
capital and be able to tell me
Paris is probably the most
likely answer to this. And this
is a huge context.
Like, if you look at this, then
it's not even that incredible
you are-
this is several sentences ago
here and this is a very, very
long sentence as well.
So it is important to keep even
single words that are quite far
away, to keep these long-term
dependencies in order to be able
to make accurate predictions.
And this is just a small example,
you can imagine if you're
writing a book.
Sometimes things depend on stuff
that you've said in chapter one.
So, it's- long-term dependencies
are very crucial for language
modelling and RNNs are definitely
not able to capture this.
So, how can we do that? Good.
Are we so far so good?
RNNs are good?
We like RNNs, losses okay?
Optimisation
we're not intimidated by?
Okay, good.
We're going to move to LSTMs,
which are like RNNs plus plus. And I'm
sure many of you have heard also
about Long Short-Term Memory
networks, short for- LSTMs for
short, and in order to explain
them,
I'm going to just very quickly show
you this model of an RNN
because you should be
familiar with it now,
so we have the previous state h,
we passed some input, combined
those, passed through a sigmoid,
next state.
Good. Because the next one is
going to be more in this same
structure.
So, an LSTM is this. This can
look quite daunting the first
time you look at it, but we're
going to go through it step by
step, and it actually is not as
bad as it looks.
The first thing to point out is
that not only does it have the
state h that we had earlier, it
has this long-term state c
over here that it has in
parallel to this h state here.
So we're keeping those two
internal states, cell state that
you loop through it.
So, and in addition to this-
this cell state,
it has a set of gates that
can be learned that modulate how
the input is or is not applied
to those internal states.
Let's go through them
independently because they're
less daunting this way.
The first gate we take into
consideration is the forget gate.
The forget gate-
the job of the forget gate is,
given your current input and
your h,
what do you need to forget or
erase from your long-term
memory?
Right.
And all you're going to do is
you're going to combine those,
pass them through some network, and
then pass them through a sigmoid,
which means the values are going
to be between zero and one, and if
somewhere there's a zero, it's going
to be multiplied with this and
just erase whatever was there
because it's multiplying it with
zero, if that makes sense.
So, this is a gate that is just
going to basically regulate how
much of the previous information
is allowed to pass through and how
much we forget. Then we have the
input gates. Now that we've
erased stuff, we should add some
new information.
Again, based on our current input
and some of the previous state,
we are going to combine those
almost in the same way that we
combine this.
So, basically, we're just creating
some pseudo cell state, gating it
with the same mechanism of zero
and one to see which information
we actually want to pass,
and we're just adding that. We're
adding that to the cell state
and that basically updates the
long-term memory.
Okay. So on the first day we erase
and the second we add new
information and then finally we
obviously need to update the
state here because this one also
needs to be updated.
And this is going to be updated
again with our input but also
with the long term state.
So that's also going to have a
say in what's get to state in
our shorter term memory and this
then looks a bit less
intimidating if you think about
it in the forget say the input
gate and the output gate and by
using this mechanism of gates
that are all learned the model
can learn how to store
information that is relevant for
later on without having the
vanishing gradient problem.
Hold on. Sorry.
And this is also why LSTM's
are essentially at the core of
sequence learning.
So up to this day whenever you
are dealing with sequential data
in any capacity usually LSTM's
are there they pop up
everywhere so I'm sure you guys
have heard of them and it's
because they're actually very
reliable at doing exactly what
we were just set ourselves up to
do.
This is why LSTM's are
good.
We more or less understand the
gating mechanisms and so on.
*STUDENT* Why don't we
have the vanashing gradient problem in this case?
What saves us from that?
*LECTURER* In this case it's because
you have a lot of additions and you're
not only just multiplying so you
have that from as far as I
understand is because of these
gates how the information flow
you're not just multiplying the
same value with it on and on and on
I see that's my intuition about it
*STUDENT* [Inaudible]
*LECTURER* Because
essentially because you
have this long term I'm
struggling at expressing what I
think because you have this long
term module over here.
When you even when you are
differentiating before you were
multiplying and essentially
you're this was going to zero
but because you're influencing
it via this plus these gates are
basically stopping the vanishing
gradients from going in
destroying everything?
If that makes sense.
*STUDENT* It's influenced not only by
multiplication so you are not
you can fix it by adding the
recovered? Ok.
*LECTURER* If that makes sense, sorry.
Yeah sorry hi.
*STUDENT* Can you go back to the previous slide?
[Inaudible]
[Inaudible]
[Inaudible]
*LECTURER* I think so this is one way
of showing it in the diagram I
think essentially what this is
trying to tell here is that this
here is going to have the same
shape as this because we're just
adding it on.
So you're essentially saying first
the way this is normally done
is first to calculate some sort
of new cell state but instead of
adding that in there you feel
you almost have you like your
own little forget gate over here
and you're multiplying these
two.
So you're forgetting part of the
you're regulating how
much of this new cell itself
that you actually want to pass.
And then I mean you can also you
can put this two in one box but
this is actually it would like
the two intermediate steps that
are carried out if that makes sense
There is also GRUs which I should mention
which were actually developed
much more recently which you can
think of as like simplified
versions of an LSTM.
They have the mitigating mechanism
is a bit simpler.
I'm not going to go into details
but if you guys are interested
they're on these slides and you
can just go back to them they're
a bit simpler and I think it
depends a bit on the particular
applications they train faster
but sometimes LSTMs are a
bit stronger.
So it really depends what you
should be using.
But this is also there. Cool.
So if we consider LSTMs then I
guess GRUs they have
very similar behavior to
RNNs
So of course order matters
variable length differential.
Great.
None of this pair was encoding
but it does preserve long term.
So this is something that we've
now gained with LSTMs so we're
slowly moving up the tick list
which is really good.
So in summary for this little
section LSTMs and GRUs
overcome to get vanishing
gradient problem with this gating
mechanism that I mentioned and
as a result of being able to
overcome this.
They're actually everywhere in
machine learning research and
I'll come back to that in a
minute.
Cool. Are there any questions about
any of this?
*STUDENT* Sorry, my bad, we said that
are GRUs different from LSTMs?
These LSTMs still have gates right?
*LECTURER* Yes. Yeah.
Both of them actually have gates
It's just a different gate
mechanism which are usually is a
bit simpler.
So they only have two gates one
that they called recent gate
and one that is update gate and
it's just the way they deal with
the with the gating.
So I mean for example they also
don't have this long term states
they just show a single state but
again they don't have the
vanishing gradients because the
way that the information
interact is not just like
multiply multiply multiply so
you don't get that same problem
but there's no need for this
long term C cell state
basically. *STUDENT* So what makes
them better in some cases than LSTMs?
They have fewer parameters and they
might be quicker to train and so
on.
So there is that and they're
simpler so they also might
be simpler to implement and so
on.
So if you write down this as a
simpler update then the gating
mechanism of LSTM which is a
lot but LSTM's are more
involved but tend to be maybe a
bit more robust.
at expense yeah.
Cool.
Any questions.
Other than that.
Sweet.
Cool.
We're doing good for time.
Yeah, sweet.
So we're going to focus on
generating sequences, so so far
what I've been talking about is
okay.
We have a sequence model, how do
we train it?
And now that we have a trained
model let's assume that's all
over.
What can we do?
Well, we have trained our model
using this cross entropy to be
able to properly predict the
probability of a sequence.
But that's not a super exciting
application of I tell you how a
model can tell you how likely a
sequence is in English.
Great.
That's usually not what people
train these models for a more
exciting application is to use
it to generate likely sequences
right now that you know it knows
how to see a probable sequences
or not.
It can also generate them and
that's why these models actually
are quite interesting.
So we're going to focus on
generation.
How would you go ahead and
generate?
So just looking back at the RNN
and I'm doing this on an RNN
but this cell could also be
an LSTM or any other
recurrent mechanism.
If you take the RNN from
before I could and I wanted to
generate a sentence I could just
feed in my first word say
modelling this would then output
the probability distribution
as I mentioned earlier and then
instead of just picking the next
word in the sentence because I
have the ground with sentence
what we're going to do is just feed
the most likely word or sample of
word from the distribution and
feed that as the next input.
So it's autoregressively
creating a sentence so that's
going to give us the next
probability and so on. So we can just
feed whatever output we get into
the next and so on
and that's going to generate our
sentences or sequence.
So I'm going to go through first
several applications and several
examples and usually in each of
those I'm trying to introduce
something new in this particular
case.
I've been talking a lot about
language as sequences.
I'd like to talk about images as
sequences of pixels.
And this is a model that
appeared actually now a while
ago 2016 I guess which was a
generative model for images
called PixelRNN.
And essentially it was treating
images as just a sequence of
pixels which is total a total
valid way of describing an image
right.
If you think about it so very
much like before when we were
modelling the probability of a
word given all of the other
words so far we're not pro
modelling the probability of a
pixel
given all of the other pixels
that you've seen so what would
the colour of this one be given
that I've seen all of the other
ones.
And this is more or less how
would work in action?
So we would start by just having
some distribution over what the
first pixel of an image is then
we can sample that over there
and we just get a dark pixel and
then to assemble the second one
we're just going to sample
what's the probability of the
second pixel given that first
and we do for the next and so on
like the chain rolling we did in
the first one and we'll just
advance along the image slowly
and sample always the next
pixel
given the previous one.
And what's interesting about it is
that the model then learns from
the context of all other
previous pixels it learns what
the probability distribution of
colours is for the next one
right.
So in this particular case it's
very likely to be a green pixel
and that's why it's made mostly
centered and quite spiky around
that.
But if you look at this example
over here maybe where it's not
clear assuming this is a bird is
this is this already where the
bird starts?
Is this still a graph?
What is happening there?
There's more uniformly
distributed across those to the
possible pixel values.
And what's interesting is that
depending on the context it just
changes this distribution quite
different quite quite a lot
between those different contexts.
And this generates images which
like I think if you're far
enough look very realistic but
nowadays obviously a state or
more state of the art models
produce quite good images
this this was at the
beginning and even though
they're not perfect and some of
them are quite blurry and
bizarre you can see that it's
learned some sort of
distribution over what natural
images should look like
especially considering that
you've only been conditioning on
the upper half of the images
which is was very good at the
time.
This was quite impressive.
And of course you don't need to
always condition in the order
that we just explained which is
this pixel by pixel.
This is one of the nice things
you could really condition on
any order right?
This is not language where we
need a specific order of like
first word,
Second word.
You could imagine also
conditioning and defining your
order to be by clusters or any
other order up and down or
whatever as long as you define
an order.
After that it should learn to
kind of use that which makes
them quite flexible.
do those make sense and
how we can.
Yeah. Hi.
*STUDENT* If dual conditoning
goes sequentially like that
pixel by pixel,
will you get better as it
saw more pixels,
would it start predicting
pixels that were a better reflection
of the actual context of the image?
Yeah.
So I don't know that
necessarily better because
It could learn to really well
predict the second pixel from
the first like imagine it
lends a super good approximation
for that so I don't know that it
would be better or worse I
think
if you were to sample
different pixels probably the
variability is much higher at
the beginning because given like
the first few years there's a
lot of images that can't come
up but the variability once
you've committed to that bird
image you're like well that's
probably going to be green at
this point in the grass
Right.
So I think that's what was
reflected also in how
like narrow or not those
distributions were right?
So rather than saying more
accurate I think I would say a
more certain of what the next
pixel should be like if I make
sense? Cool. Sweet
Ok cool.
So that's that. Another well we've
already talked a lot about
natural language but it is
the most obvious application
that always comes to mind when
you think about sequences and in
particular
obviously we've already talked
about RNN's and LSTM's but
I'd like to talk to you about
sequence to sequence models
which some of you might also
have heard of because they have
a lot of applications as well in
industry.
So if you think about the RRN
model that I've been telling you
so far.
Again we have these hidden
states that get updated with our
input and produce some output
but initially I told you that
this original state h0 can be
initialised to be whatever like
normally people just make a
vector of zeros.
That's fine because it's your
first day but this doesn't
need to be the case you can
really just pass any information
over here that you want your
model to already start with and
have right.
So imagine I could give it some
additional context that gets
processed and passed through our
states and only then once it's
seen all of this context do I
want it to start generating
and you might ask when would I ever want
to do this?
And like OK imagine you want to
translate a sentence.
Right.
So it can give you a sentence in
the original language and that's
going to be my input to the
context.
So all it's doing is it's
summarising this into the
initial state.
It's packing the whole sentence
sequence to sequence into the
initial state and then once it
has that it actually you
initialise it with a random word
and you tell it OK I want the
target to be the Japanese
translation of whatever the
sentence is so it will produce
this apparently in sequence and
then it will produce two and
then sequence based on this is
doing autoregressively the way
you would generate a normal
sentence but it's not just
generating any random sentence
generating a sentence and you
gave it as context if this makes
sense and sequence to sequence
has been very powerful.
So this way of incorporating
knowledge into a model has been
very useful.
And what makes it also very
powerful in this kind of opens
up the possibilities with
RNN's is that you can really
condition on whatever you want
and add information however you
want.
So you could think for example
the most simple RNN would just
be this network we would have an
input then it would state an output
Fair enough.
But you can also just imagine it's
passing a single input and
asking it to produce lots of
different outputs sequentially
or you could do the other way
around you could give lots of
inputs and actually only one to
one output at the end of the
whole thing.
This is probably this is like
the normal RNN and this one is
like the sequence to sequence
where you have several outputs
where it's not producing
anything generates the state and
then actually starts producing
outputs.
So this is a very flexible way
of thinking of how can I use our
RNN's and how can I give them
extra information. Sequence to
sequence I'm actually has
actually been used in a ton of
applications because of how
flexible it is so empty sets for
machine translation is at the
top.
So it's very widely used in all
of Google systems for machine
translations image captions
which I'll show some cool
examples in a minute but also
for a speech, parsing, generating
dialog, generating videos and
geometry so there's a couple of
references if you guys are
interested but clearly this is
very applicable to like a very
wide range of tasks the way you
can kind of think this is a... yeah.
let me give you an example
of like Google you're all
machine translation which is one
of the I guess more relevant
applications of sequence to
sequence Wu et al, also a
while ago now and essentially
they would do exactly this you
would have some context in
some language some
context language you would
encode this to be your state
that I just mentioned and then you
give this input in a specific
order to the RNN
on here which is just going
to produce whatever the the
output is and this GN and Google
neural machine translation model
actually improved performance on
translation by a lot so if this
is the old models that were just
phrase-based so frequency base
that's the blue line and then
humans are yellow they actually
impro - this green improved
the baseline by a lot and across
languages as well which is very
impressive.
So this was a very good way of
just closing the gap between
machine translation which was
really cool. Also very cool
what you can do because I
mentioned that it's quite
flexible.
So far we've actually been
talking about OK as a context I
can just give some language and
then ask it to predict that in a
different language.
But why do I have to give it
some language in the first
place?
I could give an image right.
So what we can do is just take
an image pass it through some
neural network like a convolutional
neural network and to create my
context state and from there
have it actually create a
sentence that describes that
image.
So this is for image captioning
and the examples of these type
of image captioning are actually
super impressive.
And one of the points that I
want to get across when I show
you these examples is that there
is still a lot of variety
between the best model which
would be this one over here and
I'm also going to show you what
the initial model said.
So when you first trying it
without hyper parameter tuning
and so on.
So it's clearly very sensitive
to that type of tuning.
So initial modelling was saying a
large brown dog laying on top
of a couch which to a certain
extent fair enough but then at
the best model is a small dog
sitting on a chair.
We can have a man cutting a cake
with a knife which not really
versus a man holding a sandwich
in his hand. Good.
A pizza sitting on top of a
white plate which is completely
wrong and a person is cooking
some food on a grill and a close
up of a person eating a hotdog
versus a woman holding a banana
up to her face which is really
good. Like if you look at the middle
sentence that's like pretty
impressive.
And that is just goes to say
that like in all of these it's
still quite fiddly and it still
requires a lot of hyper
parameter tuning and it just
doesn't come out of the box.
But when it does well it's it's
pretty good.
So yeah.
So this was sequence to
sequence.
Yeah.
*STUDENT* So when you
pass the image through the CNN
then you use the
encoding mask
so when you train it, do you,
how does it work?
Do you train the CNN as well
when you're backpropagating?
*LECTURER* Yeah.
Well really it could go either
way. So here you could either train
it from scratch and just train
them together and it should
learn to actually have a CNN
that extracts meaningful
representation often what people
do is that they do use pre-trained
models just because we can argue
that if you go online and get on
any of these Google net or
whatever like pre-trained nets
it's gonna save you a lot of
compute especially if you have a
huge CNN then you just plug that
in there stop gradient and
actually just update it over
here.
But it could work either way that's
just saving you money sometimes.
*STUDENT* The RNN application you have
for generating images
I was just thinking doesn't it make more
sense, or do people use that for simpler things
like sharpening an image or like remastering
an image, whereby there's a hole
in an image or something and
you can scan it in and you can fill it
and so it has more sense of a context.
and it kind of makes more sense
or even if, like in biological imaging
you have like deconvolutional.
Yeah.
where it iterates through
This is probably the crispest
most likely image.
Wouldn't that make sense
Is that what's its applied in?
Is that what makes sense?
I think that that would be one
of the use cases.
I don't know that necessarily
it's been applied like I could
see like when I was mentioning
you could cluster by.
You can then condition on
clusters that could be almost
like what you're saying when you
have like very rough clusters
and you want to like discretise
it like very thin and
increases the resolution
It makes sense.
So you could do that.
I don't think anyone is gonna
use PixelRNN to generate
images like there's far better
generative models right now
Can use GANs or anything
else?
This was more to show that we
have models that can create a
sample... model conditional
probability distributions and as
a matter of fact actually a
version of pixel RNN was
then done for WaveNet
with that was kind of the precursor
for actually something
completely different which I'll
talk you in a minute which
creates audio signal but pixel
RNN and such is not necessarily
used.That's just maybe an example of
showing how you can do
conditional probabilities in a
rough way but like you saw the
images weren't phenomenal.
*STUDENT* What kind of loss do you train
this task here. If you have
the output sentence. [inaudible] How do you train it?
if we assume that the image features factors
and you don't touch it. How is it trained?
The same you would
train the RNN and so you would
here you would just the first one
you'd feed it.
nothing you can feed a vector
zeros and then you expect the
model to give a good first word
which I believe would be
something like "The"
Or "A" or "N"
Right?
And you take the loss between
whatever it outputs and the
actual ground with sentence
because you assume that you have
a labeled dataset. *STUDENT* The same ones
as P and the logarithm
Yeah yeah yeah yeah exactly.
So it's over so some still the
cross entropy.
And you would just use whatever
the caption is of an image and
just try to recreate that
sentence for that particular one
so that obviously is true that
then that is fixed to that one
sentence that goes with the
image.
So it's not super flexible
obviously a human could describe
that one image in a million
different ways but you're
training it for that one
particular if that makes sense
so you're hoping that if you see
enough images of cute little
dogs eventually we'll learn
different ways of describing
dogs.
Does that make sense?
Yeah. Cool.
There was a question somewhere over there
as well.
There, Hi.
Like oh yeah sorry I mentioned
earlier and I said I would have made this
more clear. When I draw this,
this is just showing the recurrence.
This is definitely an LSTM
for sure.
yeah yeah
And yeah.
This is a bit simpler than if I
put the different gating mechanism
on this cell and states and
whatnot but all of these are
LSTM's and I mean you could
do it with RNN's but again
like it would very quickly not
do super well.
So yeah.
LSTM's
Good.
Well paid attention.
Cool.
Sweet.
Where were we. Ah audio waves
So this goes back to to WaveNet
and what
this has actually been used for
more and more efficiently.
And actually I take that back
because then it was changed so
it originally was used for that
but then it was changed with
convolutions which is exactly
what we're going to talk about
right now.
So you can think of audio waves
of course also as a sequence and
in this case we're not going to
use an RNN we're also going
to use an LSTM we're gonna
use convolutions which is what
you guys have already talked
about in a previous lecture.
And the way this works is that
you're going to have your input
at a very high resolution and
you're just going to pass it
through several layers and
hoping that it abstracts away
and then that's going to produce
your next point at the low
resolution.
If that makes sense and you're
doing this by using increasingly
large convolutions and
increasingly fewer of them and
these are called dilated
convolutions and this is a
different way actually of
thinking about how to deal with
sequences so it is it has this
fixed structure of a convolution
so it always will only look at
that horizon but it does still
take into account quite a lot of
horizon and a different level of
hierarchical abstraction between
those layers.
If this makes sense and this
particular model which is called
a WaveNet is the one that was
developed for the Google Assistant
voice.
So whenever you chat to them
that's this model generating
audio from texts.
So it's actually managed to
scale it up really well and
these convolutions work really
really fast.
If I compare these convolutions
to all the other models that
we've talked to so far they
actually are pretty similar to
LSTM's in that again order
matters variable length because
you're shifting this
convolutions along they're
differentiated both because we
can learn them and originally I
had put this is like a semi tick
but it does preserve long term
because it has these very
dilated convolutions that look
back and then summarize
everything into one point but
again it's actually fake size
of like the pass that you can
look at really.
So there's something beyond that
you're not going to see it.
So give and take.
So that's audio.
Another one that is also where
sequences are quite important
are policies and I don't know if
anyone here is interested in RL
But wherever there is a policy
or any sequential decision
making.
Of course sequence does play an
important role and this starts
already in the in for tasks such
as generating images.
So not even RL yet.
For example we can have models
that sequentially decide where
to draw on the canvas or where
to focus on when paying
attention to a certain canvas in
order to generate images.
And those are all sequential
decisions that have to be made.
Really cool actually.
Application or neuratical model it
came on recently is Spiral over
here where it is learning to
generate images but normally
most models generate images
pixel by pixel and decide what
colour they want to do that
pixel.
This one here is actually using
a computer program that draws
and is sequentially deciding
what brushstrokes it wants to
do.
So it's drawing like a human
would draw essentially and it
creates actually pretty cool
faces given that it's obviously
given only a few strokes.
And this is trained on CelebA so
you can see it learn how to draw
human faces and every bracket
and brush stroke is going to be
an output of our sequence of
actions. Of course then there is
big RL applications.
We have the OpenAI Five that
where state of the art in Dota
and again obviously very heavily
dependent on LSTM's and
being able to deal with
sequential input data.
There was AlphaStar from DeepMind
which is also state of the art
on Alpha on Starcraft.
And to go a little bit more into
detail actually this is what the
AlphaStar architecture looks like.
And again it boils down to the
same thing.
So you have to get in some
observations that are processed
in some particular way but at
the core we have an LSTM
that is taking those
observations and making
decisions right.
So this was actually quite a
crucial part in the agent that
was used for AlphaStar and then
it would output the actions and
whatnot.
So that was RL or sequential
decision making.
I'm going to move on to
transformers for the end is
there are there any questions on
the previous topic.
Cool so finally this is the last
example I want to give is
transformers.
I believe you're gonna have a
lecture on transformers
independently.
This is just gonna give you a
bit of like an quick intuition
about it.
And obviously transformers are
often closely related to
sequences.
So just to mention that a little
bit this is how I like to think
of transformers or how you I would
explain them.
You guys have already seen
convolutions and the way
convolutions work as you're going
to have a convolution made out
of in this case three weights
and then those weights are then
moved along to form these
convolutions and it's always the
same way.
So if we look at this.
It keeps sliding along the image
but it's always the same weights
that look, too far, that move
along.
transformers is a bit different.
Instead of just focusing on a
little subset of the images that
the convolution goes over it's
actually going to take the whole
input and all of them.
But what it's going to learn is
which one within those which one
to attend to.
And this are these are reflected
by these weights.
So depending on how much you
want to attend to each of these
elements in the input your
weight is going to be stronger
or less strong.
And crucially as you change the
pick the point of generating
these weights are going to
change.
So depending on what your
current point is in time you're
going to actually be at learned to
attend to different elements of
the sequence.
So that is the difference here
the weights don't change here
they should change as a function
of the input.
Ok cool. And this can be used for
generating transformers.
So usually the way it works in
language is that you have your
input words and you're going to
try first to create some context
some representation whereby all
of these words are interacting
with each other depending on
some weights that depend on the
attention that is learned so
depending on how relevant or not
they are you're going to have
different pairwise interactions
and then you're just going to as
with the LSTM you're going to use that as the
context and then start creating
the next word from the context
you get of the presentation
the embedding is over here
and everything.
In the past states and
everything you can see a bit in
the width of this lines.
These are meant to represent how
much weight the individual
elements have.
So that's what that is meant to
show.
And this is how you can use this
for generating language as well.
And what's really cool these
models are really good like
transformers really meant a big
improvement for language
generation and I'm sure that
you might have heard of GPT2
which is this model that
actually came out last year from
OpenAI and it was
essentially this huge
transformer based language model
that had one point five billion
parameters to optimise and it
was trained on this huge huge
huge dataset of like 40
gigabytes of text data from a
lot of websites and which was
really cool as this was the
language model that was not only
generating good language good
sentences but also adapting to
the style and the content of
whatever this language the
context you gave it wants which
I find really impressive.
If we look at some examples this
is one of my favorites.
So what they did is they gave it
actually I wil read from here.
They gave it this sentence as
the context to start to start
from.
And then they predicted these
sentences.
So the context they gave it was
"In a shocking finding,
scientists discovered a herd of
unicorns living in a remote
previously unexplored Valley in
the Andes Mountains.
Even more surprising to the
researchers was the fact that
the unicorn spoke perfect
English."
So this is a sentence that was
written by some person that was
trying to probably do the most
random sentences and then the
model's predictions were for
example: "The scientists named the
population after the distinctive
horn Ovid's Unicorn.
These four horns silver white
unicorns were previously unknown
to science or now after almost
two centuries the mystery of
what sparked this odd phenomenon
is finally solved."
Which is super impressive
If you think about the fact that
this sentence is very bizarre
and that it's it is keeping up
with the style of this sounding
a bit almost like a journal a
journal text that has been
produced.
So in the
sentences it creates are not
only very long but also they
make sense contextually and as
you read on.
They also keep referring almost
to its own texts and everything.
So this is very very very next
level text prediction.
And if we think about it we look
back on our table again much
like the LSTM's it fulfills the same
requirements but it also has
this added pairwise encoding
that I mentioned at the
beginning where individual words
are actually affecting your
compare words comparing every
word and depending on that
weighing how much they should
influence each other.
And this is what also gives them
a very big advantage.
Just to look back I like looking
at these things and see how far
we've made it.
I want to compare two sentences.
The first one was state of the
art in 2011 which is not that
long ago if you think about it.
When this whole sentence
generation with RNN comes out
and they were not using really
any context but the sentence
they produced was: "while he was
giving attention to the second
advantage of school building a
2 for 2 stool killed by the culture
saddled with a halfsuit
defending the Bharatiya Fernal's
office" which sounds like it
could be something but it really
isn't?
And but it's at least a
sentence that yeah this was in
the paper so this was originally
yeah one of this was a moment of
pride which at the beginning
when RNN's came out this
was really impressive.
But now we look at what GPT2 is
doing and there's an example it
was given this sentence as
context: "Miley Cyrus was
caught shoplifting from
Abercrombie and Fitch on
Hollywood Boulevard today". And
then the model produced: "the
singer was wearing a black
hoodie with the label 'Blurred
Lines' on the front and 'Fashion
Police' on the back", which is
insane.
This this is not only a much
better sentence than the one up
there.
It has popular references it
matches the style of the tabloid
and so on.
So to me eight years between
those two is super impressive
the fact that this is where
we've gotten to.
So this is quite exciting I
think.
There's obviously still a long
way to go and we're still only
dealing with sentences and we're
not writing books and whatnot
but slowly I think we're getting
closer.
Just a summary.
So this has been quick actually.
I want to kind of emphasise what
we went through and hopefully
what should stay with you in
mind
I started by motivating why
sequences are important are
they're everywhere and that they
really are quite important to
machine learning research and
why we care about them.
But you should also be convinced
that modelling the probabilities
of these sequences is quite
hard.
We cover different approaches so
from the very basic and original
ones like N-Grams to a more
complicated deep learning ones
like RNN's and LSTM's
and moved on to some examples
like dilated convolutions
for the audio as well as
transformers at the end and
hopefully I have convinced you
that these models are not only
good models but they're also
fairly flexible because they're
dealing with sequences so they
can actually be applied to a
whole range of tasks not just
like natural language generation
or audio but actually to a wide
range of different tasks in
machine learning and with that
I'd like to thank you.
And ask if there are any questions?
Yes
* STUDENT * So, thank you for the talk
it seems that the evolution of
algorithm might be able to
capture this idea of context at
least in like a paragraph that
whatever word that comes next
seems to depends on a lot of
things but if let's say a machine
wanted to write a book, a book is not
just like a paragraph or a sentence there's usually
an overarching sort of theme that is capsulated
which could be which could have many many contexts
but they sort of string together
into one piece spread across multiple chapters
do you think at the moment, we have
whether any of these models
or whether anyone is looking at
any capacity within these models to
encapsulate a grander narrative,
like a theme, [inaudible]
or like this is just good for
local optima
* LECTURER * Yeah yeah yeah yeah.
So the question just to repeat
because it was quite quiet was
whether these type of models
that we've talked about can
actually only focus on like
local consistency versus if you
write a novel you want to have
global consistencies and
actually storyline and so on.
I think you would probably need
a combination of a few things.
So this obviously has shown that
it does give you local
consistencies but you could
imagine mixing this for example
from the top of my head with
some memory module and then that
is storing something more about
it. So at least you could be
consistent about more and more
than just a few paragraphs you
could restore learn to save
some crucial elements and then
read from that condition on that
every time you're producing
right so maybe it would learn
that there's some characters and so on
you would have to learn
how to make it maybe more
hierarchical.
You can see that the
first sentences you basically
had a memory of two and whereas
now it like really is writing
paragraphs and I think that is
because you're abstracting away
more and more you just need to
learn how to do that.
So there there's nothing saying
that you cannot learn how to
scale this up into higher levels
of abstractions.
We've not done it yet as in
it's not clear maybe how but it
could be even the same principle
just like beefing it up to a
certain degree.
But does that make sense?
*STUDENT* [Inaudible]
* LECTURER * Yeah. See this is a problem I
think will have more to do
with the fact that we're using
deep learning and deep
learning is very hard to
reconcile with actually this
whole symbolic learning and this
is actually something that I've
been into research wise
quite interested in because you
have deep learning all you can
hope is that it learns a pattern
from the data and if it doesn't
it's not going to abstract it.
It's not going to reason of this
higher level of abstraction that
you'd like it to be like humans
like we did right.
Like we cannot make any claims
as to what level it is.
It is reasoning and clearly like
you said from this example it doesn't
learn concepts it doesn't seem
to learn concepts the way we do
and then apply them and I think
in that case you would have to
go through other methods if you
wanted it to actually understand
concepts to learn concepts first
and then maybe use this to then
with these concepts generate
sequentially but for sure at
the moment is just once. *STUDENT* [Inaudible]
Yeah and also because that
in my head like deep learning is
very under the representations to
obtain and deep learning are
very under constraint right
you're letting the model choose
whatever it wants to do.
So it's very unlikely that
they're going to by chance end
up being super symbolic and the
best way of actually summarising
whatever you're getting.
So yeah you'd you could
argue that symbols are a very
optimal way of compressing
information but maybe usually in
neural networks get away
with compressing in a different
way which happens to be
completely random and then
generalise badly.
Yeah.
*STUDENT* I'm not sure if this
is a stupid question but how can you decide
the size of the hidden state vector.
Just, for example if you're trying to model
a sequence like the Fibonacci sequence,
Then technically you'll only need
a hidden state vector of size one
because then theoretically that would
just represent the previous input.
So how do you actually decide?
*LECTURER* No that's not a stupid question.
That's the golden question of
machine learning.
It's one of the hyper
parameters where so when I was
showing you the different the
different image captions and I
was saying some are better than
others.
That is when we're choosing
different hyper parameters and
that is not only the learning
rate that's also the size of
your representations.
How do you wire your network up and
how many iterations of this and
that and you can have some other
additional lost terms and what
not? And the size of your hidden
state is one of those variables
as well.
And it's not very trivial
because it's also been shown
that for certain problems not
for this particular one.
But sometimes you can show that
you only need a certain amount
of bits to encode something in
a neural network so you can.
You would hope that you do a
bottom like that a small enough
it should learn it and it
actually doesn't.
So.
But whereas if you give it a
larger one you can have and then
reduce it.
That works.
So sometimes it seems like
during the training procedure it
needs more capacity even though
if the problem at the end really
doesn't need that much
information capacity it seems
that during the training process
often it does.
So it's very hard to tell even
if you know the final like how
many bits you need to encode
anything even that doesn't
necessarily give you any
information usually you say
bigger is better.
And try to like yeah give it as
much capacity as you can but
then obviously it also trains
much slower so.
*STUDENT* Is there like a way to interpret
these hidden states to see, like,
like, what techniques the model is using to
actually model it, like, that's meaningful to us?
*LECTURER* Yeah.
So there are long story short it
goes a bit back to the question
we had earlier with the symbols.
I think because there's not a
lot of constraints on these
hidden states often they're very
entangled representations that
are not very human interpretable
in general there are research
groups that are focusing on what
information can we get out of
this and like looking at like
unsupervised methods of
clustering and seeing what the
correlations are and what not to
understand like you say.
How are they how are they
working?
But it feels that very often
they're quite specific to the
one model you train and maybe if
you train a different model it
would work completely different
even if it was the same model
just different initialisation.
So I think trying to get this
intuitive understanding we're
still quite early on like it be
great but I don't think we have
it yet.
With this set of networks yet. Yeah.
*STUDENT* So if we
just use LSTM or transformer
just to make a prediction,
can we say the
the scale of computation for LSTM is
linear to the input length
but the transformer is proportional to the
input length squared.
*LECTURER* Yeah.
Yeah. That's fair.
You have to because you're
comparing all to all basically
and that's another of the
disadvantages and this is also
why it works so well because you
are comparing the words but
you're only really comparing
them at input level so it's not
like you're comparing the whole
corpus of English language
you're just saying your sentence
which has a quite finite amount
of words versus whatever you're
producing if that makes sense
so it scales words like you say
yep definitely comparing all
to all.
*STUDENT* As far as I'm aware
there has been varied success
of blank LSTMs to pass series data
So I was wondering if there is
similar success in applying
transformers to new blank series data as well.
*LECTURER* There should be I think
transformers by now have been
applied to pretty much
everything.
I don't know particular examples
but I'm sure if you look at it
it makes sense because it learns
if there are any patterns in the
time series later it very they
very quickly pick up what to
attend to.
So if you have any regularities
then they would learn these
weights between the comparison
of where to attend to in the
previous sequence.
So they're quite I mean there yeah
there have to be.
We've tried a couple of things
but not that it's super relevant.
Good.
Thanks guys.
