In the next four videos, we're going to
take a look at deep learning, and how it
can help us with natural language
processing. We will study some of the
general features of some deep learning
architectures, and in the weekly exercises,
you will look at the details of their
implementation, and you will get to run
them by yourself. The first thing we will
do is that we'll talk about how we can
process sequences of elements, sequences
of inputs, for example, sequences of words.
And before we do that, let's take a very
quick look at the word deep learning.
It sounds very modern, very cool, but the truth is, it is cool but the concept is not
new. It is a rebranding of a concept that
has been around since the 1940s. If you
remember from week two of our class (was
this week one, or two?) we looked at the
history of artificial intelligence and
natural language processing, and we saw
that some of it came from the 1940s and
50s where we had - where people invented
perceptrons and large databases. And
people were very hopeful about
artificial intelligence, and then the
early neural networks didn't behave as
well as people expected. The first
perceptrons had a lot of trouble
learning and so we face an artificial
intelligence winter during the 1970s.
People couldn't do research in neural
networks if they wanted funding. In the
1980's, the widespread implementation of
algorithms like back propagation helped
in the training of neural networks and
there were renewed hopes for how it
would work. However, there wasn't enough
data to train those networks and the
hardware was just not at a point where
it could handle very large neural
networks. So we came to a second
artificial intelligence winter in the
21st century. We have seen a lot of
growth in the use of neural networks.
What we now call deep learning because
we have better computers that can
process networks of literally millions
of weights and
we have more data. We have larger
databases now which can help us with the
learning process, and in general people
have implemented some amazing algorithms.
For example, the attention like we're
going to study later. These can really
help in the learning process, and that's
why deep learning is such a popular set
of tools these days, because it's going
through a spring. In general, what we mean
by deep learning is neural networks that
have - that are very richly connected, that
have hidden layers, and that have rich
connection between layers between
neurons and between elements, so rich
that they're beginning to to resemble at
least in size elements of the natural
world. So for example, this is a chart of
how many connections does each neuron
have for things like mice and cats. So you
can see for mice for example, each neuron
has about a thousand connections to
other neurons, and this is the same level
as an algorithm called distributed auto
encoder from 2012, where you would have
each neuron with a thousand connections
to other elements. So this is the current
state of hardware of this, parallelism in
processing and our larger databases have
allowed us to train larger and larger
models. So large in fact, that this, the
state of the art is models that are very
large. So for example, this is from the
last two years. The state of the art
system for deep learning and natural
language processing is called the
Transformers. It's a family of algorithms
so for example the Roberta and the
Distilbert on the right, are kinds of
transformers. They're a special
kind of transformer called a Bert. They have - the
Roberta has 355 million parameters that
need adjusted, so 355 million connections
between elements.
Imagine training 355 million of those. In an example from the first
video this week, we had but two
connections. The large Bert from Google
AI in 200- the end of 2018 had 340 million
connections, and I produced the chart
because for example in January of this
year, Microsoft released one called
TNLG the T is for Turing, natural
language generation, and it has 17
billion connections. So you couldn't even
run this on a regular computer, you would
need a large number of GPUs, and a lot of
parallelism to even load these models.
Why has this complexity gone rampant? Why
do we need so much complexity? It's so we
can deal with that old nemesis of ours
that has been popping up since week two,
long distance dependencies. If you have a
paragraph like I grew up in France and it
was - it was just an amazing childhood. We
played on the street and I hung out with
all the other kids, so I ended up
speaking fluent _____ and you know what's
coming up. This word here, fluent French,
depends on something that I saw 29 words
ago, And throughout the class, we've been
talking about how difficult it is to
model this in finite state machines. You
cannot look back 29 states to figure out
that French is connected with France, to
do it with n-grams, you would need a 29
gram to do it with a word2vec. You
would need an immense window for it to
capture something that was 29 words away. So
the algorithms we've studied so far have
a very hard time modeling long distance
dependencies and this complexity of
neural networks will allow them to model
very nonlinear functions, including these
kinds of dependencies.
And we're gonna study a very important
example in how we can do this. For
example, if you had a neural network
where the input was a word, and the
output was the next word, so a predictor
of what comes next. So let's say if you
give it the word I, it gives you the word
grew, and then it takes the word grew
and it predicts the word up, and so on.
I grew up in France. If we have a neural
network that just takes one word as
input, and gives you one word as output,
for example, grew up, there's no way for
this to know anything about its
surrounding words. Grew has some
characteristics, and some things that
surrounded some adjacent words in the
training input, but there's no way that
can look at its local input when it gets
the sentence, so it's gonna generate some
prediction for grew that is independent
of the fact that it was preceded by I,
and followed by up. So why can't we have
some sort of richer connection between
networks? For example where we run a
neural network and we produce our output,
but then we also produce a second type
of output, which is some information
about the word grew that's gonna go on
to the next running of the neural
network. So for example, I predict grew,
and we get the up off group, but then we
have an element that remembers somehow
that the word that I saw was I. So when
you have the second input, which is grew
it has grew and I as inputs, so you can
see the local word that it's processing
and some information from previous words
that it's processed. If a neural network
did this this was conscious, it would
constitute a kind of memory, but what
happened before this concept is called a
recurrent neural net.
But this is what one of them looks in
its folded presentation. A recurrent
neural network takes a sequence of
inputs, so for example, it always has the
input layer, red that we have there, the
X input it has some hidden layers in the
green box, and then it has an output
layer in the blue element Y. So it gets
the input, it processes some output, but
it also produces a second output which
is an some vector H. This is going to
contain information about the input X, so
you take X, and produce the output Y, and
the vector H, which contains some
information about X. And so in the next
running of the network, the - when you
execute it for the next element of the
sequence, you will take the second input
word and H, so this will have a memory of
some information of what came in before.
And so for every execution of this
neural network, you produce two outputs,
the prediction Y based on the input X,
and some output vector H, which has
information about the Xs that came
before. This is what a recurrent neural
network looks like, unfolded. By the way, I
want to be very clear about this: the
green box there is always the same
network. This is just one which you can
see on the right. It's just that we're going
to unfold it across time. So for example
if the first word you get is the word
grew as your input X, the way the hidden
layers in the green box are going to
produce two outputs, the prediction that
the word up is the following word, and
then an element Ht, which has information
about grew, about the input X, at the time
0. Let's go to the next time step,
time step 1. This will have two inputs,
the input up, which we just produced and
the element Ht which has element about
the word - a memory of the word grew.
So it takes in practice grew up, and then
from these two, it produces two things,
the prediction for y at time step one,
and then another Ht which is a vector
that contains some memory that the
previous elements were up and grew. And
with this Ht, we go on to the next time
step T2, the input is N and then the
vector that contains a memory of grew up
and it takes these two elements and
produces two outputs, the prediction
France for the time step T2, and the
vector - and a new vector 8 which has some
memory of the fact that the preceding
words were in up and grew, so this is
what it looks, again in it's unfolded
state, where - where the time steps are
represented separately, even though it's
always the same network, the layers and
the green elements are always the same.
So for every input X, we have two outputs,
the output Y which is our prediction
that the network generates, and then the
H element which is the information that
is going to be carried across to the
next execution of the neural network. So
it works something like this. This is a
recurrent neural network, so that for
every input X at the time step T, it
takes in the vector H based on the
information saw before, the input Xt runs
those through a neural network, through a
feed-forward network, and produces an
output Yt, which is the prediction of
what, for example, with the word that
follows this one. It produces yt.
It also produces a vector that
now incorporates information about
everything that it saw before.
That's what a
recurrent neural network is trying to do.
Another example would be to take
spectrographic information, and try to
predict the consonant and vowel
associated to it. So for x1 is some chunk
of a spectrogram that contains the
information, you take that, run it through
the network, and produce two outputs, y1
for just the prediction that that sound
is and then h1, which is some information
about the spectrogram x1. This enters the
second time step which is the
spectrogram x2, and then some information
about the spectrogram x1. These two go
together and then you get the prediction
y2,, and then a vector h2, which contains
some information about the inputs x2
and x1 then h2 enters into the third
time step with x3 which can presumably
contains the sound t. So t. The spectrogram for t and the h2 vector, enter the
third time step. You process them through
a feed-forward neural network, and then
you get two outputs, the prediction y3
and then h3 which has some information.
It's a vector containing information
about X3, X2, and X1. And so you might
have noticed an interesting problem
about this. The signal diminishes over
time. The vector H3 will probably
preserve a lot of information about the
input  X3 but only a little about X2,
probably only a little about X1, and as
you move further back, the signal will
diminish. We call this the vanishing
gradient problem, where the further down
in time you go, the weaker the first
signals become. So as you move through
time, these, the written - these first
signals are going to become weaker,
weaker, and weaker, as they're overwritten
by the H memory of the following
elements of the sequence. So the
vanishing gradient is the problem that
you're forgetting the information that
you saw first as it's overwritten by the
successive information. This is - there's
been many attempts at trying to solve
this problem. I'm going to show you very
briefly one solution that was very
popular for a long time. And a long time
means the - between the beginning from
2010 to 2015. As - and as you can see, it's
called a long short-term memory. Let me
show this one to you very briefly, and
then in following videos we will look at
the state of the art which is called
attention. So in recurrent neural
networks we had that we had - we had some
vector that had a memory of things that
came before.
However this memory decreased over time,
so maybe we could explicitly tell the
network what to remember, and what to
forget. Long short term memory networks
have two inputs from preceding states,
one that's the H which is just the
information for previous words and
something called the self state, which is
also the information from previous words,
but it is specifically tempered by two
functions here. As you can see, we have a
fairly complex set of a cell, it's a set
of four neural networks. So here we have
the input at the time step T, the input,
we run it through a neural network, and
then we get a sigmoid activation
function. And this is going to tell us
how much of the current input we should
remember, so this is going to help us
record information into the cell state.
So this is telling, for example, if it has
the word it as the input, this network
should tell us that the word it
shouldn't be something you remember that
much. Probably, if it has the word
chlorophyll, it should tell you the word
chlorophyll should be something that you
remember very much. It has a second
neural network here, with tangent an
activation function that tells you how
much of previous things you should
forget. So for example, if two paragraphs
ago, you had the word it but then here we
have a new it, you should remember the
the new it, the one you just heard, and you
should try to forget the one you heard
long ago, because it might no longer
be relevant. So it has a second neural
network to tell you how much you
should forget from previous states. So
you do get input from other iterations
of the network, but you also get a
channel that is dedicated to remembering
and forgetting. And recall that the cell
state, it runs in a way similar to
recurrent networks where you have - take
an input, a time step T minus one,
generate an output which can be, for
example, a prediction of the next word
and from there you transfer a H vector
for memory, but also a cell state vector
that tells - that gives you more details
on what you should remember, and what you
should forget, and you keep on going with
your inputs and your outlets. So what do
we have so far? Deep learning studies
neural networks with richer connections
between neurons, not just many
connections, but richer ones as we have
seen here, where you don't just do a
prediction between layers but also feed
information into a future neural network.
This is what recurrent neural networks
do. They have connections to elements
they have seen before, so they can use
this information when they make
predictions. Unfortunately the
information from previous iterations can
vanish. This is the vanishing gradient
because it goes, it stays further and
further back in time. There's many
solutions to this. One is a long
short-term memory
neural networks which have a dedicated
channel so you can
select what to remember and what to
forget as you go. The state of the art right
now uses an algorithm called attention
which we'll study next.
