All right. So, we have talked about
regular neural networks,
fully connected neural networks,
we have talked about
convolutional neural networks
that work with images,
we have talked about Reinforcement,
Deeper Reinforcement Learning,
where we plug in a neural network
into a Reinforcement
Learning Algorithm,
when a system has to not only
perceive the world but also act in it,
and collect a reward.
And today we will talk about,
perhaps the least understood
but the most exciting neural network out there,
flavor of neural networks,
is Recurrent Neural Networks.
But first, for administrative stuff,
there’s a website.
I don’t know if you heard,
cars.mit.edu,
where you should create an account,
if you’re a registered student,
that’s one of the requirements.
You need to have an account
if you want to get credit for this,
you need to submit code
for DeepTrafficJS,
and DeepTeslaJS,
and for DeepTraffic,
you have to have a neural network
that drives faster than 65mph.
If you need help to achieve that speed
please e-mail us.
We can give you some hints.
For those of you who are
old school SNL fans,
there’s the Deep Thoughts
section now,
in the profile page,
where we encourage you to talk about
the kinds of things that
you tried in DeepTraffic
or any of the other
DeepTesla or
any of the work you've done
as part of this class
for DeepLearning.
Okay,
we have talked about
the Vanilla Neural Networks
on the left.
The Vanilla Neural Network
is the one where it's computing
is approximating a function that maps
from one input
to one output.
An example is mapping images
to the number that is shown
in the image.
For ImageNet
is mapping an image
to what's the object in the image.
It can be anything.
In fact,
Convolutional Neural Networks
can operate on audio,
you can give it a chunk of audio,
a five second audio clip,
that still counts as one input
because it’s fixed-size.
As long as the size of the input is fixed,
that's one chunk of input
and as long as you have ground truth
that maps that chunk of input
to some output
ground truth,
that’s the Vanilla Neural Network.
Whether there's a fully connected
neural network
or convolutional neural network.
Today we’ll talk about the amazing,
the mysterious Recurrent
Neural Networks.
They compute functions
from one to many,
from many to one,
from many to many.
Also bidirectional.
What does that mean?
They take its input sequences,
time series,
audio,
video,
whenever there's a sequence of data,
and that temporal dynamics
that connects the data
is more important than the spatial
content of each individual frame.
So, whenever there's
a lot of information
being conveyed in a sequence,
in a temporal change
of whatever that type of data is,
that's when you want to use
Recurrent Neural Networks
like speech,
natural language,
audio
and the power of this
is that for many of them,
for a Recurrent Neural Network,
where they really shine,
is when the size of the input
is variable,
so you don’t have a fixed chunk of data
that you're putting in
is variable input.
And the same goes
for the output,
so you can give it
a sequence of speech,
several seconds of speech
and then the output is
a single label of whether
the speaker is male or female.
That’s many to one.
You can also do
many to many.
Translation.
You can have natural language
put into the network
in Spanish
and the output is in English.
Machine translation.
That's many to many.
And that many to many
doesn't have to be
mapped directly
into same sized sequences.
For video, the sequence size
might be the same
you're labeling every single frame,
you put in
a five second clip
of somebody playing basketball
and you can label
every single frame
counting the number of people
in every single frame.
That's many to many
when the size of the input and
the size of the output is the same
Yes, question?
The question was,
are there are any models where there's
feedback from output and input?
That's exactly what
Recurrent Neural Networks are.
It produces output,
and it copies that output
and loops it back in.
That's almost the definition of
a Recurrent Neural Network.
There's a loop in there
that produces the output
and also takes that output
as input once again.
There's also many to many
where the sequences don't align.
Like machine translation,
the size of the output sequence
might be  totally different
than the input sequence.
We will look on a lot
of cool applications;
you can start a song,
learn the audio of
a particular song
have the Recurrent Neural Network
to continue that song after
a certain period of time.
So it can learn to generate sequences
of audio, of natural language, of video.
Okay.
I know I promised not many equations,
but this is
so beautifully simple
that we have to cover
backpropagation.
It's also the thing
that, if you're a little bit lazy
and you go to the internet
and start using
the basic tutorials of TensorFlow,
you ignore how backpropagation work.
At you peril.
You kind of assume it just works.
I give it some inputs, some outputs,
and it's like Lego pieces
I can assemble them
like you might have done
with DeepTraffic
A bunch of layers put in together
and then just press Train.
backpropagation is the mechanism
that neural networks currently--
The best mechanism we know of
that is used for training.
So you need to understand
the simple power of backpropagation,
but also the dangers.
Summary,
I put on the top of the slide,
there's an input
for the network that's an image,
there's a bunch of neurons,
all with differentiable
smooth activation functions
on each neuron,
and then,
as you pass through those
activation functions,
take in an input, pass it through
this net of differentiable
compute nodes,
you produce an output.
In that output
you also have a ground truth,
the correct, the truth
that you hope
or you expect the network to produce.
And you can look at
the differences between
what the network actually produced
and what you hoped it would produce,
and that's an error.
And then you backward
propagate that error,
punishing or rewarding
the parameters of the network that resulted in that output
Let's start with a really
simple example.
There's a function
that takes its input
up on top,
three variables, X, Y and Z.
The function does two things:
it adds X and Y
and then it multiplies
that sum by Z.
And then we can formulate
that as a circuit,
circuit of gates,
where there's a Plus gate,
and a Multiplication gate.
Let's take some inputs,
shown in blue.
Let's say it's X is negative two,
Y is five and Z
is negative four.
And let's do a forward pass
through the circuit
to produce the output.
Negative two plus five
equals three
q is that intermediate value,
three.
This is so simple,
and so important to understand
that I just want to take my time for this
because everything else about neural
networks just builds on these concepts
The add gate produces q,
in this case, is three,
and three times negative
four is twelve.
That's the output.
The output of the circuit
of this network,
if you think of it as such,
is negative twelve.
The forward pass is shown in blue
the backward pass
will be shown in red
in a second here
What we want to do,
what would make us happy,
what would make f happy
is for the output to be
as high possible.
Negative twelve,
so-so, it could be better.
How do we teach it
How do we adjust X, Y and Z,
to ensure it produces a higher f
makes f happier.
Let's start backward,
The backward pass.
We'll make the gradient
on the output one,
meaning we want this to increase.
We want f to increase.
That's how we encode our happiness.
We want it to go up by one.
In order to then propagate
that fact that we want
the f to go up by one,
we have to look at
the gradient on each one of the gates.
And what's a gradient?
It's a
partial derivative
with respect to its inputs.
The partial derivative of
the output of the gate
with respect to its inputs,
if you don't know what that means,
is just
how much does the output change
when I change the inputs a little bit.
What is the slope of that change
if I increase X
for the first function of addition,
f of X, Y equals X plus Y.
If I increase X by a little bit,
what happens to f?
If I increase Y by a little bit,
what happens to f?
Taking a partial derivative of those
with respect to X and Y
you just get a slope of one
When you increase X,
f increases linearly.
Same with Y.
Multiplication is a little trickier.
When you increase X,
f increases by Y.
Do the partial derivative of f
with respect to X is Y,
the partial derivative of f
with respect to Y is X.
If you think about it,
what happens is
the gradients, when you change X,
the gradient of change
doesn't care about X.
It cares about Y.
It's flipped.
So we can backpropagate that one,
the indication of what
makes X happy backward.
And that's done by
computing the local gradient.
For q,
the partial derivative
of f with respect to q,
that intermediate value,
that gradient would be negative four.
It will take the value of Z
as I said it's the Multiplication gate,
It'll take the value of Z
and assign it to the gradient.
And the same for
the partial derivative of f
with respect to Z,
it will assign that to q.
The value of the forward pass on the q.
There's a three
and a negative four
on the forward pass in blue
and that's flipped.
Negative four and three
on the backward pass.
That's the gradient.
And then we continue in
the same exact process.
But wait.
What makes all of this work,
is the Chain Rule.
It's magical.
What it allows us to do
is to compute the gradient,
the gradien of f with respect to
the inputs X, Y, Z.
We don't need to construct
the giant function that is
the partial derivative of f
with respect to X, Y and Z
analytically.
We can do it step by step
backpropagating the gradients.
We can multiply
the gradients together
as opposed to doing
the partial derivative
of f with respect to X.
We have just the intermediate,
the local gradient
of f with respect to q,
and of q with respect to X,
and multiply them together.
So, Instead of computing
gradient of that
giant function X plus Y times Z,
in this case is not that giant,
but it gets pretty giant
with neural networks,
we just go step by step.
Look at the first function,
simple addition, q equals X plus Y,
and the second function,
multiplication,
f equals q times Z.
The gradient on X and Y,
the partial derivative
of f with respect to X and Y
is computed by multiplying
the gradient on the output,
negative four,
times the gradient on the inputs,
which as we talked about,
when the operation is addition,
that's just one.
It's negative four times one.
What does that mean?
Let's interpret those numbers.
You now have gradients on X, Y and Z
the partial derivatives of F
with respect to X, Y, Z.
That means,
for X and Y is negative four,
for Z is three.
That means, in order to
make f happy,
we have to decrease
the inputs that have
a negative gradient
and increase the inputs that
have a positive gradient.
The negatives ones are X and Y,
the positive is Z.
Hopefully, I don't say
the word “Beautiful” too many
times in this presentation
this is very simple.
Beautifully simple.
Because this gradient
is a local worker,
it propagates for you;
it has no knowledge of the broader
happiness of f.
It computes the greater between
the output and the input.
And it can propagate this gradient
based on,
in this case f,
a gradient of one but also the error.
Instead of one we can have on
the output the error
as the measure of happiness.
And then we can propagate
that error backwards.
These gates are important
because we can break down
almost every operation
we can think of
that we work within neural networks
into one or several gates like these.
The most popular are three,
which is addition, multiplication
and the Max operation.
For addition,
the process is
you take a forward pass
through the network,
so we have a value on every single gate,
and then you take the backward pass.
And through the backward pass
you compute those gradients.
For an add gate,
you equally distribute the gradients
on the output to the input,
when the gradient on the output
is negative four,
you equally distribute it tonegative four.
And you ignore the forward pass value.
That three is ignored
when you backpropagate it.
On the Multiply gate,
it's trickier.
You switch the forward pass values,
if you look at f, that's a
Multiply gate,
the forward pass values are switched
and multiplied by the value of
the gradient in the output.
If it's confusing, go through
the slides slowly.
It'll make a lot more sense.
Hopefully.
One more gate. There's the Max gate,
which takes the inputs
and produces as output
the value that is larger.
When computing the gradient
of the Max gate,
it distributes the gradient
similarly to the Add gate, but to only one,
to only one  of the inputs;
the largest one.
unlike the Add gate,
pays attention to the input
the input values on
the forward pass.
All right.
Lots of numbers but
the whole point here is,
it's really simple;
a neural network is just
a simple collection of these gates.
You take a forward pass,
you calculate some kind of function
in the end, the gradient in the very end,
and you propagate that back.
Usually, for neural networks,
that's an Error function.
A Loss function, Objective function,
a Cost function.
All the same word.
That's the Sigmoid function there
When you have three weights
W zero, W one, W two
and X, two inputs, X0, X1,
that's going to be
the Sigmoid function.
That's how you compute the output
of the neuron.
But then you can decompose
that neuron
you can separate it all into
just a set of gates like this
Addition, multiplication,
there's an exponential in there
and division
but all very similar.
And you repeat the exact same process.
there's five inputs,
there's three weights
and two inputs.
X zero, X one.
You take a forward pass
through this circuit,
in this case again,
you want it to increase so that
the gradient of the output is one
and you backpropagate that gradient
of one, to the inputs.
Now in neural networks,
there's a bunch of parameters
that you're trying through
this process, modify.
And you don't get to modify the inputs
You get to modify the weights
along the way,
and the biases.
The inputs are fixed,
the outputs are fixed,
the outputs that you hope
the network will produce.
What you're modifying is the weights.
So I get to try to adjust those weights
in the direction of the gradient.
That's the task of backpropagation. 
The main way that
neural networks learn.
As we update the weights
and the biases
to decrease the loss function.
The lower the loss function the better.
In this case, you have
three inputs on the top left.
A simple network, three inputs.
Three weights on each of the inputs.
There's a bias on the node,
b and produces an output
a, and that little symbol is indicating
a Sigmoid function.
And the loss
is computed as Y minus
A squared,
divided by two,
where Y is the ground truth,
the output that you want
the network to produce.
And that loss function
is backpropagating
in exactly the same way that
we described before.
The subtasks
involved in this update of
weights and biases
is that the forward pass computes
the network output at every neuron,
and finally, the output layer,
computes the error,
the difference between a and b,
and then
backward propagates
the gradients.
Instead of one on the output,
it will be the error on the output
and you backpropagated.
And then, once you know the gradient,
you adjust the weights
and the biases
in the direction of the gradient.
Actually, the opposite of the
direction of the gradient,
because you want the loss to decrease.
And the amount by which
you make that adjustment
is called the Learning Rate.
The learning rate can be
the same across the entire network
or can be individual
through every weight.
And the process
of adjusting the weights and biases
is just optimization.
Learning is an Optimization problem.
You have an objective function,
and you're trying to minimize it.
And your variables are the parameters,
the weights and biases.
Neural networks just happen to have
tens, hundreds of thousands, millions
of those parameters.
So the function that you're trying
to minimize is highly non-linear.
But it boils down to
something like this, you have
two weights, two plots--
or actually one weight
and as you adjust it, the cost
you adjust in such a way that
minimizes the output cost.
And there's a bunch of
optimization methods for doing this.
this is a convex function,
You can find the local minimum.
If you know about these
kinds of terminologies,
the local minimum is the same
as the global minimum,
it's not a weirdly hilly terrain
where you can get stuck in.
Your goal is to get to
the bottom of this thing
and if it's really complex terrain,
it will be hard to get
to the bottom of it.
This general approach
is gradient descent,
and there's a lot of different ways to
do a gradient descent.
Various ways of adding
randomness into the process,
so you don't get stuck into the weird
crevices of the terrain.
But it's messy.
You have to be really careful.
This is the part you have
to be aware of,
when you design a network
for DeepTraffic
and nothing is happening
this might be what's happening:
vanishing gradients
or exploding gradients.
When the partial derivatives
are small, so you take
the Sigmoid function,
the most popular
for a while, activation function,
the derivative is zero at the tails.
When the input to
the Sigmoid functions is
really high or really low,
that derivative is going to be zero.
Gradient tells on how much
I want to adjust the weights.
The gradient might be zero,
and so you backpropagate that zero,
a very low number,
and it gets less and less
as you backpropagate
and so the result is that
you think you don't need to
adjust the weights at all.
And when a large fraction
of the network
weights don't need to be adjusted,
they don't adjust the weights.
And you are not doing any learning
So the learning is slow.
There are some fixes to this,
there are different types
of functions.
There's a piece,
the ReLUs function which is the most
popular activation function.
But again,
if the neurons are initialized poorly,
this function might not fire.
it might be zero gradient
for the entire data set.
Nothing that you produce as input,
you run all your thousands
of images of cats,
and none of them fire at all.
That's the danger here.
So you have to pick
both the optimization engine,
the solver that you use
and the activation functions
carefully.
You can't just plug and play
like they're Lego's
You have to be aware of the function.
SGD, Stochastic Gradient Descent,
that's the Vanilla
optimization algorithm
for gradient descent.
For optimizing the loss function
over the gradients
And what's visualized here is,
again, if you have done
any numerical optimization,
and non-linear optimization,
there's the famous saddle point,
that's tricky for these
algorithms to deal with.
What happens is, it's easy
for them to oscillate,
get stuck in that saddle and
oscillating back and forth
as opposed to what they
want to do which is
go down into--
You get so happy that you found this
low point
that you forget there's
a much lower point.
So you get stuck with the gradient.
The momentum of the gradient
keeps rocking it back and forth
without you going
to a much greater global minimum.
And there's a lot of clever
ways to solving that,
the Atom optimizer is one of those.
But in this case, as long as
the gradients don't vanish
SGD, the Stochastic Gradient Descent,
one of these algorithms
will get you there
It might take a little while,
but it will get you there
Yes, question.
The question was,
you're dealing with a function
that is not convex,
how do we ensure anything about
converging to anything that's
reasonably good,
the local optimum converges to--
The answer is, you can't.
This isn't only a non-linear function
it's a highly non-function
The power and the beauty
of neural networks
is that it can represent these arbitrarily complex functions.
It's incredible.
And it can learn these
functions from data
But the reason people are referring to
neural networks training as art
is you're trying to play
with parameters
that don't get stuck in
these local optimal.
For stupid reasons
and for clever reasons.
Yes, question.
The Question continues
on the same thread.
The thing is, we're dealing
with functions
where we don't know what
the global optimal is.
That's the crocs of it.
Everything we talked about,
interpreting text,
interpreting video,
even driving.
What's the optimal for driving?
Never crashing?
It sounds easy to say that,
you actually have to
formulate the world
under which it defines all of those
things and it becomes a really
non-linear objective function
for which you don't know what the optimal is.
That's why you keep trying
and get impressed
every time it gets better.
It is essentially the process.
And you can also compare,
you can compare with
human-level performance.
For ImageNet,
who can tell the difference
between cats and dogs,
and top five categories,
96% of the time accuracy,
and then you get impressed when
a machine can do better than that.
But you don't know
what the best is.
These videos can be watched for hours,
I won't play it until I
explain this slide.
Let's pause to reflect
on backpropagation
before I go on to Recurrent
Neural Networks. Yes, question.
In this practical manner,
how can you tell when
you're actually creating a net
whether you're
facing the management
gradient problem
or you need to change your optimizer
or you've reached a local minimum?
The question was,
how do you practically know
when you hit the vanishing
gradient problem?
The vanishing gradient could be--
The derivative being zero
on the gradient,
happens when the activation
is exploding,
like really high values
and really low values.
To really high values is easy.
Your network has just gone crazy.
It produces very large values.
And you can fix a lot of those things
by just capping the activations.
The values being really low,
resulting in a vanishing gradient,
are really hard to detect
There's a lot of research in
trying to figure out
how to detect these things.
If you're not careful, often times
you can find that,
and this isn't hard to do,
we're like 40 or 50 percent
of the network,
of the neurons,
are dead.
We will call it, for ReLU,
they're dead ReLU
They're not firing at all.
How do you detect that?
That's part of learning
If they never fire you can detect that
by running it through
the entire training set.
There are a lot of tricks. But that's the problem.
You try to learn
and then you look at the loss function
and it's not
converging to anything reasonable.
They are going all over the place,
or just converging very slowly.
And that's an indication that
something is wrong
That something could be
the loss function is bad,
that something could be
you already found the optimal,
or that something could be
the vanishing gradient.
And again, that's why it's an art.
Certainly,
at least some fraction of the neurons
needs to be firing.
Otherwise, initialization is
really poorly done.
Okay, to reflect on the
simplicity of backpropagation
and the power of it,
this kind of step of
backpropagating the loss function
to the gradients locally,
is the way neural networks learn.
It's really the only way
that we have effectively been able to
to train a neural network
network to learn a function.
To adjusting the weights and biases,
the huge number of weights and biases,
the parameters
It's just through this optimization.
It's backpropagating the error,
where you have
the supervised ground truth.
the question is
whether this process, of fitting,
adjusting the parameters
of a highly non-linear function
to minimize a single objective,
is the way you achieve intelligence.
Human-level intelligence.
That's something to think about.
You have to think about,
for driving purposes,
what is the limitation
of this approach?
What's not happening?
The neural network designed,
the architecture
is not being adjusted.
any of the edges, the layers,
nothing is being evolved
There are other
optimization approaches
that I think are more
interesting and inspiring
than effective.
For example, this is
using soft cubes to--
This is falling out of the field
of evolutionary robotics.
Where you evolve
the dynamics of a robot
using genetic algorithms
and that's
These robots have been taught to,
in simulation, obviously,
to walk and to swim.
That one is swimming.
The nice thing here is that dynamics
that highly non- linear space as well,
that controls the dynamics of
this weird shaped robot
with a lot of degrees of freedom,
it's the same kind of thing
as the neural network.
In fact, people have applied
generic algorithms,
ant colony optimization, all kinds of
sort of nature inspire algorithms
for automatizing the weights
and the biases
but they don't seem to
currently work that well.
It's a cool idea to be using
nature-type evolutionary
algorithms to evolve
something that's already nature
inspired which is neural networks.
But, something to think about
the backpropagation,
while really simple
it's kind of dumb and
the question is whether
general intelligence reasoning
can be achieved with this process.
All right, Recurrent Neural Networks,
on the left there's an input X
with weights on the input, U,
there's a hidden state,
hidden layer S,
with weights on
the edge connecting
the hidden states to each other
and then more weights,
V, the on the output O.
It's a really simple network,
there's inputs,
there's hidden states,
the memory of this network
and there's outputs.
But the fact that there's
this loop
where the hidden states are
connected to each other
means that as opposed to
producing a single input,
the network takes arbitrary
numbers of inputs,
it just keeps taking X, one at a time
and produces a sequence of Xs
through time.
Depending on
the duration of the sequence
you're interested in,
you can think of this network
in its unrolled state.
You can unroll this neural network
where the inputs are in the bottom,
Xt-1, Xt, Xt+1,
and same with the outputs,
Ot-1, Ot, Ot+1,
and it becomes like
a regular neural network,
unrolled some
arbitrary number of times.
The parameters, again,
there's weights, there's biases,
similar to CNNs,
convolutional neural networks
and just like convolutional
neural networks
make certain spatial
consistency assumptions,
the recurrent neural network assume
temporal consistency
amongst the parameters,
shares the parameters.
That W, that U, that V,
is the same for every single time step.
You're learning
the same parameter,
no matter the duration
of the sequence
and that allows you to
look at arbitrary
long sequences
without having an
explosion of parameters. 
This process is the same exact
process that's repeated
base on the different variants
that we talk about before,
in terms of inputs and outputs,
one to many, many to one,
many to many.
The backpropagation process
is exactly the same as
for regular neural networks.
It's a fancy name of
backpropagation through time,
BPTT,
but it's just backpropagation
through an unrolled
recurrent neural network,
where the errors are on
the computed on the outputs,
the gradients are computed,
backpropagated
and computed on the inputs,
again, suffering for
the same exact problem
of vanishing gradients.
The problem is
that the depth of these networks
can be arbitrary long
if at any point the gradients hits
a lower number, zero,
becomes, that neural becomes saturated.
That gradient, let's call it saturated,
that gradient gets--
drives all the earlier layer to zero,
so is easy to run to a problem
where you're really ignoring
the majority of the sequence.
This is just another Python weight,
sudo-called weight to look at it.
Is you have the same w,
remember you're sharing the weights
and all the parameters
from time to time,
so if the weights are such
WHH,
if the weights are such
that they produce
[unintelligible]
they have a negative value
that results
in the gradient that goes to zero,
that propagates through the rest.
That's the sudo-call for backpropagation,
pass to the RNN,
that WHH
propagates back.
You get this things
with exploding and
vanishing gradients
for example, error surfaces for
a single hidden unit RNN,
this is visualizing the gradient,
the value of the weight,
the value of the bias
and the error,
the error could be really flat
or could explode,
both are going to lead
to you not making--
either making steps that
are too gradual
or too big.
It's the geometric interpretation.
Okay. What other variants that
we look at, a little bit?
are they [unintelligible 00:41:13]?
It doesn't have to be only one way,
it can be bi-directional,
that could be edges going forward
and edges going back
What that's needed for
is things like
filling in missing,
whatever the data is,
filling in missing elements of that data,
whether that's images, or words,
or audio.
Generally, as always is the case
in neural network,
the deeper it goes, the better.
That deep referring to
the number of layers
in a single temporal instance.
On the right of the slide
we're stacking
node in the temporal domain.
Each of those layers
has its own set of weights,
its own set of biases.
These things are awesome
but they need a lot of data
when you add extra layers in this way.
The problem is, while
recurrent neural network,
in theory,
is supposed to be able to learn
any kind of sequence,
the reality is they're not really
good at remembering
what happened a while ago,
the long-term dependency.
Here's a silly example,
let's think of a story
about Bob,
Bob is eating an apple.
The apple part
is generated by
the recurrent neural network.
Your recurrent neural networks
can learn to generate "apple"
because it's seen in a lot of sentences,
with "Bob" and "eating"
and it can generate the word apple.
For a longer sentence, like
"Bob likes apples,
he's hungry and decided to have a snack,
so now he's eating an apple",
you have to maintain the state
that we're talking about Bob
and we're talking about apples,
through several
discreet semantic
sentences.
That kind of long-term memory
is not--
because of different effects,
but vanishing gradients,
it's difficult to propagate
the important stuff
that happened a while ago
in order to maintain that context
in generating "apple",
or classifying some concept
that happened
way down the line. 
When people talk about
recurrent neural networks
these days, they're talking about LSTMs,
long-short-term memory networks
so all the impressive results
results on time series
and audio and video
and all that, that requires LSTMs.
Again, vanilla RNNs
are on top of the slide,
each cell is simple,
there are some hidden units,
there's an input, and there's an output.
Here, we used TANH
as activation function,
it's just another popular
Sigmoid type activation function.
LSTMs are more complicated,
or they look more complicated but
in some ways, they're more intuitive
for us to understand.
There's a bunch of gates in each cell,
we'll go through those.
In yellow are different
neural network layers,
Sigmoid and TANH,
are just different types
of activation functions.
TANH is an activation function that
squishes the input into
the range of negative one to one.
Sigmoid function
squishes it between zero and one
and that serve different purposes.
There's some pointwise operations,
addition, multiplication,
and there's connections,
data being passed from layer to layer,
shown by the arrows.
There's concatenation and there's
a copy operation on the output
We copy,
the output of each cell
it's copied to the next cell
and to the output.
Let me try to make it, clarified,
clarify a little bit.
There's this conveyer belt
going through inside of
each individual cell
and they all have, there's really
three steps in the conveyer belt.
The first is,
there is a Sigmoid function
that's responsible for deciding
what to forget and what to ignore,
it's responsible for
taking in the input, the new input, x(t),
taking in the state of the previous,
the output of the previous cell,
previous time step
and deciding "do I want to keep
that in my memory or not?"
and "do I want to integrate
the new input into my memory or not?"
This allows you to
selective about the information
which you learn.
For example,
there's that sentence
"Bob and Alice are having lunch,
Bob likes apples,
Alice like oranges,
she is eating an orange".
Bob and Alice are having lunch,
Bob likes apples,
right now, if you had said
you have a hidden state,
keeping track of the gender
of the person we're talking about
you might say that there's both genders
on the first sentence,
there's male in the second sentence,
female in the third sentence,
and that way
when you have to generate a sentence
about who's eating what,
you'll know-
you keep the gender information
in order to make an
accurate generation of text
corresponding to
the proper person.
You have to forget certain things,
like forget that Bob existed
at that moment,
you have to forget Bob likes apples
but you have to remember
that Alice likes oranges
so you have to selectively remember
and forget certain things
that's LSTM in a nutshell.
You decided what to forget,
decided what to remember
and decided what to output
in that cell.
Zoom in a little bit,
because this is pretty cool
There's a state running
through the cell,
this conveyer belt,
previous state like the gender
that we're currently talking about,
that's the state that you're
keeping track of
and that's running through the cell.
Then there's three Sigmoid layers
outputting one,
a number between the zero and one,
one when you want that
information to go through
and zero when you
don't want it to go through,
the conveyer belt
that maintains the state.
First, Sigmoid function is,
we decided what to forget
and what to ignore,
that's the first one,
we take the input from
the previous time step,
the input to the network
on the current time step
and decided, do I want to forget
or do I want to ignore those?
Then we decided
which part of the state to update,
what part of our memory do we have
to update with this information
and what values to insert in that update.
Third step is, we perform
the actual update
and perform the actual forgetting,
that's why you have
the Sigmoid function,
you just multiply it,
when is zero is forgetting,
when is one that information passes through.
Finally,
we produce an output from the cell,
if its translation
is producing an output
in the English language
where the input was
in Spanish language
and then that same output
it's copied to the next cell.
What can we get done with this
kind of approach?
We can look at machine translation.
I guess what I'm trying to--
question.
what is your representation
of this state?
Is it like a floating point
or is it like a vector
or what is it, exactly?
The state
is the activation
multiplied by the weight,
it's the output of the Sigmoid or
the TANH activations.
There's a bunch of neurons
and they're firing a number
between negative one or one,
or between zero and one,
that whole's a state.
It just that calling it a state
it's sort of simplifying,
but the point is that there's
a bunch of numbers been constantly
modified by the weights
and the biases,
those numbers hold the state
and the modification
of those numbers
is controlled by the weights
and then once all of that is done,
the resulting output
of the recurrent neural network
it's compared to the desired output
and the errors are backpropagated
to the weights.
Hopefully, that makes sense. 
So, machine translation is one
popular application
all of it is the same,
all of these networks
that I've talked about,
they're really similar constructs.
You have some inputs,
whatever language that is again,
German maybe,
I think everything is German,
and the output.
The inputs are in one language,
a set of characters
composed a word in one language,
there's a state being propagated
and once that sentence is over,
you start, as opposed
to collecting inputs,
start producing outputs
and you can output in the English language.
There's a ton of great work on
machine translations.
It's what Google is supposedly using
for their translator,
same thing.
I've show this previously
but now you all know how it works,
same exact thing, LSTMs
generating handwritten characters,
handwriting in arbitrary styles,
controlling the drawing,
where the input is text
and the output is handwriting.
Is again, the same kind of
network with some depths here,
the input is the text,
the output is the control
of the writing.
Character-level text generation,
this is
the thing that taught us about life,
the meaning of life,
literary recognition and the tradition
of ancient human reproduction.
That's again, the same process,
input one character at the time,
what we see there is the encoding
of the characters on the input layer,
there's a hidden state,
hidden layer that is keeping track
of those activations,
the outputs
of the activation functions
and every single
time it's outputting
its best prediction
of the next character that follows.
Now, on a lot of these applications
you want to ignore the output
until the input sentence is over
and then you start listening
to the output,
but the point is that it just
keeps generating text,
whether is given an input or not,
so you producing input
is just adding, steering
the recurrent neural network.
You can answer questions
about an image,
the input you get there,
you could almost arbitrary
stack things together,
you take an image as your input,
bottom left there,
put it in your convolutional neural network,
and take the question.
There's something call
word embeddings,
it's to broaden
the representative
meaning of the words.
"How many books?" is the question.
You want to take the word embeddings
and the image
and produce
your best estimate of the answer.
For question of
"what color is the cat?"
it could be gray or black,
it's the different LSTM flavors
producing that answer.
Same with counting chairs
you can give an image of a chair
and as the question
"how many chairs are there?"
And it can produce an answer of "three".
I should say this is really hard,
arbitrary question
asks an arbitrary image,
you are both interpreting--
you are doing natural languages processing
and you're doing computer vision,
all in one network.
Same thing with
the image capture generation,
you can detect
the different objects in the scene,
generate those words,
stitch them together
in syntactically correct sentences
and rearrange the sentences.
All of those are LSTMs,
the second and the third step,
the first is computer vision
detecting the objects,
segmenting the image and
detecting the objects,
that way you can generate
a caption that says
"a man is sitting in a chair
with a dog in his lap".
Again, LSTMs for video.
Caption generation for video,
the input, and every frame it's an image
that goes into the LSTM,
the input is an image
and the output is a set of characters.
First, you load in the video,
in this case the output is on top,
you encode
the video
into a representation
inside the network
and then you start generating words
about that video.
First comes the input, the encoding
stage,  then the decoding stage.
Take in the video,
say a man is taking,
talking, whatever
and because the input and
the output are arbitrary,
there also has to be indicators
of the beginnings and
the ends of a sentence,
in this case, end of sentences.
You want to know when you stop
in order to generate syntactically
correct sentences.
that indicates the end of a sentence.
You want also to be able
to generate a period
You can also, again,
recurrent neural networks,
LSTMs here,
controlling
the steering
of a sliding window on an image
that is used to classify
what is contained in that image.
Here, a CNN being steered by
a recurrent neural network
in order to convert this imagen
into the number that's associated
with a house number,
it's called visual attention.
That visual attention
can be used to steer
for the perception side
and it can be used to steer
a network for the generation.
On the right,
we can generate an image as--
So the output of the network--
it's a LSTM
where the output on every time step
is visual,
and this way you can draw numbers.
Here,
I mention this before,
is taking in as input silent video,
sequence of images
and producing audio.
This is
an LSTM
that has convolutional layers
for every single frame,
takes images as input
and produces
a spectrogram, audio as output.
The training set is a person hitting
an object with a drumstick
and your task is to generate,
given a silent video,
generate the sound that
the drumstick will make
when in contact with that object.
Medical diagnosis,
that's actually--
I've listed some places
where it has been really successful
and pretty cool,
but it's also beginning to be applied
in places where
can actually
really help
civilization, in medical applications.
For medical diagnosis
there's
the highly spars and
variable lengths
sequence of information in the form of,
for example, patient
electronic health records.
So, Every time you visit a doctor,
there's a test being done,
that information is there
and you can look it as a sequence
over a period of time
and then given that data,
that's the input,
the output is the diagnosis,
a medical diagnosis,
in this case, we can look at
predicting diabetes,
scoliosis, asthma and so on,
with pretty good accuracy.
There's something that
all of us wish we could do,
is stock market prediction.
You can input,
for example, well first of all,
you can input the raw stock data,
[unintelligible 01:00:30] books
and so on, financial data,
but you can also look at news articles
from all over the web
and take those as input as shown here,
on the X axis is time,
articles from different days,
LSTM, once again,
and produce an output
of your prediction,
binary prediction, whether
the stock would go up or down.
Nobody has been able to
really successfully do this
but there is a bunch of results
and trying to perform above random
which is how you make money,
significantly above random
on the prediction of
it's going up or down?
So you could buy or sell
and especially
when there is--
in the cases when there was crashes
it's easier to predict,
so you can predict
an encroaching crash.
These are shown in the table,
the error rates from different stocks,
automotive stocks.
You can also generate audio,
is the exact same process
as it generates language,
you generate audio.
Here's trained on
a single speaker,
a few hours epics
of them speaking
and you just learn,
that's raw audio of the speaker
and it's learning slowly to generate
[audio]
Obviously, they were reading numbers.
this is incredible, this is trained
on a compress spectrogram
of the audio, raw audio
and is producing something that
over just a few epics is producing
something that sounds like words,
it could do this lecture for me, I wish.
This is amazing,
this is raw input,
raw output,
all again, LSTMs,
and there's a lot of work
in voice recognition,
audio recognition. You're mapping--
let me turn it up.
You are mapping any kind of audio
to a classification,
you can take the audio of the road
and that's the spectrogram
on the bottom there, being shown
you could detect whether
the road is wet
is wet or the road is dry.
you could do the same thing for
recognizing the gender
of the speaker
or recognizing many to many map
of the actual words
being spoken,
speech recognition.
This is about driving,
so let's see where recurrent neural|
networks apply in driving.
We talked about the NVIDIA approach,
the thing that actually powers
DeepTeslaJS,
it is a simple convolutional neural network,
there's five convolutional layers
in their approach, three fully
connected layers,
you can add as many layers
as you want in DeepTesla,
that's a quarter of million
parameters to optimize
all you are taking is a single image,
no temporal information,
single image
and producing the steering angle,
that's the approach,
that's the DeepTesla way,
taking a single imagen
image and learning a regression
of the steering angle.
One of the
prizes for the competition is
the Udacity, self-driving
car engineer nanodegree
for free,
this thing is awesome,
I encourage everyone to check it out,
but they did a competition
that's very similar to ours,
but a very large group
of obsessed people,
they were very clever,
they went beyond
convolutional neural networks
of predicting steering,
taking a sequence of images
and predicting steering,
what they did is, the winners,
at least the first and I'll talk about
the second place winner tomorrow,
on 3D convolutional neural networks,
the first and the third place winners
used RNNs,
used LSTMs, recurrent neural networks
and map a sequence of images
to a sequence of steering angles.
For anyone,
statistically speaking,
anybody here who is
not a computer vision person,
most likely what'd you want to use,
for whatever application
you're interested in,
is RNNs,
the world is full of time series data,
very few of us are working on
data that is no time series data,
in fact, whenever it's just snapshots,
you're really just reducing
the problem to
the size that you can handle
but most data
in the world is time series data.
This is the approach you end up using
if you want to apply it
in your own research,
RNNs is the way to go.
Again, what are they doing?
How do you put images
into a recurrent neural network?
it's the same thing,
you take,
you have to convert
an image into numbers
in some kind of way,
a powerful way of doing that
is convolutional neural networks,
so you can take
either 3D convolutional
neural networks
or 2D convolutional neural networks
once it takes time into
consideration and whatnot,
process that image
to extract a representation
of that image
and that becomes the input
to the LSTM
and the output
at every single cell,
at every single timestep,
is a predicted steering angle,
the speed of the vehicle
and the torque
that's what the first place winner did,
they didn't just do the steering angle,
also did the speed and torque
and the sequence length
that they were using
for training and for testing,
for the input and the output,
is a sequence length of 10 
did they used supervised learning
or did they used reinforcement
learning?
The question was, did they used
supervised learning?
Yes, they were given the same thing
as in DeepTesla,
a sequence of frames
where the have a sequence of
steering angles, speed and torque,
I think there's other information
too available,
there's no reinforcement
learning here.
Question. 
Do you have a sense of
how much information
is being passed, how many
LSTM gates are there
in this problem?
The question was,
how many LSTM gates
are in this problem?
This network,
it's true
that this diagrams kind of hide
the number of parameters here,
but it's arbitrary
just like convolutional
neural networks are arbitrary,
the size of the input is arbitrary,
the size of Sigmoid function,
TANH is arbitrary,
so you can make it as large as you want,
as deep as you want
and the deeper and larger, the better. 
What these folks actually used--
the way these competitions work
and I encourage you, if you're
interested in machine learning
to participate in Kaggle,
I don't know how to pronounce it,
competitions
where basically everyone
is doing the same thing,
you're using LSTMs
or if it's one- on-one mapping,
using convolutional neural network
fully connecting networks
with some clever pre-processing
and the whole job is
that takes months
and you probably,
if you're a researcher,
that's what you'd be doing
your own research,
playing with parameters,
playing with pre-processing
of the data,
playing with the different parameter
that controls the size of the network
the learning rate,
I've mentioned, this type of optimizer,
all these kinds of things,
that's what you're playing with,
using your own human intuition
and you're using your--
whatever probing you can do
in monitoring the performansce of the network
through time.
Yes?
The question was,
you said that there's a
memory of tenth in this LCM,
and I thought RNNs are
supposed to be arbitrary.
It has to do
with the training,
how the network is trained.
It's trained with sequences of 10.
The structure is still the same,
you only have one
cell that's
looping onto each other.
But the question is,
in what chunks,
what is the size of the sequence
that we should do in the training
and then the testing.
It can be arbitrary length.
It's just usually better
to be consistent
and have a fixed length.
You're not stacking 10 cells together.
It's just a single cell still.
The third-place winner,
Team Chauffeur,
used something called
transfer learning
and it's something I don't think
I mentioned
but it's kind of implied,
the amazing power of neural networks.
First, you need a lot of data
to do anything.
That's the cost, that's
the limitation in neural networks.
But what you could do is,
there's
neural networks that have been
trained on very large data sets.
ImageNet,
Vdg Net, AlexNet, ResNet,
all these networks are trained
on a huge amount of data.
Those networks are trained to tell
the differences between a cat and dog
Specific optical recognition
of single images.
How do I then take that network
and apply it to my problem,
say of driving or length detection,
or medical diagnosis, or cancer or not?
The beauty of neural networks,
the promise of transfer learning,
is that you can just take that network,
chop off the final layer,
the fully connected layer
that maps from all those cool
high-dimensional features that you
have learned about visual space,
and as opposed to predicting cat vs. dog,
you teach it to predict
cancer or no cancer.
You teach it to predict lane or no lane,
truck or no truck.
As long as the visual space
under which that network operates
is similar or the data like
if it's audio or whatever
if it's similar, if the features are
useful then you learn,
in studying the problem of
cat vs dog deeply,
you have learned actually
how to see the world.
As you're going to apply
that visual knowledge,
you can transfer that learning
to another domain.
That's the beautiful power
of neural networks
it's that they're transferable.
What they did here is--
I didn't spend enough time
looking through the code
I'm not sure which of the
giant nework they took
but they took a giant
convolutional neural network,
they chopped off
the end layer,
which produced 3000 features,
and they took those 3000 features
to every single image frame,
and that's the Xt.
They gave that as the input to LSTM.
And the sequence length,
in that case, was 50.
This process is pretty
similar across domains.
That's the beauty of it.
The art of neural networks is in the--
Well that's a good sign
[chuckles],
I guess I should warp it up--
The art of the neural networks is
in the proper parameter tuning. 
That's the tricky part,
and that's the part you can't be taught.
That's experience,
sadly enough.
That's why they talk about
Stochastic Gradient Descent SGD,
That's what Geoffrey Hinton
refers to as
Stochastic Graduate Student Descent,
meaning you just keep
hiring graduate students
to play with the hyperparameters
until the problem is solved
[laughter].
I have about
100+ slides on driver state,
which is the thing that
I'm most passionate about,
and I think will save the best for last.
I'll talk about that tomorrow.
We have a guest speaker
from the White House,
will talk about the future
of Artificial Intelligence
from the perspective of policy,
and what I would like you to do first
off you registered students is submit
the two tutorial assignments,
and pick up
can we just set the boxes right here
or something?
Just stop by and pick up a shirt.
And give us a card on the way.
Thanks guys.
[Applause]
