All right.  Hello everybody.  Welcome back.
Glad you came back.
Today,
we will unveil the first tutorial.
The first project is DeepTraffic, code named "DeepTraffic,"
where your task is to solve the
traffic problem using Deep Reinforcement Learning.
And I'll talk about what's involved in designing a network there.
How you submit your own network and how you participate in the competition.
As I said the winner gets a very special prize,
to be announced later.
What is machine learning?
 Several types.
 There's supervised learning, as I mentioned yesterday that's what it meant,
usually when you discuss about,
you talk about, machine learning
and talk about its successes.
Supervised learning requires a data set
where you know the Ground Truth.
You know the inputs and the outputs.
And you provide that to the machine learning algorithm
in order to learn the mapping between the inputs and the outputs
in such a way that you can generalize to further examples in the future.
On supervised learning,
it's the other side,
when you know absolutely nothing about the outputs.
About the truth of the data that you're working with.
All you get is data and you have to find underlying structure,
underlying representation of the data that's meaningful
for you to accomplish certain tasks, whatever that is.
There is semi-supervised data,
or only parts, usually a very small amount
is labeled as Ground Truth of available for just a small fraction of it
If you think of
images that are out there on the Internet
and then you think about ImageNet, a data set where every image is labeled,
the size of that ImageNet data set
is a tiny subset of all the images available online.
But that's the task we're dealing with as human beings,
 as people interested in doing machine learning
is how to expand the size of that,
of the part of our data that we know something confidently about.
 And reinforcement learning sit somewhere in between.
It's semi supervised learning where
there's an agent that has to exist in the world.
And that agent know the inputs that the world provides
but knows very little about that world
except through occasional time delayed rewards.
This is what it's like to be human.
This is what life is about.
You don't know what's good and bad, you got to have to just live it
and, every once in a while,
you find out that all that stuff you did last week was pretty bad idea.
That's reinforcement learning.
That's semi-supervised,
in the sense that only a small subset of the data
comes with some ground truth, some certainty,
you have to, then extract knowledge from.
So first at the core of anything that works currently
in terms of, in the practical sense,
there has to be some Ground Truth.
There has to be some truth
that we can hold on to as we try to generalize.
And that supervised learning.
Even as in Reinforcement Learning,
the only thing we can count on is that truth
that comes in the form of a reward.
So the standard supervised learning pipeline is
you have some raw data,
the inputs.
you have Ground Truth,
the labels,
the outputs and matches to the inputs.
You know of ground truth.
Then you run any kind of algorithm, whether it's a neural network
or another pre-processing algorithm
that extracts the features from that data set.
You can think of a picture of a face,
that algorithm could extract
the nose, the eyes, the corners of the eyes, the pupil
or even lower level features in that image.
After that we insert those features
into a model.
A machine learning model.  We train that model.
Then we, whatever that algorithm is
as it passes through that training process, we then evaluate.
After we've seen this one particular example,
how much better are we at other tasks?
And as we repeat this loop,
the model learns to perform better and better
at generalizing from the raw data to the labels that we have.
And finally, you get to release that model into the wild
 to actually do prediction
 on data as never seen before that you don't know about.
And the task there
is to predict the labels.
 Okay.  So neural networks is what this class is about.
It's one of the machine learning algorithms
that has proven to be very successful.
And the computational building block of a neural network is a neuron.
A perceptron is a type of neuron.
It's the original old school neuron
 where the output is binary, a zero or one.
It's not real valued.
And the process that a perceptron goes through is
it has multiple inputs and a single output.
Each of the inputs have weights on them.
Shown here on the left is 0.7, 0.6, 1.4.
Those weights are applied to the inputs.
And a perceptron, the inputs are 1s or 0s -
binary.
When those weights are applied
and then summed together a bias on each neuron
is then added on top
and a threshold,
there's a test, whether that summed value
 plus the bias is below or above a threshold.
If it's above a threshold, produces a 1;
below a threshold produces a 0.
Simple.
So one of the only things we understand about neural networks confidently,
we can prove a lot of things about this neuron.
For example, what we know
is that a neuron can approximate a NAND gate.
A NAND gate is a logical operation,
a logical function, that takes as input,
has two inputs A and B,
here on the on the diagram in the left.
And the table shows what that function is
when the inputs are 0s, 01,
in any order, the output is a 1.
Otherwise, it's a 0.
The cool thing about a NAND gate is that it's a universal gate
that you can build up any computer you have
where you have your phone in your pocket today
can be built out of just NAND gates.
So it's functionally complete.
You could build any logical function out of them.
You stack them together in arbitrary ways.
The problem with NAND gates and computers.
is they're built from the bottom up.
You have to design these circuits of NAND gates.
So the cool thing here is the perceptron,
we can learn.
This magical NAND gate, we can learn its function.
So let's go through how we can do that.
How a perceptron can perform the NAND operation.
There's the four examples.
If you put the weights of -2 on each of the inputs
and a bias of three on the neuron,
snd if we perform that same operation
of summing the weights times the inputs.
plus the bias, in the top left we get
when the inputs are 0s and there's sum to the bias, we get a 3.
That's a positive number
which means the output of a perceptron will be a 1.
On the top right,
when the input is a 0 and a 1,
that sum is still a positive number, again produces a 1.
And so on.
When the inputs are both 1s, then the output is a -1.  Less than zero.
So while this is simple,
it's really important to think about.
It's a sort of one basic computational truth you can hold on to
as we talk about some of the magical things neural networks can do
because if you compare a circuit of NAND gates
and a circuit of neurons
the difference, while a circuit of neurons
which is what we think of as a neural network,
can perform the same thing as a circuit of NAND gates.
What it can also do is it can learn;
It can learn the arbitrary logical functions
that has arbitrary circuit of NAND gates can represent
but it doesn't require the human designer.
We can evolve, if you will.
So one of the key aspects here, one of the key drawbacks of perceptron,
is it's not very smooth in it's output.
As we change the weights on the inputs
and we change the bias, and we tweak it a little bit,
it's very likely that when you get-
It it's very easy to make the neuron-
I'll put a 0 instead of a 1, or 1 instead of a 0.
So when we start stacking many of these together,
it's hard to control the output of the thing as a whole.
Now the essential step that makes the neural network work,
that a circuit perceptrons doesn't,
Is if the output is made smooth,
it's made continuous with an activation function.
And so instead of using a step function
like a perceptron does shown there on the left,
we use any kind of smooth function.
Sigmoid, where the output can change gradually as you change the weights and the bias.
And this is a basic but critical step
and so learning is generally the process of adjusting those weights gradually
and seeing how it has an effect on the rest of the network.
You just keep tweaking weights here and there
and seeing how much closer you get to the Ground Truth.
And if you get farther away,
you just adjust the weights in the opposite direction.
That's neural networks in a nutshell.
What we'll mostly talk about today is feed forward neural network.
On the left, going from inputs to outputs.
With no loops, there is also
these amazing things called recurrent neural networks.
They're amazing because they have memory.
They have a memory of state;
they remember the temporal dynamics of the data they went through.
But the painful thing is that they're really hard to train.
Today will talk about feed for neural networks.
So let's look at this example,
an example of stacking a few of these neurons together.
Let's think of the task,
the basic task now famous, using a classification of numbers.
You have an image of a number in red number
and your task is given that image to say what number is in that image.
Now, what is an image?
An image is a collection of pixels; in this case 28 X 28 pixels.
That's a total of 784 numbers; those numbers are from 0 to 255.
And on the left of the network,
the size of that input, despite the diagram, is 784 neurons.
That's the input.
Then comes the hidden layer.
It's called the hidden layer because
it has no interaction with the input or the output.
It is simply a block used
at the core of the computational power of neural networks,
is the hidden layer.
It's tasked with forming a representation of the data
in such a way that it maps from the inputs to the outputs.
In this case, there is fifteen neurons in the hidden layer.
There is ten values on the output.
corresponding to each of the numbers.
There are several ways you can build this kind of network
and this is what the magic of neural networks as you can do in a lot of ways.
You only really need 4 outputs to represent values 0 through 9.
But in practice, it seems that having 10 outputs works better.
And how do these work?
Whenever the input is a 5, the output neuron
in charge of the five gets really excited.
And I'll put a value that's close to 1, from 0 to 1, close to 1.
And then the other 1s, I'll put a value, hopefully, that is close to 0.
And when they don't, we adjust the weights in such a way that they get closer to zero
and closer to one depending on whether this is the correct neuron associated with a picture.
We'll talk about the details of this training process more tomorrow when it's more relevant
but what we've discussed just now is the forward pass through the network.
It's the pass when you take the inputs, apply the weights,
sum them together, add the bias, produce the output,
and check which of the outputs produces the highest confidence of the number
then once those probabilities for each of the numbers is is provided,
we determine the gradient that's used
to punish or reward the weights
that resulted in either the correct or the incorrect decision.
And that's called Back Propagation.
We step backwards through the network applying those punishments or rewards
Because of the smoothness of the activation functions,
that is a mathematically efficient operation.
 That's where the GPU step in.
So far examples of numbers the Ground Truth for number 6
looks like the following in the slides.
Y of X equals to 10 dimensional vector
where only one of them the sixth values a 1, the rest are zero.
That's the Ground Truth that comes with the image.
The lost function here, the basic lost function, is the squared error.
Y of X is the Ground Truth and A is the output of the neural network
resulting from the forward pass.
So when you input that number of a 6 and outputs, whatever it outputs
that's "a", a 10 dimensional vector.
And it's summed over the inputs to produce the squared error.
That's our lost function.  The lost function, the objective function.
That's was used to determine
how much to reward or punish the Back Propagated weights throughout the network.
And the basic operation of optimizing that loss function, of minimizing that loss function,
is done with various variants of gradient descent.
It's hopefully a somewhat smooth function
but it's a highly non-linear function.
This is why we can't prove much about neural networks,
is it's a highly, high dimensional, highly non-linear function that's hopefully smooth enough
where the gradient descent can find its way to a least a good solution.
And there has to be some stochastic element there that
that jumps around to ensure that it doesn't get stuck
in a local minimum of this very complex function.
Okay, that's supervised learning:
there's inputs, there's outputs.  Ground Truth.
That's our comfort zone,
we're pretty confident we know what's going on.
All you have to do is just, you have this data set you train and,
you train a network on that data set and you can evaluate it.
You can write a paper and try to beat a previous paper.  It's great.
The problem is when you then use that neural network
to create an intelligent system that you put out there in the world,
and now that system is no longer is working with your data set.
It has to exist in this world that's
maybe very different from the Ground Truth.
So the take away from supervised learning
is that a neural network's a great memorization
but in the sort of philosophical way they might not be great at generalizing,
at reasoning beyond the specific flavor of data set that they were trained on.
The hope for reinforcement learning is that
we can extend the knowledge we gain in a supervised way
to the huge world outside where we don't have
the Ground Truth of how to act, how good a certain state is,
or how barristers say it is, this is a kind of brute force reasoning.
And I'll talk about, kind of what I mean there, but it feels like
it's closer to reasoning as opposed to memorization.
That's a good way to think of supervised learning - is memorization.
You're just studying for an exam.
And as many of you know,
that doesn't mean you're going to be successful in life just because you get an A.
And so, a reinforcement learning agent or just any agent;
a human being or any machine existing in this world
can operate in the following way from the perspective of the agent.
You can execute an action;
it can receive an observation resulting from that action
in a form of a new state and it can receive a reward or punishment.
You can break down our existence in this way, simplistic view,
but it's a convenient one on the computational side
and from the environment side,
the environment receives the action amidst the observation.
So your action changes the world, therefore, that world has to change
and then tell you about it and give you a reward or punishment for it.
So, again one of the most fascinating things
I'll try to convey while this is fascinating a little bit later on,
is the work of deep mind on Atari.
This is Atari Breakout a game were a paddle has to move around.
That's the world it's existing in, the agent is the paddle
and there's a bouncing ball
and you're trying to move, your actions are right: move right, move left.
You are trying to move in such a way that the ball doesn't get past you.
And so, here is a human level performance of that agent.
And so what does this paddle have to do?
That's to operate in this environment; that's to act:  move left, move right.
Each action changes the state of the world.
It may seem obvious but
moving right changes visually the state of the world.
In fact what we're watching now on the slides
is the world changing before your eyes for this little guy.
And it get rewards or punishments.
Rewards it gets in the form of points,
they're racking up points in the top left of the video.
And then when the ball gets past the paddle, it gets punished by "dying".
And that's the number of lives there's left.
Going from 5 to 4 to 3, down to 0.
And so the goal is to select at any one moment
the action that maximizes future reward.
Without any knowledge of what a reward is
in the greater sense of the word,
all you have is an instantaneous reward or punishment,
instantaneous response of the world to your actions
and this can be model as a mark of decision process.
Mark of decision process is a mathematically convenient construct.
It has no memory,
all you get is you have a state that you're currently in.
You perform an action, you get a reward.
And you find yourself in a new state.  And that repeats over and over.
You start from state 0, you go to state 1.
You once again repeat an action, get a reward for the next state.
OK that's the formulation that we're operating in.
When you're in a certain state,
you have no memory of what happened two states ago.
Everything is operating on the instantaneous.
Instantaneously.
And so what are the major components of a reinforcement learning agent?
There's a policy.
The function broadly defined an agent's behavior.
That means that includes the knowledge of how, for any given state,
what is an action that I will take with some probability.
Value function is how good each state and action are in any particular state.
And there's a model.
Now this is a subtle thing that is
actually the biggest problem with everything you'll see today,
is the model as how we represent the environment.
And we'll see today some amazing things
that neural networks can achieve
on a relatively simplistic model of the world
and the question whether that model can extend to the real world
where human lives are at stake in the case of driving.
So let's look at the simplistic world.  A robot in a room.
You start at the bottom left,
Your goal is to get to the top right.
Your possible actions are going up, down, left and right.
Now this world can be deterministic
which means when you go up, you actually go up.
Or it could be non-deterministic as human life is is
because when you go up, sometimes you go right.
So in this case if you choose to go up, you move up 80% of the time.
You move left 10% of the time
and you move right 10% of the time.
And when you get to the top right you get a reward of +1
and you get to the second block from that,
for two you get -1.  You get punished.
And every time you take a step you get a slight punishment, a -0.04.
Okay.  So the question is,
 if you start at the bottom left, is this a good solution?
Is this a good policy by which you exist in the world?
And it is if the world is deterministic.
If whenever you choose to go up, you go.
 Whenever you choose to go right, you go right.
But if the actions are stochastic, that's not the case.
In what I described previously with point eight up
and probability of .1 going left and right.
This is the optimal policy.
Now if we punish every single step with a -2 as opposed to a -0.04.
So every time you take a step,it hurts.
You're going to try to get through a positive block as quickly as possible
and that's what this policy says.
I'll walk through a negative one if I have to
as long as I stop getting a -2.
Now if the reward for each step is a -.1,
you might choose to go around that -1 block,
slight detour to avoid the pain.
And then you might take an even longer detour
as the reward for each step goes up or the punishment goes down, I guess.
And then if there is an actual positive reward for every step you take
you'll avoid going to the finish line.
You'll just wander the world.
We saw that with the Coast Racer yesterday,
the boat that chose not to finish the race
because it was having too much fun getting points in the middle.
So let's look at the world that this agent is operating in as a value function.
Now value function depends on a reward,
the reward that comes from the future
and that reward is discounted because the world is stochasted,
we can't expect the reward to come along to us in the way that
we hope it does based on the policy, based on the way we choose to act.
And so there's a gamma there that over time,
as the award is farther and farther into the future discounts that reward.
Diminishes the impact of that future reward
in your evaluation of the current state.
And so your goal is to develop a strategy
that maximizes the discounted future reward.
The sum, the discounted sum, and reinforcement learning
there is a lot of approaches for coming up with a good policy,
a near optimal, an optimal policy.
There's a lot of fun math there.
You could try to construct a model
that optimizes some estimate of this world.
You can try in the Monte Carlo way
through just simulate that world and see how it unrolls.
And as it unrolls you try to compute the optimal policy.
Or what we'll talk about today is Q learning.
It's an off policy approach,
where the policy is estimated as we go along.
The policy is represented as a Q-Function.
The Q-Function shown there on the left is,
I apologize for the equations,
I lied.  There'll be some equations.
The input to the Q-Function is a state at time t, "st".
An action they choose to take and that state "at".
and your goal is in that state
to choose an action which maximizes the reward in the next step.
And what Q-Learning does, and I'll describe the process,
is it's able to approximate through experience the optimal Q-Function,
the optimal function that tells you how to act in any state of the world.
You just have to live it.
You have to simulate this world.
You have to move about it.
You have to explore in order to see every possible state,
try every different action, get rewarded, get punished,
and figure out what is the optimal thing to do.
That's done using this Bellman equation.
On the left, the output, is the new state.
The estimate, the Q-Function estimate of the new state, for new action.
And this is the update rule at the core of Q Learning.
You take the estimate, the old estimate, and add
based on the learning rate alpha from 0 to 1,
they update the evaluation of that state
based on your new reward that you received at that time.
So you've arrived in this certain state as "t".
You tried to do an action
and then you got a certain reward
and you update your estimate of that state
and action pair based on this rule.
When the learning rate is 0, you don't learn when alpha is 0.
You never change your world view
based on the new incoming evidence.
When alpha is 1, every time change your world evaluation based on the new evidence.
And that's the key ingredient to Reinforcement Learning.
First you explore, then you exploit.
First, you explore in a non-greedy way and then you get greedy.
You figure out what's good for you and you keep doing it.
So if you wanted to learn an Atari game,
First you try every single action, every state, you screw up,
get punished, get rewarded and, eventually, you figure out
what's actually the right thing to do and you just keep doing it.
And that's how you win against the greatest human players in the world
in a game of "Go" for example, as we'll talk about.
And the way you do that is you have an "Epsilon Greedy Policy"
that over time with a probability of 1 - Epsilon,
you perform an optimal Greedy action.
With a probability of Epsilon, you perform a random action.
Random action being explore.
And so, as epsilon goes down from 1 to 0 you explore less and less.
So the algorithm here is really simple.
On the bottom of the slide there is the algorithm version,
the pseudo code version of the equation.
The Bellman equation update.
You initialize your estimate of state action pairs arbitrarily,
a random number.  This is an important point.
When you start playing or living or doing whatever you're doing
in whatever you're doing with Reinforcement Learning or driving,
you have no preconceived notion of what's good and bad, it's random.
Or however you choose to initialize it.
And the fact that it learns anything is amazing.
I want you to remember that.
That's one of the amazing things about Q-Learning at all
and then the Deep neural network version of Q-Learning.
The algorithm repeats the following step.
You step into the world, observe an initial state, you select an action "a"
so that action, if you're exploring, will be a random action;
if you're greedily pursuing the best, actually you can,
it will be the action that maximizes the Q-Function.
You observe a reward after you take the action,
and a new state that you find yourself in.
And then you update your estimate of the previous day you are in
having taken that action using that Bellman Equation Update.
And repeat this over and over.
And so there on the bottom of the slide is a summary of life.
Yes.
(CHUCKLING)
Q-Function?  Yes, yes.  Yeah, it's a single- The question was
is the Q-Function a single value?
And yes, it's just a single continuous value.
So the question was:  "how do you model the world?"
So the way you model, so let's start, is very simplistic world of Atari paddle.
You think you model it as a paddle that
can move left and right and there's some blocks
and you model the physics of the ball.
That requires a lot of expert knowledge in that particular game.
So you sit there hand crafting this model.
That's hard to do even for a simplistic game.
The other model you could take
is looking at this world in the way the humans do visually.
So take the model in as a set of pixels.
Just the model is all the pixels of the world.
You know nothing about paddles or balls or physics
or colors and points, they're just pixels coming in.
That seems like a ridiculous model of the world
but it seems to work for Atari.
It seems to work for human beings.
When you're born, you see there's light coming into your eyes
and you don't have any, as far as we know,
you don't come with an instruction when you're born.
You know there's people in the world
then there is good guys and bad guys,
and there's this is how you walk.
No, all you get is light, sound and the other sensors.
And you get to learn about every single thing you think of as
the way you model the world is a learned representation
and we will talk about how a neural network does that.
It learns to represent the world but if we have to hand model the world,
it's an impossible task.
That's the question and if we have to hand model the world,
then that world better be a simplistic one.
Yeah.
That's a great question.
And so the question was:  "what is the robustness of this model
if the way you represent the world is at all, even slightly different,
from the way you thought that world is.
That's not that well studied as far as I'm aware.
I mean, it's already amazing that you keep constructing,
if you have a certain import of the world,
If you have a certain model of the world that you can learn anything
is already amazing.
The question is, and it's an important one,
is we'll talk a little bit about it,
not about the world model but the reward function.
If the reward function is slightly different.
the real reward function of life or driving or of coast runner
is different than what you expected it to be.
What's the negative there?
Yes, it could be huge.
(CHUCKLING)
There's another question or no?
Oh, no.  Yes.
Sorry, can you ask that again?
Yes, you can change it over.  So the question was:
"do you change alpha value over time?"
You certainly should change alpha value over time, yes.
So the question was:  "what is the complex interplay
 of the Epsilon Function with the Q-Learning Update?"
That's 100% fine-tuned to the particular learning problem.
So you certainly wanted-
The more complex, the larger the number of states in the world
and the larger the number of actions,
the longer you have to wait
before you decrease the Epsilon to 0 but you have to play with it.
And it's one of the parameters you have to play with, unfortunately,
and there's quite a few of them
which is why you can't just drop a Reinforcement Learning agent into the world.
Oh, the effect in that sense?  No, no.  It's just a coin flip.
And if that Epsilon is 0.5,
half the time you're going to take a random action.
So there's no specific-
It's not like you'll take the best action
and then with some probability take the second best, and so on.
I mean you can certainly do that
but in the simple formulation that works if you just take a random action
because you don't wanted to have a preconceived notion of
what's a good action to try when you're exploring.
The wjhole point is you try crazy stuff, if it's a simulation.
So, good question.  So representation matters.
This is the question about how we represent the world.
So we can think of this world of break up, for example,
of this Atari game as a paddle the moves left and right.
and the exact position of the different things you can hit
 to construct this complex model,
this expert driven model that has to fine tune it to this particular problem.
But in practice the more complex this model gets,
the worse that Bellman Equation Update,
that value that's trying to construct a Q-Function
for every single combination of state and actions
becomes too difficult because that function is too sparse and huge
so if you think of looking at this world in a general way,
in the way human beings would is a collection of pixels visually.
If you just take in a pixel,
this game as a collection of 84 by 84 pixels, an image, an RGB image,
And then you look at not just the current image,
but look at the temporal trajectory of those images
so like if there's a ball moving you want to know about that movement.
So you look at 4 images; so, current image and 3 images back
and say, they're gray scale with 256 gray levels that size of the Q-Table
that the Q value function has to learn is
whatever that number is, but it's certainly larger than
the number of atoms in the universe.  That's a large number.
So you have to run the simulation long enough
to touch at least a few times the most of the states in that Q-Table.
So as Elon Musk says you may need to run,
you know, we live in a simulation,
and you may have to run a universe
just to compute the Q-Function in this case.
So that's where deep learning steps in
as instead of modeling the world as a Q-Table
you estimate, you try to learn that function.
And so, the takeaway from supervised learning, if you remember,
that it's good at memorizing or good at memorizing data.
The hope for reinforcement learning
With a Q-Learning is that we can extend
the occasional rewards we get to generalize over the operation,
the actions you take in that world leading up to the rewards.
And the hope for deep learning is that we can move this
Reinforcement learning system into a world
that doesn't need to be, they can be defined arbitrarily.
It can include all the pixels of an Atari game,
can include all the pixels sense by a drone, a robot or car
but still needs a formalized definition of that world
which is much easier to do when you're able to take in sensors like an image
So Deep Q-Learning, deep version.
So instead of learning a Q-Table, a Q-Function,
we try in estimating that Q-Prime.
We try to learn it using machine learning.
It tries to learn some parameters, this huge complex function.
We try to learn it and the way we do that as we have a neural network
the same kind that showed that learned the numbers
to map from an image
to a classification of that image into a number.
The same kind of network is used to take in a state,
an action and produce a Q-Value.
Now here's the amazing thing:
that without knowing anything in the beginning,
as I said, with a Q-Table it's initialized randomly.
The Q-Function. this deep network, knows nothing in the beginning.
All it knows is, in the simulated world, their words you get
for a particular game, so you have to play time and time again
and see the rewards you get for every single iteration of the game.
But in the beginning it knows nothing.
And it's able to learn to play better than human beings.
This is a deep mind paper
playing Atary with deep reinforcement learning from 2013.
There's one other key things that got everybody excited
about the role of deep learning in artificial intelligence
is that using a convolutional neural work,
which I'll talk about tomorrow,
but it's a vanilla network, like any other
like I talk about earlier today, just a regular network
That takes the raw pixels, as I said, and estimates that Q-Function
from the raw pixels as able to play on many of those games
better than a human being.
And the lost function that I mentioned previously,
so, again, very vanilla lost function,
very simple objective function.
The first one you'll probably implement.
We have a tutorial on TensorFlow.
Squared Error.  So we take this Bellman Equation
where the estimate is Q-
The Q-Function Estimate of state and action
is the maximum reward you get for taking any of the actions
that take you to any of the future states.
And you try to take that action, observe the result of that action,
and if the target is different that your learn target,
what the function is learned is the expected reward in that case,
is different than what you actually got you adjust it.
You adjust the weights of the network.
And this is exactly the process by which we learn
how to exist in this pixel world.
So your mapping states and actions to a Q-Value,
the algorithm is as follows.
This is how we train it.
We're given a transition as current state action taken in that state
are the rewards you get, an S-Prime,
as what the state you find yourself in.
And so we replace the basic of their rule,
in the previous pseudo code,
by taking a forward pass through the network given that S-state.
We'll look at what the predicted Q-value is of that action.
We then do another forward pass through that network
and see what we actually get.
And then if we're totally off, we punish,
we Back Propagate the weights in a way that.
next time we'll make less of that mistake.  And you repeat this process.
This is a simulation.
You're learning against yourself.
And again, the same rule applies here. exploration versus exploitation.
You start out with an Epsilon of 0 or 1, you are mostly exploring.
And then you move towards an Epsilon of 0.
And with Atari Breakout. this is the deep mind paper result
is Training Epochs on the x-axis,
on the y-axis is the average action value
and the average reward per episode.
I'll show why it's kind of a an amazing result but it's messy
because there's a lot of tricks involved.
So it's not just putting in a bunch of pixels of a game
and getting an agent that knows how to win at that game.
there's a lot of pre-processing and playing with the data required.
So which is unfortunate because the truth is messier than the hope
but one of the critical tricks needed is called experience replay.
So as opposed to letting an agent,
So you're learning this big network that tries
to build a model of what's good to do in the world and what's not.
And you're learning as you go.
With experience replay you're keeping a track
of all the things you did.
And every once in a while, you look back into your memory
and pull out some of those old experiences.
the good old times and trying on those again.
As opposed to letting the agent run itself into some local optima
where it tries to learn a very subtle aspect of the game
that actually in the global sense
doesn't get you farther to winning the game.
Very much like life.
So here's the algorithm, deep Q learning algorithm pseudo code.
We initialize the replay memory,
again there's this little trick that's required.
Is keeping a track of stuff that's happened in the past,
we initialize the action value function Q with random weights
and observe initial state, again same thing.
Select an action with the probability Epsilon
explore, otherwise choose the best one
based on the estimate provided by the neural network.
And then carry out the action, observe the reward
and store that experience in the replay memory
and then sample random transition from replay memory.
So with a certain probability, you bring those old times back
to get yourself out of the local minima
and then you train the Q-network
using the difference between what you actually got
and your estimate and you repeat this process over and over.
So here's what you can do after ten minutes of training
on the left, so that's very little training, what you get is
a paddle that learns hardly anything and it just keeps dying.
It goes from 5 to 4 to 2 to 2 to 1, Those are the number of lives left.
Then after two hours of training in a single GPU,
it learns to win, you know, not die.  Rack up points
and learns to avoid the ball from passing the paddle which is great.
That's human level performance really, better than some humans,
you know, but it still dies sometimes so it's very human level.
And then after four hours it does something really amazing.
It figures out how to win the game in a very lazy way
which is drill a hole through the blocks up to the top
and get the ball stuck up there.
And it does all the hard work for you.
That minimizes the probability of the ball getting pas your paddle
because it's just stuck in the in the blocks up top.
So that might be something
that you wouldn't even figure out to do yourself.
And that's-  I need to sort to pause here
to clearly explain what's happening.
The input to this algorithm is just the pixels of the game.
It's the same thing that human beings take in
when they take visual perception and it's able to learn
under this constrained definition of what is a reward and a punishment.
It's able to learn to get a high reward.
That's general artificial intelligence.
A very small example of it but its general.
It's general purpose, it knows nothing about games
and knows nothing about paddles or physics.
It's just take answer input of the game
and they've did the same thing for a bunch of different games in Atari
And what's shown here in this plot on the x-axis
is a bunch of different games from Atari
and on the y-axis is a percentile
where 100% is about the best that human beings can do.
Meaning it's the score that human beings who get
so everything about there in the middle, everything to the left of that
is far exceeding human low performance
and below that is on par or worse than human performance.
So it can learn so many-  Boxing, Pinball,
all of these games, and it doesn't know anything
about any of the individual games, it's just taking in pixels.
It's just as if you put a human being.
behind any of these games and
ask them to learn to be beat the game.
and there's been a lot of improvements in this algorithm recently.
Yes, question.
No.  So the question was:  "do they customize the model for game,
for a particular game?
And no, the point-  You could, of course, but the point is
it doesn't need to be customized for the game but
the important thing is that it's still only on Atari games.
Alright, so the question whether this is transferable to driving,
Perhaps not.
Right, you play the game where you do.
No, you don't have the-  Well, yeah you play one step of the game.
So you take action in a state and then you observe that.
So you have that simulation.
I mean, really that's one of the biggest problems here
is you require the simulation in order to get the Ground Truth.
So that's a great question or comment.
The comment was that for a lot of these situations,
the reward function might not change at all depending on your actions.
The rewards are really, most of the time, delayed
10, 20, 30 steps down the line which is why
It is amazing that this works at all.  That it's learning locally.
and through that process of simulation
of hundreds a thousand times runs through the game,
It's able to learn what to do now such that I get a reward later.
It's if you just pause, look at the math of it.
It's very simple math and look at the result, it's incredible.
So there's a lot of improvements,
this one called the general reinforcement learning architecture Gorila.
The cool thing about this in the simulated world at least
is that you can run deep reinforcement learning in a distributed way.
You could do both the simulation in a distributed way,
you can do the learning in the distributed way,
you can generate experiences which is what this kind of diagram shows,
you can, either from human beings or from simulation.
So for example, the way that Alpha Go the deep mind team
is beat the game of Go
is they learn from both expert games and by playing itself.
So, you can do this in a distributed way and
you could do the learning in a distributor way so you can scale.
And in this particular case, the Gorila has achieved
the better result than the DQN network
and that's part of the their nature paper.
Okay, so let me now get to driving for a second here
where words of reinforcement learning,
where reinforcement learning can step in and help.
So this is back to the open question they asked yesterday:
is driving closer to chess or to everyday conversation?
Chess, meaning it can be formalized in a simplistic way
and if you could think about it as an obstacle avoidance problem
and once the obstacle avoidance is solved,
you just navigate that constrained space
you choose to move left, you choose to move right in a lane
you choose to speed up or slow down.
Well, if it's a game like chess which we'll assume for today.
as opposed to for tomorrow, for today we're going to go with the one on the left
and we're going to look at DeepTraffic.
Here is this game of simulation
where the goal is to achieve the highest average speed you can
on this seven lane highway full of cars.
And so, as a side note for students, the requirement is
they have to follow the tutorial that I'll present a link for
at the end of this presentation.
And what they have to do is achieve a speed,
build a network that achieves a speed of 65 miles an hour or higher.
There is a leaderboard and you get to submit
the model you come up with with a simple click of a button.
So all of this runs in the browser
which is also another amazing thing.
And then you immediately or relatively so,
make your way up the leaderboard.
So let's look, let's zoom in.
What is this world, two-dimensional world of traffic is,
what does it look like for the intelligent system?
We descritize that world into a grid shown here on the left.
That's the representation of the state.
There are seven lanes
and every single lane is broken up into blocks spatially.
And if there is a car in that block, the length of a car is about 3 blocks,
3 of those grid blocks, then that grid is seen as occupied.
and then the red car is you.
That's the thing that's running in the intelligent agent.
There is on the left, is the current speed of the red car,
actually says MIT on top.
And then you also have a count of how many cars you passed
and if your network sucks then that number is going to get to be negative.
You can also change with a drop down the simulation speed
from normal on the left to fast on the right.
So, you know, the fast speads up the replay of the simulation.
The one on the left, normal, it feels a little more like real driving.
There is a drop down for different display options.
The default is non, in terms of stuff you show on the road.
Then there is the learning input which is the,
while that whole space is descritized,
you can choose what your car sees
and that's you could choose how far ahead it sees behind,
how far to the left and right It sees.
And so by choosing the learning input, to visualize learning input,
you get to see what you set that input to be.
Then there is the safety system.
This is a system that protects you from yourself.
The way we've made this game is 
they operates under something similar
if you have some intelligence in
if you're driving you have adaptive cruise control in your car.
It operates in the same way. When he gets close to the car in front,
It slows down for you
It operates in the same way. When he gets close to the car in front,
It slows down for you
and it doesn't let you run the car to the left of you, 
to the right of you, off the road.
So constrains the movement capabilities of your car
in such a way that you don't hit anybody because
then it would have to simulate collisions and that would just be a mess.
So, it protects you from that and so you can
choose to visualize that "safety system" with a visualization box.
And then you can also choose to visualize the full map.
This is the full occupancy map that you get
if you would like to provide as input to the network.
Now that input for every single grid that it's a number.
It's not just a 0, 1 whether there's a car in there.
It's the maximum speed limit which is 80 miles per hour.
Don't get crazy eighty miles an hour is the speed limit.
That block when it's empty is set to the 85 miles eighty miles an hour.
And when it's occupied, it's set to the number that is the speed of the car.
And then, the blocks that the red car is occupying
is set to the number, to a very large number
much higher than the speed limit.
So safety system, here shown in red, 
are the parts of the grid that your car can't move into.
Question.
What's that?
Yes.  Yes. The question was:
"what was the third option I just mentioned and
t's you the red car itself, you yourself,
the blocks underneath that car I set to really high number.
It's a way for the algorithm to know, 
for the learning algorithm to know that these blocks are special.
So safety system, shows read here, if
the car can't move into those blocks.
So ,in terms of when it lights up red, it means
the car can't speed up anymore in front of it
and when the blocks to the left or to the right light up as red
that means you can't change lanes to the left or right.
On the right of the slide, you're free to go,
free to do whatever you want.
That's what that indicates is all the blocks are yellow.
Safety system says you're free to choose any of the five actions.
In the five actions are move left, move right,
same place, accelerate or slow down.
And those actions are given as input.
That action was produced by the what's called here, the brain.
The brain takes in the current state as input, the last reward,
and produces and learns and uses that reward
to train the network through backward function there,
back propagation, and then ask the brain given the current state,
to give it the next action with the forward pass, the forward function.
You don't need to know the operation of this function in particular,
this is not something you need to worry about,
but you can if you want, you can customize this learning step.
There's, by the way, what I'm describing now
there's just a few lines of code right there in the browser
that you can change immediately with the press of a button
changes the simulation or the design of the network.
You don't need to have any special hardware,
you dont' need to do anything special.
And the tutorial cleanly outlines exactly all of these steps
but it's kind of amazing that you can design a deep neural network
that's part of the reinforcement learning agent.
So it's a deep Q learning agent right there in the browser.
So you can choose the lane side variable
which controls how many lanes to the side you see.
So in that value zero you only look forward.
When their values 1, you have one lane to the left, one valid to the right.
It's really the lane the radius of your perception system.
Patches ahead is how far ahead you look;
patches behind is how far behind you look.
And so for example here, the lane side equals 2 that means
it looks to the left, to the right; obviously, if to the right, is off road.
It provides a value of 0 in those blocks.
If we set the patches behind to be 10, it looks 10 patches back
behind starting at the 1 patch back is starting from the front of the car.
The scoring for the evaluation of the competition
is your average speed over a predefined period of time.
And so the method we do we use to collect that speed
is we we run the agent 10 runs, about 30 simulated minutes of game each.
And take the median speed of the 10 runs.  That's the score.
This is done server side and so given that we've gotten some
for this code recently gotten some publicity online unfortunately.
This might be a dangerous thing to say there's no cheating possible.
But because it's done server side and this is javascript
and runs in the browser, it's hopefully a sandbox.
So we can't do anything tricky but we dare you to try.
You can try it locally to get an estimate, you know,
and there's a button that says evaluate and it gives you a score right back
of how well you're doing with the current network.
That button is:  Start Evaluation Run; you press the button.
It does a progress bar and gives you the average speed
There's a code box where you modify all the variables I mentioned
and the tutorial describe this in detail.
And then once you're ready, you modify a few things
you can press apply code it restarts, it kills all the training
that you've done up to this point or resets it and start the training again.
So save often and there's a save button.
So the training is done a separate thread in Web Workers
which are exciting things that allow javascript to run
amazingly on multiple CPU Cores in a parallel way.
So the simulation that scores this or, sorry, the training is done
a lot faster than real time, a thousand frames a second.
That's a thousand movement steps a second.  This is all in javascript.
And the next they get shipped to the main simulation
from time to time as the training goes on.
So all you have to do is press run training.
And it trains and the car behaves better over time.
Maybe like I should show it in the browser.
Let's see if will work well, is this going to mess up?  We're good.
What can possibly go wrong?
So there's the game. When it starts, this is running live in the browser.
Artificial intelligence, ladies and gentleman in the browser. a neural network.
So currently it's not very good, it's driving at 2 miles an hour
and watching everybody pass.
So what's being shown live is the lost function which is pretty poor.
So in order to train, like I said, a thousand frames a second
you just press the "Run Training" button and pretty quickly it learns
based on the network you specify in the code box, how to-
and based on the input and all the things that I mentioned,
training finished. It learns how to do a little better.
We, on purpose. put in a network that's not very good in there.
So right now I won't, on the average, be doing that well
but it does better than standing there in place
and then you could do the start Evaluation Run
to simulate the network much faster than real time,
to see how well it does
This is a similar evaluation step that we take
when determining where you stand on the leaderboard
at the current current average speed.
In that 10 run simulation is 56.56 miles per hour.
Now, I may be logged in, maybe not. 
If you're logged in, you click "Submit your code."
If you're not logged in, it says:  "You're not logged in.  Please log in to submit your code."
And then all you have to do is log in.
This is the most flawless demo of my life.
And then you press "Submit Model" again and success.  Oh man.
"Thank you for your submission."  And so now my submission is entered as "Lex" in the leaderboard
and my 56.56, or whatever it was.
So I dare all of you to try to beat that. So too.
As as you play around with stuff if you want to save the code
you could do so by pressing the "Save Code" button.
That saves the various javascript configurations
and that saves the network layout to file.
And you can load from files as well. the danger it overrides the code for you.
And you press the "Submit" button to submit the model to the competition.
Make sure that you train the network, we don't train it for you.
You submit a model and you have to press "Train".
And he gets evaluated the time it enters a queue to get evaluated.
This is public phasing so the queue can grow pretty big
and it goes to that queue, evaluates it and then depending on where you stand
you get added to the leaderboard showing the top ten entries.
You can resubmit often and only the highest score counts.
Okay, we're using code-
Now implementation of neural networks done in just javascript
by Andrej Karpathy from Stanford now OpenAI.
ConvNet.JS is a library and what's being visualized there
is also being visualized in the game is the inputs to the network.
In this case it's 135 inputs.  You can also specify not just the
how far ahead behind you're seeing to the left and to the right,
you can specify how far back in time you look as well.
And so what's visualize there is the input to the network 135 neurons
and then the output, a regression, similar to the kind of
opo we saw with numbers where there's 10 outputs saying
if it's a 0, 1 through 9, here the output is one of the five actions:
left, right, stay in place, speed up or slow down.
The ConvNet.JS settings is you can select a number of inputs
if you want to mess with this stuff, this is all stuff you don't need to mess with
because we already gave you the variables of lane side and patches ahead and so on.
You can select a number of actions, the temporal window and the network size.
So the network definition here is the-
This is the input, the size of the input.
Again all this is in the tutorial just to give you a little outline.
There is the first fully connected layer has 10 neurons
with relu activation functions, same kind of smooth
function that we talked about before and the regression layer for the output.
And there's a bunch of other messy options you play with if you dare.
But those aren't, the ones I mentioned before is really the important ones.
Selecting the number of layers, the size of those layers,
you get to build your own very neural network that drives.
And the actual learning is done with a backward propagation
and then that returns the action by doing a forward pass to the network.
In case you're interested in this kind of stuff, there is an amazingly cool code editor.
That's the Monaco Editor.  It just works, it does some auto-completions
so you get to play with it makes everything very convenient in terms of coding editing.
A lot of this visualization of the game and the simulation we'll talk about tomorrow
is done in the browser using HTML5 canvas.
So here is a simple specification of a blue box with canvas
and this is very efficient and easy to work with.
And the thing that a lot of us are excited about, a very subtle one, but there you can, not just run.
So with the V8 Engine javascript has become super fast.
You could train neural networks in the browser that's already amazing.
And then with Web Workers as long as you have Chrome, a modern browser.
You can run multiple processes in separate threads
so you could do a lot of stuff you can do visualization separately 
and you can train separate threads, very cool.
Okay. so the tutorial is cars.mit,edu/deeptraffic.
We won't put these links on the website for a little bit because.
We got put on the front page of Hacker News
which we don't want those to leak out
especially with the claims the you can't cheat.
And while it's pretty efficient in terms of running everything on your machine, client side,
it's still. you have to pull some images here and pull some of the code.
So the tutorials on cars.mit,edu/deeptraffic and the simulation is deeptrafficjs
So cars.mit,edu/deeptrafficjs I encourage you to go there play with the network submit your code.
and win the very special prize and it is pretty cool one but we're still working on it.
There is a prize I swear.  All right so let's take a pause and think about what we talked about today.
So the very best of deep reinforcement learning is the most exciting accomplishment,
I think, is when the game-  When I first started as a freshman, took "Intro to Artificial Intelligence"
it was said that it's a game that's impossible for machines to beat because of the combinatorial complexity they just
the sheer number of options.
it's so much more complex than chess and so the most amazing accomplishment of deep reinforcement learning
to me is the design of AlphaGo when for the first time the world champion in Go was beaten 
by DeepMind AlphaGo and the way they did it
and this is, I think very relevant to driving is you start by creating first in a supervised way training a policy network.
So you take expert games to construct a network first so you look you don't play against yourself.
They agent doesn't play against itself but they learn from expert games, so there is some human Ground Truth.
This Human Ground Truth represents reality, so for driving this is important
We have a-  Well we're starting to get a lot of data were video of drivers is being recorded.
So we can learn on that data before would then run the agents through a simulation where it learns much larger magnitudes
of data sets through simulation.  And they did just that.  Now as a reminder that when you let an agent drive itself.
This is probably one of the favorite videos of all time but I just recently saw a cyclist and just watch this for hours.
but it's a reminder that you can't trust your first estimates of a reward function to be those that are safe
and productive for our society when you're talking about an intelligence system that gets to operate in the real world.
This is just as clear of a reminder of that as there is.  So again all the references are available online.
For these slides. we'll put up the slides.
I imagine you might have, if you want to come down and talk to us for questions for the either Docker
or javascript.  Question.  The question was:  "What is the visualization you're seeing in deep traffic?"
You're seeing a car move about. Why is it moving?
It's moving based on the latest snapshot of the network you trained, so it's just visualizing; for you, just for fun.
The network you train most recently.
Okay so if people have questions, stick around afterwards.  Just details on Docker and [CHUCKLING]-  Yes.  Do you want to do it offline?
