Welcome everyone to 2019.
It's really good to see everybody here
make it in the cold.
This is 6.S094 Deep Learning for Self-Driving Cars.
It is part of a series of courses
on deep learning that we're running throughout this month.
The website that you can get all the content of
videos, the lectures and the code is
deeplearning.mit.edu.
The videos and slides will be made available there
along with a github repository
that's accompanying the course.
Assignments for registered students will be
emailed later on in the week.
And you can always contact us with
questions, concerns, comments at
hcai, human centered AI, at mit.edu.
So let's start through the basics,
the fundamentals.
To summarize in one slide,
what is deep learning?
It is a way to extract useful patterns from data
in an automated way
with as little human effort involved
as possible hence to automate it.
How? The fundamental aspect that we'll talk about
a lot is the optimization of neural networks.
The practical nature that we'll provide the code
and so on is that there's
libraries that make it accessible
and easy to do some of the most powerful things
in deep learning using Python, TensorFlow & friends.
The hard part always with
machine learning artificial intelligence in general
is asking good questions and getting good data.
A lot of times the exciting aspects of what's the news covers
and a lot of the exciting aspects of what is published
and that the prestigious conferences in an archive,
in a blog post is the methodology.
The hard part is applying the
methodology to solve real world problems,
to solve fascinating interesting problems.
And that requires data,
that requires asking the right questions of that data,
organizing that data
and labeling selecting aspects of that data that can reveal
the answers to the questions you ask.
So why has this breakthrough over the past decade
of the application of neural networks,
the ideas in neural networks?
What has happened? What has changed?
They've been around since the 1940s.
And ideas were percolating even before.
The digitization of information, data.
The ability to access data easily in a distributed fashion across the world.
All kinds of problems have now a digital form.
They can be accessed by learning algorithms.
Hardware; compute, both the Moore's Law of CPU and GPU
and ASICs, Google's TPU systems,
hardware that enables the efficient
effective large-scale execution of these algorithms.
Community; people here, people all over the world
are being able to work together, to talk to each other,
to feed the fire of excitement behind machine learning.
github and beyond.
The tooling;  we'll talk about TensorFlow
PyTorch and everything in between
that enables a person with an idea
to reach a solution in less and less and less time.
Higher and higher levels of abstraction
empower people
to solve problems in less and less time
with less and less knowledge,
where the idea and the data become the central point,
not the effort, that takes you from an idea to the solution.
And there's been a lot of exciting progress.
Some of which we'll talk about from face recognition to
the general problem of scene understanding, image classification,
the speech, text, natural language processing, transcription,
translation in medical applications and medical diagnosis.
And cars
being able to solve many aspects of perception in autonomous vehicles
with drivable area, lane detection,
object detection, digital assistance,
ones on your phone and beyond the ones in your home.
Ads, recommender systems from Netflix to search to social, Facebook.
And of course deep reinforcement learning successes in the playing of games,
from board games to StarCraft and Dota.
Let's take a step back.
Deep learning is more than a set of tools
to solve practical problems.
Pamela McCorduck said in 79
"AI began with the ancient wish to forge the gods."
Throughout our history, throughout our civilization, human civilization
we've dreamed about creating echoes of
whatever is in this mind of ours in the machine.
And creating living organisms from the popular culture in the 1800s
with Frankenstein to Ex Machina this vision is dream
of understanding intelligence and creating intelligence has captivated all of us.
And deep learning is at the core of that.
Because there's aspects of, the learning aspects
that captivate our imagination about what is possible.
Given data and methodology what learning
learning to learn and beyond how far that can take us.
And here visualized is just 3% of the neurons
and one millionth of the synapses in our own brain.
This incredible structure that's in our mind
and there's only echoes of it.
Small shadows of it in our artificial neural networks that we're able to create.
But nevertheless those echoes are inspiring to us.
The history of neural networks on this pale blue dot of ours
started quite a while ago
with summers and winters,
with excitements and periods of pessimism.
Starting in the 40s with neural networks and
the implementation of those neural networks is a perceptron in the 50s;
with ideas of backpropagation,
restricted Boltzmann machine, recurrent neural networks
in the 70s and 80s; with convolutional neural networks
and the MNIST data set with data sets beginning to percolate
LSTM, bi-directional RNNs in the 90s;
and the rebranding and the rebirth of neural networks
under the flag of Deep Learning
and Deep Belief Nets in 2006;
the birth of ImageNet, the data set that on which
the possibilities of a deep learning can bring to the world
has been first illustrated in the recent years in 2009.
And AlexNet the network that an ImageNet performed exactly that
with a few ideas like dropout and improved
neural networks over time every year by year
improving the performance of neural networks.
In 2014 the idea of GANs, the Yann LeCun called
the most exciting idea of the last 20 years,
the Generative Adversarial Networks, the ability to with very little supervision
generate data, to generate ideas after forming representation of those.
From the understanding from the high-level
abstractions of what is extracted
in the data be able to generate new samples.
Create, the idea of being able to create
as opposed to memorize
is really exciting.
And on the applied side in 2014 with DeepFace
the ability to do face recognition.
There's been a lot of breakthroughs on the computer vision front
that being one of them.
The world was inspired, captivated in 2016
with AlphaGo, and in 17 with AlphaZero
beating with less and less and less effort
the best players in the in the world at Go.
The problem that for mostly the history of
artificial intelligence thought to be unsolvable.
And new ideas with capsule networks and in this year, the year 2018
was the year of natural language processing.
A lot of interesting breakthroughs
of Google's Bert and others that we'll talk about
breakthroughs on ability to understand language, understand speech
and everything including generation that's built all around that.
And there's a parallel history of tooling
starting in the 60s with the perceptron
and the wiring diagrams.
They're ending with this year with PyTorch 1.0 and TensorFlow 2.0.
These really solidified, exciting, powerful ecosystems of tools
that enable you to do very, to do a lot with very little effort.
The sky is the limit, thanks to the tooling.
So let's then from the big picture taken to the smallest.
Everything should be made as simple as possible.
So let's start simple with a little piece of code
before we jump into the details
and a big run through everything that is possible in deep learning.
At the very basic level with just a few lines of code
really six here,
six little pieces of code,
you can train a neural network that understand
what's going on in an image.
The classic, that I will always love MNIST data set,
the handwritten digits where the input
to a neural network or machine learning system is
a picture of a handwritten digit
and the output is the number that's in that digit.
It's as simple as in the 1st Step: import a library TensorFlow.
2nd step: import the data set MNIST.
3rd step, like Lego bricks, stack on top of each other
the neural network layer by layer, with a hidden layer,
an input layer and output layer.
Step 4 train the model as simple as a single line: model fit.
Evaluate the model in Step 5 on the testing data set.
And that's it. In Step 6 you're ready to deploy.
You're ready to predict what's in the image.
It's simple as that.
And much of this code obviously much more complicated or
much more elaborate and rich and interesting
and complex we'll be making available on
github on our repository that accompanies these courses.
Today we'll release the first tutorial on driver scene segmentation.
I encourage everybody to go through it.
And then on the tooling side in one slide,
before we dive into the neural networks and deep learning.
The tooling side amongst many other things
TensorFlow is a deep learning library,
an open source library from Google.
The most popular one today.
The most active with a large ecosystem.
It's not just something you import in Python
and to solve some basic problems.
There's an entire ecosystem of tooling.
There's different levels of APIs.
Much of what we'll do in this course will be
the highest level API with Keras.
But there's also the ability to run in the browser with TensorFlow.js,
on the phone with TensorFlow Lite.
In the cloud without any need to have a computer hardware,
anything any of the libraries set up on your own machine, you can run
all the code that we're providing in the cloud
with Google Colab, Colaboratory.
And the optimized ASICs hardware that Google is
optimized for TensorFlow with their TPU-Tensor Processing Unit
ability to visualize tensorboard models that provide TensorFlow Hub.
And there's just, this is an entire ecosystem including
most importantly I think documentation of blogs
that make it extremely accessible to
understand the fundamentals of the tooling
that allow you to solve the problems
from natural language processing to computer vision
to GANs-Generative Adversarial Networks and
everything in between with deeper enforcement learning and so on.
So  that's why we were excited to work both in theory in this course,
in this series of lectures, and in the tooling,
in the applied side of TensorFlow.
It really makes it exceptionally these ideas exceptionally accessible.
So deep learning at the core is the ability to form
higher and higher level of abstractions
of representations in data and raw patterns.
Higher and higher levels of understanding of patterns.
And those representations are extremely important
and effective for being able to interpret data.
Under certain representations data is trivial
to understand, cat versus dog,
blue dot versus green triangle.
Under others it's much more difficult.
In this task drawing a line under polar coordinates is trivial,
under Cartesian coordinates is very difficult,
well impossible to do accurately.
And that's a trivial example of a representation.
So our task with deep learning, with machine learning in general
is forming representations that map the topology.
This, the whatever the topology, the rich space of the problem
that you're trying to deal with of the raw inputs,
map it in such a way
that the final representation is trivial to work with,
trivial to classify, trivial to perform regression,
trivial to generate new samples of that data.
And that representation of higher and higher levels of representation
is really the dream of artificial intelligence.
That is what understanding is,
making the complex simple, like
like Einstein back in a few slides ago said.
And that with Juergen Schmidhuber and whoever else said it, I don't know,
that's been the dream of all of science in general.
Of the history of science is the history of compression progress,
of forming simpler
and simpler representations of ideas.
The models of the universe of our solar system
with the earth at the center of it
is much more complex to perform to do physics on
then a model where the Sun is at the center.
Those higher and higher levels of simple representations
enable us to do extremely powerful things.
That has been the dream of science
and the dream of artificial intelligence.
And why deep learning?
What is so special about deep learning in the grander
world of machine learning and artificial intelligence?
It's the ability to more and more remove the input of human experts,
remove the human from the picture,
the human costly inefficient effort of human beings in the picture.
Deep learning automates much of the extraction from the raw
gets us closer and closer to the raw data
without the need of human involvement,
human expert involvement.
Ability to form representations from the raw data
as opposed to having a human being need to extract features
as was done in the 80s and 90s
in the early aughts to extract features
with which then the machine learning algorithms can work with.
The automated extraction of features
enables us to work with large and larger datasets
removing the human completely
except from the supervision labeling step at the very end.
It doesn't require the human expert.
But at the same time
there is limits to our technologies.
There's always a balance between excitement and disillusionment.
The Gartner hype cycle,
as much as we don't like to think about it,
applies to almost every single technology.
Of course the magnitude of the peaks and the troughs is different.
But I would say we are at the peak
of inflated expectation with deep learning.
And that's something we have to think about as we talk about
some of the ideas and exciting possibilities of the future.
And with self driving cars that we'll talk about in
future lectures in this course
we're at the same.
In fact we're little bit beyond the peak.
And so it's up to us.
This is MIT and engineers and the people working on this in the world
to carry us through the trough,
to carry us through the future as the ups and downs
of the excitement progresses forward
into the plateau of productivity.
Why else not deep learning?
If we look at real world applications
especially with humanoid robotics, robotics manipulation
and even yes autonomous vehicles,
majority aspects of the autonomous vehicles
do not involve to an extensive amount
machine learning today.
The problems are not formulated as data driven learning,
instead they're model-based optimization methods
that don't learn from data over time.
And then from the speakers that these couple of weeks
we'll get to see how much machine learning starting to creep in.
But the examples shown here with the Boston
with amazing humanoid robotics in Boston Dynamics
to date almost no machine learning has been used
except for trivial perception.
The same with autonomous vehicles.
Almost no machine learning and deep learning has been used
except with perception.
Some aspect of enhanced perception from the visual texture information.
Plus what's becoming, what's starting to be used a little bit more
is the use of recurrent neural networks
to predict the future,
to predict the intent of the different players in the scene
in order to anticipate what the future is.
But these are very early steps.
Most of the success of EC today the 10 million miles away Moses achieved
has been attributed mostly to non machine learning methods.
Why else not deep learning?
Here's a really clean example of unintended consequences
of ethical issues.
We have to really think about.
When an algorithm learns from data
based on an objective function, a loss function,
the power, the consequences of an algorithm that
optimizes that function is not always obvious.
Here's an example of a human player playing the game
of coast runners with a,
it's a boat racing game where the task is to go
around the racetrack and try to win the race.
And the objective is to get as many points as possible.
There are three ways to get points.
The finishing time, how long it took you to finish.
The finishing position, where you were in ranking.
And picking up cone called turbos those little green things along the way.
They give you points.
Okay simple enough.
So we designed an agent in this case an RL Agent
that optimizes for the rewards.
And what we find on the right here,
the optimal the agent discovers that the optimal
actually has nothing to do with finishing the race or the ranking.
They can get much more points
by just focusing on the turbos and collecting those
those little green dots because they regenerate.
So if you go in circles over and over and over slamming into the wall
collecting the green turbos.
And that's a very clear example of
a well-reasoned, formulated objective function
that has totally unexpected consequences.
At least without sort of considering
considering those consequences ahead of time.
And so that shows the need for AI safety
for a human in the loop of machine learning.
That's why not deep learning exclusively.
The challenge of deep learning algorithms, of deep learning applied
is to ask the right question
and understand what the answers mean.
You have to take a step back and look at the difference,
the distinction, the levels, degrees of what the algorithm is accomplishing.
For example image classification
is not necessarily scene understanding.
In fact it's very far from scene understanding.
Classification may be very far from understanding.
And the datasets can vary drastically
across the different benchmarks in the datasets used.
The professionally done photographs versus
synthetically generated images versus real world data.
And the real world data is where the big impact is.
So often times the one doesn't transfer to the other.
That's the challenge of deep learning.
Solving all of these problems of different lighting variations,
impose variation, inter class variation
all the things that we take for granted human beings
with our incredible perception system.
All have to be solved in order to gain
greater and greater understanding of a scene.
And all the other things we have to close the gap
on that we're not even close to yet.
Here's an image from Andrej Karpathy blog
from a few years ago
of former President Obama's stepping on a scale.
We can classify, we can do semantic segmentation
of the scene, we can do object detection,
we can do a little bit of 3d reconstruction from a
video version of the scene.
But we can't do well is all the things we take for granted.
We can't tell the images in the mirrors versus in reality as different.
We can't deal with the sparsity of information.
Just a few pixels on President Obama's face
we can still identify Mr.President.
The 3D structure of the scene
that there's a foot on top of a scale that there's human beings behind
with from a single image.
Things we can trivially do using all the common-sense semantic knowledge that we have
cannot do the physics of the scene that there's gravity.
And the biggest thing,
the hardest thing is what some people's minds.
And what some people's minds about what's on other people's minds and so on.
Mental models of the world being able to infer what people are thinking about.
Be able to infer there's been a lot of exciting work here at MIT about
what people are looking at.
But we're not even close to solving that problem either.
But what they're thinking about we're not even
we haven't even begun to really think about that problem.
And we do trivially as human beings.
And I think at the core of that
I think I'm harboring on the visual perception problem.
Because it's one we take really for granted as human beings
especially when trying to solve real world problems,
especially when trying to solve autonomous driving
is we've have 540 million years of data for visual perception
so we take it for granted.
We don't realize how difficult it is.
And we can't focus all our attention on this recent development
of a hundred thousand years of abstract thought
being able to play chess being able to reason.
But the visual perception is nevertheless extremely difficult.
At every single layer of what's required to perceive, interpret
and understand the fundamentals of a scene.
In a trivial way to show that is just all the ways you can mess
with these image classification systems
by adding a little bit of noise.
The last few years there's been a lot of papers a lot of work
to show that you can mess with these systems
by adding noise. Here with 99% accuracy predicted dog
add a little bit of distortion
you immediately the system predicts with 99% accuracy that's an ostrich.
And you can do that kind of manipulation with just a single pixel.
So that's just a clean way to show the gap between image classification
on an artificial data cell like ImageNet
and real world perception that has to be solved,
especially for life critical situations like autonomous driving.
I really like this Max Tegmark's visualization of this rising sea
of the landscape of human competence from Hans Moravec.
And this is the difference as we progress forward.
And we discussed some of these machine learning methods
is there is the human intelligence, the general human intelligence.
Let's call Einstein here.
That's able to generalize over all kinds of problems
over all kinds of from the common sense to the incredibly complex.
And then there is the way we've been doing
especially data-driven machine learning,
which is Savant, which is specialized intelligence.
Extremely smart at a particular task
but not being able to transfer except in the very narrow
neighborhood on this landscape
of different of art, cinematography, book writing at the peaks
and chess, arithmetic and theorem proving and vision at the
at the bottom in the lake.
And there's this rising sea as we saw a problem after problem
the question can the methodology in and the approach of
deep learning of everything we're doing now
keep the sea rising or do fundamental breakthroughs
have to happen in order to generalize
and solve these problems.
And so from the specialized where the successes are
the systems are essentially boiled down to given  the dataset
and given the ground truth for that data set,
here's the apartment cost in the Boston area
be able to input several parameters
and based on those parameters predict the apartment cost.
That's the basic premise approach behind the  successful
supervised deep learning systems today.
If you have good enough data, that's good enough ground truth
that can be formalized, we can solve it.
Some of the recent promise that we will do an entire series of lectures
in the third week on deep reinforcement learning
show that from raw sensory information with very little annotation
to self play whether systems learn without human supervision
are able to perform extremely well in these constrained context.
The question of a video game.
Here pong to pixels being able to perceive the raw pixels
of this pong game as raw input
and learn the fundamental quote unquote physics of this game.
Understand how it is this game behaves
and how to be able to win this game.
That's kind of a step toward general purpose artificial intelligence.
But it is a very small step
because it's in a simulated very trivial situation.
That's the challenge, that's before us
with less and less human supervision be able to solve huge real-world problems.
From the top supervised learning where majority of the teaching
is done by human beings
throughout the annotation process, through labeling all the data,
by showing different examples
and further and further down to semi-supervised learning,
reinforcement learning and supervised learning
removing the teacher from the picture.
And making that teacher extremely efficient when is needed.
Of course data augmentation is one way we'll talk about.
So taking a small number of examples and
messing with that set of examples, augmenting that set of examples,
through trivial and through complex methods of cropping,
stretching, shifting and so on.
Including to generative networks modifying those images
to grow a small data set into a large one
to minimize, to decrease further and further the input that's a human is
the input of the human teacher.
But still that's quite far away from the incredibly efficient
both teaching and learning that humans do.
This is a video and there's many of them online for the first time a human baby walking.
We learn to do this you know, it's one shot learning.
One day you're on four, all fours, and the next day your two hands up
and then you figure out the rest.
One shot. Well you can kind of ish, you can kind of play around with it.
But the point is you extremely efficient.
With only a few examples we are able to learn the fundamental aspect of
how to solve a particular problem.
Machines in most cases need thousands, millions
and sometimes more examples depending on the life critical nature of the application.
The data flow of supervised learning systems is there's input data,
there's a learning system and there is output.
Now in the training stage for the output we have the ground truth.
And so we use that ground truth to teach the system.
In the testing stage when it goes out into the wild there's new input data over
which we have to generalize with the learning system,
we have to make our best guess.
In the training stage that the processes with neural networks is, given
the input data for which we have the ground truth, pass it through the model,
get the prediction. And given that we have the ground truth
we can compare the prediction to the ground truth,
look at the error. And based on that error adjust the weights.
The types of predictions we can make is regression and classification.
Regression is continuous and classification is categorical.
Here if we look at whether the regression problem says
what is the temperature going to be tomorrow.
And the classification formulation of that problem
says is it going to be hot or cold
or some threshold definition of what hot or cold is.
That's regression and classification.
And the classification front it can be multi class
which is the the standard formulation. We are tasked with saying,
what is, there's only a particular entity can be only be one thing,
and then there's multi-label or a particular entity can be multiple things.
And overall the input to the system can be not just a single
sample of the particular dataset
and the output doesn't have to be a particular
sample of the ground truth dataset.
They can be a sequence, sequence to sequence,
a single sample to a sequence, a sequence to the sample
and so on. From video captioning
or it's video captioning to translation to
natural language generation to of course the one-to-one
computing to general computer vision.
Okay that's the bigger picture. Let's step back from the big to the small
to a single neuron inspired by our own brain,
the biological neural networks in our brain,
in the computational block that is behind a lot of the intelligence in our mind.
The artificial neuron has inputs with weights on them
plus a bias and activation function
and an output.
It's inspired by this thing
as I showed it before. Here visualizes the Thalamocortial system
with three million neurons
and 476 million synapses.
The full brain has a hundred billion billion neurons
and a thousand trillion synapses.
ResNet and some of the other state-of-the-art networks
have tens hundreds of millions
of edges of synapses.
The human brain has ten million times more synapses
than artificial neural neural networks
and there's other differences. The topology is asynchronous
and not constructed in layers.
The learning algorithm for artificial neural networks is backpropagation
for our biological networks we don't know.
That's one of the mysteries of the human brain.
There's ideas but we really don't know.
A power consumption human brains are much more efficient
than you know networks that's one of the problems that we're trying to solve
and ASICs are starting to begin to solve some of these problems.
And the stages of learning in the biological neural networks
you really never stop learning.
You're always learning, always changing
both on the hardware and a software.
In artificial neural networks often times there's a training stage,
there's a distinct training stage
and there's a distinct testing stage when you release the thing in the wild.
Online learning is an exceptionally difficult thing
that we're still in the very early stages of.
This neuron takes a few inputs,
the fundamental computational block behind neural networks,
takes a few inputs, applies weights which are the parameters that are learned,
sums them up, puts it into a nonlinear activation function after adding the bias,
also learned parameter and gives an output.
And the task of this neuron is to get excited
based on certain aspects of the layers, features
inputs that follow before.
And in that ability to discriminate get excited by certain things
and get not excited about other things hold a little piece of information
of whatever level of abstraction it is.
So when you combine many of them together
you have knowledge.
Different levels of abstractions form a knowledge base
that's able to represent, understand or even act on a particular set of raw inputs.
And you stack these neurons together in layers
both in width and depth increasing further on.
And there's a lot of different architectural variants.
But they begin at this basic fact that with just a single hidden layer of a neural network.
The possibilities are endless.
You can approximate an any arbitrary function.
A neural network with a single hidden layer can approximate any function.
That means any other neural network with multiple layers and so on
is just interesting optimizations
of how we can discover those functions.
The possibilities are endless.
And the other aspect here is the mathematical underpinnings
of neural networks with the weights and the differentiable activation functions
are such that in a few steps from the inputs to the outputs
are deeply parallelizable.
And that's why the other aspect on the compute
the parallelizability of neural networks
is what enables some of the exciting
advancements on the graphical processing unit the GPUs
and with ASICs TPUs.
The ability to run across, across machines,
across GPU units in the very large distributed scale
to be able to train and perform inference on neural networks.
Activation functions.
These activation functions put together
are tasked with optimizing a loss function.
For regression that loss function is mean squared error usually, there's a lot of variance.
And for classifications cross entropy loss.
In the cross entropy loss the ground truth is 0,1.
In the mean squared error it's a real number.
And so with the loss function and the weights and the bias and the activation functions
propagating forward to the network from the input to the output.
Using the loss function we use the algorithm of backpropagation,
which I did an entire lecture last time,
to adjust the weights.
To have the air flow backwards to the network
and adjust the weights such that
once again the weights that were responsible for
producing the correct output
are increased in the weights that were responsible for
producing the incorrect output are decreased
The forward pass gives you the error.
The backward pass computes the gradients and based on the gradients
the optimization algorithm combine a learning rate adjust the weights.
The learning rate is how fast the network learns.
And all of this is possible on the numerical computation
side with automatic differentiation.
The optimization problem given those gradients
that are computed and enough
backward flow to the network of the gradients is Stochastic Gradient Descent.
There's a lot of variants of this optimization algorithms
that solve various problems
from Dying ReLUs to Vanishing Gradients.
There's a lot of different parameters and momentum and so on.
That's really just boil down to all the different problems that are
making in the cold.
It's really good to see everybody here
What is the right size of a batch?
Or really it's called mini batch when it's not the entire dataset
to you based on which to compute the gradients to adjust the learning.
Do you do it over a very large amount?
Or do you do it with stochastic gradient descent for every single sample of the data?
If you listen to Yann LeCun and a lot of recent literature is
small minibatch sizes are good.
He says "Training with large minibatches is bad for your health.
More importantly, it's bad for your test error.
Friends don't let friends use minibatches larger than 32"
Larger batch size means more computational speed
because you don't have to update the weights often.
But smaller batch size empirically produces better generalization.
The problem we're often on the broader scale of learning
trying to solve is overfitting.
And the way we solve it is the regularization.
We want to train on a dataset without memorizing to an extent
that you only do well in that trained dataset.
So you want it to be generalizable into future
into into into the future things that you haven't seen yet.
So obviously this is a problem for small datasets
and also for sets of parameters that you choose.
Here shown an example of a sine curve trying to fit
particular data versus a 9-degree polynomial,
trying to fit a particular set of data with the blue dots.
The 9-degree polynomial is overfitting.
It does very well for that particular set of samples
but does not generalize well in the general case
And the trade-off here is, as you train further and further
at a certain point there's a deviation between the
the error being decreased to 0 on the training set
and going to 1 on the test set.
And that's the balance we have to strike.
That's done with the validation set.
So you take a piece of the training set for which you have the ground truth
and you call it the validation set in set inside
and you evaluate the performance of your system on that validation set.
And after you notice that your training network is performing poorly
on the validation set for prolonged period of time,
that's when you stop. That's early stoppage.
Basically it's getting better and better and better
and then there's some period of time,
there's always noise of course,
and after some period of time is definitely getting worse.
That's we need to stop there.
So that provides an automated way to discovering when need to stop.
And there's a lot of other regularization methodologies.
Of course as I mentioned
dropout is very interesting approach for.
And it's variance of simply with a certain kind of probability
randomly remove nodes in the network,
both the incoming and outgoing edges,
randomly throughout the training process.
And there's normalization.
Normalization is obviously always applied at the input.
So whenever you have dataset
as different lighting conditions different variations
they get different sources and so on,
you have to all kind of put on the same level ground.
So that we're learning the fundamental aspects of the input data
as opposed to the some less relevant semantic information
like lighting variation and so on.
So we usually always normalize.
For example if it's a computer vision with pixels from 0 to 255,
you always normalize to 0 to 1 or -1 to 1
or normalize based on the mean and the standard deviation.
That's something you should almost always do.
The thing that enabled a lot of breakthrough performances
in the past few years is batch normalization.
It's performing its kind of same normalization later on in the network,
looking at the inputs to the hidden layers.
And normalizing based on the batch of data which on
which  yo're training normalized based on mean and the standard deviation.
As batch normalization with batch renormalization
fixes a few of the challenges
which is given that you're normalizing during the training
on the minibatches in the training data set,
that doesn't directly map to the inference station the testing.
And so it allows, by keeping a running average,
it, across both training and testing,
you're able to asymptotically approach a global normalization.
So this idea across all the weights
not just the inputs across all the weights you normalize
the world in the all the levels of abstractions you forming.
And batch renorm solves a lot of these problems doing inference.
And there's a lot of other ideas from layer to weight to
instance normalization to group normalization.
And you can play with a lot of these ideas in the TensorFlow playground.
On playground.tensorflow.org that I highly recommend.
So now let's run through a bunch of different ideas
some of which we'll cover in future lectures.
And what is all of this in this world of deep learning
from computer vision to deep reinforcement learning
to the different small level techniques
to the large natural language processing?
So convolutional neural networks,
the thing that enables image classification.
So these convolution of filters slide over the image and
able to take advantage of the the spatial invariance
of visual information that a cat in the top-left corner is
the same as features associated with cats in the top right corner and so on.
Images are just a set of numbers and our task is to take that image
and produce a classification
and use the spatial in the spatial variance of visual information to make that
to slide a convolution filter across the image.
And learn that filter as opposed to
as opposed to assigning equal value to features that are present in various
at various regions of the image.
And stacked on top feature these convolution filters can form
high-level abstractions of visual information and images
with AlexNet, as I've mentioned, and the ImageNet data set and challenge
captivating the world of what is possible with neural networks
have been further and further improved
superseding human performance with of special note
GoogLeNet with the inception module.
There's different ideas that came along ResNet with the residual blocks.
And SENet most recently.
So the object detection problem is a step the next step
in the visual recognition.
So the image classification is just taking the entire image
saying what's in the image.
Object detection localization is saying find all the objects of interest
in the scene and classify them.
The region based methods like shown here Faster R-CNN
takes the image,
uses convolution neural network to
extract features in that image
and generate region proposals.
Here's a bunch of candidates that you should look at.
And within those candidates, it classifies what they are
and generates a four parameters the bounding box
that thing that captures that thing.
So object detection localization ultimately boils down to a bounding box,
a rectangle with a class.
That's the most likely class that's in that bounding box.
And you can really summarize region based methods
as you generate the region proposal
here little pseudocode and do a for loop over the
over the region proposals
and perform detection on that for loop.
The Single-Shot methods remove the for loop.
There's a single pass through,
you had a bunch of, take a for example here shown SSD.
Take a pretrained neural network
that's been trained to do image classification,
stack a bunch of convolutional layers on top,
from each layer extract features
that are then able to generate in a single pass
classes boundary boxes,
boundary box predictions and the class associate of this boundary box.
The trade off here, this is where the popular yellow v123come from
the trade-off here oftentimes is in performance and accuracy.
So single-shot methods are often less performant
especially on in terms of accuracy
on objects that really far away or
rather objects that are small in the image or really large.
Then the next step up in visual perception, visual understanding
is semantic segmentation.
That's where the tutorial that we presented here on github is covering.
Semantic segmentation is the task of now as opposed to a boundary box
or the classify the entire image or detecting the object is a boundary box
is assigning at a pixel level
the boundaries of what the object is.
Every single, in full scene classic full scene segmentation classifying,
what every single pixel which class that pixel belongs to.
And the fundamental aspect there's
we'll cover a little bit or a lot more on Wednesday
is taking a image classification network,
chopping it off at some point.
And then having which is performing the encoding step
of compressing a representation of the scene.
And taking that a representation with a decoder
upsampling in a dense way.
So taking that representation upsampling
the pixel level classification.
So that upsampling has a lot of tricks that we'll talk through.
They are interesting but ultimately boils down to
the encoding step of forming a representation
what's going on on the scene
and then decoding step that upsamples
the pixel level annotation, classification of all the individual pixels.
And as I mentioned here the underlying idea applied
most extensively most successfully
in computer vision is transfer learning.
Most commonly applied way of transfer learning is taking a pre-trained your network
like ResNet and chopping it off at some point.
It's chopping off the fully connected layers,
some aspects some parts of the layers and then taking a data set,
a new data set and retraining that network.
So what is this useful for?
For every single application computer vision in industry.
When you have a specific application
like you want to build a pedestrian detector.
If you want to build a pedestrian detector and you have a pedestrian dataset,
it's useful to take ResNet trained on ImageNet or COCO
And taking that network, chopping off some of the layers
trained in the general case of vision perception.
and then retrain it on your specialized pedestrian dataset.
And depending on how large the dataset is
the sum of the previous layers that from the pre-training network should be fixed,
frozen. And sometimes not depending on how large the data is.
And this is extremely effective in computer vision
but also in audio speech and NLP.
And so as I mentioned with the pre-trained networks
they are ultimately forming representations of the database
on which classifications the regression is made,
prediction is made.
But a cleanest example of this is the auto encoder
of forming representations in an unsupervised way.
The input is an image and the output is that exactly same image.
So why do we do that?
Of you add a bottleneck in the network
where there is where the network is narrower at the
in the middle than it is on the inputs and the outputs.
It's forced to compress the data down into meaningful representation.
That's what the auto encoder does.
You're training it to reproduce the output
and reproduce it with a latent representation
that is smaller than the original raw data.
That's a really powerful way to compress the data.
It's used for removing noise and so on.
But it's also just a effective way to demonstrate a concept.
It can also be used for embeddings.
We have a huge amount of data and you want to
form a compressed efficient representation of that data.
Now in practice, this is completely unsupervised.
In practice, if you want to form an efficient useful representation of the data,
you want to train it in a supervised way.
You want to train it on a discriminative task
where you have labelled data.
And the network is trained to identify cat versus dog.
Network that's trained in the discriminative way on an
annotated supervised learning way
is able to form better representation.
But nevertheless the concept stands.
And one way to visualize these concepts is the
the tool that I really love projector.tensorflow.org,
is a way to visualize these different representations
these different embeddings.
You should definitely play with and you can insert your own data.
Okay going further and further in this direction of unsupervised
and forming representations is
generative adversarial networks.
From these representations being able to generate new data.
And the fundamental methodology of GANs is to have two networks.
One is the generator, one is the discriminator
and they compete against each other
in order to, for the generator
to get better and better and better at generating realistic images.
The generator's tasks from noise to generate images
based on a certain representation that are realistic.
And the discriminator is the critic that has to discriminate
between real images and those generated by the generator.
And both get better together.
The generator gets better and better at generating real images
to trick the discriminator
and the discriminator gets better and better at
telling the difference in real and fake
until the generator is able to generate some incredible things.
So shown here in by the work with NVIDIA, mean the ability to generate realistic faces
as skyrocketed in the past 3 years.
So these are samples of celebrities photos that have been able to generate.
Those are all generated by GAN.
There's ability to generate temporally consistent video over time
with GANs. And then there's the ability shown
at the bottom right and Nvidia I'm sure
I'm sure also we'll talk about the pixel level from semantic segmentation
being. So from the semantic pixel segmentation on the right
be able to generate completely the scene on the left.
It is part of a series of courses
This is 6.S094 deep learning for self-driving cars.
solved with non linear optimization.
Mini-batch size.
All the raw rich high-definition pixels on the left.
The natural language processing world same,
forming representations, forming embeddings
with Word2Vec, ability to from words to form representation
that are efficiently able to then be used to reason about the words.
The whole idea of forming representation about the data
is taking a huge,
you know, vocabulary over a million words.
You want to be able to map it into a space
are in a Euclidean sense
in Euclidean distance between words are
semantically far apart from each other as well.
So things that are similar are together in that space.
And one way of doing that with skip grams for example
is looking at a source text
and turning into a large body of text, into a supervised learning problem
by learning to map, predict from the words
from a particular word to all its neighbors.
So training network on the connections that are
commonly seen in natural language.
And based on those connections we're able to know
which words are related to each other.
Now the main thing here is.
Now I won't get into too many details but the
the main thing here with the input vector representing the words
and the output vector representing the probability
that those words are connected to each other.
The main thing both are thrown away in the end
the main thing is the middle, the hidden layer
That representation gives you the embedding.
That represent these words in such a way where in the Euclidean space
the ones that are close together semantically.
Are semantically together in the ones
that are not are semantically far apart.
And natural language and other sequence data,
text, speech, audio, video relies on recurrent neural networks.
Recurrent neural networks are able to learn
temporal data, temporal dynamics in the data.
Sequence data and are able to generate sequence data.
The challenge is that they're not able to learn long-term context.
Because when unrolling a neural network
it's trained by unrolling and doing backpropagation
without any tricks the backpropagation of the
gradient fades away very quickly.
So you're not able to memorize the context
in a longer form of the sentences.
Unless there's extensions here
with LSTMs that are use long term dependency
is captured by allowing
the network to forget information,
allow it to freely pass through information in time.
So what to forget what to remember
and every time decide what to output.
And all of those aspects have gates that are all trainable
with sigmoid and tanh functions.
Bi-directional real recurrent neural networks
from the 90s is an extension often used for providing
context in both direction.
So recurrent neural networks simply define is
learning representations what happened in the past.
Now in many cases you're able,
it's not real-time operation in that
you're able to also look into the future.
You look into the data that falls out of the sequence.
So benefits you do a forward pass to the network
beyond the current and then back.
The encoder-decoder architecture in recurrent neural networks
used very much when the sequence on the input
and the sequence and the output are not relied to be of the same length.
The task is to first with the encoder network encode everything
that's came, everything on the input sequence.
So this is useful for machine translation for example.
So encoding all the information the input sequence in English
and then in the language you translating to
given that representation,
keep feeding it into the decoder
recurrent neural network to generate the translation.
The input might be much smaller or much larger than the output.
That's the encoder decoder architecture.
And then there's improvements.
Attention is the improvement on this encoder-decoder architecture
that allows you to as opposed to taking the input sequence,
forming a representation of it and that's it.
It allows you to actually look back at different parts of the input.
So not just relying in the on the single vector representation
of all the entire input.
And a lot of excitement
has been around the idea as I mentioned
some of the dream of artificial intelligence
and machine learning in general
has been to remove the human more and more and more from the picture.
Being able to automate some of the difficult tasks.
So AutoML from Google and just the general concept of
neural architecture search, NasNet.
The ability to automate the discovery of
parameters of a neural network.
And the ability to discover the actual architecture
that produces the best result.
So with neural architecture search you have basic
basic modules similar to the ResNet modules,
and with a recurrent neural network
you keep assembling and network together.
And assembling in such a way that it minimizes
the loss of the overall classification performance.
And it's shown that you can then construct
a neural network that's much more efficient
and much more accurate than state of the art
on classification tasks like ImageNet here shown with a plot
erved at the very least competitive with the state of the art and SCnet.
It's super exciting that as opposed to
like I said stacking lego pieces yourself,
the final result is essentially you step back
and you say here's I have a data set
with the with the labels with the ground truth
which is what Google the dream of Google AutoML is
have the data set
you tell me what kind of neural network
will do best on this data set.
And that's it. so all you bring is the data
It constructs the network
through this neural architecture search
and it returns to you the model and that's it.
It solves, it makes it possible to solve the exception
you know, solve many of the real world problems
that essentially boil down to I have a few classes
I need to be very accurate on
here's my data set.
And then I convert the problem of a deep learning researcher
to the problem of maybe what's traditionally
what's more commonly called the sort of a data science
engineer where the task is
as I said focuses on what is the right question
and what is the right data to solve that question.
And deep reinforcement learning taking further steps
along the path of decreasing human input.
Deep reinforcement learning is the task of an agent
to act in the world based on
the observations of the state and the rewards received in that state,
knowing very little about the world
and learning from the very sparse nature of the reward.
Sometimes only when you in the gaming context
when you win or lose.
Or in the robotics contest when you successfully accomplish a task or not
with a very sparse award are able to learn how to behave in that world.
Here with with cats learning how the Bell maps to the food
and a lot of the amazing work at open AI and deep mind
about the robotics manipulation and navigation
through self play in simulated environments.
And of course the best of our own deep reinforcement learning
competition with deep traffic
that all of you can participate.
And I encourage you to try to win that with no supervised knowledge.
No human supervision through sparse rewards from the simulation
or through self play constructs able to learn how to
operate successfully in this world.
And those are the steps we're taking towards
general towards artificial general intelligence.
This is the exciting from the breakthrough ideas
that we'll talk about on Wednesday natural language processing
to generative adversarial networks.
They're able to generate arbitrary, data high resolution data,
create data. Really from this understanding of the world
to deep reinforcement learning being able to learn
how to act in the world, very little input from human supervision
is taking further and further steps
and there's been a lot of exciting ideas
going by different names. Sometimes misused,
sometimes overused, sometimes misinterpreted of transfer learning,
meta learning and the hyper parameter architecture search
basically removing a human as much as possible
from the menial tasks
and involving a human only on the fundamental side
as I mentioned with the racing boat on the ethical side.
And the things that us humans at least pretend to be quite good at
which is understanding the fundamental big questions,
understanding the data that empowers us to solve real world problems,
and understand the ethical balance
that needs to be struck in order to solve those problems.
Well on the bottom right I show that's our job here in this room
our job for all the engineers in the world to solve these problems
and progress forward through the current summer
and through the winter, if it ever comes.
So with that I'd like to thank you and
you can get the videos, code and so on
online deeplearning.mit.edu.
Thank you very much guys.
