 
 
hi everyone, let's get started. Good
afternoon and welcome to MIT 6.S191!
TThis is really incredible to see the
turnout this year. This is the fourth
year now we're teaching this course and
every single year it just seems to be
getting bigger and bigger. 6.S191 is a
one-week intensive boot camp on
everything deep learning. In the past, at
this point I usually try to give you a
synopsis about the course and tell you
all of the amazing things that you're
going to be learning. You'll be gaining
fundamentals into deep learning and
learning some practical knowledge about
how you can implement some of the
algorithms of deep learning in your own
research and on some cool lab related
software projects. But this year I
figured we could do something a little
bit different and instead of me telling
you how great this class is I figured we
could invite someone else from outside
the class to do that instead.  So let's
check this out first. Hi everybody and
welcome MIT 6.S191
the official introductory course on deep
learning to taught here at MIT. Deep
learning is revolutionising so many
fields from robotics to medicine and
everything in between.
You'll the learn the fundamentals of this field and how you can build some of these
incredible algorithms.
In fact, this entire speech and video are not real and
were created using deep learning and
artificial intelligence. And in this
class you'll learn how. It has been an
honor to speak with you today and I hope you enjoy the course!
Alright. so as you can tell deep learning
is an incredibly powerful tool. This was
just an example of how we use deep
learning to perform voice synthesis and
actually emulate someone else's voice, in
this case Barack Obama, and also using
video dialogue replacement to
actually create that video with the help
of Canny AI. And of course you might as
you're watching this video you might
raise some ethical concerns which we're
also very concerned about and we'll
actually talk about some of those later
on in the class as well. But let's start
by taking a step back and actually
introducing some of these terms that
we've been we've talked about so far now. Let's start with the word intelligence. I
like to define intelligence as the
ability to process information to inform
future decisions. Now the field of
artificial intelligence is simply the
the field which focuses on building
algorithms, in this case artificial
algorithms that can do this as well:
process information to inform future
decisions. Now machine learning is just a
subset of artificial intelligence
specifically that focuses on actually
teaching an algorithm how to do this
without being explicitly programmed to
do the task at hand.
Now deep learning is just a subset of
machine learning which takes this idea
even a step further and says how can we
automatically extract the useful pieces
of information needed to inform those
future predictions or make a decision
And that's what this class is all about
teaching algorithms how to learn a task
directly from raw data. We want to
provide you with a solid foundation of
how you can understand or how to
understand these algorithms under the
hood but also provide you with the
practical knowledge and practical skills
to implement state-of-the-art deep
learning algorithms in Tensorflow which
is a very popular deep learning toolbox.
Now we have an amazing set of lectures
lined up for you this year including
Today which will cover neural networks
and deep sequential modeling. Tomorrow
we'll talk about computer vision and
also a little bit about generative
modeling which is how we can generate
new data and finally I will talk about
deep reinforcement learning and touch on
some of the limitations and new
frontiers of where this field might be
going and how research might be heading
in the next couple of years. We'll spend
the final two days hearing about some of
the guest lectures from top industry
researchers on some really cool and
exciting projects. Every year these
happen to be really really exciting
talks so we really encourage you to come
especially for those talks. The class
will conclude with some final project
presentations which we'll talk about in
a little a little bit and also some
awards and a quick award ceremony to
celebrate all of your hard work. Also I
should mention that after each day of
lectures so after today we have two
lectures and after each day of lectures
we'll have a software lab which tries to
focus and build upon all of the things
that you've learned in that day so
you'll get the foundation's during the
lectures and you'll get the practical
knowledge during the software lab so the
two are kind of jointly coupled in that
sense. For those of you taking this class
for credit you have a couple different
options to fulfill your credit
requirement first is a project proposal
I'm sorry first yeah first you can
propose a project in optionally groups
of two three or four people and in these
groups you'll work to develop a cool new
deep learning idea and we realized that
one week which is the span of this
course is an extremely short amount of
time to really not only think of an idea
but move that idea past the planning
stage and try to implement something so
we're not going to be judging you on
your results towards this idea but
rather just the novelty of the idea
itself on Friday
each of these three teams will give a
three-minute presentation on that idea
and the awards will be announced for the
top winners judged by a panel of judges
the second option in my opinion is a bit
more boring but we like to give this
option for people that don't like to
give presentations so in this option if
you don't want to work in a group or you
don't want to give a presentation you
can write a one-page paper review of the
deep learning of a recent deepening of
paper or any paper of your choice and
this will be due on the last day of
class as well also I should mention that
and for the project presentations we
give out all of these cool prizes
especially these three nvidia gpus which
are really crucial for doing any sort of
deep learning on your own so we
definitely encourage everyone to enter
this competition and have a chance to
win these GPUs and these other cool
prizes like Google home and SSD cards as
well also for each of the labs the three
labs will have corresponding prizes so
it instructions to actually enter those
respective competitions will be within
the labs themselves and you can enter to
enter to win these different prices
depending on the different lab please
post a Piazza if you have questions
check out the course website for slides
today's slides are already up there is a
bug in the website we fixed that now so
today's slides are up now digital
recordings of each of these lectures
will be up a few days after each class
this course has an incredible team of
TAS that you can reach out to if you
have any questions especially during the
software labs they can help you answer
any questions that you might have and
finally we really want to give a huge
thank to all of our sponsors who without
their help and support this class would
have not been possible ok so now with
all of that administrative stuff out of
the way let's start with the the fun
stuff that we're all here for let's
start actually by asking ourselves a
question why do we care about deep
learning well why do you all care about
deep learning and all of you came to
this classroom today and why
specifically do care about deep learning
now well to answer that question we
actually have to go back and understand
traditional machine learning at its core
first now traditional machine learning
algorithms typically try to define as
set of rules or features in the data and
these are usually hand engineered and
because their hand engineered they often
tend to be brittle in practice so let's
take a concrete example if you want to
perform facial detection how might you
go about doing that well first you might
say to classify a face the first thing
I'm gonna do is I'm gonna try and
classify or recognize if I see a mouth
in the image the eyes ears and nose if I
see all of those things then maybe I can
say that there's a face in that image
but then the question is okay but how do
I recognize each of those sub things
like how do I recognize an eye how do I
recognize a mouth and then you have to
decompose that into okay to recognize a
mouth I maybe have to recognize these
pairs of lines oriented lines in a
certain direction certain orientation
and then it keeps getting more
complicated and each of these steps you
kind of have to define a set of features
that you're looking for in the image now
the key idea of deep learning is that
you will need to learn these features
just from raw data so what you're going
to do is you're going to just take a
bunch of images of faces and then the
deep learning algorithm is going to
develop some hierarchical representation
of first detecting lines and edges in
the image using these lines and edges to
detect corners and eyes and mid-level
features like eyes noses mouths ears
then composing these together to detect
higher-level features like maybe jaw
lines side of the face etc which then
can be used to detect the final face
structure and actually the fundamental
building blocks of deep learning have
existed for decades and they're under
underlying algorithms for training these
models have also existed for many years
so why are we studying this now well for
one data has become much more pervasive
we're living in a the age of big data
and these these algorithms are hungry
for a huge amounts of data to succeed
secondly these algorithms are massively
parallel izybelle which means that they
can benefit tremendously from modern GPU
architectures and hardware acceleration
that simply did not exist when these
algorithms were developed and finally
due to open-source tool boxes like
tensor flow which are which you'll get
experience with in this class
building and deploying these models has
become extremely streamlined so much so
that we can condense all this material
down into one week so let's start with
the fundamental building block of a
neural network which is a single neuron
or what's also called a perceptron the
idea of a perceptron or a single neuron
is very basic and I'll try and keep it
as simple as possible and then we'll try
and work our way up from there let's
start by talking about the forward
propagation of information through a
neuron we define a set of inputs to that
neuron as x1 through XM and each of
these inputs have a corresponding weight
w1
through WN now what we can do is with
each of these inputs and each of these
ways we can multiply them
correspondingly together and take a sum
of all of them then we take this single
number that's summation and we pass it
through what's called a nonlinear
activation function and that produces
our final output Y now this is actually
not entirely correct we also have what's
called a bias term in this neuron which
you can see here in green so the bias
term the purpose of the bias term is
really to allow you to shift your
activation function to the left and to
the right regardless of your inputs
right so you can notice that the bias
term doesn't is not affected by the X's
it's just a bias associate to that input
now on the right side you can see this
diagram illustrated mathematically as a
single equation and we can actually
rewrite this as a linear using linear
algebra in terms of vectors and dot
products so instead of having a
summation over all of the X's I'm going
to collapse my X into a vector capital X
which is now just a list or a vector of
numbers a vector of inputs I should say
and you also have a vector of weights
capital W to compute the output of a
single perceptron all you have to do is
take the dot product of X and W which
represents that element wise
multiplication and summation and then
apply that non-linearity which here is
denoted as G
so now you might be wondering what is
this nonlinear activation function I've
mentioned it a couple times but I
haven't really told you precisely what
it is now one common example of this
activation function is what's called a
sigmoid function and you can see an
example of a sigmoid function here on
the bottom right one thing to note is
that this function takes any real number
as input on the x-axis and it transforms
that real number into a scalar output
between 0 & 1
it's a bounded output between 0 & 1 so
one very common use case of the sigmoid
function is to when you're dealing with
probabilities because probabilities have
to also be bounded between 0 & 1 so
sigmoids are really useful when you want
to output a single number and represent
that number as a probability
distribution in fact there are many
common types of nonlinear activation
functions not just the sigmoid but many
others that you can use in neural
networks and here are some common ones
and throughout this presentation you'll
find these tensorflow icons like you can
see on the bottom right or sorry all
across the bottom here and these are
just to illustrate how one could use
each of these topics in a practical
setting you'll see these kind of
scattered in throughout the slides no
need to really take furious notes at
these codeblocks like I said all of the
slides are published online so
especially during your labs if you want
to refer back to any of the slides you
can you can always do that from the
online lecture notes now why do we care
about activation functions the point of
an activation function is to introduce
nonlinearities into the data and this is
actually really important in real life
because in real life almost all of our
data is nonlinear and here's a concrete
example if I told you to separate the
green points from the red points using a
linear function could you do that I
don't think so right so you'd get
something like this oh you could do it
you wouldn't do very good job at it and
no matter how deep or how large your
network is if you're using a linear
activation function you're just
composing lines on top of lines and
you're going to get another line right
so this is the best you'll be able to do
with the linear activation function on
the other hand nonlinearities allow you
to
approximate arbitrarily complex
functions by kind of introducing these
nonlinearities into your decision
boundary and this is what makes neural
networks extremely powerful let's
understand this with a simple example
and let's go back to this picture that
we had before imagine I give you a train
network with weights W on the top right
so W here is 3 and minus 2 and the
network only has 2 inputs x1 and x2 if
we want to get the output it's simply
the same story as we had before we
multiply our inputs by those weights we
take the sum and pass it through a
non-linearity but let's take a look at
what's inside of that non-linearity
before we apply it so we get is when we
take this dot product of x1 times 3 X 2
times minus 2 we mul - 1 that's simply a
2d line so we can plot that if we set
that equal to 0 for example that's a 2d
line and it looks like this so on the x
axis is X 1 on the y axis is X 2 and
we're setting that we're just
illustrating when this line equals 0 so
anywhere on this line is where X 1 and X
2 correspond to a value of 0 now if I
feed in a new input either a test
example a training example or whatever
and that input is with this coordinates
it's has these coordinates minus 1 and 2
so it has the value of x1 of minus 1
value of x2 of 2 I can see visually
where this lies with respect to that
line and in fact this this idea can be
generalized a little bit more if we
compute that line we get minus 6 right
so inside that before we apply the
non-linearity we get minus 6 when we
apply a sigmoid non-linearity because
sigmoid collapses everything between 0
and 1 anything greater than 0 is going
to be above 0.5 anything below zero is
going to be less than 0.5 so in is
because minus 6 is less than zero we're
going to have a very low output this
point Oh 200 to
we can actually generalize this idea for
the entire feature space let's call it
for any point on this plot I can tell
you if it lies on the left side of the
line that means that before we apply the
non-linearity the Z or the state of that
neuron will be negative less than zero
after applying that non-linearity the
sigmoid will give it a probability of
less than 0.5 and on the right side if
it falls on the right side of the line
it's the opposite story if it falls
right on the line it means that Z equals
zero exactly and the probability equals
0.5 now actually before I move on this
is a great example of actually
visualizing and understanding what's
going on inside of a neural network the
reason why it's hard to do this with
deep neural networks is because you
usually don't have only two inputs and
usually don't have only two weights as
well so as you scale up your problem
this is a simple two dimensional problem
but as you scale up the size of your
network you could be dealing with
hundreds or thousands or millions of
parameters and million dimensional
spaces and then visualizing these type
of plots becomes extremely difficult and
it's not practical and pause in practice
so this is one of the challenges that we
face when we're training with neural
networks and really understanding their
internals but we'll talk about how we
can actually tackle some of those
challenges in later lectures as well
okay so now that we have that idea of a
perceptron a single neuron let's start
by building up neural networks now how
we can use that perceptron to create
full neural networks and seeing how all
of this story comes together let's
revisit this previous diagram of the
perceptron if there are only a few
things you remember from this class try
to take away this so how a perceptron
works just keep remembering this I'm
going to keep drilling it in you take
your inputs you apply a dot product with
your weights and you apply a
non-linearity it's that simple
oh sorry I missed the step you have dot
product with your weights add a bias and
apply your non-linearity so three steps
now let's simplify this type of diagram
a little bit I'm gonna remove the bias
just for simplicity I'm gonna remove all
of the weight labels so now you can
assume that every line
the weight associated to it and let's
say so I'm going to note Z that Z is the
output of that dot product so that's the
element wise multiplication of our
inputs with our weights and that's what
gets fed into our activation function so
our final output Y is just there our
activation function applied on Z if we
want to define a multi output neural
network we simply can just add another
one of these perceptrons to this picture
now we have two outputs one is a normal
perceptron which is y1 and y2 is just
another normal perceptron the same ideas
before they all connect to the previous
layer with a different set of weights
and because all inputs are densely
connected to all of the outputs these
type of layers are often called dense
layers and let's take an example of how
one might actually go from this nice
illustration which is very conceptual
and and nice and simple to how you could
actually implement one of these dense
layers from scratch by yourselves using
tensor flow so what we can do is start
off by first defining our two weights so
we have our actual weight vector which
is W and we also have our bias vector
right both of both of these parameters
are governed by the output space so
depending on how many neurons you have
in that output layer that will govern
the size of each of those weight and
bias vectors what we can do then is
simply define that forward propagation
of information so here I'm showing you
this to the call function in tensor flow
don't get too caught up on the details
of the code again you'll get really a
walk through of this code inside of the
labs today but I want to just show you
some some high level understanding of
how you could actually take what you're
learning and apply the tensor flow
implementations to it inside the call
function it's the same idea again you
can compute Z which is the state it's
that multiplication of your inputs with
the weights you add the bias right so
that's right there
and once you have Z you just pass it
through your sigmoid and that's your
output for that
now tension flow is great because it's
already implemented a lot of these
layers for us so we don't have to do
what I just showed you from scratch in
fact to implement a layer like this with
two two outputs or a percept a multi
layer a multi output perceptron layer
with two outputs we can simply call this
TF Harris layers dense with units equal
to two to indicate that we have two
outputs on this layer and there is a
whole bunch of other parameters that you
could input here such as the activation
function as well as many other things to
customize how this layer behaves in
practice so now let's take a look at a
single layered neural network so this is
taking it one step beyond what we've
just seen this is where we have now a
single hidden layer that feeds into a
single output layer and I'm calling this
a hidden layer because unlike our inputs
and our outputs these states of the
hidden layer are not directly enforced
or they're not directly observable we
can probe inside the network and see
them but we don't actually enforce what
they are these are learned as opposed to
the inputs which are provided by us now
since we have a transformation between
the inputs and the hidden layer and the
hidden layer and the output layer each
of those two transformations will have
their own weight matrices which here I
call W 1 and W 2 so its corresponds to
the first layer and the second layer if
we look at a single unit inside of that
hidden layer
take for example Z 2 I'm showing here
that's just a single perceptron like we
talked about before it's taking a
weighted sum of all of those inputs that
feed into it and it applies the
non-linearity and feeds it on to the
next layer same story as before this
picture actually looks a little bit
messy so what I want to do is actually
clean things up a little bit for you and
I'm gonna replace all of those lines
with just this symbolic representation
and we'll just use this from now on in
the future to denote dense layers or
fully connected layers between two
between an input and an output or
between an input and hidden layer
and again if we wanted to implement this
intensive flow the idea is pretty simple
we can just define two of these dense
layers the first one our hidden layer
with n outputs and the second one our
output layer with two outputs we can cut
week and like join them together
aggregate them together into this
wrapper which is called a TF sequential
model and sequential models are just
this idea of composing neural networks
using a sequence of layers so whenever
you have a sequential message passing
system or sequentially processing
information throughout the network you
can use sequential models and just
define your layers as a sequence and
it's very nice to allow information to
propagate through that model now if we
want to create a deep neural network the
idea is basically the same thing except
you just keep stacking on more of these
layers and to create more of an more of
a hierarchical model ones where the
final output is computed by going deeper
and deeper into this representation and
the code looks pretty similar again so
again we have this TF sequential model
and inside that model we just have a
list of all of the layers that we want
to use and they're just stacked on top
of each other okay so this is awesome so
hopefully now you have an understanding
of not only what a single neuron is but
how you can compose neurons together and
actually build complex hierarchical
models with deep with neural networks
now let's take a look at how you can
apply these neural networks into a very
real and applied setting to solve some
problem and actually train them to
accomplish some task here's a problem
that I believe any AI system should be
able to solve for all of you and
probably one that you care a lot about
will I pass this class to do this let's
start with a very simple two input model
one feature or one input we're gonna
define is how many let's see how many
lectures you attend during this class
and the second one is the number of
hours that you spend on your final
projects I should say that the minimum
number of hours you can spend your final
project is 50 hours now I'm just joking
okay so let's take all of the data from
previous years and plot it on this
feature space like we looked at before
green points are students that have
passed the class in the past and red
points are people that have failed we
can plot all of this data onto this
two-dimensional grid like this and we
can also plot you so here you are you
have attended four lectures and you've
only spent five hours on your final exam
you're on you're on your final project
and the question is are you going to
pass the class given everyone around you
and how they've done in the past how are
you going to do so let's do it we have
two inputs we have a single layered set
single hidden layer neural network we
have three hidden units in that hidden
layer and we'll see that the final
output probability when we feed in those
two inputs of four and five is predicted
to be 0.1 or 10% the probability of you
passing this class is 10% that's not
great news the actual prediction was one
so you did pass the class now does
anyone have an idea of why the network
was so wrong in this case exactly so we
never told this network anything the
weights are wrong we've just initialized
the weights in fact it has no idea what
it means to pass a class it has no idea
of what each of these inputs mean how
many lectures you've attended and the
hours you've spent on your final project
it's just seeing some random numbers it
has no concept of how other people in
the class have done so far so what we
have to do to this network first is
train it and we have to teach it how to
perform this task until we teach it it's
just like a baby that doesn't know
anything so it just entered the world it
has no concepts or no idea of how to
solve this task and we have to teach at
that now how do we do that the idea here
is that first we have to tell the
network when it's wrong so we have to
quantify what's called its loss or its
error and to do that we actually just
take our prediction or what the network
predicts and we compare it to what the
true answer was
if there's a big discrepancy between the
prediction and the true answer we can
tell the network hey you made a big
mistake right so this is a big error
it's a big loss and you should try and
fix your answer to move closer towards
the true answer which it should be okay
now you can imagine if you don't have
just one student but now you have many
students the total loss let's call it
here the empirical risk or the objective
function it has many different names
it's just the the average of all of
those individual losses so the
individual loss is a loss that takes as
input your prediction and your actual
that's telling you how wrong that single
example is and then the final the total
loss is just the average of all of those
individual student losses so if we look
at the problem of binary classification
which is the case that we're actually
caring about in this example so we're
asking a question will I pass the class
yes or no binary classification we can
use what is called as the softmax
cross-entropy loss and for those of you
who aren't familiar with cross-entropy
this was actually a a formulation
introduced by Claude Shannon here at MIT
during his master's thesis as well and
this was about 50 years ago it's still
being used very prevalently today and
the idea is it just again compares how
different these two distributions are so
you have a distribution of how how
likely you think the student is going to
pass and you have the true distribution
of if the student passed or not you can
compare the difference between those two
distributions and that tells you the
loss that the network incurs on that
example now let's assume that instead of
a classification problem we have a
regression problem where instead of
predicting if you're going to pass or
fail to class you want to predict the
final grade that you're going to get so
now it's not a yes/no answer problem
anymore
but instead it's a what's the grade I'm
going to get what's the number what so
it's it's a full range of numbers that
are possible now
and now we might want to use a different
type of loss for this different type of
problem and in this case we can do
what's called a mean squared error loss
so we take the actual prediction we take
the the sorry excuse me we take the
prediction of the network we take the
actual true final grade that the student
got we subtract them we take their
squared error and we say that that's the
mean squared error that's the loss that
the network should should try to
optimize and try to minimize so ok so
now that we have all this information
with the loss function and how to
actually quantify the error of the
neural network let's take this and
understand how to train train our model
to actually find those weights that it
needs to to use for its prediction so W
is what we want to find out W is the set
of weights and we want to find the
optimal set of weights that tries to
minimize this total loss over our entire
test set so our test set is this example
data set that we want to evaluate our
model on so in the class example the
test set is you so you want to
understand how likely you are to pass
this class you're the test set now what
this means is that we want to find the
W's that minimize that total loss
function which we call as the objective
function J of W now remember that W is
just a aggregation or a collection of
all of the individual w's from all of
your weights so here this is just a way
for me to express this in a clean
notation but W is a whole set of numbers
it's not just a single number and you
want to find this all of the W's you
want to find the value of each of those
weights such that you can minimize this
entire loss function it's a very
complicated problem and remember that
our loss function is just a simple
function in terms of those weights so if
we plot in the case again of a
two-dimensional weight problem so one of
the weights is on the x-axis one of the
weights is on this axis and on the z
axis we have the loss so for any
value of w we can see what the loss
would be at that point now what do we
want to do we want to find the place on
this landscape what are the values of W
that we get the minimum loss okay so
what we can do is we can just pick a
random W pick a random place on this
this landscape to start with and from
this random place let's try to
understand how the landscape is changing
what's the slope of the landscape we can
take the gradient of the loss with
respect to each of these weights to
understand the direction of maximum
ascent okay that's what the gradient
tells us now that we know which way is
up we can take a step in the direction
that's down so we know which way is up
we reverse the sign so now we start
heading downhill and we can move towards
that lowest point now we just keep
repeating this process over and over
again until we've converged to a local
minimum now we can summarize this
algorithm which is known as gradient
descent because you're taking a gradient
and you're descending down down that
landscape by starting to initialize our
rates wait randomly we compute the
gradient DJ with respect to all of our
weights then we update our weights in
the opposite direction of that gradient
and take a small step which we call here
ADA of that gradient and this is
referred to as the learning rate and
we'll talk a little bit more about that
later but ADA is just a scalar number
that determines how much of a step you
want to take at each iteration how
strongly or aggressively do you want to
step towards that gradient in code the
picture looks very similar so to
implement gradient descent is just a few
lines of code just like the pseudocode
you can initialize your weights randomly
in the first line you can compute your
loss with respect to those gradients and
with respect to those predictions and
your data given that gradient you just
update your weights in the opposite
direction of that event of that vector
right
now the magic line here is actually how
do you compute that gradient and that's
something I haven't told you and that's
something it's not easy at all so the
question is given a loss and given all
of our weights in our network how do we
know which way is good which way is a
good place to move given all of this
information and I never told you about
that but that's a process called back
propagation and let's talk about a very
simple example of how we can actually
derive back propagation using elementary
calculus so we'll start with a very
simple network with only one hidden
neuron and one output this is probably
the simplest neural network that you can
create you can't really get smaller than
this computing the gradient of our loss
with respect to W to here which is that
second way between the hidden state and
our output can tell us how much a small
change in W 2 will impact our loss so
that's what the gradient tells us right
if we change W 2 in the differential
different like a very minor manner how
does our loss change does it go up or
down how does it change and by how much
really so that's the gradient that we
care about the gradient of our loss with
respect to W 2 now to evaluate this we
can just apply the chain rule in
calculus so we can split this up into
the gradient of our loss with respect to
our output Y multiplied by the gradient
of our walk or output Y with respect to
W 2 now if we want to repeat this
process for a different way in the
neural network let's say now W 1 not W 2
now we replace W 1 on both sides we also
apply the chain rule but now you're
going to notice that the gradient of Y
with respect to W 1 is also not directly
computable we have to apply the chain
rule again to evaluate this so let's
apply the chain rule again we can break
that second term up into with respect to
now the the state Z ok and using that we
can kind of back propagate all of these
gradients from the output all the way
back to the input that allows our error
signal to really
propagate from output to input and
allows these gradients to be computed in
practice now a lot of this is not really
important or excuse me it's not as
crucial that you understand the
nitty-gritty math here because in a lot
of popular deep learning frameworks we
have what's called automatic
differentiation which does all of this
back propagation for you under the hood
and you never even see it which is
incredible it made training neural
networks so much easier you don't have
to implement back propagation anymore
but it's still important to understand
how these work at the foundation which
is why we're going through it now ok
obviously then you repeat this for every
single way in the network here we showed
it for just W 1 and W 2 which is every
single way in this network but if you
have more you can just repeat it again
keep applying the chain rule from output
to input to compute this ok and that's
the back prop algorithm in theory very
simple it's just an application of the
chain rule in essence but now let's
touch on some of the insights from
training and how you can use the back
prop algorithm to train these networks
in practice optimization of neural
networks is incredibly tough in practice
so it's not as simple as the picture I
showed you on the colorful one on the
previous slide here's an illustration
from a paper that came out about two or
three years ago now where the authors
tried to visualize the landscape of a of
a neural network with millions of
parameters but they collapsed that down
onto just two-dimensional space so that
we can visualize it and you can see that
the landscape is incredibly complex
it's not easy there are many local
minima where the gradient descent
algorithm could get stuck into and
applying gradient descent in practice in
these type of environments which is very
standard in neural networks can be a
huge challenge
now we're called the update equation
that we defined previously with gradient
descent this is that same equation we're
going to update our weights in the
direction in the opposite direction of
our gradient I didn't talk too much
about this parameter ADA I pointed it
out
this is the learning rate it determines
how much of a step we should take in the
direction of that gradient and in
practice setting this learning rate can
have a huge impact in performance so if
you set that learning rate to small that
means that you're not really trusting
your gradient on each step so if ADA is
super tiny
that means on each time each step you're
only going to move a little bit towards
in the opposite direction of your
gradient just in little small increments
and what can happen then is you can get
stuck in these local minima because
you're not being as aggressive as you
should be to escape them now if you set
the learning rate to large you can
actually overshoot completely and
diverge which is even more undesirable
so setting the learning rate can be very
challenging in practice you want to pick
a learning rate that's large enough such
that you avoid the local minima but
small offs such that you still converge
in practice now the question that you're
all probably asking is how do we set the
learning rate then well one option is
that you can just try a bunch of
learning rates and see what works best
another option is to do something a
little bit more clever and see if we can
try to have an adaptive learning rate
that changes with respect to our lost
landscape maybe it changes with respect
to how fast the learning is happening or
a range of other ideas within the
network optimization scheme itself this
means that the learning rate is no
longer fixed but it can now increase or
decrease throughout training so as
training progressive your learning rate
may speed up you may take more
aggressive steps you may take smaller
steps as you get closer to the local
minima so that you really converge on
that point and there are many options
here of how you might want to design
this adaptive algorithm and this has
been a huge or a widely studied field in
optimization theory for machine learning
and deep learning and there have been
many published papers and
implementations within tensor flow on
these different types of adaptive
learning rate algorithms so SGD is just
that vanilla gradient descent that I
showed you before that's the first one
all of the others are all
adaptive learning rates which means that
they change their learning rate during
training itself so they can increase or
decrease depending on how the
optimization is going and during your
labs we really encourage you again to
try out some of these different
optimization schemes see what works what
doesn't work a lot of it is problem
dependent there are some heuristics that
you can you can get but we want you to
really gain those heuristics yourselves
through the course of the labs it's part
of building character okay so let's put
this all together from the beginning we
can define our model which is defined as
this sequential wrapper inside of this
sequential wrapper we have all of our
layers all of these layers are composed
of perceptrons or single neurons which
we saw earlier the second line defines
our optimizer which we saw in the
previous slide
this can be SGD it can also be any of
those adaptive learning rates that we
saw before now what we want to do is
during our training loop it's very it's
the same stories again as before
nothing's changing here we forward pass
all of our inputs through that model we
get our predictions using those
predictions we can evaluate them and
compute our loss our loss tells us how
wrong our network was on that iteration
it also tells us how we can compute the
gradients and how we can change all of
the weights in the network to improve in
the future and then the final line there
takes those gradients and actually
allows our optimizer to update the
weights and the trainable variables such
that on the next iteration they do a
little bit better and over time if you
keep looping this will converge and
hopefully you should fit your data no
now I want to continue to talk about
some tips for training these networks in
practice and focus on a very powerful
idea of batching your data into mini
batches so to do this let's revisit the
gradient descent algorithm this gradient
is actually very computationally
expensive to compute in practice so
using the backprop algorithm is
a very expensive idea and practice so
what we want to do is actually not
compute this over all of the data points
but actually computed over just a single
data point in the data set and most
real-life applications it's not actually
feasible to compute on your entire data
set at every iteration it's just too
much data so instead we pick a single
point randomly we compute our gradient
with respect to that point and then on
the next iteration we pick a different
point and we can get a rough estimate of
our gradient at each step right so
instead of using all of our data now we
just pick a single point I we compute
our gradient with respect to that single
point I and what's a middle ground here
so the downside of using a single point
is that it's going to be very noisy the
downside of using all of the points is
that it's too computationally expensive
if there's some middle ground that we
can have in between so that middle
ground is actually just very simple you
instead of taking one point and instead
taking all of the points let take a mini
batch of points so maybe something on
the order of 10 20 30 100 maybe
depending on how rough or accurate you
want that approximation of your gradient
to be and how much you want to trade off
speed and computational efficiency now
the true gradient is just obtained by
averaging the gradient from each of
those B points so B is the size of your
batch in this case now since B is
normally not that large like I said
maybe on the order of tens to a hundreds
this is much faster to compute than full
gradient descent and much more accurate
than stochastic gradient descent because
it's using more than one point more than
one estimate now this increase in
gradient accuracy estimation actually
allows us to converge to our target much
quicker because it means that our
gradients are more accurate in practice
it also means that we can increase our
learning rate and trust each update more
so if we're very noisy in our gradient
estimation we probably want to lower our
learning rate a little more so we don't
fully step in the wrong direction if
we're not totally confident with that
gradient if we have a larger batch of
gradient of data to
they are gradients with we can trust
that learning great a little more
increase it so that it steps it more
aggressively in that direction what this
means also is that we can now massively
paralyze this computation because we can
split up batches on multiple GPUs or
multiple computers even to achieve even
more significant speed ups with this
training process now the last topic I
want to address is that of overfitting
and this is also known as the problem of
generalization in machine learning and
it's actually not unique to just deep
learning but it's a fundamental problem
of all of machine learning now ideally
in machine learning we want a model that
will approximate or estimate our data or
accurately describes our data let's say
like that said differently we want to
build models that can learn
representations from our training data
that's still generalize to unseen test
data now assume that you want to build a
line that best describes these points
you can see on the on the screen under
fitting describes if we if our model
does not describe the state of
complexity of this problem or if we
can't really capture the true complexity
of this problem while overfitting on the
right starts to memorize certain aspects
of our training data and this is also
not desirable we want the middle ground
which ideally we end up with a model in
the middle that is not too complex to
memorize all of our training data but
also one that will continue to
generalize when it sees new data so to
address this problem of regularization
in neural network specifically let's
talk about a technique of regularization
which is another way that we can deal
with this and what this is doing is it's
trying to discourage complex information
from being learned so we want to
eliminate the model from actually
learning to memorize the training data
we don't want to learn like very
specific pinpoints of the training data
that don't generalize well to test data
now as we've seen before this is
actually crucial for our models to be
able to generalize to our test data so
this is very important the most popular
regularization technique
deep learning is this very basic idea of
drop out now the idea of drop out is
well actually let's start with by
revisiting this picture of a neural
network that we had introduced
previously and drop out during training
we randomly set some of these
activations of the hidden neurons to
zero with some probability so I'd say
our probability is 0.5
we're randomly going to set the
activations to 0.5
with probability of 0.5 to some of our
hidden neurons to 0 the idea is
extremely powerful because it allows the
network to lower its capacity it also
makes it such that the network can't
build these memorization channels
through the network where it tries to
just remember the data because on every
iteration 50% of that data is going to
be or 50% of that memorization or memory
is going to be wiped out so it's going
to be forced to to not only generalize
better but it's going to be forced to
have multiple channels through the
network and build a more robust
representation of its prediction now we
just repeat this on every iteration so
on the first iteration we dropped out
one 50% of the nodes on the next
iteration we can drop out a different
randomly sampled 50% which may include
some of the previously sampled nodes as
well and this will allow the network to
generalize better to new test data
the second regularization technique that
we'll talk about is the notion of early
stopping so what I want to do here is
just talk about two lines so during
training which is the x-axis here we
have two lines the y-axis is our loss
curve the first line is our training
loss so that's the green line the green
line tells us how our training data how
well our model is fitting to our
training data we expect this to be lower
than the second line which is our
testing data
so usually we expect to be doing better
on our training data than our testing
data as we train and as this line moves
forward into the future both of these
lines should kind of decrease go down
because we're optimizing the network
we're improving its performance
eventually though there becomes a point
where the training data starts to
diverge from the testing data now what
happens is that the training day
should always continue to fit or the
model should always continue to fit the
training data because it's still seeing
all of the training data it's not being
penalized from that except for maybe if
you drop out or other means but the
testing data it's not seeing so at some
point the network is going to start to
do better on its training data than its
testing data and what this means is
basically that the network is starting
to memorize some of the training data
and that's what you don't want so what
we can do is well we can perform early
stopping or we can identify this point
this inflection point where the test
data starts to increase and diverge from
the training data so we can stop the
network early and make sure that our
test accuracy is as minimum as possible
and of course if we actually look at on
the side of this line if we look at on
the left side that's where a model is
under fit so we haven't reached the true
capacity of our model yet so we'd want
to keep training if we didn't stop yet
if we did stop already and on the right
side is where we've over fit where we've
passed that early stopping point and we
need to like basically we've started to
memorize some of our training did and
that's when we've gone too far I'll
conclude this lecture by just
summarizing three main points that we've
covered so far first we've learned about
the fundamentals of neural networks
which is a single neuron or a perceptron
we've learned about stacking and
composing these perceptrons together to
form complex hierarchical
representations and how we can
mathematically optimize these networks
using a technique called back
propagation using their loss and finally
we address the practical side of
training these models such as mini
batching regularization and adaptive
learning rates as well with that I'll
finish up I can take a couple questions
and then we'll move on to office lecture
on deep sequential modeling I'll take
any like maybe a couple questions if
there are any now thank you
