[students murmuring]
- Okay, so good afternoon
everyone, let's get started.
So hi, so for those of
you who I haven't met yet,
my name is Serena Yeung and I'm the third
and final instructor for this class,
and I'm also a PhD student
in Fei-Fei's group.
Okay, so today we're going
to talk about backpropagation
and neural networks, and so
now we're really starting
to get to some of the core
material in this class.
Before we begin, let's see, oh.
So a few administrative details,
so assignment one is due
Thursday, April 20th,
so a reminder, we shifted
the date back by a little bit
and it's going to be due
11:59 p.m. on Canvas.
So you should start thinking
about your projects,
there are TA specialties
listed on the Piazza website
so if you have questions
about a specific project topic
you're thinking about, you
can go and try and find
the TAs that might be most relevant.
And then also for Google Cloud,
so all students are going
to get $100 in credits
to use for Google Cloud
for their assignments and project,
so you should be receiving an email
for that this week, I think.
A lot of you may have already, and then
for those of you who haven't,
they're going to come,
should be by the end of this week.
Okay so where we are, so
far we've talked about
how to define a classifier
using a function f,
parameterized by weights
W, and this function f
is going to take data x as input,
and output a vector of scores
for each of the classes
that you want to classify.
And so from here we can also define
a loss function, so for
example, the SVM loss function
that we've talked about
which basically quantifies
how happy or unhappy we are with
the scores that we've produced, right,
and then we can use that to
define a total loss term.
So L here, which is a
combination of this data term,
combined with a regularization
term that expresses
how simple our model is,
and we have a preference
for simpler models, for
better generalization.
And so now we want to
find the parameters W
that correspond to our lowest loss, right?
We want to minimize the loss function,
and so to do that we want to find
the gradient of L with respect to W.
So last lecture we talked
about how we can do this
using optimization, and we're going
to iteratively take steps in the direction
of steepest descent, which is
the negative of the gradient,
in order to walk down this loss landscape
and get to the point
of lowest loss, right?
And we saw how this gradient
descent can basically take
this trajectory, looking
like this image on the right,
getting to the bottom
of your loss landscape.
Oh!
Okay, and so we also
talked about different ways
for computing a gradient, right?
We can compute this numerically
using finite difference approximation
which is slow and approximate,
but at the same time
it's really easy to write out,
you know you can always
get the gradient this way.
We also talked about how to
use the analytic gradient
and computing this is, it's fast
and exact once you've
gotten the expression for
the analytic gradient, but
at the same time you have
to do all the math and the
calculus to derive this,
so it's also, you know, easy
to make mistakes, right?
So in practice what we want
to do is we want to derive
the analytic gradient and use this,
but at the same time check
our implementation using
the numerical gradient to make sure
that we've gotten all of our math right.
So today we're going to
talk about how to compute
the analytic gradient for
arbitrarily complex functions,
using a framework that I'm going
to call computational graphs.
And so basically what a
computational graph is,
is that we can use this
kind of graph in order
to represent any function,
where the nodes of the graph
are steps of computation
that we go through.
So for example, in this example,
the linear classifier
that we've talked about,
the inputs here are x and W, right,
and then this multiplication
node represents
the matrix multiplier,
the multiplication of
the parameters W with
our data x that we have,
outputting our vector of scores.
And then we have another
computational node
which represents our hinge loss, right,
computing our data loss term, Li.
And we also have this
regularization term at
the bottom right, so this node
which computes our regularization term,
and then our total loss
here at the end, L,
is the sum of the regularization
term and the data term.
And the advantage is
that once we can express
a function using a computational graph,
then we can use a technique
that we call backpropagation
which is going to recursively
use the chain rule
in order to compute the gradient
with respect to every variable
in the computational graph,
and so we're going to
see how this is done.
And this becomes very
useful when we start working
with really complex functions,
so for example,
convolutional neural networks
that we're going to talk
about later in this class.
We have here the input image at the top,
we have our loss at the bottom,
and the input has to
go through many layers
of transformations in order to get all
the way down to the loss function.
And this can get even
crazier with things like,
the, you know, like a
neural turing machine,
which is another kind
of deep learning model,
and in this case you can see
that the computational graph
for this is really insane, and especially,
we end up, you know,
unrolling this over time.
It's basically completely impractical
if you want to compute the gradients
for any of these intermediate variables.
Okay, so how does backpropagation work?
So we're going to start
off with a simple example,
where again, our goal is
that we have a function.
So in this case, f of x, y, z
equals x plus y times z,
and we want to find the
gradients of the output of
the function with respect
to any of the variables.
So the first step, always, is we want
to take our function f, and we want
to represent it using
a computational graph.
Right, so here our computational
graph is on the right,
and you can see that we have our,
first we have the plus node, so x plus y,
and then we have this
multiplication node, right,
for the second computation
that we're doing.
And then, now we're going
to do a forward pass
of this network, so given the values of
the variables that we have, so here,
x equals negative two, y equals five
and z equals negative four,
I'm going to fill these all in
in our computational graph,
and then here we can compute
an intermediate value,
so x plus y gives three,
and then finally we pass it through again,
through the last node, the multiplication,
to get our final node
of f equals negative 12.
So here we want to give every
intermediate variable a name.
So here I've called this
intermediate variable after
the plus node q, and we
have q equals x plus y,
and then f equals q times z,
using this intermediate node.
And I've also written
out here, the gradients
of q with respect to x
and y, which are just one
because of the addition,
and then the gradients of f
with respect to q and z,
which is z and q respectively
because of the multiplication rule.
And so what we want to
find, is we want to find
the gradients of f with
respect to x, y and z.
So what backprop is, it's
a recursive application of
the chain rule, so we're
going to start at the back,
the very end of the computational graph,
and then we're going to
work our way backwards
and compute all the
gradients along the way.
So here if we start at
the very end, right,
we want to compute the
gradient of the output
with respect to the last
variable, which is just f.
And so this gradient is
just one, it's trivial.
So now, moving backwards,
we want the gradient
with respect to z, right, and we know
that df over dz is equal to q.
So the value of q is just three,
and so we have here, df
over dz equals three.
And so next if we want to do df over dq,
what is the value of that?
What is df over dq?
So we have here, df over
dq is equal to z, right,
and the value of z is negative four.
So here we have df over dq
is equal to negative four.
Okay, so now continuing to
move backwards to the graph,
we want to find df over dy, right,
but here in this case, the
gradient with respect to y,
y is not connected directly to f, right?
It's connected through an
intermediate node of z,
and so the way we're going to do this
is we can leverage the
chain rule which says
that df over dy can be
written as df over dq,
times dq over dy, and
so the intuition of this
is that in order to get to
find the effect of y on f,
this is actually equivalent to if we take
the effect of q times q on f,
which we already know, right?
df over dq is equal to negative four,
and we compound it with the
effect of y on q, dq over dy.
So what's dq over dy
equal to in this case?
- [Student] One.
- One, right. Exactly.
So dq over dy is equal to
one, which means, you know,
if we change y by a little bit,
q is going to change by approximately
the same amount right, this is the effect,
and so what this is
doing is this is saying,
well if I change y by a little bit,
the effect of y on q is going to be one,
and then the effect of q on f
is going to be approximately
a factor of negative four, right?
So then we multiply these together
and we get that the effect of y on f
is going to be negative four.
Okay, so now if we want
to do the same thing for
the gradient with respect to x, right,
we can do the, we can
follow the same procedure,
and so what is this going to be?
[students speaking away from microphone]
- I heard the same.
Yeah exactly, so in this
case we want to, again,
apply the chain rule, right?
We know the effect of q on
f is negative four,
and here again, since we have
also the same addition node,
dq over dx is equal to one, again,
we have negative four times
one, right, and the gradient
with respect to x is
going to be negative four.
Okay, so what we're doing is, in backprop,
is we basically have all of these nodes
in our computational graph, but each node
is only aware of its
immediate surroundings, right?
So we have, at each node,
we have the local inputs
that are connected to this node,
the values that are flowing into the node,
and then we also have the output
that is directly outputted from this node.
So here our local inputs are
x and y, and the output is z.
And at this node we also know
the local gradient, right,
we can compute the gradient
of z with respect to x,
and the gradient of z with respect to y,
and these are usually really
simple operations, right?
Each node is going to be something like
the addition or the multiplication
that we had in that earlier example,
which is something where
we can just write down
the gradient, and we
don't have to, you know,
go through very complex
calculus in order to find this.
- [Student] Can you go
back and explain why
more in the last slide was
different than planning
the first part of it using
just normal calculus?
- Yeah, so basically if we go back,
hold on, let me...
So if we go back here, we
could exactly write out,
find all of these using just calculus,
so we could say, you know,
we want df over dx, right,
and we can probably
expand out this expression
and see that it's just going to be z,
but we can do this for, in this case,
because it's simple, but
we'll see examples later on
where once this becomes a
really complicated expression,
you don't want to have to use calculus
to derive, right, the
gradient for something,
for a super-complicated expression,
and instead, if you use this formalism
and you break it down into
these computational nodes,
then you can only ever work with gradients
of very simple computations, right,
at the level of, you know,
additions, multiplications,
exponentials, things as
simple as you want them,
and then you just use the chain rule
to multiply all these together,
and get your, the value of your gradient
without having to ever
derive the entire expression.
Does that make sense?
[student murmuring]
Okay, so we'll see an
example of this later.
And so, was there another question, yeah?
[student speaking away from microphone]
- [Student] What's the negative four
next to the z representing?
- Negative, okay yeah,
so the negative four,
these were the, the green values on top
were all the values of
the function as we passed
it forward through the
computational graph, right?
So we said up here that x
is equal to negative two,
y is equal to five, and
z equals negative four,
so we filled in all of these
values, and then we just wanted
to compute the value of this function.
Right, so we said this value
of q is going to be x plus y,
it's going to be negative
two plus five, it is going
to be three, and we have z
is equal to negative four
so we fill that in here,
and then we multiplied q
and z together, negative four times three
in order to get the
final value of f, right?
And then the red values underneath
were as we were filling in the gradients
as we were working backwards.
Okay.
Okay, so right, so we said that, you know,
we have these local, these nodes,
and each node basically gets
its local inputs coming in
and the output that it
sees directly passing on
to the next node, and we also
have these local gradients
that we computed, right, the gradient of
the immediate output of the node
with respect to the inputs coming in.
And so what happens during
backprop is we have these,
we'll start from the
back of the graph, right,
and then we work our way from the end
all the way back to the beginning,
and when we reach each
node, at each node we have
the upstream gradients coming back, right,
with respect to the
immediate output of the node.
So by the time we reach
this node in backprop,
we've already computed the gradient
of our final loss l,
with respect to z, right?
And so now what we want to find next
is we want to find the
gradients with respect
to just before the node,
to the values of x and y.
And so as we saw earlier, we
do this using the chain rule,
right, we have from the chain rule,
that the gradient of this loss function
with respect to x is going to be
the gradient with respect
to z times, compounded by
this gradient, local gradient
of z with respect to x.
Right, so in the chain rule we always take
this upstream gradient coming down,
and we multiply it by the local gradient
in order to get the gradient
with respect to the input.
- [Student] So, sorry, is it,
it's different because
this would never work
to get a general formula into the,
or general symbolic
formula for the gradient.
It only works with instantaneous values,
where you like.
[student coughing]
Or passing a little constant
value as a symbolic.
- So the question is
whether this only works
because we're working
with the current values of
the function, and so it works, right,
given the current values of
the function that we plug in,
but we can write an expression for this,
still in terms of the variables, right?
So we'll see that gradient
of L with respect to z
is going to be some
expression, and gradient of z
with respect to x is going to
be another expression, right?
But we plug in these,
we plug in the values
of these numbers at the
time in order to get
the value of the gradient
with respect to x.
So what you could do is you
could recursively plug in
all of these expressions, right?
Gradient with respect, z with respect to x
is going to be a simple,
simple expression, right?
So in this case, if we
have a multiplication node,
gradient of z with
respect to x is just going
to be y, right, we know that,
but the gradient of L with respect to z,
this is probably a complex part of
the graph in itself, right, so
here's where we want to just,
in this case, have this numerical, right?
So as you said, basically
this is going to be just
a number coming down, right, a value,
and then we just multiply it with
the expression that we have
for the local gradient.
And I think this will be
more clear when we go through
a more complicated
example in a few slides.
Okay, so now the gradient
of L with respect to y,
we have exactly the
same idea, where again,
we use the chain rule,
we have gradient of L
with respect to z, times the gradient of z
with respect to y, right,
we use the chain rule,
multiply these together
and get our gradient.
And then once we have these,
we'll pass these on to
the node directly before,
or connected to this node.
And so the main thing
to take away from this
is that at each node we just
want to have our local gradient
that we compute, just keep track of this,
and then during backprop as
we're receiving, you know,
numerical values of gradients
coming from upstream,
we just take what that is, multiply it by
the local gradient, and then this is
what we then send back
to the connected nodes,
the next nodes going backwards,
without having to care
about anything else besides
these immediate surroundings.
So now we're going to go
through another example,
this time a little bit more complex,
so we can see more why
backprop is so useful.
So in this case, our
function is f of w and x,
which is equal to one over one plus e
to the negative of w-zero times x-zero
plus w-one x-one, plus w-two, right?
So again, the first step always is we want
to write this out as
a computational graph.
So in this case we can see
that in this graph, right,
first we multiply together the
w and x terms that we have,
w-zero with x-zero, w-one with x-one,
and w-two, then we add all
of these together, right?
Then we do, scale it by negative one,
we take the exponential, we add one,
and then finally we do
one over this whole term.
And then here I've also
filled in values of these,
so let's say given values that we have
for the ws and xs, right,
we can make a forward pass
and basically compute what the value is
at every stage of the computation.
And here I've also written
down here at the bottom
the values, the expressions
for some derivatives
that are going to be helpful later on,
so same as we did before
with the simple example.
Okay, so now then we're going
to do backprop through here,
right, so again, we're going to start at
the very end of the
graph, and so here again
the gradient of the output with
respect to the last variable
is just one, it's just trivial,
and so now moving
backwards one step, right?
So what's the gradient with respect to
the input just before one over x?
Well, so in this case, we know
that the upstream gradient
that we have coming down,
right, is this red one, right?
This is the upstream gradient
that we have flowing down,
and then now we need to find
the local gradient, right,
and the local gradient of this node,
this node is one over x, right,
so we have f of x equals
one over x here in red,
and the local gradient of this df over dx
is equal to negative one
over x-squared, right?
So here we're going to take
negative one over x-squared,
and plug in the value
of x that we had during
this forward pass, 1.37,
and so our final gradient
with respect to this variable is going
to be negative one over
1.37 squared times one
equals negative 0.53.
So moving back to the next node,
we're going to go through the
exact same process, right?
So here, the gradient
flowing from upstream
is going to be negative 0.53, right,
and here the local gradient,
the node here is a plus one,
and so now looking at our
reference of derivatives at
the bottom, we have that
for a constant plus x,
the local gradient is just one, right?
So what's the gradient with respect
to this variable using the chain rule?
So it's going to be the upstream gradient
of negative 0.53 times
our local gradient of one,
which is equal to negative 0.53.
So let's keep moving
backwards one more step.
So here we have the exponential, right?
So what's the upstream
gradient coming down?
[student speaking away from microphone]
Right, so the upstream
gradient is negative 0.53,
what's the local gradient here?
It's going to be the local
gradient of e to the x, right?
This is an exponential
node, and so our chain rule
is going to tell us that our gradient
is going to be negative 0.53
times e to the power of x,
which in this case is negative one,
from our forward pass, and
this is going to give us
our final gradient of negative 0.2.
Okay, so now one more node here,
the next node is, that
we reach, is going to be
a multiplication with negative one, right?
So here, what's the upstream
gradient coming down?
- [Student] Negative 0.2?
- [Serena] Negative 0.2,
right, and what's going to be
the local gradient, can
look at the reference sheet.
It's going to be, what was it?
I think I heard it.
- [Student] That's minus one?
- It's going to be minus
one, exactly, yeah,
because our local gradient
says it's going to be,
df over dx is a, right, and the value of a
that we scaled x by is negative one here.
So we have here that the gradient
is negative one times negative 0.2,
and so our gradient is 0.2.
Okay, so now we've
reached an addition node,
and so in this case we
have these two branches
both connected to it, right?
So what's the upstream gradient here?
It's going to be 0.2, right,
just as everything else,
and here now the gradient with respect
to each of these branches,
it's an addition, right,
and we saw from before
in our simple example
that when we have an addition node,
the gradient with respect
to each of the inputs
to the addition is just
going to be one, right?
So here, our local gradient
for looking at our top stream
is going to be one times
the upstream gradient
of 0.2, which is going to give
a total gradient of 0.2, right?
And then we, for our bottom branch we'd do
the same thing, right, our
upstream gradient is 0.2,
our local gradient is one again,
and the total gradient is 0.2.
So is everything clear about this?
Okay.
So we have a few more
gradients to fill out,
so moving back now we've
reached w-zero and x-zero,
and so here we have a
multiplication node, right,
so we saw the multiplication
node from before,
it just, the gradient with respect
to one of the inputs just is
the value of the other input.
And so in this case, what's the gradient
with respect to w-zero?
- [Student] Minus 0.2.
- Minus, I'm hearing minus 0.2, exactly.
Yeah, so with respect to w-zero,
we have our upstream gradient, 0.2, right,
times our, this is the bottom one,
times our value of x,
which is negative one,
we get negative 0.2 and
we can do the same thing
for our gradient with respect to x-zero.
It's going to be 0.2
times the value of w-zero
which is two, and we get 0.4.
Okay, so here we've filled
out most of these gradients,
and so there was the question earlier
about why this is simpler
than just computing,
deriving the analytic gradient,
the expression with respect
to any of these variables, right?
And so you can see here,
all we ever dealt with
was expressions for local gradients
that we had to write out, so
once we had these expressions
for local gradients, all we did
was plug in the values for
each of these that we have,
and use the chain rule to
numerically multiply this all
the way backwards and get the gradients
with respect to all of the variables.
And so, you know, we can also fill out
the gradients with respect
to w-one and x-one here
in exactly the same way, and so one thing
that I want to note is that
right when we're creating
these computational graphs, we can define
the computational nodes at any
granularity that we want to.
So in this case, we broke it down into
the absolute simplest
that we could, right,
we broke it down into
additions and multiplications,
you know, it basically can't
get any simpler than that,
but in practice, right,
we can group some of
these nodes together into
more complex nodes if we want.
As long as we're able to write down
the local gradient for that node, right?
And so as an example, if we
look at a sigmoid function,
so I've defined the sigmoid function in
the upper-right here, of a sigmoid of x
is equal to one over one
plus e to the negative x,
and this is something that's
a really common function
that you'll see a lot in
the rest of this class,
and we can compute the gradient for this,
we can write it out, and if
we do actually go through
the math of doing this analytically,
we can get a nice expression at the end.
So in this case it's equal
to one minus sigma of x,
so the output of this function
times sigma of x, right?
And so in cases where we
have something like this,
we could just take all the computations
that we had in our graph
that made up this sigmoid,
and we could just replace it
with one big node that's a sigmoid, right,
because we do know the local
gradient for this gate,
it's this expression, d of the
sigmoid of x over dx, right?
So basically the important thing here is
that you can, group any nodes that you want
to make any sorts of a little
bit more complex nodes,
as long as you can write down
the local gradient for this.
And so all this is is
basically a trade-off between,
you know, how much math
that you want to do
in order to get a more, kind
of concise and simpler graph,
right, versus how simple you want
each of your gradients to be, right?
And then you can write out as complex of
a computational graph that you want.
Yeah, question?
- [Student] This is a
question on the graph itself,
is there a reason that the
first two multiplication nodes
and the weights are not connected
to a single addition node?
- So they could also be connected into
a single addition node,
so the question was,
is there a reason why w-zero and x-zero
are not connected with w-two?
All of these additions
just connected together,
and yeah, so the reason, the answer is
that you can do that if you want,
and in practice, maybe you
would actually want to do that
because this is still a
very simple node, right?
So in this case I just wrote
this out into as simple
as possible, where each node
only had up to two inputs,
but yeah, you could definitely do that.
Any other questions about this?
Okay, so the one thing that I really like
about thinking about this
like a computational graph
is that I feel very comforted, right,
like anytime I have to take a gradient,
find gradients of something,
even if the expression
that I want to compute
gradients of is really hairy,
and really scary, you know,
whether it's something like
this sigmoid or something worse,
I know that, you know, I could
derive this if I want to,
but really, if I just
sit down and write it out
in terms of a computational graph,
I can go as simple as I need to
to always be able to apply
backprop and the chain rule,
and be able to compute all
the gradients that I need.
And so this is something that
you guys should think about
when you're doing your homeworks,
as basically, you know,
anytime you're having trouble
finding gradients of something
just think about it as
a computational graph,
break it down into all of these parts,
and then use the chain rule.
Okay, and so, you know, so we talked about
how we could group these
set of nodes together
into a sigmoid gate, and
just to confirm, like,
that this is actually exactly equivalent,
we can plug this in, right?
So we have that our input
here to the sigmoid gate
is going to be one, in
green, and then we have
that the output is going
to be here, 0.73, right,
and this'll work out if you plug
it in to the sigmoid function.
And so now if we want to
do, if we want to take
the gradient, and we want
to treat this entire sigmoid
as one node, now what we should do
is we need to use this local gradient
that we've derived up here, right?
One minus sigmoid of x
times the sigmoid of x.
So if we plug this in, and here we know
that the value of sigmoid of x was 0.73,
so if we plug this value
in we'll see that this,
the value of this gradient
is equal to 0.2, right,
and so the value of this
local gradient is 0.2,
we multiply it by the x
upstream gradient which is one,
and we're going to get
out exactly the same value
of the gradient with respect
to before the sigmoid gate,
as if we broke it down into all
of the smaller computations.
Okay, and so as we're looking
at what's happening, right,
as we're taking these
gradients going backwards
through our computational graph,
there's some patterns that you'll notice
where there's some
intuitive interpretation
that we can give these, right?
So we saw that the add gate is
a gradient distributor right,
when we passed through
this addition gate here,
which had two branches coming out of it,
it took the gradient,
the upstream gradient
and it just distributed it,
passed the exact same thing
to both of the branches
that were connected.
So here's a couple more
that we can think about.
So what's a max gate look like?
So we have a max gate
here at the bottom, right,
where the input's coming in are z and w,
z has a value of two, w has
a value of negative one,
and then we took the max of
this, which is two, right,
and so we pass this
down into the remainder
of our computational graph.
So now if we're taking the
gradients with respect to this,
the upstream gradient is, let's
say two coming back, right,
and what does this local
gradient look like?
So anyone, yes?
- [Student] It'll be zero for
one, and one for the other?
- Right.
[student speaking away from microphone]
Exactly, so the answer that was given
is that z will have a gradient of two,
w will have a value, a gradient of zero,
and so one of these is going to get
the full value of the
gradient just passed back,
and routed to that variable,
and then the other one
will have a gradient of zero, and so,
so we can think of this as kind
of a gradient router, right,
so, whereas the addition node passed back
the same gradient to
both branches coming in,
the max gate will just take the gradient
and route it to one of the branches,
and this makes sense because
if we look at our forward pass,
what's happening is that only the value
that was the maximum got passed down to
the rest of the
computational graph, right?
So it's the only value
that actually affected
our function computation at
the end, and so it makes sense
that when we're passing
our gradients back,
we just want to adjust what, you know,
flow it through that
branch of the computation.
Okay, and so another one,
what's a multiplication gate,
which we saw earlier, is there
any interpretation of this?
[student speaking away from microphone]
Okay, so the answer that was given
is that the local
gradient is basically just
the value of the other variable.
Yeah, so that's exactly right.
So we can think of this as
a gradient switcher, right?
A switcher, and I guess
a scaler, where we take
the upstream gradient and we scale it by
the value of the other branch.
Okay, and so one other thing to note is
that when we have a place where one node
is connected to multiple nodes,
the gradients add up at this node, right?
So at these branches, using
the multivariate chain rule,
we're just going to take the value of
the upstream gradient coming
back from each of these nodes,
and we'll add these
together to get the total
upstream gradient that's
flowing back into this node,
and you can see this from
the multivariate chain rule
and also thinking about this,
you can think about this
that if you're going to
change this node a little bit,
it's going to affect both
of these connected nodes
in the forward pass,
right, when you're making
your forward pass through the graph.
And so then when you're
doing backprop, right,
then now the, both of
these gradients coming back
are going to affect this node, right,
and so that's how we're
going to sum these up to be
the total upstream gradient
flowing back into this node.
Okay, so any questions about backprop,
going through these forward
and backward passes?
- [Student] So we haven't did anything
to actually update the weights.
[speaking away from microphone]
- Right, so the question is,
we haven't done anything yet
to update the values of these weights,
we've only found the
gradients with respect
to the variables, that's exactly right.
So what we've talked about
so far in this lecture is how
to compute gradients with
respect to any variables
in our function, right,
and then once we have these
we can just apply everything we learned in
the optimization lecture,
last lecture, right?
So given the gradient,
we now take a step in
the direction of the gradient in order
to update our weight,
our parameters, right?
So you can just take this entire framework
that we learned about last
lecture for optimization,
and what we've done here is
just learn how to compute
the gradients we need for
arbitrarily complex functions,
right, and so this is going
to be useful when we talk
about complex functions like
neural networks later on.
Yeah?
- [Student] Do you mind writing out the,
all the variate, so you could help explain
this slide a little better?
- Yeah, so I can write
this maybe on the board.
Right, so basically if we're
going to have, let's see,
if we're going to have the gradient of f
with respect to some variable x, right,
and let's say it's
connected through variables,
let's see, i, we can basically...
Right, so this is basically saying
that if x is connected to
these multiple elements, right,
which in this case, different q-is,
then the chain rule is taking all,
it's going to take the effect of each of
these intermediate variables, right,
on our final output f, and
then compound each one with
the local effect of our variable x on
that intermediate value, right?
So yeah, it's basically just
summing all these up together.
Okay, so now that we've, you
know, done all these examples
in the scalar case, we're going to look at
what happens when we have vectors, right?
So now if our variables x, y and z,
instead of just being numbers,
we have vectors for these.
And so everything stays exactly
the same, the entire flow,
the only difference is
that now our gradients
are going to be Jacobian matrices, right,
so these are now going
to be matrices containing
the derivative of each
element of, for example z
with respect to each element of x.
Okay, and so to, you
know, so give an example
of something where this is
happening, right, let's say
that we have our input is
going to now be a vector,
so let's say we have a
4096-dimensional input vector,
and this is kind of a common
size that you might see
in convolutional neural networks later on,
and our node is going to be an
element-wise maximum, right?
So we have f of x is equal to the maximum
of x compared with zero
element-wise, and then our output
is going to be also a
4096-dimensional vector.
Okay, so in this case, what's the size
of our Jacobian matrix?
Remember I said earlier,
the Jacobian matrix
is going to be, like each row is,
it's going to be partial derivatives,
a matrix of partial derivatives
of each dimension of
the output with respect to
each dimension of the input.
Okay, so the answer I
heard was 4,096 squared,
and that's, yeah, that's correct.
So this is pretty large,
right, 4,096 by 4,096
and in practice this is
going to be even larger
because we're going to
work with many batches
of, you know, of, for example, 100 inputs
at the same time, right,
and we'll put all of these
through our node at the same
time to be more efficient,
and so this is going to scale this by 100,
and in practice our Jacobian's
actually going to turn out
to be something like
409,000 by 409,000 right,
so this is really huge, and basically
completely impractical to work with.
So in practice though,
we don't actually need
to compute this huge
Jacobian most of the time,
and so why is that, like, what does
this Jacobian matrix look like?
If we think about what's happening here,
where we're taking this
element-wise maximum,
and we think about what are each of
the partial derivatives, right,
which dimension of the inputs affect
which dimensions of the output?
What sort of structure can we
see in our Jacobian matrix?
[student speaking away from microphone]
Okay, so I heard that it's
diagonal, right, exactly.
So because this is element-wise,
right, each element of
the input, say the first
dimension, only affects
that corresponding element
in the output, right?
And so because of that
our Jacobian matrix,
which is just going to
be a diagonal matrix.
And so in practice then,
we don't actually have
to write out and formulate
this entire Jacobian,
we can just know the effect
of x on the output, right,
and then we can just
use these values, right,
and fill it in as we're
computing the gradient.
Okay, so now we're going to go through
a more concrete vectorized
example of a computational graph.
Right, so let's look at a case
where we have the function f of x and W
is equal to, basically the
L-two of W multiplied by x,
and so in this case we're going to say x
is n-dimensional and W is n by n.
Right, so again our first step,
writing out the
computational graph, right?
We have W multiplied by
x, and then followed by,
I'm just going to call this L-two.
And so now let's also fill
out some values for this,
so we can see that, you
know, let's say have W be
this two by two matrix, and x
is going to be this
two-dimensional vector, right?
And so we can say, label
again our intermediate nodes.
So our intermediate node
after the multiplication
it's going to be q, we
have q equals W times x,
which we can write out
element-wise this way,
where the first element is
just W-one-one times x-one
plus W-one-two times x-two and so on,
and then we can now express
f in relation to q, right?
So looking at the second
node we have f of q
is equal to the L-two norm of q,
which is equal to q-one
squared plus q-two squared.
Okay, so we filled this in, right,
we get q and then we get our final output.
Okay, so now let's do
backprop through this, right?
So again, this is always the first step,
we have the gradient with respect
to our output is just one.
Okay, so now let's move back one node,
so now we want to find the
gradient with respect to q,
right, our intermediate
variable before the L-two.
And so q is a two-dimensional vector,
and what we want to do is we want
to find how each element of q
affects our final value of f,
right, and so if we
look at this expression
that we've written out
for f here at the bottom,
we can see that the gradient of f
with respect to a specific
q-i, let's say q-one,
is just going to be two times q-i, right?
This is just taking this derivative here,
and so we have this expression for,
with respect to each element of q-i,
we could also, you know, write this out
in vector form if we want to,
it's just going to be two
times our vector of q, right,
if we want to write
this out in vector form,
and so what we get is
that our gradient is 0.44,
and 0.52, this vector, right?
And so you can see that it just took q
and it scaled it by two, right?
Each element is just multiplied by two.
So the gradient of a vector
is always going to be
the same size as the original vector,
and each element of this
gradient is going to,
it means how much of
this particular element
affects our final output of the function.
Okay, so now let's move
one step backwards, right,
what's the gradient with respect to W?
And so here again we want
to use the same concept
of trying to apply the chain rule, right,
so we want to compute our local gradient
of q with respect to W, and so let's look
at this again element-wise,
and if we do that,
let's see what's the
effect of each q, right,
each element of q with
respect to each element of W,
and so this is going to be the Jacobian
that we talked about earlier,
and if we look at this
in this multiplication, q is equal
to W times x, right,
what's the derivative,
or the gradient of the first element of q,
so our first element up top,
with respect to W-one-one?
So q-one with respect to W-one-one?
What's that value?
X-one, exactly.
Yeah, so we know that this is x-one,
and we can write this
out more generally of
the gradient of q-k with respect
to W-i,j is equal to X-j.
And then now if we want
to find the gradient
with respect to, of f,
with respect to each W-i,j.
So looking at these derivatives now,
we can use this chain rule
that we talked earlier
where we basically compound df over dq-k
for each element of q with dq-k over W-i,j
for each element of W-i,j, right?
So we find the effect of each element of W
on each element of q, and
sum this across all q.
And so if you write this
out, this is going to give
this expression of two
times q-i times x-j.
Okay, and so filling this out then we get
this gradient with respect to W,
and so again we can compute
this each element-wise,
or we can also look at this
expression that we've derived
and write it out in
vectorized form, right?
So okay, and remember, the important thing
is always to check the gradient
with respect to a variable
should have the same shape as
the variable, and something,
so this is something
really useful in practice
to sanity check, right,
like once you've computed
what your gradient should
be, check that this is
the same shape as your variable,
because again, the element,
each element of your gradient
is quantifying how much that element
is contributing to your, is
affecting your final output.
Yeah?
[student speaking away from microphone]
The both sides, oh the both sides one
is an indicator function,
so this is saying
that it's just one if k equals i.
Okay, so let's see, so we've done that,
and so now just see, one more example.
Now our last thing we need to find is
the gradient with respect to q-I.
So here if we compute the
partial derivatives we can see
that dq-k over dx-i is
equal to W-k,i, right,
using the same way as we did it for W,
and then again we can
just use the chain rule
and get the total
expression for that, right?
And so this is going to be the gradient
with respect to x, again,
of the same shape as x,
and we can also write this out
in vectorized form if we want.
Okay, so any questions about this, yeah?
[student speaking away from microphone]
So we are computing the Jacobian,
so let me go back here, right,
so if we're doing, so right, so we have
these partial derivatives of q-k
with respect to x-i, right, and these
are forming your, the entries
of your Jacobian, right?
And so in practice what we're going to do
is we basically take that,
and you're going to see it
up there in the chain rule,
so the vectorized expression
of gradient with respect to x, right,
this is going to have the Jacobian here
which is this transposed value here,
so you can write it
out in vectorized form.
[student speaking away from microphone]
So well, so in this case the matrix
is going to be the same size as W right,
so it's not actually a large
matrix in this case, right?
Okay, so the way that we've
been thinking about this
is like a really modularized
implementation, right,
where in our computational graph, right,
we look at each node
locally and we compute
the local gradients and chain them
with upstream gradients coming down,
and so you can think of this as basically
a forward and a backwards API, right?
In the forward pass we
implement the, you know,
a function computing
the output of this node,
and then in the backwards
pass we compute the gradient.
And so when we actually
implement this in code,
we're going to do this
in exactly the same way.
So we can basically think
about, for each gate, right,
if we implement a forward
function and a backward function,
where the backward function
is computing the chain rule,
then if we have our entire
graph, we can just make
a forward pass through the
entire graph by iterating through
all the nodes in the graph, all the gates.
Here I'm going to use
the word gate and node,
kind of interchangeably,
we can iterate through
all of these gates and just call forward
on each of the gates, right?
And we just want to do this
in topologically sorted order,
so we process all of
the inputs coming in to
a node before we process that node.
And then going backwards,
we're just going to then
go through all of the gates
in this reverse sorted order,
and then call backwards
on each of these gates.
Okay, and so if we look at then
the implementation for
our particular gates,
so for example, this MultiplyGate here,
we want to implement
the forward pass, right,
so it gets x and y as inputs,
and returns the value of z,
and then when we go backwards, right,
we get as input dz, which
is our upstream gradient,
and we want to output the gradients
on the input's x and
y to pass down, right?
So we're going to output dx and dy,
and so in this case, in this example,
everything is back to
the scalar case here,
and so if we look at
this in the forward pass,
one thing that's important
is that we need to,
we should cache the values
of the forward pass, right,
because we end up using this in
the backward pass a lot of the time.
So here in the forward pass,
we want to cache the values
of x and y, right, and
in the backward pass,
using the chain rule,
we're going to, remember,
take the value of the upstream gradient
and scale it by the value
of the other branch, right,
and so we'll keep, for
dx we'll take our value
of self.y that we kept, and multiply it
by dz coming down, and same for dy.
Okay, so if you look at a lot
of deep-learning frameworks
and libraries you'll see
that they exactly follow
this kind of modularization, right?
So for example, Caffe is a
popular deep learning framework,
and you'll see, if you go look
through the Caffe source code
you'll get to some
directory that says layers,
and in layers, which are
basically computational nodes,
usually layers might be
slightly more, you know,
some of these more complex
computational nodes
like the sigmoid that
we talked about earlier,
you'll see, basically just a whole list
of all different kinds of
computational nodes, right?
So you might have the sigmoid, and I know
there might be here, there's
like a convolution is one,
there's an Argmax is another layer,
you'll have all of these
layers and if you dig in
to each of them, they're
just exactly implementing
a forward pass and a backward pass,
and then all of these are called
when we do forward and backward pass
through the entire network that we formed,
and so our network is just basically going
to be stacking up all of these,
the different layers that we
choose to use in the network.
So for example, if we
look at a specific one,
in this case a sigmoid layer, you'll see
that in the sigmoid layer, right,
we've talked about the sigmoid function,
you'll see that there's a forward pass
which basically computes
exactly the sigmoid expression,
and then a backward pass, right,
where it is taking as input
something, basically a top_diff,
which is our upstream
gradient in this case,
and multiplying it by a local
gradient that we compute.
So in assignment one you'll get practice
with this kind of, this
computational graph way of thinking
where, you know, you're
going to be writing
your SVM and Softmax classes,
and taking the gradients of these.
And so again, remember always
you want to first step,
represent it as a
computational graph, right?
Figure out what are all the computations
that you did leading up to the output,
and then when you, when it's time
to do your backward pass,
just take the gradient
with respect to each of
these intermediate variables
that you've defined in
your computational graph,
and use the chain rule to
link them all together.
Okay, so summary of what
we've talked about so far.
When we get down to, you know,
working with neural networks,
these are going to be
really large and complex,
so it's going to be
impractical to write down
the gradient formula by hand
for all your parameters.
So in order to get these gradients, right,
we talked about how, what we
should use is backpropagation,
right, and this is kind of
one of the core techniques
of, you know, neural
networks, is basically
using backpropagation to
get your gradients, right?
And so this is a recursive application of
the chain rule where we have
this computational graph,
and we start at the back and
we go backwards through it
to compute the gradients with respect
to all of the intermediate variables,
which are your inputs, your parameters,
and everything else in the middle.
And we've also talked about how really
this implementation and
this graph structure,
each of these nodes is
really, you can see this
as implementing a forward
and backwards API, right?
And so in the forward
pass we want to compute
the results of the operation, and we want
to save any intermediate values
that we might want to use later
in our gradient computation,
and then in the backwards
pass we apply this chain rule
and we take this upstream gradient,
we chain it, multiply it
with our local gradient
to compute the gradient with respect to
the inputs of the node,
and we pass this down
to the nodes that are connected next.
Okay, so now finally we're going
to talk about neural networks.
All right, so really, you
know, neural networks,
people draw a lot of analogies
between neural networks
and the brain, and different types
of biological inspirations,
and we'll get to that in
a little bit, but first let's
talk about it, you know,
just looking at it as a function,
as a class of functions
without all of the brain stuff.
So, so far we've talked about, you know,
we've worked a lot with this
linear score function, right?
f equals W times x, and
so we've been using this
as a running example of a
function that we want to optimize.
So instead of using the
single in your transformation,
if we want a neural network where we
can just, as the simplest form,
just stack two of these together, right?
Just a linear transformation
on top of another one
in order to get a two-layer
neural network, right?
And so what this looks like is
first we have our, you know,
a matrix multiply of W-one with x,
and then we get this intermediate variable
and we have this non-linear
function of a max of zero
with W, max with this
output of this linear layer,
and it's really important to
have these non-linearities
in place, which we'll
talk about more later,
because otherwise if you just
stack linear layers on top
of each other, they're
just going to collapse
to, like a single linear function.
Okay, so we have our first linear layer
and then we have this
non-linearity, right,
and then on top of this we'll
add another linear layer.
And then from here, finally
we can get our score function,
our output vector of scores.
So basically, like, more broadly speaking,
neural networks are a class of functions
where we have simpler functions, right,
that are stacked on top of each other,
and we stack them in a hierarchical way
in order to make up a more
complex non-linear function,
and so this is the idea of
having, basically multiple stages
of hierarchical computation, right?
And so, you know, so this is kind of
the main way that we do this is by taking
something like this matrix
multiply, this linear layer,
and we just stack multiple
of these on top of each other
with non-linear functions
in-between, right?
And so one thing that this
can help solve is if we look,
if we remember back to
this linear score function
that we were talking about, right,
remember we discussed earlier how each row
of our weight matrix W was
something like a template.
It was a template that sort
of expressed, you know,
what we're looking for in the input
for a specific class, right,
so for example, you know,
the car template looks something like
this kind of fuzzy red car,
and we were looking for this
in the input to compute the
score for the car class.
And we talked about one
of the problems with this
is that there's only one template, right?
There's this red car, whereas in practice,
we actually have multiple modes, right?
We might want, we're looking
for, you know, a red car,
there's also a yellow
car, like all of these
are different kinds of cars, and so what
this kind of multiple
layer network lets you do
is now, you know, each of
this intermediate variable h,
right, W-one can still be
these kinds of templates,
but now you have all of these scores
for these templates in h,
and we can have another layer
on top that's combining
these together, right?
So we can say that actually
my car class should be,
you know, connected to, we're looking
for both red cars as well
as yellow cars, right,
because we have this matrix W-two
which is now a weighting
of all of our vector in h.
Okay, any questions about this?
Yeah?
[student speaking away from microphone]
Yeah, so there's a lot of ways,
so there's a lot of different
non-linear functions
that you can choose from,
and we'll talk later on
in a later lecture about
all the different kinds
of non-linearities that
you might want to use.
- [Student] For the pictures in the slide,
so, on the bottom row you have images
of your vector W-one weight, and so maybe
you would have images
of another vector W-two?
- So W-one, because it's
directly connected to
the input x, this is what's
like, really interpretable,
because you can formulate
all of these templates.
W-two, so h is going to be a score
of how much of each template
you solve, for example,
all right, so it might be
like you have a, you know,
like a, I don't know, two for the red car,
and like, one for the yellow
car or something like that.
- [Student] Oh, okay, so
instead of W-one being just 10,
like, you would have a left-facing horse
and a right-facing horse,
and they'd both be included--
- Exactly, so the question
is basically whether
in W-one you could have
both left-facing horse
and right-facing horse,
right, and so yeah, exactly.
So now W-one can be many different
kinds of templates right?
They're not, and then W-two,
now we can, like basically
it's a weighted sum of
all of these templates.
So now it allows you to weight
together multiple templates
in order to get the final
score for a particular class.
- [Student] So if you're
processing an image
then it's actually left-facing horse.
It'll get a really high score with
the left-facing horse template,
and a lower score with the
right-facing horse template,
and then this will take
the maximum of the two?
- Right, so okay, so the question is,
if our image x is like a left-facing horse
and in W-one we have a template of
a left-facing horse and
a right-facing horse,
then what's happening, right?
So what happens is yeah,
so in h you might have
a really high score for
your left-facing horse,
kind of a lower score for
your right-facing horse,
and W-two is, it's a weighted
sum, so it's not a maximum.
It's a weighted sum of these templates,
but if you have either a really high score
for one of these templates,
or let's say you have, kind of
a lower and medium score
for both of these templates,
all of these kinds of combinations
are going to give high scores, right?
And so in the end what you're going to get
is something that generally scores high
when you have a horse of any kind.
So let's say you had a front-facing horse,
you might have medium values for both
the left and the right templates.
Yeah, question?
- [Student] So is W-two
doing the weighting,
or is h doing the weighting?
- W-two is doing the
weighting, so the question is,
"Is W-two doing the weighting
or is h doing the weighting?"
h is the value, like in this example,
h is the value of scores
for each of your templates
that you have in W-one, right?
So h is like the score function, right,
it's how much of each
template in W-one is present,
and then W-two is going
to weight all of these,
weight all of these intermediate scores
to get your final score for the class.
- [Student] And which
is the non-linear thing?
- So the question is, "which
is the non-linear thing?"
So the non-linearity usually
happens right before h,
so h is the value right
after the non-linearity.
So we're talking about
this, like, you know,
intuitively as this example of like,
W-one is looking for, you know,
has these same templates as before,
and W-two is a weighting for these.
In practice it's not
exactly like this, right,
because as you said, there's all
these non-linearities thrown in and so on,
but it has this approximate
type of interpretation to it.
- [Student] So h is just W-one-x then?
- Yeah, yeah, so the
question is h just W-one-x?
So h is just W-one times x,
with the max function on top.
Oh, let me just, okay so,
so we've talked about
this as an example of
a two-layer neural network,
and we can stack more layers
of these to get deeper networks
of arbitrary depth, right?
So we can just do this one more time
at another non-linearity and
matrix multiply now by W-three,
and now we have a three-layer
neural network, right?
And so this is where the
term deep neural networks
is basically coming from, right?
This idea that you can stack
multiple of these layers,
you know, for very deep networks.
And so in homework you'll get a practice
of writing and you know, training one of
these neural networks, I
think in assignment two,
but basically a full
implementation of this using
this idea of forward pass, right,
and backward passes, and using chain rule
to compute gradients
that we've already seen.
The entire implementation of
a two-layer neural network
is actually really simple, it
can just be done in 20 lines,
and so you'll get some practice
with this in assignment two,
writing out all of these parts.
And okay, so now that we've sort of seen
what neural networks are
as a function, right,
like, you know, we hear
people talking a lot about
how there's biological
inspirations for neural networks,
and so even though it's
important that to emphasize
that these analogies are really loose,
it's really just very loose ties,
but it's still interesting to understand
where some of these connections
and inspirations come from.
And so now I'm going to
talk briefly about that.
So if we think about a neuron, in kind of
a very simple way, this neuron is,
here's a diagram of a neuron.
We have the impulses
that are carried towards
each neuron, right, so we have a lot
of neurons connected together
and each neuron has dendrites,
right, and these are sort
of, these are what receives
the impulses that come into the neuron.
And then we have a cell body, right,
that basically integrates
these signals coming in
and then there's a kind
of, then it takes this,
and after integrating all
these signals, it passes on,
you know, the impulse carries away from
the cell body to downstream
neurons that it's connected to,
right, and it carries
this away through axons.
So now if we look at what
we've been doing so far, right,
with each computational node, you can see
that this actually has, you can see it
in kind of a similar way, right?
Where nodes are connected to each other in
the computational graph,
and we have inputs,
or signals, x, x right,
coming into a neuron,
and then all of these x,
right, x-zero, x-one, x-two,
these are combined and
integrated together, right,
using, for example our weights, W.
So we do some sort of computation, right,
and in some of the computations
we've been doing so far,
something like W times x plus b, right,
integrating all these together,
and then we have an activation function
that we apply on top, we get
this value of this output,
and we pass it down to
the connecting neurons.
So if you look at that
this, this is actually,
you can think about this in
a very similar way, right?
Like, you know, these are
what's the signals coming in
are kind of the, connected
at synapses, right?
The synapse connecting
the multiple neurons,
the dendrites are
integrating all of these,
they're integrating all of
this information together in
the cell body, and then we have
the output carried on the output later on.
And so this is kind of the analogy
that you can draw between them,
and if you look at these
activation functions, right?
This is what basically takes
all the inputs coming in
and outputs one number
that's going out later on,
and we've talked about examples
like sigmoid activation function, right,
and different kinds of non-linearities,
and so sort of one kind of
loose analogy that you can draw
is that these
non-linearities can represent
something sort of like the firing,
or spiking rate of the neurons, right?
Where our neurons transmit
signals to connecting neurons
using kind of these
discrete spikes, right?
And so we can think of, you know,
if they're spiking very
fast then there's kind of
a strong signal that's passed later on,
and so we can think of this value
after our activation function
as sort of, in a sense,
sort of this firing rate
that we're going to pass on.
And you know in practice,
I think neuroscientists
who are actually studying this say
that kind of one of the non-linearities
that are most similar to the way
that neurons are actually behaving
is a ReLU function, which
is a ReLU non-linearity,
which is something that we're going
to look at more later
on, but it's a function
that's at zero for all
negative values of input,
and then it's a linear
function for everything
that's in kind of a positive regime.
And so, you know, we'll talk more about
this activation function later on,
but that's kind of, in practice,
maybe the one that's most similar to
how neurons are actually behaving.
But it's really important
to be extremely careful
with making any of these
sorts of brain analogies,
because in practice biological neurons
are way more complex than this.
There's many different
kinds of biological neurons,
the dendrites can perform
really complex non-linear computations.
Our synapses, right, the
W-zeros that we had earlier
where we drew this analogy,
are not single weights
like we had, they're actually
really complex, you know,
non-linear dynamical systems in practice,
and also this idea of interpreting
our activation function
as a sort of rate code
or firing rate is also,
is insufficient in practice, you know.
It's just this kind of
firing rate is probably not
a sufficient model of how neurons
will actually communicate to
downstream neurons, right,
like even as a very simple
way, there's a very,
the neurons will fire at a variable rate,
and this variability probably
should be taken into account.
And so there's all of these, you know,
it's kind of a much more complex thing
than what we're dealing with.
There's references, for example
this dendritic computation
that you can look at if you're
interested in this topic,
but yeah, so that in practice, you know,
we can sort of see how
it may resemble a neuron
at this very high level, but neurons are,
in practice, much more
complicated than that.
Okay, so we talked about how
there's many different kinds
of activation functions
that could be used,
there's the ReLU that I mentioned earlier,
and we'll talk about all
of these different kinds
of activation functions in
much more detail later on,
choices of these activation functions
that you might want to use.
And so we'll also talk
about different kinds
of neural network architectures.
So we gave the example
of these fully connected
neural networks, right,
where each layer is
this matrix multiply, and
so the way we actually want
to call these is, we said
two-layer neural network before,
and that corresponded to
the fact that we have two
of these linear layers,
right, where we're doing
a matrix multiply, two
fully connected layers
is what we call these.
We could also call this a
one-hidden-layer neural network,
so instead of counting the number
of matrix multiplies we're doing,
counting the number of
hidden layers that we have.
I think it's, you can use either,
I think maybe two-layer neural network
is something that's a
little more commonly used.
And then also here, for our
three-layer neural network
that we have, this can also be called
a two-hidden-layer neural network.
And so we saw that, you
know, when we're doing
this type of feed forward, right,
forward pass through a neural network,
each of these nodes in this network
is basically doing the
kind of operation of
the neuron that I showed earlier, right?
And so what's actually happening is
is basically each hidden
layer you can think of
as a whole vector, right,
a set of these neurons,
and so by writing it out this way
with these matrix multiplies
to compute our neuron values,
it's a way that we can
efficiently evaluate
this entire layer of neurons, right?
So with one matrix multiply
we get output values
of, you know, of a layer of let's say 10,
or 50 or 100 of neurons.
All right, and so looking at
this again, writing this out,
all out in matrix form,
matrix-vector form,
we have our, you know, non-linearity here.
F that we're using, in this
case a sigmoid function, right,
and we can take our data
x, some input vector
or our values, and we can apply
our first matrix multiply,
W-one on top of this,
then our non-linearity,
then a second matrix multiply to get
a second hidden layer, h-two,
and then we have our final output, right?
And so, you know, this
is basically all you need
to be able to write a neural network,
and as we saw earlier, the backward pass.
You then just use backprop
to compute all of those,
and so that's basically all there is
to kind of the main idea
of what's a neural network.
Okay, so just to
summarize, we talked about
how we could arrange neurons
into these computations, right,
of fully-connected or linear layers.
This abstraction of a
layer has a nice property
that we can use very
efficient vectorized code
to compute all of these.
We also talked about how it's important
to keep in mind that neural networks
do have some, you know,
analogy and loose inspiration
from biology, but they're
not really neural.
I mean, this is a pretty loose analogy
that we're making, and
next time we'll talk
about convolutional neural networks.
Okay, thanks.
