Okay. So great to see everyone back for lecture four of the class.
Um, so, for lec,
for today's lecture, um,
what I want to do for most of the time is actually
get into the heart of these ideas of having
the backpropagation algorithm for neural nets and how we can construct
computation graphs that allow sufficiently to do backpropagation,
neural nets to train the neural nets.
So, overall, um, this is sort of what I plan to do it today.
So, at the end of last lecture,
I slightly ran out of time and I started mumbling and waving my hands about the,
um, doing the derivatives with respect to the weight gradients.
So, I kinda of wanted to do that but again.
So hopefully it actually communicates slightly better.
So, we'll do that and talk a bit more about sort of just tips for doing matrix gradients,
um, and a particular issue that comes up with word vectors.
And so then the main part of the class,
we'll be talking about the backpropagation algorithm
and how it runs over computation graphs.
Um, and then for the last part of the class,
um, is I'm not going to hide that, um,
this is sort of just a grab bag of miscellaneous stuff you should
know about neural networks and training neural networks.
Um, like, I think,
you know we dream of a future of artificial intelligence where our machines are
really intelligent and you can just say to them this is the data and this is my problem,
go and train me a model and it might work.
Um, and in some future world,
that may be  [NOISE] that comes along.
It's something that's certainly being actively
researched at the moment under the topic of Auto ML.
I guess the question is whether it turns out that Auto ML was a scalable solution or
the climate change consequences of Auto ML techniques are
sufficiently bad that someone actually decides that these much lower power,
um, neural systems might actually be better still for doing some parts of the problem.
But anyway, either way we're not really there yet.
And the fact of the matter is,
when you're training neural networks,
there's just a whole bunch of stuff you have to know about
initialization and nonlinearities and learning rates and so on.
And, you know, when I taught this class
last time I somehow thought that people would pick this up by osmosis.
That if we gave starter,
cut code to people and now start
a code we initialized how matrices and we set our learning rates,
that by osmosis people would understand that's what you have to do and do it.
Um, it didn't really sort of teach in class the practical tips and tricks enough,
but it was perfectly obvious that when we got to
final project time that at least for quite a few people, osmosis hadn't worked.
Um, so this time,
[LAUGHTER] I'm at least wanting to spend a few minutes on that
and at least point out some other things that are important.
And, I mean just in general,
you know the reality of 2018, deep learning, no,
wait, it's 2019 now, 2019, um,
deep learning, is deep learning is still kind of a craft.
There's quite a bit you have to know of techniques of doing things that lead
neural net training to work successfully as
opposed to your models failing to work successfully.
Okay. One final announcement and I go in to it.
Um, so, we've sort of been doing some further working on Office,
our placement and I guess there are sort of multiple issues which
include the opportunities for local ICPD students without Stanford IDs.
We have to, um,
get, um, to office hours.
So for the Thursday night office hour,
um, that's after this class,
if you'd like to go and talk about,
um, the second homework, um,
the Thursday night office hour is going to be in Thorton- Thornton 110.
Um, now I didn't know where Thornton was.
It made more sense to me when I translated that as that's the old terman annex,
but that's probably just showing my age since probably none
of you remember when they used to be a building called Terman.
So that probably doesn't help you either.
Um, but you know,
if you're heading, right,
I don't know which direction we're facing.
If you're heading that way I guess
and if you know where the Papua New Guinea Sculpture Garden is, um, the,
the sort of open grassy area before you get to the Papua New Guinea Sculpture Garden,
that's where Terman used to be and the building that still stands in there is Thornton.
Um, Thornton 110 um tonight.
I think it starts at 6:30,
right? 6:30 to nine.
Okay. Right. So, let me just finish off where we were last time.
So remember we had this window of five words and then we're
putting it through a neural net layer of C equals WX plus B,
non-linearity of H equals F of X,
and then we're, um,
going to just get a score as to whether this has in its center [NOISE]
named entity like Paris which is sort of
taking this dot product of a vector times the hidden layer.
So this was our model,
and then we are wanting to work out partial derivatives of S with
respect to all of our variables and we did various of the cases,
but one we hadn't yet done is the weights,
and the weight through all of this neural net layer here.
Okay. So, chain rule, um,
the partial of ds dw is DS times HD,
um, dHDZ times DZ, DW.
And well, if you remember last time,
we had sort of done some computation of what those first two,
um, partial derivatives were.
And we could say that we could just call
those delta which is our error signal coming from above.
And that concept of having an error signal coming from above is
something I'll get back to in the main part of
the lecture and a sort of a central notion.
But the bit we hadn't dealt with is this dz,
dw and we started to look at that and I made the argument, um,
based on our shape convention that the shape of
that should be the same shape as our W matrix.
So it should be, um,
same in times M shape as this W matrix.
So we want to work out the partial of Z by W which is the same as this,
um, [NOISE] dwx plus b, dw.
And so we want to work out what that derivative is.
Um, and if that's not obvious,
one way to think about it is to go back to this elements of the matrix
and actually first off work it out element-wise and think out what it should be,
and then once you've thought out what it should be, um,
to rewrite it back in matrix form to give the compact answer.
So what we have is we have the inputs here and a biased term
and we're going to do the matrix multiply it this vector to produce these.
And if you think about what's happening there,
so we've got this matrix of weights and for a particular weight,
a weight is first index is going to correspond to a position in
the hidden layer and its second index is going to
correspond to a position in the input vector.
And one weight in the matrix ends up being
part of what's used to compute one element of the hidden layer.
So, the one element of the hidden layer you're taking, um,
a row of the matrix and you're multiplying it by
the components of this vector so they sum together when the bias
is added on but one element of the matrix is sort of only being
used in the computation between one element of the,
um, important one element of the hidden vector.
Okay. So, well, that means, um,
if we're thinking about what's the partial derivative with respect to WIJ, well,
it's only contributing to ZI and it's only,
it's only doing anything with XJ.
So, that we end up with,
we're getting the partial with respect to WIJ,
we can work that out with respect to,
just to respect to ZI.
And when we're going to look at this multiplication here,
what we're ending up is this sort of sum of terms WIK times
Xk where there's sort of weights in that row
of the matrix going across the positions of the vector.
So the only position in which WIJ is used is multiplying, um, by XJ.
And at that point,
what we have in terms of sort of,
in our basic one variable doing a differentiation,
this is just like we have 3x,
um, and we say what's the derivative of 3x?
Actually X is confusing,
so I shouldn't say that.
Is like we have three W and what's the derivative of three W with respect to W?
It's three, right?
So, that we've have a term here which is what would have been W,
will be WIJ times XJ,
and its derivative with respect to WIJ is just XJ.
Does that makes sense?
Everyone believe it?
[NOISE] Fingers crossed.
Okay. Um, so, so for one element of this matrix,
we're just getting out XJ.
And at that point,
um, we say, um,
well of course we want to know what the Jacobian is for the full matrix W. Well,
if you start thinking about it,
this argument applies to every cell.
So, that for every,
um, cell of, um,
the Jacobian for W,
um, it's going to be XJ.
So, that means, um,
we're just going to be able to make use of that in calculating our Jacobian.
So, the derivative for a single WIJ is delta IXJ and that's true for all cells.
So we wanted to have a matrix for our Jacobian which has delta I,
um, XJ in every cell of it.
And the way we can create that is by using an outer products.
So, if we have a row vector of the deltas,
the error signals from above and a column,
right, I said that wrong, sorry.
If we have a column of the delta error signals
from above and we have a row of X transfers vectors,
um, when we multiply those together we get the outer product
and we get delta IXJ in each cell and that is our Jacobian answer,
um, for working out,
um, the delta S delta W that we started off with at the beginning.
Okay. And this, um,
and we get this form where it's a multiplication of
an error signal from above and our computed local gradient signal.
And that's the pattern that we're going to see over and over
again and that will exploit and our computation graphs.
Okay, all good?
Okay. Um, so, here's just,
um, here's homework two.
You're meant to do some of this stuff.
Um, here are just over a couple of collected tips,
um, which I hope will help.
I mean keeping here track of your variables and
their dimensionality is really useful because we
just can work out what the dimensionality of things should be.
You're often kind of halfway there.
I mean basically what you're doing is sort of
applying the chain rule over and over again.
It always looks like this.
Um, but doing it in this sort of matrix calculus sense of the chain rule.
Um, in the homework you have to do a softmax,
which we haven't done in class.
Um, something that I think you'll find useful,
if you want to break apart the softmax is to consider two cases.
One, the case is to when you're working it out for the correct class.
And then, the other case is for all the other incorrect classes.
Um, yeah.
Um, in the the little derivation,
I did before, I said well,
let's work out an element-wise partial
derivative because that should give me some sense of what's going on,
what the answer is.
I think that can be a really good thing to do
if you're getting confused by matrix calculus.
And I sort of,
um, slightly skipped past another slide.
Last time that was talking about
the shape convention that I talked about it for a moment that
for the homeworks you can work out your answer however you want,
you can work it out in terms of;
you know numerator ordered Jacobians,
if that seems best to you.
But we'd like you to give the final answer to
your assignment questions following the shape convention.
So, that the derivative should be shaped in
a vector matrix in the same way as the variable,
with respect to which you're working out your derivatives.
Okay. Um, the last little bit for finishing up this example from last time,
I want to say a little bit about,
is what happens with words.
And one answer is nothing different.
But another answer is they are a little bit of a special case here because,
you know, really we have a matrix of word vectors, right?
We have a vector for each word.
And so then you can think of that as sort of this matrix of word vectors,
which row has a different word.
But we're not actually kind of connecting up
that matrix directly to our classifier system.
Instead of that, what we're connect connecting up to the classifier system is
this window and the window will have it in at five words.
Most commonly they're different words.
But you know occasionally the same word might appear,
um, in two positions in that window.
And so, we can nevertheless do
exactly the same thing and continue our gradients down and say okay,
um, let's work out, um,
the gradients of this word window vector.
And if, um, these are of dimension D we'll have this sort of 5-D, um, vector.
But, you know then what do we do about it,
and the answer of what we do about it.
Is we can just sort of split this window vector into five pieces and say aha,
we have five updates to word vectors.
We're just going to go off and apply them to the word Vector Matrix.
Um, and you know if we if the same word occurs twice,
um, in that window we literally apply both of the updates.
So, it gets updated twice or maybe
actually you want to sum them first and then do the update once but yeah,
that's a technical issue.
Um, so what that actually means is that we're extremely sparsely
updating the word Vector Matrix because most of
the word Vector Matrix will be unchanged and just a few rows of that,
um, will be being updated.
And if- um, soon we're going to be here doing stuff with PyTorch
Um, and if you poke around Pytorch it even has some special stuff.
Um, look for things like Sparse SGD for meaning
that you're sort of doing a very sparse updating like that.
Um, but there's one other sort of interesting thing that you should know about.
For a lot of um,
things that you do is just what actually happens if we push
down these gradients into our word vectors.
Well, the idea is no,
if we do that would be just like all other neural net learning,
that we will sort of in principle say move the word vectors around in such a way
as they're more useful in helping determine
named entity classification in this case because that was our motivating example.
Um, so you know it might for example learn that the word in is
a very good indicator of a named entity fall or sorry the place name following.
So, after n you often get London, Paris et cetera.
Right, so it's sort of got a special behavior that
other prepositions don't as being a good location indicator.
And so, it could sort of um,
move it's location around and say here are words that are
good location indicators and therefore help our classifier work even better.
So, in principle that's good and it's a good thing to do,
to update word vectors to help you perform better on
a supervised task such as this Named Entity Recognition classification.
But, there's a catch which is that it doesn't always work actually.
And so, why doesn't it always work?
Well, suppose that we're training a classifier.
Um, you know it could be the one I just did or a softmax or logistic regression.
And we wanting to classify um,
movie reviews sentiment for positive or negative.
Well, you know if we have trained our word vectors,
we've got some word vector space and maybe in the word vector space, um, TV,
telly and television are all very close
together because they mean basically the same thing.
So, that's great, our word vectors are good.
But, well suppose it was the case,
that in our training data for our classifier.
So, this is our training data for movie sentiment review.
We had the word TV and telly but we didn't have the word television.
Well, then what's going to happen,
is well while we try and train our sentiment classifier,
if we push gradient back down into the word vectors what's likely to happen
is that it will move around the word vectors of the words we saw in the training data.
But, necessarily television's not moving, right?
Because we're only pushing gradient down to words that are in our training data.
So, this word goes nowhere,
so it just stays where it was all along.
So, if the result of our training is words get moved around.
So, here a good words for indicating negative sentiment, um,
will actually if at test time,
when we're running our model,
if we evaluate on a sentence with television in it,
it's actually going to give the wrong answer.
Whereas if we haven't changed the word vectors at all and had just left
them where our word embedding learning system put them.
Then it would have said television,
that's a word that means about the same as TV or telly.
I should treat it the same and
my sentiment classifier and it would actually do a better job.
So, it's sort of two-sided whether you gain by training word vectors.
And so, this is a summary um, that says;
that it's two sided and practically what you should do.
So, the first choice is G is a good idea to use pre-trained word vectors like
the word2vec vectors that you used in assignment one or
using the training methods that you're doing right now for homework two.
And the answer that is almost always yes.
And the reason for that is this word vector training methods are
extremely easy to run on billions of words of texts.
So, we you know train these models like [inaudible] on billions or tens of billions of words.
And it's easy to do that for two reasons.
Firstly, because the training algorithms are very simple, right?
That um, the word2vec training algorithms skip grams very simple algorithm.
Secondly; because we don't need any expensive resources,
all or we need as a big pile of text documents and we can run it on them.
So, really easy to run it on,
you know five or 50 billion words.
Whereas, you know, we can't do that for most of the classifiers that we
want to build because if it's something
I sentiment classifier or a named entity recognizer,
we need labeled training data to train
our classifier and then we ask someone how many words have labeled training data,
do you have for named entity recognition and they give this back
a number like 300,000 words or one million words, right.
It's orders a magnitude smaller.
Okay. Um. So, therefore,
we can gain using pre-trained word vectors,
because they know about all the words that aren't
now supervised, classifies training data.
And they also know much more about the words that actually
are in the training data, but only rarely.
So, the exception to that is,
if you have hundreds of millions of words of data,
then you can start off with random word vectors and go from there.
And so, a case where this is actually commonly done,
is for machine translation,
which we do later in the class.
It's relatively easy for
large languages to get hundreds of millions of words of translated text.
If you wanted to build something,
like a German- English or Chinese-English machine translation system.
Not hard to get 150 million words of translated texts.
And so, that's sort of sufficiently much data,
that people commonly just start with word vectors, um,
being randomly initialized and start training,
um, their translation system.
Okay. So then the second question is, okay.
I'm using pre-trained word vectors.
Um, when I train my supervised classifier,
should I push gradients down into the word vectors and up, and update them?
Which is often referred to as fine tuning the word vectors, um,
or should I not,
should I just sort of throw away
those gradients and not push them down into the word vectors?
And you know, the answer to that is it depends,
and it just depends on the size.
So, if you only have a small training data set, um, typically,
it's best to just treat the pre-trained word vectors as fixed,
um, and not do any updating of them at all.
If you have a large data set,
then you can normally gain by doing fine tuning of the word vectors.
And of course, the answer here,
is what counts as large.
Um, you know, if certainly,
if you're down in the regime of 100 thousand words,
a couple of hundred thousand words, you're small.
If you're starting to be over a million words,
then maybe you're large.
But you know, on practice, people do it both ways and see which number is higher,
and that's what they stick with.
Um. Yes. Um, then, the sort of,
there's the sort of point here that is just worth underlying is " Yes",
so on principle, we can back-propagate this gradient to every variable in our model.
Um, it's actually a theorem that we can arbitrarily
decide to throw any subset of those gradients away,
and we are still improving the log-likelihood of our model, all right?
It kind of can't be inconsistent.
You can just sort of pick some subset and say only
train those 37 and throw away all the rest.
And the algorithm will still improve,
um, the log-likelihood of the model.
Perhaps not by as much as if you trained the rest of the variables,
as well, um, but yes,
it can't actually do any harm not to train anything.
Um, that's one of the reasons why often people don't notice bugs in their code, as well.
It is because if your code is kind of broken
and only half of the variables are being updated,
it will still seem to be training something and improving.
Um. It's just not doing as well as it could be doing,
if you've coded correctly.
Okay. Um, so, at this point, um,
that's sort of, um,
almost shown you back propagation, right?
So, back-propagation is really taking derivatives with a generalized chain rule,
with the one further trick which we sort of represented with that delta,
which is G. You want to be, um,
clever in doing this, so,
you minimize computation by reusing shared stuff.
Um, but now what I want to move on is to sort of look at how we can do
that much more systematically, which is this idea.
We have a computation graph and we're going to run
a back-propagation algorithm through the computation graph.
So, this is kind of like an abstracts syntax tree,
expression tree that you might see in a compiler's class,
or something like that, right?
So, when we have an arithmetic expression of the kind that we're going to compute,
we can make this tipped over on its side tree representation.
So, we've got the X and W variables,
we're going to multiply them.
There's the B variable,
we're going to add it to the previous partial result.
We're going to stick it through our non-linearity F
and then we're going to multiply it by U.
And that was the computation,
that we're doing in our neural network.
So, um the source nodes or inputs,
the interior nodes of this tree are operations.
And then we've got these edges that pass along the results of our computation.
And so, this is the computation graph for precisely the example
I've been doing for the last lecture [NOISE].
Okay, so there are two things that we want to be able to do.
The first one is,
we want to be able to start with these variables and do this computation,
and calculate what S is.
That's the part that's dead simple,
that's referred to as forward propagation.
So, forward propagation is just expression evaluation,
as you do in any any programming in language interpreter.
Um, that's not hard at all.
Um, but the difference here is, "Hey,
we want to do a learning algorithm" so we're going to do the opposite of that, as well.
What we want to be able to do is also backward propagation,
or back-propagation or just back-prop, it's commonly called,
which is we want to be able to go,
um, from the final part.
The final part here.
And then at each step,
we want to be calculating
these partial derivatives and passing them back through the graph.
And so, this was sort of the notion before that we had an error signal, right?
So, we're starting from up here,
we've calculated a partial of S by Z,
which is this with respect to that.
And so, that's sort of our calculated error signal, up to here,
and then we want to pass that further back, to start, um,
computing, um, um, gradients further back.
Right? And we started off, um, right here,
with the partial of S by S. What's the partial of S by S going to be?
One. Okay, yes.
So, the rate at which S changes is the rate at which S changes.
So, we just start off with one,
and then we want to work out how this gradient changes as we go along.
Um, so what we're doing here is when we're working out things for one node,
that a node is going to have passed in towards it upstream gradient,
which is its error signal.
So, that's the partial of our final, f- final result,
which was our loss, um, by um,
the va- variable was the output of these computation nodes.
So, that's the partial of S I H, here.
And then, we did some operation here.
Here's the non-linearity, but it might be something else.
And so what we want to then work out is a downstream gradient,
which is the partial of S by Z,
which was the input to this function.
And well then the question is,
how do we do that?
And the answer to that is,
we use the chain rule, of course, right?
So, at, we have a concept of a local gradients.
So, here's H as the output,
um, Z is the input.
So, this function here,
this is our non-linearity, right?
So, this is whatever we're using as our non-linearity,
like a logistic or T and H. We calculate H in terms of Z,
and we can work out the partial of H by Z.
So, that's our local gradient.
And so then, if we have both the upstream gradient and the local gradient.
We can then work out the downstream gradient because we know the
partial of S by Z is going to be DSDH times, um, DHDZ.
And so, then we'll be able to pass down the downstream gradient to the next node.
Okay. So our basic rule,
which is just the chain rule written in different terms
is downstream gradient equals upstream gradient times local gradient.
Um, easy as that,um, okay.
So, this was um,
the very simplest case where we have a node with one input and one output.
So, that's a function um,
like our logistic function.
But, we also want to have things work out for general computation graphs.
So, how are we going to do that?
Well, the next case is,
um, what about if we have multiple inputs?
So, if we're calculating something like Z equals W times X.
Um, where actually yes Z and X are themselves vectors and W um,
is a matrix, but we're treating X as an input and W as an input,
and Z as our output, right?
We kind of group vectors and matrices together.
Well, if you have multiple inputs,
you then end up with multiple local gradients.
So, you can work out um,
the partial of Z with respect to X,
or the partial of Z with respect to W. And so,
you essentially you take the upstream gradient,
you multiply it by each of the local gradients,
and you pass it down the respective path,
and we calculate these different downstream gradients to pass along.
Is that making sense?
Yeah. Okay. How chug.
Okay. So, let's sort of look in an example of this and then we'll see one other case.
So here's the little baby example.
This isn't kind of really looking like a neural net,
but we've got three inputs x, y, and z.
And x and y get added together and y and z you get maxed.
And then we take the results of those two operations and we multiply them together.
So overall what we're calculating is x plus y times the max of y plus z.
But, you know, we have here a general technique and we can apply it in any cases.
Okay, so if we wanted to have this graph and we want to run it forward,
well, we need to know the values of x, y, and z.
So, for my example x equals one y equals two z equals zero.
Um, so, we take the values of those variables and
push them onto the calculations for the forward arrows.
And then well the first thing we do is add and the result of that is three.
And so we can put that onto the arrow.
That's the output of add.
Max it's two as the output of the value of add times is six.
And so the forward pass we have evaluated the expression.
Its value is six.
That wasn't hard. Okay. So then the next step is we
then want to run back-propagation to work out gradients.
Um, and so we sort of want to know how to sort of,
um work out these local gradients.
So a is our right a is the result of sum.
So here's a as the result of sum.
So a equals x plus y.
So if you're taking da dx that's just one and d a d y is also one that makes sense.
Um, the max is slightly trickier because where
there's some slopes and gradient for the max depends on which one's bigger.
So, if y is bigger than z d- delta,
the partial of b by z,
plus partial b by y is one otherwise it's 0 and conversely for the partial of b by z.
So that one's a little bit dependent.
And then we do the multiplication, um,
case at the end, um,
and work out its partials with respect to a and b.
And, um, since that's a and b which has the values two and three.
If you're taking the partial of f by a it equals b which is two and vice versa.
Okay. So that means we can work out the local gradients at each node.
And so then we want to use those to
calculate our gradients backwards and the back-propagation paths.
So we start at the top.
The partial of f with respect to F is one.
Because if you move if you know by a tenth then you've moved the f by a tenth.
So that's a cancels out as one.
Okay. So then we want to pass backwards.
So, the first thing that we have is this sort of multiply node.
And so we worked- we know its local gradients that partial of f by a is two,
and the partial of f by b is three.
And so we get those values.
So formally we're taking the local gradients
multiplying them by the upstream gradients and getting our three and two.
And notice the fact that so effectively what happens is the values on the two arcs swaps.
Um, and then we sort of continue back.
Okay. There's a max node.
So our upstream gradient is now three and then we want to multiply by the local gradient.
And since the max of these two as two has a slope of one on this side.
So you get three,
there's no gradient on this side and we get zero.
And then we do the similar calculation on
the other side where we have local gradients of one.
And so both of them come out of two And then the one other thing to do is we notice,
well, wait a minute.
There are two arcs that started from the y
both of which we've backed complicated some gradient on.
And so what do we do about that.
Um, what we do about that is we sum.
So, the partial of f by x is to the partial of f by z is 0 that the
partial of f by y is the sum of the two and five, right?
And so this isn't complete voodoo.
This is something that should make sense in terms of what gradients are, right?
So, that what we're saying,
is what we're calculating,
is if you wiggle x a little bit
how big an effect does that have on the outcome of the whole thing?
And so, you know, we should be able to work this out.
So, our x started offers one but let's suppose we wiggle it up a bit
and make it 1.1 well according to this output should change by about 0.2,
it should be magnified by two.
And we should be able to work that out, right?
So it's then 1.1 plus two,
so that's then 3.1.
And then we've got the two here that multiplies by it and it's 6.2.
And lo and behold it went up by 0.2, right?
So that seems correct.
And if we try and do the same for,
well, let's do the z. It's easy.
So if we wiggle the z which had a value of zero by 0.1.
This is 0.1.
When we max if this is still two and
so a calculated value doesn't change, it's still six.
So the gradient here is zero.
Wiggling this does nothing.
And then the final one is y.
So, it's starting off value as two.
So, if we wiggle it a little and make it 2.1,
our claim is that the results are change by about 0.5.
It should be multiplied by five times.
So, if we make this 2.1 we then have 2.1 plus one and b 3.1.
When we get the max here would also be 2.1.
And so we'd have 2.1 times 3.1.
And that's too hard arithmetic for me to do in my head.
But if we take 2.1 times 3.1 it comes out to 6.51.
So, basically it's gone up by half.
We don't expect the answers to be exact of course, right?
Because you know that's not the way calculus works, right?
[NOISE]. Where that it's showing that we're getting the gradients right.
Okay. So this actually works.
So, what are the techniques that we need to know?
Um, so we've sort of already seen them all.
So, you know, we discussed when there are multiple incoming arcs,
how he saw workout the different local derivatives.
The main other case that we need to know is if, um,
in the function computation there's a branch
outward the resultant something is used in multiple places.
And so this was like the case here.
I mean, here this was an initial variable,
but you know, it could have been computed by something further back.
So, if this thing is used in multiple places and
you have the computation going out in different ways.
It's just this simple rule that when you do backpropagation
backwards you sum the gradients that you get from the different output branches.
Okay. So, if a equals X plus Y and while that's the one we showed you
before that were doing this some operation to work out the total partial of f by y.
Okay. And if you sort of think about it just a little bit more,
there's sort of these obvious patterns,
um, which we saw in this very simple example.
So, if you've got a plus that really the upstream gradient is going to
be sort of heading down every one of
these grant branches when you have multiple branches are things being summed.
Now, in this case,
it just as copied unchanged but that's because our computation was x plus y.
You know, it could be more complicated,
but we're passing it down down each of those branches.
So plus distributes upstream gradient.
When you have a max that's kind of like a routing operation,
because max is going to be sending the gradient to in the direction that's the max,
and other things are going to get no gradient being passed down to them.
Um, and then when you have, um,
a multiplication this has this kind of
fun effect that what you do is switch the gradient, right?
And so this reflects the fact that when you have u times
v regardless of whether u and v are vectors or just,
um, scalars that the derivative of the result with respect to
u is v and the derivative of those spot- result with respect to v is u.
And so, the, um,
gradient signal is the flip,
um, of the tw- two numbers on the different sides.
Okay. Um, so this is sort of most of how we have
these computation graphs and we can work out backpropagation backwards in them.
There's sort of one more part of this to do,
um, which is to say g,
we want to do this eff- efficiently.
So, there's a bad way to do this which is to say, "Oh well,
we wanted to calculate the partial of this by b and so we can calculate that partial."
Which was essentially what I was doing on last time slides.
We say, "Um, partial of s by b equals the partial of s by h,
times the partial of h by z,
times the partial of z by b,
and we have all of those partials.
We work them all out and multiply them together and then someone says,
um, what's the partial of s by w?
And we say, huh, that's the chain rule again, I'll do it all again.
It's the partial of s by,
um, h times the partial of h by z,
times the partial of and z by x,
no, no, right, ah, lost it.
But you do big long list of them and you calculate all again.
That's not what we want to do.
Instead we want to say, "Oh,
look there's this shared stuff.
There's this error signal coming from above."
And we can work out the error signal the upstream gradient for this node.
We can use it to calculate the upstream gradient for this node.
We can use this to calculate the upstream gradient for this node and then,
using the local gradients of which there are two calculated
this node we can then calculate this one and that one.
Um, and then, from here having knowing this upstream gradient,
we can use the local gradients at this node to compute this one and that one.
And so, we're sort of doing this efficient computer science like computation,
um, where we don't do any repeated work. That makes sense?
Yeah. Okay. Um, and so if that is,
um, the whole of backprop.
So, um, here's sort of a slightly sketchy um graph
which is sort of just re-capitulating this thing.
So, if you have any computation that you want to perform, um, well,
the hope is that you can sort your nodes into
what's called a topological sort which means that things that are arguments,
variables that are arguments are sorted before
variables that are results that depend on that argument.
You know, providing you have something there's an a cyclic graph,
you'll be able to do that.
If you have a cyclic graph, you're in trouble.
Um, well, I'd be there actually techniques people
use to roll out those graphs but I'm not gonna go into that now.
So, we've sorted the nodes which is kind of loosely represented here from
bottom to top in a topological sort area, sort.
Okay. So then, for the forward prop we sort of go through the nodes in
the topological sort order and we
if it's a variable we just set its value to what it's favorite val- variable value is.
If it's computed from other variables their values must have been
set already because there earlier in the topological sort, um,
and then we compute the value of those nodes according to their predecessors,
and we pass it up and work out the final output,
the loss function of our neural network and that is our forward pass.
Okay. So then, after that we do our backward pass and so for
the backward pass we initialize the output gradient with one.
The top thing is always one,
the partial of z with respect to z.
And then, we now sort of go through the nodes in reverse topological sort.
And so therefore, each of them will all ready- anything that's,
ah, anything that's, uh, language is complex.
Anything that's above that.
Anything that we calculated based on it in terms of, ah,
forward pass will already have had calculated it's, um,
it's gradient as a product of upstream gradient
times local gradient and then we can use that,
um, to compute the next thing down.
Um, and so basically the ov- the overall role
is for any node you work out its set of successors,
the things that are above it that it,
that depend on it and then you say, "Okay,
the partial of z with respect to x is simply the sum over the set of
successors of the local gradient that you
calculated the node times the upstream gradient of that node."
Um, and in the examples that I gave before there was never,
never multiple upstream gradients.
But if you imagine a, a general big graph there could actually be
so different upstream gradients that are being used in- for the various successors.
So, we apply that backwards and then we've worked out in backpropagation, um,
the gradient of every,
the gradient of the final result z with respect to every node in our graph.
Um, and the thing to notice about this is,
if you're doing it right and efficiently,
the bigger o order of complexity of doing backpropagation is exactly the
same as doing forward propagation i.e expression evaluation.
So, it's not some super expensive complex procedure
that you can imagine doing and scaling up.
Um, you're actually in exactly the same complexity order.
Okay. Um, so as [inaudible] entered it here this procedure,
you could just think of something that you're running on
an arbitrary graph and calculating this forward pass and the backwards pass.
I mean, almost without exception that the kind of
neural nets that we actually use have a regular layer
like structure and that's then precisely why it makes
to- sense to work out these gradients in terms of,
um, vectors matrices and Jacobian's as the kind we were before.
Okay. Um, so since we have this sort of really nice algorithm now, um,
this sort of means that, um,
we can do this just computationally and so we don't have to think or know how to do math.
Um, and we can just have our computers do all of this with this.
Um, so that using this graph structure, um,
we can just automatically work out how to apply, um, backprop.
And there are sort of two cases of this, right?
So, if what was calculated at each node,
um, is given as a symbolic expression,
we could actually have our computer work out for
us what the derivative of that symbolic expression is.
So, it could actually calculate, um,
the gradient of that node and that's referred to as often as automatic differentiation.
So, this is kind of like Mathematica Wolfram Alpha.
You know how you can do your math homework on it?
You just type in your expression,
say what's a derivative and it gives it back to you right?
Um, it's working doing symbolic computation and working out the derivative for you.
Um, so that- so that method could be used to
work out the local gradients and then we can use
the graph structure and now rule
upstream gradient times local gradient gives downstream gradient,
i.e the chain rule, um,
to then propagate it through the graph and do
the whole backward pass completely automatically.
And so that sounds, um, great.
Um, slight disappointment, um,
current deep learning frameworks don't quite give you that.
Um, there was actually a famous framework that attempted to give you that.
So the Theano Framework that was developed at the University of Montreal, um,
those they've now abandoned in the modern era
of large technology corporation, deep learning frameworks.
Theano did precisely that.
It did the full thing of automatic differentiation, um,
for reasons that we could either think of good or bad,
current deep learning frameworks like TensorFlow or
PyTorch actually do a little bit less than that.
So what they do is, say,
well for an indiv- for the computations at an individual node,
you have to do the calculus for yourself.
Um, for this individual node,
you have to write the forward propagation, say, you know,
return X plus Y and you have to write the backward propagation,
saying the local gradients, uh,
one and one to the two inputs X and Y, um,
but providing you or someone else has
written out the forward and backward local step at this node,
then TensorFlow or PyTorch does all the rest
of it for you and runs the backpropagation algorithm.
[NOISE] Um, and then, you know, effectively,
that sort of saves you having to have a big symbolic computation engine,
because somewhat, the person coding
the node computations is writing
a bit of code as you might normally imagine doing it whether in,
you know, C or Pascal,
of saying returning X plus Y,
and, you know, local Gradient return one.
Right? And- and you don't actually have to have a whole symbolic computation engine.
Okay. So that means the overall picture looks like this.
Right? So um, schematically,
we have a computation graph, um,
and to calculate the forward computation, um,
we, um, so- sort of put inputs into
our computation graph where there's sort of X and Y variables,
and then we run through the nodes in topologically sorted order,
and for each node we calculate its forward and
necessarily the things that depends on and have already been
computed and we just do expression evaluation forward.
And then we return, um,
the final gate in the graph,
which is our loss function, or objective function.
But then, also we have the backward pass,
and for the backward pass,
we go in the nodes in reversed topological, um, resorted order,
and for each of those nodes,
we've return their backward value,
and for their top node,
we return backward value of one,
and that will then give us our gradients.
And so that means, um,
for any node, any piece of computation that we perform,
we need to write a little bit of code that um
says what it's doing on the forward pass and what it's doing on the backward pass.
So on the forward pass, um,
this is our multiplication,
so we're just saying return X times Y.
So that's pretty easy.
That's what you're used to doing.
But while we also need to do the backward passes,
local gradients of return what is the
partial of L with respect to Z and with respect to X.
And well, to do that,
we have to do a little bit more work.
So we have to do a little bit more work,
first of all, in the forward pass.
So, in the forward pass,
we have to remember to sort of stuff away in some variables
what values we computed in the for-
what- what values were given to us in the forward pass,
or else we won't be able to calculate the backward pass.
So we store away the values of X and Y,
um, and so then,
when we're doing the backward pass,
we are passed into us the upstream Gradient,
the error signal, and now we just do calculate, um,
upstream Gradient times local Gradient- upstream Gradient times local Gradient,
and we return backwards,
um, those um downstream Gradients.
And so providing we do that for all the nodes of our graph,
um, we then have something that, um,
the system can learn for us as a deep learning system.
And so what that means in practice,
um, is that, you know,
any of these deep learning frameworks come with a whole box of tools that says,
um, here is a fully connected forward layer,
here is a sigmoid unit,
here is other more complicated things we'll do later,
like convolutions and recurrent layers.
And to the extent that you are using one of those,
somebody else has done this work for you.
Right? That they've um defined, um,
nodes or a layer of nodes that have forward and backward already written for- for them.
And to the extent that that's true, um,
that means that making neural nets is heaps of fun. It's just like lego.
Right? You just stick these layers together and say,
"God, I have to learn on some data and train it."
You know, it's so easy that my high school student is building these things.
Right? Um, you don't have to understand much really,
um, but, you know,
to the extent that you actually want to do some original research and think,
"I've got this really cool idea of how to do things differently.
I'm going to define my own kind of different computation."
Well, then you have to do this and define your class,
and as well as, sort of saying,
how to compute the forward value,
you will have to pull out your copy of
Wolfram Alpha and work out what the derivatives are,
um, and put that into the backward pass.
Um, yeah.
Okay. So here's just one little more note on that.
Um, you know, in the early days of deep learning,
say prior to 2014,
what we always used to state to everybody very sternly is,
"You should check all your Gradients,
by doing numeric Gradient checks.
It's really really important."
Um, and so what that meant was, well, you know,
if you want to know whether you have coded your backward pass right,
an easy way to check, um,
whether you've coded it right,
is to do this numeric Gradient
where you're sort of estimating the slope by wiggling it a bit,
and wiggling the input a bit,
and seeing what effect it has.
So I'm working out the value of the function the F of X plus H,
for H very small like E to the minus four,
and then F of X minus H, um,
and then dividing by 2H,
and I'm saying well, what is the slope at this point,
and I'm getting a numeric estimate of the Gradient with respect,
um, to my variable X here.
Um, so this is what you will have seen in
high school when you did the sort of first um estimates of Gradients,
where you sort of worked out F of X plus H divided by H
and you're doing rise over run and got a point estimate of the Gradient.
Um, exactly the same thing,
except for the fact,
in this case, rather than doing it one sided like that,
we are doing it two-sided.
It turns out that if you actually wanna do this,
two-sided is asymptotically hugely [NOISE] better,
and so you're always better off doing
two-sided Gradient checks rather than one-sided Gradient checks.
Um, so since you saw that- since it's hard to implement this wrong,
this is a good way to check that your Gradients are
correct if you've defined them yourselves.
Um, as a technique to use it [NOISE] for anything,
it's completely, completely hopeless,
because we're thinking of doing this over
our deep learning model for a fully connected layer.
What this means [NOISE] is that,
if you've got this sort of like a W matrix of N by M and you want to, um,
calculate um your partial derivatives to check if they're correct,
it means that you have to do this for every element of the matrix.
So you have to calculate the eventual loss,
first jiggling W11, then jiggling W12,
then jiggling one- W13, 14 et cetera.
So you have- in the complex network,
you'll end up literally doing millions of function evaluations
to check the Gradients at one point in time.
So, you know, it's,
it's not like what I advertised for
backprop when I said it's just as efficient as calculating,
um, the forward value.
Doing this is forward
value computation time multiplied by number of parameters in our model,
which is often huge for deep learning networks.
So this is something that you only want to
have inside- if statements that you could turn off.
So you could just sort of run it to check that your code isn't bre- um, debuggy.
Um, you know, in honesty,
this is just much less needed now because, you know,
by and large you can plug together your components and layers and PyTorch,
um, and other people wrote the code right and it will work.
Um, so you probably don't need to do this all the time.
But it is still a useful thing to look at and to know
about if things um, are going wrong.
Yeah. Okay, so we- we've now mastered the core technology of neural nets.
We saw now well, basically everything we need to know about neural nets,
and I sort of just, um, summarized it there.
Um, just to sort of emphasize um once more.
Um, you know, I think some people think,
why do we even lear- need to learn all this stuff about gradients?'
And there's a sense in which it's [inaudible] really,
because these modern deep learning frameworks will compute all of the gradients for you.
You know, we make you suffer on homework two,
but in homework three,
you can have your gradients computed for you.
But, you know, I- so you know it's sort of just, like, well,
why should you take a c- a class on compilers, right?
That there's actually something useful in understanding what goes on under the hood,
even though most of the time,
we're just perfectly happy to let the C compiler do its thing,
without being experts on X86 assembler every day of the wa- week.
But, you know, there is more to it than that.
Um, you know, because even though backpropagation is great,
once you're building complex models,
backpropagation doesn't always work as you would expect it to.
Perfectly is maybe the wrong word,
because you know mathematically it's perfect.
Um, but it might not be achieving what you're wanting it to.
And well, if you want to sort of then debug an improved models,
it's kind of crucial to understand what's going on.
So, there's a nice medium piece by Andre Karpathy,
of yes you should understand backprop um that's on the syllabus page, um,
that talks about this and indeed um, um,
week after next, Abby is actually going to lecture about recurrent neural networks,
and you know one of the places, um,
where you can easily fail um,
and doing backpropagation turns up there,
um, is a good example.
Okay. So anyone have any questions about backpropagation and computation graphs?
Okay. If not the remainder of the time is, um,
the grab bag of things that you really should know about,
if you're going to be doing deep learning.
And so, yeah, this is just itsy-bitsy and,
but let me say them.
Um, so up until now,
when we've had um loss functions,
and we've been maximizing the likelihood of our data,
and stuff like that,
we've sort of just had this part here which is the likelihood of our data,
and we've worked to maximize it.
Um, however, um, in practice that works badly usually,
and we need to do something else which is regularize our models.
And if you've done the Machine Learning class,
or something like that you will have seen regularization.
And there are various techniques to do regularization, but, um,
compared to anything else,
regularization is even more important,
um, for deep learning models, right?
So, um, the general idea is if you have a lot of parameters in your model,
those parameters can just essentially memorize what's in the data that you trained at.
And so they're very good at predicting the answers.
The model becomes very good at predicting the answers to the data you trained it on,
but the model may become poor at working in the real world, and different examples.
And somehow we want to stop that.
And this problem is especially bad for deep learning models,
because typically deep learning models have vast,
vast numbers of parameters.
So in the good old days when statisticians ruled the show,
they told people that it was completely ridiculous to
have a number of parameters that approached your number of training examples.
You know, you should never have more parameters in your model,
than one-tenth of the number of your training examples.
So it's the kind of um rules of thumb you are told,
so that you had lots of examples with which to estimate every parameter.
Um, that's just not true with deep learning models,
is just really common that we trained
deep learning models that have 10 times as many parameters,
as we have training examples.
Um, but miraculously it works.
In fact it works brilliantly.
Those highly over parameterized models,
and this one of the big secret sources of why deep learning has been so brilliant,
but it only works if we regularize the model.
So, if you train a model without sufficient regularization,
what you find is that you're training it and working out your loss on the training data,
and the model keeps on getting better,
and better, and better, and better.
Um, necessarily, alg- algorithm has to improve loss on the training data.
So the worst thing that could happen,
is that the graph could become absolutely fa- flat.
What you'll find is with most models that we train,
they have so many parameters that this will just keep on going down,
until the loss is sort of approaching the numerical precision of zero,
if you leave it training for long enough.
It just learns the correct answer for every example,
beca- because effectively can memorize the examples.
Okay, but if you then say,
''Let me test out this model on some different data.''
What you find is this red curve,
that up until a certain point, um,
that you are also building a model that's better at predicting on different data,
but after some point this curve starts to curve up again.
And ignore that bit where it seems to curve down again,
that was a mistake in the drawing.
Um, and so this is then referred to as over-fitting,
that the- from here on the training model is
just learning to memorize whatever was in the training data,
but not in a way that later generalized to other examples.
And so this is not what we want.
We want to try and avoid over-fitting as much as possible,
and there are various regularization techniques that we use for that.
And simple starting one is this one here where we penalize the log-likelihood by saying,
''You're going to be penalized to the extent that you move parameters away from zero.''
So the default state of nature is all parameters are zeros,
so they're ignored on computations.
You can have parameters that have big values,
but you'll pee penalized a bit four,
and this is referred to as L-2 regularization.
And, you know, that's sort of a starting point of
something sensible you could do with regularization,
but there's more to say later.
And we'll talk in this sort of lecture before we discuss
final projects of other clever regularization techniques at neural networks.
Okay. Um, grab bag number two,
vectorization is the term that you have here,
um, but it's not only vectors.
This is also matrixization,
and higher dimensional matrices what are called tensors,
in this field tensorization.
Um, getting deep learning systems to run fast and
efficiently is only possible if we vectorize things.
Um, and what does that mean?
What that means is, you know,
the straightforward way to write a lot of code um,
that you saw in your first CS class,
is you say for I in range in um calculate random randi-1.
Um, but when we want to be clever,
um, people, um, that are doing things fast,
um, we say rather than work out this W dot one word vector at a time,
and do it in a four loop,
we could instead put all of our word vectors into one matrix,
and then do simply one matrix-matrix multiply of W by our word vector matrix.
And even if you run your code on your laptop on a CPU,
you will find out that if you do it the vectorized way,
things will become hugely faster.
So in this example,
it became over an order of magnitude faster,
when doing it with a vector- vectorized rather than,
um, with a full loop.
Um, and those gains are only compounded when we run code on a GPU,
that you'll get no gains and speed of tall on a GPU,
unless your code is vectorized.
But if it is vectorized,
then you can hope to have results, of oh,
yeah, this runs 40 times faster,
than it did on the CPU.
Okay, um, yeah, so always try to use vectors and matrices not for loops.
Um, of course it's useful when developing stuff to time your code,
and find out what's slow.
Um, okay.
Point three.
Um, okay, so we discussed this idea, um, last time,
and the time before that after- after having the sort of affine layer,
where we took, you know,
go from X to WX, plus B.
That's referred to as an affine layer,
so we're doing this, um,
multiplying a vector by a matrice- matrix,
and adding um biases.
We necessarily to have power and a deep network, um,
have to have some form of non-linearity.
And so, I just wanted to go through a bit of background
on non-linearity is in what people use,
and what to use.
So, if you're sort of starting from the idea of what we know is logistic regression, um,
what's commonly referred to as the sigmoid curve,
or maybe more precisely is the logistic,
um, function is this picture here.
So something that's squashes any real
number positive or negative into the range zero to one.
It gives you a probability output.
Um, these- this use of this, um,
logistic function was really really common in early neural nets.
If you go back to '80s, '90s neural nets,
there were, um, sigmoid functions absolutely everywhere.
Um, in more recent times,
90 percent of the time nobody uses
this and they've been found to sort of actually work quite poorly.
The only place these are used is when you
actually want a value between zero and one is your output.
So we'll talk later about how you have gating in networks,
and so gating as a place where you want to have a probability between two things.
And then you will use one of those,
but you use some absolutely nowhere else.
Um, here is the tanh curve.
Um, so the formula for tanh, um,
looks like a scary thing with thoughts of exponentials in it,
and it doesn't really look much like a logistic curve whatsoever.
Um, but if you um dig up your math textbook you can convince yourself that
a tanh curve is actually exactly the same as
the logistic curve apart from you multiply it by two,
so it has a range of two rather than one,
and you shift it down line.
So, this is sort of just a re-scaled logistic.
There's now symmetric between one and minus one,
and the fact that some metric in the output actually helps
a lot for putting into neural networks. Um.
So, tanh's, are still reasonably widely used
in quite a number of places um in um your networks.
So, tanh should be a friend of yours and you should know about that.
But you know, one of the bad things about using
um transcendental functions like the sigmoid or tanh is,
you know, they involve this expensive math operations um that slow you down.
Like, it's sort of a nuisance to be kind
of computing exponentials and tanh's in your computer,
things are kind of slow.
So people started um playing around with ways
to make things faster and so someone came up with this idea like,
maybe we could come up with a hard tanh,
um where it's just sort of flat out here
and then it has a linear slope and then it's flat at the top.
You know, it sort of looks like a tanh but we just squared it off.
Um, and while this is really cheap to compute right, you say,
x less than minus one,
return minus one, return plus one or just return the number.
No complex transcendentals.
The funny thing is,
it turns out that this actually works pretty well.
You might be scared and you might justifiably be
scared because if you start thinking about gradients,
once you're over here,
there's no gradient, right?
It's completely flat at zero.
So, things go dead as soon as they're at one of the ends.
So, it's sort of important to stay in this middle section at least for
a while and then its just got a slope of one, right?
It's a constant slope of one.
But this is enough of a linearity that actually it
works well in neural networks and you can train neural networks.
So, that's sent the whole field in the opposite direction and people thought,
oh, if that works,
maybe we can make things even simpler.
And that led to the now famous what's referred to [inaudible] as ReLU.
So there is a mistake in my editing there,
delete off hard tanh.
That was in slides by mistake.
[LAUGHTER] The ReLU unit,
everyone calls it ReLU which stands for rectified linear unit.
So, the Re-, the ReLU is essentially the simplest non-linearity you can have.
So the ReLU is zero,
slope zero as soon as you're in the negative regime and it's just a line slope one,
when you're in the positive regime.
I mean, when I first saw this,
I mean, it's sort of blew my mind it could possibly work.
Because it sort of, I guess,
I was brought up on these sort of tanh's and sigmoids and the sorts of these arguments
about the slope and you get these gradients and you can move around with the gradient.
And how is it meant to work if half of this function just says
output zero and no gradient and the other half is just this straight line.
And in particular, when you're in the positive regime,
this is just an identity function.
And, you know, I sort of argued before that if you just compose linear transforms,
you don't get any power but provided when this is the right-hand part of the regime.
Since this is an identity function,
that's exactly what we're doing.
We're just composing linear transforms.
So you- you sort of believe it just can't possibly
work but it turns out that this works brilliantly.
And this is now by far
the default choice when people are building feed for deep networks.
That people use ReLU non-linearities and they are very fast,
they train very quickly and they perform very well.
And so, effectively, you know,
it is, it is simply just each u-,
depending on the inputs,
each unit is just either dead or it's passing things on as an identity function.
But that's enough of lini-,
non-linearity that you can do
arbitrary function approximation still with a deep learning network.
And people now make precisely the opposite argument which is,
because this unit just has a slope of one over it's non-zero range, that means,
the gradient is past spec very efficiently to
the inputs and therefore the models train very efficiently whereas,
when you are with these kind of curves,
when you're over here, there's very little slope so your models might train very slowly.
Okay. So, you know,
for feed-forward network, try this before you try anything else.
But there's sort of then been a sub literature that says,
well, maybe that's too simple and we could do a bit better.
And so that led to the leaky ReLU which said,
"Maybe we should put a tiny bit of slope over here so it's not completely dead."
So you can make it something like one,
one 100th as the slope of this part.
And then people had, well,
let's build off that,
maybe we could actually put another parameter into
our neural network and we could have a parametric ReLU.
So, there's some slope over here but we're also going to
backpropagate into our non-linearity which has this extra alpha parameter,
which is how ma- much slope it has.
And so, variously people have used these,
you can sort of find 10 papers on archive where people say,
you can get better results from using one or other of these.
You can also find papers where people said it made
no difference for them versus just using a ReLU.
So, I think basically,
you can start off with a ReLU and work from there.
Yes. So, parameter initialization,
it's when, so, when we have these matrices and parameters in our model,
it's vital, vital, vital,
that you have to initialize those parameter weights with small random values.
This was precisely the lesson that
some people hadn't discovered when it came to final project time.
So I'll emphasize it is vital, vital.
So, if you just start off with the weights being zero,
you kind of have these complete symmetries,
right, that everything will be calculated the same,
everything will move the same and you're not actually training
this complex network with a lot of units that are specializing to learn different things.
So, somehow, you have to break the symmetry and we
do that by giving small random weights.
So, you know, there's sort of some fine points.
When you have biases,
you may as well just start them at Zero,
as neutral and see how the system learn the bias that you want et cetera.
But in general, the weights you want to initialize to small random values.
You'll find in PyTorch or other deep learning practi- packages,
a common initialization that's used and often recommended is this Xavier Initialization.
And so, the trick of this is that,
for a lot of models and a lot of places,
think of some of these things like these ones and these,
you'd like the values in the network to sort of stay small,
in this sort of middle range here.
And well, if you kind of have a matrix with big values in it
and you multiply a vector by this matrix,
you know, things might get bigger.
And then if you put in through another layer,
it'll get bigger again and then sort of everything
will be too big and you will have problems.
So, really, Xavier Initialization is seeking to avoid that by saying,
how many inputs are there to this node?
How many outputs are there?
We want to sort of temp it down the initialization based on the inputs
and the outputs because effectively we'll be using this number that many times.
It's a good thing to use, you can use that.
Optimizers. Up till now,
we saw, just talked about plain SGD.
You know, normally plain SGD actually works just fine.
But often if you want to use just plain SGD,
you have to spend time tuning the learning rate,
that alpha that we multiplied the gradient by.
For complex nets and situations or to avoid worry,
there's sort of now this big family and more sophisticated adaptive optimizers.
And so, effectively they're scaling the parameter adjustment by accumulated gradients,
which have the effect that they learn per parameter learning rates.
So that they can see which parameters would be useful to move
more and which one is less depending on the sensitivity of those parameters.
So, where things are flat,
you can be trying to move quickly.
Where things are bouncing around a lot,
you are going to be trying to move just a little so as not to overshoot.
And so, there's a whole family of these; Adagrad,
RMSprop, Adam, there are actually other ones.
There's Adam Max and whole lot of them.
I mean, Adam is one fairly reliable one that many people use and that's not bad.
And then one more slide and I'm done.
Yes, so learning rates.
So, normally you have to choose a learning rate.
So, one choice is just have a constant learning rate.
You pick a number, may be 10 to the minus three and say that's my learning rate.
You want your learning rate to be order of magnitude, right.
If your learning rate is too big,
your model might diverge or not converge because it just sort of leaps you around by
huge cram movements and you completely miss the good parts of your function space.
If your model, if your learning rate is too small,
your model may not train by the assignment deadline and then you'll be unhappy.
So, you saw that, you know,
commonly people sort of try powers of 10 and sees how it looks, right.
They might try, you know, 0.01, 0.001,
0.0001 and see, look at how the loss is declining and see what seems to work.
In general, you want to use
the fastest learning rate that isn't making things become unstable.
Commonly, you could get better results by decreasing the learning rate as you train.
So, sometimes people just do that by hand.
So, we use the term epoch for a full pass
through your training data and people might say,
half the learning rate after every three epochs
as you train and that can work pretty well.
You can use formulas to get per epoch tra- learning rates.
There are even fancier methods.
You can look up cyclic learning rates online if you want,
which sort of actually makes the learning rates
sometimes bigger and then sometimes smaller,
and people have found that that can be useful for getting you out
of bad regions in interesting ways.
The one other thing to know is,
if you're using one of the fancier optimizers,
they still ask you for a learning rate but that learning rate is
the initial learning rate which typically the optimizer will shrink as you train.
So, commonly if you're using something like Adam,
you might be starting off by saying the learning rate is 0.1,
so of a bigger number and it will be shrinking it later as the training goes along.
Okay, all done. See you next week.
