This lecture is on
optimization for machine
learning.
Specifically, we're going to be
looking more at methods that
work well for neural networks.
So, the talk will be divided into
five sections.
First, I'm going to give a quick
intro and motivation for
optimization methods in machine
learning.
Then, I'm going to talk about
basic methods, like gradient
descent, momentum methods,
second-order methods
next,
and then finally we'll talk
about stochastic optimization,
which is applicable to all the
previous methods that we're
going to discuss. So, optimization
algorithms are the basic engine
behind deep learning methods and
machine learning methods in
general, and they enable models
to learn from data by adapting
their parameters.
So, what they do is they solve a
problem of minimisation of an
objective function that measures
the mistakes that the model
makes.
Now, this can be, for example, in
classification problems.
This is prediction error where
you're comparing your
predictions of, say, the label of
an image to the actual values. In
reinforcement learning,
this could be negative reward
where reward
essentially measures how well
you're doing at the particular
task. And optimization methods
work by making a sequence of
small incremental changes to the
model parameters.
And each step is more or less
guaranteed to produce the
objective by some small amount,
you accumulate enough of these
steps in sequence and you
eventually help solve the
problem.
So, just a little bit of notation
upfront. Throughout the lecture,
I'm going to refer to the parameters
always as theta, denoted here.
These will be assumed to lie in
R^n.
This is the n-dimensional real
space in neural networks and can
be, you know, sometimes hundreds of
millions.
The objective function will be
denoted by h,
so it's a function of the
parameters. And, you can see over
here,
I've drawn an example objective
function, so up is the value of
the objective function, along the
side is the value of the
parameter. Of course,
if I'm drawing in this
two-dimensional plane, I can only
have a one-dimensional parameter.
Of course, in general, this
picture is much more interesting
looking in higher dimensions, but
it's useful for illustrative
purposes just to consider the
low-dimensional cases.
Here's our optimum.
Here's maybe where we start.
And, yes, so the goal of
optimization is to minimise
functions of this form,
essentially by moving this
magenta point down towards the
optimum as best we can.
So, the most important example of
this, which arises in machine
learning, are objectives of this
form.
So, you have a- essentially what
is a sum over examples,
i is going to index the example,
we'll have m of them.
So, this is an average over those
examples, and it's for measuring
the loss between a target y and
the prediction made by the
network.
So, f here is going to denote the
neural network function, which
takes input x as well as the
parameters, outputs some
prediction, y being the true
value for the prediction that we
want.
L is now a loss function, which
measures this level of
disagreement between our
prediction and the correct
target.
So, this is just illustrated here
again. This f is a neural network,
of course it doesn't have to be, and
for the purposes of this
lecture,
f could be any sort of
prediction model. So, next, we'll
talk about the basic method in
optimization where you always
have to start gradient descent.
So, gradient descent method is as
simple as it gets. So,
you'll see throughout this
lecture a bunch of equations of
this form.
This is showing an update rule
for our parameter.
So, every step we advance to the
next best guess for the optimum
value of the parameters, and we do
this by taking the current value
and then subtracting off,
in this case, a multiple of the
gradient to obtain the new value.
That multiple
here is- we denote by
alpha, that's going to also be
sometimes called the learning
rate or the step size, and
gradient, denoted here, is just
the standard gradient
where you take the element y's
derivatives for each element of
the parameter vector. So, what's
gradient descent doing, why
should this method make any
sense?
Well, one way to think of the
gradient is that it's giving you
the direction of steepest
descent in the loss surface.
So, you know, in this picture here,
again one dimension,
well, there really is only one
direction you can vary,
you can go sort of left or
right.
This is the direction that goes
downhill, so this is where the
negative gradient would be
pointing, in that direction.
We follow the negative gradient.
So, yeah.
So.
So, this is giving us the
direction that reduces the-
the objective value the fastest,
if you go in that direction.
Now, it's not obvious that doing
this should make any sense.
And, indeed, if the picture looks
more like this, and I
were to start here and take a
step that way, I might not
actually go down.
I mean, I might start to go down,
but then I might go- start
going up because the objective
function curves upwards like
this, it curves upwards very
fast.
So, this is bad.
Of course, I could remedy this by
setting the learning rate lower.
That is the multiple that I take
for the gradient, but that might
slow me down.
Now, this scheme is only going to
work if functions are ultimately
smooth at some level of- you know,
when you zoom in far enough. Or,
at least, continuous.
So, another way of thinking of
gradient descent is that - and this is
the way I prefer to think about
it because it sort of leads into
the other optimization methods
quite elegantly - is that we
start-
So, at any given point, let's just
say this is where we are
currently, we're going to propose
an approximation to the
objective function. For gradient
descent,
this is a linear approximation,
denoted by this line here,
that orange dotted line, and here
is this written in an equation.
This is just the first-order
Taylor series at the current
theta.
And d is the direction that
we're going to vary, and we're
sort of modelling how h,
the objective function, changes as we
vary d, according to a linear
model. Okay,
so this is in line. Now, for small
enough d,
this line matches the objective
function pretty well,
I mean, right around here it's
almost a perfect match and, as
you start to move away, as your d
gets bigger, the match starts to
degrade a bit.
But so, if we were to minimise
this model over a reasonable
region around here, that would
give us a d which is this
negative gradient step times some
alpha, and the alpha would be
determined by the region in
which we're allowed to move,
essentially.
So, gradient descent is sort of
the canonical algorithm but it
has obvious problems, problems
that are sort of hard to
illustrate in one dimension,
but if you even just go to two
dimensions, it already becomes
apparent what's going on.
So, here- and it's important to
understand this picture
because I'm going to keep using it
throughout the lecture.
This- what I've drawn here
essentially is a sort of a two-
dimensional narrow valley. So, you
can think of this as a three-
dimensional object where up
means, you know, the value of the
objective function,
so how high are you, say, in a two-
dimensional terrain. You can
think of this as maybe like a
hill or something, and across,
you know, so North, South, East, West, is
going to be the values of the
parameter, which is appearing here
to be two dimensional.
And so I've drawn this valley
essentially where, along the base
of the valley, there's this sort
of direction that we can move
towards the minimum of the
objective function. But we sort
of start out here at the side of
the valley, and we're going to-
if we're going to be running gradient
descent with, say, a high learning
rate, what's going to happen is
the gradient is going to point
this way, right off the hill,
because that's the direction
that goes down the fastest, and
that's going to sort of bounce
to either side of the
valley. Then we hit that wall, and
now we have the gradient
pointing in- once again, in the
gradient direction, which now
that's going to be the opposite
because, of course, the function
goes down this way.
So, we'll be balancing back here.
And, if the learning rate is too
high, these steps are going to
oscillate wildly, and eventually
you're going to diverge.
So, this is the bad situation, or
at least one bad situation for
gradient descent.
Now, you could try to remedy this
by lowering the learning rate, so
we're going to take these steps
but they're going to be much
smaller, and we're going to
bounce back and forth as we were
before, but it's not going to get
out of control.
The only problem now is we're
going to be taking these very,
very small steps.
And, once you get to the base of
this valley structure and you
still want to move along the
base of it relatively fast, well,
because you're learning rate is small
you're not going be able to do
that anymore.
So, there's no good choice in
this situation for gradient
descent.
I can make the learning rate
high, I diverge. I make it low,
I converge to slowly along these
directions that curve very
slowly.
So, this can be described in
theoretical terms, which I'll get
into now.
So, a couple of common technical
assumptions that we're going to
use throughout the lecture, which
have intuitive meanings, which
I'll talk about, are as follows.
The first is that we're going to
assume that h has Lipschitz
continuous derivatives, or AKA
Lipschitz smooth.
Now, what does this mean?
Basically, it means that the
gradient doesn't change too much
as we change the parameters.
So, a small change in the
parameter translates to a small
change in the gradient or, in
other words, the function doesn't
have too much curvature.
So, this L coefficient, which
tells us the relationship, can be
thought of as an upper bound on
the curvature of the objective
function, and we'll have a
corresponding condition called
strong convexity or, of course, we
don't necessarily need to assume
this globally, it's good enough
if this is happening sort of
locally around the area that
we're converging.
And, in order to apply say, for
example, to neural net objective
functions, which aren't convex in
general, and this function- this
condition is given here.
Essentially, it's saying that the
function curves at least as
much as this quadratic term,
which has a curvature of mu.
So, this now gives us a lower
bound on the curvature of the
objective function.
We're also going to assume for
now that gradients are not
stochastic.
I can get into stochastic
conversion theory later, but for
now it's simpler to think of it
the gradients as computed
deterministically, that is,
without approximations.
So, if you have those two
conditions, then this is the
basic bound that you get with
gradient descent.
So, just to decode this a bit,
this part here is the
difference in the objective
function value between where I am
after k steps and where the
optimum is.
And it goes down according to
this function.
Now, this function depends on
k
up here.
So, as k gets bigger, this
exponent gets bigger. But this
quantity is less than 1.
So, as this exponent gets bigger,
this gets smaller, it gets
smaller than exponential rate.
And there's some dependency on
how far are you away from the
initial point
over here.
So, the key quantity here is
kappa,
and that will determine sort of
how much bigger than- or, sorry,
how much less than one this
quantity actually is.
We prefer smaller values of
kappa. Kappa is the ratio of the
highest curvature to the lowest
curvature.
And so this is what is often
called the condition number, although
we're taking a global definition
of condition number, so condition
number is usually describing a
property of a matrix but you can
also apply it to an overall
optimization problem, which is
sort of the biggest eigenvalue
globally for the Hessian
everywhere
divided by the smallest one.
And so perhaps it's more useful
to think of what this bound says in
terms of the number of
iterations that we need to
obtain a particular optimal
margin here, that is, to get to a
an error between the optimal value
and the current value of no more
than epsilon.
And that's given by this
expression.
So, this is- so k has to be
roughly kappa, which is the
condition number times log 1
over the error we like to achieve.
So, just some words about bounds:
Are they useful in practice? Kind
of. There's a lot of issues with
them, though, so often they're too
pessimistic.
So, they- you know, when you prove
these things, you have to
consider all examples, including
worst-case examples, of course,
real problems aren't necessarily
worst case.
Sometimes they make assumptions
that are too strong,
say, for example, convexity.
Now, we don't really need
convexity to prove these kinds
of things often, at least if
you're looking at this sort of
asymptotic behaviour, but it's
convenient, and oftentimes you
can sort of describe what you're
doing in optimization theory as-
you know, you assume that you're
starting already close to the
minimum, close enough that the
function looks convex.
Oftentimes these theorems make
assumptions that are too weak, in
so far as they don't use enough
structure from the real problem.
Real problems have all sorts of
interesting structure, which are
not captured necessarily by, you
know, condition numbers or
Lipschitz- you know, L bounds, things
like this and, you know, these
bounds often rely on these
crude measures, such as condition
numbers, which are only sort of a
vague description of what the
function is actually doing.
Finally, and this is perhaps the
most important point, is these
bounds are usually focused on
asymptotic performance, so they
don't tell you necessarily- or
they don't give you a reasonable
idea of how fast you're
converging, let's say, long before
k, which is the iteration number
gets very large.
And, in practice, we often stop
our optimisers before they
converge,
so you actually do care how
quickly you're getting to that
point
preasymptotically.
So, I would say, in practice, your
choice of optimiser should be,
you know, first and foremost
informed by experience.
Try different things but, at the
same time, you know, you can use
theory to tell you- or, at least, help
guide your choices in terms of
different optimisers.
So, these- theory can develop
intuitions, which then translates
into the real world.
And if you're sort of nuanced
enough, that sort of- that works
out, although not always. Be
careful
about anybody who says anything is
optimal, by the way.
So, next, I'll talk about the
momentum methods.
This is sort of the easiest
modification of gradient descent
that we can make to have it
perform better.
So, the basic motivation is, you
know, as we saw in the example of
the valley, the two-
dimensional valley, gradient
descent can flip back and forth
when it's- when the
learning rate is large.
And, of course, if the learning
rate is small you have the
opposite problem that you don't
move along these low curvature
directions, say, the base of the
valley, you don't move along
there fast enough.
So, the key idea with momentum is
that we're going to try to
accelerate along these
directions of low curvature,
let's say at the base of the
valley, and we're going to do
this essentially following a
physics-like equation of how
momentum works in real
physical situations.
So, in particular, you can think
of the parameters like a ball
rolling along the surface.
The surface is defined by the
objective function.
And this ball is subject to the
force of gravity.
So, as it starts to roll along a
direction, let's say, the base of
this valley here, it accumulates
momentum.
This is actually the best
illustration that I could find
on the Internet,
unfortunately, because you can
find a lot of videos of balls
rolling down hills.
But what's important here is
that it's hitting the sides of
this valley, right.
It goes up and then comes down,
and it's that suppression of
oscillation that is really the
important thing that's
happening in momentum, right.
Anybody can just move faster, but
if you're- if moving faster causes
you to go off the side of this
hill, you're done essentially.
But here the ball is
rolling in such a way that it's
not going over that and it's
staying along this valley like
it should.
But, at the same time, it's
picking up speed as it goes
down.
So, that's the idea that
we're going to try to exploit
with momentum.
So, here are the equations for
momentum.
The basic version is that we
have a velocity vector V, and
we're just going to keep adding
the gradient to it
essentially to it and also
decaying our current velocity by
some factor.
So, we're going to multiply that-
our current velocity by some
factor.
This is- you can think of as
friction in a physical system so
the velocity is not just
preserved perfectly, it sort of
goes down over time, but we're
always adding this force to it
essentially.
And so, and then once we have our
velocity we just add that to the
parameters times some learning
rate as before.
There's also a different version
of that which is - I won't really
get into -
but it's a slight tweak of the
basic version that's more useful
in theory and sometimes in
practice. So, returning to this 2D
valley example again, we have-
once again, this is the situation
of gradient descent where you
can bounce along the sides but
eventually you bounce too much
and you diverge or you take a
learning rate that's too slow, sorry,
too low.
And now we're not- we don't
oscillate out of control, but we
move too slowly.
You can think of this as, say, in
that video, the ball, if it never
picked up any momentum, it would
just sort of crawl along, you
know, after somebody pushed it, it
would never actually get any
faster.
What momentum allows you to do
is you sort of you go-
you hit the other side but
immediately this vector here
is remembered in terms of
velocity, so the ball has velocity
going this way and then it hits
this side and there's something
pushing that in the other direction
but that cancels out its initial
velocity.
So, actually it ends up just
going straight, and it comes down
here, and then it starts to roll
back this way.
But, again, it's cancelled by the
gradient which is pointing that
way.
So, it's sort of never is able to
oscillate out of control.
Meanwhile, there is one direction
which is always pointing
downhill and that's this one.
And so velocity keeps
accumulating in that direction
and we event- and therefore we
get to our goal in fewer steps.
So, we can justify this stuff
theoretically as well.
So, given an objective function
satisfying the same technical
conditions as before, that is, we
have this upper bound and lower
bound on curvature across the
objective function, we have this
bound. This is very similar to
the one before except now this
term here depends on the square
root of kappa, not just kappa
itself, but otherwise it's
identical to what we had before.
And what you essentially get out
of this is that the number of
iterations needed to achieve a
certain error is roughly this
expression where we have log 1
over epsilon as before, but
now there's a square root in
front of the kappa.
So, we've improved our dependency
on this kappa term by lowering
it because, of course, this is a
quantity that's that's greater
than one.
And now we've got a- yeah,
we've got this better dependence.
So, for problems where this is
large, this is going to make more
of a difference
to use momentum. So, we can then
ask: Is this as good as we can
do?
We've added velocity to
gradient descent.
You know, maybe we could
add some sort of acceleration
term that's also preserved some
higher order effects.
Well, it turns out, in some
technical sense, this is the best
we can do.
So, I'll first start off by
defining first-order method
technically. This is a term that
gets thrown around a lot.
But there is a technical
definition, and this is the one
that was presumably proposed by
Nesterov, although I'm pretty sure
it goes back much further than
that.
And it's essentially that the
difference between parameters, in
other words, these steps that we
take these, these d's, at any
consecutive iteration is given
by or is contained in, rather, the
span of all previous gradients
that we've seen.
So, these are- span just means the
set of linear combinations of
these things.
So, in other words, we've added
various multiples of previous
gradients to each other to
arrive at where we currently
are.
That restricts you to a certain
class of algorithms.
And, but this is actually a
pretty interesting class, it
includes, for example, gradient
descent, momentum methods as
well,
also conjugate gradient methods,
which are typically only applied
to quadratic problems, although
there are sort of nonlinear
versions that also fall into
this category.
So, what in particular is not
included in this definition are
preconditioned or second-order
methods, which we're going to
discuss later.
So, given this definition of
first-order method, we can say-
can now ask what is, you know, is
how well can you do with first-
order methods?
And it turns out that we
actually now have a matching
lower bound.
So, we had an upper bound
from before on performance,
but now we have this lower bound
which says that you cannot
converge faster than this.
And this looks a lot like the
term that we had before.
And, in particular, it requires
that the number of iterations to
converge is of this form, and
this is the same as the upper
bound that we had from momentum.
So, in some sense, momentum
methods are quote optimal.
Of course, this is only worst-
case optimal and all the caveats
that I gave before do apply.
But, you know, you least know in
the worst case that, you know,
there is no major algorithmic
improvement that you can make. At
least if you keep yourself
inside of the class of first-
order methods.
Yeah.
So, just getting back to the- our
bounds so far, we have this worst-
case lower bound for first-order
methods.
This is the bound that we got for
gradient descent.
Here we have the kappa term
without the square root.
So, this is a worse- worst that we
we can get. In practice, gradient
descent does get this, even
though it's an upper bound.
This is actually a fairly good-
you know, it's a tight upper
bound for gradient descent. And
the upper bound for gradient
descent with Nesterov's momentum
is this.
So, it match-
it matches the lower bound.
Okay.
So, that's first-order methods.
And next I'll get into second-
order methods.
So.
So, how we are doing for time?
Oh, we've got plenty of
time.
So, second-order methods are sort
of the next step in optimization
methods
beyond first-order methods.
The big problem that we had with
first-order methods was this
dependency on this sort of
global condition number kappa.
This-
and, in particular, the number of
iterations that we had scaled as
some function of that.
So, kappa is the ratio of the
maximum curvature again over the
minimum curvature, and, for
certain problems,
this is going to be very big.
Say, for example, certain kinds of
deep architectures, although
surprisingly for networks like
ResNets,
actually this number is really
not that bad at all.
And that's why, for example, you
see people use regular gradient
descent in ResNets with a lot
of success.
But there are certain kinds of
models,
for example, models that I've
been exploring more recently.
People at DeepMind have got
into certain physics-inspired
neural networks that are harder
to optimize and, classically, we
had a bunch more of these
networks around before everybody
started using ResNets that are
hard to optimize.
So, in practice, this might matter,
but it might not.
And it really depends on the
problem.
But we'd like our optimization
methods to be as robust as
possible, not to break down if
our problems become too hard in
some sense.
So, this is worth trying to
improve the situation and second-order
methods do allow us to do this.
They allow us to improve or
sometimes even eliminate the
dependency on kappa, and we
get similar bounds
but now the kappa term has
vanished.
So, the basic idea with second-
order methods is to essentially
return to the approximation idea
that we had before for first-
order methods.
So, we were going to locally
approximate our objective
function by a simpler function.
Now, before, we had a linear
function, which was a straight
line.
Now we're going to replace that
with a quadratic function.
And the easiest way to do that,
although not necessarily the
best way, is to take the second-
order Taylor series around the
current point. So, the second-under
Taylor series is, locally anyway,
the most accurate quadratic
approximation you can make to
the function.
And, if you were to minimise this
approximation to the objective
function with respect to d, you
get an update of this form, which
is the negative of the Hessian
inverse times the gradient.
So, it requires you to compute
this Hessian matrix and take its
inverse and multiply that by the
gradient. And then the basic
update iteration is the same as
before for gradient descent.
You can also augment this type
of equation with momentum as
well.
And that can sometimes give you
an additional boost.
It really depends on, in some
sense, how well you're
approximating H.
And if you're, you know, if you're
doing a perfect second-order
method, in some sense momentum is
not going to help you because
you've already eliminated the
dependence on the condition number.
But if you've only improved it a
bit, you could still get an
additional boost from momentum.
But we're just going to assume
that we're not using momentum
for the purposes at least of the
theoretical discussion of second-
order methods, although in
practice people use momentum
quite a bit with second-order
methods.
So, now we can return once again
to this example.
So, here we had gradient descent.
I've just shown the picture for
the small learning rate, which is
the one that doesn't diverge.
Momentum was able to, you know,
help us get around this
oscillation issue without
sacrificing our ability to move
fast along the base of the
valley.
Second-order methods are quite
elegant, in the sense that they
just model this curvature.
So, a second-order method
actually sees that both sides of
the valley curve upwards like
this.
Models that- and then it just goes straight to the
bottom, and then once it gets
here it sees, oh, this is actually
a very, very sort of smooth
pathway,
in other words, it's quite
flat and it's going downhill at
a reasonable rate,
so I can just instantly
accelerate in that direction.
So.
Right.
So, another way to think about
the relationship between
gradient descent and second-
order methods is to think of
gradient descent as a kind of
primitive second-order method.
So, when you're doing gradient
descent, the maximum allowable
learning rate is 1-
is up to a fudge constant 1
over L where L is quantifying
the maximum curvature like
before.
And so, given this learning rate,
which is the maximum one that we
can tolerate, you can think then
of gradient descent as like a
second-order method where we
start out by proposing to use
the second-order Taylor series
approximation for the objective
function and then minimising
over that.
But then we do things- we don't
like this Hessian term since it's too
complicated,
we substitute that with L times
the identity matrix.
So, when you do that substitution,
you're essentially replacing the
Hessian with a term that says
the curvature is maximum
everywhere as opposed to trying
to distinguish between
directions that have high
curvature,
let's say the sides of the
valley, and the directions that have
low curvature,
in other words, the base of the
valley.
So, all directions are
treated as having this maximum
curvature.
And when you do that, well, you
don't- then you don't get to see
that the base of the valley is
actually quite smooth and a
good direction to accelerate in.
You just have to move slowly
in all directions.
And so LI in some senses is
too-
it's just too pessimistic.
And it's too crude of an
approximation for this sort of
our altered second-order method
to perform well.
So, second-order methods sound great,
but there's a lot of catches. And
the first one, which is actually
pretty easy to handle but is
very important and often
overlooked, almost criminally so,
is that this idea of using
approximations,
well, it has the same problems
that gradient descent has but in
some sense you need more
machinery to deal with them
because the approximations are,
in some sense, pushing you
further.
So, what do I mean by that?
Well, if you think of,
again, this- think of this example
here where the purple line
is the true objective function
and, over here, we might take the
second-order Taylor series.
And this is a good approximation
of the function locally.
But a second-order approximation
could go off like that.
It could be wildly inaccurate as
we move away from our
current point.
So, you know, because gradient
descent in some sense is taking
the maximum curvature, it can
never actually go to wrong when
it comes to it-
its- the minimisation of
its implicit second-order
approximation.
But if you're using the Hessian,
you can go wrong because a
direction that might start out
as having low curvature, if you
go too far in that direction, it
might start to curve upwards
abruptly because, again, your
loss service is not perfectly
quadratic, so this kind of thing
can happen.
Just, say, for example, in the
valley, you know, we could have right at
the- close to the bottom
of the hill, you could suddenly
have this ramp that goes up, and
you don't see that locally until
you get close enough.
So, we can't move too fast with
second-order methods.
And the key idea then is to
restrict our updates into a
region around the current point.
But, unlike first-order methods,
you know, actually doing-
implementing this is going to be
a bit trickier in practice.
So, how does this look?
Well, you can start out by
defining again a minimisation
problem over the quadratic
approximation but restricted to
some region.
And it's usually convenient to
take a region which is
essentially a ball around zero
for our- for our update
vector d of radius, say, lower-case r.
Now, it turns out that in many
situations, although there's sort
of technical conditions that
have to be observed,
but usually this problem becomes
equivalent to a problem where
we're just minimising globally
over a new quadratic but where
we've added this multiple of the
identity to the curvature matrix
or the Hessian here.
And so, of course, we know how to
solve this problem.
That's just the inverse of the
matrix times the vector
times negative one.
So, okay, that's true for some
lambda. Now, actually working out
lambda can be tricky, but we
don't really need to do that in
practice.
We can just, at least, you know, we
don't have to worry about r and
its relationship with lambda and
talk about- talking
about sort of each step can be
computing a lambda for our given
r. We can just work with lambda
directly. We can just say I'm
adding this value lambda, maybe
it's too big, maybe it's too
small, and there are ways that
you can adjust this in practice,
various heuristics that are
inspired by algorithmic works
that, you know, that people often
use. So I, for example, use a
method called a Levenberg Marquardt
method which allows you to
sort of adjust lambda on the fly.
So, there's another thing
about second-order methods, which
is sort of important to talk
about, which is that the Hessian
might not actually be the best
matrix.
And this, I think, a lot of people
find really counterintuitive.
And this comes up a lot in
neural network research where
nobody uses the Hessian. Even if
you could, even if somebody gave
you an oracle to compute the
inverse Hessian, you wouldn't
necessarily want to use it.
And it's kind of hard to
understand that.
But I think it's worth thinking
about what makes a good
quadratic approximation to the
objective function.
Right.
I mean, the Taylor series, the
second-order Taylor series is
locally optimal,
right, in the sense that it gives
the most faithful approximation
of the loss surface in a sort of
a small vicinity of the current
point. But maybe that's not what
you want.
So, say, for example, you might
want an approximation that gives
you sort of a more global view
of the objective function.
So, again, here purple is the
objective function.
This might be our second-order
Taylor series approximation, but
there is a different
approximation which kind of
gives you a better global view,
and if I was to minimise this
approximation, I'd actually be
doing much better.
Even though it's not necessarily
that accurate out here, it's
still bringing me to the sort of
the right rough area.
So, in some sense, it's capturing
more of a global structure in
the objective function.
I might also want my
approximation to be more
conservative.
So say, you know, for example here.
This is the same example as on the
previous slide where we had-
where we talking about the trust
regions.
Now, orange being our Taylor
series approximation, but this
green one here, if we were to
minimise this one, we'd get over
here. If we were to minimise the orange one,
well, we'd get out here
somewhere.
Of course, the objective function
is curved up long before that
happens.
So, if you were to move out here,
you'd essentially- you'd be, you
know, the objective function
value's now shot up to infinity
or something, and that's no good.
So, there are definitely
situations where you might want
to use a different quadratic
approximation to the objective
other than the second-order
Taylor series.
And we find this in
practice.
So, the most important family of
examples that myself and
others have found for neural
networks are the generalised
Gauss-Newton matrix, the Fisher
information matrix, which is
often related to the first one,
they are in fact often
equivalent for certain kinds of
losses, and there's also the
empirical Fisher, which is
kind of a weird approximation of
the Fisher information matrix.
It's cheap to compute but
mathematically it's a bit
dubious.
So, some nice properties of these
particular matrices versus, say,
the normal Hessian,
well, first, they're always
positive semi-definite,
so there's no negative curvature
in the matrix itself.
Now that's good because, of
course, if you have a negative
curvature and a quadratic
approximation, that's kind of
telling you that you can go
infinitely far.
Of course, if you restrict
yourself to trust region you
solve that problem kind of,
but it is nice to have an
approximation which just, even
without the application of trust
regions, gives you a minimisation
problem that actually has a
reasonable minimum.
And then also you get some-
you open yourself up to a wider
class of theoretical results if
you can assume that your matrix
that you're multiplying the
gradient by is positive semi-
definite.
Another interesting fact is that,
in fact, if you take small enough
steps that, as you make the
learning rate small enough, you
are invariant to any
reparameterisation of the objective
function.
If you use one- at least the
first two, actually, no at least
three of these matrices have that
property.
So, many people will know that
Newton's method is invariant to
linear reparameterisations of the
problem, but these- methods
based on these matrices actually
are invariant to any smooth
reparameterisation,
if you take small enough steps.
And that's just not true of the-
well, it really depends on how
small you mean,
but that happens much faster
anyway for these methods.
And, finally, and this is just an
empirical fact, it works better
in practice for neural nets, and
there isn't a total
comprehensive understanding as
to why that that's true.
I like to think that some of the
intuitions, given the previous
slide and these observations, are
important, but nobody has a fully
comprehensive story yet about
this.
So, I've gone over sort of the
common problems with second-
order methods and ways you can
change them, new matrices you can
use, but there is a, you know, a
huge elephant in the room with
second-order methods, which is
just that these matrices, the
Hessian or one of these
alternatives that we might want
to compute are huge.
So, for neural networks, for
example, you know, the- the
dimension, I should say, of
the parameters can be in the
tens of millions.
So, that means now that we have a
10 million by 10 million matrix,
say, that we have to compute, we
have to store it, and then we
have to actually invert it, and
that's just totally out of
the question as n gets into
those ranges.
So, the common solution and the
one that we're going to use,
we're going to talk about in
this lecture anyway, are
approximations of
the matrix itself, although there
are, I should point out though,
that there are sort of a
different class of methods which,
instead of approximating the
matrix, they just approximate the
problem of minimising the
quadratic.
So they don't perform an exact
minimisation and therefore they
don't need to compute an
inverse.
But those methods have sort of
become less popular in recent
times.
And so approximating the matrix
is sort of the easiest and most
effective thing you can do.
So the first approximation that I'm going to
talk about it are diagonal
approximations and these are the
absolute simplest things.
So what you do is you just take
the matrix that you have and you
zero out all the non diagonal
entries.
So inversion and storage super
easy right.
Because now you just have n
entries and to invert a
diagonal matrix you just take
the reciprocal of each entry.
So that's trivial computing
these matrices actually it's
slightly non-trivial but it
really depends.
Now getting back to sort of the
different choices the Hessian,
the Gauss-Newton, the Fisher
depending on which one you
choose there can be different
computational costs associated.
Although I should say that for any of those
choices there are good
approximation algorithms that
will get you the diagonal but
not exactly.
But for the empirical Fisher
actually it is quite cheap to
get it exactly.
So now of course the obvious
problems with the obvious
problem with this method is that
it's a very primitive
approximation and it's really
not going to give you anything
unless there are obvious sort of
axis aligned scaling issues.
And so what do I mean by that?
Well if you think of the 2D
Valley example again you know if
one of those directions say that
you know the high curvature one
that goes on that sort of hits
either side of the valley.
If that's one parameter and then
the other parameter is moving
exactly along the base of the
valley.
Well that would be the perfect
situation for diagonal methods.
You know
in fact the true curvature is
diagonal in that situation but
in general you don't have that
in general different directions
of curvature different like
eigenvectors the Hessian, or whatever
matrix you happen to be using,
are not going to be aligned with
the coordinate axes and so that
the matrix itself in particular
is not going to be diagonal
and the consequences of that can be
severe
And sort of a erase any advantage
you might get from using second
order methods.
Nonetheless they are pretty
popular.
So if you take the square root
of the empirical Fisher which is
a slight fudge to the algorithm
I view it as a way of sort of
compensating for the crappiness
of the diagonal approximation
and therefore sort of hedging
your bets by being more
conservative, you get RMS-prop
or Adam which are
actually quite popular
optimization algorithms to use,
at least for neural nets.
Now one step above diagonal
methods are block diagonal
methods. So block diagonal method
instead of zeroing out all the
non diagonal entries we're just
going to organise our matrix
our parameters I should say
into sort of groups.
And then each group is
represented by a full matrix.
But relationships between
different groups are not
modeled in our matrix and so we
zero out all those entries.
So in the neural net blocks
could be say for example all the
weights associated with one
particular layer or one
particular neuron and those will
give rise to different block
diagonal approximation schemes.
So these are still fairly cheap
depending on how big your block
is.
The storage cost is just number
Sort of the B
here is the block size.
I'll assume just for the for
simplicity that all blocks are
the same size.
So this would just give you a
storage cost which is
b times n.
You've only increased your your
storage cost over diagonal
methods by a factor b the
inversion cost is b squared
times n so that's quite a bit
worse than the diagonal case.
But again if b is not too big
that might be reasonable.
Computationally it's
basically just as difficult
to compute this.
Once you get around the
additional storage versus the
diagonal case.
But like I said it can only
really be applied in the case
where b is small.
And lets say if your blocks are
the parameters for an entire
layer well that's still millions
of parameters sometimes.
And that might just be way too
big to deal with.
So one method which is probably
the best at this is something
called TONGA.
Although to be frank block
diagonal methods in their raw state
haven't really been popular for
many years.
But this is sort of this is sort
of the go to work on that.
Now one way you could improve
block diagonal methods are are
so-called Kronecker product
methods or Kronecker product
approximations.
So if we start out with a
block diagonal approximation
of the generalised Gauss Newton
or the Fisher, where your blocks
are corresponding to whole
layers, which are like I was
saying before,
too big to be treated naively
and then you're to further
approximate those blocks with
a special algebraic form which
is called a Kronecker product,
then you get this approximation.
So what is a Kronecker product?
A Kronecker product is...
well it's it's denoted like this
A times C and in terms of the
actual matrix that you get
it's essentially created by
taking multiple copies
of C and tiling them over and
over and over again and you tile
them once for every entry of A
see you create this much much bigger
matrix out of two small matrices.
And that seems like an arbitrary
construction but actually it
sort of arises very naturally
when you start thinking about
neural nets and approximations
although I don't have enough time to get into that
exactly how that happens.
But it does and what it allows
you to do is actually do much
better in terms of storage and
computation.
Now this is a typo this is not
O(n) and I don't know why I
wrote that must have copied and
pasted it from a slide
but it's not.
It is more expensive expensive
to store these majors these
approximations than a simple
diagonal approximation is but
it's not that much more so.
The cost of applying these,
ok I see why I wrote that it
there are some circumstances
where that might be true.
But there are also some
circumstances where this is sort
of not really accurate.
It's too difficult to
get into because you have to
sort of get a more fine grained
analysis but you can think of it
as most of the time being
roughly the same.
Now the cost to apply an inverse is
well it's b to the one half times n
so that's just a little bit more
expensive than a diagonal matrix
approximation again b being
the number of parameters let's
say in an entire layer.
So this could still be you know
well b to the one half.
Let's say if you're if you've
got nearly a million parameters
and in a single layer you know
you still have a thousand factor
here.
So it's not nothing.
And this gives rise to what I
would argue is the most powerful
neural net optimiser K-FAC. It's
the most powerful it's also a little
bit heavyweight but it does
optimise difficult nets the most
effectively.
And so finally I'm going to talk
about Stochastic methods and so
throughout this lecture I've
been talking about deterministic
optimization methods mostly
because it's just easier to talk
about them.
The theory is nicer and a lot of
the intuitions that you build
when you consider deterministic
methods apply in this stochastic
case partly because well if you
take a mini batch large enough a
stochastic method sort of looks
like a deterministic method but
I'm getting ahead of myself I
haven't even talked defined it yet
what a mini batch is.
So a, so you know a typical training
objective which we saw before
consists of a sum of... okay
this really is a typo
these two should be reversed.
Right.
So this is a typical objective
function which is an average of
a bunch of individual losses for
each let's say each training
case, although in general you
know there can be other ways
that you can get this kind of
form arising in machine
learning.
And so that means that
our gradient is the sum
or the average of these
individual gradients.
And well, if m is very large
with m being the size of our
data set, this computation could be
just way too expensive to
always run.
And so the idea with stochastic
methods is that we're going to, well
observe that these-
these individual objective
functions from each training
case are not totally independent
right, they're
you know, they say, for example involve a
neural network learning how to
make a prediction.
Not every single training case
is totally different from every
other training case.
There's a lot of overlap in
terms of the task that you're
trying to solve.
And so and this is especially
true early in learning. So you
can imagine in a neural network
you know the first thing that it
has to learn are the basic
properties of the data set.
The simple sort of statistics of
the images let's say in terms of
their means and their variances
and then maybe it starts to
distinguish between sort of
course categories like cat and
dog but it hasn't yet
learned all about the fine
distinctions between different
breeds of dog.
So learning is often you know
easier at the beginning in this
sense and therefore the in
the overlap between different
cases is sort of stronger, you know.
In other words the cases are
telling you... the
fine grained information is not
as important. And that intuition
really does carry through.
You do see this in neural
net optimization that stochastic
methods, when you start
optimising, they behave very much
like deterministic methods.
So and this
correspondence degrades
over time as you as you converge,
as you start to converge.
So the idea with stochastic
methods is that we're going to
yeah we're just going to take a
subset of the training set so
we're not going to take all m
cases we're going to sample some
random S and then just average
over these, b being the size of
the set S.
So this gives us some kind of
stochastic approximation of the
gradient.
And in fact it's say unbiased
stochastic approximation.
So stochastic gradient
descent is then defined just
like gradient descent was
but we have our stochastic
gradient in place of the true
gradient.
And this method
right off the bat actually just won't
work precisely. It's not even
going to not even gonna converge
unless we do one of the
following things.
So one thing you can do is you
can decay the learning rate and
there are specific ways that you
specific formulas that you could
use here where you know the
value essentially goes to zero
as k grows.
This is a form that's sort of
elegant and works well in theory
and you prove theorems with this
kind of formula.
In practice there are better
formulas that you can pick.
But this at least is a sort of a
simple baseline and every formula
that you are going to pick are
sort of going to be roughly
inspired by this one.
Now this is getting back to my
whole discussion of you know theory
versus practice.
You know there's one thing that
theory says you should do the
practice often.
You know by exploiting
additional properties you can do
better.
So another perhaps better I
would argue better alternative
is Polyak averaging.
So this involves taking an
average of all the parameter
values that we visited in the
past.
It seems kind of like a silly
thing to do because, you know, the initial
parameter value might actually,
just you know, it's just our
random starting point, which is
not particularly significant at
all you know but it.
Right.
But it's nonetheless you know as
you start to take an average of
more and more things you know that
the dependence on that point
fades.
You could encourage that to
happen by taking a kind of a
decayed average. An exponentially
decayed average.
So this is a type of average,
decays faster than a
normal average does in terms of
its dependency on the starting
point, but the theoretical
things you can say aren't quite
as good, or at least the theory
isn't quite as elegant for this
case, but this is what people do
in practice and it works better.
So this will allow your
stochastic method to converge.
Another thing you can do is you
can actually just increase the
mini batch size during
optimization. And if you do this
sufficiently quickly people have
shown that that actually gives
convergence as well.
So there's a bunch of options
here.
And oftentimes the best thing to
do comes down to you know really
just running the experiment and
trying different things. Until we...
at least until we have a much
better theory for this kind of
stuff.
So stochastic methods in general
are gonna converge slower than
their non stochastic
counterparts.
And this is kind of obvious.
I mean you're just taking the
gradient and replacing it with
some kind of noisy
approximation.
So you basically just taking
your good algorithm and
corrupting it with sort of noisy
data.
But it's not that bad.
So this term or sort of this
this formula is what you get if
you do stochastic gradient
descent with Polyak averaging...
So I haven't defined this matrix
but this is just you can think
of this as the covariance matrix
for the gradient estimates
because again our gradient
estimates are stochastic
quantities.
So they have a covariance matrix
and you can just compute that
and if you multiply this by the
inverse Hessian at the optimum,
take the trace, multiply one by
one over k, and then you add some
higher order terms which are
going to decay faster than one
over k.
Because again this is constant.
The only dependence on k the
iteration number is over here.
So these terms well they can
matter but this is the
asymptotic dominant term and
that gives you a asymptotic
convergence which for
sufficiently large k is
essentially going to scale like this so
to get our epsilon error.
This is the form that we get.
And you'll notice that there's no
log here.
So before with deterministic
methods we had the log here.
Here there's no log, so this
actually this dependency is much
worse.
And you can think of again you
can think of essentially in a
stochastic method the error is
going down as one over k, right, one
over k. In our deterministic
methods it was going
as a exponential function of k.
So...
So this is much worse in some
sense. But actually these methods
do quite well in practice.
And part of it part of that
comes down to various ways that
you can mitigate this term, make
this term small.
And actually these terms end up
being larger than you think.
And so oftentimes maybe your
true optimization is dominated
by this. But it won't be true
asymptotically. An interesting
thing that I think is worth
pointing out is that it's been
shown that this is as good as
you can do.
So asymptotically this form is
as good as any algorithm can do
ever.
And essentially the way you
prove that is by just arguing
that if you have only seen a
certain amount of data there's
an intrinsic uncertainty in your
parameters
Right.
You don't know what the true
value is because you literally
there's no way to disambiguate
it
given what you've seen so far.
And that intrinsic uncertainty is
the kind of error that you would
get with with this term here.
So SGD with Polyak
averaging is actually optimal in
a very very very strong sense.
But it's only asymptotically
optimal.
And again, asymptotics are not
always the whole story.
So you could apply second order
methods with stochastic
gradients.
And people do and you know so
the there are sort of tricks of
the trade to make this work.
Now when we're computing
curvature these these you know
these Hessians are these Gauss-Newton
matrices we have we have
the same problem that we have
with computing the gradient
right.
We don't want to compute it over
the entire dataset.
That might be too expensive.
So the common thing done in
practice is to take a
decayed average. That's just where,
again, you have a running value
that you sort of update as you
go and it is
itself as an approximation but
it's often good enough.
So it is worth pointing out
though that you know based on
the discussion that I just gave
on the previous slide there is
no faster method than SGD
with Polyak averaging.
So asymptotically we cannot
hope to get any advantage out of
doing this but
pre-asymptotically,
yes it matters.
And just going back to the slide
previously, this term, which
hides a lot of dependency in
which for example unlike this
term you know this can depend on
things like the condition number
and or rather the precondition number
that you use I shouldn't say
condition number so that the any
improvement you make to the
condition number might be
reflected here it won't be
reflected here.
This term does not depend you
get the same expression even if
you use a second order method
and there's no improvement.
So this literally does not
depend on the Hessian or
whatever matrix you end up
picking.
But this term does and this term
can be improved.
So when would you expect this to
help in practice?
Well if the loss surface is, if
the curvature is bad enough in
other words for example the
condition number might be very
big although condition number is
only one measure of sort of
badness.
The or the mini batch is large
enough so if the mini batch is
big you're naturally going to
have a low variance.
In other words this term here.
Whoops, this term here is going
to be small.
And then this whole thing
doesn't matter as much as it did
before
in your terms of your total error.
So those are two ways that it
can still help. And if you have a
combination of these two things
going on then there could there
can be an advantage and in fact
this is a graph that was
produced very recently which I
feel is sort of the ultimate
vindication of this kind of
research.
So people these days train
ResNets if they're doing deep
image classification almost
exclusively.
But you can consider networks
that are say one hundred layers
deep but don't have skip
connections, don't have Batch Norm
and don't have the usual tricks.
And this gives rise to a much
harder optimization problem
because it turns out those
tricks are actually helping make
the optimization problem much
easier.
So if you if you have such a
network and you initialise it
carefully then let's say you
pick a batch size which is not
crazy.
In fact this a similar result
holds for much smaller size say
64.
Then in fact there can be a huge
advantage to using a second
order method like K-FAC vs
momentum methods or Adam (which
is a popular diagonal second
order method).
But it's interesting to note
that if you were to run this
same experiment on a ResNet
the differences vanish
completely.
So all methods perform almost
identically.
And that's what you expect from
the theory if the
implicit condition number of
the ResNet was really really good.
And it seems to be. It's good
enough any way that the
asymptotics predicted before
sort of are the dominant factor and
so then it really only matters
how much data you've seen.
Yep.
[INAUDIBLE]
Yes.
[INAUDIBLE]
Yeah, so it is it is typically slower
but it really it really depends on
what kind of second order method
you're talking about.
So diagonal methods are almost
no cost.
[INAUDIBLE]
It would be about half the
speed
I would say. But it depends on
how you optimise it.
So people have done work on
mitigating these overheads and
it can go down you know to like
10 percent slower for example
depending on the different
tradeoffs.
So because those things always
depend on implementation details
I tend to not talk about that.
But yeah you can get these
overheads down quite a bit and
this you know this difference by
the way would never be made up
by a 2X.
This graph will go out really
really far.
It'll almost never catch up to
that fact these networks are
basically impossible to optimize
with first order methods at
least to the same level.
So.
So yeah.
So these methods can make
a difference in certain kinds of
networks. But there is this sort
of tension in the community
between making the networks
easier to optimize and just
making the optimization
technology better.
Both are solving the same
problem in some sense but it's
nice to have more than one
solution and I hope that, you
know, by embracing these more
powerful methods that might open
up new classes of models that
people wouldn't necessarily have
been able to optimize before.
So now I'll just go over some
wrap up and conclusions for the
lecture.
So I talked about optimization
methods and how they're
important machine learning.
They work by adapting the
parameters to minimise
objective function.
And they're the main engine
behind neural network learning.
We talked about second order
methods... sorry first order methods
such as gradient descent.
The key interpretation being it
is the steepest descent method
or also as a kind of a first
order approximation that you
minimise locally.
We saw how this can run into
some issues when the curvature
varies in two different
directions.
Let's say the base of the valley
versus the sides of the valley.
We talked about momentum methods
which allows us to accelerate
along the directions of low
curvature -
- let's say the base of the valley -
and in fact are optimal in a
certain asymptotic sense amongst
any first order method you could
propose.
Then we got into second order
methods.
We talked about how these can
improve problems associated with
bad curvature, how they can
eliminate this dependency on the
condition number or at least improve
it.
Although coming all with a bunch of
caveats.
So for example that you need to
use trust-regions or damping
methods for them to work well.
You have to consider alternative
curvature approximations.
There are alternative curvature
matrices and then you also have
to talk about approximations of those,
for this to be practical in
neural net training, for example.
Finally we talked about
stochastic methods, which use
mini-batches of data to estimate
gradients and possibly
curvature.
Saw how these are set of
asymptotically slower than
deterministic methods but how
their pre-asymptotic
performance can in principle be
sped up with the use of second
order methods.
And we saw an example of that in
practice. And that is the end of
the talk.
So, think we're doing pretty good for time. I was about
10 minutes early, so I'm happy to
answer any questions you might
all have.
And I also have some references
at the end here,
in case you're interested in
learning more. Yep.
STUDENT [INAUDIBLE]
Yeah.
So.
Oh, should I repeat the question?
So you were asking about
initialisation methods what are
the optimal ways of
doing that?
So I would say initialisation is
a topic that's picking up a lot
of steam recently.
It's been something that I think
was brushed aside for years and
now you've got a bunch of papers
coming out that are tackling this.
The initialisation method that I
talked about in that slide.
I said "careful initialisation",
that's been sort of my two year
long epic project.
And so it's actually very very
hard to initialise a deep
network like that and have it
train as fast as a regular
network will.
When I say "regular" I mean a
ResNet.
There are- you can get
almost arbitrarily complex of
this stuff.
It's a very deep subject and I
think the most exciting results
are going to come out this year.
People- you know if you're using
something- a package like
TensorFlow, you're
using a default initialiser
typically- the typical rule to is
to take a Gaussian and multiply
by one over square root f,
where f is the fan-in factor for that
layer and that in almost every
initialisation starts from
that
you know that basic point, but
there's much more than you can do
beyond that point.
Yeah.
It so it's important. It's
very important.
In particular, you know if you're solving, you know,
if you're trying to get rid
of things like batch
normalisation or skip
connections you know you can use
a better optimiser to solve the
optimization problem associated
with doing that.
You still have to solve the
initialisation problem because
those methods also, as it turns
out, were fixing
bad initialisations.
So if you fix both the
optimization and the
initialisation then you can get
rid of those things.
Yeah.
STUDENT So in the second order vectors
for the Hessian or the other alternative
alternative matrices you presented
are those generally analytic or do
quasi-Newton methods appear?
By analytic you mean you can
compute them sort of locally if
you know like there's an actual
formula for them.
Yeah yeah.
There they are.
So they're not like quasi newton
methods quasi newton methods of
course depend on this history of
iterates it's Yeah
No these these matrices
are sort of well-defined at every point.
The only reason you'd ever
accumulate data to to compute
them is just because you want to
get a more statistically robust
estimate of them.
But this in principle you could
just throw out your old estimate
and computer an entirely fresh
estimate where you are. Yeah.
There's another one.
STUDENT So you said things like
skip connections and batch normalisations
help make the optimization easier
is that becuase
despite the fact that you're adding
more parameters, you're also
transmitting the gradient information
more directly, so it's kinda worth it?
Yeah. I mean it's complex you
know.
There are different ideas out
there.
One idea that I find compelling
which is a paper recently
published from DeepMind is
that the skip connections are in
some sense making the network
look like much more like a
shallow network and shallow
networks are sort of easier to
optimise and that you can sort
of slowly recruit more non
linearity into your model as you
as you go, and the skip
connection plus batch norm
architecture enables this.
You know you'd be
hard pressed to find a sort of a
fully rigorous story here
although I think that you know
things are moving
along in such a way that what I
just said is almost certainly
true, spiritually anyway.
The initialisation method for
example that I've been working
on follows a similar sort of
principle in the sense that it
makes the network start out very
linear looking.
But in such a careful way
that allows it to sort of become
more non-linear gradually.
So you can actually do this
analysis based on kernel theory
where you say you can actually
really see with high probability
what a neural network will do if
you keep adding layers on top
and what you quickly observe is
that this sort of the way that
the neural network sort of maps
the input to to its output is a
function that
degenerates very very fast
unless you're extremely careful
with how you set the weights at
each layer. And the burden of
having to do that, having to set
those weights carefully, becomes
harder and harder and harder as
you keep adding more layers. Your
algorithm has to be more and
more precise.
So using default initialistions,
which are quite primitive,
you know we get away with
training shallow networks no
problem. And a ResNet, because it
looks shallow enough in some
sense it's hiding all its depth,
those naive initialization
methods are good enough for a
ResNet.
But if you if you can solve the
optimization problem associated
with deeper networks then these
differences are in some sense
not important.
Once you use a good enough
initialisation we can train...
So one thing I didn't plot on
this slide here is that if I was
to plot a regular ResNet, it
would follow this orange curve
exactly so we can get these we
get the same optimization
performance now without
the ResNet architecture but
you have to do a lot.
Yeah.
STUDENT In practice a lot of machine
learning algorithms also have
bad data right?
So does classic gradient descent
sort of help to regularise in such cases?
Yeah well so you know the
community has gone a bit back and
forth on the importance
of stochastic gradients in terms
of adding
regularization into the problem.
It's certainly true that it does,
that's sort of undisputed, but
there's some disagreement over
how important this effect is.
If you look at the modern
convergence theory for deep nets,
which actually predicts that the
loss surface is essentially a
quadratic within the
vicinity of a good
initialisation point.
So really all this theory that I
talked about actually is
applicable because really the
function is more or less convex
at least in the neighbourhood
that you care about.
Not only is it convex it's
quadratic. At least if you use a
certain type of loss.
And yeah so that theory...
Sorry I got lost.. the
question again was about oh yeah
oh yeah yeah regularisation yeah
yeah.
So if you know if you are in
that situation where the
objective looks like a convex
quadratic, you know really
there's there is only one
minimum the ever going to get
to.
Right.
So if your stochastic method
converges, it's gonna find the
same solution as your
deterministic method would have.
Now of course stochastic
methods don't often converge, or
they're not taken to convergence.
Then you could sort of ask well
how good is my unconverged
point.
Maybe maybe the lack of
convergences is itself a form of
regularisation.
Probably true, but not a very big
source.
So you know these kinds of
results that I'm plotting here
you do lose a little bit on the
test set, but it's a few percent
and you know I would like to
believe that you can make up for
those differences say by adding
more explicit regularisation to
the problem.
It's a bit dubious in my opinion
to rely on the optimiser to
perform the regularisation.
That's sort of classically been
the job of the objective
function. The optimiser's
job should just be to optimise.
Yep.
STUDENT INAUDIBLE]
Yeah, that's very hard.
I don't think anybody's really
come up with a easy way to do
[INAUDIBLE]
Yeah yeah I mean you could try
to measure that condition number
or some related quantity.
I would argue the condition
number is sort of too primitive
of a quantity to really tell you
that much, because for example
you don't care about the minimum
curvature really.
The function might be totally
flat along along certain
dimensions and it doesn't matter
at all, right.
So your condition number could
be infinity but the sort of the
the true condition number which
is the quantity that only cares
about directions that actually
matter in terms of the error value
might be much bigger...
might be much smaller
I should say, smaller than
infinity.
So yeah so.
So condition number is
problematic right off the bat
You could try to you know start
computing the eigenvalues of the
Hessian in that you know you then
you get into the problem that
those values might you know are
describing what's happening
right where you're initialising
but maybe the curvature
evolves as you optimise and then
it's hard to predict how it's
going to look halfway
through the optimization until
you actually have run half of an
optimization. But by that point
you've you're already starting
to do this empirical evaluation
of your method.
So yeah no I think you couldn't.
You know you can use intuitions
you can say well deep networks...
the deeper the network
is the harder it is typically,
you know unless you're using
skip connections and batch norm,
and then you sort of mitigate
that. RNNs are harder than
feed forward nets typically
because they basically look like
very deep nets without skip
connections so.
So there's intuitions
you could apply but in general
it's...
Yeah I don't have a good answer
for that.
Yup
STUDENT If we go back to the slide
on  the SGD with Polyak averaging
There's a formula on the
LECTURER This one.
What's the difference between
the h matrix and the capital sigma
LECTURER Between the h and?
STUDENT and the sigma
the big gradient estimate covariant
Between these ones.
Well yeah, this is the covariance of
the gradients.
This is the Hessian.
It's interesting that in fact
sometimes those matrices will be
equivalent.
That actually does happen in
certain types of problems under
certain technical conditions.
So in fact this term can
sometimes become trace of
identity.
But in general the two matrices
will be different.
STUDENT So can you approximate
the Gaussian with the [INAUDIBLE]
Well kind I mean so that the
covariance of the gradients is
kind of like the empirical
Fisher, and the empirical Fisher
is kind of like the Fisher
which is kind of like the Gauss
Newton which is an approximation
of Hessian.
So yes, kind of.
There are all sorts of
situations where that will be a
bad approximation, or you can
show that it's a bad one. So bad
in fact that it'll cause the
algorithm to very
reliably fail.
Nonetheless there are situations
where it's good, and it really comes
down to the particulars.
Oh gosh, more questions.
I'll go over there,
you were next.
STUDENT Do you think the single minima
is, like, a good assumption and then
leading on from that do you think a
million dimensional hypersphere
doesn't have a surface that is
intuitive enough
Yeah well I think
these intuitions that we get
from lower dimensional cases
have proved useful.
You know in the case of...
if you're talking about difference
of curvature all you need are
two dimensions and you can start
meaningfully talking talking
about difference of curvature
and then your
problem just scales,
you know you just keep adding
more eigenvectors into your
Hessian and it's sort of like...
It's one of those cases where
I really do think you get a lot
from that from the low
dimensional case. And in terms of
minima,
well you know for a long
time people thought you know the
neural net loss surface must be
really really crazy. And it is if
the network is badly initialised
it is very deep. And it
can develop all sorts of
pathologies.
But if you actually look at the
loss surface of a ResNet for
example, or one of these nets
that I've talked about to a
lesser extent, anyway it's
actually quite good. And the
theory that's emerging now
from this neural
nets as GP literature
which is basically looking at
what happens when you make
the layers infinitely wide, is
that they actually predict if
you start close to a good
initialisation, that locally, in
fact there is a quadratic
looking bowl with a minimum, and
that minimum is the global
minimum. It gets zero error. And
it's not so hard to see that
-why would there be zero error-
because you know of course if
you've got infinitely wide
layers, well of course
they can get zero error, right.
They can just memorise the
problem.
Yep
STUDENT is there any relationship
between mini batch size
and learning rate?
Yeah.
There can be so so mini batch
size it as you get is it gets
bigger will make the
stochastic gradient better
estimated.
So in other words the variance
goes down.
And when you're optimising you
know the reason that you in stochastic
methods the reason you decay the
learning rate or you use Polyak
averaging is to sort of cancel
out this variance. And so if the
variance is lower you don't need
to lower the learning rate as
aggressively because there's
just less of it.
Right.
So the one rule of thumb that
people sometimes use is...
...
Why did that happen?
that they're inversely related.
So you know if you double the
batch size you might be able to
double the learning rate.
That's not always going to be
true. I mean it's certainly not
true in a deterministic method
that your learning rate
can be infinity, right. Which
is what it would be if you
just keep doubling the batch
size and applying that rule
naively.
But that rule of thumb holds in
sort of a certain range of
learning rates and batch sizes,
in some sense.
Oh sure
STUDENT Thank you for the lecture
it's maybe a bit of a difficult question
is there a way to use less training
data. I mean, we can skip some
training objects that we already know
that have low error on them
I mean, just, like not all training data
are useful
LECTURER Yeah.
STUDENT I mean equally useful for us
so we can skip some training objects
to save some computational time
because on these objects we have
close to zero gradients
LECTURER Yeah
That's a good insight.
And people have developed
algorithms that do that.
One way you can view it is
you know you're sampling
random cases, and the mini batch
might not actually be this
statistically smartest way to get
gradient estimates, right.
I mean it's unbiased but it's
not necessarily the lowest
variance, for example.
And there has
been algorithms and
corresponding theory to describe
that. Variations on what you've
just talked about.
I would say those algorithms are
not used much in practice partly
because it's actually kind of
awkward to do that kind of
data processing because you have
to look at sort of how those
data points are hand.. you know
what the model thinks about them
is you have to run evaluations
and the data pipelines the way
that they're written is that
they're they're always like pre
loading data and pre processing
it with all these threads in the
background and so adding too
much complexity to that can can
be detrimental.
I think it's an under explored
area though there might be
potentials for a big gain there
but it's....
So for example the stochastic
gradient theory that I
gave.
Right.
It wouldn't apply in that
situation because in principle
at least it implies so far as if
you could phrase what you're
doing as a variance reduction
technique then it would apply.
But you know if you if you're
able to get magically some
gradient that might actually have
zero variance, let's say because
you just took the whole dataset
right, then all that theory stops
immediately working.
But yeah engineering wise it is
tricky which is why we don't
typically do those.. we
haven't explored those methods
that much. Anyway.
I've been told that we should
stop.
But if you want to if you have
any more questions you can just
come down here. I'll be available
for the next 20 minutes.
