Alright, thank you
for this lovely introduction,
and, as Thore mentioned, today
we will go through the foundations
of neural networks.
The talk itself will last around
90 minutes so I would ask you to
wait with any questions 'til the
end of the lecture where
we'll have a separate slot to
address these, and I will also
hang around for a while after
the lecture if you
would prefer to ask some
in person. The lecture itself
will be structured as follows:
There'll be six sections.
I will start with a basic
overview, trying to convince you
that there is a point in
learning about neural nets,
what is the actual motivation to
be studying this specific branch
of research. In the second part,
which is the main meat of the
story, we will go through both the
history of neural nets, their
properties, the way they are
defined, and their inner
workings.
Mostly in order to gather good
and deep intuition about them so
that you are both prepared for
future, more technical, more
in-depth lectures in this series,
as well as ready to simply work
with these models in your own
research or work.
Equipped with these, we will be
able to dive into learning
itself since even the best model
is useless without actually
knowing how to set all the knobs
and weights inside them.
After that, we'll fill in a few
gaps in terms of these pieces
of the puzzle that
we'll be building as the talk
progresses, and we'll finish with
some practical issues, or
practical guidelines, to actually
deal with most common problems
in training neural nets.
If time permits, we will also
have a bonus slide or two on
what I like to call
multiplicative interactions.
So, this is what will be in the
lecture. There's quite a few
things that could be included in
a lecture
called foundations of neural
nets that are not going to be
part of this talk. I like to
split this into three branches:
First is what I refer to as old-
school neural nets, not to
suggest that the neural nets that we
are working with these days are
not old or they don't go back
like 70 years, but rather to make
sure that you see that this is a
really wide field with quite
powerful methods that were once
really common and important for
the field that are not that
popular anymore.
But it's still plausible that
they will come back, and it's
valuable to learn about things
like Restricted Boltzmann
machines, deep belief networks,
Hopfield net, Kohonen Maps.
So, I'm just going to leave these
as sort of keywords for further
reading
if you want to really dive deep.
The second part is
biologically plausible neural
nets where the path really is to
replicate the inner workings of
the human brain,
so you have physical simulators,
there are spiking neural nets. And
these two branches - encoded in
red here - will not share that many
common ground with the talk.
They are still in neural
network land, but they don't
necessarily follow the
same design principles.
On the other hand, the third
branch, called "other" - because of
lack of a better name - do share a
lot of similarities, and even
though we won't explicitly talk
about capsule networks, graph
networks, or neural-differential
equations, what you will learn
today: the high-level ideas,
motivations, and overall scheme
directly applies to all of
these.
They simply are somewhat beyond
the scope of the series and the
ones in green, like convolutional
networks, recurrent neural
networks, are simply not part of
this lecture but will come in
weeks to come.
For example in Sander's talk and
others, yep.
why are we learning about Neural
nets
Quite a few examples were
already given
a week ago, I just want to
stress a few.
The first one being computer
vision in general, most of the
modern solutions applications in
computer vision do use some form
of neural network based
processing.
These are not just hypothetical
objects.
You know, things that are great
for mathematical analysis or for
research purposes.
There are many actual commercial
applications, products that use
neural networks on a daily basis.
Pretty much in every smartphone
you'll find at least one neural
net these days. The second one is
natural language processing
texts and text synthesis with
great recent results from OpenAI
and their GPT-2 model as well
as commercial results,
with building WaveNet based
text generation into a Google
Assistant if you own one.
Finally, control that doesn't
just allow us to create AI for
things like Go, Chess,
Starcraft, for games or
simulations in general, but is
actually being used in products
like self-driving cars.
So what made all this possible?
What started the deep learning
revolution?
What are the fundamentals that
neural networks - that really
benefited from this - need to
have?
The first one is compute, and I
want to make sure that you
understand that there are two
sides to this story.
It's not just that computers got
faster.
They were always getting faster.
What specifically happened in
recent years is that specific
kinds of hardware compute, namely.
GPUs: graphical processing units
that were designed for games got
really useful for machine
learning.
So this is a two faced thing;
on the one hand we've got
hardware
that's just much faster but
it's not generally faster, it
won't make your sequential programs
run faster.
It is faster with respect to
very specific operations and
neural networks happen to use
exactly these operations.
We will reinforce this point in
a further part of the lecture
but think about matrix
multiplication as this core
element.
Other machine learning
techniques that don't rely on
matrix multiplications would not
benefit that much from this
exponential growth in compute
that came from GPUs, and these
days, from TPUs.
The second is data, and the same
argument applies.
You have various methods of
learning, some of which scale
very well with data some scale
badly: your computational
complexity goes through the roof,
you don't benefit from pushing
more and more data.
So we have again two faces to
this story.
A: there is much more data
available, just because of the
internet, internet of things and
various other things.
And on the other end, you have
models that really are data
hungry and actually improve as
the amount of data increases.
And finally, the
modularity of the system itself,
the fact that deep learning is
not that well defined field of
study.
It's more of a mental picture of
the high level idea of these
modular blocks that can be
arranged in various ways.
And I want to try to sell to
you intuition of viewing deep
learning as this sort of puzzle
where all we are doing as
researchers is building these
small blocks that can be
interconnected in various ways
so that they jointly process data.
To use quotes from a recent
Turing Awards winner, we know
Professor Yann LeCun. 'Deep
learning is constructing
networks of parameterised
functional modules and training
them from examples using
gradient based optimisation'.
There's this core idea that was
what we're working with is an
extremely modular system.
We are not just defining one
model that we are trying to
apply to various domains.
we are defining language to build
them that relies on very simple
basic principles, and these basic
principles in a single such
node or piece of a puzzle is
really these two properties: each
of them needs to know given an
input, given data, what's to
output.
They are simple computational
units that compute
one thing: they take an average,
they multiply things, they
exponent, and things like this.
And they have also the other
mode of operation:
if they knew how the output of
their computations should change,
they should be able to tell
their inputs how to change
accordingly.
So if I tell this node:
'your output should be higher',
it should know how to change
inputs. If you are more
mathematically oriented you'll
quickly find the analogy to
differentiation.
And this is pretty much the
underlying mathematical
assumption for these, so we will
usually work with the
differentiable object, and this is
also how Professor Yann LeCun
defines this -
this is not necessarily a strict
requirement that you'll see through
this lecture that in practice
people will put things that are
kind of differentialable,
and if I were a practitioner you
know that people just put
everything into deep nets, not
necessarily caring about the
full mathematical reasoning
behind it. So given this really
high level view, let us go through
some fundamentals, and as usual,
let's start with some biological
intuition.
So in every neural network
lecture you need to see a
neuron.
So I drew one for you that's, I
know, oversimplified.
So if you have biological
background
forgive me for being very naive
but I wanted to capture just
very basic properties.
So people have been studying in
neurobiology how real neurons
look like
and one really high level view
is that these are just small
cells that have multiple
dendrites, which are inputs, from
other neurons from which they
accumulate their spikes, their
activity.
There is some simple computation
in the Soma, in the cell body,
and then there is a single Axon,
where the output is being
produced. The human brain is
composed of billions of these
things connected to many many
others, and you can see that this
kind of looks like a complex
distributed computation system,
right? We have these neurons connected
to many others, each of them
represents a very simple
computation on its own.
And what people notice is that
some of these connections
inhibit your activity,
so if other neuron is active you
aren't. Some excite.
So if they are active you are
excited
as well.
And of course that many other
properties like there was a
state where these cells live
through time they output spikes through
time. Each time you'll see a
slide like this,
with a yellow box. This is a
reference to further reading.
I won't go into too much depth
on various topics but if you
want to read more,
these are nice references. For
example, the Hodgkin Huxley
model is a nice read
if you are somewhere between
neurobiology and mathematics.
So this is an intuition, and just
intuition.
What people did that with that,
and by people I mostly mean
McCulloch and Pitts, is to look
at this and ask themselves what
seems to be the main set of
neurophysiological observations
that we need to replicate?
It's important to stress that
this is not a model,
the model that they proposed,
that was trying to replicate all
of the dynamics.
This is not an
artificial simulation of a
neuron, it's just something vaguely
inspired by some properties of
real neurons, and these
properties are selected in
green.
There are things that
will be easy to compose because
you have - and we'll go through this in
a second - you have blue inputs
that are multiplied by some
weights - W - that are just
real numbers attached to each
input and then are they are
summed.
So it's like a weighted sum of
your inputs,
and we also have a parameter -
weight B - also referred to a bias -
to as the bias, which is the
output of our neuron. You can
see that this is something you
could easily compose there are real
numbers as inputs, real numbers
as outputs. It represents simple
computation,
well it's literally
weighted average, you can't get
more basic than that -
I mean, maybe slightly more
basic.
It also has this property of
inhibiting or exciting.
If W is negative, you inhibit, if
it's positive, you excite. But
they left out quite a few
properties.
For example, this is a stateless
model, you can compute it
many many times and the output
is exactly the same. If you were
to take a real neuron and put
the same action potentials in it,
spiking might change through time
because it is a living thing
that has a physical state,
and also outputs real values
rather than spikes through time, just
because the time dimension got
completely removed. And also, to
set some notation, each time
something is blue in equations,
this means this it is an input, if
something is red, this is a
parameter weight something that
you would usually train in your
in your model, and the same
applies to the schemes
themselves.
So what it means is, intuitively,
what the weighted average does
at this.
My personal favourite intuition
is that it makes, it defines a
linear or affine projection of
your data so you can imagine that
this horizontal line is one such
neuron, with W that is just zero one -
and b equals zero - and then what
will happen is all the data that
you have, so your x's would be
perpendicularly projected
onto this line and you would get
this mess - everything would be on
top of each other.
If you had a different W, say the
vertical line before it was
horizontal, the vertical line
then they would be nicely
separated groups,
because it would just
collapse them, if it was a
diagonal line, then things would
be slightly separated at the
bottom part of this slide.
So, when we define something
like this,
we can start composing them and
the most natural or the first
mode of composition is to make a
layer out of these neurons.
So you can see,
the idea is just to take each
such neuron, put them next to
each other, and what we gain from
this?
Mostly we gain a lot of
efficiency in terms of compute
because now the equation
simplifies to a simple affine
transformation,
with W being a matrix of the
weights that are in between our
inputs and outputs, X being a
vectorised input,
so we just gather all the
inputs for it as a vector.
Why is it important? Twofold.
Multiplication of two matrixes
in a naive fashion is cubic,
but you probably know from
Algorithmics 101 or
102 or whatever that you can go
down - like two point seven - by
being slightly smart about how
you multiply things - by basically
using divide and conquer, kind of
methods.
And furthermore, this is
something that fits the GPU
paradigm extremely well.
So this is one of these
things that just matches exactly
what was already there
hardware-wise, and as such could
benefit from this huge boost in
compute.
There's also a lot of small
caveats in the neural network
land in terms of naming
conventions.
So each object will have from
one to five names, and I am
deeply sorry for us as a
community for doing this.
The main reason is many of these
things were independently
developed by various groups of
researchers, and the grand
unification never happened.
Some of these names are more
common than others, for example:
this is usually called 'linear
layer',
even though a mathematician
would probably cry and say 'no,
it's affine, it's not linear,
there's a bias,
this doesn't satisfy linearity
constraints'.
Neurons will be often called
units,
so if I say unit or neuron, I
just use these interchangeably,
and parameters and weights are
also the same object.
So, you might ask: 'isn't this just
linear regression? The equation
looks exactly like Statistics
101: a linear regression model.'
And to some extent, yes, you're
right. It is exactly the same
predictive model, but what's
important is to have in mind our
big picture.
Yes, we start small but our end
goal is to produce these highly
composable functions.
And if you're happy with
composing many linear regression
models on top of each other,
especially multi nominal
regression models, then you can
view it like this:
the language that the neural
network community refers to, is to
think about these as neurons or
collections of neurons that talk
to each other because this is
our angle.
Yes, this very beginning of our
puzzle could be something that's
known in literature under
different names, but what's
really important is that we view
them as these single composable
pieces that can be arranged in
any way and much of research is
about composing them in a smart
way so that you get a new
quality out of it.
But let's view these simple
models as neural networks first.
So, we'll start with a single
layer neural network, just so we
can gradually see what is being
brought to the table with each
extension, with each added
module.
So what we defined right now is -
or what we're gonna define right
now - can be expressed more or
less like this: we have data,
if we go through the linear module
then there'll be some extra
nodes that we are going to define,
then, there is
going to be a loss.
It will also be connected to a
target. We are missing these
two,
so let's define what can be used
there, and let's start with the
first one, which is often called
an activation function or a
non-linearity.
This is an object that is
usually used to induce more
complex models.
If you had many linear models,
many affine models, and you compose
them,
it's very easy to prove that
composition of linear is linear,
composition of affine things is
affine.
You would not really bring
anything to the table.
You need to add something that
bends the space in a more funky
way.
So one way of doing this, or
historically one of the first
ones, is to use sigmoid
activation function which you
can view as a squashing of a
real line to the zero one interval.
We often will refer to things
like this as producing
probability estimates or
probability distributions,
and while there exists a
probabilistic interpretation of
this sort of model, what this
usually means in the ML
community is that it simply
outputs things between zero and
one, or that they sum to one.
So, let's not be too strict
when we say probability estimate
here,
it might mean something as
simple as being in the correct
range. And this thing is it also
has very simple derivatives, just
to refer to the differentiability
that we were talking about
earlier.
But there are also caveats that
make it slightly less useful as
you will see in the grand scheme
of things.
One is that because it saturates
as you go to plus or minus
infinity it approaches one or zero
respectively,
this means that the partial
derivatives vanish. The
gradients far, far to the right
will be pretty much zero because
your function is flat,
so the gradient is pretty much
zero.
The same applies in minus
infinity.
So once you are in this specific
point, if you view gradient
magnitude as amount of
information that you are getting
to adjust your model, then
functions like this won't work
that well.
Once you saturate, you won't to be
told how to adjust your weights
anymore.
By this was, we are going to use
at least initially. So we plug in
sigmoid on top of our linear
model,
and the only thing we are
missing is a loss.
And the most commonly used one
for the simplest possible task
which is gonna be binary
classification - meaning that our
targets are either zero or one -
something is either false or
true, something is a face or not,
something is a dog or not.
Just these sort of problems.
Then the most common loss
function, which should be a two
argument function that returns a
scalar so it accepts in this
notation P -
our prediction - T, our target, and
is supposed to output a single
scalar - a real value - such that
smaller loss means better model,
being closer to the correct
prediction and cross entropy
which has at least three names,
being negative log likelihood,
logistic loss, and probably many
others; well yeah, there's one more,
but let's let's forget about
this one -
gives us the negation of the
logarithm of probability of correct
classification, which is exactly
what you care about in
classification - at least usually.
Is also nicely composable with
the sigmoid function, which we'll
go back to towards the end of the
lecture, showing how this
specific composition removes two
numerical instabilities at once.
Because on its own,
unfortunately, it is quite
numerically unstable.
So given these three things, we
can compose them, and have the
simplest possible neural
classifier: we have data, if
it was linear model it goes through
sigmoid, it goes for cross
entropy, attaches targets...
This is what you'd know from
statistics as logistic
regression.
And again, the fact that we are
defining a well-known model
from a different branch of
science is fine because we won't
stop here,
this is just to gain intuition:
what we can already achieve. What
we can already achieve in
practice is we can separate data
that's labeled with, well, two
possible labelings - true or
false - a zero and one, as long as you
can put a line or a hyperplane
in a higher dimension that
completely separates these two
datasets.
So in the example you see, - red
and blue - you see that the more
vertical line can separate this
dataset pretty perfectly -
and it will have a
very low cross entropy loss. The
important property of this
specific loss,
and I would say 95-ish percent
of all the losses in machine
learning, is that they are
additive
with respect to samples. So the
loss that you can see at the
lower end decomposes additively
over sum.
So there was a small function - L -
that we've just defined over
each sample.
And now T with I in the
superscript as an i'th Target,
can be expressed as a
sum of these. This specific
property relates to the data
aspect of the deep learning
revolution. Losses that have this
form undergo very specific
decomposition and can be trained
with what is going to be
introduced as stochastic gradient
descent, and can simply scale
very well with big datasets. And
unfortunately, as we just
discussed, this is still slightly
numerically unstable.
So, what happens when we have
more than
two classes? Then we usually
define what's called a 'softmax'
which is, as the name suggests, a
smooth version of the maximum
operation. If you take an exponent
of your input and just normalise,
divide by the sum of exponents.
You can see, this will some to one,
everything is non negative
because, well, exponents, by
definition, are not
negative.
So we produce probability
estimates, in the sense that the
output lies on the Simplex, and
it can be seen as a strict
multidimensional generalisation
of the sigmoid.
So it's not a different thing
it's just a strict
generalisation.
If you take a single X add zero
and compute the softmax of it
then the first argument of the
output will be a sigmoid and the
second one minus sigmoid.
Right? So, it's simply a way to go
beyond two classes, but have very
very similar mathematical
formulation.
And it's by far the most
commonly used final activation
in classification problems when the
number of classes is bigger than
two. It still has the same
issues - for obvious reasons - as
generalisations, so it cannot
remove issues.
But the nice thing is, now we
can just substitute the piece of
the puzzle that we defined
before,
right? We had the sigmoid now just put
softmax in its place, and
exactly the same reasoning and
mechanisms that would work
before
apply now.
So we'll use exactly the
same loss function after the fact
that it's summing over all the
classes and now we can separate,
still linearly of course,
more than two colours -
say it's class zero, one and two; which
is equivalent to multinominal
logistic regression,
if you went for some statistical
courses, and the combination of
the softmax and the cross
entropy - as I mentioned before -
becomes numerically stable
because of this specific
decomposition - and there will be
also a more in-depth version
towards the end of this lecture.
The only thing that it doesn't
do very well, is it doesn't scale
that well with number of
classes.
So one thing that you
might want is to be able to
select one class specifically,
just, say, one;
just, say, zero.
And of course with an equation
like what softmax has, you can't
represent ones or zeros - you can
get arbitrarily close - but never
exactly one or zero. And done
nice,
other solutions to this, like
sparsemax module for example -
and also, it doesn't scale that
well with K, it will work well if
K number of classes is, say, in
hundreds. If it's in hundreds of
thousands, you might need to look
for some slightly different
piece of the puzzle.
The nice news is you can
literally just swap them and
they will start scaling up.
So, why are we even talking about
these simple things?
So, apart from the fact that they
become pieces of the bigger
puzzle, it's also because they
just work, and you might be
surprised that the linear models are
useful, but they really are.
If you look at this very well
known MNIST dataset of just
handwritten digits, and try to
build a linear model that
classifies which digit it is
based on pixels, you might get
slightly surprising results of
somewhere around
92 percent of test accuracy.
That's pretty good for something
that just takes pixels and
computes a weighted average - and
that's all it does.
And in one of the intuitions
behind it is we usually keep
thinking about these models in
1D, 2D,3D. And yes,
in 2D there are not that many
positions of objects that the
line can separate. In 3D,
not that many positions where
hyperplane can separate. In
100,000 dimensions,
99 hyperplanes
of corresponding
size can actually shatter a lot
of possible labellings.
So, as you get higher
dimensionality, you can actually
deal with them pretty well even
with linear models.
Furthermore, in commercial
applications, a big chunk of them
actually use linear models. In
natural language processing, for
many years,
the most successful model was
nothing else but 'MaxEnt': maximal
entropy classifier, which is a
fourth name for logistic
regression. So why don't we stop
here, right?
Who could stop the lecture here?
But obviously we are interested
in something slightly more
complex, like AI for chess or
for Go, and for this we know that
the linear model -
I mean we know empirically -
linear models are just not
powerful enough.
But before we go that
far ahead, maybe let's focus on
something that's the simplest
thing that linear models cannot
do.
And it's gonna be a very well
known XOR problem where we
have two dimensional dataset
and on the diagonal, one class, on
the other diagonal,
the second class. You can
quickly iterate in your head
over all possible lines,
not a single line has red dots
on one side,
blue on the other, ergo: we
need something more powerful. So
our solution is gonna be to
introduce a hidden layer.
So now we're going to look into
two layer neural networks, that
in our puzzle view, look like
this. We have data, goes to linear,
goes through sigmoid, goes through
another linear, goes through
softmax, cross entropy, target. As you
can see, we already have all the
pieces,
we just,
we are just connecting them
differently.
That's all we are doing.
And I want to now convince you
that we're adding, qualitatively,
more than just adding
dimensions or something like
this.
So, let's start with the
potential solution:
how can we solve this
if we had just two hidden
neurons? And the sigmoid
activation function. So we have our
dataset , and for simplicity
of visualisation I'm gonna
recolour all of them so that we
have four different colours: we
have blue, red, green, and pink,
just so you see where the
projections end up.
Just remember that we want to
separate one diagonal from
the other. And the two hidden neurons
are going to be these two
projection lines.
So the top one is oriented
downwards, which means that we're
going to be projecting in such a
way that the blue class will end
up on the right hand side. Pink
on the left, green and red in the
middle.
So, somehow, I misordered these two
slides...
So this is how it's going to
look like: if you look at the
right hand side, you have a
projection on this top line:
blue on the right because
everything is flipped.
Sorry, I should have flipped the
whole graph.
Pink on the left: green
and red composed on top of each
other.
The second line is pretty
symmetrically oriented,
and there, you can see blue
datasets or blue blobs projected
on the left hand side,
pink projected on the right, and
green and red, again,
superimposed on each other.
Right?
This is all we did: drew two
lines and just projected
everything onto them.
These are the weights
and biases at the bottom that
would correspond to this
projection.
Now we add sigmoid. All that
sigmoid does is it squashes,
instead of being an identity
function,
it nonlinearly squashes. So,
we squash these two plots on the
sides, and recompose them as a
two dimensional object.
We have now an X axis,
the first
projection, just through sigmoid, and
this is why it became extreme.
The blue things ended up being
basically in one, and everything
else went to zero - maybe slightly
Boomerang-y here - and the second
neuron -
this projection - after squashing
for sigmoid became the y axis.
You can see, now the pink one got
separated,
everything else got boomerang-ly
squashed.
The nice thing about this, maybe
it doesn't look that nice, but
what it allows us to do is now
draw a line that's going to
separate all the blue and pink
things from everything else.
And this was our goal.
So if I now project on this
line, or equivalently if I were
to put the decision boundary
here, it would separate exactly
what I wanted.
Right.
So the blues and pinks were
supposed to be one class and the
remaining two colours were
supposed to be the other.
So, I can just project them, put
the boundary, and if you're now
looking to the input space, we
ended up with this discontinuous
classification or the chasm of
sorts, in the middle
became one class and
the reminder became the other.
Right.
Just going through the internals
layer by layer, how the neural
network with a single hidden
layer, would operate.
All it really did was to use
this hidden layer to rotate, and
then slightly bend the input
space, with the sigmoid -
you can think about this as kind
of bending or squishing which
are sort of topological
transformation, allowed the
purely linear model on top of it
to solve the original problem.
Right, so it prepared- pre-processed
the data such that it became
linearly separable and we'll
just needed two hidden neurons to do
this. Even though the problem was
not that complex it is a
qualitative change in what we,
in what we can do.
So what if something
is slightly more complex? Let's
imagine we want to separate a
circle from a doughnut, then two
neurons won't be enough, you can
prove it's not enough - doughnuts
are just too complex - but six
neurons are doing just fine.
And at this point I would like
to advertise to you this great
tool by Daniel Smilkov and
others called Playgrounds
at
playground.tensorflow.org where you can
just play with these sort of
simple classification problems.
You can pick one of the
datasets, you can add hidden
layers, add neurons at the top, you
can select activation function
to be sigmoid to follow what we
just talked about, and if you
select classification it will
just attach the sigmoid plus
cross entropy as the loss.
You hit run and you get the
solution which separates our
data quite nicely, you can see
the loss going down as expected,
and arguably this is the easiest
and most important way of
learning about neural nets:
playing with them. Actually
interacting with them, it's
really hard to gain intuitions
by just studying their
mathematical properties, unless
you are a person with really
great imagination. I personally
need to play with stuff to
understand, so I'm just trying to
show this sort of lesson that I
learned. So, what makes it
possible for neural nets to
learn arbitrary shapes - I mean
arguably a doughnut is not that
complex of a shape, but believe
me if I were to draw a dragon it
would also do just fine. And the
brilliant result, arguably the
most important theoretical
results in neural nets is the work of
Cybenko from the late 80s, where he
proved that neural networks
are what he called universal
approximators. Using slightly
more technical language, what it
actually means is if you get- if
you take any continuous function
from a hyper cube - right - so your
inputs are between zero and one, and
have d-dimensions, and your
function is continuous, so
relatively smooth, and output a
single scalar value, a single
number,
then there exists a neural
network with one hidden layer
with sigmoids that will get an
epsilon error - at most epsilon
error - and this is true for every
positive epsilon, so we pick an
error, say, 1 e minus 20, there
will exist neural net
satisfying this constraint. You can
pick 1
E minus 100, and there
will exist one that satisfies
it.
So, one could ask what if I pick
epsilon equals zero? Then answer
is no.
It can only approximate, it can
not represent.
So, you won't ever be able to
represent most of the continuous
functions, but you can get really
close to them. At the cost of
using potentially huge,
exponentially growing models
with respect to input
dimensions.
It shows that neural networks are
really extremely expressive -
they can do a lot of stuff.
What it doesn't tell us though,
is how on earth we would learn them.
It's an existential proof right?
If you went through proper mathematical,
like theoretical mathematical training, you
know there are two types of proofs
there are either constructive or
existential. Arguably the
constructive ones are more
interesting,
they provide you with insights,
how to solve problems; the
existential ones are these
tricky, funky things, we'll just
say it's impossible for this to
be false. And this is this kind
of proof that Cybenko provided
you just show-
He just showed that there is no
way for this not to hold. There
is no constructive methods of
finding weights of the specific
network in this proof that he
made. Since then we've actually had
more constructive versions.
Furthermore, this size can grow
exponentially; what Cybenko
attributed this brilliant
property to, was the sigmoid
function; that this squashing,
this smooth, beautiful squashing
is what gives you the
generality.
It wasn't long since Hornik
showed that actually.
what matters is this more of a
neural network structure,
That you don't need sigmoid
activation function, you can
actually get pretty much- take
pretty much anything, as long as
it's not degenerate, and what
he meant by non-degenerate
is that it's not constant,
bounded, and continuous, right? So
you can take a sine wave,
you can pretty much get any
squiggle as long as you squiggle
at least a bit
So things are not constant and
they are bonded so they cannot
go to infinity.
So it shows that this extreme
potential of representing- sorry,
approximating functions relies
on these
transformations being stacked on
top of each other with some
notion of non linearity in
between them,
still without telling you how to
train them it just says in
principle that are neural
networks that are doing all this
stuff, we just don't know how to
find.
So, to give you some intuition
and to be precise, this is going
to be an intuition behind the
property not behind the proof.
The true proof relies on showing
that the space defined by neural
networks are dense sets in the
set of continuous functions.
Instead we are going to rely on
intuition, why approximating
functions with sigmoid based
networks should be possible by,
proof by picture.
So, let's imagine that we have
this sort of mountain ridge
that's our target function, and
to our disposal is only our
sigmoid activation, and one hidden
layer.
So, of course I can represent
function like this, I
just take a positive W and a
negative B,
so it shifts a bit to the right -
details don't matter that much
because I'm gonna also get that
symmetrical one where W is negative
and B is positive.
Right, so I have to sigmoids.
Then if we take an average it
should look like a bump, and you
probably see where I'm going
with this.
It's gonna rely on a very
similar argument to how
integration works.
I just want to have enough bumps
so that after adding them, they
will correspond to the target
function of interest.
So, let's take three of them, and
they just differ in terms of
biases that I have chosen.
So, I'm using six hidden neurons
right. Two for each bump.
And now in the layer that
follows, the final classification
layer - now a regression
layer,
I'm just gonna mix them, first
weight half second, one third, one
and a half, and after adding
these three bumps with weights, I
end up with the approximation of
the original shape.
Of course it's not perfect, as we
just learned, we are never gonna
be able to represent functions
exactly
with sigmoids, but we can get
really close. By then,
this really close, the epsilon is
what's missing here.
I only used six hidden neurons, got
some error.
If it wants to squash the error
further, you just keep adding
bumps.
Now I need a bump here to
resolve this issue.
I need a tiny bump somewhere
around here.
I need a tiny bump here.
You just keep adding and adding
and eventually you'll get as
close as you want.
You won't ever get it exactly
right.
But it's gonna go in the right
direction. So you can ask:
Okay,
it's 1D, usually things in
1D are just a completely
different story than a K
dimensional case,
is there an equivalent
construction at least for 2D?
And the answer is positive, and
you've seen this seven-ish
slides before it's this one.
When we saw a doughnut, it is nothing
but a bump in 2D, right?
If you think about the blue
class as a positive one,
the one that it's supposed to
get to one as the output.
This is essentially a 2D
bump.
So the perfect version, right,
we could do a better job, but
even with this sort of bumps, we
could compose enough of them to
represent [approximate] any 2D function.
And you can see how things
starts to grow exponentially.
Right. We just needed two neurons to
represent
one,
Now we need six for 2D and you
can imagine that for KD it's
going to be horrible.
But in principle, possible. And
this is what drives this sort of
universal approximation theorem
building blocks.
So, let's finally go deeper since
we said that things are in
principle possible in a shallow
land,
there needs to be something
qualitatively different about
going deeper versus going wider.
So, the kind of models we are
going to be working with will
look more or less like this:
there's data, goes through linear,
some node, linear, node, linear,
node linear, node... And
eventually a loss attached to
our targets.
What we are missing here is what
is gonna be this special node in
between, that as advertised
before, it's not going to be a
sigmoid.
And the answer to this is the
ReLU:
rectified linear unit -
again, quite a few names, but
essentially what it this is a
pointwise maximum between x's,
between inputs, and a zero.
All it does is checks whether
the input signal is positive,
if so it acts as an identity,
otherwise it just flattens it,
sets it to zero, and that's all.
Why is it interesting?
Well, from say, a practical
perspective, because it is the
most commonly used activation
these days that just works
across the board in a wide
variety of practical
applications, from computer
vision, ending in a very fast engine
learning. It still introduces-
It still introduces
nonlinear behaviour, no one
can claim that this is a linear
function with the hinge,
but at the same time, it's kind
of linear in the sense that it's
piece-wise linear.
So all it can do, if you were to
use it maybe on different layers,
is to cut your input space into
polyhedral.
So, with the linear
transformations it would cut it
into multiple ones, and in each
subspace, such part, it can
define affine transformation,
right,
because there are just two
possibilities. I'm either identity,
I'm just cutting you off.
So, in each of these pieces, you
have
a hyperplane, and in each piece
might be a different hyperplane,
but the overall function is
really piecewise linear. In 1D,
it would be just the composition
of lines. In 2D, of planes that
are just changing their angles
and, in kd, well, k minus one-
dimensional hyperplanes that
are oriented in a funky way.
The nice thing is: derivatives no
longer vanish. They are either one,
when you're in the positive line,
or zero otherwise, I mean,
arguably this one's already
vanished way before we started.
The bad thing is that dead
neurons can occur.
So, imagine that all your
activities are negative, then
going for such neurons will just
be a function constantly equal
to zero, which is completely
useless.
So, we need to pay maybe more
attention to the way you
initialise your model and maybe
one extra thing to keep track
of,
to just see how many dead units
you have because it might be a
nice debugging signal if you did
something wrong. And, also,
technically this function is not
differentiable at zero,
and the reason why people
usually don't care is that, from a
probabilistic perspective, this
is a zero measure set. You will
never actually hit zero.
You could hand wave and say,
well, the underlying mathematical
model is actually smooth around
zero.
I just never hit it, so I never
care.
If you want to pursue a more theoretically
grounded analysis,
you can just substitute it with
a smooth version, which is log
(1+e^x)
This is the dotted line here
that has the same limiting
behaviours but is fully smooth
around zero, and you can also
just use a slightly different
reasoning where you don't talk
about gradients but different
objects with similar properties
that are just fine with
single points of non
differentiability. So, we can
now stack these things together,
and we have our typical deep
learning model that you would
see in every book on deep
learning, linear, ReLU,
linear, ReLU. And
the intuition behind depth that
people had from the very
beginning, especially in terms of
computer vision, was that each
layer will be some sort of more
and more abstract feature
extraction module.
So, let's imagine that these are
pixels that come as the
input.
Then you can imagine that the
first layer will detect some
sort of lines and corners.
And this is what
each of the neurons will
represent, whether there is a
specific line, like horizontal
line or vertical line or
something like this.
Once you have this sort of
representation, the next layer
could compose these and
represent shapes, like squiggles
or something slightly more
complex.
Once you have these shapes, the
next layer could compose them
and represent things like ears
and noses and things like this.
And then, once you have this sort
of representation, maybe you can
tell whether it's a dog or not
based on the number of ears or
the existence of ears in the first
place.
But this is a very high-level
intuition and, while
confirmed in practice, this is
not necessarily that visible
from the math, and the really
nice result from - sorry, I cannot
pronounce French - from Montúfar,
from Guido, and Pascanu,
Cho, and Bengio is to show
mathematical properties that
somewhat encode this high-level
intuition and is a provable
statement.
So, one thing is that, when we
talked about these linear
regions that are created by
relu networks, what you can show
is, as you keep adding layers
rather than neurons,
the number of chunks in which
you are dividing your input
space grows exponentially with
depth and only polynomially
with going wider, which shows
you that there is simply an
enormous reason to go deeper
rather than wider, right.
Exponential growth simply will
escape any polynomial growth
sooner or later and, with the scale
with which we are working these
days,
it escaped a long time ago.
The other thing is, if you
believe in this high-level idea
of learning from, say, Vapnik
times, from statistical learning
theory, that the principle of
learning is to encounter some
underlying structure in data,
right.
We get some training data set, which
is some number of samples,
we build a model, and we expect
it to work really well on the
test data, which comes from the
same distribution but is
essentially a different set.
How can this be done?
Well, only if you learned, if you
discovered some principles
behind the data and the output
space, and one such or a few such
things can be mathematically
defined as finding regularities,
symmetries in your input
space. And what ReLU networks
can be seen as is a method to
keep folding your input space on
top of each other, which has two
effects.
One: of course, if you keep
folding space you have more.
When I say fold space, I mean that
the points that end up on top of
each other are treated the same.
So, whatever I build on top of it
will have exactly the same
output values for both points
that got folded. So, you can see why
things will grow exponentially.
right. You fold the paper once, you
have two things on top of each
other, then four, then eight.
It's kind of how this proof is
built.
It's really beautiful. I really
recommend reading this paper and
beautiful pictures as well.
And the second thing is this is
also the way to represent
symmetries.
If your data, if your input space is
symmetric, the easiest way to
learn that this symmetry is
important is by folding this
space in half.
The symmetry is more complex, as
represented in this beautiful
butterfly-ish, I don't know, shape.
You might need to fold in this
extra way so that all the red
points that are quite swirled
end up being mapped onto this
one, single, slightly curved
shape.
And this gives you this sort of
generalisation. You discover the
structure, if you could, of course,
learn it, that only depth can
give you. If you were to build a
much wider model, you'd need
exponentially many neurons to
represent exactly the same
invariance, exactly the same
transformation, which is a really
nice mathematical insight into
this: why depth really matters.
So, people believe this.
I mean, of course, people were
using depths before just because
they were seeing better results.
They didn't need necessarily
a mathematical explanation for
that.
So, let's focus on this simple
model that we just defined.
We have three neural networks, sorry,
three hidden layers in our
neural networks, linear, ReLU,
linear, ReLU, and so on and so forth.
And now we'll go from our puzzle
view -
that was a nice high-level
intuition - into something that is
extremely similar and was
actually used in pretty much
every machine learning library
underneath, which is called a
computational graph.
So, it's a graph which represents
this sort of relations, of what
talks to what.
And I use the same colour-
coding... So, again, blue things:
inputs.
So this is my input x. This is
going to be a target.
Orange is going to be our loss,
the reds are going to be weight
parameters.
So, the reason why some of you
might have noticed when I was
talking about a linear layer, I
treated both weights and x's as
inputs to the function,
right, I was writing F of X W B, I
was not really discriminating
between weights and inputs,
apart from giving them the colour
for easier readability, is
because, in practice, it really
doesn't matter.
There's no difference between a
weight or an input into a node
in a computational graph, and
this alone gives you a huge
flexibility. If you want to do
really funky stuff, like maybe
weights of your network are
going to be generated by another
neural network, it fully fits
this paradigm because all you're
going to do is you're going to
substitute one of these red
boxes that would normally be a
weight with yet another network,
and it just fits the same
paradigm, and we'll go through some
examples in a second. To be more
precise,
we have this graph that
represents computational graph
with a three layer neural
network with relus on a high level of
abstraction,
omitting captions because
they are not necessary for this
story.
They don't have to be linear.
You can have side tracks, skip
connections.
There is nothing stopping you
from saying, okay,
output from this layer is
actually going to be connected
from- to, sorry,
yet another layer that is also
parameterised by something
else,
and then they go back and merge,
maybe for a mean operation,
concatenation approach, and there are many
ways to merge two signals.
There is nothing stopping us
from having many losses, and they
don't even have to be at the end
of our graph.
We might have a loss attached
directly to weights that will
act as a penalty for weights
becoming too large, for example,
or maybe leaving some specific
constraints, maybe we want them
to be lying on the sphere, and
we're going to penalise the
model for not doing so. Our
losses don't even need to be
the last things in the
computational graph.
You can have a neural network
that has a loss at the end, and
this loss is feeded back,
its value, to the next parts of a
neural network.
And this is the actual output
that you care about eventually.
You can also do a lot of sharing.
So, the same inputs can be
plugged into multiple parts of
your net
in skip connection fashion.
You can share weights of your
model and train weights in this
computational graph perspective.
There's nothing bad connecting one
nodes to many places.
This is extremely flexible
language that allows this really
modular development, and arguably
actually it helped researchers
find new techniques because the
engineering advancement of
computational graphs
development allows to free us
from saying, oh, they're weights, they're
inputs, they're qualitatively
different. Engineers came and
said, no,
from my perspective they're exactly the
same, and the research followed.
People just started plugging
crazy things together and ended
up with really powerful things
like hyper-networks.
So, how do we learn in all this
madness?
And the answer is surprisingly
simple.
You just need basic linear
algebra 101.
So, just to recap, gradients and
Jacobians - I hope everyone knows
what they are,
if not, in very short words:
If we have a function that goes
from d-dimensional space to a
scalar space like R, then the
gradient is nothing but the
vector of partial derivatives.
So, in an i'th dimension we have
partial derivative of this
function with respect to i'th
input. What's partial derivative? In
high level abstraction, just a
direction in which the function
grows the most. And minus gradient,
it's the direction in which it
decreases the most. Jacobian
is nothing but the k-dimensional
generalisation.
If you have k outputs and so it's
a matrix where you have partial
derivative of i'th output with
respect to j'th input, nothing
else,
very basic thing.
The nice thing about these
things is they can be
analytically computed for many
of the functions that we care
about. And then the gradient
descent technique that is
numerical methods 101...
So, surprisingly, deep learning
gives us a lot of very basic
components but from across the
board of mathematics and just
composes it in a very nice way.
The idea behind gradient descent
is extremely simple. You can view
this as sort of physical
simulation where you have your
function or loss landscape.
You just pick an initial point
and imagine that it's a ball that
keeps rolling down the hill
until it hits a stable point
where it just cannot locally
minimise your loss anymore.
So you just, at each iteration to
your current point, subtract
learning rate at time t times the
gradient in this specific point,
and this is going to guarantee
convergence to the local minimum
under some minor assumptions
on the smoothness of the
function.
So, it needs to be smooth for it to
actually converge, and it has
this nice property that I was
referring to before that, because
a gradient of the sum is a sum of
the gradients,
you can show that analogous
properties hold for the
stochastic version where you
don't sum over all examples,
you just take a subset and keep
repeating this. This will still
converge under some assumptions
of the bound between, basically,
noise over the variance of this
estimator, and the important
thing is this choice the
learning rate
unfortunately, matters, like quite
a few hyperparameters,
in machine learning community,
and there have been quite a few
other optimisers that were
developed on top of gradient
descent, one of which became a
sort of golden standard, like,
step zero
that you always start with, which
is called Adam. And when you go
to practical issues, I will say
this yet again: If you are just
starting with some model, just
use Adam before even thinking
about the optimisation. It's just a
good starting rule.
And, in principle, you can apply
gradient descent to non-smooth
functions, and a lot of stuff in
deep learning is kind of non-
smooth and people still apply it,
but the consequence is you will
lose your convergence guarantees.
So, the fact that your loss
doesn't decrease anymore might
as well be the fact that you
just did something you were not
supposed to be doing, like you
provided a node without a
well-defined gradient or you
defined a wrong gradient.
You put a stop gradient in,
or you created a GAN then things
might stop converging. So, what do
we need from the perspective of our
nodes so that we can apply
gradient descent directly to the
computational graph, right?
Because we have this
computational graph pretty much
for everything that we talked
about, and the only API that we
need to follow is very similar
to the one we talked about before.
We need a forward pass: Given x,
given input, what is the output?
And also we need a backward
pass,
so what is it?
Basically, Jacobian with respect
to your inputs. For computational
efficiency,
we won't necessarily complete
the full Jacobian but, rather, a
product between Jacobian and
the gradient of the loss that
you eventually care about.
And this is going to be
information we're going to send through
the network.
So, let's be more precise with
our computational graph.
We have three layers.
We have this sort of gradient
descent algorithm. We have our
parameters, thetas. And we want
to unify these views somehow.
Right, so I need to know what
theta is and how to compute the
gradient. So, let's start with
making theta appear.
So, one view that you might use
is that there actually is an
extra node called theta, and all
these parameters, these W's, B's
that I need for every layer, is
just slicing and reshaping of
this one huge theta.
Right, so just imagine there was this huge
vector theta, and I'm just saying
the first W, whatever its shape
is, is first k-dimensions.
I just take them, reshape. This a
well-defined differentiable
operation, right. This is also gradient
of the reshaping is reshaping of the
gradient kind of thing. So, I can
have one theta, and then the
only question is how to compute
the gradient, and the whole math
behind it is really chain rule,
the fact that the composition of
functions decomposes with
respect to the inner nodes.
So, if you have f composed with g
and you try to compute the
partial derivative of the output
with respect to the input you,
can as well compute the partial
derivative of the output with
respect to this inner node g
and multiply it by the partial
derivative of g with respect to
x. And if g happens to be
multi-dimensional, if there are many
outputs, then from matrix
calculus you know that the
analogous object requires you to
simply sum over all these paths.
So, what it means from the
perspective of the computational
graph?
Well, let's take a look at one
path.
So, we have the dependence of our
loss node on our weight
node that now became an input,
changed to blue because as we
discussed before there is
literally no difference between
these two, and it's going through
this.
So, now all we are going to do is
apply the first rule. We're going
to take the final loss and ask
it,
okay, we want you to be
small, what's the gradient?
We are now in the node that
needs to know - given how the
output needs to change,
which is already told to us by
this node - how it needs to adjust
its inputs, which is this Jacobian
type, the partial derivative
of the loss with
respect to our output. So, we can
send back, and we already
have the dL/d, whatever is the
name of this node. The previous
node has the same property, right?
It's being told, your outputs need
to change in these directions,
and internally it knows - and, by
it knows, I mean we can compute
its Jacobian - how to adjust its
inputs so that its outputs
change in the same direction. And
you go through all this graph
backwards until you hit
theta, and this is using just
this rule.
The only problem is: there is a
bit more than one path through this
network, there's way more
dependence, but this is where the
other one comes into place. We'll
just need to sum over all the
paths that connect these two
nodes. There might be
exponentially many paths, but
because they reuse computation
the whole algorithm is fully
linear, right? Because we only go
through each node once. Computing
up until here is deterministic,
and then we can in parallel also
compute these two paths until
they meet again, so we have a linear
algorithm that backprops
for the whole thing. You can ask,
couldn't I just do it by hand,
going through all the equations?
Of course you could, but it would
be, at the very least, quadratic
if you do it naively. This
is just a computational trick to
make everything linear and fit
into this really generic scheme
that allows you to do all this
funky stuff, including all the
modern and different
architectures. Representing
everything that's computational
graphs just allows you to stop
thinking about this. And you can
see this shift in research
papers as well. Until like
2005-ish you'll see in
each paper from machine learning a
section, "a gradient of my loss",
where people would define some
specific model, and then there
will be a section where they say,
"oh, I sat down and wrote down all
the partial derivatives, this is
what you need to plug in to
learn my model," and since then
it disappeared. No one ever writes
this. They just say, "and I used
TensorFlow." Period. Or Keras or
your favourite library. It's a good
thing,
I mean, it moved the field forward
instead of postdocs, you
know, spending a month
deriving everything by hand,
they spent five seconds clicking
return.
So, let's reimagine these few
modules that we introduced as
computational graphs.
We have our linear model. As we
talked before, it's just a function
with three arguments.
It is basically a dot product
between x and w.
We add b, and what we need to
define is this backward
computation
with respect to each of the
inputs, no matter if it's an
actual input - blue thing - or a
weight, as we discussed before.
And, for x and w's themselves, the
situation is symmetric.
Essentially, for x, it's just
multiplied by w - the errors that
are coming from the future,
I mean, from further down the
graph, not from the future - and,
for the w, it's just the same
situation but with x's, right?
Because the dot product is a
pretty symmetric operation
itself, and the update for the
biases is just the identity, since
they are just added at the end, so
you can just adjust them
very easily.
The nice thing to note is that
all these things in the backwards
graph, they are also basic
algebra.
And, as such, they could be a
computational graph themselves.
And this is what happens in many
of these libraries when you call
tf.gradients, for example, or
something like this, the backward
computation will be added to
your graph.
There'll be a new chunk of your
graph that represents the
backwards graph.
And what's cool about this is
now you can go really crazy and
say I want tenth order
derivative, I want to backprop through
backprop, and all you need to do is
just grab a node that
corresponds to this computation
that was done for you,
just call it again and again and
again and just get this really,
really powerful differentiation
technique, until your GPU or
RAM dies, right?
But there's a cost to everything.
ReLU itself, super simple: In
the forward pass you have
maximum with zero and x. In the
backwards pass you end up with a
masking method.
So, if the specific neuron was
active - when I say active, I mean
it was positive -
and the ReLU just passed it
through,
then you just pass the
gradients through as well. And if
it was inactive - meaning it was
negative, it hit zero -
then, of course, gradients coming
back need to be zeroed as well.
Because we don't know how to
adjust them, right? Locally, from a
mathematical perspective,
if you are in the zero land, if
you will make an infinitely
small step you are still in zero
land.
Let's forget about actual zero
because this one is tricky.
Softmax is also relatively
simple.
Maybe this graph is slightly
funkier because there's this
exponentiation summation division,
but it's the same principle,
right.
And you can also derive the
corresponding partial derivative,
which is the backwards pass, and is
essentially a difference between
the incoming gradient and the
output. And you can see that
these things might blow up, right.
Softmax itself: If XJ is very
big, then exponent will just
overflow whatever is the
numerical precision of your
computer and, as such, is rarely
used in such a form. It's either
composed with something that
squashes it back to a reasonable
scale or does some tricks, like
you take a minimum of XJ,
say, 50, so that you lose parts
of, say, the mathematical beauty of
this, but at least things will
not blow up to infinities.
And now, if you look at the cross
entropy, it's also very simple to
vectorise. And its partial
derivatives:
Now you can see why things get
messy computationally,
you divide by p, dividing by
small numbers, as you know from
computer science basics, can
again overflow.
So it's something that, on its
own, is not safe to do.
Again, you could hack things
around, but there are nicer solutions.
And the nice thing about
viewing all these things jointly -
inputs, weights, targets, whatever -
as the same objects with exactly
the same paradigm, exactly the
same model that we used to say,
well, these are these pictures of
dogs and cats, right, and these
are the targets.
What is the set of weights
for this model to maximise the
probability of this labelling?
You can also ask the question: given
this neural network,
what is the most probable
labelling of these pictures so
that this neural network is
going to be happy about it?
Its loss is going to be low by
simply attaching our gradient
descent technique, instead of to
the theta,
we can attach it
directly to t, right.
And, as long as these things are
properly defined in your library,
it's going to work.
And now we can see
why would you compose Softmax
and cross entropy because now
backwards pass extremely
simplifies. Instead of all these
nastiness, divisions, small
numbers, etc., you just get the
partial derivative of the loss
with respect to inputs as a
difference between targets and
your inputs.
As simple as that. All the
numerical instabilities are gone.
You can, of course, still learn
labels, and partial derivative is
relatively okay.
This is one of the main reasons
why, when using machine learning
libraries like Keras,
TensorFlow, and many others, you'll
encounter this cross entropy
jungle.
You'll see 10 functions that are
called cross-entropy samples, like
sparse cross entropy with logits,
cross entropy with Softmax,
I don't know,
applied twice...
The reason is: Because each of
these operations on its own is
numerically unstable and people
wanted to provide you with a
solution that is numerically
stable,
they just took literally every
single combination, gave it a
name, and each of these
combinations is implemented in a
way that is numerically stable, and
all you need to do is to have
this lookup table, which
combination you want to use and
pick the right name.
Right,
but underneath they're always
just composing cross entropy
with either sigmoid or Softmax
or something like this.
And it's exactly this problem
that they are avoiding.
If you want to do things by hand,
feel free, but don't be surprised
if even on MNIST from time
to time you'll see an infinity
in your loss. It's just the beauty
of finite-precision arithmetics
in the continuous land. So,
let's go back to our example,
right.
It was this small puzzle piece.
Now we can explicitly label each
of the nodes. So we have our
x's, they go through dot product
with weights, biases are being added,
then there is ReLU.
We do it quite a few times.
At some point- at this point, we
have probability estimates and
this is the output of our model,
even though our loss is computed
later. This is also one of the
things I was mentioning before,
right, that
the output or the special nodes
don't have to be at the end.
They might be branching from the
middle. We can replace this with
theta -
sorry -
and just slicing and apply our
gradient descent. And, maybe to the
surprise of some of you,
this is literally how training
of most of the deep neural nets
look like in supervised way.
Reinforcement learning:
slightly different stuff. But
it's this underlying principle
that allows you to work with any
kind of neural network.
It doesn't have to be this
linear structure.
All of these funky things that I
was trying to portray 10
slides ago relies on exactly the
same principle, and you use
exactly the same rules.
You just keep composing around
the same algorithm, and you get
an optimisation method that is
going to converge to some local
minimum - not necessarily a perfect
model, but it's going gonna learn
something. So, there are a few
things that we omitted from this
that are still interesting
pieces of the puzzle.
One such thing is taking a
maximum.
So imagine that one of your
nodes wants to take a maximum.
You have a competition between
your inputs, and only the maximal
one is going to be selected.
Then the backwards pass of this
operation is nothing but gating
again.
There's going to be passing
through the gradients if and only
if this specific dimension was
the maximal one and just zeroing
out everything else. You can see
that this will not learn how to
select things, but at least it
will tell the maximal thing how
to adjust under the conditions
that it got selected.
Right, so there is this notion of
node small things that don't
necessarily guarantee
convergence in a mathematical
sense, but they are commonly used,
and you'll see in, say, Sander's
talk on convolutional neural networks
that they are part of the max
pooling layers.
You can have funky things like
conditional execution, like five
different computations and then
one hot layer that tells you
which of these branches to
select. And if it was one
hot and coded, then selection can
be viewed as just pointwise
multiplication, by one you
multiply, and then the backwards
pass is just going to be, again,
gated in the same way.
But, if it was not the one hot
and coded but rather output
of the Softmax, of
something parameterised, then
looking at the backwards pass
with respect to the gating
allows you to literally learn
the conditioning.
You can learn which branch of
execution to go through as long as
you smoothly mix between them
using Softmax and as a high-
level principle or high-level
idea behind modern attention
models that essentially do this.
And to give you some trivial
examples of other loss.
So, you had cross entropy but, of
course, many problems in real
life are not classification.
If it's a regression, so your
outputs are just real numbers,
then L2, quadratic loss, or one
of at least 10 other names for
this quantity, which is just a
squared norm of a difference
between target and your
prediction, can also be seen as a
computational graph. And the
backwards pass is again nothing
but the difference between what
you predicted and what you
wanted.
And there's this nice duality, as
you can see from the backwards
perspective, that looks exactly
the same as in the case of the cross
entropy with the Softmax, which
also provides you with some
intuitions into how these things
are correlated. So, let's quickly
go for some practical issues,
given that we know kind of what
we're working with. And the first
one is the well-known problem of
overfitting and
regularisation.
So, from Statistical Learning Theory
so we are still, again, going back
to, say, Vapnik and way before him
We know that in this situation
where we just have some training
set, which is a finite set of
data that we are building our
model on top of, minimising error
on it, which we are going to call
training error or training risk, is not
necessarily what we care about.
What we care about is how our
model is going to behave in the
wild, right? What's going to happen
if I take a test sample that
kind of looks the same but is a
different dog than the one that
I saw in training. This is what
we're going to call test risk,
test error. And it's a provable
statement, that there is this
relation between complexity of
your model and the behaviour
between these two errors. As your
model gets more and more complex,
and by complex I mean more
capable of representing more and
more crazy functions or being
able to just store and store
more information, then your
training error has to go down,
not in terms of any learning
method but just in terms of the
existence of parameters that
realise it.
Think the universal approximation
theorem, it says literally
that. But at the same time, as
things get more complex and
bigger, your test risk
initially go down because you
are just getting better at
representing the underlying
structure, but eventually the
worst-case scenario is actually
going to go up because you might
as well represent things in a
very bad way by, for example,
enumerating all the training
examples and outputting exactly
what's expected from you.
Zero training error, horrible
represents, horrible
generalisation power. And this
sort of curve you'll see in
pretty much any machine learning
book until 2016-ish, when
people started discovering
something new that we'll go
through in a second, but even if
you just look at this example,
you'll notice that there is some
reason to keep things simple. And
so people developed many
regularisation techniques, such
as LP regularisation, where
you attach one of these extra
losses directly to weights that
we've talked about before, which
is just LP norm, like L2,
quadratic norm, or L1 or
something like this, to each of
your weights so that your
weights are small, and you can
prove again. I guarantee that, if the
weights are small, the function
cannot be too complex.
So, you are restricting yourself
to the left-hand side of this graph.
You can do drop out or some
neurons are randomly deactivated.
Again, much harder to represent
complex things.
You can add noise to your data.
You can stop early or you can
use various notions of
normalisation that will be
talked through in the next
lecture, but that's all in this
worst-case scenario.
What people recently discovered
or recently started working on
is how this relates to our deep neural
networks that don't have
hundreds of parameters.
They have billions of parameters,
and yet somehow they don't
really overfit as easily as you
would expect.
So, the new version of the picture
will emerge that's currently
referred to as double descent
where you have this phase change
that, yes, things get worse as you
get a more and more complex model,
but eventually you hit this
magical boundary of over-
parameterisation or you have so many
parameters that, even though in
theory you could do things in a
very nasty way, like by
enumerating samples, because of
the learning methods that we are
using,
you never will.
You start to behave kind of like
a Gaussian process, and as you
keep increasing number of
parameters, you actually end up
with a simpler solution that's being
found first, rather than the more
complex ones.
And so the curve descends again.
And it has been proven by Belkin
et al. under some constraints
and shown in simple examples.
Then it was also reinforced with
a cool work from -
sorry - Preetum from
OpenAI where they showed
that this holds for deep big
models that we care about. So one
could ask, well, does it mean we
don't care about
regularisation anymore,
we just make models bigger? And
the answer is, well, not exactly.
It's both true that, as you
increase the model, that you can
see on the x-axis that your loss,
test loss, after rapidly
increasing keeps decreasing over
time, but adding regularisation
can just keep the whole curve
lower.
So here, as you go through curves
from top to bottom, it's just more
and more regularisation being
added.
So, what it means, how it relates
to this theory of complexity,
what it mostly means is
model complexity is way more than
the number of parameters. And
this is a local minimum,
like the research local minimum
people were in for quite a while,
where they thought, well, your neural
network is huge,
truly, it's not going to
generalise well because Vapnik-
-Chervonenkis (VC) bounds are
infinite, you're doomed.
And it seems not to be the case.
The complexity of the model
strongly relies on the way we
train and, as a result, you are
still kind of
in this regime where things
can get worse and you do need to
regularise, but adding more
parameters is also a way to get
better results.
Slightly counterintuitive and
only applies if you keep using
gradient descent, not some nasty
way.
Okay.
So just a few things.
There's a lot of stuff that can
go wrong when you train a
neural net, and it can be a hard,
harsh experience initially.
So, first of all, if you haven't
tried,
don't get discouraged.
Initially nothing works, and it's
something we all went through,
and there is nothing to solve it
apart from practice. Just playing
with this will eventually get
you there.
There's a brilliant blog post
from Andrew Karpathy
that I'm referring to here and,
also, a few points that I like to
keep in mind
each time I train neural networks.
First of all, that initialisation
really matters.
All this theory that was built on
the practical results.
If you initialise your network
badly, it won't learn and you can
prove it won't work,
won't learn well.
What you should start with
always is to try to overfit
with some -
if you're introducing a new
model especially - you need to try
to overfit on some small data
sample.
If you can't overfit, almost
surely you made a mistake.
Unless for some reason your
model doesn't work for small
sample sizes, then obviously just
ignore what I just said.
You should always monitor
training loss.
I know, sounds obvious, but quite
a few people just assume that
loss will go down because
gradient descent guarantees it.
Without monitoring it
you will never know if you are
in a right spot, especially
given that many of our models
are non-differentiable and, as such,
the loss doesn't have to go
down.
So if it's not going down, you
might want to reconsider using
these non-differentiable units.
More importantly, something that
people apparently stopped doing
in deep learning on a daily
basis is monitoring the norms of
your weights. Norms going to
infinity is something to be
worried about.
And if it's not making your job
crash right now, eventually it
will,
once you leave it running for a
few days. And then you'll regret
that you are not monitoring it
earlier.
Another thing is adding shape
asserts. All the modern learning,
deep learning,
libraries are great and have
brilliant features, one of which
is automatic broadcasting.
You take a column vector. You take a
row vector.
You add them. You get the matrix.
Very useful,
unless this is not what you wanted to
do.
You just wanted to add two
vectors, and you ended up with a
matrix.
If the next operation is taking
a maximum or taking the average
you won't notice, right? Afterwards,
there is just a scalar.
Everything looks fine, but your
learning will be really crazy,
and you can try to find a linear
regression
and just, by mistake, transpose
targets, and you will see how
badly linear regression can
behave by just one line error
that throws no exceptions, and
your loss will go down.
It just won't be the model that
you're expecting.
The only way that I know about
to resolve this is to add shape
asserts everywhere.
Each time you add an operation, you
just write down an assert, like
literally low-level engineering
thing, to make sure that the
shape is exactly what you
expect,
otherwise you might run into
issues. A thing that we
mentioned before: Use Adam as
your starting point.
Just because 3e minus 5 is
the magical learning rate, it
works in 99 percent of deep
learning models for unknown
reasons to everyone.
Finally, it's very tempting to
change five things at a time
because you feel like you have
so many good ideas and, don't get
me wrong,
you probably do.
But if you change all of them at
once, you will regret it
afterwards when you struggle
with debugging and/or credit
assignment of what actually
improved in your model, and your
reviewers won't be happy either
when your ablation just keeps
five steps. So, given the few last
minutes before the questions, I
wanted to spend, say, three-ish
minutes on the bonus thing on
multiplicative interactions.
So, trying to convince you
through this lecture that neural
networks are really powerful, and
I hope I succeeded.
They are really powerful, but I want to
ask this maybe funny question:
What is one thing that these
multilayer networks where we
just have a linear, then an
activation function, say,
similar to ReLU, stacked
on top of each other
definitely cannot do? Well,
there may be answers, right.
They can't do a lot of stuff,
but one trivial thing they
can't do is they can't multiply.
There's just no way for them to
multiply two numbers given as
inputs.
Again, you might be slightly
confused.
We just talked about the universal
approximation theorem.
But what I'm referring to is
representing multiplication.
They can approximate
multiplication to any precision,
but they can never actually
represent the function that
multiplies.
So, no matter how big your
dataset is going to be, no matter
how deep your network is going to
be,
and
you train it to multiply two
numbers,
I can always find two new
numbers that you are going to
miserably fail at. And, by miserably, I
mean get arbitrarily big error.
Maybe my numbers are going to be
huge.
Doesn't matter.
There is something special about
multiplication that I would like
to see in neural nets.
What's special about them?
For example, conditional
execution relies on multiplying
something between zero and one and
something else.
Many things in real life can be
represented as multiplication,
for example, computing distance
between two points relies on
being able to compute a dot
product plus norms and things
like this.
So, it's quite useful to have
this sort of operation, yet
stacking even infinitely many,
yes, infinitely many, layers will
not help you. And one way to
resolve it is to have a unit
that just implements
multiplicative interactions.
One way to formalise it is as
follows: You have a tensor w, you
take your inputs through this,
you can see this as a Mahalanobis
dot product if you went
through this part of the algebra,
then you have the
matrix projections of the remaining
things and just add a bias.
So if you just look at the
approximation things:
If you were to say, compute the
dot product and you do it with a
normal neural net with linears and
ReLUs, then you will have an
exponentially many parameters
needed to approximate this to,
like, zero point one error
I believe I used here, with
respect to the dimensionality of
the input.
There was a very steep
exponential growth, just
approximate, and there was still
going to be this problem that
you don't generalise, but even
approximation requires huge
amounts of parameters, while
using model like this explicitly
has a linear growth and has a
guarantee, right?
Once you hit the dot product,
which can be represented exactly
with this module, you generalise
everywhere. And there's a nice
work from Siddhant et al. at this
year's
ICML if you want to dig
deeper. But I want to just stress:
There's a qualitative difference
between approximation and
representation and, in some sense,
sends you home
with this take-home message,
which is: If you want to do
research in this sort of
fundamental building blocks of
neural networks,
please try not to focus on
improving things, like marginally
improving things, that neural
networks already do very well.
If we already have this piece of
the puzzle, polishing it,
I mean, it is an improvement, but
it's really not
what's cool about this field of
study, and this is not where the
biggest gains both for you
scientifically as well as for
the community lies.
What's the biggest gain is
identifying what your networks
cannot do or cannot guarantee.
Think about:
Maybe you might want a module
that's guaranteed to be convex
or quasi-convex or some other
funky mathematical property that
you are personally interested in
and propose a model that does
that.
I guarantee you that that will be a
much better experience for you
and much better result for all
of us.
And with that I'm going to
finish.
So, thank you.
