The first time I heard about neural networks
I was 11 or something.
I saw an article on a popular tech magazine
that said scanners now use neural networks
to recognize characters.
Naively, I thought they utilized actual biological
neurons.
When I told my mom about it she said if they
use anything biological then you need to feed
it.
Does the scanner consume sugar or sth?
I mean, she was right.
But we don't feed our scanners sugar, do we?
Many years later I figured out that what they
used was nothing but a mathematical model.
So what's so neural about them?
These models are called neural networks because
they are loosely inspired by biological neural
networks.
Artificial neural networks consist of artificial
neurons, each one of which resembles a biological
neuron in the sense that it receives signals
from other neurons, accumulates the signal
where each input has a different weight, and
fires if the signal is strong enough.
Although artificial neural networks draw some
inspiration from biological models, modern
computer science research in neural networks
focuses more on building useful models rather
than understanding the brain and modeling
it accurately.
Understanding how the brain works is a very
interesting field of research too, but we
will not focus on that in this series.
To model a neural network, let's start with
a single neuron.
An artificial neuron takes the inputs x0 through
xn, multiplies them with weights w0 through
wn, and sums the products to produce the output
y.
We can express this operation as a simple
matrix multiplication.
Assume that the weights are stored in a row
matrix w transpose and the inputs are stored
in a column matrix x.
The output of the neuron is simply the multiplication
of these two.
Linear algebra recap: to multiply these matrices
we multiply x0 with w0, x1 with w1, and x2
with w2 and we sum these products.
So why do we do this?
What we are trying to accomplish here is to
approximate a function.
A linear function in this example.
Given a set of (x, y) pairs, our goal is to
find such weights w0, w1, and w3 that fits
the data we have the best.
For example, we can represent the function
y=2x using a single neuron with a single input
with linear activation, meaning that the output
is merely a product of the input and the weight.
However, the functions that we want to approximate
might not be as simple as y=2x.
We can learn more complex functions by using
a network of these neurons, where the outputs
of a set of neurons are fed into another set
of neurons as inputs.
Let's take a look at a common type of neural
network: a multi-layer perceptron, which consists
of layers of neurons.
These type of neural networks are called feedforward
networks because the data flows in one direction
from the input layer to the outputs.
The first layer is the input layer, where
each neuron is connected to an input variable.
The last layer is the output layer, which
has as many neurons as the output variables.
For example, if this is a regression problem
where we try to predict the current value
of a car, x0, x1, and x2 might be the year,
milage, and the price of the car when it was
new, where y0 is the predicted current value.
Or if this is a classification problem where
we want to classify indoor and outdoor pictures,
x0 through xn can be the pixel values, and
y0 and y1 can be the indoor and outdoor neurons.
The layers we have between the inputs and
output layers are called hidden layers.
These layers learn to produce outputs that
are useful for the next layers.
They are called hidden layers because we don't
explicitly specify what happens at these layers.
The learning algorithm decides how to use
these layers to approximate a function.
Each hidden layer tries to make the input
more useful for the next layer.
The number of these layers gives the depth
of the model --that's where the term deep
learning comes from-- whereas the number of
neurons per layer gives the width of the model.
Both increasing the width and depth increases
model complexity, which allows for learning
more complex patterns.
Or does it?
Each one of these neurons is basically taking
a weighted sum of their inputs.
Isn't a weighted sum of linear functions also
a linear function?
Let's simplify this network by taking a slice
from this network and see what happens.
The input gets multiplied by w0, w1, w2, and
w3, and we get our output y0.
We can re-write this equation as y = x0 wc,
where wc is the multiplication of all these
weights.
Then we could represent this function using
a single neuron.
Basically, if we don't introduce some sort
of nonlinearities between these neurons, the
entire network collapses into a single linear
function.
That's why we use non-linear activation functions
at the outputs of neurons, meaning that we
pass the output of a neuron through a non-linear
function before we feed it to the next one.
Doing so introduces non-linearities in our
network.
This non-linear function can be the sigmoid
function, which squashes its input into a
range between 0 and 1.
The sigmoid function is usually not ideal
for deep models but we'll come back to that
later.
Let's rewrite the function that our model
represents, now with the non-linearities in
between neurons...
As you can see, it no longer reduces to a
single layer model.
This enables our model to represent non-linear
functions.
Let's run a simulation on TensorFlow playground
to observe the impact of the non-linear activations.
First, let's try to classify linearly separable
data without using nonlinearities.
The model learns a decision boundary without
any trouble.
How about classifying these data points that
lie on a swiss-roll shaped manifold.
Seems like the decision boundary is still
linear despite having several hidden layers.
Now let's try again using a non-linear activation
function this time.
It's learning non-linear decision boundaries
now, but what do we actually mean by learning?
Let's go back to the previous example.
The weights, w0, w1, and w2, are the trainable
parameters.
These parameters are learned from training
data.
The values of these parameters are what we
keep to deploy our model.
Let's talk about how we train a model to learn
these weights.
Training a neural network is essentially an
optimization problem, where our goal is to
minimize a loss function.
The loss function tells our model how well
it's doing on training data so that the weights
can be updated towards decreasing the loss
and increasing the future performance as a
consequence.
What we mean by training is simply finding
the weights that minimize our loss function.
For example, if our goal is to predict a continuous-valued
variable we can use the mean squared error
or the mean absolute error as our loss function.
The mean squared error is simply the mean
of the squared differences between the predicted
and actual values of output variables.
And the mean absolute error is the mean of
absolute differences between these actual
and predicted values.
The loss function can be anything as long
as it's differentiable, and we'll see why
soon.
We'll go back to loss functions later.
Let's have a very simple example to understand
how we train neural networks first.
Let's say we have these data points and we
want to learn a function that generates similar
data points.
Let's use a single neuron with a single weight
and use mean squared error as our loss function.
For simplicity let's omit the bias term and
the non-linear activation function.
First step: we initialize our weights to small
random values.
In this example, we have a single weight w,
which is "randomly" initialized to .5.
Then we evaluate the output given a data sample.
The data is usually shuffled before training,
so let's pick x = 2 as our first training
sample.
Plugging in x, we get y = 1, which is not
so close to the actual value of y, which was
4.
So, how do we fix this?
How do we tell the model to update the weight
towards the right direction?
We picked the mean squared error as our loss
function.
Since we are evaluating the samples one by
one, it's simply the squared difference between
the actual and predicted values of y.
We take the derivative of the error with respect
to the weight.
Then, we use the derivative to define an update
rule,
which tells us how to change the weight to
make the predictions better.
Here, alpha is the learning rate, which specifies
the magnitude of the update at every iteration.
It's a common practice to decay the learning
rate gradually.
Which is actually analogous to how humans
learn.
Kids learn faster but adults are less gullible
since they are exposed to more training data.
For simplicity, let's fix the learning rate
at 0.1 in this example.
Now that we have our update rule let's iterate
over the data.
We get a data sample and we update the weight
by evaluating the update rule.
We do this until the loss converges to an
acceptable point, which is the global minimum
0 in this example.
This optimization algorithm is called Stochastic
Gradient Descent, there are some tricks to
improve this optimization process but this
is how it works in its plain form.
In this example, we iterated over the samples
one-by-one.
This is called online learning.
Training a model this way can sometimes lead
to noisy weight updates and slow down the
convergence.
Alternatively, we could use all data points
at once and average the loss over all data
points at every iteration.
That's called batch learning and the iterative
optimization algorithm that we used earlier
is called gradient descent when we use the
entire dataset for each update.
However, in many modern applications, this
is not a feasible approach since the dataset
is usually too big to fit into memory.
Even if the entire dataset fits the memory,
it might still be preferable not to use the
entire data at every step.
In the previous example, our loss function
was a nice and smooth convex function.
Here's how the overall loss roughly looks
like when plotted as a function of the weight.
This would be an ideal case for the full-batch
gradient descent.
However, this is hardly the case for real-life
applications.
In practice, the w can be much higher dimensional,
and the loss manifold is unlikely to be perfectly
convex.
Many applications adopt an approach between
these two: pick a mini-batch that consists
of a number of samples and average the loss
over the samples in the mini-batch at every
iteration.
The number of samples in a mini-batch is called
the batch size.
Now we know what training a neural network
means and how to train a single neuron with
a single trainable parameter.
But how do we train networks with many more
layers and many more trainable parameters?
It's not as complicated as one might think
it is.
In the next video, we will talk about how
is training deep neural networks different
from training shallow ones and intuitively
explain how we train these models.
That's all for today.
Thanks for watching, stay tuned and see you
next time.
