Welcome back!
In the previous video, we talked about what
artificial neural networks are and how to
train a single neuron.
If you haven't watched the previous video
yet, find the link in the description below.
In this video, we will pick up where we left
off and talk about how to train deeper and
more complex networks.
Quick recap: how do we train a model that
has a single weight to learn?
We define a loss function which tells us how
well the model is doing
take its derivative with respect to the weight
then incrementally
update the weight towards the opposite sign
of the derivative.
The derivative simply tells us the slope of
the loss function at a given point.
For example, if the weight is at this point
where the slope is positive, then we need
to decrement the weight.
If the weight is here on the opposite side,
then the slope becomes negative, so we need
to move right to move towards the value that
minimizes the loss.
It's pretty straightforward to learn a single
weight, but what do we do when we have many
parameters to learn?
We do the same thing: compute the derivatives
and update the weights.
Take this simplified multi-layer network for
example.
We can rewrite the derivative of the error
with respect to the weight as the derivative
of the error with respect to y, times y with
respect to the hidden node z, times z with
respect to w.
This is called the chain rule of calculus
and computing the derivatives this way is
called backpropagation, or backpropagating
the error since the error is calculated at
the output layer and propagated back through
the earlier hidden layers.
We do this for all the weights in the network.
We first randomly initialize them,
then compute the outputs given the data samples,
which is called forward propagation.
Then compute the error and the partial derivatives
with respect to all these weights.
The partial derivatives are collectively called
the gradient.
Each one of these partial derivatives measures
how the loss function would change if we were
to change a single variable.
Once we have the derivatives, we update the
weights just like we did in the previous video.
There are several tricks to improve both stability
and efficiency of this process.
For example, the derivatives can be computed
more efficiently by storing previously computed
derivatives in a table and reusing them to
compute others.
In dynamic programming this technique is called
memoization, not to be confused with memorization
which usually refers to overfitting that we
will cover later.
Good news!
Unless you want to develop a deep learning
framework yourself, you don't have to worry
too much about how to compute these derivatives
because many modern deep learning frameworks
take care of it for us.
In the previous video, I briefly talked about
activation functions and mentioned that sigmoid
activation is not ideal for deep learning.
Now, let's see why.
First, let's take a look at how the sigmoid
function looks like.
It looks like an S curve that saturates at
large values of its input.
Now let's see what happens to the gradient
when we use sigmoid activations.
This is how we computed the derivative of
the error with respect to the weight using
the chain rule.
Since these hidden units use sigmoid activations
their derivatives involve the derivative of
the sigmoid function, which looks like this.
As you can see it quickly saturates to very
small numbers when their input moves away
from zero.
As we backpropagate the error by chaining
these expressions the magnitude of the gradient
rapidly diminishes.
The more sigmoids we encounter along the backpropagation
path the smaller the gradient gets.
In deeper models, the derivatives get so small
towards the earlier layers that it becomes
virtually impossible to update the early layer
parameters.
Moreover, since the latter layers depend on
the earlier layers they don't learn anything
useful either although they get larger updates.
This problem is called the 'vanishing gradient
problem'.
Other saturating activation functions such
as the hyperbolic tangent function, which
is simply a scaled and shifted version of
the sigmoid function, also suffers from this
problem.
But it usually performs better since its output
is zero-centered.
So how do we fix this vanishing gradient problem,
or at least alleviate it?
A simple solution is to use a rectified linear
unit activation.
A rectified linear unit (or ReLU for short)
is a clipped version of the linear activation,
meaning that the output of the function is
the same as input if the input is positive
and zero otherwise.
How does the derivative of this function look
like?
It's a simple step function, which doesn't
vanish unlike saturating activation functions.
ReLUs are easy to optimize but they have a
problem too: they sometimes die.
For example, if a neuron learns a large negative
bias, the input of the ReLU function might
always be negative.
Therefore neuron might never fire again.
Since the gradient flowing through the neuron
will always be zero its parameters will not
get any updates either.
There are some variants of the rectified linear
function that aim to solve this problem.
For example, Leaky ReLUs use small negative
slope instead of zero on the left-hand side
so that a neuron always gets an update and
can eventually recover from death.
Parametric ReLUs take this idea one step further
and learn the value of this slope during training.
Another alternative is an exponential linear
unit, which is linear on the right-hand side
and exponential on the left-hand side.
Dying neurons is usually not a huge problem
and many modern neural network architectures
still use the plain ReLU function.
So, if you are not sure what activation function
to choose I would say go for ReLU first.
Last week we talked a little bit about loss
functions and used the mean squared error
as our loss function in the examples.
Clearly, the mean squared error is not our
only option when it comes to loss functions,
perhaps not even the most popular one.
We can design our custom loss function depending
on what we really want to minimize.
So we need to express our goal in terms of
a function that we can minimize through tuning
the model parameters.
Let's take a look at what type of loss function
we generally use for classification problems.
Let's say we want to recognize handwritten
digits and our input is an 8x8 binary matrix.
We feed the values in this matrix to a neural
network that consists of several layers of
neurons.
What do you think the output of this neural
network should be?
Maybe a single neuron that gives us the value
of the predicted digit?
We could train this network using the mean
squared error between the predicted and the
actual values of the digits, right?
Well, not quite.
I mean we could but that wouldn't be ideal.
Although the labels are digits, they aren't
really numerical variables.
Take this handwritten digit for example.
It looks like a six but could be a zero as
well.
If we are not sure if it's zero or six, should
we call it a three?
It's certainly not three.
The better way to do this is to have a separate
output neuron for each class and train the
model using a cross-entropy function.
For simplicity, we can think of the cross-entropy
as a measure of similarity between the actual
and the predicted probability distributions
of the classes.
It's not truly a similarity metric though
because it's not symmetric.
The cross entropy between p and q is not the
same as the cross-entropy between q and p.
So how do we get these probability distributions?
For the ground truth labels, it's quite simple.
We assign probability one to the true class
label and zero to all others.
This representation is also called one hot
encoding.
These labels don't have to be binary, in some
cases we might have a softer probability distribution
for each sample too.
To get the predicted class probabilities,
we pass the output variables (which are called
logits) through a softmax function which squashes
the outputs to the range [0, 1] that sum to
1.
Once we have the actual and the predicted
probability distributions, we can easily compute
the cross-entropy function and train the model
to minimize it.
Since softmax outputs always sum up to 1,
it creates a competition between its inputs.
When a neuron for one class gets a large value
it pushes all other class probabilities down.
This is a useful property when our class labels
are mutually exclusive.
But in some cases, each sample can have more
than one label.
For example, if we are detecting objects in
pictures, one picture can contain both a laptop
and a smartphone.
In such cases, we can pass the logits through
separate sigmoid functions instead of a single
softmax so that the probabilities do not necessarily
sum up to 1.
Then given the actual and predicted probability
distributions we can compute the cross-entropy
just like we did before.
Earlier I mentioned that cross entropy is
not symmetric.
Then you might wonder how we decide whether
p or q should be the predicted probabilities
and the other one is the actual probabilities.
Think of it this way.
From a theoretical perspective, we can think
of taking the logarithm of the predicted probabilities
as reverting the exponential introduced by
the sigmoid or the softmax functions.
From a practical perspective, since the target
variables are usually binary, taking the logarithm
of 0 and 1 doesn't really help.
One last thing.
Theoretically, a feedforward network with
a single hidden layer with a nonlinear activation
function can approximate any function with
an arbitrary amount of desired error.
This is called the universal approximation
theorem.
Then why do we even need deeper models?
First, the number of hidden units we need
to represent a complex function might be infeasibly
large.
More importantly being able to represent a
function with a model doesn't mean that our
model can easily learn to represent that function.
Deep architectures allow for learning a hierarchy
of features resulting in a fewer number of
units and better generalization in many cases.
So, what do we mean by generalization?
That's what we are going to talk about next.
In the next video, we will cover some of the
key concepts in machine learning such as the
model capacity, overfitting, and underfitting.
We will also go through a basic recipe for
machine learning that can be applied to many
kinds of AI problems.
That's all for today.
As always, thanks for watching, stay tuned,
and see you next time.
