Artificial Neural Networks have been inspired
from biological neural networks.
They are one of the most popular class of
machine learning methods and are an important
component of the state of the art artificial
intelligence applications such as voice recognition,
weather forecasting, and self-driving cars.
---
The building block of a biological nervous
system is a neuron, and likewise, the building
block of an artificial neural network is an
artificial neuron.
An artificial neuron takes a number of inputs
from its dendrites, integrates these inputs,
and produces a single output.
Each input x has a connection strength or
weight associated with it.
The contribution of each input x is obtained
by multiplying it with its weight.
I’m going to use the letter z for the summation
of the inputs multiplied by their weights.
There is a special type of input, which always
has a value of exactly 1.
If all the other inputs are zero, this special
input provides a baseline input value.
This input is sometimes also called the bias
term.
When a neuron is drawn, the bias term is usually
not shown.
In some of the slides I may not show the bias
term, but you should know that it is still
there.
---
We can rewrite the summation as a series.
---
One way to handle the bias term is to just
pretend it is one of the inputs.
Let’s also it x0, so we can simplify our
formula for the total input signal.
---
If we consider the set of inputs and the set
of weights as vectors, we can write the summation
as a dot product between X and W vector.
---
The total input signal in a neuron is processed
by an activation function to produce the neuron’s
output signal.
The activation function is a function of the
total input z, but you can also write it as
a function of inputs X and weights W.
---
The activation function we use in a neuron
depends on the application.
The two most commonly used activation function
are the linear and sigmoid functions.
The linear function simply outputs the total
input signal.
The sigmoid function produces an output that
is between 0 and 1.
The sigmoid function is also known as logistic
function.
When I draw a neuron, I will show the shape
of the activation function, because that is
an important property of a neuron.
---
The best part about a neuron is that we can
train it on a training dataset that contains
a set of samples and their target values and
have the neuron learn to produce desired outputs
for each sample.
I will show the dataset as a matrix, where
each row represents a sample, and each column
represents a feature or attribute.
This dataset contains N samples and D features.
We feed the samples into the neuron one at
a time.
Each feature of a sample, becomes one of the
input signals into the neuron.
The process of learning involves adjusting
the weights of the input signals so that neuron’s
output matches the desired target values.
---
I’ll demonstrate the learning process with
a simple example.
I will train a neuron that has a linear activation
function.
I will work with a training dataset that contains
only a single sample.
I will simplify the problem further and assume
we have a single input attribute in the dataset.
Including the bias term, that makes two inputs.
This neuron will have two weights that we
can adjust.
---
Let’s see what the neuron outputs for this
sample.
I will feed in the input sample into the neuron,
calculate the total input signal and the activation
function.
We’ll have to start with some random values
for the weights.
Let’s use w0=5 and w1=7.
Multiplying each input with its weight and
adding them up gives us a total of 21.1.
The linear activation function produces 21.1
as output.
The desired target value we wanted the neuron
to output is 2.7.
We have to make changes to the weights so
they produce a value closer to 2.7.
But how do we do that?
---
We need to define an error measure that captures
how far off we are from the desired target
value.
You might be tempted to just subtract the
neurons output and the desired target and
use that as your error value.
But then you would have to take the absolute
value to account for being above or below
the target.
Using the absolute error does not give us
a mathematically clean measure.
As you will see, being able to take the derivative
of the error measure is important, but taking
the derivative of the absolute value is mathematically
a little messy.
The most common error metric is the difference
between output and target values, squared.
For convenience, we’ll multiply that by
1half, so that when we take the derivative,
we don’t have to keep writing the number
2.
The error of our neuron is currently 169.
Changing the weights will change the output
Y, which will then change the error.
There are many optimization strategies that
can be used to adjust the weights to minimize
the error.
We’ll use our favorite method, Gradient
Descent.
We have two weights, we apply gradient descent
to both of them.
We change each weight, such that we move in
the opposite direction of the gradient of
the error with respect to that weight.
The gradient is multiplied by a learning rate
eta, to control the magnitude of the adjustment
we make.
---
The output Y of the linear neuron is equal
to w0timesx0 plus w1timesx1.
Let’s plug that into the error formula.
We can then obtain the derivative of the error
with respect to w0 and with respect to w1
and plug these in the gradient descent, to
calculate the change we need to make to each
weight.
---
We have identified everything we need to update
the weights.
Let’s use a learning rate of 0.1.
w0 was 5, changing it by this much, gives
us the adjusted w0 value of 3.16
w1 was 7.
After the weight update, it becomes 2.77.
We update the neuron with these new weights.
We are ready to see how these new weights
will perform in terms of the error of the
output they would generate.
---
When we feed the sample data into the neuron
again, we get 9.53 as the output, which is
closer to the target value of 2.7 than before.
The Error has gone down to 23.
But we are still not satisfied with the amount
of error.
We repeat the weight update again, change
the weights, feed the sample again, and get
the new output.
The new error is now 3.2, which again is better
than before.
We repeat the weight update one more time
and achieve an error of 0.44.
And when we repeat for 10 iterations, we get
an output that is very close to the target
value 2.7.
We stop the learning process because we like
the results.
---
We have formulated a learning process by which
we can adjust the weights so the network produces
an output that is close to the desired target.
What if we had more attributes in the data?
Let’s go back to our formulation and come
up with a more general weight update rule
for each weight.
The gradient descent for each of the ith weights
is calculated from the partial derivative
of the error with respect to the ith weight.
To calculate the partial derivative, we apply
the chain rule and get this.
Because we are using a linear activation function,
the output Y is equal to the total input signal
z, the derivative of Y with respect to z is
just 1.
And remember that the total input z is just
a summation of the weights multiplied by inputs.
The partial derivative of the total signal
with respect to any of the ith weights is
just xi.
For example, the partial derivative of z with
respect to w2 is just x2.
When we plug this back into the derivative
of the error, we get this.
And plugging that back into the weight update
rule gives us this.
So if we have any number of inputs in the
neuron, we just apply this weight update formula
for each of the weights.
---
What if the neuron had a different activation
function?
When we had a linear activation, we were able
to drop dY/dz because Y was just equal to
z, but now we’ll have to keep that.
---
For an arbitrary activation function, the
partial derivative of the error will now include
the derivative of the activation function.
---
Let’s take a look at the case where the
sigmoid function is used for activation.
The derivative of the sigmoid function is
shown here.
---
With a little bit of magic, we can rewrite
the derivative as the output Y times (1-Y).
This means after we feed a sample into the
neuron and find the value of Y, the derivative
of the activation function can simply be written
as Y times (1-Y), which is more convenient
than having to type or calculate this formula.
The partial derivative of the error for a
sigmoid activation function becomes this formula.
If you had a different activation function,
you would have to stick with this general
formula, but for sigmoid function, we can
use this.
---
Let’s go back to our training example and
use a sigmoid neuron instead of a linear neuron
and see how the weights are updated.
---
When we feed the sample into the neuron, calculate
the total input signal, apply the sigmoid
function, we get an output value that is slightly
under 1.
The error in our output is around 1.4.
---
The derivative of the sigmoid for that total
input signal, is in the order of 10^-10, which
is a very small number.
As a result, the weight update rule barely
makes any change in the weights.
When we use these slightly modified weights,
and feed the sample into the neuron again,
the new output is only slightly different
and the error is reduced only slightly.
The change is so small that the number of
digits I am showing here does not show any
change.
---
It turns out that no matter how many times
we apply gradient descent, we cannot get an
error less than 1.4.
Here you see the result after 1000 iterations,
where we were able make a tiny difference
in the error.
There are two problems here.
One problem is that the output of sigmoid
will never go above 1.
No matter how we change the weights, we cannot
make the output Y greater than 1.
The other problem is that the derivative of
the sigmoid is too small and the weights don’t
change much.
To understand this better, let’s examine
the properties of the sigmoid function.
---
The total input signal can be any number between
negative infinity to positive infinity.
Sigmoid function is a saturating function
which outputs a value that is always between
0 and 1.
Sigmoid function is incapable of producing
an output outside the 0 to 1 range.
Because of this saturating behavior, the sigmoid
function is not appropriate for learning arbitrary
target values.
It is only good for learning target values
that are between 0 or 1.
---
Another issue with sigmoid function is that
its derivative is too small when the magnitude
of the total input signal is too large.
Since the weight update in gradient descent
is proportional to the derivative of the activation
function, this results in tiny weight update
values.
And that makes the weights change too slowly,
requiring many iterations for convergence.
This slow behavior also suggests a guideline
for how we initialize the weights at the beginning
of training.
We have to make sure we those initial random
weights are small, so that the total input
signal remains small, where the gradient is
at a considerable level.
---
The desired features of an activation function
are: being continuous, easily differentiable,
and monotonic.
These properties make gradient descent possible
and allow the training to converge.
---
Here are some of the activation functions
you may encounter.
Hyperbolic tangent function is just a scaled
version of sigmoid function.
You may see step function when you read about
activation functions, but it is never really
used in practice, because it is not differentiable
and its derivative is zero, which means the
weights would never changed if we applied
gradient descent.
Rectified Linear Unit function is zero for
negative values and linear above zero.
ReLu is non-differentiable at zero, but it
has a nice feature that in a network of neurons,
it allows some neurons to be inactive.
There is a smooth version of ReLu function
that is differentiable.
The Gaussian activation function has the property
that it turns on for some samples that are
close to its mean, and stays off for others.
Among these activation functions, Linear and
Sigmoid functions are the most commonly used
ones.
---
Linear activation function is useful for regression
problems where we try to make the neuron learn
unbounded and continuous values.
That being said, it is still possible to use
a linear function for classification, by using
a threshold on its output.
Sigmoid activation function is more appropriate
for classification problems where the target
value is usually binary.
0 would represent one class and 1 would represent
the other.
---
Sigmoid neuron will output a real value between
0 and 1, we interpret that output using a
threshold of 0.5 to determine whether the
neuron thinks the sample is in one class or
the other.
---
We discussed training a neuron for just a
single sample.
How do make the neuron learn from all the
samples?
The overall error on all the input samples
is usually defined as the average of the squared
errors from each sample.
This is just to get a sense of how the neuron
is performing for all the samples.
When we are updating the weights, we would
need to calculate the gradient for each sample
separately.
To update the weights for a set of samples,
the most common approach is batch learning,
where we feed all the patterns to the neuron,
calculate the weight update for each sample,
but we don’t modify the weights until all
the samples are processed.
We then take the average of the weight updates,
and modify the neuron’s weights with these
averages.
A single iteration of batch learning where
we present all the samples is called an epoch.
Batch update has the advantage of minimizing
the overall error across all samples at each
epoch.
Sometimes because of the nature or the size
of the data, it may not be possible to use
batch learning.
For example, we may be dealing with an application
where we constantly get new data samples,
such as streaming data obtained from sensors.
Or the dataset may just be too large to fit
in the computer’s memory, making batch learning
computationally infeasible.
In such cases, we can train the neuron one
sample at a time and update the weights after
we present each sample.
A combination of batch and iterative learning
is also possible, where we can feed a subset
of the samples to the neuron, update its weights,
and then move on to the next subset.
A third option is Stochastic learning, which
is similar to iterative learning, except samples
are now picked randomly.
In iterative learning, we just use the samples
in the order they appear in the dataset.
Iterative and stochastic learning tend to
be less consistent than batch learning.
Because when a neuron is trained on each new
sample, it may forget about the previous samples.
Each new sample kind of pulls the weights
toward itself, and the weights end up moving
all over the place.
To get more consistent learning, we usually
run through the dataset multiple times.
---
Our lecture title is Neural Networks, but
all we have discussed so far is a single neuron.
A single neuron is clean and simple and gradient
descent algorithm does a good job of learning
the weights from the data.
The problem with a single neuron is that it
is not powerful enough to learn complex relationships.
In fact, if we consider classification problems,
a single neuron is capable of learning only
linearly separable datasets.
Two classes of samples are linearly separable
if and only if there exists a line that separates
samples of one class from the samples of the
other class.
A linearly non-separable classification problem
is where it is not possible to separate the
samples of the two classes by a line.
It does not matter if you use a linear activation
function or a sigmoid activation function,
which is non-linear.
The limitation really comes from the way the
total input signal is calculated, as a linear
function of the sample features.
---
A neural network is where we set up a network
of neurons where the output of some neurons
feed in as inputs to other neurons.
Let’s see if a network of linear neurons
is more powerful than a single neuron.
I have two neurons that receive each of the
inputs, I then feed the outputs from these
intermediate neurons into a third neuron.
The output from this last neuron is considered
to be the output of the entire network.
These intermediate neurons are called hidden
neurons.This type of network is called a feed-forward
network because the information flows in one
direction.
I’m using the letters i and j to distinguish
between the hidden neurons and the output
neuron.
I am also using subscript numbers to indicate
the source and target of that connection.
For example, this weight goes from the first
input to the second hidden neuron.
---
If you work out the overall output of the
network, you can rewrite it in this form.
So, the overall output of the network is simply
a linear combination of the inputs.
---
We could have easily created a single neuron
with appropriate weights to result in the
same output.
This was kind of disappointing.
Combining multiple linear neurons does not
really allow any kind of power in representing
more complex formulas.
---
And it doesn’t matter how many neurons you
have or how many layers you use.
We can show that for any number of linear
neurons, you can produce the same output with
a single neuron.
In other words, if you take linear transformations
of input signals from multiple neurons and
then linearly combine them again, the overall
result is still a linear transformation of
the inputs.
Using a network of linear neurons does not
represent any function that a single neuron
cannot.
---
But the same is not true for network of non-linear
neurons.
In fact, a feed-forward network of non-linear
neurons is so powerful that the universal
approximation theorem states that it is capable
of approximating any continuous function.
You of course need to have a sufficient number
of neurons in each layer of the network.
The universal approximation theorem doesn’t
say how we can actually train the network
and set the weights.
It just says that if we can set the correct
weights, the network is capable of approximating
any function.
From a classification point of view, you can
think of each neuron dividing the space, and
the neurons in subsequent layers divide the
space further, forming complex decision boundaries.
We can certainly increase the number of intermediate
layers, and that is something deep learning
networks use.
Using only one hidden layer and one output
layer is one of the most commonly used architectures.
---
A network with only one hidden layer is called
a 3-layer feedforward network.
The term 3-layer is actually a little misleading.
Only the hidden layer and the output layers
have neurons.
The input layer just represents the input
values of a sample.
---
Earlier, we showed how a single neuron can
be trained using gradient descent.
Let’s now see how a feed-forward network
can be trained.
Just like in the single neuron case, we start
off by feeding a new sample into the network
and calculating total input signals for each
of the hidden neuron, and then the activations
from the hidden units and feed those into
the output neuron to get the final output.
We compare that output Y with the desired
target value to calculate the error.
Now we need to come up with an error gradient
for each connection weight in the network,
so that we can update them using gradient
descent.
Training the neuron in the output layer is
easy.
It’s just like training a single neuron.
I have here the derivative of the error for
a single neuron that we derived earlier for
reference.
In the single neuron, we used the input into
that neuron.
Here, the input to the output neuron is Yj,
which is the output that is coming from the
j’th hidden neuron.
I am using Yj and wj to represent any of these
connections.
Their weights will be updated following this
rule.
---
Let’s take a look at the derivative of the
error for the hidden unit weights.
The influence of the hidden weights on the
final output and the error is indirect.
We use a method called error back-propagation
which is just a fancy word for applying the
chain rule in calculus.
To find the derivative of the total signal
zk in the output neuron with respect to wi
of the hidden neuron, we make use of intermediate
variables that link them together.
In this case, Yj and zj provide a connection
from wi to zk, so we’ll rewrite the derivative
using the chain rul, as the derivative zk
with of Yj, multiplied by the derivative of
Yj with zj, and multiplied by the derivative
of zj with wi.
Zk is the total input signal of the output
neuron, which is the the sum of Yj’s multiplied
by wj’s, so the derivative of zk with Yj
is just the weight wj.
The derivative of Yj with zj is just the derivative
of the activation function for the hidden
neuron.
zj is the total activation of the hidden unit,
which is just a multiplication of all of its
inputs with their respective weights.
The derivative of each term in that summation
is zero, except for the one containing xi
times wi, whose derivative is xi.
---
Let’s clean up the formulas a bit.
Let’s call this part as delta-k, so we can
reuse it in the hidden neurons too.
Similarly, let’s call this part delta-j
for the hidden neuron.
---
If we had multiple neurons at the output layer,
which would happen if we were trying to teach
the network more than one target attribute,
we would need to incorporate the errors in
all of the outputs when we are updating the
hidden unit weights.
And the way we do that is by summing up the
delta-k’s of all the output units, weighted
by the connections going out from the j’th
hidden node.
This is in a way how we attribute the blame
that the j’th node has in the errors observed
in all of the output neurons.
---
This derivative formulation can be extended
to work for any feedforward network, with
one or more hidden layers.
We call this the generalized delta rule.
We calculate the delta value of each hidden
layer by summing up the weighted delta’s
of the next layer.
---
When you are using a neural network for a
machine learning application, you wouldn’t
really implement the weight update rules yourself.
Implementations of gradient descent for feedforward
networks are already available as programming
libraries or as separate software tools.
In a 3-layer feedforward network, the number
of input and output units will depend on how
many input and output attributes you have.
The number of hidden units is something you
can adjust.
There is no hard rule for how many hidden
units you should use, and you should try out
different numbers of hidden units to see which
one gives you the best performance.
The activation functions of the output units
will depend on your application.
If you are doing regression and predicting
unbounded continuous target values, you should
use linear activation functions in the output
neurons.
If you are doing classification where the
target values are binary, 0 or 1, you should
use a sigmoid activation function in the output
neurons.
When programming you usually need to make
use of separate function for regression or
classification networks, or you need to specify
which one you want to use as a parameter,
when setting up your network.
Regardless of whether you are doing classification
or regression, the hidden unit activation
functions should always be nonlinear, and
they are usually sigmoid functions.
---
Let’s now review some issues with Neural
Networks.
A disadvantage of neural networks is that
the calculations that are captured in the
hidden layers are not easy to interpret.
For that reason, neural networks are considered
black boxes that work well in replicating
the desired target values, but they do not
provide much insight into how they accomplish
that.
Using gradient descent for weight updates
means the neural network training has the
limitations of the gradient descent algorithm.
The network is typically slow to train, especially
if we have a large network with many neurons,
and a large number of connections.
The weights can get stuck in local minima
and not improve the error beyond that.
To overcome these limitations, we can apply
the strategies that we discussed in gradient
descent, such as using adaptive learning rate,
momentum, and random restarts.
A related problem that we have to watch out
for is overfitting.
Overfitting is when the network becomes very
good at replicating the targets for the samples
we train it on, but performs poorly for samples
that it has not seen.
---
To avoid overfitting, we can use a validation
set, in order to simulate the neural network’s
performance on unseen samples.
Given a dataset, we would set aside a small
subset as the validation set.
We would train the network and update the
weights based only on the error on the training
set.
As we are training the network, we would also
keep on eye on the error it produces for the
validation set.
During the first phase of training, we would
usually an improvement in the error for both
the training and validation sets.
This is where the network is learning the
essence of the samples.
After a certain number of iterations, the
network will start overfitting to the training
set, and the error on the training set will
keep decreasing as we update the weights over
each epoch.
However, the error on the validation set,
which acts as an independent performance measure,
starts increasing.
This is where the network goes beyond the
essence of the problem and starts memorizing
the samples in the training set.
After we stop training, we would go back and
identify the epoch where the network had the
best validation error and use the weights
it had then as our final trained network.
The performance on the validation set can
also be used to stop training earlier, so
we don’t have to keep training the network
if it is no longer reducing the validation
error.
---
A similar approach can be used to control
the number of hidden units used in the network.
The overfitting problem happens when the network
is too smart and it can memorize the samples.
Using fewer number of hidden units reduces
the capacity of the network and makes it more
likely to learn the essence of the samples
and less likely to memorize individual samples.
However, we do not want to use too few hidden
units that the network is no longer able to
learn the desired target values.
We can again use a validation set to monitor
network’s ability to learn rather than memorize.
We can try networks with different number
of hidden units and use the one that gives
the best validation performance.
I want to point out that using a validation
set is not a substitute for a test set and
cross-validation, which I talk about in a
separate lecture.
Even though validation set is not used for
updating the weights, it is used to make decisions
about training and therefore we are technically
making use of those samples for training.
You still need to check the performance of
the network on a separate independent test
set in order to convince people that your
network is able to generalize beyond the samples
used in training and validation sets.
---
I would like to mention some of the more advanced
forms of neural networks before I finish this
lecture.
Deep Learning has been quite popular in recent
years, especially in image processing applications,
such as facial recognition.
Deep Learning is essentially a neural network
with many hidden layers.
Because of the huge number of weight parameters
that need to be identified, deep networks
only work if you have a large amount of data
to train from.
The layers of a deep network are organized
such that lower layers learn local features
from the data, such as edges and colors in
an image, whereas higher layers learn more
abstract features.
---
An autoencoder network is where a network
is trained to reproduce the input samples.
So the target values and the input values
are the same.
Using a hidden layer with few number of units,
the network is forced to learn a compressed
representation of samples.
Auto-encoders are used in noise removal and
dimension reduction.
---
Sometimes auto-encoders are also as a first
phase in training for a prediction problem.
After an auto-encoder network is trained,
its decompression layer is removed and replaced
with output neurons for a different prediction
problem.
For example, an auto-encoder that is trained
to compress animal pictures can then be repurposed
to make a prediction about whether a picture
is a dog or a cat.
In training for this new problem, we would
lock the autoencoder weights from changing
and only adjust the new neuron weights.
This strategy of taking a network that was
trained for one dataset and repurposing it
for a different problem is called transfer
learning.
Transfer learning is usually used when we
don’t have enough samples in our dataset,
so we use a network that is trained on a related
dataset.
---
To conclude this lecture, I would like to
say that neural networks have really survived
the test of time and are here to stay.
There are many different neural network architectures
that are designed for different applications,
and we have barely scratched the surface.
