Good morning. Today we talk about the second
part of lecture in Neural Network, where we
will talk about Multilayer Neural Network.
We have already looked at individual neural
units and discussed that they can represent
linear functions, but the main excitement
about neural network is because they can represent
non-linear functions. And we can represent
non-linear functions by stacking layers of
perceptrons into different architectures.
So, today we will look at Feed Forward Multilayer
Neural Networks which is a particular type
of connections in neural network
Now, first of all we will look at what are
the limitations of perceptrons. We have seen
that the really not discussed, but perceptrons
have this monotinicity property.
So in a perceptron what we are doing is that,
we have an input and we have multiple outputs
and we have weights associated with them.
And because of this type of connection perceptrons
have a monotonicity property. If a link has
positive weight activation can only increase
as the input value increases. It cannot represent
functions where input interactions can cancel
each other. So each input is individually
interacting with the neural, so it cannot
handle interactions between the For example, perceptrons can be used
to represent gates like and suppose there
are two inputs x 1 and x 2 you want to pass
one if both x 1 and x 2 are true. Then you
can set the weights and the thresh hold such
that the result will be 1 only when x 1 and
x 2 is 1.
However, perceptron cannot represent the XOR
function. In XOR function, suppose we have
two variables x 1 and x 2 which can take 0
and 1. In XOR is true for in this case it
is negative here and here. So, XOR function
is not linearly separable and it is not monotonic
in the individual inputs and it cannot be
represented by a perceptron.
So, a solution to this is to have a multilayer
neural network. Where we have this units stacked
on each other. So, we have these inputs.
So, we have x 1, x 2, x 3, x n as the input
and we have let us say 1, this is the first
layer we call it Hidden layer 1 and it has
units let us say z 1, z 2, z 3, z l and then
there is the output layer where we have units
y 1, y 2. Now, we have inputs feeding through
from input to this layer and etcetera, and
then from here to this layer. We can have
a multilayer network where we have the input
layer and the output layer and between the
input and the output we can have other units
which are called the Hidden Units. Why Hidden?
Because, in the training examples they are
not observed, we have the input and the output
these are called the hidden units. And through
these hidden units we can represent many non-linear
functions.
For example, if you look at this picture
we have x 1 and x 2 and each of z 1 and z 2 is
a linear function of x 1, x 2. And then y
is a linear function you know does a linear
separation between x 1 and x 2, but y does
a y is a function of z 1 and z 2 and if we
use suitable non-linear activation functions
then this sort of connection can represent
XOR or other non-linear function.
So, we can look at this expression of what
multilayer networks can express. We have seen
that single layer networks can represent linearly
separable function. Multilayer networks can
express interactions among the input. In particular
a 2 layer network means this is a 2 layer
network where you have 1 hidden layer and
1 output layer, this is the input this is
the hidden computing layer this is the output
computing layer. And these 2 layer neural
network can represent any Boolean function.
And continuous functions within a tolerance
provided of course you have the requisite
number of hidden units and you would use appropriate
activation functions then all Boolean functions
and all continuous functions within a certain
tolerance can be represented using 2 layer
neural networks. If you have 3 layer neural
networks, then you can represent all computable
functions. These functions can be represented
using 2 layer and 3 layer neural networks.
So, they have very good representation capacity.
But the next question is, is it learnable?
Just because a presentation exists to represent
a function does not immediately mean that
you can learn the function well. But, for
neural networks like this learning algorithms
do exist, but they have weaker guarantees.
In perceptron learning rule we said that,
if a function exists then this procedure will
converge. So, for multilayer neural networks
we cannot give such strong guarantees, but
algorithms exist and people are working on
different very exciting types of algorithms.
So, let us look at a general structure of
a multilayer network. This is a 3 layer network
where there is the input, and then we have
the first hidden layer, the second hidden
layer, and the output. This is an example
of a layered feed forward neural network.
This is a feed forward neural network because
the inputs, these connections that we have
drawn are single connectional. Input to first
hidden layer, first hidden layer to second
hidden layer, second hidden layer to output,
all the edges are single directional and it
is going forward from the input to output
there is no back link.
So, this is why is called feed forward neural
network. This is called a layered network
because we have organized the neurons into
layers and layer i is connected to layer i
plus 1. Also this particular diagram shows
a fully connected layered feed forward network,
where there are 2 hidden layers, 1 output
layer and of course the input layer is there.
So, in this particular type of feed forward
neural network the input will be going from
feed forward from input to the output through
the hidden layer. Now in the while talking
about perceptron training we said that based
on the error in the output we change, if we
observe there is an error in the output between
what should be the ideal value and what is
computed then we change the weights of this
connections so that that error is made smaller.
So that is what we looked at in perceptron
training.
Here also we need to do the same thing. However,
here there is one difficulty. We know what
should be the ideal output here and the ideal
output here, so based on that we can change
these weights. But, at the hidden node the
ideal output is not told to us, it is not
known to us directly, so on what basis do
we compute these weights. We know the ideal
output here, we can compute the error directly
and on that basis we can update these weights.
But we need to know what is the error here.
So, what is done is that the error that is
observed here is propagated backwards. The
assumption is the error here is a result of
error here, here, and here, because these
are the three nodes that are feeding inputs.
This error here is due to errors here, and
therefore what we do is that this error we
now back propagate to these nodes. The error
here also we back propagate to this nodes
based on these each of the nodes we know what
is the notional error, based on that error
we update this weights. If there were more
layers here, the error here will again be
back propagated to those other nodes.
So, the error signal flows backward. The computation
in the network is forward, but the error signals
flow backward based on this computation of
the error signals we figure out how the weights
have to be changed. That is why the method
for updating weights in such multilayered
neural network is called back propagation.
So the input is going forward and the error
signal is going backward.
So, these are the steps in the back propagation
training algorithm. First step is initialization,
we give the structure of the network and we
initialize the values of these weights and
usually we give these weights small random
values. We give them small random values to
the weights at different biases in the network.
Now after initialization we do the forward
computing.
Now what we do is that we take our training
example which comprises x 1, y 1, x 2, y 2,
x m, y m. These are our training examples
we take the examples one by one and apply
them to the network. We apply x 1 to the network
we get some output y 1 hat. So, given x 1
we get y 1 hat, so this is the output that
we get. And the ideal output is y 1 hat, and
the outlet let me draw it here. So, the output
that we get is y 1 hat, here we get y 2 hat,
here we get y m hat for a particular configuration
of the network. And if this y 1 hat and y
1 are different this is different then we
look at the error and based on this we change
the weights.
So, in the format computing we apply the input
first to this layer here we first take the
summation followed by the activation function
we get the output at this layer; we get the
output z 1, z 2, z 3, and then based on that
we compute y 1. y 2 etcetera as this is computed
as the activation function applied to sigma
w i z i right here, it is activation function
applied to sigma x i w i. So, we get the do
the forward computing.
After doing the forward computing, now we
have to update the weights. As i said that
we can update the weight at the output layer
in a similar fashion as we did for single
layer units. If you think of you have single
layer sigmoid units we have already worked
out how to update the weights. These weights,
suppose let us say this set of weights we
call them w and this set of weights we call
them v. So, w is updated using the sigmoid
layer training rule which we discussed in
the last class.
But, how to update v; As we said that we do
not know the target values of z 1, z 2, z
3, so what we do is that we back propagate
the error here to the error here. So, we propagate
errors which are visible at the output units
to this hidden units and then based on that
error we do the similar training rule. And
if there were more layers between this and
the input so these errors can be further propagated
in this direction, further propagated downwards.
So, error back propagation can be continued
if the net has more than hidden layers.
Now, we will have to see how to compute the
errors. And for that let us try to do the
derivation of this computation.
Now, for the output neuron so the error function
is E equal to half sigma y minus y hat whole
square, and we can take summation over all
the training examples if you wish. Now let
us think of in the network we have different
units, these units are either at the output
or at the hidden layer, so let us take any
unit chain. For each unit j, so j is a unit
and for j the output is o j. First of all
we have seen that in the unit there are two
components; one is the summation component,
the second is the activation function component.
So first, the summation is applied and then
the activation function is applied. Let us
say that output here we call net j and output
after this we call o j, this is a unit j.
So, the output of the summation is net j and
output of the final output is o j. So for
unit j, o j equal to phi of net j and net
j equal to summation w i x i, where i ranges
over all the inputs to the unit j. This is
equal to phi of sigma k equal to 1 to n, k
ranges of all the input w k j. So, w k j is
the weight from the unit k to this unit j
and o k is the output of unit k. So, output
of unit k is an input to unit j. Suppose,
this is a node which is j and this has these
three inputs; 1, 2, 3,, so this is w 1 j,
this is w 2 j, this is w 3 j and the output
is here o 1, o 2, o 3. So, this is the output
computed as unit j.
Now, if we want to find the partial derivative
of the error with respect to w i j we get
this is equal to del e by del o j, del o j
by del net j, del net j by del w i j. So to
simplify the computation we have used the
chain rule to write del e the partial derivative
of the error with respect to w i j as the
product of these three partial derivatives,
so to make the computation simple for us.
This is m j this is my unit j and these are
the units which are feeding to this unit,
this is the unit i. And j has you know initially
the output is net j and then the output is
o j. So this is o j and this is o i.
Now, we have decomposed this partial derivative
in these three components. Now what we will
do is that let us remember this and we will
do these three computations separately then
we will write them together. So first let
us look at del net j by del w i j, del net
j by del w i j. So what is net j, net j is
sigma o i w i j. And so we have del del w
i j sigma of w k j o k equal to 1 to n, so
this is equal to o i. Because only here we
have w i j and these other link 1 to n they
have nothing to do with i. So, del net j by
del w i j is simply o i. In corresponding
to o i only we have w i j, and we have o 1
w 1 j, o 2 w 2 j, o 3 w 3 j they have nothing
to do with w i j. So only here we will have
and so this is equal to o i.
So, del net j by del w i j this we have this
part we have computed and this a is equal
to o i. Let me clean this so that we can work
out the other two components of this expression.
Next let us take this second term del o j
by del net j. What is o j? O j equal to phi
of net j, so del del net j we have phi of
net j. We have already seen that this particular
derivative will depend on the form of the
activation function if you assume sigmoid
activation function, for sigmoid activation
function we will have this is equal to phi
of net j 1 minus phi of net j. So, this is
the second term here which we have computed
given like this.
Now, let us look at the third term del e by
del o j. Now we can take the derivative with
respect to o j and we can get recursive expression
of the derivative as follows; this is equal
to summation over i. Now first of all let
us say that the inputs to this network are
1, 2, 3, 4 up to n, so, let us say z equal
to 1, 2, n these are the inputs to the unit
o j. We can write this as follows; summation i
So, del e o j, before that let me write
this.
So, this is e of the output o j is equal to
this error is dependent on or rather write
let me just write it in a proper way. It is
other way round. So, this error at the unit
o j has come from the units which are upstream
of o j. If you allow let me rub this part,
so that we can make the drawing here.
So, this is my the unit j and j has certain
input which we have already seen, but j outputs
to the next layer, and
let the output of j go to the input of these
units and let us call this set of nodes z.
And let set say that z comprises of z 1, z
2, etcetera, or just for simplicity let me
denote it for this step as 1 2 l. These are
the nodes to which the output of j is feeding.
During back propagation the error of these
nodes is due to the error here. So, the error
of o j is based on error of net z 1 net z
2 net z 3, or for simplicity let me write
net 1, net 2, etcetera net l by del o j.
So, these components due to net 1, net 2,
net l we can do a summation. We can write
this as let me again compact it error of 1,
2, l, and this we can write as summation over
l, this index letters call it i d or let us
use l as the index and l we have del e by
del net l del net l by del o j. So, this error
is coming from these units and i am taking
the summation over that so summation of del
e by del o j del e where it is coming from
this unit. So, we were writing it in chain
form as del e by del net l del net l by del
o j. So, for this node we have net 1 and then
o 1, net 2 then o 2, net 3 then o 3, and the
output of this contributes to the net 1 contributes
to net 2 contributes to net 3.
Now what we can write is that, if we look
at this so we can write this as summation
of over l del e by. So, let us look at the
output here, output here is o l. We can write
it as del e by del o l, del o l by del net
l. This part i have expanded to del e by del
o l times del o l by del net l and we have
del net l by del net j. So, del net l by del
o j, so net l will depend on sigma w j l o
j, so from this we get w j l. This component
is equal to w j l. So, we have this expression.
We can compute this derivative of error with
respect to o j if all the derivatives with
respect to the output of the next layer are
already computed. Now, we can write this as.
Suppose we have del e by del w i j equal to
delta j o i. We have the output of the current
unit and the delta or the errors which are
available here these errors are the errors
which are available here, which is being proper
contributing to the error here. So, this delta
j corresponds to the error which was computed
at this layer which is brought here. And the
component of this delta j are, so delta j
is one of this units and it is component is
del e by del o j times del o j by del net j
And this will be treated in different ways
depending on whether the unit is the output
unit or it is an intermediate unit. If it
If it is an output unit we can use the formula which
we have already seen so it will be equal to
o j minus t j, t j is the target output at
that node which we wrote us here, so o j minus
t j times o j times 1 minus o j. This will
be if j is an output neuron. And if it is
an intermediate neuron this will be equal
to it will get the outputs from the next layer,
get the errors or the delta values from the
next layer as we have written here so it will
be sigma over the set z delta l w j l times
o j times 1 minus o j, if j is an in a neuron.
So, this delta j is if we are looking at for
the output neuron we have already seen this
delta j equal to this, but for an intermediate
neuron it comes from this formula where we
will get delta j as sigma over z. Where, z
is the set of those units to which this unit
feeds input, so sigma over z the delta of
those nodes the errors computed at those nodes
times w j l times o j into 1 minus o j. This
particular part, this comes due to the sigmoid
activation function, if you change the activation
function this will be a little different.
So, we stop the class today and in the next
class we will look at how this is incorporated
into the back propagation algorithm.
Thank you very much.
