Hey everyone! Welcome back to neural
networks lectures. In this lecture, we are
going to discuss about back propagation
algorithm. Back propagation algorithm is
used to learn the weights of a
multi-layer neural network with fixed
architecture. What it basically does is, it performs a gradient descent to try to
minimize the sum squared error between
the network's output values and the
given target values. So loosely speaking, we can say that back
propagation algorithm helps us to
calculate the gradient of cost function.
Let's see how this works with the help of
a simple neural network. So consider
that this is a neural network used to
classify the input pattern into two
classes, that is, class A and class B. Let's say that this is the neuron
dedicated for class A and this is the
neuron dedicated for class B. That is,
when a class A input comes, it is this
neuron that gets activated. Now
consider that initially the network is
not properly trained. So when we provide
a class A input, the activation of the
neuron corresponding to class A is only
0.4. So what you have to do is you
need to increase the activation over
here.  Note that, I represent positive
correlation between neurons with the
help of white synaptic connections and
negative correlation with the help of
red synaptic connections. What I meant by positive correlation is that, when the
activation of this particular neuron
increases, since these two are positively
correlated, the activation of this neuron
also increases and similarly since these
two neurons are negatively correlated,
when the activation of this neuron
increases, it will in turn decrease
activation of this output neuron. So
in order to increase the activation
value of this output neuron to the
desired value of 1, we should increase
the activation of this neuron, increase
the activation of this neuron, but
decrease the activation of this neuron.
Let's mark it over here. But this is
not the entire picture. We also have
another output neuron corresponding to class B
and let's say that the activation
of B is 0.5 when we give an input
corresponding to class A. So you need to
bring down the activation of this output neuron
to desired value of 0. Let's
consider that these two neurons are
negatively correlated, this is positively
correlated and these two are negatively
correlated. So in order to bring
down the activation of this, we need to
increase the activation of this neuron,
decrease the activation of neuron and
increase the activation of this neuron.
Let's mark it also there. So we have
marked that also. Now the effective
change in activation both to these
neurons, that is, neurons in this hidden
layer can be found by sum of the
activations which is to be done to these
neurons. So if we take the case of this
particular neuron, we can see that we
need to increase the activation of this
neuron so as to increase the activation
over here and decrease the activation
over here. But we can't directly change the
activation of this neuron in order to
change the activation of this neuron. We
should change the synaptic weights
connected to this neuron. So we can see
that in order to change the activation
of this neuron, you need to tweak the
values of these two synaptic weights which
is again dependent upon activation of
these two neurons. So in effect, we can
see that they are propagating backwards.
It is in this context, the term "Back
Propagation" comes into relevance. I hope that you understood this part
very clearly. Let's start the mathematics behind back
propagation algorithm by taking an
extremely simple network, that is, one
where each layer has only a single
neuron in it. Here there is only
one neuron, in this layer also one neuron,
and in this layer also one neuron. So
we can say that the output of this
particular network is determined by the
weight over here and weight over here
and our goal is to understand how
sensitive this output is to these
variables. That way we'll get to
know which adjustments to these terms
decrease the cost function most
efficiently. Now let's label the layers.
So I am labeling this layer as L this
layer as L-1 and this layer is L-2. Now bear with me very
carefully, and denoting the activation of
this last neuron as yL and that of
previous neuron as y(L-1). Let this synaptic weight be denoted
as WL. Note that these are not exponents, but just the index given to the
notations. Let the desired value
for the output neuron be d. So, the cost
function for this simple network for a
single training sample is:
Similarly we can write yL as some
activation function of:
For simplification purposes, I am writing:
Therefore this equation becomes:
In brief we can say that the weight WL and the activation of the previous neuron that is
this neuron is used to calculate VL,
which is used to calculate yL
which along with D is used to calculate the cost function
So you can see that a tiny change in WL can cause a change in VL,
which caused a change in yL, which in
turn caused a change in the cost
function. You can see that this y(L-1)
is affected by the weight and
activations of the previous units.
Now our first goal is to understand how
sensitive is the cost function to changes
in WL. That is, we need to calculate:   
But in order to find
this value, we first need to understand
how VL is affected by WL
So we'll write:
Now since we know how VL is affected, we can
calculate how yL is affected. Finally we can
calculate the changes of E with respect
to yL.
Therefore the final expression can be written as:
Keep in mind that this thing we
calculated over here is just the cost
function for a single training sample.
The full cost function over all training
samples can be found as the average of
cost functions for individual training
samples.
Again note that this is just a component
of the gradient vector which consists of
partial derivatives of cost functions
with respect to all the weights in the network.
For example: If you consider the case of a simple neural network considered over
here, we can write:
Why only two components?
Because there are only two synaptic
weights in this neural network. Keep
in mind that in actual practice the
gradient vector will be really large
because actual neural network consists
of many hidden layers and also many
synaptic connections between each of
these neurons. Now one might ask what
would happen to this equation
as we increase the complexity of the
neural network. So let's probe into that.
Here is a neural network which is
slightly more complex than one we
considered before. Here let me write
this layer as layer L and this layer as
layer L-1. I will use
the subscript 'k' for output neurons and 'j'
for the input neurons and in similar
fashion the weight between layers L and
L-1 is denoted as WkjL. If you look into how the equations get
modified, you can find that:
where j ranges from 1 to L, that is number of neurons in layer L.
The cost function get modified
as: 
where k ranges from 1 to L-1
that is number of neurons in the layer L-1. Now if you do the math, you
can see that the equation doesn't change
at all except for the fact that you add
some subscripts. That's all for
this lecture. I hope that everyone
understood the lecture clearly. If you have any doubts, please do ask in
the comment section. Thanks for watching! Have a nice day:)
