so welcome and so in the ah we will be continuing
down with the from where we left in the last
lecture and thats on multi layer perceptrons
to deep neural networks and this is where
we would be introducing so while till the
last lecture we had a good amount of briefing
and a recap of the learning rules and how
to create down this single model and then
a single perceptron and then a whole collection
of perceptrons in terms of its matrix and
the matrix form of representing the data from
there going down to the gradient and what
happens with the gradient based learning rule
and then trying to come down to the point
of how do we calculate the gradient of the
output itself so your del del w of jw how
it gets broken down into partial products
over there and using these partial derivative
products how you can find out the total derivative
of the network itself
so from there we would be entering into the
multi layer perceptron model of a deep neural
networks and how it works out so just to do
a brief recap of where we left down in the
last class so that was on the gradient computation
part say i have three scalars x one x two
and x three and then i would like to map it
down to another predictor scalar which is
my p hat and then how this network was constructed
is that i had three weights w one w two and
w three they were all put into a linear summation
block and there was another added component
of what is called as the bias now taking down
all of these linear summations over there
and the bias together you get an output called
as y and that is mapped down through a non
linearity to your predicted output p hat now
while we have done the forward pass which
is from x how to get down to p hat the question
which we had raised is how to get down this
derivative and at first we are doing a partial
product of different derivatives
so you take the derivative of the cost function
which is derivative of j w with respect to
the output which is p then you take a derivative
of p or the predicted output with respect
to y which is the output from the summation
block and you take a derivative of y with
respect to w which is known as the derivative
of the linear part of the network ok now and
how this was so these were the three different
parts ah where we left off in the last class
now the point is that this kind of a computation
is what holds true for just one single neuron
ok and the next point is that if it is not
just one single neuron but you have a collection
of neurons or something which is a deep neural
network
so one network so one bunch of neurons in
one layer then you have another bunch of neurons
in another layer then another bunch of neurons
in another layer typically what you called
as a multi layer perceptron due to its multiple
layers form over there so thats exactly what
we are speaking about so i have my bunch of
inputs x which connects down to a set of intermediate
nodes over there that connects to another
set of intermediate nodes and that subsequently
to another set of intermediate nodes and finally
you get your final predictor which is a p
hat where each of these intermediate nodes
when you are connecting now is what is called
as my hidden layer
now the point is that we did find out how
to get my output from when i just have one
single neuron to connect so my inputs to my
classification neuron and how to get a derivative
the question is that here also you will need
to get down the derivative in its own way
but then in order to get this one you see
that clearly there is not just one single
connection which connects down the inputs
to my output but it passes through a multiple
set of nonlinear transformations along the
depth itself and thats where comes the major
question which we have so what we do in that
case is something tricky so lets look into
this
small part of the network so what we do is
say that i am looking at one of my particular
layer which is say called as the d th layer
ok now for my a d th layer what will i will
have done is i can write it down in terms
of this partial products which is del del
p of jw then i can have del del ah so del
del y of p which is my output from there ok
now that can be done as an extended product
of del del w of y d which is output which
is the linear part of summation which comes
down to that particular plot now if i go down
to my d minus one th layer so this is my w
d which is just connecting down my output
to the d th layer now if i go down to one
layer before it now what we can see is that
this del del y of sorry del del w of yd ok
so which is the derivative of
this linear part of this block with respect
to the weights which are connecting these
two this blocks over there can also be written
down in terms of a partial product of the
output of these blocks which are z s over
here with respect to the partial derivative
of the linear part of this one with respect
to the block earlier it ok so this is my d
th layer the output weights connect down to
my d th layer to my target output layer over
there this is my d minus one th layer and
this is the connections which goes down so
this is my w of d minus one and thats where
my expansion happens now and similarly i keep
on repeating this whole thing together on
the chain and finally what i would get down
is on the final part which is del del w of
y one which is my first output layer over
there and that incidentally is equal to whatever
is my inputs over here so if you just look
into your matrix form of representation of
y and w s in terms of
so y w and x which is you just have a linear
product of the weight and th x so thats our
dot product which gives rise to this output
over here y so obviously my output my derivative
of the linear part of output with respect
to my weight is going to give my input to
it which is my x over here so this is a typical
way in which we calculate now our whole networks
gradient over there so if i want that my total
network has to be solved out so this is exactly
what i would be doing in terms of my calculations
so you can typically look so now that i dont
have what is my input coming from here so
what i would be doing is i dont know exactly
what values are over here so i will be again
differentiating this with respect to this
and thats what the chain rule keeps on doing
so every time you have a del ah del w of y
d so you keep on going to the next previous
one
so over here like this block that block will
again be represented in terms of this dot
product of two partial fractions partial derivative
ah multipliers over there and these two partial
derivative multipliers will again keep on
going and subsequently the final point where
it stops is a del del w one of y one and thats
equal to my input which is x ok so i i believe
this part is quite clear to you guys and and
quite intuitive actually not so hard to calculate
you know now the next point is that i i have
my final form of the whole derivative going
down something like this now that i have this
form going down so my first part of it is
what is called as a derivative for my cost
function or known as grad of jw ok the gradient
of my cost function the second parts over
there is my derivative of the nonlinear transfer
function ok now the other part is the derivative
of the perceptron itself and which together
is what is called as a derivative of the network
and finally is the input to the network which
is my x ok
so these together is what constitutes of any
sort of a learning mechanism within a multi
layer perceptron or any kind of a deep neural
network so what you will have to do is you
will have to find out what is my derivative
of my cost function you will have to find
out the derivative of the network which together
consists of two parts of the derivative one
is derivative of the nonlinear transfer function
and derivative of the perceptron together
and you will have to find out what is that
so this is always known to you because you
are just pushing in the input to the network
now by solving out this complete derivative
over here is what we are able to get down
as our neural network learning algorithm in
terms of gradient descent and thats where
it goes down so that brings us to a very important
aspect over here and which called as the existential
criteria so what this essentially means is
that in order for the total derivative of
the ah cost function to exist with respect
to weight you need to see that every single
fraction of the derivative exists so every
single part over there we were doing a chain
rule of expansion
so if every single component of the chain
rule exists only in that case you would see
that the total derivative of the network as
well exists and you can now learn on it in
a perfect way so my first component over there
which is del del p of jw is what is called
as the derivative of the cost function and
that part should be existing the ah so so
lets lets take down a very simple example
which is lets take down these two cost functions
so the first one is what is called as the
l two norm the second one is what is called
as the l one norm and lets see if its derivative
exists so i would give you this just some
moment to ponder on this one and like really
contemplate on what do you think thats the
derivative exists for each of them so what
you can do is quite simple i mean you can
just take a del del w ah del del p of j over
here and then just find out whether the derivative
can exist or not now interestingly what will
happen is definitely it does exist for the
first case which is euclidean norm
so and then thats not so hard to contemplate
as well because i mean for any kind of an
l two norm existing existence of a derivative
is the pretty a straightforward case now you
have the second one ok and this is where the
fun is so do you think that the derivative
of this one will exist as well or not just
just take down a few seconds over here while
i just wait so if you look into this one carefully
the derivative value should not exist ok and
one of the reasons why this will not exist
is that you will have some sort of a discontinuity
at zero you see you have an l one norm or
just a mod ok so mod of p minus p hat this
one over there is basically a value which
is always a nonzero value and this hat does
have a discontinuity at x equal to zero and
thats one of the reasons why this direct absolute
summation will never have a derivative ok
now ah lets lets look into the other part
of the network and that is the derivative
of the non linear this will the derivative
of the rest part of the network and now over
here one important point is that the derivative
of this nonlinear transfer function or f nl
that should also be existing otherwise your
del del y of p that will not come into existence
whereas del del y of z that does not have
any issues because that keeps on expanding
over and over and over but your del del y
of p is something which needs to exist at
every single non linearity wherever you are
putting
now so lets take two different cases of nonlinearities
over here so you see the first one is a sigmoid
function which i have as a non linearity the
second one is a one by mod set or this is
something like the inverse of the activity
the input which goes down to the network itself
but absolute value of the input and now the
question is does the derivative of each of
them exist or not so lets give you some time
you can calculate out the derivative of the
first function which is your sigmoid non linearity
i will give you a couple of seconds ah to
yes calculate this part so if you look carefully
over here what you would see is that for the
first case which is my sigmoid non linearity
the derivative does exist and thats a perfectly
differentiable function whereas look into
the second part of that that again is something
which is not differentiable because of the
discontinuity at x equal to zero right
so this are two some like really interesting
facts because like remember in the earlier
class in the first weeks lecture where we
were discussing about neural networks and
how to make them so thats where you were exposed
to this concept of what is a non linear transfer
function as well as like what are the different
property so one property was definitely to
make it bounded in some form ok but we also
mentioned that the there are other properties
and one of those important properties is that
the cost function itself sorry not the cost
function but the transfer function itself
needs to be differentiable in its own term
the question was why do you think it needs
to be differentiable is something which gives
you an answer over here and this is one of
the major reasons why you need a transfer
function to be differentiable from end to
end over there until and unless a transfer
function is differentiable the derivative
of the complete network cannot exist and thats
the reason why you cannot take transfer functions
like say signum of x because if you take a
signum function that will again have its a
its a like plus one for any value which is
greater than zero and it is minus one for
any value which is less than one and and it
becomes sort of zero in order to be made precise
but the point is at x equal to zero you still
see a discontinuity coming down into those
function and thats the reason why these kind
of functions cannot be made cannot be used
into it because they are not differentiable
and cannot get our total compliance within
the way of how a neural network is going to
act and make its learning in the reverse way
and that brings us to the point of trying
to visualize down our learning rule itself
so now say that i have a multi layer perceptron
and i have my inputs x s which connect down
through intermediate weights w one up to w
d and ah go down to my final output which
is p hat and the way of how we are doing down
is something of this sort and the first step
which is called as the forward pass of the
network and this is something similar to what
you had done in your laboratory classes in
the last week where you well learning down
how to actually learn down a perceptron model
and for classification purposes we had used
a similar kind of a concept over there as
well so you would do a first would be a forward
pass of this x in order to obtain your p hat
now that you have your forward pass and you
have obtained some p hat over there within
a particular given epoch the next point is
that i need to compute out my j which is my
cost function the way of computing this j
or the cost function is that i have my output
my predicted output which is p hat i have
my ground truth of my predictors which is
p and then i need to find out what is the
euclidean distance and that was my straight
forward way now euclidean distance for finding
out my cost function is not just the only
way though we are just sticking down to a
very basic form over here because you have
seen that your cost function also needs to
be differentiable and if the cost function
is not differentiable then it doesnt work
so we will be going down with more detail
cost functions in a bit later down on this
course where we will be bringing in very specific
cost functions which are designed for classification
will bring in very specific cost functions
which are designed for regression problems
so they they would eventually come down but
as of now lets stick down to the very basic
form of the euclidean cost function
so the next part is that you would be computing
whatever is your cost coming down for a given
epoch and a whatever combination of weights
you have now once you find out your cost function
you would be finding out what is the gradient
of this cost function or the del del p of
jw thats the first part which was my ah nabla
of g or the gradient of j now once i have
my gradient of j the next part is to compute
my gradient of the network or nabla of network
and this gradient of the network is what is
the gradient of my p with respect to wise
and wise with respect to w this was my net
so my intermediate points over there for all
of these hidden layers where what they are
giving me some outputs called a z so what
i was doing is i would in my grad off net
is what is my p of y with respect to w and
then that gets again remolded in terms of
the output z with respect to t with respect
to y s and and eventually this keeps on going
down the line so once i have this part also
computed and this is very straightforward
to compute down by solving out those sets
of equations know where it comes down the
utility of most of the libraries for deep
learning is that you dont need to explicitly
sit down and calculate these one on pen and
paper or you will not even have to write down
a separate the bunch of codes in order to
design them if you are using very standard
forms of cost functions and very standard
forms of non linear a transfer functions over
here then calculating this and this part
so grad of j and guard of net is a very straightforward
activity to be undertaken now by any of those
standard deep neural network libraries say
the library which we are going to use down
in the next class onwards is what is called
as pytorch and that has very standard ways
of doing it so but there you would be in standard
doing something like this that you do a forward
pass you get down your output from there you
will be calculating what is your error in
terms of cost from there you find out what
is the gradient of the cost from there you
will be next computing out the gradient of
the network once you have the gradient of
the network then you will be doing at update
w and this update w is something which will
happen in the reverse way as you had seen
in these arrows so once you invoke update
w then it would be going down updating all
the weights from the output side to the input
side together and then next is to continue
again with step number one which is do a forward
pass of x obtain p and then repeat all the
steps together until and unless you find that
your cost function j over there is so so
so this will keep on continuing till your
cost function j is above a certain threshold
value which is epsilon so the moment you go
below this threshold value so you can set
empirically set this values so this threshold
value can be say something like ten power
of minus three ten power of minus four ten
power minus five or even as small as so it
can be larger values as well so they can be
one two something of that sort it ah it all
depends on the problem which you are handling
but then so often i do get this question that
how do we choose this value of epsilon now
thats something which well be covering down
a bit subsequently in the later classes by
coming down to an understanding of the dependence
of data to the architecture of the network
to the kind of nonlinear transfer functions
you use and what is the nature of the cost
function which you are using over there so
while certain kind of cost functions they
have a like large dynamic range so they some
kind of cost functions and have a value in
the range of say one to ten some of them have
a value in the range of hundred to ah ten
thousand some of them have value which are
in order of ten power minus three or ten power
minus four
so this epsilon has to be chosen based on
what is the nature of the cost function you
are using and what is the nature of the nonlinear
transfer function you are using and before
that its its like really hard to tell down
until and unless we enter in to that so while
we will be entering into these very practical
experiments we would becoming known to them
one by one and in the lab sessions to understand
how these are to be used ok and finally the
end is that once you are below your certain
threshold of error you can just go and stop
the whole learning process ok so what typically
happens is that within your ah learning over
there in while you will be writing down your
codes you would be coming across some very
well known terms and then say net is basically
a pointer which is used for defining network
and so if you look onto the screen over here
will be getting them out so basically net
is a variable which defines its a pointer
to the data structure of how to define this
neural network a net colon forward is basically
what executes this part of it which is to
get down your p hat ok next is a criterion
colon forward which executes whatever is the
output of this j the next part is a criterion
colon backward so these parameters which go
as arguments over here are not indicated over
here so these are the ones which we will be
studying down in the lab classes itself but
its not so hard to guess down what these parameters
would be so if doing a net colon forward so
what i need is the parameter over here which
goes down is just this x over here ok the
output of this will be obviously p hat so
now if i want to do a criterion colon forward
i will have to give down p hat and what is
the actual state of p or the ground truth
these two will be the input to the criterion
function
next is when you find out the gradient of
the criterion itself for the gradient of cost
function thats what is evaluated using something
called as a backward operator and this backward
operator what it would be needing now is basically
the so ah so i know this pointer to this whole
thing so what it needs is basically whatever
is my predicted output and over here so what
is my cost function ok now the next one is
the net colon backward and this is what will
be needing this output and this input together
so its its not so hard to get it down because
if you go back to the earlier slides into
the equations you would see from there how
this relates and now once this is done the
next part is that you need to update parameters
of the network so this input over here is
the this ah network the pointer of the network
net over here and these two backward operators
or these two gradients which are calculated
and then that together will update down w
following the update rules and this will be
updating all of these weights not just one
of the width so each weight so it goes in
a subsequent fashion for this one updates
then this updates then this updates then this
updates and this is what would finish off
what is called as one single iteration or
one epoch of it and eventually you can now
trace down your error and then either decide
to stop or you can continue so this brings
us all most to the end of trying to understand
a deep neural network ah from a multi layer
perceptron model and at the end of it what
i would like really suggest you is that you
can do much more detailed reading from the
textbook on neural networks and learning machines
by simon haykin and for toolboxes side of
it while for neural networks its like the
most simplest thing which a lot of people
get started with is the neural network tool
box within mat lab which is for pattern recognition
so thats the npr tool you can alternatively
also look into theano and scikits learn in
python ah and lua for torch and other aspects
we would be doing down though the labs with
pytouch which is basically a port off so its
its ah very recently ported out version of
torch which can walk down with python environments
and makes it much more easier because of the
rest of the script availability within python
and in is based on something called as a dynamic
graph based architecture which allows us to
compute these forward passes and the gradients
much easily in a computationally attractive
form and finally if you are like really into
getting down ah into more deeper understanding
so the typical suggestion is that they dont
tend to these networks dont tend to work down
so great on laptops you might experience heating
problems as well so ah the best point is that
get on a custom workstation so i mean a very
competitively built up workstation can be
achieved with gtx ten sixty or a gtx ten eighty
and ten eighty ti machines so this this in
indian rupees this would be costing in less
than a bracket of one lakh rupees to get a
total desktop raised and setup ah you can
get done much more professional ones with
the titan x or tesla version or you can even
think of buying down any of these for tool
boxes these are again multiple of these toolboxes
and just to for a revision these are the different
reading sources from where you can read down
and these are the two major conferences which
where you would see down ah most of these
advancements in the field coming down so with
that we come to an end on deep neural networks
and ah multilayer perceptrons on the lecturing
and theory in the next class we will be doing
a lab session where you would be going to
a good walk through and getting to execute
it on on your side or as well as on the remote
clusters to which you can gain access under
certain kind of academic licenses as well
so ah with that ah
thanks and stay tuned for the rest of the
classes as well
