Hello everyone
In this video
I'll walk you through a simple example on
how you can calculate the derivative of a
sigmoid function and why is a particular format
of the sigmoid function used in deep learning.
So let's get started.
I'm assuming everyone's aware of
what a sigmoid function is
the sigmoid function is defined as
1 by 1 plus e to the power minus x.
The sigmoid function is widely used in algorithms
such as logistic regression and it also finds
a special place in.
Deep learning algorithms.
Let's visualize the sigmoid function.
This is how the sigmoid function looks like.
It is an S shape curve where in the values
of y that is your output range from 0 to 1.
Now, let's proceed to the main tasks that we have
assigned for this video, which is to find
out the derivative of the sigmoid function.
The first step before we find out the derivative
is to find out what type of rule of differentiation,
we can apply to the given sigmoid function.
As we can clearly see you have a numerator term
in the sigmoid function, which is 1 and you
have a denominator term in the sigmoid function,
which is 1 plus e to the power minus x.
When you try to find out the derivative of
a term having a numerator and denominator,
we have something called as a quotient rule for our rescue.
The way you find out the derivative
of a fraction having a numerator and denominator
using a quotient rule is you take the denominator
term multiply it by the derivative of the
numerator with respect to x and store it as
a first term of your numerator.
You subtract the first term of the numerator that you've just computed.
With the multiplication
of the numerator term as it is into the derivative
of the denominators term that is what the
second term implies here and you divide this
term that you've just computed by taking the
difference of the two terms by something called
as a denominator square terms.
Now, it may look daunting but this is kind
of a very simple example that will help you
understand how quotient rule is implemented
and how you can find out the derivative of
the sigmoid function as well.
Let's keep the quotient rule as the headline so that you
can always refer back to the quotient rule
while we differentiate our sigmoid function.
This is what we want to differentiate and
then we apply the quotient rule going by what
I have specified on the top.
I pick the denominator as it is in our case
the denominator was one plus e to the power
minus x.I save it as it is.
I multiply this term with the derivative of
my numerator with respect to x.
Now my numerator is 1.
Derivative of a constant with respect to x is 0
is what I get here.
Something similar is what I do in the second
part of the numerator term.
Which is I copy the numerator as it
is that is 1 and if I differentiate the denominator
with respect to x the constant term is 0 and
e to the power minus x will be e to the power
minus x with a negative sign, which is what
I get here.
And I square the denominator term.
After applying the quotient rule.
I am left with this equation, so the derivative
of sigmoid function essentially is e to the
power minus x upon 1 plus e to the power minus
x the whole square
Going through a lot of
literature you wouldn't find this implementation
that commonly used because of certain reasons
so we will try to modify this derivative that
we have just computed for the sigmoid function
in a format that becomes very easy for computation
when we actually implement the neural network.
So, the next thing that I do is I safeguard 
 the derivative that I've already computed so that
you can always refer it as we move along
now to bring it in a form that we can use easily
when we create neural networks is we start
off by adding and subtracting one to the numerator
going by the principles of numerator and denominator
addition subtraction if you add and subtract
a term from the numerator, it wouldn't change
the fraction at all so I add and subtract
one from the numerator and this is what I
get.
Once I have reached this position.
I basically split my fraction into two
equivalent fractions that add up to this
What I do next is I try to separate out of fraction
into its constituents,
so I take this one
and this e to the power minus x and store
it as something like this which is visible
from the first term which is one plus e to
the power minus x divided by one plus e to
the power minus x the whole square.I'm left
with this minus one term which comes here,
so my second term of the equation is minus
one divided.
By one plus e to the power minus x the whole
square y am i doing this will all be clear
as we move along.
So now that we've done this let's go to the
next step.
I start off with again the point
that i left in my previous slide
as you can clearly see these two terms are literally
the same but the denominator is squared version
of the numerator, so what i do is i reduce
the power of the denominator by one.
Is cancel and then just left with one upon
one plus e to the power minus x minus one
upon one plus e to the power minus x the whole
square which is this term.
Now the next thing
that I do is I take this term as it is and
I remove whatever is common within inside
of bracket what I mean to say by that is I
have one upon one plus e to the power minus
x outside.
Into 1 minus that is this term as it is minus
1 upon 1 plus e to the power minus x.
If you
look closely we started off with a sigmoid
function.
S of x which was 1 upon 1 plus e to the power
minus x.
So essentially after doing all of these small
manipulations we've reached to the derivative
form of the sigmoid function, which is of
the form s of x.Into 1 minus s of x.
Now, you might be wondering why have we done
so much of manipulation
In Deep learning what
you do is you implement a neural network,
the first pass is the forward pass where in
your activations are calculated and since
I have s of x which is the activation of my
previous layers output, all of these values
are stored as a matrix
in order to start with
back propagation or update the weights of
my existing layers what I essentially have
to do is whatever weights are existing for
that layer.I have to.
Add a minus 1 to it and multiply it with the
value of that particular activation itself.
So there isn't any complex calculation involved
when I use sigmoid as my activation function
because the gradient that we've calculated
is essentially the function itself multiplied
by 1 minus the function value itself.
So that is the uniqueness that this representation
gives us.
So architecturally what we can do is we can
store the forward propagation results into
a cache memory and while updating the weights
we can directly access the Values from the
cache memory and update the weights fast because
we represented the derivative form of the
sigmoid function using the values itself.So
this is the uniqueness of representing the
sigmoid function the way I represented it
and if you plot the sigmoid functions derivative,
this is how it appears.
I hope you found the video information
If you do have any questions with what we covered
in this video then feel free to ask in the
comments section below and I'll do my best
to answer those
if you enjoy these tutorials
and would like to support them then the easiest
way is to simply like the video and give it a
thumbs up and also it's a huge help to share
these videos with anyone whom you think would
find them useful.
Please consider clicking
the subscribe button to be notified for future
videos, and thank you so much for watching
the video.
