we continue on artificial neural networks
to quickly recapitulate what we had done we
had considered a perceptron perceptron which
takes a simple form like you have inputs lets
say x one x two and you have a bias let me
call it as x naught and there are weights
of this connection w one w naught w two then
you apply a an activation function often taking
the form of a step function which whose input
exceeds some threshold it usually zero it
is one otherwise minus one
and then that we get the final output we also
discussed that this is equivalent to drawing
the perceptron actually realizes a linear
line two plus by two points and w naught w
one are nothing but the biases and the slopes
of this line so it actually implements a line
of this form implements a line of this form
and then we studied so what we do to given
in prediction problem where you are given
a training set consisting of n number of input
vectors and their corresponding desired outbreak
output vectors
what we do is to start with random values
of the widths w naught w one w two and then
update the weights over iterations till the
weights become stable they dont change and
we called we we discussed two update rule
one is the perceptron learning rule where
what we do is to ah is to start with a random
weight and then look at a training example
if the every set of weights random or whatever
will correspond to a line if that line correctly
classifies that training point dont do anything
maxim if it misclassifies you tilt the line
you tilt the line ok to try to classify it
so what is the tilting rule your w new is
w old plus some delta w where delta w equals
hm the the target output for that x i lets
say call it t i and the actual output for
that x i let me call it oi so given the value
of weights for the input i collected i calculate
actual output y and i am supposed to achieve
some target t i just multiply that by your
x i multiply that by x i and this correction
you go on making to w till all points are
correctly classified hm
we can actually show that if the points are
from two classes are linearly separable this
algorithm will indeed converge then we also
studied an alternate algorithm called the
gradient descent algorithm what the gradient
descent algorithm does for differing different
values of w naught w one w two it calculates
an error surface it calculates an error surface
ok
so what is the error surface the error surface
is nothing but if you sum up over all the
training points it is the target output minus
the obtained output square mean squared error
summed up over all the training examples in
number of training examples ok so and then
we we can visualize as if we look at the w
one w two plane for example and plot the error
in the y axis we will get some surface like
this
and what the gradient descent does is that
it starts with a random value of w one w two
w naught or even more ws are there then does
what is called a steepest descent that means
it changes the value of w one w two w naught
which leads to maximum decrease in the error
they are defined like this it goes takes a
direction along the which is same as descending
along the tangent of the error surface
so delta w is some learning constant eta times
the tangent of the derivative of the error
function which i represent by this symbol
for this kind of mean square error thing error
you can actually show that delta w becomes
nothing but eta times t i o i summed over
all the training examples because the error
is itself is some about all returning examples
ok
so i in the inverse slides that derivation
are there i am not repeating the derivation
so if you do it you get a real tie blue so
that they would weight update is same in perceptron
as well as the gradient descent algorithm
only difference being perceptron it looks
at every example if it is misclassified it
updates and works with a updated value of
w whereas the gradient descent calculates
delta w based on the sum of the correction
over all the training points then update stop
so one is a incremental mode every training
example changes w other is a batch mode you
go through all the examples then change w
ok so both of these does converge for the
simple case of a perceptron and we also showed
that ah you can realize logic function so
logic gates like and or etcetera by choosing
proper value of that weights ok
so putting proper value of weights and the
bias put with the threshold function would
realize these gates ok so i would ah request
you that you take some standard and or functions
and you try to find out what are the values
of w one w two which will give exactly the
output that this logic function should give
for inputs like one one one zero and so on
what is the value of w that would give the
proper input for the given logic gate that
is in fact ah easy to understand in some sense
because if we say take two input logic functions
x one x two and for example if i take the
and function so if i draw which at the points
for which and is true you will see that one
one is and is true rest all point one zero
zero zero zero one are negative examples and
is false ok so this is the and function in
x one x two space so this is zero this is
one this is one one this is zero one this
is one zero this is zero zero ok so this is
the and function
and since the perceptron is a straight line
you can actually draw a straight line to realize
the ah one one in one side and the rest of
the point in other side so similarly you can
see the or function hm so what is the or function
so you have ah zero zero as the negative example
and rest as the positive example true examples
where r is true
similarly you can draw a line to recognize
the or so this is ok so many of the gates
but now let me give you an example of a logic
gate which cannot be realized by this perceptron
let us look at the xor gate let me write down
the truth table let me call zero as one classification
problem 
one one xor is zero so if i plot it so you
can see these two are the negative points
zero zero and one one and these two are the
positive points one zero and zero one ok
so this is negative whereas this is positive
now you cannot draw a line to separate the
positives from the negatives no line you can
check no line can be drawn so the perceptron
cannot xor function cannot realize the xor
function ok so what is the solution ok ah
perceptron failed because i dont i cannot
separate them by a line but maybe i can separate
them by two lines ok
so maybe i can separate them by say ah one
line like this and another line like this
so everything inside this two line is minus
outside it is plus maybe i can draw so or
is inside of both the line is plus and non
origin side sorry non origin side of both
the line is a plus point for xor and origin
side of both the line is minus point for x
so i if i can use two lines i can realize
it all right
so then people thought ok so how do you realize
two or more lines let me connect up perceptrons
so its like you remember in your logic circuit
maybe maybe a single and function cannot realize
so what you do is that you connect up and
gates i hope you are familiar with digital
logic and connect a second level up gate
now you can realize this now you can realize
this all right so the same idea if not a single
perceptron if not a single perceptron you
add up two perceptrons this is perceptron
one and perceptron two and the output of these
two perceptrons you feed as input of a third
perceptron so earlier inputs output of the
single perceptron as the output now i have
two perceptron and their output goes to the
third perceptron as a input
and output of the third perceptron is our
ah is our actual output so this type of networks
are called because there are different layers
of perceptron you can think of one layer feeding
to the next layer feeding to the next layer
these are called multi layered perceptron
or mlp they are called multi layer perceptrons
or mlp so ah and and and this this so the
terminology is like this the first inputs
are called input layers input layer the second
level perceptrons are called hidden layer
and the output is called the output layer
ok so there may be may be multiple hidden
layers as many as you want
in fact the modern trend is something called
a deep neural network 
which is nothing but you have large number
of hidden layers thats why it a large number
of hidden layers ok
so what does this mean geometrically hm so
each layer its inputs or each perceptron would
give you a line like this some w one w two
on another ah perceptron would give and the
output of these perceptrons would be by either
plus one or minus one depending on which side
of the perceptron the point to be classified
first and this value plus one minus one they
are input to a third second level perceptron
which look at if it is in plus side of both
the lines or plus side one line minus side
of other line that the third perceptron looks
at and gives the result
so then you can see many non linear ah non
linearly separable set of classes for example
if the classes are like this 
you can sort of classify them by considering
a number of la pereptrons so these these these
is a boundary piecewise linear boundary you
would call it in fact it can be shown that
ah any any complex surface can be realized
by such things ok
so i have explained you everything with the
help of a with the help of a logic functions
but in general it holds true for any kind
of classification task so this is the xor
ah this is the xor i will i will i will skip
this perceptron learning rules i have already
told you all this gradient descent error minimization
you should go through this slide for the derivation
that i talked about of the update rule of
the update rule ok
so um if you have a complex function for example
this is an problem this is a speech recognition
problem so this ah for you have to recognize
the vowels the five vowels the five vowels
and the if we look at the frequencies of utterances
of the speech signal for these vowels they
would form a complex class boundary like this
oh no sorry not vowel it is some a set of
words head hid and a hod had howed similar
words ok
even actually the vowel also looks like this
so you can draw a multi layer perceptron to
sort of separate them out ok 
so as i have told this is the please note
down the structure this is the multi layer
structure this is also called a feed forward
network actually let me clearly draw the draw
the make the picture clear
suppose i have three inputs each of the input
will be collected connected to all the neurons
this diagram is important each input corrected
to each of the input neurons this is the hidden
layer say one the output of each of the input
neurons let me put arrows here will be connected
to all the nodes in the next layer so this
connects to all the nodes in the next layer
this connects to all the nodes in the next
layer this connects to all the nodes in the
next layer
similarly you can have a hidden layer two
same rule holds it is connected to every node
in the next layer and finally the output ah
you can have less number output for example
you can have only one output which gives plus
one or minus one ok you can have only one
output which would give plus one or minus
one so this is the output neuron
again same rule will be everything will be
connected so two things you observed first
thing it is a fully connected network every
node in a layer connects to all the nodes
in the next layer second thing every node
connects to only nodes up next layer providing
inputs to only node of next layer there is
no backward connection so nodes of this layer
contact as input of the nodes of a previous
layer so this would be called a 
feed forward network this is a ah feed forward
network
because it feeds forwards no back link there
is another class of network which has back
links which are called recurrent networks
which which is also an very important class
of networks but i am not covering that right
now so this is a network and then you have
weights ok each of these links would have
weights so so you have directed links one
output of one goes as input of another you
have directed edges 
and every edge has a weight value every edge
has a weight value
the convention of ah writing the weight is
something like this if you write w ij say
input to h one it means it connects the ith
neuron of the input layer ith perceptron of
the input layer to the jth perceptron of the
hidden layer one that would be this similarly
this would be similarly this would be w h
one h two ij that means i th neuron of the
hidden layer one connects to j th neuron of
hidden layer two ok
so now you can one more thing is missing that
i have talked about weights but each of this
neuron just like a perceptron we will have
activation function so each of this neuron
just like a perceptron will take inputs weighted
inputs 
and add them up so it will compute w one x
one v plus w naught the bias term and add
i apply a activation function of this which
usually takes a form either plus one or minus
one ok
to step like a function step function that
is the activation function that but one problem
with this step back function it is not differentiable
here not differentiable at the corner of the
step but when you are taking gradient descent
tangent means some kind of derivative has
to be taken differentiation has to be done
so people give a smooth version of step function
is the sigmoid function which approximates
this step by a s like function s that is in
between if outside it is same as the step
but here there is a is of a jump there is
a smooth transition and differentiable everywhere
it is now differentiable which will need for
the later stage all right so this is clear
the multi layer perceptron in the next lecture
we will discuss just like the perceptron weight
update rule and weight update rule for the
multi layer perceptron
thank you
