Hello. So, in the last class we have started
discussion on radial basis function neural
network. We have seen that a radial basis
function neural network consists of three
layers, 1 is of course, the input layer and
1 of the three layers is the output layer
and in between the input layer and output
layer we have a hidden layer. So, unlike in
case of multilayer perception, where we can
have 1 or more hidden layers. In case of the
radial basis function network we have only
1 hidden layer, and every neuron in the hidden
layer computes a radial basis function. So,
when I have the neurons in the hidden layer.
For every neuron, which computes a radial
basis function, radial basis functional value
for an input feature vector. Every radial
basis function has got 2 parameters, 1 is
called the receptor and other one, which defines
the spread of the radial basis function. So,
the architecture that we have is something
like this. We have 1 input layer, so the input
layer contains a number of neurons and the
number of neurons in the input layer is same
as the dimensional, dimensionality of the
feature
vector. So, that if the feature vectors of
are of dimension d, I will have d number of
nodes in the input layer so there will be
d number of nodes. When the dimensionality
of the feature vector is d, in the hidden
layer I will have a number of nodes. And suppose
the number of nodes in the hidden layer is
say capital M. So, as we discussed in our
previous class that the purpose of the hidden
layer nodes is to project the d dimensional
feature vector into a higher dimensional feature
vector. So, as I have n number of nodes in
the middle layer, so obviously this M, the
number of nodes in the hidden layer is greater
than the dimensionality of the feature vector
which is d. And as we say it that every node
in the hidden layer computes a basic radial
basis function. Like this and at the output
layer, which are basically the classifying
neurons, I have the number of neurons of the
number of nodes, which is the same as the
number of classes that we have. So, if I have
c number of classes then at the output layer
I will have c number of neurons. So, there
we have c number of neurons. Where c is the
number of class in which the pattern has to
be classified. Then every node in the input
layer is connected, is feeding input to every
node in the hidden layer and output of the
hidden layer every node output from the hidden
layer is connected to every node in the output
layer. So, I have the connections, which is
something like this. So, these are the connections
from the input layer nodes to the hidden layer
nodes. And because the purpose of this connection
is simply to forward the input feature vector
to the nodes in the hidden layer, we can assume
that weight of each of this connection is
equal to 1 and that is a difference with the
connections from the hidden layer to the output
layer nodes. Because, in every output layer
node, computes a linear combination of the
outputs of the hidden layer node. So, the
connection from the output layer node, to
the connection from the hidden layer nodes
to the output layer nodes is something like
this. Sorry. Where we can say that every field
i th node in the hidden layer is connected
to the j th node in the output layer to a
connection weight, which is equal to W i j.
So, because of this, every node in the output
layer computes a linear combination of the
outputs of the hidden layer. Based on this,
the value of this linear combination the output
layer nodes decide to which class the input
vector should be
classified. Now, what can be done is, these
output and nodes can also impose a non-linear
function to ensure that if a particular input
feature vector belongs to class omega j. In
that case only the output of the j th node
will be equal to 1 and output of all other
output layer node will be equal to 0. Similarly,
if a feature vector, input feature vector
belongs to say class 1 then only the output
of the first node in the output layer will
equal to 1 and outputs of all other nodes
will be equal to 0. So, as we discussed in
the previous class that such a radial basis
function network and RBF network incorporates
2 types of learning. 1 is, we have to learn
that for every node in the hidden layer, because
every node in the hidden layer represents
a radial basis function, what should be the
receptor of that radial basis function and
what should be the spread of that radial basis
function. So, if the radial basis function
is a Gaussian function, that is if it is something
like this, say phi of x is equal to say e
to the power minus x minus t square of this
upon 2 sigma square, where t is the receptor
and sigma which is the variance. It decides
that what is the spread of radial basis function.
So, every, for every i th radial basis function
phi i x t i is the receptor and sigma i is
the spread, so I have to know that what is
the receptor for every radial basis function
and what is the spread of every radial basis
function. So, this is 1 level of learning
and the second level of learning is, once
through these radial basis functions a d dimensional
feature vector is projected onto a M dimensional
feature vector.
So basically, what we are doing is, we are
increasing the dimensionality of the feature
vector. As we have indicated in our last class
that the purpose of increasing dimensionality
is that if the feature vectors are linearly
non separable in the d dimensional space then
we cast them into a higher dimensional space.
Then the possibility that they will be linearly
separable in a higher dimensional space increases
and this possibility increases with the value
of M. So, as we increase the dimensionality
more and more, the possibility of linear separability
of the feature vectors also increases. So,
the feature vectors in the d dimensional space,
which are not linearly separable, when I asked
them into an M dimensional space at M is greater
than d, it is more likely that those feature
vectors, will be linearly separable in an
M dimensional space. And once the feature
vectors are linearly separable in the M dimensional
space, then the linear combination of the
outputs of this hidden layers is likely to
give me a class belongingness. And that linear
combination is decided by the weight vectors
by the connection weights from the hidden
layer nodes to the output layer nodes. So,
we also have to learn that what should be
the connection weight W i j from say i th
node in the hidden layer to the j th node
in the output layer. So, this is the second
level of learning, so in the first level of
learning for every radial basis function,
we try to learn what is the receptor and what
is the spread of radial basis function. And
for, in the second level, we try to learn
what is the connection weight from the input
layer from the hidden layer nodes to the output
nodes. As we have discussed in the previous
class that the usual way, a common method
of learning the radial basis function is if
you are given a set of feature vectors of
the training purpose. Suppose, value of M
is equal to 3, so what we do is, we partition
weight of cluster the set of feature vectors
into the number of clusters.
So, if we have M number of nodes in the hidden
layer and I have say N number of feature vectors,
N number of feature vectors, which are given
for training purpose and I have M number of
nodes in the hidden layer, obviously in this
case N has to be greater than M. Otherwise,
clustering N number of feature vectors into
N number of clusters does not make any sense.
So, I have to have more number of feature
vectors than the number of clusters that I
have to form. So, I cluster this N number
of feature vectors into M number of clusters
and I can assume that centroid or mean of
every cluster represent the corresponding
receptor okay. So, if I take i th cluster,
i th cluster represents the receptor or the
i th radial basis function, so the situation
is something like this. If I have a set of
feature vectors, say these are the feature
vectors belonging to different classes. Typically,
what I do is I cluster this feature vectors
into three different clusters. Every cluster
center now represents a receptor, so this
is 1 receptor, this is 1 receptor, this is
1 receptor. So, this is the receptor t 1,
this is the receptor t 2 and this is receptor
t 3. So, the first operation we have to perform
is the clustering of the feature vectors and
these clustering operations we will discuss
in details in future lectures. Now, once I
have these different receptors, to find out
what should be the spread of a particular
radial basis function, what you do is for
say i th receptor, I find out P number of
nearest
neighbors or P number of nearest receptors
and for this P number of nearest receptors,
I compute what is the mean distance or root
mean square distance. So, there are different
possibilities I can take any value I can choose
any value out of these P number of receptor.
So, what I do is the way I compute sigma i
for the i th cluster, for the i th radial
basis function is I have t i, which is the
receptor for the i th radial basis function
and then I take P number of nearest receptors,
which are nearest to t i. So, suppose 1 such
receptor is t k, so what I do is I compute
t i minus t k square, take summation of this,
for k is equal to 1 2 P as I have P number
of receptors, 1 upon P of this and square
root of this. So, this defines the spread
of the i th radial basis function. So, for
every i th radial basis function, I have t
i and have sigma i and once these 2 are known
then my radial basis function phi i of x is
simply e to the power minus x minus t i square
upon 2 sigma i square. Now, let us see that
by using this concept whether I can make a
linear classifier using the radial basis function
concept for the XOR problem and XOR is very,
very common problem, which is used for illustrating
such
operations. So, as we have said earlier, if
I take an XOR function I have a 2-dimensional
feature vector binary feature vector having
components x 1 and x 2. Suppose, this represents
0, this is x 1 equal to 1 here. I have x 2
is equal to 0 and here I have x 2 equals to
1. The value of the XOR function when it is
0, 0 is equal to 0 0 1 the value is 1. 1 0
the value is 1 and 1 1 again the value is
equal to 0. So, find that here I have 2 dimensional
binary feature
vectors and. So what I do is this 2 dimensional
feature vector, I want to cast into a four
dimensional space, by using four radial basis
functions. So, I have the radial function,
radial basis functions phi 1, phi 2, phi 3
and phi 4. For phi 1, I choose t 1, is equal
to 0, 0 that is the receptor of the radial
basis function phi 1. Similarly, for phi 2,
I can choose t 2 which is a 0 1, that is the
receptor of the radial basis function phi
2. Similarly, the receptors of other radial
basis functions I can choose as t 3 is equal
to 1 0 and for this I choose t 4 is equal
to 1 1. So, these are the four receptors for
the four radial basis functions. Next, I had
to choose the spread sigma 1. For the first
radial basis function I have to choose sigma
2 for the second radial basis function, sigma
3 for the third radial basis function and
sigma 4 for the fourth radial basis function.
Now, for this for every receptor I have to
find out P number of nearest receptors and
suppose I choose that the value of P is equal
to 2. Now, here you find that for every receptor
there are three neighbors, 2 of the neighbors
are at a distance, are at distances of 1 and
1 of the neighbors is at a distance of 1 point
4, that is square root of 2. So, that is easily
verifiable from here I have receptor over
here, which is t 1, t 2 is at a distance 1,
t 3 is at a distance 1, but t 4 is at a distance
of square root of 2, which is 1 point 4 or
1 point 414. So, when I take P is equal to
2, I have to take 2 nearest neighbors both
of them are at distance 1 and root mean square
distance of these 2 distances will also be
equal to 1 so I have spread sigma 1 is equal
to 1, I have spread sigma 2 also equal to
1, have spread sigma 3 also equal to 1 and
have spread sigma 4 that also equal to 1.
So, I get phi 1 x, which is of the form e
to the power minus x minus t 1 square of this,
upon 2, and sigma 1 square and sigma 1 being
equal to 1. This will be equal to 2. Similarly,
for phi 2 x, I will have e to the power minus
x minus t 2 square of this upon 2 phi 3 x
will be e to the power minus x minus t 3 upon
2 and phi 4 x. So, if I compute to these values
for each of the feature vectors taking 0 0
is 1 of the feature vector 0 1 as another
feature vector 1 0 as another feature vector
and 1 1 as another feature vector the functional
values will be something like this.
So, I put that in the form of a table here
I have input feature vectors, inputs are 0
0, 0 1, 1 0 and 1 1 and I have the RBF functions
phi 1 phi 2 phi 3 and phi 4. So, when you
input the feature vector 0 0 to phi 1 you
find that your x is equal to t 1. So, this
exponent is equal to 0, which means that phi
1 x will be equal to 1. So, this phi 1 x over
here this will be 1 point 0. Similarly, for
phi 2 my x is 0 0 t 2 is 0 1, so if I compute
this phi 2 x, you will find that this phi
2 x with be equal to 0 point 6. Similarly,
I just put the values over here. Phi 3 x will
also be 0 point 6 and phi 4 x will be 0 point
4. When the input vector is 0 1, phi 1 x will
be 0 point 6, phi 2 x will be 1 point 0, phi
3 x will be 0 point 4, phi 4 x will be 0 point
6. For 1 0 this is 0 point 6, this is 0 point
4, this is 1 point 0, this is 0 point 6 again
and for the input feature vector 1 1, I have
phi 1 is equal to 0 point 4, phi 2 x will
be 0 point 6, phi 3 x will be 0 point 6 and
phi 4 x that will be 1 point 0. So, you find
that given a 2 dimensional feature vector
0 0 this has been cost into a 4 dimensional
feature vector where the components of this
4 dimensional feature vector are 1 point 0,
0 point 6, 0 point 6 and 0 point 4. Similarly,
0 1 is a 2-dimensional input feature vector,
which has been cast into a 4 dimensional feature
vector the components being 0 point 6, 1 point
0, 0 point 4 and 0 point 6. So, every input
feature vector where the input feature vector
is a 2-dimensional feature vector. Every 2-dimensional
input feature vector is converted to a four
dimensional feature vector by using four radial
basis functions. Now, if a take the linear
combination of this and for linear combination
for phi 1, if I give an weight of, say I give
the
weight for phi 1, I give an weight of minus
1, for phi 2, I give an weight of plus 1 for
phi 3, I give an weight of plus 1, for phi
4, I give an weight of minus 1 okay. So, function
that I will finally, compute at the output
at a node in the output layer will be phi
2 plus phi 3 minus phi 1 minus phi 4 and if
I compute this, let us see what are the values
that I get. So, here I will write, sum of
W i times phi i where i varies from 1 to 4.
So, here it will be 0 point 6 plus 0 point
6 is 1.2 minus 1 point 4, this will be minus
0.2. Similarly, here it will be 1 point 4
minus 1.2, so it will be plus 0 point 2. Here,
again it will be 1 point 4 minus 1 point 2,
so this is plus 0 point 2 and here it will
be 1 point 2 minus 1 point 4, so this is minus
0 point 2. And if I take a decision that if
the value is more than 0 the output will be
1, if it is less than 0 the output will be
0, then the final output that we have is here
I write output. This will be 0, this will
be 1, this will be 1 and this will be 0. So,
which is nothing, but the XOR function output.
So, over here the architecture of the radial
basis function that we have used is we had
two input layer nodes. Where x 1 is fed to
1 node and x 2 is paid to another node, I
had 4 nodes in the hidden layer, which computes
the radial basis function. And I had 1 node
in the output layer which I can say that it
is finding out, it is a non-linear operator
or a threshold operator. The connections are
like this, where each of this connection has
a connection weight is equal
to 1. And over here these connections are
as you can see over here phi 1 2 output layer
node has a connection weight of minus 1, phi
2 2 output layer node has a connection weight
of plus 1 phi 3 2 output layer node again
has a connection weight of plus 1, phi 4 2
output layer node has a connection weight
of minus 1. So, here the connections are minus
1, plus 1, plus 1, minus 1 and this output
actually gives me the XOR function, okay?
So, this example clearly shows that by casting
the two-dimensional feature vectors into a
four-dimensional feature vector. I can implement
the XOR function using a linear network or
a single layer perceptron because this part
is nothing, but a single layer perceptron
okay. Now, let us theoretically try to find
out or try to find out an expression for the
training of the output layer or how do I find
out this connection weights. So, in general
I have a network something like this. I have
a set of input layers, set of input layer
nodes, I have a set of hidden layer nodes
and I have a set of output layer nodes. The
feature vector is fed to the input layer nodes,
so here I feed x, which are fed to the hidden
layer nodes through connection weights, which
are one and outputs of this hidden layer nodes.
These are my radial basis functions. Outputs
of the hidden layer nodes are connected to
the output layer nodes. So, like this and
I take the output from every output layer
node. So, if my input feature vector x belongs
to say i th class then output of the i th
output layer node will have a high value likely
to be 1 and outputs of all other output layer
nodes will have a low value
likely to be 0. I assume that the i th node
in the input layer is connected to the j th
node of the output layer, through a connection
weight say W i j. So, given this, if I say
the output of the i th j th layer node is
o j, I will have o j is equal to sum of W
i j times phi i X for an input vector X and
this summation I have to compute over all
nodes in the hidden layer. So, here I will
have this summation has to be computed over
i is equal to 1 to M, as I have M number of
nodes in the hidden layer. And naturally over
here, if this feature vector, so I will write,
if X belongs to class omega j then I had to
have sum of W i j times phi i X, i is equal
to 1 to M. This has to be equal, this must
be greater than 0. I will put this as plus
1 and if X does not belong to omega j that
indicates that sum of 
W i j into phi i X, i varying from 1 to capital
M that must be equal to 0 or I can also put
it as minus 1. So, let us assume that if x
belongs plus omega j W i j times phi I x that
has equal to plus 1 and if x does not belong
to omega j this has to be 0 and that is what
have to be the output from the j th node in
the output layer. Now, taking this, now I
can go for training of the output layer that
means I have to find out what should be the
values of this W i j. Now, if I compute only
the connection weights, if I right now consider
only the connection weights, which are connected
to the j th node in the output layer then
for every vector X k, suppose I have capital
N number of vectors.
So, I have vectors X k for k varying from
1 to capital N. I have capital N number of
input vectors, which are given for training
purpose or for learning as we are using supervised
learning, then phi i of X k. For simplicity
I will write this as phi i k. now, by using
this I can, as I said that sum of phi I x
k into W i j k. So, my condition is, if you
remember this one, if you remember this one.
So, sum of W i j into phi i k for i varying
from 1 to M. This has to be equal to plus
1 if X k belongs to class omega j, if X k
belongs to omega j and this has to be 0 if
X k does not belong to omega j. So, this is
the output that I expect, so for X k I have
for every X k, I have such a kind of linear
equation that this summation will be either
plus 1 or 0 and all those capital N number
of equations now, I can write in the form
of a matrix.
So, in the matrix form this can be written
as, let me write the matrix equation that
phi 1 1, which means phi 1 X 1 okay, phi 2
1 that is phi 2 of X 1, phi 3 1, phi M 1,
this means phi M of X 1. Similarly, phi 1
2, which means phi 1 of X 2, phi 2 2, phi
3 2 upto phi M 2 and as I have capital M number
of samples for training so I will have phi
1 N, phi 2 N, phi 3 N upto phi M N, which
indicates phi M of X N into 
W 1 j, W 2 j upto W M j. So, you find that
what it computes? W 1 j times phi 1 1 plus
W 2 j times phi 2 1 continue like this, W
n j times phi M 1 that is for the first input
vector x 1. Whatever is the output of individual
middle layer nodes or hidden layer nodes.
This equation simply makes a linear combination
of outputs of the hidden layer nodes for the
input feature 1 or X 1 okay. So, this has
to be equal to, again I put the output in
the form vector b 1 j, b 2 j up b N j. Where
every b i j will be equal to 1, if the corresponding
X i belongs to class omega j and that will
be equal to 0, if the corresponding X i does
not belong to class omega j. So, every b i
j will assume a binary value either 0 or 1.
So, this b i j will be equal to 1. If X i
the corresponding input vector X i on belongs
to class omega j, the j th class or it will
be equal to 0, if X i does not belong to omega
j. So, this is the kind of situation that
I have.
This whole expression, this matrix equation
I can write in a short form, that is phi W
j is equal to b j where this phi is this matrix,
W j is weight vector which are connected to
the output layer node j and b j is the output
of the j th node in the output layer, which
is represented in the form of a vector like
this for different input vectors okay. So,
if the network is properly trained, that is
of all the W i j has got the trained value
then this equation should be satisfied. But,
what we are trying to do is, we are trying
to train the network that means we are trying
to set the weights W j, so you cannot expect
that this equation will be satisfied initially.
So, if this equation is not if this equality,
this equality is not satisfied then what I
can do is I can define an error e which is
nothing but, phi W j minus b j. And now, training
involves adaptation of this weight W j, so
that this error can be minimized. So, in order
to do that as we have done one earlier for
mean square error optimization for mean square
error technique for classifier learning or
classifier training. I can also define here
a function, criteria function j of W j, which
is given by phi W j minus b j norm of this
and then I take the gradient with respect
to W j, so grad of J W j which will be simply
2 phi transpose into phi W j minus b j. And
by equating this to 0 what we get is w j is
equal to phi transpose phi inverse of this
into phi transpose b j. And as you have seen
earlier this phi transpose phi inverse phi
transpose phi inverse into phi transpose,
this is what is called pseudo inverse and
that is represented as phi plus. So, we
have W j by this pseudo inverse technique,
so we have this W j is equal to phi pseudo
inverse into b j okay, where this b j is defined
beforehand, every component of b j will be
either 1 or 0. It will be equal to 1 in the
corresponding feature vector, input feature
vector belongs to plus omega j and the component
will be equal to 0 if the corresponding input
feature vector does not belong to plus omega
j. So, I have this vector b j, phi actually
indicates that what should be the output of
the hidden layer nodes for every feature vector,
so from that I compute what is my matrix phi.
So, once I have this matrix phi and I have
this b j I can compute what will be the connection
weights for different nodes in the hidden
layer to the j th output layer node. This
if I do for every output layer node I can
compute what is the connection weight for,
from different outputs of the hidden layer
nodes to different output layer nodes and
that is what completes my training of the
RBF neutral network and the RBF neural network
will be ready for classification. Now, if
you compare this RBF neural network with your,
against say multilayer perceptron you find
that the training of the RBF neural network
is faster than the training in multilayer
perceptron, because in case of multilayer
perceptron the training is done by back propagation
algorithm which takes large number of iterations
okay. So, the training of the RBF neural network
will be faster than training of the multilayer
perceptron. The second advantage is that I
can easily interpret, what is the meaning
or what is the function of every node in the
hidden layer, okay, which is difficult in
case of multilayer perceptron. I cannot easily
interpret the role of different nodes in the
hidden layer in case of multilayer perceptron.
And not only that I also cannot easily decide
that what should be the number of hidden layers
and what should be the number of nodes in
every hidden layer. So, those are the difficulties
in case of multilayer perceptron, which is
not there in case of RBF network. However,
RBF network has a disadvantage that though
the training is faster, but you find that
the classification takes more time. In case
of RBF network than in case of MLP, because
in case of RBF network every node in the hidden
layer has to compute the radial basis functional
value for the input feature vector, which
is time consuming. So, the classification
in case, the classification in case of RBF
network takes more time than the classification
time in case of multilayer preceptor, okay?
So, with this so we come to
a conclusion on the new radial basis function
neural network. Now, over here I will just
briefly discuss about another kind of classifier
which is called a support vector machine.
So, I briefly discuss support vector machine.
So, support vector machine is another type
of linear classifier. So, if you remember
what we discussed in case of a linear classifier,
that given a two class problem, we have said
that I can define a discriminating function
say g of X, which is of the form say W transpose
X plus b. and we have said in case of linear
discriminator that if this g of X and W transpose
X plus b this is greater than 0 that indicates
that feature vector X belongs to class omega
1. If this is less than 0 then feature vector
X belongs to class omega 2. So, here we find
that for classification purposes the actual
value of g X is not really very important,
but what is important is what is the sign
of g X. If the sign is positive I infer that
x belongs to class omega 1 if the sign is
negative I infer that x belongs to class omega
2, right? So, over here with every X, if I
or if with every x i, I indicate a number
say y I that y I can be either plus 1 or minus
1. In that case this Y i times W transpose
X i plus b it will always be greater than
0. If the sample X i is properly classified,
which is quite obvious, because if I say that
Y i equal to plus 1 for a sample X i which
belongs to class omega 1 and for a sample
which belongs to class omega 1 this W transpose
X i is greater than 0 Y i is also positive.
So, Y i times this will obviously be greater
than 0, if X i belongs to class omega 2 then
W
transpose X i will be less than 0. For that
I have set Y i equal to minus 1 times W transpose
X i plus b will obviously be greater than
0. And this is a concept that actually we
have used when we have discussed about the
perceptron criteria or designing the linear
classifier, that is for every feature vector
belonging to class omega 2, we have negated
the feature vector before we try to design
the classifier, so that for every feature
vector irrespective of whether the feature
vector belongs to class omega 1 or the feature
vector belongs to class omega 2 my discriminant
function value will always be positive. If
the feature vector is correctly classified,
so that is true if the feature vector belongs
to class omega 1 or even if the feature vector
belongs to class omega 2, because of the feature
vectors belonging to class omega 2 before
trying to design the classifier we have negated
this feature vector. So, we will discuss about
the support vector machine in more details
in our next class. Thank you.
