Hello. So, in the last few classes we have
talked about the neural networks and the application
of neural networks for pattern recognition.
Particularly, we have talked about two types
of neural networks; one was the Hopfield neural
network, as we have said that Hopfield neural
networks are basically meant for binary feature
vectors. So, if I have a number of binary
feature vectors the, Hopfield neural network
basically learns or remembers those feature
vectors by updation of weights. So, learning
process is basically updation of weights and
in case of back propagation neural network,
the learning is also by means of updation
of weights.But for back propagation or feed
forward neural network, the neural network
is not confined to the binary feature vectors,
only it considers the feature vectors which
contains real components also. Now, when we
talked about the feed forward neural network
or back propagation neural network, we have
said that it is a multi-layered neural network,
which has got one layer, which is input layer
and it is just a buffer layer, whose responsibility
is to feed, is to just pass the information
to the layers above it. And in the
simplest case of the back propagation neural
network, we had single layer perceptron, which
consists of only one output layer, in addition
to the input layer. So, the neurons in the
output layer so if I just draw it in case
of a single layer perceptron, the network
architecture of single layer perceptron was
something like this. We had an input layer
and we had an output layer so the number of
nodes in the input layer is same as, the dimensionality
of the feature vector or if the feature vector
is of the dimension d then the number of nodes
in the input layer is d plus 1. So, one node
is to take care of the bias term and the number
of nodes in the output layer is same as, the
number of classes. So, the network architecture
was something like this, from each input layer
node we had a connection to every output layer
node so it was something like this and so
on. So, a feature vector is feed to the input
layer nodes and 
one of the node was kept at a constant value,
which is equal to 1. That is meant for feeding
the bias term and outputs of this different
nodes in the output layer, they indicated
the corresponding classes Ci. Every node in
the input layer is connected to every other
node in the output layer, through a connection
weight say wij. So, ith node in the input
layer is connected to the jth node in the
output layer, through a connection with wij.
So, during training what we had done is, we
had fed the feature vectors for which the
classes are known. And if the class of a feature
vectors known, that
means I know that what should be the output
corresponding to that particular feature vector.
So, that output is an output vector say t
and when the network is not properly learnt,
in that case for an input vector, I get an
output which may not be equal to t. So, I
get an output which is o whereas t is the
target output. So, I have an error e which
is equal to square of this, right? So, during
training of this back propagation neural network
what we had done is, we tried to minimize
this error e by means of updating this connection
weights. So, this was expressed as a function
of a set of connection weights wij and then
through gradient descent procedure, we have
reduced the error by updation of the connection
weights wij. In case of multilayer perceptron,
we had the same input layer, input layer is
always possible, and we also had the number
of nodes in the output layer, which is same
as the number of classes. In addition, we
had one or more hidden layers, okay. However,
the connection the pattern remains the same,
and every node in the input layer is connected
to every other node in the layer above it.
So, it continued this way 
and so on and finally, we get the output.
However, the training for this back propagation
neural network was same as, a single layer
perceptron that is we, for unknown feature
vector which is input to this neural network.
We know what the output is or we know what
is the target output. And when the neural
network is not properly learnt we get an output
for the same vector, same input feature vector,
which is not same as the target output. So,
there again we get an error and this error
when I express in the form of a function,
of the weight matrix, or the connection weights
then will try to reduce the error by optimizing
or by adopting the weights. And when we modified
the weights, when we adapt the weights the
error information for reduction of the error,
that propagates from output layer to input
layer so it to propagate in the reverse direction.
So, this is like this, the error information
or the weight updation information that propagates
from the output layer to the input layer.
So, that is why it is called back propagation
learning, but during the classification, the
information moves from the input layer to
the output layer. So, the information is feed
from input to output so that is why it is
feed forward network whereas, the learning
algorithm is called as a back propagation
learning algorithm. Now when we discussed
about this, feed forward network or back propagation
network, we have said that if the patterns
or if the classes are linearly separable then
a single layer perceptron is sufficient. But
if the classes are not linearly separable
then single layer perceptron is not sufficient,
we have to have one or more hidden layers,
okay. And what these hidden layers do is,
it represents a non-linear boundary by a set
of piecewise linear boundaries and to explain
this, we have taken few examples. We have
taken an AND function. So, in case of AND
function if I represent the AND function as,
a two-dimensional binary vector. So, I have
the input components 0 0 0 1 1 0 and 1 1 so
if the function is an AND function, only when
the input is 1 1, the output is 1. In all
other cases 1 0 0 0 or 0 1 output is 0 and
now, if this AND function we consider to be
a classification problem then you find that
it is quite obvious. I can draw a straight
line separating the set of 0’s and the set
of 1’s, so that means the input vector 1
1 that is put in one class. And all other
combinations of the input components, that
is 0 0 0 1 1 0 that is, those are put in another
class. And I can separate these two classes
by a straight line so that means that this
particular function, AND function is a linearly
separable function. Similarly, if I take an
OR function here again I put it as x1 x2 somewhere
here, let us put 0 1 0 1 so only in case of
0 0 both x1 and x2 components, when they are
0 the output is 0. In all other cases the
output is 1 so I have this type of situation
and again here, we find that I can separate
the
1’s and the 0 by a straight line. So, which
clearly indicates that, this is also a linearly
separable problem, but we have faced difficulty
in case of XOR function. So, in case of XOR
function if I put it in the same fashion,
I have x1 x2 0 1 0 1 so when the input is
0 0 the output is 0, when the input is one
1 1, output is also 0, when the input is 0
1 or the input is 1 0 the output is 1. Now
given this type of situation, you find that
I can no longer separate the outputs 0’s
and 1’s by single straight line. Rather,
I need to have two different straight line
over here, to separate the 0’s and 1’s.
So, this clearly says that the XOR function
is not linearly separable. And when I discussed
about this back propagation neural network,
we have said, we have shown the AND function
and OR function they can be represented, they
can be implemented using single layer perceptron’s.
But when you go for XOR function, the XOR
function cannot be implemented by single layer
perceptron rather, we need at least one hidden
layer for implementation of XOR function by
a feed forward neural network because it is
not linearly separable, okay. So, now, today
will discuss about another kind of network
which is called radial basis function network.
So, what radial basis function network does
is, it performs a non-linear transformation
over the input vectors, before the input vectors
are fed for classification. So, by using such
non-linear transformation, it is possible
to convert a linearly non separable problem
to a linearly separable problem. It is also
possible what this, RBF network or radial
basis function network does is, it increases
the dimensionality of the feature vectors,
okay. So, I will come to that later first
let us see that, how this introduction of
non-linear function to the input feature vectors,
before we go for classification, can make
a linearly non separable problem, can convert
a linearly non separable problem to a linearly
separable problem. So, let us just take a
situation something like this.
Say, I have a set of feature vectors, which
are of this form, okay. So, here you will
find that, if I say that all these green crosses
represent feature vectors belonging to one
class and all the red crosses represent feature
vectors belonging to one class, another class.
So, clearly you will find that, these two
feature vectors cannot be separated by straight
line rather, what I need to have is, I have
to have a curve, a quadratic curve at least
to separate these two classes. Now if I put,
if I impose on non-linear transformation something
like this, that a transformation which will
expand the feature vectors in this direction
and contract the feature vectors in this direction.
So, if I do this these feature vectors will
be converted, will be clustered into a cluster
something like this and these feature vectors,
it is possible they will be clustered into
a cluster like this. And because of this non-linear
transformation, now it may be possible that,
I will be able to find out a straight line,
which separates these feature vectors from
these feature vector, okay. So, that is why
I say that, if I use a non-linear transformation,
if I impose a non-linear transformation to
the feature vectors. Before I put them, I
go for classification purposes then it may
be possible to that a non-linearly separable
problem can be converted to a linearly separable
problem. The other one as I said that the
RBF network, the radial basis function network
also increases the dimensionality of the feature
vector. Because it has been found that, if
we increase the dimensionality of the feature
vectors then classification problem which
is
linearly non separable in a lower dimensional
space. It may be possible that, they will
be linearly separable in higher dimensional
space. And more and more you increase the
dimension, the linear separability, the possibility
of linear separability increases. So, it is
something like this, if I put it in this form
where the tips of my fingers, they represent
different feature vectors. And the tips of
the fingers of the left hand, they represent
feature vectors belonging to one class and
the tips of the fingers of the right hand,
they represent feature vectors of another
class. And these feature vectors they are
coplanar so all of them are lying in the same
plane. So, if the situation is like this obviously,
this set of feature vectors are not linearly
separable. However, if I introduce a third
dimension something like this, I push in the
forward and the other one back ward. And in
this third dimension, now we find that they
are linearly separable so this is just a simple
logic that, if I increase the dimensionality.
Then it is possible to increase the possibility
of linear separability, among the classes
which are linearly non separable in lower
dimensional space. So, the RBF network actually
imposes both, it imposes a non-linear function,
a non-linear transformation to the feature
vectors. And it also increases the dimensionality
of the feature vectors. So, if I have a set
of feature vectors whose original dimension
is the dimension P. I increase the dimension
of the feature vectors to dimension M, okay.
So, how it is to be done, a given feature
vector x is subjected to function phi. So,
phi of x, where this phi of x is the form
of phi 1 x, phi 2 x, phi 3 x it is up to phi
M x, okay. If the original feature vector
a x is of dimension P, for each such feature
vector x, I impose this set of phi functions,
n number phi functions. And each of these
phi functions produce a real value so as I
have M number of phi functions so by application
of this phi function onto this feature vector
x. I create a number M of real components
so what I get is an M dimensional feature
vector and if M is greater than P then obviously,
I am increasing the dimensionality of the
feature vector from P to M. And in case of
radial basis function, each of these phi function,
whether it is phi 1, phi 2, phi 3 up to phi
M each of this phi functions are radial basis
functions. Then what is a radial basis function?
So, let us try to see what a radial basis
function. Every radial basis function has
a receptor, the radial basis function has
a receptor say t and as I move radially away
from this receptor, the value of the function
goes on increasing or goes on decreasing.
And the value at a point, depends upon the
radial distance of the point from the receptor
t. Value of the function is either maximum
or minimum at location t so if I put concentric
circles around this receptor t, the 
value of the function phi on these, each of
this concentric circles are constant. So,
there are different choices of radial basis
functions.
I can put those radial basis functions as,
one of them is called multi quadrics, which
is given by phi of r is equal to r square
plus c square to the power half for c greater
than 0. Then I can have inverse multi quadrics,
where the function phi r is given as, just
inverse of quadrics that is r square pulse
c square to the power half for c greater than
0. I can also have use Gaussian functions,
where phi of r is given as exponential minus
r square upon 2 sigma square for sigma greater
than 0. And this r is a measure of the distance
of the point from, what I said as the receptor
of the radial basis function. So, this r if
I want to compute this radial basis function
at a point say x, coming over here. If I want
to compute, the radial basis function at a
point x, where t is the receptor of this radial
basis function. Then r is nothing but t minus
x or x minus t which is nothing but the distance
between x and t. So, you will find that in
all these different cases if I use this multi
quadrics then at r equal to 0 the value of
the radial basis function is equal to c, which
is the minimum. And as far r goes on increasing,
the value of the radial basis function goes
on increasing. So, the nature of the radial
basis function for multi quadrics will be
like this, when I go for the inverse multi
quadrics, at r equal to 0, that means when
the point x and the point t they coincide.
That is at the receptor point, the value of
the inverse multi quadrics is minimum which
is nothing but 1 upon c, this is the maximum
which is nothing but 1 upon c. And as r goes
on increasing, the value of radial basis function
goes on decreasing so the kind of, the nature
of the function with inverse multi quadrics
will be like this. Just inverse of multi quadrics
at, in case of a Gaussian function you all
know that, this has a very, very common form,
the Gaussian function, will be something like
this. So, at the receptor location t, the
value of the function phi will be maximum,
and as you move away from the receptor point,
the value of the function phi r that goes
on reducing following this pattern. And it
is this Gaussian function, which is most commonly
used in case of radial basis function. Now,
let us say that how by using this radial basis
function neural network, we can convert a
linearly non separable problem to a linearly
separable
problem, even keeping the dimensionality of
the feature vector same. And so for doing
that I will use the same XOR problem. So,
I have this x2, x2 here I have 0, here I have
1 again 0, here it is 1, I have 0 over here,
okay. So, these are the outputs of an XOR
function, what I will do is, I will use a
radial basis function to nonlinearly transform
these feature vectors into another space.
And the radial basis function that I use is,
a Gaussian basis function so the form of the
radial basis function will be phi x, which
will be given by, e to the power minus 
x minus t square, okay and this t is the receptor
and as I said that, I will not increase the
dimensionality. So, I will use two of the
receptors, one of the receptor I consider
is this point at location 0 1 and other receptor
I consider this point at location 1 0. So,
by using this if I compute the non-linear
transformation then the values that I will
have will be something like this. I use this
non-linear transformation so I will put this
inputs my inputs are 0 0, 0 1, 1 0 and 1 1
and I will have the output functions. One
is phi 1 and the other one is phi 2 so if
I say that this is my t 1, the receptor t
1 and this is my receptor t 2. So, phi 1 x
will be simply e to the power minus x minus
t 1 square and phi 2 x will be e to the power
minus x minus t 2 square. So, here you will
find that, I assume that 2 sigma square that
is equal to 1, okay. However, that does not
matter for this non-linear transformation.
So, I define that when it is 0 0 then what
will be my phi 1 of 0 0, phi 1 of 0 0 will
be simply equal to, sorry let me use the receptor
like this. Let this be t 1 and let this be
t 2, okay. So phi on1e of 0 0 will be equal
to 1 and if you compute this term phi 2 of
0 0 will be 0.1. Phi 1 of 0 1 will be equal
to 0.4 and phi 2 of 0 1 will also be equal
to 0.4, phi 1 of 1 0 will be 0.4, this will
also be 0.4. And for 11 phi of 1 will be 0.1
and phi 2 will be equal to 1. So, this is
the kind of situation that I have so what
have used is this 0 0 has been used as one
of the receptors of the radial basis function
and 1 1 has been used as another receptor
of the second radial basis function. So, obviously
0 0 as it coincides with my receptor t 1,
so I have the situation that x minus t 1 square,
that will be equal to 0 so this is e to the
power of 0, which is equal to 1.So, phi 1
of this x which is 0 1, 0 0 which is equal
to 1 similarly, if I compute phi 2 of x that
will be equal to 0.1. For 0 1 phi 1 of x will
be 0.4, phi 2 of x will also be 0.4, for 1
0 which is this point, phi 1 of x will be
equal to 0.4, phi 2 of x will also be equal
to 0.4. Because of these exponential functions,
for 1 1 if x is equal to 1 1 this coincides
with t 2 so phi 2 of x this distance be equal
to 0 now, phi 2 of x will be equal to 1 and
phi 1 of x, this is the receptor the phi 1
of x will be equal to 0.1. So, you find that
0 0 has been mapped to 1, 0.1, 0 1 have been
mapped 0.4, 0.4, 1 0 has been mapped to 0.4,
0.4, 1 1 has been mapped to 0.1, 1.
So, if I plot this, the kind of plot that
I have is 0 0 is now mapped to 1, 0.1 which
will be somewhere over here, this is my x1,
this is my x2. Now, this will be phi 1 instead
of x1it will be phi 1 and this will be phi
2. So, this is phi 1 of x this is phi 2 of
x so 0 0 has been
mapped to 1, 0.1, 0 1 has been mapped to 0.4,
0.4 which will be somewhat here. 1 0 that
is also mapped to 0.4, 0.4 it is the same
point so both the points up to be, now has
been mapped to the same point in phi 1 phi
2 domain. And that has happened because of
the non-linear transformation whereas, 1 1
has been mapped to 0.1, 1 somewhere over here
so this is 1. This being the situation now
you find that, I can draw a straight line
separating these points say, problem which
was nonlinearly, which was not linearly separable
in original space, it has now been linearly
separable in phi space, simply by using: non-linear
transformation. And the possibility of the
linearly separability, between the classes
increases as we go on increasing, the number
of, the dimension of the feature vectors,
by using these phi functions. So, by using
this now the architecture of the network will
be of this form. We have the input layer as
before, it always has been present because
this is the layer which accepts the input
vectors and feeds to the next higher layers.
We will have only one hidden layer and the
number of nodes in the hidden layer will be
equal to M, where M is the dimension, to which
we want to increase the dimensionality of
the feature vectors. So, if I have P number,
P is the dimensional of the input feature
vectors, we have P number of nodes at the
input layer and if M is the increased dimensionality
then M is the number of nodes in the hidden
layer. Each node in the hidden layer, actually
performs a
radial basis function and as we said, it is
something like this, each node represents
or computes a radial basis function. And in
the output layer, we will have number of nodes
which is same as the number of classes c,
now from each node in the input layer I have
a connection to each node in the hidden layer.
And from each node in the hidden layer, I
have a connection to each node in the output
layer. Finally, I have outputs from the classifying
neurons and here I feed the input feature
vectors so this is, what is the architecture?
So, the nodes in the hidden layer, they actually
perform the radial basis function and nodes
in the output layer, they perform linear combination
of the outputs of the hidden layer nodes.
So, here I will have a linear classifier so
actually classification is done only at the
output layer. Whereas, at the hidden layer
only we perform the radial basis function
so that is a transformation of the input feature
vector to a hidden space. Because this is
my hidden layer so you can say that it is
a transformation to a hidden space, so when
I have a number of nodes in the hidden layer
which performs radial basis function, okay.
So, I have two levels of training, the first
level of training is, training of the hidden
layer nodes. So, what does this training mean,
as every node in the hidden layer is a radial
basis function so for each of the radial basis
function I have to have a receptor t. So,
one of the purpose of training of the hidden
layer nodes is, to find out that what should
be this receptor, t. And then if I use the
Gaussian function then we have said that,
the radial basis function is of the form e
to the power minus r square upon 2 sigma square
so that is what is my phi of x, where this
r is nothing but the distance between x and
t. So, this sigma actually indicates that
what is the spread of this radial basis function,
so that is also what is to be trained. So,
by saying that I want to train the hidden
layer nodes, for each of the nodes I have
to find out what is the t and for each of
the nodes I have to find out what is the sigma.
So, that is what is meant by training of the
hidden layer nodes and at the second level
I have another training operation, by which
I have to train these weight vectors, which
connects the outputs of the hidden layer nodes
to, the output layer nodes because it is the
linear combination of the outputs layer nodes
of the hidden the nodes, which decides that
to which class a sample will belong. And that
will be outputted, that with the available
at the output of the, output layer nodes.
So, let us see that, how you can train a node
in the
hidden layer, okay. So by saying that the
hidden layer nodes are to be trained, what,
let us see the situation that we have. If
I go for supervised classification or supervised
learning what I have is, I have set of feature
vectors or which are labelled with its class
belongingness, which are to be used for training
the neural network. In case of multi layered
perceptron, in case of single layered perceptron
we did not have any non-linear transformation
so from the label of the input feature vector
and actual label that you got, at the output
of the output layer nodes, we have defined
an error function. And our aim was that by
updation of the connection weights, we have
to reduce that error function or we have to
minimize that error function. So, the algorithm
was that in order to minimize the error function,
how the connection weights are to be modified.
And that modification of the connection weights
was carried out from the output layer to the
input layer, in the backward ways. whereas
in the case of RBF network, the training consists
of two parts, the first one is training of
the hidden layer nodes. And when I say it
is the training of the hidden the nodes that
means, it is for every hidden layer which
represents a radial basis function, I have
to find out that what should be the receptor.
And also I had to find out what is the spread
of the radial function, of the radial basis
function which is sigma. So, if we are given
a number of training vectors. So, we are given
N number of training vectors, vectors 
from say x1, x2 up to say xN, we have N number
of training vectors. Where each of these vectors
is of say dimension P and I want to
transform these vectors to a vector of dimension
capital M, okay. So, naturally at the hidden
layer of the node, at the hidden layer I have
to have capital M number of nodes. And when
I have this M number of nodes so for each
of the nodes I have to have a receptor. That
means for the ith node, I have to have receptor
ti so i varies from 1, 2 up to M and the ith
hidden layer node, I have to have sigma i
again, i varying from 1, 2 up to capital M.
So, I have to have M number of receptors,
I have to have M number of variances or standard
deviations. So, this can be done in various
ways, the simplest way is at a random from
the training samples, you pick M number of
samples, which are to be used as receptors,
which is not very logical rather, what appears
to be more logical is, you go for clustering
of this capital N number of training samples
into M number of clusters. So, when I go for
clustering, I basically get a space where
it is more likely to have a feature vector.
Clustering we have not discussed yet, but
I will just for the purpose of completion
of this lecture, I will simply indicate what
does this clustering mean. So, I have capital
N number of samples, from that I have to form
capital M number of clusters. So, it is understood
over here, that this N has to be greater than
capital M, if N is less than M that is number
of samples training, samples is less than
the number of clusters that are to be formed,
it is not possible. Because, even if in a
single cluster I assume that, there has been
only one feature vector, at the most I can
have N number of feature vectors or at the
most I can have N numbers of clusters. So,
I cannot form N number of clusters which is
more than the number of feature vectors, which
is provided. So, obviously our assumption
is that the number of feature vectors N is
greater than, the number of clusters that
one has to form that is M. Now, there are
various clustering algorithms, one of the
clustering algorithm is say initially, I partitioned
this N number of samples into capital M number
of clusters, okay. So, at every cluster will
have a number of samples, for each of the
created clusters say.
Suppose I have got a number of feature vectors,
which are given like this. So, these are the
set of feature vectors now, suppose I want
to form three clusters so initially what I
do is at random arbitrarily you divide this
set of samples. You partition the set of samples
into three different clusters so this cluster
contains these samples, this cluster contains
these samples, and this cluster contains these
samples and so on. Now, once I have these
initial clusters then what I can do is for
each of the cluster so formed, I compute the
cluster center. So, which will be that representative
of that cluster, of the samples belonging
to that cluster. So, if over here suppose
the center of this cluster comes out to be
somewhere over here, the center of this cluster
comes out to be somewhere over here and the
center of this cluster comes out to be somewhere
over here. So This will be the center cluster,
center mu 1, this will be the cluster center
mu 2, and this will be cluster center mu 3
and so on. Now, once I form this cluster centers
in the next iteration what I do is for each
of the samples, I find out which cluster is
nearest to that particular sample. So, here
you find that it may be possible that this
class, this particular sample say this particular
sample it’s distance from mu 3 is less than
its distance from mu 2. So, initially though
this sample was put in cluster 2, in the second
iteration because its distance from the cluster
center 3 is found to be less than, its distance
from the cluster center 2, this sample will
be more from cluster 2 to cluster 3. And this
operation you
will perform for every sample so for each
such sample you find out that what its nearest
cluster center is and you move this sample
to, the cluster whose center is nearest to
it. And you perform this for each of the samples
so at the end of second iteration you will
find that, initial clusters that we had formed,
the samples in the initial cluster have changed.
I still have three numbers of clusters, but
the sample values have changed. So, in the
second iteration at the end of second iteration,
again I compute that with this new set of
samples what is its cluster center. After
computing, recomputation of this cluster centers,
again I go for the reassignment of the feature
vectors, based on its distance from this new
cluster centers. And this operation you will
perform in a number of iterations. And finally,
when it stabilizes then we say that the clustering
is complete and at that point whatever cluster
centers I have, those cluster centers, have
those cluster centers are my receptors. So,
by this what I have done is, for each of the
radial basis functions, or each of the nodes
in the hidden layer I have computed the receptors,
these receptors are nothing but the cluster
centers. Once I do that, next what I have
to do is... I have to, so for the every ith
node I have to compute a ti, I found out what
is ti. Next I have to find out what is sigma
i so then what I can do is, I can go for a
nearest neighbor rule. From every ti, I can
find out say P number of nearest receptors
and the distance of those P numbers of nearest
receptors. So, I get P
number of distances and root mean square of
all these distances, I can take as, that as
the variance or the standard deviation sigma
i, okay. So, if I am interested to find out
what is sigma j, what I will do is, I will
take tj minus ti, take the square of this,
sum of this for i is equal to 1 to P, as I
have P number of nearest ti’s 1 upon P of
this and square root of this. This is one
of the way in which I can compute sigma so
I have the receptors ti and I have sigma i.
So, once I have this ti and sigma i then you
will find that in the middle layer every hidden
layer node is defined, for every node I have
ti, for every node I also have sigma i. So,
here also have ti, I have say tj, here also
have sigma j so this is defined for every
node. So Now, if I feed any feature vector
x to this, I have the corresponding phi q
of x, I have the corresponding phi 2 of x,
I have the corresponding phi M of x. So, the
original feature vector x is now converted
to M dimensional feature vectors now, using
this feature vectors I have to go for classification,
okay. So, for that I have to train the output
layer of this radial basis function neural
network because as you see, if you remember
the architecture of the radial basis neural
network over here. I had a set of weights
W, so I have to compute this Ws for classification
purpose. Now, here as we said that, as I increase
the dimensionality and I, as I imposed a non-linear
transformation and increase of dimension and
increase of dimensionality taken together.
That increases the possibility, that the classes
will be linearly separable in the higher
dimensional space. And as I increase the dimension,
as I increase the value of M, the possibility
of linear separability goes on increasing.
But I do not know that, how many such radial
basis functions I have to use so that, my
original problem becomes linearly separable.
So, though I have increased the possibility
of linear separability, I have not been able
to guarantee that, they are linearly separability.
So, the best way to find out this connection
weights at the output layer is, by means of
least mean square error technique. So, what
we have is, we have already discussed about
the LNS technique. And then, we have seen
that if Y is a set of feature vectors and
if I have a set of feature N number of feature
vectors Y then using this Y I can form a matrix.
So, let me say this feature vector is X, okay.
So if I have N number of feature vectors,
every feature vector when I put in the form
of a row, I have a matrix of say N by d, if
d is the dimensionality of the feature vectors.
And we call this matrix as Y and the logic
that I put is, when I want to train say, first
output node of my classifier, only the samples
if I feed a sample which belongs to class
one, only that output should be 1, the rest
of output should be 0. So, for every input
feature vector I can define and output vector,
okay. So using each of these outputs, I can
define what is, what in case of this LMS algorithm,
we have said as a bias vector. And then by
using the pseudo-inverse technique, I can
find out what is the output to it. So, this
discussion I will stop here today, I will
continue with this discussion in our next
class. Thank you.
