In this module we will talk about the Proof
of Convergence for the Perceptron or the Learning
Algorithm that we saw in the previous module.
So, we have some faith and intuition that
it actually works, we just need to formally
prove it that it actually converges right.
So, that is what we are going to do in this
module.
So, before that a very few very simple definitions.
So, if you have two sets of points P and N
in an n dimensional space and we call say
that these points are absolutely linearly
separable, if there exists some n plus 1 real
numbers which has w0 to wn; such that every
point which belongs to P right, P is the case
where the output is 1.
Then these set of weights satisfy this condition
right and every point which lies in the negative
set the set of weights satisfy this condition
right. So, nothing very different from what
has we have been saying so far, it is just
formally defining it right.
Now, our proposition is that, if the set P
and N are finite and there is a fixed number
of points in that which was the case in the
toy example that we were doing and which will
be the case in most examples that, we do and
linearly separable right. The perceptron learning
algorithm updates the weight vector ok. Before
I go there right ok, let me not give you the
definition and let me ask you the definition
right.
So, now, I have given this definition, the
first definition and given this part of the
proposition. Can you tell me what do I need
to prove if I need to prove that the algorithm
converges, ok that is one way of looking at
it, but what was happening in that wrong argument
which was I was making that it continuously
kept toggling right. That means, I am not
making a finite number of updates right I
have to keep changing again and again and
this process continues in a loop right.
So, that is how I am going to define convergence
that the perceptron learning algorithm updates
a weight vector of finite number of times
right, it only needs to update it finite number
of times and it will reach a configuration
such that now, it is able to separate the
P from the N ok; that is what the proof of
convergence means right.
So, in other words if you are going to pick
up these vectors randomly from the set P and
N cyclically, as we were doing in the toy
example, then a weight vector wt is found
after a finite number of steps which will
separate these two steps, these two sets right.
So, that is what we are trying to prove. So,
that is the definition of converge, does it
make sense? Right? Ok.
So, proof is on the next slide and it is going
to take me around 5 to 10 minutes to prove
it. So, just stay focused all right. So, here
is a few set up right. So, I am going to,
before I go to the actual proof I am going
to make a set up so that it becomes easier
for us to prove it right. So, the first thing
that I am going to say is that, if there is
a point which belongs in negative set then
the negative of that point belongs in the
positive set and that is very clear, because
if the point belongs in the negative set then
w transpose x is less than 0.
But then w transpose minus x would be greater
than equal to 0 right? So, I take the negative
of the point, I can just put it in the positive
set. So, instead of considering these two
different things P and N I am just going to
consider one P prime, which is an union of
P and all the N points negative ok, will the
set up clear. If this is a setup then what
is the condition that I need to ensure for
every point in P dash.
.
W transpose p should be greater than equal
to 0 right? So, I do not care about the negative
case, I have just made everything positive
now, and it is, I am not done anything wrong
here, it is just a simple trick. ok And now,
this is how the algorithm will look in this
setup, these are the inputs with label one
inputs with label 0, N minus contains a negation
of all the points in N and P prime is a union
of these. now, again I start by initializing
w randomly, while convergence I will do something,
I will pick a random P from P prime. now,
what is the, if condition.
Less than 0.
.
Do I need the other if condition.
No.
No right, because everything is now, positive
ok and the other small thing that I am going
to do is, I am going to normalize P ok? So,
that again does not mean, because we are talking
in terms of angles and I am not changing the
direction of the vector, I am just shrinking
it right. So, I am just or maybe scaling it,
also I am just making it unit norm. So, that
does not change anything right. So, it is
still everything still holds, ok.
And in particular you can see here right.
So, if this condition was true, this condition
will also be true, ok. So, so far just I am
done some simple tricks to make things easier
for me later on, so now, P has been normalized.
Now, remember that this data is linearly separable;
that is what we started the proposition. If
P and N are linearly separable then the perceptron
learning algorithm will converge right. So,
now, if P and N are linearly separable, irrespective
of whether we have the perceptron learning
algorithm or not what do we know?
.
That there exists.
Line .
There exists a w star which is the solution
vector right, there exists at least one w
star which is the solution vector right; such
that it will separate the P points from the
N points. So, this vector which we do not
know, but we just know, that it exists, so
you can refer to it. So, we will call this
w star fine? Now we start the proof.
So, w star is some optimal solution which
we know, exists.
But we do not know, what it is right. now,
suppose you had a time step t. So, remember
that this algorithm is going on while convergence.
So, you have time step 1 2 3 you are picking
up points. So, we are at a time step t, at
which you pick up a random point pi and you
find that the condition is actually violated.
So, this should actually be less than 0, if
I know, the condition is violated. So, now,
what will you have to do?
.
w is equal to.
.
w1. So, I will just call it the new w wt plus
1 is equal to the old w plus pi, ok. now,
what I am going to do is I am going to consider
the angle beta between w star and wt plus
1. I do not know, what w star is, but we can
still assume it exists and make some calculations
based on that right. So, what is the angle
between w star and wt plus 1, its beta and
what is the cost of that angle this
.
And remember that we do not have w star here,
because we had assumed that it is the normalized
vector right. So, we do not need that but,
this is actually equal to 1 ok So, now, if
I just take the numerator, w star in dot product
wt plus 1 now, I am going to expand wt as
wt plus pi fair; that is exactly what I did
on the previous step, is it ok? fine.
Now, now, what is pi actually, it is. So,
what you had is you had these p1, p2, p3.
My hand writing is really horrible and up
to pn right. So, I have just picked one of
these pi's. ok? Now, what I am going to define
is, now, suppose this is my, these are my
pi's right. So, these are all the vectors
that I have. Now, suppose I have this w star,
suppose this was the w star that I am interested.
Now, for each of these I could compute w star
p1, w star p2 and so on up to w star pn right
and I could sort them. Now, what I am doing
is that for whichever of these points w star
pi is the minimum ok, I am going to call that
value as delta. Suppose w star P 1 is the
smallest quantity out of w star p1 w star,
p2 w star, pn right, and I am calling that
quantity delta.
So, I have this quantity here and my delta
is the minimum of all the possible values
that it can take. It can make w star p1 p2
up to pn, so delta is the minimum quantity.
So, here I have an equality. ok
Now, are you ok with this? This is the minimum
quantity right. So, any pi that I put in here
it is always going to be greater than or worst
case equal to delta fair ok, fine.
Now, again this w2 itself I could write it
as wt minus 1 plus pj, because that also would
have come up from some update in the previous
step. ok Again this is there which I could
call it as delta and still retain the greater
than equal to here, ok fine. So, let us see
where are we heading with this right.
Now, notice that we do not make a correction
at every time step, when I was running that
toy algorithm I was not making a correction
at every time step. We were only making a
correction at those time steps for which the
condition was violated. So, now, if I am at
t'th time step, maybe I have made only k which
is less than or equal to t corrections. At
max I would have made t corrections, but it
could have been less than that also ok?
So, now, every time we make a correction we
are adding a value delta to this right. So,
at the time step t what would happen, I had
started off from w naught I have reached time
safety and I have made a case that, I have
not made t updates I have made k less than
equal to t updates right. So, how many deltas
would get added?
K delta.
K delta. So, I could say that with respect
to w naught where I had started from, this
is what this quantity is, ok is that fine,
anyone has a problem with this ok.
So So, far what are we shown right, we started
with this, this condition was true again not
less than equal to and hence we made the correction
and this was the point that we picked up at
the t th step and thence we made that correction.
And we also showed that the numerator is actually
greater than equal to this quantity right,
we showed it by induction fine. Now, let us
look at the denominator and particularly let
us look at the denominator squared, ok is
a step right.
This is actually wt plus 1 dot product wt
plus 1, but wt plus 1 can be written as wt
plus p i, is this ok. This bracket needs to
disappear right, is that ok fine. Now, what
is, what is this quantity?
.
No.
.
That is less than equal to 0. So, now, can
you guess what is the next thing that I am
going to write? Right.
.
That is correct, yeah it is a negative quantity.
So, that is going to be less than equal to
this. So, that is fine and what about pi square
or this term.
.
Because this is less than right that is why.
yeah
Correct is this fine? ok. Now, what is pi
square?
1 .
One ok. Now, can you guess what I am going
to do? by induction?
k.
Ok ok. So, what is wt square again? right.
Just this wt plus 1 square was wt square plus
1, wt square is going to be wt minus 1 square
plus 1 right and how many such ones will get
added? k of those right, starting from w naught
ok?
So, what have we shown, the numerator is greater
than equal to this, the denominator is less
than this. ok now, if I put them together
I actually get that cos beta is going to be
greater than equal to the numerator over the
denominator. Ok now, what is this quantity
proportional to k, k square, k cube square
root of k, k by 2?
Square root of k.
Square root of k right, you have, I mean roughly
speaking you have a k here, you have a square
root of k here. So, I could roughly speaking
say that it is proportional to square root
of k. So, as k grows what will happen to cos
beta, it will grow and that is fine right,
it can keep growing.
.
Only until one right. So, cos beta is going
to be proportional to k what is k? The number
of updates that you make. Now, if I were to
take that degenerate case which you guys were
hinting at, where that it will keep changing
again and again, what will happen to k? It
will keep going to infinity can that happen.
No
No, because cos beta will blow up right and
that is not allowed. So, k has to be finite,
so that cos beta stays within its limits right.
Hence are we done, how many if you think we
are done, how many if you are satisfy that
we are done, it is not a trick question that
we are done. Please we are done.
So, yeah. So, this says that we can only have
a finite number of such k updates that we
make and after that the algorithm will converge,
is that ok. So, we have a proof of convergence.
Ok now, coming back to our questions this
is where we had started at one point, what
about non boolean inputs, so perceptron allows
that right. We took IMDB rating and critics
rating as an input. Do we always need to hand
code the threshold?
No.
No, in our perceptron learning algorithm are
all inputs equal? No we now, assign weights
to input. What about functions which are not
linearly separable? We still do not know,
right. So, that is where we are headed now,
not possible with a single perceptron, but
we will see how to handle this ok. So, far
the story is clear to everyone ok. So, we
will end this module here.
