What I would like to do know is step
through a statement of a theorem.
We are not going to prove the theorem.
That's really up to people like.
But we are going to state the theorem
and understand the theorem well enough
so we can apply is to showing
that Q learning converges.
>> Okay.
>> So, the first thing we are going to
do is just to make things
simpler to write down,
we're going to say that Q learning and
update algorithms like that
are going to update all state
action values on all time steps.
However, if it's a state action value
that doesn't actually correspond
to the current state action
pair that we just experienced
then we just set the learning rate to 0.
Right so that just means leave the Q
values alone except for in the state
action pair where you actually just
experience and got a transition.
>> Okay.
>> This is the beginning of the state
into the theorem so we're going to say B
is going to be some contraction mapping.
So this is going to be
the Bellman Operator ultimately.
Q star equals = BQ star.
That's the fixed point.
That's the solution to
the Bellman Equation.
So Q star is like Q star.
>> Okay.
>> And
let's imagine that we've got
some sequence of Q functions.
That starts off with Q0 and the way
we're going to generate the next step
from the previous step is we're going to
have a new kind of operator, B sub T.
B sub t is going to be applied to Q sub
t, producing an operator that we then
apply to Q sub t and that's what
we assign Q t plus one to be.
So in the context of Q-learning,
this is essentially the Q-learning
update, but we're going to separate out
the two different Q functions that
are used in the Q-learning update.
One is the past Q function that
we're using to average together,
to take care of the fact that
there's noise in the transitions.
And then the other Q function is the one
that we're using in one-step look ahead
as part of the Bellman equation.
But we'll get to that in a moment.
But here's the cool thing.
That this sequence of Q functions,
starting from any Q0 that we want,
as long as keep applying this,
is going to go to Q star, as long
as we have certain properties holding
on how we defined these B sub t's.
