All right. Let's start guys.
Okay. So a few announcements before we start.
So, um, if, you have- if you need OAE accommodations,
please let us know if you haven't done that already.
So you need to let us know by
October 31st because we need to figure out the alternate exam date.
So, uh, we'll get back to you about the exact like details around the alternate exam date,
but let us know by October 31st.
Um, project proposals are also due this Thursday.
So do talk to the TAs.
Do talk to us, come to office hours, all that.
Okay. All right.
So today, I wanna talk about games.
So, um, so we've started talking about this idea of state-based models, like,
the fact that if you wanna have state as a way of representing,
uh, everything about- everything that we need to plan for the future.
We talked about search problems already.
We have talked about MDPs where we have
a setting where we are playing against the nature and,
and the nature can play, uh, like probabilistically.
And then based on that,
we need to respond.
Uh, and today, we wanna talk about games.
So, so at the setup is,
is we have two players playing against each other.
So we're not necessarily playing against nature which can act probabilistically.
We're actually playing against another intelligent agent that- that's deciding for,
for his own or her own good.
So, so that's kind of the main idea of,
of games. All right.
So, so let's start with an example.
So this is actually an example that we are gonna use throughout the lecture. All right.
So the example is, we have three buckets.
We have A, B and C. And then you are choosing one of these three buckets.
And then I choose a number from the bucket.
And the question is, well, your goal here is to
maximize the chosen number and the question is,
which bucket would you use?
Okay. So, so how many of you would choose bucket A?
No one trusts me, okay [LAUGHTER] No one trusts me, good.
How many of you would choose B?
Okay. So now, now people don't trust me [LAUGHTER].
How many of you choose C?
Okay. So, so there's a number of people there too.
So, so how are you making that decision?
So the way you are making this decision is, if you choose A,
you're basically assuming that I'm not playing like,
like try- I'm not trying to get you.
I might actually give you 50.
And if I give you 50, that'll be awesome.
And you have this very large value that you are trying to maximize.
If you think I'm going to act adversarial,
and go against you and then try to minimize your,
your number, then you're going to choose bucket B, right, because,
because worst-case scenario, I'll choose the,
the lowest number of the bucket and,
and in bucket B, the lowest number is one which is better than minus 50 and minus 5.
So, so if you're assuming I'm trying to, like,
minimize your good, then you're gonna choose bucket B.
And if you have no idea how I'm playing and,
and you're just assuming maybe I'm acting ast- stochastically and maybe I'm, like,
flipping a coin and then based on that deciding like what number to give you,
you might choose C because in expectation,
C is not bad, right?
Like, C, like, if you just average out these numbers and
then pick the average values from A, B, C- A,
A and B and C, the average value for A is 0,
for B, it's 2 and then for C, is, um, 5.
Right, so, so, so if I'm playing it stochastically,
you might say, well, I'm probably going to give you something around 5.
So you would pick C. Okay.
So, so today we wanna talk about
these different policies that you might choose in these settings and
how we should model our opponent and how we formalize these problems as game problems.
So this is an example that, that we just started.
Okay. So, so to- the plan is to formalize
games, talk about how we compute values in the setting of games.
So we're gonna talk about expectimax and minimax.
And then towards the end of the lecture,
we're gonna talk about how to make things faster.
So we're gonna talk about evaluation functions as a way of making things faster, uh,
which is using domain knowledge to,
to, to define evaluation functions over notes.
We're also gonna talk about alpha-beta pruning,
which is a more general way of pruning your tree and making things faster.
Okay. All right.
So that's the plan for today.
Okay. So we just defined this game and a way to,
to go about the scheme is to create something that's called a game tree.
A game tree is very similar to a search tree.
So this might remind you of
search tree where we talked about it like two weeks ago, right.
So, so the idea is,
we have this game tree where we have nodes in the- in
this tree and each node is a decision point of a player.
And we have different players here, right,
like I was playing or you were playing or we have
two different people, like, playing here.
So these decision nodes could be for what one of the players, not both of them.
And then each root to leaf path is going to be a possible outcome of the game.
Okay. So, like, it could be that I'm choosing minus 50 and then your decision was to pick
bucket A so that path is going to give us one possible outcome of how things can go.
Okay. So, so that is what the tree is basically representing here.
Okay. So the, the nodes in,
in the first level are the de- decisions that I was making and then the,
the first node, the root node are the decisions that you were making in this setting.
So if we were to formalize this a little bit more,
we're gonna formalize this problem as,
as a two player zero sum game.
Okay. So, so in this class, a- at least, like, today,
we are going to talk about two-player games where
we have an agent and we have an opponent.
And then we are going to talk about policies and values and for all of those things,
think of you- yourself as being the agent.
So you're playing for the agent.
You're optimizing for the agent.
Opponent is this opponent that's playing against you.
Okay. So we are also going to, to, like,
today, we are going to talk about games,
uh, that are turn-taking games.
So we're going to talk about things like chess.
We're not talking about things like rock-paper-scissors.
We will talk about that actually next time when we have,
like, like, simultaneous games where you're playing simultaneously.
Today we are talking about turn-taking settings.
Two-player turn-taking settings.
Full observability, we see everything.
We are not talking about, like,
games like poker where you don't necessarily see,
like, you have partial observation and you don't
necessarily see the hand of your opponent.
Full observation, two-player and also zero-sum games.
And, and what zero-sum means is,
if I'm winning and if I'm getting, like,
$10 from winning, then my opponent is losing $10.
So, so the total utility is going to be equal to zero.
If I win some amount,
my opponent is losing the same amount.
Okay. All right.
So, so what are the things that we need when we define games?
So, so we need to know the players.
We have the agent, we have the opponent.
In addition to that, you need to define a bunch of things.
This should remind you of the search lecture or the MDP lecture.
So you might have a start state, as S start.
We have actions which is a function of state,
which gives us the possible actions from state S similar to before.
You have a successor function similar to search problems.
So a successor function takes a state and action and it
tells us what's the resulting state you're going to end up at.
And this- and, and you have an end- this
end function which checks if you're in an end state or not.
And the thing that's different here,
there are two things that are different here.
One is this utility function.
And the utility function basically gives us the agent's utility at the end state.
Okay. So one thing to notice here is,
is that the utility only comes at an end state.
So after you finish the game, like,
I've played my chess and I won chess now and this is this chess game.
And then, then I get my utility.
Like, as I'm making moves, like,
through my, my chess game, I'm not getting, getting any utility.
Like, you only get the utility at an end state.
And, and the way we're defining the utility,
is we're defining it for the agents because again we are,
we are replaying from perspective of the agent.
So, so what would be the utility of the opponent?
Minus that, right.
So, so negation of that would be the utility of opponent. Okay.
I've heard about partially observable Markov decision process.
Is this, like, kind of, what it is?
Like, is this partially observable?
Okay. So the question is, is this partially observable Markov decision process?
This is not a partially observable Markov decision processes.
Um, there are classes that talk about,
like there's- this decision under uncertainty by
Mykel Kochenderfer's class that actually teaches that.
So you should, you should, you should take classes on that.
This is not a partially observable Markov decision process.
This is fully observable.
You have two players playing against each other. It's a very different setup.
[inaudible].
So, so the, the question is,
are there any randomness here?
And, and so far, I haven't discussed any randomness yet.
Later in the lecture, I'll talk actually about the case where there might be
a nature in the middle that acts randomly and then how we go about it.
But so far, two players playing against each other.
Okay. All right.
And then the other thing that we need to define when you are defining a game,
um, is, is the player.
So, so, so player is a function of state.
And basically tells us who is in control, like, who is playing now.
So in the game of chess,
like, whose turn is it now.
And then that is the function that,
that you are going to define when we are formally defining, um, that game.
Okay. All right.
So, so let's look at an example.
So we have a game of chess.
Players are white and black.
Let's say you're playing for white.
So the agent is white,
the opponent is black.
And then the state S can represent the position of all pieces and whose turn it is.
So, so that is going to what the state is representing.
So whose player's turn it is and
then the position of all pieces.
So actions would be all the legal chess moves that player S can take.
And then IsEnd basically checks if the state is checkmate or draw.
That is what it is checking.
Okay. So, so then what would the utility be?
The utility will be,
will be if you're, like,
you're only going to get it when you win or when you lose or, or if there's a draw.
So the way we are defining it is,
it's going to be let's say, plus infinity if white wins because,
because the agent is white and,
and it's going to be zero if,
if there is a draw and then it's going to be minus infinity if black wins.
Okay. Yeah. So, so that was all the things that we would need to define. Yes.
[inaudible]
What- why do we have,
why do we have whose turn it is in the state.
Uh, so that's one way of actually, like,
extracting the player function.
So, so the way you can define a player function is a player is a function of state.
So the state already needs to encode whose turn it is.
So you can kind of extract that from the player.
You said the, the utility would kind of be negative utility for the p agent.
Is that assuming that they're both taking the same actions the whole time?
No. So, so, so this is turn-taking, right?
So I take an action and then the opponent
takes an action and then the agent takes an action.
The opponent takes an action and then at the very end of the game then then
you get the utility and then the opponent gets- gets the negative of that utility.
But the actions could be very different.
Policies could be very different.
And we'll talk about how to come up with that.
So why is that condition variable, so what happens if white wins, you get
plus infinity, but if black wins, if black wins, you get negative infinity, but like, when you lose-
you hav- you don't have zero-sum game.
We'll talk about that next lecture actually a little bit.
So, so I'm, I'm talking about zero-sum games here
because the algorithms you are talking about are for zero-sum games.
Like we are talk- going to talk about min- mini-max type policies.
Where I'm minimizing and the agent is maximizing.
So I'll get back to that if,
if I haven't answered that.
Like we can talk about it after the class but also next lecture,
we'll talk about more variations of games.
So- but for now, I'm assuming a bunch of simplifying assumptions about this game.
The assumption is that like if white wins, it's negative infinity, but if white wins, black gets 0 utility, [inaudible]. [NOISE]
Uh, yeah.
So these utilities need to add up to 0.
If white wins, maybe white gets 10,
but black gets minus 10.
So, so like they, they need to add up.
Okay. All right.
So and then kind of the characteristics of games that
we have already discussed are two main things.
One is that all utilities are at end state.
So throughout this path you are not getting
new utilities as opposed to like things like MDPs where we were,
we were getting rewards like throughout the path.
But here, like the utility only comes in at the very end.
At the end state. And then the other thing about
it is that different players are in control at different states, right.
Like if you are in state, you might not be able to control thing- control things.
It might be your opponent's turn and you might not be able to do anything.
Okay? So those are kind of the two main characteristics of games. All right.
So let's look at a game that you're going to play.
All right. So the game is a halving game.
So we start with a number N. And then
the player- the players take a turn and they can do two things.
They can either subtract 1.
So they can decrement N, or they can replace N with N over 2.
So they can divide or subtract.
Okay? And the player that's left with 0 is, is going to win.
Okay. So, so that is, that is the setup.
Is that- is everyone following that?
So, so let's try to formalize the game and then after that you
want to figure out what is a good policy to, to do it.
So, so right now let's just try to- let's just try to formalize this.
So you know like what are all the different things for the model are.
So let's just have a new file.
We are going to define this game.
So it's a halving game.
Okay, so let's, let's get this. All right.
So we're initializing with N. So we're starting with some number N. So what is our state?
Our state is going to encode whose player turn it is and that number N. Okay.
So we have a player.
Let's say our players are either plus 1 or minus 1.
That's how I'm defining like who's player it is.
So the start state. Let's say player plus 1 place with N. So so
that is plus 1 and N. And then we need to define its end.
Okay. So what you do is end check.
Well we take the state. We decouple it into player and number.
And if the number is equal to zero then then that is when the game ends.
That's our ending condition. Okay.
How about utility?
Well we get the utility at an end state.
So again I take a state.
I decouple it into player and number.
I make sure that we are in, in, in an end state so we assert
that number is equal to 0 because that kind of
defines if you're in an end state or not.
And then the utility I'm gonna get, if I'm winning I'm gonna get infinity.
If I'm not winning I'm gonna get minus infinity.
And the way I'm defining that here is by just doing player of times infinity.
Because player- I'm the agent, I'm the player plus 1.
The opponent is player minus 1.
That how- like if,
if minus 1 is winning I'm gonna get minus infinity.
Okay? The actions that we can do is we can subtract 1, or we can divide.
Divide by 2. I mean subtract and divide are the main actions.
And player, this player function again takes the state.
I'm gonna decouple the state into player and number and just return the player.
That's how I know who's player's turn is.
And then we need to define the successor function.
The successor function takes a state and
an action and tells us what state you're going to end up at.
So again a state.
I'm going to decouple that into a player and a number.
And then the actions I can take are two things.
I can either subtract 1 or I can I can divide by 2.
So if I'm subtracting then I'm going to return a new state which is minus player cause
now it's minus 1's turn or plus 1's- like it's minus whoever turn it is now.
And then I'm gonna do number minus 1.
If the action is divide,
we're gonna return the new player which is minus player,
and then number divided by 2.
Okay? That is it.
So, so we just defined this game, okay.
Yeah. All right.
So, so that was my game.
We're gonna play this game in a little bit.
But let's- quickly before playing it.
Let's talk about what is a solution to a game.
Like what are we trying to do in a game.
So if you remember MDPs the solution to a game was was the policy.
So a policy was a function of state.
It would return the action that you need to take in that state.
So similar to MDPs here we have policies.
But, but, the thing is I have two players.
So policy should should depend on the player too.
So I have Pi of P which is the policy of player P. And I can define it similar to before.
It can be a policy as a function of a state and it can return just an action.
And this would be a deterministic policy.
Like deterministically if I'm in state,
the policy is going to tell me what action to take, okay.
We can also define Stochastic policies.
So what Stochastic policies would do is they would
take a state and action and then they would
return a number between 0 to 1 which is the probability of taking that action.
So policy Pi of a state and action basically will return
a probability of player P taking action A in state S. So,
so if you remember the bucket example,
like maybe half the time I would pick the number
on the right and half the time I would pick the number on the, on the left.
That would be a stochastic policy, right.
I'm not deterministically telling you what the action is.
I'm coming up with the stochastic way
of telling you like what policy I'm following, okay?
So we have deterministic policies.
Stochastic policies. Like in our game we could follow either one of them.
Under what case would you want a
stochastic policy versus the deterministic policy?
Uh, can you speak up?
Yeah. Under what case would you want a
stochastic policy versus a deterministic policy?
So under what case do you want a stochastic policy versus a deterministic policy?
Again, we'll cover that a little bit more next time depending on what games you are in.
Like you have some properties of when stochastic policies are giving
us some some properties and deterministic policies are giving us some other properties.
Right now you're just defining them as things that could exist.
And, uh, we could think our opponent is acting deterministically if,
if you know exactly what they were doing.
Sometimes I've no idea.
Maybe you like I've learned it somehow and I have some randomness there.
And then I'm going to use some stochastic policy
for how my opponent is going to play against me.
But we are going to apply the- like what we get out of the
stochastic versus deterministic policy is a little bit more next time.
Okay. All right. So okay.
So now let's- okay so now that we know that it's the policy that we want to get.
Let's try to, let's try to write up a policy for this game.
And then I'm gonna define a human policy.
And what I mean by that is this is going to come from the human.
That means one of you guys or two of you guys.
So, um, so I need two volunteers for this but let's quickly actually write this up.
So what is a human policy?
It's just going to get the input from the keyboard.
So, so what I'm going to type up here is,
is get the action from the keyboard.
So get the input from the keyboard.
And that is going to be the action that we are picking.
Remember the actions are either divide or subtract.
Subtract 1. And if action is valid then return that action.
That sounds like a good, good pol- policy.
Okay. So that is a human policy.
So now what I wanna do is I wanna
have like this game that they're actually playing against each other.
So I want to have policies for my agent.
My agent is plus 1.
That's going to be a human policy.
And for my opponent,
I'm gonna say my opponent is also a human policy.
So I just want two humans to play against each other.
Okay. And the game is, let's say we are starting with 15.
So our number that we're starting with is 15.
Okay? All right, so that looks right to me.
So how do we, how do we ensure that we are progressing in the game.
So if you're in an end sta- if you're not in an end state you want to progress.
So let's print a bunch of things here.
Let's print out state.
Okay. Let's get the player out of the state cause again the state encodes a player.
Let's get the policy.
Because we have defined these policies for both of the players so
we can get the policy for whoever is playing right now.
And then the action comes from the policy in that state.
And then the new state you're going to end up at is
just the successor of the current state and action.
So th- I'm just progressing.
So, so this while loop here just figures out what state we are in,
what policy are we following,
and where are we going to end up at and that's the successor function.
Okay. And then at the very end I'm just going to print out the utility.
So that's either plus infinity or minus infinity.
And that sounds good.
So, all right.
So let's actually- All right.
So who wants to play this?
Okay that's one person.
You're the agent. You're player plus 1.
Opponent is three people [LAUGHTER].
I think you were first. By [inaudible] yeah.
Okay so you're minus 1.
All right so let's,
uh, play this game.
Is this large enough?
Yeah. Okay. All right so player 1.
Player plus 1. We are at number 15.
Do you wanna, uh, decrement.
Okay. So minus 1. So we are at player minus 1.
We're at 14. What do you wanna do?
Divide.
Divide. Okay. You have a policy [OVERLAPPING] [LAUGHTER]
[BACKGROUND]
Minus 1.
Divide.
Divide. [LAUGHTER]
[LAUGHTER] Yeah I don't really, yeah.
So yeah so you kind of get the point, right, so wait, did I make you lose now?
[LAUGHTER] Sorry. My bad.
But you get the utility at the end and then basically you
kinda can see this interface- actually does any- Oh I don't know.
We don't have that much time. I was going to try like another pair
but the code is online if you wanna play with it, just play with it.
We will have one other version playing it with an automated policy later.
Um, all right.
So, okay.
So we're back here.
Let me close this.
Um, all right. So we just saw how we can give
some human policies and human policies playing against each other.
And again, the policy,
you give it a state and action.
It gives you a probability or you give it a state and it gives you an action.
So a deterministic policy is just an instance of a stochastic policy.
Right? So if you have a deterministic policy,
you can kind of treat as a stochastic policy where
with probability 1 you're picking- you're picking an action.
So, all right.
So, so now we wanna talk about how we evaluate a game.
So, so let's say that someone comes in and gives me
the policy of an agent and an opponent,
and I just want to know how good that was.
And again if you remember in the MDP lecture,
we started with policy evaluation.
So in the MDP lecture,
we started with this idea of someone gives me the policy,
you just want to evaluate how good that is,
and you're kind of doing it analogous to exactly that.
Someone comes in and tells me that my agent is going to pick bucket A,
that is what my agent is going to just do all the time.
And someone comes in and says, "Well,
my opponent is going to act stochastically and,
and with probability one-half,
give me one of those numbers."
Okay? So, so these are the two policies that we are going to have.
So the question is; how good is this?
So going back to the, to the tree, the game tree,
what is really happening is my agent is going to pick,
uh, this one, right?
Because he's going to pick bucket A.
So with probability one,
we are going to end up here,
with probability zero we end up in any of these other buckets.
And then my opponent is going to stochastically pick either minus 50 or 50.
Okay? So if my opponent is picking minus 50 or 50,
then the value of this node is just the,
the expectation of that or it's just going to be 0.
So 50% of the time it's minus 50,
50% of the times it's 50, then the value of this node is 0.
And then if my agent is picking, picking A then,
then the value of this node is going to be 0.
Okay? So, so you kind of can see how the value is going to propagate up from the utility.
So we had the utilities at the leaf nodes,
but we could actually compute a value for each
one of these nodes if I know what the policies are.
Like if I know who's following what policies,
I can actually compute these values and go up the tree.
Okay? And so in this case,
I can say a value of a- of the start state,
if I'm evaluating this particular policy,
is going to be equal to 0.
Okay? All right.
So someone gave me the policy,
I evaluated the value at the start state.
So in general, as I was just saying earlier and this is,
this is similar to policy evaluation.
This is similar to the case that someone gives me the policies and I'll
evaluate wha - how good the situation is.
And you can write a recurrence to actually compute that.
So I'm going to write the recurrence here maybe.
So you want to compute this value.
And this value is evaluating a given policy and it's a function of state.
Well, what is that going to be equal to?
It's going to be equal to utility of S,
if you're in an end state.
So it's utility of S if we are already in an end state.
Otherwise, I have access to the policy of
my opponent and policy of my agent so I can just do
an expected sum over all possible actions of S. Let's say that I am - if,
if player S is agent,
I'm looking at policy of agent,
let's say its a stochastic policy times V of eval of the successor state.
Successor of S and A.
And this is if, if my player is agent.
So, so if is player - I'm just gonna write is player of S is equal to agent.
What happens if my player is opponent?
Um, I'm gonna do the same thing.
I'm just evaluating I have access to the po- policy of the opponent.
I'm again just doing- going to do a sum over
all possible actions on the policy of the opponent,
this is given to me- someone gave this to me,
of state and action times the value of the successor state.
And S and A and this is the case that my player is the opponent.
So this is a recurrence that we are going to just write and it's kind of intuitive.
Again, we have seen this in research too.
Like you start with the utilities at the leaf nodes and you just push that back
up based on what your policies are and what
your policies are telling you like which sides - like which,
which edges of the tree you are taking with what probability.
Okay? This makes sense? All right.
Okay. So that was evaluating the game.
But what if now I want to solve what the agent should do?
Like I'm the agent,
I care about doing - like figuring out what my Pi agent is.
I don't know what my Pi Agent is.
I need to figure out what sort of policy I should be following.
And that kind of takes us to this idea of expectimax which is
basically the idea of - if I'm in a scenario where I know what my opponent does,
so I'm still assuming what - I know what my opponent does,
what would be the best thing that I should be doing as an agent?
Okay? What, what would be the best thing I should do?
Like if you knew, like, in the bucket example,
I was trying - I was acting probabilistically, what would you do?
Pick the action that gives you the maximum value.
So you'd pick the action that gives you the maximum value because you're
trying to maximize your own, your own value.
So, so then if that is the case,
then this recurrence needs to change, right?
This recurrence- the way changes is,
I'm going to call this- that new value,
so I'm going to just do everything on top of this, I'm not gonna be writing it.
I'm gonna call this value,
value of expectimax policy.
Okay? So, so this value eval,
I'm not evaluating anything anymore.
I want to actually figure out what my agent should do.
So I'm gonna call it expectimax.
And if I know a policy of my opponent,
I'm not changing anything here because I know the policy of my opponent,
I'm just going to compute this.
But now I want to figure out what the agent should do and what should the agent do?
Well, the agent should do the thing that maximizes this value.
So I'm going to erase this sum with the policy because I don't have that policy.
And the agent should do the thing that maximizes this value over all possible actions.
So this should remind you of value iteration.
So if you remember value iteration in the MDP lecture,
like we weren't evaluating things, right?
We were trying to maximize our value.
And that's kind of analogous to what we are doing here.
We're trying to figure out what should be the policy that the agent should take that
maximizes the value under the scenario that I know what the opponent does.
So I still kind of know what the opponent does.
So going back to this example,
so let's say I know my opponent is acting stochastically. What should I do?
So if my opponent is acting stochastically with probability one-half,
then the values of each one of these buckets are going to be 0, 2 and 5.
And I'm trying to maximize my own util- my own values.
So I'm gonna pick the one that gives me five.
And, and that's shown with this upward triangle I'm trying to maximize.
So I'm gonna pick bucket C because I'm trying to
maximize under this knowledge that the other agent is stochastically acting.
Okay? And, and, and then you're calling this the value of
expectimax policy and the value of
expectimax policy from the start state is equal to 5.
Right? Because that's, that's evaluating
the thing I'm going to get. Question back there?
[inaudible]
Yes. This is assuming I know my opponent's policy and I'm,
I'm following - I guess I'm maximizing my own, er,
my own value knowing that my opponent
is following this policy and what the opponent would do in expectation.
Okay? All right.
So and then this is the,
this is the recurrence that we would get, we would just update the recurrence.
So if the agent is, uh,
playing then we maximize the value of expectimax.
Okay? All right.
So, okay, in general I don't know the policy of my opponent.
Right? So in general, like,
I know what gives me this pi opt.
So if that is the case, then what should we do?
So one thing that we could do is we could assume worst case.
So, so one thing that you could do is you could be like oh
the opponent is trying to get me in and
they're going to play the worst-case scenario and they are trying to minimize my value.
And, and that's the fair thing to do.
And we are going to talk about if,
if that is always the best thing we can do or not,
a little bit later in the lecture.
But for now, what we could assume that if I know nothing about my opponent,
I can just assume my opponent is acting adversarially against me.
So and that kind of introduces this idea of
minimax as opposed to expectimax that we just talked about.
So, so what would minimax do?
So in the case of a minimax policy, what I'm,
I'm assuming is I am this agent trying to maximize my,
my own- my own value and then I'm assuming my opponent is acting adversarially.
So my opponent is really trying to minimize my value.
And what that means is from this bucket,
I'm gonna get minus 50,
from this one I'm gonna get 1,
from this one I'm gonna get minus 5.
And under that assumption, well,
I'm going to pick the second bucket because that gives me the highest- the highest value.
So, so that is a minimax policy.
So how would I change my recurrence if I were to play minimax or I'm going to-
I'm going to call it V of- so let's look at the V of minimax of a state.
Well, the recurrence is going to be over
minimax, V of minimax, so I'm gonna change that.
If the agent is playing,
the agent is still trying to maximize the value.
So, so that is all good.
What if the opponent is playing?
The opponent is going to minimize, right?
So I don't have access to pi opt.
So what I'm gonna do is I'm going to remove
this and say well the opponent is going to take
an action that minimizes the value of the successor of S and A.
Okay? And this is how you would compute the value of a minimax policy.
Is this assuming that the adversarial agent consistently tries to minimize the utility of the agent?
Yes.
What happens when, um,
the adversarial agent doesn't always go with that selection but also becomes
stochastically.
Yes. So that's a good question.
So what happens like if the adversarial agent is not always adversarial, right?
So in that case, you have another stochastic policy
that kind of defines what- what the opponent is doing.
And if you have access to that,
you can do something similar to expectimax.
If you don't have access to that maybe you would want to act
worst-case and assume that they're always trying to minimize.
But- but that's some prior knowledge that you have that
allows you to- to act better or maybe evaluate,
ah, the value better for wherever you stay.
So we'll talk about evaluation functions a little bit in the lecture.
And maybe you'll look back and form your evaluation function, okay? All right.
So- so- so here the value of minimax from the start state is going to be 1,
right? Does everyone see that?
So I'm assuming my opponent is acting adversarially.
So we have minus 51 and minus 5.
If I am maximizing then the best thing I can get is 1.
And then that's how we compute V of minimax, okay?
And then there is really no analogy to this in MDP
setting because in MDP setting you don't really have this game.
We don't really have this opponent that's playing against us.
And what happens is, is that this is a recurrence that you're going to
get it which is what we already have on the board, right?
Okay. So- so what would the policy be?
So the policy is just going to be the argmax of this V of minimax.
So if you want to know what the policy of your agent should be, that's Pi max.
It's the arg max over v of minimax,
over successor of that state.
And if you want to know what's the policy of- of your opponent,
that state S should be- well,
that's argmin of- of b of minimax which is intuitive, right?
So- so then that way you can actually figure out what the action should be,
what the policy with the actual action should be, okay?
All right. So let's go back to this example, this halving game.
So what we wanna do is we wanna actually code up what
a minimax policy would do in this setting.
And maybe we can play with a minimax policy after that, okay?
So what would a minimax policy do?
So it's a policy,
so it's going to be a function of states, so let's give it state.
And you're going to just write this recursion that we have on the board.
So- so we're recursing over to state.
If you're in an end state then what are we returning?
Just the utility, okay?
So we're returning the utility of that state,
and there was no actions.
And then if you're not in an end state,
then you are either maximizing or minimizing over a set of choices.
So let's actually like create those choices so they can just call max and min on them.
So the choices we're going to iterate over all actions that- that we have.
And what is that going to be exactly?
Well, that's going to be doing a recursion over the successor states.
So we are going to recurse over the successor state.
So recurse over succ- game.successor of state and action.
And I'm going to return the action here too because I just want to get the policy later.
And the successor- does this recursive function returns a state and action.
So I just want to get the state from the first one and the action from the second one.
Okay. So if player is plus 1 that's the agent,
the agent should maximize the choices.
And if player is minus 1,
then- then that's the opponent,
the opponent should try to minimize over these choices.
And that's pretty much like this recursion that we have on the board,
and- and that's our recursive function, okay?
So we're going to recurse over- over our state and that
gives us a value and it also gives us- gives us an action.
So let's just print things out.
So you can refer to them.
So minimax gives us an action,
and it tells us this is the value that you can get [NOISE].
All right. And then it's a policy,
so let's just return the action.
Okay. So now what I'm gonna do is,
I'm going to say plus 1 agent is still a human policy,
and then it's playing against a minimax policy.
So all right.
So let's- who wants to play with this?
And it's a little scarier to play with the minimax policy [LAUGHTER].
Okay. All right.
So let's do this.
Python.
All right. So you are the agent. So you're player 1.
You're starting from 15. What do you want to do?
[BACKGROUND].
So you just lost the game [LAUGHTER].
So- so why do I know you lost the game?
Now it's player minus 1 playing, you are at 7.
And minimax policy took action minus, er,
and says action minus, um,
and- and it also,
yeah takes action minus. So we're at 6.
And then the value of the game is minus infinity.
So you're playing with a minimax policy,
you're already getting minus infinity.
So- so you just lost the game.
Anyone want to try this again [LAUGHTER].
You want to try it again maybe.
[BACKGROUND] Subtract.
[LAUGHTER] Okay.
So you- so you can win, right?
So the value is infinity right now.
And then yeah, so and then the minimax policy also did a minus.
So we're at 13 right now.
It's your turn, you're at 13 [BACKGROUND].
You just lost the game again [LAUGHTER].
So yeah, so minus infinity is- yeah actually you need to like alternate between them.
I think that is the best policy.
But play with this kind of get a sense of how this runs. The code is online.
So just feel free to play with it and figure out,
what is the best policy to use. All right.
So- okay.
So- so that was a minimax policy.
And then this is kinda the recurrence that we get for a minimax policy.
Now, what I wanna do is I wanna spend a little bit of time talking about,
um, some properties of this minimax policy.
And then we talked about two types of policy so far, right?
We have talked about expectimax,
which is basically saying,
"I as an agent,
I'm trying to maximize,
but I know what my opponent is going to do.
So I'm going to assume my opponent does whatever.
And then I'm going to maximize based on that."
So- so for example, I am following and I'm going to refer that to as Pi of expectimax,
which means that the agent and everything in red is for the agent,
everything in blue is for the opponent.
So I'm gonna say the agent is following this policy which says,
"I'm going to maximize assuming my opponent is doing whatever.
And here I'm calling Pi 7 as like some opponent policy."
It couldn't be like anything but Pi 7.
So let's say that, opponent is playing Pi 7,
I'm going to maximize based on that.
And- and the value we just talked about is the value of expectimax.
The other value we just talked about is the value of minimax which says, "I am the agent.
I'm going to maximize assuming the opponent is going to minimize."
And then the opponent actually is going to minimize and is going to follow pi min.
Okay. So- so these are the two values we have talked about so far.
I want to talk a little bit about the properties of this. But before that, let me-
So weight to like kinda like mix the two together.
And you say like just highlight the probability of
piping the minimum for like an expected max.
I give a probability distribution over like the actions, right?
Like why don't we just take the action that like
minimizes whatever our reward is and give it a higher weight,
in Expectimax.
Um-
[NOISE] I didn't fully follow what policy you were referring to, actually.
Is it- are you coming up with a new policy that you
do- you're saying would be a better policy to
[NOISE] between like expectimax and minimax in some sense?
So this might- this, this table might,
kind of, address that because it's,
it's considering four different cases.
It's actually not considering the two cases.
So this might actually refer to what you're, what you're proposing.
So, so let's actually go through this first and then maybe,
like, if it doesn't answer that.
So, All right, so,
so I want to talk about the setting.
So this table is actually not that confusing,
but it can get confusing.
So do pay attention to this part.
Um, all right, so where do I wanna- maybe, maybe I'll write over there.
So I'm gonna use red for agent.
Where is my blue, my blue? On the floor?
Hanging on the left.
Left?
Your right.
My right, [LAUGHTER] okay, all right, [LAUGHTER] okay.
And then I'm going to use blue for,
um, and I dropped this.
I'm going to use blue for,
um, the opponent policy.
Okay. So, so then for agents,
we're are going to have Pi max.
All right. An agent could play Pi max. What does that mean again?
I'm going to maximize assuming you're gong to minimize.
An agent could play Pi expectimax.
Maybe the policy 7, I'm gonna put 7 here,
which means I am going to maximize assuming you're going to follow this Pi 7.
So this is a thing that the agents can do.
[NOISE] Okay?
And then there are things that my opponent can do.
I'm going to write that here.
My opponent can actually follow Pi min which is I'm just going to minimize,
or my opponent could follow some other policy Pi 7.
Let's say Pi 7 in the bucket example right now is,
is just acting as stochastically.
So half the time pick one number,
half the time pick another number.
Okay? So, so that is what we have.
So I'm going to draw my- actually my tree so we can go over examples of that too.
So this was the bucket example.
They started at minus 50 and 50 in bucket A,
1 and 3 in bucket B,
minus 5 and 15 in bucket C. Okay?
So this was my bucket example.
I'm actually going to talk about that.
So- All right.
So I'm gonna talk about a bunch of properties of V of Pi max and Pi min,
which is what we have been referring to as the minimax value.
Okay? So, so I want to talk about this a little bit.
Okay? So the first property that,
that we can have is,
is that V of Pi max and Pi min,
it is- actually let me go back to the next slide.
It is going to be an upper bound
of any order value of any order policy.
Pi of- I'm going to just write Pi of expectimax for any other policy for the agent.
Assuming that my opponent is playing as a minimizer.
Okay. So, so what I'm writing,
so what I'm writing here is,
is that value is going to be an upper bound of any order value if
my agent decides to do anything
else under the assumption that my opponent is a minimizer.
So my opponent is really trying to get me.
If my opponent is really trying to get me,
then the best thing I can do is to maximize.
Okay? So, so that's kind of intuitive,
right? That's an upper bound.
Let's look at that example. So what is Pi- V of Pi mix- er,
Pi max and Pi min?
So, so we just talked about that, right?
So if this guy is a minimizer,
we're gonna get minus 50 here,
1 here, minus 5 here.
If this guy is a maximizer,
what is the value I'm gonna get?
You'll get 1, right? I'm gonna go down here and then I'm gonna get 1.
So V of Pi max and Pi min is just equal to 1.
That is this value that is just equal to 1.
Okay? What is this saying is that this is going to be greater
than maybe the setting where my opponent- so my,
my agent is following expectimax and my opponent is still doing Pi min.
So, so what would this correspond to?
What will this value correspond to?
So this is a value which says, well,
I'm going to take an action assuming my opponent is acting stochastically.
If my opponent is acting stochastically,
I'm gonna get 0 here,
I'm gonna get 2 here, and get 5 here.
If I'm assuming that and I'm trying to maximize my own,
my own value, which route do I go? I'm gonna go this route.
But it turns out that my opponent was not doing that.
My opponent was actually a minimizer.
So if my opponent was actually a minimizer and I went this route,
my opponent is going to give me minus 5.
So the value I'm going to end up getting is minus 5.
So this is equal to minus 5.
This is equal to minus 5.
Okay? So, so far I've shown that this guy is greater than this guy.
okay? All right.
So that's the first property.
First property is if my opponent is terrible and is trying to get me,
best thing I can do is to maximize. I shouldn't do anything else.
Okay? The second property is,
is that this is V of Pi max, again the same V,
V of Pi max and Pi min is now a lower bound of
a setting where your agent is maximizing assuming your opponent is minimizing.
But your opponent was actually not minimizing,
your opponent was following Pi 7.
So, so what this says is if you're trying to maximize assuming your agent,
your, your, your opponent is always minimizing, then,
then you're doing- like you'll come up with like
a lower bound and if your opponent ends up doing something else,
you can always just do better than this lower bound.
Okay? So what is,
what is this V equal to or we just showed that is, that is one, right?
That is this value.
Okay? What does this correspond to?
So this is value of Pi max which is I am going to assume you are trying to get me.
If I'm going to assume you are trying to get me I'm gonna go down
this route because that is the thing that gives me the highest, the highest value.
But you are not trying to get me,
you are following Pi 7.
So if you're following, following Pi 7,
you're just going to give me a half the time
1 and half the time 3 and that actually corresponds to the 2,
and I'm going to get value 2 instead of value 1.
So this is actually equal to 2 in this case.
And this corresponds to this value in the table which is again
the agent is following a maximizer assuming the opponent is a minimizer.
Opponent was not a minimizer,
opponent was just following Pi 7.
And this is just equal to 2 .
Okay. So so far,
the things I've shown are actually very intuitive.
They seem a little complicated but they're very intuitive.
What I've shown is that this value of minimax,
it's an upper bound.
If you're assuming our,
our opponent is a terrible opponent,
now it's going to be an upper bound because the best thing I can do is maximize.
I've also shown it's a lower bound if my opponent is not as bad.
So, so that's what I've shown so far.
A question.
So here the opponent's policy is completely hidden to the agent.
Yeah. So here, like,
because- Yeah, the agent actually
doesn't see the opponent- where the opponent goes, right?
Even in the expectimax case,
it thinks the opponent is going to follow Pi 7,
but maybe the opponent follows Pi 7, maybe not.
Right so, so like when we talk about expectimax and minimax,
it's always the case that the opponent doesn't actually see what the opponent does.
But the opponent can think- the agent can think what the opponent does, okay?
And I'm going to talk about one more property.
And this last property basically says if you know something,
actually goes back to your question,
if you know something about your opponent, right?
If you know something about your opponent,
then you shouldn't do the minimax policy.
You should actually do the thing that has some knowledge of what your opponent does.
So, so that basically says this-
we Pi max and some Pi of opponent,
you know something about Pi opponent.
You know that opponent is playing Pi 7.
That is going to be less than or equal to the case where you are following
the Pi of expectimax of 7,
uh, and the opponent actually follows Pi 7.
Okay. So what does this last equality- inequality saying?
Well, it is saying that the case where you're trying to
maximize and you think your opponent is minimizing,
but your opponent is actually not minimizing the,
value of that is going to be less than the case where you're maximizing under
some knowledge of your opponent's policy
and your opponent's policy actually ended up doing that.
Okay? So, so the first term is always the agent.
The second term is always the opponent, right?
So this value we have already computed,
that- that's equal to 2.
This value, what is this value saying?
It is saying you are going to maximize assuming your opponent is stochastic.
So if I'm assuming my opponent is stochastic,
then I'm assuming that this is 0,
this is 2, this is 5, right?
I'm trying to maximize.
So which one of my routes shou- should I go?
I should go this route because that gives me 5.
So this is the agent thinking the opponent is going to be stochastic,
thinking he's going to get 5.
And it gets here and the opponent actually ends
up following Pi 7 which is a stochastic thing.
So, so we are actually going to get 5.
So, so this guy is equal to 5.
And this is the last inequality that we have,
which is V of Pi expectimax of 7,
and Pi of 7 is greater than or equal to V of Pi max and Pi 7.
We just showed this is equal to 5 for this example.
Okay. All right. Question.
[inaudible] The actions of the opponents
always whether or not the [inaudible] [NOISE].
Uh, so-
So if, if you, so if you know something about the stochasticity, that's in order.
Like here, I knew that the opponent was
following the stochastic policy of one out, one out.
I might have known that the opponent is following
a deterministic policy in- and always is picking the left one.
So I could have like followed,
like same expectimax policy under that knowledge.
It could be anything else, but the whole idea of expectimax is,
I have some knowledge of what the policy of,
of the opponent is, it could be a stochastic policy,
it could be a deterministic policy under that,
how would I maximize?
Does that mean that like transitively,
that the bottom right is greater than the bottom left always?
Yeah. So the question is do we have- Yeah.
So we have what like this inequality, so transitively,
this guy is always greater than this guy.
And that kinda makes sense, right?
Like we're saying, like if you're following expectimax,
so this last one kinda makes sense, right?
It's, it's basically saying if you're following expectimax and you know
something about your opponent and your opponent actually ended up doing that,
though, though your value should be greater than pretty much anything, right?
Because you knew something about the opponent,
you played knowing that,
having that knowledge. Yes.
When you say knowing something about the opponent,
is that just knowing that it's asked stochastically or know what it's gonna take? [NOISE]
It's knowing what they're going to take.
Right? Like here, I knew what they'll point out.
I knew that half the time they're going to take this one,
half the time you are going to take the other one,
and then I use that knowledge, right? Yeah.
So you know exactly this? [OVERLAPPING].
Yes. Yeah, yeah, the expectimax.
Is the expectimax policy given that your opponent is following Pi min policy-
Given that, sorry.
Given that your opponent is following Pi min.
Is it- do you maximize it?
So the expectimax policy is,
is this policy when here we have a sum.
The expectimax policy, uh,
assumes your opponent is following Pi opponent
and assumes that it has access to Pi opponent and so it ends up doing this sum over here.
Yeah. If Pi opponent is Pi min? Like-
Uh, if Pi oppo- I see what you are saying.
So you're saying if Pi opponent is actually Pi min,
then do they end up being equal to each other in some sense?
So yeah, I guess so.
Yeah. So if, if you know that the oppo- it becomes minimax, right?
If you know your opponent is, is following min,
as acting as minimizer or just like call that minimax. All right.
So I'm gonna move ahead a little bit.
All right so- and then, this is like what we have already talked about.
Okay. So a few other things about modifying this game.
So, so we have- okay so we have talked about this game,
we have talked about properties of this game.
There is a simple modification one can do which is, bringing nature in.
So there was a question earlier which was like,
is there any chance here?
And then, yeah, you can like actually bring chance inside here.
So, so let's say that you have the same game as before,
you're choosing one of the three bins.
And then, after choosing one of the three bins,
you can flip a coin and if heads comes,
then you can move one bin to the left, with wraparound.
So what this means is 50% of the time,
tails comes, you're not changing anything,
you have this set up.
50% of the time you get heads.
And then, in those settings you're just gonna pick like
a neighboring bin as opposed to your original bin, okay?
So, so the- you're adding this notion of chance here and,
and it's kind of acting as a new player, so,
so it's not actually the making things that much more complicated.
So, so what happens is in some sense we have a policy of,
of coin which is nature here,
right and policy of coin is,
half the time I get 0,
I don't change anything,
half the time I just get the neighboring bin as opposed to my main bin.
And then I get this new tree where,
where I have like a whole new level for what- where the chance plays.
So we have- now we have max nodes,
we have min nodes, we also have these chance nodes here.
And the chance nodes again,
like sometimes they take me to the original bucket and then
50% of the times they take me to a neighboring bucket, okay?
But, but the whole story like stays the same, like nothing changes.
You can, you can still compute value functions,
you can still push the value functions further up.
It's the same sort of recurrence.
Nothing fundamental changes.
Just- it just feels like there are three things playing now, okay?
So, so then this is actually called expectiminimax,
so a value of expectiminimax here,
in this case for example,
is minus 2, because there is a mini node for the opponent,
there is an expectation node for what nature does,
and then there is a max node for what the agent should do.
That's why it's called expectiminimax.
And then, you can actually compute the same value.
So when the game is working out,
so there's like two players.
I pick a bin then you flip a coin,
and then shift it left or not shift it left,
and then I get to pick the number?
Yes. Well, not you, well the opponent.
The opponent.
So yeah. So, so there are still two players and then the third coin thing. Yes.
[inaudible]
All right. So, so yeah.
So the way to formalize this is you have players, so you have an agent,
you have an opponent, you have coin,
and then the recurrence changes a little bit I guess.
So, so what happens is,
the recurrence that we have had for minimax was just the max and
min and it would just return us the utility
if you're in an End function and in an End-state.
Now, if the- if it is the coins term,
we just do a sum over, uh,
an expected sum of the policy of
the coin which is what we were doing in expectiminimax.
But, but we just have like a new term for when coin placed.
So, so everything here kind of follows
naturally in terms of what we were expecting, okay?
All right so the summary so far is, uh,
what we've been talking about max nodes, we have been talking about chance nodes,
like what if you have a coin there and then also these min nodes.
Um, and, and basically we've been talking about composing
these sort of nodes together and creating like a minimax game or,
or an expectimax game.
And then value function, uh,
we- is- you just do the usual recurrence that we had been
doing in this class from the expected utility to,
to- from the utility to come up with
this expected utility value for all the nodes that we have.
So there might be other,
other scenarios that you might wanna think about for example,
for your projects or like in,
in general there are other variations of games that you might wanna think about.
So what if, like the case that you are playing with multiple opponents?
Like so far we have talked about like a two-players setting where we have
one opponent and one agent but what if you have multiple opponents,
like you can think about how the tree changes in those settings.
Uh, or for example, like the taking turns aspects of it, like is it sim- if,
if the game is simultaneous versus your turn-taking, uh,
or like you can imagine settings where you have
some actions that allow you to have an extra turn.
So, so you have two turns.
Uh, and then the next person takes t- takes a turn.
So, so you should think about some of these,
some of them come up in the homework.
So, uh, think about variations of games in general.
They are kind of fun. So to talk a little bit about the computation aspects of this.
Um, so this is pretty bad.
[LAUGHTER] Right, we talked about a game tree which is similar to tree search.
So we are taking a tree search approach.
Uh, if you remember tree search,
like the algorithms we're using,
like if you have branching factor of b and some depth of d then,
then in terms of time it's exponential in order of b to the 2d, in this case.
So I'm using d for the number of- how do I say this?
So, so it's 2d because the play- the,
the agent plays and then the opponent plays,
so that's how I'm counting it.
So every, every 2d like you have 2d plies but d depth.
Does that makes sense? All right.
And then in terms of space,
it's order of d in terms of time,
it's exponential that's pretty bad.
So for a game of chess for example,
the branching factor is around 35,
depth is around 50.
So if you compute b to the 2d,
then it goes in the order of like number of atoms in the universe,
that's not doable we should- we are not able to use any of these methods.
So, so how do we make thi- things faster?
So we should be talking about how to make things faster.
So there are two approaches that we are
talking about in this class to make things faster.
And the first approach is using an evaluation function.
So, uh, using an evaluation function and what we can do is you can use domain specific
knowledge about the game to define almost like
features about the game in order to approximate,
like the, the value th- this value function at a particular state.
So I'm going to talk about that a little bit.
And then, another approach is this approach which is kind of
simple and kind of nice which is called alpha-beta pruning.
And, and the alpha-beta pruning approach,
basically gets rid of part of the tree
if it realizes you don't need to go down that tree,
that part, that part of the sub-tree.
So, so it's a pruning approach that doesn't
explore all of the tree only explores parts of the tree.
So, so we're going to talk about both of them. All right.
So evaluation functions.
So let's talk about that.
Okay. So the depth can be really like
the breadth and depth of the game can be really large.
That's not that great. So one approach to go about solving the problem is,
is to kind of limit the depth.
So instead of like exploring everything in the tree,
just limit the depth and,
and get to that particular depth.
And then after that, when you get to that depth just call an evaluation function.
So, so if you were to search the full tree,
this was the recursion that,
that we had like we have talked about.
This was like if you're doing a minimax approach this is the recursion that you gotta do.
You gotta go over all the states and actions and,
and go over all of the tree.
But if you're using a limited depth tree search approach, what you can do is,
you can basically have this depth d and then decrement d every time you go
over an agent and opponent,  like every time you go
down the tree and at some point d just becomes 0,
so you get to the po- some particular depth of the tree and when d becomes 0,
you're gonna call an evaluation function on the states that you get, okay?
And this evaluation function is almost of the same form of,
like future costs when we were talking about search problems, right?
So, so if you knew exactly what it was,
then, then you were done,
but you don't know exactly what it is because if,
if you knew that we were to solve like the whole,
uh, tree search problem.
But in general, it can have some sort of weak estimate of, of, um,
what, what the future costs would be.
So, um, yeah.
So, so an evaluation function Eval of s is a weak estimate
of V minimax of s. So it's a weak estimate of,
of your value function, okay?
All right. So, so analogy of that is future costs in search problems.
So how do we come up with an evaluation function?
So we do it in a similar manner that we had visited in the learning lecture,
where we're coming up with,
with features and, and weights for those features, right.
So, so if I'm playing like chess,
and like the way we play it, right,
like we think about a set of actions that we can take and where we end up at and,
and based on where we end up at um,
then you kind of evaluate how good that board is, right.
You have some notions of features,
and how good looking- like how good that board would be from that point on.
And that allows us to evaluate what action to pick,
right, like when we play chess that's kind of what we do.
We pick a couple of actions and we see how the board would look like after taking them.
An evaluation function kind of does the same thing, it tries to figure out
what are the things said we should care about in a specific game,
in this case in chess and tries to give values to them.
So, so it might be things like the number of pieces we have,
or mobility of those pieces,
or if our king is safe,
or if we have central control or not.
So, so for example,
for the pieces what we can do is,
we can look at the difference between the number of
pieces we have between what we have and what our opponent has.
So number of kings that I have versus number of opponents that I have.
Well, that seems really important thing because if I don't have a king and
our opponent has a king then [LAUGHTER] I've lost the game.
So, so you might put like a really large weight for
that and you might care about like differences between the number of pawns,
or number queens and other types of pieces that you have on the board.
So, so that allows you to care about- to think about how good the board is,
or number of legal moves that you have and the number of legal moves that your opponent has,
and then that gives you some notion of like mobility of that state.
Okay. All right.
So um,
so summary so far is- yeah,
so this is pretty bad,
order of B to the 2D is pretty bad,
and an evaluation function basically tries to estimate
this V minimax using some domain knowledge.
And unlike A star,
we actually don't have any guarantees in terms of
like error from these sort of approximations.
So um, but it's an approximation, people use it, it's pretty good.
We will talk about it a little bit later next time when,
when we think about like how- what sort of weights we should,
we should pick for each one of these,
for each one of these features.
So you should think learning when you think
about what are the weights we are using. All right.
So- okay, so now I want to spend a bit of time
on alpha-beta pruning because this is- yeah, important.
Okay. So alpha-beta pruning.
Yeah. The concept of alpha-beta pruning is also pretty simple,
but I think it's one of those things that was- it was kind
of that table you should pay attention to,
to kind of get what it is happening.
All right. So, so let's say that you want to choose between some bucket A and bucket B.
Okay, and you want to choose the maximum value,
and then you know that the values of A fall into like 3 to 5,
and the values of B fall into 5 to 10.
So, so they don't really have like any,
any intersections between each other.
So, so in that case,
you don't really care about your,
your- if you're picking a maximum right,
you shouldn't care about your bucket A,
or rest of your bucket A right,
because you already know that you are above 5,
you are happy with B, you shouldn't even look at A.
So, so kind of the,
the underlying concept of,
of um, alpha-beta pruning is,
is maintaining a lower bound and upper bound on values,
and then if the intervals don't overlap then basically
dropping part of the sub-tree that you don't need to work on because there is,
there is no overlap between them.
Okay. So here's an example,
so let's say we have these max nodes and
min nodes and you're going to go down and see 3,
and then this is a min node so,
so you're going to get 3 here.
So when I get to the max node here,
right, I- what, what,
I know is that the max node is going to get 3 or higher, right.
That- that's one thing that I would know without
even looking at anything on the, on the other side,
without even looking at the sub-tree on the left,
I already know that this max node should get 3 or higher, right.
Does everybody already agree with that? Okay. So, so then
when I go down to this min node and I see 2 here,
right, I know this is a min node,
it's going to get a value that's less than or equal to 2.
Less than or equal to 2 does not have any interval with greater than or equal to 3,
so I should not worry about that sub-tree.
Does everyone see see that? So maybe you'll like let me draw it out here.
[NOISE] So that's kind of like the whole concept of what happens in alpha-beta pruning.
So I have this max node,
this was three, this was what- five.
I found that the guy is 3,
this is a max node.
Whatever it gets, it- it's going to be greater than or equal to 3 because,
because it's already seen 3,
it's not gonna get any value less than 3, right.
So, so we know whatever value we are going to get at
this max node is going to be 3 or higher.
Okay. Then I'm going to go down here,
and then I see two here, right,
It's a min node whatever it gets is going to be less than or equal to 2.
So less than or equal to 2 is the value that's going to get popped up here.
I already know less than or equal to 2 has no interval with 3 or greater.
So I don't even need to worry about this like I,
like I can completely ignore this side of the 3,
I don't need to know whatever is happening down here,
I don't even need to look at that.
Okay. Because, because I- like this value should be greater than and equal to 2. Yes.
All right, we should get a value greater than or equal to 8.
Sorry.
[inaudible]
It's minimum- so it's a minimum,
it's a minimum node, right.
So it's going to be less than or equal [NOISE] to- right.
Yeah. It's a min node,
so I still have to,  if I see 10 here or 20 here,
like I'm not going to pick that, like it's 2 or lower.
All right. So yeah- so if it is 10,
or 100, or whatever sub-tree it is there like we're not going to look at that.
So, so that, that is kind of the whole concept.
Um, All right.
So- okay.
Let me actually go to this slide, I think this would be.
So the key idea of alpha-beta pruning is as we are- with
the optimal path is going to get to some leaf node that has some utility,
and that utility is the thing that is going to be pushed up, right,
like- and then the interesting thing is if you pick the optimal path,
the value of the nodes on that optimal path are all going to be equal to each other,
right, like they're the- basically the utility that you are going
to get pushed up all the way to the top.
So, so because of that like we need to have like these, these like we,
we can't have settings where we don't have
any intersections between the intervals because we know if this is,
if this were to be the optimal path,
the value on this node should have been the same as the value at
this node- the same as the value at this node and, and so on.
So if they don't have any intervals then no way that they would have the same value,
and no way for that path to be the optimal path.
Okay. So, so that's kind of the reason that it works because
the optimal path you're going to have the same value throughout.
Okay. So-all right so how do we actually do this?
So the way we do this, is we're going to keep
a lower bound on max nodes, so I'm going to call it that a_s.
Let me [NOISE] get this up here.
So we are going to have a_s which is a lower bound on max nodes.
So we're going to keep track of that.
We're also going to keep track of b_s,
which is an upper bound on min nodes.
Okay. And then if they don't have any intervals,
we just drop that sub-tree.
If they have intervals we just keep updating a_s and b_s.
Okay. So, so here's an example,
so let's say that we start with this top node.
Somehow we have found out that this top node
should be greater than or equal to 6, right.
Somehow I know it should be greater than or equal to 6.
Okay. So that is my a_s value.
So my a_s is equal to 6, it,
it is, it is going to be a lower bound on my max nodes.
I know the, the valued- optimal value
is going to be something greater than or equal to 6.
Okay. Then somehow we get to this min node,
and then we realized that this min node should be less than or equal to 8.
So you're here, let's say 8 is here,
we still have some interval, we're all good, right,
so b_s is going to be equal to 8, right,
we have an upper bound on the min node,
and that tells us that upper bound is 8.
So the, the valued- optimal
value- the value and optimal path is going to be less than or equal to 8.
Okay. So far so good.
Then somehow I found out that that one is greater than or equal to 3.
Greater than or equal to 3 should be fine, right,
greater than or equal to 3 is still greater than or equal to 6,
my a_s in this case,
I'm going to call this S1, S2,
and S3 is equal to 3,
right, because I know I need to be greater than or equal to 3.
But like 6 already does the job, right,
like I don't need to worry about that 3.
So, so that's all, good so far.
And then for this last node,
I am at this min node,
and I realize that b_s4,
I'm going to call it b_s4,
is equal to 5.
And what this tells me is that your value should be less than 5 and less than 5.
So I'm going to update less than 8 to less than 5.
And now, we don't have any in- intervals.
So what that tells me is that path is not going to be the optimal path,
because there is no intervals.
So- so we're not going to find this- this one number that is going to be the utility.
And what that tells me is,
I can actually ignore that whole sub-tree because-
because that's not going to be in- my- my optimal path,
I can- I can get rid of it,
I can ignore it, okay. Yes.
We also ignore 3 if, uh,
the beta is equal to alpha,
if we already have something else, is that not the same thing?
Yeah. So- so we're ignoring 3 in a different way.
I- I- so- so yeah- so we're ignoring the value
of 3 because this is already encoded here.
But we're ignoring the subtree of 5,
like I'm not exploring it.
Like I need to explore things after the 3 already,
because I- like- like- like with the 3 we already had an overlap with the Beta.
So you're looking at- with the b value- we are looking at
the overlap between your upper bound of min node and lower bound of max node.
So that interval is the interval you're making sure it still has values in it.
One example of, uh, if the two or three extend;
do you just ignore them anyway
because you already had something else that's- that's [OVERLAPPING] is that optimal?
Yeah, yeah. So, uh, yeah, I think so.
Yeah, so- so if you already have like,
if 3 were 2, is that what you're saying?
Yeah so- so- th- you want to have non-trivial intervals basically, yes.
Yeah. So like if- if- if- it is the same value- you still- yeah,
you don't have non-trivial intervals.
And- and yeah question.
I was wondering how we got 6 and 8 and 3.
Oh, this is an example of that, imagine somehow [NOISE].
But we- we will talk about some examples whe- where we get them.
So I'll talk about one more example where we actually like get these,
but for now just assume somehow we have found this. Yes.
Um, on the top example,
I don't understand why,
uh, 3 is an upper bound or 2 is a lower bound.
So, um, so the- the actual values,
um, I'm not showing a full example here.
So the actual values are coming from somewhere that
I'm not talking about yet but- [OVERLAPPING]
[inaudible]
Oh, the one at the top. Okay, oh sorry.
Yeah. So the one at the top right?
So- so this is a min node,
a min node, this is a max node, right?
So at my min node,
I found out that minimum between 3 and 5 is 3, right?
So max node is maximizing between 3 and a bunch of other things.
That- that's what it is supposed to do, right?
So it's maximizing between 3 and a bunch of other things,
then it's at least going to be 3.
It's not going to be 2, there is no way for it to be
2 or it's not going to be 0, right?
Because it's- it's going to take maximum of 3 and something else.
So that's why I'm saying,
well this value whatever I'm going to get at this max node is going
to be greater than or equal to 3. Does that make sense?
So now I come down here, and I see like,
I see this 2; this is a min node.
So the value here is going to be the minimum between
2 and whatever is down this tree, right?
So it is going to be at least,
uh, I'm very bad with that, the least, and the most.
It's going to be- [LAUGHTER] it's going to be 2 or lower. Let me just use that.
So- so what we are getting here is going to be 2 or lower, right?
So I'm either going to get 2 or 1 or 0 or- or all that.
And that's the value that's going to be pushed up here, right?
So that's the value that's going to go down here,
it's going to be a value that is 2 or lower.
So if I'm maximizing between 3 and something that is 2 or lower,
then 3 is enough.
And I can like, kind of figure that out based on
these intervals and don't look at this side of the tree.
Like- like once I've- I've seen these two,
I already feel there is no- no trivial interval
between a value that's greater than 3 and a value that's less than 2.
So I can just not worry about stuff down there.
Okay. All right.
So one quick other implementation,
I think is we talked about these A's,
A values and B values.
You can- on- keep track of only one value.
And that would be this Alpha value and Beta value,
where Alpha value is just -I'm going to illustrate it here.
Alpha value- let me get it right.
So Alpha of s is the max of a_s for all these s primes that are less than s. Yeah.
So- so- is what this basically says is,
remember like when we saw 3 we said, "Well,
that's already included, like we already knew that."
That's kind of the same idea.
So Alpha of s is just going to be one value.
In this case, it's just going to be 6,
because like when I see 3,
like I don't really care about that 3, right?
Like I already know I'm greater than 6,
knowing that I'm greater than 3 is not adding anything.
So we keep track of one value;
Alpha of a- al- Alpha of s. In this case,
Alpha of s is just equal to 6.
And then similar thing for Beta.
We're going to keep track of Beta of s,
and Beta of s is just minimum of b_s's.
And then, what I'm writing here is just the ordering of the nodes that you have seen.
So- so Beta of s is 5.
And then, you're looking at the intervals like Alpha of s,
uh, and s- Alpha of s and above,
and Beta of s and below.
And if those intervals don't have any trivial intersections,
then you can- you can prune part of the tree.
Okay. So- so this is more of an- an implementation thing instead of keeping
track of all these a_s's and b_s's just keep one number,
one Alpha and one Beta.
Okay. All right.
Okay. So let's look at one- one other example.
Uh, so all right.
So I'm going to just do this example real quick.
Okay. So you're going to start from some top node,
we're gonna go to this node,
this is a min node between 9 and 7.
Between 9 and 7 right?
So it's a min node,
I'm going to get this guy; 7.
I'm going to realize that
this max node is going to be something that's at least 7, right?
It's going to be something that's greater than or equal to 7.
So my Alpha of s is going to be 7 right now.
I know whatever value I'm going to get is going to be 7 or higher.
Whatever value this start node is going to get,
It's got to be 7 or higher, okay?
So now I come down here, I am at a min node.
I see a 6 here, right?
I go here, it's a min node,
so whatever we get here,
is going to be less than or equal to 6, right?
So it's going to be 6 or something that is lower.
That tells me my Beta of s is equal to 6.
That tells me whatever I am getting in that min node is going to be 6 and lower.
That doesn't have any intersections with my Alpha of
s. So I can just not do anything about this- this branch.
Like I don't- like I don't need to go
over like- like I know like all these other things like,
I can kind of ignore like this whole branch.
Okay. All right.
So now I go back up.
I go down here, I'm at the min node.
So remember the way we were computing these Beta values,
were based on the nodes that we have seen previously.
So I have a new Beta now because I'm done with this branch, right.
So I- I need to get here.
Here I have a min between- what is it?
8. This Is 8?
8 and 3. So okay.
So- so I see my- maybe let me just write 8.
I see my 8 here,
it's a min node, so it's going to be less than or equal to 8.
So my new Beta value is going to be 8.
My Alpha is still 7 because that's for my top node.
So its 8 or lower.
We do have an interval,
overlapping interval, 7 to 8.
Everything is good. So I actually need to go and see what this value is.
This value is 3,
so I get 3 here,
or like, it's exactly equal to 3.
So that updates my Beta from 8 to 3.
We have already explored that part of the tree anyways,
but 3- you don't have an interval.
If there were a bunch of things below this three like I- I like when I somehow decide,
like I wouldn't need to explore it,
but we don't really have that.
And then we just find that our optimal value is 7,
so we just return 7, okay.
And we didn't explore this giant middle part of the tree.
Okay. One more slide and then I'll- two more- two more quick, one quick idea.
Okay. So [LAUGHTER] All right.
So the order of things actually matters,
so- so that the only thing I want to mention about
this idea of pruning is- is the order of things matters.
So- so when we look at this example,
remember we didn't explore anything about the 10,
because we already knew that this value needs to be greater than or equal to 3.
These are my buckets, right?
If I swap the buckets,
like if I just swap the order of buckets,
I move the 2-10 bucket to this side,
3-5 bucket to the other side,
I wouldn't be able to do that.
I actually need to explore the whole tree,
because my Alphas and Beta wouldn't have the same properties.
So the order that you are putting things on the tree
actually matters and- and you should care about that.
Um, so worse case scenario,
our ordering is terrible,
so we need to actually go over the full tree that's order of b to the 2d.
That's the worst-case scenario.
There is this best ordering where you don't explore like half of it.
So- so you can- like if you- if you had- if you have
a tree where you can explore up to depth 10,
then with the best ordering,
you can actually explore up to depth like 20.
So- so that's a huge improvement actually.
Uh, so the best ordering is going to be order of b to the
d. And then random ordering turns out to be pretty okay too,
so random ordering would be order of b to
the 2 times 3 fourth times d. So even if you had the random ordering,
it would be better than the worst case scenario.
And then, well, how do you figure out what is a good bordering- ordering?
Well, we can have this evaluation function.
Remember you- you are computing the evaluation function and- and what you can do is,
you can order, uh, your- so for max nodes,
you can order the successors by decreasing evaluation function,
and therefore, min nodes you can order the successors by increasing the evaluation functions.
That allows you to prune as much things as possible. All right.
So with that I'll see you guys next lecture talking about TD learning.
