Let's start guys. Okay, so,
uh, we're gonna continue talking about games today.
Uh, just a quick announcement,
the project proposals are due today.
I think you all know that. Um, all right, let's co-
Tomorrow.
Tomorrow. You're right [LAUGHTER].
Tomorrow [LAUGHTER] Just checking.
[LAUGHTER] Yeah. Today is not Thursday.
Yeah. [LAUGHTER] Tomorrow.
For a second, I thought it's Thursday.
Um, all right, so let's talk about games.
Uh, so we started talking about games last time.
Uh, we formalized them.
Uh, we talked about, uh,
non- we talked about zero-sum two-player games that were turn-taking, right?
And we talked about a bunch of different strategies to solve them,
like the minimax strategy or the expectimax strategy.
Uh, and today we wanna talk a little [NOISE] bit about learning in the setting of games.
So what does learning mean?
How do we learn those evaluation functions that we talked about?
And then, er, towards the end of the lecture,
we wanna talk a little [NOISE] bit about
variations of the game- the games we have talked about.
So, uh, how about if you have- how about the cases
where we have simultaneous games or non-zero-sum games.
So that's the, that's the plan for today.
So I'm gonna start with a question that
you're actually going to talk about it towards the end of the lecture,
but it's a good motivation.
So, uh, think [NOISE] about
a setting where we have a simultaneous two-player zero-sum game.
So it's a two-player zero-sum game similar to the games we talked about last time,
but it is simultaneous.
So you're not ta- ta- taking turns,
you're playing at the same time.
And an example of that is rock, paper, scissors.
So can you still be optimal if you reveal your strategy?
So lets say you're playing with someone.
If you tell them what your strategy is,
can you still be optimal?
That's the question. Yes.
[inaudible] It's a small [NOISE]
enough game space for- if they know exactly [NOISE] what you're going to play,
[NOISE] you won't be successful if you- for
a zero-sum real-time simultaneously being the larger scale,
I think you could still be successful if that approach
is like superior to the other approach taken. [NOISE]
So it's not- so, so,
so the answer was about the size of the game.
So rock, paper, scissors being small versus,
versus not being small.
So, so the question is more of a motivating thing.
We'll talk about this in a lot of details towards the end of the class.
It's actually not the size that matters.
It's the type of strategy that you play that matters,
so just to give you an idea.
But, like, the reason that we have put this I guess at,
at the beginning of the lecture is intuitively when you think about this,
you might say, "No.
I'm not gonna tell you what my strategy is, right?
Because if I say, I'm gonna play it, like,
scissors, you'll know what to play."
But th- this has
an unintuitive answer that we are gonna talk about towards the end of the lecture.
So just more of a motivating example.
Don't think about it too hard. All right.
So, so let's do a quick review of games.
So, um, so last time we talked
about having an agent and opponent playing against each other.
So, uh, and we were playing for the agent,
uh, and the agent was trying to maximize their utility.
So they were trying to get this utility.
The example we looked at was, uh,
agent is going to pick bucket A, bucket B,
or bucket C. And then the opponent is going to pick a number from these buckets.
They can either pick minus 50 or 50,
1 or 3 or minus 5 or 15.
And then if you want to maximize your,
your utility as an agent,
then you can potentially think that your opponent [NOISE] is trying to,
trying to minimize your utility,
and you can have this minimax game, kind of,
playing against each other and, and,
and based on that, uh, decide what to do.
So we had this minimax tree and based on that,
the utilities that are gonna pop up are minus 50,
1 and minus 5.
So if your goal is to maximize your utility, you're gonna pick bucket B,
the second bucket, because that's the best thing you can do,
assuming your opponent is a minimizer.
So, so that was kind of the setup that we started looking at.
And the way we thought about, uh,
solving this game by- was by writing a recurrence.
So, so we had this value.
This is V which was the value of a minimax, uh,
at state S. And if you're at the utility,
er, so if you're an- at an end state,
we are gonna get utility of S, right?
Like if you get to the end state,
we get the utility because we get the utility only at the,
at the very end of the game.
And if the agent is playing,
we- the recurrence is maximize V of the successor states.
And if the opponent is playing,
you wanna minimize the value of the successor states.
And so that was the recurrence we started with, and,
and we looked at games that were kind of large like the game of chess.
And if you think about the game of chess,
the branching factor is huge.
The depth is really large.
It's not practical to u- to do the recurrence.
So we, we started talking about ways to- for speeding things up, and,
and one way to speed things up was this idea of using an evaluation function.
So do the recurrence but only do it until some depth.
So don't go over the full tree.
Just do it until some depth,
and then after that, just call an evaluation function.
And hopefully your evaluation function which is kind of this weak estimate of
your value is going to work well and give you an idea of what to do next.
Okay. So, so instead of the usual recurrence,
what we did was we decided to add this D here, um,
this D right here which is the depth that un- until which we are exploring.
And then we decrease the value of depth,
uh, after an agent and opponent plays.
And then when depth is equal to 0,
we just call an evaluation function.
So intuitively if you're playing chess, for example,
you might think a few steps ahead,
and when you think a few steps ahead,
you might think about how the board looks like and
kind of evaluate that based on the features that,
that, that board has and based on that, you might,
you might decide to take various actions.
So similar type of idea.
And then the question was, well,
how are we gonna come up with this evaluation function?
Like where is this evaluation function coming from?
Uh, and, and then one idea that,
that we talked about last time was it can be handcrafted.
The designer can come in and sit down and figure out what is a good evaluation function.
So in the game of chase- che- and chess example is,
you have this evaluation function that can depend on the number of pieces you have,
the mobility of your pieces.
Maybe the safety of your king,
central control, all these various things that you might care about.
So the difference between the number of
queens that you have and your opponents number of queens,
these are things, these are features that you care about.
And, and potentially, a designer can come in and say, "Well,
I care about nine times more than I care about how many pawns I have."
So, so the hand- like you can actually hand-design these things and,
and write down these weights about how much you care about these features.
Okay. So I'm using terminology from the learning lecture, right?
I'm saying we have weights here and we have features here,
and someone can come and just handcraft that.
Okay. Well, one other thing we can do is instead of handcrafting it,
we could actually try to learn this evaluation function.
So, so we can still handcraft the features, right?
We can still say, "Well, I care about the number of
kings and queens and these sort of things that I have,
but I don't know how much I care about them.
And I actually wanna learn that evaluation function.
Like what the weights should be."
Okay. So to do that,
I can write my evaluation function,
eval of S, as,
as this V as a function of state parameterized by, by weights Ws.
And, and my goal is to figure out what these Ws,
what these weights are.
And ideally I wanna learn that from some data.
Okay. So, so we're gonna talk about how learning is applied to these game settings.
And specifically the way we are using learning for these game settings is to just
get a better sense of what this evaluation function should be from some data.
Okay. So, so the questions you might have right now is,
well, how does V look like?
Where does my data come from?
Because if I, if you know where your data comes from and your, your V is,
then all you need to do is to come up with a learning algorithm
that takes your data and tries to figure out what your V is.
So, so we're gonna talk about that at the first part of the lecture.
Okay. And, and that kind of introduces to this, this,
um, temporal difference learning which we're gonna discuss in a second.
It's very similar to Q-learning.
Uh, and then towards the end of the class,
we will talk about simultaneous games and non-zero-sum games. Okay.
All right. So, so let's start with this V function.
I just said, well, this V function could be
parameterized by a set of weights, a set of w's,
and the simplest form of this V function is to just write it as
a linear classifier as a linear function of a set of features, w's times Phi's.
And these Phi's are the features that are hand-coded and someone writes them.
And then- and then I just want to figure out what w is.
So this is the simplest form.
But in general, this, this V function doesn't need to be a linear classifier.
It can actually be any
supervised learning model that we have discussed in the first few lectures.
It can be a neural network.
It can be anything even more complicated than neural network that just does regression.
So, so we can- basically,
any model you could use in supervised learning could be placed here as,
as, as this V function.
So all I'm doing is I'm writing
this V function as a function of state and a bunch of parameters.
Those parameters in the case of linear classifiers are just
w's and in the case of the neural network,
there are w's and these v's in this case of what one layer neural network.
Okay. Or multilayer, actually. Yeah, one way.
All right. So let's look at an example.
So let's think about an example and I'm going to focus on
the linear classifier way of looking at this just for simplicity.
So, um, okay, let's pick a game.
So we're going to look at backgammon.
So this is a very old game.
Uh, it's a two-player game.
The way it works is you have the red player and you have the white player,
and each one of them have these pieces.
And what they wanna do is they want to move all their pieces from
one side of the board to the other side of the board. It's a game of chance.
You can actually, like, roll two dice and based on the outcome of your dice,
you move your pieces various,
various amounts to, to various columns. Uh, there are a bunch of rules.
So your goal is to get all your pieces off the board.
But if you have only, like,
one piece and your opponent, like,
gets on top of you, they can push you to the bar and you have to, like, start again.
Um, there are a bunch of rules about it.
Read it, read about it on Wikipedia if you're interested.
But you are going to look at a simplified version of it.
So in this simplified version,
I have Player O and player X,
and I only have four columns.
I have column 0, 1, 2, and 3.
And in this case, I have four of each one of these players and,
and the idea is,
we want to come up with features that we would care about in this game of backgammon.
So, so what are some features that you think might be useful?
Remember the learning lecture.
How did we come up with, like, feature templates? Yes.
Currently, still bound with the [inaudible].
So maybe like the location of the X's and O's. The number of them.
Yeah. Yeah. So one idea is you have all this knowledge about the board,
so maybe we should, like, care about the location of the X's.
Maybe we should care about like where the O's are,
how many pieces are on the board,
how many pieces are off the board.
So similar type of way that we- we've come up with features in the first few lectures.
We were basically, we would do the same thing.
So a feature template- set of feature templates could look like this, like,
number of X's or O's in column- whatever column being equal to some value or,
uh, number of X's or O's on the bar.
Maybe fraction of X's or O's that are removed, whose turn it is.
So these are all like potential features that we could use.
So for this particular board,
here are what those features would look like.
So if you look at number of O's in column zero 0 to 1, that's equal to 1.
Remember we were using these indicator functions to be more general.
So, so like here, again, we are using these indicator functions.
You might ask number of O's on the bar that's equal to 1,
fraction of O's that are removed.
So I have four pieces.
Two of them are already removed. So that's one-half.
Number of X's in column 1 equal to 1, that's 1.
Number of X's in column 3 equal to 3, that's 1.
It's O's turn. So that's equal to 1.
Okay. So, so we have a bunch of features.
These features, kind of, explain what the sport looks like or how good this board is.
And what we wanna do is we wanna figure out what,
what are the weights that we should put for each one of
these features and how much we should care about, uh, each one of these features.
So, so that is the goal of learning here.
Okay. All right. So okay.
So, so that was my model. All right.
So far, I've talked about this V S of w. I'm-
I've defined it as a linear classifier- as a linear predictor.
W's times features.
And now, the question is where do I get data?
Like where and because if I'm doing learning,
I got to get data from somewhere.
So, so one idea that we can use here is we can try to generate
data based on our current policy pi agent or pi opponent,
which is based on our current estimate of what V is.
Right. So currently, I might have some idea of what this V function is.
It might be a very bad idea of what V is, but that's okay.
I can just start with that and starting with,
with that V function that I currently have,
what I can do is I can,
I can call arg max of V over successors of s and a to get a policy for my agent.
Remember this was how we were getting policy in a mini-max setting.
Policy for the opponent is just argument of
that V function and then when I call these policies,
I get a bunch of actions.
I get a sequence of, like,
states based on, based on how we are following these policies,
and that is some data that I can actually go over and try to make my V better and better.
So, so that's kind of how we do it.
We call these policies.
We get a bunch of episodes.
We go over them to make things better and better.
So, so that's, kind of, the key idea.
Um, one question you might have at this point is, um,
is this deterministic or not, like,
do I need to do something like Epsilon-Greedy.
So in general, you would need to do something like Epsilon-Greedy.
But in this particular case,
you don't really need to do that because we have to get- we have this die that,
that you're actually rolling the dice.
And by rolling the dice,
you are getting random different- different random path that,
that we might take- so that might take us to different states.
So we, kind of, already have this,
this element of randomness here that does some of the exploration for us.
And you just mean like unexplored probability?
Yes. So my Epsilon-Greedy,
what I mean here is do I need to do extra exploration?
Am I gonna get stuck like in a particular set of states if I don't do exploration?
And in this particular case,
because we have this randomness,
we don't really need to do that.
But in general, you might imagine having
some sort of Epsilon-Greedy to take us explore a little bit more.
Okay. So then we generate episodes and then from these episodes, we want to learn.
Okay. These episodes look like
state action reward states and then they keep going until we get a full episode.
One thing to notice here is,
is the reward is going to be 0 throughout
the episode until the very end of- end of the game.
Right. Until we end the episode and we
might get some reward at that point or we might not.
Uh, but, but the reward throughout is
going to be equal to 0 because we are playing a game.
Right. Like we are not getting any rewards at the beginning.
And if you think about each one of these small pieces of experience; s, a, r,
s prime, we can try to learn something from each one of these pieces of experience.
Okay. So, so what you have is you actually go on board maybe.
What you have here is you have a piece of experience.
Let's call it s, a.
You get some reward. Maybe it is 0.
That's fine if it is 0.
And you go to some s prime through that.
So s, take an action, you get a reward.
Maybe you get a reward.
You go to some s prime from that and you have some prediction.
Right. Your prediction is your current, like,
your current, um, V function.
So your prediction is going to be this V function and add
state s parameterized with W. And this is what you already,
like, you, kind of, know right now.
This, this is your current estimate of what V is.
And this is your prediction.
I'm writing the prediction as a function of w. Right.
Because it depends on w. And then we had a target that you're trying to get to.
And my target, which is kind- kind of acts as a label,
is going to be equal to my reward,
the reward that I'm getting.
So it's kind of, the reward- so if you look at this V of s and w, well,
it's kind of close-ish to reward plus,
I'm gonna write discount factor,
Gamma V of s prime, w. All right.
So, so my target the thing that I'm trying to like get to is the reward
plus Gamma V of s prime, w, okay?
So we're playing games,
in games Gamma is usually 1.
I'm gonna keep it here for now but I'm gonna drop it at some point,
so you don't need to really worry about Gamma.
And then one other thing to notice here is,
I'm not writing target as a function of w
because target acts kind of like my label, right?
If I'm, if I'm trying to do regression here,
target is my label, it's kind of the ground truth thing that I'm trying to get to.
So I'm gonna treat my target as just like a value,
I'm not writing it as a function of w, okay? All right.
So, so what do we try to do usually,
like when you are trying to do learning?
We have prediction, we have a target,
what do I do? Minimize the- your error.
So what is error? So I can write my error as potentially a squared error.
So I'm gonna write one-half of prediction of w,
minus target squared, this is my squared error.
I want to minimize that.
So with respect to w, okay?
How do I do that? I can take the gradient.
What is the gradient equal to?
This is simple, right?
2 reduced, 2 gets canceled.
Gradient is just this guy, prediction of w,
minus target, times the gradient of this inner expression.
The gradient of this inner expression with respect to w is the gradient of
prediction with respect to w minus 0 because target is,
I'm treating it as a number, okay?
Let me move this up.
So now I have the gradient.
What algorithm should I use?
I can use gradient descent. All right.
So I'm going to update my w. How do we update it?
I'm gonna move in the negative direction of my gradient using some learning rate Eta,
uh, times my gradient.
My gradient is prediction of w minus target
times gradient of prediction of w with respect to w. All right.
So that's actually what's on this slide.
So the objective function is prediction minus target squared.
Gradient, we just took that,
it's prediction minus target times gradient of prediction.
And then the update is just this,
this particular update where we move in the negative direction of the gradient.
This is, this is what you guys have seen already, okay. All right.
So so far so good.
Um, so this is the TD learning algorithm.
This is all it does. So temporal difference learning,
what it does is it picks like these pieces of experience;
s, a, r, s prime,
and then based on that pieces of experience,
it just updates w based on this gradient descent update,
difference between prediction and target times the gradient of V, okay?
So what, what happens if I have,
if I have this, this linear function,
maybe let me write- let me write this in the case that I have a linear, linear function.
So what if my V of sw is just equal to w dot phi of s,
yeah phi of s. So what happens to my update?
Minus Eta. What is prediction?
w dot phi of s, right?
w dot phi of s. What is target?
We defined up it there, it's the reward you're getting- the immediate reward
you're getting plus Gamma times V of s prime, w,
which is w dot phi of s prime times gradient of your prediction which is what, phi of s, okay?
So I just, I just wrote up this indicates of a linear predictor. Yes.
With Q learning, what are the differences between the two?
Yeah, so this is very similar to Q learning.
There are very minor differences that you'll
talk about actually at the end of this section,
comparing it to Q learning. All right.
So, so I wanna go over an example,
it's kind of like a tedious example but I think it
helps going over that and kind of seeing why it works.
Especially in the case that the reward is just equal to 0 like throughout an episode.
So it kinda feels funny to use this algorithm and make it work but it works.
So I want to just go over like one example of this.
So I'm gonna show you one episode starting from S1 to some other state.
And, and I have an episode I start from some state,
I get some features of that state.
Again, these features are by just evaluating those han- hand coded features.
And I'm just going to start,
what w should they start with?
0, let me just initialize w to be equal to 0, okay, right?
How do I update my w?
Me- let me let me just write it in this.
So, so this is I want to write it in
a simple for- not a simpler form but just another form.
So w the way we're updating it is,
the previous w minus Eta times prediction minus target,
I'm gonna use p and t for prediction minus the target,
times phi of s. Okay,
this is the update you're doing, okay?
Uh, yeah, that's right.
Okay. So, so what is my prediction?
What is my prediction? w dot t of s?
0. What is my target?
So for my target I need to know what state I'm ending up at.
I'm gonna end up at 1, 0 in this episode and I'm gonna get a reward of 0.
So what is my target?
My target is reward, which is 0,
plus w times phi of s prime,
that is 0 because w is equal to 0.
So my target is equal to 0.
My p minus t is equal to 0.
So p minus t is equal to 0,
this whole thing is 0,
w stays the same.
So in the next kind of step, w is just 0, okay?
I'm gonna move forward.
Um, so what is prediction here?
0 times 0, prediction is 0. What is target?
I haven't done 0 because I haven't got any- anything,
any reward yet, where do I end at? I end up at 1, 2.
So yeah, so target is going to be a reward,
which is 0 plus 0 times,
whatever state of phi of s prime that I'm at,
so that's equal to 0. p minus t is equal to 0,
it's kind of boring [LAUGHTER].
So at this point,
w hasn't changed, w is equal to 0.
What is my prediction?
Prediction is equal to 0, that's great.
What is target equal to?
So I'm gonna end up in an end state where I get 1, 0 and I get a reward of 1.
So this is the first time I'm getting a reward.
What should my target be?
My target is reward 1 plus 0 times 1, 0 which is 0,
so my target is 1.
So what this tells me is,
I'm predicting 0 but my target is 1,
so I need to push my w's a little bit up to actually address the fact that this is,
this is, this is equal to 1.
So p minus t is equal to minus 1.
So I need to do an update.
Maybe I, I'll do that update here.
So how am I updating it?
So I'm doing, starting from 0, 0 minus, uh,
my Eta is 0.5, that's what I allowed it- like I put it- I defined it to be,
my prediction minus target is minus 1.
What is phi of s, phi of s is 1, 2, right?
So what should my new w be?
What is that equal to? 0.5 and then 1.
All right, so I'm just doing arithmetic here.
So my due- new w is going to become 0.5 and 1 at the end of this one episode.
So I just did one episode, one full episode,
where w is worth 0 throughout and then at the very end when I got a reward,
then I updated my w because I realized that
my prediction and target were not the same thing, okay?
So now I'm gonna, I'm gonna start
a new episode and the new episode I'm starting is going to start with this particular w,
and in the new episode even though the rewards are going to be 0 throughout,
so like we are actually going to update our w's. Yes, question?
If you use, uh, two questions.
If you use like, uh, initialize rates do not be
zeros which you update throughout instead of just to the end.
Yeah.
Okay and section two,
so S4 and S9 are the same future of activities but you said S4 is S9 [OVERLAPPING].
Uh, this is a made up example,
[LAUGHTER] so don't think about this example too much though.
Well, is it that possible to have, an end state and not end state have the same feature vector, or no? If you have the same feature vector in the same state-
It, it is possible to have, yeah,
the, the most of the states to have the same features, right.
You could have, like I said up here.
Depends on what sorts of feature, you can could,
could use like really not representative features.
Like if you really want S4 and s- S9 to,
to differentiate between them,
you should pick features that differentiates between them.
But if there were kind of the same and have the same sort of characteristics,
it's fine to have feature that gives the same value.
Like, like we have different [inaudible].
As one, uh, entry that's always isn't [inaudible] like instead of 1, 2,
we have 1, 0 leading to the,
the final weight then the weight corresponding to that.
Is going to- [OVERLAPPING] Yeah. It will never converge.
And that kind of tells you that that entry in your feature vector,
you don't care about that, or it's always,
like, it, it's always staying the same.
If it is always 0, it doesn't matter like what the weight of that entry is.
So in general, you wanna have features that are differentiating and,
and you're using it in some way.
So for the second row,
I'm not gonna write it up cause that takes time.
[LAUGHTER] So, uh, so okay,
so let's start wi- with a new episode.
We started S1 again but now I'm starting with this new W that I have.
So I can compute the prediction,
the prediction is 1.
I can compute my target it's 0.5.
And what we realize here is we overshoot it.
So before, our prediction was 0,
target was 1, we are undershooting.
We fix our Ws, but now we're overshooting.
So we need to fix that. Yes.
Uh, a little verification on the relationship between the features and the weights.
Uh, they always have to be the same dimension,
and what should we be thinking about that
would make a good feature for updating the weights specifically, like-
So, uh, okay so first off, yes,
they need to be always in the same- in dimension cause you are doing this,
um, dot-product between them.
Um, the feature selection, um,
you don't necessarily think of it as,
like how am I updating the weights,
you think of the feature selection as is it representative of how good my board is.
Is it, for example in the case of Backgammon,
or is it representative of, uh,
how good I am navigating, uh, so,
so it should be a representation of how good your state is,
and then it's- yeah, it's usually like hand designed, right.
So, so i- i- it,
it's not necessarily- you shouldn't think of it as how is it helping my weights,
you should think of it as how is it representing how good my state is.
How is that also, like, thinking of the blackjack example,
if you have a threshold of 21 and then you have a threshold of 10, uh,
if you're using the same feature extraction for both,
how does that affect the generalized ability of the model, the agent?
Yeah, so, so you might choose two,
two different features and one of them might be more like so,
so there is kind of a trade-off, right?
You might get a feature that actually differentiates between different states very well,
but then that, that makes learning longer,
that makes it not as generalizable,
and then at the end- on the other hand,
you might get a feature that's pretty generalizable but,
but then it might not do these specific things
that you would wanna do or these differentiating factors about it.
So, so picking features,
it's, it's an art, right, so.
[LAUGHTER] All right. So lemme,
lemme move forward cause we have a bunch of things coming up.
Okay, so I'll go over this real quick then.
So we have the W's, right.
So, so we now update the W based on this new value, um,
and kind of similar thing, you have a prediction,
you have a target, you're still overshooting,
so, so you still need to update it.
And then once you update it to 0.25 and 0.75 then it kind of stays there, and you are happy.
Okay. All right so,
so this was just an example of TD learning but this is
the update that you have kind of already seen, right?
And then a lot of you have pointed out that this is,
this is similar to Q-learning already, right?
This is actually pretty similar to update, um, it's,
it's very similar, like we have these gradients,
and, and the same weight that we have in Q-learning.
And, and we are looking at the difference between prediction and target,
same weight that we are looking at in Q-learning,
but there are some minor differences.
So, so the first difference here is that Q-learning operates on the Q function.
A Q function is a function over state and actions.
Here, we are operating on a value function, right?
On V. And V is only a function of state, right?
And, and part of that is,
is actually because in the setting of- in setting of a game,
you already know the rules of the game.
So we kind of already know the actions.
You don't need to worry about it as much the
same way that if you are worrying about it in Q-learning.
The second difference is,
Q-learning is an off-policy algorithm.
So, so the value is based on this estimate
of the optimal policy which is this Q opt, right?
It's based on this optimal policy.
But in the case of TD learning, it's an on-policy,
the value is based on this exploration policy which is based on a fixed Pi,
and sure you're updating the Pi,
but you're going with whatever Pi you have and, and,
and kind of running with that and keep updating it.
Okay, so that's another difference.
And then, finally like in Q-learning,
you don't need to know the MDP transitions.
So you don't need to know this transition function as transition from s, a to s-prime.
But in the case of TD learning,
um, you need to know the rules of the game.
So you need to know how the successor function of s and a works.
Okay. So, so those are some kind of minor differences,
but from like a perspective of,
like how the update works,
it is pretty similar to what Q-learning is, okay?
All right. So, so that was kind of this idea of,
I have this evaluation function,
I wanna learn it from data,
I'm going to generate data from that generated data I'm going to update my W's.
So, so that's what we've been talking about so far.
And the idea of learning- using learning to play games is,
is not a new idea actually.
So, um, so in '50s,
um, Samuel looked at a checkers game program.
So where he wa- he was using ideas from
self-play and ideas from like similar type of things we have talked about,
using really smart features,
using linear evaluation functions to try to solve the checkers program.
So a bunch of other things that he did included adding intermediate rewards.
So, so kind of throughout,
like the to, to get to the endpoint,
he added some intermediate rewards,
used alpha-beta pruning and some search heuristics.
And then, he was kind of impressive,
like what he did in '50s,
like he ended up having this game that was playing, like it was reaching,
like human ama- amateur level of play and he only used like
9K of memory which is like really impressive [LAUGHTER] if you're thinking about it.
So, so this idea of learning in games is old.
People have been using it.
In the case of Backgammon, um,
this was around '90s when Tesauro came up with,
with an algorithm to solve the game of Backgammon.
So he specifically used, uh,
this TD lambda algorithm,
which is similar to the TD learning that we have talked about.
It, it has this lambda temperature parameter
that that kinda tells us how good states are,
like as they get far from the reward.
Uh, he didn't have any, any intermediate rewards,
he used really dumb features,
but then he used neural networks which was, uh, kind of cool.
And he was able to reach human expert play, um,
and kind of gave us- and this kind of ga- gave us
some insight into how to play games and how to solve,
like these really difficult problems.
And then more recently we have been looking at the game of Go.
So in 2016, we had AlphaGo, uh,
which was using a lot of expert knowledge in addition to, um,
ideas from a Monte Carlo tree search and then, in 2017,
we had AlphaGo Zero,
which wasn't using even expert knowledge,
it was all, like, based on self-play.
Uh, it was using dumb features,
neural networks, um, and then,
basically the main idea was using Monte Carlo tree search
to try to solve this really challenging difficult problem.
So, um, I think in this section we're gonna talk a little bit about AlphaGo Zero too.
So if you're attending section I think that will be part of that story.
All right so the summary so far is,
we have been talking about parameterizing
these evaluation functions using, using features.
Um, and the idea of TD Learning is,
is to look at this error between our prediction and
our target and try to minimize that error and,
and find better W's as we go through.
So, um, all right so that was learning and, and games.
Uh, so now I wanna spend a little bit of time talking about,
uh, other variations of games.
So, so the setting where we take our games to simultaneous games from turn-based.
And then, the setting where we go from zero-sum to non-zero-sum, okay? All right.
Okay simultaneous games.
So, um, all right so,
so far we have talked about turn-based games like
chess where you play and then next player plays,
and you play, and next player plays.
And Minimax sca- strategy seemed to be
pretty okay when it comes to solving these turn-based games.
But not all games are turn-based, right?
Like an example of it is rock-paper-scissors.
You're all playing at the same time,
everyone is playing simultaneously.
The question is, how do we go about solving simultaneously, okay?
So let's start with, um,
a game that is a simplified version of rock-paper-scissors.
This is called a two-finger Morra game.
So the way it works is,
we have two players, player A, and player B.
And each player is going to show o- either one finger,
or two fingers, and,
and you're playing at the same time.
And, and the way it works is,
is if both of the players show 1 at the same time,
then player B gives two dollars to player A.
If both of you show 2 at the same time,
player B gives Player A four dollars.
And then, if, if you show different numbers like 1 or 2,
or 2 or 1,
then player A has to give o- give three dollars to, to player B.
Okay? Does that make sense?
So can you guys talk to your neighbors and play this game real quick?[BACKGROUND]
All right, so, so what was the outcome?
[LAUGHTER] How many of you are in the case where A chose 1,
then- and B chose 1?
Oh, yeah one. Okay, one pair here.
Uh, A chose 1, B chose 2?
One pair there, is it like four people played.
So A chose 2, B chose 1.
We have, okay two pairs.
And then 2 and 2?
Okay. All right.
So, so you can kind of see like a whole mix of strategies here happening.
And this is a game that you are gonna play and talk
about it a bit and think about what would be
a good strategy to use when you are solving this, this simultaneous game.
Okay. All right so, um.
All right so let's formalize this.
We have player A and player B.
We have these possible actions of showing 1 or 2.
And then, we're gonna use this,
this payoff matrix which,
which represents A's utility.
If A chooses action A and B chooses action B.
So, so before we had this,
this value function, right?
Before, we had this value function, uh,
over, um, over our state here.
Now, we have this value function that is- do we- we shall use here, I'll just use here.
That is again from the perspective of agent A.
So remember like before,
when we were thinking about value function,
we are looking at it from the perspective of the first player,
the maximizer player, the agent.
Now, I'm looking at all of these games from the perspective of a player.
So, so I'm trying to like get good things for A. Yes.
In this case it's not at the end [inaudible] ?
Uh, yeah. And then this is like a one-step game too, right?
So, so like you're just playing and then you see what you get.
So, so we're not talking about repeated games here.
So, so you're playing,
you see what happens, okay?
So, so we have this V,
which is V of a and b.
And, and this basically represent a's utility if agent A plays a and if agent B plays b.
Okay? And this is called, and,
and you can represent this with a matrix and that's why it's called a pay-off matrix.
I'm going to write that pay-off matrix here.
So pay-off matrix.
I'm gonna write A here, B here.
agent A can show 1 or can show 2.
agent B can show 1 or can show 2, right?
If both of us show 1 at the same time,
agent A gets $2.
If both of us show 2 at the same time,
agent A gets $4.
Otherwise agent A has to pay, so agent A gets minus $3.
And again the reason I only like talk about one way is
we are still in the setting of zero-sum games.
So whatever the agent A gets,
agent B gets negative of that, right?
So, so if agent A gets $4,
agent B is, is paying minus $4.
So I am just writing 1B from perspective of agent A.
And this is called the pay-off matrix, okay?
All right. So, uh,
so now we need to talk about what does a solution mean in this setting?
So, so what is a policy in the setting?
And, and then the way we refer to them in this case are as strategies.
So we have pure strategy which is almost like the same thing as,
uh, as deterministic policies.
So a pure strategy is just a single action that you decide to take.
So, so you have things like pure strategies, uh, pure strategies.
The difference between pure strategy and,
and deterministic policies, if you remember,
a deterministic policy again is a function of state, right?
So, so it's a policy as a function of state. It gives you an action.
Here we have like a one move game, right?
So it's just that one action and we call it pure strategy.
[NOISE] We have also this other thing that's called
mixed strategy which is equivalent to, to stochastic policies.
And what a mixed strategy is, is,
is a probability distribution that tells you what's the probability of you choosing A.
So, so pure strategies are just actions a's.
And then you can have things that are called
mixed strategies and they are probabilities of,
of choosing action a, okay?
All right. So here is an example.
So if, if you say, well, I'm gonna show you 1,
I'm gonna always show you 1.
Then the- if you can,
you can write that strategy as a pure strategy,
that says I'm gonna always with probability of 1 show
you 1 and with probability 0 show you 2.
So, so let's say the first column is for showing 1,
the second column is for showing 2.
So, so this is a pure strategy that says always I'm going to show you 1.
If I tell you, well, I always I'm gonna show you 2,
then I can write that strategy like this, right?
With probability 1, I'm always showing you 2`.
I could also come up with a mixed strategy.
Mixed strategy would be I'm going to flip a coin and if I get one-half,
I'm gonna give you- uh,
if I'm- if I get heads,
I'm gonna show you one,
if I get tails, I'm gonna show you two.
And then you can write that as this and this is going to be a mixed strategy.
You could only pull it out to like you're in the si- simultaneous game,
you could just bring chance in and be like half the time,
I'm gonna show you one, half the time I'm gonna show you two based on chance, okay?
Everyone happy with mixed strategies and pure-strategies?
All right. So, so how do we evaluate the value of the game.
So, so remember in, uh,
previous lecture and like in the MDP lecture even,
we were talking about evaluating.
If someone gives me the policy,
how do I evaluate how good that is?
So the way we are evaluating that is again by this value function V. And,
and we are gonna write this value function as a function of Pi A and Pi B.
Maybe I'll just write that up here.
Or I'm gonna erase this 'cause this is a repetitive.
So I'm gonna say a value of agent A following Pi A and agent B following Pi B,
what is that equal to?
Well, that is going to be the setting where, uh,
Pi A chooses action A,
Pi B chooses action B times value of choice A and B,
summing over all possible a and bs.
Okay. So, so let's look at an actual example for this.
So, so for this particular case of Two-finger Morra game,
let's say someone comes in and says I'm gonna tell you what Pi A is.
Policy of agent A is just to always show one.
And policy of agent B is this,
this mixed strategy which is half the time show one,
half the time show, show two.
And then the question is,
what is the value of,
of these two policies? How do we compute that?
[NOISE] Well, I'm gonna use my payoff matrix, right?
So, so 1 times 1 over 2 times the value that we get at 1,
1, which is equal to 2.
So it's 1 times 1,
1 over 2 times 2 plus 0 times 1 over
2 times 4 plus 1 times 1 over 2,
times minus 3, the value that I get is minus 3 plus ah,
0 times 1 over 2 times minus 3.
Okay? And, well, what is that equal to?
What is  that equal to? There are two 0s here,
that's minus 1 over 2.
Okay? So I just computed that
the value of these two policies is going to be minus 1 over 2.
And again this is from the perspective of, of,
um, agent A and it kinda makes sense, right?
If agent A tells you I'm gonna always show you 1,
then probably agent- and,
and agent two is following this mixed strategy,
agent A is probably losing, and agent A is losing
minus 1 over 2 based on- based on this strategy, okay? Okay.
So I guess this doesn't seem like we only have this one statement, so it's, we only take one action, in this environment, we have one state, take one action, and that would be the end state.
If we had more than one state,
Would we have that for every single one.
So that opens up a whole set of new questions that you're not discussing in this class.
So that introduces repeated games.
Ah, so you might be interested in looking at what happens in repeated games.
In this class right now we're just talking about this, one step one play.
We're playing like zero-sum game um,
but we're playing like we'll say, rock-paper-scissors and you just play once.
Well you might say well, what happens if you play like ten times then you're building
some relationship and weird things can happen and so,
so that introduces the whole new class of games that we're not talking about here.
All right. So, so the value is equal to minus 1 over 2.
Okay? All right.
So, so that was a game value.
So, so we just evaluated it, right?
If someone tells me it's pi A and pi B, I can evaluate it.
I can know how good pi A and pi B is,
from the perspective of agent A.
Okay? So what do we wanna do like when we solve- when we want to try to solve games?
All we wanna do is from the agent A's perspective,
you wanna maximize this value.
I want to get as much money as possible and its values from my agent A perspective.
So I should be trying to maximize this, agent B should be trying to minimize this.
Right? Like, like think minimax.
So agent B should be min- minimizing this.
agent A should be maximizing this.
That's, that's what we wanna do.
But with the challenge here is we are playing simultaneously,
so we can't really use the minimax tree.
Like if you remember the minimax tree like in,
in that setting we have sequential place and and you could like wait
for agent A to play and then after
that play and that will give us a lot of information,
here we're playing simultaneously.
So what should we do?
Okay so what should we do? So I'm going to assume we can play sequentially.
So that's what I wanna do for now.
So, so I'm going to limit myself to pure strategies.
So maybe I'll, um, I'll come over here.
So right now I'm going to focus only on pure strategies.
I will just consider a setting- very limited setting and see what happens.
And I'm going to assume oh, what if,
what if we were to play sequentially, what would happen?
How bad would it be if we were to play sequentially?
So um, we have the setting where player A plays, goes first.
What do you think? Would you think like if Player A goes first,
Is that better for player A or is that worse for player A?
Worse.
Worse for player A. Okay. So, so that's probably what's gonna happen.
Try that. [LAUGHTER] Okay. So player A was trying to maximize.
Right? This V, player B was trying to minimize, right?
And then each of them have actions of either or showing 1 or showing 2.
This is player A, this is A, this is agent B.
They can show 1, show 1 or 2, right?
If we do one- if we show 1, 1,
player A gets what?
$2? Is that right?
It's 2, right? I can't see the board.
Um, otherwise player A gets minus $3
if you have 2, 2,
player A gets $4.
Right? So okay.
So, so now if,
if we have this sequential setting,
if you're playing minimax,
then player B is going second.
Player B is going to take the minimizer here.
So Player B is gonna be like
this one and in this case player B is going to be like this one.
What should player A do?
Well in both cases player A is getting minus $3.
It doesn't actually matter, player A could do any of them and
player A at the end of the day is going to get minus $3.
Right? And this is a case where player A goes first.
What if player A goes second, second?
Okay? So, so then player B is going first,
player B is minimizing and then player A is maximizing
[NOISE] and we have the same values here.
Okay? So this is,
this is player A going second,
player A going second tries to maximize.
So we'd like to pick these ones.
Player B is, is here.
Player B wants to minimize.
So Player B is going to be like, okay,
if you're going second I'd rather,
I'd rather show you 1,
because by showing you 1 I'm losing less.
If I show you 2,
I'm losing even more. All right.
So, so and then in that setting,
we are gonna get to,
so player A is going to get $2.
Okay? All right.
So that was kind of intuitive
if we have pure strategies,
it looks like if you're going second that should be better.
Okay. So, ah, so going second is no worse.
It's the same or better.
And that basically can be represented by this minimax relationship, right?
So, so agent A is trying to maximize.
So, so in the second case.
[NOISE]
In the second case, um,
we are maximizing second over our actions of V of a and b,
and Player B is going first.
So this is going to be greater than or equal to the case where Player A
is going, uh, first.
Sorry no, not min.
That makes sense. V of a and b.
So I'm gonna just write these things that you're
learning throughout on the side of the board, maybe up here.
So what did we just learn?
We learned, if we have pure strategies,
if we have pure strategies,
all right, going second is better.
That sounds intuitive and right.
[NOISE].
Okay. So far so good.
Okay? So the question that I
wanna try to think about it right now there is what if we have mixed strategies?
What's going to happen if we have mixed strategies?
Are we gonna get the same thing?
Like, if you have mixed strategies is going second better,
or is it worse, or is it the same?
So, so that's the question we're trying to answer.
Okay? So, so let's say Player A comes in,
and Player A says, "Well,
I'm gonna reveal my strategy to you.
What I'm gonna do is I'm going to flip the coin depending on what it comes.
I'm either show- going to show you 1, or I'm gonna show you 2.
That's what I'm gonna tell you,
tell you that's what I'm gonna do."
Okay. So, so what would be the value of the game under that setting?
So the value of the game, uh,
would be, maybe I'll write it here.
So the value of Pi A and
Pi B. Pi A is already this mixed strategy of one-half, one-half, right?
It's going to be equal to Pi- is this- yeah, actually.
All right. So what is that going to be equal to?
It's going to be Pi B times 1, right?
Pi- so it's going to be Pi B,
choosing 1 times one-half.
The probability one-half Agent A is also picking 1.
If it is 1, 1, we're gonna get 2, right,
plus Pi B choosing 1,
Pi A with one-half choosing 1,
and then we're gonna get minus $3 sort of choosing 2.
We're gonna get minus $3,
plus Pi B choosing 2,
times one-half Pi A choosing due- 2.
We're gonna get $4,
plus Pi B choosing 2 times Pi A choosing 1,
and that's minus $3.
So I just, like, iterated all the four options that we can get here, uh,
under the policy of Pi B choosing 1 or 2,
and then Pi A is always just half, right,
because they, they are following this mixed strategy.
So well, what is this equal to?
Uh, that's equal to minus 1 over 2 Pi B of 1,
plus 1 over 2 Pi B of 2.
Okay. So that's the value.
Okay? So, so again, the setting is someone came in,
Agent A came in,
Agent A told me, "I'm following this mixed strategy.
This is gonna be the, the thing I'm gonna do."
What should I do as an Agent B?
What should I do as an Agent B?
You always want to pick 1.
So- okay, so that was too quick.
So you always [LAUGHTER] have to do, do 1.
But why, why is that?
Well, well, if Agent A comes and tells me, "Well,
this is a thing I wanna do," I should try to minimize value of Agent A, right?
So, so what I'm really trying to do as Agent B is to minimize this,
right, because I don't want Agent A to get anything.
So if I'm minimizing this, in some sense,
I'm trying to come up with a policy that minimizes this.
Pi is the probability,
so it's like a positive number.
I've like a positive part and negative part here.
The way to minimize this is to put as much weight as
possible for this side and as little as possible for this side.
So that tells me that never show 2 and always show 1. Does everyone see that?
So, so the best thing that I can do as Agent 2 is
to follow a pure strategy that always shows 1 and never shows 2.
Okay. So this was kind of interesting, right?
Like if someone comes in and tells me, "This is the thing.
This is a mixed strategy I'm gonna follow," I'll have a solution in response to that,
and that solution is always going to be a pure strategy actually.
So, so that's kind of cool.
All right.
So, so this is actually what's happening in a more general case.
I'm gonna make a lot of generalizations in this lecture.
So I'll show you one example I generalize it,
but if you're interested in details of it,
like, we can talk about it offline.
So yeah, so, so setting is for any fixed mixed strategy Pi A.
So, so Pi A told me what their mixed strategy is.
It's a fixed mixed strate- uh, mixed strategy.
What I should do as Agent B is I should minimize that value.
I should pick Pi B in a way that minimizes that value,
and that can be attained by pure strategy.
So the second thing that I've learned here,
is if Player A plays, uh, uh,
plays a mixed strategy, mixed strategy,
Player B has an optimal pure strategy.
And that's kind of interesting.
[NOISE] Right.
Okay. So, so in this case,
also we, we haven't decided what the policies should be yet, right,
like we- we've have started- we've still,
we've still been talking about the setting where Pi A- like
Agent A comes in and tells us what their policy is,
and we know how to respond to it.
It's going to be a pure strategy.
Okay? So now we want to figure out what is this, this policy.
Like what, what should be this mixed strategy actually?
So, so I wanna think of it more generally.
So, so I wanna go back to those two diagrams and actually
modify those two diagrams in a way where we talk about it a little bit more generally.
Maybe- yeah, I'll just modify these.
Okay. So, um, so let's say that- okay,
and, and I'm gonna think about both of the settings.
So let's say it again. Player A is deciding to go first.
Player A is going to follow a mix- a mixed strategy.
So this is all we know,
but we don't know what mixed strategy.
Play- Player A is going to decide to do- to follow mixed strategy.
This is Player A. Player A is maximizing.
Player A is following a mixed strategy.
The way I'm writing that mixed strategy is more
generally saying Player A is gonna show 1 with
probability p and is going to show 2 with
probability 1 minus p. Or generally like some, some p-value.
Okay? And then after that it's Player B's turn.
We have just seen that Player B,
the best thing Player B can do is,
is to do a pure strategy.
So Player B is either 100% is going to pick 1 or 100% is going to pick 2.
Yes?
Player B could really like [inaudible] terms
with the same then like Player B following a mixed strategy.
That would be the best strategy.
You know it's just the same as any pure strategy, does that make sense?
For those terms behind on the blue on the board here right there.
Yeah. Those terms with the same blue terms,
then like Player B can follow any kind of strategy, right?
So the thing is that, that strategies are probabilities, right?
So they are values from 0-1,
and then you kinda always end up with this negative term that you're trying to make as
negative as possible and this positive term
that you are trying to get as positive as possible.
And that's kind of intuitively why you end up with a pure strategy.
And by pure strategy, what I mean is you always
end up like putting as much possible like 1,
like all your probabilities on the negative turn
and nothing on the positive turn because you are trying to minimize this.
So that's kinda like intuitively why you're getting this pure strategy.
One-half and one-half?
So, so you wouldn't get 1. So, so that's what I mean.
So like, you wouldn't ever get like one-half and one-half.
If you get one-half and one-half,
that's a, that's a mixed strategy.
That's not a pure strategy.
And I'm saying you, you wouldn't get a mixed strategy
because you would always end up in this setting that to minimize this,
you end up pushing all your probabilities to this negative term, okay.
All right. So, so, all right,
so let me go back to this.
So- all right.
So we have the setting where Player A goes first.
Player A is following a mixed strategy with p and
1 minus p. Player B is going to follow a pure strategy,
either 1 or 2.
I don't know which one, right?
So, uh, what's gonna happen is if you have 1, 1 and then,
then that is going to give me 2, value 2, right?
So it's 2 times p. I'm trying to write the value here.
Am I writing it right? Is it 2 times p plus?
Yeah. 1 minus p times 3.
Right. So with probability 1 minus p,
this guy is gonna pick 2.
If this guy picks 1, you're gonna get minus 3, minus 3.
Okay? And then for this side,
with probability 1 minus p,
A is going to show 2.
If I'm gonna show 2,
then I'm gonna get 4.
So it's 4 times 1 minus p. And with probability p,
this guy's gonna show 1.
I'm gonna show 2.
So that is minus 3p.
Okay. All right.
So what are these equal to?
So this is equal to 5p minus 3.
That is equal to minus 7p plus 4.
Okay? So, so I'm talking about this more general case.
In this more general case,
Player A comes in.
Player A is playing first, uh,
and is following a mixed strategy but doesn't know what p they should choose.
They're choosing a p and 1 minus p here.
And then Player B has to follow,
uh, a pure strategy.
That's what we decided.
And then under that case,
we either get 5p minus 3 and minus 7p plus 4, okay?.
What should Player B do here?
This is Player B and this min node.
What should Player B do? Which, which- should,
should Player B pick 1 or 2?
It should- player B should pick a thing that minimizes between these two. All right?
So Player B is going to take the minimum of 5p minus 3 and minus 7p plus 4, okay?
What should Player A do?
What should player A do? I'm thinking minimax, right?
So- so when you think about the minimax,
Player A is maxima- maximizing the value.
So Player A is going to maximize the value that comes up here.
So player is going to maximize that and also,
I'm saying Player A needs to decide what P they're picking.
So they're going to pick a P that maximizes that. Is this clear?
[inaudible]
Like these computations? Yeah, so these are the four different,
uh, things in my, uh, payoff matrix.
So I'm saying is,
with probability P, A is going to show me 1, right?
And I'm going to go down this other route where B is also choosing 1.
So if one- like both of us are showing 1,
then I'm going to get 2, right?
So I'm going to get $2.
So that's where the $2 comes from,
times probability P. With probability 1 minus p,
A is going to show me 2.
I'm going to show 1, that's minus $3,
times probability 1 minus p. So,
so that's how and and for this particular branch,
I know the pay off is going to be 5p minus 3.
That makes sense? And then for this side again,
like with probability 1 minus p,
A is going to show me 2.
If it is both of them 2,
I'm gonna get $4.
That's why it's 4 times probability of 1 minus p. With probability P,
A is going to show me 1.
So that's why I'll lose $3,
that's minus 3 times probability p. So that's minus 7p.
Okay. So and then,
and then, the second player,
what they're gonna do is,
they're going to minimize between these two values and they're going to pick 1 or 2.
They're gonna- they're deciding, "Should I pick 1 or should I pick 2?"
And the way they're deciding that is by trying to pick,
pick 1 or 2 based on which one minimizes these two values.
But I'm writing it, uh,
like using this variable p that's not decided yet.
And this variable P is the thing that Player A needs to decide.
So what, what p should Player A decide?
Uh, Player A should decide the p that maximizes this.
So I'm writing like, literally a minimax relationship here.
Okay? All right, so the interesting thing here,
is beside p minus 3,
is some line, right?
With positive slope.
This is 5p minus 3, let's say.
And this minus 7p, plus 4 is another line.
Minus 7p plus 4.
It's another line with negative slope.
What is the minimum of this?
Where is going to be the minimum of this happening?
Minimum of these two lines?
Where they meet each other, right?
This is going to be the minimum of the two.
Okay? So, so the p that I'm s- going to pick,
is going to be actually the p, where,
th- th- the value of p, where these two are equal to each
other and that turns out to be at,
I don't know what it is, 7 over 12 or something.
Actually I don't remember this- what is this value?
Yeah, so it's going to happen at 7 over 12.
And the value of it is minus 1 over 12.
Right? So okay, so let's recap. Okay, what did I do?
So I'm talking about the simultaneous game,
but I'm relaxing it and making it sequential.
I'm saying A is going to play first,
B is playing second.
The thing that's going to happen is A is playing first,
A is deciding to choose a mixed strategy.
So A is deciding to say maybe one half, one half,
but maybe he doesn't wanna say one half,
one half, he wants to come up with some other probabilities.
So the thing A is deciding is,
"Should I pick 1 with probability p and should I
pick 2 with probability 1 minus p and what should that p be?"
So, so what is the probability I should be picking 1?
So that's what A is trying to decide here.
Okay? So whatever A decides with p and 1 minus p,
ends up in two different results and based on them,
B is trying to minimize that.
When B is trying to minimize that,
B is minimizing between these two linear functions.
These two linear functions meet at one point,
that is the point that this thing is going to be minimized and that actually
corresponds to a p-value when A tries to maximize this.
This is I know a little bit- this requires a little bit of thinking,
but any clarification questions?
Any- I see a lot of lost faces, so- [LAUGHTER]
By having, um, [inaudible].
Yeah and then that the- yeah,
the interesting point is exactly right.
Yeah, so A is still by the way losing.
So even in this case,
where A is trying to come up with the best mixed strategy he could do,
the best mixed strategy A is doing is show,
show a 1 with probability 7 over 12 and show 2 with probability 5 over 12.
This comes from here. Even under that scenario, A is losing.
A is losing, minus 1 over 12.
Okay? All right.
Okay. So also, I haven't solved a simultaneous game yet, right?
Like I have talked about the setting where A plays first.
So what if B plays first?
So I'm going to swap this.
What if B plays first?
So A goes second, B plays first.
I'm gonna modify this one now.
Okay, B goes first, A is going second.
B is gone to start- is going to reveal the strategy- his strategy.
The strategy that B is going to reveal, is also again,
I'm gonna with probability p show you 1,
with probability 1 minus p,
show you, show you, uh, 2.
Then A plays, A is trying to maximize.
And A has to play a pure strategy because of that, right?
Like the best thing A can do,
is going to be a pure strategy.
So A is always going to be either showing 1 or
2 and A is deciding which one, but doesn't know yet.
And the values here are going to be exactly the same thing as there.
So they're 5, 5, 5, 5p minus 3,
minus 7p plus 4.
Okay? All right.
So what's happening here?
So, so in this case,
A is playing second.
What A likes to do is A likes to maximize between 5p minus 3 and minus 7p plus 4.
That's what A likes to do.
B is going second, uh,
sorry, B is going first,
so then B has to minimize that and pick a p that minimizes that.
Okay? So these two are exactly
the same two lines but now I'm picking the maximum of them.
The maximum of these two lines end up being exactly the same point as before,
ends up being exactly the same p as
before and giving you exactly the same value as before.
So, so this is also equal to minus 1 over 12.
So what this is telling me is,
if you are playing a mixed strategy,
even if you reveal your best mixed strategy at the beginning, it doesn't matter.
It actually doesn't matter if you're going first or second.
So like in the moral game when you're playing,
if you were playing a mixed strategy and you
would tell your opponent, "This is the thing,
I'm gonna do and this is a mixed strategy," actually and if it was the optimal thing,
like, like it didn't matter like if they know,
know it or not, like you still get the same value.
So again, you get 5p minus 3 and minus 7p plus 4.
And then now you're minimizing or a maximum of these two lines,
maximum of these two lines end up being at the same point and you pick a p that,
that kind of maximizes that and you get the same value.
So this is called the von Neumann's theorem.
So von Neum- like this whole thing that you just,
did over just one example,
there is a theorem about it that says,
for every simultaneous two-player zero-sum game,
with a finite number of actions,
the order of players doesn't matter.
So B is playing second or B is playing first,
the values are going to be the same thing.
If you're minimizing or are maxim- or maximum or min- minimum of that value,
it's going to be the same thing.
Okay? So this is kind of the third thing that we just learned,
which is von Neumann's Theorem,
which says, if- I- I'm writing a modification of a simpler, shorter version of it.
So if playing a mixed strategy,
order of play doesn't matter.
So remember, if you play mixed strategy, your opponent.
And remember, if you play mixed strategy,
your opponent is going to play pure strategy
because this is like this the first point that we had before it.
All right? If you, if you play mixed strategy,
your opponent is going to follow a pure strategy.
Either 1 or 2 with probability 1.
[NOISE] But with probability p,
like, if we're doing like ordering,
like one of the two answers might- will come out,
[inaudible] it'll be either one or two and then in that case, the second [inaudible].
So in this case, yeah.
So, uh, the thing is these two end up being equal.
So the way to- it doesn't,
it doesn't matter because the way for you to
maximize this is going to be the point where the two end up being equal.
So the two branches,
like if you actually plug in p equal to 7 over 12 here,
like these two values end up being equal.
Equal, right? [inaudible]. [OVERLAPPING]
Uh, none [inaudible] actually equal and the reason that they end up being equal
is you are trying to minimize the thing that this guy is trying to maximize.
So you are trying to pick the p that actually makes this thing equal.
So no matter what your opponent does,
like you're gonna get the best thing that you can do.
So, so yeah, like think of it like this.
Okay. So I'm player A,
I'm, I'm still- I still have a choice.
My choice is to pick a p. I want to pick
a p that I'm not gonna wi- like lose as much. What p should I pick?
I should pick a p that makes these choices the same.
Because if I pick a p that makes this one higher than this one,
of course the second player is going to make me lose and then go down a route that's,
that's be- better for the second player.
So the best thing that I can do here is make these two as equal as possible.
So then the second player whatever they choose,
choose one or two,
I guess it's gonna be the same thing, it's gonna be- does that make sense?
So sounds no in expectations,
like you're multiply by p and 1 minus p as you were saying,
like if the [inaudible]. [OVERLAPPING]
So in expectation when- you're saying when you are choosing p?
Yes, so I'm choo- I'm treating p as a variable that I'm deciding, right?
Like p is the thing I gotta be deciding.
So I'm player A, I gotta be citing a p. That's not gonna be too bad for me.
Like let say I would pick a p that doesn't make these things equal.
Let's say, I don't know, I would pick a p that makes this guy I don't
know 10 and this makes this guy 5.
The second player is of course going to make me lose and of course is going
to like pick the thing that's going to be the worst thing for me.
So the best thing I can do is I can make both of them, I don't know, 7.
So it's not gonna be as bad.
So, so that's kind of the idea. All right.
So let me move forward because there's still a bunch things happening. All right.
So, so okay.
So the kind of key idea here is revealing
your optimal mixed strategy does not hurt you which is kind of a cool idea.
The proof of that is interesting.
If you're interested in looking at the notes,
you can use linear programming here.
The reason, kind of the intuition behind it is,
is if you're playing mixed strategy,
the next person has to play pure strategy
and you have n possible options for that pure strategy.
So that creates n constraints that you are putting in for your optimization.
You end up with a single optimization with n constraints, and,
and, and you can use like linear programming duality to actually solve it.
So, so you could compute this using linear programming
and that's kind of the one that's here.
So, so let's summarize what we have talked about so far.
So, so we have talked about these simultaneous games, er, and,
and we've talked about the setting where we have pure strategies,
and we saw that if you have pure strategies,
going second is better.
Right. Going second is better if you are just telling
you what's the pure strategy you're using, right?
So that was kind of the first point up there.
And then if you're using mixed strategies,
it turns out it doesn't matter if you're going first or second.
You're telling them what your mixed- best mixed strategy
is and they're going to respond based on that.
So that's the von Neumann's minimax theorem.
Okay? All right.
So next 10 minutes,
I want to spend a little bit of time talking about non-zero-sum games.
So so far we have talked about zero-sum games,
uh, where it's either minimax, I get some reward.
You get the negative of that or vice versa.
There are also these other things called collaborative
games where we are just both maximizing something.
So, so we both get like money out of it,
and, and that's kinda like a single optimization.
It's a single maximization and you can think of it as plain search. In real life,
you're kind of somewhere in between that,
and, and I want to motivate that by an example.
So, uh, I want to do that b- by this idea of Prisoner's dilemma.
How many of you have heard of Prisoner's dilemma?
Okay. Good. Okay. So the idea of Prisoner's dilemma is you have
a prosecutor who asks A and B
individually if they will testify against each other or not, okay?
If both of them testify,
then both of them are sentenced to five years in jail.
If both of them refuse,
then both of them are sentenced to one year in jail.
If one testifies, then he or she gets out for free and,
and then the other one gets 10 years sentence.
Play with your partner real quick.
[NOISE]
All right. [LAUGHTER] Okay.
Okay, so let's look at the pay off matrix.
So I think you kind of have an idea of how the game works. Is that A or B?
So, uh, so you have two players A or B.
Each one of you have an option.
You can either testify or you can refuse to testify.
So you can- B can testify and A can refuse to testify,
and I am going to create this payoff matrix.
This payoff matrix is going to have two entries now in each one of these, these cells.
And, and why is that? Because we have a non-zero-sum game.
Before, our payoff matrix only had one entry.
Because this was for player A,
player B would just get negative of that.
But now player A and B are getting different values.
So if both of us testify,
then both of us get five years jail, right?
So A gets five years of jail,
B gets five years.
Right? If both of us refuse,
A gets one year of jail,
B gets one year of jail.
One year, one year of jail.
And then if it is a setting where one of us testifies,
the other one refuses, one of us gets 0,
the other one gets 10 years jail.
So if I refuse to testify,
then I get 10 years jail right away and then B gets 0.
And then in this case,
A gets 0 and B gets 10.
Okay? So the payoff matrix is now
going to be for every player we are gonna have a payoff matrix.
So now we have this,
this B value function which is a function of a player.
For policy A and policy B,
will be the utility for one particular player,
because you might be looking at it from perspective of different players.
Okay? So the von Neumann's minimax theorem
doesn't really apply here because we don't have the zero-sum game.
But do you actually get something a little bit weaker,
and that's the idea of Nash equilibrium.
So a Nash equilibrium is setup policies Pi star A and
Pi star B so that no player has an incentive to change their strategy.
So, so what does that mean?
So what that, that means is if you look at the,
the value function from perspective of player A,
value function from perspective of player A at the Nash equilibrium at
Pi star A and Pi star B is going to be greater than or equal to value of,
of any other policy Pi A if you fix Pi B.
Okay and at the same time the same thing is true for value of B.
So for agent B, value of B at Nash equilibrium is gonna be greater
than or equal to a value of B at any other Pi B if if,
if Pi A fixes their policy.
Okay? So, so what does that mean in this setting?
Do we have a Nash equilibrium here?
So let's say I start from here.
I start from A equal to minus 10,
B equal to 0.
Can I get this better?
Can I make this better, or did I flip them I all?
[NOISE] Okay. Flip, right?
0 minus 10, er, minus 10, 0.
Okay. So let's say I start from here.
Can I, can I get this better? Can I make this better?
I start from this cell,
A gets 0 years of jail.
That's pretty good. B gets 10 years of jail.
That's not that great.
So B has an incentive to change that.
Right? Like B has an incentive to actually move in this direction.
Right? So B has an incentive to get 5 years jail instead of 10 years.
Similar thing here.
What if we start here?
A has 1 year of jail,
B has 1 year of jail.
A has an incentive to change this now and get 0 years jail.
B has an incentive to change this and get 0 years jail.
And we end up with this cell.
Where like, we don't have any incentive to change our strategy.
So we have one Nash equilibrium here and that one Nash equilibrium here is,
is both of us are testifying and both of us are getting 5 years jail.
Just kind of interesting because there is like
a socially better choice to have here, right?
Like both of us, like if both of us would refuse,
like we would each get 1 year jail but that's not gonna be a Nash equilibrium.
Okay? All right.
So there's a theorem which is, er,
Nash's existence theorem which basically says if
any finite player game with a finite number of actions,
if you have any finite player game with a finite number of actions,
then there exists at least one Nash equilibrium.
And then this is usually one mixed strategy Nash equilibrium,
at least one mixed strategy Nash equilibrium.
In this case, it's actually a pure strategy Nash equilibrium.
Uh, but, but in general,
there is at least one Nash equilibrium if you have a game of this form.
Okay? All right.
So, uh, so let's look at a few other examples.
Two-finger Morra. What would be the Nash equilibrium for that?
So we just actually solve that using the minimax- von Neumann's minimax theorem, right?
So there would be if you're playing a mixed strategy of 7 over 12 and 5 over 12,
you might, you might kind of modify your Two-finger Morra game and make it collaborative.
So in a collaborative setting, uh,
what that means is we both get $2 or we both get $4 or we both lose $3.
So, so a collaborative Two-finger Morra game,
it's not a zero-sum game anymore and,
and you have two Nash equilibria.
So, uh, you would have a setting where
A and B both of them play 1 and the value is 2,
or A and B both of them play 2 and the value is 4.
Okay? And then Prisoner's dilemma is the case where both of them testify.
We just, we just saw that on the board. All right.
Okay. So summary so far is we have talked about simultaneous zero-sum games.
We talked about this von Neumann's minimax theorem, er,
which has like multiple minimax strategies and a single game value, right?
Like we had a single game value because it was zero-sum.
But in the case of non-zero-sum games, er,
we would have something that's slightly weaker that's
Nash's existence theorem. We would still have multiple Nash equilibria,
we could have multiple Nash equilibria.
Uh, but we have multi- we also have
multiple game values from- depending on whose perspective you are looking at.
So this kind of was just a brief like short introduction to game theory and econ.
There's a huge literature around different types of games,
uh, in game theory and economics.
If you're interested in that, take classes.
And yeah, there are other types of games still like
Security Games and or
resource allocation games that have
some characteristics that are similar to things we've talked about.
If you're interested in any of them,
maybe you can take a look at them,
would be useful for projects.
And with that, I'll see you guys next time.
