[MUSIC]
Welcome to the very last session
of this year's AI Summer School.
This is very much the question
of last but not least.
It's going to be a very exciting
session, and it's going to be
kicked off by Shimon Whiteson
from Oxford University.
And in the second
half of the session,
there will be a talk
about Project and
the collaborative AI
challenge started by me, but
we'll also hear from
participants of the challenge.
But let me start by
introducing Shimon.
I've actually known Shimon for
a long time.
He happened to be
co-advising my PhD thesis,
so I must say,
he's done a great job with that.
>> [LAUGH]
>> [LAUGH] Since then,
from the University
of Amsterdam,
he had a stint at UC Irvine, and
now moved to Oxford, where he's
doing some really exciting work
around reinforcement learning,
especially in the area of
multi-agent reinforcement
learning and learning to
communicate and collaborate.
The talk today is going to be
about his latest work focusing
on multi-agent RL in the game
of Starcraft, welcome.
>> Thank you [INAUDIBLE] for
that introduction and for
inviting me to speak today.
Good morning, everyone,
it's a pleasure to be here and
to have a chance to tell you
a little bit about our latest
features on multi-agent
reinforcement learning.
I'm conscious of the fact that
this is one of the last talks
and the end of a long week.
So, I'll do my best to
help you stay up late.
What I'm gonna try to do
in this talk is first,
give a bit of an overview about
what multi-agent reinforcement
learning and what's about
interesting [INAUDIBLE] are.
And then, I'm gonna do a brief
review of single agent policy
gradient methods.
I know you guys have heard quite
a bit about policy gradients
already.
But just to review and
to familiarize you
with my notation.
And I'll tell you about our
new work which is a new
multi-agent policy
gradient method that we
call Counterfactual
Multi-Agent Policy Gradients.
And some experiments that we've
run in a StarCraft benchmark
test.
So let's dive right in.
So the single agent prospective
that you're probably
used to seeing is a simple one.
You have an agent and
you have an environment,
you have a world.
And one of the few things this
conversation needs to do,
it needs perception.
It needs to take file of
those data coming in from its
sensors and
process them into some estimate
of the state of the world.
And then,
it needs to do control.
It needs to condition on that
estimate in order to decide
what actions to take so
as to maximize goal,
minimize cost, etc.
So what's missing from this
single-agent paradigm is
basically acknowledging
that in this environment,
we very often have other agents.
So in this single-agent
paradigm, you're just
treating the environment as it
was some kind of black box and
we don't really consider the
fact that, in many cases, a key
element to that environment
is a bunch of other agents.
And this is important for
a couple of reasons.
So the first reason is,
if this agent over here,
the one that we're trying to
find the control system for,
if it is learning, then there's
a good chance those other agents
in the environment
are also learning.
So we just ignore the fact that
the environment consists of
other agents,
then that environment
becomes a non-stationary one.
Because the weight of the
environment evolves over time,
depends on the behavior
of the other agents.
And that behavior is changing
due to the other agents
learning.
Good thing is if we take a
really broad perspective of what
you're trying to do when you
build your control system for
an honest agent, it's just
one form of machine learning.
And in machine learning,
we basically have inductive
bias and we have data.
So if we want to get leverage,
we need to get more data or
we have a better inductive bias.
And to have a better inductive
bias, we need to take advantage
of everything that we
know about the world.
So if we know that that world
consists of other agents,
that's important information
that should be reflected in our
and in our methods, so that we
can have the best inductive bias
in order to [INAUDIBLE].
So this multi-agent paradigm, we
make explicit the fact that its
environment consists
of other agents.
And those other agents may
[INAUDIBLE] as well as you're
learning.
So this is quite
a natural model.
It will be hard for
me to convince at this point.
Multi-agent systems
are everywhere.
When you think about all
the things you do in your daily
life, you're basically
an agent interacting,
invoking a system in many,
many situations.
If you're driving in traffic,
it's just one example.
When we think about we have
a fleet of drones, and
they need to depends
on territory.
They need to deliver some
packages, or whatever.
These multi-agent
systems are everywhere.
And these are just the two
examples that I could find
pretty pictures for in a lot.
It's pragmatic examples
in logistics, or
network packet routing,
or whatever.
These multi-agent
systems are basic.
So okay, a quick overview for
some context.
What kind of multi-agent
systems are there working So
first of all, we have
cooperative gaming systems, so
we have systems where all
the agents are on the same team,
working towards the same goal.
So formally, this is typically
modeled with a team reward.
There's one reward function, and
that reward is shared by all of
the agents, so their incentives
are perfectly aligned, and
that induces this key behavior.
This might not seem like a very
interesting setting, but
it is interesting because,
even though these agents are all
on the same team, they still
have this really important
challenge of coordination.
They need to figure out how to
align their behavior together
such that it achieves
their common goal.
So for example if two cars
arrive at an intersection
together and one car
thinks red means stop and
the other car thinks
red means go,
then there could be
a catastrophic miscoordination.
At the other end of
the spectrum we have fully
competitive scenarios.
So in game theory these
are what are referred to as
zero-sum games.
So our reward functions
are diametrically opposed.
And any time I gain,
it means you lose by
exactly the same amount.
So this is a more
complicated setting.
And there are a lot of
subtleties and nuances.
That arise here which you don't
have in the single agent setting
in the fully cooperative
multi-agent setting.
Nonetheless I would
characterize this as a pretty
well-understood setting.
We have solution concepts from
game theory such as the min and
max equilibrium,
which makes sense.
They're reasonable solution
concepts to strive for
in the setting.
And we have pretty good
algorithms for computing them or
even learning them in a
reinforcement learning context.
Then in between
these two extremes,
we have these sort
of mixed scenarios,
they're neither fully
competitive, nor
fully cooperative,
what in game through
called general-sum games.
So if we think about,
for example, the economy
of the entire country.
Obviously, people in that
country are working together,
there are cooperative aspects.
But they're also competing with
each other over customers or
revenue or whatever.
This setting is much
much more difficult.
There are solution
concepts to this setting.
The celebrated Nash equilibrium
is the extension of the min and
max equilibrium to this
general-sum setting, but
from a machine learning
perspective, it's not clear that
this is a useful solution
concept in any way.
So first of all, we don't
have good algorithms for
learning what the Nash
equilibrium would be for
some general-sum game in
a sequential setting.
The algorithms that
we do have for
doing so typically assume that
all these agents are learning in
the same algorithm, which is not
really a justifiable assumption
when you don't have a fully
cooperative scenario.
And even if all the agents
do use the same algorithm,
the convergence properties
are very restrictive.
In addition to that, there's
a more fundamental problem,
which is that even if you had
some efficient, reliable way, to
learn these Nash equilibria and
you were sure that it was going
to converge, it is not clear
what the use would be.
What is the point of learning
this Nash equilibrium because in
this general-sum setting, the
Nash equilibrium is not actually
a valid prediction of the
behavior that will result when
a game is played by rational
agents, so for example,
agents could miscoordinate and
result in behavior it could be
one which is not a Nash
equilibrium at all.
So several years ago someone
wrote a paper called if
multi-agent learning is
the answer what is the question?
He was summarizing
the fact that,
in this mixed scenario, we don't
really know what we're doing.
And I think, since people
start to realize that,
this problem has become sort of
a black hole into which good
researchers go and
never emerge again.
>> [LAUGH]
>> So
this talk is focused entirely
on the cooperative setting, not
because I think the competitive
setting is uninteresting, but
because there are so many
interesting challenges already
in a cooperative setting.
That's what we're focusing on,
at least for the moment.
The rest of this talk will be
about fully cooperative
multi-instances.
So this is the same
slide as before,
I just changed the title.
>> [LAUGH]
>> So
now that we're focused
upon cooperative settings,
we're talking about
coordination problems.
And I just wanna convince you
these coordination problems
are everywhere.
So if you drive in traffic,
you might think this is
a competitive scenario,
not a cooperative scenario,
and indeed it is.
>> [LAUGH]
>> But if you imagine
someday these will all be
taken over by driverless cars.
At least driverless cars
might be controlled
by some central system that
makes sure that traffic
flows through the network
as efficiently as possible.
Then the question is how should
all of these different cars
coordinate during the year so
as to make
maps of the efficient use of
the scarce gangly formula.
Or again,
we have this team of fleets,
and they wanna get everyone
at Amazon packages as
quickly as possible.
How can we coordinate
the leader of these drones so
as to achieve that goal.
And you can come up with many,
many more examples.
So good algorithms for
learning policies that allow for
activations to coordinate is in
my opinion really a fundamental
issue.
Okay, so what are some of the
challenges that arise when we
think about taking
reinforcement and
extending this
multi-agent scenario.
To answer this question, I want
to get slightly more formal.
So let me introduce to formal
models of cooperative,
multi-agent reinforcement,
ready?
So the first one is what's
called the Multi-Agent MDP.
So this is just like an MDP but
there are a bunch of agents and
each of these agents
has their own action.
Actions to choose from.
So every agent can
see the global state.
And each agent can select
an individual action.
So let me be clear
here about my notation.
You're probably used to seeing
the action denoted by a for
action.
But in this problem,
a refers to agent.
And so U is used to
refer to the action.
And that might seem weird, but
at least in the control
literature,
U is also a standard symbol
that's used for the action.
So this U with the superscript
a indicates the action selected,
the action U selected
by the a agent.
Okay, and then just like
in a single-agent MDP,
we have a transition function,
tells us the probability
over the next state.
But now this transition function
condition's not just on
the previous state, but
on the joint action.
So use both your MDP vectors.
And this vector contains
the action choices of each of
the agents.
And we have the shared
team rewards, so
the reward function
is just like before.
But instead of conditioning
just on the state,
conditions on the state and
the joint action.
So every agent contributes
to the action that generates
the reward.
And that reward is shared
by all the agents.
Ironically enough, there's
nothing inherently multi-agent
about a multi-agent MDP.
So I can think of each of these
actions UA as being selected by
a separate agent.
Or I can think of the whole
system as being controlled by
a single master puppeteer.
And that puppeteer needs to,
in every time step,
select an action.
And that action is
described by a vector.
And that vector contains
a bunch of action features.
And those action features
can just be thought of as
the choices of these individual
agents or subagents,
anyway how you wanna
think about it.
So we can think of this
as a multi-agent or not,
it's really a matter
of perspective.
And the reason we have that
choice is because of a very
restricted and
unrealistic assumption made
by the multi-agent MDP,
which is that all the agents
see the global state.
So even in a single agent
setting, assuming that the agent
has access to a Markov state
signal is often not realistic.
We often need to think about
partial observability even when
there's one agent.
But when there's
multiple agents,
the problem becomes
even more severe.
Technically, as you increase
the number of agents,
the chance that all of these
agents have access to everything
that every agent can see
becomes increasingly untenable.
So that leads us to the second
formalism that I want to
mention, which is the Dec-POMDP,
the decentralized partially
observable Markov decision
process, which basically takes
this multi-agent MDP, and
it adds an observation function.
So this observation function,
the observation function
conditions on the global state,
and the index of the agent.
Remember, a is the index of
the agent, not an action.
And then based on that,
it generates some observation.
So the agents don't see
the global state, but
they see some observation
which is correlated to that
global state.
And crucially, this observation
can be different for each agent.
So each agent may have a
different partial perspective on
the true state of the world.
Okay, so now, because there's
partial observability,
in general, these agents need
to condition their behavior,
not just on their most
recent observation, but
on their whole history.
So we formalize this as this
action-observation history, tau,
which is just like a sequence
of action-observation pairs that
are known to that agent.
Okay, so now what do
these agents need to do?
The team of agents needs to
learn a set of decentralized
policies.
So the solution to this problem,
unlike in the multiagent MDP,
is inherently decentralized.
Because each agent's policy can
condition only on its private
action observation history.
So this opportunity of
interpreting this as really just
being a single agent system
controlled by a puppeteer
is not possible.
Because there is no puppeteer
who has access to all
the information.
It's just a bunch of agents,
each of whom who has their
private information, and
needs to do the best they can
conditioned on what they know.
Okay, and then there's sort
of two settings in which this
Dec-POMDP is applicable.
So what it's like in the natural
setting, where the physical
environment imposes some
constraints on our sensors or
some constraints on our
ability to communicate and
share information
with each other,
such that we have this
partial durability.
Such as this observation we
receive has less information
than the Markov state.
There's also another setting
which is sort of an artificial
one, where those physical
constraints don't exist.
And in principle,
every agent could have access to
everyone else's observations and
could construct a Markov
state signal from it.
But doing so
would lead to an impractically
difficult learning problem.
So in order to cope with the
very large size of the resulting
joint action space, we might
choose to decompose it and
artificially impose restrictions
on what information each
agent can condition on,
in the hopes that this would
simplify the learning problem.
Okay, and then the last very
important thing to note about
the setting is, at least in
our interpretation of it,
the way we do it in our work.
We have a setting where
the learning is centralized, but
the execution is Decentralized.
So this p form really comes from
the planning literature and
in the planning literature
the setting is that you have
a central brain like a computer
that's computing a solution to
this planning problem.
And that solution consists of
a set of decentralized policies
which are distributed
to the agents and
then the agents execute those
decentralised policies.
During execution,
they can condition only
on a local observation.
During the planning process,
that computer that's doing the
planning can then globally about
the state and trade information
between the different
agents however it wants.
So we're doing the same
thing here but
just in the learning setting.
So if you imagined the learning
is taking place in some
kind of simulator or some kind
of laboratory where robots
augmented with extra sensors
that allow them to communicate
without any significant
restrictions.
Now during the loading process,
we can share information however
we want among the agents.
And we can learn things which
depend on the global state.
But the result of that learning
must be policies which can be
executed in
a decentralized fashion,
with each agent
conditioning only on.
Okay, so what are some of
the key challenges that arise in
these multi-agent formulas?
Let me try to briefly
summarize them.
So the first one is the curse
of dimensionality in action.
So the curse of dimensionality
was originally formulated, so
Bellman was the one
who coined the term.
He was referring to the states
space which is referrring to
the fact that the complexity
of the planning problem
is growing exponentially with
respect to the number of
features that are used to
describe the state space.
In this multi-agent setting,
we have another
course of dimensionality, which
results in the action space.
So the size of the action space
is growing exponentially with
the number of agents.
If you think of this multi-agent
entity in principle,
like I said, you can
view it as a single agent system
with this master puppeteer.
The problem faced by this
puppeteer is overwhelmingly
difficult because it's joint
action space grows very quickly
with [INAUDIBLE].
There is a challenge
that we are faced with.
The second one is what's called
multi-agent credit assignment.
So again, this is the sort of
an analog to a different
version of its policy.
So typically we talk
about the temporal critic
assigned problem.
So you play a game of chess and
you make 63 moves and
at the end you win
with a checkmate.
Then you look back and
you move the 17th move and
you have to figure out how
much of the credit for
that win goes to the 17th move.
Was this a brilliant move
that won me the game, or
was this actually a terrible
move, but I won despite it?
How do you figure that out?
And basically,
everything that Richard Bellman
did with dynamic programming and
increasing about value functions
can be thought of as an optimal
solution to the temporal
credit assignment problem.
But now we have another
credit assignment problem,
a multi-agent credit
assignment assignment problem.
And this credit assignment
problem results from the fact
that you have a bunch of agents
and they all take an action
together and
the result is some big reward.
How do you figure out how
much of the credit for
that big reward goes to
an individual agent?
Did that agent play a crucial
role in generating that reward,
or did that agent actually
completely mess up but
the other agents
behaved well and so
you still got a big [INAUDIBLE].
How can we sort that out as
a multi critic assigned problem?
And then the final problem is
one that arises only in this
debt policy, not in the multi
agent thesis as a result of
the partial credibility.
And that's the difficulty of
modeling the information state
of other agents.
So in a single agent fully
observable setting, you can just
condition on this Markov state
signal that you're receiving.
In a single agent partially
observable setting,
you need to maintain some
kind of information state.
Like from a vision
perspective you need
to know how to leap over the
hidden state of the world cuz
you weren't able to observe.
But the multi-agent setting,
this belief that you would like
to maintain, it's about
the hidden state of the world.
The hidden state of the world
contains these agents.
And these other agents they
themselves have some information
statements about you.
This leads to complex things
where you have to reason it all.
I know that he knows that I know
that he knows that I know and
so on.
You have to do reasoning like
okay, I've got this observation,
therefore it's likely that my
team mate got that observation
and if he got that observation,
he's likely going to take this
action and if he takes this
action then I should take that
action successfully for him.
That's the kind of
reasoning we need to do.
We needed to go back and
do it in a simple format.
So a little bit later,
I'm gonna present a method
that we developed
which tries to address at least
some of these challenges.
But first I'm gonna give a bit
of background about single agent
policy gradient.
And then you guys had some of
this before, so either this will
just be review or seeing it
again, but in my notation, or if
some of the subtleties escaped
you last time [INAUDIBLE].
Okay, so
in policy gradient methods,
we are trying to optimize
some policy, and
as opposed to in
the temporal difference,
like learning and where all
the emphasis is on the value.
Function and the policy is
kind of an afterthought.
Here the policy is actually,
I'll get to the spotlight.
So we have some policy, pi,
parametrized by some parameters,
theta.
And we just need to
optimize those parameters.
So how do we do that
with our favorite tool?
We do some gradient ascent.
And we do gradient ascent to
try to maximize our objective,
which is the expected return.
And there are a couple of ways
of writing this down, but
I selected one of the standard
ways of writing it down.
We will write this expected
return actually as
an expectation of
renewed reward.
But that expectation is with
respect to the state which is
governed by this distribution
which can loosely be thought of
as the distribution over states
induced by the policy pi.
And the distribution
over actions,
which is of course also
used by the policy.
So why should we use
policy gradient methods?
There are many ways that
we could motivate this.
I've listed two
here on the slide.
So one is anytime the process
of extracting a policy from
a learning value function is
itself non-trivial, this is what
I call greedification, finding
out what the greed reaction
is with respect to some
learned value function.
Whenever that problem
is non-trivial, then
it makes sense to think about
using policy gradient methods.
So the most common example of
this is if the action space is
continuous.
If the action space is
continuous, then you know,
finding the best action with
respect to some value function
is itself a potentially
difficult optimization problem.
So if I separately learn
a policy which sort of catches
the result, the solution to
that optimization problem,
that can be useful.
And later, we'll see in
a multi-agent setting that there
can be other reasons why the
greedification step can be hard,
and therefore why it makes sense
to take the policy gradient
approach in this interview.
Policy gradient methods are also
useful if there are settings
where you have reason
to think that the value
function is gonna be a lot
more complex than the policy.
There could be settings where
the value function is very
complex but the policy is
very uniform and simple.
And so you can exploit that
centricity by explicitly
representing and optimizing
the policy and policy agreement.
Okay, so back in 2000, Sutton
and colleagues came up with
the policy gradient theorem,
which basically says that
the gradient of this expected
return, which is the thing that
we wanna maximize, can itself
be written as an expectation.
And this is important
because that means we can
estimate this
gradient by sampling.
Yeah, we can estimate this
expectation by sampling.
And this is done by using what's
called a lightning ratio trick
in order to convert this
gradient of derivatives,
this gradient of an expectation
to an expectation of gradients.
And we end up with
this formulation.
We can derive a lot of different
policy gradient algorithms
using this general
policy gradient thereom.
One of them,
the most simple one,
which actually predates this
policy gradient theorem,
which is called REINFORCE, in
which case this value estimate
is replaced with just
a sample return.
What's happening here is this
is the true gradient, and
down here we want to
estimate this gradient,
given some particular
trajectory.
So we took our policy,
we executed it in the world,
we got some trajectory.
We need to to use
that trajectory to
estimate the gradient that we're
gonna follow to optimize theta.
So one way we can estimate that
is by using just the sample
returns that are observed
in that trajectory there.
So that's conceptionally
very simple.
The disadvantage of it is that
this estimate of the gradient
is likely to have very
high variance because
it's based on this sample return
which could be very simple.
So in the actor-critic approach
to policy gradient methods,
we try to reduce the variance
in that estimate of the gradient
by explicitly learning a critic.
So instead of estimating this
term here with just a sample
return, we actually learn
a whole value function,
a separate entity
called a critic and
stick it directly into that
definition of the gradient.
So this is the same estimate
of the gradient as before.
You just replace RT with
this critic return.
So this figure sort of shows the
way actor-critic methods work
and you have this actor.
It generates some policy.
That policy is deep
in the environment
where it generates
some some trajectory.
The trajectory is
fed into the critic,
which is used to train
this Q function,
using standard temporal
business methods.
And the Q function is
used as a gradient,
which is then used organize
the actor and whole processing.
Okay, so some other
tricks that we can do.
We can further reduce
the variance in the gradient
capacity by introducing a
baseline, so we replace Q with Q
minus D, where D is any function
that depends only on state.
As long as it doesn't
have any actions,
then it won't introduce bias
into our gradient estimate.
The one common choice for
the baseline is to use
the value function itself.
And if we do that, then this
term which is our value function
minus the baseline,
becomes Q minus D.
And Q minus D is also known
as the advantage function cuz
the advantage function
qualifies for a given action.
How much of the value of
that action is greater than
the value of?
So this needs to get another
policy gradient where we
replace our Q-b with a.
And yet another trick is that if
s being this advantage function
is too hard because we have to
learn this value function which
depends on states and actions.
We can simplify the problem by
learning a critic that depends
only on s.
So just learning v instead of q.
In some cases learning that
critic might be easier.
And then we use the TD-error in
place of the advantage function.
So this is our value function,
and
this is some target estimate
of the value function.
Typically when you do TD
learning you would follow
the gradient and
try to reduce the difference
between these two terms.
But in this case we can
interpret this TD-error as
an unbiased estimate of
the advantage function.
And this is easy to see because
the advantage function is q
minus v.
And here we have something which
is an estimate of q minus v.
That is, itself, an estimate
of the advantage function.
Okay, so there are some variance
of the actual [INAUDIBLE]
architecture.
We're doing some form
of gradient using
this estimated gradient,
which you then [INAUDIBLE]
optimize their policy.
Okay, if we combine these
tricks with deep learning,
We can represent
both the actor and
the critic using deep
neural networks.
We may have convolutional layers
in order to do some feature
construction if we have
a lot of pixel input.
We may use recurrent layers in
order to deal with partial or
in this [INAUDIBLE].
We may also share layers between
the actor and the critic.
For example, convolutional
layers that encode the state may
be used by both the actor and
the critic.
Then we train both of these
networks using stochastic
gradient descent.
The actor is trained on the
policy gradient and the critic
is trained on some form of
critical difference method,
like TD lambda or Sarsa lambda.
I think actually I'll skip these
equations cuz these detail
[INAUDIBLE] enough,
not important.
Okay, so
what have I done so far?
Settings for you and
I've given this review about
single agent policy gradient.
Where am I going with this?
I'm gonna propose some multi
agent policy gradient methods
and show you some
results with them.
And I'm gonna start with
the simplest version.
This is really my baseline that
we're gonna try to beat later.
So the simplest thing that
you could do is what we call
independent actor critic.
And independent actor critic is
really just a small variation on
a standard multi-agent algorithm
called independent Q-learning.
And this is the simplest, but
also by far the most popular
algorithm for doing the agent.
So it's either highly Q-learning
independent actor critic
is that each agent is just
gonna learn as if it were
the only agent in the world.
So each agent is gonna have its
own value, function, policy,
whatever depending on
its learning algorithm.
And it's just gonna interact
with the world as if that world
was an environment that
didn't contain that gradient.
And every agent's gonna do
this at the same time, so
just treat the other agents
as part of the environment.
So now we're saying,
because we're using these
Actor-Critic methods,
independent Actor-Critic means
each agent learns independently,
with its own actor and
it's own critic.
However, we can speed up this
learning process by allowing
the agents to share
parameters with each other.
So I want to try to explain this
clearly to avoid confusion.
Because it may seem like
the learning is no longer
independent, so the agents
are now sharing parameters and
on what sense is
the learning independent?
So let me try to explain this.
First of all,
even though the agents may all
have the same parameters because
they share parameters, they
will still behave differently,
because they receive
different inputs.
So they are learning
a policy which
conditions on their
private observations and
their private observations
will be different.
In addition, we can include in
these inputs an agent index.
So that this shared policy
conditions on the index
of the agents.
And so this allows the agents
really to learn a policy
anywhere they want on
the spectrum from homogeneous to
heterogeneous agents.
If they totally ignore
this agent index,
they'll be totally homogeneous.
And if this agent index
triggers a completely different
subnetwork for each agent,
then they would be
completely heterogeneous.
They can learn how
heterogeneous they would be.
But more importantly,
even though they're parameters
sharing, the learning is
still independent in the sense
that the critics that each agent
is learning,
they condition only on these
local private observations.
So this critic estimates
the value of taking some global
action, given some
local observation.
So nowhere is any agent
reasoning about a global value
function.
No agent is reasoning about
the value of a joint action.
So that's what we mean by
independent actor critic
regardless of whether.
So, we consider a couple
variants of this.
The first one, we do
independent actor-critic, and
the policy gradient is based on.
So we learned critic,
which is just V, so this V now,
that position's only on
private observation.
And then we use that to
construct a policy gradient
using the TD-error,
as I showed earlier.
Or another variant where
we actually learn Q,
then we use Q to estimate
the advantage, and
then we have a policy gradient
based on the advantage.
Okay, this is
a very simple idea,
this is the baseline first.
The first thing that you
think of when you try to make
actor-critic methods
[INAUDIBLE].
So it has some p limitations.
So first of all, just like
independent Q learning,
it suffers from the fact
that the learning is
a nonstationary area.
So you're treating
the other agents as part of
the environment, but
that environment is
nonstationary because those
other agents are learning.
So it's well known that this can
prevent the convergence of Q
learning in the art of creating
whatever you might need.
It's hard to get the agents to
learn to coordinate using such
a method because there is
no agent that's learning
about the value of formation,
agents are just learning about
the value of their
individual actions.
So the synergy between
their actions and
other agents' actions is
not volume in the crater.
And third problem,
it doesn't address this issue
of multi-agent credit signing.
Okay, so
that brings us to our method.
So now I'm gonna try to tell you
about the new method that we
developed which
we call Counterfactual
Multi-Agent Policy Gradients.
There are sort of three ideas
behind this method, and
I'll step you through.
So the first idea is that
we centralized the grid.
So we have this scenario where
we have centralized learning and
decentralized execution.
So that means the critic which
is used only during learning and
not during execution
can be centralized,
even though we need to learn
these decentralized policies.
The second one, we introduce
a new kind of baseline.
We call it counterfactual
baseline, which allows us to
tackle multi-agent credit
assignment [INAUDIBLE], giving
us a learning signal which
is specific to that agent's
contribution.
And then finally we represent
the critic in a way that makes
it efficient to actually compute
this counterfactual baseline.
Okay, so I'll step through
each of these one by one.
So, as I already said,
the idea, the insight here is
that only the actors need to
be decentralized because only
the actors are actually
used during execution.
And when we're learning,
we're allowed to centralize
whatever we want,
and the critic is used
only during learning.
So we can have a centralized
critic that will allow us to
reason about the value
of joint actions.
And this leads me back to
the problem I said before this.
I'm actually now at a point
where I can motivate taking
an active-critic approach to
this multi-agent problem.
As I mentioned before,
these policy gradient methods
with active-critic approaches,
they're useful when the
communications stack is itself
a difficult one, and that's
exactly what we face here.
Cuz it's easy enough to say,
okay,
let's learn
a centralized critic.
But how are we gonna use that
centralized critic in order to
choose decentralized
behavior for all the agents?
If you just did a naive
greedification with respect to
this centralized critic, you'd
end up with centralized policies
that you wouldn't be able to
execute in a decentralized way.
So we have a nontrivial
greedification step.
And we can solve that problem by
taking an actor-critic approach,
and optimizing, decentralizing
actors with respect to
the centralized critic.
So the naive way to do this is
what's shown in this equation.
So now I have an estimate of
the gradient that's specific to
each agent.
So now these terms here depend
on the decentralized policy of
that agent and
its local action depending on
just its local observation.
But now this part of
the gradient is computed using
a centralized critic, that
depends on the global state that
only the critics can see but
the agents won't see
during execution.
That's also what's
shown in this figure.
That decentralized critic, it
gets access to this global state
information, and the other
estimates the advantage
function, that's this term,
which is fed back to the actors.
Who use it to optimize
their policy.
That policy condition is
dependent on local observations,
and selects only localized.
Again, we still have parameter
sharing between actor
1 and actor 2.
But the point is we have
a different gradient for
each agent,
because we're optimizing
this decentralized policy.
The second idea is this
counterfactual baseline.
And let me give you
a little bit of background
before I introduce this.
This might seem like
strange detour but
I promise you it
will make sense.
So can I see show of hands, how
many of you are familiar with
the movie It's A Wonderful Life?
Okay, not bad.
In my experience,
most Europeans are not familiar
with this movie, but
in the United States,
this movie plays like a key
role in Christmas mythology.
>> [LAUGH]
>> This is a Frank Capra movie
starring James Stewart and
Donna Reed.
The story is, James Stewart
plays this banker, and
he makes a big mistake and
the bank loses a bunch of money.
And he's feeling quite
down about himself, and
he goes to this bridge and he
is gonna jump off the bridge and
commit suicide,
when suddenly this angel
comes down from heaven.
It says, let me take you on
a tour, and show you what
the world would've been like
if you had never been born.
And of course he sees that
everyone in his life,
his family, his friends,
they would all have been worse
off if he hadn't been there,
because he's such a positive
influence on their lives.
And he decides, you know what?
I'm not gonna kill
myself after all.
And he goes home, and
his friends and family have
raised all the money that he
needs to pay back the bank and
everything is okay.
So the reason I mention this
is because back in 2000
Wolpert and Turner came up with
this idea that they called
the Wonderful Life Utility.
And it's sort of
inspired by this movie.
And in my opinion,
this is the best or
actually the only good
idea I've ever heard for
adjusting the multi-agent
credit assignment problem.
So the idea that
they had was when
an agent is trying to decide
how to optimize his behavior,
the learning signal should
not be based on this global
reward signal.
Because that global rewards
signal will contain very
little information about how
good that agent's behavior was,
because we have this multi agent
credit assignment problem.
So instead, what the agent
should do is learn from
the signal that estimates
the difference between the team
reward that was received, and
the team reward that would have
been received if that agent
had not participated.
So I think this is
a very important idea.
And some years later, Tumer and
Agogino took this further and
actually made a practical
method out of it based on what
they called
the difference rewards.
So you have
a multi-agent system.
And each agent is learning based
on its own per-agent shaped
reward signal that's called
a difference reward.
So that's what's shown here.
This is the difference
reward for some agent a.
And it's the actual reward that
was achieved by the group minus
the reward that would have been
achieved by the group had that
agent replaced the action
with a default action ca.
So let me explain
my notation here.
So this here,
first, do a joint action with
the action of agent a removed.
And then we add back
in a different action
called the default action.
And this whole thing together
constitutes the joint action.
So this difference reward
measures the difference between
the true team reward, and
the team reward had this agent a
taken a default action instead.
Okay, this is exactly
what we're looking for.
And it has the important
property that if I increase this
difference reward, I've also
increased the true reward.
So if I had some joint action
and I replaced the action of one
agent with some other one and
doing that increases
the difference reward, then that
also increases the true reward.
So yeah, we optimize default
minus the true reward.
And this will actually be
a useful signal because it will
tell us what our contribution
was to this team reward.
So this is a great idea, but it
has two practical limitations.
So first of all, in order
to compute this difference,
we're going to do
extra simulations.
And I need to do extra
simulations for each agent.
Because this term here, in order
to estimate this counterfactual,
I need to do another simulation
to see what the effect would of
been of taking this different
joint action instead.
And the other problem is
what is this default action?
I'm gonna need some domain
expertise in order to figure out
what action should
we need to select so
it's the best approximate,
its absence in the system.
It's not clear how
to do that in there.
Okay, so what we've done is come
up with a new baseline that we
call a counterfactual baseline,
which uses this idea of
difference rewards, but
addresses both of
these limitations.
Okay, so
here's the way it works.
We have this gradient,
like before, and
just like in the previous.
So let me go back to that idea.
So here we had
a centralized critic, but
we weren't doing
anything else clever.
And we had this part of
the gradient was specific
to the individual agent.
So we have the same thing here.
This part of the gradient
is individual to that
particular agent, but now we
have a new advantage function.
And that advantage function
is also specific to
this particular agent.
So what does this advantage
function look like?
Well, it's Q minus some
baseline, just like always.
But now this baseline estimates
exactly this counterfactual
that's included in
the difference reward.
So let's see how this is done.
So the Q value is
estimated by our gradient.
And then what do we do for
the counterfactual baseline?
If we consider for
every possible action that
that agent could have taken,
what would the value have been
if the agent had taken that
action instead?
So nothing has changed here
except the action of agent a.
And we marginalize out
that action by considering
all of the action that
agent could have taken,
weighted by the probability
that the agent would
take that action according
to his current policy.
This is estimating
the difference in the value of
the joint action
that was selected,
from the expected value
if taking into account
all the actions that this
agent could have taken.
But keeping fixed the actions
of all the other agents.
You're basically marginalizing
out the action of the agent
under consideration.
So this is all done
within the critic.
So because we already went to
the trouble of learning this
critic, it tells us the value of
all these different state action
pairs, we no longer need to
do any extra simulations.
The information is already
in there in the critic.
And in addition,
because we marginalized out
the action of the other agent,
we no longer have to
select the default action.
That default action is
actually given to us by
the stochastic policy that
we're trying to authorize.
Okay, so then the third point is
just how can we do this in some
kind of efficient way?
If you think about what would
be, let me go back to this for
a second,
what would be the obvious ways
to represent this critic?
So there are sort
of two choices.
So one would be you have some
neural network and it takes this
input the state, and it takes
this input joint action, and
it produces this output,
the value.
The problem here is that we
would to do forward property
of this network every time
we wanted to compute one
of the summands in
this summation.
That would be hugely expensive.
The other thing we can do is we
could take the approach that's
typically done by DQN, where
the input is just the state, and
there's an output for
every action.
But the number of outputs here
would be the same, because the
number of outputs would be the
size of the joint action space.
So what we do instead is we have
a representation that takes
as input the actions of
only the other agents, and
it produces this output.
The value of the joint action,
when that joint action is
completed by each of
the different actions available
to the agent that
we're considering.
So what this means is that
in one forward propagation,
we are able to compute
the Q value for
each of the summands needed
to compute the summation.
So you do only one
forward propagation for
each agent in order to compute
this counterfactual baseline.
Okay, how am I doing on time?
I'm out of time, okay.
>> [LAUGH]
>> Let me just in
two minutes then tell
you about some results.
Okay, so I'll have to
skip all the details, but
we've applied this method to
a decentralized variance of
the Starcraft
micromanagement problem.
So you have some agents and
they need to coordinate
their behavior in order to
defeat the other agents who
are controlled by some fixed AI.
And unlike the original problem,
they have a limited
field of view.
So they actually have this
partial derivability and
the problem is much harder
because they actually coordinate
in a non-trivial way.
So I'll just show
you some results.
So these are on some different
maps that are different types of
agents or different algorithms
of agents where the details
aren't important.
And so these green lines show
these independent active critics
that I mentioned.
And the blue lines are portions
of our method with some feature
taken out, so
these are centralized critic but
they don't use
a counterfactual baseline, or
they use a baseline but
not a counterfactual one.
And the red line is our method.
The results basically show that,
especially on the harder maps,
it's very important to
have a centralized group.
The independent measures
just don't work.
And you can get additional
leverage by using this
counterfactual baseline.
I wish I had more time to tell
you about the details, but
not your fault.
If you're interested in this
work, I encourage you to take
a look at our paper which is
currently posted on Archive.
And since I'm giving this
presentation at Microsoft,
I feel that I that should
mention the experiments
in this paper were only possible
because of the generous donation
of the Azure cloud
credits from Microsoft.
So thank you very much,
Microsoft.
And thank you.
>> [APPLAUSE]
>> Let's take a few minutes for
questions, yes?
>> Thank you for your talk.
So I'm wondering how restrictive
is in practice the assumption
that you have a critic that has
access to the global state,
when you assume that
your agents don't.
>> Yeah, a good question.
I think it sometimes
is reasonable and
sometimes it isn't.
If you want to really learn in
a deployed environment, then no.
I think the most interesting
applications of reinforced
learning are ones where there's
this significant amount
of danger.
You're not just
gonna be learning
from scratch in your
real environment anyway,
so at least some learning will
happen in the simulation.
There's the real world and
there's simulation, and
there's a vast spectrum
in between the two.
So you can think of,
for example,
a laboratory setting where you
have real, physical robots.
But if the robot
makes a mistake,
the consequences aren't
as severe as if it was
deployed in the real world and
it could run over a child.
And because it's a laboratory
setting, you may be able to put
extra cameras and add some
extra sensory information.
So any time those things
are possible, it is realistic.
And I think any interesting
problem will be one where
the stakes are high enough and
you'll need to do at least
some learning in that setting.
>> Hi, thank you for your talk.
I'd like to ask you about
the StarCraft thing.
Have you run in a bot all
these are microexperiments?
Have you completely determined?
>> No, no.
>> So what are the settings
that you have there?
It's something like
micro-encounters very
specifically placed, or
you have created a bot and, So
they would be considered
four separate scenarios,
like five marines
versus five marines?
>> So yeah,
it's a micro setting so
it's not something global, okay?
Are you considering your
method to partake in a bot or
something?
Because they are, a great model
there is tournaments competing
in Starcraft that do that kind
of thing and it's very popular.
>> That would be amazing,
I would love to do that.
We probably have some
work to do first.
We are benchmarking against the
other published results on for
Starcraft.
We're doing really
well in that respect,
it was the day I didn't
have time to show you.
Even versus
centralized variance,
we are performing well.
But, yeah, in order to determine
we probably have to do more.
Okay, so what this is is how do
you know that this is really
system, because I could
imagine the following happens.
All the agents see each other,
see what action they do so
everything they do is to learn
to obtain the map of state by
applying some action that
would make the state unique.
And then it's just like in
the first setting that you
described, that it's really just
like a puppet master because
every agent has a complete
idea what to do to.
>> So, I mean the field of
view is part restricted.
So the important information
is often out of view.
So actually in a centralized
setting, the reinforcement wing
is also gonna fight within
a centralized setting.
Actually just learning when to
use certain macro actions and
those macro actions actually
solve most of the problem.
There's a macro action for which
actually tells the algorithm
what to do and when to fire.
And those macro actions are not
available to us because those
macro actions depend
on full durability.
You for example, you don't
know where the other agents
are because they're
not close enough.
You don't know where
the opposing team is because
they're not close enough.
>> Yeah, but for example you see
that this one ranged thing where
it stops, so you know how
far the other one is away.
You see that these things,
the are probably moved there.
So maybe you can decompose the
real state pretty good because
we see what the other
agents are doing.
>> I don't think so because
you don't see the team at all,
those are key state features.
>> Yeah, but it's moving.
They will move in this
direction, the other agents and
this you can see.
>> Please can we take
this discussion to light?
>> Yes, sure.
>> Okay,
one more question [INAUDIBLE].
>> Yeah, go on.
>> That one [CROSSTALK].
>> I found something.
I was just wondering because for
all this impossible learning and
location setting you
are assuming that every agent is
acting at the same time.
Is there some extension
how quickly extend it to,
do you like asynchronous actions
for the different agents?
>> So, if you have discrete
time, then formally that's not
a problem because you can
just have the action space be
condition on the current
state of time stamp.
So at one time stamp your
action completes, and
see that [INAUDIBLE] get back,
and so on.
If you do a continuous kind,
then yeah,
that'll put up a lot of handwork
that I'm having you do that.
From a practical perspective,
I'm not sure how to navigate,
typically in games, you just
have some time resolution that
you sort of artificially impose,
and you just select that
parameter so that it's feasible
for the challenges of the task.
>> I know this is a fascinating
topic and you're so well timed,
but I encourage you to more help
from Shimon after this session.
For now let's thank Shimon for.
>> [APPLAUSE]
