[MUSIC].
>> I'm very happy to
introduce Chris Amato.
Chris is an assistant professor
at Northeastern University,
and he has done a lot of
really cool work on robotics
and some game-playing,
and today he's going to
talk about multi-agent RL.
>> Thank you very much. Yeah, so in
the title my talk is 'Scalable and
Robust Multi-Agent
Reinforcement Learning'.
I want to thank in
particular two students.
These are the students who
mainly did the work that
I'm going to talk about.
One is Shayegan and the
other one's Yuchen.
I'll also say I don't
know where Sam went.
But Sam various.
He led very nicely with
the ending part of
his talk into the stuff that
I'm going to talk about here,
is I'm going to focus on the
multi-agent learning stuff,
multi-agent reinforcement learning.
If he didn't convey
it enough already,
I think we all know that
there's going to be
multiple agents everywhere,
if there's already
becoming lots of agents,
robots or otherwise that
are helping us out in
our daily lives whether
for drone delivery
or autonomous cars like
typically we think of
an autonomous car problem
as a single agent problem.
But it's going to be lots of
autonomous cars and they're going
to need to coordinate with each other
in order to really navigate
the streets and optimize
where they should go.
Things like UAV surveillance
or even home robots.
Nowadays the home
robots are becoming so
sophisticated that
they're coordinate with
each other in order to make
sure that your house is as
clean as it can possibly
be for instance, right?
So there's lots of these
domains that we can think
about where we're going to
have multi-agent systems
that needs to learning rather than
just a single agency state as well as
video games which is also a
cool application looks answer.
So but in these multi-agent domains,
we're going to have lots
of different types of
uncertainty which I think both
Sheila and Sam set up
nicely before as well.
We have the regular
outcome uncertainty that
we have in a typical MDP,
real-world domains always
have sensor uncertainty.
So we're going to have
partial observability
often in these domains.
Multi-agent domains often also
have communication uncertainty.
So if I have home robots that are
made by different manufacturers,
for instance so now I can
know who is to always be able
to communicate with
each other or from
doing search and rescue or
surveillance in dangerous domains,
we don't have perfect communication.
They can't communicate with each
other to have a
centralized solution here.
So there's going to be
these three types of
uncertainties that could exist in
any of these different domains.
In some domains, some of them may not
exist in other domains. Maybe
all three of them will.
So for these types of problems,
the common representation for
the cooperative case for
thinking about these different
types of uncertainties.
Again we can think
about sub-classes or
even one superclass I guess for
modeling these cooperative agents
here is the decentralized
partially observable
markup decision process.
It's decentralized because
now we have this set of
agents that's operating in
a decentralized fashion.
It's obviously partially observable
because of the sensor and certainty.
So it's just like
multi-agent extension of
the MDP and the POMDP frameworks.
So now we have this set of
agents here for instance
the robots and at each step
they're going to take some action,
and it's a cooperative
problems so there's
a single joint reward that's
generated for the team of agents.
Each one of them though
gets its own observation,
and then they have to make decisions
choose what to do based on
those observations sequences
that they're going to guess.
So a little bit more formally,
so again it's an extension of the
MDP and the POMDP frameworks.
So we have our agents or set
of agents or set of states,
and then each agent has a set
of actions that they can take.
There's a transition model that
depends on all the agents.
So the world state depends on
what I do and what you do, right?
The reward model depends on
again what I do and what you do.
So it depends on everybody.
There's a set of observations
that could be different
for all the different
agents and that but
the observation model
again depends on
possibly all the different
agents in the world state,
and then we have a discount factor.
So all these functions depend
on all the different agents.
Solutions for these
problems could be anything,
but here we need to
remember history because
again we're in partial
observable situations.
So we want to try and map
our different possible histories
to the different actions that
we may want to choose
because we don't have access
to the true state of
the world necessarily.
So we can have lots of different
policy representations,
we can have direct history
to act to mappings,
we could have tree-based
representations finds to
control representations recurrent
network representations.
There's one of these
representations are
different for each of
the different agents
so that can operate in
a decentralized fashion
in the domain itself.
Then once we have
something like this,
we can evaluate this
set of policies one for
the different agents using various
types of Bellman equations.
This is a Bellman
equation for the tree or
the finest the controller
representation.
But you could also just think
of v of h here as well,
which depends on the
immediate reward,
the transition probability
of the system,
observation probability
of each agent,
and then the next value that you
can get back to that, right?
So and the goal is just like
any MDP or POMDP problem,
we're trying to maximize
expected reward
over that finite or infinite horizon.
So if there's any questions
about any of this stuff
at any point feel
free to ask and I will there
to try and clarify things.
Okay. So now like I
was talking about so
just to make things more realistic,
we can think about all of these
different types of uncertainty.
So this model is very general,
we can consider any type of problem,
any type of multi-agent
problem that you can
cooperative multi-agent
problem you can think of,
we can represent
using this framework.
It's a very common
framework for planning and
learning in multi-agent
environments. You have the only.
So the only more general framework
that is typically used is for
the competitive setting or
possibly competitive setting is
when each agent has its
own reward function.
That's the parcels
horrible stochastic
game which I'm not going
to talk about today,
although some of them
do apply in that case,
I'm happy to answer questions
about if people want to
know but I'm going to focus
on the cooperative case.
But it means that it
this generality of
the representation means that we
have to consider the observability
and the other agents in the
solutions to the problems.
Okay. So that's essentially what
I'm going to talk about today is,
how do we learn solutions for
this model the Dec-POMDP model?
We want to learn solutions that
are scalable to large domains,
and then we also want to,
remind me what time I'm
suppose finish. Sorry.
>> Perfect. Thank you. Then how do
we integrate deep reinforcement
learning methods into
the multi-agent reinforcement
learning domains
in scaling to large horizons as well?
So we'll talk about different
methods to do these things.
So first, we're going to focus
on decentralized learning.
So like I said,
using Dec-POMDP models is
a common framework for multi-agent
reinforcement learning.
Most of the methods though,
do centralize learning for
decentralized execution.
So all the learning is done
offline in a centralized fashion,
so they can generate a set
of policies which can then
be executed online in a
decentralized fashion.
But this is problematic in
a couple of different ways.
One, that in order to really
continue to learn online,
you need to be decentralized
because execution is decentralized.
So in order to learn online,
you need to learn in
a decentralized way,
which means that each of the
agents is continuing to learn
while the other agents is
also learning at the same
time, are also learning.
This decentralized learning is
potentially more scalable
as well because it
means that there's less information
that all the different agents
have to keep track of.
You keep track of your own
history information separately,
and just generating a
policy directly from
your own history information to
the actions that you
do you want to pick.
So this is nice because it's
potentially more scalable and we
can directly apply if we'd like,
naively at least, single-agent RL
methods to each of the
different agents in the domain.
But this is problematic because now
the problem is non-stationary
from the perspective
of a particular agent.
I'm learning while the
same time you're learning.
So this is changing what it seems
like the environment is doing.
So we need to reason about
this nonstationarity
or do something about
this nonstationarity to
make these methods not be
quite so naive and
work well in practice.
So the first method that we
came up with is a combination
of some of these ideas.
Using some ideas from
multi-agent reinforcement
learning and using
some ideas from deep
reinforcement learning,
combining them in a
way in order to allow
the methods perform well
in these type of domains.
So the basic idea here is that first,
using the idea of hysteresis,
which was originally developed a
while ago for the non-deep case,
the tabular case of
multi-agent learning where we
now have two learning rates.
So instead of just having
one learning rate,
we have a learning rate that's
different depending on
what that TD error is.
So for the negative case,
when it seems like
something bad happened,
we use a smaller learning rate.
So that is, when
something bad happens,
we're assuming it's because
the other agent was exploring.
The other agents or doing
something stupid for some reason.
So we want to discount that.
We don't want to learn from that
quite as much because that's
the random bad thing
that happened that we don't want
to really consider too much.
So we use a smaller
learning rate there
so we can be more optimistic,
so that we can hopefully coordinate
on doing the good thing.
So we want to learn
from these cases in
which all the agents do the
right thing at the right time,
and maybe not learn quite
as much from cases in which
there's random exploration
by any of the other agents,
which is going to cause a
negative TD error in these cases.
So we have these two
different learning
rates for these different cases.
So then, if we have
positive TD error,
then we were going to change
the positive direction.
We have the regular learning rate.
Then when we're changing
in the negative direction,
then we use a smaller learning rate.
So this is going to make it possibly
a little bit too optimistic
and can run into issues
with stochasticity.
But by adjusting these
rates over time,
we hope to be able to
converge to a good value.
Then one common thing
that people do so well,
before we get there I guess is.
So for the deep case,
we're going to build on DQN.
I think everybody knows what
DQN is in this audience.
Then again, a common
thing that's done,
I think Matthew is in
here somewhere as well.
His back there.
So a common thing that
people also do is
an extension to DQN or
any of the methods,
so stick a recurrent layer in there.
So now this is going to
allow us to consider history
and take advantage of this
partial observability.
So basically, this first approach,
the hysteresis is going to help
us to deal with nonstationarity.
DQN or deep methods,
in general, are going
to help us scalability.
In the input space right
now we can deal, hopefully,
with larger observation spaces,
and the recurrent layer
of using DRQN is going to help
with the partial observability.
One other thing, so in these cases,
so when we're training
these recurrent layers,
where we need to sample trajectories.
So previous methods have not used
the replay buffer prior to this
because it caused
instability in learning
because if each of the agents is
pulling from a replay
buffer separately,
this cause the gradients to have
issues and cause the
learning to be unstable.
So in this case,
what we did was we generated
replay buffers that were
synchronized across the agents,
so we can sample random
seeds beforehand and
then when we would during
the learning phase,
we can think about the
time, the episode,
and the agent so that when we're
sampling the trajectories from
each of the different agents,
they're sampled from the
same time steps so that
we can be more robust
in our learning.
Again, we can do this
in a decentralized way.
This thing is called a
current experience replay
trajectory or CERT.
Okay. So using these methods,
we could compare our method
with the previous basic method.
So just using without
hysteresis and without
the CERT versus using
hysteresis in the CERT.
So what we essentially
see in these cases is
that our method is scalable
to larger domains.
So you see what happens here.
So then on top is the
previous methods,
and this is a target capture domain.
So we have our two agents
have partial observability as to
the location of that other agent,
that red agent flickers.
We need to try and
catch that other agent.
If we're in the same square
root we got to reward of one.
So in these cases,
for small domains, they
get about the same value.
So three-by-three, four-by-four,
they're about the same.
But then when you start moving
to five-by-five without using
hysteresis and the CERTs,
learning becomes unstable.
So it doesn't do so well and then
for six-by-six and seven-by-seven,
the learning doesn't really
happen for the other case.
Whereas for our case, then
we're able to learn for
larger domains even
and get good values.
We can actually scale to
larger agent sizes as well.
I just didn't include those results.
We also in this paper had a version
that tried to solve
multitask problems as well.
But again, I'm not including
those results as well.
That method works really well.
It can solve those problems,
but has some issues scaling up
to problems with long horizons.
These long horizon problems
especially with multiple agents,
it's very unlikely that those
agents are going to coordinate and
doing this thing it's very
far out into the future.
So one thing that we can do which
I think Sheila mentioned as
well is we can use an
idea of macro-actions.
So in this particular
problem for instance,
if we were trying to coordinate
for our domain, I should
change these things,
but we're trying to coordinate so
that we need to deliver tools to
these different people which
are a bunch of coauthors,
if we need to think
about it at a low level,
for each of these agents,
they might need to move here right
and they come back and then go
over here and ask the PR2 for
something and continue on.
The horizon, this problem is going to
get quite long quite quickly.
For this type of problem,
there's no coordination
at that level,
like navigation there's
very little coordination
that needs to happen.
So we can think of the problem
where we can break it up and have
like single-agent parts
and multi-agent parts.
From the single-agent parts,
we can have a hierarchical
kind of method where
the single-agent part can be
navigation independent
of the other agents.
We can have low-level collision
avoidance so that they can learn,
but really it's just navigating
from one location to
another location.
We don't need to do multi-agent
reinforcement learning at that level.
We need to do multi-agent
reinforcement learning over
the level of how do they coordinate,
but we don't need to do multi-agent
reinforce learning over
how they navigate from
point a to point b.
So we can build a macro actions which
is in our case the type of
option that we can have
for the different agents,
where we can build those
lower-level macro actions
and then we can do learning
at the higher level over
those independent single
agent macro actions
in order to scale to
these large domains.
Sometimes these are given
to us in advance like
navigation controllers for robots of
grasping controllers for robots
to these kind of things.
There's already very good solutions
for those type of problems,
and we can just use those directly,
other times that we can talk about
trying to learn those as well.
We call this a macro
action tech POMDP
potentially and it has
slightly different notation,
so I have a slide on that slightly
different notation as well.
So now macro action is M and
a high-level observation maybe
it's a macro observation is Z and
then our policy representation
will be over those Ms and Zs
instead of those A's and O's that
we had in the previous case.
Otherwise, the representation
will be essentially the same,
the evaluation gets more
complicated now because
it depends on time.
I'm not going to go into
this in too much detail,
but I'm happy to talk about it if
people want is that
technically in order to do
the evaluation becomes a
semi-Markov decision process
because we have to reason
about how much time.
So this will complete after
time step k and will be in
state S prime after taking
a macro action and in
particular state s,
so it becomes a semi-Markov process
rather than a traditional
Markov process.
But this is the general model
that you would get in this case.
So now this idea of
macro actions exists,
how do we do learning
for these cases?
So all the current
deep multi-agent reinforcement
learning methods,
they assume synchronized
primitive actions,
none of them make use of
these asynchronous macro
actions in this case.
So it isn't clear how to incorporate
this idea of asynchronous actions
into the deep moral methods.
So let's think about how
we'd like to do this.
So the basic idea is
that in this case,
we will make the assumption that we
only get the information
at the macro-action-level.
At each time step, we
get information about
which macro action each of
the agents is executing,
or in the decentralized case
that particular agent is
executing or particular observation
they see in the joint reward.
So we get this information
every primitive time step.
So at the first time step,
Agent 1 gets this information,
Agent 2 gets this information.
So it gets a particular
macro observation,
the macro action that it's currently
executing and then whether there's
a new observation if
it's terminated or not
and each of the different
agents gets this information.
Sorry, I'm ignoring
the people over here,
I should point to stuff
over there as well,
wonder that way. Then this continues.
So the next time step we're
going to get that information,
the next time step we
get this information.
Like I said, I'll walk over here.
So we're accumulating
this information
per agent in terms of the rewards,
so that's why the sum is happening.
So the sum is happening just to
try and for each agent you're
accumulating the reward
while the macro action is
continuing and then when the
macro action terminates,
then you'll no longer accumulate
the reward and you start
accumulating the next reward
from the next time step.
So we can generate this trajectory,
this is the trajectories of
the agents that are
generated over time at
the macro action level and we
have all this information
for these ten time steps.
So we can generate that
and then from there,
we can get what we call
these things Mac-CERTS,
just an extension on
the original CERT idea.
So in order to train
the recurrent network,
we have to sample this sequence
from these trajectories.
So maybe we sample this
particular set here
between three and eight and
then from there now for each
of the different agents,
we'll only have a particular view.
So Agent 1 will only have a view
of its particular trajectory,
Agent 2 will have its
particular trajectory.
Then in this case,
so now we need to identify
when the macro actions change.
So we don't need to have
all the information about
all these time steps,
we could potentially but
we're identifying when
the macroeconomic change.
Then in our case, we're going
to throw away the time info.
We're just going to
compress them so that
we're ignoring the time info.
You could imagine you keep
the time information,
but in our case we're
throwing it away.
So now we just have
the macro action info
for what happens within
those time steps.
So once we have this,
now we can essentially just throw
this into the method that
we talked about earlier.
So the deck HDR QN,
there's too many letters there,
is the method that we
talked about before
the hysteretic deep
recurrent Q network
that happens to also
be decentralized.
So we can throw the macro action
level information directly
into that algorithm so that
we can continue to learn
in a decentralized way.
So these are all calculated
in a decentralized way,
I'm again ignoring
the folks over there,
so that we can get
this information for
Agent 1 and then this
information for Agent 2.
We can put those into the
previous algorithm and it
gives us this particular loss
function which is just the,
this is the double DQN version,
but it doesn't matter that much.
So we can put it in the loss
function in order to try
and learn the queue
functions in that case.
So this is the
de-centralized version.
The decentralized versions actually
simpler than the centralized version.
So for the centralized version,
the idea is so we're assuming
now that we have
perfect communication,
we're going to just do
centralized learning here.
So in centralized learning
is potentially useful
in the cases when we do have in
fact full communication online,
as well as offline or there's a
bunch of methods that try and
use the centralized
values as an interim,
to try and learn better
decentralized values that can
be learned to execute online.
So in this case,
we can do the buffer
in a centralized way.
So where we going to do
the same thing as before,
but here just directly
identifying when
the macro actions and which we can
do at the beginning of the end,
but the buffer is the same.
It's just we only have
at least one joint reward
that we're accumulating here,
rather than the different rewards for
each agent which we have in
the de-centralized case.
The trick here is that we
identify when the macro action
ends for any agents.
This is the problem.
In the de-centralized case,
it's clear what it
means for termination,
like if I terminates then I
stop accumulating my reward.
But in the centralized case,
it's not clear because
it's asynchronous.
There's really unlikely to be
a chance when both agents
terminate at the same time,
it's always going to be one agent
terminate and one agent doesn't.
So you have to figure out
when you decide that,
otherwise you have
just one big action.
So you have to break it up into
particular subsets and makes
sure that you're not over
counting in the reward.
So then what we do
here is we assume that
a macro action terminates when any
agents macro action terminates.
So this now Agent 2's
macro action terminated,
so this becomes one
macro action here.
Then after this now both of
these agents terminated,
so this is another chunk right and
then on the next step
Agent 2 terminated,
so that's another chunk.
So we can break it up in
that sort of way in order
to deal with asynchronous
actions that we have here.
So we can generate that into
a centralized buffer in
this particular case while
we remove the time info,
again just like we did before.
Now, we can learn in
a centralized way
using these centralized experiences.
In this case, this is
just the double DR. QN that we
use for learning in that case.
There's one other kind of thing as
well which we can use
in the centralized case
is when we're doing the argmax
what happens here is that,
remember that not all
agents get to change
their macro actions at each of
those steps because what we
know from these trajectories is that
only some of the agents
stopped and therefore
only some of the agents
were able to change
what they were able to do,
the problems asynchronous.
So the basic idea of
this conditional target
value prediction is we fix
the macro actions for the
agents that don't change and we
only allow the argmax to go
over agents that do change.
So we only consider agents
that are able to change
the macro action at that
particular time step
based on the trajectory
information that we have.
Lots of replay buffers,
but the idea is relatively
straightforward I think.
Okay. So once we have this idea,
now we can talk about results
and obviously we want to try
and compare on the
previous domain as well.
So this is the previous domain,
the target capture domain.
In this case, we didn't learn
the macro actions in
the macro observations,
we set them for these problems.
In this case it's
pretty straightforward
to set the original
actions to just up,
down, left, right, and stay
and the macro actions
in this cases you get.
You get observation on
the target that flickers.
So the macro action is to move
towards the last observed
location of the targets.
Seems like a pretty straightforward
thing to do, right?
So if we do that, in this
case the problem is so
simple that for a
four by four and a 10 by
10 version of the problem.
So for the macro-action case,
there's very little learning that
even needs to be done because
the macro-actions are pretty good
and the problems pretty simple.
So in both of these cases,
the primitive version and
the macro-action version,
learn the same thing but
the macro-action version as
you can imagine it's going
to learn it much faster.
But then we can think about
more complicated problems.
This is a box pushing type problem
where these are way points
that the agents can move to.
We have these two robots.
There's a bigger box that they
can coordinate to push on
or there's these smaller boxes
that they can push independently.
So they have to both push this
together at the same time
in order to move the box,
and they get a reward for moving
it to the goal at the top.
So we can look at the
decentralized case
here where we have the primitive-actions
and the macro-actions.
For this case, again,
in the macro-action case
we can move to these
different way points,
we can push these kind of things,
and in the primitive case it's up,
down, turn, straight,
push, things like that.
The values here are for the
10 by 10 and 30 by 30 case,
where in the primitive-action version
you see here that
the primitive-action
learns a little bit but very
slowly and does not
do very well at all.
Whereas the macro-action version
can learn relatively fast,
and then this is the optimal value
that's this dashed line up here.
So it can pretty quickly
get to the optimal
value for this problem.
Then we can compare it
with a centralized method.
As you would imagine the
centralized method is going to do
better than a decentralized method
in this particular problem,
because it has more information
coordinated explicitly.
So the centralized method learns
even faster than the decentralized
method for this problem.
We can look at a more
complicated problem.
This is the one human
version of the problem.
We looked at a couple of
different versions but now this
is a multi-robot problem.
It will be at least a
multi-robot problem.
Where we have this is going
to be a robot that is
at a desk that can try and find
objects that are on the desk.
Then these are two delivery robots,
similar to the picture
that I showed earlier
for there's a human in a
workshop here that's doing
some task and then we need to
try and figure out
which objects to get to
the person at different times
based on what their tasks are.
So we need to monitor
them and bring them
the right objects at
the right time in
order to make sure that they can work
efficiently and complete all
the tasks they need to do.
Then we compared our
decentralized method
and our centralized method
for this particular problem.
The centralized method
could learn pretty
quickly and converge to
a near optimal policy,
while the decentralized method
didn't do so well on
this particular problem.
But the reason why it didn't do so
well isn't necessarily a
problem with the method itself,
rather than a problem
with the problem itself.
The problem is just really hard,
and the information that you
get in the decentralized case
is not enough to make it so
that you can coordinate well.
In this particular case,
the information that
the fetch robot had in
this case this is just not
enough and the data that it
got wasn't enough for it to
be able to figure out what
the right objects were to give the
particular humans in this case.
So there's certainly problems
in which there isn't a good
decentralized solution that you can
get in these cases whereas
obviously the centralized solution
will be better. Okay.
So then we ran an extension of
our method on a real robot task.
This is the real robot task,
where we have our fetch
robot and our two terabots.
There's a human worker
that's over there that
needs to be monitored.
This is actually extension
of the method that I said
on the previous slide.
So it's not quite the centralized
or decentralized method,
it's pretty much similar to
the centralized solution now,
where they're monitoring the
person and now bringing the person
the first object or the first
tool that the human needs.
The human's trying to build this
table and needs the tape measure,
and then the clamp,
and then the drill,
in order to be able to build
this table at the most
efficient manner.
This is obviously a simplified
version of problem you can imagine
many humans and many more robots
in a much more complicated,
but this stuff is hard
to get to work on
a set of robots so you
want to start simple.
So here you see that it gave
the person the tape measure,
and now the next robots
getting the clamp.
So it'll bring the clamp to
the person while the fetch is
giving the last tool to
the other robot there.
So the clamp can get
brought to the human here,
they can clamp the table in
that case table is
sufficiently clamped I guess,
and then they can finally have
the drill there and they can
use the drill to finish
drilling all the screws into
the table in order to finish
the beautiful table..
So I think I'm quickly
running out of time,
so I will skip this method I think.
I think I have five minutes left.
So I'll skip this method.
It's a non deep method for
trying to learn controller
based representations for
the decentralized
macro-action approach here.
So I'll skip this even
though it does have
a cool robot video
that I wanted to show.
You can look on my web page for
the robot video where they're
doing, I'll just say,
this is search and rescue where
there's a aerial vehicle and
a bunch of ground vehicles.
An air vehicle can see
more stuff than the ground
vehicles obviously,
but the ground vehicles have to bring
things to the people
and rescue the people.
So they need to coordinate and
communicate with each other in
limited communication range in
order to be able to figure
out where the people are,
and get to them most effectively,
and rescue them most effectively.
So that's a particular domain
that we are looking at there.
Maybe for this audience
I don't need this slide,
but I like this slide.
People often ask or often
try to use deep RL.
Nowadays it's quite popular
for trying to solve many
different problems,
and it's helpful in some ways
but it certainly especially
in these parse observable
multi-agent problems
doesn't solve all the problems.
So there's a number of
big issues that some of
which have been studied in
the past and some of which there
aren't good solutions for,
there's a bunch of good
issues that still need to be
solved and be able to
deal with solving these
large interesting problems.
One of which is centralized
versus decentralized learning.
So the stuff that I was talking about
today was the decentralized
learning case,
which is the hardest case.
It's like all the agents
are acting online
based on very limited
information that only they have.
It's quite hard to
learn in that case.
So what are the best
methods that we can use?
How do we use those signals in
a better way to do things
that are more sophisticated,
then hysteresis to
be able to deal with
the non-stationary that we
get when we're doing
decentralized learning?
In the centralized learning case,
what's the best way of balancing off
the centralized information
in order to learn well?
So there's a bunch of methods
that are popular for using
centralized value functions to help
learn decentralized solutions.
But that can often be a bad thing.
The centralized value function
can make you do stuff
that you just can't do in
the decentralized case.
So we need to think
more about what the best methods
are to use in that case as well.
Obviously, again, the
deep methods are popular
but don't always have very
good sample efficiency.
So how do we use them
online in order to be able
to to learn quickly and
dealing with person's ability?
So just using LSTMs in
there is a easy solution,
but it's probably not
the best solution.
How can we better use
partial observability to
take advantage of that,
come up with other structures
in order to be able to learn
well and deal with the
persons observability
that exists in these
different domains.
So I will conclude there by
saying for the multi-agent
reinforcement learning problems,
one of the most general
representations is
this dec POMDP, decentralized POMDP,
where it considers the outcome,
sensor and communication,
and certainty in these
domains so we can
model any multi-agent
coordination problem this way.
So I talked about a
couple different methods.
One, just integrating Deep Learning
with the multi-agent
reinforcement learning under
this framework and using
macro-actions in abstraction to
improve scalability
and the horizon here.
Yeah, in using learning methods.
So these methods they also
apply in sub-classes as well,
even in the super-class,
some of them.
So if we don't have
uncertainty or if we have
different reward functions some
of these methods can still work.
These are some methods that
we've started to develop,
but like I said,
there's lots of cool open questions
that still need to be solved.
I'm sure Sam and his team
will solve many of them,
but there will still be
plenty open for other people.
So I'm happy to talk about
them as well afterwards
if people have questions.
