Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
Reinforcement learning is a learning algorithm
that we can use to choose a set of actions
in an environment to maximize a score.
There are many applications of such learners,
but we typically cite video games because
of the diverse set of challenges they can
present the player with.
And in reinforcement learning, we typically
have one task, like learning backflips, and
one agent that we wish to train to perform
it well.
This work is DeepMind's attempt to supercharge
reinforcement learning by training one agent
that can do a much wider variety of tasks.
Now, this clearly means that we have to acquire
more training data and also be prepared to
process all this data as effectively as possible.
By the way, the test suite you see here is
also new where typical tasks in this environment
involve pathfinding through mazes, collecting
objects, finding keys to open their matching
doors, and more.
And every Fellow Scholar knows that the paper
describing its details is of course, available
in the description.
This new technique builds upon an earlier
architecture that was also published by DeepMind.
This earlier architecture, A3C unleashes a
bunch of actors into the wilderness, each
of which gets a copy of the playbook that
contains the current strategy.
These actors then play the game independently,
and periodically stop and share what worked
and what didn't to this playbook.
With this new IMPALA architecture, there are
two key changes to this.
One, in the middle, we have a learner, and
the actors don't share what worked and what
didn't to this learner, but they share their
experiences instead.
And later, the centralized learner will come
up with the proper conclusions with all this
data.
Imagine if each football player in a team
tries to tell the coach the things they tried
on the field and what worked.
That is surely going to work at least okay,
but instead of these conclusions, we could
aggregate all the experience of the players
into some sort of centralized hive mind, and
get access to a lot more, and higher quality
information.
Maybe we will see that a strategy only works
well if executed by the players who are known
to be faster than their opponents on the field.
The other key difference is that with traditional
reinforcement learning, we play for a given
number of steps, then stop and perform learning.
With this technique, we have decoupled the
playing and learning, therefore it is possible
to create an algorithm that performs both
of them continuously.
This also raises new questions, make sure
to have a look at the paper, specifically
the part with the new off-policy correction
method by the name V-Trace.
When tested on 30 of these levels and a bunch
of Atari games, the new technique was typically
able to double the score of the previous A3C
architecture, which was also really good.
And at the same time, this is at least 10
times more data-efficient, and its knowledge
generalizes better to other tasks.
We have had many episodes on neural network-based
techniques, but as you can see research on
the reinforcement learning side is also progressing
at a remarkable pace.
If you have enjoyed this episode, and you
feel that 8 science videos a month is worth
a dollar, please consider supporting us on
Patreon.
You can also pick up cool perks like early
access to these episodes.
The link is available in the video description.
Thanks for watching and for your generous
support, and I'll see you next time!
