Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér.
Reinforcement learning is a learning algorithm
that chooses a set of actions in an environment
to maximize a score.
This class of techniques enables us to train
an AI to master a large variety of video games
and has many more cool applications.
Reinforcement learning typically works well
when the rewards are dense.
What does this mean exactly?
This means that if we play a game, and after
making a mistake, we immediately die, it is
easy to identify which action of ours was
the mistake.
However, if the rewards are sparse, we are
likely playing something that is akin to a
long-term strategy planner game.
If we lost, it is possible that we were outmaneuvered
in the final battle, but it is also possible
that we lost the game way earlier due to building
the wrong kind of economy.
There are a million other possible reasons,
because we get feedback on how well we have
done only once, and much much after we have
chosen our actions.
Learning from sparse rewards is very challenging,
even for humans.
And it gets even worse.
In this problem formulation, we don't have
any teachers that guide the learning of the
algorithm, and no prior knowledge of the environment.
So this problem sounds almost impossible to
solve.
So what did DeepMind's scientists come up
with to at least have a chance at approaching
it?
And now, hold on to your papers, because this
algorithm learns like a baby learns about
its environment.
This means that before we start solving problems,
the algorithm would be unleashed into the
environment to experiment and master basic
tasks.
In this case, our final goal would be to tidy
up the table.
First, the algorithm learns to activate its
haptic sensors, control the joints and fingers,
then, it learns to grab an object, and then
to stack objects on top of each other.
And in the end, the robot will learn that
tidying up is nothing else but a sequence
of these elementary actions that it had already
mastered.
The algorithm also has an internal scheduler
that decides which should be the next action
to master, while keeping in mind that the
goal is to maximize progress on the main task,
which is, tidying up the table in this case.
And now, onto validation.
When we are talking about software projects,
the question of real-life viability often
emerges.
So, the question is, how would this technique
work in reality, and what else would be the
ultimate test than running it on a real robot
arm!
Let's look here and marvel at the fact that
it easily finds and moves the green block
to the appropriate spot.
And note that it had learned how to do it
from scratch, much like a baby would learn
to perform such tasks.
And also note that this was a software project
that was deployed on this robot arm, which
means that the algorithm generalizes well
for different control mechanisms.
A property that is highly sought after when
talking about intelligence.
And if earlier progress in machine learning
research is indicative of the future, this
may learn how to perform backflips and play
video games on a superhuman level within two
followup papers.
I cannot wait to see that, and I'll be here
to report on that for you.
Thanks for watching and for your generous
support, and I'll see you next time!
