In this work, we focus on robotic manipulation with deep reinforcement learning from pixels.
Deep RL is difficult in the robotics context
because it requires a lot of data and a precise reward signal.
Our framework addresses these limitations.
The central idea is to store and reuse historical data when learning a new skill.
The second important feature is specifying the reward function through efficient human annotations.
Combining hundreds of hours of historical data with a learnt reward function
enables us to harness the power of batch RL.
Our framework consists of the following steps, which we will present now.
Suppose we want to teach a robot to stack blocks.
To collect data for deep RL, we record everything that a single robot ever does:
lifting, sorting, and moving tasks, random policies and even failed experiments.
All of this off-task data is essential for the final performance.
First to demonstrate the desired behavior, a human teleoperates the robot.
The next step is to specify the task through reward annotations.
We take a few episodes: some demonstrated trajectories and some other historical data.
The annotator then draws a “sketch” to indicate the progress towards the target task.
Once we annotated a few examples, we use them to train a reward model.
This model is then applied to label all of the historical data with the rewards for the stacking task.
We now have hundreds of hours of videos with rewards.
We use this data to do batch RL, which is a completely offline process.
Choosing batch RL enables us to train policies without a real robot in the loop.
The diverse historical data is essential for getting a good performance in off-policy training.
Once we completed all the steps, the initial policy may not be very good.
We repeat the process again, collecting more episodes and reward annotations.
With each iteration the performance and robustness of the policy improves.
The robot successfully accomplished the goal even despite adversarial human perturbations.
Besides, the generalisation performance of deep RL
allows it to deal with novel objects, recover from the mistakes and start in the unseen initial conditions.
Reinforcement learning allows
the agent to be faster than human demonstrations which it observed during its training.
The framework is very general and can solve other tasks as well.
If we execute the pipeline again,
the robot can lift unseen and deformable objects without any feature or reward engineering.
