
Chinese: 
你好，我们又见面了！欢迎来到神经匹配学院2020强化学习日的教程4。
我叫马塞洛·马塔（Marcelo Mattar）。在本教程中，我将向您介绍基于模型的
强化学习。
现在，这是我最喜欢的主题之一，因为我认为这是了解各种
神经科学和认知科学中有趣的主题，例如计划，记忆重播，
记忆整合甚至是梦境。因此，出于这个原因，这实际上是
我在自己的实验室里研究的对象。
高层次的想法是这样的：
至今，
我们已经看到了智能体如何通过在世界上行动来学习如何行动。因此，这就像反复试验。
他们行动。他们体验了您行动的结果，然后从这些结果中学习。
因此，这些模型被称为无模型的，因为它们不需要模型。实际上，
我们在上一教程中刚刚学到的Q学习算法和SARSA算法都是如此。

English: 
Hi again! Welcome to tutorial 4 of the reinforcement learning day of Neuromatch Academy 2020.
My name is Marcelo Mattar. And in this tutorial, I'm going to tell you about model-based
reinforcement learning.
Now, this is one of my favorite topics ever as I think it it's a great framework for understanding a variety of
fascinating topics in neuroscience and cognitive science, such as planning, memory replay,
memory consolidation & even dreaming. So for this reason, this is actually the topic that
I study in my own lab. So here's a high-level idea.
So far,
we've seen how agents learn to act by acting in the world. So, it's like trial and error.
They act. They experience the outcome of your actions, and then they learn from those outcomes.
So, these models are called model-free, because they do not require a model. And in fact,
that's the case for both Q-learning and SARSA algorithms that we just learned in the previous tutorial.

Chinese: 
现在，在本教程中，我们将学习基于模型的方法，该方法通过
一个称为计划的过程。这意味着，与其根据经验来计算价值，
规划将使智能体能够根据模型计算动作。
好吧，等一下。什么是模型？
模型代表了世界如何响应代理人的行为。某种程度上来说，
您可以将模型视为智能体内部对环境的代表。
因此，您可以正式地将模型定义为数学对象，这样，如果给定操作中的状态，
它输出结果的预测
状态和奖励。
就像环境一样
除了它在智能体内。建立模型的好处是

English: 
Now in this tutorial, we will learn about model-based methods, which compute actions by
a process called planning. This means that instead of computing values from experience,
planning will allow the engine to compute actions from a model.
Well, wait a second. What is a model?
A model is a representation of how the world might respond to the agent's actions. In some way,
you can think of a model as a representation of the environment that lives inside the agent.
So formally, you define a model, as a mathematical object such that, given a state in an action,
it outputs a prediction of the resultant
state and reward.
Just like an environment works,
except that it's inside the agent. And the advantage of having a model is that

Chinese: 
您可以使用它来模仿或模拟真实的体验。
但是，如果您可以模拟经验，则可以从这些模拟中学习。
因此，以这种方式思考，无需模型的强化学习，智能体使用真实
通过称为学习的过程来计算价值并最终制定策略的经验。
就像我们在所有以前的教程中看到的一样。
但是，在基于模型的强化学习中，
智能体改为使用模拟经验
它是由模型产生的，用于计算这些值。在计算价值的过程中，以及
最终是根据模拟经验制定的政策，我们称之为计划。
但是，事实证明，计划和学习不一定是相互独立的
编程一个可以同时实现这两个功能的智能体很容易
同时发生的事情，此图实际上说明了这一点。
如果您在这里关注右侧内循环，我们会看到

English: 
you can use it to mimic or simulate real experience.
But if you can simulate experience, you can then learn from those simulations.
So think about it this way, a model-free reinforcement learning, the agent uses real
experience to compute values and ultimately, a policy, via a process called learning.
This is just like we've seen in all previous tutorials.
In model-based reinforcement learning, however,
the agent instead uses simulated experience
which is produced from a model, to compute those values. In the process of computing values, and
ultimately a policy from simulated experiences, what we call planning.
But, it turns out that planning and learning are not necessarily mutually
exclusive. It's fairly easy to program an agent that can implement both
things concurrently and this diagram actually illustrates exactly that.
If you focus on the right inner loop here, we see that

English: 
the agent has a value function indicating the values of different actions, and
using those values, the agent can then select and execute an action
producing real experience. And
then, this real experience can be used in two ways .
First, it can be used in the usual way using model-free learning algorithms such as Q-learning & SARSA
to teach the animal, or the agent about action values.
But, the agent can also use real experience in a different way. For instance, the agent can learn about
how the world works - how it evolves, how you respond to each action, and so on
In other words, the agent can learn a model.
And with a model available,
the agent can then use planning to produce its own simulated experiences and then to learn from them.
So, this is the idea. And in fact, a very powerful one, that renders
there a variety of cutting edge reinforcement learning algorithms,

Chinese: 
该代理具有一个值函数，用于指示不同操作的值，并且
使用这些值，代理可以选择并执行操作
产生真实的经验。和
然后，可以以两种方式使用这种真实体验。
首先，可以使用无模型学习算法（例如Q学习和SARSA）以常规方式使用
教动物或智能体有关动作值的知识。
但是智能体也可以以其他方式使用实际经验。例如，智能体可以了解
世界如何运转-世界如何发展，您对每个动作的反应等等
换句话说，智能体可以学习模型。
有了可用的模型，
然后，智能体可以使用计划产生自己的模拟体验，然后从中学习。
所以，这就是想法。实际上，一个非常强大的
有各种各样的尖端强化学习算法，

Chinese: 
包括Google DeepMind过去击败的那些
棋盘游戏中的专业玩家。现在，很遗憾，我们不会实施DeepMind的
今天玩Go的算法，但我们都知道最后一种实现基于模型的算法
强化学习算法Dyna-Q来说明我刚刚描述的概念。
因此，Dyna-Q是
可能是最简单的
基于模型的强化学习算法，将学习与计划整合在一起。
在接下来要做的练习中，
我们将详细分析Dyna-Q，但只是为了给您一个想法，
该算法将在步骤（a）至（d）中实施Q学习。
如您所见，在此表中，智能体观察到
在这个阶段，选择一个动作，采取动作并观察结果，然后从结果中学习
在步骤（d）中。
然后在步骤（e）中，

English: 
including those that Google DeepMind used to beat
professional players in the board game, Go. Now, unfortunately, we're not going to implement DeepMind's
algorithms for playing Go today, but we all know the last implement a model-based
reinforcement learning algorithm called Dyna-Q to illustrate the concepts that I just described.
So, Dyna-Q is
possibly the simplest
reinforcement model-based reinforcement learning algorithm which integrates learning and planning.
In the exercise that you're gonna do next,
we will analyze Dyna-Q in detail, but just to give you an idea in a nutshell,
this algorithm will implement Q-learning in steps (a) through (d).
As you can see here, in this table, the agent observes
this stage, chooses an action, takes the action and observes the result, and then learns from the outcome
in step (d).
Then in the step (e),

English: 
the agent learns the model, simply saves and remembers
what was the result of the action that it executed. And then, in step (f), it implements planning.
In loops and times,
selecting a state and an action, to simulate, remembering what were the results by drawing the
result from the model and then learning that in the very last line of this algorithm.
Now, what's the relevance of this for Neuroscience? Well, for one, I'm sure we would all agree that
our brains are sometimes focused on things other than what's happening around us.
For example, we may remember things that
have happened in the past, or we may think about things that might happen in the future.
And of course, this is not an accident of evolution.
It's really not difficult for us to see why it might be useful for us to learn from
imagined or remembered experiences in addition to real experiences.

Chinese: 
智能体学习模型，简单地保存并记住
它执行的操作的结果是什么？然后，在步骤（f）中，它执行计划。
在循环和时间中
选择一个状态和一个动作进行模拟，通过绘制图形来记住结果
从模型中得出结果，然后在该算法的最后一行中了解到这一点。
现在，这与神经科学有什么关系？好吧，我敢肯定，我们都会同意
有时我们的大脑专注于我们周围发生的事情以外的事情。
例如，我们可能会记住
曾经发生过的事情，或者我们可能会考虑将来可能发生的事情。
当然，这不是进化的偶然。
对于我们来说，确实不难理解为什么它对我们学习很有用
除了真实体验，还可以想象或记住的体验。

English: 
But, we don't have to rely only on our
phenomenological experiences to study disability.
So, it turns out that in the last couple decades, we've been able to record
neural activity that relates precisely to this phenomena.
This is called the hippocampal replay and I find it a fascinating type of data.
So, let me show you an example in this beautiful paper here, by Ji D and Wilson.
The authors recorded activity from both the visual cortex and the hippocampus of the animal, as it
explored an environment.
They observed some patterns that you can see in the top line, and then during sleep,
they observed that in both regions, both in the hippocampus and individual cortex,
there were brief moments in time, when the neural activity resembled what had happened
during awake behavior with the caveat that those sequences were a bit compressed in time.

Chinese: 
但是，我们不必仅仅依靠我们的
研究疾病的现象学经验。
因此，事实证明，在过去的几十年中，我们已经能够记录
与这种现象正好相关的神经活动。
这称为海马重播，我发现它是一种有趣的数据。
因此，让我在Ji D和Wilson的这篇精美论文中为您展示一个例子。
作者记录了动物视皮层和海马的活动，因为它
探索了一个环境。
他们观察了一些模式，您可以在最上面一行看到，然后在睡眠中，
他们观察到在海马和单个皮层的两个区域，
有短暂的时刻，当神经活动类似于白天的活动状态
注意，这些序列在时间上有所压缩。

Chinese: 
但是，它们看起来确实很像在清醒行为中观察到的情况。
因此，他们告诉他们正在测试，并找到了支持证据，
也许海马正在基于模型的计划中，模拟过去和
然后皮质从这些重播的经验中学习
基于海马体产生的序列。实际上，这是关于如何
大脑在称为系统整合的过程中整合记忆。
但是，类似的现象实际上也可能在清醒行为期间发生。
事实证明，如果您从海马的地方细胞记录下来，那么当动物在导航的同时，
大多数情况下，您会看到这些细胞根据动物的实际位置发放。
这就是位置细胞的工作方式。
有时候，如果动物停下来，

English: 
But, they really looked a lot like what had been observed during awake behavior.
So, they had taught to this that they were testing and found supporting evidence for, that
perhaps the hippocampus is engaged in model-based planning, simulating experiences from the past and
then the cortex is learning from those replayed experiences that the hippocampus is
producing. This in fact, is a long-standing idea about how
the brain consolidates memories in the process called systems consolidation.
But a similar phenomenon can actually happen also during awake behavior.
Turns out that if you record from place cells in the hippocampus, as the animal navigates, while you know,
most of the times you're seeing these cells firing according to the actual location of the animal.
This is just how place cells work.
Sometimes, if the animal stops, for instance,

Chinese: 
如果到达迷宫中的某个分叉点，您会发现位置细胞不代表
动物的实际位置，但是它们可以代表将来的位置
就像动物记住或想象接下来会发生什么一样。
这被称为唤醒重放，被认为是计划的神经基础。
现在，事实证明，这个故事有点复杂，这些序列不仅向前发展。
有时候他们也可以倒退
在动物后面，有时它们可​​以一起代表偏远地区。所以，
很难想象所有这些如何起到计划的作用。
但是，如果您想了解强化学习如何
数学语言可以用来解释所有这些发现，并且实际上可以用来链接
从记忆检索到计划的各种想法，
巩固记忆，甚至是做梦，我建议您看看Nathaniel Daw和我本人在2018年发表的这篇论文。

English: 
if it arrives at some bifurcation in a maze, you will find that place cells are not representing the
actual location of the animal, but instead, they can represent future locations
just like if the animal was remembering or imagining what might happen next.
This is known as awake replay, and is thought to be the neural basis of planning.
Now, it turns out that this story is a bit more complicated, these sequences not only go forward.
Sometimes they can also spend, in you know backwards
behind the animal, and sometimes they can represent remote locations all together. So,
it's difficult to imagine how all of this is serving the role of planning.
But, if you want to see how reinforcement learning as a
mathematical language can be used to explain all of these findings, and in fact, to link
various ideas from memory retrieval to planning,
consolidation, and even dreaming, I suggest that you take a look at this 2018 paper by Nathaniel Daw and myself.

English: 
Alright, I'm gonna stop here for now and I'll let you play with the code implementing Dyna-Q. Have fun
implementing your first model-based reinforcement learning agent.

Chinese: 
好吧，我现在要在这里停止，我将让您使用实现Dyna-Q的代码。玩得开心
实施您的第一个基于模型的强化学习代理。
