很高兴看到大家都回来了
okay so great to see everyone back for
第四节课，对lect
lecture four of the class so for lect
今天的课我想做的是
for today's lecture what I want to do
在大多数情况下，实际上是得到
for most of the time is actually get
这些想法的核心是
into the heart of these ideas of having
反向传播算法
the back propagation algorithm for
神经网络以及我们如何构建
neural nets and how we can construct
允许我们计算的图形
computation graphs that allow us
有效地进行反向传播
efficiently to do back propagation your
整个网络用来训练神经网络
nets to train the neural Nets so overall
这就是我计划要做的事情
this is sort of what I plan to do it
今天，在上节课的最后
today so at the end of last lecture I
有点没时间了，我开始了
slightly ran out of time and I started
喃喃地说着，挥舞着我的手
mumbling and waving my hands about the
对其求导
doing the derivatives with respect to
权重梯度，所以我想要
the weight gradient so I kind of wanted
再来一次，希望它能
to do that bit again so hopefully it
实际上沟通的更好
actually communicates slightly better so
我们会做这个，再多讲一点
we'll do that and talk a bit more about
这是做矩阵的技巧
sort of just tips for doing matrix
梯度和一个特殊的问题
gradients and a particular issue that
提出了单词向量，然后
comes up with word vectors and so then
这门课的主要内容是
the main part of the class will be
说到反向传播
talking about the back propagation
算法以及它是如何运行的
algorithm and how it runs over
计算图，然后是最后一个
computation graphs and then for the last
这门课的一部分是我不打算讲
part of the class is I'm not going to
隐藏它
hide it
这只是一个简单的例子
this is sort of just a grab bag of
你应该知道的杂项
miscellaneous stuff you should know
关于神经网络和训练
about neural networks and training
神经网络，就像我想你知道的
neural networks like I think you know we
梦想未来的人工智能
dream of a future of artificial
我们的机器是智能的
intelligence where our machines are
非常聪明，你可以说
really intelligent and you can just say
对他们来说这是数据，这是我的
to them this is the data and this is my
问题来了，给我一个模型
problem go and train me a model and it
在未来的世界里
might work and in some future world that
可能发生的是什么
maybe what comes along it's something
这肯定是积极的
that's certainly being actively
在这一主题下进行研究
researched at the moment under the topic
我想问题是
of auto ml I guess the question is
不管结果是自动ml是a
whether it turns out that auto ml is a
可扩展的解决方案或气候变化
scalable solution or the climate change
自动ml技术的结果是
consequences of auto ml techniques are
非常糟糕的是
sufficiently bad that someone actually
决定这些更低的能量
decides that these much lower power
神经系统可能会更好
neural systems might actually be better
还是要做一些
still for doing some parts of the
但问题
problem but
不管怎样，我们还没有真正实现
either way we're not really there yet
而事实是
and the fact of the matter is when
你在训练神经网络
you're training neural networks there's
你必须要做的一大堆东西
just a whole bunch of stuff you have to
了解初始化和初始化
know about initialization and
非线性和学习速率
nonlinearities and learning rates and so
你知道，我教过这门课
on and you know well I taught this class
上次我觉得人们
last time I somehow thought that people
如果我们能通过渗透法来理解
would pick this up by osmosis that if we
给人们提供入门级的代码
get gave starter cut code to people and
在初始代码中，我们初始化了
in our starter code we initialized how
矩阵和我们的学习速率
matrices and we set our learning rates
通过渗透，人们会理解
that by osmosis people would understand
这就是你必须做的事情
that's what you have to do and do it and
在课堂上并没有真正的教学
didn't really sort of teach in class the
实用的技巧和技巧，但是
practical tips and tricks enough but it
很明显，当我们得到
was perfectly obvious that when we got
到最后的项目时间至少是
to final project time there at least for
很多人都没有工作
quite a few people osmosis hadn't worked
所以这次我至少愿意
so this time I'm at least willing to
花上几分钟时间，至少
spend a few minutes on that and at least
指出一些事情
point out some of the things that are
很重要，我的意思是，一般来说
important and I mean just in general you
了解2018年深度学习的现实
know the reality of 2018 deep learning
现在，2019年，2019年
paths now wait 2019 now 2019 deep
学习是一种深层次的学习
learning is deep learning is still kind
在一艘飞船上，你有相当多的东西
of a craft there's quite a bit you have
了解做事情的技巧
to know of techniques of doing things
这就导致了神经网络的训练
that lead neural net training to work
成功地与你的模型相反
successfully as opposed to your models
没有成功的工作
failing to work successfully okay one
最后的声明，我通过
final announcement and I go in through
所以我们一直在做一些
it so we've sort of been doing some
进一步研究我们的位置
further working on office our placement
我想这是有多重的
and I guess there are sort of multiple
涉及到的问题
issues which include opportunities for
没有斯坦福大学id的本地SCPD学生
local SCPD students without Stanford IDs
我们得去上班的时间
we have to get to office hours so for
星期四晚上的办公时间
the Thursday night office hour
如果你愿意的话，这是在这门课之后
that's after this class if you'd like to
去讨论第二个家庭作业
go and talk about the second homework
星期四晚上的办公时间就要开始了
the Thursday night office hour is going
在桑顿110的思想中
to be in thought at Thornton 110 now I
不知道桑顿是在哪里制造的
didn't know where Thornton was it made
当我翻译的时候更有意义
more sense to me when I translated that
这是老特尔曼的附件
as that's the old Terman annex but
这可能只是显示了我的年龄
that's probably just showing my age
因为你们可能都不记得
since probably none of you remember when
曾经有一幢建筑叫做
there used to be a building called
瑟蒙德，这可能没有帮助
Thurmond so that probably doesn't help
你也可以，但是如果你
you either but you know if you're
我不知道哪个方向
heading I don't know which direction
如果你朝那个方向走，我们就会面对
we're facing if you're heading that way
我想如果你知道哪里是流行音乐
I guess and if you know where the pop
你的新几内亚雕塑园是
your new guinea sculpture garden is the
在你面前的那片开阔的草地
the sort of open grassy area before you
去巴布亚新几内亚的雕塑
get to the Papua New Guinea sculpture
花园
garden
这就是图尔曼曾经的地方
that's where Turman used to be and the
仍然矗立在那里的建筑
building that still stands in there is
桑顿桑顿110今晚我想
thornton thornton 110 tonight i think it
从6:30开始，右639，好的
starts at 6:30 right 639 okay right so
让我把我们的位置讲完
let me just finish off where we were
上次我们有这个窗口
last time so remember we had this window
五个字，然后我们把
of five words and then we were putting
它通过一个新的Z等于W的净层
it through a new net layer of Z equals W
X加B的非线性H等于f
X plus B non-linearity of H equals f of
然后我们就得到a
X and then we are going to just get a
关于这是否有影响
score as to whether this has in its
中心命名为像巴黎这样的实体
centre named entity like Paris which is
把这个点积成a
sort of taking this dot product of a
矢量乘以隐藏层，所以这个
vector times the hidden layer so this
是我们的模型，然后我们想要
was our model and then we were wanting
求出s的偏导数
to work out partial derivatives of s
关于我们所有的变量
with respect to all of our variables and
我们做了各种各样的案例但是我们做了一个
we did various of the cases but one we
方便的是权重和权重
handy had done is the weights and the
权重或所有这些神经网络层
weights or all of this neural net layer
在这里
here
好的，链式法则是
okay so chain rule the partial of the
SDW是s乘以h d乘以h DZ乘以
SDW of is the s times h d the h DZ times
如果你还记得的话
DZ d w and well if you remember last
我们会做一些计算
time we'd sort of done some computation
前两个部分的内容
of what those first two partial
导数是，我们说过
derivatives were and we said that we
可以把这些叫做delta
could just call those delta which is our
来自上面的错误信号
error signal coming from above and that
出现错误信号的概念
concept of having an error signal coming
从上到下，我将回到
from above is something I'll get back to
在讲座的主要部分和a
in the main part of the lecture and a
这是一个核心概念，但我们
sort of a central notion but the bit we
没有处理过这个DZ/DW
haven't dealt with is this DZ DW and we
开始看它，我做了
started to look at it and I made the
基于我们的形状惯例的论证
argument based on our shape convention
它的形状应该是
that the shape of that should be the
和W矩阵一样的形状所以它应该
same shape as our W matrix so it should
在时间上是一样的
be the same in time
在这个w矩阵中，我们想要
in shape as this w matrix so we want to
算出Z的偏微分，也就是
work out the partial of Z by W which is
和ZW x+B DW一样，所以我们
the same as this ZW x + B DW and so we
想算出它的导数是什么
want to work out what that derivative is
如果这不是显而易见的
and if that's not obvious one way to
想想看，我们回到这个问题上
think about it is to go back to this
矩阵的元素，实际上
elements of the matrix and actually
首先，把它算出来
first off work it out element wise and
想想它应该是什么，然后
think out what it should be and then
一旦你想到了它应该做什么
once you've thought out what it should
把它重新写成矩阵形式
be to rewrite it back in matrix form to
给出简洁的答案，我们有
give the compact answer so what we have
我们是否有这些输入和偏差
is we have these inputs here and a bias
我们要做的是矩阵
term and we're going to do the matrix
乘以这个向量来产生这些
multiply of this vector to produce these
如果你想想发生了什么
and if you think about what's happening
这样我们就得到了这个矩阵
there so we've got this matrix of
权重和权重a
weights and for a particular weight a
权重是第一个指标
weight is first index is going to
对应于隐藏的位置
correspond to a position in the hidden
图层，它的第二个索引将会
layer and it's second index is going to
对应输入中的位置
correspond to a position in the input
矩阵的向量和一种方式
vector and one ways in the matrix ends
作为计算的一部分
up being part of what's used to compute
隐藏层的一个元素
one element of the hidden layer so the
隐藏图层的一个元素是
one element of the hidden layer you're
取一排矩阵，然后你
taking a row of the matrix and you're
乘以它的分量
multiplying it by the components of this
矢量，它们加在一起，和偏差
vector so they sum together and the bias
添加了一个元素
is added on but one element of the
矩阵只是在a中被使用
matrix is sort of only being used in a
在一个元素之间的计算
computation between one element of the
隐藏的重要元素
important one element of the hidden
矢量，好的，这意味着如果我们
vector okay so well that means if we're
思考什么是部分
thinking about what's the partial
对W IJ的导数
derivative with respect to W IJ
它只对Zi有贡献
well it's only contributing to Zi and
它只是在做任何事情
it's only it's only doing anything with
所以当我们结束的时候，我们就结束了
XJ so that we end up with when we're
得到关于W IJ的偏导
getting the partial with respect to W IJ
我们可以把它算出来
we can work that out with respect to
只要尊重子，当我们要去的时候
just respect Zi and when we're going to
看看这个乘法
look at this multiplication here what
我们的结局是这样的
we're ending up is this sort of
用wik乘以XK，有排序的地方
of terms wik times XK where there's sort
在矩阵的这一行的权重
of weights in that row of the matrix
穿过矢量的位置
going across the positions of the vector
所以W IJ的唯一位置是
so the only position in which W IJ is
用的是XJ乘以这个
used is multiplying by XJ and at that
指出我们所拥有的
point what we have in terms of sort of
你知道基本的一个变量在做a
you know basic one variable doing a
区别就像我们
differentiation this is just like we
有3个X，我们说什么是
have three X and we say what's the
3x的导数是X，这让人很困惑
derivative of 3x max just X is confusing
不好意思，我不应该说这就像我们
sorry I shouldn't say that it's like we
有3 W，它的导数是什么
have three W and what's the derivative
对于W的3 W，它是3
of three W with respect to W it's three
好的，我们这里有一个术语
right so that we've have a term here
也就是W将会是W IJ
which is will have been W will be W IJ
乘以XJ和它的导数
times XJ and its derivative with respect
W IJ就是XJ，这是有意义的
to W IJ is just XJ that makes sense if
你想让它交叉
you want to leave it fingers crossed
好的，对于其中的一个元素
okay so so for one element of this
矩阵，我们刚刚得到XJ和at
matrix we're just getting out XJ and at
这一点我们说，当然我们想要
that point we say well of course we want
要知道雅可比矩阵是什么
to know what the Jacobian is for the
全矩阵W如果你开始思考
full matrix W well if you start thinking
这个论点适用于每一个人
about it this argument applies to every
细胞，所以对于每一个细胞
cell so that for every cell of the
W的雅可比矩阵是XJ
Jacobian for W it's going to be XJ so
这意味着我们将会有能力
that means we're just going to be able
利用这一点来计算我们的
to make use of that in calculating our
雅可比矩阵对单个的导数
Jacobian so the derivative for a single
W IJ是Delta I XJ，这是对的
W IJ is Delta I XJ and that's true for
所有的细胞，所以我们想要一个矩阵
all cells so we want to have a matrix
对于雅可比矩阵它有Delta I XJ in
for our Jacobian which has Delta I XJ in
每一个细胞和我们所能做到的
every cell Evert and the way we can
通过使用外积来创建
create that is by using an outer product
所以如果我们有一个delta的行向量
so if we have a row vector of the deltas
来自上面和a的错误信号
the error signals from above and a
我说错了，抱歉，如果我们
column wait I say that wrong sorry if we
有一个Delta箭头信号的列
have a column of the Delta arrow signals
从上面
from above
我们有一排X转置向量
we have a row of X transpose vector when
我们把它们相乘得到
we multiply those together we get the
外积，我们得到Delta I XJ in
outer product and we get Delta I XJ in
每一个细胞，这就是我们要做的
each cell and that is our to Co be an
解决Delta s Delta的问题
answer for working out the Delta s Delta
我们开始的时候
W that we started off with at the
开始，好的，这个我们得到这个
beginning okay and this and we get this
它是a的乘法
form where it's a multiplication of an
来自上面的错误信号和我们的计算
error signal from above and our computed
局部梯度信号，这是
local gradient signal and that's the
我们将会看到的模式
pattern that we're going to see over and
再一次，这将会在我们的
over again and that will exploit in our
计算图，好的，好的
computation graphs okay all good okay
这是作业二
so here's just you know homework two
你应该做一些这样的事情
you're meant to do some of this stuff
这里是一些收集的
here it s sort of a couple of collected
我希望能帮助我的建议
tips which I hope will help I mean
保持对变量的跟踪
keeping here track of your variables and
它们的维数非常有用
their dimensionality is really useful
因为如果你能算出
because if you just can work out what
事物的维度应该是
the dimensionality of things should be
你经常会走到一半的路程
you're often kind of halfway there
我的意思是基本上你所做的是
I mean basically what you're doing is
把链式法则应用到
sort of applying the chain rule over and
再一次，它看起来是这样的
over again it always looks like this but
用这种矩阵微积分来做
doing it in this sort of matrix calculus
作业中链式法则的感觉
sense of the chain rule in the homework
你需要做一个软max
you have to do a softmax which we
我在课堂上没有做过什么
haven't done in class something that I
如果你想的话，你会发现有用的
think you'll find useful if you want to
分解软max是要考虑的
break apart the softmax is to consider
两种情况一种情况，当你在
two cases one the cases - when you're
为正确的班级工作
working it out for the correct class and
另一种情况则适用于其他情况
then the other case is for all the other
不正确的课程，是的，我在
incorrect classes yeah I'm in the in the
在我说之前我做了一些推导
little derivation I did before I said
我们来算出一个元素
well let's work out an element-wise
偏导，因为它应该
partial derivative because that should
让我了解一下发生了什么
give me some sense of what's going on
答案是，我认为可以是a
what the answer is I think that can be a
如果你是，这真的是一件很好的事情
really good thing to do if you're
被矩阵微积分搞糊涂了
getting confused by matrix calculus and
我稍微跳过了另一个
I sort of slightly skip past another
上节课我讲的是
slide last time I was talking about the
我讲过的形状惯例
shape convention that I talked about it
有那么一段时间
for a moment but for
或者是家庭作业你可以算出
or the homeworks you can work out your
不管你想要什么，你都可以
answer however you want you can work it
从分子的角度来看
out in terms of you know numerator
如果这看起来是最好的
ordered jacobians if that seems best to
但是我们希望你能给出最终的结果
you but we'd like you to give the final
回答你的作业问题
answer to your assignment questions
按照形状惯例，这样
following the shape convention so that
这些导数应该在a中形成
the derivatives should be shaped in a
向量或矩阵的方式和
vector or matrix in the same way as the
关于你的变量
variable with respect to which you're
计算出你的导数
working out your derivatives
好了，最后一点
okay the last little bit for finishing
上节课我想讲的是
up this example from last time I want to
稍微讲一下发生了什么
say a little bit about is what happens
有了文字，一个答案就没有了
with words and one answer is nothing
不同的是，他们是
different but another answer is they are
这里有一点特殊的情况
a little bit of a special case here
因为你知道我们有一个矩阵
because you know really we have a matrix
我们有一个矢量
of word vectors right we have a vector
对于每个单词，你可以这样想
for each word and so then you can think
这就是这个词的矩阵
of that as sort of this matrix of word
作为一个不同的词的向量
vectors which row as a different word
但实际上我们并不是
but we're not actually kind of
将这个矩阵直接连接到
connecting up that matrix directly to
我们的分类器系统，而不是那个
our classifier system instead of that
我们连接到
what we connect connecting up to the
分类器系统是这个窗口和
classifier system is this window and the
窗口，我们用五个字来写
window we'll have it in at five words
最常见的是不同的词
most commonly they're different words
但是你知道，偶尔也会有同样的词
but you know occasionally the same word
可能出现在两个位置
might appear in two positions in that
所以我们还是可以做的
window and so we can nevertheless do
完全一样，继续我们的
exactly the same thing and continue our
向下，然后说，好吧，让我们开始吧
gradients down and say okay let's work
在这个词窗口的梯度中
out the gradients of this word window
矢量，如果这些是D维D
vector and if these are of dimension D
我们会有这类的5 D矢量
we'll have this sort of five D vector
但是你知道我们该怎么做
but you know then what do we do about it
而我们所做的事情的答案是
and the answer to what we do about it is
我们可以把这个窗口分割开
we can just sort of split this window
把矢量分成五份，然后说，啊哈
vector into five pieces and say aha
我们对单词矢量有五个更新
we have five updates to word vectors
我们要去申请
we're just going to go off and apply
它们是矢量矩阵和你
them to the words vector matrix and you
知道如果同一个词出现两次
know if we if the same word occurs twice
在这个窗口中我们可以同时应用这两种方法
in that window we literally apply both
更新的内容
of the updates so
它更新了两次，或者可能
it is it's updated twice or maybe
实际上你想先把它们加起来
actually you want to sum them first and
然后进行一次更新，但这是
then do the update once but yeah that's
这是一个技术问题
a technical issue so so what that
实际上是说我们非常
actually means is that we're extremely
稀疏地更新单词矢量矩阵
sparsely updating the word vector matrix
因为大多数的矢量矩阵
because most of the word vector matrix
将会保持不变，只有几行
will be unchanged and just a few rows of
它将会被更新，如果很快
it will be being updated and if soon
我们会做一些事情
we're going to be you know doing stuff
用PI手电筒，如果你在PI周围戳一下
with PI torch and if you poke around PI
火炬甚至有一些特别的东西
torch even has some special stuff look
对于像稀疏SGD这样的事物
for things like sparse SGD for meaning
你做的是非常稀疏的
that you're sort of doing a very sparse
像这样更新，但还有一个
updating like that but there's one other
这是一件很有趣的事
sort of interesting thing that you
应该知道很多事情
should know about for a lot of things
你所做的只是实际上
that you do is just what actually
如果我们向下推这些梯度就会发生
happens if we push down these gradients
在我们的词中，我们的想法是
into our word vectors well the idea is
如果我们这样做的话
you know if we do that would be just
就像所有其他神经网络学习一样
like all other neural net learning that
我们会在原则上说，移动
we will sort of in principle say move
矢量这个词的意思是
the word vectors around in such a way as
它们在帮助决定时更有用
they're more useful in helping determine
在这种情况下命名实体分类
named entity classification in this case
因为这是我们的激励榜样
because that was our motivating example
所以你知道它可能会学习
so you know it might for example learn
这个词很好
that the word in is a very good
被命名为默认值的指示器
indicator of named into default I'm
不好意思，地名后面跟着
sorry the place name following so after
你经常会得到伦敦的巴黎，等等
in you often get London Paris etc right
所以这是一种特殊的行为
so it's sort of got a special behavior
其他的介词并不是a
that other prepositions don't as being a
良好的位置指示器，所以它可以
good location indicator and so it could
移动它的位置然后说
sort of move its location around and say
这里有一些很好的位置
here are words that are good location
指标，因此帮助我们
indicators and therefore help our
分类器的工作效果更好
classifier work even better so on
原则是好的，而且是好的
principle that's good and it's a good
要做的事情是将单词向量更新为
thing to do to update word vectors to
帮助你在受监督的情况下表现得更好
help you perform better on a supervised
像这个命名实体这样的任务
tasks such as this named entity
识别分类，但是有一个
recognition classification but there's a
注意，这并不总是
catch which is that it doesn't always
实际上是工作，为什么不呢
work actually and so why doesn't it
总是很好地假设我们是
always work well suppose that we're
训练一个分类器，你知道它可以
training a classifier you know it could
成为
be the
一个我刚刚做的或者一个软的或者逻辑的
one I just did or a softmax or logistic
回归，我们想要分类
regression and we're wanting to classify
电影评论积极的情绪
movie reviews sentiment for positive or
负的，如果我们有
negative well you know if we have
训练我们的单词矢量我们有一些
trained our word vectors we've got some
词向量空间，也许在这个词中
word vector space and maybe in the word
矢量空间电视电视和电视
vector space TV telly and television are
它们都非常接近，因为它们
all very close together because they
意思基本上是一样的所以这是
mean basically the same thing so that's
很好，我们的词矢量很好，但是很好
great our word vectors are good but well
假设在我们的研究中
suppose it was the case that in our
为我们的分类器训练数据
training data for our classifier so this
是我们对电影情感的训练数据
is our training data for movie sentiment
我们有电视和电视这个词
review we had the word TV and telly but
我们没有电视这个词
we didn't have the word television well
那么接下来会发生什么呢
then what's going to happen is well
当我们试图训练我们的感情时
while we try and train our sentiment
分类器如果我们把梯度降下来
classifier if we push gradient back down
对于向量这个词，很有可能
into the word vectors what's likely to
发生的是它会绕着
happen is that it will move around the
我们在这个词中看到的词的词
word vectors of the words we saw in the
培训数据，但必须
training data but necessarily
电视不能移动，因为
televisions not moving right because
我们只是把梯度降到了
we're only pushing gradient down to
在我们的训练数据中
words or in our training data so this
word没有任何地方，所以它就停留在那里
word goes nowhere so it just stays where
一直都是这样，如果我们的结果
it was all along so if the result of our
训练是语言的移动
training is words get moved around so
这里有一些很好的说明
here are good words for indicating
负面情绪实际上是
negative sentiment will actually if a
运行模型时的测试时间
test time when we're running our model
如果我们对一个句子进行评估
if we evaluate on a sentence with
电视在里面，它实际上会
television in it it's actually going to
给出错误的答案，而如果我们
give the wrong answer whereas if we
还没有改变矢量这个词
haven't changed the word vectors at all
这就把他们留在了我们的话语中
and it just left them where our word
然后嵌入学习系统
embedding learning system put them then
它会说电视是a
it would have said television that's a
这个词的意思和电视差不多
word that means about the same as TV or
我应该在我的治疗中
tele I should treat it the same in my
情绪分类器，它会
sentiment classifier and it would
实际上做得更好，所以这是一种
actually do a better job so it's sort of
两方面，你是否获得了
two-sided whether you've gained by
训练单词矢量，这是a
training word vectors and so this is a
总结说它是双面的
summary that says that it's two-sided
实际上你应该怎么做
and practically what you should do so
第一个选择是
the first choice is gee
使用预先训练好的词是一个好主意
it's a good idea to use pre-trained word
向量就像vech vech
vectors like the word to vech vech ters
它使用了一个作业或者使用了
that used an assignment one or using the
你正在做的培训方法
training methods that you're doing right
现在看作业二和答案
now for homework two and the answer that
几乎都是肯定的
is almost always yes and the reason for
这就是矢量训练
that is these word vector training
方法是非常容易运行的
methods are extremely easy to run on
数十亿，文字的文字，所以我们
billions and words of text so we you
了解这些模型，比如手套
know train these models like glove or
数十亿或几十个字
word thick on billions or tens of
数十亿字，这很容易做到
billions of words and it's easy to do
这有两个原因，首先是因为
that for two reasons firstly because the
训练算法非常简单
training algorithms are very simple
是的，缺陷训练这个词
right that the word defect training
算法跳过Graham非常简单
algorithm skip Graham's very simple
第二，因为我们不需要
algorithm secondly because we don't need
我们所需要的任何昂贵的资源都是a
any expensive resources all we need is a
大量的文本文件，我们可以
big pile of text documents and we can
在上面运行，运行起来很简单
run it on them so really easy to run it
你知道5到五百亿个单词
on you know five or fifty billion words
但是你知道我们不能这么做
whereas you know we can't do that for
我们想要的大多数分类器
most of the classifiers that we want to
构建，因为如果它是像a这样的东西
build because if it's something like a
情绪分类器或名称Indy
sentiment classifier or a named Indy
识别器我们需要标签训练数据
recognizer we need label training data
训练我们的分类器然后我们问
to train our classifier and then we ask
一个人或多少个标签
someone or how many words of label
培训数据你有指定的名字
training data do you have for named
实体的识别，他们把我们还给了我们
entity recognition and they give us back
一个数字是三十万
a number like three hundred thousand
单词或100万字，对吧
words or 1 million words right it's
数量级较小，所以
orders of magnitude smaller ok so
因此我们可以利用预先训练来获得
therefore we can gain using pre trained
单词矢量因为他们知道所有的
word vectors because they know about all
那些不在我们监督下的词
the words that aren't in our supervised
分类器培训数据，他们也
classifiers training data and they also
了解更多关于
know much more about the words that
实际上是在培训数据中
actually are in the training data but
只有很少的例外情况是这样的
only rarely so the exception to that is
如果你有成千上万的
if you have hundreds of millions of
你可以从数据的文字开始
words of data then you can start off
有了随机的单词矢量，然后从
with random word vectors and go from
这是一个案例
there and so a case where this is
通常是机器做的
actually commonly done is for machine
我们稍后会在
translation which we do later in the
对于大型的类来说，这是相对容易的
class it's relatively easy for large
获得数亿的语言
languages to get hundreds of millions of
如果你想的话，翻译的文字
words of translated text if you want to
建立一些像德语或英语的东西
build something like German English or
中英机翻译
Chinese English machine translation
系统并不难得到一百
system not hard to get a hundred and
五千万字的翻译文本
fifty million words of translated text
这就足够了
and so that's sort of sufficiently much
人们通常只是
data that people commonly just
从单词向量的随机性开始
start with word vectors being randomly
初始化并开始训练它们
initialized and start training their
翻译系统在第二种情况下
translation system okay so in the second
问题是，我用的是预先训练过的
question is okay I'm using pre trained
当我训练我的监督时，单词矢量
word vectors when I train my supervised
我应该把梯度降下来
classifier should I push gradients down
输入到矢量这个词中，向上和更新
into the word vectors and up and update
它们通常被称为
them which is often referred to as
对向量进行微调，或者我应该
fine-tuning the word vectors or should I
我不应该把它扔掉
not should I just sort of throw away
这些梯度并没有把它们推下去
those gradients and not push them down
对于向量这个词，你知道
into the word vectors and you know the
答案是，这取决于，它只是
answer that is it depends and it just
取决于大小，所以如果你只有
depends on the size so if you only have
一个小型训练数据集
a small training data set
一般来说，最好是治疗
typically it's best to just treat the
预先训练的词向量是固定的
pre trained word vectors as fixed and
如果没有更新的话
not do any updating of them at all if
你有一个很大的数据集
you have a large data set then you can
通常通过微调
normally gain by doing fine-tuning of
矢量这个词，当然还有
the word vectors and of course the
答案是，你的大
answer here is what counts as large you
如果你在这方面
know if certainly if you're down in the
十万字的政权
regime of a hundred thousand words a
数十万字你是
couple hundred thousand words you're
如果你的时间超过一百万
small if your time to be over a million
也许你是大的，但是你
words then maybe you're large but you
了解和实践人们都是这样做的
know and practice people do it both ways
看看哪个数比较高
and see which number is higher and
这就是他们坚持的东西
that's what they stick with ya then the
这是某种意义上的观点
sort of there's the sort of point here
这是值得强调的，是的
that's just worth underlining is yeah so
原则上，我们可以进行传播
in principle we can back propagate this
我们模型中的每个变量的梯度
gradient to every variable in our model
这实际上是一个定理
it's actually a theorem that we can
任意决定抛出任何子集
arbitrarily decide to throw any subset
在这些梯度的情况下，我们仍然
of those gradients away and we're still
提高我们的日志可能性
improving the log likelihood of our
模型正确，这是不可能的
model right it kind of can't be
不一致你可以选择一些
inconsistent you can just so pick some
子集说只训练这37个
subset and say only train those 37 and
把剩下的都扔掉
throw away all the rest and the
算法仍然会改进日志
algorithm will still improve the log
模型的可能性也许不是
likelihood of the model perhaps not by
就像你训练其他的
as much as if you train the rest of the
变量也是，但它不能
variables as well but yeah it can't
实际上做任何伤害都不要训练
actually do any harm not to train
这是其中一个原因
anything that's one of the reasons why
通常人们不会注意到他们的错误
often people don't notice bugs in their
代码也是
code as well is
因为如果你的代码被破坏了
because if your code is kind of broken
只有一半的变量是
and only half of the variables are being
更新了它似乎仍然是训练
updated it'll still seem to be training
一些东西，改善它，只是不
something and improving it's just not
做得和它一样好
doing as well as it could be doing if
你可以对它进行正确的编码
you'd code it correctly okay
所以在这一点上
so at this point that sort of almost
显示你的反向传播
shown you back propagation right so back
传播实际上是在求导
propagation is really taking derivatives
有一个通用的链角色
with a generalized chain role with the
还有一个技巧，我们有点
one further trick which we sort of
用G表示的是G
represented with that Delta which is G
你想要聪明地做这件事
you want to be clever in doing this so
通过重用来最小化计算
you minimize computation by reusing
共享的东西，但是现在我想要移动的东西
shared stuff but now what I want to move
我们来看看我们能做些什么
on is to sort of look at how we can do
更系统地说
that much more systematically which is
这个想法我们有一个计算图
this idea we have a computation graph
我们将会运行一个背面
and we're going to run a back
传播算法通过
propagation algorithm through the
计算图，这是一种
computation graph so this is kind of
就像一个抽象的语法树表达式
like an abstract syntax tree expression
你可能在编译器中看到的树
tree that you might see in a compilers
类或者类似的东西
class or something like that right so
当我们有一个算术表达式
when we have an arithmetic expression of
我们要计算的是
the kind that we're going to compute we
可以把它放在它的边上
can make this tipped over on its side
树形表示，所以我们得到了X
tree representation so we've got the X
W变量我们要乘以
and W variables we're going to multiply
这是我们要做的B变量
them there's the B variable we're going
将其添加到前面的部分结果中
to add it to the previous partial result
我们将通过我们的
we're going to stick it through our
非线性F然后我们要
non-linearity F and then we're going to
乘以U，这就是
multiply it by U and that was the
我们正在做的计算
computation that we're doing in our
神经网络，所以源节点是
neural network so the source nodes are
输入此树的内部节点
inputs the interior nodes of this tree
是操作，然后是这些
are operations and then we've got these
沿着我们的结果传递的边
edges that pass along the results of our
计算，这是
computation and so this is the
精确的计算图
computation graph for precisely the
举个例子，我一直在做
example I've been doing for the last
讲座，电子，或者Eilish，好的
lecture and electron or Eilish okay so
有两件事是我们想要的
there are two things that we want to be
第一个是我们想要做的
able to do the first one is we want to
能够从这些变量开始
be able to start with these variables
做这个计算和计算
and do this computation and calculate
那是什么，那是死亡的部分
what s is that's the part that's dead
很简单，这就是前面提到的
simple that's referred to as forward
传播向前传播是
propagation so forward propagation is
就像你做的那样
just expression evaluation as you do
在任何编程语言解释器中
in any programming language interpreter
这并不难，但是
that's not hard at all but the
不同的是，我们想要做a
difference here is hey we want to do a
学习算法，我们要做的是
learning algorithm so we're going to do
与之相反的是
the opposite of that as well
我们想要做的是
what we want to be able to do is also
反向传播或反向传播
backward propagation or back propagation
或者只是支持它通常被称为
or just back prop it's commonly called
我们希望能够从
which is we want to be able to go from
最后一部分是最后一部分
the final part the final part here and
然后在每一步中，我们想要
then at each step we want to be
计算这些偏导数
calculating these partial derivatives
然后把它们从图中传递回去
and passing them back through the graph
所以这是一种概念
and so this was sort of the notion
在此之前我们有一个错误信号
before that we had an error signal right
从这里开始
so starting from up here we've
计算了s除以Z的偏微分
calculated a partial of s by Z which is
这是关于这个的，这是
this with respect to that and so that's
我们计算出的错误信号
sort of our calculated error signal up
到这里，然后我们想要传递这个
to here and then we want to pass that
再回到开始计算我们的
further back to start computing our
再往右，再往右
gradients further back right and we
从这里开始
started off right here with the partial
s的s的偏微分是s的偏导
of s by s what's the partial of s but is
会是1，好的，那么
going to be one okay yes so the rate at
s改变了s的速率
which s changes the rate at which s
改变，所以我们从一个开始
changes so we just start off with one
然后我们想知道这是怎么来的
and then we want to work out how this
随着时间的变化，梯度会发生变化
gradient changes as we go along so what
我们在这里做的是当我们工作的时候
we're doing here is when we're working
对于一个节点来说，一个节点是
out things for one node that a node is
它已经进入了它
going to have passed into it it's
上游梯度是它的误差
upstream gradient which is its error
这是我们的部分
signal so that's the partial of our
最后的结果是我们的损失
final final result which was our loss by
这个变量是这个的输出
the variable that was the output of this
计算节点，这是部分
computation node so that's the partial
s by H，然后我们做了一些
of s by H here and then we did some
这里的操作是非线性的
operation here here's the non-linearity
但它可能是其他的东西
but it might be something else and so
我们想要的是a
what we wanted then work out is a
下游梯度，也就是偏微分
downstream gradient which is the partial
s by Z是这个的输入
of s by Z which was the input of this
功能，然后问题是
function and well then the question is
我们要怎么做呢，答案是
how do we do that and the answer that is
我们使用链式法则当然是这样的
we use the chain rule of course right so
在
at
我们有一个局部梯度的概念
we have a concept of a local gradient so
这是H，输出z是输入
here's H is the output z is the input so
这个函数和这个是我们的
this function here and this is our
非线性，这是任何东西
non-linearity right so this is whatever
我们用的是我们的非线性
we're using as our non-linearity like a
在我们计算的年龄
logistic or at an age we're calculating
用Z表示，我们可以算出
H in terms of Z and we can work out the
H除以Z的偏导这就是我们的局部
partial of H by Z so that's our local
梯度，如果我们有两个
gradient and so then if we have both the
上游梯度和局部梯度
upstream gradient and the local gradient
然后我们可以算出下游
we can then work out the downstream
梯度因为我们知道部分
gradient because we know the partial of
s除以Z等于DSD a乘以D H DZ
s by Z is going to be DSD a times D H DZ
这样我们就可以向下传递了
and so then we'll be able to pass down
下一个节点的下游渐变
the downstream gradient to the next node
好的，我们的基本规则就是
ok so our basic rule which is just the
用不同的术语写的链式法则是
chain rule written in different terms is
下游梯度等于上游
downstream gradient equals upstream
梯度乘以局部梯度，很容易
gradient times local gradient easy as
这是最简单的
that ok so this was the very simplest
如果我们有一个带有一个输入的节点
case where we have a node with one input
一个输出，这是一个函数
and one output so that's a function like
我们的逻辑函数，但我们也想要
our logistic function but we also want
为了让一个将军能有个好结果
to have things work out for a general
计算图，我们要怎么做呢
computation graph so how are we going to
下一个案例是这样的
do that well the next case is what about
如果我们有多个输入，那么如果我们
if we have multiple inputs so if we're
计算像Z=W这样的东西
calculating something like Z equals W
乘以X我们实际上是，Z和X
times X we're actually yes Z and X of
它们本身是矢量，W是一个矩阵
themselves vectors and W is a matrix but
我们把X看作是重要的W
we're treating X as only important W as
输入和Z是我们的输出
an input and Z is our output right we
类向量和矩阵
kind of group vectors and matrices
如果你有多个
together well if you have multiple
输入的结果是多重的
inputs you then end up with multiple
局部梯度，这样你就可以算出
local gradients so you can work out the
Z关于X的偏导
partial of Z with respect to X or the
关于W的偏微分
partial of Z u with respect to W and so
本质上，你取上游
you essentially you take the upstream
梯度，乘以每一个
gradient you multiply it by each of the
局部梯度，然后把它传递下去
local gradients and you pass it down the
各自的路径，我们计算这些
respective path and we calculate these
不同
differ
以及下游的梯度
and downstream gradients to pass along
这是有道理的
that making sense
好吧，我来吧
yeah okay I'll chug
好的，让我们来看一个例子
okay so let's sort of look at an example
然后我们会看到另一个
of this and then we'll see one another
这里有个小婴儿的例子
case so here's a little baby example
这看起来不像是a
this isn't kind of really looking like a
神经网络，但我们有三个导入x
neural net but we've got three imports x
y，z，x和y加在一起
y and z and x and y get added together
y和z得到了最大的结果
and y and z get maxed and then we take
这两个操作的结果
the results of those two operations and
我们把它们相乘
we multiply them together so overall
我们计算的是X+y乘以
what we're calculating is X plus y times
y+C的最大值但是你知道我们有
the max of y plus C but you know we have
这里有一个通用的技术，我们可以
here a general technique and we can
在任何情况下都适用，如果我们想要
apply it in any cases okay so if we want
要有这个图，我们想要运行它
to have this graph and we want to run it
我们需要知道这些值
forward well we need to know the values
对于x y和z，对于我的例子x等于
of x y and z so for my example x equals
1 y等于z等于0，所以我们取
1 y equals to z equals 0 so we take the
这些变量的值并推动它们
values of those variables and push them
关于未来的计算
on to the calculations for the forward
箭头，然后是我们的第一件事
arrows and then well the first thing we
do是add，结果是
do is add and the result of that is
第三个，我们可以把它放到
three and so we can put that onto the
箭头是ad max的输出，是2
arrow that's the output of ad max it's 2
x的值的输出是6
is the output of the value of air x is 6
所以我们的远期利率
and so the forward pass we have
对表达式求值，它的值是6
evaluated the expression it's value is 6
这并不难，所以下一个
that wasn't hard ok so then the next
步骤是我们想要运行它
step is we then want to run that
通过传播来计算梯度
propagation to work out gradients and so
我们想知道如何
we sort of want to know how to sort of
计算出这些局部梯度a是
work out these local gradients so a is
a是这样的结果
out right a is the result of some so
这是a的结果
here's a as the result of some so a
等于x+y，如果你选da
equals x plus y so if you're taking da
DX=1，da dy也是1
DX that's just 1 and da dy is also 1
这说得通
that makes sense
最大值有点棘手，因为
the max is slightly trickier because
这里有一些斜率
where there's some slope some gradient
因为最大值取决于哪一个
for the max depends on which one's
如果y比Z D大，那么大
bigger so if y is bigger than Z D Delta
B的偏微分除以Z的偏导
the partial of B by Z partial B by Y is
否则在0处，反之亦然
1 otherwise at 0 and conversely for the
B的偏导，所以现在是a
partial of B by Z so now that one's a
有点依赖，然后我们做
little bit dependent and then we do the
最后和工作的乘法案例
multiplication case at the end and work
关于a和B的偏微分
out its partials with respect to a and B
因为这是B，它有
and since that's a B which has the
值2和3，如果你取
values 2 and 3 if you're taking the
F的偏微分等于B，也就是2
partial of F by a it equals B which is 2
反之亦然，这意味着我们可以
and vice versa okay so that means we can
计算出每个局部的梯度
work out the local gradients at each
然后我们要用这些
node and so then we want to use those to
计算我们的梯度
calculate our gradients backwards in the
反向传播路径所以我们从
back propagation path so we start at the
F关于F的偏导
top the partial of F with respect to F
是1因为如果你知道，如果你知道
is 1 because if you move if you know by
然后你把F移到1/10
1/10 then you've moved the F by 1/10 so
这是一个建议，然后是1，然后
that's a Counsel's out as 1 okay so then
我们想要倒退，所以第一个
we want to pass backwards so the first
我们拥有的东西是这样的
thing that we have is this sort of
乘以节点，我们知道，我们知道
multiply node and so we work we know
它是局部梯度，F的偏导数
it's local gradients the partial of F by
a是2而F的偏导是3
a is 2 and the partial of F by B is 3
所以我们得到了这些值
and so we get those values so formally
我们采用的是局部梯度
we're taking the local gradients
将它们乘以上游
multiplying them by the upstream
渐变，得到3和2
gradients and getting our 3 and 2 and
注意到这一事实
notice the fact that sort of effectively
所发生的是两者的值
what happened is the values on the two
弧交换，然后我们继续
arcs swaps and then we sort of continue
好的，这里有一个最大的节点
back okay there's a max node so our
上游梯度现在是3，然后我们
upstream gradient is now 3 and then we
想要乘以局部梯度
want to multiply by the local gradient
因为这两个的最大值是2
and since the max of these two is 2
这条边的斜率是1
there's a slope of 1 on this side so we
得到3，这边没有梯度
get 3 there's no gradient on this side
我们得到0，然后我们做相似的
and we get 0 and then we do the similar
在另一边的计算
calculation on the other side where we
局部梯度为1
have local gradients of 1
所以这两个都是两个
and so both of them come out of two and
另一件要做的事是
then the one other thing to do is we
注意，等一下
notice well wait a minute
有两条弧线从
there are two arcs that started from the
这两个都是我们传回来的
Y both of which we've back propagated
有一些梯度，我们该怎么做
some gradient on and so what do we do
关于这一点我们要做的是
about that what we do about that is we
所以F除以X的偏导是2
sum so the partial of F by X is 2 the
F除以Z的偏导是0，但偏导
partial of F by Z is 0 but the partial
F by Y是2和5的和
of F by Y is the sum of the two and five
这并不是完全的巫术
right and so this isn't complete voodoo
这是有意义的
this is something that should make sense
关于梯度是怎样的
in terms of what gradients are right so
这就是我们所说的
that what we're saying is what we're
计算是如果你稍微摆动一下X
calculating is if you wiggle X a little
这有多大的影响
bit how big an effect does that have on
整个事件的结果
the outcome of the whole thing and so
你知道我们应该能够解决这个问题
you know we should be able to work this
所以我们的X一开始是1
out so our X started off as one but
假设我们把它稍微调整一下
let's suppose we wiggle it up a bit to
根据你的要求，把它变成一点一
make it one point one well according to
我们的输出应该会改变
this our output should change by about
点二应该被放大
zero point two it should be magnified by
第二，我们应该能够做到这一点
two and we should be able to work that
右边是一点一
out right so it's then one point one
加上2，所以是三点一
plus two so that's then three point one
然后我们得到了这两个
and then we've got the two here that
乘以它，它是六点二
multiplies by it and it's six point two
你瞧，它上升了
and lo and behold it went up by point to
是的，这似乎是正确的，如果我们
right so that seems correct and if we
试着做同样的事情，让我们来做
try and do the same for well let's do
Z很简单，如果我们扭动Z轴
the Z it's easy so if we wiggle the Z
它的值是0到0.1
which had a value of the zero by 0.1
这是点一，如果我们把它最大化
this is zero point one when we max it if
这仍然是两个，所以是经过计算的
this is still two and so a calculated
价值不变，它仍然是6
value doesn't change it's still six so
这里的梯度是0，我们这样做
the gradient here is zero we doing this
什么都不做最后一个是y
does nothing and then the final one is y
所以它的初始值是2
so it's starting off value as two so if
我们稍微摆动一下，做个2
we wiggle it a little and make a two
第一点，我们的主张是结果
point one our claim is that the results
大概是0.5年左右
sort of change by about 0.5 it should be
乘以5，如果我们做
multiplied by five times so if we make
这个二点一我们有两个
this two point one we then have two
点1加1是3点
point one plus one would be three point
一
one
当我们得到最大值时也就是2.1
when we get the max here it also be 2.1
所以它必须指向1乘以3
and so it have to point one times three
第一点，这是更难的rithmetic
point one and that's to harder rithmetic
在我的脑海里，如果我们
for me to do in my head that if we take
二点一乘以三点一
two point one times three point one it
从六点五一中出来
comes out of six point five one so
基本上它上升了一半，如果
basically it's gone up by half right if
我们不希望答案是准确的
we don't expect the answers to be exact
当然是对的，因为你知道
of course right because you know that's
微积分的方法不是很好
not the way calculus works right well
我只是说它展示了
I'm just that that it's showing that
我们得到了梯度，对吧
we're getting the gradients right okay
所以这是可行的
so this actually works so what are the
我们需要知道的技术
techniques that we need to know so we've
你们已经看到了这些
sort of already seen them all so you
知道我们讨论过什么时候
know we discussed when there are
他看到工作的多个弧形
multiple incoming arcs how he saw work
在不同的地方导数中
out the different local derivatives
我们需要的另一个案例是
their main other case that we need to
知道在函数计算中
know is if in the function computation
这里有一个分支，其结果是
there's a branch outward the result of
有些东西在很多地方都有使用
something is used in multiple places and
这就是我的意思
so this was like the case here I mean
这是一个初始变量
here this was an initial variable but
你知道它可以通过
you know it could have been computed by
如果是这样的话
something further back so if this thing
在很多地方都有使用
is used in multiple places and you have
不同的计算结果
the computation going out in different
这是一个简单的规则
ways it's just this simple rule that
当你向后反向传播时
when you do back propagation backwards
使用一些渐变的效果
use some the gradients that you get from
不同的外分支，所以
the different outward branches okay so
如果a=x+y，这是1
if a equals x plus y and well that's one
我们之前给你们展示过
we showed you before that we're doing
这个求和运算来算出总数
this sum operation to work out the total
F的偏导数，如果你排序
partial of F by Y okay and if you sort
再多考虑一下
of think about it just a little bit more
有一些明显的模式
there are sort of these obvious patterns
我们在这个非常简单的例子中看到了
which we saw in this very simple example
所以如果你有一个加号
so if you've got a plus that really the
上游梯度将会是
upstream gradient is going to be sort of
每一笔拨款
heading down every one of these grant
分支，当你有多个分支时
branches when you have multiple branches
或者在这种情况下可以总结的东西
or things being summed now in this case
它只是被复制了，但那是
it just is copied unchanged but that's
因为我们的
because our comp
它是x+y你知道它可以是
tation was x plus y you know it could be
更复杂，但我们要通过它
more complicated but we're passing it
下面每个分支都是这样
down down each of those branches so plus
在你的时候分配上游坡度
distributes upstream gradient when you
有一个最大值，有点像a
have a max that's kind of like a
因为max在进行操作
ralphing operation because max is going
将梯度发送到
to be sending the gradient to in the
方向是最大的和其他的
direction that's the max and other
物体不会有梯度
things are going to get no gradient
被传递给他们，然后
being passed down to them and then when
你有一个乘法，这个有这个
you have a multiplication this has this
这是一种有趣的效果你所做的就是
kind of fun effect that what you do is
切换渐变模式，这样
switch the gradient ride and so this
这反映了当你拥有u的时候
reflects the fact that when you have u
乘以V不管u和V
times V regardless of whether u and V
是向量或者是标量
are vectors or just scalars that the
对结果的导数
derivative of the result with respect to
u是V和一个点的导数
u is V and the derivative of a spot
对V的结果是U，所以
result with respect to V is U and so the
梯度信号是两者的翻转
gradient signal is the flip of the two
不同侧面的数字
numbers on the different sides okay so
这是我们所拥有的
this is sort of most of how we have
这些计算图，我们可以工作
these computation graphs and we can work
反向传播在它们后面
out back propagation backwards in them
这是其中的一部分
there's sort of one more part of this to
也就是说，我们想要做的是
do which is to say gee we want to do
这是有效的，所以有一个不好的方法
this efficiently so there's a bad way to
做这个，也就是说，哦，我们
do this which is to say oh well we
想要计算出部分的B和B
wanted to calculate the partial vez by B
所以我们可以计算出这一部分
and so we can calculate that part roll
这就是我所做的
which was essentially what I was doing
上节课我们说F的偏微分
on last time slides we say partial of F
B等于s的偏微分除以H乘以
by B equals the partial of s by H times
H的偏微分除以Z的偏导
the partial of H by Z times the partial
在Z轴上，我们得到了所有这些
of Z by B and we have all of those
我们把它们都算出来
partials we work them all out and
把它们相乘，然后再乘以一个人
multiply them together and then someone
s除以W的偏导是什么
says what's the partial of s by W and we
这是链式法则，我要
say huh that's the chain rule again I'll
再来一遍，这是F的偏微分
do it all again it's the partial of F by
H乘以H的偏导乘以
H times the partial of H by Z times the
Z的偏微分除以X不可能失去它
partial of Z by X no mo right lost it
但是你会列出一长串的
but you do a big long list of them
然后再计算一次
and you calculate it all again that's
不是我们想要做的，而是我们想要的
not what we want to do instead we want
说我看这是共享的东西
to say I look there's this shared stuff
这是来自于
there's this error signal coming from
在上面，我们可以算出误差
above and we can work out the error
这是一个信号的上游梯度
signal the upstream gradient for this
我们可以用它来计算
node we can use it to calculate the
这个节点的上游梯度
upstream gradient for this node we can
用这个来计算上游
use this to calculate the upstream
这个节点的梯度，然后使用
gradient for this node and then using
当地的梯度也是如此
the local gradients of which they're too
计算这个节点，我们可以
calculated this node we can then
计算这个和那个，然后
calculate this one and that one and then
从这里开始了解这个上游
from here having knowing this upstream
梯度，我们可以使用局部梯度
gradient we can use the local gradients
在这个节点上计算这个
at this node to compute this one and
这是我们要做的
that one and so we're sort of doing this
高效的计算机科学
efficient computer science like
我们不做任何事情的计算
computation where we don't do any
重复的工作是有意义的，好的
repeated work that makes sense yeah okay
所以如果这是整个的后支柱
and so if that is the whole of backprop
这是有点粗略的
so here's sort of a slightly sketchy
这是一种
graph which is sort of just
总结一下，如果你有
recapitulating this thing so if you have
任何你想要执行的计算
any computation that you want to perform
希望是你可以对你进行分类
well the hope is that you can sort your
节点进入所谓的拓扑
nodes into what's called a topological
排序也就是指
sort which means the things that are
参数变量是参数
arguments variables that are arguments
在变量之前进行排序
are sorted before variables that are
结果取决于你的论点
results that depend on that argument you
知道给你一些东西
know providing you have something
有一个非循环图你就可以
there's an acyclic graph you'll be able
如果你有一个心理图的话
to do that if you have a psychic graph
你有麻烦了
you're in trouble
我的意思是，它们实际上是技术
well I mean they're actually techniques
人们用这些图表来表达
people use to roll out those graphs but
我现在不打算讲这个了
I'm not going to go into that now
我们已经对节点进行了排序
so we've sorted the nodes which is kind
从底部得到的松散的表示
of loosely represented here from bottom
在上面
to top
在一个拓扑排序区域中
in a topological sort area sort ok so
然后是向前的支柱我们要走了
then for the forward prop we sort of go
通过节点，它们是
through the nodes and they're
拓扑排序顺序，如果它是a
topological sort order and we if it's a
变量我们刚刚说过它的值
variable we just said its value to what
它的价值是可变的
it's very about variable value is if
它是从其他变量中计算出来的
it's computed from other variables their
值必须已经设置好了
values must have been set already
因为在之前的拓扑中
because there earlier in the topological
排序，然后计算其值
sort and then we compute the value of
这些节点是根据它们的
those nodes according to their
前辈们，我们把它传递下去
predecessors and we pass it up and work
最终输出损失函数
out the final output the loss function
我们的神经网络是我们的
of our neural network and that is our
向前传球，然后在那之后
forward pass okay so then after that we
我们的后传，等等
do our backward pass and so for the
向后传递我们初始化输出
backward pass we initialize the output
最上面的是梯度
gradient with one the top thing is
总是一个关于Z的偏导
always one the partial of Z with respect
到Z，然后我们再来看看
to Z and then we now sort of go through
逆拓扑排序中的节点
the nodes in Reverse topological sort
因此每一个都将会
and so therefore each of them will all
阅读任何东西
read anything that's anything that's
我认为它是在玩它的复杂的
playing with its complex and I think
这是我们的一切
it's above it everything that we
根据它的计算
calculated based on it in terms of
向前传球已经有了
forward pass will already have had
计算出它的梯度作为一种产品
calculated its its gradient as a product
上游梯度乘以局部
of upstream gradient times local
梯度，然后我们可以用它来表示
gradient and then we can use that to
计算下一个东西，所以
compute the next thing down and so
基本上，整体角色是什么
basically that what the overall role is
对于任何节点，你都可以计算出它的集合
for any node you work out its set of
成功的事情是超越它的
successes the things that are above it
它依赖于它，然后你
that it that depend on it and then you
对Z的偏导表示尊敬
say okay the partial of Z with respect
X是一组的和
to X is simply the sum over the set of
局部梯度的成功
successes of the local gradient that you
在节点上计算上游
calculate at the node times the upstream
该节点的梯度和
gradient of that node and in the
我之前给出的例子
examples that I gave before there was
永远不要有多个上游梯度
never never multiple upstream gradients
但是如果你想象一个大的图形
but if you imagine a a general big graph
它们实际上可能是不同的
they could actually be sort of different
正在使用的上游渐变
upstream gradients that are being used
为了获得各种成功，我们申请
in for the various successes so we apply
这是反方向的，然后我们算出来了
that backwards and then we've worked out
在后面传播的是
in back propagation the gradient of
每一个最终结果的梯度
every the gradient of the final results
关于我们每个节点的Z
Z with respect to every node in our
这张图和要注意的东西
graph and the thing to notice about this
如果你做的是对的
is if you're doing it right and
高效的大o顺序
efficiently the big o order of
这样做的复杂性
complexity of doing that proper
这和前面的操作是一样的
is exactly the same as doing forward
传播眼神表达评估
propagation eye expression evaluation so
它不是什么超级昂贵的复合体
it's not some super expensive complex
你无法想象的过程
procedure that you couldn't imagine
做和扩大你实际上是在
doing and scaling up you're actually in
完全相同的复杂性顺序
exactly the same complexity order okay
正如我在这里展示的这个过程
so as I presented here this procedure
你可以想象一下
you could just think of something that
你在一个任意的图上运行
you're running on an arbitrary graph and
计算这个向前和
calculating this forward pass and the
我的意思是几乎没有
backwards pass I mean almost without
这是一种神经网络
exception that the kind of neural nets
我们实际使用的是有规律的
that we actually use have a regular
像这样的层状结构
layer like structure and that's then
为什么要这样做是有意义的
precisely why it makes sense to work out
这些梯度在矢量上
these gradients in terms of vectors
矩阵和雅克比是我们的一种
matrices and jacobians as the kind we
在此之前
were before okay
既然我们有这样的东西
so since we have this sort of really
这是一种很好的算法
nice algorithm now this sort of means
我们可以在计算上做到这一点
that we can do this just computationally
所以我们不需要思考或知道
and so we don't have to think or know
如何做数学，我们可以有
how to do math and we can just have our
电脑会让我们所有人都这样做
computers do all of us with this so that
使用这个图形结构我们可以
using this graph structure we can just
自动解决如何应用
automatically work out how to apply
后支撑，有两种情况
backprop and there's sort of two cases
在这个右边，如果计算出来
of this right so if what was calculated
在每个节点上都有一个符号
at each node is given as a symbolic
我们可以得到
expression we could actually have our
计算机为我们提供了什么
computer work out for us what the
这个符号表达式的导数
derivative of that symbolic expression
这样就可以计算出
is so it could actually calculate the
那个节点的梯度，那是
gradient of that node and that's
被称为自动的
referred to as often as automatic
微分，所以这有点像
differentiation so this is kind of like
如果你知道，Mathematica Wolfram Alpha
Mathematica Wolfram Alpha if you know
你的数学作业是怎么做的
how you do your math homework on it you
只要输入你的表达式就可以了
just type in your expression say what's
它的导数又回到了
the derivative and it gives it back to
你说的对，它是在做象征性的
you right it's working doing symbolic
计算并计算出
computation and working out the
对你求导，这样
derivative for you so that so that
方法可以用来计算
method could be used to work out the
局部梯度然后我们可以使用
local gradients and then we can use the
图结构和上游规则
graph structure and our rule upstream
梯度时间
gradient times
局部梯度给出了下游梯度
local gradient gives downstream gradient
即链式法则然后传播它
ie the chain rule to then propagate it
通过图表，做整个过程
through the graph and do the whole
完全自动后退
backward pass completely automatically
这听起来很不错
and so that sounds great
轻微的失望电流
slight disappointment current deep
学习框架并没有给你带来什么
learning frameworks don't quite give you
有一个著名的
that there was actually a famous
试图给你的框架
framework that attempted to give you
这就是theano框架
that so the theano framework that was
蒙特利尔大学开发
developed at the University of Montreal
他们现在已经放弃了
those they've now abandoned in the
现代大型科技时代
modern era of large technology
公司深度学习框架
corporation deep learning frameworks
天野之所做的正是
Tiano did precisely that it did the full
自动微分法
thing of automatic differentiation for
我们可以想到的原因
reasons that we could either think of
好或坏的当前深度学习
good or bad current deep learning
像张量流或PI火炬这样的框架
frameworks like tensor flow or PI torch
实际上做的比这要少一点
actually do a little bit less than that
所以他们的做法是
so what they do is say well for an
个人的计算
individual for the computations at an
单独的节点，你必须要做
individual node you have to do the
这是你自己的微积分
calculus for yourself for this
单独的节点，你必须写
individual node you have to write the
前进传播说你知道返回
forward propagation say you know return
x+y，你必须写
x plus y and you have to write the
反向传播，表示本地
backward propagation saying the local
梯度1和1/2 2输入x和y
gradients 1 and 1/2 the 2 inputs x and y
但是如果你或其他人
but providing you or someone else has
把前后两行写出来
written out the forward and backward
在这个节点上的局部步骤然后是tensorflow
local step at this node then tensorflow
或者是pi手电筒的其他部分
or pi torch does all the rest of it for
然后运行反向传播
you and runs the back propagation
算法，然后你就知道了
algorithm and then you know effectively
这样可以节省你必须要有a
that sort of saves you having to have a
大的符号计算引擎，因为
big symbolic computation engine because
有点像编码节点的人
somewhat the person coding the node
计算是写一些代码
computations is writing a bit of code as
你可以想象一下
you might normally imagine doing it
不管你是C还是帕斯卡
whether in you know C or Pascal of
说返回，x+y，你知道
saying return ik x plus y and you know
局部梯度返回1和
local gradient return 1 right and and
你不需要有一个整体
you don't actually have to have a whole
符号计算引擎，好的
symbolic computation engine okay so that
意味着整体的图像是这样的
means the overall picture looks like
这是一个示意图，我们有一个
this right so schematically we have a
计算图和计算
computation graph and to calculate the
为了
for
我们把输入输入到
computation we sort of put inputs into
我们的计算表明了
our computation graphed where there's
这是X和y的变量，然后
sort of X and y variables and then we
在节点和拓扑学上运行
run through the nodes and topologically
对每个节点进行排序
sorted order and for each node we
计算它的前进和必然
calculate its forward and necessarily
那些依赖于此的事物
the things that depends on then have
已经计算过了，我们就这么做了
already been computed and we just do
然后是表达评估
expression evaluation forward and then
我们把最后的门返回到图中
we return the final gate in the graph
这是我们的损失函数或目标
which is our loss function or objective
函数，但我们也有
function but then also we have the
后向和后传
backward pass and for the backward pass
我们在反向的节点中
we go in the nodes in Reverse
拓扑只对顺序和
topological only sorted order and for
我们返回的每一个节点
each of those nodes we've returned their
向后的值和顶部的节点
backward value and for the top node we
返回一个和那个的后值
return backward value of one and that
然后给我们梯度，所以
will then give us our gradients and so
这意味着任何节点的任何部分
that means for any node any piece of
我们执行的计算需要
computation that we perform we need to
写一些代码，上面写着
write a little bit of code that says
它在前进的过程中做了什么
what it's doing on the forward pass and
它在后面的通道上做了什么
what it's doing on the backward pass so
这是我们的
on the forward pass this is our
乘法，我们只是说
multiplication so we're just saying
返回x乘以y，这很简单
return x times y so that's pretty easy
这是你习惯做的
that's what you're used to doing but
我们还需要做反向操作
well we also need to do the backward
作为返回的局部梯度
pass as local gradients of return what
是L对Z的偏导
is the partial of L with respect to Z
关于X，以及要做的
and with respect to X and well to do
我们需要做的更多
that we have to do a little bit more
所以我们要多做一点
work so we have to do a little bit more
首先要在前进的过程中工作
work first of all on the forward pass so
在前进的过程中我们必须记住
in the forward pass we have to remember
在一些变量中进行一些处理
to sort of stuff away in some variables
我们在这四个方面计算出了什么值
what values we computed in the forth
有什么价值的东西给我们
with what values were given it to us in
向前传球，否则我们就不会
the forward pass or else we won't be
能够计算出向后的通过
able to calculate the backward pass so
我们存储了x和y的值
we store away the values of x and y and
所以当我们在做反向操作时
so then when we're doing the backward
通过我们进入了上游
pass we are passed into us the upstream
渐变错误信号，现在我们
gradient the error signal and now we
只是计算一下
just do calculate up
极限梯度乘以局部梯度
extreme gradient times local gradient
上游梯度乘以局部梯度
upstream gradient times local gradient
我们返回下游
and we return backwards those downstream
梯度，提供我们这样做
gradients and so providing we do that
对于我们的图形的所有节点
for all the nodes of our graph we then
有系统可以运行的东西
have something that the system can run
对于我们来说，作为一个深度学习系统
for us as a deep learning system and so
这在实践中意味着你
what that means in practice is that you
了解这些深刻的学习
know any of these deep learning
框架有一整盒的
frameworks come with a whole box of
这里说的工具是完全连接的
tools it says here is a fully connected
前一层是一个sigmoid单元
forward layer here is a sigmoid unit
这里还有一些更复杂的东西
here is other more complicated things
我们稍后会做卷积和卷积
we'll do later like convolutions and
重复的图层，在一定程度上
recurrent layers and to the extent that
你用的是别人的一个
you're using one of those somebody else
为你做了这个工作
has done this work for you right that
他们已经定义了节点或一层
they've defined nodes or a layer of
有正向和向后的节点
nodes that have forward and backward
已经为他们和
already written foot for them and to the
这是真的，这意味着
extent that that's true that means that
制造神经网络是一群农民
making neural nets is heaps of farmers
就像乐高积木一样
just like Lego right you just stick
这些层合在一起然后说
these layers together and say got it on
一些数据和培训，你知道的
some data and train it you know it's so
我的高中学生很容易
easy that my high school student is
把这些东西建好，你不会
building these things right you don't
必须要真正理解，但是你
have to understand much really but you
要知道你实际上
know to the extent that you actually
想要做一些原创性的研究
want to do some original research and
我想我有一个很酷的想法
think I've got this really cool idea of
我要做的是不同的事情
how to do things differently I'm going
来定义我自己的不同之处
to define my own kind of different
计算很好，你必须这样做
computation well then you have to do
这也定义了你的类
this and define your class and as well
就像说如何计算
as sort of saying how to compute the
远期价值，你必须拿出你的
forward value you have to pull out your
复制Wolfram Alpha并计算出
copy of Wolfram Alpha and work out what
导数是，把它放到
the derivatives are and put that into
向后传球，好的
the backward pass yeah okay
这里有一点值得注意的地方
so here's just one little more note on
你知道在早期的时候
that you know in the early days of deep
学习说在2014年之前
learning say prior to 2014 what we
总是对每个人说
always just to say to everybody very
严厉的是你应该检查你所有的
sternly is you should check all your
通过数值梯度来进行渐变
gradients by doing numeric gradient
检查它真的很重要
checks it's really really important and
这意味着，如果你知道
so what that meant was well you know if
你想知道你是否编码了
you want to know whether you've coded
你的后传，这是一个简单的方法
your backward pass right an easy way to
检查你是否对它进行了编码
check whether you've coded it right is
做某事
to do
这个数值梯度，你在排序
this numeric gradient where you're sort
通过摆动它来估计斜率
of estimating the slope by wiggling it a
把输入稍微移动一点
bit and wiggling the input a bit and
看看它有什么效果，所以我在工作
seeing what effect it has so I'm working
输出f的函数值
out the value of the function that f of
X+H对于H非常小，就像e的
X plus h for H very small like e to the
减去4和f的X减H，然后
minus 4 and an f of X minus H and then
除以2 h，我说的是
dividing by 2h and I'm saying well what
这一点的斜率是
is the slope at this point and I'm
得到的数值估计
getting a numerical estimate of the
关于变量X的梯度
gradient with respect to my variable X
这就是你要做的
here now so this is what you will have
在高中的时候，你做了
seen in high school when you did the
对梯度的初步估计
sort of first estimates of gradients
你可以算出f的值
where you sort of worked out f of X plus
H除以H，你在上升
h divided by H and you're doing rise
在运行中得到了一个点估计
over run and got a point estimate of the
梯度完全是一样的
gradient exactly the same thing except
在这种情况下，而不是
for the fact in this case rather than
像我们正在做的那样
doing it one-sided like that we're doing
它是双向的，结果是如果你
it two-sided it turns out that if you
实际上想要做的是两面都是
actually want to do this two-sided is
渐近地好得多
asymptotically hugely better and so
你总是做两方面的工作
you're always better off doing two-sided
梯度检查，而不是
gradient checks rather than one side of
渐变检查，因为你看到了
gradient checks so since you saw that
因为要实现这个错误是很困难的
since it's hard to implement this wrong
这是检验你的方法的好方法
this is a good way to check that your
如果你已经定义了渐变是正确的
gradients are correct if you've defined
它们是一种使用它的技术
them yourselves as a technique to use it
对于任何事情，它完全是完全的
for anything it's completely completely
没有希望，因为我们正在考虑做某事
hopeless because we're thinking of doing
这是我们对a的深入学习模型
this over our deep learning model for a
完全连通层意思是
fully connected layer what this means is
如果你有这样的W
that if you've got this sort of like a W
N乘M的矩阵你想要
matrix of N by M and you want to
计算你的偏导数
calculate your partial derivatives to
检查他们是否正确，这意味着
check if they're correct it means that
你必须为每一个元素做这个
you have to do this for every element of
矩阵，所以你要计算
the matrix so you have to calculate the
最后的损失是第一次抖动w11
eventual loss first jiggling w11 then
抖动w12然后抖动1 w 1 3 1 4
jiggling w12 then jiggling 1 w 1 3 1 4
在复杂的网络中
etc so you have in the complex network
最终你会得到数百万美元
you'll end up literally doing millions
进行功能评估以检查
of function evaluations to check the
在某一时刻的梯度，所以你
gradients at one point in time so you
要知道这不是我想要的
know it's it's not like what I
当我说，当我说
advertised for back prop when I said
它和计算一样有效
it's just as efficient as calculating
向前的值是向前的
the forward value doing this is forward
值计算时间x
value computation time x number of
我们模型中的参数通常是
parameters in our model which is often
这对于深度学习网络来说是巨大的
huge for deep learning networks so this
是你唯一想拥有的东西
is something that you only want to have
如果你能说的话
in side if statements that you could
关闭，这样你就可以运行它了
turn off so you could just so run it to
检查你的代码不是砖块
check that your code isn't brick
我是D型车你们知道，老实说，这是
I am D buggy you know in honesty this is
现在就不那么需要了，因为你
just much less needed now because you
大家都知道，你可以把它们连接在一起
know by and large you can plug together
你的组件、图层和PI手电筒
your components and layers and PI torch
其他人写的代码是对的
and other people wrote the code right
它会起作用所以你可能不会
and it will work so you probably don't
需要一直这样做，但它是
need to do this all the time but it is
这仍然是一件很有用的事情
still a useful thing to look at to know
如果事情出了差错，请讲
about if things are going wrong yeah
好的，如果我们现在已经掌握了核心
okay so if we've now mastered the core
我们看到的神经网络技术
technology of neural nets we saw know
基本上我们需要知道的一切
basically everything we need to know
关于神经网络，我只是
about neural nets and I sort of just
总结一下，只是为了
summarized it there just to sort of
再强调一次你知道我认为
emphasize once more you know I think
有些人认为为什么我们需要
some people think why do we even need to
学习所有关于梯度的知识
learn all this stuff about gradients and
有一种感觉，所以不要真的
there's a sense in which so don't really
因为这些现代的深度学习
because these modern deep learning
框架将计算所有的
frameworks will compute all the
你知道我们给你的渐变
gradients for you you know we make you
在家庭作业2中，但是在作业3中
suffer in homework 2 but in homework 3
你可以有渐变和计算
you can have your gradients and computed
对你来说，但是你知道我所以你知道
for you but you know I so you know it's
就像你为什么要
sort of just like well why should you
在编译器上做一个类
take a class on compilers right that
实际上有一些有用的东西
there's actually something useful in
理解下面的内容
understanding what goes on under the
虽然大部分时间我们都是
hood even though most of the time we're
很高兴让C
just perfectly happy to let the C
编译器做它的事情
compiler do its thing
没有x86汇编器的专家
without being experts on x86 assembler
世界周的每一天，但是你知道
every day of the world week but you know
还有比这更重要的
there is more to it than that
你知道，因为即使在后面
you know because even though back
一旦你有了，传播就很好了
propagation is great once you're
构建复杂模型反向传播
building complex models back propagation
并不总是像你期望的那样工作
doesn't always work as you would expect
它可能是一个错误的词
it to perfectly it may be the wrong word
因为从数学上讲，它是
because you know mathematically it's
很好，但可能无法实现
perfect but it might not be achieving
你想要的是什么，如果
what you're wanting it to and well if
你想要调试和改进
you want to sort of in debug and improve
模型是非常重要的
models it's kind of crucial to
了解发生了什么事
understand what's going on there's a
好介质
nice medium
安德烈卡波的收费是，你应该
by Andre Kapaa fee of yes you should
理解背景，但在教学大纲上
understand backdrop but on the syllabus
关于这个的页面
page that talks about this and indeed
下下个星期
week after next
艾比实际上要讲的是
Abby is actually going to lecture about
重复的神经网络你知道
recurrent neural networks and you know
你可以很容易地
one of the places where you can easily
失败和传播会出现
fail and doing that propagation turns up
这是一个很好的例子有人知道吗
there it's a good example ok does anyone
关于背部有问题吗
have any questions about back
传播和计算图
propagation and computation graphs okay
如果不是，剩下的时间是
if not the remainder of the time is the
把你真正想要的东西拿出来
grab bag of things that you really
应该知道你是否会成为
should know about if you're going to be
做深度学习，是的，这是
doing deep learning and so yeah this is
只是有点让人不安，但是让我说一下
just itsy-bitsy and but let me say them
到目前为止，我们已经失去了
so up until now when we've had loss
函数，我们一直在最大化
functions and we've been maximizing the
我们数据的可能性
likelihood of our data and stuff like
我们已经有了这个部分
that we've sort of just had this part
这是我们数据的可能性
here which is the likelihood of our data
我们一直在努力使它最大化
and we've worked to maximize it however
在实践中，这种方法通常很糟糕
in practice that works badly usually and
我们需要做一些其他的事情
we need to do something else which is
将我们的模型规范化，如果你已经完成了
regularize our models and if you've done
机器学习班之类的
the machine learning class or something
就像你看到的那样
like that you will have seen
正则化，有各种各样的
regularization and there are various
进行正规化的技术
techniques to do regularization but
与其他任何规则相比
compared to anything else regularization
对深度学习更重要
is even more important for deep learning
模型是正确的所以一般的想法是
models right so the general idea is if
你有很多参数
you have a lot of parameters in your
对这些参数进行建模
model those parameters can just
本质上是记住数据中的内容
essentially memorize what's in the data
你训练了它，所以它们很好
that you trained it and so they're very
善于预测模型的答案
good at predicting the answers the model
很擅长预测他们的
becomes very good at predicting their
你对数据的回答
answers to the data you trained it on
但是这种模式在工作中可能会变得很差
but the model may become poor at working
在现实世界和不同的例子中
in the real world and different examples
我们想要阻止它
and somehow we want to stop that and
这个问题对深层来说尤其糟糕
this problem is especially bad for deep
学习模型，因为通常深
learning models because typically deep
学习模式有
learning models have
大量的参数
vast numbers of parameters so in the
统计学家统治的美好时光
good old days when statisticians ruled
他们告诉人们这是
the show they told people that it was
有一个数字是完全荒谬的
completely ridiculous to have a number
接近你的参数
of parameters that approached your
你认识的培训案例的数量
number of training examples you know you
永远不应该有更多的参数
should never have more parameters in
你的模型比这个数字的十分之一
your model than one-tenth of the number
你们的培训案例中
of your training examples from what's
这是你被告知的经验法则
the kind of rules of thumb you are told
所以你有很多这样的例子
so that you had lots of examples with
Westar的每个参数都是
which Westar made every parameter that's
这不是真正的深度学习模型
just not true deep learning models it's
我们的训练很普遍
just really common that we train deep
学习模型有十倍于
learning models that have ten times as
我们有很多参数
many parameters as there we have
训练的例子，但奇迹般地
training examples but miraculously it
事实上，它的效果很好
works in fact it works brilliantly those
高度超参数化模型
highly over parameterised models and
这是我的一个重要的秘密来源
it's one of the big secret sources of my
深度学习是如此的精彩，但是
deep learning has been so brilliant but
只有当我们将模型规范化时，它才会起作用
it only works if we regularize the model
所以如果你没有
so if you train a model without
充分的规范化，你会发现
sufficient regularization what you find
是你的训练和锻炼
is that your training it and working out
你在训练数据上的损失
your loss on the training data and the
模型越来越好，越来越好
model keeps on getting better and better
越来越好
and better and better
我们的算法必须改进
necessarily our algorithm has to improve
培训数据的损失是最糟糕的
loss on the training data so the worst
可能发生的事情是
thing that could happen is that the
图形可以变成绝对的脂肪
graph could become absolutely fat flat
你会发现大多数模型
what you will find is with most models
我们训练他们有这么多
that we train they have so many
这些参数会一直持续下去
parameters that this will just keep on
一直下去，直到损失是某种程度上
going down until the loss is sort of
接近数字的精度
approaching the numerical precision of
如果你长期坚持训练的话
zero if you leave a training for long
只要知道正确答案就行了
enough it just learns the correct answer
每个例子，因为
that every example because because
有效地记忆例子
effectively can memorize examples okay
但是如果你说让我来测试一下
but if you then say let me test out this
在一些不同的数据上做模型
model on some different data what you
找到这个红色的曲线，直到a
find is this red curve that up until a
你还在构建的某一点
certain point that you're also building
一个更擅长预测的模型
a model that's better at predicting on
不同的数据，但在某一点之后
different data but after some point this
曲线又开始向上弯曲
curve starts to curve up again and
忽略它的曲线
ignore that bit where it seems to curve
再一次，这是一个错误
down again that was a mistake in the
这就是我们所说的
drawing and so this has then referred to
从这里开始
as overfitting that them from here on
培训模式刚刚开始
the training model was just learning to
记住里面的内容
memorize whatever was in
培训数据，但不是
the training data but not in a way that
让它概括为其他例子
lets it generalize to other examples and
所以这不是我们想要的
so this is not what we want we want to
尽量避免过度拟合
try and avoid overfitting as much as
可能有各种各样的
possible and there are various
我们使用的正则化技术
regularization techniques that we use
对于这个，一个简单的开始
for that and a simple starting one is
这里我们要对其进行惩罚
this one here where we penalize the
可能是说你要去
log-likelihood by saying you're going to
在你移动的范围内受到惩罚
be penalized to the extent that you move
参数从0到默认值
parameters away from zero so the default
自然状态是所有的参数
state of nature is all parameters are
所以在计算中它们被忽略了
zero so they're ignored in computations
你可以有一个很大的参数
you can have parameters that have big
价值观，但你会受到惩罚
values but you'll be penalized a bit for
这被称为l2
and this is referred to as l2
正规化，你知道这是一种分类
regularization and you know that's sort
在某事物的起点上
of a starting point of something
你可以用
sensible you could do with
正则化
regularization
但是后面还有更多的内容，我们会
but there's more to say later and we'll
在我们之前的讲座中
talk in the sort of lecture before we
讨论其他聪明的最终项目
discuss final projects of other clever
以及正则化技术和神经学
and regularization techniques and neural
网络可以抓2号包
networks ok grab bag number two
矢量化是你听到的术语
vectorization is the term that you hear
但它不仅仅是矢量
but it's not only vectors this is also
矩阵化和更高维度
matrix ization and higher dimensional
矩阵在这个里面叫做张量
matrices what are called tensors in this
场张量变深
field tensor ization getting deep
学习系统快速运行
learning systems to run fast and
只有我们才能有效地
efficiently is only possible if we
矢量化，这意味着什么
vectorize things and what does that mean
这意味着你知道
what that means is you know the
直接写很多东西的方法
straightforward way to write a lot of
你在第一个CS课程中看到的代码
code that you saw in your first CS class
你是说我在计算范围内
is you say for I in range in calculate
随机Randi 1但当我们想要成为
random Randi 1 but when we want to be
聪明的人做事快
clever people that are doing things fast
我们说，而不是算出这个w点
we say rather than work out this w dot
一次一个字的矢量，然后在a处完成
one-word vector at a time and do it in a
对于循环，我们可以把所有的
for loop we could instead put all of our
把词向量转换成一个矩阵然后做
word vectors into one matrix and then do
简单的一个矩阵乘以W
simply one matrix matrix multiply of W
用我们的词向量矩阵
by our word vector matrix
即使你在你的代码上运行你的代码
and even if you run your code on your
在CPU上的笔记本电脑你会发现
laptop on the CPU you will find out that
如果你这样做
if you do it
矢量化的方式将会变成
the vectorized way things will become
在这个例子中，速度要快得多
hugely faster so in this example it
速度超过了一个数量级
became over an order of magnitude faster
当用矢量矢量进行计算时
when doing it with a vector vectorized
而不是for循环和那些
rather than with a for loop and those
当我们运行时，收益只会增加
gains are only compounded when we run
在GPU上的代码，你将得不到任何收益
code on a GPU that you will get no gains
在GPU上加速，除非你
and speed at all on a GPU unless your
代码是矢量化的，但如果是
code is vectorized but if it is
矢量化，你可以希望有
vectorized then you can hope to have
结果是，哦，是的，这是40
results of how oh yeah this runs 40
比在CPU上的速度快
times faster than it did on the CPU okay
是的，所以总是试着使用矢量
yeah so always try to use vectors and
矩阵不是for循环当然是
matrices not for loops of course it's
在开发东西时很有用
useful when developing stuff to time
你的代码，找出什么是慢的
your code and find out what's slow okay
第三点，我们讨论过这个
point three okay so we discussed this
上节课和之前的时间
idea last time and the time before that
在经历了一种罚款之后
after after having the sort of a fine
我们从X开始，从X开始
layer where we took you know go from X
到W X+B这被称为a
to W X plus B that's referred to as an
仿射图层，我们这样做
affine layer so we're doing this
用矩阵矩阵乘以一个向量
multiplying a vector by matrix matrix
增加我们必须要有的偏见
and adding biases we necessarily to have
深网的力量必须有
power in a deep network have to have
某种形式的非线性
some form of non-linearity
所以我想稍微讲一下
and so I just wanted to go through a bit
关于非线性的背景不是
of background on non-linearity isn't
人们使用什么和使用什么
what people use and what to use so if
你从一开始就想
you're sort of starting from the idea of
我们所知的逻辑回归
what we know as logistic regression
通常被称为
what's commonly referred to as the
s形曲线或者更精确的
sigmoid curve or maybe more precisely
看逻辑函数是这样的
look the logistic function is this
这是一幅图
picture here so it's something that
把任何实数都表示为正或
squashes any real number positive or
从0到1的范围
negative into the range zero to one it
给你一个概率输出
gives you a probability output these
这个逻辑函数的使用是
this use of this logistic function was
在早期神经系统中非常常见
really really common in early neural
如果你回到80年代的神经网络
nets if you go back to 80s 90s neural
网
Nets
这里有sigmoid函数
there were sigmoid functions absolutely
在最近的时间里，90%的
everywhere in more recent times 90% of
没有人使用这些的时候
the time nobody uses these and they've
被发现实际上是在工作
been found to sort of actually work
非常糟糕的是，这些都是
quite poorly the only place these are
当你想要一个值的时候
used is when you actually want a value
在0和1之间作为输出，所以我们将
between 0 & 1 as your output so we'll
以后再谈你是如何进入的
talk later about how you have gating in
网络和门控是一个地方
networks and so gating as a place where
你想要有一个概率
you want to have a probability between
有两件事，然后你会用到
two things and then you will use one of
但是你绝对会使用它们
those but you use them absolutely
没有别的地方是10 H曲线
nowhere else here is the 10 H curve so
10 h的公式看起来很吓人
the formula for 10h looks like a scary
有很多指数的东西
thing with lots of Exponential's in it
它看起来不太像
and it doesn't really look much like a
逻辑曲线，但如果你挖
logistic curve whatsoever but if you dig
在你的数学课本上，你可以说服
up your math textbook you can convince
你自己，10 h曲线实际上是
yourself that a 10h curve is actually
和逻辑曲线是一样的
exactly the same as the logistic curve
除了把它乘以2
apart from you multiply it by two so it
有两种，而不是一种
has a range of two rather than one and
你把它移到1这是某种程度上
you shift it down one so this is sort of
只是一个重新调整的逻辑，但现在是
just a rescaled logistic but it's now
在1和-1之间对称
symmetric between 1 and -1 and the fact
它在输出中是对称的
it's symmetric in the output actually
对神经系统的投入有很大的帮助
helps a lot for putting into neural
所以青少年仍然是
networks so teenagers are still
相当广泛的使用和相当的
reasonably widely used and quite a
神经网络中位置的数量
number of places in neural networks so
10 h应该是你和你的朋友
10h should be a friend of yours and you
应该知道这个但是你知道
should know about that but you know one
关于使用的不好的事情
of the bad things about using
超验函数
transcendental functions like the
s型或10 h，你知道它们涉及到
sigmoid or 10h is you know they involve
这种昂贵的数学运算速度很慢
this expensive math operations that slow
你觉得这很讨厌
you down like it's sort of a nuisance to
是一种计算指数的方法
be kind of computing Exponential's and
10个HS和新电脑是很好的
ten HS and new computer things are kind
慢慢地，人们开始玩游戏
of slow so people started playing around
有办法让事情变得更快
with ways of to make things faster and
所以有人提出了这个想法
so someone came up with this idea like
也许我们可以用10来个H
maybe we could come up with a hard 10 H
这里是平坦的
where it's just sort of flat out here
然后它有一个线性斜率
and then it has a linear slope and then
它的顶部是平的你知道它有点像
it's flat at the top you know it sort of
看起来像10 H，但是我们已经平方了
looks like a 10 H but we've just squared
这是很便宜的
it off and well this is really cheap to
计算你说
compute right you say
少于-1返回-1返回+1或
less than -1 return -1 return +1 or just
返回数字不复杂
return the number no complex
先验的，有趣的是
transcendentals the funny thing is it
结果证明这是可行的
turns out that this actually works
很好，你可能会害怕
pretty well you might be scared and you
可以证明b的平方，因为如果
might justify ibly b squared because if
你开始考虑梯度
you start thinking about gradients once
你在这里，没有梯度
you're over here there's no gradient
对，它的管子是完全平的
right it's tube completely flat at 0 so
他们一出去就死了
things go dead as soon as they're out at
其中一个目的是，这是一种
one of the ends so it's sort of
重要的是要留在中间部分
important to stay in this middle section
至少在一段时间内，它只是
at least for a while and then it's just
斜率是1，这是个常数
got a slope of 1 right it's a constant
斜率为1但这已经足够了
slope of 1 but this is enough of a
它的线性关系很好
linearity that actually it works well in
神经网络你可以训练神经网络
neural networks and you can train neural
网络使整个领域都进入了
networks so that sent the whole field in
相反的方向和人
the opposite direction and people
如果这行得通，也许我们可以
thought oh if that works maybe we can
让事情变得更简单，这就导致了
make things even simpler and that led to
现在已经很出名的
the now-famous what's referred to
到处都是reloj对不起，我错过了
everywhere as a reloj sorry I miss
我的编辑有一个错误
there's a mistake in my editing there
删除掉在里面的10个H
delete off hard 10 H that was in the
把r-la单位搞错了
slides by mistake the r-la unit everyone
称它为reloj
calls it reloj which stands for
修正了的线性单位所以reloj
rectified linear unit so the the reloj
是最简单的
is essentially the simplest
非线性你可以有这样的关系
non-linearity you can have so the relu
当你在这里的时候，斜率为0
is zero slope zero as soon as you're in
消极的政体，它只是一条线
the negative regime and it's just a line
当你处于积极状态时，斜率为1
slope one when you're in the positive
我的意思是当我第一次看到这个的时候
regime I mean when I first saw this I
我的意思是，它可能会让我大吃一惊
mean it sort of blew my mind it could
可能是工作，因为这是我的一部分
possibly work because it's sort of I
我猜我是在这种情况下长大的
guess I was brought up on these sort of
10个HS和sigmoid，有一些
10 HS and sigmoids and there's sort of
关于斜率和你的论点
these arguments about the slope and you
得到这些梯度，你可以移动
get these gradients and you can move
在梯度的情况下，它是怎样的
around with the gradient and how is it
如果这个函数的一半是用来工作的
meant to work if half of this function
只有输出0，没有梯度
just has output zero and no gradient and
另一半是直的
the other half is just this straight
特别是当你在的时候
line and in particular when you're in
积极的政体，这只是一个
the positive regime this is just an
身份功能，你知道我有点
identity function and you know I sort of
在此之前，如果你只是
argued before that if you've just
作曲
compose
线性变换你不会得到任何东西
linear transforms you don't get any
权力，但我们在这些方面
power but providing we're in these the
这是政权的右手边
right-hand part of the regime since this
是一个恒等函数
is an identity function that's exactly
我们正在做的就是作曲
what we're doing we're just composing
线性变换，所以你有点相信
linear transforms so you sort of believe
它根本不可能工作，但它
it just couldn't possibly work but it
结果证明这是非常有效的
turns out that this works brilliantly
这是目前为止的默认值
and this is now by far the default
当人们在为自己的工作做准备时
choice when people are building feed for
人们真正使用的深层网络
deep networks that people use really
非线性，而且速度非常快
nonlinearities and they are very fast
他们训练得很快，他们表演
they train very quickly and they perform
非常好，非常有效
very well and so effectively you know
它只是你的每一个人
the it is it is simply just each you
取决于每个单位的输入
depending on the inputs each unit is
要么死了，要么是传递信息
just either dead or it's passing things
作为一个身份函数，但有
on as an identity function but there's
足够的线性非线性
enough of a linear non-linearity that
你可以做任意的函数
you can do arbitrary function
近似仍然是深度学习
approximation still with a deep learning
网络和人们现在精确地
network and people now make precisely
相反的观点，因为
the opposite argument which is because
这个单位的斜率是1/
this unit just has a slope of 1 over
它是非零的范围这意味着
it's nonzero range that means the
梯度是非常有效地传递回来的
gradient is passed back very efficiently
对于输入，因此它们是
to the inputs and therefore them the
模型非常有效地训练
models train very efficiently whereas
当你使用这些曲线的时候
when you were with these kind of curves
当你在这里的时候
when you're over here there's very
小斜率，你的模型可以训练
little slope so your models might train
非常慢，所以你知道
very slowly ok so you know for
前馈网络在你面前试一试
feed-forward network try this before you
尝试其他的方法，但有一些
try anything else but there's sort of
他们是一种次级文学
them being a sub literature that says
也许这太简单了
well maybe that's too simple and we
可以做得更好，这就导致了
could do a bit better and so that led to
有漏洞的reloj说也许我们
the leaky reloj which said maybe we
应该在这里画一个小的斜率
should put a tiny bit of slope over here
所以它并不是完全死亡的所以你可以
so it's not completely dead so you can
大概是11/100
make it something like 1 1/100 as the
斜率是这一部分，然后人们有
slope is this part and then people had
好吧，让我们开始吧，也许我们可以
well let's build off that maybe we could
实际上在我们的
actually put another parameter into our
神经网络，我们可以有一个
neural network and we could have a
参数reloj所以有一些斜率
parametric reloj so there's some slope
在这里，但我们也要返回
over here but we're also going to back
传播到我们的非线性中
propagate into our non-linearity which
这个额外的阿尔法参数是
has this extra alpha parameter which is
多少钱
how many much
所以不同的人使用了这些
and so variously people have used these
你可以在上面找到锡纸
you can sort of find tin papers on
存档，人们说你可以得到
archive where people say you can get
使用一个或另一个更好的结果
better results from using one or other
在这些你也可以找到
of these you can also find papers where
人们说这对我们没有任何影响
people say it made no difference for
它们与使用reloj相比
them versus just using a reloj so I
基本上你可以从a开始
think basically you can start off with a
reloj，从那里开始工作
reloj and work from there yeah so
参数初始化是这样的
parameter initialization it's when so
当我们有这些矩阵的时候
when we have these matrices of
我们模型中的参数是至关重要的
parameters in our model it's vital vital
你必须初始化这些
vital that you have to initialize those
小随机参数权重
parameter weights with small random
这就是我们的教训
values this was precisely the lesson
有些人还没有发现
that some people hadn't discovered when
这是最后的项目时间，所以我要
it came to final project time so I'll
强调它是至关重要的如果你只是
emphasize it vital vital so if you just
从权重为0开始
start off with the weights being zero
这些都是完整的
you kind of have these complete
一切都是对称的
symmetries right that everything will be
计算出所有东西都会移动
calculated the same everything will move
同样的，你实际上并不是
the same and you're not actually
对这个复杂的网络进行大量的培训
training this complex network with a lot
专门学习的单位
of Units that are specializing to learn
不同的东西，所以你必须
different things so somehow you have to
打破对称性，我们这样做
break the symmetry and we do that by
给出小的随机权重这样你就知道了
giving small random weights so you know
有一些很好的观点
there's sort of some fine points when
你也有偏见
you have biases you may as well just
从0开始，保持中立，看
start them at zero as neutral and see
系统是如何学习你的偏见的
how the system learn the bias that you
想要等等但是一般来说
want et cetera but in general the
你想要初始化小值
weights you want to initialize the small
你会在高度中找到的随机值
random values you'll find in height or
其他深度学习实践包a
other deep learning practice packages a
使用的通用初始化
common initialization that's used and
经常被推荐为这个Xavier
often recommended as this Xavier
初始化，这是一个技巧
initialization and so the trick of this
很多模型都是这样的
is that for a lot of models in a lot of
有些地方会想到这些事情
places think of some of these things
就像这些，你会喜欢
like these ones and these you'd like the
网络中的价值
values in the network to sort of stay
在这中间的范围很小
small in this sort of middle range here
如果你有一个矩阵
and well if you kind of have a matrix
有了大的价值，你就乘以
with big values in it and you multiply
这个矩阵的向量你知道的
vector by this matrix you know things
可能会变大，然后如果你输入
might get bigger and then if you put in
通过另一层
through another layer
再变大，然后
get bigger again and then sort of
我的一切都太大了，你会
everything I'll be too big and you'll
有问题，所以泽维尔
have problems so really Xavier
初始化试图避免这种情况
initializations seeking to avoid that by
告诉我们有多少输入
saying how many inputs are there to this
节点有多少个输出
node how many outputs are there we want
把它压在下面
to sort of tamp it down the
初始化基于输入和
initialization based on the inputs and
输出，因为我们将会
the outputs because effectively we'll be
用这个数很多次，它是a
using this number that many times it's a
好的，你可以使用它
good thing to use you can use that
优化器
optimizers
到目前为止我们看到的是
up till now we saw just talked about
普通的SGD，你知道通常是SGD
plain SGD you know normally plain SGD
实际上效果很好，但通常
actually works just fine but often if
你想要使用的是简单的SGD
you want to use just plain SGD you have
花时间调整学习速度
to spend time tuning the learning rate
我们把这个alpha值乘以
that alpha that we multiplied the
由复杂的网和
gradient by for complex nets and
情况或避免担心
situations or to avoid worry there's
就像现在这个大家庭
sort of now this big family and more
复杂的自适应优化器
sophisticated adaptive optimizers and so
实际上他们在扩大
effectively they're scaling the
累积参数调整
parameter adjustment by accumulated
梯度，它的作用是
gradients which have the effect that
他们学习每个参数学习率
they learn per parameter learning rate
所以他们构思了哪些参数
so that they conceived which parameters
会对移动更有用
would be useful to move more and which
不那么依赖于敏感性
ones less depending on the sensitivity
在这些参数中
of those parameters so where things are
平的你可以试着快速移动
flat you can be trying to move quickly
很多东西都在不停地跳动
where things are bouncing around a lot
你可以试着稍微移动一下
you can be trying to move just a little
为了不过度，这是一个
so as not to overshoot and so there's a
整个家族都是退化的rms道具
whole family of these a degrade rmsprop
原子，它们实际上是其他的
Atom they're actually other ones there's
这是一个最大的问题，我指的是亚当
a de Max and a lot of them I mean Adam
是一个相当可靠的
is one fairly reliable one that many
人们使用，这并不坏，然后
people use and that's not bad and then
再看一张幻灯片，我就完成了
one more slide and I'm done yeah so
学习速率通常是必须的
learning rates so normally you have to
选择一个学习率，所以一个选择是
choose a learning rate so one choice is
只要有一个恒定的学习速率
just have a constant learning rate you
选一个数字可能是10的-3次方
pick a number maybe 10 to the minus 3
这是你想要的学习速率
and say that's my learning rate you want
你的学习速度是
your learning rate to be order of
如果你的学习速率是正确的
magnitude right if your learning rate is
太大，你的模型可能会偏离
too big your model might diverge or not
收敛，因为调整你的跳跃
converge because adjusts of leaps you
周围有巨大的抽筋
around by huge cramp
运动，你完全错过了
movements and you completely miss the
如果你的函数空间很好
good parts of your function space if
如果你的学习速度也一样
your model if your learning rate is too
你的模型可能不会被
small your model may not train by the
分配截止日期，然后你就会
assignment deadline and then you'll be
不开心如果你看到了，你知道
unhappy so if you saw it you know
通常人们会尝试
commonly people sort of try powers of
10，看看它是怎样的
ten and sees how it looks right they
可能会尝试0.01。01.00 1和
might try you know 0.0 1.0 01.00 1 and
看看损失是如何减少的
see look at how the loss is declining
看看一般的工作原理
and see what seems to work in general
你想用最快的学习方法
you want to use the fastest learning
这不是让事情变得更有可能
rate that isn't making things become
不稳定通常会得到更好的结果
unstable commonly you get better results
通过降低学习率
by decreasing the learning rate as you
所以有时候人们会这样做
train so sometimes people just do that
用手，我们用了"时代"这个词
by hand so we used the term epoch for a
全面通过你的培训数据
full pass through your training data and
人们可能会说一半的学习率
people might say half the learning rate
每隔三次就会有一次
after every three epochs as you train
这可以很好地工作
and that can work pretty well you can
用公式来计算每一个时期的速率
use formulas to get per epoch rate
那里的学习率甚至更高级
learning rates there are even fancier
方法可以查阅循环学习
methods you can look up cyclic learning
如果你想要在网上收费的话
rates online if you want which sort of
实际上是提高了学习率
actually makes the learning rate
有时更大，有时
sometimes bigger and then sometimes
更小的，人们发现
smaller and people have found that that
可以帮助你摆脱困境
can be useful for getting you out of bad
区域和有趣的方式
regions and interesting ways the one
另外需要知道的是，如果你在使用
other thing to know is if you're using
其中一个更高级的优化器
one of the fancier optimizers they still
问你一个学习率，但是
ask you for a learning rate but that
学习速率是最初的学习
learning rate is the initial learning
通常优化器会使用的速率
rate which typically the optimizer will
如果你经常训练的话
shrink as you train so commonly if
你使用的是像atom这样的东西
you're using something like atom you
可能是说
might be starting off by saying the
学习培养了点一种
learning raises zero point one sort of a
更大的数字，它会缩小
bigger number and it'll be shrinking it
随着训练的进行，一切顺利
later as the training goes along ok all
下周见
done see you next week
