好的，大家好。
Okay. Hi everyone.
好了，我们开始吧。
Okay. Let's get started.
很高兴在这里见到你们。
Um- great to see you all here.
欢迎回到CS224N的第二周。
Welcome back for um- week two of CS224N.
嗯，这是一个小预览
Um- so- so this is a little preview
这周和下周的课上要讲的内容。
of what's coming up in the class for this week and next week.
嗯-你知道，这周可能是这节课最糟糕的一周。
Um- you know, this week is perhaps the worst week of this class.
笑声，在第二周的课上我们的希望是
[LAUGHTER]. Um- so in week two of the class our hope is to actually kind of
看看神经网络的一些细节以及它们是如何被训练的，
go through some of the nitty gritty of neural networks and how they're trained,
以及我们如何通过反向传播来学习良好的神经网络，
and how we can learn good neural networks by backpropagation,
这就意味着我们要讲的是培训
which means in particular we're gonna be sort of talking about the training
算法和微积分来求出梯度来证明它们。
algorithms and doing calculus to work out gradients from proving them.
嗯，我们看起来有点像，
Um, so we are looking a bi- a little bit,
在-嗯-嗯，word窗口的分类命名为实体识别。
at- um- um, word window classification named entity recognition.
这里有一点自然语言处理，但基本上，
So there's a teeny bit of natural language processing in there, but basically,
第二周有点像，
sort of week two is sort of,
深度学习的数学
um- math of deep learning and
神经网络模型和真正的神经网络基础。
neural network models and sort of really neural network fundamentals.
但希望是这样会给你带来一些帮助
Um, but the hope is that that will give you kind
对于这些东西是如何工作的，
of a good understanding of how these things really work,
我们会给你所有你需要的信息，
and we'll give you all the information you need to do,
嗯-接下来的作业，然后，
um- the coming up homework and so then,
在第三周，我们会翻转。
in week three we kind of flips.
那么第三周主要讲的是
So, then week three is going to be mainly about
自然语言处理所以我们接下来要讨论的是
natural language processing so we then gonna talk about how
将句法结构置于句子之上，
to put syntactic structures over sentences,
构建依赖性解析句子
um- for building dependency parses of sentences
这实际上是第三个作业中用到的。
which is then actually what's used in homework three.
所以我们正在快速地前进。
So we're chugging along rapidly.
然后我们会讨论这个概率
And then we'll talk about this idea of the probability of
这句话引出了神经语言模型。
a sentence which leads into neural language models.
嗯，在家庭作业上。
Um- so on the homeworks.
作业大概是两分钟前交的，
Homework one was due approximately two minutes ago,
我希望每个人都提交了他们的作业，我的意思是，
um- so I hope everyone has submitted their homework one, I mean as,
一种是警告，
um- one just sort of admonition,
一般来说，你们知道作业一
um- in general so you know homework one we
希望你找到的是一个好的热身，不要太用力
hope you found was a good warm up and not too too hard
所以最好尽快把作业做完
and so really be best to get homework one in quickly
而不是把你的很多时间都花在做家庭作业上。
rather than to burn lots of your late days doing homework one.
嗯，现在在网站上，
Um, and now right now out on the website,
还有作业二。
um there's homework two.
嗯，我们正在前进。
Um so, we are chugging along.
所以家庭作业两种对应于这周的课程。
So homework two kind of corresponds to this week's lectures.
所以在第一部分，我们期待着你
So on the first part of that we are expecting you to
通过一些数学问题来解决梯度推导问题。
grind through some math problems of working out gradient derivations.
然后第二部分是实现
Um- and then the second part of that is then implementing
你自己版本的word2vec使用了NumPy。
your own version of word2vec making use of NumPy.
所以这次是在编写一个Python程序。
And so this time sort of writing a Python program.
它不再是一个IPython的笔记本。
It's no longer an IPython notebook.
嗯，我鼓励你早点来，
Um, I encourage you to get early,
看看这些材料，
um- look at the materials,
在网上。
um- on the web.
我的意思是，特别是今天的课上，
I mean, in particular corresponding to today's lecture there's,
有一些很好的教程材料
um- some quite good tutorial materials that are available
在网站上，也鼓励你去看那些。
on the website and so also encourage you to look at those.
噪音。
[NOISE].
一般来说，
Um- more generally,
只是想多说几句话。
just to make a couple more comments on things.
我的意思是，我想在斯坦福的很多课程中都是这样，但是，
I mean, I guess this is true of a lot of classes at Stanford but,
你们知道，当我们得到这门课的课程评论时我们总是会得到
you know when we get the course reviews for this class we always get
所有的人都说这门课很糟糕，而且工作量太大了，
the full spectrum from people who say the class is terrible and it's way too much work,
对于那些说这是一个很好的课程的人来说，
um- to the people who say it's a really great class,
他们在斯坦福最喜欢的课程之一，
one of their favorite classes at Stanford,
很明显，教练关心，等等。
obvious the instructors care, et cetera.
我的意思是，这部分反映了我们得到了这个，
And I mean, partly this reflects that we get this very,
有很多人来上这门课，
um- wide range of people coming to take this class on the one hand,
在右手边，也许我们有物理博士，
on the right hand margin perhaps we have the physics PhDs,
在左手边，我们有一些新鲜的人认为这将是一件很有趣的事情。
and on the left hand margin we have some fresh who think this will be fun to do anyway.
嗯，我们欢迎e-我们欢迎大家，
Um, we welcome e- we welcome everybody,
但原则上，这是研究生级别的课程。
um- but in principle this is uh, graduate level class.
你知道，这并不意味着我们想让人们失望，
You know, that doesn't mean we want to fail people out,
我们希望每个人都能成功，但也喜欢研究生级别的课程。
we'd like everyone to succeed but also like graduate level class.
嗯-我们希望你-你知道，
Um- we'd like you to- you know,
对你的成功采取主动。
take some initiative in your success.
意思是，如果你需要知道一些事情
Meaning, if there are things that you need to know to
做作业，你不知道，
do the assignments and you don't know them,
那么你应该主动去找一些教程，
um- then you should be taking some initiative to find some tutorials,
到办公时间，和人们交谈，然后得到
come to office hours and talk to people and get
你需要的任何帮助，并学会在你的知识中找到任何漏洞。
any help you need and learn to sort of for any holes in your knowledge.
好的，这是今天的计划。
Okay. So here's the plan for today.
这就是课程信息更新。
Um- so that was the course information update.
所以你知道，这是，
So you know, is- this is sort of,
在某种程度上，你知道机器学习神经网络导论
in some sense you know machine learning neural nets intro- Just to
试着确保每个人都能在所有这些事情上加快速度。
try and make sure everyone else is up to speed on all of this stuff.
我要讲一下分类，嗯，
So I'll talk a little bit about classification, um,
引入神经网络，嗯，
introduce neural networks, um,
进入命名实体识别的小弯路，
little detour into named Entity Recognition,
然后展示了一个做的模型
then sort of show a model of doing um
窗口-Word窗口分类，然后是结束部分，
Window- Word Window classification and then the end part,
我们会深入研究我们的工具
we sort of then dive deeper into what kind of tools we
需要学习神经网络今天我们要讲的是
need to learn neural networks and so today um we're gonna go
通过，在复习和矩阵微积分入门之间的某个地方然后就可以了
through um somewhere between review and primer of matrix calculus and then that will
下节课讲的内容
lead into next time's lecture where it's talking
更多关于反向传播和计算图。
more about backpropagation and computation graphs.
所以，是的，所以这个材料是最后的部分。
So, yeah. So this material was especially the part at the end.
你知道，对一些人来说，这看起来真的很像
You know for some people it'll seem really
幼稚的-这是你每周都要做的事情，嗯，
babyish if- it's the kind of stuff you do every week, um,
对其他人来说，这似乎是不可能的
for other people it um- might seem impossibly
很困难但希望你们中有很大一部分人处于中间
difficult but hopefully for a large percentage of you in the middle
这将是对这类研究的一个有用的回顾
this will be kind of a useful review of doing this kind of
矩阵微积分和我们希望你们在作业二中可以做的事情。
matrix calculus and the kind of things that we hope that you can do on homework two.
嗯，好的，嗯，是的。
Um, okay. So um, yeah.
很抱歉，如果我让一些人感到厌烦。
So sorry if I'm boring some people.
如果你在上个季度坐过229
If you sat through 229 last quarter you
看到了一个分类器是什么样的希望这个
saw um what a classifier was like and hopefully this
这看起来很熟悉，但我只是希望尝试一下
will seem familiar but I'm just sort of hoping to try and have
第二周的每个人都在速度和大致相同的页面上。
everyone in week two sort of up to speed and on roughly the same page.
这是我们的分类设置。
So here's our classification setup.
所以我们假设我们有一个训练数据集我们有这些向量
So we have assumed we have a- training data set where we have these um vector
x点的x点，然后每一个都有一个类。
x um of our x points and then for each one of them we have a class.
所以输入可能是单词或句子的文档或其他东西，
So the input might be words or sentences documents or something,
有d提到向量，嗯，
there are d to mention vector, um,
我们想要的Yi，标签或类
the Yi, the labels or classes that we want to
分类，我们有一组C类我们想要预测。
classify to and we've got a set of C classes that we're trying to predict.
所以这些可能是文件的主题，
And so those might be something like the topic of the document,
积极或消极的情绪
the sentiment positive or negative um of
一份文档或者稍后我们会更详细地介绍命名实体。
a document or later we'll look a bit more at named entities.
好的，如果我们有——
Okay. So if we have that um-
因为这种直觉是我们得到了这个向量空间
for this sort of intuition is we got this vector space which we again have
一个二维的图像我们在这个向量空间中有一些点
a 2D picture and we have points in that vector space which
对应于Rx项目我们想要做的是
correspond to Rx items and what we'd want to do is we'll
看看我们的培训样本中的那些，看看哪些是
look at the ones in our training sample and see which ones are
绿色和红色代表我们的两个课程然后我们想要学习
green and red for our two classes here and then we want to sort of learn
这条线可以在绿色和绿色之间
a line that could divide between the green and
红色的是最好的，而学习的线是我们的分类器。
the red ones as best as possible and that learned line is our classifier.
所以在传统的机器学习或统计中我们有
So on traditional machine learning or statistics we have the sort of
XI矢量是纯固定的数据项
XI vectors that are data items that are purely fixed but
我们要把这些XI乘以
we're going to then multiply those XI by
一些估计重量矢量和
some estimated weight vector and
这个估计的权向量将会进入一个分类决策。
that estimated weight vector will then go into a classification decision.
我在这里展示的分类器是
And the classifier that I'm showing here is
一个软的分类器，它几乎是相同的但不完全是
a softmax classifier which is almost identical but not quite to
逻辑回归分类器你应该在CS 109或统计数据中看到
logistic regression classifier which you should've seen in CS 109 or a stats
类或者类似的东西它给出了不同类的概率。
class or something like that which is giving a probability of different classes.
好的，特别是如果你有
Okay. And in particular if you've got
一个软的分类器或逻辑回归分类器，
a softmax classifier or a logistic- logistic regression classifier,
这些就是所谓的线性分类器。
these are what are called linear classifiers.
所以这两个类之间的决策边界
So the decision boundary between two classes here
在一些适当的高维空间中是一条直线。
is a line in some suitably high-dimensional space.
所以一旦你有了更大的期望，它就是一个平面或者一个超平面。
So it's a plane or a hyperplane once you've got a bigger expecter.
好的，这是我们的软max分类器。
Okay. So here's our softmax classifier.
嗯，这有两部分。
Um, and there are sort of two parts to that.
所以在权重矩阵的双U中
So in the- in the weight matrix double U
我们有一个对应于每个类的行，然后
we have a row corresponding to each class and then for
这一行我们是用点来制作的
that row we're sort of dot-producting it with
我们的数据点向量XI这给了我们一种分数
our data point vector XI and that's giving us a kind of a score for
这个例子属于那个类的可能性有多大，然后我们
how likely it is that the example belongs to that class and then we're
通过一个softmax函数来运行，就像我们在第一周看到的那样，
running that through a softmax function and just as we saw on week one,
softmax需要一堆数字，然后把它们变成一个概率分布。
the softmax takes a bunch of numbers and turn them into a probability distribution.
这对人们有意义吗？
Does that makes sense to people?
人们还记得上周的情况吗？到目前为止还好吗？
People remember that from last week? Good so far?
好的，我不打算细讲，但我的意思是，
Okay. Um, I'm not gonna go to this in detail but I mean,
本质上，这就是逻辑回归的作用。
ah- essentially this is what the logistic regression does as well.
不同之处在于，在这个设置中我们有一个权向量
Um, the difference is that here in this setup we have a weight vector um
对于每一节课
for each class whereas what
进行逻辑回归的统计学家说，他们说的是重量，
the statisticians doing logistic regression is they say weight,
这就得到了比我们实际需要的更多的权向量。
that gives us one more number of weight vectors than we really need.
我们可以为C类课程，
We can get away for- for C classes,
我们可以用C减去一个权重向量。
we can get away with C minus one weight vectors.
特别是当你在做二进制的时候
So in particular if you're doing binary
逻辑回归，你只需要一个权重向量
logistic regression you only need one weight vector whereas
这是你的软max回归公式
this softmax regression formulation you've
实际上每个类都有两个权重向量。
actually got two weight vectors one for each class.
嗯，这是有一点不同的地方
Um, so there's that sort of a little difference there which we
可以进入但基本上是一样的。
could get into but basically the same.
它只是说我们要么在做软质要么是逻辑回归，这无关紧要。
It's just say it's we're either doing softmax or logistic regression, doesn't matter.
所以当我们训练我们想要的东西时
Um, so when we're training what we want to
我们想要能够预测出正确的类。
do is we want to be able to predict um the correct class.
所以我们要做的是，我们想要
And so the way we're gonna do that is we're gonna wanna
训练我们的模型，这样它给我们的概率是最高的
train our model so it gives us highest probability as
可能是正确的类，因此他们会给我们
possible to the correct class and therefore they'll give us
低概率po——可能是错误的类。
low probability po- as possible um to um the wrong classes.
所以我们的标准是，我们要创建
And so our criterion for doing that is we're going to create
这是我们作业的负对数概率
this negative log probability um of our assignments and then
我们要尽量减少负对数的概率也就是最大值
we're gonna want to minimize the negative log probability which corresponds to maximizing
对数概率，对应于最大化的概率，嗯。
the log probability which corresponds to maximizing um the probability. Um.
但是，嗯，有点，
And, but, um, sort of,
很快，我们就会开始用深度学习框架做更多的事情，
pretty soon now, we're gonna start doing more stuff with deep learning frameworks,
尤其是py电筒，你可以在里面发现，
in particular PyTorch and you can discover in that,
实际上有一种叫做NLL的东西
that there's actually a thing called NLL
损失，即负对数-似然损失。
loss which stands for negative log-likelihood loss.
基本上，没有人会使用它因为更方便的使用方式就是所谓的
Basically, no one uses it because the more convenient thing to use is what's called
交叉熵损失，所以你会
the cross entropy loss and so you'll
在任何地方都能听到交叉熵损失的训练。
hear everywhere that we're training with cross entropy loss.
所以，我想简单地提一下，解释一下发生了什么。
So, I just wanted to briefly mention that and explain what's going on there.
所以交叉熵的概念来自于
Um, so the concept of cross entropy comes from
婴儿信息理论是关于我所知道的信息理论的数量。
baby Information Theory which is about the amount of information theory I know.
嗯，我们假设有一些真正的概率分布P和我们的模型，
Um, so, we're assuming that there's some true probability distribution P and our model,
我们已经建立了一些概率分布q。
we've built some probability distribution, Q.
这就是我们用软-max回归所构建的我们想要
That's what we've built with our soft-max regression and we want to have
衡量我们的估计概率分布是否很好。
a measure of whether our estimated probability distribution is a good one.
我们在交叉熵中做的是，
And the way we do it in cross entropy is,
我们通过这些课程，我们说，
we go through the classes and we say,
“根据真实的模型，这门课的概率是多少？”
"what's the probability of the class according to the true model?"
利用这个等待，我们计算出，嗯，
Using that waiting, we then work out the log of, um,
根据我们的估计模型的概率我们把它们加起来，然后否定它，
the probability according to our estimated model and we sum those up and negate it,
这就是我们的交叉熵测度。
and that is our cross entropy measure.
好的，但是，这是一般情况下给你的
Okay. Um, but- so this in general gives you
在分布之间的一种度量。
a measure of sort of information, um, between distributions.
但在我们的特殊情况下，
But in our particular case,
记住，对于每个例子，
remember that for each example,
我们假设这是一部分
we've sort of assuming that this is a piece of
有标记的训练数据，所以我们说的是这个例子，
labeled training data so we are saying for that example,
正确答案是第七班。
the right answer is class seven.
因此，我们的真实分布，
So therefore, our true distribution,
我们的p是-对于这个例子，
our p is- for this example,
这是第7课，概率1，这是类，
it's class seven with probability one and it's class,
还有任何概率为0的东西。
um, anything else with probability zero.
如果你想一下这个公式会发生什么，
So if you think about then what happens with this formula,
你得到了所有类的总和。
you've got this summation of all the classes.
PFC要么是1要么是0，它将是1
The PFC is gonna be either one or zero and it's gonna be one
只针对真正的类，所以你剩下的是，
only for the true class here and so what you're left with is,
这就等于减去qc的对数，嗯，
this is going to equal minus the log of qc, um,
对于真正的类，这是我们在之前的幻灯片中计算的。
for the true class which is sort of what we were then computing in the previous slide.
好的，这是-嗯，是的。
Okay. So that's- um, yeah.
这就是交叉熵损失的基本情况。
So that's basically where you'd get with cross entropy loss.
嗯，还有一个概念要提。
Um, but one other concept to mention.
所以当你有一整套的数据集时，
So when you have a full data-set of a whole bunch of examples,
交叉熵损失是取每个样本均值。
the cross entropy loss is then taking the per example average.
所以，我想这就是信息理论人们有时所说的交叉熵率。
So, I guess it's what information theory people sometimes call the cross entropy rate.
另外，在这里因式分解。
So additionally, factored in there.
如果你在任何例子中训练它是在向量上的一个。
If you are training it on any examples is that one on in vector that's coming in there.
好的，嗯，好的。
Okay. Um, okay.
这就是交叉熵损失。
Um, so that's cross entropy loss.
这样行吗？是的。
Is that okay? Yeah.
噪音有一些——在地面上有一些实际的标签的混合物？
[NOISE] There's some- there's some mixture of the actual labels in the ground?
当然，好问题。
Sure. Good question.
对，最简单的例子就是你的黄金数据，
Right. So, the simplest case is that your gold data,
有人给它贴上了标签，
someone has hand labeled it and,
嗯，他们已经标记了一个，剩下的都是零。
um, they've labeled one and the rest is zero.
嗯，他们是-你可以想想，情况不是这样的。
Um, they are- you can think of cases where that isn't the case.
我的意思是，一个案例是你可以相信人类
I mean, one case is you could believe that human beings
有时候不知道正确的答案，所以如果人类说，
sometimes don't know the right answer so if human beings said,
"我不确定这是否应该是第三类或第四类"你可以想象我们
"I'm not sure whether this should be class three or four," you could imagine that we
可以制作训练数据，我们把概率一半放在两者上，
can make training data where we put probability half on both of them,
这并不是一件疯狂的事情，
um, and that wouldn't be a crazy thing to do,
这样就会得到一个真正的交叉熵损失用更多的分布。
and so then you'd have a true cross entropy loss using more of a distribution.
在实际操作中更常用的例子是，
Um, the case where it's much more commonly used in actual practice is,
在很多情况下人们想要进行半监督学习。
there are many circumstances in which people wanna do semi-supervised learning.
所以，我想这是一个主题
So, I guess this is a topic that
我的团队和克里斯。雷的团队都做了很多工作，
both my group and Chris Re's group have worked on quite a lot,
我们实际上没有完全标记的数据，
where we don't actually have fully labeled data,
但是我们有一些方法来猜测这些标签
but we've got some means of guessing what the labels
数据是，如果我们试着猜出数据的标签，
of the data are and if we try and guess labels of data,
通常我们会说，
well then quite often we'll say,
“这是我们的数据。
"Here's this data right in.
这是三分之二的机会，
It's two-thirds chances this label,
但也可以是其他四个标签，“我们会用概率分布，
but it could be these other four labels," and we'd use a probability distribution,
是的，这是更一般的交叉熵损失。
and yeah, then it's more general cross entropy loss.
好吧？嗯，对，那么，嗯，
Okay? Um, right. So, um,
这是交叉熵损失，很好。
that's cross entropy loss, pretty good with.
嗯，这个底部有点不同，嗯，
Um, this bottom bit is a little bit different, um,
也就是说，现在我们，
which is to say, "Well now we,
这是一种完整的数据集。”
this is the sort of the full data-set."
另外要注意的是，
The other thing to notice, um,
当我们有一个完整的数据时，我们可以有一个完整的x的数据集，
when we have a full data- we can have a full data-set of x's,
然后我们有一组完整的权值。
um, and then we have a full set of weights.
嗯，我们在这里工作，
Um, where here we're working a row,
对于一个类的权重的行向量，
a row vector for the weights for one class,
但我们会为所有的课都解决这个问题。
but we're gonna work it out for all classes.
所以，我们可以简化我们写的东西
So, we can sort of simplify what we're writing
这里我们可以开始使用矩阵符号
here and we can start using matrix notation
直接用矩阵w来表示。
and just work directly in terms of the matrix w. Okay.
对于传统的ML优化，
So for traditional ML optimization,
我们的参数是这些权重，
our parameters are these sets of weights,
对于不同的课程。
um, for the different classes.
对于每一个类，
So for each of the classes,
我们有一个d维，嗯，
we have a d-dimensional, um,
权重的行向量因为我们要
row vector of weights because we're gonna
有点像点积，用rd，空间，输入向量。
sort of dot-product wi- with rd, dimensional, input vector.
所以我们有c乘以d项和W矩阵这些是我们模型的参数。
So we have c times d items and our W matrix and those are the parameters of our model.
所以如果我们想要学习这个模型使用梯度下降的概念，
So if we want to learn that model using the ideas of gradient descent,
随机梯度下降法，我们要做的是
stochastic gradient descent, we're gonna do
这是我们上节课开始讲的内容。
sort of what we started to talk about last time.
我们有这些参数。
We have these set of parameters.
我们算出来了，嗯，
We work out, um,
梯度，所有这些的偏导数，嗯，
the gradient, the partial derivatives of all of these, um,
关于所有这些参数的损失，我们
of the loss with respect to all of these parameters and we
用它来获得关于损失函数的梯度更新，
use that to get a gradient update on our loss function,
我们在w的周围移动，
and we move around the w's,
在w的周围移动对应的是
and moving around the w's corresponds to sort of
移动这条线，分隔类和我们小提琴
moving this line that separates between the classes and we fiddle
为了最小化我们的损失这对应于选择一条直线
that around so as to minimize our loss which corresponds to choosing a line that
在某种意义上，最好将类的项分开。
best separates between the items of the classes in some sense.
好的，这是一个基本的分类器。
Okay. So, that's a basic classifier.
第一个问题是，
So the first question is, well,
神经网络分类器会有什么不同？
how are things gonna be different with a neural network classifier?
所以中心的观察是这样的
Um, so the central observation is that sort of
大多数经典的分类器，人们经常使用，
most of the classic classifiers that people used a lot of the time,
这就包括了像天真的贝叶斯模型，
so that includes things like Naive Bayes models, um,
基本支持向量机，
basic support vector machines,
软或逻辑回归。
Softmax or logistic regressions.
它们是相当简单的分类器。
They're sort of fairly simple classifiers.
特别是这些都是线性分类器它们将通过绘图进行分类
In particular those are all linear classifiers which are going to classified by drawing
一条直线，或者在更高维度空间中
a line or in the higher dimensional space by
画出一些不同例子的平面。
drawing some kind of plane that separates examples.
在某些情况下，拥有这样一个简单的分类器可能是有用的。
Having a simple classifier like that can be useful in certain circumstances.
我的意思是，这给了你一个机器学习的高偏差分类器，
I mean, that gives you what a machine learning as a high bias classifiers,
在CS229中有很多讨论，
there's lots of, talk of in CS229,
但是如果你有一个数据集，嗯，
but if you have a data-set, um,
就像这样，你不能很好地进行分类
that's like this, you can't do a very good job at classifying
如果有的话，所有的点都是正确的
all the points correctly if you have
一个高偏差分类器因为你只会画一条线。
a high bias classifier because you're gonna only draw a line.
所以你想要一个更强大的分类器。
So you'd like to have a more powerful classifier.
从本质上讲，是什么推动了大量的使用
Essentially, what's been powering a lot of the use of
深度学习是指在很多情况下当你有自然信号时，
deep learning is that in a lot of cases when you have natural signals,
这些是，嗯，演讲，
so those are things like, um, speech,
语言，图像，诸如此类的东西，
language, images, and things like that,
你有大量的数据，所以你可以学习一个相当复杂的分类器。
you have a ton of data so you could learn a quite sophisticated classifier.
但是用输入数据表示类是非常复杂的。
Um, but representing the classes in terms of the input data is sort of very complex.
你不可能通过在这两个类之间画一条线来实现它。
You could never do it by just drawing a line between the two classes.
所以，你想用一些更复杂的分类器。
So, you'd like to use some more complicated kind of classifier.
神经网络，多层结构
So neural networks, the multi-layer
神经网络，我们现在要开始讲，
neural networks that we're gonna be starting to get into now,
他们所做的就是为你提供一种学习非常复杂的方法，
precisely what they do is provide you a way to learn very complex,
你知道，几乎是无限复杂的分类器。
you know, almost limitlessly complex classifiers.
所以如果你看看他们在原始空间上做出的决定，
So that if you look at the decisions that they're making in terms of the original space,
他们可以是像这样的学习案例。
they can be learning cases like this.
嗯，我把这个放了，
Um, I put this- I put the,
这里有几张幻灯片。
um, pointer on a couple of the slides here.
这是由安德烈。卡洛西做的一个形象化的东西。
Um, this- this is a visualization that was done by Andrei Karpathy.
他是一名博士生，直到几年前。
He was a PhD student here until a couple of years ago.
这是一个小JavaScript，
So this is a little JavaScript, um,
你可以在他的网站上找到它
app that you can find off his website and it's
其实很有趣的是，看看什么样的，
actually a lot of fun to play with to see what kind of,
决定边界你可以得到一个神经网络。
um, decision boundaries you can get a neural net to come up with.
好的，为了得到更高级的分类，
Okay. Um, so for getting- for getting more advanced classification out of,
这是一种用于自然语言的神经网络，
um, a neural net used for natural language,
有两件事，你可以做，
there are sort of two things going- that you can do,
我想说的是
that I want to talk about which are in
当涉及到它的时候，也会有同样的感觉。
some sense the same thing when it comes down to it.
但是我在开始的时候会单独提到其中一个是我们
But I'll sort of mention separately at the beginning that one of them is that we
有了这些词，然后
have these word vectors and then
第二个问题是我们要建立更深层次的多层网络。
the second one is that we're gonna build deeper multi-layer networks.
好的，那么，第一个关键的区别是，
Okay. So, at first crucial difference said,
嗯，我们已经开始看到，嗯，
um, we already started to see, um,
我们上周所做的是
with what we were doing last week is rather than
有一个词是“house”这个词，
sort of having a word being this is the word house,
而是说，房子是实数的矢量我们可以做的是
we instead say house is a vector of real numbers and what we can do is
改变与房子相对应的矢量
change the vector that corresponds to house in
这样我们就可以建立更好的分类器，
such a way as we can build better classifiers,
这就意味着我们要把房子的形象搬到这里
which means that we are gonna be sort of moving houses representation around
这个空间用来捕捉我们感兴趣的东西，比如单词相似性，
the space to capture things that we're interested in like word similarity,
类比，诸如此类。
analogies, and things like that.
这实际上是，你知道，
So this is actually, you know,
与传统的步骤或ML相比，这是一个奇怪的想法。
kind of a weird idea compared to conventional steps or ML.
而不是说我们只有参数w，
So rather than saying we just have the parameters w,
我们还说，所有这些词表示也是我们模型的参数。
we also say that all of these word representations are also parameters of our model.
所以，我们要改变
So, we're actually going to change
单词的表示使我们的分类器更好。
the representations of words to allow our classifiers to do better.
所以，我们同时改变了权重
So, we're simultaneously changing the weights
我们改变了词语的表达，
and we're changing the representation of words,
我们同时对它们进行优化，试图使我们的模型成为，
and we're optimizing both of them at once to try and make our model as,
嗯，很好。
um, good as possible.
所以，这就是人们经常谈论的一种感觉
So, this is the sense in which people often talk about
我们正在做的学习模型的深度学习模型。
the deep learning models that we're doing representation learning.
我说过有两种方法，
I sort of said there are two ways,
我要提两件事。
I was going to mention two things.
一种是，嗯，
One is this sort of, um,
词向量表示学习，然后是第二个
word vector representation learning and then the second one
我们要开始研究更深层次的神经网络。
is that we're going to start looking at deeper multi layer neural networks.
在幻灯片上隐藏着
Um, sort of hidden over here on the slide
你能想到的是，
is the observation that really you can think of word,
词向量嵌入，就像放你的，
word vector embedding as just putting your,
有一个带有神经网络层的模型。
having a model with one more neural network layer.
所以，如果你想象每个单词都是一个热向量，
So, if you imagine that each word was a one hot vector,
对于你的模型中不同的词类型。
um, with, for the different word types in your model.
所以，你有一个，呃，你知道，
So, you had a, uh, you know,
15万维向量，具有不同单词的一热编码。
150,000 dimensional vector with the one-hot encoding of different words.
嗯，你可能会说你有一个，嗯，
Um, then you could say you have a ma-, um,
矩阵L，它是你的词典矩阵你会把你的一个热矢量传递给你
matrix L which is sort of your lexicon matrix and you will pass your one-hot vector for
通过神经网络的一个词
a word through a layer of neural net which
乘以一个热的向量或者L1，一个热的向量。
multiplies the one-hot vector or L1, the one-hot vector.
因为这是一个单热的向量，
And since this was a one-hot vector,
这样做的效果是取出l的一列，所以，
what that will have the effect of doing is taking out a column of L. So,
实际上，我们有一个额外的矩阵层，嗯，
really, we've got an extra layer of matrix, um,
在我们的神经网络中，我们正在学习
in our neural net and we're learning
这个矩阵的参数和我们学习的方式是一样的，
the parameters of that matrix in the same way as we're learning,
嗯，一个用于其他目的的深层神经网络。
um, a deep neural network for other purposes.
所以，从数学上讲，这完全讲得通
So, mathematically that completely makes sense and
这是一种合理的思考方式，
that's sort of a sensible way to think about,
嗯，你在做什么，嗯，
um, what you're doing, um,
有了神经网络的单词嵌入。
with word embeddings in neural networks.
嗯，实施明智，这是
Um, implementation wise, this makes
完全没有意义，没有人这样做，因为它没有
no sense at all and no one does this because it just doesn't
当矩阵乘法的结果是，做一个矩阵乘法是有意义的。
make sense to do a matrix multiply when the result of the matrix multiply will be, okay.
这是单词ID 17，嗯，有点像，
This is word ID 17, um, sort of,
然后构造一个长度为15万的热向量a
then constructing a one-hot vector of length a 150,000 with a
一个在第17位，然后做一个矩阵相乘，没有意义。
one in position 17 and then doing a matrix multiplied, makes no sense.
你只要拿起，嗯，
You just take up, um,
列，或者，行，
column or, or, the row,
正如我们讨论过的，你的17个矩阵，这是每个人实际做的。
as we've discussed, 17 of your matrix and that's what everyone actually does.
好的，这是我的一个关于神经元的强制性图片，对这个类来说。
Okay. Here's my one obligatory picture of neurons, um, for the class.
所以，不要错过它，我不会再展示它了，所有的课。
So, don't miss it, I'm not going to show it again, all class.
好的，那么，神经网络的起源笑声，嗯，
Okay. So, the origins [LAUGHTER] of Neural Networks, um,
在某种意义上，尝试和构建
was in some sense to try and construct
似乎是一个人造神经元
an artificial neuron that seemed to in
某种意义上，某种意义上的计算，
some sense kind of capture the kind of computations,
这在人类的大脑中是存在的。
um, that go on in human brains.
这是一个非常松散的类比，它的产生，但是，你知道，
It's a very loose analogy for what was produced but, you know,
我们的模型是，
our model here is these are our,
这是我们人类大脑的一个结核病部分。
this is our a TB part of our human brains.
所以，这里是神经元，
So, here are neurons,
这是一个神经元细胞，
this is a neuron cell here and so,
神经元是由什么组成的。
what does a neuron consist of.
嗯，在后面，
Um, so, up the back,
它有这些树突，大量的树突。
it's got these dendrites, lots of dendrites.
然后它有一个细胞体如果有东西进入树突，嗯，
Then it's got a cell body and if there's stuff coming in on the dendrites, um,
细胞体会变得活跃，然后一切就开始了
the cell body will become active and then it all starts
在这个被称为轴突的长东西上尖峰。
spiking down this long thing which is called the Axon.
那么这些轴突就会导致
So, then these axons lead to
不同细胞的树突或许多不同的细胞，对吧。
the dendrites of a different cell or lots of different cells, right.
这个，嗯，我不确定它是否显示出来了
This one, um, I'm not sure it's shown but
有些是不同的细胞。
some of these are kind of going to different cells.
嗯，然后你就有了这些，嗯，
Um, and so, you then have these sort of, um,
轴突上的终端按钮很接近
terminal buttons on the Axon which are kind of close
对于树突，但是在它们之间有一点空隙，还有一些最小值，
to the dendrites but have a little gap in them and some min-,
生物化学的奇迹发生在那里。
miracles of biochemistry happen there.
所以，这就是突触，当然，
So, that's the synapse, of course,
然后你就会有一种激活流进入下一个神经元。
which you'll then have sort of activation flowing which goes into the next neuron.
所以，这是开始的时候，
So, that was the starting off, um,
人们想要在计算中尝试和模拟的模型。
model that people wanted to try and simulate in computation.
所以，人们想出了这个人造神经元的模型。
So, people came up with this model of an artificial neuron.
所以，我们从其他的神经元中得到一些激活的东西。
So, that we have things coming in from other neurons at some level of activations.
所以，这是一个数字X0 X1 X2。
So, that's a number X0, X1, X2.
嗯，突触的变化取决于易激动的程度
Um, then synapses vary depending on how excitable
它们是如何轻易地让信号穿过突触。
they are as to how easily they'll let signal cross across the synapse.
所以，这是通过将它们乘以一个重量W0 W1 W2来进行建模的。
So, that's being modeled by multiplying them by a weight W0, W1, W2.
然后是细胞体，正确地说，
Then the cell body, sort of correctly,
这是对这一激励的求和
is sort of summing this amount of excitation it's
从不同的树突中得到，嗯，
getting from the different dendrites, um,
然后它就会有自己的偏见，知道火的可能性有多大，
and then it can have its own biases to how likely it is to fire,
这是b，所以，
that's the B. Um, so,
我们得到了这个，然后它有一些总体的阈值或者射击倾向。
we get that and then it has some overall kind of threshold or propensity for firing.
所以，我们把它通过激活函数，
So, we sort of stick it through an activation function,
嗯，这有点像，
um, which will sort of,
将决定一个发射率，也就是，
will determine a firing rate and that will be,
输出轴突上的信号。
um, the signal that's going out on the output axon.
所以，这是一个起点，但是，
So, that was sort of the starting point of that but,
你知道，真的，嗯，
you know, really, um,
对于我们最终计算的结果。
for what we've ended up computing.
我们这里有一些婴儿数学，实际上，嗯，
We just have a little bit of baby math here which actually, um,
看起来很熟悉你所看到的婴儿数学
looks very familiar to the kind of baby math you see
在线性代数和统计学中，这是没有区别的。
in linear algebra and statistics and so it's really no different.
所以，特别地，
So, in particular, um,
一个神经元很容易成为一个二元逻辑回归单元。
a neuron can very easily be a Binary Logistic Regression Unit.
嗯，所以，这有点，
Um, so that, this is sort of,
对于逻辑回归，你需要输入X，
for logistic regression you're taking for your input X,
用一个权向量乘以它。
you multiply it by a weight vector.
你在加，嗯，嗯，
You're adding, um, your, um,
偏置词，然后你把它写下来，
bias term and then you're putting it through,
嗯，非线性，
um, a non linearity,
就像逻辑函数。
like the logistic function.
然后，你在计算一个逻辑回归，
Um, and then, so you're calculating a logistic regression,
在这种神经元模型中。
um, inside this sort of neuron model.
嗯，这是，
Um, and so this is the,
这就是软极限逻辑回归的区别，
this is the difference between the soft maximum logistic regression,
我说过，两个类的软最大值有两组参数。
that I was saying that there is the soft-max for two classes has two sets of parameters.
这只是一组参数Z和你的建模
This sort of just has one set of parameters Z and your modeling
这两个类给出了一个类从0到1的概率，
the two classes by giving the probability of one class from 0 to one,
取决于输入是否
depending on whether the input to
逻辑回归是高度消极或高度积极的。
logistic regression is highly negative or highly positive.
好的，我们可以说这些人工神经元有点像
Okay. So, really, we can just say these artificial neurons are sort of like
二进制逻辑回归单元或者我们可以对
binary logistic regression units or we can make variants of
用不同的F函数来表示二进制逻辑回归单元。
binary logistic regression units by using some different F function.
我们很快就会回到这个问题上。
And we'll come back to that again and pretty soon.
好的，嗯，这就给了我们一个神经元。
Okay. Um, well, so that gives us one neuron.
因此，一个神经元是一个逻辑回归单元，用于当前的目的。
So, one neuron is a logistic regression unit for current purposes.
所以，至关重要的是，我们想要用神经网络来做的是，
So, crucially what we're wanting to do with neural networks is say, well,
为什么只运行一个逻辑回归，
why only run one logistic regression,
为什么我们不，嗯，
why don't we, um,
同时进行一系列的逻辑回归？
run a whole bunch of logistic regressions at the same time?
这是我们的输入，这是我们的逻辑回归单元，嗯，
So, you know, here are our inputs and here's our little logistic regression unit, um,
但是我们可以进行三次逻辑回归
but we could run three logistic regressions at
同时，我们也可以运行任意数量的。
the same time or we can run any number of them.
嗯，这很好但是对于传统的培训
Um, well, that's good but sort of for conventional training of
这是一种统计模型
a statistical model which sort of have to
确定逻辑回归的橙色输出。
determine for those orange outputs of the logistic regression.
你知道，我们训练每个人去尝试捕捉。
You know, what we're training each of them to try and capture.
我们必须要有数据来预测他们将要尝试捕捉的东西。
We have to have data to predict what they're going to try and capture.
所以，建立更大的神经网络的秘密就是，
And so, the secret of sort of then building bigger neural networks is to say,
我们实际上并不是想提前决定
we don't actually want to decide ahead of time what
那些橙色的逻辑回归正试图捕捉。
those little orange logistic regressions are trying to capture.
我们想让神经网络自我组织，
We want the neural network to self-organize,
所以这些橙色的逻辑回归，
so that those orange logistic regression,
单位学习一些有用的东西。
um, units learn something useful.
那么，什么是有用的呢？
And well, what is something useful?
我们的想法是，
Well, our idea is to say,
我们确实有一些我们想要做的事情。
we do actually have some tasks that we want to do.
所以，我们有一些我们想要做的事情。
So, we- we have some tasks that we want to do.
也许，我们想要决定一个电影评论是积极的还是消极的，
So maybe, we want to sort of decide whether a movie review is positive or negative,
比如情绪分析之类的。
something like sentiment analysis or something like that.
在一天结束的时候，我们想做点什么。
There is something we want to do at the end of the day.
嗯，我们要，呃，
Um, and we're gonna have, uh,
逻辑回归分类器告诉我们正或负。
logistic regression classifier there telling us positive or negative.
但是输入的信息是不可能的
Um, but the inputs to that aren't going to
直接是文档中的单词。
directly be something like words in the document.
它们将会是逻辑回归单元的中间层
They're going to be this intermediate layer of logistic regression units
我们要训练这整个过程来最小化交叉熵的损失。
and we're gonna train this whole thing to minimize our cross entropy loss.
本质上，我们想要的是
Essentially, what we're going to want to have
在后面的传播算法中会发生，
happen in the back propagation algorithm will do for us,
是说，你在中间，
is to say, you things in the middle,
你的工作就是找到一些有用的方法来计算
it's your job to find some useful way to calculate values from
它将帮助我们的最终分类器做出一个好的决定。
the underlying data such that it'll help our final classifier make a good decision.
我的意思是，你知道，
I mean in particular, you know,
回到这张照片，你知道。
back to this picture, you know.
最后的分类器，它只是一个线性分类器，
The final classifier, its just a linear classifier,
一个软的或逻辑的回归。
a soft-max or logistic regression.
它会有一条这样的线。
It's gonna have a line like this.
但是如果中间的分类器，
But if the intermediate classifiers,
它们就像一个词嵌入，
they are like a word embedding,
它们可以重新表示空间并改变周围的事物。
they can kind of sort of re-represent the space and shift things around.
所以，他们可以学会以这样的方式改变事物
So, they can learn to shift things around in such a way as
你正在学习原始输入空间的高度非线性函数。
you're learning a highly non-linear function of the original input space.
好的，在这一点上，
Okay. Um, and so at that point,
这只是一个简单的问题，
it's simply a matter of saying,
那么，为什么要止步于此呢？
well, why stop there?
如果我们放入更多的层，也许会更好。
Maybe it gets even better if we put in more layers.
这让我们进入了深度学习的领域，精确地说，
And this sort of gets us into the area of deep learning and sort of precisely,
嗯，这是，嗯，
um, this is, um,
这种类型的神经网络有三种情况。
that sort of there was- sort of being three comings of neural networks.
50年代的第一件作品是
So the first work in the 50s which is
本质上，当人们有一个单一神经元的模型时
essentially when people had a model of a single neuron
就像这样，然后慢慢地解决了
like this and then only gradually worked out how it
与传统的统计数据相关。
related to more conventional statistics than there was.
我们看到的第二种神经网络是80年代和90年代早期，嗯，
Um, the second version of neural networks which we saw the 80s and early 90s, um,
人们，嗯，建立了这样的神经网络
where people, um, built neural networks like this that had
这是一个隐藏的层，其中可以在中间学到一种表示。
this one hidden layer where a representation could be learned in the middle.
但在那时，它真的没有效果。
But at that time it really wasn't effective.
所有人都无法建立更深入的网络，让他们做任何有用的事情。
Of all people weren't able to build deeper networks and get them to do anything useful.
所以你的神经网络有一个隐藏的层，非常精确
So you sort of had these neural networks with one hidden layer and so precisely with
从深入学习开始的研究，正是这个激励的问题，
research that started in- into deep learning that precisely the motivating question is,
嗯，我们相信我们能做得更复杂，
um, we believe we'll be able to do even more sophisticated,
对更复杂的任务进行分类。
um, classification for more complex tasks.
比如语音识别和图像识别
Things like speech recognition and image recognition if we could
有一个更深入的网络，它将会有更多的
have a deeper network which will be able to more
有效地学习更复杂的输入函数
effectively learn more sophisticated functions of the input which
让我们做一些事情，比如识别一种语言的声音。
will allow us to do things like recognize sounds of a language.
我们怎么可能训练这样的一个，
How could we possibly train such a,
嗯，网络，这样他们就能有效地工作了吗？
um, network so they'll work effectively?
这就是我想说的，
And that's the kind of thing,
嗯，会继续，
um, will go on to,
更重要的是，在接下来的课上，我们会更详细地开始这节课。
um, more so starting this lecture more so in the next lecture.
但在我们到达那里之前，
But before we get to there,
我再强调一下。
um, just to underline it again.
所以一旦我们有了这样的东西，
So once we have something like this is our,
嗯，神经网络的一层。
um, layer of a neural network.
我们有一个输入向量，
We have a vector of inputs,
我们有一个输出向量，所有的都是
we have a vector of outputs and everything is
连接起来我们在每条黑线上都有这样的权重。
connected so that we've got this sort of weights along every one of these black lines.
所以我们可以说A1是你取的
And so we can say A1 is you're taking
权重乘以X1的每一个分量，加上一个偏置项，
weights times each component of X1 and adding a bias term,
嗯，然后你就会开始跑步这是一种
um, and then you're going to be running which is sort of
这部分，然后通过我们的非线性来运行它会给我们一个输出。
this part and then running it through our non-linearity and that will give us an output.
我们要对A1 A2 A3和A3做这个。
And we're gonna do that for each of A1, A2, and A3.
再一次，我们可以把A看作一个矢量，我们可以
Um, so again, we can kind of regard A is a vector and we can
把它折叠成这个矩阵符号来计算层的影响。
kind of collapse it into this matrix notation for working out the effects of layers.
完全连通的层实际上是权重的矩阵，嗯，
The fully connected layers are effectively matrices of weights, um,
通常把它们写成这样我们有一个偏差项作为偏差项的向量。
and commonly rewrite them like this where we have a bias term as a vector of bias terms.
这里有一个选择。
There's sort of a choice there.
你可以有一个总是在导入，然后
You can either have an always on import and then
偏差项变成了一个稍微大一点的矩阵的权重，一个额外的，
the bias terms become part of the weights of a slightly bigger matrix with one extra,
呃，一列或一列。
uh, one extra either column or row.
一个额外的，一个-行，对吧？
One extra, a- row, right?
或者你可以把它们分别放在这些b里。
Or you can just sort of have them separately within those Bs.
好的，然后是最后一个音符，对吧？
Okay. Um, and then the final note here- right?
所以一旦我们计算了这部分，
So once we've calculated this part,
我们总是把东西放到非线性中也就是我们所说的
we always put things through non-linearity which is referred to as
激活函数，诸如此类
the activation function and so something like
我之前展示的逻辑转换是一个激活函数。
the logistic transform I showed earlier is an activation function.
这是在端口上写成的向量，
And this is written as sort of vector in port, um,
激活函数给出一个向量输出，
activation function giving a vector output,
这总是意味着我们应用这个函数。
and what this always means is that we apply this function element-wise.
所以我们在应用逻辑函数这是一种
So we're applying the logistic function which is sort of a
自然地，一个输入一个输出函数就像我之前展示的那个小图。
naturally a one input one output function like the little graph I showed before.
当我们把它应用到一个向量上时，
So when we apply that to a vector,
我们把它应用到向量元素的每一个元素上。
we apply it to each element of the vector element-wise.
好的，我们很快就会回来说
Okay. We will come back very soon to sort of saying
更多的是关于非线性的以及人们实际使用的非线性问题。
more about non-linearities and what non-linearities people actually use.
但是，你知道，
Um, but, you know,
你可能想知道的是，
something you might be wondering is well,
为什么他总是有这些非线性的
why does he always have these non-linearities
假设这里必须有一个f函数？
and say there has to be an f function there?
为什么我们不，嗯，
Why don't we just, um,
计算Z=WX+B在一层，然后开始
calculate Z equals WX plus B in one layer and then go
在另一层也就是Z2=W2，
on to another layer that also does Z2 equals W2,
Z1+B继续这样的层？
Z1 plus B and keep on going with layers like that?
这是一个非常精确的原因如果你想要
And there's a very precise reason for that which is if you want
要有一个神经网络学习任何有趣的东西，
to have a neural network learn anything interesting,
你必须要坚持某个函数F
you have to stick in some function F which is
一个非线性函数，如我之前展示的逻辑曲线。
a non-linear function such as the logistic curve I showed before.
这样做的原因是如果你在做某事
And the reason for that is that if you're sort of doing
线性变换，比如WX+B，然后是W2 Z1+B，
linear transforms like WX plus B and then W2 Z1 plus B,
W3Z2+B，你在做一系列的线性变换。
W3Z2 plus B and you're doing a sequence of linear transforms.
好吧，多重线性变换只是组成了一个线性变换，对吧？
Well, multiple linear transforms just composed to become a linear transform, right?
所以一个线性变换就是旋转和拉伸空间，你可以
So one linear transform is rotating and stretching the space somehow and you can
再次旋转和拉伸空间，但结果是
rotate and stretch the space again but the result of
这只是一个更大的旋转和空间的延伸。
that is just one bigger rotate and stretch of the space.
所以对于分类器你没有额外的能量
So you don't get any extra power for a classifier
通过简单地拥有多个线性变换。
by simply having multiple linear transforms.
但是一旦你在任何一种非线性的情况下，
But as soon as you stick in almost any kind of non-linearity,
然后就得到了额外的能量。
then you get additional power.
所以一般来说，
And so you know in general,
当我们在做深层网络的时候，
what we're doing when we're doing deep networks, um,
在他们中间我们不会想"啊，
in the middle of them we're not thinking, "Ah,
这是非常重要的
it's really important to have
关于概率的非线性思考之类的。”
non-linearity thinking about probabilities or something like that."
我们的总体情况是，
Our general picture is well,
我们希望能够做有效的函数近似或曲线拟合。
we want to be able to do effective function approximation or curve fitting.
我们想要学习一个这样的空间我们只能这样做如果我们把它放进去
We'd like to learn a space like this and we can only do that if we're sort of putting in
一些非线性的东西让我们学习这些曲线的决定，嗯，模式。
some non-linearities which allow us to learn these kind of curvy decision, um, patterns.
因此，F被有效地用于做精确的事情
And so- so F is used effectively for doing accurate
噪声-函数近似或模式匹配。
[NOISE] fu- function approximation or sort of pattern matching as you go along.
好了，你已经落后了，好吧。
Okay. You are behind already. Um, okay.
这就是婴儿神经网络的介绍。
So that was the intro to baby neural networks.
很好，有问题吗？
All good? Any questions?
什么问题？
Yes?
是的，像er一样，特性一和特性
Yeah, like er, feature one and feature
4如果-如果你把它乘起来，它就像标签Y一样，
four if- if you multiply it together it's highly indicative of like the label Y,
你能得到这种产品关系吗
can you get to that product relationship
仅仅是说几层是线性的？
to just say [NOISE] couple of layers that are linear?
嗯，好问题。
Um, yes. Good question.
所以，在传统的统计数据中，
So, in conventional stats,
你有基本的输入特性，什么时候
you have your basic input features and when
人们正在用手工构建类似于逻辑回归模型的东西，
people are building something like a logistic regression model by hand,
人们常说，
people often say well,
对于分类来说非常重要的是
something that's really important for classification is
看看这对功能4和功能7。
looking at the pair of feature four and feature seven.
嗯，你知道，
Um, that you know,
如果这两个都是真的，那么我很重要
if both of those are true at the same time something i-important
这在统计中被称为互动项，
happens and so that's referred to normally in stats as an interaction term,
你可以通过手动添加交互条件到你的模型中。
and you can by hand a-add interaction terms to your model.
所以，本质上，这个秘密的很大一部分就是有这些中间层。
So, essentially a large part of the secret here is having these intermediate layers.
他们可以学习，建立相互作用的术语。
They can learn, build interaction terms by themselves.
是的，这有点，嗯，
Yeah, so it's sort of, um,
自动搜索你想要放入你的模型的高阶项。
automating the search for higher-order terms that you wanna put into your model.
好的，我继续，还有其他问题吗？
Okay. I'll go on, other questions?
好的，嗯，嗯，是的。
Okay. Um, so um, yeah.
这里有一段简短的小插曲
So here's a brief little interlude on a teeny bit more of
NLP，这是我们要研究的一个问题。
NLP which is sort of a kind of problem we're gonna to look at for a moment.
这就是我上次提到的命名实体识别的任务。
So this is the task of named entity recognition that I very briefly mentioned last time.
所以，如果我们有一些文本，
So, um, if we have some text,
等等，它没有出现在这里。
wait, it isn't appearing here.
好的，好的。
Okay. Uh, okay.
如果我们有一些文本，
If we have some text,
在各种各样的地方人们想要做的事情
something that in all sorts of places people want to do
我想要找到所提到的东西的名字。
is I'd like to find the names of things that are mentioned.
通常情况下，
Um and then normally, as well as,
找到你真正想要分类的东西的名字，
finding the names of things you'd actually like to classify them,
比如说，他们中的一些人是组织，
say it's like to say some of them are organizations,
有些是人，
some of them are people,
嗯，有些是地方。
um, some of them are places.
所以你知道，这有很多用处，你知道，
And so you know this has lots of uses, you know,
人们喜欢追踪公司的情况
people like to track mentions of companies and
人们，报纸，诸如此类的东西。
people and newspapers and things like that.
当人们回答问题的时候很多时候他们会回答
Um, people when they do question-answering that a lot of the time the answers
问题是我们所说的命名实体，即人的名字，
to questions are what we call named entities the names of people,
地点，组织，流行歌曲，
locations, organizations, pop songs,
所有这些东西都被命名为实体。
movie names all of those kind of things are named entities.
如果你想要开始
Um, and if you want to sort of start
从大量的文本中自动建立知识库，
building up a knowledge base automatically from a lot of text,
你通常要做的就是离开
well, what you normally wanna do is get out
命名实体，并找出它们之间的关系。
the named entities and get out relations between them.
这是一个常见的任务。
So this is a common task.
那么，我们该怎么做呢？
So, how can we go about doing that?
一种常见的做法是说，
And a common way of doing that is to say well,
我们将会逐个讲解其中的一个词
we're going to go through the words one at a time
它们将会是在一个背景下的词语就像它们在文字上一样，
and they're gonna be words that are in a context just like they were for word to deck,
我们要做的是运行一个分类器我们要给它们分配一个类。
and what we're gonna do is run a classifier and we're going to assign them a class.
所以我们会说，第一个词是组织，
So we're gonna say first word is organization,
第二个词是组织，
second word is organization,
第三个词不是一个被命名的实体，
third word isn't a named entity,
第四个词是一个人，
fourth word is a person,
第五个词是一个人，然后继续下去。
fifth word is a person and continue down.
所以在内部对一个词进行分类
So in running a classification of a word within
在文本中有一个位置，所以它周围有周围的单词。
a position in the text so it's got surrounding words around it.
所以说实体是什么
Um and so to say what the entities are
很多实体都是多词的术语所以你可以做的最简单的事情
many entities are multi-word terms and so the simplest thing you can
想象一下，我们会把所有分类的序列都归类为
imagine doing is just say we'll take the sequence that are all classified the
同样的，也叫“电子实体”沈国光或类似的东西。
same and call that the e-entity Shen Guofang or something like that.
这是有原因的，这是有缺陷的
There's a reason why that's slightly defective and so
人们经常使用的是生物编码，
what people often use is that BIO encoding,
嗯，我在右边展示，但我还是先跑，现在不做那个。
um, that I show on the right but I'll just gonna run ahead and not do that now.
所以，一开始，命名实体识别是微不足道的，因为你知道，
Um so, it might seem at first that named entity recognition is trivial because you know,
你有公司名称，谷歌和Facebook是公司的名字。
you have company names Google and Facebook are company names.
当你看到谷歌或Facebook时你只会说公司，你怎么可能是错的？
And whenever you see Google or Facebook you just say company and how could you be wrong?
但在实践中，有很多微妙之处
But in practice, there's a lot of subtlety and it's
在命名实体识别中容易出错。
easy to be wrong in named entity recognition.
所以这只是一些困难的例子。
So this is sort of just some of the hard cases.
所以通常很难计算出一个实体的边界。
So it's often hard to work out the boundaries of an entity.
所以在这个句子里，
So in this sentence,
第一国民银行向史密斯堡的未来学校捐赠了两辆货车。
First National Bank don-donates two vans to Future School of Fort Smith.
所以，这里大概有一家银行的名字但它是国家银行，第一个是
So, there's presumably the name of a bank there but is it National Bank and the first is
这句话的第一个词是大写的
just the first word of a sentence which is cap-capitalized
她先点了些食物什么的。
like first she ordered some food or something.
所以不清楚是什么。
So kind of unclear what it is.
有时很难知道某物是否是一个实体。
Sometimes it's hard to know whether something's an entity at all.
所以在这句话的结尾是未来学校的名字
So at the end of this sentence is Future School the name of
这是一种令人兴奋的21世纪学校
some exciting kind of 21st-century school or is it
这只是意味着未来的学校将会建在这个城市，对吧？
just meaning it's a future school that's gonna be built in this town, right?
它是一个实体，还是根本不存在？
Is it an entity or not at all?
计算一个实体的类通常是困难的，所以要找出更多的东西
Working out the class of an entity is often difficult so to find out more
关于齐格克拉和阅读的特点是什么是齐格？
about Zig Ziglar and read features by what class is Zig Ziglar?
如果你不知道，很难判断。
Kinda hard to tell if you don't know.
嗯，这实际上是一个人的名字，嗯，
Um, it's actually a person's name, um,
有各种各样的实体是模棱两可的，对吧？
and there are various entities that are ambiguous, right?
所以查尔斯施瓦布在文本中
So Charles Schwab in text is
90%的时间是一个组织的名字，因为有嘉信理财。
90% of the time an organization name because there's Charles Schwab Brokerage.
但是在这个特殊的句子里，
Um, but in this particular sentence here,
在伍德塞德，拉里埃里森和
in Woodside where Larry Ellison and
嘉信理财可以谨慎地生活在树木繁茂的土地上，
Charles Schwab can live discreetly among wooded estates,
这是对嘉信理财的一个参考。
that is then a reference to Charles Schwab the person.
所以有很多不同的理解需要做正确的事情。
So there's sort of a fair bit of understanding variously that's needed to get it right.
好的，那么我们要怎么做呢？
Okay. Um, so what are we gonna do with that?
这就意味着，嗯，
And so this suggests, um,
我们想要做的是为在环境中工作的语言构建分类器。
what we wanna do is build classifiers for language that work inside a context.
所以，你知道，总的来说，
Um, so you know, in general,
对一个词进行分类并不是很有趣
it's not very interesting classifying a word
在外部环境中，我们在NLP中并没有做太多的事情。
outside a context we don't actually do that much in NLP.
但是一旦你在一个环境中，
Um, but once you're in a context, um,
这很有趣，命名为实体识别
then it's interesting to do and named entity recognition
有一种情况，还有很多其他的地方出现。
is one case there are lots of other places that comes up.
我的意思是，这里有一个稍微酷一点的，
I mean, here's a slightly cool one,
有一些词可以表示
that there are some words that can mean
他们自己和他们的对立面，对吧？
themselves and their opposite at the same time, right?
因此，制裁某件事可能意味着允许
So to sanction something can either mean to allow
或者它可以惩罚那些做过的人
something or it can mean to punish people who do
种子或种子都可以种下种子
things or to seed something can either mean to plant seeds
你正在播种的东西或者它可以
and things that you're seeding the soil or it can
把种子从像西瓜这样的东西中取出来，对吧？
take seeds out of something like a watermelon, right?
你只需要知道它是什么。
You just need to know the context as to which it is.
好的，这说明了我们可以对一个词进行分类的任务
Okay. So, that suggests the tasks that we can classify a word
在它的语境中，任何一个都有这样的例子。
in its context of neighboring words and any has an example of that.
问题是我们怎么做呢？
And the question is how might we do that?
一种很简单的方法是，“嗯，
And a very simple way to do it might be to say, "Well,
我们有一堆单词
we have a bunch of words in a row
它们每个都有一个单词向量从一个单词到vec。
which each have a word vector from something like word to vec.
嗯，也许我们可以
Um, maybe we could just average
这些向量，然后对结果向量进行分类。
those word vectors and then classify the resulting vector.
问题是，因为你丢失了位置信息，所以效果不太好。
The problem is that doesn't work very well because you lose position information.
你不知道是哪一种
You don't actually know anymore which of
这些词向量是你要分类的。
those word vectors is the one that you're meant to be classifying.
所以，一个简单的方法就是，
So, a simple way to do better than that is to say,
“那么，为什么我们不做一个单词窗口的大矢量呢？”
"Well, why didn't we make a big vector of a word window?"
所以，这是单词，它们都有一个词向量，
So, here are words and they each have a word vector,
因此，在这里的上下文中对中间词进行分类，加上或减去两个词，
and so to classify the middle word in the context of here plus or minus two words,
我们只是将这五个向量连接在一起然后说，现在我们有
we're simply going to concatenate these five vectors together and say now we have
一个更大的向量，让我们在这个向量上建立一个分类器。
a bigger vector and let's build a classifier over that vector.
所以，我们对这个x窗口进行分类它是一个向量，
So, we're classifying this x window which is then a vector in,
如果我们用的是D维的词向量。
ah, 5D if we're using D dimensional word vectors.
我们可以用之前的方法来做这个，嗯，
We can do that um in the kind of way that we did previously which is, um,
我们可以说，“好吧，
that we could say, "Okay,
对于这个大矢量我们要学习w的权重
for that big vector we're going to learn w weights
我们把它放在一个软的分类器上，
and we're put- gonna put it through a softmax classifier,
然后我们要做决定。”
and then we're going to do the decisions."
嗯，这是一个很好的方法，
Um, that's a perfectly good way to do things and,
嗯，为了它的目的。
um, for the purpose of it.
我想在最后一部分讲的是
What I want to get to in the last part of this is to
开始看我的，嗯，矩阵微积分。
start looking at my, um, matrix calculus.
你知道我们可以使用这个模型
And you know we could use this model and do
一个分类器，并学习它的重量，事实上，嗯，
a classifier and learn the weights of it and indeed, um,
我们建议你们看一下网站上的讲义
for the handout on the website that we suggest you look at it does
用这种类型的软式分类器来做。
do it with a softmax classifier of precisely this kind.
但是对于我在课堂上做的例子，我试着让它更简单一些。
Um, but for the example I do in class I try to make it a bit simpler.
嗯，我想很快地做这个，因为我快没时间了。
Um, and I've wanted to do this I think very quickly because I'm fast running out of time.
所以，早期的神经网络的早期论文之一，嗯，
So, one of the famous early papers of neural NLP, um,
这篇论文是由科洛伯特和韦斯顿写的
was this paper by Collobert and Weston which was first
2008年的ICML论文，实际上就在几周前，
an ICML paper in 2008 which actually just a couple of weeks ago,
嗯，赢得了2018年ICML时间奖的测试。
um, won the ICML 2018 test of time award.
嗯，还有一个最近的期刊版本2011。
Um, and then there's a more recent journal version of it 2011.
嗯，他们用的是
And um, they use this idea of
窗口分类，用于分配像命名实体这样的类，
window classification to assign classes like named entities,
在语境中，嗯，
ti- to words in context, um,
但他们的做法略有不同。
but they did it in a slightly different way.
所以，他们说的是，
So, what they said is, "Well,
我们有这些窗户，这是一个，嗯，
we've got these windows and this is one with the, um,
在中间的位置命名为实体
location named entity in the middle and
这是一个没有位置实体的中间部分。
this is one without a location entity in the middle.
所以，我们要做的是有一个系统返回一个分数，
So, what we want to do is have a system that returns a score,
它应该返回一个很高的分数在这个例子中是一个实数
and it should return a high score just as a real number in this case and
如果没有，它可以返回一个低分数，如果没有，
it can should return a low score if it- if there isn't,
在这个例子中，在窗口中间的位置名。
ah, location name in the middle of the window in this case.
所以，明确的模型只是返回分数。
So, explicitly the model just return the score.
所以，如果你有神经网络的顶层，
So, if you had the top level of your neural network a,
然后你就用向量u做点积，
and you just then dot product did with a vector u,
然后你就得到了最后的点积，
you then kind of with that final dot product,
你只要返回一个实数。
you just return a real number.
他们将其作为分类器的基础。
They use that as the basis of their classifier.
所以，在完全的荣耀里，
So in full glory,
你所拥有的是，你有了这个窗口，
what you had is you had this window of words,
你为每个单词查找一个单词向量，然后，嗯，
you looked up a word vector for each word, you then, um,
乘以这个，-你
multiplied that the, the- well you
将窗口的单词向量连接起来。
concatenated the word vectors for the window.
你乘以一个矩阵然后编辑一个偏差来得到
You multiplied them by a matrix and edited a bias to get
第二个隐藏层是a然后你把它乘以
a second hidden layer which is a and then you multiply that by
最后一个矢量，它为你提供了一个窗口和你的分数
a final vector and that gave you a score for the window and you
如果是位置和小的，分数就会很大，
wanted the score to be large if it was the location and small,
如果它不是一个位置。
if it wasn't a location.
所以，在这个模拟的例子中我们有四维的词向量，
So, in this sort of pretend example where we have four dimensional word vectors,
嗯，这就是你知道的窗户，
um, that's meaning you know for the window,
这是一个20 x 1的向量。
this is a 20 x 1 vector.
嗯，为了计算下一个隐藏层
Um, for calculating the next hidden layer we've
得到一个8乘20矩阵加上偏置向量。
got an 8 by 20 matrix plus the bias vector.
然后，我们有了这种8维的第二隐藏层
Then, we've got this sort of 8-dimensional second hidden layer
然后我们计算一个最终的实数。
and then we are computing a final real number.
好的，非常重要的是这是一个关于这个问题的例子。
Okay. Um, and so crucially this is an example of what the question was about.
我们在这里加了一个额外的层，对吧？
Um, we've put in this extra layer here, right?
我们可以说这是一个词向量，
We could have just said here's a word vector,
一个大的词向量，关于上下文的。
a big word vector of, of context.
让我们用一个软max或者
Let's just stick a softmax or
在上面的逻辑分类，对位置说“是”或“否”。
logistic classification on top to say yes or no for location.
但是通过加入额外的隐藏层
But by putting in that extra hidden layer
这个额外的隐藏层可以计算
precisely this extra hidden layer can calculate
输入词向量之间的非线性相互作用。
non-linear interactions between the input word vectors.
所以，它可以计算出
So, it can calculate things like if
第一个词是一个词像博物馆，第二个和第二个词
the first word is a word like museum and the second and the second
是一个词，比如介词
was a word like the preposition in or
这是一个很好的信号，它应该是，
around then that's a very good signal that this should be,
啊，位置在窗口的中间位置。
ah, location in the middle position of the window.
所以，神经网络的额外层让我们计算
So, extra layers of a neural network let us calculate
我们的基本特征之间的这种相互作用。
these kind of interaction terms between our basic features.
好的，嗯，这是
Okay. Um, so there's
这里还有几张幻灯片展示了他们的模型的细节，
a few more slides here that sort of go through the details of their model,
但我现在就跳过这些，因为我有点落后了。
but I'm gonna just skip those for now because I'm a little bit behind.
在最后，我们得到了这个分数。
And at the end of it we've just got this score.
这是我们的模型，我刚刚讲过的
So this is our model which is the one that I just outlined where we're
计算分数，我们想要一个大的分数，嗯，位置。
calculating the score and we're wanting a big score, um, for location.
所以，我们要做的是考虑，嗯，
And so, what we're gonna want to do is consider, um,
我们如何使用这个模型，
how we can use this model,
嗯，要学习，嗯，
um, to learn, um,
我们在神经网络中的参数。
our parameters in a neural network.
尤其是，
Um, so in particular,
记住，这和我们之前的故事是一样的。
remember it's the same story we've had before.
我们有一个损失函数J，
We had a loss function J,
我们想要解决这个问题，嗯，
and we're wanting to work out, um,
关于损耗函数的当前theta参数的梯度。
the gradient with respect to our current theta parameters of the loss function.
然后，我们想要减去这个的几个倍数，嗯，
Then, we want to sort of subtract a little multiple of that, um,
根据我们当前参数的学习速率得到更新的参数，
given by the learning rate from our current parameters to get updated parameters,
如果我们反复做随机选择
and if we repeatedly do then stochastic
梯度下降法我们会有更好更好的参数
gradient descent we'll have better and better parameters
这样就有了更高的概率
which give higher probability to the things
我们实际上是在训练数据中观察到的。
that we're actually observing in our training data.
所以，我们想知道的是，
So, the thing we want to know is, well,
一般来说，我们怎么做呢，
in general how can we do this um,
微分并计算出我们的损失函数的梯度？
differentiation and work out the gradient of our loss function?
所以，我想在这节课剩下的时间里，
And so, I sort of wanted to sort of this the remaining time in this lecture,
嗯，我们来看看如何用手来做，嗯，
um, go through how we can do that by hand, um,
使用数学，然后就会导致
using math and then that'll lead into sort of
讨论和更一般的反向传播算法，
discussing and more generally the backpropagation algorithm,
嗯，下一个。
um, for the next one.
好的，那么，如果我们在做，
Okay. So, if we're doing um,
用手的梯度，我们做的是多变量微积分，多变量导数。
gradients by hand well we're doing multi-variable calculus, multi-variable derivatives.
但通常情况下，最有效的思考方式是
But in particular normally the most useful way to think about this is as doing
矩阵微积分也就是说，我们直接与之合作
matrix calculus which means we're directly working with
向量和矩阵来求出梯度，
vectors and matrices to work out our gradients,
这通常是更快更方便的
and that that's normally sort of much faster and more convenient for
总结我们的神经网络层，而不是用非矢量的方式来做。
summarizing our neural network layers than trying to do it in a non vectorized way.
但这并不意味着这是唯一的方法。
But that doesn't mean that's the only way to do it.
如果你对正在发生的事情感到困惑，
If you're sort of confused about what's going on,
有时候会想
sometimes thinking it through in
非矢量化的方法可以更好地理解发生了什么，
the non vectorized way can be a better way to understand what's going on and,
嗯，取得更多进展。
um, make more progress.
所以，比如，嗯，
So, like when, um,
上次我讲的是word2vec
last time I did the word2vec um
当我在黑板上写得太小的时候，
derivatives when I was writing too small on that board,
抱歉，这是用一种非矢量的方法来计算权重，
sorry, um, that was doing it in a non vectorized way of working out the weights,
单独谈论他们。
talking about them individually.
但是我们要做的是，
Um, but here we're going to do it with,
向量和矩阵。
um, vectors and matrices.
再一次，寻找课堂笔记来更详细地介绍这些材料。
And again, look for the lecture notes to cover this material in more detail.
特别是，没有人会错过它。
In particular, so that no one misses it.
让我澄清一下我所说的课堂笔记是什么意思。
Um, let me just clarify what I mean by lecture notes.
所以，如果你看一下左边栏的课程大纲，
So, if you look at the course syllabus on the left-hand column, um,
你可以下载一些幻灯片，
there's the slides that you can download and,
在幻灯片下面，
on straight under the slides,
它说的是课堂讲稿。
it says lecture notes.
这就是我在课堂笔记上的意思。
That's what I'm meaning by the lecture notes.
在中间这一列中有一些读数
In the- in the middle column it then has some readings and
实际上，有一些扩散的东西覆盖了类似的材料。
actually there are some diffe- additional things there that cover similar material.
嗯，嗯，嗯，
Um, so there's, um,
所以他们可能也会有所帮助。
so there's they might be helpful as well.
但首先，最接近我将要展示的东西，
But first the thing that's closest to what I'm about to present,
这是在幻灯片链接下的课堂笔记。
it's the lecture notes that appear immediately under the slides link.
好的，我的希望是，嗯，
Okay. Um, so my hope here, um,
我的希望是这样的：嗯，
my hope here is the following: Um,
如果你不记得怎么做单变量微积分，
if you can't remembered how to do single variable calculus,
对不起，你现在已经沉没了，现在就离开吧。
sorry you're basically sunken and might as well leave now.
嗯，笑声，我假设你知道怎么做
Um, [LAUGHTER] I'm assuming you know how to do
单变量微积分我假设你们知道a是什么向量和一个矩阵是什么。
single-variable calculus and I'm assuming you know what a um a vector and a matrix is.
嗯，但是你知道，嗯，
Um, but you know, um,
我希望，即使你从来没有
I sort of hope that even if you never
多变量微积分，或者你不记得了，
did multi-variable calculus or you can't remember any of it,
这是我们在这里要做的，
it's sort of for what we have to do here,
这并不难，你可以做到。
not that hard and you can do it.
所以，这就是你要做的。
So, here's what, um, what you do.
嗯，好吧。
Um, all right.
所以，如果我们有一个简单的函数f x等于x的立方，对吧。
So, if we have a simple function f of x equals x cubed, right.
它的梯度，嗯，梯度是斜率，对吧？
Its gradient, um, and so the gradient is the slope, right?
说某物的斜率是多么的陡峭或浅，
Saying how steep or shallow is the slope of something,
然后当我们在多维度的时候看到斜率的方向。
and then when we and also saw the direction of slope when we go into multiple dimensions.
它的梯度就像导数一样。
Um, its gradient is just as derivatives.
它的导数是3x方。
So, its derivative is 3x squared.
如果你在x=3点，你知道，
Um, so if you're at the point x equals 3, that you know,
这27种马虎，
the sort of this 27 of sloppiness,
嗯，非常陡。
um, is very steep.
好的，那么，如果我们有一个输出的函数但是现在它有很多输入呢？
Okay. So well, what if we have a function with one output but now it has many inputs?
嗯，我们正在做那种，嗯，
Um, so that we're sort of doing that sort of, um,
它的功能就像点积一样我们在做UTV或WTX，
function that was like the dot products where we're doing the sort of the UTV or WTX,
嗯，计算一个值。
um, to calculate a value.
那么我们要计算的是
Well, then what we're gonna calculate is
一个梯度向量，它是关于每个输入的偏导数的向量。
a gradient which is a vector of partial derivatives with respect to each input.
所以，你要，嗯，
So, you take, um,
当你改变x1时函数的斜率，
the slope of the function as you change x1,
函数的斜率，当你改变x2的斜率的时候，
the slope of the function as you change x2 through the slope of the, ah,
当你改变xn的时候函数你可以计算出你在做的事情
function as you change xn and each of these you can just calculate as if you were doing
单变量微积分把它们都放在一个矢量上
single variable calculus and you just put them all in a vector and
这就给了你梯度，然后是梯度和多维度，
that's then giving you the gradient and then the gradient and multi-dimensional,
然后是空间给你一个方向和斜率
um, spaces then giving you the direction and slope of a sort of
一个接触到多维度的曲面，嗯，f函数。
a surface that touches your multi-dimensional, um, f function.
好了，这就有点吓人了，
Okay. So that's getting a bit scarier,
但它比这更可怕
but it gets a little bit scarier than that
因为如果我们有一个中性的网络层，
because if we have a neutral network layer, um,
然后我们有一个函数，它有n个输入，
we then have a function which will have n inputs,
也就是输入神经元，
which are the input neurons,
它会有m个输出。
and it will have m outputs.
如果是这样的话，
So if that's the case, um,
然后有一个偏导数的矩阵也就是雅可比矩阵。
you then have a matrix of partial derivatives which is referred to as the Jacobian.
在雅可比矩阵中，
So in the Jacobian, um,
你在取这些偏导数，嗯，
you're sort of taking these partial derivatives, um,
对于每一个，嗯，
with respect to each, um,
沿着行输出，并对列中的每个输入进行输出。
output along the rows and with respect to each input down the columns.
所以你得到这些m除以n偏导，
And so you're getting these m by n partial derivatives,
考虑到输出和输入的每一个组合。
considering every combination of an output and an input.
但是，你可以填充这个矩阵的每一个单元格
Um, but again, you can fill in every cell of this matrix
只要做单变量微积分，你就不会感到困惑。
just by doing single-variable calculus provided you don't get yourself confused.
好的，我们在做word2vec时已经看到了，
Okay. Um, then we already saw when we were doing word2vec,
这是我们必须使用的核心工具，
that sort of a central tool that we have to use to work out,
嗯，为了解决问题，
um, to work out, um,
我们的导数是
our derivatives of something like
我们有一个神经网络模型
a neural network model is we have
一个接一个的函数序列。
a sequence of functions that we run up one after another.
所以，嗯，在神经网络中你是
So, um, in a neural network you're sort of
一个接一个地运行一系列函数。
running a sequence of functions one after another.
所以我们要用，嗯，
So we have to use, um,
当我们组合函数时，链式法则求出导数。
the chain rule to work out derivatives when we compose functions.
所以如果我们有一个变量函数，我们有，
So if we have one variable function, so we have,
C=3y y=x方。
um, C equals 3y and y equals x squared.
如果我们想要解决这个问题，
If we want to work out, um,
z对x求导，
the derivative of z with respect to x,
我们说，啊哈，这是两个函数的组合。
we say, aha, that's a composition of two functions.
所以我用链式法则。
So I use the chain rule.
这就意味着我要做的是，乘以导数。
And so that means what I do is I multiply, um, the derivative.
所以我要，嗯，嗯，dy/dy。
So I take, um, dz/dy.
所以这是2x，嗯，
So that's 2x, um,
等等，抱歉，我说错了，对吧？
wait, [NOISE] Sorry, I said that wrong, right?
我的例子是错的吗？
Is my example wrong?
哦，是的，它是对的，dz/dy。
Oh yeah, its right, dz/dy.
是的，dz/dy=3。
So yeah, dz/dy is just three.
这是最上面一行的导数，
That's, right, that's the derivative of the top line,
然后dy/dx=2x。
and then dy/dx is 2x.
我把它们相乘得到答案，嗯，
And I multiply those together and I get the answer, um,
z关于x的导数是6x。
that the derivative of z with respect to x is 6x.
好吧，这个有点奇怪，但这是真的。
Okay. Um, this bit then gets a little bit freakier, but it's true.
如果你同时有很多变量，
If you have lots of variables at once,
你只要把雅克比相乘就得到了正确的答案。
you simply multiply the Jacobians and you get the right answer.
如果我们现在想象我们的神经网络，
So if we're now imagining our neural net,
这是典型的神经网络，对吧？
well sort of, this is our typical neural net right?
所以我们要做的是神经网络层
So we're doing the neural net layer where we have
我们的权矩阵乘以它们的输入向量加上，
our weight matrix multiplied their input vector plus,
嗯，偏差，然后我们把它放在一个非线性的地方。
um, the bias, and then we're putting it through a non-linearity.
然后如果我们想知道h关于x的偏导数是什么，
And then if we want to know what's the partials of h with respect to x,
我们只是说，嗯，它是一个函数组合。
we just say, huh, it's a function composition.
这很容易做到。
So this is easy to do.
我们算出第一个雅可比矩阵，
We work out our first Jacobian,
这是h关于z的偏微分，
which is the partials of h with respect to z,
然后我们把它乘以z关于x的偏微分，
and then we just multiply it by the partials of z with respect to x,
我们得到了正确的答案。
and we get the right answer.
嗯，很简单。
Um, easy.
嗯，这是一种
Um, so here's sort of um
一个例子雅可比矩阵是一个很特殊的例子。
an example Jacobian which is a special case that comes up a lot.
嗯，很好地意识到这一点我们将会在神经网络中看到。
Um, so it's just good to realize this one which we'll see with our neural net.
我们有一件事是这些元素的激活函数。
So well one of the things that we have are these element-wise activation function.
我们有h=f。
So we have h equals f of z.
那么，嗯，嗯，
So, um, what is the, um,
h对z的偏导数，
partial derivative of h with respect to z. Um,
好吧，记住，我们把这个元素应用到元素上。
well the thing- remember that we sort of apply this element-wise.
所以我们实际上是说hi等于f。
So we're actually saying hi equals f of zi.
所以，形式上这个函数有n个输入和n个输出，
So, you know, formally this function has n inputs and n outputs,
所以它的偏导数是n乘n的雅可比矩阵。
so it's partial derivatives are going to be an n by n Jacobian.
但是如果我们想想那里发生了什么，
But if we think about what's happening there,
嗯，我们实际上会发现，某种程度上，
um, what we're actually going to find is, sort of,
当我们计算出这个的项时，我们算出来了，
when we're working out the terms of this so we're working out,
当你改变zj的时候，f是怎么变化的？
how does f of zi change as you change zj?
如果j不等于i，
Well, if j is not equal to i,
这不会有什么区别，对吧？
it's gonna make no difference at all, right?
如果我的f函数是这样的
So if my f function is something like putting it through
逻辑函数或者其他任何绝对值的东西，
the logistic function or anything else absolute valuing a number,
这对f的计算没有影响
it's gonna make no difference for the calculation of f of zi
如果我把zj链，因为它不在方程里。
if I chains zj because it's just not in the equation.
因此，唯一会发生的条件是
And so, therefore, the only terms that are actually going to occur
非零是i等于j的项。
and be non-zero are the terms where i equals j.
所以如果我不等于j，那么它就是零。
So for working out these partial derivatives if i does not equal j, um, it's zero.
如果i等于j，
If i does equal j,
然后我们要算出一个单变量的微积分。
then we have to work out a single-variable calculus.
它的导数是什么，嗯，
What's the derivative, um,
在激活函数中，嗯，
of the, um, activation function, um,
因为-所以这就是，
for- and so this is what,
雅可比矩阵看起来像是一个激活函数。
a um, Jacobian looks like for an activation function.
它是一个对角矩阵。
It's a diagonal matrix.
其他的都是零，
Everything else is zero,
我们认为这个激活函数，
and we thought this activation function,
我们求出它的导数，
we work out its derivative,
然后我们计算这个差值，嗯，
and then we calculate that for the difference, um,
我们有不同种类的，zi值。
we have it for the different kind of um, zi values.
好的，嗯，这是a，
Okay. Um, so that's a,
嗯，雅各布是一个激活函数。
um, Jacobians for an activation function.
其他的主要案例，
What are the other main cases,
呃，我们需要一个神经网络？
uh, that we need for a neural network?
这些我将在相同的课堂笔记中稍微慢一点。
And these I'll go in through a little bit more slowly in the same lecture notes.
但它们和我们在第一节课看到的相似。
But they're kind of similar to what we saw in the very first class.
所以如果我们想算出wx+b关于x的偏导数，
So if we are wanting to work out the partial derivatives of wx plus b with respect to x,
嗯，我们得到的是w。
um, what we get is w. Um,
如果我们想算出wx+b对b的偏导数，
and if we want to work out the partial derivative of wx plus b with respect to b,
这意味着我们得到一个单位矩阵因为b有点像1b，对吧？
um, that means that we get an identity matrix because b is sort of like a 1b, right?
它几乎总是在矢量上，
It's this almost always on vector,
所以你只需要把它们拿出来保护b，
so you're just getting the ones coming out to preserve the b. Um,
就是这样，嗯，
this was the case, um,
我们看到了，嗯，
that we saw, um,
当我们在做向量的时候。
when we were doing the word vectors.
如果你有一个向量点积u和h你会说，
That if you have a vector dot product of u and h and you say,
关于u的偏导数是什么，
what's the partial derivatives of that with respect to u,
然后得到h转置。
then you get out h transpose.
如果你之前没见过，
Um, if you haven't seen those before,
看看讲义上的讲义，嗯，
um, look at the lecture notes handouts, um,
看看你能不能计算出它们在家里是有意义的，
and see if you can compute them and they make sense at home, um,
但是现在我们要相信这些并使用它们
but for the moment we're gonna believe those and use those to
看看我们如何能在神经网络中计算出导数。
see how we can then work out derivatives inside the neural network.
好的，这就是我们之前看到的神经网络。
Okay. So here's the same neural network we saw before.
所以我们有了一扇窗，
So we have a window of words,
我们看到的是向量，
we're looking at word vectors,
我们把它放在一个隐藏的图层上，
we're putting it through a hidden layer,
然后我们做一个向量模态，嗯，
and then we're just doing a vector modal, um,
向量点积，你得到最后的分数。
vector dot product, you get this final score.
因此，我们的噪音想要做的是能够训练我们的神经网络，
And so, what we [NOISE] want to do to be able to train our neural network,
我们想知道，s是如何变化的，取决于模型的所有参数。
is we want to find out how- how s changes depending on all the parameters of the model.
x，w，b，u，
The x, the w, the b, the u. Um,
所以我们要算出S的偏导数
and so we want to work out partial derivatives of S with respect
对于每一个，因为我们可以算出，如果你把b向上移动，
to each of those because we can then work out okay if you move b up,
嗯，分数越来越高，
um, the score gets better,
这很好，如果它在中间是一个加号，
which is good if it's actually a plus in the middle,
因此，我们想要推动，
and therefore we'll want to nudge up,
b的元素是恰当的。
um, elements of b appropriately.
好的，嗯，我只是在做渐变
Okay, um, and so I'm just doing the gradient with
关于这里的分数，我跳过了这几张幻灯片。
respect to the score here and I skipped over those couple of slides.
嗯，如果你只是，
Um, so if you're just, sort of,
盯着这张照片说，好吧，
staring at this picture and say, well,
我怎么求出s对b的偏导？
how do I work out the partial derivative of s with respect to b?
嗯，可能看起来不太明显。
Um, probably it doesn't look obvious.
所以你要做的第一件事就是分解
So the first thing here that you want to do is sort of break up
把这些方程转换成简单的组合在一起，对吧？
the eq- equations into simple pieces that compose together, right?
所以你有输入x，
So you have the input x,
然后z=wx+b，
and then that goes into z equals wx plus b,
然后用下一件事来写。
and then you compose that with the next thing.
所以h=f，我们的激活函数，
So h equals f of z, our activation function,
然后这个h进入s=uTh的下一项。
and then this h goes into the next thing of s equals uTh.
我们有这些函数序列。
So we've got these sequence of functions.
你想要尽可能地把事情搞砸。
And pretty much you want to break things up as much as you can.
我的意思是，我可以把它进一步分解。
I mean, I could have broken this up even further.
我可以说z1等于wx，
I could have said z1 equals wx,
z=z1+b，
z equals z1 plus b. Um,
结果是，
it turns out um,
但是如果你只需要添加和减去，
but if you've just got things added and subtracted,
你可以一步一步来做，因为这条通道将，
you can sort of do that in one step because that sort of pathway separating the,
做导数时，
when doing the derivatives,
但是其他的东西组合在一起你想要把它拉出来。
but sort of anything else that composes together you want to pull it out for the pieces.
好的，现在我们的神经网络正在做一系列的函数组合。
Okay. So now our neural net is doing a sequence of function compositions.
当我们说，好的，
And when we say, okay,
我们知道怎么做，链式法则。
we know how to do that, the chain rule.
所以如果你想算出s对b的偏微分，
So if you wanna work out the partials of s with respect to b,
它只是沿着这条路的每一步的导数的乘积。
it's just going to be the product of the derivatives of each step along the way.
所以它是s关于h乘以h的偏导
So it's gonna be um the partial of s with respect to h times h with
对z乘以z对b的尊重会得到正确的答案。
respect to z times z with respect to b and that will give us the right answer.
所以我们要做的就是计算它。
So then all we have to do is actually compute that.
嗯，我认为这是一种
Um, so, I think this just sort of
好的，我们取这个组合的每一步的偏导数。
shows okay we're taking the partials of each step of that composition.
好的，现在我们要计算它。
Okay. So now we want to compute that.
这就是我要用到的雅克比
And so this is where I'm going to sort of use the Jacobians that I
在之前的幻灯片中没有太多的证明。
sort of asserted without much proof on the preceding slide.
好的，首先，我们有ds/dh。
Okay. So first of all um we have ds/dh.
这就是两个向量的点积。
Well, that's just the dot product of two vectors.
所以这个的雅可比矩阵就是h转置。
So the um, the Jacobian for that is just h transpose.
好了，这是个开始。
Okay, that's a start.
然后是h=f。
Then we have um h equals f of z.
这就是激活函数。
Well, that's the activation function.
所以它的雅可比矩阵是
So the um Jacobian of that is
这个由元素构成的对角矩阵由这个函数的导数
this diagonal matrix made of the element wise um derivative of the function
然后是z的偏导
f. And then we have the partial of z
关于b，这是作为单位矩阵的一部分。
with respect to b and that's the bit that comes out as the identity matrix.
这就给出了关于b的偏导的计算。
And so that's then giving us our calculation of the partial of s with respect to b.
所以我们可以看到，单位矩阵是这样的
And so we can see that the- the identity matrix sort of goes
所以我们最后得到了这个，ht乘以f'的组合。
away so we end up with this composition of ht times f prime of z.
好，假设我们现在要继续计算s关于w的偏导？
Okay, suppose we then want to go on and compute now the partial of s with respect to w?
好的，从起点开始
Well, as starting off point is
完全相同的链式法则每一个阶段都是一样的。
exactly the same chain rule that we work out each of the stages.
所以，首先你要做的是
So, that first of all you're working
从wx部分输出z然后通过非线性，
out the z from the wx part then putting it through the non linearity,
然后做向量的点积。
then doing the dot product of the vectors.
所以这部分是一样的。
So that part is the same.
你应该注意到的是
And what you should notice is that if you
比较s和s对b的偏导，
compare the partial of s with respect to w versus s with respect to b,
大多数都是一样的，只是最后的部分是不同的。
most of them are the same and it's only the part at the end that's different.
这在我们的神经网络上是有意义的，对吧？
And that sort of makes sense in terms of our neural net right?
当我们有神经网络时，w和b在这里。
That when we had our neural net that the w and the b were coming in here.
一旦你对他们做了一些事情你就把事情做完了
And once you've sort of done some stuff with them you're putting things through
同样的激活函数，做相同的点积来创建一个分数。
the same activation function and doing the same dot product to create a score.
所以，你要做的是和你一样的计算。
So, you're sort of doing the same calculations that you're then composing with.
所以你应该得到这样的结果
So it sort of makes sense that you should be getting
同样的导数
the same derivatives that are
发生-在那个点发生的相同的偏导数。
occur- same partial derivatives that occurring at that point.
哦，实际上你知道
Oops. And so effectively you know
这些部分的开发-导数对应于
these partial dev- derivatives correspond to
在神经网络上的计算在w和b的上面。
the computations in the neural network that are above where w and b are.
这些通常被称为delta，
And so those are commonly referred to as delta,
注意，它与偏微分d不同，所以
note delta which is different from partial derivative d. And so
delta被称为错误信号和神经网络谈话。
delta is referred to as the error signal and neural network talk.
所以，这就是你所计算的
So, it's the what you're calculating as
上面的偏导数
the partial derivatives above
你要计算的参数是关于偏导数的。
the parameters that you are working out the partial derivatives with respect to.
所以，我们下节课会讲到的很多秘密，
So, a lot of the secret as we'll see next time,
反向传播发生的很多秘密是
a lot of the secret of what happens with backpropagation is
我们想要进行有效的计算
just we want to do efficient computation in
计算机科学的人喜欢做高效计算的方式。
the sort of way that's computer science people like to do efficient computation.
所以我们想要注意的是
And so precisely what we want to notice is that there is
一个错误信号来自于上面我们想要计算一次。
one error signal that comes from above and we want to compute it once.
然后在计算时再使用它
And then reuse that when calculating
对w和b的偏导数都是。
both partial derivatives with respect to w and with b.
好了，还有两件事要做。
Okay. So there's sort of two things to still do.
所以一个是，
So one is well,
知道偏导数是很有用的
it'd be kind of useful to know what the partial derivative
关于w的s实际上看起来像。
of s with respect to w actually looks like.
我的意思是，这是一个数字，一个矢量，
I mean, is that a number, a vector,
一个矩阵，一个三维张量？
a matrix, a three-dimensional tensor?
然后我们要算出它的值
And then we actually want to work out its values
为了计算出它的值我们还得继续工作
and to work out its values we're going to still have to work
关于z的偏导数
out the partial derivative of z with respect to
但是如果首先我们试着计算出它的形状，
w. But if first of all we just try and work out its shape,
它有什么样的形状？
what kind of shape does it have?
这实际上有点棘手
And this is actually sort of a bit tricky and is
做这种矩阵计算有点不太好理解。
sort of a dirty underbelly of doing this kind of matrix calculus.
所以，因为我们的权向量是一个n乘m矩阵，
So, since our weight vector is an n by m matrix,
s对w的偏导的最终结果是我们有一个函数
the end result of the partial of s with respect to w is we have a function with
n乘以m输入w的所有元素和一个输出就是我们的分数。
n times m inputs all of the elements of w and simply one output which is our score.
所以，这听起来就像我之前说的
So, that makes it sound like according to what I said before we
应该有一个n乘以m雅可比矩阵。
should have a one by n times m Jacobian.
但事实证明这并不是我们想要的，对吧？
But it turns out that's not really what we want, right?
因为我们想要做的是使用我们计算的
Because what we wanted to do is use what we calculate
在这个随机梯度下降更新算法中。
inside this stochastic gradient descent update algorithm.
如果我们这样做的话
And if we're doing this with sort of like to have
旧的权值矩阵我们想要减去一点格式来得到一个新的权重矩阵。
the old weight matrix and we'd like to subtract a bit format to get a new weight matrix.
所以，如果雅可比矩阵的形状和w的形状是一样的，那么我们——
So, be kind of nice if the shape of our Jacobian was the same shape as w. And so we-
我们通常想要做的是
we and in general what you always want to do with
神经网络遵循我们所说的形状惯例
neural nets is follow what we call the shape convention which
我们要表示雅可比矩阵所以它和输入的形状是一样的。
is we're going to sort of represent the Jacobian so it's in the same shape as the inputs.
这整个就是-不好的部分
And this whole thing is kind of the- the bad part of
做矩阵计算的不好的部分。
the bad part of doing matrix calculus.
就像人们如何表示矩阵微积分有很多不一致的地方。
Like there's a lot of inconsistency as to how people represent matrix calculus.
一般来说，如果你去不同的领域比如
That in general if you just go to different fields like
经济学和物理学有些人使用分子惯例。
economics and physics some people use a numerator convention.
有些人使用分母约定。
Some people use a denominator convention.
我们用的都不是。
We're using neither of those.
我们要用这个形状的惯例来匹配形状
We're going to use this shape convention so we match the shape of
输入，所以我们可以很容易地进行重量更新。
the input so it makes it easy to do our weight updates.
好的，那么，这就是我们想要的答案。
Okay. So. Right. So that's what we want the answer to look like.
那么，我们要做的最后一件事就是
So, then the final thing we need to do to work out on the
s关于w的偏导是我们有误差信号delta
partial of s with respect to w is we have the error signal delta
这是答案的一部分然后我们要算出部分
that's gonna be part of the answer and then we want to work out the partial
关于w，嗯，
of z with respect to w. Well,
那将会是什么。
um what's that going to be.
结果是，我马上就要
Well, it turns out and I'm about to be
在我还剩两分钟的时候，在这里的钟声拯救了我。
saved by the bell here since I'm down to two minutes left.
结果是，我们最终得到的结果是
Um, it turns out that what we end up with for that is we take
部分的乘积，即德尔塔乘以x的乘积。
the product of the partial- the product of delta times x.
有效地，我们得到了w上方的局部错误信号然后我们
So effectively we've got the local error signal above w. And then we
有输入x，我们正在计算它们的外积。
have the inputs x and we are working out an outer product of them.
思考这个问题的方法有点类似于w。
And the sort of way to think about this is sort of for the w's.
你知道，我们有w矩阵的元素，
You know, we've got the elements of the w matrix,
这些神经元之间的不同连接。
these different connections between our neurons.
每一个都将一个输出连接到一个输入。
And so each one of these is connecting one output to one input.
所以我们要把这个n乘起来
And so we're going to be sort of making this n by
我们的偏导数的m矩阵它将会是
m matrix of our partial derivatives that are going to be the product of
适当输出的错误信号
the error signal for the appropriate output
乘以输入，这些就得到了偏导数。
multiplied by input and those goes give us the partial derivatives.
在最后的一分钟里，我快速地跳过了。
I'm skipping ahead quickly in my last one minute.
好的，那么，是的。
Okay. So uh, right.
这就是我说过的
So this is sort of what I said have used
我要跳过这一项。
the shape con- convention. I'm going to skip that.
好的，那么，嗯，我——我的时间不多了，但我的意思是，
Okay. So, um, I- I ran out of time a teeny bit at the end but I mean,
我希望这能传达大部分的信息
I think hopefully that's conveyed most of
关于如何使用链式法则的概念
the idea of how you can sort of use the chain rule and
计算出导数并把它们算出来
work out the derivatives and work them out in
这些向量和矩阵导数的项。
terms of these vector and matrix derivatives.
噪音，本质上我们想要做的是反向传播我们怎么能
[NOISE] And essentially what we wanna do for backpropagation is to say how can we
让电脑自动为我们做这件事，并能有效地做到这一点。
do ah get a computer to do this automatically for us and to do it efficiently.
这就是深度学习框架的内容
And that's what's sort of the deep learning frameworks like
张力和火焰，你怎么做。
TensorFlow and PyTorch do and how you can do that.
我们下次再看。
We'll look at more next time.
