
Chinese: 
大家好！上节课我们学习了基本的感知器算法，
.现在我们要学习多层感知器。
多层感知器是感知器的网络，线性分类器的网络。
它有输入层，一些隐藏的层，和输出层。
如果我们看左下方的图，绿色的节点是输入节点。
这描述的是数值型天气数据。
你或许看不清标签，最上面的是“outlook=sunny”；
往下是“outlook=overcast”；
再往下是“outlook=rainy”；“temperature”，“humidity”和“windy”。

English: 
Hello! In the last lesson, we looked at the
basic Perceptron algorithm, and now we're
going to look at the Multilayer Perceptron.
Multilayer Perceptrons are simply networks
of Perceptrons, networks of linear classifiers.
They have an input layer, some hidden layers
perhaps, and an output layer.
If we just look at the picture on the lower
left, the green nodes are input nodes.
This is actually for the numeric weather data.
Although you probably can't read the labels,
the top one is "outlook=sunny";
underneath is "outlook=overcast";
then "outlook=rainy"; and then we have "temperature",
"humidity" and "windy" for the nodes.

English: 
This is the numeric weather data, so "outlook"
is the only nominal variable, and that's been
made into three binary attributes.
These two [yellow] nodes are the output nodes for output
is "play" and "don't play", respectively.
Each of those two yellow nodes performs
a weighted sum, and each of the connections
has a weight.
If we look at the more complicated picture
to the right, we've got some red nodes here.
These are three hidden layers with different
numbers of neurons/nodes in each of these
three hidden layers.
Each node performs a weighted sum of its inputs
and thresholds the result, just like in the
regular, basic Perceptron.
But in the basic Perceptron, you looked to
see whether the result was greater than zero
or less than zero.
In Multilayer Perceptrons, instead of using
that hard-edged function, people use what's
called a "sigmoid" function.

Chinese: 
这是数值型天气数据，“outlook”是唯一的名词性属性，它被分为
三个二值属性。
两个黄色的节点是输出节点，分别是“play”和“don't play”。
这两个黄色节点的每个代表一个权重和，每个连接
都有一个权重。
如果我们来看更复杂的右图，我们可以看到一些红色节点。
这是三个隐蔽的层，每个层中都有
不同数量的节点。
每个节点都为输入的求权重和，用阈值过滤结果，就像
基础感知器一样。
但是在基础感知器中，我们看结果是否大于
或小于零。
在多层感知器中，人们使用的是S形（sigmoid）函数，
而不是硬边的（hard-edged）函数。

Chinese: 
在幻灯片的右上角我画了一些sigmoid函数。
可以看到，在它们的极点，它们趋向于阶梯函数，
对应了基础感知器中的硬边阈值。
但是，这里我们要用平缓、连续的sigmoid函数。
理论上，网络会收敛，如果sigmoid函数
是可微分的。
这点比较重要。
顺便提一句。
这些节点经常叫做“神经元”，这些红色和黄色的节点。
不要与大脑中的神经元细胞混淆。
问题是每个感知器中有多少层，多少节点。
我们知道输入层中每个属性有一个节点，属性是数值型或二值型的。
在输出层中，每个类别有一个节点。
有多少隐蔽层呢？这取决于你。

English: 
I've drawn a few sigmoid functions on the
slide up in the top right.
You can see that as they become more extreme,
they approach the step function, which
corresponds to the hard-edged threshold used
in the basic Perceptron.
But here we're going to use a smooth, continuous
sigmoid function.
Actually, there is a theoretical property
that the network will converge if the sigmoid
function is differentiable.
That's kind of important.
Anyway, that's by the by.
These nodes are often called "neurons", the
red nodes and the yellow nodes.
These are not to be confused with the neurons
that you have in your head.
The big questions are how many layers, and
how many nodes in each? We know for the input
layer, we're going to have one for each attribute,
and the attributes are numeric or binary.
For the output layer, we're going to have
one for each class.
How many hidden layers? Well, that's up to
you.

English: 
If you have zero hidden layers, that's the
standard Perceptron algorithm.
That's suitable if the data is linearly separable.
There are theoretical results:
with one hidden layer, that's suitable for a single,
convex region of the decision space;
two hidden layers are enough to generate arbitrary
decision boundaries.
However, people don't necessarily use two
hidden layers, because that really increases
the number of connections--
that's the number of weights that would have
to be learned.
The next big question is: how big should the
layers be? They are usually chosen somewhere
between the input and output layers.
A common heuristic, Weka's heuristic, is to
use the mean value of the [number of] input and output
layers.
What are these weights? Well, they're learned.
They're learned from the training set.
They are learned by iteratively minimizing
the error using the steepest descent method,
and the gradient is determined using a backpropagation
algorithm.
We're not going to talk about the backpropagation
here.
The change in weight is computed by multiplying
the gradient by a constant called the "learning

Chinese: 
如果你不设隐蔽层，那就是标准感知器算法。
这适用于线性可分割数据。
理论上说：
一个隐蔽层适用于决策空间的一个凸区域。
两个隐蔽层就足以生成任意决策边界。
然而，人们不一定会使用两个隐蔽层，因为这会
增加连接的数量--
需要学习的权重的数量。
另一个问题是：隐蔽层需要多大。它们一般选在
输入和输出层之间。
常见的方法，也是Weka的方法，
是使用输入层和输出层的均值。
权重是多少？得通过学习
权重来自于训练数据集。
通过使用最快下降算法迭代地减少错误来学习，
梯度是由反向传播算法决定的。
这里我们不介绍反向传播算法。
计算权重的改变需要将梯度与一个叫做“学习率”的常量相乘，

English: 
rate" and adding the previous change in weight
multiplied by another parameter called "momentum".
So Wnext (the next weight vector) is W + ΔW,
where ΔW is minus the learning_rate times
the gradient (minus because we want to go
downhill) plus momentum times the previous
change in the weight parameter.
Multilayer Perceptrons can get excellent results,
but they often involve a lot of experimentation
with the number and size of the hidden layers
and the value of the learning rate and momentum
parameters.
Let's take a look in Weka.
I'm going to use the numeric weather data.
Over here, I've got it open.
I'm going to go to Classify and find MultilayerPerceptron
in the functions category.
Here it is, and let's just run it.
We get 79%.
I want to show you the network we used.

Chinese: 
再加上权重之前的变化与另一个叫做“动量”的参数相乘的值。
也就是Wnext等于W + ΔW。这里ΔW是负的学习率乘以
梯度（负值是因为我们想下降）加上动量乘以
之前权重参数的变化。
多层感知器能够得到很好的结果，但是它们经常需要多次实验
来决定隐蔽层的数量和大小，以及学习率和
动量参数。
我们打开Weka。
使用数值型天气数据。
打开。
点击Classify，在functions中找到MultilayerPerceptron。
这里就是，运行。
得到79%的准确率。
我们来看它的网络。

English: 
Let me just switch on GUI, the graphical user
interface.
Now when I run it, I get a picture of the
network.
That is Weka's default network.
These are the input nodes that we looked at
before, the green ones.
Weka has chosen 4 neurons in the hidden layer.
That's the average of the number of input
and output layers.
There are 2 output neurons.
Going back to the slide:
when I tried IBk, I also get 79% on this data
set.
J48 and so on do worse.
However, it's just a toy problem, so those
results aren't really indicative.
On real problems Multilayer Perceptrons often
do quite well, but they're slow.
There are a number of parameters:
the number of hidden layers and the size of
the hidden layers;
the learning rate and momentum.

Chinese: 
打开GUI，图形用户界面。
运行，得到网络图。
这是Weka默认的网络。
这些是我们之前看到的输入节点，这些绿色的。
Weka在隐蔽层选了4个神经元。
这是输入侧和输出层节点的平均值。
有2个输出神经元。
回到幻灯片。
当我使用IBk时，同样的数据集，得到79%的准确率。
使用J48等分类器得到的准确率低。
这只是一个玩具问题，因此结果不能说明什么。。
现实问题中，多层感知器性能很好，但是它们速度慢。
有许多的参数：
隐蔽层的数量和大小，
学习率和动量。

English: 
The algorithm makes multiple passes through
the data, and training continues until the
error on the validation set consistently increases--that
is, if we start going uphill--or the training
time is exceeded, the maximum number of epochs
allowed.
Going back to Weka, I'm going to configure
this to use 5 neurons, 10 neurons, and 20
neurons in 3 hidden layers.
Look at this!
You can see the three hidden layers with 5,
10, and 20 neurons --
an awful lot of weights here.
We've got the learning rate, so we can change
the momentum.
We've got the maximum number of epochs.
We can just run that.
Also, in Weka, you can create your own network
structure.
You can add new nodes, add connections, and
delete nodes and so on.

Chinese: 
算法对数据进行多遍训练，不会停止，直到
验证数据中的错误不断增加--也就是，如果我们开始上升--
或者超出训练时间，达到设定的训练运行次数。
回到Weka，我要在3个隐蔽层设定5、10和20个
神经元。
看！
在3个隐蔽层分别有5、10和20个神经元。
非常多的权重。
我们得到了学习率、因此可改变动量。
我们得到了最大的运行次数。
我们可以运行。
在Weka中，你可以创建自己的网络结构。
增加新的节点、连接，删除节点，等等。

Chinese: 
回到Weka，使用隐蔽层的
默认数量。
我得到一个包含4个神经元的隐蔽层。
我要增加另一个隐蔽层。
如果点击空白处，就会创建一个神经元。
黄色，说明它被选中。
点击空白处，取消选定，再创建其它几个神经元。
我要把这个和这个连接起来。
点击它们，它们就会和选定的神经元相连--也就是说，
我点击的黄色的那个。
然后，我可以取消选定，选择这个，创建连接。
可以看到，创建连接是非常快的。
我已经增加了另一个隐蔽层。
我还需要对输出做些工作，但是你可从我做的知道是怎样的。
我们可以点击选择节点，右击空白处来取消选择。
我们可以通过点击空白处来创建或
右击删除节点。

English: 
I'm going to go back to Weka, and I'm just
going to use the default number of hidden
layers.
I've now got my 4 neurons in the 1 hidden
layer.
I'm going to add another hidden layer.
If I click empty space, I create a neuron.
It's yellow, which means it's selected.
I'm going to deselect it by clicking empty
space, and create another couple.
With this one here, I'm going to connect it
up to this.
If I click these, they connect the selected
neuron--that is, the yellow one--to the one
I click.
Then I can deselect it and select this one
and make connections here.
You can see it's pretty quick to add connections.
I've added another hidden layer.
Well, I need to do some things with the output
here, but you can get the idea from this.
We can click to select a node and right-click
in empty space to deselect.
We can create and delete nodes by clicking
an empty space to create and right-clicking
to delete.

Chinese: 
我们可以创建和删除连接，我们还可以在这个界面设置参数。
效果好吗？我尝试了6个数据集，使用9个算法。
多层感知器在6个数据集中的2个上获得了最好的结果。
其他分类器情况是：SMO赢在另外2个数据集，J48和IBk各赢一次。
我说“赢”，是指击败了其他所有方法。
多层感知器不错，但事实上它比其他方法慢10到2000倍，
这是它的不足。
下面我们来总结这节课的内容。
多层感知器可以处理任意决策边界，只要创建两个或者更过的隐蔽层，
在隐蔽层创建足够的神经元，加上合理的训练。

English: 
We can create and delete connections, and
we can set parameters in this interface too.
Are they any good? Well, I tried the Experimenter
with 6 datasets, and I used 9 algorithms.
MultilayerPerceptron gave me the best results
on 2 of the 6 datasets.
The other wins were: SMO won on 2 other
datasets; J48 and IBk won on 1 dataset each.
When I say "win", I mean beat all the other
methods.
MultilayerPerceptron was not too bad, but
in fact it was between 10 and 2000 times slower
than other methods, which is a bit of a disadvantage.
Here's the summary.
Multilayer Perceptrons can implement arbitrary
decision boundaries given two or more hidden
layers providing you've got enough neurons
in the hidden layers, and providing they're trained properly.

English: 
Training is done by backpropagation, which
is an iterative algorithm based on gradient
descent.
In practice, you get quite good performance,
but Multilayer Perceptrons are extremely slow.
I'm still not a fan of Multilayer Perceptrons;
I'm sorry about that.
They might be a lot more impressive on more
complex datasets;
I don't know.
But for me, configuring Multilayer Perceptrons
involves too much messing around.
But you're going to be doing that in the activity
associated with this lesson.
Have fun with that, and we'll see you in the
next lesson.
Bye for now!

Chinese: 
训练是通过反向传播来完成的，这是基于梯度下降的
迭代算法。
在实践中，你会得到很好的结果，但是多层感知器非常慢。
我还是不热衷于多层感知器。
很抱歉。
或许对于更复杂的数据集，他们的表现会更出色。
我不知道。
但对我来说，设置多层感知器是很麻烦的。
请做这节课的练习。
享受学习的快乐。下节课见。
再见！
