
English: 
Hi! In the last class, we looked at a bare-bones
algorithm for constructing decision trees.
To get an industrial strength decision tree
induction algorithm, we need to add some more
complicated stuff, notably pruning.
We're going to talk in this [lesson] about
pruning decision trees.
Here's a guy pruning a tree, and that's a
good image to have in your mind when we're
talking about decision trees.
We're looking at those little twigs and little
branches around the edge of the tree, seeing
if their worthwhile, and snipping them off
if they're not contributing.
That way, we'll get a decision tree that might
perform worse on the training data, but perhaps

Chinese: 
大家好！在上一课，我们学习了构建决策树的基本算法。
为了得到一个具有工业强度的决策树归纳算法，我们需要添加一些
更复杂的东西，尤其是修剪。
我们这节课要谈论修剪决策树。
这有一个给树剪枝的人，当提到修剪决策树时，
你的脑海可以浮现出这个很形象的画面。
我们寻找一些树的边缘的小细枝和小树叉，
检查他们是否是有价值的，如果发现他们是多余的就剪除它。
这样，我们可能得到一个在训练数据上表现差的决策树，但是也许

English: 
generalizes better to independent test data.
That's what we want.
Here's the weather data again.
I'm sorry to keep harking back to the weather
data, but it's just a nice simple example
that we all know now.
I've added here a new attribute.
I call it an ID code attribute, which is different
for each instance.
I've just given them an identification code:
a, b, c, and so on.
Let's just think from the last lesson, what's
going to happen when we consider which is
the best attribute to split on at the root,
the first decision.
We're going to be looking for the information
gain from each of our attributes separately.
We're going to gain a lot of information by
choosing the ID code.
Actually, if you split on the ID code, that
tells you everything about the instance we're
looking at.
That's going to be a maximal amount of information
gain, and clearly we're going to split on
that attribute at the root node of the decision
tree.

Chinese: 
在独立的测试数据上会表现的好。
这就是我们想要的。
这里又是天气数据。
我很抱歉总是使用天气数据，但是这是一个我们熟悉的
简单好用的例子。
我已经在这里新增了一个新的属性。
我把它叫做ID属性，这个属性对于每个实例都有不同的值。
我已经赋给它们一个识别码：a，b，c，等等。
让我们回想一下上节课学过的知识，
当我们寻找最好的属性作为根节点，第一个决定，是怎样做的。
我们分别计算每个属性的信息增益。
如果选择ID属性，我们将获得许多信息。
实际上，如果你用ID属性划分，
你将得到我们想要的
这是一个最大的信息增益，很明确地，我们将用
ID属性在决策树的根节点划分实例。

Chinese: 
但是，这完全不能运用到新的天气实例上。
为了解决这个问题，构建决策树后，决策树算法，
会自动修剪决策树。
你不会看到这个过程，这个过程在Weka中自动完成。
我们如何修剪呢？这有一些简单的修剪技术，和一些较复杂的
修建技术。
一种非常简单的技术是：如果节点包含的实例非常少，就停止分裂。
上节课，我说我们要一直分裂到每个节点只有
一个分类。
这恐怕不是一个好主意。如果一个节点只包含很少的几个实例，
那么恐怕不值得再继续分裂这个节点了。
实际上，这是J48的参数之一。

English: 
But that's not going to generalize at all
to new weather instances.
To get around this problem, having constructed
a decision tree, decision tree algorithms
then automatically prune it back.
You don't see any of this, it just happens
when you start the algorithm in Weka.
How do we prune? There are some simple techniques
for pruning, and some more complicated techniques
for pruning.
A very simple technique is to not continue
splitting if the nodes get very small.
I said in the last lesson that we're going
to keep splitting until each node has just
one class associated with it.
Perhaps that's not such a good idea. If we
have a very small node with a couple instances,
it's probably not worth splitting that node.
That's actually a parameter in J48.

Chinese: 
打开Weka，选择J48，看下它的参数。
这里有一个叫做minNumObj的参数。
如果把鼠标移到这个参数上，可以看到"每个叶节点最少包含多少个实例"。
默认值是2。
下一步，我们创建一个详尽的决策树，然后从叶节点开始修剪。
 这表明，先创建详尽的的树再反向修剪比
在建树的过程中正向修剪要好。
我们将在每一个阶段进行统计测试。
这是confidenceFactor参数。
它在这里。
默认值是0.25。
“用于修剪的信心因数（较小的值导致更多的修剪]。”
有时修剪一个内部节点并将它的子树提升
一个层次。
这叫做子树提升。

English: 
I've got Weka going here. I'm going to choose J48
and look at the parameters.
There's a parameter called minNumObj.
If I mouse over that parameter, it says "The
minimum number of instances per leaf".
The default value for that is 2.
The second thing we do is to build a full
tree and then work back from the leaves.
It turns out to be better to build a full
tree and prune back rather than trying to
do forward pruning as you're building the
tree.
We apply a statistical test at each stage.
That's the confidenceFactor parameter.
It's here.
The default value is 0.25.
"The confidence factor used for pruning [smaller
values incur more pruning]."
Then, sometimes it's good to prune an interior
node, and to raise the subtree beneath that
interior node up one level.
That's called subtreeRaising.

English: 
That's this parameter here.
We can switch it on or switch it off.
"Whether to consider the subtree raising operation
during pruning." Subtree raising actually
increases the complexity of the algorithm,
so it would work faster if you turned off
subtree raising on a large problem.
I'm not going to talk about the details of
these methods.
Pruning is a messy and complicated subject,
and it's not particularly illuminating.
Actually, I don't really recommend playing
around with these parameters here.
The default values on J48 tend to do a pretty
good job.
Of course, it's become apparent to you now
that the need to prune is really a result
of the original unpruned tree overfitting
the training dataset.
This is another instance of overfitting.

Chinese: 
就是这个参数。
我们可以启用或者关闭它。
“在修剪时，是否考虑子树提升操作。”提升子树实际上
增加了算法的复杂度，所以关闭它可以缩短运行时间。
子树提升是一个大问题。
我不打算介绍子树提升的细节。
修剪是一个混乱和复杂的主题，而且不是特别有启发性。
实际上，我并不建议改变这里的这些参数。
J48的默认值往往表现很好。
当然，现在你知道了，需要修剪的真正原因是
原始的未修剪的决策树过度拟合训练数据集。
这是过度拟合的另一个例子。

English: 
Sometimes simplifying a decision tree gives
better results, not just a smaller, more manageable
tree, but actually better results.
I'm going to open the diabetes data.
I'm going to choose J48, and I'm just going
to run it with the default parameters.
I get an accuracy of 73.8%, evaluated using
cross-validation.
The size of the tree is 20 leaves, and a total
of 39 nodes.
That's 19 interior nodes and 20 leaf nodes.
Let's switch off pruning.
J48 prunes by default.
We're going to switch off pruning.
We've got an unpruned option here, which is
false, which means it's pruning.
I'm going to change that to true -- which
means it's not pruning any more -- and run

Chinese: 
有时候，简化后的决策树会有更好的结果，并不仅仅是更小的，更易管理的
决策树，也会得到更好的结果。
载入diabetes数据集。
选择J48，保留默认参数值，运行。
交叉验证的准确率是73.8%。
这个决策树一共39个节点，其中20个是叶节点。
也就是19个内部节点和20个叶节点。
让我们去掉修剪过程。
J48默认修剪。
关掉修剪过程。
这里有一个unpruned的选项，默认值是false，意味修剪。
改为true（即不再修剪）

Chinese: 
再次运行。
现在我们得到的是一个略低的准确率，72.7%，低得并不多。
我们得到了一个大得多的决策树——22个叶节点，一共43个节点。
这是之前的两倍。
我们得到了一个更大更难理解的准确率略低的
决策树。
这里，我们更喜欢修剪之后的决策树。
我将用一组关于乳腺癌的数据给你演示另一个极端的例子。
我想我们之前没有见过乳腺癌的数据。
分类是no-recurrence-events和recurrence-events，属性有年龄（age）、绝经期 （menopause）、
肿瘤大小（tumor size），等等。
我们使用J48的默认配置。

English: 
it again.
Now we get a slightly worse result, 72.7%,
probably not significantly worse.
We get a slightly larger tree -- 22 leaves and
43 nodes.
That's a double whammy, really.
We've got a bigger tree, which is harder to
understand, and we've got a slightly worse
prediction result.
We would prefer the pruned [tree] in this
example on this dataset.
I'm going to show you a more extreme example
with the breast cancer data.
I don't think we've looked at the breast cancer data before.
The class is no-recurrence-events versus recurrence-events,
and there are attributes like age, menopause,
tumor size, and so on.
I'm going to go classify this with J48 in
the default configuration.

Chinese: 
启用修剪（将unpruned设置为false）然后运行它。
准确率是75.5%，并且生成了一个相当小的只有4个叶节点和2个内部节点的决策树。
可以在这里查看决策树，或者可视化决策树。
这里，我们得到这个简单的决策结构，它很容易理解并且
表现出色，拥有75%的准确度。
我们将要关掉修剪过程。
将unpruned设置为true，再次运行。
首先，我们得到了一个坏得多的结果，69.6%
比我们之前得到的75.5%差多了。
更重要的是，我们得到的是一个有152个叶节点，共179个节点的巨大的决策树。

English: 
I need to switch on pruning -- that is, make
unpruned false -- and then run it.
I get an accuracy of 75.5%, and I get a fairly
small tree with 4 leaves and 2 internal nodes.
I can look at that tree here, or I can visualize
the tree.
We get this nice, simple little decision structure
here, which is quite comprehensible and performs
pretty well, 75% accuracy.
I'm going to switch off pruning.
Make unpruned true, and run it again.
First of all, I get a much worse result, 69.6%
-- probably signficantly worse than the 75.5%
I had before.
More importantly, I get a huge tree, with
152 leaves and [179] [total] nodes.

English: 
It's massive.
If I try to visualize that, I probably won't
be able to see very much.
I can try to fit that to my screen,
and it's still impossible to see what's going on here.
In fact, if I look at the textual description
of the tree, it's just extremely complicated.
That's a bad thing.
Here, an unpruned tree is a very bad idea.
We get a huge tree which does quite a bit
worse than a much simpler decision structure.
J48 does pruning by default and, in general,
you should let it do pruning according to
the default parameters.
That would be my recommendation.
We've talked about J48, or, in other words,
C4.5.
Remember, in Lesson 1.4, we talked about the
progression from C4.5 by Ross Quinlan.

Chinese: 
这真大。
如果可视化这个决策树，很可能我们不能看到具体的内容。
我可以试着调整决策树的大小以适应我的屏幕，
但是，这里还是不可能看到具体有什么。
事实上，如果我们看树的文本描述，会发现它极其复杂。
这是不好的。
这里，创建未修剪的树是一个非常坏的选择。
我们得到一棵巨大的树，它比一个简单得多的决策结构要差许多。
J48默认进行修剪，一般情况下，你应该让它用
默认的参数值修剪。
这是我的建议。
我们谈论J48，或者，换句话说，C4.5。
记得在Lesson 1.4，我们提到Ross Quinlan的C4.5的发展。

English: 
Here is a picture of Ross Quinlan, an Australian
computer scientist, at the bottom of the screen.
The progression from C4.5 from Ross to J48,
which is the Java implementation essentially
equivalent to C4.5.
It's a very popular method.
It's a simple method and easy to use.
Decision trees are very attractive because
you can look at them and see what the structure
of the decision is, see what's important about
your data.
There are many different pruning methods,
and their main effect is to change the size
of the tree.
They have a small effect on the accuracy,
and it often makes the accuracy worse.
They often have a huge effect on the size of
the tree, as we just saw with the breast cancer data.
Pruning is actually a general technique to
guard against overfitting, and it can be applied
to structures other than trees, like decision rules.
There's a lot more we could say about decision trees.

Chinese: 
在屏幕的底部有一张澳大利亚计算机学家Ross Quinlan的照片。
J48用Java实现了Ross的C4.5，
两者的本质是一样的。
这是非常受欢迎的方法。
简单而实用。
决策树很有吸引力，因为你可以看到它们，可以看到它们的决策结构，
看到哪些对于你的数据是重要的。
有许多不同的修剪方法，它们的主要作用是
改变树的大小。
修剪方法对准确率的影响很小，通常使准确率下降。
修剪方法经常会大大简化决策树，就像我们刚刚看到的乳腺癌的例子。
修剪实际上是防止过拟合的一个通用的技术，它可以应用于
树之外的其他结构，如决策规则。
关于决策树，我们还可以介绍很多。

Chinese: 
例如，我们以前经常谈论的单变量决策树
即，每个节点只有一个测试。
你可以想像一个多变量的树，那里是一个复合试验。
节点的测试可能是“如果这个属性是什么并且另一个属性是什么”。
你可以想像由更复杂的算法生成的更复杂的决策树。
一般来说，C4.5/J48是一个常见的有效的数据挖掘算法。
课本中有更多关于决策树的内容。
第六章第一节是关于修剪的，并且提供了我在这里简述过的
修剪方法的数学背景。
做下练习，我们下节课见。
再见！

English: 
For example, we've been talking about univariate
decision trees -- that is, ones that have
a single test at each node.
You can imagine a multivariate tree, where
there is a compound test.
The test of the node might be 'if this attribute
is that AND that attribute is something else'.
You can imagine more complex decision trees
produced by more complex decision tree algorithms.
In general, C4.5/J48 is a popular and useful
workhorse algorithm for data mining.
You can read a lot more about decision trees
if you go to the course text.
Section 6.1 tells you about pruning and gives
you the mathematical details of the pruning
methods that I've just sketched here.
It's time for you to do the activity, and
I'll see you in the next lesson.
Bye for now!
