
English: 
Hello again! We're just going to take one
last look at discretization in this lesson, 2.3.
We're going to look at how J48 does discretization.
Because it does deal with numeric attributes,
so somehow, inside of J48, it's got to effectively
discretize an attribute.
Or at least it's got to be able to determine
a split point for a numeric attribute.
Let's just review how J48 works.
It's a top-down, recursive, divide-and-conquer
algorithm, and we talked about this in the
last course, Data Mining with Weka.
The algorithm involves first of all selecting
an attribute to split on at the root node

Chinese: 
大家好！本课我们将结束关于离散化的学习。
我们来看一下J48是如何进行离散化的。
因为它处理的是数值属性，所以，在J48的内部，
要能有效的离散一个属性。
或者，至少，要能找到数值属性的分裂点。
来复习一下J48是怎样工作的。
J48是一个自上而下，递归，分治的算法，
我们在上一次课程“Data Mining with Weka”讲到过。
该算法首先选取一个属性作为根节点进行分裂

Chinese: 
（例子中就是“outlook”属性），为这个属性所有可能的属性值创建一个分支，
("sunny"，"overcast"和"rainy")，把实例分成几个子集，
然后，向下分别对三个分支重复进行递归
--选取属性等等--只基于到达该分支的实例。
关键问题是：如何选取最优分裂属性？以及何时停止分裂？
第一个问题的答案是，选取能获得最大的信息增益的属性，
--至少J48是这么做的。
信息增益（information gain）是指知道属性值后，所获得的信息量，
也就是分裂前的类属性分布的熵减去
分裂后的类属性分布的熵。
熵是等于p乘以p以10为底的对数。
我们在前面的课程中提到过这点。

English: 
(that's the "outlook" attribute in this case),
creating a branch for every possible value
of that attribute ("sunny", "overcast", and
"rainy"), splitting the instances into subsets,
then going down those three branches and repeating
recursively for each branch--selecting an
attribute and so on--using only instances
that reach the branch.
The key questions are: what's the best attribute
to split on? And: when should you stop the
process? The answer to the first question
is, the attribute with the greatest "information
gain"--at least, that's J48's answer.
Information gain is the amount of information
that's gained by knowing the value of the
attribute, which is the entropy of the distribution
before the split minus the entropy of the
distribution after it.
Entropy is defined in terms of p log p's.
We talked about that briefly in the previous
course.

English: 
The details we didn't really go into, and
I don't think they're too important for you
at this point.
In the previous example, the weather data,
the information gain for outlook was 0.247
bits according to that calculation, and that
was the greatest information gain of all of
the attributes.
So that's the one that was split on.
That's how it works.
Now let's look at a numeric attribute.
This is the temperature attribute in the numeric
weather data.
Here the split point is a number.
The trouble is, there's an infinite number
of numbers, so we can't possibly try them
all.
However, we're going to choose split points
mid-way between adjacent values in the training
set.
This reduces it to a finite problem, n-1 possibilities.
So here, for the temperature attribute, it
goes from 64 at the bottom end to 85 at the
top end.

Chinese: 
我们这里不详细讲解，
我不认为这对你很重要。
上节课天气数据中，根据计算，
属性outlook的信息增益是0.247比特，
这是所有属性中信息增益最大的。
所以，我们选取它进行分裂。
就是这个原理。
现在来看数值属性。
这是数值天气数据中的温度属性。
Here the split point is a number.
这里分裂点是一个数值。
问题是，数值属性可以有无穷个值，
所以不可能试遍所有的值。
然而，我们可以选择训练集相邻的属性值之间的中间值
作为分裂点。
这就转化为一个有限的值的问题了，n-1种可能。
这里，气温属性取值在最低温64
和最高温85之间。

Chinese: 
下一行是实例的类属性值：有一个温度等于64，
类别为"yes"的实例；有两个温度等于72，
类别分别是"no"和"yes"的实例。
有n减1种可能，我们一一尝试一下。
尝试所有可能的界限。
如果我们采用课件上演示的分裂点，分裂点左边
我们得到4个"yes"和1个"no"，分裂点右边我们得到5个"yes"和4个"no"。
来计算分裂之前的熵和之后的熵，再把它们相减，
然后得到信息增益等于0.001。
我们可以选择它如果它是在所有可能的分裂点中
信息增益最多的点。
J48就是这样处理数值属性的。
举个例子。
我们已经在"outlook"分裂。
把它作为根节点。
在"sunny"分支下，有5个"outlook"属性值为"sunny"的实例。
就是表格里的这5个实例。

English: 
Below are the class values of the instances:
when the value of temperature was 64 it was
a "yes" instance; and there were two instances
where the value was 72, one "no" and one "yes"
instance.
There are just n-1 possibilities, and we're
going to try them all.
Try all possible boundaries.
If we take that split point that's shown,
on the left side of the split point we've
got four "yes"s and one "no", and on the right
side we've got five "yes"s and four "no"s.
We can calculate the entropy before the split
and the entropy after the split and subtract
them, and the information gain in this case
is 0.001 bits.
We might choose that if that's the greatest
information gain of all of the possible split
points.
That's how it's going to work for numeric
attributes.
Here's an example.
We've already split, let's say, on "outlook".
We've chosen that at the root.
If we look at the "sunny" branch, then there
are five instances whose outlook is "sunny".
Those are in that little table there.

English: 
If you look at the value of "humidity", it's
70 and 70 for the two "yes" instances and
85, 90, and 95 for the three "no" instances.
That neatly splits the instances into "yes"s
and "no"s if we choose a split point somewhere
between 70 and 85.
We're going to choose it halfway, at the point
75.
Well, 75 isn't halfway between 70 and 85,
but we've got two things at 70 and one thing
at 85, so we're going to use sort of a weighted
halfway point.
It's going to be a third of the way from 70
to 85.
That's where we get the 75 from.
That's the split point, and in this case,
that's the end of the matter.
We've discriminated the instances into "yes"
and "no" instances down that branch.
In a more complicated example, you can imagine
we might split on the same attribute more
than once.
If we have a nominal attribute, once we split
on it--once we split on "outlook" here--then
we've used all of the information in the "outlook"
attribute, and we certainly aren't going to

Chinese: 
看一下"humidity"的属性值，有两个类别"yes"的实例属性值为70，
有三个类别"no"的实例属性值分别为85，90，95。
如果我们把分裂点设在70和85之间，
恰好可以把实例分成了"yes"和"no"两类。
选取中间的值，比如75。
当然，75并不是70和85的中间值，但是在70有两个实例
在85有一个实例，所以我们使用类似加权平均的方法。
也就是离70三分之一距离的地方。
我们就是这样得到75的。
这是分裂点，这种情况就不需要继续分裂了。
我们已经把实例分为"yes"和"no"两个分支。
如果是更加复杂的例子，可以想象，
我们可能会重复分裂一个属性。
对于一个名词性属性，一旦我们用它进行分裂，--用"outlook"分裂--
我们就已经利用了"outlook"属性的所有信息，

Chinese: 
在树的分支就不会再次用这个属性分裂。
对于数值属性就不是这样了。
我们可以用某一个界限分裂数值属性，
在树的分支还可能再次用另一个界限分裂这个数值属性。
让我们想想：创建决策树时进行离散化，就像我描述的那样，
对比我们前两节课学习的方法。
在创建决策树时进行离散化，我们是
基于某一特定的背景，即子树的背景来决定界限。
实例的子集分枝到这里，我们可得到一个更具体的数据集，
这可能会更有助于决定离散分界点。
另一方面，是不好的一面，我们的决定仅基于
整体信息其中的一小部分。
我们要牢记，使用决策树时，越向下，

English: 
split further down the tree on the same attribute.
That's not true for numeric attributes.
We could split with a certain threshold for
a numeric attribute, and further down the
tree we might split again on the same attribute,
but with a different threshold.
Let's just think about this issue: about discretization
when building a tree, as I've described, versus
discretization in advance, which we looked
at in the last couple of lessons.
When you do discretization when building a
tree, you're determining the boundaries in
a more specific context, in the context of
that subtree.
A subset of the instances get down here, so
you've got a more specific dataset to maybe
give you a better determination of where a
discretization boundary should be.
On the other side, the negative side, your
decision is based on a small subset of the
overall information.
You've always got to remember when you're
working with a tree that as you get further

Chinese: 
直到树的底部，我们可以使用的实例
也会越来越少。
虽然，一开始我们可能有一个大的数据集，
但是随着树的分裂，我们面对的是少量的实例，
有时，可能已经不足以做决策。
另外一点是计算的复杂性。
就像我描述的那样，在每一个节点，所含的实例
都必须以每一个数值属性排序，这样，我们才能
知道以哪个属性分裂可以获得最大的信息增益。
排序的复杂性是n乘以n的对数，n是指排序对象的个数。
看起来必须对实例重复排序，这会加大计算量，
但事实上，采用一种好的数据结构可以避免重复排序。
所以内部离散化不会造成计算灾难。
就是这样。
C4.5很早就采用了离散化。

English: 
down to the bottom of the tree, you've got
smaller and smaller and smaller numbers of
instances.
Although you might have a large dataset to
begin with, by the time you get way down the
tree you're only dealing with a small number
of instances, which is maybe not a good basis
on which to make a decision.
Another issue is computational complexity.
As I've described the algorithm, for every
internal node, the instances that reach it
must be sorted separately for each numeric
attribute, so that you can work out which
split point on which attribute gives you the
best information-gain value.
The complexity of sorting is n log n, where
n is the number of things you're sorting.
It looks like you've got to repeatedly sort
instances, which could be a bit demanding
computationally, but in fact, repeated sorting
can be avoided with a better data structure.
It's not computationally disastrous to do
internal discretization.
That's it.
C4.5 incorporated discretization very early
on.

Chinese: 
预离散化，我们在上几节课提到的，是一种替代方案，
出现得稍晚。
有指导离散化使用类似的熵的方法，如C4.5。
我们可以保留数值属性所包含的顺序信息。
在树的分枝，我们不必重复实例排序。
J48使用的内部离散化是否优于预离散化？
嗯，有人支持，有人反对，我们已经谈过了。
这是一个试验性问题，这也是本课的
课后练习中的问题。
不仅仅是J48，其他的分类器也是一样的。
祝你好运！下节课见。
再见！

English: 
Pre-discretization, as we've seen in the last
lessons, is an alternative, which has been
refined later.
Supervised discretization uses essentially
the same entropy heuristic as C4.5.
We can retain the ordering information that
numeric attributes imply.
We don't have to keep on sorting them as we
go further down the tree.
Will internal discretization in J48 outperform
pre-discretization? Well, there are arguments
both for and against, which we've talked about.
It's an experimental question, and you will
answer it in the activity associated with
this lesson.
Not just for J48, but for other classifiers,
as well.
Good luck with that, and I'll see you in the
next lesson.
Bye for now!
