
English: 
Hello, and welcome to more More Data Mining
with Weka.
This is Class 2.
We're going to look at two topics in this
class.
In the first three lessons, we're going to
look at Discretization, and in the last three
lessons, we're going to look at document classification.
Let's get going.
Discretizing is transforming numeric attributes
to nominal.
There are a couple of obvious ways of doing
this.
We've got a numeric attribute with a certain
range.
We could take that range and chop it into
a certain number of equal parts, or bins.
Just divide it into equal bins, and wherever
a numeric value falls, we take that bin and

Chinese: 
大家好！欢迎学习More Data Mining with Weka。
这是第二部分。
我们 主要学习两方面的内容。
前三节课，我们将学习离散数值属性。
后三节课，我们将学习文件分类。
现在开始。
离散是把数值属性转化为名词属性。
这里有几种基本的方法。
我们可以将有一定值域的数值属性
平均分隔成几个区间或箱。
分隔成等份的箱，然后用这个箱

Chinese: 
来代表落入这个箱的数值属性。
除了使用等份箱，我们可以通过调整箱的大小
来保证每个箱中有相同数量的实例：也就是等频装箱法。
下面我们来看这两种方法。
我们会简短地讲一下箱子数量的选择。
然后我们学习如何利用数值的排序信息。
当转化为名词属性时，排序信息就会丢失。
我们来看等区间装箱法。
打开数据集ionosphere.arff，有很多数值属性。
使用J48。
我已经把ionosphere导入Weka。
运行J48，得到91.5%的准确率。
我们来看这些数值属性。
第一个a1，有两个不同的数值0和1。
在这里可看到。

English: 
use its identification as the discretized
version of the numeric value.
Or instead of using equal-sized bins, we can
adjust the size to make the number of instances
that fall into each bin approximately the
same: equal-frequency binning.
We're going to talk about those two things.
We'll talk briefly about the choice of the
number of bins.
Then we'll talk about how to exploit the ordering
information that's implicit in a numeric value,
but not implicit in a nominal value that you
convert it to.
Let's look at equal-width binning.
I'm going to take ionosphere.arff, which has
got a lot of numeric attributes.
I'm going to use J48.
I've set Weka up here with ionosphere.
I've run J48, and I get 91.5% accuracy.
Let's go and look at some of these numeric
attributes.
The first one, a1, has got just two distinct
values, 0 and 1, actually.
You can see the two values here.

Chinese: 
第三个属性包含从-1到+1的一组不同的数值，
逐渐增大。
第四个属性也包含从-1到+1的不等数值。
看上去很像常态分布。
我要使用名为Discretize的无指导属性过滤器处理数据。
这些参数中，这里是箱子数量，键入40。
这里是等频区间装箱法。我们要使用等区间装箱法，而不是等频，保留false值。
运行系统，查看结果。
这是第一个0到1的属性，只有两个数值。
这是那个值逐渐增大的属性。
这里是-1，这里是0，这里是+1.
这是看上去像常态分布的属性。

English: 
The third attribute has got a bunch of different
values ranging between -1 and +1, and kind
of scrunched up towards the top end.
The fourth attribute also varies between -1
and +1.
It looks like it could almost be a normal
distribution.
I'm going to go to a filter here, an unsupervised
attribute filter called Discretize.
Amongst the parameters here is the number
of bins, and I'm going to use 40 bins.
And equal frequency--we're going to use equal-width
binning, not equal frequency, leave that at false.
I'm going to run it, and look at the result.
Here is the first attribute from 0 to 1, just
two values.
Here's the one that was all scrunched up to
the top end.
This is -1, this is 0, and this is +1.
Here is the one that looked kind of normal.

Chinese: 
分布很符合常态分布，除了等于-1 和 +1
的一些数值。
可以在Edit面板中查看数值。
如果我点击undo，回到Edit面板，排序该属性。
可以看到-1的数值在这里，然后是一组数据。在顶端，
可以看到一组+1的数值。
现在我们来再次运行过滤器。
我要进行分类，使用J48。
得到87.7%的准确率，结果并不好。
回去，更改箱子的数量。
改为两个箱子。
要首先点击undo，然后运行两个箱子的装箱法。
可以看到，使用两箱的情形，你可以看到这个属性
只可能存在的两种值，它被分配到两个箱子中。
如果再次运行J48，得到90.9%的准确率，非常不错。

English: 
You can see it is sort of normal-ish except
for a bunch of extra values down here at -1
and +1.
I can look at those in the Edit panel, actually.
If I undo the effect of that, and go and look
in the Edit panel and sort by this attribute,
you can see all the -1's here, and then a
bunch of numbers, and then up at the top you
can see a bunch of extra +1's in this column.
Now I've applied the filter again.
I'm going to classify it and see what J48
makes of that.
We get 87.7% accuracy, which is not very good.
I can go back and change the number of bins.
I'm going to go straight to two bins here.
I'm going to, first of all, undo the effect
of this, and then apply the two-bin version.
You can see that--well, this was two bins
to start off with--but you can see that this
attribute, there's only two possible values,
and this attribute is discretized into two bins.
If I run J48 again, I get 90.9%, which is
pretty good, actually.

Chinese: 
我们回到幻灯片。可以看到不同数量的箱子对应的不同结果。
最后一行，也是90.9%的准确率，几乎不亚于未离散的数值的准确率。
而且，树只有13个节点。
比我们之前的树，小得多也经济得多，
而且准确率也没怎么降低。
看上去非常不错。
我们下面来看等频装箱法。
回到这个界面，选择equal-frequency装箱法。
设置40个箱子，然后运行。
首先需要点击undo，然后使用这个过滤器。
针对第一个二元属性, 没什么区别。
但是，这里可以看到所有数值排列在一起直到顶端。
这是-,1，这是0，这是+1。

English: 
Going back to the slide, you can see the results
for different numbers of bins here.
The last one, 90.9% is about the same, not
too much worse than the original undiscretized version.
What's more, the tree has only got 13 nodes.
It's a much smaller, much more economical
tree than the one we had before and very little
loss in accuracy.
So that looks really good.
I'm going to move now to equal-frequency binning.
Let's go back here, and take the discretize
filter and change it to equal frequency.
I'm going to go back to 40 bins here, and
I'm going to run that.
First, I need to undo the discretization,
and then I'm going to apply this filter.
Well, it can't do much with the first attribute;
that was binary to start off with.
But here, you can see that this is where they
were all scrunched up towards the top end.
This is -1, this is 0, and this is +1.

Chinese: 
可以看到，系统尽可能选择大小合适的箱子来等频分割。
系统无法分割两头的大箱子，
或中间的这个，因为有很多值为+1、0、-1的数。
1的数。
但只要可以，系统就会等频分割。
这是那个看上去常态的分布。
可以看到这里有很多值为-1，0，+1的数，可通过选择合适的区间来
实现等频分割。
我们来分类。
J48的准确率是87%。
结果令人失望。
我们来尝试不同的箱子数量。
试试两个箱子。
首先要点击undo。
然后运行。
没有太大区别。这里原来就是两个箱子，但是可以看到
我们得到两个等频的箱子。

English: 
You can see that, where possible, it's chosen
the size of the bins to equalize the frequency.
It can't do anything with this large bin at
the top, or this one at the bottom, or this
one in the middle, because all of the instances
have +1 and here they've got 0 and here they've
got -1.
But where it can, it has kind of equalized
the frequency.
This is the one that used to look normal.
You can see there's some extra -1's, 0's,
and +1's, and it's kind of equalized the frequency
by choosing appropriate bin widths.
I can go and classify.
J48 gives me 87%.
It's a bit disappointing, not very good at
all.
I can try with different numbers of bins.
Let me change this to two bins.
I need to undo this one first.
Then apply.
It hasn't done much here--which was originally
just two bins--but you can see that here we've
got two equal-sized bins.

English: 
That's what histogram equalization, equal
frequency, is trying to do: make bins with
the same number of instances in each.
If I just run J48 on that, I get 83%, which,
again, is pretty disappointing.
Coming back to the slide, you can see that
all of these equal frequency binning results
are worse than the original results.
The size of the tree is not hugely smaller,
either.
So they're not really very good.
Which method should you use? How many bins
should you use? Well, these are experimental
questions.
There's a theoretical result called "proportional
k-interval discretization" which says that
the number of bins should be proportional
to the square root of the number of instances.
That doesn't really help you very much in
choosing the number of bins, because it doesn't
tell you what the constant of proportionality
should be.
It's an experimental question.
A more interesting question is how to exploit
ordering information.
In the numeric version of the attribute--and
this is it at the top, the attribute value--we've

Chinese: 
这就是直方图均衡化算法，等频装箱法的原理。
箱子中放入同等数量的数据。
运行.J48，得到83%的准确率，还是很让人失望。
回到幻灯片，可以看到等频区间装箱法的结果
不如之前的结果。
树也没有显著变小。
所以说，这些方法不很有效。
l我们应该选择哪种方法呢？多少箱子呢？
这些都是需要实践的问题。
有一种叫“K均衡区间离散法”，
箱子数量应该和数据数量的平方根成正比。
这条理论并不能帮你决定最终的箱子数量，
因为它没有给出比例的常量。
这是个实践性的问题。
一个更有趣的问题是如何使用排序信息。
在属性的数值版本中--在顶端的属性值--可看到一个属性值v，

Chinese: 
属性的不同值之间是有顺序的。
然而，当我们把数值离散到五个不同的箱子中，箱子间是
没有排序信息的。
这是个问题，因为在离散前我们或许需要测试一个决策树是否x<v。
离散后，为了做同样的测试，我们需要知道是否y=a?，y=b?， y=c?
在每个节点下重复测试决策树。
很显然这效率很低，而且容易造成不理想的结果。
这是有技巧的。
我们不需要离散数值属性为从a到e五个中的一个值，
而是离散为四个不同的二元属性，k-1个二元属性。
第一个属性表示值v是否落入这个范围，

English: 
got a value v here, and there's an ordering
relationship between different values of this attribute.
However, when we discretize it here into five
different bins, then there's no ordering information
between these bins.
Which is a problem, because we might have
a test in a tree, "is x<v?", before discretization.
After discretization, to get the equivalent
test, we would need to ask "is y=a?", "is
y=b?", "is y=c?" and replicate the tree underneath
each of these nodes.
That's clearly inefficient, and is likely
to lead to bad results.
There's a little trick here.
Instead of discretizing into five different
values a to e, we can discretize into four
different binary attributes, k-1 binary attributes.
The first attribute here says whether the
value v is in this range, and the second attribute,

Chinese: 
第二个属性a2表示是否落入这个范围，a或b，
第二个属性a3表示是否落入这个范围，a，b或c，
第四个属性表示它是否落入前面的四个范围。
如果我们测试x<v?，如果x小于v，a1、a2、a3都是真值，
a4是假值。
因此对二元属性的对应测试就是a3属性是否为真 。
我们来看之前一个决策树，测试是否x<v，对应的测试就是a3属性是否为真。
这样我们就能达到树结构的效率，而不需要
重复测试不同的子树。
在Weka中这十分容易实现。
打开过滤器，将makeBinary设置为True。

English: 
z2, says whether it's in this range, a or b.
The third, z3, says whether it's in this range,
a, b, or c.
The fourth says whether it's in the first
four ranges.
If in our tree we have a test "is x<v?", then
if x is less than v, then z1, z2, and z3 are
true and z4 is false.
So an equivalent test on the binary attributes
are "is z3=true?".
If we take that tree we have before, testing
on "x<v", an equivalent test is "is z3 true".
Then we have the same kind of economy of the
tree underneath this without replicating different
subtrees.
That's very easy in Weka.
We just go to our filter, and we set makeBinary
to True.

English: 
You'll be using that option--and finding out
that it's really pretty good, actually--quite
a bit in the activity associated with this
lesson.
It makes sense, you know.
It allows us to retain the ordering information
that's implicit in the original numeric attribute.
Here's what we've done.
We've looked at equal-width binning and equal-frequency
binning, also called histogram equalization.
We briefly considered how many bins.
It's an experimental question.
We've talked about exploiting ordering information.
In the next lesson, we'll talk about taking
the class into account, supervised discretization.
There's a bit in the book on discretization.
You should do the activity, and I'll see you
in the next lesson.
Bye for now!

Chinese: 
在这节课的练习中有很大部分是
使用这个选项--你会发现它很不错。
这很有道理。
它允许我们保留原始数值属性的排序信息。
这就是我们这节课的内容。
我们学习了等区间装箱法和等频装箱法，也叫做直方图均衡化法。
我们简单介绍了如何决定箱子的数量。
这是一个实践性的问题。
我们学习了如何使用排序信息。
下次课，我们要学习有指导的离散法。
书中有些有关离散的内容。
请做练习，下次课见。
再见！
