
English: 
Hello again, and welcome to Lesson 3.2 of
More Data Mining with Weka.
In the last lesson, we looked at decision
rules versus decision trees.
They are kind of similar representations,
but they are interestingly different in how
they express things.
We also looked at a bottom-up covering algorithm
for generating rules called "PRISM".
If you did the activity associated with that
lesson, then you will realize that PRISM is
not really a terribly good machine learning
method.
It's not really meant for serious use.
In this lesson, we're going to look at a couple
of schemes for rule learning and show how
to use them in Weka.

Chinese: 
大家好，欢迎学习More Data Mining with Weka第3.2节课。
上一节课，我们对比了决策规则和决策树。
它们有相似的表现方式，但有趣的是它们表达的方式
不同。
我们还学习了在下而上的用于创建规则的覆盖算法PRISM法。
如果你做了上节课的练习，你就会发现
PRISM法不是一个非常好的机器学习方法。
它不能满足高要求的应用。
这节课我们学习几个规则算法，
以及它们在Weka中的应用。

English: 
The first scheme we're going to look at is
called "PART", and it's a way of forming rules
from partial decision trees.
It's the basic separate and conquer algorithm:
make a rule, remove the instances it covers,
and continue creating rules for the remaining
instances.
To make a rule, PART builds a tree.
It builds and prunes a decision tree for the
current set of instances, and reads off the
rule for the largest leaf--the most important
rule, if you like.
Then it throws the tree away, and carries
on with the covering algorithm.
It seems very wasteful, and I suppose perhaps
it is a bit wasteful, but it turns out you
can build just a partial tree--you don't have
to build a whole tree.
That's how PART works.
The second method is called "RIPPER": in fact,
in Weka it's called "JRip".
It's a basic incremental reduced-error pruning
algorithm.

Chinese: 
第一个要学习的是PART算法，它通过部分决策树
创建规则。
它是基础的割治算法：创建规则，去除规则所覆盖的实例，
继续为剩余的实例创建规则。
PART创建决策树来建规则。
它为实例集创建和修建决策树，
从大的叶节点的读取规则--最重要的规则。
然后，它不再参考决策树，继续执行覆盖算法。
看上去非常浪费，或许却有点浪费，但是你可以
只创建部分决策树--不必创建完整的决策树。
这就是PART的工作原理。
第二种方法叫做RIPPER：实际上，在Weka中，它叫做JRip。
它是基础的递增减少误差修剪算法。

English: 
There's class of algorithms that go by this
name.
The idea is that PRISM is good at producing
rules, but it produces exact rules.
Typically, we want to produce rules that are
not necessarily exact, but merely very good.
What incremental reduced-error pruning does
is to take the instances, the training set,
and split them into two sets, one called Grow
and one called Prune, in the ratio 2:1.
It uses the Grow set for growing rules, adding
clauses to rules until you get a perfect rule.
Then it uses the Prune set when you're pruning
rules, deleting the clauses from the rule
until you're left with a good rule.
For each class, while there are instances
of that class in both these sets, we're going
to use PRISM to create the best perfect rule
for that class.

Chinese: 
有一类算法在这个名下。
PRISM擅长创建规则，但是它的规则太精确。
通常，我们不想创建那么精确的规则，刚好就行了。
递增减少误差修剪算法做的是将训练集实例，
以2:1的比例分成两个数据集，成长集和修剪集，。
成长集用于生成规则，增加条件直到规则完美。
修剪集用于修减规则，删除规则中的条件，
直到得到好的规则。
对每一类，如两个数据集中都有实例时，我们可以
使用PRISM为该类创建最好的规则。

Chinese: 
然后，我们要评估规则。
我们需要评估规则价值的方法。
有许多不同评估规则价值的方法，
不同递增减误差修剪算法做法不同。
例如，我们可以使用成功率，或更为复杂的方法，
甚至用熵。
不论做什么，假定我们有了评估规则价值的方法。
我们评估规则的价值，然后移除最后的条件--最后添加的条件--
来看它的价值。
如果价值不变，我们就删除这个条件，继续移除条件，
直到得到规则的最佳版本。
我们在成长集中创建规则，然后在修剪集中修剪规则，直到
得到价值不错的规则。
事实证明，最创建完后修剪比边创建边修剪效果好。

English: 
Then we're going to calculate the worth of
the rule.
Now, we need some measure of the "worth" of
a rule.
There are different ways of measuring the
worth of a rule, and different incremental
reduced-error pruning algorithms do different
things.
For example, you might just use the success
rate, or you might use some more complicated
thing, perhaps even some entropy metric.
Anyway, whatever you do, let's assume you've
got a way of measuring the worth of a rule.
We calculate the worth of that rule, and then
we omit the final condition--the last one
that we added--and look at the worth of that.
If it's worthwhile, then we take away that
final condition and carry on trying to remove
conditions from the rule until we get an optimal
version of the rule.
So we build up the rule on the Grow set, and
then we prune it back on the Prune set until
we get a rule whose worth is good.
It turns out it's better to prune backwards
than to prune on the way forwards.

English: 
Again, it sounds a bit wasteful, but it's
a good idea to build up the whole rule and
then prune backwards.
Then we just carry on.
We select the rule with the largest worth,
and we prune it and remove the instances it
covers, carrying on with the basic covering
algorithm.
RIPPER follows this by a fiendishly complicated
global optimization step that's really detailed,
really complex, not really very principled,
but works really well.
I'm not going to tell you about that.
It's just not worthwhile--you'd never remember
it, it's just too hard to--I mean, I don't
remember it.
It's just really complicated.
But this is the basic kind of incremental
reduced-error pruning algorithm that it uses
to generate the rule set in the first place.
All right, let's go to Weka.
I've loaded the diabetes dataset.
I'm going to try J48, PART, and JRip.
So here we are in Weka.
Here's the diabetes dataset: 768 instances.
I go to Classify, and I've already run J48.

Chinese: 
听上去很浪费，但是创建完整个规则再修剪
是个不错的选择。
我们继续。
t我们选择了价值最大的规则，修剪它，去除它包含的实例，
继续使用基本的覆盖算法。
 RIPPER采用的是有点复杂的全局优化步骤，非常琐碎
复杂，不很条理，但是效果很好。
我们现在不学习这点。
不值得花费时间，你不会记住的，太难了，我是说，我也
根本记不住。
真的很复杂。
但这是基础的递增减少误差修剪算法，
从创建规则开始。
.好，我们打开Weka。
我已经打开了diabetes数据集。
我们要试J48,PART,和JRip算法。
这是Weka界面。
这是diabetes数据集，有768个实例。
点击Classify，我已经运行了J48，

Chinese: 
这是J48的结果。
l这是决策树，有点复杂，有20个叶节点和
39个节点。
准确率是74%。
PART创建了一个规则集。
来看这些规则，如果plas <= 127 且 mass<= 26.4，
那么测试结果为隐性（在这个数据集中，有两个类别，隐性和显性）。
一条隐性的规则，一条显性的规则，等等。
这个数据集共有13个规则，包含25个独立的测试。
准确率是75%。
RIPPER做得很好，76%准确率，在最上面。

English: 
This is the result from J48.
I've got a decision tree here, quite a complicated
decision tree: it's got 20 leaves and a total
of 39 nodes in this tree.
It gets 74% accuracy.
PART produces a rule set that looks like this.
See these rules--"If plas <= 127 and mass
<= 26.4, etc.
then tested_negative" (in this dataset, there
are two classes, negative and positive).
There's a rule for negative, and a rule for
positive, and so on.
This rule set has got 13 rules, involving
25 separate tests.
We get 75% accuracy.
RIPPER does really well, 76% accuracy, there
at the top.

English: 
It has only 4 rules.
Amazing, eh?
In fact, going back to the slide, here are
the results, and here are the 4 rules.
Actually, RIPPER starts out by taking the
majority class--in this case tested_negative--and
leaving that to the end.
So it only produces rules for the other classes,
and then leaves the majority class for the
default clause like this.
So tested_positive is the smaller class, and
tested_negative is the larger class.
These are the rules it's come up with.
Only 4 rules, 9 tests, and the best performance
of all.
That's pretty amazing.
PART is quick and quite an elegant algorithm,
really.
Repeatedly constructing decision trees and
discarding them is less wasteful than it sounds.
Incremental reduced-error pruning is a standard
technique.
RIPPER does incremental reduced-error pruning
followed by a global optimization step:
it usually produces fewer rules than PART.

Chinese: 
它只有4条规则。
很神奇。
如果我们回到幻灯片，这是结果，这是那4条规则。
事实上，RIPPER首先确定了大多数实例的类别--即测试结果为隐性--
然后把它放在一边。
RIPPER只为其他类别创建规则，而把大多数实例的类别
作为默认条件。
测试结果为显性为小类别，而测试结果为隐性为大类别。
这些就是它的规则。
仅有4条规则，9次测试，就得到最好的结果。
很神奇。
PART是快速而细致的算法。
它重复创建决策树，然后删除，而且这并不像听上去那么浪费。
递增减少误差修剪算法是标准算法。
RIPPER使用的是递增减少误差修剪算法然后是全局优化。
通常，它比PART创建更少的规则。

English: 
There's some stuff in the course text on classification
rules, so go and read that.
The activity associated with this lesson involves
using the Experimenter to do more reliable
comparisons between rule systems.
I thought I'd end up by telling you about
Australia and New Zealand.
You know, a lot of people in the northern
hemisphere think that Australia and New Zealand
are really close together.
This map is false.
They're not that close together.
Actually, in New Zealand, we say there are
three islands: the North Island, the South
Island, and the West Island--the people who
live there call it Australia.
It's quite a long way from New Zealand, and
the countries are completely different.
In fact, here's a little encapsulation of
what's in Australia, which is not very much
and a lot of dangerous things here in Australia.
New Zealand's completely different.
It's clean and green, and a nice place to
live.
I just thought I'd leave you with that thought,
and I'll see you in the next lesson.
Bye for now!

Chinese: 
教材中有关于分类规则的内容，请阅读这一部分。
这节课的课后练习是使用Experimenter界面
对比规则。
我要在课程结尾谈谈澳大利亚和新西兰。
许多住在北半球的人认为澳大利亚和新西兰
离得很近。
这张地图是错的。
它们没有那么近。
实际上，在新西兰，我们说有三个岛：北岛、
南岛和西岛--住在西岛上的人称西岛为澳大利亚。
离新西兰很远。两个国家也完全不同。
这是澳大利亚的一些物种，不算太多，
那里有许多危险的东西。
新西兰却截然不同。
干净、绿色，是生活的好地方。
我留给你这些去思考。下次课见。
再见。
