
Chinese: 
大家好！这是Weka在数据挖掘中的运用的第三部分，在这部分
我们将学习些简单的机器学习方法，了解它们的工作原理。
我们要再次强调简单的算法通常能取得很好的效果。
在数据挖掘中，也如同在生活中，
在试复杂的事之前，你应先试一试简单的。
简单的形式有多种。
例如，数据集中有一个起决定作用的属性，
或有一个起决定作用的属性的值。
或者，所有属性都起同样的，独立的作用。

English: 
Hi! This is the third class of Data Mining
with Weka, and in this class,
we're going to look at some simple machine
learning methods and how they work.
We're going to start out emphasizing the message
that simple algorithms often work very well.
In data mining, maybe in life in general,
you should always try simple things before
you try more complicated things.
There are many different kinds of simple structure.
For example, it might that one attribute in
the dataset does all the work,
everything depends on the value of one of
the attributes.
Or, it might be that all of the attributes
contribute equally and independently.

English: 
Or a simple structure might be a decision
tree that tests just a few of the attributes.
We might calculate the distance from an unknown
sample to the nearest training sample,
or a result my depend on a linear combination
of attributes.
We're going to look at all of these simple
structures in the next few lessons.
There's no universally best learning algorithm.
The success of a machine learning method depends
on the domain.
Data mining really is an experimental science.
We're going to look at OneR rule learner,
where one attribute does all the work.
It's extremely simple, very trivial, actually,
but we're going to start with simple things
and build up to more complex things.
OneR learns what you might call a one-level
decision tree,
or a set of rules that all test one particular
attribute.

Chinese: 
或者，简单的形式是一个只使用一小部分属性的决策树
我们或许只用计算一个未知样本离最近的训练样本的距离。
或者，分析结果取决于属性的线性组合。
我们将在以后的几节课中学习这些简单的形式。
世上没有完美的机器学习算法。
机器学习算法的成功取决于适用的领域。
数据挖掘是一门试验性的科学。
我们来看一下OneR规则算法，
这里一个属性决定一切。
它实际上非常简单，琐细。
但我们从简单的开始，再到复杂的。
OneR创建所谓的一层决策树，
或者一组只检测某个属性的规则

Chinese: 
一棵只在根节点根据某个属性值分叉的树，
或者，同样的，一组只检测某个属性值的规则。
OneR的基本版本是
每个属性值都有一个分支
我们选择从某个属性开始，
然后为它的每个值建立一个分支。
将这个分支所含最多的类作为它的分类。
错误率是在这个分支中不属于多数分类的
实例的比例。
我们选择错误率最小的属性。
我们来解释下它的原理。
算法是这样的。
Here's the algorithm.
我们为每个属性建立一组规则。
为每个属性值，
我们建立一个规则。 它累计每个分类出现的次数，
找出最多的分类。
这个规则由这个最多的分类和这个属性值组成。

English: 
A tree that branches only at the root node
depending on the value of a particular attribute,
or, equivalently, a set of rules that test
the value of that particular attribute.
The basic version of OneR,
there's one branch for each value of the attribute.
We choose which attribute first,
and we make one branch for each possible value
of the attribute.
Each branch assigns the most frequent class
that comes down that branch.
The error rate is the proportion of instances
that don't belong to the majority class of
their corresponding branch.
We choose the attribute with the smallest
error rate.
Let's look at what this actually means.
Here's the algorithm.
For each attribute, we're going to make some
rules.
For each value of the attribute,
we're going to make a rule that counts how
often each class appears,
finds the most frequent class,
makes the rule assign that most frequent class
to this attribute value combination,

Chinese: 
然后，我们来计算这个规则的错误率。
我们对这个数据集的每一个属性重复这个过程，
然后选择错误率最低的属性。
这是天气（weather）数据集。
OneR用以上的方法为天气数据集的每一个属性，
展望（outlook）， 温度（temperature）， 湿度（humidity），和风速（wind）建立一组规则。
展望（outlook）有三个分类值：晴（sunny），阴（overcast），和雨（rainy）
我们计算在5个晴（sunny）的实例中，
有2个是yes，有三个是no。
我们来选一个规则，
如果是晴（sunny）选no，
在五个中有两个是错的。

English: 
and then we're going to calculate the error
rate of this attribute's rules.
We're going to repeat that for each of the
attributes in the dataset,
and choose the attribute with the smallest
error rate.
Here's the weather data again.
What OneR does, is it looks at each attribute
in turn,
outlook, temperature, humidity, and wind,
and forms rules based on that.
For outlook, there are three possible values:
sunny, overcast, and rainy.
We just count out of the 5 sunny instances,
2 of them are yeses and 3 of them are nos.
We're going to choose a rule,
if it's sunny choose no.
We're going to get 2 errors out of 5.

Chinese: 
对于阴（overcast）来说，展望是阴的四个实例的分类play的值都是yes
所以，我们选择这个规则，
如果展望是阴，选yes。这样我们的错误率是零。
最后，如果展望是雨（rainy），
我们也选yes，
这样在五个实例中两个是错的。
我们将得到4个错的如果我们用展望（outlook）分支，
我们可以用温度（temperature）分支，重复同样的过程。
如果温度（temperature）是热（hot），
有两个no和两个yes。
如果是平手，我们任选其一。
如果是热（hot），
我们来选no，有两个错误。
如果温度（temperature）是温和（mild），
我们选yes，六个中有两个错误，
如果温度（temperature）是冷（cool），
我们选yes，四个中有一个错误
对湿度（humidity）和风速（wind）重复同样的过程。

English: 
For overcast, all of the 4 overcast values
of outlook lead to yes values for the class play.
So, we're going to choose the rule,
if outlook is overcast, then yes, giving us
0 errors.
Finally, for outlook is rainy,
we're going to choose yes,
as well, and that would also give us 2 errors
out of the 5 instances.
We've got a total number of errors if we branch
on outlook of 4.
We can branch on temperature and do the same
thing.
When temperature is hot,
there are 2 nos and 2 yeses.
We just choose arbitrarily in the case of
a tie,
so we'll choose if it's hot,
let's predict no, getting 2 errors.
If temperature is mild,
we'll predict yes, getting 2/6 errors,
and if the temperature is cool,
we'll predict yes, getting 1 out of the 4
instances as an error.
And the same for humidity and wind.

Chinese: 
我们来看一下总错误率；我们选一个总错误率最低的，
要么是展望（outlook），要么是湿度（humdity）。
这是个平手，所以我们任选其一，
选展望（outlook）。
这就是OneR的原理。
就那么简单。
来试一下。
打开Weka。
载入名词性的天气数据。
选择分类器。
这是个非常简单的数据集，结果可能没多大意义。
如果我就以ZeroR开始，
我得到的错误率是64%。
我来选OneR，
运行一下。
我得到一个用展望（outlook）分支的规则
如果是晴（sunny）就选no，
阴（overcast）选no，雨（rainy）选yes。

English: 
We look at the total error values; we choose
the rule with the lowest total error value -- either
outlook or humidity.
That's a tie, so we'll just choose arbitrarily,
and choose outlook.
That's how OneR works,
it's as simple as that.
Let's just try it.
Here's Weka.
I'm going to open the
nominal weather data.
I'm going to go to Classify.
This is such a trivial dataset that the results
aren't very meaningful,
but if I just run ZeroR to start off with,
I get an error rate of 64%.
If I now choose OneR,
and run that.
I get a rule, and the rule I get is branched
on outlook,
if it's sunny then choose no,
overcast choose yes, and rainy choose yes.

Chinese: 
用训练数据集，我们得到的结果是十四个实例中有十个正确。
我们用交叉验证法评估一下。
这对这么小的数据集不是太合理。
有意思的是，我们得到的正确率是42%，
非常糟糕，比ZeroR的差。
事实上，对二分类问题，
我们期望的正确率至少应该是50%。
抛硬币也能得到50%的正确率。
OneR算法在这个简单的数据集上的效果不是很好。
请注意，规则终于输出了，由于我们使用了10层交叉验证，
它共运行了10次，第十一次用整个数据集
来计算规则，这就是输出。
这就是这个规则如何生成的。
OneR，一个属性决定一切。
这是在1993年发表的一个简单的机器学习方法，

English: 
We get 10 out of 14 instances correct on the
training set.
We're evaluating this using cross-validation.
Doesn't really make much sense on such a small
dataset.
Interesting, though, that the [success] rate
we get,
42% is pretty bad, worse than ZeroR.
Actually, with any 2-class problem,
you would expect to get a success rate of
at least 50%.
Tossing a coin would give you 50%.
This OneR scheme is not performing very well
on this trivial dataset.
Notice that the rule it finally prints out
since we're using 10-fold cross-validation,
it does the whole thing 10 times and then
on the 11th time calculates a rule from the
entire dataset and that's what it prints out.
That's where this rule comes from.
OneR, one attribute does all the work.
This is a very simple method of machine learning
described in 1993,

Chinese: 
二十年前，它出现在一篇叫“一个非常简单的分类规则在多数常见的数据上取得好效果”
的文章中， 它由一个生活在
加拿大的叫Rob Holte的人提出。
他在16组常用的数据集上用OneR进行试验。
他用了我们教过你的交叉验证法，
他发现一些OneR产生的规则要比为这些数据集
专门设计的复杂得多的算法更有效。
为什么这么简单的方法会那么好呢？
有些数据集真的很简单，
其它的很小，杂乱，或复杂，
以至于很难从中找到规律。
所以，应该先试试最简单的。
课本的第4.1节介绍了OneR。
现在做一下课后的练习。
再见！

English: 
20 years ago in a paper called "Very Simple
Classification Rules Perform Well on Most
Commonly Used Datasets"
by a guy called Rob Holte, who lives in Canada.
He did an experimental evaluation of the OneR
method on 16 commonly used datasets.
He used cross-validation just like we've told
you to evaluate these things,
and he found that the simple rules from OneR
often outperformed far more complex methods
that had been proposed for these datasets.
How can such a simple method work so well?
Some datasets really are simple,
and others are so small, noisy, or complex
that you can't learn anything from them.
So, it's always worth trying the simplest
things first.
Section 4.1 of the course text talks about
OneR.
Now it's time for you to go and do the activity
associated with this lesson.
Bye for now!
