
Chinese: 
大家好！欢迎再次来到新西兰。
上次课，第五部分第一课，我们知道了Weka只能帮助你
完成数据挖掘过程中的一小部分工作，技术部分，也许是最容易的部分。
这节课，我们会了解到即使这部分也有
许多缺陷和失误。
让我给你解释这些问题。
“缺陷”是隐藏的或者没有料到的危险和困难，
这在机器学习的领域有很多。
“失误”是一种愚蠢的丢脸的行为，使用数据时，

English: 
Hi! Welcome back for another few minutes in
New Zealand.
In the last lesson, Lesson 5.1, we learned
that Weka only helps you with a small part
of the overall data mining process, the technical
part, which is perhaps the easy part.
In this lesson, we're going to learn that
there are many pitfalls and pratfalls even
in that part.
Let me just define these for you.
A "pitfall" is a hidden or unsuspected danger
or difficulty, and there are plenty of those
in the field of machine learning.
A "pratfall" is a stupid and humiliating action,
which is very easy to do when you're working

Chinese: 
很容易做出这样的事情。
首要的教训是你应该持有怀疑的态度。
在数据挖掘领域，作弊很容易。
不论你是有意作弊还是无意作弊，你的结果的显着性都很容易
误导你自己或者其他人。
作为一个可靠的测试，应该使用从未见过的新的样本数据。
你应该把一部分数据保留到最后，在你选择算法，
决定如何应用算法和过滤器等之前不要使用它。
在最后的最后，已经完成了所有这些，在全新的数据上运行
来评估它的性能。
不要试图通过改变来改善它以获得更好的结果。
总是在全新的数据上做最终的运行。
关于过度拟合，我们谈过很多了，这基本上是一类问题。

English: 
with data.
The first lesson is that you should be skeptical.
In data mining it's very easy to cheat.
Whether you're cheating consciously or unconsciously,
it's easy to mislead yourself or mislead others
about the significance of your results.
For a reliable test, you should use a completely
fresh sample of data that has never been seen before.
You should save something for the very end,
that you don't use until you've selected your
algorithm, decided how you're going to apply
it, and the filters, and so on.
At the very, very end, having done all that,
run it on some fresh data to get an estimate
of how it will perform.
Don't be tempted to then change it to improve
it so that you get better results on that data.
Always do your final run on fresh data.
We've talked a lot about overfitting, and
this is basically the same kind of problem.

Chinese: 
当然，你知道不要用训练数据做测试。
我们一直在强调这个问题。
无论用哪种方式来开发的数据都是被污染的。
每次你用一些数据帮你选择过滤器或者分类器，
或者处理问题的方法之后，数据都被污染了。
你应该用全新的数据得到评价结果。
留一些评估数据直到过程的最后。
这是第一条建议。
另外一件事，我至今为止没有告诉过你，就是残缺数据。
真实的数据，部分数据残缺是十分常见的。
它们没有被记录。
它们可能是未知的；我们可能忘了记录了；它们可能是不相干的。

English: 
Of course, you know not to test on the training
set.
We've talked about that endlessly throughout
this course.
Data that's been used for development in any
way is tainted.
Any time you use some data to help you make
a choice of the filter, or the classifier,
or how you're going to treat your problem,
then that data is tainted.
You should be using completely fresh data
to get evaluation results.
Leave some evaluation data aside for the very
end of the process.
That's the first piece of advice.
Another thing I haven't told you about in
this course so far is missing values.
In real datasets, it's very common that some
of the data values are missing.
They haven't been recorded.
They might be unknown; we might have forgotten
to record them; they might be irrelevant.

English: 
There are two basic strategies for dealing
with missing values in a dataset.
You can omit instances where the attribute
value is missing, or somehow find a way of
omitting that particular attribute in that
instance.
Or you can treat missing as a separate possible
value.
You need to ask yourself, is there significance
in the fact that a value is missing? They
say that if you've got something wrong with
you and go to the doctor, and he does some
tests on you: if you just record the tests
that he does -- not the results of the test,
but just the ones he chooses to do -- there's
a very good chance that you can work out what's
wrong with you just from the existence of
the tests, not from their results.
That's because the doctor chooses tests intelligently.
The fact that he doesn't choose a test doesn't
mean that that value is missing, or accidentally
not there.
There's huge significance in the fact that
he's chosen not to do certain tests.
This is a situation where "missing" should
be treated as a separate possible value.

Chinese: 
有两个基本的策略可以处理数据集中的残缺数据。
你可以忽略属性值残缺的实例，或者找到某种方法，
忽略那个实例的特定属性。
或者你可以把残缺的数据当作另一个可能的值。
你需要问自己，事实上一个值丢失是否有重要意义？
他们说如果你不舒服，然后去看医生，他帮你做了检查：
如果你只记录了他做的检查（而不是检查的结果，
仅仅是他选择做的检查），这很有可能你可以
仅从做了的检查而不是检查结果，而了解你的问题。
这是因为医生明智地选择检查。
事实上，他没有选择一个检查并不意味着数值的残缺，
或无意中的丢失。
实际上，他选择不去做一些检查是有重要意义的。
这里是“残缺”应该被当作另一个可能的值的情况。

English: 
There's significance in the fact that a value
is missing.
But in other situations, a value might be
missing simply because a piece of equipment
malfunctioned, or for some other reason -- maybe
someone forgot something.
Then there's no significance in the fact that
it's missing.
Pretty well all machine learning algorithms
deal with missing values.
In an ARFF file, if you put a question mark
as a data value, that's treated as a missing
value.
All methods in Weka can deal with missing
values.
But they make different assumptions about
them.
If you don't appreciate this, it's easy to
get misled.
Let me just take two simple and well known
(to us) examples -- OneR and J48.
They deal with missing values in different
ways.

Chinese: 
这里的数值残缺事实上是有意义的。
但是，其他的情况，数值残缺可能只是简单的因为一个设备
发生故障，或者其他一些原因（也许是有人忘了）。
这里的数值残缺事实上是无意义的。
几乎所有的机器学习算法都能处理残缺值。
在ARFF文件中，如果你输入一个问号作为数据值，就会被当作是
残缺值。
所有Weka中的方法都能处理残缺值。
但是，它们对于残缺值有不同的假设。
如果你不明白这一点，会很容易被误导。
我们举两个简单的众所周知的（对于我们）例子：OneR和J48。
它们用不同的方式处理残缺值。

Chinese: 
载入名词性天气数据，应用OneR，得到43%。
应用J48，得到50%。
接下来，
编辑这个数据集，把outlook的前四个no
改为"missing"。
我们用编辑器完成这件事。
如果我们将这个文件输出为ARFF格式，
我们会发现在文件这些值用问号替代。
现在，如果我们看outlook，会发现有四个残缺值。
如果你数下这些标签（2，4和4），共有10个标签。
加上残缺的另外4个，组成了14个实例。

English: 
I'm going to load the nominal weather data
and run OneR on it: I get 43%.
Let me run J48 on it, to get 50%.
I'm going to
edit this dataset by changing the value of
"outlook" for the first four "no" instances
to "missing".
That's how we do it here in this editor.
If we were to write this file out in ARFF
format, we'd find that these values are written
into the file as question marks.
Now, if we look at "outlook", you can see
that it says here there are 4 missing values.
If you count up these labels -- 2, 4, and
4 -- that's 10 labels.
Plus another 4 that are missing, to make the
14 instances.

Chinese: 
让我们回去，再次运行J48。
我们还是得到一样的结果50%。
当然了，这是一个小型的数据集，但是事实上这里的结果并不受
几个残缺值的影响。
然而，如果我们使用OneR，将得到一个高得多的准确率93%。
我们之前的规则是以outlook分支，我想这是我们以前做的。
这里，有四种可能：如果值是sunny，结果是yes；如果是overcast，
结果是yes；如果是rainy，结果是yes；如果是残缺，结果是no。
这里，OneR认为`残缺值是重要的
它可以用来分支。
然而，如果你看到一个J48的树，它永远不会有
一个对应残缺值的分叉。
J48处理残缺值的方式是不同的。

English: 
Let's go back to J48 and run it again.
We still get 50%, the same result.
Of course, this is a tiny dataset, but the
fact is that the results here are not affected
by the fact that a few of the values are missing.
However, if we run OneR, I get a much higher
accuracy, a 93% accuracy.
The rule that I've got is "branch on outlook",
which is what we had before I think.
Here it says there are 4 possibilities: if
it's sunny, it's a yes; if it's overcast it's
a yes; if it's rainy, it's a yes; and if it's
missing, it's a no.
Here, OneR is using the fact that a value
is missing as significant, as something you
can branch on.
Whereas if you were to look at a J48 tree,
it would never have a branch that corresponded
to a missing value.
It treats them differently.

English: 
It is very important to know and remember.
The final thing I want to tell you about in
this lesson is the "no free lunch" theorem.
There's no free lunch in data mining.
Here's a way to illustrate it.
Suppose you've got a 2-class problem with
100 binary attributes.
Let's say you've got a huge training set with
a million instances and their classifications
in the training set.
The number of possible instances is 2 to the
100 (2^100), because there are 100 binary
attributes.
And you know 10^6 of them.
So you don't know the classes of 2^100 - 10^6
examples.
Let me tell you that 2^100 - 10^6 is 99.999...%
of 2^100.
There's this huge number of examples that
you just don't know the classes of.
How could you possibly figure them out? If
you apply a data mining scheme to this, it

Chinese: 
知道并记住这点很重要。
这节课，我想告诉你的最后一点是“天下没有免费的午餐”。
这在数据挖掘中也适用。
举例说明。
假如你有一个二类问题和100个二元属性。
假设你有一个巨大的包含一百万实例并
已分类的训练数据集。
可能的实例数是2的100次方，因为有100个二元的
属性。
你已知是10的6次方个实例。
所以，你未知分类的实例数是2的100次方减去10的6次方。
让我告诉你，2的100次方减去10的6次方的结果是2的100次方的99.999...%。
还有大量的例子，你是不知道它们的分类的。
你怎么可能计算出它们呢？如果你应用数据挖掘的方法，

Chinese: 
它会计算，但是你怎么可能用已知的
那么少的数据来计算。
为了能概括，每个学习算法必须具有除所给数据之外的某些知识
或假设。
每一个学习算法隐含了一组假设。
考虑那些假设的最好方法是
回想我们在课程4.1看到的边界可视化。
你看到了，不同的机器学习方法可绘制
不同种类的实例空间的界限。
这些界限对应一组关于我们可做的决定的假设。
这里没有绝对的最好的算法；没有免费的午餐。
没有单一的最好的算法。
数据挖掘是一门试验科学，这就是为什么我们教大家
如何自己实践数据挖掘。
这只是个总结。

English: 
will figure them out, but how could you possibly
figure out all of those things just from the
tiny amount of data that you've been given.
In order to generalize, every learner must
embody some knowledge or assumptions beyond
the data it's given.
Each learning algorithm implicitly provides
a set of assumptions.
The best way to think about those assumptions
is to think back to the Boundary Visualizer
we looked at in Lesson 4.1.
You saw that different machine learning schemes
are capable of drawing different kinds of
boundaries in instance space.
These boundaries correspond to a set of assumptions
about the sort of decisions we can make.
There's no universal best algorithm; there's
no free lunch.
There's no single best algorithm.
Data mining is an experimental science, and
that's why we've been teaching you how to
experiment with data mining yourself.
This is just a summary.

English: 
Be skeptical: when people tell you about data
mining results and they say that it gets this
kind of accuracy, then to be sure about that
you want to have them test their classifier
on your new, fresh data that they've never
seen before.
Overfitting has many faces.
Different learning schemes make different
assumptions about missing values, which can
really change the results.
There is no universal best learning algorithm.
Data mining is an experimental science, and
it's very easy to be misled by people quoting
the results of data mining experiments.
That's it for now.
Off you go and do the activity.
We'll see you in the next lesson.
Bye for now!

Chinese: 
保持怀疑：当人们告诉你数据挖掘的结果，告诉你他们得到的
准确率，然后你一定要在你的全新的他们没有见过的数据上
测试他们的分类器。
过度拟合有很多种表现。
不同的学习算法关于残缺值有不同的假设，这会
使结果十分不同。
没有到处适用的最好的学习方法。
数据挖掘是一门试验科学，很容易被引用
数据挖掘试验结果的人误导。
就这些。
去完成课后练习。
下次课见！
再见！
