
English: 
Hi! Before we go on to talk about some more
simple classifier methods, we need to talk about overfitting.
Any machine learning method may 'overfit'
the training data, that's when it produces
a classifier that fits the training data too
tightly and doesn't generalize well to independent
test data.
Remember the user classifier that you built
at the beginning of Class 2, when you built
a classifier yourself? Imagine tediously putting
a tiny circle around every single training
data point.
You could build a classifier very laboriously
that would be 100% correct on the training

Chinese: 
大家好！在我们学习更多简单分类方法之前，我们先来学习过度拟合（overfit）。
任何机器学习方法都有可能过度拟合训练数据。
当创建的分类器过度拟合训练数据时，就很难推广到
独立的测试数据。
还记得你在第二部分开始创建的用户分类器吗？
想象用小圈圈住每一个
训练数据点。
你可以花精力去创建一个百分百符合训练数据的分类器，

Chinese: 
但是这很可能不适用于独立的测试数据。
这就是过度拟合（overfit）。
一个常见的问题。
我们用OneR来讲解过度拟合。
我们来看数字性的天气数据。
温度和湿度值都是数字，而不是名词性的值。
你想一下OneR怎么工作，如果它用温度属性生成规则。
它会针对数据集中14个不同的实例，生成有14个分叉的
复杂的规则。
每个规则的错误率都是零，也就是说，完全正确。
根据属性温度分叉，我们可以得到一个完美的规则，
完全没有错误。
事实上，OneR有一个限定规则复杂度的参数。
我们现在不学习这点。

English: 
data, but probably wouldn't generalize very
well to independent test data.
That's overfitting.
It's a general problem.
We're going to illustrate it with OneR.
We're going to look at the numeric version
of the weather problem, where temperature
and humidity are numbers and not nominal values.
If you think about how OneR works, when it
comes to make a rule on the attribute temperature,
it's going to make complex rule that branches
14 different ways perhaps for the 14 different
instances of the dataset.
Each rule is going to have zero errors; it's
going to get it exactly right.
If we branch on temperature, we're going to
get a perfect rule, with a total error count
of zero.
In fact, OneR has a parameter that limits
the complexity of rules.
I'm not going to talk about how it works.

Chinese: 
其实非常简单，只是与我们现在讲的无关，且不太重要。
关键是你可以通过参数来限制OneR生成的规则的
复杂度。
打开数字性的天气数据。
找到OneR，选择它。
这就是OneR。我们来创建一个规则。
规则是根据展望（outlook）属性创建的，
和上节课使用
名词性的天气数据一样。
我们来删除展望（outlook）属性，再来试一次。
再次使用OneR分类。

English: 
It's pretty simple, but it's just a bit distracting
and not very important.
The point is that the parameter allows you
to limit the complexity of the rules that are
produced by OneR.
Let's open the numeric weather data.
We can go to OneR, and choose it.
There's OneR, and let's just create a rule.
Here the rule is based on the outlook attribute.
This is exactly what happened in the last
lesson with
the nominal version of the weather data.
Let's just remove the outlook attribute, and
try it again.
Now let's see what happens when we classify
with OneR.

Chinese: 
这次是以湿度（humidity）分叉。
如果湿度小于82.5%，设为yes；如果大于82.5%，即为no，
这样得到10/14的正确率。
到目前为止，一切顺利。我们用的是OneR的默认设置，
也就是限定规则复杂度的设置。
我们可以点击OneR来设置分类器。
可以看到一个名为minBucketSize的参数，
默认值是6，
很好的折中值。
我们来把它改为1，看看结果如何。
运行OneR，得到了不同的规则。
这个规则用温度属性做了很多不同的分叉。

English: 
Now it branches on humidity.
If humidity is less than 82.5%, it's a yes day;
if it's greater than 82.5%, it's a no day and
that gets 10 out of 14 instances correct.
So far so good, that's using the default setting
of OneR's parameter that controls the complexity
of the rules it generates.
We can go and look at OneR, and remember you
can configure a classifier by clicking on it.
We see that there's a parameter called minBucketSize,
and it's set to 6 by default, which is a good
compromise value.
I'm going to change that value to 1, and then
see what happens.
Run OneR again, and now I get a different
kind of rule.
It's branching many different ways on the
temperature attribute.

Chinese: 
这就是对数据集的过度拟合。
它对于训练数据非常精确，但是却很难适用于独立的
测试数据。
我们来看一个更为真实的数据集。
打开diabetes，这也是一个数字性数据集。
数据集diabetes是数字性属性，包含tested_negative和tested_positive两个分类。
先选择ZeroR计算数据集的基线数据，
得到65%。
应该能得到更好的准确率。
选择OneR。
OneR的默认值是6，它决定着规则的
复杂度。
我们得到71.5%的准确率。
还不错。

English: 
This rule is overfitted to the dataset.
It's a very accurate rule on the training
data, but it won't generalize well to independent
test data.
Now let's see what happens with a more realistic
dataset.
I'll open diabetes, which is a numeric dataset.
All the attributes are numeric, and the class
is either tested_negative or tested_positive.
Let's run ZeroR to get a baseline figure for
this dataset.
Here I get 65% for the baseline.
We really ought to be able to do better than
that.
Let's run OneR.
The default parameter settings that is a value
of 6 for OneR's parameter that controls rule
complexity.
We get 71.5%.
That's pretty good.

English: 
We're evaluating using cross-validation.
OneR outperforms the baseline accuracy by
quite a bit -- 71% versus 65%.
If we look at the rule, it branches on "plas".
This is the plasma-glucose concentration.
So, depending on which of these regions the
plasma-glucose concentration falls into,
then we're going to predict a negative or
a positive outcome.
That seems like quite a sensible rule.
Now, let's change OneR's parameter to make
it overfit.
We'll configure OneR, find the minBucketSize parameter,
and change it to 1.
When we run OneR again, we get 57% accuracy,
quite a bit lower than the ZeroR baseline
of 65%.
If you look at the rule.
Here it is.
It's testing a different attribute, pedi,
which -- if you look at the comments of the
ARFF file -- happens to be the diabetes pedigree
function, whatever that is.

Chinese: 
我们再做交叉验证，
OneR的准确率远远高于基线准确率，71%比65%。
我们来看规则，规则是基于 plas，
血浆葡萄糖浓度的，
根据血浆葡萄糖浓度所在的范围，
我们可以判定是阴性（negative）或阳性（positive）。
这看上去是个很合理的规则。
现在，我们来改变OneR参数，使其过度拟合。
我们来设置OneR，找到参数minBucketSize，改为1。
再次运行，得到57%的准确率，比通过ZeroR得到的基线准确率
65%低得多。
我们来看规则，
这是规则。
OneR在测试另一个属性pedi，如果看
ARFF文件，它指糖尿病家族史。

English: 
You can see that this attribute has a lot
of different values, and it looks like we're
branching on pretty well every single one.
That gives us lousy performance when evaluated
by cross-validation, which is what we're doing now.
If you were to evaluate it on the training
set, you would expect to see very good performance.
Yes, here we get 87.5% accuracy on the training
set, which is very good for this dataset.
Of course, that figure is completely misleading;
the rule is strongly overfitted to the training
dataset, and doesn't generalize well to independent
test sets.
That's a good example of overfitting.
Overfitting is a general phenomenon that plagues
all machine learning methods.
We've illustrated it by playing around with
the parameter of the OneR method, but it happens
with all machine learning methods.
It's one reason why you should never evaluate
on the training set.
Overfitting can occur in more general contexts.
Let's suppose you've got a dataset and you
choose a very large number of machine learning methods, 

Chinese: 
可以看到这个属性有一系列不同的数值，
看起来我们在用每一个值做分叉。
当做交叉验证时，也就是我们现在所做的，表现不是太好。
如果我们使用训练数据做评估，应该会得到很好的结果。
是的，针对训练数据，我们得到87%的准确率，非常好的结果。
当然，这完全是误导。规则过度拟合训练数据，
就很难推广到独立的测试数据。
这就是过度拟合的例子。
过度拟合是所有机器学习方法都存在的问题。
我们通过演示OneR的参数讲解了过度拟合，
不过过度拟合存在于所有的机器学习方法中。
这就是你不能使用测试数据评估的原因之一。
过度拟合会出现在更普遍的情形中。
假设你有一个数据集，你尝试了无数种机器学习方法，

English: 
say a million different machine learning
methods and choose the best for your dataset
using cross-validation.
Well, because you've used so many machine
learning methods, you can't expect to get
the same performance on new test data.
You've chosen so many, that the one that you've
ended up with is going to be overfitted to
the dataset you're using.
It's not sufficient just to use cross-validation
and believe the results.
In this case, you might divide the data three
ways, into a training set, a test set, and
a validation set.
Choose the method using the training and test set.
By all means, use your million machine learning
methods and choose the best on the training
and test set or the best using cross-validation
on the training set.
But then, leave aside this separate validation
set for use at the end, once you've chosen
your machine learning method, and evaluate
it on that to get a much more realistic assessment

Chinese: 
比如一百万种不同的方法，用交叉验证来选择对你的数据来说
最好的方法。
但是，不能因为你试了许多种机器学习方法，你就期望在新数据集上
取得相同的效果。
你尝试了那么多，最后选择的将会过度拟合
你的数据。
使用交叉验证，然后相信得到的结果是不够的。
这种情况下，你可以把数据分为训练数据、测试数据和
验证数据。
使用训练和测试数据来选择方法。
可以从成千上万种机器学习方法中选择最适合训练和
测试数据的，或者用交叉验证选择最适合训练数据的。
当你选好了机器学习方法，
再使用验证数据来评估

English: 
of how it would perform on independent test data.
Overfitting is a really big problem in machine learning.
You can read a bit more about OneR and what
this parameter actually does in the course
text in Section 4.1.
Off you go now and do the activity associated
with this class.
Bye for now.

Chinese: 
它针对独立的测试数据的真实效果，
过度拟合是机器学习的一个大问题。
你可以阅读4.1章来进一步了解OneR
及其参数。
请做这节课的相关练习。
再见！
