
English: 
Hello again! In data mining, people are always
asking "how much data do I need?" We're going
to show you how you can address that question
in this lesson using learning curves.
The advice on evaluation from "Data Mining
with Weka" was: if you've got a large, separate
test set, then just go ahead and use the test
set.
If you've got lots of data, use the holdout
method.
Otherwise, use 10-fold cross-validation --
it's the best way of getting the most reliable
performance estimate out of a limited amount
of data.
You might repeat it 10 times or more, like
the Experimenter does.
But how much data is a lot? Well, that's a
good question, and there is no answer:
it depends.

Chinese: 
大家好！数据挖掘中，人们总是会问“我需要多少数据？”
这节课，我们要使用学习曲线来回答这个问题。
从课程Data Mining with Weka得到的建议是：如果你有大的、独立的测试数集，
就直接用测试数集。
如果有很多数据，使用旁置法。
否则，使用10折交叉验证。
这是从有限数据中获取最可靠结果的
最好的方法。
你或许要重复10次或者更多，就像Experimenter界面做的一样。
但是多少数据算很多？这是个问题，而且没有答案：
这看情况而定。

English: 
Supposing you've got 1000 instances.
That sounds like quite a lot.
If you've got a 2-class dataset with 500
of each class, then maybe that's pretty good.
If you've got a 10-class dataset with 1000
instances and the classes are unevenly distributed--so,
maybe for some classes there are only 10 or
15 instances--well, that doesn't sound so
good.
Although perhaps you don't care about those
small classes.
It depends on the number of attributes.
Again, with your 1000-instance dataset, that
sounds like a lot, but if you have 1000 attributes,
that might not be such a lot of instances.
It depends on the structure of the domain.
Are you looking for complicated decision boundaries?
It depends on the kind of model, the sort
of decision boundaries it makes.
If you've got a machine learning technique
that looks for linear decision boundaries,
then they're pretty simple.
You might not need so much data as you would
for ones that look for more convoluted linear boundaries,
or for decision trees, perhaps.

Chinese: 
假设你有1000个实例。
这听上去很多。
如果你创建2个类别的数据集，每个类别500个实例，或许很不错。
如果你有一个有10个类的1000个实例的数据集，且类别分布不平均，
或许有些类别只有10 或15个实例--这听上去
就不太好。
尽管你或许不太在意那些小类别。
数据的大小还取决于属性的数量。
还是假设你有一个1000个实例的数据集，听上去很多，但是如果数据集有1000个属性，
就不会感到数据多了。
数据的大小取决于研究领域的结构。
你需要复杂的决策边界吗？数据的大小取决于模型种类，也就是
决策边界的类别。
如果你用的机器学习方法是求线性决策边界，
那么这很简单。
如果你要的是复杂的线性边界，
或决策树，你或许不需要很多数据，

English: 
It's an impossible question to answer.
The only way to look at it really is to look
at it empirically using learning curves.
I've shown a plot here of a learning curve.
As the size of the training data increases,
the performance gets better and better, but
of course, it asymptotes off.
The point where it starts to asymptote off
is probably enough training data to get a
reliable estimate.
Let's talk about how to plot a learning curve
in Weka.
We're going to sample the data.
When you do sampling, we're going to choose
a sample, and you need to understand the difference
between sampling with replacement and sampling
without replacement.
When you sample, it's really a question of
whether you move or copy the data.
If you sample "with replacement", then it's
like you take it out of the original dataset
and put it into the sample dataset, and then
replace it back in the original dataset.
You don't really take it out.

Chinese: 
这是个很难回答的问题。
唯一能回答它的方法是使用学习曲线。
这是学习曲线的图。
当训练数据的数量增加，效果会越来越好。
当然，曲线会渐进平缓。
它开始渐进平缓的点可能就是得到可靠结果的
足够训练数据。
我们来学习如何在Weka中画学习曲线。
我们来取样数据。
取样时，我们要选择数据样本。我们需要知道
替换取样和非替换取样的不同。
当取样时，需要决定移动还是复制数据。
如果替换取样，那就像你从原始数据中取样本，
把它放入样本数据集中，然后又把它放回原始数据集。
并没有真的把原始数据取出。

Chinese: 
只是复制了样本数据，实例从原始数据集复制到样本数据集。
“非替换”指移动了数据。
你再也看不到了。
你不能两次取样。
如果你替换取样，那么实例可能会在样本数据集中出现多次。
如果你非替换取样，那么它们不会。
这是第一件事。
我们要在训练数据集中，而不是测试数据集中取样。
我们想要找出随着训练数据集的增大，性能的变化。
但是，测试数据集决定了我们评估的可信度。
我们不想人为缩小测试数据集。
我们一直希望使用同样大小的测试数据集。
我们可以在Weka中使用FilteredClassifier。
这是一个重采样过滤器。如果我们使用FilteredClassifier，
那么过滤只针对训练数据，而不是测试数据。
我要使用glass数据集来演示。
打开glass数据集。

English: 
You copy it from the sample data, the instance
from the original dataset to the sample dataset.
"Without replacement" means you move it.
You can't see it again;
you can't sample it twice.
If you sample with replacement, then instances
might occur more than once in the sample dataset.
If you sample without replacement, then they
can't.
That's the first thing.
We're going to sample the training set, but
not the test set.
We want to find out how performance changes
as the size of the training set increases.
But the test set determines the reliability
of our estimate --
we don't want to make that artificially smaller.
We always want to use the same size test set.
We can do that in Weka by using the FilteredClassifier.
There's a Resample filter, and if we wrap
that up in the FilteredClassifier, that means
that the filtering will apply to the training
data and not to the test data.
I'm going to do that with the glass dataset.
I've opened the glass dataset here.

Chinese: 
点击Classify。
在meta中，找到FilteredClassifier。
然后，我要检验--我要使用J48作为分类器。
对于过滤器，我要使用重采样过滤器。
它是无指导实例过滤器。
我们对实例重取样。
在这。
这些是参数。
我们可以替换取样或者不替换取样。我想要不替换取样。
因此我要选择true。
我想取50%的样。
运行。
我用了10折交叉验证，对训练数据取样，
取样了50%的训练数据，而测试数据不变。

English: 
I'm going to go to Classify.
In meta, I'm going to find the FilteredClassifier.
Then I'm going to check--I'm going to use
J48 as the classifier.
For the filter, I'm going to use the Resample
filter.
It's an unsupervised instance filter:
we're resampling instances.
There it is.
Here are the parameters.
We can sample with or without replacement,
and I would like to sample with no replacement,
so I want to make that true.
I want a 50% sample.
I can go ahead and run that.
I'm doing 10-fold cross-validation, sampling
the training set,
using a 50% sample of the training set and
leaving the test set untouched.

English: 
I get 65% performance.
Back to the slide.
Here
is the 50% level:
65% performance.
I did this for other sample sizes,
which enabled me to plot this learning curve
empirically:
the performance against the percentage of
training data I'm using.
I've shown the ZerioR performance there, for
reference.
The line's a bit jagged, and to get a smoother
line, I'd want to do it several times with
cross-validation.
If I do 10 repetitions of J48, then I get
this line here.
(I did this with the Experimenter.
It's very easy to do.) Then I did 1000 repetitions.
I get this red line here, this smooth line.
You can look at this line and make your own
judgement as to how much training data you
need to get pretty close to the ultimate accuracy
of J48 on this dataset.

Chinese: 
得到65%的准确率。
回到幻灯片。
这是50%的水平。
65%的准确率。
使用不同的样本大小，我又做了重复取样。
这样我就可以画出这条学习曲线。
针对我使用的训练数据的百分率的效果。
我画了ZeroR的效果，作为参考。
线有点曲折，要得到平滑的曲线，我要使用交叉验证
做几次。
如果我使用J48重复10次，就会得到这条线。
（我使用实验者面板做的，这很简单）
然后，我做了1000次重复。
得到这条红线，平滑的线。
你可以看这条线，对于这个数据集中需要多少训练数据
才能使J48得到接近最终准确率的问题，做出自己的判定。

Chinese: 
看上去如果使用50%-60%的训练数据，就会
得到非常接近最终准确率的结果。
这就是学习曲线。
问题是多少数据算足够。答案是我们不知道。因此，你需要画
一条学习曲线。
我们学习了替换和非替换重采样，但是我们不想从测试数据集中取样，
因为这会降低评估的可信度。
我们使用了FilteredClassifier。
很显然，我们得到的数据结果只是估计，你可以
通过重复测试多次来提高这些评估的可信度。
这次先讲这么多。
请做练习。下节课见。
再见！

English: 
It looks like providing you have about maybe
50-60% of the training data, you're going
to be fairly close to the final accuracy.
That's it for learning curves.
The question is how much data is enough? The
answer is we don't know! So, you can plot
a learning curve.
We looked at resampling with and without replacement,
but we didn't want to sample the test set,
because that would just decrease the reliability
of evaluation.
We used the FilteredClassifier.
Obviously, the performance figures you get
are only estimates, and you can improve the
reliability of those estimates by repeating
the test several times.
That's it for now.
Off you go and do the activity, and we'll
see you again in the next lesson.
Bye for now!
