
English: 
Hi! In this lesson, Lesson 2.5, I want to
introduce you to the standard way of evaluating
the performance of a machine learning algorithm,
which is called cross-validation.
A couple of lessons back, we looked at evaluating
on an independent test set, and we also talked
about evaluating on the training set (don't
do that).
We also talked about evaluating using the
holdout method by taking the one dataset and
holding out a little bit for testing and using
the rest for training.
There is a fourth option on Weka's Classify
panel, which is called cross-validation, and

Chinese: 
大家好！这节课，我想给大家介绍
评估机器学习算法性能的标准方法，交叉验证。
前几节课，我们学习了如何用一个独立的测试数据集进行评估，并且了解了
如何用训练数据进行评估（不要这么做）。
我们还学习了采用预留的方法，拿出数据集的
一小部分作为测试数据，其他剩余的部分作为训练数据。
Weka的分类面板提供的第四个选项叫做交叉验证，

Chinese: 
这就是我们这节课要介绍的内容。
交叉验证改良了多次预留法。
我们曾经用不同的随机种子来运行预留法。
这叫做重复预留法。
交叉验证系统性地重复运行预留法。它通过
减少预测误差来提高预留法的性能。
我们载入训练数据并且创建一个分类器。
然后，我们试着评估这个分类器的性能，
这个评估会有一定的误差，因为这是建立在统计之上的。
我们想要尽量减少评估的误差。
交叉验证是一种减少误差的方法，一种叫做分层交叉验证， 交叉验证的变体，
进一步减少了误差。
本课，我会给大家解释这些方法。

English: 
that's what we're going to talk about here.
Cross-validation is a way of improving upon
repeated holdout.
We tried using the holdout method with different
random-number seeds each time.
That's called repeated holdout.
Cross-validation is a systematic way of doing
repeated holdout that actually improves upon
it by reducing the variance of the estimate.
We take a training set and we create a classifier.
Then we're looking to evaluate the performance
of that classifier, and there is a certain
amount of variance in that evaluation, because
it's all statistical underneath.
We want to keep the variance in the estimation
as low as possible.
Cross-validation is a way of reducing the
variance, and a variant on cross-validation
called stratified cross-validation reduces
it even further.
I'm going to explain that in this class.

Chinese: 
在前面的课程中，我们拿出10%作为测试数据，并且重复10次。
这是一种重复预留法。
我们载入一个数据集，重复10次将它分割成训练数据集
和测试数据集。
在交叉验证中，我们只分割一次，但是我们分成10份。
然后，我们用其中的9份作为训练数据，
剩余的一份作为测试数据。
接着，还是用之前分割好的数据，我们用另外一组9份作为训练数据，
剩余的一份作为测试数据。
我们这样重复10次，每次都是用分割出的不同的数据作为测试数据。
换句话说，我们把数据集分为10份，然后用每份
轮流作为测试数据，剩余的作为训练数据，平均十次的结果。

English: 
In a previous lesson, we held out 10% for the
testing and we repeated that 10 times.
That's the repeated holdout method.
We've got one dataset, and we divided it independently
10 separate times into a training set and
a test set.
With cross-validation, we divide it just once,
but we divide into, say, 10 pieces.
Then, we take 9 of the pieces and use them
for training,
and the last piece we use for testing.
Then, with the same division, we take another
9 pieces and use them for training and the
held out piece for testing.
We do the whole thing 10 times, using a different
segment for testing each time.
In other words, we divide the dataset into
10 pieces, and then we hold out each of these
pieces in turn for testing, train on the rest,
do the testing and average the 10 results.

Chinese: 
这就是10层交叉验证。
把原始的数据随机分成10份（这些叫做分层），依次预留其中1份，
然后平均多次的结果。
所以，数据集中的每个数据都会用来作测试一次，训练九次。
这就是10层交叉验证。
分层交叉验证是一种简单的变体，当我们把原始数据分为
10部分时，我们确保每一份的类值的比例
大致是相同的。
当然，有很多不同的方法将数据集分为10份。
我们想选择一种能确保每一部分数据都具有分类
代表性的分割方式。
这就是分层交叉验证。
它可以进一步降低评估误差。
在我们做完交叉验证之后，Weka要对整个数据

English: 
That would be 10-fold cross-validation.
Divide the dataset into 10 parts (these are
called folds), hold out each part in turn
and average the results.
So, each data point in the dataset is used
once for testing and 9 times for training.
That's 10-fold cross-validation.
Stratified cross-validation is a simple variant
where, when we do the initial division into
10 parts, we ensure that each fold has got
approximately the correct proportion of each
of the class values.
Of course, there are many different ways
of dividing a dataset into 10 equal parts,
we just make sure we choose a division that
has approximately the right representation
of class values in each of the folds.
That's stratified cross-validation.
It helps reduce the variance in the estimate
a little bit more.
Then, once we've done the cross-validation,
what Weka does is run the algorithm an eleventh

Chinese: 
再运行一次。
然后产生一个分类器，这就是我们实际使用的分类器。
我们采用10层交叉验证是为了得到评估结果和它的误差，
然后最终地，我们再次分类来得到一个
可以实际应用的分类器。
这就是我想说明的。
交叉验证比简单的重复预留法精确。我们下节课会介绍这一点。
分层交叉验证更加精确。
Weka的默认值是分层交叉验证。
使用10层交叉验证，Weka需要进行11次算法运算，前十次用每一份数据，
最后一次用整个数据集。
最实用的原则是，如果你有足够多的数据，你可以用百分比
分割数据集并只评估一次。

English: 
time on the whole dataset.
That will then produce a classifier that we
might deploy in practice.
We use 10-fold cross-validation in order to
get an evaluation result and estimate of the error,
and then finally, we do classification
one more time to get an actual classifier
to use in practice.
That's what I wanted to tell you.
Cross-validation is better than repeated holdout,
and we'll look at that in the next lesson.
Stratified cross-validation is even better.
Weka does stratified cross-validation by default.
With 10-fold cross-validation, Weka invokes
the learning algorithm 11 times, one for each
fold of the cross-validation and then a final
time on the entire dataset.
The practical rule of thumb is that if you've
got lots of data, you can use a percentage
split and evaluate it just once.

Chinese: 
否则，如果你没有那么多数据，你需要使用分层交叉验证。
多少数据算多呢？这是一个每个人想知道的问题。
你知道一个字符串有多长吗？这很难说，但是（我们知道）它取决于一些方面。
它取决于数据集包含类的个数。
如果是一个只有两个类的数据集，那么，你有，例如100-1000条数据，
应该足够得到一个可靠的评估结果了。
你可按照90%和10%将数据集分为训练和测试数据。
对于二类的分类问题，如你有10,000条数据，我想这已足够
多，你不需要使用交叉验证。
如果，另一方面，数据有100个不同分类，这就不一样了，是吧？
你将需要一个大得多的数据集，因为你希望每一个分类都能够
被充分地评估。

English: 
Otherwise, if you don't have too much data,
you should use stratified 10-fold cross-validation.
How big is lots? Well, this is what everyone
asks.
How long is a piece of string, you know? It's
hard to say, but it depends on a few things.
It depends on the number of classes in your
dataset.
If you've got a two-class dataset, then if
you had, say 100-1000 datapoints, that would
probably be good enough for a pretty reliable
evaluation.
If you did 90% and 10% split in the training
and test set.
If you had, say 10,000 data points in a two-class
problem, then I think you'd have lots and
lots of data, you wouldn't need to go to cross-validation.
If, on the other hand, you had 100 different
classes, then that's different, right?
You would need a larger dataset, because you want
a fair representation of each class when you
do the evaluation.

Chinese: 
这个问题很难给出确切的答案，因为它视具体情况而定。
如果你收集到了成千上万的数据，你也许只需要
用预留法做一次。
如果你只有少于1000的数据，即使是二类问题，你可能还是需要
运行10层交叉验证。
10层交叉验证需要更多运行时间。
它需要10倍的时间，但是，一般每一次运行的时间非常短。
你可以阅读课本的第五章第三节关于交叉验证的内容。
现在请大家完成本课的课后练习。
再见！

English: 
It's really hard to say exactly; it depends
on the circumstances.
If you've got thousands and thousands of data
points, you might just do things once with
a holdout.
If you've got less than a thousand data points,
even with a two-class problem, then you might
as well do 10-fold cross-validation.
It really doesn't take much longer.
Well, it takes 10-times as long, but the times
are generally pretty short.
You can read more about this in Section 5.3
of the course text on cross-validation.
Now it's time for you to go and do the activity
associated with this [lesson].
See you soon!
