
English: 
Hi! This is Lesson 2.2 in Data Mining with
Weka, and here we're going to look at training
and testing in a little bit more detail.
Here's a situation.
We've got a machine learning algorithm, and
we feed into it training data, and it produces
a classifier -- a basic machine learning situation.
For that classifier, we can test it with some
independent test data.
We can put that into the classifier and get
some evaluation results, and, separately,
we can deploy the classifier in some real
situation to make predictions on fresh data
coming from the environment.

Chinese: 
大家好！这是Weka在数据挖掘中的运用的第二部分第二节。我们将要继续学习
训练和测试。
下面是机器学习常见的场景：
我们使用机器学习算法来处理训练数据，从而得到
分类器。
我们再用一些独立的测试数据来测试得到的分类器。
我们可以把测试数据放入分类器，得到评估结果。
我们还可以在现实中使用分类器对新的数据
做预测。

Chinese: 
使用不同的测试数据和训练数据是十分重要的，因为只有这样，
你才能得到可信的评估结果。
这就是下面我们要讲解的内容。
如果你只有一组数据，该怎么办呢？你可以把数据分成
两部分。
一部分用作训练，另一部分用作测试。
比如，2/3用于训练，1/3用于测试。
训练数据一定要不同于测试数据。
训练和测试数据都是分别取自无限总体的独立样本。
这是基本准则。两组数据都是不同的独立样本，
不可以是同样的数据。
如果一样，评估结果就会产生误导，
无法反映出用分类器分析新数据的真实情况。

English: 
It's really important in classification, when
you're looking at your evaluation results,
you only get reliable evaluation results if
the test data is different from the training data.
That's what we're going to look at in this
lesson.
What if you only have one dataset? If you
just have one dataset, you should divide it
into two parts.
Maybe use some of it for training and some
of it for testing.
Perhaps, 2/3rds of it for training and 1/3rd
of it for testing.
It's really important that the training data
is different from the test data.
Both training and test sets are produced by
independent sampling from an infinite population.
That's the basic scenario here, but they're
different independent samples.
It's not the same data.
If it is the same data, then your evaluation
results are misleading.
They don't reflect what you should actually
expect on new data when you deploy your classifier.

English: 
Here we're going to look at the segment dataset, which we used in the last lesson.
I'm going to open the segment-challenge.
I'm going to use a supplied test set.
First of all, I'm going to use the J48 tree
learner.
I'm going to use a supplied test set,
and I will set it to the appropriate segment-test file, segment-test.arff.
I'm going to open that. Now we've got
a test set, and let's see how it does.
In the last lesson, on the same data with
the user classifier, I think I got 79% accuracy.
J48 does much better;
it gets 96% accuracy on the same test set.

Chinese: 
下面，我们来看segment数据集。上节课我们也用到了这个数集。
打开segment-challenge。
我们打算使用提供的测试数据。
先打开J48树分类器，
再选择提供的测试数据，
找到合适的测试数据文件，segment-test.arff，
载入它。现在我们有了测试数据，来一起看看结果。
上节课，我们使用用户分类器（the user classifier）来分析数据，准确率为79%。
J48使用效果要好得多，
同样的测试数据，准确率高达96%。

English: 
Suppose I was to evaluate it on the training
set? I can do that by just specifying under
Test options Use training set.
Now it will train it again and evaluate it
on the training set, which is not what you're
supposed to do, because you get misleading
results.
Here, it's saying the accuracy is 99% on the
training set.
That is not representative of what we would
get using this on independent data.
If we had just one dataset, if we didn't have
a test dataset, we could do a percentage split.
Here's a percentage split.
This is going to be 66% training data and
34% test data.
That's going to make a random split of the
dataset.
If I run that, I get 95%.
That's just about the same as what we got
when we had an independent test set,
just slightly worse.

Chinese: 
假设我要用测试数据来检验训练数据。我可以选择
Use training set测试选项。
现在系统重新运行。你不要这么做，
因为这会得到误导性的结果。
看，准确率是99%。
这并不能代表使用不同数据的准确率。
如果我们只有一组数据。在没有测试数据的情况下，
我们可以按百分比分割数据。
这就是percentage split。
我们会得到66%的训练数据和34%的测试数据。
数据分割是随机的。
再次运行，得到95%的准确率。
这个值和我们使用独立的测试数据得到的值基本一样，
只是稍微低了一些。

English: 
If I were to run it again, if we had a different
split, we'd expect a slightly different result,
but actually, I get exactly the same results,
95.098%.
That's because Weka, before it does a run,
it reinitializes the random number generator.
The reason is to make sure that you can get
repeatable results.
If it didn't do that, then the results that
you got would not be repeatable.
However, if you wanted to have a look at the
differences that you might get on different
runs, then there is a way of resetting the
random number between each run.
We're going to look at that in the next lesson.
That's this lesson.
The basic assumption of machine learning is
that the training and test sets are independently
sampled from an infinite population, the same
population.
If you have just one dataset, you should hold
part of it out for testing, maybe 33% as we
just did or perhaps 10%.

Chinese: 
再次运行，不同的分割或许会带来些许不同的结果，
但是，结果完全一样，95.098%。
这是因为，在运行前Weka会初始化随机数生成器，
以保证你能得到重复的结果。
否则，会得到不同的结果。
不过，如果你希望看到不同的运行结果，
你可以重新设置随机数。
我们下节课会学习这个问题。
今天就讲到这里。
机器学习的最基本原理是
训练数据和测试数据是从相同的无限总体中
取出的独立的样本。
如果你只有一个数据集，可以取部分做测试数据，例如33%
或10%。

English: 
We would expect a slight variation in results
each time if we hold out a different set,
but Weka produces the same results each time
by design by making sure it reinitializes
the random number generator each time.
We ran J48 on the segment-challenge dataset.
If you'd like, you can go and look at the
course text on
Training and testing, Section 5.1,
and please go and do the activity associated with this lesson.
Bye for now!

Chinese: 
不同的测试数据会带来些许不同的结果，
但是Weka会通过重置随机数生成器
来保证一样的结果，
我们使用了J48分析segment-challenge数据集。
你想深入学习，可以看课本5.1章：
训练和测试。
请做这节课的相关练习。
再见！
