
English: 
Hello, and welcome back! 
We're going to be using the Experimenter
quite a lot in this course.
So, in Lesson 1.3, I'm going to show you how
to use it to compare classifiers.
Here's a question: Is J48 better than ZeroR
or OneR on the Iris dataset? Of course, we
could fire up the Explorer.
You know how to do this, so I'm not going
to do it for you.
We can open the dataset, we can get the results
for these three different machine learning
methods, and we can see that J48 with 96%
cross-validation accuracy is better than OneR,
which is better than ZeroR.
But the question is, how reliable is this comparison?
Things could change if we happened to choose
a different random number seed.

Chinese: 
大家好！
这门课我们会大量用到Experimenter。
在这节课，我们要来学习如何使用Experimenter比较分类器。
我们的问题是：针对Iris数集，J48是否优于ZeroR或OneR？
当然我们可以使用Explorer。
你已经学习了如何使用Explorer，这里不再重复。
我们可以打开数集，使用这三种学习方案，得到结果。
可以看到J48有96%的交叉验证的准确率，优于OneR和ZeroR。
OneR优于ZeroR.
问题是：对比结果的可信度是多少？如果我们选择不同的随机种子
结果会不同。

English: 
So, the Experimenter helps produce more reliable
comparisons between datasets and classification
algorithms.
I'm going to fire up the Experimenter.
I'm going to open the Iris dataset and use
the same three classification algorithms and
compare them.
Here we are in the Experimenter.
I'm going to create a new experiment.
I'm going to open a dataset.
I'm going to add three classification algorithms.
I can reorder these algorithms, by the way.
If I select one and go up, and select another
one and go down, I can reorder them.
I'm going to go to Run and run this.
Then I'm going to go to the Analyse panel,
click Experiment -- that's important -- and
then click Perform test.
Back to the slides here -- that's what I did.

Chinese: 
Experimenter可以更为可靠地比较不同数集和
分类算法。
打开Experimenter。
打开Iris数集，对比三种
分类算法。
这是Experimenter。
建立一个新的实验。
打开一个数集，
选择三种分类算法。
顺便提一下，我们可以重新排列算法。
选择一个，向上；选择另一个，向下；我可以重新排列算法。
点击Run，运行。
然后打开Analyse面板，点击Experiment，这很重要。
然后点击Perform test。
我们回到幻灯片，这是我们刚才做的。

English: 
I switched to the Analyse panel and clicked
these things and got these results, which
look like this, actually.
Now, we can see the three figures for the
three classification algorithms on the Iris
dataset.
We can see that both OneR and ZeroR are worse
than J48, just looking at the numbers.
The star (*) means that ZeroR is significantly
worse than J48.
The absence of a star on OneR means that
we cannot be sure that OneR is significantly
worse than J48 at the 5% level of statistical
significance.
In other words, J48 seems better than ZeroR,
and we're pretty sure (5% level) that this is not
due to chance.
And it seems to be better than OneR, but this
may be due to chance -- we can't rule it out
at the 5% level of statistical significance.
Now, I can add a bunch more datasets.

Chinese: 
在Analyse面板，点击这些功能键，得到像这样的
这些结果。
现在，我们可以看到针对Iris数集三个分类算法
的数据。
根据数据，我们可以看到算法OneR和ZeroR都不如J48。
星号(*)指ZeroR显著劣于J48。
算法OneR旁没有星号，说明在统计意义上我们没有95%的把握
证明OneR劣于J48。
换句话说，我们有95%的把握证明J48 优于ZeroR，
而且这不是出于偶然。
J48或许优于OneR，但可能是出于偶然，我们不能
95%的推断出这两种算法有统计意义上显著差异。
现在，我要导入更多的数据。

Chinese: 
现在就做。
重新运行试验。
需要花费多点时间。
然后我们来分析结果。
幻灯片上显示的是运行结果。
可以看到针对其中三个数集我们有95%的把握J48
优于OneR和ZeroR
我们来看星号，星号表明这两种算法显著劣于J48。
换句话说，J48显著优于这两种算法。
 在数集breast-cancer和german_credit上，J48优于OneR。

English: 
In fact, I'll just go and do that.
I'll rerun the experiment.
It'll take a little bit of time.
Then I'll analyze the results.
Over here on the slide, these are the results
I get.
So I can see that at the 5% level of significance
J48 is significantly better than both OneR
and ZeroR on three of the datasets.
That's looking at the stars; the star means
that those methods are significantly worse than J48.
In other words J48 is significantly better
than them.
It's significantly better than OneR in breast-cancer
and german_credit, and it's significantly

English: 
better than ZeroR on iris and pima_diabetes
datasets.
So, you can see from the table of figures
and the stars where the significant results
are.
Now, what if we wanted to know whether OneR
was significantly better than ZeroR.
This does not tell us on this slide, because
on this slide, we're comparing everything
with J48.
If we go back to the Experimenter and select
something different for the test base.
I'm selecting OneR for the test base and performing
the test.
Now I've got OneR in the first column, and
things are being compared with it.
Going back to the slide, having changed the
test base, I can see that OneR is significantly
worse than ZeroR on the german_credit dataset,
about the same on the breast-cancer dataset,
and significantly better on all the rest
of the datasets.

Chinese: 
在数集iris和pima_diabetes上，J48优于ZeroR。
从表中可知，星号标注了
显著差异。
我们怎样知道OneR是否优于ZeroR？
幻灯片中并没有显示，因为这里我们比较J48和
其他算法。
我要回到Experimenter，选择不同的基准方案。
我要选择OneR作为基准方案，运行测试。
现在OneR在第一竖列，比较它和其他算法。
回到幻灯片，转换了基准方案后，我们可以看到OneR
在german_credit数集上劣于ZeroR，在breast-cancer数集上和ZeroR均等，
在其他数集上优于ZeroR。

Chinese: 
我们还可以改变矩阵数列的顺序。
回到Experimenter，选择行属性，当前设置是数集。
我们选择方案为行属性。
然后选择列属性，当前设置是方案，我们选择数集作为列属性。
再次运行测试。
现在我们看到数集横向排列。
这些是数集。算法纵向排列。
可以看到，J48在iris数集上的表现是否优于
在breast-cancer数集上的表现。
我们学习了如何对比分类器。
从统计学的角度说，就是零假设，
即某个分类器的准确率等于另一个分类器的准确率。

English: 
Another thing that we can do is to change
the order of the columns in this matrix.
If I go back to the Experimenter and select
for the row -- currently the Dataset is selected --
I'm going to select Scheme for the row.
And for the column, currently Scheme is selected,
and I'm going to select Dataset for the column.
Then perform the test again.
Now, we get the datasets going along horizontally
here.
This is the list of datasets, and we get the
algorithms going vertically.
So I can see whether J48 performs significantly
better or worse on the iris dataset than it
does say on the breast-cancer dataset.
What we've looked at is comparing classifiers.
In statistical terms, people talk about the
"null hypothesis",
that is, that one classifier's performance
is the same as another.

Chinese: 
结果表明零假设几乎不成立，
也就是说，我们推翻了零假设。
我们在统计意义上有95%的把握推翻零假设。
Experimenter可以告诉我们什么时候零假设被推翻。
同样的，有95%的把握A显著优于B。
在Experimenter中，我们可以更改显著标准。
常用的是5%， 1% 用于重要的应用，例如医学，
10%可以用于一般应用。
我们可以更改比较的内容。
我们比较了百分比准确率，我们可以在Explorer中更改，
比较一组数据也是很常见的。
我们可以比较在这些数集上，A方案比B方案好和差的次数，
也 就是A显著优于B或B显著优于A

English: 
The result that we observe is highly unlikely
if the null hypothesis is true:
that is, we reject the null hypothesis.
We reject the hypothesis that they're the
same at the 5% level of statistical significance.
So the Experimenter tells you when the null hypothesis is
being rejected,
or, equivalently, we can say that A performs
significantly better than B at the 5% level.
In the Experimenter, we can change the significance
level.
It's common to use 5%; 1% for critical applications,
maybe medical applications; perhaps 10% for
less critical applications.
We can change the comparison field.
We have used percent correct, but we can change
that in the Explorer, and it's common to compare
over a set of datasets.
We might say on these datasets, method A has
so many wins and so many losses over method
B, referring to the number of statistically
significant times A is better than B or B

English: 
is better than A.
There's problem you ought to be aware of -- the
multiple comparison problem.
If you make a large number of tests, some
of them will appear to be significant just
by chance.
As usual, this is not an exact science.
The interpretation of results requires a certain
amount of care.
All right. The activity associated with this
lesson will ask you to do some more work with
the Experimenter in comparing datasets and
the classification algorithms.
Off you go and do that, and I'll see you in
the next lesson.
Bye for now!

Chinese: 
的次数。
你需要留意一个问题 -- 多次比较的问题。
如果你做大量的测试，一些显著差异可能
会是偶然的。
一般来说，这不是精确的科学。
分析结果时需谨慎。
好了，这节课的练习要求你使用Experimenter
对比数集和分类算法。
下次课见！
再见！
