
English: 
Hello! Welcome back to New Zealand for another
few minutes of More Data Mining with Weka.
By the way,
I'd just like to thank all of those who did
the first course for their nice comments and feedback.
You know, the University of Waikato is just a little
university on the far side of the world,
but they listen.
They listen when they hear feedback,
and they've listened to you.
As you can see,
they've put me in a bigger office with more
books and a bigger plant.
So, this has been great.
They really appreciate the positive feedback
that we've had from you for the previous course.
Thank you very much indeed.
Today we're going to look at the Experimenter.
As you know,

Chinese: 
大家好！欢迎回到新西兰听我讲解几分钟More Data Mining with Weka。
顺便说一句，
感谢前一门课学生提出的评论和建议。
怀卡托大学只是世界另一端的一所学校。
但是我们会去听，
听取并且会接受
您的建议。
大家可以看到，
我现在在一个更大的办公室，有更多的书和更繁茂的植物。
这很好。
怀卡托确实能够听取积极的建议。
非常感谢。
今天我们要学习Experimenter界面。
我们讲过，

Chinese: 
Weka有四个界面：
 Explorer，我们上次课讲过的界面、
 Experimenter和另外两个界面。
这节课和下节课我们要学习Experimenter。
它用于计算分类算法对于某一数集的
平均数和标准差。
上一门课中，
我们手算了这一部分。
使用 Experimenter，很容易针对多个数集设置多个学习方案。
你可以找出针对某一特定数集一种分类方案是否优于另一种分类方案，
它们的差别是否在统计意义上显著不同。
你可以检验同一算法的不同参数的效果，
然后通过ARFF文档显示测试结果。
这样你可以用数据挖掘的结果做数据挖掘。
如果你愿意。
有时，Experimenter可能运算
几天甚至几周的时间。
它能使用多台机器同时运算，
比如同一实验室的所有计算机。
这对Experimenter来说很容易，

English: 
there are four interfaces to Weka: the Explorer,
which we looked at in the last course; 
the Experimenter, and two more.
We're going to look at the Experimenter today
and in the next lesson as well.
It's used for things like determining the
mean and standard deviation performance of
a classification algorithm on a dataset,
which you did manually,
actually, in the previous course.
It's easy to do several algorithms on several
datasets,
and you can find out whether one classifier
is better than another on a particular dataset
and whether the difference is statistically
significant or not.
You can check the effect of different parameter
settings of an algorithm,
and you can actually express the results of
these tests as an ARFF file.
So you can sort of do data mining on the results
of data mining experiments,
if you like.
In the Experimenter,
sometimes the computation takes days or even
weeks,
and it can be distributed over several computers,
like all the computers in a lab.
That's quite easy to do with the Experimenter,

English: 
but we're not going to be covering that in
this course.
When you invoke the Experimenter,
you get three panels: the Setup panel,
the Run panel,
and the Analyse panel.
Before we go to those,
let me just refresh your memory.
This is a slide from Data Mining with Weka,
Lesson 2.3, I think,
where we talked about the training set and
the test set.
A basic assumption of machine learning is
that these are independent sets produced by
independent sampling from an infinite population.
In Lesson 2.3 -- perhaps if you don't remember
this you can go back and look at that video
from the first course again -- we took a dataset,
segment-challenge,
and the learning algorithm J48,
and we used a percentage split method of evaluation.
We evaluated it and got a certain figure for
the accuracy.
Then we repeated that with different random
number seeds,
and, in fact, we got ten different figures for the accuracy.

Chinese: 
但我们这门课不涉及这方面。
Experimenter包含
三个面板：Setup面板、
Run面板
和Analyse面板。
在开始之前，
我们来复习一下之前的内容。
这是课程Data Mining with Weka第2.3节中的幻灯片。
我们讲到了训练数据和测试数据。
机器学习一个基本的假设是这些数据是
取自无限总体的独立的样本数集。
在上门课2.3节中我们讲过。如果你不记得，请看上次课的视频。
我们选择一个数集，
名为segment-challenge，
选择算法J48，
使用比例分割方法来评估。
通过评估，得到一个准确率。
然后，我们使用不同的随机种子来重复这一步骤，
最终得到10个不同的准确率。

Chinese: 
计算样本平均数，方差
和标准差。
如果你记不清这部分，
请复习这段视频。
还有，
我们需要回顾交叉验证的内容。
在Data Mining with Weka第2.5节中，
我们学习了10折交叉验证，
也就是把数集分成10组，
每次保留一组测试数据，
然后平均十次的结果。
我们来看Experimenter界面。
点击Experimenter，
进入Setup面板，
开始设定实验。
默认设置是10折交叉验证，
我们来重复十次实验。
选择一个数集，
数据集segment-challenge
在这儿。

English: 
From those we manually computed the sample
mean and the variance,
and hence the standard deviation.
If you can't remember that,
go and refresh your memory.
Also, while we're at it,
let me just remind you about cross-validation.
In Lesson 2.5 of Data Mining with Weka
we looked at this technique of 10-fold cross-validation,
which involves dividing the dataset into ten parts,
holding out each part in turn,
and averaging the results of the ten runs.
Let's get into the Experimenter.
If I just go here and click Experimenter,
I get the Setup panel.
I'm going to start a new experiment.
I'm just going to note that we've got 10-fold
cross-validation by default,
and we're repeating the experiment ten times
by default.
I'm going to add a dataset.
I'm going to add the segment-challenge dataset,
which is here.

English: 
I'm going to add a machine learning algorithm --
I'm going to use J48.
You've seen this kind of menu before, many,
many times; it's the same as in the Explorer.
If I just select J48 and click OK,
then I've got this dataset and this learning
algorithm.
Well, let's just run it.
I'm going to go to the Run panel and click
Start.
It's running.
You can see at the bottom here,
it's doing the fifth,
sixth, seventh, eighth, ninth, tenth run,
because we repeated the whole thing ten times.
We repeated 10-fold cross-validation ten times.
Now, if I go to the Analyse panel,
it doesn't show anything.
I need to analyze the results of the experiment
I just did.
Click Experiment.
And I need to perform the test.
You can see here that it's showing for a dataset called
"segment" that we've got an average of 95.71%
correct using this J48 algorithm.
We wanted to look at the standard deviation.
If I click Show std. deviations

Chinese: 
选择机器学习算法，
算法J48。
你之前曾多次在Explorer中
见过这种菜单，
选择J48，点击OK，
之后得到这个数集和算法。
运行。
选择Run面板，点击Start。
开始运行。
你可以看到，在面板下方，
运行第五次、
六、七、八、九、十次，
因为重复运算了十次，
10折交叉验证运行了十次。
现在来看Analyse面板，
没有任何显示。
我们需要分析刚才的实验结果。
点击Experiment，
运行测试数据。
可以看到对于segment数据集使用J48算法
得到了95.71%的正确率。
我们来看标准差。
点击Show std. deviations，

English: 
and perform the test again,
then I get the standard deviation.
So, we've effectively done what we did rather
more laboriously in the first course by doing
ten individual runs.
Over on the slide here,
this just summarizes what we've done.
In the Setup panel,
we set things up.
In the Run panel,
we just clicked Start,
and in the Analyse panel,
we clicked Experiment,
and we selected Show std. deviations 
and performed the test.
Now, what about those detailed results of the individual
runs? 
I'm going to go back to the Setup panel here.
I'm going to write the results to a CSV file,
which we'll call "Lesson 1.2".

Chinese: 
再次运行测试，
得到标准差。
我们高效地得到了上门课中需单独
运行十次的结果。
这张幻灯片
总结了我们刚才的步骤。
使用Setup面板
设置实验，
点击Run面板
上的Start，
然后点击Analyse面板
上的Experiment，
并选择Show std. deviations，
开始测试。
每次运行的结果如何呢？
.我们回到Setup面板，
我要将运行结果写入CSV文档，
文档名为Lesson 1.2。

Chinese: 
我还要做比例分割，
训练数据90%，
测试数据10%。
已有了数集和机器学习算法，
我们来运行。
下面来看输出的CSV文档，
很好
看这里，
我们重复了十次实验，
这些是十次运行记录。
每次运行的记录，
都会得到很多信息，
很多。
我们现在要找的是Percent_correct
这是十次中每次运行的正确率。
表中还有很多信息，
包括开始时间、运行时间、
和其他很多信息。
也许你应该花时间好好看看。

English: 
I think I'll just do a percentage split.
I'll do 90% training,
10% test.
I've got my dataset and my machine learning
method,
so I'll just go and run.
If I look at the CSV file that's been produced,
well,
here it is.
We repeated the experiment ten times.
These are the ten different runs.
And for each of these ten runs,
we've got a lot of information.
A lot of information.
The information that we're looking
for here is Percent_correct.
That's the percent correct for each of those
ten separate runs.
We've got all sorts of other stuff here,
including, for example, the user time, the elapsed time,
and lots and lots of other things.
Maybe you should take a look at those yourself.

Chinese: 
数据表中显示了十次运行的具体结果。
现在我要做十折交叉验证。
重复十次
我们做了一次百分比分割
如果选择十折交叉验证，
将结果写入文档，
再次运行。
需要长一些的时间，
因为每次都要做交叉验证。
现在完成了。
我们来看实验结果，
和之前的结果很相近，但是数据要多得多。
我们运行了十次。
十次十折交叉验证。
这是第一次，
有十折。
第一次运行有十折，
这是第二次运行的十折，
以此类推。
和之前的结果是一样的。
我们可以得到实验的详细信息。

English: 
That's given us the detailed results for each
of the ten runs.
I'm going to do 10-fold cross-validation now.
These are the ten repetitions, right,
and we did a single percentage split.
If I do 10-fold cross-validation,
and write the result into a file,
and run it again.
It takes a little bit longer,
because it's doing cross-validation each time.
Now it's finished,
and if we look at the resulting file,
we get something that's very similar but much
bigger.
We repeated the whole thing ten times.
We repeated 10-fold cross-validation ten times.
This is the first run,
and there were ten folds.
There are ten folds of the first run.
Here are the ten folds of the second run,
and so on.
I've got the same results as I had before
along here.
I've got a very detailed account of what was
done in that experiment.

Chinese: 
我们来看幻灯片：
为了得到详细的数据，
我们选择了Setup面板、CSV文档、命名了输出结果的文件。
这是我们做百分比分割的结果。
然后我们用交叉验证的方法又做了一遍，
得到了更大的数据表。
我们来回顾Experimenter的功能。
它有三个面板。
在面板Setup中，
我们可以设置实验，
和存储实验，
但我们通常做的是开始新的实验。
我们一般点击这里开始。
这里有高级模式。
我们暂不讲解高级模式，
我们将继续学习基本模式。
我们可以为输出文档命名，
选择ARFF文档或者CSV文档，
或者数据库文档。
可以做交叉验证或者百分比分割。
实际上，

English: 
Just coming back to the slides here:
to get detailed results
we went back to the Setup panel, and selected
CSV file, and put in a file name for the results.
This is the file that we got with percentage split.
Then we did the same thing for the cross-validation
experiment,
and got a larger results spreadsheet.
Let's just review the Experimenter.
We've got three panels.
In the Setup panel,
you can open an experiment,
and you can save an experiment,
but what we usually do is start a new experiment.
We normally start by clicking here.
There's an Advanced mode.
We're not going to talk about the Advanced
mode here; we're going to continue to use
the simple mode of the Experimenter.
You can set a file name for the results if
you want,
either an ARFF file or a CSV file or,
in fact, a database file.
You can do either a cross-validation or a
percentage split.
Actually,

English: 
you can preserve the order in percentage split.
The reason for that is that there's no way
of specifying a separate test file in the
Experimenter.
To do that,
you would kind of glue the training set and
test set together, preserve the order, and
specify the appropriate percentage so that
those last instances were used as the test set.
Normally,
we're not doing that,
we just randomize things for the percentage
split.
We've got the number of repetitions.
We repeated the whole thing ten times,
but we could have repeated it a hundred times.
Here we can add new datasets.
We can add more datasets.
We can delete datasets that we've added,
delete this dataset.
Here we add more learning algorithms.
We can just add new learning algorithms into
the learning algorithms box.
That's the Setup panel.
Then there's the Run panel.
You don't do much in the Run panel except
click Start,
and just monitor for errors here.
There were zero errors in the three runs I
did.
Then,
in the Analyse panel,

Chinese: 
我们可以在做百分比分割时设置运行的顺序。
因为Experimenter不支持使用独立的
测试文档。
这样，
就可以将训练数集和测试数集绑定在一起，设定顺序，
指定合适的百分比，这样后面的数据就会用作测试数据。
一般来说，
我们不会这样做，、
我们会随机百分比。
我们可以得到重复的次数。
我们重复了十次，
还可以重复一百次。
在这里我们还可以增加新的数集，
更多的数集。
我们可以删除数集。
删除这个数集。
在这里我们输入算法。
我们可以在算法栏中输入新算法。
这是Setup面板。
这是Run面板。
我们只点击Start，
在这里监控错误。
我们做的三次运行都没有错误。
Then,
在面板Analyse中，

English: 
you can load results from a file or a database,
but what we normally want to do is click Experiment
here to get the results from the experiment
we've just done.
There are many options,
and we're going to be looking at some of these
options as we go through this course.
That's the Experimenter.
We've learned how to open the Experimenter.
We've looked at the Setup,
Run,
and Analyse panels.
We've evaluated a classifier on a dataset
using both cross-validation repeated ten times
and percentage split repeated ten times.
We've looked at the spreadsheet output.
We've looked at the Analyse panel.
We found out how to get the mean and the standard
deviation,
and we've looked at some of the other options
on the Setup and Run panels.
There's a chapter in the course text on the
Experimenter,
Chapter 13.
If you go to the activity now associated with
this lesson,
you'll do some of the things I've just been
doing,
and more besides.
Good luck,
and we're see you in the next lesson.

Chinese: 
我们可以从文档或数集中导入结果，
但我们通常做的是点击Experiment获取实验数据。
我们是这样做的。
这里有很多选项。
我们这门课会学习一些选项。
这就是Experimenter。
我们学习了如何使用Experiment。
Setup面板、
Run面板
和Analyse面板。
我们使用十次十折交叉验证和十次百分比分割
评估了一个数集的分类器。
我们看了输出的数据表、
Analyse面板。
我们知道如何获取平均数和标准差。
我们学习了Setup和Run面板的一些选项。
教材中有一章专门介绍Experimenter，
第十三章。
课后练习中
部分练习和我刚才演示的一样，
还有其他一些内容。
祝好运。
下次课见。

Chinese: 
再见！

English: 
Bye for now!
