
English: 
Hello again! In this lesson we're going to
look at an important new concept called baseline
accuracy. We're going to actually use a new
dataset, the diabetes dataset.
I've got Weka here, and I'm going to open
diabetes.arff.
There it is.
Have a quick look at this dataset.
The class is tested_negative or tested_positive
for diabetes.
We've got attributes like preg, which I think
has to do with the number of times they've
been pregnant; age, which is the age.

Chinese: 
大家好！本课我们将学习一个非常重要的概念，基线精确度。
我们将用一个新的数据集diabetes为例。
打开Weka，载入diabetes.arff。
好了。
快速地浏览一下这个数据集。
diabetes有两个类：tested_negative和tested_positive。
我想，preg属性是代表调查对象怀孕的次数；
age属性，也就是年龄。

English: 
Of course, we can learn more about this dataset
by looking at the ARFF file itself.
Here is the diabetes dataset.
You can see it's diabetes in Pima Indians.
There's a lot of information here.
The attributes: number of times pregnant,
plasma, glucose concentration, and so on.
Diabetes pedigree function.
I'm going to use percentage split.
I'm going to try a few different classifiers.
Let's look at J48 first, our old friend J48.
We get [76%] with J48.
I'm going to look at some other classifiers.

Chinese: 
当然，我们可以通过查看ARFF文件本身更多地了解这个数据集。
这就是数据集diabetes。
你可以看到它是关于印第安人糖尿病的数据集。
这里有很多相关的信息。
属性有：怀孕的次数、血浆， 葡萄糖浓度，
糖尿病家系函数， 等等。
我将用百分比分割数据。
我们将尝试用一些不同的分类器。
先试一下我们熟悉的分类器J48。
用J48，我们得到76%的精确度。
再试其他的分类器。

English: 
You learn about these classifiers later on
in this course, but right now we're just going
to look at a few.
Look at NaiveBayes classifier in the bayes
category, and run that.
Here we get 77%, a little bit better, but
probably not significant.
Let's choose in the lazy category IBk.
Again, we'll learn about this later on.
Here we get 73%, quite a bit worse.
We'll use one final one, the PART, partial
rules in the rules category.
Here we get 74%.
We'll learn about these classifiers later,
but they are just different classifiers, alternative
to J48.
You can see that J48 and NaiveBayes are pretty
good, probably about the same.
The 1% difference between them probably isn't
significant.

Chinese: 
后面，我会介绍这些分类器，但是现在让我们
先试着用一下。
在bayes目录下找到“NaiveBayes”分类器，然后运行它。
我们得到77%的准确度，高了一些，但并不显著。
让我们选择一种懒惰分类器，IBK。
再次说明，我们马上会讲到它。
我们得到73%的精确度，这更加低了。
我们再试最后一种分类器，规则目录下面的PART分类器。
这次，我们得到74%的精确度。
我们后面会讲到这些分类器，但现在只要知道它们是不同于
J48的分类器。
你可以看到J48和NaiveBayes的精确度是几乎是一样高。
它们的精确度相差百分之一，可以说并不显著。

Chinese: 
IBk和PART的表现差不多。
它们的精确度也是相差百分之一。
我觉得第一组（J48和NaiveBayes）和第二组（IBk和PART）之间的差距是合理的，并且应该
是显著的。
我们来思考下这些数字。
76%是否是一个令人满意的精确度呢？
如果我们回到这个数据集，观察分类，
我们会发现它包含500个negative的实例和268个positive的实例。
如果让你猜一个实例的分类，你应该猜它是“negative”，这样你有500/768的能性是对的。
（negative和“positive实例的总和，一共是768）。
你有这么大的概率会猜对。
500/768，如果你总是猜“negative”，正确率将是65%。

English: 
IBk and PART are probably about the same performance.
Again, 1% between them.
There is a fair gap, I guess, between those
bottom two and the top two, which probably
is significant.
I'd like to think about these figures.
76%, is that good to get 76% accuracy?
If we go back and look at this dataset, the class,
we see that there are 500 negative instances
and 268 positive instances.
If you had to guess, you'd guess it would
be negative, and you'd be right 500/768
(the sum of these two things, the total number
of instances).
You'd be right that fraction of the time.
500/768 if you always guess [negative], and
that works out to 65%.

English: 
Actually, there's a rules classifier called
ZeroR, which does exactly that.
The ZeroR classifier just looks for the most
popular class and guesses that all the time.
If I run this on the training set, that will
give us the exact same number, 500/768,
 which is 65%.
It's a very, very simple, kind of trivial
classifier, that always just guesses the most
popular class.
It's ok to evaluate that on the training set,
because it's hardly using the training set
at all to form the classifier.
That's what we would call the baseline.
The baseline gives 65% accuracy, and J48 gives
76% accuracy.
It's significantly above the baseline, but
not all that much above the baseline.
It's always good when you're looking at these
figures to consider what the very simplest kind of classifier,
the baseline classifier, would get you.

Chinese: 
实际上，有一个叫ZeroR的分类器就是这么做的。
ZeroR分类器找到能性最大的分类，然后总是猜这个分类。
如果我们在训练数据上运行ZeroR，会得到一样的结果，500/768，
65%。
这是一种极其简单，以比例最大的类来分类
的分类器。
它可以用训练数据来评估，因为它几乎不用训练数据
进行分类。
这就是我们所说的基线。
基线精确度是65%，分类器J48得到的精确度是76%。
J48的结果明显高于基线精确度，但是高出的并不多。
看以下最简单的分类器的精确度，也就是基线精确度，
对于你评估分类器的精确度总是有帮助的。

Chinese: 
有时，基线精确度会是最好的结果。
这里，我们载入一个数据集。
我们不去研究这个数据集。
这是个有一点奇怪的数据集，并不是用来进行分类的数据集。
名字是supermarket。
载入supermarket，我们不关心它的内容，直接
应用一些分类器。
应用ZeroR，我们得到精确度64%。
应用J48。
我们这里使用百分比分割进行评估，因为
使用训练数据集评估并不公平。
我们得到精确度63%。
这比基线精确度还要低。
如果我们应用NaiveBayes。
这些是我们之前的结果。
我们又得到精确度63%，比基线精确度还要低。

English: 
Sometimes, baseline might give you the best
results.
I'm going to open a dataset here.
We're not going to discuss this dataset.
It's a bit of a strange dataset, not really
designed for this kind of classification.
It's called supermarket.
I'm going to open supermarket, and without
even looking at it, I'm just going to apply
a few schemes here.
I'm going to apply ZeroR, and I get 64%.
I'm going to apply J48,
and I think I'll use a percentage split for evaluation because
it's not fair to use the training set here.
Now I get 63%.
That's worse than the baseline.
If I try NaiveBayes.
These are the ones I tried before.
I get again 63%, worse than the baseline.

Chinese: 
如果选择IBk，这需要点时间，IBk分类器有点慢。
好了，有结果了。
精确度只有38%。
这比基线精确度差多了。
我们将尝试分类器PART，部分决策规则。
这次，我们得到精确度63%。
结果表明，实际上基线精确度比任何一种分类器的精确度都要高，
而且IBk的精确度比基线精确度低的多。
这是因为，对于这个数据集，属性并不作为分类的依据。
这里要说的是，不要盲目地使用Weka分析任意的数据。
你需要知道你在做什么。

English: 
If I choose IBk, this is going to take a little
while here, it's a rather slow scheme.
Here we are; it's finished now.
Only 38%.
That is way, way worse than the baseline.
We'll just try PART, partial decision rules.
Here we get 63%.
The upshot is that the baseline actually gave
a better performance than any of these classifiers,
and one of them was really atrocious compared
with the baseline.
This is because, for this dataset, the attributes
are not really informative.
The rule here is, don't just apply Weka to
a dataset blindly.
You need to understand what's going on.

Chinese: 
在你使用Weka之前，
总是先尝试下基线分类器，ZeroR。
一般来说，简单的是最好的。
在尝试复杂的分类器前，先试一试简单的。
并且，当精确度有微小的提高时，你需要考虑下这些提高
是否是显著的。
上节课中我们看到的1%的区别，就很可能并不显著。
你应该去试一试简单的基线。
你应该观察数据集。
我们不应该盲目地使用Weka分析任意的数据，我们需要知道我们在做什么。
这就是本课的内容。 
请大家完成本课的课后练习。再见！

English: 
When you do apply Weka to a dataset, always
make sure that you try the baseline classifier,
ZeroR, before doing anything else.
In general, simplicity is best.
Always try simple classifiers before you try
more complicated ones.
Also, you should consider, when you get these
small differences whether the differences
are likely to be significant.
We saw these 1% differences in the last lesson
that were probably not at all significant.
You should always try a simple baseline.
You should look at the dataset.
We shouldn't blindly apply Weka to a dataset;
we should try to understand what's going on.
That's this lesson.
Off you go and do the activity associated
with this lesson, and I'll see you soon!
