
English: 
Hello again!
At the end of the last lesson we were looking
at a two-class dataset where the accuracy
on one of the classes was very high and the
accuracy of the other class was not very high.
But because there was an overwhelming majority
of the instances in the first class, the overall
accuracy looked very high.
In this lesson, we're going to take a closer
look at this kind of situation and come up
with a more subtle way of evaluating classifiers
under these circumstances.
Here in Weka, I've opened the weather data.
14 instances, a simple, artificial dataset,
and I'm going to classify it with Naive Bayes.

Chinese: 
大家好！
上节课的最后我们看了包含两类的数据集，
其中一类的准确度非常高，而另一类的准确度并不是很高。
因为绝大部分的实例都在第一类中，
所以它的整体准确度看上去很高。
这节课，我们要详细讲解这种情况，
使用更微妙的方法评估这种情况下的分类器。
这是Weka的界面，我已经打开了weather数据。
14个实例，一个简单、人造的数据集。我打算用 Naive Bayes分类。

Chinese: 
选择Naive Bayes。
我们来看混淆矩阵。
来看幻灯片。
这是混淆矩阵。
可以看到这里有a、b、yes和no。
有7个a被归为a类，2个a被错归为b类。
1个b被归为b类，这是正确的；4个b
被错归为a类。
这里，我想介绍一些术语。
我们来看“正确的肯定”，7个正确分类的a；
“正确的否定”，1个正确分类的b。
“错误的肯定”指否定的实例被错误的分到肯定的类
它们看上是肯定的，其实它们是错误的。
这里是4。
“错误的否定”正好相反。

English: 
I've selected NaiveBayes here, and there it
is.
I'm interested in the Confusion Matrix.
In fact, I've put over on the slide here.
Here is the confusion matrix.
You can there are "a"s and "b"s, "yes"s and
"no"s.
There are 7 "a"s that are classified as "a"s,
and 2 "a"s that are classified as "b"s, incorrectly.
There's 1 "b" that's classified as "b"--that's
correct--and 4 "b"s that are classified as
"a", incorrectly.
I want to introduce some terminology here.
We're going to talk about "true positives",
those 7 correctly classified "a"s; and "true
negatives", that 1 correctly classified "b".
"False positives" are negative instances that
are incorrectly assigned to the positive class.
They look like they're positives, but they're
false.
That's the 4.
And "false negatives" conversely.

English: 
We're going to be interested in the "true
positive rate", that is the accuracy on class
"a", which is 7 (the number of true positives),
divided by the total size of class "a", that
is 9; and the "false positive rate", which
is the number of false positives, 4, divided
by the total number of negative instances,
that is 5.
That's 0.80.
That's 1 minus the accuracy on class "b".
The main point of this lesson is that there's
a tradeoff between these things.
You can trade off the accuracy on class "a"
against the accuracy on class "b".
You can get better accuracy on class "a" at
the expense of accuracy on class "b", and
vice versa.
To show you what I mean, let's go back to
Weka.
In the More options menu, I'm going to output
the predictions.

Chinese: 
我们来看正确肯定率，也就是a类的准确率，
就是 7 （正确的肯定的数值）除以a类的总值， 即9。
错误肯定率，即错误肯定的数值4，除以否定实例的总数，
即5。
也就是0.80。
这就是1减b类的准确率。
这节课的重点是这些数值都有权衡关系。
你可以牺牲a类的准确率来换取b类的准确率。
反之，也可以牺牲b类的准确率来换取a类的准确率。
反之亦然。
我们回到Weka, 来看这意味什么。
在More options菜单在，选择output the predictions。

English: 
Let's just run Naive Bayes again.
I'm interested in this table of predictions.
These are the 14 instances.
For this instance, which is actually a "no",
Naive Bayes had a prediction probability of
92.6% for the "yes" class and 7.4% for the
"no" class.
These two things add up to 1.
Because the probability for the "yes" class
was greater than the probability for the "no"
class, Naive Bayes predicted "yes".
Incorrectly, as it turns out, because it was
actually a "no"--that's why there's a plus
in this error column.
That's the way Naive Bayes gets all of its
predictions.
It takes the "yes" probability and the "no"
probability and sees which is larger, and
predicts a "yes" or a "no" accordingly.
Over on the slide, I've got the same data,
and then I've processed it on the right into
a simpler table, with just the actual class
and the probability of the "yes" class that's
output by Naive Bayes.

Chinese: 
再次运行Naive Bayes。
我们来看预测表。
共有14个实例。
这个实例其实是“否”，Naive Bayes预测是“yes”的可能性
是92.6%，7.4%是“no”。
两者相加为1。
因为“yes”的可能性比“no”的可能性大，
Naive Bayes最终预测结果是“yes”。
这是错误的，因为它应该是“yes”。
这就是错误栏中有加号的原因。
这就是Naive Bayes做预测的方式。
对比“yes”和“no”的可能性哪个大，
相应作出“yes”或“no”的预测。
我们来看幻灯片。同样的数据，我将它简化为右侧的表，
表中仅包含实际类别和Naive Bayes预测的
“yes”类别的可能性。

English: 
I've sorted the instances in decreasing order
of prediction probability.
At the top, we've got an instance which is
actually a "no" that Naive Bayes predicts
to be a "yes", because the prediction probability
for "yes" is 0.926, which is way larger than
the prediction probability for a "no", 1 minus
that.
In fact, if you think about it, it's like
Naive Bayes is drawing a line at the 0.5 point--that
horizontal line--and everything above that
line it's predicting to be a "yes"; everything
below that line it's predicting to be a "no".
The true positives are those "yes"s above
the line--that's 7 of them.
The "yes"s below the line are incorrectly
predicted positive instances.

Chinese: 
我依照可能性从高到低将实例排序。
表格顶端，我们看到实际类别是“no”，而Naive Bayes预测为“yes”，
因为“yes”的预测可能性为0.926，比“否”的可能性，
即1减去“yes”的可能性，要大得多。
实际上，我们想想看，这就像是Naive Bayes 在0.5的可能性上划了条线，这条水平线，
这条线以上的所有实例预测为“yes”，
这条线之下的所有实例预测为“no”。
正确肯定的实例就是线以上的那些“yes”的实例。
线以下的“yes”的实例被错误地判定为肯定实例。

English: 
So the "true positive" rate is 7 over 9.
Conversely, for the "no" class, things below
the line are predicted as a "no".
There's only one correct prediction there
below the line.
That's the very last entry.
There are 4 "no"s above the line that are
incorrectly predicted to be "yes"s because
they are above the line.
That gives a false positive rate of 0.8.
Like I say, there's a tradeoff.
We could change things if we put the line
in a different place.
Naive Bayes puts it at 0.5.
But if we were to move the line from 0.5 (that's
the P line) to 0.75 (that's the Q line), then
we'd have a true positive rate of 5/9--that's
those 5 "yes"s above the line compared with
the 4 "yes"s below the line--and a false positive
rate of 0.2.
That's the Q line.
We're going to plot these points on a graph.

Chinese: 
因此正确肯定率为7比9。
反正，对于“no”的分类，线以下的实例被预测为“no”。
这里只有一个正确的预测。
最后一个。
线以上有4个“no”的实例被错误预测为“yes”，
因为他们在线之上。
由此得出，错误肯定率为0.8。
就想我说的一样，这是权衡的关系。
如果我们改变线的位置，就可以i改变结果。
Naive Bayes 的设置是0.5。
但是，如果我们把线从0.5（即P线）移到0.75（即Q线），
就会得到5/9的正确肯定率--5个“是”在线上，4个“是”在线下--
错误肯定率为0.2。
也就是Q线。
W我们要在图中画出这些点。

English: 
We're going to plot the accuracy on class
"a" (TP) against 1 minus the accuracy on class
"b" (FP).
You can see the P and Q points on the graph.
Now we can get other points on the graph by
putting the line in different places.
In the extreme, we could put the line right
at the very top above the first instance.
That means that we'd be classifying everything
as a "no", which gives us 100% accuracy on
the "no" class--that's an FP rate of 0--and
0% accuracy on the "yes" class--that's a TP
rate of 0.
That's the 0,0 point on the graph.
Then, if we take our horizontal line and move
it down the table one by one, we're going
to be moving up along that red line until
we get to the top, the upper right-hand corner,
which corresponds to a line underneath the
whole table where we classify everything as

Chinese: 
我们来画以a类的准确率（正确肯定率）相对1减b类的
准确率（错误肯定率）的图。
你可以在图中看到P、Q两点。
现在我们通过移动这条线得到其他点。
极端的例子是，我们可以把线划在第一个实例之上。
也就是说，我们将所有实例分类为“no”，得到100%“no”分类的准确率，
--即错误肯定率为0--“yes”分类的准确率是0%，
即正确肯定率也为0。
这就是图中的0点。
如果我们将水平下向下挪移，
图中的数值会沿红线向上，直到右上角，
这里表示表格低端的线，

English: 
a "yes", getting 100% accuracy on the "yes"
class, and nothing as a "no", getting 0% accuracy
on the "no" class, the "b" class.
You can get different tradeoffs between accuracy
on class "a" and accuracy on class "b" by
putting the line at different points.
That's for single machine learning method.
What about a different machine learning method?
Well, different machine learning methods will
give you different red lines.
There's one, the dashed line down a little
bit below.
That's actually worse than the Naive Bayes
line with the P and the Q on it, because where
you want to be is in the top left-hand corner.
The top left-hand corner corresponds to perfect
accuracy on class "a" and perfect accuracy
on class "b".
That's where you'd like to be.
So lines that push up toward that top corner,
that top, red, dotted line are better.
That's where you want to be.

Chinese: 
将所有实例分类为“yes”，得到“yes”类别的准确率为100%，
“no”类别，即b类别，的准确率是0%。
通过把线划在不同点上，我们可以
权衡a类别和b类别的准确率。
这就是单一的机器学习方法。
不同的机器学习方法是怎样的？不同的方法会给出
不同的红色曲线。
例如这条红线下的虚线。
其实，这条线的准确率要低于有P、Q两点的Naive Bayes的线，
因为我们的目标是左上角，
左上角表示a类和b类
最高的准确率，
是我们希望的结果。
因此，接近左上角的那条红虚线更好。
那是我们希望的结果。

Chinese: 
评估某一分类器整体性能的方法，如有P、Q两线的Naive Bayes，
是看曲线下面的区域。
也就是这个区域。
如果区域面积大，说明我们找到了
已用不同权衡关系、不同临界值的评估了的分类器。
曲线下的面积可以用来评估分类器的准确度，
它不会因选择某一特定的权衡而被影响。
事实上，在Weka当中，可以看到这条曲线。
这条曲线叫做临界值曲线，我们来可视化
肯定类别的临界值曲线。
这就是曲线。
不是一条平滑的曲线，有些褶皱。

English: 
One way of evaluating the overall merit of
a particular classifier, say the Naive Bayes
one shown in the P-Q line, is to look at the
area under the curve.
That's the area shown there.
If that area is large, then we're going to
get a better classifier evaluated across all
the different possible tradeoffs, the different
thresholds.
The area under the curve is a way of measuring
classifier accuracy independent of the particular
tradeoff that you happen to choose.
Actually, in Weka, you can look at this curve.
It's called a "threshold curve", and we're
going to visualize the threshold curve for
the positive class.
That's what we get.
It's not a smooth curve, it's a bit of a jagged
curve.

English: 
In fact, we plot the y axis against the x
axis--true positive rate against false positive
rate--and each of these points corresponds
to a particular point in the table.
There are 13 points, plus 1 at the beginning
and 1 at the end; 15 points all together.
The point that I've circled there corresponds
to a false positive rate of 2/5 and a true
positive rate of 5/9.
All the other points correspond to different
points on the curve.
What we want to measure is the area under
the curve.
It's called an ROC, "Receiver Operating Characteristic",
curve, for historical reasons.
Weka prints out the area under the ROC curve.
In this case it's 0.5778.
If we could find a classifier that pushed
a bit more up towards the top left, then that
would be better, give us a better area.

Chinese: 
实际上，我们画出y轴和x轴--正确肯定率对错误肯定率，
图上每点对应表中的每个数值。
表中共有13点，加上顶端1点和底端1点，共计15点。
我圈出的这一点对应的错误肯定率是2/5，
正确肯定率是5/9。
所有表中的点都对应曲线上的点，
我们想测量的是曲线下面的区域。
因为历史原因，它叫做ROC，即“Receiver Operating Characteristic”曲线。
Weka标出了ROC曲线下的区域。
面积是0.5778。
如果我们能够找到一个能向左上角扩展区域的分类器，
结果会更好，更好的区域。

Chinese: 
实际上，如果我们在同一数集上使用J48--我不这样做，但很简单--
(运行J48，看曲线)我们会得到这样一条曲线，
蓝色虚线，更好的曲线。
曲线下面积是0.6333，高于Naive Bayes。
我们学习了临界值曲线，它对比某一类的准确率与另一类的准确率，
描绘出两个类的权衡关系。
ROC曲线绘制了正确肯定率与错误肯定率的比。
从左下到右上，好的曲线靠近左上角。
事实上，对角线对应的是随机选择的结果，
因此曲线不能低于对角线。
曲线下方的面积可用来衡量分类器的整体性能。
它对应着分类器判断随机选择的肯定测试实例

English: 
And actually, if we were to evaluate J48--which
I won't do, but it's very simple--on the same
dataset (just run J48 and look at the curve),
we'll get a curve like this, the dashed blue
line, which is better.
The area under that curve is 0.6333, which
is better than Naive Bayes.
We're looking at threshold curves that plot
the accuracy of one class against the accuracy
on the other class and that depict the tradeoff
between these two things.
ROC curves plot the true positive rate against
the false positive rate.
They go from the lower left to the upper right,
and good ones stretch up towards the top left corner.
In fact, a diagonal line corresponds to a
random decision, so you shouldn't go below
the diagonal line.
The area under the curve is a measure of the
overall quality of a classifier.
It turns out that it's equal to the probability
that the classifier ranks a randomly chosen

English: 
positive test instance above a randomly chosen
negative one.
This has been a bit of a theoretical lesson,
and it might be worth going to the textbook
and reading the subsection on ROC curves in
Section 5.2.
The activity associated with this lesson involves
looking at actual ROC curves in an actual
classification system.
Good luck with that, and I'll see you in the
next lesson.
Bye for now!

Chinese: 
高于随机选择的否定测试实例的可能性。
这是一节偏理论的课程。
你最好阅读课本中5.2部分ROC曲线。
这节课的课后练习是
学习在实际分类器系统中的ROC曲线。
祝好运，下次课见！
再见！
