
English: 
Hello again.
You know, the trouble with life is that sometimes
everything just comes down to money.
In this lesson and the next we're going to
look at counting the cost in data mining applications.
What is success? Well, that's a pretty good
question, I suppose.
In data mining terms, we've looked at the
classification rate, measured on a test set
or holdout or cross-validation.
But, essentially, we're trying to minimize
the number of errors or maximize the classification
rate.
In real life, different kinds of errors might
have different costs, and minimizing the total
errors might be inappropriate.

Chinese: 
大家好！
众所周知，生活中的问题有时可以归结为钱的问题。
这节和下节课，我们将开关注数据挖掘应用的成本。
什么是成功？我觉得这是个好问题。
评价数据挖掘方法，我们使用过分类正确率，
基于测试数据或者预留或交叉验证。
但是，原则上，我们一直努力减少误差数量，
也就是最大化分类正确率。
在现实生活中，不同种类的误差的成本代价是不同的，
最小化整体误差可能是不合适的。

Chinese: 
现在我们看看第二课的ROC曲线，
它反映了不同误差的成本权衡。
但当你真的知道误差成本，它就不适用了。
那时，我们想要找到ROC曲线上的某个点。
我们来看看信用评级数据集，credit-g.arff。
将“坏”客户归类到“好”客户的代价高于
将“好”客户归类到“坏”客户。
在这个数据集中，有两个类属性，“好”和“坏”。
问题是，如果我们将“坏”客户归类到“好”客户，并且贷款给他们，
他们会携款逃走，而将“好”客户归类到“坏”客户，
我们可能还有机会纠正这个决定。
老实说，我并不了解信用评级业，
但是我们以此为例。
让我们假设成本比是5比1。

English: 
Now, we looked at the ROC curve in Class 2,
and that shows you the different tradeoffs
between the different error costs.
But it's not really appropriate if you actually
know the error costs.
Then we want to pick a particular point on
this ROC curve.
We're going to look at the credit rating dataset,
credit-g.arff.
It's worse to class a customer as "good" when
they're "bad" then it is to class a customer
as "bad" when they're "good".
In this dataset, the class value is "good"
or "bad".
The idea is that if you class someone as "good"
when they're "bad" and you give them a loan,
then he's going to run away with all your
money, whereas if you make an error the other
way round then you might have an opportunity
to rectify it later on.
To tell you the truth, I know nothing about
the credit rating industry, but let's just
suppose that's the case.
Furthermore, let's suppose that the cost ratio
is 5 to 1.

English: 
I've got the credit dataset open here, and
I'm going to run J48.
What I get is an error rate of 29.5%, a success
rate of 70-71%.
Down here is the confusion matrix.
I've copied those over here on to this slide.
You can see that the cost here, the number
of errors, is effectively the 183 plus 112,
those off-diagonal elements of the confusion
matrix.
If errors cost the same amount, that's a fair
reflection of the cost of this confusion matrix.
However, if the cost matrix is different,
then we need to do a different kind of evaluation.
On the Classify panel, we can do a cost-sensitive
evaluation.
Let me go and do that for you.
In the More options menu, we're going to do
a cost-sensitive evaluation.
I need to set a cost matrix.
This interface is a little weird.
I need a 2 by 2 matrix; I'm going to resize
this.

Chinese: 
我们加载了信用数据集，运行J48。
我们得到的误差率是29.5%，成功比例是70-71%。
来看这里的混淆矩阵。
我已经把它复制到课件里了。
我们可以在这看到成本，误差的数量，183加上112，
那些不在对角线上的数字。
如果误差成本相同，这是这个混乱矩阵的成本的公正的反映。
然而，如果成本矩阵是不同的，那么我们需要另一种评估方法。
在分类面板，我们可以做成本敏感评估。
让我们一起去试试。
在更多选项菜单中，我们选择成本敏感评估。
我们需要设置一个成本矩阵。
这个界面有点古怪。
我需要一个2乘2的矩阵，调整一下大小。

Chinese: 
这里，两种误差的成本都是1，
但是我们想要（左下角的）误差成本是5。
关闭这个窗口，再次运行。
现在我们得到了同样的结果，一样的混淆矩阵，但是这里有更多的数据。
我们得到了总成本1027和平均成本1.027。
（数据集一共包含1000个实例。）
回到课件，成本等于左下方的183乘以5
--因为那是左下角的误差成本--加上112乘以1，
我们得到成本和1027。
如果想要基线成本，我们要用到ZeroR。
这里，我们运行ZeroR。
好了。
我们得到的结果是1500。
还有这个混淆矩阵。

English: 
Here we're got a cost of 1 for both kinds
of error, but I want a cost of 5 for this
kind of error.
Just close that and then run this again.
Now I've got the same result, the same confusion
matrix, but I've got some more figures here.
I've got a total cost of 1027 and an average
cost of 1.027.
(There are 1000 instances in this dataset.)
Coming back to the slide, the cost here is
computed by taking the 183 in the lower left
and multiplying it by 5--because that's the
cost of errors down there--and the 112 times
1, adding those up, and I get 1027.
If I take the baseline, let's go and have
a look at ZeroR.
I'm going to run ZeroR on this.
Here it is.
Here I get a cost of 1500.
I get this confusion matrix.

English: 
Over here on the slide, there's the confusion
matrix.
And although I've only got 300 errors here,
they're expensive errors, they each cost $5,
so I've got a cost of 1500.
This is classifying everything as "good",
because there are more "good" instances than
"bad" in this dataset.
If I were to classify everything as "bad"
the total cost would only be 700.
That's actually better than either J48 or
ZeroR.
Obviously we ought to be taking the cost matrix
into account when we're doing the classification,
and that's exactly what the CostSensitiveClassifier
does.
We're going to take the CostSensitiveClassifier,
select J48, define a cost matrix, and see
what happens.
It's in the meta > CostSensitiveClassifier,
which is here.
I can define a classifier.
I'm going to choose J48, which is here.

Chinese: 
课件的这里，这就是得到的混淆矩阵。
即使这次我们只得到了300个误差，它们都是昂贵的误差，
它们的成本是5，所以，我们得总成本是1500。
它把所有客户都归为“好”，因为“好”的客户
比“坏”的多。
如果我把所有客户都归为“坏”，总成本将是700。
这比J48和ZeroR的成本都要低。
显然，我们应该在分类的时候考虑成本矩阵，
CostSensitiveClassifier就是这么做的。
我们在CostSensitiveClassifier中选择J48，定义成本矩阵，
看看会发生什么。
在meta目录下，找到CostSensitiveClassifier。
选择分类器。
我们选择J48，在这里。

English: 
I need to specify my cost matrix.
I want it 2 by 2.
I'll need to resize that.
I need to put a 5 down here.
Cool.
I'm just going to run it.
Now I get a worse classification error.
We've only got 60-61% accuracy, but we've
got a smaller cost, 658.
And we've got a different confusion matrix.
Back here on the slide you can see that.
The old confusion matrix looked like this,
and the new confusion matrix is the one on
the right.
You can see that the number 183 of expensive
errors has been reduced to 66.
That brings the cost down, the average cost,
to 0.66 per instance instead of 1.027, despite
the fact that we now have a worse classification
rate.
Let's look at what ZeroR does with the CostSensitiveClassifier.

Chinese: 
我们需要设定我们的成本矩阵。
需要一个2乘2的矩阵。
我们调整一下。
我们在这里输入5。
不错。
开始运行。
现在，我们得到一个更差的分类误差。
我们仅仅得到60-61%的正确率，但是我们的成本只有658。
我们还得到一个不同的混淆矩阵。
回到课件，可以看到这个混淆矩阵。
T这是旧的混淆矩阵，新的混淆矩阵
在右边。
我们可以看到昂贵的误差从183降到了66。
这导致成本下降，平均成本从1.027降到了0.66，
尽管，我们现在的分类正确率降低了。
让我们看看ZeroR在CostSensitiveClassifier里的表现。

Chinese: 
这很有趣，我们会得到一个不同的分类规则。
与上次都归为“好”不同，所有的客户都变成了“坏”客户。
我们得到了700个误差，但是都是成本低廉的误差。
总成本只要700。
这就是我们今天的内容。
分类正确率是最好的衡量标准吗？显然不是。
在现实生活中，不同的误差会有不同的成本。
如果你不知道成本，你可能要在误差成本之间权衡，
在不同的误差之间，ROC曲线可以帮助你。
但是如果你知道了成本--成本矩阵--然后你可以用成本敏感评估寻找到
基于测试数据某一种学习方法的总成本，
或者你可以用成本敏感分类器，在分类的时候考虑成本。

English: 
It's kind of interesting because we're going
to get a different rule.
Instead of classifying everything as "good",
we're going to classify everything as "bad".
We're going to make 700 mistakes, but they're
cheap mistakes.
It's only going to cost us $700.
That's what we've learned today.
Is classification accuracy the best measure?
Very likely it isn't.
In real life, different kinds of errors usually
do have different costs.
If you don't know the costs, you just might
want to look at the tradeoff between the error
costs, different parts of the space; and the
ROC curve is appropriate for that.
But if you do know the costs--the cost matrix--then
you can do cost-sensitive evaluation to find
the total cost on the test set of a particular
learned model, or you can do cost-sensitive
classification, that is, take the costs into
account when producing the classifier.

English: 
The CostSensitiveClassifier does this: it
makes any classifier cost-sensitive.
How does it do this? Very good question.
We're going to find out in the next lesson.
Off you go now and do the activity, and we'll
see you soon.
Bye for now!

Chinese: 
CostSensitiveClassifier使得分类器变成成本敏感分类器。
它是怎么做到的？好问题。
我们将在下一节课讲解。
请大家完成课后练习。我们下次课见。
再见！
