
English: 
Hi! Welcome back to Data Mining with Weka.
In the last lesson, we looked at classification
by regression, how to use linear regression
to perform classification tasks. In this lesson
we're going to look at a more powerful way
of doing the same kind of thing. It's called
"logistic regression". It's fairly mathematical,
and we're not going to go into the dirty details
of how it works, but I'd like to give you
a flavor of the kinds of things it does and
the basic principles that underline logistic
regression. Then, of course, you can use it
yourself in Weka without any problem.
One of the things about data mining is that
you can sometimes do better by using prediction

Chinese: 
大家好！欢迎回到Weka与数据挖掘。
上节课中，我们学习了如何利用回归技术分类，如何使用线性回归
完成分类任务。这节课我们将学习使用同样技术但更有效的
的方法，即logistic回归。这是数学的知识，
我们不打算深入学习它是如何工作的，但是我会介绍
一个大致的概念和logistic回归的基本原理。
然后你可以毫无问题的在Weka里使用logistic回归。
数据挖掘的一个观点是有时你可以通过预测概率

Chinese: 
而不是实际分类得到更好的结果。不做yes或no的预测，
而是作是yes或no的概率的预测
会是“是”或“否”的概率。例如，明天下雨的可能性是95%，
或者晴天的可能性是72%，而不是说一定会
下雨或者一定会晴天。
在数据挖掘时，概率是非常有用的。NaiveBayes输出概率；
它基于概率工作。我们之前的课程已经介绍过这点了。
载入diabetes数据集，运行NaiveBayes。
把数据划分的比例设为90%，

English: 
probabilities rather than actual classes.
Instead of predicting whether it's going to
be a "yes" or a "no", you might do better
to predict the probability with which you
think it's going to be a "yes" or a "no".
For example, the weather is 95% likely to
be rainy tomorrow, or 72% likely to be sunny,
instead of saying it's definitely going to
be rainy or it's definitely going to be sunny.
Probabilities are really useful things in
data mining. NaiveBayes produces probabilities;
it works in terms of probabilities. We've
sen that in an earlier lesson.
I'm going to open diabetes and run NaiveBayes.
I'm going to use a percentage split with 90%,

Chinese: 
这样留出10%作为测试数据。然后，确保输出的是
测试数据的预测结果。运行。我们想看输出的预测。
这数据集有两个类别，分别是tested_negative和tested_positive，
包含实例（编号1，编号2，编号3，等等）。这是真实的类别tested_negative，
tested_positive，tested_negative等。这是预测的类别tested_negative，tested_negative，
tested_negative，等。误差列下面的加号说明
那里有一个误差，所以，实例2有一个误差。这些是
NaiveBays输出的概率。
所以对于实例1，negative的概率是99%，positive的概率是1%。
所以，我们预测它是negative的；这就是为什么是tested_negative。

English: 
so that leaves 10% as a test set. Then I'm
going to make sure I output the predictions
on those 10%, and run it. I want to look at
the predictions that have been output.
This is a 2-class dataset, the classes are tested_negative
and tested_positive, and these are the instances
-- number 1, number 2, number 3, etc. This
is the actual class -- tested_negative, tested_positive,
tested_negative, etc. This is the predicted
class -- tested_negative, tested_negative,
tested_negative, tested_negative, etc. This
is a plus under the error column to say where
there's an error, so there's an error with
instance number 2. These are the actual probabilities
that come out of NaiveBayes.
So for instance 1 we've got a 99% probability
that it's negative, and a 1% probability that
it's positive. So we predict it's going to
be negative; that's why that's tested_negative.

English: 
And in fact we're correct; it is tested_negative.
This instance, which is actually incorrect,
we're predicting 67% percent for negative
and 33% for positive, so we decide it's a
negative, and we're wrong. We might have been
better saying that here we're really sure
it's going to be a negative, and we're right;
here we think it's going to be a negative,
but we're not sure, and it turns out that
we're wrong. Sometimes it's a lot better to
think in terms of the output as probabilities,
rather than being forced to make a binary,
black-or-white classification.
Other data mining methods produce probabilities,
as well. If I look at ZeroR, and run that,
these are the probabilities -- 65% versus
35%. All of them are the same.

Chinese: 
事实上，我们是正确的；它是tested_negative。这个
实例的预测是不正确的，我们预测67%的可能是negative，33%的可能是positive，
所以它是negative的，但是我们错了。这里，我们最好说
它肯定是negative的，而且我们是对的；这里，我们认为它会是negative的，
但是，我们并不确定，结果证明我们错了。
有时，以概率为输出要比非要得到一个二元的
非黑即白的分类要好。
其他的数据挖掘方法也能产生概率。如果我们使用ZeroR，
这些是概率（65%比35%）。它们都一样。

Chinese: 
当然，它是ZeroR！（它总是做出相同的预测）。在这里，它的结果总是tested_negative
并且所有的概率都一样。这样的原因是，
如果你看幻灯片，我们选择90%作为训练数据，10%作为测试数据，
训练数据集包含448个negative的实例和243个positive的实例。
记得3.2节的"拉普拉斯修正"？我们在每个数目上加1，得到449和244。
这样我们得到了是negative实例的概率为65%。这就是这些数字的来源。
如果我们运行J48，我们会得到一个更有趣的概率
分别是negative的和positive的概率。
你可以看到误差在哪里。

English: 
Of course, it's ZeroR! -- it always produces the same
thing. In this case, it always says tested_negative
and always has the same probabilities. The
reason why the numbers are like that, if you
look at the slide here, is that we've chosen
a 90% training set and a 10% test set, and
the training set contains 448 negative instances
and 243 positive instances.
Remember the "Laplace Correction" in Lesson 3.2? -- we add 1 to
each of those counts to get 449 and 244.
That gives us a 65% probability for being a negative
instance. That's where these numbers come from.
If we look at J48 and run that, then we get
more interesting probabilities here --
the negative and positive probabilities, respectively.
You can see where the errors are.

Chinese: 
这些概率都是不一样的。
本质上，J48利用概率做修剪操作
当我们讨论J48的修剪操作时，我们提到过这一点，虽然，我没有明确说明
这些概率是如何得到的。
logistic回归的想法是让线性回归也生成概率。
这变得有点吓人。
记得，当我们使用线性回归分类时，我们利用回归
计算一个线性函数，然后使用阈值来决定是0还是1。
这很诱人去想象，你可以把这些数字理解成概率，
而不是阈值，但这是一个错误。
它们不是概率。
这些回归线得出的数字有时会是负数，有时
会大于1。
它们不可能是概率，因为概率不能为负或大于1。

English: 
These probabilities are all different.
Internally, J48 uses probabilities in order
to do its pruning operations.
We talked about that when we discussed J48's
pruning, although I didn't explain explicitly
how the probabilities are derived.
The idea of logistic regression is to make
linear regression produce probabilities, too.
This gets a little bit hairy.
Remember, when we use linear regression for
classification, we calculate a linear function
using regression and then apply a threshold
to decide whether it's a 0 or a 1.
It's tempting to imagine that you can interpret
these numbers as probabilities, instead of
thresholding like that, but that's a mistake.
They're not probabilities.
These numbers that come out on the regression
line are sometimes negative, and sometimes
greater than 1.
They can't be probabilities, because probabilities
don't work like that.

Chinese: 
为了得到更好的概率估计，要用到一个稍微复杂的技术。
在回归分析中，我们有一个线性的总和。
在logistic回归中，我们有相同的线性和（和我们之前看到的线性和一样）
但是我们把它嵌入在这样的公式中。
这就叫做“对数变换”。
对数变换（这是多维的，有很多不同变量a）。
如果我们只有一维，一个变量a1，然后如果这是
对数变换的输入，输出看上去像这样：取值区间从0到1。
它是由一个更温和的函数产生的近似S型的曲线。
而不是0，然后是一个阶梯函数，它是一个阶梯函数的温和版本，
不会小于0，不会大于1，在0和1之间是平滑的变换。
当你使用对数变换，而不是最小化平方误差时（记住，

English: 
In order to get better probability estimates,
a slightly more sophisticated technique is used.
In linear regression, we have a linear sum.
In logistic regression, we have the same linear
sum down here -- the same kind of linear sum
that we saw before -- but we embed it in this
kind of formula.
This is called a "logit transform".
A logit transform -- this is multi-dimensional
with a lot of different a's here.
If we've got just one dimension, one variable,
a1, then if this is the input to the logit
transform, the output looks like this: it's
between 0 and 1.
It's sort of an S-shaped curve that applies
a softer function.
Rather than just 0 and then a step function,
it's soft version of a step function that
never gets below 0, never gets above 1, and
has a smooth transition in between.
When you're working with a logit transform,
instead of minimizing the squared error (remember,

English: 
when we do linear regression we minimize the
squared error), it's better to choose weights
to maximize a probabilistic function called
the "log-likelihood function", which is this
pretty scary looking formula down at the bottom.
That's the basis of logistic regression.
We won't talk about the details any more:
let me just do it.
We're going to use the diabetes dataset.
In the last lesson we got 76.8% with classification
by regression.
Let me tell you if you do ZeroR, NaiveBayes,
and J48, you get these numbers here.
I'm going to find the logistic regression
scheme.
It's in "functions", and called "Logistic".
I'm going to use 10-fold cross-validation.
I'm not going to output the predictions.
I'll just run it -- and I get 77.2% accuracy.
That's the best figure in this column, though
it's not much better than NaiveBayes, so you

Chinese: 
当我们做线性回归时，我们最小化平方误差），最好选择加权
来最大化一个叫做“对数似然函数”的概率函数，
就是这个在底部的，相当吓人的公式。
这是logistic回归的基础。
我们不再介绍详细的内容，我们开始试着用它。
我们打算载入diabetes数据集。
上节课，我们通过回归分类得到了76.8%的准确率。
让我告诉你，如果使用ZeroR,NaiveBayes和J48，你会得到这些准确率。
找到logistic回归方法。
它在functions目录下，叫做Logistic。
使用10-折交叉验证。
我不输出预测结果。
运行。准确率是77.2%。
这是这一列中最高的了，尽管并不比NaiveBayes好很多，

Chinese: 
也许你会对它是否真的好有一点怀疑。
重复了10次，算出平均值，我们得到了10次运行
的均值。
当然，ZeroR的结果都是一样的，65.1%;它每次产生的准确率都一样。
NaiveBayes和J48是不同的，这里logistic回归的平均准确率是77.5%,
这个结果优于该列中的其他结果。
你可以把这个思路推广到多个类。
上一课中，我们对于每一类进行回归，
一个多响应回归。
实际上，在那种情况下，logistic回归工作的并不好，因为你需要
各种不同类的概率的和等于1。
这使得计算更加复杂，
并需要作为联合优化问题来解决。

English: 
might be a bit skeptical about whether it
really is better.
I did this 10 times and calculated the means
myself, and we get these figures for the mean
of 10 runs.
ZeroR stays the same, of course, at 65.1%;
it produces the same accuracy on each run.
NaiveBayes and J48 are different, and here
logistic regression gets an average of 77.5%,
which is appreciably better than the other
figures in this column.
You can extend the idea to multiple classes.
When we did this in the previous lesson, we
performed a regression for each class, a multi-response
regression.
That actually doesn't work well with logistic
regression, because you need the probabilities
to sum to 1 over the various different classes.
That introduces more computational complexity
and needs to be tackled as a joint optimization problem.

Chinese: 
logistic回归是一种常用的强大的机器学习方法，
它利用对数变换直接预测概率。
它就像NaiveBayes一样，利用概率工作。
我们这节课还学习了，其他分类方法也可以得出预测概率，
和如何计算ZeroR的概率。
你可以阅读课本4.6节关于logistic回归的内容。
现在你应该去完成本课的课后练习了。
回头见！
再见！

English: 
The result is logistic regression, a popular
and powerful machine learning method that
uses the logit transform to predict probabilities directly.
It works internally with probabilities, like
NaiveBayes does.
We also learned in this lesson about prediction
probabilities that can be obtained from other
methods, and how to calculate probabilities
from ZeroR.
You can read in the course text about logistic
regression in Section 4.6.
Now you should go and do the activity associated
with this lesson.
See you soon.
Bye for now!
