
English: 
Hi! This is Lesson 3.3 on using probabilities.
It's the one bit of Data Mining with Weka
that we're going to see a little bit of mathematics,
but don't worry, I'll take you through it
gently.
The OneR strategy that we've just been studying
assumes that there is one of the attributes
that does all the work, that takes the responsibility
of the decision.
That's a simple strategy.
Another simple strategy is the opposite, to
assume all of the attributes contribute equally
and independently to the decision.
This is called the "Naive Bayes" method --
I'll explain the name later on.

Chinese: 
大家好！欢迎来到第三部分第三课，概率的应用。
这是在使用Weka进行数据挖掘中需要点数学的部分。
但是不用担心，我会帮助你的。
我们刚刚学过的OneR方法假设只需要一个属性
就可以完成所有工作，为决策负全责。
这是一种简单的策略。
另外一种简单的策略正好相反，假设所有的属性
在决策过程中有平等的，独立的贡献。
这就是Naive Bayes方法
我一会儿会解释这个名字。

English: 
There are two assumptions that underline Naive
Bayes: that the attributes are equally important
and that they are statistically independent,
that is, knowing the value of one of the attributes
doesn't tell you anything about the value
of any of the other attributes.
This independence assumption is never actually
correct, but the method based on it often
works well in practice.
There's a theorem in probability called "Bayes
Theorem" after this guy Thomas Bayes from the
18th century.
It's about the probability of a hypothesis
H given evidence E.
In our case, the hypothesis is the class of
an instance and the evidence is the attribute
values of the instance.
The theorem is that Pr[H|E] -- the probability of the class
given the instance, the hypothesis

Chinese: 
Naive Bayes基于两个假设：所有属性都是同样的重要，
且它们在统计上是独立的。
独立指的是，根据已知属性的值不能推测出
其他属性的值。
独立条件假设实际上从来没有成立过，但是基于它
的方法在实践中是十分有效的。
在概率论中有一种定理叫做Bayes Theorem（贝叶斯定理），是由Tomas Bays
在18世纪提出的。
这是关于假设H基于证据E发生的概率。
对我们来说，假设是实例的分类，
证据是实例的属性值。
依照定理，Pr[H|E]是基于证据E，该实例被归为某一类的概率，

English: 
given the evidence -- is equal to Pr[E|H] times Pr[H] divided
by Pr[E].
Pr[H] by itself is called the [prior] probability
of the hypothesis H.
That's the probability of the event before
any evidence is seen.
That's really the baseline probability of
the event.
For example, in the weather data, I think
there are 9 yeses and 5 nos, so the baseline
probability of the hypothesis "play equals
yes" is 9/14 and "play equals no" is 5/14.
What this equation says is how to update that
probability Pr[H] when you see some evidence,
to get what's call the "a posteriori" probability
of H, that means after the evidence.
The evidence in our case is the attribute
values of an unknown instance. That's E.

Chinese: 
它等于Pr[E|H]乘以Pr[H]
除以Pr[E]。
Pr[H]叫做假设H的先验概率，
就是知道证据前的概率。
这是事件发生的基线概率。
举个例子，在天气数据中，有9个yes和5个no，所以
假设play是yes的概率是9/14，假设play是no的概率是5/14。
用这个等式可以基于证据来校正Pr[H]，
来得到所谓的H的后验概率，也就是知道了证据后的概率。
对我们来说，证据是指未知的实例的属性值，就是E。

Chinese: 
这就是Bayes定理。
为什么叫做naive呢？因为我们简单的假设是（我们之前提过的）
证据由几个在统计意义上独立的部分组成。
这些证据的组成部分，就我们而言是指天气数据集中
四个不同的属性值。
当事件彼此独立时，概率才能相乘，所以Pr[H|E]，
按照上面公式，等于Pr[E|H]乘以先验概率
Pr[H]除以Pr[E]。
Pr[E|H]可分为以下几部分：Pr[E1|H]，
第一个属性值的概率；Pr[E2|H]，第二个属性值的概率；
等等，zhe可推广到所有的属性值。
这也许有点抽象，让我们用天气数据解释一下。
右手边是天气数据。

English: 
That's Bayes Theorem.
Now, what makes this method "naive"? The naive
assumption is -- I've said it before -- that the
evidence splits into parts that are statistically
independent.
The parts of the evidence in our case are
the four different attribute values in the
weather data.
When you have independent events, the probabilities
multiply, so Pr[H|E],
according to the top equation, is the product
of Pr[E|H] times the prior probability
Pr[H] divided by Pr[E].
Pr[E|H] splits up into
these parts: Pr[E1|H],
the first attribute value; Pr[E2|H],
the second attribute value; and so on for all
of the attributes.
That's maybe a bit abstract, let's look at
the actual weather data.
On the right-hand side is the weather data.

English: 
In the large table at the top, we've taken
each of the attributes.
Let's start with "outlook". Under the "yes" hypothesis and the "no" hypothesis, we've looked at
how many times the outlook is "sunny".
It's sunny twice under yes and 3 times under no.
That comes straight from the data in the table.
Overcast.
When the outlook is overcast, it's always
a "yes" instance, so there were 4 of those,
and zero "no" instances.
Then, rainy is 3 "yes" instances and 2 "no"
instances.
Those numbers just come straight from the
data table given the instance values.
Then, we take those numbers and underneath
we make them into probabilities.
Let's say we know the hypothesis.
Let's say we know it's a "yes".
Then the probability of it being "sunny" is
2/9ths, "overcast" is 4/9ths, and "rainy" 3/9ths,
simply because when you add up 2 plus 4 plus
3 you get 9.
Those are the probabilities.

Chinese: 
在顶部的表格中，我们可以看到所有的属性。
让我们从outlook开始。在play是yes和no的情况下，来看一下
属性值sunny出现了多少次。
play是yes的实例中，sunny出现了两次，play是no的实例中，sunny出现了三次。
这些可以直接从数据表中得到。
overcast。
当outlook是overcast，play总是yes，所以共有四个yes的实例，
没有no的实例。
然后，rainy的实例，有三个是yes， 两个是no。
这些数字都是直接从实例的数据表格中得到的。
然后，我们用这些数子和下面的公式来计算概率。
假如我们知道了假设（play）。
我们已知它是yes。
这时，sunny的概率是2/9，overcast的是4/9，rainy的是3/9，
因为2加4加3等于9。
这就是概率。

Chinese: 
如果我们知道play是no，sunny的概率是3/5，overcast的是0/9，
rainy的是2/5。
这些是针对outlook属性的。
这就使我们想要的，基于假设H
每个属性值的概率。
下一个属性是temperature，我们可以重复以上的步骤得到
三个属性值（hot，mild 和 cool）对于play
是yes或no的概率
humidity属性和windy属性也一样。 Play是先验概率（Pr[H]）。
即使你不知道任何属性值的信息，它是yes的概率为9/14，
no的概率为5/14。
我们用到就是下面的公式，我们只需计算出结果。
这是一个例子。
这里有一个未知的，新的一天。

English: 
If we know that the outcome is "no", the probabilities
are "sunny" 3/5ths, "overcast" 0/5ths, and "rainy"
2/5ths.
That's for the "outlook" attribute.
That's what we're looking for, you see, the
probability of each of these attribute values
given the hypothesis H.
The next attribute is temperature, and we
just do the same thing with that to get the
probabilities of the 3 values -- hot, mild,
and cool -- under the "yes" hypothesis or the
"no" hypothesis.
The same with humidity and windy. Play,
that's the prior probability -- Pr[H].
It's "yes" 9/14ths of the time, "no" 5/14ths of the
time, even if you don't know anything about
the attribute values.
The equation we're looking at is this one
below, and we just need to work it out.
Here's an example.
Here's an unknown day, a new day.

English: 
We don't know what the value of "play" is, but
we know it's sunny, cool, high, and windy.
We can just multiply up these probabilities.
If we multiply for the yes hypothesis, we
get 2/9th times 3/9ths times 3/9ths times
3/9ths -- those are just the numbers on the
previous slide Pr[E1|H], Pr[E2|H], Pr[E3|H]
Pr[E4|H] -- finally Pr[H], that is 9/14ths.
That gives us a likelihood of 0.0053 when
you multiply them.
Then, for the "no" class, we do the same to
get a likelihood of 0.0206.
These numbers are not probabilities.
Probabilities have to add up to 1.
They are likelihoods.
But we can get the probabilities from them
by using a straightforward technique of normalization.
Take those likelihoods for "yes"

Chinese: 
我们不知道它的play的值，但是我们知道它的属性值分别是sunny，cool，high和windy。
我们可以把它们的概率相乘。
如果我们把yes的概率相乘，2/9乘以3/9乘以3/9乘以3/9
（这些就是前一页的Pr[E1|H]，Pr[E2|H]，Pr[E3|H]，和Pr[E4|H]）
最后乘以Pr[H]，也就是9/14。
用它们相乘，得到了是yes的可能性是0.0053。
然后，对于no，用同样的方法，得到的可能性为0.0206。
但这些数字不是概率。
概率的和必须等于1。
它们只是可能性。
但是，通过使用简单的归一化技术，我们可以从中得到概率。
把yes和no的可能性，

Chinese: 
用下面的方法归一化，使得它们的和等于1。
这就是我们如何用不同的属性值计算出新的一天的play的概率。
再来做一遍。
我们知道的是outlook为sunny，temperature为cool，humidity为high，
windy为true（我们不知道play的值）。
基于我们知道的，yes的可能性等于这4个概率的乘积
（outlook的，temperature的，humidity的和windy的）乘以先验概率，
也就是yes的基线概率。
这些分数的乘积再除以Pr[E]。
我们不知道Pr[E]是多少，但是没关系，因为我们能够算出
Pr[E]对于no的概率，这样得到一个像这样的等式，然后
通过归一化让这两个概率的和等于1，我们就计算出实际的概率。

English: 
and "no" and we normalize them as shown below
to make them add up to 1.
That's how we get the probability of "play"
on a new day with different attribute values.
Just to go through that again.
The evidence is "outlook" is "sunny", "temperature"
is "cool", "humidity" is "high", "windy" is "true" --
and we don't know what play is.
The [likelihood] of a "yes", given the evidence
is the product of those 4 probabilities -- one
for outlook, temperature, humidity and windy
-- times the prior probability, which is
just the baseline probability of a "yes".
That product of fractions is divided by Pr[E].
We don't know what Pr[E] is, but it doesn't
matter, because we can do the same calculation
for Pr[E] of "no", which gives us another
equation just like this, and then we can calculate
the actual probabilities by normalizing them
so that the two probabilities add up to 1.

Chinese: 
Pr[E]对于yes的概率加上Pr[E]对于no的概率等于1。
当从数字的角度看，这实际上十分简单。使用Weka时，
这也很简单。
打开Weka，载入
名词性天气数据，
我们之前见过这个数据集许多次了。
切换到分类面板。
应用NaiveBayes方法。
它在bays目录下面。
这里有许多bays的变体的实现。   
这里，我们使用简单的NaiveBayes方法。
运行它。
这是结果。
通过交叉验证得到成功的概率。
更有趣的是，我们得到了一个模型。
这个模型就像之前我给你们看到的划分成yes
和no的表格。

English: 
Pr[E] for "yes" plus Pr[E] for "no" equals 1.
It's actually quite simple when you look at
it in numbers, and it's simple when you look
at it in Weka, as well.
I'm going to go to Weka here, and I'm going
to open the nominal weather data,
which is here.
We've seen that before, of course, many times.
I'm going to go to Classify.
I'm going to use the NaiveBayes method.
It's under this bayes category here.
There are a lot of implementations of different
variants of Bayes.
I'm just going to use the straightforward
NaiveBayes method here.
I'll just run it.
This is what we get.
The success probability calculated according
to cross-validation.
More interestingly, we get the model.
The model is just like the table I showed
you before divided under the "yes" class and
the "no" class.

Chinese: 
我们得到四个属性（outlook，temperature，humidity和windy）
对于每一个属性值，我们得到了属性值出现的次数。
这里，这个表格和我们之前看到的表格有一很小但很重要的不同。
让我们回到幻灯片，看一下先前的这些数字。
让我们回到幻灯片，看一下这些数字。
在幻灯片上，在play是yes，outlook的值，我们得到的是2，4和3，
而Weka得到的是3，5和4。
每个都大了一个，总数是12，而不是9。
Weka在所有的数上加一。
这么做的是为了避免出现零次数。
在原始的表格里，在play是no，outlook为overcast的概率是
零，我们将用它乘以其他的概率。
这就意味着什么，如果我们从表面意义看零概率，

English: 
We've got the four attributes -- outlook,
temperature, humidity, and windy -- and then,
for each of the attribute values, we've got
the number of times that attribute value appears.
Now, there's one little and important difference
between this table and the one I showed you before.
Let me go back to my slide and look at these
numbers.
before.
Let me go back to my slide and look at these
numbers.
You can see that for outlook under "yes" on
my slide, I've got 2, 4, and 3, and Weka has
got 3, 5, and 4.
That's 1 more each time for a total of 12,
instead of a total of 9.
Weka adds 1 to all of the counts.
The reason it does this is to get
rid of the zeros.
In the original table under outlook, under
"no", the probability of overcast given "no" is
zero, and we're going to be multiplying that
into things.
What that would mean in effect, if we took
that zero at face value, is that the probability

English: 
of the class being "no" given any day for which
the outlook was overcast would be zero.
Anything multiplied by zero is zero.
These zeros in probability terms have sort
of a veto over all of the other numbers, and
we don't want that.
We don't want to categorically conclude that
it must be a "no" day on a basis that it's overcast,
and we've never seen an overcast outlook on
a "no" day before.
That's called a "zero-frequency problem", and
Weka's solution -- the most common solution
-- is very simple, we just add 1 to all the
counts.
That's why all those numbers in the Weka table
are 1 bigger than the numbers in the table
on the slide.
Aside from that, it's all exactly the same.
We're avoiding zero frequencies by effectively
starting all counts at 1 instead of starting
them at 0, so they can't end up at 0.
That's the Naive Bayes method.

Chinese: 
也就是无论哪天，只要outlook是overcast，play是no的概率为零。
任何数乘以零等于零。
在概率的术语里这些零具有否决其他数字的能力。
这不是我们想要的。
我们不想断然推定，只要是overcast，play就是no，
我们也没有见过outlook是overcast， play是no的实例。
这就叫做零次数问题（zero-frequency problem），Weka的解决方案是（也是最常用的方案）
简单地在每个数上加一。
这就是为什么Weka表格中的数字
比幻灯片中的数字大一。
除了这点，两者是完全一样的。
通过从1而不是0开始计数，我们有效地避免了零次数的出现。
这样，它们就不会等于0了。
这就是Naive Bayes方法。

Chinese: 
它的前提是，对于要预测的，所有的属性的同等重要且彼此独立。
Naive Bayes方法效果很好，即使在独立假设显然不成立的情况下。
为什么在假设错误的情况下，还能有效？
这是一个很好的问题。
事实上，分类不需要精确的概率估计。
我们只需选择分类为概率最大的。
只要最大的概率分配给了正确的分类，概率估计是否
精确不是很重要。
这就实际意味着如果增加冗余的属性，Naive Bayes就会出现问题。
极端的相互依赖的例子是两个属性有同样的值，两个完全相同的
属性。
这将给Naive Bayes方法造成很大的影响。
Weka提供的属性选择方法可用来挑选部分

English: 
The assumption is that all attributes contribute
equally and independently to the outcome.
That works surprisingly well, even in situations
where the independence assumption is clearly violated.
Why does it work so well when the assumption
is wrong?
That's a good question.
Basically, classification doesn't need accurate
probability estimates.
We're just going to choose as the class the
outcome with the largest probability.
As long as the greatest probability is assigned
to the correct class, it doesn't matter if
the probability estimates are all that accurate.
This actually means that if you add redundant
attributes you get problems with Naive Bayes.
The extreme case of dependence is where two
attributes have the same values, identical
attributes.
That will cause havoc with the Naive Bayes
method.
However, Weka contains methods for attribute
selection to allow you to select a subset

Chinese: 
相对独立的属性，在这之后，你就可以安全地使用Naive Bayes了。
课本的第四章第二节是有关于统计建模的内容。
现在大家可以去完成课后练习了。
再见！

English: 
of fairly independent attributes after which
you can safely use Naive Bayes.
There's quite a bit of stuff on statistical
modeling in Section 4.2 of the course text.
Now you need to go and do that activity.
See you soon!
