
English: 
Hello again! This is the last lesson in Class
2, Lesson 2.6, and we're going to get back
to some actual document classification.
In fact, we're going to introduce a new classifier,
Multinomial Naive Bayes, designed for document classification.
I'd like you to recall the Naive Bayes classifier.
We talk about the probability of the event
H, that is, the probability of a particular
class, given evidence E, that is, a particular
set of attribute values for an instance.
Naive Bayes updates the prior probability
of H without knowing anything about the instance.
So in the weather data, I think there are
9 "play" instances and 5 "don't play" instances,

Chinese: 
大家好！这是第二部分最后一节课，课程2.6，
我们来学习真正的文档分类。
我们要学习一个新的分类器，Multinomial Naive Bayes，专为文档分类的设计的。
我们来复习Naive Bayes分类器。
我们讲过了事件H出现的概率，也就是一个特定类别出现的概率，
基于例证E，即一组实例的属性值。
Naive Bayes可以在对实例一无所知的情况下，计算事件H的先验概率。
在天气数据中，有9个“play”的实例，5个“don't play”的实例。

English: 
so the prior probability of "play" is 9/14
without knowing anything about the instance.
Naive Bayes updates that with information
about the instance, that is, the attribute
values, to get the probability of H, the class,
given the instance.
The "naive" part is that it takes these attribute
values, this evidence, and splits it into
independent parts, one for each attribute,
and multiplies these together.
This is a good thing to do if the attributes
really are independent.
So E1 is like the first attribute value, and
E2 is like the second attribute value and
so on.
That's how Naive Bayes works.
There are a couple of problems here for document
classification.
First of all, the non-appearance of a word
counts just as much in Naive Bayes as the
appearance of a word.
It makes intuitive sense that the class of
a document is more determined by the words
that are in it than the words that aren't
in it.

Chinese: 
在不知道实例的情况下，“play”的先验概率是9/14。
基于一个实例的属性值， Naive Bayes 算出了实例的事件H出现的概率，也就是，
也就是属于某个分类的概率。
“naive”的是指将这些属性值，也就是证据，分为“naive”的独立的部分，
每部分对应一个属性，再它们相乘。
如果属性确实彼此独立，这样做没问题。
因此E1就像第一个属性值，E2就像第二个属性值，
等等。
这就是Naive Bayes的原理。
这里有几个文档分类的问题。
首先，Naive Bayes把没有出现的单词和出现的单词
一样对待。
而更合理的是文档的分类取决于文中出现的词，
而不是空缺的词。

English: 
Secondly, Naive Bayes doesn't account for
the fact that a word might occur multiple
times in a document.
A word that occurs lots probably should have
a greater influence on the class of the document
than a word that only appears once.
Thirdly, it treats all words the same.
The word "and" or "the" is treated the same
as an unusual word like "weka" or "breakfast",
and that doesn't sound reasonable, either.
Multinomial Naive Bayes is an enhancement
of Naive Bayes that solves these problems.
We take that complicated formula and replace
it by the thing at the bottom.
Just forget about those exclamation marks
for the moment.
This is basically a product over all the words
in the document of p_i, that is the probability
of word i, to the power n_i, that is the number
of times that word appears in that document.
It's like treating each word appearance as
an independent event and multiplying them
all together.

Chinese: 
第二，Naive Bayes忽略单词在文档中
出现的次数。
文中出现多次的词应该比只出现一次的词
更有影响力。
第三，Naive Bayes同等对待所有的词。
单词“and”和“the"与不常见词"weka"和"breakfast"一样对待，
这也不太合理。
Multinomial Naive Bayes是Naive Bayes的升级版，可以解决这些问题。
我们使用下面的公式来替代这个复杂公式。
先不管那些感叹号。
这是文档 p_i 中所有单词的乘积，也就是单词 i 出现的概率
的n_i次方，也就是单词在该文档中出现的次数。
就像是把每个单词的出现作为一个独立的情况，
将它们相乘。

Chinese: 
这些阶乘是用来计算
不同词序的概率。
这是理论，你不需要理解这些。
在Weka中使用Multinomial Naive Bayes是很简单的。
这就是我们下面要做的事情。
打开一个训练数据集，
"ReutersGrain"，和我们以前用的"corn"数据集一样，
只是这些文章是关于谷物的。
打开训练文档。
然后我要用提供的测试文档，就是它，对应的测试文档。
用J48。
选择J48时，我们看到它变灰了。
我们知道原因，变灰是因为测试文档中有字符串属性，
而J48不能处理字串属性。

English: 
And those factorials are just a technicality
that account for the possibility of different
word orderings.
That's the theory; you don't have to understand
that.
It's very easy to use Multinomial Naive Bayes
in Weka.
This is what we're going to do.
I'm going to open a training set.
We're going to use "ReutersGrain", which is
like the "corn" dataset we used previously,
only it's about documents that are about grain.
I'm going to open that training file.
Then I'm going to use a supplied test set,
that is, the corresponding test file.
Then I'm going to use J48.
When I try to choose J48, well, it's grayed
out.
We know why it's grayed out; it's grayed out
because the training file contains a string
attribute, and J48 can't deal with string
attributes.

Chinese: 
我们现在要做的是使用FilteredClassifier，
这个。
设置J48为分类器，也就是默认分类器，
过滤器选择无指导属性过滤器“StringToWordVector”。
这里。
运行。
得到96%的准确率，但是如果我们看小类别的准确率，
也是我们最关心的类别，谷物类，准确率并不乐观。
57个实例中只有38个正确。
这一点也不准确。
我们应该看ROC区域，即0.906。
回到幻灯片，这里有我的总结。

English: 
We know that what we're supposed to do here
is to use the FilteredClassifier, which is
here.
Configure that to have J48 as the classifier,
which is the default, and for the filter we're
going to choose the unsupervised attribute
filter called StringToWordVector.
There it is.
Let me just run that.
Here I get 96% accuracy, but if I look at
the accuracy in the minority class, the one
that we're most interested in, the "grain"
class, the accuracy is not very good.
I get 38 correct out of a total of 57 (19+38).
That's not very good accuracy at all.
We know now that I should be looking at the
ROC area, which is 0.906.
Going back to the slide: I've summarized that
information.

Chinese: 
我可以运行NaiveBayes，但我现在不去做，可以告诉你
它的分类准确率会更低，但是对于谷物相关的文档成功率较高46/57，
还有一个稍差的ROC区域(0.885)。
运行Multinomial Naive Bayes。
回到FilteredClassifier，选择NaiveBayesMultinomial，
运行。
非常快。
我没有得到非常好的分类准确率，但是我得到了好的ROC区域，
小类别的准确率也不错，52/57。
这不错，比J48在ROC区域和小类别准确率上
都有提高。

English: 
I could run NaiveBayes; I won't do that, but
let me just tell you that I would get quite
a bit worse classification accuracy but a
better success rate on the [grain]-related
documents, 46/57, and a slightly worse ROC
Area (0.885).
I'm going to run Multinomial Naive Bayes.
I'm going to go back to my FilteredClassifier
and configure it to choose NaiveBayesMultinomial.
Run that.
It's very quick.
I don't get a very good classification accuracy,
but I get rather a good ROC area, and not
a bad accuracy on the minority class, 52 out
of 57.
That's not too bad, a definite improvement
in terms of ROC Area and minority class accuracy
on J48.

English: 
I can actually mess around with some of the
parameters in the StringToWordVector filter.
There are a lot of parameters here, and they're
very useful.
One of the parameters is to output word counts.
By default, the filter outputs a 1 if the
document contains that word and a 0 otherwise.
But we can output the number of appearances
of that word in the document, which is suitable
for Multinomial Naive Bayes.
I'm going to do a few other things at the
same time.
I can change all the tokens, all the words,
into lower case.
I'm going to do that, so that it doesn't matter
whether a word is expressed in uppercase or
lowercase, it's going to count as the same
word.
Also, I'm going to use a "stoplist".
"Stop words" are those common words, like
"and" and "the", and there's a standard stoplist
for English.

Chinese: 
我们可以更改StringToWordVector filter中的参数。
这里有很多有用的参数。
其中之一是outputwordcounts。
默认设置是如果文档包含这个单词，输出为1；反之为0。
我们可以输出文档中出现某个单词的数量，
这是适合Multinomial Naive Bayes的。
同时，我还要做几件事。
我可以改变所有的字符，所有单词，变为小写。
这样，大小写单词都会
被视为同样的词。
我还打算使用"stoplist"。
"Stop words" 是那些常见词，如"and"和"the"，这里有一个标准的英语。
"stoplist"

English: 
If I set this to True, then it's going to
disregard common words, words on the stoplist
in Weka.
Let me run that again and see what I get.
Here I get a slightly better accuracy, a pretty
good accuracy actually.
I get a much better ROC Area, and I get phenomenal
accuracy on the minority class: just 1 error
out of 57 here.
Going back to my slide: with J48 I got really
good classification accuracy; now I'm not
quite at the same level with NaiveBayesMultinomial.
When I first did NaiveBayesMultinomial, it
wasn't too bad, but then when I set outputWordCounts,
well, it got slightly worse, actually.
I got a worse ROC Area, which is a little
bit surprising; better accuracy on the minority
class, 54 out of 57.

Chinese: 
如果设置为真，系统就会忽略Weka 的"stoplist" 中的
单词。
再次运行。
得到更好的准确率，非常好的准确率。
更好的ROC区域，小类别的错误率确率是
1/57。
回到幻灯片。我们用J48可以得到了很高的准确率，
但远不及Multinomial Naive Bayes的效果。
当我第一次使用NaiveBayesMultinomial，结果并不坏，但是当我设置了outputWordCounts之后，
准确率稍微降低。
ROC区域不如之前大，这有点出乎意料。 但小类别准确率提高，
54/57。

Chinese: 
设置了lowerCaseTokens和stoplist后，得到非常好的小类别准确率，
非常好的ROC区域，0.978。
这就是我所做的。
Multinomial Naive Bayes是专门为文档设计的机器学习方法。
它采用出现的单词，而不是未出现的单词。
它计算单词出现次数，通过检索单词在文档中出现的频率，
区别对待常见的和非常见的。
在Weka中Multinomial Naive Bayes的处理速度比Naive Bayes要快。
Multinomial Naive Bayes 忽略未出现在文档中的词，
想想看，大多词语都不会出现在一篇文档中。
Weka使用数据的稀疏表示，Multinomial Naive Bayes利用了这一点。
过滤器StringToWordVector有很多有趣的选项。
我们尝试了一些。

English: 
Then when I set lowerCaseTokens and the stoplist
as well, I got very good accuracy on the minority
class, and a very good ROC Area of 0.978.
That's it.
Multinomial Naive Bayes is a machine learning
method that's designed for use with text.
It takes into account word appearance, rather
than word non-appearance.
It accounts for multiple repetitions of a
word, and it treats common words differently
from unusual ones by looking at the frequency
with which they appear in the document collection.
It's actually a lot faster in Weka than plain
Naive Bayes.
For one thing, it ignores words that don't
appear in a document--when you think about
it, most words don't appear in a document!
Internally, Weka uses what's called a "sparse
representation" of the data; Multinomial Naive
Bayes takes advantage of that.
The StringToWordVector filter has many interesting
options.
We looked at some of those.

Chinese: 
它实际上以稀疏格式输出结果，而Multinomial Naive Bayes利用了
这一点。
课本上有些关于这方面的内容。
现在你可以去做这节课的练习。
还有，可以参加期中考试。
这是第二部分的结束，你应该参加期中考试，检测自己的学习，
还要记得做练习。
第三部分见。
再见。

English: 
It actually outputs the results in sparse
format, which Multinomial Naive Bayes takes
advantage of.
There's some stuff on this method in the course
text.
Now you should go and do the activity associated
with this lesson.
By the way, it's now time to do the mid-course
test.
This is the end of Class 2, so you should
sit the mid-course test and see how you get
on with that, as well as doing the activity.
I'll see you in Class 3.
Bye for now!
