
Chinese: 
大家好！
现在，正如他们所说，我们学习一些完全不同的内容。
这周的第二部分，这节课和下两节课是关于文件分类。
唯一和第一部分关联的是，我们都要
使用FilteredClassifier.
我们举个例子。
这里有六篇文档。
都是很短的文档（随后我们会看到长得多的）
每个文档里面只有一句话，它们被分为“yes”“no”两类。
通过阅读，你会发现这些文档都是关于油的。
“yes”类的文档是关于地下开采的油，“no”类的文档
是关于食用油，比如“the food was very oily”。

English: 
Hello again!
And now, as they say, for something completely
different.
The second half of this class is about document
classification, this lesson and the next two.
And the only thing it has to do with the first
half of the class is that both use the Filtered
Classifier.
Let's look at some documents.
Here are 6 documents.
They are very short documents (we'll look
at a much larger example in a minute), just
a single sentence each, and they're classified
into "yes" and "no" classes.
You can see when you read these that they
are all about oil.
The "yes" documents are about oil coming from
the ground, and the "no" documents are about
oil as used in cooking, "the food was very
oily," for example.

English: 
We code this training set into ARFF in the
standard way, with string attributes.
For string attributes we just take the text
and surround it by quotes, just as I've shown
in the bottom here.
I've loaded this dataset into Weka.
We can just have a look at it here.
There it is, just what you saw on the slide.
And of course we can't do anything with this
at the moment.
There are 6 distinct values for the text attribute,
and no learning system can learn anything
from these 6 different values.
What we're going to do is use a filter, the
StringToWordVector filter--Unsupervised, Attribute,
StringToWordVector--which is here.
It's got a bunch of options, but let's just
apply it.
Wow!
Look at all these attributes.
We've got 34 attributes.

Chinese: 
我们把训练集变成标准的有字符串属性的ARFF文件。
作为字符串属性，我们需要把文本用双引号括起来，
就像底部的这些一样。
我已经把数据集加载到Weka里了。
我们可以看下这里。
这里，就是我们之前在课件里看到文档。
当然，我们现在还不能做什么。
这个属性有六个不同的属性值，没有哪个学习系统
可以通过六个不同的属性值得到学习结果。
我们要做的是用StringToWordVector过滤器
--一个无指导的属性过滤器--在这儿。
它有很多的选项，但是让我们直接试试。
Wow!
看看这么多属性。
我们得到了34个属性。

Chinese: 
它们是单词"crude"，"demand"和"the"。
看看这些，它们只是文中出现的单词。
实际上，类别“yes”和“no”已经被放到了属性的最前面，
而不是最后。
看一个单词属性，比如"crude"，属性值是一个数字，
这是一个数值属性，它有两个值，0和1：如果没有在文中出现，属性值为0，
如果在文中出现了，属性值为1。
让我们试一试分类。
用J48。
实际上，J48是灰色的。
我还是可以选择它，但是不能运行它。
这是因为Weka默认预测最后一个属性，
最后一个属性是数值属性，单词"was"。
所以，我们需要设置为预测类属性。
我们可以运行J48了，但是如何评估呢？因为训练集只有六个实例，

English: 
They're words like "crude" and "demand" and
"the".
When you look at it, these are just the words
that appeared in the training documents.
Actually, the type, the "yes" or "no" thing,
has been moved to the first attribute, not
the last attribute.
When we look at the individual word attributes,
like the one for "crude", it's just a number,
it's a numeric attribute with two values,
0 or 1: 0 if it doesn't appear in that document,
and 1 if it does appear in that document.
Let's go and classify this.
Let's use J48.
It's in grey, actually.
I can still select it, but I can't start it.
The reason why I can't start it is that, by
default, Weka is predicting the last attribute,
and the last attribute is numeric, the word
"was".
So I'll just change this to predict the type.
Then I can run J48, but there's a problem
evaluating it, because there are only 6 instances

English: 
and we're trying to do 10-fold cross-validation,
which isn't going to work.
Let's just evaluate this on the training set
for the moment.
The most useful thing to look at in the result
here is the decision tree that's produced,
which is here.
Let's look at the tree.
You can see that it tests on the single word
"crude".
If "crude" does not appear, then it's a "no"
document--that is, it's about food.
If "crude" does appear, it's a "yes" document--that
is, it's about oil coming out of the ground.
It makes kind of sense; it's a kind of trivial
example, I guess.
I'll just go back to the slides.
This is what we've done.
We loaded the data set into Weka.
We looked at the string attributes.
We applied this filter, which created a lot
of new attributes, one for each word.
They were binary (two-valued) numeric attributes.
We used J48, had to set the class attribute,
and evaluated on the training set.
Then we looked at the tree.

Chinese: 
我们试着用10折交叉验证，但这行不通。
这次让我们用训练集评估结果。
看看所产生的决策树是最有帮助的。
它在这儿。
让我们看看。
你可以看到决策树只测试了一个单词"crude"。
如果"crude"没有出现，那么归为"no"类文档--关于食物的。
如果"crude"出现了，说明是"yes"类文档--关于地下开采的油的。
这有一定的道理；我想，这只是一个普通的例子。
我们回到课件。
这些都是我们做过的。
我们把数据集载入Weka。
我们看了字符串属性。
我们用这个过滤器创造了很多新的属性，每个属性对应一个单词。
这些属性是二元的（只有两个值）数值属性。
我们使用了J48，设置了类属性，并且利用训练集评估（分类结果）。
Then we looked at the tree.
然后，看看决策树。

Chinese: 
来用一个提供的测试集评估。
来看看这个测试集的预测结果。
这些是测试集里的文档。
我已经把类属性指定为未知，在ARFF文件里用问号代表。
我们之前从来没有这么做过。
我们从未看过针对某一个测试文档或者实例的预测结果。
现在就去设置测试集，在这里。
现在，我们有了测试集。
可以开始运行了。
好了，这有一个很明显的如何评估分类器问题，因为，你知道的，
测试集是一个包含字符串属性的ARFF文件，
但是，训练集是一个包含单词属性的ARFF文件。
当然，我可以通过StringToWordVector过滤器转化测试文件，

English: 
I want to evaluate this on a supplied test
set.
I want to see what the predictions are on
this test set.
These are the documents in the test set.
I've coded them as Unknown, that is, a question
mark in the ARFF file.
We've never done this before.
We haven't ever looked at predictions for
individual test documents or test instances.
Let me now go and get the supplied test set,
which I have here.
Now I've got that test set.
I can start this running.
Well, it's obvious really--there's a problem
evaluating the classifier, because, you know,
when I look at the test documents, it's an
ARFF file with string attributes, and the
training documents are an ARFF file with word
attributes.
Of course, I can take these test documents
and convert them using the StringToWordVector

English: 
filter, but that still wouldn't solve the
problem, because I might have different words
in a different order here, so I'd still have
a different structure to the ARFF file.
We've got to do something different.
That's where the FilteredClassifier comes
in.
Just going back to the slide, there's a problem
evaluating the classifier.
We can't simply apply StringToWordVector to
the test file.
The solution is the FilteredClassifier.
As we saw previously, the FilteredClassifier
will create a filter from the training set
and use it for the test set.
That's exactly what we're going to do here.
Coming back to Weka, I'm going to undo the
effect of this filter, so I've got the original
string attribute.
I'm going to find the FilteredClassifer (meta>FilteredClassifier).
I'm going to configure that to use J48 as
the classifier, which is done by default,
and I'm going to use the StringToWordVector
filter.

Chinese: 
但是，这还解决不了问题，因为可能会有不同的单词
按照不同的顺序出现，所以两个ARFF文件的结构还是不同。
我们要做一点特别的事情。
这就用到了FilteredClassifier。
回到课件，这有一个评估分类器的问题。
我们不能简单地在测试集上应用StringToWordVector。
解决的办法是用FilteredClassifier。
我们之前看到，FilteredClassifier用训练数据
创造一个过滤器，并应用于测试集。
这是我们接下来要做的。
回到Weka，撤销过滤器的影响，所以又得到
原来的字符串属性。
我们找到FilteredClassifer(meta>FilteredClassifier)。
把分类器设置为J48，也是默认的值，
然后，搭配过滤器StringToWordVector。

English: 
It's an Unsupervised Attribute filter.
Let me just run this.
Here we get the result.
That's actually not very interesting, because
these documents had question marks instead
of classifications.
What I wanted to do was output the predictions,
and I can do that in the More options menu.
If I click Output predictions and run it again,
now I can see the predictions for the test
instances.
As you can see, there's 1 "yes" and 3 "no"
predictions.
The actual class is a question mark in each
case.
Coming back to the slide.
That's not exactly what I wanted.
The first instance is certainly "yes", oil
coming out of the ground, but so is the third.

Chinese: 
它是一个无指导的属性过滤器。
试着运行。
我们得到结果。
这并不是十分有趣，因为测试文件的类别
是问号。
我想导出预测结果，在“More options”菜单找到相关的功能。
如果我们点击“Output predictions”并且再次运行，
就能看到测试实例的预测结果。
如你所见，预测中有一个"yes"和3个"no"。
原本每一个的类别都是问号。
回到课件。
这并不完全是我想要的。
The first instance is certainly "yes", oil
coming out of the ground, but so is the third.
第一个实例当然是"yes"，地下开采的油，第三个也是。

English: 
That should have been a "yes", and, in fact
J48 has predicted a "no" for that document
"Iraq has significant oil reserves".
Obviously, it doesn't contain the word "crude",
which is the test that J48 is doing.
Well, these are tiny little documents.
Let's look at something a bit more substantial.
I'm going to take a big dataset, ReutersCorn-train.arff.
Let's just look at it in a minute.
I'm going to open it now.
There are 1,554 documents.
This is a lot bigger.
If I apply the StringToWordVector filter,
then--it just takes a second--I get a lot
of attributes corresponding to words.
Actually, there are 2,234 attributes.
Again, the class attribute has been moved
to the top, attribute number 1.

Chinese: 
应该还有一个"yes"，但J48把它归为"no"文档。
"伊拉克有大量的石油储量。（"Iraq has significant oil reserves"）.
显然，这句话不包含J48测试的关键词"crude"。
当然，这些都是很简短的文档。
来看下更长的文件。
我打算用一个大数据集，ReutersCorn-train.arff。
花几分钟看看它。
打开它。
一共有1554文档。
这次大多了。
如果我们用StringToWordVector过滤器，很快，
我们就会得到一大堆和单词对应的属性。
实际上，会有2234个属性。
又一次，类属性被移动到上头了，成为第一个属性。

English: 
I'm going to undo the effect of this [filter],
because we're going to classify this using
the FilteredClassifier.
I'm going to set a different test set.
I'm going to open ReutersCorn-test.arff.
Then I'm going to run this with J48.
The FilteredClassifier.
It's just going to take a second.
It's finished now.
I get 97% accuracy.
Before we go on, let's actually have a look
at what this dataset looks like.
I'm going to open up the file, the training
file.
Here it is.
There are two attributes: a string attribute
and a class attribute which is 0 or 1.
Here's the beginning of the first string,
and it's a long string.
In fact, this open quote quotes right down
to the closing quote here.
This whole bit of text is one string attribute
value.

Chinese: 
我们撤销过滤器的这一影响，因为我们
打算用FilteredClassifier分类。
我们设置一个不同的测试集。
打开ReutersCorn-test.arff。
然后，运行J48。
FilteredClassifier。
只需要一秒钟。
完成了。
我得到了97%的准确度。
再进行下一步之前，我们先好好看看这个数据集。
打开文件，训练文件。
在这里。
有两个属性：一个字符串属性和值为0或1的类属性。
这是第一个字符串的开头，它是一个很长的字符串。
其实，从这里的上引号到这里的下引号才结束。
这些文本都是字符串属性的值。

English: 
It's followed by a 0, which means the classification
of that document is 0.
For this dataset, that means it's not about
corn.
You can see this is regular text except these
"\ n"s, those are new lines.
If we just had a regular newline in a string,
then Weka would get confused when you tried
to load in that ARFF file.
It would think that the continuation of the
line was the next instance.
So we just encode newlines as "\ n".
This is one instance, classified as 0.
The next thing starts with a quote.
This is the string, and it ends here.
That's a 1 document; this document *is* about
corn.
That doesn't necessarily mean it just contains
the word "corn", it means that a human has
decided whether this document is about corn
or not about corn.
I don't know a lot about corn, but an expert
will have made that decision.
These are the documents, and, like I said,
there are 1,554 of them.

Chinese: 
它后面是0，意味着文档类别为0。
对于这个数据集，这意味着这不是一篇关于谷物的文档。
你可以看到，除了换行符"\ n"之外这是普通的文本。
如果是字符串中的常用换行符，当你把ARFF文件载入时，
Weka将无法正确识别。
Weka会认为下一行文字是一个新的实例。
所以，我们用"\n"代表换行符。
这是一个实例，分类为0。
下一个实例从引号开始。
这是一个字符串，这里是结尾。
这是一个分类是1的文档，这个文档是关于谷物的。
这并不意味着这个文档包含"corn"（谷物）这个词，
而是人为地决定了这篇文档和谷物有关。
对于谷物，我了解得不多，但是有专家做了决定。
这些是文档，如我所说，有1554篇。

Chinese: 
每个实例包含一个长字符串。
I如果我们现在回去看看，我们已经得到了97%的准确度，
这真不错。
但不幸的是，当我看到这个，关于谷物的文档，
类值为1的--只有24篇--其中被正确分类的只有15篇，
这个结果并不好。
对于0类文档，那些与谷物无关的文档，
580篇当中573被正确的识别出来了，这是一个不错的结果。
当把两个分类结果合在一起，就得到了97%的高准确度。
我们看看决策树--在这儿--有点复杂。
我们得到了节点为"corn"（谷物）的分支。
如果文档包含"corn"（谷物），我们便寻找单词"planted"。

English: 
Each instance contains this extensive string.
If I now go back and have a look, well, I've
got really high accuracy, 97%, which sounds
really good.
Unfortunately, though, when I look at this,
the documents that are about corn, the "1"
documents--there's only 24 of them--and the
accuracy there is 15 correct out of 24, which
is not so good.
For the "0" documents, the ones which aren't
about corn, then I've got 573 correct out
of 580, which is very good.
When I combine those two, that's what gives
me this rather high-looking 97% accuracy.
When I look at the tree--here it is--it's
little bit more complicated.
We've got a branch on the word "corn".
If the document contains the word "corn",
then we're going to look for the word "planted".

Chinese: 
如果包含"planted"，那么预测分类为0。
如果不包含"planted"，那么预测分类为1，也就是说，与谷物有关。
往下看，我们找到"1986/87"，这是一个非常奇怪的分裂属性。
我们找到了单词"maize"。
这里，我们还找到了单词"the"。
这个决策树看上去并不怎么合理。
但是仍然有97%的准确度。
这就是我们所完成的。
我们看了数据集。
我们应用了StringToWordVector过滤器。
我们看了一下，有2234个属性。
然后，我们用FilteredClassifier得到了97%的准确度，
但是和24篇与谷物有关的准确度只有62%。
这很遗憾，因为我们可能对这些文档更感兴趣。
这些是与谷物无关的文档，我们得到了非常高的准确度。
这让你思考优化总体的分类准确度

English: 
If it contains the word "planted", then it's
a "0".
If it doesn't contain the word "planted",
then it's a "1", that is, it's about corn.
Down here, we're looking for the word "1986/87",
which is a very strange thing to be looking for.
We're looking for the word "maize".
Here we're looking for the word "the".
This tree doesn't look like is makes a huge
amount of sense.
And yet it does get 97% accuracy.
This is what we've done here.
We looked at this dataset.
We applied the StringToWordVector filter.
We just had a look, and we found that there
were 2,234 attributes.
Then we used the filtered classifier to get
97% classification accuracy, but we discovered
that the accuracy on the 24 corn-related documents
was only 62%.
That's a shame, because those are probably
the documents we're most interested in.
These are the ones that aren't about corn,
and we get very high accuracy on those.
Which makes you wonder whether the overall
classification accuracy is really the right

English: 
thing to optimize.
This is what we've done in this lesson.
We looked at string attributes.
We looked at the StringToWordVector filter,
which creates one attribute for each different
word.
We looked at the options for the StringToWordVector--no
we didn't look at the options.
Let's have a really quick look back in Weka
here at the options for the StringToWordVector
filter.
Suffice it to say, there are a lot of options:
it's a pretty comprehensive kind of filter.
We'll look at those options in a subsequent
lesson.
We looked at J48 models for text data.
J48 is not necessarily a very sensible learning
scheme to use on text data.
Then we looked at the overall classification
accuracy.
Is it really what we care about? Perhaps not.
That's what we're going to look at in the
next lesson.
Before you do that, you should do the activity,
which will get you to actually do a little
more work on actually classifying real documents.
Good luck with that, and we'll see you in
the next lesson.
Bye for now!

Chinese: 
是否正确。
这些就是我们本课的内容。
我们看到了字符串属性。
我们了解了StringToWordVector过滤器，
它可以把每一个单词转化为一个属性。
我们学习了StringToWordVector的选项--不我们还没有。
让我们快速地回到Weka，这里是
StringToWordVector的选项。
只要知道这里有很多选项就够了：它是一个非常复杂的过滤器。
随后的课会讲到这里。
我们用J48模型处理文本数据。
对于文本数据，使用J48不一定是一个非常明智的选择。
然后我们看到了总体的分类准确度。
Is it really what we care about? Perhaps not.
它是我们真正在意的吗？也许不是。
我们下节课再来学习。
在你那么做之前，先完成课后练习，
对真正文档进行分类。
祝你好运！我们下次课见。
再见！
