
English: 
Hi! you probably learned a bit about flowers if you did the activity
associated with the last lesson.
Now, we're going to actually build
a classifier: Lesson 1.4
Building a classifier.
We're going to use a
system called J48—I'll
tell you why it's called J48 in a minute—
to analyze the glass dataset.
That we looked at in the last lesson.
I've got the glass dataset open here.
I going to go to the Classify panel.
I choose a classifier here.
There are different kinds of
classifiers. Weka has

Chinese: 
大家好！如果你做了上节课的练习，
你可能已经对花有了更多的了解。
本次课程，课程1.4， 我们要建立
一个分类器。
我们将使用一个叫J48的系统，
稍后，我会给大家解释一下“J48”名字的由来
来分析我们上节课用过的玻璃
数据。
我已读入了玻璃数据。
然后，进入分类器面板
选择一种分类器。
Weka提供多种分类器，例如

English: 
bayes classifiers, functions classifiers,
lazy classifiers, meta classifiers, and so on.
We're going to use a tree classifier. J48 is
a tree classifier. I'm going to open trees and click
J48.
Here is the J48 classifier.
Let's run it. If we just press start,
we've got the dataset, we've got the classifier,
and lo and behold,
it's done it.
It's a bit of an anticlimax, really.
Weka makes things very easy
for you to do.
The problem is understanding what
it is that you have done.
Let's take a look.
Here is some information
about the datasets,
glass dataset. The number of
instances and attributes.
Then it's printed out a
representation of a tree here.
We'll look at these trees later on,
but just note that this tree has
30 leaves and 59 nodes altogether.
The overall accuracy

Chinese: 
贝叶斯分类器，函数分类器，懒惰分类器，元分类器，等等。
接下来，我们使用的是决策树分类器。 J48是一个决策树分类器。我们打开分类器菜单，
单击J48。
这就是J48分类器。
让我们运行它。选定了数据集和分类器，点击“start”，
等一会儿，
我们就得到分类结果。
这有点不可思议
weka使得
整个过程变得如此简单。
关键是要理解你所得到的结果。
我们来看一下。
这里是玻璃数据的一些信息，
实例和属性的数量。
接下来是决策树的示意图。
我们一会儿再仔细解释这个树，
但是要留意下这个树有 
30叶节点和59个节点。
总的准确度是

English: 
is 66.8%.
So, it's done pretty well.
Down at the bottom,
we've got a confusion matrix.
Remember there were about seven different
kinds of glass.
This is
building windows made of float glass.
You can see that 50 of these
have been classified as 'a', which is
correctly classified.
15 of them have been classified as 'b',
which is building windows non-float glass,
so those are errors,
and 3 have been classified as 'c',
and so on.
This is a confusion matrix.
Most of the weight is down the
main diagonal, which
we like to see because that
indicates correct classifications.
Everything off the main diagonal
indicates a misclassification.
That's the confusion matrix.
Let's investigate this a bit further.
We're going to open a configuration
panel for J48.
Remember I chose it

Chinese: 
66.8%.
这是一个不错的结果。
在窗口最下面，
我们看到一个混淆矩阵。是否还记得这里有7种不同的
玻璃。
这是
由浮法玻璃制造的建筑用窗户的分类结果
我们可以看到，其中有50个实例的类别为“a”，
是正确的分类。
其中有15个实例的类别为“b”，
那时由非浮法玻璃制造的建筑用窗户，所以这是错误的分类。
另外3个实例被分到“c”，
等等。
这是一个混淆矩阵。
大部分的实例在对角线上，
这正是我们期望看到的，因为这意味着正确的分类。
每一个不在对角线上的实例
代表一个错误的分类结果。
这是一个混淆矩阵。
让我们再深入研究它。
我们打开J48的配置面板。
上次，我们通过

English: 
by clicking the Choose button.
Now, if I click it here,
I get a configuration panel.
I clicked J48 in this menu,
and I get a configuration panel, which
gives a bunch of parameters.
I'm not going to
really talk about these parameters.
Let's just look at one of them, the unpruned
parameter, which by default is false.
What we've just done is to build a
pruned tree, because unpruned is False.
We can change this to
make it true and build an unpruned tree.
We've changed the configuration.
We can run it again.
It just ran again, and now we have
a potentially different result.
Let's just have a look. We have
67% correct classification.
What did we have before?
These are the runs.
This is the previous run,
and there we had
66.8%.

Chinese: 
单击“choose”按钮选择分类器。
现在，单击“choose”，
弹出一个配置面板。
单击菜单中的“J48”，然后得到
一组配置参数
这里我们
不介绍所有的参数。仅以“unpruned”参数为例，
它的默认值是off。
我们刚刚建立的是
一个经过修剪的决策树，因为“unpruned”的值是off。
我们可以将“unpruned”
设置为on，之后将得到一个未修剪的决策树。
我们已经改变了配置。
重新运行一下。
我们可能得到
一个不同的结果。
让我们看一下。这次分类的准确率是67%。
我们之前的准确度是多少？
这些是当前的运行记录。这一条是之前的记录。
之前的准确度是
66.8%.

English: 
Now, in this run that we've just done with
the unpruned tree, we've got 67% accuracy,
and the tree is the same size.
That's one option.
I'm just going to look at another option,
and then we'll look at some trees.
I'm going to click the
configuration panel again,
and I'm going to change
the minNumObj parameter.
What is that?
That is the minimum number of
instances per leaf.
I'm going to change that from 2
up to 15
to have larger leaves.
These are the leaves of the tree here,
and these numbers in
brackets are the number of
instances that get to the leaf. When
there are two numbers, this means that one
incorrectly classified instance
got to this leaf and five correctly
classified instances got there.

Chinese: 
这次我们通过
未修剪的决策树得到的准确度是67%。
两次的决策树一样大小。
这是一种参数，
我们也可以使下其他的参数，现在我们来看一下其他的决策树。
再一次单击配置面板，
我将更改
参数“minNumObj”。
这代表什么？
参数“minNumObj”是每个叶节点所含实例的最少数量。
我将参数“minNumObj”的值从2
增加到15
以便得到更大的叶节点。
（回到之前的运行结果）这里这些是树的叶节点，
这些括号里的数字是
叶节点所含实例的数量。当括号里有两个值的时候，意味着
这个叶节点包含一个错误的分类实例和
五个正确的分类实例。

Chinese: 
可以看出，
这些叶节点都只包含很少的实例，
有的只含有2个或3个实例
然而，这里有一个叶节点含有31个实例。
我们现在来看一下如何设置叶节点所含最小实例数，
这次运行的产生的树的叶节点将含有
不少于15个实例。再次运行。
现在我们得到
一个比较糟糕的准确度，61%，
和一个小得多的
只有8个叶节点的决策树。
现在，我们来构建一个决策树可视化模型。
右击
我们之前的运行记录，
得到一个小菜单。
单击“visualize tree”，
弹出了可视化窗口。如果右击空白处，可以选择使树符合窗口大小。
这是一个决策树。首先，我们要看的是
钡（Ba）的含量。

English: 
You can see that all of
these leaves are pretty small,
with sometimes just two or three
or here is one with 31
instances. We've constrained now this number,
the tree is going to be generated,
and this number is always going to be
15 or more. Let's run it again.
Now we've got
a worse result, 61%
correct classification, but a much
smaller tree,
with only eight leaves.
Now, we can visualize this tree.
If I right click
on the line—these are the lines that describe
each of the runs that we've done, and this
is the third run—if I right
click on that, I get a little menu,
and I can visualize the tree.
There it is. If I right click on empty
space, I can fit this to the screen.
This is the decision tree.
This says first look at the
Barium (Ba) content.

Chinese: 
如果钡的含量高，那么分类为“headlamps”。
如果钡的含量低，那么我们继续看镁（Mg）的含量。
如果镁的含量低，那么继续判定钾的含量（K）。如果钾的含量低，我们即可得到餐具。
这是件好事，因为我不希望在我的餐具中含有过多的钾。
这是一个可视化的决策树。
（回到原输出窗口）这里的决策树和可视化的决策树是同一个决策树。
这是决策树的另一种表现形式。
关于配置面板，接下来再提一点，
“More”按钮。点击这个按钮，你可以了解更多
关于J48
分类器
了解分类器的来源，（对于使用分类器）总是很有帮助的。
现在，
我来解释为什么这个分类器叫做“J48”。这个名字来源于
这本书里著名的C4.5系统。
这里我们引用了这本书。
事实上，我书架上有这本书。

English: 
If it's large, then it must be headlamps.
If it's small, then Magnesium (Mg).
If that's small, then let's look at potassium (K),
and if that's small, then we've got tableware.
That sounds like a pretty good thing to me;
I don't want too much potassium in my tableware.
This is a visualization of the tree
and it's the same tree that you
can see by looking here.
This is a different representation
of the same tree.
I'll just show you one more
thing about this configuration panel,
the More button. This
gives you more information
about the classifier,
about J48.
It's always useful to look at that to
see where these classifiers have come from.
In this case,
let me explain why it's called
J48. It's based on a famous
system that's called C4.5,
which was described in a book.
The book is referenced here.
In fact, I think I've got

Chinese: 
在这里，
“C4.5: Programs for Machine Learning”，作者是
澳大利亚的计算机学家，Ross Quinlan。
他开发了ID3系统，
我想应该是发表在他的博士论文中。
之后，C4.5就变得非常流行。各种版本最终演变
为C4.5。
C4.5出名了，书也出版了。接着，他继续发展这个系统,
升级为C4.8。
在此之后，系统商业化了。在那之前，都是开源
系统。
当我们创建Weka时，
我们借鉴了C4.5的最后一个
版本，
也就是C4.8，
我们重写了它。Weka是用Java语言编写的，所以我们叫它J48。
也许这不是
一个非常好的名字，
但一直沿用至今。
这是这个分类器的来源。
我们已经介绍了Weka中的分类器，

English: 
on my shelf here. This book here,
"C4.5: Programs for Machine Learning"
by an Australian
computer scientist called Ross Quinlan.
He started out with a system called ID3—
I think that might have
been in his PhD thesis—
and then C4.5 became quite famous.
This kind of morphed through various
versions into C4.5.
It became famous; the book came out,
and so on. He continued to work on this system.
It went up to C4.8,
and then he went commercial. Up until
then, these were all open source
systems.
When we built Weka,
we took the latest version
of C4.5,
which was C4.8,
and we rewrote it. Weka's written
in Java, so we called it J48.
Maybe it's not a
very good name,
but that's the name that stuck.
There's a little bit of history for you.
We've talked about classifiers in Weka.

Chinese: 
以及如何找到分类器。我们对玻璃数据
进行了分类。我们学习了如何理解J48的输出，
特别是混淆矩阵。我们看了J48的配置面板。
我们试用了两个参数：已修剪与未修剪的决策树，
和如何避免小的叶节点。
我向大家介绍了
J48实际上来自于大家熟知的
机器学习系统C4.5。C4.5和C4.8非常的相近。
所以我们介绍J48
就好像它是C4.5的同义词。
你可以阅读教材的
11.1节，（这一节）介绍了如何创建和检验决策树。
就到这里，请大家运用本课的知识，完成课后练习。
再见！

English: 
I've shown you where you find the
classifiers. We classified the glass
dataset. We looked at how to interpret
the output from J48, in
particular the confusion matrix.
We looked at the configuration panel for J48.
We looked at a couple of options: pruned
versus unpruned trees and the option to
avoid small leaves.
I told you how
J48 really corresponds to the
machine learning system that
most people know as C4.5.
C4.5 and C4.8 were really pretty similar,
so we just talk
about J48 as if it's synonymous with C4.5.
You can read about this in the book—
Section 11.1 about Building a
decision tree and Examining the output.
Now, off you go, and do the
activity associated with this lesson.
See you again soon!
