
English: 
Hi! Welcome back to Data Mining with Weka.
This is Class 2.
In the first class, we downloaded Weka and
we looked around the Explorer and a few datasets;
we used a classifier, the J48 classifier;
we used a filter to remove attributes and
to remove some instances;
we visualized some data—we visualized classification
errors on a dataset;
and along the way we looked at a few datasets,
the weather data, both the nominal and numeric
version, the glass data, and the iris dataset.
This class is all about evaluation.

Chinese: 
大家好！欢迎回到Weka在数据挖掘中的运用的课程。
这是第二部分。
在第一部分中，我们下载了Weka，学习使用了Explorer和一些数据集。
我们学习了分类器，J48。
我们用过滤器来删除属性和实例。
我们可视化了数据集，和一个数据集的分类错误。
同时，我们看了一些数据集，例如，名词性的和数字性的天气（weather）数据，
玻璃（glass）数据，和鸢尾花（iris）数据。
这节课是关于评估的。

English: 
In Lesson 1.4, we built a classifier using J48.
In this first lesson of the second class,
we're going to see what it's like to actually
be a classifier ourselves.
Then, later on in subsequent lessons in this
class, we're going to look at more about evaluation,
training and testing, baseline accuracy and
cross-validation.
First of all, we're going to see what it's
like to be a classifier.
We're going to construct a decision tree ourselves,
interactively.
I'm going to just open up Weka here.
The Weka Explorer.
I'm going to load the segment-challenge dataset.
segment-challenge.arff -- that's the one I
want.
We're going to look at this dataset.
Let's first of all look at the class.

Chinese: 
在第1.4节，我们用J48建了分类器`。
在第二部分的第一节课，我们来看看分类器到底是什么。
分类器到底是什么？
在接下来的几节课中，我们将学习评估，
训练，测试，基线精度和交叉验证。
首先，来看看什么是分类器。
我们将一步步地创建一个决策树。
打开Weka，
Weka的Explorer。
载入segment-challenge数据集。
segment-challenge.arff ，就是这个。
我们来看看数据。
先来看一下类。

Chinese: 
类值包括砖面（brickface），天空（sky），叶子（foliage），水泥（cement），窗户（window），路径（path），和草（grass）。
它看起来象图像分析的数据。
我们看一下属性，可以看到行和列的形心，像素的数量，
线的密度，亮度的均值，和各种其它东西。
饱和度，色调，和类，正如我提过的，是不同的质地：
砖，天空，叶子，等等。
这就是segment challenge数据集。
选一个用户分类器。
用户分类器是个树形分类器。
我们会很快知到它是干什么用的。
这是用户分类器
在开始之前，我要强调一下。

English: 
The class values are brickface, sky, foliage,
cement, window, path, and grass.
It looks like this is kind of an image analysis
dataset.
When we look at the attributes, we see things
like the centroid of columns and rows, pixel
counts, line densities, means of intensities,
and various other things.
Saturation, hue, and the class, as I said
before, is different kinds of texture: bricks,
sky, foliage, and so on.
That's the segment challenge dataset.
I'm going to select the user classifier.
The user classifier is a tree classifier.
We'll see what it does in just a minute.
That's the user classifier.
Before I start, this is really quite important.

Chinese: 
我会使用提供的测试数据集。
我将设定测试数据集。用来评估分类器的测试数据集叫segment-test。
训练数据集叫segment-challenge，测试数据集叫segment-test。
现在我们准备好了。
运行分类器。
我们可以看到窗口有两个面板：树形可视化和数据可视化。
先来看看数据可视化。
上节课我们学习了可视化，你可以为x和y轴
选不同的属性。
试一下region-centroid-row对比intensity-mean。

English: 
I'm going to use a supplied test set.
I'm going to set the test set, which is used
to evaluate the classifier to be segment-test.
The training set is segment-challenge, the
test set is segment-test.
Now we're all set.
I'm going to start the classifier.
What we see is a window with two panels: the
Tree Visualizer and the Data Visualizer.
Let's start with the Data Visualizer.
We looked at visualization in the last class,
how you can select different attributes for
the x and y.
I'm going to plot the region-centroid-row
against the intensity-mean.

Chinese: 
这是我得到的图形。
现在，我们选一个类。
我选Rectange。
如果我在这里用鼠标画一个方形，在我看来，
它包括了几乎纯红色的点。
我将提交这个方形。
你可以看到这片区域消失，和已被调整的图形。

English: 
That's the plot I get.
Now, we're going to select a class.
I'm going to
select Rectangle.
If I draw out with my mouse a rectangle here,
I'm going to have a rectangle that's pretty
well pure reds, as far as I can see.
I'm going to submit this rectangle.
You can see that that area has gone and the
picture has been rescaled.

Chinese: 
我将用这个建立一个树形。
如果我看下树形可视化，是个树形。
它从region-centroid-row属性和intensity-mean属性分支
这里，我们得到天空（sky）, 这些全是天空（sky）。 
这里有砖面（brickface），叶子（foliage），水泥（cement），窗户（window），路径（path），和草（grass）。
我们来创建这个树形。
我要做的是，选择这个节点，重新细化它。
这是数据可视化
我选一个包含了这些点的方形，就这样。
他们从图中消失了。
你可以看到，我在这里用region-centroid-row和intensity-mean属性
建了另一个分支，这里几乎包含了所有的233个路径（path）的实例

English: 
I'm building up a tree here.
If I look at the Tree Visualizer, I've got
a tree.
We've split on these two attributes, region-centroid-row
and intensity-mean.
Here we've got sky, these are all sky classes.
Here we've got a mixture of brickface, foliage,
cement, window, path, and grass.
We're kind of going to build up this tree.
What I want to do is to take this node and
refine it a bit more.
Here is the Data Visualizer again.
I'm going to select a rectangle containing
these items here, and submit that.
They've gone from this picture.
You can see that here, I've created this split,
another split on region-centroid-row and
intensity-mean, and here, this is almost all
path.

Chinese: 
这里是些混合的实例。
这里是个纯节点。
这里几乎是个纯节点。
我想进一步改进这个节点。
我来选些这部分的实例。
选这部分，提交它们。
选这部分，提交。
选这部分，提交。
这一小簇看起来很齐整，
提交它们。
事实上我还没有改变过坐标轴，当然，我可以在任何时候改变它们以
更好的区分剩下的类。
我可以试试这些。
事实上最快捷的方法是点击这些条块。
x轴是左击，y轴是右击。

English: 
233 path instances, and then a mixture here.
This is a pure node we've got over there.
This is almost a pure node.
This is the one I want to work on.
I'm going to cover some of those instances
now.
Let's take this lot here and submit that.
Then I'm going to take this lot here and submit
that.
Maybe I'll take those ones there and submit
that.
This little cluster here seems pretty uniform.
Submit that.
I haven't actually changed the axes, but,
of course, at any time, I could change these
axes to better separate the remaining classes.
I could kind of mess around with these.
Actually, a quick way to do it is to click
here on these bars.
Left click for x and right click for y.

Chinese: 
我可以很快地体验不同组合的x轴和y轴，看看能否能得到更好的结果。
这是我创建的树。
把它拉成满屏。
它像这样。
你可以看到我们成功地将它细化到这个分支。
完成后，我可以确认它。
在这之前，让我来演示下，我们不仅可以选方形，
我们还可选多边形或折线。
如果我不想用方形，我可以用多边形或折线
如果你愿意，你可以试着选一些不同形状的区域。
这是我选的区域。

English: 
I can quickly explore different pairs of axes
to see if I can get a better split.
Here's the tree I've created.
I'm going to fit it to the screen.
It looks like this.
You can see that we have successively elaborated
down this branch here.
When I finish with this, I can accept the
tree.
Actually, before I do that, let me just show
you that we were selecting rectangles here,
but I've got other things I can select: a
polygon or a polyline.
If I don't want to use rectangles, I can use
polygons or polylines.
If you like, you can experiment with those
to select different shaped areas.
There's an area I've got selected I just can't
quite finish it off.

English: 
Alright, I right clicked to finish it off.
I could submit that.
I'm not confined to rectangles;
I can use different shapes.
I'm not going to do that.
I'm satisfied with this tree for the moment.
I'm going to accept the tree.
Once I do this, there is no going back, so
you want to be sure.
If I accept the tree, "Are you sure?" -- yes.
Here, I've got a confusion matrix, and I can
look at the errors.
My tree classifies 78% of the instances correctly,
nearly 79% correctly, and 21% incorrectly.
That's not too bad, especially considering
how quickly I built that tree.
It's over to you now.
I'd like you to play around and see if you
can do better than this by spending a little
bit longer on getting a nice tree.
I'd like you to reflect on a couple of things.
First of all, what strategy you're using to
build this tree.

Chinese: 
好的，我右击完成它。
我可以提交它。
所以我并不受限于方形。
我可以使用其他的形状。
我不那样做。
目前我对这个树形很满意。
我决定接受这个树形。
一旦这样做了就不能改变了，所以得想好。
我接受这个树形。“确信吗？” 是的。
这里是混淆矩阵，我可以看看误差。
我的树的分类准确度是78%，几乎是79%，错误率是21%。
还不错， 尤其考虑到创建这个树形的快捷度。
现在该你了。
我想你试一试，看看能不能多花点时间
创建一个更好的树形。
我希望你考虑几点。
首先，你将采用什么样的策略来创建这个树形。

English: 
Basically, we're covering different regions
of the instance space, trying to get pure
regions to create pure branches.
This is kind of like a bottom-up covering
strategy.
We cover this area and this area and this
area.
That's not how J48 works.
When it builds its trees, it tries to do a
judicious split through the whole dataset.
At the very top level, it'll split the entire
dataset into two in a way that doesn't necessarily
split out particular classes, but makes it
easier when it starts working on each half
of the dataset further splitting in a top-down
manner in order to try and produce an optimal tree.
It will produce trees much better than the
one that I just produced with the user classifier.
I'd also like you to reflect on what it is
we're trying to do here.
Given enough time, you could produce a 'perfect'
tree for the dataset, but don't forget that

Chinese: 
基本上，我们尝试了实例空间的不同区域，找纯度最高的区域来建
纯度最高的分支。
这是由底至上的覆盖策略。
我们覆盖这片区域，这片区域和这片区域。
这不是J48的方法。
当J48创建树时，它用整个数据集来划分。
在最顶端，它将数据集一分为二时，不是为了更好的分类，而是
为了它可以使用由顶至下的方法
将每次剩下的一半数据建成最优化的树形。
它所创建的树比我刚刚用用户分类器所建的要好。
我希望你能思考下我们想达到的目的。
如有足够的时间，你可以为这个数据集创建一个完美的树形，但是

Chinese: 
别忘了，我们使用的是训练数据集。
我们将用不同的数据集来评估这个树形，希望这个测试数据集
有相同的来源，但不和训练数据集等同。
我们不想创建和训练数据集非常匹配的。
但我们希望它能代表训练数据集中的模式。
我们要找的是能在测试数据集上取得好效果的。
这突出了评估在机器学习中的重要性。
这就是这部分要学习的分类器评估的不同方法。
就到这里。
教材中有关于用户分类器的知识，如果你愿意，
可读一读。
去做一下这节课的练习，和创建一个自己的树。
希望你能比我在5-10分钟内做的更好。

English: 
the dataset that we've loaded is the training
dataset.
We're going to evaluate this tree on a different
dataset, the test dataset, which hopefully
comes from the same source, but is not identical
to the training dataset.
We're not trying to precisely fit the training
dataset;
we're trying to fit it in a way that generalizes
the kinds of patterns exhibited in the dataset.
We're looking for something that will perform
well on the test data.
That highlights the importance of evaluation
in machine learning.
That's what this class is going to be about,
different ways of evaluating your classifier.
That's it.
There's some information in the course text
about the user classifier, which you can read
if you like.
Please go on and do the activity associated
with this lesson and produce your own classifier.
Hopefully, you'll be able to do much better
than me given 5-10 minutes.

English: 
Good luck!

Chinese: 
祝你好运！
