
English: 
Hello, again, and welcome to Data Mining with
Weka, back here in New Zealand. In this class,
Class 4, we're going to look at some pretty
cool machine learning methods.
We're going to look at linear regression,
classification by regression, logistic regression,
support vector machines, and ensemble learning.
The last few of these are contemporary methods,
which haven't been around very long. They
are kind of state-of-the-art machine learning
methods.
Remember, there are 5 classes in this course,
so next week is Class 5, the last class. We'll
be tidying things up and summarizing things
then. You're well over halfway through; you're
doing well. Just hang on in there.

Chinese: 
大家好！欢迎来到Weka与数据挖掘。这里是新西兰。
在第四部分,我们来学习一些很酷的机器学习方法。
我们将要学习线性回归、回归分类、逻辑回归、
支持向量机和集成学习法。最后几个是当代的方法，
问世不久，是最先进的机器学习
方法。
记得吗，这门课共5部分，下周是第五部分，也就是最后一部分。
我们会整理、总结学过的知识。我们已经学了过半的内容，
大家都做得不错，只需要坚持下去。

Chinese: 
这节课，我们将要学习不同机器学习方法的分类边界。
我们将要使用Weka的Boundary Visualizer（分界可视化工具），
一个新的Weka工具。
我们来看一个二维数据集，
iris.2d.arff。这是鸢尾花数据集的二维版本。
我删除了两个鸢尾花数据集的属性 
（花萼长和花萼宽） 得到这个二维数据集，以及鸢尾花的种类。
我们来使用Boundary Visualizer。点击Weka Chooser，就可以看到
可视化菜单。Weka包含很多工具，我们
现在只用Boundary Visualizer。我要使用Boundary Visualizer

English: 
In this lesson, we're going to start by looking
at classification boundaries for different
machine learning methods. We're going to use
Weka's Boundary Visualizer, which is another
Weka tool that we haven't encountered yet.
I'm going to use a 2-dimensional dataset.
I've prepared iris.2d.arff. It's a
2-dimensional version of the iris dataset.
I took the regular iris dataset and deleted
a couple of attributes -- sepallength and
sepalwidth -- leaving me with this 2D dataset,
and the class.
We're going to look at that using the Boundary
Visualizer. You get that from this Visualization
menu on the Weka Chooser. There are a lot
of tools in Weka, and we're just going to
look at this one here, the Boundary Visualizer.
I'm going to open the same file in the Boundary

Chinese: 
打开同样的文件，二维鸢尾花数据集。这里我们可以看到数据的图形。
花瓣宽是y轴，花瓣长是x轴。
这就是这个数据集的可视化图，图中有三种鸢尾花：红色是鸢尾花setosa，绿色是versicolor，
蓝色是virginica。
下面来选分类器，首先使用OneR分类器，在rules目录下。
我们用训练数据作图，等它散开。
这个有色图标显示了决策分界线，而训练数据叠加在分界之上。
我们来看
Explorer中OneR是如何处理这个数据集的。
OneR选择从花瓣宽分支。

English: 
Visualizer, the 2-dimensional iris dataset.
Here we've got a plot of the data.
You can see that we're plotting petalwidth on the
y-axis against petallength on the x-axis.
This is a picture of the dataset with the
3 classes setosa in red, versicolor in green,
and virginica in blue.
I'm going to choose a classifier. Let's begin
with the OneR classifier, which is in rules.
I'm going to "plot training data" and just
going to let it rip.
The color diagram shows the decision boundaries, with the training data superimposed on it. 
Let's look at what
OneR does to this dataset in the Explorer.
OneR has chosen to split on petalwidth.

English: 
If it's less than a certain amount, we get a
setosa; if it's intermediate, we get a versicolor;
and if it's greater than the upper boundary,
we get a viriginica.
It's the same as what's being shown here.
We're splitting on petalwidth. If it's less
than a certain amount, we get a setosa; in
the middle, a versicolor; and at the top,
a virginica.
This is a spatial representation of the decision
boundary that OneR creates on this dataset.
That's what the Boundary Visualizer does;
it draws decision boundaries.
It shows here that OneR chooses an attribute
-- in this case petalwidth -- to split on.
It might have chosen petallength, in which
case we'd have vertical decision boundaries.
Either way, we're going to get stripes from
OneR.
I'm going to go ahead and look at some boundaries
for other schemes.
Let's look at IBk, which is a "lazy" classifier.
That's the instance-based learner we looked
at in the last class.
I'm going to run that.

Chinese: 
如果花瓣宽小于某个值，就是鸢尾花setosa；如果值居中，就是鸢尾花versicolor；
如果高于上限，就是鸢尾花virginica。
这和图中标示的一样。我们以花瓣宽分支。如果小于
某个值，就是鸢尾花setosa；如果值居中，就是鸢尾花versicolor；
如果在顶端，就是鸢尾花virginica。
这就是OneR产生的决策分界线的空间表示法，
即Boundary Visualizer的功能，标示决策分界线。
它显示OneR在这里选择一个属性（花瓣宽度）分支。
它还可以选择花瓣长度，那么我们就会得到纵向的决策分界线。
不论哪种方式，我们都会通过OneR得到些条形。
我们再来看一些其他分类器的分界线。
我们来看IBk，一个lazy分类器。
这是上一部分我们讲到的基于实例的分类器。
运行。

Chinese: 
我们得到了不同的图形。
先来停止运行。
我们得到了对角线图形。
这条对角线下是鸢尾花setosa，中间区域是鸢尾花versicolor，
而鸢尾花virginica大体在右上角。
这就是IBK的工作原理。
它根据一个测试实例。
比方说这个实例，在分界线的这边，红色的区域。
然后，IBK选择离它最近的实例。
我猜是这个点。
比那个点近些。
这是一个红点。
如果跨越分界线去找，它会选择绿色分类，
因为绿色是最近的实例。
如果你认真去看，会发现这条分界线划在最近的红点和
绿点之间。

English: 
Here we get a different kind of pattern.
I'll just stop it there.
We've got diagonal lines.
Down here are the setosas underneath this
diagonal line; the versicolors in the intermediate
region; and the virginicas, by and large,
in the top right-hand corner.
Remember what [IBk] does.
It takes a test instance.
Let's say we had an instance here, just on
this side of the boundary, in the red.
Then it chooses the nearest instance to that.
That would be this one, I guess.
That's kind of the nearer than this one here.
This is a red point.
If I were to cross over the boundary here,
it would choose a green class, because this
would be the nearest instance then.
If you think about it, this boundary goes
halfway between this nearest red point and
this nearest green point.

English: 
Similarly, if I take a point up here, I guess
the two nearest instances are this blue one
and this green one.
This blue one is closer.
In this case, the boundary goes along this
straight line here.
You can see that it's not just a single line:
this is a piecewise linear line, so this part
of the boundary goes exactly halfway between
these two points quite close to it.
Down here, the boundary goes exactly halfway
between these two points.
It's the perpendicular bisector of the line
joining these points.
So we get a piecewise linear boundary made
up of little pieces.
It's kind of interesting to see what happens
if we change the parameter: if we look at,
say, 5 nearest neighbors instead of just 1.
Now we get a slightly blurry picture, because
whereas down here in the pure red region the

Chinese: 
同样的，如果在这里找一个点，我想最近的两个实例会是这个蓝点
和这个绿点。
蓝点更近一些。
这回，分界线沿着这条直线。
可以看到这不只是一条线，而是逐段的线。
分界线的这部分穿过离它最近的两点中间。
再往下，分界线穿过这两点的正中间。
它是两点连线的中垂线。
我们得到的是由一小段一小段组成的是逐段线性分界线。
如果我们改变参数，结果会很有趣：如果我们
把最近的点数改为5，而不是1。
现在我们得到了一张稍微模糊的图表，因为在下面纯红的区域

English: 
5 nearest neighbors to a point are all red
points, if we look in the intermediate region
here, then the nearest neighbors to a point
here -- this is going to be in the 5, and
this might be another one in the 5, and there
might be a couple more down here in the 5.
So we get an intermediate color here, and
IBk takes a vote.
If we had 3 reds and 2 greens, then we'd be
in the red region and that would be depicted
as this darker red here.
If it had been the other way round with more
greens than reds, we'd be in the green region.
So we've got a blurring of these boundaries.
These are probabilistic descriptions of the
boundary.
Let me just change k to 20 and see what happens.
Now we get the same shape, but even more blurry
boundaries.
The Boundary Visualizer reveals the way that
machine learning schemes are thinking, if
you like.

Chinese: 
距离这点最近的5个点都是红点。如果我们看中间区域，
离这点最近的点为（这会是5个点中之一），
这可能是另一点，可能还有两三点在这里。
这里是过渡色，IBK投票来决定。
如果有三个红点、两个绿点，就会是红色区域，
也就是这个深红色的区域。
反之，如果绿点比红点多，就会是绿色区域。
因此，我们会得到模糊的分界。
这就是对分界的概率化的描述。
我们来把k值改为20，看会发生什么。
现在我们得到了同样的形状，但分界更为模糊。
如果你愿意，我们可以说分界可视化显示了
机器学习法的工作原理。

Chinese: 
是数据集的内在形式的表达。
可视化帮助我们像机器学习法一样思考。
我们来选择另一种方法，
NaiveBayes。
当介绍NaiveBayes时，我们只提到过离散属性。
针对连续属性，我们要选择一个监督离散方法（supervised discretization method）。
不用担心这些细节，这是使用NaiveBayes处理数值属性的
最常用的方法。
我们来看图表。
非常有意思。
NaiveBayes认为两个属性
平等而独立地参与决策。
它算出针对这个属性的值和针对另一个属性的值
再把两者相乘。

English: 
The internal representation of the dataset.
They help you think about the sorts of things
that machine learning methods do.
Let's choose another scheme.
I'm going to choose NaiveBayes.
When we talked about NaiveBayes, we only talked
about discrete attributes.
With continuous attributes, I'm going to choose
a supervised discretization method.
Don't worry about this detail, it's the most
common way of using NaiveBayes with
numeric attributes.
Let's look at that picture.
This is interesting.
When you think about NaiveBayes, it treats
each of the two attributes as contributing
equally and independently to the decision.
It sort of decides what it should be along
this dimension and decides what it should
be along this dimension and multiples the
two together.

English: 
Remember the multiplication that went on in
NaiveBayes.
When you multiple these things together, you
get a checkerboard pattern of probabilities,
multiplying up the probabilities.
That's because the attributes are being treated
independently.
That's a very different kind of decision boundary
from what we saw with instance-based learning.
That's what's so good about the Boundary Visualizer:
it helps you think about how things are working
inside.
I'm going to do one more example.
I'm going to do J48, which is in trees.
Here we get this kind of structure.
Let's take a look at what happens in the Explorer
if we choose J48.
We get this little decision tree: split first
on petalwidth; if it's less than 0.6 it's
a setosa for sure.

Chinese: 
还记得NaiveBayes中乘法么。
当相乘这些数时，我们用一个概率的棋盘格局，
来将它们相乘。
这是因为所有属性都被认为是独立的。
这和我们之前见到的基于实例的方法的决策分界线完全不同。
这就是分界可视化的好处，它可以帮助我们思考
内部的运作方法。
我们再来看一个例子。
我要用J48，在trees目录下。
我们得到了这样的图形。
选择J48，来看Explorer中的处理结果。
我们来看这个小决策树，首先以花瓣宽分支，如果值小于0.6，
就是鸢尾花setosa。

Chinese: 
再以花瓣宽分支，如果值大于1.7,就是鸢尾花virginica。
然后，在中间，先以花瓣长，再以花瓣宽分支，得到
鸢尾花versicolor和鸢尾花virginica的混合。
我们先以花瓣宽划分，就是这个分支。
竖轴表示花瓣宽。
如果小于这个值，就是鸢尾花setosa。
然后我们在同一个坐标轴再次划分。
如果大于这个值，就是鸢尾花virginica。
如果在中间区域，我们以另一个轴划分，也就是花瓣长。
这个区域是鸢尾花versicolor。我们再以花瓣宽划分。

English: 
Then split again on petalwidth; if it's greater
than 1.7, it's a virginica for sure.
Then, in between, split on petallength and
then again on petalwidth, getting a mixture
of versicolors and viriginicas.
We split first on petalwidth; that's this
split here.
Remember the vertical axis is the petalwidth
axis.
If it's less than a certain amount, it's a
setosa for sure.
Then we split again on the same axis.
If it's greater than a certain amount, it's
a virginica for sure.
If it's in the intermediate region, we split
on the other axis, which is petallength.
Down here, it's a versicolor for sure, and
here we're going to split again on the petalwidth attribute.

Chinese: 
我们来更改参数minNumObj，它控制叶节点最小实例数。
如果加大，会得到一个较简单的树。
我们在第三部分的某节课中讲到过这个参数。
运行，得到一个较为简单的树，和通过这个参数得到的
较为简单的规则一致。
我们还可以给参数设更大的值，例如10，再次运行。
我们得到更为简单的规则，和OneR产生的规则很相似。
我们学习了分类分界线。
分类器在实例空间创建分界线，不同的分类器
创建实例空间的方式不同。
这就叫做分类器的偏差 （不同的创建实例空间
的方法）。

English: 
Let's change the minNumObj parameter, which
controls the minimum size of the leaves.
If we increase that, we're going to get a
simpler tree.
We discussed this parameter in one of the
lessons of Class 3.
If we run now, then we get a simpler version,
corresponding to the simpler rules we get
with this parameter set.
Or we can set the parameter to a higher value,
say 10, and run it again.
We get even simpler rules, very similar to
the rules produced by OneR.
We've looked at classification boundaries.
Classifiers create boundaries in instance
space and different classifiers have different
capabilities for carving up instance space.
That's called the "bias" of the classifier
-- the way in which it's capable of carving
up the instance space.

English: 
We looked at OneR, IBk, NaiveBayes, and J48,
and found completely different biases, completely
different ways they carve up the instance
space.
Of course, this kind of visualization is restricted
to numeric attributes and 2-dimensional plots,
so it's not a very general tool, but it certainly
helps you think about these different classifiers.
You can read about classification boundaries
in Section 17.3 of the course text.
Now off you go and do the activity associated
with this lesson.
Good luck! We'll see you later.
Bye!

Chinese: 
我们学习了OneR、IBk、NaiveBayes和J48，看到了完全不同的偏差，
完全不同的创建实例空间的方式。
当然，这种可视化仅限数值属性和二维视图。
这不是一个可以广泛使用的工具，但却能够帮助你学习不同的分类器。
请阅读教材17.3章分类边界。
请做这节课的课后练习。
祝好运！回头见！
再见！
