
English: 
Hello again! This is Lesson 3.6, and we're
still talking about clustering.
In the last lesson we looked at some different
clustering algorithms, and each of them had
a different metric.
SimpleKMeans talked about the total squared
distance of each instance from its cluster center.
That's not necessarily a good way of evaluating
clustering, and it certainly makes it difficult
to compare the results of different clustering
algorithms.
One thing we can do in Weka is to visualize
the clusters.
Over here in Weka I've got the iris data open.

Chinese: 
大家好！欢迎来到3.6课，本课我们将继续学习聚类。
上节课，我们了解了一些不同的聚类算法，
每一种算法都用到不同的指标。
SimpleKMeans关注每个实例到该聚类中心的平方距离的总合。
这并不是评估聚类的好方法，因为这将导致
比较不同的聚类算法的结果变得困难。
Weka提供的解决方案是可视化聚类。
这里，我们在Weka里打开iris数据。

Chinese: 
我们选择SimpleKMeans方法，聚类数量设置为3，
运行。
在右击菜单里，单击选择“Visualize Cluster Assignments”。
结果在这儿。
绘制聚类相对于实例序号的图更有意义。
记得鸢尾花数据：前50个实例是一种鸢尾花，
中间50个是一种，后50个是另一种。
这看上去过于完美了。
结果前50个实例是一类鸢尾花，中间50个是一类，
third 50 are in another cluster.
后50个是另一类。
在数据挖掘中，如果一件东西看上去过于完美，往往不可信。
问题出在这里，看看数据，你会发现有一个属性叫做“类别”，
进行聚类时，是不应该将类属性包含进去的。
在聚类面板，我们可以忽略某个属性。

English: 
I've got here the SimpleKMeans method with
3 clusters selected, and I'm going to run
that.
On the right-click menu, I'm going to visualize
the cluster assignments.
Here they are.
This would make most sense if we plot the
cluster against the instance number.
Remember the iris data: the first 50 instances
are one kind of iris, and the next 50 are
another, and the third 50 are another.
Well, this looks too good to be true.
Here the first 50 are in one cluster, the
second 50 are in another cluster, and the
third 50 are in another cluster.
In data mining, if things look too good to
be true, they probably are.
The problem here, when you think about it,
is that one of the attributes is the "class",
and it's not really fair to include the class
when we're doing the clustering.
On the clustering panel we can ignore attributes.

English: 
I'm going to ignore the "class" attribute
and try again.
Now I've got 61 instances in one cluster and
50 in another and 39 in another.
If I visualize the cluster assignments and
choose the cluster here, I get a different
picture.
You can see that the first cluster looks pretty
good, but there are some errors here, some
green things have crept into this second thing.
For the last 50 items of the dataset, which
all belong to one class of iris, we've got
a whole bunch of stuff coming in here from
another cluster.
That's not looking so good.
How do you tell which instances are in which
cluster? To do that, there is a filter called
"AddCluster".
It's an unsupervised attribute filter called
AddCluster.

Chinese: 
忽略“类别”属性，再重新试一遍。
这次有61个实例属于一个聚类，50个属于一个类，39个属于另一个类。
如果我们可视化聚类，选择聚类作为Y轴，
我们将得到一个不同的画面。
我们可以看到第一个聚类看上去很不错，但是这里有一些误差，
有几个绿色的实例被分到了第二个聚类。
对于数据集的最后50个实例，原本属于一种鸢尾花，
其中有一整部分都被划分到了其它的聚类。
这看上去不太令人满意。
我们怎么知道哪个实例属于哪个聚类？为了搞清楚这点，
这里有一个过滤器叫做“AddCluster”。
AddCluster是一个无指导的过滤器。

English: 
In this filter, we can specify a clusterer.
Here we specified SimpleKMeans, and I'll choose
3 clusters again.
I'm going to apply the filter, and that's
going to add a new attribute.
Let's do this.
You can see that we've got a new attribute.
It's called "cluster", attribute 6.
If we edit this dataset, we can have a look
at the values for the last attribute and compare
them with the class.
This is an unsupervised filter, so the class
was not used when running the filter.
The clustering is done just on the basis of
the first four attributes.
You can see that the iris-setosas are all
in cluster 2.
The next lot of irises, versicolors, are mostly
in cluster 1--there are a couple of cluster
3's here.
The third lot, the iris-virginicas, are mostly
in cluster 3, but there are quite a lot of
cluster 1's.

Chinese: 
通过它，我们可以指定聚类器。
这次我们指定SimpleKMeans，聚类数量还定为3。
我们打算应用过滤器，这会增加一个新属性。
开始。
可以看出我们得到了一个新的属性。
它叫做“cluster”，第6个属性。
如果我们编辑数据集，我们可以看到最后一个属性的属性值，
并把它们和类属性进行比较。
这是一个无指导过滤器，所以运行时不会用到类属性。
聚类是基于前四个属性进行的。
所有iris-setosas的实例都属于聚类2。
然后，大多数irises-versicolors属于聚类1
--也有几个被分到聚类3。
第三，iris-virginicas主要集中在聚类3，
但是也有好多分在聚类1。

Chinese: 
这和我们在可视化界面看到的结果一模一样。
回到课件，我们了解了如何在聚类面板
可视化聚类结果。
我们学会了如何忽略属性。
通常情况下，如果数据集包含类属性，我们要忽略掉类属性。
然后，我们了解了一个无指导的属性过滤器，AddCluster。
我们看到了过滤器的结果并且知道了如何新增一个属性，
一个包含聚类编号的属性，之后，通过单击编辑按钮看到了实例的分组结果。
Weka提供一种叫做“classes-to-clusters evaluation”的评估方法。
我们回到iris数据，试着做一次classes-to-clusters evaluation评估。
（让我们还原数据。）我们撤销刚刚用过的过滤器，
回到原始数据。

English: 
That's just exactly what we saw when we visualized
the cluster assignments before.
Coming back to the slide, we've looked at
the visualize cluster assignments on the Cluster
panel.
We've learned how to ignore attributes.
Typically the class attribute is a good one
to ignore if you've got a dataset with a class.
Then we've looked at a filter, the AddCluster
unsupervised attribute filter.
We looked at the result of that and how you
can add a new attribute, which gives a cluster
number, and then look at which instances have
got which cluster by clicking the Edit button.
A way of evaluation in Weka is called the
"classes-to-clusters evaluation".
I'm going to go back to the iris data and
do a classes-to-clusters evaluation.
(Let me get rid of this.) I'm going to undo
the filter we just did to get the original
iris data back.

Chinese: 
打开Cluster面板，点击“Classes-to-clusters evaluation”，
运行它。
现在，我们得到了3个类别。
这是3个聚类，我们可以看到每个类别的实例
被如何分配到各个聚类。
可以看到有17个被错误的划分的实例。
我们一会儿再仔细察看，让我们先使用EM算法，
看看结果如何。
这次，还是指定3个聚类，然后运行它。
我们得到了相似的结果。
回到课件，这是用SimpleKMeans分为3个聚类的结果。
你可以看到聚类0主要由这47个实例构成。
它们是versicolor。
所以，我们认为聚类0代表versicolor。
聚类1的主体，在第二列，是setosas，

English: 
I'm going to go to my Cluster panel, click
"Classes to clusters evaluation", and run
that.
Now I see I've got my 3 classes.
There are 3 clusters, and you can see how
many of each class were assigned to which
cluster.
You can see there are 17 incorrectly clustered
instances.
We'll have a look at that in a minute, but
first let me go and use the EM algorithm and
see how that does.
Again, I'm going to specify 3 clusters, and
I'm going to run that.
I get a similar kind of thing here.
Back on the slide, this is the result I saw
for SimpleKMeans with 3 clusters.
You can see that the majority in cluster 0
is this 47 here.
That's the versicolor.
So, we're going to assign versicolor to cluster
0.
The majority in cluster 1, that's the second
column, are the setosas--that 50 there in

Chinese: 
50个实例在第二列，列标签是1。
最后一行，这儿有36，所以主体类别是virginica。
17个误分的实例出现在这里。
EM的结果要好得多。
仅有14个误分的实例，占数据集的9%。
这就是classes-to-clusters evaluation。
这里有一个元分类器叫做"ClassificationViaClustering"。
它的工作原理是忽略类别，聚类数据，
用最多的类别定义该聚类。
它是一个分类器。
这和我们之前做的很相近，但我们可以像评估分类器一样评估它。
让我们回到Weka。
来到分类器，在元分类器目录下，选择ClassificationViaClustering。

English: 
the second column, the column labeled 1.
The final column, there's a 36 there, so the
majority class is virginica.
That's where we get the 17 incorrectly clustered
instances from.
EM does quite a lot better here.
We only get 14 incorrectly clustered instances,
or 9% of the dataset.
That's a classes-to-clusters evaluation.
There's a meta-classifier called "ClassificationViaClustering".
It works by ignoring the classes, clustering
the data, and assigning to each cluster its
most frequent class.
That's a classifier.
It's very similar to what we just did, but
we can evaluate it like we evaluate classifiers.
Let's get back to Weka.
I'm going to go to Classify, and in my meta
list I'm going to choose ClassificationViaClustering.

English: 
I'm going to stick to SimpleKMeans with 3
clusters.
Now, if I evaluate that on the training set,
that's exactly what we just did on the clustering
panel.
Let me start that.
Here I get exactly the same matrix as I just
looked at.
As you can see, there are 17 errors here.
That's evaluating on the training set.
Of course, there are the 17 errors up there.
We know we shouldn't be evaluating on the
training set.
We're going to use cross-validation, which
is going to do the usual thing: take 90%,
form a clustering, form a classification based
on that clustering, and then see how well
that does on the held-out 10% of the dataset.
In this case, I get slightly worse results,
as I would expect.

Chinese: 
我们使用SimpleKMeans，仍然指定分为3个聚类。
现在，如果用训练数据评估，这和我们
在聚类面板做的是完全一样的。
让我们开始。
这里，我们看到了和之前看到的一样的矩阵。
如你所见，这儿有17个误差。
这是用训练数据评估的结果。
当然，会有17个误差在这里。
我们知道我们不应该用训练数据评估。
我们打算使用交叉验证，和往常一样：用90%的数据
产生一个聚类，用该聚类来分类， 
然后用剩余的10%进行验证。
这次，我们得到的结果没有我之前想象得好。

Chinese: 
我们得到了19个误差，也就是84%的正确率。
这就是ClassificationViaClustering。
当然了，我可以选择不同的聚类算法，并在其基础上分类。
这是比较聚类算法的一个好方法。
直接比较聚类算法是很难的。
例如，simplekmeans使用聚类内部的误差平方和，
但聚类算法的评估要在在特定的应用进行。
可以使用可视化：它使你看清在数据上发生了什么。
AddCluster过滤器能让你看到实例在聚类中的分布，
这通常都值得一看。
“classes-to-clusters evaluation”是一种聚类的评估方法，
但是，事实上，它用整个数据集（进行据训练和评估）。
用基于整个数据集产生的错误实例来评估

English: 
I've got 19 errors, or an 84% success rate.
That's ClassificationViaClustering.
Of course, I could choose different clusterers
and build classifiers based on them.
It's a very good way of comparing clusterers.
It's hard to evaluate clustering.
SimpleKMeans, for instances, uses within-cluster
sum of squared errors, but really clustering
should be evaluated with respect to a particular
application.
Visualization helps: it helps you to see what's
happening to your data.
The AddCluster filter allows you to see which
instances are in each cluster, which is often
useful to see.
The classes-to-clusters evaluation gives you
a way of looking at the clusters, but, in
effect, it uses the entire dataset.
So to look at the incorrectly assigned instances
based on a classification made from the entire

English: 
dataset risks overfitting.
You should never evaluate on the training
set.
Classification via clustering uses the same
kind of technique to produce a classifier
that can then be evaluated in different ways,
for example, 10-fold cross-validation, which
is what we just did.
There's some more stuff on evaluating clustering
in the course text.
You should go and do the activity associated
with this lesson.
This is the last lesson in Class 3, so we'll
see you soon in Class 4.
Bye for now!

Chinese: 
会有过度拟合的风险。
永远都不要用训练集进行评估。
基于聚类的分类器采用同样的技术来分类，
这样就可以用不同的方法来评估，例如，10折交叉验证，
就像我们刚刚做的那样。
阅读课本，可以了解到更多关于聚类评估方法。
建议你完成本课的课后练习。
这是课程3的最后一节课，我们课程4再见。
再见！
