
Chinese: 
大家好！我们即将开始聚类的学习，我将展示不同的
聚类方法。
在聚类中，没有类属性的概念。
我们仅仅是试着把实例自然分组，也叫做“聚类”。
举个例子，回想上一期课我们用到的iris数据集。
假设删除类属性。
当然（你可能还记得）共有三种iris花：
iris setosas, iris versicolors, 和 iris virginicas，每种有50个实例。
数据集给出了花瓣长度，花瓣宽度，等等。
问题是：如果删除类属性，我们能否通过聚类，

English: 
Hello again! We're going to talk about clustering,
and I'm going to show you some different clustering
methods.
With clustering, there's no "class" attribute.
We're just trying to divide the instances
into natural groups, or "clusters".
For instance, imagine the iris dataset that
we looked at in the last course.
Imagine deleting the class attribute.
Of course (you might remember) there are 3
kinds of iris, and in the iris dataset there
are 50 of each kind: iris setosas, iris versicolors,
and iris virginicas.
The dataset gives the petal length, and petal
width, and so on.
If you deleted the "class" attribute, the
question is: could you recover the 3 classes

Chinese: 
找回类属性？这也是本课课后练习的一部分。
不同的聚类算法会生产不同的
of clusters.
聚类。
关于聚类的一种解释是不相交的集合。
我们把实例空间分割成集合，这样每个实例空间只属于
一个聚类。
或者聚类之间有重合，就像第二幅图显示的那样。
我们可能会得到相互重叠的聚类。
于重叠的聚类，实例以一定的概率分属于每个类。
所以，这里有实例a，b，c，d，e和三个聚类。
实例a有40%的概率属于聚类1，10%的概率属于聚类2，
50%的概率属于聚类3。
第四，我们可能会用到分层聚类方法。
这些是属于底部聚类的实例，并且a和g在底部被合并在一起。

English: 
by clustering the data? You'll be trying that
in the activity after this lesson.
The different kinds of clustering algorithms
produce different sorts of representations
of clusters.
One way of thinking about clusters is to imagine
disjoint sets.
We take the instance space and divide it into
sets such that each part of the instance space
is in just one cluster.
Or the clusters might overlap, as shown in
the second picture.
You might have overlapping clusters.
If you have overlapping clusters, you might
have probabilistic assignment of instances.
So a, b, c, d, e, are instances and there
are 3 clusters here.
Instance a has 0.4 probability of belonging
to cluster 1, and 0.1 probability for cluster
2, and 0.5 probability for 3.
Fourthly, we might have a hierarchical clustering
method.
Here the instances are along the bottom and
a and g get clustered together at the bottom level.

English: 
In fact, you can see the clusters at the bottom
level.
Then these clusters join together at the next
level up, and so on, until at the very top
level all the dataset is just one big cluster.
It's called a dendrogram, that kind of tree.
The first algorithm we're going to look at
is called "KMeans": it does iterative distance-based clustering.
First, we want to specify the desired number
of clusters, we call that k.
Then the algorithm chooses k points at random
as cluster centers.
It assigns all the instances in the dataset
to their closest cluster center.
Then it takes each cluster and calculates
the centroid of the instances in it--that's
the mean of all of the instances.
These centroids are new cluster centers.
It goes back to the beginning and carries
on until the clusters centers don't change.
At some point when you re-calculate the cluster
centers you get just the same numbers you
had before.

Chinese: 
实际上，你可以看到底部的聚类。
然后这些聚类在下上一层结合在一起，如此类推，
直到顶部，整个数据集聚合成为一个大的聚类。
这叫做树状图，是树的一种。
首先，我们试试叫做“KMeans”的算法：一个基于距离的迭代聚类。
第一步，指定我们想要的聚类的数量，我们称之为k。
然后，算法随机选择k个点作为聚类的中心。
算法将数据集中的所有实例分配到最近的聚类中心。
之后，对每个聚类，计算出所含实例的质心--
也就是所有实例的平均值。
这些质心将成为各个聚类新的中心。
回到开始重复整个过程，直到聚类的中心不再变化。
有些时候，当我们重新计算聚类中心时，
我们会得到和之前同样的数字。

English: 
This algorithm minimizes the total squared
distance from instances to their cluster centers.
Unfortunately, it's a local minimum, not a
global minimum.
You get different results with different random
number seeds.
We're going to look at it in Weka: we're going
to look at KMeans clustering.
I've opened the weather dataset, and on the
Cluster panel I'm going to open SimpleKMeans.
It's got some parameters here: the number
of clusters, we've set this for 2 clusters;
we can have different distance functions;
and the random number seed here.
Let's just run it.
Here we get 2 clusters.
One's got 9 instances and the other's got
5 instances.
The total squared error is 16.2--that's what
we're trying to minimize.
The thing is that if we run this with a different
random number seed, say 11, then we're going
to get different clusters.

Chinese: 
这个算法最小化每个实例至聚类中心的距离的平方的和。
不幸的是，只是局部最小化，而不是全局的。
使用不同的随机种子，我们会得到不同的结果。
我们可以用Weka验证一下：运行KMeans聚类。
我们打开了weather数据集，在Cluster面板，我们将打开SimpleKMeans。
这里有一些参数：聚类的数量，我们设为2，
我们可以有不同的距离公式，并设置随机种子。
开始运行。
我们得到了两个聚类。
其中一个包含9个实例，另一个有5个实例。
总的平方误差是16.2——这是我们想最小化的。
事实上，如果选用一个另一个随机种子，比如11，
我们会得到不同的聚类。

English: 
There are 6 instances in one cluster and 8
in another cluster.
Here the total squared error is 13.6.
If we were to do it again with another seed,
let's say 12, we get a different clustering
again.
Going back to the slide, you can see that
for each different random number seed we get
a different clustering, and that doesn't seem
like a very good thing.
Although maybe it's the dataset.
Maybe with a more proper dataset, we might
get better results, more consistent results.
But in KMeans, it's always dependent on the
initial assignment of the cluster centers,
the initial choice of cluster centers.
XMeans, also in Weka, is an extended version
of KMeans.
It selects the number of clusters itself--with
KMeans, you've got to specify that.
For XMeans, you can specify a minimum and
a maximum for this number.

Chinese: 
这次6个实例在一起，另外是8个实例。
这次总的平方误差是13.6。
如果我们再来一次，使用随机种子12，
我们又会得到不同的结果。
回到课件，我们看到不同的随机种子会导致不同的聚类，
这并不是一件好事。
尽管这有可能和数据集有关。
也许换一个合适的数据集，我们可以得到更加稳定的结果。
但是，KMeans算法总是会受到初始聚类中心分配，
即初始聚类中心的选取的影响。
Weka中的XMeans算法，是KMeans算法的延伸。
它可以自动选取聚类数量--KMeans要求我们指定聚类数量。
对于XMeans，你可以指定最小和最大的聚类数量。

Chinese: 
它使用kD-trees，这是一种复杂的数据结构，可以保证运行的效率。
遗憾的是，尽管如此，XMeans还是不能处理名词性属性。
让我们找找其他的方法。
EM是一种概率性聚类方法。
EM是Expectation Maximization（期望最大化）的缩写。
我们在聚类面板，选取EM。
这是它的参数。
我们已经有了聚类数量。
聚类数量被设为1，这意味着EM自身会自动选取聚类数量。
我打算把它定为2。
然后，运行EM。
我们得到2个聚类。
事实上，这些是聚类。
对应每个属性，我们得到概率，比如"outlook"属性，
在每个聚类中是"sunny"，"overcast"还是"rainy"的概率。

English: 
It uses kD-trees, which are sophisticated
data structures, to make it operate pretty quickly.
Unfortunately, though, XMeans cannot handle
nominal attributes.
Let's look at another method.
EM is a probabilistic clustering method.
It stands for Expectation Maximization.
I'm going to go to the cluster panel and choose
EM.
There are some parameters here.
We've got here the number of clusters.
That's set to -1, that means EM will try to
determine the number of clusters itself.
I'm going to set that to 2.
Then I'm going to run EM.
Here I get 2 clusters.
In fact, these are the clusters.
For each of the attribute values, I get kind
of a probability that the "outlook" is going
to be "sunny" and "overcast" and "rainy",
in each of the clusters.

English: 
We get the probability by dividing this by
the total here.
Given those probabilities, if we had a new
instance we could calculate for it the probabilities
for each cluster.
Actually, EM uses as an overall quality measure
a thing called the "log likelihood".
Back to the slide, we've got two clusters
with these prior probabilities.
Within each cluster, we've got the probability
of each attribute for a nominal attribute;
and for numeric attributes we've got the mean
and standard deviation.
Let's look at another, final clustering method.
This is a hierarchical clustering method called
Cobweb.
Back in Weka, let me just run Cobweb.
It's got some rather magic parameters here.
I'm going to choose 0.3.
It's a bit of a black art, actually, using
Cobweb.
I'm going to run this.

Chinese: 
我们通过把这个值除以总数得到概率。
基于这些概率，如果我们拿到一个新的实例，
我们算出它属于每个聚类的概率。
事实上，EM使用一种叫做“log likelihood”方法评估全局质量。
回到课件，我们在这些先验概率的基础上得到了两个聚类。
每一个聚类，我们得到每个名词性属性的属性值的出现的概率。
对于数值属性，我们得到了均值和标准差。
让我们看看另一个，（本课）最后一种的聚类方法。
这是一个分层聚类方法，叫做Cobweb。
回到Weka，我们运行Cobweb。
我们可以设置很多神奇的参数。
选择0.3。
实际上，使用Cobweb有点黑色的艺术。
让我们运行它。

Chinese: 
我们将得到了一个树，通过单击右键，我们可以可视化这个树。
回到课件，这就是我们从weather数据得到的树，
一共有10个聚类。
你可以看到这些底层的聚类，然后，上一层的聚类，
和顶层的唯一的聚类。
这就是聚类。
聚类不会用到类属性。
聚类可以有很多不同的表达，不同的算法
会产生不同的表达。
KMeans是最简单的，标准的聚类方法。
它是一种基于距离迭代的方法。
它可以采取不同的距离度量。
我们使用的是Euclidean距离，但在Kmeans和Xmeans中
你可以选择不同的距离度量。
评估聚类是非常难的，那是我们下一节课
的内容。
同时，请完成有关聚类的课后练习。

English: 
I'll get a tree, which I can visualize--on
the right-click method, I can visualize this tree.
Going back to the slide, this is the tree
that we get for the weather data, with 10
clusters.
You can see these clusters at the bottom level,
and then these clusters at the level above,
and one cluster at the very top.
That's clustering.
In clustering, there's no class value.
There are different representations of clusters,
and different algorithms produce different
kinds of representation.
KMeans is the simplest, standard clustering
method.
It's an iterative, distance-based method.
It can take different distance metrics.
We were using the Euclidean distance, but
you can select different distances metrics
in Kmeans and Xmeans.
It's really hard to evaluate clustering, and
we're going to be looking at that in the next
lesson.
Meanwhile, go and do some clustering with
the activity.

Chinese: 
如果有兴趣，建议大家阅读课本。我们下次见。
再见！

English: 
Have a look at the course text, if you like,
and we'll see you in the next lesson.
Bye for now!
