
Chinese: 
大家好！
在大多数的课程，都会有一个难点。
前几节课，你接触了些你大概不愿意接触的数学知识，
你可能认识到你可能永远都无法完全理解
这些机器学习方法的工作细节。
我希望你知道我想传达的是现代机器学习法的要领，
而不是细节。
重要的是，你可以使用它们，并且了解点
它们主要工作原理。
我们不再涉及数学了。
所以，坚持；事情开始变得容易。无论如何，快了：
还有几节课。
我之前告诉过你我演奏音乐。

English: 
Hello again.
In most courses, there comes a point where
things start to get a little tough.
In the last couple of lessons, you've seen
some mathematics that you probably didn't
want to see, and you might have realized that
you'll never completely understand how all
these machine learning methods work in detail.
I want you to know that what I'm trying to
convey is the gist of modern machine learning
methods, not the details.
What's important is that you can use them
and that you understand a little bit of the
principles behind how they work.
And the math is almost finished.
So hang in there; things will start to get
easier -- and anyway, there's not far to go:
just a few more lessons.
I told you before that I play music.

Chinese: 
昨天有人来到我家，带来了一个低音巴松管。
它是管弦乐队中最低沉的乐器。
你不常见到或者听到它。
所以，这里，我试着第一次演奏低音巴松管。
我认为这会是我们课程的最低点，数据挖掘和Weka！
今天，我想要谈一下支持向量机，另一种先进的机器学习的
技术。
我们上节课学习了logistic回归，我们发现
它在实例空间产生线性分界线。
事实上，我们用Weka的分界可视化工具演示了logistic回归

English: 
Someone came round to my house last night
with a contrabassoon.
It's the deepest, lowest instrument in the
orchestra.
You don't often see or hear one.
So, here I am, trying to play a contrabassoon
for the first time.
I think this has got to be the lowest point
of our course, Data Mining with Weka!
Today I want to talk about support vector
machines, another advanced machine learning
technique.
We looked at logistic regression in the last
lesson, and we found that these produce linear
boundaries in the space.
In fact, here I've used Weka's Boundary Visualizer
to show the boundary produced by a logistic

Chinese: 
产生的分界（使用的是二维的鸢尾花数据，花瓣宽相对于花瓣长的）。
这条黑线是红色的和绿色的类之间的分界线。
它可能是更合理的，如果我们在这两个类之间划一条分界线，
尽量让它通过两个类之间最宽的通道，尽可能地将
每一类区别开来。
这是一个示意图，黑色的线正好在
两个类的通道的中间。
实际上，数学上，我们可以通过两个关键成员找到这条线，
每个类产生一个关键成员（它们叫做支持向量；这些是确定通道
的关键点），采用两个支持向量的连接线的

English: 
regression machine -- this is on the 2D Iris
data, plotting petalwidth against petallength.
This black line is the boundary between these
classes, the red class and the green class.
It might be more sensible, if we were going
to put a boundary between these two classes,
to try and drive it through the widest channel
between the two classes, the maximum separation
from each class.
Here's a picture where the black line now
is right down the middle of the channel between
the two classes.
Actually, mathematically, we can find that
line by taking the two critical members, one
from each class -- they're called support
vectors; these are the critical points that
define the channel -- and take the perpendicular
bisector of the line joining those two support

English: 
vectors.
That's the idea of support vector machines.
We're going to put a line between the two
classes, but not just any old line that separates them.
We're trying to drive the widest channel between
the two classes.
Here's another picture.
We've got two clouds of points, and I've drawn
a line around the outside of each cloud -- the
green cloud and the brown cloud.
It's clear that any interior points aren't
going to affect this hyperplane, this plane,
this separating line.
I call it a line, but in multi dimensions
it would be a plane, or a hyperplane in four
or more dimensions.
There's just a few of the points in each cloud
that define the position of the line: the
support vectors.
In this case, there are three points.
Support vectors define the boundary.
The thing is that all the other instances
in the training data could be deleted without

Chinese: 
垂直平分线。
这就是支持向量机的理念。
在两个类之间划一条线，但是并不是像以前那样为了把它们分开。
我们试着在两个类之间最宽的通道间划线。
这是另一幅图。
我们有两组点，用线沿着每一组的边缘把它们圈起来
（绿色的一组和棕色的一组）。
显然，内部的点不会影响这个超平面，这个平面，
这个分界线。
我把它叫做线，但是在多维的空间它可能是一个平面，或者一个
四维甚至更多维的超平面。
每个组只有几个点会限定分界线的位置：
它们是支持向量。
在这个例子中，有三个点。
支持向量决定边界。
训练数据集的其他全部实例都可以删除，而不会

English: 
changing the position of the dividing
hyperplane.
There's a simple equation and this is the
last equation in this course.
A simple equation that gives the formula for
the maximum margin hyperplane as a sum over
the support vectors.
These are kind of a vector product with each
of the support vectors, and the sum there.
It's pretty simple to calculate this maximum
margin hyperplane once you've got the support
vectors.
It's a very easy sum, and, like I say, it
only depends on the support vectors.
None of the other points play any part in
this calculation.
Now in real life, you might not be able to
drive a straight line between the classes.
Classes are called "linearly separable" if
there exists a straight line that separates
the two classes.
In this picture, the two classes are not linearly
separable.

Chinese: 
改变这个分割超平面的位置。
这里有一个简单的方程，这是这部分课程最后一个方程了。
这个简单的方程以支持向量的和的形式给出了
最大边缘超平面的公式。
这些是每一个支持向量的向量积的总和。
计算最大边缘超平面会变得非常简单，只要你知道
支持向量。
这是一个简单的求和，就像我说的，它只取决于支持向量。
其他的点都不会参与计算。
现在，在现实生活中，你也许无法在类之间划出一条直线。
如果两个类之间存在一条可以将它们分开的直线，
它们称为“线形可分离”的，
这幅图里，这两个类就不是线形可分离的。

Chinese: 
这也许不太好看出来，但是线这边绿色的区域有一些蓝色的点，
线那边蓝色的区域有一些绿色的点。
这不可能得出一条直线作为分界线。
这使得支持向量机（的数学知识）有点复杂。
但是，在这种条件下找到最大边缘超平面还是可能的。
这就是支持向量机。
这是一个线形的决策分界。
实际上，有一种更聪明的技术可以让你得到更复杂的分界。
它叫做“Kernel trick”（核技巧）。
使用不同的"kernel"公式（Weka中你只需从
提供的kernels中选择），你会得到不同形状的分界线，而不仅仅是直线。

English: 
It might be a little hard to see, but there
are some blue points on the green side of
the line, and a couple of green points on
the blue side of the line.
It's not possible to get a single straight
line that divide these points.
That makes support vector machines -- the
mathematics -- a little more complicated.
But it's still possible to define the maximum
margin hyperplane under these conditions.
That's it: support vector machines.
It's a linear decision boundary.
Actually, there's a really clever technique
which allows you to get more complex boundaries.
It's called the "Kernel trick".
By using different formulas for the "kernel"
-- and in Weka you just select from some possible
different kernels -- you can get different
shapes of boundaries, not just straight lines.

English: 
Support vector machines are fantastic because
they're very resilient to overfitting.
The boundary just depends on a very small
number of points in the dataset.
So it's not going to overfit the dataset,
because it doesn't depend on almost all of
the points in the dataset, just a few of these
critical points -- the support vectors.
So it's very resilient to overfitting, even
with large numbers of attributes.
In Weka, there are a couple of implementations
of support vector machines.
We could look in the "functions" category
for "SMO".
Let me have a look at that over here.
If I look in "functions" for "SMO", that implements
an algorithm called "Sequential Minimal Optimization"
for training a support vector classifier.

Chinese: 
支持向量机很神奇，因为可以避免过度拟合。
分界取决于数据集中数量极少的几个点。
所以，它不会过度拟合数据集，因为它与大多数
数据集中的点没有关系，只取决于几个关键点（支持向量）。
所以它不会发生过度拟合，即使有非常多的属性。
Weka有一些支持向量机的实现。
我们可以看functions目录下的SMO。
让我们去看看。
看一下functions目录下的SMO，它实现了一种
用于训练支持向量机分类器的，叫做序列最小优化算法（Sequential Minimal Optimization）。

Chinese: 
这里有几个参数，例如包括，核的不同选择。
你可以选择不同的核：你可以尝试不同的东西。
还有一些其他的参数。
实际上，SMO算法只能用于二类的问题，所以只能用于
二类的数据集。
Weka还有一些更加复杂的支持向量机的实现方法。
有一个叫做LibSVM的库，是一个外部库，Weka与这个库之间有一个接口。
这是LibSVM工具的包装类。
你需要分别从Weka下载这些，并把它们放在正确的Java classpath。
你可以看到这里有许多不同的参数，事实上，
许多关于这个支持向量机工具包的信息。
这就是支持向量机。
你可以阅读课本的6.4节了解更多内容。

English: 
There are a few parameters here, including,
for example, the different choices of kernel.
You can choose different kernels: you can
play around and try out different things.
There are a few other parameters.
Actually, the SMO algorithm is restricted
to two classes, so this will only work with
a 2-class dataset.
There are other, more comprehensive, implementations
of support vector machines in Weka.
There's a library called "LibSVM", an external
library, and Weka has an interface to this library.
This is a wrapper class for the LibSVM tools.
You need to download these separately from
Weka and put them in the right Java classpath.
You can see that there are a lot of different
parameters here, and, in fact, a lot of information
on this support vector machine package.
That's support vector machines.
You can read about them in Section 6.4

Chinese: 
请完成本课的相关练习。
这部分的最后一节课再见！
再见！

English: 
of the textbook if you like, and please
go and do the associated activity.
See you soon for the last lesson in this class.
Bye!
