
English: 
Hello again! Welcome back to New Zealand.
In this lesson, we're going to look at a new
class of attribute selection methods: Scheme-Independent
attribute selection.
The Wrapper method that we looked at before
is straightforward, simple, and direct, but
it's really slow.
So here's a couple of alternatives.
We could use a single-attribute evaluator,
evaluate the attributes one by one independently
and then kind of rank them and base our attribute
selection on that.
That allows us to eliminate irrelevant attributes,
and we'll be looking at that in the next lesson.
A second alternative is to combine an attribute
subset evaluator with a search method.

Chinese: 
大家好！欢迎回到新西兰。
这节课，我们来学习新的属性选择的方法：独立于方案的
属性选择。
之前我们学习的包装方法直接、简单、明确，
但非常慢。
因此，这里有些替换方案。
我们可以使用单一属性评估器，逐一评估属性，
然后排序，并在此基础上做属性选择。
这样我们就可以删除不相关的属性，下节课会讲到。
另一个方案是将属性子集评估与搜索方法相结合。

English: 
And that allows us to eliminate redundant
attributes as well as irrelevant ones, so
it's potentially much more powerful.
Now, we've already looked at different kinds
of search methods in lesson 4.1, and we've
looked at one kind of attribute subset evaluator,
the wrapper method.
That is a way, a scheme-dependent way of evaluating
an attribute subset.
Now we're going to look at scheme-independent
ways of evaluating attribute subsets.
In fact, we're going to look at a methods
called CfsSubsetEval.
It considers an attribute subset to be good
if the attributes it contain are highly correlated
with the class attribute and not strongly
correlated with one another.
It comes up with a measure of "goodness" of
an attribute subset.
This is a measure applied to a subset.
We sum the correlation between the attribute
and the class over all of the attributes in

Chinese: 
这样我们就可以删除多余属性和不相关属性，
因此，这种方法更为有效。
我们已经在课程4.1中学习了不同的搜索方法，
我们还学习了一种属性子集评估器，包装方法。
那是一种依赖某一特定方案的评估属性子集的方法。
现在我们来学习独立于方案的属性子集评估方法。
实际上，我们要来学习一种名为CfsSubsetEval的方法。
CfsSubsetEval判定一个属性子集为好的标准是属性子集中的属性和类属性高度关联
但属性之间的关联程度却不高。
这是判定属性子集好坏的公式。
这个方法适用于属性子集。
我们将子集中属性与类属性的关联度相加，

English: 
the subset; then we divide that by the correlations
of each attribute with each other attribute,
summed over all pairs of attributes (we take
the square root of that).
For correlation, the CfsSubsetEval methods
uses an entropy-based metric called the "symmetric
uncertainty".
It's pretty straightforward, but I'm not going
to talk about that.
Let's try it.
Let's compare CfsSubsetEval with Wrapper selection
on the ionosphere data.
We're going to look first at Naive Bayes.
Coming over to Weka here, I've got the ionosphere
data open, and I'm going to classify that
with Naive Bayes, standard Naive Bayes.
When I do that, I get 82-83%.
All right.
Now let's do attribute selection and, of course,
we're going to use the AttributeSelectedClassifier
to ensure that we're not cheating.

Chinese: 
再除以每对属性之间的关联度之和
（这里要开根号）。
对于关联度，CfsSubsetEval使用“对称不定性（symmetric uncertainty）”
的熵函数。
非常简单，但我现在不讲这点。
我们来试一下。
我们使用ionosphere数据来比较CfsSubsetEval和包装选择。
我们先来看Naive Bayes。
打开Weka，ionosphere数据已打开，我先要使用，标准的
Naive Bayes.分类。
这样做，得到82-83%的准确率。
好。
现在我们来做属性选择，当然，我们要用属性选择分类器AttributeSelectedClassifier
确保我们没有作弊。

English: 
That's a meta classifier, the AttributeSelectedClassifier.
Within that, remember, we can select a classifier--we're
going to choose Naive Bayes--and we're also
going to choose a subset evaluator--we're
going to use the default, CfsSubsetEval.
And for the search method, I'll just use the
default search method.
Let's run that.
Now we get 88.6% ..
89%, which is a lot better, so attribute selection
has really helped here.
Let's try attribute selection using the Wrapper
method.
I'm going to use the same learning scheme,
Naive Bayes, but here I'm going to choose
the Wrapper method.
For that, of course, I've got to specify a
machine-learning method to use to wrap, and
we're going to wrap Naive Bayes.

Chinese: 
AttributeSelectedClassifier是meta分类器
我们从中选择分类器--要选择Naive Bayes--我们还要
选择子集评估器--使用默认评估器CfsSubsetEval。
至于搜索方法，我们就用默认的搜索方法。
运行。
得到88.6%的准确率。
89%，提高很多，因此属性选择在这里很有用。
我们来试试使用包装方法做属性选择。
我要使用同样的学习方案，Naive Bayes，但这次我要选择
包装方法。
当然，我要指定一种机器学习方法来包装，
我们来包装Naive Bayes。

Chinese: 
运行--其他设置都是默认值--要花费一些时间。
好了。
完成了，花费了不少时间。
准确率为91%。
回到幻灯片。
在NaiveBayes栏，不做属性选择，得到83%的准确率。
属性选择是很有用的。使用CfsSubsetEval，会非常快，
使用很慢的包装方法会得到更好的结果。
如果用IBk，不做属性选择，得到86%的准确率。使用CfsSubsetEval，准确率为89%，
使用包装方法，在这两种属性选择方法中包装IBk，包装相对应的分类器，
也就是我们用来做分类的那个，得到89%的准确率。
这两种属性选择方法是一样的。
不做属性选择J48已经能得到很好的结果。

English: 
I'm going to run that--everything else is
default--and it's going to take a while.
Here we go.
It's finished now; it took quite a long time.
We got 91% accuracy.
Back on the slide.
In the NaiveBayes column, we got the 83% without
attribute selection.
Attribute selection helped quite a lot, with
CfsSubsetEval, which is very fast and it was
even better with this very slow Wrapper method.
When I did IBk, I got 86% for plain IBk, 89%
for CfsSubsetEval.
And for the wrapper, I wrapped IBk--in each
of these things, I wrapped the corresponding
classifier, the one that we're using for classification--and
I got 89%.
The two attribute selection methods were the
same.
J48 was already extremely good without any
attribute selection.

Chinese: 
使用快的方法，得到92%的准确率，使用慢得多的包装选择，
得到了略差的结果，90%的准确率。
对于J48，包装选择比CfsSubsetEval效果要差，这有点出乎意料。
当然，这只是一次运行的结果。
结论是CfsSubsetEval和包装方法效果几乎一样，但快得多。
在Weka中有很多属性子集评估器。
这里有几种依赖于学习方案的方法：WrapperSubsetEval使用内部交叉验证。
在上节课我们简要介绍了ClassifierSubsetEval，它就像包装方法，
但是它使用一个单独旁置测试数集，
而不是交叉验证。
这些是依赖于学习方案的。
下面是独立于方案的方法，有几个。
我们已经学习了CfsSubsetEval，另一种方法叫做ConsistencySubsetEval，
它以训练数据中类值相对于属性的一致性来评估。

English: 
I got 92% for the very fast method, and in
fact, I got slight worse results (90%) for
the much slower wrapper selection.
A little bit surprising that wrapper selection
does worse than CfsSubsetEval for J48.
These are just based on one run, of course.
The conclusion is that CfsSubsetEval is nearly
as good as the Wrapper method, and much faster.
There are a number of attribute subset evaluators
in Weka.
There are a couple of scheme-dependent methods:
the WrapperSubsetEval uses internal cross-validation,
and I think in a previous lesson we mentioned
briefly the ClassifierSubsetEval, which is
like the Wrapper method but instead of using
cross-validation it uses a separate held-out
test set.
Those are scheme-dependent.
And then the scheme-independent methods, there
are a few of those.
We've looked at CfsSubsetEval, and there's
another one called the ConsistencySubsetEval,
which measures consistency in class values
of the training set with respect to the attributes.

English: 
If I just go over to Weka here and have a
look at the different methods of attribute
selection.
There's CfsSubsetEval.
I've talked about ClassifierSubsetEval, that's
a scheme dependent method.
ConsistencySubsetEval, that's the one we were
just talking about, and I can look at that
and get some more information.
It evaluates the worth of a subset by consistency,
and to really understand that method you need
to go and look at the paper where it's referenced.
As you can see, there are quite a lot of different
methods for attribute subset evaluation, and
the list includes meta-evaluators, which incorporate
other operations.
I'm not going to talk about that here.
In conclusion, attribute subset selection
involves a subset evaluation measure and a
search method.

Chinese: 
我们回到Weka,看看属性选择的
不同方法。
这是CfsSubsetEval。
我们已经讲过了ClassifierSubsetEval，依赖于学习方案的方法。
这是我们刚刚提到的ConsistencySubsetEval，我们可以使用它
得到更多的信息。
它以类值的一致性来评估属性子集，要学习这种方法，
你需要阅读关于它的文章。
可以看到，这里有许多不同的评估属性子集的方法，
这个列表包括许多meta评估器，他们包含了其他的操作。
这里不再讲解。
总之，属性子集选择包括子集评估方法和
搜索方法。

English: 
Some measures are scheme-dependent, like the
Wrapper method, which is very slow, and others
are scheme-independent, like CfsSubsetEval
which, as we found, is quite fast.
Even faster is to use a single-attribute evaluator
using ranking, and we are going to talk about
that in the next lesson.
But before that, off you go and do the activity
associated with this lesson.
We'll talk to you soon.
Bye for now!

Chinese: 
有些评估方法是依赖于学习方案的，例如包装方法，非常慢。
其他评估方法是独立于方案的，例如CfsSubsetEval，就像我们所看到的，非常快。
更快的方法是使用单一属性评估器，使用排序，
下节课我们将会学到。
在此之前，请做这节课的相关练习。
下次见！
再见！
