
English: 
Hello again! One final lesson on attribute
selection.
You're probably getting a bit fed up with
attribute selection by now, but you know it's
really important.
It's one of those things that can really improve
the performance of machine-learning methods,
and more importantly, it really improves the
understandability.
You know, you select out some attributes--it's
easy to explain to other people what you've
done to get such good performance on their
data set.
Attribute selection is pretty important.
We're going to look in this lesson at fast
attribute selection using ranking.
Remember before in the last lesson we looked
at attribute subset selection, which involves

Chinese: 
大家好！这是关于属性选择的最后一课。
至此，你可能开始厌倦属性选择了，但是，
属性选择是十分重要的。
它是提高机器学习方法性能的有效途径之一，
更重要的是，它让学习方法变得容易理解。
你知道，你选择了某些属性--这样就更容易向其他人
解释你是如何从他们的数据集得到好的结果的。
属性选择是很重要的。
这节课，我们来看利用排序进行快速的属性选择。
记得在上一节课我们学习了属性子集选择，

English: 
a subset evaluation measure and a search method,
and we were looking for rapid subset evaluation
methods.
The Wrapper method is very slow, and we were
looking for faster alternatives.
But, of course, searching is slow.
So we're not doing any searching now.
We're going to use a single-attribute evaluator,
that doesn't evaluate a subset, it evaluates
each attribute individually.
This can help eliminate irrelevant attributes,
but it can't remove redundant attributes,
because it's only looking at individual attributes,
one at a time.
You need to choose the ranking search method
whenever you select a single-attribute evaluator.
The ranking search method doesn't really search,
it just sorts them into rank order of the
evaluation.
We've seen several metrics for evaluating
attributes before.
We looked in the last course at OneR, ages
ago.

Chinese: 
其中涉及到子集的评价方法和搜索方法，
methods.
我们寻找快速的子集评价方法。
包装方法是很慢的，我们寻找更快的替代。
但是，当然，搜索总是慢的。
所以，现在我们不做任何搜索。
我们将使用单一的属性评估器，它不评估子集，
而是对每个属性分别评估。
它可以消除不相关的属性，但不能去除冗余属性，
因为它每一次只考虑一个单独的属性。
你需要选择排序的搜索方法，当你选择一个单一属性评估器。
排序的搜索方法并不是真的搜索，
它只是把属性按照评价结果排序。
我们之前已经看到过属性评估的几个指标。
很久以前，在介绍OneR时，我们看到过。

English: 
Remember OneR? It's effectively a method of
ranking attributes.
In Weka, there are attribute selection methods
based on all of these.
The OneR attribute evaluator.
C4.5, what we know as J48 in Weka, uses information
gain, so there's an information gain attribute
evaluator.
Actually, it uses gain ratio, slightly more
complex than information gain, and there's
also a gain ratio attribute evaluator.
In the last lesson we saw the CfsSubsetEvaluation
method, and that uses symmetric uncertainty,
so there is a symmetric uncertainty attribute
evaluator in Weka.
The ranker search method is very simple.
It just sorts attributes according to their
evaluation, and you can specify the number
of attributes to retain.
The default is to retain them all, or you
can ask it to discard attributes whose evaluation
falls below a certain threshold, or you can
specify a certain set of attributes that you
want to ignore.
Let's have a look.

Chinese: 
记得OneR吗？这是属性排序的有效方法。
WEKA中，有基于所有这些的属性选择方法。
属性评估器OneR。
C4.5，我们所知道的WEKA中的J48，采用信息增益，
所以存在一个信息增益属性评估器。
实际上，它使用增益比率，比信息增益更复杂一点，
有一个增益比率属性评估器。
在上一课中我们看到了CfsSubsetEvaluation方法，它使用对称不确定性，
所以WEKA有一个对称不确定性属性评估器。
排序搜索方法很简单。
它根据评估排列所有属性，
你可以指定保留的属性的数量。
默认是保留所有属性，或者你可以让它丢弃
评估低于一定的阈值的属性，或者你也可以指定
你想忽略的属性集。
让我们看看。

Chinese: 
让我们用 ionosphere数据比较下上一课学过的GainRatioAttributeEval
和其他方法.
灰色部分这部分,没有属性选择，CfsSubsetEval和包装，
是我们上一节课得到的结果。
我们来看GainRatioAttributeEval。
我们打开WEKA。
ionosphere数据集已经加载好了。
当然，我们要用AttributeSelectedClassifier得到一个合理的评价。
Meta  〉AttributeSelectedClassifier。
在这里我要选择——让我们从Naive Bayes开始。
我们将使用GainRatioAttributeEval。

English: 
Let's compare GainRatioAttributeEval with
the other methods we looked at in the last
lesson, on the ionosphere data.
The gray part of this, the no attribute selection
and CfsSubsetEval and wrapper, those results
we got before in the last lesson.
We're just going to look at the GainRatioAttributeEval.
I'm going to go to Weka.
I've got my ionosphere dataset.
Of course, I'm going to use the AttributeSelectedClassifier
to get a fair evaluation.
Meta > AttributeSelectedClassifier.
Here I'm going to specify--let's just use
Naive Bayes to start off with.
I'm going to use a GainRatioAttributeEval.

English: 
If I just run that, it's not going to work:
the attribute evaluators must use the Ranker
search method.
Sorry about that, I should have specified
here the Ranker search method.
There are a couple of parameters.
The number to select: -1 means select them
all.
It's not really very useful to select them
all.
I'm going to select 7 attributes for this.
We could have a set to ignore.
This threshold here, this bizarre number,
is actually minus infinity in Java, so that's
it why it's such a strange number.
That's all I need to do.
I'm going to run that, and I get 89-90% accuracy.
Let's go back to the slide and compare this.
So last time with Naive Bayes I got 83% accuracy,
and then 89% with CfsSubsetEvaluation, 91%

Chinese: 
如果直接运行它，它不会工作：
属性评估器必须搭配排序搜索法。
抱歉，我应该先指定排序搜索法。
有几个参数。
选择号码：-1意味着选择所有属性。
选择全部并不真的很有用。
这里，我们打算选择七个属性。
我们可以忽略某些属性。
这里是门槛值，这个奇怪的数字，实际上在JAVA中代表负无穷大，
这就是为什么它看上去那么奇怪。
这就是我们要做的。
我们运行它，得到89-90%的正确率。
让我们回到幻灯片和比较正确率。
上次用Naive Bayes，我们得到了83%的正确率，CfsSubsetEvaluation得到了89%的正确率，

English: 
with the Wrapper selection method, and with
this new method GainRatioAttributeEval, a
single-attribute evaluator, I get 90%.
Fantastic performance for a method that's
lightning fast.
For IBk, the performance is really not very
good.
It's just the same as IBk without any attribute
selection.
For J48, it's the same as J48 without any
attribute selection.
Single-attribute selection is lightning fast
but very sensitive to the number of attributes.
I chose 7 here because it turned out to be
a good number for this problem.
There are a lot of single-attribute evaluators
in Weka.
We talked about the first four a minute ago.
There's one based on the chi-squared test,
one based on support vector machines, one
instance-based evaluator, principal components
transform, and latent semantic analysis.
The workings of these are all explained in
the papers that are referenced in the More

Chinese: 
包装选择方法得到了91%正确率，用这种新方法GainRatioAttributeEval，
单一属性评估器，我们得到90%的正确率。
这个快得象闪电的方法表现极好。
如果用IBK，表现就不是很好。
进行和不进行属性选择对于IBK的结果一样。
进行和不进行属性选择对于J48的结果一样。
单一属性选择快得象闪电，但对于属性的数目非常敏感。
我选择了七个，因为对于这个问题，这恰好是一个合适的数字。
WEKA包含很多种单一属性评估器。
一分钟前，我们提到了其中的四个。
还有一些基于 chi-squared检验的，基于支持向量机的，
基于实例的，principal components变换的，和潜在语义分析的评估器。
所有这些评估器的工作原理都在

English: 
button for that attribute evaluator.
There are also meta-evaluators, which incorporate
other operations.
That's it.
We've seen that attribute subset selection
involves searching, which is bound to be slow
no matter how quickly you can evaluate the
subsets, so instead we can use single-attribute
evaluation.
It involves ranking, which is really fast.
It's hard to specify a suitable cut-off, you
need to do experimentation.
It doesn't cope with redundant attributes.
For example, if you have copies of an attribute,
then they will be repeatedly selected, because
attributes are evaluated individually.
Many single-attribute evaluators are based
on machine-learning methods we've already
looked at.
The activity associated with this lesson is
really worth doing, because it gives you really
remarkable result.
It uses this attribute selection method on
a document classification system, and it gets
stunningly good results with just 2 attributes.

Chinese: 
有关这个属性评估器的More按钮中有解释。
也有元评估器，它包含其他操作。
是的。
我们已经看到，属性子集选择涉及搜索，
这必然是缓慢的，无论你能多快地评价子集，
所以，我们可以使用单一属性评估器。
它进行排序，这就会很快。
指定一个合适的截止是很困难的，你需要做实验。
它不能应对冗余属性。
例如，如果你有一个属性的副本，然后它们将多次入选，
因为每一属性单独进行评估。
许多单一属性评估器是基于我们已经学习过的
机器学习方法。
完成与本课相关的练习是非常重要的，
因为它会让你看到（单一属性评估器的）显著效果。
练习在文档分类系统中使用属性选择方法，
并在只有两个属性时，得到很好的结果。

Chinese: 
当我设计这个练习，我对用单一属性选择方法得到好结果感到惊讶。
开始做练习吧。
你会惊讶并记住这一课的，我们下次课见。
再见！

English: 
I was very surprised when I set up this activity
to see how well you could do using this method.
Please go ahead and do that.
You'll be surprised and impressed, and we'll
see you in the next lesson.
Bye for now!
