
Chinese: 
大家好！欢迎学习More Data Mining with Weka的第四部分。
这部分的前四节课我们要学习属性选择。
后两节课，我们学习成本敏感的评价
和分类。
这节课我们学习利用包装方法做属性选择。
在课程Data Mining with Weka的第一部分，我们使用了glass.arff数据集，
我们运行了J48，去掉了一些属性，出乎意料的有时用更少的属性
得到了更好的结果。是的，那是费事的手工操作的过程，

English: 
Hello again! Welcome to Class 4 of More Data
Mining with Weka.
We're going to talk about attribute selection
in the first four lessons of this class.
Then we're going to talk about cost sensitive
evaluation and classification in the last
two.
In this lesson we are going to talk about
attribute selection using the wrapper method.
D'you remember way back in Data Mining with
Weka, in the first class, you looked at glass.arff,
you ran J48, you removed some attributes and--much
to your surprise--you sometimes got better
results with fewer attributes? Well, that
was a laborious manual process, where you

Chinese: 
我们首先选择所有的属性，然后通过有选择地尝试所有可能性来去掉最合适的属性，
以此类推。
你可能还记得当时的痛苦。
当然，有更好的方法。这就是Select Attributes面板的功能。
我们来看Weka界面，打开glass数据集，就是这个，有214个实例。
点击Select Attributes面板。
我们来看属性选择中的包装方法，即
包装一个分类器。
我们来包装J48，就像几周前我们做过的一样。
我们来用10折交叉验证，这也是之前做过的，尽管
在Data Mining with Weka的第一部分你还没听说过交叉验证。
看上去不错。
我把临界值改为-1，随后我会解释。
我们来选择搜索策略。
我要选择BestFirst的方法，用backwards搜索，随后我会

English: 
started with all the full attribute set and
removed the best attribute by selectively
trying all possibilities; then you carried
on doing that.
You probably remember the pain involved.
Well, of course there's a better way, and
that's what the Select Attributes panel does.
We're going to go to Weka, and I've opened
the glass data set--there it is: 214 instances.
I'm going to go to the Select Attributes panel.
We're talking here about the wrapper method
of attribute selection, and that involves
wrapping up a classifier.
We're going to wrap up J48, which is exactly
what you did back then all those weeks ago.
I'm going to use 10-fold cross-validation,
which actually is what you did--although in
Class 1 of Data Mining with Weka you'd never
heard of cross-validation.
That looks pretty good to me.
I'm going to select a threshold of -1: I'll
explain that later on.
Then we have a search method.
We're going to use the BestFirst search, but
we're going to search backwards--I'll talk

English: 
about that later on.
And we're going to have a search termination...
yes, we're going to leave it at that.
OK.
Let's just run it.
OK.
Let's just run it.
Now it's running, doing all those cross-validations.
And, lo and behold, it's got the same attribute
subset as you got before, and it's got a "Merit
of best subset" of 74%.
Going back to the slide here, that's really
the same kind of thing as we got before.
Same subset, and the "merit" is the same as
the accuracy.
It is a little bit of a coincidence that we
got the same results, because Weka doesn't
do exactly the same thing, you know, the setting
of the random number generator and so on.
But anyway, we did get the same results here
in this situation.
A good question is: how many subsets we had
to evaluate? How many attributes we had to
evaluate? How much experimentation we had
to do? So, I'm going to go back here, and

Chinese: 
解释。
还需要设置搜索终止条件。
我们保留默认值。
好。
好。运行。
运行。
正在运行，做所有的交叉验证。
真想不到，我们得到了和之前一样的属性集，
"Merit of best subset"是74%。
回到幻灯片，这其实与我们之前得到的一样。
同样的子集，"merit"也和之前一样准确。
得到同样的结果有点偶然，因为Weka不会做完全相同的事情，
因为随机数生成器等的设置并不相同。
不管怎样，在这种情况下，我们确实得到了同样的结果。
问题是：我们需要评价多少个子集？多少属性？
做多少次验证？我们回到Weka，

Chinese: 
设置搜索终止为1，我会随后解释，
再次运行。
可以看到，共评估了36个子集。
回到幻灯片，我们可以计算子集数。
取整个属性集，逐一移除
9个属性。
就是9次评价。
然后再移除另一个属性，得到8次评估，再移除一个，又一个，
得到最终的属性集。
为了验证这是最终的属性集，需要再次移除属性，
这又是5次评估。
如果把这些次数相加，就得到36个评估子集。
包装方法包括评价策略和搜索策略。
我们来看搜索。
我们在做后向搜索。

English: 
I'm going to set the searchTermination to
1--and again I will explain that in a minute--and
run it again.
And here it tells me that it's evaluated 36
subsets.
Back on the slide, you can count these subsets.
It took the complete attribute set and then
it tried removing all of the 9 attributes,
one by one.
That's 9 more evaluations.
Then it removed another attribute, 8 evaluations,
and another one and another one, which gave
it the final attribute subset.
But to check that it was the final attribute
subset and you couldn't do better by removing
another attribute, it had to do a further
five evaluations.
And if you add up all of those, you get 36
subsets evaluated.
The wrapper method involves an evaluation
method and a search method.
Let's talk about search.
We were doing backwards searching.

English: 
We started with all 9 attributes, and selected
one to remove, and so on and so forth until
we decided to stop; the searchTermination
criteria.
It would be equally viable to do forwards
search, starting with a 0-attribute subset
and adding the best attribute each time until
you decided to stop.
Or you could do bi-directional search.
You could start with some random attribute
subset--actually, Weka allows you to specify
what attribute subset to start with--and then
either add or subtract an attribute depending
on which gives the most performance improvement.
Or you could do exhaustive search: in this
case there are 512 possible subsets of 9 attributes
and you could simply try them all.
The searchTermination criterion is interesting.
When we did this manually, we stopped as soon
as the results started to get worse: we got
a local maximum in the search space.

Chinese: 
从9个属性开始，每次移除一个，以此类推，
直到决定停止，即搜索终止标准。
前向搜索是一样可行的，从0个属性开始，
每次增加最合适的属性，直到决定停止。
或者可以双向搜索。
从随机属性集开始，实际上，在Weka中可以指定开始的属性集，
然后或者增加或者减少属性
以达到最大的性能改善。
或者可以穷举搜索，9个属性可能有512个子集
我们逐一尝试。
搜索终止标准很有趣。
当手工搜索时，一旦结果变坏，我们就终止：
在搜索的范围内我们获取了最佳值。

Chinese: 
但是或许你能够得到比这个最佳值更好的，如果你再做下去，
能得到另一个最好值。
如果搜索终止标准设定为大于1的数，在放弃搜索之前，Weka
会多尝试几次。
我不打算展示所有不同的搜索，但是这里有些结果。
这是非常复杂的过程。
我们看了后向搜索，得到一个准确率是0.72的子集。
然后我们将搜索终止标准设定为5，这样我们就可以超越
局部最大值，在搜索范围内获取更大的值，得到更好的
评价结果。
前向搜索，得到包含三个属性的子集RI,AL和CA，如果继续搜索，

English: 
But you might do better by plowing on through
that minimum that you get, going a little
bit further to see if perhaps you might reach
an even higher peak further on.
If you set the searchTermination criterion
to something greater than 1, then Weka will
try a little bit harder, go a little bit further,
before deciding to abandon the search.
Now, I'm not going to show you all these different
searches, but here are some results.
It's a pretty complex process.
I showed you Backwards search, and we got
that first subset at a 0.72 evaluation.
And then we set the searchTermination up to
5 which gives us a chance of powering on past
a local maximum, finding an even bigger maximum
in the search space, and that gives us a better
evaluation.
Or with Forwards search, you get that 3-attribute
subset RI, AL and CA if you search on a little

Chinese: 
而不是就此终止，会得到更高准确率的
更好的子集。
双向搜索也会得到包含三个属性的子集，你也可以
通过设定搜索终止标准继续搜索。
我们总是找到一个局部最佳值，但是通过设定搜索终止标准大于1
可以在搜索范围内另辟捷径找到
更好的局部最佳值。
结果证明，在这个数据集中，AL是唯一一个最合适的属性。
（OneR可以向你证明这一点。）所有前向搜索的结果都包括AL。
奇怪的是，AL也是最合适放弃的属性。
如果我们从全集开始，最合适放弃的就是AL。
这听上去很奇怪，我也必须承认这很不正常。

English: 
bit further--instead of terminating the search
prematurely you can get a better subset with
better accuracy.
And Bi-directional search will give that 3-attribute
subset, and again you can improve that by
setting the searchTermination criterion to
search a little bit further.
Note that we are always finding a local optimum,
but setting the searchTermination criterion
to more than 1 gives you a chance of traversing
a valley in the search space to find a better
local optimum.
It turns out that, on this dataset, AL is the
single best attribute to use.
(OneR will confirm that for you.) All Forward
search results will include AL.
Curiously, AL is the best single attribute
to drop.
So if you start with a full set, the best
one to drop is AL.
This sounds pretty strange, and I must admit
it is pretty unusual.

English: 
But nevertheless its true, and it's certainly
not impossible.
Let's just go back to Weka here.
I'm going to set cross-validation and see
what happens.
What it's doing now is it's doing the attribute
evaluation 10 separate times.
It's showing us here how many times this attribute,
RI, appeared in the final attribute subset.
In this case, it appeared in 9 out of the
10 attribute subsets.
Coming back to the slide.
in how many folds does this attribute appear
in the final subset? You can see that RI and
Mg and Ba appear in all 10 of the folds, and
AL, Si, K and Fe appear in not too many, 2

Chinese: 
但无论如何这是真的，也不是不可能。
我们回到Weka。
我要设置交叉验证，来看结果。
现在Weka做的是10次独立的属性评价。
这里告诉我们在最终的属性集中，属性RI出现的次数。
在这里，10个属性集中，它出现了9次。
回到幻灯片。
在最终的子集中属性RI出现在多少折？可以看到RI、Mg和Ba
出现在全部10折中，AL、Si、K和Fe出现地不多，

English: 
or 3 of the folds.
This gives you an indication of the stability
of the attribute selection method.
For this data set it's not really very stable,
as we've seen by getting all those different
subsets when we try different parameters of
the wrapper method.
If we do forward search, of course, we will
definitely choose AL, so this was done with
Backwards search.
The gory details of the Wrapper method.
In general, Weka implementations follow descriptions
in the research literature, so these parameters
came from the research literature.
It tries to do a 5-fold cross-validation by
default, not a 10-fold cross-validation, but
it doesn't necessarily do all 5 folds.
It does at least 2 and up to 5 runs, and stops
when the standard deviation is less than a
user-specified threshold.
Setting a negative threshold, which is what
we did, forces a single cross-validation each

Chinese: 
2或3折。
这表现了属性选择策略的稳定性。
对于这个数据集来说，不是非常稳定，就像我们看到的一样，
在尝试包装方法的不同参数时，会得到不同的子集。
如果我们前向搜索，当然，我们肯定会选择AL，
后向搜索时也是AL。
这些就是包装方法的内容。
总之，Weka的实现来自于研究文献，
这些参数都来自研究文献。
默认值是5折交叉验证，不是10折交叉验证，
但也不一定会做全部5折。
至少会运行2到5次，当标准差小于Weka使用者设定的临界值时
停止。
设置负数临界值，也就是我刚才所做的，每次只进行

English: 
time.
The BestFirst search method is the default,
and the searchTermination defaults to 5 for
traversing valleys.
The Wrapper method uses cross-validation to
select the best attribute to add or drop at
each stage.
If we go back to Weka, there's another attribute
evaluator which is called the ClassifierSubsetEvaluator.
That allows us to specify a classifier and
also a HoldOutFile, so here we would use the
HoldOutFile to evaluate each subset in turn.
That's attribute selection using the Wrapper
method.
We use a classifier to find a good attribute
set: we used J48.
We wrap the classifier in a cross-validation
loop.
There are two components here: the attribute
evaluator, which evaluates a subset of attributes;

Chinese: 
一次交叉验证。
BestFirst搜索方法是默认的，搜索终止的默认值是5，
进行多次尝试。
包装方法通过使用交叉验证来为每一步选取最佳的增加或移除的
属性。
如果我们回到Weka，这有另一个属性评价器，叫做ClassifierSubsetEvaluator。
它可以指定一个分类器和一个备用评估文档（HoldOutFile），这里我们用HoldOutFile
来依次评价子集。
这就是如何使用包装方法做属性选择。
我们使用分类器找到好的属性集：我们用了J48。
我们使用交叉验证循环包装分类器。
这里有两个组件：属性评估器，用于评价属性的子集；

Chinese: 
搜索方法，用于在属性的值域搜索。
搜索可以开始于任何子集，前向、后向、或双向。
包装法的计算量很大：对于m个属性,需要评价m的平方个子集。
可以使用穷尽方法，它评价2的m次方个子集，
你在练习在会用到。
贪婪搜索总能在搜索的区域找到局部最佳值，你可以
增加搜索终止参数再多次尝试。
请阅读教材在属性选择部分，做练习，
这样你可以思考
并体验包装方法。
祝好运，下次课见。
再见。

English: 
and the search method, which searches through
the attribute space.
Searching can be forwards, backwards, or bidirectional
starting from any subset.
It's computationally intensive: m^2 subsets
need to be evaluated for m attributes, and
there's an exhaustive method which evaluates
2^m subsets, which you're going to use in
the activity.
Greedy search always finds a local optimum
in the search space, and you can traverse
valleys by increasing the searchTermination
parameter.
You can read the section in the course text
about attribute selection; and off you go
and do the activity, which will let you think
about the Wrapper method and give you some
experience using it.
Good luck and we'll see you in the next lesson.
Bye for now.
