
Chinese: 
大家好！在数据挖掘中，我们总是会遇到优化参数的问题。
我们来看看Weka里都有哪些优化参数的方法。
这些是“包装”元学习器。
有三种。
你还记得AttributeSelectedClassifier里的WrapperSubsetEval吗？
它的工作原理是基于分类器表现选择属性子集，
并利用交叉验证评估分类器。
这些和它的道理是一样的。
CVParameterSelection选择某一参数的最优值。
它也是通过交叉验证。
它可以优化多种不同的参数：正确率或是均方根误差。
GridSearch通过搜索一个二维网格优化两个参数。

English: 
Hello again! In data mining, you often want
to optimize parameters for some situation,
and I'm going to show you some methods in
Weka that allow you to do that.
These are "wrapper" meta-learners.
There are three of them.
Do you remember the AttributeSelectedClassifier
with WrapperSubsetEval? The way it worked
was to select an attribute subset based on
how well a classifier performed, and it evaluated
that using cross-validation.
These do the same kind of thing.
CVParameterSelection selects the best value
for a parameter.
Again, it uses cross-validation.
It can optimize various parameters: the accuracy
or the root mean-squared error.
GridSearch optimizes two parameters by searching
a 2-dimensional grid.

Chinese: 
ThresholdSelector选择一个概率阈值，然后用阈值
优化各种参数。
让我们先看看CVParameterSelection。
在Weka里，打开diabetes数据集，我们试试J48。
还记得J48有两个参数：“C”和“M”吗？我们可以优化它们。
开始运行。
在普通模式下，我们得到了73.8%。
现在我们可以优化这些参数了。
回到课件。
我们可以用CVParameterSelection。
我们的优化方法是写一个循环。
参数“C”将会从0.1逐步增加到1（增加10次）。
这样刚好加到1.0。

English: 
The ThresholdSelector selects a probability
threshold and you can optimize various parameters
with that.
Let's take a look first at CVParameterSelection.
Over in Weka, I've got the diabetes dataset
open, and I'm looking at J48.
Now, do you remember J48 has got these two
parameters: "C" and "M"? We can optimize those.
Let's just run it.
In plain mode, we get 73.8%.
Now, we can optimize these parameters.
Coming back to the slide.
We can use CVParameterSelection.
The way we express our optimization is to
write a loop.
The "C" parameter is going to go from 0.1
to 1 in [10 steps].
That will take it right up to 1.0.

Chinese: 
实际上，如果你真的这么做，是不能正常运行的，
因为当“C”等于1时，J48无法处理。
所以，我们让“C”从0.1增加9次到0.9。
想要了解语法，需要打开More按钮。
让我们回到Weka，试着这么做。
选择CVParameterSelection：
这是个元分类器。
我打算包装J48。
我的语句是让“C”从0.1增加9次到0.9。
点击添加按钮。
语句出现在这里。
就这样，我们回去看看。
还是显示相同的内容。
这就是它正在做的。
这是我们想要的清单。

English: 
Actually, if you were to try this, you would
find it would fail, because if C is set to
1, then J48 can't cope with that.
Instead, we're going to use C goes from 0.1
to 0.9 in 9 steps.
To find out about this syntax, you need to
use the More button.
Let's go back to Weka and do that.
I'm going to choose CVParameterSelection:
it's a meta-classifier.
I'm going to wrap up J48.
My string is the C parameter is going from
0.1 to 0.9 in 9 steps.
I need to Add that.
That's it here.
I'll leave this and then go back and have
another look.
It still says the same thing.
This is what it's doing.
This is the list you want.

Chinese: 
我们在其中可以划几条线。
如果开始运行，它就会优化参数。
会花费很长的时间。
我现在要把它停下来。
我们将会失望，因为实际上，结果变得更糟了。
“C”被定为0.1而不是默认的0.2，
这将让结果变得更糟，只有73.4%。
我们打算在另一个参数minNumObj上试试运气，它也叫做“M”。
让我们回到初始界面。
我们回到开始，重新设置CVParameterSelection。
增加另一个优化项。
“M”经过10步从1增加到10。
添加上；它成为第一个循环--
下面是我们优化“C”的循环。
所以我们通过循环得到“M”的最优值，
然后我们基于“M”的最优值，找到“C”的最优值.
我不打算这样做；这需要很长的时间。

English: 
You can have several lines in this list.
If I just go ahead and do that, then it will
optimize that parameter.
It will take quite a long time.
I'm going to stop it now.
I'm going to be disappointed, because actually,
I'm going to get worse results.
It'll choose a value of C as 0.1 instead of
the default of 0.2, and it's going to get
slightly worse results, only 73.4%.
I'm going to get better luck with minNumObj,
the other parameter, which is called M.
Let's go back here.
We're going to go back and reconfigure CVParameterSelection.
I'm going to add another optimization:
M goes from 1 to 10 in steps of 10.
I'm going to Add that; it's first --
and then I'm going to do C.
So I'm going to loop around M and get the
best value for M, and then I'm going to loop
around C and get the best value for C with
that best value for M.
I'm not going to do this; it takes a long
time.

Chinese: 
但让我告诉你结果。
当“C”取0.2，“M”取10，正确率等于74.3%。
实际上，我们得到一个更简单的树。
我们得到一个比普通J48好一些的结果，我们得到一个简单的树。
这是一个有价值的优化。
下一个方法是GridSearch。
CVParameterSelection可以进行多个参数的优化，
这些参数按照顺序依次被优化。
GridSearch可以同时优化两个参数。
GridSearch不是单独优化分类器，而是给出
过滤器和分类器最优参数的集合。
GridSearch可以优化很多不同方面。
这是非常灵活的，但设置起来也相当复杂。
让我们看看GridSearch。
我们需要学习并使用GridSearch。
这是设置界面。
你可以看到，这非常复杂。

English: 
But let me tell you the results.
It gets 74.3% with C as 0.2 and M as 10.
Actually, it gets a much simpler tree.
We get a very slightly better result than with
plain J48, and we get a simpler tree.
That's a worthwhile optimization.
The next method is GridSearch.
You can do CVParameterSelection with multiple
parameters, and it will optimize the first
parameter and then the other parameter.
GridSearch optimizes the two parameters together.
It allows you to explore not just for a classifier,
but the best parameter combinations for a
filter and a classifier.
You can optimize various things.
It's very flexible, but pretty complicated
to set up.
Let's take a quick look at GridSearch.
You would need to study this to actually use
it.
This is the configuration panel.
You can see it's pretty complex.

English: 
We're doing "x" and "y".
x is actually going to be the filter.
We can optimize a number of components in
the filter, the x property.
y is going to be the classifier, and we're
going to optimize the ridge parameter of the
classifier.
That's in this default configuration.
We're using linear regression, which has got
a ridge parameter.
This is the parameter we're optimizing.
For the filter, we're using partial least
squares, and that's got a parameter called
numComponent.
That's what we're going to be optimizing.
That's the default configuration.
In order to change this configuration, then
you'd need to look at the More button and
think about this quite a bit.
The third thing I want to look at is a threshold selector.
Do you remember in the last class, Lesson
4.6, we looked at probability thresholds,

Chinese: 
We're doing "x" and "y".
我们有“x”和“y”。
“x”实际上代表过滤器。
我们可以优化过滤器的一些组件，即“x”的属性。
“y”代表分类器，我们打算优化
分类器的ridge参数。
这是默认的设置。
我们使用线形回归，它会产生一个ridge参数。
这就是我们要优化的参数。
对于这个过滤器，我们用偏最小二乘，
这会用到一个叫做numComponent的参数。
这就是我们优化的对象。
这是默认的设置。
为了能够更改设置，我们需要打开More按钮，
更多地了解这些设置。
我们要注意的第三件事是阈值选定。
你还记得上节课，4.6课，我们谈到了概率阈值，

English: 
and we found that Naive Bayes uses a probability
threshold of 0.5? We fiddled around with that
to optimize a cost matrix.
That's exactly the kind of thing that ThresholdSelector
can optimize.
In fact, in this case, it's unlikely to do
better than Naive Bayes, but we can do different things.
I'm going to use the credit dataset and Naive Bayes.
I've got them here, the credit database and
Naive Bayes.
I can just run that, and I'll get 75.4% accuracy.
Now, I can use the threshold selector.
Let's look at the ThresholdSelector.
it's a meta classifier, of course.
I'm going to configure that with Naive Bayes.
There are various things I can do.
The designated class:

Chinese: 
我们发现Naive Bayes使用0.5作为概率阈值吗？
我们尝试了不同的阈值，找出最优的成本矩阵。
这正是ThresholdSelector要做的。
实际上，在这里，它不可能比Naive Bayes做得更好了，但我们可以试试别的。
我们可以用credit数据集和 Naive Bayes。
已经准备好了，信用credit数据集和 Naive Bayes。
可以直接运行，正确率是75.4%。
现在，我们使用阈值选择。
让我们看看ThresholdSelector。
显然，这是一个元分类器。
我们把它设置为Naive Bayes。
我们可以尝试很多事情。
指定类：

English: 
I'm going to designate the first class value.
In this dataset, the class values are "good"
and "bad".
The first class is the "good" class.
Let me optimize the accuracy and see what
happens.
I get exactly the same 75.4% that I got before.
We can actually optimize a number of different
measures here;
in fact, these measures, the TP_Rate and FP_Rate
and so on.
Back on the slide,
there are some new terms here, the F-measure,
Precision, and the Recall.
Remember the confusion matrix? The TP is there,
so that's True Positive.
True Negative (TN) is in the lower right-hand
corner of the confusion matrix.
The TP_Rate is TP divided by TP plus FN.

Chinese: 
我们选定第一个类值。
在这个数据集，类值是“好”和“坏”。
第一类是“好”类。
让我们优化正确率然后看看会发生什么。
我们得到了和之前一样的正确率75.4%。
实际上，我们可以优化多个不同的度量；
这些度量包含正确的肯定率和错误的肯定率等等。
回到课件，
这里还有很多新的选项，F测量，精确率和反馈率。
记得混淆矩阵吗？里面就有TP，
也就是正确的肯定。
“正确的否定(TN)”是在混淆矩阵的右下角。
正确的肯定率等于正确的肯定除以正确的肯定加上错误的否定。

Chinese: 
我们之前学过这些。
我们没有学过F测量，精确率和反馈率，这些也是
信息检索中常用的度量。
课件上有这些概念的计算公式。
回到Weka，让我们优化一些简单的东西，比如正确的肯定的数量。
看啊--我们得到了700个正确的肯定。
是不是很神奇？大量的正确的肯定。
或者我们可以修改分类器以优化正确的否定的数量。
这里我们得到了295，很多的正确的否定。
实际上，我们可以在顶部看到阈值。
它的取值接近1。
它用三分之一的数据优化，就是这么评估的。

English: 
We've talked about those before.
We haven't talked about Precision, Recall,
and F-measure, which are commonly used measures
in the area of information retrieval.
Those are defined there on the slide for you.
Going back to Weka, let's optimize something
simple, like the number of true positives.
Look--we've got 700 of them here,
isn't that fantastic? A very high number of
true positives.
Or we could change the classifier to optimize
the number of true negatives.
Here we get 295, a very high number of true
negatives.
The threshold value's actually given here
up at the top.
You can see it's chosen almost 1 here.
It's tuning on one third of the data, is how
it's evaluating this.

English: 
We can optimize other things--Precision, Recall,
and F-measure--as well as the accuracy.
That's it.
The moral is: don't optimize parameters manually.
If you do, you'll overfit, because you'll
use the whole dataset in cross-validation.
That's cheating! We're going to use wrapper
methods using internal cross-validation.
We've looked at CVParameterSelection, GridSearch,
and ThresholdSelection.
Now it's time for you to do the activity,
and we'll see you in Lesson 5.5.
Bye for now!

Chinese: 
我们可以优化其他的参数——F测量，精确率和反馈率就和正确率一样。
就这样。
原则是：不要手动优化参数。
那么做，会导致过度适合，因为你将使用整个数据集进行交叉验证。
这是作弊！我们将用包装方法并使用内部交叉验证。
我们了解了CVParameterSelection，GridSearch和ThresholdSelection。
现在，是时候完成练习了，我们5.5课见。
再见！
