
English: 
Hello again.
In the last couple of lessons, we've been
looking at decision rules.
In this lesson, Lesson 3.3, and the next one,
I want to look at a different kind of rule,
association rules.
Association rules are about finding associations
between attributes.
Between any attributes.
There's no particular class attribute.
Rules can predict any attribute, or, indeed,
any combination of attributes.
For this we need a different kind of algorithm.
The one that we use in Weka, and the most
popular association rule algorithm, is called
Apriori.
I don't know if you remember the weather data
from Data Mining with Weka.

Chinese: 
大家好。
前几节课，我们学习了决策规则。
这节课，课程3.3，和下节课，我们要来学习一种不同的规则，
关联规则。
关联规则就是找出
属性之间的联系。
这里没有特殊的类别的属性。
规则可以预测任何属性，或属性的组合。
这里我们需要一个不同种类的算法。
我们在Weka中使用的，最受欢迎的关联规则算法，是
Apriori算法。
不知道你是否还记得在课程Data Mining with Weka中的weather数据。

English: 
Here's this little dataset with 14 instances
and a few attributes.
Well, here are some association rules.
"If = outlook=overcast, then play=yes." If
you look at that, there are 4 "overcast" instances,
and it's "yes" for all of them: that rule
is 100% correct.
"If temperature=cool, then humidity=normal";
that's also 100% correct.
"If outlook=sunny and play=no, then humidity=high."
We don't have to predict "play" or indeed
any particular attribute.
If you look at rule #4, "outlook=sunny and
play=no," the first 2 instances satisfy that
rule, and there are no other instances that
satisfy that rule.
So it's 100% correct, but it only covers 2
instances.
There are lots of 100% correct rules for the
weather data.
I think there are 336 rules that are 100%
correct.

Chinese: 
这是那个有14个实例和一些属性的小数据集。
这有些关联规则。
“如果outlook=overcast，那么 play=yes”。 如果你看（表格），这里有4个是overcast的实例，
play都是yes：这条规则是100%正确的。
“如果temperature=cool,  那么humidity=normal“这条规则也是100%正确。
"如果 outlook=sunny 且 play=no, 那么 humidity=high." 我们不需要预测“play” 或
其他属性。
如果看第4个规则，“outlook=sunny 且 play=no”，前2个实例满足这个规则，
没有其他实例满足这个规则。
因此，这是100%正确的，但仅包含2个实例。
对于weather数据，有许多100%正确的规则。
有336条是100%正确的。

Chinese: 
我们需要筛选这些规则。
我们要做的是看“支持度”,
也就是满足规则的实例数量。
“置信度”是结论成立的实例的比例，
“支持度”是满足规则的实例数量。
这里我得到了同样的规则。
“置信度”都是100%，但是“支持度”不同，
实例的数量不同。
我们要找到高支持度/置信度的规则，但我们不会设定100%置信度，
来寻找所有的规则，因为，像我说的一样，有成百个规则，
而其中很多支持度非常低。
通常我们做的是设定最小置信度，寻找最小置信度下
最大的支持度。
我要介绍一个概念“itemset”。

English: 
Somehow we need to discriminate between these
rules.
The way we're going to do this is to look
at the "support", the number of instances
that satisfy a rule.
The "confidence" is the proportion of instances
for which the conclusion holds, and the "support"
is the number of instances that satisfy a
rule.
Here I've got the same rules.
They all have 100% confidence, but they've
got different degrees of support, different
numbers of instances.
We're looking for high support/high confidence
rules, but we don't really want to specify
100% confidence and look for all of those
rules, because, like I said, there are hundreds
of them and a lot of them have very low support.
Typically what we do is specify a minimum
degree of confidence and seek the rules with
the greatest support with that minimum confidence.
I want to introduce you to the idea of an
"itemset".

Chinese: 
 “itemset”是属性-值配对的组合，例如humidity=normal， windy=false
和 play=yes.
给定数据集，“itemset”有一定的支持度。
这里，这个“itemset”中有4个实例。
我们可以用这个“itemset”，以7种不同排列来产生规则，
每种的支持度都是4。
"如果 humidity=normal 且 windy=false 那么 play=yes" 的支持度为4，置信度为4/4--100%
因为所有humidity=normal 且 windy=false的实例的play都是yes
沿这列规则向下，我们得到较低的置信度。
最后一条规则，左侧没有任何条件，结论为 humidity=normal,  windy=false 和 play=yes
支持度为4，但是有14个实例满足

English: 
An itemset is a set of attribute-value pairs,
like "humidity=normal and windy=false and
play=yes".
An itemset has got a certain support given
a dataset.
Here there are 4 instances in the dataset
that are in that itemset.
We can take that itemset and permute it in
7 different ways to produce rules, all of
which have a support of 4.
"If humidity=normal and windy=false than play=yes"
has a support of 4 and a confidence of 4/4--that's
100%--because all of the instances for which
humidity=normal and windy=false have play=yes.
As we go down this list of rules, we get a
lower degree of confidence.
The last rule, for example, doesn't have anything
on the left-hand side: "anything implies humidity=normal,
windy=false and play=yes" has a support of
4, but there are 14 instances that satisfy

English: 
the left-hand side.
All of the instances satisfy the left-hand
side, so the confidence is 4/14.
You can see that as you go down this list
of rules, the confidence is decreasing from
100% through 4/6 (67%) down to quite a low
value, 4/14.
What Apriori does is generate high-support
itemsets.
Then, given an itemset, it gets all the rules
from it, and just takes those with more than
an minimum specified degree of confidence.
The strategy is to iteratively reduce the
minimum support until the required number
of rules is found with a given minimum confidence.
That's it for this lesson.
There are far more association rules than
classification rules.
We need different techniques.
The "support" and "confidence" are two important
measures.

Chinese: 
左侧的条件。
所有的实例都满足左侧的条件，因此置信度是4/14。
我们可以看到，沿规则表向下，置信度由100%
经4/6 (67%), 降至非常低的数值，4/14。
Apriori算法所做的是提取高支持度的itemset。
从这个itemset中产生所有的规则，
提取高于指定最小置信度的规则。
策略是重复减少最小支持度，直到找到
在最小置信度范围内的，达到指定数量的规则。
这就是本节课我们要学习的内容。
关联规则比分类规则要多得多。
我们需要不同的技巧。
支持度和置信度是两个重要的指标。

Chinese: 
 Apriori算法是标准算法。我们来在Weka中
看这个算法。
我打开了一个数据集，你们要在练习中用到这个数据集：
名为“vote”的数据集。
要使用Apriori算法，点击Associate面板。
这有若干关联规则算法，其中最受欢迎的是Apriori算法，
也就是默认的算法。
运行，得到关联规则。
这就是你们要做的练习。
我们要制定最小置信度，寻找支持度最大的规则。
下一节课会详细讲解这部分内容。
教材中有关于关联规则的内容。
祝好运，下节课见。
再见。

English: 
Apriori is the standard algorithm, and I just
want to show you that algorithm over here
in Weka.
I've opened a dataset that you're going to
be using in the activity: it's the "vote"
dataset.
In order to use Apriori, I go to the Associate
panel.
There are a few association rule algorithms,
of which by far the most popular is Apriori;
that's the default one.
Then I just run that to get association rules.
That's what you'll be doing in the activity.
We want to specify the minimum confidence
value and seek rules with the most support,
and the details of that are in the next lesson.
There's some stuff about association rules
in the textbook.
Good luck with the activity, and I'll see
you in the next lesson.
Bye for now!
