
Chinese: 
大家好！
本课3.4，我们将继续学习关联规则和
Apriori算法。
我们上节课看到一个包含3个项，4个支持的itemset，
这意味着数据集中有四个实例符合这个规则。
通过选择不同的项作为条件和结论，
这个itemset可引伸出7个规则。
我说过Apriori 的策略是指定最小的置信率，
然后逐步减少覆盖量，
直到有足够的规则符合最小置信率。

English: 
Hello again.
In this lesson, 3.4, we're going to carry
on looking at association rules and the Apriori
algorithm.
We left off last lesson looking at an itemset
with 3 items and a support of 4, which means
that there are 4 instances in the dataset
for which those conditions are true.
This item set can be massaged into 7 possible
rules by choosing different things on the
left- and right-hand side.
I said that the strategy of Apriori is to
specify the minimum confidence level, and
then iteratively reduce the support until
enough rules are found with greater than this
confidence.

Chinese: 
这些规则有四个支持并且置信率从100%，4/4，
4/14。
对于weather数据，Apriori算法最先会产生一个有14个支持的itemsets。
不存在这样的itemsets。
如果有的话，Apriori算法会从中找到大于最小置信率水平的规则；
Weka的置信率默认值是90%。
因为没有有14个支持的itemsets，算法将支持减到13
不断减少直到找到足够的满足条件的
规则为止。
实际上，weather数据可以产生336个置信率为100%的规则。
数量很大是因为如果基于高置信率，
我们会得到大量的覆盖率极低但置信率很高的规则。

English: 
These rules have got support 4 and confidence
values ranging from 100%, 4/4, to whatever
4/14 is.
For the weather data, Apriori will first generate
itemsets with support 14.
There aren't any of those.
If there were, it would find rules in them
with greater than the minimum confidence level;
the default for Weka is 90% confidence level.
Since there weren't any itemsets with support
14, it would decrease the support to 13 and
carry on decreasing it until it had found
sufficient rules to satisfy the specified
number of rules.
Actually, the weather data has 336 rules with
confidence of 100%.
The reason why it works in this slightly crazy
way is that if you started looking at high
confidence rules, then you'd find huge numbers
of very low support, very high confidence rules.

Chinese: 
从一个大型数据集，可以产生出大量的置信率为100%的规则，
这些规则没有任何意义，因为覆盖率极低。
这就是原因。
让我们用Weka演示一下。
打开weather数据，有14个实例。
运行关联目录下的Apriori，这是默认的关联规则学习器。
我们得到结果了。
产生了10个规则，默认的数量是10。
我们可以看到这些规则的支持数，从4到3再到2。
数据集中只有两个实例符合最后两个规则。
这些规则的置信率是百分百。
让我们回到课件。
我们指定最小置信率，默认值是90%。
我们指定产生的规则的数量，默认值是10。

English: 
From a large dataset, you'd have massive numbers
of 100% confidence rules that weren't very
interesting because they had tiny support.
That's why it does this.
Let's go over to Weka.
I've opened the weather data, the 14-instance
weather data.
I'm going to go to Associate and run Apriori,
that's the default association rule learner.
This is the output I get.
Here are the 10 rules; the default number
of rules is 10.
You can see this is the support of these rules,
and it ranges from 4 down to 3 down to 2.
These last 2 rules only have 2 instances in
the dataset that satisfy those rules.
The rules are all 100% confidence.
Let's go back to the slide here.
We specify the minimum confidence level; the
default is 90%.
We specify how many rules we want; the default
is 10.

Chinese: 
我们所说的覆盖率是实例数量的比例。
然后，我们多次运行Apriori算法。
我们从100%的置信率开始，每次降低5%。
当得到足够的高于最小置信率的规则，停止，
或者达到另外一个参数lowerBoundMinSupport的限制时停止。
所以，这有很多参数。
然我们看看它们是怎么作用于weather数据的。
从刚刚的输出中，可以看出最小的覆盖率--算法指定的
最小覆盖率--是0.15。
0.15乘以14等于2--0.15是指实例总量的比例。
最小的置信率是0.9；这是默认的值。
实际上，循环了17轮，每次减少最小覆盖率。

English: 
We express the support as the proportion of
the number of instances.
Then we run the Apriori algorithm several
times.
We start at 100% support, and decrease it
by 5% each time.
We stop when we get the right number of rules
with more than the minimum confidence level,
or we would stop when we reached another parameter,
lowerBoundMinSupport.
So there are quite a lot of parameters here.
Let's just take a look at how this works for
the weather data.
On the output that I just showed you, it said
that the minimum support of the top--it specifies
the minimum support--is 0.15.
0.15 times 14 instances is 2--0.15 is the
proportion of the total number of instances.
The minimum confidence is 0.9; that was set
as the default parameter.
It actually performed 17 cycles, reducing
the support each time.

English: 
Just looking down at the bullet point underneath,
the 17 cycles corresponded to having a support
of 100%, and then it reduces by 5% each time--95%,
90%, 85%, and so on.
It actually got right down to 15%.
When you translate those percentages into
a number of instances, it was using 14 instances,
13 instances, then it did it again with 13
instances.
It's a little crazy on this tiny dataset--it's
doing a bit of extra work--but on a large
dataset that wouldn't happen.
It got down to 3 instances at the 20% level,
and it only found 8 rules with confidence
greater than 90% and support 3.
That's why it was forced to go down to a support
of 2.
What are these itemsets? Well, we can look
at the itemsets.
The itemsets are based on the final support
values.
Let's go back to Weka and have a look at the
itemsets.

Chinese: 
看看下方的箭头，17轮期间，覆盖率从100%，
每次降低5%--95%, 90%, 85%,，等等。
实际上，最低降到了15%。
当我们把这些百分比换算成实例数，从使用14个实例，
13个实例，然后又是12个实例。
对于这种小数据集这么做有点太过--它做了额外的工作--
但是对于大数据集就不会有这样的问题。
最小覆盖量降到了3个实例，也就是20%的水平，才找
8个符合最小置信率大于90%的规则。
这就使为什么覆盖量减少到2。
这些itemsets是什么样子？让我们去看看。
itemsets是基于最终的覆盖量的。
让我们回到Weka，看看产生的itemsets。

Chinese: 
这些是Apriori方法的参数。
如我所说，有很多参数。
这是覆盖率每次降低的比例，例如5%。
它在这里停止，覆盖率为10%。
它找的规则的置信率要大于90%。
只是要找到10个规则。
这是从100%的覆盖率开始，一般都是这么做。
我们可以导出itemsets，让我们试着导出，并再次运行。
我们得到了一样的规则，这就是itemsets。
这些是有两个支持的itemsets。
工作的原理是从只有一个项的itemsets开始，
这里是每一个项集的支持数。

English: 
Here are the parameters for Apriori.
As I said before, there are are quite a few
of them.
This is the amount by which the support is
reduced each time, by 5%.
This is when it stops, when it gets to a support
of 10%.
It's looking for rules with a confidence greater
than 90%.
It's looking for 10 rules.
It's starting as a support of 100%, which
is normally what one does.
Here we can output the itemsets, so let's
output the itemsets, and run it again.
We've got the same rules, and here are the
itemsets.
These are the itemsets with support of at
least 2.
The way it works is it starts with itemsets
with just 1 item, and these are the support
for each of these itemsets.

Chinese: 
然后，添加新的条件生成两个项的itemsets。
这是第一级，这些是包含一个项的itemsets，下面是包含两个项的itemsets。
这就是产生itemsets的过程。
这看似有点复杂，但在处理大型数据时非常有效。
这些是3项的itemsets，这些是包含4项的itemsets。
实际上，以这个覆盖率，没有多于四项的了。
回到课件。
这里有12个覆盖量为至少为2的单项集。
还有47个双项集，39个3项集，6个4项集，以及0个5项集，
到此为止。
这就是算法工作的过程，每个itemset可以转换成规则，

English: 
Then it adds new conditions to these to generate
2-item itemsets.
This L1 here, these are itemsets with 1 item,
and down here, these are itemsets with 2 items.
That's how it generates the itemsets.
It's a little bit convoluted, and it does
this for efficiency reasons for large datasets.
Here are the itemsets with 3, and here are
the itemsets with 4 conditions in them.
In fact, there weren't any itemsets with more
than that with this support level.
Coming back to the slide.
There were 12 1-item sets with support of
at least 2.
There were 47 2-item sets, 39 3-item sets,
6 4-item sets, and 0 5-item sets, which is
where it stopped.
That's how it goes through this, and for each
itemset it converts it into rules and looks

Chinese: 
高于最小置信率的规则。
最终得到我们之前看到的那10个规则。
这有点复杂了，但是如我所说，这样做是有效率的。
还有其他参数。
在Weka中，通过类属性，
可以得到预测类属性的规则。
我们可以通过统计测试过滤规则， Chi-squared检定，
但实际上结果并不可靠，因为我们做需要大量的统计测试，
显著性的结果可能是随机的。
我们学习了置信率，但是还有另一个指标，
可用于比较规则。
我们可以忽略所有有缺失值的属性。
这些是额外的参数。
在本课的课后练习中，你将看到supermarket数据
和一些超市购物篮分析。

English: 
for rules with greater than the minimum confidence.
Here it ended up with the 10 rules that we
saw before.
It's a little bit complicated, but as I say,
it's done for efficiency reasons.
There are some other parameters.
In the Weka implementation the car, class
attributes, always produces rules that predict
the class attribute.
You can filter rules according to a statistical
test, a Chi-squared statistical test, but
that's actually unreliable because we're making
a very large number of statistical tests here
and significant results will be found just
by chance.
We've talked about confidence, but there are
different metrics that can be used for ranking
rules.
You can also remove all attributes whose values
are all "missing".
Those are extra parameters.
In the activity associated with this lesson
you're going to look at the supermarket data
and do some market basket analysis.

Chinese: 
这些数据来自于实际的新西兰的超市。
让我们快速地浏览一下supermarket.arff.
就是这个。
你可以看到，这里是不同的种类：烘烤的用品，优惠券，茶和饼干
--这些新西兰常见的商品--冷冻食品，（剃刀的）刀片，园艺用具，香料。
数据集包含大量的属性，217个属性，4，600个实例。
把属性数和实例数相乘，（大约）会得到一百万种属性值的组合。
在这个数据集中，缺失值表示没有购买该商品。
实际上，92%的数值是缺失的。
这意味着平均每次购物只包含220乘以8%，也就是18种商品
--你的购物篮里的商品的平均数。

English: 
This data was collected from an actual New
Zealand supermarket.
Let's just go and have a quick look at this
at supermarket.arff.
Here it is.
You can see there are departments here: there's
baking needs, and coupons, and tea, and biscuits--very
popular in New Zealand--frozen foods, razor
blades, gardening aids, spices.
A large number of attributes, 217 attributes,
and 4,600 instances in this dataset.
There are 1 million attribute values if you
multiply those numbers together.
In this dataset, missing values are used to
indicate that the basket did not contain that item.
In fact, 92% of the values are missing.
That means that the average basket contains
220 attributes times 8%, that's only 18 items--
the average number of items in your supermarket
basket.

English: 
The most popular items are bread-and-cake,
vegetables, frozen foods, and biscuits.
That's Apriori.
It makes multiple passes through the data,
generating 1-item sets, 2-item sets, and so
on, with more than the minimum support.
It turns each one into rules and checks their
confidence.
It's fast and efficient, providing that the
data fits into main memory.
Weka invokes the algorithm several times,
gradually reducing the support each time until
sufficient high-confidence rules have been
found.
There are many parameters that control this
iteration.
In the activity, you'll be looking at the
supermarket data, and you'll get to realize
just how difficult it is to come up with cool
association rules in a real dataset.
Association rule mining is not an easy thing
to do.
Actually, it's something I personally don't
have a lot of experience with, but it's a
very common use of data mining.
That's it for association rule mining.

Chinese: 
最常见的商品是面包或者蛋糕，蔬菜，冰冻食品，还有饼干。
这就是Apriori算法。
多次处理数据，生成单项集，双项集，等等，
来满足最小覆盖量。
把每个itemset转化成规则，查看它们的置信率。
如果数据存储在主存储器，这个过程并不慢。
Weka多次调用算法，逐步降低对覆盖量的要求，
直到得到足够的高置信率的规则。
有很多参数控制这个循环过程。
课后练习中，你会看到有关于超市的数据，你会认识到
在现实生活中找到有用的关联规则是多么困难的一件事。
挖掘关联规则不是一件简单的事。
实际上，我个人这方面也还有待提高，但是
这是数据发掘中非常常见的应用。
这就是有关关联规则的内容。

Chinese: 
我们下节课见。
再见！

English: 
We'll see you in the next lesson.
Bye for now!
