
English: 
Hello again! Welcome to the last lesson of
Class 1.
It's time to talk about big data.
Everyone's talking about big data.
I've heard people say it's like teenage sex.
Everyone talks about it, but no one's actually
doing it.
Those people probably didn't have teenage
children.
Anyway, different people mean different things
by big data, and what I mean by big data is
datasets that can't fit into the Weka Explorer.
The Explorer loads the entire dataset.
When you load a dataset, it's all got to fit
into main memory.
How much can it handle? Well, roughly speaking,
a million instances with 25 attributes in
the default configuration.

Chinese: 
大家好！欢迎来到本周的最后一课。
现在来谈谈大数据。
每个人都在谈论大数据。
我听人说这就像十几岁时的性生活。
大家都在谈论它，但没有人真正做过。
那些（谈论大数据的）人大概没有十几岁的孩子。
总之，不同的人对大数据的定义不同，
我所说的大数据是无法载入中Weka Explorper 的。
Explorer加载整个数据集。
当你载入一个数据集，它就会被放进主内存。
上限是多少？大体上说，在默认配置下，可以载入
有25个属性的一百万个实例。

English: 
Actually, if you go to the Explorer and right-click
on Status, you can get memory information,
and this gives three figures here.
The last figure is the total amount of memory
that is allocated to Weka, which is actually
a Gigabyte.
That's the default configuration.
The other two figures, well, it's a little
bit complicated.
The most important thing is the difference
between these two figures.
If you want to find out more, then you should
look up the Java functions freeMemory() and
totalMemory().
Although Weka initializes itself with a Gigabyte
of memory, on my computer there's more.
In fact, if I look on my computer, if I right-click
on Computer here, I can get the properties, and
the properties will show me that I've got
8 Gb of memory.
So, I could, in fact, arrange for Weka to
initialize itself with more main memory, but
I'm not going to do that now.
I'm going to try and break it.
Let's see what happens when you break Weka.

Chinese: 
其实，如果你打开Explorer，并鼠标右键单击状态栏，即可获得内存信息，
这里有三个数字。
最后一个数字是分配给Weka的内存总量，
这实际上是十亿字节。
这是默认配置。
另外两个数字有点复杂。
最重要的是，这两个数字之间的差异。
如果您想了解更多，那么你应该看看Java函数freeMemory（）
和totalMemory（）。
虽然Weka的初始化自身的内存是十亿字节，但我的电脑上还有更多内存。
事实上，如果查看我的电脑的内存，如果在这里单击鼠标右键，可以得到的属性，
从属性中可以看到，我的电脑有8GB的内存。
所以，其实我可以修改Weka初始化设置，使其拥有更多主内存，
但我现在不打算这样做。
I'm going to try and break it.
我们要去尝试下极限。
让我们来看看，当Weka到达极限会发生什么。

Chinese: 
好了，我们可以通过下载一个大的数据集的方式，
实际上，这也是课后练习的内容。
但我会向你介绍Weka的数据生成器来代替。
在Preprocess面板上，有一个Generate按钮，
它将会按照特定的模式生成随机数据。
我将使用LED24数据，并显示它，并生成它。
产生是一个有一百个实例，
25个属性的LED数据的数据集。
一百个实例在这里，
这就是载入的数据。我们可以很容易地产生比默认的一百个更多的实例。
让我们产生十万个实例，只需在这后面
添加3个零。

English: 
Well, we can do this by downloading a large
dataset, and that's actually what you're going
to do in the activity after this lesson.
But I'm going to introduce you to Weka's
data generator instead.
On the Preprocess panel, there's a Generate
button, and that will generate random data
according to particular patterns.
I'm going to use the LED24 data, and show it,
and generate it.
What this has generated is a dataset with
100 instances of the LED data, which
has got 25 attributes.
There they are, the 100 instances,
that's what's loaded in. But I can easily
generate more than the default 100 instances.
Let's generate 100,000 instances by just adding
3 zeros to this.
Generate that.

Chinese: 
现在，产生了十万个实例。
让我们对其进行分类。
我们可以选择，比如说，J48。
在这里，我们使用比例分割
--交叉验证需要很长的时间。
J48正在工作。
完成了，得到的准确度是73％。
或者，使用NaiveBayes，我认为会比较快，
并且得到了74％的准确度。

English: 
Now it's generated 100,000 instances.
Let's go and classify this.
We could choose, say, J48.
I'm going to use
Percentage split here --
cross-validation would take a long time.
J48 is working away.
It's finished now, and it's come up with
a percentage accuracy of 73%.
Or we could use NaiveBayes, which I think
will be a little bit quicker, and that comes
up with an accuracy of 74%.

Chinese: 
让我们这次用数据生成器产生一百万个实例。
我们已经得到了十万;所以（加一个零）就有一百万。
我们可以生成一百万个实例。
这将需要几秒钟。
这有一百万个实例，我们可以用NaiveBayes进行分类了。
几秒钟后，我们得到的结果。
好了，还是74％的准确度。
现在，我可以用J48试试这个，但我碰巧知道，J48比NaiveBayes需要更多的内存，
它会在这个数据集上崩溃。
随着需要的内存变得更大，Explorer开始崩溃。
其实，我可以去试着生成，必如，此数据集的二百万个实例。

English: 
Let's go and generate a million instances
then with the data generator.
We've got 100,000; so there's 1,000,000.
We can generate that.
It'll take a few seconds.
There's a million instances, and we can go
and classify that with NaiveBayes.
After a few seconds, I get the result.
Here we go, 74% again.
Now, I could try this with J48, but I happen
to know that J48 uses more memory than NaiveBayes,
and it will crash on this dataset.
As things get bigger, the Explorer starts
to crash.
Actually, I could go and try to generate,
say, two million instances of this dataset.

Chinese: 
如果我这样做，Explorer会崩溃。
当你在做这种事情，你最好使用
探索者的控制台版本。
如果你查看All Programs菜单，你会发现，
有一些自动安装的Weka版本。
其中一个是Weka控制台版本。
它带有一个控制台窗口，当出现程序崩溃，内存不足，
等错误时，控制台窗口会报告错误。
如果你要尝试做这种事情，我建议你使用
该版本的Weka。
这是J48崩溃时，你应该得到的错误消息。
不幸的是，错误在不同电脑上会以不同的方式出现，
所以你可能不会得到这样的错误消息。
有时Weka只是进入一个无限循环，永远等待。
这要看情况。
这就是为什么使用控制台版本更好。

English: 
The Explorer would crash if I did that.
When you're doing this kind of thing, you're
better off using the console version of the
Explorer.
If you go to your All Programs menu, you'll
find that there are a couple of versions of
Weka that are installed for you automatically.
One is Weka with console.
That brings up this console window, and it's
the console window that reports when things
crash, out-of-memory errors, and so on.
If you're going to mess around with this kind
of thing, I'd recommend using that version
of Weka.
This is the error message that you ought to
get when J48 crashes.
Unfortunately, when things break, they tend
to break in different ways on different computers,
so you might not get this error message.
You might see that Weka just goes into an
infinite loop and waits forever.
It depends.
That's why the console version is a better
thing to use.

Chinese: 
来进一步说明，首先，我们不能使用Explorer，
因为它加载整个数据集。
其次，我们需要使用可更新的分类。
这些是每次处理一个实例的递增型分类模型。
它们不会把整个数据加载到内存中。
有若干递增型分类模型。
事实上，我们在上节课的练习中见过它们。
我们将使用是NaiveBayesUpdateable，和NaiveBayes一样，
但是是可更新的。
IBK也是一个可更新的分类器，还有其他一些。
Weka可以处理多少数据？如果使用简单的命令行界面和可更新的分类器，
那么将无限的。
让我们打开了简单的Command Line interface。
在这儿。
来建造一个巨大的数据集。

English: 
To go further, first of all, we mustn't use
the Explorer, because it loads the entire
dataset in.
Secondly, we need to use updateable classifiers.
These are incremental classification models
that process a single instance at a time.
They don't load the whole dataset into memory.
There are a few of them.
In fact, we looked at them in the activity
associated with the last lesson.
The one we're going to use is NaiveBayesUpdateable,
which is just like NaiveBayes, but an updateable
implementation.
IBK is also an updateable classifier, and
there are a few others.
How much data can Weka handle? If you use
the Simple Command Line interface and updateable
classifiers, then it's unlimited.
Let's open up the Simple Command Line interface.
Here it is.
I'm going to create a huge dataset.

Chinese: 
实际上，我们先建造一个有十万个实例的小数据集。
来运行LED24数据产生器，并把生成数据保存在这个文件中。
我们已产生了一个十万个实例的数据集，它将被用作测试文件。
对于培训文件，我将用一千万个实例。
将这个数字改为一千万，并把把它保存到培训文件。
但是这可能需要几分钟的时间，所以我不这样做。
事实上，我已经事先准备好这些文件。
来展示下。
这里是test.arff。
测试文件有0.5兆字节，包含十万个实例。
培训文件有0.5千兆字节，包含一千万个实例。

English: 
Actually, I'm going to create a pretty small
dataset here with 100,000 instances in.
I'm going to run the LED24 data generator
and put that in this file here.
That has created that dataset of 100,000 instances,
which I'm going to use as a test file.
For a training file, I'm going to use 10 million
instances.
I could change this to 10 million and put this
in the training file.
However, that would just take a few minutes,
so I'm not going to do that.
Instead, I've prepared these files in advance.
Let me just show you.
Here we've got test.arff.
The test file is half a Mb, with 100,000
instances.
The training file is half a Gb, with 10 million
instances.

Chinese: 
这是一个非常大的培训文件，有5千兆字节，包含一亿个实例。
这些都是我们将使用的文件。
来用NaiveBayesUpdateable分类器处理训练文件。
这是一个非常大的培训文件。
这是一个小得多的测试文件。
键入回车将启动运行，运行NaiveBayesUpdateable需要4分钟，
 
109
00:08:41,440 --> 00:08:44,370
产生的准确度是74％。
我们不能用J48这样做，因为J48不是一个可更新的分类器。
我们可以尝试用一个非常大的文件，可以是任意大小的文件。
如果用5千兆字节，一亿实例的培训文件，它会

English: 
I've done a really big training file here,
which is 5 Gb, with 100 million instances.
Those are the files I'm going to use.
I just need to run the NaiveBayes 
updateable classifier with the training file.
This is the very large training file.
This is the much smaller test file.
If I run that by typing Enter here, it'll
take 4 minutes and produce 74% accuracy with
NaiveBayesUpdateable.
I can't do it with J48 because that's not
an updateable classifier.
I can try it with a really big file, with
any size file.
If I were to use my 5 Gb training file with
100 million examples in it, then it would run --

Chinese: 
在我的电脑上需要运行大约40分钟。
所以，是可以做到的。
Explorer可以处理有25个属性的一百万个实例的数据集。
这要看具体情况。
你可以分配给Explorer更多的内存，
如果您的电脑主内存超过1千兆字节的话。
我们还没有谈到如何做，但它并不难。
Command Line interface尽可能的以递增的方式工作。
它不像Explorer那样将数据集加载到主内存。
如果你使用可更新分类器--在Javadoc中可以找到 
可更新的分类器--Command Line interface将会递增地工作。
之后，你可以使用任意大的文件，许多千兆字节或数百千兆字节的文件。
但是，你不应该使用交叉验证。
如果你在Command Line interface指定交叉验证的话，

English: 
it takes about 40 minutes on my computer.
So, there you have it.
The Explorer can handle about a million instances
with 25 attributes, say.
It depends.
You can increase the amount of memory allocated
to the Explorer if your computer's got more
than 1 Gb of main memory.
We haven't talked about how to do that, but
it's not difficult.
The Simple Command Line interface works incrementally
wherever it can.
It doesn't load the dataset into main memory
the way the Explorer does.
If you use updateable classifier implementations
-- you can find which ones are updateable
using the Javadoc -- then the Simple Command
Line interface will work incrementally.
Then you can work with arbitrarily large files,
many gigabytes or hundreds of gigabytes.
However, you shouldn't use cross-validation.
If you were to specify cross-validation in
the Simple Command Line interface, then it

English: 
would have to load the file all in at once.
The Command Line interface only doesn't load
the file in if you're not using cross-validation.
That's why we use an explicit test file instead
of the default of cross-validation.
Working with big data can be difficult and
quite frustrating.
You'll find that out if you do the activity
associated with this lesson.
Good luck with that, and we'll see you in
Class 2.
Bye for now!

Chinese: 
那就需要一次加载整个文件。
T只要你不使用交叉验证，Command Line interface就不会加载该文件。
这就是为什么要使用一个指定的测试文件，而不是默认的交叉验证。
处理大数据是很困难的，有时相当令人沮丧。
在之后的练习中，你会感受到这点。
祝你好运，我们下周见。
再见了！
