
Chinese: 
大家好！这是课程5.5：
已经接近尾声了。
让我们回到最基础的部分，
简单介绍一下ARFF格式。
记得，ARFF文件开头是“@relation”来命名关系，
然后是一些“@attribute”，来说明每一个属性，
它包含属性的类型：名词性，这种情况下，会给出可能的值，或者数值性。
整数或实数，它们是一样的--
在Weka中都定义为数值属性。
还有字符串属性。
然后，有一行“@data”，之后每个实例占一行。
我们用问号表示缺失值。
而且当然会有一些注释行，起始于%。

English: 
Hello again! This is Lesson 5.5:
very near the end now.
I just wanted to go right back to the beginning
and talk about the ARFF format a little bit
more.
Remember,  an ARFF file starts out with
an @relation to name the relation, and then
some "@" attribute statements, one for each
attribute.
It declares them to be nominal, in which case
it gives the values, or numeric.
Integer or real, it's the same thing--
they're all numeric for Weka.
There are also string attributes.
Then there's an @data line, and following
that, for each instance there's one data line.
We use question mark for missing values.
And of course there are comment lines, beginning
with %.

Chinese: 
也许，这些都是你知道，但是还有一些是你不知道的。
首先，我们可以创建稀疏ARFF文件。
有一种稀疏数据的格式，还有NonSparseToSparse和SparseToNonSparse的过滤器。
以天气数据为例，首先按照常规的格式。
稀疏和常规格式的头部分是一样的。
左边是常规格式，右边是稀疏格式。
第一个实例，“sunny, hot, high, false, no”，在稀疏格式中，
属性的第一个值被认为是默认值。
“Sunny”，“hot”和“high”都是默认值。
所以第一个实例在稀疏格式中表示是
3，属性3（我们从0开始）属性3是false，4是no。

English: 
Well, you know all that, but there are a few
more things that you don't know.
First of all, we can have sparse ARFF files.
There's a sparse format, and filters NonSparseToSparse
and SparseToNonSparse.
Here's an example of the weather data, first
of all in the regular format.
Both the sparse and the regular format have
the same header.
On the left is the regular format, on the
right is the sparse format.
In the first instance, which is "sunny, hot,
high, false, no", well, in the sparse format,
if the attribute has the first value, then
that's considered the default.
"Sunny", "hot", and "high" are all default.
So the first instance in the sparse format
is 3, attribute number 3
(we count from 0). Attribute number 3 is false
and 4 is no.

English: 
In the second instance, "sunny, hot, high,
true, no"; "sunny", "hot", "high", and "true"
are all default.
Those are all the first possible values as
declared in the ARFF header,
so we don't need to specify those.
We just specify that the 4th attribute--numbering
again from 0--is a "no".
And the third instance, "overcast", well,
that's not the first value for "outlook", so
we've got to specify that,
so we say the 0th [attribute] is "overcast".
Then "hot", "high", and "yes" are all default, but
"false" isn't, so we say the 3rd attribute
is "false".
And so we go on:
just specify those attributes that do not
have the first value.
All classifiers accept sparse data as input,
but some of them just nullify the savings
by expanding the sparse data internally.
Others actually use sparsity to speed up the
computation.
Good examples are NaiveBayesMultinomial and
SMO.

Chinese: 
第二个实例，“sunny, hot, high, true, no”；
“Sunny”，“hot”，“high”和“true”都是默认值。
它们都是ARFF文件开头说明的可能的属性值的第一个。
所以我们不需要再次输入这些了。
我们只输入第四个属性的值--从零开始--是一个“no”。
接着第三个实例，“overcast”，当然，这不是“outlook”的默认值，
我们得给出它的值，
所以我们说第零个属性的值是“overcast”。
然后，“hot”，“high”和“yes”都是默认值，但是
“false”不是，我们要说明第三个属性的值是“false”。
等等，以此类推：
只要特别说明那些属性值不是第一个可选值的属性。
所有的分类器接收稀疏数据作为输入，但是，
有些通过内部展开稀疏数据来放弃节省。
另外一些其实在利用稀疏性加速运算。
NaiveBayesMultinomial和SMO是很好的例子。

Chinese: 
有几个过滤器--比如StringToWordVector--可以产生稀疏格式的输出。
所以，如果搭配使用StringToWordVector和Multinomial Naive Bayes
运算速度会非常快，你在做文档分类的时候
可能已经注意到这点了。
（ARFF文件）还有其他一些功能，例如给实例加权。
我们了解过，但现在再一次看看如何在Weka内部给实例加权。
ARFF文件中，可以在实例末尾
的括号里输入权重。
再次以天气数据为例，我们有两个实例，
第一个实例的权重是0.5，第二个权重是2.0。
如果没有指定权重，当然，权重默认值是1.0。
我们可以在ARFF文件里明确的设置权重。
还有日期属性。
我们这里不多说了。
我们可以有关系属性，这在多属性学习中很有用，

English: 
There are a couple of filters--the StringToWordVector,
for example--that produce sparse output.
So if you use the StringToWordVector filter
in combination with Multinomial Naive Bayes,
you get a very fast system, and you probably
noticed that when you were doing document
classification.
There are a couple of other features like
weighted instances.
We've talked now and again about instances
being weighted internally to Weka.
You can specify weighted instances in ARFF
files in curly brackets at the end of the
instance.
Again, with the weather data, we've got a
couple of instances and the first instance
has got a weight of 0.5 and the second instance
has got a weight of 2.0.
If weights are missing, of course, they're
assumed to be 1.0.
You can specify weights explicitly in your
ARFF file.
There are also date attributes.
I won't go into the format.
You can can have relational attributes, which
are really intended for multi-instance learning,

English: 
which we haven't touched upon in this course.
There's an XML version of the ARFF format
called XRFF
(I don't know how to pronounce that).
The Explorer can read and write XRFF files.
It's very verbose.
Here's an example.
We've got a header, and then at the end of
the header, we've got the body.
The header contains the ARFF header, and the
body contains the data, the instances.
In the header, there's a bit for each attribute
where it specifies the name of the attribute
and the type of the attribute, and if it's
a nominal attribute, the possible labels for
it, for each attribute.
In the body, we say , and within
, we have : define the first
instance, define the attribute values.
Then we would follow that with another instance
defined in the second , and so on.
It's the same information as in ARFF files.
It's clearly very verbose.

Chinese: 
这也不是我们这节课的内容。
ARFF格式有一种XML版本，叫做XRFF
（我不确定读音）。
Explorer 可以读取和写入XRFF文件。
（XRFF文件）非常冗长。
来看一个例子。
我们有一个文件抬头，然后在抬头的结尾处，是文件的主体。
抬头包括ARFF文件的抬头，主体包括数据和实例。
在抬头中，这部分是给属性的，说明它的名字和属性类别，
如果是名词性属性，要列举出属性值，
所有可能的属性值。
主体当中，我们用表示，在中间是实例：
定义第一个实例，定义属性值。
然后我们接着在第二个中定义下一个实例，等等。
这和ARFF文件包含的信息是一样的。
显然这很冗长。

Chinese: 
我们也可以像ARFF文件那样设置实例的权重。
我们可以比AFRR文件做的更多一些。
XML格式，允许我们指定“类”属性--记得，Weka
默认最后一个属性是类。
ARFF文件不允许更改默认类，但是XRFF文件可以。
我们可以设定属性的权重，从而得到加权的属性。
XRFF还有一种压缩格式：xrff.gz。
Explorer同样可以读写这些文件。
所以你应该知道这些。
就是这些。
ARFF还有其他一些你不知道的特性：稀疏格式，实例加权，
日期属性和关系属性。
有些过滤器和分类器利用稀疏性使运算变得
在时间上和空间上更加有效率。

English: 
You can have instance weights, as you can
with ARFF files.
You can do a little bit more than you can
with ARFF files.
In the XML format, you can specify which is
the "class" attribute -- remember, Weka
assumes by default that the last attribute
is the class.
There's no way to change that in an ARFF file,
but there is in an XRFF file.
You can also specify attribute weights to
have weighted attributes.
There's a compressed version of this: .xrff.gz.
The Explorer can read and write those files,
as well.
So you should know about that.
That's it.
ARFF has some extra features that you didn't
know about: the sparse format, instance weights,
date attributes, and relational attributes.
Some filters and classifiers take advantage
of the sparsity to operate more efficiently
in both time and space.

English: 
XRFF is an XML equivalent of ARFF, plus some
additional features.
That's it.
There is an activity associated with this
lesson.
Off you go and do it, and we'll see you in
the last lesson.
Bye for now!

Chinese: 
XRFF是一种相当于ARFF的XML，它拥有一些新的特性。
就这样。
课后有本课相关的练习。
开始练习吧。我们下节课见。
再见！
