
Chinese: 
你好！欢迎回到"Weka在数据挖掘中的运用"
的课程。
这是讲座1.3。我们将学习如何
 
4
00:00:28,499 --> 00:00:32,230
分析数据集。
我们已在上节课学习了数据文件。
是关于天气的数据。
它是个非常简单的数据集， 包含了14天的天气，或者
说实例。 每天是一个实例，它由
五种属性来描述。
四种与天气有关, 
最后一个叫类
属性。
它是我们希望要预测的：是否玩这个
未定的游戏。

English: 
Hi! Welcome back for another
five minutes in New Zealand
with Data Mining with Weka.
This is Lesson 1.3, and we're going
to look at exploring datasets
in this lesson.
We looked at this data file in the 
last lesson. It's the
weather data
toy dataset, of course. It has
fourteen days, or
instances, and each instance, 
each day, is described by
five attributes,
four to do with the weather, and
the last attribute,
which we called the class value,
the thing that we're trying to 
predict, whether or not to play this
unspecified game.

English: 
This is called a classification problem.
We're trying to predict the class value.
Let's open up Weka.
It's here on my desktop.
I'm going to go into the Explorer.
We always use the Explorer.
I'm going to open the file.
I put the datasets in My Documents folder, 
 so I can see them here.
Just open
the Weka datasets and 
the nominal weather data.
There's the weather data in Weka.
As we saw last time,
 
you can see the size of the dataset, 
the number of instances—fourteen—
you can see the attributes,
you can click any of these attributes
and get the values for those attributes
up here in this panel.
You also get at the bottom
a histogram of the attribute values
with respect to the different
class values. The different class
values are

Chinese: 
这是个关于分类的问题。
我们希望要预测类的值。
现在打开Weka
它已安装到我的笔记本电脑上。.
我打开了Explorer。
我们将一直使用Explorer。
我将打开一个文档。
我把数据集放在我的文件夹中，这样便于我找到它们。
我打开了
天气的名词性数据。
这是Weka里显示的天气数据，
正如我们上次看到的。
 
你能看到数据集的大小，实例的数量，14个
你能看到属性，
你可点击其中的一个，
在这个面板的上部， 
看到它们的值。
在这个面板的下部是根据不同的类值形成的
属性值直方图。  这些
不同的类值有

Chinese: 
兰色代表 yes, 即 play，
红色代表 no, 即 don't play，
在Weka中，
默认的类是最后一个属性。
如果你愿意，你可以改变。这样你可以
预测其它的属性。
这是天气的数据。
我已提到，这是个分类的问题，有时也称为监督学习的问题。
它是监督学习的问题，
因为你已知道了
每个训练实例的类的值。
我们用已录入的数据作为已分类的例子。
这些都是独立的已知类值的实例。
 
我们希望自动地建立一个 
可用于预测新的 
 实例的模型。
这是个分类的问题。 
这些例子是这样的。这是个实例，有 
不同的属性值， 

English: 
blue for yes, play and
red for no, don't play.
By default,
the last attribute in Weka is always the class value.
You can change this if you like. If you
change it here you can decide to
predict a different one other than the last
attribute.
That's the weather dataset, and
we've already explored that.
As I said, it's a classification problem,
sometimes called a supervised learning
problem. Supervised
because you get to know the
class values of the training instances.
We take as inputted data 
set as classified examples, 
these examples are independent 
examples with a class value attached.
 
The idea is to produce automatically 
some kind of model 
that can classify new examples. 
That's the classification problem. 
Here is what the examples 
look like. This is an instance, with 
the different attribute values 

Chinese: 
有一组固定的特性。
 我们赋予它一个类值， 
这样就产生了一个已分类的实例。 
这就是我们为什么要训练数据集。
 
这些属性或特性可以是离散的或连续的。 
我们看到
过的天气数据 
是离散的，所以也称为名词性的。 
通常它们有一组固定的数据值。
他们也可以是数字
一组连续的数字。
同时，类 可以是离散的或连续的。我们看到的是离散的类, 
例如在天气数据中的类yes 或 no。 其它机器学习问题 
可能涉及到连续的类， 
这时我们希望预测一个数字。 
在商业领域叫
 回归问题。
我们来看一组和 
天气数据集近似的数据。 
数字天气
数据集
在Weka中打开 

English: 
a fixed set of features,
 and then we add to that 
the class to get the classified example. 
That's what we have to 
have in our training dataset. 
 
These attributes or features 
can be discrete or continuous. 
What we 
looked at in the weather data were 
discrete, or we call them nominal, 
attribute values where they 
belong to a certain fixed set, 
or they can be numeric 
or continuous values. 
Also, the class can be discrete or 
continuous. We're looking at a discrete class, 
yes or no, in the case of the weather 
data. Another kind of machine 
learning problem would involve 
continuous classes, where 
you're trying to predict a number. 
That's called a regression problem
 in the trade.
I'm going to have a look at a similar 
dataset to the weather dataset. 
The numeric weather
dataset. 
Let me just open that in Weka, 

Chinese: 
weather.numeric.arff. 
看到没有，非常近似。 
实际上，几乎是相同的。
14个实例，5个属性, 相同的属性。 
我们来用编辑面板
看看这个数据集。
你可看到，这里， temperature 和 humidity 是 
数字性的属性，之前他们是名词性的
这里全是数字。
 我们看下outlook 的值，
 和以前一样， 有
sunny, overcast 和 rainy. 
对于温度来说，我们不能枚举所有的值
因为有太多的数字。 
但我们可以有最大值，最小值，平均值和标准差。 
这些信息是Weka为
数字属性提供的
 
我们来看一个不同的数据集。 
我们来看下玻璃的数据集。它是个大的数据集。 
一个真实的数据集

English: 
weather.numeric.arff. 
Here it is. It's very similar, 
almost identical in fact, 
for 14 instances, 5 attributes, the same attributes. 
Maybe I should just look at this dataset 
in the edit panel. 
You can see here that two of the 
attributes—temperature and humidity—
are numeric attributes, whereas 
previously they were nominal
attributes. So here there are numbers.
What we see when we look at 
the attributes values for outlook, just as 
before, we have 
sunny, overcast and rainy. 
For temperature, though, 
we can't enumerate the values, 
there are too many numbers to enumerate. 
We have the minimum and   maximum 
value, mean, and standard deviation. 
That's what Weka gives you 
for the numeric values.
 
I'm going to look at a different dataset. 
I'm going to look at the glass dataset, 
which is a rather more extensive dataset. 
It's a real world dataset, 

English: 
not a terribly big one. 
Let's open it. 
Here we've got 214 instances 
and 10 attributes. 
Here are the 10 attributes, 
it's not clear what they are. 
Let's look at the class, 
by default the last 
attribute shown. 
There are seven values for the class, 
and the labels of these values give 
you some indication of what 
this dataset is about. 
We have headlamps, tableware, and containers. 
Then we have building and vehicle windows, 
both float and non-float. 
You may not know this, but there are 
different ways of making glass, and 
the floating process is a way of making glass. 
These are seven different kinds of glass. 
What are the attribute values? 
I don't know what you remember about physics, 
 
and I guess it doesn't 
matter if you don't remember. 
RI stands for the refractive index. 

Chinese: 
不是特别大的一个。 
让我们打开它，
它包含214个实例， 和 
10 个属性。 
这是10个属性, 不太清楚它们是什么。 
让我们看下类。
默认值应是
最后的属性。  
这个类有7个值。你可以从值的名称大概猜到 
这个数据集是关于什么的。
这里有 前大灯，餐具，容器。 
然后，有建筑物和车辆的窗子。 
使用浮法和不使用浮法的。
你可能不知道这是什么， 
制造玻璃有许多方法， 
 浮法生产是一种制作玻璃的方法。 
这里有七种不同的玻璃。
它们的属性值都是些什么呢？ 
我不知道你还记得那些物理知识。 
 
我想你不记得也没关系。
RI 指的是折射率。 

English: 
It's always a good idea to check for 
reasonableness when you're looking at 
datasets. It's really important to 
get down and dirty with your data. 
Here we're looking at the values of the 
refractive index—a minimum of 1.511,
 
a maximum of 1.534. 
It's good to think about whether these are 
reasonable values for refractive index. If you 
go to the web and have a look around, 
you'll find that these are 
good values for 
the refractive index.
Na. 
If you did chemistry, you'll recognize Na as sodium. 
Here, it looks like these are percentages, 
the different percentages of sodium. 
Magnesium, Mg, 
and so on. We would expect Silicon (Si), 
to make up the majority of glass. 
It varies between 69.81% 
 
and 75.41%.  
These are percentages of 
different elements in the glass. 

Chinese: 
在你查看数据时，看看它们是否合理总是个好主意。
深入了解你的数据是
非常重要的。
我们来看看折射率的值，最小值是1.511
 
最大值是1.534. 
想一想这些数据对于折射率来说
是否合理总是好的。 如果你到网上查一查， 
你会发现这些值数据
对于折射率来说 
是合理的。
Na. 
如果你学过化学，你知道Na是钠
这里看起来是些百分比， 
钠的不同百分比 
锰, Mg, 
等等。可能会想到硅 (Si), 
玻璃的主要组成元素，它的含量在69.81% 
 
和75.41%之间。
这些数据是各种元素在玻璃中的含量。 

Chinese: 
我们可以通过观察数据来肯定我们的猜测。 
让我们找一找玻璃数据， 
它在Weka提供的数据集中，
 
 
它是glass.arff. 
 
这是ARFF 
文件格式。 
开头是些关于玻璃数据的说明。 
由%打头的行都是说明。 
你可以读一下。 但我们现在没时间这样做。
你可以看下各个属性
他们是 
折射率，钠，锰,  等等。
和玻璃的种类，例如，
窗户， 容器，餐具，等等。
我们看下说明的结尾部分， 
这些是Weka特定的. 这就是ARFF格式。
这个关系有个名字。 
你能在Weka的界面上看到. 

English: 
We can confirm our guesses 
here by looking at the data file itself. 
Let me just find the glass data. 
It's in Weka datasets, 
 
 
and it's glass.arff. 
 
This is the ARFF 
file format. 
It starts with a bunch of comments about 
the glass database. These lines beginning 
with percentage signs (%) are comments. 
You can read about this. 
We don't have time to read it now.
You can see about the 
attributes and it does say that 
the attributes are 
refractive index, sodium, magnesium, and so on. 
And the type of glass, just like I said, is about 
windows, containers, and tableware, and so on.
We can get down to the end of the comments, 
and here we have stuff for Weka. This is 
the ARFF format. The relation has a 
name, 
you'll see it printed in 
the interface when you look. 

English: 
The attributes are defined, 
they are real valued attributes,
 numeric attributes. 
The type 
attribute is nominal, and 
the different values of type are
 enumerated here in quotes. 
That defines the relation 
and the attributes. Then we have an
 '@data' line, and following that in the 
ARFF format, are simply the instances, 
one after the other, with the attribute 
values all on one line, ending with 
class by default. This is the 
class value for the first instance. 
I think there are 214 
instances here. 
There's the last one. 
That's the ARFF format. It is a very simple, 
textual file format. 
Now we've confirmed our guesses 
about these numbers being percentages 
and different elements. 
We can think about 
this some more. It's important 
then, that these numbers are

Chinese: 
属性都被定义过了，拥有真正的值。
 数字性属性。. 
 这个属性是
名词性的，它的值已被罗列出来，
 放在引号中。
这些是关于关系和属性的。接下来，我们看一看
 '@data' 打头的行，接下来都是一个接一个的实例。 
每行包含了所有的属性的值， 默认类的值是 
最后一个属性的值。 这是第一个
 实例的类的值。
这里大概有
214个实例。
这是最后一个。
这就是ARFF格式。 它是一个简单 
的文本文件格式。 
现在，我们已确认了我们关于这些数字是百分比的猜测， 
和各种不同的元素。
如果我们再思考一下， 
 这些数字的合理性

English: 
reasonable. If they went negative, for example, 
that would indicate some kind of corrupted 
value. You can't have a negative 
percentage. 
We're expected silicon to 
be the majority component; 
we're expecting the refractive index to be 
in this kind of range. It's always a good 
idea when you get a dataset to just 
click around in the Weka interface 
and make sure things look real. 
Rather small amounts 
of aluminum in glass. I guess that's not surprising; 
I don't know very much about glass myself. 
We're just kind of checking for reasonableness here—
a very good thing to do.
That's it then. 
In this lesson, we've 
looked at the classification problem. 
We've looked at the nominal weather 
data and the numeric weather data. 
We've talked about 
nominal versus numeric attributes, 
and we've 
talked about the ARFF file format. 
We've looked at the glass.arff 
dataset, 
and I've talked about sanity checking 
of attributes, and the importance of 

Chinese: 
是很重要的。 如果他们是负数，这可能 
意味着这些值已被损坏了。 负数百分比 
是不正常的。
我们预测硅是主要的元素， 
预测折射率在这个范围内。 当你拿到一组数据时，
一个好的方法是把它放在Weka中
检查一下
以确保数据看起来合理真实。  我想玻璃中 
含有些微量的铝，也是很正常的。 
我对玻璃知道的不多。 
我们这里只是看一下合理性。
这是应该做的。
就到这里吧。 
这节课，我们学习了分类问题。 
我们了解了名词性和数字性的天气数据。 
我们谈到了名词性和数字性的属性。 
同时，我们 
学习了ARFF文件格式。 
我们看了下
glass.arff 
我们谈到了对属性的合理性的检测，

English: 
getting down and dirty with your data. 
If you'd like some further background 
on this, you can read Section 11.1 
of the text and read about 
Preparing the data and Loading the data 
into the Explorer.
Whether or not you do that, 
please go and look at the activity 
associated with this lesson.
We'll see you soon. Bye!

Chinese: 
 和深入了解你的数据的重要性。 
你如果还想知道更多的背景知识，可以看一看教科书的第11.1节 
关于如何组织数据和将数据
读入到Explorer的部分。
这个可做，可不做， 
但你应该完成这节课的课后练习。
下次课见！
