
Chinese: 
大家好！昨天晚上，我看了一个非常不错的电影《了不起的盖茨比》。
我希望你们不要介意我喝完我的马丁尼酒。无论如何，这节课我还要再强调
接近你数据的必要性，全面地了解它。
这是第一部分的最后一节课，我们将学习如何建立数据的可视化模型。
这就是我们接下来要做的。我们使用可视化面板。我要载入
iris数据集。你在之前的练习中已经使用过iris数据集了。

English: 
Hi! I went to see that movie The Great Gatsby
last night. I thought that was really good.
I hope you don't mind if I finish off my martini.
Anyway, one of the constantly recurring themes
in this course is the necessity to get close
to your data, look at it in every possible
way. In this last lesson of the first class,
we're going to look at visualizing your data.
This is what we're going to do. We're going
to use the Visualize panel. I'm going to open
the iris dataset. You came across the iris
dataset in one of the activities, I think.

Chinese: 
我们选择载入它是因为这 是一个带有数值属性的数据集，有四个数值属性：花萼长、花萼宽、
花瓣长、花瓣宽；同时，包含三种鸢尾花：鸢尾花setosa, 鸢尾花versicolor
和鸢尾花virginica。让我们切换到可视化面板，建立
这个数据集的可视化模型。这有一个二维的五乘五的矩阵图。
选择其中之一，我们看x轴为花萼宽，y轴为花瓣宽的散点图。
这是一个散点图。不同的颜色代表三种不同的鸢尾花种类。我们也可以改变
颜色。如果不喜欢某种颜色，可以选择其他的颜色，不过，现在
我不打算这么做。单击一个单独的数据点，可以查看这个点的详细信息。这是

English: 
I'm using it because it has numeric attributes,
four numeric attributes: sepallength,sepalwidth,
petallength, petalwidth. The class are the
three kinds of iris flower: Iris-setosa, Iris-versicolor,
and Iris-virginica.
Let's go to the Visualize panel and visualize
this data. There is a matrix of two dimensional
plots, a five-by-five matrix of plots.
If I can select one of these plots, I'm going
to be looking at a plot of sepalwidth on
the x-axis and petalwidth on the y-axis.
That's a plot of the data. The colors correspond
to the three classes. I can actually change
the colors. If I don't like those, I could
select another color, but I'm going to leave
them the way they are. I can look at individual
data points by clicking on them. This is talking

English: 
about instance number 86 with a sepallength
of 6, sepalwidth of 3.4, and so on.
That's a versicolor, which is why this spot is colored
red. We can look individual instances.
We can change the x- and y-axis by changing on
the menus here. Better still, if we click
on this little set of bars here, these represent
the attributes. I'm going to click on this
and the x-axis will change to sepallength.
Here the x-axis is sepalwidth. Here the x-axis
is petallength, and so on. If I right click,
then it will change the y-axis to sepallength.
So, I can quickly browse around these different
plots. There is a Jitter slider.

Chinese: 
第86个实例，花萼长等于6，花萼宽等于3.4, 等等。
这是一个色彩变换的数据点，这就是为什么这个点是红色的。我们来看单独的实例。
我们可以在下拉菜单中选择不同的x轴和y轴。更简单的方法是，单击
这些代表不同属性的小横条。左键单击这里，
x轴就会改变为花萼长。单击这里，x轴就会改变为花萼宽。单击这里，x轴
就会改变为花瓣长。右键单击这里，y轴就会改变为花萼长。
这样，我们就可以快速地浏览这些不同的图。这是一条可以调节抖动（jitter）的滑块。

Chinese: 
有时候，多个点叠加在同一个位置，通过抖动（jitter）可以给x轴和y轴
增加点随机性。增加了抖动之后，深色的点代表了多个实例。
如果点击其中一个，我们可以看到这个点代表了三个单独的实例。
这三个都是鸢尾花setosa的实例，并且它们有相同的花瓣长度，
和相同的花萼宽度。
它们都标绘在这个散点图上。这三个实例的花萼宽和花瓣长
分别是3.0和1.4。
点击另外一个点。这个点包含两个实例，它们的花萼宽和花瓣长很接近，
都是鸢尾花versicolor。
抖动（jitter）滑块可以帮助你区分实际位置特别近的点。

English: 
Sometimes, points sit right on top of each other, and
jitter just adds a little bit of randomness
to the x- and the y-axis. With a little bit
of jitter on here, the darker spots represent
multiple instances. If I click on one of those,
I can see that that point represents three
separate instances, all of class iris-setosa,
and they all have the same value of petallength
and sepalwidth.
Both of which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4
 for each of the three instances.
If I click another one here. This one here are two
with very similar [sepalwidths] and petallengths,
both of the class versicolor.
The jitter slider helps you distinguish between
points that are in fact very close together.

English: 
Another thing we can do is select bits of
this dataset. I'm going to choose select rectangle
here. If I draw a rectangle now, I can select
these points. If I were to submit this rectangle,
then all other points would be excluded and
just these points would appear on the graph,
with the access re-scaled appropriately. Here
we go. I've submitted that rectangle, and
you can see that there's just the red points
and green points there. I could save that
if I wanted as a different dataset, or I could
reset it and maybe try another kind of selection
like this, where I'm going to have some blue
points, some red and some green points and
see what that looks like. This might be a
way of cleaning up outliers in your data,
by selecting rectangles and saving the new dataset.
That's visualizing the dataset itself. What
about visualizing the result of a classifier.

Chinese: 
另外，我们可以选择数据集的一部分。我们来选择rectangle。
通过（在散点图上）画一个矩形，我们可以选定这些点。如果我提交这个矩形，
那么所有这个矩形之外的点都不再出现，图上只显示选定的点。
当然，比例经过了重新调整。我们已经提交了矩形，
现在图上只剩下红色和绿色的点了。 如果需要的话，
我们可以保存这个为不同的数据集。我们也可以复原，重新去选。
选择一些蓝色、红色和绿色的点，
看看会怎样。选择矩形范围并且保存为新的数据集
是清理数据中异常数据的一个方法。
这就是可视化数据模型。那么，可视化分类结果是什么样呢？

Chinese: 
让我们退出这个可视化面板，回到预处理面板，
我将试一下分类器， J48。 我们可以在（分类器列表的）树形分类器下找到它。
运行J48。然后如果我们右击在日志区的这条记录，可以查看
分类器的误差。 这里我们得到实际分类和预测分类对照图。这些方块
代表误差。如果我们单击横条之一，就可以改变坐标轴。
我可以改变x轴和y轴， 但我还是设回为class和predictedclass
如果我们单击其中的一个方块，就可以看到误差。

English: 
Let's get rid of this visualize panel and
back to the Preprocess panel. I'm going to
use a classifier. I'm going to use, guess
what, J48. Let's find it under trees. I'm
going to run it. Then if I right click on
this entry here in the log area, I can view
classifier errors. Here we've got the class
plotted against the predicted class. The square
boxes represent errors. If I click on one
of these, I can, of course, change the different
axes if I want. I can change the x-axis and
the y-axis, but I'm going to go back to class
and predictedclass. If I click on one of these
boxes, I can see where the errors are.

Chinese: 
这里有两个实例，预测分类为鸢尾花versicolor，实际分类为鸢尾花virginica。
我们可以在混淆矩阵看到这两个实例。实际分类为鸢尾花virginica，但是
预测分类为鸢尾花versicolor，也就是分类“b”。混淆矩阵中的这两条记录代表了
这两个实例。如果我们看一下另外一个点，比如这个。这里我们得到了一个实例，
这个实例的实际分类是鸢尾花setosa，预测分类是鸢尾花versicolor。我们可以通过看这个图，找到
实际发生的错误分类，也就是混淆矩阵中的误差。
深入了解你的数据，并且建立可视化模型。你可以做各种各样的事情。

English: 
There are two instances where the predicted class
is versicolor and the actual class is virginica.
We can see these in the confusion matrix.
The actual class is virginica, and the predicted
class is versicolor, that's 'b'. This 2 entry
in the confusion matrix is represented by these
2 instances here. If I look at another point,
say this one. Here I've got one instance which
is in fact a setosa predicted to be a versicolor.
I can look at this plot and find out where
the misclassifications are actually occurring,
the errors in the confusion matrix.
Get down and dirty with your data and visualize
it. You can do all sorts of things. You can

English: 
clean it up, detect outliers. You can look
at the classification errors. For example,
there's a filter that allows you to add the
classifications as a new attribute.
Let's just go and have a look at that. I'm going
to go and find a filter. We're going to add
an attribute. It's supervised because it uses
a class. Add an attribute, and AddClassfication.
Here I get to choose in the configuration
panel, the machine learning scheme. I'm going
to choose J48, of course, and I'm going to
outputClassification—put that True.
That's configured it, and I'm going to apply it.
It will add a new attribute. It's done it,
and this attribute is the classification according
to J48. Weka is very powerful. You can do
all sorts of things with classifiers and filters.
That's the end of the first class.
There's a section of the book on Visualization. Please
go and do the activity associated with this

Chinese: 
你可以清理你的数据，删除异常数据。你可以观察分类误差。例如，
有一种过滤器可以添加类为一个新的属性。
让我们去看看。找到这个过滤器，添加
一个属性。这是个监督过滤器，因为它用到了类。添加一个属性，用过滤器AddClassfication。
这里，我们打开配置面板，机器学习方案，
选择J48，将 outputClassification设置为True。
完成配置。现在应用这个过滤器。它将添加一个新的属性。完成了。
这个新增的属性是根据J48分类的结果。Weka的功能非常强大，你可以
利用分类器和过滤器做各种各样的事情。这节课到此为止。
书里面有一节关于可视化的内容。请大家完成本课课后练习。

English: 
lesson, and I'll see you in the next class. Bye!

Chinese: 
我们下节课见。再见！
