
English: 
Hello again! This is the last class of Data
Mining with Weka, and we're going to step
back a little bit and take a look at some
more global issues with regard to the data
mining process.
It's a short class with just four lessons:
the data mining process, pitfalls and pratfalls,
data mining and ethics, and finally, a quick
summary.
Let's get on with Lesson 5.1.
This might be your vision of the data mining
process.
You've got some data or someone gives you
some data.
You've got Weka.

Chinese: 
大家好！这是Weka和数据挖掘的最后一部分。
我们来回顾下，看看数据挖掘过程中的
一些共有的问题。
这部分比较短，只有四次课：数据挖掘过程，数据挖掘失误和缺陷，
数据挖掘和道德规范，最后，课程总结。
我们来学习课程5.1。
这幅图可能是你对数据挖掘过程的想象。
你手头有数据，或者别人给你一些数据。
你拿到Weka。

Chinese: 
你应用Weka分析数据，得到一些很酷的结果。每个人都很满意。
如果是这样，告诉你个坏消息，
事情完全不会那样。
说实话，这是对它的更好想象。
整个过程会是一个循环，你会不停的循环。
Weka是很重要 -- 它是圆圈的中心。
尽管它重要，它只是你要做的事的一小部分。
或许最大的问题是提出合适的问题。
你需要做的是回答这个问题，而不是盲目地处理数据。
然后，你需要收集些你能收集到的，能用数据挖掘技术
帮助你回答这个问题的数据。
收集数据是很难的。
你可能有一个初始数据集，但是你可能需要增加一些人口数据、

English: 
You apply Weka to the data, you get some kind
of cool result from that, and everyone's happy.
If so, I've got bad news for you.
It's not going to be like that at all.
Really, this would be a better way to think
about it.
You're going to have a circle; you're going
to go round and round the circle.
It's true that Weka is important -- it's in
the very middle of the circle here.
It's going to be crucial, but it's only a
small part of what you have to do.
Perhaps the biggest problem is going to be
to ask the right kind of question.
You need to be answering a question, not just
vaguely exploring a collection of data.
Then, you need to get together the data that
you can get hold of that gives you a chance
of answering this question using data mining
techniques.
It's hard to collect the data.
You're probably going to have an initial dataset,
but you might need to add some demographic

English: 
data, or some weather data, or some data about
other stuff.
You're going to have to go to the web and
find more information to augment your dataset.
Then you'll merge all that together: do some
database hacking to get a dataset that contains
all the attributes that you think you might
need -- or that you think Weka might need.
Then you're going to have to clean the data.
The bad news is that real world data is always
very messy.
That's a long and painstaking process of looking
around, looking at the data, trying to understand it,
trying to figure out what the anomalies
are and whether it's good to delete them or not.
That's going to take a while.
Then you're going to need to define some new
features, probably.
This is the feature engineering process, and
it's the key to successful data mining.
Then, finally, you're going to use Weka, of
course.
You might go around this circle a few times
to get a nice algorithm for classification,

Chinese: 
气候数据、或者其它数据。
你可以上网搜索一些信息来补充你的数据集。
然后，你可以把所有这些组合起来：创建一个数据集，它包含了
你认为有必要（ 或Weka需要）的属性。
然后你需要清理数据。
坏消息是现实生活中的数据总是非常杂乱。
研究数据、理解数据、找出异常、确定是否剔出些数据
是一个漫长而痛苦的过程。
这会花费一些时间。
然后你或许会需要定义一些新的特征。
这是特征工程的过程，是数据挖掘能够成功的关键步骤。
最后，当然，你需要使用Weka.
你可能需要重复这个循环几次，才能得到好的分类算法。

English: 
and then you're going to need to deploy the
algorithm in the real world.
Each of these processes is difficult.
You need to think about the question that
you want to answer.
"Tell me something cool about this data" is
not a good enough question.
You need to know what you want to know from
the data.
Then you need to gather it.
There's a lot of data around, like I said
at the very beginning, but the trouble is
that we need classified data to use classification
techniques in data mining.
We need expert judgements on the data, expert
classifications, and there's not so much data
around that includes expert classifications,
or correct results.
They say that more data beats a clever algorithm.
So rather than spending time trying to optimize
the exact algorithm you're going to use in
Weka, you might be better off employed in
getting more and more data.

Chinese: 
然后你需要在现实世界中使用算法。
这个过程中的每一步都不容易。
你需要思考想回答的问题
“发现这组数据中酷的地方”并不是一个很好的问题。
你需要知道通过数据你想得到什么。
然后你需要收集数据。
就像我课程开始时说的，我们周围有很多数据，问题是
我们需要已分类数据以便在数据挖掘中使用分类技术。
我们需要专家对数据的判断，专家的分类，但有专家分类，
或准确结果的数据并不多。
人们说较多的数据会打败一个聪明的算法。
因此，与其花时间优化你将在Weka中使用的算法，
不如收集更多的数据。

Chinese: 
然后你需要清理数据，就像我之前说过，真正的数据非常杂乱。
观察数据，找出异常是痛苦的过程。
特征工程是下一步，数据挖掘的关键。
一会我们会谈及Weka如何帮助你。
然后你就要利用结果。
实施，这是容易的部分。
困难的部分是说服你的老板使用数据挖掘过程的结果。
也许他会觉得这很神秘，而不太信认。
在现实世界中付诸实施是很棘手的问题。
所有这些中主要的技术部分是特征工程，Weka中有许多过滤器
可以帮助我们。
这里有一些过滤器。
或许值得定义一个新的特征，一个新的属性，它是集合了
现有属性的数学表达。
或者你会想修改一个现有属性。

English: 
Then you've got to clean it, and like I said
before, real data is very mucky.
That's going to be a painstaking matter of
looking through it and looking for anomalies.
Feature engineering, the next step, is the
key to data mining.
We'll talk about how Weka can help you a little
bit in a minute.
Then you've got to deploy the result.
Implementing it -- well, that's the easy part.
The difficult part is to convince your boss
to use this result from this data mining process
that he probably finds very mysterious and
perhaps doesn't trust very much.
Getting anything actually deployed in the
real world is a pretty tough call.
The key technical part of all this is feature
engineering, and Weka has a lot of [filters]
that will help with this.
Here are just a few of them.
It might be worth while defining a new feature,
a new attribute that's a mathematical expression
involving existing attributes.
Or you might want to modify an existing attribute.

Chinese: 
通过AddExpression，你可以使用任何数学公式在现有属性基础上
创建新属性。
你或许想规范化、中心化或者标准化你的数据。
把数值属性的均值转换为零( 这就是“中心化”)。
或者是转化到给定的数值区间 （这就是“规范化”）。
或者转化成零均值和单位方差，这种在统计上叫做“标准化”。
你也许想把数值属性离散为名词性值。
Weka有监督和无监督属性离散过滤器。
还有许多其他的转换方法。
例如，主分量转换方法（ PrincipalComponents transformation），它用矩阵分析数据
来在线性空间选择主分量
这是关于数学的。Weka有很好的实现。

English: 
With AddExpression, you can use any kind of
mathematical formula to create a new attribute
from existing ones.
You might want to normalize or center your
data, or standardize it statistically.
Transform a numeric attribute to have a zero
mean -- that's "center".
Or transform it to a given numeric range -- that's
"normalize".
Or give it a zero mean and unit variance,
that's a statistical operation called "standardization".
You might want to take those numeric attributes
and discretize them into nominal values.
Weka has both supervised and unsupervised
attribute discretization filters.
There are a lot of other transformations.
For example, the PrincipalComponents transformation
involves a matrix analysis of the data to
select the principal components in a linear space.
That's mathematical, and Weka contains a good
implementation.

Chinese: 
RemoveUseless可以删除一成不变或变化过多的属性。
实际上，我想我们在一个练习中见过。
然后，这里有几个过滤器可以帮助你处理时间序列，
当你的实例是随时间推移的一系列数据，
你或许希望找到实例之间的区别，
或者有一定间隔的实例的区别 （ 一个实例和在它之前5个实例，或者之前10个的区别）。
这只是Weka的几个Weka的能帮你做特征工程的过滤器。
这节课的主旨是Weka只是整个数据挖掘的一小部分，
最简单的部分。
这门课中，我们选择讲解了整个过程的最容易部分！
对不起。
在实践中，其它的部分会难得多。
这是一个老程序员的祝福：“希望你所有的问题都是技术问题”。

English: 
RemoveUseless will remove attributes that
don't vary at all, or vary too much.
Actually, I think we encountered that in one
of our activities.
Then, there are a couple of filters that help
you deal with time series, when your instances
represent a series over time.
You probably want to take the difference between
one instance and the next, or a difference
with some kind of lag -- one instance and
the one 5 before it, or 10 before it.
These are just a few of the filters that Weka
contains to help you with your feature engineering.
The message of this lesson is that Weka is
only a small part of the entire data mining
process, and it's the easiest part.
In this course, we've chosen to tell you about
the easiest part of the process! I'm sorry
about that.
The other bits are, in practice, much more
difficult.
There's an old programmer's blessing: "May
all your problems be technical ones".

English: 
It's the other problems -- the political problems
in getting hold of the data, and deploying
the result -- those are the ones that tend
to be much more onerous in the overall data
mining process.
So good luck!
There's some stuff about this in the course
text.
Section 1.3 contains information on Fielded
Applications, all of which have gone through
this kind of process in order to get them
out there and used in the field.
There's an activity associated with this lesson.
Off you go and do it, and we'll see you in
the next lesson.
Bye for now!

Chinese: 
其它问题 （ 获取数据所涉及的政治问题，以及实施结果的问题）
这些是在整个数据挖掘过程中艰巨得多的
任务。
祝好运！
教材中有一些相关内容。
第1.3是关于现场应用的，为了能够在现实中应用，
它们都经历了这样的过程。
这节课有一个课后练习。
请完成它。下节课见。
再见！
