
Chinese: 
大家好！欢迎来到数据挖掘与Weka第5.3节！
在我们开始前，我想给大家看一下我住的地方。
我告诉过你，我许多年前搬到了新西兰。
我住在一个叫做哈密尔顿的城市。
让我们放大地图，看是否能看到新西兰北岛
中心的的哈密尔顿。
这就是坏卡托大学。
这是大学，这是我的家。
这是上班的路线：我每天早上骑车穿过乡野。
就像你看到的那样，这真是美好啊。
我住在市郊这里。

English: 
Hi! Welcome to Lesson 5.3 of Data Mining with
Weka.
Before we start, I thought I'd show you where
I live.
I told you before that I moved to New Zealand
many years ago.
I live in a place called Hamilton.
Let me just zoom in and see if we can find
Hamilton in the North Island of New Zealand,
around the center of the North Island.
This is where the University of Waikato is.
Here is the university; this is where I live.
This is my journey to work: I cycle every
morning through the countryside.
As you can see, it's really nice.
I live out here in the country.

Chinese: 
我是一个牧羊人！我有四只羊，三只在小围场里，一只在冰箱里。
我骑车（花半个小时）到大学。
我周周都不见到交通信号灯，
因为我住在和大学一个方向的城镇的边缘。
当我到达坏卡托大学校园，那真是一个非常美丽的校园。
我们有三个湖。
那里有两个湖，另一个湖在这儿。
这真是一个工作的好地方！所以，我在这儿很开心。
让我们回到主题：数据发掘中的道德规范。
在欧洲，有很多关于保护隐私权的严格的法律。
例如，如果你收集任何有关于某人的个人信息，
必须声明你的目的。
不经过同意，信息不应该对别人公开。

English: 
I'm a sheep farmer! I've got four sheep, three
in the paddock and one in the freezer.
I cycle in -- it takes about half an hour
-- and I get to the university.
I have the distinction of being able to go
from one week to the next without ever seeing
a traffic light, because I live out on the
same edge of town as the university.
When I get to the campus of the University
of Waikato, it's a very beautiful campus.
We've got three lakes.
There are two of the lakes, and another lake
down here.
It's a really nice place to work! So I'm very
happy here.
Let's move on to talk about data mining and
ethics.
In Europe, they have a lot of pretty stringent
laws about information privacy.
For example, if you're going to collect any
personal information about anyone, a purpose
must be stated.
The information should not be disclosed to
others without consent.

Chinese: 
个人信息的记录必须准确和即时。
人们应该可以评审关于他们的数据。
当不再需要时，数据应该被删除。
个人信息不允许被传送到其他地方。
有的数据因太敏感而不能收集，除非是特殊情况。
这是欧洲的现状，特别是斯堪的纳维亚（半岛）。
当然，在美国不是这样的。
数据挖掘是收集和使用记录的信息，
知道这些道德问题是很有必要的。
人们通常匿名化数据，以便安全地分发给其他人使用，
但是，匿名化比你认为的更难。
可以给你讲一个小故事。
90年代中期，当马萨诸塞州发布公务员的医疗记录总结时，
州长向公众保证它是匿名的，已移除了所有的识别信息
（名字，地址和社会安全号码）。

English: 
Records kept on individuals must be accurate
and up to date.
People should be able to review data about
themselves.
Data should be deleted when it's no longer
needed.
Personal information must not be transmitted
to other locations.
Some data is too sensitive to be collected,
except in extreme circumstances.
This is true in some countries in Europe,
particularly Scandinavia.
It's not true, of course, in the United States.
Data mining is about collecting and utilizing
recorded information, and it's good to be
aware of some of these ethical issues.
People often try to anonymize data so that
it's safe to distribute for other people to
work on, but anonymization is much harder
than you think.
Here's a little story for you.
When Massachusetts released medical records
summarizing every state employee's hospital
record in the mid-1990's, the Governor gave
a public assurance that it had been anonymized
by removing all identifying information -- name,
address, and social security number.

English: 
He was surprised to receive is own health
records (which included a lot of private information)
in the mail shortly afterwards! People could
be re-identified from the information that
was left there.
There's been quite a bit of research done
on re-identification techniques.
For example, using publicly available records
on the internet, 50% of Americans can be identified
from their city, birth date, and sex.
85% can be identified if you include their
zip code as well.
There was some interesting work done on a
movie database.
Netflix released a database of 100 million
records of movie ratings.
They got individuals to rate movies [on the
scale] 1-5, and they had a whole bunch of
people doing this -- a total of 100 million
records.

Chinese: 
之后不久，他惊讶地收到的自己的健康记录（包括很多私人信息）
的邮件！人们还可以从剩余的信息中
再次被识别出。
有相当多的关于再识别技术的研究。
例如，利用在互联网上的公开记录，50%的美国人
可以从他们的城市，出生日期，性别识别出来。
如果你还有他们的邮政编码，85%的人可以被识别出来。
有人用电影数据库做了一些有趣的研究。
Netflix发布了有一亿个电影评级记录的数据库。
他们让每个人把电影分级（1到5级），有一大堆人
来做这件事情——总共一亿个电影评级记录。

English: 
It turned out that you could identify 99%
of people in the database if you knew their
ratings for 6 movies and approximately when
they saw them.
Even if you only know their ratings for 2
movies, you can identify 70% of people.
This means you can use the database to find
out the other movies that these people watched.
They might not want you to know that.
Re-identification is remarkably powerful,
and it is incredibly hard to anonymize data
effectively in a way that doesn't destroy
the value of the entire dataset for data mining
purposes.
Of course, the purpose of data mining is to
discriminate: that's what we're trying to do! 
We're trying to learn rules that discriminate
one class from another in the data -- who
gets the loan? -- who gets a special offer?
But, of course, certain kinds of discrimination
are unethical, not to mention illegal.
For example, racial, sexual, and religious
discrimination is certainly unethical, and
in most places illegal.

Chinese: 
结果是，你可以识别出数据库中99%的人，如果你知道他们
对6部电影的评级结果和他们看电影的大致时间。
即使你仅仅知道他们对2部电影的评级结果，你也可以识别出70%的人。
这就是说，你可以利用数据库找出这些人看过的其他电影。
他们也许不想让你知道。
再识别是非常强大的，在不破坏整个数据集的
数据挖掘的价值的前提下，进行匿名化是
很困难的。
当然，数据挖掘的目的是辨别：这就是我们努力做的！
我们想知道能把一类数据从另一类辨别出来的规则（谁
获得贷款）？谁得到一个特价？但是，当然，有一些辨别
是不道德的，甚至是违法的。
例如，种族，性别，信仰识别肯定是不道德的，
在许多地方是违法的。

Chinese: 
但是，这取决于辨别的内容。
性别辨别经常是违法的......除了对医生以外。
医生在诊断时，会考虑（患者的）性别。
比如，他们不想告诉一个男人，他怀孕了。
同样，看上去无害的信息，可能并不是那样。
例如，地区代码（美国的邮政编码）和种族有很强关联；
某些组织的会籍和性别相关。
所以你也许已经从你的数据库删除了明确的种族和性别信息，
还是可以从其他的信息推断出来种族和性别。
数据处理是非常困难的：它用意想不到的方式揭示自己的秘密。
一个和数据挖掘有关的道德问题是相关性并不意味着因果关系。

English: 
But it depends on the context.
Sexual discrimination is usually illegal ... except for doctors.
Doctors are expected to take gender into account
when they make their make their diagnoses.
They don't want to tell a man that he is pregnant,
for example.
Also, information that appears innocuous may
not be.
For example, area codes -- zip codes in the
US -- correlate strongly with race; membership
of certain organizations correlates with gender.
So although you might have removed the explicit
racial and gender information from you database,
it still might be able to be inferred from
other information that's there.
It's very hard to deal with data: it has a way of revealing secrets about itself in unintended ways.
Another ethical issue concerning data mining
is that correlation does not imply causation.

Chinese: 
这是一个经典的例子：冰淇淋的销售与溺水率同时增长。
是否可以说，冰淇淋的消费导致溺水？应该不能。
它们大概都是由高温引起的，人们去海滩。
数据挖掘显示出的是简单的相关性，不是因果关系。
其实，我们想要因果关系。
我们希望能够预测我们行为的影响，但我们
利用数据挖掘技术得到的只是相关性。
要了解原因，你需要更深层次的模型。
我只是想提醒你一些问题，一些数据挖掘中的道德问题，
在你离开之前，在你把课程中学到的知识应用到自己的数据集之前：
关于个人信息的隐私问题；事实上，匿名是比你想象的困难；
从已匿名的数据重新识别出个体比你想象的容易；

English: 
Here's a classic example: as ice cream sales
increase, so does the rate of drownings.
Therefore, ice cream consumption causes drowning?
Probably not.
They're probably both caused by warmer temperatures
-- people going to beaches.
What data mining reveals is simply correlations,
not causation.
Really, we want causation.
We want to be able to predict the effects
of our actions, but all we can look at using
data mining techniques is correlation.
To understand about causation, you need a
deeper model of what's going on.
I just wanted to alert you to some of the
issues, some of the ethical issues, in data
mining, before you go away and use what you've
learned in this course on your own datasets:
issues about the privacy of personal information;
the fact that anonymization is harder than
you think; re-identification of individuals
from supposedly anonymized data is easier

English: 
than you think; data mining and discrimination
-- it is, after all, about discrimination;
and the fact that correlation does not imply
causation.
There's a section in the textbook, Data mining
and ethics, which you can read for more background
information, and there's a little activity
associated with this lesson, which you should
go and do now.
I'll see you in the next lesson, which is
the last lesson of the course.
Bye for now!

Chinese: 
数据挖掘和辨别（这毕竟是关于辨别）；
相关性并不意味着因果关系的事实。
课本里有一节，数据挖掘中的道德，你可以从中了解到
更多的背景知识。你现在应该去完成
本课课后的小练习。
下节课将是我们的最后一节课，下次见。
再见！
