>> MEASE: So, I'll try and give you guys some
background about what's going on here if you
havenít figure it out from the e-mail.
So, this is a class called, "Statistical Aspects
of Data Mining."
I'm teaching it at Stanford this summer.
Today was the first day at Stanford just like
today is the first day here.
Basically, I just had the idea that I'm teaching
the course at Stanford anyway, it seems like
something that might be of interest to people
here at Google.
So, I work here at Google and I just teach
the class at Stanford.
So, I thought, well, since I'm going to come
to work everyday after I teach the class,
why don't I just go ahead and teach it here?
And so people that are interested can sort
of sit here at Google and take the class here.
Because of that, sort of the slides and everything
you see is basically the exact same as the
slides that I'm presenting when I'm at Stanford.
So, some of the things like, for example,
number four up here, the fourth thing we did
at Stanford today was I took pictures of the
students so I could have pictures of them.
Obviously, I'm not going to take pictures
of you because your pictures are on MoMA,
et cetera.
So there's going to be certain things that
obviously donít apply to you and you just
sort of have to be a little bit patient with
that.
But for the most part, I'm going to follow
exactly the script from what I do at Stanford.
And to that end, the outline for today is
basically to go over the information on the
course webpage.
Run through chapter one on the textbook--I'll
talk about the textbook--and then talk about
the software that we're going to be using
for this class and how you can go about getting
it.
So, if there--unless there's any sort of pressing
questions before we begin, I'm going to start
going through these slides.
Question in the back?
>> [INDISTINCT] hear you back here.
>> MEASE: Okay.
So, is that better?
>> Yes.
>> MEASE: Okay.
I'll try and stay close to the microphone
and I'll just try and speak loudly when I
walk over to the board.
If you can't hear me, let me know.
>> You have a lapel mic.
Maybe...
>> MEASE: Yes, but they said this is just
for recording, for the video conferencing.
So maybe we can figure out how to get that
to work over the speaker.
Yes?
>> Textbook?
>> MEASE: Textbook, yes, we'll talk about
that right off the bat.
Okay.
So, textbook, okay.
This is available on Amazon for about 80 bucks.
It's called--well, I'll show you on the webpage.
It's called "Introduction to Data Mining."
Authors are Tan, Steinbach, Kumar.
We're trying to order some copies for you
guys but there's no way we're going to have
enough.
So, if you come next time, I may have some
to give you via some lottery system.
But you can go ahead and order it online if
you haven't already.
It's a good book to get if you want to just
sort of, you know, hope your officemate will
order one and, you know, you guys can share.
And a good--it's a good book to get and we
are going to be following it quite closely
and I'll talk more about it in a second.
>> Is it first edition or second edition?
>> MEASE: There's only one edition and I've
been told there may be a paperback available
if you're trying to save money.
But there's only one edition as far as I know.
>> What are the author's names?
>> MEASE: Pardon me?
>> Can you repeat the authors' names?
>> MEASE: Yes, the authors are Tan, Steinbach,
Kumar.
And I'll show you where that's up on the webpage.
Okay, so the first thing to talk about is
the webpage.
It's www.stats202.com.
If 
you ever forget this, if you remember my last
name and look up my last name on Google, you'll
find my homepage and then there's a link from
my homepage.
But basically, stats202.com will have all
the information.
Now, again, this is the information for the
students at Stanford but it will be relevant
to you.
And so, if you just go see www.stats202.com
and so, this is what the webpage looks like.
Now, some things are not relevant to you.
You donít care about the current grades,
right?
Those are the grades for students at Stanford.
Homework and exam solutions you might care
about, right?
Because I'm going to give homework assignments,
if you want to play along and do them, you
know, the solutions then will be up there.
Obviously, we're not going to grade them but
there might be, you know, something extra
you can do.
And then the homework assignments, like I
said, they'll be linked from here, if you
want to do them, it's up to you and I will
be posting solutions for the students at Stanford.
Lecture 1 is linked here; these are just the
PowerPoint slides I have, so those will be
up there.
And then probably the most important thing
on the stats202.com webpage is the course
information.
If you click there, you go to this pink page
and it has--so, you know, donít use this
e-mail, use my Google email, of course.
Also, you know the e-mail.
Let me--in case, any of you donít know it,
let me write it down.
This is datamining, one word, no underscore,
07@google.com.
That's the e-mail I set up.
You should have got an e-mail to that already
if you signed--I signed up basically everyone
on the Trix spreadsheet, so you should have
got an e-mail to that already.
If you're not, it's public; you can go to
mailman and add yourself to that.
Phone, thatís my cell phone, it's in [INDISTINCT]
office hours donít pertain to you.
TA, donít bother the TA, webpage, stats202.com,
okay.
Okay.
Yeah, he would get really confused.
Okay.
So, the textbook, we were talking about there
are the authors' names right there, Tan, Steinbach,
Kumar, "Introduction to Data Mining."
Like I said, I think it retails somewhere
between $60 and $80.
So, go ahead and get yourself a copy of that
or find someone else who's going to have a
copy of it and agree to share it.
So we are going to try and get some.
But it will have to be a lottery because there's
no way we have enough for everyone that's
going to be in the room.
Course description.
Okay.
So this is the Stanford generic catalog description,
so "Data mining is used to discover patterns
and relationships in data.
Emphasis is on large, complex data sets such
as those in very large databases or through
web mining."
Topics are going to be decision trees.
We will talk about neural networks.
We'll talk about association rules which if
you're coming from a stats background like
I am, that's something new.
We will talk about it.
Clustering, you've seen before, no doubt.
Case-based methods and data visualization.
And then we're going to basically follow the
textbook pretty closely.
So, first chapter is introduction, just sort
of a soft introduction, I'm going to go over
that today.
Second chapter is on data, basically types
of data, importing data, caveats about data.
We'll talk about that for about two lectures.
Chapter three is exploring data and for those
of you who know me, I love to make plots of
data and so I think that's very important
even though a lot of people think it's trivial.
So, we'll spend, I think, at least three lectures
on chapter three talking about different ways
of summarizing data through graphs and tables
and chapter 6, then association analysis,
basic concepts and algorithms.
I have a break right there because that's
when the students at Stanford are going to
be taking a midterm.
If you want I can, you know, e-mail you guys
the midterm if you want, you know, to sort
of quiz yourself.
What it might mean practically for us is before
chapter--between chapter six and four, we
might have a day where we donít--where we
donít meet or we might use it as a catch
up if, for some reason, we donít get through
everything because we are only meeting for
an hour whereas, at Stanford, they're meeting
for an hour and 15 minutes.
Chapters four and five are both on classification.
That's sort of one of my favorite areas so
we're going to spend, you know, good amount
of time on chapters four and five.
And then finally we'll finish with chapter
eight which is the cluster analysis.
Evaluation, you donít care about either.
The late assignments, you don't care about;
technology, you do care about.
So, basically we're going to be using R and
Excel.
Okay?
So, if you have a PC that sort of makes your
life easy because Excel is probably installed
on your PC and R is, of course, a free download
that available for PC, it's available for
Mac.
And, you know, there is an R user's e-mail
list; maybe I'll send that around to you and
with a link for how to uninstall R depending
on what your Linux platform is.
I donít really keep up with it because I
tend to use R more on my Windows machine.
But I know that we have installations for
Linux and I just--I havenít really kept up
with it.
So, maybe I'll try and send around a pointer
to you guys for that.
I'll run through today briefly how to install
R on Windows, and then maybe from there, you
can sort of extrapolate and figure out how
to install it on Linux.
But mainly, we're going to be using R with
a little bit of Excel, which Excel, for those
of you who arenít familiar, is just a real
simple spreadsheet application that works
for all the small data.
Academic honesty, you donít care about.
So that's all the--that's all the information
on the webpage.
So, go ahead and use, you know, stats202.com
as your reference for things in this class.
Just remember that the webpage is designed
for the students of Stanford so the obvious
things, you know, donít pertain to you.
And, you know, for example, right, donít
e-mail me at stanford.U, e-mail me at @google.com.
I think that's all I wanted to say about the
webpage.
Are there any questions about anything I said
so far about the webpage?
Yes.
>> This is an undergraduate class?
>> MEASE: It's a master's level class but
it's an intro class and there's upper--there's
a higher level class.
There's a 300 class for those of you who are
familiar with the Stanford curriculum.
So, it sort of necessarily keeps this at an
intro level, which is--which I think is good
for us because a lot of us are sort of, you
know, this is our first time seeing some of
this stuff.
If this is isnít your first time seeing some
of this stuff, that might, you know, you might
think, "Well, this might be too basic for
me."
So sort of pick and choose when you come or
what lectures you watch.
The lectures are being videotaped; they are
available on Fish.
So, those of you who donít want to sit here,
would rather just sit on their PC and watch
it there, sit on your machine and watch it
there, then you know they are going to be
up on Fish.
Any other questions about anything I said
so far?
Okay.
So moving on, so the textbook, again, we'll
start with chapter one, it's just a real soft
introduction to what were going to be doing
in this class.
Well, this is sort of interesting.
I--when I said I was going to teach this class
on data mining, the first thing my officemate
asked me, you know, he said, "Well, what is
data mining?"
I said, "Well, I'll be able to tell you that
by the time I'm done teaching the class."
Well, hopefully, you know, by the end of today,
we'll be able to say something intelligent
about what is data mining.
So, the definition in your textbook, it says,
"The process of automatically discovering
useful information in large data repositories,"
and there's many other definitions.
So, let me just sort of dissect this a little
bit and sort of, you know--I come from stats.
The question is how is data mining different
from statistics?
Well, I think the easiest one, right, is the
notion of a large, right?
The fact that the data set is large.
So, one way you could define data mining is,
well, it's statistics with large data sets.
Okay.
But there's more than that, right?
There's this idea that I'm automatically discovering
useful information.
You know, again, what does automatically mean?
Well, you know, you're not going to write
a script thatís going to do all the analysis
for you and tell you, "Hey, I looked at your
data, and you know, you should be aware that,
you know, there is problem with this variable,"
or "There's something strange going on here,"
right?
It's not going to be completely automatic,
but it's sort of more automatic than statistics,
right?
So in statistics, you might say, you know
what?
I really want to analyze these two variables
and see, you know, what the correlation is
between them, blah, blah, blah.
In data mining you might say, look, I have
a thousand predictors in this data set; I
want to look at all parallelized correlations
and I want to get an automatic e-mail every
time two of those correlations goes above
a certain value, for example.
So, on some level, it's more automated and--than
stats but, of course, itís not like a magic
thing that does all the work for you.
And then the final aspect I was just going
to mention is discovering useful information,
right?
I mean, obviously, there's a lot of data out
there and the goal of data mining is to see
if there's anything useful there.
And actually, one last aspect to this definition
I want to mention is the last part where it
says, "The data is in large data repositories,"
right?
So, it doesn't just say large data sets, it
says large data repositories.
So, you think of a repository as some place
where data just accumulates, right?
You didnít necessarily collect it.
It's just there, right?
So, web logs are an example, right?
I mean, the data is just there; whether or
not you're going to get any use out of it,
it's up to you.
Like, a whole bunch of other examples will
have credit card transactions, supermarket
data.
The data isnít really being collected for
any specific reason, but it's sort of hard
to not collect it.
The data just sort of accumulates naturally.
So, the question is, given that all this data
is there, can we find any useful information
in it?
And that's quite different from statistics
where, in statistics, you often say, you know
what?
I'm going to go out and collect data specifically
to answer a specific question.
Whereas in data mining, you're accumulating
the data and the question is, "Can I find
anything useful in this--in this data?"
Okay.
So then I say there are many other definitions.
And on the next slide I say, so find a different
definition and see how it compares to the
previous slide.
So, this is sort of a fun exercise to sort
of look through and see what other people
say is data mining.
And the first thing that you'll notice is
that, you know, the authority, right, Wikipedia,
it doesnít give one definition; it gives
two, which already suggests to you that there's
some, you know, non-uniform standard for what
is data mining.
So, the first definition is "Nontrivial extraction
of implicit, previously unknown, and potentially
useful information from data," and the other
is "The science of extracting useful information
from large data sets or databases."
So, that second definition which--I think
thatís a stat reference, right?
Yeah, David Hand.
So, thatís very similar to what we had.
The idea is that you're looking to see if
there's anything useful.
The data set is large and, you know, basically,
it's the art or the science of extracting
that.
The first definition isn't too different.
Just as potentially useful information from
the data, a little bit of an omission that
sometimes we're not going to find anything
useful.
It does say here, the first one, nontrivial,
and I'll talk about that in a second.
There's a lot of tasks that you could say,
look, I'm extracting useful information from
data in an automated way, but it's sort of
trivial, right?
So data mining deals with what we'll call
nontrivial.
And I'll give you some examples, in a second,
of what I would consider trivial and nontrivial,
and your textbook talks about those.
Other definitions, so you can sort of see--I
think there's a few I clicked on earlier.
What is data mining?
It says here, "Generally, data mining, sometimes
called data or knowledge discovery, is the
process of analyzing data from different perspectives
and summarizing it into useful information,"
sort of not that good of a definition.
I think this one was pretty close to what
we were talking about.
Let's see.
There's a "What is data mining?"
somewhere down here.
Yeah, "Data mining or knowledge discovery
is the computer-assisted process of digging
through and analyzing enormous sets of data
and then extracting the meaning of the data."
So you see, the digging through is sort of
carrying on the mining analogy.
There are a couple more.
Maybe I'll show you one more of these that
I thought was pretty good.
What does this one say?
Data mining is what?
"Analytic process designed to explore data,
usually large amounts of data, typically business
or market related, in search of consistent
patterns and/or system relationships between
variables."
I think this little parenthetic statement,
typically business or market-related is telling.
I mean, we're looking at it from a point of
view of, you know, we're--most of us are computer
scientists and so, we're looking at it from
more of a science point of view.
But it's really things in industry and the
market that has driven data mining.
That's really where the phrase comes from
and it's her--one of these, you know, catchy,
trendy words that like, "Oh, that company,
you know, my competitor is doing data mining
and I'm not, so they're going to beat me."
So thatís, you know, that's where a lot of
it comes from it.
You know, if you're cynical, it really is
just, "Well, it's statistical techniques or
it's machine-learning techniques or it's,
you know, these techniques with a new word
put on them."
But, you know, it's sort of--that's how things
in business and market get popularity; someone
attaches a word to them.
And so this is basically the word that's been
attached to--again, as your textbook says--process
of automatically discovering useful information
in large data repositories.
Now, I'll say this on the side.
So, you know, like I said, I come from statistics
and my officemate has the same background
as me, and he said, "You know, all you really
have here is two ingredients, you know, to
make a disaster, right?
You have a lot of data and you don't know
what you're looking for."
He said, "You're only--you're only going to
get yourself in trouble."
Well, you know, you can and you cannot, so
we'll talk about some caveats that you have
to be careful about.
But generally, this is the feel of it, you
have a large data set and you're just looking
to see if you can find some useful information
there.
And what he warns about getting yourself in
trouble is you need to make sure it really
is useful and you're not just telling yourself
some story that's completely artificial.
So I mentioned to you some data mining tasks
aren't really data mining tasks, right?
Sometimes you think you're extracting useful
information from a large data repository but
it's not really considered data mining, and
that's because it's too trivial, right?
So here on the left side of the screen, you
have some data mining tasks.
On the right, you have some non-examples,
right?
So, for example, looking up a phone number
in a phone directory; well, that's extracting
useful information.
If you want the phone number, it's useful
to you.
The phone directory is a large data set, so
you're extracting useful information from
a large data set, but it's not considered
data mining; it's too trivial, right?
An example of something that would be data
mining would be suppose you have the phonebook
and you start to look for relationships that
you previously didn't know.
So, for example, it says here you see names
like O'Brien, O'Rurke, O'Reily occurring more
in the Boston area.
And you say, "Oh, I didn't really, you know,
know that but it makes sense to me now because
I sort of know, you know, how the different,
you know, people settled in the United States
and I know there's a lot of people, you know,
of certain descent in this area so it makes
sense to me."
And if you say, "Well, that's--you know, I
knew that already," imagine, you know, giving
it the phone book from India or from, you
know, Brazil or a country that you're not
familiar with and you donít know any of these
names and you start to see how they cluster
in the different regions, you know, youíre
learning something about the data without
really knowing what youíre looking before
going in; you start to see this grouping.
So, that would be an example of data mining.
On the right here, the second one, Query,
a web search engine for information about
Amazon.
Okay.
I'm getting useful information from a large
data set, right?
The web is large data set.
But again, that's not data mining; something
that would be data mining here on the left,
grouped together similar documents returned
a search engine according to their context.
For example, if you thought about, you know,
drawing a picture of all the web pages that
come back for query Amazon, and--sorry, if
you can't hear me--you start to see two groups,
right?
You start to see--here's a group over here,
and here's a group over here.
Okay?
And you say, you know, how are these groups--how
are these groups, right?
Well, maybe, you know, users that query Amazon
go to these pages or they go to these pages,
but there's very few users who query Amazon
that go to a page here and a page here, right?
So these are connected and these are connected
but they're very split.
So what have you learned?
Well, you've learned that maybe Amazon has
two different dominant interpretations.
So presumably one is the retail site and the
other one is the river.
And you say, "Well, I knew that already.
Hang on a second.
I knew that already," you know.
But imagine doing it in a language that you
didn't know already or imagine having some
automated process that would tell you one
query has two dominant interpretations or
one and only has one main interpretation.
What was your question?
>> Just when you write on the board, if you
had a black marker, it would be easier.
>> MEASE: Yes, if someone can toss me one.
I don't really have one.
>> There's one right under the podium.
>> MEASE: Where?
See?
I have to search for it.
Okay.
Okay.
Not that ornery.
All right.
So imagine those are black.
Okay.
So, that is--that would be an example of data
mining.
And that's actually clustering, and we'll
talk about that specifically as an example
clustering.
Okay.
So why mine data?
So there's the scientific point of view.
Now, I'm going to talk about the scientific
point view and the commercial point of view.
Both of these basically have this flavor like
I'm collecting the data anyway, so there might
be some useful information in it.
And from a scientific point of view, you're
collecting lots of data.
Examples would be a satellite that has sensors
on it, telescope that look across the sky,
micro rays.
You know, with gene expression data was sort
of trendy a few years back.
Generally, you know, any simulation you do,
you can generate lots and lots of data.
I don't need to tell, you know, you guys about
collecting lots of data.
Traditional techniques are infusible, and
so data mining might be helpful in sort of
classifying and segmenting data or informing
hypotheses.
So that's sort of the scientific point of
view.
The commercial point view, you know, again,
the commercial point of view is really sort
of the driving the data mining on some level.
Data there is being collected from web data,
from e-commerce, right, any time you use a
search engine, I don't have to tell you, any
time you buy something from a site online,
any time you go to department or a grocery
store, any time you use your--a bank or a
credit card.
So the data is just there.
Computers are cheap where you can't say, "Oh,
we can't afford to store that much data."
No, storage is cheap.
"Oh, we can't afford, you know, to analyze
that."
No, the, you know, processing is cheap.
And then your competitor is doing it, right?
If you donít want to do data mining, well,
you can be certain that your competitor is--and
if it's giving them any edge, well, you know,
you're going to get beat out eventually.
Now, the one thing I wanted to talk about
on this slide which I always thought was interesting,
the grocery store example is sort of the very--it's
like the classic defining example of data
mining, which is that when you go to the grocery
store, they have a record of, you know, you
bought eggs and you also bought diapers or
you bought milk and you also bought beer,
you bought chips and you also bought salsa.
So they have a record of that.
Now, the funny thing is--you should think
about is how do they have a record of that?
Right?
So if I go to the grocery store today and
I bought chips and I pay cash, right?
Suppose I pay cash, I donít use my credit
card; I'm trying to sort of be off the grid,
right?
So I pay cash and I get the chips.
Then tomorrow, I go, "You know, I forgot the
salsa," so tomorrow I go and I buy salsa and
I pay cash again.
What they don't want to know just that one
person bought chips yesterday and one person
bought salsa today.
They want to know that it's me.
They don't just want to know what I bought.
They want to know who I am.
So, the question I'll ask you is how do they
know who I am if I don't pay with my credit
card?
>> [INDISTINCT]
>> MEASE: Yes.
You have the little, you know, your Safeway,
save six cents on gas, right?
You have your little Safeway cards.
So, you can think about it, you know, that
sure, they don't get the data for free but
all they have to do is give you a little card
and let you save three cents on every purchase
or whatever it is, and now they get all the
data in the world, right?
And, you know, you can opt in or opt out.
You don't have to use the card.
If you really want to--you know, don't want
people spying on you, you can just not use
the card.
But it's not hard for them to get the data.
And once they do something like that, they
have the data.
So that type of supermarket data where they
know each customer, at least their ID, and
what they bought is sort of one of the classic
examples of data mining, data sets, you know,
where they use that data to discover relationships
between, you know, people who buy this product
usually buy this product.
Now, what does that mean for the grocery store?
Well, you know, use your imagination.
If they often buy these two products at the
same time, maybe they should put them in the
same aisle.
Better yet, maybe they can say, "Look, let's
close down the whole supermarket and just
sell these two products because we know, you
know, if we just stock those, we can make
this much money, things like that."
So, anyway, they have--they have that data
because they give you a little discount card.
And they give you discount card for other
reasons too.
So, this is sort of a fun exercise.
I knew I was going to give this one today
so I started thinking about it as soon as
I woke up.
So I'll give you four examples.
It says here, give an example of something
you did yesterday or today which resulted
in data which could potentially be mined to
discover useful information.
Okay, so in black, I will write here the four
things that I thought of and see if I can
get you guys to give me some others that I
haven't thought of.
So, I just literally went from the time I
woke up--actually, I went from the time I
woke up to time that I started teaching and
came up with some examples.
So, the first thing when I wake up, the door
on my apartment doesn't have a key lock.
It has this card, right?
Little light card and it goes beep when you--when
you open it.
So you think, "Well, they're not going to
keep that data, right?"
What do they want--why would they want to
know that data?
Why would they want to spy on you that much?
Well, actually when I moved in, they told
me.
They said, "Don't use your card to try and
open someone else's apartment door because
we'll let--we'll come after you, you know.
We'll get mad at you."
And I thought right away, I thought, "That's
kind of weird, right?
I mean, what if I just take the elevator to
the wrong floor and, you know, I'm half awake,
right?"
But, you know, presumably, they're keeping
that data around or at least they have some
sort of alert system.
So, you wouldn't think it, but--you know,
I'll call that my apartment door.
You know, presumably, that data is sitting
around somewhere where they know what tenants
tried to open what doors at what time.
And if there's any useful information there,
you know, they can use it.
What would you use that information for?
I don't know.
Maybe they want to hire a security guard and
they want to know what sort of traffic, whether
people are coming and going.
Maybe they do really want to spy on you.
I mean, they can use the data for whatever
they want, right?
And you're consenting to it because you're
the one using the card to open and shut your
door.
You're the one living in their apartment.
Okay.
So then after I open the door, what do I do?
Well, I go and I, you know, I hit the elevator
button.
Now, that one, I'm not really sure if they're
keeping that data around.
But I kind of--you know, I wish they would
because maybe if they had some intelligent
system, I wouldnít have to wait so long for
the elevator because, you know, what's it
doing down there on the basement when everyone's
sleeping?
They know that, you know, it should be setting
up at the top.
Okay, after I go in the elevator, then I--the
parking garage has another thing but that's
the same as the apartment door.
As soon as I get one Guadalupe Expressway,
there's metering lights.
And I wish that they would use the traffic
sensors, you know, to do something better
about the traffic, right?
So presumably, they could know that they donít
need to turn the metering lights on 87 when
101 is moving so quickly.
So, you know, they could mine that data too.
They know who's driving on the highway--well,
they donít know who's driving.
They know how many cars are driving on the
highway at what time.
They donít really know who you are, although
if you had, like, the FastTrack going over
the bridge, they would know who you are, right,
because it's your fast track.
And then finally, when I get to Stanford,
they donít give me, like, a nice parking
pass, so I have to put money in the--in the
pay parking machine.
And how do they know who I am?
I use my credit card and I use the same credit
card every time.
So, these are--you know, none of these are
related to Internet applications.
I'm trying to, you know, be a little bit creative.
All cases where I'm producing data that someone
could be using to do data mining, you know,
and they're not trying to spy on me; it's
just I'm giving them the data.
It's freely available for them to use for
whatever purpose they want.
So, this is what?
In-class exercise number two, I call it.
So I gave you four.
How about you guys give me four?
Yeah, Charles?
>> [INDISTINCT] stuff down Micro Kitchen.
>> MEASE: The Micro Kitchen.
They run out of data, right?
So they have to restock the Micro Kitchen.
They run out, so they know what we're eating
and what building we live in, right?
Okay.
So, thatís--yeah.
So presumably, someone is looking at this
data in the Micro Kitchen.
Okay.
One more.
In green.
Sorry.
>> Yeah, the government's tracking wherever
you go through your cell phone.
>> MEASE: Cell phone, right?
Not only--not only--right?
Yes.
So you can turn that off, right?
But not only do they know who you are, who
you called, what time you called, they also
now know where you are because they have that
little, you know, GPS location sensor in there.
And, you know, I donít know.
If you're a paranoid person, this isnít a
good exercise for you.
But, you know, the data is there.
You know, they could ignore it if they want,
but it's there and they might find information
in it.
Okay.
Another one in the front.
>> Google badge [INDISTINCT]
>> MEASE: My badge, right?
So someone asked me this one time.
So let me--let me just say this is my badge,
right, which is--oops, B-A-D-G-E, which is
similar to--similar to my apartment door but
this is, you know, an employer, right, who
might have a little more interest in who I
am and where I'm going at what time.
And I have--when people ask me--I donít know
if they ask you this, when you tell them you
work at Google, they say, "Well, what time
do you start work?"
And you say, "Well, it depends what time I
wake up," and things like that.
And they say, "Well, what time does your boss
tell you have to be there?"
Well, you know, whatever.
And then they say, "But certainly, you know,
they know when you scan your badge in and
they keep track of that," and I'll go, "I
guess they could," right?
But, you know, knowing Google, they're likely
to use that data but not, you know, to spy
on us; more so to sort of just keep statistics
and just know when they should stock the Micro
Kitchens, right, or know when they should
serve breakfast.
Okay.
So let's get one more, one more.
Yeah.
>> That's [INDISTINCT] probably know, we know...
>> MEASE: Yes.
>> ...where you are and all [INDISTINCT]
>> MEASE: Yes.
>> [INDISTINCT]
>> MEASE: Yes.
So...
>> What data you are transferring.
>> MEASE: Laptop.
Yes, one time--well, that doesnít say P.
Laptop.
One time I was using a computer somewhere
in an office at the university, and the guy
called me and he said, "Why are you VPN?
Why are you using a VPN connection?"
And I thought, "Who are you to ask to me why
am I..." but he was the administrator of the
network, right?
So, yes, any time you use a computer, people
are getting lots of information about you.
And you know we have web logs or Google--Stats202.com,
we have web logs from that.
So one of the things we're going to be doing
is playing with the logs for that and it'll
be cute because we can see certain spikes
when certain events happen and I can see what
webpage you go to.
I donít know who you are but I know your
IP address.
So anyway, you know, you can think of loads
of examples here of different cases where
you're producing data that, if someone wants
to, they can mine, they can use it to get
information about and help them to make different
decisions.
Okay.
So where does data mining come from?
So, you know, this--you can tell this book
is sort of a statistical book because you
see statistics and they you see everything
else.
What's everything else?
Well, you have artificial intelligence, you
have machine learning, you have pattern recognition,
and some people sort of put data bases and
things like that and information retrieval
there too.
But, you know, it is sort of like we're teaching--or
I'm teaching this course from a statistics
point of view, but it's not just statistics,
of course.
It's borrowing ideas from artificial intelligence,
machine learning, pattern recognition and
all those.
Traditional techniques, when we say they have
traditional techniques in the second bullet,
it's traditional statistical techniques would
be unsuitable.
Why?
The data is large, not just large like a lot
of observations, but large, it's high dimensional
and heterogeneous and distributed, right?
So, there are sort of new challenges for statistics.
We coined this phrase "data mining" but we're
borrowing information from or we're borrowing
ideas from all these other areas too.
Okay.
So, the book breaks down into two types of
data mining tasks, and this dichotomy is a
little bit forced in some cases but I'll walk
through it.
So, they differentiate between predictive
methods and descriptive methods.
And let me sort of write the shorthand version
of these.
The one thing to remember is I guess descriptive
methods donít really have one right answer.
You sort of know if you found something useful
because it's useful but you really never know
exactly what you're going after whereas predictive
methods, you're going to look at your classification
accuracy or your precision and your recall,
so those are straightforward.
So predictive methods, this is predictive.
What do we have here?
It says "Some variables to predict unknown
or future values of other variables," right?
So basically we're trying tom predict future
in some sense, right?
We're trying to use some inputs to predict
the future of classification of some output.
Whereas, descriptive methods, descriptive,
for that, we're just basically trying to find
patterns in the data.
Okay.
Find patterns.
Okay.
So, you know, the way to remember this right
is sort of this is the supervised learning
and the unsupervised learning, if you will,
right?
So the example I would--I would give you if
you think about the Amazon, right, with Amazon
I found a pattern, right?
There were two distinct types of pages about
Amazon.
There is like the commercial Amazon and there
was the river Amazon, so I found the pattern.
Okay.
Suppose, alternatively, that I already--so
that would be descriptive.
I described the pattern, I described that
there were two groups of the Amazon webpages.
Predictive would be more like, I know that
there's two types of Amazon webpages and I
know there's like the--one's about commercial
site and I know there's one about the river.
Okay.
I know that there's two groups, but can I
predict given a new one, given a new webpage,
right, can I have a computer algorithm that
will predict which one of these two classes
it falls in?
And the way I am going to measure success
there is how accurate am I going to be able
to predict this.
Right?
What is my misclassification rate?
Am I going to get 90% of them correct, 95%
of them correct, et cetera?
Now, it's really easy for a human to read
the webpage and say, "Oh, this is about, you
know, Amazon the retailer or this is about
Amazon the rainforest."
But, you know, can a computer use the human
labeled observations to get a pretty accurate
rule?
Thatís predictive data mining, whereas, again,
I told you descriptive data mining was just
finding the fact that there's two groups in
the first place.
So the topics that we're going to cover fall
into these two categories as follows.
So, the book talks about classification and
regression as both being predictive.
So let me make this--so we'll put here classification
and regression.
Both of these as being predictive.
Now, classification, we're going to cover
in chapters four and five.
Regression, we're not going to cover in this
course.
However, if you take a regress--oh, sorry,
if you take a stats course, they're going
to cover regression.
Really, the main difference between these
is classification, you're trying to say, you
know, I said I'm 90% accurate, right?
I have two classes; it's either Amazon the
rainforest or Amazon the river right.
Or you could have three classes or four classes
or any number of classes and you're trying
to see how accurate you are.
Regression is analogous to that but instead
of trying to predict the class, you're generally
trying to predict a continuous attribute,
right?
So let me give you an example, right?
So it says to change from a web application.
This is a book, this is an eraser.
You can these apart, right?
Suppose you send like sonar signals, right,
and bounced the sonar signals off the book
and the sonar signals off the eraser.
Well, they're going to look different, right?
And so classification would be to use those
sonar signals and some labeled instances--some
labeled instances of the book, some labeled
instances of the eraser, and predict for new
cases whether it's a book or an eraser, right?
Thatís classification.
I'm trying to predict is it the book class
or is it the eraser class, right?
Just like, is it the Amazon rainforest class
or the Amazon web--commercial class?
Regression would be more like, can I use the
sonar to predict the size of the book?
Right?
So you donít just either get it right or
wrong.
If you say the book is 11 inches tall and
it's really 10.5 inches tall, you're off by
exactly .5.
So, in classification, you're basically trying
to predict what class it is, whereas regression,
you're trying to predict some continuous attribute.
And so you would measure your performance
a little bit differently.
Classification, you might use recall, precision
or misclassification rate.
Regression, you might use like squared area
of loss, L1 loss, some sort of lost function
like that that measures how close you are
to the target.
Where--again, we're not going to cover regression.
Classification is more common than data mining,
but regression gets a lot of attention in
classical stats courses and also a lot of
the classification techniques you can extend
to regression; we're just not going to get
in to them.
Okay.
Then for descriptive, the visualization we're
going to cover in chapter three, association
analysis in six, clustering in eight and anomaly
detection is in chapter ten, although we're
not going to get to it.
So, let me just say a few words about these.
So, visualization is in chapter three.
Visualization, as I said before, it's one
of the most important things you're going
to do.
If you think about writing a report or doing
some study, people are going to remember the
picture, right?
If you can't tell it with one simple picture,
you probably havenít really said anything
interesting.
And there's sort of an art in making good
pictures and making pictures that can see
clearly.
And so, we're going to spend a fair amount
of time in chapter three just talking about
differently ways to visualize data.
And visualization can be two purposes, right?
One is to present to someone.
Okay.
You know what you want to say and this is
just a good way to present it and the other
is to learn something yourself.
You donít know what you're looking for.
If you're just going to look at a bunch of
pictures or if you only look at one picture
that's going to tell you what's going on so
you can discover.
Both of those are visualization tasks.
They're both descriptive, and we'll talk about
those in chapter three.
Association analysis.
Association analysis, this is something that
doesnít really make it into mainstream stats
too much.
We're going to talk about that in chapter
six.
This one is the classic supermarket one, right?
The people that bought--the people that bought
diapers often bought beer, right?
The people that bought chips often bought
salsa.
This is a type of association analysis and
we're going to talk about that in chapter
six, in particular, that mark--basket example
I just talked about.
Clustering, chapter eight.
Clustering.
Okay.
The clustering example--the canonical example
there would be like the Amazon search engine
versus the Amazon Rainforest, right?
You see two distinct groups emerge, you know
something is going on and, of course, that
example is trivial, but suppose I give you
a query in a language you donít know.
Can you tell me, you know, what sort of pattern
there are in those web pages?
Are there two main interpretations?
Is there one dominant interpretation and one
slightly less common interpretation?
So you're just sort of looking for patterns
in the data, and grouping is one pattern,
and that type of grouping is called clustering.
News stories, right?
Can you--can you group together different
news stories?
These are about sports, these are about politics.
Can you see different groups emerging in the
data even without having labels on them?
So, it's unsupervised.
It's clustering.
And then finally, anomaly detection, we're
not going to have time to get to but you might
want to read about it in--only one L, right?
You might want to read about it in chapter
10.
When we do chapter three, we'll do some of
it because when we make pictures of data,
sometimes thatís exactly what we're looking
for, things that are strange.
Anomaly detection is probably, you know, as
I say, association analysis is the one of
the classic examples of data mining.
Anomaly detection is one of the ones that
gets all the press because--shoot, I had a
news story, I donít know if I can find it.
That, you know, you always see data mining
in the news because they're using it for purposes
of, you know, credit card fraud detection
and they're using it to find terrorists, right?
And both of these things are anomalies, right?
So, what is credit card fraud, right?
Someone--you know, you have your credit card
and all of a sudden, you spent a whole bunch
of money in a place that you've never been
before.
Thatís an anomaly.
Their credit card flags it.
The sooner they can flag it, the more money
they can save.
So, thatís anomaly detection through credit
cards.
With respect to terrorists, what are they
looking for?
Strange behavior, right?
Something thatís indicative of a terrorist.
Now, you could argue, well, in that case,
maybe this should be up here because you're
trying to see how accurately you can, you
know, spot the terrorist.
But, you know, thatís why I said this line
is a little bit blurry.
But your textbook tends to classify anomaly
detection as descriptive because you donít
really know exactly what you're looking for.
Okay.
Thatís all chapter one notes I want to talk
about.
Now I'm going to talk a little bit about the
software.
But let me stop and see if there's questions.
Question?
>> So, we end up giving yet more data to the
data mining mill because everytime we're planning
a trip, we have to inform every credit card
company that we will be spending stuff abroad.
>> MEASE: Right.
Yeah.
So the question is, if you donít want your
credit card to sort of, you know, call you
and cancel your card because they see something
weird, some people will call the credit company
ahead of time and tell them, "Look, I'm going
to be traveling overseas," but then, the point
is that they can also use that data to, you
know, feed into the data mining framework.
Yeah, some people will do that.
Some people, every time they're going to travel,
they'll let their credit card company know
ahead of time because they donít want any
problems.
Actually, I have a friend who is a pilot who
uses cash only which is surprising in this
day and age.
But for that very reason, he doesnít want
to call the credit card company every time
he goes somewhere.
And, you know, he does fights that look as
though they're anomalies, right?
So, anyway, are there any other questions
on anything I said so far before I talk about
software?
Question?
>> So what is the difference with clustering,
classification [INDISTINCT]
>> MEASE: Okay.
So what--your question is what's the difference
between clustering and classification?
Okay.
So, they're very similar.
Clustering is unsupervised.
Classification is supervised.
So the thing is--let's see.
Let's go with the web page example, right,
with the Amazon, Amazon, all right.
In one case, I have all the label--all the
two instances labeled.
This one is about the rainforest.
This one is about the e-commerce company.
And I'm trying to predict for a new observation
which one it is and I'm going to measure how
accurately I'm doing.
That's classification, thatís predictive.
I want to see how well I can predict a new
observation into these two classes.
Okay.
Clustering, on the other hand, is the act
of actually discovering that there are two
classes, because I didnít know that ahead
of time.
I was just looking at a bunch of different
queries, looking at how things grouped together,
and I saw two distinct groups emerge for Amazon.
Now, the clustering is you donít really know
you're right.
You know, are there really two groups.
Well, in this case, you do.
But you're just sort of trying to discover
a relationship.
So, does--is that good?
Is there anyone else that can give a better
definition than I just gave?
Because a lot of people in this room that
are experts on this machine learning and they
can tell you supervised learning, unsupervised
learning and--but thatís sort of my take
on it.
It's a little--it can be a little blurry especially
after you do the clustering if you say, "Oh,
I really did learn something that was right."
That, you know, it tends to have a little
bit of a classification feel, but that should
help.
Okay.
Other questions?
Yes.
>> This anomaly, itís actually the same as
the [INDISTINCT]
>> MEASE: Yes, to a large degree.
To a large degree.
And there are some subtle differences and
you can--you can read about that.
But, yeah, generally an anomaly is an outlier
and vice versa, generally speaking.
Yeah.
An outlier, you can sort of see the word outlier,
something that lies out a rest from--away
from the rest of the data, an outlier.
So in some space, an anomaly is an outlier,
but key might be to find that space.
Other questions about things that I have said
so far?
>> Is an anomaly like unsupervised learning
an outlier?
An outlier pertaining any actual [INDISTINCT]
cluster.
>> MEASE: Right.
Right.
So Charles was making a point about the relationship
between outliers and anomalies.
I donít want to get too much into that distinction,
but yeah, there's a relationship and some
subtle difference.
Other questions on what I've said, anything
I've said so far?
Okay.
So, let me see.
How are we doing here?
So, okay, we're doing good on time.
So, like I said, Stanford, this is an hour
and 15 minutes, but here, we're trying to
stay under an hour, obviously, so I go a little
bit faster and skip a few things that donít
matter for you guys.
Okay.
So, what are we using in here?
We're using Excel and we're using R. Now Excel,
if you have a PC, you're in a good shape.
If you donít--you know, I donít know.
Trix probably won't give you everything you
need.
You know, no offense, you know, but it's just
not the same product.
Open Office might give everything you need
but if you have a PC, you're in good shape.
If you have Mac, I'm sure Excel is installed
on there.
If you donít have either of these, if you're
sort of a--just a strict Linux user, I donít
know.
I could--I can't really speak to Open Office
but it might--it might get you through most
of it, but we are going to using Excel not
primarily, right.
Excel isn't very powerful, it doesnít handle
large data, it's very slow, it's--you know.
But for some purposes, it's good.
And it's--it is good to sort of have a spreadsheet
application sort of that you're comfortable
with because sometimes you can do things very
quickly that, you know, you donít really
want to take the time to strip up.
So we're doing some things in Excel, and so
you should have access to that.
But then primarily, we're going to be using
R, which is free.
If you have Windows, I'll go through how to
install it on a Windows machine right now.
The same installation instructions will generally
hold for Mac.
And for Linux, like I said, I'm going to try
and send you guys a link to something on the--something
I can get from the R users which talks about
installation.
But different people do different installations
depending on what they're doing and so, I
havenít really kept up with it.
But let me take you through R and give you
a little preview of that and show you how
to get it installed on your Windows machine.
I have a Windows machine right here, obviously.
So, I'm going to--as I go through examples,
I'll be doing it on the Windows machine.
So, you know, you might, if you have a Windows
laptop, just install R on that and use that
for going through examples.
Also, it's easy, you can bring it with you
and you can sort of play along as you're sitting
here.
Okay.
So, how do I download R?
So you go to this web page, right?
It's sort of a little bit tricky.
For a while on Google if you just, you know,
queried R, you wouldn't get it.
I think you can now, but let's just go to
CRAN.
Let's see.
Here we go.
Okay.
So, we're going to be good with--you know,
I have a Windows machine, so I just hit Windows
95 or later and then, Base.
So, it's open source and different people
contribute different packages.
If we need anything special, I'll let you
know.
But for now, Base is going to do the trick.
And then if I go down to this one here, this
is a self-extracting EXE file.
It says I have to get it from a [INDISTINCT].
But it turns out if I click on this right
now, the behavior is itís just going to give
me one and I can just save it.
Then you double click, go through all the
defaults that everything as it is is going
to be fine and it will get you R. And once
you do all that, you can see what it looks
like.
Here.
Let me--let me just run through those screenshots
again.
So these are up on the PowerPoint slides if
youíre--you sort of forget what I said.
So, go to cran.r-project.org.
And for me, I would click on Windows 95 and
later.
Just click here on Base, and then 2.5.0.
I think I have a 2.4 version.
The version shouldnít matter too much if
it's, you know, within the last year.
You just save this to your machine and then
it will just install itself and all the defaults
are pretty good.
Once you do that, okay, you get something
that looks like this.
So here I have actually 2.4.1 on my machine.
And it's sort of command line, right?
Itís not a spreadsheet app; it's sort of
command line.
Let me see if I can make this a little bit
bigger for you so you can see.
Let's change the font size from 10 to--let's
try 20.
That's too big probably, right?
It looks like a cartoon.
Okay.
So, you know, it's sort of my online, right?
I can do 10 + 1 and figure out that thatís
11.
Okay.
You have functions in here, right?
So, let's see.
Let's think about an easy function.
EXP is exponential E to the zero is one.
Okay.
So, your functions.
You can look for help on the functions.
So, I want to like help on the exp function.
If I type question mark in an e function,
it brings up a window and tells me, you know,
that this log computes natural algorithms,
log 10 computes--okay.
And it gives you some examples.
So, the help is pretty good.
And you can look things up online because
there's a lot of documentation.
So, it sort of has command line.
You can write your own functions, you can,
you know, sort of use it as a little bit of
a scripting language.
But also, it's really good for plotting, right?
So, if you type like, okay, well, seq(1:10)
is the integers 1 through 10.
So, if I made a plot of--well, here.
Here, here.
I'll show you.
So, let's x--let x--seq(1:10).
And if I plot, suppose, like, x, let's say,
x+10, then I get a plot, right?
And you get to change sort of almost everything
you can change on this plot.
You can change the plotting symbol, you can
change the font.
You can change the color.
You can make it quick able, you can label
things.
So, the plotting in R is really good and it
gives you a lot--a nice tool for making plots
very quickly.
And we'll go through a lot of that when we
get through chapter three.
But I think in the meantime, go ahead and
make sure that you have a machine where you
can--you can use R and get it working.
And next time, when I get into chapter two,
I'm going to go through some example datasets
and we'll talk about, you know, manipulating
them in R and Excel a little bit.
But let me just take, you know, the last minute
here and see if there's sort of any question.
So, I'll just run through in case you missed
at the beginning.
The whole point here is that this class, you
know, I'm teaching it at Stanford, so it's
not too much extra effort for me to come here
and teach it here.
It is being videotaped, so you can watch all
these, you know, on your--on your machine
at your desk.
It's an hour--even though a Stanford class
is an hour, 15 minutes.
And, you know, you can sign up on Mailman;
you're welcome to come to any lectures you
want.
Everyone is invited.
This is the textbook.
We'll try and, you know, distribute some of
these next time, but there won't be enough.
We'll have to do it by lottery.
So, go ahead and buy one.
Go to www.stats202.com for all the information
about the course and make sure you're subscribed
to the datamining07@google.com.
And so, I think, thatís all the organizational
information.
We're going to meet Tuesdays and Fridays from
1:00 to 2:00.
I can't think of anything else that I might
have said.
Let me just stop and ask if there's any organization
questions or anything.
Yes, question.
>> Can you get a larger room?
>> MEASE: No, actually.
I mean, I asked about this and they said like,
"Well, you can, but you have to go through
another process."
So this is the biggest the Google EDU folks
had and so...
>> The machine learning EDU talks had a lot
of attrition like for the first lectures [INDISTINCT]
>> MEASE: Yes.
So, we hope that a lot of you--we hope that...
>> Why are you looking at me when you say
that?
>> MEASE: Because you're--because you're sitting
on the floor.
>> [INDISTINCT] to that second group?
>> MEASE: That's--I can talk to the Google
EDU folks about that.
I mean, theyíve been extremely helpful.
It was sort of a challenge to estimate the
attendance and we knew the Trix was an overestimate
and didnít know with the video conferencing
how many people would actually want to come.
And I think I still donít know how many people
actually want to come.
I think we'll know more on Friday.
Think if there's still overflowing on Friday,
then we can--we can really try and see if
we can do something better than having people
sit on the floor.
Other organizational questions?
No other organizational questions?
Okay.
There's free lunch in the cafeteria.
