Why Data Mining?
There is so much data that has been coming
out every single second.
So with the internet we've got so many: Social
media, we've got YouTube, we've got Google,
we've got tons of data coming in.
And that's just internet based.
There's also sensor data, there's stock data,
there's patient data or medical areas and
medical research, energy research.
So there's just the growth of data is astronomically
growing so fast.
In the data explosion, you can see in this
chart here on just how much has grown since
1970 and then at this exponential curve with
the web stuff in addition to business transaction
data.
And with the storage costs going so much lower,
more data can be stored and then now the whole
point is well, what do we do with all of this
data?
We want to be able to mine it for interesting
knowledge that we can gain from it.
And the data explosion has really pushed IT
as far as storage and then processing too.
So sometimes data mining could take, trying
to answer one question could take days
Just to answer it.
Just the processing, the machine learning
and the artificial intelligence behind it
to try and figure out an answer to the question
when you have massive amounts of data.
And so we'll talk about more efficient ways
and effective ways to deal with large amounts
of data as well.
And work with how we're going to move forward
with all of this.
So really the question, why data mining?
We are drowning in data, but starving for
knowledge.
We have tons of data, and we need to figure
out ways to, especially automate this process
of analyzing the data and looking for something
interesting out of it.
Some interesting patterns, some knowledge
we can gain based on the data that we have.
So some of the questions we'd like to mine,
things like market basket analysis, where
should products be placed in the store to
maximize sales?
Alright, so what's commonly bought together?
Another market basket analysis with Amazon,
what products should we recommend and show
to the user?
What are other users doing, and who are interested
very similar to this particular user that
they may also be interested.
Another example would be seasonal.
Are more popular items around particular seasons,
whether it's Christmas time or the Fourth
of July, or whatever the holiday may be.
Or Summer versus Winter.
For classification and clustering, we're looking
at identifying, for example, loan applicants.
Is it worth giving a loan to this particular
person.
How much of a loan?
How risky is it to give them this much loan?
Who's most similar and then infer from the
previous customers to make a more informed
decision.
In medical diagnosis, is this cancer?
At what stage can we identify certain differentiating
patterns?
For time series, looking at time and identify
changes.
And it could be the stock market is a common
one, over time.
We can see crime rates, we can see the use
of Uber and Lyft-type car sharing.
Social networks, Twitter, Facebook, what's trending.
Wind speed, how big a wind farm to generate?
How much need to supplement power, what is the demand?
For stocks, we want to be able to predict
stock changes, and make lots of money.
So these are some examples.
Outlier analysis was fraudulent activity.
So most of the credit card transactions are
normal, normal purchases and they are proper
purchases.
Outlier analysis, we're looking for the abnormal,
that's what we're actually interested in finding.
So we don't want to delete those from our
dataset, we actually want to find them.
And if it's a fraudulent card, we might end
up cancelling the card so that if someone
stole that information that we don't end up
paying too much on it if we're one of the
credit card companies.
Lots of reason to do data mining.
