Now that we've seen several data mining applications in business, let's look at the big picture of what data mining is all about.
This is what I call data mining in a nutshell.
We're going to get into a lot more details in the course,
but just so that you're familiar with general terminology, and where we're going with data mining, let's look at the big
concepts. Let's start by talking about what data mining is not - and in fact you've already taken a statistics course where
you've been exposed to data analysis in some form - so how is data mining different from statistics?
First and foremost, in statistics, we're looking at the macro level, what we call macro-decisioning,
and that's a term that that's been coined by a colleague.
This means that we're looking at the average customer, or the average household or overall sales.
In contrast, in data mining, we're doing micro-decisioning where we're looking at individuals:
it could be an individual customer, an individual transaction or anything at the individual level.
A second distinction between statistics and data mining, is that in statistics
we're trying to explain or to describe some kind of relationship that's happening in the population.
We could ask ourselves: what are the factors that influence pricing?
In contrast, in data mining we're forward-looking, we're looking into the future and trying to predict values of new records.
So we're not looking at an aggregate
and we're not looking at explanation or description. We're looking at prediction at the individual record level.
Now if you think about the origins of these two fields, they come from very different backgrounds.
In statistics, data was very scarce and
statistical methods are aimed at trying to squeeze every little possible information out of your sample, in order to
infer something about the population.
In contrast, data mining came out of this deluge of data where we have mountains and mountains of data, and we're trying to get the
insights, or the knowledge outside of these mountains of data.
That's why lots of algorithms in data mining come from computer science, from artificial intelligence, from machine learning;
but the surprising part is that data mining also includes a bunch of
statistical methods that have been adapted from the field of statistics.
It turns out that these methods that were designed for small samples work really nicely with large samples as well.
So don't be surprised when you see linear regression being used in data mining, although in a very different way.
How do we define a good model? In statistics, a good model is one that fits the data reasonably well.
We want an approximation, a statistical model that's an approximation, of the data.
In contrast, in data mining we don't try to find the model that fits the data best;
we try to find the model or method or algorithm that best predicts new data
and that sometimes is different from finding a model that fits the data well.
Finally, how do we actually use the tools to come up with decisions?
In statistics we use confidence intervals,
hypothesis tests, statistical significance, P values - in order to answer the question of interest that we stated upfront.
In data mining we do something else. We try to evaluate the predictive power of the model that we built, and how well that
adjusts to the requirements, or answers the requirements, of what we were trying to do upfront. Here, when we're looking at predictive power
we have special metrics that are different from statistical significance; and another major issue that comes into play are costs.
For instance, overprediction might be different than underprediction in terms of costs - or maybe predicting
having errors of one type can be different from having prediction errors of a different type.
So this differentiates data mining from statistics and gives you a little bit of a flavor of where we're going.
Now let's look at two major terms in data mining. In data mining, we
divide all the different methods - and there's a plethora of methods - into supervised learning and into unsupervised learning.
Here's the difference between the two.
In supervised learning, we take the measurements that we have on multiple records and we divide them into inputs and outputs.
Similar to this little picture where we have the inputs - the worms - and the output, which is the egg -
obviously, there's a small joke hiding here in the middle -
and the idea is that given a set of inputs we want to predict a certain output.
For example,
we might want to look at your performance in the program, and your past experience and your
demographics, and use that to try and predict some output. Now, outputs are typically one of two types: they can be numerical,
or they can be categorical.
Numerical outcomes, or outputs of interest, are
something like your salary, if we're trying to predict what's going to be your salary after you graduate.
In contrast, if we're trying to predict a categorical
output such as whether you'll find a job or not within three months, that's a categorical output.
When we're using supervised learning with a numerical output,
we call it prediction, or sometimes, machine learning people will call it regression.
When we're trying to predict a categorical outcome, we call this
classification, because we're trying to classify records into one of several classes, so this is all supervised learning.
Now let's look at unsupervised learning. Here, we don't have a distinction between inputs and outputs;
we just have a big set of measurements on a large amount of people.
What type of things do we want to do when we just have a bunch of measurements? Well, here are three things.
One is dimension reduction.
This means that I have many many columns of measurements, and I'm trying to reduce them -
that's the dimension - into a smaller set of measurements that is more easy to work with, for different reasons.
And we'll see that there are a bunch of methods aimed at dimension reduction.
Another goal that we want to achieve is to
try and reduce the dimension in terms of the observations or the records - so instead of talking about 1 million customers,
I want to segment them into, say, several segments of customers and talk about segments, so this is another type of
methods - unsupervised techniques -
 that are aimed at segmentation or clustering.
And lastly, we have methods that look at which measurements go with which other measurements.
So, classic examples are recommender systems
or association rules where you have baskets or carts that you shop in a supermarket or online.
And we're trying to look at multiple carts, and deduct from that which items are usually purchased together.
So this gives you an overview of
types of data mining techniques, and as I said, we're going to get into more detail as we move forward.
Now, how is data mining actually performed? It's a whole process;
it's not just running some software and getting some results out of it. The data mining process is
closely integrated with the business
process, and we start by defining an objective.
Defining an objective is one of the most difficult problems
because from a business objective that we're able to identify, we're going to have to translate it into a data mining or analytics
objective, and that's very very different, and we'll talk a lot about defining objectives later in the course.
Once we define an objective - and we can't start without one, you can't just say
"Oh, just take some data and see what you can find", that's not going to lead you in good directions -
then we're going to obtain a data set. We can either go and actively collect data for the purpose of this
project, or else we might use data that already exists - maybe in our department,
maybe in other departments, maybe outside - in order to try and address the goal of interest.
So now we have an objective,
and we have data, and the next step is to start cleaning the data and exploring it
and that's where business intelligence tools such as
dashboards and reports are actually useful.
Cleaning the data is a big pain, but unfortunately that's always a major factor
and it usually takes a long time, a big portion of this process is devoted to data cleaning.
So we have a task, and we have a data set that we reasonably understand now, and we know what each
field means, and it's reasonably clean, and then we're going to choose
a toolkit, which is going to include a bunch of different data mining techniques.
Unfortunately, you don't know in advance which data mining method is going to work for which problem. And therefore, we need to try a
set of different techniques, and this is where an iterative process starts, where you try a certain method,
and maybe it doesn't work very well, and you tweak it, you change some parameters in it,
and then you evaluate it again, and then you try some other method and you start comparing. This whole process of
model selection
takes some time and requires careful
implementation.
When we're finally done, and we have our final model,
we're going to evaluate its predictive performance.
And that's going to happen by deploying it to a set of data that we have not seen. We're going to talk about this concept
of holdout data in just a minute.
When we find that this performs according to our
requirements - only then will we move to the deployment stage. At the deployment stage, we go back to our original
objective and we're going to deploy our algorithm in order to
adhere to what we were trying to achieve at first.
This is a very large process, a very long process, and it includes a lot of links between the different stakeholders.
This cannot just happen in an isolated IT department.
Now, as I said before, there's a plethora of data mining methods, and our book includes
quite a bunch of them, and in order to try and help you make sense of which methods belong where in this process,
we built this little map,
that's also available in the book in both editions,
and what it does is it maps the different methods and different approaches into either the data preparation stage, or
into the choice of methods, model evaluation, etc.
Where are we going from now? Well, a very important concept that applies to any method that you use, which is a
critical data mining approach, is evaluating predictive performance. And to do that,
we're going to have to look at a concept called
"holdout set". And this is where we're going to look at the concept of partitioning the data and keeping a holdout set.
We'll talk about this in the next video.
