Hello and welcome. In this session today we
are going to look at very interesting aspect
of or interesting application in which database
technologies are used namely the field of
data mining and knowledge discovery. In fact
in recent years data mining has become an
extremely or fields that eliciting an extremely
large amount of interest not just from researchers
but also from commercial domain. I mean the
commercial utility of data mining is probably
of more interest than or at least as much
interest as the research interest that lies
in data mining.
And in addition to commercial interest, there
is also number of public debates that data
mining has started which range from topics
like legalities and ethics and the rights
to certain information and the rights to non-disclosure
of information or the rights to privacy and
so on and so forth. So data mining actually
is in some sense has opened a pan door as
box in and only time will tell whether the
technology has given, has been on an overall
sense completely beneficial or destructive
in nature.
But then there is nothing beneficial or destructive
about technology per say it's how we use it,
how we use technology which is what matters.
So any way in this session, we shall be concentrating
mostly on the technical aspects of data mining
obviously.
And we shall look at the basic algorithms
and concepts that make up data mining and
what exactly is meant by data mining and how
does it differ from the traditional operations
of databases or traditional way in which databases
are used. So the overview of this or this
set of two sessions would be as follows. Let
us first motivate the need for data mining
that is why data mining and what are some
of the basic underlying concepts in data mining,
what are the building blocks of data mining
concepts. Then we look at data mining algorithms
and several classes of this data mining algorithms.
We will start with tabular mining as in mining
relational tables and we will look at classification
and clustering approaches and we will also
look at mining of other kinds of data like
sequence data mining or mining of streaming
data and so on. And data warehousing concepts
would be covered as a different session all
together. First of all, why data mining from
a managerial perspective. Let's first look
at what a data mining has for the commercial
world first before we go in to looking at
the technical aspects of data mining.
If you were to let us say give an internet
search or talk to a manager, let us say about
why he or she would invest in data mining,
you would encounter a variety of answers.
One would say something like strategic decision
making that is I look for some kinds of some
ways or some patterns in or mind for certain
nuggets of knowledge to understand something
about strategic decision making or to help
in strategic decision making. Somebody would
say well it is very useful for something called
wealth generation although there is no precise
definition of the term wealth generation and
you would say that data mining would help
me in understanding or making the right decisions
that can help me increase my financial portfolio
or whatever.
Somebody would say well I would use data mining
for analyzing trends, analyzing how my customers
behave or analyzing how particular market
is behaving and so on and so forth. And more
recently data mining has been used extensively
for security purposes especially mining network
logs or network streaming data in order to
look for abnormal behavioral patterns or patterns
that might be potentially linked to abnormal
activity in the network or in the system and
so on. So, security is now relatively recent
and very important application area of data
mining.
So, what is this data mining all about and
why is this so controversial and why is it
so interesting from a technical perspective
at the same time. Data mining is the generic
term used to look for hidden patterns in data
or hidden pattern and trends in data that
are not immediately apparent by just summarizing
the data. So if I want to look for certain
patterns, let us say if I have set of all
students and their grades if I want to look
for certain patterns on how are the students
performing over time or what is the is there
some kind of relation between subject A and
subject B I mean if a student does well in
subject A, he or she does badly in subject
B or so on and so forth.
Such things cannot be discovered by just aggregating
the data, by just saying what is the average
or what is the summation or whatever. Besides,
such things also cannot be discovered by,
I mean such things in a sense cannot be within
quotes discovered if we have to give queries
that finds out these aggregations. That is
if we already knew what it is that we are
looking for then it's not a hidden pattern
any more. We know that such a pattern exists
that is students performing in subject A will
not perform well in subject B, we know that
such a correlation exists and there is nothing
hidden in the pattern anyway.
So data mining essentially has no query that
is if you are performing a data mining on
a on a database, we do not talk of any data
mining query. In fact it is the mining algorithm
that should give us something which we don't
know. Now how do we say something which we
don't know, which is putting it in a very
broad sense I mean which is making things
so vague. So data mining is actually controlled
by what are called as interestingness criteria
and we just specify to the database that this
is what we understand by an interesting pattern.
Let us say correlation between performances
in subject A and subject B or some kinds of
trends over a period of time. This is what
is interesting for us. Now find me something
or find me everything which I don't know about
or which are interesting according to this
criteria.
So when we talk about data mining, we have
a set of data to begin with that is we have
a database and then we give one or more interestingness
criteria and the output of which will be one
or more hidden patterns which we didn't know
exists in the first place. Now given this
model, we should say now when we say patterns
then the obvious question to ask is what type
of patterns, what do you mean by patterns
or what do you mean that this is or when do
you say that something is a pattern and something
is not a pattern.
If we have to answer that we have to ask two
further questions that is what is the type
of data that we are looking at, what kind
of data set is it that we are looking at and
what is the type of interestingness criteria
that we are looking. What do we mean by interestingness,
is it correlation between something, what
exactly do we mean by interestingness.
So let us look at the different type of data
that we encounter in different situations.
The most common kind of data is the tabular
data or the relational database which is in
the form of set of tables or now slightly
different multi-dimensional form of database.
And it's very common that any kind of transaction
data that is let us say data array coming
out from the database from an ATM for example
or the data coming out from the transactional
database at a railway reservation counter
or at a bank or any place like that are all
tabular in nature. So it's a most common form
of data and which is a rich source of data
to be in mine.
In addition to tabular data, there are spatial
data for example where data is represented
in the form of either points or regions which
have been encoded with certain coordinates
X Y Z coordinates. So each point in addition
to having certain attributes also has certain
coordinates and mining in this context also
requires us to know what is the importance
of the coordinates system.
In addition to spatial data there are other
kinds of data like say temporal data, temporal
data in the sense that were each data element
has a time tag associated with it. So temporal
data could be for example streaming data where
network traffic or set of all packets that
are flowing through a network forms streaming
data which just flows fast and where each
packet can be allocated some kind of a time
stamp or something like activity logs, your
database activity log is a temporal data.
There could be also be spatio temporal data
that is data that are tagged both by time
and coordinates. And other kinds of data like
tree data which for example XML databases
or graph data where especially bio molecular
data or volvoid web is a big graph data and
so on.
Then there are sequence data like data about
genes and DNAs and so on and again activity,
I mean sequence is a kind of temporal data
where timestamp need not be explicit in sequence
then text data, the arbitrary text or multimedia
and so on and so forth. So, the several different
kinds of data that can be the source from
which we can extract or mine for unknown nuggets
of knowledge.
Similarly when we talk about interestingness
criteria, several things could be interesting.
If certain pattern of events or certain patterns
of data keep occurring frequently then it
might be of interest to us, something that
happens very frequently. So frequency by itself
is an interestingness criteria or interestingness
or a criteria on which interestingness can
be based.
Similarly rarity, if something happens very
rarely and we don't know about it or let us
say rarity is again a very interesting pattern
to be searched for when we are looking at
say abnormal behavior of any system or abnormal
behavior of network traffic and so on. So
something that happens rarely that is away
from the norm is again an interestingness
pattern. Correlation between two or more elements
and if the correlation being more than a threshold
is again interesting or length of occurrence
in the case of sequence or temporal data and
so on.
And consistent occurrence, consistency that
is consistency is different from frequency
in the sense that overall in the set of all
databases, overall for the entire database
a given pattern may not be frequent enough.
For example there could be one particular
behavior pattern, let us say one particular
customer comes to a bank every month at the
tenth of each month. So if we are looking
for frequently banking customers, let us say
this customer would not figure out in this
algorithm because this customer comes only
once a month whereas other customers could
be coming many times a month. However if we
are looking for consistency in behavior then
this customers behavior is far more consistent
than someone who comes let us say arbitrarily
10 times the first month and once the second
month and 50 times the third month and so
on and so forth. So in terms of consistency
in his behavioral pattern across different
months, this pattern is interesting even though
it's not frequent.
Then repeating or periodicity is slightly
similar to consistency except that a periodicity
is I mean consistency is across the entire
set, across the entire set of months if you
have divided our database into months but
periodicity, the time interval could vary
in in a periodicity of a pattern. If a customer
comes let us say a 5 times to the bank every
6 month, we may not be able to catch it as
part of a consistent pattern analysis but
if we use an algorithm that detects periodicity
of several occurrence of events, we will be
able to detect it. And similarly there are
several other patterns of interestingness
that which one could think of.
Now when we talk about data mining, usually
there is sometimes a misconception and not
completely but usually there is a contention
that data mining is the same as statistical
inference. For many cases it is yes, the answer
is true that is several concepts from statistics
have been incorporated in to data mining and
data mining software uses statistical concepts
or many kinds of statistical algorithms comprehensively.
However there is a fundamental difference
between statistical inference and data mining
which is perhaps the reason for the renewed
interest in data mining algorithms. And here
is the general idea behind the data mining
versus statistical inference.
What do we do when we talk about statistical
inference? Statistical inference in techniques,
essentially have the following three steps
as is shown in this slide here. In statistical
inference, we start out with the conceptual
model or what is called as the null hypothesis.
That is we first of all present ourselves
or perform a hypothesis about the system in
concern. That is we make a hypothesis that
if some something to the effect that if exams
are held in the month of march then there
would be I mean then the turnout would be
higher than if it is held in the month of
June or something like that. Now based on
this hypothesis, we perform what is called
as sampling of the data set or of the system.
Now sampling is a very important step in a
statistical inferencing process. There is
huge amount of literature in to what is meant
by correct sampling or what is called as a
representative sample and so on. Now based
on the sampling of data set from the system,
we either prove or refute our hypothesis.
That is we show a proof saying, yes this hypothesis
is true because statistical sampling of the
system has shown that this is true otherwise
it's false.
Now, when we sample for example if you are
performing a statistical inference about user
preferences or let's say some kind of market
analysis, we present questioner to different
users based on our null hypothesis or based
on our conceptual model. Now it is this set
of questioner, now this questioner has been
created by our conceptual model. So this questioner
already knows what to look for and the proof
or the answers will either prove or refute
the hypothesis but data mining on the other
hand is a completely different process or
rather it's the opposite process.
In data mining we just have a huge data set
and we don't know what is it that we are looking
for. We don't have any hypothesis, we don't
have any null hypothesis to begin with. We
just have a huge data set and we just have
some notions of interestingness. Now we use
this interestingness criteria to mine this
data set and usually there is no sampling
that is performed on the data set that is
the entire data set is scanned at least once
by the data mining algorithm in order to look
for patterns. So there is no question of sampling
and there is no null hypothesis to begin with.
So we just have a weighed notion of an interestingness
based on which we present an algorithm, data
mining algorithm over the data set. Out of
this comes out certain patterns, certain interesting
patterns which form the basis for forming
a hypothesis. So it's sometimes also called
hypothesis discovery. Obviously, of course
we cannot discover complete hypothesis using
just data mining but we too discover patterns
using which we can formulate a hypothesis.
So in a sense it's an opposite process of
statistical inference.
Let us look at some data mining concepts.
Two fundamental concepts are of interest in
data mining especially in the core algorithms
of data mining especially the apriori based
algorithms. These are what are called as associations
and items sets. An association, when we say
an association it is a rule of the form if
X then Y as shown in this slide here and it's
denoted as X right arrow Y.
For example if India wins in cricket sales
of sweets goes up, if India wins in cricket
then sales of sweets goes up. So here X is
India wins in cricket and Y is the predicate
that sales of sweets go up. So we say that
we discover such a rule if we are able to
conclusively say based on analyzing the data
that whenever India wins in cricket, the sales
of sweets go up. And on other hand suppose
if there is any rule of this form that is
if X then Y then I can imply that if Y then
X. That is the ordering of this rules is not
important. If India wins in cricket then sales
of sweets go up, if sales of sweets go up
then India has won in cricket and so on which
may be true or may not be true but if that
is the case then it is called an interesting
item set. That is it's just a set of item.
For example people buying school uniforms
in june also buy school bags or you can also
say people buying school bags in june also
buy school uniforms. So it's just a item set
that is school uniforms and school bags are
a set of items which are interesting by themselves.
Once we define the notion of a association
rule and an item set, we now come to the concept
of support and confidence. That is how do
we discover a rule to be interesting. We say
that a rule is interesting in the sense of
frequent occurrences of a particular rule,
if the support for that rule is high enough.
That is the support for a given rule R is
the ratio of the number of occurrences of
R given all occurrences of all rules. So we
look into the exact or we will illustrate
the notion of support in the next slide with
an example where it will become more clear.
And when we say the confidence of a rule,
suppose I have a rule if X then Y then the
confidence of the rule is suppose I know that
X is true, the ratio of all occurrences when
Y is also true versus when for all other occurrences
when X is true and something else is here.
So that is it's a ratio of the number of occurrences
of Y given X among all other occurrences given
X. So if I know that X is true with what confidence,
with what percentage of confidence can I say
that Y is also going to be true?
Let us look at some examples here. Let us
say these are some item sets let us say these
are data that have been distilled from purchases
of different consumers over a period of time
over, in a given month let us say. So the
first consumer has bought a bag, a uniform
and a set of crayons, the second consumer
has bought books and bag and uniform, the
third one has bought bag uniform and pencil
and so on and so forth. Now suppose I take
the item set bag and uniform, (Bag, Uniform)
what is the support for this item set. Now
the support for this item set is look at all
the transactions or the rows here in which
bag and uniform occur 1 2 3 4 and 5 uniform
and bag. Out of a total of 10 rows, 5 of them
have bag and uniform occurring in that.
Therefore the support for bag and uniform
is 5 divided by 10 which is 0.5 that is with
a this dataset supports the assertion that
bag and uniform will be bought together with
50% support that is 0.5 as its support. What
is the confidence that, what is the confidence
for the rule if bag then uniform? That is
what is the confidence by which we say whenever
somebody buys a bag, they also buy uniform.
For this we have to look at the set of all
item sets or the set of all transactions or
rows here in which bag and uniform, bag occurs
rather not just uniform in which bag occurs.
So bag occurs in 1 2 3 4 5 6 7 8 different
rows, out of which bag and uniform have occurred
in 5 different rows. Therefore the confidence
for this assertion or this association rule
is 5 divided by 8 which is about 62%. That
means if some consumer has bought a bag then
with 62% of confidence or 62.5 % of confidence,
we can say that the consumer will also buy
a uniform, a school uniform along with this.
So the question now is how do we mine or how
do we find out the set of all interesting
item sets and the set of all interesting association
tools.
Now have a look at this previous slide once
again. Now the association rule, when we talk
about association rules we have just or rather
when we talk about item sets first we just
saw a single item set having two different
elements here but that need not be the case,
bag by itself could be an item set a single
element item set, uniform by itself could
be a single element item set, crayons could
be a single element item set or let us say
bag, uniform and crayons could be a three
element item set and so on. So item sets could
be of any size size 1, size 2, size 3, size
n any set of elements. Now we have to find
the set of all item sets that is the set of
all items that are bought together and that
have been together frequently as part of this
transaction log here.
Now how do we do that? Now there is a very
famous algorithm called the apriori algorithm
which performs such a discovery process that
is a discovery process for all frequent item
sets in a very efficient manner. The simple
idea behind apriori algorithm, it is shown
in this slide here. However let us not go
through the slide in a lot of detail, since
it will be more easier to explain apriori
through an example.
The idea behind apriori algorithm is that,
the essential idea behind an apriori algorithm
is that suppose I have any n element item
set. Let us say suppose I have any 5 element
item set, that is interesting or that is frequent.
So if this 5 element item set is frequent
then all sub sets of this item should also
be frequent. This seems obvious but this is
a very important conclusion or it's a very
important observation in the apriori algorithm.
That is if I discover the set of all one frequent
item sets that is the set of all item sets
of size 1 which are frequent then there is
no need for me to look at other item sets
when I am looking for two frequent item sets.
That is the set of all item sets of size 2
which are frequent will be made up of combinations
of set of all item sets of size 1 which are
frequent.
So let us illustrate the process of apriori
with an example. Let us take our consumer
database again, the previous consumer database
again where we have consumers buying several
school utilities like bags and school bags
and school uniforms and crayons and pencils
and books and so on and so forth.
Now suppose we set when we say or when we
ask the apriori miner to mine for all interesting
item sets, we have to the interestingness
criteria here is frequency that is frequent
occurrence. Now frequency is or interestingness
here is parameterized by a threshold parameter
which is called the minimum support or min
sup. So let us say minimum support is 0.3
that is we term an item set to be interesting
if its support is at least 0.3 or greater.
Now given this what are all the interesting
one element item sets? What is that mean to
say what are all the interesting one element
item set, which one element item sets occur
at least at a rate of 30% or more. Now this
database here or this data set here has a
total of 10 rows therefore we have to look
at all one element item sets which occur 3
or more times. So given this we see that all
of these are interesting that is bag, uniform,
crayons, pencil and books. Bag occurs much
more than three times, uniform also occurs
more than three times, crayons also occur
more than three times and so on. So all of
these elements here occur more than thrice
which therefore all of this one element item
sets have a minimum support of 30% or more.
Now from this, suppose we have to look at
the set of all interesting two element item
sets. Now how do we build the set of all interesting
two element item sets? We just look at all
possible combinations between one element
item sets, therefore we have bag uniform,
bag crayons, bag pencil, bag books, uniform
crayons, uniform pencil uniform books and
so on and so forth. Now out of this for each
such two element item set that have been created,
we have to see how many times they occur in
this data set. Now we see that it's only these
set of combinations which have a minimum support
of 0.3 or more. So for example bag uniform,
bag crayons, bag pencil and bag books all
of them along with bag are interesting.
However let us say uniform and book is not
interesting that is it doesn't occur more
than thrice. So let us see how many times
uniform and book occur? Uniform and books
occur once and second one twice here, so they
occur only twice but we need a minimum support
of three times so that's not interesting.
Similarly a pencil and uniform, so uniform
and pencil is again is not interested. So
therefore we have filtered away or we have
thrown away certain item sets from our exploration
here and identified only a smaller subset
of the set of all possible combinations of
one element item set.
Now from this if we have to look for all three
element item sets, we have to generate the
set of all candidate three element item sets.
What are the candidate three element item
sets? Perform a union across all possible
combinations of these interesting two element
item sets to create all possible distinct
three element item sets and then look for
those three element item sets which occur
at least three times or more in this database.
Given that we see that there is only one three
element item set that is bag, uniform and
crayons that is interesting that is that occur
at least three times or more or that has at
least, that has support of at least 30% in
this in this data set.
So as you can see the apriori algorithm, you
can visualize the apriori algorithm in the
form of let us say an iceberg. Such queries
are also called as iceberg queries when given
on to databases that is at the base there
are large number of one element item sets.
But once we start combining them together,
we start getting smaller and smaller numbers
of combinations and we peak out at a very
small of large item sets which are frequent.
So the beauty of the apriori algorithm is
that for every parse, it does not need to
go through the entire data set. It does not
have to parse through the entire data set,
it only needs to consult results of the previous
iteration or item sets that are of one element
one lesser than the present iteration in order
to construct candidates for the present iteration.
So given this algorithm here let us go back
and look at the apriori algorithm. Given the
explanation here with an example let us go
back and look at the apriori algorithm which
will now be a little more easier to understand.
Initially we start with a given minimum required
support s as the interestingness criteria.
now given minimum support s as the interestingness
criterion, first we e search for all individual
elements that is one element item sets that
have a minimum support of s. Now we start,
we go into a loop where we start looking for
item sets of sizes higher greater than 1.
So from the results of the previous search
for i element item sets, search for all i
plus 1 element item sets that have a minimum
support of s. This in turn is done by first
generating a candidate set of i plus 1 item
sets and then choosing only those among them
which have a minimum support of s. Now this
becomes the set of all frequent i plus element
item sets that are interesting. So this loop
is repeated until the item set size reaches
the maximum. That is there no more candidate
elements to be generated for the next item
set or there are no more frequent item sets
in the current iteration.
Now that was about item sets. A property of
item sets is that there is no, I mean you
basically consider item sets as one entity
that is there is no ordering between the item
sets. that is it does not matter if somebody
buys a bag first or a uniform first or a crayon
first or whatever, as long as the, only thing
that we are going that we infer from this
is that the item set bags, uniforms and crayons
are quite lightly to be bought together in
in in one piece.
Therefore if I am let us say a super market
vendor, I mean someone having a super market
then it would make sense for me to place bags
and school uniforms and crayons next to each
other. So because there is a higher probability
that all three of them are bought together.
But when we are looking for association rules
we are also concerned about the direction
of association that is there is a sense of
direction saying if A then B is different
from if B then A. So association rule mining
requires two different threshold, the minimum
support as in the item sets and the minimum
confidence with which we can talk about a,
with which we can say or determine that a
given association rule is interesting.
So how do we mine association rules using
apriori. Again we shall do the same thing
like we did in the past. We shall come back
to this algorithm or the general procedure
after we have illustrated an example by which
we can mine apriori, using apriori algorithm
by which we can mine association rules.
Now the main idea is the following. Now use
the apriori algorithm and generate the set
of all frequent item sets. So let us say we
have generated a frequent item set of size
3 which is namely bag, uniform and crayons
with a min sup or of 0.3 that is a minimum
support threshold of 30%. Now this bag, uniform
and crayons can be divided into the following
rules. If bag then uniform and crayons or
if bag and uniform then crayons or if bag
and crayons then uniform and so on so forth.
Now what is this thing mean? this thing means
that when a customer buys a bag then the customer
also buys uniform and crayons and this rule
means that if a customer has bought a bag
and a school uniform then the customer will
also buy a set of crayons or if a customer
has bought a bag and a set of crayons then
the customer will also buy a school uniform
and so on.
Now we have got all of these different association
rules. Now each of these association rule
has a certain confidence based on this data
set. Now what is the confidence for each of
these rules? What is the confidence for the
rule if bag then uniform and crayon. That
is if a customer buys a school bag then here
she will also buy a school uniform and a set
of crayons. In order to calculate the confidence
of this, we have to first look at which are
all the item sets here that have bags that
is where the customer has bought a bag. So,
there are 1 2 3 4 5 6 7 8 different entries
where customer has bought a school bag.
Now among these 8 entries, in how many different
entries did the customer also buy uniform
and crayons? 1 and 2 3, so there are 3 different
entries, 3 different instances out of 8 instances
where this rule holds. Therefore whenever
a customer buys a bag, one can say with 3
by 8 or 37.5% of confidence that the customer
is also going to buy a set of school uniform
and crayons. Similarly we can calculate the
confidence for each of these other association
rules like this is 0.6, 0.75, 0.428 and so
on and so forth.
Now, given a minimum confidence as a second
threshold and suppose we say that the minimum
confidence is 0.7 then whichever the rules
that we have discovered, every rule that has
confidence of at least 70% or more.
That means we have discovered the following
three rules, bag if bag crayons then uniform,
uniform crayons then bag and crayons then
bag and uniform. What is that mean in plain
English? It means that people who buy a school
bag and a set of crayons are likely to buy
a school uniform as well that is bag and crayons
implies uniform.
Similarly people who buy a school uniform
and a set of crayons are also likely to buy
a school bag that is here, somebody buys uniform
and a set of crayons then they are also likely
to buy a school bag. Similarly if somebody
buys a set of crayons then they are very likely
to buy a school bag and a school uniform as
well.
So that is here, that is somebody buys crayons
then with 75% confidence one can say that
they also buy bags and school uniforms. So
again it's a question of direct marketing
or whatever. If somebody is interested in
crayons then you might be reasonably sure
that they are also interested in a bag and
a school uniforms so on. Now so let us look
at look back at the algorithm here for mining
association rules.
Simple mechanism for mining association rules
is first of all use apriori to generate different
item sets of different sizes and at each iteration,
we can divide each item sets in to two parts
an LHS part and an RHS part, the left hand
side part and the antecedent and precedent
that is the right hand side part.
So this represents a rule of the form LHS
implies RHS. Then the confidence of such a
rule is support of LHS divided by that is
support of the entire thing divided by the
support of LHS. That is support of LHS implies
RHS divided by support of LHS will give us
confidence of this rule. And then we discard
all rules whose confidence is less than minconf.
So now let us look in to the question of how
do we generate or how do we prepare a tabular
data for association rule mining or let us
say item set mining and so on. Now because
we use let us say relational data set, relational
database you might have observed that or you
might have got a little doubt when we have
been considering a data set like this. There
is something peculiar about this data set.
What is peculiar about this data set here?
The peculiarity is that it looks like every
consumer coming to this store is buying exactly
three items which is very unlikely.
In fact what is more practical is that this
set, this data set contains records of variable
length. That is one customer may have bought
just two different items whereas some other
customer may have bought 10 different items
whereas a third customer may have bought only
5 different items and fourth customer may
have bought only one item and so on and so
forth.
So it is not possible to represent this item
set like a table, like a well form table like
this because basically it is a set of all
items of different lengths. In fact the best
way to represent this would be in a normalized
form let us say in a database where for example
the same bill number here 15563 15563, both
of this refer to the same customer. That is
it's the same customer who has bought books
and crayons and this is not completely normalized
because date is not really necessary here
but nevertheless here all of these records
are of uniform length, if you order this based
on the set of bill numbers then we get the
set of all different transactions.
Now depending on what we are looking for this,
this ordering might make a difference. How
does this ordering make a difference here
when we are looking at data set like this?
Suppose given a dataset like this, here performing
group by's on different fields will yield
as different kinds of behavior data sets.
So what does it mean? Suppose let us say we
perform a group by based on the bill number.
So suppose we perform a group by on the bill
number on this table then each group will
represent the behavior of one particular customer
that is one bill represents one or one bill
number represents one particular customer
or one particular transaction. So suppose
we group by based on bill numbers and then
perform apriori across these different groups
then we would be getting frequent patterns
across different customers.
On the other hand suppose we group by over
date, so rather than bill number. So all transactions
happening on a given date will come in to
one group and all transactions happening on
another date will come in to another group
but a given date may have transactions from
several different customers but all of them
are now grouped in to one single group. And
suppose we run apriori over this set, over
this different groups then we would actually
be looking for frequent patterns across different
days that is across the different dates. So
we have to interpret what we mean by something
that is frequent based on how we have ordered
the data. If we have ordered the data over
different customers then it would show aggregate
behavior over the set of all consumers with
whom you are interacting with.
On the other hand if you are running apriori
or if you have performed group by over dates
then it would show you aggregated behavior
over a given time period rather than over
the set of all customers. Well, it also includes
the set of all customers but what is more
important here is that how does the behavior
or how has the behavior changed over time.
So if something is frequent over time, it
means that it is uniformly or in some sense
consistent over this entire period of time.
So let us summarize what we have learnt in
this session. We started with the notion of
data mining and like I said in the beginning,
data mining is a very interesting sub field
of databases which has elucidated a lot of
interest not just from researchers or and
not just from the technology perspective but
from several other perspectives like defense
perspective or security perspective, commerce
that is business perspective and so on. And
there are several debate that have raged on
whether it is right to use data mining to
look for certain behavior pattern.
for example would it be right, if a government
uses data mining over let us say the set of
all different activities of people and find
out the behavior pattern of any particular
individual and so on. And their pros and cons
on both sides of the debate, one would say
for security reasons it is right to look for
behavior patterns and one would say well for
privacy reasons it's not right to look for
behavior patterns and so on and so forth.
so it's a topic which is very much pertinent
and has spond a huge amount of interest from
several different areas.
And data mining is in some sense, I called
it as sub field of databases but that's not
entirely true in a sense that data mining
and knowledge discovery many would claim is
a field in itself. That is it relies on database
concepts as well as several other concepts
like learning theory or statistical inference
and several other concepts in order to perform
data mine. So don't be really surprised if
one would say that a data mining is a complete
field in itself and its only associated with
databases not really sub field of databases.
but anyway data mining as we said is the process
of discovery of previously unknown patterns
in the sense that we have not really sure
what is it that database is going to give
us or what new pattern or what new nugget
of knowledge so to say is we are going to
learn as part of the data mining process.
As a result there is no query as part of a
data mining process that is a data mining
algorithm is based around one or more interestingness
criteria rather than a given query.
And we saw that in conceptually, it is in
some way the opposite of statistical inference
where we start with a null hypothesis and
either refute or prove or hypothesis by sampling,
statistical sampling of the population. While
here we don't start with a hypothesis but
the end result of the data mining process
is the set of patterns which can help us in
formulating a hypothesis. We also saw the
notion of association rules and item sets
as well and the concepts of support and confidence
and two different algorithms the apriori algorithm
for mining frequent item sets and from which
we also saw the apriori algorithm for mining
association rules. In the next session on
data mining, we are going to look at several
other algorithms like say classification or
discovery. So that's brings us to the end
of this session. Thank you.
