Hi guys this tutorial is aimed at teaching
with a decision tree is
I have a very small
problem at hand.
I want to Play tennis.
but do I go out & play tennis or not depends on a lot of factors
whether it's raining today, is it humid if
it's humid it's very difficult to play
what do I do?
should I go out or not?
who will help me?
This is where my decision tree would help me
now what is
a decision tree?
A decision tree is a supervised machine learning algorithm
used for predicting outcomes
based on certain rules
and is done by partitioning the data into subsets.
Don't be scared by whatever I have said, don't worry.
I'll explain each and everything as
Things come along.
This is something that I've collected
on day 1 it was sunny it was really hot it was
humid and it had a weak wind as well, so I
didn't play tennis that day.
On the other hand,
when it was overcast, hot & normal and weak wind was blowing, I did play tennis that day.
Now today, It's a very sunny day with cool temperature.
its very humid
and there is a strong wind blowing as well
Do I go out or not is what my
14 datapoints would help me identify.
As you've already seen
the four attributes are outlook, humidity,  wind
and temperature and my target value is should
I create tennis or not?
Finally this is how
my decision tree should look like but have
you noticed something?
Why is outlook first?
and then humidity and then wind
why can't I first have humidity 
and then outlook and then wind
There is a reason for that
the red boxes define the internal nodes.
The blue boxes define the value of the internal nodes
& your leaf nodes are
basically your final output target variables.
Now there is something called as an entropy.
The ordering of your nodes for your splitting criteria
is based on entropy.
entropy is given
a node
how homogeneous the samples are.
By homogenity, I mean
If I have 20 samples.
& If I split it.
whether those 20 are falling into 1 category left,
yes
or do they fall into no.
Its either
20, 0 or 0,20.
Non-homogenous is 10,10
If I make a split. There are Fifty percent Yes and 50 percent No.
that is when I say that it's not at all homogeneous.
The case the formula is -p+ Minus the
positive samples log to the base two positive
samples.Again added with the negative samples,
so this way you kind of have an Entropy function.
Now what entropy tells you is how homogeneous your samples are
if your sample is really homogeneous, 
your value would be 0
if your value is not homogeneous 
then the value would be 1.
For making a split
the condition is known as 
information gain
now information gain
starts off by first calculating the entropy
of your target variables,
if you go back a
bit you will realize I will have nine Yes's
and five No's
in this Play Tennis column
so what I'll do next is.
I'll first calculate
the entropy of my output variable.
that is my Play Tennis
that turns out to be 0.94
In order for making a
split.
I have to have a very high information
So this is my training examples as I
have already showed you.
now selecting the
first splitting attribute.
I start with humidity
it has nine positives and five negatives
entropy at that point this 0.94 the way
we started off
it's splits as high and normal
which has three positives and four negative
samples and six positive and one negative
sample the entropy at the individual points
calculating from those that formula minus
p log of p plus minus p negative log of p
negative you get 0.985 and
for the normal samples you get entropy as 0.592
the formula of gain is starting entropy minus
the ratio of the entropy of that node into.
The ratio of the node value so here you have
seven elements in the node you started
off with fourteen so seven by fourteen into
the entropy at that node again negative of
seven nodes here seven by fourteen divided
into the value of my entropy, which is point
five nine two, you get 0.151
something similar is what you do for the wind
variable and you get 0.048
which is very low.
And checking for the outlook
variable you have you start with fourteen
variables you come here, you have five you
have four and you have five again, so your
out your Gain here would be point nine four
zero minus five by fourteen into point nine,
seven one. Minus 4 by 14 into 0 and 3 again
5 by 14 into 0.971, which turns out to be 0.247
if you're wondering why is this 0.0
just see it's a very homogeneous sample that
has either all positives, so it's a very homogeneous
sample so that is by your entropy and that
node is 0.0.
Once you have all the entropies
in hand, sorry the gain in hand then the
one with the maximum gain is the one that
you use for your splitting criteria.
And going
ahead. I've also assumed zero log to the base
two zero is zero.
so coming here. I test for
now since the first node is ready now I have
to split based on the other nodes at this
node. I again check for the information gain,
so again I go back and do the same calculation
for the remaining three conditions humidity
temperature and wind and I find that again
the gain in humidity is maximum so humidity
comes here and I'll go on and do this.
And
grow the decision tree once a new testing
data arrives. I just have to traverse the
decision tree to reach to yes, no yes no whichever
matches with the current criteria. I'll give
that as my output hopefully you enjoyed the
small introduction to how a decision tree
works and thank you.
