Hey guys welcome back to the data
science course. In today's session
we'll look at different methods to
evaluate a models performance. So we
start off by learning how can we find
the right threshold for a model then
we'll build an ROC curve and finally we
will learn how to get the area under the
curve. So let's get started. Now in the
previous sessions we gave some arbitrary
values for the threshold but obviously
it is not the right way and you get the
best accuracy we need to get that
perfect threshold value so we need to
plot a graph of accuracy versus cutoff
so that we can get that right threshold
so there's a package called as ROC r
in R which helps us to get this
threshold value. Now accuracy is not only
criteria, that also should be a right
trade-off between the true positive rate
and the false positive rate so the best
cutoff has the highest true positive
rate together with a lowest false
positive rate and to get the right
balance with respect to true positive
rate and false positive rate we can
build an ROC curve. So the term ROC
stands for receiver operating
characteristic and it helps in getting
the performance of a model with respect
to all classification thresholds. Then we
have AUC. So AUC measures the
entire two-dimensional area underneath
the entire ROC curve so it's an
aggregate measure of performance across
all possible classification thresholds
and it ranges in value from zero to one
so model who is predictions a hundred
percent wrong has an AUC of zero
and the model whose predictions is 100%
correct has an AUC of one so
let's go to R studio and work with
these. We have R studio right in front of
us and this is our C mol customer churn
dataset and before we go ahead with the
performance metrics let's actually build
the model
so I load the CA tool package
now the help of sample dot split
function I'll divide the data set and
I'll be dividing it with respect to the
Churn column
the split ratio which I'll give is 0.65
okay so I will store this in an object
called as split tag. Now with the help of
subset function let me split it so I
give the first parameter which is the
data set now from the entire data set
wherever the value of split tag is equal
to true I'll store all of the observations
in the training set
similarly from the entire customer churn
data set where the value of split tag is
equal to false I'll select all the
observations and store them in the test set
so we have a training and testing sets
ready. Now it's time to build the model
on top of the training set so I'll use
the GLM function I will give the formula
first. So this time I'd want to
understand how does churn vary with
respect to the monthly charges score
i.e. churn would be the dependent
variable and monthly charges would be
the independent variable. So i will give the
formula next is the data set which would
be drained after the data set I need to
get the family of regression which would
be binomial over here
and I will store this in an object
called as mod log so I build the model
now it's time predict the values values
predict function whose parameter is
the models which you just built so it will
be mod log then we need to get the new
data which would be test and after
giving the new dataset I would need to
give the type of the prediction the type
of prediction over here will be response
and I will store this in an object
called as result log. Let me have a
glance on the predicted values will be view
or result
yes these are all of the predicted
probabilities okay
Now let me just build a random
confusion matrix over here okay
so I will type table let me include the
actual values. So test.churn, let me get
the predicted values which are there in
result log so let me give an arbitrary
threshold value of 0.1 so this is what
we get over you okay all right but then
again we can't just use any random
threshold value okay so this is where
the performance matrix comes in and we
can use the performance matrix with the
help of the ROC r package. Let me load
the ROC r package first library of
ROC r keep in mind that all of these are
in capitals okay. ROCR is in capitals
so I have loaded the ROCR package. Now after loading the package this
package gives us two very important
functions which are prediction and
performance. Now let me show you guys how
it's done. I will type prediction which takes in
two parameters, first as all of the
predicted values, next as all the labels
of the original values okay so let me
give the predicted values over here
which are there in result log. Let me
give the labels from original set which
are there in test dollar churn. Now I'll
store this in a new object and I'll name it
to be let's say red log okay so this is
a new prediction type object. Now I will
use performance function. Now this
function helps us to work with different
performance metrics such as accuracy ROC
curve and area under the curve so
starting with I'd like to get the
maximum accuracy with this model okay
so again this takes in two parameters here, first is the prediction object which
you've just built so restored in red log
now after this we need to give the type
of performance metric which you want so
I'd want to get the accuracy so I'll
type acc over here and I will store
this in an object called as acc okay
now let me plot this so plot of ACC
let's see what do we get. So guys this is
what we get over you alright so this is
basically a plot with respect to
accuracy and the cutoff that is accuracy
is on the y-axis and cut off
is on the x-axis and this helps us to
determine how this accuracy vary with
respect to the cutoff. So what we see as
the maximum accuracy is around 0.7 and
threshold value for that would be around
a 0.41 or 0.42 okay right
so now what I'll do is I'll build a
confusion matrix with respect to this
threshold value so that I get the
maximum accuracy okay so let me do that
I'll type table let me get the actual
values test dollar churn now it's time
from predicted values result log and the
threshold value is 0.41 okay so guys this
is what we get over here now let me
check the accuracy so one seven eight
zero plus of 6/1 seven eight zero plus
six plus 31 plus 648 so you get an
accuracy of 72 percent right so when we
use the threshold value of 0.41 we
get an accuracy of 72 percent but then
again as I've told you guys accuracy
cannot be the only measure when we are
trying to evaluate the performance of a
model. We also have to look at the true
positive rates and the false positive
rates so let's take the true positive
rates first okay so true positive rate
can be found out by dividing all of the
true positives with respect to the sum
of true positives and false negatives
okay so it will be 6 divided by 6 plus
648 okay
so this is what we get so a true
positive rate is 0.0091 now let me
get the false positive rate over here so
false positive rate can be obtained by
dividing all of the false positives with
respect to the sum of false positives
and true negatives will be 31 divided by
31 plus 1 7 8 0 so the false positive
rate is 0.01 7 now the true positive
rate is very low the false positive rate
is also low but it is actually greater
than the true positive rate which cannot
be the case right
so that is why even though the accuracy
is good the trade-off between true
positive rate and false positive rate is
very bad so that is why we will build an
ROC curve to get a height trade-off
between the true positive rate and the
false positive rate
again I will use the performance
function I will given the prediction
object so it will be read log since I
want to build an ROC curve I'll die PPR
which stands for true positively then I
will type F PR which stands for false
positive rate okay and I will store this
in let's say ROC
make plot with this law Dow ROC curve
trace this is what we get so we get a
graph with respect to true positive rate
and false positive rate that is true
positive rate is on the y-axis and false
positive rate is on the x-axis
ok now let me also give a color with
respect to the cutoff
so I'll use the colorize parameter over
here
and I will colorize it with respect to
the cutoff so I'll just type true over
here
so this color of this line which you see
so this color is basically being
determined with the cutoff values over
here the cut-off value start from 0.15
and the world zero point four three now
when we use a threshold value of zero
point four one so over here so what the
value is red okay so this is where the
threshold value is zero point four one
so what here the true positive rate is
very very low and the false positive
rate is very very low so that is why
this is not the right trade-off between
true positive rate and false positive
rate so let's take up some point over
here so over here true positive rate is
very good and the false positive rate is
moderate enough okay so now let's say
this is green over here right so this
ranges from zero point two six to zero
point three two now let's say the value
which I'll take a 0.28 okay so now I'll
take a threshold value of zero point two
eight and build the confusion matrix
with respect to the threshold value okay
table so first I'll type test dollar
shown then predicted values result log
and the threshold value is zero point
two eight
now let me start off by checking the
accuracy first will be 1 0 double 6 plus
429 divided by 1 0 double 6 plus 429
plus 225 plus 745 so this is what we get
away so we get an accuracy of 60% and
now even though this accuracy is not as
good as that let's also check the true
positive rate and false positive rate
then let's come to a conclusion okay so
let's start with a true positive rate
first so it will be 429 / 429 plus 225
so we get a true positive rate of 0.65
so when we use the threshold value of
0.4 one the true positive rate was very
low and when we use the threshold value
of 0.28 the true positive rate is good
enough okay
now let's also check the false positive
rate
so it'll be seven hundred and forty five
divided by seven hundred and forty five
plus one zero six six the false positive
rate is zero point four one which is
lower than the true positive rate so
this somehow seems to be an ideal
trade-off between the true positive rate
and the false positive rate okay so that
is why accuracy cannot be the only
performance metric when we are
evaluating the model we also have to
look at the true positive rate and the
false positive rate okay so this was the
ROC curve then we'll work with a final
performance metric which will be the AUC
which is nothing about the area under
the curve and this area under the curve
gives us an aggregate measure of
performance across all possible
classification thresholds okay so again
I'll type performance over here let me
give the prediction object so it will be
red log after this I need to give the
type of performance metric which will be
AUC
and dials or listen an object called as
a you see now let me print this
winema choosing so this is the area
under the ROC curve and this is the
value for it so the area under the ROC
curve is 0.62 now as I've already told
you guys if this value is closer to one
then it'll be 100% perfect and if it is
closer to zero then we'll be hundred
percent wrong so it's zero point six two
and then it's good enough okay so these
were the three performance metrics from
the ROC or package and yes this brings
us to the end of this session thanks for
attending and let's meet in the next
class
