using a get working directory I can see that
right now my working directory is
desktop I am going to read a CSV file
and the first row contains
information about the variables
so I am going to say header is true and
we'll call this data file s binary so we
can run this and you can see binary file
has four hundred observations and four
variables we can look at structure
using this data we want to create a
predictive model where we want to
predict whether or not a student will be
admitted to this college and the
variables that help to make this
prediction are GRE
GPA and rank
so I am going to use n net
our variable of interest is admit as a
function of tilde and then dot means I
want to use all these three variables
GRE GPA and rank and our data is binary
so we run this model and we get some
solution
all the values for this data set binary
and let's store these predictions in P
now we can create a table
and the
Abell in tab and if you want to look at
this tab this is what we get
I said what we have is given this side
and here you have values based on the
prediction
this means 253 students who applied they
were not admitted and this model also
predicts that they are not admitted but
there are 20 students actually they were
not admitted but the model predicts that
they should be admitted so this is the
MIS classification similarly there were
98 students who were actually admitted
but the model says they should not be
admitted and 29 were actually admitted
and the model also says or predicts that
they should be admitted so obviously the
correct classification based on this
data set is 253 plus 29 divided by the
entire dataset which is 400 so let's
calculate some of diagonal values within
the term divided by some of the entire
tab
so this gives us about 0.7 zero five so
this is correct classification and if
you do one minus we get 0.295 which is
miss classification rate now the
question is whether this 70 percent
correct classification is good so let's
look at simply how many students were
actually admitted and how many students
were not admitted so I'm going to make a
quick small table
so you can see in the data set 127
students were admitted and 273 out of
400 were not admitted one way we can
predict whether or not these applicants
will be accepted is using the higher
value here so if you see 273 divided by
400 so if we predict all students will
not be accepted still we'll be right 68
point two five percent of the time if we
create a statistical model and find that
the percentage of accuracy is less than
this number obviously we should not use
that model so right now we have
developed logistic regression model and
it gives accuracy of 70.5% which is
slightly better than this
so at least it is better than this
benchmark number
so two four
to model performance evaluation let's
make use of a package our OC R so let's
make a prediction using the model that
we developed my model and the data set
we have is binary and the type of
prediction that we want
is probability so we want to predict
probability values and store this in T
our Edie
so now in PR ad we have like 400
prediction values if you want to look at
them you can type PR Edie in fact we can
say head PR Edie so you can see first 6
probability values are given if you look
at head binary
you can see that first applicant was not
admitted and our prediction probability
is 0.18 which is a very low so
prediction also is that this student
should not be admitted so this is a
correct prediction
this classification table that we made
so this makes use of a cut-off which is
0.5 so if the probability is below 0.5
it will say prediction is 0 and if the
probability is above 0.5 it will say
prediction is 1 so here you can see
second probability is point three so
prediction will be that student should
not be admitted
whereas reality is that this prudent was
admitted so this is a classification
error similarly this is 0.71 so this is
more than 0.5 and you can also see that
this student was admitted so this is a
correct classification now if you look
at the prediction value so let's make a
histogram of PR IDI
so you can see these probabilities vary
between zero and about 0.8 so this is
0.6 so about 0.8 and most of the values
are below 0.4 so if we use 0.5 we may
have one type of classification but if
we use cutoff let's say 0.4 or 0.6 so
accuracy or miss classification might
change so let's see what will happen if
we do that and for that I am going to
make use of prediction function within
our OCR and we will use these
probabilities that we calculated and
stored in trad and we also make use of
the actual values so let's store this in
trad again so now we are going to use
performance
using trad that we have and will make
use of accuracy values and will store
this in eval for evaluations and then
let's plot eval
so we get this kind of curve so what we
have is these cutoff values change from
zero to one and for different cutoff
values in this picture we see what is
the accuracy level that we'll get so
this accuracy is overall accuracy so you
can see when cutoff is close to 0.1
accuracy levels are really very low in
fact close to 30% and it rapidly Rises
as we increase the cutoff values and
reaches a peak somewhere here
so remember 0.5 was our default cutoff
and here we can see what would have been
accuracy for different cutoff values now
if you want to identify what is the best
value here let's make a line on this
chart using a beeline we will draw a
horizontal line at somewhere here Oh
point seven one so you can see somewhere
here we have the peak and then we try to
identify what is this value so this is
about 0.45 so we can say vertical line
equals 0.45 so this will give us more or
less highest accuracy value for this
model
so this is based on our I estimate
we are going to use which dot max which
one is maximum and the way this ROC our
package is made the data that we have
are stored in some slots so I am going
to make use of a slot and our data was
in eval and we are interested in y dot
values and then we also specify with
double square brackets that this is in
one and suppose we want to store this in
max so before you run this if you simply
want to look at what exactly is
contained in eval
you can just say evil and hit enter so
we notice that there are lot of values
that you see there are y values there
are X values and so on so let's run this
so it will identify what is the maximum
value so if you do simply max what is
there in max so it says that it is the
sixty-first value now let's go into the
slot for evil and we are interested in Y
values
and double square brackets and then one
more square bracket and we specify this
is for the max that we identified in the
last row and let's store this accuracy
value in AC C so now ACC has that value
let's look at what is contained in AC C
so this is 0.7 one seven five so the
highest value here is 0.7 175 and now we
want to figure out what is the optimal
cutoff level for that point seven one
seven five value so it may not be
exactly 0.45 that we have seen here it
may be slightly different
so we are going to make use of the same
format so from using slot and eval now
we want values on the x-axis so X dot
values and square brackets two times
with one in it and we also specify that
against the maximum and let's call this
s cut for cutoff so if you want to look
at how much is the cutoff you can see
this is 0.46 eight so not exactly 0.45
that we were looking at on the graph so
now we can print
so if you run this it will give us what
is the accuracy value and what is the
cutoff value so when compared to a
default cutoff value of 0.5 that we have
here a cutoff value of zero point 4 6 8
3 so this will give us a better accuracy
of 0.7 175 remember this is based on the
table that we have seen earlier so this
table was just for one situation where
cutoff is 0.5 and it tells us how the
model has performed but sometimes what
can happen is instead of focusing on the
overall accuracy or miss classification
we are more concerned about predicting
more accurately in one group compared to
the other group for example if we have
data on bankruptcies and we are trying
to make a prediction whether a company
will go bankrupt represented by 1 in
that case our interest may lie more in
correctly predicting one rather than
zero so that's where we can make use of
our OC curve
we'll make use of performance
and we will calculate t be our true
positive rate so true positive rate
based on this data is 29 divided by 29
plus 98
positive rate is about 22% obviously
this is a very very low accuracy level
for correctly predicting one most of the
times one is being misclassified as zero
so obviously this one needs big
improvement when we look at the overall
model and see that accuracy is 70 1.75
that looks very good
but when we have to focus on one
accuracy level of 22% obviously is not
very good similarly we also calculate
what is called false positive rate so
false positive rate we can calculate
again from the same table here so for
example 20 is falsely predicted as one
out of 20 plus 253 so false prediction
rate in this case will be 20 divided by
253 plus 20 so false prediction rate is
about 7% we can do this calculation and
store this in let's say our OC because
we are going to make a ROC curve so
remember these calculations are based on
cut-off value of 0.5 which is default
but when we do ROC curve we'll also be
able to see how is the performance for
different cutoff values so that's the
idea
we got this one now let's plot our OC
this is how the ROC curve looks like so
you have true positive rate on the y
axis and we have false positive rate on
the x axis so the ideal situation should
be that the curve starts somewhere here
at zero zero and goes to this one zero
one and then this value which is 1 1 and
that would have classified in a perfect
way or accuracy would have been
hundred-percent but in reality based on
the data we get these curves which are
not really close to the ideal value we
can draw a line in the middle so that
the intercept is 0 and slope B is 1
so this straight line here means without
any model if we say that out of 400
students reject everyone will be right
about 68% of the time so if the model
does worse than that so this curve will
be below the line but obviously in this
case the model is doing better so this
curve can be compared for different
models and we can see which model is
doing better and which model is not
doing good we can customize this chart
by adding few more things we can
colorize
by saying true so if you run that line
you will see now there is a color and
that color is based on the cutoff so for
example 0.5 is somewhere here so that
light green color is for cutoff at 0.5
so cutoff values range from point 0 5 up
to 0.72 in this example we can also add
a title
note that while Abel here is true
positive rate another name for this is
sensitivity
also xlabel we can say this is one -
specificity so that's another name for
false positive rate so if you run that
you get this title here sensitivity and
one - specificity and of course we can
draw this a beeline and see how the
model performs against this benchmark
another way people use this ROC curve is
to calculate area under the curve
because visually we can see that this
curve is doing better than the benchmark
but when you have many curves on this
chart what will happen is it will be
difficult to differentiate between the
performances so we need a numeric value
what we do is we find area under the
curve and if the area is higher that
means model performance is better so
note that for the entire rectangle here
the total area is 1 and if you look at
this line area below this line will be
50% obviously area under the curve for
the model that we have built that area
will be more than 50% so let's see how
much we get
so we will use performance
PR ad that we had calculated earlier we
will get a you see area under curve and
let's store this in a you see
so we'll make use of this command here
Unleashed and slot a you see that we
calculated earlier why dot values so
let's do this also in a you see so if
you simply run a you see so you can see
we get point six nine two one etcetera
but if you want to round this to less
number of decimals you can do a you see
here you see and let's say we'll have
only four
so now you see only four decimal values
so let's add this number to this graph
somewhere here using legend let's say we
want the legend to start somewhere at
0.6 and 0.2 so X value is 0.6 Y value is
0.2 we want a you see we can also give a
title
the curve is indicated on this graph you
can also choose to change the size of
this by using cex if you say 4 it will
be very big so obviously this will not
fit 1.2 so let me run this line again
this one is so this is with about 1.2
size
