The following content is
provided under a Creative
Commons license.
Your support will help
MIT OpenCourseWare
continue to offer high quality
educational resources for free.
To make a donation, or to
view additional materials
from hundreds of MIT courses,
visit MIT OpenCourseWare
at ocw.mit.edu.
PROFESSOR: Hello, everybody.
Before we start the material,
a couple of announcements.
As usual, there's some
reading assignments,
and you might be surprised
to see something from Chapter
5 suddenly popping up.
But this is my
relentless attempt
to introduce more Python.
We'll see one new concept later
today, list comprehension.
Today we're going to
look at classification.
And you remember
last, on Monday,
we looked at
unsupervised learning.
Today we're looking at
supervised learning.
It can usually be divided
into two categories.
Regression, where
you try and predict
some real number associated
with the feature vector,
and this is something
we've already done really,
back when we looked at curve
fitting, linear regression
in particular.
It was exactly building a model
that, given some features,
would predict a point.
In this case, it
was pretty simple.
It was given x predict y.
You can imagine generalizing
that to multi dimensions.
Today I'm going to talk
about classification,
which is very common,
in many ways more
common than regression for--
in the machine learning world.
And here the goal is to predict
a discrete value, often called
a label, associated with
some feature vector.
So this is the sort of thing
where you try and, for example,
predict whether a
person will have
an adverse reaction to a drug.
You're not looking
for a real number,
you're looking for will they get
sick, will they not get sick.
Maybe you're trying to predict
the grade in a course A, B, C,
D, and other grades
we won't mention.
Again, those are
labels, so it doesn't
have to be a binary label but
it's a finite number of labels.
So here's an example
to start with.
We won't linger on it too long.
This is basically
something you saw
in an earlier lecture, where
we had a bunch of animals
and a bunch of properties,
and a label identifying
whether or not they
were a reptile.
So we start by building
a distance matrix.
How far apart they are,
an in fact, in this case,
I'm not using the
representation you just saw.
I'm going to use the
binary representation,
As Professor Grimson showed
you, and for the reasons
he showed you.
If you're interested, I didn't
produce this table by hand,
I wrote some Python
code to produce it,
not only to compute
the distances,
but more delicately to
produce the actual table.
And you'll probably find it
instructive at some point
to at least remember
that that code is there,
in case you need to ever
produce a table for some paper.
In general, you probably noticed
I spent relatively little time
going over the actual
vast amounts of codes
we've been posting.
That doesn't mean you
shouldn't look at it.
In part, a lot of
it's there because I'm
hoping at some point in
the future it will be handy
for you to have a model
on how to do something.
All right.
So we have all these distances.
And we can tell how far apart
one animal is from another.
Now how do we use those
to classify animals?
And the simplest approach
to classification,
and it's actually one that's
used a fair amount in practice
is called nearest neighbor.
So the learning part is trivial.
We don't actually learn anything
other than we just remember.
So we remember
the training data.
And when we want to predict
the label of a new example,
we find the nearest example
in the training data,
and just choose the label
associated with that example.
So here I've just
drawing a cloud
of red dots and black dots.
I have a fuschia
colored X. And if I
want to classify
X as black or red,
I'd say well its
nearest neighbor is red.
So we'll call X red.
Doesn't get much
simpler than that.
All right.
Let's try and do it
now for our animals.
I've blocked out this
lower right hand corner,
because I want to classify these
three animals that are in gray.
So my training data, very
small, are these animals.
And these are my test set here.
So let's first try and
classify the zebra.
We look at the zebra's
nearest neighbor.
Well it's either a
guppy or a dart frog.
Well, let's just choose one.
Let's choose the guppy.
And if we look at the
guppy, it's not a reptile,
so we say the zebra
is not a reptile.
So got one right.
Look at the python, choose
its nearest neighbor,
say it's a cobra.
The label associated
with cobra is reptile,
so we win again on the python.
Alligator, it's nearest
neighbor is clearly a chicken.
And so we classify the
alligator as not a reptile.
Oh, dear.
Clearly the wrong answer.
All right.
What might have gone wrong?
Well, the problem with
K nearest neighbors,
we can illustrate it by
looking at this example.
So one of the things people do
with classifiers these days is
handwriting recognition.
So I just copied from a
website a bunch of numbers,
then I wrote the number 40 in
my own inimitable handwriting.
So if we go and we look for,
say, the nearest neighbor
of four--
or sorry, of whatever
that digit is.
It is, I believe, this one.
And sure enough that's
the row of fours.
We're OK on this.
Now if we want to
classify my zero,
the actual nearest
neighbor, in terms
of the bitmaps if you will,
turns out to be this guy.
A very poorly written nine.
I didn't make up this nine,
it was it was already there.
And the problem we see here
when we use nearest neighbor is
if something is noisy, if you
have one noisy piece of data,
in this case, it's rather
ugly looking version of nine,
you can get the wrong
answer because you match it.
And indeed, in this case, you
would get the wrong answer.
What is usually done to
avoid that is something
called K nearest neighbors.
And the basic idea here
is that we don't just
take the nearest
neighbors, we take
some number of nearest
neighbors, usually
an odd number, and we
just let them vote.
So now if we want to
classify this fuchsia X,
and we said K equal to
three, we say well these
are it's three
nearest neighbors.
One is red, two
are black, so we're
going to call X black
is our better guess.
And maybe that actually
is a better guess,
because it looks like this
red point here is really
an outlier, and we don't want
to let the outliers dominate
our classification.
And this is why people almost
always use K nearest neighbors
rather than just
nearest neighbor.
Now if we look at this, and
we use K nearest neighbors,
those are the three nearest
to the first numeral,
and they are all fours.
And if we look at the
K nearest neighbors
for the second numeral,
we still have this nine
but now we have two zeros.
And so we vote and we
decide it's a zero.
Is it infallible?
No.
But it's typically
much more reliable
than just nearest neighbors,
hence used much more often.
And that was our problem, by
the way, with the alligator.
The nearest neighbor
was the chicken,
but if we went back
and looked at it--
maybe we should go do that.
And we take the alligator's
three nearest neighbors,
it would be the chicken, a
cobra, and the rattlesnake--
or the boa, we
don't care, and we
would end up correctly
classifying it now
as a reptile.
Yes?
AUDIENCE: Is there like a
limit to how many [INAUDIBLE]?
PROFESSOR: The
question is is there
a limit to how many nearest
neighbors you'd want?
Absolutely.
Most obviously, there's no point
in setting K equal to-- whoops.
Ooh, on the rebound--
to the size of the training set.
So one of the problems
with K nearest neighbors
is efficiency.
If you're trying to
define K nearest neighbors
and K is bigger,
it takes longer.
So we worry about
how big K should be.
And if we make it too big--
and this is a crucial thing--
we end up getting dominated
by the size of the class.
So let's look at this
picture we had before.
It happens to be more
red dots than black dots.
If I make K 10 or 15, I'm going
to classify a lot of things
as red, just because red is so
much more prevalent than black.
And so when you have an
imbalance, which you usually
do, you have to be very careful
about K. Does that make sense?
AUDIENCE: [INAUDIBLE] choose K?
PROFESSOR: So how
do you choose K?
Remember back on Monday when we
talked about choosing K for K
means clustering?
We typically do a very
similar kind of thing.
We take our training data and
we split it into two parts.
So we have training
and testing, but now
we just take the training,
and we split that
into training and
testing multiple times.
And we experiment with
different K's, and we
see which K's gives us the best
result on the training data.
And then that becomes our K.
And that's a very common method.
It's called
cross-validation, and it's--
for almost all of machine
learning, the algorithms
have parameters in this case,
it's just one parameter, K.
And the way we typically
choose the parameter values
is by searching
through the space using
this cross-validation
in the training data.
Does that makes
sense to everybody?
Great question.
And there was someone
else had a question,
but maybe it was the same.
Do you still have a question?
AUDIENCE: Well, just that
you were using like K nearest
and you get, like
if my K is three
and I get three different
clusters for the K [INAUDIBLE]
PROFESSOR: Three
different clusters?
AUDIENCE: [INAUDIBLE]
PROFESSOR: Well, right.
So if K is 3, and I had
red, black, and purple
and I get one of each,
then what do I do?
And then I'm kind of stuck.
So you need to typically
choose K in such a way
that when you vote
you get a winner.
Nice.
So if there's two, any
odd number will do.
If it's three, well then
you need another number
so that there's some-- so
there's always a majority.
Right?
You want to make sure
that there is a winner.
Also a good question.
Let's see if I get
this to you directly.
I'm much better at
throwing overhand, I guess.
Wow.
Finally got applause
for something.
All right, advantages
and disadvantages KNN?
The learning is
really fast, right?
I just remember everything.
No math is required.
Didn't have to show
you any theory.
Was obviously an idea.
It's easy to explain the method
to somebody, and the results.
Why did I label it black?
Because that's who
it was closest to.
The disadvantages is
it's memory intensive.
If I've got a million examples,
I have to store them all.
And the predictions
can take a long time.
If I have an example and I
want to find its K nearest
neighbors, I'm doing
a lot of comparisons.
Right?
If I have a million
tank training points
I have to compare my
example to all a million.
So I have no real
pre-processing overhead.
But each time I need
to do a classification,
it takes a long time.
Now there are better
algorithms and brute force
that give you approximate
K nearest neighbors.
But on the whole,
it's still not fast.
And we're not getting any
information about what process
might have generated the data.
We don't have a model of the
data in the way we say when
we did our linear regression
for curve fitting,
we had a model for the data that
sort of described the pattern.
We don't get that out
of k nearest neighbors.
I'm going to show you a
different approach where
we do get that.
And I'm going to do it on
a more interesting example
than reptiles.
I apologize to those of
you who are reptologists.
So you probably all
heard of the Titanic.
There was a movie
about it, I'm told.
It was one of the great
sea disasters of all time,
a so-called unsinkable ship--
they had advertised
it as unsinkable--
hit an iceberg and went down.
Of the 1,300
passengers, 812 died.
The crew did way worse.
So at least it looks as
if the curve was actually
pretty heroic.
They had a higher death rate.
So we're going to
use machine learning
to see if we can predict
which passengers survived.
There's an online
database I'm using.
It doesn't have all
1,200 passengers,
but it has information
about 1,046 of them.
Some of them they couldn't
get the information.
Says what cabin class they
were in first, second,
or third, how old they
were, and their gender.
Also has their
name and their home
address and things,
which I'm not using.
We want to use these
features to see
if we can predict
which passengers were
going to survive the disaster.
Well, the first
question is something
that Professor Grimson
alluded to is, is it OK,
just to look at accuracy?
How are we going to evaluate
our machine learning?
And it's not.
If we just predict died
for everybody, well then
we'll be 62% accurate for the
passengers and 76% accurate
for the crew members.
Usually machine
learning, if you're 76%
you say that's not bad.
Well, here I can get that
just by predicting died.
So whenever you have a class
imbalance that much more of one
than the other, accuracy isn't
a particularly meaningful
measure.
I discovered this early on
in my work and medical area.
There are a lot of
diseases that rarely occur,
they occur in say 0.1%
of the population.
And I can build a great
model for predicting it
by just saying,
no, you don't have
it, which will be 0.999%
accurate, but totally useless.
Unfortunately, you do see
people doing that sort
of thing in the literature.
You saw these in an earlier
lecture, just to remind you,
we're going to be
looking at other metrics.
Sensitivity, think
of that as how good
is it at identifying
the positive cases.
In this case, positive
is going to be dead.
How specific is it, and the
positive predictive value.
If we say somebody died,
what's the probability
is that they really did?
And then there's the
negative predictive value.
If we say they
didn't die, what's
the probability they didn't die?
So these are four
very common metrics.
There is something called an
F score that combines them,
but I'm not going to be
showing you that today.
I will mention that
in the literature,
people often use the word
recall to mean sensitivity
or sensitivity I mean recall,
and specificity and precision
are used pretty much
interchangeably.
So you might see various
combinations of these words.
Typically, people talk
about recall n precision
or sensitivity and specificity.
Does that makes
sense, why we want
to look at the measures
other than accuracy?
We will look at accuracy,
too, and how they all tell us
kind of different
things, and how you might
choose a different balance.
For example, if I'm running
a screening test, say
for breast cancer, a
mammogram, and trying
to find the people
who should get on
for a more extensive
examination,
what do I want to
emphasize here?
Which of these is likely
to be the most important?
Or what would you
care about most?
Well, maybe I want sensitivity.
Since I'm going to send this
person on for future tests,
I really don't want to miss
somebody who has cancer,
and so I might
think sensitivity is
more important than specificity
in that particular case.
On the other hand,
if I'm deciding
who is so sick I should do
open heart surgery on them,
maybe I want to be
pretty specific.
Because the risk of the
surgery itself are very high.
I don't want to do it on
people who don't need it.
So we end up having to choose
a balance between these things,
depending upon our application.
The other thing I want to talk
about before actually building
a classifier is how we
test our classifier,
because this is very important.
I'm going to talk about
two different methods,
leave one out class of
testing and repeated
random subsampling.
For leave one out,
it's typically
used when you have a
small number of examples,
so you want as much
training data as possible
as you build your model.
So you take all of your n
examples, remove one of them,
train on n minus
1, test on the 1.
Then you put that 1 back
and remove another 1.
Train on n minus 1, test on 1.
And you do this for each
element of the data,
and then you average
your results.
Repeated random
subsampling is done
when you have a larger set of
data, and there you might say
split your data 80/20.
Take 80% of the data to
train on, test it on 20.
So this is very similar to
what I talked about earlier,
and answered the
question about how
to choose K. I haven't
seen the future examples,
but in order to
believe in my model
and say my parameter
settings, I do this repeated
random subsampling or
leave one out, either one.
There's the code
for leave one out.
Absolutely nothing
interesting about it,
so I'm not going to waste
your time looking at it.
Repeated random subsampling
is a little more interesting.
What I've done here
is I first sample--
this one is just
to splitted 80/20.
It's not doing
anything repeated,
and I start by sampling 20% of
the indices, not the samples.
And I want to do that at random.
I don't want to say
get consecutive ones.
So we do that, and then
once I've got the indices,
I just go through and
assign each example,
to either test or training,
and then return the two sets.
But if I just sort
of sampled one,
then I'd have to do a
more complicated thing
to subtract it from the other.
This is just efficiency.
And then here's the--
sorry about the yellow there--
the random splits.
Obviously, I was
searching for results
when I did my screen capture.
I'm just going to for
range and number of splits,
I'm going to split it 80/20.
It takes a parameter method,
and that's interesting,
and we'll see the
ramifications of that later.
That's going to be the
machine learning method.
We're going to compare KNN
to another method called
logistic regression.
I didn't want to
have to do this code
twice, so I made the
method itself a parameter.
We'll see that introduces
a slight complication,
but we'll get to it
when we get to it.
So I split it, I apply
whatever that method is
the training the test
set, I get the results,
true positive false positive,
true negative false negatives.
And then I call this
thing get stats,
but I'm dividing it by
the number of splits,
so that will give me
the average number
of true positives, the average
number of false positives, etc.
And then I'm just going
to return the average.
Get stats actually just prints
a bunch of statistics for us.
Any questions about
the two methods,
leave one out versus
repeated random sampling?
Let's try it for
KNN on the Titanic.
So I'm not going to show you
the code for K nearest classify.
It's in the code we uploaded.
It takes four arguments
the training set,
the test set, the label that
we're trying to classify.
Are we looking for
the people who died?
Or the people who didn't die?
Are we looking for
reptiles or not reptiles?
Or if case there
were six labels,
which one are we
trying to detect?
And K as in how many
nearest neighbors?
And then it returns the true
positives, the false positives,
the true negatives, and
the false negatives.
Then you'll recall we'd
already looked at lambda
in a different context.
The issue here is K nearest
classify takes four arguments,
yet if we go back here, for
example, to random splits,
what we're seeing is I'm
calling the method with only two
arguments.
Because after all, if I'm not
doing K nearest neighbors,
maybe I don't need to pass
in K. I'm sure I don't.
Different methods will
take different numbers
of parameters, and yet I want
to use the same function here
method.
So the trick I use
to get around that--
and this is a very common
programming trick--
in math.
It's called currying, after
the mathematician Curry,
not the Indian dish.
I'm creating a function a
new function called KNN.
This will be a function of
two arguments, the training
set and the test
set, and it will
be K nearest classifier
with training set and test
set as variables, and
two constants, survived--
so I'm going to
predict who survived--
and 3, the K.
I've been able to turn a
function of four arguments,
K nearest classify, into a
function of two arguments
KNN by using lambda abstraction.
This is something that
people do fairly frequently,
because it lets you build much
more general programs when
you don't have to worry about
the number of arguments.
So it's a good trick to
keeping your bag of tricks.
Again, it's a trick
we've used before.
Then I've just chosen 10
for the number of splits,
and we'll try it, and we'll try
it for both methods of testing.
Any questions before
I run this code?
So here it is.
We'll run it.
Well, I should learn how to
spell finished, shouldn't I?
But that's OK.
Here we have the
results, and they're--
well, what can we
say about them?
They're not much
different to start with,
so it doesn't appear that
our testing methodology had
much of a difference on
how well the KNN worked,
and that's actually
kind of comforting.
The accurate-- none of
the evaluation criteria
are radically different,
so that's kind of good.
We hoped that was true.
The other thing to notice
is that we're actually
doing considerably better than
just always predicting, say,
didn't survive.
We're doing better than
a random prediction.
Let's go back now
to the Power Point.
Here are the results.
We don't need to
study them anymore.
Better than 62% accuracy,
but not much difference
between the experiments.
So that's one method.
Now let's look at
a different method,
and this is probably
the most common method
used in machine learning.
It's called logistic regression.
It's, in some ways, if
you look at it, similar
to a linear regression,
but different
in some important ways.
Linear regression, you
will I'm sure recall,
is designed to
predict a real number.
Now what we want here
is a probability, so
the probability of some event.
We know that the dependent
variable can only
take on a finite set of values,
so we want to predict survived
or didn't survive.
It's no good to say we predict
this person half survived,
you know survived, but is
brain dead or something.
I don't know.
That's not what
we're trying to do.
The problem with just using
regular linear regression
is a lot of time you get
nonsense predictions.
Now you can claim,
OK 0.5 is there,
and it means has a half
probability of dying,
not that half died.
But in fact, if you
look at what goes on,
you could get more
than one or less than 0
out of linear
regression, and that's
nonsense when we're talking
about probabilities.
So we need a different method,
and that's logistic regression.
What logistic
regression does is it
finds what are called the
weights for each feature.
You may recall I complained when
Professor [? Grimson ?] used
the word weights to mean
something somewhat different.
We take each feature, for
example the gender, the cabin
class, the age, and
compute for that weight
that we're going to use
in making predictions.
So think of the weights
as corresponding
to the coefficients we get
when we do a linear regression.
So we have now a coefficient
associated with each variable.
We're going to take
those coefficients,
add them up, multiply
them by something,
and make a prediction.
A positive weight implies--
and I'll come back
to this later--
it almost implies that
the variable is positively
correlated with the outcome.
So we would, for
example, say the
have scales is
positively correlated
with being a reptile.
A negative weight implies that
the variable is negatively
correlated with the
outcome, so number of legs
might have a negative weight.
The more legs an animal
has, the less likely
it is to be a reptile.
It's not absolute, it's
just a correlation.
The absolute
magnitude is related
to the strength of
the correlation,
so if it's being
positive it means
it's a really strong indicator.
If it's big negative,
it's a really strong
negative indicator.
And then we use an
optimization process
to compute these weights
from the training data.
It's a little bit complex.
It's key is the way it uses
the log function, hence
the name logistic, but I'm not
going to make you look at it.
But I will show
you how to use it.
You start by importing something
called sklearn.linear_model.
Sklearn is a Python library,
and in that is a class
called logistic regression.
It's the name of a
class, and here are
three methods of that class.
Fit, which takes a
sequence of feature vectors
and a sequence of
labels and returns
an object of type
logistic regression.
So this is the place where
the optimization is done.
Now all the examples
I'm going to show you,
these two sequences will be--
well all right.
So think of this as the
sequence of feature vectors,
one per passenger, and the
labels associated with those.
So this and this have
to be the same length.
That produces an
object of this type,
and then I can ask for
the coefficients, which
will return the weight of
each variable, each feature.
And then I can
make a prediction,
given a feature vector
returned the probabilities
of different labels.
Let's look at it as an example.
So first let's build the model.
To build the model, we'll take
the examples, the training
data, and I just said whether
we're going to print something.
You'll notice from
this slide I've
elighted the printed stuff.
We'll come back in a later slide
and look at what's in there.
But for now I want to focus on
actually building the model.
I need to create two vectors,
two lists in this case,
the feature vectors
and the labels.
For e in examples,
featurevectors.a
ppend(e.getfeatures
e.getfeatures e.getlabel.
Couldn't be much
simpler than that.
Then, just because it wouldn't
fit on a line on my slide,
I've created this
identifier called
logistic regression,
which is sklearn.linearmo
del.logisticregression.
So this is the thing I
imported, and this is a class,
and now I'll get
a model by first
creating an instance of the
class, logistic regression.
Here I'm getting an
instance, and then I'll
call dot fit with
that instance, passing
it feature vecs and labels.
I now have built a
logistic regression
model, which is simply
a set of weights
for each of the variables.
This makes sense?
Now we're going to
apply the model,
and I think this is the
last piece of Python
I'm going to introduce this
semester, in case you're
tired of learning about Python.
And this is at least
list comprehension.
This is how I'm going to build
my set of test feature vectors.
So before we go and
look at the code,
let's look at how list
comprehension works.
In its simplest form,
says some expression
for some identifier
in some list,
L. It creates a new list by
evaluating this expression Len
(L) times with the ID in
the expression replaced
by each element of
the list L. So let's
look at a simple example.
Here I'm saying L equals x
times x for x in range 10.
What's that going to do?
It's going to,
essentially, create a list.
Think of it as a
list, or at least
a sequence of values, a range
type actually in Python 3--
of values 0 to 9.
It will then create a
list of length 10, where
the first element is
going to be 0 times 0.
The second element
1 times 1, etc.
OK?
So it's a simple
way for me to create
a list that looks like that.
I can be fancier and say for x
times L equals x times x for x
in range 10, and I add and if.
If x mod 2 is equal to 0.
Now instead of returning all--
building a list using
each value in range 10,
it will use only those values
that satisfy that test.
We can go look at what
happens when we run that code.
You can see the first
list is 1 times 1, 2 times
2, et cetera, and
the second list
is much shorter, because I'm
only squaring even numbers.
Well, you can see that
list comprehension gives us
a convenient compact way to
do certain kinds of things.
Like lambda expressions,
they're easy to misuse.
I hate reading code where I
have list comprehensions that
go over multiple lines on
my screen, for example.
So I use it quite a lot
for small things like this.
If it's very large, I
find another way to do it.
Now we can move forward.
In applying the model, I
first build my testing feature
of x, my e.getfeatures
for e in test set,
so that will give me
the features associated
with each element
in the test set.
I could obviously have written
a for loop to do the same thing,
but this was just
a little cooler.
Then we get model.predict
for each of these.
Model.predict_proba is nice in
that I don't have to predict it
for one example at a time.
I can pass it as set of
examples, and what I get back
is a list of predictions,
so that's just convenient.
And then setting these to 0,
and for I in range len of probs,
here a probability of 0.5.
What's that's saying is what I
get out of logistic regression
is a probability of
something having a label.
I then have to build a
classifier, give a threshold.
And here what I've said, if the
probability of it being true
is over a 0.5, call it true.
So if the probability
of survival is over 0.5,
call it survived.
If it's below, call
it not survived.
We'll later see that, again,
setting that probability
is itself an interesting thing,
but the default in most systems
is half, for obvious reasons.
I get my probabilities
for each feature vector,
and then for I in ranged
lens of probabilities,
I'm just testing whether
the predicted label is
the same as the actual label,
and updating true positives,
false positives, true
negatives, and false negatives
accordingly.
So far, so good?
All right, let's
put it all together.
I'm defining something called
LR, for logistic regression.
It takes the training data,
the test data, the probability,
it builds a model, and
then it gets the results
by calling apply model
with the label survived
and whatever this prob was.
Again, we'll do it
for both leave one out
and random splits, and
again for 10 random splits.
You'll notice it actually runs--
maybe you won't notice, but
it does run faster than KNN.
One of the nice things
about logistic regression
is building the
model takes a while,
but once you've got
the model, applying it
to a large number of variables--
feature vectors is fast.
It's independent of the
number of training examples,
because we've got our weights.
So solving the optimization
problem, getting the weights,
depends upon the number
of training examples.
Once we've got the weights, it's
just evaluating a polynomial.
It's very fast, so
that's a nice advantage.
If we look at those--
and we should probably compare
them to our earlier KNN
results, so KNN on the
left, logistic regression
on the right.
And I guess if I look at it, it
looks like logistic regression
did a little bit better.
That's not guaranteed,
but it often
does outperform because it's
more subtle in what it does,
in being able to assign
different weights
to different variables.
It's a little bit better.
That's probably a good
thing, but there's
another reason that's really
important that people prefer
logistic regression,
is it provides
insights about the variables.
We can look at the
feature weights.
This code does that, so remember
we looked at build model
and I left out the printing?
Well here I'm leaving out
everything except the printing.
Same function, but leaving out
everything except the printing.
We can do model
underbar classes,
so model.classes underbar
gives you the classes.
In this case, the classes
are survived, didn't survive.
I forget what I called it.
We'll see.
So I can see what the
classes it's using
are, and then for I in range
len model dot cof underbar,
these are giving the
weights of each variable.
The coefficients, I can
print what they are.
So let's run that
and see what we get.
We get a syntax error
because I turned a comment
into a line of code.
Our model classes are
died and survived,
and for label survived--
what I've done, by the
way, in the representation
is I represented the cabin
class as a binary variable.
It's either 0 or 1, because
it doesn't make sense
to treat them as if they were
really numbers because we don't
know, for example,
the difference
between first and second is
the same as the difference
between second and third.
If we treated the class,
we just said cabin class
and used an integer, implicitly
the learning algorithm
is going to assume that the
difference between 1 and 2
is the same as between 2 and 3.
If you, for example, look at
the prices of these cabins,
you'll see that that's not true.
The difference in an
airplane between economy plus
and economy is way smaller than
between economy plus him first.
Same thing on the Titanic.
But what we see here is
that for the label survived,
pretty good sized
positive weight
for being in first class cabin.
Moderate for being
in the second,
and if you're in the third
class well, tough luck.
So what we see here is
that rich people did better
than the poor people.
Shocking.
If We look at age, we'll see
it's negatively correlated.
What does this mean?
It's not a huge weight,
but it basically
says that if you're older,
the bigger your age,
the less likely you are to
have survived the disaster.
And finally, it
says it's really bad
to be a male, that the men--
being a male was very negatively
correlated with surviving.
We see a nice thing here is
we get these labels, which
we can make sense of.
One more slide
and then I'm done.
These values are
slightly different,
because different randomization,
different example,
but the main point
I want to say is
you have to be a little
bit wary of reading
too much into these weights.
Because not in this example,
but other examples--
well, also in these features
are often correlated,
and if they're
correlated, you run--
actually it's 3:56.
I'm going to explain the
problem with this on Monday
when I have time
to do it properly.
So I'll see you then.
