Hi. Let's look at an example of a naive
Bayes classifier implemented in Python.
Before we do that, let's look at our data.
Here we have a text file with tab
separated columns, as you can see here.
The first one has the labels for the
data, so some documents are neg for
negative and some are pos for positive.
And all of these are movie reviews. Out of those negative and positive documents we
have, again in the 0th element the label
whether it's negative or positive, one,
and in this element number two, we have
the actual text of the review. So there's
a negative review that says: plot two
teen couples go to a church party drink
and then drive they get into an accident.
Let's look at a positive one. So there's
about a hundred and twenty negative
reviews, and there's another 120 positive
reviews. The first positive review says:
films adapted from comic books have had
plenty of success. So 120 positive
reviews, 120 negative reviews with their
label from the sets, from the all the
data, and the text of the review. Here we
have the code.
So we're going to have a function here
that is going to return unigrams and
bigrams from the text, and is going to
return to us the 200 most frequent
bigrams for a text. We'll take a look at it
in a minute. The first thing the code is
going to do is to open the text file
mini-movie-reviews and read its text
into the - into file lines which is gonna
be just a list of the lines. Then it's
going to split each of the lines, into -
with the tabulations because they're
divided by tabs, and look at the 0th
column, at the very first element. If it
contains the letters neg because of the
negative review, then we get the element
number 2 which was the text and append
to negReviewsText. This is just
the text of the negative reviews. If we
get pos or positive in the zeroth
element, then we take the second element
which is the text of the review and put
it in the posReviewsText. This is an
example. Let's print the positive reviews. Conda.
As you can see, that is just gonna give
you a list with the text of all of the
movies that are reviewed as positive. So
it's just a list of strings. Next we want
to get the features that describe each
of the reviews so we want to get some
sort of vector that has the presence or
absence of n-grams, unigrams, bigrams,
and we're going to put them here in
negative features and positive features.
So for each of our positive reviews,
we're going to first tokenize them. Let's
see what this looks like.
We're - let's extract the tokens from the
positive reviews. And as you can see it's
just all those tokens right there: heart,
strings, and, of, course, the, kids, or, the.
After you have the tokens, we're going to
send this to our n-gram generating
function. So the list of tokens, the
complete list, is going to be sent here,
and this is going to return to us a
dictionary containing just a big vector
of the n-grams in each of the reviews. Let's see. Those are for
the positive ones, so for each of the 120
positive reviews, we're getting a list
with the features in that belong, I'm
sorry, with the n-grams belonging to that
document. And so those are going to be
the features. As you can see, this is
returned as a type of Python dictionary.
So for example, we have the structure of
mouth, and it is true that this structure
exists. We have the - the bigram was
pleasantly, and it is true that this
bigram exists.
And here are the unigrams. So we have the
unigram, black and it is true that the
unigram exists, so we get a list of these
for each of our documents, for each of
our movie reviews, and then we add the
tag positive to each of them. Let me show
you what that would look like, so for
every document, we're going to get a list
of all of its n-grams. So this will be
our n-gram vector that contains the
features, for example, it has the bigram,
you have, it has the bigram because it,
it has the bigram it's sound. This is a
huge vector for each of the documents,
and we're going to append an element to
indicate whether that vector belongs to
a positive document or to a negative
document. In this case we're looking at
the positive ones, so it's just the list
of n-grams and positive indicating that
it belonged to a positive document. And
we're going to append all of that into a
big list called posfeats,
which contains (what is it) yes, a big list
of many dictionaries for the features,
and then an additional element for whether
it's positive or negative. We can look at
the ones for the negatives, for example.
It's ... yes it's a big list.
That's a long list.
Yep.
That's because it's doing it for each
one of them as it grows.
Let me just kill it. And why don't we
print it outside of the for. Thank you.
There we go, thank you, computer. Now you
only need to print once. As you can see
this negfeats is a list of the
feat - of all of the documents, a vector of
features for every document, and within
the label negative, and another one, the
features for the second negative
document label negative, the features of
of third negative document, negative, of the fourth negative document, negative. This is what
negfeats contains. The next thing we'll do
is establish a cutoff point so that we
can get 90% of the positive features for
training, and 10% of the positive
features for testing. And we're going to
do the same with the negatives, so we're
going to get the integer of 90% of the
length of negative features, which is
just the list of documents, and we're
going to get 90% of the length of
positive features, which is a list with
the vectors for each of the documents. So
because they both have the same size 120,
there's 108 documents is 90% of the
documents that they contain. So we're
going to make two new lists: trainfeats
is going to contain the first 90% of the
negative documents, so the vectors for
each of the negative documents, plus the
first 90% of the vectors for the
positive documents, and this is going to
be contained in trainfeats. Likewise,
test features is going to contain the
last 10%, so from 90% to the end of the
negative features, and from 90% to the
end, so 10%, of the positive features. So
the positive documents are the vectors
for the positive documents, so you get
the bigrams, you split the sets and
here you can see that I printed test
features. We have a list of features for
the last document there on the screen,
and then a label, the positive label, and
one thing that we couldn't see because
it was sitting by this one is that we
have a little message that we're
training on 90% training features. It's
gonna say 108 - um 216 because
it's 108 plus 108,
and then we're training on 24 test
documents because it's 12 test positive
and 12 test negative.
Here we're gonna have an object called
classifier that is a naive Bayes
classifier, and I'm going to train it on
our trainfeats list, which is just the
vectors for each of the training
documents, of the 216 training documents.
And by the way, NaiveBayesClassifier is a
function within NLTK, as you can see
here. So we're gonna make a classifier
object which we're going to use later.
Next we're going to create two
dictionaries that are going to contain
the labels of the test set.
So goldLabels is going to contain the
labels that we actually have in our test
set, so on those 24 documents, and
predictedLabels is going to contain the
classifiers' predictions for each of
those documents. So here we have the
object testfeats which is the list - the
list of vectors with the features of
each document. We're going to split it
into the vector with the features, and
the label positive or negative. So for
each document, we split its item in the
test list, into the vector with features
and the label, and we take the label and
put it in gold labels. So this is the
actual label that we have in our testing
set. Let's look at it real quick.
Hmm, it's training, it's making the
classifier okay, and as you can see it's
telling you which documents are negative
and which documents are positive. And
these are the ones that we have in our
testing set, so these are the correct
ones, and precisely they are in order
because the negative one it came first,
and the positive ones came second. Now
what's gonna happen next?
We're going to use the classifier object
to classify a set of features, a vector
containing features, the presence of
unigrams or bigrams. We're gonna get
it from the test set, and then we're
gonna run it through the classifier, and
then we're going to get an observed
prediction, and we're going to put that
prediction in the la - in the
dictionary predictedLabels. So let's
print them side by side.
Hmm, we're training. There we go. So as you
can see, the computer made a few mistakes.
Here are the actual labels from the
testing set, some of them are negative,
and some of them are positive. And here
is which documents were predicted to be
negative by the model, and which
documents were predicted to be positive
by the model. And we can already see that
it had a few problems. For example number
one is negative, but the computer
predicted it to be positive. So here we
get the original labels, and then we use
the classifier to actually predict the
vector in the testing features, and then
compare them. How do we compare them
using these measurements? So precision
and recall are two functions from NLTK
as you can see. NLTKmetrics. We
also have the F score there. We can
calculate the precision for the positive
items by sending all of the items
labeled as positive versus all the items
predicted as positive. We can calculate
the recall by also sending all of the
items labeled as positive and all of the
items predicted as positive, and see how
they match. As we can see, as you can see,
we can get the precision, recall for
positive, precision and recall for
negative. And the F score for the
negative and positive sets. Let's see
them all. Also we're gonna get the
accuracy, also from NLTK, by sending the
classifier object, and the vectors from
the testing set. Now that we can print
that let's run it again.
Training and testing, mm-hmm. As you can
see the program had about 71 percent
accuracy. It had 72 percent precision, 73
percent precision, and 67 percent recall
for the positives, and it had 69 percent
precision for the negatives and 75
percent recall for the negatives. So as you can
see, recall is the ones that were
actually negative that it managed to
classify as negative. So out of old ones
that were actually negative, it saw 75%
of them. Precision is the ones that it called
negative, but only 69 percent of the ones
that wor - that the computer classified
as negative were actually negative.
That's the precision and the recall. And the
F measure or F score which is a
combination of the precision and the recall.
So as you can see it's performing not
excellent but it's 72. 72 and
70 are not terrible measurements for
classification. Finally I want to show
you this function of the classifier, show
most informative features. This is very
useful. It's gonna tell us which of the
features was more important when
classifying something as positive or
negative.
Training mm-hmm. So we have here our
measurements for accuracy, precision
recall, the F scores, and we have
informative features. For example the - the
existence of the feature mess made a
document more negative than positive. As
a matter of fact, it's 13 times more
likely to be negative than to be
positive. If it has the feature mess,
This is what this means: these are
likelihood ratios. So mess makes a
document is - a document is 13 times more
likely to be negative if it has the word
mess. A document is ten point six times
more likely to be negative if it has the
word worst. A document is seven times
more likely to be positive if it has the
word - the bigram supportive cast. A
document is seven times more likely to
be positive if it has the unigram, the
word appreciate. And that's what this
does.
And as you can see here, you can (gosh is
this how you manipulate this one, let me
see I cannot remember if it's just 50 or
n equals 50. We'll see in a minute. Oh yeah
n equals 50.) So you can set an arbitrary
number of features that you want to
detect as more or less important. So
with 50 for example we get a lot more
information. Someone is it's five times more likely
to be negative if - if it has the word
lame. Something is 4.3 times more likely
to be positive if it has - it has the word
individual. And so forth. So this is an
example of a naive Bayes classifier. It
can take a set of texts, for example,
movie reviews. It extracts
features from it, which can be unigrams
or bigrams or any type of n-gram really, or any type of feature that
you define. So once you get the vectors
with the features, you tagged each of
them with a positive or negative tag, you
give that to the classifier and before
you give that to the classifier, you
split it into training and test. You
give the training set to the classifier,
and then you can measure how well it did
with the testing set. You can get recall, precision, F scores, and the most
informative features.
