In this video, we will look at the math
of how a naive Bayes classifier works, so
what have we been doing so far. We have
training sets, say documents for example,
and we extract features from that
training set, for example, whether they
contain the word greatest, whether they
contain the word disappointing, whether
they have an n with the curl on top, and
we get a vector of features. Next to that
vector of features, we have a label that
comes in with the training data. This
label tells you what the correct label
or classification should be. For example
if something has the word greatest, and
does not have the word disappointing, it
is probably going to be a positive movie
review. So having this, we need an
equation that can help us determine the
probability that links these features to
this label. So for example, if someone has
the feature greatest, and does not have
the feature disappointing, what is the
probability that it will be positive or
negative, and then we will use this to
classify new documents that come in
without the label, because the system has
never seen them before, and the computer
will try to calculate the probability of
them belonging to one category or the
other. We'll use a technique called naive
Bayes classification. Why is it called
naive? Because we will use a technique
called bag of words, that will ignore the
order of the words. As we saw last week
when we look at the n-grams, the ordering
of words is very important. When you have
the word, we finish each other's,
you have almost completely determined
what could come next. Maybe it'll be
sandwiches, maybe it'll be sentences.
It'll practically never be chairs, so it
will be naive to assume
that the order of words is irrelevant,
which is what this system does. It takes
all of the words in your document,
basically puts them in a bag, and then
the only thing you know is whether the
word is there or not, or maybe how many
times you see each word in a bag, but you
completely lose the connections that you
have with the order of the words. We will
assume that position does not matter.
This is a very naive supposition. We
assume that words, that any word, will
have the same effect, regardless of its
position, and document. For example, the
word love matters the same if it's at
the beginning, the middle, the end.
followed by a certain word, followed by
not a certain word, and so on. By the way
we can put bigrams and trigrams into
the bag of words, it's not just single
words, but we will lose the exact
ordering, and most importantly, we will
lose information about their position in
the whole document. So we can have
things like bigrams and trigrams in
the bag of words. We're just gonna be
disordered in a more abstract way. To
look at this, the presence of one feature
is unrelated to the presence of other
features in this assumption. So again
we're going to presuppose that the
existence of a certain bigram is
independent from the existence of a
certain other bigram in the bag of
words, which is probably not a true
assumption but we will use it in this
system. So that's why it's naive. Why do
we call it Bayesian? Because we are going
to count how many times we see each word
or each feature, and then we're going to
relate that to the probability that a
word belongs to a certain category or to
a different category.
Let's look at a very simple example from
our book. Let's say we have a training
set of movie reviews with just 5 reviews,
three of them are negative: just plain
boring, entirely predictable and lacks
energy, no surprises and very few laughs.
So we have three negative documents and then we have two positive documents: very
powerful, the most fun film of the summer. We
want the computer to calculate the
probability of certain features being in
negative documents and positive
documents, and then we want to use those
probabilities to guess whether any new
things are gonna be positive or negative,
for example the new review: predictable
with no fun. Is it positive or negative?
Your human intuition is probably telling
you that it's a negative review. Let's
see what the math says. The first step is
calculating the probability that the review
is either positive or negative in our
training set. For example we have a total
of five elements in the set, and three of
them are negative. So the probability
that review will be negative is three
negatives out of five total documents in
the set. On the other hand the
probability that a review is positive is
two positive documents in the training
set divided by five total
documents in the training set, so 60% and
40% are the probabilities that our
review's gonna be negative or positive in
the training set.
The next thing we need to calculate is
the probability that given that we know
that our document is negative, we
will see the word predictable. Also given
that we know that a certain document is
positive,
what is the probability of the word
predictable? And so forth for these three
features we will define: the presence or
absence of predictable, the presence or
absence of no, and the presence or
absence of fun. More accurately how many
times we see predictable, no and fun in
our documents and in the training set.
Let's look at each of the elements in
this equation. The first one is the
number of times that we see predictable
in either positive or negative reviews,
so for example in the positive reviews
very powerful, the most fun film of the
summer, the word predictable does not
appear so it appears zero times in the
positive reviews. On the other hand, the
word predictable appears once in the
negative reviews, in entirely
predictable and lacks energy, so we have
a 1 there. So this is the number of times
the feature or the word appears in each
category of reviews. The next number is
smoothing we add a 1 so that we can
account for words that do not appear in
either the positive or the negative set.
This is so that we have no probabilities
that are equal to 0 and we've already
seen that this would cause problems
because it would make things x 0 which
would make the system less able to
predict new things it sees. So we have
the total number of appearances of
predictable in positive or negative and
then +1 for smoothing. In the denominator
we have two elements, first we have the
total number of words in the positive
and in the negative examples. So, so you
can see if we count the positive words
they are 1 2 3 4 5 6 7 9
words. If you count the words in
the - in the negative examples there
are three five eight ten twelve fourteen
total words. The final element of this
equation is the total number of distinct
words in all of the training set. I have
the distinct words there to the right.
There's 20 of them. So as you can see
plus 20, so these equations are the
probability that given that a document
is negative or positive, we will see this
feature predictable. Given that a
document is positive, how often will we
see the word predictable 1/29. Given that
a document is negative, how often will we
see the word predictable? Two divided by
34. So we have the probabilities for all
of the other features, you can check them
real quick, and then what do we do with
this? We multiply them. The probability
that a document will be negative given
the features that we see has two
components: it is equal to the
probability that a document will be
negative, just out of all the documents,
multiplied by the probability that it
has those features given that it is
negative.
Let's move one step forward in the
equation. The probability that the
document is negative is 3/5 and then we
multiply this by a chain of
multiplications for all of our features.
So the first feature that we extracted
for the test element was predictable, the
probability that a document would be
negative given that it had the word - has the word predictable
in the training is two divided by 34,
multiplied by the probability of the
document is negative given that we
have the word no, which is two divided
by 34 multiplied by the probability that
the document is negative given that it
has the word fun it's 1 divided by 34.
Likewise with the positive features, the
probability that a document is positive
given the features that we see is equal
to the probability that any document is
positive multiplied by the probability
of the features given the positive
training examples, the first element is
2/5 and then the probability that
something is positive given that we see,
I'm sorry, the probability that
predictable shows up given that it's a
positive document 1/29, the probability
that a document is, sorry, the
probability that we have the word no
given that the document is positive, 1
divided by 29, the probability that we
have the word fun given that the document
is positive, 2 divided by 29. If you do
these multiplications you're gonna see
that the first one is 6 point 1 2 10 to
the minus 5, the second one is 3 point 2
2 10 to the minus 5. The larger
probability is the one for negative ones.
So what did we do? We took the sentence
predictable with no fun, we extracted
three features predictable, no, and fun,
and then asked the system, what is the
probability that a document is negative
given that I see those three words? We
did a chain of count of multiplications
and the larger probability is the
negative one, so the system is going to
tell you, I think this document is a
negative review, predictable with no fun,
and this matches our intuition that this
should be a negative review. Notice again
that I did it based on the probabilities
in the training set. For clarity I put
this here, the first element in the
multiplication is the probability that
a document is positive or negative, or
one category versus another, and the
second element is the probability that
we would see those features or those
words given that our review is positive
in the training set. This is the math
behind it. A naive Bayes classifier
calculates the probability of a label
given the presence/absence or count of
features, and because we know those
probabilities from the training set we
can use the probabilities to predict the
label of a new document that we have not
seen before, and in doing that the system
can see new examples and give them a
classificatory label. It becomes a classifier. In
the next video, we will look at how to
evaluate the classifier and see if it's
working or not.
