In this video we will try to figure out
how to describe a document according
to the words it contains. For example if
I have a Shakespeare play and tell you
that it has the word Juliet, you will
immediately know what play I'm talking
about. Likewise with this one if you have
a Shakespeare play and I tell you that
it has the word
Caesar, you will immediately know I'm
talking about this one play. On the other
hand if I tell you that I have a
Shakespeare play and it contains the
word and, what is the play that I'm
talking about? The information that I
just gave you is completely useless. You
will not be able to know what play it is
just from that very common word, and.
This is what we will be talking about. There's
one technique to figure out which words
are useful and which words are not and
it's called the tf-idf matrix. So what we want
to do is try to figure out what words
are necessary for us to describe what a
document is about. This is an extremely
simple example where we have two
documents, one says apples are delicious
and the other one says cookies are
delicious. What are these two documents,
what are they about, and how is the first
one different from the second one? The
difference between the two documents
cannot be in the word now because both
documents contain those words. It cannot
be the word delicious or available for
the same reason. The difference must
reside in what's unique to each of them:
apples for the first one and cookies for
the second one. When we are trying to
figure out how to describe a document we
need to look at the document itself but
also at the entire collection that it
belongs to so that we could figure out
what sets this document apart from the others.
So when we want to figure out which
features will be relevant for example
how many times does a word occur, we need
to figure out features or words that
are very frequent in a document. For
example, apples in the one we saw before,
but also infrequent in other documents.
The other document did not have the word
apples so we need to figure out how
frequent something is within our
document and also in the rest of the
collection. And the method to do this is
called a term frequency inverse document
frequency matrix or tf-idf. So why can't
we just count words? Maybe we can just
count how many words we have in every
document and that'll be the end of that.
Because it'll give us a - a strange result
for what the document is about. And the
short answer is that there's words that
are very frequent but they don't
communicate much. If I can view this
document, Sushi is the best type of snack
in the world it's the best,
what is this document about? Your human
brain probably wants to say that it's
about sushi but what the computer can
see is that there's a few words sort of
very frequent, such as the word the, which
happens three times for example. The word
best which is - which happens two times, so
if we count it by just frequency this
document would be about the word the and
about the word best. This document would
not be about sushi so we can't just
count words and hope that the most
frequent word is going to give us the
actual answer we need. And this is a
feature of language: very common words
like the and is are not very full of
content, of meaning. The more infrequent
words like sushi for example have a lot
of meaning encapsulated in them.
So to overcome the problems of just
counting words we are going to use an
object called the tf-idf matrix, which
has two sides and two intuitions behind
each side. The tf means term frequency.
For each document, we're gonna try to
count the words and see which words are
very frequent within that document. So
for example if a given document contains
the word sushi a lot, that document is
probably about sushi. The other side of
this matrix is the inverse document
frequency or idf. We're going to count
how often we see the words in the other
documents, or more accurately we're gonna
have all of the documents and we're
gonna try to figure out if the word occurs in this document, this document and
in general how many documents have that
word. If we have the word the, and we see
the word the in every single
document, it means that it's not gonna be
a good way to tell the documents apart,
because they all have it. On the other
hand, if we have all the documents and the
word sushi appears only in a few of them
then we know that this word is gonna be
a good feature for distinguishing a set
of documents from the rest of the
documents. So let's turn these intuitions
into languages the computer can
understand, into numbers. The first part
is the term frequency matrix. We have
here four plays from Shakespeare, two
comedies, As you like it and Twelfth
Night, and two tragedies Julius Caesar
and Henry the fifth. We also have four
words, four features, we have how many
times we have the word battle in each of
them, how many times we have the word
good, how many times we see the word fool,
and how many times we see the word wit
in each of the plays. As you can see, the
word battle has a distribution where you
see more of it in the tragedies. You have many
more in Julius Caesar and Henry the fifth
compared to the comedies. On the
other hand, words like fool and wit which
were associated to comedy in the 16th
century are more common in the comedies.
They occur 36 times, 58 times as compared
to just few in the tragedies. And then
there's the word good which is fairly
stable across all the documents, all - in
all the documents, wit is a very frequent
word and it doesn't really help us to
distinguish comedies from tragedies. So
in order to compensate for words that
are very common throughout all of our
documents, we have the second matrix, the
inverse document frequency. We're going
to count - we're going to take all the
documents in our collection and then
we're gonna count how many documents
have the word we're looking for, for
example in 37 plays of Shakespeare the
word Romeo only appears in the one: in
Romeo and Juliet. So it appears in one document. On the
other hand a word like good appears in
all of the 37 plays, so it's document
frequency is 37. We need to perform a
further calculation: we're going to take
the total of documents of the collection
37, and divide it by the documents that
contain those words, in the case of Romeo
it will be 37 divided by one equals 37.
37, and then we're going to take the base
10 logarithm of that number: base 10
logarithm is one point five seven, which is what we see here. So for all these
numbers, this is the total of documents
in the collection 37, divided by the
number of documents that have the word.
And this is what appears here. We are
going to perform the same calculation
for the tf side.
And let's explain a few things first: why
do we have to logarithm? This is to
compensate for the fact that some words
are very frequent and some words are
very infrequent. Not all words are the
same, there's words that appear in
practically every sentence that we use,
where it's like the or is or and, they
are extremely common. On the other hand,
there's words that are incredibly
infrequent like chlorophyll, for example.
So the distribution of frequency and
words is exponential, where we have many
words that are relatively infrequent and
a few words that are very frequent. We're
going to compensate for it by turning
this - by applying a logarithm and turning
this into a more linear function that's
what we have the logarithm there to
compensate for: the fact that some words
are incredibly frequent and some words
are mostly infrequent. Then we have the
base 10 logarithm of the - of how many
times we see the word in the document
plus 1. The 1 is there to help us with
the math. If we do for example in Twelfth
Night the word battle appears zero times,
if we try to do the base 10 logarithm of
zero we're gonna get a math error
because that number is undefined. We need
to add the 1 so that we can compute the
number correctly, logarithm base 10 of 1
is equal to zero. So we're gonna add one
to each of our values to make sure that
our math performs correctly. In general
this calculation is called smoothing.
We're gonna talk more about it on week 4.
So the number that we want for the term
frequency is the base 10 logarithm of
the frequency of each word within the
document, so how many times we see the
word battle in Julius Caesar for example
plus 1, and those are the numbers that we have
here. Then we multiply them,
for example the tf-idf number for wit
in As you like it is going to be the
frequency in the document 1.32,
multiplied by the frequency in the
entire collection of documents 0.037. As
we can see here, that number is zero
point zero four nine. So this is a
measurement of frequency that takes into
account the local frequency in each of
the documents and the global frequency
of a word in the entire collection. We
can see that this has very interesting
properties. We still have the clusters we
detected before, for example the word
battle is most frequent in the tragedies,
in Julius Caesar and in Henry the fifth,
and the words fool and wit are most
frequent in the comedies, in As you like
it and in Twelfth Night, but the word good is
now irrelevant. It is now zero throughout all of our
plays. It's because it was never relevant
to begin with. It ha -  you could see it
in every play and therefore it wa - it
uniquely described no play. We're going
to use tf-idf matrices to cluster
objects and one of your exercises this
week is going to be to take Shakespeare
plays and to take a number of features
approximately 200 and cluster plays
together. As you can see it works! For
example in the upper section we have
Julius Caesar and Henry the fifth and we
can see that they are very close
together. This is because the tf-idf
numbers are relatively similar.
This also clusters them with other
tragedies such as Romeo and Juliet and
Hamlet for example. On the bottom of the
clusters we have the comedies Twelfth
Night and As you like it. Not only are
these two close to
gather but they're all so close together
to other comedies such as All's well
that ends well and Much ado about
nothing. So in summary when we are trying
to figure out what a document is about,
we need to find words that clearly
distinguish it from other documents and
these words are gonna have two
properties. They're gonna be frequent
within the document but they're gonna be
infrequent throughout the collection so
that whenever a document appears that
does have our word, it's gonna be very
precious and unique and it's going to be
well described by that word. We can
identify these special words using a
tf-idf matrix and we can also use
this matrix to cluster things together.
