Hi and welcome to week 5. This week we're
going to study two techniques for
supervised learning: naive based
classifiers and support vector machines,
and we're going to study how they can be
used for natural language processing but
in order to do that, we first need to
look at the structure of supervised
learning. So what we've been doing over
the last two weeks is called
unsupervised learning. Here we provide
the computer with some training set made
up of the features that describe some
original document, for example, so we have
a vector with features and then the
computer needs to figure out what's
similar between those vectors, so that we
can use those similarities to create
clusters. Maybe the features are the different
words that related languages have, maybe
the features are the presence or absence
of the word sushi, or the word restaurant
or the word Hanover, so we have sets of
features and then the computer tries to
find similarities between them so that
it can create clusters. It does so so
that in the future, when it gets a set of
features it hasn't seen before, it can
decide, oh this set of features which
this first document should go into, this
cluster, or it's more similar to this
other cluster. In that way, the computer
is learning in an unsupervised manner,
because we didn't tell it exactly what
it should be looking for. We just told it
find similarities between these and
give us some clusters. For example with the
Shakespeare text and it just found
similar groupings but it didn't have
specific names for the clusters, for
example. Here we're going to look at a
different technique for machine learning
called supervised learning. In
supervised learning, we have our vectors
with features that describe a document
for example, or a language, or anything we
want to classify,
but in addition to this vector with
features, we're going to give each vector
an additional field which will be the
label. We are going to tell the computer
explicitly these features, here they
describe a cat, these here to describe
the cat, the cat, and then this one here
describes a not cat, some other things so
let's say we have for example cats and
not cats, and we give the computer a list
of vectors, some for cats some for not
cats, and then we tell the computer
please find what is it about the
features that make some of these vectors
cats, and some of these vectors not cats.
This problem is called classification,
and what we hope to achieve is that in
future runs, the computer will get a set
of features that it hasn't seen before,
and it will be able to compute for
example, the probability that this unseen
vector is similar to cat or is similar
to not a cat. Classification is what
we'll be studying this week, but there's
many other types of supervised learning.
Regression is an example, deep learning
is another one, that we'll be studying
later and so forth.
Let's focus on classification. What we
need first is a training set, for example
a list of movie reviews. This is an
example from our textbook, whether the
review is - we'll later see whether the
review's positive or negative, but for now
we only need the text. That, for example that a movie is unbelievably disappointing or
that it was pathetic, the worst part
about it was the boxing scenes. So we
have the training set, and we'll extract
some features from it, for example the
presence of the word disappointing, the
presence of the word greatest, and in
addition to the training data, we have
the labels. These labels are the correct
classification for each element, and
they're the ones that we're aiming
towards. For example, we tell the computer
explicitly that the review Unbelievably
disappointing is supposed to be negative.
We tell it explicitly the review This is
the greatest screwball comedy ever
filmed is supposed to be positive. So we
have some vector with the features for
each item in the training set, and we
have labels for whether this is supposed
to be positive or negative. We would need
to define the features. This is not a
trivial task. Let's do something very
simple for now and use the words as the
features as we've been doing in the last
couple of weeks. Maybe the word
disappointing can tell us if a review is
positive or negative, so if for example
the review is contains the word
disappointing, is that review negative? We
want to figure out whether the
probability of this is high. So that if
you get the word disappointing, it is
very likely that the review is negative
or maybe the probability could be low
that the word disappointing doesn't
really tell you whether the review is
positive or negative. By the way, this is
conditional probability like we studied
last week. In Bayesian terms, this would
be read: the probability that a review is
negative given that we see the word
disappointing. So I'm going to give you a
moment to pause the video. I'm gonna
count to five and you tell me: is the
proba - is a document likely to be
negative if you find the word
disappointing in it? Look at the little
corpus that we have and let me know.
The probability is very high if you find
the word disappointing in a document
within our little training set here, the
word disappointing only appears in
negative documents, and so it is the word
disappointed is highly - disappointing
is highly associated with negative
reviews. So this probability here is
gonna be very high again because we only
find it in negative reviews and never in
positive reviews. How about a feature
like whether the review contains the
word greatest or not? If a review
contains the word greatest, is it a
positive review? What is the
probability of this? I'm gonna give you a
moment to pause the video and think
about it on your own.
If the word greatest appears in the training set, it
is only associated with positive reviews
in this toy training set that we have.
Again the training set determines the
behavior of your machine learning
algorithm. So because the word greatest
only appears in positive reviews, we
believe that the probability that a
review is positive given that it
contains the word greatest is very high.
Let's give this one more try. Look at the
corpus that you have here, and try to
figure out what is the probability that
a review is negative if it has the word
was. Is it high or is it low? I'm gonna
let you pause the video and work it out.
So this is interesting. In this tiny
training set, it is - the word was only
appears in negative reviews. So there's a
very high probability that the document
is negative if it has the word was.
However we know from real life that the
word was probably appears in both
positive and negative reviews. So it
shouldn't really be helpful. This is just
an artifact of our training set, and
again the training set will determine a
lot of the behavior of your program. So
we could use features like the presence
or absence of a word to describe the
document and then append to it labels
like positive or negative, but we could
really use any label such as whether
something is about sports or about not
sports. Here we have another training set:
a bunch of documents that say a great
game, the election was over. not a very
clean match, A clean but forgettable game,
it was a close election. So try to
work out which words in those documents
are very informative towards the
category sports. So they only happen in
the documents that belong in a category
sports. Try to work out which words are
very informative for the category, not
sports. So these are words that only
occur in documents that are not about
sports. Finally which words
don't go either way. You can find these
words in both sports and non sports
documents. I'll give you a moment to
pause the video, try to work it out and
then come back.
So it could be something like this:
the words great, game, very, clean, match but,
forgettable, and game, (oh game is there
twice, I'm sorry) are very informative for
the category sports, because they only
appear in documents labeled sports. Words
like the, election, was, over ,it, close, are
very informative of the category not
sports, because they only appear in
documents that are labeled not sports,
and notice again that we shouldn't
expect the word the to be informative
about non sports in real life, it's just
an artifact of how the training set is
structured. The word a is uninformative
because it appears both in sports
documents and in not sports documents,
such as a great game but also a close
election. Let's look at one final example.
This is an example of spam, of junk email.
Try to figure out what features you
would use to distinguish between spam
and non-spam email based on things you
see here. I'll give you a moment to pause
and think about it.
For example, maybe if the email
contains word lucky, it should be flagged
as potentially spam. If it contains
things like winning, it might be tagged
particularly, if it contains winning in
all caps. Look at how so many things here
are in all caps, and regular emails are not
written like this. So this could be one
thing that we could use to distinguish
between spam email and non-spam email
features, like does it contain the word
lucky or not, does it contain the word
winning or not, does it have many words
in all caps or not. In summary, in
supervised learning, we need to provide a
training set, a series of documents that we
want to classify, or something which we
ultimately turn into features and these
features are going to describe the train -
the items in the training set. We also
have labels, these labels are the
classification that we believe to be
correct, such as positive or negative,
spam or not spam, sports or not sports. So
we have the features that describe the
document, and the label that we've
assigned to the document, and then the
computer will try to find what is it
about this combination of features that
tells you that it belongs to that label.
This week, we're going to look at two
methods to do this: naive Bayes
classifiers and support vector machines.
