Welcome to the third week of accelerated
computational linguistics. This week we
will start our study of natural language
processing. We'll turn documents into
collections of features so that we can
then feed these features into machine
learning algorithms. In this first video
we'll start with a very basic
application of this idea: we'll perform
sentiment analysis, we'll convert
documents into words, words into features,
and then we will use these features to
compute sentiment values for a whole
document. And first let's try to think of
words as bundles of features, features
that describe them. So how about this
question. This has been the bread and
butter of philosophers for centuries:
what is a human and in general how would
you define what a human is? One way to do
it will be to come up with a system of
features that set this word apart from
other words. So what, what characterizes a
human? For example humans don't have
feathers, they are negative for feathers and they
are bipeds, they are positive for being
biped. This is a definition Plato gave
once for what a human is, and this is a
very arbitrary definition, but as a
matter of fact any system that we could
use is fairly arbitrary. Defining a word
is extremely difficult and doing it with
features like these is very difficult,
because we would need to find some
correct combination. Let's see how far
the system can take us. Let's try to define
three words, man, woman, and dog, for
example. We could define woman as plus
human, a being that is human, and someone
who is plus hair. As you can see in this
example the woman has hair and the other
being does not. To our left we have man
which is plus human and minus hair
because he has absolutely no hair.
Dog would be a minus human plus hair because
the puppy there does appear to have hair.
This system again is arbitrary this is
one way to find features that fit the
data that we got. And one short comment
we will come back to this but in general
anytime we have a binary feature or in
general an n-ary, feature we are forcing
reality into these categories and when
we do that we are either leaving chunks
of reality out or forcing chunks of
reality into these categories and ma -
making it so that phenomena that are not
neatly in these categories becomes
invisible or it becomes dissimulated by
this other feature. So we will later in
the class criticize these kinds of
binary systems but for now let's see if
they can help us with this problem that
we have now. In computer science we use
binary systems because we're used to
them because this is the way our
computers work we could easily turn
something like not being, yeah human or
yes being human into 0 or 1, so a woman
would be someone who has the value of 1
in the feature  of being human and the
value of 1 in the feature of having hair,
a man will be someone who has the value
of 1 in being human the value 0 and
having hair like those for the dog the
values will be 0 and 1. We have been
using features to define words. Let's try
to use them to assign emotions to words.
Let's try to come up with some system
where a word is characterized by its
emotion features. We'll use something
called a Plutchik whe - wheel of emotion.
For example, love, what kind of emotions
does love conjure, does it
entail serenity, interest
annoyance, boredom, pensiveness,
distraction, apprehension, acceptance, for
example? Let's say we will use the middle
ring: the one that has joy, anticipation,
anger, disgust, sadness, surprise, fear, and
trust. Does love entail joy? Probably yes.
you do - these two emotions do come
together. How about sadness? Hopefully those two
emotions are not bundled together.
In general, let's assign 1 to the feature
that we do associate with the word and 0
to the feature that we do not associate
with the word. Love would be associated
with the properties of joy, and positive,
for example and it would not be
associated with properties like anger or
disgust. How about a different word
something like hate. Hate could be
associated with the properties anger,
disgust, fear, negativity in general, and
sadness. So these would have the value of
1 for those features.
This is what the system - this is what 
the NRC word emotion
lexicon does. This is a document, a text
file, with about 14,000 words and the
authors of this lexicon crowd-sourced
the emotions. So they had people vote on
whether love was positive or negative, or
joy or sadness, and out of crowdsourcing
this with a large number of participants,
they decided on the actual values of
those features. The lexicon is here in
this URL so you can see the procedure,
the exact procedures for the
crowdsourcing, but it is basically this:
it's a text file that has the word in
the first column, then the emotion in the
second column, anger anticipation disgust,
and then the value for that feature 0
or 1. For example, in line 21 we can see
that the word abacus entails the feeling
of trust because it has the number 1 for
that feature. So how can we use this?
Let's say we have a document, a series of
sentences, for a movie review and the
review says, I love that movie it was
incredible and the characters were so
compelling. This review is probably
positive, more positive than negative. The
first thing we need to do would be to
tokenize it and to normalize it so we
take all the words, turn them into total
individual tokens, and we perform some
normalization. If something has different
spellings, we merge the spellings into
one form, etc., we turned it into lowercase,
here for example. Once you have the
tokens, and you have removed the
punctuation, we have removed the
exclamation marks, we get the unique
stems. The NRC document is not gonna
have a word like loved in the past tense
but it will have the word love in the
present tense so we need to convert
the word loved into love. This is done
through stemming as we studied in Week one.
So let's say we have the review, I loved
that movie
it was incredible and the characters were
so compelling. We tokenize it, we
normalize it, and then we stem it to get
the vector, I love that movie
it be incredible and the character so
compelling, and then we try to figure out
which words in that array are present in
the NRC list. We're going to find four
words: love movie character and
compelling. And then we try to add the
values for all these emotions. So in
total no words had the emotion of anger
associated to them, one word had the
emotion of joy associated to it, and two
words had the emotion
positive associated to them. So this
method correctly predicts that this
review, I love that movie it was
incredible and the characters were so
compelling, is relatively positive and
relatively joyful compared to the other
emotions. Let's look at a second review
maybe you have That movie's terrible, I
hated that the characters are so boring.
We get: That movie be terrible I hate it
and the character be so boring, after
tokenization, normalization, and stemming
of the sentence. When we try to see which
words in that array are present in the
NRC, you get that the words movie,
terrible, hate, character, and boring are
present in the NRC lexicon. You add them,
their values and you see that there are
two words from this list that have the
emotion of anger associated to them:
probably hate and terrible, there's two
words that have the association of
disgust as - associated to them and
there's three words that have the
general negative emotion associated to
them, probably terrible hate and boring. So
these values correctly predict that the
overarching emotion of this review is
negative rather than positive because
negative has a value of three words that
have their emotion and zero words that
had a positive emotion. Again, in summary,
this method even though it's incredibly
simple correctly predicts the emotions
of these movie reviews.
As you can see we can extend that to
longer documents. One of your exercises
this week is going to be to take the
scripts from the first Star Wars movies
episodes four five and six and then to
figure out what were the emotions of
Darth Vader during those movies. You're
going to get all the lines of dialogue
from Darth Vader, you're going to compute
their emotion values, you're gonna add
them up, and then we're gonna see how
he's feeling in each of the movies. In
movie four, maybe he's feeling happier in
movie five, maybe he is feeling bold, you
know more joyful, and so forth. Here we
have an example of what it might look
like: this is the feelings throughout all
three movies for some of the characters
that have more lines. You can see that
the Emperor is angry quite a bit, that is
the lowermost line in the Emperor's
summation. We can use this for other
works of art, for example this is a
summation of positive and negative for
the chapters of The Lord of the Rings
books, and you can see that some chapters
are more positive, some chapters are more
negative, and it goes back and forth.
Python has many interesting libraries to
support sentiment analysis and to do
some of the things that we did in Week
one like the lemmatization, stemming and so
forth. So this was a very simple exercise
but it could accurately predict the
thing that we needed. We can describe
meanings of words as bundles of features
so we can also describe the sentiment of
words as bundles of features, of emotion
features. They give us an explanation of
what the world - what the word feels like
and this is a very simple way to do
sentiment analysis.
