[MUSIC PLAYING]
SPEAKER 1: OK, welcome back, everyone,
to our final topic in an introduction
to artificial intelligence with Python.
And today, the topic is language.
So thus far in the class,
we've seen a number
of different ways of interacting
with AI, artificial intelligence,
but it's mostly been happening in
the way of us formulating problems
in ways that I can understand--
learning to speak the language of AI,
so to speak, by trying to take a problem
and formulated as a search problem,
or by trying to take a problem and make
it a constraint satisfaction problem--
something that our AI
is able to understand.
Today, we're going to try and come
up with algorithms and ideas that
allow our AI to meet us
halfway, so to speak--
to be able to allow AI to be able to
understand, and interpret, and get
some sort of meaning
out of human language--
the type of language,
in the spoken language,
like English, or some other
language that we naturally speak.
And this turns out to be a
really challenging task for AI.
And it really encompasses a
number of different types of tasks
all under the broad heading of
natural language processing,
the idea of coming up
with algorithms that
allow our AI to be able to process
and understand natural language.
So these tasks vary in
terms of the types of tasks
we might want an AI to perform,
and therefore, the types of
algorithms that we might use.
Them but some common
tasks that you might see
are things like automatic summarization.
You give an AI a long document,
and you would like for the AI
to be able to summarize
it, come up with a shorter
representation of the same idea, but
still in some kind of natural language,
like English.
Something like information extraction--
given a whole corpus of information
in some body of documents
or on the internet,
for example, we'd like for
our AI to be able to extract
some sort of meaningful semantic
information out of all of that content
that it's able to look at and read.
Language identification--
the task of, given a page,
can you figure out what language
that document is written in?
This is the type of thing you might
see if you use a web browser where,
if you open up a page
in another language,
that web browser might ask you, oh,
I think it's in this language-- would
you like me to translate into
English for you, for example?
And that language
identification process is a task
that our AI needs to be able to
do, which is then related then
to machine translation, the process
of taking text in one language
and translating it into another
language-- which there's
been a lot of research
and development on really
over the course of the
last several years.
And it keeps getting
better, in terms of how
it is that AI is able to
take text in one language
and transform that text into
another language as well.
In addition to that, we have topics
like named entity recognition.
Given some sequence of text, can you
pick out what the named entities are?
These are names of companies,
or names of people,
or names of locations for example, which
are often relevant or important parts
of a particular document.
Speech recognition as a related task
not to do with the text that is written,
but text that is spoken-- being able
to process audio and figure out,
what are the actual words
that are spoken there?
And if you think about smart
home devices, like Siri or Alexa,
for example, these are
all devices that are now
able to listen to when we
are able to speak, figure out
what words we are saying, and draw some
sort of meaning out of that as well.
We've talked about how you
could formulate something,
for instance, as a hit and
Markov model to be able to draw
those sorts of conclusions.
Text classification, more
generally, is a broad category
of types of ideas, whenever we
want to take some kind of text
and put it into some sort of category.
And we've seen these
classification type problems
and how we can use statistical
machine learning approaches
to be able to solve them.
We'll be able to do something
very similar with natural language
that we may need to make a couple
of adjustments that we'll see soon.
And then something like
word sense disambiguation,
the idea that, unlike in
the language of numbers,
where AI has very precise
representations of everything, words
and are a little bit fuzzy,
in terms of their meaning,
and words can have multiple
different meanings--
and natural language is
inherently ambiguous,
and we'll take a look at some of
those ambiguities in due time today.
But one challenging
task, if you want an AI
to be able to understand
natural language,
is being able to
disambiguate or differentiate
between different possible
meanings of words.
If I say a sentence like, I went to
the bank, you need to figure out,
do I mean the bank where I
deposit and withdraw money or do
I mean the bank like the river bank?
And different words can
have different meanings
that we might want to figure out.
And based on the context
in which a word appears--
the wider sentence,
or paragraph, or paper
in which a particular word appears--
that might help to
inform how it is that we
disambiguate between different
meanings or different senses
that a word might have.
And there are many other topics
within natural language processing,
many other algorithms
that have been devised
in order to deal with and
address these sorts of problems.
And today, we're really just
going to scratch the surface,
looking at some of the fundamental ideas
that are behind many of these ideas
within natural language processing,
within this idea of trying to come up
with AI algorithms that are able to do
something meaningful with the languages
that we speak everyday.
And so to introduce this idea,
when we think about language,
we can often think about it in
a couple of different parts.
The first part refers to
the syntax of language.
This is more to do with just
the structure of language
and how it is that that structure works.
And if you think about natural
language, syntax is one of those things
that, if you're a native
speaker of a language,
it comes pretty readily to you.
You don't have to think
too much about it.
If I give you a sentence from Sir
Arthur Conan Doyle's Sherlock Holmes,
for example, a sentence like this--
"just before 9:00 o'clock, Sherlock
Holmes stepped briskly into the room"--
I think we could probably
all agree that this
is a well-formed grammatical sentence.
Syntactically, it makes
sense, in terms of the way
that this particular
sentence is structured.
And syntax applies not just to natural
language, but to programming languages
as well.
If you've ever seen a syntax error
in a program that you've written,
it's likely because you
wrote some sort of program
that was not syntactically well-formed.
The structure of it was
not a valid program.
In the same way, we can look at
English sentences, or sentences
in any natural language, and
make the same kinds of judgments.
I can say that this sentence
is syntactically well-formed.
When all the parts are put together,
all these words are in this order,
it constructs a grammatical sentence, or
a sentence that most people would agree
is grammatical.
But there are also grammatically
ill-formed sentences.
A sentence like, "just
before Sherlock Holmes
9 o'clock stepped briskly the room"--
well, I think we would all agree that
this is not a well-formed sentence.
Syntactically, it doesn't make sense.
And this is the type of thing that,
if we want our AI, for example,
to be able to generate
natural language--
to be able to speak to us the way
like a chat bot would speak to us,
for example--
well then our AI is going to need to be
able to know this distinction somehow,
is going to be able to know what
kinds of sentences are grammatical,
what kinds of sentences are not.
And we might come up with rules or ways
to statistically learn these ideas,
and we'll talk about some
of those methods as well.
Syntax can also be ambiguous.
There are some sentences that are
well-formed and not well-formed,
but certain way-- there are certain
ways that you could take a sentence
and potentially construct multiple
different structures for that sentence.
A sentence like, "I saw the man on
the mountain with a telescope," well,
this is grammatically well-formed--
syntactically, it makes sense--
but what is the structure
of the sentence?
Is it the man on the mountain
who has the telescope, or am
I seeing the man on the mountain and
I am using the telescope in order
to see the man on the mountain?
There's some interesting ambiguity
here, where it could have potentially
two different types of structures.
And this is one of the ideas
that will come back to also,
in terms of how to think about dealing
with AI when natural language is
inherently ambiguous.
So that then is syntax,
the structure of language,
and getting an
understanding for how it is
that, depending on the order
and placement of words,
we can come up with different
structures for language.
But in addition to language having
structure, language also has meaning.
And now we get into the world
of semantics, the idea of,
what it is that a word,
or a sequence of words,
or a sentence, or an entire
essay actually means?
And so a sentence like, "just
before 9:00, Sherlock Holmes
stepped briskly into the
room," is a different sentence
from a sentence like, "Sherlock Holmes
stepped briskly into the room just
before 9:00."
And yet they have
effectively the same meaning.
They're different
sentences, so an AI reading
them would recognize them as
different, but we as humans
can look at both the
sentences and say, yeah,
they mean basically the same thing.
And maybe, in this case, it was just
because I moved the order of the words
around.
Originally, 9 o'clock with near
the beginning of the sentence.
Now 9 o'clock is near
the end of the sentence.
But you might imagine that I could come
up with a different sentence entirely,
a sentence like, "a few minutes
before 9:00, Sherlock Holmes
walked quickly into the room."
And OK, that also has
a very similar meaning,
but I'm using different words
in order to express that idea.
And ideally, AI would
be able to recognize
that these two sentences, these
different sets of words that
are similar to each other,
have similar meanings,
and to be able to get
at that idea as well.
Then there are also ways that a
syntactically well-formed sentence
might not mean anything at all.
A famous example from linguist Noam
Chomsky is this sentence here--
"colorless green ideas sleep furiously."
Syntactically, that
sentence is perfectly fine.
Colorless and green are adjectives
that modify the noun ideas.
Sleep is a verb.
Furiously is an adverb.
These are correct constructions,
in terms of the order of words,
but it turns out this
sentence is meaningless.
If you tried to ascribe meaning to
the sentence, what does it mean?
And it's not easy to
be able to determine
what it is that it might mean.
Semantics itself can also be ambiguous,
given that different structures can
have different types of meanings.
Different words can have
different kinds of meanings,
so the same sentence
with the same structure
might end up meaning
different types of things.
So my favorite example
from the LA times is
a headline that was in the Los
Angeles Times a little while back.
The headline says, "Big rig carrying
fruit crashes on 210 freeway,
creates jam."
So depending on how it is
you look at the sentence--
how you interpret the sentence-- it
can have multiple different meanings.
And so here too are challenges in this
world of natural language processing,
being able to understand
both the syntax of language
and the semantics of language.
And today, we'll take a
look at both of those ideas.
We're going to start
by talking about syntax
and getting a sense for how it
is that language is structured,
and how we can start by coming
up with some rules, some ways
that we can tell our computer,
tell our AI what types of things
are valid sentences, what types
of things are not valid sentences.
And ultimately, we'd like
to use that information
to be able to allow our AI to
draw meaningful conclusions,
to be able to do
something with language.
And so to do so, we're going
to start by introducing
the notion of formal grammar.
And what formal grammar is
all about its formal grammar
is a system of rules that
generate sentences in a language.
I would like to know what are
the valid English sentences--
not in terms of what they mean--
just in terms of their structure--
their syntactic structure.
What structures of English
are valid, correct sentences?
What structures of
English are not valid?
And this is going to apply in a very
similar way to other natural languages
as well, where language follows
certain types of structures.
And we intuitively know
what these structures mean,
but it's going to be helpful to
try and really formally define
what the structures mean as well.
There are a number of different
types of formal grammar
all across what's known as the
Chomsky hierarchy of grammars.
And you may have seen
some of these before.
If you've ever worked with
regular expressions before,
those belong to a class
of regular languages.
They correspond to regular languages,
which is a particular type of language.
But also on this hierarchy
is a type of grammar
known as a context-free grammar.
And this is the one we're
going to spend the most
time on taking a look at today.
And what a context-free grammar
is it is a way of taking--
of generating sentences
in a language or via what
are known as rewriting rules--
replacing one symbol with other symbols.
And we'll take a look in a
moment at just what that means.
So let's imagine, for example,
a simple sentence in English,
a sentence like, "she saw the city"--
a valid, syntactically
well-formed English sentence.
But we'd like for some way for our
AI to be able to look at the sentence
and figure out, what is the
structure of the sentence?
If you imagine a guy in
question answering format--
if you want to ask the AI a
question like, what did she see,
well, then the AI wants to be
able to look at this sentence
and recognize that what she saw is the
city-- to be able to figure that out.
And it requires some
understanding of what
it is that the structure of
this sentence really looks like.
So where do we begin?
Each of these words--
she, saw, the, city--
we are going to call terminal symbols.
There are symbols in our language--
where each of these words is just
a symbol--
where this is ultimately what
we care about generating.
We care about generating these words.
But each of these words
we're also going to associate
with what we're going to
call a non-terminal symbol.
And these non-terminal symbols initially
are going to look kind of like parts
of speech, if you remember
back to like English grammar--
where she is a noun,
saw is a V for verb,
the is a D. D stands for determiner.
These are words like the,
and a, and and, for example.
And then city-- well, city is
also a noun, so an N goes there.
So each of these--
N, V, and D--
these are what we might
call non-terminal symbols.
They're not actually
words in the language.
She saw the city-- those are
the words in the language.
But we use these non-terminal symbols
to generate the terminal symbols,
the terminal symbols which
are like, she saw the city--
the words that are actually
in a language like English.
And so in order to translate these
non-terminal symbols into terminal
symbols, we have what are
known as rewriting rules,
and these rules look
something like this.
We have N on the left side
of an arrow, and the arrow
says, if I have an N
non-terminal symbol,
then I can turn it into any of these
various different possibilities
that are separated with a vertical line.
So a noun could translate
into the word she.
A noun could translate into the
word city, or car, or Harry,
or any number of other things.
These are all examples
of nouns, for example.
Meanwhile, a determiner, D, could
translate into the, or a, or an.
V for verb could translate
into any of these verbs.
P for preposition could translate
into any of those prepositions--
to, on, over, and so forth.
And then ADJ for adjective can translate
into any of these possible adjectives
as well.
So these then are rules in
our context-free grammar.
When we are defining what
it is that our grammar is,
what is the structure of the English
language or any other language,
we give it these types of
rules saying that a noun could
be any of these possibilities, a verb
could be any of those possibilities.
But it turns out we can then begin
to construct other rules where
it's not just one non-terminal
translating into one terminal symbol.
We're always going to have one
non-terminal on the left-hand side
of the arrow, but on the
right-hand side of the arrow,
we could have other things.
We could even have other
non-terminal symbols.
So what do I mean by this?
Well, we have the idea of nouns-- like
she, city, car, Harry, for example--
but there are also a noun phrases--
like phrases that work as nouns--
that are not just a single word,
but there are multiple words.
Like the city is two words,
that together, operate
as what we might call a noun phrase.
It's multiple words, but they're
together operating as a noun.
Or if you think about a more complex
expression, like the big city--
three words all operating
as a single noun--
or the car on the street--
multiple words now, but that entire set
of words operates kind of like a noun.
It substitutes as a noun phrase.
And so to do this, we'll
introduce the notion
of a new non-terminal symbol called
NP, which will stand for noun phrase.
And this rewriting rule says that
a noun phrase it could be a noun--
so something like she is
a noun, and therefore, it
can also be a noun phrase--
but a noun phrase could also be a
determiner, D, followed by a noun--
so two ways we can have a noun
phrase in this very simple grammar.
Of course, the English language
is more complex than just this,
but a noun phrase is either a noun or
it is a determiner followed by a noun.
So for the first example, a
noun phrase that is just a noun,
that would allow us to
generate noun phrases like she,
because a noun phrase is
just a noun, and a noun
could be the word she, for example.
Meanwhile, if we wanted to look at
one of the examples of these, where
a noun phrase becomes a
determiner and a noun,
then we get a structure like this.
And now we're starting to
see the structure of language
emerge from these rules in a
syntax tree, as we'll call it,
this tree-like structure that represents
the syntax of our natural language.
Here, we have a noun
phrase, and this noun phrase
is composed of a determiner and a noun,
where the determiner is the word the,
according to that rule,
and noun is the word city.
So here then is a noun phrase that
consists of multiple words inside
of the structure.
And using this idea of taking one symbol
and rewriting it using other symbols--
that might be terminal
symbols, like the and city,
but might also be non-terminal symbols,
like D for determiner or N for noun--
then we can begin to construct
more and more complex structures.
In addition to noun phrases, we
can also think about verb phrases.
So what might a verb phrase look like?
Well, a verb phrase might
just be a single verb.
In a sentence like "I
walked," walked is a verb,
and that is acting as the
verb phrase in that sentence.
But there are also more complex verb
phrases that aren't just a single word,
but that are multiple words.
If you think of the sentence like
"she saw the city," for example,
saw the city is really
that entire verb phrase.
It's taking up like what it is
that she is doing, for example.
And so our verb phrase
might have a rule like this.
A verb phrase is either
just a plain verb
or it is a verb followed
by a noun phrase.
And we saw before that a
noun phrase is either a noun
or it is a determiner
followed by a noun.
And so a verb phrase
might be something simple,
like verb phrase it is just a verb.
And that verb could be the
word walked for example.
But it could also be
something more sophisticated,
something like this noun, where we
begin to see a larger syntax tree,
where the way to read the
syntax tree is that a verb
phrase is a verb and
a noun phrase, where
that verb could be something like saw.
And this is a noun phrase we've
seen before, this noun phrase that
is the city-- a noun phrase composed
of the determiner the and the noun
city all put together to
construct this larger verb phrase.
And then just to give one
more example of a rule,
we could also have a rule like this--
sentence S goes to noun
phrase and a verb phrase.
The basic structure of
a sentence is that it is
a noun phrase followed by verb phrase.
And this is a formal grammar
way of expressing the idea
that you might have learned when you
learned English grammar, when you read
that a sentence is like a subject
and a verb, subject and action--
something that's happening
to a particular noun phrase.
And so using this structure,
we could construct
a sentence that looks like this.
A sentence consists of a noun
phrase and a verb phrase.
A noun phrase could just be
a noun, like the word she.
The verb phrase could be
a verb and a noun phrase,
where-- this is something we've
seen before-- the verb is saw
and the noun phrase is the city.
And so now look what we've done here.
What we've done is, by
defining a set of rules,
there are algorithms that we
can run that take these words--
and the CYK algorithm, for example, is
one example of this if you want to look
into that--
where you start with a set of terminal
symbols, like she saw the city,
and then using these rules,
you're able to figure out,
how is it that you go from a
sentence to she saw the city?
And it's all through
these rewriting rules.
So the sentence is a noun
phrase and a verb phrase.
A verb phrase could be a verb and
a noun phrase, so on and so forth,
where you can imagine
taking this structure
and figuring out how it is that
you could generate a parse tree--
a syntax tree-- for that set of
terminal symbols, that set of words.
And if you tried to do this for a
sentence that was not grammatical,
something like "saw the city
she," well, that wouldn't work.
There'd be no way to
take a sentence and use
these rules to be able to
generate that sentence that
is not inside of that language.
So this sort of model
can be very helpful
if the rules are expressive
enough to express
all the ideas that you might want to
express inside of natural language.
Of course, using just the
simple rules we have here,
there are many sentences that we
won't be able to generate-- sentences
that we might agree are grim
and syntactically well-formed,
but that we're not going to be able
to construct using these rules.
And then, in that case,
we might just need
to have some more complex rules in
order to deal with those sorts of cases.
And so this type of
approach can be powerful
if you're dealing with a
limited set of rules and words
that you really care about dealing with.
And one way we can actually
interact with this in Python
is by using a Python library called
NLTK, short for natural language
toolkit, which we'll see
a couple of times today,
which has a wide variety of
different functions and classes
that we can take
advantage of that are all
meant to deal with natural language.
And one such algorithm that
it has is the ability to parse
a context-free grammar, to
be able to take some words
and figure out according to
some context-free grammar,
how would you construct
the syntax tree for it?
So let's go ahead and
take a look at NLTK
now by examining how we might construct
some context-free grammars with it.
So here inside of cfg0--
cfg's short for context-free grammar--
I have a sample context-free grammar
which has rules that we've seen before.
So sentence goes to noun phrase
followed by a verb phrase.
Noun phrase is either a
determiner and a noun or a noun.
Verb phrase is either a verb
or a verb and a noun phrase.
The order of these things
doesn't really matter.
Determiners could be the
word the or the word a.
A noun could be the
word she, city, or car.
And a verb could be the word saw
or it could be the word walked.
Now, using NLTK, which I've
imported here at the top,
I'm going to go ahead
and parse this grammar
and save it inside of this
variable called parser.
Next, my program is going
to ask the user for input.
Just type in a sentence,
and dot split will just
split it on all of the
spaces, so I end up
getting each of the individual words.
We're going to save that inside
of this list called sentence.
And then we'll go ahead and try to parse
the sentence, and for each sentence
we parse, we're going to
pretty print it to the screen,
just so it displays in my terminal.
And we're also going to draw it.
It turns out that NLTK has
some graphics capacity,
so we can really visually see
what that tree looks like as well.
And there are multiple different
ways a sentence might be parsed,
which is why we're putting
it inside of this for loop.
And we'll see why that can
be helpful in a moment too.
All right, now that I have
that, let's go ahead and try it.
I'll cd into cfg, and we'll
go ahead and run cfg0.
So it then is going to prompt
me to type in a sentence.
And let me type in a very
simple sentence-- something
like she walked, for example.
Press Return.
So what I get is, on
the left-hand side, you
can see a text-based
representation of the syntax tree.
And on the right side here-- let
me go ahead and make it bigger--
we see a visual representation
of that same syntax tree.
This is how it is that my computer has
now parsed the sentence she walked.
It's a sentence that consists of
a noun phrase and a verb phrase,
where each phrase is just a single
noun or verb, she and then walked--
same type of structure
we've seen before,
but this now is our
computer able to understand
the structure of the
sentence, to be able to get
some sort of structural understanding
of how it is that parts of the sentence
relate to each other.
Let me now give it another sentence.
I could try something like she
saw the city, for example--
the words we were dealing
with a moment ago.
And then we end up getting
this syntax tree out of it--
again, a sentence that has a
noun phrase and a verb phrase.
The noun phrase is fairly simple.
It's just she.
But the verb phrase is more complex.
It is now saw the city, for example.
Let's do one more with this grammar.
Let's do something like she saw a car.
And that is going to look very similar--
that we also get she.
But our verb phrase is now different.
It's saw a car, because there
are multiple possible determiners
in our language and
multiple possible nouns.
I haven't given this grammar
rule that many words,
but if I gave it a larger
vocabulary, it would then
be able to understand more and
more different types of sentences.
And just to give you a sense of some
added complexity we could add here,
the more complex our grammar,
the more rules we add,
the more different
types of sentences we'll
then have the ability to generate.
So let's take a look
at cfg1, for example,
where I've added a whole number
of other different types of rules.
I've added the adjective phrases, where
we can have multiple adjectives inside
of a noun phrase as well.
So a noun phrase could be an adjective
phrase followed by a noun phrase.
If I wanted to say
something like the big city,
that's an adjective phrase
followed by a noun phrase.
Or we could also have a noun
and a prepositional phrase--
so the car on the street, for example.
On the street is a
prepositional phrase, and we
might want to combine those two ideas
together, because the car on the street
can still operate as something
kind of like a noun phrase as well.
So no need to understand all of
these rules in too much detail--
it starts to get into the
nature of English grammar--
but now we have a more complex way of
understanding these types of sentences.
So if I run Python cfg1--
and I can try typing something like
she saw the wide street, for example--
a more complex sentence.
And if we make that larger, you can
see what this sentence looks like.
I'll go ahead and
shrink it a little bit.
So now we have a sentence like
this-- she saw the wide street.
The wide street is one
entire noun phrase,
saw the wide street is
an entire verb phrase,
and she saw the wide street ends
up forming that entire sentence.
So let's take a look at one more example
to introduce this notion of ambiguity.
So I can run Python cfg1.
Let me type a sentence like
she saw a dog with binoculars.
So there's a sentence, and here
now is one possible syntax tree
to represent this idea--
she saw, the noun phrase a dog,
and then the prepositional phrase
with binoculars.
And the way to interpret the sentence is
that what it is that she saw was a dog.
And how did she do the seeing?
She did the seeing with binoculars.
And so this is one possible
way to interpret this.
She was using binoculars.
Using those binoculars, she saw a dog.
But another possible way
to pass that sentence
would be with this tree over
here, where you have something
like she saw a dog with binoculars,
where a dog with binoculars
forms an entire noun phrase of its own--
same words in the same order, but
a different grammatical structure,
where now we have a dog with binoculars
all inside of this noun phrase,
meaning what did she see?
What she saw was a dog,
and that dog happened
to have binoculars with the dog-- so
different ways to parse the sentence--
structures for the sentence-- even given
the same possible sequence of words.
And NLTK's algorithm and
this particular algorithm
has the ability to find
all of these, to be
able to understand the
different ways that you might
be able to parse a sentence and be able
to extract some sort of useful meaning
out of that sentence as well.
So that then is a brief
look at what we can do--
using getting the structure of language,
of using these context-free grammar
rules to be able to describe
the structure of language.
But what we might also
care about is understanding
how it is that these
sequences of words are
likely to relate to each other in
terms of the actual words themselves.
The grammar that we saw before could
allow us to generate a sentence like,
I eat a banana, for example, where I
is the noun phrase and ate a banana
is a verb phrase.
But it would also allow
for sentences like, I
eat a blue car, for example, which
is also syntactically well-formed
according to the rules, but is probably
a less likely sentence that a person is
likely to speak.
And we might want for our
AI to be able to encapsulate
the idea that certain sequences of words
are more or less likely than others.
So to deal with that, we'll
introduce the notion of an n-gram,
and an n-gram, more generally,
just refers to some sequence
of n items inside of our text.
And those items might take
various different forms.
We can have character n-grams,
which are just a contiguous
sequence of n characters--
so three characters in a row,
for example, or four
characters in a row.
We can also have word n-grams,
which are a contiguous
sequence of n words in a row
from a particular sample of text.
And these end up proving
quite useful, and you
can choose our n to decide how many
how long is our sequence going to be.
So when n is 1, we're just looking at
a single word or a single character.
And that is what we might
call a unigram, just one item.
If we're looking at two
characters or two words,
that's generally called
a bigram-- so an n-gram
where n is equal to 2, looking at
two words that are consecutive.
And then, if there are
three items, you might
imagine we'll often call those
trigrams-- so three characters
in a row or three words that happen
to be in a contiguous sequence.
And so if we took a
sentence, for example--
here's a sentence from,
again, Sherlock Holmes--
"how often have I said
to you that, when you
have eliminated the
impossible, whatever remains,
however improbable, must be the truth."
What are the trigrams that we
can extract from the sentence?
If we're looking at
sequences of three words,
well, the first trigram
would be how often
have-- just a sequence of three words.
And then we can look
at the next trigram,
often have I. The next
trigram is have I said.
Then I said to, said to you,
to you that, for example--
those are all trigrams of words,
sequences of three contiguous words
that show up in the text.
And extracting those bigrams and
trigrams, or n-grams more generally,
turns out to be quite
helpful, because often,
when we're dealing with
analyzing a lot of text,
it's not going to be particularly
meaningful for us to try
and analyze the entire text at one time.
But instead, we want to segment
that text into pieces that we
can begin to do some analysis of--
that our AI might never have
seen this entire sentence before,
but it's probably seen the
trigram to you that before,
because to you that is something that
might have come up in other documents
that our AI has seen before.
And therefore, it knows a little
bit about that particular sequence
of three words in a row-- or
something like have I said,
another example of another sequence
of three words that's probably
quite popular, in terms of where you
see it inside the English language.
So we'd like some way to be able
to extract these sorts of n-grams.
And how do we do that?
How do we extract
sequences of three words?
Well, we need to take our
input and somehow separate it
into all of the individual words.
And this is a process generally
known as tokenization,
the task of splitting up some
sequence into distinct pieces,
where we call those pieces tokens.
Most commonly, this refers to
something like word tokenization.
I have some sequence of text
and I want to split it up
into all of the words
that show up in that text.
But it might also come up
in the context of something
like sentence tokenization.
I have a long sequence of text
and I'd like to split it up
into sentences, for example.
And so how might word tokenization
work, the task of splitting up
our sequence of characters into words?
Well, we've also already seen this idea.
We've seen that, in word
tokenization just a moment ago, I
took an input sequence and I just called
Python's split method on it, where
the split method took
that sequence of words
and just separated it based on where
the spaces showed up in that word.
And so if I had a sentence like,
whatever remains, however improbable,
must be the truth, how
would I tokenize this?
Well, the naive approach is just
to say, anytime you see a space,
go ahead and split it up.
We're going to split up this particular
string just by looking for spaces.
And what we get when we do
that is a sentence like this--
whatever remains, however
improbable, must be the truth.
But what you'll notice here is
that, if we just split things
up in terms of where the spaces are, we
end up keeping the punctuation around.
There's a comma after the word remains.
There's a comma after
improbable, a period after truth.
And this poses a little
bit of a challenge, when
we think about trying to tokenize
things into individual words,
because if you're comparing
words to each other, this word
truth with a period after it--
if you just string
compare it, it's going
to be different from the word
truth without a period after it.
And so this punctuation can
sometimes pose a problem for us,
and so we might want some way of dealing
with it-- either treating punctuation
as a separate token altogether or maybe
removing that punctuation entirely
from our sequence as well.
So that might be
something we want to do.
But there are other cases where it
becomes a little bit less clear.
If I said something like,
just before 9:00 o'clock,
Sherlock Holmes stepped
briskly into the room,
well, this apostrophe after 9 o'clock--
after the O in 9 o'clock-- is
that something we should remove?
Should be split based on that
as well, and do O and clock?
There's some interesting
questions there too.
And it gets even trickier if you begin
to think about hyphenated words--
something like this, where we
have a whole bunch of words
that are hyphenated and then you
need to make a judgment call.
Is that a place where you're
going to split things apart
into individual words, or are you going
to consider frock-coat, and well-cut,
and pearl-grey to be
individual words of their own?
And so those tend to pose challenges
that we need to somehow deal with
and something we need to
decide as we go about trying
to perform this kind of analysis.
Similar challenges arise when it comes
to the world of sentence tokenization.
Imagine this sequence of
sentences, for example.
If you take a look at this
particular sequence of sentences,
you could probably imagine you could
extract the sentences pretty readily.
Here is one sentence and
here is a second sentence,
so we have two different sentences
inside of this particular passage.
And the distinguishing feature
seems to be the period--
that a period separates
one sentence from another.
And maybe there are other
types of punctuation
you might include here as well--
an exclamation point, for
example, or a question mark.
But those are the types of
punctuation that we know
tend to come at the end of sentences.
But it gets trickier again if you look
at a sentence like this-- not just
sure talking to Sherlock, but
instead of talking to Sherlock,
talking to Mr. Holmes.
Well now, we have a
period at the end of Mr.
And so if you were just
separating on periods,
you might imagine this
would be a sentence,
and then just Holmes
would be a sentence,
and then we'd have a
third sentence down below.
Things do get a little
bit trickier as you start
to imagine these sorts of situations.
And dialogue too starts to
make this trickier as well--
that if you have these sorts of lines
that are inside of something that--
he said, for example--
that he said this
particular sequence of words
and then this particular
sequence of words.
There are interesting
challenges that arise there too,
in terms of how it is
that we take the sentence
and split it up into
individual sentences as well.
And these are just things that
our algorithm needs to decide.
In practice, there usually some
heuristics that we can use.
We know there are certain
occurrences of periods,
like the period after Mr.,
or in other examples where
we know that is not the
beginning of a new sentence,
and so we can encode
those rules into our AI
to allow it to be able to
do this tokenization the way
that we want it to.
So once we have these ability to
tokenize a particular passage--
take the passage, split it
up into individual words--
from there, we can begin to extract
what the n-grams actually are.
So we can actually take
a look at this by going
into a Python program that will
serve the purpose of extracting
these n-grams.
And again, we can use NLTK, the
Natural Language Toolkit, in order
to help us here.
So I'll go ahead and go into ngrams
and we'll take a look at ngrams.py.
And what we have here
is we are going to take
some corpus of text, just
some sequence of documents,
and use all those documents and extract
what the most popular n-grams happen
to be.
So in order to do so, we're going to
go ahead and load data from a directory
that we specify in the
command line argument.
We'll also take in a number
n as a command line argument
as well, in terms of what
our number should be,
in terms of how many sequences-- words
we're going to look at in sequence.
Then we're going to go ahead and
just count up all of the nltk.ngrams.
So we're going to look at all of
the grams across this entire corpus
and save it inside this variable ngrams.
And then we're going to
look at the most common ones
and go ahead and print them out.
And so in order to do so,
I'm not only using NLTK--
I'm also using counter, which is built
into Python as well, where I can just
count up, how many times do these
various different grams appear?
So we'll go ahead and show that.
We'll go into ngrams, and I'll
say something like python ngrams--
and let's just first look
for the unigrams, sequences
of one word inside of a corpus.
And the corpus that
I've prepared is I have
all of the-- or some of these
stories from Sherlock Holmes
all here, where each one is just
one of the Sherlock Holmes stories.
And so I have a whole bunch of
text here inside of this corpus,
and I'll go ahead and provide that
corpus as a command line argument.
And now what my program
is going to do is
it's going to load all of the
Sherlock Holmes stories into memory--
or all the ones that I've
provided in this corpus at least--
and it's just going to look
for the most popular unigrams,
the most popular sequences of one word.
And it seems the most popular one is
just the word the used in 9,700 times;
followed by I, used 5,000 times;
and, used about 5,000 times--
the kinds of words you might expect.
So now let's go ahead and check for
bigrams, for example, ngrams 2, holmes.
All right, again, sequences of two
words now that appear multiple times--
of the, in the, it was, to the, it
is, I have-- so on and so forth.
These are the types
of bigrams that happen
to come up quite often inside this
corpus, the inside of the Sherlock
Holmes stories.
And it probably is true
across other corpses as well,
but we could only find out
if we actually tested it.
And now, just for good
measure, let's try
one more-- maybe try three, looking now
for trigrams that happen to show up.
And now we get it was the, one
of the, I think that, out of the.
These are sequences of
three words now that
happen to come up multiple times
across this particular corpus.
So what are the
potential use cases here?
Now we have some sort of data.
We have data about how often
particular sequences of words
show up in this particular
order, and using that,
we can begin to do some
sort of predictions.
We might be able to say that, if
you see the words that it was,
there's a reasonable
chance the word that
comes after it should be the word a.
And if I see the words one of,
it it's reasonable to imagine
that the next word might be
the word the, for example,
because we have this data about
trigrams, sequences of three words
and how often they come up.
And now, based on two
words, you might be
able to predict what the
third word happens to be.
And one model we can use for that is
a model we've actually seen before.
It's the Markov model.
Recall again that the
Markov model really
just refers to some sequence
of events that happen one time
step after a one time step,
where every unit has some ability
to predict what the next
unit is going to be--
or maybe the past two units predict
with the next unit is going to be,
or the past three predict with
the next one is going to be.
And we can use a Markov
model and apply it
to language for a very
naive and simple approach
at trying to generate natural
language, at getting our AI
to be able to speak English-like text.
And the way it's going to work is we're
going to say something like, come up
with some probability distribution.
Given these two words,
what is the probability
distribution over what the
third word could possibly
be based on all the data?
If you see it was, what are the
possible third words we might?
Have how often do they come up?
And using that information,
we can try and construct
what we expect the third word to be.
And if you keep doing
this, the effect is
that our Markov model
can effectively start
to generate text-- can be
able to generate text that
was not in the original
corpus, but that sounds
kind of like the original corpus.
It's using the same sorts of rules
that the original corpus was using.
So let's take a look
at an example of that
as well, where here now, I have
another corpus that I have here,
and it is the corpus of all of
the works of William Shakespeare.
So I've got a whole bunch of stories
from Shakespeare, and all of them
are just inside of this big text file.
And so what I might like to do is
look at what all of the n-grams are--
maybe look at all the trigrams
inside of shakespeare.txt--
and figure out, given
two words, can I predict
what the third word is likely to be?
And then just keep
repeating this process--
I have two words--
predict the third word; then,
from the second and third, word
predict the fourth word; and
from the third and fourth word,
predict the fifth word, ultimately
generating random sentences that
sounds like Shakespeare, that are
using similar patterns of words
that Shakespeare used, but that never
actually showed up in Shakespeare
itself.
And so to do so, I'll
show you generator.py,
which, again, is just going to
read data from a particular file.
And I'm using a Python library
called markovify, which is just
going to do this process for me.
So there are libraries out here that
can just train on a bunch of text
and come up with a Markov
model based on that text.
And I'm going to go
ahead and just generate
five randomly generated sentences.
So we'll go ahead and go in to markov.
I'll run the generator
on shakespeare.txt.
What we'll see is it's going to load
that data, and then here's what we get.
We get five different
sentences, and these
are sentences that never showed
up in any Shakespeare play,
but that are designed to
sound like Shakespeare,
that are designed to just
take two words and predict,
given those two words, what would
Shakespeare have been likely to choose
as the third word that follows it.
And you know, these sentences
probably don't have any meaning.
It's not like the AI is trying to
express any sort of underlying meaning
here.
It's just trying to understand,
based on the sequence
of words, what is likely to come
after it as a next word, for example.
And these are the types of sentences
that it's able to come up with,
just generating.
And if you ran this multiple times, you
would end up getting different results.
I could run this again and
get an entirely different set
of five different
sentences that also are
supposed to sound kind of like the way
that Shakespeare's sentences sounded
as well.
And so that then was
a look at how it is we
can use Markov models to be able to
naively attempt generating language.
The language doesn't mean
a whole lot right now.
You wouldn't want to use the
system in its current form
to do something like
machine translation,
because it wouldn't be able
to encapsulate any meaning,
but we're starting to see now that
our AI is getting a little bit better
at trying to speak our
language, at trying
to be able to process natural language
in some sort of meaningful way.
So we'll now take a look
at a couple of other tasks
that we might want our
AI to be able to perform.
And one such task is text
categorization, which really is just
a classification problem.
And we've talked about
classification problems already,
these problems where we would
like to take some object
and categorize it into a
number of different classes.
And so the way this comes up in text
is anytime you have some sample of text
and you want to put it inside of a
category, where I want to say something
like, given an email, does it belong
in the inbox or does it belong in spam?
Which of these two
categories does it belong in?
And you do that by looking
at the text and being
able to do some sort of analysis on that
text to be able to draw conclusions,
to be able to say that, given the
words that show up in the email,
I think this is probably
belonging in the inbox,
or I think it probably
belongs in spam instead.
And you might imagine
doing this for a number
of different types of classification
problems of this sort.
So you might imagine that another
common example of this type of idea
is something like sentiment
analysis, where I want to analyze,
given a sample of text, does
it have a positive sentiment
or does it have a negative sentiment?
And this might come up in the case
of a product reviews on a website,
for example, or feedback on a website,
where you have a whole bunch of data--
samples of text that are
provided by users of a website--
and you want to be able to quickly
analyze, are these reviews positive,
are the reviews negative,
what is it that people
are saying, just to get a sense for
what it is that people are saying,
to be able to categorize text into
one of these two different categories.
So how might we approach this problem?
Well, let's take a look at
some sample product reviews.
Here are some sample prep reviews
that we might come up with.
My grandson loved it.
So much fun.
Product broke after a few days.
One of the best games I've
played in a long time.
Kind of cheap and flimsy.
Not worth it.
Different product reviews that you
might imagine seeing on Amazon, or eBay,
or some other website where people
are selling products, for instance.
And we humans can pretty
easily categorize these
into positive sentiment
or negative sentiment.
We'd probably say that the
first and the third one, those
are positive sentiment messages.
The second one and the fourth
one, those are probably
negative sentiment messages.
But how could a computer
do the same thing?
How could it try and take these
reviews and assess, are they positive
or are they negative?
Well, ultimately, it
depends upon the words
that happen to be in this particular--
these particular reviews-- inside
of these particular sentences.
For now we're going to
ignore the structure
and how the words are
related to each other,
and we're just going to focus
on what the words actually are.
So there are probably some key
words here, words like loved,
and fun, and best.
Those probably show up in more
positive reviews, whereas words
like broke, and cheap, and flimsy--
well, those are words
that probably are more
likely to come up inside of negative
reviews, instead of positive reviews.
So one way to approach this
sort of text analysis idea
is to say, let's, for now, ignore the
structures of these sentences-- to say,
we're not going to care about how it
is the words relate to each other.
We're not going to try and parse
these sentences to construct
the grammatical structure
like we saw a moment ago.
But we can probably just rely
on the words that were actually
used-- rely on the fact that
the positive reviews are
more likely to have words
like best, and loved, and fun,
and that the negative reviews are
more likely to have the negative words
that we've highlighted there as well.
And this sort of model-- this approach
to trying to think about language--
is generally known as
the bag of words model,
where we're going to model a sample of
text not by caring about its structure,
but just by caring about the
unordered collection of words that
show up inside of a sample--
that all we care about
is what words are in the text.
And we don't care about what
the order of those words is.
We don't care about the
structure of the words.
We don't care what noun
goes with what adjective
or how things agree with each other.
We just care about the words.
And it turns out this
approach tends to work
pretty well for doing classifications
like positive sentiment
or negative sentiment.
And you could imagine doing
this in a number of ways.
We've talked about different approaches
to trying to solve classification style
problems, but when it
comes to natural language,
one of the most popular approaches
is that naive Bayes approach.
And this is one approach to trying to
analyze the probability that something
is positive sentiment
or negative sentiment,
or just trying to categorize it
some text into possible categories.
And it doesn't just work for text--
it works for other types of ideas
as well-- but it is quite
popular in the world
of analyzing text and natural language.
And the naive Bayes approach
is based on Bayes' rule, which
you might recall back from when
we talked about probability,
that the Bayes' rule looks like this--
that the probability of
some event b, given a
can be expressed using
this expression over here.
Probability of b given a is the
probability of a given b multiplied
by the probability of b divided
by the probability of a.
And we saw that this came about
as a result of just the definition
of conditional independence and
looking at what it means for two events
to happen together.
This was our formulation
then of Bayes' rule, which
turned out to be quite helpful.
We were able to predict one
event in terms of another
by flipping the order of those events
inside of this probability calculation.
And it turns out this approach
is going to be quite helpful--
and we'll see why in a moment--
for being able to do this
sort of sentiment analysis,
because I want to say you
know, what is the probability
that a message is positive,
or what is the pop probability
that the message is negative?
And I'll go ahead and simplify
this just using the emojis just
for simplicity-- probability of
positive, probability of negative.
And that is what I
would like to calculate,
but I'd like to calculate
that given some information--
given information like
here is a sample of text--
my grandson loved it.
And I would like to know not just what
is the probability that any message is
positive, but what is the probability
that the message is positive,
given my grandson loved it
as the text of the sample?
So given this information that inside
the sample are the words my grandson
loved it, what is the probability
then that this is a positive message?
Well, according to the bag of
words model, what we're going to do
is really ignore the
ordering of the words--
not treat this as a single sentence
that has some structure to it,
but just treat it as a whole
bunch of different words.
We're going to say something
like, what is the probability
that this is a positive
message, given that the word my
was in the message, given that the
word grandson was in the message,
given that the word loved within
the message, and given the word it
was in the message?
The bag of words model here--
we're treating the entire simple
sample as just a whole bunch
of different words.
And so this then is what I'd like
to calculate, this probability--
given all those words,
what is the probability
that this is a positive message?
And this is where we can
now apply Bayes' rule.
This is really the probability
of some b, given some a.
And that now is what
I'd like to calculate.
So according to Bayes' rule, this
whole expression is equal to--
well, it's the probability--
I switched the order of them--
it's the probability
of all of these words,
given that it's a positive
message, multiplied
by the probability that is
the positive message divided
by the probability of
all of those words.
So this then is just an
application of Bayes' rule.
We've already seen where I want to
express the probability of positive,
given the words, as related to
somehow the probability of the words,
given that it's a positive message.
And it turns out that--
as you might recall, back
when we talked about probability,
that this denominator is
going to be the same.
Regardless of whether we're looking
at positive or negative messages,
the probability of these
words doesn't change,
because we don't have a
positive or negative down below.
So we can just say that,
rather than just say
that this expression up here is
equal to this expression down below,
it's really just proportional
to just the numerator.
We can ignore the denominator for now.
Using the denominator would
get us an exact probability.
But it turns out that
what we'll really just do
is figure out what the probability
is proportional to, and at the end,
we'll have to normalize the
probability distribution-- make
sure the probability distribution
ultimately sums up to the number 1.
So now I've been able to
formulate this probability--
which is what I want to care about--
as proportional to multiplying these two
things together-- probability of words,
given positive message, multiplied by
the probability of positive message.
But again, if you think back
to our probability rules,
we can calculate this really as just
a joint probability of all of these
things happening-- that the probability
of positive message multiplied
by the probability of these words,
given the positive message--
well, that's just the joint
probability of all of these things.
This is the same thing
as the probability
that it's a positive message, and my
isn't the sentence or in the message,
and grandson is in the sample,
and loved is in the sample,
and it is in the sample.
So using that rule for the
definition of joint probability,
I've been able to say that
this entire expression is now
proportional to this sequence--
this joint probability of these
words and this positive that's
in there as well.
And so now the interesting
question is just how
to calculate that joint probability.
How do I figure out
the probability that,
given some arbitrary message, that it is
positive, and the word my is in there,
and the word grandson is in there,
and the word loved is in there,
and the word it is in there?
Well, you'll recall that we can
calculate a joint probability
by multiplying together all of
these conditional probabilities.
If I want to know the
probability of a, and b, and c,
I can calculate that as
the probability of a times
the probability of b, given a, times
the probability of c, given a and b.
I can just multiply these
conditional probabilities together
in order to get the overall joint
probability that I care about.
And we could do the same thing here.
I could say, let's
multiply the probability
of positive by the probability of the
word my showing up in the message,
given that it's positive, multiplied
by the probability of grandson
showing up in the message, given
that the word my is in there
and that it's positive, multiplied
by the probability of loved,
given these three things,
multiplied by the probability of it,
given these four things.
And that's going to end up being a
fairly complex calculation to make,
one that we probably
aren't going to have
a good way of knowing the answer to.
What is the probability that
grandson is in the message, given
that it is positive and the
word my is in the message?
That's not something we're really
going to have a readily easy answer to,
and so this is where the naive
part of naive Bayes comes about.
We're going to simplify this notion.
Rather than compute exactly what
that probability distribution is,
we're going to assume
that these words are
going to be effectively
independent of each other,
if we know that it's
already a positive message.
If it's a positive message, it
doesn't change the probability
that the word grandson
is in the message,
if I know that the word loved
is in the message, for example.
And that might not necessarily
be true in practice.
In the real world, it
might not be the case
that these words are
actually independent,
but we're going to assume
it to simplify our model.
And it turns out that
simplification still
lets us get pretty good
results out of it as well.
And what we're going to assume is that
the probability that all of these words
show up depend only on whether
it's positive or negative.
I can still say that loved
is more likely to come up
in a positive message than a negative
message, which is probably true,
but we're also going to say that it's
not going to change whether or not
loved is more likely or less likely to
come up if I know that the word my is
in the message, for example.
And so those are the assumptions
that we're going to make.
So while top expression is
proportional to this bottom expression,
we're going to say it's naively
proportional to this expression,
probability of being a positive message.
And then, for each of the words
that show up in the sample,
I'm going to multiply what's
the probability that my
is in the message, given
that it's positive,
times the probability of grandson
being in the message, given
that it's positive-- and
then so on and so forth
for the other words that happen
to be inside of the sample.
And it turns out that these are
numbers that we can calculate.
The reason we've done all of this
math is to get to this point,
to be able to calculate this probability
of distribution that we care about,
given these terms that we
can actually calculate.
And we can calculate then,
given some data available to us.
And this is what a lot of
natural language processing
is about these days.
It's about analyzing data.
If I give you a whole bunch of
data with a whole bunch of reviews,
and I've labeled them
as positive or negative,
then you can begin to calculate
these particular terms.
I can calculate the probability
that a message is positive just
by looking at my data
and saying, how many
positive samples were there, and divide
that by the number of total samples.
That is my probability
that a message is positive.
What is the probability that the
word loved is in the message, given
that it's positive?
Well, I can calculate
that based on my data too.
Let me just look at how many positive
samples have the word loved in it
and divide that by my total
number of positive samples.
And that will give me
an approximation for,
what is the probability that loved is
going to show up inside of the review,
given that we know that
the review is positive.
And so this then allows us to be able
to calculate these probabilities.
So let's not actually
do this calculation.
Let's calculate for the
sentence, my grandson loved it.
Is it a positive or negative review?
How could we figure out
those probabilities?
Well, again, this up here is the
expression we're trying to calculate.
And I'll give you a hint the
data that is available to us.
And the way to interpret
this data in this case
is that, of all of the messages, 49%
of them were positive and 51% of them
were negative.
Maybe online reviews tend to be a
little bit more negative than they
are positive-- or at least
based on this particular data
sample, that's what I have.
And then I have distributions for
each of the various different words--
that, given that it's
a positive message,
how many positive messages
had the word in my in them?
It's about 30%.
And for negative messages, how many
of those had the word my in them?
About 20%-- so it seems like the word
my comes up more often in positive
messages-- at least slightly more
often based on this analysis here.
Grandson, for example--
maybe that showed up
in 1% of all positive messages
and 2% of all negative messages
had the word grandson in it.
The word loved showed up in 32%
of all positive messages, 8%
of all negative messages, for example.
And then the word it up in
30% of positive messages,
40% of negative messages-- again, just
arbitrary data here just for example,
but now we have data with which we can
begin to calculate this expression.
So how do I calculate multiplying
all these values together?
Well, it's just going to
be multiplying probability
that it's positive times the
probability of my, given positive,
times the probability of
grandson, given positive--
so on and so forth for
each of the other words.
And if you do that multiplication and
multiply all of those values together,
you get this, 0.00014112.
By itself, this is not
a meaningful number,
but it's going to be meaningful
if you compared this expression--
the probability that it's positive times
the probability of all of the words,
given that I know that
the message is positive,
and compare it to the same thing,
but for negative sentiment messages
instead.
I want to know the probability
that it's a negative message
times the probability
of all of these words,
given that it's a negative message.
And so how can I do that?
Well, to do that, you just multiply
probability of negative times
all of these conditional probabilities.
And if I take those five values,
multiply all of them together,
then what I get is this value
for negative 0.00006528--
again, in isolation, not a
particularly meaningful number.
What is meaningful is treating these
two values as a probability distribution
and normalizing them, making it so
that both of these values sum up to 1
the way of probability
distribution should.
And we do so by adding these two up
and then dividing each of these values
by their total in order to
be able to normalize them.
And when we do that, when we normalize
this probability distribution,
you end up getting something like
this, positive 0.6837, negative 0.3163.
It seems like we've been able to
conclude that we are about 68%
confident-- we think there's
a probability of 0.68
that this message is a positive
message-- my grandson loved it.
And why are we 68% confident?
Well, it seems like we're more
confident than not because the word
loved showed up in 32%
of positive messages,
but only 8% of negative messages.
So that was a pretty strong indicator.
And for the others, while
it's true that the word
it showed up more often
in negative messages,
it wasn't enough to
offset that loved shows up
far more often in positive
messages than negative messages.
And so this type of analysis is
how we can apply naive Bayes.
We've just done this calculation.
And we end up getting not just a
categorization of positive or negative,
but I get some sort of confidence level.
What do I think the probability
is that it's positive?
And I can say I think it's positive
with this particular probability.
And so naive Bayes can be quite
powerful at trying to achieve this.
Using just this bag of words
model, where all I'm doing
is looking at what words
show up in the sample,
I'm able to draw these
sorts of conclusions.
Now, one potential drawback-- something
that you'll notice pretty quickly
if you start applying
this room exactly as is--
is what happens depending on if
0's are inside this data somewhere.
Let's imagine, for example, this same
sentence-- my grandson loved it--
but let's instead imagine that this
value here, instead of being 0.01,
was 0, meaning inside of
our data set, it has never
before happened that in a positive
message the word grandson showed up.
And that's certainly possible.
If I have a pretty small data
set, it's probably likely
that not all the messages are
going to have the word grandson.
Maybe it is the case that no
positive messages have ever
had the word grandson in
it, at least in my data set.
But if it is the case that
2% of the negative messages
have still had the word
grandson in it, then we
run into an interesting challenge.
And the challenge is this-- when I
multiply all of the positive numbers
together and multiply all the negative
numbers together to calculate these two
probabilities, what I end up getting
is a positive value of 0.000.
I get pure 0's, because when I
multiply all of these numbers
together-- when I
multiply something by 0,
doesn't matter what the other numbers
are-- the result is going to be 0.
And the same thing can be said
of negative numbers as well.
So this then would seem to be a problem
that, because grandson has never
showed up in any of the positive
messages inside of our sample,
we're able to say-- we seem to
be concluding that there is a 0%
chance that the message is positive.
And therefore, it must be negative,
because the only cases where
we've seen the word grandson come
up is inside of a negative message.
And in doing so, we've totally
ignored all of the other probabilities
that a positive message is much more
likely to have the word loved in it,
because we've multiplied
by 0, which just
means none of the other probabilities
can possibly matter at all.
So this then is a challenge
that we need to deal with.
It means that we're
likely not going to be
able to get the correct results if
we just purely use this approach.
And it's for that reason there
are a number of possible ways
we can try and make sure that we
never multiply something by 0.
It's OK to multiply
something by a small number,
because then it can
still be counterbalanced
by other larger numbers, but multiplying
by 0 means it's the end of the story.
You multiply a number
by 0, and the output's
going to be 0, no matter how big any
of the other numbers happen to be.
So one approach that's fairly
common a naive Bayes is
this idea of additive smoothing, adding
some value alpha to each of the values
in our distribution just to
smooth the data little bit.
One such approach is
called Laplace smoothing,
which basically just means adding one
to each value in our distribution.
So if I have 100 samples and zero
of them contain the word grandson,
well then I might say
that, you know what?
Instead, let's pretend that I've had
one additional sample where the word
grandson appeared and one additional
sample where the word grandson didn't
appear.
So I'll say all right,
now I have one 1 of 102--
so one sample that does have the
word grandson out of 102 total.
I'm basically creating two
samples that didn't exist before.
But in doing so, I've been able to
smooth the distribution a little bit
to make sure that I never have
to multiply anything by 0.
By pretending I've seen one more value
in each category than I actually have,
this gets us that result
of not having to worry
about multiplying a number by 0.
So this then is an approach
that we can use in order
to try and apply naive
Bayes, even in situations
where we're dealing with words that we
might not necessarily have seen before.
And let's now take a look at how we
could actually apply that in practice.
It turns out that NLTK, in addition
to having the ability to extract
n-grams and tokenize
things into words, also
has the ability to be able to apply
naive Bayes on some samples of text,
for example.
And so let's go ahead and do that.
What I've done is, inside of sentiment,
I've prepared a corpus of just
know reviews that I've generated, but
you can imagine using real reviews.
I just have a couple of
positive reviews-- it was great.
So much fun.
Would recommend.
My grandson loved it.
Those sorts of messages.
And then I have a whole bunch of
negative reviews-- not worth it,
kind of cheap, really bad,
didn't work the way we expected--
just one on each line.
A whole bunch of positive
reviews and negative reviews.
And what I'd like to do now
is analyze them somehow.
So here then is sentiment up high,
and what we're going to do first
is extract all of the positive
and negative sentences,
create a set of all of the words that
were used across all of the messages,
and then we're going to go ahead and
train NLTK's naive Bayes classifier
on all of this training data.
And with the training
data effectively is is I
take all of the positive messages
and give them the label positive, all
the negative messages and
give them the label negative,
and then I'll go ahead and apply
this classifier to it, where I'd say,
I would like to take all
of this training data
and now have the ability to
classify it as positive or negative.
I'll then take some input from the user.
They can just type in
some sequence of words.
And then I would like to
classify that sequence
as either positive or
negative, and then I'll
go ahead and print out what the
probabilities of each happened to be.
And there are some helper functions here
that just organize things in the way
that NLTK is expecting them to be.
But the key idea here is that
I'm taking the positive messages,
labeling them, taking
the negative messages,
labeling them, putting them
inside of a classifier,
and then now trying to classify
some new text that comes about.
So let's go ahead and try it.
I'll go ahead and go into sentiment,
and we'll run Python sentiment,
passing in as input that
corpus that contains
all of the positive
and negative messages--
because depending on the corpus, that's
going to affect the probabilities.
The effectiveness of
our ability to classify
is entirely dependent on how good our
data is, and how much data we have,
and how well they happen to be labeled.
So now I can try something and say--
let's try a review
like, this was great--
just some review that I might leave.
And it seems that, all right,
there is a 96% chance it estimates
that this was a positive message--
4% chance that it was a negative,
likely because the word great
shows up inside of
the positive messages,
but doesn't show up inside
of the negative messages.
And that might be something that
our AI is able to capitalize on.
And really, what it's going to look
for are the differentiating words--
that if the probability
of words like this and was
and is pretty similar between
positive and negative words,
then the naive Bayes
classifier isn't going
to end up using those values as
having some sort of importance
in the algorithm.
Because if they're the
same on both sides,
you multiply that value for
both positive and negative,
you end up getting about the same thing.
What ultimately makes the
difference in naive Bayes
is when you multiply by value
that's much bigger for one category
than for another category--
when one word like great
is much more likely to show
up in one type of message
than another type of message.
And that's one of the nice
things about naive Bayes
is that, without me
telling it, that great
is more important to care
about than this or was.
Naive Bayes can figure
that out based on the data.
It can figure out that this shows
up about the same amount of time
between the two, but great,
that is a discriminator,
a word that can be different
between the two types of messages.
So I could try it again--
type in a sentence like,
lots of fun, for example.
This one it's a little less sure about--
62% chance that it's positive, 37%
chance that it's negative-- maybe
because there aren't
as clear discriminators
or differentiators inside of this data.
I'll try one more--
say kind of overpriced.
And all right, now
95%, 96% sure that this
is a negative sentiment--
likely because of the word
overpriced, because it's shown up
in a negative sentiment expression
before, and therefore, it thinks,
you know what, this is probably
going to be a negative sentence.
And so naive Bayes has now given
us the ability to classify text.
Given enough training data,
given enough examples,
we can train our AI to be able to
look at natural language, human words,
figure out which words
are likely to show up
in positive as opposed to
negative sentiment messages,
and categorize them accordingly.
And you could imagine
doing the same thing
anytime you want to take text
and group it into categories.
If I want to take an email
and categorize as email--
as a good email or as a spam email,
you could apply a similar idea.
Try and look for the
discriminating words,
the words that make it more
likely to be a spam email or not,
and just train a naive Bayes
classifier to be able to figure out
what that distribution is and to be
able to figure out how to categorize
an email as good or as spam.
Now, of course, it's not going to be
able to give us a definitive answer.
It gives us a probability
distribution, something like 63%
positive, 37% negative.
And that might be why our spam filters
and our emails sometimes make mistakes,
sometimes think that a good
email is actually spam or vice
versa, because ultimately,
the best that it can do
is calculate a probability distribution.
If natural language is
ambiguous, we can usually
just deal in the world of
probabilities to try and get
an answer that is reasonably good, even
if we aren't able to guarantee for sure
that it is the number that we
actually expect for it to be.
That then was a look at how
we can begin to take some text
and to be able to analyze the text and
group it into some sorts of categories.
But ultimately, in addition just being
able to analyze text and categorize it,
we'd like to be able to figure
out information about the text,
get it some sort of meaning
out of the text as well.
And this starts to get us
in the world of information,
of being able to try and
take data in the form of text
and retrieve information from it.
So one type of problem is known
as information retrieval, or IR,
which is the task of finding relevant
documents in response to a query.
So this is something like you type
in a query into a search engine,
like Google, or you're
typing in something
into some system that's going to look
for-- inside of a library catalog,
for example-- that's going to
look for responses to a query.
I want to look for documents that are
about the US constitution or something,
and I would like to get a
whole bunch of documents
that match that query back to me.
But you might imagine that what
I really want to be able to do
is, in order to solve
this task effectively,
I need to be able to take
documents and figure out,
what are those documents about?
I want to be able to say what is it
that these particular documents are
about-- what of the topics
of those documents--
so that I can then more effectively
be able to retrieve information
from those particular documents.
And this refers to a set of tasks
generally known as topic modeling,
where I'd like to discover what the
topics are for a set of documents.
And this is something
that humans could do.
A human could read a document
and tell you, all right,
here's what this document
is about, and give maybe
a couple of topics for who are the
important people in this document, what
are the important objects in the
document-- can probably tell you
that kind of thing.
But we'd like for our AI to
be able to do the same thing.
Given some document, can you
tell me what the important words
in this document are?
What are the words that
set this document apart
that I might care about if
I'm looking at documents
based on keywords, for example?
And so one instinctive idea-- an
intuitive idea that probably makes
sense--
is let's just use term frequency.
Term frequency is just
defined as the number of times
a particular term appears in a document.
If I have a document with 100 words and
one particular word shows up 10 times,
it has a term frequency of 10.
It shows up pretty often.
Maybe that's going to
be an important word.
And sometimes, you'll
also see this framed
as a proportion of the total number
of words, so 10 words out of 100.
Maybe it has a term frequency of
0.1, meaning 10% of all of the words
are this particular
word that I care about.
Ultimately, that doesn't
change relatively
how important they are for
any one particular document,
but they're the same idea.
The idea is look for words that show
up more frequently, because those
are more likely to be the important
words inside of a corpus of documents.
And so let's go ahead
and give that a try.
Let's say I wanted to find out what
the Sherlock Holmes stories are about.
I have a whole bunch of
Sherlock Holmes stories
and I want to know, in
general, what are they about?
What are the important characters?
What are the important objects?
What are the important parts of
the story, just in terms of words?
And I'd like for the AI to be able
to figure that out on its own,
and we'll do so by looking
at term frequency--
by looking at, what are the words
that show up the most often?
So we'll go ahead, and I'll go ahead
and go in to the tfidf directory.
You'll see why it's
called that in a moment.
But let's first open up tf0.py, which
is going to calculate the top 10 term
frequencies-- or maybe
top five term frequencies
for a corpus of documents,
a whole bunch of documents
where each document is just
a story from Sherlock Holmes.
We're going to load all
the data into our corpus
and we're going to figure out,
what are all of the words that
show up inside of that corpus?
And we're going to
basically just assemble all
of the number of the term frequencies.
We're going to calculate, how
often do each of these terms
appear inside of the document?
And we'll print out the top five.
And so there are some data
structures involved that you
can take a look at if you'd like to.
The exact code is not so important,
but it is the idea of what we're doing.
We're taking each of these
documents and first sorting them.
We're saying, take all
the words that show up
and sort them by how
often each word shows up.
And let's go ahead and just, for
each document, save the top five
terms that happen to show up
in each of those documents.
So again, some helper functions you can
take a look at if you're interested.
But the key idea here is
that all we're going to do
is run to tf0 on the
Sherlock Holmes stories.
And what I'm hoping to get out of this
process is I am hoping to figure out,
what are the important words in
Sherlock Holmes, for example?
So we'll go ahead and run
this and see what we get.
And it's loading the data.
And here's what we get.
For this particular story, the
important words are the, and and, and I,
and to, and of.
Those are the words that
show up more frequently.
In this particular story, it's
the, and and, and I, and a, and of.
This is not particularly useful to us.
We're using term frequencies.
We're looking at what words show
up the most frequently in each
of these various different
documents, but what we get naturally
are just the words that
show up a lot in English.
The word the, and of, and happen
to show up a lot in English,
and therefore, they happen
to show up a lot in each
of these various different documents.
This is not a particularly
useful metric for us
to be able to analyze
what words are important,
because these words are just part of
the grammatical structure of English.
And it turns out we can categorize words
into a couple of different categories.
These words happen to be known as what
we might call function words, words
that have little meaning
on their own, but that
are used to grammatically connect
different parts of a sentence.
These are words like am, and
by, and do, and is, and which,
and with, and yet-- words that,
on their own, what do they mean?
It's hard to say.
They get their meaning
from how they connect
different parts of the sentence.
And these function words are what we
might call a closed class of words
in a language like English.
There's really just some
fixed list of function words,
and they don't change very often.
There's just some list of
words that are commonly
used to connect other grammatical
structures in the language.
And that's in contrast with what
we might call content words, words
that carry meaning independently--
words like algorithm,
category, computer, words that
actually have some sort of meaning.
And these are usually the
words that we care about.
These are the words where
we want to figure out,
what are the important
words in our document?
We probably care about
the content words more
than we care about the function words.
And so one strategy we could apply is
just ignore all of the function words.
So here in tf1.py, I've
done the same exact thing,
except I'm going to load a whole bunch
of words from a function_words.txt
file, inside of which are just a whole
bunch of function words in alphabetical
order.
These are just a whole
bunch of function words
that are just words that are used
to connect other words in English,
and someone has just compiled
this particular list.
And these are the words
that I just want to ignore.
If any of these words-- let's just
ignore it as one of the top terms,
because these are not words
that I probably care about
if I want to analyze what the
important terms inside of a document
happen to be.
So in tfidf1, we were
ultimately doing is,
if the word is in my
set of function words,
I'm just going to skip over it, just
ignore any of the function words
by continuing on to
the next word and then
just calculating the frequencies
for those words instead.
So I'm going to pretend the
function words aren't there,
and now maybe I can get
a better sense for what
terms are important in each of the
various different Sherlock Holmes
stories.
So now let's run tf1 on the Sherlock
Holmes corpus and see what we get now.
And let's look at, what is the most
important term in each of the stories?
Well, it seems like,
for each of the stories,
the most important word is Holmes.
I guess that's what we would expect.
They're all Sherlock Holmes stories.
And Holmes is not a function in Word.
It's not the, or a, or
an, so it wasn't ignored.
But Holmes and man--
these are probably not what I mean when
I say, what are the important words?
Even though Holmes does
show up the most often
it's not giving me a whole
lot of information here
about what each of the different
Sherlock Holmes stories
are actually about.
And the reason why is because Sherlock
Holmes shows up in all the stories,
and so it's not meaningful for me to
say that this story is about Sherlock
Holmes I want to try and
figure out the different topics
across the corpus of documents.
What I really want to know
is, what words show up
in this document that show up less
frequently in the other documents,
for example?
And so to get at that idea, we're
going to introduce the notion
of inverse document frequency.
Inverse document frequency
is a measure of how common,
or rare, a word happens to be
across an entire corpus of words.
And mathematically, it's
usually calculated like this--
as the logarithm of the
total number of documents
divided by the number of
documents containing the word.
So if a word like Holmes shows
up in all of the documents,
well, then total documents
is how many documents there
are a number of documents containing
Holmes is going to be the same number.
So when you divide these two together,
you'll get 1, and the logarithm of one
is just 0.
And so what we get is, if Holmes
shows up in all of the documents,
it has an inverse
document frequency of 0.
And you can think now of
inverse document frequency
as a measure of how
rare is the word that
shows up in this particular document
that if a word doesn't show up
across many documents at all this
number is going to be much higher.
And this then gets us that
a model known as tf-idf,
which is a method for ranking what
words are important in the document
by multiplying these two ideas together.
Multiply term frequency, or TF, by
inverse document frequency, or IDF,
where the idea here now is
that how important a word is
depends on two things.
It depends on how often it
shows up in the document using
the heuristic that, if a
word shows up more often,
it's probably more important.
And we multiply that by
inverse document frequency IDF,
because if the word is rarer,
but it shows up in the document,
it's probably more important
than if the word shows up
across most or all of the documents,
because then it's probably
a less important factor in
what the different topics
across the different documents
in the corpus happen to be.
And so now let's go ahead and apply
this algorithm on the Sherlock Holmes
corpus.
And here's tfidf.
Now what I'm doing is,
for each of the documents,
for each word, I'm
calculating its TF score,
term frequency, multiplied
by the inverse document
frequency of that word-- not just
looking at the single volume,
but multiplying these
two values together
in order to compute the overall values.
And now, if I run tfidf
on the Holmes corpus,
this is going to try and get us
a better approximation for what's
important in each of the stories.
And it seems like it's
trying to extract here
probably like the names
of characters that
happen to be important in the
story-- characters that show up
in this story that don't
show up in the other story--
and prioritizing the more
important characters that
happen to show up more often.
And so this then might be a better
analysis of what types of topics
are more or less important.
I also have another corpus, which
is a corpus of all of the Federalist
Papers from American history.
If I go ahead and run tfidf
on the Federalist Papers,
we can begin to see what
the important words in each
of the various different
Federalist Papers happen to be--
that in Federalist Paper Number 61,
seems like it's a lot about elections.
In Federalist Papers 66, but
the Senate and impeachments.
You can start to extract what
the important terms and what
the important words are just by
looking at what things show up across--
and don't show up across
many of the documents,
but show up frequently enough
in certain of the documents.
And so this can be a
helpful tool for trying
to figure out this
kind of topic modeling,
figuring out what it is that
a particular document happens
to be about.
And so this then is starting to get
us into this world of semantics,
what it is that things actually mean
when we're talking about language.
Now, we're not going to
think about the bag of words,
where we just say, treat a sample of
text as just a whole bunch of words.
And we don't care about the order.
Now, when we get into
the world of semantics,
we really do start to care about what
it is that these words actually mean,
how it is these words
relate to each other,
and in particular, how we can
extract information out of that text.
Information extraction is
somehow extracting knowledge
from our documents-- figuring
out, given a whole bunch of text,
can we automate the process of having
an AI, look at those documents,
and get out what the useful or relevant
knowledge inside those documents
happens to be?
So let's take a look at an example.
I'll give you two samples
from news articles.
Here up above is a sample of a news
article from the Harvard Business
Review that was about Facebook.
Down below is an example of a
Business Insider article from 2018
that was about Amazon.
And there's some information
here that we might
want an AI to be able to extract--
information, knowledge
about these companies
that we might want to extract.
And in particular, what I
might want to extract is--
let's say I want to know data
about when companies were founded--
that I wanted to know that
Facebook was founded in 2004,
Amazon founded in 1994--
that that is important information
that I happen to care about.
Well, how do we extract that
information from the text?
What is my way of being
able to understand this text
and figure out, all right,
Facebook was founded in 2004?
Well, what I can look for are
templates or patterns, things
that happened to show up across multiple
different documents that give me
some sense for what this
knowledge happens to mean.
And what we'll notice
is a common pattern
between both of these passages,
which is this phrasing here.
When Facebook was
founded in 2004, comma--
and then down below, when Amazon
was founded in 1994, comma.
And those two templates end up giving
us a mechanism for trying to extract
information-- that this notion, when
company was founded in year comma,
this can tell us something about
when a company was founded,
because if we set our
AI loose on the web,
let look at a whole bunch of papers
or a whole bunch of articles,
and it finds this pattern--
when blank was founded in blank, comma--
well, then our AI can
pretty reasonably conclude
that there's a good chance that this
is going to be like some company,
and this is going to be like the year
that company was founded, for example--
might not be perfect, but at
least it's a good heuristic.
And so you might imagine
that, if you wanted
to train and AI to be able
to look for information,
you might give the AI
templates like this--
not only give it a template like when
company blank was founded in blank,
but give it like, the book blank
was written by blank, for example.
Just give it some templates
where it can search the web,
search a whole big corpus of documents,
looking for templates that match that,
and if it finds that, then
it's able to figure out,
all right, here's the
company and here's the year.
But of course, that requires
us to write these templates.
It requires us to figure out, what
is the structure of this information
likely going to look like?
And it might be difficult to know.
The different websites are, of
course, going to do this differently.
This type of method isn't going to be
able to extract all of the information,
because if the words are
slightly in a different order,
it won't match on that
particular template.
But one thing we can do is, rather
than give our AI the template,
we can give AI the data.
We can tell the AI, Facebook was founded
in 2004 and Amazon was founded in 1994,
and just tell the AI those
two pieces of information,
and then set the AI loose on the web.
And now the ideas that the AI can begin
to look for, where do Facebook in 2004
show up together, where do
Amazon in 1994 show up together,
and it can discover these
templates for itself.
It can discover that
this kind of phrasing--
when blank was founded in blank--
tends to relate Facebook to 2004,
and it released Amazon to 1994,
so maybe it will hold the same
relation for others as well.
And this ends up being--
this automated template
generation ends up being quite
powerful, and we'll go ahead
and take a look at that now as well.
What I have here inside
of templates directory
is a file called companies.csv,
and this is all of the data
that I am going to give to my AI.
I'm going to give it the pair
Amazon, 1994 and Facebook, 2004.
And what I'm going to
tell my AI to do is
search a corpus of
documents for other data--
these pairs like this--
other relationships.
I'm not telling AI that this
is a company and the date
that it was founded.
I'm just giving it Amazon,
1994 and Facebook, 2004
and letting the AI do the rest.
And what the AI is going to do is
it's going to look through my corpus--
here's my corpus of documents--
and it's going to find, like
inside of Business Insider,
that we have sentences like, back when
Amazon was founded in 2004, comma--
and that kind of phrasing is going to be
similar to this Harvard Business Review
story that has a sentence like,
when Facebook was founded in 2004--
and it's going to look across
a number of other documents
for similar types of patterns to be able
to extract that kind of information.
And what it will do is,
if I go ahead and run,
I'll go ahead and go into templates.
So I'll say python search.py.
I'm going to look for the data
like the data and companies.csv
inside of the company's directory, which
contains a whole bunch of news articles
that I've curated in advance.
And here's what I get--
Google 1998, Apple
1976, Microsoft 1975--
so on and so forth--
Walmart 1962, for example.
These are all of the pieces
of data that happened
to match that same template that
we were able to find before.
And how was it able to find this?
Well, it's probably because, if
we look at the Forbes article,
for example, that it has a phrase in it
like, when Walmart was founded in 1962,
comma-- that it's able to
identify these sorts of patterns
and extract information from them.
Now, granted, I have curated
all these stories in advance
in order to make sure that there
is data that it's able to match on.
And in practice, it's not always
going to be in this exact format
when you're seeing a company related
to the year in which it was founded,
but if you give the AI access to enough
data-- like all of the data of text
on the internet-- and just have
the AI crawl the internet looking
for information, it can very
reliably, or with some probability,
try and extract information
using these sorts of templates
and be able to generate
interesting sorts of knowledge.
And the more knowledge it
learns, the more new templates
it's able to construct,
looking for constructions that
show up in other locations as well.
So let's take a look at another example.
And then I'll here show
you presidents.csv,
where I have two presidents
and their inauguration date--
so George Washington 1789,
Barack Obama 2009 for example.
And I also am going to give
to our AI a corpus that
just contains a single
document, which is the Wikipedia
article for the list of presidents
of the United States, for example--
just information about presidents.
And I'd like to extract from this raw
HTML document on a web page information
about the president.
So I can say search in presidents.csv.
And what I get is a whole
bunch of data about presidents
and what year they were likely
inaugurated and by looking
for patterns that matched--
Barack Obama 2009, for example--
looking for these sorts
of patterns that happened
to give us some clues as to what it
is that a story happens to be about.
So here's another example.
If I open up inside the olympics,
here is a scraped version
of the Olympic home page
that has information
about various different Olympics.
And maybe I want to extract
Olympic locations and years
from this particular page.
Well, the way I can do that is
using the exact same algorithm.
I'm just saying, all right, here are two
Olympics and where they were located--
so 2012 London, for example.
Let me go ahead and
just run this process,
Python search, on olympics.csv,
look at all the Olympic data set,
and here I get some information back.
Now, this information--
not totally perfect.
There are a couple of examples
that are obviously not
quite right, because my template might
have been a little bit too general.
Maybe it was looking for
a broad category of things
and certain strange things happened to
capture on that particular template.
So you could imagine adding rules to try
and make this process more intelligent,
making sure the thing on the left
is just a year, for example--
for instance, and doing
other sorts of analysis.
But purely just based
on some data, we are
able to extract some interesting
information using some algorithms.
And all search.py is really doing here
is it is taking my corpus of data,
finding templates that match it--
here, I'm filtering down to just the
top two templates that happen to match--
and then using those templates
to extract results from the data
that I have access to, being able
to look for all of the information
that I care about.
And that's ultimately
what's going to help me,
to print out those results to figure
out what the matches happen to be.
And so information extraction
is another powerful tool
when it comes to trying
to extract information.
But of course, it only works
in very limited contexts.
It only works when I'm able will
find templates that look exactly
like this in order to come up
with some sort of match that
is able to connect this
to some pair of data,
that this company was
founded in this year.
What I might want to do, as we start
to think about the semantics of words,
is to begin to imagine some way
of coming up with definitions
for all words, being able to relate
all of the words in a dictionary
to each other, because that's ultimately
what's going to be necessary if we want
our AI to be able to communicate.
We need some representation
of what it is that words mean.
And one approach of doing this,
this famous data set called WordNet.
And what WordNet is is
it's a human-curated--
researchers have curated
together a whole bunch of words,
their definitions, their
various different senses--
because the word might have
multiple different meanings--
and also how those words
relate to one another.
And so what we mean by this is--
I can show you an example of WordNet.
WordNet comes built into NLTK.
Using NLTK, you can
download and access WordNet.
So let me go into WordNet,
and go ahead and run WordNet,
and extract information about a
word-- a word like city, for example.
Go ahead and press Return.
And here is the information
that I get back about a city.
It turns out that city has
three different senses, three
different meanings,
according to WordNet.
And it's really just kind
of like a dictionary, where
each sense is associated with its
meaning-- just some definition
provided by human.
And then it's also got categories,
for example, that a word belongs to--
that a city is a type
of municipality, a city
is a type of administrative district.
And that allows me to
relate words to other words.
So one of the powers of WordNet
is the ability to take one word
and connect it to other related words.
If I do another example, let me
try the word house, for instance.
I'll type in the word house
and see what I get back.
Well, all right, the house
is a kind of building.
The house is somehow
related to a family unit.
And so you might imagine
trying to come up
with these various different
ways of describing a house.
It is a building.
It is a dwelling.
And researchers have just
curated these relationships
between these various different words to
say that a house is a type of building,
that a house is a type
of dwelling, for example.
But this type of
approach, while certainly
helpful for being able to
relate words to one another,
doesn't scale particularly well.
As you start to think
about language changing,
as you start to think about all
the various different relationships
that words might have to one another,
this challenge of word representation
ends up being difficult.
What we've done is just
defined a word as just a sentence that
explains what it is that that word is,
but what we really
would like is some way
to represent the meaning
of a word in a way
that our AI is going to be able
to do something useful with it.
Anytime we want our AI to
be able to look at texts
and really understand
what that text means,
to relate text and
words to similar words
and understand the
relationship between words,
we'd like some way that a computer
can represent this information.
And what we've seen all
throughout the course
multiple times now is
the idea that, when
we want our AI to
represent something, it
can be helpful to have the AI
represent it using numbers--
that we've seen that we can
represent utilities in a game,
like winning, or losing,
or drawing, as a number--
1, negative 1, or a 0.
We've seen other ways that
we can take data and turn it
into a vector of features,
where we just have
a whole bunch of numbers that represent
some particular piece of data.
And if we ever want to past
words into a neural network,
for instance, to be able
to say, given some word,
translate this sentence
into another sentence,
or to be able to do interesting
classifications with neural networks
on individual words, we need
some representation of words
just in terms of vectors--
way to represent words, just
by using individual numbers
to define the meaning of a word.
So how do we do that?
How do we take words and
turn them into vectors
that we can use to represent
the meaning of those words?
Well, one way is to do this.
If I have four words that I want
to encode, like he wrote a book,
I can just say, let's let
the word he be this vector--
1, 0, 0, 0.
Wrote will be 0, 1, 0, 0.
A will be 0, 0, 1, 0.
Book will be 0, 0, 0, 1.
Effectively, what I have here is what's
known as a one-hot representation
or a one-hot encoding, which
is a representation of meaning,
where meaning is a vector that has a
single 1 in it and the rest are 0's.
The location of the 1 tells
me the meaning of the word--
that 1 in the first
position, that means here--
1 in the second position,
that means wrote.
And every word in the
dictionary is going
to be assigned to some representation
like this, where we just
assign one place in the vector
that has a 1 for the word
and 0 for the other words.
And now I have
representations of words that
are different for a whole
bunch of different words.
This is this one-hot representation.
So what are the drawbacks of this?
Why is this not necessarily
a great approach?
Well, here, I am only
creating enough vectors
to represent four words in a dictionary.
If you imagine a dictionary with 50,000
words that I might want to represent,
now these vectors get enormously long.
These are 50,000 dimensional
vectors to represent
a vocabulary of 50,000 words--
that he is 1 followed by all these.
Wrote has a whole bunch of 0's in it.
That's not a particularly tractable
way of trying to represent numbers,
if I'm going to have to deal
with vectors of length 50,000.
Another problem-- a subtler problem--
is that ideally, I'd
like for these vectors
to somehow represent meaning
in a way that I can extract
useful information out of-- that if
I have the sentence he wrote a book
and he authored a novel, well, wrote
and authored are going to be two
totally different vectors.
And book and novel are going to be
two totally different vectors inside
of my vector space that have
nothing to do with each other.
The one is just located
in a different position.
And really, what I would
like to have happen
is for wrote and
authored to have vectors
that are similar to one
another, and for book and novel
to have vector representations
that are similar to one another,
because they are words
that have similar meanings.
Because their meanings are
similar, ideally, I'd like for--
when I put them in vector form and
use a vector to represent meanings,
I would like for those vectors to
be similar to one another as well.
So rather than this
one-hot representation,
where we represent a word's meaning
by just giving it a vector that is one
in a particular location,
what we're going to do--
which is a bit of a strange
thing the first time you see it--
is what we're going to call
a distributed representation.
We are going to represent
the meaning of a word as just
a whole bunch of different values--
not just a single 1 and the rest 0's,
but a whole bunch of values.
So for example, in he wrote a book,
he might just be a big vector.
Maybe it's 50 dimensions, maybe it's
100, dimensions but certainly less
than like tens of thousands, where
each value is just some number--
and same thing for
wrote, and a, and book.
And the idea now is that, using
these vector representations,
I'd hope that wrote and authored
have vector representations that
are pretty close to one another.
Their distance is not too far
apart-- and same with the vector
representations for book and novel.
So this is going to be the goal of a
lot of what statistical machine learning
approaches to natural
language processing
is about is using these vector
representations of words.
But how on earth do we define
a word as just a whole bunch
of these sequences of numbers?
What does it even mean to talk
about the meaning of a word?
The famous quote that
answers this question
is from a British linguist in the
1950s, JR Firth, who said, "You shall
know a word by the company it keeps."
And what we mean by
that is the idea that we
can define a word in terms of the words
that show up around it, that we can get
at the meaning of a word based on the
context in which that word happens
to appear.
That if I have a sentence like
this, four words in sequence--
for blank he ate--
what goes in the blank?
Well, you might imagine
that, in English,
the types of words that might fill in
the blank are words like breakfast,
or lunch, or dinner.
These are the kinds of words
that fill in that blank.
And so if we want to define,
what does lunch or dinner mean,
we can define it in terms
of what words happened
to show up around it--
that if a word shows up
in a particular context and
another word happens to show up
in very similar context, then
those two words are probably
related to each other.
They probably have a similar
meaning to one another.
And this then is the
foundational idea of an algorithm
known as word2vec, which is a
model for generating word vectors.
You give word2vec a corpus of
documents, just a whole bunch of texts,
and what word to that will produce is
it will produce vectors for each word.
And there a number of
ways that it can do this.
One common way is through what's known
as the skip-gram architecture, which
basically uses a neural network
to predict context words,
given a target word-- so
given a word like lunch,
use a neural network to try and
predict, given the word lunch, what
words are going to show up around it.
And so the way we
might represent this is
with a big neural
network like this, where
we have one input cell for every word.
Every word gets one node
inside this neural network.
And the goal is to use this
neural network to predict,
given a target word, a context word.
Given a word like lunch, can I predict
the probabilities of other words,
showing up in a context of one word
away or two words away, for instance,
in some sort of window of context?
And if you just give the AI, this neural
network, a whole bunch of data of words
and what words show up in context,
you can train a neural network
to do this calculation, to be able
to predict, given a target word--
can I predict what those context
words ultimately should be?
And it will do so using
the same methods we've
talked about-- back propagating
the error from the context word
back through this neural network.
And what you get is, if
we use the single layer--
just a signal layer of hidden nodes--
what I get is, for every single
one of these words, I get--
from this word, for example, I
get five edges, each of which
has a weight to each of
these five hidden nodes.
In other words, I get five
numbers that effectively
are going to represent this
particular target word here.
And the number of hidden nodes I
choose in this middle layer here--
I can pick that.
Maybe I'll choose to have 50
hidden nodes or 100 hidden nodes.
And then, for each of
these target words,
I'll have 50 different values
or 100 different values,
and those values we can
effectively treat as the vector
numerical representation of that word.
And the general idea here is
that, if words are similar,
two words show up in similar contexts--
meaning, using the same target words,
I'd like to predict
similar contexts words--
well, then these vectors and these
values I choose in these vectors
here-- these numerical values
for the weight of these edges
are probably going to be similar,
because for two different words that
show up in similar
contexts, I would like
for these values that are
calculated to ultimately
be very similar to one another.
And so ultimately, the high-level
way you can picture this
is that what this word2vec
training method is
going to do is, given a whole
bunch of words, were initially,
recall, we initialize these
weights randomly and just pick
random weights that we choose.
Over time, as we train
the neural network,
we're going to adjust these weights,
adjust the vector representations
of each of these words
so that gradually,
words that show up in similar
contexts grow closer to one another,
and words that show up
in different contexts
get farther away from one another.
And as a result, hopefully
I get vector representations
of words like breakfast, and lunch, and
dinner that are similar to one another,
and then words like book,
and memoir, and novel
are also going to be similar
to one another as well.
So using this algorithm, we're
able to take a corpus of data
and just train our computer, train this
neural network to be able to figure out
what vector, what sequence
of numbers is going
to represent each of these words-- which
is, again, a bit of a strange concept
to think about representing a word
just as a whole bunch of numbers.
But we'll see in a moment just
how powerful this really can be.
So we'll go ahead and go into vectors,
and what I have inside a vectors.py--
which I'll open up now--
is I'm opening up words.txt, which
is a pretrained model that just--
I've already run word2vec
and it's already given me
a whole bunch of vectors for
each of these possible words.
And I'm just going to
take like 50,000 of them
and go ahead and save their vectors
inside of a dictionary called words.
And then I've also defined
some functions called distance,
closest_word, so it'll get me what are
the closest words to a particular word,
and then closest_word, that just gets
me the one closest word, for example.
And so now let me try doing this.
Let me open up the Python
interpreter and say something like,
from vectors import star--
just import everything from vectors.
And now let's take a look at
the meanings of some words.
Let me look at the
word city, for example.
And here is a big array that is the
vector representation of the words
city.
And this doesn't mean anything, in
terms of what these numbers exactly are,
but this is how my
computer is representing
the meaning of the word city.
We can do a different
word, like words house,
and here then is the vector
representation of the word house,
for example-- just a
whole bunch of numbers.
And this is encoding somehow
the meaning of the word house.
And how do I get at that idea?
Well, one way to measure how
good this is is by looking at,
what is the distance between
various different words?
There a number of ways
you can define distance.
In context of vectors,
one common way is what's
known as the cosine distance
that has to do with measuring
the angle between vectors.
But in short, it's just
measuring, how far apart
are these two vectors from each other?
So if I take a word like the word book,
how far away for is it from itself--
how far away is the
word book from book--
well, that's zero.
The word book is zero
distance away from itself.
But let's see how far away word
book is from a word like breakfast,
where we're going to say one is
very far away, zero is not far away.
All right, book is about
0.64 away from breakfast.
They seem to be pretty far apart.
But let's now try and calculate
the distance from words book
to words novel, for example.
Now, those two words are
closer to each other--
0.34.
The vector representation
of the word book
is closer to the vector
representation of the word novel
than it is to the vector
representation of the word breakfast.
And I can do the same thing and,
say, compare breakfast to lunch,
for example.
And those two words are
even closer together.
They have an even more
similar relationship
between one word and another.
So now it seems we have some
representation of words,
representing a word using vectors, that
allows us to be able to say something
like words that are
similar to each other
ultimately have a smaller distance
that happens to be between them.
And this turns out to be
incredibly powerful to be
able to represent the meaning of
words in terms of their relationships
to other words as well.
I can tell you as well--
I have a function called
closest words that
basically just takes
a whole bunch of words
and gets all the closest words to it.
So let me get the closest
words to book, for example,
and maybe get the 10 closest words.
We'll limit ourselves to 10.
And right.
Book is obviously closest
to itself-- the word book--
but is also closely related to books,
and essay, and memoir, and essays,
and novella, anthology.
And why are these words that it was
able to compute are close to it?
Well, because based on
the corpus of information
that this algorithm was trained
on, the vectors that arose
arose based on what words
show up in a similar context--
that the word book shows up in a similar
context, similar other words to words
like memoir and essays, for example.
And if I do something like--
let me get the closest words to city--
you end up getting city,
town, township, village.
These are words that happen to show up
in a similar context to the word city.
Now, where things get really interesting
is that, because these are vectors,
we can do mathematics with them.
We can calculate the relationships
between various different words.
So I can say something like, all
right, what if I had man and king?
These are two different vectors,
and this is a famous example
that comes out of word2vec.
I can take these two vectors and
just subtract them from each other.
This line here, the distance
here, is another vector
that represents like king minus man.
Now, what does it mean to take a
word and subtract another word?
Normally, that doesn't make sense.
In the world of vectors, though,
you can take some vector sum
sequence of numbers, subtract
some other sequence of numbers,
and get a new vector, get
a new sequence of numbers.
And what this new sequence of
numbers is effectively going to do
is it is going to tell me, what do I
need to do to get from man to king?
What is the relationship
then between these two words?
And this is some vector
representation of what makes--
takes us from man to king.
And we can then take this value
and add it to another vector.
You might imagine that the
word woman, for example,
is another vector that exists
somewhere inside of this space,
somewhere inside of this vector space.
And what might happen if I
took this same idea, king
minus man-- took that same vector
and just added it to woman?
What will we find around here?
It's an interesting
question we might ask,
and we can answer it very easily,
because I have vector representations
of all of these things.
Let's go back here.
Let me look at the
representation of the word man.
Here's the vector representation of men.
Let's look at the
representation of the word king.
Here's the representation
of the word king.
And I can subtract these two.
What is the vector
representation of king minus man?
It's this array right here--
whole bunch of values.
So king minus man now represents the
relationship between king and man
in some sort of numerical vector format.
So what happens then
if I add woman to that?
Whatever took us from man to king,
go ahead and apply that same vector
to the vector representation
of the word woman,
and that gives us this vector here.
And now, just out of curiosity,
let's take this expression
and find, what is the closest
word to that expression?
And amazingly, what we get
is we get the word queen--
that somehow, when you take the
distance between man and king--
this numerical representation
of how man is related to king--
and add that same
notion, king minus man,
to the vector representation
of the word woman.
What we get is we get the vector
representation, or something close
to the vector representation
of the word queen,
because this distance somehow encoded
the relationship between these two
words.
And when you run it
through this algorithm,
it's not programmed to do this,
but if you just try and figure
out how to predict words
based on context words,
you get vectors that are able to
make these SAT-like analogies out
of the information that has been given.
So there are more examples of this.
We can say, all right,
let's figure out, what
is the distance between
Paris and France?
So Paris and France are words.
They each have a vector representation.
This then is a vector representation of
the distance between Paris and France--
what takes us from France to Paris.
And let me go ahead and add the vector
representation of England to that.
So this then is the
vector representation
of going Paris minus
France plus England--
so the distance between
friends and Paris as vectors.
Add the England vector,
and let's go ahead
and find the closest word to that.
And it turns out to be London.
You do this relationship, the
relationship between France and Paris.
Go ahead and add the England vector
to it, and the closest vector to that
happens to be the vector
for the word London.
We can do more examples.
I can say, let's take
the word for teacher--
that vector representation
and-- let me subtract
the vector representation of school.
So what I'm left with is, what
takes us from school to teacher?
And apply that vector to a
word like hospital and see,
what is the closest word to that--
turns out the closest word is nurse.
Let's try a couple more examples--
closest word to ramen, for example.
Subtract closest word to Japan.
So what is the relationship
between Japan and ramen?
Add the word for America to that.
Want to take a guess is what
you might get as a result?
Turns out you get burritos
as the relationship.
If you do the subtraction,
do the addition,
this is the answer that you happen to
get as a consequence of this as well.
So these very interesting
analogies arise
in the relationships
between these two words--
that if you just map out all of
these words into a vector space,
you can get some pretty interesting
results as a consequence of that.
And this idea of representing
words as vectors turns out
to be incredibly useful
and powerful anytime
we want to be able to do
some statistical work with
regards to natural language,
to be able to have--
represent words not just
as their characters,
but to represent them as numbers,
numbers that say something
or mean something about
the words themselves,
and somehow relate the meaning
of a word to other words that
might happen to exists--
so many tools then for
being able to work inside
of this world of natural language.
The natural language is tricky.
We have to deal with the syntax of
language and the semantics of language,
but we've really just seen just the
beginning of some of the ideas that are
underlying a lot of natural language
processing-- the ability to take text,
extract information out of it, get
some sort of meaning out of it,
generate sentences maybe by having some
knowledge of the grammar or maybe just
by looking at probabilities of what
words are likely to show up based
on other words that have
shown up previously--
and then finally, the
ability to take words
and come up with some distributed
representation of them, to take words
and represent them as
numbers, and use those numbers
to be able to say something
meaningful about those words as well.
So this then is yet another
topic in this broader
heading of artificial intelligence.
And just as I look back
at where we've been now,
we started our conversation by
talking about the world of search,
about trying to solve problems
like tic-tac-toe by searching
for a solution, by exploring our
various different possibilities
and looking at what algorithms we
can apply to be able to efficiently
try and search a space.
We looked at some simple algorithms
and then looked at some optimizations
we could make to this
algorithms, and ultimately, that
was in service of trying to get our
AI to know things about the world.
And this has been a lot of what
we've talked about today as well,
trying to get knowledge out
of text-based information,
the ability to take information, draw
conclusions based on those information.
If I know these two things
for certain, maybe I
can draw a third conclusion as well.
That then was related to
the idea of uncertainty.
If we don't know
something for sure, can we
predict something, figure out
the probabilities of something?
And we saw that again
today in the context
of trying to predict whether
a tweet or whether a message
is positive sentiment
or negative sentiment,
and trying to draw that
conclusion as well.
Then we took a look at
optimization-- the sorts
of problems where we're looking
for a local global or local maximum
or minimum.
This has come up time and time
again, especially most recently
in the context of neural networks, which
are really just a kind of optimization
problem where we're trying to
minimize the total amount of loss
based on the setting of our
weights of our neural network,
based on the setting of what
vector representations for words we
happen to choose.
And those ultimately helped
us to be able to solve
learning-related problems-- the
ability to take a whole bunch of data,
and rather than us tell
the AI exactly what to do,
let the AI learn patterns
from the data for itself.
Let it figure out what makes an inbox
message different from a spam message.
Let it figure out what
makes a counterfeit
bill different from an
authentic bill, and being
able to draw that analysis as well.
And one of the big tools
in learning that we used
were neural networks,
these structures that
allow us to relate inputs to outputs
by training these internal networks
to learn some sort of function that
maps us from some input to some output--
ultimately yet another model in this
language of artificial intelligence
that we can use to
communicate with our AI.
Then finally today,
we looked at some ways
that AI can begin to communicate
with us, looking at ways
that AI can begin to get an
understanding for the syntax
and the semantics of language to
be able to generate sentences,
to be able to predict things about
text that's written in a spoken
language or a written
language like English,
and to be able to do interesting
analysis there as well.
And there's so much more
in active research that's
happening all over the areas within
artificial intelligence today,
and we've really only just seen the
beginning of what AI has to offer.
So I hope you enjoyed this
exploration into this world
of artificial intelligence with Python.
A big thank you to the courses
teaching staff and the production team
for making this class possible.
This was an Introduction to
Artificial Intelligence with Python.
