In the last video we looked at the
probabilities of unigrams and bigrams
and at the conditional
probability of one word given that we
had already seen a word before it. So
essentially the probability of a bigram.
In this video we're going to combine
those probabilities to calculate the
probability of a sequence of words of a
sentence, and throughout the week we're
gonna use that - those probabilities to
distinguish between English sentences
that we hear all the time, between
sentences that are poorly formed, the
computer should not use them, and forms
that are unusual but that we have heard
before, such as all your base are belong
to us. What we are trying to design is a
language model, is some kind of computer
structure that can duplicate the
intuitions that humans have. For example
at the beginning of the week we looked
at structures like: it's like we finish
each other's sentences or sandwiches.
Humans can estimate that these are
relatively probable sentences. And that
something like: it's like we finish each
other's desks is a very unlikely
sentence. We are trying to build a
mathematical structure that can tell us
that: it's like we finish each other's
sentences is highly probable and it's
like we finish each other's desks is
very improbable and unlikely. A system
that can capture human intuitions about
language is a language model, and this
language model, an n-gram model, can
calculate the probability of sequences,
of words, or sequences of characters, and
in particular the probability of some
upcoming word given the words that have
happened before. It's like we finish each
other's ... and then what's gonna happen? Is
it sentences? Is it desks? The computer
is gonna be able to tell us, and in doing
so it'll replicate a capability that
humans have with language.
So we want to estimate the probability
of sentences like I would like to swim in
the river, which is a perfectly fine
English sentence but if you have a
corpus of say 10 million words, it might
be the case that you never find this
exact sentence: I would like to swim in
the river. Therefore if you never find it,
the probability would be zero because
the computer never saw that exact
sequence, and this estimation would be
wrong because that sequence of words is
a very good sequence in English, it's a
perfectly fine sentence. So we want to
approximate the probability of this
sequence, even if we don't ever see the
exact sequence in a corpus, and use the
corpus structure to derive that this
is a good English sentence. The way we do
this is by decomposing the probability
of the sequence into the probability of
its adjacent elements, of its bigrams.
And this was an insight of Andrey Markov
from Markov chains. So this whole
sequence, I would like to swim in the
river, can be decomposed into the
probability of the beginning of a
sentence, I'm sorry the probability of I
given the beginning of a sentence, so
beginning I, multiplied by the
probability of would given I, multiplied
by the probability of like given would,
multiplied by the probability of to given
like. And so forth until multiplied by
the probability of the end of a sentence
given river. So this multiplication of
the sequences of objects is going to be
an approximation of the probability of
the entire sequence. Here we have a very
small example. Let's imagine we have a
corpus with just three sentences: I am
Sam, Sam I am, I do not like green eggs
and ham, and the beginning and the end of the sentence
for each of these three sentences. So all
our model knows about English is those
three sentences. From that let's
calculate the probability of some of the
bigrams. For example we have the
probability of I given the start of a
sentence. This would be - the number for this would be the count of how many
times we see the bigram beginning of the
sentence. I divide it by how many times
we see the unigram beginning of the
sentence. So the bigram beginning of a
sentence I happens once in the first
sentence, and then in the third sentence
when we have beginning, I do not. So start
of sentence I happens two times, and then
how many times we see beginning of the
sentence, it happens three times, so two
divided by three, equals zero point six
seven, approximately zero point six seven.
This is the probability of I given the
start of the sentence, and notice how this
number captures the intuition of what we
saw in the previous video: if you're
standing on the beginning of the
sentence, what will happen next? In the
first sentence, you go to I. In the second
sentence you go from the beginning to
Sam. In the third sentence, you go from
the beginning to I again, so if you're
standing on the beginning of the
sentence, two out of three,
you're gonna go to I. In sentence one and
three. And one time out of three, you're
gonna go into Sam, in sentence two. So when
you're standing on the beginning of the
sentence,
two out of three times, you're gonna go into
I, which is the exact number that we're
getting here. What about the probability
of Sam given am? It's the probability of
the sequence am Sam divided by the
probability of the unigram am, so the
sequence am Sam is once in sentence one,
and that's it, one time.
And then how many times do we see the
unigram - unigram am? We see it once, twice,
in sentences 1 and 2. So 1/2 equals
0.5. Again the intuition is that if I'm
standing on am, what's going to happen
next if I'm standing on am, one of the
times I'm going to go into Sam, and the
other time I'm going to go into the end
of a sentence. So if I'm standing on am, 1
out of 2 times, I'm going to go into Sam.
This is the probability that we get. Now
you go ahead and give it a try. This is
the generalized form: the probability of
the element b given a is the count of
the bigram a b divided by the unigram
a, by how many times we see the
unigram a. Now use this general formula
to calculate the probability of these
to bigrams of the end of a sentence
given Sam, and of the word do given I. You
can use this formula and then you can
come back to see the result. So go ahead
and pause the video.
Welcome back!
The probability of the end of a sentence
given Sam is the probability of the
sequence Sam end of the sentence divided
by the probability of the unigram Sam.
So Sam and the sentence happens once in
sentence one and then the unigram Sam
happens one, two times so this is 1/2 50%.
And again if you're standing on Sam, what
happens? Then
you can go in to end of the sentence in
the first sentence, or you can go into
the word I in the second sentence. So if
you're standing on Sam, 50% of the times
you go in to end a sentence.
Fifty percent of the times, we go into
and. Fifty - into I, I'm sorry, 50%. The
probability of do given I is the
probability of the sequence I do divided
by the probability of the unigram I. You
see I do one time in these three
sentences, and then the unigram happens
once in the three sentences, so 1/3 0.33.
Again this is because when I am on am,
I'm sorry on I, in the first sentence I
go into am, in the second sentence I go
into am, and the third sentence I go into
do. So if I'm standing on the word I, two out
of three times, I go into am, and one out
of three times, I go into do. 0.36. What we
are doing here is that we're estimating
the probability of a sequence given the
observations that we have given this
huge corpus of three sentences. The name,
the general name, of this technique is
maximum likelihood estimation in that
we're given the maximum likelihood that
we're going to see some event and the
sentence given Sam. We present these
numbers between zero and one so that we
can use them as probabilities which is
why we divide by the count of the unit.
We call this a normalized maximum
likelihood estimation so that all of our
counts are - all of our probabilities, I'm
sorry, are in between zero and one. So here's
another one that I want you to try:
let's say we have the sentence I am Sam
which is the beginning of a sentence, I
am Sam, end of a sentence. From the chain
of probabilities, we know that this is
the probability of I given the start of
a sentence, of am given I, of Sam given am,
and of the end of the sentence given Sam.
So three of these probabilities are in
the previous slides. One of them you're
going to have to calculate. Please
calculate the probability of this whole
sequence and I'm gonna give you a moment
to pause the video and then come back
for the result.
Welcome back.
The probability is 0.67 by 0.67 which is
the one that you have to calculate which
if you remember, if you're an I,
two times you're going to am, one time
you're going to do then, the probability
of Sam given am and the probability of
the stop given Sam. The probability of
the sentence I am Sam is 11 percent, zero
point one one. How about a sentence
that doesn't mean anything, like I am
fuf. This would be the beginning of the
sentence, I am fuf, end of the sentence.
We have the first two probabilities but
we do not have the sequence am fuf in
anywhere in the model. And in any case
it will be a bad sentence of
English because it would not mean
anything. So the probability of I am fuf is
zero because we never observe the
sequence am fuf. This makes the whole
multiplication zero, so by having these
numbers the model that we've devised, the
n-gram model, can tell you that there are
sentences like I am Sam which have a
higher probability and are therefore
more profitable then sentences like I am
fuf that have a lower probability.
In summary, we can describe the
probability of acceptance of a sequence
of words using the probability of its
sub n-grams. Here we used bigrams,
and we multiplied sequences of the bigram, so the conditional probabilities, to
form a chain of probability that will
describe the sentence. We can use these
probabilities to figure out if a
combination is possible, such as I am
Sam, and it will help us find sentences
that are good but also ones that are
nonsensical or ungrammatical, like I am
fuf, or fuf am I, for example. In the next couple of
videos we're going to see an example of
these techniques with more sentences and
a way to distinguish between good
sentences, bad sentences, and sentences
that are unusual but that should still
be admissible. This is gonna be called smoothing.
