This week we are going to study n-grams.
N-grams capture the intuition that, in
your brain, you know how sentences are
going to end. You know that certain
sequences of words are more likely than
others. In essence it's like we finish
each other's sentences, sandwiches So the
first time I encountered this joke was
in a show called Arrested Development.
Many shows have had it over the years,
and it goes something like this: someone
says it's like we finish each other's ...
and in your mind you know what they want
to say, the intention is saying sentences,
and the other person says sandwiches. We
all have intuitions about which words
should come next in a language. These are
some examples from English. We all have
some intuitions about the next word, what
the next word should be, in salt and ...
You're probably thinking salt and pepper.
The sentence I'm a student at Dartmouth ...
then probably pizza is not the
next one. Probably college is the next
one. Please come on in and sit roof,
please come in and sit mouse, those are
probably not the ones you're thinking of.
You're probably thinking of sit down.
Those were very restricted in what you
could predict. There's a few that are
more open-ended: I always order pizza
with cheese and ... maybe you can think of
anchovies, maybe you can think of basil,
but in no circumstance are you gonna
think phone I always order pizza with
cheese and phone, for example. And in no
case are these words going to be the end
of the sentence salt and the. I'm a
student of Dartmouth and. Please come on in
and sit book. I always order pizza
with cheese and book.
You do have intuitions in your brain
about what's going to happen next. You do
know that some words are very likely
after, after you've heard a certain
sequence. You also know that after you've
heard a sequence, certain words are very
unlikely, so salt and pepper is very
likely given salt and, and book is very
unlikely given salt and. Salt and book. We
could calculate these probabilities if
we had a collection of documents that
was large enough. Let's say we had like a
trillion words of English. If we have
such a collection, we could see that the
sequence salt and pepper
is more likely than something like salt
and vinegar, and that was more likely
than salt and celery, and that was more
likely than salt and diamonds. So, and we
know this in our mind, like we, we kind of
feel like salt and pepper should be the
first one that we think. We use
collections of documents to calculate
those probabilities. We call those
corpora, the singular is a corpus. Using a
corpus, you can count how often you
encounter a word, and we call that a
unigram. So you can count amongst your
trillion words, how often you find the
word salt you can also count characters,
like how often you find the letter t
similar to what Markov did with Eugene
Onegin. You could also calculate the
probability of bigrams which are two
word sequences, salt and, table and, lock and. These could also refer to two
character units, such as t followed by h,
or h followed by e. You can also have
trigrams, which are three word units
salt and pepper, table and chairs, lock
and key. They could also be three
character units t h e,
t h o, t h p. In English this character
sequence t h e is much more likely than
the character sequence t h p, for example.
In general we're gonna call these
n-grams which are strings of n words or
n characters, and we're going to
calculate the probability of those
sequences from a corpus, from a
collection of documents from a certain
language. N-grams are extremely useful
for many natural language processing
tasks. For example, in machine translation,
let's say you have a language and you
want to translate it into English. And
you have two possible translations for
some original phrase and those
translations are high winds tonight and
large winds tonight. You could consult
your n-gram model for the probability of
these trigrams, of these three word
sequences, and you will see that in
English, the sequence high winds is much
more likely than large winds. So even if
both high and large were possible
translations of the original word, the
machine should go with high winds
because that one is more likely in
English from the n-gram model. N-grams
could be used for spell checking, so a
minuet is a kind of dance, it does exist
as a word so you could have 15 minuets
but it it's a very infrequent word. So
the computer might calculate that 15
minuets is very unlikely as opposed to
15 minutes which would be much more
likely, and if it detects this disparity
it might say, oh maybe what I have here
is a spelling mistake and 15 minutes
would be much more likely. We could use
n-grams in speech recognition. Let's say
we get a sequence of sounds like I saw a
fan, we could put that in English words
as I saw a fan,
but probably you could put that in the
English words eyes awe off an. The
computer could then take that four-gram
and calculate the probability of I saw a
fan as typed in the first four words, I saw a fan, 
calculate that probability as
opposed to the probability of eyes saw off an,
and we determined that the first
probability is much more likely and so
this is probably the way that these
words should be typed.
Did I mention a trillion words? Yes there
is an important n-gram called the Google
N-gram model which has one trillion
twenty four billion words in it. It has,
it has all of those words in its corpus
and from there they calculated 1.1
billion sequences of up to five words
that appear at least 40 times in the
text. And from all the corpus they got
13.5 million unique words after
discarding those that appear less than
200 times, so that was their threshold
for what a word should be: something that
appears at least 200 times. And they
found 13.5 million of those. You can
download this model by the way I'll show
you where in a second. And you can use it
for any of your programs. If you download
it, you're gonna find the frequencies of
certain collocations like all these four-grams
serve as the independent, serve
as the individual, serve as the index. So
serve as the independent happened 794
times out of those trillion words. A
four-gram like serve as the
indispensable happens 40 times in those
trillion words.
Here's the website. Let's take a brief
look. The default example it has is
comparing bi - the bigram Albert
Einstein with the bigram Sherlock
Holmes with the unigram Frankenstein. And
as you can see here, this is the
comparison of the occurrence of those
words in the corpus in the last 200
years. So the word Frankenstein has
existed since the 1800s and as you can
see it has had a spike in occurrence
since the 1960s, in documents that were
written in the 1960s or later. The bigram Albert Einstein started existing
in the early 20th century and has had a
linear growth ever since Sherlock Holmes
started in the 1880s and has grown, gone
up and down but it's fairly steady.
Notice that you have many languages, not
many actually just quite a few in which
you can do Google N-grams. Let's look
at the original n-grams that we had a
few slides ago.
Salt and pepper, salt and vinegar, salt
and celery, salt and diamonds. As you can
see salt and pepper is much more
frequent than all of the others. It shot up in occurrences in the
nineteen sixties documents. But it's
always been there, and it's two orders of
magnitude more frequent than the next
one, salt and vinegar. But you can barely
see here.
Let's eliminate salt and pepper so that
the scale would will adjust. Here we have
salt and pepper [vinegar], which is much more
likely than salt and celery, or salt and diamonds. It is again two orders
of magnitude larger than salt and celery.
Let's go to salt and celery. It's not as -
it's more frequent than salt and diamonds but
not by much. But salt and diamonds is
still the least frequent one
because it occurs with mines, but food
collocations are more likely. So just
look at them again in the original scale.
This is salt and pepper, salt and vinegar, salt and celery, salt and
diamonds. Let's use an example in some
other language. In Spanish, the same
examples. Sal y pimienta is salt and pepper, 
sal y viniagre is salt and vinegar.
As you can see, salt and pepper, salt and
vinegar. Salt and pepper is much more
frequent, one order of magnitude more
frequent than salt and vinegar, in
Spanish. And again, as you can see, there's very few languages
actually, and hope, one would hope they
would have many more. But these are very
difficult to make as you can see,
gathering a trillion words in any
language is difficult work. In summary,
words don't occur at random. Words are
likely to occur together. So salt and
pepper is - would be a very common trigram.
Salt and book would not be a very common
trigram. You have these intuitions in your
mind, and somehow we have to replicate
them into our computer systems. We're
going to do that by gathering corpora. A
corpus is a large collection of documents
from which we can calculate
probabilities and we're gonna make a
language model that knows what words go
together. We're gonna call this an n-gram
model. N-grams are extremely useful for
many natural language processing tasks,
such as spell checking, machine
translation, speech recognition and we're
going to use them for things like new
sentence generation, and for spell
checking in the rest of the week.
