Hi. You've all heard about bots
generating language, and that computers
maybe will write poetry in the future. In
this video, we're going to take our very
first look at natural language
generation, and we're gonna use n-grams
to generate language, to generate new
sentences no one has ever uttered before.
So Botnik Studios is one of my favorites.
They take natural language input and
then teach the computer to generate new
outputs, for example Super Bowl facts:
the modern Super Bowl Sunday is the
exact size of a bus, the game is strictly
platonic. Natural language generation is
two things: it's the transformation of
structured computer data into language,
for example to report results from a
device or from a database, this is your
you know your sci-fi ship that then will
- it always tells you the status
of the ship is this and this, or this is
your robot that transforms the - it's the
data base of knowledge into an
explanation, or in a more down-to-earth
application maybe you collected
information from a big data application
and you want the computer to present it
in English in 30 seconds. This is one
thing that is natural language
generation. A second thing is generating
new text that is creative, for example
writing poems, or writing a novel, maybe
it looks at a picture and from there it
writes a poem, or maybe it reads all of
Shakespeare and then tries to write a
new Shakespeare play. This is also a type
of natural language generation. In this
class, we're going to look at three ways
in which we can generate natural
language. This video is going to talk
about n-grams so chains of words chained
by their probabilities. In week six,
we're gonna talk about neural language
generation which uses deep learning
neural networks like RNNs and
LSTM's to generate new text, and in week
seven we're gonna be talking about
parsing rules to aid in natural
language generation, for example if we
have a very specific structure that we
want to generate, like a poem. Here we're
going to talk about the most basic form
of natural language generation through n-grams.
So this is automatic Shakespeare. For
example, if we generate it - generated it
out of a unigram, it would sound
like this: To him swallowed confess
hear both. Which. Of save on rail for
are ay device and rote life have. Hill
he late speaks or a more to leg less
first you enter. Not great. Let's see how
we do with chains, chains of bigrams.
Why dost stand forth thy canopy, forsooth
he is this palpable hit the King Henry.
Live king. Follow. What means, sir. I confess
she? Then all sorts, he is trim,
captain. Still not great. Let's see what
we can do with chains of trigrams. Fly,
and will rid me of these news of price.
Therefore the sadness of parting, as they
say, tis done. This shall forbid it
should be branded, if renown made it
empty. A little bit more readable. How
about 4 grams, chains of four grams.
King Henry. What! I will go seek the traitor.
Gloucester. Exeunt some of the watch. A
great banquet served in. It cannot be but
so. How did the computer make these?
Let's go back to our n-gram table, our
bigram table from the example - restaurant
example. We have nine thousand sentences
asking questions about restaurants, and
from there we calculated how many times
we have the combination, I want, eight
hundred and twenty eight times, I eat, ten
times, and I food, which is just that one
time that we inserted with smoothing. So
let's choose a random starting point. How
about I want. We have the background, I
want, which has two elements, I want. Let's
take the second element and make it the
first element of a new bigram, where can
we go? We have want as our first element,
and then what are the possible second
elements of that bigram? It could be want
to which happens six hundred and nine
times, it could be want food which happens
seven times, same as want Chinese, it could
be want lunch, which happens six times so
let's go with the one with the highest
count: want to. So we have I want and it's
add want to. We have the bigram want to,
so let's take the second element, to, and
make it the first element of the new
bigram. We're standing on to and then
where do we go?
Maybe we can go to, to eat, which occurs
six hundred and eighty seven times. We
can go to spend, which occurs two hundred
and twelve times. We could go to lunch,
which occurs seven times. Since we're
using the one with the highest count,
let's have - let's have to eat. So I want,
want to, to eat.
So we have the bigram to - to eat. Let's
take the second element, eat and make it
the first one of the new bigram, so we
are chaining diagrams together. If we are
standing on eat, where could we go? We
could go to eat lunch which occurs 43
times, eat Chinese 17 times, eat to which
occurs three times. Let's choose the one
with the highest count, eat lunch. I want
want to to eat eat lunch. A chain of
bigrams. There was the most likely chain
for us to follow.
I want 828 times, want to six hundred and
nine times, to eat 687, eat lunch seven,
I'm sorry, forty-three. The model just
generated a sentence from these humble
starts which are just counts of how many
bigrams you have. This language model
could generate a valid grammatical and
perfectly good sentence of English. I
want to eat lunch so it knows a little
bit of English. Why don't you give it a
try? I chose the the path with the
likeliest, I'm sorry, with the highest
count for each of the chains. Why don't
you give it a try?
Let's start with the word I, and if
you're at I, you can go to I want, I eat,
I spend. Try to take the the two most
likely paths, so the two paths with the
highest count, and then you choose at
random and then go to one or the other,
and keep going until you have five words
and let's see what kind of sentence can
you generate. Then give it another try
and see if you land at some other
sentence. This way you will be able to
generate quite a few sentences of English.
So follow these instructions give it a
try.
And I'm gonna count to five for you to
pause the video now. I'll give you some
of my solutions when I count to five. Two
three four five...
Welcome back. These are some of the ones
I made: so, I want to eat Chinese lunch, I
want to spend to, I want food to eat, I
want food to spend, I eat Chinese food to. So
all of these are English sentences.
This humble n-gram model knows quite a
bit of English, because it knows how to
put English sentences together. Again try
to see this against the backdrop of you
learning a foreign language. Try to think
of how long it takes you to be able to
do this in a foreign language, and the
computer can replicate that ability just
by having the input and then counting
the number of bigrams it has. So this is
a way that the computer can use n-grams
to generate sentences of a language. This
is how Shakespeare got generated here. If
it's a unigram, then it's just choosing
randomly unigrams from a big bag so
the most likely word gets chosen the
most often, and so forth. To him swallowed
confess hear both ... This makes absolutely
no sense. Bigrams would be exactly what
we did: chains of two words. Why does stand forth
that canopy forsooth. This is still not
very good; it's mostly gibberish. Trigrams
would be chains where you have one two
three elements, and then the last two
elements become the first two elements
of the following trigram. And once you've
chained that one, the last two elements
of that trigram become the next two
elements of the following trigram. Using
this kind of chaining, the trigram says
Fly, and will rid me these news of price.
Therefore the sadness of parting, as they
say, tis done. That sounds a little bit more like
English. It's starting to sound like it.
4-gram is pushing it because this
would be one two three words, and then
you choose the last three words
as the - as the first three words of the
next 4-gram, and then you take the
last three words of that 4-gram as
the first three words of the next
4-gram, and that's how you do the
chaining. King Henry. What! I will go seek
the traitor Gloucester. Exeunt some of the watch.
This sounds very much like English. We'll
come back to the 4-gram in a second. Again, the input determines the output. We
would get very different sentences if we
train this on the Wall Street Journal
for example. If it's just a unigram, it
would - oh by the way it's forty million
words from The Wall Street Journal! If
it's just unigrams, we get Months
the my and issue of year foreign new
exchange's September were recession
exchange new endorsed a acquire to six
executives. That doesn't sound like
English at all. The bigram says Last
December through the way to preserve the
Hudson corporation... Not great. The trigram
starts to sound a little bit more like
it. They also point to ninety nine point
six billion dollars from 204.063% of the rates of
interest stores as Mexico and Brazil on
market conditions. That sounds a little
bit more like it. There's a danger in
going to - to have more and more n-grams.
Let's take Shakespeare. Shakespeare has
about eight hundred and eighty four
thousand tokens in total, so words in all of
Shakespeare's work. It has about
twenty nine thousand unique tokens. Those
could be our unigrams, twenty nine
thousand. If you have twenty nine
thousand unigrams,
you could have eight hundred and forty
four million possible bigrams,
you could have twenty four point five
trillion trigrams, and you would have
seven hundred and thirteen quadrillion
4-grams. That is a monstrous amount.
However most of them are zero.
Ninety-nine point nine six
of the bigrams are zero, because there's
many words that are never - that never
appeared next to one another. Because of
this, the more it grows the more the
larger percentage of it is going to be
just zeros. And as you get more and more
zeros, the possible paths that you can
take become fewer and fewer, so by the
time you get to 4-grams, there's so few
paths that essentially you are just
reciting the path you followed when you
were generating the 4-gram. You are
producing Shakespeare because it is
Shakespeare. It is exactly what the
program read, and it's just spitting it
out. This is a form of overfitting where
the model only really learns what the
data that has been shown to it as the
input. Trigrams are a happy medium for
this. They're - they don't produce amazing
text, but they are - they actually produce
new structures that you could not find
in the original text. One important thing
is that n-grams also have the same
problems we've been seeing so far with
finite state machines. For example, when
you're trying to model language, you have
to model phenomena like long distance
dependencies something like this
sentence: The fact that he's the one who
didn't read the recipe and tried to make
the thing all by himself.
These two words depend on one another. If
we changed she as the first word, then
you also need to change himself to
herself. So in order for you to capture
this dependency, you need something like
an 18-gram to have these two words
within the same probability, and that
that at that point you'd be ridiculously
over fitting. That n-gram is really no
good for generating new structures or
really for predicting anything other
than the data it just saw. So
n-grams do have limitations in how much
they can generate data and how much they
can model data, but they are very useful
for modeling short
distance dependencies, and we're gonna be
using these a lot. For example, when you
are typing on your phone to predict the
words you're typing right now or the
next word, we're also gonna be using them
in spell checking to try to see if the
word you're typing right now could
correspond to some other word. We're
going to use it in predicting emails, for
example email - email auto-completion. In
summary, natural language generation is
a process that uses an existing language
model like an n-gram to create new
sentences that have never been uttered.
We can use n-grams to generate new
sentences by jumping from one bigram to
the other, from one trigram to the other.
This is very good for modeling
immediate short distance dependencies
but it's not gonna be good at
remembering long distance dependencies,
because the larger our n-grams are, the
more we are overfitting the data, the
more is just the one path we saw when we
were taking the data in. But this doesn't
mean that n-grams are not useful.
They're very useful. In the next videos
of the week, we're gonna look at some of
their applications. For example
prediction of texts, prediction of words
in texting, and spellcheck.
