LAURENCE MORONEY: Through
this series so far,
you've been learning the
basics of NLP using TensorFlow.
You saw how to tokenize
and then sequence text,
preparing it to train
neural networks.
You saw how sentiment in
text can be represented
with embeddings and how
the semantics of text
over long stretches
might be learned
using recurrent neural
networks and LSTMs.
In this video, we'll
put all of that
together into a fun scenario.
We'll create a
model and train it
on the lyrics of
traditional Irish songs.
From that, you'll see then if
it can write its own poetry
using those words.
Let's look at the
steps involved.
First of all, this is our text.
Within the entire corpus are the
lyrics to lots of Irish songs.
One of them, "Lanigan's
Ball," is listed here,
and you can see these words
have a very distinctive style.
If we were to read
them in, we could
do it something like this.
And for simplicity, I'll just
use this one song for now.
It's stored as a single
string with slash n's
to give new lines.
That, I can then break
into a number of sentences
by splitting the string by
that new line character,
and this will form
my corpus of text.
Later, you'll see how to change
it to read the full corpus off
of disk, but the methods
will be exactly the same.
I can then fit my tokenizer to
the corpus to get a word index.
As I'm using an out
of vocabulary token,
I'll add 1 to the length
of the word index just
to cater for that.
Now, you might
wonder, why not just
encode with an out
of vocabulary token?
There's a subtle difference
here when generating text
over the previous scenario
when we were classifying text.
When generating text, we don't
need a validation data set.
We're going to use
every bit we have
to try to spot the patterns
of where and how words occur.
So if we tokenize
our entire corpus,
there will be, by definition,
no out of vocabulary token.
However, in a moment
you'll see where
we will start to
pad subsentences
from the full corpus,
and for that, we'll
need some kind of a zero token.
Hence, we'll add one here,
and counting that token
as a valid word.
Now that we have a
list of sentences
and we've tokenized
them, we can turn them
into a set of training data.
Now, there's a key
difference here
between what we've seen
previously for classification
and what we'll use
for generation.
So let's go over this line
by line so it's clear.
First of all, I'll create an
empty list of input sequences.
We'll populate this
as we go along.
Now, for each line
in the corpus,
we'll create the list of tokens.
Note that we're not
doing text to sequences
for the entire body.
We're going to do it
one line at a time.
So this will give me the text to
sequences for the current line.
Now for example, this will
sequence just the first line
the first time through the loop.
And "In the town of Athy
one Jeremy Lanigan" will be
tokenized into the
numbers as shown.
Next, we're going to
go through this list
and generate n grams from that.
What does that mean?
It's best if we look
at it like this.
The line that we
tokenized is represented
by a list of numbers,
but we can split that
into a number of other lists.
The first two, the first three,
the first four, and so on.
The reason for that
is that we want
to train a model to predict
the likely next word.
So for each sentence we
have, we can train it
for when you see this
word, this one is next.
When you see these two
words, this one is next.
When you see these three words,
this one is next, and so on.
Now that we've split the
sentence into multiple lists,
we'll need to pad it.
So we'll start by
getting the length
of the longest of the
sentences and then
pad everything with
a 0 up to the length
of the maximum sentence.
So now our line of eight words
has formed the same seven
lists, but each one is now
padded with 0's to begin.
Thus, we can see our
set of input sequences
for this one line
just looks like this.
And this is ideal for giving
us features and labels
or X's and Y's.
We can take everything but
the last value as our X,
and we can use the
last value as our Y.
So when we see a bunch
of 0's followed by a 4,
the label for that will be 2.
Similarly, when we
see a bunch of 0's
ending with a 4 and then a 2,
the label for that will be 66.
Similarly, 4 to 66
will be labeled as 8.
Python makes it
super simple for us
to slice our lists like this.
We can simply use code like
this to generate our X's and now
our labels.
Finally, we'll want our Y
to be categorical and one
hot encoded, so
that when we train,
we'll be able to predict across
all of the words in our corpus
which one is the most likely
word to be next in the sequence
given the current set of words.
And then we can use the keras
to categorical to achieve this.
So for example, given
the above sentence,
we'll split it into
X and label, where
X is the beginning of the
list, and the label is 70.
We can then 1, hot encode
the label to get the Y.
And if you look
closely, you'll see
that the seventieth
element in the y list
is a one while
everything else is 0.
So now we have our
features and our labels,
let's train a neural network
with all of the data.
And here's a very simple model
architecture to achieve that.
This is completely
unoptimized, particularly
in the middle layers, so
please feel free to experiment
and improve it.
It starts with a sequential,
adds an embedding
at the top like we saw earlier.
As there's a massive
variation of words,
I gave it a lot of dimensions.
And in this case, it's 240.
The first parameter
is the number
of unique words in the corpus.
The input length is the maximum
sequence length minus 1,
because we lopped off the
final value in each sequence
to make a label.
After that, we've just
got a single LSTM,
but we'll make it
bi-directional.
And then importantly,
our output is a dense
with the total number of words.
Remember that the labels
were 1 hot encoded,
so we want an output that
is representative of this.
It's then a matter of
defining your loss function
and optimizer.
Remember, as this is categorical
with lots of classes,
you'll need a
categorical loss function
such as categorical
cross entropy here.
And once you've done that, you
just fit the X's to the Y's.
As you're training, you might
see the initial accuracy is
really small like 0.05 or 0.06.
Don't worry, it will
go up with time.
This is very unstructured data,
and it's trying to figure out
the rules that match
your X's to your Y's.
When it's done,
you'll have a model
that you can pass it a
sequence, and it will give you
the predicted next value.
You can use this to
then generate poetry,
take a sequence, and
get the next value,
add that to the sequence,
pass that to the model,
get the next value, add that
to the sequence, and so on.
With the simple model
architecture above,
it ends up with an
accuracy around 70 to 75%.
And that means that given
a sequence of words,
it will pick the correct word
right about 70% of the time.
If it gets a sequence of words
it hasn't previously seen,
it can make a rough
prediction for what
the next word could be.
So to get it to generate text,
we can seed it with some words
and predict the next value.
We'll add that to
our string of words
and get it to predict the
next value, and so on.
And here's the code for that.
And when seeded with the words,
"I made a poetry machine,"
I got the following
sequence generated for me.
It's not bad, though
if anybody can
explain "shed love
raw boo," please let
me know in the comments below.
Experiment with
different architectures
and run times, and let me
know what you come up with.
Now, that brings us to the
end of this series on NLP.
I hope you've enjoyed it,
and if you want more, please
let us know in the
comments, and don't
forget to hit that
Subscribe button for more
great TensorFlow content.
Thank you.
[MUSIC PLAYING]
