In this video we're going to look at a
structure called word2vec. It takes
advantage of the following intuition: if
you don't know a word you can probably
guess what it means from its neighboring
words. It's going to use the neighboring
words of a target word to describe the
meaning, and then give us the meaning in
a vector in a feature system. So let's
talk about meaning. Sometimes as humans
we find words that we don't understand
that we've never seen before and we have
to try to guess what they could possibly
mean. For example ongchoi. What does the
word ongchoi mean? If you've never heard
it before and I just tell you that, then
you probably wouldn't have a clue about
what it means. But what if I give you the
word ongchoi in context with sentences
like, Ongchoi is delicious sauteed with
garlic, or Ongchoi is superb over rice, ongchoi leaves with salty sauces, and so on and
so forth. If I ask you again what does
ongchoi mean maybe you won't know exactly,
but you might have an idea, maybe it's
some kind of food maybe it's some kind
of leaf, maybe it's something that you
eat warm because you have to sautee it.
We don't really know what it means but
again we would assume that it's something
similar to spinach to kale to leafy
vegetables in general, because the words -
the word ongchoi appears with garlic,
with rice, with leaves and this is
something that the word spinach also
does. The word spinach frequently occurs
with words like garlic, rice, and leaves.
Indeed this is an ongchoi also known
as water spinach or Morning Glory. It's
absolutely delicious. Here we had a
target word that was ongchoi and we were
trying to figure out what that target
word means. And we also have a context
for the word, words that are previously co -
that frequently co-occur with ongchoi,
words such as delicious, sauteed, garlic,
and so forth. So maybe we don't know the
target word, but we can describe it with
using its neighboring words, the words that
it - that occurred before it or after it in
context. Let's take this idea and see
where it takes us. Here we have an
example of a sentence, Natural language
processing and machine learning is fun
and exciting, and we want to figure out
what those words mean, natural language
processing and, and so forth, marked here in
orange. In order for us to look at the
context, when you figure out how much
context we want to look at, maybe we want
to look 20 words around us, five words
around us. Let's take a conservative size
and look at four words around us, two
preceding words and two following words.
This method is gonna be called the skip
gram because we're gonna use the target
and then skip through the text, skipping
from word to word to look at its context.
Let's look at row number five, machine. So
the word machine is preceded by the
words processing and and. And the word
machine is followed by the words
learning and is. These are the words that
are going to occur in the window of
machine, the target word. Machine is going
to be the target, and processing and
learning is
going to be the context. Let's look at
row number one, the word natural, natural
is surrounded by two words that follow
it, language and processing, but by no
words that precede it because there were
no previous words in the sentence. So in
that case, the window is just going to be
the words language, it's just going to
include the words language and process.
So from here, we can make a series of
vectors. On the left, we have a very
simple vector that goes in columns and
the first one, the ones with the orange
color are called the target embedding.
Target because they refer to the target
word, the word we are currently studying.
So for natural, this vector is going to
have the value of 1, because we are
talking about the word natural, it's
going to have the value of 0 for
everything else because language is not
the word natural, processing is not the
word natural, 0 0 0 and so forth. In the
matrix on the right, the one for row
number 5 or the word machine, we have the
target embedding, 0 0 0 0 1 0 0 0 0 we
have a 1 there because this refers to
the word machine and not the other words.
And by the way, take a look at the far
left at the words that we're looking at:
natural language processing and machine
learning is fun exciting. This is the
unique tokens that we have in the text.
We do not count and twice, so the rows
and these matrices are the unique words
that we find in the document, that we're
looking at. So again the target
embeddings just have a 1 or a 0. 1 if
it's the word we're looking at and 0 for
everything else.
We also have context vectors. These are
going to have a 1 if this word appears
in the window, or the context of the
target word. So if the matrix were
natural on the left, we have the first
vector for target words, which is - which
has a 1 for language and 0 for
everything else. This means that the word
language is contained in the context of
natural. The second matrix, I'm sorry, the
second vector for na - for the context of
natural is 0 0 1 0 0 0 0.
This is because this contains the word
processing, and so those two context
vectors describe the words in the
context of natural. For the matrix on the
right, we have four context vectors. The
one that has a 1 for processing, and
for learning and for is, and everything
else is 0 0 0 0. So for every word, every
word we have the target vector which
tells us what word we're talking about
natural or machine, and we have the
context words which tell us whether that
word exists in the context of the target
word or not.
So if we squash together our context
vectors, we would have one vector that
tells us whether the word has ever
appeared in the context window of the
target. These are the context embeddings.
In the case of natural, we have again the
target vector which is the orange one,
that just has a 1 for natural and then
the context embedding has 0 1 1
0 0 0 0. It has a 1 for
language because language is a word that is
found in the window around natural. It
also has a word for - it also has a 1
for processing because processing is
found in the window around natural. Let's
turn to the matrix on the right. The
context embedding for the word machine
is 0 0 1 1 0 1 1 0 0. This is because
there are some words that are found in
the context of machine. Those words are
processing and learning is. Those have a
value of 1. There are some words that are
not found in the context of machine.
Those are natural language machine fun
exciting, and those are marked with a 0.
So the context embedding tells you which
words are in the context of the target.
So we can build an entire matrix out of
these. For example, a target matrix which
is all of the target vectors. This
example has only two: natural and machine.
So the first column of the target matrix
is for the word natural and it's a 1
because it's the word natural, and the
second column is the - for the word
machine, and it has a 1 only in machine.
This is the target words, then we have
the matrix for the context words, which
again tells you which words appeared in
the context of something. So the first
column corresponds to the context of
natural which - which is the words machine
and processing and the second column
corresponds to the context of machine
which is processing and learning is. So
we have target vectors and context
vectors. The contexts describe which words
co-occur with the targets. We're also
going to throw in a few words that were
not in the context so that the computer
can get negative examples, so that the
computer can also see things that maybe
were not there. We're gonna choose them
at random from the text. And for example
we could have the word sushi in there.
Let's imagine that the document did
include the word sushi at some point so we
have a new row, I'm sorry, a new column
for the target matrix the column for
sushi, that has a 1 in sushi telling us
that we're talking about the sushi. We
have the context matrix which has a new
row this one is 0 and 0 because sushi
never appeared in the context of natural
and in the context of machine. But we
have it there so that the computer can
learn from positive examples and from
negative examples.
And now what we need to do is to come up
with some vector, with some series of
numbers that connects these two matrices,
for example if we give it the target
matrix, we need to come up with some
transformation that gives us the context
matrix. Likewise if we give it the
context matrix, we need some
transformation that gives us the target
matrix. Those numbers that generate the
transformation are going to be called a
word2vec matrix. How do we calculate
this? If you have not taken machine
learning or a class with neural networks,
please ignore this for now and we'll
come back to this on week 6. If you have
taken machine learning, there's basic -
it's basically a neural network. It has a
hidden layer, and so it has input neurons,
weights into the hidden layer, weights
into an output layer, and this neural
network is trained back and forth until
we get good predictions, predictions of
what? It could go either way. There's a
way of making word2vecs which is
called CBOW, continued bag of words, which
gets the ... Yes, this gets the target
vectors and generates the, I'm sorry, this
is the other way around!
The CBOW gets the context, trains to
generate the target vectors, and then the
weights of the fir - between the first
layer and the hidden layer are the word2vec values. So for CBOW, you get -
you have the context as input, you
try to generate the target as the output
and use the weights as the word2vec values. In skipgram, in the one on the right
you take the target vectors and you try
to predict the context vectors. And it's
the same structure, the weights and
between the input and the first hidden
layer are the word2vec. So we can train
the target to predict the context or the
context to predict the target, and the
weights that we get are the word2vec
values. A word2vec matrix looks
something like this.
this is the first line of a
Spanish-language word2vec trained on
Wikipedia text. And as you can see it's a
huge series of basically weight values
for a neural network. This - they vary,
they have different sizes, so sometimes
word2vecs are as small as 50 features.
That's the smallest value they need in
order for them to work. Sometimes they
can be 200 features. Sometimes, sometimes
it can be 300 features. Beyond 300, the
performance does not really improve so
this is the maximum size that these are.
This is the first line of the word2vec
in Spanish and it's the word de which
means of, and you can see the different
weights there. This is the second word
the word la, which is - it means the
among other things. As you can see, it's a
different set of weights. One interesting
property of the word2vec is that the
vectors are very dense. They all have
values. Almost none of them are zeros. In
theory we could represent a word with
zeros and ones like we were doing in our
target matrices and our context matrices.
For example, even our system from the
first video: we have sushi, Hanover, and
origami. So we could represent those
words as 1 0 0, 0 1 0
or 0 0 1. This system by the way would be
called a ho - one hot encoding. However
doing it like this has two disadvantages:
first these vectors are very sparse,
they're full of zeros. Also they tell you
that the word exists that this is the
word sushi, but they don't tell you much
else. They don't give you information
about the meaning of the word, which
would be something very desirable. Word2vec
is very dense and because of its
connections, it captures some of the
meaning of the words, of the semantics of
the words, because it - when you are
training it, it takes into account all of
its neighbors so ongchoi, spinach, kale
are gonna have very similar neighbors
and so they're gonna have very similar
word2vec values. In summary, a word2vec is a - is a structure which can be 50
features can be 200-300 features, and it
describes the meaning of a word based on
its neighboring words. And as we shall
see in the next video
the - this has very interesting properties.
Indeed
the weight for words like ongchoi,
spinach and kale are very similar so
we're going to be able to capture a lot
of meaning, and a lot of - a lot of
analogies between words using a word2vec.
