[MUSIC]
Stanford University.
>> Alright!
Hello, everybody.
Welcome to lecture three.
I'm Richard, and today we'll talk
a little bit more about word vectors.
But before that, let's do three
little organizational Items.
First we'll have our first
coding session this week.
Next, the problem set one has
a bunch of programming for
you, as the first and only one where
you will do everything from scratch.
So, do get started early on it.
The coding session is mostly
to help you chat with other
people go through small bugs.
Make sure you have everything set
up properly, your environments and
everything, so you can get into the
exciting deep learning parts right away.
Then there's the career fair,
the computer science forum.
It's excited to help you find companies
to work at, and talk about your career.
And then my first project advice office
hour's today, I'll just grab a quick
dinner after this and then I'll be back
here in the Huang basement to chat.
Mostly about projects, so we encourage you
to think about your projects early and so
we'll start that today.
Very excited to chat with you if wanna
just bounce off ideas in the beginning,
that will be great.
Any questions around organization, yes.
I think just like outside,
yeah like, you can't miss it,
like right here in front of the class.
Any other organizational questions?
Yeah.
He will hold office hours too.
And we have a calendar on the website, and
you can find all our office
hours on the calendar.
Okay.
We'll fix that.
We'll add the names of who's doing
the office hours, especially for
Chris and mine All right, great.
So we'll finish word2vec.
But then where it gets
really interesting is,
we're actually asked what
word2vec really captures.
We have these objective
functions we're optimizing.
And we'll take a bit of a look and
analyze what's going on there.
And then we'll try to actually
capture the essence of word2vec,
a little more effectively.
And then also look at our first analysis,
of intrinsic and
extrinsic evaluations for word vectors.
So, it'll be really exciting.
By the end, you actually have a good sense
of how to evaluate word vectors, and
you have at least two methods under
your belt on how to train them.
So let's do a quick review of word2vec.
We ended with this following equation
here, where we wanted to basically predict
the outside vectors from the center word,
and
so lets just recap really
quickly what that meant.
So let's say I have the beginning of
a corpus, and it says something like,
I like deep learning,
or just and NLP.
Now, what we were gonna do is, we
basically wanna compute the probability.
Let's say, we start with these word
vectors in this is our first center word,
and that's deep.
So, we wanna first compute
the probability of
the first outside word,
I given the word deep and
that was something like
the exponent here Of UO.
So the U vector is the outside word and
so that's,
in our case, I here transposed the deep.
And then we had this big sum here and
the sum is always the same,
given for a certain VC.
So that is the center word.
Now, how do we get this V and this U?
We basically have a large matrix here,
with all the different word vectors for
all the different words.
So it starts with vector for aardvark.
And a and so on,
all the way to maybe the vector for zebra.
And we had basically all
our center words v in here.
And then we have one large matrix,
where we have again, all the vectors
starting with aardvark and A,
and so on, all the way to zebra.
And when we start in our first window
through this corpus, we basically collect,
take that vector for deep here
this vector V plug it in here and
then we wanna maximize this probability.
And now, we'll take the vectors for
U for all these different words like I,
like, learning, and and.
So the next thing would be, I for like or
the probability of like given deep.
And that'll be the exponent of
U like transpose of v deep.
And again, we have to divide by this
pretty large sum over
the entire vocabulary.
So, it's essentially little
classification problems all over.
So that's the first window of this corpus.
Now, when we move to the next window,
we basically move one over.
And now the center word is learning, and
we wanna predict these outside words.
So now we'll take for this next,
the second window here.
This was the first window,
the second window.
We'll now take the vector V for learning
and the U vector for like, deep and NLP.
So that was the skip gram model that
we talked about in the last lecture,
just explained again
with the same notation.
But basically, you take one window
at a time, you move that window and
you keep trying to predict
the outside words.
Next to the center word.
Are there any questions around this?
Cuz we'll move, yep?
That's a good question, so
how do you actually develop that?
You start with all the numbers,
all these vectors are just random.
Little small random numbers, often sampled
uniformly between two small numbers.
And then, you take the derivatives with
respect to these vectors in order to
increase these probabilities.
And you essentially take the gradient
here, of each of these windows with SGD.
And so, when you take the derivatives that
we went through in Latin last lecture,
with respect to all these different
vectors here, you get this very,
very large sparse update.
Cuz all your parameters
are essentially all the word vectors.
And basically these two matrices,
with all these different column vectors.
And so let's say you have
100 dimensional vectors, and
you have a vocabulary of
let's say 20,000 words.
So that's a lot of different
numbers that you have to optimize.
And so these updates are very, very large.
But, they're also very
sparse cuz each window,
you usually only see five words
if your window size is two.
yeah?
>> [INAUDIBLE]
>> That's a good question.
We'll get to that once we look at
the evaluation of these word vectors.
This cost function is not
convex It doesn't matter,
sorry, I should repeat all the questions,
sorry for the people on the video.
So the first question was,
how do we choose the dimensionality?
We'll get to that very soon.
And then, this question here.
Was how do we start?
And how much does it matter?
It turns out most of the objective
functions, pretty much almost of them in
this lecture, are not convex, and
so initialization does matter.
And we'll go through tips and
tricks on how to circumvent getting
stuck in very bad local optima.
But it turns out in practice,
as long as you initialize with small
random numbers especially in these word
vectors, it does not tend to be a problem.
All right, so we basically run SGD,
it's just a recap of last lecture.
Run SGD, we update now our
cost function here at each
window as we move through the corpus,
right?
And so when you think about these updates
and you think about implementing that,
which you'll very soon for
problem set one, you'll realize well,
if I have this entire matrix,
this entire vector here, sorry.
This vector of all these
different numbers and
I explicitly actually
keep around these zeros,
you have very, very large updates, and
you'll run out of memory very quickly.
And so what instead you
wanna do is either have very
sparse matrix operations where
you update only specific columns.
For this second window, you only have
to update the outside vectors for
like, deep and NLP and
inside vector for learning.
Or you could also implement this as
essentially a hash where you have keys and
values.
And the values are the vectors,
and the keys are the word strings.
All right, now, when I told you
this is the skip-gram model,
I actually kind of lied a little bit
to teach it to you one step at a time.
It turns out when you do
this computation here,
the upper part is pretty simple, right?
This is just
the hundred-dimensional vector, and
you multiply that with another
hundred-dimensional vector.
So that's pretty fast.
But at each window, and again you
go through an entire corpus, right?
You do this one step at a time,
one word at a time.
And for each window,
you do this computation.
And you do also this gigantic sum.
And this sum goes over
the entire vocabulary.
Again, potentially 20,000 maybe
even a million different words
in your whole corpus.
All right, so each window,
you have to make 20,000 times
this inner product down here.
And that's not very efficient.
And it turns out,
you also don't teach the model that much.
At each window you say, deep learning, or
learning does not co-occur with zebra.
It does not co-occur of aardvark.
It does not co-occur
with 20,000 other words.
And it's kind of repetitive, right?
Cuz most words don't actually appear with
most other words, it's pretty sparse.
And so the main idea behind skip-gram is a
very neat trick, which is we'll just train
a couple of binary logistic
regressions for the true pairs.
So we keep this idea of
wanting to optimize and
maximize this inner product of
the center word and the outside words.
But instead of going through all,
we'll actually just take
a couple of random words and say,
how about these random words from
the rest of the corpus don't co-occur.
And this leads us to the original
objective function of the skip-gram model,
which sort of as a software
package is often called Word2vec.
And the original paper title was
Distributed Representations of Words and
Phrases, and their compositionality.
And so the overall objective
function is as follows.
Let's walk through this slowly together.
Basically, you go again
through each window.
So T here corresponds to each window
as you go through the corpus,
and then we have two terms here.
The first one is essentially just a log
probability of these two center words and
outside words co-occurring.
And so the sigmoid here is
a simple element wise function.
We'll become very good friends.
We'll use the sigmoid function a lot.
You'll have to really be able to
take derivatives of it and so on.
But essentially what it does,
it just takes any real number and
squashes it to be between zero and one.
And that's for you learning people,
good enough to call it a probability.
If you're reading statistics,
you wanna have proper measures and so on,
so it's not quite that much, but
it's a number between zero and one.
We'll call it a probability.
And then we basically can call this
here a term that we basically wanna
maximize the log probability of
these two words co-occurring.
Any questions about the first term?
This is very similar to before, but
then we have the second term here.
And the original description
was this expected value here.
But really, we can have some clear
notation that essentially just shows that
we're going to randomly sub sample
a couple of the words from the corpus.
And for each of these,
we will essentially try to minimize
their probability of co-occurring.
And so one good exercise is actually for
you in preparation for midterms.
And what not to prove to
yourself that one of sigmoid
of minus x is the same as
one minus sigmoid of x.
That is a nice little quick
proof to get into the zone.
And so basically this is one
minus the probability of this.
So we'd subsample a couple of random words
from our corpus instead of going through
all the different ones saying
an aardvark doesn't appear.
Zebra doesn't appear with learning and
so on.
We'll just sample five, or ten, or so,
and then we minimize their probabilities.
And so usually, we take and
this is again a hyperparameter, one that
will have to evaluate how much it matters.
I will take k negative samples for
the second part here of the objective
functions for each window.
And then we minimize the probability
that these random words appear
around the center word.
And then the way we sample them is
actually from a simple uniform or
unigram distribution here.
We basically look at how often do
the words generally appear, and
then we sample them based on that.
But we also take the power
of three-fourth.
It's kind of a hacky term.
If you play around with this model for
long enough, you say, well,
maybe it should more often sample some
of these rare words cuz otherwise,
it would very, very often sample THE and
A and other stop words.
And would probably never, ever sample
aardvark and zebra in our corpus,
so you take this to
the power of three-fourth.
And you don't have to
implement this function,
we'll just give it to you cuz you kind
of have to compute the statistics
of how often each word
appears in the corpus.
But we'll give this to
you in the problem set.
All right, so
any questions around the skip-gram model?
Yeah?
That's right, so the question is,
is it a choice of how to define p of w?
And it is a choice, you could do
a lot of different things there.
But it turns out a very simple thing,
like just taking the unigram distribution.
How often does this word
appear works well enough.
So people haven't really explored
more complex versions than that.
That's a good question.
Should we make sure that
the random samples here
aren't the same as exactly this word?
Yes, but it turns out that the probability
for a very large corpora is so
tiny that the very, very few times that
ever happens is kind of irrelevant.
Cuz we randomly sub-sample so
much that it doesn't change.
Orders of magnitude for which part?
K, it's ten.
It's relatively small, and
it's an interesting trade-off that
you'll observe in actually
several deep learning models.
Often, As you go through the corpus,
you could do an update after each window,
but you could also say let's go through
five windows collect the updates and
then make a really, a step in your...
Mini batch of your stochastic
gradient descent and
we'll go through a lot these kind
of options later in the class.
All right, last question on skip
gram What does Jt(theta) represent?
It's a good question.
So theta is often a parameter that we
use for all the variables in our model.
So in our case here for
the skip-gram model,
it's essentially all the U vectors and
all the V vectors.
Later on, when we call,
we'll call a theta,
it might have other parameters of
the neural network, layers and so on.
And J is just our cost function and
T is at the Tth time step or
the Tth window as we
go through our corpus.
So in the end, our overall objective
function that we actually optimize is
the sum of all of them.
But again, we don't wanna do one large
update of the entire corpus, right?
We don't wanna go through all the windows,
collect all the updates and
then make one gigantic step cuz that
usually doesn't work very well.
So, good question I think, last lecture
we talked a lot about minimization.
Here, we have these log probabilities and
in the paper you wanna maximize that.
And it's often very intuitive, right?
Once you have probabilities,
you usually wanna maximize the probability
of the actual thing that you
see in your corpus happening.
And then other times,
when we call it a cost function,
we wanna minimize the cost and so on.
All right so, in word2vector's,
another model,
which you won't have to implement
unless you want to get bonus points.
But we will ask you to take
derivatives of, and so
it's good to understand it at least
in a very simple conceptual level.
And it's very similar
to the skip-gram model.
Basically, we want to predict
the center word from the sum
of the surrounding words.
So very simply here, we sum up
the vector of And of NLP and of deep and
of like and
we have the sum of these vectors.
And then we have some inner products
with just the vector of the inside.
And basically that's called
the continuous bag of words model.
You'll learn all about the details and
the definition of that in the problem set.
So what actually happens when we
train these word vectors, right?
We optimize this objective function and
we take gradients and
after a while, something kind of
magical happens to these word vectors.
And that is that they actually start to
cluster around similar kinds of meaning,
and sometimes also similar
kinds of syntactic functions.
So when we zoom in, and again, this is,
usually these vectors are 25 to even
500 or thousand dimensional, this is just
a PCA visualization of these vectors.
And what we'll observe is that Tuesday and
Thursday and
weekdays cluster together,
number terms cluster together,
first names cluster together and so on.
So basically, words that appear
in similar context turn out to
often have dissimilar meaning as
we discussed in previous lecture.
And so
they essentially get similar vectors
after we train this model for
a sufficient number of sets.
All right, let's summarize word2vec.
Basically, we went through
each word in the corpus.
We looked at the surrounding
words in the window.
We predict the surrounding words.
Now, what we are essentially
doing there is
trying to capture
the coocurrence of words.
How often does this word
cooccur with the other word?
And we did that one count at a time.
It's like, I see the deep and
learning happen.
I make an update to both of this vectors.
And then you go over the corpus and then
you probably will eventually see deep and
learning coocurring again and
you make again a separate update step.
When you think about that,
it's not very efficient, right?
Why now we just go to the entire corpus
once, count how often this deep and
learning cooccur, of these two
words cooccur, and then we make one
update step that captures the entire
count instead of one sample at the time.
And, yes we can do that and
that is actually a method that
came historically before word2vec.
And there are different
options of how we can do this.
The simplest one or
the one that is similar to word2vec at
least is that we again use a window around
each word and we basically just
go through the entire corpus.
We don't update anything,
we don't do any SGD.
We just collect the counts first.
And once we have the counts,
then we do something to that matrix.
And so when we look at just
the window of length maybe two,
like in this example here, or maybe five,
some small window size around each word,
what we'll do is we'll capture,
not just the semantics, but
also some of the syntactic
information of each word.
Namely, what kind of
part of speech tag is it.
So verbs are going to be
closer to one another.
Then the verbs are to nouns, for instance.
If, on the other hand, we look at
co-occurrence counts that aren't just
around the window, but entire document,
so I don't just look at each window.
But i say, this Word appears with all
these other words in this entire Wikipedia
article, for instance, or
this entire Word document.
Then, what you'll capture is actually
more topics, and this is often
called Latent Semantic Analysis,
a big popular model from a while back.
And basically what you'll get there is,
you'll ignore the part of
speech that you ignore any kind of
syntactic information and just say,
well swimming and boat and
water and weather and the sun,
they're all kind of appear in this topic
together, in this document together.
So we won't go into too many details for
these cuz they turn out for
a lot of other downstream tasks
like machine translation or so and
we really want to use these windows,
but it's good knowledge to have.
So let's go over a simple example of
what we would do if we had a very small
corpus and wanna collect these windows and
then compute word vectors from that.
So it is technically not cosine cuz we
are not normalizing over the length, and
technically we are not optimizing inner
products of these probabilities and so on.
But continue.
That's right.
So the question is,
in all these visualizations here,
we kind of look at Euclidean distance.
And it's true, we're actually often
are going to use inner products
kinds of similarities.
So yes, in some cases, Euclidean
distance works reasonably well still,
despite not doing this in fact we'll see
one evaluation that is entirely based or
partly based on Euclidean distances and
partly inner products.
So it turns out both work well despite
our objective function only having this.
And even more surprising there're a lot
of things that work quite well on this
despite starting with this
kind of objective function.
We often yeah, so if despite having
only this inner product optimizations,
we will actually also do often very
well in terms of Euclidean distances.
Yep.
Well, it get's complicated but there
are some interesting relationships between
the ratios of the co-occurence counts
We don't have enough time to dive into
the details, but if you are interested
in that I will talk about a paper.
I mentioned the title of the paper in
five or ten slides, that will help
you understand that a little better and
gain some more intuition, yep.
All right, so,
window based co-occurrence matrices.
So, let's say,
we have this corpus here, and
that's to find our window length
as just 1, for simplicity.
Usually, we have more commonly
5 to 10 windows around there.
And we assume we have
a symmetric window so,
we don't care if a word is to the left or
to the right of our center word.
And we have this corpus.
So, this is essentially what a window
based co-occurrence matrix would be, for
this very, very simple corpus.
We just look at the word I and then,
we look at which words appear next to I.
And so, we look at I, we see like
twice so, we have number two here.
And we see enjoy once so,
we put the count one here.
And then, we know we have the word like.
And so, like co-occurs twice
with the word I on it's left and
once with deep and once with NLP.
And so, this is essentially we go through
all the words in a very large corpus and
we compute all these counts, super simple.
Now, you could say, well,
that's a vector already, right?
You have a list of numbers here and that
list of numbers now represents that word.
And you already kinda capture things like,
well, like and enjoy have some overlap so,
maybe they're more similar.
So, you already have a word vector, right?
But now, it's not a very ideal word
vector for a couple of reasons.
The first one is,
if you have a new word in your vocabulary,
that word vector changes.
So, if you have some downstream machine
learning models now to take that
vector's input, they always have to change
and there's always some parameter missing.
Also, this vector is going
to be very high-dimensional.
Of course, for this tiny corpus,
it's small but generally,
we'll have tens of thousands of words.
So, it's a very high-dimensional vector.
So, you'll have sparsity issues if you
try to train a machine learning model
on this afterwards and that moves up in
a much less robust downstream models.
And so, the solution to that is lets again
have the similar idea to word2vec and
have just don't store all of the co
occurrence counts, every single number.
But just store most of
the important information,
the fixed small number of dimensions,
similar to word2vec,
those will be somewhere around
25 to 1,000 dimensions.
And then, the question is okay,
how do we now reduce the dimensionality,
we have these very large
co-occurrence matrices here.
In the realistic setting, we'll have
20,000 by 20,000 or even a million by
a million, very large sparse matrix,
how do we reduce the dimensionality?
And the answer is we'll
just use very simple SVD.
So, who here is familiar with
singular value decomposition?
All right, good, the majority of people,
if you're not then,
I strongly suggest you go to the office
hours and brush up on your linear algebra.
But, basically, we'll have here
this X hat matrix, which is
going to be our best rank k approximation
to our original co-occurrence matrix X.
And we'll have basically these three
simple matrices with orthonormal columns.
U we often call also our left-singular
vectors and we have here S the diagonal
matrix containing all the singular
values usually from largest to smallest.
And we have our matrix V here,
our orthonormal rows.
And so, in code,
this is also extremely simple,
we can literally implement this
in just a few lines, if we have,
this is our corpus here, and
this is our co-occurrence matrix X.
Then, we can simply run SVD with
one line of Python code and
then, we get this matrix U.
And now, we can take the first two
columns here of U and plot them, right?
And if we do this in the first two
dimensions here, we'll actually get
similar kinda visualization to all this
other ones I've showed you, right?
But this is a few lines of Python code
to create that kinda word vector.
And now, it's kinda reading tea leaves,
none of these dimensions we can't really
say, this dimension is noun, the verbness
of a word, or something like that.
But as you look at these long enough,
you'll definitely observe
some kinds of patterns.
So for instance, I and like are very
frequent words in this corpus and
they're a little further to the left so,
that's one.
Like and enjoy are nearest
neighbors in this space so
that's another observation,
they're both verbs, and so on.
So, the things that were being liked,
flying and
deep and other things
are closer together and so on.
So, such a very simple method you get
a first approximation to what word
vectors can and should capture.
Are there any questions around this SVD
method in the co-occurrence matrix?
It's a good question,
is the window always symmetric?
And the answer is no, we can actually
evaluate asymmetric windows and
symmetric windows, and I'll show you
the result of that in a couple of slides.
All right, now, once you realize, wow,
this is so simple and it works kinda well,
and you're a researcher, you always
wanna try to improve it a little bit.
And so, there are a lot of different hacks
that we can make to this co-occurrence
matrix.
So, instead of taking the raw counts, for
instance, as you do this, you realize,
well, a lot of representational power
in this word vectors is now captured
by the fact that the and he and
has and a lot of other very, very frequent
words co-occur with almost all the nouns.
Like the appears in the window of
pretty much every noun out there.
And it doesn't really give us that much
information that it does over and over and
over again.
And so, one thing we can do is
actually just cap it and say,
all right, whatever the co-occurs
with the most, and a lot of other
one of these function words,
we'll just maximize the count at 100.
Or, I know some people do this also,
we just ignore a couple of the most
frequent words cuz they really,
we have a power law distribution or
Zipf's law where basically,
the most frequent words appear much,
much more frequently than other words and
then, it peters out.
And then, there's a very long tail of
words that don't appear that often but
those very rare words often
have a lot of semantic content.
Then, another way we can change this,
the way we compute these counts is by
not counting all the words equally.
So, we can say, well,
words that appear right next to my
center word get a count of one.
Or words that appear and
they're five steps away,
five words away only
you get a count of 0.5.
And so, that's another hack we can do.
And then, instead of counts we could
compute correlations and set them to 0.
You get the idea, you can play a little
around with this matrix of co-occurrence
counts in a variety of different ways and
sometimes they help quite significantly.
So, in 2005, so quite a long time ago,
people used this SVD method and
compared a lot of different
ways of hacking the co-occurrence
matrix and modifying it.
And basically found quite surprising and
awesome results.
And so, this is another way
we can try to visualize
this very high dimensional space.
Again, these vectors are usually
around 100 dimensions or so, so
it's hard to visualize it.
And so, instead of projecting it down to
just 2D, here they just choose a couple of
words and look at the nearest
neighbours and which word is closest To
what other word and they find that wrist
and ankle are closest to one another.
And next closest word is shoulder.
And the next closest one is arm and so on.
And so different extremities cluster
together, we'll see different
cities clustering together, and American
cities are closer to one another than
cities from other countries, and country
names are close together, and so on.
So it's quite amazing, right?
Even with something as simple
as SVD around these windows,
you capture a lot of different
kinds of information.
In fact it even goes to syntactic and
chromatical kinds of patterns that
are captured by this SVD method.
So show, showed, shown or
take, took, taken and so
on are all always together in
often similar kinds of patterns.
And it goes further and
even more semantic in the verbs that
are very similar and
related to these kinds of nouns.
Often appear even in roughly similar
kinds of Euclidean distances.
So, swim and swimmer, clean and janitor,
drive and driver, teach and teacher.
They're all basically have a similar
kind of vector difference.
And intuitively you would
think well they appear,
they often have similar kinds
of context in which they appear.
And there's some intuitive sense of why,
why this would happen,
as you're trying to capture
these co-occurrence counts.
Does the language matter?
Yes, in what way?
Great question.
So if it was German instead of English.
So it's actually a sad truth of a lot of
natural language processing research that
the majority of it is in English.
And a few people do this.
It turns out this works for
a lot of other languages.
But people don't have as good
evaluation metrics often for
these other languages and evaluation
data sets which we'll get to in a bit.
But we would believe that it works for
pretty much all languages.
Now there's a lot of complexity because
some languages like Finnish or German have
potentially a lot of different words, cuz
they have much richer morphology, right?
German has compound nouns.
And so you get more and
more rare words, and
then the rarer the words are,
the less good counts you have of them,
and the harder it is to use
this method in a vanilla way.
Which eventually in the limit
will get us to character-based
natural language processing,
which we'll get to in a couple weeks.
But in general, this works for
pretty much any language.
Great question.
So now, what's the problem here?
Well SVD, while being very simple and
one nice line of Python code, is actually
computationally not always great,
especially as we get larger and
larger matrices.
So we essentially have this quadratic
cost here in the smaller dimension.
So either if it's a word by
word co-occurrence matrix or
even a word by document,
we'd assume this gets very, very large.
And then it also gets hard to
incorporate new words or documents into,
into this whole model cuz you have
to rerun this whole PCA or sorry,
the SVD, singular value decomposition.
And then on top of that SVD, and
how we optimize that is quite different
to a lot of the other downstream deep
learning methods that we'll use
like neural networks and so on.
It's a very different
kind of optimization.
And so the word to vec objective
function is similar to SVD,
you look at one window at a time.
You make an update step.
And that is very similar to how we
optimize most of the other models in this
lecture and in deep learning for NLP.
And so basically what we
came with with post-doc and
Chris' group, so
Jeffery Pennington, me and
Chris, is a method that tries to
combine the best of both worlds.
So let's summarize what the advantages and
disadvantages are of these two
different kinds of methods.
Basically we have these count
based methods based on SVD and
the co-occurence matrix.
And we have the window-based or
direct prediction methods
like the Skip-Gram model.
The advantages of PCA is that
it's relatively fast to train,
unless the matrix gets very,
very large but
we're making very efficient usage of
the statistics that we have, right?
We only have to collect the statistics
once, and we could in theory,
throw away the whole corpus.
And then we can try a lot of different
things on just these co-occurence counts.
Sadly, when you do this,
it captures mostly word similarity,
and not various other patterns that
the word2vec model, captures and
we'll show you what
those are in evaluation.
And we give often disproportionate
importance to these large counts.
And we can try various ways
of lowering the importance
that these function words and
very frequent words have.
The disadvantage of
the Skip-Gram of model is that
it scales with a corpus size, right?
You have to go through
every single window,
which is not very efficient, and
henceforth you also don't really make very
efficient usage of the statistics that
you have overall, of the data set.
However we actually get, in may cases,
much better performance
on downstream tasks.
And we don't know yet,
those downstream tasks, that's why we have
the whole lecture for this whole quarter.
But for a variety of different
problems like an entity recognition or
part of speech tagging and so on.
Things that you'll implement
in the problem sets,
it turns out the Skip-Gram like models
turn out to work slightly better.
And we can capture various
complex patterns, some of
which are very surprising and we'll get
to in the second part of this lecture.
And so, basically,
what we tried to do here is combining
the best of both of these worlds.
And the result of that was the GloVe
model, our Global Vectors model.
So let's walk through this
objective function a little bit.
Again, theta here will
be all our parameters.
So in this case, again,
we have these U and these V vectors.
But they're even more symmetric now,
we basically just go through all pairs
of words that might ever co-occur.
So we go through these very
large co-occurrence matrix that
we computed in the beginning and
we call P here.
And for
each pair of words in this entire corpus,
we basically want to minimize
the distance between the inner
product here, and
the log count of these two words.
So again, this is just this kind of
matrix here that we're going over.
We're going over all elements of
this kind of co-occurrence matrix.
But instead of running the large SVD,
we'll basically just optimize
one such count at a time here.
So I have the square of this distance and
then we also have this term here,
f, which allows us to weight even
lower some of these very
frequent kinds of co-occurrences.
So the, for instance, will have
the maximum amount that we can weigh it
inside this overall objective function.
All right,
so now what this allows us to do is
essentially we can train very quickly.
Cuz instead of saying, all right, we'll
optimize that deep and learning co-occur
in one window, and then we'll go in a
couple windows later, they co-occur again.
And we update again, with just one say or
a deep learning co-occur
in this entire corpus.
Which could now be in all of Wikipedia or
in our case, all of common crawl.
Which is most of the Internet,
that's kind of amazing.
It's a gigantic corpora
with billions of tokens.
And we just say, all right, deep and
learning in these billions of documents
co-occur 536 times or something like that.
Probably now a lot more often.
And then we'll just optimize basically
This inner product to be closed and
it's value to the log of
that overall account.
And because of that,
it scales to very large corpora.
Which is great because the rare
words appear not very often and
just build hours to capture even rarer
like the semantics of very rare words.
And because of the efficient usage
of the statistics, it turns out
to also work very well on small
corpora and even smaller vector sizes.
So now you might be confused
because individualization,
we keep showing you a single vector but
here, we again, just like with the skip
gram vector, we have v vector, it's the
outside vectors and the inside vectors.
And so let's get rid of that confusion and
basically tell you that there are a lot
of different options of how you get,
eventually, just a single vector
from having these two vectors.
You could concatenate them but
it turns out what works best
is just to sum them up.
They essentially both
capture co-occurence counts.
And if we just sum them,
that turns out to work best in practice.
And so, that also destroys
some of the intuitions of why
certain things should happen, but it turns
out in practice this works best, yeah?
>> [INAUDIBLE]
>> What are U and
V again, so U here are again just
the vectors of all the words.
And so here, just like with the skip-gram,
we had the inside and the outside vectors.
Here, u and v are just the vectors in
the column and the vectors in the row.
They're essentially interchangeable and
because of that,
it makes even more sense to sum them up.
You could even say, well, why don't
you just have one set of vectors?
But then, you'd have a more, a less
well behaved objective function here,
because you have the inner product between
two of the same sets of parameters.
And it turns out, in terms of the
optimization having the separate vectors
during optimization and combining them at
the very end just was much more stable.
That's right.
Even for skip-gram, that's the question.
Is it common also time for
skip-gram to sum them up?
It is.
And it's a good, it's good whenever you
have these choices and they seem a little
arbitrary, also, for all your projects.
The best thing to always do is like,
well, there are two things.
You could just come to me and
say, hey what should I do?
X or Y?
And the true answer,
especially as you get closer to your
project and to more research and
novel kinds of applications, the best
answer is always, try all of them.
And then have a real metric a quantitative
of measure of how well all of them do and
then have a nice little
table in your final projects
description that tells you
very concretely what it is.
And once you do that many times,
you'll gain some intuitions,
and you'll realize alright, for the fifth
project, you just realized well summing
them up usually works best, so
I'm just going to continue doing that.
Especially as you get into the field,
it's good to try a lot of these
different knobs and hyperparameters.
>> [INAUDIBLE]
>> That's right,
they're all in the same scale here.
Really they are quite interchangeable,
especially for the Glove model.
Is that a question?
Alright I will try to repeat it.
So in theory here you're right.
So the question is does the magnitude
of these vectors matter?
Good paraphrase?
And so you are right.
It does.
But in the end you will see them basically
in very similar contexts, a lot of times.
And so in this log here,
they will eventually have to
capture the log count, right?
So they will have to go to a certain size
of what these log counts usually are.
And then the model just figures
out that they are in the end
roughly in the same place.
There's nothing in the optimization
that pushes some vectors to get really,
really large, except of course,
the vectors of words that appear very
frequently, and
that's why we have exactly this term here,
to basically cap the importance
of the very frequent words.
Yes, so the question is, and I'll just
phrase it the way it is, which is right.
The skip-gram model tries to capture
co-occurrences one window at a time.
And the Glove model tries to capture
the counts of the overall statistics
of how often these words appear together,
all right.
One more question?
I think there was one.
No?
Great.
So now we can look at some fun results.
And, basically, we found,
the nearest neighbors for
frog were all these various words.
And we're first a little worried,
but then we looked them up.
And realize, alright,
those are actually quite good.
So you'll see here even for
very rare words, Glove will give you very,
very good nearest neighbors in this space.
And so next,
we will do the evaluation, but
before that we'll do a little
intermission with Arun.
Take it away.
>> [SOUND] Cool, so
we've been talking about word vectors.
I'm gonna take a brief detour
to talk about Polysemy.
So far we've seen that word vectors
encode similarity, we see that
similar concepts are even distributed
in Euclidean space near each other.
And the question I want you to think
about is, what do we do about polysemy?
Suppose you have a word like tie.
All right, tie could mean
something like a tie in a game.
So maybe it should be near this cluster.
Over here.
It could be a piece of clothing, so
maybe it should be near this cluster, or
it could be an action like braid twist,
should be near this cluster.
Where should it lie?
So this paper by Sanjeev Arora and
the entire group,
they seek to answer this question.
And one of the first things
they find is that if
you have an imaginary you could split
up tie into these polysemous vectors.
You had tie one every time you
talk about this sport event.
Tie two every time you talked
about the garment of clothing.
Then, you can show that the actual
tie that is a combination of
all of these words lies in the linear
superposition of all of these vectors.
You might be wondering, how is this
vector close to all of them, but
that's because we're projecting
this into a 2D plane and so
it's actually closer to
them in other dimensions.
Now that we know that
this tie lies near or
in the plane of the different senses
we might be curious to find out,
can we actually find out what
the different senses of a word are.
Suppose we can only see this word tie,
could we computationally find out
to some core logistics that tie had
a meaning about sport clothing etc.
So the second thing that they're able
to show is that there's an algorithm
called sparse coding.
That is able to recover these.
I don't have time to discuss exactly what
sparse coding how the algorithm works but
let me describe the model.
The model says that every word
vector you have is composed as
the sum of a small selected number
of what are called context vectors.
So these context vectors,
there are only 2,000 that they found for
their entire corpus,
are common across every word.
But every word like tie is
only composed of a small
number of these context vectors.
So, the context vector could
be something like sports, etc.
There's some noise added in,
but that's not very important.
And so, if you look at the type of output
that you get for something like tie,
you see something to do with clothing,
with sports.
Very interestingly you also
see output about music.
Some of you might realize
that actually makes sense.
And now,
we might wonder how this is qualitative.
Is there a way we can quantitatively
evaluate how good the senses we
recover are?
So it turns out, yes you can, and
here's the sort of experimental set-up.
So, for
every word that was taken from WordNet,
a number of about 20 sets of
related senses were picked up.
So, a bunch of words that represent
that sense, like tie, blouse, or
pants, or something totally unrelated,
like computer, mouse, and keyboard.
And so now they asked a bunch of grad
students, because they're guinea pigs, to
differentiate if they could find out which
one of these words correspond to tie.
And they also asked the algorithm
if it could make that distinction.
The interesting thing is that,
the performance of this method that
I alluded to earlier, is about at
the same level as the non-native grad
students that they had surveyed.
Which I think is interesting.
The native speakers do better on the task.
So in summary,
word vectors can indeed capture polysemy.
It turns out these polysemies,
the word vectors,
are in the linear superposition
of the polysemy vectors.
You can recover the senses that
a polysemous word has wIth sparse coding.
And the senses that you
recover are almost as good as
that of a non-native English speaker.
Thank you.
>> Awesome, thank you Arun.
>> [APPLAUSE]
>> All right,
so now on to evaluating word vectors.
So we've had gone through now
a bunch of new machinery.
And you say, well,
how well does this actually work?
I have all these hyperparameters.
What's the window size?
What's the vector size?
And we already came up
with these questions.
How much does it matter
how do we choose them?
And these are all the answers now.
Well, at least some of them.
So, in a very high level, and this will be
true for a lot of your projects as well,
you can make a high level decision of
whether you will have an intrinsic or
an extrinsic evaluation of
whatever project you're doing.
And in the case of word vectors,
that is no different.
So intrinsic evaluations are usually on
some specific or intermediate subtask.
So we might, for instance, look at how
well do these vector differences or vector
similarities and inner products correlate
with human judgments of similarity.
And we'll go through a couple
of these kinds of evaluations in
the next couple of slides.
The advantage of intrinsic evaluations
is that they're going to be very fast
to compute.
You have your vectors,
you run them through this quick
similarity correlation study.
And you get a number out and
you then can claim victory very quickly.
And then or you can modify your model and
try 50,000 different little knobs and
combinations and tune this very quickly.
It sometimes helps you really understand
very quickly how your system works, what
kinds of hyperparameters actually have
an impact on this metric of similarity,
for instance.
However, there's no free lunch here.
It's not clear, sometimes,
if your intermediate or
intrinsic evaluation and improvements
actually carry out to be a real
improvement in some task
real people will care about.
And real people is a little
tricky definition.
I guess real people,
usually we'll assume are like
normal people who want to just have
a machine translation system or a question
answering system or something like that.
Not necessarily linguists and
natural language processing
researchers in the field.
And so, sometimes you actually
observe people trying to
optimize their intrinsic
evaluations a lot.
And they spent years of
their life optimizing them.
And other people later find out, well,
it turns out those improvements on your
intrinsic task, when I actually
applied your better word vectors or
something to name entity recognition or
part of speech tagging or
machine translation,
I don't see an improvement.
So then the question is, well, how useful
is your intrinsic evaluation task?
So as you go down this route, and
a lot of you will for their projects,
you always wanna make sure you establish
some kind of correlation between these.
Now, the extrinsic one is basically
evaluation on a real task.
And that's really where
the rubber hits the road, or
the proof is in the pudding, or whatever.
The problem with that is that
it can take a very long time.
You have your new word vectors and
you're like,
I took the Pearson correlation instead of
the raw count of my core currents matrix.
I think that's the best thing ever.
Now I wanna evaluate whether that
word vector really helps for
machine translation.
And you say, all right,
now I'm gonna take my word vectors and
plug them into this machine
translation system.
And that turns out to
take a week to train.
And then you have to wait a long time,
and now you have ten other knobs, and
before you know it, the year is over.
And you can't really just do
that every time you have a tiny,
little improvement on your first
early word vectors, for instance.
So that's the problem,
it takes a long time.
And then often people will often make
the mistake of tuning a lot of different
subsystems.
And then they put it all together
into the full system, the real task,
like machine translation.
And something overall has improved,
but now it's unclear which part
actually gave the improvement.
Maybe two parts where actually, one was
really good, the other one was bad.
They cancel each other out, and so on.
So you wanna basically,
when you use extrinsic evaluations,
be very certain that you only change
one thing that you came up with, or
one aspect of your word vectors,
for instance.
And if you then get an improvement
on your overall downstream task,
then you're really in a good place.
So let's be more explicit and
go through some of these
intrinsic word vector evaluations.
One that was very popular and
came out just very recently
with the word2vec paper was
these word vector analogies.
Where basically they found,
which was initially very surprising to
a lot of people, that you have amazing
kinds of semantic and syntactic analogies
that are captured through these
cosine distances in these vectors.
So for instance, you might ask,
what is man to woman and
the relationship of king to another word?
And basically a simple analogy.
Man to woman is like king to queen.
That's right.
And so it turns out that,
when you just take vector of woman,
you subtract the vector of man,
and you add the vector of king.
And then you try to find the vector
that has the largest cosine similarity.
It turns out the vector of queen
is actually that vector that has
the largest cosine
similarity to this term.
And so that is quite amazing,
and it works for
a lot of different kinds
of very intuitive patterns.
So, let’s go through a couple of them.
So you'd have similar things like, if
sir to madam is similar as man to woman,
or heir to heiress, or king to queen,
or emperor to empress, and so on.
So they all have a similar kind of
relationship that is captured very well
by these cosine distances in this simple
Euclidean Subtractions and additions.
It goes even more specific.
You have similar kinds of companies and
their CEO names.
And you can take company, title,
minus CEO plus other company, and
you get to the vector of the name
of the CEO of that other company.
And it works not just for
semantic relationships but also for
syntactic relationships, so slow,
slower, or slowest in these glove
things has very similar
kind of differences and so
on, to short, shorter, and shortest,
or strong, stronger, and strongest.
You can have a lot of fun with this and
people did so here are some even more fun
ones like Sushi- Japan + Germany
goes to bratwurst, and so on.
Which as a German, I'm mildly offended by.
And of course,
it's very intuitive in some ways.
But it's also questionable.
Maybe it should have been [INAUDIBLE] or
whatever.
Other typical German foods.
While this is very intuitive and for
some people, in terms of the actual
semantics that are captured here, you
might really wonder why this has happened.
And there is no mathematical proof
of why this has to fall out but
intuitively you can kind of
make sense of it a little bit.
Superlatives for instance might
appear next to certain words,
very often, in similar kinds of ways.
Maybe most, for instance,
appears in front of a lot of superlative.
Or barely might appear in front of
certain words like slower or shorter.
It's barely shorter
than this other person.
And since in these vectors you're
capturing these core occurrence accounts,
as you take out, basically one concurrence
you subtract that one concurrence
intuitively it's a little hand wavy.
There's no like again here this is
not a nice mathematical proof but
intuitively you can see how similar kinds
of words appeared and you subtract those
counts and hence you arrive in similar
kinds of places into vector space.
Now first you try a couple of these, and
you're surprised that this works well.
And then you want to make it
a little more quantitative.
All right, so
this was a qualitative sub sample of some
words where this works incredibly well.
It's also true that when you
really play around with it for
a while,
you'll find something things that are like
Audi minus German goes to some
crazy sushi term or something.
It doesn't always make sense but
there are a lot of them where it
really is surprisingly intuitive.
And so people essentially then came
up with a data set to try to see
how often does it really appear and
does it really work this well?
And so they basically collected
this Word Vector Analogies task.
And these are some examples.
You can download all of
them on this link here.
This is, again, the original
word2vec paper that discovered and
described these linear relationships.
And they basically look at Chicago and
Illinois and Houston Texas.
And you can basically come up
with a lot of different analogies
where this city appears in that state.
Of course there are some problems and
as you optimize this metric more and
more you will observe like well maybe
that city name actually appears in
multiple different cities and
different states have the same name.
And then it kind of depends on your
corpus that you're training on whether or
not this has been captured or not.
But still, a lot of people,
it makes a lot of sense for
most of them to optimize these
at least for a little bit.
Here are some other examples of analogies
that are in this data set that are being
captured, and just like the capital and
the world, of course you know as those
change if it doesn't change in your
corpus that's also problematic.
But in many cases the capitals of
countries don't change, and so
it's quite intuitive and here's some
examples of syntactic relationships and
analogies that are basically
in this data set to evaluate.
We have several thousands
of these analogies and
now, we compute our word vectors,
we've tuned some knob,
we changed the hyperparameter instead of
25 dimensions, we have 50 dimensions and
then we evaluate which one is better for
these analogies.
And again, here is another syntactic one
with past tense kinds of relationships.
Dancing to danced should
be like going to went.
Now, we can basically look at a lot of
different methods, and we don't know all
of these in the class here, but we know
the skip gram SG and the Glove model.
And here is the first
evaluation that is quantitative
and basically looks at the semantic and
the syntactic relationships, and
then just average, in terms of the total.
And just says, how often is
exactly this relationship true,
for all these different analogies
that we have here in the data set.
And it turns out that when both of
these papers came out in 2013 and
14 basically GloVe was the best
at capturing these relationships.
And so we observe a couple
of interesting things here.
One, it turns out
sometimes more dimensions
don't actually help in capturing
these relationships better, so
thousand dimensional vectors work
worst than 300 dimensional vectors.
Another interesting observation and that
is something that is somewhat sadly true
for pretty much every deep learning model
ever is more data will work better.
If you train your word
vectors on 42 billion tokens,
it will work better than
on 6 billion tokens.
By you know, 4% or so.
Here we have the same 300 dimensions.
Again, we only want to change one thing
to understand whether that one change
actually has an impact.
And we'll see here a big gap.
It's a good question.
How come the performance
sometimes goes down?
It turns out it also depends on what
you're training your word vectors on.
It turns out, Wikipedia for instance,
is really great because Wikipedia has very
good descriptions of all these
capitals in all the world.
But now if you take news, and let's say if
you take US news and in US news you might
not have Abuja and
Ashgabat mentioned very often.
Well, then the vectors for
those words will also not
capture their semantics very well and
so you will do worse.
And so some not, bigger is not always
better it also depends on the quality
of the data that you have.
And Wikipedia has less misspellings
than general Internet texts and so on.
And it's actually a very good data set.
And so here are some of
the evaluations and we have a lot of
questions of like how do we choose this
hyperparameter the size and so on.
This is I think a very good and careful
analysis that Geoffrey had done here three
years ago on a variety of these different
hyperparameters that we've observed and
kind of mentioned in passing.
And so
this is also a great sort of way that you
should try to emulate for your projects.
Whenever I see plots like this I
get a big smile on my face and
your grades just like improve right away.
>> [LAUGH]
>> Unless
you make certain mistakes in your plots.
But let's go through them.
Here we look at basically the symmetric
context, the asymmetric context is
where we only count words that have
happened after the current word.
We ignore the things that's before but it
turns out symmetric usually works better
and so a vector dimension here
is a good one to evaluate.
It's pretty fundamental
how high dimensional.
Should these be.
And we basically observe
that when they're very small
it doesn't work as well in capturing these
analogies but then after around 200,
300 it actually kind of peters out and
then it doesn't get much better.
In fact, over all it's pretty
flat between 300 and 600.
And this is good.
So, the main number we often look
at here is the overall accuracy and
that's in red here.
And that's flat.
So, one mistake you could make
when create such a plot is you
can prove you have some hyperparameter and
you have some kind of accuracy.
This could be the vector size,
and you create a nice plot and
you say look, things got better.
And then my comment if I see
a plot like this would be,
well why didn't you go
further in this direction?
It seems to just be going up and up.
Like, so that is not good.
You should find your plots until
they actually kind of peter out, and
you say all right now, I really found
the optimum value for this hyperparameter.
So, another important
thing to evaluate here
is the window's size, and there
are sometimes considerations around this.
So word vectors for instance,
maybe the 200 worked
here slightly better than, or
300 works slightly better than 200.
But, larger word vectors
also means more RAM, right?
Your software now needs
to store more data.
And you need to, you might want
to ship it to the cellphone.
And now yes you might get 2%
improvement on this intrinsic task.
But you also have 30%
higher RAM requirements.
And maybe you say, well,
I don't care about those 2% or
so improvement in accuracy
on this intrinsic task.
I still choose a smaller word vector.
So, that's a legit argument,
but in general here,
we're just trying to optimize this metric.
And so
we wanna look at carefully what these are.
All right, now, window's size, again
this is how many words to the left and
to the right of each of the center
words do we wanna predict and
compute the counts for.
Turns out around eight or
so, you get the highest.
But again that also increases
the complexity and the training time.
The longer the windows are,
the more times you have to compute
these kind of expressions.
And then for asymmetric context,
it's actually slightly different
windows size that works best.
All right,
any question around these evaluations?
Great.
Now, it's very hard actually,
to compare glove and the skip gram model,
cuz they're very different
kinds of training regimes.
One goes through the one window at a time,
the other one first computes all
the counts, and then works on the counts.
So this is kind of us
trying to do well and
answer a reviewer question of
when you compare them directly.
So what we did here is we
looked at the Negative Samples.
So remember, we had that sum and
the objective function for
the skip gram model of how many words
we want to push down the probability of
cuz they don't appear in that window and
so
that is one way to increase training time,
and in theory do better on that objective.
Versus different iterations of how
often do we go over this cocurrence
counts to optimize each pair in
the cocurrence matrix for GloVe.
And in this evaluation GloVe did better
regardless of how many hours you
sort of trained both models.
And this is more data helps,
that the argument already made.
Especially Wikipedia.
So here Gigaword is I think
mostly a news corpus.
So news, despite being more actually it
does not work quite as well, overall, and
especially not for semantic,
relationships and analogies,
but Common Crawl, which is a super large
data set of 42 billion tokens, works best.
All right, so now these amazing analogies
of king minus man plus woman and
so on were very exciting.
Before that, people used often
just correlation judgements.
So basically they asked a bunch of people,
often grad students,
to give on a scale of one to ten, how
similar do you think these two words are?
So tiger and cat, when you ask three or
five humans on a scale from one to ten
how similar they are, they might say,
one might say seven, the other eight,
the other six or something like that and
then you average.
And then you get basically a score
here of similarities our computer and
internet are seven.
But stock and
CD are not very similar at all.
So a bunch of people will say on a scale
from one to ten, it's only 1.3 on average.
>> [INAUDIBLE]
>> And now,
we could try to basically say all right.
We want to train word vectors such that
the vectors have a high correlation and
their distances be it cosine similarity or
Euclidian distance,
or you can try different distance metrics
too and look at how close they are.
And so here's one such example.
You take the word of Sweden and
you look in terms of cosine similarity and
you basically find lots of words
that are very, very close by or
have the largest cosine similarity and
you basically get Norway and
Denmark to be very close by.
And so, if you have a lot of these
kinds of data sets and this one,
WordSim353 has basically
353 such pairs of words.
And you can look at how well
do your vector distances correlate
with these human judgements.
So the higher the correlation,
the more intuitive we would think are the
distances in this large vector space.
And again, Glove does very well
here across a whole host of
different kinds of datasets
like the WordSim 353 and,
again, the largest training
dataset here did best for Glove.
Any questions on word vector
similarities and correlations?
No, good, all right.
Now, basically, intrinsic's evaluations
have this huge problem, right?
We have these nice similarities,
but who knows?
Maybe that doesn't actually improve the
real tasks that we care about in the end.
And so the best kinds of evaluations,
but again they are very expensive,
are those on real tasks or at least
subsequent kinds of downstream tasks.
And so one such example is
named entity recognition.
It's a good one cuz
it's relatively simple.
But it's actually useful enough.
You might want to run a named entity
recognition system over a bunch of
your corporate emails.
To understand which person is in
relationship to what company, and
where do they live and the locations
of different people and so on.
It's actually a useful system to have,
a named entity recognition system.
And basically we'll go
through the actual models for
doing a named entity recognition
in the next lecture.
But as we plug in different
word vectors into these
downstream models that we'll describe in
the next lecture we'll observe that for
many of them GloVe vectors again do very,
very well on these downstream tasks.
All right.
Any questions on extrinsic methods?
We'll go through the actual
model that works here later.
That's right.
Well, so you're not optimizing
anything here, you're just evaluating.
You're not training anything.
You've trained your word vectors with your
objective function from skip-gram, and
you fix them, and
then you just evaluate them.
And so what you're evaluating here now
is you look at for instance Sweden and
Norway, and they have a certain
distance between them, and
then you want to basically
look at the human
measure of how similar do humans
think these two words are.
And then you want these kinds of human
judgements of similarity to correlate well
with the cosine distances of the vectors.
And when they correlate well,
you think, the vectors are capturing
similar kinds of intuitions that people
have, and hence they should be good.
And again, intuitively it would
make sense that if Sweden
has good cosine similarity and you plugged
it into some other downstream system,
that that system will also get
better at capturing named entities.
Because maybe at training time
it sees the vector of Sweden and
at test time it sees
the vector of Norway and
at training time you told that Sweden is
a location, and so a test time it might
be more likely to correctly identify
Norway or Denmark also as a location.
Because they're actually
close by in the vector space.
And we'll go actually through example
of how we train word vectors and so
on in the next lecture.
Or train downstream tasks.
So I think we have until 5:50,
so we got 8 more minutes.
So, let's look briefly at simple,
single word classification.
So you know we talked about these
word vectors and I basically showed
you the difference between starting with
these very simple co-occurrence counts and
these very sparse large vectors versus
having small dense vectors like Word2vec.
And so the major benefits are basically
that because similar words cluster
together, we'll be able to classify and
be more robust in classifying
different kinds of words that we might
not see in the training data set.
So for instance,
because countries cluster together and
our goal is to classify location words
then we'll do better if we initialize
all these country words to be in
a similar part of the vector space.
It turns out later we'll actually
fine tune these vectors too.
So right now we learned
an unsupervised objective function.
It's unsupervised in the sense that we
don't have human labels that we assigned
to each input, we just basically
took a large corpus of words, and
we learned with these
unsupervised objective functions.
But other tasks where that
doesn't actually work as well.
So for instance sentiment analysis turns
out to not be a great downstream task for
some word vectors because good and bad
might actually appear in similar contexts.
I thought this movie was really good or
bad.
And so when your downstream
task is sentiment analysis
it turns out that maybe you can just
initialize your word vectors randomly.
So this is kind of a bummer
after listening to us for
many hours on how word
vectors should be trained.
But fret not, it's in many cases word
vectors are helpful as your first step for
your deep learning model, just not always.
And again, that will be
something that you can evaluate.
Can I just initialize my words randomly or
should I initialize them with
the Word2vec or the glove model.
So as we're trying to classify words,
what we'll use is the softmax.
And so you've seen this equation already
in the very beginning in the first slide
of the lecture.
But we'll change the notation a little
bit because all the math that will follow
will be easier to go through
with this kind of notation.
So this is going to be
the softmax that we'll optimize.
It's essentially just a different
word term for logistic regression.
And we'll in many cases, have generally
a matrix W here for our different classes.
So x, for instance, could be in
a simplest form, just a word vector.
We're just trying to classify different
word vectors with no context of just like,
are these locations or not.
It's not very useful, but just for
pedagogical reasons, let's assume x,
our input here, is just a word vector.
And I want to classify, is it a location,
or is it not a location.
And then we give it basically, these
different kinds of word vectors that we
compute it, for instance, for Sweden and
Norway, and then we want to classify is
now Finland, Switzerland, and
also a location, yes or no.
So that's the task.
And so our softmax here might just
have in the simplest case two,
two doesn't really make sense so let's say
we have multiple different classes and
each class has one row vector here.
And so this notation y is essentially
the number of rows that we have,
so the specific row that we have.
And we have here inner product with this
rho vector times this column vector x.
And then we normalize just
like we always do for
logistic regression to get
an overall vector here for
all the different classes that sums to 1.
So W in general for classification
will be a C by d dimensional matrix.
Where d is our input and
C is the number of classes that we have.
And again, logistic regression, just a
different term for softmax classification.
And the nice thing about the softmax is
that it will generalize well above for
multiple different classes.
And so, basically this is also
something we've already covered.
So the loss function will use a similar
term for all the subsequent lectures.
Loss function, cost function and objective
functions, we kind of use interchangeably.
And what we'll use to optimize
the softmax is the cross entropy loss.
And so I feel like the last minute,
I'll just give you one extra minute,
cuz if we start now, it'll be too late.
So that's it, thank you.
>> [APPLAUSE]
