Okay. Hello everyone.
Um, welcome back to the second class of, um, CS224N.
Okay, so right at the end of last time I was just showing you a little,
um, from this, um,
IPython Notebook of things that you could do with word vectors
but I kind of ran out of time a little for a bit.
So, I'll just spend a couple of more minutes first,
um, showing the end of  this.
I stuck this IPython Notebook up on the course page.
So, under lecture one you can find a copy of it and you can download it.
So, I both stuck up just an HTML version of it and a zip file.
Like HTML file is only good to look at.
You can't do anything with it.
So, you wanna, if you wanna play with it by yourself, um,
download the zip file and get the IPython Notebook out of that.
Okay. So we were looking at
these Glove word vectors which I'll talk about a bit more today and so there were
these sort of basic results of similarity in this vector space work very nicely for,
um, discovering similar words and then going on from that,
there was this idea that we'll spend some more time on today which was, um,
maybe this vector space is not only a similarity space where
close together things have similar meaning but it actually captures meaning
in a considerably deeper and more profound way which is to say that there are
actually directions in the space that you can point which have a certain meaning.
So, that if you are pointing in one direction it means this is more so the case,
if you are pointing in a different direction and the meaning space it might be this is
the capital of this country or
all sorts of different meanings could be encoded in the space.
And a way of testing that,
is to use these analogy, um, problems.
And I quickly showed this at the end but just to make sure if you're
unguarded since it's sort of- it's sort of a clever thing right?
So, the idea is that we're going to start with a pair of words like king and man.
And so what we're gonna do is we're gonna say well,
there's a vector for king in the space and there's a vector for man in
the space and but what we're gonna do is we're going to
subtract as in just good old vector subtraction that you hopefully learned in your,
um, linear algebra class.
We're gonna subtract the man vector from
the king vector and the idea we have in our head then is if we do
that what will happen is we'll be left with the meaning of kingship without the manness.
Um, and so then there's also a direct vector for a woman.
So, we can add the woman vector to that resulting vector and then we could say well,
in the vector, we end up at some point in the vector space and then we're gonna say well,
what's the closest word that you're gonna find the here
and it's gonna print out the closest word and as we saw,
um, last time, um,
lo and behold if you do that,
um, you get the answer.
I'm saying you get,
um, king, man, woman.
No? All right. [LAUGHTER].
You gotta reverse king and man.
I have to reverse king and,
ah, sure, sure, sure. I'm sorry.
Oops. Yeah, okay, I kinda do it well like man, king.
Ah, [LAUGHTER] Okay. Yeah, that's right.
Sorry. Okay. Yeah, because it should be
man is to king as woman is to something sorry yeah.
I was getting [LAUGHTER] my order of components wrong.
Okay. Um, and, you know,
as I was sort of I guess I was showing some examples last time with
nationality words but I mean this in a way that is sort of surprising to shocking,
this actually works for all kinds of things that you can get meaning in this space.
So, I can ask various kinds of analogies of sorts.
So I can say Australia is to beer as France is to-.
Wine.
Wine. You might think wine.
What it gives back as champagne which seems a pretty good answer.
[LAUGHTER] Um, I'll go with that.
Um, um, you can do more syntactic facts.
So, I can say tall ta- tall is to tallest as long is to longest and it gets set.
Um, if I say good is to fantastic as bad is to terrible.
That it seems to get out that there's some kind of notion of
make more extreme direction and get this direction out.
I skipped over one.
A bomber is to Clinton as Reagan is to.
You may or may not like the answer it gives for this one
as Obama is to- as Reagan is to Nixon.
Um, now one thing you might notice at
this point and this is something I actually want to come back to at the end.
Um, well, there's this problem because Clinton's ambiguous, right?
There's Bill or there's Hillary.
Um, and, um, I forget,
you know, so this data as I said is a few years old.
So, this data was done in 2014.
So, in sort of in- it definitely doesn't
have Trump really in it as a politician, um, but, you know,
it would have variously both Clintons but as sort of makes sense if probably um,
for a sort of proof for 2014 data,
um, that Bill Clinton dominated.
So, I think what we're getting, um,
out of this is that Clinton and Nixon are sort of similar of people in dangers,
um, of being impeached.
Um, and, uh, on both sides of the aisle had us thinking primarily of Bill Clinton.
But, um, if this sort of brings up something that
I'll come back to right at the end of, um,
it sort of looks like we've got a sort of a problem here because we
just have this string literally Clinton and that, um,
string is any possible sense and meaning of the string Clinton and so minimally um,
that we have Bill Clinton and Hillary Clinton that near.
Maybe you have some friends that are called Clinton as well, right,
and they're all mixed together in this Clinton.
And so that seems kinda problematic and that's sort of been an issue
that's been discussed some for these word vectors and I'll come back to that.
Um, another thing you can do is you can give
a set of words and say which is the odd one out.
Maybe you used to do puzzles like that in middle school or something.
Um, and so you can do that and it decides
that cereal is the odd one out of that set. It seems okay.
Um, and then one other thing I'll just show you is, so, um,
it'll sort of be nice to look at these words that I've drawn
them in some of the slide pictures.
So, this is saying to put together a PCA or
Principal Components Analysis, um, scatter plot.
Um, so, I can do that and then I can say, "Um,
give it a set of words and draw me these as a scatter plot" and um,
hopefully if I can just about fit it in,
um, here's my scatter plot.
And it works pretty well, right?
I've got the wine, champagne,
beer up here then the coffee and tea.
Um, here are the countries.
Here is the schools, college institute, universities.
Um, the animals are down here.
Um, foodstuffs there.
So, yeah, this sort of really does work with this two direction- dimensional display.
It basically shows you similarity.
Now, um, there are, you know,
to some extent though you want to hold on to your wallet with these PCA displays.
So, it's as I've discussed before since you're
taking something that was 100-dimensional and we're just doing
this 2D projection that is capturing some of the major geometry of
the space but it just has to be losing a huge amount of the information.
So, when things end up close together,
they might be really close together in the original space or
they might just have been words that lost in
the 2D projection because they- there are other patterns that
were more dominant and were chosen as the first two principal components.
So, you sort of don't wanna over trust
these things and something if you like Infoviz you might think
about is how there are other ways that I might be able to
represent the distances in a way that was more accurate.
Um, but anyway this is very simple to do and I'm just getting
a PCA to reduce the dimensionality of the matrix and then,
um, transforming with it these word vectors and printing them.
Um, it's mainly easy to do.
The bit that wasn't easy for me to do, um,
but if someone's got some clever Python um plotting tips I'd like one,
if someone wants to send me a message after class.
I would have thought there'd be some default way in which you could just
label points in a scatter plot but I wasn't able to find one.
So, what I did, um,
was I'm just sort of plotting the texts and
I'm offsetting it a little bit from the points.
Um, now that works kinda crappily
because they just collide with each other as you can see.
Um, so, it'd be better if there was a better way to do point labeling in Python plots.
So, if anyone knows the answer to that one you can send it to me.
Um, okay. So, that's that. Ah.
And if you haven't used IPython Notebooks
before and don't want your computer to run really slowly,
it's a good idea to halt
your IPython Notebooks when you're not gonna be using them anymore,
um, especially if they're computing something.
Um, okay. [NOISE] Um.
[NOISE]
Okay. [NOISE] So now,
[NOISE] um, lecture two and so for today,
we're gonna keep on talking about things you can do with
Word Vectors and say a little bit at the end about Word sensors.
So, in more detail, [NOISE] um,
I'm gonna say a bit more about, um, Word2Vec.
I'm gonna have a sort of a very brief excursion on optimization, um,
but then I sort of want to explain a bit more of the space of what
people have done and can do with dense word representations.
So I am gonna say something about
count-based approaches to capturing meaning and how do they work.
I'm gonna talk for a bit about a,
a different model of Word Vectors which was the GloVe model that,
um, as a post-doc of mine, um,
Jeffrey Pennington and, uh,
me worked on a couple of years ago,
um, talk some about evaluation,
really quite dominant theme on a lot of what
we do on natural language processing is how do we,
how do we evaluate things and how much do we trust our evaluations,
um, and then say a little bit about, um, word sensors.
I have a sort of a goal here which is that by the end of the class,
um, you should actually sort of understand, um,
enough of the lay of the land that you could
read papers about word vectors such as the ones that
are in the syllabus and actually understand
them and where they're coming from and roughly how they work.
And so, you know, if you really wanna minimize
work for your c- this class, you could think, "I,
I know everything I need to know after the first week and I'm gonna
do a final project on word vectors and I'll be okay."
Um, and you know, you could actually do that,
I mentioned during the wo- um,
class, um, a couple of recent pieces of work on word vectors.
On the other hand, um,
doing things with word vectors as a fairly mined out areas,
so you're probably better off, um,
also listening to some of the later parts of the class.
Okay. So, remember we had this idea of Word2Vec,
so it was an iterative updating algorithm that learned, um,
these vector representations of words,
then in some sense capture their meaning and the way it worked was we kinda
moved position by position through a corpus and each point in time,
we had a center word here into and it's trying to predict
the words around that by having
a probability distribution over words will occur around that,
and that probability distribution is defined simply in terms
of the.product of the word vectors via the Softmax function.
And so, what we wanna do is change those vectors in
a way that this gives good probability predictions,
that gives as high probability as possible to words that you tend to see in the context.
And so, just to drill that in a little bit more, you know,
what we actually have is we have two matrices, right?
We have for center words,
we have a matrix where for each word in our vocabulary,
we have a vector, um, and at this,
this is probably as good a point as any to say that it
turns out that all the major deep learning packages,
TensorFlow, PyTorch, etc., for their word vectors,
the word vectors are represented as rows.
If you've done a bunch of math classes,
that might not be what you would expect.
You might have expected the other way around,
but they all put them in rows.
So we can have rows for our,
um, so we have six words and a five dimensional vector each.
Okay. And then, we have this outside, um,
matrix where we also have a second, um,
vector for each word which is this representation in context.
Um, so when we have a particular center word here,
word four, you know,
when we're doing our computations,
we're taking a.product between v_4 and each row of
U and that's then giving us a vector of dot product scores.
And so, then after that,
we're running Softmaxes on each of those numbers doing it
element-wise and that's been giving us
a probability distribution over words in the context.
Um, and the sort of things to notice there, um,
which hopefully you noticed last time,
but to make sure you noticed that,
um, you know, we've just got one probability distribution, right?
So in terms of what words we predict,
we're predicting exactly the same probability distribution, every position.
We've sort of saying the most likely word one to the left
is whatever it is house or most likely word to the left is house,
three to the left is house,
the one to the right should be house too, right?
So, it's sort of no sort of find us a prediction,
it's just an overall kind of
probability distribution of words that are likely to occur in my context.
So, all we're asking for is a model that gives
reasonably high probability estimates to all words that
occur in the context of this word relatively often,
is nothing more to it than that.
And that's part of why it's sort of surprising when you've got
such a simplistic thing that it seems like at the end of the day,
it can end up capturing so much about
the meanings of words and aspects of the meanings of words,
like in the examples I've just showing you in the IPython Notebook.
Um, and [NOISE] there's one other thing I was gonna say, oh yeah,
one other thing I was gonna say was the other thing that might occur to you from this is,
um, well, wait a minute,
there was like that and-and,
and-of that occur all the time.
Um, so that means every word must have a high dot product with words like that and of and,
um, they get their probabilities right.
And the first answer to that is, "Yup, that's true."
And it turns out that all word vectors, [NOISE] um,
have a very strong prob- word probability component that reflects that.
And I mean, one of the things that some workers discuss,
so on the readings,
there are two papers from Sanjeev Arora's group in Princeton and one of
those papers sort of discusses, um, this probability,
high frequency effect and your crude way of [NOISE] actually
fixing this high frequency effect is that normally, um,
the first, um,
the first biggest component in
your word vectors is actually a frequency effect and if you just lop it off,
you can make your semantic similarities better.
Um, but there are other things that we do to sort of deal with high frequencies.
Okay, so we get these lovely spaces that I've shown some of.
But I'll make one more remark. Um.
Yeah, so did I say this last time? Oh, oh.
Um, my remark anyway is that,
um, we show all these two-dimensional pictures.
They're exceedingly, exceedingly misleading because in these pic,
two-dimensional pictures, you know,
you have these effects that if, you know,
Samsung is close to Nokia,
it has to be over here and then it has to be far away from words that are over here.
Um, whereas you might sort of also want to have the effect that
Nokia is close to Finland for a different reason,
um, and you can't do that in two-dimensional, um,
vector spaces but, you know, one of the, um,
most of the properties of high dimensional vector spaces are very unintuitive,
and one of the ways that they're unintuitive is in a high dimensional vector space,
a word can be close to lots of other words in different directions.
Um, okay. So um,
we sort of started to talk about how we went about learning these word vectors.
I'm sort of going to take about a five minute detour into optimization.
Now, this isn't really an optimization class,
if you want to learn a lot about optimization.
Well you can learn more about optimization if you do
229 and if you do something like Stephen Boyd's optimization class,
you can learn a lot of optimization but this is
sort of really baby optimization but just to make sure everyone's on the same page,
here are three slides.
Right, so what we did at the end,
what we did over there,
where I apologized that my writing was too small,
but that will give you the chance to when doing homework too and you have to
write that out to work it out for yourselves and learn more in the process.
Right, so what we had was a cost function that we wanted to
minimize and so what we did was we did our bit of
calculus to calculate the gradient of the cost function with respect
to our word vectors which were our variables theta and then what we want to do is say,
well if we take a small step in
the direction of the negative of the gradient that will be taking us down,
down hill in this space and we want to keep on
doing that and sort of head to the minimum of our space.
I mean, of course in our high multi-dimensional space,
you know, it might not be a nice smooth curve like this.
It might be a horrible and non-convex curve but that's just the idea.
So, essentially we're saying we've got the old parameters,
we work out the gradient of the objective function using those old parameters.
We multiply that by a small alpha which is
our step size or learning rate because we only want to move a
little bit each time because if back here,
if we sort of said downhill is this way and said,
"Great let's go a long way that way."
You could kind of completely overshoot,
so we only want to go a little bit each time.
So we normally have a small learning rate alpha and so we
subtract a small multiple of the gradient and we,
from the old parameters and we get
our new parameters and that sort of effectively being worked out,
component wise as is shown below,
that we're just doing that to each of the partial derivatives and then,
that our hope is that that will let us gradually walk down this surface.
Now, if you actually did this,
it would be unbelievably bad for the kind of
systems that we build and there's a lot of work on
clever optimization but the most basic thing
which you definitely need to know is that well,
our objective function here,
J of theta was a function of our entire corpus, right?
And to get this to work well,
the first thing you want to do is,
you know collect a few billion words of your favorite language and then say,
"Go and build a Word2Vec model for me, " and so,
if you have to evaluate
a billion center words and maybe then to- for each of 10 billion context words,
if you have a window size of five and you- so you have to do these sort of 10 billion um,
Softmax calculations before you work out what your gradient is,
that you're going to be having your computer compute for a quite a long time before
you make one little step in the gradient and so things are going to go so, so slowly.
So, no one does that in deep learning systems.
Um, so what people- everyone does is use
stochastic gradient descent and in stochastic gradient descent,
we sample our window in the simplest case.
We, just for this one window,
work out an estimate of the gradient and we use it as a parameter update.
So, this is sort of an amazingly,
amazingly noisy estimate of
the gradient but it sort of doesn't matter too much because as soon as we've done it,
we're going to choose a different center word and do it again and again,
so that gradually we sort of approach what we would have gotten if we'd sort
of looked at all of the center words before we took any steps,
but because we take steps as we go,
we get to the minimum of the function orders and magnitude more quickly.
So thi- this shows the simplest case where we're just sampling one window.
In practice, that's not what we normally do.
We normally sample as- a small bunch,
you know, order of approximately 32 or 64.
Um, so if we have a sample that's bigger,
that's generally referred to as a mini-batch and we
calculate a gradient estimate from the mini-batch.
Um, so that has two advantages.
One advantage is that you kind of get less noisy estimates of
the gradient because you've kind of averaged
over a bunch of examples rather than just using one,
but the second advantage,
which is the one why we really care,
is if we want our computations to go fast when we're using a GPU,
that you need to get parallelization of doing the same operation a whole bunch of
times and then you gain a lot by using
a mini-batch of 64 examples or something like that.
Um, and you don't have to but you know,
it turns out the details of the guts of the hardware that you know,
it isn't- [inaudible] GPUs, you know, they have these,
whatever they have inside them,
there in powers of two.
So, you get better speedups if you use batches like 32 or 64,
rather than just deciding that 42 is still your favorite number from
high school [LAUGHTER] and you're going to use that as the size of your mini-batch.
Okay. um, yeah here's
one other interesting thing which
actually has some optimization details in it, it turns out.
Um, if you think of these um,
doing stochastic gradients with word vectors,
that's actually very different to
some other deep learning problems like vision deep learning problems.
Because for either a single window or even a sort of a reasonably sized mini-batch,
it will turn out that those mini-batch,
mini-batch only has, you know,
relatively speaking a handful of words in it, right?
So, if you have mini-batch of size 32 and a window size of ten,
you know, probably there are only about a 100,150 different words in it.
Um, but yet we're building this model over
a vocabulary of quarter of a million words or something like that.
So, just about all of the elements in this vector are zero.
Um, and so, um,
we sort of really have this very sparse um,
perimeter update and so, um,
that sort of suggests that we actually probably um,
want to sort of only update the word vectors that
appear and then the question is whether you can achieve that, right?
The dumb way to do it, is you just have this matrix that's normally,
nearly all zeros and you say add
those two matrices together and there you go and then the question is,
can you actually have a sparse matrix update which only updates
the certain rows of the matrix that contain
the words that you've entered and do things much faster?
And if you're doing something even cleverer like doing
distributed computation over multiple computers and sharing your parameters,
well then definitely you just sort of only want to update
the word vectors that you've actually been getting a parameter estimate for.
So, there's sort of some details there but I'm
going to skip past them for more details, um.
Right. So, a couple of people asked afterwards, yeah,
why are there these two word vectors that sort of center and the outside one?
And, I mean the answer to that is,
it makes that math I showed you easy, right?
So that if, um,
if you do it as I showed you, well,
you know, for working out, um,
the partial derivatives for the center word.
It's just as I showed you, it's easy.
Um, but if you use only one set of word vectors,
well then the same word,
that's the center word,
will be one of the choices for
the context word when you're working out that Softmax for the context word.
And then you'll get these terms that are then
squared terms in terms of the two references,
so that same word,
and that makes your math more difficult.
Um, so it's sort of just a practical thing,
um, in the end.
I mean it sort of doesn't make very much difference,
because if you sort of think about it since you're going along through all the,
um, positions, you know.
What was a center word at one point is immediately afterwards
a context word of what used to be a context word,
which is now the center word.
So, sort of doing the same computations because, you know,
the dot product is symmetric actually,
um, all over again.
So, they get pretty similar vector representations.
So, it seems like in general you can get the best results
by averaging what comes out for your two vectors,
and you end up with just one vector per word.
Okay, more substantively, um,
if you go to the word2vec paper,
you'll discover that there's sort of more to word2vec that they
define as sort of a family of word2vec models.
And there are so two main parts of that family.
Um, firstly, there's a choice between the Continuous Bag of Words model,
and the skip-grams model.
And what I presented with the skip-grams models.
So, in the skip-grams model,
you've got one center word and you're trying to
predict all the words in context one at a time.
For the Continuous Bag of Words model it's the opposite.
You've got all of the outside words and you're trying to use all of them,
though considered independently like a Naive Bayes model to predict the center word.
Um, and then the second one is, um,
the way I presented learning this was
the method that's using the so called Naive Softmax.
So, therefore when we are wanting to work things out,
we were sort of saying okay we want probability estimates for the context words,
and so we're just going to sum over
the whole vocabulary and we'll come up with these probability estimates.
Um, in practice, that turns out to be a sort of
a bad idea because that would also make things mega slow.
So, in homework two,
coming up next week, um,
you will get to implement a much more practical, um,
way of doing this which they present in the word2vec papers, right?
So, the problem is,
if we're using this equation that we use to do the calculus,
that down in this denominator here,
we're doing the sum over the entire vocabulary.
So, if you have a vocabulary of a quarter million words,
we're sort of doing a quarter of a million dot products and
exponentials and adding them all to work out that denominator.
And that sort of seems uh,
sort of a really bad idea if you want things to be fast.
Um, so, um, Tomas Mikolov and
colleagues came up with this idea of negative sampling would be near enough.
And so the idea of negative sampling,
is we're going to train binary logistic regressions instead.
And so, we're going to train one binary logistic regression
for the actual word observed what's in the numerator,
and you want to give high probability to the word that was actually observed.
And then, what we're going to do,
is we're going to sort of randomly sample a bunch of other words,
they're the negative samples and say they weren't the ones that were actually seen.
So, you should be trying to give them as low a probability as possible.
Okay, so, um, the sort of notation that
they use in the paper is sort of slightly different to the one I've used.
They actually do maximization not minimization,
and that's the equation which I'll come back to.
Um, though before we do that here's the sigmoid function.
So, the sigmoid function is normally written like this,
one over one plus E to the minus X.
But, um, essentially,
the sigmoid function is like a binary case of the Softmax function, right?
That we have two possible outcomes, yes or no,
and that you're sort of again got an input that is any real number,
and it's mapping it onto a probability distribution between
zero and one which represents these two binary outcomes.
And to the extent that the number is positive,
it kind of ceilings to one and negative goes down to zero.
Okay, so with this time,
we're going to take the dot for- for the good word,
we're going to take the dot product of the two vectors,
shove it through our sigmoid function
and then we're going to want that probability estimate,
um, to be as high as possible.
So, if I show you this version,
which is just written slightly differently, um,
to look as much as possible like the notation that we used last time,
here is our new objective function for using negative sampling.
And we've got two terms,
the first one, um,
is the log of the sigmoid of the observed context word,
the outside words, dot producted with the center word,
and we're going to want that to be big.
Um, and then on the other hand,
um, we've got, um, the,
um, randomly chosen K words,
which are just other words,
and we're going to work out dot products between them and the center word.
And we're going to want those to be as small as possible.
Um, note that extra minus sign in there which is causing
the sign of the two things to be different, right?
So, those are our negative samples.
And for big K, it can be a reasonably modest number,
you can just take kind of 10,
15 negative samples and that works pretty fine.
Um, I said we sort of sampled some words,
um, to be the negative samples.
They in particular propose a sampling distribution that helps them along
a little in partly dealing with this pro- problem of very frequent words.
Um, so the starting point of how you sample words is you
use what we call the- the unigram distribution.
So, that just means you take words on a large corpus and count up
how often each one occurs just as a count of independent words,
so there's the called unigram counts.
And so you start off with unigram counts,
but then you raise them to the three quarters power.
And raising to the three quarters power,
has the effect of, um,
decreasing how often you sample very common words,
and increasing how often you sample rarer words.
Okay, um, and that's that.
Okay, so that's everything about word2vec I am going to say.
Anyone have any last thing.
Yes. [NOISE]
Oh, oh [NOISE]. This is a- sorry Z,
that capital Z is often used as a normalization term and so this is saying,
well if you want the probability distribution of words,
is you work out this three quarters power of the count
of the word for every word in the vocabulary and then these
numbers you just sum them up over the vocabulary and it'll be sum
total and we're dividing by that so we get a probability distribution.
Good question because i hadn't explained that.
Um, in this class, when you see the letter Z with no explanation,
it normally means I am a normalization term to turn things into
probabilities and you sort of iterate over
the numerator term and summing them and divide through.
Any other questions of things I haven't explained or otherwise? Yes.
So the window [inaudible] that's a [inaudible]
Yeah, yes.
So, [NOISE] what size window do you use?
I'll actually come back to that in a bit and show a little bit of data on that,
but yeah, we haven't done anything about that.
At the moment we're guessing a window size like five,
which isn't a bad one um,
but you know there isn't- there hasn't really been any science behind that, um,
that people treat that as what's then called a hyperparameter which means that um,
you try a few different numbers and see which one seems
best and that's the one that you use in your future work. Yeah.
Um, [inaudible] three quarters power
chosen for any theoretical reason or just because it seems to work in practice?
Um, no. Um, that,
that was um, also chosen as a hyperparameter and improved performance.
I mean, actually um, you know,
for this Word2Vec paper, I mean,
you know, it turns out that um,
in the actual paper um,
the model looks very- fairly clean but what
people's discovered when they started digging through the code,
which to- to their credit they did make available, reproducible research,
that there are actually a whole bunch of tricks
of different things like these hyperparameters of um,
how you sample, and how you wait windows and various things to make the numbers better.
So, you know, people play quite a few tricks to make
the numbers go up which aren't particularly theoretical.
Are we good?
Yeah.
[inaudible] [NOISE].
Ah, sometimes.
I so- I- you- so in general for a lot of these sampling things,
it's a bad idea if you're going to be doing multiple passes if you just go bloom,
bloom, bloom and then bloom, bloom, bloom again,
that's a bad idea,
but a common technique a lot of the packages use
is that they do use this shuffling operation at the beginning.
So for each epoch,
they'll shuffle the data randomly and then they'll go through it in sequence and that has
the benefits of faster computation from locality et cetera um,
while ha- meaning that when you do it differently epoch,
it will work out differently.
Uh, yeah, yeah.
[inaudible] [NOISE] [inaudible].
That last question I think was talking about taking the mini-batches from the corpus and
contrasting whether you actually say sample 20
randomly from the whole corpus versus just sort of working from left to right.
Yes, do you have a question?
Um, yeah [inaudible] [NOISE].
Yeah. So- so you could argue- you could argue
whether or not this was written in the clearest way, but, right.
So, we're making this dot product and then we're negating it
which is then flipping which side of the space we're on, right?
Because the sigmoid is symmetric around zero.
So, if we've got some dot product um,
and then we negate it,
we're sort of working out a one minus probability and so
that's the way in which we're actually for the first um,
for the first time we're wanting the probability to be
high and then for the negative samples,
we're wanting their probability to be low.
Okay, I'll maybe run ahead now.
Um, so this was an algorithm which um,
sort of you're going through this corpus position by position and you're sort of doing
this prediction of words and then you're
updating some parameters and you're learning something and you know,
by job it seemed to work based on what we saw in the examples,
but you know, you might have thought um,
that that was kind of weird right?
Look we have this whole big pile of data you know,
sort of traditional, I'm thinking of statistics, right?
So you have a big pile of data,
you aggregate it and it sort of seems like there are obvious things you could do here.
You could say, well there's a word like,
whatever word we're using, banana.
Let's just see what words occur in the context of the gut banana and count
them all up and then we'll be able to use those to predict somehow and you know,
those kinds of methods were traditionally
used including even with distributed representation techniques.
Um, so I want to say a bit about that,
so you're fully educated and don't sound like one of those people who were
aware of no work that happened before 2013 when your network's took off.
Um, okay. So, what we could do is we can essentially
do the same thing as sort of Word2Vec.
We could say there's a five word window around
each word instance that's often referred to as a word token, right?
So at NLP, we often want to distinguish between a particular kind of type like banana
or apple versus particular instances
often in the text and that's referred to as sort of a type token distinction.
So we could, um,
look at each um token with a word,
and the words five around that,
and then we could so start counting up which words occur,
occur with it and so we can then have a matrix of co-occurrence counts.
Um, okay.
So, we'll have again,
and I'm going to give me an example of this.
So, normally again you use the five to 10 but you know I can just
use a window of one to keep my counts very simple and small.
I ignore left or right just like Word2Vec did,
and so if I have a teeny baby corpus like this,
you know, what I could do,
is just say here is the matrix of word co-occurrence accounts.
So, within my window size of one,
I occurs next to like twice,
and that means that like occurs next I twice it's symmetric,
and all my other accounts here are singletons, um.
And so this gives me a big huge sparse matrix of word co-occurrence accounts.
And so one thing that you could do is just use this matrix directly,
because I haven't really got enough data here.
But, you know, if you sort of,
um, decided that, you know,
the word like is like the word learning,
what you'd do is you'd expect that
these two vectors would end up kind of similar to each other.
And [NOISE] they do.
So, you can just measure, um,
similarity of the vectors directly in terms of these co-occurrence counts.
But, you know, it's a little bit unappealing doing things this way, right?
If you have a quarter million word vocabulary that's where
you are in this space where my math is bad,
but it's in the trillions of the number of cells of this matrix,
might require a lot of storage.
Though if you're clever and notice that most of the cells were zero and could do
some clever sparse matrix representation might take a little bit less.
Um, your classification models might have sparsity issues cause, you know,
a lot of those cells aren't present and so it might not be very robust.
And so those are traditional answer to all of these things which is well,
maybe we could have that big co-occurrence counts matrix
and somehow reduce its dimensionality of just, um,
find a corresponding low dimensional matrix which preserves,
uh, most of the information, um,
in the original matrix and, you know,
maybe we'll reduce things to a dimensionality of somewhere around the size 25 to a 1,000,
um as is done with Word2Vec.
So, there's sort of a standard most common way of doing
this dimensionality reduction and you don't really have to understand all the math,
but you get to play with this and homework one which is, um,
for any matrix you can do what's called the singular value decomposition, um,
which is a way you can take an arbitrary matrix and decompose it into three matrices, um,
where the center one is diagonal and has what- in it what are
called singular vectors which are weightings of the different dimensions.
So, they decrease in size as you go downwards.
And then these two U and V are then
orthogonal bases corresponding to the rows and columns.
And so in particular,
it's even simpler than the case where we just have these word-word vectors,
because you have a square matrix and so they are effectively the same.
But, you know, for the general case, um,
although you get these sort of full orthogonal bases,
you then have these bits sort of don't really
matter cause they end up being used for nothing when you work out the product.
Um, and then if you want to reduce the dimensionality, what you say is,
throw away the smallest singular values which remember there are in decreasing
size and that means you're then effectively
throwing away rows and columns of these other matrices.
And then it says,
behold I've now reduced these things to
a two-dimensional representation from
the original three-dimensional representation and that's referred to as
the reduced SVD and the classic result is in terms of least squares error in
estimation that this- the product of these three things will give X k which is the best,
um, k- rank k approximation to the original X in terms of,
uh, X squared least squares criterion.
So, we could do this and we could build word vectors.
So, I can, um,
make use of, um,
NumPy's SVD function and I can throw into it,
um, matrices and, um,
I can make word vectors.
And these ones look really bad, but hey,
I give it a dataset of three centers [LAUGHTER] and it's not exactly a fair comparison.
But- so this technique was in, um,
popularized around, um, the turn- the turn of the millennium.
It generally, um, went for
some word applications under the name of latent semantic analysis or
latent semantic indexing and the idea was that you could have
these semantic directions that you are
finding in this low dimensional space that had meaning.
And people worked with it quite a bit for techniques
like t- trying to do information retrieval
using these LSA approximations and it sort of worked a bit.
It kind of never really worked very well I think,
um, and so it never sort of hugely caught on.
Um, but it's- the methods kind of continued to be explored actually mainly in the sort of
COG psych- COGS psych community where people were doing things with word meaning.
And there's this sort of kind of interesting, um,
the [NOISE] to the literature that there was this guy Doug Rohde, um,
who, um, did a PhD at CMU, um, in 2005.
And basically what he discovered was,
look if rather than just using raw counts,
I start doing quite a bit more in terms of,
you know, fiddling with the counts,
I can start to produce results that are much better.
So, rather than using low counts,
you have to do something to deal with those very high-frequency words.
So, one idea is you could log scale them which
is also commonly used in information retrieval.
Another idea is you could just use something like,
uh, a ceiling function,
so you take the minimum of X,t for t set and that some number like around 100.
Um, he had- he used the idea which was also another of the hacks that was put into
the Word2Vec was rather than just treating the whole window the same that you should,
um, count words that are closer more.
So, in Word2Vec, they sample closer words more commonly than further away words.
Um, in his system, you're sort of having to have
a differential count for closer words et cetera.
And then, um, compared to any of that rather than using counts at all,
he then started using Pearson correlations which
helped and set they're sometimes negative and he decided that it helped,
um, if you then got rid of the negative values.
So, in- in some sense,
this sounds like a bag of hacks,
um, but on the other hand,
he was able to show that, you know,
these transformed counts could actually then give
you very useful word vectors as I'm about to show.
And- well, we have to realize that actually in slightly different forms,
several of these exact same counts are actually being used in Word2Vec as well.
Do you hear that?
Yeah. Were they [inaudible].
Yeah. So, so that's an- I'm about to show exactly that.
Um, that's actually a really interesting little,
um, bit of the data.
So, you know, what, um, yeah,
so the, the thing- if you do that,
you not only get word similarities pretty good.
Let me show you this example which is cleaner.
Um, so this- the precise idea of
evaluating with analogies was not something that had really been developed.
So, that was actually something that Marsh Mikolov, um, suggested.
But actually, um, Doug Rohde made this, um,
really interesting observation which was- he said, look,
"Once I do these kind of transformations to
improve the semantic representation of my word vectors,
look this really interesting property emerges.
Um, that what you find is that there is semantic vectors
are which basically linear components in my carefully-constructed space.
So, here we have the sort of, um,
verb to the doer of the verb direction,
drive, driver, um, clean,
janitor, swim, swimmer, learn,
teacher or teach, teacher,
doctor, treat, priest, pray.
I mean, you know, it's not exactly perfect,
you know, there's a little bit of wiggle there, right?
But, you know, roughly it's completely clear that there's sort of a direction
in the space that corresponds to- from a verb to the doers of a verb.
Um, and yeah, so he [inaudible] - he-
no one had thought of this idea of doing the analogies and tests.
But the thing in retrospect that's obvious is,
if you can construct a vector space that has this linearity property,
then you're definitely gonna do well in analogy.
So, effectively he had invented a vector space that do
well in analogies because this means that you've got
this direction which is the doer and then you can immediately
say that's the doer vector which you can get from subtracting clean from swimmer.
And the- Right. So, it's clean from janitor.
And then we can add it on to swim and we'll get somewhere close to swimmer.
Um, so his space actually did do that.
And so, um, this is- so the,
the moral in some sense is,
if you have- if you kind of do carefully control accounts and so on,
that conventional methods can also give you good word vector spaces and- I mean,
so that was actually the starting off point for our work on GloVe.
Um, so that essentially,
there had been these two schools of work.
Um, there had been the school of work that had
been explored more in COG psych than anywhere else,
which had been based on counting and transforming counts.
And, you know, it had some advantages or it seemed it had some advantages, right?
That, um, you're making sort of efficient use of statistics as you're using
the global statistics of the whole matrix directly to estimate things.
Um, and at that poi- up until then,
it had really only being used to capture word similarity, um,
and a lot of it had suffered from disproportionate im- importance given to large counts.
But Doug Rohde, he had sort of started to show how to solve both of these problems.
And so on the other hand,
there had been these neural network methods
which are kind of direct prediction methods that
we're defining that probability distribution and trying to predict the words that occur.
And they had some advantages, right?
The fact that your sampling means that you're not going to run out of memory hopefully.
I know we've had some memory problems with homework one, but in principle,
you're not as bad memory position and if you have to
construct a huge matrix because you're going linearly,
um, but, you know, since you're doing it sample by
sample it's inefficient use of statistics, um.
Okay. And so, but on the other hand Mikolov's work it performed perfectly.
Not perfectly, but really well.
Um, so this is sort of led into this work,
um, that Jeffrey Pennington, um,
Richard Socher [inaudible] can we sort of combine these ideas
and sort of have some of the goodness of the neural net methods,
um, while trying to do things with some kind of count matrix.
And so in particular, um,
we wanted to get the result in
a slightly less hacky way that you want to have components of meaning
being linear ope- linear operations in
the vector space that they're just some effective or adding or something like this.
And so the crucial observation of this model was that we could use
ratios of co-occurrence probabilities to encode meaning components.
And so the idea here is,
if you have a word like ice and
you say how often the thing's going to co-occur with that,
well solid should co-occur a lot and gas shouldn't.
But well water is also going to co-occur a lot and some random word won't occur much.
If you have, oops.
If you have steam,
you get the opposite pattern with solid and gas, right?
But so the thing to notice is,
it's not enough to just have large by itself because large
appears both here and here or small appears there and there,
the thing that's interesting and sort of the difference between
these components in there indicating a meaning component.
And so we can get it that if we look at the ratio of co-occurrence probabilities.
And so for the ratio of co-occurrence probabilities this is a dimension of
meaning and where for other words and this sort of ratio cancels out to about one.
And so in this slide I've moved so it's not how my
small and large that these are actually actual counts from a corpus.
So we roughly get dimension of meaning between
solid and gas are the ones coming out
as about one because they are not the dimension of meaning.
And so, it seems like what we want is we want to have ratio of
co-occurrence probabilities become linear and our space.
And then we're in a good business.
And so that's what we want to set about doing.
Well, how can you do that?
Well, the way you can do that,
is by if you can make the dot products equal to the log of the co-occurrence probability,
then immediately you get the fact that when you have
a vector difference it turns into a ratio of the co-occurrence probabilities.
And so, essentially the whole of the model is that we
want to have dot products or logs of co-occurrence probabilities.
And so, that's what we do.
So, here is our objective function here
and it's made to look a little bit more complicated.
But essentially we've got this squared loss here
and then we wanting to say the dot-product should be as similar
as possible to the log of
co-occurrence probability and so you'll they'll
be lost to the extent that they're not the same,
but we kind of complexify it a little by putting in bias terms for both of the two words.
Because maybe the word is just overall common and likes to
co-occur things or uncommon or does end.
And then we do one more little trick because every [inaudible] does tricks to make the performance
better is that we also use this f-function in front,
so that we're sort of capping the effect that
very common word pairs can have on the performance of the system.
Okay. And so that gave us the GloVe model of word vectors.
And theoretically, the interest of this was,
you know, a lot of the preceding literature had been there had
been these count methods and there had been these prediction methods.
And the hope was that this could sort of unify the
two by showing you how you could have a method that
is estimated simply of a count matrix but it's done in the same kind of
iterative loss based estimation method that's
used for the neural methods to get good word vectors.
And this also worked to give good word vectors.
So here's GloVe results for the word frog.
And frogs and toad are obvious.
But there are these different kinds of words, uh,
various kinds of pretty tree frogs and things like that.
Okay. Um, so I'll then go from here and say a little
bit more about some of the work on evaluating word vectors.
And this is maybe also a chance just talk a little bit about evaluation altogether.
So, normally in NLP when we do a valuation,
the first thing that comes up is intrinsic versus extrinsic evaluation.
So, normally if there's something we trying to do like model, um,
word similarity with word vectors or we're trying to, um,
put parts of speech on words or something,
we can just have an intrinsic evaluation of saying how good a job did you get.
Are you guessing the right part of speech?
Are you putting synonyms close together?
And that's sort of normally very easy to do and fast to compute.
And it's useful to do because it helps us understand the system.
On the other hand, a lot of the time those intrinsic evaluations,
it's not very clear where- where they're having done well on that task is really going to
help us build the amazing natural language understanding robots
that we so ardently desire.
Um, so, people are also very interested in extrinsic evaluations.
And so extrinsically is then saying well suppose you use
this new stuff in a real system doesn't make performance go up.
And it's then sort of definitional what counts
to you as a real system that normally that's
meaning it's some application that human beings actually care about and liked to use.
So that's something like web search, or question answering,
or phone dialog system or something like that, um,
hat you can put it into that system and the numbers get- go up.
So, that seems what you want to do.
You want to have stuff that works in real tasks.
Of course, there are sort of on the other hand a lot of things are a lot harder than.
So much more work to do such an evaluation and run different variance of a system.
And even when the results, uh,
poor or great sometimes it's hard to diagnose.
You know, if- if your great new word vectors don't work better in the system, you know,
it might be for sort of some extraneous reason about
how the system was built at sort of hiding all your magic.
And if you just change the rest of the system and suddenly show its good effects.
So, it's kind of hard to do,
um, sort of, um,
apportionment of goodness and badness Okay.
So, um, so, today I'm mainly going to say a little bit more about
these intrinsic word vector evaluations that we've talked about.
So we've talked quite a bit about these analogies.
So if we're actually working out the analogies,
it turns out that normally what people are doing is working out
a cosine distance and angle between, um,
different word candidates, um,
to work out which is the word that solves the analogy which
is an Norbert little tiny wrinkle of difference there.
And there's also one other trick that people commonly use.
They forbid the system from returning one of the three word she put
into the analogy Okay.
But nevertheless, so, this is something that you can evaluate.
Here now some GloVe visualizations.
And so these GloVe visualizations show exactly the same kind of
linearity property that Doug Rohde discovered which means that analogy's work.
Sort of by construction,
because our vector space wanted to make meaning components linear.
So, this is then, um,
showing a gender display.
This is showing one between companies and their CEOs, kind of cool.
And you can also do more syntactic facts.
So this is showing, um,
positive comparative and superlative of adjectives.
Yeah. So, Tomas Mikolov came up with this idea of doing these analogy tasks.
And so he built a data-set with a lot of analogies in it.
It's sort of- it's a bit of a weirdo data-set because it's sort of tests a few
random different things which may have been things that his system worked well on, um,
but you know, it tests countries and capitals,
country, cities and states, countries and currency.
So there are a bunch of semantic things that tests.
And then there are some, um,
syntactic things that tests so bad, worst,
fast fastest for superlatives.
But, you know, even some of the ones I was showing before, you know,
there's no- there's no Obama is to
Clinton kind of ones that are actually in this evaluation set.
Um, here's a big table of results,
um, that comes from our GloVe paper.
So not surprisingly the GloVe paper perform best in this evaluation.
Because that was our paper. Um, [LAUGHTER]
[LAUGHTER] But I mean perhaps- you know,
perhaps the things to start to notice is,
yeah, if you just do a plain SVD on counts.
You know that that works abominably badly for these, um, analogy tasks.
But, you know, kind of as Doug Rohde showed,
if you start then doing manipulations of the count matrix before you do an SVD,
you can actually start to produce
an SVD based system that actually performs quite well on these tasks.
Um, you know, not badly against other things.
Um, other things that you will discover,
right at the top there are a 100 dimensional ones,
and at the bottom there are some 1000 dimensional ones,
and other 300 dimensional ones.
At least when you're training on a big amount of text,
bigger dimensionality definitely works better.
And I'll come back to that in a minute.
Um, the amount of text makes a difference as well, right?
So we're going up from- so one to 1.5 billion words at the beginning,
to these ones down here are being trained over 42 billion words of text,
and perhaps unsurprisingly, the 42 billion words of texts ones work better.
Um, so it's big data.
Um, here are a couple more steps from this paper.
So this is a graph of dimensionality and what the performance is.
So for the three lines the green one's semantic,
the blue one's the syntactic analogies and so red's the overall score.
So sort of what you see is up to dimensionality
300 things that clearly increasing quite a bit,
and then it gets fairly flat,
which is precisely why you find a lot of word vectors,
um, that are of dimensionality 300.
Um, this one's showing what window size.
So this is sort of what we talked about symmetric on both sides window size,
and as it goes from 246810.
And sort of what you see is,
if you use a very small window like two, that actually works.
That the, the syntactic prediction is stronger because well,
syntactic effects are very local.
Whereas as you go out,
the semantic prediction gets better and better.
Actually this syntactic gets a bit better as well,
but it's especially the semantic that gains.
Um, the right graph shows that if you only use context on one side,
um, your numbers aren't as good.
Okay, um, so, I sort of just wanted to sort of sneak in a little cameos of a couple of,
um, recent bits of work,
as sort of a first of what things people are doing,
um, with word vectors.
Um, so this one, um,
was actually by two Stanford people.
Um, now the best- this would be the best story.
If I could say that this was a final project,
um, in this class last year,
but unfortunately that's not true.
This paper has nothing to do with this class [LAUGHTER].
But it-- right.
Um, Zin Yin and Yuanyuan,
um, actually had, um,
some sort of clever and very mathy ideas,
where they're using matrix perturbation theory.
Um, and sort of just showing how, um,
dimensionality in word vectors actually sort of feeds into the bias-variance trade-off.
If you've seen that,
um, in other parts of machine learning.
And I'm not even going to attempt to explain their paper.
Um, but here it is,
that they did really well with this paper,
they gone all talk in Europe's from it.
Um, and so- but there's sort of
an interesting result of what you see with these word vectors,
which is in a way kind of surprising.
So this is showing doing word vector dimensions from zero up to 10,000.
So we're going way higher than we talked about before.
And so what you discover which people have known for ages is,
that there's sort of a little blip that somewhere around two or 300,
which seems to optimize performance.
So, I've used those sizes.
But the thing that they were sort of doing a lot of their theory about,
and it's kind of surprising is, well,
surely if you have a humongous humongous number, like,
if you are using 10,000,
um, dimensional vectors, you know,
you're trying to estimate another two orders of magnitude more numbers for every word,
surely things should just fall apart, um,
because you've got hopelessly many parameters relative to
the amount of training data that you're trying to estimate these numbers from.
And so the interesting result that they show is,
that things don't fall apart.
Um, and that you can essentially go out to these huge huge dimensionalities,
and the performance stays flat.
And that they've got a lot of theory,
sort of for predicting why that that's actually going to end up being the case.
Um, yeah.
So for training these models iteratively,
this is- orange is showing, um, GloVe training.
You know, they keep on getting better for a while.
So you know, just go out,
go sleep and see in the morning how it's doing, right?
So that if you were running it, um,
for 24 hours your numbers are better than if you only ran it for six hours.
Um, and that's true for a lot of deep learning models, sorry.
So this is the key reason why you don't want
to start your assignment the night before it's due.
Because even if you program it perfectly,
you might just not have enough time for it to run,
um, so that you produce good numbers at the end of it.
Um, okay. Uh, yeah so,
so couple of more, um,
things, on that, um.
Yes. So, um, what are we showing here?
So these are again semantics in tactic and overall numbers.
So there are sort of two things that are sort of being mixed together here.
One is, if we just look at the overall numbers,
they're highest over here, um,
which is this 42 billion Common Crawl web-pages corpus,
that gives us the highest overall number.
But there's sort of something else that's interesting in this graph, which is,
um, that using Wikipedia works frequently well.
So that you actually find that 1.6 billion tokens of Wikipedia works
better than 4.3 billion tokens of News-wire newspaper article data.
And so I, I think that's sort of actually make sense,
which is well, you know,
the job of encyclopedias is to just sort of
explain concepts and how they relate to each other, right?
So that encyclopedias are
just much more expository text that show all the connections between things,
whereas newspapers in general aren't trying to expose at how things fit together.
They're just telling you about, you know,
who got shot dead last night or something like that, right?
So, um, so this is sort of interesting fact, um,
that this Wikipedia data kind of really,
it sort of is differentially useful, um,
for, um, making word vectors.
And you know, in fact, you know,
when we did very well without GloVe word vectors and lots of people use those.
You know, I think actually one of the reasons why they work so well is that
the original word2vec vectors that Google distributes are built only on Google News data,
where else sort of have this,
um, Wikipedia data inside them.
Okay, um, rushing ahead.
Um, yes, so the- there's all the work on analogy,
but the other more basic evaluation is this one of capturing similarity judgments.
And I haven't said much about this, but you know,
there is this sort of large sub-literature in the psychology community,
where people have wanted to model humans judgments of similarity.
So like a good psych person, what you do,
is you find your classroom of Psych one undergrads,
and you show them pairs of words and say rate
these things for similarity on a scale of one to 10.
And lots of that data has been collected,
and you work out the mean over human beings,
and they give numbers like this of tiger and cat, 7.35.
Tiger's similar to Tiger 10, book and paper,
plane and car, stock and phone,
stock and CD, and you get numbers.
So then, what we're doing is wanting to say,
well let's use distance in the space to map directly onto these similarity judgments,
and how well does it map?
And so that's sort of similarity judging has
also then being used for evaluating these systems.
So again, here are a lot of models.
This is again from our GloVe paper.
But so there are these various similarity data-sets.
So one of the best-known ones that I had on the slide before is this, um, Wordsim 353.
It has 353, um,
different ones in it,
and so you are sort of then modeling a correlation
between your judgments of similarity and the ones that came from the human beings.
Okay. Two more things I want to say. Um, yes.
So, we had that problem right at the beginning
of Clinton and how that could be various people.
And that's perhaps in some sense the simplest case of words being ambiguous,
when you have names which have reference to different people.
Um, but it's not only true of names.
So by and large,
words in human languages are ambiguous and have lots of meanings.
Um, that's especially true of common words.
They always have lots of meaning.
It's especially true of words that have existed for a long time.
It's not true of new very technical words, you know, carcinoma.
I think that only has one meaning.
Um, but, you know,
if you think of any relatively, um,
common word and starts, um,
scratching your head for a moment,
you'll find it has lots of meanings.
I- maybe this isn't even such a common word,
but my random word I've got here is Pike.
Um, pike has lots of meanings,
it has meanings like?
Fish.
Fish, it's a kind of fish, yeah.
So there's a fish that's a pike.
What else is a pike?
A large spear.
A large spear.
Yes, so a large spear is a pike.
Other kinds of pike's?
Gymnastics move.
It's a road.
Gymnastics move or in diving move.
It's a road.
Um, yeah. Um, so there are lots of meanings.
Um, there are other meanings.
Um, in Australian English,
pike is also used as a verb to mean,
um, to pull out from doing something.
Like, "We were all going to go out to a nightclub later, but Joe piked."
[LAUGHTER] Um, I don't think that usage is common in this country,
but, um, you can try that, um. [LAUGHTER]
Right. But lots of meanings and, you know,
this isn't only true of the word pike, right?
Pick any other simple word, right?
You can pick a word like shell or field or house or make,
you know, they have lots of meanings when it comes down to it.
So, you know, but, uh,
how can this work if we just have one meaningful words?
And that's the interesting question and it was something that
[NOISE] we were actually interested in early on.
So, I'm even before the Word2Vec paper came out back in 2012,
um, we were playing around, um,
with neural word vectors and, um, we thought,
boy this is so broken having only one,
um, cents, for a word.
Why don't we come up with a model that has multiple sensors for a word?
And so we did that and we did it in a pretty crude way,
I guess, [NOISE] um,
the way we did it is say,
well, let's for each common word,
let's cluster all the contexts in which it occurs.
And then we'll see if there seem to be
multiple clear clusters by some criterion for that word.
And if so, we'll just sort of split the word into pseudo words.
So, if it seems like that there are five clusters,
um, for the word,
the example I meant to use here is jaguar.
Five clusters for the word jaguar,
I will just call them jaguar_1, jaguar_2, jaguar_3, four,
five, so it's just literally changed
the word in our corpus according to its cluster number.
And then we run our word vectoring algorithm and so we get
a representation that each of those sensors of the word.
And basically, that works,
right up the top is jaguar_1 next,
uh, luxury and convertible.
Um, here is, I guess there's a very old version of MacOS called Jaguar,
any remem- remember that one?
Um. Right. So, jaguars right next to software and Microsoft up there, so that's hopeful.
Um, here's the jaguar that's right next to the Hunter, um,
and I'm being confused on this one,
is jaguar as near solo musical keyboard and string.
Is there a band, [NOISE] a brand of keyboard called jaguar?
I'm not quite sure about that one,
but anyway, it's sort of basically works.
Um, but that was sort of crude and it's also perhaps problematic,
so a lot of time, the divisions between sensors aren't very clear, right?
A lot of sensors are actually related to each other and overlapping because
when how sensors normally arrive is that people stretch the meanings of words.
It's not that they just sort of randomly
wake up the next morning and say, "I know carpet.
I could also refer to that as stone," um,
and given a new sense to the word stone, right?
You so take something that you know about like
a web and you extend it metaphorically to other uses of webbing.
Um, so here's a perhaps more interesting things,
so this is the other Sanjeev Arora,
um, paper that I was going to mention.
So, that what happens if you don't,
um, if you don't have more than one cents for each word?
Well, effectively what you get is that
the word vector that you learn is what's referred to by
physicists and fancy people as a superposition
of the word vectors of the different sentence, different sensors.
By supersitio- superposition just means a weighted average.
Um, um, [LAUGHTER] so that effectively
my meaning of pike is sort of
a weighted average of the vectors for the different sensors of pike,
and the components are just weighted by their frequency.
Um, so that part maybe is perhaps not too surprising,
but the part that's really surprising is well,
if we just averaging these word vectors,
you'd think you couldn't get anything out of the average, right|?
Like if I tell you I'm thinking of two numbers and they're here,
weighted sum is 54,
what are my two numbers, right?
You are sort of really short of information to be able to answer my question.
But, well, you know,
for these word vectors, um,
we have these high dimensional spaces and even though there
are a lot of words that the space is so vast for thoughts dimensions,
that actual words or sensors are very sparse in that space.
And so it turns out that there's this whole literature on, um,
sparse coding, compressed sensing,
um, some of which has actually done by people in the stats department here,
um, which shows that in these cases where you have these sort of sparse,
um, codes in these high dimensional spaces,
you can actually commonly reconstruct out the components of a superposition,
even though all you've done is sort of done this weighted average,
and so, um, this paper looks at how you can do this and so they have,
um, these underlying meaning components,
and they sort of separated out.
So, tie has one meaning component,
there's in this space of trousers, blouse, waistcoat,
that makes sense, and other one in this meaning component of seasoned teams,
winning league, makes sense.
Um, scoreline goal with equalizer clinching scorers,
this one seems to overlap with this one a bit.
Um, but here tie,
this is sort of cable ties and wire ties and things like that.
So, they are actually able to pull out the different sense meanings,
um, from outside, out of the meaning of the word.
Um, so that is a kind of a cool thing.
I just wanna, um,
say one more thing.
Okay. [NOISE] All the evaluations so far was intrinsic,
um, you also might wanna do extrinsic evaluation.
Why, why word vectors excited people on NLP so much?
Is it turned out that having this meaning,
having this representation meaning just turned out to be
very useful and sort of improve all of your tasks after that.
Um, and so, um,
this is doing named entity recognition which is labeling
persons and locations and organizations, but, you know,
it's typical of many tasks of what people found,
was if you started with a model without sort of word representations
and you throw in your word vectors regardless of whether they were to vehicle GloVe ones,
just kind of your numbers go up a couple of percent or more?
And so word vectors were just sort of this useful source that you could
throw into any NLP system that you build and your numbers went up.
So, that there was just a very effective technology, um,
which actually did work in
basically any extrinsic tasks you type tried it on. Okay. Thanks a lot.
