Just a few things that I want to do today.
Uh, my plan is to get through those things,
which is really just a bit more on GloVe,
then Word2Vec and retrofitting,
and that kind of rounds out, uh,
this first big unit on distributed, uh, representations.
Next week, we're going to turn to the topic of
supervised sentiment analysis and explore a different set of topics.
But my idea for today is that we get through this stuff,
we're not in a rush, so feel free to ask questions.
But when we get through it,
we're just going to break, um,
and I'm going to encourage you to do some hacking.
Uh, to explore the notebooks if you haven't already to get
a more hands-on look at the material or also you could just dive into homework one.
Um, you should certainly make sure that your system is set up.
Please don't leave here today without feeling
like you're ready to just work productively on the problems.
Um, so a bunch of the teaching team are here in
the room and then we're going to have a setup,
I'll give you the details on that later,
where you could video-in to talk to
some of the TAs who are going to be available in a kind of queue.
So if you're not in the room right now but you need some help, uh,
feel free to contact us,
and I just want to emphasize that this is the moment,
the best moment to get your system working,
do not wait until Monday afternoon to discover that you don't have crucial components.
That's what this is. This is the perfect moment to make sure that's all straightened out.
Before I dive in, any questions or comments? Yes?
There's two office hours tomorrow, uh,
one of them says it has Zoom,
the other one doesn't because both-
Are they both also online?
For tomorrow.
So there's another one in the evening?
That's right. He's kinda always by video.
Oh.
You can reach him, and I think the questions was
featured [inaudible] [LAUGHTER]
So you could choose either one.
Can, can that one also be online or is it only.
No, that one you have to go see the [inaudible].
Yeah, what if you want to just call in,
we can arrange that?
Oh, thank, thank you.
Yeah, you could uh, I don't know.
Yeah, you could- I don't want you to email me directly,
but maybe you could post on Piazza.
Actually, that would be tough too.
Maybe I'll just create a link,
so if people wanna call in from the line, they can as well.
Just so I'm inclusive of everyone whether they are in person or not.
Okay, do what you can,
but otherwise, you might have to make your way [inaudible]
Do we use QueueStatus?
We do, that's what we're using today.
No, I don't think there's Zoom today,
because we have the [inaudible]
Okay. Yes so we use Zoom sometimes.
Can you give details on it when we- we're in the middle
[OVERLAPPING] on how it works? Yeah.
It's very easy.
You can just basically call me,
and then [inaudible] and that might
mean that I'm not going to be able to see it like you, but we see it.
Okay, great.
Where we left off is that I had planted the seed for GloVe.
Basically, I said some nice words about this paper.
I'm a big fan of this paper and I gave you the intuition and
also here are some other practical details about implementations that you might use.
So we're using mittens by and large,
there's also C implementation,
and I think in a minute,
I'm going to show you kind of a reference implementation that is very, very slow,
but it's kind of what you would do to just
naively implement the algorithm and I think that
there's a real pedagogical value to see exactly how those computations would go.
So we're going to glance at that but not-
I'm not saying that you should use it for anything beyond study.
So here's how I introduced GloVe,
I gave at the top there what they present as kind of their ideal objective which is
that for row representation or a row embedding w sub i,
it's dot product with a column embedding w sub k. That's the primary term,
and then you have two bias terms,
and the idea is that that should be equal to the log of the co-occurrence probability,
or at least, I guess proportional to it.
They kind of dive down further so their ideal reconstruction of what that dot product is,
is that it should be the log of the co-occurrence probability and they define
that co-occurrence probability as the log of
the co-occurrence count minus the log of the row probability,
um, and if you do allow that the rows and columns in this matrix might be different.
So you'd want to not just concentrate for normalizing on that row but also on the column,
I think you get this expression here which isn't in the paper but I think
it's certainly in the spirit of the paper because even for their matrix,
it sees the rows and columns could be different.
And so then you've reconstructed the co-occurrence probability
as the log of the co-occurrence minus the product of the two,
the row and the column and what I just pointed out to you is that,
you know, that's PMI.
That is certain like in some sense,
that's the literal calculation of PMI and you can certainly see that what we're,
what we're getting from this picture is that GloVe is built on that PMI intuition.
It's not the same as PMI,
of course because we're learning,
we're learning regularized reduced dimensional representations,
whereas PMI is just a re-weighting of the entire count matrix.
But I really like this as an entry point to GloVe.
When you think of GloVe,
you could think of it first and foremost as capitalizing on that core insight.
That's where we left off.
There are a few more details for the paper though that I think are
really important for why GloVe works so well.
So again, I just repeated the original GloVe objective up here at the top.
What they actually argued for is a weighted version of
it which you see down here and the,
the new pieces like at its core you have
that dot-product again of the row and the column.
You have two bias terms for the row and the column and then
you subtract out the log co-occurrence.
It's additionally weighted by this function f. So f is a function of the counts,
and what it's doing essentially is kind of
flattening out and re-scaling those count values.
So f of x,
that's the general function defined here.
You set x_max, the default and the spirit of the paper is to set it at 100.
So a count of 100.
So if x is less than that max value you set it as a proportion of
that max with an exponential scaling here at Alpha,
and if it's equal or greater to the max,
you just set it to 1.
Yeah as I've said down here,
Alpha is typically 0.75 and x_max is typically 100.
I think that's fine, reasonable defaults.
I am gonna suggest that you think a little bit about that though.
But is that clear, so far?
That's kind of the whole model, right?
When you think of GloVe you should think of that dot product of row and column with
some bias terms and then subtracting out the log co-occurrence. Make sense?
Plus a re-weighting of that.
The GloVe hyper-parameters.
So there are really three.
The first is just the dimensionality of the learned representation.
You get to set that.
You can set it at any number and you can imagine that setting
it at larger values is going to give you more capacity to learn,
but also take more time and everything else.
So that's kind of something that you play with for all of these models.
x_max is going to flatten out all the high counts and the idea I
want to plant there is that you might
reflect on that value and the way you might think about it is,
for my count matrix what percentage of
the values are actually at or above that threshold,
if it's 100% of them, for some reason,
then picking x_max to 100 is going to turn your entire matrix into a bunch of 1s.
That's certainly not going to be good, right?
That's unlikely, that you would have a count matrix that was that dense but,
what percentage is it,
and what are you doing to your matrix when you flatten out?
Maybe that's a large percentage of the, of the values.
So you can think about that,
and then you could also think about Alpha which is going to control
how much scaling you do for counts that are below your max.
And here, I've given an example so the incoming row vector for this f,
I've defined with the defaults is 100, 99, 75, 10,
and 1, and once I've applied this re-weighting, it looks like this.
So obviously, the magnitudes are much different.
1, 0.99, 0.81, 0.18, and 0.03.
So you're doing something very important to this space when you do
this free-weighting. Makes sense?
This is a new slide that I added.
I recognize that it's very dense.
It's kind of there for a reference,
but let me give you a guided tour a bit.
Let me step back and just say,
"I was thinking about other ways that we might,
at an analytic level,
get a deeper understanding of GloVe.
We have the PMI insight."
I've shown you the GloVe objective,
um, but what else could we do?
And what I thought is,
when I motivated LSA before,
I gave you this example of gnarly and wicked.
And my motivation, my motivation for showing you that was,
these are two words that are similar and if semantically,
they're both kind of positive slang terms maybe.
Uh, but they're unlikely to co-occur with each
other because they're kinda from different dialects, that was the idea,
and my pitch to you was that LSA has the capacity
to recognize that they are similar despite never co-occurring
because they have other neighbors in common because they
over-occur together with awesome and less than you would expect with terrible.
And as a result LSA is able to capture that insight.
Right, that was one of the motivations there.
And I thought, how does GloVe do this?
Because GloVe, by its nature,
because of this log count thing here it doesn't even visit 0-valued cells.
Yeah, this is like my just very quick reference implementation,
where you set up the word and the context matrix and the bias terms.
You can take the log of the entire matrix all at once.
So do that as a pre-computing thing and you can
re-weight the entire matrix in the way that I just showed you with f,
could do that outside of the loop.
And then once you start iterating and working on specific examples,
you don't visit the values that are 0,
and the reason is because of that log thing right,
you could either have a problem there or you'd want to be able
to say that this is just the value of 0.
There's no update for this example.
So that's a kind of definitio- definitional thing about GloVe.
But then, how is it going to do what I want it,
in the case where,
here's my count matrix,
I've said that wicked and gnarly never co-occur.
This is a version of that LSA example but I made it word by word because of course,
GloVe requires a word by word matrix or a square matrix.
So how is it going to do this?
I kind of hope that it does have the capacity to do it.
Actually, does anybody have any guesses?
There's a lot of detail here but like what's your hunch?
Anyone have a hunch about how it might be able to do this,
or maybe it's going to be unable to? Yeah.
Basing it on the relation to the word terrible which is a negative, sure,
so how, how the positive nature of it is connected to the matrix.
Yeah, beautifully put.
That's in fact what happens.
I think experimentally is that what GloVe [NOISE] do is,
push both gnarly and wicked away from terrible and toward excellent,
because it does have co-occurrence values for them.
And you're kind of relying on its ability to do that,
in order to achieve this abstract connection.
Um, and I think that does happen.
So what I did for this little problem is,
I just got rid of the bias term so that I could display the calculations.
And I also started
the GloVe optimization process in a way
that I was sure that gnarly and wicked were really far apart.
So iteration one I basically made them like to
be complete in- completely separate parts of the space.
And then I thought what's going to happen as I do these updates?
And building on exactly your intuition,
what I've decided to do is look at what's
happening for both of them as it relates to excellent.
And I could have done it for terrible where you'd expect them to push away.
And I thought that might happen,
that they- that they both gravitate toward excellent, and it does.
So here's it's kind of small but this is wicked and terrible and awesome and gnarly.
And that's what I meant that gnarly and wicked are at like
opposite ends of this little two-dimensional space here.
And after one iteration [NOISE] it really has been reorganized.
Here's awesome, here's wicked,
and here's gnarly and terrible.
It's been thrust up in- into the corner here.
And I think if I keep running this,
[NOISE] then- then gnarly and wicked are just
gonna gravitate closer to awesome and farther away from terrible.
And that's the sense in which we'll be capturing their abstract similarity.
And then on the right here,
what I've done with kind of color-coding is just make that really concrete.
So there are the counts up in the upper left.
This is the reweighting.
This is w_0 and the context vector zero.
So that's the starting point.
And I am watching for gnarly and wicked and how they're gonna get closer to awesome.
And when you do that calculation,
this is just the derivative of the GloVe objective here,
which I stated up in the corner.
You get these error terms and then,
that updates the weights.
And then after another iteration,
the error terms have gotten smaller.
That is, just as you've seen in the picture over here.
These terms are gravitating toward excellent and you can imagine that as I go,
these values are gonna get closer and closer to 0.
So I've- I don't know, I've found- I found that reassuring.
Does that make sense? Any questions I can answer about it? Yes.
Do you mean awesome or excellent?
[inaudible].
Awesome. That's the one that I picked.
Oh did- I wrote excellent up here. I did mean awesome.
Now this is my space;
gnarly, wicked, awesome, terrible,
and I set it up so that awesome is really
frequent and co-occurs a lot with both gnarly and wicked,
and hardly ever with terrible.
That's the idea. Then you could see it pulling toward the blue.
Those are the comparisons I keep making.
And just for the comment before,
if I had been done comparable comparisons with terrible,
I think I would have seen them pull apart.
Well, why would gnarly and awesome have like a lot of co-occurences?
They have none. Gnarly and awesome.
Yeah.
Oh, because they're both positive slang terms.
But if a co-occurence like right next to each other, I'm trying to think sentences.
Oh, this would be like if I was reviewing your paper and I liked it,
I would say it was both gnarly and awesome. I don't know.
Probably that would be for more like your skateboarding moves
[LAUGHTER] than your paper.
[LAUGHTER] There is one more thing
just to kinda fill out my little analytic framework here.
So when I showed you all those reweighting schemes before,
I showed you before and after on the distribution of values,
and here for GloVe, this is really nice looking, right?
So the raw counts come in and they have this really difficult scaling.
And then, when you reweight with GloVe,
that's just raw counts coming in,
you re-weight and you get this like completely normal distribution here
of cell values and look how nicely they're scaled between -2 and 2.
Ah, and we'll see this.
I'm gonna return to this point again and again.
Like a lot of the deep learning models that we look at are gonna be
very sensitive to the scaling of incoming values.
And you could just say like,
this is a- a really nice starting point for a lot of
those networks because everything is kinda scaled in a rational way.
So this is also partly why I think GloVe does well as an input to another system.
And then I had some GloVe snippets here just showing you how the, um,
oh, first one I did is,
just loaded the, um, data frames two of them;
IMDB5 and IMDB20 and I just wrote a little function that would look at
the number of non-zero values that were below some threshold that I set,
so that would be x_max.
And for IMDB5, it's about 2% of the values and for IMDB20,
it's 15% of the values.
Uh, corresponding to the fact that IMDB20 has a much larger window and therefore,
it's much denser because more things co-occur with other things.
But you can see that that is interacting in an important way with the chosen value for
x_max because I've done relatively little to IMDB5 by thresholding,
but I did a whole lot [NOISE] to IMDB20.
I really made a lot of those values the
same even though they might have been very different.
So something to keep in mind as you use GloVe that you might, I mean,
I think I should have been more reflective on this in the past when I've used GloVe to
think about the impact of that hyperparameter.
And then, for this code snippet here,
what I did is just test this idea,
the GloVE objective is to learn
word vectors whose dot-product is proportional to their co-occurence count.
And so you might think that if I can somehow perfectly achieve this objective,
then I'll get a perfect correlation between the dot product for
two vectors and their co-occurrence in the original count matrix.
So I just wrote a little function correlation test that does
that and I ran GloVe for a little while.
And you can see that the correlation is about
0.38 for IMDB5 and about 0.48 for the IMDB20,
for the amount that I ran in GloVe.
I would expect that value to get larger the longer I ran GloVe for.
But this is a kind of measure of how much we're living up to
the GloVe objective and also kind of how
much we are compromising on the original count matrix.
It's not that we want to reconstruct the original count matrix.
So maybe some loss is useful actually. Yeah.
This slide changed from the prior classes.
I made just one change.
I meant to do a shout out for that.
Somebody noted that I had taken the log twice.
It doesn't matter whether you take it in the line row_prob or row_log_prob.
And so I just did it on both,
um, which was not good.
[LAUGHTER] It was affecting the scaling and since
the correlation test that I chose is sensitive to the scaling of the values,
the original correlations were much higher.
And I should have been reflected.
They were so high but I was like,
can that really be right that it's doing so well?
This is much more realistic.
I thought we would do one more model here.
word2vec. I'm just gonna do this one quickly, uh,
and I can justify that in part by the fact that the TensorFlow tutorial on
word representations or vector representations
of words is basically an introduction to word2vec.
And it's really, really good.
So that's my first practical tip and the other is that if you wanna train
your own word2vec representations,
then Gensim is a nice Python package,
kind of arbitrarily scalable that you could use to do that.
Well, lemme just give you the gist of this so that you have it in your toolkit.
So the fundamental idea that I take it for word2vec is,
that I can take a corpus and think of it as a labeled data set.
And the way that I can do that is by setting a window size,
just the way we did for all our count matrices,
and think of pairs of words that co-occur
within that window as labeled training instances.
This is a positive label that because it was,
is there in that sequence.
Then I say it is labeled with was.
And I also say it is labeled with the and so forth for all of these co-occurrence pairs.
So it's a funny kind of supervision, right?
It's not, um, but it's like a positive example of this word co-occurring with the other one.
And still it's sensing in which this is what they call.
Where does it and best co-occur?
Here.
It's window two, right?
Yep. One, two.
Yeah. So it and best do not co-occur because they're outside that window.
But that's a parameter that you would set in the larger.
You've said it, the more training data you have in some sense.
Of course it's not all equal.
So this is the basic model and you can probably see at the top
here that this is a kind of log linear or softmax classifier.
You might have seen this equation before.
This is the explicit objective but I
think we could reduce it all the way down to this kind of
blurred model here where I have X is my embedding matrix for the vocabulary,
W is some weights,
maybe we also have a bias term.
And then the kind of hard thing to get used to about this model is that
C is a label vector for an individual example.
And it's one-hot encoded but it has the dimensionality of your entire vocabulary.
If you have 20,000 words in your vocabulary,
then it's a 20,000 dimensional vector with a single word having-
a single dimension having a 1 corresponding to
the word for that training instance that co-occurs with your input word.
So this is obviously really difficult to
train and it's quite slow but that's the intuition, right?
Is that I created this labeled data set
and then I just do a very standard thing in machine learning,
which is have a classifier that learns to predict those labels.
It's just that it happens in this very high dimensional space.
And I take it that that's kind of the core insight
behind what's called the skip-gram variant of word2vec.
And then this one,
other thing you might think about is I kind of- I can do
this for individual examples as they come in.
But if I do it for my entire vocabulary all at once,
that's fine because they're kind of all independent.
But when you phrase it that way,
you start to see that this is kind of like another one
of these matrix factorization methods, where I have, like,
my word vector, where my word embedding space and I have my context embedding space W,
just like GloVe did,
and I am taking all those dot products.
And the objective of the model is to push those dot products in a particular direction.
And in this case, it's in the direction of favoring the things that co-occur a lot,
and in turn disfavor the things that don't co-occur a lot.
In practical terms I think people don't use
this model because of the very high dimensionality of C
causing lots of problems.
So the more popular version which is implemented for you in TensorFlow,
you'll see that in the tutorial,
is called skip-gram with noise contrastive estimation.
And this is the kind of approximation of that model that I just showed you,
where instead of doing that very high-dimensional thing,
I have the sum of two separate objectives here,
each one of them binary.
And it's just saying, for the left here those are the things that actually co-occur,
you know, push in that direction.
And then I sample some negative instances to be the negative ones over here,
and they give some advice about how to sample negative instances.
But those are presumably things that you didn't see co-occurring in your training data.
It's not that they never co-occur.
It's kind of a miracle that this works at all because of course they might be
perfectly well formed pairs but the idea is that on average with a large corpus,
when I sample these negative instances they will be kind of sequences that are
unlikely to appear together and therefore it's right to down-weight them. Yeah.
Is there is some intuition that you would want to choose negative pairs that kind of
could appear now or should you just pick a totally absurd pair?
That's an interesting idea.
It sounds like something like a linguist would suggest,
which is let's think very carefully about which words are plausible.
I think they don't do that.
They have some schemes for sampling from the frequency distribution,
which might kind of align with your intuition.
But I've not seen work that tried to get down to
like can this adjective appear next to this verb or not.
Anyone seen anything like that?
It's a potential place for innovation just
because this negative sampling part really does
matter for how efficiently this model is
trained and what kinds of embeddings you could have.
Okay, final section here.
Retrofitting. The spirit of this
is I think it's very exciting to study distributional representations,
and I think they've been very successful.
And it's- it's kind of amazing how well it works given that they
operate only on co-occurrence statistics in large data sets and large corpora.
So obviously there are
important aspects of meaning that are latent in that co-occurrence data.
But I do not myself believe the claim that that's all there is to meaning.
If I think about meaning as it's studied by
linguists or by cognitive and developmental psychologists,
they emphasize other things beyond co-occurrence,
like grounding in a social situation,
interaction with other people in language.
All of that stuff, it's not merely co-occurrence but
rather about kind of our entire cognitive lives.
And that leads me to doubt that
these distributional representations are going to tell the whole story of meaning.
And a response that I really like about that is that we might look to ways that we could
take these distributional vectors and improve them with some supervision,
that is imbue them with information that wasn't merely co-occurrence information.
That's the pitch of retrofitting,
and it's really just one of many methods that you could adopt for doing that.
Yeah, here I've given a kind of summary.
I'd like that these things are high-dimensional.
I like that they're dense and capture lots of linguistic relationships.
But I'm doubtful that we can capture meaning solely from
co-occurrences but I don't like that there's no grounding.
You can leave the symbolic questioning for later.
Not sure I have an opinion on that.
But these two things are things that I would really like to address.
There have been in the past a bunch of different ways to do this,
a bunch of models proposed.
This is some work that I did with Andrew Maas when he was a PhD student here,
and this was really just taking distributional vectors that were learned from
a kind of auto-encoder and infusing them with sentiment information,
like what's the gradient information from a sentiment classifier.
And what we're pitching here is that you could take this space that's kind of not
well differentiated but when it comes to sentiment,
as you can see by the mixture of colors,
and kind of really spread it out so that the negative and
the positive are in their own regions.
And that's coming from the supervision.
And there are lots of ways you could do that. Here's another example.
This is actually some stuff that I did in
an exploratory way with , one of our TAs.
Like here I have a kind of PMI space,
and if I take that space and train it against a sentiment objective with a deep classifier,
then I look at the hidden representations from that classifier,
they of course have shifted the space around in response to their supervision.
And you can see a really deep sentiment split here with
all the positive and negative words really
gone into their own part of the embedding space.
These methods can be kind of hard to tune,
so I was really excited when I saw this retrofitting model appear.
This is from Faruqui et al.
It was a NAACL paper.
I think one of the best paper works at NAACL.
Some kind of pioneering work,
and it's implemented in the notebook in case you want to explore it.
I've showed you how to experiment with WordNet and so forth.
The gist of this is here's the objective,
and what I've said here is that this is an objective that's balancing two pressures.
First, I want to be like the original vectors that I've got.
That's the pressure in orange,
but I would also like to look more like my neighbors in a knowledge graph.
Let me unpack that a little bit.
Let's suppose that you trained a bunch of representations by whatever method.
So you have some vectors for a vocabulary of English.
The sensor which you're retrofitting is that you could find all of those words,
say in the WordNet graph. Just look them up.
And now you can imagine that those nodes in
that graph are actually your vector representations.
What the retrofitting model does is say try to remain like
those original vectors but also update
them so that they're more like their neighbors in the knowledge graph,
and that's the sensor which you're retrofitting.
And it's kind of nice and modular because you can take any embedding space,
and as long as you can ground it in some kind of graph you can
run this objective and move those vectors around.
So here's a quick example.
In this tiny little graph here,
this node zero is related to 1 and to 2 in the unidi- unidirectional fashion.
When I run retrofitting,
you can see that 0 moves closer to both of them,
and they stay put, 1 and 2,
because they're not related back to 0.
It doesn't matter which direction they go in.
Here's a case where they're all related to each other,
and what you can see with retrofitting is no matter where they start,
they're gonna kind of pull together because everything is by
this graph similar to everything else because of its connectedness.
And then here's one more example.
I meant there's a,
a hyper-parameter alpha and beta,
and if you set alpha to 1 and beta to 1,
then they're evenly balanced.
Suppose I set alpha to 0.
That's a kind of interesting edge case where I
say "I don't care what the original embedding was like.
All I care about is being like my neighbors," and that's degenerate, right,
because very clearly they just become identical to each other with alpha at 0.
And conversely, if you set alpha really high,
then you'll really remain like
your original representations and the graph will have almost no influence.
And between there, you might find something that's kind of usefully making-
using the original embedding and also making use of the knowledge graph structure.
Yeah.
Where is this knowledge graph structure?
Like where does it come from?
You need to have it out there in the world.
So I gave the example of WordNet, ah,
ah [NOISE] as a case where you have really easy alignments
between whatever representations you've trained for words in English
and this preexisting structure.
But you can think really creatively about this.
Lots of data, can be represented in a graph
and if you can find that structure,
then you can perform the retrofitting step.
And I've given some extensions here,
one thing that you might find unsatisfying about the original retrofitting model,
is that it's baking in very deeply the idea that if I'm connected in the graph,
it means that I'm like my neighbors.
It's like one semantics for these things.
And this is a bunch of attempts.
So, uh, I was involved with some of this work,
to relax that assumption.
So for example, in WordNet there is an edge of relation for an antonym.
And I think it's probably a mistake to say that,
just because I'm connected via th- that edge that I want to be similar to that word.
Antonym has very different semantics from for example synonym,
which is the do- a dominant kind of edge, or
a hypernym like entailed or entailed by.
So these are nice extensions and we released the code for this middle one,
this is kind of the most general version of this that subsume would this other
work that was more specialized around different, ah, graphical relationships.
Then here are a few code snippets.
Uh, I'll let you look at the notebooks for the full story,
because there's some data processing involved in,
for example using WordNet for retrofitting.
But here, I just gave a simple example that kind of documents the interfaces,
um, so that you can think creatively about applying in- in your domain.
That's it. Hope that's an inspiring idea to end on. Yeah.
Could you in order to construct the graphical representation,
just take like any set of embeddings and then just sort of you look at- choose k
closest neighbors and then artificially form
a graph that way and then run retrofittingly?
Oh interesting, so you'd say from the original space,
connect things that are close by some standard and then retrofit to them.
I mean you can certainly do it. It's well-defined.
Ah, like I said don't know what would happen I guess,
what you'd be saying is that,
for whatever threshold you set,
you really want those things to be pulled together.
Um, and then everything else should kind of drift into the background.
That's unusual because I- I think of retrofitting as,
you know, I've my embedding space,
and I have some other resource,
some other knowledge graph and you're kind of using them together.
I don't know but I- you could try it out and see what happens. Yeah.
Um, what does like retrofitting, uh,
with WordNet do to the like analogical relationships?
Does it preserve those things,
or make them better, or destroy them?
Oh that's a good question.
Well, oh, so in the original paper,
I believe they test on- on analogies and see some improvement.
I think that's one of the tasks that they evaluate on.
Um, I should double-check that.
They retrofit to WordNet and FrameNet,
maybe one other gra- oh,
and also a dataset of, um, paraphrases.
But I forget precisely the experiments,
that it's an empirical question,
I would check the paper. Yeah.
Ah, like model on solving, uh,
does the idea of- of the word embedding and solving,
ah, can move quicker, like some- something they were to work on GloVe.
Like some kind of weighted or something on both of
these can be improved than their original input.
Does this idea of wording, uh, give strength.
Are you trying for the bake-off?
No, I'm just, ah.
Yeah, the question was basically whether or not you could combine forces.
That is, bring a bunch of different embedding spaces together and see some gains.
I think the general answer is yes,
and my- thing you might think about is,
what's the overall scaling of the representations that you're creating?
Are they kind of harmonized or is it obviously,
just two things that you smash together?
Because that's going to affect how easy
some like later learning problems like for example.
But I think the general answer is,
yeah, pull these things together.
If you look at recent papers that use
these newer contextual word representations like BERT and ELMo,
you might notice that I'm not- in a number of places,
they have actually included the GloVe vectors as well,
to have even more dimensions and
even more information brought to the problem and you see gains in that case.
And for the bake-off,
it's certainly within bounds to smash together all four
of those count matrices or things that you've done to them [NOISE] and,
uh, see whether that leads to improvements.
And that's a nice transition point.
So what we wanna do now is break,
no more lecturing, get hands-on.
Make sure you can get through the homework because I think those are
nice preliminary steps to building your own system.
Build that system and then you have a bake
off entry ready for when we do the announcement
on Monday with the new test sets. Um, a few tips.
Um, I was [inaudible] I thank you again.
In the back somewhere,
we added to utils a function called fix_random_seeds,
then you have a bunch of options for setting random seeds for
various parts of the scientific computing environment that you're in.
You might wanna do that for your bake-off entry,
as a way of ensuring that we can exactly reproduce whatever you got.
And then you could have- we can have a separate discussion about whether,
if you- if your win depends on seed 42 and you don't win with seed one,
what that means for us.
So let's not dwell on that now.
You might make use of this function in utils
to- to make it easier for us to reproduce exactly what you've solved.
And that is allowed. The bake-off is of course
oriented towards the best-performing system.
That's the standard thing for bake-offs.
But in talking with you, I have come- become interested in two other questions.
Definitely, what's the worst-performing, but well-motivated system?
We were talking about this yesterday.
So not just the worst-performing system in general,
that's not so interesting because you could just do a bunch of incoherent stuff.
But like what's this system that really ought to have worked if you
think about the theory that was a disaster for the problem?
I really would be interested in seeing those systems.
And this is a kind of grim joke but which systems consumed the most resources?
[NOISE] So you can see I've gotten that somebody did train GloVe for 50,000 iterations,
19 hours of CPU time,
and it didn't perform so well.
But maybe that gets the award for most resources consumed.
So I don't know.
If you think you're not going to win then maybe try something crazy,
you know, shoot the moon as they say.
You'll get full credit for entering,
and we would be interested in these systems.
I think you're on your own now.
So now excuse me, is there anything we should say about how people
can connect with TAs, if they're remote.
I think they can just call.
Okay, just call in and there's some details on Piazza, right?
Yeah. [OVERLAPPING]
Okay. We're available.
Great, get to work I guess.
