Welcome, everyone.
I propose that we get started. There's lots to do.
Um, we're gonna continue on vector space models,
um, focused especially today on dimensionality reduction.
We'll look at a whole family of models for dimensionality reduction, and
kind of try to understand what they're doing
and how they're related to each other and so forth.
Stepping back for a second,
I was thinking that the ideal timing for
everything this week would be that we actually finish this slide show, like,
midway through class on Wednesday,
because what I'd love to do is just create some space in
here for people to do hands-on work with the notebooks, uh,
or especially with the homework,
because it would be wonderful if you all left here on Wednesday feeling like, you know,
your system was set up,
you knew what to do with the data,
you kinda knew the rules of the game,
you could very quickly run through
the homework problems to get thinking about your original system.
That's the ideal.
Um, I do want to balance that though against the fact that I don't want to rush,
uh, and in particular the questions from people last time were really wonderful.
They were very perceptive and kind of pushed us in just the right direction.
So I certainly don't want you to feel like you
can't raise your hands because we're not, in fact, in a rush.
This is just an ideal and we have to balance these pressures.
But just so you know,
I would like to create some space for
hands-on problem solving because I think that's really rewarding,
and we have a lot of the teaching team available to help you
out with exactly the problem that you have at exactly the moment you have it,
and I think that's a really great environment for learning.
So that's where we're headed.
Um, a few other announcements.
So we have a very accomplished teaching team.
Um, they have lots of different kinds of expertise,
and so a bunch of them have volunteered to do kind of short little lectures or
units that we sprinkled throughout
the course that I think are going to add various things.
So the- the first one- yeah,
the first one comes on May 1 at the end of our unit on natural language inference,
which is our first look at like really big supervised labeled data sets.
[inaudible] is gonna talk to us a bit about Lifelong Learning,
and I'll let him unpack that term for you,
but he's done a lot of work under that rubric.
That's kind of going to be giving you
some interesting machine-learning insights and
also maybe connecting with things from cognitive science.
So that's nicely timed,
and then a little bit later in the term,
when we talk about metrics and methods,
has volunteered to think about,
kind of, a next phase of that,
where we move beyond standard precision, recall, accuracy, time kinds of
things, and really think in a serious way about what it means
for a system to generalize in a useful way on a real-world problem,
and kind of introspect on what the systems are actually learning.
Uh, so that'll be exciting,
and that's nicely timed.
When we talk about contextual word representations in this class,
I am gonna talk about the models,
but we're gonna be, kind of,
focused on how those,
um, models and, you know,
pre-trained representations could be used as tools that would improve,
uh, whatever you're doing for your project.
That's going to be my pitch.
And as a kind of counterpart to that,
just gonna talk about another very familiar thing that might improve your systems,
which is just I give you a pretty long blob of text,
like a whole article or some kind of
piece of context for the problem you wanna talk about.
What's an effective way to turn that big blob of text into
a fixed-dimensional dense representation that you could use for another problem?
On May 22, again,
in the spirit of just helping you get moving in interesting directions on your project,
he's going to talk about data augmentation,
which is I guess like making maximal use of the data you have available to you.
And then, when you guys are thinking about actually writing up and presenting your work,
he's going to talk about new techniques that you might use to
kind of probe what might be a black-box model to figure out what it's learning,
and that could be illuminating just for you
as someone who wants to understand your own system.
But also, of course, that's wonderful stuff for
a discussion or error analysis section where we
kind of get a higher-level insight beyond the numbers into what your system is doing.
So I'm really excited by the sequence of
things because I think this is gonna really nicely
complement the core strand of work that we plan to talk about.
So thanks to the teaching team and lots more to come.
Oh and- and relatedly, um,
the Friday session on Python and Jupyter Notebooks, that was great.
Uh, thanks to everyone who participated.
The video and corresponding materials,
um, linked to them from Piazza.
So if you miss that session,
but you're fumbling around with your notebooks or with your Python,
do check out that stuff.
And we're also thinking of doing one, kind of,
a bit later in the term, uh,
that would be focused on NumPy for scientific computing,
that is kind of like vectorizing your code and also PyTorch,
which this year we're kinda pitching as the default choice you might make for,
uh, doing deep learning, and we've got a lot of support for that.
All right. That's some stuff to come.
A few other things that I just wanted to mention.
First, a bunch of people have asked about the role of the notebooks.
In my view, for this year,
what the notebooks do is complement the main lecture.
So what we are trying to do in here with you all,
is give you a guided tour of the content and highlight some stuff.
But I think the real work will start when you open up one of those notebooks,
follow that narrative and whenever you don't understand
something or you have an open question or something else you want to pursue,
you use the code embedded in that notebook to answer that question for yourself.
That's the kind of hands-on learning that I think can really push
you in new directions and that's what the notebooks are for.
They're also useful if you open up the homework notebook and feel,
kind of, at sea about how things work.
You can bet that the actual notebooks are meant to get you oriented and so forth.
So, but, they're meant to play that kind of informal role. Yeah. Go ahead.
At least for the notebook, we use, we read it, if we have questions,
but we don't need to read every line of code to try to understand what's going on.
Yeah that's fair.
I mean you must have questions,
because this is complicated stuff and you could spend
a whole lifetime meditating on any one point of it.
So my pitch to you would be when you have questions,
the notebook is a good way for you to begin to answer them for yourself,
and that's the primary thing that I want.
It's not so much about evaluating you or making sure that you do a particular,
kind of, reading or anything but rather just empowering you to try some new stuff.
And all of this is meant to give you a head start on a project or something like that.
And also as I said on the first day, to, kind of,
start to try to convey to you what we understand to be
best practices for setting up projects,
running experiments, dealing with data and so forth.
Um, if you are struggling with your setup,
you can post on Piazza and we might be able to help,
but there is some stuff that you could do on your own that might help with debugging.
And I think one powerful way that you could, kind of,
check the health of your class environment,
is to actually use the tests that are embedded in the GitHub repository.
There's a sub-directory called test.
You can see from all these files test_NLI and so forth.
We've got pretty good coverage across
all the core code and if you use PyTest, which is installed,
if you use Anaconda and you type PY.test and I like to do -vv because it gives you a,
kind of, health dashboard,
and you run it on one of these files,
it will run a bunch of tests,
and all the tests in these directories should pass and if they don't,
you might get an error saying like,
"Hey you haven't installed this library," or something and
that would be really helpful for people trying to debug with you.
And also I'm just kinda pitching
this, PyTest is a very easy way to write a bunch of tests.
You basically just name your method.
The first word should be test and then
PyTest will treat it as a test and if you look at my code,
you see you can get really good coverage really fast.
And one thing that might be of utility to you,
is that I have in test NP model gradients.
That's a kind of gradient checking infrastructure.
So if you write your own model,
like in pure NumPy,
then you could use that to check its health,
essentially the health of the back-prop.
That's it by way of,
kind of, semi-random announcements.
For the homework one, I have a few things that I wanted to clarify.
Um, let's start with the easiest stuff.
So I hope it's emerging by now,
and it's certainly going to emerge today that PMI,
pointwise mutual information is a very powerful insight,
um, related to all this stuff,
not only re-weighting but also dimensionality reduction.
And as a result, it's a very natural baseline.
If you were actually doing a project in this space,
then it would be nice if the main results table had a row that was just what
happens with positive PMI, or PMI, because you can bet that
the fancy thing that you're actually advocating for is in some sense related
to, or complements, or is in opposition to the PMI insight.
And so the reason for having that as the first homework is just to
push you to essentially fill in the first row of that results table.
From there, you know,
you're, kind of, exploring more advanced things.
We're gonna talk about LSA today,
and I'll talk about- I'll, sort of,
explain to you why you might wanna check a bunch of
these different k-values, and why that's gonna be an issue for you in all likelihood.
And then this final one is just the toe in the water on using the GloVe model,
which we're gonna talk about today,
to possibly develop some powerful vectors.
And I'm hoping that having done
those first three questions and thought about the rest of this material,
you've got some ideas for your original system,
which we'll enter into the bake-off.
Let's see overall rationale.
Any questions about that that have come up? Makes sense? Yeah.
Uh, I've two questions. One is like,
[NOISE] it's just hard.
[inaudible] kind of distance.
If you can interpret what that is or time distance.
It's a little confusing getting through this series when like
the columns or the words and the rows are like, you know,
either the window and like the scaling,
like what is that distance?
Oh, you didn't quite go where I thought you were going
because what I thought you were gonna say is,
if you give me a word by word matrix,
it's kind of interpretable what's happening because if I
look across the row with the different column values I can see like,
oh, this is the strength of association in
some sense with the word amazing or the word terrible.
And that's pretty interpretable.
I mean, we're talking about huge high-dimensional spaces,
so it's not like any of us has analytic command but that's pretty good.
And, but then once you start to do things like GloVe or LSA which we'll talk about today,
the columns become much less interpretable.
But you were actually at the first point feeling like you
didn't have the intuitions, is that right?
Yeah. [NOISE] It's like the column has like the window, right, of like how
[inaudible] of the distance vector between two columns.
How is that, what is that?
I don't know. I mean, the scaling thing I would separate out,
separate out and I would say scaling is you modeling your intuition
about how proximity to your focal word counts, for co-occurrence.
It could be that you count them all the same or that you feel like
being farther away is less of a co-occurrence.
Um, but once you've built the matrix,
you know you have this high dimensional space and for there,
I always do this but I kind of want to quote this famous, um,
quotation from John Von Neumann,
"In mathematics you don't understand things,
you just get used to them".
Uh, and I feel like a lot of this stuff is just about getting used to it.
And I've already pitched to you that
VSM.neighbors function as a way to kind of see what's happening,
um, and comparing across different matrices you've built will give you intuitions.
And I'm, we're gonna do some visualization stuff today that might also help.
But beyond that, you might just have to reflect.
Two points of clarification,
the first one is easy.
I threw into VSM data
this adjective adverb matrix which is
collected from Gigaword as adverb-adjective dependency pairs,
which I think is a really interesting notion of context,
very different from just raw,
like being in that, um, linear window.
Um, but I didn't mean for you to have to make
sure that your bake-off entry worked for that matrix,
it just won't, it's a completely different vocabulary and so forth.
So I added in just a,
a clarification that all I'm talking about is
the two IMDB ones and the two Gigaword ones.
Don't do the heroic thing it would take to
get the experiment code to work with that other one,
that's certainly not something I wanted you to construct.
Um, then more subtle,
maybe more worth discussion here is,
I added a note just clarifying being,
trying to be really explicit that for the bake-off,
you should not bring in external vectors.
Like you shouldn't just download the pre-trained GloVe or the pre-trained
BERT or ELMo or other representations that
you've heard about and enter them into the bake-off.
What I'm trying to do the spirit of this is to say,
"Start with the count matrices that I gave you.
So we all begin from like a level playing field, same spot",
and then do cool things to those matrices,
that's the spirit of this.
Um, it gets a little complicated because I want to encourage you to do retrofitting.
So I can't say something as clean as don't introduce any outside information,
because I think retrofitting to WordNet is a worthwhile thing to
do and I'm gonna talk about how you might do that in a little bit.
So I've done this thing of saying
no external vectors but I don't know that this is completely the right answer.
I mean part of me wants to just leave you completely unfettered to do
whatever you want beyond like figuring out what the test set is, right.
You know in the real-world if you'd entered a bake-off,
you could do pretty much whatever you wanted to do.
And here I don't know. This is an attempt to keep us all on the same trajectory.
But, I don't know [NOISE] I did download BERT and ELMo,
at least some small models and GloVe and see how well they would do in the bake-off,
uh they don't, they do okay, um,
but then again, I never win these bake-offs, so what do I know?
Yeah. I don't know, do people have thoughts about this?
You know, I don't know what the right answer is. Yeah.
Just a few questions not necessarily particularly on this one, but one of them is certainly related. In VSM 3,
the retrofitting one, it does actually download the pre-trained GloVe model and uses that.
Would that be something that we work- is that what you're [inaudible] be okay,
to use in retrofitting? [NOISE]
Um, so that's I would want to exclude is that you would use the GloVe ones but
that you could use the vectors from like Gigaword or ones that you had developed,
you could retrofit them.
That's the spirit of this.
But in doing that, you'd bri- be bringing in WordNet structure and
that's where I fall apart on this idea of not introducing outside information.
And then, but the real battle I'm having here is like,
maybe it's the right thing to do,
to download BERT and do some cool reweighting and
then learn a function that will do real- really well at this task, right.
That could be a, a scientific discovery or combining
BERT and ELMo and representations for
my matrices that you, you know, massaged in various ways,
maybe that's the right answer.
And sometimes, I don't want you to feel
constrained artificially because what's the sense in that,
on the other hand it would be kind of sad if off the shelf BERT ,
everybody did it and everybody got the same score.
Um, so I don't know, um,
I guess I'm the authority,
and so I've set the rules now and we'll see what happens.
For, for future bake-offs,
I don't think we have to play this game because I think I can more or less just say,
uh, do what you think will be interesting scientifically and successful empirically.
Yep.
The same question was you mentioned earlier not having to
manipulate the bake-off matrix for a T-test.
Uh, I don't think we have that matrix now currently, is that correct?
So, uh, I didn't follow,
you could- once you've implemented this T-test thing here,
you can run that on one of these matrices.
I think that's pretty successful actually.
Is that- yeah, that'll work.
So what were you saying with heroic effort that we didn't need to do? [NOISE]
What- when, I did say heroic effort but what was the context?
[inaudible]
Oh, for that, one other lonely duckling in the VSM directory,
it's called- I clarified down here.
If you have the latest notebook,
you can just read through this;
Gigaword NYT adds mod matrix.
That one is just there for fun.
Um, if you're a linguist who has studied
scalar adjectives then this is a particularly interesting space.
But for most of you,
it's probably just an aside here.
But it has the wrong vocabulary to enter the bake-off.
You'll get a whole bunch of errors,
if you try to make sure your code works with that matrix and that's what I was
worried about you all like battling against this when really I just meant to exclude it.
Thank you.
Sure [NOISE] One more thing about this notebook.
So this is a small change that I
pushed and this is so strange how the mind works I guess,
I woke up this morning just sure that I had made this omission and sure enough I had.
I don't know why it occurred to me that the point that it did,
but this full word similarity evaluation,
the change that I made is just exposing distfunc.
Before, it wasn't exposed and that meant that when you called
word similarity evaluation down here it was going to use cosine,
because cosine is the defa ult for that function.
And the reason that I'm shocked that I did this
is just that I had been quietly thinking that
a really interesting source of innovation for
this bake-off would be that people actually thought a lot about that distfunc there,
because you can use Cosine or Euclidean or Jaccard but all you really need to have in
that position for this code to work is
some function that will take two vectors and produce a real value.
And if you think about it at that level,
there's lots of stuff that you could do that would kind of go beyond the methods I'm
introducing but some of you have the background that could make that very exciting.
It could be a learn function of the data for example,
that you wrote yourself.
Uh, and so I was sad that I had not exposed this,
nothing will go wrong,
if you don't use my updated version but I kind of encourage it
because that's a point, something to think about.
That makes sense?
Good. There are just a couple more announcements that I wanted to make.
Um, again just by way of you getting you familiar with this- this is the- the repository.
I hope you all are working productively in this repository.
I just wanted to point out that kind of on the backend modeling side,
there are three groups of models.
There are these NP prefixed ones, that's for NumPy.
Those are simple versions of the models like autoencoder,
and classifier, and so forth.
Um, but they expose everything.
So if you really want to understand what's happening
under the hood when these systems learn things from data,
that would be a good place to look.
But if you just want to get some interesting work done,
then I would use either the Torch versions or the TensorFlow versions.
Um, I had kind of planned to just switch to Torch.
Ah, because I think the code is really nice and transparent.
And if you want to do modifications and so forth,
it's typically pretty easy and I'm going to ask you to do some of that later.
So I recommend the Torch,
but the TensorFlow is faster.
Um, and so if you need to get a big job done or you're just accustomed to TensorFlow,
then those are there for you as well.
And just conspicuously missing is the tree version.
Um, because at least as far as I can tell,
it's very difficult to do, ah,
neural network models with kind of arbitrary graphical structure in TensorFlow.
But in PyTorch, that's wonderfully transparent and easy.
So that's kind of what pushed me over the edge.
Relatedly, I thought I would do a shout out.
I just pushed some nice improvements to the PyTorch code, um,
that are thanks to who's a student in this class. So thanks .
I also wanted to thank the people who are helping me get credits.
So we have Google Cloud Credits that's posted on Piazza,
and I'm working with some of these other vendors to try to get credits for you,
ah, via them as well.
All right. Those are- that's it for announcements and prefatory material.
Any questions? Ask me anything before we dive in.
Oh, one more thing.
Uh, I made some updates to the slideshow.
So you might want to download the version,
download it from the web again.
Uh, mostly it's just additions,
but there were a couple of corrections.
Um, I guess this will be on the video.
I'm not sure if you're taking notes and you started taking notes on the first PDF,
this will help you merge the two having this list here.
Um, but otherwise, you can just download a fresh copy.
One of the imp- improvements is this slide 26,
and that's kind of where I want to start.
Um, but let me build up to that.
So last time we talked about matrix designs,
we talked about vector comparison,
and we started in on basic reweighting.
For vector comparison, the- there were kind of three big ideas,
like three big comparison methods.
So first was Euclidean and Cosine.
And tho- you could call those kind of like the-
the classical geometric vector comparison methods,
both stated as distance, um, measures.
And then, I introduced all these matching coefficients, all- sorry,
all these methods that are based on matching coefficient like Jaccard,
and Dice, and Overlap.
And then, I introduced a family of like probabilistic vector comparison methods,
that seem like good choices if you are
dealing with things that you would call probability distributions.
So that was the kind of- kind of the plot of that section.
And I stated some relationships and
generalizations which you can use the code to check if you like,
and this was by way of trying to make sense of this.
Now, the other shoutout that I wanted to do is to [inaudible].
So ah, here now we're chatting after class,
we walked back to the department, and,
um, he was very supportive of course, a very gentle guy.
And I- I- but I could tell that something was bugging him a little bit.
And so we kept talking and I kept trying to figure out what was bugging.
And the one thing that I managed to extract from him that was clearly bugging him
is that I had called a lot of these things distance metrics, especially cosine.
And the way I stated cosine distance,
that is not a proper distance metric.
Uh, and I- so this is- this is true.
I kind of knew this. I had just not made much of it,
but I actually like this as a way of additionally like making sense of the options here.
So this slide kind of summarizes the argument.
To qualify as a distance metric,
a vector comparison method D has to be symmetric.
It has to assign 0 to identical vectors,
and it has to satisfy crucially this thing called the triangle inequality,
which is the distance between x and z has to be less than or equal to xy to yz.
And it's called the triangle inequality.
You can kind of see here like if this is z at the bottom, x and y,
it's just saying that these two things,
the sum of them have to be larger than the shortcut here. All right.
Very intuitive. Cosine distance as I defined it,
is not a metric because it fails that triangle inequality.
And you know it's very easy to find counter-examples,
and here's just one random one that I've found,
uh, that would show that as I defined it,
this is not a distance metric.
There is a way to correct that.
Here's the definition here.
I actually don't know myself why this isn't for example the version that's in NumPy.
Uh, NumPy uses the one that I introduced,
which I thi- which seems to be perv- pervasive.
But this is the correct Euclidean,
this is the correct metric version.
And if you want to think in these terms,
then this could help you make decisions,
because the only things that are distance metrics
of the stuff I introduced are Euclidean,
Jensen-Shannon, and cosine as it's defined in this new way here.
The things that aren't distance metrics are all those matching methods.
KL divergence because it's not even symmetric,
but also symmetric KL is not a proper distance metric.
And of course KL with skew which is just a kind of
weighting of the reference versus the target distribution.
That's just like KL, it's not a- a distance metric.
So if you find this- this
elegant as a classification and you want to work only with proper metrics,
then some of your choices become clear.
If you're in a probabilistic space, use Jensen-Shannon.
It's got all these nice properties.
If you want a more classically geometric notion,
use Euclidean or the corrected cosine.
And the nice thing about that that I think will emerge especially today is,
if your vector space is kinda properly normalized like for
example if you did the L2 norm of all the vectors,
then Euclidean versus cosine doesn't matter at all.
That's a point that I made before that actually I just wanted to verify this,
that you can check with the code that if I do, um,
cosine distance here, I get the exact same result
if I do the L2 norm and then use Euclidean as the comparison method.
That was one of those generalizations.
The numbers are different,
but the ranks are all the same.
And that's kind of like saying for what we care about,
there's no difference at that point.
And so that's actually kind of clarifying from this whole mess of things that I offered
kind of like by way of just helping you with
keywords you might encounter in the literature,
I think we can distill it down to some pretty clear guidance.
So thank you for being particular about this.
And then there were some code snippets here.
Questions before we move on.
That was kind of a recap with a little bit of bonus content to help clarify. All good.
We also started talking about reweighting.
Er, and I gave the goals here which is kind of
amplify the important and de-emphasize the mundane and the quirky.
We covered L2 norming and probability distributions as way to- ways to do that.
And then I kind of paused here and I said, observed over expected.
This is a really important idea because you see
this throughout all of these methods that I'm gonna introduce today.
Uh, and there's a whole mess of math up here.
What it's essentially doing this metric here is measuring
like calculating our expectation for the count based on the row and the column,
and then comparing that to the actual count.
And if our expectation is smaller than the actual count,
we should amplify that, that kind of thing.
There's a lot of math here. And so what I've tried to do is,
just I- I added this slide.
This is a kind of intuitive example of why these,
why these numbers are interesting.
So I created this fake,
this toy vector space here,
and the conceit of this is that the word keep is incredibly frequent.
It co-occurs with lots of other words.
So 20-20-20 and enjoy is pretty frequent.
And then if you think down the column,
this word tabs here,
my idea is that it's idiomatically related to keep, right?
Keep tabs as an idiom and tabs itself.
It does occur outside of that idiom, but relatively infrequently.
And surely because this is a kind of collocational idiomatic expression,
keep tabs should have a really high count there.
That's what we would expect because of that idiomaticity. That's the count matrix.
And when you count, when you go through the expected values here,
this is the full calculation and then you get these expected values,
you get the result that you were hoping for in the sense that
our expected count for keeping tabs,
just given the probability of the row keep and
the probability of the column tabs is like 13,
but the actual count is 20.
So it's o- it's occurring more than we would expect given this null hypothesis.
And correspondingly, tabs and enjoy occur less often than you would expect.
And that's reflected in the fact that the actual count is 1,
but our expectation is 8.5.
And then all observed over expected is doing- doing is dividing the actual count,
the observed count by
the expected count to kind of bring these things into a ratio space.
That makes sense? That's what we hope to see.
Yeah.
[inaudible] the larger number tends to like,
incur more expectation,
so why is it that we're okay with keep tabs being less than say keep reading?
Or is the metric preferring lower values?
Well, I think but that- that's kinda the intuition that keep tabs
and keep and reading co-occur a lot because they're both kinda frequent,
and it's no surprise that they occur together.
And so actually, having their expected value be
just a little bit bigger than the actual value,
you know, that's kinda good.
But keep and tabs, uh,
co-occur way more frequently than we would
expect given the row and the column and that's why we're amplifying it.
My question formerly, the 12.48 versus 23.76,
why is it that it's not higher than 12?
Oh well, maybe realistically it is.
It's just a toy example.
So if I had changed the numbers I guess I could have made it higher.
But this seemed pretty good to me in the sense that 20 divided by
13 is much different than 20 divided by 23,
just in kinda an absolute sense and that's the sense in which this is occurring
way more often than we would expect. Yeah.
So expected is basically what we would expect if everything was independent?
Yes.
And then our OE is a frequency independent,
um, frequency independent like,
evaluation of how often something happens.
Definitely right for the first part.
If you think about these in probabilistic terms,
then this is a kind of- that's the- the probability,
the joint probability that you would expect given that
the row and the column were independent in the probabilistic sense.
For the second part, I wasn't sure about
what you meant because it is dependent on the actual frequency,
that's the observed count.
So I guess you could take like one class of documents and another class of
documents, do the OE for them separately.
But then still take the OE that you get from the class one
and class two and compare them without doing any scaling.
Sure.
Yeah, yeah, yeah,
um, I just paused a little bit just because I do wanna show you
some stuff about how these OE values scale.
It's maybe not ideal but I think,
I think I understand your intuition. I'm inclined to agree.
I belabor this point in part because I told you that PMI was the hero of this story.
And PMI is just log-space observed over expected.
Um, where like in- when you normally see this calculation,
people have done what I've done here, which is,
first create a joint probability table and then do the calculations in terms of
those probabilities but everything
comes out the same as if you do it with the raw counts.
Um, but this is a kinda basis for
the calculation and then here's the PMI reweighted matrix.
Um, it's doing the same fundamental thing as OE but in log-space,
and I highlighted these orange values because you can
see that this is prone to exaggerating very small counts.
The one over here,
because it's kinda lonely in this row and this column,
ends up really big in the final matrix.
Even though, if you've worked in NLP for a while,
you might think that 1 is probably not so important as an event.
[LAUGHTER] Uh, maybe in more precise spaces this 1 would be very exciting.
But for us it's like a mis-tokenization or some kind of artifact in the document.
I gave a few plots here just to kinda give you a sense for PMI values and how they scale.
You could, uh, check that out on your own.
This is an important, um,
extension of PMI, positive PMI.
[NOISE] Uh, this is actually the default for VSM
because it's kinda better behaved in certain respects.
Um, but let me articulate the argument for why you might
choose this positive PMI over regular PMI.
PMI is actually undefined when the count is 0,
because I have to at some point take the log of something that would be 0.
So the question is what to do in those situations
because we can't just leave them as infinite values or something.
So the usual response is the one that I gave above here kinda implicitly, right,
with log 0 = 0,
which is to map it to 0.
But so Levy and Goldberg have this nice paper about this general class of methods,
and one thing they point out is just that this is kind of incoherent.
And the argument is, look,
for PMI larger than expected counts get
a really large PMI and smaller than expected counts get a really small PMI.
That's perfectly intuitive.
But if we do this log 0 = 0 thing,
then the 0th count stuff gets placed in the middle of that ranking.
But we don't have evidence for it being in the middle of the ranking,
like, actually, we just don't know that's the point.
Um, and so there's kinda something amiss here.
And a response that kinda cleans this up is just to
say that for anything that would be below 0,
I map it to 0,
and that's the sense in which it's positive PMI.
Make sense? Yeah.
Could we just lose all information about elements
that are smaller than expected counts if we're mapping to 0?
Yes. Just to repeat that because I can't, I can't disagree.
You're losing a lot of information about all of those negative values.
[inaudible]
Just guessing based on not knowing
whether we have smaller or larger than expected?
It's a great question I guess,
I have to do this a lot especially in this unit.
It's an empirical question [LAUGHTER],
um, which behaves better.
I am gonna give you- try to give you some guidance around this
because obviously you're doing something pretty serious to
the underlying distribution of values and that could give you
some analytic guidance about whether this is the right choice or not.
Um, but fundamentally, I have to agree. Yeah.
How could the expected value be 0?
If it was just exactly the same.
So for O over E,
if it was just exactly the same as the actual count,
then it would be 0 in log-space.
Oh wait, it's like [inaudible].
And that would be kinda like, saying this is exactly what we expected.
Let me introduce a couple more and then we'll step back.
Another famous reweighting scheme that draws on
slightly different notions of row and context is TF-IDF.
This is like famous from information retrieval.
Uh, it's defined in terms of two things.
So first, term frequency,
which you would get in our, um,
vector spaces by dividing everything in the- I don't wanna get this backwards,
so you would normalize column-wise.
Uh, if you think row-wise, right?
So you would be getting column-wise term frequency
for each one of these columns or documents.
And then the inverse document frequency is essentially counting
up how many columns that term appears in.
Uh, and so what the- what the calculation is essentially doing
is amplifying the values for things that occur in very few documents.
And then the TF-IDF is the product of these two.
So the intuition is that you're gonna get
really high TF-IDF values for cells that have very high probability for your document.
That is, they're like over-represented along that
column and occur in very few documents, that is,
traveling along the row because that's like saying that that term
is really specially associated with that document.
Over-represented in it and unusual in appearing in it at all.
Let's say it's those two intuitions. All right.
Here's an example of the calculations just to make it really concrete.
You can do the IDF values and the TF values
and then you just take their product to get the full reweighted matrix.
And again, I have some plots that kinda show you how these values tend to scale.
So for example, setting aside the 0 case,
which is always a problem in these log, uh, models.
As I get larger and as I occur in more documents,
that's along this X-axis,
the IDF values go down and like the epitome of
a high IDF value is to occur in just one document because then it's like,
really specially associated with that one document.
And if it has high frequency in that document,
then its TF-IDF value is gonna be really, really large. Yep.
So for our homework,
I don't think we'll be able to use the TF-IDF
because our matrices are word-by-word, is that correct?
Well, technically you can,
nothing would stop you. It's perfectly defined.
However, um, it can be really problematic to
use these very dense matrices with, uh, TF-IDF.
They are really kind of expecting a lot of sparsity.
And in fact, if you have a word that appears in every document, uh,
which can- that is that it co-occurs with every other word,
like a stop word, then these TF-IDF values can get completely ridiculous,
or even be undefined.
Yeah. Um, well, I was kind of hoping you discover that for yourselves,
but now I have given that away.
Well, it'll save you some time.
Um, but for a very sparse word by document, for example,
this has proven time and again to be a really powerful reweighting scheme.
And a powerful way to find documents for
a given query that are really like highly associated with that query.
And here are a few others.
T-test is when you're implementing,
and I like that one just because if you stare at this long enough,
you can see that this is a kind of PMI intuition as well.
There are lots of TF-IDF variants.
And then another one that I could just throw out there
is I could create a pairwise distance matrix,
and that might be an interesting reweighting scheme,
where like all of these values in here are
the cosine distances between the various elements.
Let's step back for a second.
I gave you last time a little bit of a framework for
thinking about how- what these re-weighting schemes are doing.
Uh, you know, things you might ask yourself as you're making these choices.
And one of them was just that you should look at
the distribution of cell values that result from doing the reweighting scheme,
and think about what that might mean for your problem.
And here I've kind of mapped that out for the ones that we've looked at.
So up here, these are the raw counts.
And they're actually- like you can see that they're shaped like this.
Uh, it's actually much worse than this plot appears,
because I did the log scale for the y-axis.
Otherwise, it looked like everything was just at 1,
uh, and then, you know,
a long tail of things.
So that's a really strange, you know,
Zipfian or log-log distribution.
It's kind of hard to deal with for lots of methods.
And then we can look at what happens when we take that that incoming distribution,
and change it in various ways.
So for example, the L2 norm has more of a U-shape,
probabilities are kind of similar.
Although I confess that I don't have a firm grip on
why it has this kind of hump before dropping off here.
Observed over expected values have a similar distribution to the raw counts.
So that's maybe not so good,
because even though we're amplifying some stuff,
it's kind of like the underlying statistical problem remains.
[NOISE].
PMI, this is a kind of friendly picture because
it looks quite normally distributed. That's pretty good.
Uh, PPMI does the thing of just chopping off the left half and amplifying all the 0s.
Um, but it's still not as scary looking as for example the raw counts.
And then TF-IDF looks a lot like PPMI.
One more thing before I take questions is just that
beyond just staring at these distributions.
You can also, I know it's hard to read,
but you can look along the x-axis at the actual cell values.
So of course for the counts,
it goes from 0 to 100,000 or whatever,
um, for these word-word matrices,
whereas the L2 norm of course is 0 to 1.
Um, the probabilities are 0 to 1.
Observed over expected, that's kind of constrained in a space,
like, so even though it looks like a scary distribution,
the x-axis is much tighter.
For PMI and PPMI it's like here, -10 to about 20.
So not as normalized as probabilities or L2 norms,
but at least constrained.
And then the TF-IDF values have a similar kind of 0 to,
you know, small value, um,
to 3 or something like that.
And the reason I emphasize that is just that if you
think about taking distance measurements in these spaces,
we've seen how sensitive some methods are to the magnitude of these values.
And this is giving you a picture of how much that choice
about distance is gonna matter for these various spaces.
Sorry. There's a question, did you have a?
[NOISE] Um, how did you obtain the probability distribution?
Well, for the probabilities,
I just mean that I row normalized using that scheme,
and then I just looked at all the values in that entire matrix.
That's what I did for all of these in fact.
This is just a histogram of all the cell values.
Think of them as just one long vector.
[NOISE] Uh, another pitch I made to you is
that you might think about how the co-occurrence counts relate to the reweighted counts.
So obviously, for example,
if your reweighting scheme,
just makes a proportional shift of the underlying counts.
It's probably not so meaningful.
What we really want presumably is to see something really
different when we look at the new cell values,
and how they correlate with the old ones.
And so here are some plots that do that.
These plots are a little bit harder to understand.
But like, up here, L2 norming,
so along the x-axis,
these are the co-occurrence counts.
That is the raw cell values,
although I did the log scale to make it easier to see what's happening.
And these are the reweighted values.
And you can see it's a very different distribution of values.
And the correlation here which I've given up at the top is pretty low.
So it's- it's not that I can make sense of the space per se,
but rather at least I know it's different from the underlying counts.
So I did something meaningful.
Uh, same thing with the L2 observed over expected,
the correlation is kind of strange.
I actually don't know what these artifacts are with these kinds of
shocks of correlated values.
But it looks like PMI, that's really nice.
Right? There's essentially no correlation between
the underlying counts and the reweighted values.
So we did something.
Uh, PPMI actually has a pretty high correlation of all of these,
it's the one that's by this measure the least different from the underlying counts.
Make sense? I'll let you stare at these and try to make sense of them on your own.
I actually don't understand completely why they have the particular shapes that they do.
But I'm reassured that something is happening when I reweight.
Are we looking for one, uh,
one of the- like is it the smaller correlation, the better?
No, I wouldn't say that.
I can't go so far as to say better,
I can just say that at least I know I did something.
[LAUGHTER] But it really depends on
how much you think the underlying counts were important to retain.
Uh, which kind of relates to whether or not you think
frequency is information that you wanna preserve in your model.
Here are a bunch of generalizations,
just by way of wrapping up this this unit here.
So a theme of all of this,
as I've said before,
is that we want to weight a cell value relative to the value we would expect,
given its row and its column.
I think that runs through all these schemes.
Many weighting schemes end up favoring rare events that may not be trustworthy,
so you have to kind of watch out.
And you might want to smooth that out.
The magnitude of counts can be important.
Right? So 1 in 10 and 1,000 in
10,000 might be very different situations in terms of language data.
But for example, if you take a probability distribution,
they will look identical.
PMI and its variants will amplify the values of counts that are tiny,
relative to the rows and the co- and columns,
and that can be a problem,
something to watch out for.
And then relate it to the question from the back there,
TF-IDF severely punishes words that appear in many documents.
So it will behave oddly for our dense matrices,
which are word by word.
And then I offered a few code snippets here.
Again, all of this is in the notebooks,
all of this is relevant for the homeworks,
but I just thought it might be nice to consolidate it here,
and show for example you can do observed over expected,
you can length normalize PMI with and without positive and TF-IDF.
And just to show you that we're making some progress,
at the end of class last time I showed you the neighbors of the word bad,
in the raw imdb5 matrix.
So tight window, highly scaled.
And it looks terrible.
It's highly collocational, and then there's a period,
and then taste, and then guy again.
When I do PPMI on that same space,
then the neighbors of bad are good,
awful, terrible, and horrible.
And that looks like progress.
Right? We're in a coherent semantic space,
terrible and horrible seem like they should be near bad,
even though they have different frequencies.
And then you might just worry that good is in there.
Maybe that's the problem that you kind of wanna qualitatively solve,
because you have what looks like a sentiment confusion,
in those being so close.
I have a short section in here on s- bringing in subword information.
This is a newer idea and I just wanted to make you aware of
it and show you some code that will allow you to do it.
Um, I think, this idea in my mind at least traces to a,
a seminal paper by Henrik Schutze, um,
who showed that if you did some subword modeling you could, kind of,
overcome some sparsity problems and also find
different and more abstract connections between words that might
turn on their morphological analysis, for example.
Uh, yeah, I've said that below here,
subword modeling will pull morphological variants closer together.
Right. You might find yourself seeing
abstract connections between all the adjectives that end in able,
uh, in a way that you wouldn't if you just looked at the words as,
kind of, atomic units.
Uh, it can facilitate the modeling of out-of-vocabulary items,
because if you don't have a representation for a word you can
retreat to using its subparts which are probably highly represented.
Uh, and it can also reduce
the importance of any particular tokenization scheme which is nice because
tokenization in NLP is
this hard a priori thing that you do before you start solving your problem,
that often has huge downstream consequences for your- for the models that you've built.
So it's nice that we can find ways to make that less of a major choice.
And what I did here is just sketch
a technique for doing that using the data that we have.
I'll just walk through the method quickly and if there are questions feel free to ask.
Given a word-level VSM like the ones we've been working with,
you can say that the vector for a character level n-gram
x is the sum of all the vectors of words containing x.
So, it's like you've gone down into
the word and treated all of those things as [NOISE] co-occurring.
Uh, and then you can represent each word as
the sum of its character [NOISE] level n-grams,
so that's the sense in which you could gather strength from all the parts.
And then, I think this is a nice touch.
If you have a representation for the actual word bring that in as well.
And then maybe just, like,
sum up all these vectors and now you've done
some subword modeling and brought in a whole lot of
information from different parts of your corpus that would have been
completely neglected if you hadn't gone down to the subwords.
And here, I just gave an example of how you might decompose the word
superbly into its 4-grams and the only thing that's worth pointing out there is,
I do think it's valuable to have this notion of
start and end contexts which I've signaled with a w here.
And then, it's just a code snippet,
so I just wrote a few utility functions that will allow you to
create an n-gram space for any n that you choose.
And then this l- like character_level_rep here,
it's just a small function that I wrote that preven- presents
one way of mapping a word into its subword model.
And I just showed at the bottom here that even though
superbly as a word is not in the vector space,
I can now get a representation of it because
all its subparts or at least most of them are in the corpus.
And this is kind of interesting.
If you look at how that vector superbly relates to other words,
you see some stuff that's good and some stuff that might worry you,
like, now it's very close to super.
Um, and other, like, add,
like add adverbs that end in ly,
and, and that might- it might be over-emphasizing the subword information.
Um, but in general,
I think- if I think back to the bake-offs last year,
subword modeling was often an important step for the top systems.
Make sense? before I, yeah, go ahead.
How do you use subword modeling in the same VSM as like irregular verbs? [NOISE].
You can't quite do the same VSM.
That's why you have to create this imdb5
n-grams and it has different dimensionality, right.
All ours have dimension 5k by 5k,
but this one because there are more n-grams, it has almost 10k by 5k.
So why [inaudible] 5k why [inaudible]? [NOISE].
That's just a design decision so that we can use the original count matrices and,
kind of, find all this subword structure in them.
Because we, we keep the notion of co-occurrence context the
same on the approach that I've sketched. Yeah.
How's this compare to, like, just stemming your documents beforehand?
I think stemming would be really different, right?
So stemming would be, like,
opposed because there you would be saying that,
I'm gonna get rid al- of a lot of this morphological stuff at the end, whereas,
here you might be saying,
I'm gonna find the way in which superbly is related to
all these other adverbs by the fact that they end in ly.
Another quick unit here,
I'm not gonna spend too much time on this,
but I think this can be really valuable for doing exploration.
Just some fast visualization techniques.
So let me state the goals. Our goal is to visualize
very high-dimensional spaces in two or three dimensions so that we can understand them.
And you just have to accept upfront,
that this is going to involve a bunch of compromises.
There's not gonna be a system,
a mapping from this high dimensional space into 2D that preserves all that structure,
unless in fact, your original matrix had very little structure to begin with, right?
So we have to have some compromises and but,
uh, my pitch to you would be that,
visualization can still be great because it can give you a feel for what's in your VSM.
And it can be especially nice to pair this with some, kind of,
qualitative investigations using the neighbors function.
I feel like with a holistic picture that's approximate,
and some sampling with neighbors,
you can get a feel for whether your matrix contains what you want it to contain,
and wheth- whether it has promise.
Even despite all these compromises that I'm flagging for you.
There are lots of techniques for doing this and, uh,
in fact in recent years scikit-learn has added a bunch of them,
and they have a wonderful user guide.
So I would think if you want to go beyond the one that I'm going to show you, uh,
check out scikit-learn [NOISE] and see what,
what they can do and they talk a lot about the trade-offs for various methods.
But what I've included in VSM is
a little wrapper around scikit's implementation of t-SNE.
I'm not actually sure whether I'm saying that right but I've been saying it for years.
It stands for t-Distributed Stochastic Neighbor Embedding.
And I would say intuitively,
what this algorithm is trying to do is,
go from a very high-dimensional space,
to a dimensionality reduction in a way that you can lay things out
in a few dimensions while preserving a lot of local structure.
This is a t-SNE plot of giga5,
that matrix after I did pp,
sorry giga20 after I did positive pointwise mutual information on it.
And I would just say that,
this looks pretty good to me.
Uh, yeah actually, at this point I think this is quite beautiful.
You might come to also find these things quite
beautiful because it looks like a giant blob,
but actually there's a lot of local structure.
And it's that local structure that I think t-SNE can capture pretty well,
and that is telling me that there's, kind of,
neighborhoods of coherence which is what I expect from this very high-dimensional space,
that I would find some dense neighborhoods that are interesting.
So that's where you can see and I actually,
so here are two examples.
This is the cooking one,
it's probably too small to read but it's lots of stuff about cooking,
and over here this is lots of stuff about conflict.
Terrorism, war, politics, and so forth.
But it's really semantically coherent which is nice.
And then I did the same thing, this is imdb20 with PPMI,
again pretty good local structure in here.
Here's a positive section,
this is lots of positive words and then negative ones.
And I would say this is nice,
because not only is there pretty good preservation of sentiment in
this space but also you- if you look beyond
the adjectives you can see that when people talk about
positive and negative things they talk about
somewhat different a- aspects of these movies.
So for example, they complain a lot about the dialogue, um,
but they talk- they say lots of nice things
about actors and their appearances and so forth.
Um, so kind of,
one layer beyond just raw sentiment.
And then here's a little bit of code for using the wrapper that I wrote.
It's really straightforward and the only nice
add-on that I want to mention is that the function which is called, um,
tsne_viz, if you give it a vector of colors which you can set up however you
want as long as it's aligned with your vocabulary
then it will display the words in those colors.
And for this example here,
I just downloaded a sentiment lexicon and then
displayed all the positive words in blue and negative in red.
And that can be, kind of, nice to see for something that you
expect to align with your semant- the semantics of your space,
how well is it actually aligning, right?
Surely, I hope that I'll see lots of blue clusters
and lots of red clusters amongst all the gray.
Yeah. [NOISE].
Just a bit more like global structure of a 2D visualization,
like, this, like, meaningful ever.
Can you interpret, like,
where the clusters are in relation to one another?
I think not. I think that's where this starts to break down.
Certainly, you can't trust anything about position on
the plane in an absolute sense because as you re-run t-SNE,
you'll see rotations of approximately the same space.
But I think, correct me if I'm wrong,
but I think it is not meaningful that,
like, whatever blob this is,
is kind of, close to whatever blob this is.
I think that kind of
influence is getting dispersed pretty fast throughout this visual- visualization.
Does that accord with your understanding?
That's why I tend to zoom in on local structure.
Excellent. Final phase that we wanna do here is dimensionality reduction.
I'm gonna present to you a few intuitive methods for doing this,
um, and try to give you a sense for why you might pick one over the other.
We're gonna start with the classic which is
a matrix factorization method called Latent Semantic Analysis.
It's also called truncated SVD for singular value decomposition,
that's really the underlying algorithm.
This is one of the oldest and most widely used reduction techniques,
and I would say that it is a standard baseline,
and often very tough to beat.
This is a powerful method.
And if you think about building up a table of results,
then this may be the second row beyond just raw PMI.
I am not gonna walk through the full algorithm with you because in my experience,
this is kind of the culmination of a class in linear algebra,
takes a long time to build up the relevant concepts here.
But I think, even if we don't do that,
I can give you a sense for why this works and why you might want to apply it.
So let's start with those guiding intuitions.
Thinking just in 2D,
I've got these points A, B, D,
C, and you're probably accustomed to doing
something like fitting a linear regression to those points.
So that's that orange line there.
And by definition, that linear regression is gonna
find this- the source of largest variation among those dots,
and fit a line through them.
And then one thing you can do is think of that model as saying that B,
its fitted value is on this line right here.
And C, its fitted value is on this line right here.
So you lose one dimension of variation for the sake of this linear model.
And notice what happens if you do that.
If you do that mental projection of pulling B down and pulling C up,
now they're very close together.
That's the sense in which I have- I've abstracted away from one dimension of variation,
and found an abstract connection between these points,
they were modeled in a similar way.
I actually find that this is even easier to think about if you go 3D.
So think about the Stanford campus.
It's very flat, its main source of variation is kind of XY let's say on the plane,
but it does have some tall buildings.
So if you are standing at the base of the Hoover Tower,
let's say I'm at the base and Bill is at the top,
we're far apart because the Hoover Tower is pretty tall.
But if I decide to abstract away from this vertical dimension,
I decide I'm just gonna look at the [NOISE] at the plane,
then Bill will be pulled down and he'll be right next to me.
That's the sense in which we're doing dimensionality reduction.
We're taking something in 3D,
smooshing it down into 2D.
And when we do that,
we find a connection that was kind of missing from the original space. Does that make sense?
The method is singular value decomposition,
and kind of the fundamental theorem here is that for any matrix m by n,
I can do a decomposition into three matrices,
T which you could think of as like the row rotation or the row matrix,
some singular values, that's the diagonal,
and a kind of column rotation or a column matrix that's over here.
And the idea is that this- the combination of these three matrices equals this one.
So at the level of the full decomposition,
I haven't done anything, right?
I've just shown you a decomposition of the original matrix.
Where this gets interesting is when we start to think
about doing reduced dimensional versions of it,
where we're gonna approximate the full matrix by some subparts of the decomposition.
Let me try to motivate that.
So here's my example that I like.
This is a little vector space of adjectives.
The conceit of my example [NOISE] which I think is still pretty true,
is that gnarly as an adjective is a West Coast thing.
Whereas, wicked is kind of like [NOISE] its counterpart on the East Coast.
If you're from Boston, you say wicked
and if you're from Los Angeles, you say gnarly.
If this is not true anymore,
that's a little embarrassing because I'm the linguist here but set that aside,
I think it's ki- it kinda makes sense.
The issue there is that gnarly and wicked are both positive.
But because of this dialect split,
they're unlikely to co-occur in any document.
Um, rather, what will happen is that they'll have neighbors in common.
So gnarly and wicked will both tend to over-occur with awesome,
because that's like across the world and across the country anyway.
And not tend to occur with lame and terrible because they're sentiment opposed.
But if I just use the methods that I've showed you so
far and I calculate distance in this little matrix here,
gnarly and wicked are really far apart.
And the reason they're really far apart is,
again, think back to my [NOISE] campus metaphor.
There, uh, they're not abstractly anywhere near each other because they never co-occur.
What we want is a kind of second-order notion of
co- co-occurrence that's taking into account the other things that they vary with.
And to get at that notion,
we have to do some work.
Here's the Singular Value Decomposition,
term singular values, and document.
But what I've done here, this is the truncated part.
I'm gonna look at just the top two dimensions.
The algorithm ensures that the singular values are organized by their size, uh,
and that corresponding these- these columns are organized by a kind of the,
uh, amount of information [NOISE] that they contain.
So when I do the approximate reconstruction using just the two columns here,
I get this new vector space, reduce dimensional.
And in that new vector space gnarly and wicked are really close to each other because
the method has found this abstract notion of co-occurrence,
second-order co-occurrence you might call it.
Make sense? Yeah.
How is this different from PCA?
It's very similar.
Yeah. In fact I- I do- I do wanna point out that the-
there's a whole family of these methods that you could think about,
and they're all drawing on essentially the same intuition,
uh, and they just kind of play out in slightly different ways.
I picked LSA because it's like the classic for this space.
Cell-value comparisons, right?
I'm pitching this as a way of understanding what happen to the space at a high level.
Here, the raw counts.
If I just apply LSA directly,
I get kind of this funny distribution here.
And I think that's because LSA as fundamentally a least-squares method is
not so well matched to the Zipfian distribution of counts coming in over here.
So if I first reweighted by PMI
which I showed you has this kind of nice normal distribution,
then the resulting values for LSA also have a similar distribution over here.
They're slightly differently scaled,
and they're different values,
um, but it's kinda more tractable looking.
One more note about this.
So when you read about truncated SVD or LSA, er,
the- the question comes up of how you would pick the dimensionality K,
in the sense of like I picked two here.
But you're working with matrices that have 5,000 dimensions.
How will you pick the number of singular values that you wanna pay attention to?
The dream scenario that you read about is that when you
plot the singular values by their rank,
you see a bunch of high values and then a really rapid drop-off.
And if you saw that,
you'd say, "Oh good,
I'm gonna pick 20 as my dimensionality because I have
such a clear fall off in information value from there."
In my experience, this never happens,
my distribution of singular [LAUGHTER] values tend to look like this.
This is actually from one of the matrices.
And I just feel like I have no idea which value of k to [LAUGHTER] pick, um,
because there's this kind of quick thing and
then a smooth drop-off all the way to the end.
And that's why, as you can see from the homework,
this emerges as a kind of
heuristic empirical issue of maybe trying a bunch of different values of k,
and seeing which one gives you the gain you were looking
for on the tasks that you're actually trying to solve.
If you ever see one of these dream scenarios,
do post it on Piazza,
we could like have a wall of fame for them.
But probably, it'll be mostly this wall of confusion.
And as I said, here are a bunch of other methods and a lot of them are implemented
in scikit-learn, in its decomposition and manifold packages.
Questions or comments before I move on to another method?
[BACKGROUND]
There's some LSA code snippets.
I think this is really easy.
They're just there for reference.
Let's move on to autoencoders.
This is our first deep learning model I guess.
So autoencoders are flexible deep learning architectures
for learning reduced dimensional representations, right.
That's the name of the game here.
It's a nice reference.
Let me give you the intuition.
For an autoencoder, the incoming examples could be for example,
the 5,000 dimensional rows from one of the count matrices that you guys are working with.
The idea is that I'll take that really high-dimensional thing
and pass it through this very narrow layer in orange,
maybe that has dimension 100 or 50 or 10 or whichever one you pick.
The job of the autoencoder is to do its best to reconstruct the input.
So you pass it through this narrow pipe and then try to reproduce it.
Of course, you expect given
a complex enough incoming matrix to have some information loss,
but we also had information loss with LSA.
The hope of the autoencoder is that it will learn to lose the information
that's not so important and preserve the sources of variation that really matter.
And that's the sense in which you'll get a powerful intermediate representation.
So when we run an autoencoder,
we train it against this reconstruction objective.
But we're at, what we're actually interested in is, you know,
this hidden value here or one of
the hidden values if you've decided to put a lot in the middle there.
I will say that it can be hard to make it work with
that raw 5,000 dimensional vector coming in.
So people do things that are outside the autoencoder,
uh, to reduce the dimensionality.
So for example, you might first run LSA and kinda do a hard extern,
model external dimensionality reduction and then feed that through
your network and try to reconstruct it to do a further layer of dimensionality reduction.
For this second panel here,
I thought it would be nice to just show you everything you need.
If you wanted to implement this autoencoder from scratch,
the forward pass would go up and the notation is all
here and then this shows you how to calculate the gradients,
and I used exactly the notation from the NumPy implementation that's included
with the course repo in case you wanna
really go down and see what the computations are like.
But it's a pretty straightforward model,
compress the information and hope that it works out for the best. Yeah.
So the hidden layer here would be what we use,
uh, let's say in the homework to predict word pairs.
Yes, and my autoencoder as the fit method returns that matrix of
hidden representations so that it kinda acts like LSA. Yeah.
Uh, on note of the task that's being used,
is it more like reversed LSA
because if you process LSA beforehand you've sort of done dimensionality reduction,
and if you're trying to reinvent the input,
do you mean the original like giga5 matrix that was going to fit to LSA,
or you're trying to re-engineer your matrix after LSA.
Uh, well, let's just say there's lots of possibilities.
It will depend on what you do but the sharp answer is if you feed in raw counts,
the model will try to reproduce the raw counts.
If you feed in the LSA representation,
it will try to reproduce that.
Other mixtures would be different than not, they would be, not standard autoencoders. Yes.
[inaudible]
Yeah. It would and if you had the hidden dimensions,
the same as the input and output,
you would hope it would learn to just copy.
But since you're forcing it to pass it through this narrow pipe,
it has to lose some information or at least represent it very differently.
Thank you.
Sure. And here's some code for doing this, the interfaces are
maybe a little bit non-standard and I tried a few different things here in particular.
I guess here I tried to length normalize
the inputs so that it didn't get raw counts coming in because I
think that's a very difficult learning problem
for a deep learning architecture to have raw counts coming in.
Uh, and it works pretty well. I think you need to run it for
longer than I did but it's definitely on
the right track to doing something interesting. Oh yeah.
Here's a little bit of proof of that finance in giga5 raw counts that does not look good.
Uh, the autoencoder looks much better, right.
So that's the raw autoencoder,
and here I got a little fancy, I ran PPMI and then
LSA with dimension 100 and fed that into the autoencoder,
uh, and it's at least different.
I don't know how to evaluate the two.
I guess that's what the homework is all about.
Any questions about that?
This is great timing then.
Let me introduce one more model just quickly and then we'll circle back to it.
Uh, I think this is my favorite of
all of these dimensionality reduction techniques for a few reasons.
So this is GloVe.
Global Vectors, you might have heard about it.
It was developed here at Stanford by
Jeffrey Pennington and Richard Socher and Chris Manning.
This is the paper, and I guess I want to say right at the outset as a kinda meta comment.
This is a lovely paper.
I highly recommend that you read it not only for the content but also just
for to experience a really well-written well-motivated paper
in our space, not all of them are.
Uh, but this is a kind of exemplar.
They really nicely articulate
the high level goals that kind of place you in the context that you're in right now,
which is that like, we have these matrix factorization methods,
we have things like PMI,
we have Word2Vec.
Let's think about their strengths and weaknesses.
And then let's really think about what we are trying to do when we build
word vectors and from there over
like section two they build up this model very carefully.
Uh, and then of course, it had great experiments and they
released code and they released these lots of pre-trained word vectors,
which the community has benefited from in lots of ways.
So it's like the complete package of what you'd be looking for in a contribution.
Uh, roughly speaking, the objective for GloVe is to learn vectors for
words such that their dot-product
is proportional to the probability of their co-occurrence.
Keep that in mind because that might already sound to you like PMI.
In terms of practical details,
we're using the implementation in the mittens package,
which I wrote with Nick Dingwall.
It's called mittens because we have a variant of GloVe, uh,
it's GloVe with a warm start and we take Mittens to be warmer than GloVe's.
And it should also be fast because we found a way to vectorize GloVe.
So I would use that for modest jobs and for really big things,
you should use the GloVe team's C implementation,
which is also an impressive engineering feat and works extremely well.
Let me do this one slide for you.
Maybe two and then we'll wrap up for
the day but I would like to just plant the seed for this.
So at the top here,
this is the GloVe objective.
You can see that you have two words,
I and K. Those are the vectors that we're learning.
We're gonna take their dot product.
That's an idea that you've seen before because if you think back to cosine distance,
the similarity part is the dot product over the normalization constant there.
So this is kinda like un-normalized cosine similarity.
There are two bias terms,
and the idea is that that's equal to the log of
the co-occurrence count for I and K. They kind of unpack that a little bit so they
define this dot product here as the log of the probability of
co-occurrence and they define that as
the log of the co-occurrence minus the row probability.
The reason that they subtract only the row log probability,
is because they're assuming at this stage what they call the exchange symmetry,
which is that in the matrix the rows and the columns are the same,
um, because they assume they're dealing with a word-by-word matrix.
It certainly needs to be a square matrix for GloVe to work.
If you allow that the rows and the columns might be different,
and in fact they construct matrices where the rows and
the columns are different and you might do that as well.
Then you would wanna factor in
the column probability and that's what I've done down here.
So now you have the log of the co-occurrence
minus the log of the product of the row and the column probabilities.
And that is something that you've seen before, that is PMI,
right, just by the equivalence of the given down here.
That's quite, quite interesting, right?
They give this lovely argument that sounds like it's in a different place
from PMI and they end up back at the PMI place.
Now, it's not literally PMI,
because you have to remember that what GloVe is doing is
learning reduced dimensional representations,
right, it's doing reweighting and dimensionality reduction all at once.
But it is certainly and you can see this objective drawing on the same insight as PMI.
And that's the sense in which this is the kind of the hero of our story.
There's a few other things that are really
meaningful so I don't wanna minimize them because this is kind of
the ideal objective and then the actual objective is
a weighted version of it and we should talk
about why they chose the weighting scheme that they did.
Um, but just in the interest of this having been a lot of content,
I propose that we wrap up here,
maybe read the GloVe paper,
think about that slide and then we'll pick up
there and move through on into retrofitting next time.
