Hi everyone. Hope you had a good weekend.
Uh, we're gonna start off today's class talking about the
bake-off that happened last week.
So if you don't remember what you did,
you evaluate distributed- distributed representations using word similarity datasets,
and so all the similarity datasets had these word pairs with associated, um,
human annotated similarity scores,
and your evaluation measured the distance between the word pairs and your chosen VSM.
So here's the histogram of the scores.
Um, as you can see, there's kind of two bumps: one on the lower end,
one on the higher end, most people are on the higher end.
Um, so what Chris did was, um,
calculate the observed over expected scores after
separating all the bake-off entries into
the top scoring teams and to the bottom scoring teams.
So on the left here, we have,
um, that- the words that are more likely to appear for top-scoring teams.
And then over here, we have the ones that are more likely to
appear for the ones in bottom scoring teams.
That helps you get an idea of like what distinguishes
the really good models from the meh models.
So, uh, so it seems like retrofitting on WordNet and using
LSA are common ways to build better models from this rough analysis.
Okay. So our first place team is Group 80.
Is Group 80 here? No, okay.
But yeah, Group 80 basically, uh,
stacked p- pp- PMI,
t-tests, retrofitting on WordNet,
and in autoEncoder, and they got the highest score.
So good job for Group 80.
[LAUGHTER] [APPLAUSE] Yay.
Um, now, second place is- is this person here?
No? Okay. So what this person did was it looks like a lot,
but it's basically PMI plus retrofitting with a sub-word model, and they did pretty well.
Third place, are you here?
No. Um, so they also piled on a bunch of things, PMI,
t-test, LSA, retrofitting on WordNet,
and they did pretty well as well.
Okay. Now, [LAUGHTER] a model that- that did not work as well.
We're not gonna say who this model belongs to.
But it is kind of interesting how they did like t-tests, LSA, Jaccard.
So like, so they did t-tests and LSA together,
but I think the thing that made them kind of
suffer a bit more is they used Jaccard as their distfunc,
while the winning teams all had the default distfunc,
which I think is a cosine, so that's like an interesting detail there.
So it could be cool just to see like if we swapped out Jaccard with cosine,
maybe they wouldn't have scored so low.
Um, but it's still an interesting result,
so like if you're not doing too well on your models for these bake-off,
it's still interesting to see like how something that could
be very well-motivated still fail.
Um, and yeah, I think that's all I have.
Any questions? Yeah.
There was like a- another category for like took the most effort for real-time, right?
Who is- who is the winner for that?
Oh, I wasn't able to find who was that because it's so hard to like,
do you like, do you quantify it by lines of code, you know?
Because I saw a lot of teams who they ran a lot of tests,
like they had like 10 cells and like each cell has like a different experiment,
and they did like a lot of hyper parameter tuning, and stuff like that.
So but it- it could be there.
Yeah. There are some low scoring teams that put in effort.
[inaudible] their teams.
Yeah.
For some special consideration,
if you could quantify it yourself in terms of compute time or [OVERLAPPING].
Great.
I'm gonna hand this off to the real professors.
Thank you miss.
[APPLAUSE]
Okay. Uh, I have a couple of announcements
before we get to our main event. So here they are.
Bake-off 2 has begun,
so there was a- there was a post on Piazza within
the last hour or so giving you details on how to do that.
So make sure you take a look at that post.
Uh, Bake-off 2 closes on Wednesday,
so you definitely wanna,
uh, look at that today or tomorrow.
Um, on Friday, we are gonna have
a special session to give you an intro to NumPy and PyTorch.
Uh, so the details are there and this session is gonna be recorded.
On Wednesday, we're gonna have some guests.
So Guillaume Genthial, don't know how to pronounce that,
and Marta, uh, uh,
who's from Roam Analytics and Marta Recasens who is from Google Research,
are gonna be visiting for a panel discussion on NLU in industry.
So we wanna encourage you to attend in person if you can to, um, show them a good,
good, uh, show of support from,
from Stanford and to have lots of good interaction.
I think the value of that session will be greatly enhanced if there's,
um, you know, lots of in-person interaction.
And if you can come to the session with
specific questions that you'd like to get some insight into,
what is it really like to use NLU techniques, um,
in the industry and, and apply the,
the stuff that we've been talking about in real-world applications.
Um, if you have specific questions or topics in mind,
there's a Piazza post, um,
en- encouraging you to think about this and you can add your,
uh, proposed question as a comment on that Piazza post.
And it's pinned so it's visible at the top of the list.
Um, that NLU panel discussion probably won't take the,
the full hour and 15 minutes next time,
so we'll use the extra time for coding in class, uh,
focusing on Homework 3 and Bake-off 3,
which is about relation extraction,
that's what we're gonna look at today.
But having some time in class for that on Wednesday will be particularly valuable,
because you'll have a little bit less time, uh, to- to,
uh, to prepare before that homework is due next Monday.
So any questions before I go on? Yes.
Bake-off 2, uh, do you have to be trained one more time?
Um, if you have a model that you think ha- has like good links or something,
could you just like, uh,
not do that, and get points.
Not sure I got that.
Yeah. I think you won't be able to use that as the experiment.
But I gave some guidelines about this in the close.
Just make sure that you're using
exactly the right training data that you did that before.
You test on the test data and then you report Macro at form.
And if you do those three things, it's totally fine.
[NOISE] Okay.
Today, I wanna turn to a new topic,
which is relation extraction.
Um, and there's kind of two big ideas that I wanna introduce you to.
One is the task of relation extraction itself,
and the other is this idea of distant supervision.
Um, relation extraction is an interesting topic
for a bunch of reasons that I'll say more about in a moment.
But one of the reasons that it's an interesting turn for us in kind
of the- the narrative that we're developing is that it's the first time
that we're looking at an NLU task where the output of
the task is a discreet object rather than a numeric value.
So we've just been looking at, uh, sentiment analysis.
And sentiment analysis, the output is a scalar value,
it might be like one through five or
in the ternary task that we're doing for the homework and the bake-off,
it's just, uh, minus one,
zero, or plus one.
Some versions of sentiment analysis,
it's just zero or one,
it's basically binary classification.
But in any case, it's a scalar value.
Uh, before that we looked at vector space, models,
and meaning where the output is a vector of real values.
With relation extraction, for the first time,
we're gonna be looking at a task where the output is an object.
In this case, it's a relational triple.
So it's like, uh, a row in a database essentially.
Um, so that's, uh, uh, something new.
And a couple of weeks from now,
we're gonna look at the task of semantic parsing
where we go even further in that direction,
and we're looking at outputs which are complete logical forms,
so that have real complex structure to them.
So relation extraction is one big idea.
The other big idea is distant supervision,
and this is interesting because it's a strategy
for breaking the data bottleneck that has been,
uh, one of the biggest, um,
sort of obstacles to progress in applying machine learning methods to- to NLP and NLU.
Um, the idea of distant supervision enabled people working on
relation extraction suddenly to- to be able to use
100 times as much data as had ever been used before for relation extraction.
Um, and that enabled, um,
vastly greater statistical power in the- in the models that were trained.
Um, and I'll explain how- how that works in a moment.
But that was a big innovation that's applicable not only to relation extraction,
but to other kinds of problems as well.
Um, what's interesting is that the innovation here is not so much a new kind of model.
It's not a new neural architecture or
a new loss function or a new optimization algorithm.
Instead, the innovation is in the kind of supervision that's used to provide,
uh, uh, uh, a signal on which to hill climb.
Um, it's also interesting because
the ideas of distant supervision were developed right here at Stanford.
So the idea first came about in this paper by Rion Snow and colleagues.
Uh, in 2005, Rion Snow was a PhD student who was advised by Dan Jurafsky,
whom some of you may know.
Um, so that idea, um,
incubated here in 2005.
It was then extended by,
um, Mintz et al.
I think I have a reference to Mintz et al right here.
In this paper, distant supervision for relation extraction without labeled data.
So that was also in Dan Jurafsky's research group and part of the,
part of the Stanford NLP group.
Um, and then this same strand of work led
eventually to a generalization of the idea of distant supervision,
which was the idea of data programming developed in Chris Re's group.
Some of you might know Chris Re who's a-
also a professor in the Computer Science Department here.
And, um, a package called Snorkel,
which implements some of the ideas of data programming, um,
which you can see as kind of a generalization of,
of this basic idea.
These are all strategies to break through
the limit imposed by needing supervised, manually annotated,
uh, sup- uh, fully supervised data for training.
So let me say a little bit about the task of
relation extraction is it- is this big enough for people to read? This okay?
Legible? Okay. So the task of
relation extraction is to extract relational triples from natural language text.
And by relational triple,
I mean things like these examples here, so Founders, SpaceX,
Elon_Musk, has_spouse, Elon_Musk, Talulah_Riley, worked_at, Elon_Musk, Tesla_Motors.
So each of these expresses a fact about the real world.
Each of them has three components.
There is a relation and then two entities.
Um, the entities have these underscores in here.
So that suggests that they're some kind of
unique identifier of these entities and I'll say more about that in a moment.
And the relation is one of
a small number of pre-defined relations that we're interested in.
So why is this interesting? Why do we care about this?
Well, if we can accumulate a large collection of these relational triples,
that amounts to a database.
Uh, a database of facts about the real world.
Um, building a knowledge base manually is slow and expensive and
time-consuming but the web or
other large do- collections of documents are filled with these kinds of facts already.
So if we can figure out a way to extract
these facts automatically from natural language text,
we can greatly accelerate the creation of knowledge bases.
So why would you care about that?
Why do we want to have these large knowledge- knowledge bases?
It turns out that,
um, they're incredibly useful.
First, because a huge proportion of human knowledge can be expressed in this form.
Um, and second, because there are
abundant commercial applications for knowledge bases like this.
Um, for example, my current boss, John Giannandrea, uh,
earlier in his career,
founded a company called Metaweb and was the co-creator of Freebase.
Freebase is a very large- it doesn't actually exist anymore which is a terrible shame.
But Freebase was a large community-authored knowledge
base which was essentially a vast collection of relational triples like this.
Um, it was acquired by Google in 2010.
And it became the foundation of
Google's knowledge graph which was used to
power a variety of different knowledge applications
including question answering and also the knowledge panels that you sometimes
see on the right- on the right-hand side of Google search results.
Um, it was a really powerful idea.
Uh, a couple of years later in I think 2014,
Microsoft introduced its own version of a knowledge graph,
theirs was called Satori but was fundamentally the same thing.
Um, it also matters for Apple.
Uh, in 2017, Apple acquired a startup called
Lattice which was co-founded by Chris Re
from the Stanford Computer Science Department and Lattice's,
uh, business was also to do relation extraction from,
uh, documents from the web, um,
in order to build and extend, uh,
knowledge bases that in this case are helped to- are used to help theory better answer,
uh, questions about the world.
Um, so there's, uh,
abundant commercial interest in building knowledge bases like these.
There are also other applications that might be a little bit less obvious.
Um, one is extending WordNet.
Um, WordNet is basically a knowledge base of lexical semantic relations.
So in WordNet, the role of entities is played by words or actually synsets.
And the relations between them are things like
the hypernym relation or the synonym relation or the antonym relation.
So, um, you can formulate the task
of adding new stuff to WordNet as a task that has this form.
There's, um, there's lots of stuff that should be in WordNet but isn't.
So, you know, WordNet doesn't know,
um, about, um, things like,
um, like there's, um,
categories and subcategories and sub-subcategories of video games, right?
And, uh, WordNet doesn't know about all those different categories of video games.
And the fact that this one is a subcategory of that one.
Um, but it would be great to be able to automatically
extend WordNet to have that kind of knowledge.
And so I mentioned that this technique of
distant supervision was introduced by Rion Snow in 2005.
That was actually the application that he applied it to,
was automatically inferring relationships between
word senses by extracting them from natural language text.
Um, there are other applications for example in biology.
Every year there are thousands and thousands of biology research papers published,
some of them have to do with, uh,
gene expression and gene regulation networks.
If we can make these,
if we can apply these techniques to biology research papers maybe we can run, uh,
relation extraction, uh, focused on
genes and gene activations over these thousands of research papers.
And extract from it lots of relationships between different genes
and whether they have activating or suppressing effects on other genes.
And by doing that, build a big graph of relationships between
all these different genes and then somehow that's- I'm not a biologist,
so I don't know what you do with that from that point.
But- but that can be really useful, uh,
to biologists who are trying to understand
these complex networks of relationships between different genes.
So that gives you a sense of some of
the applications and why you might be interested in this task.
Okay. So how do we actually do- let's return,
let's come back to these examples here.
We want to learn some facts about
Elon Musk and we want to extract it from natural language.
How can we actually do that?
One of the techniques that might- might come to mind first is
to just write down some patterns that express the relations that we're interested in.
So for example, um,
a pattern like X is the founder of Y that seems to express the founder relation.
So maybe we can just search a large corpus for occurrences of that pattern
and extract the XY pairs and add those to our knowledge base.
So we might find, um,
in a large corpus, a sentence that says Elon Musk is
the founder of SpaceX and we're like, "Okay, great."
I can add a new triple to my knowledge base,
that seems very straightforward.
Um, and in fact,
this approach was the dominant approach that's sort of the dominant paradigm for
relation extraction for many years like during the 60s and 70- 60s and 70s.
So in the- in the very early days of natural language understanding.
People tried to be slightly more sophisticated by instead of defining, uh,
an exact string and looking for exact matches on that string,
defining patterns that could be like regular expressions to
accommodate some minor variation of things like verb tense and so on.
Um, but nevertheless it was the basic approach.
But this doesn't get us very far.
And the fundamental reason it doesn't get us very far is
that language is incredibly varied.
There's just a million different ways to say the same thing.
So look at examples like these, um,
you may also be thinking of Elon Musk,
parentheses, founder of SpaceX who started PayPal.
Interesting fact, Elon Musk,
co-founder of PayPal went on to establish SpaceX.
Or if SpaceX exploration parentheses SpaceX
founded by PayPal pioneer Elon Musk succeeds blah blah blah.
So each of these sentences expresses the fact that Elon Musk is the founder of SpaceX.
But none of these match the pattern that we proposed X is the founder of
Y nor any other pattern that would be obvious at the outset, right?
You wouldn't have anticipated any of these exact patterns.
And even if you ima- imagine that you like, um,
create a little bit of flexibility by making
it a regular expression instead of an exact match,
there's no way you would've anticipated any of these patterns.
Now, once you see these examples,
you could say, okay,
well I'm going to create new patterns.
I'm going to add to my pattern list by having a pattern that captures, you know,
X parentheses founder of
Y and maybe that will match some more examples somewhere else in my corpus.
And then I can add another pattern which is X,
co-founder of PayPal went on to establish Y or maybe I
should actually replace the PayPal with a star so that I get some more generalization.
But like these things aren't really going to help at all,
these patterns are still really specific.
They're not going to match that many if any,
other examples in your corpus.
They're not going to generalize well to new examples.
And so the result of adopting this approach is even if you manually engineer,
even if you write down lots of patterns and you try
to make them general by using regular expressions,
you're going to wind up with a system which might have
high precision but it's inescapably gonna have very low recall.
There's gonna be lots and lots of sentences which
express a relation which
don't match any of your patterns and you're just gonna miss those.
You're not gonna have the opportunity to do those extractions.
So that strongly suggests embracing machine learning,
and so this is what happened in the- in the 90s and the 2000s.
There was the so-called statistical revolution that came to NLP,
and people said, "Hey,
let's use data in machine learning,
ah, to learn the things that we wanna learn. "
Um, and in addition, um,
let's use a feature representation that will allow
much better generalization from the examples that we have.
Um, so to do this,
we're going to have, um,
we're going to need lots and lots of supervised data to train our model.
So what we would do is,
go through a corpus and every time there's a sentence that has two entities in it,
we would manually annotate it with the relation, if any,
that holds between the two entities
or if there is no particular relation then we just say, "No relation".
So, we go through and we'd find this sentence and we'd say,
"Elon Musk" and "SpaceX" and we'd annotate that with something that says,
"Founder," and we'd annotate
the second sentence and the third sentence with something that says.
"Founder". Um, there might also be sentences in our corpus like this one,
billionaire entrepreneur- entrepreneur Elon Musk announced to blah blah blah.
I'll let you read it. It's got both Elon Musk and SpaceX,
but it doesn't actually express the founder relation.
It's not saying that Elon is the founder,
and so we'd manually annotate this example with a label that says "No relation".
So now, we've got a big corpus where all pairs of
entity mentions are annotated with a relation that holds between them, or no relation.
And we can train a class,
if we can define a feature representation,
ah, for each of those pairs of entity mentions.
Maybe it's just the bag of words in the sentence,
and then we can train a classifier using tec- techniques very
much like what we've been exploring for sentiment analysis, and this works.
It totally works and it was a huge step forward for relation extraction.
It allowed us to build relation extraction systems that got
much better performance than the handbill patterns that came before,
and particularly, much better recall than what came before.
But it has a big shortcoming as well which is that it re- relies on
having this manually annotated data from which to learn.
You have to go through a corpus and- and manually label everything,
and that manual annotation, again,
is slow and expensive and it doesn't scale.
And in fact, when this was the dominant paradigm
in relation extraction during the 90s and 2000s,
the largest data sets that were constructed for
relation extraction had on the order of thousands of examples.
And today, a data set with only thousands of examples seems kinda small.
And it's certainly not enough to do-,
um, to apply sophisticated,
ah, modern methods to.
So the big innovation that I- I wanna, um,
focus your attention on and gets you excited about is this idea of distant supervision.
So here's the big idea.
This is the big conceptual breakthrough.
We're gonna sidestep the need to manually annotate entity mentions in
a sentence by using an external resource as a source of truth.
So in particular, we're gonna use relational triples from an existing knowledge base,
ah, to automatically identify extraction examples in a big corpus.
Let's say for example, we already have a knowledge base that contains
this relational triple, founders, SpaceX, Elon Musk.
What we do, is we take that pair of entities "SpaceX" and "Elon Musk",
we search through a large corpus for any sentence that contains those two entities,
and then we make the unreliable assumption
that that sentence expresses the founder relation,
and we effectively label each, ah,
entity pair, each example in a sentence with that relation label.
So it's a way of using this external data source, the knowledge base,
to generate labels in the corpus rather
than having to manually label all of the examples in the corpus. Yeah.
Is it possible that we have situations where we have both Founder,
[NOISE] SpaceX, Elon Musk,
and like- likes SpaceX, Elon Musk.
So when we're looking through a corpus we are going
to assign more than one label to a sentence?
Absolutely, and we'll come back to that question in a moment because it- it goes to,
how do you formulate the classification problem?
Is it a multi-class classification problem
where you have to choose just one relation out of a bunch of relations,
or it is- is a multi-label classification problem where you
can assign as many labels as you want. Yes.
So can you only do this on something that you've
already built up a knowledge base [inaudible].
Yeah. One of the limitations of this approach is that because it
starts with examples of the relation that you care about,
um, you can't use it to build a knowledge base for a completely new relation,
or add a new relation to a knowledge base.
If I wanna learn to do extractions for the founders relation,
I need to start with some examples of the founders relation.
Um, if I don't have any examples yet,
I'll need to use a different strategy. Yep.
Bootstrapping?
It is, ah, um,
say a little bit more, bootstrapping is an overloaded term.
What are the concerns when you do that
you have like seed pairs, and then you check over dataset seed to see different relationships that uses those pairs, and then use that to try to find the relation.
It is basically the same idea.
If you read, you might have- you might be referring to the reading from
Jurafsky and Martin where they talk about bootstrapping?
Yes.
Yeah so it's basically a way of- of
formalizing that idea and making it a little bit more rigorous.
The way the bootstrapping idea is presented in Jurafsky and Martin,
the seed examples, the seed tuples are just chosen by hand.
You just kind of think about it.
What are some examples of the founder relation?
And you come up with a very small number like maybe a dozen,
and start the process from there.
The other difference is that, um,
the way bootstrapping is presented by Jurafsky and Martin,
it can be, um,
a repeated cycle where you use the seed tuples to get some extractions from your,
ah, or to identify some examples in your corpus which may help you identify new patterns.
Those patterns in turn can help you identify new tuples,
and the process can repeat.
Um, with distant supervision we're not repeating that process.
We start with a much let- much larger set of seed tuples.
Hopefully, if our, you know,
if our knowledge base is already kind of
large and our goal is to extend the knowledge base,
and we're starting with lots of seed tuples and where,
ah, we're not sort of going around that cycle repeatedly.
We're just doing it once.
Other questions before I go on?
Somehow related to a earlier question,
but say for example like that sentence with PayPal, could it
extract that Elon Musk is both the founder of PayPal and the founder,
or co-founder of, sorry,
founder of SpaceX, co-founder of PayPal.
Absolutely. So the way the problem is usually set up,
every pair of entities is a candidate for relation extraction,
and so if you have three entities in a sentence,
there are three pairs and you can consider each one of those three pairs.
Um, and, you know, if there are more entities there you have
even more potential pairs and each one can be considered. Yeah.
So using this external knowledge base as a source of supervision,
that is the central idea of distant supervision and it's a really powerful idea.
Um, and one of the reasons it was
such a game changer for relation extraction about 10 years ago,
is that it allowed us to suddenly use
100 times as much data as it had ever been used before.
I mentioned that the largest datasets that were
fully supervised had thousands of examples.
But at one stroke,
this idea allowed us to, uh,
allowed Mintz et al.,
the authors of the 2009 paper to use 100 times as much data.
By using 100 times as much data,
they were able to define a feature representation that had
much more precise sort of
specific pinpoint features and they actually had enough data to,
um, overcome the problem of sparsity for those very specific precise features,
and that enabled them to get much better power from the,
from the model that they were building.
And in fact, like, at that time,
they were able to achieve a 100X gain.
I mean, there's no, there's nothing in principle that limits even larger scaling up.
Um, they were constrained by the computational resources available at that time.
Today, we could probably easily go to 1,000 times or 10,000
times as much data as people were using for relation extraction,
uh, in the 2000s.
So it's been a very successful idea.
It has two limitations or two,
um, [NOISE] yeah two limitations.
One is that it relies on making this assumption that every sentence where so,
um, I'm starting from this triple in my knowledge base, founders, SpaceX, Elon Musk.
I go to the corpus and I find every sentence where those two entities co-occur.
I'm making the assumption that every sentence where those two entities co-occur,
actually expresses the founder relation.
But we know that's not true.
We know that's an unreliable assumption, assumption and in fact,
I just showed an example of a sentence that doesn't express the founder relation.
The distant supervision approach is gonna label that as an example that
does express the founder relation, that's wrong.
So by doing this,
I'm essentially injecting noise into my dataset and that makes,
um, that creates a,
a challenge for the learning algorithm.
It's harder to learn when you have noisy labels.
That disadvantage is more than outweighed by the benefit of having vastly more data.
And so even though your learner has this additional challenge,
it's still able to learn much better models.
Um, the other limitation of this idea is the one that was already mentioned,
it can only be used to extend an existing knowledge base or an existing relation.
Uh, but because it relies on existing triples from the database,
from the knowledge base, it can't be used to create a new one from whole cloth.
Okay. So that's the biggest idea.
Um, if you only got that from today's session,
um, I think you would have already gotten something valuable.
What I wanna do with, uh,
the rest of the time is to walk you through the rest of this code book, um,
to help prepare you for looking at Homework 3 and Bake-Off 3,
which are gonna be due next week.
So we have a little bit of setup stuff.
We're gonna import some useful classes and libraries,
um, and some data.
So the data directory for the class includes a sub-directory called rel_ext_data.
And there's two things in there;
there's a corpus and there's a KB and I'm gonna say a little bit more,
uh, about each of those.
Um, if you have sort of the default setup,
um, then this code should work for you out of the box.
If you put the data directory somewhere else,
you may need to fiddle with the line that tells it where to look for the data.
Okay. So two data assets to talk about;
the corpus and the knowledge base.
Um, as usual when- whenever we're doing NLU we start with a corpus,
a large collection of examples of language and in this case,
we, there are some specific things that we need from our corpus.
We need lots of examples that contain two entities.
That's the whole idea.
Um, and it's gonna be really useful if those entities
have or already have what are called entity resolutions.
So an entity resolution basically takes an entity mention like
a string in a sentence that refers to an entity and it maps it into,
um, a unique unambiguous identifier.
Um, and that's gonna be really useful because we're gonna wanna
connect the corpus to the knowledge base,
which is gonna use the same unique and unambiguous identifiers.
Um, so, um, making the identifiers unique and unambiguous kinda solves two problems.
One, is the problem of polysemy,
which means polysemy is when one word or one phrase can mean two different things.
So the example here is New York.
New York can mean New York State or New York City.
If I've done entity resolution correctly,
those will map into two different unique identifiers.
New York City will have one identifier,
New York State will have a different identifier,
and hopefully something in the context around,
uh, will tell me which one of those it should map to.
So that's polysemy.
The other problem that helps to solve is synonymy,
when you can have two different expressions that refer to the same entity.
So New York City and the Big Apple both refer to New York City.
Hopefully, those two different expressions will both get mapped to
the same unique identifier when we do entity resolution.
Um, entity resolution is a whole topic unto itself.
There's people, you know,
work on developing models for doing entity resolution properly.
That's a topic that we could unpack in this class but we're not gonna do that.
We're gonna kinda treat it as
a black box and assume that it's already been done correctly.
And for this project,
what we're gonna, what,
use for a corpus is, uh,
a bunch of examples that are derived from this Wikilinks corpus.
Wikilinks was, uh, the result of a collaboration in 2013
between Google and some rese- researchers at UMass and um,
it's drawn from a
really large snapshot of the Internet,
um, 10 million web pages altogether.
Uh, what Google did is they,
Google has a terrific entity resolution module in-house and they just ran
this entity resolution system over
millions of web pages and basically recorded the results.
So the entity resolutions that appear in the Wikilinks, um,
corpus are not the result of human annotation.
They're the result of, uh,
a model that's just running over webpages.
And if you wanted to,
you could build an entity resolution component from
scratch and then you don't need any annotations on your data at all.
You just start from, uh,
arbitrary text from the web.
Uh, but to make our lives a little bit easier for ourselves,
we're gonna leverage the output of
this entity resolution system from Google through, through, uh, Wikilinks.
[NOISE] Um, there's something else we want,
which is that we wanna have the context around the two entity mentions.
So we'd like to have the text before the first mention,
the text between the first mention and
the second mention and the text after the second mention
because all of that context is potentially predictive of what the right relation is.
If you look back at, um,
some of the examples that I gave, um,
like, in the first example,
Elon Musk, founder of SpaceX.
Well, there's something in the middle here, the word founder,
that's strongly predictive of the right relation.
Um, [NOISE] the, in the second example,
you've got this word established that's in the middle,
which is strongly predictive.
Um, [NOISE] oh, I thought I had an example where at least intuitively
the most predictive thing is not in
the middle but it's outside maybe that example comes later.
Anyway, we wanna have that context around the,
around the two entity mentions and, um,
the UMass version of the Wikilinks dataset actually has that context.
So to help set things up to be convenient for the purposes of this class, we, um,
uh, massage the data,
we filter the data to make it small enough to easily load into memory and,
and to work with.
Uh, but what we wind up with is
a compact corpus that's gonna work really well for, for our purposes.
Um, there's, uh, a corpus class that provides some easy access methods to some of this,
to, to, to the contents of the corpus.
So I wanna give you a, a peek of some of that,
and at the same time do some, some data exploration.
So to load the corpus into memory,
um, we just use, um,
this call here and,
um, we can start to,
to poke around and see what we're working with.
Um, so the corpus has examples.
Every example is two entity mentions in the context of a sentence.
There are 332,000 examples.
So that's great. That's plenty of examples,
but still small enough to easily work with in memory.
Um, and let's take a look at what an example looks like.
So we'll just look at the first example here.
Um, this is horizontally scrolling so it's a little hard to read.
But, um, each example is basically a tuple and it has if I remember correctly 12 fields.
So the first two fields are called entity_1 and entity_2.
So in this example,
entity_1 is New_Mexico and entity_2 is Arizona.
Um, these, these are the unique identifiers for entities that I mentioned.
Um, the fact that New_Mexico has an underscore
in it kinda suggests that this is not just an English expression,
this is actually a unique identifier in some kind of entity space.
And in particular, these are so-called Wiki IDs.
So this is basically the last part of the,
the URL for the Wikipedia page corresponding to this thing.
Um, and serves as a unique identifier for New Mexico and for Arizona in this case.
Um, then the next five fields are fragments of the text from this example.
We have a field called left,
which is all the text to the left of the first entity mention.
Then we have mention_1,
which is the English text.
So here it's New Mexico without the underscore
because this is the actual fragment of text.
Then we have the middle,
which is everything between the two mentions.
In this case, it's just and.
Then we have mention_2, Arizona,
and then we have right,
which is everything to the right of the second mention.
And so if you paste those together,
you can kinda get the whole sentence.
Um, the procedure is still used in parts of New Mexico
and Arizona is to build a fire inside the Horno and so forth.
And then, the last five fields are similar to the previous five fields,
except that, um, they have part of speech annotations.
So every word has a slash after it,
and following the slash is something in all caps TO, DT,
JJ, NNS, DT, NN,
and so forth, that indicates the part of speech of that particular word.
And again, the parts of speech have been generated by an automated system, by a model.
Um, you don't need to worry too much about the meaning of the different parts of speech.
They come from the Penn Tree-bank annotation scheme.
There's about 40 or 50 different categories.
You can sort of figure out what they mean,
NN means noun, NNS means plural noun, things like that.
But you probably don't need to worry too much about what the specific things mean.
The reason they're here is that, uh,
they could be useful in building feature representations for your model.
And you may find- in fact,
I think one of the homework questions asks you to try to
leverage the- the parts of speech information to get,
uh, better- better generalization capability.
Oh, and here's a code fragment which actually pastes.
Um, the left mention one, middle, mention two,
and right all together to sort of reconstruct the original sentence.
Let's do a little bit more data exploration,
just to get a better sense of what's in this corpus.
And by the way, this is good metholo- methodological practice.
Whenever you start working with a new data resource,
do some poking around,
and figure out what's in there.
Try to- try to look at some summary statistics of what the data looks like.
Uh, we wrote a little bit of code to extract,
um, all of the entity mentions,
and then count them up.
And so, here's a list of all the entities by frequency.
Um, there's a lot of geography things.
In fact, looking at this list,
you might think that it was only geography.
That's not actually the case,
but it certainly looks that way right here.
Um, one of the benefits of having a corpus class,
is that it makes it- we can do some indexing and make it
easy to retrieve examples having specific characteristics.
So in particular, there's a method called show_examples_for_pair that makes it
really easy to look up all the examples that contain a particular pair of entities.
So here, we looked up, uh,
the examples that contain Elon Musk and Tesla Motors.
And the output tells us there's five examples like that,
and it prints out, uh,
the first of those five examples.
But actually, direction matters here.
So this isn't necessarily all of the examples that contain Elon Musk and Tesla,
it's just the examples that have Elon Musk as entity one and Tesla as entity two.
It could be the other way around.
So let's also look at,
show examples for pair Tesla Motors and Elon Musk.
And it turns out that there's two more examples that have it the other way around.
So we need to be conscious of direction.
If we're looking for examples that have Elon Musk and Tesla Motors,
we need to check in both directions,
um, when we're making those calls.
This corpus does have some flaws.
One flaw that you'll discover as you get
further into things and start poking around more,
is that there are lots of examples that are near duplicates of each other.
So not exactly the same,
but with minor variations,
like only a few words different, or something like that.
I think this is a consequence- so this com- this is- um,
this comes from the Wikilinks data-set.
And I think this is a consequence of the sampling strategy that was
used to choose the 10 million web pages from which the data-set was constructed.
I don't know all the de- I don't know exactly how they did that,
but there's lots of near duplicate documents on the web,
pages that differ from each other only by a little bit.
And I think they didn't try to weed those out.
And so, you will see lots of those near duplicates,
and they will cause some funny-looking results that we'll run into later.
Um, it's a flaw, but it's not a showstopper for this data-set.
Um, the other observation to make here,
is that this corpus doesn't contain any information about relations.
Unlike the supervised paradigm,
there's no explicit indication in
this corpus about what relations hold between the entities.
Uh, there's nothing in this corpus that tells us- nothing except
the English words in this corpus that tells us that Elon Musk is the founder of Tesla.
So to get that,
we're going to have to connect the corpus to the knowledge base.
So we'll turn our attention to the KB now.
Before I go on, any- any more questions about the corpus? Yeah.
[inaudible] like a record for every single [inaudible] factory and space and things like that.
Say if we did like space in two.
Would we get a list of words that have both space and two
or are there only certain entities that it's marked out?
Only entities. And in this case,
it was entities that were identified as entities in the Wikilinks corpus,
and therefore entities that were recognized by
the entity resolution system that Google used to generate this data.
Um, but that's a possible variation.
And in fact, er- earlier I mentioned the work by Snow that was used to extend WordNet,
and they had a different notion of like what were the units of extraction.
Um, and instead of entities, they were- they were looking much more
generally at nouns and trying- or noun phrases,
and trying to learn relationships between noun phrases.
Okay. Let's turn to the knowledge base.
Now, the knowledge base that we're gonna use is a fragment of Freebase.
I mentioned Freebase earlier.
It's this, uh, large, uh,
just sort of community constructed knowledge base of,
uh, relationships between entities.
It was very sadly shut down in 2016,
but you can see- if you poke around on the Internet,
you can still find data dumps from Freebase.
And the- the knowledge base that we're gonna include
here was derived from a Freebase data dump.
So in Freebase, you have these relational triples that we've already looked at.
Each one consists of a relation,
a subject, and an object.
Uh, subject and object that's just kind of, um,
the- there's no connection to- to subject and object in a syntactic sense,
in the sense of English syntax.
It's just kind of an arbitrary designation for the first argument to the relation,
and the second argument to the relation.
The relation, the thing in
the first position is one of a small number of pre-defined relations,
like place of birth or has spouse.
And the entities that appear in the second and third position,
again, are Wiki IDs.
So it's like the last segment of a Wikipedia URL,
and it constitutes, uh,
a unique identifier that we can connect to the identifiers used in the corpus.
Um, just like we did for the corpus, um,
we should do some data exploration and see what this data actually looks like.
So here's a line of code to load the KB into memory,
and then we can start poking around.
So there's 46,000 KB triples here.
That's great. That seems like a fair amount to work with.
Um, there's 16 relations in this knowledge base.
Wikipe- uh, Freebase the original free,
like the full Freebase has way more than this.
So Freebase has thousands of relations,
millions of entities and literally billions of relational triples.
What we're working with here is a- is a fairly small slice of that whole knowledge base.
Um, then you might want to know how big is each relation.
How many triples does each relation have?
So we've counted them up,
and you see the results here.
It looks like the contains relation is the biggest one.
It's got 19,000 triples.
Um, and some of them are quite small.
So the capital relation has only 500 triples.
So they vary quite a bit in size.
Um, here's some code that prints one example from each relation,
just so you get a sense of what's in here.
So adjoins, France, Spain, pretty intuitive.
Author, Uncle_Silas, Sheridan_Le_Fanu,
Capital, Panama, Panama_City.
I think these are pretty intuitive for the most part.
Um, one thing to point out is that,
some of these relations are intuitively symmetric relation.
So I presume the adjoins relation is symmetric.
And if adjoins, France, Spain is in the knowledge base, then probably,
hopefully, adjoins, Spain, France is also in the knowledge base.
However, there are no guarantees.
Um, in theory, it's symmetric,
but nothing actually guarantees that
the inverse relation is actually in the knowledge base.
Um, other relations are intuitively asymmetric.
So Author, Uncle_Silas,
Sheridan_Le_Fanu, that's fun to say.
Um, it's a bit arbitrary that the relation was defined in that direction.
There is no reason why it couldn't have been the other way around.
So it could have been instead,
author of Sheridan_Le_Fanu, Uncle_Silas.
That would have been fine.
The representational choice that was made here was to put the-
the author first and the work second.
Um, and by the way, that also points out that relations
frequently or typically have types.
So, uh, a relation will have
a type for its first argument and a type for its second argument.
Um, just like in the corpus,
there's a convenience method that makes it easy to look up the KB triples that,
uh, contain a particular pair of entities.
So we can look up all the,
um, triples that contain France and Germany.
Uh, it looks like there's just one.
There is a relation adjoins, France, Germany.
Um, and we can also look up,
uh, Germany and France.
So it turns out, yep,
the inverse relation is in there. That's great.
Um.
Uh, we can look up the relation- relations between Tesla Motors and Elon Musk who- uh,
we get the result that the founders relation holds between Tesla and Elon Musk.
That's great. I kind of said some of this stuff already.
Oh, and then we can look up things the other way around.
We can look at the relation between Elon Musk and Tesla Motors and
there is a triple, worked at, Elon_Musk, Tesla_Motors.
So that kinda makes sense.
Um, and then this point addresses a question that came up earlier.
There can be more than one relation that holds between a given pair of entities.
So, uh, Cleopatra and Ptolemy blah, blah, blah,
have both the- has sibling relation and has
spouse relation which seems you know unusual but,
um, I guess things were different back then.
Um, and then we can look at the distribution of entities in the KB.
So here we go through all the triples,
count up all the entities and list them in order of frequency.
Um, and again you see that it's dominated by entities that are related to geography.
Um, and in fact this list look like- list look like- looks like it's only geography,
but that's not actually the case.
One last observation, um,
there's no guarantee that the KB is complete.
So this knowledge base has founders, SpaceX, Elon_Musk.
It has founders, Tesla, Elon_Musk.
It has worked_at, Elon_Musk, Tesla_Motors.
But it does not have worked_at, Elon_Musk, SpaceX even though he did.
And given the other three you would think that it
should but it just so happens it doesn't.
And in fact, that's the whole reason that we're doing this.
The whole reason that we're doing this is that we want to extend
an existing knowledge base with facts that should be in there but aren't.
If the knowledge base were already complete we wouldn't have any work to do.
So the fact that it's incomplete is no surprise and in fact,
it's the whole motivation for doing this thing.
Okay, so the next question to look at is, um,
how are we going to formulate the prediction problem that we're going to undertake?
And there's really two questions here.
One is, what's the input to the prediction problem?
And the other is, what's the output?
What's the input means what is the- um,
what is the unit that we're going to try to make a prediction about?
The classic formulation of the relation extraction problem was to say,
the input is a pair of entity mentions in a specific context in
a specific sentence and for a pair of entity mentions in context,
we're going to try to predict the relation that holds between them.
There's an alternative which is to say,
the unit of prediction,
the input to the prediction problem is instead just a pair of entities.
Not a pair of entity mentions but just a pair of entities
abstracted away from any specific context.
So that's one choice we need to make.
The other choice is what's the output of
the prediction and this ties back to the question that came up earlier.
Are we trying to predict a single relation that holds between the two entities?
Or can we predict multiple relations to hold between the two entities?
Like the example of a half sibling and half
spouse both holding between one- one pair of people.
Um, it's worth noting that the classical approach to
relation extraction chose the first answer to both of these questions.
But today, the path that we're going to pursue chooses
the second answer to both of
these questions and I'll spell that out a little bit more as we go.
So we're first going to talk about the first one.
How do we- how do we formulate the input to this prediction problem?
And that really comes to the question of how we're going to
connect the corpus and the KB?
Um, there's kind of two possibilities.
The way I described it earlier it kind of sounded like we were going to use
the knowledge base as a way to generate labels on entity mentions in context,
and that's similar to the classical approach to relation extraction.
Um, and we could do that.
But actually, we're going to do something else.
Instead we're gonna, um,
de- define our problem as the problem of
classifying not a pair of entity mentions but a pair of entities.
And we're going to use all of the examples from the corpus where
those two entities co-occur to generate
the feature representation for that classification problem.
So instead of trying to classify a mention of Elon and Tesla in a specific sentence,
we're gonna take all of the sentences where Elon and
Tesla co-occur and use those to generate
a feature representation which will help us make a prediction
about whether Elon and Tesla in abstract,
apart from any specific context,
are related to each other.
That's a really important step.
So let me pause there and let that sink in and make sure that it makes sense.
We're gonna- the input to the prediction problem will be a pair of
entities considered apart from any specific context.
But when we generate a feature representation,
we're going to use all of the sentences in the corpus where
those two entities co-occur as the source for those features.
Okay. Um, so that means that we need to, um,
use the, um, the entity IDs,
the- the Wiki IDs as the way to join the corpus and the KB.
And continuing the idea of data exploration,
we can look at how many examples we have for each triple.
So we can go through the triples in our knowledge base.
For each one we look up all the examples in the corpus that contain those two entities.
And we can look at in aggregate or on average over
the- over the knowledge base how many examples we have to support each triple.
So this table summarizes that.
Like for adjoins there are 60,000 examples altogether,
um, but 1700 triples,
and it means there's a lot of examples for every triple.
We have about on average about 34 examples for every triple. That's terrific.
We have lots of evidence,
lots of examples to draw features from to
help us learn an extraction model for this relation.
There are other relations where the number of examples is much smaller.
So some of these have an average number of examples per triple of like one-and-a-half,
and that means we're going to have a lot less evidence in the corpus to draw on.
Um, there's one more thing we need, um,
to train these models and that is negative examples.
The, um, the- the connecting the- the KB triples to
examples in the corpus is gonna give us positive examples for training our models,
but to- but you can't train a model from positive examples alone.
You need negative examples as well.
So to get negative examples for each relation, we're basically gonna,
um, use pairs that appear together in the corpus.
So we have some,
um, evidence for them.
We have features for them.
But, uh, pairs of entities that are unrelated according to the knowledge base.
That is they don't appear together in any KB triple in the knowledge base.
So these are, uh,
pairs of entities that appear to be unrelated,
but where we have sentences that they co-occur.
Um, and we've written some code that,
um, walks through and finds all of those unrelated pairs.
And here's some examples: William_Randolph_Hearst, The_Cat's_Meow,
Les_McCann, Bobby_Timmons, Shatoy, Ibn_al-Khattab.
I don't even know what most of these things are.
But they appeared together in a sentence somewhere and yet
according to the knowledge base they're not related.
So we're going to use these as the negative examples for our classifier.
Okay so that was all related to the input side of the prediction problem.
On the output side, we have this choice to make about whether we're doing
multi-class classification or multi-label classification.
Um, and the direction we're gonna go is
multi-label classification and part of the motivation for that is
that it's actually quite common in the knowledge base
for a particular pair of entities to belong to more than one relation.
We already saw one example,
the one with Cleopatra and has,
has_spouse and has_sibling.
But there's a lot of them. Uh, and we can write some code to count them up.
It turns out that there are some overlaps that
are really common and they're kind of intuitive.
So, um, it's really common for a pair of
entities to belong both to is_a and profession, that makes sense.
So it's like Einstein is a physicist,
um, and his profession is Physicist.
Um, capital and contains,
that makes sense as well.
Place_of_birth and place_of_death,
there's no necessary relationship there.
But in practice, it's quite common for people to die in the same place they were born,
so that's not too surprising.
When we go down further, there are some that are surprising, uh,
has_spouse and parents,
um, co-workers at least once.
Oh, this is interesting, has sibling and has spouse,
it turns out Cleopatra's not alone.
There is at least seven other cases where those two overlap.
Uh, there's one case where we have three relations that overlap;
nationality, place_of_birth, and place_of_death.
I think that one makes sense.
Um, but like this one looks like noise,
parents and worked_at,
there's two examples of those overlapping,
and I think that's just some kind of data error.
I don't know what happened there, but I think that's noise.
So we're gonna formulate our problem as multi-label classification.
This means that we can assign multiple relations to a given entity pair.
There's many different ways to do multi-label classification,
but the simplest way to do it is,
uh, what's called the binary relevance method.
And the binary relevance method basically means factor
your prediction problem into unrelated binary classification problems.
So if you have, in this case,
we have 16 relations, right?
So what we're gonna do is just factor it into 16 separate binary classification problems.
For each relation, we're gonna try to predict,
does this entity pair belong to the relation or not?
It's just a binary classification for each of the 16 relations.
Um, there are smarter ways to do multi-label classification because this way of doing it
doesn't take account of the fact that the labels or in this case,
the relations can be correlated with each other.
Um, it just assumes everything is independent,
but it's a very simple way to set things up and it
will work well enough for our purposes.
So one way to think about our prediction problem is for a candidate KB triple,
for a specific relation and entity one and entity two,
does it, is it or is it not a valid KB triple?
Should we add it to our KB if it's not already there?
That's the binary classification problem that we're reducing this problem to.
Okay, then there is some code to create datasets.
I'm not gonna dwell on this, um,
but a dataset basically is gonna combine um, sorry, no.
Uh, a dataset is gonna be a bunch of KB triples along with labels that say,
uh, what the correct,
what the correct binary classification for that KB triple is.
Um, the thing to be aware of is that because we're
factoring our prediction problem by relation,
all of these data structures are gonna be organized into maps,
fro- or dictionaries from a relation to some other stuff.
So we're gonna have um,
these things where it says kbts_by_rel.
That basically means a collection of KB triples that have been grouped by relation.
So the data structure is a dictionary where the key is
a relation name and the value is a list of KBTs.
There will be a list of KBTs for this relation and a list
of KBTs for that relation and so on.
Um, that's this first thing here.
The second thing here is an exactly parallel data structure that has the labels.
For each KBT, it's gonna say whether that thing should have,
uh, a true or a false prediction made for it.
And um, this method here build dataset is,
um, gonna combine positive and negative data.
It's gonna ri- derive the positive data from the KB,
and it's gonna derive the negative data in the way I described
by looking for pairs of entities that co-occur in the corpus,
but don't co-occur in the knowledge base.
It will include both positive and negative data.
There's a parameter that lets you control the sampling rate.
The default sampling rate is 0.1 because we have
about 10 times as much negative data as we do positive data.
So using the sampling ra- sampling rate will give you a roughly balanced dataset.
Um, and it also lets you specify
a random seed so that you can have reproducible results if you want. Yeah?
Um, what do we do with
overloaded words like I'm playing the guitar, someone's playing soccer.
Or like anything of the sort that could be I guess confused.
[NOISE].
Um, I'm not sure if I understand the question.
Are, are you talking about words that you might
use in the feature representation as predictors?
Um, I guess so.
Like I'm just thinking like,
I guess you can tell me whether or not this is an issue.
When I think of sorting these triplets by relation, um,
I can see the word play as being used for either playing
an instrument or playing a sport which are very different things.
Yeah.
And so A, is a problem,
I guess and then B,
if that is a problem how do you,
I guess, counter that?
Um, so I think you're asking about,
um, the, the words that you would use to tr- the,
the way you'd construct a feature representation um,
to try to make a prediction and do you need to account for ambiguity?
Do you need to resolve ambiguities like that
one in constructing the feature representation?
Um, that's something interesting you could play with.
I think the typical answer is no.
We don't typically try to do word sense
disambiguation before constructing a feature representation.
Um, but it's possible that that could yield some,
that it could yield some benefits.
Um, I don't know, do have a-
[inaudible] in the knowledge base if a,
we would hope that the knowledge base distinguish those two senses of play.
It might or it might not.
But we hope that it'll work with
some different symbols and then it will go to two different relations.
Now, the data that you've learned on,
the thing that you perform the feature representations from,
it's like complete chaos.
But we would hope that the relationship [inaudible] that.
Are, are you imagining that play is a relation that appears in your, in your KB?
Yeah.
Okay, I missed that. I missed that.
Yeah, the, the knowledge base should if it's,
if it's a well-constructed knowledge base,
it should resolve those ambiguities.
Um, part of what we're aiming for with
the knowledge base is that it's completely unambiguous.
That, um, I kind of spoke about this a little bit
earlier in connection with the entities that when we have an,
uh, um, an entity ID that shows up in the knowledge base,
we definitely want that to be unambiguous.
We wanna have for New York City,
we wanna have an unambiguous entity ID in our knowledge base.
So there's no question about which New York we're talking about.
The sa- the same is true for the relations that we don't
want the relations to be ambiguous in any way.
Okay. Let me keep going.
Um, we wanna talk a little bit about the approach to evaluation.
And it's a really good idea to define the way you're gonna
evaluate your systems before you start building systems.
It's kind of the same idea as test-driven development in software engineering.
So first, you set up a test harness that allows you to measure performance,
and then you can start building your system and iterating on performance.
Having that test harness setup in advance means that you
can sort of measure progress as you go.
And actually, a great first step is to build
a random classifier that only takes five minutes to write but gives you sort of a,
a, a baseline, uh,
on which you can begin to measure progress.
A good approach to evaluation typically
starts with splitting the data into different pieces.
And in this case, I'm advocating a three-way split into a tiny split,
a train split, and a dev split.
So the tiny split is only 1% of your data.
It's gonna be really small,
but it's also going to be really fast to train and test on the tiny split.
And the merit of having a tiny split is
that you can do your early development on this tiny split,
and it can help you like just sort of flush out bugs in your code,
like just make sure your code is running properly before you scale
up to training on the big test training split and testing on the dev split,
which can take a lot longer to run.
So I think that's a great methodological practice,uh,
that I encourage you to use in other places.
And the train split is gonna be about three-quarters of our data.
The dev split is gonna be about a quarter of our data.
That's a pretty typical setup.
And during development, you'll typically train on
the train split and test on the dev split.
There's also, we've also held out a test split that we're gonna use for the bake-off.
So you don't have access to that yet but it exists.
And we're, you know, keeping it in a,
in a vault deep underground for now.
Um, I'm gonna skip this part.
There's some discussion of how to split,
but we want to split both the knowledge base and the corpus.
And there's some discussion of how to try to do that split
so it does align well with each other as,
as much as possible,
but I'm gonna skip that for now.
Um, then we wanna talk about evaluation metrics a little bit.
So this is a, we're factoring our problem into 16 binary classification tasks,
one for each relation.
Because it's binary classification,
evaluation metrics are pretty straightforward.
This is, uh, an arena where it's sort of
well understood what the best evaluation metrics are.
Typically, precision and recall are what you look at,
particularly when there's an unbalanced label distribution as there is for this problem.
So we're gonna look at precision and recall.
But it can be really helpful when developing a system
to have a single summary statistic on which to hill climb.
So to combine precision and recall into one statistic,
the most common solution is to use F measure,
which is just the harmonic mean of precision and recall.
Usually, people use F1,
which gives equal weight to precision and recall.
For this application, though,
I think you can make an argument that precision matters more than recall.
The goal here is to identify new KB triples that we can add to our knowledge base.
And we don't wanna be putting garbage into our knowledge base.
It's better to miss something than to put garbage into our knowledge base.
So we wanna give more weight to precision than we do to recall.
And the way to do that is,
um, using the F_ 0.5,
score which gives, it's a little bit counter-intuitive,
but gives twice as much weight to precision than it does to recall.
Um, and SKLearn makes it easy to, to calculate that.
We also want to have a summary statistic that averages across labels.
So we'll compute precision, recall,
and F_0.5 for each relation separately,
but we wanna have a way to combine it all into one summary statistic,
an average, and the only question is whether to do a micro-average or a macro-average.
We're gonna do a macro-average which gives equal weight to each relation,
because it's not really significant.
It doesn't really matter how many instances
happen in this dataset to fall into each relation.
So we're gonna treat each relation as equally important.
Um, so there's some code that helps you actually run these evaluations.
Um, we run it here on a random classifier.
And what you see when running on a random classifier is that
precision is bad almost everywhere,
which kind of makes sense.
Uh, but recall is always right around 50%.
This random classifier is just flipping a coin.
It's predicting true half the time.
So it kind of makes sense that if
there actually is a true lab- when the label is actually true,
we have about a 50-50 chance of getting that,
and that's what recall is measuring.
But we're seeing true all the time and most of the time,
that's not the right label which is why precision is quite low.
So the macro average precision is 8%,
the macro-average recall is 51%,
the macro average F-score is 10%. It's really low.
It's much closer to the precision than it is to the recall,
which is a characteristic of F-measure.
It tends to be pessimistic, uh,
much closer to the lower of precision or recall.
Um, and then the next thing I do is create a simple baseline model,
which, uh, I'll let you, uh,
look at it on your own, but the basic idea with a simple baseline model.
It's not even a learned model.
It's not, uh, it's not a machine-learned model.
Instead, all we do is look through the examples of, uh,
tied to each relation and we find
the most common phrase that occurs between the two entity mentions.
We count up what are the three most common phrases for each relation,
and then we just use exact matches to one of those three phrases as the predictor.
Um, so we find,
for example, that for the,
um, there was a- there was a particularly good one,
was it parents, yeah.
So look at the parents relation.
For the parents relation,
the most common phrases are comma son of,
and, and comma in the forward direction.
And in the reverse direction,
and comma, and and his son.
Um, so you can,
like those sort of make sense.
You're certainly seeing the word son in there, makes sense.
You also see that,
um, punctuation and stopwords are really common.
So it's probably a bad idea to drop those in some other NLP tasks.
People advocate dropping stopwords and punctuation.
You probably don't wanna do that here.
Um, you also see the comma by itself,
a bare comma shows up almost everywhere.
Almost every one of these relations has comma as one of its most frequent middles.
So comma is significant,
but it's extremely ambiguous.
It could indicate almost any relation.
Um, we can evaluate how this does and it does significantly better.
The summary statistic is 11.5,
so that's the macro averaged F-score,
significantly better than the random guesser.
But on the other hand, still really terrible, right?
The random guesser was like 9%,
now we're up to 11.5%.
It definitely helped, but it's still really terrible.
Um, and actually, you should have modest expectations for how we're
gonna be able to do on this task on this dataset.
We're gonna have trouble getting great precision because the KB is incomplete.
Um, there could be some relational triples that actually are
true in real life and there are sentences in the corpus that tell us that they're true,
but we can't measure that they're true because they're not in the KB.
Um, that's the KB being incomplete.
And we're also going to have trouble getting great recall because
there could be some relations that are true and are in the KB,
but there's no evidence of it in the corpus.
Both the KB and the corpus are kind of small,
and that's gonna put limits on how good the precision and recall we can achieve will be.
Nevertheless, the next step in this journey
is to apply a real machine learning model to this problem.
And in the second notebook, uh,
which is linked from the,
from the course webpage you'll see an example of doing that.
So using a bag of words feature representation,
logistic regression classifier, the most vanilla setup possible,
and even that very vanilla setup gets to a summary statistic of,
if I remember 55%,
which is vastly better than the simple baseline model.
So that will start you down the road.
That vanilla setup, um,
leaves lots of room for improvement.
And one of the things you'll be looking at in homework 3 and in
a bake-off is how to do much better on that. All right. Thank you.
